WO1994016434A1 - Recursive finite state grammar - Google Patents

Recursive finite state grammar Download PDF

Info

Publication number
WO1994016434A1
WO1994016434A1 PCT/US1993/012598 US9312598W WO9416434A1 WO 1994016434 A1 WO1994016434 A1 WO 1994016434A1 US 9312598 W US9312598 W US 9312598W WO 9416434 A1 WO9416434 A1 WO 9416434A1
Authority
WO
WIPO (PCT)
Prior art keywords
finite state
sub
recognition
grammar
network
Prior art date
Application number
PCT/US1993/012598
Other languages
French (fr)
Inventor
Yen-Lu Chow
Kai-Fu Lee
Original Assignee
Apple Computer, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Apple Computer, Inc. filed Critical Apple Computer, Inc.
Priority to AU60800/94A priority Critical patent/AU6080094A/en
Priority to DE4397100T priority patent/DE4397100T1/en
Priority to DE4397100A priority patent/DE4397100C2/en
Publication of WO1994016434A1 publication Critical patent/WO1994016434A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/193Formal grammars, e.g. finite state automata, context free grammars or word networks

Definitions

  • the present invention relates to the field of continuous speech recognition; more particularly, the present invention relates to finite state grammar networks utilized in the recognition process.
  • each language model is associated with a grammar.
  • a grammar represents the set of all possible sentence sequences that constitute recognizable inputs to the speech recognition system for any particular vocabulary.
  • the grammar is typically not every combination of words in the vocabulary. Instead, the grammar represents the combinations of words in the vocabulary that have meaning in the particular context or application currently being employed.
  • the grammar for a particular context or application is usually stored in memory in a compact format.
  • the grammar model for a speech recognition system can be static, i.e. specified before a particular application is run, or dynamic, where the grammar changes as the user interacts with the system. In the former, the grammar model is usually specified by one familiar with the application. In the latter, the grammar model can be constructed as the user interacts with the application by the use of a specially configured user interface.
  • the grammar changes as the user interacts, such that the grammar model reflects the current state of the vocabulary utilized by the speech recognition system.
  • the grammars are often encoded as finite state grammars.
  • finite state grammars the collection of sentences are represented as a single network of arcs and nodes; that is, the sentences are represented as states and transitions in the network.
  • Each arc, or transition, in the network refers to a particular word in the vocabulary, while each node, or state, ties the words together in particular sentence.
  • the arcs connect the nodes to form a network.
  • Associated with each word is an acoustic model.
  • the acoustic model for the word is represented as a sequence of phonetic models.
  • a speech recognition system is capable of matching the acoustic description of each word in the grammar against the acoustic signal, such that the spoken phrase can be recognized.
  • the networks that comprise the grammar e.g., the finite state grammar networks
  • the networks that comprise the grammar can be very large. At run time, the entire network must be compiled. If a particular vocabulary contains thousands of words, the network utilized to depict all of the possible grammars can potentially require a large amount of memory, especially at run time. Therefore, regardless of whether a particular part of the network is not going to be required, it is still compiled, thereby requiring its own memory allocation.
  • portions of the grammar may be repeated at other locations in the network.
  • the present invention involves a recursive finite state grammar which uses a collection of finite state grammars.
  • the set of finite state grammars of present invention comprises one global finite state grammar and multiple sub-finite state grammars.
  • the present invention creates and combines multiple grammars dynamically at run time. Furthermore, the present invention reduces the memory required to perform the speech recognition.
  • the method and means include multiple finite state grammars.
  • the multiple finite state grammars include at least one global finite state grammar and at least one sub-finite state grammar.
  • Each of the multiple finite state grammars includes multiple states and at least one transition arranged in a network. The transitions in the network are capable of including either terminals or non ⁇ terminals.
  • Each of the terminals is associated with an acoustic model, while each of the non-terminals is associated with a call to one of the sub-finite state grammars.
  • the present invention also includes a recognition engine which performs the recognition by traveling through the global finite state grammar. As terminals are encountered, the recognition engine matches the acoustic model of the terminal to the speech signals.
  • the recognition engine calls the sub-finite state grammar associated with the non-terminal and continues performing recognition by traversing the sub-finite state grammar.
  • the recognition engine matches the acoustic model to the speech signals to continue with the recognition.
  • the recognition engine Upon traversing the sub-finite state grammar, the recognition engine returns to and continues traversing the global finite state grammar at the location of the call. In this manner, the speech signals are matched against the acoustic models in the global and sub-finite state grammars to perform the speech recognition.
  • Figure 1 is a block diagram of the computer system which may be utilized by the preferred embodiment of the present invention.
  • FIG. 2 is a block diagram of the speech recognition system of the present invention.
  • Figures 3A-E illustrates an example of a recursive finite state grammar of the present invention.
  • Figure 4 illustrates an example of an acoustic model for word "find" as used in one embodiment of the present invention.
  • Figure 5 illustrates the general description of the acoustic model for a word added to one of the sub-finite state grammars of the present invention.
  • the present invention is referred to as a recursive finite state grammar.
  • the framework of the present invention is superior to the finite state automata language models typically used for speech recognition.
  • the language is usually represented with a single finite-state automata where the transitions represent the terminals (or words) in the language model.
  • the recursive finite state grammar of the present invention consists of one global finite state grammar and multiple sub-finite state grammars.
  • the global finite state grammar is comprised of transitions and states which form a 1 0
  • the states are equated with nodes in the network, while the transitions are represented as arcs between the nodes, or states.
  • Each transition in the present invention represents either a terminal or a non-terminal.
  • the terminal can be a word or phone which is associated with an acoustic model that represents its speech recognition pattern.
  • Non ⁇ terminals represent classes or categories of vocabulary and are associated with an index to a sub-finite state grammar which represents that portion of the vocabulary.
  • the sub-finite state grammars can be viewed as sub-networks having the same format as the global finite state grammar.
  • the recognition engine uses the global finite state grammar network to perform the matching between received speech signals and the acoustic models throughout the network.
  • the recognition engine When the recognition engine encounters a non-terminal in the network, the recognition engine calls the sub-finite state grammar network associated with the non-terminal and continues the recognition process.
  • the language model of the present invention incorporates not only a single network, but a collection of networks, each of which is capable of calling the other networks (i.e., it is recursive).
  • the minimal form of the network which results from employing sub-networks to present transitions in the networks produces a network which is more conducive for efficiently searching the acoustic space.
  • the present invention allows for easy creation and combination of several grammars dynamically at run time.
  • the present invention also involves a new recognition algorithm wherein each transition encounter during the search is added to a stack.
  • a word transition (arc) is encountered, the search proceeds as a normal search.
  • a class indexing a sub-finite state grammar is reached, the network is pushed onto the stack.
  • the search exits the network the search continues at the same point in the original network where the call was made.
  • the stack is empty, the final state of the global recursive finite state grammar has been reached, the recognition engine is finished, and a textual output representing the speech signals as recognized by the recognition engine is output.
  • Figure 1 illustrates some of the basic 1 2
  • the computer system illustrated in Figure 1 comprises a bus or other communication means 101 for communicating information, a processing means 102
  • RAM random access memory
  • main memory main memory
  • ROM read only memory
  • Other devices coupled to bus 101 include a data storage device 105, such as a magnetic disk and disk drive for storing information and instructions, an alpha numeric input device 106, including alpha numeric and other keys, for communicating information and command selections to processor 102, a cursor control device 107, such as a mouse, track-ball, cursor control keys, etc., for controlling a cursor and communicating information and command selections to the processor 102, a display device 108 for displaying data text input and output, a sound chip 109 for processing sound signals and information, a microphone/audio receiver 111 for receiving speech and audio signals and a 1 3
  • a data storage device 105 such as a magnetic disk and disk drive for storing information and instructions
  • an alpha numeric input device 106 including alpha numeric and other keys
  • a cursor control device 107 such as a mouse, track-ball, cursor control keys, etc.
  • a display device 108 for displaying data text input and output
  • a sound chip 109 for processing sound signals and
  • telecommunications port 110 for input and output of telecommunication and audio signals.
  • An embodiment of the present invention is implemented for use on some of the members of the family of MacintoshTM brand of computers, available from Apple Computer, Inc. of Cupertino, California.
  • Receiver 101 consisting of the microphone/audio receiver 111 , receives the speech and transforms the received speech signals into a digital representation of the successive amplitudes of the audio signal created by the speech. Then receiver 201 converts that digital signal into a frequency domain signal consisting of a sequence of frames. Each of the frames depicts the amplitude of the speech signal in each of a plurality of frequency bands over a specific time interval (i.e., window). In one embodiment, the time windows are 10 milliseconds apart. It should be noted that the present invention can be used with any type of receiver and speech encoding scheme.
  • recognition engine 102 uses a recognition algorithm to compare the sequence of frames produced by the utterance with a sequence of nodes 1 4
  • the recognition vocabulary contains over five thousand text words.
  • the result of the recognition matching process is either a textual output or an action taken by the computer system which corresponds to the recognized word.
  • the recognition algorithm of one embodiment of the present invention uses probablistic matching and dynamic programming.
  • Probablistic matching determines the likelihood that a given frame of an utterance corresponds to a given node in an acoustic model of a word. This likelihood is determined not only as a function of how closely the amplitude of the individual frequency bands of a frame match the expected frequencies contained in the given node models, but also as a function of how the deviation between the actual and expected amplitudes in each such frequency band compares to the expected deviations for such values.
  • Dynamic programming provides a method to find an optimal, or near optimal, match between the sequence of frames produced by the utterance and the sequence of nodes contained in the model of the word. This is accomplished by expanding and contracting the duration of each node in the acoustic model of a word to compensate for the natural variations in the duration of speech sounds which occur in 1 5
  • the present invention employs language model filtering.
  • language model filtering a partial score reflecting the probability of each word occurring in the present language context is added to the score of that word before selecting the best scoring word so that words which the language model indicates are most probable in the current context are more likely to be selected.
  • the acoustic model or, in other words, the speech recognition algorithm used in one embodiment of the present invention is the Hidden Markov Model (HMM) method.
  • HMM Hidden Markov Model
  • the HMM method evaluates each word in the active vocabulary by representing the acoustic model for each word as a hidden Markov process and by computing the probability of each word of generating the current acoustic token as a probablistic function of that hidden Markov process.
  • the word scores are represented as the negative logarithms of probabilities, so all scores are non-negative, and a score of zero represents a probability of one, that is, a perfect score. It 1 6
  • the searching performed by the recognition engine of the present invention is undertaken in conjunction with a global finite state grammar and a collection sub-finite state grammars.
  • the global finite state grammar of the present invention is comprised of states (nodes) and transitions (arcs) in a network. Each transition in the network comprises either a word or a category constituting an index to one of the sub- finite state grammars. Allowing transitions to be indices into sub-finite state grammars potentially makes the global finite state grammar smaller in size, thereby requiring less memory.
  • each index to a sub-finite state grammar can be used repeatedly throughout the network, such that the need to repeat the same state to state transition at different places in the network is obviated.
  • each word designated arcs within the global finite state grammar or within any of the sub-finite state grammars is associated, and to that extent, represents the machinery employed by the present invention to match the received speech signals during the recognition process.
  • the sub-finite state grammars contain states and transitions in the same manner as those in global finite state grammar.
  • the transitions in the sub-finite state grammars are 1 7
  • each of the transitions in the sub-finite state grammars is a word. Furthermore, each of the sub-finite state grammars are capable of calling themselves.
  • FIG. 3A-E An example of a recursive finite state grammar of the present invention is shown in Figures 3A-E.
  • the global finite state grammar is shown comprising seven nodes 301-307 coupled together by arcs 321-327.
  • Node 301 represents the beginning of the global finite state grammar
  • node 307 represents the end of the global finite state grammar.
  • Arc 321 couples nodes 301 and 302 and is associated with the index to the sub-finite state grammar ⁇ locate> depicted in Figure 3B corresponding to the class (i.e., vocabulary) of location words consisting of "find" and "get”.
  • Arc 324 couples nodes 301 and 304 and is the word "mail”.
  • Arcs 322 and 325 couple each of the nodes 302 and 304 respectively to nodes 303 and 305 respectively with the index associated to the sub-finite grammar ⁇ document> depicted in Figure 3C corresponding to the class of document types consisting of "paper" and "figure”.
  • Nodes 303 and 305 are coupled to node 306 via arcs 323 and 326 respectively.
  • Arc 323 represents the word "from” and arc 326 represents the word "to”.
  • Node 306 is coupled to node 307 by arc 327, which represents the index to the sub-finite state grammar 1 8
  • ⁇ personal-name> depicted in Figure 3D corresponding to the class of personal names of individuals consisting of John, Mary, and NEW-WORD.
  • Each of nodes 301-307 is also coupled to a self-looping arcs 311-317 respectively.
  • Each of arcs 311 -317 is associated with the index to the noise words sub-finite state grammar ⁇ nv> represented in Figure 3E. It should be reiterated that the words, such as the word "mail" associated with arc 324, represent the acoustic models for the words.
  • the location words sub-finite state grammar ⁇ locate> is shown comprised of nodes 331 and 332 coupled by arc 333 representing the word "find” (i.e., the acoustic modeling machinery used for matching the speech input to the word “find”) and by arc 334 representing the word “get” (i.e., the acoustic modeling machinery used for matching the speech input to the word “get”).
  • the acoustic model for the word "find” is shown in Figure 4. Referring to Figure 4, the acoustic model is depicted as a series of nodes 401-405 each coupled by a phone arc.
  • Node 401 is coupled to node 402 by arc 406 which is the acoustic phone ⁇ l.
  • Node 402 is coupled to node 403 by arc 407 which is the acoustic phone /ay/.
  • Node 403 is coupled to node 404 by arc 408 which is the acoustic phone /n/.
  • Node 404 is coupled to node 405 by arc 409 which is the acoustic phone l l. It should be noted that all of the word designated arcs mentioned 1 9
  • the document type sub-finite state grammar ⁇ document> is depicted as comprising nodes 341-343 and arcs 344-346.
  • Node 341 begins the sub-finite state grammar and is coupled to node 342 via arc 344 which corresponds to the word "the”.
  • Node 342 is coupled to node 343 by arc 345 representing the word "paper” and by arc 346 representing the word "figure”.
  • the personal name sub-finite state prammar ⁇ personal-name> is shown comprising nodes 351-352 and arcs 353-355.
  • Node 351 is coupled to node 352 by arc 353 representing the word "Mary”, by arc 354 representing the word "John” and by arc 255 representing the word NEW-WORD.
  • NEW-WORD represents an out-of- vocabulary word which was not in the original vocabulary category (e.g., personal names in this instance).
  • the recognition engine can generate an output indicating the presence of the out-of-vocabulary words.
  • the present invention allows for the incorporation of out-of-vocabulary (OOV) word detection capability for open- class grammar categories.
  • OOV out-of-vocabulary
  • An open-class grammar category is one in which one of the acoustic models correlates with a high probability to any spoken word.
  • Figure 5 illustrates an example of the all-phone sub ⁇ network for NEW-WORD.
  • the example acoustic model for NEW-WORD comprises nodes 501-504 and arcs 505-509.
  • Node 501 is the beginning and is coupled to node 502 via arc 505, which represents any phone in the NEW-WORD.
  • Node 502 is coupled to node 503 via arc 506 which again represents a phone within the NEW-WORD.
  • Node 503 is coupled to node 504 by arc 507 to end the acoustic model of NEW- WORD.
  • arc 507 represents another phone within NEW-WORD.
  • Arcs 508 and 509 are self-looping arcs which loop from and to nodes 502 and 503 respectively. These arcs also represent any phone in the acoustic model for NEW-WORD.
  • the acoustic model for NEW-WORD comprises a multiplicity of phones.
  • the acoustic model for NEW-WORD could contain any number of phones.
  • the actual number of phones chosen, which indicates the minimum duration of the acoustic model, is design choice that is typically made by the designer.
  • the representation is hierarchical so that only a single instance of both the all-phone network, such as that depicted in Figure 5, and the OOV network is needed. Thus, the present invention reduces the amount of memory needed to compensate for OOV words. 2 1
  • a dictionary incorporates out- of-vocabulary words into the recognition engine.
  • the dictionary contains non-verbal words, phone words, or both.
  • the system designer has other accessible parameters, besides setting the minimum number of phones, by which the out-of-vocabulary detection can be controlled.
  • a language weighting for open-class transitions in the grammar can also be chosen to control the ratio of false alarms (i.e., words recognized by the out-of-vocabulary detection when they are actually in the dictionary) versus detections.
  • the language weighting is an adjustment to the probabilities of a language model, wherein less likely language models have a lower probability associated with them, such that they have a less likely opportunity for being chosen as the result of the recognition process.
  • a language weighting is chosen for each of the phone arcs in the all-phone sub ⁇ network to give additional control over false alarms/detections.
  • the noise words sub-finite state grammar ⁇ nv> is shown comprised of nodes 361-362 and arcs 363-366.
  • Node 361 is coupled to node 362 by arc 363 representing the acoustic machinery for the sound of a telephone ring, by arc 364 representing the acoustic machinery for the sound of a cough, by arc 365 representing the acoustic machinery for the sound of silence, and by arc 22
  • sub-finite state grammar ⁇ nv> is a non-verbal sub-finite state grammar (network) in that the recognition is not of a word, but is of a sound.
  • Figure 3E in conjunction with Figure 3A, illustrate the advantageous manner in which non-verbal models are used in the present invention.
  • the non-verbal model of noises such as coughs, sneezes, etc.
  • the size of the network can be reduced in comparison to the prior art monolithic finite state grammars, while experiencing insignificant overhead.
  • the size of the network can be reduced because the entire class of noises does not have to be incorporated into the network at every node.
  • the memory space required to store the non-verbal model of noises is reduced because the different classes of noises (i.e., the sub-finite state grammar) are only compiled when needed.
  • noise sub-finite state grammars or categories of noises, can be located at any state in the recognition engine (i.e., at any node in the network) and are the same as any other sub-finite state grammar.
  • verbal networks are implemented using a self-looping mechanism, such that the beginning and ending of the arcs corresponding to the non-verbal network is at the same location.
  • the present invention allows for the use of non-verbal networks which can be located freely about the network with little hindrance on performance.
  • Figures 3A-E represent the static depictions of an example of a recursive finite state grammar of the present invention.
  • the grammars both global and sub-finite, must be compiled.
  • prior art recognition engines although some grammars appear to be hierarchical, their hierarchical nature is lost upon being compiled.
  • the present invention retains its hierarchical structure during the recognition process because each of the sub-finite state grammars and the global finite state grammar are compiled individually.
  • the sub-finite state grammars are only access when needed. Hence, any required memory allocation can be deferred until the time when such an access is needed, such that the recognition 24
  • the present invention saves memory. Moreover, by allowing the sub-finite state grammars to be compiled individually, any changes that take the form of additions and deletions to individual sub-finite state grammars can be made to the recognition engine without having to modify, and subsequently recompile, the global finite state network. Therefore, the global finite state grammar does not have to be recompiled every time a change occurs in the recognition engine. Thus, the present invention comprises a very flexible run-time recognition engine.
  • the recognition engine can begin the recognition process.
  • the recognition process is typically a matching process, wherein the acoustic models are matched with the speech input signal.
  • the recognition engine must be able to identify that the transition involves an index to a sub-network. In other word, the recognition engine is not just seeing a terminal. Instead, the recognition engine is seeing a generic category or class. Therefore, the present invention must be able to compensate for the presence of the 25
  • a stack system is created in the memory of the computer system and is utilized in performing the recognition process.
  • all of the first phones of the acoustic models (machineries) that correspond to the transitions from the first node of the network are pushed onto the stack.
  • the phone W would be pushed onto the stack.
  • the models are pushed on the stack in their order of appearance in the network.
  • subsequent phones of the acoustic models corresponding to the current transitions being evaluated in the network are placed on the stack, while some of the previous phones may be removed.
  • the stack is capable of growing and shrinking. Note that each path through the network represents one possible theory as to what the acoustic input signals are. As the recognition process continues, certain theories become less likely. In this case, portions of the acoustic models associated with these less likely theories may be removed from the stack.
  • the recognition engine traverses through the network (i.e., by traversing through the serial stack), the recognition engine encounters both terminals (e.g., words, phones, etc. which have acoustic models) or non-terminals 26
  • the terminals are expanded and the associated acoustic models (e.g., HMM) are pushed on the stack.
  • HMM acoustic models
  • the pushing of terminals and non ⁇ terminals onto the stack is performed at run-time on an as- needed basis.
  • the entire network does not have to occupy memory space, such that the present invention produces a large memory savings.
  • the recognition engine When a non-terminal is reached in the search, the recognition engine must obtain the sub-finite state grammar (i.e., the sub-network) and employ it in the recognition process.
  • a pointer directs the recognition engine to the non-terminal sub- network.
  • the recognition engine creates a dynamic version of the sub-network and pushes the dynamic version on stack.
  • the dynamic version is a copy of the sub-finite state grammar. A copy is made because the particular sub-network may appear at more than one location in the hierarchical topology, such that the recognition is able to keep tract of all the different theories or instances of use.
  • Each theory or model has a history consisting of a sequence of words.
  • each occurrence of a sub-network in a network is associated with its own history, such that the probability of that occurrence of the sub-network is uniquely identified in the network (or sub- 27
  • the history is only the identity of the last predecessor.
  • a score associated with a particular theory is a percentage indicative of the probability that the current word follows the predecessor.
  • the dynamic version comprises the topology of the network and also includes the information needed for the recognition engine to generate its result (i.e., its identity, its history and its scores associated with the sub-network).
  • the actual sub-finite state grammar is not pushed onto the stack because it may appear and, thus, be needed at other instances within the global network.
  • the acoustic models of the terminals are placed on the stack.
  • the recognition engine uses the acoustic models in the recognition process in the same manner as prior art finite state grammar recognition system, which is well-known in the art.
  • the sub-network When each class or category which indexes a sub- network is pushed onto the stack, a mechanism exists by which the sub-network can be traversed.
  • the sub-network may be popped off the stack.
  • information corresponding to its termination or ending state is pushed onto the stack with it. In other words, this information 28
  • the recognition engine is pushed onto the stack which identifies the termination or ending state of the current sub-network as the location of the next node in the network which called the current sub ⁇ network.
  • the recognition engine knows where to transition to by referring to the ending or termination state. Therefore, in the preferred embodiment, there is no need to provide the functionality necessary to pop items off the stack. It should be noted that the self-looping mechanism described earlier employs this feature. By having the ending state be the same as the beginning state, the transition which occurs is able to loop to itself.
  • the recognition engine performs the searching. Based on the likelihood of the theories, the recognition engine continues onto next machinery or machineries. The stack grows and shrinks as the theories survive (are above a threshold probability) or die (fall below a threshold of probability). Once all of the machinery have been evaluated, signified by the stack being empty, the most probable theory is produced as textual output or as an action taken by the computer (e.g., opening a folder, etc.) In the case of text, the textual output represents the recognized speech. 29

Abstract

A method and means for performing speech recognition are described. The method and means include multiple finite state grammars. The multiple finite state grammars include at least one global finite state grammar and at least one sub-finite state grammar. Each of the multiple finite state grammars includes multiple states and at least one transition arranged in a network. The transitions in the network are capable of including either terminals or non-terminals. Each of the terminals is associated with an acoustic model, while each of the non-terminals is associated with a call to one of the sub-finite state grammars. The present invention also includes a recognition engine which performs the recognition by traveling through the global finite state grammar. As terminals are encountered, the recognition engine matches the acoustic model of the terminal to the speech signals. As non-terminals are encountered, the recognition engine calls the sub-finite state grammar associated with the non-terminal and continues performing recognition by traversing the sub-finite state grammar. In traversing the sub-finite state grammar, the recognition engine matches the acoustic model to the speech signals to continue with the recognition. Upon traversing the sub-finite state grammar, the recognition engine returns to and continues traversing the global finite state grammar at the location of the call. In this manner, the speech signals are matched against the acoustic models in the global and sub-finite state grammars to generate textual output.

Description

RECURSIVE FINITE STATE GRAMMAR
FIELD OF THE INVENTION
The present invention relates to the field of continuous speech recognition; more particularly, the present invention relates to finite state grammar networks utilized in the recognition process.
BACKGROUND OF THE INVENTION Recently, speech recognition systems have become more prevalent in todays high-technology market. Due to advances in computer technology and advances in speech recognition algorithms, these speech recognition systems have become more powerful. Current speech recognition systems operate by matching an acoustic description, or model, of a word in their vocabulary against a representation of the acoustic signal generated by the utterance of the word to be recognized. The vocabulary contains all of the words which the speech recognition system is capable of recognizing. In other words, the vocabulary consists of all of the words which have acoustic models stored in the system. It should be noted that all of the vocabulary is not active all of the time. At any one time, only a portion of the vocabulary may be active. Typically, only a portion of the vocabulary is activated because there are limitations in the current state of the art. Language models are used to indicate which portion of the vocabulary is currently active.
In continuous speech recognition, each language model is associated with a grammar. A grammar represents the set of all possible sentence sequences that constitute recognizable inputs to the speech recognition system for any particular vocabulary. The grammar is typically not every combination of words in the vocabulary. Instead, the grammar represents the combinations of words in the vocabulary that have meaning in the particular context or application currently being employed. The grammar for a particular context or application is usually stored in memory in a compact format. The grammar model for a speech recognition system can be static, i.e. specified before a particular application is run, or dynamic, where the grammar changes as the user interacts with the system. In the former, the grammar model is usually specified by one familiar with the application. In the latter, the grammar model can be constructed as the user interacts with the application by the use of a specially configured user interface. In this case, the grammar changes as the user interacts, such that the grammar model reflects the current state of the vocabulary utilized by the speech recognition system. In the prior art, the grammars are often encoded as finite state grammars. In finite state grammars, the collection of sentences are represented as a single network of arcs and nodes; that is, the sentences are represented as states and transitions in the network. Each arc, or transition, in the network refers to a particular word in the vocabulary, while each node, or state, ties the words together in particular sentence. The arcs connect the nodes to form a network. Associated with each word is an acoustic model. The acoustic model for the word is represented as a sequence of phonetic models. Through the use of the network, a speech recognition system is capable of matching the acoustic description of each word in the grammar against the acoustic signal, such that the spoken phrase can be recognized. The networks that comprise the grammar (e.g., the finite state grammar networks) for a particular application can be very large. At run time, the entire network must be compiled. If a particular vocabulary contains thousands of words, the network utilized to depict all of the possible grammars can potentially require a large amount of memory, especially at run time. Therefore, regardless of whether a particular part of the network is not going to be required, it is still compiled, thereby requiring its own memory allocation. Moreover, portions of the grammar may be repeated at other locations in the network. Thus, identical portions of the grammar must be compiled multiple times, such that multiple memory allocations, each associated with identical but different parts of the network, are required. Since memory and its usage are at a premium in today's technology, there is a desire in speech recognition systems to reduce the amount of memory utilized in storing the grammar.
As will be shown, the present invention involves a recursive finite state grammar which uses a collection of finite state grammars. The set of finite state grammars of present invention comprises one global finite state grammar and multiple sub-finite state grammars. The present invention creates and combines multiple grammars dynamically at run time. Furthermore, the present invention reduces the memory required to perform the speech recognition.
SUMMARY OF THE INVENTION
A method and means for performing speech recognition are described. The method and means include multiple finite state grammars. The multiple finite state grammars include at least one global finite state grammar and at least one sub-finite state grammar. Each of the multiple finite state grammars includes multiple states and at least one transition arranged in a network. The transitions in the network are capable of including either terminals or non¬ terminals. Each of the terminals is associated with an acoustic model, while each of the non-terminals is associated with a call to one of the sub-finite state grammars. The present invention also includes a recognition engine which performs the recognition by traveling through the global finite state grammar. As terminals are encountered, the recognition engine matches the acoustic model of the terminal to the speech signals. As non-terminals are encountered, the recognition engine calls the sub-finite state grammar associated with the non-terminal and continues performing recognition by traversing the sub-finite state grammar. In traversing the sub-finite state grammar, the recognition engine matches the acoustic model to the speech signals to continue with the recognition. Upon traversing the sub-finite state grammar, the recognition engine returns to and continues traversing the global finite state grammar at the location of the call. In this manner, the speech signals are matched against the acoustic models in the global and sub-finite state grammars to perform the speech recognition.
BRIEF DESCRIPTION OF DRAWINGS
The present invention will be understood more fully from the detailed description given below and from the accompanying drawings of the preferred embodiment of the invention, which, however, should not be taken to limit the invention to the specific embodiment but are for explanation and understanding only.
Figure 1 is a block diagram of the computer system which may be utilized by the preferred embodiment of the present invention.
Figure 2 is a block diagram of the speech recognition system of the present invention.
Figures 3A-E illustrates an example of a recursive finite state grammar of the present invention.
Figure 4 illustrates an example of an acoustic model for word "find" as used in one embodiment of the present invention.
Figure 5 illustrates the general description of the acoustic model for a word added to one of the sub-finite state grammars of the present invention. DETAILED DESCRIPTION OF THE INVENTION
A method and means for performing speech recognition are described. In the following description, numerous specific details are set forth such as specific processing steps, recognition algorithms, acoustic models, etc., in order to provide a thorough understanding of the present invention. It will be obvious, however, to one skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known speech recognition processing steps and circuitry have not been described in detail to avoid unnecessarily obscuring the present invention.
Overview of the Present Invention The present invention is referred to as a recursive finite state grammar. The framework of the present invention is superior to the finite state automata language models typically used for speech recognition. In the prior art finite state automata language models, the language is usually represented with a single finite-state automata where the transitions represent the terminals (or words) in the language model. The recursive finite state grammar of the present invention consists of one global finite state grammar and multiple sub-finite state grammars. The global finite state grammar is comprised of transitions and states which form a 1 0
network. The states are equated with nodes in the network, while the transitions are represented as arcs between the nodes, or states. Each transition in the present invention represents either a terminal or a non-terminal. The terminal can be a word or phone which is associated with an acoustic model that represents its speech recognition pattern. Non¬ terminals, on the other hand, represent classes or categories of vocabulary and are associated with an index to a sub-finite state grammar which represents that portion of the vocabulary. The sub-finite state grammars can be viewed as sub-networks having the same format as the global finite state grammar. When performing recognition, the recognition engine uses the global finite state grammar network to perform the matching between received speech signals and the acoustic models throughout the network. When the recognition engine encounters a non-terminal in the network, the recognition engine calls the sub-finite state grammar network associated with the non-terminal and continues the recognition process. Thus, the language model of the present invention incorporates not only a single network, but a collection of networks, each of which is capable of calling the other networks (i.e., it is recursive).
The recursive nature of the present invention reduces the required memory since the same language can be represented more efficiently and can be potentially smaller 1 1
than the single finite state grammars of the prior art. The minimal form of the network which results from employing sub-networks to present transitions in the networks produces a network which is more conducive for efficiently searching the acoustic space. Moreover, the present invention allows for easy creation and combination of several grammars dynamically at run time.
The present invention also involves a new recognition algorithm wherein each transition encounter during the search is added to a stack. When a word transition (arc) is encountered, the search proceeds as a normal search. However, when a class indexing a sub-finite state grammar is reached, the network is pushed onto the stack. When the search exits the network, the search continues at the same point in the original network where the call was made. When the stack is empty, the final state of the global recursive finite state grammar has been reached, the recognition engine is finished, and a textual output representing the speech signals as recognized by the recognition engine is output.
The Overview of a Computer System in the Preferred Embodiment
The preferred embodiment of the present invention may be practiced on computer systems having alternative configurations. Figure 1 illustrates some of the basic 1 2
components of such a computer system, but is not meant to be limiting nor to exclude other components or combinations of components. The computer system illustrated in Figure 1 comprises a bus or other communication means 101 for communicating information, a processing means 102
(commonly referred to as a host processor) coupled with bus 101 for processing information, a random access memory (RAM) or other storage device 103 (commonly referred to as main memory) coupled with bus 101 for storing information and instructions for the processor 102, a read only memory (ROM) or other static storage device 104 coupled with the bus 101 for storing static information and instructions for the processor 102.
Other devices coupled to bus 101 include a data storage device 105, such as a magnetic disk and disk drive for storing information and instructions, an alpha numeric input device 106, including alpha numeric and other keys, for communicating information and command selections to processor 102, a cursor control device 107, such as a mouse, track-ball, cursor control keys, etc., for controlling a cursor and communicating information and command selections to the processor 102, a display device 108 for displaying data text input and output, a sound chip 109 for processing sound signals and information, a microphone/audio receiver 111 for receiving speech and audio signals and a 1 3
telecommunications port 110 for input and output of telecommunication and audio signals.
An embodiment of the present invention is implemented for use on some of the members of the family of Macintosh™ brand of computers, available from Apple Computer, Inc. of Cupertino, California.
Overview of the Speech Recognition System
The simplified version of the speech recognition system of the present invention is shown in Figure 2. Receiver 101 , consisting of the microphone/audio receiver 111 , receives the speech and transforms the received speech signals into a digital representation of the successive amplitudes of the audio signal created by the speech. Then receiver 201 converts that digital signal into a frequency domain signal consisting of a sequence of frames. Each of the frames depicts the amplitude of the speech signal in each of a plurality of frequency bands over a specific time interval (i.e., window). In one embodiment, the time windows are 10 milliseconds apart. It should be noted that the present invention can be used with any type of receiver and speech encoding scheme.
Once the speech is converted, recognition engine 102 uses a recognition algorithm to compare the sequence of frames produced by the utterance with a sequence of nodes 1 4
contained in the acoustic model of each word in the active vocabulary, as defined by the grammar, to determine if a match exists. In the current embodiment of the invention, the recognition vocabulary contains over five thousand text words. The result of the recognition matching process is either a textual output or an action taken by the computer system which corresponds to the recognized word.
The recognition algorithm of one embodiment of the present invention uses probablistic matching and dynamic programming. Probablistic matching determines the likelihood that a given frame of an utterance corresponds to a given node in an acoustic model of a word. This likelihood is determined not only as a function of how closely the amplitude of the individual frequency bands of a frame match the expected frequencies contained in the given node models, but also as a function of how the deviation between the actual and expected amplitudes in each such frequency band compares to the expected deviations for such values. Dynamic programming provides a method to find an optimal, or near optimal, match between the sequence of frames produced by the utterance and the sequence of nodes contained in the model of the word. This is accomplished by expanding and contracting the duration of each node in the acoustic model of a word to compensate for the natural variations in the duration of speech sounds which occur in 1 5
different utterances of the same word. A score is computed for each time aligned match, based on the sum of the dissimilarity between the acoustic information in each frame and the acoustic model of the node against which it is time aligned. The words with the lowest sum of such distances are then selected as the best scoring words. In one embodiment, the present invention employs language model filtering. When language model filtering is used, a partial score reflecting the probability of each word occurring in the present language context is added to the score of that word before selecting the best scoring word so that words which the language model indicates are most probable in the current context are more likely to be selected.
The acoustic model or, in other words, the speech recognition algorithm used in one embodiment of the present invention is the Hidden Markov Model (HMM) method. As is well-known to those skilled in the art, the HMM method evaluates each word in the active vocabulary by representing the acoustic model for each word as a hidden Markov process and by computing the probability of each word of generating the current acoustic token as a probablistic function of that hidden Markov process. In the one embodiment, the word scores are represented as the negative logarithms of probabilities, so all scores are non-negative, and a score of zero represents a probability of one, that is, a perfect score. It 1 6
should be noted that other terminal or word matching schemes can be utilized by the present invention.
The searching performed by the recognition engine of the present invention is undertaken in conjunction with a global finite state grammar and a collection sub-finite state grammars. The global finite state grammar of the present invention is comprised of states (nodes) and transitions (arcs) in a network. Each transition in the network comprises either a word or a category constituting an index to one of the sub- finite state grammars. Allowing transitions to be indices into sub-finite state grammars potentially makes the global finite state grammar smaller in size, thereby requiring less memory. The reduction of memory size is further accentuated by the fact that each index to a sub-finite state grammar can be used repeatedly throughout the network, such that the need to repeat the same state to state transition at different places in the network is obviated. It should be noted that each word designated arcs within the global finite state grammar or within any of the sub-finite state grammars is associated, and to that extent, represents the machinery employed by the present invention to match the received speech signals during the recognition process.
The sub-finite state grammars contain states and transitions in the same manner as those in global finite state grammar. The transitions in the sub-finite state grammars are 1 7
capable of representing words or other indices for other sub- finite state grammars. In one embodiment, each of the transitions in the sub-finite state grammars is a word. Furthermore, each of the sub-finite state grammars are capable of calling themselves.
An example of a recursive finite state grammar of the present invention is shown in Figures 3A-E. Referring to Figure 3A, the global finite state grammar is shown comprising seven nodes 301-307 coupled together by arcs 321-327. Node 301 represents the beginning of the global finite state grammar, and node 307 represents the end of the global finite state grammar. Arc 321 couples nodes 301 and 302 and is associated with the index to the sub-finite state grammar <locate> depicted in Figure 3B corresponding to the class (i.e., vocabulary) of location words consisting of "find" and "get". Arc 324 couples nodes 301 and 304 and is the word "mail". Arcs 322 and 325 couple each of the nodes 302 and 304 respectively to nodes 303 and 305 respectively with the index associated to the sub-finite grammar <document> depicted in Figure 3C corresponding to the class of document types consisting of "paper" and "figure". Nodes 303 and 305 are coupled to node 306 via arcs 323 and 326 respectively. Arc 323 represents the word "from" and arc 326 represents the word "to". Node 306 is coupled to node 307 by arc 327, which represents the index to the sub-finite state grammar 1 8
<personal-name> depicted in Figure 3D corresponding to the class of personal names of individuals consisting of John, Mary, and NEW-WORD. Each of nodes 301-307 is also coupled to a self-looping arcs 311-317 respectively. Each of arcs 311 -317 is associated with the index to the noise words sub-finite state grammar <nv> represented in Figure 3E. It should be reiterated that the words, such as the word "mail" associated with arc 324, represent the acoustic models for the words. Referring to Figure 3B, the location words sub-finite state grammar <locate> is shown comprised of nodes 331 and 332 coupled by arc 333 representing the word "find" (i.e., the acoustic modeling machinery used for matching the speech input to the word "find") and by arc 334 representing the word "get" (i.e., the acoustic modeling machinery used for matching the speech input to the word "get"). The acoustic model for the word "find" is shown in Figure 4. Referring to Figure 4, the acoustic model is depicted as a series of nodes 401-405 each coupled by a phone arc. Node 401 is coupled to node 402 by arc 406 which is the acoustic phone ήl. Node 402 is coupled to node 403 by arc 407 which is the acoustic phone /ay/. Node 403 is coupled to node 404 by arc 408 which is the acoustic phone /n/. Node 404 is coupled to node 405 by arc 409 which is the acoustic phone l l. It should be noted that all of the word designated arcs mentioned 1 9
throughout this specification correspond to acoustic models, such as that shown in Figure 4.
Referring to Figure 3C, the document type sub-finite state grammar <document> is depicted as comprising nodes 341-343 and arcs 344-346. Node 341 begins the sub-finite state grammar and is coupled to node 342 via arc 344 which corresponds to the word "the". Node 342 is coupled to node 343 by arc 345 representing the word "paper" and by arc 346 representing the word "figure". Referring to Figure 3D, the personal name sub-finite state prammar <personal-name> is shown comprising nodes 351-352 and arcs 353-355. Node 351 is coupled to node 352 by arc 353 representing the word "Mary", by arc 354 representing the word "John" and by arc 255 representing the word NEW-WORD. NEW-WORD represents an out-of- vocabulary word which was not in the original vocabulary category (e.g., personal names in this instance). By including a general acoustic model for NEW-WORD in the sub-finite state grammar, the recognition engine can generate an output indicating the presence of the out-of-vocabulary words.
The present invention allows for the incorporation of out-of-vocabulary (OOV) word detection capability for open- class grammar categories. An open-class grammar category is one in which one of the acoustic models correlates with a high probability to any spoken word. The open-class OOV 20
class network is represented as a sequence of all-phone sub¬ networks. A self loop at the last state allows for arbitrarily long words. Figure 5 illustrates an example of the all-phone sub¬ network for NEW-WORD. Referring to Figure 5, the example acoustic model for NEW-WORD comprises nodes 501-504 and arcs 505-509. Node 501 is the beginning and is coupled to node 502 via arc 505, which represents any phone in the NEW-WORD. Node 502 is coupled to node 503 via arc 506 which again represents a phone within the NEW-WORD. Node 503 is coupled to node 504 by arc 507 to end the acoustic model of NEW- WORD. Once again, arc 507 represents another phone within NEW-WORD. Arcs 508 and 509 are self-looping arcs which loop from and to nodes 502 and 503 respectively. These arcs also represent any phone in the acoustic model for NEW-WORD. Thus, the acoustic model for NEW-WORD comprises a multiplicity of phones. It should be noted that the acoustic model for NEW-WORD could contain any number of phones. The actual number of phones chosen, which indicates the minimum duration of the acoustic model, is design choice that is typically made by the designer. The representation is hierarchical so that only a single instance of both the all-phone network, such as that depicted in Figure 5, and the OOV network is needed. Thus, the present invention reduces the amount of memory needed to compensate for OOV words. 2 1
In the present invention, a dictionary incorporates out- of-vocabulary words into the recognition engine. The dictionary contains non-verbal words, phone words, or both. The system designer has other accessible parameters, besides setting the minimum number of phones, by which the out-of-vocabulary detection can be controlled. A language weighting for open-class transitions in the grammar can also be chosen to control the ratio of false alarms (i.e., words recognized by the out-of-vocabulary detection when they are actually in the dictionary) versus detections. The language weighting is an adjustment to the probabilities of a language model, wherein less likely language models have a lower probability associated with them, such that they have a less likely opportunity for being chosen as the result of the recognition process. Similarly, a language weighting is chosen for each of the phone arcs in the all-phone sub¬ network to give additional control over false alarms/detections.
Referring back to Figure 3E, the noise words sub-finite state grammar <nv> is shown comprised of nodes 361-362 and arcs 363-366. Node 361 is coupled to node 362 by arc 363 representing the acoustic machinery for the sound of a telephone ring, by arc 364 representing the acoustic machinery for the sound of a cough, by arc 365 representing the acoustic machinery for the sound of silence, and by arc 22
366 representing the acoustic machinery for the sound of a door slamming. It should be noted that the sub-finite state grammar <nv> is a non-verbal sub-finite state grammar (network) in that the recognition is not of a word, but is of a sound.
Figure 3E, in conjunction with Figure 3A, illustrate the advantageous manner in which non-verbal models are used in the present invention. In this case, the non-verbal model of noises, such as coughs, sneezes, etc., are represented in the present invention as a class or sub-network. By using sub- finite state grammars to implement different classes of noises which can occur during the recognition process, the size of the network can be reduced in comparison to the prior art monolithic finite state grammars, while experiencing insignificant overhead. The size of the network can be reduced because the entire class of noises does not have to be incorporated into the network at every node. Furthermore, the memory space required to store the non-verbal model of noises is reduced because the different classes of noises (i.e., the sub-finite state grammar) are only compiled when needed.
This is especially true when a large number of non-verbal models are used. These noise sub-finite state grammars, or categories of noises, can be located at any state in the recognition engine (i.e., at any node in the network) and are the same as any other sub-finite state grammar. These non- 23
verbal networks are implemented using a self-looping mechanism, such that the beginning and ending of the arcs corresponding to the non-verbal network is at the same location. Thus, the present invention allows for the use of non-verbal networks which can be located freely about the network with little hindrance on performance.
The networks depicted in Figures 3A-E are implemented in memory using pointers in the same manner as prior art monolithic finite state grammars and is well-known in the art. It should be noted that the relationship between the global finite state grammar and the sub-finite state grammars of the present invention is hierarchical in nature.
Figures 3A-E represent the static depictions of an example of a recursive finite state grammar of the present invention. To utilize these static representations, i.e. make them dynamic, the grammars, both global and sub-finite, must be compiled. In prior art recognition engines, although some grammars appear to be hierarchical, their hierarchical nature is lost upon being compiled. The present invention retains its hierarchical structure during the recognition process because each of the sub-finite state grammars and the global finite state grammar are compiled individually. The sub-finite state grammars are only access when needed. Hence, any required memory allocation can be deferred until the time when such an access is needed, such that the recognition 24
engine arrives at a solution by piecing the grammars together. If no access is required, no memory allocation is made. In this manner, the present invention saves memory. Moreover, by allowing the sub-finite state grammars to be compiled individually, any changes that take the form of additions and deletions to individual sub-finite state grammars can be made to the recognition engine without having to modify, and subsequently recompile, the global finite state network. Therefore, the global finite state grammar does not have to be recompiled every time a change occurs in the recognition engine. Thus, the present invention comprises a very flexible run-time recognition engine.
Once the global finite state grammar and the individual sub-finite state grammars are compiled, the recognition engine can begin the recognition process. The recognition process is typically a matching process, wherein the acoustic models are matched with the speech input signal. However, in the present invention, where non-terminals in the global finite state grammar (or any sub-finite state grammar as well) are encountered by the recognition engine, the recognition engine must be able to identify that the transition involves an index to a sub-network. In other word, the recognition engine is not just seeing a terminal. Instead, the recognition engine is seeing a generic category or class. Therefore, the present invention must be able to compensate for the presence of the 25
non-terminals in the networks. To utilize the recursive finite state grammars of the present invention in the recognition process, a stack system is created in the memory of the computer system and is utilized in performing the recognition process.
At run time, all of the first phones of the acoustic models (machineries) that correspond to the transitions from the first node of the network are pushed onto the stack. For example, in the case of the word "find" in Figure 4, the phone W would be pushed onto the stack. The models are pushed on the stack in their order of appearance in the network. As the recognition process continues, subsequent phones of the acoustic models corresponding to the current transitions being evaluated in the network are placed on the stack, while some of the previous phones may be removed. Thus, the stack is capable of growing and shrinking. Note that each path through the network represents one possible theory as to what the acoustic input signals are. As the recognition process continues, certain theories become less likely. In this case, portions of the acoustic models associated with these less likely theories may be removed from the stack.
As the recognition engine traverses through the network (i.e., by traversing through the serial stack), the recognition engine encounters both terminals (e.g., words, phones, etc. which have acoustic models) or non-terminals 26
(i.e., indices to sub-finite state grammars). The terminals are expanded and the associated acoustic models (e.g., HMM) are pushed on the stack. Thus, there is a stack of active machineries (e.g., HMMs) as the search is performed. It should be reiterated that the pushing of terminals and non¬ terminals onto the stack is performed at run-time on an as- needed basis. Hence, the entire network does not have to occupy memory space, such that the present invention produces a large memory savings. When a non-terminal is reached in the search, the recognition engine must obtain the sub-finite state grammar (i.e., the sub-network) and employ it in the recognition process. In the currently preferred embodiment, a pointer directs the recognition engine to the non-terminal sub- network. The recognition engine creates a dynamic version of the sub-network and pushes the dynamic version on stack. The dynamic version is a copy of the sub-finite state grammar. A copy is made because the particular sub-network may appear at more than one location in the hierarchical topology, such that the recognition is able to keep tract of all the different theories or instances of use. Each theory or model has a history consisting of a sequence of words. Thus, each occurrence of a sub-network in a network is associated with its own history, such that the probability of that occurrence of the sub-network is uniquely identified in the network (or sub- 27
network). In one embodiment, the history is only the identity of the last predecessor. A score associated with a particular theory is a percentage indicative of the probability that the current word follows the predecessor. The dynamic version comprises the topology of the network and also includes the information needed for the recognition engine to generate its result (i.e., its identity, its history and its scores associated with the sub-network). The actual sub-finite state grammar is not pushed onto the stack because it may appear and, thus, be needed at other instances within the global network. Thus, as different parts of the global finite state grammar are traversed and non¬ terminals gets expanded into terminals and non-terminals, the acoustic models of the terminals are placed on the stack. The recognition engine uses the acoustic models in the recognition process in the same manner as prior art finite state grammar recognition system, which is well-known in the art.
When each class or category which indexes a sub- network is pushed onto the stack, a mechanism exists by which the sub-network can be traversed. In one embodiment, the sub-network may be popped off the stack. In the preferred embodiment, when a sub-network is pushed onto the stack, information corresponding to its termination or ending state is pushed onto the stack with it. In other words, this information 28
is pushed onto the stack which identifies the termination or ending state of the current sub-network as the location of the next node in the network which called the current sub¬ network. Once the recognition engine completes traversing the particular sub-network, then using the pointer to the next location, the recognition engine knows where to transition to by referring to the ending or termination state. Therefore, in the preferred embodiment, there is no need to provide the functionality necessary to pop items off the stack. It should be noted that the self-looping mechanism described earlier employs this feature. By having the ending state be the same as the beginning state, the transition which occurs is able to loop to itself.
As the word machineries are on the stack, the recognition engine performs the searching. Based on the likelihood of the theories, the recognition engine continues onto next machinery or machineries. The stack grows and shrinks as the theories survive (are above a threshold probability) or die (fall below a threshold of probability). Once all of the machinery have been evaluated, signified by the stack being empty, the most probable theory is produced as textual output or as an action taken by the computer (e.g., opening a folder, etc.) In the case of text, the textual output represents the recognized speech. 29
Whereas many alterations and modifications of the present invention will no doubt become apparent to a person of ordinary skill in the art after having read the foregoing description, it is to be understood that the particular embodiment shown and described by way of illustration is in no way intended to be considered limiting. Therefore, reference to the details of the preferred embodiments are not intended to limit the scope of the claims which themselves recite only those features regarded as essential to the invention.
Thus, a speech recognition system using recursive finite state grammars has been described.

Claims

30CLAIMSWe claim:
1. A speech recognition system for recognizing speech signals comprising: a plurality of finite state grammars including at least one global finite state grammar and at least one sub-finite state grammar, wherein each of said plurality includes a plurality of states and at least one transition arranged in a network, and further wherein said transitions are capable of including either terminals or non-terminals, each of said terminals associated with an acoustic model and each of said non-terminals associated with calls to said at least one sub- finite state grammar; and recognition means for performing recognition by traversing said global finite state grammar, wherein if said recognition engine encounters a terminal then said recognition engine uses the acoustic model of the terminal in performing recognition of the speech signals and if said recognition engine encounters a non-terminal then said recognition engine calls the sub-finite state grammar associated with the non-terminal and continues performing recognition by traversing said sub-finite state grammar, such that upon completion of traversing said sub-finite state 3 1
grammar said recognition engine returns to and continues traversing the global finite state grammar at the location of the call.
2. The system as defined in Claim 1 wherein each of said terminals is a word.
3. The system as defined in Claim 1 wherein said acoustic model comprises a Hidden Markov Model.
4.. The system as defined in Claim 1 wherein said recognition means traverses said grammars using a stack, such that as said recognition engine encounters a terminal on one of said plurality of finite state grammars, the acoustic model associated with said terminal is pushed onto the stack.
5. The system as defined in Claim 4 wherein information regarding the next state is pushed onto the stack with the acoustic models associated with the terminals of the sub-finite state grammar, such that the recognition engine continues traversing at the location indicated by the next state upon completion of traversing said sub-finite state grammar.
6. The system as defined in Claim 1 wherein the recognition engine determines that the end of the global finite 32
state grammar has been reached by determining that the stack is empty, such that the recognition process is finished.
7. A speech recognition system for recognizing speech signals comprising: a plurality of finite state grammars including at least one global finite state grammar and at least one sub-finite state grammar, wherein each of said plurality includes a plurality of states and at least one transition arranged in a network, and further wherein said transitions are capable of including either words or classes, each of said words associated with an acoustic model and each of said classes associated with a call to said at least one sub-finite state grammar; and recognition means for performing recognition by traversing said global finite state grammar, wherein if a word is encountered then said recognition engine performs recognition using the acoustic model associated with the word and if a class is encountered then said recognition engine calls the sub-finite state grammar associated with the class and continues performing recognition by traversing said sub- finite state grammar, such that upon completion of traversing said sub-finite state grammar said recognition engine returns to and continues traversing the global finite state grammar at the location of the call. 33
8. The system as defined in Claim 7 wherein at least one of the states in at least one of the plurality of finite state grammars contains a self-looping transition which begins and ends at the same state.
9. The system as defined in Claim 8 wherein said self-looping transition comprises a noise word.
10. The system as defined in Claim 8 wherein said self-looping transition comprises a class.
11. The system as defined in Claim 10 wherein said class comprises a sub-finite state grammar of noises, such that each transition in the sub-finite state grammar is associated with acoustic models which represent the noises.
12. The system as defined in Claim 7 wherein at least one of the transitions is associated with an acoustic model having all phones, such that out-of-vocabulary word detection is provided.
13. The system as defined in Claim 12 wherein the all phone acoustic model has at least one self-looping 34
transition, and further wherein the self-looping transition is associated with an all-phone acoustic model.
14. The system as defined in Claim 13 wherein said at least one self-looping transition is located at the last state to compensate for arbitrarily long words.
15. A method for recognizing speech signals comprising the steps of: providing a first transition network having states and transitions between the states, such that the first transition network may be traversed, wherein each of the transitions is associated with a terminal or a class; providing a second transition network having states and at least one transition between the states, such that the second transition network may be traversed; traversing the first network such that speech recognition is performed, wherein the second network is called upon arriving at the transition associated with the class when said first network is being traversed, such that the second transition network is traversed; and upon traversing the second transition network, the first network is returned to, wherein the traversing of the first network continues from the point of the call, such that the speech signals are recognized.
PCT/US1993/012598 1992-12-31 1993-12-28 Recursive finite state grammar WO1994016434A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
AU60800/94A AU6080094A (en) 1992-12-31 1993-12-28 Recursive finite state grammar
DE4397100T DE4397100T1 (en) 1992-12-31 1993-12-28 Recursive grammar with a finite number of states
DE4397100A DE4397100C2 (en) 1992-12-31 1993-12-28 Method for recognizing speech signals and speech recognition system with recursive grammar with a finite number of states

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US99901792A 1992-12-31 1992-12-31
US07/999,017 1992-12-31

Publications (1)

Publication Number Publication Date
WO1994016434A1 true WO1994016434A1 (en) 1994-07-21

Family

ID=25545784

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1993/012598 WO1994016434A1 (en) 1992-12-31 1993-12-28 Recursive finite state grammar

Country Status (4)

Country Link
AU (1) AU6080094A (en)
CA (1) CA2151371A1 (en)
DE (2) DE4397100C2 (en)
WO (1) WO1994016434A1 (en)

Cited By (96)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996037881A2 (en) * 1995-05-26 1996-11-28 Applied Language Technologies Method and apparatus for dynamic adaptation of a large vocabulary speech recognition system and for use of constraints from a database in a large vocabulary speech recognition system
EP0801378A2 (en) * 1996-04-10 1997-10-15 Lucent Technologies Inc. Method and apparatus for speech recognition
EP0903727A1 (en) * 1997-09-17 1999-03-24 Istituto Trentino Di Cultura A system and method for automatic speech recognition
US6064959A (en) * 1997-03-28 2000-05-16 Dragon Systems, Inc. Error correction in speech recognition
US7584103B2 (en) 2004-08-20 2009-09-01 Multimodal Technologies, Inc. Automated extraction of semantic content and generation of a structured document from speech
US8959102B2 (en) 2010-10-08 2015-02-17 Mmodal Ip Llc Structured searching of dynamic structured document corpuses
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US9535906B2 (en) 2008-07-31 2017-01-03 Apple Inc. Mobile device having human language translation capability with positional feedback
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
US9620104B2 (en) 2013-06-07 2017-04-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9626955B2 (en) 2008-04-05 2017-04-18 Apple Inc. Intelligent text-to-speech conversion
US9633660B2 (en) 2010-02-25 2017-04-25 Apple Inc. User profiling for voice input processing
US9633674B2 (en) 2013-06-07 2017-04-25 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
US9646614B2 (en) 2000-03-16 2017-05-09 Apple Inc. Fast, language-independent method for user authentication by voice
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US9798393B2 (en) 2011-08-29 2017-10-24 Apple Inc. Text correction processing
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US9892734B2 (en) 2006-06-22 2018-02-13 Mmodal Ip Llc Automatic decision support
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9953088B2 (en) 2012-05-14 2018-04-24 Apple Inc. Crowd sourcing information to fulfill user requests
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US9966068B2 (en) 2013-06-08 2018-05-08 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US10169329B2 (en) 2014-05-30 2019-01-01 Apple Inc. Exemplar-based natural language processing
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
US10185542B2 (en) 2013-06-09 2019-01-22 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US10283110B2 (en) 2009-07-02 2019-05-07 Apple Inc. Methods and apparatuses for automatic speech recognition
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10521466B2 (en) 2016-06-11 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US10568032B2 (en) 2007-04-03 2020-02-18 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0248377A2 (en) * 1986-06-02 1987-12-09 Motorola, Inc. Continuous speech recognition system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE3786822T2 (en) * 1986-04-25 1994-01-13 Texas Instruments Inc Speech recognition system.

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0248377A2 (en) * 1986-06-02 1987-12-09 Motorola, Inc. Continuous speech recognition system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NIEDERMAIR: "Datenbankdialog in gesprochener Sprache - Linguistische Analyse in SPICOS II", INFORMATIONSTECHNIK IT, vol. 31, no. 6, December 1989 (1989-12-01), MÜNCHEN, DE, pages 382 - 391, XP000074830 *
PARSONS: "Voice and Speech Processing", 1986, MCGRAW-HILL, NEW YORK, US *

Cited By (128)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996037881A2 (en) * 1995-05-26 1996-11-28 Applied Language Technologies Method and apparatus for dynamic adaptation of a large vocabulary speech recognition system and for use of constraints from a database in a large vocabulary speech recognition system
WO1996037881A3 (en) * 1995-05-26 1997-01-16 Applied Language Technologies Method and apparatus for dynamic adaptation of a large vocabulary speech recognition system and for use of constraints from a database in a large vocabulary speech recognition system
US6501833B2 (en) * 1995-05-26 2002-12-31 Speechworks International, Inc. Method and apparatus for dynamic adaptation of a large vocabulary speech recognition system and for use of constraints from a database in a large vocabulary speech recognition system
EP0801378A2 (en) * 1996-04-10 1997-10-15 Lucent Technologies Inc. Method and apparatus for speech recognition
EP0801378A3 (en) * 1996-04-10 1998-09-30 Lucent Technologies Inc. Method and apparatus for speech recognition
US6064959A (en) * 1997-03-28 2000-05-16 Dragon Systems, Inc. Error correction in speech recognition
EP0903727A1 (en) * 1997-09-17 1999-03-24 Istituto Trentino Di Cultura A system and method for automatic speech recognition
US9646614B2 (en) 2000-03-16 2017-05-09 Apple Inc. Fast, language-independent method for user authentication by voice
US7584103B2 (en) 2004-08-20 2009-09-01 Multimodal Technologies, Inc. Automated extraction of semantic content and generation of a structured document from speech
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9892734B2 (en) 2006-06-22 2018-02-13 Mmodal Ip Llc Automatic decision support
US10568032B2 (en) 2007-04-03 2020-02-18 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US9626955B2 (en) 2008-04-05 2017-04-18 Apple Inc. Intelligent text-to-speech conversion
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US9535906B2 (en) 2008-07-31 2017-01-03 Apple Inc. Mobile device having human language translation capability with positional feedback
US10475446B2 (en) 2009-06-05 2019-11-12 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US11080012B2 (en) 2009-06-05 2021-08-03 Apple Inc. Interface for a virtual digital assistant
US10795541B2 (en) 2009-06-05 2020-10-06 Apple Inc. Intelligent organization of tasks items
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US10283110B2 (en) 2009-07-02 2019-05-07 Apple Inc. Methods and apparatuses for automatic speech recognition
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US9548050B2 (en) 2010-01-18 2017-01-17 Apple Inc. Intelligent automated assistant
US10706841B2 (en) 2010-01-18 2020-07-07 Apple Inc. Task flow identification based on user intent
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US9633660B2 (en) 2010-02-25 2017-04-25 Apple Inc. User profiling for voice input processing
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US8959102B2 (en) 2010-10-08 2015-02-17 Mmodal Ip Llc Structured searching of dynamic structured document corpuses
US10102359B2 (en) 2011-03-21 2018-10-16 Apple Inc. Device access using voice authentication
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US9798393B2 (en) 2011-08-29 2017-10-24 Apple Inc. Text correction processing
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9953088B2 (en) 2012-05-14 2018-04-24 Apple Inc. Crowd sourcing information to fulfill user requests
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
US9633674B2 (en) 2013-06-07 2017-04-25 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9620104B2 (en) 2013-06-07 2017-04-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US9966068B2 (en) 2013-06-08 2018-05-08 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
US10185542B2 (en) 2013-06-09 2019-01-22 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US10169329B2 (en) 2014-05-30 2019-01-01 Apple Inc. Exemplar-based natural language processing
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10904611B2 (en) 2014-06-30 2021-01-26 Apple Inc. Intelligent automated assistant for TV user interactions
US9668024B2 (en) 2014-06-30 2017-05-30 Apple Inc. Intelligent automated assistant for TV user interactions
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US11556230B2 (en) 2014-12-02 2023-01-17 Apple Inc. Data detection
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US10521466B2 (en) 2016-06-11 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback

Also Published As

Publication number Publication date
DE4397100C2 (en) 2003-02-27
AU6080094A (en) 1994-08-15
CA2151371A1 (en) 1994-07-21
DE4397100T1 (en) 1995-11-23

Similar Documents

Publication Publication Date Title
WO1994016434A1 (en) Recursive finite state grammar
US5390279A (en) Partitioning speech rules by context for speech recognition
JP2644171B2 (en) Method and speech recognition system for building a target field-dependent model in the form of a decision tree for intelligent machines
US7478037B2 (en) Assigning meanings to utterances in a speech recognition system
US5384892A (en) Dynamic language model for speech recognition
US5613036A (en) Dynamic categories for a speech recognition system
US6501833B2 (en) Method and apparatus for dynamic adaptation of a large vocabulary speech recognition system and for use of constraints from a database in a large vocabulary speech recognition system
JP3696231B2 (en) Language model generation and storage device, speech recognition device, language model generation method and speech recognition method
US7487094B1 (en) System and method of call classification with context modeling based on composite words
US7676365B2 (en) Method and apparatus for constructing and using syllable-like unit language models
US7620548B2 (en) Method and system for automatic detecting morphemes in a task classification system using lattices
US6178401B1 (en) Method for reducing search complexity in a speech recognition system
US6061653A (en) Speech recognition system using shared speech models for multiple recognition processes
EP0384584A2 (en) A chart parser for stochastic unification grammar
US20070118353A1 (en) Device, method, and medium for establishing language model
EP0938076B1 (en) A speech recognition system
WO2002029612A1 (en) Method and system for generating and searching an optimal maximum likelihood decision tree for hidden markov model (hmm) based speech recognition
Georgila et al. Large Vocabulary Search Space Reduction Employing Directed Acyclic Word Graphs and Phonological Rules
Kobayashi et al. A sub-word level matching strategy in a speech understanding system
Georgila et al. Improved large vocabulary speech recognition using lexical rules

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AT AU BB BG BR BY CA CH CZ DE DK ES FI GB HU JP KP KR KZ LK LU LV MG MN MW NL NO NZ PL PT RO RU SD SE SK UA UZ VN

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2151371

Country of ref document: CA

RET De translation (de og part 6b)

Ref document number: 4397100

Country of ref document: DE

Date of ref document: 19951123

WWE Wipo information: entry into national phase

Ref document number: 4397100

Country of ref document: DE

122 Ep: pct application non-entry in european phase