WO1994016434A1

WO1994016434A1 - Recursive finite state grammar

Info

Publication number: WO1994016434A1
Application number: PCT/US1993/012598
Authority: WO
Inventors: Yen-Lu Chow; Kai-Fu Lee
Original assignee: Apple Computer, Inc.
Priority date: 1992-12-31
Filing date: 1993-12-28
Publication date: 1994-07-21
Also published as: DE4397100C2; AU6080094A; CA2151371A1; DE4397100T1

Abstract

A method and means for performing speech recognition are described. The method and means include multiple finite state grammars. The multiple finite state grammars include at least one global finite state grammar and at least one sub-finite state grammar. Each of the multiple finite state grammars includes multiple states and at least one transition arranged in a network. The transitions in the network are capable of including either terminals or non-terminals. Each of the terminals is associated with an acoustic model, while each of the non-terminals is associated with a call to one of the sub-finite state grammars. The present invention also includes a recognition engine which performs the recognition by traveling through the global finite state grammar. As terminals are encountered, the recognition engine matches the acoustic model of the terminal to the speech signals. As non-terminals are encountered, the recognition engine calls the sub-finite state grammar associated with the non-terminal and continues performing recognition by traversing the sub-finite state grammar. In traversing the sub-finite state grammar, the recognition engine matches the acoustic model to the speech signals to continue with the recognition. Upon traversing the sub-finite state grammar, the recognition engine returns to and continues traversing the global finite state grammar at the location of the call. In this manner, the speech signals are matched against the acoustic models in the global and sub-finite state grammars to generate textual output.

Description

RECURSIVE FINITE STATE GRAMMAR

FIELD OF THE INVENTION

The present invention relates to the field of continuous speech recognition; more particularly, the present invention relates to finite state grammar networks utilized in the recognition process.

BACKGROUND OF THE INVENTION Recently, speech recognition systems have become more prevalent in todays high-technology market. Due to advances in computer technology and advances in speech recognition algorithms, these speech recognition systems have become more powerful. Current speech recognition systems operate by matching an acoustic description, or model, of a word in their vocabulary against a representation of the acoustic signal generated by the utterance of the word to be recognized. The vocabulary contains all of the words which the speech recognition system is capable of recognizing. In other words, the vocabulary consists of all of the words which have acoustic models stored in the system. It should be noted that all of the vocabulary is not active all of the time. At any one time, only a portion of the vocabulary may be active. Typically, only a portion of the vocabulary is activated because there are limitations in the current state of the art. Language models are used to indicate which portion of the vocabulary is currently active.

In continuous speech recognition, each language model is associated with a grammar. A grammar represents the set of all possible sentence sequences that constitute recognizable inputs to the speech recognition system for any particular vocabulary. The grammar is typically not every combination of words in the vocabulary. Instead, the grammar represents the combinations of words in the vocabulary that have meaning in the particular context or application currently being employed. The grammar for a particular context or application is usually stored in memory in a compact format. The grammar model for a speech recognition system can be static, i.e. specified before a particular application is run, or dynamic, where the grammar changes as the user interacts with the system. In the former, the grammar model is usually specified by one familiar with the application. In the latter, the grammar model can be constructed as the user interacts with the application by the use of a specially configured user interface. In this case, the grammar changes as the user interacts, such that the grammar model reflects the current state of the vocabulary utilized by the speech recognition system. In the prior art, the grammars are often encoded as finite state grammars. In finite state grammars, the collection of sentences are represented as a single network of arcs and nodes; that is, the sentences are represented as states and transitions in the network. Each arc, or transition, in the network refers to a particular word in the vocabulary, while each node, or state, ties the words together in particular sentence. The arcs connect the nodes to form a network. Associated with each word is an acoustic model. The acoustic model for the word is represented as a sequence of phonetic models. Through the use of the network, a speech recognition system is capable of matching the acoustic description of each word in the grammar against the acoustic signal, such that the spoken phrase can be recognized. The networks that comprise the grammar (e.g., the finite state grammar networks) for a particular application can be very large. At run time, the entire network must be compiled. If a particular vocabulary contains thousands of words, the network utilized to depict all of the possible grammars can potentially require a large amount of memory, especially at run time. Therefore, regardless of whether a particular part of the network is not going to be required, it is still compiled, thereby requiring its own memory allocation. Moreover, portions of the grammar may be repeated at other locations in the network. Thus, identical portions of the grammar must be compiled multiple times, such that multiple memory allocations, each associated with identical but different parts of the network, are required. Since memory and its usage are at a premium in today's technology, there is a desire in speech recognition systems to reduce the amount of memory utilized in storing the grammar.

As will be shown, the present invention involves a recursive finite state grammar which uses a collection of finite state grammars. The set of finite state grammars of present invention comprises one global finite state grammar and multiple sub-finite state grammars. The present invention creates and combines multiple grammars dynamically at run time. Furthermore, the present invention reduces the memory required to perform the speech recognition.

SUMMARY OF THE INVENTION

A method and means for performing speech recognition are described. The method and means include multiple finite state grammars. The multiple finite state grammars include at least one global finite state grammar and at least one sub-finite state grammar. Each of the multiple finite state grammars includes multiple states and at least one transition arranged in a network. The transitions in the network are capable of including either terminals or non¬ terminals. Each of the terminals is associated with an acoustic model, while each of the non-terminals is associated with a call to one of the sub-finite state grammars. The present invention also includes a recognition engine which performs the recognition by traveling through the global finite state grammar. As terminals are encountered, the recognition engine matches the acoustic model of the terminal to the speech signals. As non-terminals are encountered, the recognition engine calls the sub-finite state grammar associated with the non-terminal and continues performing recognition by traversing the sub-finite state grammar. In traversing the sub-finite state grammar, the recognition engine matches the acoustic model to the speech signals to continue with the recognition. Upon traversing the sub-finite state grammar, the recognition engine returns to and continues traversing the global finite state grammar at the location of the call. In this manner, the speech signals are matched against the acoustic models in the global and sub-finite state grammars to perform the speech recognition.

BRIEF DESCRIPTION OF DRAWINGS

The present invention will be understood more fully from the detailed description given below and from the accompanying drawings of the preferred embodiment of the invention, which, however, should not be taken to limit the invention to the specific embodiment but are for explanation and understanding only.

Figure 1 is a block diagram of the computer system which may be utilized by the preferred embodiment of the present invention.

Figure 2 is a block diagram of the speech recognition system of the present invention.

Figures 3A-E illustrates an example of a recursive finite state grammar of the present invention.

Figure 4 illustrates an example of an acoustic model for word "find" as used in one embodiment of the present invention.

Figure 5 illustrates the general description of the acoustic model for a word added to one of the sub-finite state grammars of the present invention. DETAILED DESCRIPTION OF THE INVENTION

A method and means for performing speech recognition are described. In the following description, numerous specific details are set forth such as specific processing steps, recognition algorithms, acoustic models, etc., in order to provide a thorough understanding of the present invention. It will be obvious, however, to one skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known speech recognition processing steps and circuitry have not been described in detail to avoid unnecessarily obscuring the present invention.

Overview of the Present Invention The present invention is referred to as a recursive finite state grammar. The framework of the present invention is superior to the finite state automata language models typically used for speech recognition. In the prior art finite state automata language models, the language is usually represented with a single finite-state automata where the transitions represent the terminals (or words) in the language model. The recursive finite state grammar of the present invention consists of one global finite state grammar and multiple sub-finite state grammars. The global finite state grammar is comprised of transitions and states which form a 1 0

network. The states are equated with nodes in the network, while the transitions are represented as arcs between the nodes, or states. Each transition in the present invention represents either a terminal or a non-terminal. The terminal can be a word or phone which is associated with an acoustic model that represents its speech recognition pattern. Non¬ terminals, on the other hand, represent classes or categories of vocabulary and are associated with an index to a sub-finite state grammar which represents that portion of the vocabulary. The sub-finite state grammars can be viewed as sub-networks having the same format as the global finite state grammar. When performing recognition, the recognition engine uses the global finite state grammar network to perform the matching between received speech signals and the acoustic models throughout the network. When the recognition engine encounters a non-terminal in the network, the recognition engine calls the sub-finite state grammar network associated with the non-terminal and continues the recognition process. Thus, the language model of the present invention incorporates not only a single network, but a collection of networks, each of which is capable of calling the other networks (i.e., it is recursive).

The recursive nature of the present invention reduces the required memory since the same language can be represented more efficiently and can be potentially smaller 1 1

than the single finite state grammars of the prior art. The minimal form of the network which results from employing sub-networks to present transitions in the networks produces a network which is more conducive for efficiently searching the acoustic space. Moreover, the present invention allows for easy creation and combination of several grammars dynamically at run time.

The present invention also involves a new recognition algorithm wherein each transition encounter during the search is added to a stack. When a word transition (arc) is encountered, the search proceeds as a normal search. However, when a class indexing a sub-finite state grammar is reached, the network is pushed onto the stack. When the search exits the network, the search continues at the same point in the original network where the call was made. When the stack is empty, the final state of the global recursive finite state grammar has been reached, the recognition engine is finished, and a textual output representing the speech signals as recognized by the recognition engine is output.

The Overview of a Computer System in the Preferred Embodiment

The preferred embodiment of the present invention may be practiced on computer systems having alternative configurations. Figure 1 illustrates some of the basic 1 2

components of such a computer system, but is not meant to be limiting nor to exclude other components or combinations of components. The computer system illustrated in Figure 1 comprises a bus or other communication means 101 for communicating information, a processing means 102

(commonly referred to as a host processor) coupled with bus 101 for processing information, a random access memory (RAM) or other storage device 103 (commonly referred to as main memory) coupled with bus 101 for storing information and instructions for the processor 102, a read only memory (ROM) or other static storage device 104 coupled with the bus 101 for storing static information and instructions for the processor 102.

Other devices coupled to bus 101 include a data storage device 105, such as a magnetic disk and disk drive for storing information and instructions, an alpha numeric input device 106, including alpha numeric and other keys, for communicating information and command selections to processor 102, a cursor control device 107, such as a mouse, track-ball, cursor control keys, etc., for controlling a cursor and communicating information and command selections to the processor 102, a display device 108 for displaying data text input and output, a sound chip 109 for processing sound signals and information, a microphone/audio receiver 111 for receiving speech and audio signals and a 1 3

telecommunications port 110 for input and output of telecommunication and audio signals.

An embodiment of the present invention is implemented for use on some of the members of the family of Macintosh™ brand of computers, available from Apple Computer, Inc. of Cupertino, California.

Overview of the Speech Recognition System

The simplified version of the speech recognition system of the present invention is shown in Figure 2. Receiver 101 , consisting of the microphone/audio receiver 111 , receives the speech and transforms the received speech signals into a digital representation of the successive amplitudes of the audio signal created by the speech. Then receiver 201 converts that digital signal into a frequency domain signal consisting of a sequence of frames. Each of the frames depicts the amplitude of the speech signal in each of a plurality of frequency bands over a specific time interval (i.e., window). In one embodiment, the time windows are 10 milliseconds apart. It should be noted that the present invention can be used with any type of receiver and speech encoding scheme.

Once the speech is converted, recognition engine 102 uses a recognition algorithm to compare the sequence of frames produced by the utterance with a sequence of nodes 1 4

contained in the acoustic model of each word in the active vocabulary, as defined by the grammar, to determine if a match exists. In the current embodiment of the invention, the recognition vocabulary contains over five thousand text words. The result of the recognition matching process is either a textual output or an action taken by the computer system which corresponds to the recognized word.

The recognition algorithm of one embodiment of the present invention uses probablistic matching and dynamic programming. Probablistic matching determines the likelihood that a given frame of an utterance corresponds to a given node in an acoustic model of a word. This likelihood is determined not only as a function of how closely the amplitude of the individual frequency bands of a frame match the expected frequencies contained in the given node models, but also as a function of how the deviation between the actual and expected amplitudes in each such frequency band compares to the expected deviations for such values. Dynamic programming provides a method to find an optimal, or near optimal, match between the sequence of frames produced by the utterance and the sequence of nodes contained in the model of the word. This is accomplished by expanding and contracting the duration of each node in the acoustic model of a word to compensate for the natural variations in the duration of speech sounds which occur in 1 5

different utterances of the same word. A score is computed for each time aligned match, based on the sum of the dissimilarity between the acoustic information in each frame and the acoustic model of the node against which it is time aligned. The words with the lowest sum of such distances are then selected as the best scoring words. In one embodiment, the present invention employs language model filtering. When language model filtering is used, a partial score reflecting the probability of each word occurring in the present language context is added to the score of that word before selecting the best scoring word so that words which the language model indicates are most probable in the current context are more likely to be selected.

The acoustic model or, in other words, the speech recognition algorithm used in one embodiment of the present invention is the Hidden Markov Model (HMM) method. As is well-known to those skilled in the art, the HMM method evaluates each word in the active vocabulary by representing the acoustic model for each word as a hidden Markov process and by computing the probability of each word of generating the current acoustic token as a probablistic function of that hidden Markov process. In the one embodiment, the word scores are represented as the negative logarithms of probabilities, so all scores are non-negative, and a score of zero represents a probability of one, that is, a perfect score. It 1 6

should be noted that other terminal or word matching schemes can be utilized by the present invention.

The searching performed by the recognition engine of the present invention is undertaken in conjunction with a global finite state grammar and a collection sub-finite state grammars. The global finite state grammar of the present invention is comprised of states (nodes) and transitions (arcs) in a network. Each transition in the network comprises either a word or a category constituting an index to one of the sub- finite state grammars. Allowing transitions to be indices into sub-finite state grammars potentially makes the global finite state grammar smaller in size, thereby requiring less memory. The reduction of memory size is further accentuated by the fact that each index to a sub-finite state grammar can be used repeatedly throughout the network, such that the need to repeat the same state to state transition at different places in the network is obviated. It should be noted that each word designated arcs within the global finite state grammar or within any of the sub-finite state grammars is associated, and to that extent, represents the machinery employed by the present invention to match the received speech signals during the recognition process.

The sub-finite state grammars contain states and transitions in the same manner as those in global finite state grammar. The transitions in the sub-finite state grammars are 1 7

capable of representing words or other indices for other sub- finite state grammars. In one embodiment, each of the transitions in the sub-finite state grammars is a word. Furthermore, each of the sub-finite state grammars are capable of calling themselves.

An example of a recursive finite state grammar of the present invention is shown in Figures 3A-E. Referring to Figure 3A, the global finite state grammar is shown comprising seven nodes 301-307 coupled together by arcs 321-327. Node 301 represents the beginning of the global finite state grammar, and node 307 represents the end of the global finite state grammar. Arc 321 couples nodes 301 and 302 and is associated with the index to the sub-finite state grammar <locate> depicted in Figure 3B corresponding to the class (i.e., vocabulary) of location words consisting of "find" and "get". Arc 324 couples nodes 301 and 304 and is the word "mail". Arcs 322 and 325 couple each of the nodes 302 and 304 respectively to nodes 303 and 305 respectively with the index associated to the sub-finite grammar <document> depicted in Figure 3C corresponding to the class of document types consisting of "paper" and "figure". Nodes 303 and 305 are coupled to node 306 via arcs 323 and 326 respectively. Arc 323 represents the word "from" and arc 326 represents the word "to". Node 306 is coupled to node 307 by arc 327, which represents the index to the sub-finite state grammar 1 8

<personal-name> depicted in Figure 3D corresponding to the class of personal names of individuals consisting of John, Mary, and NEW-WORD. Each of nodes 301-307 is also coupled to a self-looping arcs 311-317 respectively. Each of arcs 311 -317 is associated with the index to the noise words sub-finite state grammar <nv> represented in Figure 3E. It should be reiterated that the words, such as the word "mail" associated with arc 324, represent the acoustic models for the words. Referring to Figure 3B, the location words sub-finite state grammar <locate> is shown comprised of nodes 331 and 332 coupled by arc 333 representing the word "find" (i.e., the acoustic modeling machinery used for matching the speech input to the word "find") and by arc 334 representing the word "get" (i.e., the acoustic modeling machinery used for matching the speech input to the word "get"). The acoustic model for the word "find" is shown in Figure 4. Referring to Figure 4, the acoustic model is depicted as a series of nodes 401-405 each coupled by a phone arc. Node 401 is coupled to node 402 by arc 406 which is the acoustic phone ήl. Node 402 is coupled to node 403 by arc 407 which is the acoustic phone /ay/. Node 403 is coupled to node 404 by arc 408 which is the acoustic phone /n/. Node 404 is coupled to node 405 by arc 409 which is the acoustic phone l l. It should be noted that all of the word designated arcs mentioned 1 9

throughout this specification correspond to acoustic models, such as that shown in Figure 4.

Referring to Figure 3C, the document type sub-finite state grammar <document> is depicted as comprising nodes 341-343 and arcs 344-346. Node 341 begins the sub-finite state grammar and is coupled to node 342 via arc 344 which corresponds to the word "the". Node 342 is coupled to node 343 by arc 345 representing the word "paper" and by arc 346 representing the word "figure". Referring to Figure 3D, the personal name sub-finite state prammar <personal-name> is shown comprising nodes 351-352 and arcs 353-355. Node 351 is coupled to node 352 by arc 353 representing the word "Mary", by arc 354 representing the word "John" and by arc 255 representing the word NEW-WORD. NEW-WORD represents an out-of- vocabulary word which was not in the original vocabulary category (e.g., personal names in this instance). By including a general acoustic model for NEW-WORD in the sub-finite state grammar, the recognition engine can generate an output indicating the presence of the out-of-vocabulary words.

The present invention allows for the incorporation of out-of-vocabulary (OOV) word detection capability for open- class grammar categories. An open-class grammar category is one in which one of the acoustic models correlates with a high probability to any spoken word. The open-class OOV 20

class network is represented as a sequence of all-phone sub¬ networks. A self loop at the last state allows for arbitrarily long words. Figure 5 illustrates an example of the all-phone sub¬ network for NEW-WORD. Referring to Figure 5, the example acoustic model for NEW-WORD comprises nodes 501-504 and arcs 505-509. Node 501 is the beginning and is coupled to node 502 via arc 505, which represents any phone in the NEW-WORD. Node 502 is coupled to node 503 via arc 506 which again represents a phone within the NEW-WORD. Node 503 is coupled to node 504 by arc 507 to end the acoustic model of NEW- WORD. Once again, arc 507 represents another phone within NEW-WORD. Arcs 508 and 509 are self-looping arcs which loop from and to nodes 502 and 503 respectively. These arcs also represent any phone in the acoustic model for NEW-WORD. Thus, the acoustic model for NEW-WORD comprises a multiplicity of phones. It should be noted that the acoustic model for NEW-WORD could contain any number of phones. The actual number of phones chosen, which indicates the minimum duration of the acoustic model, is design choice that is typically made by the designer. The representation is hierarchical so that only a single instance of both the all-phone network, such as that depicted in Figure 5, and the OOV network is needed. Thus, the present invention reduces the amount of memory needed to compensate for OOV words. 2 1

In the present invention, a dictionary incorporates out- of-vocabulary words into the recognition engine. The dictionary contains non-verbal words, phone words, or both. The system designer has other accessible parameters, besides setting the minimum number of phones, by which the out-of-vocabulary detection can be controlled. A language weighting for open-class transitions in the grammar can also be chosen to control the ratio of false alarms (i.e., words recognized by the out-of-vocabulary detection when they are actually in the dictionary) versus detections. The language weighting is an adjustment to the probabilities of a language model, wherein less likely language models have a lower probability associated with them, such that they have a less likely opportunity for being chosen as the result of the recognition process. Similarly, a language weighting is chosen for each of the phone arcs in the all-phone sub¬ network to give additional control over false alarms/detections.

Referring back to Figure 3E, the noise words sub-finite state grammar <nv> is shown comprised of nodes 361-362 and arcs 363-366. Node 361 is coupled to node 362 by arc 363 representing the acoustic machinery for the sound of a telephone ring, by arc 364 representing the acoustic machinery for the sound of a cough, by arc 365 representing the acoustic machinery for the sound of silence, and by arc 22

366 representing the acoustic machinery for the sound of a door slamming. It should be noted that the sub-finite state grammar <nv> is a non-verbal sub-finite state grammar (network) in that the recognition is not of a word, but is of a sound.

Figure 3E, in conjunction with Figure 3A, illustrate the advantageous manner in which non-verbal models are used in the present invention. In this case, the non-verbal model of noises, such as coughs, sneezes, etc., are represented in the present invention as a class or sub-network. By using sub- finite state grammars to implement different classes of noises which can occur during the recognition process, the size of the network can be reduced in comparison to the prior art monolithic finite state grammars, while experiencing insignificant overhead. The size of the network can be reduced because the entire class of noises does not have to be incorporated into the network at every node. Furthermore, the memory space required to store the non-verbal model of noises is reduced because the different classes of noises (i.e., the sub-finite state grammar) are only compiled when needed.

This is especially true when a large number of non-verbal models are used. These noise sub-finite state grammars, or categories of noises, can be located at any state in the recognition engine (i.e., at any node in the network) and are the same as any other sub-finite state grammar. These non- 23

verbal networks are implemented using a self-looping mechanism, such that the beginning and ending of the arcs corresponding to the non-verbal network is at the same location. Thus, the present invention allows for the use of non-verbal networks which can be located freely about the network with little hindrance on performance.

The networks depicted in Figures 3A-E are implemented in memory using pointers in the same manner as prior art monolithic finite state grammars and is well-known in the art. It should be noted that the relationship between the global finite state grammar and the sub-finite state grammars of the present invention is hierarchical in nature.

Figures 3A-E represent the static depictions of an example of a recursive finite state grammar of the present invention. To utilize these static representations, i.e. make them dynamic, the grammars, both global and sub-finite, must be compiled. In prior art recognition engines, although some grammars appear to be hierarchical, their hierarchical nature is lost upon being compiled. The present invention retains its hierarchical structure during the recognition process because each of the sub-finite state grammars and the global finite state grammar are compiled individually. The sub-finite state grammars are only access when needed. Hence, any required memory allocation can be deferred until the time when such an access is needed, such that the recognition 24

engine arrives at a solution by piecing the grammars together. If no access is required, no memory allocation is made. In this manner, the present invention saves memory. Moreover, by allowing the sub-finite state grammars to be compiled individually, any changes that take the form of additions and deletions to individual sub-finite state grammars can be made to the recognition engine without having to modify, and subsequently recompile, the global finite state network. Therefore, the global finite state grammar does not have to be recompiled every time a change occurs in the recognition engine. Thus, the present invention comprises a very flexible run-time recognition engine.

Once the global finite state grammar and the individual sub-finite state grammars are compiled, the recognition engine can begin the recognition process. The recognition process is typically a matching process, wherein the acoustic models are matched with the speech input signal. However, in the present invention, where non-terminals in the global finite state grammar (or any sub-finite state grammar as well) are encountered by the recognition engine, the recognition engine must be able to identify that the transition involves an index to a sub-network. In other word, the recognition engine is not just seeing a terminal. Instead, the recognition engine is seeing a generic category or class. Therefore, the present invention must be able to compensate for the presence of the 25

non-terminals in the networks. To utilize the recursive finite state grammars of the present invention in the recognition process, a stack system is created in the memory of the computer system and is utilized in performing the recognition process.

At run time, all of the first phones of the acoustic models (machineries) that correspond to the transitions from the first node of the network are pushed onto the stack. For example, in the case of the word "find" in Figure 4, the phone W would be pushed onto the stack. The models are pushed on the stack in their order of appearance in the network. As the recognition process continues, subsequent phones of the acoustic models corresponding to the current transitions being evaluated in the network are placed on the stack, while some of the previous phones may be removed. Thus, the stack is capable of growing and shrinking. Note that each path through the network represents one possible theory as to what the acoustic input signals are. As the recognition process continues, certain theories become less likely. In this case, portions of the acoustic models associated with these less likely theories may be removed from the stack.

As the recognition engine traverses through the network (i.e., by traversing through the serial stack), the recognition engine encounters both terminals (e.g., words, phones, etc. which have acoustic models) or non-terminals 26

(i.e., indices to sub-finite state grammars). The terminals are expanded and the associated acoustic models (e.g., HMM) are pushed on the stack. Thus, there is a stack of active machineries (e.g., HMMs) as the search is performed. It should be reiterated that the pushing of terminals and non¬ terminals onto the stack is performed at run-time on an as- needed basis. Hence, the entire network does not have to occupy memory space, such that the present invention produces a large memory savings. When a non-terminal is reached in the search, the recognition engine must obtain the sub-finite state grammar (i.e., the sub-network) and employ it in the recognition process. In the currently preferred embodiment, a pointer directs the recognition engine to the non-terminal sub- network. The recognition engine creates a dynamic version of the sub-network and pushes the dynamic version on stack. The dynamic version is a copy of the sub-finite state grammar. A copy is made because the particular sub-network may appear at more than one location in the hierarchical topology, such that the recognition is able to keep tract of all the different theories or instances of use. Each theory or model has a history consisting of a sequence of words. Thus, each occurrence of a sub-network in a network is associated with its own history, such that the probability of that occurrence of the sub-network is uniquely identified in the network (or sub- 27

network). In one embodiment, the history is only the identity of the last predecessor. A score associated with a particular theory is a percentage indicative of the probability that the current word follows the predecessor. The dynamic version comprises the topology of the network and also includes the information needed for the recognition engine to generate its result (i.e., its identity, its history and its scores associated with the sub-network). The actual sub-finite state grammar is not pushed onto the stack because it may appear and, thus, be needed at other instances within the global network. Thus, as different parts of the global finite state grammar are traversed and non¬ terminals gets expanded into terminals and non-terminals, the acoustic models of the terminals are placed on the stack. The recognition engine uses the acoustic models in the recognition process in the same manner as prior art finite state grammar recognition system, which is well-known in the art.

When each class or category which indexes a sub- network is pushed onto the stack, a mechanism exists by which the sub-network can be traversed. In one embodiment, the sub-network may be popped off the stack. In the preferred embodiment, when a sub-network is pushed onto the stack, information corresponding to its termination or ending state is pushed onto the stack with it. In other words, this information 28

is pushed onto the stack which identifies the termination or ending state of the current sub-network as the location of the next node in the network which called the current sub¬ network. Once the recognition engine completes traversing the particular sub-network, then using the pointer to the next location, the recognition engine knows where to transition to by referring to the ending or termination state. Therefore, in the preferred embodiment, there is no need to provide the functionality necessary to pop items off the stack. It should be noted that the self-looping mechanism described earlier employs this feature. By having the ending state be the same as the beginning state, the transition which occurs is able to loop to itself.

As the word machineries are on the stack, the recognition engine performs the searching. Based on the likelihood of the theories, the recognition engine continues onto next machinery or machineries. The stack grows and shrinks as the theories survive (are above a threshold probability) or die (fall below a threshold of probability). Once all of the machinery have been evaluated, signified by the stack being empty, the most probable theory is produced as textual output or as an action taken by the computer (e.g., opening a folder, etc.) In the case of text, the textual output represents the recognized speech. 29

Whereas many alterations and modifications of the present invention will no doubt become apparent to a person of ordinary skill in the art after having read the foregoing description, it is to be understood that the particular embodiment shown and described by way of illustration is in no way intended to be considered limiting. Therefore, reference to the details of the preferred embodiments are not intended to limit the scope of the claims which themselves recite only those features regarded as essential to the invention.

Thus, a speech recognition system using recursive finite state grammars has been described.

Claims

30CLAIMSWe claim:

1. A speech recognition system for recognizing speech signals comprising: a plurality of finite state grammars including at least one global finite state grammar and at least one sub-finite state grammar, wherein each of said plurality includes a plurality of states and at least one transition arranged in a network, and further wherein said transitions are capable of including either terminals or non-terminals, each of said terminals associated with an acoustic model and each of said non-terminals associated with calls to said at least one sub- finite state grammar; and recognition means for performing recognition by traversing said global finite state grammar, wherein if said recognition engine encounters a terminal then said recognition engine uses the acoustic model of the terminal in performing recognition of the speech signals and if said recognition engine encounters a non-terminal then said recognition engine calls the sub-finite state grammar associated with the non-terminal and continues performing recognition by traversing said sub-finite state grammar, such that upon completion of traversing said sub-finite state 3 1

grammar said recognition engine returns to and continues traversing the global finite state grammar at the location of the call.

2. The system as defined in Claim 1 wherein each of said terminals is a word.

3. The system as defined in Claim 1 wherein said acoustic model comprises a Hidden Markov Model.

4.. The system as defined in Claim 1 wherein said recognition means traverses said grammars using a stack, such that as said recognition engine encounters a terminal on one of said plurality of finite state grammars, the acoustic model associated with said terminal is pushed onto the stack.

5. The system as defined in Claim 4 wherein information regarding the next state is pushed onto the stack with the acoustic models associated with the terminals of the sub-finite state grammar, such that the recognition engine continues traversing at the location indicated by the next state upon completion of traversing said sub-finite state grammar.

6. The system as defined in Claim 1 wherein the recognition engine determines that the end of the global finite 32

state grammar has been reached by determining that the stack is empty, such that the recognition process is finished.

7. A speech recognition system for recognizing speech signals comprising: a plurality of finite state grammars including at least one global finite state grammar and at least one sub-finite state grammar, wherein each of said plurality includes a plurality of states and at least one transition arranged in a network, and further wherein said transitions are capable of including either words or classes, each of said words associated with an acoustic model and each of said classes associated with a call to said at least one sub-finite state grammar; and recognition means for performing recognition by traversing said global finite state grammar, wherein if a word is encountered then said recognition engine performs recognition using the acoustic model associated with the word and if a class is encountered then said recognition engine calls the sub-finite state grammar associated with the class and continues performing recognition by traversing said sub- finite state grammar, such that upon completion of traversing said sub-finite state grammar said recognition engine returns to and continues traversing the global finite state grammar at the location of the call. 33

8. The system as defined in Claim 7 wherein at least one of the states in at least one of the plurality of finite state grammars contains a self-looping transition which begins and ends at the same state.

9. The system as defined in Claim 8 wherein said self-looping transition comprises a noise word.

10. The system as defined in Claim 8 wherein said self-looping transition comprises a class.

11. The system as defined in Claim 10 wherein said class comprises a sub-finite state grammar of noises, such that each transition in the sub-finite state grammar is associated with acoustic models which represent the noises.

12. The system as defined in Claim 7 wherein at least one of the transitions is associated with an acoustic model having all phones, such that out-of-vocabulary word detection is provided.

13. The system as defined in Claim 12 wherein the all phone acoustic model has at least one self-looping 34

transition, and further wherein the self-looping transition is associated with an all-phone acoustic model.

14. The system as defined in Claim 13 wherein said at least one self-looping transition is located at the last state to compensate for arbitrarily long words.

15. A method for recognizing speech signals comprising the steps of: providing a first transition network having states and transitions between the states, such that the first transition network may be traversed, wherein each of the transitions is associated with a terminal or a class; providing a second transition network having states and at least one transition between the states, such that the second transition network may be traversed; traversing the first network such that speech recognition is performed, wherein the second network is called upon arriving at the transition associated with the class when said first network is being traversed, such that the second transition network is traversed; and upon traversing the second transition network, the first network is returned to, wherein the traversing of the first network continues from the point of the call, such that the speech signals are recognized.