EP1428205A1 - Grammatiken für die spracherkennung - Google Patents

Grammatiken für die spracherkennung

Info

Publication number
EP1428205A1
EP1428205A1 EP02782503A EP02782503A EP1428205A1 EP 1428205 A1 EP1428205 A1 EP 1428205A1 EP 02782503 A EP02782503 A EP 02782503A EP 02782503 A EP02782503 A EP 02782503A EP 1428205 A1 EP1428205 A1 EP 1428205A1
Authority
EP
European Patent Office
Prior art keywords
constituents
subset
grammar
references
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP02782503A
Other languages
English (en)
French (fr)
Inventor
Johan Schalkwyk
Michael S. Phillips
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SpeechWorks International Inc
Original Assignee
SpeechWorks International Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/906,575 external-priority patent/US20030009331A1/en
Application filed by SpeechWorks International Inc filed Critical SpeechWorks International Inc
Publication of EP1428205A1 publication Critical patent/EP1428205A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/193Formal grammars, e.g. finite state automata, context free grammars or word networks

Definitions

  • This invention relates to grammars for speech recognition.
  • phrase-structured grammar such as a context-free grammar.
  • a number of rewrite rules are specified.
  • Each rewrite rule has a "left-hand side," which identifies a non-terminal symbol for which the rule specifies an allowable expansion, and a "right-hand side” which specifies the allowable expansion as a sequence of one or more elements.
  • the elements of the expansion can be non-terminal symbols, which can be expanded according to one or more rules of the grammar, or can be terminal symbols, in this case words in the lexicon of the speech recognizer.
  • One non-terminal symbol is identified as the "top level” symbol, which is associated with the complete set of valid sequences of terminal symbols in the language defined by the grammar.
  • a well-known syntax for such grammars is the Backus-Naur Form (BNF).
  • BNF Backus-Naur Form
  • Various other syntaxes can also be used, for example, allowing optional and alternative sub-sequences of elements on the right-hand side of a rule.
  • One such extended syntax is Extended BNF (EBNF).
  • Non-terminals in the grammar may be associated with semantically meaningful constituents, and identification of the words that are associated with those constituents aids in the interpretation of what was meant by the utterance.
  • a non-terminal may be associated with the structure of a date (e.g., "May twenty third").
  • the nested structure of the non-terminals can be identified in a word sequence that is hypothesized by a speech recognizer using one of a number of parsing algorithms.
  • the output of such parsing algorithms can be a nested bracketing and labeling of the constituents of the word sequence.
  • Such a bracketing can equivalently be represented as a parse tree, in which the leaves are associated with words and the interior leaves of the tree are each associated with non-terminals.
  • a word-level grammar In automated speech recognition, one use of a word-level grammar is to constrain the recognizer to hypothesize only word sequences that fall within the language specified by the grammar. Advantages of constraining the recognizer in this way include increased accuracy, assuming that the speaker truly uttered a word sequence in the grammar. Furthermore, by avoiding consideration of the combinatorially large number of word sequences made up from words in the lexicon, the total amount of computation is reduced.
  • One approach to using a word-level grammar in automatic speech recognition is to represent the grammar as a finite-state machine, which is represented as a graph. The graph has a starting node and an ending node. Arcs are labeled with words.
  • a context-free grammar can be represented exactly as a finite-state machine. If the constraints are not satisfied, a finite-state machine can approximate the context-free grammar.
  • an automatic speech recognizer searches for a path through the graph that best represents the input utterance. Typical high-accuracy speech recognizers represent words in terms of sequences of sub-word units, and perform recognition of utterances in terms of these sub-word units.
  • a typically used sub-word unit is based on a phone.
  • each word is represented as one or more sequences of phonemes.
  • the alternative sequences for a word correspond to different pronunciations of the word.
  • One representation of these alternative phoneme sequences is as a graph in which any path from the starting node to the ending node represents an allowable pronunciation.
  • one approach to incorporating the word-level grammar constraint as well as the phonetically based pronunciation constraint is to form a single combined finite state machine in which each word-arc of the word-level graph is replaced with the phoneme-level graph for the word on that arc.
  • the speech recognizer searches for a path through the single phoneme-level graph that best represents the input utterance. The speech recognizer then hypothesizes the word sequence associated with that phoneme-level path.
  • a finite-state transducer is like a finite- state machine in that arcs are labeled with symbols, such as phonemes, that are "accepted” by the transducer. That is, an allowable phoneme sequence corresponds to the sequence of accepted symbols on arcs on a path from the start node to the end node. Each arc, in addition to having an accepted, or input, symbol, has a output symbol. In the case of a finite-state transducer that accepts phonemes and produces words, the input symbols are phonemes and the output symbols are words. Output symbols can also be null.
  • phonetically-based speech recognizers associate phoneme labels with observations in the input utterance in such a way that the characteristics of the observations associated with a phoneme depend not only on the label of that phoneme, but also on the context of preceding and following phonemes in the hypothesized phoneme sequence.
  • Such a recognizer is referred to as using "context- dependent" phonemes.
  • the parameters of the model may depend on the label of that phoneme, as well as the label of the preceding and the following phoneme in a particular hypothesized phoneme sequence, in what is referred to as "triphone" modeling. Note however that in recognition, the following phoneme is not yet known while recognizing a current phoneme.
  • cross-word context modeling the last phoneme of a word can affect the characteristics of a first phoneme of a next word, and conversely, the first phoneme of the second word can affect the characteristics of the last phoneme of the first word.
  • cross-word context modeling provides higher accuracy than context modeling that does not take into account the dependency between words.
  • One approach to introducing context-dependent models is to form a finite-state transducer, in which inputs of arcs are labeled according to the phoneme as well as the context for that phoneme. Any path from the starting node to the ending node is associated with an allowable phoneme sequence as in the case of a simple phoneme- based graph. Furthermore, the sequence of contexts of each of the phonemes along any path are consistent with the underlying sequence of phonemes.
  • a method of fo ⁇ ning a context-dependent phoneme graph from a simple phoneme graph is described in M. Riley, F. Pereira, and M. Mohri, "Transducer Composition for Context-Dependent Network Expansion," in Proceedings of the 5th European
  • the process of forming a runtime grammar begins with a developer 110 specifying a context-free grammar 120, which includes a number of rules 122.
  • a grammar compiler 130 is applied to rules 122 to form a finite-state machine grammar 140, such as a context-dependent phoneme based finite state transducer.
  • FSM grammar 140 is then used at the time of recognition of an utterance by a speech recognizer 150.
  • An alternative to pre-expansion of a word-level grammar to form a finite-state machine or a finite-state transducer prior to recognizing an utterance is dynamic expansion of the grammar during the recognition process.
  • the process of constructing a word-level graph is deferred until a particular non-terminal is encountered during recognition, and the word and phoneme level graph is constructed "on-the-fly.”
  • a number of examples of such dynamic expansion are described in M. K. Brown and S. C. Glinski, "Context-Free Large- Vocabulary
  • a recursive transition network is formed.
  • the RTN includes a separate finite-state machine for each non-terminal that is on the left hand side of a grammar rule, and the paths through the graph associated with that non-terminal correspond to the possible sequences of elements (terminals and non-terminals) on the right-hand side of rules for that non-terminal.
  • a non-terminal is encountered on an arc, what is essentially a recursive "call" to the finite state machine for that non-terminal is made.
  • a parsing procedure is integrated into the runtime recognizer which makes use of the phrase-structure grammar to predict allowable next words based on partial word hypotheses.
  • FIG. 2 Such alternative approaches are illustrated in FIG. 2.
  • a developer 110 specifies a CFG grammar 120.
  • a speech recognizer 250 directly processes the phrase-structured form of the grammar.
  • the invention provides a way of combining aspects of pre-computation of context-dependent phoneme graphs as well as dynamic processing of grammar constraints to provide a configurable tradeoff between data size and recognition-time computation. This tradeoff can be obtained without sacrificing recognition accuracy, and in particular, allows full modeling of all cross-word phoneme contexts.
  • the invention is method for speech recognition.
  • a specification of a grammar is first accepted. This specification includes specifications of a number of constituents of the grammar. Each specification of one of the constituents defines sequences of elements associated with that constituent, where these sequences of elements include words and references to the constituents of the grammar.
  • a first subset of the constituents of the grammar are selected, and the remaining of the constituents form a second subset.
  • the method first includes processing the specification of the constituent to form a first processed representation that defines sequences of elements that are associated with that constituent and that includes words and references to constituents in the first subset. Forming the first processed representation of each constituent includes expanding references to constituents in the second subset according to the specifications of those constituents, and retaining references to constituents in the first subset without expanding said references. For each of the constituents in the first subset, the method further includes processing the first processed representation to form a second processed representation that defines sequences of elements that include subword units and references to constituents in the first subset. Configuration data that includes the second processed representation of each of the constituents in the first subset is then stored.
  • the method can include one or more of the following features.
  • the specification of the grammar includes a specification of a phrase-structure grammar, and the specification of each of the constituents includes a rewrite rule that specifies allowable substitutions of references to the constituent to include the sequences of elements associated with that constituent.
  • the specification of the phrase-structure grammar includes a context-free grammar.
  • the specification of the grammar is in Backus Naur Form (BNF).
  • Selecting the first subset of the constituents includes selecting those constituents according to static characteristics of the grammar.
  • Selecting the constituents according to runtime characteristics of a speech recognizer using the grammar Selecting the constituents according to an expected processing time associated with the selection of those constituents.
  • Fo ⁇ ning the second processed representation of each constituent includes expanding words in terms of subword units.
  • Expanding the words in terms of context-dependent subword units such that the expansion of at least some of the words depends on context in proceeding or following words in the sequences of elements defined by the first processed representations of the constituent.
  • Expanding words adjacent to references of constituents in the first subset in sequences of elements including determining multiple possible expansions of those words according to context of the referenced constituents. Determining multiple possible expansions of said words in terms of subword units includes limiting said multiple expansions according to context within the second processed representation.
  • Computing the second processed representation of each constituent includes forming a graph representation of that constituent. Paths through that graph are associated with sequences of elements that include context-dependent subword units and including references to constituents in the first subset of constituents.
  • the first processed representation of each of the constituents in the first subset includes a first FST representation of that constituent, and processing the first processed representation of each of the constituents in the first subset to fonn the second processed representation of that constituent includes applying a composition operation to the first FST representation of that constituent to form the second FST representation of that constituent.
  • the method further includes accessing the stored configuration data by a speech recognizer, and automatically processing an utterance according to the configuration data. Only some of the second processed representations of the constituents in the first subset are selectively accessed by the speech recognizer according to content of the utterance being processed.
  • the invention includes one or more of the following advantages.
  • selecting only a subset of the constituents of a grammar that are retained by reference in the processed forms of the constituents provides less computation than an approach in which all references to constituents are retained.
  • the size of the configuration data can be controlled.
  • the size of the configuration data affects not only the size of that data on a static storage device, but can also affect the amount of dynamic memory needed to execute a speech recognizer using that configuration data.
  • a tradeoff is possible between selecting the a large subset of constituents whose references are not expanded, thereby yielding relatively small configuration data, and selecting a small subset of constituents that yields relatively less computation at runtime.
  • cross-word subword unit modeling is maintained at the boundaries of those constituents, thereby avoiding loss in speech recognition accuracy as compared to approaches in which cross-word context is not considered at such boundaries.
  • FIG. 1 is a block diagram that illustrates a prior art approach in which a context-free grammar (CFG) is fully expanded into a finite-state machine (FSM) grammar at configuration time, and the FSM grammar is processed at recognition time by a speech recognizer;
  • FIG. 2 is a block diagram that illustrates a prior art approach in which a context-free grammar is processed directly by a speech recognizer at recognition time without forming a finite-state machine prior to recognition;
  • CFG context-free grammar
  • FSM finite-state machine
  • FIG. 3 is a block diagram that illustrates an approach to processing an using a grammar according to the present invention
  • FIGS. 4a-c are diagrams that illustrate a simple context free grammar, a fully- expanded finite-state transducer, and a input sequence and corresponding output sequence, respectively;
  • FIGS. 5a-b are diagrams that illustrate two finite state machines
  • FIGS. 6a-b are diagrams that illustrate phone-level transducers
  • FIG. 7a is a diagram that illustrates a phone-based transducer
  • FIG. 7b is a diagram that illustrates a corresponding context-model-based transducer
  • FIG. 8a is a diagram that illustrates a portion of the phone-based transducer for the word-based finite state machine shown in FIG. 5a
  • FIG. 8b is a diagram that illustrates a corresponding context-model-based transducer
  • FIG. 9a is a diagram that illustrates a portion of the phone-based transducer for the word-based finite state machine shown in FIG. 5b
  • FIG. 9b is a diagram that illustrates a corresponding context-model-based transducer.
  • a developer 110 specifies a context-free grammar (CFG) 120.
  • CFG 120 includes a number of rules 122, each of which has a left-hand side, which is a non-terminal symbol, and a right hand side, which specifies allowable rewrites of the non-terminal symbol in terms of elements, each of which is a non- terminal symbol or a terminal symbol.
  • the terminal symbols of CFG 120 are words.
  • CFG 120 specifies the set of word sequences (the language) that can be hypothesized during recognition of an utterance. Typically, a speaker is expected to speak an utterance that falls within the specified language.
  • Developer 110 specifies CFG 120 using a text editing software tool.
  • various types of software systems for example, systems that support creation of CFG 120 using a graphical interface, or provide aides to specification and verification of the grammar are used.
  • a grammar compiler 330 processes CFG 120 to produce data that is used by a speech recognizer 350 at the time an utterance is recognized.
  • This data includes a compiled grammar 340, which is similar to a recursive transition network (RTN).
  • Compiled grammar 340 includes a number of separate finite-state transducers (FSTs) 342.
  • FSTs finite-state transducers
  • Each FST 342 is associated with a different non-terminal symbol that was defined in CFG 120.
  • Each FST 342 includes a graph with arcs that are each labeled with an input symbol and an output symbol.
  • the input symbols include labels of subword units, in particular, labels of context-dependent phones.
  • the input symbols can also include labels of non-terminals that are associated with others of the FST 342.
  • the output symbols of the FST can include markers that are used to construct parses of the output word sequences according to CFG 120 without requiring re-parsing of the word sequence after speech recognizer 150 has completed processing of an input utterance.
  • the output symbols can include markers that are used to identify procedures that are to be executed when particular elements are present in a word sequence produce by the speech recognizer.
  • each FST 342 represents an expansion of a number of CFG rules 122, and at least some of FST 342 are not expanded fully resulting in arcs in those FST that are labeled with non-terminals rather than phone models.
  • Grammar compiler 330 determines the nature of this "partial" expansion of CFG 120 into the FST 342 based on input from a developer 312 (who can be but is not necessarily the same person as developer 110) as well as based on an automated analysis of CFG 120 by an automated tool, grammar analyzer 335. As is described more fully below, information provided to grammar compiler 330 by grammar analyzer 335 and developer 312 determine which non-terminals are associated with separate FST 342, and which instances of those non-terminals are not expanded as input symbols on arcs of other of FST 342.
  • grammar compiler 330 optionally makes use of a predefined CFG library 324 that includes a number of predefined CFG rules.
  • CFG rules 122 may include non-terminal elements in their right-hand sides that are not defined by any other of rules 122.
  • CFG library 324 may provide the needed rules.
  • speech recognizer 350 can optionally make use of a predefined or dynamically modified FST library 344.
  • a non-terminal element in a right-hand side of a CFG rule 122 may not be defined by either any of CFG rules 122 nor by a rule in CFG library 324.
  • such an element may be specified by an FST that is in FST library 344.
  • FST library includes a predefined set of FSTs that speech recognizer 350 can make use of at recognition time.
  • developer 110 in specifying CFG 120 specifies a number of separate CFG rules 122.
  • the left-hand side of each rule specifies a nonterminal that can be expanded using the rule.
  • the right-hand side is specified according to an extended Backus-Naur Form (EBNF), in which alternative and optional elements or sequences of elements are allowed.
  • EBNF Backus-Naur Form
  • the non-terminal symbols start with the dollar sign character, '$', while terminal symbols (words) are written without any delimiters.
  • Alternative sequences are delimited by a vertical bar, '
  • the top-level non-terminal, whose expansion defines the language accepted by the grammar, is denoted by $ROOT.
  • CFG 120 that accepts sentences such as "I would like to fly from Albuquerque to Wilmington on the fourth of July.”
  • a grammar that specifies the ways in which a speaker might phrase such a request would be significantly more complicated if it were to capture many more possible variations.
  • rules are restricted to preclude any recursion.
  • Recursion is the situation in which a rule that defines a non-terminal includes that non-terminal as an element on the right-hand side of that rule, or in the expansion of the right-hand side by recursive expansion of the non-terminal elements.
  • restricted forms of recursion are allowed in other examples, and approaches to handling such recursions are noted below.
  • the developer can annotate any element with a name of a procedure that is to be executed when that element is present in a hypothesized output.
  • the non-terminal $CITY can be annotated as $CITYf r0 m_cit resulting in the function from_city( ) being executed when that element is hypothesized by the speech recognizer in an utterance, with the argument to the function corresponding to the output subsequence associated with that element.
  • Grammar compiler 330 processes CFG 120 to produce FST 342 that are used at recognition time by speech recognizer 350.
  • each FST 342 is associated with a different non-terminal defined by CFG rules 122, but each of the non-terminals used in CFG 120 is not, in general, associated with a different one of FST 342.
  • Each FST 342 includes a graph with arcs between nodes that are each labeled with an input symbol and an output symbol. One node is identified as the starting node and one or more nodes are identified as ending nodes for each of the FST.
  • each arc of an FST is labeled 'a:b' to denote an input symbol 'a' and an output symbol 'b' for that arc.
  • Input symbols include:
  • Model labels each model label identifies a particular context-dependent phonetic model that is expected by the speech recognizer. In general, each model is associated with a particular phone, and identifies one of a number of enumerated context-based variants of that phone.
  • Non-terminal labels in addition to model labels, arcs in FST 342 can specify input symbols that are non-terminals. Each non-terminal that appears as a label in an FST 342 is associated with one of the FST 342.
  • arcs in FST 342 can be null, denoted by ' ⁇ ', 'eps', or 'epsilon'. Such arcs can be traversed during recognition without matching (consuming) any inputs.
  • Word labels Along any path in FST 342 from the starting node to an ending node, the sequence of input model labels corresponds to a pronunciation of a sequence of words that falls within the language defined by CFG 120. Each of a subset of the arcs on that path has an output word label that identifies that word sequence.
  • Procedure labels In addition to outputs that bracket constituents, output procedure labels identify procedures that are to be executed if a path including that arc is hypothesized by the speech recognizer.
  • FIG. 4a shows CFG 120 with two non-terminals, $ROOT and $WHO.
  • the language defined by this grammar consists of the two possible sentences "call speech work” and "call speech works please.”
  • FIG. 4b shows a corresponding FST 342 assuming that it is fully expanded such that there are no non-terminals on its arcs.
  • the inputs symbols in this illustration correspond simply to phones.
  • the starting node is node 0 (by convention) and the ending node is node 17, which has no arcs leaving it.
  • the 0-1 arc is labeled 'lccall'.
  • the input symbol for that arc is the phone 'k' and the output is the word 'call'.
  • Arcs l- ⁇ 2 and 2 ⁇ 3 correspond to the phones 'aa' and '1' and have null outputs. Therefore, from the starting node, an input 'k, aa, 1' produces the outputs 'call, ⁇ , ⁇ '. With the null outputs removed, this produces the partial sentence "call.”
  • Arcs 3 ⁇ 4 and 12- ⁇ 13 relate to bracketing of the constituent '$WHO'.
  • Arc 3 ⁇ 4 has a null input, and an open bracket output ' ⁇ '.
  • Arc 12 ⁇ 13 has a null input, and a labeled closing bracket '$WHO ⁇ A path from node 3 to node 13 can be identified with the constituent $WHO by matching these brackets.
  • the arcs from node 4 to node 8 correspond to the word "speech.” Note that there are two alternative pronunciations resulting in two alternative arcs from node 6 to node 7, one with input 'ey' and one with input 'iy ⁇
  • the arcs from node 8 to node 12 correspond to the word "works” while the arc from node 12 to node 17 correspond to the word "please.” Note that since "please” is optional in the grammar, an arc with a null input and a null output joins node 13 to node 17. Referring to FIG. 4c, an input 'k, aa, 1, ...
  • iy, z' is matched to a path from starting node 0 the ending node 17 to produces the output symbols (after null removal) 'call, ⁇ , speech, works, $WHO ⁇ , please'.
  • the recognized sentence is "call speech works please” and the sub-sequence of words corresponding to the $WHO constituent is identified as "speech works.”
  • Grammar compiler 130 produces compiled grammar 340 in a sequence of steps. These are:
  • the arcs are labeled with words, word-level markers such as constituent brackets, and non-terminal symbols for non-terminals that are not expanded.
  • grammar compiler 330 accepts information from grammar analyzer 335 and developer 312 that identifies the non- terminals that are not to be expanded within other constituents and which will be associated with separate FST 342.
  • a description of grammar analyzer 335 which automatically or semi-automatically identifies those non-terminals, is deferred to later in the description below. Referring to the example introduced above, which accepts sentences such as "I want to fly from Albuquerque to Wilmington," FIGS. 5a-b illustrate corresponding word-level FSMs assuming that the non-terminal $CITY has been identified as not to be expanded, while non-terminals $FLY, $DATE, $MONTH and $DAY are expanded. FIG.
  • FIG. 5a illustrates the FSM for $ROOT and FIG. 5b illustrates the FSM from $CITY.
  • a sub-graph 510 corresponds to the complete expansion of the nonterminal $DATE, while the arcs from node 8 to 9 and from node 12 to 13 are labeled with the non-terminal $CITY, which is not expanded. Note that if $CITY had been expanded, two instances of the expansion shown in FIG. 5b would be present in the FSM for $ROOT. In practice, many more instances of such a relatively large subgraph could be present, for example, if $ROOT included the alternatives 'from $CITY to $CITY' and 'to $CITY from $CITY', there would be four instances.
  • $DATE expands to a relatively large sub-graph, it occurs only once in the expansion of $ROOT.
  • word-level FSM can be processed in any way that preserves the language of symbols they accept. For example, each word-level graph can be "determinized" to ensure that any node does not have more than one succeeding non- null arc with the same input label. Other procedures, such as introduction and removal of null arcs to modify the graph, can be performed at this stage as well.
  • Each of the word-level FST is expanded to produce an FST whose inputs are phonemes and whose outputs are words, word-level markers, and non-terminals.
  • the FST for $CITY is fully expanded to phoneme inputs and word outputs.
  • the arc from node 0 to node 1 corresponds to the first phoneme of the word "Albuquerque” and is labeled with the phoneme input 'ah'.
  • FIG. 6a a portion of the expansion of the word-level graph shown in FIG. 5a is illustrated. This portion corresponds to the "from $CITY to" portion of the grammar. From left to right, the first arcs correspond to the word "from”. The next arc corresponds to the un-expanded arc for the non-terminal $CITY. The symbol $CITY is introduced as both the input and the output of the arc. The next arc corresponds to the first phoneme of the word "to.”
  • the each of the phoneme level expansions is then processed to produce a phone-based FST whose inputs are phones rather than phonemes.
  • the effect of P is not illustrated and for the purpose of illustration we assume that P does not introduce any additional pronunciations.
  • the context of a phone which affects the selection of a model depends on the immediately preceding and the immediately following phone, in what is often referred to as tri-phone modeling. It should be understood that the approach described below is applicable to other contexts, for example, that make use of more preceding and following phones, and make use of other types of contexts, such as word and phrase boundaries.
  • a phone 'b' that is preceded by a phone 'a' and followed by a phone 'c' is denoted as having a 'a_c' context.
  • each phone has an enumerated set of context, the selection of which is a deterministic function of its context. That is, all the possible contexts, 'x_y ⁇ are mapped into a smaller number of groups. This grouping is based on data, on a priori decisions for example using knowledge of speech, or both.
  • the different models for a particular phone, 'p' are enumerated and denoted 'p.l', 'p.2', ... .
  • FIG. 8 a a similar portion of the phone level FST in our example corresponds to the portion of the grammar "from $CITY to.” Note that in the context-dependent expansion, we account for the possible expansions of $CITY. In this embodiment, the grammar compiler does not make use of the actual starting or ending phone of the expansion of $CITY in performing the context-dependent expansion of the FST.
  • FIG. 8b the phone 'm' is expanded into a number of contexts. Note that since the preceding context, in this case the word "from,” is known, the set of possible contexts is restricted to be consistent with that preceding context. However, since the following context is not known, a number of different contexts are expanded.
  • 'm' is expanded to 'm.3', 'm.12', 'm.14', and 'm.17'.
  • 'm.3' and 'm.14' are actually needed for the contexts of "Albuquerque” and "Wilmington” as shown in FIG. 7b.
  • the resulting expansions are linked by null arcs to the starting node of $CITY. These null arcs include amiotations that are used at recognition time and that indicate the particular contexts they are appropriate for.
  • FIG. 9a and FIG. 9b a similar context processing is performed for the FST for $CITY.
  • the first arc of "Alburqueue" from node 910 to node 912 is expanded based on the unknown preceding context in which the $CITY is used. Note that in this example, $CITY follows both "from” and "to” in different instances in the grammar.
  • nodes 950 correspond to node 910 and the multiple arcs from nodes 950 to node 952 have the multiple contexts the first phone of "Albuquerque” may be used in.
  • the last arc of each word is expanded to account for the unknown following context. This expansion is performed using a composition operation with an FST C, as introduced above.
  • speech recognizer 350 makes use of the multiple FST 342.
  • speech recognizer 350 keeps track of a state the grammar is in, based on past input, and uses compiled grammar 340 to identify allowable next models based on the state.
  • the state corresponds to the node in the single FST of that compiled grammar.
  • speech recognizer 350 recursively enters ("calls") the subsidiary FST when it encounters non-terminal arcs during recognition in an RTN approach, and maintains a calling "stack" along with the state.
  • the speech recognizer dynamically determines that the node following 'm.3' can propagate to the arc labeled 'ah.3' and the node following 'm.14' can propagate to the arc labeled 'w.4', and in this simple example, the nodes following 'm.12' and 'm.17' do not propagate to any arc in the $CITY FST.
  • the recognizer maintain a context (call stack) so that when it reaches the ending node of the $CITY FST, it propagates to the correct arcs in the $ROOT FST.
  • speech recognizer 350 hypothesizes one path, or alternatively multiple ("N-best") paths, though the recursively expanded FSTs, and for each hypothesized path produces the corresponding sequence of output symbols, ignoring the null outputs.
  • these output symbols include words, and opening and closing brackets for constituents, which are used to parse the output without having to process the word sequence with CFG 120 after recognition.
  • elements in a right-hand side of a CFG rule 122 can be annotated with procedure names. The mechanics of this are not illustrated in the examples above. If an element is annotated with a procedure label, an arc with an opening bracket is introduced in the word graph before the element and an arc with a closing bracket and an identifier of the procedure is introduced after the element.
  • the speech recognizer locates these procedure brackets and invokes the identified procedures using the bracketed output subsequence as an argument to the procedure.
  • an application program has invoked the speech recognizer, and the procedures that are automatically called are "callback" procedures in the application that are used to process the hypothesized output. For example, if a constituent corresponds to the ways a currency amount (e.g., "a dollar fifty”) are spoken, the callback procedure may process the subsequence of words to automatically fill in a value for the stated dollar amount.
  • a number of alternative approaches are used alone or in combination to select the non-terminals which are not expanded during grammar compilation.
  • developer 312 provides a list of those non-terminals to grammar compiler 330. For example, developer 312 may select these non-terminals based on knowledge that they result in large subgrammars, but are not often used in practice by a speaker. Since they are not often used, speech recognizer 350 would not often incur the overhead of calling the nested FST for that non-terminal.
  • Another criterion used by developer 312 might be based on a combination of a relative large size of the expansion of the non-terminal and a large number of instances of the non-terminal in the grammar.
  • the selection process is alternatively automated using grammar analyzer 335.
  • a criterion based on the overall size of compiled grammar 340 and an estimated overhead for processing the nested calls at run-time may be used.
  • Grammar analyzer 335 processed CFG 120, and optionally processes a corpus of typical utterances in determining the non-terminals to select according to the criterion.
  • FST 342, or equivalently FST in FST library 344 are not known at configuration time.
  • non-terminals may be dynamically expanded based on the identity of the speaker or on external context such as the time of day or the area code from which a speaker is calling.
  • Software which is stored on computer readable media such as magnetic or optical disks, or which is accessed over a communication medium such as a data network, includes instructions for causing processors to perform the various steps.
  • processors can be general-purpose processors, such as Intel Pentium processors, and the software can execute under the control of a general-purpose operating system such as Microsoft Windows NT, or a variant of the UNIX operating system.
  • the processors can be special purpose and the operating systems can be special-purpose operating systems.
  • the instructions can be machine-level instructions.
  • the instructions are higher-level instructions, such as Java byte codes or program language statements that are interpreted at runtime.
  • configuration time and for recognition time can all be performed on a single computer.
  • the configuration time steps may be performed on one or a number or computers and the recognition performed on another computer or set of computers.
  • Information can be transferred from the configuration computer to the recognition computer over a communication network or using physical media.
  • Multiple computers can be used for the configuration steps or the recognition steps, and some configuration steps may be performed on the recognition computer. For example, determining runtime characteristics of a grammar may be performed on a computer hosting the recognizer, with these determined runtime characteristics being fed back to a configuration computer. Such an approach can include profiling execution of the recognizer and feeding back the profiling results to the grammar compiler.
EP02782503A 2001-07-05 2002-07-03 Grammatiken für die spracherkennung Withdrawn EP1428205A1 (de)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US906575 1997-08-05
US30316601P 2001-07-05 2001-07-05
US303166P 2001-07-05
US09/906,575 US20030009331A1 (en) 2001-07-05 2001-07-16 Grammars for speech recognition
PCT/US2002/021479 WO2003005347A1 (en) 2001-07-05 2002-07-03 Grammars for speech recognition

Publications (1)

Publication Number Publication Date
EP1428205A1 true EP1428205A1 (de) 2004-06-16

Family

ID=26973297

Family Applications (1)

Application Number Title Priority Date Filing Date
EP02782503A Withdrawn EP1428205A1 (de) 2001-07-05 2002-07-03 Grammatiken für die spracherkennung

Country Status (1)

Country Link
EP (1) EP1428205A1 (de)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5241619A (en) * 1991-06-25 1993-08-31 Bolt Beranek And Newman Inc. Word dependent N-best search method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5241619A (en) * 1991-06-25 1993-08-31 Bolt Beranek And Newman Inc. Word dependent N-best search method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of WO03005347A1 *

Similar Documents

Publication Publication Date Title
US7072837B2 (en) Method for processing initially recognized speech in a speech recognition session
US20030009331A1 (en) Grammars for speech recognition
US5384892A (en) Dynamic language model for speech recognition
Hori et al. Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition
Lee et al. Recent development of open-source speech recognition engine julius
US5613036A (en) Dynamic categories for a speech recognition system
US7127394B2 (en) Assigning meanings to utterances in a speech recognition system
US5875426A (en) Recognizing speech having word liaisons by adding a phoneme to reference word models
JPH08278794A (ja) 音声認識装置および音声認識方法並びに音声翻訳装置
Bergmann et al. An adaptable man-machine interface using connected-word recognition
US5819221A (en) Speech recognition using clustered between word and/or phrase coarticulation
EP1475779A1 (de) System mit kombiniertem statistischen und regelbasierten Grammatikmodell zur Spracherkennung und zum Sprachverstehen
US6345249B1 (en) Automatic analysis of a speech dictated document
KR100726875B1 (ko) 구두 대화에서의 전형적인 실수에 대한 보완적인 언어모델을 갖는 음성 인식 디바이스
US6980954B1 (en) Search method based on single triphone tree for large vocabulary continuous speech recognizer
EP0938077B1 (de) Spracherkennungssystem
US20070038451A1 (en) Voice recognition for large dynamic vocabularies
US20040034519A1 (en) Dynamic language models for speech recognition
EP1111587B1 (de) Vorrichtung zum Spracherkennung mit Durchführung einer syntaktischen Permutationsregel
AbuZeina et al. Cross-word modeling for Arabic speech recognition
US6772116B2 (en) Method of decoding telegraphic speech
KR20050101695A (ko) 인식 결과를 이용한 통계적인 음성 인식 시스템 및 그 방법
Mohri Weighted grammar tools: the GRM library
EP1428205A1 (de) Grammatiken für die spracherkennung
NAKAGAWA et al. Comparison of Syntax-0riented Spoken Japanese Understanding System with Semantic-Oriented System

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20040204

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LI LU MC NL PT SE SK TR

AX Request for extension of the european patent

Extension state: AL LT LV MK RO SI

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20061220