WO2018053502A1 - Systems and methods for adaptive proper name entity recognition and understanding - Google Patents

Systems and methods for adaptive proper name entity recognition and understanding Download PDF

Info

Publication number
WO2018053502A1
WO2018053502A1 PCT/US2017/052251 US2017052251W WO2018053502A1 WO 2018053502 A1 WO2018053502 A1 WO 2018053502A1 US 2017052251 W US2017052251 W US 2017052251W WO 2018053502 A1 WO2018053502 A1 WO 2018053502A1
Authority
WO
WIPO (PCT)
Prior art keywords
grammar
words
span
acoustic
recognizer
Prior art date
Application number
PCT/US2017/052251
Other languages
English (en)
French (fr)
Inventor
Harry William PRINTZ
Original Assignee
Promptu Systems Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/269,924 external-priority patent/US9818401B2/en
Application filed by Promptu Systems Corporation filed Critical Promptu Systems Corporation
Priority to EP17851782.7A priority Critical patent/EP3516649A4/de
Priority to AU2017326987A priority patent/AU2017326987B2/en
Priority to CA3036998A priority patent/CA3036998A1/en
Publication of WO2018053502A1 publication Critical patent/WO2018053502A1/en
Priority to AU2022263497A priority patent/AU2022263497A1/en

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/36Input/output arrangements for on-board computers
    • G01C21/3605Destination input or retrieval
    • G01C21/3608Destination input or retrieval using speech input, e.g. using speech recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding

Definitions

  • Various of the disclosed embodiments relate to systems and methods for automatic recognition and understanding of fluent, natural human speech, notably speech that may include proper name entities, as discussed herein.
  • Figure 1 is a screenshot of an example graphical user interface in a personal assistant application implementing various features of some embodiments.
  • Figure 2 is a graphical depiction of a grammar generated using a request in the example of Figure 1 in some embodiments.
  • Figure 3 is an example processing diagram depicting the processing operations of an embodiment as applied to an example word sentence.
  • Figure 4 is a screenshot of an example graphical user interface depicting the results following processing in an example system as may occur in some embodiments.
  • Figure 5 is an example breakdown of an utterance waveform as may occur in some embodiments.
  • Figure 6 is an example breakdown of an utterance waveform as may occur in some embodiments.
  • Figure 7 is an example breakdown of an utterance waveform as may occur in some embodiments.
  • Figure 8 is a block diagram depicting various components in an example speech processing system having server and client proper name resolution modules as may occur in some embodiments.
  • Figure 9 is a flow diagram depicting the proper name recognition process at a high level for various embodiments using automatic speech recognition (ASR) and natural language understanding (NLU) components.
  • ASR automatic speech recognition
  • NLU natural language understanding
  • Figure 10 is a flow diagram depicting various steps in a proper name recognition process as may occur in some embodiments.
  • Figure 11 is an example hypothesis corpus as may be generated in some embodiments.
  • Figure 12 is an example of a first hypothesis breakdown based upon the example of Figure 11 as may occur in some embodiments.
  • Figure 13 is an example of a second hypothesis breakdown based upon the example of Figure 11 as may occur in some embodiments.
  • Figure 14 is a flow diagram depicting various steps in a server-side process for proper name recognition as may occur in some embodiments.
  • Figure 15 is a flow diagram depicting various steps in a client-side process for proper name recognition as may occur in some embodiments.
  • Figure 16 and Figure 17 are further processing examples without and with score fusion respectively as may occur in some embodiments.
  • Figure 18 is an example target section suitable for incorporation into an adaptation grammar, depicting generic labels n n 2 , n 3 , n k , on the grammar arcs.
  • Figure 19 is an example target section suitable for incorporation into an adaptation grammar, with selected arcs labeled with both literal sequences, derived from a user contact list, and associated actions to be performed on a semantic meaning variable.
  • Figure 20 is an example waveform and associated annotations, with an associated primary recognizer output and adaptation grammar that may be used to perform secondary recognition.
  • Figure 21 is an example waveform and associated annotations, with an associated primary recognizer output and adaptation grammar that may be used to perform secondary recognition, illustrating the use of word baseforms.
  • Figure 22 is an example slotted adaptation grammar incorporating an example target section, illustrating unpopulated slots in the prefix and suffix sections.
  • Figure 23 is an example waveform and annotations thereof, with an associated primary recognizer output and slotted adaptation grammar with populated slots that may be used to perform secondary recognition.
  • Figure 24 is an example slotted adaptation grammar with a single unpopulated prefix slot and a single unpopulated suffix slot.
  • Figure 25 is an example waveform and annotations thereof, with an associated primary recognizer output and slotted adaptation grammar with a single prefix slot and a single suffix slot, respectively populated with prefix and suffix word sequences, that may be used to perform secondary recognition.
  • Figure 26 is an example unpopulated slotted adaptation grammar, structured to permit correction of span-too- small errors.
  • Figure 27 is an example waveform and annotations thereof, with an associated primary recognizer output and populated slotted adaptation grammar, structured to permit correction of span-too-small errors, illustrating the correction of a span-too-small error, and also an incorrect decoding.
  • Figure 28 is an example populated slotted adaptation grammar, illustrating prefix and suffix sections populated per a span-too-large error, which will therefore likely yield an incorrect secondary decoding.
  • Figure 29 is an example populated slotted adaptation grammar, illustrating prefix and suffix sections populated per a span-too-large error, and including epsilon arcs appropriate for correcting span-too-small errors, which are not effective in correcting the error in question.
  • Figure 30 shows example unpopulated and populated slotted adaptation grammars, which include left shim and right shim structures that can correct span-too-large errors.
  • Figure 31 is an example waveform and annotations thereof, with an associated primary recognizer output and fully populated slotted adaptation grammar, structured to permit correction of span-too-large errors, illustrating the correction of a span-too-large error.
  • Figure 32 shows example unpopulated and populated slotted adaptation grammars, which include left shim and right shim structures that can correct span-too-large errors by means of phoneme loops.
  • Figure 33 is an example waveform and annotations thereof, with an associated primary recognizer output and fully populated slotted adaptation grammar, structured to permit correction of span-too-large errors by means of phoneme loops, illustrating the correction of a span-too-large error.
  • Figure 34 shows example unpopulated and populated slotted adaptation grammars, which include left shim and right shim structures that can correct span-too-large errors by means of either phoneme loops or alternate literals.
  • Figure 35 shows an example unpopulated slotted adaptation grammar, structured to permit correction of both span-too-small and span-too-large errors, including simultaneous errors of different types, and two populated instances thereof appropriate to distinct primary transcriptions.
  • Figure 36 is an example waveform and annotations thereof, illustrating the use of the top populated grammar of Figure 35 to correct a span- too-large error.
  • Figure 37 is an example waveform and annotations thereof, illustrating the use of the bottom populated grammar of Figure 35 to correct a span-too-small error, and also an incorrect primary decoding.
  • Figure 38 is an example waveform and annotations thereof, illustrating the use of a slotted adaptation grammar, with prefix and suffix sections populated with baseforms from the primary recognizer output, to correct a span- too-large error.
  • Figure 39 is an example waveform and annotations thereof, illustrating the use of a slotted adaptation grammar, with populated left and right shims structured to correct span-too-large errors via phoneme sequences decoded by the primary recognizer for the target span, to correct a span-too-large error.
  • Figure 40 is an example waveform and annotations thereof, illustrating the use of a slotted adaptation grammar, with populated left and right shims structured to correct span-too-large errors via phoneme sequences decoded by the primary recognizer for the target span, and prefix and suffix sections structured to correct span-too-small errors via nested epsilon arcs, to correct a span-too-large error.
  • Figure 41 is an example waveform and annotations thereof, illustrating the use of the method of adaptive proper name recognition with a primary recognizer output that includes a lattice.
  • Figure 42 is a block diagram of a computer system as may be used to implement features of some of the embodiments.
  • An "acoustic prefix" as referenced herein is one or more words, as decoded in the primary recognition step, that precede a target span. This may also be called the "left acoustic context.”
  • An "acoustic span" is a portion of an audio waveform.
  • An "acoustic suffix" is one or more words, as decoded in the primary recognition step, that follow a target span. This may also be called the "right acoustic context.”
  • An "adaptation grammar” is a grammar that is used, in conjunction with a grammar-based ASR system, as an adaptation object.
  • An "adaptation object” is computer-stored information that enables adaptation (in some embodiments, very rapid adaptation) of a secondary recognizer to a specified collection of recognizable words and word sequences. For grammar- based ASR systems, this is a grammar, which may be in compiled or finalized form. [0055] An "adaptation object generation module” creates adaptation objects. It may accept as input words or word sequences, some of which may be completely novel, and specifications of allowed ways of assembling the given words or word sequences.
  • An “adaptation object generator” is the same as an “adaptation object generation module.”
  • An "adaptation object generation step” is a step in the operation of some embodiments, which may comprise the use of an adaptation object generation module, operating upon appropriate inputs, to create an adaptation object. This process may be divided into two stages, respectively object preparation and object finalization. If the secondary recognizer uses grammar- based AS technology, "object preparation” may comprise grammar compilation, and “object finalization” may comprise population of grammar slots.
  • An “aggregate word” is a notional “word,” with very many pronunciations, that stands for an entire collection of proper names. This may be the same as a “placeholder” or “placeholder word.”
  • ARPAbet refers to a phonetic alphabet for the English language.
  • ASR automatic speech recognition: the automatic conversion of spoken language into text.
  • An "ASR confidence score” refers to a numerical score that reflects the strength of evidence for a particular transcription of a given audio signal.
  • a “baseform” refers to a triple that associates: (1) a word as a lexical object (that is, a sequence of letters as a word is typically spelled); (2) an index that can be used to distinguish many baseforms for the same word from one another; and (3) a pronunciation for the word, comprising a sequence of phonemes.
  • a given word may have several associated baseforms, distinguished by their pronunciation. For instance, here are the baseforms for the word "tomato”, which as memorialized in the lyric of the once-popular song "Let's Call the Whole Thing Off has two accepted pronunciations. The number enclosed in parentheses is the above-mentioned index:
  • a "decode span” or “decode acoustic span” is the same as a “full span” or “full acoustic span”.
  • the "epsilon word object” or equivalently “epsilon word,” denoted “w e ,” is a grammar label that enables a decoder to traverse the arc it labels without matching any portion of the waveform being decoded.
  • a “feature vector” is a multi-dimensional vector, with elements that are typically real numbers, comprising a processed representation of the audio in one frame of speech.
  • a new feature vector may be computed for each 10ms advance within the source utterance. See “frame.”
  • a "frame” is the smallest individual element of a waveform that is matched by an ASR system's acoustic model, and may typically comprise approximately 200 ms of speech. For the purpose of computing feature vectors, successive frames of speech may overlap, with each new frame advancing, e.g., 10 ms within the source utterance.
  • a "full span” or “full acoustic span” is the entire audio segment decoded by a secondary recognition step, including the audio of acoustic prefix words and acoustic suffix words, plus the putative target span.
  • a "grammar” is a symbolic representation of all the permitted sequences of words that a particular instance of a grammar-based ASR system can recognize. See “VXML" in this glossary for a discussion of one way to represent such a grammar.
  • the grammar used by a grammar-based ASR system may be easy to change.
  • Grammar-based ASR is a technology for automatic speech recognition in which only the word sequences allowed by suitably specified grammar can be recognized from a given audio input. Compare with “open dictation ASR.”
  • a "grammar label” is an object that may be associated with a given arc within a grammar— hence “labeling" the arc— that identifies a literal, a baseform, a phoneme, a context-dependent phoneme, or some other entity that must be matched within the waveform when a decoder traverses that arc. This nomenclature is used as well for the objects that populate the slots of a slotted grammar.
  • variable "h” refers to a “history” or "language model context,” typically comprising two or more preceding words. This functions as the conditioning information in a language model probability such as p(w
  • a "literal” is the textual form of a word.
  • NLU refers to natural language understanding: the automatic extraction, from human-readable text, of a symbolic representation of the meaning of the text, sufficient for a completely mechanical device of appropriate design to execute the requested action with no further human guidance.
  • NLU confidence score is a numerical score that reflects the strength of evidence for a particular NLU meaning hypothesis.
  • Open dictation AS is a technology for automatic speech recognition in which in principle an arbitrary sequence of words, drawn from a fixed vocabulary but otherwise unconstrained to any particular order or grammatical structure, can be recognized from a given audio input. Compare with "grammar-based ASR.”
  • a "placeholder” or "placeholder word” is the same as an “aggregate” or an “aggregate word.”
  • a "phonetic alphabet” is a list of all the individual sound units (“phonemes”) that are found within a given language, with an associated notation for writing sequences of these phonemes to define a pronunciation for a given word.
  • a "primary recognition step” or “primary decoding step” is a step in the operation of some embodiments, comprising supplying a user's spoken command or request as input to the primary recognizer, yielding as output one or more transcriptions of this input, optionally labeled with the start time and end time, within this input, of each transcribed word.
  • a "primary recognizer” or “primary decoder” is a conventional open dictation automatic speech recognition (AS ) system, in principle capable of transcribing an utterance comprised of an arbitrary sequence of words in the system's large but nominally fixed vocabulary.
  • a "primary transcription” or “primary decoding” is a sequence, in whole or in part, of regular human-language words in textual form, or other textual objects nominally representing the content of an audio input signal, generated by a primary recognizer.
  • a "proper name” or "proper name entity” is a sequence of one or more words that refer to a specific person, place, business or thing. By the conventions of English language orthography, typically the written form of a proper name entity will include one or more capitalized words, as in for example "Barack Obama,” “Joseph Biden,” “1600 Pennsylvania Avenue,” “John Doe's Diner,” “The Grand Ole Opry,” “Lincoln Center,” “Cafe des Artistes,” “AT&T Park,” “Ethan's school,” “All Along the Watchtower,” “My Favorite Things,” “Jimi Hendrix,” “The Sound of Music” and so on. However, this is not a requirement, and within the context of this specification purely descriptive phrases such as “daycare” or “grandma's house” may also be regarded as proper name entities.
  • secondary recognition or “secondary decoding” refers to either of (a) the execution of a secondary recognition step, in whole or in part, by a secondary recognizer, or (b) the result, in whole or in part, of a secondary recognition step.
  • a “secondary recognition step” or “secondary decoding step” is a step in the operation of some embodiments, comprising supplying a selected portion of the user's spoken command or request, which may comprise the entirety of this spoken command or request, as input to the secondary recognizer, yielding as output one or more transcriptions of this input, each transcription possibly labeled with (1) a confidence score and (2) one or more associated meaning variables and their values.
  • a “secondary recognizer” or “secondary decoder” is an automatic speech recognition (AS ) system, characterized by its ability to perform very rapid adaptation to new vocabulary words, novel word sequences, or both, including completely novel proper names and words.
  • a secondary recognizer may generate an ASR confidence score for its output, and may be operated in " «-best mode" to generate up to a given number n of distinct outputs, each of which may bear an associated ASR confidence score.
  • a "secondary transcription” or “secondary decoding” is a sequence, in whole or in part, of regular human-language words in textual form, or other textual objects nominally representing the content of an audio input signal, generated by a secondary recognizer.
  • a "grammar” is a slotted grammar that is used as an adaptation object.
  • a "slotted grammar” is a grammar, wherein certain otherwise unlabeled grammar arcs have placeholder slots that may be populated with zero, one or a sequence of grammar labels, after the nominal compilation of the slotted grammar. If a slot is left unpopulated, the grammar behaves in decoding as if the associated arc were not present.
  • a “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent"), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type").
  • the term may also include acoustic prefix and suffix words, not nominally part of the proper name entity per se. See also “acoustic prefix” “acoustic suffix", "target span” and “full span.”
  • a “span extent” is the start time and end time of a span, within an input utterance.
  • a "span type" is the putative type of the proper name entity believed to be present within the span; thus a personal name, business name, numbered street address, etc.
  • a "target span” is the portion of the acoustic span, decoded by a secondary recognition step, that nominally contains the words of the proper name entity.
  • the term refers to the acoustic span, exclusive of the acoustic prefix words and acoustic suffix words.
  • An "understanding step” is a step in the operation of some embodiments, comprising supplying as input the text and word timings of the user's utterance as generated by the primary recognizer, and yielding as output one or more hypothesized symbolic meanings of the user's input, each such meaning possibly including the identification of one or more acoustic spans, comprising a span extent and span type, each such span to be separately processed by a secondary recognition step.
  • each hypothesized symbolic meaning may include an associated NLU confidence score.
  • An "utterance” is audio presented as input to an ASR system, to be transcribed (converted into text) by that system.
  • a "verbalization" of one or more words is an audio signal comprising the spoken form of those words.
  • a "vocabulary" is, informally, a list of the words with associated pronunciations, which forms part of the input to an ASR system, and which defines the words that could in principle be recognized by such a system.
  • the term may refer to a list of baseforms. Also sometimes called a "lexicon.”
  • VXML is a popular standard for specifying the grammar, for grammar-based ASR systems.
  • variable "w" refers to a generic word, including an aggregate word.
  • the word "word” may refer to any of: the spoken form of a conventional word in an ordinary human language, thus a verbalization of this word; the textual form of a conventional word in an ordinary human language, thus the "literal” corresponding to this word; or an aggregate word.
  • the textual output marking a period of silence, in a transcription generated by an ASR system, is also regarded as a word.
  • ww is an abbreviation for "whole waveform.”
  • wwapnr is an abbreviation for "whole waveform adaptive proper name recognition.”
  • Various of the disclosed embodiments attain high recognition accuracy and understanding of freely spoken utterances containing proper names such as, e.g., names of persons, streets, cities, businesses, landmarks, songs, videos or other entities that are known to be pertinent to a particular user of such a system.
  • proper names such as, e.g., names of persons, streets, cities, businesses, landmarks, songs, videos or other entities that are known to be pertinent to a particular user of such a system.
  • Various embodiments augment the recognition system with methods that recognize and understand completely novel proper names, never before incorporated into the system in question.
  • Various embodiments may achieve this benefit with extremely low latency, e.g., on the order of a few hundred milliseconds.
  • Some embodiments may be used to recognize entities, such as numbered street addresses, or street intersections, that include within them street names and possibly city and state names as well.
  • entities such as numbered street addresses, or street intersections, that include within them street names and possibly city and state names as well.
  • An example of the latter would be “333 avenswood Avenue” or the more precise “333 Ravenswood Avenue, Menlo Park, California.”
  • Word sequences that are purely descriptive and generic, such as “grandma's house,” “the office,” “daycare,” “the playground” and so on, which the user has identified to the system as personally significant, may also be addressed in some embodiments.
  • the terms "proper names” and "proper name entities” will be understood to refer to the proper names and word sequences discussed in this and the preceding paragraph.
  • Some embodiments also extract a symbolic meaning, as appropriate, associated with the identity or relevant particulars of the recognized entity (such as the index of a particular entry in a list of businesses or personal contact names, the number portion of a street address, the user's current address or the internal symbolic label of a street name within an automatic mapping or navigation system), so that the system as a whole may respond appropriately to the user's spoken request.
  • a symbolic meaning associated with the identity or relevant particulars of the recognized entity (such as the index of a particular entry in a list of businesses or personal contact names, the number portion of a street address, the user's current address or the internal symbolic label of a street name within an automatic mapping or navigation system), so that the system as a whole may respond appropriately to the user's spoken request.
  • An additional benefit of some embodiments is higher accuracy recognition of proper name entities than can be achieved with conventional methods, such as direct adaptation of an open dictation ASR system. This benefit may be obtained because some embodiments place additional sources of information at the disposal of the speech decoding and meaning assignment process. This information may be principally but not exclusively derived from an NLU processing step, from the state of the system as a whole, such as recent prior user inputs and search results, or information about or associated with the user, such as the contents of a personal or professional calendar.
  • an additional benefit of some embodiments is that the open dictation ASR component of the system, prepared by some of the methods described here, may require no further adaptation or modification to enable recognition of names and entities that are not initially present in its vocabulary.
  • this open dictation ASR component may be shared by a multitude of users, with the necessary adaptations to enable recognition of proper names confined to other components of the system. This may provide several important advantages.
  • Second, such adaptation of open dictation ASR systems typically involves preparation of a new vocabulary, language model and acoustic model, or some subset thereof, each of which is an electronic computer file.
  • Such files can be large even by current standards of electronic storage technology. For instance a typical language model may occupy some 4 GB of storage.
  • the computational and hence economic cost to prepare these specially adapted files, and the associated economic cost to store them and load them on demand, for each individual user of the system, may be prohibitively high.
  • some embodiments do not require adaptation of the open dictation ASR system, yet yield accuracy akin to or superior to the performance of an open dictation ASR system that has been adapted in the conventional fashion.
  • a "primary" open dictation ASR system may reside at a central server, whereas a "secondary" grammar-based ASR system may reside within a smartphone, automotive dashboard, television, laptop or other electronic computing device that is the user's personal property.
  • the latter device is referred to as the "client device” or the “client” herein.
  • the system adaptations to enable recognition and understanding of the proper name entities associated with that particular user may be confined to the secondary ASR system and may be executed exclusively within the client device in some embodiments.
  • Various embodiments accept as input an audio signal comprising fluent, natural human speech, which notably may contain one or several proper names, or unorthodox sequences of otherwise ordinary words.
  • the embodiments may produce as output an accurate textual transcription of this audio signal, and optionally a symbolic rendering of its meaning.
  • the system comprises four major functional components, respectively a primary speech recognizer (or more simply a primary recognizer), a natural language understanding module (also called an NLU module, language understanding module or just understanding module), an adaptation object generator, and a secondary speech recognizer (or secondary recognizer).
  • Some embodiments also include a fifth major functional component, the score fusion and hypothesis selection module, which will be discussed in later sections.
  • the system as a whole may include mechanisms to cause these components to operate and communicate as described herein, and to store the input audio signal in such form that it may be reprocessed, in whole or part, during the operation of some embodiments.
  • the primary recognizer may include a conventional open dictation automatic speech recognition (AS ) system. Such a system accepts as input an audio signal comprising human speech. It may produce as output a textual transcription of this input, labeled with the start time and end time, within the input audio signal, of each transcribed word. It may also attach an ASR confidence score to each transcribed word and optionally to the output transcription as a whole.
  • the primary recognizer may be an "open dictation" ASR system in that it may transcribe an utterance comprising an arbitrary sequence of words that belong to its vocabulary. This is contrasted with a grammar-based ASR system that can recognize only certain predetermined word sequences. In those embodiments where the recognizer is "conventional", this designation is used in the sense that the recognizer does not make use the embodiments described herein. As a result, the primary recognizer may be assumed to have a large but fixed vocabulary.
  • This vocabulary may be difficult or impossible to augment with proper names or other novel words not presently in the vocabulary. Attempting to do so may require many minutes, hours or possibly even days of computational effort. These unknown proper names or other novel words— "unknown” in the sense of not being listed in the aforesaid fixed vocabulary— are therefore not recognizable by this primary recognizer. Moreover, if presented with an audio signal comprising words that belong to the vocabulary, but which are spoken in an unusual and possibly nominally meaningless sequence, such as "The The” (the name of an English musical group founded in 1979), the primary recognizer may have difficulty generating a correct transcription. Again, it is often difficult for the primary recognizer to accurately transcribe such unorthodox word sequences without significant computational effort.
  • the natural language understanding module may accept as input the transcription and word timings generated by the primary recognizer, and optionally additional pertinent information, and may emit as output one or more hypotheses of the meaning of the utterance (also called an NLU hypothesis, meaning hypothesis or just a meaning).
  • This meaning may be represented in a symbolic form suitable for processing or execution by a computer.
  • Each meaning hypothesis may optionally include a numerical NLU confidence score, which reflects the strength of evidence for that particular meaning.
  • This module may identify a particular word or word sequence in the input transcription that potentially comprises a proper name entity and label this word or word sequence with a putative type (for instance, a person's name, a street intersection, a numbered street address, and so on). Each such word or word sequence is called a proper name entity acoustic span, or the acoustic span of a proper name entity, or just an acoustic span.
  • a proper name entity acoustic span or the acoustic span of a proper name entity, or just an acoustic span.
  • the basis for marking this word or word sequence as an acoustic span may be quite indirect, and may not reflect the nominal meaning of the words that comprise it.
  • a given hypothesis may include one or more such acoustic spans, each one constituting an information element that must be resolved to fully specify the meaning of the phrase.
  • the transcription and meaning of the span may be determined by the context in which the embodiment is applied.
  • a given hypothesis may not include any acoustic spans at all. In this case the proper name recognition embodiments discussed herein may not apply.
  • the adaptation object generation module or adaptation object generator may create computational objects that are used to adapt the secondary recognizer in the manner described in the next paragraph. As detailed herein, this process may be divided into two stages, respectively object preparation and object finalization.
  • the secondary recognizer also comprises an AS system, insofar as it accepts an audio signal as input and generates a transcription, and other information, as output. It may also attach an ASR confidence score to each transcribed word and optionally to the output transcription as a whole; it may also be operated in " «-best mode," to generate up to a given number n of distinct outputs, each of which may bear associated ASR confidence scores. However, its characteristics may be markedly different from those of the primary recognizer. Specifically, the secondary recognizer may be capable of very rapid adaptation to new vocabulary words, novel word sequences, or both, including completely novel proper names and words.
  • the secondary recognizer may also be unlike the primary recognizer as it is constrained to transcribe only a relatively small collection of phrases, numbering, e.g., in the tens, hundreds or thousands, rather than, e.g., the billions of phrases supported by the primary recognizer.
  • the primary and secondary recognizers may further be distinguished based upon their usage.
  • the secondary recognizer rather than processing the audio signal comprising the entirety of the user's spoken input, may operate upon only one or two short segments of the signal, extracted, e.g., from a saved copy of the signal. These segments are referred to herein as acoustic spans (or simply spans).
  • W* argmax w P(A ⁇ W) ⁇ P(W)
  • A is the audio signal to be decoded (transcribed)
  • W is an hypothesis (guess) as to the correct decoding (transcription)
  • W* is the final decoding (transcription)
  • P(A I W) and P(W) are the numerical values of the acoustic model and language model respectively, for the indicated inputs.
  • Such a system may derive its
  • the features and performance of the adaptation object generator may be those of a grammar compiler; similarly the features and performance of the secondary recognizer may be those of a grammar-based ASR system.
  • the primary recognizer is a large- vocabulary open dictation ASR system that uses statistical language models as described above and the secondary recognizer is a grammar-based ASR system. Many of the examples presented herein will proceed on this basis.
  • the audio input is supplied to the primary recognizer, which emits a transcription comprising a sequence of words in the primary recognizer's vocabulary. It is assumed that all the words in the sample command above are in this vocabulary.
  • this transcription is supplied as input to the natural language understanding module. Suppose that no proper name entity acoustic spans are identified and that only a single meaning hypothesis is generated. This yields a symbolic representation of the command's meaning
  • m_categories [ indpak] ,
  • the system may not include any of "de,” “Dosa,” “Guddu,” “Karahi,” “Masala,” “Noori” or “Tikka” in its ASR vocabulary, the system implementing the present embodiments is nevertheless capable of recognizing and responding properly to a command like "tell me how to get to Guddu de Karahi.” This may be achieved in some embodiments by creating a specialized recognizer that can process the indicated business names (and in some embodiments nothing else), exploiting the information obtained by both the primary recognizer and the natural language module, and other such information that may be relevant, so that an appropriate acoustic span may be identified, and then deploying this specialized recognizer to good effect.
  • the system operates the adaptation object generator to create an object suitable for adapting the secondary recognizer to recognize precisely these names. This may be done by preparing a grammar, illustrated in graphical form in Figure 2, that contains exactly these names, and compiling it into a binary form so that it is ready for use by the secondary recognizer. This operation, which may typically take a few hundred milliseconds, may be performed immediately upon receiving from Yelp® the list of names to be shown on the tablet display.
  • Compilation may involve (1) obtaining one or more pronunciations for each indicated word in the grammar (this may typically be done by first searching a vocabulary, but if this search fails any required pronunciations may be automatically generated by a "grapheme to phoneme" or "g2p" processing module, which applies the standard rules of English language pronunciation to the given word spelling to produce one or more plausible pronunciations), (2) creating a computational structure that permits words to be decoded only in the order allowed by the grammar, (3) attaching to this structure operations to be performed on indicated meaning variables when a given decoding is obtained (which may typically comprise assigning values to these variables), and (4) emitting this structure in such form that it may be immediately loaded by a suitable grammar-based ASR system and used to guide its decoding of audio input.
  • This compiled grammar denoted "business-names. g" in Figure 3, may be labeled with its type (in this case effectively business-names) and held for possible future use. In some embodiments, this comprises the adaptation object generation step.
  • this grammar may be created speculatively and that this action may not require any great prescience on the part of the system.
  • other grammars for example covering the user's personal contacts, the businesses in the user's personal calendar, the artists, song titles, and album names stored in the user's iPod® or USB flash drive, or all the numbered street addresses for the city in which the user is currently located, may have been created on an equally speculative basis. The net result is that effectively a panoply of specially adapted secondary recognizers may be available for use in the secondary recognition step, to process various spans that may be identified, of various types. [00133] The system may now wait for further input.
  • the audio input is first passed to the primary recognizer, which generates an initial nominal transcription of its input, labeled with word timings. This is referred to herein as the primary recognition step.
  • the user's audio input may also be retained for later processing by the secondary recognizer.
  • this imperfect initial transcription may be presented to the natural language understanding module.
  • This module processes the input word sequence, and determines by application of standard methods of computational linguistics to the first six words of the transcription— “tell me how to get to”—that the user is making a request for directions. Noting that the rest of the transcription— “go to do a call Rocky”— is both nominally somewhat nonsensical, and also occupies a position in the phrase as a whole that in conventional conversational English would likely comprise the name of the target to be navigated to, the language understanding module also determines that the portion of the audio input corresponding to this part of the transcription probably contains a spoken rendering of one of the displayed business names.
  • this selected portion of the audio input is referred to as the proper name entity acoustic span, or acoustic span of a proper name entity, or just acoustic span for short, that is now to be processed by the secondary recognizer.
  • the acoustic span in question is known to begin at 1330 ms into the audio input, corresponding to the start of the word "go,” and end at 2900 ms into the audio input, corresponding to the end of the word "Rocky.” In this way the extent associated to the span has been determined. This entire operation comprises the language understanding step of this example.
  • the system may then proceed to the secondary recognition step.
  • the language understanding module having determined that a particular segment of the audio input probably comprises one of the displayed business names
  • the already-compiled grammar which enables recognition of these names (and in some embodiments, only these names) is loaded into the secondary recognizer.
  • the acoustic span transcription "Guddu de Karahi” may be interpolated into the primary recognizer's transcription, replacing the word sequence "go to do a call Rocky” that was initially guessed for this span, thereby yielding a final transcription "tell me how to get to Guddu de Karahi.”
  • the symbolic meaning directionApnrCommand is populated with a parameter identifying the navigation target, yielding the complete symbolic meaning
  • This symbolic meaning may then processed by other functional elements of the system, extracting information (including location) pertinent to the third business from the five-element array of such objects, executing appropriate operations to find a route from the user's current position to the indicated location, rendering a map showing this route, etc.
  • the map may depict other associated information as deemed useful and pertinent to the application context by the system designers.
  • the resulting image in this example is shown in Figure 4.
  • the disclosed embodiments provide many advantages over conventional systems.
  • the "Achilles heel" of grammar- based AS technology is that the user must speak within the grammar or the technology will not function.
  • the disclosed embodiments do not comprise simply causing the user to stay within a grammar, when speaking his or her request.
  • the disclosed embodiments allow the user to speak freely, using the words and phrase structure that come naturally when expressing the desired action.
  • the associated audio input may then be analyzed by the primary recognizer and the language understanding module to determine if a proper name entity, substantive to the correct processing of the command, has in fact been spoken. If so the proper name entity's extent within the audio input, and putative type, are identified, and this specific segment of audio may then be processed by the secondary recognizer, adapted to recognize the proper name entity within a relatively small list of possibilities.
  • This narrowing of the task in two important senses— first by pruning away the freely formed and now-extraneous audio that would confound a grammar-based ASR system, and second by adapting the secondary recognizer to drastically reduce the space of possible transcriptions— may allow the secondary recognition step to succeed.
  • This analysis and subsequent narrowing may in turn depend upon the ability, afforded by various disclosed embodiments, to integrate information and insights normally outside the scope of ASR technology— specifically in this case that a prior command generated a list of businesses and hence that a followup command naming one of them is not unlikely, plus the observation that the phrase "tell me how to get to,” or a myriad of other phrases of similar meaning, was probably followed by a proper name or other vocalization of a navigation target.
  • Some embodiments may even include several distinct and competing mechanisms for identifying acoustic spans, with all of them processed by distinct and separately adapted secondary recognizers, with a final determination of one or a few surviving hypotheses (surviving for presentation to and ultimate disambiguation by the user) performed by the "score fusion and hypotheses selection" module discussed herein.
  • Various embodiments divide the speech decoding process (and as we shall see, the meaning extraction or "understanding” process as well) into a primary recognition step, one or more understanding steps, one or more adaptation object generation steps (which may comprise two stages, an object preparation stage and an object finalization stage), one or more secondary recognition steps, and (optionally) a score fusion and hypothesis selection step.
  • the primary recognition step comprises recognition of the input utterance by a conventional open dictation ASR system, though this system may have been specially prepared to assist the language understanding module to identify the extent and type of one or more acoustic spans. This yields a transcription of the utterance (and possibly alternate transcriptions as well) in the vocabulary of the open dictation recognizer, plus nominal start and end times for each transcribed word.
  • the primary recognition step which may be accomplished by the nominally more powerful, more flexible and more computationally demanding primary recognizer (the object of the comparative "more" being the secondary recognizer), may not bear the full responsibility of generating the final transcription of the input utterance. Instead, the main objective of this step may be to provide a sufficiently accurate transcription for the language understanding module to do its work of hypothesizing one or more symbolic meanings for the user's command, including the extent and type of any proper name entity acoustic spans that may figure in the full specification of this meaning. It should be clear from the example in the overview that the words in the primary recognizer's transcription may be far from correct. In fact the primary recognizer may provide several alternate transcriptions of the input waveform, each one subject to the processing steps described below; a means for selecting the final preferred transcription and its associated meaning will be explained shortly.
  • the output of the primary recognition step comprising (1) a nominal transcription, (2) the start time and end time within the waveform of each transcribed word at the granularity of a single frame and (3) possibly other information, described further below, of use in determining the extent and type of any acoustic spans, may then be passed to the understanding step.
  • the understanding step applies the methods of natural language understanding to hypothesize one or more symbolic meanings for the nominal transcription and as appropriate to identify the extent and type of any proper name entity acoustic spans that contribute to this meaning.
  • Each acoustic span becomes an element of the hypothesis, to be processed by an associated secondary recognition step to yield the span's transcription and meaning.
  • various embodiments may speculatively create or make use of various adaptation objects, appropriate to the type or types of spans to be processed.
  • This adaptation may comprise preparing the secondary recognizer to recognize completely novel words, restricting the secondary recognizer so that it does not use certain other words in its vocabulary or uses them only in particular orders, or both.
  • the output may comprise a collection of hypotheses, each one containing one or more acoustic spans.
  • the third step which is the secondary recognition step.
  • a grammar-based speech recognizer which has been specially adapted to the span type.
  • a new grammar can be generated or compiled, or a grammar with unpopulated placeholder "slots" can be completed and made ready for service, in a few hundred milliseconds or less in some embodiments.
  • This secondary recognition step performed solely on the subject acoustic span, using a suitably specialized grammar, may be taken as the nominal transcription of the span.
  • the literal sequences of this same grammar may be labeled, in appropriate and conventional ways, with the meaning of each potential decoding path through the grammar.
  • the act of transcribing the span may at the same time generate an appropriate symbolic meaning, associated to the transcription.
  • grammar based ASR fails when presented with freely- formed human speech, which typically lies outside the scope of even elaborate grammars
  • the disclosed embodiments perform well when presented with in- grammar utterances.
  • the preceding processing stages establish this desired condition.
  • an acoustic span which is known, or more correctly hypothesized by earlier processing steps, to consist of the name of one of a few businesses drawn from those listed in a user's daily appointment calendar for a particular day. If that sole portion of the original utterance is provided as the audio input to a grammar-based ASR system, and the grammar used for decoding comprises all and only the business names extracted from the user's calendar for that day, then it is highly likely that the correct proper name will be decoded.
  • Such secondary recognitions may be performed for each of the acoustic spans identified by the prior decoding stages, until a final transcription is obtained for the whole of the original utterance. If no competing alternative meaning hypotheses were proposed by the prior processing steps, then the decoding is complete. However, this may not always be the case. More likely, several alternative transcriptions, each with one or more associated meaning hypotheses, may have been generated, each hypothesis having NLU and ASR confidence scores. It remains to select the final preferred decoding, or at a minimum, assign a confidence score to each whole decoding, and provide a ranked list of alternatives. As differing hypotheses may comprise different numbers of acoustic spans, this may force the comparison of hypotheses that are based upon different numbers of confidence scores. One will recognize various approaches to combine such scores in a consistent manner, to allow meaningful and reliable score-based ranking. The NLU system itself may be involved in generating this ranking.
  • each admissible path comprises the entirety of one individual complete transcription.
  • the original utterance in its entirety may then be decoded against this grammar by the secondary recognizer operating in «-best mode, yielding an acoustic confidence score for each complete hypothesis, nominally expressed as P( Tj ⁇ A ).
  • Tj is the text associated with the zth hypothesis
  • A is the acoustic input, which is constant across the hypotheses being ranked. It is possible that this will suffice, and a ranking of hypotheses may be made purely upon this acoustic score.
  • NLU confidence scores can be normalized to probabilities, they may be meaningfully combined with the ASR confidence scores by the following application of the laws of conditional probability.
  • T t and A denote the transcription and acoustic input as above, and let M t denote the symbolic meaning assigned by NLU processing to the zth hypothesis.
  • P(Mj I Tj) for the NLU confidence score of the zth hypothesis meaning, given the associated transcription.
  • P(M h T, ⁇ A) P(M, ⁇ T, , A) P(T, ⁇ A).
  • the disclosed embodiments may apply, e.g., to: business names (resulting from a search); business names (retrieved from a personal phone book, personal calendar, or both); personal contact names (retrieved from a phone book, or from a calendar); locations (numbered street addresses); locations (intersections); locations (landmarks); music library search; and video library search.
  • business names results from a search
  • business names retrieved from a personal phone book, personal calendar, or both
  • personal contact names retrieved from a phone book, or from a calendar
  • locations numbered street addresses
  • the adaptation object may be constructed from the names retrieved in the just-executed search.
  • the adaptation object may be constructed from business names retrieved from a personal phone book, personal calendar or both, possibly restricted to the current day's personal calendar.
  • the adaptation object may be constructed from personal contact names retrieved from a personal phone book, personal calendar or both, possibly restricted to today's personal calendar.
  • the adaptation object may be one of many constructed well in advance, each object comprising the valid street addresses for each street in every political subdivision in a country (typically a city), the adaptation object actually used being determined either by the user's current location as determined say by GPS, by an explicit or implicit preceding request for a particular such subdivision, or by the identity of the political subdivision as decoded by the primary recognizer from some part of the user's utterance (e.g., the transcribed words "Menlo Park” in the primary recognizer transcription "tell me how to get to three thirty three avenswood Avenue in Menlo Park”).
  • the adaptation object may likewise be one of many constructed well in advance, each object comprising intersections of each street in every given political subdivision in a country, with the adaptation object actually used determined as described above.
  • various adaptation objects are constructed from the artist names, song names, album names, and genre names in a user's personal music storage device.
  • various adaptation objects are constructed from the actor names, director names, and genre names in a given catalog of video content to be navigated.
  • this list is intended to be merely exemplary and not exhaustive. The disclosed embodiments may be applied in other ways as well.
  • Allophones may be used to address this phenomenon, wherein the templates or models used to match a particular phoneme are made to depend upon the sequence of phonemes that precede it, and those that follow it.
  • the secondary recognition step or more properly the generation of the adaptation object associated to this step in some embodiments, does not account for coarticulation.
  • the adaptation object in the example, a grammar of business names— that was prepared shows no words either preceding or following the listed names.
  • Such a grammar would be appropriate for decoding speech that consists of one of these names having been spoken in isolation, with no preceding or following words. But in fact the speech to be decoded is an extract from a longer, fluently spoken phrase.
  • some embodiments employ a method for incorporating acoustic context into the adaptation object generation step, so that the secondary recognizer may accommodate coarticulation effects.
  • the following discussion is a running example that demonstrates the operation of one embodiment of the method. The example is discussed in relation to the Figures 5, 6, and 7.
  • the nominal adaptation object comprising the contact names say in the user's address book, has the structure illustrated in Figure 5. That is to say, it comprises a list of alternatives, each one a personal contact name, and each labeled with some suitable meaning variable command to be executed if the associated literal sequence is decoded by the secondary recognizer. But it contains no information about the acoustic context in which the contact name was spoken.
  • the audio excerpt that is processed comprises the extent of the full span, from the nominal start of the first word of the acoustic prefix through the nominal end of the last word of the acoustic suffix (this is why this is also referred to herein as the "decode span," because the full extent is in fact processed— decoded— by the secondary recognizer).
  • this has the effect of causing the secondary recognizer to perform a forced alignment between the prefix literals and their corresponding audio, and likewise between the suffix literals and their audio.
  • the adaptation object that is, in this embodiment, the personal contact name grammar— may be populated with the prefix and suffix words as determined by the primary recognition step. This would seem to present a challenge to the desire to achieve low latency decoding of the user's spoken phrase, as part of the adaptation step is now executed between the language understanding step and one or more of the secondary recognition steps.
  • Various embodiments contemplate a grammar with so-called "slots,” which are placeholders for literals to be populated at the very last moment, with very low latency.
  • This "slotted grammar” with a target section comprising the names of the user's personal contact list, and with four unpopulated slots for the acoustic prefix literals and acoustic suffix literals, may be speculatively created at the system's leisure as soon as this name list is available.
  • This grammar is illustrated in Figure 7. Creating and compiling this grammar, leaving the slots unpopulated, is the preparation stage of the adaptation object generation step. It is then held ready for use at the appropriate moment.
  • This slotted grammar may then be populated with the appropriate words extracted from the primary recognizer's transcription, in the finalization stage of the adaptation object generation step.
  • FIG 8 is a block diagram depicting various components in a speech processing system 800 having server and client proper name resolution modules as may occur in some embodiments.
  • the depicted topology is merely an example provided for purposes of explanation and one will recognize that variations will readily exist.
  • the depicted modules may be relocated from the client to server and vice versa (e.g., fulfilment may be performed at the server and the results returned to the client).
  • the depicted placement of components and topology is merely one example of many possible configurations.
  • the depicted system may be used to address utterances which do not include proper entities (e.g., "Show me nearby restaurants") as well as utterances which do include proper entities (e.g., "Tell me how to get to Guddu du Karahi").
  • a user 805 may speak a command 810 to a user interface 820 of a client device 815.
  • the user may ask "Show me nearby restaurants".
  • the client device 815 may be an iPhone®, iPad®, tablet, personal computer, personal digital assistant, etc., or any device able to receive audio from the user 805.
  • the user interface 820 may convert the incoming command to a waveform 825a.
  • the waveform 825a may be stored locally before being transmitted to the server 850. Storing the waveform locally may allow portions of the waveform to later be considered by the client device, based on the hypotheses, without requesting that the waveform 825b be transmitted back to the client from the server (one will recognize that in some other embodiments the server may instead transmit all or a portion of the waveform back to the client).
  • the server 850 may submit the waveform 825b to a primary recognizer 830.
  • Primary recognizer 830 may be an "open-dictation" ASR system as known in the art.
  • the primary recognizer 830 may employ a lexicon associating energy patterns in a waveform with phonetic components to identify words corresponding to the phonetic components. Bayesian techniques as known in the art may be applied.
  • the server system 850 may include a Natural Language Understanding module (NLU) 855 configured to convert the transcription and word timings from the primary recognizer 830 into hypotheses.
  • NLU Natural Language Understanding module
  • the hypotheses 815 and associated metadata may then be transmitted across a medium (e.g., the Internet) to client system 815.
  • a medium e.g., the Internet
  • the hypotheses metadata may include the results of the ASR, such as the timestamps for word occurrences and the confidence of recognition for a given word.
  • the hypotheses may be received at a secondary recognizer 860.
  • the secondary recognizer 860 may be a grammar based ASR system as discussed herein. If the hypotheses do not include acoustic spans, the hypotheses may pass through the scoring module 885 (if necessary) to identify a best match and proceed to the fulfillment unit 890, possibly as a symbolic representation, which will attempt to fulfill the request (e.g., make a request to Yelp®). For the example request "Show me nearby restaurants", e.g., the fulfilment unit 890 may contact a map server and request a list of restaurants within proximity to the user's 805 coordinates. Once the results 870 have been retrieved, the client device may present the results to the user.
  • the secondary recognizer 860 may consult adaptation object generator 865 to identify an appropriate grammar of proper names from the various components 840a-e of the user device. Having identified possible proper names for the acoustic spans, the secondary recognizer 860 may apply various of the identified proper entities and determine corresponding confidence levels. These decodings may be referred to the scoring module 885 so that the most likely candidate may be identified. The most likely candidate may then be passed to the fulfillment module 890 as discussed above.
  • Figure 9 is a flow diagram depicting the proper name recognition process at a high level for various embodiments using automatic speech recognition (AS ) and natural language understanding (NLU) components.
  • AS automatic speech recognition
  • NLU natural language understanding
  • recognition of a proper name within an utterance may generally proceed in four steps, depicted in Figure 9. These steps are generally referred to herein as "Primary Recognition” 905, “Understanding” 910, “Secondary Recognition” 915, and “Fusion” 920.
  • the client may perform the object preparation stage of "Adaptation Object Generation” at block 900a, though one will recognize that the depicted order is merely for exemplary purposes and the process block may occur at other times in other embodiments than as depicted here.
  • Primary Recognition 905 for the example phrase "Tell me how to get to Guddu du Karahi” may occur at the open dictation (ASR) unit 830.
  • the output 805 of "Primary Recognition” 905 may include a nominal transcription, the start and end time within the waveform of each transcribed word at the granularity of a single frame, plus typically one or more putative proper name entity acoustic spans and information relating to the type of each span.
  • the NLU 855 may then perform the "Understanding" 910 step, applying the methods of natural language understanding to hypothesize one or more types to each span, possibly also adjusting the span boundaries (start frame and end frame) as assigned by the primary recognition step, and may provide additional information, such as potential shim words, and prefix and suffix acoustic context words, all defined in the sequel, that may aid in the decoding of each span.
  • the server may infer the presence of proper names in the text as described below and prepare one or more hypotheses 815 for their resolution.
  • the hypotheses 815 may be submitted to the client.
  • the client may then identify proper name entities from the various components 840a-e of the user device.
  • a GPS 840a component may provide relevant street names near the user's location
  • an address book 840c may store the user's 805 contacts
  • a search cache 840d may reflect recent inquiries and operations performed by the user 805
  • a calendar 840b may reflect meetings and events associated with user 805. The content from one or more of these components may be considered when identifying proper name entities as discussed herein.
  • They client may perform the finalization stage of "Adaptation Object Generation” at block 900b.
  • the client may consult various local modules (e.g., the search cache) to identify appropriate proper name entities to consider in the grammar for "Secondary Recognition" 915.
  • various local modules e.g., the search cache
  • the steps need not necessarily proceed in this order and that "Adaptation Object Generation” may occur earlier in the process.
  • the client may then perform "Secondary Recognition" 915, by seeking to substitute various proper name entities for the acoustic spans to achieve suitable local resolution results.
  • the client may use an ASR or a separate grammar-based ASR system to determine the probability that a given portion of the waveform corresponds to a proper name entity identified from the components 840a-d.
  • each such span may now be decoded by a grammar-based speech recognizer within client-side proper name resolutions, using a grammar that has been specially adapted to the type and individual user of the system based upon components 840a-d.
  • a grammar-based speech recognizer within client-side proper name resolutions, using a grammar that has been specially adapted to the type and individual user of the system based upon components 840a-d.
  • the adaptation of an open dictation recognizer to a specialized vocabulary or context may be computationally expensive, a new grammar, or a grammar with unpopulated placeholder "slots," can be generated and ready for service on the client in a few seconds or less.
  • the output of the "Secondary Recognition" 915 step, performed on the subject acoustic span, using a suitably specialized grammar, may be considered the nominal transcription of the span.
  • the literal sequences of this same grammar may be labeled with the meaning of each potential decoding path through the grammar.
  • the generation of the transcription can at the same time generate an appropriate symbolic meaning, for the selected decoding path. If only the portion of the original utterance associated with the proper name is provided as the audio input to a grammar-based ASR system, and the grammar used for decoding comprises all and only the business names extracted from the user's calendar for that day, then it is highly likely that the correct proper name will be decoded.
  • the "Secondary Recognition" 915 may be complete and the decoded result may be submitted for fulfillment.
  • the client module may select the final preferred decoding, or assign a confidence score to each whole decoding, and provide a ranked list of alternatives to the server.
  • differing hypotheses may comprise different numbers of acoustic spans, this may force the comparison of hypotheses that are based upon different numbers of confidence scores.
  • the "Fusion" 920 of the different scores may occur when the proper names are considered in the context of the NLU unit.
  • the "Primary Recognition” step 905 may be accomplished by the open dictation ASR 830 technology, which may be more powerful and more flexible, but more computationally demanding than a grammar- based ASR.
  • the "Primary Recognition" step 905 may not bear the principal responsibility for generating the final transcription of the input utterance in some embodiments recognizing proper names. Rather, this step may determine the portion or portions of the input waveform that comprise one or more of the proper name entities.
  • the open dictation ASR 830 may simply note portions of the waveforms for which the identified words have exceedingly low confidence levels.
  • proper name entities may contain or be comprised wholly of words not present in the vocabularies of the ASR systems as normally constituted.
  • the open dictation ASR component of the system prepared by the methods described here, requires no further adaptation or modification to enable recognition of names and entities that are not even present in its vocabulary. Thus, this component may be shared by a multitude of users, with the necessary adaptations to enable recognition of proper names confined to other components of the system.
  • Figure 10 is a flow diagram depicting various steps in a proper name recognition process as may occur in some embodiments.
  • the system may receive an utterance waveform from a user. Where the process is divided between client and server devices a copy of the waveform may be retained at the client as discussed herein.
  • a "standard" open dictation ASR may be applied to the waveform. This may produce a complete textual word for every aspect of the waveform, even when the confidence levels are exceptionally low.
  • some embodiments further contemplate applying a modified version of the open dictation ASR to the waveform to achieve one or more textual readings that explicitly identify words that may reflect proper names (e.g., based on the highest possible confidence level for a word still failing to exceed a threshold). These modified systems may indicate placeholder words for the potential proper names (Q.g., fna, Ina, and sa designations as discussed herein).
  • Block 1010 may roughly correspond to the "Primary Recognition" step 905.
  • Block 1020 may roughly correspond to the "Understanding" step 910.
  • the system may determine if one or more word confidence values are deficient, e.g., have confidence levels falling below a threshold, or if the modified systems have otherwise identified one or more potential proper names. Where all of the confidence values exceed a threshold, or where no proper name candidates are otherwise identified, the system may transition to block 1035. At block 1035, the system may complete processing to generate a symbolic representation of the request. At block 1040, the system may attempt fulfilment using the symbolic representation and return any results to the user. As discussed above, one will recognize that fulfillment is just one possible application for the above processes. Accordingly, blocks 1035 and 1040 may readily be substituted by other applications, e.g., performing operations on the client device.
  • the system may generate one or more hypotheses based upon the deficient word(s) that include acoustic spans as described in greater detail herein.
  • the system e.g., the client device, may decode each probable first name segment against its first name grammar.
  • Block 1050 may generally correspond to the "Secondary Recognition" 915 step.
  • the "Secondary Recognition" 915 step reduces to little more than inserting the most likely grammar decoding result in the appropriate location in the text output by "Primary Recognition" 905 and/or "Understanding" 910 operations.
  • the system may determine which of the proposed proper entities for the acoustic spans (and/or the confidence levels associated with a hypothesis without acoustic spans) best corresponds to the utterance. For example, the system may identify the resolution with the highest cumulative confidence values. This determination may be made by considering one or more of the original, open dictation ASR confidence values, the original NLU confidence values, the ASR grammar-based confidence values determined at block 1050, and possibly a second NLU determination using the ASR grammar-based results, as part of a "Score Fusion" 920.
  • the system may convert the proper name to symbolic form at block 1060 and present the symbolic representation of the entire utterance for fulfilment. Conversely, if no appropriate resolutions are found at block 1055, the system may announce a failure at block 1065. In some embodiments, rather than announce failure, the system may instead attempt fulfillment with the words having deficient probabilities or with the closest approximates.
  • Figure 11 is an example hypothesis corpus as may be generated in some embodiments. These example hypotheses may be generated as part of block 1045.
  • a waveform 1105 may be associated with the user utterance "Where is Guddu de Karahi, the restaurant, located?" 1110.
  • the client ASR/NLU and/or the server ASR/NLU may generate the proposed decodings 1115a-c.
  • the decoding 1115a construes the utterance as "Where is goose karate the restaurant, located?" with confidences values of 110 and 150 associated with the words “goose” and "karate” respectively. This example, where low confidence values are generated, but words are identified anyway, may correspond to the path through blocks 1015 and block 1045 discussed above.
  • confidence values may be lower than a threshold, e.g., 300, indicating an incorrect association.
  • a threshold e.g. 300
  • “goose” mismatches "Guddu” and “karate” mismatches "Karahi” as the words are superficially similar. Accordingly, the 110 and 150 confidence levels reflect an unlikely match (e.g., because the spectral character of the waveform doesn't agree with the expected character of the phonemes in these words). However, if no better proper name match is found for the proposals 1115b-c, the system may accept this interpretation by default, and submit these words to the symbolic representation for fulfilment.
  • the client ASR/NLU and/or the server ASR/NLU may construe "Guddu” as "parking for” with corresponding low confidence levels 90 and 75.
  • the system may have simply identified the portion of the waveform within "Karahi” as unknowable and accordingly, a potential proper name.
  • an appropriate substituted identifier fna, Ina, etc. may be inserted for the hypothesis.
  • decoding 1115c the system may simply have recognized the entirety of the "Guddu de Karahi" waveform as being unrecognizable. The system may recognize that two separate words were spoken, but may be unable to recognize the identity of the words.
  • decodings 1115b and 1115c, where placeholders are used to identify possible proper names, may correspond to the path through blocks 1010 and block 1020 discussed above.
  • Figure 12 is an example of a first hypothesis breakdown based upon the example of Figure 11 as may occur in some embodiments.
  • the system may recognize that the "Karahi" portion between 1520ms and 1750ms could not be recognized. Accordingly, a hypothesis 1205 having an acoustic span between 1520 and 1750 may be generated (one will recognize that the values 1520 and 1750 are merely exemplary and other representations, e.g., milliseconds, may be used).
  • the NLU may infer that this is a "Location inquiry" and ascribe a corresponding potential meaning, based upon the "Where" and “parking for" portions of the utterance.
  • the NLU may infer based upon the phrase "restaurant” that the span is of type "Business name”.
  • the Potential Meaning and Putative Type in the hypothesis may be used to localize the search for proper names on the client device. For example, knowing that this is a "Location Query" the client device may not consider first and last names in an address book, but may rather consider only meeting locations in a calendar. Only business names associated with locations in the calendar may be considered based upon the Putative Type of span.
  • the pronunciation of a proper noun may be influenced by the preceding and succeeding words. Accordingly, a prefix portion and a suffix portion may also be identified in the hypothesis for consideration by the components searching for proper names. Alternatively, some embodiments may prepend and post-pend brief segments of silence (or low-power background noise), ramped from very low power to the nominal power of the utterance (e.g., ramping up from low to nominal power, for the prepended audio, and ramping down from nominal power to low for the post-pended audio). This temporal smoothing of the audio input may eliminate abrupt audio transitions, which could be falsely matched as fricative phonemes.
  • Figure 13 is an example of a second hypothesis breakdown based upon the third proposal 1115c in the example of Figure 11 as may occur in some embodiments.
  • the system may recognize that the "Guddu de Karahi" portion between 1220ms and 1750ms could not be recognized. Accordingly, a hypothesis 1205 having an acoustic span between 1220ms and 1750ms may be generated, and the prefix and suffix portions adjusted accordingly.
  • the NLU may again identify the proper noun as a "Business Name" in the Putative Type but may instead consider the general inquiry as an "Address Book" query, limiting the search to only the address book contents.
  • Figure 14 is a flow diagram depicting various steps in a server-side process for proper name recognition as may occur in some embodiments.
  • process 1400 may depict the operations of block 1030 in greater detail.
  • the system may consider the next possible textual representation generated from the ASR and/or NLU. A plurality of probabilities and word timings may be included as part of the textual representation.
  • the system may prepare a hypothesis template, e.g., a data structure for holding the various hypothesis parameters.
  • the system may generate a "potential meaning" for the hypothesis by referencing NLU statistics.
  • the system may generate a "putative type" for the span by referencing NLU statistics.
  • the system may determine the timestamps associated with the beginning and end of the span. As discussed in greater detail below, the prefix and suffix to the potential proper name in question may also be included in this determination.
  • the system may consider additional potential text representations if they exist. If not, the system may proceed to block 1440, where the system may submit the queued hypotheses to the client system for analysis, or depending upon the topology, to the appropriate component for analyzing the hypotheses. For example, in some embodiments, the system may analyze the hypotheses locally on the server, or they may be both generated and analyzed on the client device.
  • Figure 15 is a flow diagram depicting various steps in a client-side process 1500 for proper name recognition as may occur in some embodiments.
  • the client module may consider the next hypothesis received from the server module.
  • the client module may extract the potential meaning from the hypothesis.
  • the client module may extract the putative type of span from the hypothesis.
  • the client module may collect the corpus of proper nouns based upon the potential meaning and/or putative type.
  • the system may extract the timestamps associated with the putative span and (if present) the timestamps to any suffix or prefix portions.
  • the system may consider the next proper name in the identified corpus. Where substitution of the corpus member results in a satisfactory confidence values at block at block 1535, the system may include the substituted member among the successful resolutions at block 1540.
  • the system determines which resolution to submit for fulfillment, e.g., using the score fusion processes discussed herein.
  • hypotheses from decodings 2 and 3 differing hypotheses may comprise different numbers of acoustic spans. This may require that hypotheses based upon different numbers of confidence values be considered so as to achieve a meaningful and reliable score-based ranking. In some embodiments, the ranking of hypotheses may be made purely upon these hypotheses.
  • NLU confidence scores can be normalized to probabilities, they may be meaningfully combined with the grammar and/or open dictation ASR confidence scores.
  • Each individual name type grammar may be prepared from an appropriate data source, specialized to information about the user's friends and associates, location, past, current or future activities, and so on.
  • a first name grammar may be prepared by listing all the first names of any contact found in the user's address book, along with common nicknames or abbreviations; similarly with last names.
  • a street name grammar may be prepared by combining the names of all streets within a given radius of the user's current location, possibly augmented by all street names extracted from past or future appointments, as noted in the user's personal calendar, or all streets on or near any recently-driven routes, as determined by a car or telephone handset GPS system.
  • these grammars may be prepared at the client, and never communicated to the server.
  • the second approach is to introduce an additional aggregate to capture the type, say street-type-aggregate or sta, and include within it the pronunciations of all the types. This may approximately halve the number of pronunciations nominally included in sa. However it may weaken the language model, and thereby hamper the ability of the primary recognizer to find the end of the audio segment that comprises the street name.
  • Alternative Method for Language Model Generation is to introduce an additional aggregate to capture the type, say street-type-aggregate or sta, and include within it the pronunciations of all the types. This may approximately halve the number of pronunciations nominally included in sa. However it may weaken the language model, and thereby hamper the ability of the primary recognizer to find the end of the audio segment that comprises the street name.
  • some embodiments implement a more restricted method, which may yield good results in the contexts in which the technique is likely to be the most useful.
  • This method may preprocess the entire training corpus with the NLU system, replacing proper name entities with appropriate aggregate words, in context. This will then yield n-gram counts with aggregate words, from which language models can be constructed, e.g., by conventional means, with such words as first-class objects.
  • Some embodiments may adopt a hybrid approach in which the conditional probabilities p(fna-i h) could be determined by this method while values for p(x ? with fna-i e h', could be determined by the previously outlined method.
  • the "Primary Recognition" 905 and "Understanding” 910 steps may return the sequence fna-i Ina-j.
  • the method proposed above would perform a "Secondary Recognition" 915 decoding of the audio segment associated to fna-i with a grammar of first names, and an independent "Secondary Recognition” 915 decoding of the audio segment associated to Ina-j with a grammar of last names.
  • the command includes a request to send a message to a recipient
  • the intended recipient of a message For example, consider the utterance "send a message to Barack thanks so much for the invitation comma we'd love to visit you and Michelle the next time we're in Washington".
  • the output of the primary recognizer may then very well read: send a message to fna thanks so much for the invitation comma we'd love to visit you and fna the next time we're in Washington
  • the NLU may be located at the server rather than the client. Without further communication from the client back to the server, of the "Secondary Recognition" 915 recognizer results, there may be no way to perform the required analysis to determine that "Barack" is indeed the name of the intended recipient.
  • the NLU will be able to work out the position of the intended recipient, from the information that a particular token (fna) in the decoding is likely to be a person's proper name, and from the words that appear adjacent or near to this token.
  • This information may be communicated to the client, where the "Secondary Recognition" 915 recognizer can definitively identify the recipient name.
  • Other elements of the client software may process this name to determine a suitable destination address.
  • the putative type of the audio segment could be changed, in a manner understandable to the client, to communicate to the client both the grammar to be used by the secondary decoder, and the special meaning, if any, of the audio segment.
  • the first instance of fna could be changed to a type fna-recipient, with the client suitably modified to decode the associated audio segment against the first name grammar as before, and then interpret the result as the name of the intended recipient.
  • acoustic span may improve secondary decoding.
  • some embodiments allow for coarticulation effects in selecting phone models during the decoding process.
  • These issues may be dealt with by expanding the acoustic span to include some number of acoustic prefix words and acoustic suffix words, which are those immediately preceding and following the nominal proper name entity. This yields the important distinction between the target span, which is the span of words comprising the nominal proper name entity, and the full span, which includes the audio putatively corresponding to the just-mentioned acoustic prefix words and acoustic suffix words.
  • the secondary recognition grammar must then be structured in such a way that allows decoding of these words. Indeed, it may be helpful if the secondary recognition proceeds through the first acoustic prefix word and the final acoustic suffix word.
  • this acoustic prefix word By making the decoding of this acoustic prefix word optional (e.g., by providing an epsilon-path around it), the appropriate frames of audio may thereby participate in the successful decoding of the name "Toby.”
  • Shim words are words that should be present in the primary recognizer decoding, adjacent to the target span, but for which the audio has erroneously been incorporated into the target span.
  • the grammar may be enlarged with optional paths that include such shim words.
  • shim words are necessary, or what the shim words should be, they may be hypothesized using a forward (conventional) language model that identifies likely forward extensions of the acoustic prefix words.
  • a backward language model that identifies likely backward extensions of the acoustic suffix words may similarly be used. These considerations may therefore yield one or more such words, which may be incorporated as optional alternatives within the target grammar shim words (and selection thereof via language models).
  • each such excerpt may exhibit significant signal transients at its start or end, due to the possibly abrupt onset of speech at the start of the excerpt, and likewise the possibly abrupt cessation of speech at the end of the excerpt, either of which may cause recognition errors.
  • There are methods for compensating for such transients but it is best to avoid them completely, if possible.
  • Whole waveform adaptive proper name recognition comprises the same functional components described in paragraphs [00114] through [00123] inclusive, operating in the same manner and order as described therein and in the sequel, with two important distinctions.
  • the primary recognizer need not emit word timings, as described in paragraph [00116], comprising the nominal start time and end time, within the input audio signal, of each transcribed word. These timings may still be emitted by the primary recognizer, but they are no longer used during secondary decoding.
  • the secondary recognizer may not operate upon certain short segments of the audio signal comprising the user's spoken input, as described in paragraph [00123]. Instead, the secondary recognizer may operate upon the entirety of the audio signal comprising the user's spoken input.
  • the secondary recognizer may now and indeed must process the entire input audio signal. That is, when the method is applied the secondary recognizer decodes the whole waveform of the user's spoken utterance, rather than one or more excerpts thereof; hence the method's name.
  • this method is identical in concept and execution to the adaptive proper name recognition method, and variants to and elaborations thereof, all as previously described, except that no excerpting of the input audio signal is performed. Because no excerpting is performed, there is no need to divide the input audio signal at nominal word boundaries. Hence, no word timings are used or required for this purpose. And because no word timings are used or required, there can be no errors, in transcription or ultimately in meaning, due to errors in the determination of these timings.
  • FIG. 18 depicts such a structure, labeled "target section.” Additional arcs may then be added, corresponding to reasonable variations of each contact name as supplied, for instance comprising the first name or last name only of each complete contact name, or consulting a dictionary of common nicknames to substitute for names already listed, each substitution thereby creating a variation and hence an additional arc.
  • Each arc of this grammar may also be labeled, as previously described, with operations to be performed on suitable meaning variables, if the arc is traversed during secondary decoding, so that a symbolic indication of the identity of the decoded contact, or a list of such symbolic indications of possible decoded contacts, if the spoken command is acoustically or semantically ambiguous, may be emitted as part of the secondary decoding step.
  • the target section structure of Figure 18 may be created; in Figure 18 each of the indicated labels n n 2 , n 3 , n k , stands for both the literals and meaning variable operations associated with each alternative, as just described.
  • Figure 19 depicts this same structure, with the label n x replaced by the literals and meaning variable operation associated to the example contact name "pak shak", and the label n 2 replaced by the literals and meaning variable operation associated to the example contact name "steve youngest.”
  • a further k - 2 arcs are present in this structure, comprising the number required to represent the aforementioned additional names in the contact list and reasonable variations thereof, with the k 2 labels n 3 , ... , n k replaced by the literals, and optional meaning variable operations, associated with those names.
  • Figure 20 illustrates the method of whole waveform adaptive proper name recognition, implemented using a grammar as the adaptation object. We proceed to recount the steps whereby the method, so implemented, yields the correct transcription and meaning of a typical utterance.
  • the primary recognizer output is then passed to the understanding step.
  • the language understanding module hypothesizes that the transcription as a whole is a command to send a text message to a user contact.
  • the language understanding module further hypothesizes that the word sequence "steve young us" comprises a proper name entity of type user-contact-name.
  • This information which may consist of all of (a) a symbolic indication that the command is of type text-message-to-user-contact, (b) identification of the whole waveform prefix span, comprising the transcribed word sequence "SIL send a message to,” which constitute the whole waveform prefix words, (c) identification of the target span, of putative type user-contact-name, and comprising the transcribed word sequence "steve young us,” which constitute the "target words,” and (d) identification of the "whole waveform suffix span,” comprising the transcribed word sequence "SIL hi steve how are you,” which constitute the "whole waveform suffix words,” comprises the input to the adaptation object preparation step.
  • These various information elements are as indicated within Figure 20.
  • the adaptation object preparation step uses this information to construct the adaptation object, comprising the grammar ww-contact-name.g, as shown within Figure 20. This is done by assembling the indicated sections, respectively the whole waveform prefix section, the target section, and the whole waveform suffix section, each as depicted in Figure 20.
  • the whole waveform prefix section is constructed as indicated, by assembling a linear sequence of the required number of grammar arcs, these arcs labeled with the whole waveform prefix words in succession. The end of this sequence of arcs is attached to the previously described user contact name target section. Note that this target section may have been constructed separately, at the time of registration of the user contact names.
  • the user contact name target section is incorporated into the adaptation object, as opposed to some other kind of target section (for example, registered business names, numbered street addresses within Menlo Park, California, geographically proximate business names, or some other type appropriate to a different instance of adaptive proper name recognition), is a consequence of the putative target span type user-contact-name provided by the language understanding step. It is in this way that the type as well as the extent of each putative span, as identified by the language understanding module, has a decisive effect upon the secondary decoding step.
  • the end of the target section is then attached to the whole waveform suffix section, which is constructed in a manner identical to the whole waveform prefix section, except that the grammar arcs are labeled with the whole waveform suffix words.
  • the resulting grammar ww-contact-name.g is then compiled and provided to the secondary recognition step. It is important to note that by virtue of the inclusion of all of the whole waveform prefix words, and all of the whole waveform suffix words, in the indicated order and locations within the grammar, the grammar is crafted for secondary recognition of the complete whole waveform decode span, as indicated in Figure 20. No excerpting of this whole waveform decode span is required or performed.
  • the secondary recognition step receives the adaptation object, comprising the indicated compiled grammar.
  • the compiled grammar is loaded into the secondary recognizer, and the full decode span is presented to the secondary recognizer as input.
  • the secondary recognizer uses the grammar to decode the full decode span.
  • the secondary recognizer in processing the target section of the grammar, finds the closest acoustic match (or matches, if the secondary recognizer is operating in «-best mode) permitted by the target section to the input audio signal. If present on the grammar arcs traversed, the associated operations upon semantic meaning variables are also performed, with corresponding values emitted as part of each decoding. This completes the secondary decoding step.
  • the language understanding module has hypothesized multiple distinct spans, within the same input audio signal, these are each likewise processed, and the results for each distinct secondary decoding assembled into one or more complete transcriptions, each with an associated symbolic meaning.
  • each complete transcription, and its associated symbolic meaning may be presented to the score fusion module for final ranking and winnowing of the various hypotheses.
  • this example of the wwapnr method using a grammar as an adaptation object, is complete, yielding a symbolic meaning and associated transcription.
  • the secondary recognizer is free to find the best possible match or matches permitted by the target section to any contiguous portion of the entire input audio signal, as long as the ww prefix section and the ww suffix section are themselves matched against appropriate portions of the ww decode acoustic span.
  • the slotted grammar is constructed with no prefix section or no suffix section respectively. If there are neither prefix words nor suffix words the resulting grammar has no slots and comprises the target section alone; this case has already been covered. This completes the discussion of the example, detailing the operation of whole waveform adaptive proper name recognition, implemented by use of a slotted grammar for the adaptation object.
  • the appropriate version of the slotted contact name grammar may be selected, its slots populated, and then finalized for use in the secondary recognition step. While this is not impossible, it is a complication that we would like to avoid.
  • the primary decoder Upon presentation of the input audio signal "send a message to Steve Youngest Hi Steve how are you" the primary decoder emits the indicated transcription, and the language understanding step once again hypothesizes a user- contact-name proper name entity and passes this and associated information to the adaptation object generation step.
  • the adaptation object generation step retrieves the grammar ww-l-l-slotted-contact-name.g and populates its slot x with the full sequence of ww prefix words, and likewise its slot 2 with the full sequence of ww suffix words, and finalizes the grammar for use.
  • the remaining operations of this embodiment of the invention may then be executed as described in paragraphs [00269], [00270] and [00271].
  • the alternate way is first to prepare a slotted wwapnr grammar with a sufficiently large number of prefix and suffix slots to accommodate the maximal number of prefix and suffix words that may be encountered in practice. Then, when using this slotted grammar as the adaptation object for any given primary decoder transcription, first populate the slots adjacent to the target section with the prefix and suffix words, and then populate any unfilled slots of the ww prefix section or ww suffix section with the epsilon word object. Thus populated, by virtue of the characteristics of the epsilon word object, the grammar functions as if it contained only the appropriate number of prefix and suffix slots, now populated with the ww prefix and suffix words, as determined by the language understanding step.
  • This kind of span extent error can be compensated for by adjoining epsilon arcs to the grammar, to permit some given number of prefix words or suffix words to be skipped when decoding the prefix and suffix sections of the grammar respectively, thereby allowing the audio that yielded those words in the primary transcription to be absorbed within the decoding of the target section.
  • the error per se is not an inaccurate determination of word start and end times, but a misclassification of a token in the primary recognizer output as a non- target word. More succinctly, one or more words of the primary recognizer transcription that should have been assigned to the putative span were incorrectly excluded from it. For this reason we refer to this as a "span-too-small error.”
  • Figure 26 depicts a slotted grammar, ww-l-e2-e2-l-slotted-contact- name.g, that exhibits the desired behavior: the four additional epsilon arcs ⁇ , 8 , Ess and permit up to two whole waveform prefix words and up to two whole waveform suffix words to be skipped when decoding the ww prefix section and ww suffix section, respectively.
  • the decoder when processing the ww prefix section of the grammar, the decoder was not forced to either skip or match the word or words occupying slot 3 . That is, the decoder was free to either (a) traverse the slot 3 arc, matching the word or words populating this arc against the input audio signal forward from the end of the slot 2 match, and thereby rendering that portion matched against slot 3 unavailable to matching within the target section, or (b) traverse the ⁇ ⁇ arc, and match the input audio signal, forward from the end of the slot 2 match, against some arc of the target section.
  • the ⁇ ⁇ path in the ww-l-e2-e2-l-slotted- contact-name.g grammar allows the decoder to choose freely between either including or excluding from the target span the word or words labeling slot 3 .
  • the decoder may revise, on the basis of the present audio input signal, and the options available within the target section, the provisional decision made earlier by the language understanding module regarding the extent of the target span. In this way, the grammar allows the decoder to compensate for the span-too- small error of the language understanding module.
  • the decoder may alternately choose to traverse the z PP path within the ww prefix section, thereby including both the slot 2 and slot 3 contents into the target span. Similar comments apply independently to the ⁇ and ⁇ paths and their alternatives. That is, independent of its actions with respect to the paths taken when processing the ww prefix section, the decoder may choose to traverse either s s or slot 4 , respectively, including or excluding from the target span the contents of slot ⁇ , or likewise to traverse either or slot ⁇ slot 5 , respectively, including or excluding from the target span the contents of both slot 4 and slot 5 , in that order.
  • Figure 27 exhibits a populated version of ww-l-e2-e2-l-slotted- contact-name.g, and shows how it compensates for a span-too-small error.
  • the audio input graphic shows the waveform presented.
  • the true word sequence shows the user's true spoken command that yielded the waveform.
  • the primary recognizer output shows the transcription of this waveform generated by the primary decoding step; note that it does not match the true word sequence.
  • the "time" line above this shows the now-irrelevant word boundaries generated by the primary recognizer.
  • slot 2 , slot 3 , sloU and slot 5 have been populated respectively with one word apiece of the ww prefix words and ww suffix words, immediately adjacent to the putative target words, to allow the secondary recognizer to revise, at a granularity of each individual word, the matching of the corresponding portions of the input audio signal against the target section, rather than the contents of the indicated slots.
  • the ww-l-e2-e2-l-slotted-contact-name graphic of Figure 27 shows the actual grammar arcs traversed to yield the secondary recognizer output. Note that this path correctly expands the target section to match the true target, yields the correct transcription of the whole utterance, including the contact name, and assigns to the semantic meaning variable c i d the numerical value 1, corresponding to the now correctly identified contact name.
  • the arc labeled " ⁇ " in the illustrated decoding path is not a loop in a graph. It connects the head of the arc labeled "message" to the tail of the arc labelled "tasteve youngus thai.” It is depicted in this way because in traversing this arc the secondary recognizer has matched no portions of the input audio signal. Thus, in keeping with the graphical rendering of this matching process, the arc bridges a portion of the waveform of zero width. Hence it appears to be a loop.
  • the arc provides a way for the secondary recognizer to skip over the alternate arc labeled "to" in the ww prefix section of the ww-l-e2-e2-l-slotted-contact-name.g: (slots populated) graphic, thereby causing the secondary recognizer to match the audio that yielded this word in the primary recognizer output against some portion of the target section.
  • the language understanding module incorrectly decides that the target span comprises the transcribed words "tupac shakur," and correspondingly determines that the ww prefix words comprise “send a message” and no others and, likewise, that the ww suffix words comprise "you coming tonight SIL" and no others.
  • these words are populated into the nominal ww-3-3-slotted-contact-name.g grammar, illustrated in Figure 28, it is unlikely that the secondary recognizer will yield the correct result.
  • the portion of the input audio signal corresponding to the spoken but misrecognized word "to,” and likewise with portion of the input audio signal corresponding to the spoken but misrecognized word "are,” must be matched by the start and end of some literal sequence labeling an arc of the target section.
  • the secondary recognizer must match these portions of the audio signal somewhere, but there is no way to match them within the whole waveform prefix section or the whole waveform suffix section, respectively. Thus, the secondary recognizer is likely to choose an incorrect pathway through the target section, in a vain attempt to match the entire target span waveform against this section of the grammar, as it is constrained to do.
  • each shim also includes an epsilon path, respectively of ⁇ & and e rs , so that the secondary recognizer is not forced to traverse one of the alternative literal arcs, but may honor the target span extent as originally determined by the language understanding module, independently with respect to the ww prefix and ww suffix, if that provides a better acoustic match through the grammar overall.
  • each shim structure is to allow the secondary recognizer to revise the target span as originally determined by the language understanding module, but in the sense opposite to that expressed in preceding variant, for compensating for span-too-small errors.
  • these structures permit the secondary recognizer to narrow the span extent, by decoding audio that would otherwise be forced to match within the target section to literals that appear outside it, on the non-epsilon arcs of the left and right shims. It should be noted that any such left shim or right shim slots may be populated with a literal sequence, for example "to my friend" in the case of a left shim arc, rather than just a single literal.
  • this best acoustic match is the literal sequence "pak shak.”
  • a difficulty in applying this method is determining what alternative literals should be compiled into (or in the case of a slotted grammar implementation, populated into) the left shim and right shim respectively.
  • Various methods are possible. Among methods to find plausible left shim alternative literals are (a) select alternate primary decodings of the start of the nominal target acoustic span, (b) use a forward language model to select likely forward extensions of the whole waveform prefix words, (c) select lexicon words that are a good acoustic match to the start of the nominal target acoustic span, (d) select words according to some weighted combination of the scores yielded by (b) and (c), or (e) select words that are known to be likely to appear immediately before a named entity of the putative target span type, for instance "to” or "for” in the case of type text-message-to-user-contact.
  • Figure 32 illustrates the embodiment of the invention for a slotted grammar.
  • the graphic labeled ww-3-lspl-rspl-3-slotted-contact-name.g: (slots unpopulated) shows how the left shim and right shim structures of the grammars of preceding figures have been replaced by a phoneme loop.
  • the loop itself is indicated within both shims by the indicated loop arc, labeled " ⁇ ⁇ ⁇ ".
  • the use of curly braces here is intended to indicate that the loop can match any phoneme ⁇ in the secondary recognizer phonetic alphabet.
  • the indicated loop arc actually stands for a collection of parallel loop arcs, each one labeled with a different phoneme of the recognizer's phonetic alphabet.
  • the epsilon paths ⁇ & and Zi s ⁇ within the left shim provide access to the left shim phoneme loop without matching any audio; they also allow the decoder to completely bypass the phoneme loop if it so desires. Similar remarks apply to the epsilon paths e rs and e rs > within the right shim.
  • each phoneme loop functions much as a less discriminating version of the left shim and right shim arcs labeled with alternate literals or literal sequences.
  • the grammar in the Figure 31 graphic labeled ww-3-ls2-rsl-3-slotted-contact-name.g: (slots populated).
  • the input audio signal segment from 1145 ms to 1271 ms inclusive, or thereabouts is matched against the left shim arc labeled "to" of this grammar; this is illustrated in the Figure 31 graphic labeled ww-3-ls2-rsl-3-slotted-contact- name.g: (decoding path).
  • the same audio segment may be matched against the phoneme sequence "T UW" when using the grammar in the graphic labeled ww-3-lspl-rspl- 3-slotted-contact-name.g: (slots populated) of Figure 32.
  • FIG. 33 This is illustrated in Figure 33. It is organized identically to Figure 31, except that the graphics labeled ww-3-ls2-rsl-3-slotted-contact-name.g: (slots populated) and ww-3-ls2-rsl-3-slotted-contact-name.g: (decoding path) in Figure 31 are respectively replaced by the graphics labeled ww-3-lspl-rspl-3-slotted- contact-name.g: (slots populated) and ww-3-lspl-rspl-3-slotted-contact-name.g: (decoding path) in Figure 33.
  • the phoneme loop mechanism is less discriminating than the left shim and right shim arcs labeled with alternate literals or literal sequences insofar as each loop can match an arbitrary sequence of phonemes, and may therefore consume too much audio from the target acoustic span, leading to a different kind of decoding error.
  • the weights accorded to the phoneme loops must be tuned with respect to the weights of the target section to ensure this does not happen.
  • Figure 34 illustrates how the phoneme loop idea may be combined with the previously-described method that uses alternative literals or literal sequences to label arcs within the shims.
  • One means of compensating for this is to post-process any such user- visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon.
  • This strategy applied to the secondary recognizer transcription fragment "ER you coming tonight” yields "are you coming tonight.”
  • Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence.
  • this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz. , immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.
  • a further natural variant of the wwapnr method is a way of coping simultaneously with either or both span-too-small and span-too-large errors with a single grammar. Such a way is useful insofar as it is unknown, for any particular primary decoding and language understanding steps, whether the current hypothesized span extent exhibits a span-too- small error, a span-too-large error, both span-too-small and span-too-large errors (this can happen if the span is too small on one end and too large on the other end), or neither.
  • a mechanism that allows correction of all possible modes and combinations of span extent errors while retaining the virtues of the wwapnr method is of value.
  • FIG. 35 Such a mechanism is exhibited in Figure 35.
  • the grammar has prefix slots slot i, slot 2 and slot 3 , suffix slots slo , slots and slot 6 , left shim slots ls and ls 2 , and right shim slots rs x and rs 2 , all unpopulated.
  • the number of these slots is arbitrary and purely exemplary; the design can be generalized to more or less slots as desired.
  • the grammar also includes the same target section that we have used throughout our running sequence of examples, reflecting the previously-registered list of user contact names; this is likewise arbitrary and purely exemplary.
  • the arcs associated with the left shim slots ls and ls 2 permit the secondary recognizer to match the contents of these slots, if populated, rather than the target section, against a portion of the input audio signal, likewise as explained earlier in the discussion of span-too-large errors.
  • the novel aspect of the structure as depicted is that the secondary recognizer is free to exploit the epsilon paths Ep and ⁇ ⁇ to correct a span-too-small error, or the left shim arcs associated with either of slots ls and ls 2 to correct a span-too-large error, or make no revision to the target span by traversing the path slot 2 -*-slot ⁇ ⁇ & , to obtain the best possible acoustic match.
  • these options are afforded by a single adaptation object, implemented as a slotted grammar.
  • both epsilon paths ⁇ ⁇ and ⁇ ⁇ extend over the left shim structure. This ensures that if either of ⁇ ⁇ or ⁇ ⁇ is traversed during secondary decoding—that is, the decoder has chosen to correct for a span-too-small error with respect to the ww prefix acoustic span— then no arc of the left shim may be traversed. Similarly, if any of the left shim arcs is traversed during secondary decoding, then neither of the epsilon paths ⁇ ⁇ or ⁇ ⁇ may be traversed. This is as desired because it is not possible to commit simultaneously both a span-too-small and a span-too-large error, with respect to ww prefix acoustic span.
  • Figure 37 shows the input audio waveform for the command "send a message tasteve youngus thai steve how are you.”
  • the true word sequence, time, primary recognizer output, whole waveform prefix words, target words and whole waveform suffix words are all identical to those depicted earlier in the corresponding figure for this command, Figure 27.
  • the adaptation object is the grammar labeled ww- l-e2-ls2-rs2-e2-l-slotted-contact-name.g: (slots populated for "SIL send a message tasteve youngus thai steve how are you”), copied from Figure 35.
  • Another method of correcting all possible modes and combinations of span extent errors is to utilize four distinct grammars or slotted grammars: one that can correct span-too-small errors both before and after the target span, one that can correct a span-too-small error before the target span and a span-too-large error after the target span, one that can correct a span-too-large error before the target span and a span-too-small error after the target span, and one that can correct span-too- large errors both before and after the target span. But this is four times more computationally costly than the method just explained, and is therefore not preferred. [00332] In the text that follows we will continue to discuss additional embodiments of the invention. We will continue to couch this discussion in the framework of wwapnr. However it is to be noted that these variants may apply equally well to embodiments that do not use wwapnr, as explained earlier in this specification.
  • the primary recognizer output may additionally include the actual baseforms decoded for each word of the primary transcription, as is common in practice.
  • the elements of the adaptation object that are derived from the primary recognizer output, which up to this point have been words, are replaced by the corresponding baseforms as determined by the primary recognizer.
  • the adaptation object is a grammar
  • any arcs of the grammar that had been labeled with words from the primary transcription are in this variant instead labeled with the corresponding baseforms from the primary transcription.
  • the adaptation object is a slotted grammar
  • any slots that had been populated with words from the primary transcription are in this variant instead populated with the corresponding baseforms from the primary transcription.
  • Figure 21 depicts the output of the primary recognizer as a sequence of baseforms, and shows how these in turn are used to label the arcs of the ww-contact-name.g adaptation grammar.
  • each possible baseform of any of these words must be considered by the secondary recognizer; hence, the adaptation object is neither constructed nor populated to restrict the secondary recognizer to any particular baseform, when decoding the target section.
  • the user's contact name list itself includes a preferred baseform of any given name, possibly supplied by the user. This is especially likely for unusual names and, if so, this known preferred baseform may be appropriately incorporated into the adaptation object, which in this example comprises the target section of ww-contact-name.g.
  • Figure 38 likewise illustrates this variant, where the adaptation object is a slotted grammar, with a more elaborate structure than that of Figure 21.
  • the primary recognizer output is exhibited as a sequence of baseforms.
  • the slots of the ww prefix and ww suffix sections of ww-l-e2-ls2-rs2-e2-l-slotted- contact-name.g are now populated with baseforms, rather than words. However, this is not so for the slots of the left shim and right shim.
  • the secondary recognizer must be free to consider each possible baseform for each shim word.
  • the arcs of the ww prefix section match and ww suffix section match portions of this graphic, which indicate the path through the ww-l-e2-ls2-rs2-e2-l- slotted-contact-name.
  • grammar selected by the secondary recognizer are labeled not with words but with baseforms. Moreover, these baseforms match those on corresponding arcs of the grammar. This of course is not an accident. Because the cited grammar arcs are labeled not with words but with baseforms, the decoder is permitted to match corresponding portions of the input audio signal to the indicated baseforms only. While this restriction might seem to limit the freedom of action of the secondary recognizer, possibly resulting in an error in the secondary transcription, in fact it has no impact on accuracy.
  • the baseforms in the grammar are precisely those appearing in corresponding locations in the primary recognizer output and, hence, are already known to match the input audio signal well.
  • the restriction has the previously cited advantage of preventing the secondary recognizer from needlessly exploring the quality of the acoustic match between every baseform of these words that may be present in its lexicon. This, in turn, reduces the computational workload and memory requirements of the secondary recognizer when processing the ww prefix and ww suffix sections of the grammar.
  • the baseforms output by the primary recognizer may comprise sequences of context-dependent phonemes, and may therefore be inserted in the manner just described into the adaptation object.
  • these sequences of context-dependent phonemes may be used by the secondary recognizer when performing a decoding with respect to the given adaptation object. This will likewise further restrict the secondary recognizer, again with no impact on accuracy, but with a further reduction in its computational workload and memory requirements.
  • these embodiments are inefficient, insofar as (a) use of the phoneme loop is computationally demanding, in the search it imposes upon the secondary recognizer, and (b) the secondary recognizer must explore all baseforms for all alternate literals. They are also potentially error-prone in that they allow the secondary recognizer both too much freedom, because the phoneme loop allows the decoder to match an arbitrary sequence of phonemes, and too little freedom, because the methods that exclude the phoneme loop allow the decoder to match only the literals that appear within the shim. [00341] We now describe a variant that exploits the primary recognizer baseform output and thereby steers a middle path between these two extremes.
  • This variant uses a left shim that comprises a portion of the phoneme sequence, decoded by the primary recognizer, at the start of the putative target section, and a right shim that comprises a portion of the phoneme sequence, decoded by the primary recognizer, at the end of the putative target section.
  • These shims may either be constructed directly, if the adaptation object is a grammar, or created by appropriately populating the proper slots, if the adaptation object is a slotted grammar.
  • a grammar and a slotted grammar as the adaptation object, as the method may be applied equally well to either by one skilled in the art.
  • the shims are constructed from a selected prefix and a selected suffix of the phoneme sequence associated to the primary recognizer's decoding of the target span. Over each such prefix and suffix sequence is adjoined the now familiar nested assembly of epsilon paths, so that the secondary recognizer may match or exclude from matching contiguous portions of the putative target section acoustic span, as it prefers, against the target section of the grammar.
  • the method thus effectively narrows the target span from the possibly too-large extent assigned by the language understanding module, but in a manner that allows the secondary recognizer to enlarge it, at the granularity of an individual phoneme.
  • Figure 39 illustrates the application of this idea to the exemplary span- too-large utterance considered earlier.
  • labeled primary recognizer output shows the sequence of baseforms for the whole utterance decoded by the primary recognizer; immediately beneath this the line primary recognizer output (phonemes) shows the actual phoneme sequence corresponding to each indicated baseform.
  • the graphic labeled ww-3-lsp3-rsp3-3- slotted-contact-name shows the structure of the associated slotted grammar, and how it is to be populated to achieve the desired effect.
  • the ww prefix and ww suffix section slots are populated with the corresponding baseforms from the primary recognizer output, as previously described.
  • the left shim and right shim each now contain a linear sequence of three arcs, respectively labeled with the first three and last three phonemes output by the primary recognizer, for the putative target span.
  • the arcs of the left shim are labeled "T” "UW” "P”
  • the arcs of the right shim are labeled "AA” "K” "ER”.
  • the choice of three phonemes for each shim is arbitrary and reflects a design that is known to work well in practice. Designs with a larger or smaller number of phonemes are possible and also fall within the scope of the invention, as do designs with differing numbers of phonemes in the left and right shims.
  • the exemplary phoneme alphabet is the ARPAbet. This choice is arbitrary and purely for expository convenience. Note the two nested structures of epsilon arcs— those labeled ⁇ & » ⁇ & ' and ⁇ & within the left shim and e rs , e rs ' and ⁇ » within the right shim.
  • the secondary recognizer matches the contiguous audio of the waveform corresponding to these phonemes outside the target section. Conversely, by traversing the arc labeled ⁇ & it chooses to match the audio of the waveform corresponding to the phoneme beneath this arc, P, within the target section. In this way, the secondary recognizer can obtain a good match to the "pak" portion of the contact name; absent the left shim this would not have been possible. It is worth noting that the indicated structure allowed only this and three other ways of matching the indicated portion of the input audio signal, viz.
  • the secondary recognizer matches the audio of the waveform corresponding to the phoneme sequence beneath it, AA K, within the target section, while matching the audio corresponding to the phoneme E outside it.
  • the remainder of the decoding path comprises a forced alignment of the ww suffix section against the ww suffix acoustic span.
  • this variant may include phonemes within a transcription. However as discussed in paragraphs [00315] through [00317] inclusive above, this is either of no consequence or may be dealt with by the methods detailed therein. This comment applies equally to other variants that may include phonemes within a transcription. This concludes the exposition of this variant.
  • FIG. 40 illustrates one such combination.
  • this variant as illustrated in the graphic labeled ww-l-e2-lsp3-rsp3-e2-l-slotted-contact-name.g: (slots populated for primary decoding "send(01) a(02) message(Ol) tupac(Ol) shakur(Ol) you(03) coming(Ol) tonight(Ol) SIL(02)”), the ww prefix section and the ww suffix section have been populated as before with the primary recognizer decoded baseforms.
  • the outermost begins at the tail of the arc labeled AH and again ends at the head of the arc labeled P.
  • the baseforms you(03) and coming(Ol) may be replaced by the linear sequence of connected phonemes Y UH K AH M IH NG, with a corresponding structure of seven nested epsilon arcs, within the suffix section. Accordingly, this embodiment is also included within the scope of the invention.
  • the phonemes used in the adaptation objects which are derived from the primary recognizer output, may be the context-independent phonemes typically associated with baseforms, the specific context-dependent phonemes decoded by the primary recognizer, if these are present in the primary recognizer output, or some admixture of the two. Accordingly, all such embodiments, whether they use context-independent or context-dependent phonemes, are also included within the scope of the invention.
  • the primary recognizer output comprises at least a sequence of transcribed words, optionally labeled with nominal start times and end times within the input audio signal, optionally labeled with the associated decoded baseforms, and said baseforms themselves optionally labeled with the individual context-dependent phonemes used in and possibly output as part of the primary recognizer decoding.
  • a primary recognizer may also output a lattice, which is a directed graph, the arcs of which are labeled with words decoded by the primary recognizer, and optionally with the additional information described in the preceding paragraph. This lattice may be used as the basis of an alternate embodiment of the invention, as follows.
  • the lattice is used to generate one or more primary recognizer outputs, comprising a linear sequence of transcribed words, possibly with additional optional information as previously described.
  • Each such output, or at a minimum the highest ranking such output, is provided to the language understanding module, which as previously described identifies a command type, and typically one or more putative spans with associated span type.
  • the lattice is then excerpted to remove the arcs associated with each such span.
  • the exact means by which this excerpting is performed may vary under different embodiments of the invention. For concreteness in this discussion, we explain one such method, which is to remove any arcs that correspond to portions of the audio input signal that lie wholly or partly within the subject span.
  • a target section of structure and content appropriate to the span type is then interpolated into the lattice, attached at either extreme to all appropriate frontier nodes from which arcs were excerpted in the preceding step.
  • Various of the techniques for handling span- too-small or span-too-large errors may be applied at this stage.
  • the resulting lattice then serves as the adaptation object; it is processed by the secondary recognizer to find the best match to the input audio signal. If multiple high- ranking secondary recognizer outputs emerge from the secondary recognition step, or if other adaptation lattices likewise yield high-ranking secondary outputs, they may be ranked or winnowed by a score fusion step, as previously described. [00355]
  • Figure 41 illustrates this procedure with a familiar example.
  • the graphic labeled primary recognizer output (lattice) is the aforementioned lattice; it is far simpler in structure than an actual lattice would be.
  • the highest ranking path through this lattice is depicted as the sequence of straight horizontal arcs, corresponding to the conventional primary recognizer output for this utterance. This is processed in the usual manner by the language understanding module to yield the command type, the indicated target span, and the associated span type.
  • the graphic labeled excerpted lattice illustrates the excerpting of arcs that lie within or impinge upon the target acoustic span; note that the arcs labeled "steve,” “young” and "us,” along with those labeled "toast” and “eve” have been removed.
  • Various of the methods described herein may involve the repeated processing of portions of the input audio signal, or of the entirety of the input audio signal. At a minimum this may comprise processing the input audio signal first by a primary recognizer, and thereafter in whole or in part by a secondary recognizer. While these two recognizers may operate on entirely different principles, they may equally well share significant internal operating details, notably including the so- called front end and the associated feature vectors or other intermediate representations of the speech signal that it produces, an acoustic model, neural network or other computational device for evaluating the quality of a given acoustic match, or some other internal device or mechanism. It will be apparent to one skilled in the art that the primary and secondary recognizers may therefore share significant internal data, for instance model parameters, network weights, or other information used during decoding, and may likewise perform some duplicate computations.
  • the primary recognizer provides to the secondary recognizer the results of certain of the computations that it performs, so that the secondary recognizer may, instead of repeating those computations, simply look up the previously computed result obtained by the primary recognizer.
  • any precomputation must be independent of any adaptation object, as any such object will not be available for consultation by the secondary recognizer until the primary recognizer and the language understanding module have each completed their work. That said it may yet be possible to speculatively precompute some results that may or may not later be used by the secondary recognizer, depending upon the contents of the adaptation object. The ready availability of these results may therefore also reduce the overall system latency, if they are ultimately needed by the secondary recognizer.
  • FIG 42 is a block diagram of a computer system as may be used to implement features of some of the embodiments.
  • the computing system 1800 may include one or more central processing units (“processors") 1805, memory 1810, input/output devices 1825, e.g. keyboard and pointing devices, display devices, storage devices 1820, e.g. disk drives, and network adapters 1830, e.g. network interfaces, that are connected to an interconnect 1815.
  • the interconnect 1815 is illustrated as an abstraction that represents any one or more separate physical buses, point to point connections, or both connected by appropriate bridges, adapters, or controllers.
  • the interconnect 1815 may include, for example, a system bus, a Peripheral Component Interconnect (PCI) bus or PCI- Express bus, a HyperTransport or industry standard architecture (ISA) bus, a small computer system interface (SCSI) bus, a universal serial bus (USB), IIC (I2C) bus, or an Institute of Electrical and Electronics Engineers (IEEE) standard 1394 bus, also called "Firewire”.
  • PCI Peripheral Component Interconnect
  • ISA HyperTransport or industry standard architecture
  • SCSI small computer system interface
  • USB universal serial bus
  • I2C IIC
  • IEEE Institute of Electrical and Electronics Engineers
  • the memory 1810 and storage devices 1820 are computer-readable storage media that may store instructions that implement at least portions of the various embodiments.
  • the data structures and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link.
  • a data transmission medium such as a signal on a communications link.
  • Various communications links may be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection.
  • computer readable media can include computer-readable storage media, e.g. non transitory media, and computer-readable transmission media.
  • the instructions stored in memory 1810 can be implemented as software and/or firmware to program the processor 1805 to carry out actions described above.
  • such software or firmware may be initially provided to the processing system 1800 by downloading it from a remote system through the computing system 1800, e.g. via network adapter 1830.
  • the various embodiments introduced herein can be implemented by, for example, programmable circuitry, e.g. one or more microprocessors, programmed with software and/or firmware, or entirely in special-purpose hardwired (non-programmable) circuitry, or in a combination of such forms.
  • Special-purpose hardwired circuitry may be in the form of, for example, one or more ASICs, PLDs, FPGAs, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Automation & Control Theory (AREA)
  • General Engineering & Computer Science (AREA)
  • Machine Translation (AREA)
PCT/US2017/052251 2016-09-19 2017-09-19 Systems and methods for adaptive proper name entity recognition and understanding WO2018053502A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP17851782.7A EP3516649A4 (de) 2016-09-19 2017-09-19 Systeme und verfahren zur adaptiven erkennung und zum verstehen ordnungsgemässer namensentitäten
AU2017326987A AU2017326987B2 (en) 2016-09-19 2017-09-19 Systems and methods for adaptive proper name entity recognition and understanding
CA3036998A CA3036998A1 (en) 2016-09-19 2017-09-19 Systems and methods for adaptive proper name entity recognition and understanding
AU2022263497A AU2022263497A1 (en) 2016-09-19 2022-11-02 Systems and methods for adaptive proper name entity recognition and understanding

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/269,924 2016-09-19
US15/269,924 US9818401B2 (en) 2013-05-30 2016-09-19 Systems and methods for adaptive proper name entity recognition and understanding

Publications (1)

Publication Number Publication Date
WO2018053502A1 true WO2018053502A1 (en) 2018-03-22

Family

ID=61620171

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/052251 WO2018053502A1 (en) 2016-09-19 2017-09-19 Systems and methods for adaptive proper name entity recognition and understanding

Country Status (4)

Country Link
EP (1) EP3516649A4 (de)
AU (2) AU2017326987B2 (de)
CA (1) CA3036998A1 (de)
WO (1) WO2018053502A1 (de)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109257547A (zh) * 2018-09-21 2019-01-22 南京邮电大学 中文在线音视频的字幕生成方法
CN111159366A (zh) * 2019-12-05 2020-05-15 重庆兆光科技股份有限公司 一种基于正交主题表示的问答优化方法
CN111415655A (zh) * 2020-02-12 2020-07-14 北京声智科技有限公司 语言模型构建方法、装置及存储介质
CN114757176A (zh) * 2022-05-24 2022-07-15 上海弘玑信息技术有限公司 一种获取目标意图识别模型的方法以及意图识别方法
US20230186618A1 (en) 2018-04-20 2023-06-15 Meta Platforms, Inc. Generating Multi-Perspective Responses by Assistant Systems
US11886473B2 (en) 2018-04-20 2024-01-30 Meta Platforms, Inc. Intent identification for agent matching by assistant systems
US12008802B2 (en) 2021-06-29 2024-06-11 Meta Platforms, Inc. Execution engine for compositional entity resolution for assistant systems

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688628B (zh) * 2021-07-28 2023-09-22 上海携宁计算机科技股份有限公司 文本识别方法、电子设备和计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194200A1 (en) * 2000-08-28 2002-12-19 Emotion Inc. Method and apparatus for digital media management, retrieval, and collaboration
US7120582B1 (en) * 1999-09-07 2006-10-10 Dragon Systems, Inc. Expanding an effective vocabulary of a speech recognition system
US20080221893A1 (en) * 2007-03-01 2008-09-11 Adapx, Inc. System and method for dynamic learning
US20140358544A1 (en) 2013-05-30 2014-12-04 Promptu Systems Corporation Systems and methods for adaptive proper name entity recognition and understanding

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5440177B2 (ja) * 2007-12-21 2014-03-12 日本電気株式会社 単語カテゴリ推定装置、単語カテゴリ推定方法、音声認識装置、音声認識方法、プログラム、および記録媒体
US8108214B2 (en) * 2008-11-19 2012-01-31 Robert Bosch Gmbh System and method for recognizing proper names in dialog systems
US9818401B2 (en) * 2013-05-30 2017-11-14 Promptu Systems Corporation Systems and methods for adaptive proper name entity recognition and understanding

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7120582B1 (en) * 1999-09-07 2006-10-10 Dragon Systems, Inc. Expanding an effective vocabulary of a speech recognition system
US20020194200A1 (en) * 2000-08-28 2002-12-19 Emotion Inc. Method and apparatus for digital media management, retrieval, and collaboration
US20080221893A1 (en) * 2007-03-01 2008-09-11 Adapx, Inc. System and method for dynamic learning
US8457959B2 (en) * 2007-03-01 2013-06-04 Edward C. Kaiser Systems and methods for implicitly interpreting semantically redundant communication modes
US20140358544A1 (en) 2013-05-30 2014-12-04 Promptu Systems Corporation Systems and methods for adaptive proper name entity recognition and understanding

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3516649A4

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11721093B2 (en) 2018-04-20 2023-08-08 Meta Platforms, Inc. Content summarization for assistant systems
US11887359B2 (en) 2018-04-20 2024-01-30 Meta Platforms, Inc. Content suggestions for content digests for assistant systems
US11704900B2 (en) 2018-04-20 2023-07-18 Meta Platforms, Inc. Predictive injection of conversation fillers for assistant systems
US12001862B1 (en) 2018-04-20 2024-06-04 Meta Platforms, Inc. Disambiguating user input with memorization for improved user assistance
US11715289B2 (en) 2018-04-20 2023-08-01 Meta Platforms, Inc. Generating multi-perspective responses by assistant systems
US20230186618A1 (en) 2018-04-20 2023-06-15 Meta Platforms, Inc. Generating Multi-Perspective Responses by Assistant Systems
US11694429B2 (en) 2018-04-20 2023-07-04 Meta Platforms Technologies, Llc Auto-completion for gesture-input in assistant systems
US11908181B2 (en) 2018-04-20 2024-02-20 Meta Platforms, Inc. Generating multi-perspective responses by assistant systems
US11908179B2 (en) 2018-04-20 2024-02-20 Meta Platforms, Inc. Suggestions for fallback social contacts for assistant systems
US11886473B2 (en) 2018-04-20 2024-01-30 Meta Platforms, Inc. Intent identification for agent matching by assistant systems
US11704899B2 (en) 2018-04-20 2023-07-18 Meta Platforms, Inc. Resolving entities from multiple data sources for assistant systems
US11727677B2 (en) 2018-04-20 2023-08-15 Meta Platforms Technologies, Llc Personalized gesture recognition for user interaction with assistant systems
US11869231B2 (en) 2018-04-20 2024-01-09 Meta Platforms Technologies, Llc Auto-completion for gesture-input in assistant systems
CN109257547A (zh) * 2018-09-21 2019-01-22 南京邮电大学 中文在线音视频的字幕生成方法
CN109257547B (zh) * 2018-09-21 2021-04-06 南京邮电大学 中文在线音视频的字幕生成方法
CN111159366A (zh) * 2019-12-05 2020-05-15 重庆兆光科技股份有限公司 一种基于正交主题表示的问答优化方法
CN111415655B (zh) * 2020-02-12 2024-04-12 北京声智科技有限公司 语言模型构建方法、装置及存储介质
CN111415655A (zh) * 2020-02-12 2020-07-14 北京声智科技有限公司 语言模型构建方法、装置及存储介质
US12008802B2 (en) 2021-06-29 2024-06-11 Meta Platforms, Inc. Execution engine for compositional entity resolution for assistant systems
CN114757176A (zh) * 2022-05-24 2022-07-15 上海弘玑信息技术有限公司 一种获取目标意图识别模型的方法以及意图识别方法

Also Published As

Publication number Publication date
AU2017326987A1 (en) 2019-04-11
CA3036998A1 (en) 2018-03-22
EP3516649A4 (de) 2020-04-29
AU2017326987B2 (en) 2022-08-04
AU2022263497A1 (en) 2022-12-22
EP3516649A1 (de) 2019-07-31

Similar Documents

Publication Publication Date Title
US9818401B2 (en) Systems and methods for adaptive proper name entity recognition and understanding
US11783830B2 (en) Systems and methods for adaptive proper name entity recognition and understanding
US9449599B2 (en) Systems and methods for adaptive proper name entity recognition and understanding
AU2017326987B2 (en) Systems and methods for adaptive proper name entity recognition and understanding
US10957312B2 (en) Scalable dynamic class language modeling
US8346537B2 (en) Input apparatus, input method and input program
US9940927B2 (en) Multiple pass automatic speech recognition methods and apparatus
US8380505B2 (en) System for recognizing speech for searching a database
US11270687B2 (en) Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models
JP5189874B2 (ja) 多言語の非ネイティブ音声の認識
AU2023258338A1 (en) Systems and methods for adaptive proper name entity recognition and understanding
JPWO2016067418A1 (ja) 対話制御装置および対話制御方法
EP3005152B1 (de) Systeme und verfahren zur adaptiven erkennung und zum verstehen ordnungsgemässer namensentitäten
JP2008243080A (ja) 音声を翻訳する装置、方法およびプログラム
KR101424496B1 (ko) 음향 모델 학습을 위한 장치 및 이를 위한 방법이 기록된 컴퓨터 판독 가능한 기록매체
KR101483947B1 (ko) 핵심어에서의 음소 오류 결과를 고려한 음향 모델 변별 학습을 위한 장치 및 이를 위한 방법이 기록된 컴퓨터 판독 가능한 기록매체
JP6275569B2 (ja) 対話装置、方法およびプログラム
JP2000330588A (ja) 音声対話処理方法、音声対話処理システムおよびプログラムを記憶した記憶媒体

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17851782

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3036998

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017326987

Country of ref document: AU

Date of ref document: 20170919

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2017851782

Country of ref document: EP

Effective date: 20190423