WO2001026092A2 - Modelisation des mots orientee attributs - Google Patents

Modelisation des mots orientee attributs Download PDF

Info

Publication number
WO2001026092A2
WO2001026092A2 PCT/IB2000/001539 IB0001539W WO0126092A2 WO 2001026092 A2 WO2001026092 A2 WO 2001026092A2 IB 0001539 W IB0001539 W IB 0001539W WO 0126092 A2 WO0126092 A2 WO 0126092A2
Authority
WO
WIPO (PCT)
Prior art keywords
speech recognition
recognition system
suprasegmental
word
acoustic
Prior art date
Application number
PCT/IB2000/001539
Other languages
English (en)
Other versions
WO2001026092A3 (fr
Inventor
Michael Finke
Juergen Fritsch
Detleff Koll
Alex Waibel
Original Assignee
Lernout & Hauspie Speech Products N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lernout & Hauspie Speech Products N.V. filed Critical Lernout & Hauspie Speech Products N.V.
Priority to AU79383/00A priority Critical patent/AU7938300A/en
Publication of WO2001026092A2 publication Critical patent/WO2001026092A2/fr
Publication of WO2001026092A3 publication Critical patent/WO2001026092A3/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/148Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/193Formal grammars, e.g. finite state automata, context free grammars or word networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/085Methods for reducing search complexity, pruning

Definitions

  • the invention relates to automatic speech recognition, and more particularly, to word models used in a speech recognition system.
  • ASR Automatic speech recognition
  • Speaking mode has been considered to reduce confusability by probabilistically weighting alternative pronunciations depending on the speaking style.
  • This approach uses pronunciation modeling and acoustic modeling based on a wide range of observables such as speaking rate; duration; and syllabic, syntactic, and semantic structure — contributing factors that are subsumed in the notion of speaking mode.
  • a phonetic transcription of relaxed informal speech by its nature is a simplification.
  • Pronunciation models implementing purely phonological mappings generate phonetic transcriptions that underspecify durational and spectral properties of speech.
  • Reduced variants as predicted by a pronunciation model ought to be phonetically homophonous — e.g., the fast variant of "support” being pronounced as /s/p/o/r/t/ is phonetically ho- mophonous with "sport").
  • Figure 1 illustrates a prefix search tree according to a representative embodiment of the present invention showing roots, nodes, leaves, and single phone word nodes (stubs).
  • Figure 2 illustrates the heap structure of a root node.
  • Figure 3 illustrates the heap structure of a leaf node.
  • Figure 4 illustrates the heap structure of a stub.
  • Embodiments of the present invention generalize phonetic speech transcription to an attribute-based representation that integrates supra-segmental non-phonetic features.
  • a pronunciation model is trained to augment an attribute transcription by marking possible pronunciation effects, which are then taken into account by an acoustic model induction algorithm.
  • a finite state machine single- prefix-tree, one-pass, time-synchronous decoder is used to decode highly spontaneous speech within this new representational framework.
  • the notion of context is broadened from a purely phonetic concept to one based on a set of speech attributes.
  • the set of attributes incorporates various features and predictors such as dialect, gender, articulatory features (e.g. vowel, high, nasal, shifted, stress, reduced), word or syllable position (e.g. word begin/end, syllable boundary), word class (e.g pause, function word), duration, speaking rate, fundamental frequencies, HMM state (e.g. begin/middle/end state), etc.
  • This approach affects all levels of modeling within the recognition engine, from the way words are represented in the dictionary, through pronunciation modeling and duration modeling, to acoustic modeling. This leads to strategies to efficiently decode conversational speech within the mode dependent modeling framework.
  • a word is transcribed as a sequence of instances ( ⁇ 0 ⁇ v ... ⁇ k ) which are bundles of instantiated attributes (i.e. attribute-value pairs).
  • Each attribute can be either binary, discrete (i.e. multi-valued), or continuous valued.
  • the filled pause "um" is transcribed by a single instance i consisting of truth values for the following binary attributes (pause, nasal, voiced, labial ).
  • the instance-based representation allows for a more detailed modeling of pronunciation effects as observed in sloppy informal speech. Instead of predicting an expected phonetic surface form based on a purely phonetic context, the canonical instance-based transcription is probabilistically augmented.
  • a pronunciation model predicts instances for the set of attributes.
  • the pronunciation model is trained to predict pronunciation effects: p( ⁇ k ⁇ ... ⁇ k _,[ ⁇ k ] ⁇ M ].
  • Pronunciation variants are derived by augmenting the initial transcription by the predicted instances: ⁇ 0 ⁇ .. ⁇ k t ⁇ (t 0 ® ⁇ 0 )(l, ® )...(l k ®l k ' ) which are then weighted by a probability:
  • Predicting pronunciation variation by augmenting the phonetic transcription by expected pronunciation effects avoids potentially homophonous representation of variants (see, e.g., M. Finke and A. Waibel,
  • the original transcription is preserved, and the duration and acoustic model building process exploit the augmented annotation.
  • Decision trees are grown to induce a set of context dependent duration and acoustic models.
  • the induction algorithm allows for questions with respect to all attributes defined in the transcription.
  • context dependent modeling means that the acoustic models derived depend on the phonetic context, pronunciation effects, and speaking mode-related attributes. This leads to a much tighter coupling of pronunciation modeling and acoustic modeling because model induction takes the pronunciation predictors into account as well as acoustic evidence.
  • a corresponding LVCSR decoder should handle finite state grammar decoding, forced alignment of training transcripts, large vocabulary statistical grammar decoding, and lattice rescoring.
  • One typical embodiment uses a single-prefix-tree time-synchronous one-pass decoder that represents the underlying recognition grammar by an abstract finite state machine.
  • the dictionary is represented by a pronunciation prefix tree as described in H. Ney, R. Haeb-Umbach, B.-H. Tran, M. Oerder, "Improvement In Beam Search For 10000-word Continuous Speech Recognition," IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 9-12, 1992, incorporated herein by reference.
  • a priority heap may represent alternative linguistic theories in each node of the prefix tree as described in previously cited Alleva, Huang, and Hwang.
  • the heap maintains all contexts whose probabilities are within a certain threshold, thus avoiding following only the single best local history.
  • the threshold and the heap policy have the benefit of allowing different more or less aggressive search techniques by effectively controlling hypothesis merging.
  • the heap approach is more dynamic and scalable.
  • the language model is implemented in the decoder as an abstract finite state machine.
  • the exact nature of the underlying grammar remains transparent to the recognizer.
  • the only means to interact with a respective language model is through the following set of functions in which FSM is a finite state machine- based language model:
  • FSM.initial() Returns the initial state of the FSM.
  • FSM.arcs(state) Returns all arcs departing from a given state.
  • An arc consists of the input label (recognized word), the output label, the cost, and the next state. Finite state machines are allowed to be non-deterministic, i.e. there can be multiple arcs with the same input label.
  • FSM.cost(state) Returns the exit cost for a given state to signal whether or not a state is a final state.
  • the decoder may virtually add a self loop with a given cost term to each grammar state.
  • any number of filler words can be accepted /recognized at each state of the finite state machine.
  • One typical embodiment provides a set of different instantiations of the finite state machine interfaces that are used in different contexts of training, testing or rescoring a recognizer:
  • Finite State Grammar Decoding may explicitly define a finite state grammar. Besides its use in command-and-control, this application can be used in training the recognizer.
  • M. Finke and A. Waibel, Flexible Transcription Alignment in ASRU, pages 34—40, Santa Barbara, CA, December 1997, we showed that, when dealing with unreliable transcripts of training data, a significant gain in word accuracy can be achieved by training from probabilistic transcription graphs instead of the raw transcripts.
  • Typical embodiments allow for decoding of right recursive rule grammars by simulating an underlying heap to deal with recursion.
  • the transcription graphs of the Flexible Transcription Alignment (FTA) paradigm may be expressed in the decoder by a probabilistic rule grammar. T us, forced alignment of training data is basically done through decoding these utterance grammars.
  • N-gram Decoding Statistical n-gram language models are not explicitly represented as a finite state machine. Instead, a finite state machine wrapper is built around n-gram models. The state index codes the history such that FSM.arcs(state) can retrieve all the language model scores required from the underlying n-gram tables. This implies that the FSM is not minimized and the state space is the vocabulary to the power of the order of the n-gram model.
  • Lattice Rescoring Lattices are finite state machines, too. So, rescoring a word graph using a different set of acoustic models and a different language model is feasible by decoding along lattices and on-the-fly composition of finite state machines.
  • the lookahead tree will thus be a trigram lookahead; for fourgrams, a fourgram lookahead; and for finite state grammars, the lookahead will be a projection of all words allowed at a given grammar state.
  • Lookahead trees may be saved in an aging cache as they are computed to avoid recomputing the tree in subsequent frames.
  • the size of the cache and the number of steps to compute the tree can be reduced by precomputing a new data structure from the prefix tree: the cost tree.
  • the cost tree represents the cost structure in a condensed form, and turns the rather expensive recursive procedure of finding the best score in the tree into an iterative algorithm.
  • Each heap element, hypothesis, or tree copy has the current FSM lookahead score attached. When a hypothesis is expanded to the next node and the corresponding lookahead tree has been removed from the cache, the tree will not be recomputed. Instead, the lookahead probability of the prefix is propagated forward ("lazy cache" evaluation).
  • Typical embodiments use polyphonic within- word acoustic models, but triphone acoustic models across word boundaries.
  • the context dependent root and leaf nodes are dealt with. Instead of having context dependent copies of the prefix tree, each root node may be represented as a set of models, one for each possible phonetic context. The hypotheses of these models are merged at the transition to the within word units (fan-in). As a compact means of representing the fan-in of root nodes and the corresponding fan-out of leaf nodes, the notion of a multiplexer was developed.
  • a multiplexer is a dual map that maps instances / to the index of a unique hidden Markov model for the context of i: mpx(t) : t r ⁇ / e ⁇ 0,l,..., N mp , ⁇ mpx[i] : / r- m e ⁇ n ,m, ,..., repet ⁇ where m 0 , m, ,..., m N are unique models.
  • the set of multiplexer models can be precomputed based on the acoustic modeling decision tree and the dictionary of the recognizer.
  • Figure 1 shows the general organization of a multiplexer-based prefix search tree showing various type of nodes including a root node 10, internal node 12, leaf node 14, and single phone word node 16 (also called a stub).
  • a root node 10 To model conversational speech, multiplexers are particularly useful since the augmented attribute representation of words leads to an explosion in the number of crossword contexts. Because multiplexers map to unique model indices, they basically implement a compression of the fan-in /out and a technique to address the context dependent model by the context instance i.
  • the heap structure of a root node, 10 in Fig. 1, is shown in Figure 2.
  • the root node 10 represents the first attribute instance of words in terms of a corresponding multiplexer 20.
  • the structure also includes for each node a finite state machine grammar state 22 and corresponding state index 24.
  • Cost structure 26 contains a finite state machine lookahead score for the node.
  • FIG. 1 For every word there is a leaf node, 14 in Fig. 1, the heap structure of which is illustrated by Figure 3.
  • a multiplexer describes the leaf node fan-out, and each heap element represents the complete fan-out for a given grammar state.
  • Figure 4 illustrates the heap structure for a single-phone instance word node or stub, 16 in Fig. 1. Words consisting of only one phone are represented by a multiplexer of multiplexers. Depending on the left context of the word, this stub multiplexer returns a multiplexer representing the right-context dependent fan- out of this word.
  • the heap policy is the same as for root nodes, and each heap element represents the complete fan-out as for leaf nodes.
  • two heap related controls may also be used: (1) the maximum number of heap elements can be bounded, and (2) there can be a beam to prune hypotheses within a heap against each other.
  • the number of finite state machine states expanded at each time t can be constrained as well (topN threshold).
  • Acoustic model evaluation is sped up by means of gaussian selection through Bucket Box Intersection (BBI) and by Dynamic Frame Skipping (DFS).
  • BBI Bucket Box Intersection
  • DFS Dynamic Frame Skipping
  • a threshold on the Euclidean distance is defined to trigger reevaluation of the acoustics. To avoid skipping too many consecutive frames, only one skip at a time may be taken — after skipping one frame, the next one must be evaluated.
  • an evaluation test started from a Switchboard recognizer trained on human-to-human telephone speech.
  • the acoustic front end computed 42 dimensional feature vectors consisting of 13 mel-frequency cepstral coefficients plus log power and their first and second derivatives. Cepstral mean and variance normalization as well as vocal tract length normalization were used to compensate for channel and speaker variation.
  • the recognizer consisted of 8000 pentaphonic Gaussian mixture models. A 15k word recognition vocabulary and approximately 30k dictionary variants generated by a mode dependent pronunciation model were used for decoding.
  • Embodiments of the invention may be implemented in any conventional computer programming language. For example, preferred embodiments may be implemented in a procedural programming language (e.g., "C") or an object oriented programming language (e.g., "C++"). Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components. Embodiments can be implemented as a computer program product for use with a computer system.
  • a procedural programming language e.g., "C”
  • object oriented programming language e.g., "C++”
  • Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components.
  • Embodiments can be implemented as a computer program product for use with a computer system.
  • Such implementation may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium.
  • the medium may be either a tangible medium (e.g., optical or analog communications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques).
  • the series of computer instructions embodies all or part of the functionality previously described herein with respect to the system. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems.
  • Such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies.
  • a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the network (e.g., the Internet or World Wide Web).
  • some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention are implemented as entirely hardware, or entirely software (e.g., a computer program product).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

L'invention concerne un système de reconnaissance vocale basé sur des attributs. Un préprocesseur vocal reçoit des entrées vocales et produit une séquence d'observations acoustiques représentatives desdites entrées vocales. Une base de données de modèles acoustiques dépendants du contexte caractérise la probabilité qu'a une séquence de sons données de produire la séquence d'observations acoustiques. Chaque modèle acoustique comprend des attributs phonétiques et des attributs non phonétiques supra-segmentaux. Un modèle de langage de type 3 caractérise la probabilité qu'a une séquence de mots donnée d'être prononcée. Un décodeur à un passage compare la séquence d'observations acoustiques aux modèles acoustiques et au modèle de langage, pour émettre en sortie au moins une séquence de mots représentative des entrées vocales.
PCT/IB2000/001539 1999-10-06 2000-10-06 Modelisation des mots orientee attributs WO2001026092A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU79383/00A AU7938300A (en) 1999-10-06 2000-10-06 Attribute-based word modeling

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15787499P 1999-10-06 1999-10-06
US60/157,874 1999-10-06

Publications (2)

Publication Number Publication Date
WO2001026092A2 true WO2001026092A2 (fr) 2001-04-12
WO2001026092A3 WO2001026092A3 (fr) 2003-05-22

Family

ID=22565650

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2000/001539 WO2001026092A2 (fr) 1999-10-06 2000-10-06 Modelisation des mots orientee attributs

Country Status (2)

Country Link
AU (1) AU7938300A (fr)
WO (1) WO2001026092A2 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003005345A1 (fr) * 2001-07-05 2003-01-16 Speechworks International, Inc. Reconnaissance vocale dotee de grammaires dynamiques
EP1450350A1 (fr) * 2003-02-20 2004-08-25 Sony International (Europe) GmbH Méthode de reconnaissance de la parole avec des attributs
EP1528539A1 (fr) * 2003-10-30 2005-05-04 AT&T Corp. Système et méthode pour l'utilisation de meta-données en modélisation du language
US7149688B2 (en) 2002-11-04 2006-12-12 Speechworks International, Inc. Multi-lingual speech recognition with cross-language context modeling
GB2453366A (en) * 2007-10-04 2009-04-08 Toshiba Res Europ Ltd Automatic speech recognition method and apparatus
EP2096630A1 (fr) * 2006-12-08 2009-09-02 NEC Corporation Dispositif de reconnaissance audio et procédé de reconnaissance audio
US11153472B2 (en) 2005-10-17 2021-10-19 Cutting Edge Vision, LLC Automatic upload of pictures from a camera

Non-Patent Citations (13)

* Cited by examiner, † Cited by third party
Title
ANASTASAKOS A ET AL: "Duration modeling in large vocabulary speech recognition" INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING,XX,XX, vol. 1, 9 May 1995 (1995-05-09), pages 628-631, XP002123829 *
BYRNE W ET AL: "PRONUNCIATION MODELLING USING A HAND-LABELLED CORPUS FOR CONVERSATIONAL SPEECH RECOGNITION" IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING,US,NEW YORK, NY: IEEE, vol. CONF. 23, 12 May 1998 (1998-05-12), pages 313-316, XP000854578 ISBN: 0-7803-4429-4 *
DATABASE INSPEC [Online] INSTITUTE OF ELECTRICAL ENGINEERS, STEVENAGE, GB; DELMONTE R: "Linguistic tools for speech recognition and understanding" Database accession no. 4199465 XP002159657 & SPEECH RECOGNITION AND UNDERSTANDING. RECENT ADVANCES, TRENDS AND APPLICATIONS. PROCEEDINGS OF THE NATO ADVANCED STUDY INSTITUTE, CETRARO, ITALY, 1-13 JULY 1990, pages 481-485, 1992, Berlin, Germany, Springer-Verlag, Germany ISBN: 3-540-54032-6 *
ERLER K ET AL: "HMM REPRESENTATION OF QUANTIZED ARTICULATORY FEATURES FOR RECOGNITION OF HIGHLY CONFUSIBLE WORDS" PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP),US,NEW YORK, IEEE, vol. CONF. 17, 23 March 1992 (1992-03-23), pages 545-548, XP000341204 ISBN: 0-7803-0532-9 *
FINKE ET AL.: "Modeling and efficient decoding of large vocabulary conversational speech" EUROSPEECH'99, vol. 1, 5 - 9 September 1999, pages 467-470, XP002168070 Budapest, Hungary *
LLORENS D ET AL: "ACOUSTIC AND SYNTACTICAL MODELING IN THE ATROS SYSTEM" PHOENIX, AZ, MARCH 15 - 19, 1999,NEW YORK, NY: IEEE,US, 15 March 1999 (1999-03-15), pages 641-644, XP000900202 ISBN: 0-7803-5042-1 *
MERGEL D ET AL: "CONSTRUCTION OF LANGUAGE MODELS FOR SPOKEN DATABASE QUERIES" INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH & SIGNAL PROCESSING. ICASSP,US,NEW YORK, IEEE, vol. CONF. 12, 1 April 1987 (1987-04-01), pages 844-847, XP000758092 *
MYOUNG-WAN KOO ET AL: "A new decoder based on a generalized confidence score" INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING,XX,XX, vol. 1, 1998, pages 213-216, XP002123828 *
NEY H ET AL: "Dynamic programming search for continuous speech recognition" IEEE SIGNAL PROCESSING MAGAZINE, SEPT. 1999, IEEE, USA, vol. 16, no. 5, pages 64-83, XP002159654 ISSN: 1053-5888 *
RENALS S ET AL: "Start-synchronous search for large vocabulary continuous speech recognition" IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, SEPT. 1999, IEEE, USA, vol. 7, no. 5, pages 542-553, XP002159651 ISSN: 1063-6676 *
SUAUDEAU N ET AL: "An efficient combination of acoustic and supra-segmental informations in a speech recognition system" ICASSP-94. 1994 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (CAT. NO.94CH3387-8), PROCEEDINGS, ADELAIDE, SA, AUSTRALIA, 19-22 APRIL 1, pages I/65-8 vol.1, XP002159652 1994, New York, NY, USA, IEEE, USA ISBN: 0-7803-1775-0 *
WAGNER M: "Speaker characteristics in speech and speaker recognition" TENCON '97 BRISBANE - AUSTRALIA. PROCEEDINGS OF IEEE TENCON '97. IEEE REGION 10 ANNUAL CONFERENCE. SPEECH AND IMAGE TECHNOLOGIES FOR COMPUTING AND TELECOMMUNICATIONS (CAT. NO.97CH36162), TENCON '97 BRISBANE - AUSTRALIA, page 626 vol.2 XP002159653 1997, New York, NY, USA, IEEE, USA ISBN: 0-7803-4365-4 *
WANG H -M ET AL: "COMPLETE RECOGNITION OF CONTINUOUS MANDARIN SPEECH FOR CHINESE LANGUAGE WITH VERY LARGE VOCABULARY BUT LIMITED TRAINING DATA" PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP),US,NEW YORK, IEEE, 9 May 1995 (1995-05-09), pages 61-64, XP000657931 ISBN: 0-7803-2432-3 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003005345A1 (fr) * 2001-07-05 2003-01-16 Speechworks International, Inc. Reconnaissance vocale dotee de grammaires dynamiques
US7149688B2 (en) 2002-11-04 2006-12-12 Speechworks International, Inc. Multi-lingual speech recognition with cross-language context modeling
EP1450350A1 (fr) * 2003-02-20 2004-08-25 Sony International (Europe) GmbH Méthode de reconnaissance de la parole avec des attributs
EP1528539A1 (fr) * 2003-10-30 2005-05-04 AT&T Corp. Système et méthode pour l'utilisation de meta-données en modélisation du language
US7996224B2 (en) 2003-10-30 2011-08-09 At&T Intellectual Property Ii, L.P. System and method of using meta-data in speech processing
US11818458B2 (en) 2005-10-17 2023-11-14 Cutting Edge Vision, LLC Camera touchpad
US11153472B2 (en) 2005-10-17 2021-10-19 Cutting Edge Vision, LLC Automatic upload of pictures from a camera
US8706487B2 (en) 2006-12-08 2014-04-22 Nec Corporation Audio recognition apparatus and speech recognition method using acoustic models and language models
EP2096630A1 (fr) * 2006-12-08 2009-09-02 NEC Corporation Dispositif de reconnaissance audio et procédé de reconnaissance audio
EP2096630A4 (fr) * 2006-12-08 2012-03-14 Nec Corp Dispositif de reconnaissance audio et procédé de reconnaissance audio
GB2453366B (en) * 2007-10-04 2011-04-06 Toshiba Res Europ Ltd Automatic speech recognition method and apparatus
US8311825B2 (en) 2007-10-04 2012-11-13 Kabushiki Kaisha Toshiba Automatic speech recognition method and apparatus
GB2453366A (en) * 2007-10-04 2009-04-08 Toshiba Res Europ Ltd Automatic speech recognition method and apparatus

Also Published As

Publication number Publication date
AU7938300A (en) 2001-05-10
WO2001026092A3 (fr) 2003-05-22

Similar Documents

Publication Publication Date Title
US6963837B1 (en) Attribute-based word modeling
Young A review of large-vocabulary continuous-speech
Woodland et al. The 1994 HTK large vocabulary speech recognition system
EP1575030B1 (fr) Apprentissage de la prononciation de nouveaux mots utilisant un graphe de prononciation
Stolcke et al. The SRI March 2000 Hub-5 conversational speech transcription system
US5241619A (en) Word dependent N-best search method
EP1960997B1 (fr) Systeme de reconnaissance vocale a vaste vocabulaire
Seymore et al. The 1997 CMU Sphinx-3 English broadcast news transcription system
Ward Extracting information in spontaneous speech.
Lee et al. Improved acoustic modeling for large vocabulary continuous speech recognition
US20050187758A1 (en) Method of Multilingual Speech Recognition by Reduction to Single-Language Recognizer Engine Components
Matsoukas et al. Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system
US5819221A (en) Speech recognition using clustered between word and/or phrase coarticulation
Alghamdi et al. Arabic broadcast news transcription system
Hain et al. Automatic transcription of conversational telephone speech
Finke et al. Modeling and efficient decoding of large vocabulary conversational speech.
Aubert One pass cross word decoding for large vocabularies based on a lexical tree search organization
Lee et al. Improved acoustic modeling for continuous speech recognition
WO2001026092A2 (fr) Modelisation des mots orientee attributs
Ney et al. Dynamic programming search strategies: From digit strings to large vocabulary word graphs
Fosler-Lussier et al. On the road to improved lexical confusability metrics
Gauvain et al. Large vocabulary speech recognition based on statistical methods
Chen et al. Large vocabulary word recognition based on tree-trellis search
Steinbiss et al. Continuous speech dictation—From theory to practice
Elshafei et al. Speaker-independent natural Arabic speech recognition system

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AU CA JP

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP