EP1590795A1 - Generation et suppression de variantes de prononciation pour diminuer le taux de mots errones en reconnaissance vocale - Google Patents

Generation et suppression de variantes de prononciation pour diminuer le taux de mots errones en reconnaissance vocale

Info

Publication number
EP1590795A1
EP1590795A1 EP04704214A EP04704214A EP1590795A1 EP 1590795 A1 EP1590795 A1 EP 1590795A1 EP 04704214 A EP04704214 A EP 04704214A EP 04704214 A EP04704214 A EP 04704214A EP 1590795 A1 EP1590795 A1 EP 1590795A1
Authority
EP
European Patent Office
Prior art keywords
pronunciation
variants
pronunciation variants
word
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP04704214A
Other languages
German (de)
English (en)
Inventor
Tobias Schneider
Andreas Schröer
Michael Wandinger
Günter DI STEINMASSL
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unify GmbH and Co KG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Publication of EP1590795A1 publication Critical patent/EP1590795A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • G10L2015/0636Threshold criteria for the updating

Definitions

  • the phoneme sequences corresponding to them must be known for all words belonging to the vocabulary. These phoneme sequences are entered in the vocabulary. During the actual recognition process, a search is then carried out in the so-called Viterbi algorithm for the best path through the given phoneme sequences which correspond to the words. If there is no mere single word recognition, probabilities for transitions between the words can also be modeled and included in the Viterbi algorithm.
  • the phonetic model is inferred from the orthographic spelling by predefined rules or by statistical approaches. Since a written word is pronounced differently in different languages, several pronunciation variants can be generated in the vocabulary for each word. There are also numerous methods in the literature for generating pronunciation variants. The large number of pronunciation variants in turn reduces the word error rate.
  • speech recognition systems are adapted to their respective users.
  • word models Through
  • Transformation such as maximum likelihood linear regression (MLLR), or through model parameter prediction such as For example, Regression Model Prediction (RMP) or Maximum A Posteriori Prediction (MAP), which adapts the acoustic modeling of the feature space on which the word models are based, which is available, for example, as a hidden Markov model (HMM).
  • HMM hidden Markov model
  • the speech recognizer is thus changed from a speaker-independent to a speaker-dependent system.
  • the complexity i.e. the storage space consumption
  • the complexity increases with the number of possible words in the speech recognizer.
  • the object of the invention is to provide speech recognition with a reduced word error rate which is particularly adaptable and has only a very low resource consumption.
  • a method for speech recognition several pronunciation variants for a word to be recognized are stored, for example in the memory of a device which is set up for the method. Alternatively or in addition, these multiple pronunciation variants can also be generated and added to the vocabulary. Each time a word is recognized, it is registered for which word the pronunciation variant of the word is recognized. After several recognition processes, an evaluation of the Pronunciation variants made based on the number of times the pronunciation variants were recognized.
  • the frequency of detection is used here as the simplest and least resource-consuming criterion. - Of course, however, more complicated assessment methods are also conceivable, in which, for example, the degree of correspondence between the utterance to be recognized and the pronunciation variant recognized in each case is also taken into account.
  • the method can work with existing words stored in the vocabulary. However, the method gains a very decisive advantage if the word models can be dynamically expanded as an alternative or in addition. When adding a new word to the vocabulary
  • Vocabulary automatically generates several pronunciation variants of the new word and also added to the vocabulary.
  • pronunciation variants for a word can be generated, for example, by phoneme replacement, phoneme deletion and / or phoneme insertion.
  • pronunciation variants e.g. can also be created by adding noise to the spoken signal (signal in the broader sense, i.e. language, feature, phoneme chain).
  • a further pronunciation variant for the spoken word can be generated upon recognition based on an utterance from this utterance.
  • a particularly good utilization of the available memory can be achieved if a maximum number of pronunciation variants is generated for several words.
  • Another important aspect of the invention relates to the evaluation of the pronunciation variants.
  • the method advantageously saves storage space if the number of stored pronunciation variants is reduced on the basis of the evaluation of the pronunciation variants. This can be achieved, for example, by deleting pronunciation variants that are recognized less frequently.
  • Pronunciation variants whose confidence lies below a threshold value are preferably deleted.
  • the speech recognizer can still be kept speaker-independent if the requirement is also set that the canonical pronunciation variant of the word is never deleted.
  • a device that is set up to carry out the method described above can be implemented, for example, by providing means by which one or more method steps can be carried out in each case.
  • Advantageous configurations of the device result analogously to the advantageous configurations of the method.
  • a program product for a data processing system which contains code sections with which one of the described methods can be carried out on the data processing system, can be implemented by suitable implementation of the method in a programming language and translation into code executable by the data processing system.
  • the code sections are saved for this purpose. It is under a program product understood the program as a tradable product. It can be in any form, for example on paper, a computer-readable data medium or distributed over a network.
  • the proposed method is based on a dynamic expansion of the word models in combination with an assessment of the pronunciation variants.
  • Pronunciation variants optimally used by generating a maximum number of variants.
  • Pronunciation variants carried out.
  • these confidence levels are in each case summed up to already achieved confidence levels from previous recognition runs of the pronunciation variants; a simple “boolean” confidence is the value 1 here, if the pronunciation variant which was referenced for this recognition, the value 0 for all Other variants
  • An error detection can be determined, among other things, from the reaction of the user: for example, the detection is repeated or a command initiated by voice is aborted.
  • a further pronunciation variant for the spoken word can be generated upon recognition based on the utterance. Here it must again be ensured that there is no error detection. This step can also be done unnoticed by the user.
  • the accumulated confidence generated for each pronunciation variant is now used to reduce the vocabulary again at a given point in time. This is done by deleting those vocabulary entries whose accumulated confidence is below a certain threshold. These entries are generally pronunciation variants that have never or only rarely been referenced and are therefore not relevant for a recognition run.
  • the adaptation does not take place at the level of acoustic modeling (for example HMM). Instead, the adaptation is achieved by selecting one or more language variants. This selection is dependent on the referencing in the successful ones
  • Typein is the original, canonical
  • Deleting the pronunciation variants increases the reliability of recognition or rejection, since the relevant entries, that is to say the adapted models, are generally easier to distinguish discriminatively. At the same time, recognition is accelerated as the vocabulary becomes smaller.
  • word entries in the vocabulary are defined by their phoneme sequence or by a status sequence.
  • pronunciation variants can be generated by adding noise to the speech data.
  • Another way of creating variants is to modify the phoneme or state sequence obtained. This can be done with the help of random factors or with user-specific information, for example. »A mix-up matrix from the - last recognition runs.
  • a confusion matrix can be created, for example, by a second recognition run with phonemes.
  • Typein is used to infer the phoneme sequence from the orthographic spelling.
  • graphemes When assigning graphemes to phonemes, statistical methods are known which, in addition to the most likely phoneme sequence, are also alternative
  • Mr. Meier was called ten times by voice command.
  • the five variants were referenced as follows, which corresponds to the boolean confidence already mentioned:
  • Variant 1.2 6 6
  • Variant 1.3 0 0
  • the vocabulary is thus reduced by more than half.
  • Speech recognition (search) is reduced to the same extent.
  • Variant 2.1 / for a m a r t i n / Variant 2.2: / for a m a t n /
  • Variant-2.1 / for aU a r t i n
  • Variant-2.2 / for aU m A t n /
  • Mr. Meier is called three times, Ms. Martin is called five times by voice command.
  • the five variants are assessed with confidence as follows.
  • a criterion is now used here, i.e. a confidence measure that allows a statement about the reliability of the spoken utterance for each variant:
  • Variant-1.2 / h E r m al er / Original 2: / for a u m a r t e ⁇ /

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé pour la reconnaissance vocale, qui est basé sur une extension dynamique des modèles de mots combinée à une évaluation des variantes de prononciation.
EP04704214A 2003-02-04 2004-01-22 Generation et suppression de variantes de prononciation pour diminuer le taux de mots errones en reconnaissance vocale Withdrawn EP1590795A1 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
DE10304460A DE10304460B3 (de) 2003-02-04 2003-02-04 Generieren und Löschen von Aussprachevarianten zur Verringerung der Wortfehlerrate in der Spracherkennung
DE10304460 2003-02-04
PCT/EP2004/000527 WO2004070702A1 (fr) 2003-02-04 2004-01-22 Generation et suppression de variantes de prononciation pour diminuer le taux de mots errones en reconnaissance vocale

Publications (1)

Publication Number Publication Date
EP1590795A1 true EP1590795A1 (fr) 2005-11-02

Family

ID=31502580

Family Applications (1)

Application Number Title Priority Date Filing Date
EP04704214A Withdrawn EP1590795A1 (fr) 2003-02-04 2004-01-22 Generation et suppression de variantes de prononciation pour diminuer le taux de mots errones en reconnaissance vocale

Country Status (4)

Country Link
US (1) US20060143008A1 (fr)
EP (1) EP1590795A1 (fr)
DE (1) DE10304460B3 (fr)
WO (1) WO2004070702A1 (fr)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7280963B1 (en) * 2003-09-12 2007-10-09 Nuance Communications, Inc. Method for learning linguistically valid word pronunciations from acoustic data
US7624013B2 (en) * 2004-09-10 2009-11-24 Scientific Learning Corporation Word competition models in voice recognition
US7533018B2 (en) * 2004-10-19 2009-05-12 Motorola, Inc. Tailored speaker-independent voice recognition system
GB2424742A (en) * 2005-03-31 2006-10-04 Ibm Automatic speech recognition
US7983914B2 (en) * 2005-08-10 2011-07-19 Nuance Communications, Inc. Method and system for improved speech recognition by degrading utterance pronunciations
TW200926142A (en) * 2007-12-12 2009-06-16 Inst Information Industry A construction method of English recognition variation pronunciation models
US9275640B2 (en) * 2009-11-24 2016-03-01 Nexidia Inc. Augmented characterization for speech recognition
US9177545B2 (en) * 2010-01-22 2015-11-03 Mitsubishi Electric Corporation Recognition dictionary creating device, voice recognition device, and voice synthesizer
US9837070B2 (en) * 2013-12-09 2017-12-05 Google Inc. Verification of mappings between phoneme sequences and words
US9747897B2 (en) * 2013-12-17 2017-08-29 Google Inc. Identifying substitute pronunciations
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
US11043213B2 (en) * 2018-12-07 2021-06-22 Soundhound, Inc. System and method for detection and correction of incorrectly pronounced words
CN110277090B (zh) * 2019-07-04 2021-07-06 思必驰科技股份有限公司 用户个人的发音词典模型的自适应修正方法及系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE3931638A1 (de) * 1989-09-22 1991-04-04 Standard Elektrik Lorenz Ag Verfahren zur sprecheradaptiven erkennung von sprache
JPH0772840B2 (ja) * 1992-09-29 1995-08-02 日本アイ・ビー・エム株式会社 音声モデルの構成方法、音声認識方法、音声認識装置及び音声モデルの訓練方法
US5899973A (en) * 1995-11-04 1999-05-04 International Business Machines Corporation Method and apparatus for adapting the language model's size in a speech recognition system
US6076053A (en) * 1998-05-21 2000-06-13 Lucent Technologies Inc. Methods and apparatus for discriminative training and adaptation of pronunciation networks
US6208964B1 (en) * 1998-08-31 2001-03-27 Nortel Networks Limited Method and apparatus for providing unsupervised adaptation of transcriptions
US6535849B1 (en) * 2000-01-18 2003-03-18 Scansoft, Inc. Method and system for generating semi-literal transcripts for speech recognition systems
US7181395B1 (en) * 2000-10-27 2007-02-20 International Business Machines Corporation Methods and apparatus for automatic generation of multiple pronunciations from acoustic data
EP1233406A1 (fr) * 2001-02-14 2002-08-21 Sony International (Europe) GmbH Reconnaissance de la parole adaptée aux locuteurs étrangers
DE10119284A1 (de) * 2001-04-20 2002-10-24 Philips Corp Intellectual Pty Verfahren und System zum Training von jeweils genau einer Realisierungsvariante eines Inventarmusters zugeordneten Parametern eines Mustererkennungssystems
US6925154B2 (en) * 2001-05-04 2005-08-02 International Business Machines Corproation Methods and apparatus for conversational name dialing systems

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2004070702A1 *

Also Published As

Publication number Publication date
DE10304460B3 (de) 2004-03-11
WO2004070702A1 (fr) 2004-08-19
US20060143008A1 (en) 2006-06-29

Similar Documents

Publication Publication Date Title
DE60302407T2 (de) Umgebungs- und sprecheradaptierte Spracherkennung
DE112010005959B4 (de) Verfahren und System zur automatischen Erkennung eines Endpunkts einer Tonaufnahme
DE69818231T2 (de) Verfahren zum diskriminativen training von spracherkennungsmodellen
DE10306022B3 (de) Dreistufige Einzelworterkennung
JP3990136B2 (ja) 音声認識方法
EP1084490B1 (fr) Dispositif et procede de reconnaissance d'un vocabulaire predetermine dans une parole au moyen d'un ordinateur
DE10304460B3 (de) Generieren und Löschen von Aussprachevarianten zur Verringerung der Wortfehlerrate in der Spracherkennung
DE60318385T2 (de) Sprachverarbeitungseinrichtung und -verfahren, aufzeichnungsmedium und programm
DE10119284A1 (de) Verfahren und System zum Training von jeweils genau einer Realisierungsvariante eines Inventarmusters zugeordneten Parametern eines Mustererkennungssystems
DE60018696T2 (de) Robuste sprachverarbeitung von verrauschten sprachmodellen
EP1199704A2 (fr) Sélection d'une séquence alternative de mots pour une adaptation discriminante
DE10040063A1 (de) Verfahren zur Zuordnung von Phonemen
WO2005088607A1 (fr) Determination de seuils de fiabilite et de rejet avec adaptation a l'utilisateur et au vocabulaire
WO2001086634A1 (fr) Procede pour produire une banque de donnees vocales pour un lexique cible pour l'apprentissage d'un systeme de reconnaissance vocale
DE60029456T2 (de) Verfahren zur Online-Anpassung von Aussprachewörterbüchern
DE102005030965B4 (de) Erweiterung des dynamischen Vokabulars eines Spracherkennungssystems um weitere Voiceenrollments
EP1435087A1 (fr) Procede de production de segments de reference decrivant des blocs vocaux et procede de modelisation d'unites vocales d'un modele de test parle
JP2000075886A (ja) 統計的言語モデル生成装置及び音声認識装置
EP1457966A1 (fr) Méthode de détermination d'un risque de confusion d'entrées de vocabulaire pour la reconnaissance de la parole à partir de phonèmes
EP1445759B1 (fr) Méthode adaptée à l'usager pour modéliser le bruit de fond en reconnaissance de parole
DE10122087C1 (de) Verfahren zum Training und Betrieb eines Spracherkenners, Spracherkenner und Spracherkenner-Trainingssystem
EP2012303B1 (fr) Procédé de reconnaissance d'un signal vocal
EP1677285B1 (fr) Procédé destiné à la détermination de variantes de prononciation d'un mot provenant d'un vocabulaire préréglé d'un système de reconnaissance vocale
DE10244722A1 (de) Verfahren und Vorrichtung zum rechnergestützten Vergleich einer ersten Folge lautsprachlicher Einheiten mit einer zweiten Folge lautsprachlicher Einheiten, Spracherkennungseinrichtung und Sprachsyntheseeinrichtung
DE10359624A1 (de) Spracherkennung mit sprecherunabhängiger Vokabularerweiterung

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20050620

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL LT LV MK

DAX Request for extension of the european patent (deleted)
RBV Designated contracting states (corrected)

Designated state(s): DE ES FR GB IT

17Q First examination report despatched

Effective date: 20100615

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: SIEMENS ENTERPRISE COMMUNICATIONS GMBH & CO. KG

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20101228