EP1590795A1 - Generation et suppression de variantes de prononciation pour diminuer le taux de mots errones en reconnaissance vocale - Google Patents
Generation et suppression de variantes de prononciation pour diminuer le taux de mots errones en reconnaissance vocaleInfo
- Publication number
- EP1590795A1 EP1590795A1 EP04704214A EP04704214A EP1590795A1 EP 1590795 A1 EP1590795 A1 EP 1590795A1 EP 04704214 A EP04704214 A EP 04704214A EP 04704214 A EP04704214 A EP 04704214A EP 1590795 A1 EP1590795 A1 EP 1590795A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- pronunciation
- variants
- pronunciation variants
- word
- recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000012217 deletion Methods 0.000 title description 3
- 230000037430 deletion Effects 0.000 title description 3
- 238000000034 method Methods 0.000 claims abstract description 34
- 238000001514 detection method Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000003780 insertion Methods 0.000 claims description 2
- 230000037431 insertion Effects 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 abstract description 5
- 230000006978 adaptation Effects 0.000 description 7
- 230000001419 dependent effect Effects 0.000 description 3
- 241000282332 Martes Species 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
- G10L2015/0636—Threshold criteria for the updating
Definitions
- the phoneme sequences corresponding to them must be known for all words belonging to the vocabulary. These phoneme sequences are entered in the vocabulary. During the actual recognition process, a search is then carried out in the so-called Viterbi algorithm for the best path through the given phoneme sequences which correspond to the words. If there is no mere single word recognition, probabilities for transitions between the words can also be modeled and included in the Viterbi algorithm.
- the phonetic model is inferred from the orthographic spelling by predefined rules or by statistical approaches. Since a written word is pronounced differently in different languages, several pronunciation variants can be generated in the vocabulary for each word. There are also numerous methods in the literature for generating pronunciation variants. The large number of pronunciation variants in turn reduces the word error rate.
- speech recognition systems are adapted to their respective users.
- word models Through
- Transformation such as maximum likelihood linear regression (MLLR), or through model parameter prediction such as For example, Regression Model Prediction (RMP) or Maximum A Posteriori Prediction (MAP), which adapts the acoustic modeling of the feature space on which the word models are based, which is available, for example, as a hidden Markov model (HMM).
- HMM hidden Markov model
- the speech recognizer is thus changed from a speaker-independent to a speaker-dependent system.
- the complexity i.e. the storage space consumption
- the complexity increases with the number of possible words in the speech recognizer.
- the object of the invention is to provide speech recognition with a reduced word error rate which is particularly adaptable and has only a very low resource consumption.
- a method for speech recognition several pronunciation variants for a word to be recognized are stored, for example in the memory of a device which is set up for the method. Alternatively or in addition, these multiple pronunciation variants can also be generated and added to the vocabulary. Each time a word is recognized, it is registered for which word the pronunciation variant of the word is recognized. After several recognition processes, an evaluation of the Pronunciation variants made based on the number of times the pronunciation variants were recognized.
- the frequency of detection is used here as the simplest and least resource-consuming criterion. - Of course, however, more complicated assessment methods are also conceivable, in which, for example, the degree of correspondence between the utterance to be recognized and the pronunciation variant recognized in each case is also taken into account.
- the method can work with existing words stored in the vocabulary. However, the method gains a very decisive advantage if the word models can be dynamically expanded as an alternative or in addition. When adding a new word to the vocabulary
- Vocabulary automatically generates several pronunciation variants of the new word and also added to the vocabulary.
- pronunciation variants for a word can be generated, for example, by phoneme replacement, phoneme deletion and / or phoneme insertion.
- pronunciation variants e.g. can also be created by adding noise to the spoken signal (signal in the broader sense, i.e. language, feature, phoneme chain).
- a further pronunciation variant for the spoken word can be generated upon recognition based on an utterance from this utterance.
- a particularly good utilization of the available memory can be achieved if a maximum number of pronunciation variants is generated for several words.
- Another important aspect of the invention relates to the evaluation of the pronunciation variants.
- the method advantageously saves storage space if the number of stored pronunciation variants is reduced on the basis of the evaluation of the pronunciation variants. This can be achieved, for example, by deleting pronunciation variants that are recognized less frequently.
- Pronunciation variants whose confidence lies below a threshold value are preferably deleted.
- the speech recognizer can still be kept speaker-independent if the requirement is also set that the canonical pronunciation variant of the word is never deleted.
- a device that is set up to carry out the method described above can be implemented, for example, by providing means by which one or more method steps can be carried out in each case.
- Advantageous configurations of the device result analogously to the advantageous configurations of the method.
- a program product for a data processing system which contains code sections with which one of the described methods can be carried out on the data processing system, can be implemented by suitable implementation of the method in a programming language and translation into code executable by the data processing system.
- the code sections are saved for this purpose. It is under a program product understood the program as a tradable product. It can be in any form, for example on paper, a computer-readable data medium or distributed over a network.
- the proposed method is based on a dynamic expansion of the word models in combination with an assessment of the pronunciation variants.
- Pronunciation variants optimally used by generating a maximum number of variants.
- Pronunciation variants carried out.
- these confidence levels are in each case summed up to already achieved confidence levels from previous recognition runs of the pronunciation variants; a simple “boolean” confidence is the value 1 here, if the pronunciation variant which was referenced for this recognition, the value 0 for all Other variants
- An error detection can be determined, among other things, from the reaction of the user: for example, the detection is repeated or a command initiated by voice is aborted.
- a further pronunciation variant for the spoken word can be generated upon recognition based on the utterance. Here it must again be ensured that there is no error detection. This step can also be done unnoticed by the user.
- the accumulated confidence generated for each pronunciation variant is now used to reduce the vocabulary again at a given point in time. This is done by deleting those vocabulary entries whose accumulated confidence is below a certain threshold. These entries are generally pronunciation variants that have never or only rarely been referenced and are therefore not relevant for a recognition run.
- the adaptation does not take place at the level of acoustic modeling (for example HMM). Instead, the adaptation is achieved by selecting one or more language variants. This selection is dependent on the referencing in the successful ones
- Typein is the original, canonical
- Deleting the pronunciation variants increases the reliability of recognition or rejection, since the relevant entries, that is to say the adapted models, are generally easier to distinguish discriminatively. At the same time, recognition is accelerated as the vocabulary becomes smaller.
- word entries in the vocabulary are defined by their phoneme sequence or by a status sequence.
- pronunciation variants can be generated by adding noise to the speech data.
- Another way of creating variants is to modify the phoneme or state sequence obtained. This can be done with the help of random factors or with user-specific information, for example. »A mix-up matrix from the - last recognition runs.
- a confusion matrix can be created, for example, by a second recognition run with phonemes.
- Typein is used to infer the phoneme sequence from the orthographic spelling.
- graphemes When assigning graphemes to phonemes, statistical methods are known which, in addition to the most likely phoneme sequence, are also alternative
- Mr. Meier was called ten times by voice command.
- the five variants were referenced as follows, which corresponds to the boolean confidence already mentioned:
- Variant 1.2 6 6
- Variant 1.3 0 0
- the vocabulary is thus reduced by more than half.
- Speech recognition (search) is reduced to the same extent.
- Variant 2.1 / for a m a r t i n / Variant 2.2: / for a m a t n /
- Variant-2.1 / for aU a r t i n
- Variant-2.2 / for aU m A t n /
- Mr. Meier is called three times, Ms. Martin is called five times by voice command.
- the five variants are assessed with confidence as follows.
- a criterion is now used here, i.e. a confidence measure that allows a statement about the reliability of the spoken utterance for each variant:
- Variant-1.2 / h E r m al er / Original 2: / for a u m a r t e ⁇ /
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE10304460A DE10304460B3 (de) | 2003-02-04 | 2003-02-04 | Generieren und Löschen von Aussprachevarianten zur Verringerung der Wortfehlerrate in der Spracherkennung |
DE10304460 | 2003-02-04 | ||
PCT/EP2004/000527 WO2004070702A1 (fr) | 2003-02-04 | 2004-01-22 | Generation et suppression de variantes de prononciation pour diminuer le taux de mots errones en reconnaissance vocale |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1590795A1 true EP1590795A1 (fr) | 2005-11-02 |
Family
ID=31502580
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP04704214A Withdrawn EP1590795A1 (fr) | 2003-02-04 | 2004-01-22 | Generation et suppression de variantes de prononciation pour diminuer le taux de mots errones en reconnaissance vocale |
Country Status (4)
Country | Link |
---|---|
US (1) | US20060143008A1 (fr) |
EP (1) | EP1590795A1 (fr) |
DE (1) | DE10304460B3 (fr) |
WO (1) | WO2004070702A1 (fr) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7280963B1 (en) * | 2003-09-12 | 2007-10-09 | Nuance Communications, Inc. | Method for learning linguistically valid word pronunciations from acoustic data |
US7624013B2 (en) * | 2004-09-10 | 2009-11-24 | Scientific Learning Corporation | Word competition models in voice recognition |
US7533018B2 (en) * | 2004-10-19 | 2009-05-12 | Motorola, Inc. | Tailored speaker-independent voice recognition system |
GB2424742A (en) * | 2005-03-31 | 2006-10-04 | Ibm | Automatic speech recognition |
US7983914B2 (en) * | 2005-08-10 | 2011-07-19 | Nuance Communications, Inc. | Method and system for improved speech recognition by degrading utterance pronunciations |
TW200926142A (en) * | 2007-12-12 | 2009-06-16 | Inst Information Industry | A construction method of English recognition variation pronunciation models |
US9275640B2 (en) * | 2009-11-24 | 2016-03-01 | Nexidia Inc. | Augmented characterization for speech recognition |
US9177545B2 (en) * | 2010-01-22 | 2015-11-03 | Mitsubishi Electric Corporation | Recognition dictionary creating device, voice recognition device, and voice synthesizer |
US9837070B2 (en) * | 2013-12-09 | 2017-12-05 | Google Inc. | Verification of mappings between phoneme sequences and words |
US9747897B2 (en) * | 2013-12-17 | 2017-08-29 | Google Inc. | Identifying substitute pronunciations |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
US11043213B2 (en) * | 2018-12-07 | 2021-06-22 | Soundhound, Inc. | System and method for detection and correction of incorrectly pronounced words |
CN110277090B (zh) * | 2019-07-04 | 2021-07-06 | 思必驰科技股份有限公司 | 用户个人的发音词典模型的自适应修正方法及系统 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE3931638A1 (de) * | 1989-09-22 | 1991-04-04 | Standard Elektrik Lorenz Ag | Verfahren zur sprecheradaptiven erkennung von sprache |
JPH0772840B2 (ja) * | 1992-09-29 | 1995-08-02 | 日本アイ・ビー・エム株式会社 | 音声モデルの構成方法、音声認識方法、音声認識装置及び音声モデルの訓練方法 |
US5899973A (en) * | 1995-11-04 | 1999-05-04 | International Business Machines Corporation | Method and apparatus for adapting the language model's size in a speech recognition system |
US6076053A (en) * | 1998-05-21 | 2000-06-13 | Lucent Technologies Inc. | Methods and apparatus for discriminative training and adaptation of pronunciation networks |
US6208964B1 (en) * | 1998-08-31 | 2001-03-27 | Nortel Networks Limited | Method and apparatus for providing unsupervised adaptation of transcriptions |
US6535849B1 (en) * | 2000-01-18 | 2003-03-18 | Scansoft, Inc. | Method and system for generating semi-literal transcripts for speech recognition systems |
US7181395B1 (en) * | 2000-10-27 | 2007-02-20 | International Business Machines Corporation | Methods and apparatus for automatic generation of multiple pronunciations from acoustic data |
EP1233406A1 (fr) * | 2001-02-14 | 2002-08-21 | Sony International (Europe) GmbH | Reconnaissance de la parole adaptée aux locuteurs étrangers |
DE10119284A1 (de) * | 2001-04-20 | 2002-10-24 | Philips Corp Intellectual Pty | Verfahren und System zum Training von jeweils genau einer Realisierungsvariante eines Inventarmusters zugeordneten Parametern eines Mustererkennungssystems |
US6925154B2 (en) * | 2001-05-04 | 2005-08-02 | International Business Machines Corproation | Methods and apparatus for conversational name dialing systems |
-
2003
- 2003-02-04 DE DE10304460A patent/DE10304460B3/de not_active Expired - Fee Related
-
2004
- 2004-01-22 US US10/544,596 patent/US20060143008A1/en not_active Abandoned
- 2004-01-22 EP EP04704214A patent/EP1590795A1/fr not_active Withdrawn
- 2004-01-22 WO PCT/EP2004/000527 patent/WO2004070702A1/fr active Search and Examination
Non-Patent Citations (1)
Title |
---|
See references of WO2004070702A1 * |
Also Published As
Publication number | Publication date |
---|---|
DE10304460B3 (de) | 2004-03-11 |
WO2004070702A1 (fr) | 2004-08-19 |
US20060143008A1 (en) | 2006-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE60302407T2 (de) | Umgebungs- und sprecheradaptierte Spracherkennung | |
DE112010005959B4 (de) | Verfahren und System zur automatischen Erkennung eines Endpunkts einer Tonaufnahme | |
DE69818231T2 (de) | Verfahren zum diskriminativen training von spracherkennungsmodellen | |
DE10306022B3 (de) | Dreistufige Einzelworterkennung | |
JP3990136B2 (ja) | 音声認識方法 | |
EP1084490B1 (fr) | Dispositif et procede de reconnaissance d'un vocabulaire predetermine dans une parole au moyen d'un ordinateur | |
DE10304460B3 (de) | Generieren und Löschen von Aussprachevarianten zur Verringerung der Wortfehlerrate in der Spracherkennung | |
DE60318385T2 (de) | Sprachverarbeitungseinrichtung und -verfahren, aufzeichnungsmedium und programm | |
DE10119284A1 (de) | Verfahren und System zum Training von jeweils genau einer Realisierungsvariante eines Inventarmusters zugeordneten Parametern eines Mustererkennungssystems | |
DE60018696T2 (de) | Robuste sprachverarbeitung von verrauschten sprachmodellen | |
EP1199704A2 (fr) | Sélection d'une séquence alternative de mots pour une adaptation discriminante | |
DE10040063A1 (de) | Verfahren zur Zuordnung von Phonemen | |
WO2005088607A1 (fr) | Determination de seuils de fiabilite et de rejet avec adaptation a l'utilisateur et au vocabulaire | |
WO2001086634A1 (fr) | Procede pour produire une banque de donnees vocales pour un lexique cible pour l'apprentissage d'un systeme de reconnaissance vocale | |
DE60029456T2 (de) | Verfahren zur Online-Anpassung von Aussprachewörterbüchern | |
DE102005030965B4 (de) | Erweiterung des dynamischen Vokabulars eines Spracherkennungssystems um weitere Voiceenrollments | |
EP1435087A1 (fr) | Procede de production de segments de reference decrivant des blocs vocaux et procede de modelisation d'unites vocales d'un modele de test parle | |
JP2000075886A (ja) | 統計的言語モデル生成装置及び音声認識装置 | |
EP1457966A1 (fr) | Méthode de détermination d'un risque de confusion d'entrées de vocabulaire pour la reconnaissance de la parole à partir de phonèmes | |
EP1445759B1 (fr) | Méthode adaptée à l'usager pour modéliser le bruit de fond en reconnaissance de parole | |
DE10122087C1 (de) | Verfahren zum Training und Betrieb eines Spracherkenners, Spracherkenner und Spracherkenner-Trainingssystem | |
EP2012303B1 (fr) | Procédé de reconnaissance d'un signal vocal | |
EP1677285B1 (fr) | Procédé destiné à la détermination de variantes de prononciation d'un mot provenant d'un vocabulaire préréglé d'un système de reconnaissance vocale | |
DE10244722A1 (de) | Verfahren und Vorrichtung zum rechnergestützten Vergleich einer ersten Folge lautsprachlicher Einheiten mit einer zweiten Folge lautsprachlicher Einheiten, Spracherkennungseinrichtung und Sprachsyntheseeinrichtung | |
DE10359624A1 (de) | Spracherkennung mit sprecherunabhängiger Vokabularerweiterung |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20050620 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK |
|
DAX | Request for extension of the european patent (deleted) | ||
RBV | Designated contracting states (corrected) |
Designated state(s): DE ES FR GB IT |
|
17Q | First examination report despatched |
Effective date: 20100615 |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: SIEMENS ENTERPRISE COMMUNICATIONS GMBH & CO. KG |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20101228 |