EP1590795A1 - Generation and deletion of pronunciation variations in order to reduce the word error rate in speech recognition - Google Patents
Generation and deletion of pronunciation variations in order to reduce the word error rate in speech recognitionInfo
- Publication number
- EP1590795A1 EP1590795A1 EP04704214A EP04704214A EP1590795A1 EP 1590795 A1 EP1590795 A1 EP 1590795A1 EP 04704214 A EP04704214 A EP 04704214A EP 04704214 A EP04704214 A EP 04704214A EP 1590795 A1 EP1590795 A1 EP 1590795A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- pronunciation
- variants
- pronunciation variants
- word
- recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000012217 deletion Methods 0.000 title description 3
- 230000037430 deletion Effects 0.000 title description 3
- 238000000034 method Methods 0.000 claims abstract description 34
- 238000001514 detection method Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000003780 insertion Methods 0.000 claims description 2
- 230000037431 insertion Effects 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 abstract description 5
- 230000006978 adaptation Effects 0.000 description 7
- 230000001419 dependent effect Effects 0.000 description 3
- 241000282332 Martes Species 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
- G10L2015/0636—Threshold criteria for the updating
Definitions
- the phoneme sequences corresponding to them must be known for all words belonging to the vocabulary. These phoneme sequences are entered in the vocabulary. During the actual recognition process, a search is then carried out in the so-called Viterbi algorithm for the best path through the given phoneme sequences which correspond to the words. If there is no mere single word recognition, probabilities for transitions between the words can also be modeled and included in the Viterbi algorithm.
- the phonetic model is inferred from the orthographic spelling by predefined rules or by statistical approaches. Since a written word is pronounced differently in different languages, several pronunciation variants can be generated in the vocabulary for each word. There are also numerous methods in the literature for generating pronunciation variants. The large number of pronunciation variants in turn reduces the word error rate.
- speech recognition systems are adapted to their respective users.
- word models Through
- Transformation such as maximum likelihood linear regression (MLLR), or through model parameter prediction such as For example, Regression Model Prediction (RMP) or Maximum A Posteriori Prediction (MAP), which adapts the acoustic modeling of the feature space on which the word models are based, which is available, for example, as a hidden Markov model (HMM).
- HMM hidden Markov model
- the speech recognizer is thus changed from a speaker-independent to a speaker-dependent system.
- the complexity i.e. the storage space consumption
- the complexity increases with the number of possible words in the speech recognizer.
- the object of the invention is to provide speech recognition with a reduced word error rate which is particularly adaptable and has only a very low resource consumption.
- a method for speech recognition several pronunciation variants for a word to be recognized are stored, for example in the memory of a device which is set up for the method. Alternatively or in addition, these multiple pronunciation variants can also be generated and added to the vocabulary. Each time a word is recognized, it is registered for which word the pronunciation variant of the word is recognized. After several recognition processes, an evaluation of the Pronunciation variants made based on the number of times the pronunciation variants were recognized.
- the frequency of detection is used here as the simplest and least resource-consuming criterion. - Of course, however, more complicated assessment methods are also conceivable, in which, for example, the degree of correspondence between the utterance to be recognized and the pronunciation variant recognized in each case is also taken into account.
- the method can work with existing words stored in the vocabulary. However, the method gains a very decisive advantage if the word models can be dynamically expanded as an alternative or in addition. When adding a new word to the vocabulary
- Vocabulary automatically generates several pronunciation variants of the new word and also added to the vocabulary.
- pronunciation variants for a word can be generated, for example, by phoneme replacement, phoneme deletion and / or phoneme insertion.
- pronunciation variants e.g. can also be created by adding noise to the spoken signal (signal in the broader sense, i.e. language, feature, phoneme chain).
- a further pronunciation variant for the spoken word can be generated upon recognition based on an utterance from this utterance.
- a particularly good utilization of the available memory can be achieved if a maximum number of pronunciation variants is generated for several words.
- Another important aspect of the invention relates to the evaluation of the pronunciation variants.
- the method advantageously saves storage space if the number of stored pronunciation variants is reduced on the basis of the evaluation of the pronunciation variants. This can be achieved, for example, by deleting pronunciation variants that are recognized less frequently.
- Pronunciation variants whose confidence lies below a threshold value are preferably deleted.
- the speech recognizer can still be kept speaker-independent if the requirement is also set that the canonical pronunciation variant of the word is never deleted.
- a device that is set up to carry out the method described above can be implemented, for example, by providing means by which one or more method steps can be carried out in each case.
- Advantageous configurations of the device result analogously to the advantageous configurations of the method.
- a program product for a data processing system which contains code sections with which one of the described methods can be carried out on the data processing system, can be implemented by suitable implementation of the method in a programming language and translation into code executable by the data processing system.
- the code sections are saved for this purpose. It is under a program product understood the program as a tradable product. It can be in any form, for example on paper, a computer-readable data medium or distributed over a network.
- the proposed method is based on a dynamic expansion of the word models in combination with an assessment of the pronunciation variants.
- Pronunciation variants optimally used by generating a maximum number of variants.
- Pronunciation variants carried out.
- these confidence levels are in each case summed up to already achieved confidence levels from previous recognition runs of the pronunciation variants; a simple “boolean” confidence is the value 1 here, if the pronunciation variant which was referenced for this recognition, the value 0 for all Other variants
- An error detection can be determined, among other things, from the reaction of the user: for example, the detection is repeated or a command initiated by voice is aborted.
- a further pronunciation variant for the spoken word can be generated upon recognition based on the utterance. Here it must again be ensured that there is no error detection. This step can also be done unnoticed by the user.
- the accumulated confidence generated for each pronunciation variant is now used to reduce the vocabulary again at a given point in time. This is done by deleting those vocabulary entries whose accumulated confidence is below a certain threshold. These entries are generally pronunciation variants that have never or only rarely been referenced and are therefore not relevant for a recognition run.
- the adaptation does not take place at the level of acoustic modeling (for example HMM). Instead, the adaptation is achieved by selecting one or more language variants. This selection is dependent on the referencing in the successful ones
- Typein is the original, canonical
- Deleting the pronunciation variants increases the reliability of recognition or rejection, since the relevant entries, that is to say the adapted models, are generally easier to distinguish discriminatively. At the same time, recognition is accelerated as the vocabulary becomes smaller.
- word entries in the vocabulary are defined by their phoneme sequence or by a status sequence.
- pronunciation variants can be generated by adding noise to the speech data.
- Another way of creating variants is to modify the phoneme or state sequence obtained. This can be done with the help of random factors or with user-specific information, for example. »A mix-up matrix from the - last recognition runs.
- a confusion matrix can be created, for example, by a second recognition run with phonemes.
- Typein is used to infer the phoneme sequence from the orthographic spelling.
- graphemes When assigning graphemes to phonemes, statistical methods are known which, in addition to the most likely phoneme sequence, are also alternative
- Mr. Meier was called ten times by voice command.
- the five variants were referenced as follows, which corresponds to the boolean confidence already mentioned:
- Variant 1.2 6 6
- Variant 1.3 0 0
- the vocabulary is thus reduced by more than half.
- Speech recognition (search) is reduced to the same extent.
- Variant 2.1 / for a m a r t i n / Variant 2.2: / for a m a t n /
- Variant-2.1 / for aU a r t i n
- Variant-2.2 / for aU m A t n /
- Mr. Meier is called three times, Ms. Martin is called five times by voice command.
- the five variants are assessed with confidence as follows.
- a criterion is now used here, i.e. a confidence measure that allows a statement about the reliability of the spoken utterance for each variant:
- Variant-1.2 / h E r m al er / Original 2: / for a u m a r t e ⁇ /
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE10304460 | 2003-02-04 | ||
DE10304460A DE10304460B3 (en) | 2003-02-04 | 2003-02-04 | Speech recognition method e.g. for mobile telephone, identifies which spoken variants of same word can be recognized with analysis of recognition difficulty for limiting number of acceptable variants |
PCT/EP2004/000527 WO2004070702A1 (en) | 2003-02-04 | 2004-01-22 | Generation and deletion of pronunciation variations in order to reduce the word error rate in speech recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1590795A1 true EP1590795A1 (en) | 2005-11-02 |
Family
ID=31502580
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP04704214A Withdrawn EP1590795A1 (en) | 2003-02-04 | 2004-01-22 | Generation and deletion of pronunciation variations in order to reduce the word error rate in speech recognition |
Country Status (4)
Country | Link |
---|---|
US (1) | US20060143008A1 (en) |
EP (1) | EP1590795A1 (en) |
DE (1) | DE10304460B3 (en) |
WO (1) | WO2004070702A1 (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7280963B1 (en) * | 2003-09-12 | 2007-10-09 | Nuance Communications, Inc. | Method for learning linguistically valid word pronunciations from acoustic data |
US7624013B2 (en) * | 2004-09-10 | 2009-11-24 | Scientific Learning Corporation | Word competition models in voice recognition |
US7533018B2 (en) * | 2004-10-19 | 2009-05-12 | Motorola, Inc. | Tailored speaker-independent voice recognition system |
GB2424742A (en) * | 2005-03-31 | 2006-10-04 | Ibm | Automatic speech recognition |
US7983914B2 (en) * | 2005-08-10 | 2011-07-19 | Nuance Communications, Inc. | Method and system for improved speech recognition by degrading utterance pronunciations |
TW200926142A (en) * | 2007-12-12 | 2009-06-16 | Inst Information Industry | A construction method of English recognition variation pronunciation models |
US9275640B2 (en) * | 2009-11-24 | 2016-03-01 | Nexidia Inc. | Augmented characterization for speech recognition |
US9177545B2 (en) * | 2010-01-22 | 2015-11-03 | Mitsubishi Electric Corporation | Recognition dictionary creating device, voice recognition device, and voice synthesizer |
US9837070B2 (en) * | 2013-12-09 | 2017-12-05 | Google Inc. | Verification of mappings between phoneme sequences and words |
US9747897B2 (en) * | 2013-12-17 | 2017-08-29 | Google Inc. | Identifying substitute pronunciations |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
US11043213B2 (en) * | 2018-12-07 | 2021-06-22 | Soundhound, Inc. | System and method for detection and correction of incorrectly pronounced words |
CN110277090B (en) * | 2019-07-04 | 2021-07-06 | 思必驰科技股份有限公司 | Self-adaptive correction method and system for pronunciation dictionary model of user person |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE3931638A1 (en) * | 1989-09-22 | 1991-04-04 | Standard Elektrik Lorenz Ag | METHOD FOR SPEAKER ADAPTIVE RECOGNITION OF LANGUAGE |
JPH0772840B2 (en) * | 1992-09-29 | 1995-08-02 | 日本アイ・ビー・エム株式会社 | Speech model configuration method, speech recognition method, speech recognition device, and speech model training method |
JP3126985B2 (en) * | 1995-11-04 | 2001-01-22 | インターナシヨナル・ビジネス・マシーンズ・コーポレーション | Method and apparatus for adapting the size of a language model of a speech recognition system |
US6076053A (en) * | 1998-05-21 | 2000-06-13 | Lucent Technologies Inc. | Methods and apparatus for discriminative training and adaptation of pronunciation networks |
US6208964B1 (en) * | 1998-08-31 | 2001-03-27 | Nortel Networks Limited | Method and apparatus for providing unsupervised adaptation of transcriptions |
US6535849B1 (en) * | 2000-01-18 | 2003-03-18 | Scansoft, Inc. | Method and system for generating semi-literal transcripts for speech recognition systems |
US7181395B1 (en) * | 2000-10-27 | 2007-02-20 | International Business Machines Corporation | Methods and apparatus for automatic generation of multiple pronunciations from acoustic data |
EP1233406A1 (en) * | 2001-02-14 | 2002-08-21 | Sony International (Europe) GmbH | Speech recognition adapted for non-native speakers |
DE10119284A1 (en) * | 2001-04-20 | 2002-10-24 | Philips Corp Intellectual Pty | Method and system for training parameters of a pattern recognition system assigned to exactly one implementation variant of an inventory pattern |
US6925154B2 (en) * | 2001-05-04 | 2005-08-02 | International Business Machines Corproation | Methods and apparatus for conversational name dialing systems |
-
2003
- 2003-02-04 DE DE10304460A patent/DE10304460B3/en not_active Expired - Fee Related
-
2004
- 2004-01-22 EP EP04704214A patent/EP1590795A1/en not_active Withdrawn
- 2004-01-22 US US10/544,596 patent/US20060143008A1/en not_active Abandoned
- 2004-01-22 WO PCT/EP2004/000527 patent/WO2004070702A1/en active Search and Examination
Non-Patent Citations (1)
Title |
---|
See references of WO2004070702A1 * |
Also Published As
Publication number | Publication date |
---|---|
US20060143008A1 (en) | 2006-06-29 |
DE10304460B3 (en) | 2004-03-11 |
WO2004070702A1 (en) | 2004-08-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE60302407T2 (en) | Ambient and speaker-adapted speech recognition | |
DE112010005959B4 (en) | Method and system for automatic recognition of an end point of a sound recording | |
DE69519297T2 (en) | METHOD AND DEVICE FOR VOICE RECOGNITION BY MEANS OF OPTIMIZED PARTIAL BUNDLING OF LIKELIHOOD MIXTURES | |
DE69818231T2 (en) | METHOD FOR THE DISCRIMINATIVE TRAINING OF VOICE RECOGNITION MODELS | |
DE69607913T2 (en) | METHOD AND DEVICE FOR VOICE RECOGNITION ON THE BASIS OF NEW WORD MODELS | |
DE10306022B3 (en) | Speech recognition method for telephone, personal digital assistant, notepad computer or automobile navigation system uses 3-stage individual word identification | |
JP3990136B2 (en) | Speech recognition method | |
EP1084490B1 (en) | Arrangement and method for computer recognition of a predefined vocabulary in spoken language | |
DE10304460B3 (en) | Speech recognition method e.g. for mobile telephone, identifies which spoken variants of same word can be recognized with analysis of recognition difficulty for limiting number of acceptable variants | |
DE60318385T2 (en) | LANGUAGE PROCESSING APPARATUS AND METHOD, RECORDING MEDIUM AND PROGRAM | |
DE10119284A1 (en) | Method and system for training parameters of a pattern recognition system assigned to exactly one implementation variant of an inventory pattern | |
DE60018696T2 (en) | ROBUST LANGUAGE PROCESSING OF CHARACTERED LANGUAGE MODELS | |
EP1199704A2 (en) | Selection of an alternate stream of words for discriminant adaptation | |
DE10040063A1 (en) | Procedure for assigning phonemes | |
EP1723636A1 (en) | User and vocabulary-adaptive determination of confidence and rejecting thresholds | |
EP1282897A1 (en) | Method for creating a speech database for a target vocabulary in order to train a speech recognition system | |
DE60029456T2 (en) | Method for online adjustment of pronunciation dictionaries | |
DE102005030965B4 (en) | Extension of the dynamic vocabulary of a speech recognition system by further voice enrollments | |
JP2000075886A (en) | Statistical language model generator and voice recognition device | |
DE10308611A1 (en) | Determination of the likelihood of confusion between vocabulary entries in phoneme-based speech recognition | |
EP1445759B1 (en) | User adaptive method for modeling of background noise in speech recognition | |
DE10122087C1 (en) | Method for training and operating a voice/speech recognition device for recognizing a speaker's voice/speech independently of the speaker uses multiple voice/speech trial databases to form an overall operating model. | |
EP2012303B1 (en) | Method for detecting a speech signal | |
EP1677285B1 (en) | Method for determining pronunciation variants of a word from a predeterminable vocabulary of a speech recognition system | |
DE10302101A1 (en) | Training of a Hidden Markov Model using training data vectors and a nearest neighbor clustering method based on condition parameters used to describe the Hidden Markov Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20050620 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK |
|
DAX | Request for extension of the european patent (deleted) | ||
RBV | Designated contracting states (corrected) |
Designated state(s): DE ES FR GB IT |
|
17Q | First examination report despatched |
Effective date: 20100615 |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: SIEMENS ENTERPRISE COMMUNICATIONS GMBH & CO. KG |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20101228 |