EP1590795A1

EP1590795A1 - Generation and deletion of pronunciation variations in order to reduce the word error rate in speech recognition

Info

Publication number: EP1590795A1
Application number: EP04704214A
Authority: EP
Inventors: Tobias Schneider; Andreas Schröer; Michael Wandinger; Günter DI STEINMASSL
Original assignee: Siemens AG
Current assignee: Unify GmbH and Co KG
Priority date: 2003-02-04
Filing date: 2004-01-22
Publication date: 2005-11-02
Also published as: US20060143008A1; DE10304460B3; WO2004070702A1

Abstract

Disclosed is a speech recognition method which is based on a dynamic extension of the word models in combination with an evaluation of the pronunciation variations.

Description

description

Generation and deletion of pronunciation variants to reduce the word error rate in speech recognition

In phoneme-based speech recognition, the phoneme sequences corresponding to them must be known for all words belonging to the vocabulary. These phoneme sequences are entered in the vocabulary. During the actual recognition process, a search is then carried out in the so-called Viterbi algorithm for the best path through the given phoneme sequences which correspond to the words. If there is no mere single word recognition, probabilities for transitions between the words can also be modeled and included in the Viterbi algorithm.

Recognizing spoken utterances that deviate from the canonical phonetic transcription of a word that is usually used in the vocabulary (transcription) or differs discriminatively from those proves to be problematic

Distinguish utterances that were used during the training of a word model.

Such statements can no longer be properly classified by the existing models and an error detection occurs. The reasons for these differences are, among other things, the special accent of the speaker as well as the respective form of the utterance, which can be spoken quickly, indistinctly or very slowly, for example. Stationary and impulsive noise can also lead to misclassification.

Furthermore, technical systems, especially systems on so-called embedded platforms, for example in cell phones, are subject to one

Resource limitation, which affects the size or thickness of the modeling. Many application scenarios in speech recognition are based on an expansion of the word models in speech recognizer or the adaptation of word models already present in speech recognizer.

In the so-called Sayln, a new word model is generated by speaking an utterance. Through a two-time enrollment, the speech recognizer has two different pronunciation variants for classifying a word. This reduces the word error rate because the discriminative differences are better captured.

With the so-called type-in, the phonetic model is inferred from the orthographic spelling by predefined rules or by statistical approaches. Since a written word is pronounced differently in different languages, several pronunciation variants can be generated in the vocabulary for each word. There are also numerous methods in the literature for generating pronunciation variants. The large number of pronunciation variants in turn reduces the word error rate.

However, these methods have in common that at the time of modeling it is not known which of the respective pronunciation variants are relevant for an individual user in the recognition. This is particularly the case with type one, since the respective accent of the speaker is not taken into account.

To reduce the word error rate, speech recognition systems are adapted to their respective users. When adapting word models, through

Transformation, such as maximum likelihood linear regression (MLLR), or through model parameter prediction such as For example, Regression Model Prediction (RMP) or Maximum A Posteriori Prediction (MAP), which adapts the acoustic modeling of the feature space on which the word models are based, which is available, for example, as a hidden Markov model (HMM). This achieves a system state that is strongly adapted to the respective user. In contrast, other users are no longer recognized sufficiently well in such a system.

The speech recognizer is thus changed from a speaker-independent to a speaker-dependent system.

Usually, the complexity, i.e. the storage space consumption, increases with the number of possible words in the speech recognizer. In the case of embedded systems, there is often only a very limited storage space available, which is not used with a small number of words in the speech recognizer.

Proceeding from this, the object of the invention is to provide speech recognition with a reduced word error rate which is particularly adaptable and has only a very low resource consumption.

This object is achieved by the inventions specified in the independent claims. Advantageous refinements result from the subclaims.

In a method for speech recognition, several pronunciation variants for a word to be recognized are stored, for example in the memory of a device which is set up for the method. Alternatively or in addition, these multiple pronunciation variants can also be generated and added to the vocabulary. Each time a word is recognized, it is registered for which word the pronunciation variant of the word is recognized. After several recognition processes, an evaluation of the Pronunciation variants made based on the number of times the pronunciation variants were recognized.

The frequency of detection is used here as the simplest and least resource-consuming criterion. - Of course, however, more complicated assessment methods are also conceivable, in which, for example, the degree of correspondence between the utterance to be recognized and the pronunciation variant recognized in each case is also taken into account.

The method can work with existing words stored in the vocabulary. However, the method gains a very decisive advantage if the word models can be dynamically expanded as an alternative or in addition. When adding a new word to the

Vocabulary automatically generates several pronunciation variants of the new word and also added to the vocabulary.

Several pronunciation variants for a word can be generated, for example, by phoneme replacement, phoneme deletion and / or phoneme insertion.

In the case of language-independent language recognizers in particular, it can also be advantageous if the pronunciation variants are generated for different languages. '• ■

In Sayln in particular, pronunciation variants e.g. can also be created by adding noise to the spoken signal (signal in the broader sense, i.e. language, feature, phoneme chain).

As an extension, however, alternatively or additionally, a further pronunciation variant for the spoken word can be generated upon recognition based on an utterance from this utterance. A particularly good utilization of the available memory can be achieved if a maximum number of pronunciation variants is generated for several words.

Another important aspect of the invention relates to the evaluation of the pronunciation variants.

The method advantageously saves storage space if the number of stored pronunciation variants is reduced on the basis of the evaluation of the pronunciation variants. This can be achieved, for example, by deleting pronunciation variants that are recognized less frequently.

Pronunciation variants whose confidence lies below a threshold value are preferably deleted.

However, the speech recognizer can still be kept speaker-independent if the requirement is also set that the canonical pronunciation variant of the word is never deleted.

A device that is set up to carry out the method described above can be implemented, for example, by providing means by which one or more method steps can be carried out in each case. Advantageous configurations of the device result analogously to the advantageous configurations of the method.

A program product for a data processing system, which contains code sections with which one of the described methods can be carried out on the data processing system, can be implemented by suitable implementation of the method in a programming language and translation into code executable by the data processing system. The code sections are saved for this purpose. It is under a program product understood the program as a tradable product. It can be in any form, for example on paper, a computer-readable data medium or distributed over a network.

Further essential advantages and features of the invention result from the description of an embodiment.

The proposed method is based on a dynamic expansion of the word models in combination with an assessment of the pronunciation variants.

When adding a new word to the recognizer vocabulary, several pronunciation variants of this word are generated at the same time, which are also added to the vocabulary. These variants differ phonetically and can be created in different ways, depending on the technology used.

The available memory set for the

Pronunciation variants optimally used by generating a maximum number of variants.

With each recognition, in addition to the actual classification of the models, an evaluation of all

Pronunciation variants carried out. In the event of successful recognition, that is to say no error recognition, these confidence levels are in each case summed up to already achieved confidence levels from previous recognition runs of the pronunciation variants; a simple “boolean” confidence is the value 1 here, if the pronunciation variant which was referenced for this recognition, the value 0 for all Other variants An error detection can be determined, among other things, from the reaction of the user: for example, the detection is repeated or a command initiated by voice is aborted. As an extension, a further pronunciation variant for the spoken word can be generated upon recognition based on the utterance. Here it must again be ensured that there is no error detection. This step can also be done unnoticed by the user.

The accumulated confidence generated for each pronunciation variant is now used to reduce the vocabulary again at a given point in time. This is done by deleting those vocabulary entries whose accumulated confidence is below a certain threshold. These entries are generally pronunciation variants that have never or only rarely been referenced and are therefore not relevant for a recognition run.

Thanks to the deleted pronunciation variants, free space is now available for new words in the vocabulary.

In contrast to the prior art, the adaptation does not take place at the level of acoustic modeling (for example HMM). Instead, the adaptation is achieved by selecting one or more language variants. This selection is dependent on the referencing in the successful ones

Recognition runs. The available memory space is optimally used regardless of the number of words to be recognized.

For example, if Typein is the original, canonical

Keeping the pronunciation variant in the vocabulary, speaker independence is still guaranteed. If the system is used by several users, it is adapted to all users, since on average the frequently referenced pronunciation variants of all speakers are retained. An advantage over other adaptation methods is that the original system behavior can be restored at any time, since the HMM, i.e. the acoustic modeling of the feature space, remains untouched. No further information is required for the adaptation, such as the assignment of the states to characteristics. The method can therefore be carried out without much additional code and memory and is therefore also suitable for the embedded area.

Deleting the pronunciation variants increases the reliability of recognition or rejection, since the relevant entries, that is to say the adapted models, are generally easier to distinguish discriminatively. At the same time, recognition is accelerated as the vocabulary becomes smaller.

In a phoneme-based speech recognition system, for example an HMM recognizer, word entries in the vocabulary are defined by their phoneme sequence or by a status sequence.

In the case of Sayln, pronunciation variants can be generated by adding noise to the speech data. Another way of creating variants is to modify the phoneme or state sequence obtained. This can be done with the help of random factors or with user-specific information, for example. »A mix-up matrix from the - last recognition runs. A confusion matrix can be created, for example, by a second recognition run with phonemes.

Typein is used to infer the phoneme sequence from the orthographic spelling. When assigning graphemes to phonemes, statistical methods are known which, in addition to the most likely phoneme sequence, are also alternative

Deliver phoneme sequences. The use of neural networks can serve as an example here. The assignment can also be made taking into account a respective language. For example, the name "Martin" is pronounced differently in German and French and therefore there are two different phoneme sequences. Of course, as with Sayln, the status sequences can also be generated by random factors and user-dependent information.

example 1

"Herr Meier" is added to the vocabulary as a new entry.

The following (German-speaking) canonical phoneme sequence is determined using Typein:

Original 1 / h E r m al β /

The variants could look like this. It is assumed that a total of five vocabulary entries correspond to the maximum permitted memory requirement:

Variant-1.1 / h e r m al 6 /

Variant-1.2 / h E r m al er / Variant-1.3 /. h 6 m al 6 /

Version 1 . 4 / h e r m l e 6 /

Selection or determination of the confidence of the variants

Mr. Meier was called ten times by voice command. The five variants were referenced as follows, which corresponds to the boolean confidence already mentioned:

Pronunciation variant #references Σconfidence original-1: 4 4

Variant 1.1: 0 0

Variant 1.2: 6 6 Variant 1.3: 0 0

Variant 1.4: • 0 0

In the adaptation step that follows, all variants with confidence 0 are deleted. The vocabulary now only contains the variants "Original-1" and "Variant-1.2".

Original-1: / h E r m al 6 / Variant-1.2: / h E r m al er /

The vocabulary is thus reduced by more than half.

That is, the load on the processor with the

Speech recognition (search) is reduced to the same extent.

At the same time, the risk of confusion for other commands is reduced.

Since the canonical variant "Original-1" still exists, the speaker independence for the following recognition runs is preserved.

Example 2

The name "Ms. Martin" is now added to the vocabulary in Example 1 by means of phoneme-based sayln. The determined phoneme sequence is:

Original 2: / for au marte ~ ^■ /

The variants of "Frau Martin" could look like this:

Variant 2.1: / for a m a r t i n / Variant 2.2: / for a m a t n /

The vocabulary now contains the following entries:

Original 1: / h E r al 6 / Variant 1.2: / h E rm al er / Original 2: / fr aU marte ~ /

Variant-2.1: / for aU a r t i n / Variant-2.2: / for aU m A t n /

Selection or determination of the confidence of the variants

Mr. Meier is called three times, Ms. Martin is called five times by voice command. The five variants are assessed with confidence as follows. A criterion is now used here, i.e. a confidence measure that allows a statement about the reliability of the spoken utterance for each variant:

Pronunciation variant #references Σconfidence

Original 1: 2 100

Variant 1.2: 1 30

Original 2: 3 60

Variant 2.1: 1 10

Variant 2.2: 1 20

In the following adaptation step, all variants are deleted that have a confidence level less than 25. The vocabulary now only contains the variants "Original-1" and "Variant-1.2" and "Original-2".

Original 1: / h E r al 6 /

Variant-1.2: / h E r m al er / Original 2: / for a u m a r t e ~ /

There are now 2 free entries available for further pronunciation variants or new words.

Claims

claims

1. method for speech recognition,

- in which there are several pronunciation variants for a word and / or are generated,

in which a recognition process registers which of the pronunciation variants of the word is recognized,

- in which an analysis of the frequency of recognition of the individual pronunciation variants takes place after several recognition processes.

2. The method of claim 1, wherein the pronunciation variants are generated by phoneme replacement, phoneme cancellation and / or phoneme insertion.

3. The method of claim 1 or 2, wherein the pronunciation variants are generated for different languages.

4. The method according to any one of the preceding claims, wherein the pronunciation variants are generated by adding noise.

5. The method as claimed in one of the preceding claims, in which one of the pronunciation variants, in particular after a recognition process, is generated on the basis of an utterance recognized as the word.

6. The method as claimed in one of the preceding claims, in which a maximum permissible number of pronunciation variants is specified for several, in particular all, words.

7. The method according to any one of the preceding claims, in which the number of stored pronunciation variants is reduced on the basis of the analysis of the frequency of detection of the individual pronunciation variants.

8. The method according to claim 7, in which less frequently recognized pronunciation variants are deleted.

9. The method according to claim 8, in which the pronunciation variants are deleted, the confidence of which is below a threshold value.

10. The method of claim 8 or 9, wherein the canonical pronunciation variant is not deleted.

11. The device, which is set up to carry out a method according to one of the preceding claims.

12. Program product for a data processing system, which contains code sections with which a method according to one of claims 1 to 10 can be carried out on a data processing system.