WO2006037446A1

WO2006037446A1 - Method for adapting and/or creating statistical linguistic models

Info

Publication number: WO2006037446A1
Application number: PCT/EP2005/009973
Authority: WO
Inventors: Albert FABREGAT SUBIRÀ; Udo Haiber; Harald HÜNING
Original assignee: Daimlerchrysler Ag
Priority date: 2004-10-01
Filing date: 2005-09-16
Publication date: 2006-04-13
Also published as: DE102004048348A1; DE102004048348B4

Abstract

The invention relates to a method for adapting and/or creating statistical linguistic models for automatic speech recognition systems. Said method takes into account current speech recognition results and specific acoustic conditions.

Description

Method for adapting and / or generating statistical language models

The invention relates to a method for adapting and / or generating statistical language models for automatic speech recognition systems.

Known automatic speech recognition systems are used in various environments. For example, they are used as parts of dialog systems in motor vehicles for the control of information or entertainment systems (navigation system, telephone, radio or the like) via voice inputs. Nowadays, speech recognition is still largely restricted to certain predefined commands, but in the future, a freer speech of the user or driver is desired.

The speech recognition is carried out in particular by means of statistical models. Both acoustic models, which are based on so-called HMM models (Hidden Markov Model), and linguistic language models, which represent occurrence probabilities of semantic and syntactic language elements, are used.

Frequently, especially in dialogue systems, there is the problem that for the training to be carried out prior to startup, the statistical speech models used for speech recognition - for estimating the probabilities of particular word sequences as speech input - are not provided with sufficient training material. The training material is mostly from a large amount of text data. Especially with regard to the above-mentioned goal of allowing a freer voice input, such comprehensive training data would urgently be necessary.

In addition, there is often the problem that voice recordings are very cost-intensive under realistic conditions.

Usually language models are not adapted to specific acoustic situations. As described above, separate models are used for this, which further complicates free speech input.

DE 198 42 151 A1 discloses a method for adapting linear code models in systems with automatic speech recognition.

Also known from the prior art is the so-called "discriminative training", in which the language model and the acoustic model are trained together. For this purpose, however, large amounts of acoustic training data are required which must lie in the speech domain of the corresponding speech model and are likewise often not available.

The present invention has for its object to provide a method for adapting and / or generating statistical speech models of the type mentioned, which avoids the disadvantages of the prior art and in particular requires a small amount of text data as training material.

This object is achieved by claim 1.

As a result of these measures, it is advantageously possible to generate or adapt probability for statistical language models without large amounts of text data as training material must be available. At the same time, certain acoustic conditions are taken into account in the adaptation or generation. The method uses a small amount of data, which can be extracted from existing current speech recognition results. The language model is thus easily customizable. In addition, in particular, the possibility exists, as far as acoustic data is available covering the entire target range of the language model, to build up an entirely new statistical language model.

Advantageous embodiments and further developments of the invention emerge from the subclaims. Below are described in principle embodiments of the invention with reference to the drawings.

Showing:

Fig. 1 is an illustration of the structure of the method according to the invention;

2 shows a representation of the distance values of a speech utterance;

FIG. 3 shows an overview of a consideration of side effects; FIG.

4 shows a simplified representation of a neural network; and

Fig. 5 is an illustration of a dynamic threshold.

The individual probabilities of a known statistical language model are called uni-grams, bi-grams, tri-grams or N-grams, which represent the corresponding occurrence probability for a word if none, one, two or NI words have already been preceded. If a speech recognition system has to decide between alternative recognized sentences, it takes into account both the language model and the acoustic ratings of an HMM model. The word recognition results sometimes differ from the spoken words. These deviations are evaluated as word errors (substitutions, insertions and deletions) with respect to a reference index (correct path). In this case, the speech recognition system first generates an internal superordinate word graph (jumbo graph), which has a large number of alternative sequences (paths) of word hypotheses or sentences with respective path scores on the basis of the corresponding occurrence probabilities. After the language model has been applied to this superordinate word graph, the speech recognition system delivers as output either a specific sentence, the so-called best path, or a reduced word graph. Such a word graph is shown by way of example in DE 198 42 151 A1 (see FIG. 3 there).

It is now desirable, if any path of the superordinate graph is the correct set / path (i.e., the reference path), but this has not been identified as the best path after applying the language model to the parent graph, i. that the correct path does not appear in the output of the speech recognition system, the speech model or its probabilities should be changed in such a way that this sentence appears in the output the next time.

Accordingly, the following method steps are proposed as an embodiment of the invention:

Path scores in a word hypothesis graph of the speech model are compared, in particular, by forming distance values between at least two paths of the word hypothesis graph; at least one best path in the word hypothesis graph relating to the speech recognition process is identified; at least one correct path or path valid as a correct path with a minimum word error number is marked in the word hypothesis graph; the comparison of the distance values in the best and in the correct paths is carried out in such a way that it is possible to adapt the language model, which leads to a smaller number of word errors in the best path if the same speech input is entered again during the speech recognition process.

Barriers arise in this regard through so-called side effects of the changes. If a recognized sentence is changed, it may cause errors in other sentences. It is accordingly advantageous that a page effect with regard to verbal errors in other paths is determined if the probability is to be changed.

1 shows the coarse structure of a method 1 according to the invention for adapting and / or generating a statistical language model 2 for automatic speech recognition systems (not shown). From acoustic speech data 3, word hypothesis graphs are then created in a step 4, which are stored as superordinate word graphs (jumbo graphs) in internal data 5 of the speech recognition system. In order to get from the higher-order word graphs to an output 6 of the speech recognition system, the statistical language model 2 is used. In a step 7, the word graphs are evaluated. To output the best sentence, the path scores for each possible path of the word graphs are compared. The superordinate word graphs of the internal data 5 of the speech recognition system are stored as current speech recognition results 8, then an adaptation of the speech model 2 is determined therefrom. The application of the language model 2 in the evaluation of the superordinate word graphs is carried out on the basis of a comparison of alternative paths of the word graph (FIG. 1: step 7). In this case, each path is evaluated with the following equation, whereby only the path with the best path evaluation is selected and output as the recognized sentence.

N-I N

Path evaluation = Σac _t + υ - Σp (W _j \ w _j _ ₂ , w _j _ _ι ) + N -

+ Np • pWeight, ι = 0 J = O where: acj logarithmic, acoustic ratings of the words, v a global voice model weight (versus acoustic ratings)

P ^(w i ^w I _j - ₂ _-> ^w _j -ι) logarithmic Tri-gram plausibility ^¬ possibilities, N is the number of words of the calculated path,

- penalize a higher or lower number of word hypotheses per path,

Np represent a number of pauses within a path, and PWeight represent an empirically adjusted pause weight.

According to the invention, path scores must be calculated for each of several paths from a plurality of word graphs. These data are stored. Thus, a comparison between path evaluations is made possible across several word graphs by calculating difference or distance values. Due to their range of values, it is favorable to give the path evaluations a negative logarithmic format. For each sentence that is considered, a distance value is stored. These are calculated as follows:

1. If the best sentence is the correct sentence, the distance value of the best sentence results from the absolute value of the difference between the path score of the best sentence and that of the second best sentence. For the remaining sentences of the word graph, the respective distance value results from the absolute value of the difference between the path score of the respective sentence and the path score of the best sentence.

2. If the best sentence is not the correct sentence, the distance value of the best sentence results from the absolute value of the difference between its path score and that of the correct sentence. If several correct sentences have been determined, the path score closest to that of the best path is considered, because the closer the path score is to the best sentence, the easier it is to make the best out of it. In the remaining sets of the word graph, the respective distance value results from the absolute value of the difference between the path score of the best sentence and the path score of the respective sentence.

FIG. 2 shows the different cases when determining the distance values for a speech utterance with the unique designation KILW047. Each circle represents a path evaluation of a path, wherein the distance values are shown as arrows. The two hatched circles represent erroneous paths while the unshaded paths are correct.

Advantageously, the necessary data is stored together in a distance file. The first line of the following Table 1 contains the (unique) name of the superordinate graph. Below this, the generated sentences are stored as follows: distance value, path evaluation, name of the superordinate graph, a first flag (= B = or = S =), which indicates whether it is the best sentence (= B =) or not (= S =), a second flag (= C = or = E =) which indicates whether it is the reference set, ie the correct set (= C =), or whether it contains an error is (= E =), and finally the associated word order. <s> and </ s> mark the beginning and the end of the respective sentence.

Table 1:

KILW047

173.704 744.355 KILW047 = B = = C = <s> # PAUSE # enter new destination # PAUSE # </ s>;

75,241 819,596 KILW047 = S = = C = <s> # NOISE # enter new destination # PAUSE # </ s>;

449.679 1194.034 KILW047 = S = = E = <s> # PAUSE # <number> is <room> enter # PAUSE # </ s>;

173.704 918.059 KILW047 = S = = E = <s> # PAUSE # enter new <number> # PAUSE # </ s>;

Now, a calculation can be performed on the changes of the probabilities of the language model according to the equation (1) necessary to correct the errors. Here's the example:

Correct path: "<s> Show me the last number again </ s>"

Best Path: "<s> Carry the last number on </ s>"

The goal now is to make the path score of the best sentence less than that of the reference sentence. In other words, the speech recognition system should select the correct sentence as the best one due to the changes to be made. The probabilities of those N-grams are increased, which occur only in the correct path and not in the best path, and the probabilities of those N-grams are reduced which occur only in the best path and not in the correct path.

In the present case, the distance should, so to speak, be distributed between the tri-grams that caused the error. To There are two possibilities in the present case: either the path scores of the correct sentences are increased or those of the incorrect ones are reduced. In the first case, two tri-grams are involved in the error, so they can be increased for correction. It's possible to increase the bi-gram "<s>point", the tri-gram "<s> show me" or even all of them. The last option is the most convenient since it requires only small changes to the tri-grams to reduce the difference between the path scores, thereby less affecting other sets. Another possibility is to reduce the tri-grams of the best set. Vor¬ lying a combination of increasing and decreasing the tri-grams is used. The distance is distributed among all possible tri-grams to reduce them to zero. Thus, the error should be corrected, as far as of course no other sentences are affected by side effects. Such errors can be prevented by analyzing the stored data. This is achieved by defining constraints that determine when N-grams are to be changed. Imagine that a tri-gram should be increased. Subsequently, the tri-gram is searched for among all sets in the distance file. In this case, four different situations can occur per detected sentence depending on the respective stored flags:

1. BE: If the sentence containing the tri-gram to be increased is the best of a superordinate graph, but has an error, the tri-gram can not be increased, otherwise the path evaluation of the erroneous path can also be increased. This makes the correction more difficult. Nevertheless, there is an exception when the reference set of the parent graph also has the tri-gram, then it is increased as desired. This preserves the distance between the best and the correct sentence.

2. SE: in this case the theorem that gives the tri-gram is holds, neither the recognized nor the correct. As a result, the tri-gram can be increased, but not higher than the distance to the path score of the best set. Otherwise this sentence would become the best sentence, which would lead to another error. Here, however, there is also an exception. If the correct set of the superordinate graph contains the same tri-gram, it is increased as desired.

3. BC: in such a case, the tri-gram is increased because the sentence is correct. If the path score of the sentence is improved, error detections are reduced.

4. SC: In this case too, the tri-gram is increased, even if it is not the recognized set. Incidentally, it is facilitated to recognize the correct sentence when its path score is increased.

So far, only the increase of a tri-gram of a correct set has been considered. However, it is also possible to reduce the tri-grams of the best set _/ if it has an error. In the above example, the tri-grams, "<s> vehicle me" and "vehicle me" are wrapped in the error. The reduction uses similar limitations. Only the conditions for a change change.

l. BE: if the tri-gram is decreased, the path score of the best sentence, not that of the correct sentence, is degraded. Therefore, there is no restriction here.

2. SE: There is also no restriction on reduction.

3. BC: in this case, the tri-gram can be decreased as long as the path score of the best set is still higher than that of the second best set. With other words In fact, reducing the tri-gram can not cause deterioration of the path score greater than the distance value.

4. SC: if the tri-gram was found in a sentence that is not the best sentence, but the correct sentence, it must not be reduced.

FIG. 3 shows an overview of the consideration of the side effects when changes to the tri-grams are to be carried out.

It is often desirable to favor more relevant sentences for the user. In other words, it is less problematic to correct sentences with higher priority, even if sentences with a lower priority are affected by side effects thereof. For example, sentences that contain critical words that are of fundamental importance for a subsequent dialog or the like can be assigned a higher priority. This embodiment is realized as follows: First, a list of prioritized sentences must be provided. Then the nature of the preference should be determined by the restrictions. In the case of sentences without priority, the method operates according to the abovementioned restrictions of FIG. 3. However, if a sentence with priority has an error, the method is modified. There are two possibilities for a side effect. If the error was generated in a sentence with priority, the method works as before. However, the change can be carried out if the restriction concerns a sentence without priority. By means of these measures, however, the generic error rate can be increased since the rates are degraded without priority.

Although the present embodiment relates only to tri-grams, an analogous application to other N-grams is possible. If a tri-gram was not found in the language model, this corresponds to a so-called back-off case. In this case, a new tri-gram can be introduced as normal tri-gram into the optimized speech model or the change can be distributed among the values which serve to calculate the back-off probability (usually a lower N-gram probability) and a back-off weight).

So far, the path scores are not recalculated after a language model probability change. Therefore, the control of side effects is incomplete when multiple probabilities are changed at once. In contrast to this, the embodiment of the method according to the invention outlined below iteratively recalculates all path evaluations (and distance values), which advantageously leads to an improvement in the treatment of side effects. The core idea is to use a classifier to reproduce a comparison of path scores for different paths of the parent graph (see Fig. 4). The parameters of the classifier should be convertible into probabilities of the language model. Different classifier architectures have in common that they require many numerical values as inputs and have some sort of threshold function to provide an output, such as "0" or "1". Furthermore, there are learning classifiers that adapt some of their parameters in response to a set of input / output pairs along with a training signal. Such a learning classifier is used here. It is based on the observation that the path evaluation calculation in the logarithmic domain corresponds to a weighted sum which is common to many classifiers (as part of a so-called neuron function). In addition to the presentation of the path evaluation formula as a classifier (Fig. 4), attention must be paid to the representation of the data and how a dynamic threshold value is applied. The following conditions must be translated into input / output pairs of a classifier. The sentence with the best path score corresponds to the output of the speech recognition system. If the detected sentence is not the reference sentence and an error occurred, the probabilities of the language model 2 involved in the error are changed according to a learning rule. If there was no error, the language model is not changed. This procedure is carried out for each of the superordinate graphs (jumbo graphs). Subsequently, an iterative processing takes place. In other words, the process is carried out several times to the higher-level graph. With a suitable choice of the learning rule of the classifier, the number of errors always decreases as long as the method is used.

The corrections are carried out according to a so-called cross-entropy learning rule (Gross Entropy). The behavior of this learning rule is desirable because it has been proven that it minimizes the number of errors instead of minimizing the quadratic error such as the gradient learning rule (Gradient Descent), because the error frequency is important here.

The transfer of the data to a neural network 9 as a classifier is shown in FIG. 4. For each tri-gram of the speech model 2, an input is provided. The input value reflects the number of tri-grams in this path. The transfer of a path in this way is characterized as a learning pattern. According to the evaluation of the path, these learning patterns are divided into two target values: correct ("1") and not correct ("0"). The output value of the learning pattern with the target value "1" should be greater than zero. The output value of the learning patterns with the target value "0" should be less than zero. The data from the same parent graph should be treated together as a group. The reason for this is that the comparison of the path scores into a dy- Namely threshold value function of the neural network classifier 9 has to be translated. The solution is to set a dynamic threshold value in such a way that it implements the decision which is the highest path evaluation. The dynamic threshold is calculated on each pass and is different for each group (ie for each superordinate graph). The goal is that this only creates the best path that exceeds the threshold of the activation function. As a result, the activation function is active only on the best path. Setting the dynamic threshold requires prior input of all the learning patterns of a group. Thus, the transfer takes place as follows. First, the transfer of all learning patterns takes place in order to determine the dynamic threshold value. Subsequently, the calculated value is subtracted from all path evaluations and the new values are stored. These new values may be both above and below the threshold. This output is compared with the target values which indicate whether a path is correct or not. The learning is performed on those learning patterns whose output is not identical to the target value. The learning rule modifies the weights, which are later translated back into the language model. To calculate the dynamic threshold, a first way is to form an average between the best path and the second best path. However, the possibility should be considered that there is more than one correct path. Therefore, the dynamic threshold is calculated as the average between the best path score of all correct sentences and the best path score of all the errors. This calculation is shown in FIG.

It is also advantageous to define two barriers so that the error can be eliminated with a higher reliability. The purpose of the bounds is to force the erroneous sentences to be not only above the threshold, but also above the threshold and a predetermined bound. The same thing happens with the correct one Sentence. It must also be above a certain limit. These limits are determined empirically. In practice, they are set to -0.1 and +0.1 since the output moves within [-1.1]. As can be seen in Figure 5, the unshaded circle (correct set) above the high bound and the hatched circles (erroneous sets) must be below the low bound. This defines a confidence interval. If the circles are within the interval, it is not certain that the error will be corrected.

Claims

claims

Method (1) for adapting and / or generating statistical speech models (2) for automatic speech recognition systems, wherein current existing linguistic speech recognition models are taken into account, characterized in that:

Path evaluations are compared in a word hypothesis graph of the language model (2) by forming distance values between at least two paths of the word hypotheken, at least one best path in the word hypothesis graph being identified with respect to the speech recognition process, at least one correct path or a path valid as a correct path is marked with a minimum word error number in the word hypothesis graph, where the comparison of the distance values in the best and the correct paths is performed such that an adaptation of the language model (2) can be achieved which leads to a lesser number of word errors in the best path when the same speech input is input again during the speech recognition process, whereby the acoustic speech is also evaluated by an HMM in order to adapt the speech model to the existing linguistic speech model. Model used wi approx.

2. The method according to claim 1, characterized in that the language model as N-gram language model (2) is formed, wherein for each N-gram a separate and veränder¬ bare probability is stored.

3. Method according to claim 2, characterized in that the probabilities of those N-grams are increased, which occur only in the correct path and not in the best path, and that the probabilities of those N-grams are reduced which only in the best Path and does not occur in the correct path.

4. The method according to any one of claims 1 to 3, characterized ge indicates that a page effect is determined in terms of word errors in other paths, if the probability Wahrikel¬ be changed.

5. The method according to any one of claims 1 to 4, characterized ge indicates that the distance values are determined by calculating the absolute values of the difference between logarithmic Pfad¬ ratings.

6. The method according to any one of claims 1 to 5, characterized ge indicates that after the change of probabilities a recalculation of the path scores and the distance values is performed.

7. The method according to claim 6, characterized in that a classifier is used in the calculation, which decides with respect to a group of paths of the word graph based on a threshold, whether probabilities have to be changed.

8. The method according to claim 7, characterized in that the decisions of the classifier learning rules for a neural network (9) form. Method according to claim 7 or 8, characterized in that a neural network (9) is used as the classifier, wherein the weight parameters of the neural network (9) can be converted into probability values for the language model (2).