WO2006037446A1 - Method for adapting and/or creating statistical linguistic models - Google Patents

Method for adapting and/or creating statistical linguistic models Download PDF

Info

Publication number
WO2006037446A1
WO2006037446A1 PCT/EP2005/009973 EP2005009973W WO2006037446A1 WO 2006037446 A1 WO2006037446 A1 WO 2006037446A1 EP 2005009973 W EP2005009973 W EP 2005009973W WO 2006037446 A1 WO2006037446 A1 WO 2006037446A1
Authority
WO
WIPO (PCT)
Prior art keywords
path
word
correct
speech
speech recognition
Prior art date
Application number
PCT/EP2005/009973
Other languages
German (de)
French (fr)
Inventor
Albert FABREGAT SUBIRÀ
Udo Haiber
Harald HÜNING
Original Assignee
Daimlerchrysler Ag
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daimlerchrysler Ag filed Critical Daimlerchrysler Ag
Publication of WO2006037446A1 publication Critical patent/WO2006037446A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams

Definitions

  • the invention relates to a method for adapting and / or generating statistical language models for automatic speech recognition systems.
  • the speech recognition is carried out in particular by means of statistical models.
  • acoustic models which are based on so-called HMM models (Hidden Markov Model), and linguistic language models, which represent occurrence probabilities of semantic and syntactic language elements, are used.
  • HMM models Hidden Markov Model
  • linguistic language models which represent occurrence probabilities of semantic and syntactic language elements
  • the statistical speech models used for speech recognition - for estimating the probabilities of particular word sequences as speech input - are not provided with sufficient training material.
  • the training material is mostly from a large amount of text data. Especially with regard to the above-mentioned goal of allowing a freer voice input, such comprehensive training data would urgently be necessary.
  • the present invention has for its object to provide a method for adapting and / or generating statistical speech models of the type mentioned, which avoids the disadvantages of the prior art and in particular requires a small amount of text data as training material.
  • Fig. 1 is an illustration of the structure of the method according to the invention.
  • FIG. 3 shows an overview of a consideration of side effects
  • Fig. 5 is an illustration of a dynamic threshold.
  • the individual probabilities of a known statistical language model are called uni-grams, bi-grams, tri-grams or N-grams, which represent the corresponding occurrence probability for a word if none, one, two or NI words have already been preceded. If a speech recognition system has to decide between alternative recognized sentences, it takes into account both the language model and the acoustic ratings of an HMM model. The word recognition results sometimes differ from the spoken words. These deviations are evaluated as word errors (substitutions, insertions and deletions) with respect to a reference index (correct path).
  • the speech recognition system first generates an internal superordinate word graph (jumbo graph), which has a large number of alternative sequences (paths) of word hypotheses or sentences with respective path scores on the basis of the corresponding occurrence probabilities.
  • the speech recognition system delivers as output either a specific sentence, the so-called best path, or a reduced word graph.
  • a word graph is shown by way of example in DE 198 42 151 A1 (see FIG. 3 there).
  • any path of the superordinate graph is the correct set / path (i.e., the reference path), but this has not been identified as the best path after applying the language model to the parent graph, i. that the correct path does not appear in the output of the speech recognition system, the speech model or its probabilities should be changed in such a way that this sentence appears in the output the next time.
  • Path scores in a word hypothesis graph of the speech model are compared, in particular, by forming distance values between at least two paths of the word hypothesis graph; at least one best path in the word hypothesis graph relating to the speech recognition process is identified; at least one correct path or path valid as a correct path with a minimum word error number is marked in the word hypothesis graph; the comparison of the distance values in the best and in the correct paths is carried out in such a way that it is possible to adapt the language model, which leads to a smaller number of word errors in the best path if the same speech input is entered again during the speech recognition process.
  • FIG. 1 shows the coarse structure of a method 1 according to the invention for adapting and / or generating a statistical language model 2 for automatic speech recognition systems (not shown).
  • word hypothesis graphs are then created in a step 4, which are stored as superordinate word graphs (jumbo graphs) in internal data 5 of the speech recognition system.
  • the statistical language model 2 is used in order to get from the higher-order word graphs to an output 6 of the speech recognition system.
  • the word graphs are evaluated.
  • the path scores for each possible path of the word graphs are compared.
  • the superordinate word graphs of the internal data 5 of the speech recognition system are stored as current speech recognition results 8, then an adaptation of the speech model 2 is determined therefrom.
  • each path is evaluated with the following equation, whereby only the path with the best path evaluation is selected and output as the recognized sentence.
  • acj logarithmic, acoustic ratings of the words, v a global voice model weight (versus acoustic ratings)
  • N is the number of words of the calculated path
  • Np represent a number of pauses within a path
  • PWeight represent an empirically adjusted pause weight
  • path scores must be calculated for each of several paths from a plurality of word graphs. These data are stored. Thus, a comparison between path evaluations is made possible across several word graphs by calculating difference or distance values. Due to their range of values, it is favorable to give the path evaluations a negative logarithmic format. For each sentence that is considered, a distance value is stored. These are calculated as follows:
  • the distance value of the best sentence results from the absolute value of the difference between the path score of the best sentence and that of the second best sentence.
  • the respective distance value results from the absolute value of the difference between the path score of the respective sentence and the path score of the best sentence.
  • the distance value of the best sentence results from the absolute value of the difference between its path score and that of the correct sentence. If several correct sentences have been determined, the path score closest to that of the best path is considered, because the closer the path score is to the best sentence, the easier it is to make the best out of it. In the remaining sets of the word graph, the respective distance value results from the absolute value of the difference between the path score of the best sentence and the path score of the respective sentence.
  • FIG. 2 shows the different cases when determining the distance values for a speech utterance with the unique designation KILW047.
  • Each circle represents a path evaluation of a path, wherein the distance values are shown as arrows.
  • the two hatched circles represent erroneous paths while the unshaded paths are correct.
  • the necessary data is stored together in a distance file.
  • the first line of the following Table 1 contains the (unique) name of the superordinate graph.
  • ⁇ s> and ⁇ / s> mark the beginning and the end of the respective sentence.
  • the goal now is to make the path score of the best sentence less than that of the reference sentence.
  • the speech recognition system should select the correct sentence as the best one due to the changes to be made.
  • the probabilities of those N-grams are increased, which occur only in the correct path and not in the best path, and the probabilities of those N-grams are reduced which occur only in the best path and not in the correct path.
  • the distance should, so to speak, be distributed between the tri-grams that caused the error.
  • the path scores of the correct sentences are increased or those of the incorrect ones are reduced.
  • two tri-grams are involved in the error, so they can be increased for correction. It's possible to increase the bi-gram " ⁇ s>point", the tri-gram " ⁇ s> show me” or even all of them. The last option is the most convenient since it requires only small changes to the tri-grams to reduce the difference between the path scores, thereby less affecting other sets.
  • Another possibility is to reduce the tri-grams of the best set. Vor ⁇ lying a combination of increasing and decreasing the tri-grams is used.
  • the distance is distributed among all possible tri-grams to reduce them to zero.
  • the error should be corrected, as far as of course no other sentences are affected by side effects.
  • Such errors can be prevented by analyzing the stored data. This is achieved by defining constraints that determine when N-grams are to be changed. Imagine that a tri-gram should be increased. Subsequently, the tri-gram is searched for among all sets in the distance file. In this case, four different situations can occur per detected sentence depending on the respective stored flags:
  • BE If the sentence containing the tri-gram to be increased is the best of a superordinate graph, but has an error, the tri-gram can not be increased, otherwise the path evaluation of the erroneous path can also be increased. This makes the correction more difficult. Nevertheless, there is an exception when the reference set of the parent graph also has the tri-gram, then it is increased as desired. This preserves the distance between the best and the correct sentence.
  • BC in such a case, the tri-gram is increased because the sentence is correct. If the path score of the sentence is improved, error detections are reduced.
  • SC In this case too, the tri-gram is increased, even if it is not the recognized set. Incidentally, it is facilitated to recognize the correct sentence when its path score is increased.
  • BC in this case, the tri-gram can be decreased as long as the path score of the best set is still higher than that of the second best set. With other words In fact, reducing the tri-gram can not cause deterioration of the path score greater than the distance value.
  • SC if the tri-gram was found in a sentence that is not the best sentence, but the correct sentence, it must not be reduced.
  • FIG. 3 shows an overview of the consideration of the side effects when changes to the tri-grams are to be carried out.
  • a tri-gram was not found in the language model, this corresponds to a so-called back-off case.
  • a new tri-gram can be introduced as normal tri-gram into the optimized speech model or the change can be distributed among the values which serve to calculate the back-off probability (usually a lower N-gram probability) and a back-off weight).
  • the embodiment of the method according to the invention outlined below iteratively recalculates all path evaluations (and distance values), which advantageously leads to an improvement in the treatment of side effects.
  • the core idea is to use a classifier to reproduce a comparison of path scores for different paths of the parent graph (see Fig. 4).
  • the parameters of the classifier should be convertible into probabilities of the language model. Different classifier architectures have in common that they require many numerical values as inputs and have some sort of threshold function to provide an output, such as "0" or "1".
  • the probabilities of the language model 2 involved in the error are changed according to a learning rule. If there was no error, the language model is not changed. This procedure is carried out for each of the superordinate graphs (jumbo graphs). Subsequently, an iterative processing takes place. In other words, the process is carried out several times to the higher-level graph. With a suitable choice of the learning rule of the classifier, the number of errors always decreases as long as the method is used.
  • the corrections are carried out according to a so-called cross-entropy learning rule (Gross Entropy).
  • Gross Entropy The behavior of this learning rule is desirable because it has been proven that it minimizes the number of errors instead of minimizing the quadratic error such as the gradient learning rule (Gradient Descent), because the error frequency is important here.
  • the transfer of the data to a neural network 9 as a classifier is shown in FIG. 4.
  • an input is provided.
  • the input value reflects the number of tri-grams in this path.
  • the transfer of a path in this way is characterized as a learning pattern.
  • these learning patterns are divided into two target values: correct ("1") and not correct ("0").
  • the output value of the learning pattern with the target value "1” should be greater than zero.
  • the output value of the learning patterns with the target value "0” should be less than zero.
  • the data from the same parent graph should be treated together as a group. The reason for this is that the comparison of the path scores into a dy- Namely threshold value function of the neural network classifier 9 has to be translated.
  • the solution is to set a dynamic threshold value in such a way that it implements the decision which is the highest path evaluation.
  • the dynamic threshold is calculated on each pass and is different for each group (ie for each superordinate graph). The goal is that this only creates the best path that exceeds the threshold of the activation function. As a result, the activation function is active only on the best path.
  • Setting the dynamic threshold requires prior input of all the learning patterns of a group. Thus, the transfer takes place as follows. First, the transfer of all learning patterns takes place in order to determine the dynamic threshold value. Subsequently, the calculated value is subtracted from all path evaluations and the new values are stored. These new values may be both above and below the threshold. This output is compared with the target values which indicate whether a path is correct or not.
  • the learning is performed on those learning patterns whose output is not identical to the target value.
  • the learning rule modifies the weights, which are later translated back into the language model.
  • a first way is to form an average between the best path and the second best path.
  • the dynamic threshold is calculated as the average between the best path score of all correct sentences and the best path score of all the errors. This calculation is shown in FIG.
  • the purpose of the bounds is to force the erroneous sentences to be not only above the threshold, but also above the threshold and a predetermined bound. The same thing happens with the correct one Sentence. It must also be above a certain limit. These limits are determined empirically. In practice, they are set to -0.1 and +0.1 since the output moves within [-1.1]. As can be seen in Figure 5, the unshaded circle (correct set) above the high bound and the hatched circles (erroneous sets) must be below the low bound. This defines a confidence interval. If the circles are within the interval, it is not certain that the error will be corrected.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method for adapting and/or creating statistical linguistic models for automatic speech recognition systems. Said method takes into account current speech recognition results and specific acoustic conditions.

Description

Verfahren zur Adaption und/oder Erzeugung statistischer SprachmodelleMethod for adapting and / or generating statistical language models
Die Erfindung betrifft ein Verfahren zur Adaption und/oder Erzeugung statistischer Sprachmodelle für automatische Spracherkennungssysteme.The invention relates to a method for adapting and / or generating statistical language models for automatic speech recognition systems.
Bekannte automatische Spracherkennungssysteme werden in den verschiedensten Umgebungen eingesetzt. Beispielsweise werden sie als Teile von Dialogsystemen in Kraftfahrzeugen zur Steu¬ erung von Informations- oder Unterhaltungssystemen (Navigati¬ onssystem, Telefon, Radio oder dergleichen) über Spracheinga¬ ben verwendet. Heutzutage ist die Spracherkennung noch weit¬ gehend auf bestimmte vorgegebene Kommandos beschränkt, jedoch wird in Zukunft ein freieres Sprechen der Benutzer bzw. Fah¬ rer angestrebt.Known automatic speech recognition systems are used in various environments. For example, they are used as parts of dialog systems in motor vehicles for the control of information or entertainment systems (navigation system, telephone, radio or the like) via voice inputs. Nowadays, speech recognition is still largely restricted to certain predefined commands, but in the future, a freer speech of the user or driver is desired.
Die Spracherkennung wird insbesondere mittels statistischer Modelle durchgeführt. Es werden sowohl akustische Modelle, die auf sogenannten HMM-Modellen ("Hidden Markov Model") fußen, als auch linguistische Sprachmodelle, die Auftrittswahrscheinlichkeiten von Sprachelementen seman¬ tischer und syntaktischer Natur repräsentieren, eingesetzt.The speech recognition is carried out in particular by means of statistical models. Both acoustic models, which are based on so-called HMM models (Hidden Markov Model), and linguistic language models, which represent occurrence probabilities of semantic and syntactic language elements, are used.
Häufig besteht gerade bei Dialogsystemen das Problem, dass für das vor der Inbetriebnahme durchzuführende Training der für die Spracherkennung verwendeten statistischen Sprachmo¬ delle - zur Schätzung der Wahrscheinlichkeiten bestimmter Wortfolgen als Spracheingabe - nicht genügend Trainingsmate¬ rial zur Verfügung steht. Das Trainingsmaterial besteht zu- meist aus einer großen Menge von Textdaten. Insbesondere im Hinblick auf das oben erwähnte Ziel, eine freiere Sprachein¬ gabe zu erlauben, wären derartige umfangreiche Trainingsdaten dringend notwendig.Frequently, especially in dialogue systems, there is the problem that for the training to be carried out prior to startup, the statistical speech models used for speech recognition - for estimating the probabilities of particular word sequences as speech input - are not provided with sufficient training material. The training material is mostly from a large amount of text data. Especially with regard to the above-mentioned goal of allowing a freer voice input, such comprehensive training data would urgently be necessary.
Darüber hinaus besteht häufig das Problem, dass Sprachaufnah¬ men unter realistischen Bedingungen sehr kostenintensiv sind.In addition, there is often the problem that voice recordings are very cost-intensive under realistic conditions.
Üblicherweise sind Sprachmodelle nicht an bestimmte akusti¬ sche Situationen angepasst . Wie vorstehend beschrieben werden dazu separate Modelle verwendet, was eine freie Spracheingabe zusätzlich erschwert.Usually language models are not adapted to specific acoustic situations. As described above, separate models are used for this, which further complicates free speech input.
Aus der DE 198 42 151 Al ist ein Verfahren zur Adaption lin¬ guistischer Sprachmodelle in Systemen mit automatischer Spracherkennung bekannt.DE 198 42 151 A1 discloses a method for adapting linear code models in systems with automatic speech recognition.
Ebenfalls aus dem Stand der Technik bekannt ist das sogenann¬ te "Discriminative Training" , bei dem das Sprachmodell und das akustische Modell zusammen trainiert werden. Hierfür wer¬ den jedoch große Mengen an akustischen Trainingsdaten benö¬ tigt, die im Sprachbereich des korrespondierenden Sprachmo¬ dells liegen müssen und ebenfalls oft nicht zur Verfügung stehen.Also known from the prior art is the so-called "discriminative training", in which the language model and the acoustic model are trained together. For this purpose, however, large amounts of acoustic training data are required which must lie in the speech domain of the corresponding speech model and are likewise often not available.
Der vorliegenden Erfindung liegt die Aufgabe zugrunde, ein Verfahren zur Anpassung und/oder Erzeugung statistischer Sprachmodelle der eingangs erwähnten Art zu schaffen, das die Nachteile des Standes der Technik vermeidet und insbesondere mit einer geringen Menge an Textdaten als Trainingsmaterial auskommt.The present invention has for its object to provide a method for adapting and / or generating statistical speech models of the type mentioned, which avoids the disadvantages of the prior art and in particular requires a small amount of text data as training material.
Diese Aufgabe wird erfindungsgemäß durch Anspruch 1 gelöst.This object is achieved by claim 1.
Durch diese Maßnahmen können in vorteilhafter Weise Wahr¬ scheinlichkeiten für statistische Sprachmodelle erzeugt bzw. angepasst werden, ohne dass dabei große Mengen an Textdaten als Trainingsmaterial zur Verfügung stehen müssen. Gleichzei¬ tig werden bestimmte akustische Verhältnisse bei der Anpas¬ sung bzw. Erzeugung berücksichtigt. Das Verfahren kommt mit einer geringen Datenmenge aus, welche aus vorhandenen aktuel¬ len Spracherkennungsergebnissen extrahiert werden kann. Das Sprachmodell ist somit einfach anpassbar. Darüber hinaus be¬ steht insbesondere die Möglichkeit, soweit akustische Daten vorliegen, die den gesamten Zielbereich des Sprachmodells ab¬ decken, ein gänzlich neues statistisches Sprachmodell aufzu¬ bauen.As a result of these measures, it is advantageously possible to generate or adapt probability for statistical language models without large amounts of text data as training material must be available. At the same time, certain acoustic conditions are taken into account in the adaptation or generation. The method uses a small amount of data, which can be extracted from existing current speech recognition results. The language model is thus easily customizable. In addition, in particular, the possibility exists, as far as acoustic data is available covering the entire target range of the language model, to build up an entirely new statistical language model.
Vorteilhafte Ausgestaltungen und Weiterbildungen der Erfin¬ dung ergeben sich aus den Unteransprüchen. Nachfolgend sind anhand der Zeichnung prinzipmäßig Ausführungsbeispiele der Erfindung beschrieben.Advantageous embodiments and further developments of the invention emerge from the subclaims. Below are described in principle embodiments of the invention with reference to the drawings.
Dabei zeigen:Showing:
Fig. 1 eine Darstellung der Struktur des erfindungsgemäßen Verfahrens;Fig. 1 is an illustration of the structure of the method according to the invention;
Fig. 2 eine Darstellung der Distanzwerte einer Sprachäuße¬ rung;2 shows a representation of the distance values of a speech utterance;
Fig. 3 eine Übersichtsdarstellung über eine Berücksichtigung von Seiteneffekten;FIG. 3 shows an overview of a consideration of side effects; FIG.
Fig. 4 eine vereinfachte Darstellung eines neuronalen Netz¬ werks; und4 shows a simplified representation of a neural network; and
Fig. 5 eine Darstellung eines dynamischen Schwellwerts.Fig. 5 is an illustration of a dynamic threshold.
Die einzelnen Wahrscheinlichkeiten eines bekannten statisti¬ schen Sprachmodells werden als Uni-Gramme, Bi-Gramme, Tri- Gramme oder N-Gramme bezeichnet, welche die entsprechende Auftrittswahrscheinlichkeit für ein Wort darstellen, wenn kein, eins, zwei oder N-I Wörter bereits vorangegangen sind. Wenn ein Spracherkennungssystem zwischen alternativen erkann¬ ten Sätzen entscheiden muss, berücksichtigt es dabei sowohl das Sprachmodell als auch die akustischen Bewertungen eines HMM-Modells. Die Wortergebnisse der Spracherkennung weichen zuweilen von den gesprochenen Wörtern ab. Diese Abweichungen werden als Wortfehler (Ersetzungen, Einfügungen und Löschun¬ gen) bezüglich einer Referenzverschriftung (korrekter Pfad) gewertet . Dabei generiert das Spracherkennungssystem zuerst einen internen übergeordneten Wortgraphen (Jumbo-Graph) , wel¬ cher eine große Anzahl an alternativen Folgen (Pfaden) von Worthypothesen bzw. Sätzen mit jeweiligen Pfadbewertungen aufgrund der entsprechenden Auftrittswahrscheinlichkeiten aufweist. Das Spracherkennungssystem liefert nach Anwendung des Sprachmodells auf diesen übergeordneten Wortgraphen als Ausgabe entweder einen bestimmten Satz, den sogenannten bes¬ ten Pfad, oder einen reduzierten Wortgraphen. Ein derartiger Wortgraph ist in der DE 198 42 151 Al beispielhaft darge¬ stellt (siehe dort Fig. 3) .The individual probabilities of a known statistical language model are called uni-grams, bi-grams, tri-grams or N-grams, which represent the corresponding occurrence probability for a word if none, one, two or NI words have already been preceded. If a speech recognition system has to decide between alternative recognized sentences, it takes into account both the language model and the acoustic ratings of an HMM model. The word recognition results sometimes differ from the spoken words. These deviations are evaluated as word errors (substitutions, insertions and deletions) with respect to a reference index (correct path). In this case, the speech recognition system first generates an internal superordinate word graph (jumbo graph), which has a large number of alternative sequences (paths) of word hypotheses or sentences with respective path scores on the basis of the corresponding occurrence probabilities. After the language model has been applied to this superordinate word graph, the speech recognition system delivers as output either a specific sentence, the so-called best path, or a reduced word graph. Such a word graph is shown by way of example in DE 198 42 151 A1 (see FIG. 3 there).
Es ist nun wünschenswert, falls irgendein Pfad des übergeord¬ neten Graphen der korrekte Satz/Pfad (d.h. der Referenzpfad) ist, dieser aber nach der Anwendung des Sprachmodells auf den übergeordneten Graphen nicht als bester Pfad identifiziert wurde, d.h. dass der korrekte Pfad nicht in der Ausgabe des Spracherkennungssystems erscheint, dann sollte das Sprachmo¬ dell bzw. dessen Wahrscheinlichkeiten derart verändert wer¬ den, dass dieser Satz das nächste Mal in der Ausgabe er¬ scheint.It is now desirable, if any path of the superordinate graph is the correct set / path (i.e., the reference path), but this has not been identified as the best path after applying the language model to the parent graph, i. that the correct path does not appear in the output of the speech recognition system, the speech model or its probabilities should be changed in such a way that this sentence appears in the output the next time.
Dementsprechend werden als Ausführungsform der Erfindung fol¬ gende Verfahrensschritte vorgeschlagen:Accordingly, the following method steps are proposed as an embodiment of the invention:
Pfadbewertungen in einem Worthypothesengraphen des Sprachmodells werden insbesondere durch eine Bildung von Distanzwerten zwischen wenigstens zwei Pfaden des Worthypothesengraphen verglichen; wenigstens ein bester Pfad in dem Wordhypothesengra¬ phen bezüglich des Spracherkennungsprozesses wird identifiziert; wenigstens ein korrekter Pfad oder ein als korrekter Pfad geltender Pfad mit einer minimalen Wortfehleran¬ zahl wird in dem Wordhypothesengraphen markiert; der Vergleich der Distanzwerte bei den besten und bei den korrekten Pfaden wird derart durchgeführt, dass eine Anpassung des Sprachmodells erzielbar ist, wel¬ che zu einer geringeren Anzahl an Wortfehlem in dem besten Pfad führt, wenn dieselbe Spracheingabe erneut während des Spracherkennungsprozesses eingegeben wird.Path scores in a word hypothesis graph of the speech model are compared, in particular, by forming distance values between at least two paths of the word hypothesis graph; at least one best path in the word hypothesis graph relating to the speech recognition process is identified; at least one correct path or path valid as a correct path with a minimum word error number is marked in the word hypothesis graph; the comparison of the distance values in the best and in the correct paths is carried out in such a way that it is possible to adapt the language model, which leads to a smaller number of word errors in the best path if the same speech input is entered again during the speech recognition process.
Schranken ergeben sich diesbezüglich durch sogenannte Seiten¬ effekte der Änderungen. Falls ein erkannter Satz geändert wird, kann dies in anderen Sätzen Fehler verursachen. Es ist dementsprechend vorteilhaft, dass ein Seiteneffekt bezüglich Wortfehlern in anderen Pfaden ermittelt wird, wenn die Wahr¬ scheinlichkeiten verändert werden sollen.Barriers arise in this regard through so-called side effects of the changes. If a recognized sentence is changed, it may cause errors in other sentences. It is accordingly advantageous that a page effect with regard to verbal errors in other paths is determined if the probability is to be changed.
Fig. 1 zeigt die Grobstruktur eines erfindungsgemäßen Verfah¬ rens 1 zur Anpassung und/oder Erzeugung eines statistischen Sprachmodells 2 für automatische Spracherkennungssysteme (nicht dargestellt) . Aus akustischen Sprachdaten 3 werden da¬ zu in einem Schritt 4 Worthypothesengraphen erstellt, welche als übergeordnete Wortgraphen (Jumbo-Graphen) in internen Da¬ ten 5 des Spracherkennungssystems abgelegt werden. Um von den übergeordneten Wortgraphen zu einer Ausgabe 6 des Spracher¬ kennungssystems zu kommen, wird das statistische Sprachmodell 2 angewendet. In einem Schritt 7 werden die Wortgraphen be¬ wertet. Um den besten Satz ausgeben zu können, werden die Pfadbewertungen für jeden möglichen Pfad der Wortgraphen ver¬ glichen. Die übergeordneten Wortgraphen der internen Daten 5 des Spracherkennungssystems werden als aktuelle Spracherken- nungsergebnisse 8 gespeichert, anschließend wird daraus eine Anpassung des Sprachmodells 2 ermittelt . Die Anwendung des Sprachmodells 2 bei der Bewertung der über¬ geordneten Wortgraphen wird anhand eines Vergleichs alterna¬ tiver Pfade des Wortgraphen durchgeführt (Figur 1: Schritt 7) . Dabei wird jeder Pfad mit der nachfolgenden Glei¬ chung bewertet, wobei nur der Pfad mit der besten Pfadbewer¬ tung als der erkannte Satz ausgewählt und ausgegeben wird.1 shows the coarse structure of a method 1 according to the invention for adapting and / or generating a statistical language model 2 for automatic speech recognition systems (not shown). From acoustic speech data 3, word hypothesis graphs are then created in a step 4, which are stored as superordinate word graphs (jumbo graphs) in internal data 5 of the speech recognition system. In order to get from the higher-order word graphs to an output 6 of the speech recognition system, the statistical language model 2 is used. In a step 7, the word graphs are evaluated. To output the best sentence, the path scores for each possible path of the word graphs are compared. The superordinate word graphs of the internal data 5 of the speech recognition system are stored as current speech recognition results 8, then an adaptation of the speech model 2 is determined therefrom. The application of the language model 2 in the evaluation of the superordinate word graphs is carried out on the basis of a comparison of alternative paths of the word graph (FIG. 1: step 7). In this case, each path is evaluated with the following equation, whereby only the path with the best path evaluation is selected and output as the recognized sentence.
N-I NN-I N
Pfadbewertung = ∑act + υ - ∑p(Wj \ wj_2,wj_ι) + N -
Figure imgf000008_0001
+ Np • pWeight , ι=0 J=O wobei : acj logarithmische, akustische Bewertungen der Wörter, v ein globales Sprachmodellgewicht (gegenüber akustischen Bewertungen)
Path evaluation = Σac t + υ - Σp (W j \ w j _ 2 , w j _ ι ) + N -
Figure imgf000008_0001
+ Np • pWeight, ι = 0 J = O where: acj logarithmic, acoustic ratings of the words, v a global voice model weight (versus acoustic ratings)
P(wi Iw j-2-> w j-ι) logarithmische Tri-Gramm-Wahrschein¬ lichkeiten, N die Anzahl der Wörter des berechneten Pfads,P (w i w I j - 2 -> w j -ι) logarithmic Tri-gram plausibility ¬ possibilities, N is the number of words of the calculated path,
- pen einen Strafwert für eine höhere oder niedrigere Zahl von Worthypothesen pro Pfad,- penalize a higher or lower number of word hypotheses per path,
Np eine Anzahl von Pausen innerhalb eines Pfades, und PWeight ein empirisch eingestelltes Pausengewicht darstellen.Np represent a number of pauses within a path, and PWeight represent an empirically adjusted pause weight.
Erfindungsgemäß müssen Pfadbewertungen für jeweils mehrere Pfade aus mehreren Wortgraphen berechnet werden. Diese Daten werden abgespeichert. So wird ein Vergleich zwischen Pfadbe¬ wertungen übergreifend über mehrere Wortgraphen durch eine Berechnung von Differenz bzw. Distanzwerten ermöglicht. Auf¬ grund ihres Wertebereichs ist es günstig, den Pfadbewertungen ein negatives logarithmisches Format zu geben. Für jeden Satz, der in Betracht gezogen wird, wird ein Distanzwert ge¬ speichert. Diese werden wie folgt berechnet :According to the invention, path scores must be calculated for each of several paths from a plurality of word graphs. These data are stored. Thus, a comparison between path evaluations is made possible across several word graphs by calculating difference or distance values. Due to their range of values, it is favorable to give the path evaluations a negative logarithmic format. For each sentence that is considered, a distance value is stored. These are calculated as follows:
1. Wenn der beste Satz der korrekte Satz ist, ergibt sich der Distanzwert des besten Satzes aus dem Absolutwert der Differenz zwischen der Pfadbewertung des besten Satzes und der des zweitbesten Satzes. Bei den restlichen Sätzen des Wortgraphen ergibt sich der jeweilige Distanzwert aus dem Absolutwert der Differenz zwischen der Pfadbewertung des jeweiligen Satzes und der Pfadbewertung des besten Satzes.1. If the best sentence is the correct sentence, the distance value of the best sentence results from the absolute value of the difference between the path score of the best sentence and that of the second best sentence. For the remaining sentences of the word graph, the respective distance value results from the absolute value of the difference between the path score of the respective sentence and the path score of the best sentence.
2. Wenn der beste Satz nicht der korrekte Satz ist, ergibt sich der Distanzwert des besten Satzes aus dem Absolut¬ wert der Differenz zwischen seiner Pfadbewertung und der des korrekten Satzes. Falls mehrere korrekte Sätze ermit¬ telt wurden, wird die Pfadbewertung, die der des besten Pfades am nächsten kommt, betrachtet, denn je näher die Pfadbewertung an die des besten Satzes kommt, je einfa¬ cher ist es, aus ihr die beste zu machen. Bei den restli¬ chen Sätzen des Wortgraphen ergibt sich der jeweilige Distanzwert aus dem Absolutwert der Differenz zwischen der Pfadbewertung des besten Satzes und der Pfadbewertung des jeweiligen Satzes.2. If the best sentence is not the correct sentence, the distance value of the best sentence results from the absolute value of the difference between its path score and that of the correct sentence. If several correct sentences have been determined, the path score closest to that of the best path is considered, because the closer the path score is to the best sentence, the easier it is to make the best out of it. In the remaining sets of the word graph, the respective distance value results from the absolute value of the difference between the path score of the best sentence and the path score of the respective sentence.
Fig. 2 zeigt die verschiedenen Fälle bei der Bestimmung der Distanzwerte für eine Sprachäußerung mit der eindeutigen Be¬ zeichnung KILW047. Jeder Kreis stellt eine Pfadbewertung ei¬ nes Pfades dar, wobei die Distanzwerte als Pfeile dargestellt sind. Die beiden schraffierten Kreise stellen fehlerhafte Pfade dar, während die unschraffierten korrekte Pfade dar¬ stellen.FIG. 2 shows the different cases when determining the distance values for a speech utterance with the unique designation KILW047. Each circle represents a path evaluation of a path, wherein the distance values are shown as arrows. The two hatched circles represent erroneous paths while the unshaded paths are correct.
In vorteilhafter Weise werden die notwendigen Daten zusammen in einer Distanzdatei abgelegt. Die erste Zeile der nachfol¬ genden Tabelle 1 beinhaltet den (eindeutigen) Namen des über¬ geordneten Graphen. Darunter werden die generierten Sätze wie folgt abgespeichert: Distanzwert, Pfadbewertung, Name des ü- bergeordneten Graphen, ein erstes Flag (=B= oder =S=) , wel¬ ches anzeigt, ob es sich um den besten Satz (=B=) oder nicht (=S=) handelt, ein zweites Flag (=C= oder =E=) , welches an¬ zeigt, ob es sich um den Referenzsatz, d.h. um den korrekten Satz handelt (=C=) oder ob ein Fehler enthalten ist (=E=) , und schließlich die zugehörige Wortfolge. <s> und </s> kenn¬ zeichnen den Anfang und das Ende des jeweiligen Satzes.Advantageously, the necessary data is stored together in a distance file. The first line of the following Table 1 contains the (unique) name of the superordinate graph. Below this, the generated sentences are stored as follows: distance value, path evaluation, name of the superordinate graph, a first flag (= B = or = S =), which indicates whether it is the best sentence (= B =) or not (= S =), a second flag (= C = or = E =) which indicates whether it is the reference set, ie the correct set (= C =), or whether it contains an error is (= E =), and finally the associated word order. <s> and </ s> mark the beginning and the end of the respective sentence.
Tabelle 1:Table 1:
KILW047KILW047
173.704 744.355 KILW047 =B= =C= <s> #PAUSE# neues Ziel ein¬ geben #PAUSE# </s>;173.704 744.355 KILW047 = B = = C = <s> # PAUSE # enter new destination # PAUSE # </ s>;
75.241 819.596 KILW047 =S= =C= <s> #NOISE# neues Ziel ein¬ geben #PAUSE# </s>;75,241 819,596 KILW047 = S = = C = <s> # NOISE # enter new destination # PAUSE # </ s>;
449.679 1194.034 KILW047 =S= =E= <s> #PAUSE# <zahl> ist <ho- tel> mir eingeben #PAUSE# </s>;449.679 1194.034 KILW047 = S = = E = <s> # PAUSE # <number> is <room> enter # PAUSE # </ s>;
173.704 918.059 KILW047 =S= =E= <s> #PAUSE# neues <zahl> ein¬ geben #PAUSE# </s>;173.704 918.059 KILW047 = S = = E = <s> # PAUSE # enter new <number> # PAUSE # </ s>;
Nun kann eine Berechnung hinsichtlich der Änderungen der Wahrscheinlichkeiten des Sprachmodells gemäß der Gleichung (1) durchgeführt werden, die notwendig sind, um die Fehler zu korrigieren. Dazu folgendes Beispiel:Now, a calculation can be performed on the changes of the probabilities of the language model according to the equation (1) necessary to correct the errors. Here's the example:
Korrekter Pfad: "<s> Zeig mir die letzte Nummer noch einmal an </s>"Correct path: "<s> Show me the last number again </ s>"
Bester Pfad: "<s> Fahrzeug mir die letzte Nummer noch einmal an </s>"Best Path: "<s> Carry the last number on </ s>"
Das Ziel ist nun, die Pfadbewertung des besten Satzes gerin¬ ger als die des Referenzsatzes zu machen. Mit anderen Worten sollte das Spracherkennungssystem aufgrund der durchzuführen¬ den Änderungen den korrekten als besten Satz auswählen. Die Wahrscheinlichkeiten derjenigen N-Gramme werden erhöht, wel¬ che nur im korrekten Pfad und nicht im besten Pfad auftreten und die Wahrscheinlichkeiten derjenigen N-Gramme werden ver¬ ringert, welche nur im besten Pfad und nicht im korrekten Pfad auftreten.The goal now is to make the path score of the best sentence less than that of the reference sentence. In other words, the speech recognition system should select the correct sentence as the best one due to the changes to be made. The probabilities of those N-grams are increased, which occur only in the correct path and not in the best path, and the probabilities of those N-grams are reduced which occur only in the best path and not in the correct path.
Vorliegend sollte die Distanz sozusagen zwischen den Tri- Grammen verteilt werden, die den Fehler verursachten. Dazu gibt es vorliegend zwei Möglichkeiten: entweder werden die Pfadbewertungen der korrekten Sätze erhöht oder die der feh¬ lerhaften verringert. Im ersten Fall sind zwei Tri-Gramme in den Fehler verwickelt, daher können diese zur Korrektur er¬ höht werden. Es ist möglich, das Bi-Gramm "<s> Zeig", das Tri-Gramm "<s> Zeig mir" oder sogar alle zu erhöhen. Die letzte Möglichkeit ist die bequemste, da sie nur kleine Ände¬ rungen an den Tri-Grammen erfordert, um die Differenz zwi¬ schen den Pfadbewertungen zu reduzieren, wodurch andere Sätze weniger beeinflusst werden. Eine weitere Möglichkeit besteht darin, die Tri-Gramme des besten Satzes zu verringern. Vor¬ liegend wird eine Kombination von Erhöhung und Verringerung der Tri-Gramme benutzt. Die Distanz wird unter allen mögli¬ chen Tri-Grammen verteilt, um sie zu Null zu reduzieren. Da¬ durch sollte der Fehler korrigiert werden, soweit natürlich keine anderen Sätze durch Seiteneffekte betroffen werden. Derartigen Fehlern kann durch eine Analyse der gespeicherten Daten vorgebeugt werden. Dies wird durch eine Definition von Einschränkungen erreicht, welche festlegen, wann N-Gramme verändert werden sollen. Man stelle sich vor, ein Tri-Gramm soll erhöht werden. Anschließend wird das Tri-Gramm unter al¬ len Sätzen in der Distanzdatei gesucht. Dabei können pro auf¬ gefundenem Satz vier verschiedene Situationen in Abhängigkeit der jeweiligen gespeicherten Flags auftreten:In the present case, the distance should, so to speak, be distributed between the tri-grams that caused the error. To There are two possibilities in the present case: either the path scores of the correct sentences are increased or those of the incorrect ones are reduced. In the first case, two tri-grams are involved in the error, so they can be increased for correction. It's possible to increase the bi-gram "<s>point", the tri-gram "<s> show me" or even all of them. The last option is the most convenient since it requires only small changes to the tri-grams to reduce the difference between the path scores, thereby less affecting other sets. Another possibility is to reduce the tri-grams of the best set. Vor¬ lying a combination of increasing and decreasing the tri-grams is used. The distance is distributed among all possible tri-grams to reduce them to zero. Thus, the error should be corrected, as far as of course no other sentences are affected by side effects. Such errors can be prevented by analyzing the stored data. This is achieved by defining constraints that determine when N-grams are to be changed. Imagine that a tri-gram should be increased. Subsequently, the tri-gram is searched for among all sets in the distance file. In this case, four different situations can occur per detected sentence depending on the respective stored flags:
1. BE: falls der Satz, der das zu erhöhende Tri-Gramm ent¬ hält, zwar der beste eines übergeordneten Graphen ist, je¬ doch einen Fehler aufweist, kann das Tri-Gramm nicht er¬ höht werden, denn sonst würde die Pfadbewertung des feh¬ lerhaften Pfads auch erhöht werden. Dadurch wird die Kor¬ rektur erschwert. Nichtsdestotrotz gibt es dabei eine Aus¬ nahme, wenn der Referenzsatz des übergeordneten Graphen das Tri-Gramm ebenfalls aufweist, dann wird es wie ge¬ wünscht erhöht. Dadurch bleibt die Distanz zwischen dem besten und dem korrekten Satz erhalten.1. BE: If the sentence containing the tri-gram to be increased is the best of a superordinate graph, but has an error, the tri-gram can not be increased, otherwise the path evaluation of the erroneous path can also be increased. This makes the correction more difficult. Nevertheless, there is an exception when the reference set of the parent graph also has the tri-gram, then it is increased as desired. This preserves the distance between the best and the correct sentence.
2. SE: in diesem Fall ist der Satz, der das Tri-Gramm ent- hält, weder der erkannte noch der korrekte. Demzufolge kann das Tri-Gramm erhöht werden, jedoch nicht höher als die Distanz zur Pfadbewertung des besten Satzes. Sonst würde dieser Satz zum besten Satz werden, was zu einem weiteren Fehler führen würde. Hier gibt es jedoch eben¬ falls eine Ausnahme. Wenn der korrekte Satz des übergeord¬ neten Graphen dasselbe Tri-Gramm enthält, wird es wie ge¬ wünscht erhöht .2. SE: in this case the theorem that gives the tri-gram is holds, neither the recognized nor the correct. As a result, the tri-gram can be increased, but not higher than the distance to the path score of the best set. Otherwise this sentence would become the best sentence, which would lead to another error. Here, however, there is also an exception. If the correct set of the superordinate graph contains the same tri-gram, it is increased as desired.
3. BC: in einem solchen Fall wird das Tri-Gramm erhöht, da der Satz korrekt ist. Falls die Pfadbewertung des Satzes verbessert wird, werden Fehlerkennungen reduziert.3. BC: in such a case, the tri-gram is increased because the sentence is correct. If the path score of the sentence is improved, error detections are reduced.
4. SC: auch in diesem Fall wird das Tri-Gramm erhöht, auch wenn es nicht der erkannte Satz ist. Nebenbei bemerkt wird es erleichtert, den korrekten Satz zu erkennen, wenn des¬ sen Pfadbewertung erhöht wird.4. SC: In this case too, the tri-gram is increased, even if it is not the recognized set. Incidentally, it is facilitated to recognize the correct sentence when its path score is increased.
Bisher wurde lediglich die Erhöhung eines Tri-Gramms eines korrekten Satzes betrachtet. Jedoch ist es auch möglich, die Tri-Gramme des besten Satzes zu verringern/ wenn dieser einen Fehler aufweist. In obigem Beispiel sind die Tri-Gramme, "<s> Fahrzeug mir" und "Fahrzeug mir die" in den Fehler ver¬ wickelt . Bei der Verringerung werden ähnliche Beschränkungen verwendet. Lediglich die Bedingungen für eine Änderung verän¬ dern sich.So far, only the increase of a tri-gram of a correct set has been considered. However, it is also possible to reduce the tri-grams of the best set / if it has an error. In the above example, the tri-grams, "<s> vehicle me" and "vehicle me" are wrapped in the error. The reduction uses similar limitations. Only the conditions for a change change.
l. BE: falls das Tri-Gramm verringert wird, wird die Pfadbe¬ wertung des besten Satzes, nicht die des korrekten Satzes, verschlechtert. Daher gibt es hier keine Einschränkung.l. BE: if the tri-gram is decreased, the path score of the best sentence, not that of the correct sentence, is degraded. Therefore, there is no restriction here.
2. SE: Hier gibt es ebenfalls keine Einschränkung bei der Verringerung.2. SE: There is also no restriction on reduction.
3. BC: in diesem Fall kann das Tri-Gramm solange verringert werden, solange die Pfadbewertung des besten Satzes noch höher als die des zweitbesten Satzes ist. Mit anderen Wor- ten kann die Verringerung des Tri-Gramms keine Verschlech¬ terung der Pfadbewertung herbeiführen, die größer ist als der Distanzwert.3. BC: in this case, the tri-gram can be decreased as long as the path score of the best set is still higher than that of the second best set. With other words In fact, reducing the tri-gram can not cause deterioration of the path score greater than the distance value.
4. SC: falls das Tri-Gramm in einem Satz gefunden wurde, der nicht der beste Satz, jedoch der korrekte Satz ist, darf es nicht verringert werden.4. SC: if the tri-gram was found in a sentence that is not the best sentence, but the correct sentence, it must not be reduced.
Fig. 3 zeigt eine Übersicht über die Berücksichtigung der Seiteneffekte, wenn Änderungen an den Tri-Grammen durchzufüh¬ ren sind.FIG. 3 shows an overview of the consideration of the side effects when changes to the tri-grams are to be carried out.
Oft ist es wünschenswert, für den Benutzer relevantere Sätze zu bevorzugen. Mit anderen Worten ist es weniger problema¬ tisch, Sätze mit höherer Priorität zu korrigieren, auch wenn Sätze mit geringerer Priorität durch Seiteneffekte davon be¬ troffen sind. Beispielsweise kann Sätzen, die kritische Wör¬ ter enthalten, welche für einen anschließenden Dialog oder dergleichen elementar wichtig sind, eine höhere Priorität zu¬ gewiesen werden. Diese Ausgestaltung wird wie folgt verwirk¬ licht: Zuerst muss eine Liste von priorisierten Sätzen be¬ reitgestellt werden. Anschließend ist die Art der Bevorzugung durch die Einschränkungen festzulegen. Bei Sätzen ohne Prio¬ rität arbeitet das Verfahren gemäß den oben erwähnten Ein¬ schränkungen nach Fig. 3. Falls ein Satz mit Priorität jedoch einen Fehler aufweist, wird das Verfahren abgewandelt. Bei einem Seiteneffekt können zwei Möglichkeiten auftreten. Falls der Fehler in einem Satz mit Priorität erzeugt wurde, arbei¬ tet das Verfahren wie zuvor. Die Änderung kann jedoch durch¬ geführt werden, falls die Einschränkung einen Satz ohne Prio¬ rität betrifft. Durch diese Maßnahmen kann jedoch die gene¬ relle Fehlerrate erhöht werden, da die Sätze ohne Priorität verschlechtert werden.It is often desirable to favor more relevant sentences for the user. In other words, it is less problematic to correct sentences with higher priority, even if sentences with a lower priority are affected by side effects thereof. For example, sentences that contain critical words that are of fundamental importance for a subsequent dialog or the like can be assigned a higher priority. This embodiment is realized as follows: First, a list of prioritized sentences must be provided. Then the nature of the preference should be determined by the restrictions. In the case of sentences without priority, the method operates according to the abovementioned restrictions of FIG. 3. However, if a sentence with priority has an error, the method is modified. There are two possibilities for a side effect. If the error was generated in a sentence with priority, the method works as before. However, the change can be carried out if the restriction concerns a sentence without priority. By means of these measures, however, the generic error rate can be increased since the rates are degraded without priority.
Das vorliegende Ausführungsbeispiel betrifft zwar nur Tri- Gramme, eine entsprechende Anwendung bei anderen N-Grammen ist jedoch analog möglich. Falls ein Tri-Gramm nicht im Sprachmodell gefunden wurde, entspricht dies einem sogenannten Back-Off-Fall . Dabei kann ein neues Tri-Gramm als normales Tri-Gramm in das optimierte Sprachmodell eingeführt werden oder die Änderung kann unter den Werten, welche der Berechnung der Back-Off-Wahr¬ scheinlichkeit dienen, verteilt werden (üblicherweise eine niedrigere N-Gramm-Wahrscheinlichkeit und ein Back-Off- Gewicht) .Although the present embodiment relates only to tri-grams, an analogous application to other N-grams is possible. If a tri-gram was not found in the language model, this corresponds to a so-called back-off case. In this case, a new tri-gram can be introduced as normal tri-gram into the optimized speech model or the change can be distributed among the values which serve to calculate the back-off probability (usually a lower N-gram probability) and a back-off weight).
Bisher werden die Pfadbewertungen nach einer Änderung einer Sprachmodellwahrscheinlichkeit nicht neu berechnet. Daher ist die Kontrolle der Seiteneffekte unvollständig, wenn mehrere Wahrscheinlichkeiten auf einmal verändert werden. Im Gegen¬ satz dazu berechnet die nachfolgend skizzierte Ausführungs¬ form des erfindungsgemäßen Verfahrens alle Pfadbewertungen (und Distanzwerte) iterativ neu, was in vorteilhafter Weise zu einer Verbesserung der Behandlung von Seiteneffekten führt. Die Kernidee besteht darin, mit einem Klassifikator einen Vergleich von Pfadbewertungen für verschieden Pfade des übergeordneten Graphen zu reproduzieren (siehe Fig. 4) . Die Parameter des Klassifikators sollten in Wahrscheinlichkeiten des Sprachmodells überwandelbar sein. Unterschiedliche Klas¬ sifikatorarchitekturen haben gemeinsam, dass sie viele nume¬ rische Werte als Eingaben benötigen und eine Art von Schwell¬ wertfunktion aufweisen, um eine Ausgabe, wie beispielsweise "0" oder "1", zu liefern. Des weiteren existieren lernende Klassifikatoren, welche einige ihrer Parameter als Antwort auf eine Vorgabe von Ein-/Ausgabepaaren zusammen mit einem Lernsignal anpassen. Ein derartiger lernender Klassifikator wird hier verwendet. Er basiert auf der Beobachtung, dass die Pfadbewertungsberechnung im logarithmischen Bereich mit einer gewichteten Summe korrespondiert, welche vielen Klassifikato¬ ren gemein ist (als Teil einer sogenannten Neuron-Funktion) . Neben der Darstellung der Pfadbewertungsformel als Klassifi- kator (Fig. 4) muss auf die Darstellung der Daten geachtet werden und wie ein dynamischer Schwellwert angewendet wird. Die folgenden Bedingungen müssen in Ein-/Ausgabepaare eines Klassifikators übersetzt werden. Der Satz mit der besten Pfadbewertung entspricht der Ausgabe des Spracherkennungssys- tems. Wenn der erkannte Satz nicht der Referenzsatz ist und ein Fehler auftrat, werden die Wahrscheinlichkeiten des Sprachmodells 2, die in den Fehler verwickelt waren, nach ei¬ ner Lernregel geändert. Wenn kein Fehler vorlag, wird auch das Sprachmodell nicht verändert. Diese Prozedur wird für je¬ den übergeordneten Graphen (Jumbo-Graphen) durchgeführt. An¬ schließend erfolgt eine iterative Bearbeitung. Mit anderen Worten wird der Vorgang mehrfach an den übergeordneten Gra¬ phen durchgeführt . Mit einer geeigneten Wahl der Lernregel des Klassifikators verringert sich die Fehlerzahl immer wei¬ ter, so lange das Verfahren angewendet wird.So far, the path scores are not recalculated after a language model probability change. Therefore, the control of side effects is incomplete when multiple probabilities are changed at once. In contrast to this, the embodiment of the method according to the invention outlined below iteratively recalculates all path evaluations (and distance values), which advantageously leads to an improvement in the treatment of side effects. The core idea is to use a classifier to reproduce a comparison of path scores for different paths of the parent graph (see Fig. 4). The parameters of the classifier should be convertible into probabilities of the language model. Different classifier architectures have in common that they require many numerical values as inputs and have some sort of threshold function to provide an output, such as "0" or "1". Furthermore, there are learning classifiers that adapt some of their parameters in response to a set of input / output pairs along with a training signal. Such a learning classifier is used here. It is based on the observation that the path evaluation calculation in the logarithmic domain corresponds to a weighted sum which is common to many classifiers (as part of a so-called neuron function). In addition to the presentation of the path evaluation formula as a classifier (Fig. 4), attention must be paid to the representation of the data and how a dynamic threshold value is applied. The following conditions must be translated into input / output pairs of a classifier. The sentence with the best path score corresponds to the output of the speech recognition system. If the detected sentence is not the reference sentence and an error occurred, the probabilities of the language model 2 involved in the error are changed according to a learning rule. If there was no error, the language model is not changed. This procedure is carried out for each of the superordinate graphs (jumbo graphs). Subsequently, an iterative processing takes place. In other words, the process is carried out several times to the higher-level graph. With a suitable choice of the learning rule of the classifier, the number of errors always decreases as long as the method is used.
Die Korrekturen werden entsprechend einer sogenannten Kreuz¬ entropie-Lernregel (Gross Entropy) durchgeführt. Das Verhal¬ ten dieser Lernregel ist wünschenswert, da erwiesen ist, dass sie die Fehleranzahl minimiert, anstatt den quadratischen Fehler wie die Gradienten-Lernregel (Gradient Descent) zu mi¬ nimieren, denn hier kommt es wesentlich auf die Fehlerhäufig¬ keit an.The corrections are carried out according to a so-called cross-entropy learning rule (Gross Entropy). The behavior of this learning rule is desirable because it has been proven that it minimizes the number of errors instead of minimizing the quadratic error such as the gradient learning rule (Gradient Descent), because the error frequency is important here.
Die Übergabe der Daten an ein neuronales Netz 9 als Klassifi- kator ist in Fig. 4 dargestellt. Für jedes Tri-Gramm des Sprachmodells 2 ist ein Eingang vorgesehen. Der Eingabewert gibt die Anzahl der Tri-Gramme in diesem Pfad wieder. Die Übergabe eines Pfades auf diese Weise wird als Lernmuster be¬ zeichnet. Diese Lernmuster werden der Auswertung des Pfades entsprechend in zwei Zielwerte eingeteilt: korrekt ("1") und nicht korrekt ("0") . Der Ausgabewert des Lernmusters mit dem Zielwert "1" sollte größer als Null sein. Der Ausgabewert der Lernmuster mit dem Zielwert "0" sollte kleiner als Null sein. Die von demselben übergeordneten Graphen stammenden Daten sollten zusammen als eine Gruppe behandelt werden. Der Grund dafür ist, dass der Vergleich der Pfadbewertungen in eine dy- namische Schwellwertfunktion des neuronalen Netzwerkklassifi- kators 9 übersetzt werden muss. Die Lösung besteht darin, ei¬ nen dynamischen Schwellwert derart einzustellen, dass er die Entscheidung, welches die höchste Pfadbewertung ist, imi¬ tiert. Der dynamische Schwellwert wird bei jedem Durchgang berechnet und ist für jede Gruppe (d.h. für jeden übergeord¬ neten Graphen) verschieden. Das Ziel ist, dass das nur das Anlegen des besten Pfades dazuführt, dass der Schwellwert der Aktivierungsfunktion überschritten wird. Demzufolge ist die Aktivierungsfunktion nur bei dem besten Pfad aktiv. Das Ein¬ stellen des dynamischen Schwellwerts erfordert die vorherige Eingabe aller Lernmuster einer Gruppe. Somit läuft die Über¬ gabe wie folgt ab. Zuerst erfolgt die Übergabe aller Lernmus¬ ter, um den dynamischen Schwellwert zu bestimmen. Anschlie¬ ßend wird der berechnete Wert von allen Pfadbewertungen sub¬ trahiert und die neuen Werte abgespeichert. Diese neuen Werte können sowohl über als auch unter dem Schwellwert liegen. Diese Ausgabe wird mit den Zielwerten verglichen, die anzei¬ gen, ob ein Pfad korrekt ist oder nicht. Das Lernen wird bei denjenigen Lernmustern durchgeführt, deren Ausgabe nicht i- dentisch mit dem Zielwert ist. Die Lernregel modifiziert die Gewichte, die später wieder zurück in das Sprachmodell über¬ setzt werden. Zur Berechnung der dynamischen Schwelle ist ein erster Weg, einen Mittelwert zwischen dem besten Pfad und dem zweitbesten Pfad zu bilden. Jedoch sollte die Möglichkeit in Betracht gezogen werden, dass es mehr als einen korrekten Pfad gibt. Deshalb wird der dynamische Schwellwert als Mit¬ telwert zwischen der besten Pfadbewertung aller korrekten Sätze und der besten Pfadbewertung aller fehlerhaften Sätze berechnet. Diese Berechnung ist in Fig. 5 dargestellt.The transfer of the data to a neural network 9 as a classifier is shown in FIG. 4. For each tri-gram of the speech model 2, an input is provided. The input value reflects the number of tri-grams in this path. The transfer of a path in this way is characterized as a learning pattern. According to the evaluation of the path, these learning patterns are divided into two target values: correct ("1") and not correct ("0"). The output value of the learning pattern with the target value "1" should be greater than zero. The output value of the learning patterns with the target value "0" should be less than zero. The data from the same parent graph should be treated together as a group. The reason for this is that the comparison of the path scores into a dy- Namely threshold value function of the neural network classifier 9 has to be translated. The solution is to set a dynamic threshold value in such a way that it implements the decision which is the highest path evaluation. The dynamic threshold is calculated on each pass and is different for each group (ie for each superordinate graph). The goal is that this only creates the best path that exceeds the threshold of the activation function. As a result, the activation function is active only on the best path. Setting the dynamic threshold requires prior input of all the learning patterns of a group. Thus, the transfer takes place as follows. First, the transfer of all learning patterns takes place in order to determine the dynamic threshold value. Subsequently, the calculated value is subtracted from all path evaluations and the new values are stored. These new values may be both above and below the threshold. This output is compared with the target values which indicate whether a path is correct or not. The learning is performed on those learning patterns whose output is not identical to the target value. The learning rule modifies the weights, which are later translated back into the language model. To calculate the dynamic threshold, a first way is to form an average between the best path and the second best path. However, the possibility should be considered that there is more than one correct path. Therefore, the dynamic threshold is calculated as the average between the best path score of all correct sentences and the best path score of all the errors. This calculation is shown in FIG.
Es ist ebenfalls vorteilhaft zwei Schranken zu definieren, damit der Fehler mit einer höheren Zuverlässigkeit beseitigt werden kann. Der Zweck der Schranken besteht darin, die feh¬ lerhaften Sätze zu zwingen, nicht nur über dem Schwellwert, sondern auch über dem Schwellwert und einer vorgegebenen Schranke zu liegen. Dasselbe geschieht mit dem korrekten Satz. Er muss ebenfalls über einer bestimmten Schranke lie¬ gen. Diese Schranken werden empirisch festgelegt. In der Pra¬ xis werden sie auf -0,1 und +0,1 gesetzt, da sich die Ausgabe innerhalb [-1,1] bewegt. Wie aus Fig. 5 ersichtlich, muss der unschraffierte Kreis (korrekter Satz) über der hohen Schranke und die schraffierten Kreise (fehlerhafte Sätze) unter der niedrigen Schranke liegen. Dadurch wird ein Vertrauensinter¬ vall definiert. Falls sich die Kreise innerhalb des Inter¬ valls befinden, ist nicht sicher, ob der Fehler korrigiert wird. It is also advantageous to define two barriers so that the error can be eliminated with a higher reliability. The purpose of the bounds is to force the erroneous sentences to be not only above the threshold, but also above the threshold and a predetermined bound. The same thing happens with the correct one Sentence. It must also be above a certain limit. These limits are determined empirically. In practice, they are set to -0.1 and +0.1 since the output moves within [-1.1]. As can be seen in Figure 5, the unshaded circle (correct set) above the high bound and the hatched circles (erroneous sets) must be below the low bound. This defines a confidence interval. If the circles are within the interval, it is not certain that the error will be corrected.

Claims

Patentansprüche claims
Verfahren (1) zur Adaption und/oder Erzeugung statisti¬ scher Sprachmodelle (2) für automatische Spracherken- nungssysteme, wobei aktuelle vorhandene linguistische Spracherkennungsmodelle berücksichtigt werden, dadurch gekennzeichnet, dass:Method (1) for adapting and / or generating statistical speech models (2) for automatic speech recognition systems, wherein current existing linguistic speech recognition models are taken into account, characterized in that:
Pfadbewertungen in einem Worthypothesengraphen des Sprachmodells (2) , durch eine Bildung von Distanzwer¬ ten, zwischen wenigstens zwei Pfaden des Worthypothe¬ sengraphen verglichen werden, wobei wenigstens ein bester Pfad in dem Wordhypothesengra¬ phen bezüglich des Spracherkennungsprozesses identi¬ fiziert wird, wobei wenigstens ein korrekter Pfad oder ein als korrekter Pfad geltender Pfad mit einer minimalen Wortfehleran¬ zahl in dem Wordhypothesengraphen markiert wird, wo¬ bei der Vergleich der Distanzwerte bei den besten und bei den korrekten Pfaden derart durchgeführt wird, dass eine Anpassung des Sprachmodells (2) erzielbar ist, welche zu einer geringeren Anzahl an Wortfehlern in dem besten Pfad führt, wenn dieselbe Spracheingabe erneut während des Spracherkennungsprozesses eingege¬ ben wird, wobei zur Anpassung des Sprachmodells neben dem vor¬ handenen linguistischen Sprachmodell auch die akusti¬ sche Bewertung des Sprachsignals durch ein HMM-Modell herangezogen wird. Path evaluations are compared in a word hypothesis graph of the language model (2) by forming distance values between at least two paths of the word hypotheken, at least one best path in the word hypothesis graph being identified with respect to the speech recognition process, at least one correct path or a path valid as a correct path is marked with a minimum word error number in the word hypothesis graph, where the comparison of the distance values in the best and the correct paths is performed such that an adaptation of the language model (2) can be achieved which leads to a lesser number of word errors in the best path when the same speech input is input again during the speech recognition process, whereby the acoustic speech is also evaluated by an HMM in order to adapt the speech model to the existing linguistic speech model. Model used wi approx.
2. Verfahren nach Anspruch 1, dadurch gekennzeichnet, dass das Sprachmodell als N-Gramm-Sprachmodell (2) ausgebildet ist, wobei für jedes N-Gramm eine separate und veränder¬ bare Wahrscheinlichkeit gespeichert wird.2. The method according to claim 1, characterized in that the language model as N-gram language model (2) is formed, wherein for each N-gram a separate and veränder¬ bare probability is stored.
3. Verfahren nach Anspruch 2, dadurch gekennzeichnet, dass die Wahrscheinlichkeiten derjenigen N-Gramme erhöht wer¬ den, welche nur im korrekten Pfad und nicht im besten Pfad auftreten und dass die Wahrscheinlichkeiten derjeni¬ gen N-Gramme verringert werden, welche nur im besten Pfad und nicht im korrekten Pfad auftreten.3. Method according to claim 2, characterized in that the probabilities of those N-grams are increased, which occur only in the correct path and not in the best path, and that the probabilities of those N-grams are reduced which only in the best Path and does not occur in the correct path.
4. Verfahren nach einem der Ansprüche 1 bis 3, dadurch ge¬ kennzeichnet, dass ein Seiteneffekt bezüglich Wortfehler in anderen Pfaden ermittelt wird, wenn die Wahrschein¬ lichkeiten verändert werden.4. The method according to any one of claims 1 to 3, characterized ge indicates that a page effect is determined in terms of word errors in other paths, if the probability Wahrikel¬ be changed.
5. Verfahren nach einem der Ansprüche 1 bis 4, dadurch ge¬ kennzeichnet, dass die Distanzwerte durch Berechnung der Absolutwerte der Differenz zwischen logarithmischen Pfad¬ bewertungen bestimmt werden.5. The method according to any one of claims 1 to 4, characterized ge indicates that the distance values are determined by calculating the absolute values of the difference between logarithmic Pfad¬ ratings.
6. Verfahren nach einem der Ansprüche 1 bis 5, dadurch ge¬ kennzeichnet, dass nach der Veränderung von Wahrschein¬ lichkeiten eine Neuberechnung der Pfadbewertungen und der Distanzwerte durchgeführt wird.6. The method according to any one of claims 1 to 5, characterized ge indicates that after the change of probabilities a recalculation of the path scores and the distance values is performed.
7. Verfahren nach Anspruch 6, dadurch gekennzeichnet, dass bei der Berechnung ein Klassifikator verwendet wird, wel¬ cher bezüglich einer Gruppe von Pfaden des Wortgraphen aufgrund eines Schwellwerts entscheidet, ob Wahrschein¬ lichkeiten geändert werden müssen.7. The method according to claim 6, characterized in that a classifier is used in the calculation, which decides with respect to a group of paths of the word graph based on a threshold, whether probabilities have to be changed.
8. Verfahren nach Anspruch 7, dadurch gekennzeichnet, dass die Entscheidungen des Klassifikators Lernregeln für ein neuronales Netz (9) bilden. Verfahren nach Anspruch 7 oder 8, dadurch gekennzeichnet, dass als Klassifikator ein neuronales Netz (9) verwendet wird, wobei die Gewichtsparameter des neuronalen Net¬ zes (9) in Wahrscheinlichkeitswerte für das Sprachmodell (2) umwandelbar sind. 8. The method according to claim 7, characterized in that the decisions of the classifier learning rules for a neural network (9) form. Method according to claim 7 or 8, characterized in that a neural network (9) is used as the classifier, wherein the weight parameters of the neural network (9) can be converted into probability values for the language model (2).
PCT/EP2005/009973 2004-10-01 2005-09-16 Method for adapting and/or creating statistical linguistic models WO2006037446A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE200410048348 DE102004048348B4 (en) 2004-10-01 2004-10-01 Method for adapting and / or generating statistical language models
DE102004048348.5 2004-10-01

Publications (1)

Publication Number Publication Date
WO2006037446A1 true WO2006037446A1 (en) 2006-04-13

Family

ID=35717648

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2005/009973 WO2006037446A1 (en) 2004-10-01 2005-09-16 Method for adapting and/or creating statistical linguistic models

Country Status (2)

Country Link
DE (1) DE102004048348B4 (en)
WO (1) WO2006037446A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111916058A (en) * 2020-06-24 2020-11-10 西安交通大学 Voice recognition method and system based on incremental word graph re-scoring

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6081779A (en) * 1997-02-28 2000-06-27 U.S. Philips Corporation Language model adaptation for automatic speech recognition
EP1022723A2 (en) * 1999-01-25 2000-07-26 Matsushita Electric Industrial Co., Ltd. Unsupervised adaptation of a speech recognizer using reliable information among N-best strings
US6314400B1 (en) * 1998-09-16 2001-11-06 U.S. Philips Corporation Method of estimating probabilities of occurrence of speech vocabulary elements

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5586215A (en) * 1992-05-26 1996-12-17 Ricoh Corporation Neural network acoustic and visual speech recognition system
JPH0772840B2 (en) * 1992-09-29 1995-08-02 日本アイ・ビー・エム株式会社 Speech model configuration method, speech recognition method, speech recognition device, and speech model training method
US5960395A (en) * 1996-02-09 1999-09-28 Canon Kabushiki Kaisha Pattern matching method, apparatus and computer readable memory medium for speech recognition using dynamic programming
EP1016077B1 (en) * 1997-09-17 2001-05-16 Siemens Aktiengesellschaft Method for determining the probability of the occurrence of a sequence of at least two words in a speech recognition process
DE19842151A1 (en) * 1998-09-15 2000-03-23 Philips Corp Intellectual Pty Process for the adaptation of linguistic language models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6081779A (en) * 1997-02-28 2000-06-27 U.S. Philips Corporation Language model adaptation for automatic speech recognition
US6314400B1 (en) * 1998-09-16 2001-11-06 U.S. Philips Corporation Method of estimating probabilities of occurrence of speech vocabulary elements
EP1022723A2 (en) * 1999-01-25 2000-07-26 Matsushita Electric Industrial Co., Ltd. Unsupervised adaptation of a speech recognizer using reliable information among N-best strings

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KUO H-K J ET AL: "Discriminative training of language models for speech recognition", 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS. (ICASSP). ORLANDO, FL, MAY 13 - 17, 2002, IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), NEW YORK, NY : IEEE, US, vol. VOL. 1 OF 4, 13 May 2002 (2002-05-13), pages 325 - 328, XP002292347, ISBN: 0-7803-7402-9 *

Also Published As

Publication number Publication date
DE102004048348A1 (en) 2006-04-13
DE102004048348B4 (en) 2006-07-13

Similar Documents

Publication Publication Date Title
DE69908047T2 (en) Method and system for the automatic determination of phonetic transcriptions in connection with spelled words
DE69937176T2 (en) Segmentation method to extend the active vocabulary of speech recognizers
DE60016722T2 (en) Speech recognition in two passes with restriction of the active vocabulary
EP1217610A1 (en) Method and system for multilingual speech recognition
DE19847419A1 (en) Procedure for the automatic recognition of a spoken utterance
WO2003060877A1 (en) Operating method for an automated language recognizer intended for the speaker-independent language recognition of words in different languages and automated language recognizer
DE602004004310T2 (en) System with combined statistical and rule-based grammar model for speech recognition and understanding
EP1273003B1 (en) Method and device for the determination of prosodic markers
DE112006000322T5 (en) Audio recognition system for generating response audio using extracted audio data
EP1812930B1 (en) Method for voice recognition from distributed vocabulary
WO1996022593A1 (en) Speech recognition process
DE60028219T2 (en) Method for speech recognition
WO2006111230A1 (en) Method for the targeted determination of a complete input data set in a voice dialogue system
DE102006036338A1 (en) Method for generating a context-based speech dialog output in a speech dialogue system
EP1214703A1 (en) Verfahren zum trainieren der grapheme nach phonemen regeln für die sprachsynthese
EP0987682B1 (en) Method for adapting linguistic language models
EP1039447B1 (en) Determination of regression classes tree structure for a speech recognizer
WO2006037446A1 (en) Method for adapting and/or creating statistical linguistic models
WO1996027871A1 (en) Method of recognising at least one defined pattern modelled using hidden markov models in a time-variable test signal on which at least one interference signal is superimposed
EP1136982A2 (en) Generation of a language model and an acoustic model for a speech recognition system
EP1016077A1 (en) Method for determining the probability of the occurrence of a sequence of at least two words in a speech recognition process
EP1038293B1 (en) Method for voice recognition using a grammar
EP1659572B1 (en) Dialogue control method and system operating according thereto
DE10308611A1 (en) Determination of the likelihood of confusion between vocabulary entries in phoneme-based speech recognition
DE10131157C1 (en) Dynamic grammatical weighting method for speech recognition system has existing probability distribution for grammatical entries modified for each identified user

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV LY MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase