DE4300159C2

DE4300159C2 - Procedure for the mutual mapping of feature spaces

Info

Publication number: DE4300159C2
Application number: DE19934300159
Authority: DE
Inventors: Lars Dipl Ing Knohl; Ansgar Dipl Ing Rinscheid
Original assignee: Individual
Current assignee: Individual
Priority date: 1993-01-07
Filing date: 1993-01-07
Publication date: 1995-04-27
Anticipated expiration: 2013-01-08
Also published as: DE4300159A1

Description

Die zur Zeit bekannten Spracherkennungssysteme setzen sich grundsätzlich aus einer Vorver arbeitungsstufe zusammen, welche die in Form eines Zeitsignals (Schalldruck über die Zeit) vorliegende Äußerung eines Sprechers in den jeweiligen Merkmalsraum transformiert, gefolgt von einer zweiten Stufe, welche diese Äußerung (explizit oder implizit) mit den in einer vor angegangenen Trainingsphase gewonnenen Referenzmustern vergleicht und zu einer oder mehreren Erkennungshypothesen gelangt. Die Art und Menge an Information, die zu einem bestimmten Abschnitt des Zeitsignals extrahiert und in Form eines Vektors zusammengefaßt wird, bezeichnet man als Merkmalssatz. Ein Merkmalssatz kann beispielsweise aus den Koef fizienten einer LPC-Analyse bestehen, vgl. Flanagan, J. L. (1972): "Speech Analysis Synthesis and Perception", Sprin ger-Verlag, Berlin, S. 390 ff.The currently known speech recognition systems basically consist of a previous version working level together, which is in the form of a time signal (sound pressure over time) present statement of a speaker transformed into the respective feature space, followed from a second stage, which this statement (explicit or implicit) with the one in a the reference pattern obtained in the training phase and compared to one or several recognition hypotheses. The type and amount of information that comes with a extracted certain section of the time signal and summarized in the form of a vector is called a feature set. A set of characteristics can, for example, from the Koef efficient LPC analysis, cf. Flanagan, J.L. (1972): "Speech Analysis Synthesis and Perception", Sprin ger publishing house, Berlin, p. 390 ff.

Jede mögliche Realisierung einer Merkmalsextraktion wird als Merkmalsvektor bezeichnet, die Gesamtheit aller möglichen Merkmalsvektoren als Merkmalsraum. Je nach seinen indivi duellen anatomischen und habituellen Eigenschaften deckt jeder Sprecher einen Teilbereich dieses Merkmalsraumes, seinen individuellen Merkmalsraum ab, welcher mehr oder weniger genau mit den Merkmalsräumen anderer Sprecher übereinstimmt. Desgleichen beschreibt jede Äußerung eines Sprechers, d. h. deren Realisierung in Form einer Sequenz von Merkmalsvek toren, eine Trajektorie durch den Merkmalsraum des Sprechers; die Gesamtheit aller mögli chen Aussprachevarianten eines Sprechers bildet den sprecherspezifischen Merkmalsraum einer Äußerung, welcher wiederum nur mehr oder weniger exakt mit den Merkmalsräumen die ser Äußerung, gesprochen von anderen Sprechern, übereinstimmt. Die i. a. nur teilweise Über einstimmung der Merkmalstrajektorien verschiedener Sprecher stellt das fundamentale Problem bezüglich der erzielbaren Sprecherunabhängigkeit eines Spracherkennungssystems dar, welche neben der durchschnittlichen Erkennungsrate das zentrale Kriterium zur Beurtei lung der Güte eines solchen Systems darstellt. Soll eine hohe Sprecherunabhängigkeit erzielt werden, so muß das Erkennungssystem in der Trainingsphase auf die Sprachmuster möglichst vieler Sprecher trainiert werden. Dies hat zur Folge, daß die resultierenden Referenzmuster (s. o.) einen entsprechend breit gestreuten Bereich des Merkmalsraumes einnehmen. Hierdurch nimmt die Wahrscheinlichkeit der Überlappung von Merkmalsmustern verschiedener Äuße rungen zu und entsprechend nimmt die Erkennungssicherheit ab. Dies gilt umso mehr, je umfangreicher und perplexer der zu erkennende Wortschatz ist. Die zuverlässigsten Erken nungsergebnisse lassen sich folglich mit auf nur einen einzigen Sprecher (Referenzsprecher) trainierten Erkennern erzielen. Der Nachteil solcher sogenannter sprecherabhängiger Erkenner liegt darin, daß zur Erkennung eines bis dahin unbekannten Sprechers (Testsprecher) ein auf wendiges und sehr zeitraubendes Neutraining erforderlich wird.Every possible realization of a feature extraction is called a feature vector, the entirety of all possible feature vectors as a feature space. Depending on its indivi duel anatomical and habitual characteristics, each speaker covers a sub-area of this feature space, its individual feature space, which more or less exactly matches the feature spaces of other speakers. Likewise, each describes Statement by a speaker, d. H. their realization in the form of a sequence of features vek toren, a trajectory through the speaker's feature space; the entirety of all possible The pronunciation variants of a speaker form the speaker-specific feature space an utterance, which in turn only more or less exactly matches the feature spaces this statement, spoken by other speakers, agrees. The i. a. only partially over attunement of the trajectories of characteristics of different speakers provides the fundamental Problem regarding the achievable speaker independence of a speech recognition system which, in addition to the average recognition rate, is the central criterion for assessment represents the quality of such a system. Should achieve a high level of speaker independence the recognition system in the training phase must be based on the speech pattern as far as possible many speakers are trained. As a result, the resulting reference pattern (see above) occupy a correspondingly wide area of the feature space. Hereby takes away the likelihood of overlapping feature patterns of different appearances increases and the recognition reliability decreases accordingly. This applies all the more, ever the vocabulary to be recognized is extensive and perplexed. The most reliable orks results can therefore be assigned to only one speaker (reference speaker) trained trained recognizers. The disadvantage of such so-called speaker-dependent recognizers lies in the fact that to recognize a previously unknown speaker (test speaker) on agile and very time-consuming re-training is required.

Vor diesem Hintergrund liegt es nahe, Verfahren zur Sprecheradaption zu entwickeln, welche eine Anpassung bzw. Normalisierung eines Testsprechers auf den Referenzsprecher leisten. Eine solche Adaption kann entweder implizit geschehen, indem das Erkennungsverfahren unmittelbar an den Testsprecher angepaßt wird, oder explizit, wobei der Erkennungsstufe ein Modul vorgeschaltet wird, welches eine Abbildung des zur Erkennung verwendeten Merk malsraumes des Testsprechers auf den Merkmalsraum des Referenzsprechers vornimmt.Against this background, it is obvious to develop methods for speaker adaptation, which adjust or normalize a test speaker to the reference speaker. Such an adaptation can either be done implicitly by the recognition process is adapted directly to the test speaker, or explicitly, the recognition level being one Upstream module, which is an image of the Merk used for recognition painting room of the test speaker on the feature room of the reference speaker.

Aus der Literatur sind zwei grundsätzliche, auf eine explizite Sprecheradaption abzielende Verfahren bekannt:
Furui, S.: "Speaker-Independent and Speaker-Adaptive Recognition Techni ques", aus "Advances in Speech Signal Processing", Hrsg.: Sadaoki Furui & M. Mohan Sondhi, Marcel Dekker Inc., New York, Kap. 3.4.Two fundamental methods aimed at explicit speaker adaptation are known from the literature:
Furui, S .: "Speaker-Independent and Speaker-Adaptive Recognition Techni ques", from "Advances in Speech Signal Processing", Ed .: Sadaoki Furui & M. Mohan Sondhi, Marcel Dekker Inc., New York, chap. 3.4.

Bei der einen Methode wird eine (meist nichtlineare) funktionale Abbildung der Merkmals räume aufeinander durchgefährt. Geschieht dies beispielsweise mit Hilfe eines mehrschichti gen neuronalen Netzes, kann zwar eine nahezu hundertprozentige Sprecheradaption erzielt werden, doch benötigt das Training des neuronalen Netzes ein derart umfangreiches Trainings korpus, daß sich - abgesehen von der Unzumutbarkeit eines solchen Vorgehens für den Test sprecher - gegenüber dem Neutraining der Erkennungsstufe kein wesentlicher zeitlicher Vorteil ergibt.In one method, a (mostly non-linear) functional mapping of the feature is used spaces run through each other. This is done, for example, with the help of a multilayer neural network, an almost 100% speaker adaptation can be achieved training of the neural network requires such extensive training corpus that - apart from the unreasonableness of such a procedure for the test spokesperson - no significant time advantage compared to retraining the recognition level results.

Die zweite Kategorie bilden die sognannten Codebook-Mapping Verfahren. Ihnen liegt das Konzept zugrunde, den Merkmalsraum sowohl des Referenz- als auch des Testsprechers zu quantisieren (durch eine begrenzte Anzahl Prototypen zu repräsentieren) und die Einträge der resultierenden Kodebücher auf geeignete Weise miteinander zu verbinden. Hierzu wird in einer Adaptionsphase die entsprechend vektorquantisierte und zeitachsentransformierte Version einer Trainingsäußerung des Testsprechers der abgespeicherten Version des Referenzsprechers gegenübergestellt und die jeweils korrespondierende Zentroiden (Prototypen) in Form eines Histogramms hinterlegt. Mit Hilfe dieses Korrespondenz-Histogramms werden in der Betriebsphase die Merkmalsvektoren des Testsprechers durch die des Referenzsprechers ersetzt. Ein Nachteil von Codebook-Mapping Verfahrens liegt u. a. in der Schwierigkeit, ein deutige Korrespondenzen zwischen den Kodebüchern der beiden Sprecher herzustellen. Ein weiteres Problem stellt die Wahl eines geeigneten Verfahrens zur Generierung der Kodebücher (Vektorquantiserungsverfahren) der Sprecher dar. Stand der Technik ist es, hierzu sogenannte konventionelle Algorithmen zur Vektorquantisierung zu verwenden, wie sie in ihren wesentli chen Zügen in
Linde, Y; Buzo, A. and Gray, R. M. (1980): "An Algorithm for Vector Quan tization Design", IEEE Transactions on Communications, Vol. 28, NO. 1, p. 84 ff.
beschrieben werden.The second category is the so-called codebook mapping process. They are based on the concept of quantizing the feature space of both the reference speaker and the test speaker (represented by a limited number of prototypes) and connecting the entries in the resulting code books in a suitable manner. For this purpose, the corresponding vector-quantized and timeline-transformed version of a training statement by the test speaker is compared to the stored version of the reference speaker in an adaptation phase and the corresponding centroids (prototypes) are stored in the form of a histogram. With the help of this correspondence histogram, the feature vectors of the test speaker are replaced by those of the reference speaker in the operating phase. One disadvantage of codebook mapping is the difficulty in establishing clear correspondence between the code books of the two speakers. Another problem is the choice of a suitable method for generating the code books (vector quantization method) of the speakers. The state of the art is to use so-called conventional algorithms for vector quantization, as they are in their essential features in
Linde, Y; Buzo, A. and Gray, RM (1980): "An Algorithm for Vector Quantization Design", IEEE Transactions on Communications, Vol. 28, NO. 1, p. 84 ff.
to be discribed.

Eine weitere Möglichkeit zur Quantisierung von Sprechermerkmalsräumen liefert die Theorie selbstorganisierender, topologieerhaltender Merkmalskarten. Sie ist zu Beginn der achtziger Jahre entwickelt worden und wird von ihrem Mitbegründer ausführlich in
Kohonen, T. (1990): "The Self-Organizing Map", Proceedings of the IEEB, Vol. 78, No. 9, September 1990, S. 1464 ff.
beschrieben. Selbstorganisierende, topologieerhaltende Merkmalskarten, im folgenden kurz als SOTE-Karten bezeichnet, stellen eine Sonderform sogenannter neuronaler Netze dar. Ein umfaßender Überblick über die bekannten Varianten und Ausprägungsformen von neuronalen Netzwerken ist beispielsweise
Schöneburg, E. (1990): "Neuronale Netzwerke: Einführung, Überblick und Anwendungsmöglichkeiten", Markt und Technik-Verlag, Kap. 4
zu entnehmen. Die wesentliche Funktion von SOTE-Karten ist die der Vektorquantisierung, d. h. die Darstellung eines hochdimensionalen Vektorraumes mittels einer endlichen Anzahl prototypischer Repräsentanten (Neuronen) desselben. Im Unterschied zu konventionellen Vek torquantisierern sind die Prototypen einer SOTE-Karte auf einem 2-dimensionalen Feld ange ordnet, woraus sich die Bezeichnung Karte ableitet. Die Anordnung der Prototypen erfolgt dabei nicht zufällig, sondern derart, daß Prototypen von großer Nähe im Merkmalsraum auch auf der Karte in unmittelbarer örtlicher Nachbarschaft angesiedelt sind. Hierdurch ähnelt die Topologie der Merkmalskarte der des zugrundeliegenden Merkmalsraumes; die Merkmals karte wird als toplogieerhaltend bezeichnet.The theory of self-organizing, topology-preserving feature maps provides a further possibility for quantizing speaker feature spaces. It was developed at the beginning of the eighties and is extensively developed by its co-founder in
Kohonen, T. (1990): "The Self-Organizing Map", Proceedings of the IEEB, Vol. 78, No. September 9, 1990, pp. 1464 ff.
described. Self-organizing, topology-preserving feature maps, hereinafter referred to as SOTE maps for short, represent a special form of so-called neural networks. A comprehensive overview of the known variants and forms of expression of neural networks is, for example
Schöneburg, E. (1990): "Neural Networks: Introduction, Overview and Possible Applications", Markt und Technik-Verlag, chap. 4th
refer to. The essential function of SOTE cards is that of vector quantization, ie the representation of a high-dimensional vector space by means of a finite number of prototypical representatives (neurons) of the same. In contrast to conventional vector quantizers, the prototypes of a SOTE card are arranged on a 2-dimensional field, from which the name card is derived. The prototypes are not arranged randomly, but in such a way that prototypes from close proximity in the feature space are also located on the map in the immediate vicinity. As a result, the topology of the feature map resembles that of the underlying feature space; the feature map is said to preserve topology.

Der im Patentanspruch 1 angegebenen Erfindung liegt die technische Aufgabe zugrunde, die Merkmalsräume zweier Sprecher in die Topologie der zugrundeliegenden Merkmalsräume widerspiegelnde Weise zu quantisieren und so eine fehlertolerante Abbildung der Merkmals räume aufeinander bzw. den fehlertoleranten Austausch der Merkmalsvektoren der Sprecher im Sinne einer Kodebuch-Abbildung zu ermöglichen. Desweiteren soll das Training der abzu bildenden Kodebücher eine möglichst große topologische Identität der Kodebücher garantie ren um so eine aufwendige fehleranfällige Korrespondenzsuche zwischen den Kodebüchern zu erübrigen.The invention specified in claim 1 is based on the technical object Characteristic spaces of two speakers in the topology of the underlying characteristic spaces reflective way to quantize and so a fault-tolerant mapping of the feature clear each other or the fault-tolerant exchange of the feature vectors of the speakers in the sense of a code book illustration. Furthermore, the training of the should guaranteeing the greatest possible topological identity of the code books all the more so that an expensive error-prone correspondence search between the code books is possible spare.

Diese Aufgabe wird durch die im Patentanspruch 1 angegebenen Merkmale gelöst.This object is achieved by the in claim 1 specified features solved.

Die Erfindung nach Patentanspruch 1 beruht darauf, die Merkmalsräume der zu adaptierenden Sprecher mittels selbstorganisierender, topologieerhaltender Merkmalskarten zu quantisieren und zum Training der Merkmalskarten zunächst in einer ersten Phase die Merkmalskarte eines Referenzsprechers vollständig auf den Merkmalsraum des Referenzsprechers zu trainieren und zu Beginn der Adaptionsphase der Merkmalskarte eines Testsprechers, diese mit der Topologie der Referenzsprecherkarte zu initialisieren.The invention according to claim 1 is based on the feature spaces to be adapted Quantify speakers using self-organizing, topology-preserving feature maps and to train the feature cards, in a first phase, the feature card one To train the reference speaker completely on the feature space of the reference speaker and at the beginning of the adaptation phase of a test speaker's feature card, this with the topology initialize the reference speaker card.

Aus den topologieerhaltenden Eigenschaften von SOTE-Karten leitet sich unmittelbar ein mit der Erfindung verbundener Vorteil ab: Er liegt in der Fehlertoleranz bei der Quantisierung der Merkmalsräume beider Sprecher; kommt es in der Betriebsphase, d. h. während des Austau sches der Merkmalsvektoren der Sprecher zu einer Fehlentscheidung, so ist deren Auswirkung auf das Adaptionsergebnis i. a. nur gering, da Neuronen mit ähnlicher Bedeutung auf den Merkmalskarten in örtlicher Nachbarschaft angesiedelt sind.From the topology-preserving properties of SOTE cards, one is directly involved advantage associated with the invention: it lies in the fault tolerance in the quantization of the Feature spaces of both speakers; it comes in the operational phase, d. H. during the exchange If the feature vectors of the speakers make a wrong decision, this is the effect on the adaptation result i. a. only slight, since neurons with a similar meaning on the Feature cards are located in the local neighborhood.

Gleichzeitig bringt jede Quantisierungsentscheidung eine ganze Gruppe möglicher Alternativ entscheidungen mit sich, wobei sich diese Gruppen jeweils durch einfache örtliche Nachbar schaftsbereiche definieren lassen; hierdurch eignet sich das Verfahren auch zur Entwicklung von Spracherkennungsverfahren, welche solche Hypothesen über mögliche Quantisierungsal ternativen gezielt zur Bildung bzw. zur Absicherung einer Erkennungsentscheidung ausnutzen. Da das Adaptionsverfahren umkehrbar, also auch eine Abbildung des Referenzsprechers auf einen Testsprecher möglich ist, eignet es sich ebenfalls im Bereich der Sprachsynthese. So könnte beispielsweise die "Stimme" eines auf den spektralen Merkmalen eines Referenzspre chers aufbauenden Sprachsynthetisators verändert werden.At the same time, each quantization decision brings a whole group of possible alternatives decisions with themselves, whereby these groups are each by simple local neighbors have business areas defined; this also makes the process suitable for development of speech recognition methods that support such hypotheses about possible quantization use specific alternatives to form or secure a recognition decision. Since the adaptation process is reversible, it also shows the reference speaker a test speaker is possible, it is also suitable in the field of speech synthesis. So could, for example, be the "voice" of one on the spectral features of a reference speaker chers building speech synthesizer can be changed.

Ein weiterer wesentlicher Vorteil der Erfindung ist durch die selbstorganisierenden Eigen schaften der SOTE-Karten begründet. Mit selbstorganisierend wird der Umstand bezeichnet, daß sich die Topologie der Karten während der Lern- bzw. Trainingsphase vollständig selbständig ausprägt. Anders als bei konventionellen Vektorquantisierungsverfahren muß die Anzahl der gewünschten Prototypen nicht explizit im vorhinein vorgegeben werden, vielmehr wird zunächst eine große Zahl von Prototypen (Neuronen) zur Verfügung gestellt, welche nach abgeschlossener Trainingsphase durch Analyse der Kartentopologie auf geeignete Weise zu sogenannten Merkmalsklassen zusammengefaßt werden können. Dies garantiert, im Gegen satz zu konventionellen Verfahren, eine topologiegetreue Quantisierung des interessierenden Merkmalsraumes.Another important advantage of the invention is the self-organizing of the SOTE cards. The term self-organizing means that the topology of the maps is completely independent during the learning or training phase pronounced. In contrast to conventional vector quantization methods, the The number of desired prototypes cannot be explicitly specified in advance, rather First, a large number of prototypes (neurons) are made available completed training phase by analyzing the map topology in a suitable manner so-called feature classes can be summarized. This guarantees, in return conventional methods, a topology-true quantization of the interested Feature space.

Gleichzeitig stellt das unüberwachte Training und damit die letztlich unkontrollierte Ausprä gung von Merkmalsklassen eine Schwierigkeit bezüglich des Einsatzes von SOTE-Karten im Rahmen von Kodebuch-Abbildungsverfahren dar. Es kann i. a. nicht garantiert werden, daß zwei unabhängig voneinander auf verschiedene Sprecher trainierte Merkmalskarten die glei che Form und Anzahl von Merkmalsklassen ausbilden. Dies jedoch, stellt eine wesentliche Voraussetzung für die Bildung einer eindeutigen Korrespondenzbeziehung zwischen zwei Kodebüchern dar.At the same time, the unsupervised training and thus the ultimately uncontrolled expression feature classes a difficulty with the use of SOTE cards in the Framework of codebook mapping processes. It can i. a. there is no guarantee that two feature cards trained independently on different speakers did the same form and number of feature classes. However, this represents an essential one Prerequisite for establishing a clear correspondence relationship between two Code books.

Diese Schwierigkeit wird durch die in der Erfindung erläuterte Erzeugung des Kodebuches des Testsprechers (Testsprecherkarte) umgangen. Sie gewährleistet zudem eine hohe Effizienz und damit Zeitersparnis in der Adaptionsphase. Während die Merkmalskarte des Referenzspre chers mittels der bekannten kompetitiven Lernregeln trainiert wird, d. h. sie zu Beginn des Trainings mit Zufallswerten initialisiert wird, wird die Merkmalskarte des Testsprechers mit der zuvor vollständig austrainierten Merkmalskarte des Referenzsprechers initialisiert. Die Trainingsphase der Testsprecherkarte (Adaptionsphase) kann somit im Vergleich zur Trai ningsphase der Referenzsprecherkarte (Trainingsphase) wesentlich verkürzt und das Adapti ons-Sprachkorpus des Testsprechers auf wenige Wörter beschränkt werden. Da die Grundtopologie des interessierenden Merkmalsraumes in der Adaptionsphase in Form einer Kopie der Referenzsprecherkarte bereits vorgegeben wird, wird zudem die topologische Ähn lichkeit beider Karten gefördert.This difficulty is caused by the generation of the codebook of the invention as explained in the invention Test speaker (test speaker card) bypassed. It also ensures high efficiency and thus saving time in the adaptation phase. While the feature map of the reference spre chers is trained using the known competitive learning rules, d. H. them at the beginning of the Training is initialized with random values, the characteristics card of the test speaker with the previously fully trained feature card of the reference speaker initialized. The Training phase of the test speaker card (adaptation phase) can therefore be compared to that of the trai ning phase of the reference speaker card (training phase) significantly shortened and the Adapti ons speech body of the test speaker can be limited to a few words. Since the Basic topology of the feature space of interest in the adaptation phase in the form of a Copy of the reference speaker card is already specified, the topological similarity is also promoted both cards.

Eine weitere, vorteilhafte Ausgestaltung der Erfindung ist in Patentanspruch 2 angegeben. Ihr liegt die Annahme zugrunde, daß die Topologie der Merkmalskarten der zu adaptierenden Sprecher zumindest ähnlich ist. Hierdurch entfällt eine fehleranfällige, gesondert durchzufüh rende Korrespondenzsuche zwischen den aufeinander abzubildenden Merkmalskarten und dem damit verbundenen Problem der Disambiguierung möglicher Mehrdeutigkeiten. Das Ver fahren benötigt somit keinerlei explizite Kenntnis über die Bedeutung der verwendeten Merk malssätze und kann beispielsweise auch zur Adaption abstrakter Merkmalssätze verwendet werden (z. B. ereignisorientierte Merkmalssätze in der Spracherkennung). Gleichzeitig redu ziert sich der zeitliche Aufwand in der Betriebsphase auf ein Minimum.Another advantageous embodiment of the invention is specified in claim 2. your is based on the assumption that the topology of the feature maps of the to be adapted Speaker is at least similar. This eliminates the need to carry out errors separately Corresponding search for correspondence between the feature cards and the associated problem of disambiguation of possible ambiguities. The Ver Driving therefore does not require any explicit knowledge of the meaning of the notes used paint sets and can also be used, for example, to adapt abstract feature sets (e.g. event-oriented feature sets in speech recognition). At the same time redu the time spent in the operating phase is reduced to a minimum.

Eine geeignete Weiterentwicklung des in Patentanspruch 1 angegebenen Verfahrens wird in Patentanspruch 3 aufgezeigt. Ihr Vorteil liegt zum einen in der zusätzlichen Effizienzerhöhung der Adaptionsphase durch die Begrenzung des jeweiligen Trainingsbereiches auf einen relativ kleinen Gewinner-Suchraum. Zum anderen gewährleistet sie, daß die durch die Kopie der Referenzsprecherkarte vorgegebene Kartentopologie während des Nachtrainings der Testspre cherkarte nicht wesentlich verändert wird; dies jedoch ist Grundvoraussetzung für die in der Betriebsphase vorgesehene 1 : 1-Abbildung der Merkmalskarten.A suitable further development of the method specified in claim 1 is in Claim 3 shown. On the one hand, your advantage lies in the additional increase in efficiency the adaptation phase by limiting the respective training area to a relative one small winner search space. Secondly, it ensures that the copy of the Reference speaker card predefined card topology during the retraining of the test pre card is not significantly changed; however, this is a basic requirement for those in the Operating phase provided 1: 1 mapping of the feature cards.

Durch die Vorgabe des Trainingsbereiches (Gewinner-Trajektorie) ist auch die Adaption von Sprechern unterschiedlichen Akzents möglich, vorausgesetzt, die jeweiligen Trainingsäuße rungen enthalten die gleichen Hauptkomponenten.Due to the specification of the training area (winning trajectory), the adaptation of Speakers of different accents possible, provided the respective training exterior pillars contain the same main components.

Patentanspruch 4 beschreibt eine vorteilhafte Erweiterung des in Patentanspruch 3 angegebe nen Verfahrens. Durch die Weiterbildung nach Patentanspruch 4 werden statistische Unsicher heiten bei der Wahl der in den zulässigen Trainingsbereich aufzunehmenden Gewinner- Neuronen unterdrückt. Darüber hinaus wird eine weitere Zeitersparnis in der Adaptionsphase erzielt.Claim 4 describes an advantageous extension of the specified in claim 3 NEN procedure. The further training according to claim 4 makes statistical uncertainty in the selection of the winners to be included in the permitted training area Suppressed neurons. It also saves time in the adaptation phase achieved.

In Patentanspruch 5 wird eine vorteilhafte, effiziente Ausführung des Verfahrens nach Patent anspruch 4 beschrieben. Das beschriebene Glättungsverfahren basiert auf ein auch in anderen Anwendungsbereichen erfolgreich eingesetztes Mittellungsverfahren.In claim 5, an advantageous, efficient implementation of the method according to the patent Claim 4 described. The smoothing method described is based on one in others as well Areas of application successfully used.

Der in Patentanspruch 6 angegebenen Erweiterung des in Patentanspruch 4 dargestellten Ver fahrens liegt das Prinzip zugrunde, explizites Weltwissen (Erfahrungswerte) über die Anspre chensvielfachheit der Neuronen zur Selektion solcher Gewinner-Neuronen zu nutzen, welche aus einer Gewinner-Trajektorie der Referenzsprecherkarte in den Gewinner-Suchraum für das Nachtraining der Testsprecherkarte zu übernehmen sind; die hierdurch erzielte Glättung ist der charakteristischen zeitlichen Struktur des zu adaptierenden Merkmalsraumes angepaßt.The extension specified in claim 6 of the Ver shown in claim 4 driving is based on the principle, explicit world knowledge (empirical values) about the address to use the multiplicity of neurons to select those winner neurons which from a winner trajectory of the reference speaker card into the winner search space for the Post-training of the test speaker card are to be taken over; the smoothing achieved in this way is the adapted to the characteristic temporal structure of the feature space to be adapted.

Patentanspruch 7 stellt eine vorteilhafte Ausgestaltung des in Patentanspruch 6 angegebenen Verfahrens dar. Es wird gezielt, bereits in der Trainingsphase der Referenzsprecherkarte gewonnenes Wissen über die Statistik der Ansprechensvielfachheit der Neuronen genutzt.Claim 7 represents an advantageous embodiment of that specified in claim 6 Procedure. It is targeted, already in the training phase of the reference speaker card gained knowledge about the statistics of the response multiplicity of the neurons used.

Die in Patentanspruch 8 angegebene Ergänzung des in Patentanspruch 1 ausgeführten Verfah rens hat ebenfalls den Vorteil, eine wesentliche Veränderung der durch die Kopie der Referenz sprecherkarte vorgegebene Kartentopologie während des Nachtrainings der Testsprecherkarte zu vermeiden.The addition specified in claim 8 of the procedure set out in claim 1 rens also has the advantage of making a significant change by copying the reference predefined card topology during the retraining of the test speaker card to avoid.

Die in Patentanspruch 9 aufgeführte Weiterentwicklung des Patentanspruchs 3 gewährleistet ein Nachtraining der Testsprecherkarte in zeitlich korrekter Richtung entlang der Gewinner- Trajektorien der Referenzsprecherkarte. Insbesondere für den Fall, daß sich die Gewinner-Tra jektorien der Referenzsprecherkarte in stark verschlungener oder überkreuzender Form ausbil den, werden Fehlentscheidungen bei der Gewinnersuche hierdurch vermieden.The further development of claim 3 specified in claim 9 guarantees a retraining of the test speaker card in the correct direction along the winning Trajectories of the reference speaker card. Especially in the event that the winning tra jectories of the reference speaker card in a strongly entwined or crossing form wrong decisions in the search for winners are avoided.

Patentanspruch 10 stellt eine vorteilhafte Ausgestaltung von Patentanspruch 9 dar. Das Verfah ren nach Patentanspruch 10 kann auf einfache, recheneffiziente Weise in den Adaptionsprozeß integriert werden.Claim 10 represents an advantageous embodiment of claim 9. The procedure ren according to claim 10 can in a simple, computationally efficient manner in the adaptation process to get integrated.

Die in Patentanspruch 11 angegebene Realisierung des in Patentanspruch 10 dargestellten Ver fahrens ist in sofern vorteilhaft, als daß sie größere Zeitabstände unproportional stärker bewer tet als kleine und gleichzeitig eine Betragsbildung des zeitlichen Abstandes überflüssig macht.The realization specified in claim 11 of the ver shown in claim 10 driving is advantageous in that it disproportionately applies larger time intervals tends to be small and at the same time makes it unnecessary to calculate the amount of the time interval.

Anhand des folgenden Ausführungsbeispiels aus dem Bereich der automatischen Spracherken nung soll das Gesamtverfahren weiter verdeutlicht werden.Using the following embodiment from the area of automatic speech recognition The overall process should be clarified further.

Sowohl die Referenzsprecherkarte als auch die Testsprecherkarte werden in Form von selbst organisierenden, topologieerhaltenden Merkmalskarten der Dimension 40×25 (insgesamt 1000 Neuronen pro Karte) ausgelegt.Both the reference speaker card and the test speaker card are in the form of themselves Organizing, topology-preserving feature maps of the dimension 40 × 25 (total 1000 Neurons per card).

In der Trainingsphase wird die Refernzsprecherkarte mittels der bekannten Lernvorschriften nach Kohonen
Kohonen, T. (1990): "The Self-Organizing Map", Proceedings of the IEEE, Vol. 78, No. 9, September 1990, S. 1464 ff.
auf das interessierende Sprachkorpus, gesprochen durch den Referenzsprecher, trainiert (Fig. 1/1a & 1/1b). Hierzu wird jede Trainingsäußerung im Zeitbereich abgetastet und zu jedem Abtastzeitpunkt eine Frequenzanalyse des Sprachsignals durchgeführt und in Form eines Vektors (Merkmalsvektor) zusammengefaßt. Nach dem Training der Referenzsprecher karte mittels der bekannten Lernregeln für selbstorganisierende, topologieerhaltende Merk malskarte repräsentiert jedes Neuron auf der Merkmalskarte einen Prototypen dieser Merkmalsvektoren. Falls in der Adaptionsphase das in Patentanspruch 7 angegebene Verfah ren zur Glättung der Gewinner-Trajektorien Anwendung finden soll, wird zusätzlich in der Trainingsphase für jedes Neuron Mittelwert und Varianz seiner Ansprechensvielfachheit ermittelt und entsprechend abgespeichert (Fig. 1/2a & 1/2b). Nach Abschluß des Trainings wird die Referenzsprecherkarte mit den dafür bekannten Verfahren in sogenannte Merkmals klassen eingeteilt, welche ebenfalls abzuspeichern sind (Fig. 1/3a & 1/3b).In the training phase, the reference speaker card is based on the known Kohonen learning rules
Kohonen, T. (1990): "The Self-Organizing Map", Proceedings of the IEEE, Vol. 78, No. September 9, 1990, pp. 1464 ff.
trained on the body of interest, spoken by the reference speaker ( Fig. 1 / 1a & 1 / 1b). For this purpose, every training utterance is sampled in the time domain and a frequency analysis of the speech signal is carried out at each sampling time and summarized in the form of a vector (feature vector). After training the reference speaker card using the known learning rules for self-organizing, topology-preserving feature map, each neuron on the feature map represents a prototype of these feature vectors. If the procedure specified in claim 7 for smoothing the winning trajectories is to be used in the adaptation phase, the mean value and variance of its response multiplicity for each neuron are additionally determined in the training phase and stored accordingly ( FIGS. 1 / 2a & 1 / 2b). After completing the training, the reference speaker card is divided into so-called feature classes using the methods known for this purpose, which are also to be stored (FIGS . 1 / 3a & 1 / 3b).

In der Adaptionsphase wird die zu trainierende Testsprecherkarte mit der Referenzsprecher karte initialisiert, d. h. es wird eine Kopie der Referenzsprecherkarte angelegt, welche im Ver laufe der Adaptionsphase auf das Sprachkorpus des Testsprechers nachtrainiert werden soll (Fig. 2/1a & 2/1b). Das Nachtraining erfolgt prinzipiell auf die gleiche Weise, wie in der Trai ningsphase der Referenzsprecherkarte. Unterschiedlich ist lediglich, daß der Gewinner-Such raum auf der Testsprecherkarte gezielt eingeschränkt wird. Die Grundidee dabei ist, nur solche Neuronen in den Suchraum aufzunehmen, welche während der in Frage stehenden Trainings äußerung auf der Referenzsprecherkarte als Gewinner hervorgehen (Fig. 2/2a & 2/2b). Um den Suchraum noch weiter zu begrenzen bzw. um unwahrscheinliche Gewinner-Neuronen zu eli minieren, werden die Gewinner-Trajektorien darüber hinaus einer Glättung unterzogen, bevor sie als Gewinner-Suchraum zur Verfügung stehen (Fig. 2/3). Dies geschieht entweder mittels des in Patentanspruch 5 angegebenen Mittelungsverfahren, oder indem jeweils nur solche Gewinner-Neuronen in den Trainingsbereich der Testsprecherkarte aufgenommen, welche ent sprechend ihrer in der Trainingsphase der Referenzsprecherkarte aufgezeichneten, mittleren Ansprechensvielfachheit plus/minus der zweifachen Varianz der Ansprechensvielfachheit angewählt wurden (Patentanspruch 7), vgl. Fig. 2/4. Im Ergebnis ist der Gewinner-Suchraum auf relativ glatte Trajektorien beschränkt (Fig. 4). Zusätzlich wird die, mit dem Training selbst organisiernder, topologieerhaltender Merkmalskarten verbundene Modifizierung eines um das jeweilige Gewinner-Neuron gruppierten Nachbarschaftsbereiches auf solche Nachbarschafts neuronen beschränkt, welche der selben Merkmalsklasse wie das Gewinner-Neuron angehö ren, wobei die Merkmalsklassen bereits in der Trainingsphase der Referenzsprecherkarte ermittelt wurden (Fig. 2/5a & 2/5b). Hierdurch wird eine unerwünschte Verschiebung der durch die Kopie der Referenzsprecherkarte vorgegebene Kartentopologie vermieden.In the adaptation phase, the test speaker card to be trained is initialized with the reference speaker card, that is, a copy of the reference speaker card is created, which is to be retrained in the course of the adaptation phase on the body of the test speaker ( Fig. 2 / 1a & 2 / 1b). Post-training is basically the same as in the training phase of the reference speaker card. The only difference is that the winner search space on the test speaker card is limited. The basic idea is to include only those neurons in the search space that emerge as winners during the training utterance in question on the reference speaker card ( Fig. 2 / 2a & 2 / 2b). In order to further limit the search space or to eliminate unlikely winner neurons, the winner trajectories are also subjected to a smoothing before they are available as a winner search space ( Fig. 2/3). This is done either by means of the averaging method specified in claim 5, or by only including in the training area of the test speaker card only those winner neurons which have been selected in accordance with their mean response multiplicity plus / minus twice the response multiplicity recorded in the training phase of the reference speaker card ( Claim 7), cf. Fig. 2/4. As a result, the winner search space is limited to relatively smooth trajectories ( Fig. 4). In addition, the modification of a neighborhood area grouped around the respective winner neuron, which is associated with the training of self-organizing, topology-preserving feature cards, is limited to those neighborhood neurons which belong to the same feature class as the winner neuron, the feature classes already in the training phase of the reference speaker card were determined ( Fig. 2 / 5a & 2 / 5b). This avoids an undesirable shifting of the card topology predetermined by the copy of the reference speaker card.

Bei Trainingstrajektorien, wie sie in Fig. 4 dargestellt sind, ist aufgrund der zugrundeliegen den, kompetitiven Lernstrategie das Nachtraining der Testsprecherkarte auch in zeitlich kor rekter Reihenfolge gesichert. Bildet eine Trainigsäußerung jedoch eine stark verschlungene oder sich überkreuzende Gewinner-Trajektorie aus (Fig. 5), so kann dies nicht mehr mit Sicherheit gesagt werden. Die Beibehaltung dieser Reihenfolge ist jedoch insbesondere mit Hinblick auf eine sich eventuell anschließende Spracherkennung wichtig, da hier die Ausbil dung konsistenter, gleichartiger Trajektorien auf den Merkmalskarten beider Sprecher eine Grundvoraussetzung für eine erfolgreiche Erkennung darstellt. Um ein reihenfolgegetreues Training auch in den aufgeführten Problemfällen zu gewährleisten, wird jedem Gewinner-Neu ron eine zeitliche Gewichtung angehaftet (Fig. 2/6) und man gelangt zu einer Gegenüberstel lung gemäß Fig. 6. Das zu einem Merkmalsvektor einer Trainingsäußerung des Testsprechers gehörende Gewinner-Neuron ergibt sich dann nicht nur mittels der bekannten Abstandsmaße für Vektoren, sondern unter gleichzeitiger Berücksichtigung der zeitlichen Korrelation zwi schen den Merkmalsvektoren der Trainingsäußerung und den Neuronen der Gewinner-Trajek torie der Refernzsprecherkarte. Dieser Zusammenhang wird in Form eines additiven, dem Quadrat des zeitlichen Abstandes proportionalen Kostentermes bei der Gewinnersuche reali siert.In training trajectories, as shown in FIG. 4, the subsequent training of the test speaker card is also ensured in correct chronological order due to the underlying, competitive learning strategy. However, if a training statement forms a strongly intertwined or intersecting winner trajectory ( FIG. 5), this can no longer be said with certainty. However, maintaining this order is particularly important with regard to a possible subsequent speech recognition, since the formation of consistent, similar trajectories on the feature cards of both speakers is a basic prerequisite for successful recognition. In order to ensure that the training is carried out in the correct order even in the problem cases listed, a time weighting is attached to each winner neuron ( Fig. 2/6) and a comparison is made as shown in Fig. 6. That which belongs to a feature vector of a training statement by the test speaker The winner neuron is then obtained not only by means of the known distance measurements for vectors, but also taking into account the temporal correlation between the feature vectors of the training utterance and the neurons of the winner trajectory of the reference speaker card. This relationship is realized in the form of an additive cost term that is proportional to the square of the time interval when searching for a winner.

Die Durchführung der Adaption der Merkmalsräume beider Sprecher in der Betriebsphase erfolgt in Form einer 1 : 1-Abbildung der Testsprecherkarte auf die Referenzsprecherkarte (Fig. 3). Hierbei wird zu jedem Merkmalsvektor einer Äußerung des Testsprechers das zuge hörige Gewinner-Neuron auf der Testsprecherkarte ermittelt und durch den Merkmalsvektor ersetzt, der durch das gleichnamige Neuron auf der Referenzsprecherkarte repräsentiert wird.The adaptation of the feature spaces of both speakers in the operating phase is carried out in the form of a 1: 1 mapping of the test speaker card onto the reference speaker card ( FIG. 3). For each feature vector of an utterance by the test speaker, the associated winner neuron on the test speaker card is determined and replaced by the feature vector which is represented by the neuron of the same name on the reference speaker card.

Claims

1. Method for the mutual mapping of feature spaces by means of an exchange of the feature vectors prototypically representing the feature spaces (codebook image), in particular for speaker adaptation, in which

- The prototypes of the feature spaces to be mapped (code books) are generated and saved in the form of self-organizing, topology-preserving feature maps and in which
- To train the feature cards to be mapped (code books), in a first phase (training phase) the feature card of a reference speaker (reference speaker card) is completely trained on the feature space of the reference speaker and at the beginning of a second phase (adaptation phase) the feature card of a test speaker (test speaker card) initialized with the topology of the reference speaker card.

2. The method according to claim 1, characterized in that the mapping of the feature spaces of two speakers onto one another in the operating phase in form a 1: 1 exchange of the prototypes of the feature cards.

3. The method according to claim 1, characterized in that associated with the training of self-organizing, topology-preserving feature cards Search for the most similar prototype (winner search) in the adaptation phase on the number of such neurons (smallest local units of a feature map) on the Test speaker card is limited, which during an adaptation utterance (sequence of Feature vectors) previously on the reference speaker card as winner neurons (winner trajectories), the reference speaker card used by the ref spokesman spoken, and the test speaker card the spoken by the test speaker Adaption is offered.

4. The method according to claim 3, characterized in that the winner trajectories of the reference speaker card are first smoothed before they are considered Find the winner search room for post-training the test speaker card.

5. The method according to claim 4, characterized in that the smoothing of the winning trajectories in the form of a local averaging of several, temporally successively selected winner neurons is carried out, with the average center of gravity at each smoothing point by the local center of gravity nearest neuron is represented.

6. The method according to claim 4, characterized in that only the coordinates of such neurons from the winning trajectories in each approved winner search room for post-training of the test speaker card added on the reference speaker card during a training utterance can be addressed with a multiplicity that within certain limits of multiplicity speaks with which the individual neurons are known to be successively new as winners ron emerge (multiplicity of responses).

7. The method according to claim 6, characterized in that

a) during the training phase, both the mean and the variance of the response multiplicity of the neurons of the reference speaker card are determined and recorded, and that
b) only the coordinates of such neurons from the winning trajectories are included in the respectively approved winner search space for post-training of the test speaker card, which are additionally addressed with their mean multiplicity plus / minus their double multiplicity variance during the corresponding training utterance.

8. The method according to claim 1, characterized in that

a) the neurons of the reference speaker card are classified into feature classes after the completion of the training phase, the feature classes being formed by grouping the feature vectors represented by the neurons with greatest relative similarity, and that
b) the modification of a neighborhood area grouped around a winner neuron, which is associated with the training of self-organizing, topology-preserving feature maps, is limited during the adaptation phase to those neighborhood neurons which belong to the same feature class as the respective winner neuron.

9. The method according to claim 3, characterized in that the search associated with the retraining of the test speaker card for the similar in each case Most neuron (winner search) in the adaptation phase according to the chronological order is performed using the neurons of the winning trajectory during a training session have been addressed.

10. The method according to claim 9, characterized in that

a) the winner neurons (winner trajectory) emerging during a training utterance on the reference speaker card are additionally provided with a mark which reflects the time relative to the start of the training utterance with which the respective neuron is determined as the winner, and that
b) in the search for the winner, the relative time interval between the feature vectors of the training utterance of the test speaker and the neurons of the winner trajectory of the reference speaker card is also taken into account, the similarity (winning probability) being reduced with an increasing time interval.

11. The method according to claim 10, characterized in that the search for a winner in the form of an additive, the square of the relative time interval between the feature vectors of the test speaker's utterance and the neurons the distance trajectory proportional to the winner trajectory of the reference speaker card rea is lized.