EP1406244B1 - Voice activity detection based on unsupervised trained clustering - Google Patents

Voice activity detection based on unsupervised trained clustering Download PDF

Info

Publication number
EP1406244B1
EP1406244B1 EP20030102639 EP03102639A EP1406244B1 EP 1406244 B1 EP1406244 B1 EP 1406244B1 EP 20030102639 EP20030102639 EP 20030102639 EP 03102639 A EP03102639 A EP 03102639A EP 1406244 B1 EP1406244 B1 EP 1406244B1
Authority
EP
European Patent Office
Prior art keywords
signal
speech
classes
accordance
classified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
EP20030102639
Other languages
German (de)
French (fr)
Other versions
EP1406244A3 (en
EP1406244A2 (en
Inventor
Stephan Dr. Grashey
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Publication of EP1406244A2 publication Critical patent/EP1406244A2/en
Publication of EP1406244A3 publication Critical patent/EP1406244A3/en
Application granted granted Critical
Publication of EP1406244B1 publication Critical patent/EP1406244B1/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • a Voice Activity Detector is a device that allows to distinguish between speech including background noise (“speech") and background noise alone (“non-speech").
  • the input of a VAD can be, for example, a voice signal of a communication terminal recorded by a microphone. As the user speaks, the signal is composed of his voice and the background noise (for example, street noise). In contrast, during the pauses between speaking, the signal alone consists of the background noise.
  • the output of a Voice Activity Detector now adds information to the input signal, whether it contains speech or not.
  • VAD Voice over IP
  • a data reduction VAD can be used to store or transmit only the speech signal.
  • a VAD allows faster and better recognition because the recognition can focus on the pure speech passages of the audio signal.
  • VADs are either set based on heuristics or trained during a training phase.
  • the input signal used in each case is the appropriately pre-processed audio signal.
  • a property extraction depending on the number of properties used, different-sized property vectors are obtained.
  • the simplest, but still widely used, heuristic is to judge a signal against a specific, fixed energy threshold. If the signal energy exceeds the threshold, then "language” is assumed, otherwise "non-language”.
  • Another example is the determination of the zero crossing rate of the autocorrelation function of the speech signal and a corresponding threshold for discriminating whether a speech signal is present or not.
  • VADs that are trained during a training phase include statistical VADs or neural networks. These are trained with data that is known when speech and when a noise occurs. So these are data that are previously labeled by hand, for example. Examples of methods with which it can be decided in this way whether a voice signal is present or not, for example, in Stadermann J .: "Speech / Pause Detection in Automatic Speech Recognition", University of Duisburg, Diploma Thesis, 1999, pages 28-36 , stated.
  • VADs in particular for wireless communication, are described in El-Maleh, K. and Kabal, P .: “Comparison of voice activity detection algorithms for wireless personal communication systems", Proc. IEEE Canadian Conference on Electrical and Computer Engineering, St. John's, Newfoundland, May 1997, pages 470-473.
  • the object of the invention is to make possible a more precise distinction between speech and non-speech. It should also be placed on an automatic adaptability to different noise situations, speakers or languages value.
  • N 2 classes (speech / non-speech).
  • a much better classification can be made if a signal is not immediately assigned to the speech or non-speech class, but if the signal is first in a class of a plurality of more than is divided into three classes. In this way, the numerous different properties of speech and noise can be better taken into account.
  • the plurality is preferably greater than or equal to 10, in particular greater than or equal to 64. Depending on the class into which the signal is divided, it is then decided whether the signal is a speech signal or not.
  • speech signals that are recognized as such, ie after the Voice Activity Detection, to divide into more than two classes, as a special feature of the invention, two or more classes may be provided to be decided that the signal is not speech when it is divided into it.
  • the classes may be clustered in clusters so that similar classes are grouped adjacent or in groups.
  • the classes are automatically formed in a self-organizing cluster process to be trained in a training phase, in particular by means of test signals.
  • a neural network is preferably used, in particular a Kohonen network with the network architecture of a self-organizing card.
  • This trained and structured network is then preferably also used immediately in the detection phase, in which it is decided whether a signal is a voice signal or not.
  • the device described can be used in the biometric speech recognition during enrollment to capture the voice of the enrollenden person as a reference and no more or less large parts of the background noise. Otherwise, a person who has a similar noise environment during verification may be authenticated by the system.
  • a method for detecting whether a speech signal is present or not can be constructed analogously to the described device. This also applies to its preferred embodiments.
  • a program product for a data processing system which contains code sections with which one of the described methods can be executed on the data processing system, can be implemented by suitable implementation of the method in a programming language and translation in code executable by the data processing system.
  • the code sections are stored for this purpose.
  • the program is understood as a tradable product under a program product. It can be in any form, such as on paper, a computer-readable medium or distributed over a network.
  • VADs have the problem that properties extracted from the signal are divided into only two classes, although their characteristics differ widely within one and the same class. For example, in a speech signal, characteristics that represent unvoiced sounds tend to be very different from those that reflect voiced sounds. Nevertheless, both are assigned to the same class ("language").
  • a self-organizing, self-organizing cluster process with N> 2 classes is used.
  • N is given arbitrarily, but meaningfully.
  • For training therefore, only extracted from an audio signal property vectors are used, without at the same time a class affiliation is specified.
  • More generally, therefore, there are a larger number m of classes of classifier representing "language” and a larger number n of classes representing "non-language" (m + n N> 2).
  • This first phase will be illustrated with reference to FIG.
  • This preprocessing is preferably the same as that used for later speech recognition. This can save a second preprocessing.
  • the preprocessing 2 extracts from the audio signals of the audio database 1 property vectors 3, in which properties of the audio signals are specified. These property vectors 3 are supplied to the input neurons of a neural network 4.
  • the neural network 4 is a Kohonen network with the network architecture of a self-organizing map (SOM: Self-Organizing Map). It has the property that there is a local neighborhood relationship between the individual neurons, so that the reference vectors representing the individual classes are spatially ordered after completion of the training.
  • SOM Self-Organizing Map
  • the neural network is trained on the basis of a database which, for example, has speech and noise in equal frequency.
  • the training of such a network represents a self-organizing cluster process with unsupervised learning.
  • the result of the classifier training is a class representation 5.
  • the association phase the assignment of each individual class of the classifier 4 in the form of the neural network to one of the two classes speech or non-speech takes place.
  • the classifier 4 itself in Classification mode operated, that is, it outputs to each property vector 3, the associated class 6.
  • the association unit 7 is operated in the training mode, that is, it learns to associate each of the classifier classes with "speech" or "non-speech" based on the labeled audio signals 8. It is ascertained to which classes in each case how many test signals have been assigned which are "language” or the "non-language”. Depending on this result, each class in an association step is declared as a language or non-language class. The result is the class assignment 9 of the VAD.
  • the results obtained are further enhanced by using a mean value filter to eliminate individual outliers.
  • Label indicates the classification made by a conventional VAD for labeling.
  • this detection and classification is much more in line with reality than with the traditional VAD. This makes especially noticeable by the fact that even pauses between individual syllables are detected as "non-speech".
  • Dissimilar property vectors are no longer forced into the same class, but are assigned to a class on the basis of a similarity criterion alone. This increases the accuracy of the classification.
  • the method is independent of the language and / or content of the spoken text.
  • the invention can also be used preferably in the context of enrollment in a biometric speech recognition to recognize the word boundaries, after previous methods based on the signal energy again and again lead to errors and thus to a security risk in the biometric authentication.

Description

Ein Voice Activity Detector (VAD) ist eine Vorrichtung, die es erlaubt, zwischen Sprache inklusive Hintergrundgeräuschen ("Sprache") und dem Hintergrundgeräusch alleine ("Nicht-Sprache") zu unterscheiden. Der Eingang eines VAD kann beispielsweise ein durch ein Mikrofon aufgenommenes Sprachsignal eines Kommunikationsendgerätes sein. Während der Nutzer spricht, setzt sich das Signal aus seiner Stimme und dem Hintergrundlärm (beispielsweise Straßenlärm) zusammen. In den Sprechpausen dagegen besteht das Signal alleine aus dem Hintergrundlärm. Der Ausgang eines Voice Activity Detectors fügt dem Eingangssignal nun jeweils die Information hinzu, ob es Sprache enthält, oder nicht.A Voice Activity Detector (VAD) is a device that allows to distinguish between speech including background noise ("speech") and background noise alone ("non-speech"). The input of a VAD can be, for example, a voice signal of a communication terminal recorded by a microphone. As the user speaks, the signal is composed of his voice and the background noise (for example, street noise). In contrast, during the pauses between speaking, the signal alone consists of the background noise. The output of a Voice Activity Detector now adds information to the input signal, whether it contains speech or not.

Die Anwendungen eines VAD sind vielfältig. So kann ein VAD zur Datenreduktion verwendet werden, um nur das Sprachsignal zu speichern bzw. zu übertragen. In der Spracherkennung erlaubt ein VAD eine schnellere und bessere Erkennung, da die Erkennung sich auf die reinen Sprachpassagen des Audiosignals konzentrieren kann.The applications of a VAD are many. Thus, a data reduction VAD can be used to store or transmit only the speech signal. In speech recognition, a VAD allows faster and better recognition because the recognition can focus on the pure speech passages of the audio signal.

VADs werden entweder auf Basis von Heuristiken eingestellt oder aber im Lauf einer Trainingsphase trainiert. Als Eingangssignal dient jeweils das in geeigneter Weise vorverarbeitete Audiosignal. In einer Eigenschaftenextraktion erhält man dabei je nach Anzahl der verwendeten Eigenschaften unterschiedlich große Eigenschaftenvektoren.VADs are either set based on heuristics or trained during a training phase. The input signal used in each case is the appropriately pre-processed audio signal. In a property extraction, depending on the number of properties used, different-sized property vectors are obtained.

Die einfachste, aber immer noch weit verbreitete Heuristik ist, ein Signal anhand einer bestimmten, festgelegten Energieschwelle zu beurteilen. Überschreitet die Signalenergie die Schwelle, so wird "Sprache" angenommen, ansonsten "Nicht-Sprache".The simplest, but still widely used, heuristic is to judge a signal against a specific, fixed energy threshold. If the signal energy exceeds the threshold, then "language" is assumed, otherwise "non-language".

Ein anderes Beispiel ist die Bestimmung der Nulldurchgangsrate der Autokorrelationsfunktion des Sprachsignals und ein entsprechender Schwellwert zur Unterscheidung, ob ein Sprachsignal vorliegt oder nicht.Another example is the determination of the zero crossing rate of the autocorrelation function of the speech signal and a corresponding threshold for discriminating whether a speech signal is present or not.

Daneben gibt es komplexere Verfahren, um anhand einer mehr oder weniger großen Anzahl von Schwellen auf Basis verschiedenster Eigenschaften die gewünschte Unterscheidung zu treffen.In addition, there are more complex methods to make the desired distinction based on a variety of properties using a more or less large number of thresholds.

Zu VADs, die im Laufe einer Trainingsphase trainiert werden, gehören beispielsweise statistische VADs oder auch neuronale Netze. Diese werden dazu mit Daten trainiert, bei denen bekannt ist, wann Sprache und wann ein Geräusch auftritt. Es handelt sich also um Daten, die vorab zum Beispiel händisch gelabelt sind. Beispiele für Verfahren, mit denen auf diese Weise entscheiden werden kann, ob ein Sprachsignal vorliegt oder nicht, sind beispielsweise in Stadermann J.: "Sprach/Pause-Detektion in der automatischen Spracherkennung", Universität Duisburg, Diplomarbeit, 1999, Seiten 28-36, angegeben.For example, VADs that are trained during a training phase include statistical VADs or neural networks. These are trained with data that is known when speech and when a noise occurs. So these are data that are previously labeled by hand, for example. Examples of methods with which it can be decided in this way whether a voice signal is present or not, for example, in Stadermann J .: "Speech / Pause Detection in Automatic Speech Recognition", University of Duisburg, Diploma Thesis, 1999, pages 28-36 , stated.

Weitere VADs, insbesondere für drahtlose Kommunikation, werden in El-Maleh, K. und Kabal, P.: "Comparison of voice activity detection algorithms for wireless personal communication systems", Proc. IEEE Canadian Conference on Electrical and Computer Engineering, St. John's, Neufundland, Mai 1997, Seiten 470-473, offenbart.Other VADs, in particular for wireless communication, are described in El-Maleh, K. and Kabal, P .: "Comparison of voice activity detection algorithms for wireless personal communication systems", Proc. IEEE Canadian Conference on Electrical and Computer Engineering, St. John's, Newfoundland, May 1997, pages 470-473.

Das Dokument nach Venkatesha Prasad et al. "Comparison of Voice Activity Detection Algorithms for VoIP", Proc. 7th, ISCC, July 2002, pp. 530-535, offenbart eine Vorrichtung zur Detektion, ob ein Sprachsignal vorliegt oder nicht. Hierzu wird ein Sprachsignal in eine der drei Klassen "speech", "noise" oder "low-energy speech phonemes" eingeteilt. Anschließend wird in Abhängigkeit von der Klasse, in die das Sprachsignal eingeteilt wurde, entschieden, ob das Signal ein Sprachsignal ist oder nicht. Nachteilig hierbei ist, dass die Einteilung des Sprachsignals an die drei fest vorgegebenen Klassen gebunden ist und somit der differenzierten Charakteristik von Sprachsignalen keine Rechnung getragen wird. Dies hat eine größere Ungenauigkeit bei der Entscheidung, ob ein Sprachsignal vorliegt oder nicht, zur Folge.The document according to Venkatesha Prasad et al. "Comparison of Voice Activity Detection Algorithms for VoIP", Proc. 7 th, ISCC, July 2002, pp. 530-535, discloses a device for detecting whether or not there is a voice signal. For this purpose, a speech signal is divided into one of the three classes "speech", "noise" or "low-energy speech phonemes". Subsequently, depending on the class into which the speech signal has been divided, it is decided whether the signal is a speech signal or not. The disadvantage here is that the Division of the speech signal is bound to the three fixed predetermined classes and thus the differentiated characteristics of speech signals is not taken into account. This results in a greater inaccuracy in the decision as to whether a speech signal is present or not.

Davon ausgehend liegt der Erfindung die Aufgabe zugrunde, eine genauere Unterscheidung zwischen Sprache und Nicht-Sprache zu ermöglichen. Dabei soll auch auf eine automatische Anpassbarkeit an unterschiedliche Geräuschsituationen, Sprecher oder Sprachen Wert gelegt werden.Based on this, the object of the invention is to make possible a more precise distinction between speech and non-speech. It should also be placed on an automatic adaptability to different noise situations, speakers or languages value.

Diese Aufgabe wird durch die in den unabhängigen Ansprüchen angegebenen Erfindungen gelöst. Vorteilhafte Ausgestaltungen ergeben sich aus den Unteransprüchen.This object is achieved by the inventions specified in the independent claims. Advantageous embodiments emerge from the subclaims.

Die Erfindung geht von dem Gedanken aus, dass ein VAD im Prinzip als Klassifikator mit N = 2 Klassen (Sprache/Nicht-Sprache) betrachtet werden kann. Es hat sich aber herausgestellt, dass eine wesentlich bessere Klassifikation vorgenommen werden kann, wenn ein Signal nicht sofort der Sprache- oder der Nicht-Sprache-Klasse zugeordnet wird, sondern wenn das Signal abhängig von seinen Eigenschaften zunächst in eine Klasse einer Vielzahl von mehr als drei Klassen eingeteilt wird. Hierdurch kann den zahlreichen unterschiedlichen Eigenschaften von Sprache und Geräuschen besser Rechnung getragen werden.The invention is based on the idea that a VAD can, in principle, be regarded as a classifier with N = 2 classes (speech / non-speech). However, it has been found that a much better classification can be made if a signal is not immediately assigned to the speech or non-speech class, but if the signal is first in a class of a plurality of more than is divided into three classes. In this way, the numerous different properties of speech and noise can be better taken into account.

Gemäß dieser zahlreichen unterschiedlichen Eigenschaften ist die Vielzahl vorzugsweise größer oder gleich 10, insbesondere größer oder gleich 64. In Abhängigkeit von der Klasse, in die das Signal eingeteilt ist, wird dann entschieden, ob das Signal ein Sprachsignal ist oder nicht.According to these numerous different characteristics, the plurality is preferably greater than or equal to 10, in particular greater than or equal to 64. Depending on the class into which the signal is divided, it is then decided whether the signal is a speech signal or not.

Während es in der weiteren Sprachverarbeitung bekannt ist, Sprachsignale, die als solche erkannt sind, also nach der Voice Activity Detection, in mehr als zwei Klassen einzuteilen, können als ein besonderes Charakteristikum der Erfindung auch zwei oder mehr Klassen vorgesehen sein, bei denen entschieden wird, dass das Signal nicht Sprache ist, wenn es in sie eingeteilt ist.While it is known in further speech processing, speech signals that are recognized as such, ie after the Voice Activity Detection, to divide into more than two classes, as a special feature of the invention, two or more classes may be provided to be decided that the signal is not speech when it is divided into it.

Hierfür können die Klassen in Clustern geclustert sein, so dass ähnliche Klassen benachbart oder in Gruppen zusammengefasst sind. Dazu werden die Klassen in einem unüberwacht zu trainierenden, sich selbst organisierenden Clusterverfahren in einer Trainingsphase, insbesondere anhand von Testsignalen, automatisch gebildet.For this purpose, the classes may be clustered in clusters so that similar classes are grouped adjacent or in groups. For this purpose, the classes are automatically formed in a self-organizing cluster process to be trained in a training phase, in particular by means of test signals.

Hierbei wird bevorzugt ein neuronales Netz eingesetzt, insbesondere ein Kohonen-Netz mit der Netzarchitektur einer selbstorganisierenden Karte.In this case, a neural network is preferably used, in particular a Kohonen network with the network architecture of a self-organizing card.

Dieses so trainierte und strukturierte Netz wird dann bevorzugt auch gleich in der Detektionsphase eingesetzt, in der entschieden wird, ob ein Signal ein Sprachsignal ist oder nicht.This trained and structured network is then preferably also used immediately in the detection phase, in which it is decided whether a signal is a voice signal or not.

Besonders vorteilhaft kann die beschriebene Vorrichtung in der biometrischen Spracherkennung während des Enrollments eingesetzt werden, um die Stimme der sich enrollenden Person als Referenz zu erfassen und nicht mehr oder weniger große Teile des Hintergrundlärms. Ansonsten wird eventuell eine Person, die während der Verifikation eine ähnliche Geräuschumgebung hat, vom System authentifiziert.Particularly advantageously, the device described can be used in the biometric speech recognition during enrollment to capture the voice of the enrollenden person as a reference and no more or less large parts of the background noise. Otherwise, a person who has a similar noise environment during verification may be authenticated by the system.

Ein Verfahren zur Detektion, ob ein Sprachsignal vorliegt oder nicht, lässt sich analog zur beschriebenen Vorrichtung aufbauen. Dies gilt auch für seine bevorzugten Ausgestaltungen.A method for detecting whether a speech signal is present or not can be constructed analogously to the described device. This also applies to its preferred embodiments.

Ein Programmprodukt für eine Datenverarbeitungsanlage, das Codeabschnitte enthält, mit denen eines der geschilderten Verfahren auf der Datenverarbeitungsanlage ausgeführt werden kann, lässt sich durch geeignete Implementierung des Verfahrens in einer Programmiersprache und Übersetzung in von der Datenverarbeitungsanlage ausführbaren Code ausführen. Die Codeabschnitte werden dazu gespeichert. Dabei wird unter einem Programmprodukt das Programm als handelbares Produkt verstanden. Es kann in beliebiger Form vorliegen, so zum Beispiel auf Papier, einem computerlesbaren Datenträger oder über ein Netz verteilt.A program product for a data processing system, which contains code sections with which one of the described methods can be executed on the data processing system, can be implemented by suitable implementation of the method in a programming language and translation in code executable by the data processing system. The code sections are stored for this purpose. In this case, the program is understood as a tradable product under a program product. It can be in any form, such as on paper, a computer-readable medium or distributed over a network.

Weitere wesentliche Vorteile und Merkmale der Erfindung ergeben sich aus der Beschreibung eines Ausführungsbeispiels anhand der Figuren. Dabei zeigt:

Figur 1
die Trainingsphase einer Vorrichtung mit Mitteln zur Detektion, ob ein Sprachsignal vorliegt oder nicht;
Figur 2
die Assoziationsphase der Vorrichtung nach Figur 1;
Figur 3
ein Beispiel für eine Detektion, ob ein Sprachsignal vorliegt oder nicht.
Further essential advantages and features of the invention will become apparent from the description of an embodiment with reference to FIGS. Showing:
FIG. 1
the training phase of a device with means for detecting whether a speech signal is present or not;
FIG. 2
the association phase of the device of Figure 1;
FIG. 3
an example of detection of whether a voice signal is present or not.

Im Stand der Technik bekannte VADs haben das Problem, dass aus dem Signal extrahierte Eigenschaften in lediglich zwei Klassen unterteilt werden, obwohl sich ihre Ausprägung innerhalb ein und derselben Klasse stark unterscheidet. Beispielsweise sind bei einem Sprachsignal in der Regel Eigenschaften, welche stimmlose Laute repräsentieren, stark verschieden von jenen, die stimmhafte Laute wiederspiegeln. Trotzdem werden beide ein und derselben Klasse ("Sprache") zugeordnet.The prior art known VADs have the problem that properties extracted from the signal are divided into only two classes, although their characteristics differ widely within one and the same class. For example, in a speech signal, characteristics that represent unvoiced sounds tend to be very different from those that reflect voiced sounds. Nevertheless, both are assigned to the same class ("language").

Es wird daher vorgeschlagen, zur Unterscheidung, ob ein Sprachsignal vorliegt oder nicht, ein Lernverfahren mit zwei Phasen zu verwenden.It is therefore proposed to use a two-phase learning method to discriminate whether or not there is a speech signal.

In der ersten Phase des Verfahrens wird ein unüberwacht zu trainierendes, sich selbst organisierendes Clusterverfahren mit N > 2 Klassen eingesetzt. N wird dabei beliebig, aber sinnvoll vorgegeben. Zum Training werden also lediglich aus einem Audiosignal extrahierte Eigenschaftsvektoren verwendet, ohne dass gleichzeitig eine Klassenzugehörigkeit vorgegeben wird. Ganz allgemein gibt es demnach also eine größere Anzahl m an Klassen des Klassifikators, die "Sprache" repräsentieren, und eine größere Anzahl n von Klassen, die "Nicht-Sprache" repräsentieren (m + n = N > 2). Somit wird es beispielsweise möglich, stimmhafte und stimmlose Laute verschiedenen Klassen zuzuordnen.In the first phase of the process, a self-organizing, self-organizing cluster process with N> 2 classes is used. N is given arbitrarily, but meaningfully. For training, therefore, only extracted from an audio signal property vectors are used, without at the same time a class affiliation is specified. More generally, therefore, there are a larger number m of classes of classifier representing "language" and a larger number n of classes representing "non-language" (m + n = N> 2). Thus, for example, it becomes possible to assign voiced and unvoiced sounds to different classes.

Diese erste Phase soll anhand von Figur 1 verdeutlicht werden. Dort erkennt man eine Audiodatenbank 1 mit Audiosignalen. Diese werden einer Vorverarbeitung 2 zugeführt. Diese Vorverarbeitung ist vorzugsweise dieselbe, wie sie für eine spätere Spracherkennung verwendet wird. Dadurch lässt sich eine zweite Vorverarbeitung einsparen.This first phase will be illustrated with reference to FIG. There one recognizes an audio database 1 with audio signals. These are fed to preprocessing 2. This preprocessing is preferably the same as that used for later speech recognition. This can save a second preprocessing.

Die Vorverarbeitung 2 extrahiert aus den Audiosignalen der Audiodatenbank 1 Eigenschaftsvektoren 3, in denen Eigenschaften der Audiosignale angegeben werden. Diese Eigenschaftsvektoren 3 werden den Eingangsneuronen eines neuronalen Netzes 4 zugeführt.The preprocessing 2 extracts from the audio signals of the audio database 1 property vectors 3, in which properties of the audio signals are specified. These property vectors 3 are supplied to the input neurons of a neural network 4.

Das neuronale Netz 4 ist ein Kohonen-Netz mit der Netzarchitektur einer selbstorganisierenden Karte (SOM: Self-Organizing Map). Es hat die Eigenschaft, dass eine lokale Nachbarschaftsbeziehung zwischen den einzelnen Neuronen existiert, so dass die die einzelnen Klassen repräsentierenden Referenzvektoren nach erfolgtem Training räumlich geordnet vorliegen.The neural network 4 is a Kohonen network with the network architecture of a self-organizing map (SOM: Self-Organizing Map). It has the property that there is a local neighborhood relationship between the individual neurons, so that the reference vectors representing the individual classes are spatially ordered after completion of the training.

Das neuronale Netz wird auf Basis einer Datenbank trainiert, welche beispielsweise Sprache und Geräusch in gleicher Häufigkeit aufweist.The neural network is trained on the basis of a database which, for example, has speech and noise in equal frequency.

Das Training eines solchen Netzes stellt ein selbstorganisierendes Clusterverfahren mit unüberwachtem Lernen dar.The training of such a network represents a self-organizing cluster process with unsupervised learning.

Als Ergebnis des Klassifikatortrainings ergibt sich eine Klassenrepräsentation 5.The result of the classifier training is a class representation 5.

Nach erfolgreichem Klassifikatortraining erfolgt in einer zweiten Phase, der Assoziationsphase, die Zuordnung jeder einzelnen Klasse des Klassifikators 4 in Form des neuronalen Netzes zu einer der beiden Klassen Sprache bzw. Nicht-Sprache. Dazu wird jetzt der Klassifikator 4 selbst im Klassifikationsmodus betrieben, das heißt, er gibt zu jedem Eigenschaftsvektor 3 die zugehörige Klasse 6 aus. Dies ist in Figur 2 dargestellt. Die Assoziationseinheit 7 wird dagegen im Trainingsmodus betrieben, das heißt, sie erlernt auf Basis der gelabelten Audiosignale 8 die Zuordnung jeder der Klassifikatorklassen zu "Sprache" oder zu "Nicht-Sprache". Dabei wird festgestellt, welchen Klassen jeweils wie viele Testsignale zugeordnet worden sind, die "Sprache" oder die "Nicht-Sprache" sind. In Abhängigkeit von diesem Ergebnis wird jede Klasse in einem Assoziationsschritt jeweils als Sprache- oder als Nicht-Sprache-Klasse deklariert. Als Ergebnis erhält man die Klassenzuordnung 9 des VADs.After successful classifier training, in a second phase, the association phase, the assignment of each individual class of the classifier 4 in the form of the neural network to one of the two classes speech or non-speech takes place. This is now the classifier 4 itself in Classification mode operated, that is, it outputs to each property vector 3, the associated class 6. This is shown in FIG. On the other hand, the association unit 7 is operated in the training mode, that is, it learns to associate each of the classifier classes with "speech" or "non-speech" based on the labeled audio signals 8. It is ascertained to which classes in each case how many test signals have been assigned which are "language" or the "non-language". Depending on this result, each class in an association step is declared as a language or non-language class. The result is the class assignment 9 of the VAD.

Nach erfolgtem Assoziationsschritt werden die erhaltenen Ergebnisse weiter verbessert, indem ein Mittelwertfilter dazu genutzt wird, einzelne Ausreißer zu eliminieren.After the association step, the results obtained are further enhanced by using a mean value filter to eliminate individual outliers.

In Figur 3 ist die Amplitude A des deutschen Wortes "Zwanzig" (20) über der Zeit t aufgetragen. Für dieses Signal ist unterhalb des Graphs das Ergebnis der Detektion dargestellt, ob ein Sprachsignal vorliegt oder nicht.In FIG. 3, the amplitude A of the German word "twenty" (20) is plotted over time t. For this signal, the result of the detection is shown below the graph, whether a voice signal is present or not.

Dabei ist in der ersten, mit "Real" bezeichneten Zeile die tatsächliche Klassifikation angegeben. Hierbei steht "Noise" für "Nicht-Sprache" und "Speech" für "Sprache".The actual classification is indicated in the first line labeled "Real". Where "noise" stands for "non-language" and "speech" for "language".

In der zweiten Zeile ("Label") ist die durch einen herkömmlichen VAD für ein Labeln vorgenommene Klassifikation angegeben.The second line ("Label") indicates the classification made by a conventional VAD for labeling.

In der dritten, mit "N-VAD" bezeichneten Zeile ist schließlich die Detektion angegeben, wie sie durch die erfindungsgemäße Vorrichtung und das erfindungsgemäße Verfahren mit einer vorgegebenen Klassenzahl N = 625 erzielt wird. Wie man sieht, stimmt diese Detektion und Klasseneinteilung wesentlich besser mit der Realität überein als die mit dem herkömmlichen VAD vorgenommene. Dies macht sich insbesondere dadurch bemerkbar, dass auch Pausen zwischen einzelnen Silben als "Nicht-Sprache" detektiert werden.In the third line, labeled "N-VAD", the detection is finally indicated, as achieved by the device according to the invention and the method according to the invention with a predetermined class number N = 625. As you can see, this detection and classification is much more in line with reality than with the traditional VAD. This makes especially noticeable by the fact that even pauses between individual syllables are detected as "non-speech".

Durch die Erfindung ergeben sich insbesondere folgende Vorteile:The invention provides the following advantages in particular:

Unähnliche Eigenschaftsvektoren werden nicht mehr in die gleiche Klasse gezwungen, sondern werden einer Klasse alleine auf Basis eines Ähnlichkeitskriteriums zugeordnet. Dadurch steigt die Genauigkeit der Klassifikation an.Dissimilar property vectors are no longer forced into the same class, but are assigned to a class on the basis of a similarity criterion alone. This increases the accuracy of the classification.

Ungenauigkeiten beim Labeln der Audiosignale wirken sich nicht auf den eigentlichen Trainingsprozess aus, da unüberwachtes Lernen erfolgt. So werden typischerweise kurze Sprechpausen zwischen einzelnen Silben beim Labeln nicht erfasst, sondern der Klasse "Sprache" zugeordnet, obwohl in dieser Pause das Hintergrundgeräusch überwiegt. Bei dem vorgeschlagenen Verfahren auf Basis von unüberwachtem Lernen wird diese kurze Pause den ihr entsprechenden Eigenschaftsvektoren zugeordnet.Inaccuracies in the labeling of the audio signals do not affect the actual training process, as there is unmonitored learning. Thus, typical short speech pauses between individual syllables are not detected when labeling, but assigned to the class "speech", although in this pause the background noise predominates. In the proposed method based on unsupervised learning, this short pause is assigned to its corresponding property vectors.

Das Verfahren ist unabhängig von Sprache und/oder Inhalt des gesprochenen Textes.The method is independent of the language and / or content of the spoken text.

Insgesamt wird die Genauigkeit des VAD verbessert, was sich in besseren Ergebnissen bei darauf aufbauenden Applikationen wiederspiegelt.Overall, the accuracy of the VAD is improved, which translates into better results in applications based on it.

Entsprechend der gesteigerten Genauigkeit kann die Erfindung bevorzugt auch im Rahmen des Enrollments bei einer biometrischen Spracherkennung zur Erkennung der Wortgrenzen eingesetzt werden, nachdem bisherige Verfahren auf Basis der Signalenergie immer wieder zu Fehlern und damit zu einem Sicherheitsrisiko bei der biometrischen Authentifizierung führen.According to the increased accuracy, the invention can also be used preferably in the context of enrollment in a biometric speech recognition to recognize the word boundaries, after previous methods based on the signal energy again and again lead to errors and thus to a security risk in the biometric authentication.

Claims (9)

  1. Device for detecting whether a speech signal is present or not, with
    - Means for classifying a signal into one of more than two classes,
    - Means for deciding whether the signal is a speech signal or not, depending on the class to which the signal is classified,
    characterized in that, that the following means are provided:
    - Means for extracting feature vectors from the speech signal,
    - Means for classifying in a learning process the extracted feature vectors with the aid of a self-organizing cluster method into one of more than two automatically formed classes.
    - Means in an association process, for classifying the classes from the learning process as "speech" or "non-speech" respectively.
  2. Device in accordance with claim 1,
    characterized in that
    the number of the more than two classes is equal to or greater than 10, especially equal to or greater than 64.
  3. Device in accordance with claim 1,
    characterized in that
    the automatically formed classes are classes which are formed by a neuronal network.
  4. Device in accordance with one of the previous claims,
    characterized in that
    the device features a neuronal network for classification of the signal into one of more than two classes.
  5. Device in accordance with one of the claims 3 or 4,
    characterized in that
    the neuronal network is a Kohonen network.
  6. Device in accordance with one of the previous claims,
    characterized in that
    the device is a mobile terminal, especially a mobile telephone.
  7. Biometric method, in which a device in accordance with one of the claims 1 to 6 is used.
  8. Method for detecting whether a speech signal is present or not, in which
    - a signal is classified into one of more than two classes which are clustered into self-organizing clusters,
    - depending on the class into which the signal is classified, a decision is made as to whether the signal is a speech signal or not,
    characterized in that,
    - feature vectors are extracted from the speech signal,
    - in a learning process the extracted feature vectors are classified with the aid of a self-organizing cluster process into one of more than two automatically formed classes,
    - in an association process the classes from the learning process are classified as "speech" or "non-speech" respectively.
  9. Program product for data processing system which contains code sections, with which all steps of a method in accordance with one of the claims 7 to 8 are executed, if the program product is running on a data processing system.
EP20030102639 2002-09-27 2003-08-25 Voice activity detection based on unsupervised trained clustering Expired - Fee Related EP1406244B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE2002145107 DE10245107B4 (en) 2002-09-27 2002-09-27 Voice Activity Detection based on unsupervised trained clustering methods
DE10245107 2002-09-27

Publications (3)

Publication Number Publication Date
EP1406244A2 EP1406244A2 (en) 2004-04-07
EP1406244A3 EP1406244A3 (en) 2005-01-12
EP1406244B1 true EP1406244B1 (en) 2006-10-11

Family

ID=31984148

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20030102639 Expired - Fee Related EP1406244B1 (en) 2002-09-27 2003-08-25 Voice activity detection based on unsupervised trained clustering

Country Status (3)

Country Link
EP (1) EP1406244B1 (en)
DE (2) DE10245107B4 (en)
ES (1) ES2269917T3 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210359872A1 (en) * 2020-05-18 2021-11-18 Avaya Management L.P. Automatic correction of erroneous audio setting

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102006021427B4 (en) * 2006-05-05 2008-01-17 Giesecke & Devrient Gmbh Method and device for personalizing cards

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4802221A (en) * 1986-07-21 1989-01-31 Ncr Corporation Digital system and method for compressing speech signals for storage and transmission
EP0435458B1 (en) * 1989-11-28 1995-02-01 Nec Corporation Speech/voiceband data discriminator
JP3088171B2 (en) * 1991-02-12 2000-09-18 三菱電機株式会社 Self-organizing pattern classification system and classification method
DE4442613C2 (en) * 1994-11-30 1998-12-10 Deutsche Telekom Mobil System for determining the network quality in communication networks from the end-user and operator's point of view, in particular cellular networks
IT1281001B1 (en) * 1995-10-27 1998-02-11 Cselt Centro Studi Lab Telecom PROCEDURE AND EQUIPMENT FOR CODING, HANDLING AND DECODING AUDIO SIGNALS.
US5737716A (en) * 1995-12-26 1998-04-07 Motorola Method and apparatus for encoding speech using neural network technology for speech classification
US6564198B1 (en) * 2000-02-16 2003-05-13 Hrl Laboratories, Llc Fuzzy expert system for interpretable rule extraction from neural networks

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210359872A1 (en) * 2020-05-18 2021-11-18 Avaya Management L.P. Automatic correction of erroneous audio setting
US11502863B2 (en) * 2020-05-18 2022-11-15 Avaya Management L.P. Automatic correction of erroneous audio setting

Also Published As

Publication number Publication date
EP1406244A3 (en) 2005-01-12
DE10245107B4 (en) 2006-01-26
DE10245107A1 (en) 2004-04-08
EP1406244A2 (en) 2004-04-07
ES2269917T3 (en) 2007-04-01
DE50305333D1 (en) 2006-11-23

Similar Documents

Publication Publication Date Title
DE69432570T2 (en) voice recognition
DE69814104T2 (en) DISTRIBUTION OF TEXTS AND IDENTIFICATION OF TOPICS
DE2953262C2 (en)
DE69722980T2 (en) Recording of voice data with segments of acoustically different environments
EP0604476B1 (en) Process for recognizing patterns in time-varying measurement signals
DE60108373T2 (en) Method for detecting emotions in speech signals using speaker identification
DE60124559T2 (en) DEVICE AND METHOD FOR LANGUAGE RECOGNITION
DE69924596T2 (en) Selection of acoustic models by speaker verification
DE60128270T2 (en) Method and system for generating speaker recognition data, and method and system for speaker recognition
DE19824354A1 (en) Device for verifying signals
DE2422028A1 (en) CIRCUIT ARRANGEMENT FOR IDENTIFYING A SHAPE FREQUENCY IN A SPOKEN WORD
DE69813597T2 (en) PATTERN RECOGNITION USING MULTIPLE REFERENCE MODELS
DE112018007847B4 (en) INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD AND PROGRAM
WO1993002448A1 (en) Method and device for recognizing individual words of spoken speech
EP1406244B1 (en) Voice activity detection based on unsupervised trained clustering
DE10209324C1 (en) Method for automatic detection of different speakers in speech recognition system correlates speech signal with speaker-independent and speaker-dependent code books
DE19705471C2 (en) Method and circuit arrangement for speech recognition and for voice control of devices
EP0965088B1 (en) Reliable identification with preselection and rejection class
EP0817167B1 (en) Speech recognition method and device for carrying out the method
DE112018006597B4 (en) Speech processing device and speech processing method
CN106971725B (en) Voiceprint recognition method and system with priority
DE3935308C1 (en) Speech recognition method by digitising microphone signal - using delta modulator to produce continuous of equal value bits for data reduction
DE60002868T2 (en) Method and device for analyzing a sequence of spoken numbers
DE102008040002A1 (en) Speaker identification method, involves determining statistical distribution of extracted portions of speech signal, and determining threshold value for classification of speaker by using determined statistical distribution
DE10308611A1 (en) Determination of the likelihood of confusion between vocabulary entries in phoneme-based speech recognition

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL LT LV MK

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL LT LV MK

17P Request for examination filed

Effective date: 20050711

AKX Designation fees paid

Designated state(s): DE ES FR GB

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): DE ES FR GB

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

Free format text: NOT ENGLISH

GBT Gb: translation of ep patent filed (gb section 77(6)(a)/1977)

Effective date: 20061011

REF Corresponds to:

Ref document number: 50305333

Country of ref document: DE

Date of ref document: 20061123

Kind code of ref document: P

REG Reference to a national code

Ref country code: ES

Ref legal event code: FG2A

Ref document number: 2269917

Country of ref document: ES

Kind code of ref document: T3

ET Fr: translation filed
PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed

Effective date: 20070712

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: ES

Payment date: 20130925

Year of fee payment: 11

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20130814

Year of fee payment: 11

Ref country code: FR

Payment date: 20130814

Year of fee payment: 11

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20141020

Year of fee payment: 12

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20140825

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST

Effective date: 20150430

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20140825

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20140901

REG Reference to a national code

Ref country code: DE

Ref legal event code: R119

Ref document number: 50305333

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: ES

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20140826

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20160301