EP1406244B1 - Détection d'activité vocale basée sur l'agrégation non-supervisée - Google Patents

Détection d'activité vocale basée sur l'agrégation non-supervisée Download PDF

Info

Publication number
EP1406244B1
EP1406244B1 EP20030102639 EP03102639A EP1406244B1 EP 1406244 B1 EP1406244 B1 EP 1406244B1 EP 20030102639 EP20030102639 EP 20030102639 EP 03102639 A EP03102639 A EP 03102639A EP 1406244 B1 EP1406244 B1 EP 1406244B1
Authority
EP
European Patent Office
Prior art keywords
signal
speech
classes
accordance
classified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
EP20030102639
Other languages
German (de)
English (en)
Other versions
EP1406244A3 (fr
EP1406244A2 (fr
Inventor
Stephan Dr. Grashey
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Publication of EP1406244A2 publication Critical patent/EP1406244A2/fr
Publication of EP1406244A3 publication Critical patent/EP1406244A3/fr
Application granted granted Critical
Publication of EP1406244B1 publication Critical patent/EP1406244B1/fr
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • a Voice Activity Detector is a device that allows to distinguish between speech including background noise (“speech") and background noise alone (“non-speech").
  • the input of a VAD can be, for example, a voice signal of a communication terminal recorded by a microphone. As the user speaks, the signal is composed of his voice and the background noise (for example, street noise). In contrast, during the pauses between speaking, the signal alone consists of the background noise.
  • the output of a Voice Activity Detector now adds information to the input signal, whether it contains speech or not.
  • VAD Voice over IP
  • a data reduction VAD can be used to store or transmit only the speech signal.
  • a VAD allows faster and better recognition because the recognition can focus on the pure speech passages of the audio signal.
  • VADs are either set based on heuristics or trained during a training phase.
  • the input signal used in each case is the appropriately pre-processed audio signal.
  • a property extraction depending on the number of properties used, different-sized property vectors are obtained.
  • the simplest, but still widely used, heuristic is to judge a signal against a specific, fixed energy threshold. If the signal energy exceeds the threshold, then "language” is assumed, otherwise "non-language”.
  • Another example is the determination of the zero crossing rate of the autocorrelation function of the speech signal and a corresponding threshold for discriminating whether a speech signal is present or not.
  • VADs that are trained during a training phase include statistical VADs or neural networks. These are trained with data that is known when speech and when a noise occurs. So these are data that are previously labeled by hand, for example. Examples of methods with which it can be decided in this way whether a voice signal is present or not, for example, in Stadermann J .: "Speech / Pause Detection in Automatic Speech Recognition", University of Duisburg, Diploma Thesis, 1999, pages 28-36 , stated.
  • VADs in particular for wireless communication, are described in El-Maleh, K. and Kabal, P .: “Comparison of voice activity detection algorithms for wireless personal communication systems", Proc. IEEE Canadian Conference on Electrical and Computer Engineering, St. John's, Newfoundland, May 1997, pages 470-473.
  • the object of the invention is to make possible a more precise distinction between speech and non-speech. It should also be placed on an automatic adaptability to different noise situations, speakers or languages value.
  • N 2 classes (speech / non-speech).
  • a much better classification can be made if a signal is not immediately assigned to the speech or non-speech class, but if the signal is first in a class of a plurality of more than is divided into three classes. In this way, the numerous different properties of speech and noise can be better taken into account.
  • the plurality is preferably greater than or equal to 10, in particular greater than or equal to 64. Depending on the class into which the signal is divided, it is then decided whether the signal is a speech signal or not.
  • speech signals that are recognized as such, ie after the Voice Activity Detection, to divide into more than two classes, as a special feature of the invention, two or more classes may be provided to be decided that the signal is not speech when it is divided into it.
  • the classes may be clustered in clusters so that similar classes are grouped adjacent or in groups.
  • the classes are automatically formed in a self-organizing cluster process to be trained in a training phase, in particular by means of test signals.
  • a neural network is preferably used, in particular a Kohonen network with the network architecture of a self-organizing card.
  • This trained and structured network is then preferably also used immediately in the detection phase, in which it is decided whether a signal is a voice signal or not.
  • the device described can be used in the biometric speech recognition during enrollment to capture the voice of the enrollenden person as a reference and no more or less large parts of the background noise. Otherwise, a person who has a similar noise environment during verification may be authenticated by the system.
  • a method for detecting whether a speech signal is present or not can be constructed analogously to the described device. This also applies to its preferred embodiments.
  • a program product for a data processing system which contains code sections with which one of the described methods can be executed on the data processing system, can be implemented by suitable implementation of the method in a programming language and translation in code executable by the data processing system.
  • the code sections are stored for this purpose.
  • the program is understood as a tradable product under a program product. It can be in any form, such as on paper, a computer-readable medium or distributed over a network.
  • VADs have the problem that properties extracted from the signal are divided into only two classes, although their characteristics differ widely within one and the same class. For example, in a speech signal, characteristics that represent unvoiced sounds tend to be very different from those that reflect voiced sounds. Nevertheless, both are assigned to the same class ("language").
  • a self-organizing, self-organizing cluster process with N> 2 classes is used.
  • N is given arbitrarily, but meaningfully.
  • For training therefore, only extracted from an audio signal property vectors are used, without at the same time a class affiliation is specified.
  • More generally, therefore, there are a larger number m of classes of classifier representing "language” and a larger number n of classes representing "non-language" (m + n N> 2).
  • This first phase will be illustrated with reference to FIG.
  • This preprocessing is preferably the same as that used for later speech recognition. This can save a second preprocessing.
  • the preprocessing 2 extracts from the audio signals of the audio database 1 property vectors 3, in which properties of the audio signals are specified. These property vectors 3 are supplied to the input neurons of a neural network 4.
  • the neural network 4 is a Kohonen network with the network architecture of a self-organizing map (SOM: Self-Organizing Map). It has the property that there is a local neighborhood relationship between the individual neurons, so that the reference vectors representing the individual classes are spatially ordered after completion of the training.
  • SOM Self-Organizing Map
  • the neural network is trained on the basis of a database which, for example, has speech and noise in equal frequency.
  • the training of such a network represents a self-organizing cluster process with unsupervised learning.
  • the result of the classifier training is a class representation 5.
  • the association phase the assignment of each individual class of the classifier 4 in the form of the neural network to one of the two classes speech or non-speech takes place.
  • the classifier 4 itself in Classification mode operated, that is, it outputs to each property vector 3, the associated class 6.
  • the association unit 7 is operated in the training mode, that is, it learns to associate each of the classifier classes with "speech" or "non-speech" based on the labeled audio signals 8. It is ascertained to which classes in each case how many test signals have been assigned which are "language” or the "non-language”. Depending on this result, each class in an association step is declared as a language or non-language class. The result is the class assignment 9 of the VAD.
  • the results obtained are further enhanced by using a mean value filter to eliminate individual outliers.
  • Label indicates the classification made by a conventional VAD for labeling.
  • this detection and classification is much more in line with reality than with the traditional VAD. This makes especially noticeable by the fact that even pauses between individual syllables are detected as "non-speech".
  • Dissimilar property vectors are no longer forced into the same class, but are assigned to a class on the basis of a similarity criterion alone. This increases the accuracy of the classification.
  • the method is independent of the language and / or content of the spoken text.
  • the invention can also be used preferably in the context of enrollment in a biometric speech recognition to recognize the word boundaries, after previous methods based on the signal energy again and again lead to errors and thus to a security risk in the biometric authentication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)
  • Telephonic Communication Services (AREA)
  • Character Discrimination (AREA)

Claims (9)

  1. Dispositif pour la détection de la présence ou de l'absence d'un signal vocal, comprenant
    - des moyens pour classer un signal dans une classe parmi plus de deux classes,
    - des moyens pour déterminer si le signal est un signal vocal ou non en fonction de la classe dans laquelle le signal est classé,
    caractérisé en ce que les moyens suivants sont prévus :
    - des moyens pour extraire des vecteurs de propriétés du signal vocal,
    - des moyens pour classer les vecteurs de propriétés au cours d'un procédé d'apprentissage dans une classe de plus de deux classes formées automatiquement, à l'aide d'un procédé de clustérisation auto-organisateur,
    - des moyens pour classer les classes du procédé d'apprentissage dans un procédé d'association comme "parole" ou "non-parole".
  2. Dispositif selon la revendication 1, caractérisé en ce que le nombre des plus de deux classes est égal ou supérieur à 10, en particulier supérieur ou égal à 64.
  3. Dispositif selon la revendication 1, caractérisé en ce que les classes formées automatiquement sont des classes formées par un réseau neuronal.
  4. Dispositif selon l'une des revendications précédentes, caractérisé en ce que le dispositif pour le classement du signal dans une classe de plus de deux classes comporte un réseau neuronal.
  5. Dispositif selon l'une des revendications 3 ou 4, caractérisé en ce que le réseau neuronal est un réseau de Kohonen.
  6. Dispositif selon l'une des revendications précédentes, caractérisé en ce que le dispositif est un terminal mobile, en particulier un téléphone portable.
  7. Procédé biométrique dans lequel un dispositif selon l'une des revendications 1 à 6 est utilisé.
  8. Procédé pour la détection de la présence ou de l'absence d'un signal vocal, dans lequel
    - un signal est classé dans une classe de plus de deux classes qui comprennent une organisation en clusters formée automatiquement,
    - il est décidé, en fonction de la classe dans laquelle le signal est classé, si le signal est un signal vocal ou non,
    caractérisé en ce que
    - des vecteurs de propriétés sont extraits du signal vocal,
    - les vecteurs de propriétés extraits sont classés, au cours d'un procédé d'apprentissage, dans une classe de plus de deux classes formées automatiquement, à l'aide d'un procédé de clustérisation auto-organisateur,
    - des moyens pour classer les classes du procédé d'apprentissage dans un procédé d'association comme "parole" ou "non-parole".
  9. Produit de programmation pour une installation de traitement de données contenant des sections de code à l'aide desquelles toutes les étapes d'un procédé selon l'une des revendications 7 à 8 sont exécutées lorsque le produit de programmation est actif sur une installation de traitement de données.
EP20030102639 2002-09-27 2003-08-25 Détection d'activité vocale basée sur l'agrégation non-supervisée Expired - Fee Related EP1406244B1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE10245107 2002-09-27
DE2002145107 DE10245107B4 (de) 2002-09-27 2002-09-27 Voice Activity Detection auf Basis von unüberwacht trainierten Clusterverfahren

Publications (3)

Publication Number Publication Date
EP1406244A2 EP1406244A2 (fr) 2004-04-07
EP1406244A3 EP1406244A3 (fr) 2005-01-12
EP1406244B1 true EP1406244B1 (fr) 2006-10-11

Family

ID=31984148

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20030102639 Expired - Fee Related EP1406244B1 (fr) 2002-09-27 2003-08-25 Détection d'activité vocale basée sur l'agrégation non-supervisée

Country Status (3)

Country Link
EP (1) EP1406244B1 (fr)
DE (2) DE10245107B4 (fr)
ES (1) ES2269917T3 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210359872A1 (en) * 2020-05-18 2021-11-18 Avaya Management L.P. Automatic correction of erroneous audio setting

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102006021427B4 (de) * 2006-05-05 2008-01-17 Giesecke & Devrient Gmbh Verfahren und Vorrichtung zum Personalisieren von Karten

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4802221A (en) * 1986-07-21 1989-01-31 Ncr Corporation Digital system and method for compressing speech signals for storage and transmission
EP0435458B1 (fr) * 1989-11-28 1995-02-01 Nec Corporation Discriminateur entre parole et autres données transmises dans la bande vocale
JP3088171B2 (ja) * 1991-02-12 2000-09-18 三菱電機株式会社 自己組織型パタ−ン分類システム及び分類方法
DE4442613C2 (de) * 1994-11-30 1998-12-10 Deutsche Telekom Mobil System zum Ermitteln der Netzgüte in Nachrichtennetzen aus Endnutzer- und Betreibersicht, insbesondere Mobilfunknetzen
IT1281001B1 (it) * 1995-10-27 1998-02-11 Cselt Centro Studi Lab Telecom Procedimento e apparecchiatura per codificare, manipolare e decodificare segnali audio.
US5737716A (en) * 1995-12-26 1998-04-07 Motorola Method and apparatus for encoding speech using neural network technology for speech classification
US6564198B1 (en) * 2000-02-16 2003-05-13 Hrl Laboratories, Llc Fuzzy expert system for interpretable rule extraction from neural networks

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210359872A1 (en) * 2020-05-18 2021-11-18 Avaya Management L.P. Automatic correction of erroneous audio setting
US11502863B2 (en) * 2020-05-18 2022-11-15 Avaya Management L.P. Automatic correction of erroneous audio setting

Also Published As

Publication number Publication date
DE50305333D1 (de) 2006-11-23
ES2269917T3 (es) 2007-04-01
DE10245107A1 (de) 2004-04-08
EP1406244A3 (fr) 2005-01-12
EP1406244A2 (fr) 2004-04-07
DE10245107B4 (de) 2006-01-26

Similar Documents

Publication Publication Date Title
DE69432570T2 (de) Spracherkennung
DE69814104T2 (de) Aufteilung von texten und identifizierung von themen
DE2953262C2 (fr)
DE69722980T2 (de) Aufzeichnung von Sprachdaten mit Segmenten von akustisch verschiedenen Umgebungen
EP0604476B1 (fr) Procede de reconnaissance des formes dans des signaux de mesure variables dans le temps
DE60108373T2 (de) Verfahren zur Detektion von Emotionen in Sprachsignalen unter Verwendung von Sprecheridentifikation
DE60124559T2 (de) Einrichtung und verfahren zur spracherkennung
DE69924596T2 (de) Auswahl akustischer Modelle mittels Sprecherverifizierung
DE60128270T2 (de) Verfahren und System zur Erzeugung von Sprechererkennungsdaten, und Verfahren und System zur Sprechererkennung
KR101785500B1 (ko) 근육 조합 최적화를 통한 안면근육 표면근전도 신호기반 단모음인식 방법
DE19824354A1 (de) Vorrichtung zur Verifizierung von Signalen
DE2422028A1 (de) Schaltungsanordnung zur identifizierung einer formantfrequenz in einem gesprochenen wort
DE69813597T2 (de) Mustererkennung, die mehrere referenzmodelle verwendet
DE112018007847B4 (de) Informationsverarbeitungsvorrichtung, informationsverarbeitungsverfahren und programm
WO1993002448A1 (fr) Procede et dispositif pour la reconnaissance de mots isoles du langage parle
EP1406244B1 (fr) Détection d'activité vocale basée sur l'agrégation non-supervisée
DE10209324C1 (de) Automatische Detektion von Sprecherwechseln in sprecheradaptiven Spracherkennungssystemen
DE19705471C2 (de) Verfahren und Schaltungsanordnung zur Spracherkennung und zur Sprachsteuerung von Vorrichtungen
EP0965088B1 (fr) Identification sure avec preselection et classe de rebuts
EP0817167B1 (fr) Procédé de reconnaissance de la parole et dispositif de mise en oeuvre du procédé
DE112018006597B4 (de) Sprachverarbeitungsvorrichtung und Sprachverarbeitungsverfahren
CN106971725B (zh) 一种具有优先级的声纹识方法和系统
DE3935308C1 (en) Speech recognition method by digitising microphone signal - using delta modulator to produce continuous of equal value bits for data reduction
DE60002868T2 (de) Verfahren und Einrichtung zur Analyse einer Folge von gesprochenen Nummern
DE102008040002A1 (de) Verfahren zur szenariounabhängigen Sprechererkennung

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL LT LV MK

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL LT LV MK

17P Request for examination filed

Effective date: 20050711

AKX Designation fees paid

Designated state(s): DE ES FR GB

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): DE ES FR GB

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

Free format text: NOT ENGLISH

GBT Gb: translation of ep patent filed (gb section 77(6)(a)/1977)

Effective date: 20061011

REF Corresponds to:

Ref document number: 50305333

Country of ref document: DE

Date of ref document: 20061123

Kind code of ref document: P

REG Reference to a national code

Ref country code: ES

Ref legal event code: FG2A

Ref document number: 2269917

Country of ref document: ES

Kind code of ref document: T3

ET Fr: translation filed
PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed

Effective date: 20070712

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: ES

Payment date: 20130925

Year of fee payment: 11

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20130814

Year of fee payment: 11

Ref country code: FR

Payment date: 20130814

Year of fee payment: 11

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20141020

Year of fee payment: 12

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20140825

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST

Effective date: 20150430

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20140825

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20140901

REG Reference to a national code

Ref country code: DE

Ref legal event code: R119

Ref document number: 50305333

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: ES

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20140826

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20160301