EP1406244B1 - Détection d'activité vocale basée sur l'agrégation non-supervisée - Google Patents
Détection d'activité vocale basée sur l'agrégation non-supervisée Download PDFInfo
- Publication number
- EP1406244B1 EP1406244B1 EP20030102639 EP03102639A EP1406244B1 EP 1406244 B1 EP1406244 B1 EP 1406244B1 EP 20030102639 EP20030102639 EP 20030102639 EP 03102639 A EP03102639 A EP 03102639A EP 1406244 B1 EP1406244 B1 EP 1406244B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- signal
- speech
- classes
- accordance
- classified
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000001514 detection method Methods 0.000 title description 10
- 230000000694 effects Effects 0.000 title description 6
- 238000000034 method Methods 0.000 claims description 26
- 239000013598 vector Substances 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 6
- 230000001537 neural effect Effects 0.000 claims 3
- 238000012549 training Methods 0.000 description 11
- 230000005236 sound signal Effects 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000007781 pre-processing Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 230000000875 corresponding effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000005311 autocorrelation function Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 210000002364 input neuron Anatomy 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Definitions
- a Voice Activity Detector is a device that allows to distinguish between speech including background noise (“speech") and background noise alone (“non-speech").
- the input of a VAD can be, for example, a voice signal of a communication terminal recorded by a microphone. As the user speaks, the signal is composed of his voice and the background noise (for example, street noise). In contrast, during the pauses between speaking, the signal alone consists of the background noise.
- the output of a Voice Activity Detector now adds information to the input signal, whether it contains speech or not.
- VAD Voice over IP
- a data reduction VAD can be used to store or transmit only the speech signal.
- a VAD allows faster and better recognition because the recognition can focus on the pure speech passages of the audio signal.
- VADs are either set based on heuristics or trained during a training phase.
- the input signal used in each case is the appropriately pre-processed audio signal.
- a property extraction depending on the number of properties used, different-sized property vectors are obtained.
- the simplest, but still widely used, heuristic is to judge a signal against a specific, fixed energy threshold. If the signal energy exceeds the threshold, then "language” is assumed, otherwise "non-language”.
- Another example is the determination of the zero crossing rate of the autocorrelation function of the speech signal and a corresponding threshold for discriminating whether a speech signal is present or not.
- VADs that are trained during a training phase include statistical VADs or neural networks. These are trained with data that is known when speech and when a noise occurs. So these are data that are previously labeled by hand, for example. Examples of methods with which it can be decided in this way whether a voice signal is present or not, for example, in Stadermann J .: "Speech / Pause Detection in Automatic Speech Recognition", University of Duisburg, Diploma Thesis, 1999, pages 28-36 , stated.
- VADs in particular for wireless communication, are described in El-Maleh, K. and Kabal, P .: “Comparison of voice activity detection algorithms for wireless personal communication systems", Proc. IEEE Canadian Conference on Electrical and Computer Engineering, St. John's, Newfoundland, May 1997, pages 470-473.
- the object of the invention is to make possible a more precise distinction between speech and non-speech. It should also be placed on an automatic adaptability to different noise situations, speakers or languages value.
- N 2 classes (speech / non-speech).
- a much better classification can be made if a signal is not immediately assigned to the speech or non-speech class, but if the signal is first in a class of a plurality of more than is divided into three classes. In this way, the numerous different properties of speech and noise can be better taken into account.
- the plurality is preferably greater than or equal to 10, in particular greater than or equal to 64. Depending on the class into which the signal is divided, it is then decided whether the signal is a speech signal or not.
- speech signals that are recognized as such, ie after the Voice Activity Detection, to divide into more than two classes, as a special feature of the invention, two or more classes may be provided to be decided that the signal is not speech when it is divided into it.
- the classes may be clustered in clusters so that similar classes are grouped adjacent or in groups.
- the classes are automatically formed in a self-organizing cluster process to be trained in a training phase, in particular by means of test signals.
- a neural network is preferably used, in particular a Kohonen network with the network architecture of a self-organizing card.
- This trained and structured network is then preferably also used immediately in the detection phase, in which it is decided whether a signal is a voice signal or not.
- the device described can be used in the biometric speech recognition during enrollment to capture the voice of the enrollenden person as a reference and no more or less large parts of the background noise. Otherwise, a person who has a similar noise environment during verification may be authenticated by the system.
- a method for detecting whether a speech signal is present or not can be constructed analogously to the described device. This also applies to its preferred embodiments.
- a program product for a data processing system which contains code sections with which one of the described methods can be executed on the data processing system, can be implemented by suitable implementation of the method in a programming language and translation in code executable by the data processing system.
- the code sections are stored for this purpose.
- the program is understood as a tradable product under a program product. It can be in any form, such as on paper, a computer-readable medium or distributed over a network.
- VADs have the problem that properties extracted from the signal are divided into only two classes, although their characteristics differ widely within one and the same class. For example, in a speech signal, characteristics that represent unvoiced sounds tend to be very different from those that reflect voiced sounds. Nevertheless, both are assigned to the same class ("language").
- a self-organizing, self-organizing cluster process with N> 2 classes is used.
- N is given arbitrarily, but meaningfully.
- For training therefore, only extracted from an audio signal property vectors are used, without at the same time a class affiliation is specified.
- More generally, therefore, there are a larger number m of classes of classifier representing "language” and a larger number n of classes representing "non-language" (m + n N> 2).
- This first phase will be illustrated with reference to FIG.
- This preprocessing is preferably the same as that used for later speech recognition. This can save a second preprocessing.
- the preprocessing 2 extracts from the audio signals of the audio database 1 property vectors 3, in which properties of the audio signals are specified. These property vectors 3 are supplied to the input neurons of a neural network 4.
- the neural network 4 is a Kohonen network with the network architecture of a self-organizing map (SOM: Self-Organizing Map). It has the property that there is a local neighborhood relationship between the individual neurons, so that the reference vectors representing the individual classes are spatially ordered after completion of the training.
- SOM Self-Organizing Map
- the neural network is trained on the basis of a database which, for example, has speech and noise in equal frequency.
- the training of such a network represents a self-organizing cluster process with unsupervised learning.
- the result of the classifier training is a class representation 5.
- the association phase the assignment of each individual class of the classifier 4 in the form of the neural network to one of the two classes speech or non-speech takes place.
- the classifier 4 itself in Classification mode operated, that is, it outputs to each property vector 3, the associated class 6.
- the association unit 7 is operated in the training mode, that is, it learns to associate each of the classifier classes with "speech" or "non-speech" based on the labeled audio signals 8. It is ascertained to which classes in each case how many test signals have been assigned which are "language” or the "non-language”. Depending on this result, each class in an association step is declared as a language or non-language class. The result is the class assignment 9 of the VAD.
- the results obtained are further enhanced by using a mean value filter to eliminate individual outliers.
- Label indicates the classification made by a conventional VAD for labeling.
- this detection and classification is much more in line with reality than with the traditional VAD. This makes especially noticeable by the fact that even pauses between individual syllables are detected as "non-speech".
- Dissimilar property vectors are no longer forced into the same class, but are assigned to a class on the basis of a similarity criterion alone. This increases the accuracy of the classification.
- the method is independent of the language and / or content of the spoken text.
- the invention can also be used preferably in the context of enrollment in a biometric speech recognition to recognize the word boundaries, after previous methods based on the signal energy again and again lead to errors and thus to a security risk in the biometric authentication.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Image Analysis (AREA)
- Telephonic Communication Services (AREA)
- Character Discrimination (AREA)
Claims (9)
- Dispositif pour la détection de la présence ou de l'absence d'un signal vocal, comprenant- des moyens pour classer un signal dans une classe parmi plus de deux classes,- des moyens pour déterminer si le signal est un signal vocal ou non en fonction de la classe dans laquelle le signal est classé,caractérisé en ce que les moyens suivants sont prévus :- des moyens pour extraire des vecteurs de propriétés du signal vocal,- des moyens pour classer les vecteurs de propriétés au cours d'un procédé d'apprentissage dans une classe de plus de deux classes formées automatiquement, à l'aide d'un procédé de clustérisation auto-organisateur,- des moyens pour classer les classes du procédé d'apprentissage dans un procédé d'association comme "parole" ou "non-parole".
- Dispositif selon la revendication 1, caractérisé en ce que le nombre des plus de deux classes est égal ou supérieur à 10, en particulier supérieur ou égal à 64.
- Dispositif selon la revendication 1, caractérisé en ce que les classes formées automatiquement sont des classes formées par un réseau neuronal.
- Dispositif selon l'une des revendications précédentes, caractérisé en ce que le dispositif pour le classement du signal dans une classe de plus de deux classes comporte un réseau neuronal.
- Dispositif selon l'une des revendications 3 ou 4, caractérisé en ce que le réseau neuronal est un réseau de Kohonen.
- Dispositif selon l'une des revendications précédentes, caractérisé en ce que le dispositif est un terminal mobile, en particulier un téléphone portable.
- Procédé biométrique dans lequel un dispositif selon l'une des revendications 1 à 6 est utilisé.
- Procédé pour la détection de la présence ou de l'absence d'un signal vocal, dans lequel- un signal est classé dans une classe de plus de deux classes qui comprennent une organisation en clusters formée automatiquement,- il est décidé, en fonction de la classe dans laquelle le signal est classé, si le signal est un signal vocal ou non,caractérisé en ce que- des vecteurs de propriétés sont extraits du signal vocal,- les vecteurs de propriétés extraits sont classés, au cours d'un procédé d'apprentissage, dans une classe de plus de deux classes formées automatiquement, à l'aide d'un procédé de clustérisation auto-organisateur,- des moyens pour classer les classes du procédé d'apprentissage dans un procédé d'association comme "parole" ou "non-parole".
- Produit de programmation pour une installation de traitement de données contenant des sections de code à l'aide desquelles toutes les étapes d'un procédé selon l'une des revendications 7 à 8 sont exécutées lorsque le produit de programmation est actif sur une installation de traitement de données.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE10245107 | 2002-09-27 | ||
DE2002145107 DE10245107B4 (de) | 2002-09-27 | 2002-09-27 | Voice Activity Detection auf Basis von unüberwacht trainierten Clusterverfahren |
Publications (3)
Publication Number | Publication Date |
---|---|
EP1406244A2 EP1406244A2 (fr) | 2004-04-07 |
EP1406244A3 EP1406244A3 (fr) | 2005-01-12 |
EP1406244B1 true EP1406244B1 (fr) | 2006-10-11 |
Family
ID=31984148
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP20030102639 Expired - Fee Related EP1406244B1 (fr) | 2002-09-27 | 2003-08-25 | Détection d'activité vocale basée sur l'agrégation non-supervisée |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP1406244B1 (fr) |
DE (2) | DE10245107B4 (fr) |
ES (1) | ES2269917T3 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210359872A1 (en) * | 2020-05-18 | 2021-11-18 | Avaya Management L.P. | Automatic correction of erroneous audio setting |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102006021427B4 (de) * | 2006-05-05 | 2008-01-17 | Giesecke & Devrient Gmbh | Verfahren und Vorrichtung zum Personalisieren von Karten |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4802221A (en) * | 1986-07-21 | 1989-01-31 | Ncr Corporation | Digital system and method for compressing speech signals for storage and transmission |
EP0435458B1 (fr) * | 1989-11-28 | 1995-02-01 | Nec Corporation | Discriminateur entre parole et autres données transmises dans la bande vocale |
JP3088171B2 (ja) * | 1991-02-12 | 2000-09-18 | 三菱電機株式会社 | 自己組織型パタ−ン分類システム及び分類方法 |
DE4442613C2 (de) * | 1994-11-30 | 1998-12-10 | Deutsche Telekom Mobil | System zum Ermitteln der Netzgüte in Nachrichtennetzen aus Endnutzer- und Betreibersicht, insbesondere Mobilfunknetzen |
IT1281001B1 (it) * | 1995-10-27 | 1998-02-11 | Cselt Centro Studi Lab Telecom | Procedimento e apparecchiatura per codificare, manipolare e decodificare segnali audio. |
US5737716A (en) * | 1995-12-26 | 1998-04-07 | Motorola | Method and apparatus for encoding speech using neural network technology for speech classification |
US6564198B1 (en) * | 2000-02-16 | 2003-05-13 | Hrl Laboratories, Llc | Fuzzy expert system for interpretable rule extraction from neural networks |
-
2002
- 2002-09-27 DE DE2002145107 patent/DE10245107B4/de not_active Expired - Fee Related
-
2003
- 2003-08-25 DE DE50305333T patent/DE50305333D1/de not_active Expired - Lifetime
- 2003-08-25 EP EP20030102639 patent/EP1406244B1/fr not_active Expired - Fee Related
- 2003-08-25 ES ES03102639T patent/ES2269917T3/es not_active Expired - Lifetime
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210359872A1 (en) * | 2020-05-18 | 2021-11-18 | Avaya Management L.P. | Automatic correction of erroneous audio setting |
US11502863B2 (en) * | 2020-05-18 | 2022-11-15 | Avaya Management L.P. | Automatic correction of erroneous audio setting |
Also Published As
Publication number | Publication date |
---|---|
DE50305333D1 (de) | 2006-11-23 |
ES2269917T3 (es) | 2007-04-01 |
DE10245107A1 (de) | 2004-04-08 |
EP1406244A3 (fr) | 2005-01-12 |
EP1406244A2 (fr) | 2004-04-07 |
DE10245107B4 (de) | 2006-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE69432570T2 (de) | Spracherkennung | |
DE69814104T2 (de) | Aufteilung von texten und identifizierung von themen | |
DE2953262C2 (fr) | ||
DE69722980T2 (de) | Aufzeichnung von Sprachdaten mit Segmenten von akustisch verschiedenen Umgebungen | |
EP0604476B1 (fr) | Procede de reconnaissance des formes dans des signaux de mesure variables dans le temps | |
DE60108373T2 (de) | Verfahren zur Detektion von Emotionen in Sprachsignalen unter Verwendung von Sprecheridentifikation | |
DE60124559T2 (de) | Einrichtung und verfahren zur spracherkennung | |
DE69924596T2 (de) | Auswahl akustischer Modelle mittels Sprecherverifizierung | |
DE60128270T2 (de) | Verfahren und System zur Erzeugung von Sprechererkennungsdaten, und Verfahren und System zur Sprechererkennung | |
KR101785500B1 (ko) | 근육 조합 최적화를 통한 안면근육 표면근전도 신호기반 단모음인식 방법 | |
DE19824354A1 (de) | Vorrichtung zur Verifizierung von Signalen | |
DE2422028A1 (de) | Schaltungsanordnung zur identifizierung einer formantfrequenz in einem gesprochenen wort | |
DE69813597T2 (de) | Mustererkennung, die mehrere referenzmodelle verwendet | |
DE112018007847B4 (de) | Informationsverarbeitungsvorrichtung, informationsverarbeitungsverfahren und programm | |
WO1993002448A1 (fr) | Procede et dispositif pour la reconnaissance de mots isoles du langage parle | |
EP1406244B1 (fr) | Détection d'activité vocale basée sur l'agrégation non-supervisée | |
DE10209324C1 (de) | Automatische Detektion von Sprecherwechseln in sprecheradaptiven Spracherkennungssystemen | |
DE19705471C2 (de) | Verfahren und Schaltungsanordnung zur Spracherkennung und zur Sprachsteuerung von Vorrichtungen | |
EP0965088B1 (fr) | Identification sure avec preselection et classe de rebuts | |
EP0817167B1 (fr) | Procédé de reconnaissance de la parole et dispositif de mise en oeuvre du procédé | |
DE112018006597B4 (de) | Sprachverarbeitungsvorrichtung und Sprachverarbeitungsverfahren | |
CN106971725B (zh) | 一种具有优先级的声纹识方法和系统 | |
DE3935308C1 (en) | Speech recognition method by digitising microphone signal - using delta modulator to produce continuous of equal value bits for data reduction | |
DE60002868T2 (de) | Verfahren und Einrichtung zur Analyse einer Folge von gesprochenen Nummern | |
DE102008040002A1 (de) | Verfahren zur szenariounabhängigen Sprechererkennung |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK |
|
PUAL | Search report despatched |
Free format text: ORIGINAL CODE: 0009013 |
|
AK | Designated contracting states |
Kind code of ref document: A3 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK |
|
17P | Request for examination filed |
Effective date: 20050711 |
|
AKX | Designation fees paid |
Designated state(s): DE ES FR GB |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): DE ES FR GB |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D Free format text: NOT ENGLISH |
|
GBT | Gb: translation of ep patent filed (gb section 77(6)(a)/1977) |
Effective date: 20061011 |
|
REF | Corresponds to: |
Ref document number: 50305333 Country of ref document: DE Date of ref document: 20061123 Kind code of ref document: P |
|
REG | Reference to a national code |
Ref country code: ES Ref legal event code: FG2A Ref document number: 2269917 Country of ref document: ES Kind code of ref document: T3 |
|
ET | Fr: translation filed | ||
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
26N | No opposition filed |
Effective date: 20070712 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: ES Payment date: 20130925 Year of fee payment: 11 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: GB Payment date: 20130814 Year of fee payment: 11 Ref country code: FR Payment date: 20130814 Year of fee payment: 11 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20141020 Year of fee payment: 12 |
|
GBPC | Gb: european patent ceased through non-payment of renewal fee |
Effective date: 20140825 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: ST Effective date: 20150430 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: GB Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20140825 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: FR Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20140901 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R119 Ref document number: 50305333 Country of ref document: DE |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: ES Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20140826 Ref country code: DE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20160301 |