EP1639579A1 - Verfahren und system zur sprachanalyse zur kompakten darstellung von sprechern - Google Patents

Verfahren und system zur sprachanalyse zur kompakten darstellung von sprechern

Info

Publication number
EP1639579A1
EP1639579A1 EP03748194A EP03748194A EP1639579A1 EP 1639579 A1 EP1639579 A1 EP 1639579A1 EP 03748194 A EP03748194 A EP 03748194A EP 03748194 A EP03748194 A EP 03748194A EP 1639579 A1 EP1639579 A1 EP 1639579A1
Authority
EP
European Patent Office
Prior art keywords
speaker
speakers
vocal
dimension
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP03748194A
Other languages
English (en)
French (fr)
Inventor
Yassine Mami
Delphine Charlet
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Orange SA
Original Assignee
France Telecom SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by France Telecom SA filed Critical France Telecom SA
Publication of EP1639579A1 publication Critical patent/EP1639579A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Definitions

  • the present invention relates to a method and a device for analyzing voice signals.
  • the analysis of voice signals notably requires being able to represent a speaker.
  • the representation of a speaker by a mixture of Gaussian (“Gaussian Mixture Model” or GMM) is an effective representation of the acoustic or vocal identity of a speaker.
  • GMM Gaussian Mixture Model
  • this technique it is a question of representing the speaker, in a reference acoustic space of a predetermined dimension, by a weighted sum of a predetermined number of Gaussians.
  • This type of representation is precise when we have a large number of data, and there are no physical constraints to store the parameters of the model, nor to execute calculations on these numerous parameters.
  • the authors propose to represent a speaker, no longer absolutely in a reference acoustic space, but relatively with respect to a predetermined set of representations of reference speakers also called anchor models, for which there are GMM-UBM models (UBM for "Universal Background Model”).
  • GMM-UBM models UBM for "Universal Background Model”
  • the proximity between a speaker and the reference speakers is evaluated by means of a Euclidean distance. This greatly reduces the computational loads, but the performances are still limited and insufficient.
  • the invention aims to analyze voice signals by representing the speakers with respect to a predetermined set of reference speakers, with a reduced number of parameters reducing the computational loads for time applications real, with acceptable performances, in comparison with an analysis using a representation by the GMM-UBM model.
  • the probability density of the similarities between the representation of said speech signals of the speaker ( ⁇ ) and the predetermined set of vocal representations of the reference speakers is represented by a Gaussian distribution ( ⁇ ( ⁇ ⁇ , ⁇ ⁇ )) of vector of mean ( ⁇ ⁇ ) of dimension E and of covariance matrix ( ⁇ ⁇ ) of dimension ExE estimated in the space of resemblances to the predetermined set of E reference speakers.
  • information a priori is also introduced into the probability densities of the resemblances ( ⁇ ( ⁇ ⁇ , ⁇ ⁇ j) with respect to the E reference speakers.
  • a system for analyzing a speaker's voice signals Comprising databases in which are stored voice of a predetermined set of E reference speakers and their vocal representations associated signals in a predetermined model, as well as audio archive databases, characterized in that it comprises means for analyzing voice signals using a vectorial representation of the similarities between the vocal representation of the speaker and the predetermined set of vocal representations of E reference speakers.
  • the databases also store the analysis of the voice signals carried out by said analysis means.
  • the invention can be applied to the indexing of audio documents, however other applications can also be envisaged, such as the acoustic identification of a speaker or the verification of the identity of a speaker.
  • Other objects, characteristics and advantages of the invention will appear on reading the following description, given by way of nonlimiting example, and made with reference to the single appended drawing illustrating an implementation of a use of the process for indexing audio documents.
  • the figure shows an application of the system according to one aspect of the invention for the indexing of audio databases.
  • the system comprises means for receiving voice data from a speaker, for example a microphone 1, connected by a connection 2 with or without wire to means 3 for recording a request made by a speaker ⁇ and comprising a set voice signals.
  • the recording means 3 are connected by a connection 4 to storage means 5 and, by a connection 6, to acoustic processing means 7 of the request.
  • These acoustic processing means transform the voice signals of the speaker ⁇ into a representation in an acoustic space of dimension D by a GMM model of representation of the speaker ⁇ . This representation is defined by a weighted sum of M
  • D is the dimension of the acoustic space of the absolute GMM model
  • x is an acoustic vector of dimension D, ie vector of cepstral coefficients of a speech signal sequence of the speaker ⁇ in the absolute GMM model
  • M denotes the number of Gaussians of the absolute GMM model, generally a power of 2 between 16 and 1024
  • the acoustic processing means 7 of the request are connected by a connection 8 to analysis means 9.
  • These analysis means 9 are able to represent a speaker by a probability density vector representing the similarities between the vocal representation of said speaker in the chosen GMM model and vocal representations of E reference speakers in the chosen GMM model.
  • the analysis means 9 are also able to carry out verification and / or identification tests for a speaker. To carry out these tests, the means of analysis proceed to the elaboration of the vector of probability densities, that is to say similarities between the speaker and the reference speakers. It is a question of describing a relevant representation of a single segment x of the signal of the speaker ⁇ by means of the following equations:
  • w ⁇ is a vector of the space of resemblances to the predetermined set of E reference speakers representing the segment x in this representation space
  • plx ⁇ ⁇ j is a probability density or probability normalized by a universal model, representing the resemblance of the acoustic representation x ⁇ of a segment of the vocal signal of a speaker ⁇ , knowing a reference speaker ⁇ j ;
  • T x is the number of frames or acoustic vectors of the speech segment x;
  • ⁇ j j is a probability representing the resemblance of the acoustic representation x ⁇ of a voice signal segment of a speaker ⁇ , knowing a reference speaker ⁇ j ;
  • p ( ⁇ ⁇ u BM ) is a probability representing the resemblance of the acoustic representation x ⁇ of a voice signal segment of a speaker ⁇ in the UBM world model;
  • M is the number of Gaussians of the relative GMM model, generally power of 2 between 16 and 1024;
  • D is the dimension of the acoustic space of the absolute GMM model;
  • x ⁇ is an acoustic vector of dimension D, ie vector of cepstral coefficients of a speech signal sequence of the speaker ⁇ in the absolute GMM model;
  • b k (x) represents, for k ⁇ l to D, Gaussian dens
  • ⁇ ⁇ represents components of the vector of average ⁇ ⁇ of dimension E of the resemblances ⁇ ( ⁇ ⁇ , ⁇ ⁇ J of the speaker ⁇ with respect to E reference speakers
  • ⁇ ". represents components of the covariance matrix ⁇ ⁇ of dimension ExE of the sets ⁇ ( ⁇ ⁇ , ⁇ ⁇ j of the speaker ⁇ with respect to the E reference speakers.
  • the analysis means 9 are connected by a connection 10 to learning means 11 making it possible to calculate the vocal representations , in the form of vectors of dimension D, E reference speakers in the GMM model chosen.
  • the learning means 11 are connected by a connection 12 to a database 13 comprising voice signals from a predetermined set of speakers and their associated voice representations in the GMM reference model.
  • the database 13 is connected by the connection 14 to the analysis means 9 and by a connection 15 to the acoustic treatment means 7.
  • the system further comprises a database 16 connected by a connection 17 to the acoustic treatment means 7 , and by a connection 18 to the analysis means 9.
  • the database 16 includes audio archives in the form of vocal articles, as well as the associated vocal representations in the GMM model chosen.
  • the database 16 is also able to store the associated representations of the audio articles calculated by the analysis means 9.
  • the learning means 11 are further connected by a connection 19 to the acoustic processing means 7.
  • the learning module 11 will determine the representations in the GMM reference model of the E reference speakers by means of the voice signals of these E reference speakers stored in the database 13, and of the acoustic processing means 7. This determination takes place according to relations (1) to (3) mentioned above.
  • This set of E reference speakers will represent the new acoustic representation space.
  • These representations of the E reference speakers in the GMM model are stored in memory, by example in database 13. All of this can be done offline.
  • the acoustic processing means 7 calculate a vocal representation of the speaker in the predetermined GMM model as explained previously in reference to relations (1) to (3) above.
  • the acoustic processing means 7 have calculated, for example offline, the vocal representations of a set of S test speakers and a set of T speakers in the predetermined GMM model. These sets are separate. These representations are stored in the database. 13.
  • the analysis means 9 calculate, for example offline, a vocal representation of the S speakers and T speakers compared to the E reference speakers.
  • This representation is a vector representation with respect to these E reference speakers, as described above.
  • the analysis means 9 also perform, for example offline, a vocal representation of the S speakers and T speakers compared to the E reference speakers, and a vocal representation of the articles of the speakers from the audio database.
  • This representation is a vector representation with respect to these E reference speakers.
  • the processing means 7 transmit the voice representation of the speaker ⁇ in the predetermined GMM model to the analysis means 9, which calculate a voice representation of the speaker ⁇ .
  • This representation is a probability density representation of the resemblances to the E reference speakers. It is calculated by introducing information a priori to the voice representations of T speakers. Indeed, the use of this a priori information makes it possible to keep a reliable estimate, even when the number of available speech segments of the speaker ⁇ is small.
  • We introduce information a priori by means of the following equations:
  • ⁇ ⁇ vector of mean of dimension E of resemblances ⁇ ( ⁇ ⁇ , ⁇ ⁇ J of speaker ⁇ with respect to E reference speakers;
  • N ⁇ number of segments of voice signals from speaker ⁇ represented by N ⁇ vectors of the space of resemblances to the predetermined set of E reference speakers;
  • ⁇ ⁇ vector of mean of dimension E of the resemblances ⁇ ( ⁇ ⁇ , ⁇ ⁇ ) of the speaker ⁇ with respect to the E reference speakers, with introduction of information a priori;
  • ⁇ ⁇ covariance matrix of dimension ExE of the resemblances ⁇ ( ⁇ ⁇ , ⁇ ⁇ ) of the speaker ⁇ with respect to
  • the analysis means 9 will compare the vocal representations of the request and of the articles of the base articles of the base by tests in identification and / or verification of the speakers.
  • the speaker identification test consists in evaluating a likelihood measure between the vector of the test segment w x and the set of representations of the articles in the audio base.
  • the speaker verification test consists in calculating a likelihood score between the vector of the test segment w x and the set of representations of the articles of the audio base normalized by its likelihood score with the representation of the information a priori.
  • the segment is authenticated if the score exceeds a given predetermined threshold, said score being given by the following relation:
  • This invention can also be applied to other uses, such as recognition or identification of a speaker.
  • This compact representation of a speaker makes it possible to drastically reduce the cost of computation, because there are much less elementary operations in view of the drastic reduction in the number of parameters necessary for the representation of a speaker. For example, for a request for 4 seconds of words from a speaker, that is to say 250 frames, for a GMM model of dimension 27, to 16 Gaussian the number of elementary operations is reduced by a factor of 540 , which greatly reduces the computation time.
  • the memory size used to store the representations of the speakers is significantly reduced. The invention therefore makes it possible to analyze the vocal signals of a speaker by drastically reducing the computation time and the storage memory size of the vocal representations of the speakers.
EP03748194A 2003-07-01 2003-07-01 Verfahren und system zur sprachanalyse zur kompakten darstellung von sprechern Withdrawn EP1639579A1 (de)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/FR2003/002037 WO2005015547A1 (fr) 2003-07-01 2003-07-01 Procede et systeme d'analyse de signaux vocaux pour la representation compacte de locuteurs

Publications (1)

Publication Number Publication Date
EP1639579A1 true EP1639579A1 (de) 2006-03-29

Family

ID=34130575

Family Applications (1)

Application Number Title Priority Date Filing Date
EP03748194A Withdrawn EP1639579A1 (de) 2003-07-01 2003-07-01 Verfahren und system zur sprachanalyse zur kompakten darstellung von sprechern

Country Status (7)

Country Link
US (1) US7539617B2 (de)
EP (1) EP1639579A1 (de)
JP (1) JP4652232B2 (de)
KR (1) KR101011713B1 (de)
CN (1) CN1802695A (de)
AU (1) AU2003267504A1 (de)
WO (1) WO2005015547A1 (de)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005034086A1 (ja) * 2003-10-03 2005-04-14 Asahi Kasei Kabushiki Kaisha データ処理装置及びデータ処理装置制御プログラム
DE602006018795D1 (de) * 2006-05-16 2011-01-20 Loquendo Spa Kompensation der variabilität zwischen sitzungen zur automatischen extraktion von informationen aus sprache
JP4717872B2 (ja) * 2006-12-06 2011-07-06 韓國電子通信研究院 話者の音声特徴情報を利用した話者情報獲得システム及びその方法
WO2008074076A1 (en) * 2006-12-19 2008-06-26 Torqx Pty Limited Confidence levels for speaker recognition
CN102237084A (zh) * 2010-04-22 2011-11-09 松下电器产业株式会社 声音空间基准模型的在线自适应调节方法及装置和设备
US8635067B2 (en) * 2010-12-09 2014-01-21 International Business Machines Corporation Model restructuring for client and server based automatic speech recognition
WO2012075640A1 (en) * 2010-12-10 2012-06-14 Panasonic Corporation Modeling device and method for speaker recognition, and speaker recognition system
JP6556575B2 (ja) 2015-09-15 2019-08-07 株式会社東芝 音声処理装置、音声処理方法及び音声処理プログラム
CA3172758A1 (en) * 2016-07-11 2018-01-18 FTR Labs Pty Ltd Method and system for automatically diarising a sound recording

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2105034C (en) * 1992-10-09 1997-12-30 Biing-Hwang Juang Speaker verification with cohort normalized scoring
US5664059A (en) * 1993-04-29 1997-09-02 Panasonic Technologies, Inc. Self-learning speaker adaptation based on spectral variation source decomposition
US5793891A (en) * 1994-07-07 1998-08-11 Nippon Telegraph And Telephone Corporation Adaptive training method for pattern recognition
JPH08110792A (ja) * 1994-10-12 1996-04-30 Atr Onsei Honyaku Tsushin Kenkyusho:Kk 話者適応化装置及び音声認識装置
US5864810A (en) * 1995-01-20 1999-01-26 Sri International Method and apparatus for speech recognition adapted to an individual speaker
US5790758A (en) * 1995-07-07 1998-08-04 The United States Of America As Represented By The Secretary Of The Navy Neural network architecture for gaussian components of a mixture density function
US5835890A (en) * 1996-08-02 1998-11-10 Nippon Telegraph And Telephone Corporation Method for speaker adaptation of speech models recognition scheme using the method and recording medium having the speech recognition method recorded thereon
US6029124A (en) * 1997-02-21 2000-02-22 Dragon Systems, Inc. Sequential, nonparametric speech recognition and speaker identification
US6212498B1 (en) * 1997-03-28 2001-04-03 Dragon Systems, Inc. Enrollment in speech recognition
US6009390A (en) * 1997-09-11 1999-12-28 Lucent Technologies Inc. Technique for selective use of Gaussian kernels and mixture component weights of tied-mixture hidden Markov models for speech recognition
US5946656A (en) * 1997-11-17 1999-08-31 At & T Corp. Speech and speaker recognition using factor analysis to model covariance structure of mixture components
US6141644A (en) * 1998-09-04 2000-10-31 Matsushita Electric Industrial Co., Ltd. Speaker verification and speaker identification based on eigenvoices
US6411930B1 (en) * 1998-11-18 2002-06-25 Lucent Technologies Inc. Discriminative gaussian mixture models for speaker verification
US20010044719A1 (en) * 1999-07-02 2001-11-22 Mitsubishi Electric Research Laboratories, Inc. Method and system for recognizing, indexing, and searching acoustic signals
US7035790B2 (en) * 2000-06-02 2006-04-25 Canon Kabushiki Kaisha Speech processing system
US6954745B2 (en) * 2000-06-02 2005-10-11 Canon Kabushiki Kaisha Signal processing system
US6754628B1 (en) * 2000-06-13 2004-06-22 International Business Machines Corporation Speaker recognition using cohort-specific feature transforms

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2005015547A1 *

Also Published As

Publication number Publication date
WO2005015547A1 (fr) 2005-02-17
US20060253284A1 (en) 2006-11-09
JP2007514959A (ja) 2007-06-07
US7539617B2 (en) 2009-05-26
CN1802695A (zh) 2006-07-12
KR20060041208A (ko) 2006-05-11
JP4652232B2 (ja) 2011-03-16
KR101011713B1 (ko) 2011-01-28
AU2003267504A1 (en) 2005-02-25

Similar Documents

Publication Publication Date Title
Li et al. Cn-celeb: multi-genre speaker recognition
US7245767B2 (en) Method and apparatus for object identification, classification or verification
US6253179B1 (en) Method and apparatus for multi-environment speaker verification
US10706857B1 (en) Raw speech speaker-recognition
Deshpande et al. Classification of music signals in the visual domain
US20140195237A1 (en) Fast, language-independent method for user authentication by voice
CN109243487B (zh) 一种归一化常q倒谱特征的回放语音检测方法
CN110534101B (zh) 一种基于多模融合深度特征的移动设备源识别方法及系统
JPH11507443A (ja) 話者確認システム
Liu et al. A Spearman correlation coefficient ranking for matching-score fusion on speaker recognition
US20160019897A1 (en) Speaker recognition from telephone calls
Peri et al. Robust speaker recognition using unsupervised adversarial invariance
EP2202723A1 (de) Verfahren und System zur Authentifizierung einem Sprecher
WO2005015547A1 (fr) Procede et systeme d'analyse de signaux vocaux pour la representation compacte de locuteurs
Shim et al. Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes
CN113628612A (zh) 语音识别方法、装置、电子设备及计算机可读存储介质
Fathan et al. Mel-spectrogram image-based end-to-end audio deepfake detection under channel-mismatched conditions
US7516071B2 (en) Method of modeling single-enrollment classes in verification and identification tasks
Abualadas et al. Speaker identification based on hybrid feature extraction techniques
Pandey et al. Cell-phone identification from audio recordings using PSD of speech-free regions
FR3099016A1 (fr) Procédé de génération de clé privée à partir de caractéristiques biométriques.
Jakubec et al. On deep speaker embeddings for speaker verification
Duraibi et al. Voice Feature Learning using Convolutional Neural Networks Designed to Avoid Replay Attacks
CN112995135B (zh) 一种面向海量数字语音内容的批量内容认证方法
Hsu et al. Performance Comparison of Audio Tampering Detection Using Different Datasets

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20051219

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR

DAX Request for extension of the european patent (deleted)
GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

GRAC Information related to communication of intention to grant a patent modified

Free format text: ORIGINAL CODE: EPIDOSCIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20110329