EP1125281A1

EP1125281A1 - Method for training a speaker recognition system

Info

Publication number: EP1125281A1
Application number: EP00969207A
Authority: EP
Inventors: Marcin Kuropatwinski
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 1999-08-26
Filing date: 2000-08-25
Publication date: 2001-08-22
Also published as: AU7901200A; WO2001015141A1

Abstract

The invention relates to a method of recognizing speakers using the parameters of an LPAS encoder or a parametric encoder for modeling the probability distribution for the speaker classes.

Description

description

METHOD FOR TRAINING A SPEAKER RECOGNITION SYSTEM

The invention relates to a method for recognizing speakers based on their voices.

The object on which the invention is based is to specify a method for recognizing speakers on the basis of their voices which is robust, safe and reliable.

This object is achieved by the features specified in claim 1.

The invention is described in more detail below using a flow chart.

1.

The invention enables the speaker to be recognized on the basis of his voice. The problem with speaker recognition is to distinguish between different speakers or to check the given speaker identity, the only input information being the recording of the speaker's voice.

A method is also proposed that prevents the access system from being tricked if the voice and keyword are picked up by third parties.

When storing complex probability distributions for a speaker's speech parameters, a compromise must be made between accuracy and memory requirements. For this reason, methods of storing the probability distributions are proposed, which can be used depending on the number of speakers. So far, speaker recognition has been solved using hidden Markov models or vector quantization, for example, see literature [1].

Third

The invention solves the problem of speaker recognition based on the parameters of an analysis by synthesis encoder with linear prediction (LPAS) [1] (e.g. a harmonized vector excited codec [5] or waveform interpolation codec [4]). The parameters of the speech signal used so far, e.g. Cepstrale AR parameters do not bring a satisfactory solution to the problem. Therefore, other parameters have to be accessed, e.g. Parameters of the excitation of the vocal tract, which carry speaker-dependent and at the same time largely phoneme-independent information.

In addition, the method of estimating the probability distribution of the encoder parameters for the respective speaker is given, and a method that prevents the access system from being tricked.

speaker identification

In speaker recognition systems, the statistical principles [2] are used to check whether the spoken sentence was spoken by one of the speakers detected by the speaker recognition system. There are basically two types of speaker recognition systems, the text-dependent and the text-independent systems. For the procedure described in the invention, the text independence of the system is achieved through an extended training phase in which the speaker has to record a variety of material and the probability distributions of the speech signal parameters mentioned are determined from the entire speech material. Training a text dependent system is an easier task because that

Speech material spoken by the speaker during the usage phase, on some key words or certain ones Sentences is limited. The preparatory phase continues until the system reliably recognizes the speaker's voice.

The task of speaker identification is shown in Figure 2.

Speaker's voice

Figure 2. Problem of speaker identification

Speaker identification is treated as a problem of multiple detection [2]. The classes to be distinguished, one for each speaker that is to be recognized by the system, are designated as sp, i = 1..M, with M - number of speakers recorded by the speaker recognition system. The speaker recognition is based on the recorded voice signals of the respective speakers. The speech signal is segmented into the signal frames x = [x (l) .. x (K)] (e.g. for a signal frame from

20 ms in length and a sampling frequency of 8 kHz is K = 160). The segmentation provides the speech signal frames x (l) .. x (/ V), where N depends on the total length of the sentence or keyword spoken by the speaker. The decision about the speaker is made from the probabilities or probability densities (collectively referred to as probability scores) that the vectors of the samples x (/) l = l..N belong to class sp. The statistically optimal decision scheme is chosen by the class sp _t with the highest probability value given x (/), / = 1 ../. Ie the vector x (/) is assigned to the class sp _j , for which:

p (x (l) ... x (Λ /) | sp _j )> p (x (l) ... x (/ V) | spi) for all j ≠ /

speaker verification

Speaker's voice

Does the speaker agree with the given identity?

Identity of the speaker

Picture 3 . Speaker verification problem The problem with speaker verification is to check the given identity of the speaker using his voice. This corresponds to the situation shown in Figure 3.

The process of speaker verification is similar to that of speaker identification, i.e. segmentation of the spoken sentence is also carried out. After that, however, no classification of the voice is made, but a probability score is calculated for the given speaker identity and compared with a threshold. The identity of the speaker is confirmed on the basis of his voice if:

p (x (l) .. x (/ V) | s _j )> threshold

where sp _j corresponds to the given speaker identity. The

The threshold must be set accordingly high in order to avoid the situation in which a speaker with an identity other than the specified one is admitted / authorized.

LPAS encoder

The speech coding methods used today are mainly based on the analysis-by-synthesis method with an LPC synthesis filter [2]. In these methods, speech coding is optimized by repeating the coding and decoding operations until the optimal parameter set for the given speech section is found.

Figure 4: Scheme of an LPAS encoder One of the most used types of the LPAS encoder is the CELP encoder. A relatively new development is the Harmony Vector Excited Codec with a form of excitation signals that is particularly suitable for the described task. Synthesis model of a CELP encoder is shown in Figure 4. The synthesis model defines the method of calculating the synthesized speech signal from the quantized parameters of the speech signal. In general, each LPAS encoder has parameter groups:

• Short term predictor parameters. The short-term predictor parameters are usually calculated using classic LPC analysis, using the correlation method or the covariance method of linear prediction [3]. 8-10 LPC coefficients are used for signal frames with a length of 20 to 30 ms and a sampling rate of 8 kHz. The short-term predictor parameters can appear in various forms (e.g. the reflection coefficients or as line spectrum frequencies LSF), depending on which representation can be better quantized. It has been shown that the LSF coefficients are best suited for quantization and this form of the prediction coefficients is usually used. The short-term predictor parameters are calculated in an open-loop procedure, i.e. without the overall optimization shown in Figure 1 with the other parameters regarding the synthesis error.

• Long-term predictor parameters. Long term predictor parameters are used in a filter that synthesizes the fundamental frequency of the speech signal. It most often becomes a long-term predictor with a filter coefficient and a parameter for the basic period of the speech signal. A long-term predictor with the parameters b = [b, N] is part of Fig. 2. The long-term predictor parameters are also calculated in an open-loop procedure without overall optimization with the other parameters. In some co- Sometimes a refined search for the long-term predictor parameters is carried out in a closed-loop procedure.

• The parameters of the excitation. The 5-10ms subframes of the residual signal are vector-quantized in a closed-loop procedure in a CELP encoder. The sent parameters enable the waveforms to be restored from the stored code book on the decoder side.

codebook of the

Long-Term Predictor Short-Term Predictor

Figure 5 .: Synthesis model of a CELP encoder

In an HVXC codec, the output from the LPC analysis filter is transformed into the frequency domain and the spectral envelope, which is normalized for the period, is vector-quantized.

Speaker recognition with the parameters of an LPAS encoder The parameters of a speech encoder describe in detail the possible speech signals with a significantly reduced number of parameters compared to the representation of the speech signal as a sequence of the samples.

The decomposition of the speech signal into the parameter groups mentioned can be used for speaker recognition in various ways. The methods of calculating the parameters and synthesizing the speech signal imply the methods of estimating the probability densities (or the probabilities for the parameters, which are considered to be discrete probability variables). Those determined in a closed-loop procedure are actually supposed to be as discrete probability variables are considered because it is not possible to connect the volumes of the parameter space regions of the vector quantizer for such parameters. This applies in particular to the excitation parameters. The estimation of the probability distributions for such parameters is determined by calculating the relative frequencies of the parameters / code vectors in the training set. Those calculated in an open-loop procedure in the encoder are first available in an unquantized form and only then quantized, whereby vector quantization is generally used. For such parameters, the probability densities can be estimated from the training set. This approach is used primarily for short-term predictor parameters. The estimation of the probability densities is based on the histogram method [6]. This method requires knowledge of the volume of the regions of the parameter space connected to the quantized points. A method of storing probability distributions arises if the possible code vectors for the

Speech signal parameters are stored once for the entire population, which corresponds to the case that the quantization levels / code vectors are determined from the database, which contains the recordings of many speakers, once. The probability distributions of the parameters for the speakers are then stored in the system together with the indications of the code vectors for the parameters. It is suitable for large systems with a large number of users (ATM, access systems in companies). Speaker's voice

Code vectors for coding operation the parameters of the open-loop, closed-loop

LPAS encoder parameter calculation

encoded 0 probability parameter distributions the coded parameters for the

Spreeher 1 decision 5 over the speaker

Identity of the speaker

Figure 6. Speaker identification with the parameters of an LPAS encoder

5 Another method arises when the code vectors for the parameters are trained individually for each speaker. The code vectors are then stored together with the values of the probability densities at the points of the parameter space determined by the code vectors. A diagram of this method is in the picture. 7 shown. This method is intended for a small number of speakers (e.g. for a voice-controlled door in the apartment)

10 Voice of the speaker

Calculation of the unquantified parameters in an open-loop procedure

decision

Figure 7. Speaker identification with the parameters of an LPAS encoder. Probability densities are stored together with the code vectors for the parameters Identity of the speaker

Training phase of a speaker recognition system

The probability density distributions for the speaker classes are estimated from the training material. For the text-dependent speaker recognition (speaker identification / speaker verification), a certain sentence or keyword is repeated during the training phase until the speaker recognition works reliably. For the text-independent speaker verification, a phonetically balanced language material must be included. In this case too, the training phase must be repeated until the speaker identification / verification functions reliably.

The material recorded during the training phase is used several times out of phase in order to make the speaker recognition system independent of the initial phase of the recorded voices. The data used for training is stored as training set TS _sp . referred to, where sj symbolizes the speaker.

Estimation of the probability densities In order to describe the method according to the invention for estimating the probability densities of the parameters for the speaker ^classes , necessary definitions are first introduced. The introduced abstraction of the coding process has the advantage that the estimation of the probability densities can be described in a simple manner, without going into details of the very complicated operations in the speech encoder. A detailed description of the parameter calculation can be found in [4] and [5]. A speech encoder works in evaluation intervals. For each signal frame, the operations described in the LPAS Encoder section are performed in the speech encoder provide the parameters of the speech signal for the respective frame. Calculation of a non-quantized parameter vector p from the signal frame x in an open-loop compression procedure is written as p = _p (x). The quantization of the parameter is called: p = Q _p (p). The region in the parameter space of the parameters p, which is mapped to the code vector p in the coding process, is referred to as S _p = {p: Q _p (p) = p}. The volume from this region is called V (Sp). The set of possible code vectors for the parameter p is called C _p = {p ,; = l.JV _p } written with N _p number of code vectors. The

Set of regions associated with the code vectors is called R _p = {S,; i = l..Λ / _p }. The membership function of a region S is called: lfürpeS, 1 _S ^s ' (P) - j.OfürpeS.

The frequency of the occurrence of a parameter in the training set is included

Number of parameter values from the training set TS__ that fall into the region S, f _Sι =

Number of parameter values from the training set TSsp,

calculated.

The estimated probability density distribution then becomes:

Estimation of the probabilities

For the parameters that are considered a discrete probability variable, ie above all the excitation from the code book, which is optimized in a closed-loop procedure and the basic period of the speech signal, the probability functions (probability mass functions) are estimated. These are given as the frequencies of the given pattern rametercode determined in the training set for the respective speaker.

Save the probability distributions

The speech parameters in a speech coder are not all calculated at the same time but one after the other. E.g. first calculates the short-term predictor parameters and then optimizes the remaining parameters with respect to synthesis or prediction error for already known short-term predictor parameters. This enables effective storage of the probability distributions as conditional probabilities of the code vectors in a tree structure. This is possible thanks to the following dependency:

P (P .P _L .P _Λ | sP,) = P (Pκ | sp,) p (p _L \ sρ ,, p _κ ) p (p _A \ sp ,, p _κ , p _L )

p _κ - vector of short-term parameter p _L - vector of long-term parameter p _Λ - vector of excitation parameter

A significant simplification results if the speech parameters within a signal frame can be assumed to be statistically independent. The above formula then becomes:

P (PK .P _L .P. I s,) = P (P _K I SPMPL I ^S PMPA I ^S P)

The probability densities must be stored in the system at very many points in the parameter space. The for

Storing the number of bits used in probability densities is critical to the complexity of the overall system. A vector quantizer is therefore used for the probability values. This enables the number of bits used to store the probability distributions to be reduced. System security

In order to prevent the system from being outwitted, a noise which is known to the system and from which the digitized speech signal is subtracted is emitted simultaneously with the recording of the speaker's voice.

5th

The invention can be used for access control applications such as e.g. the voice-controlled door, or as verification, for example for bank access systems. The procedure can be implemented as a program module on a processor that realizes the task of speaker recognition in the system.

[1] S.Furui, "Recent advances in Speaker recognition *, Pattern Recognition Letters, Tokyo Inst, of Technol., 1997 [2] P.Vary, U. Today, W.Hess, Digital Speechsignalverarverarbeitung, BGTeubner Stuttgart 1998 [3] K.Kroschel, Statistical News theory, ^3rd ed., Springer-Verlag, 1997

[4] W.B.Kleijn, K.K.Paliwal, Speech Coding and Synthesis, Elsevier, 1995

[5] ISO / IEC 14496-3, MPGA-3 HVXC Speech Coder description [6] Prakasa Rao, Functional Estimation, Academic Press, 1982

Recorded voice of the speaker - a specific keyword or phrase for text-independent speaker verification, any text for text-independent speaker verification

Segmentation of the speech signal into the signal frames of the length 20-30ms

calculate the non-quantized speech parameters _. It becomes the short-term predictor parameters _. the long-term predictor parameter and the long-term residual signal calculate predefined speech identity

For each frame calculate the

Narrator data W ahrs cheinhchke lts ss cores

Probability distribution of (probabilities or

Speech parameters probability densities)

Summary of

W ahrs iemhchke lts scores from all signal frames

It is believed that the

Signal frame of the speech signal statistically independent smd

Decision whether the given

Identity of the speaker and the

The speaker's voice match

Research verification using the parameters of an LPAS encoder

Claims

claims

1. A method for recognizing speakers based on their voices with the following features: (a) in a preparatory phase,

(a) M speakers segment k text-dependent or text-independent reference speech utterances, which form a speaker-related training sentence, into first speech signal frames of length L, (a2) the first speech signal frames are fed to an analysis-by-synthesis encoder based on linear prediction,

(a3) a first short-term predictor parameter, long-term predictor parameter and / or excitation parameter of the encoder is calculated in the analysis-by-synthesis encoder for each of the M speakers and each first speech signal frame, the parameters then forming a speaker-related training material, (a4) in the analysis-by-synthesis encoder for each of the M speakers and each first speech signal frame, the frequency of the respective occurrence of the first short-term predictor parameter, long-term predictor parameter and / or excitation parameter of the encoder in the speaker-related training set or the probability densities with which the first short-term predictor parameter Long-term predictor parameters and / or excitation parameters are contained in the speaker-related training set,

(a5) the calculated frequencies or probability densities are stored speaker-related as speaker data, (b) in a simulated usage phase of the training phase, (bl) a text-dependent or text-independent simulation speech expression of an m-th speaker with m = l..M is placed in second speech signal frames Segmented length L, (b2) the second speech signal frames are fed to the analysis-by-synthesis encoder,

(b3) is inserted in the analysis-by-synthesis encoder for the internal speaker and every other speech signal frame second short-term predictor parameters, long-term predictor parameters and / or excitation parameters of the encoder are calculated, (b4) first probability hits are calculated for every second speech signal frame from the calculated second short-term predictor parameters, long-term predictor parameters and / or excitation parameters and the speaker data stored for the int speaker in the preparation phase, which indicate the probability with which the second short-term predictor parameter, long-term predictor parameter and / or excitation parameter with the first short-term predictor parameter,

Long-term predictor parameters and / or excitation parameters match,

(b5) the first probability scores from all the second speech signal frames are combined, (b6) it is checked whether the combined first probability scores are greater than a predetermined first threshold, the voice of the m-th speaker is confirmed if the combined first probability scores are greater than the predetermined first Threshold or the preparation phase is carried out for further reference speech utterances by the mth speaker until the voice of the mth speaker is confirmed when the combined first probability scores are less than or less than the predetermined first threshold, (c) in one use phase

(cl) a text-dependent or text-independent useful language utterance of the m th speaker with m = l .. M is segmented into third speech signal frames of length L, (c2) the third speech signal frames are fed to the analysis-by-synthesis encoder,

(c3) a third short-term predictor parameter, long-term predictor parameter and / or excitation parameter of the encoder is calculated in the analysis-by-synthesis encoder for the internal speaker and every third speech signal frame, (c4) are calculated for every third speech signal frame from the calculated third short-term predictor parameter, long-term predictor parameter and / or excitation parameters and for the m- second speaker hits stored in the preparation phase of the speaker data are calculated, which indicate the probability with which the third short-term predictor parameter, long-term predictor parameter and / or excitation parameter was pronounced by the m-th speaker,

(c5) the second probability hits from all the third speech signal frames are combined, (c6) it is checked whether the combined second probability scores are greater than a predetermined second

Threshold are, the voice of the m-th speaker is recognized if the combined second probability hits are greater than the predetermined second threshold, or the voice of the m-th speaker is not recognized if the combined second probability scores are less than or equal to the predetermined second Are threshold.

2. The method according to claim 1, characterized in that as a parametric encoder, in particular a 'harmony

Vector Excited Predictive "encoder or a * Waveform Interpolating" encoder is used.

3. The method according to claim 1, characterized in that an encoder based on linear prediction, in particular an LPAS encoder, is used as the analysis-by-synthesis encoder.

4. The method according to any one of claims 1 to 3, characterized in that the frequencies or probability densities are quantized with a vector quantizer with a certain, substantially reduced number of bits.

5. Method according to one of claims 1 to 4, characterized in that with the input of the speaker's utterance into the speaker recognition system, a noise known to the speaker recognition system is also entered.

6. The method according to any one of claims 1 to 5, characterized in that the noise entered is subtracted internally before the segmentation of the recording of the speaker's voice.