WO2007048688A1

WO2007048688A1 - System and method for variable-text speaker recognition

Info

Publication number: WO2007048688A1
Application number: PCT/EP2006/067071
Authority: WO
Inventors: Amitav Das; Viswanathan Ramasubramanian
Original assignee: Siemens Aktiengesellschaft
Priority date: 2005-10-24
Filing date: 2006-10-05
Publication date: 2007-05-03
Also published as: EP1941495A1

Abstract

The invention proposes a system and method for text-dependent speaker recognition based on variable text, which is robust against background noise and intra-speaker variations over time. The system for speaker recognition comprises: at least one module (1) for feature extraction of an input utterance, converting the utterance into a sequence of input test feature vectors, at least one set of speaker specific multiple templates (2) comprising of a sequence of training feature vectors obtained from recorded speech during training, whereby the multiple templates are suitable for recognition of continuous input speech and means (3) for matching the sequence of feature vectors against the sets of speaker specific multiple templates (2).

Description

System and method for variable-text speaker recognition

Conventional speaker recognition techniques suffer from two major problems: a) background noise variation and b) intra- speaker variability or non-contemporary testing. Various background noise conditions (both noise types and noise levels) can drastically reduce the performance of any device ap- plying such techniques. Intra-speaker variability refers to the variations one sees in the speech of the same person recorded/tested over a long period of time. Moreover, there may also be distinct pronunciation variability for the same speaker across various sessions. The speaker's voice may be influenced by physiological changes as a cold infection. Different moods of the speaker may also influence the voice. Typically, performance degrades in conventional methods when there is a large time gap between training and testing.

Also, in speaker-recognition systems, the use of a Variable- text' mode of operation provides the flexibility of defining arbitrary passwords from a given vocabulary of words (such as the 10 digits) . Now, passwords can be dynamically defined and changed by the user at any time or by the system as in a prompted mode of operation. This is in contrast to the

^Λfixed-text' mode of operation, which requires that the test password (a phrase or a sentence) be recorded for training every time it is defined anew or needs to be changed. Thus a "variable-text" mode of operation is a more secure and better system.

In a text-dependent speaker recognition system, during the training, the system is trained with some passwords (speech signal segments) and the corresponding "text" information (information about what the password is) uttered by the vari- ous speakers. During actual usage of the system, given the uttered password and the corresponding text information, the system attempts to identify the speaker.

The password or "text" can be a "fixed-text" in which the password is unique and fixed, one per user, or it can be "variable-text" in which the password can be dynamically formed by concatenating from a set of words [e.g. 9-3-1, from the words nine, three and one] . For a fixed-test system, if the password needs to be changed then the system will have to be re-trained with the new password. For a variable-text system, it is easy to change the password, as the new password can be dynamically formed from a set of words.

Variable-text systems typically have a front-end "feature extractor" which extracts some features from the given speech signal segments at the input. A "back-end" classifier then uses these extracted features to either train the system (during training phase) or to enable the system to detect the speaker (during actual use) . For such back-end classification, conventional text-dependent speaker recognition systems are using either the Hidden Markov Model (HMM) based methods or the Dynamic Time Warping (DTW) method.

DTW is essentially a method in which two segments of signals, which are of different time duration can be compared together with the signals being suitably time-warped & aligned against each other to aid the comparison process. This way two signal segments can be compared even though they are not time- aligned or they are having different length.

Conventional speaker recognition systems such as B. Yegnana- rayana, S. R. Mahadeva Prasanna, J. Mariam Zachariah and C. S. Gupta, "Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification sys- tern", IEEE Trans. Speech and Audio Processing, 2005 use a fixed-text mode of operation and use a single-template based DTW method for the back-end classification.

In H. Ney, "The use of one-stage dynamic programming algorithm for connected word recognition", IEEE Trans, on Acoustics, Speech and Signal Processing, vol. ASSP-32, no. 2, April 1984, a variation of DTW, called one-pass Dynamic Programming (DP) , has been used for connected word recognition systems. Here the system is designed to detect words and not the speaker.

An object of the invention is to provide a system and method for text-dependent speaker recognition based on variable text, which is robust against background noise and intra- speaker variations over time .

This object is achieved by a system for speaker recognition according to claim 1. The object is further achieved by a method for speaker recognition according to claim 16.

A method is presented, which uses a variation of one-pass DP while using multiple templates to create a robust variable- text text-dependent speaker identification/verification sys- tern.

The proposed technique uses a judicious combination of multiple templates and one-pass DP to solve the background noise and intra-speaker variability problem as described next. The proposed one pass DP based algorithm allows for variable-text operation, use of multiple templates and inter-word pauses to be present or absent to handle a natural way of speaking of the password by the user.

The use of multiple templates has the following advantages: 1. Multiple-templates: Conventional system with a single template will fail to capture the variations due to in- tra-speaker variability, whereas the proposed system with multiple templates will definitely capture all the variations and hence will make the system more robust and accurate.

2. Multiple-session templates for non-contemporary testing: When training and testing phases are done spread apart over a long gap of time, conventional speaker recognition systems with single template will fail. However, the proposed system with multiple templates, captured over a period of time, will adequately capture the variations over time.

3. Multiple-templates for different noise-conditions: If the system is to be deployed in different background conditions where the noise level and noise types will be different, then conventional systems with a single template will not be able to capture all the various background noise conditions and hence the performance of conventional single-template based systems will perform miserably. On the other hand, the proposed system de- signed with multiple templates coming from various noise conditions and noise types, will be more robust and will provide consistently better performance in such variety of background noise conditions.

The proposed one-pass DP based system and method also has the advantage of allowing for inter-word pauses to be present in the input utterance corresponding to the connected word sequence comprising the password. Typically, users usually exercise no voluntary control over the manner in which a se- quence of words is spoken continuously. The presence or ab- sence (and their durations) of inter-word pauses are therefore unpredictable. In the proposed method, the one-pass DP algorithm is adopted to allow for inter-word pauses by using a silence template between every pair of adjacent words (their multiple templates) and allowing for this silence template to be skipped or not. This will add a high degree of convenience to the user and system reliability for varied degrees of inter-word co-articulations as well as some degree of freedom in the way the isolated word templates are defined (their end-pointing is not crucial any more) .

The proposed system and method advantageously uses a judicious combination of one-pass DP and multiple templates to deliver a robust variable-text text-dependent speaker identi- fication system with flexibility to create a wide set of passwords easily and usage of a "prompted" mode of operation. It uses multiple templates from multiple training sessions and delivers robustness against non-contemporary testing. A further advantage is the use of multiple noisy templates from multiple noise conditions in training to deliver robustness of performance handling various background noise conditions. The proposed method and system incorporate ways to handle variable inter-word silence thereby allowing graceful and effective handling of continuous speech utterances during ac- tual usage.

The proposed method can be used in secure access control systems using voice in buildings, cars, offices, shop factories, where the user (or system) can change access passwords from time to time or in prompted-mode of speaker-recognition operation where the system uses randomly generated passwords every time the system is used for high security applications. The proposed method can be used in applications where hands- free access control and command can be done with voice and in addition recognizing the person through the proposed speaker identification method provides [or in case of an impostor, denies] the access. Thus, this can be used in hands-free secure voice driven control and command applications in buildings, machine shop floors, factories, car, offices, roads, where there are heavy background noise.

The proposed method can also be used in virtual or augmented reality kind of applications, running in background noise, where voice is used for command and control and voice is also used to grant access to the system and various levels of the information content and a various level of control. For exam- pie, the system recognizes that this is the supervisor and hence provides more info and control as opposed to a team member .

Further advantages features are disclosed in the dependent claims .

The following figures describe some possible embodiments of the present invention:

Fig 1: Example of conventional SINGLE-TEMPLATE DTW for text variable speaker recognition,

Fig 2: System architecture of the proposed variable-text speaker-recognition system based on one-pass DP with multiple templates, Fig 3: A typical matching between the input utterance and the word-templates,

Fig 4: DP recursion that operates in the one-pass DP algorithm for the case of multiple templates, Fig 5: A typical matching by one-pass DP algorithm when inter-word silences are present and absent,

Fig 6: General recursions operating in the one-pass DP algorithm.

As shown in Figure 1, an input utterance of password "915" is being compared to a 915 "template" of a particular speaker, to see if this was spoken by the particular speaker. Note that, this is an example of variable text system, where the "template" is formed by concatenation of individual word templates of 9, 1 and 5. Also note that this is an example of conventional SINGLE-TEMPLATE DTW, where only one template is used for each word.

Figure 2 shows the system architecture of our proposed variable-text speaker-recognition system based on one pass DP with multiple templates 2 (both multi-session and multi- noise-type) per speaker. Here, each speaker has a set of templates (say, L templates R₁,-, , j =l, ..., L) , for word W₁. Given an input utterance, the feature extraction module 1 converts the utterance into a sequence of feature vectors (such as the mel-frequency-cepstral coefficients (MFCCs) ) . This feature vector sequence corresponds to the input 'password' text (say, the digit string 915 in the figure) .

The one-pass DP algorithm matches 3 (Figure 3) the feature- vector sequence against the multiple templates of words 9 1 and 5 and inter-word silence models. The resulting match score is the optimal distance between the input utterance and the word-templates. For speaker-identification, one such score is computed for each speaker and the speaker with the lowest score is declared the input speaker. For speaker- verification, this score corresponds to the match between the input utterance and the 'claimed speaker' models; this score is normalized by the impostor score, computed between the in- put utterance and background word-templates and the normalized score is compared to a threshold; the input speaker claim is accepted if the normalized score is less than the threshold and rejected otherwise.

Figure 3 shows a typical matching between the input utterance (on the x-axis) and the word-templates (on the y-axis) for the example utterance ^Λ915' shown in Figure 1. Four multiple templates of the word "1" are shown on the y-axis; the multi- pie templates of the words "9" and "5" are not shown for sake of clarity. The best warping path obtained by the one-pass DP algorithm can be noted to have preferred token 2 of word 1 (Ri, ₂) as the best matching template for the word "1" part of the input utterance of "915".

Figure 4 shows the DP recursion that operates in the one-pass DP algorithm for the case of multiple templates. The recursions for multiple templates can be of two types: i) Within- word recursion and ii) Across-word recursion as shown in Fig- ure 4 in the context of the word-sequence ^Λ915', with the across-word recursion being illustrated for transition from any of the 4 templates of word ^Λl' to one template of the word ^Λ5' .

These two recursions can be described by the following equations :

Within-word recursion

D(m, n,v) = d(m,n,v) + min [ D(m-l,j,v) ] (1) n-2<=j<=n

Across-word recursion

D (m, 1, v)=d(m, 1, v) +min{D (m-1, 1, v) , min D (m-1, Nu, u) } (2) uεPred (v) Here, D(m,n,v) is the minimum accumulated distortion by any path reaching the grid point defined as frame ^Λn' of word- template V and frame ^Λm' of the input utterance and d(m, n,v) is the local distance between the m-th frame of word-v template and n-th frame of the input utterance.

The within-word recursion applies to all frames of word v template, which is not the starting frame (i.e., n>l) . The across-word recursion applies to frame 1 of any word-v to ac- count for a potential ^Λentry' into word v template from the last frame N_u of any of the other words {u} which are valid predecessors of word-v; i.e., the Pred(v) is the set of valid predecessor words of v, which are the multiple templates of the word preceding the word v in the ^Λpassword' text; for in- stance, if the 'password' text is 915, and v=5, then

Pred(v=5) = {Rn, R12, R13, R14}, i.e., the 4 templates of word 1.

Figure 5 shows a typical matching by one-pass DP algorithm when inter-word silences are present and absent. The input utterance is ^Λ915' as in Figure 2, but spoken with silence before 9, between 1 and 5 and after 5; there is no inter-word silence between 9 and 1 which represents an inter-word co- articulation. The one-pass DP algorithm uses a concatenated template of ^Λ915' but with a silence template between adjacent words. The across-word recursion now allows for entry into any word either from a silence template or one of the multiple templates of the predecessor words. Figure 5 shows how this adoption of the one-pass DP algorithm now correctly decodes the input utterance, by skipping the silence model between word 9 and 1. Other inter-word silences are mapped to the corresponding silence templates. Figure 6 shows the general recursions operating in the one- pass DP algorithm for this case of inter-word silence and multiple templates per word; the corresponding DP recursions are given as below.

Within-word recursion

D(m, n,v) = d(m,n,v) + min [D (m-1, j , v) ] (3) n-2<=j<=n

Across-word recursion

D(m, l,v) = d(m, l,v) + min {D (m-1, 1, v) , min D (m-1, Nu, u) } (4) uεPred' (v)

Here, all terms are as in the above described recursion except the definition of Pred' (v) ; Pred' (v) = {Silence template R_sii, Pred (v) } where Pred(v) is as in the above described recursion, i.e., the valid predecessor of any word v now includes a silence template R_sil in addition to the multiple templates of the word preceding the word v in the 'password' text; for instance, if the 'password' text is 915, and v=5, then Pred' (v=5) = {R_sii, Rn, Ri₂, Ri3, R14} and Pred' (v=l) = {R_Sii, R91, R92, R93, R94}. Likewise, the entry into a silence template in any part of the concatenated reference template is given by the across-word above described recursion.

The score D(T,N_r,r), where T is the last frame of the input utterance and word-r is the last silence template (with N_r as the last frame) yields the minimum accumulated distance of the match between the input utterance and the 'password' text and is used as the score for that speaker whose word- templates were used. Use of this score for speaker- identification or speaker-verification is as described in the beginning of this section.

Further embodiments of the invention are as follows : A method of applying ^Λmodified K-means' algorithm for selection of a small number of templates from a given pool of templates corresponding to multiple training sessions (over time) and multiple noise-type/noise-level conditions.

A method of applying the one-pass DP algorithm in forced alignment to excise isolated templates from connected word training data. This makes the enrollment easy, as a new speaker need only to speak connected word strings of particular training texts (rather than tedious repetitions of isolated words) and the extraction of templates is made automatic .

A method of combining the automatic extraction with the modified K-means algorithm to yield a ^λ segmental K-means' algorithm to get a small, but effective, number of templates from connected word training data which has been subjected to noise-addition and noise-removal.

A method of obtaining the multi-session training data during testing by converting the input utterance into training templates based on confidence measure of recognition. By this, the system updates the user templates continuously over time and handles non-contemporary testing effectively; the new connected-word data can be integrated with old data for further selection of templates .

The proposed system could be suitable for use in augmented reality or to use it in a distributed fashion. The implementation of the proposed system in DSP or FPGA or other embedded device would be possible.

Claims

1. System for speaker recognition comprising:

• at least one module (1) for feature extraction of an input utterance, converting the utterance into a sequence of input test feature vectors,

• at least one set of speaker specific multiple templates (2) comprising of a sequence of training fea- ture vectors obtained from recorded speech during training, whereby the multiple templates are suitable for recognition of continuous input speech and

• means (3) for matching the sequence of input test feature vectors against the sets of speaker specific multiple templates (2) .

2. System according to claim 1, wherein the multiple sequences of training feature vectors are repetitions of words from varying conditions .

3. System according to claim 2, wherein the multiple sequences of training feature vectors are repetitions of words spoken by the speaker recorded over a period of time and/or under different noise conditions.

4. System according to any one of the previous claims, wherein the input utterance comprises of a subset of words selected from a predefined vocabulary of words .

5. System according to any one of the previous claims, further comprising means for retrieving corresponding speaker specific multiple templates (2) of a password text uttered by speakers under verification.

6. System according to any one of the preceding claims, wherein the means (3) for matching the sequence of input test feature vectors against the sets of speaker specific multiple templates (2) are charac- terized in that a time warp and an alignment of the sequence of input test feature vectors and a sequence of training feature vectors obtained from the multiple templates (2) are carried out.

7. System according to claim 6, wherein a one-pass dynamic programming algorithm matches the sequence of input test feature vectors and the sequence of the training feature vectors obtained from the multiple templates (2) , the result being a match score between the input utterance and the word templates for each speaker.

8. System according to any one of the previous claims, further comprising of silence templates, especially speaker independent silence templates, whereby the silence templates are located between the adjacent words, each having multiple templates.

9. System according to any one of the previous claims, wherein the means (3) for matching the sequence of input test feature vectors against the sets of speaker specific multiple templates (2) are characterized that they allow for the silence templates to be skipped or not.

10. System according to any one of the previous claims, wherein the one-pass dynamic programming algorithm comprises of at least one dynamic programming recursion for operation on the multiple templates.

11. System according to claim 10, wherein the dynamic programming recursion is a within-word recursion for the matching of the sequence of input test feature vectors to interior portions of the multiple tem- plates of the corresponding password text and an across-word recursion for transition from any of the multiple templates of one word to one of the multiple templates of another word.

12. System according to any one of the previous claims, further comprising means for changing the password text for any use.

13. Access control system for secure access comprising of a system according to any one of the claims 1 to 12.

14. Embedded device with a system according to any one of the claims 1 to 12.

15. Method for speaker recognition whereby:

• an input utterance is converted into a sequence of input test feature vectors,

• at least one set of speaker specific multiple tem- plates (2) comprising of a sequence of training feature vectors is recorded from speech during training, whereby the multiple templates are suitable for recognition of continuous input speech and

• the sequence of input test feature vectors is matched against the sets of speaker specific multiple templates (2) .

16. Method according to claim 15, whereby repetitions of words from varying conditions are recorded for the multiple sequences of training feature vectors .

17. Method according to claim 16, whereby repetitions of words spoken by the speaker recorded over a period of time and/or under different noise conditions are recorded for the multiple sequences of training feature vectors .

18. Method according to any one of the claims 15 to 17, whereby the input utterance is a subset of words selected from a predefined vocabulary of words.

19. Method according to any one of the claims 15 to 18, whereby corresponding speaker specific multiple templates of a password text uttered by speakers under verification are retrieved.

20. Method according to any one of the claims 15 to 19, whereby a time warp and an alignment of the sequence of input test feature vectors and a sequence of training feature vectors obtained from the multiple templates are carried out for matching the sequence of input test feature vectors against the sets of speaker specific multiple templates.

21. Method according to claim 20, whereby a one-pass dynamic programming algorithm matches the sequence of input test feature vectors and the sequence of the training feature vectors obtained form the multiple templates (2), the result being a match score between the input utterance and the word templates for each speaker, whereby at least one score is computed for each speaker and whereby the computed scores are used for speaker identification and/or verification.

22. Method according to any one of the claims 15 to 21, whereby silence templates, especially speaker independent silence templates, are located between the adjacent words, each having multiple templates.

23. Method according to any one of the claims 15 to 22, whereby the silence templates can be skipped or not for matching the sequence of input test feature vectors against the sets of speaker specific multiple templates (2) .

24. Method according to any one of the claims 15 to 23, whereby at least one dynamic programming recursion in the one-pass dynamic programming algorithm operates on the multiple templates.

25. Method according to claim 24, whereby the dynamic programming recursion is a within-word recursion for matching the sequence of input test feature vectors to interior portions of the multiple templates of the corresponding password text and whereby the dynamic programming recursion is an across-word recursion for transition from any of the multiple templates of one word to one of the multiple templates of another word.

26. Method according to any one of the claims 15 to 25, whereby the password text is changed for any use.