WO2007048688A1 - System and method for variable-text speaker recognition - Google Patents

System and method for variable-text speaker recognition Download PDF

Info

Publication number
WO2007048688A1
WO2007048688A1 PCT/EP2006/067071 EP2006067071W WO2007048688A1 WO 2007048688 A1 WO2007048688 A1 WO 2007048688A1 EP 2006067071 W EP2006067071 W EP 2006067071W WO 2007048688 A1 WO2007048688 A1 WO 2007048688A1
Authority
WO
WIPO (PCT)
Prior art keywords
templates
speaker
feature vectors
sequence
word
Prior art date
Application number
PCT/EP2006/067071
Other languages
French (fr)
Inventor
Amitav Das
Viswanathan Ramasubramanian
Original Assignee
Siemens Aktiengesellschaft
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Aktiengesellschaft filed Critical Siemens Aktiengesellschaft
Priority to EP06793966A priority Critical patent/EP1941495A1/en
Publication of WO2007048688A1 publication Critical patent/WO2007048688A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Definitions

  • a Variable- text' mode of operation provides the flexibility of defining arbitrary passwords from a given vocabulary of words (such as the 10 digits) .
  • passwords can be dynamically defined and changed by the user at any time or by the system as in a prompted mode of operation. This is in contrast to the
  • a text-dependent speaker recognition system In a text-dependent speaker recognition system, during the training, the system is trained with some passwords (speech signal segments) and the corresponding "text" information (information about what the password is) uttered by the vari- ous speakers. During actual usage of the system, given the uttered password and the corresponding text information, the system attempts to identify the speaker.
  • passwords speech signal segments
  • text information information about what the password is
  • Variable-text systems typically have a front-end "feature extractor” which extracts some features from the given speech signal segments at the input.
  • a "back-end” classifier then uses these extracted features to either train the system (during training phase) or to enable the system to detect the speaker (during actual use) .
  • conventional text-dependent speaker recognition systems are using either the Hidden Markov Model (HMM) based methods or the Dynamic Time Warping (DTW) method.
  • HMM Hidden Markov Model
  • DTW Dynamic Time Warping
  • An object of the invention is to provide a system and method for text-dependent speaker recognition based on variable text, which is robust against background noise and intra- speaker variations over time .
  • This object is achieved by a system for speaker recognition according to claim 1.
  • the object is further achieved by a method for speaker recognition according to claim 16.
  • a method which uses a variation of one-pass DP while using multiple templates to create a robust variable- text text-dependent speaker identification/verification sys- tern.
  • the proposed technique uses a judicious combination of multiple templates and one-pass DP to solve the background noise and intra-speaker variability problem as described next.
  • the proposed one pass DP based algorithm allows for variable-text operation, use of multiple templates and inter-word pauses to be present or absent to handle a natural way of speaking of the password by the user.
  • Multiple-templates Conventional system with a single template will fail to capture the variations due to in- tra-speaker variability, whereas the proposed system with multiple templates will definitely capture all the variations and hence will make the system more robust and accurate.
  • the proposed one-pass DP based system and method also has the advantage of allowing for inter-word pauses to be present in the input utterance corresponding to the connected word sequence comprising the password.
  • users usually exercise no voluntary control over the manner in which a se- quence of words is spoken continuously. The presence or ab- sence (and their durations) of inter-word pauses are therefore unpredictable.
  • the one-pass DP algorithm is adopted to allow for inter-word pauses by using a silence template between every pair of adjacent words (their multiple templates) and allowing for this silence template to be skipped or not. This will add a high degree of convenience to the user and system reliability for varied degrees of inter-word co-articulations as well as some degree of freedom in the way the isolated word templates are defined (their end-pointing is not crucial any more) .
  • the proposed system and method advantageously uses a judicious combination of one-pass DP and multiple templates to deliver a robust variable-text text-dependent speaker identi- fication system with flexibility to create a wide set of passwords easily and usage of a "prompted" mode of operation. It uses multiple templates from multiple training sessions and delivers robustness against non-contemporary testing. A further advantage is the use of multiple noisy templates from multiple noise conditions in training to deliver robustness of performance handling various background noise conditions.
  • the proposed method and system incorporate ways to handle variable inter-word silence thereby allowing graceful and effective handling of continuous speech utterances during ac- tual usage.
  • the proposed method can be used in secure access control systems using voice in buildings, cars, offices, shop factories, where the user (or system) can change access passwords from time to time or in prompted-mode of speaker-recognition operation where the system uses randomly generated passwords every time the system is used for high security applications.
  • the proposed method can be used in applications where hands- free access control and command can be done with voice and in addition recognizing the person through the proposed speaker identification method provides [or in case of an impostor, denies] the access.
  • this can be used in hands-free secure voice driven control and command applications in buildings, machine shop floors, factories, car, offices, roads, where there are heavy background noise.
  • Fig 1 Example of conventional SINGLE-TEMPLATE DTW for text variable speaker recognition
  • Fig 2 System architecture of the proposed variable-text speaker-recognition system based on one-pass DP with multiple templates
  • Fig 3 A typical matching between the input utterance and the word-templates
  • Fig 4 DP recursion that operates in the one-pass DP algorithm for the case of multiple templates
  • Fig 5 A typical matching by one-pass DP algorithm when inter-word silences are present and absent
  • Fig 6 General recursions operating in the one-pass DP algorithm.
  • an input utterance of password "915" is being compared to a 915 "template” of a particular speaker, to see if this was spoken by the particular speaker.
  • this is an example of variable text system, where the "template” is formed by concatenation of individual word templates of 9, 1 and 5.
  • this is an example of conventional SINGLE-TEMPLATE DTW, where only one template is used for each word.
  • Figure 2 shows the system architecture of our proposed variable-text speaker-recognition system based on one pass DP with multiple templates 2 (both multi-session and multi- noise-type) per speaker.
  • the feature extraction module 1 Given an input utterance, the feature extraction module 1 converts the utterance into a sequence of feature vectors (such as the mel-frequency-cepstral coefficients (MFCCs) ) . This feature vector sequence corresponds to the input 'password' text (say, the digit string 915 in the figure) .
  • MFCCs mel-frequency-cepstral coefficients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention proposes a system and method for text-dependent speaker recognition based on variable text, which is robust against background noise and intra-speaker variations over time. The system for speaker recognition comprises: at least one module (1) for feature extraction of an input utterance, converting the utterance into a sequence of input test feature vectors, at least one set of speaker specific multiple templates (2) comprising of a sequence of training feature vectors obtained from recorded speech during training, whereby the multiple templates are suitable for recognition of continuous input speech and means (3) for matching the sequence of feature vectors against the sets of speaker specific multiple templates (2).

Description

Description
System and method for variable-text speaker recognition
Conventional speaker recognition techniques suffer from two major problems: a) background noise variation and b) intra- speaker variability or non-contemporary testing. Various background noise conditions (both noise types and noise levels) can drastically reduce the performance of any device ap- plying such techniques. Intra-speaker variability refers to the variations one sees in the speech of the same person recorded/tested over a long period of time. Moreover, there may also be distinct pronunciation variability for the same speaker across various sessions. The speaker's voice may be influenced by physiological changes as a cold infection. Different moods of the speaker may also influence the voice. Typically, performance degrades in conventional methods when there is a large time gap between training and testing.
Also, in speaker-recognition systems, the use of a Variable- text' mode of operation provides the flexibility of defining arbitrary passwords from a given vocabulary of words (such as the 10 digits) . Now, passwords can be dynamically defined and changed by the user at any time or by the system as in a prompted mode of operation. This is in contrast to the
Λfixed-text' mode of operation, which requires that the test password (a phrase or a sentence) be recorded for training every time it is defined anew or needs to be changed. Thus a "variable-text" mode of operation is a more secure and better system.
In a text-dependent speaker recognition system, during the training, the system is trained with some passwords (speech signal segments) and the corresponding "text" information (information about what the password is) uttered by the vari- ous speakers. During actual usage of the system, given the uttered password and the corresponding text information, the system attempts to identify the speaker.
The password or "text" can be a "fixed-text" in which the password is unique and fixed, one per user, or it can be "variable-text" in which the password can be dynamically formed by concatenating from a set of words [e.g. 9-3-1, from the words nine, three and one] . For a fixed-test system, if the password needs to be changed then the system will have to be re-trained with the new password. For a variable-text system, it is easy to change the password, as the new password can be dynamically formed from a set of words.
Variable-text systems typically have a front-end "feature extractor" which extracts some features from the given speech signal segments at the input. A "back-end" classifier then uses these extracted features to either train the system (during training phase) or to enable the system to detect the speaker (during actual use) . For such back-end classification, conventional text-dependent speaker recognition systems are using either the Hidden Markov Model (HMM) based methods or the Dynamic Time Warping (DTW) method.
DTW is essentially a method in which two segments of signals, which are of different time duration can be compared together with the signals being suitably time-warped & aligned against each other to aid the comparison process. This way two signal segments can be compared even though they are not time- aligned or they are having different length.
Conventional speaker recognition systems such as B. Yegnana- rayana, S. R. Mahadeva Prasanna, J. Mariam Zachariah and C. S. Gupta, "Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification sys- tern", IEEE Trans. Speech and Audio Processing, 2005 use a fixed-text mode of operation and use a single-template based DTW method for the back-end classification.
In H. Ney, "The use of one-stage dynamic programming algorithm for connected word recognition", IEEE Trans, on Acoustics, Speech and Signal Processing, vol. ASSP-32, no. 2, April 1984, a variation of DTW, called one-pass Dynamic Programming (DP) , has been used for connected word recognition systems. Here the system is designed to detect words and not the speaker.
An object of the invention is to provide a system and method for text-dependent speaker recognition based on variable text, which is robust against background noise and intra- speaker variations over time .
This object is achieved by a system for speaker recognition according to claim 1. The object is further achieved by a method for speaker recognition according to claim 16.
A method is presented, which uses a variation of one-pass DP while using multiple templates to create a robust variable- text text-dependent speaker identification/verification sys- tern.
The proposed technique uses a judicious combination of multiple templates and one-pass DP to solve the background noise and intra-speaker variability problem as described next. The proposed one pass DP based algorithm allows for variable-text operation, use of multiple templates and inter-word pauses to be present or absent to handle a natural way of speaking of the password by the user.
The use of multiple templates has the following advantages: 1. Multiple-templates: Conventional system with a single template will fail to capture the variations due to in- tra-speaker variability, whereas the proposed system with multiple templates will definitely capture all the variations and hence will make the system more robust and accurate.
2. Multiple-session templates for non-contemporary testing: When training and testing phases are done spread apart over a long gap of time, conventional speaker recognition systems with single template will fail. However, the proposed system with multiple templates, captured over a period of time, will adequately capture the variations over time.
3. Multiple-templates for different noise-conditions: If the system is to be deployed in different background conditions where the noise level and noise types will be different, then conventional systems with a single template will not be able to capture all the various background noise conditions and hence the performance of conventional single-template based systems will perform miserably. On the other hand, the proposed system de- signed with multiple templates coming from various noise conditions and noise types, will be more robust and will provide consistently better performance in such variety of background noise conditions.
The proposed one-pass DP based system and method also has the advantage of allowing for inter-word pauses to be present in the input utterance corresponding to the connected word sequence comprising the password. Typically, users usually exercise no voluntary control over the manner in which a se- quence of words is spoken continuously. The presence or ab- sence (and their durations) of inter-word pauses are therefore unpredictable. In the proposed method, the one-pass DP algorithm is adopted to allow for inter-word pauses by using a silence template between every pair of adjacent words (their multiple templates) and allowing for this silence template to be skipped or not. This will add a high degree of convenience to the user and system reliability for varied degrees of inter-word co-articulations as well as some degree of freedom in the way the isolated word templates are defined (their end-pointing is not crucial any more) .
The proposed system and method advantageously uses a judicious combination of one-pass DP and multiple templates to deliver a robust variable-text text-dependent speaker identi- fication system with flexibility to create a wide set of passwords easily and usage of a "prompted" mode of operation. It uses multiple templates from multiple training sessions and delivers robustness against non-contemporary testing. A further advantage is the use of multiple noisy templates from multiple noise conditions in training to deliver robustness of performance handling various background noise conditions. The proposed method and system incorporate ways to handle variable inter-word silence thereby allowing graceful and effective handling of continuous speech utterances during ac- tual usage.
The proposed method can be used in secure access control systems using voice in buildings, cars, offices, shop factories, where the user (or system) can change access passwords from time to time or in prompted-mode of speaker-recognition operation where the system uses randomly generated passwords every time the system is used for high security applications. The proposed method can be used in applications where hands- free access control and command can be done with voice and in addition recognizing the person through the proposed speaker identification method provides [or in case of an impostor, denies] the access. Thus, this can be used in hands-free secure voice driven control and command applications in buildings, machine shop floors, factories, car, offices, roads, where there are heavy background noise.
The proposed method can also be used in virtual or augmented reality kind of applications, running in background noise, where voice is used for command and control and voice is also used to grant access to the system and various levels of the information content and a various level of control. For exam- pie, the system recognizes that this is the supervisor and hence provides more info and control as opposed to a team member .
Further advantages features are disclosed in the dependent claims .
The following figures describe some possible embodiments of the present invention:
Fig 1: Example of conventional SINGLE-TEMPLATE DTW for text variable speaker recognition,
Fig 2: System architecture of the proposed variable-text speaker-recognition system based on one-pass DP with multiple templates, Fig 3: A typical matching between the input utterance and the word-templates,
Fig 4: DP recursion that operates in the one-pass DP algorithm for the case of multiple templates, Fig 5: A typical matching by one-pass DP algorithm when inter-word silences are present and absent,
Fig 6: General recursions operating in the one-pass DP algorithm.
As shown in Figure 1, an input utterance of password "915" is being compared to a 915 "template" of a particular speaker, to see if this was spoken by the particular speaker. Note that, this is an example of variable text system, where the "template" is formed by concatenation of individual word templates of 9, 1 and 5. Also note that this is an example of conventional SINGLE-TEMPLATE DTW, where only one template is used for each word.
Figure 2 shows the system architecture of our proposed variable-text speaker-recognition system based on one pass DP with multiple templates 2 (both multi-session and multi- noise-type) per speaker. Here, each speaker has a set of templates (say, L templates R1,-, , j =l, ..., L) , for word W1. Given an input utterance, the feature extraction module 1 converts the utterance into a sequence of feature vectors (such as the mel-frequency-cepstral coefficients (MFCCs) ) . This feature vector sequence corresponds to the input 'password' text (say, the digit string 915 in the figure) .
The one-pass DP algorithm matches 3 (Figure 3) the feature- vector sequence against the multiple templates of words 9 1 and 5 and inter-word silence models. The resulting match score is the optimal distance between the input utterance and the word-templates. For speaker-identification, one such score is computed for each speaker and the speaker with the lowest score is declared the input speaker. For speaker- verification, this score corresponds to the match between the input utterance and the 'claimed speaker' models; this score is normalized by the impostor score, computed between the in- put utterance and background word-templates and the normalized score is compared to a threshold; the input speaker claim is accepted if the normalized score is less than the threshold and rejected otherwise.
Figure 3 shows a typical matching between the input utterance (on the x-axis) and the word-templates (on the y-axis) for the example utterance Λ915' shown in Figure 1. Four multiple templates of the word "1" are shown on the y-axis; the multi- pie templates of the words "9" and "5" are not shown for sake of clarity. The best warping path obtained by the one-pass DP algorithm can be noted to have preferred token 2 of word 1 (Ri, 2) as the best matching template for the word "1" part of the input utterance of "915".
Figure 4 shows the DP recursion that operates in the one-pass DP algorithm for the case of multiple templates. The recursions for multiple templates can be of two types: i) Within- word recursion and ii) Across-word recursion as shown in Fig- ure 4 in the context of the word-sequence Λ915', with the across-word recursion being illustrated for transition from any of the 4 templates of word Λl' to one template of the word Λ5' .
These two recursions can be described by the following equations :
Within-word recursion
D(m, n,v) = d(m,n,v) + min [ D(m-l,j,v) ] (1) n-2<=j<=n
Across-word recursion
D (m, 1, v)=d(m, 1, v) +min{D (m-1, 1, v) , min D (m-1, Nu, u) } (2) uεPred (v) Here, D(m,n,v) is the minimum accumulated distortion by any path reaching the grid point defined as frame Λn' of word- template V and frame Λm' of the input utterance and d(m, n,v) is the local distance between the m-th frame of word-v template and n-th frame of the input utterance.
The within-word recursion applies to all frames of word v template, which is not the starting frame (i.e., n>l) . The across-word recursion applies to frame 1 of any word-v to ac- count for a potential Λentry' into word v template from the last frame Nu of any of the other words {u} which are valid predecessors of word-v; i.e., the Pred(v) is the set of valid predecessor words of v, which are the multiple templates of the word preceding the word v in the Λpassword' text; for in- stance, if the 'password' text is 915, and v=5, then
Pred(v=5) = {Rn, R12, R13, R14}, i.e., the 4 templates of word 1.
Figure 5 shows a typical matching by one-pass DP algorithm when inter-word silences are present and absent. The input utterance is Λ915' as in Figure 2, but spoken with silence before 9, between 1 and 5 and after 5; there is no inter-word silence between 9 and 1 which represents an inter-word co- articulation. The one-pass DP algorithm uses a concatenated template of Λ915' but with a silence template between adjacent words. The across-word recursion now allows for entry into any word either from a silence template or one of the multiple templates of the predecessor words. Figure 5 shows how this adoption of the one-pass DP algorithm now correctly decodes the input utterance, by skipping the silence model between word 9 and 1. Other inter-word silences are mapped to the corresponding silence templates. Figure 6 shows the general recursions operating in the one- pass DP algorithm for this case of inter-word silence and multiple templates per word; the corresponding DP recursions are given as below.
Within-word recursion
D(m, n,v) = d(m,n,v) + min [D (m-1, j , v) ] (3) n-2<=j<=n
Across-word recursion
D(m, l,v) = d(m, l,v) + min {D (m-1, 1, v) , min D (m-1, Nu, u) } (4) uεPred' (v)
Here, all terms are as in the above described recursion except the definition of Pred' (v) ; Pred' (v) = {Silence template Rsii, Pred (v) } where Pred(v) is as in the above described recursion, i.e., the valid predecessor of any word v now includes a silence template Rsil in addition to the multiple templates of the word preceding the word v in the 'password' text; for instance, if the 'password' text is 915, and v=5, then Pred' (v=5) = {Rsii, Rn, Ri2, Ri3, R14} and Pred' (v=l) = {RSii, R91, R92, R93, R94}. Likewise, the entry into a silence template in any part of the concatenated reference template is given by the across-word above described recursion.
The score D(T,Nr,r), where T is the last frame of the input utterance and word-r is the last silence template (with Nr as the last frame) yields the minimum accumulated distance of the match between the input utterance and the 'password' text and is used as the score for that speaker whose word- templates were used. Use of this score for speaker- identification or speaker-verification is as described in the beginning of this section.
Further embodiments of the invention are as follows : A method of applying Λmodified K-means' algorithm for selection of a small number of templates from a given pool of templates corresponding to multiple training sessions (over time) and multiple noise-type/noise-level conditions.
A method of applying the one-pass DP algorithm in forced alignment to excise isolated templates from connected word training data. This makes the enrollment easy, as a new speaker need only to speak connected word strings of particular training texts (rather than tedious repetitions of isolated words) and the extraction of templates is made automatic .
A method of combining the automatic extraction with the modified K-means algorithm to yield a λ segmental K-means' algorithm to get a small, but effective, number of templates from connected word training data which has been subjected to noise-addition and noise-removal.
A method of obtaining the multi-session training data during testing by converting the input utterance into training templates based on confidence measure of recognition. By this, the system updates the user templates continuously over time and handles non-contemporary testing effectively; the new connected-word data can be integrated with old data for further selection of templates .
The proposed system could be suitable for use in augmented reality or to use it in a distributed fashion. The implementation of the proposed system in DSP or FPGA or other embedded device would be possible.

Claims

Claims
1. System for speaker recognition comprising:
• at least one module (1) for feature extraction of an input utterance, converting the utterance into a sequence of input test feature vectors,
• at least one set of speaker specific multiple templates (2) comprising of a sequence of training fea- ture vectors obtained from recorded speech during training, whereby the multiple templates are suitable for recognition of continuous input speech and
• means (3) for matching the sequence of input test feature vectors against the sets of speaker specific multiple templates (2) .
2. System according to claim 1, wherein the multiple sequences of training feature vectors are repetitions of words from varying conditions .
3. System according to claim 2, wherein the multiple sequences of training feature vectors are repetitions of words spoken by the speaker recorded over a period of time and/or under different noise conditions.
4. System according to any one of the previous claims, wherein the input utterance comprises of a subset of words selected from a predefined vocabulary of words .
5. System according to any one of the previous claims, further comprising means for retrieving corresponding speaker specific multiple templates (2) of a password text uttered by speakers under verification.
6. System according to any one of the preceding claims, wherein the means (3) for matching the sequence of input test feature vectors against the sets of speaker specific multiple templates (2) are charac- terized in that a time warp and an alignment of the sequence of input test feature vectors and a sequence of training feature vectors obtained from the multiple templates (2) are carried out.
7. System according to claim 6, wherein a one-pass dynamic programming algorithm matches the sequence of input test feature vectors and the sequence of the training feature vectors obtained from the multiple templates (2) , the result being a match score between the input utterance and the word templates for each speaker.
8. System according to any one of the previous claims, further comprising of silence templates, especially speaker independent silence templates, whereby the silence templates are located between the adjacent words, each having multiple templates.
9. System according to any one of the previous claims, wherein the means (3) for matching the sequence of input test feature vectors against the sets of speaker specific multiple templates (2) are characterized that they allow for the silence templates to be skipped or not.
10. System according to any one of the previous claims, wherein the one-pass dynamic programming algorithm comprises of at least one dynamic programming recursion for operation on the multiple templates.
11. System according to claim 10, wherein the dynamic programming recursion is a within-word recursion for the matching of the sequence of input test feature vectors to interior portions of the multiple tem- plates of the corresponding password text and an across-word recursion for transition from any of the multiple templates of one word to one of the multiple templates of another word.
12. System according to any one of the previous claims, further comprising means for changing the password text for any use.
13. Access control system for secure access comprising of a system according to any one of the claims 1 to 12.
14. Embedded device with a system according to any one of the claims 1 to 12.
15. Method for speaker recognition whereby:
• an input utterance is converted into a sequence of input test feature vectors,
• at least one set of speaker specific multiple tem- plates (2) comprising of a sequence of training feature vectors is recorded from speech during training, whereby the multiple templates are suitable for recognition of continuous input speech and
• the sequence of input test feature vectors is matched against the sets of speaker specific multiple templates (2) .
16. Method according to claim 15, whereby repetitions of words from varying conditions are recorded for the multiple sequences of training feature vectors .
17. Method according to claim 16, whereby repetitions of words spoken by the speaker recorded over a period of time and/or under different noise conditions are recorded for the multiple sequences of training feature vectors .
18. Method according to any one of the claims 15 to 17, whereby the input utterance is a subset of words selected from a predefined vocabulary of words.
19. Method according to any one of the claims 15 to 18, whereby corresponding speaker specific multiple templates of a password text uttered by speakers under verification are retrieved.
20. Method according to any one of the claims 15 to 19, whereby a time warp and an alignment of the sequence of input test feature vectors and a sequence of training feature vectors obtained from the multiple templates are carried out for matching the sequence of input test feature vectors against the sets of speaker specific multiple templates.
21. Method according to claim 20, whereby a one-pass dynamic programming algorithm matches the sequence of input test feature vectors and the sequence of the training feature vectors obtained form the multiple templates (2), the result being a match score between the input utterance and the word templates for each speaker, whereby at least one score is computed for each speaker and whereby the computed scores are used for speaker identification and/or verification.
22. Method according to any one of the claims 15 to 21, whereby silence templates, especially speaker independent silence templates, are located between the adjacent words, each having multiple templates.
23. Method according to any one of the claims 15 to 22, whereby the silence templates can be skipped or not for matching the sequence of input test feature vectors against the sets of speaker specific multiple templates (2) .
24. Method according to any one of the claims 15 to 23, whereby at least one dynamic programming recursion in the one-pass dynamic programming algorithm operates on the multiple templates.
25. Method according to claim 24, whereby the dynamic programming recursion is a within-word recursion for matching the sequence of input test feature vectors to interior portions of the multiple templates of the corresponding password text and whereby the dynamic programming recursion is an across-word recursion for transition from any of the multiple templates of one word to one of the multiple templates of another word.
26. Method according to any one of the claims 15 to 25, whereby the password text is changed for any use.
PCT/EP2006/067071 2005-10-24 2006-10-05 System and method for variable-text speaker recognition WO2007048688A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP06793966A EP1941495A1 (en) 2005-10-24 2006-10-05 System and method for variable-text speaker recognition

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN2826DE2005 2005-10-24
IN2826/DEL/2005 2005-10-24

Publications (1)

Publication Number Publication Date
WO2007048688A1 true WO2007048688A1 (en) 2007-05-03

Family

ID=37497008

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2006/067071 WO2007048688A1 (en) 2005-10-24 2006-10-05 System and method for variable-text speaker recognition

Country Status (2)

Country Link
EP (1) EP1941495A1 (en)
WO (1) WO2007048688A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7254316B1 (en) 2022-04-11 2023-04-10 株式会社アープ Program, information processing device, and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5893902A (en) * 1996-02-15 1999-04-13 Intelidata Technologies Corp. Voice recognition bill payment system with speaker verification and confirmation
US5960392A (en) * 1996-07-01 1999-09-28 Telia Research Ab Method and arrangement for adaptation of data models
EP1164576A1 (en) * 2000-06-15 2001-12-19 Swisscom AG Speaker authentication method and system from speech models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5893902A (en) * 1996-02-15 1999-04-13 Intelidata Technologies Corp. Voice recognition bill payment system with speaker verification and confirmation
US5960392A (en) * 1996-07-01 1999-09-28 Telia Research Ab Method and arrangement for adaptation of data models
EP1164576A1 (en) * 2000-06-15 2001-12-19 Swisscom AG Speaker authentication method and system from speech models

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NEY H: "The use of a one-stage dynamic programming algorithm for connected word recognition", IEEE TRANSACTIONS ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, IEEE INC. NEW YORK, US, vol. ASSP-32, no. 2, April 1984 (1984-04-01), pages 263 - 271, XP002228868, ISSN: 0096-3518 *
QI LI ET AL: "Automatic Verbal Information Verification for User Authentication", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 8, no. 5, September 2000 (2000-09-01), XP011054041, ISSN: 1063-6676 *
RAMASUBRAMANIAN V ET AL: "Text-Dependent Speaker-Recognition Using One-Pass Dynamic Programming Algorithm", ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 2006. ICASSP 2006 PROCEEDINGS. 2006 IEEE INTERNATIONAL CONFERENCE ON TOULOUSE, FRANCE 14-19 MAY 2006, PISCATAWAY, NJ, USA,IEEE, 14 May 2006 (2006-05-14) - 19 May 2006 (2006-05-19), pages I-901 - I-904, XP010930326, ISBN: 1-4244-0469-X *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7254316B1 (en) 2022-04-11 2023-04-10 株式会社アープ Program, information processing device, and method
JP2023155684A (en) * 2022-04-11 2023-10-23 株式会社アープ Program, information processing device and method

Also Published As

Publication number Publication date
EP1941495A1 (en) 2008-07-09

Similar Documents

Publication Publication Date Title
US10950245B2 (en) Generating prompts for user vocalisation for biometric speaker recognition
US7447632B2 (en) Voice authentication system
Woo et al. The MIT mobile device speaker verification corpus: data collection and preliminary experiments
JP2002304190A (en) Method for generating pronunciation change form and method for speech recognition
CA2239339A1 (en) Method and apparatus for providing speaker authentication by verbal information verification using forced decoding
KR20010102549A (en) Speaker recognition
Ezzine et al. Moroccan dialect speech recognition system based on cmu sphinxtools
Ilyas et al. Speaker verification using vector quantization and hidden Markov model
Tuasikal et al. Voice activation using speaker recognition for controlling humanoid robot
Jung et al. Selecting feature frames for automatic speaker recognition using mutual information
JPH11231895A (en) Method and device speech recognition
JP4461557B2 (en) Speech recognition method and speech recognition apparatus
WO2007048688A1 (en) System and method for variable-text speaker recognition
Ramasubramanian et al. Text-dependent speaker-recognition systems based on one-pass dynamic programming algorithm
JPH0854891A (en) Device and method for acoustic classification process and speaker classification process
Al-Dahri et al. A word-dependent automatic Arabic speaker identification system
Bansal et al. lllllllllllllll ç Medwell Journals, 2007 Automatic Speaker Identification Using Vector Quantization
Wang et al. Robust Text-independent Speaker Identification in a Time-varying Noisy Environment.
Gupta et al. Text dependent voice based biometric authentication system using spectrum analysis and image acquisition
Phyu et al. Building Speaker Identification Dataset for Noisy Conditions
Phyu et al. Text Independent Speaker Identification for Myanmar Speech
Yang et al. User verification based on customized sentence reading
JP3919314B2 (en) Speaker recognition apparatus and method
Li et al. Speaker authentication
Kounoudes et al. Combined Speech Recognition and Speaker Verification over the Fixed and Mobile Telephone Networks.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2006793966

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 2006793966

Country of ref document: EP