JP3092788B2

JP3092788B2 - Speaker recognition threshold setting method and speaker recognition apparatus using the method

Info

Publication number: JP3092788B2
Application number: JP08004508A
Authority: JP
Inventors: 知子松井; 貞煕古井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1996-01-16
Filing date: 1996-01-16
Publication date: 2000-09-25
Anticipated expiration: 2016-01-16
Also published as: JPH09198086A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、例えば入力され
た音声により暗証番号の人と同一人であることを同定し
たりするために用いられ、入力音声を、特徴パラメータ
を用いた表現形式に変換し、その表現形式による入力音
声と、あらかじめ話者対応に登録された上記表現形式に
よる音声のモデルとの類似度を求めて、入力音声を発声
した話者を認識する話者認識方法における、話者の判定
に用いるしきい値の設定方法及びこの方法が適用された
話者認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is used, for example, to identify the same person as a personal identification number based on an input voice, and to convert the input voice into an expression form using feature parameters. In the speaker recognition method for recognizing a speaker who uttered the input voice, a similarity between the input voice in the expression and the model of the voice in the above-mentioned expression registered in advance for the speaker is obtained. The present invention relates to a method for setting a threshold value used for speaker determination and a speaker recognition device to which the method is applied.

【０００２】[0002]

【従来の技術】図３に従来のテキスト独立形話者認識を
例としたその装置の機能構成を示す。まず話者の登録を
するが、各話者ごとに発声した文章などの音声（登録用
音声）が入力端子１１より特徴パラメータ抽出手段１２
に入力され、音声に含まれる特徴パラメータ（例えばケ
プストラム、ピッチなど）を用いた表現形式に変換さ
れ、この特徴パラメータの時系列に変換された登録用音
声データから、モデル作成手段１３でその音声のモデ
ル、例えば隠れマルコフモデル（Hidden Markov Model:
HMM と記す。例えば複数のガウス分布の重み付き加算で
表現される）が作成される。ＨＭＭを作成する方法とし
ては、例えば文献「松井知子、古井貞煕：“ＶＱ、離散
／連続ＨＭＭによるテキスト独立形話者認識法の比較検
討”、電子情報通信学会音声研究会資料、ＳＰ９１−８
９、１９９１」に述べられている方法などを用いること
ができる。このようにして得られた話者ごとのＨＭＭは
話者対応にモデル蓄積部１４に登録される。2. Description of the Related Art FIG. 3 shows a functional configuration of a conventional text independent speaker recognition apparatus. First, a speaker is registered. A voice (registration voice) such as a sentence uttered for each speaker is input from the input terminal 11 to the feature parameter extracting unit 12.
Is converted into an expression format using feature parameters (for example, cepstrum, pitch, etc.) included in the voice, and the model creation unit 13 converts the time-series registration voice data of the feature parameter Models, such as Hidden Markov Model:
Notated as HMM. For example, a plurality of Gaussian distributions are represented by weighted addition). As a method of creating an HMM, for example, a document “Tomoko Matsui, Sadahiro Furui:“ VQ, Comparative Study of Text-Independent Speaker Recognition Method Using Discrete / Continuous HMM ”, IEICE Speech Research Group, SP91-8
9, 1991 "can be used. The HMM for each speaker obtained in this manner is registered in the model storage unit 14 for each speaker.

【０００３】話者を認識する場合は、その話者の発声音
声が入力端子１１から特徴パラメータ抽出手段１２に入
力されて、特徴パラメータの時系列に変換され、この入
力音声の特徴パラメータの時系列は類似度計算手段１５
でモデル蓄積部１４に蓄えられた各話者のＨＭＭとの類
似度が計算され、その計算結果は、話者認識判定手段１
６で、しきい値蓄積部１７に蓄積されている、本人の声
とみなせる類似度の変動の範囲を考慮したしきい値と比
較され、そのしきい値より大きければ、その入力音声は
類似度計算に用いたＨＭＭの登録話者の音声であると判
定され、しきい値より小さければその他の人の音声であ
ると判定され、この判定結果が出力される。When recognizing a speaker, the uttered voice of the speaker is input from an input terminal 11 to a characteristic parameter extracting means 12, and is converted into a time series of characteristic parameters. Is similarity calculating means 15
Calculates the similarity between each speaker stored in the model storage unit 14 and the HMM.
In step S6, the input voice is compared with a threshold value stored in the threshold value storage unit 17 in consideration of a range of similarity fluctuation that can be regarded as a person's voice. The voice is determined to be the voice of the registered speaker of the HMM used for the calculation, and if smaller than the threshold value, it is determined to be voice of another person, and this determination result is output.

【０００４】従来においてしきい値を設定するために、
本人棄却率と詐称者受理率との二つの誤り率が考慮され
ていた。本人棄却率は全登録用音声のうち、本人の登録
用音声を用いた話者認識実験の結果から求められるもの
で、本人が誤って棄却される率を表し、詐称者受理率は
詐称者の音声を用いた話者認識実験の結果から求められ
るもので、詐称者が誤って受理される率を表している。
話者認識の目的によって、本人棄却率の方が詐称者受理
率よりも重要であったり、またその逆であったりする。
目的がはっきりしない場合には、ベイズの定理より、本
人棄却率と詐称者受理率が等しい、等誤り率を与える値
が最適なしきい値（等誤り率のしきい値）としていた。
図４Ａに示すように、本人棄却率を示す曲線２１はしき
い値を大きくするに従って大となる。一方、詐称者受理
率を示す曲線２２はしきい値が大きくなるに従って減少
する。従来においては詐称者を本人以外の登録話者とし
て、全登録用音声を用いて、各モデル（ＨＭＭ）との類
似度を計算して話者認識を行い、その際に、話者判定用
のしきい値を変化させ、つまり話者認識実験を行い、図
４Ａに示した本人棄却率曲線２１と詐称者受理率曲線２
２との交点、つまり両誤り率が等しい値ε₀となるしき
い値φ₀を求めてしきい値として設定し、即ち登録用音
声による等誤り率のしきい値を設定していた。Conventionally, to set a threshold value,
Two error rates were considered, the rejection rate and the false acceptance rate. The rejection rate is obtained from the results of a speaker recognition experiment using the registration voice of the individual out of all the registered voices, and indicates the rate of false rejection of the individual. It is obtained from the result of a speaker recognition experiment using speech, and indicates the rate at which impostors are incorrectly accepted.
Depending on the purpose of speaker recognition, the false rejection rate may be more important than the false acceptance rate, and vice versa.
When the purpose was not clear, Bayes' theorem was used to determine the optimal threshold value (equal error rate threshold) at which the false rejection rate and the impostor acceptance rate were equal and gave the equal error rate.
As shown in FIG. 4A, the curve 21 indicating the rejection rate increases as the threshold value increases. On the other hand, the curve 22 indicating the impostor acceptance rate decreases as the threshold value increases. Conventionally, the impostor is a registered speaker other than himself / herself, and the similarity to each model (HMM) is calculated and speaker recognition is performed using all registration voices. The threshold was changed, that is, a speaker recognition experiment was performed, and the rejection rate curve 21 and the impostor acceptance rate curve 2 shown in FIG.
Intersection of the 2, that is set as a threshold seeking threshold phi ₀ both error rate becomes equal epsilon _0, ie sets the threshold for equal error rate by a registered voice.

【０００５】[0005]

【発明が解決しようとする課題】しかし、本人のモデル
が発声内容の違い、発声変動などに対して十分に頑健で
ない場合は、本人のモデルとそれを作成する時に用いた
本人の音声（本人の登録用音声）との類似度は、本人の
モデルと本人が認識の際に発声する音声との類似度に比
べて、一般に大きいものとなる。従って、本人が認識の
際に発声する音声について、しきい値を変化させる本人
棄却率曲線を求めると、例えば図４Ａの点線曲線２３の
ように、本人の登録用音声を用いた本人棄却率曲線２１
よりも棄却率が悪くなる、つまり同一しきい値に対して
棄却率が大きくなる。つまり登録用音声による等誤り率
のしきい値φ₀は、認識用音声による等誤り率のしきい
値φ ₀よりも大きな値となり、その結果、登録用音声に
よる等誤り率のしきい値φ₀を認識に用いた場合には本
人棄却率が大きくなるという問題があった。However, the model of the person himself
Is sufficiently robust against utterance differences, utterance fluctuations, etc.
If not, use your own model and create it
The similarity to the voice of the person (voice for registration of the person)
Compared to the similarity between the model and the voice spoken during recognition
All are generally large. Therefore, the person himself
The person who changes the threshold for the voice uttered when
When the rejection rate curve is obtained, for example, the dotted line curve 23 in FIG.
Thus, the rejection rate curve 21 using the registration voice of the person
Rejection rate is worse than
Rejection rate increases. In other words, the equal error rate due to the registration voice
Threshold value φ₀Is the threshold of equal error rate by speech for recognition.
Value φ ₀Value, and as a result,
Equal error rate threshold φ₀If used for recognition
There was a problem that the rejection rate increased.

【０００６】また、本人の登録用の音声データ量はあま
り多くないために、特に本人のモデルが発声内容の違
い、発声変動などに対して十分に頑健でない場合が多
く、本人棄却率を信頼性高く求められないことも問題で
あった。更に話者の声は発声の度に変動し、特に２〜３
カ月の単位で大きく変動する。この点から、高い認識性
能を維持するためには定時に各話者について、音声を発
声してもらい、そのモデルを更新することが望まれる。
このようにモデルの更新が行われると、本人棄却率特性
及び詐称者受理率特性も変化する。従ってモデル更新が
行われるとしきい値も再設定することが望ましい。In addition, since the amount of voice data for registration of a person is not so large, the model of the person is often not sufficiently robust against differences in utterance contents, fluctuations in utterance, and the like. The lack of high demand was also a problem. Furthermore, the voice of the speaker fluctuates with each utterance, and in particular, a few
It fluctuates greatly in units of months. From this point, in order to maintain high recognition performance, it is desired that each speaker be uttered at regular intervals to update the model.
When the model is updated in this way, the personal rejection rate characteristics and the impostor acceptance rate characteristics also change. Therefore, it is desirable to reset the threshold value when the model is updated.

【０００７】[0007]

【課題を解決するための手段】請求項１の発明の方法に
よれば、詐称者を本人以外の登録話者として、登録用音
声を用いて話者認識実験を行った時の等誤り率を与える
しきい値から所定値だけ差し引いた値をしきい値とす
る、つまり等誤り率のしきい値より、高めの詐称者受理
率を与える値に設定する。この高めの詐称者受理率は、
等誤り率のしきい値での詐称者受理率よりも、この話者
認識方法のシステム誤り率の上限程度だけ高いものであ
る。この構成により、頑健なモデルでなくても、本人棄
却率が大きくなり過ぎることはない。According to the method of the first aspect of the present invention, the equal error rate when a speaker recognition experiment is performed using a registration voice with an impostor as a registered speaker other than the registered person. A value obtained by subtracting a predetermined value from the given threshold value is set as the threshold value, that is, a value that gives a false acceptance rate higher than the equal error rate threshold value is set. This higher impostor acceptance rate
It is higher than the impostor acceptance rate at the equal error rate threshold by about the upper limit of the system error rate of this speaker recognition method. With this configuration, the rejection rate does not become too large even if the model is not robust.

【０００８】請求項２の発明の方法では周期的にモデル
の更新を行い、その更新ごとに、その更新用音声と更新
モデルとを用いて詐称者を本人以外の登録話者として、
話者認識実験を行った時の等誤り率のしきい値から、前
記所定値より小さくかつ更新回数に応じて小さくなる値
が差し引いた値を新たなしきい値とする。この構成によ
り、モデルの更新が行われるに従って発声内容の違いや
発声変動などに対して次第に頑健になり、かつその理想
的なモデルを用いた場合の認識用音声に対する理想的な
等誤り率のしきい値に、前記高めの詐称者受理率を与え
る値から漸近してゆくことになる。In the method according to the second aspect of the present invention, the model is periodically updated, and each time the model is updated, the impersonator is registered as a registered speaker other than himself using the update voice and the updated model.
A value obtained by subtracting a value smaller than the predetermined value and smaller according to the number of updates from the threshold of the equal error rate at the time of performing the speaker recognition experiment is set as a new threshold. With this configuration, as the model is updated, it becomes more and more robust against differences in utterance content, utterance fluctuation, etc., and the ideal equal error rate for the recognition speech when the ideal model is used. The threshold value is asymptotic from the value giving the higher impostor acceptance rate.

【０００９】前記モデル更新ごとのしきい値の更新の例
を以下に示す。つまり次式に従ってしきい値φを設定す
る。 φ＝ｗφ₁＋（１−ｗ）φ₀ （１）ここでφ₀は詐称者を本人以外の登録話者として、登録
用音声を用いて話者認識実験を行った時の等誤り率のし
きい値、つまり最初に設定したしきい値を表し、φ₁は
詐称者受理率としきい値との関係（図４Ａ）に基づい
て、詐称者受理率が｛等誤り率ε₀＋ｘ｝％（例えばｘ
＝１％）になる値のしきい値を表す。この｛等誤り率ε
₀＋ｘ｝％は、その話者認識方法の性能から推定される
詐称者受理率（そのシステム推定誤り率）の上限に対応
している。ｗは話者のモデルの更新に合わせて、等誤り
率のしきい値にしきい値が漸近していく速度を制御する
パラメータで、例えば次式のように定義することができ
る。An example of updating the threshold every time the model is updated will be described below. That is, the threshold value φ is set according to the following equation. φ = wφ ₁ + (1−w) φ ₀ (1) where φ ₀ is an equal error rate when a speaker recognition experiment is performed using a registration voice with the impostor as a registered speaker other than himself / herself. threshold, i.e. represents the initially set threshold, phi ₁ is based on the relationship (FIG. 4A) of the false acceptance rate and the threshold, false acceptance rate {equal error rate ε ₀ + x}% (Eg x
= 1%). This ｛equal error rate ε
₀ + x｝% corresponds to the upper limit of the impostor acceptance rate (the system estimation error rate) estimated from the performance of the speaker recognition method. w is a parameter that controls the speed at which the threshold value gradually approaches the threshold value of the equal error rate in accordance with the updating of the speaker model, and can be defined by the following equation, for example.

【００１０】ｗ＝２／（１＋ｅｘｐ（０．２５ｔ））（２）ここでｔは話者のモデルの更新の回数（ｔ＝０，１，
２，…）を表し、この式は実験により求めた式である。
この式（１）、（２）によれば、ｔ＝０では話者認識装
置を作った時、あるいは、認識すべき話者を全て新らた
なものにした時であり、つまり登録用音声を用いた最初
に決定するしきい値であり、等誤り率しきい値φ₀より
Δφ（＝φ₀−φ₁）だけ小さいしきい値に設定され
る。通常、モデルの更新回数が多くなるに従って、ｗが
小さくなり、Δφも小さくなってφ₀に近づく。なおφ
₀，φ₁もモデル更新ごとに求められる。W = 2 / (1 + exp (0.25t)) (2) where t is the number of updates of the speaker model (t = 0, 1,
2,...), And this equation is an equation obtained by experiment.
According to the formulas (1) and (2), at t = 0, the time when a speaker recognition device is made or when all the speakers to be recognized are new, that is, the registration voice , And is set to a threshold smaller by Δφ (= φ ₀ −φ ₁ ) than the equal error rate threshold φ ₀ . Usually, as the number of updates of the model increases, w decreases, and Δφ also decreases and approaches φ ₀ . Note that φ
_{0, φ} ₁ is also determined for each model update.

【００１１】[0011]

【発明の実施の形態】図１にこの発明の方法の実施例に
おける処理順を示し、図２にこの発明の装置の実施例の
機能構成を図３と対応する部分に同一符号を付けて示
す。図２においてこの実施例では、登録時及びモデル更
新時の各入力音声の特徴パラメータの時系列が一時蓄積
される特徴パラメータ時蓄積部２５、モデル更新指示が
あるとモデル蓄積部１４内のモデル更新を行うモデル更
新手段２６、更に登録時及びモデル更新時にしきい値を
計算してしきい値蓄積部１７のしきい値を更新するしき
い値計算手段２７が設けられる。FIG. 1 shows the order of processing in an embodiment of the method of the present invention, and FIG. 2 shows the functional configuration of an embodiment of the apparatus of the present invention by assigning the same reference numerals to parts corresponding to those in FIG. . 2, in this embodiment, a feature parameter time storage unit 25 for temporarily storing a time series of feature parameters of each input voice at the time of registration and model update, and a model update in the model storage unit 14 when there is a model update instruction. And a threshold calculating means 27 for calculating a threshold at the time of registration and updating of the model and updating the threshold of the threshold accumulating unit 17.

【００１２】入力端子１１に登録用音声又は更新用音声
が入力されると、図１、図２に示すように特徴パラメー
タ抽出手段１２で特徴パラメータの時系列に変換され
（Ｓ₁）、登録時にはモデル作成手段１３でその音声の
モデルが作成され、モデル更新時には、更新用音声の特
徴パラメータ時系列により、モデル蓄積部１４内の対応
モデルの更新が行われる（Ｓ₂）。When a voice for registration or a voice for update is inputted to the input terminal 11, it is converted into a time series of characteristic parameters by the characteristic parameter extracting means 12 as shown in FIGS. 1 and 2 (S ₁ ). model is the voice models in creation means 13 creates, at the time of model update, the feature parameter time series of update audio, updating the corresponding models in the model storage unit 14 is carried out (S _2).

【００１３】このモデル更新のためには、各話者ごとに
登録用音声、更新用音声の各特徴パラメータ時系列を保
持しておき、それまでの全保持時系列と、新たに入力さ
れた更新用音声の時系列とを用いて新たにモデルを作成
してモデル蓄積部１４内の対応モデルを更新する。ある
いは、モデルがＨＭＭの場合、ベイズ推定により、更新
用音声の特徴パラメータの時系列Ｘの対応話者のＨＭＭ
に対する尤度ｆ（Ｘ｜１θ）と、それまでに発声された
音声の特徴を反映する事前確率密度関数ｇ（θ）との積
が最大となるＨＭＭのパラメータベクトルθを推定し、
そのθを新たなＨＭＭとする。In order to update the model, the time series of each characteristic parameter of the speech for registration and the speech for update for each speaker are held, and the time series of all the held speeches and the newly input update time are updated. A new model is created using the time series of the voice for use, and the corresponding model in the model storage unit 14 is updated. Alternatively, when the model is an HMM, the HMM of the speaker corresponding to the time series X of the feature parameters of the update speech is obtained by Bayesian estimation.
Estimate the parameter vector θ of the HMM that maximizes the product of the likelihood f (X | 1θ) with respect to the prior probability density function g (θ) reflecting the features of the voices uttered up to then.
Let θ be a new HMM.

【００１４】次に登録時には登録用音声、モデル更新時
には更新用音声を用いて等誤り率ε ₀及びそのしきい値
φ₀を計算する（Ｓ₃）。つまりこれら音声の特徴パラ
メータの時系列は特徴パラメータ時蓄積部２５に一時蓄
積され、これらとモデル蓄積部１４内の各モデルとの類
似度が類似度計算手段１５でそれぞれ、計算され、これ
ら類似度に対し、話者認識判定手段１６で各種のしきい
値に対し判定を行い、詐称者を本人以外の登録話者とし
て、登録用音声（又は更新用音声）を用いた話者認識実
験を行い、図４Ａに示した本人棄却率曲線と詐称者受理
率曲線とを求め、両誤り率が等しい誤り率ε₀と、その
時のしきい値φ₀を求める。Next, at the time of registration, voice for registration, at the time of model update
Using the update speech ₀And its threshold
φ₀Is calculated (S_Three). In other words, the feature parameters of these voices
The time series of the meter is temporarily stored in the feature parameter time storage unit 25.
Are stored in the storage unit 14 and the model of each model in the model storage unit 14.
The similarity is calculated by the similarity calculating means 15, respectively.
For the similarity, various thresholds are determined by the speaker recognition determining means 16.
Judgment is made for the value, and the impostor is regarded as a registered speaker other than himself.
Speaker recognition using registration voice (or update voice)
Test and the rejection rate curve shown in FIG.
And the error rate ε where both error rates are equal₀And its
Time threshold φ₀Ask for.

【００１５】その後、詐称者受理率が（ε₀＋ｘ）％を
与えるしきい値φ₁を求め（Ｓ₄）、新たなしきい値φ
をｗφ₁＋（１−ｗ）φ₀を計算して求める（Ｓ₅）。
この新たなしきい値φを、しきい値蓄積部１７内の対応
話者のしきい値とする。次にモデル更新回数ｔを＋１し
て終了とする（Ｓ₆）。ステップＳ₃，Ｓ₄，Ｓ₅，Ｓ
₆はしきい値計算手段２７で行われる。Thereafter, a threshold φ _{1 at} which the impostor acceptance rate gives (ε ₀ + x)% is obtained (S ₄ ), and a new threshold φ
Is calculated by calculating wφ ₁ + (1−w) φ ₀ (S ₅ ).
This new threshold value φ is set as the threshold value of the corresponding speaker in the threshold value storage unit 17. Next, the number of model updates t is incremented by 1 to end the process (S ₆ ). Steps S ₃ , S ₄ , S ₅ , S
Step ₆ is performed by the threshold value calculating means 27.

【００１６】図１において、ｔ＝０では登録時の登録用
音声を用いたしきい値計算が行われ、その時の等誤り率
のしきい値を与える誤り率ε₀よりもｘ％だけ高い詐称
者受理率となるしきい値φ₁がしきい値として設定さ
れ、モデルが頑健でなくても、本人棄却率が大き過ぎる
ことはない。また、モデル更新が行われるごとにその都
度、その更新されたモデルについて、その更新用音声を
用い、かつ詐称者を本人以外の登録話者として、話者認
識実験が行われ、つまり頑健なものに近づいて来たモデ
ルについての等誤り率しきい値に近いものとなり、かつ
ｗが小さくなり、その理想的に近づいた等誤り率しきい
値との差が小となる、これより小さいしきい値が設定さ
れ、つまりモデル更新が繰り返される程、望ましいしき
い値となる。In FIG. 1, when t = 0, a threshold value is calculated using the voice for registration at the time of registration, and the impostor who is x% higher than the error rate ε ₀ which gives the threshold of the equal error rate at that time is calculated. threshold phi ₁ as the acceptance rate is set as a threshold, model may not robust, does not false rejection rate is too high. In addition, each time a model update is performed, a speaker recognition experiment is performed on the updated model using the updated voice and the impostor as a registered speaker other than the person itself. , Which is close to the equal error rate threshold for the model approaching, and w is small, and the difference from the ideally close equal error rate threshold is small. The more the value is set, that is, the more the model update is repeated, the more desirable the threshold becomes.

【００１７】[0017]

【発明の効果】次にこの発明の効果を確めるための実験
例を述べる。実験は、男性２０名が約１５カ月に渡る５
つの時期（時期Ａ、Ｂ、Ｃ、Ｄ、Ｅ）に発声した文章デ
ータ（１文章長は平均４秒）を対象とする。登録話者と
して男性１０名、詐称者としてその他の男性１０名を用
いた。これらの音声を、従来から使われている特徴量、
つまり、ケプストラムの短時間毎の時系列に変換する。
ケプストラムは標本化周波数１２ｋＨｚ、フレーム長３
２ｍｓ、フレーム周期８ｍｓ、ＬＰＣ分析（Linear Pre
dictive Coding、線形予測分析）次数１６で抽出した。
登録には、時期Ａに発声した１０文章を用いた。更新に
は、１回目の更新として時期Ｂに発声した１０文章を用
い、２回目の更新として時期Ｃに発声した１０文章を用
いた。テストでは、時期Ｄ、Ｅに発声した５文章を１文
章づつ用い、つまり時期Ａ、Ｂ、Ｃによる各モデルがし
きい値について各５回づつテストを行った。なお、しき
い値の設定では、ｘ＝１％とした。Next, an experimental example for confirming the effect of the present invention will be described. The experiment was conducted by 20 men for about 15 months5
Sentence data uttered at two times (time A, B, C, D, and E) (one sentence length is 4 seconds on average) is targeted. Ten males were used as registered speakers and the other ten males were used as impostors. These voices are used as feature values,
That is, the cepstrum is converted into a time series for each short time.
The cepstrum has a sampling frequency of 12 kHz and a frame length of 3
2 ms, frame period 8 ms, LPC analysis (Linear Pre
(dictive coding, linear prediction analysis).
Ten sentences uttered at time A were used for registration. For the update, ten sentences uttered at time B were used as the first update, and ten sentences uttered at time C were used as the second update. In the test, five sentences uttered at times D and E were used one sentence at a time, that is, each model at times A, B, and C performed a test on the threshold value five times each. In the setting of the threshold value, x = 1%.

【００１８】この発明の効果は、テキスト独立型（例え
ば文献「松井知子、古井貞煕：“ＶＱ、離散／連続ＨＭ
Ｍによるテキスト独立形話者認識法の比較検討”、電子
情報通信学会音声研究会資料、ＳＰ９１−８９、１９９
１」）の話者認識において試した。各話者のＨＭＭは、
１状態が６４個のガウス分布の重み付き加算（例えば文
献「松井知子、古井貞煕：“ＶＱ、離散／連続ＨＭＭに
よるテキスト独立形話者認識法の比較検討”、電子情報
通信学会音声研究会資料、ＳＰ９１−８９、１９９
１」）で表した。The effect of the present invention is that the text-independent type (for example, the documents "Tomoko Matsui, Sadahiro Furui:" VQ, discrete / continuous HM
M-Comparison of Text Independent Speaker Recognition Methods ", IEICE Spoken Language Study Group, SP91-89, 199
1)). The HMM for each speaker is
Weighted addition of 64 Gaussian distributions in one state (for example, literature "Tomoko Matsui, Sadahiro Furui:" Comparison of VQ, text-independent speaker recognition method using discrete / continuous HMM ", IEICE Speech Research Group) Documents, SP91-89, 199
1 ").

【００１９】結果は、本人棄却率と詐称者受理率の平均
で評価した。その結果を図４Ｂに示す。従来法は、詐称
者を本人以外の登録話者として、全登録用音声を用いて
話者認識実験を行った時の等誤り率のしきい値による結
果を表す。これより、この発明方法は従来法と比べて、
高い性能を示すことがわかる。これらの結果より、この
発明方法は有効であることが実証された。The results were evaluated by the average of the rejection rate and the false acceptance rate. The result is shown in FIG. 4B. In the conventional method, a result obtained by performing a speaker recognition experiment using all registration voices with the impostor as a registered speaker other than the registered person is represented by a threshold value of an equal error rate. From this, the method of the present invention, compared with the conventional method,
It turns out that it shows high performance. From these results, the method of the present invention was proved to be effective.

[Brief description of the drawings]

【図１】この発明方法の実施例における処理手順を示す
流れ図。FIG. 1 is a flowchart showing a processing procedure in an embodiment of the method of the present invention.

【図２】この発明装置の実施例の機能構成を示すブロッ
ク図。FIG. 2 is a block diagram showing a functional configuration of an embodiment of the apparatus of the present invention.

【図３】従来の話者認識装置の機能構成を示すブロック
図。FIG. 3 is a block diagram showing a functional configuration of a conventional speaker recognition device.

【図４】Ａはしきい値に対する本人棄却率及び詐称者受
理率の関係を示す図、Ｂはこの発明の効果を説明するた
めの実験結果を示す図である。FIG. 4A is a diagram showing a relationship between a rejection rate and an impostor acceptance rate with respect to a threshold, and FIG. 4B is a view showing an experimental result for explaining an effect of the present invention.

フロントページの続き (56)参考文献特開平８−123475（ＪＰ，Ａ) 特開昭62−209500（ＪＰ，Ａ) 特開昭47−41103（ＪＰ，Ａ) 特開平７−248791（ＪＰ，Ａ) 特公昭63−29279（ＪＰ，Ｂ２) ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．ＡＳＳＰ−29, Ｎｏ．２，Ａｐｒｉｌ 1981，Ｓ．Ｆｕｒｕｉ，”ＣｅｐｓｔｒａｌＡｎａｌｙｓｉｓＴｅｃｈｎｉｑｕｅｆｏｒＡｕｔｏｍａｔｉｃＳｐｅａｋｅｒＶｅｒｉｆｉｃａｔｉｏｎ”，ｐ. 254−272 日本音響学会平成８年度春季研究発表会講演論文集▲Ｉ▼，１−５−６，松井知子外「話者認識におけるモデルとしきい値の更新法の検討」，ｐ．11−12, （平成８年３月26日発行) 電子情報通信学会技術研究報告［音声］，Ｖｏｌ．94，Ｎｏ．90，ＳＰ94− 22，松井知子外「音韻・話者独立モデルによる話者照合尤度の正規化」，ｐ．61 −66（1994年６月16日発行) 電子情報通信学会技術研究報告［音声］，Ｖｏｌ．95，Ｎｏ．468，ＳＰ95− 120，松井知子外「話者照合におけるモデルとしきい値の更新法」，ｐ．21−26 （1996年１月19日発行) 電子情報通信学会論文誌，Ｖｏｌ．Ｊ 81−Ｄ−▲ＩＩ▼ Ｎｏ．２，Ｆｅｂｕｒａｒｙ 1998，松井知子外「話者照合におけるモデルとしきい値の更新法」, ｐ．268−276（平成10年２月25日発行) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 17/00 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (56) References JP-A-8-123475 (JP, A) JP-A-62-209500 (JP, A) JP-A-47-41103 (JP, A) JP-A-7-248791 (JP) , A) JP-B-63-29279 (JP, B2) IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-29, no. 2, April 1981, S.M. Furui, “Cepstral Analysis Technique for Automatic Speaker Verification”, p. 254-272 Proceedings of the Spring Meeting of the Acoustical Society of Japan 1996, I-I, 1-5-6, Tomoko Matsui Examination of updating method of model and threshold ”, p. 11-12, (issued March 26, 1996) IEICE Technical Report [Voice], Vol. 94, no. 90, SP94-22, Tomiko Matsui et al. “Normalization of speaker matching likelihood by phoneme / speaker independent model”, p. 61-66 (issued on June 16, 1994) IEICE Technical Report [Voice], Vol. 95, No. 468, SP95-120, Tomoko Matsui et al., “Method of updating model and threshold in speaker verification,” p. 21-26 (issued on January 19, 1996) Transactions of the Institute of Electronics, Information and Communication Engineers, Vol. J 81-D- ▲ II ▼ No. 2, Febuary 1998, Tomoko Matsui et al., "Method of updating models and thresholds in speaker verification," p. 268-276 (Issued February 25, 1998) (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 15/00-17/00 JICST file (JOIS)

Claims

(57) [Claims]

1. An input speech is converted into an expression form using feature parameters, and a similarity between the input speech in the expression form and a speech model in the above-mentioned expression form registered in advance for a speaker is obtained. A method for setting the threshold value in a speaker recognition method for recognizing a speaker who has uttered the input voice by comparing the similarity with a speaker determination threshold value, comprising the steps of: Calculate the two error rates, the rejection rate and the impostor acceptance rate, using the voice uttered at this time and the above registration model, and subtract a predetermined value from a threshold value at which the calculated two error rates become equal. A threshold value for speaker recognition, wherein the threshold value for speaker determination is set to the threshold value.

2. Each time the model for each speaker is updated, the two error rates are calculated using the updated model and the voice uttered at the time of the update, and the calculated two error rates are calculated. 2. The speaker according to claim 1, wherein the threshold value for speaker determination is updated to a value obtained by subtracting a value smaller than the predetermined value and smaller than a previous value from a threshold value at which an error rate becomes equal. How to set the threshold for recognition.

3. The method according to claim 1, wherein the predetermined value is substantially equal to an upper limit of an error rate of the speaker recognition method itself.

4. An input speech is converted into an expression format using feature parameters by a feature parameter extraction unit, and a model of the input speech in this expression format is created by a model creation unit and stored in a model storage unit. The similarity of each of the models in the model storage means is calculated by the similarity calculation means for the speech of the expression format converted by the feature parameter extraction means, and these calculated similarities are stored in a threshold storage section. The threshold value indicating the range of variation of the similarity that can be regarded as the voice of the individual is compared with the speaker recognition determining means. If the similarity is higher, the voice is the voice of the individual, and if the similarity is lower, the voice is the voice of another person. In the speaker recognition device, when a model update instruction is issued, the model in the model storage unit of the corresponding speaker is updated by an input speech in an expression form using the feature parameter from the feature parameter extracting unit. Means for updating the model, and calculating a rejection rate and an imposter acceptance rate for the speech at the time of updating the updated model, and determining whether the rejection rate and the acceptance rate are equal. Threshold calculating means for updating a threshold value of a corresponding speaker in the threshold accumulation unit to a value obtained by subtracting a slightly smaller value from the threshold value. Accompanying speaker recognition device.