JPH08123475A

JPH08123475A - Method and device for speaker collation

Info

Publication number: JPH08123475A
Application number: JP6265856A
Authority: JP
Inventors: Katsutake Bin; 雄偉閔
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1994-10-28
Filing date: 1994-10-28
Publication date: 1996-05-17
Anticipated expiration: 2015-07-04
Also published as: JP3058569B2

Abstract

PURPOSE: To provide the speaker collation device which considers variance in a speaker and estimates a threshold value adaptive to secular variation of the features of a speech previously in a short time with a small feature sample quantity. CONSTITUTION: Code books by speakers at respective periods are generated on the basis of speech features by the speakers at different periods A-C and stored in a code book file 100. When the threshold value is determined, a code book 101 by the speakers at a period D of a 1st speaker is generated and on the basis of this code book and the past stored code books of the same speaker and other speakers by the speakers, the intra-code-book distance and the inter- code-book distance of the 1st speaker are derived. On the basis of correlative values determined by the speakers at specific time intervals, an intra-period- difference-speaker distance is found from the intra-code-book distance and the inter-period-difference-speaker distance is found from the inter-code-book distance. Then the initial threshold value is adjusted so as to obtain an equal error rate with those intra-speaker distance and inter-speaker distance.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、話者認識システムや音
声認識システムに用いられる話者照合技術に関し、特
に、発話者の声の特徴の経時変化に適応する話者別閾値
を少ない特徴サンプル量で短時間に推定する方法及びそ
の実現装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker verification technique used in a speaker recognition system or a voice recognition system, and more particularly to a feature sample having a small threshold value for each speaker adapted to a change with time of a voice feature of a speaker. The present invention relates to a method for estimating a quantity in a short time and an apparatus for realizing the method.

【０００２】[0002]

【従来の技術】話者照合装置は、発話者の表明した識別
名称が、発話者自身の真の識別名称と一致するかどうか
を判定する装置である。通常、話者照合を行う場合に
は、予め照合対象となる話者識別名称及びこの識別名称
に対応するコードブックを登録しておき、話者照合時
に、発話者の実音声と識別名称とを入力し、この識別名
称によって指定されたコードブックと発話者の実音声と
を比較してその特徴差を検出する。この特徴差が話者別
に設定された所定の閾値以下の場合には表明された識別
名称が真の識別名称であり、発話者は本人であると判定
する。その他の場合は、表明された識別名称は偽識別名
称であり、発話者は詐称者であると判定する。このよう
に、話者照合においては、話者別閾値をどのような値に
決定するかが重要であり、この値が適切な値であるかど
うかによって話者の識別率が大きく左右される。2. Description of the Related Art A speaker verification device is a device for determining whether an identification name expressed by a speaker matches a true identification name of the speaker itself. Normally, when performing speaker verification, a speaker identification name to be verified and a codebook corresponding to this identification name are registered in advance, and at the time of speaker verification, the actual voice of the speaker and the identification name are registered. It is input and the codebook designated by this identification name is compared with the actual voice of the speaker to detect the feature difference. When the feature difference is less than or equal to a predetermined threshold set for each speaker, the asserted identification name is a true identification name, and the speaker is determined to be the person. In other cases, the asserted identification name is a false identification name and the speaker is determined to be an impostor. As described above, in speaker verification, it is important to determine what value the speaker-specific threshold value is, and whether or not this value is an appropriate value greatly affects the speaker identification rate.

【０００３】話者照合時の誤認識には大別して２つの原
因がある。１つは発話者が真の識別名称を表明している
にも拘わらず、識別名称が偽であると認識してしまう場
合であり、この誤認識率を本人拒否率（ＦＲＲ：False
Rejection Rate）と称する。もう１つは、発話者が偽名
称を表明しているにも拘わらずそれを真の識別名称と認
識してしまう場合であり、この誤認識率を詐称者受理率
（ＦＡＲ：False Acceptance Rate）と称する。ところ
で、話者別閾値の値を高くすると、特徴差が大きくても
発話者が本人であると判断する確率が高くなる。従っ
て、ＦＲＲは低くなるがＦＡＲは高くなる。逆に、話者
別閾値の値を低くすると、ＦＡＲは低くなるがＦＲＲは
高くなる。このように、ＦＲＲとＦＡＲとは一方が低く
なると他方が高くなるという関係にある。誤認識率は両
者の平均値で表されるので、話者別閾値を調整して両者
の平均値をできるだけ小さくすることが好ましい。There are roughly two causes for erroneous recognition during speaker verification. One is a case in which the speaker recognizes that the identification name is false, even though the speaker expresses the true identification name. This false recognition rate (FRR: False)
Rejection Rate). The other is a case where the speaker recognizes a false name even though he / she has expressed a false name, and this false recognition rate is the false acceptance rate (FAR). Called. By the way, if the threshold value for each speaker is increased, the probability that the speaker is the true person increases even if the feature difference is large. Therefore, FRR is low, but FAR is high. On the contrary, if the threshold value for each speaker is lowered, the FAR becomes lower but the FRR becomes higher. In this way, the FRR and the FAR have a relationship such that when one becomes lower, the other becomes higher. Since the false recognition rate is represented by the average value of both, it is preferable to adjust the threshold for each speaker so that the average value of both is as small as possible.

【０００４】従来、この閾値を決定するための手法が種
々提案されている。第１の手法として、ＦＲＲとＦＡＲ
とが等しくなるように話者別閾値を設定する等誤り率設
定法があり、”ディジタル音声処理”（著者：古井貞
煕；出版者：東海大学出版会）第９章に紹介されてい
る。図４は、この等誤り率設定法を実現するためのブロ
ック図であり、本人学習音声及び詐称者学習音声を音声
入力端子４００に入力し、前処理部４０１が各音声を一
定時間長の音声フレームごとに記憶する。特徴量抽出部
４０２は、各音声の特徴量を抽出する。ベクトル量子化
部４０３は、音声から抽出されたそれぞれの特徴量を識
別名称に対応する話者コードブック４０４に基づいてベ
クトル量子化し、これにより得られたコードベクトルの
同一話者内歪み距離（以下、話者内距離）、他話者間歪
み距離（以下、話者間距離）を話者内／話者間距離記憶
部４０５に記憶する。ＦＲＲ・ＦＡＲ計算部４０６は、
話者内距離と予め定められた初期閾値とを用いてＦＲＲ
を計算するとともに、話者間距離と上記初期閾値とを用
いてＦＡＲを計算する。ＦＲＲ・ＦＡＲ比較部４０８で
は、ＦＲＲとＦＡＲの値を比較し、両者が等しくなけれ
ば閾値調整部４０７において初期閾値を調整し、再度Ｆ
ＲＲ・ＦＡＲ計算部４０６に戻る。そしてＦＲＲとＦＡ
Ｒとが等しくなった時点で、調整を終え、その値を当該
話者の閾値として出力する。Conventionally, various methods for determining this threshold have been proposed. The first method is FRR and FAR
There is an equal error rate setting method that sets speaker-specific thresholds so that and become equal, and is introduced in Chapter 9 of "Digital Speech Processing" (Author: Sadahiro Furui; Publisher: Tokai University Press). FIG. 4 is a block diagram for realizing the equal error rate setting method. The principal learning voice and the impostor learning voice are input to the voice input terminal 400, and the preprocessing unit 401 outputs each voice for a predetermined time length. Store for each frame. The feature amount extraction unit 402 extracts the feature amount of each voice. The vector quantizer 403 vector-quantizes the respective feature amounts extracted from the voice based on the speaker codebook 404 corresponding to the identification name, and the same-speaker distortion distance (hereinafter referred to as “distortion distance within the same speaker” of the code vector thus obtained. , Intra-speaker distance) and the inter-speaker distortion distance (hereinafter, inter-speaker distance) are stored in the intra-speaker / inter-speaker distance storage unit 405. The FRR / FAR calculation unit 406
The FRR using the intra-speaker distance and a predetermined initial threshold value.
And FAR are calculated using the inter-speaker distance and the initial threshold. The FRR / FAR comparison unit 408 compares the values of FRR and FAR, and if they are not equal, the threshold adjustment unit 407 adjusts the initial threshold, and F
Return to the RR / FAR calculation unit 406. And FRR and FA
When R becomes equal, the adjustment is finished, and the value is output as the threshold of the speaker.

【０００５】また、第２の手法として、話者間距離の分
布を考慮して閾値を設定する方法（S.Furui,”Cepstral
Analysis Technique for Automatic Speaker Verifica
tion,"IEEE Trans.Acoustics, Speech, and Signal Pro
cessing, vol.ASSP-29, No.2,pp.254-272, April 1981
参照）が知られている。図５はこの手法を実現するため
のブロック図であり、音声入力端子５００に学習音声が
入力された後、前処理部５０１、特徴量抽出部５０２、
ベクトル量子化部５０３、話者コードブック５０４まで
は図４の構成と同様となる。この手法の特徴は、ベクト
ル量子化部５０３で得られた話者間距離を話者間距離記
憶部５０５に記憶しておき、それぞれ話者間標準偏差計
算部５０６と話者間平均値計算部５０７において話者間
距離の平均値と標準偏差を求め、その結果得られた統計
パラメータに基づき閾値計算部５０８で閾値を導出する
ことにある。As a second method, a method of setting a threshold in consideration of the distribution of inter-speaker distances (S. Furui, "Cepstral
Analysis Technique for Automatic Speaker Verifica
tion, "IEEE Trans. Acoustics, Speech, and Signal Pro
cessing, vol.ASSP-29, No.2, pp.254-272, April 1981
(See) is known. FIG. 5 is a block diagram for realizing this method. After the learning voice is input to the voice input terminal 500, the preprocessing unit 501, the feature amount extraction unit 502,
The configuration up to the vector quantizer 503 and the speaker codebook 504 is the same as that shown in FIG. The feature of this method is that the inter-speaker distances obtained by the vector quantization unit 503 are stored in the inter-speaker distance storage unit 505, and the inter-speaker standard deviation calculation unit 506 and the inter-speaker average value calculation unit are respectively stored. In 507, the average value and standard deviation of the inter-speaker distance are obtained, and the threshold value calculation unit 508 derives the threshold value based on the statistical parameter obtained as a result.

【０００６】また、第３の手法として、本発明者らによ
り開示された「話者照合方法及び装置」（特願平６−４
１６１５号明細書参照）がある。この手法は、複数のコ
ードブックから発話者の表明した識別名称に対応する本
人コードブックとそれ以外の他話者コードブックとを選
択し、他話者コードブックから出現した所定量のコード
ベクトルと本人コードブックとの特徴差の統計値を計算
し、これにより閾値を得るものである。つまりコードブ
ック間距離を話者間距離に変換することを特徴とする。
この手法は、図６の各ブロック６００〜６１２により実
現される。As a third method, the "speaker verification method and apparatus" disclosed by the present inventors (Japanese Patent Application No. 6-4).
1615 specification). This method selects the principal codebook corresponding to the identification name asserted by the speaker from the plurality of codebooks and the other-speaker codebooks other than that, and selects a predetermined amount of code vector that appears from the other-speaker codebook. The statistical value of the feature difference from the person's codebook is calculated, and the threshold value is obtained from this. In other words, the feature is that the distance between codebooks is converted into the distance between speakers.
This method is realized by the blocks 600 to 612 in FIG.

【０００７】[0007]

【発明が解決しようとする課題】上記各手法は、いずれ
も特定の一時期に収集した特徴サンプルに基づいて閾値
を決定する手法であり、人間の声の特徴に経時変化があ
ることを考慮していない。そのため、時間が経つにつれ
て話者の識別率が低下する場合があった。人間の声の特
徴の経時変化に適応する閾値を推定するには、発話者毎
にできるだけ長期間の特徴サンプルを用いて音声特徴の
標準パターンを作成しておくことでその対処が可能であ
る。しかしながら、長期間の特徴サンプルをそのまま保
存する場合或いは音声特徴を抽出して保存する場合のい
ずれであっても、話者照合装置に莫大なメモリ容量を確
保しなければならず、しかも、特徴サンプル等が膨大な
量になることから話者別閾値の計算に長時間を要する問
題があった。Each of the above-mentioned methods is a method of determining a threshold value based on a characteristic sample collected at a specific time, and considers that the characteristic of human voice changes with time. Absent. Therefore, the identification rate of the speaker may decrease with time. In order to estimate the threshold value that adapts to changes in the characteristics of human voice over time, it is possible to deal with it by creating a standard pattern of speech characteristics by using a feature sample for each speaker as long as possible. However, regardless of whether the long-term feature sample is stored as it is or the voice feature is extracted and stored, a huge memory capacity must be secured in the speaker verification device. There is a problem in that it takes a long time to calculate the threshold for each speaker because the number of etc. becomes enormous.

【０００８】また、上記第１及び第２の手法は、あくま
でも事後的に閾値を設定する手法なので、推定等によっ
て事前に閾値の設定を要する用途では十分に活用でき
ず、また、第２及び第３の手法は、話者内距離のばらつ
きを考慮しないため、話者照合時に、本人を高い確率で
拒否してしまう可能性があった。Further, since the first and second methods are methods of setting the threshold value ex post facto, they cannot be fully utilized in applications where the threshold value is set in advance by estimation or the like. Since the method of 3 does not consider the variation in the inter-speaker distance, there is a possibility of rejecting the person with a high probability at the time of speaker verification.

【０００９】本発明の課題は、上記問題点を解消し、話
者内距離のばらつきが考慮され、しかも声の特徴の経時
変化に適応する閾値を、少ない特徴サンプル量及び短時
間で事前に推定する方法及びこの方法を実施するための
装置を提供することにある。An object of the present invention is to solve the above problems and to estimate in advance a threshold value that takes into account variations in the distance between speakers and that adapts to changes over time in voice characteristics with a small amount of feature samples and in a short time. And a device for performing this method.

【００１０】[0010]

【課題を解決するための手段】本発明は、コードブック
サイズが一定値以上であれば、コードブック内距離と話
者内距離、コードブック間距離と話者間距離との間に、
それぞれ図２及び図３に示すように強い相関関係があ
り、しかもこれらの相関関係は時期差に頑健であるとい
う性質を有効に利用して、話者内距離のばらつき及び同
一話者及び他話者の声の特徴の時期差を考慮した最適な
話者別閾値を決定する点に特徴がある。According to the present invention, if the codebook size is equal to or larger than a certain value, the distance between codebooks and the distance between speakers, the distance between codebooks and the distance between speakers are
As shown in FIG. 2 and FIG. 3, respectively, there is a strong correlation, and these correlations are robust to the time difference. The feature is that the optimum threshold for each speaker is determined in consideration of the difference in the characteristics of the voice of the speaker.

【００１１】図２は、例えば所定語数から成る単位大き
さの文章の音声を所定フレーム長で抽出した特徴量を１
セットとした場合に、８セットの男性音声に基づくコー
ドブック内距離と話者内距離との間の相関実測図であ
る。例えばコードブックサイズは５１２であり、時期差
は９ケ月である。この図から明らかなように、同一話者
であれば両者は線形相関にあり、時期差話者内距離を
ｙ、コードブック内距離をｘとすると、ｙ＝ａｘ＋ｂの
関係にあることが本発明者による検証の結果明らかにな
った。この式において係数ａ，ｂについては多少のばら
つきはあるものの、全体的には時期差に頑健な傾向が現
れている。なお、図２の例では、ａは０．９４４、ｂは
０．１０１であった。FIG. 2 shows a feature quantity obtained by extracting a voice of a unit size sentence having a predetermined number of words with a predetermined frame length, for example.
FIG. 9 is a correlation measurement diagram between the intra-codebook distance and the intra-speaker distance based on eight sets of male voices when a set is used. For example, the codebook size is 512 and the time difference is 9 months. As is clear from this figure, if the speakers are the same, the two have a linear correlation, and if the intra-speaker distance is y and the intra-codebook distance is x, there is a relationship of y = ax + b according to the present invention. It became clear as a result of verification by the person. Although there are some variations in the coefficients a and b in this equation, there is a robust tendency in the time difference as a whole. In the example of FIG. 2, a was 0.944 and b was 0.101.

【００１２】また、図３は、１セットの女性音声に基づ
くコードブック間距離と時期差話者間距離との間の相関
実測図であり、図２の場合と同様、コードブックサイズ
は５１２、時期差は９ケ月である。図３を参照すると１
セットであるにも拘わらず、図２と同様の線形相関であ
ることが明らかであり、しかもこの傾向は、セット数が
増えても同様となる。なお、図示を省略したが、男性音
声に基づくコードブック間距離と話者間距離との関係も
同様であった。即ち、コードブック内距離とコードブッ
ク間距離が判れば、これら相関関係に基づき、時期差話
者内距離と時期差話者間距離を導出することができる。Further, FIG. 3 is an actual measurement diagram of the correlation between the inter-codebook distance and the inter-timing talker distance based on a set of female voices. As in the case of FIG. 2, the codebook size is 512, The time difference is 9 months. Referring to FIG.
Despite the set, it is clear that the linear correlation is the same as in FIG. 2, and this tendency is the same even when the number of sets is increased. Although not shown, the relationship between the inter-codebook distance based on the male voice and the inter-speaker distance was similar. That is, if the intra-codebook distance and the inter-codebook distance are known, the inter-speaker distance and the inter-speaker distance can be derived based on these correlations.

【００１３】このような性質を利用した本発明の話者照
合方法は、閾値の決定対象となる第１話者の任意の時期
の音声特徴に基づき第１のコードブックを作成するとと
もに、第１話者のコードブック内距離及びコードブック
間歪み距離をそれぞれ導出し、更に、第１話者について
準備された前記コードブック内距離と当該時期の時期差
話者内距離との間の第１相関値及びコードブック間距離
と時期差話者間距離との間の第２相関値に基づき、当該
時期における時期差話者内距離及び時期差話者間距離を
導出する過程を経る。時期差話者内距離及び話者間距離
を導出した後は、従来の第１手法と同様に、これら各距
離と任意に定めた初期閾値とに基づき本人拒否率及び詐
称者受理率を計算するとともに、これら本人拒否率及び
詐称者受理率が等しい値になるように前記初期閾値を調
整すれば良い。According to the speaker verification method of the present invention utilizing such a property, the first codebook is created based on the speech feature of the first speaker whose threshold is to be determined at any time, and A first correlation between the intra-codebook distance of the speaker and the inter-codebook distortion distance is derived, and further, the intra-codebook distance prepared for the first speaker and the time difference intra-speaker distance at the time. Based on the value and the second correlation value between the inter-codebook distance and the inter-timing talker distance, a process for deriving the inter-timing talker distance and the inter-timer talker distance at the time is performed. After deriving the inter-speaker distance and the inter-speaker distance, calculate the false rejection rate and the impostor acceptance rate based on each of these distances and an arbitrarily set initial threshold value, as in the first method of the related art. At the same time, the initial threshold value may be adjusted so that the false rejection rate and the impostor acceptance rate are equal.

【００１４】なお、第１及び第２相関値は、図２及び図
３から明らかなように、予め話者別に所定の時期間隔で
取得した値であり、第１相関値は前記コードブック内歪
み距離と同一話者内の時期差歪み距離との間の線形相関
値、第２相関値は前記コードブック間歪み距離と話者間
歪み距離との間の線形相関値である。As is apparent from FIGS. 2 and 3, the first and second correlation values are values obtained in advance at a predetermined time interval for each speaker, and the first correlation value is the distortion within the codebook. The linear correlation value between the distance and the time difference distortion distance within the same speaker, and the second correlation value is the linear correlation value between the inter-codebook distortion distance and the inter-speaker distortion distance.

【００１５】また、上記性質を利用した本発明の話者照
合装置は、各々異なる時期の話者別音声特徴に基づき各
時期の話者別コードブックを作成し、各話者別コードブ
ックの作成過程で出現したコードベクトルの出現回数を
当該時期の話者別コードブックと共にメモリに保存する
手段を有する。この手段は、公知技術を利用することで
実現することができる。また、閾値の決定対象となる第
１話者の第１のコードブックを作成する対象話者別コー
ドブック作成手段と、作成された第１のコードブックと
保存してある前記第１話者及び他話者の過去のコードブ
ックから各々前記出現回数のコードベクトルを出現さ
せ、これらコードベクトルを前記第１のコードブックの
コードベクトルで量子化して第１話者のコードブック内
歪み距離、及び第１話者と他話者との間のコードブック
間歪み距離を導出する手段と、予め話者別に所定の時期
間隔で実測した前記コードブック内歪み距離と同一話者
内の時期差歪み距離との間の第１相関値、及び前記コー
ドブック間歪み距離と他話者間の時期差歪み距離との間
の第２相関値を記憶した相関値記憶手段と、前記第１話
者に対応する前記第１及び第２相関値を読み出して当該
時期の第１話者内の時期差歪み距離及び他話者間の時期
差歪み距離を導出する時期差歪み距離導出手段と、各時
期差歪み距離と任意に定めた初期閾値とに基づき本人拒
否率及び詐称者受理率を計算するとともに、これら本人
拒否率及び詐称者受理率が等しい値になるように前記初
期閾値を調整する閾値調整手段と、を有する。Further, the speaker verification apparatus of the present invention utilizing the above characteristics creates a speaker-specific codebook for each period based on the speaker-specific voice characteristics of different periods, and prepares a speaker-specific codebook. It has means for storing the number of appearances of the code vector appearing in the process in the memory together with the speaker-specific codebook of the time. This means can be realized by using a known technique. Also, a codebook creating means for each target speaker that creates a first codebook of the first speaker for which a threshold value is to be determined, the created first codebook, and the first speaker that has been saved, Code vectors having the above-mentioned number of appearances are made to appear from the past codebooks of other speakers, and these code vectors are quantized by the code vector of the first codebook, and the distortion distance in the codebook of the first speaker, and A means for deriving the inter-codebook distortion distance between one speaker and another speaker, and the intra-codebook distortion distance measured in advance at a predetermined time interval for each speaker and the time difference distortion distance within the same speaker. Corresponding to the first speaker, and a correlation value storage means for storing a first correlation value between the codebook and a second correlation value between the inter-codebook distortion distance and the time difference distortion distance between other speakers. Read the first and second correlation values Then, a time difference distortion distance deriving means for deriving a time difference distortion distance within the first speaker and a time difference distortion distance between other speakers at the time concerned, and each time difference distortion distance and an arbitrarily set initial threshold value. And a threshold adjusting means for adjusting the initial threshold so that the false rejection rate and the impostor acceptance rate are calculated on the basis of the false rejection rate and the impostor acceptance rate.

【００１６】[0016]

【作用】本発明では、長期間の特徴サンプルの代わり
に、話者別に異なる時期の話者別コードブックを複数作
成しておき、その際、各話者別コードブックにおけるコ
ードベクトルの出現回数を保存しておく。また、好まし
くは閾値決定前に図２及び図３で示した相関関係、即ち
第１相関値及び第２相関値を相関値記憶手段に記憶させ
ておく。第１話者の閾値を決定するときは、過去の各時
期の第１話者及び他話者のコードブックからコードベク
トルを代表する符号列及び符号列の出現回数に従ってコ
ードベクトルを出現させ、閾値決定時期に対応するコー
ドブック内距離、及びコードブック間距離を求める。次
いで、このコードブック内距離、コードブック間距離と
上記相関値記憶手段から読み出した第１相関値及び第２
相関値に基づき、時期差話者内距離、時期差話者間距離
を導出する。In the present invention, a plurality of speaker-specific codebooks at different times are prepared for each speaker instead of the long-term feature sample, and the number of times the code vector appears in each speaker-specific codebook is calculated. Save it. Further, preferably, the correlation shown in FIGS. 2 and 3, that is, the first correlation value and the second correlation value are stored in the correlation value storage means before the threshold value is determined. When determining the threshold of the first speaker, the code vector is represented according to the code string representing the code vector and the number of appearances of the code string from the code books of the first speaker and other speakers at each past time, and the threshold is set. The distance between codebooks and the distance between codebooks corresponding to the determination time are calculated. Next, the intra-codebook distance, the inter-codebook distance, the first correlation value and the second correlation value read from the correlation value storage means.
Based on the correlation value, the inter-time-speaker distance and the inter-time-speaker distance are derived.

【００１７】このように過去において話者別に作成され
たコードブックを用いるだけで事前に第１話者の時期差
話者内距離と時期差話者間距離を近似表現することが可
能となり、話者照合装置におけるメモリ使用量が従来よ
りも大幅に節約される。因みに従来の各手法により音声
波形をそのまま保存する場合（ｓｈｏｒｔ型）は、サン
プリング周波数（１／ｓｅｃ）×音声継続時間（ｓｅ
ｃ）×２（バイト）、音声波形からその特徴ベクトルを
抽出して保存する場合（ｄｏｕｂｌｅ型）は、総フレー
ム数×特徴ベクトルサイズ×１６（バイト）のメモリ容
量を必要とするのに対し、本発明の方法及び装置の場合
のメモリ使用量は、話者別コードブックサイズ×特徴ベ
クトルサイズ×１６（バイト：ｄｏｕｂｌｅ型）＋話者
別コードブックサイズ×４（バイト：ｉｎｔ型）であ
る。従って、サンプリング周波数が１６ＫＨｚの単語音
声が１０個あるとし、その平均の長さが３秒、分析フレ
ーム長が３０ｍｓｅｃ、フレーム周期が１６ｍｓｅｃと
すると、１ヶ月毎に特徴サンプルの収録を重ね、１年間
で収録した１００名の話者の特徴サンプルの特徴量のみ
を記憶するために必要なメモリ容量は、コードブックサ
イズが２５６であれば従来の１／１１倍となる。つま
り、本発明によれば、約９２％のメモリ容量が節約でき
る。計算量についても同様であり、本発明によれば、従
来の約９２％の計算量が削減できる。上記時期差話者内
距離と時期差話者間距離が導出された後は、従来の第１
の手法と同様の手順乃至手段で等誤り率になるように初
期閾値を調整し、これを第１話者の閾値とする。As described above, it is possible to approximate the distance between speakers in the first speaker and the distance between speakers in advance by simply using the codebook created for each speaker in the past. The amount of memory used in the person verification device is significantly saved as compared with the conventional one. Incidentally, when the voice waveform is stored as it is by each conventional method (short type), the sampling frequency (1 / sec) × the voice duration (se
c) × 2 (bytes), when the feature vector is extracted from the speech waveform and stored (double type), a memory capacity of total number of frames × feature vector size × 16 (bytes) is required. In the case of the method and apparatus of the present invention, the memory usage is speaker codebook size × feature vector size × 16 (byte: double type) + speaker codebook size × 4 (byte: int type). Therefore, assuming that there are 10 word voices with a sampling frequency of 16 KHz, an average length of 3 seconds, an analysis frame length of 30 msec, and a frame period of 16 msec, feature samples are recorded every month for one year. If the codebook size is 256, the memory capacity required to store only the feature amounts of the feature samples of the 100 speakers recorded in 1. will be 1/11 times the conventional value. That is, according to the present invention, a memory capacity of about 92% can be saved. The same applies to the amount of calculation, and according to the present invention, the amount of calculation can be reduced by about 92% of the conventional amount. After the inter-speaker distance and the inter-speaker distance are derived, the first
The initial threshold value is adjusted so as to obtain an equal error rate by the same procedure or means as the method of (1), and this is set as the threshold value of the first speaker.

【００１８】[0018]

【実施例】次に、図面を参照して本発明の実施例を詳細
に説明する。図１は、本発明の一実施例の話者照合装置
における話者別閾値決定部のブロック図であり、１００
はコードブックファイル、１０１は話者別コードブック
（Ｄ時期話者別コードブック）、１０２はベクトル量子
化部、１０３はコードブック内距離記憶部、１０４はコ
ードブック間距離記憶部、１０５は相関変換値記憶部、
１０６は時期差話者内距離変換部、１０７は時期差話者
間距離変換部、１０８はＦＲＲ・ＦＡＲ計算部、１０９
はＦＲＲ・ＦＡＲ比較部、１１０は閾値調整部である。Next, an embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram of a speaker-specific threshold value determining unit in a speaker verification apparatus according to an embodiment of the present invention.
Is a codebook file, 101 is a speaker-specific codebook (D-period speaker-specific codebook), 102 is a vector quantizer, 103 is an intra-codebook distance storage unit, 104 is an inter-codebook distance storage unit, and 105 is correlation. Converted value storage,
Reference numeral 106 denotes a distance conversion speaker distance conversion unit, 107 denotes a time difference speaker distance conversion unit, 108 denotes an FRR / FAR calculation unit, and 109.
Is an FRR / FAR comparison unit, and 110 is a threshold adjustment unit.

【００１９】コードブックファイル１００には、話者別
に異なる時期、図示の例ではＡ〜Ｃ時期に作成したコー
ドブックを記憶する話者別コードブック記憶部と、各話
者別コードブックのコードベクトルを代表する符号列及
び各符号の出現回数を記憶する符号データ記憶部とが格
納され、話者別の過去の音声特徴として随時再利用でき
るようになっている。時期Ａ〜Ｃの間隔は、ある程度離
れた方が好ましい。これは、短期間では話者の音声特徴
に差異が生じない場合があるからである。また、一回も
出現しなかった符号乃至符号列については符号データ記
憶部への記憶を行わない。これによってコードブックフ
ァイル１００のサイズ（メモリ使用量）及び以後の計算
量の削減が更に期待できる。なお、符号列とは、各コー
ドベクトルに対し、例えばそれぞれのクラスタのセント
ロイドに対応して付与された符号の集合をいい、符号の
出現回数とは、ベクトル量子化処理が終了するまでの過
程において、同じクラスタに配属された符号の出現回数
データをいう。In the codebook file 100, a speaker-specific codebook storage unit for storing codebooks created at different times for each speaker, in the illustrated example, A to C, and a code vector of each speaker-specific codebook. And a code data storage unit that stores the number of appearances of each code are stored, and can be reused at any time as a past voice feature for each speaker. It is preferable that the intervals A to C are separated to some extent. This is because there may be no difference in the voice characteristics of the speaker in a short period. Further, the code or code string that has not appeared once is not stored in the code data storage unit. This can further be expected to reduce the size (memory usage) of the codebook file 100 and the amount of calculation thereafter. The code string is a set of codes given to each code vector, for example, corresponding to the centroid of each cluster, and the number of appearances of the code is the process until the vector quantization process is completed. In, the term refers to the appearance frequency data of codes assigned to the same cluster.

【００２０】Ｄ時期の話者別コードブック１０１は、閾
値設定対象となる第１話者（甲）のＤ時期における学習
音声から分析フレームを抽出して特徴量を求め、その特
徴量をベクトル量子化して作成したコードブックであ
る。このＤ時期は、任意の時期であるが、上述のＡ時
期、Ｂ時期、Ｃ時期よりも遅い時期である。The D-speaker-specific codebook 101 extracts an analysis frame from the learning speech of the first speaker (A) whose threshold is to be set in the D period to obtain a feature quantity, and the feature quantity is a vector quantum. It is a codebook created by converting. The D period is an arbitrary period, but is later than the A period, B period, and C period described above.

【００２１】ベクトル量子化部１０２は、例えば上記Ｄ
時期話者別コードブック１０１に基づき、コードブック
ファイル１００内に既に格納されている同一話者（甲）
のＡ〜Ｃ時期のコードブックからコードベクトルを各符
号の出現回数に従って出現させ、これを例えばＤ時期話
者別コードブック１０１に出力させる。これにより同一
話者（甲）によるＤ時期のコードブックの特徴差、即ち
コードブック内距離が得られる。これをコードブック内
距離記憶部１０３に記憶する。The vector quantizer 102 uses, for example, the above D.
Based on the time-speaker-specific codebook 101, the same speaker (A) already stored in the codebook file 100
A code vector is made to appear from the code book of the periods A to C according to the number of times of appearance of each code, and this is output to the code book 101 for each speaker of the D period, for example. As a result, the characteristic difference of the codebook at the D period by the same speaker (A), that is, the codebook distance is obtained. This is stored in the intra-codebook distance storage unit 103.

【００２２】また、コードブックファイル１００内に既
に格納されている他話者（乙）のＡ〜Ｃ時期の話者別コ
ードブックからコードベクトルを各符号の出現回数に従
って出現させ、これを例えばＤ時期の話者別コードブッ
ク１０１に出力させる。これにより上記話者と他話者に
よるＤ時期のコードブック特徴差、即ちコードブック間
距離が得られる。これをコードブック間距離記憶部１０
４に記憶する。Further, a code vector is made to appear according to the number of appearances of each code from the code book for each speaker of the other speakers (B) in the periods A to C, which is already stored in the code book file 100, and this is made D, for example. The codebook 101 for each speaker of the period is output. As a result, the codebook feature difference between the speakers and the other speakers at the time D, that is, the inter-codebook distance is obtained. This is the inter-codebook distance storage unit 10
Store in 4.

【００２３】相関変換値記憶部１０５には、話者別に、
図２及び図３基づく線形相関式とコードブック内距離／
コードブック間距離から一義的に導かれる相関値が記憶
されている（相関値記憶手段）。ここにいう相関値は、
コードブック内距離と時期差話者内距離との関係を表す
第１相関値、コードブック間距離と時期差話者間距離と
の関係を表す第２相関値である。これら相関値は、各話
者をキーに読み出され、各々第１相関値は時期差話者内
距離変換部１０６、第２相関値は時期差話者間距離変換
部１０７に入力される。この実施例の場合は、話者
（甲）及び他話者（乙）に関わる第１及び第２相関値を
それぞれ読み出す。The correlation conversion value storage unit 105 stores, for each speaker,
Linear correlation equation and distance in codebook based on FIGS. 2 and 3
A correlation value uniquely derived from the inter-codebook distance is stored (correlation value storage means). The correlation value here is
It is a 1st correlation value showing the relation between the codebook inside distance and the time difference talker's distance, and the 2nd correlation value showing the relation between the codebook distance and the time difference talker's distance. These correlation values are read with each speaker as a key, and the first correlation value is input to the time difference talker distance conversion unit 106, and the second correlation value is input to the time difference talker distance conversion unit 107. In the case of this embodiment, the first and second correlation values relating to the speaker (A) and the other speaker (B) are read out.

【００２４】時期差話者内変換部１０６及び時期差話者
間距離変換部１０７は、それぞれ上記コードブック内距
離及びコードブック間距離と上記第１及び第２相関値に
基づいてＤ時期における話者（甲）の時期差話者内距離
及び時期差話者間距離を求め、これらをＦＲＲ・ＦＡＲ
計算部１０８に入力する。The intra-timing talker conversion unit 106 and the inter-timer talker distance conversion unit 107 respectively talk at the time D based on the intra-codebook distance and the inter-codebook distance and the first and second correlation values. The inter-speaker distance and the inter-speaker distance of the person (A), and use them for FRR / FAR
Input to the calculation unit 108.

【００２５】ＦＲＲ・ＦＡＲ計算部１０８は、入力され
た時期差話者内距離と任意の初期閾値とを用いてＦＲＲ
を計算するとともに、入力された時期差話者間距離と初
期閾値とを用いてＦＡＲを計算する。ＦＲＲ・ＦＡＲ比
較部１０９では、ＦＲＲとＦＡＲの値を比較し、両者が
等しくなければ閾値調整部１１０において初期閾値を調
整し、再度ＦＲＲ・ＦＡＲ計算部１０８に戻る。そして
ＦＲＲとＦＡＲとが等しくなった時点で、調整を終え、
そのときの閾値を当該話者（甲）の閾値として出力す
る。The FRR / FAR calculation unit 108 uses the input time difference speaker distance and an arbitrary initial threshold value to perform FRR.
And the FAR is calculated using the input inter-speaker distance and the initial threshold. The FRR / FAR comparison unit 109 compares the values of FRR and FAR, and if they are not equal, the threshold adjustment unit 110 adjusts the initial threshold value, and the process returns to the FRR / FAR calculation unit 108 again. Then, when FRR and FAR become equal, adjustment is completed,
The threshold value at that time is output as the threshold value of the speaker (A).

【００２６】このように、本実施例では、予めコードブ
ック内距離と時期差話者内距離との相関関係、及びコー
ドブック間距離と時期差話者間距離との相関関係を求め
て相関変換値記憶部１０４に記憶しておき、また、話者
別に異なる時期に作成した複数の話者別コードブックを
保存しておき、閾値決定の際に、これら話者別コードブ
ックを用いてコードブック内距離及びコードブック間距
離を導出するとともに上記相関関係に基づき時期差話者
内距離及び時期差話者間距離を導出するようにしたの
で、最初の時期（コードブック作成時）を除けば以後の
各時期において話者別閾値を随時決定することができ
る。また、コードブックファイル１００のメモリ使用量
が従来手法に比べて大幅に節約され、閾値計算に要する
時間も短縮化される。これにより従来の課題を解決する
ことができる。なお、本実施例では、第１及び第２相関
値を予め相関変換値記憶部１０５に記憶しておき、閾値
決定対象となる話者をキーとして読み出す構成とした
が、必ずしもこのような構成に限定されるものではな
く、随時計算して時期差話者内距離変換部１０６及び時
期差話者間距離変換部１０７に入力するようにしても良
い。As described above, in this embodiment, the correlation conversion is performed in advance by obtaining the correlation between the intra-codebook distance and the inter-timing talker distance, and the correlation between the inter-codebook distance and the inter-timing talker distance. The codebooks are stored in the value storage unit 104, and a plurality of speaker-specific codebooks created at different times for each speaker are stored, and these speaker-specific codebooks are used to determine a threshold value. The distance between speakers and the distance between codebooks are derived, and the distance between speakers with different timing and the distance between speakers are calculated based on the above correlation. Therefore, except for the first period (when creating the codebook), The speaker-specific threshold can be determined at any time in each of the above. Further, the memory usage of the codebook file 100 is largely saved as compared with the conventional method, and the time required for threshold calculation is also shortened. As a result, the conventional problems can be solved. In the present embodiment, the first and second correlation values are stored in advance in the correlation conversion value storage unit 105 and the speaker whose threshold value is to be determined is read as a key. The present invention is not limited to this, and may be calculated at any time and input to the inter-timing talker distance conversion unit 106 and the inter-timer talker distance conversion unit 107.

【００２７】[0027]

【発明の効果】以上の説明から明らかなように、本発明
の話者照合方法によれば、コードブック内距離及びコー
ドブック間距離と予め取得した第１相関値及び第２相関
値に基づいて時期差話者内距離及び時期差話者間距離が
導出されるので、声の特徴の経時変化に対応する話者別
閾値を決定する際に、長期間の特徴サンプルを記憶する
必要がなくなり、話者照合装置のメモリ使用量を節約す
ることができる。更に、特徴サンプル量が少なくて済む
ことから計算時間が従来手法よりも大幅に短縮され、話
者別閾値を、短時間で推定し得る効果がある。As is apparent from the above description, according to the speaker verification method of the present invention, based on the inter-codebook distance and the inter-codebook distance and the first and second correlation values acquired in advance. Since the inter-speaker distance and the inter-speaker distance are derived, it is not necessary to store a long-term feature sample when determining a speaker-specific threshold corresponding to a change in voice features over time. The memory usage of the speaker verification device can be saved. Furthermore, since the amount of feature samples is small, the calculation time is significantly shortened as compared with the conventional method, and there is an effect that the threshold for each speaker can be estimated in a short time.

【００２８】また、本発明の話者照合装置によれば、過
去の異なる時期に話者別に作成したコードブックを再利
用して事前に第１話者の時期差話者内距離と時期差話者
間距離を相関値によって近似表現することが可能となる
ので、声の特徴の経時変化に適応する話者別閾値を決定
する場合であってもメモリを余分に使用する必要がなく
なる。また、特徴サンプル量が少なく、各距離の計算時
間も短縮化されるので、話者別閾値の決定手段の全体構
成を簡略化し得る効果がある。Further, according to the speaker verification apparatus of the present invention, the codebook created for each speaker at different times in the past is reused, and the first speaker's inter-speaker distance and inter-speaker difference are preliminarily used. Since the inter-person distance can be approximately represented by the correlation value, it is not necessary to use an extra memory even when determining the speaker-specific threshold value that adapts to changes over time in the characteristics of the voice. In addition, since the feature sample amount is small and the calculation time for each distance is shortened, there is an effect that the entire configuration of the speaker-specific threshold value determining means can be simplified.

[Brief description of drawings]

【図１】本発明の一実施例に係る話者照合装置の要部ブ
ロック図。FIG. 1 is a block diagram of a main part of a speaker verification device according to an embodiment of the present invention.

【図２】コードブック内距離と話者内距離との相関関係
を示す実測図。FIG. 2 is an actual measurement diagram showing a correlation between an intra-codebook distance and an intra-speaker distance.

【図３】コードブック間距離と話者間距離との相関関係
を示す実測図。FIG. 3 is an actual measurement diagram showing a correlation between an inter-codebook distance and an inter-speaker distance.

【図４】従来の第１の手法である等誤り閾値設定方法を
実現するためのブロック図。FIG. 4 is a block diagram for realizing an equal error threshold setting method that is a first conventional method.

【図５】従来の第２の手法である話者間距離の分布を考
慮した閾値設定方法を実現するためのブロック図。FIG. 5 is a block diagram for realizing a threshold value setting method that takes into consideration the distribution of inter-speaker distances, which is the second conventional method.

【図６】従来の第３の手法であるコードブック間距離を
話者間距離に変換する方法を実現するためのブロック
図。FIG. 6 is a block diagram for realizing a method of converting a distance between codebooks into a distance between speakers, which is a third conventional method.

[Explanation of symbols]

１００コードブックファイル１０１任意の時期、例えばＤ時期に作成した話者別コ
ードブック１０２ベクトル量子化部１０３コードブック内距離記憶部１０４コードブック間距離記憶部１０５相関変換値記憶部（相関値記憶手段）１０６時期差話者内距離変換部１０７時期差話者間距離変換部１０８ＦＲＲ・ＦＡＲ計算部１０９ＦＲＲ・ＦＡＲ比較部１１０閾値調整部100 codebook file 101 speaker-specific codebook created at an arbitrary time, for example, D time 102 vector quantization unit 103 intra-codebook distance storage unit 104 inter-codebook distance storage unit 105 correlation conversion value storage unit (correlation value storage means ) 106 time difference talker distance conversion unit 107 time difference talker distance conversion unit 108 FRR / FAR calculation unit 109 FRR / FAR comparison unit 110 threshold adjustment unit

Claims

[Claims]

1. A speaker verification method having a threshold value determining process for determining a speaker-specific threshold value used for distinguishing from another speaker, wherein the threshold value determining process is performed by any one of the first speakers whose threshold value is to be determined. While creating the first codebook based on the voice features of the period,
This first codebook and the first created before that period
In-codebook distortion distance from the speaker's codebook,
And the inter-codebook distortion distance between the first codebook and the codebook of another speaker created before the period, respectively, and further, the intra-codebook distortion distance prepared for the first speaker and Based on the first correlation value between the first speaker's time difference distortion distance and the second correlation value between the inter-codebook distortion distance and the other speaker's time difference distortion distance, A speaker verification method comprising a step of deriving a time difference distortion distance within one speaker and a time difference speaker distortion distance between a first speaker and another speaker.

2. The threshold value determining step further calculates a false rejection rate and an imposter acceptance rate based on each time difference distortion distance at the relevant time and an arbitrarily set initial threshold value, 2. The speaker verification method according to claim 1, further comprising the step of adjusting the initial threshold value so that the speaker acceptance rates become equal.

3. The first and second correlation values are values obtained in advance at a predetermined time interval for each speaker, and the first correlation value is the difference in distortion within the codebook and the time difference within the same speaker. The linear correlation value with a distortion distance, and the second correlation value is a linear correlation value between the inter-codebook distortion distance and the inter-speaker distortion distance. Person verification method.

4. A speaker verification apparatus having a speaker-specific threshold value determining means used for distinguishing from another speaker, wherein a speaker-specific codebook for each period is created based on speaker-specific voice characteristics of different periods. Means for storing the number of appearances of the code vector appearing in the process of creating the codebook for each speaker in the memory together with the codebook for each speaker at the time, and the first speaker of the first speaker for which the threshold value is determined. Codebook creating means for creating a codebook for each target speaker, the created first codebook, and the first stored
A code vector having the above-mentioned number of appearances is made to appear from the past codebooks of the speaker and another speaker, and these code vectors are quantized by the code vector of the first codebook to distort the distance within the codebook of the first speaker. , And a means for deriving an inter-codebook distortion distance between the first speaker and another speaker, and a time difference within the same speaker and the intra-codebook distortion distance measured in advance at a predetermined time interval for each speaker. Correlation value storage means for storing a first correlation value between the distortion distance and a second correlation value between the inter-codebook distortion distance and the time difference distortion distance between other speakers, and the first speaker. And a time difference distortion distance deriving means for deriving the time difference distortion distance within the first speaker and the time difference distortion distance between other speakers at the time concerned by reading the first and second correlation values corresponding to A speaker verification device characterized by the above.

5. The speaker verification apparatus according to claim 1, wherein the false rejection rate and the impostor acceptance rate are calculated based on each time difference distortion distance and an arbitrarily set initial threshold, and the false rejection rate and spoofing are calculated. And a threshold adjusting unit that adjusts the initial threshold so that the acceptance rates of the speakers are equal to each other.