JPH11249684A

JPH11249684A - Method and device for deciding threshold value in speaker collation

Info

Publication number: JPH11249684A
Application number: JP10069514A
Authority: JP
Inventors: Eiko Yamada; 栄子山田; Hiroaki Hattori; 浩明服部
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-03-04
Filing date: 1998-03-04
Publication date: 1999-09-17
Anticipated expiration: 2018-03-04
Also published as: JP3036509B2

Abstract

PROBLEM TO BE SOLVED: To decide a threshold value beforehand and to obtain the suitable threshold value even when different registering words are uttered at every registered speaker. SOLUTION: When the threshold value in speaker collation becoming a reference deciding the said person, when an input pattern is collated with a standard pattern, is decided the standard pattern in sub-word unit of plural speakers excepting the registered speaker is stored beforehand in a storage part 20, and the sub-word standard patterns showing relevant utterance contents are connected at a registering time, and a word standard pattern is formed, and a distribution of a similar degree with the registered voice of the said person is obtained by a similar degree calculation part 30, and the threshold value is decided based on the formed distribution.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、話者照合装置内で
用いられる照合時の本人判定の閾値を決定する話者照合
における閾値決定方法及び装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and an apparatus for determining a threshold value in speaker verification for determining a threshold for personal identification at the time of verification used in a speaker verification apparatus.

【０００２】[0002]

【従来の技術】話者照合では照合時に入力音声と登録さ
れた本人パターンとの類似度が求められ、類似度が予め
定められた閾値よりも高い場合に本人と判定され低い場
合に他人と判定される。判定の基準となる閾値について
は、従来より複数話者の複数単語を用いた評価実験結果
から事後的に設定されることが多く、閾値設定に要する
手間が大きいことが問題となっている。2. Description of the Related Art In speaker verification, the degree of similarity between an input voice and a registered personal pattern is determined at the time of verification. Is done. Conventionally, a threshold value used as a criterion for determination is often set afterwards based on the results of evaluation experiments using a plurality of words of a plurality of speakers, and there is a problem in that the time required for setting the threshold value is large.

【０００３】この問題を解決する手法として、事前に閾
値を設定する手法が古井より提案されている。（S.Furu
i, "Cepstral analysis technique for automatic spea
kerverification, "IEEE Trans. Acoust., Speech, Sig
nal processing,ASSP-29,2,pp.254-271,1981、以下これ
を文献１とする。）この手法は、登録された本人の標準
パターンと本人とは異なる複数話者の音声（発声内容は
本人と同一）との間の類似度(距離）の分布を調べ、次
のような（１）式に従って閾値を設定する手法である。閾値（ｋ）＝０．６＊（μ（ｋ）−σ（ｋ）＋７．０（１）As a method for solving this problem, a method of setting a threshold value in advance has been proposed by Furui. (S.Furu
i, "Cepstral analysis technique for automatic spea
kerverification, "IEEE Trans. Acoust., Speech, Sig
nal processing, ASSP-29,2, pp.254-271,1981. This method examines the distribution of the similarity (distance) between the registered standard pattern of the subject and the voices of a plurality of speakers different from the subject (speech contents are the same as the subject), and the following (1) ) Is a method of setting a threshold according to the expression. Threshold value (k) = 0.6 * (μ (k) −σ (k) +7.0 (1)

【０００４】ここで、μは本人標準パターンと本人とは
異なる複数話者の音声との間の類似度の平均値であり、
σはその分散であり、ｋは登録話者番号である。Here, μ is the average value of the similarity between the principal standard pattern and the voices of a plurality of speakers different from the principal,
σ is the variance and k is the registered speaker number.

【０００５】図５はこの手法を実行するための装置構成
を示す図である。最初に、本人標準パターン作成部３１
０において、登録音声を用い本人の標準パターン（単語
標準パターン）が作成され、単語単位の木構造話者標準
パターン記憶部３２０にされる。次に、作成された本人
標準パターンと、複数話者の音声記憶部３４０に記憶さ
れている複数話者の音声との間の類似度計算が類似度計
算部３３０にて行われる。続いて、平均値、分散計算部
３５０において、計算された類似度の分布から類似度の
平均値と分散が計算される。最後に、閾値計算部３６０
において、計算された平均値、分散を用いて（１）式に
従い閾値が求められる。この方法を用いた場合、登録時
の発声から事前に閾値を決定できるため、閾値設定の手
間が少なくて済む。FIG. 5 is a diagram showing the configuration of an apparatus for executing this method. First, the principal standard pattern creating unit 31
At 0, a personal standard pattern (word standard pattern) is created using the registered voice, and stored in the tree-structured speaker standard pattern storage unit 320 for each word. Next, the similarity calculation section 330 performs similarity calculation between the created personal standard pattern and the voices of the multiple speakers stored in the voice storage section 340 of the multiple speakers. Subsequently, the average value and variance calculation unit 350 calculates the average value and the variance of the similarities from the calculated similarity distribution. Finally, the threshold calculator 360
In step (1), a threshold value is calculated according to equation (1) using the calculated average value and variance. When this method is used, the threshold can be determined in advance from the utterance at the time of registration, so that the work of setting the threshold can be reduced.

【０００６】[0006]

【発明が解決しようとする課題】従来方法は、単語単位
の登録であるため、本人の登録用発声と同一内容の多数
話者の発声を予め用意する必要があり、複数の登録話者
が異なる単語を登録する場合に対応し全ての発声を予め
用意することは困難である。そのため従来方法は、全登
録話者に共通な発声内容を登録する場合には効果的な方
法であるが、登録話者ごとに異なる登録用単語を発声し
た場合には適切な閾値を求めることができない。Since the conventional method is registration in word units, it is necessary to prepare in advance the utterances of a number of speakers having the same contents as the registration utterances of the individual, and a plurality of registered speakers are different. It is difficult to prepare all utterances in advance when registering words. For this reason, the conventional method is an effective method when registering a common utterance content for all registered speakers.However, when a different registration word is uttered for each registered speaker, an appropriate threshold value can be obtained. Can not.

【０００７】本発明の目的は、事前に閾値を決定するこ
とができ、しかも登録話者ごとに異なる登録用単語を発
声した場合においても適切な閾値を求めることができる
閾値決定方法及び装置を提供することにある。An object of the present invention is to provide a threshold value determining method and apparatus which can determine a threshold value in advance and can obtain an appropriate threshold value even when a different registration word is uttered for each registered speaker. Is to do.

【０００８】[0008]

【課題を解決するための手段】前述の課題を解決するた
め、本発明による閾値決定方法及び装置は、次のような
特徴的構成を採用している。In order to solve the above-mentioned problems, a threshold value determining method and apparatus according to the present invention employ the following characteristic configuration.

【０００９】（１）入力パターンと標準パターンの照合
時の本人判定の基準となる話者照合における閾値決定方
法において、登録話者以外の複数話者のサブワード単位
の標準パターンを蓄えておき、登録時に当該発声内容を
表すサブワード標準パターンを連結し単語標準パターン
を作成し、本人の登録音声との類似度の分布を求め、作
成された分布に基づいて閾値を決定することを特徴とす
る話者照合における閾値決定方法。(1) In a method of determining a threshold value in speaker verification, which is a criterion for personal identification when comparing an input pattern with a standard pattern, standard patterns in subword units of a plurality of speakers other than the registered speaker are stored and registered. A speaker characterized in that a word standard pattern is created by concatenating subword standard patterns representing the contents of the utterances at times, a distribution of similarity with the registered voice of the person is obtained, and a threshold value is determined based on the created distribution. A method of determining a threshold in matching.

【００１０】（２）前記サブワードは、音素、音節また
は半音節である（１）の閾値決定方法。(2) The threshold value determining method according to (1), wherein the sub-word is a phoneme, a syllable or a semi-syllable.

【００１１】（３）前記閾値は、本人の登録音声と前記
複数話者の標準パターンとの間の類似度の分布のＫパー
センタイルの類似度を閾値として決定する（１）の閾値
決定方法。(3) The threshold value determination method according to (1), wherein the threshold value is determined as the similarity of the Kth percentile of the distribution of the similarity between the registered voice of the subject and the standard pattern of the multiple speakers.

【００１２】（４）前記閾値は、本人の登録音声と複数
話者の標準パターンとの間の類似度の分布のＫパーセン
タイル近傍のＮ個の類似度の平均値として決定する
（１）の閾値決定方法。(4) The threshold value is determined as an average value of N similarities in the vicinity of the K th percentile of the distribution of similarities between the registered voice of the subject and the standard patterns of a plurality of speakers. Decision method.

【００１３】（５）前記閾値は、本人の登録音声と複数
話者の標準パターンとの間の類似度の分布からランダム
に選択されたＮ個の類似度に基づいて決定する（１）の
閾値決定方法。(5) The threshold value according to (1), wherein the threshold value is determined based on N similarities randomly selected from a distribution of similarities between a registered voice of the user and standard patterns of a plurality of speakers. Decision method.

【００１４】（６）前記閾値は、本人の登録音声と複数
話者の標準パターンとの間の類似度の分布の中央値近傍
のＮ個の類似度に基づいて決定する（１）の閾値決定方
法。(6) The threshold value is determined based on N similarities near the median of the similarity distribution between the registered voice of the person and the standard pattern of a plurality of speakers. Method.

【００１５】（７）複数話者の標準パターンを、話者ク
ラスタに相当する各ノードによって構成され、第一段目
のノードが全話者のクラスタに相当し、第一段目の各ノ
ードから子ノードが第二段目のノードに相当し、上位ノ
ードはそれにつながる下位ノードの音響的特徴を代表す
る標準パターンを有し末端のリーフノードが各話者の標
準パターンで構成される木構造話者標準パターンとして
記憶されており、登録音声と第二段目の各ノードの標準
パターンとの類似度を計算し、計算された類似度の上位
Ｍ位の類似度を持つノードの下位ノードの標準パターン
と登録音声との類似度計算を計算し、このような計算を
上位ノードから下位ノードへ向かって、各段で類似度の
高い上位Ｍ位のノードの下位ノードの標準パターンと登
録音声との類似度を計算し、計算されないノードの類似
度は、その上位ノードの値を代用し、最終的にリーフノ
ードで計算された類似度類似度の分布に基づいて閾値を
決定する閾値決定方法。(7) A standard pattern of a plurality of speakers is constituted by each node corresponding to a speaker cluster, and the first-stage node corresponds to a cluster of all speakers. The child node corresponds to the second stage node, the upper node has a standard pattern representing the acoustic characteristics of the lower node connected thereto, and the leaf node at the end is a tree structure story composed of the standard pattern of each speaker. The similarity between the registered voice and the standard pattern of each node in the second row is calculated as the standard pattern of the second node, and the standard value of the lower node of the node having the upper M-th similarity of the calculated similarity is stored. A similarity calculation between the pattern and the registered voice is calculated, and such calculation is performed from the upper node to the lower node, and the standard pattern of the lower node of the upper M node having a higher similarity at each stage and the registered voice are calculated. Degree of similarity Calculated degree of similarity is not calculated node substitutes the value of the upper node, the threshold determination method for determining a threshold value based on the final distribution of the calculated similarity similarity leaf node.

【００１６】（８）前記類似度の分布のＫパーセンタイ
ルの類似度を閾値として決定する（７）の閾値決定方
法。(8) The threshold value determining method according to (7), wherein the similarity of the Kth percentile of the similarity distribution is determined as a threshold.

【００１７】（９）前記類似度の分布のＫパーセンタイ
ル近傍のＮ個の類似度の平均値の類似度を閾値として決
定する（７）の閾値決定方法。(9) The threshold value determining method according to (7), wherein the similarity value of the average value of the N similarities near the Kth percentile of the similarity distribution is determined as the threshold value.

【００１８】（１０）前記類似度の分布のＫパーセンタ
イル近傍のＮ個の類似度の平均値の類似度を閾値として
決定する（７）の閾値決定方法。(10) The threshold value determining method according to (7), wherein the similarity value of the average value of the N similarities near the K th percentile of the similarity distribution is determined as the threshold value.

【００１９】（１１）入力された本人の登録用音声を分
析し、特徴ベクトルに変換し、複数の話者の標準パター
ンをサブワード単位の標準パターンを入力された発声内
容に従って連鎖させ、各話者の単語標準パターンを作成
し、前記本人の特徴ベクトルと、複数話者の単語標準パ
ターンとの類似度を計算し、計算された類似度の分布か
ら照合時の閾値を求める閾値決定方法。(11) The input registration voice is analyzed and converted into a feature vector, and the standard patterns of a plurality of speakers are chained according to the input utterance contents of the standard patterns in subword units. A threshold value determining method for calculating a similarity between the feature vector of the subject and a word standard pattern of a plurality of speakers, and obtaining a threshold at the time of matching from the calculated similarity distribution.

【００２０】（１２）入力された本人の登録用音声を分
析し、特徴ベクトルに変換し、複数の話者の標準パター
ンをサブワード単位の標準パターンを入力された発声内
容に従って連鎖させ、各話者の単語標準パターンを作成
し、前記本人の特徴ベクトルと、複数話者の単語標準パ
ターンとの類似度を計算し、計算された類似度を高い順
にソートし、類似度の分布のＫパーセンタイル近傍のＮ
個の類似度の平均値を照合時の閾値とする閾値決定方
法。(12) The input registration voice is analyzed and converted into a feature vector, and the standard patterns of a plurality of speakers are chained in accordance with the input utterance contents of the standard patterns in subword units. Is created, the similarity between the feature vector of the subject person and the word standard pattern of multiple speakers is calculated, the calculated similarities are sorted in descending order, and the vicinity of the K percentile of the similarity distribution is calculated. N
A threshold value determination method using the average value of the similarities as a threshold value at the time of comparison.

【００２１】（１３）入力された本人の登録用音声を分
析し、特徴ベクトルに変換し、１つのノードがそれにつ
ながる下位ノードの音響的特徴を代表するような木構造
で、各話者の標準パターンを末端のリーフノードに持つ
木構造をした複数話者の標準パターンをサブワード単位
の標準パターンを入力された発声内容に従って連鎖さ
せ、各ノードの単語標準パターンを作成し、前記本人の
特徴ベクトルと作成された木構造話者標準パターンとの
類似度を、上位ノードから下位ノードに向かって計算す
る際、各階層において類似度の高い上位Ｍ個のノードに
ついてのみその下位ノードの類似度計算を行い、計算さ
れないノードの類似度をその上位ノードの類似度で代用
し、前記計算された類似度の分布から照合時の閾値を求
める閾値決定方法。(13) The input registration voice is analyzed and converted into a feature vector, and one node has a tree structure representing the acoustic features of the lower nodes connected to it. The standard pattern of a plurality of speakers having a tree structure having a pattern at the leaf node at the end is linked to the standard pattern of each subword in accordance with the input utterance content, and a word standard pattern of each node is created. When calculating the similarity with the created tree structure speaker standard pattern from the upper node to the lower node, the similarity calculation of the lower node is performed only for the upper M nodes having the higher similarity in each layer. A threshold value determining method in which the similarity of a node that is not calculated is substituted by the similarity of the upper node, and a threshold at the time of matching is obtained from the distribution of the calculated similarity.

【００２２】（１４）入力された本人の登録用音声を分
析し、特徴ベクトルに変換する分析部と、複数の話者の
標準パターンをサブワード単位で記憶するサブワード単
位の話者標準パターン記憶部と、前記サブワード単位の
話者標準パターン記憶部に記憶されているサブワード単
位の標準パターンを入力された発声内容に従って連鎖さ
せ、各話者の単語標準パターンを作成し、前記分析部で
変換された本人の特徴ベクトルと、複数話者の単語標準
パターンとの類似度を計算する類似度計算部と、前記類
似度計算部で計算された類似度の分布から照合時の閾値
を求める閾値決定部と、を備えて成る閾値決定装置。(14) An analysis unit that analyzes the input registration voice of the individual and converts it into a feature vector, and a subword-unit speaker standard pattern storage unit that stores a plurality of speaker standard patterns in subword units. The standard patterns in subword units stored in the subword unit speaker standard pattern storage unit are chained in accordance with the input utterance contents, and word standard patterns for each speaker are created. A feature vector and a similarity calculation unit that calculates the similarity between the word standard patterns of a plurality of speakers, a threshold determination unit that obtains a threshold at the time of matching from the distribution of the similarity calculated by the similarity calculation unit, A threshold value determination device comprising:

【００２３】（１５）入力された本人の登録用音声を分
析し、特徴ベクトルに変換する分析部と、複数の話者の
標準パターンをサブワード単位で記憶するサブワード単
位の話者標準パターン記憶部と、前記サブワード単位の
話者標準パターン記憶部に記憶されているサブワード単
位の標準パターンを入力された発声内容に従って連鎖さ
せ、各話者の単語標準パターンを作成し、前記分析部で
変換された本人の特徴ベクトルと、複数話者の単語標準
パターンとの類似度を計算する類似度計算部と、前記類
似度計算部で計算された類似度を高い順にソートし、類
似度の分布のＫパーセンタイル近傍のＮ個の類似度の平
均値を照合時の閾値とする閾値選計算部と、を備えて成
る閾値決定装置。(15) An analysis unit that analyzes the input registration voice of the individual and converts it into a feature vector, and a subword-unit speaker standard pattern storage unit that stores a plurality of speaker standard patterns in subword units. The standard patterns in subword units stored in the subword unit speaker standard pattern storage unit are chained in accordance with the input utterance contents, and word standard patterns for each speaker are created. And a similarity calculator that calculates the similarity between the feature vector and the word standard pattern of a plurality of speakers. The similarities calculated by the similarity calculator are sorted in ascending order, and the Kth percentile of the similarity distribution is sorted. And a threshold selection calculation unit that uses an average value of the N similarities of the threshold values at the time of matching.

【００２４】（１６）入力された本人の登録用音声を分
析し、特徴ベクトルに変換する分析部と、１つのノード
がそれにつながる下位ノードの音響的特徴を代表するよ
うな木構造で、各話者の標準パターンを末端のリーフノ
ードに持つ木構造をした複数話者の標準パターンをサブ
ワード単位で記憶するサブワード単位の木構造話者標準
パターン記憶部と、前記サブワード単位の木構造話者標
準パターン記憶部に記憶されているサブワード単位の標
準パターンを入力された発声内容に従って連鎖させ、各
ノードの単語標準パターンを作成し、前記分析部で変換
された本人の特徴ベクトルと作成された木構造話者標準
パターンとの類似度を、上位ノードから下位ノードに向
かって計算する場合に、各階層において類似度の高い上
位Ｍ個のノードについてのみその下位ノードの類似度計
算を行い、計算されないノードの類似度をその上位ノー
ドの類似度で代用する類似度計算部と、前記類似度計算
部で計算された類似度の分布から照合時の閾値を求める
閾値決定部と、を備えて成る閾値決定装置。(16) An analysis unit that analyzes the input registration voice of the user and converts it into a feature vector, and each node has a tree structure in which one node represents the acoustic features of the lower nodes connected to it. Subword-based tree-structured speaker standard pattern storage unit for storing, in subword units, a tree-structured speaker-based standard pattern of a plurality of speakers having a tree structure having the speaker's standard pattern at the leaf node at the end; The standard pattern of each subword stored in the storage unit is chained according to the input utterance content, a word standard pattern of each node is created, and the feature vector of the person converted by the analysis unit and the created tree structure story When calculating the degree of similarity with the standard pattern from the upper node to the lower node, the similarity to the upper-level M nodes in each layer is calculated. The similarity calculation of the lower node is performed only when the similarity of the non-calculated node is substituted by the similarity of the upper node, and the similarity calculation is performed based on the similarity distribution calculated by the similarity calculator. And a threshold value determining unit for determining the threshold value of the threshold value.

【００２５】[0025]

【発明の実施の形態】以下、本発明による閾値決定方法
及び装置の実施形態を説明する。本発明は、照合時の本
人判定の際に用いられる閾値の決定に関するもので、登
録話者以外の複数話者のサブワード単位の標準パターン
を蓄えておき、登録時には当該発声内容を表すサブワー
ド標準パターンを連結し単語標準パターンを作成し、本
人の登録音声との類似度の分布を求める。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of a threshold value determining method and apparatus according to the present invention will be described below. The present invention relates to the determination of a threshold value used for identity determination at the time of collation, and stores a standard pattern in subword units of a plurality of speakers other than the registered speaker, and a subword standard pattern representing the utterance content at the time of registration. To create a word standard pattern, and obtain a distribution of similarity with the registered voice of the person.

【００２６】こうして作成された分布を本人標準パター
ンと詐称者音声との類似度の分布と見なすことで閾値を
決定するものである。同一話者の音声とそれによって学
習された標準パターンの持つ情報量は等価であると考え
られているため、蓄えられる複数話者の標準パターンを
音響空間をちゅう密に覆うような多数話者の標準パター
ンとすることで、本人標準パターンとあらゆる詐称者音
声との類似度の分布を本人の登録音声と複数話者の標準
パターンとの類似度の分布と見なすことができる。The threshold value is determined by regarding the distribution thus created as the distribution of the similarity between the principal standard pattern and the impostor's voice. It is considered that the amount of information of the same speaker's voice and the standard pattern learned by it is equivalent, so that a large number of speakers who closely cover the acoustic space with the stored standard patterns of multiple speakers are considered. By using the standard pattern, the distribution of the similarity between the principal standard pattern and all the impostor voices can be regarded as the distribution of the similarity between the registered voice of the principal and the standard pattern of a plurality of speakers.

【００２７】作成された本人の登録音声と複数話者の標
準パターンとの類似度の分布から閾値を求めることによ
って、あらゆる詐称者に対して適切な閾値を設定するこ
とができる。By obtaining a threshold from the similarity distribution between the created registered voice of the person and the standard pattern of a plurality of speakers, an appropriate threshold can be set for any impostor.

【００２８】また、本手法は、サブワード単位の標準パ
ターンを用いているためあらゆる単語標準パターンを作
成することができ、登録話者ごとに異なる単語を発声す
る場合においても適切な閾値を求めることができる方法
である。In addition, since the present method uses a standard pattern in units of subwords, any word standard pattern can be created, and an appropriate threshold value can be obtained even when a different word is uttered for each registered speaker. That's the way you can.

【００２９】標準パターンの単位であるサブワードにつ
いては、音素、音節、半音節（渡辺他.，"半音節を単位
とするHMMを用いた不特定話者音声認識，"電子情報通信
学会論文誌D-II Vol.J75-D-II No.8,pp.1281-1289,1992
-8、以下文献２と称する）等を用いることができる。Subwords, which are units of standard patterns, are composed of phonemes, syllables, and semisyllables (Watanabe et al., "Unspecified speaker speech recognition using HMMs with semisyllable units," IEICE Transactions D -II Vol.J75-D-II No.8, pp.1281-1289,1992
-8, hereinafter referred to as Document 2).

【００３０】閾値の計算方法は、種々の決定方法を用い
ることができ、例えば、本人の登録音声と複数話者の標
準パターンとの間の類似度の分布のＫパーセンタイルの
類似度を閾値として選択する方法を用いることができ
る。As a method of calculating the threshold value, various determination methods can be used. For example, the similarity of the Kth percentile of the distribution of the similarity between the registered voice of the subject and the standard pattern of the multiple speakers is selected as the threshold. Can be used.

【００３１】この閾値決定方法を用いた場合、全体のＫ
パーセントにあたる類似度の値が閾値より高い値とな
る。これは受理誤り率をＫパーセントに設定することと
同等であると言える。この閾値決定方法を用いること
で、所望する受理誤り率を与える閾値を事前に設定する
ことが可能となる。受理誤り率を１０パーセントにする
場合類似度の分布の１０パーセンタイルの類似度を閾値
とし、受理誤り率を０パーセントにする場合最大の類似
度を閾値とする。When this threshold value determination method is used, the total K
The value of the similarity corresponding to the percentage is a value higher than the threshold. This can be said to be equivalent to setting the reception error rate to K percent. By using this threshold value determination method, it is possible to set a threshold value that gives a desired reception error rate in advance. When the reception error rate is set to 10%, the similarity of the 10th percentile of the similarity distribution is set as the threshold, and when the reception error rate is set to 0%, the maximum similarity is set as the threshold.

【数１】 (Equation 1)

【００３２】また、本人の登録音声と複数話者の標準パ
ターンとの間の類似度の分布のＫパーセンタイル近傍の
Ｎ個の類似度の平均値を閾値とする方法も用いることが
できる。この方法によると、蓄えられている話者数が少
なく標準パターンによる音響空間の被覆が粗である場
合、Ｎ個の類似度の平均値を求めることで閾値計算精度
の低下を軽減することができる。閾値の計算は（２）式
に従って行われる。Further, a method may be used in which the average value of N similarities near the K th percentile of the similarity distribution between the registered voice of the subject and the standard pattern of a plurality of speakers is used as the threshold value. According to this method, when the number of stored speakers is small and the coverage of the acoustic space by the standard pattern is coarse, the decrease in threshold value calculation accuracy can be reduced by calculating the average value of N similarities. . The calculation of the threshold is performed according to the equation (2).

【００３３】Ｎは類似度計算を行う話者標準パターン数
である。この他の方法として、類似度の分布からランダ
ムに選択されたＮ個の類似度から求める方法や、類似度
の分布の中央値近傍のＮ個の類似度から求める方法等も
用いることができる。N is the number of speaker standard patterns for performing similarity calculation. As other methods, a method of obtaining from N similarities randomly selected from the similarity distribution, a method of obtaining from N similarities near the median of the similarity distribution, and the like can be used.

【００３４】他の発明による閾値決定方法は、記憶され
ている複数話者の標準パターンを木構造話者標準パター
ンとした場合の方法である。木構造話者標準パターンの
簡単な構造を図５に示す。A threshold value determining method according to another invention is a method in a case where a stored standard pattern of a plurality of speakers is a tree-structured speaker standard pattern. FIG. 5 shows a simple structure of the tree structure speaker standard pattern.

【００３５】木構造話者標準パターンは、話者クラスタ
に相当する各ノードによって構成されている。第一段目
のノードは全話者のクラスタに相当する。第一段目の各
ノードから子ノードが出ており、それが第二段目のノー
ドに相当する。上位ノードはそれにつながる下位ノード
の音響的特徴を代表する標準パターンを有している。末
端のリーフノードは各話者の標準パターンで構成され
る。The tree-structured speaker standard pattern is constituted by nodes corresponding to speaker clusters. The first-stage node corresponds to a cluster of all speakers. Each of the nodes at the first level has a child node, which corresponds to the node at the second level. The upper node has a standard pattern representing the acoustic characteristics of the lower node connected to it. The leaf node at the end is composed of the standard pattern of each speaker.

【００３６】話者クラスタリングに関しては、以下の論
文に詳しい。例えば、Kai-Fu Lee,"Large-Vocabulary S
peaker-Independent Continuous Speech Recognition :
TheSPHINX System, "CMU-CS-88-148,pp.103-107,1988.4
（文献３と称する）。また、話者標準パターンの木構
造化については、小坂他，"話者適応のための木構造話
者クラスタリング，"信学技報，SP93-110,pp.49-54,119
3-12 （文献４と称する）に詳しい。Regarding speaker clustering, see the following paper. For example, Kai-Fu Lee, "Large-Vocabulary S
peaker-Independent Continuous Speech Recognition:
TheSPHINX System, "CMU-CS-88-148, pp.103-107,1988.4
(Referred to as Reference 3). Kosaka et al., "Tree-structured speaker clustering for speaker adaptation," IEICE Technical Report, SP93-110, pp.49-54,119.
3-12 (referred to as Reference 4).

【００３７】木構造化話者標準パターンを用いた場合の
閾値計算方法について図４を参照して説明する。A threshold value calculation method using a tree-structured speaker standard pattern will be described with reference to FIG.

【００３８】先ず、始めに、登録音声と第二段目の各ノ
ードの標準パターンとの類似度計算が計算され、続い
て、計算された類似度の上位Ｍ位の類似度を持つノード
の下位ノード（第三段目、最下位ノード)の標準パター
ンと登録音声との類似度計算が計算される。このように
上位ノードから下位ノードへ向かって、各段で類似度の
高い上位Ｍ位のノードの下位ノードの標準パターンと登
録音声との類似度が計算される。First, the similarity calculation between the registered voice and the standard pattern of each node in the second stage is calculated, and then, the lower order of the node having the M highest similarity of the calculated similarity is calculated. The similarity calculation between the standard pattern of the node (third stage, the lowest node) and the registered voice is calculated. As described above, from the upper node to the lower node, the similarity between the standard pattern of the lower node and the registered voice of the upper M node having the higher similarity at each stage is calculated.

【００３９】計算されないノードの類似度は、その上位
ノードの値が代用される。最終的にリーフノードで計算
された類似度がソートされ、類似度の分布のＫパーセン
タイルの類似度が閾値となる。For the similarity of a node that is not calculated, the value of the upper node is used instead. Finally, the similarities calculated at the leaf nodes are sorted, and the similarity of the Kth percentile of the distribution of the similarities becomes a threshold.

【００４０】または、類似度の分布のＫパーセンタイル
近傍のＮ個の類似度の平均値が閾値となる。ここで、リ
ーフノードで計算される類似度が全体のＫパーセント以
上の個数となることに注意する必要がある。例えば、木
構造を各層が４、各ノードにつながる子ノードが４の構
造（リーフノードの話者標準パターン数は６４)とし、
１０パーセンタイルの類似度を閾値とする場合、リーフ
ノードで計算される類似度は７個以上であればよいた
め、Ｍ＞＝２となる。木構造化された話者標準パターン
を用いた場合、各段で類似度の高いノードの下位ノード
分だけ類似度を計算すれば良いため、非常に高速に登録
時の閾値計算を行うことが可能である。Alternatively, the average value of the N similarities near the Kth percentile of the similarity distribution is used as the threshold. Here, it should be noted that the similarity calculated at the leaf node is equal to or more than K percent of the whole. For example, the tree structure has a structure in which each layer is 4 and each child node connected to each node is 4 (the number of speaker standard patterns of leaf nodes is 64).
When the similarity of the 10th percentile is set as the threshold, the similarity calculated at the leaf node may be 7 or more, so that M> = 2. When a tree-structured speaker standard pattern is used, it is only necessary to calculate the similarity for the lower nodes of the nodes with high similarity at each stage, so the threshold calculation for registration can be performed very quickly. It is.

【００４１】図１は本発明による閾値決定方法及び装置
の第１の実施形態を示す構成図である。FIG. 1 is a block diagram showing a first embodiment of a threshold value determining method and apparatus according to the present invention.

【００４２】入力された登録時の音声は、分析部１０に
送出られ、発声内容は類似度計算部３０に送出される。
分析部１０に送出された入力音声は、特徴ベクトルに変
換され、類似度計算部３０に送られる。特徴ベクトルと
しては、ケプストラム、△ケプストラム、（古井，"デ
ィジタル音声処理，"東海大学出版会，pp.44-47,1985、
これを以後文献５とする。）等が用いられる。The input speech at the time of registration is sent to the analyzer 10, and the utterance content is sent to the similarity calculator 30.
The input voice sent to the analysis unit 10 is converted into a feature vector and sent to the similarity calculation unit 30. As feature vectors, cepstrum, cepstrum, (Furui, "Digital Speech Processing," Tokai University Press, pp.44-47,1985,
This is hereinafter referred to as Reference 5. ) Etc. are used.

【００４３】類似度計算部３０では、送られた発声内容
に従いサブワード単位の話者標準パターン記憶部２０に
記憶されているサブワード単位の標準パターンが連鎖さ
れ、各話者の単語標準パターンが作成される。続いて、
作成された各話者の単語標準パターンと分析部１０から
送出された特徴ベクトルとの間の類似度計算が行われ
る。In the similarity calculating section 30, the standard patterns in subword units stored in the speaker standard pattern storage section 20 in subword units are chained in accordance with the sent utterance contents, and a word standard pattern for each speaker is created. You. continue,
A similarity calculation between the created word standard pattern of each speaker and the feature vector sent from the analysis unit 10 is performed.

【００４４】類似度計算方法については、Viterbiアル
ゴリズム（中川，"確率モデルによる音声認識，"電子情
報通信学会，1988、これを以後文献６とする。）、ＤＰ
マッチング（迫江他，"動的計画法による音声パタンの
類似度評価，"電子通信学会総合全国大会講演論文集，
p.136, 1970.8、以後これを文献７と称する)等が用いら
れる。計算された類似度は、閾値決定部４０に送出され
る。The similarity calculation method includes the Viterbi algorithm (Nakagawa, "Speech Recognition by Stochastic Model," IEICE, 1988, hereinafter referred to as Document 6), DP.
Matching (Sakoe et al., "Similarity Evaluation of Speech Patterns Using Dynamic Programming," Proc. Of the IEICE General Conference,
p.136, 1970.8, hereinafter referred to as Reference 7). The calculated similarity is sent to the threshold determination unit 40.

【００４５】閾値決定部４０は、類似度計算部３０から
送出された類似度の分布から閾値を求め、出力する。The threshold determining section 40 obtains a threshold from the similarity distribution transmitted from the similarity calculating section 30 and outputs the threshold.

【００４６】図２は本発明による閾値決定方法及び装置
の第２の実施形態を示す構成図である。FIG. 2 is a block diagram showing a second embodiment of the threshold value determining method and apparatus according to the present invention.

【００４７】入力された登録時の音声は、分析部１１０
に送出され、発声内容は類似度計算部１３０に送出され
る。The input speech at the time of registration is analyzed by the analyzing unit 110.
And the utterance content is sent to the similarity calculation unit 130.

【００４８】分析部１１０に送出された入力音声は、特
徴ベクトルに変換され、類似度計算部１３０に送出され
る。特徴ベクトルについては上述第１の発明の場合と同
様である。The input speech sent to analysis section 110 is converted into a feature vector and sent to similarity calculation section 130. The feature vector is the same as in the first embodiment.

【００４９】類似度計算部１３０では、送られた発声内
容に従いサブワード単位の話者標準パターン記憶部１２
０に記憶されているサブワード単位の標準パターンが連
鎖され、各話者の単語標準パターンが作成される。次
に、作成された各話者の単語標準パターンと分析部１１
０から送られた特徴ベクトルとの間の類似度計算が行わ
れる。類似度計算方法については上述第一の発明と同様
である。計算された類似度は閾値計算部１４０に送出さ
れる。In the similarity calculation section 130, the speaker standard pattern storage section 12 in subword units is sent in accordance with the sent utterance contents.
The standard patterns in subword units stored in 0 are chained, and a word standard pattern for each speaker is created. Next, the created word standard pattern of each speaker and the analysis unit 11
A similarity calculation is performed with the feature vector sent from 0. The method of calculating the similarity is the same as in the first invention. The calculated similarity is sent to threshold calculating section 140.

【００５０】閾値計算部１４０では、類似度計算部１３
０から送られた類似度が高い順にソートされ、類似度分
布のＫパーセンタイル近傍のＮ個の類似度の平均値が計
算される。計算された平均値が閾値として出力される。In the threshold value calculation unit 140, the similarity calculation unit 13
The similarities sent from 0 are sorted in descending order of similarity, and the average value of N similarities near the Kth percentile of the similarity distribution is calculated. The calculated average value is output as a threshold.

【００５１】図３は本発明による閾値決定方法及び装置
の第３の実施形態を示す構成図である。FIG. 3 is a block diagram showing a third embodiment of the threshold value determining method and apparatus according to the present invention.

【００５２】入力された登録時の音声は、分析部２１０
に送出され、発声内容は類似度計算部２３０に送出され
る。分析部２１０に送出された入力音声は、特徴ベクト
ルに変換され、類似度計算部２３０に送出される。特徴
ベクトルについては第１の発明の場合と同様である。The input voice at the time of registration is analyzed by the analysis unit 210.
And the utterance content is sent to the similarity calculating section 230. The input speech sent to the analysis unit 210 is converted into a feature vector and sent to the similarity calculation unit 230. The feature vector is the same as in the first invention.

【００５３】類似度計算部２３０では、送られた発声内
容に従いサブワード単位の木構造話者標準パターン記憶
部２２０に記憶されているサブワード単位の標準パター
ンが連鎖され、各ノードの単語標準パターンが作成され
る。次に、作成された各ノードの単語標準パターンと分
析部２１０から送られた特徴ベクトルとの間の類似度計
算が行われる。In the similarity calculation section 230, the standard patterns in subword units stored in the tree-structured speaker standard pattern storage section 220 in subword units are chained in accordance with the sent utterance contents, and a word standard pattern for each node is created. Is done. Next, similarity between the created word standard pattern of each node and the feature vector sent from the analysis unit 210 is calculated.

【００５４】木構造話者標準パターンと特徴ベクトルと
の間の類似度計算方法について、図４の木構造を例とし
た場合について説明する。The method of calculating the similarity between the tree structure speaker standard pattern and the feature vector will be described with reference to the case of the tree structure shown in FIG.

【００５５】先ず始めに、特徴ベクトルと第二段目の各
ノードの標準パターンとの類似度計算が行われる。次
に、計算された類似度の上位Ｍ位の類似度を持つノード
の下位ノード（第三段目、最下位ノード）の標準パター
ンと特徴ベクトルとの類似度計算が行われる。計算され
ないノードの類似度はその上位ノードの類似度が代用さ
れる。このように、上位ノードからリーフノードに向か
って類似度計算が行われ、最終的に計算された類似度が
閾値決定部２４０に送られる。First, the similarity between the feature vector and the standard pattern of each node in the second stage is calculated. Next, the similarity calculation between the standard pattern and the feature vector of the lower node (third stage, the lowest node) of the node having the highest M similarity of the calculated similarity is performed. The similarity of a node that is not calculated is substituted by the similarity of its upper node. Thus, the similarity calculation is performed from the upper node toward the leaf node, and the finally calculated similarity is sent to the threshold value determination unit 240.

【００５６】閾値決定部２４０では、類似度計算部２３
０から送られた類似度が高い順にソートされ、類似度の
分布のＫパーセンタイルに相当する類似度が閾値として
出力される。又は、類似度の分布のＫパーセンタイル近
傍のＮ個の類似度の平均値が閾値として出力される。In the threshold value determining section 240, the similarity calculating section 23
The similarities sent from 0 are sorted in descending order of similarity, and the similarity corresponding to the Kth percentile of the similarity distribution is output as a threshold. Alternatively, the average value of the N similarities near the Kth percentile of the similarity distribution is output as the threshold.

【００５７】[0057]

【発明の効果】以上説明したように、本発明によれば、
所望する受理誤り率を得られる閾値を事前に設定するこ
とが可能であり、登録話者ごとに異なる単語を発声する
場合についても適切な閾値を求めることができる。As described above, according to the present invention,
A threshold for obtaining a desired reception error rate can be set in advance, and an appropriate threshold can be obtained even when a different word is uttered for each registered speaker.

[Brief description of the drawings]

【図１】本発明による閾値決定装置の第１の実施形態を
示す構成図である。FIG. 1 is a configuration diagram showing a first embodiment of a threshold value determining device according to the present invention.

【図２】本発明による閾値決定装置の第２の実施形態を
示す構成図である。FIG. 2 is a configuration diagram showing a second embodiment of the threshold value determining device according to the present invention.

【図３】本発明による閾値決定装置の第３の実施形態を
示す構成図である。FIG. 3 is a configuration diagram showing a third embodiment of the threshold value determining device according to the present invention.

【図４】従来の閾値決定方法を説明するための図であ
る。FIG. 4 is a diagram for explaining a conventional threshold value determining method.

【図５】従来の閾値決定装置の構成図である。FIG. 5 is a configuration diagram of a conventional threshold value determining device.

[Explanation of symbols]

１０、１１０、２１０分析部２０、１２０サブワード単位の話者標準パタ
ーン記憶部３０、１３０、２３０、３３０類似度計算部４０、２４０閾値決定部１４０、３６０閾値計算部２２０サブワード単位の木構造話者標準パターン
記憶部３１０本人標準パターン作成部３２０単語単位の本人標準パターン記憶部３４０複数話者の音声記憶部３５０平均値、分散計算部３６０閾値計算部10, 110, 210 Analysis unit 20, 120 Subword unit speaker standard pattern storage unit 30, 130, 230, 330 Similarity calculation unit 40, 240 Threshold determination unit 140, 360 Threshold calculation unit 220 Subword unit tree structure speaker Standard pattern storage unit 310 Personal standard pattern creation unit 320 Personal standard pattern storage unit for each word 340 Speech storage units for multiple speakers 350 Average value and variance calculation unit 360 Threshold calculation unit

Claims

[Claims]

In a method for determining a threshold value in speaker verification, which is a criterion for personal identification when comparing an input pattern and a standard pattern, a standard pattern in subword units of a plurality of speakers other than a registered speaker is stored, and Speaker verification characterized by creating a word standard pattern by concatenating subword standard patterns representing the utterance contents, obtaining a distribution of similarity with the registered voice of the person, and determining a threshold based on the generated distribution. Threshold determination method.

2. The method according to claim 1, wherein the sub-word is a phoneme, a syllable or a semi-syllable.

3. The speaker verification according to claim 1, wherein said threshold value is determined as a similarity of a K-percentile of a similarity distribution between a registered voice of the individual and a standard pattern of said plurality of speakers. Threshold determination method.

4. The method according to claim 1, wherein the threshold value is determined as an average value of N similarities near the K th percentile of a similarity distribution between a registered voice of the subject and a standard pattern of a plurality of speakers. Threshold determination method for speaker verification.

5. The method according to claim 1, wherein the threshold value is determined based on N similarities randomly selected from a similarity distribution between a registered voice of the subject and a standard pattern of a plurality of speakers. Threshold determination method for speaker verification.

6. The talk according to claim 1, wherein the threshold value is determined based on N similarities near the median of the similarity distribution between the registered voice of the subject and the standard pattern of a plurality of speakers. Threshold determination method in person verification.

7. A standard pattern of a plurality of speakers is constituted by nodes corresponding to speaker clusters. A first-stage node corresponds to a cluster of all speakers, and each node of the first stage is a child. The node corresponds to the second-stage node, and the upper node has a standard pattern representing the acoustic characteristics of the lower node connected thereto, and the leaf node at the end is a tree-structured speaker composed of the standard pattern of each speaker. The similarity between the registered voice and the standard pattern of each node in the second stage is stored as a standard pattern, and the standard pattern of the lower node of the node having the upper M similarity of the calculated similarity is calculated. The similarity calculation between the registered voice and the registered voice is calculated from the upper node to the lower node. Calculate degree , The similarity of not compute nodes,
A threshold value determination method in speaker verification, characterized in that a threshold value is determined based on a distribution of similarity / similarity finally calculated at a leaf node by using the value of the upper node instead.

8. The method according to claim 7, wherein the similarity of the Kth percentile of the similarity distribution is determined as a threshold.

9. The method according to claim 7, wherein the similarity of an average value of N similarities near the Kth percentile of the similarity distribution is determined as a threshold.

10. The method according to claim 7, wherein the similarity of the average of the N similarities near the Kth percentile of the similarity distribution is determined as the threshold.

11. Analyzing the input registration voice of the person,
After converting the standard pattern of a plurality of speakers into a standard pattern in subword units according to the input utterance contents, a word standard pattern of each speaker is created, A method for determining a threshold value in speaker verification, comprising calculating a similarity with a word standard pattern of the subject and calculating a threshold at the time of verification from a distribution of the calculated similarity.

12. Analyzing the input registration voice of the person,
After converting the standard pattern of a plurality of speakers into a standard pattern in subword units according to the input utterance contents, a word standard pattern of each speaker is created, The similarity with the word standard pattern is calculated, the calculated similarities are sorted in descending order, and the average value of N similarities in the vicinity of the Kth percentile of the similarity distribution is used as a threshold value for comparison. Threshold determination method in speaker verification.

13. Analyzing the input registration voice of the person,
This is a tree structure in which one node represents the acoustic characteristics of the lower nodes connected to it, and the standard pattern of each speaker is a tree structure with the standard pattern of each speaker at the leaf node at the end. Are chained according to the input utterance content of the standard pattern in subword units, a word standard pattern of each node is created, and the similarity between the feature vector of the subject and the created tree structure speaker standard pattern is determined from the upper node. When calculating toward lower nodes,
Only for the top M nodes having high similarity in each hierarchy, the similarity of the lower node is calculated, and the similarity of the non-calculated node is substituted by the similarity of the upper node. A threshold determination method in speaker verification, wherein a threshold at the time of verification is obtained.

14. Analyzing the input registration voice of the person,
An analyzing unit that converts the feature vector into a feature vector; a speaker standard pattern storage unit that stores a plurality of speaker standard patterns in subword units; and a subword unit that is stored in the subword unit speaker standard pattern storage unit. Are chained according to the input utterance contents, and a word standard pattern of each speaker is created, and the similarity between the feature vector converted by the analysis unit and the word standard pattern of a plurality of speakers is calculated. A threshold value determining unit for speaker verification, comprising: a similarity calculating unit that calculates a threshold value for matching based on a distribution of similarities calculated by the similarity calculating unit.

15. An analysis unit for analyzing an input personal registration voice and converting it into a feature vector; a subword-unit speaker standard pattern storage unit for storing a plurality of speaker standard patterns in subword units; The sub-word-unit standard patterns stored in the sub-word-unit speaker standard pattern storage unit are chained in accordance with the input utterance contents, word standard patterns for each speaker are created, and the characteristics of the individual converted by the analysis unit are used. A similarity calculator that calculates the similarity between the vector and the word standard pattern of a plurality of speakers; and the similarities calculated by the similarity calculator are sorted in ascending order, and N near the Kth percentile of the distribution of the similarity is sorted. And a threshold selection calculation unit that uses an average value of the similarities as a threshold at the time of comparison.

16. Analyzing the input personal registration voice,
An analysis unit that converts to a feature vector, and a plurality of speakers with a tree structure in which one node represents the acoustic features of the lower nodes connected to it, with the standard pattern of each speaker at the leaf node at the end A tree-structured speaker standard pattern storage unit that stores the standard pattern of the sub-words in units of sub-words; To create a word standard pattern of each node, and calculate the similarity between the feature vector of the person converted by the analysis unit and the created tree structure speaker standard pattern from the upper node to the lower node. In this case, the similarity of the lower nodes is calculated only for the upper M nodes having a high similarity in each layer, and the calculation is not performed. A similarity calculator that substitutes the similarity of the node with the similarity of the upper node; and a threshold determiner that obtains a threshold at the time of matching from the distribution of the similarity calculated by the similarity calculator. A threshold determination device in speaker verification characterized by the following.