JP2000122693A

JP2000122693A - Speaker recognizing method and speaker recognizing device

Info

Publication number: JP2000122693A
Application number: JP10297026A
Authority: JP
Inventors: Yutaka Deguchi; 豊出口; Hiroshi Kanazawa; 博史金澤
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1998-10-19
Filing date: 1998-10-19
Publication date: 2000-04-28
Anticipated expiration: 2018-10-19
Also published as: JP3859884B2

Abstract

PROBLEM TO BE SOLVED: To shorten time required for recognizing a speaker while holding high recognizability by making a speaker model having a housing structure of a wood structure on the basis of a characteristic quantity extracted from an uttering sound of the speaker. SOLUTION: A speaker model of a single speaker is expressed by a wood structure. The speaker model making part 3 makes the speaker model by using a characteristic vector group calculated from voice data of the speaker inputted at learning time. In a method of using a code book of a characteristic vector as the speaker model, a representative vector is decided by one on respective clusters by clustering the characteristic vector of the speaker by any method. Aggregation of these representative vectors is used as the speaker model. The checkup part 5 checks up the characteristic vector of a recognizing object speaker calculated by the characteristic quantity calculating part 2 with respective speaker models stored in the speaker model storage part 4 to calculate a true degree or a distance to identify it as the person himself when the true degree exceeds a prescribed value.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声を用いて個人
の同定を行う話者認識方法およびそれを持ちた話者認識
装置に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a speaker recognition method for identifying an individual using voice and a speaker recognition device having the same.

【０００２】[0002]

【従来の技術】従来、金融機関等や役所での個人の同定
は、あらかじめ登録された印鑑や暗証番号を用いて行わ
れている。しかし、このような従来方法では、印鑑の紛
失や盗難、暗証番号の忘却や漏洩によって正確な個人の
同定が行えなくなるのが現状である。そのため、当人の
音声を用いて個人同定を行う方法が提案されている。2. Description of the Related Art Heretofore, identification of individuals at financial institutions or government offices has been performed using a registered seal or password. However, in such a conventional method, at present, accurate identification of an individual cannot be performed due to loss or theft of a seal, forgetting or leaking a personal identification number. Therefore, a method of performing individual identification using a person's voice has been proposed.

【０００３】音声を用いて個人の同定を行う話者認識方
式には、特定の言葉を発生して認識を行う発生内容依存
手法と、任意の発生に対して認証を行う発生内容独立手
法に大別される。発声内容依存手法を用いた方が一般的
に高い認識率を得ることができるが、特定の言葉の発声
を必要とする制約が加わる。発声内容独立手法は、発声
長を長くすればするほど高い認識率を得られるという利
点もある。特定の言葉を記憶する必要がなく、利用者に
負担の少ない発声内容独立手法は、様々な分野への応用
が可能である。[0003] Speaker recognition methods for identifying individuals using voice include a content-dependent method of generating and recognizing specific words and a content-independent method of performing authentication for arbitrary occurrences. Separated. Generally, a higher recognition rate can be obtained by using the utterance content-dependent method, but a constraint that requires utterance of a specific word is added. The utterance content independent method also has an advantage that the longer the utterance length is, the higher the recognition rate can be obtained. The utterance content independent method that does not need to memorize a specific word and has a small burden on the user can be applied to various fields.

【０００４】従来の発声内容独立手法を用いた話者認識
方法の一例を次に説明する。学習時には、収集した複数
の話者の音声データからそれぞれ特徴量（特徴ベクト
ル）を算出した後に、各話者ごとに話者モデルを作成し
て、それを記憶しておく。認識時には、認識対象の話者
の音声データを特徴量に変換した後、その変換された特
徴量と先に記憶しておいた複数の話者モデルとを照合し
て尤度を算出し、尤度がある一定値を越えるなどした場
合は本人と同定する。[0004] An example of a conventional speaker recognition method using a conventional utterance content independent method will be described below. At the time of learning, a feature amount (feature vector) is calculated from the collected voice data of a plurality of speakers, and then a speaker model is created for each speaker and stored. At the time of recognition, speech data of a speaker to be recognized is converted into a feature amount, and the converted feature amount is compared with a plurality of speaker models stored before to calculate likelihood. If the degree exceeds a certain value, the person is identified.

【０００５】学習時に音声データを特徴量に変換する際
には、例えば、音声データより１６ｍｓから４０ｍｓ程
度の区間を、８ｍｓから１６ｍｓ毎に逐次取り出し、各
区間に対して特徴量を算出する手法が一般的に用いられ
ている。When speech data is converted into a feature value during learning, for example, a method of extracting a section of about 16 ms to 40 ms from the speech data every 8 ms to 16 ms and calculating a feature value for each section is known. Commonly used.

【０００６】認識時も同様に各区間に対して特徴量を算
出し、その算出された特徴量に対して個別に話者モデル
との照合を行い、照合結果を統合して最終的な認識結果
を算出するようになっている。At the time of recognition, a feature amount is similarly calculated for each section, and the calculated feature amount is individually collated with a speaker model, and the collation results are integrated to obtain a final recognition result. Is calculated.

【０００７】話者モデルの作成には、隠れマルコフモデ
ル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ、以下ＨＭ
Ｍと言う）を用いて複数の特徴ベクトルの分布をクラス
タリングした後、各クラスタの代表的な特徴ベクトルを
保持し、保持した代表的な特徴ベクトルを羅列した符号
帳を用いるものが、代表的なものとして挙げられる。To create a speaker model, a Hidden Markov Model (hereinafter referred to as HM)
M), clustering the distribution of a plurality of feature vectors, holding representative feature vectors of each cluster, and using a codebook listing the held representative feature vectors. Are listed.

【０００８】また、周囲の環境変化に対応するため、尤
度を計算した後、本人の話者モデルに対する認識結果の
他に本人以外の話者モデルなどに対する認識結果の値を
用いて、認識結果を正規化する手法が一般的に用いられ
ている。例えば、本人の話者モデル以外に、本人に似た
他の５人の話者モデルと尤度（距離）を算出し、尤度の
高い（距離の小さい）２つの値を使って認識結果の正規
化を行う手法もよく使われている。Further, in order to cope with a change in the surrounding environment, the likelihood is calculated, and the recognition result of the speaker model other than the speaker is used in addition to the recognition result of the speaker model of the speaker. Is generally used. For example, in addition to the speaker model of the subject, five other speaker models similar to the subject are calculated and the likelihood (distance) is calculated, and two values having a high likelihood (small distance) are used to determine the recognition result. A technique for performing normalization is also often used.

【０００９】話者モデルの作成方法は、話者認識の性能
に大きな影響を与える。話者認識では、誤認識を避ける
ため、本人の音声データとの尤度は大きく（もしくは距
離は小さく）、本人以外の音声データとの尤度は小さく
（もしくは距離は大きく）なるような話者モデルを作成
する必要がある。また、話者認識を行う際の利用者の負
担を考え、最小限の音声データで認識することが望まれ
るが、発声時間が短かければ、本人の特徴をつかむのは
困難になる。このため、短い発声に対しても高い認識性
能を保持するような話者モデルを作成する必要がある。The method of creating a speaker model greatly affects the performance of speaker recognition. In speaker recognition, in order to avoid erroneous recognition, a speaker whose likelihood with the voice data of the subject is large (or the distance is small) and whose likelihood with the speech data of another person is small (or the distance is large) You need to create a model. Also, considering the burden on the user when performing speaker recognition, it is desirable to perform recognition with a minimum amount of voice data. However, if the utterance time is short, it becomes difficult to grasp the characteristics of the individual. For this reason, it is necessary to create a speaker model that maintains high recognition performance even for short utterances.

【００１０】短い発声に対して高い認識性能を保持する
ためには、微細な話者モデルを作成する必要が生じる。
すなわち、話者認識では、例えば、特徴ベクトルの分布
を、いくつかのガウス分布を合成したもの（混合分布）
で表現するＨＭＭを話者モデルに用いる場合には、混合
分布に用いるガウス分布（ＨＭＭの分布）の数を多くす
る。また、特徴ベクトルの符号帳を話者モデルに用いる
場合には、符号帳の大きさを大きくするなどの方法が取
られている。しかし、一般的に、ＨＭＭの分布の数や符
号帳の大きさに比例する計算量が必要になるので、作成
するモデルが微細であるほど認識に要する時間が長くな
る。In order to maintain high recognition performance for short utterances, it is necessary to create a fine speaker model.
That is, in speaker recognition, for example, the distribution of a feature vector is obtained by combining several Gaussian distributions (mixture distribution).
When the HMM expressed by is used for the speaker model, the number of Gaussian distributions (distributions of the HMM) used for the mixture distribution is increased. When a codebook of feature vectors is used for a speaker model, a method of increasing the size of the codebook is used. However, in general, a calculation amount proportional to the number of distributions of the HMM and the size of the codebook is required. Therefore, the finer the model to be created, the longer the time required for recognition.

【００１１】[0011]

【発明が解決しようとする課題】以上説明したように、
短い発声に対して認識性能を高めるために、微細な話者
モデルを作成することは、一方で、入力音声と話者モデ
ルを照合するための尤度（距離）算出の計算量が増大
し、話者認識に要する時間が長くなるという問題点があ
る。As described above,
Creating a fine speaker model to improve the recognition performance for short utterances, on the other hand, increases the calculation amount of likelihood (distance) calculation for matching the input speech with the speaker model, There is a problem that the time required for speaker recognition becomes long.

【００１２】そこで本発明は、入力音声と話者モデルを
照合する際に要する計算量を減らし、その結果として、
作成するモデルの微細さはそのままに、すなわち、高い
認識性能を保持したまま、話者認識に要する時間を短く
できる話者認識方法およびそれを用いた話者認識装置を
提供することを目的とする。Therefore, the present invention reduces the amount of calculation required when collating an input voice with a speaker model, and as a result,
It is an object of the present invention to provide a speaker recognition method and a speaker recognition device using the same that can reduce the time required for speaker recognition while keeping the fineness of a model to be created, that is, while maintaining high recognition performance. .

【００１３】[0013]

【課題を解決するための手段】（１）本発明の話者認識
方法は、話者の発声音から抽出される特徴量から予め作
成された話者モデルに基づき、その話者を認識する話者
認識方法において、入力された話者の発声音から抽出さ
れた特徴量に基づき木構造の格納構造をもつ話者モデル
を作成して記憶手段に記憶し、入力された認識対象の話
者の発声音から抽出された特徴量と前記記憶された話者
モデルとを照合して、当該認識対象の話者を認識するこ
とを特徴とする。(1) A speaker recognition method according to the present invention is a method for recognizing a speaker based on a speaker model created in advance from features extracted from uttered sounds of the speaker. In the speaker recognition method, a speaker model having a storage structure of a tree structure is created based on the feature amount extracted from the input utterance sound of the speaker and stored in the storage unit, and the speaker model of the input recognition target speaker is generated. The feature amount extracted from the uttered sound is compared with the stored speaker model to recognize the speaker to be recognized.

【００１４】本発明によれば、微細な話者モデルを作成
しても、使用する話者モデルを木構造化することで、高
い認識性能を保持したまま入力音声と話者モデルとの尤
度（あるいは距離）算出の際の計算量を削減して、高速
な話者認識が可能となる。According to the present invention, even when a fine speaker model is created, the likelihood between the input speech and the speaker model is maintained while maintaining high recognition performance by forming the speaker model to be used in a tree structure. (Or distance) calculation amount can be reduced, and high-speed speaker recognition can be performed.

【００１５】（２）本発明の話者認識方法は、話者の発
声音から抽出される特徴量から予め作成された話者モデ
ルに基づき、その話者を認識する話者認識方法におい
て、入力された話者の発声音から抽出された特徴量に基
づき木構造の格納構造をもつ話者モデルを作成して記憶
手段に記憶し、入力された認識対象の話者の発声音から
抽出された特徴量と前記記憶された複数の話者モデルと
を照合して、前記入力された認識対象の話者の特徴量と
当該認識対象の話者の話者モデルとの照合結果を正規化
するために用いる話者モデルを選択し、この選択された
話者モデルを用いて、前記入力された認識対象の話者の
特徴量と当該認識対象の話者の話者モデルとの照合結果
を正規化して、当該認識対象の話者を認識することを特
徴とする。(2) The speaker recognition method according to the present invention is a speaker recognition method for recognizing a speaker based on a speaker model created in advance from feature amounts extracted from a speaker's utterance sound. A speaker model having a tree-structured storage structure is created based on the features extracted from the uttered voices of the selected speaker, stored in the storage unit, and extracted from the uttered voices of the input speaker to be recognized. To compare a feature amount with the stored plurality of speaker models and normalize a matching result between the input feature amount of the recognition target speaker and the speaker model of the recognition target speaker. Is used to normalize the matching result between the input feature amount of the recognition target speaker and the speaker model of the recognition target speaker using the selected speaker model. The speaker to be recognized is recognized.

【００１６】本発明によれば、尤度（あるいは距離）算
出の際に得られた値を用いて、正規化にどの話者モデル
を使用するかを決定することにより、照合結果の正規化
に伴う計算量の増大を抑えることができる。According to the present invention, a speaker model to be used for normalization is determined by using a value obtained at the time of calculating a likelihood (or distance), so that the result of matching can be normalized. The accompanying increase in the amount of calculation can be suppressed.

【００１７】（３）本発明の話者認識装置は、話者の発
声音から抽出される特徴量から予め作成された話者モデ
ルに基づき、その話者を認識する話者認識装置におい
て、入力された話者の発声音から抽出された特徴量に基
づき木構造の格納構造をもつ話者モデルを作成して記憶
する記憶手段と、入力された認識対象の話者の発声音か
ら抽出された特徴量と前記記憶手段に記憶された話者モ
デルとを照合して、当該認識対象の話者を認識する認識
手段と、を具備したことを特徴とする。(3) The speaker recognition apparatus of the present invention recognizes a speaker based on a speaker model created in advance from feature amounts extracted from the utterance of the speaker. Storage means for creating and storing a speaker model having a tree-structured storage structure based on the features extracted from the uttered sounds of the given speaker, and extracted from the uttered sounds of the input recognition target speaker And a recognition unit for recognizing the speaker to be recognized by comparing the feature amount with the speaker model stored in the storage unit.

【００１８】本発明によれば、微細な話者モデルを作成
しても、使用する話者モデルを木構造化することで、高
い認識性能を保持したまま入力音声と話者モデルとの尤
度（あるいは距離）算出の際の計算量を削減して、高速
な話者認識が可能となる。According to the present invention, even if a fine speaker model is created, the likelihood between the input speech and the speaker model is maintained while maintaining a high recognition performance by forming a tree structure of the speaker model to be used. (Or distance) calculation amount can be reduced, and high-speed speaker recognition can be performed.

【００１９】（４）本発明の話者認識装置は、話者の発
声音から抽出される特徴量から予め作成された話者モデ
ルに基づき、その話者を認識する話者認識装置におい
て、入力された話者の発声音から抽出された特徴量に基
づき木構造の格納構造をもつ話者モデルを作成して記憶
する記憶手段と、入力された認識対象の話者の発声音か
ら抽出された特徴量と前記記憶手段に記憶された複数の
話者モデルとを照合して、前記入力された認識対象の話
者の特徴量と当該認識対象の話者の話者モデルとの照合
結果を正規化するために用いる話者モデルを選択する選
択手段と、この選択手段で選択された話者モデルを用い
て、前記入力された認識対象の話者の特徴量と当該認識
対象の話者の話者モデルとの照合結果を正規化して、当
該認識対象の話者を認識する認識手段と、を具備したこ
とを特徴とする。(4) The speaker recognition apparatus of the present invention recognizes a speaker based on a speaker model created in advance from features extracted from the utterance of the speaker. Storage means for creating and storing a speaker model having a tree-structured storage structure based on the features extracted from the uttered sounds of the given speaker, and extracted from the uttered sounds of the input recognition target speaker The feature amount is compared with the plurality of speaker models stored in the storage unit, and the matching result between the input feature amount of the recognition target speaker and the speaker model of the recognition target speaker is normalized. Selecting means for selecting a speaker model to be used for conversion, and using the speaker model selected by the selecting means, the input features of the speaker to be recognized and the talk of the speaker to be recognized. Normalize the matching result with the speaker model, and Characterized by comprising a recognition means for identification, the.

【００２０】本発明によれば、尤度（あるいは距離）算
出の際に得られた値を用いて、正規化にどの話者モデル
を使用するかを決定することにより、照合結果の正規化
に伴う計算量の増大を抑えることができる。According to the present invention, a speaker model to be used for normalization is determined by using a value obtained at the time of calculating a likelihood (or distance), so that a matching result can be normalized. The accompanying increase in the amount of calculation can be suppressed.

【００２１】[0021]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照しながら説明する。図１は、本実施形態に係る話
者認識方法を用いた話者認識装置（発声内容独立手法を
適用したもの）の構成例を示したもので、音声を入力す
る音声入力部１、入力された音声から区間を抽出し、そ
の抽出した区間毎に特徴量（例えば、特徴ベクトル）を
算出する特徴量算出部２、算出された特徴量から話者モ
デルを作成する話者モデル作成部３、作成された話者モ
デルを記憶する話者モデル記憶部４、話者認識時に音声
入力部１に入力された音声から算出された特徴量と、話
者モデル記憶部４に記憶された話者モデル（辞書）とを
照合して尤度（あるいは、入力された話者モデルと辞書
としての話者モデルとの隔たりを示す距離）を算出し、
その算出された尤度が所定値を超えた場合（あるいは、
距離が所定値より小さい場合）に、当該話者を本人と同
定する照合部５とから構成されている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows an example of the configuration of a speaker recognition apparatus (applying an independent speech content method) using a speaker recognition method according to the present embodiment. A feature amount calculating unit 2 that extracts a section from the extracted voice and calculates a feature amount (for example, a feature vector) for each extracted section; a speaker model creating unit 3 that creates a speaker model from the calculated feature amount; A speaker model storage unit 4 for storing the generated speaker model; a feature amount calculated from a voice input to the voice input unit 1 during speaker recognition; and a speaker model stored in the speaker model storage unit 4 (Dictionary) to calculate the likelihood (or the distance indicating the distance between the input speaker model and the speaker model as a dictionary),
If the calculated likelihood exceeds a predetermined value (or
(When the distance is smaller than a predetermined value) and a collating unit 5 for identifying the speaker as the subject.

【００２２】図１に示したような構成の話者認識装置の
処理動作の概略を図３に示すフローチャートを参照して
説明する。学習時には（ステップＳ１）、音声入力部１
より入力された複数の話者の音声データから特徴量算出
部２で特徴ベクトルを算出した後に、話者モデル作成部
３において各話者ごとに話者モデルを作成し、話者モデ
ル記憶部４に記憶しておく（ステップＳ２〜ステップＳ
５）。認識時には（ステップＳ１）、音声入力部１より
入力された認識対象の話者の音声データから特徴量算出
部２で特徴ベクトルを算出した後に、照合部５におい
て、特徴量算出部２で算出された認識対象話者の特徴ベ
クトルと話者モデル記憶部４に記憶されているそれぞれ
の話者モデルとを照合して、尤度あるいは距離を算出す
る。例えば、尤度が所定値を越えるなどした場合は本人
と同定する（ステップＳ６〜ステップＳ９）。An outline of the processing operation of the speaker recognition device having the configuration shown in FIG. 1 will be described with reference to a flowchart shown in FIG. At the time of learning (step S1), the voice input unit 1
After calculating a feature vector from the input speech data of a plurality of speakers in the feature amount calculation unit 2, a speaker model creation unit 3 creates a speaker model for each speaker, and a speaker model storage unit 4. (Steps S2 to S
5). At the time of recognition (step S1), a feature vector is calculated by the feature value calculation unit 2 from the voice data of the speaker to be recognized input from the voice input unit 1, and then the feature vector is calculated by the feature value calculation unit 2 in the collation unit 5. The feature vector of the speaker to be recognized is compared with each speaker model stored in the speaker model storage unit 4 to calculate likelihood or distance. For example, if the likelihood exceeds a predetermined value, the user is identified (step S6 to step S9).

【００２３】図２は、本実施形態に係る話者認識方法を
用いた話者認識装置（発声内容独立手法を適用したも
の）の他の構成例を示したものである。なお、図１と同
一部分には同一符号を付し、異なる部分についてのみ説
明する。すなわち、図２では、図１に示した構成に、照
合部５で算出された認識対象の話者の特徴量と当該話者
の話者モデルとの尤度あるいは距離を正規化する正規化
部６がさらに追加されている。FIG. 2 shows another example of the configuration of a speaker recognition apparatus (applying the utterance content independent method) using the speaker recognition method according to the present embodiment. The same parts as those in FIG. 1 are denoted by the same reference numerals, and only different parts will be described. That is, in FIG. 2, a normalization unit that normalizes the likelihood or distance between the feature amount of the speaker to be recognized calculated by the matching unit 5 and the speaker model of the speaker is added to the configuration shown in FIG. 6 is further added.

【００２４】正規化部６では、周囲の環境変化に対応す
るため、図３のステップＳ８において、照合部５で尤度
あるいは距離を計算した後、本人の話者モデルに対する
認識結果の他に本人以外の話者モデルなどに対する認識
結果の値を用いて、当該認識結果を正規化する。例え
ば、本人の話者モデル以外に、本人に似た他の５人の話
者モデルとも尤度（もしくは距離）を算出し、尤度の高
い（距離の小さい）２つの値を使って認識結果の正規化
を行う。正規化された認識結果（尤度あるいは距離）に
基づき、（尤度が所定値を超えた場合、あるいは距離が
所定値より小さい場合に）当該話者を本人と同定する。In step S8 in FIG. 3, the normalizing unit 6 calculates the likelihood or the distance in step S8 of FIG. The recognition result is normalized using the value of the recognition result for a speaker model other than the above. For example, in addition to the speaker model of the subject, the likelihood (or distance) is calculated for the other five speaker models similar to the subject, and the recognition result is calculated using two values with high likelihood (small distance). Is normalized. Based on the normalized recognition result (likelihood or distance), the speaker is identified as the person (when the likelihood exceeds a predetermined value or when the distance is smaller than the predetermined value).

【００２５】図１、図２の構成において、本発明の特徴
的な部分は、図３のステップＳ４で、話者モデル作成部
３において行われる話者モデルの作成方法である。すな
わち、本発明では、図４、図５に示すような木構造で１
人の話者の話者モデルを表現するようになっている。以
下、この木構造を木構造型話者モデルと呼ぶ。1 and 2, a characteristic part of the present invention is a method of creating a speaker model performed by the speaker model creating section 3 in step S4 of FIG. That is, in the present invention, the tree structure shown in FIGS.
It is designed to represent a speaker model of one speaker. Hereinafter, this tree structure is called a tree structure type speaker model.

【００２６】話者モデル作成部３は、学習時に入力され
た話者の音声データから算出された特徴ベクトル群を用
いて、話者モデルを作成する。特徴ベクトルの符号帳を
話者モデルとして用いる方法では、当該話者の特徴ベク
トルを何らかの方法でクラスタリングし、各クラスタに
関して代表ベクトルを１つ決定する。これらの代表ベク
トルの集合を話者モデルとして使用している。本実施形
態では、この特徴ベクトルの符号帳を話者モデルとして
使用する。The speaker model creating section 3 creates a speaker model by using a feature vector group calculated from the speaker's voice data input at the time of learning. In the method using the codebook of the feature vector as a speaker model, the feature vector of the speaker is clustered by some method, and one representative vector is determined for each cluster. A set of these representative vectors is used as a speaker model. In the present embodiment, the codebook of this feature vector is used as a speaker model.

【００２７】図３のステップＳ４では、特徴量算出部２
にて複数の話者の音声データから算出された特徴ベクト
ルを用いて木構造型話者モデルを作成する。木構造型話
者モデルの一例を図４に示す。図４に示す木構造型話者
モデルは、ルートノードと末端のノード（ａ１〜ａ３
２）の間に中間ノード（ｂ１〜ｂ８）を１段持つ２層の
木構造を保持しているが、原理的には中間ノードを複数
段保持する構造にしても全く問題ない。例えば、図５の
ように、ある末端のノード（図５におけるノードｃ１、
ｃ２、ｃ４、ｃ５）に対しては中間ノード（図５におけ
るノードｂ２、ｂ３）が１段階、別の末端のノード（図
５においてノードａ１〜ａ５）に対しては中間ノードが
２段階ある構造（図５において、ノードａ１、ａ２に対
してはノードｃ３、その上位ノードｂ２が中間ノードと
なる。また、ノードａ３〜ａ５に対してはノードｃ３、
その上位ノードはｂ３が中間ノードとなる）をとること
も可能である。In step S4 of FIG.
Creates a tree-structured speaker model using feature vectors calculated from voice data of a plurality of speakers. FIG. 4 shows an example of the tree structure type speaker model. The tree structure type speaker model shown in FIG. 4 has a root node and terminal nodes (a1 to a3).
Although a two-layer tree structure having one stage of intermediate nodes (b1 to b8) is held between 2), a structure in which intermediate nodes are held in a plurality of stages can be used in principle. For example, as shown in FIG. 5, a certain terminal node (node c1 in FIG. 5,
A structure in which an intermediate node (nodes b2 and b3 in FIG. 5) is one stage for c2, c4, and c5) and an intermediate node is two stages for another terminal node (nodes a1 to a5 in FIG. 5). (In FIG. 5, the node c3 is the node a1 and a2, and the upper node b2 is the intermediate node. The nodes c3 and a5 are the nodes c3 and a5.
The upper node b3 may be an intermediate node).

【００２８】次に、話者モデル作成部３における木構造
型話者モデルの構築処理手順について、図６に示すフロ
ーチャートを参照して説明する。本発明の木構造型話者
モデルでは、特徴ベクトルを中間ノードの段数よりさら
にもう１回多い回数だけ、逐次クラスタリングすること
で作成する。また、各段のノードの数は、各回のクラス
タリングでのクラスタの数であり、１つの特徴ベクトル
に対して１つのノードを対応させる。図４に示したよう
な木構造型話者モデルを作成する場合を例にとり説明す
ると、最初に特徴ベクトルを通常のクラスタリング手法
を用いて８つにクラスタリングする（ステップＳ１
１）。ここでは、各クラスタに属する特徴ベクトルは複
数あるので、各クラスタに関して代表ベクトルを求め
（ステップＳ１２〜ステップＳ１３）、それらをｂ１か
らｂ８のノードに対応させる（ステップＳ１４）。Next, the procedure for constructing a tree-structured speaker model in the speaker model creation section 3 will be described with reference to the flowchart shown in FIG. In the tree-structured speaker model of the present invention, the feature vector is created by successively clustering the feature vector one more time than the number of intermediate nodes. The number of nodes in each stage is the number of clusters in each clustering, and one node is associated with one feature vector. A case where a tree structure type speaker model as shown in FIG. 4 is created will be described as an example. First, feature vectors are clustered into eight using a normal clustering method (step S1).
1). Here, since there are a plurality of feature vectors belonging to each cluster, representative vectors are obtained for each cluster (steps S12 to S13), and these are made to correspond to the nodes b1 to b8 (step S14).

【００２９】次に、それぞれのクラスタに属する特徴ベ
クトルに関して、２回目のクラスタリングを行い、各ク
ラスタに属する特徴ベクトルを再び４つにクラスタリン
グする（ステップＳ１５）。ここでは、各クラスタに属
する特徴ベクトルはそれぞれ１つづつであるので、各特
徴ベクトルをそれぞれ下位のノードに対応させる（ステ
ップＳ１２、ステップＳ１６）。Next, the second clustering is performed on the feature vectors belonging to each cluster, and the feature vectors belonging to each cluster are clustered again into four (step S15). Here, since each feature vector belonging to each cluster is one by one, each feature vector is made to correspond to each lower node (step S12, step S16).

【００３０】本実施形態では、末端のノード（ａ１〜ａ
３２までの３２個のノード）に最終的な詳細な特徴ベク
トルが、中間ノード（ｂ１〜ｂ８までの８個のノード）
に中間的な代表ベクトルが保持され、計４０個の代表ベ
クトルからなる１人の話者の話者モデルが構築されるこ
とになる。In this embodiment, the terminal nodes (a1 to a
The final detailed feature vector is the intermediate node (8 nodes from b1 to b8) in up to 32 nodes (32 nodes).
, An intermediate representative vector is held, and a speaker model of one speaker consisting of a total of 40 representative vectors is constructed.

【００３１】従来は、末端のノードに対応する３２個の
代表ベクトルのみを保持していたので、モデルのサイズ
としては中間ノードの分だけ大きくなることになる。こ
こでは、中間ノードが１段階であったので、２回のクラ
スタリングにより話者モデルを構築した。中間ノードの
段階が増えても、末端ノードに特徴ベクトルが１つのみ
になるまで、このような操作を繰り返すことでモデルを
構築できる。Conventionally, only 32 representative vectors corresponding to the terminal node are held, so that the size of the model is increased by the size of the intermediate node. Here, since the number of intermediate nodes was one, the speaker model was constructed by clustering twice. Even if the number of intermediate nodes increases, a model can be constructed by repeating such an operation until only one feature vector exists at the terminal node.

【００３２】なお、木構造型話者モデルでは、上位のノ
ードでは、下位につながっている末端のノードの代表ベ
クトルの平均ベクトルを保持していることになる。すな
わち、例えば、図４に示した木構造型話者モデルにおい
て、ノードｂ１では、ノードａ１〜ノードａ４のそれぞ
れに保持されてる特徴ベクトルの平均ベクトルが保持さ
れている。In the tree structure type speaker model, the upper node holds the average vector of the representative vectors of the terminal nodes connected to the lower node. That is, for example, in the tree structure type speaker model shown in FIG. 4, at the node b1, the average vector of the feature vectors held at each of the nodes a1 to a4 is held.

【００３３】本発明では、認識対象の話者の発声音から
抽出された特徴量としての入力ベクトルと話者モデルと
の距離は、図７のフローチャートに示す手順に従って算
出される。In the present invention, the distance between the input vector as a feature quantity extracted from the utterance sound of the speaker to be recognized and the speaker model is calculated according to the procedure shown in the flowchart of FIG.

【００３４】本発明では、クラスタリングの結果が表現
された木構造型話者モデルをルートノードから末端のノ
ードまで辿って、最も距離の近いノードを探し、入力ベ
クトルとそのノードの持つベクトルとの距離を、入力ベ
クトルと話者モデルの距離と定義し、木構造を辿ってい
く際、現在見ているノードのことを、注目ノードと呼ぶ
ことにする。ここでは、図４に示すような木構造型話者
モデルの場合を例にとり説明する。In the present invention, the tree-structured speaker model expressing the result of clustering is traced from the root node to the terminal node to find the closest node, and the distance between the input vector and the vector of the node is determined. Is defined as the distance between the input vector and the speaker model, and when tracing the tree structure, the node currently being viewed is referred to as a node of interest. Here, a case of a tree structure type speaker model as shown in FIG. 4 will be described as an example.

【００３５】注目ノードをルートノードとし（ステップ
Ｓ２１）、入力ベクトルと注目ノードにつながっている
ｂ１からｂ８までのノードの持つベクトルとの距離を計
算する（ステップＳ２２）。このとき、例えばｂ１が距
離最小であったとすると、注目ノードをノードｂ１へ移
動する（ステップＳ２３）。次に、注目ノードに下位ノ
ードが存在するかを確かめ（ステップＳ２４）、この場
合のように下位ノードがある場合には、ステップＳ２２
へ進み、入力ベクトルと注目ノードにつながっているａ
１からａ４までのノードの持つベクトルとの距離を計算
する。その距離計算を基に、注目ノードを距離最小のノ
ードへ移動する（ステップＳ２３）。再び、注目ノード
に下位ノードが存在するかを確かめ（ステップＳ２
４）、この場合のように下位ノードがない場合には、ス
テップＳ２５へ進み、入力ベクトルと注目ノードの持つ
ベクトルとの距離を入力ベクトルと話者モデルの距離と
して出力して、終了する。The target node is set as the root node (step S21), and the distance between the input vector and the vectors of the nodes b1 to b8 connected to the target node is calculated (step S22). At this time, if, for example, b1 is the minimum distance, the target node is moved to the node b1 (step S23). Next, it is checked whether a lower node exists in the target node (step S24). If there is a lower node as in this case, step S22 is performed.
A, which is connected to the input vector and the node of interest
The distance from the vector of the node from 1 to a4 is calculated. The target node is moved to the node having the minimum distance based on the distance calculation (step S23). Again, it is checked whether a lower node exists in the target node (step S2).
4) If there is no lower node as in this case, the process proceeds to step S25, where the distance between the input vector and the vector of the node of interest is output as the distance between the input vector and the speaker model, and the process ends.

【００３６】図４に示す話者モデルのように、代表ベク
トルを３２個保持する場合、距離計算にかかる計算は、
従来の話者モデルを用いると３２回のベクトル距離計算
となるのに対し、本実施形態では、１段目にある８ノー
ド分と、２段目にある４ノード分に計１２回のベクトル
距離計算で済み、話者モデルを木構造化することによっ
て、距離計算が高速化されていることがわかる。When 32 representative vectors are held as in the speaker model shown in FIG.
Using the conventional speaker model, the vector distance calculation is performed 32 times. On the other hand, in the present embodiment, a total of 12 vector distances are calculated for 8 nodes at the first stage and 4 nodes at the second stage. It can be seen that the calculation is completed and the distance calculation is speeded up by forming the speaker model into a tree structure.

【００３７】また、話者モデルを木構造化することで、
本来選ばれるはずの代表ベクトルが選ばれず、その結果
として、従来の話者モデルで算出した距離と照合して、
より大きな距離が計算される可能性はありうる。距離計
算の誤差が認識精度に影響を与える場合には、図６のス
テップＳ２３で、注目ノードを、距離が最小のものから
数えてｓ個までのベクトルを持つノードに変更すること
で、誤差を小さくすることができる。Further, by making the speaker model into a tree structure,
The representative vector that should have been selected is not selected, and as a result, it is compared with the distance calculated by the conventional speaker model,
It is possible that larger distances may be calculated. If the error in the distance calculation affects the recognition accuracy, the error is changed by changing the node of interest to a node having up to s vectors counting from the smallest distance in step S23 in FIG. Can be smaller.

【００３８】例えば、図４に示す木構造型話者モデルの
場合、図７のステップＳ２３で、ｓ＝２、すなわち、注
目ノードに距離が最小なものを２つ選ぶことにしたと
き、計算量としては、１段目にある８ノード分と、２段
目にある４＋４＝８ノード分との計１６回の距離計算が
必要になる。それでも、従来の３２回の計算量と照合す
ると十分小さく、また最小なものを１つ選ぶ場合と照合
すると、誤差を小さくすることができる。For example, in the case of the tree-structured speaker model shown in FIG. 4, when s = 2 in step S23 in FIG. As a result, a total of 16 distance calculations are required for 8 nodes at the first stage and 4 + 4 = 8 nodes at the second stage. Nevertheless, if it is compared with the conventional calculation amount of 32 times, it is sufficiently small, and if it is compared with the case where one minimum is selected, the error can be reduced.

【００３９】次に、算出した尤度の正規化（距離の正規
化）に関して説明する。一般的には、本人の話者モデル
に対する距離以外に、他のｎ個の話者モデルに対する距
離も計算し、以下のような計算式（１）を用いて距離の
正規化を行う。Next, normalization of the calculated likelihood (normalization of distance) will be described. In general, in addition to the distance to the true speaker model, the distance to the other n speaker models is also calculated, and the distance is normalized using the following formula (1).

【００４０】[0040]

【数１】 (Equation 1)

【００４１】式（１）では、入力ベクトルＸと話者モデ
ルＳｉより得られた距離ｄ（Ｘ｜Ｓｉ）を、他の話者モ
デルＳｊ（１≦ｊ≦ｎ）を用いて、正規化距離ｄｎｅｗ
（Ｘ｜Ｓｉ）を求めている。分母の関数ｆは、一般的に
は相乗平均などが良く使用されている。In the equation (1), a distance d (X | Si) obtained from the input vector X and the speaker model Si is converted to a normalized distance using another speaker model Sj (1 ≦ j ≦ n). dnew
(X | Si). As the function f of the denominator, generally, a geometric mean or the like is often used.

【００４２】ｎ個の話者モデルは、本人に似た声の話
者、すなわち本人の発声に対して小さな距離を出力する
話者モデルを、学習用音声よりあらかじめ特定してお
く。通常は、ｎ個の話者モデルに対する距離を用いる代
わりに、距離が小さかったｋ個の話者モデルを用いて計
算することが多い。For the n speaker models, a speaker with a voice similar to the subject, that is, a speaker model that outputs a small distance to the subject's utterance is specified in advance from the training speech. Usually, instead of using the distances for n speaker models, the calculation is often performed using k speaker models whose distance is small.

【００４３】従来の方法においては、話者正規化は、ｎ
個の話者モデルとの距離を算出する必要がある。しか
し、本発明では木構造型話者モデルを用いているので、
ｎ個の話者モデルとの距離を求めることなく、近似的に
距離が小さいｋ個の話者モデルを選択し、それらのモデ
ルとの距離を求めることができる。In the conventional method, the speaker normalization is n
It is necessary to calculate the distance from each speaker model. However, since the present invention uses a tree structure type speaker model,
It is possible to select k speaker models whose distances are approximately small and obtain the distances to those models without obtaining the distances to the n speaker models.

【００４４】図８は、認識対象の話者の発声音から抽出
された特徴ベクトルと当該話者の話者モデルとの照合結
果を正規化する際の処理動作を説明するためのフローチ
ャートである。図４の木構造型話者モデルを例にとり説
明する。FIG. 8 is a flow chart for explaining the processing operation when normalizing the collation result between the feature vector extracted from the utterance sound of the speaker to be recognized and the speaker model of the speaker. A description will be given taking a tree structure type speaker model of FIG. 4 as an example.

【００４５】ｎ個の話者モデルのそれぞれのルートノー
ドを注目ノードとして（ステップＳ３１）、各話者モデ
ル毎に、まず第一段目のノードｂ１〜ｂ８の８つの代表
ベクトルに対して距離を計算し、８つのうちの最小値を
求める（ステップＳ３２）。ｎ個の話者モデルのそれぞ
れから求められた最小値のノード（Ｎｏｄｅｘ’）の
中で、値の小さいものｔ（＞ｋ）個を抽出して（ステッ
プＳ２３）、それらを新たに注目ノードとする（ステッ
プＳ３３）。Assuming that each root node of the n speaker models is the target node (step S31), the distance between the eight representative vectors of the nodes b1 to b8 in the first stage is first determined for each speaker model. The calculation is performed, and the minimum value among the eight is obtained (step S32). From the nodes (Node x ′) of the minimum value obtained from each of the n speaker models, t (> k) nodes having a small value are extracted (step S23), and they are newly added to the target node. (Step S33).

【００４６】新たに注目ノードに設定されたｔ個のノー
ドの中に、下位ノードの存在するノードがあるか確かめ
る（ステップＳ３４）。図４に示す木構造型話者モデル
では、下位ノードの存在するノードがあるので、ステッ
プＳ３２に進み、新たに注目ノードの設定されたｔ個の
話者モデルに対してのみ、第２段目の計算を行う。It is checked whether there is a node having a lower node among the t nodes newly set as the target node (step S34). In the tree structure type speaker model shown in FIG. 4, since there are nodes having lower nodes, the process proceeds to step S32, and only the t speaker models for which the new target node is set are set in the second stage. Is calculated.

【００４７】ｔ個の話者モデルのそれぞれにおいて、第
２段目で求めたｔ個の距離のうち、最も小いものから順
にｔ個をとりだす（ステップＳ３３）。図４に示す木構
造型話者モデルでは、今度は、下位ノード（３段目のノ
ード）の存在するノードがないので、ステップＳ３５に
進み、ｔ個のノードから、さらに距離の小さいものから
順にｋ個を取り出して、そのｋ個のノードをそれぞれ含
む話者モデルを用いて、式（１）から距離の正規化を行
う（ステップＳ３５）。In each of the t speaker models, t of the t distances determined in the second stage are taken in order from the smallest one (step S33). In the tree-structured speaker model shown in FIG. 4, there is no node that has a lower node (third stage node) this time, so the process proceeds to step S35, and from the t nodes, the nodes having smaller distances are sequentially arranged. The k nodes are extracted, and the distances are normalized using Equation (1) using speaker models each including the k nodes (step S35).

【００４８】正規化を行う場合の話者認識を、図４に示
したような木構造型話者モデルで、ｎ＝１０、ｔ＝５、
ｋ＝２とした場合を例にとり、その効果をより具体的に
説明する。この場合、従来の話者モデルでは、３２本の
ベクトルからなる話者モデル１０人に対して距離を算出
する必要があるので、３２０回のベクトル距離計算が必
要になる。これに対して、本発明によれば、話者モデル
を構造化したことで、各話者モデル毎の計算は、１段目
の中間ノードに対する計算が８×１０＝８０回、２段目
の末端のノードに対する計算が４×１０＝４０回の計１
２０回で済む。The speaker recognition in the case of performing the normalization is performed by using a tree-structured speaker model as shown in FIG.
Taking the case where k = 2 as an example, the effect will be described more specifically. In this case, in the conventional speaker model, it is necessary to calculate the distance for 10 speaker models consisting of 32 vectors, so that 320 vector distance calculations are required. On the other hand, according to the present invention, by structuring the speaker model, the calculation for each speaker model requires 8 × 10 = 80 times for the intermediate node in the first stage, The calculation for the terminal node is 4 × 10 = 40 times, total 1
Only 20 times.

【００４９】さらに、図８に示したように、ステップＳ
３３でｔ＝５個の話者モデルに限定することにより、２
段目での計算は、５個の話者モデルのそれぞれについて
の２段目の計算（４ノード分の計算）だけで済むので、
４×５＝２０回で済ませることができる。図８に示すよ
うな手順で正規化を行えば、木構造型話者モデルを用い
た話者認識において、全ての話者モデルとの距離を算出
した場合と比較して２０回のベクトル距離計算を省略で
きたことになる。Further, as shown in FIG.
By limiting at t = 5 speaker models at 33, 2
Since the calculation at the stage only requires the calculation at the second stage (calculation for four nodes) for each of the five speaker models,
4 × 5 = 20 times. If the normalization is performed according to the procedure shown in FIG. 8, in the case of the speaker recognition using the tree structure type speaker model, the vector distance calculation is performed 20 times compared with the case where the distances from all the speaker models are calculated. Can be omitted.

【００５０】また、話者モデルと入力ベクトルの距離を
求める際に出力される、中間ノードでの途中結果を用い
ることで、明らかに距離正規化に使用しないであろうと
思われる話者モデルに対しては、途中で計算を打ち切
り、最終的に必要になるであろう話者モデルに関しての
み引続き計算を行うことで、尤度正規化の高速化をはか
ることができる。Further, by using the intermediate result at the intermediate node, which is output when obtaining the distance between the speaker model and the input vector, the speaker model which is apparently not used for distance normalization can be used. In other words, it is possible to speed up the likelihood normalization by terminating the calculation on the way and continuing to calculate only the speaker model that will eventually be required.

【００５１】なお、上記実施形態では、図４に示すよう
な木構造型話者モデルを作成したが、木構造型であれ
ば、どのような構造でも本発明を適用することは可能で
ある。なお、本実施形態では、特徴ベクトルの符号帳を
話者モデルに用いて木構造型話者モデルを作成したが、
ＨＭＭの分布を話者モデルに用いて木構造型話者モデル
を作成しても同様の効果がある。In the above embodiment, a tree-structure type speaker model as shown in FIG. 4 is created, but the present invention can be applied to any structure having a tree structure. In the present embodiment, the tree structure type speaker model is created by using the codebook of the feature vector as the speaker model.
A similar effect can be obtained by creating a tree-structured speaker model using the distribution of the HMM as the speaker model.

【００５２】話者認識において、認識性能を高めるため
微細な話者モデルを作成することは、一方で認識の際の
計算量を増大させるという欠点があるが、以上説明した
ように、上記実施形態によれば、微細な話者モデルを作
成しても、使用する話者モデルを木構造化することで、
高い認識性能を保持したまま入力音声と話者モデルとの
尤度（あるいは距離）算出の際の計算量を削減して、高
速な話者認識が可能となる。In speaker recognition, creating a fine speaker model in order to enhance recognition performance has the disadvantage of increasing the amount of calculation at the time of recognition, but as described above, as described above, According to, even if a fine speaker model is created, by using a tree structure for the speaker model to be used,
It is possible to reduce the amount of calculation at the time of calculating the likelihood (or distance) between the input speech and the speaker model while maintaining high recognition performance, thereby enabling high-speed speaker recognition.

【００５３】また、照合結果の正規化を行う場合には、
尤度（あるいは距離）算出の際に求められていない話者
モデルとの尤度（あるいは距離）も必要となるため、正
規化を行わない場合と比較した時の計算量の増大は、従
来に比べて大きくなる。しかしながら、上記実施形態に
よれば、尤度（あるいは距離）算出の際に得られた値を
用いて、正規化にどの話者モデルを使用するかを決定す
ることにより、照合結果の正規化に伴う計算量の増大を
抑えることが出来る。When normalizing the collation result,
Since the likelihood (or distance) with the speaker model that is not calculated when the likelihood (or distance) is calculated is also required, the increase in the amount of calculation as compared with the case without normalization is conventionally It will be larger than that. However, according to the above embodiment, the speaker model to be used for the normalization is determined by using the value obtained in the calculation of the likelihood (or distance), so that the matching result can be normalized. The accompanying increase in the amount of calculation can be suppressed.

【００５４】[0054]

【発明の効果】以上説明したように、本発明によれば、
入力音声と話者モデルを照合する際に要する計算量を減
らして、高い認識性能を保持したまま、話者認識に要す
る時間を短縮できる。As described above, according to the present invention,
The amount of calculation required for collating the input speech with the speaker model can be reduced, and the time required for speaker recognition can be reduced while maintaining high recognition performance.

[Brief description of the drawings]

【図１】本発明の実施形態に係る話者認識方法を適用し
た話者認識装置の構成例を示した図。FIG. 1 is a diagram showing a configuration example of a speaker recognition device to which a speaker recognition method according to an embodiment of the present invention is applied.

【図２】本発明の実施形態に係る話者認識方法を適用し
た話者認識装置の他の構成例を示した図。FIG. 2 is a diagram showing another configuration example of the speaker recognition device to which the speaker recognition method according to the embodiment of the present invention is applied.

【図３】話者認識装置の処理動作を概略的に示したフロ
ーチャート。FIG. 3 is a flowchart schematically showing a processing operation of the speaker recognition device.

【図４】木構造型話者モデルの一例を示した図。FIG. 4 is a diagram showing an example of a tree structure type speaker model.

【図５】木構造型話者モデルの他の例を示した図。FIG. 5 is a diagram showing another example of a tree structure type speaker model.

【図６】木構造型話者モデルの構築処理手順を示したフ
ローチャート。FIG. 6 is a flowchart showing a tree structure type speaker model construction processing procedure;

【図７】認識対象の話者の発声音から抽出された特徴量
としての入力ベクトルと話者モデルとの距離の算出手順
を示したフローチャート。FIG. 7 is a flowchart illustrating a procedure for calculating a distance between an input vector as a feature amount extracted from a speech sound of a speaker to be recognized and a speaker model;

【図８】認識対象の話者の発声音から抽出された特徴ベ
クトルと当該話者の話者モデルとの照合結果を正規化す
る際の処理動作を説明するためのフローチャート。FIG. 8 is a flowchart for explaining a processing operation when normalizing a collation result between a feature vector extracted from a speech sound of a speaker to be recognized and a speaker model of the speaker.

[Explanation of symbols]

１…音声入力部２…特徴量算出部３…話者モデル作成部４…話者モデル記憶部５…照合部６…正規化部 DESCRIPTION OF SYMBOLS 1 ... Speech input part 2 ... Feature amount calculation part 3 ... Speaker model creation part 4 ... Speaker model storage part 5 ... Collation part 6 ... Normalization part

Claims

[Claims]

1. A speaker recognition method for recognizing a speaker based on a speaker model created in advance from features extracted from the speaker's uttered sound. A speaker model having a tree-structured storage structure is created based on the obtained feature amount and stored in the storage means, and the feature amount extracted from the input utterance sound of the recognition target speaker and the stored speaker model are stored. And recognizing the speaker to be recognized.

2. A speaker recognition method for recognizing a speaker based on a speaker model created in advance from features extracted from the speaker's uttered sound. A speaker model having a tree-structured storage structure is created based on the obtained feature amount and stored in the storage unit. The feature amount extracted from the input utterance sound of the speaker to be recognized and the plurality of stored talks are stored. A speaker model used for normalizing a matching result between the input feature amount of the speaker to be recognized and the speaker model of the speaker to be recognized, The selected speaker model is used to normalize the matching result between the input feature amount of the recognition target speaker and the speaker model of the recognition target speaker to recognize the recognition target speaker. Speaker recognition method.

3. A speaker recognition device for recognizing a speaker based on a speaker model created in advance from features extracted from the speaker's uttered sound. Storage means for creating and storing a speaker model having a tree-structured storage structure based on the obtained feature quantities, and feature quantities extracted from the input utterance sounds of the speaker to be recognized and stored in the storage means. A recognition unit for recognizing a speaker to be recognized by collating with a speaker model.

4. A speaker recognition device for recognizing a speaker based on a speaker model created in advance from feature amounts extracted from the speaker's uttered sound. Storage means for creating and storing a speaker model having a tree-structured storage structure based on the obtained feature quantities, and feature quantities extracted from the input utterance sounds of the speaker to be recognized and stored in the storage means. By comparing a plurality of speaker models, a speaker model to be used for normalizing a matching result between the input feature amount of the recognition target speaker and the speaker model of the recognition target speaker is selected. By using the speaker model selected by the selecting unit, and normalizing the collation result between the input feature amount of the recognition target speaker and the speaker model of the recognition target speaker. And recognition means for recognizing the speaker to be recognized. Speaker recognition device, characterized in that was.