JP2000163092A

JP2000163092A - Method and device for collating speaker

Info

Publication number: JP2000163092A
Application number: JP10339213A
Authority: JP
Inventors: Toshihiro Isobe; 俊洋磯部; Junichi Takahashi; 淳一高橋
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 1998-11-30
Filing date: 1998-11-30
Publication date: 2000-06-16
Anticipated expiration: 2018-11-30
Also published as: JP3422702B2

Abstract

PROBLEM TO BE SOLVED: To improve a speaker collation performance. SOLUTION: A personal dictionary creating part 13 generates a phoneme dictionary of a new user based on the speech read when the speech is registered, and stores it in a personal dictionary storage part 23. This is a setting of a set of cohort speakers in units of speakers. At the time of registering, a by-phoneme neighbor speaker selection part 17 selects, by phoneme, speaker information near to each phoneme in the acoustic space by each phoneme of a new phoneme dictionary from the phoneme dictionary of the registered speakers already stored in the storage part 23. The speaker information are arranged in a nearest-first order in distance and stored in the by- phoneme neighbor speaker table storage part 21. This is a setting of a set of cohort speakers in units of phonemes. A personal verification part 15 calculates a likelihood of the phoneme dictionary of the person in question to the speech read at the time of authentication. Based on the information in the storage part 21, a background speaker dictionary synthesis part 19 synthesizes a background speaker phoneme dictionary by combining each phoneme of the neighbor speaker with each phoneme in the speaker phoneme dictionary of the person in question. A mean value of the likelihood of the background speakers is calculated to the input speech. Acceptance/rejection is judged by calculating a normalized likelihood and comparing large-and-small relations with a pre-set threshold.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、本人話者に対して
背景話者を設定し、入力音声に対する本人話者の照合ス
コアを、その入力音声に対する背景話者の照合スコアで
正規化する話者照合方法及び装置の改良に関するもので
ある。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of setting a background speaker for a true speaker, and normalizing a matching score of the true speaker with respect to the input voice by a matching score of the background speaker with respect to the input voice. The present invention relates to improvement of a person verification method and apparatus.

【０００２】[0002]

【従来の技術】話者照合装置においては、入力音声に対
する本人の辞書（音響モデル）の照合スコアと、事前に
設定される閾値との大小関係により、受理／棄却の判定
が行われる。音響モデルがＨＭＭ（隠れマルコフモデ
ル）の場合、照合スコアは入力音声に対するＨＭＭの尤
度である。そのため、発話内容の違いや音声入力系の影
響による照合スコア（尤度）の変動は、装置の照合性能
を大きく劣化させる。そこで、尤度の変動低減のため、
各本人話者に対して背景話者を設定し、入力音声に対す
る本人話者の尤度を、同じ入力音声に対する背景話者の
尤度で除算することにより正規化する幾つかの手法が研
究されている。2. Description of the Related Art In a speaker verification apparatus, acceptance / rejection is determined based on a magnitude relationship between a verification score of a dictionary (acoustic model) of a person himself / herself with respect to an input voice and a threshold value set in advance. If the acoustic model is an HMM (Hidden Markov Model), the matching score is the likelihood of the HMM for the input speech. For this reason, a change in the matching score (likelihood) due to the difference in the utterance content and the influence of the voice input system greatly deteriorates the matching performance of the device. Therefore, to reduce the likelihood fluctuation,
Several techniques have been studied to set a background speaker for each speaker and normalize the likelihood of the speaker for the input speech by dividing the likelihood of the background speaker for the same input speech. ing.

【０００３】上記手法の例として、最も尤度の高い話者
を背景話者にするヒギンス等の手法、背景話者としてコ
ホート話者セットを用いるローゼンバーグ等の手法、リ
ュウ等による多数話者の尤度の平均値と分散値に基づく
正規化法、及び松井等による事後確率に基づく多数話者
モデルの尤度による正規化法が挙げられる。上記各手法
は、いずれも背景話者の正確且つ効率的な表現を主題と
している。[0003] Examples of the above methods include Higgins et al., Which uses the speaker with the highest likelihood as the background speaker, Rosenberg et al., Which uses a cohort speaker set as the background speaker, and Liu et al. A normalization method based on the average value and the variance value of the likelihood, and a normalization method based on the likelihood of a multi-speaker model based on the posterior probability by Matsui et al. Each of the above techniques is directed to accurate and efficient representation of a background speaker.

【０００４】[0004]

【発明が解決しようとする課題】ところで、上述した各
手法（つまり、本人話者と背景話者との尤度比による本
人話者の正規化手法）では、背景話者の音響モデルが、
本人話者以外の全ての話者に対する音響空間を表現する
ことが望ましい。しかし、理想的な全ての話者に対して
音響モデルを求めることは困難である。また、実サービ
スにおいても、数多くの登録話者からその音響モデルを
求めることは、多大な計算コストを要するため効率的で
はない。以下、背景話者としてローゼンバーグ等の手法
（つまり、コホート話者セットを用いる手法）について
詳細に説明する。By the way, in each of the above-mentioned methods (that is, a method of normalizing a true speaker based on a likelihood ratio between a true speaker and a background speaker), the acoustic model of the background speaker is
It is desirable to represent the acoustic space for all speakers other than the true speaker. However, it is difficult to find an acoustic model for all ideal speakers. Further, even in an actual service, obtaining an acoustic model from a large number of registered speakers is not efficient because a large calculation cost is required. Hereinafter, a method of Rosenberg et al. (That is, a method using a cohort speaker set) as a background speaker will be described in detail.

【０００５】背景話者としてのコホート話者セットの設
計では、全話者空間を少ない計算量で正確且つ効率的に
近似することが重要である。尤度は対数値で表されるた
め、背景話者を可能な限り本人話者に近いものとして選
択することで、照合の判定基準（即ち、尤度比）の分解
能を向上させることができるものと思料される。In designing a cohort speaker set as a background speaker, it is important to accurately and efficiently approximate the entire speaker space with a small amount of calculation. Since the likelihood is expressed as a logarithmic value, it is possible to improve the resolution of the judgment criterion (that is, the likelihood ratio) of the collation by selecting the background speaker as close as possible to the true speaker. It is thought.

【０００６】従来、上述した手法では、背景話者とし
て、図１に示すように本人以外の登録話者のうち、音響
空間上で本人話者の音素辞書（音響モデル）に近い上位
Ｎ人を選定し（図１では、Ｎ＝２）、そのＮ人の入力音
声に対する尤度の平均値を、背景話者の尤度として用い
ていた。或いは、上記手法において、本人話者の尤度値
に最も近い他の話者を１人だけ選択する場合もあった。
それらの理由は、或る話者の音響モデルが、音響空間上
でその音響モデルの近くの入力音声に対する尤度につい
ては正確であるが、上記音響モデルから遠い入力音声に
対する尤度については誤差が大きいことによる。換言す
れば、正規化された尤度は、本人話者及び音響空間上で
本人話者の近くに侵入してくる詐称者に対しては正確で
なければならないが、本人話者とは距離が遠い空間に侵
入してくる詐称者に対してはそれほど正確である必要が
ないことによる。Conventionally, in the above-mentioned method, among the registered speakers other than the person himself / herself, as shown in FIG. 1, the upper N persons who are close to the phoneme dictionary (acoustic model) of the person himself / herself in the acoustic space are used. A selection was made (N = 2 in FIG. 1), and the average value of the likelihoods for the N input voices was used as the likelihood of the background speaker. Alternatively, in the above-mentioned method, there is a case where only one other speaker closest to the likelihood value of the true speaker is selected.
The reason is that the acoustic model of a speaker is accurate in the likelihood for input speech near the acoustic model in the acoustic space, but has an error in the likelihood for input speech far from the acoustic model. By big. In other words, the normalized likelihood must be accurate for the true speaker and for an impostor entering the sound space near the true speaker, but at a distance from the true speaker. It doesn't have to be as accurate for impostors entering into distant space.

【０００７】しかし、上記のような話者単位での近傍話
者の選定方法では、正確には音素毎に近傍話者が異なる
（つまり、話者間距離は音素によって異なる）ため、話
者単位に選出する近傍話者は全ての音素において本人に
近い訳ではなく誤差を含んでおり、よって本人に音響空
間上で近いという背景話者の条件を満たしていない。そ
のため、尤度の正規化精度が悪くなり照合性能を劣化さ
せる要因であるとされていた。つまり、背景話者の尤度
が、話者単位でしか制御できないために、話者照合にお
ける判定基準の分解能が必ずしも高くないと考えられて
いたのである。However, in the above-described method of selecting a nearby speaker on a speaker-by-speaker basis, to be precise, the neighboring speakers are different for each phoneme (that is, the inter-speaker distance is different for each phoneme). Is not close to the person in all phonemes but includes an error, and thus does not satisfy the condition of the background speaker that the speaker is close to the person in the acoustic space. For this reason, it has been considered that the accuracy of the likelihood normalization is deteriorated and the matching performance is degraded. That is, since the likelihood of the background speaker can be controlled only on a speaker-by-speaker basis, it has been considered that the resolution of the criterion in speaker verification is not necessarily high.

【０００８】従って本発明の目的は、話者照合の性能を
向上させることができるようにすることにある。Accordingly, it is an object of the present invention to improve the performance of speaker verification.

【０００９】[0009]

【課題を解決するための手段】本発明の第１の側面に従
う話者照合装置は、本人話者に対して背景話者を設定
し、入力音声に対する本人話者の照合スコアを、その入
力音声に対する背景話者の照合スコアで正規化するもの
で、各話者の音響モデルの中から、局所的な特徴に関し
て登録話者の音響モデルとの音響空間上における距離が
比較的近い話者を選択する手段と、選択された話者を、
登録話者の近傍話者として保持する手段と、近傍話者の
音響モデルの局所的な特徴を、本人話者の音響モデルの
各々の局所的な特徴に対応させて組合わせることにより
背景話者の音響モデルを合成する手段とを備える。A speaker verification apparatus according to a first aspect of the present invention sets a background speaker for a true speaker, and obtains a verification score of the true speaker with respect to an input voice, and outputs the verification score of the input voice. Is normalized by the matching score of the background speaker with respect to, and a speaker whose acoustic feature is relatively close to the registered speaker's acoustic model in terms of local features is selected from the acoustic models of each speaker And the selected speaker
By combining the means for retaining the registered speaker as a nearby speaker and the local features of the acoustic model of the neighboring speaker in correspondence with each local feature of the acoustic model of the true speaker, the background speaker Means for synthesizing the acoustic model.

【００１０】上記構成によれば、話者照合の性能を向上
させることができる。According to the above configuration, the performance of speaker verification can be improved.

【００１１】本発明の第１の側面に係る好適な実施形態
では、音響モデルの各音素は、隠れマルコフモデル（Ｈ
ＭＭ）であり、照合スコアは、入力音声に対するＨＭＭ
の尤度である。また、局所的な特徴は、各話者の音響モ
デルの音素である。上記実施形態の変形例では、局所的
な特徴は、各音素ＨＭＭの各々の状態である。別の変形
例では、局所的な特徴は、各音素ＨＭＭの各々の状態の
夫々の分布である。[0011] In a preferred embodiment according to the first aspect of the present invention, each phoneme of the acoustic model is a hidden Markov model (H
MM), and the matching score is the HMM for the input speech.
Is the likelihood of The local feature is a phoneme of the acoustic model of each speaker. In the modification of the above embodiment, the local feature is the state of each phoneme HMM. In another variation, the local feature is a respective distribution of each state of each phoneme HMM.

【００１２】上記実施形態では、選択手段は、各話者の
音響モデルの中から、音素に関して登録話者の音響モデ
ルとの音響空間上における距離が比較的近い話者を選択
する。上記実施形態の変形例では、選択手段は、各話者
の音響モデルの中から、各音素ＨＭＭの各々の状態に関
して登録話者の音響モデルとの音響空間上における距離
が比較的近い話者を選択する。別の変形例では、選択手
段は、各話者の音響モデルの中から、各音素ＨＭＭの夫
々の状態の各々の分布に関して登録話者の音響モデルと
の音響空間上における距離が比較的近い話者を選択す
る。In the above embodiment, the selecting means selects, from the acoustic models of the speakers, a speaker whose phoneme is relatively short in the acoustic space from the acoustic model of the registered speaker. In a modification of the above embodiment, the selection unit selects, from among the acoustic models of the speakers, a speaker whose distance in the acoustic space from the acoustic model of the registered speaker is relatively short with respect to each state of each phoneme HMM. select. In another modified example, the selecting unit may select, from among the acoustic models of the speakers, a talk that has a relatively short distance in the acoustic space from the acoustic model of the registered speaker with respect to the distribution of each state of each phoneme HMM. Choose a person.

【００１３】上記実施形態では、合成手段は、近傍話者
の音響モデルの音素を、本人話者の音響モデルの各々の
音素に対応させて組合わせることにより背景話者の音響
モデルを合成する。上記実施形態の変形例では、合成手
段は、近傍話者の各音素ＨＭＭの各々の状態を、本人話
者の各音素ＨＭＭの各々の状態に対応させて組合わせる
ことにより背景話者の音響モデルを合成する。別の変形
例では、合成手段は、各音素ＨＭＭの各々の状態の夫々
の分布を、本人話者の各音素ＨＭＭの各々の状態の各々
の分布に対応させて組合わせることにより背景話者の音
響モデルを合成する。In the above embodiment, the synthesizing means synthesizes the acoustic model of the background speaker by combining the phonemes of the acoustic model of the neighboring speaker with the phonemes of the acoustic model of the true speaker. In a modification of the above embodiment, the synthesis means combines the respective states of the phoneme HMMs of the neighboring speaker in correspondence with the respective states of the phoneme HMMs of the true speaker, thereby forming an acoustic model of the background speaker. Are synthesized. In another modification, the synthesis means combines the distribution of each state of each phoneme HMM with the distribution of each state of each phoneme HMM of the true speaker, thereby combining the distribution of the background speaker. Synthesize an acoustic model.

【００１４】本発明の第２の側面に従う話者照合方法
は、本人話者に対して背景話者を設定し、入力音声に対
する本人話者の尤度を、その入力音声に対する背景話者
の尤度で正規化するもので、各話者の音響モデルの中か
ら、局所的な特徴に関して登録話者の音響モデルとの音
響空間上における距離が比較的近い話者を選択する第１
の過程と、選択された話者を、登録話者の近傍話者とし
て保持する第２の過程と、近傍話者の音響モデルの局所
的な特徴を、本人話者の音響モデルの各々の局所的な特
徴に対応させて組合わせることにより背景話者の音響モ
デルを合成する第３の過程とを備える。In the speaker verification method according to the second aspect of the present invention, a background speaker is set for a true speaker, and the likelihood of the true speaker with respect to the input voice is determined based on the likelihood of the background speaker with respect to the input voice. The first method is to select a speaker whose acoustic feature is relatively close to the registered speaker's acoustic model in the acoustic space from among the acoustic models of the speakers from among the acoustic models of the speakers.
And the second step of retaining the selected speaker as a neighboring speaker of the registered speaker, and the local feature of the acoustic model of the neighboring speaker is expressed as a local characteristic of each of the acoustic models of the own speaker. And synthesizing the acoustic model of the background speaker by combining them in accordance with the characteristic features.

【００１５】本発明の第３の側面に従うプログラム媒体
は、本人話者に対して背景話者を設定し、入力音声に対
する本人話者の照合スコアを、その入力音声に対する背
景話者の照合スコアで正規化する話者照合装置におい
て、各話者の音響モデルの中から、局所的な特徴に関し
て登録話者の音響モデルとの音響空間上における距離が
比較的近い話者を選択する手段と、選択された話者を、
登録話者の近傍話者として保持する手段と、近傍話者の
音響モデルの局所的な特徴を、本人話者の音響モデルの
各々の局所的な特徴に対応させて組合わせることにより
背景話者の音響モデルを合成する手段とを備えることを
特徴とする話者照合装置における各手段としてコンピュ
ータを動作させるためのコンピュータプログラムをコン
ピュータ読取可能に担持する。A program medium according to a third aspect of the present invention sets a background speaker for a true speaker, and calculates a verification score of the true speaker for the input voice by using a verification score of the background speaker for the input voice. Means for selecting, from the speaker acoustic models of each speaker, a speaker whose distance in the acoustic space with respect to the local speaker's acoustic model is relatively short from among the acoustic models of each speaker; Talked speaker
By combining the means for retaining the registered speaker as a nearby speaker and the local features of the acoustic model of the neighboring speaker in correspondence with each local feature of the acoustic model of the true speaker, the background speaker And a computer program for operating a computer as each unit in the speaker verification device.

【００１６】[0016]

【発明の実施の形態】以下、本発明の実施の形態を、図
面により詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１７】図２は、本発明の一実施形態に係る話者照
合装置の機能ブロック図である。FIG. 2 is a functional block diagram of the speaker verification device according to one embodiment of the present invention.

【００１８】上記装置は、音素別近傍話者テーブルを有
し、音声登録モードと音声認証モードの一方を選択的に
設定できるよう構成され、図示のように、音声入力部
（入力部）１１と、個人用辞書作成部（作成部）１３
と、個人認証部（認証部）１５と、音素別近傍話者選択
部（選択部）１７とを備える。上記装置は、上記各部に
加えて更に、背景話者用辞書合成部（合成部）１９と、
音素別近傍話者テーブル格納部（格納部）２１と、個人
辞書格納部（格納部）２３をも備える。The apparatus has a phoneme-specific neighborhood speaker table, and is configured to be able to selectively set one of a voice registration mode and a voice authentication mode. As shown in the figure, a voice input unit (input unit) 11 , Personal dictionary creation unit (creation unit) 13
, A personal authentication unit (authentication unit) 15, and a phoneme-specific neighborhood speaker selection unit (selection unit) 17. The apparatus further includes a background speaker dictionary synthesizing unit (synthesizing unit) 19 in addition to the above units.
It also includes a phoneme-specific neighborhood speaker table storage unit (storage unit) 21 and a personal dictionary storage unit (storage unit) 23.

【００１９】作成部１３は、装置の音声登録モード時
に、入力部１１を通じて読込んだ音声情報に基づき、新
規ユーザ（話者）の音素辞書（音響モデル）を作成し、
格納部２３に格納する。例えば、図示のように、新規ユ
ーザが話者1であるときは、音素1〜音素Pから成る話者1
の音素辞書が、また、話者2であるときは、上記と同様
に音素1〜音素Pから成る話者2の音素辞書が、更に、話
者Sであるときも、上記と同様に音素1〜音素Pから成る
話者Sの音素辞書が、夫々作成部１３で作成される。そ
して、それら各話者1〜Sの音素辞書は、作成部１３によ
り夫々格納部２３に格納される。The creating unit 13 creates a phoneme dictionary (acoustic model) of a new user (speaker) based on the voice information read through the input unit 11 during the voice registration mode of the apparatus.
It is stored in the storage unit 23. For example, as shown in the figure, when the new user is speaker 1, speaker 1 including phonemes 1 to P
When the phoneme dictionary of the speaker 2 is the speaker 2, the phoneme dictionary of the speaker 2 including the phonemes 1 to P in the same manner as described above. The phoneme dictionary of the speaker S including the phonemes P is created by the creating unit 13. The phoneme dictionaries of the speakers 1 to S are stored in the storage unit 23 by the creation unit 13.

【００２０】作成部１３における上記処理動作は、話者
を単位としたコホート話者セットの設定として一般化さ
れる。即ち、話者を単位としてコホート話者セットを設
定する場合は、話者iに対する尤度比Ｌ（i）は、下記の
（１）式で表わされる。The above processing operation in the creating unit 13 is generalized as setting of a cohort speaker set for each speaker. That is, when a cohort speaker set is set for each speaker, the likelihood ratio L (i) for the speaker i is expressed by the following equation (1).

【００２１】[0021]

【数１】（１）式において、οは観測ベクトルを、λ（i）は話
者iの音素HMMセットを、Ｋはコホートサイズを、ｃk
（i）は話者iのK番目のコホート話者を、Pは音素数を、
夫々表わす。ｃk（i）（k＝1、2、…、K）は、話者間距
離により、登録話者から話者iに近い上位K人が選択され
る。(Equation 1) In equation (1), ο is the observation vector, λ (i) is the phoneme HMM set of speaker i, K is the cohort size, and ck
(I) is the Kth cohort speaker of speaker i, P is the phoneme number,
Show each. For ck (i) (k = 1, 2,..., K), the top K persons near the speaker i from the registered speakers are selected based on the inter-speaker distance.

【００２２】選択部１７は、装置の音声登録モード時に
作成部１３による音素辞書作成の処理動作（話者を単位
としたコホート話者セットの設定）と並行して、新規に
作成された音素辞書の各音素別に各音素と音響空間上に
おける距離が近い話者情報を、格納部２３に格納済みの
複数の登録話者の音素辞書中から音素別に選出する。そ
して、それらの話者情報を、距離の近いものから順に整
理した状態で、音素別近傍話者テーブルとして格納部２
１に格納する。例えば、新規ユーザが話者1であるとき
は、話者1の音素1に近い話者が、最も距離の近い話者か
ら順に話者9、話者7、……、話者11であったとすれば、
選択部１７により、図示のように、話者1の音素1に係る
音素別近傍話者テーブルに話者9、話者7、……、話者11
が格納される。話者1の音素Pに近い話者が、最も距離の
近い話者から順に話者8、話者21、……、話者14であっ
たとすれば、選択部１７により、図示のように、話者1
の音素Pに係る音素別近傍話者テーブルに話者8、話者2
1、……、話者14が格納される。また、新規ユーザが話
者Sであるときは、話者Sの音素1に近い話者が、最も距
離の近い話者から順に話者2、話者30、……、話者19で
あったとすれば、選択部１７により、図示のように、話
者Sの音素1に係る音素別近傍話者テーブルに話者2、話
者30、……、話者19が格納される。話者Sの音素Pに近い
話者が、最も距離の近い話者から順に話者24、話者13、
……、話者18であったとすれば、選択部１７により、図
示のように、話者Sの音素Pに係る音素別近傍話者テーブ
ルに話者24、話者13、……、話者18が格納される。The selecting unit 17 performs a phoneme dictionary creation process (setting of a cohort speaker set in units of speakers) by the creating unit 13 in the voice registration mode of the apparatus, in parallel with the newly created phoneme dictionary. For each phoneme, the speaker information whose distance in the acoustic space is short from each phoneme is selected for each phoneme from the phoneme dictionaries of a plurality of registered speakers stored in the storage unit 23. Then, in a state where the speaker information is arranged in ascending order of distance, the storage unit 2 stores a phoneme-based neighbor speaker table.
1 is stored. For example, when the new user is speaker 1, the speakers closest to phoneme 1 of speaker 1 are speaker 9, speaker 7,... if,
As shown in the drawing, the selection unit 17 stores the speaker 9, the speaker 7,..., And the speaker 11 in the phoneme-based neighboring speaker table relating to the phoneme 1 of the speaker 1.
Is stored. Assuming that the speakers closest to the phoneme P of the speaker 1 are the speakers 8, the speakers 21,..., And the speakers 14 in the order from the speaker having the shortest distance, the selecting unit 17 selects Speaker 1
Speaker 8 and Speaker 2 in the phoneme-specific neighborhood speaker table for phoneme P
1,..., The speaker 14 is stored. Further, when the new user is speaker S, the speakers closest to phoneme 1 of speaker S were speaker 2, speaker 30,..., And speaker 19 in order from the speaker having the shortest distance. Then, the selector 17 stores the speaker 2, the speaker 30,..., And the speaker 19 in the phoneme-by-phoneme neighboring speaker table relating to the phoneme 1 of the speaker S as illustrated. The speakers closest to the phoneme P of the speaker S are speaker 24, speaker 13,
..., If the speaker 18 is selected, the selection unit 17 causes the speaker 24, the speaker 13,... 18 is stored.

【００２３】選択部１７における上記処理動作は、音素
を単位としたコホート話者セットの設定のための処理動
作として一般化される。即ち、音素を単位としてコホー
ト話者セットを設定する場合は、上記（１）式の右辺の
第２項は、下記の（２）式で表わされる。The above-described processing operation in the selection unit 17 is generalized as a processing operation for setting a cohort speaker set in units of phonemes. That is, when the cohort speaker set is set in units of phonemes, the second term on the right side of the above equation (1) is expressed by the following equation (2).

【００２４】[0024]

【数２】（２）式において、＃ｃk（i）は話者iのK番目の仮想コ
ホート話者を、λ（＃ｃk（i））は仮想コホート話者＃
ｃk（i）の音素HMMセットを、夫々表わす。(Equation 2) In equation (2), #ck (i) is the K-th virtual cohort speaker of speaker i, and λ (#ck (i)) is the virtual cohort speaker #
Each represents a phoneme HMM set of ck (i).

【００２５】次に、話者iの音素セットは下記の（３）
式で表わされる。Next, the phoneme set of the speaker i is expressed by the following (3)
It is expressed by an equation.

【００２６】[0026]

【数３】次に、（３）式から、仮想コホート話者の音素HMMセッ
トは、下記の（４）式で表わされるようになる。(Equation 3) Next, from equation (3), the phoneme HMM set of the virtual cohort speaker is represented by the following equation (4).

【００２７】[0027]

【数４】（４）式において、ｃk（i,p）は話者iの音素pに対する
k番目のコホート話者を表わす。ここで、ｃk（i,p）
は、音素を単位としたコホート話者であり、話者iの音
素pと他の登録話者の音素pとの音素モデル間距離によ
り、話者iに近い上位K人が選択される訳である。(Equation 4) In equation (4), ck (i, p) is the phoneme p of speaker i.
Represents the kth cohort speaker. Where ck (i, p)
Is a cohort speaker in phoneme units, and the top K people near speaker i are selected based on the distance between phoneme models between phoneme p of speaker i and phoneme p of another registered speaker. is there.

【００２８】認証部１５は、装置の認証モード時に、入
力部１１を通じて読込んだ音声情報に対する本人話者の
音素辞書（音響モデル）の尤度ＬSを計算する。The authentication unit 15 calculates the likelihood LS of the speaker's phoneme dictionary (acoustic model) for the voice information read through the input unit 11 in the authentication mode of the apparatus.

【００２９】合成部１９は、認証部１５が、上記のよう
に音声情報に対する本人話者の音素辞書（音響モデル）
の尤度ＬSを計算するのと並行して、格納部２１内の情
報を基に本人話者（1〜S）の音素辞書の夫々の音素（1
〜P）に対し、各音素（1〜P）の近傍話者の音素を組合
わせることによって背景話者の音素辞書を合成する。そ
して、入力音声情報に対する背景話者の尤度Ｌbの平均
値を計算する。その後、正規化尤度Ｌn＝Ｌs／Ｌbを算
出し、事前に設定された閾値とＬnとの大小関係を比較
し、受理／棄却の判定を行う。In the synthesizing unit 19, the authentication unit 15 performs the phoneme dictionary (acoustic model) of the speaker on the voice information as described above.
In parallel with the calculation of the likelihood LS of each phoneme (1 to S) in the phoneme dictionary of the speaker (1 to S) based on the information in the storage unit 21.
~ P), the phoneme dictionary of the background speaker is synthesized by combining the phonemes of the neighboring speakers of each phoneme (1 to P). Then, the average value of the likelihood Lb of the background speaker with respect to the input voice information is calculated. Thereafter, the normalized likelihood Ln = Ls / Lb is calculated, the magnitude relationship between the threshold value set in advance and Ln is compared, and the acceptance / rejection is determined.

【００３０】以上説明したように、本発明の一実施形態
によれば、選択部１７が、装置の音声登録モード時に、
新規に作成された音素辞書の各音素別に各音素と音響空
間上における距離が近い話者情報を、格納部２３に格納
済みの複数の登録話者の音素辞書中から音素別に選出す
る。そして、それらの話者情報を、距離の近いものから
順に整理した状態で、音素別近傍話者テーブルとして格
納部２１に格納する。そのため、音響空間上において、
本人話者の音素辞書（音響モデル）により近い背景話者
の音素辞書を合成することが可能になり、合成された背
景話者の尤度を本人話者の尤度の正規化に使用すること
で、尤度の正規化を高精度に行うことができる。その結
果、認証精度の高い話者照合装置を実現することが可能
になる。As described above, according to one embodiment of the present invention, when the selection unit 17 is in the voice registration mode of the device,
For each phoneme in the newly created phoneme dictionary, speaker information that is close to each phoneme in the acoustic space is selected for each phoneme from the phoneme dictionaries of a plurality of registered speakers already stored in the storage unit 23. Then, the speaker information is stored in the storage unit 21 as a phoneme-specific neighborhood speaker table in a state where the speaker information is arranged in ascending order of distance. Therefore, in the acoustic space,
It is possible to synthesize a phoneme dictionary of a background speaker closer to the phoneme dictionary (acoustic model) of the native speaker, and use the likelihood of the synthesized background speaker for normalization of the likelihood of the native speaker. Thus, the likelihood can be normalized with high accuracy. As a result, a speaker verification device with high authentication accuracy can be realized.

【００３１】図３は、本発明の一実施形態の第１変形例
に係る話者照合装置の機能ブロック図である。FIG. 3 is a functional block diagram of a speaker verification device according to a first modification of the embodiment of the present invention.

【００３２】上記装置は、状態別近傍話者テーブルを有
し、音声登録モードと音声認証モードの一方を選択的に
設定できるよう構成される。上記装置は、図２に示した
音素別近傍話者選択部１７に代えて、状態別近傍話者選
択部（選択部）２５を設けた点、及び音素別近傍話者テ
ーブル格納部２１に代えて、状態別近傍話者テーブル格
納部（格納部）２７を設けた点で図２に示した話者照合
装置と相違する。上記以外の各部については、図２に示
したものと同一符号を付す。なお、図示の作成部１３及
び合成部１９の処理動作も、図２で示した作成部１３及
び合成部１９のそれと多少相違する。The apparatus has a state-specific nearby speaker table and is configured to be able to selectively set one of a voice registration mode and a voice authentication mode. The above-described apparatus has a state-specific neighborhood speaker selection section (selection section) 25 instead of the phoneme-based neighborhood speaker selection section 17 shown in FIG. 2 and a phoneme-based neighborhood speaker table storage section 21. 2 is different from the speaker verification device shown in FIG. 2 in that a state-specific nearby speaker table storage unit (storage unit) 27 is provided. Components other than the above are denoted by the same reference numerals as those shown in FIG. The processing operations of the creating unit 13 and the synthesizing unit 19 shown in the drawing are slightly different from those of the creating unit 13 and the synthesizing unit 19 shown in FIG.

【００３３】例えば、図示のように、新規ユーザが話者
1であるときは、音素1状態1、……、音素1状態J、…
…、音素P状態1、……、音素P状態Jから成る話者1の音
素辞書が、作成部１３で作成される。また、話者Sであ
るときも、上記と同様に音素1状態1、……、音素1状態
J、……、音素P状態1、……、音素P状態Jから成る話者S
の音素辞書が作成部１３で作成される。そして、それら
各話者1〜Sの音素辞書は、作成部１３により夫々格納部
２３に格納される。For example, as shown in FIG.
When it is 1, phoneme 1 state 1, ..., phoneme 1 state J, ...
, A phoneme P state 1,..., A phoneme P state J, and a phoneme dictionary of the speaker 1 is created by the creating unit 13. Also, when the speaker is S, the phoneme 1 state 1,...
J, ..., phoneme P state 1, ..., speaker S consisting of phoneme P state J
Is created by the creating unit 13. The phoneme dictionaries of the speakers 1 to S are stored in the storage unit 23 by the creation unit 13.

【００３４】選択部２５は、装置の音声登録モード時
に、作成部１３による音素辞書作成の処理動作と並行し
て、新規に作成された音素辞書の各音素における夫々の
状態別にそれら各状態と音響空間上における距離が近い
話者情報を、格納部２３に格納済みの複数の登録話者の
音素辞書中から夫々の状態別に選出する。そして、それ
らの話者情報を、距離の近いものから順に整理した状態
で、状態別近傍話者テーブルとして格納部２７に格納す
る。例えば、新規ユーザが話者1であるときは、話者1の
音素1の状態1に近い話者が、最も距離の近い話者から順
に話者9、話者7、……、話者11であったとすれば、選択
部２５により、図示のように、話者1の音素1の状態1に
係る音素別近傍話者テーブルに話者9、話者7、……、話
者11が格納される。話者1の音素1の状態Jに近い話者
が、最も距離の近い話者から順に話者20、話者4、…
…、話者15であったとすれば、選択部２５により、図示
のように、話者1の音素1の状態Jに係る音素別近傍話者
テーブルに話者20、話者4、……、話者15が格納され
る。話者1の音素Pの状態1に近い話者が、最も距離の近
い話者から順に話者14、話者41、……、話者12であった
とすれば、選択部２５により、図示のように、話者1の
音素Pの状態1に係る音素別近傍話者テーブルに話者14、
話者41、……、話者12が格納される。話者1の音素Pの状
態Jに近い話者が、最も距離の近い話者から順に話者1
7、話者21、……、話者32であったとすれば、選択部２
５により、図示のように、話者1の音素Pの状態Jに係る
音素別近傍話者テーブルに話者17、話者21、……、話者
32が格納される。また、新規ユーザが話者Sであるとき
は、話者Sの音素Pの状態1に近い話者が、最も距離の近
い話者から順に話者8、話者11、……、話者36であった
とすれば、選択部２５により、図示のように、話者Sの
音素Pの状態1に係る音素別近傍話者テーブルに話者8、
話者11、……、話者36が格納される。話者Sの音素Pの状
態Jに近い話者が、最も距離の近い話者から順に話者1
8、話者3、……、話者16であったとすれば、選択部２５
により、図示のように、話者1の音素Pの状態Jに係る音
素別近傍話者テーブルに話者18、話者3、……、話者16
が格納される。In the voice registration mode of the apparatus, the selection unit 25 simultaneously executes the phoneme dictionary creation processing performed by the creation unit 13 with each of the states of each phoneme of the newly created phoneme dictionary. Speaker information having a short distance in space is selected for each state from the phoneme dictionaries of a plurality of registered speakers stored in the storage unit 23. Then, the speaker information is stored in the storage unit 27 as a state-specific neighbor speaker table in a state where the speaker information is arranged in ascending order of distance. For example, when the new user is the speaker 1, the speakers closest to the state 1 of the phoneme 1 of the speaker 1 are the speakers 9, the speakers 7,... , The selection unit 25 stores the speaker 9, the speaker 7,..., And the speaker 11 in the phoneme-specific neighborhood speaker table relating to the state 1 of the phoneme 1 of the speaker 1 as illustrated. Is done. The speakers closest to the state J of the phoneme 1 of the speaker 1 are the speakers 20, 4,.
.., Assuming that the speaker 15 is the speaker 15, the selector 20, the speaker 20, the speaker 4,... Speaker 15 is stored. Assuming that the speakers closest to the state 1 of the phoneme P of the speaker 1 are the speakers 14, the speakers 41,... As described above, in the phoneme-specific neighborhood speaker table according to the state 1 of the phoneme P of the speaker 1,
The speakers 41,..., And the speaker 12 are stored. The speakers closest to the state J of the phoneme P of the speaker 1 are the speakers 1 in the order of the closest speaker
7, speaker 21,..., Speaker 32, the selection unit 2
5, as shown in the figure, the speaker 17, the speaker 21,..., The speaker 17 are stored in the phoneme-by-phoneme neighboring speaker table relating to the state J of the phoneme P of the speaker 1.
32 is stored. When the new user is the speaker S, the speakers closest to the state 1 of the phoneme P of the speaker S are the speakers 8, 11,. If so, the selection unit 25 causes the speaker 8 and the speaker 8 to be stored in the phoneme-by-phoneme neighboring speaker table relating to the state 1 of the phoneme P of the speaker S, as shown in the figure.
Speakers 11,..., And speaker 36 are stored. The speakers closest to the state J of the phoneme P of the speaker S are the speakers 1 in order from the closest speaker.
8, speaker 3,..., Speaker 16, if the selection unit 25
As a result, as shown in the figure, the speaker 18, the speaker 3,..., The speaker 16 are stored in the phoneme-specific neighborhood speaker table relating to the state J of the phoneme P of the speaker 1.
Is stored.

【００３５】選択部２５における上記処理動作は、状態
を単位としたコホート話者セットの設定のための処理動
作として一般化される。音素ＨＭＭを状態数Ｓ、混合数
Ｍのleft―to―right型の混合連続型ＨＭＭとすると、
音素pのＨＭＭλpはある状態sにおいて自己ループする
遷移確率ａp,s,s、次の状態への遷移確率ａp,s,s＋1、
分布重みｗp,s,m｛ｍ＝１、２、…、Ｍ｝をパラメータ
として保有する。状態を単位としたコホート話者セット
では、上記のような状態sにおいて保有されるパラメー
タ毎に話者iの近傍話者が選択されるため、仮想コホー
ト話者のＨＭＭセットは、下記の（５）式のように表わ
される。ｃk（i,p,s）は状態を単位としたコホート話者
であり、話者iの音素p、状態sと登録話者の同状態との
状態間距離により、近傍話者上位Ｋ人が選択される。The above processing operation in the selection section 25 is generalized as a processing operation for setting a cohort speaker set in units of states. Assuming that the phoneme HMM is a left-to-right mixed continuous HMM with the number of states S and the number of mixtures M,
The HMMλp of the phoneme p is a transition probability ap, s, s which loops in a certain state s, a transition probability ap, s, s + 1 to the next state,
The distribution weights wp, s, m {m = 1, 2,..., M} are stored as parameters. In the cohort speaker set in units of states, since the neighbor speaker of speaker i is selected for each parameter held in state s as described above, the HMM set of the virtual cohort speaker is represented by the following (5) ) Expression. ck (i, p, s) is a cohort speaker in units of states, and the top K speakers in the vicinity are determined by the phoneme p of speaker i, the state distance between state s, and the same state of the registered speaker. Selected.

【００３６】[0036]

【数５】装置の認証モード時に、入力部１１を通じて読込んだ音
声情報に対する本人話者の音素辞書（音響モデル）の尤
度ＬSが、認証部１５により計算される。(Equation 5) In the authentication mode of the device, the authentication unit 15 calculates the likelihood LS of the phoneme dictionary (acoustic model) of the speaker with respect to the voice information read through the input unit 11.

【００３７】合成部１９は、認証部１５が、上記尤度Ｌ
Sを計算するのと並行して、格納部２７内の情報を基に
本人話者（1〜S）の音素辞書の各音素（1〜P）における
夫々の状態（1〜J）に対し、夫々近傍話者の各状態（1
〜J）毎のパラメータを組合わせることにより背景話者
の音素辞書を合成する。そして、入力音声情報に対する
背景話者の尤度Ｌbの平均値を計算し、その後、図２に
示した装置におけると同様、正規化尤度Ｌn＝Ｌs／Ｌb
を算出して事前に設定された閾値とＬnとの大小関係を
比較し、受理／棄却の判定を行う。The synthesizing unit 19 determines that the authentication unit 15 has the likelihood L
In parallel with the calculation of S, based on the information in the storage unit 27, for each state (1 to J) of each phoneme (1 to P) in the phoneme dictionary of the speaker (1 to S), Each state (1
To J), the phoneme dictionary of the background speaker is synthesized by combining the parameters. Then, the average value of the likelihood Lb of the background speaker with respect to the input speech information is calculated, and then the normalized likelihood Ln = Ls / Lb, as in the apparatus shown in FIG.
Is calculated, and the magnitude relationship between the threshold value set in advance and Ln is compared to determine the acceptance / rejection.

【００３８】図４は、本発明の一実施形態の第１変形例
に係る状態数Ｊ、混合数ＭのＨＭＭの構成を示す説明図
である。FIG. 4 is an explanatory diagram showing a configuration of an HMM having a number of states J and a number of mixtures M according to a first modification of the embodiment of the present invention.

【００３９】図４において、ＨＭＭのパラメータは、遷
移確率Ａ、分布重みＷ、分布Ｎで構成され、状態jに含
まれるパラメータは、このうち、Ａj,j、Ａj,j＋1、Ｗ
j,1…Ｗj,M、Ｎj,1…Ｎj,Mである。In FIG. 4, the parameters of the HMM are composed of a transition probability A, a distribution weight W, and a distribution N, and the parameters included in the state j are Aj, j, Aj, j + 1, W
j, 1... Wj, M and Nj, 1... Nj, M.

【００４０】図５は、本発明の一実施形態の第２変形例
に係る話者照合装置の機能ブロック図である。FIG. 5 is a functional block diagram of a speaker verification device according to a second modification of the embodiment of the present invention.

【００４１】上記装置は、分布別近傍話者テーブルを有
し、音声登録モードと音声認証モードの一方を選択的に
設定できるよう構成される。上記装置は、図２に示した
音素別近傍話者選択部１７に代えて、分布別近傍話者選
択部（選択部）２９を設けた点、及び音素別近傍話者テ
ーブル格納部２１に代えて、分布別近傍話者テーブル格
納部（格納部）３１を設けた点で図２に示した話者照合
装置と相違する。上記以外の各部については、図２に示
したものと同一符号を付す。なお、図示の作成部１３及
び合成部１９の処理動作も、図２で示した作成部１３及
び合成部１９のそれと多少相違する。The above apparatus has a distribution-specific nearby speaker table, and is configured so that one of a voice registration mode and a voice authentication mode can be selectively set. The above-described apparatus has a point that a neighboring speaker selecting section (selecting section) 29 for each distribution is provided instead of the neighboring speaker selecting section 17 for each phoneme shown in FIG. 2 is different from the speaker verification device shown in FIG. 2 in that a neighboring speaker table storage unit (storage unit) 31 for each distribution is provided. Components other than the above are denoted by the same reference numerals as those shown in FIG. The processing operations of the creating unit 13 and the synthesizing unit 19 shown in the drawing are slightly different from those of the creating unit 13 and the synthesizing unit 19 shown in FIG.

【００４２】例えば、図示のように、新規ユーザが話者
1であるときは、音素1状態1分布1、……、音素1状態1分
布M、……、音素P状態J分布1、……、音素P状態J分布M
から成る話者1の音素辞書が、作成部１３で作成され
る。また、話者Sであるときも、上記と同様に音素1状態
1分布1、……、音素1状態1分布M、……、音素P状態J分
布1、……、音素P状態J分布Mから成る話者Sの音素辞書
が作成部１３で作成される。そして、それら各話者1〜S
の音素辞書は、作成部１３により夫々格納部２３に格納
される。For example, as shown in FIG.
When it is 1, phoneme 1 state 1 distribution 1, ..., phoneme 1 state 1 distribution M, ..., phoneme P state J distribution 1, ..., phoneme P state J distribution M
The phoneme dictionary of the speaker 1 is created by the creating unit 13. Also, when the speaker is S, the phoneme 1 state
, A phoneme 1 state 1 distribution M,..., A phoneme P state J distribution 1,... And those speakers 1 ~ S
Are stored in the storage unit 23 by the creation unit 13.

【００４３】選択部２９は、装置の音声登録モード時
に、作成部１３による音素辞書作成の処理動作と並行し
て、新規に作成された音素辞書の各音素の夫々の状態の
各々の分布別にそれら各分布と音響空間上における距離
が近い話者情報を、格納部２３に格納済みの複数の登録
話者の音素辞書中から各々の分布別に選出する。そし
て、それらの話者情報を、距離の近いものから順に整理
した状態で、分布別近傍話者テーブルとして格納部３１
に格納する。例えば、新規ユーザが話者1であるとき
は、話者1の音素1の状態1の分布1に近い話者が、最も距
離の近い話者から順に話者9、話者7、……、話者11であ
ったとすれば、選択部２９により、図示のように、話者
1の音素1の状態1の分布1に係る分布別近傍話者テーブル
に話者9、話者7、……、話者11が格納される。話者1の
音素1の状態1の分布Mに近い話者が、最も距離の近い話
者から順に話者20、話者4、……、話者15であったとす
れば、選択部２９により、図示のように、話者1の音素1
の状態1の分布Mに係る分布別近傍話者テーブルに話者2
0、話者4、……、話者15が格納される。話者1の音素Pの
状態Jの分布1に近い話者が、最も距離の近い話者から順
に話者14、話者41、……、話者12であったとすれば、選
択部２９により、図示のように、話者1の音素Pの状態J
の分布1に係る分布別近傍話者テーブルに話者14、話者4
1、……、話者12が格納される。話者1の音素Pの状態Jの
分布Mに近い話者が、最も距離の近い話者から順に話者1
7、話者21、……、話者32であったとすれば、選択部２
９により、図示のように、話者1の音素Pの状態Jの分布M
に係る分布別近傍話者テーブルに話者17、話者21、…
…、話者32が格納される。また、新規ユーザが話者Sで
あるときは、話者Sの音素Pの状態Jの分布1に近い話者
が、最も距離の近い話者から順に話者8、話者11、…
…、話者36であったとすれば、選択部２９により、図示
のように、話者Sの音素Pの状態Jの分布1に係る分布別近
傍話者テーブルに話者8、話者11、……、話者36が格納
される。話者Sの音素Pの状態Jの分布Mに近い話者が、最
も距離の近い話者から順に話者18、話者3、……、話者1
6であったとすれば、選択部２９により、図示のよう
に、話者1の音素Pの状態Jの分布Mに係る分布別近傍話者
テーブルに話者18、話者3、……、話者16が格納され
る。In the voice registration mode of the apparatus, the selection unit 29 performs, in parallel with the operation of the phoneme dictionary creation by the creation unit 13, the distribution of each state of each phoneme of the newly created phoneme dictionary. Speaker information whose distance in the acoustic space is short from each distribution is selected for each distribution from the phoneme dictionaries of a plurality of registered speakers stored in the storage unit 23. Then, in a state where the speaker information is arranged in ascending order of distance, a storage unit 31 as a distribution-specific neighborhood speaker table is arranged.
To be stored. For example, when the new user is speaker 1, speakers close to distribution 1 of state 1 of phoneme 1 of speaker 1 are speakers 9, speaker 7,. Assuming that the speaker 11 is the speaker 11, as shown in FIG.
Speaker 9, speaker 7,..., Speaker 11 are stored in the distribution-specific neighboring speaker table relating to distribution 1 of state 1 of one phoneme 1. If the speakers closest to the distribution M of the state 1 of the phoneme 1 of the speaker 1 are the speakers 20, 4,... , As shown, phoneme 1 of speaker 1
Speaker 2 in the distribution-specific nearby speaker table for distribution M in state 1 of
0, speaker 4,..., Speaker 15 are stored. If it is assumed that the speakers closest to the distribution 1 of the state J of the phoneme P of the speaker 1 are the speakers 14, the speakers 41,. , As shown, state J of phoneme P of speaker 1
Speaker 14 and speaker 4 in the distribution-specific neighborhood speaker table for distribution 1
1,..., The speaker 12 is stored. The speakers closest to the distribution M of the state J of the phoneme P of the speaker 1 are the speakers 1 in the order of the nearest speaker.
7, speaker 21,..., Speaker 32, the selection unit 2
9, the distribution M of the state J of the phoneme P of the speaker 1 as shown in FIG.
Speaker 17, speaker 21, ... in the distribution-specific neighborhood speaker table according to.
..., speaker 32 is stored. When the new user is the speaker S, the speakers closest to the distribution 1 of the state J of the phoneme P of the speaker S are the speakers 8, 11,.
.., If it is speaker 36, as shown in the figure, the selection unit 29 stores the speaker 8, the speaker 11, and the speaker 11 in the distribution-specific neighboring speaker table according to the distribution 1 of the state J of the phoneme P of the speaker S. ..., The speaker 36 is stored. The speakers closest to the distribution M of the state J of the phoneme P of the speaker S are the speakers 18, 3,.
If it is 6, the selection unit 29 causes the speaker 18, the speaker 3,... Person 16 is stored.

【００４４】選択部２９における上記処理動作は、分布
を単位としたコホート話者セットの設定のための処理動
作として一般化される。分布を単位としたコホート話者
セットでは、仮想コホート話者のＨＭＭセットは下記の
（６）式で表わされる。The above processing operation in the selection unit 29 is generalized as a processing operation for setting a cohort speaker set in units of distribution. In the cohort speaker set in units of distribution, the HMM set of the virtual cohort speaker is represented by the following equation (6).

【００４５】[0045]

【数６】同じ状態に分布数Ｍ個のコホート話者セットがあるた
め、遷移確率ａは、各コホート話者セットからの遷移確
率の合計を、自己ループと次の状態への遷移確率との和
が１になるように再正規化することによって算出され
る。遷移確率ａは下記の（７）式で表わされる。(Equation 6) Since there are M cohort speaker sets distributed in the same state, the transition probability a is the sum of the transition probabilities from each cohort speaker set, and the sum of the self-loop and the transition probability to the next state is 1. It is calculated by re-normalization so that The transition probability a is represented by the following equation (7).

【００４６】[0046]

【数７】分布重みｗも同様に各コホート話者セットからの分布重
みを、総和が１になるように再正規化することによって
算出される。分布重みｗは下記の（８）式で表わされ
る。(Equation 7) Similarly, the distribution weight w is calculated by renormalizing the distribution weights from each cohort speaker set so that the sum becomes 1. The distribution weight w is expressed by the following equation (8).

【００４７】[0047]

【数８】（６）式、（７）式及び（８）式において、ck（i,p,s,
m）は分布を単位としたコホート話者、より具体的に
は、話者iの音素pの状態sの分布mに対するk番目のコホ
ート話者であり、分布毎に選択された近傍話者である。(Equation 8) In equations (6), (7) and (8), ck (i, p, s,
m) is the cohort speaker in units of distribution, more specifically, the k-th cohort speaker for the distribution m of the state s of the phoneme p of the speaker i, and the neighboring speakers selected for each distribution. is there.

【００４８】装置の認証モード時に、入力部１１を通じ
て読込んだ音声情報に対する本人話者の音素辞書（音響
モデル）の尤度ＬSが、認証部１５により計算される。In the authentication mode of the device, the authentication unit 15 calculates the likelihood LS of the phoneme dictionary (acoustic model) of the speaker with respect to the voice information read through the input unit 11.

【００４９】合成部１９は、認証部１５が、上記尤度Ｌ
Sを計算するのと並行して、格納部３１内の情報を基に
本人話者（1〜S）の音素辞書の各音素（1〜P）の夫々の
状態（1〜J）の各々の分布（1〜M）に対し、夫々近傍話
者の各分布（1〜M）を組合わせることにより背景話者の
音素辞書を合成する。そして、入力音声情報に対する背
景話者の尤度Ｌbの平均値を計算する。この場合、背景
話者の遷移確率Ａは、本人話者の音声辞書の遷移確率Ａ
をそのまま流用する。その後、図２に示した装置におけ
ると同様、正規化尤度Ｌn＝Ｌs／Ｌbを算出して事前に
設定された閾値とＬnとの大小関係を比較し、受理／棄
却の判定を行う。なお、上述した状態Jの分布mに含まれ
るパラメータは、図４中のＷj,M、Ｎj,Mである。The synthesizing unit 19 determines that the authentication unit 15 has the likelihood L
In parallel with the calculation of S, based on the information in the storage unit 31, each state (1 to J) of each phoneme (1 to P) of the phoneme dictionary of the speaker (1 to S) is determined. The phoneme dictionary of the background speaker is synthesized by combining the distributions (1 to M) with the distributions (1 to M) of the neighboring speakers, respectively. Then, the average value of the likelihood Lb of the background speaker with respect to the input voice information is calculated. In this case, the transition probability A of the background speaker is the transition probability A of the speech dictionary of the true speaker.
Is diverted as it is. Then, as in the apparatus shown in FIG. 2, the normalized likelihood Ln = Ls / Lb is calculated, and the magnitude relationship between the threshold value set in advance and Ln is compared to determine acceptance / rejection. The parameters included in the distribution m of the state J described above are Wj, M and Nj, M in FIG.

【００５０】次に、図２、図３及び図５において夫々示
した装置、及び従来装置の評価実験について説明する。Next, an evaluation experiment of the apparatus shown in FIGS. 2, 3 and 5 and a conventional apparatus will be described.

【００５１】まず、実験条件として、上記各装置に関す
る評価実験に用いた音声データの諸元を図６に示す。First, as experimental conditions, FIG. 6 shows specifications of audio data used in an evaluation experiment for each of the above devices.

【００５２】照合方式はテキスト指定型とし、セットＡ
を用いてクローズド実験（つまり、本人以外の登録話者
を詐称者として使用する実験）を行い、セットＢをオー
プン実験（つまり、登録話者以外の話者を詐称者として
使用する実験）の詐称者として使用した。ＨＭＭは状態
数３混合数３のコンテキスト独立音素ＨＭＭとし、話者
登録では、ＭＬ推定によりＨＭＭの全パラメータを推定
した。The collation method is a text designation type.
To perform a closed experiment (that is, an experiment in which a registered speaker other than the registered speaker is used as a spoofer) and set B to an open experiment (that is, an experiment in which a speaker other than the registered speaker is used as a spoofer). Used as a person. The HMM was a context-independent phoneme HMM having three states and three mixtures, and in speaker registration, all parameters of the HMM were estimated by ML estimation.

【００５３】次に、実験結果及びそれに関する考察につ
いて説明する。Next, a description will be given of the experimental results and their considerations.

【００５４】上述した実験では、事後的に与える閾値に
よるＥＥＲ（等誤差率）により、各手法の性能を比較し
た。図７にクローズド実験を行った結果得られたデータ
（クローズドデータ）を、図８にオープン実験を行った
結果得られたデータ（オープンデータ）を、夫々示す。
なお、図７において、曲線３３は話者別コホート（つま
り、従来装置に係る）ＥＥＲ（単位％）を、曲線３５は
音素別コホート（つまり、図２に示した装置に係る）Ｅ
ＥＲ（単位％）を、曲線３７は状態別コホート（つま
り、図３に示した装置に係る）ＥＥＲ（単位％）を、夫
々示している。更に、曲線３９は分布別コホート（つま
り、図５に示した装置に係る）ＥＥＲ（単位％）を示し
ている。In the above-described experiment, the performances of the respective methods were compared based on the EER (equal error rate) based on a threshold value that was given afterward. FIG. 7 shows data (closed data) obtained as a result of the closed experiment, and FIG. 8 shows data (open data) obtained as a result of the open experiment.
In FIG. 7, a curve 33 represents a cohort by speaker (that is, related to the conventional apparatus) EER (unit%), and a curve 35 represents a cohort by phoneme (that is, related to the apparatus shown in FIG. 2) E
The ER (unit%) and the curve 37 indicate the cohort EER (unit%) for each state (that is, according to the apparatus shown in FIG. 3). Further, the curve 39 indicates the cohort by distribution (that is, according to the apparatus shown in FIG. 5) EER (unit%).

【００５５】一方、図８においても、図７に示した曲線
３３と同種の曲線である曲線４１は話者別コホートＥＥ
Ｒ（単位％）を、曲線３５と同種の曲線である曲線４３
は音素別コホートＥＥＲ（単位％）を、曲線３７と同種
の曲線である曲線４５は状態別コホートＥＥＲ（単位
％）を、夫々示している。更に、曲線３９と同種の曲線
である曲線４７は分布別コホートＥＥＲ（単位％）を示
している。On the other hand, also in FIG. 8, a curve 41 which is the same kind of curve as the curve 33 shown in FIG.
R (unit%) is represented by a curve 43 which is the same kind of curve as the curve 35.
Indicates a cohort EER (unit%) for each phoneme, and a curve 45 which is the same kind of curve as the curve 37 indicates a cohort EER (unit%) for each state. Further, a curve 47, which is the same kind of curve as the curve 39, shows a cohort EER (unit%) for each distribution.

【００５６】図７及び図８を参照して明らかなように、
クローズド実験、オープン実験共に、コホート話者セッ
トのサイズが大きくなるにつれてＥＥＲは減少し、サイ
ズが４．５付近で飽和する傾向にある。両実験におい
て、コホート話者セットをより詳細な単位で選択して設
計する方がＥＥＲを低減でき、サイズ５の場合には、分
布を単位とした方法では、話者を単位とした方法に比べ
てクローズド実験で６７％、オープン実験で３５％とい
う高い誤り削減率を得た。As is clear with reference to FIGS. 7 and 8,
In both closed and open experiments, the EER decreases as the size of the cohort speaker set increases, and tends to saturate around 4.5. In both experiments, selecting and designing a cohort speaker set in more detailed units can reduce EER, and in the case of size 5, the method using distribution as a unit is smaller than the method using speaker as a unit. As a result, a high error reduction rate of 67% was obtained in the closed experiment and 35% in the open experiment.

【００５７】また、分布を単位とする方法のサイズ２に
おける照合性能は、話者を単位とする方法のサイズ５に
おける照合性能にほぼ匹敵し、本発明の各実施形態に係
る装置によって、より少ないコホート話者セットのサイ
ズで分解能の高い尤度正規化が実現できた。Further, the matching performance in size 2 of the method using distribution as a unit is almost equal to the matching performance in size 5 of the method using speaker as a unit, and the matching performance according to each embodiment of the present invention is smaller. High-resolution likelihood normalization was achieved with the size of the cohort speaker set.

【００５８】上述した内容は、あくまで本発明の一実施
形態及びそれの変形例に関するものであって、本発明が
上記内容のみに限定されることを意味するものでないの
は勿論である。The contents described above relate only to an embodiment of the present invention and modifications thereof, and needless to say that the present invention is not limited to only the above contents.

【００５９】[0059]

【発明の効果】以上説明したように、本発明によれば、
話者照合の性能を向上させることができるようにするこ
とができる。As described above, according to the present invention,
The performance of speaker verification can be improved.

[Brief description of the drawings]

【図１】背景話者の設定の態様を示す説明図。FIG. 1 is an explanatory diagram showing a manner of setting a background speaker.

【図２】本発明の一実施形態に係る話者照合装置の機能
ブロック図。FIG. 2 is a functional block diagram of the speaker verification device according to the embodiment of the present invention.

【図３】本発明の一実施形態の第１変形例に係る話者照
合装置の機能ブロック図。FIG. 3 is a functional block diagram of a speaker verification device according to a first modification of the embodiment of the present invention.

【図４】本発明の一実施形態の第１変形例に係る状態数
Ｊ、混合数ＭのＨＭＭの構成を示す説明図。FIG. 4 is an explanatory diagram showing a configuration of an HMM having a number of states J and a number of mixtures M according to a first modification of the embodiment of the present invention.

【図５】本発明の一実施形態の第２変形例に係る話者照
合装置の機能ブロック図。FIG. 5 is a functional block diagram of a speaker verification device according to a second modification of the embodiment of the present invention.

【図６】実験条件として、従来の話者照合装置と、本発
明に係る各話者照合装置に関する評価実験に用いた音声
データの諸元を示す説明図。FIG. 6 is an explanatory diagram showing, as experimental conditions, specifications of voice data used in an evaluation experiment on a conventional speaker verification device and each speaker verification device according to the present invention.

【図７】従来の話者照合装置と、本発明に係る各話者照
合装置をクローズド実験を行った結果得られたデータを
示す図。FIG. 7 is a diagram showing data obtained as a result of performing closed experiments on a conventional speaker verification device and each speaker verification device according to the present invention.

【図８】従来の話者照合装置と、本発明に係る各話者照
合装置をオープン実験を行った結果得られたデータを示
す図。FIG. 8 is a diagram showing data obtained as a result of conducting an open experiment on a conventional speaker verification device and each speaker verification device according to the present invention.

[Explanation of symbols]

１１音声入力部（入力部）１３個人用辞書作成部（作成部）１５個人認証部（認証部）１７音素別近傍話者選択部（選択部）１９背景話者用辞書合成部（合成部）２１音素別近傍話者テーブル格納部（格納部）２３個人辞書格納部（格納部）２５状態別近傍話者選択部（選択部）２７状態別近傍話者テーブル格納部（格納部）２９分布別近傍話者選択部（選択部）３１分布別近傍話者テーブル格納部（格納部） Reference Signs List 11 voice input unit (input unit) 13 personal dictionary creation unit (creation unit) 15 personal authentication unit (authentication unit) 17 neighborhood speaker selection unit by phoneme (selection unit) 19 background speaker dictionary synthesis unit (synthesis unit) 21 Phoneme-specific neighborhood speaker table storage unit (storage unit) 23 Personal dictionary storage unit (storage unit) 25 State-specific neighborhood speaker selection unit (selection unit) 27 State-specific neighborhood speaker table storage unit (storage unit) 29 Distribution Neighborhood speaker selection unit (selection unit) 31 Neighborhood speaker table storage unit by distribution (storage unit)

Claims

[Claims]

1. A speaker matching device for setting a background speaker for a true speaker and normalizing a matching score of the true speaker with respect to the input voice by a matching score of the background speaker with respect to the input voice. Means for selecting, from among the speaker acoustic models, speakers whose distances in the acoustic space with respect to the registered speaker's acoustic model are relatively short with respect to local features; and Means for holding as a nearby speaker of the speaker, and combining the local features of the acoustic model of the near speaker with the local features of each of the acoustic models of the true speaker to obtain the background speaker. A speaker verification device, comprising: means for synthesizing an acoustic model.

2. The speaker verification device according to claim 1, wherein each phoneme of the acoustic model is a hidden Markov model (HM).
M), and the collation score is HM for the input voice
A speaker verification device characterized by M likelihood.

3. The speaker verification device according to claim 1, wherein the local feature is a phoneme of an acoustic model of each speaker.

4. The speaker verification device according to claim 2, wherein the local feature is a state of each of the phoneme HMMs.

5. The speaker verification device according to claim 2, wherein the local feature is a distribution of each state of each of the phoneme HMMs.

6. The speaker verification apparatus according to claim 1, wherein the selecting unit has a relatively short distance in sound space between the phoneme and the registered speaker's acoustic model from among the acoustic models of each speaker. A speaker verification device for selecting a speaker.

7. The speaker verification device according to claim 1 or 5, wherein the selecting means selects, from among acoustic models of each speaker, an acoustic model of a registered speaker with respect to each state of each phoneme HMM. And a speaker whose distance in the acoustic space is relatively short is selected.

8. The speaker verification device according to claim 1, wherein the selecting unit selects, from among the acoustic models of each speaker, an acoustic model of a registered speaker with respect to a distribution of each state of each phoneme HMM. And a speaker whose distance in the acoustic space is relatively short is selected.

9. The speaker verification device according to claim 3, wherein the synthesizing unit calculates a phoneme of an acoustic model of the nearby speaker.
A speaker verification apparatus characterized in that an acoustic model of a background speaker is synthesized by combining acoustic models of an individual speaker in correspondence with respective phonemes.

10. The speaker verification device according to claim 4, wherein the synthesizing unit associates each state of each phoneme HMM of the neighboring speaker with each state of each phoneme HMM of the own speaker. A speaker verification apparatus characterized in that an acoustic model of a background speaker is synthesized by combining the speaker models.

11. The speaker verification apparatus according to claim 5, wherein said synthesizing means calculates a distribution of each state of each of said phoneme HMMs, and a distribution of each state of each of said phoneme HMMs of a true speaker. A speaker verification device characterized in that an acoustic model of a background speaker is synthesized by combining the speaker models with each other.

12. A background speaker is set for a true speaker,
In a speaker verification method for normalizing the likelihood of a true speaker with respect to an input voice by the likelihood of a background speaker with respect to the input voice, a registered speaker with respect to local features is selected from among acoustic models of the respective speakers. A first step of selecting a speaker whose distance in the acoustic space with the acoustic model is relatively short; a second step of holding the selected speaker as a nearby speaker of the registered speaker; A third process of synthesizing the acoustic model of the background speaker by combining the local features of the acoustic model of the neighboring speaker in correspondence with the local features of each of the acoustic models of the true speaker; A speaker verification method, comprising:

13. A background speaker is set for a true speaker,
In a speaker verification device that normalizes a verification score of a true speaker with respect to an input voice by a verification score of a background speaker with respect to the input voice, the speaker model of a registered speaker regarding a local feature is selected from among acoustic models of the respective speakers. Means for selecting a speaker whose distance in the acoustic space with the acoustic model is relatively short; means for holding the selected speaker as a nearby speaker of the registered speaker; and an acoustic model of the nearby speaker Means for synthesizing the acoustic model of the background speaker by combining the local features of the speaker corresponding to the local features of each of the acoustic models of the speaker. A computer-readable program medium carrying a computer program for operating a computer as each of the means in the device.