JP2009116278A

JP2009116278A - Method and device for register and evaluation of speaker authentication

Info

Publication number: JP2009116278A
Application number: JP2007292384A
Authority: JP
Inventors: Jie Hao; ジー・ハオ; Luan Jian; ジアン・ルアン
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2007-11-09
Filing date: 2007-11-09
Publication date: 2009-05-28

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and device for register and evaluation of speaker authentication. <P>SOLUTION: The method for register of speaker authentication comprises the steps of: generating a plurality of sound feature vector sequences based on each of a plurality of the speeches of the same content spoken by the speaker; generating a reference template from the plurality of sound feature vector sequences; generating a false impostor feature vector sequence corresponding to each of the plurality of the sound feature vector sequence, based on a plurality of codes and a code book including the feature vector corresponding to the plurality of codes; and selecting an optimal sound feature subset, based on the plurality of sound feature vector sequences, the reference template and the plurality of false imposter feature vector sequence. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は情報処理技術、特に話者認証技術に関する。 The present invention relates to information processing technology, and particularly to speaker authentication technology.

話者が話している時の各話者の発話特性に基づいて、異なる話者を認識することが可能であり、従って話者認証を行うことが可能である。K. Yu, J. Mason、J. Oglesby, “Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation” (Vision, Image and Signal Processing, IEE Proceedings, Vol. 142, Oct. 1995, pp. 313-18)には、ＨＭＭ、ＤＴＷ（動的時間伸縮法（Dynamic Time Warping））及びＶＱといった３つの話者認識エンジン技術が紹介されている。 Different speakers can be recognized based on the utterance characteristics of each speaker when the speaker is speaking, and thus speaker authentication can be performed. K. Yu, J. Mason, J. Oglesby, “Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation” (Vision, Image and Signal Processing, IEE Proceedings, Vol. 142, Oct. 1995, pp. 313- 18) introduces three speaker recognition engine technologies such as HMM, DTW (Dynamic Time Warping) and VQ.

通常、話者認証のプロセスは登録段階及び検証段階を含む。登録段階では、話者（ユーザ）の話者テンプレートは、話者自身によって話されたパスワードを含む発話に従って生成され、検証段階では、テスト発話が話者自身によって話された同じパスワードを含んでいるものかどうかが話者テンプレートに基づいて決定される。従って、話者テンプレートの品質は、全体の認証処理プロセスにとって非常に重要である。 Typically, the speaker authentication process includes a registration phase and a verification phase. In the registration phase, the speaker template of the speaker (user) is generated according to the utterance including the password spoken by the speaker himself, and in the verification phase, the test utterance includes the same password spoken by the speaker himself. Whether it is a thing is determined based on the speaker template. Therefore, the quality of the speaker template is very important for the entire authentication process.

ＤＴＷに基づいた話者照合システムにおいては、信頼できる性能を得るために各フレームの入力に多くの特徴が要求される。一般に、これらの特徴はすべての話者から同じ方法で聞き出され、また各話者の特徴は無視される。音響特徴セットから適切な特徴サブセットを選ぶことにより、各話者のための最適な特徴セットをカスタマイズするために、いくつかのスキームが提案されている。この方法によって、検証性能を向上することが可能であり、テンプレートのための必要メモリを低減することも可能である。しかしながら、特に利用可能な情報が制限された場合、特徴選択のための有効な基準は難題である。 In speaker verification systems based on DTW, many features are required at the input of each frame in order to obtain reliable performance. In general, these features are heard from all speakers in the same way, and each speaker's features are ignored. Several schemes have been proposed to customize the optimal feature set for each speaker by choosing an appropriate feature subset from the acoustic feature set. By this method, the verification performance can be improved, and the memory required for the template can be reduced. However, effective criteria for feature selection is a challenge, especially when available information is limited.

既知の最適化方法は、２つの構成要素、すなわち性能基準及び検索方法に関して規定することが可能である。第１構成要素については、通常の性能基準は、詐称者(imposter)データベース、例えばB. Sabac (2002)のICSLP-2002, pp.2321-2324の中の“Speaker recognition using discriminative features selection”で性能基準として使用された誤受理率(False Accept Rate)を要求する。すなわち、最適なものを見つけるように多くのクライアントトライアル及び詐称者トライアルを持った異なる特徴サブセットの性能をテストする必要がある。しかしながら、パスワード選択可能な話者照合システムにおいては、詐称者データはめったに利用可能ではない。 Known optimization methods can be defined in terms of two components: performance criteria and search methods. For the first component, the normal performance criteria are performance in the imposter database, eg “Speaker recognition using discriminative features selection” in B. Sabac (2002) ICSLP-2002, pp.2321-2324 Request the False Accept Rate used as a reference. That is, it is necessary to test the performance of different feature subsets with many client and spoiler trials to find the best one. However, in speaker verification systems that allow password selection, spoofer data is rarely available.

K. Yu, J. Mason、J. Oglesby, “Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation” (Vision, Image and Signal Processing, IEE Proceedings, Vol. 142, Oct. 1995, pp. 313-18).K. Yu, J. Mason, J. Oglesby, “Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation” (Vision, Image and Signal Processing, IEE Proceedings, Vol. 142, Oct. 1995, pp. 313- 18). “Speaker recognition using discriminative features selection” in ICSLP-2002, pp. 2321-2324.“Speaker recognition using discriminative features selection” in ICSLP-2002, pp. 2321-2324.

従来技術の問題を解決するために、本発明は話者認証の登録のための方法と装置、話者認証の検証のための方法と装置及び話者認証システムを提供する。 To solve the problems of the prior art, the present invention provides a method and apparatus for registration of speaker authentication, a method and apparatus for verification of speaker authentication, and a speaker authentication system.

本発明の一態様によれば、話者によって話された同一内容の複数の発話の各々に基づいて複数の音響特徴ベクトル系列を生成するステップと、前記複数の音響特徴ベクトル系列から参照テンプレートを生成するステップと、複数のコード及び前記複数のコードに対応する特徴ベクトルを含むコードブックに基づいて、前記複数の音響特徴ベクトル系列の各々に対応する擬似詐称者特徴ベクトル系列を生成するステップと、前記複数の音響特徴ベクトル系列、前記参照テンプレート及び前記複数の擬似詐称者特徴ベクトル系列に基づいて最適な音響特徴サブセットを選択するステップと、を具備する話者認証の登録のための方法が提供される。 According to one aspect of the present invention, generating a plurality of acoustic feature vector sequences based on each of a plurality of utterances having the same content spoken by a speaker, and generating a reference template from the plurality of acoustic feature vector sequences And, based on a codebook including a plurality of codes and feature vectors corresponding to the plurality of codes, generating a pseudo-spoofer feature vector sequence corresponding to each of the plurality of acoustic feature vector sequences; Selecting an optimal acoustic feature subset based on a plurality of acoustic feature vector sequences, the reference template and the plurality of pseudo-spoofer feature vector sequences, and a method for enrollment of speaker authentication is provided. .

本発明の別の態様によれば、テスト発話からテスト音響特徴ベクトル系列を生成するステップと、最適なテスト音響特徴ベクトル系列を得るために、登録中に生成された最適な音響特徴サブセットに基づいて前記テスト音響特徴ベクトル系列を最適化するステップと、参照テンプレート及び前記最適なテスト音響特徴ベクトル系列に基づいて、テスト発話が同じ話者によって話された登録発話かどうかを判定するステップと、を具備する話者認証の検証のための方法が提供される。 According to another aspect of the invention, based on the optimal acoustic feature subset generated during registration to generate a test acoustic feature vector sequence from the test utterance and to obtain an optimal test acoustic feature vector sequence. Optimizing the test acoustic feature vector sequence; and determining whether the test utterance is a registered utterance spoken by the same speaker based on a reference template and the optimal test acoustic feature vector sequence. A method for verification of speaker authentication is provided.

本発明の別の態様によれば、話者によって話された発話から音響特徴ベクトル系列を生成する音響特徴抽出部と、話者によって話された同一内容の複数の発話に対応する複数の音響特徴ベクトル系列から参照テンプレートを生成するテンプレート生成部と、複数のコード及び前記複数のコードに対応する特徴ベクトルを含むコードブックに基づいて、前記複数の音響特徴ベクトル系列の各々に対応する擬似詐称者特徴ベクトル系列を生成する擬似詐称者データ生成部と、前記複数の音響特徴ベクトル系列、前記参照テンプレート及び前記複数の擬似詐称者特徴ベクトル系列に基づいて、最適な音響特徴サブセットを選択する最適化部と、を具備する話者認証の登録のための装置が提供される。 According to another aspect of the present invention, an acoustic feature extraction unit that generates an acoustic feature vector sequence from an utterance spoken by a speaker, and a plurality of acoustic features corresponding to a plurality of utterances of the same content spoken by the speaker Based on a template generation unit that generates a reference template from a vector sequence, and a code book that includes a plurality of codes and feature vectors corresponding to the plurality of codes, a pseudo-fake person feature corresponding to each of the plurality of acoustic feature vector sequences A pseudo-spoofer data generation unit that generates a vector sequence; and an optimization unit that selects an optimal acoustic feature subset based on the plurality of acoustic feature vector sequences, the reference template, and the plurality of pseudo-spoofer feature vector sequences; , An apparatus for registration of speaker authentication is provided.

本発明の別の態様によれば、テスト発話からテスト音響特徴ベクトル系列を生成するテスト音響特徴抽出部と、最適なテスト音響特徴ベクトル系列を得るために、登録中に生成された最適な音響特徴サブセットに基づいて前記テスト音響特徴ベクトル系列を最適化するテスト最適化部と、参照テンプレート及び前記最適なテスト音響特徴ベクトル系列に基づいて、テスト発話が同じ話者によって話された登録発話かどうかを判定する判定部と、を具備する話者認証の検証のための装置が提供される。 According to another aspect of the present invention, a test acoustic feature extraction unit that generates a test acoustic feature vector sequence from a test utterance, and an optimal acoustic feature generated during registration to obtain an optimal test acoustic feature vector sequence A test optimization unit for optimizing the test acoustic feature vector sequence based on a subset; and whether a test utterance is a registered utterance spoken by the same speaker based on a reference template and the optimal test acoustic feature vector sequence An apparatus for verifying speaker authentication, comprising: a determination unit for determining;

本発明の別の態様によれば、話者認証の登録のための上記の装置及び話者認証の検証のための上記の装置を具備する話者認証システムが提供される。 According to another aspect of the present invention, there is provided a speaker authentication system comprising the above apparatus for registration of speaker authentication and the above apparatus for verification of speaker authentication.

本発明の実施形態に係る話者認証の登録のための方法を示すフローチャートを示す図である。FIG. 6 is a flowchart illustrating a method for enrollment of speaker authentication according to an embodiment of the present invention. 本発明の別の実施形態に係る話者認証の登録のための方法を示すフローチャートを示す図である。FIG. 5 is a flowchart illustrating a method for enrollment of speaker authentication according to another embodiment of the present invention. 本発明の実施形態に係る話者認証の検証のための方法を示すフローチャートを示す図である。FIG. 5 is a flowchart illustrating a method for verification of speaker authentication according to an embodiment of the present invention. 本発明の実施形態に係る話者認証の登録のための装置のブロック図である。1 is a block diagram of an apparatus for registration of speaker authentication according to an embodiment of the present invention. FIG. 本発明の実施形態に係る話者認証の検証のための装置のブロック図である。1 is a block diagram of an apparatus for verification of speaker authentication according to an embodiment of the present invention. 本発明の実施形態に係る話者認証システムのブロック図である。1 is a block diagram of a speaker authentication system according to an embodiment of the present invention.

図面に関連した以下の実施形態の説明から、本発明の上記特徴、利点及び目的は、より良く理解されると考えられる。 The above features, advantages and objects of the present invention will be better understood from the following description of embodiments with reference to the drawings.

以下、本発明の好ましい実施形態について図面を参照して詳細に説明する。
図１は、本発明の一実施形態に係る話者認証の登録のための方法のフローチャートである。図１に示されるように、まずステップ１０１において、話者によって話された同一内容の複数の発話の各々に基づいて複数の音響特徴ベクトル系列が生成される。音響特徴ベクトルの各々は、例えば、ＭＦＣＣ（メル周波数ケプストラム係数）の形式の発話を示す複数の音響特徴を含んでいてもよい。しかし、本発明は特にこれに限定されず、発話の音響特徴はＬＰＣＣ（線形予測ケプストラム係数）、エネルギー、基音周波数またはウェーブレット解析など、ピッチ及び継続期間、及びそれらの時間軸上の第一次微分及び第二次微分のような情報などに基づいて得られる、他の既知及び将来のモードを用いることにより表わすことが可能である。話者認識に適すると思われる全ての特徴は、特徴セットとして機能するようにマージすることが可能である。次に、後述する特徴選択方法を用いることによって、登録プロセスにおいて各話者の特性に関して自動的に設定された特徴セットから複数の特徴を選ぶことにより、最適な特徴サブセットが話者のためにカスタマイズされる。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a flowchart of a method for enrollment of speaker authentication according to an embodiment of the present invention. As shown in FIG. 1, first, in step 101, a plurality of acoustic feature vector sequences is generated based on each of a plurality of utterances having the same content spoken by a speaker. Each of the acoustic feature vectors may include a plurality of acoustic features indicating an utterance in the form of, for example, MFCC (Mel Frequency Cepstrum Coefficient). However, the present invention is not particularly limited to this, and the acoustic features of the speech include LPCC (Linear Predictive Cepstrum Coefficient), energy, fundamental frequency or wavelet analysis, pitch and duration, and their first derivative on the time axis. And other known and future modes obtained on the basis of information such as the second derivative and the like. All features that may be suitable for speaker recognition can be merged to function as a feature set. Then, using the feature selection method described below, the optimal feature subset can be customized for the speaker by selecting multiple features from the feature set automatically set for each speaker's characteristics during the registration process. Is done.

次に、ステップ１０５において、複数の音響特徴ベクトル系列から参照テンプレートが生成される。例えば、まず第１の音響特徴ベクトル系列が初期テンプレートとして選択され、次にＤＴＷ法を用いることにより、第２の音響特徴ベクトル系列が第１の音響特徴ベクトルと時間的に整列するように生成され、２つの音響特徴ベクトル系列の中の対応する特徴ベクトルの平均を用いることにより、新規テンプレートが生成される。次に、第３の音響特徴ベクトル系列が新規テンプレートと時間的に整列するように生成され、また音響特徴ベクトル系列がすべて独立したテンプレートにマージされるまで（テンプレートマージと呼ばれる）、そのような循環が実施される。詳細な内容については、W. H. Abdulla、D. Chow及びG. Sinの“Cross-words reference template for DTW-based speech recognition systems”(IEEE TENCON 2003, pp.1576-1579)を参照。本発明では、参照テンプレートの生成のためのモードについて特に限定されない。 Next, in step 105, a reference template is generated from a plurality of acoustic feature vector sequences. For example, a first acoustic feature vector sequence is first selected as an initial template, and then a second acoustic feature vector sequence is generated in time alignment with the first acoustic feature vector by using the DTW method. A new template is generated by using the average of the corresponding feature vectors in the two acoustic feature vector series. A third acoustic feature vector sequence is then generated to be temporally aligned with the new template, and such a cycle is performed until all acoustic feature vector sequences are merged into independent templates (called template merging). Is implemented. For details, see “Cross-words reference template for DTW-based speech recognition systems” by W. H. Abdulla, D. Chow and G. Sin (IEEE TENCON 2003, pp.1576-1579). In the present invention, the mode for generating the reference template is not particularly limited.

次に、ステップ１１０において、複数の音響特徴ベクトル系列のための対応する擬似詐称者（pseudo-impostor）特徴ベクトル系列がコードブックに基づいて生成される。本実施形態において用いられるコードブックは、全体のアプリケーションの音響空間において訓練されたコードブックである。例えば、中国語アプリケーション環境については、コードブックは中国語の発話の音響空間をカバーする必要があり、英語アプリケーション環境については、コードブックは英語の発話の音響空間をカバーする必要がある。もちろん、幾つかの特別なアプリケーション環境については、コードブックによってカバーされた音響空間を相応して変更することが可能である。 Next, in step 110, corresponding pseudo-impostor feature vector sequences for a plurality of acoustic feature vector sequences are generated based on the codebook. The code book used in the present embodiment is a code book trained in the acoustic space of the entire application. For example, for a Chinese application environment, the codebook needs to cover the acoustic space for Chinese utterances, and for the English application environment, the codebook needs to cover the acoustic space for English utterances. Of course, for some special application environments, the acoustic space covered by the codebook can be changed accordingly.

本実施形態のコードブックは、多数のコード及びそれらの多数のコードに対応する特徴ベクトルを含んでいる。コードの数は、音響空間のサイズ、要求圧縮比及び要求圧縮品質に依存する。音響空間が大きいほど、必要とされるコードの数は多い。同じ音響空間の条件の下で、コードの数が少ないほど圧縮比は高く、またコードの数が多いほど圧縮されたテンプレートの品質は高い。本発明の好ましい実施形態によれば、通常の中国語の発話の音響空間の下では、コードの数は２５６〜５１２が好適である。もちろん、異なるニーズに応じてコードの数、及びコードブックのカバーされた音響空間を適切に調整することが可能である。 The code book of this embodiment includes a large number of codes and feature vectors corresponding to the large numbers of codes. The number of codes depends on the size of the acoustic space, the required compression ratio, and the required compression quality. The larger the acoustic space, the more codes are needed. Under the same acoustic space conditions, the smaller the number of chords, the higher the compression ratio, and the larger the number of chords, the higher the quality of the compressed template. According to a preferred embodiment of the present invention, the number of codes is preferably 256-512 under the acoustic space of normal Chinese utterances. Of course, it is possible to appropriately adjust the number of codes and the covered acoustic space of the code book according to different needs.

具体的には、まず当該ステップにおいて音響特徴ベクトル系列を対応するコード系列に変換するために、音響特徴ベクトル系列の各特徴ベクトルにコードが指定される。例えば、音響特徴ベクトル系列の特徴ベクトルに近い特徴ベクトルは、音響特徴ベクトル系列の特徴ベクトルとコードブック内の各特徴ベクトルとの間のユークリッド距離のような距離の計算により、見つけることが可能である。コードブックにおいて最も接近している特徴ベクトルに対応するコードは、音響特徴ベクトル系列の特徴ベクトルに指定される。 Specifically, in order to convert the acoustic feature vector sequence into a corresponding code sequence in this step, a code is designated for each feature vector of the acoustic feature vector sequence. For example, a feature vector close to the feature vector of the acoustic feature vector sequence can be found by calculating a distance such as the Euclidean distance between the feature vector of the acoustic feature vector sequence and each feature vector in the codebook. . The code corresponding to the closest feature vector in the code book is designated as the feature vector of the acoustic feature vector series.

次に、コード系列はコードブックのコード及びそれらのコードに対応する特徴ベクトルに基づいた対応する特徴ベクトル系列に擬似詐称者特徴ベクトル系列として変換される。 Next, the code sequence is converted into a corresponding feature vector sequence based on codebook codes and feature vectors corresponding to those codes as a pseudo-spoofer feature vector sequence.

次に、ステップ１１５において、複数の音響特徴ベクトル系列、参照テンプレート及び複数の擬似詐称者特徴ベクトル系列に基づいて最適な音響特徴サブセットが選択される。具体的には、個々の音響特徴サブセットの候補が調べられ、そして擬似詐称者特徴ベクトル系列に対する参照テンプレートの識別率を最大化する音響特徴サブセッが最適な音響特徴サブセットとして選択される。 Next, in step 115, an optimal acoustic feature subset is selected based on the plurality of acoustic feature vector sequences, the reference template, and the plurality of pseudo impersonator feature vector sequences. Specifically, individual acoustic feature subset candidates are examined, and the acoustic feature subset that maximizes the identification rate of the reference template for the pseudo-impostor feature vector sequence is selected as the optimal acoustic feature subset.

本発明の実施形態によると、個々の音響特徴サブセットの候補に従って、参照テンプレートと複数の音響特徴ベクトル系列との間の複数のＤＴＷ距離ｄｅ(i)（話者内距離と呼ばれる）、及び参照テンプレートと複数の擬似詐称者特徴ベクトル系列との間のＤＴＷ距離ｄｐ(i)（話者間距離と呼ばれる）がそれぞれ計算される。参照テンプレートと複数の音響特徴ベクトル系列との間の複数のＤＴＷ距離ｄｅ(i)と、参照テンプレートと複数の擬似詐称者特徴ベクトル系列との間のＤＴＷ距離ｄｐ(i)との比を最小化する音響特徴サブセットは、最適な音響特徴サブセットとして選択される。 According to an embodiment of the present invention, according to individual acoustic feature subset candidates, a plurality of DTW distances de (i) (referred to as speaker distance) between a reference template and a plurality of acoustic feature vector sequences, and a reference template And DTW distance dp (i) (referred to as inter-speaker distance) between each of the pseudo-fake person feature vector series. Minimize the ratio between the DTW distance de (i) between the reference template and the plurality of acoustic feature vector sequences and the DTW distance dp (i) between the reference template and the plurality of pseudo-impersonator feature vector sequences The acoustic feature subset to be selected is selected as the optimal acoustic feature subset.

本発明の別の実施形態によれば、個々の音響特徴サブセットの候補に従って、参照テンプレートと複数の音響特徴ベクトル系列との間の複数のＤＴＷ距離ｄｅ(i)及び参照テンプレートと複数の擬似詐称者特徴ベクトル系列との間の複数のＤＴＷ距離ｄｐ(i)がそれぞれ計算される。参照テンプレートと複数の音響特徴ベクトル系列の間の複数のＤＴＷ距離ｄｅ(i)と参照テンプレートと複数の擬似詐称者特徴ベクトル系列との間の複数のＤＴＷ距離ｄｐ(i)との差と、参照テンプレートと複数の音響特徴ベクトル系列との間の複数のＤＴＷ距離ｄｅ(i)と参照テンプレートと複数の擬似詐称者特徴ベクトル系列との間の複数のＤＴＷ距離ｄｐ(i)との和との比を最小化する音響特徴サブセットは、最適な音響特徴サブセットとして選択される。 According to another embodiment of the present invention, a plurality of DTW distances de (i) between a reference template and a plurality of acoustic feature vector sequences and a reference template and a plurality of pseudo-impersonators according to individual acoustic feature subset candidates. A plurality of DTW distances dp (i) between the feature vector series are respectively calculated. A difference between a plurality of DTW distances de (i) between the reference template and a plurality of acoustic feature vector sequences and a plurality of DTW distances dp (i) between the reference template and a plurality of pseudo-impersonator feature vector sequences, and reference A ratio of a plurality of DTW distances de (i) between the template and the plurality of acoustic feature vector sequences and a sum of a plurality of DTW distances dp (i) between the reference template and the plurality of pseudo-impersonator feature vector sequences. The acoustic feature subset that minimizes is selected as the optimal acoustic feature subset.

さらに、本発明の実施形態によれば、個々の音響特徴サブセットの候補を調べるステップは特定範囲において実行される。例えば、その特定範囲は音響特徴の数が特定の数よりも大きい音響特徴サブセットの候補を含んでいる。 Further in accordance with an embodiment of the present invention, the step of examining individual acoustic feature subset candidates is performed in a specific range. For example, the specific range includes acoustic feature subset candidates in which the number of acoustic features is greater than the specific number.

上記の説明から分かるように、本実施形態に係る話者認証の登録のための方法は、詐称者データベースがない場合に最適な音響特徴サブセットを選択することが可能であり、これにより抽出された特徴はより識別力がある。この技術を使用して、テキスト依存の話者照合システムは著しい改良を達成する。 As can be seen from the above description, the method for enrollment of speaker authentication according to the present embodiment can select an optimal acoustic feature subset when there is no impersonator database, and is extracted by this. Features are more discriminating. Using this technique, text-dependent speaker verification systems achieve significant improvements.

図２は、本発明の別の実施形態に係る話者認証の登録のための方法のフローチャートである。本実施形態について図を参照して述べる。先の実施形態のものと同一の部分についての説明は、適宜省略する。 FIG. 2 is a flowchart of a method for enrollment of speaker authentication according to another embodiment of the present invention. This embodiment will be described with reference to the drawings. The description of the same parts as those of the previous embodiment will be omitted as appropriate.

図２に示されるように、本実施形態に係る話者認証の登録のための方法のステップ１０１−１１５は、図１に示される実施形態と同一であるため、ここでは再び繰り返さない。 As shown in FIG. 2, steps 101-115 of the method for enrollment of speaker authentication according to this embodiment are the same as the embodiment shown in FIG. 1, and will not be repeated again here.

先に述べた実施形態と比較して、本実施形態は参照テンプレートの圧縮のためのステップ２２０をさらに含む。具体的には、最適な音響特徴サブセットに基づいて参照テンプレート内の音響特徴ベクトルの次元を圧縮することを含んでもよいし、あるいはコードブックに基づいて参照テンプレート内の音響特徴ベクトルの数を圧縮することを含んでもよい。 Compared to the previously described embodiment, this embodiment further includes a step 220 for compression of the reference template. Specifically, it may include compressing the dimensions of the acoustic feature vectors in the reference template based on the optimal acoustic feature subset, or compressing the number of acoustic feature vectors in the reference template based on the codebook. You may include that.

コードブックに基づいて参照テンプレート内の音響特徴ベクトルの数を圧縮するためのモードに関しては、２００５年１１月１１日に本出願人によって提出された中国特許出願第２００５１０１１５３００．５号（タイトル“話者テンプレート及び話者認証を圧縮し合成するための装置及び方法”）を参照されたい。参照テンプレート内の同一の特定コードを持つ複数の隣接する特徴は、特徴ベクトルに置き換えられる。例えば、まず同一のコードを持つある隣接した特徴ベクトルの上記セットの平均ベクトルが計算される。次に、計算された平均ベクトルは、同一のコードを持つ隣接した特徴ベクトルの上記セットを置き換えるために用いられる。 For a mode for compressing the number of acoustic feature vectors in a reference template based on a codebook, please refer to Chinese Patent Application No. 2005110115300.5 (title “Speaker”) filed by the present applicant on November 11, 2005. See "Apparatus and method for compressing and synthesizing templates and speaker authentication"). A plurality of adjacent features having the same specific code in the reference template are replaced with feature vectors. For example, an average vector of the set of certain adjacent feature vectors having the same code is first calculated. The calculated average vector is then used to replace the set of adjacent feature vectors with the same code.

参照テンプレートにおいて、同一のコードを持つそのような複数の隣接した特徴の複数のセットが存在する場合、それらは上記モードにおいて一つずつ交換することが可能である。従って、複数の特徴ベクトルは一つの特徴ベクトルと一つずつ取り替えられ、参照テンプレート内の特徴ベクトルの数は低減される。従って、テンプレートは圧縮される。 If there are multiple sets of such multiple adjacent features with the same code in the reference template, they can be exchanged one by one in the above mode. Therefore, the plurality of feature vectors are replaced one by one with one feature vector, and the number of feature vectors in the reference template is reduced. Thus, the template is compressed.

上記の説明から分かるように、本実施形態に係る話者認証の登録のための方法は、詐称者データベースがない場合に最適な音響特徴サブセットを選択することが可能であるだけでなく、参照テンプレートを圧縮することが可能である。従って、テンプレートのための必要メモリは削減される。また、計算負荷も減少する。この技術を使用して、テキスト依存の話者照合システムの顕著な改良を達成する。 As can be seen from the above description, the method for enrollment of speaker authentication according to the present embodiment can not only select an optimal acoustic feature subset when there is no impersonator database, but also a reference template. Can be compressed. Therefore, the required memory for the template is reduced. Also, the calculation load is reduced. This technique is used to achieve significant improvements in text-dependent speaker verification systems.

図３は、同じ発明概念の下での本発明の実施形態に係る話者認証の検証方法を示すフローチャートである。以下、本実施形態について図３を用いて説明する。先の実施形態のものと同一の部分についての説明は、適宜省略する。 FIG. 3 is a flowchart showing a method for verifying speaker authentication according to an embodiment of the present invention under the same inventive concept. Hereinafter, the present embodiment will be described with reference to FIG. The description of the same parts as those of the previous embodiment will be omitted as appropriate.

図３に示されるように、まずステップ３０１においてテスト音響特徴ベクトル系列がテスト発話に従って生成される。前述した図１のステップ１０１に類似しているので、音響特徴ベクトルは例えばＭＦＣＣ（メル周波数ケプストラム係数）の形式の発話を示す複数の音響特徴をそれぞれ含むことが可能である。しかし、本発明は特にこれに限定されず、発話の音響特徴はＬＰＣＣ（線形予測ケプストラム係数）、エネルギー、基音周波数またはウェーブレット解析など、ピッチ及び継続期間、及びそれらの時間軸上の第一次微分及び第二次微分のような情報などに基づいて得られる、他の既知及び将来のモードを用いることにより表わすことが可能である。話者認識に適すると思われる全ての特徴は、特徴セットとして機能するようにマージすることが可能である。 As shown in FIG. 3, first, in step 301, a test acoustic feature vector series is generated according to a test utterance. Since it is similar to step 101 of FIG. 1 described above, the acoustic feature vector can include a plurality of acoustic features each indicating an utterance in the form of, for example, MFCC (Mel Frequency Cepstrum Coefficient). However, the present invention is not particularly limited to this, and the acoustic features of the speech include LPCC (Linear Predictive Cepstrum Coefficient), energy, fundamental frequency or wavelet analysis, pitch and duration, and their first derivative on the time axis. And other known and future modes obtained on the basis of information such as the second derivative and the like. All features that may be suitable for speaker recognition can be merged to function as a feature set.

次に、ステップ３０５において、テスト音響特徴ベクトル系列が最適なテスト音響特徴ベクトル系列を得るために登録中に生成された最適な音響特徴サブセットに基づいて最適化される。最適な音響特徴サブセットを選択する方法は、先の実施形態に記載されているので、ここで再び繰り返さない。 Next, in step 305, the test acoustic feature vector sequence is optimized based on the optimal acoustic feature subset generated during registration to obtain an optimal test acoustic feature vector sequence. The method of selecting the optimal acoustic feature subset has been described in the previous embodiment and will not be repeated here again.

次に、ステップ３１０において、テスト発話が参照テンプレート及び最適なテスト音響特徴ベクトル系列に基づいて、同じ話者によって話された登録発話かどうかが判定される。具体的には、例えば、まず参照テンプレートと最適なテスト音響特徴ベクトル系列との間のＤＴＷマッチングスコアが計算され、次いで入力発話が同じ話者によって話された登録発話かどうか判定するために、ＤＴＷマッチングスコアがしきい値と比較される。 Next, in step 310, it is determined whether the test utterance is a registered utterance spoken by the same speaker based on the reference template and the optimal test acoustic feature vector sequence. Specifically, for example, a DTW matching score is first calculated between the reference template and the optimal test acoustic feature vector sequence, and then the DTW is determined to determine whether the input utterance is a registered utterance spoken by the same speaker. The matching score is compared with a threshold value.

本発明では、参照テンプレートと最適なテスト音響特徴ベクトル系列との間のＤＴＷマッチングスコアを計算するための全ての既知及び将来のモードを適用可能である。また、本発明では識別する閾値を設定するための全ての既知及び将来のモードを適用可能である。 In the present invention, all known and future modes for calculating the DTW matching score between the reference template and the optimal test acoustic feature vector sequence are applicable. In the present invention, all known and future modes for setting a threshold value for identification can be applied.

上記の説明から分かるように、本実施形態に係る話者認証の確認のための方法は、登録段階の中で選択された最適な音響特徴セットを用いることが可能であり、それにより抽出された特徴はより識別力を持ち、またシステムは顕著な改良を達成する。 As can be seen from the above description, the method for confirmation of speaker authentication according to the present embodiment can use the optimal acoustic feature set selected in the registration stage, and is extracted by the method. The features are more discriminating and the system achieves significant improvements.

図４は、同じ発明概念の下での本発明の実施形態に係る話者認証の登録のための装置のブロック図である。以下、本実施形態について図４を用いて説明する。先の実施形態のものと同一の部分についての説明は、適宜省略する。 FIG. 4 is a block diagram of an apparatus for enrollment of speaker authentication according to an embodiment of the present invention under the same inventive concept. Hereinafter, this embodiment will be described with reference to FIG. The description of the same parts as those of the previous embodiment will be omitted as appropriate.

図４に示されるように、話者認証の登録のための装置４００は、話者によって話された発話から音響特徴ベクトル系列を生成する音響特徴抽出部４０１と、話者によって話された同一内容の複数の発話に対応する複数の音響特徴ベクトル系列から参照テンプレートを生成するテンプレート生成部４０２と、複数のコード及びそれらの複数のコードに対応する特徴ベクトルを含むコードブック７０４に基づいて、前記複数の音響特徴ベクトル系列の各々に対応する擬似詐称者特徴ベクトル系列を生成する擬似詐称者データ生成部４０３と、前記複数の音響特徴ベクトル系列、前記参照テンプレート及び前記複数の擬似詐称者特徴ベクトル系列に基づいて、最適な音響特徴サブセットを選択する最適化部４０５を備える。 As shown in FIG. 4, the apparatus 400 for registration of speaker authentication includes an acoustic feature extraction unit 401 that generates an acoustic feature vector sequence from an utterance spoken by the speaker, and the same content spoken by the speaker. Based on a template generation unit 402 that generates a reference template from a plurality of acoustic feature vector sequences corresponding to a plurality of utterances, and a code book 704 including a plurality of codes and feature vectors corresponding to the plurality of codes. A pseudo-spoofer data generation unit 403 that generates a pseudo-spoofer feature vector sequence corresponding to each of the acoustic feature vector sequences of the plurality of acoustic feature vector sequences, Based on this, an optimization unit 405 for selecting an optimal acoustic feature subset is provided.

擬似詐称者データ生成部４０３は、前記音響特徴ベクトル系列を対応するコード系列に変換するために前記音響特徴ベクトル系列中の各特徴ベクトルにコードを指定する変換部ベクトル−コード変換部４０３１と、前記コード及びそれらのコードに対応する特徴ベクトルに基づいて、前記コード系列を対応する特徴ベクトル系列に前記擬似詐称者特徴ベクトル系列として変換する変換部コード−ベクトル変換部４０３２を含む。 The pseudo-impersonator data generation unit 403 includes a conversion unit vector-code conversion unit 4031 that specifies a code for each feature vector in the acoustic feature vector sequence in order to convert the acoustic feature vector sequence into a corresponding code sequence; Based on the codes and the feature vectors corresponding to those codes, a conversion unit code-vector conversion unit 4032 for converting the code sequence into the corresponding feature vector sequence as the pseudo-impersonator feature vector sequence is included.

本発明の実施形態によれば、変換部ベクトル−コード変換部４０３１は前記音響特徴ベクトル系列の前記特徴ベクトルに最も近い特徴ベクトルをコードブックから探索し、前記音響特徴ベクトル系列の前記特徴ベクトルに前記コードブック中の前記最も近い特徴ベクトルに対応するコードを指定するように構成されている。 According to the embodiment of the present invention, the conversion unit vector-code conversion unit 4031 searches a code book for a feature vector closest to the feature vector of the acoustic feature vector sequence, and adds the feature vector of the acoustic feature vector sequence to the feature vector. A code corresponding to the closest feature vector in the code book is designated.

本発明の実施形態によれば、最適化部４０５は個々の音響特徴サブセットの候補を調べ、前記擬似詐称者特徴ベクトル系列に対する前記参照テンプレートの識別率を最大化する前記音響特徴サブセットを前記最適な音響特徴サブセットとして選択するように構成されている。 According to an exemplary embodiment of the present invention, the optimization unit 405 examines individual acoustic feature subset candidates, and determines the acoustic feature subset that maximizes the identification rate of the reference template for the pseudo-fake person feature vector sequence as the optimal feature set. It is configured to select as the acoustic feature subset.

本発明の別の実施形態によれば、最適化部４０５は個々の音響特徴サブセットの候補を調べ、前記参照テンプレートと前記複数の音響特徴ベクトル系列との間の複数のＤＴＷ距離ｄｅ(i)（話者内距離と呼ぶ）、及び前記参照テンプレートと前記複数の擬似詐称者特徴ベクトル系列との間の複数のＤＴＷ距離ｄｐ(i)（話者間距離と呼ぶ）を計算し、前記複数のＤＴＷ距離ｄｐ(i)に対する前記複数のＤＴＷ距離ｄｅ(i)との比を最小化する前記音響特徴セットを前記最適な音響特徴サブセットとして選択するように構成されている。 According to another embodiment of the present invention, the optimization unit 405 examines individual acoustic feature subset candidates, and includes a plurality of DTW distances de (i) () between the reference template and the plurality of acoustic feature vector sequences. And a plurality of DTW distances dp (i) (referred to as inter-speaker distances) between the reference template and the plurality of pseudo impersonator feature vector sequences, and calculating the plurality of DTWs. The acoustic feature set that minimizes the ratio of the plurality of DTW distances de (i) to the distance dp (i) is selected as the optimal acoustic feature subset.

本発明の別の実施形態によれば、最適化部４０５は個々の音響特徴サブセットの候補を調べ、前記参照テンプレートと前記複数の音響特徴ベクトル系列との間の複数のＤＴＷ距離ｄｅ(i)、及び前記参照テンプレートと前記複数の擬似詐称者特徴ベクトル系列との間の複数のＤＴＷ距離ｄｐ(i)を計算し、前記複数のＤＴＷ距離ｄｅ(i)と前記複数のＤＴＷ距離ｄｐ(i)との差と前記複数のＤＴＷ距離ｄｅ(i)と前記複数のＤＴＷ距離ｄｐ(i)との和の比を最小化する前記音響特徴セットを前記最適な音響特徴サブセットとして選択するように構成されている。 According to another embodiment of the present invention, the optimization unit 405 examines individual acoustic feature subset candidates, and a plurality of DTW distances de (i) between the reference template and the plurality of acoustic feature vector sequences, And calculating a plurality of DTW distances dp (i) between the reference template and the plurality of pseudo impersonator feature vector sequences, and the plurality of DTW distances de (i) and the plurality of DTW distances dp (i) And the acoustic feature set that minimizes the ratio of the sum of the difference between the plurality of DTW distances de (i) and the plurality of DTW distances dp (i) as the optimal acoustic feature subset. Yes.

本発明の別の実施形態によれば、最適化部４０５は特定範囲において前記個々の音響特徴サブセットの候補を調べるように構成されている。例えば、前記特定範囲は、音響特徴の数が特定の数よりも大きい音響特徴サブセットの候補を含んでいる。 According to another embodiment of the invention, the optimizer 405 is configured to examine the individual acoustic feature subset candidates in a specific range. For example, the specific range includes acoustic feature subset candidates in which the number of acoustic features is larger than the specific number.

図４に示されるように、話者認証の登録のための装置４００は前記最適な音響特徴サブセットに基づいて前記参照テンプレート内の音響特徴ベクトルの次元を圧縮する圧縮部４０６をさらに含む。 As shown in FIG. 4, the apparatus 400 for enrollment of speaker authentication further includes a compression unit 406 that compresses the dimension of the acoustic feature vector in the reference template based on the optimal acoustic feature subset.

本発明の別の実施形態によれば、圧縮部４０６はさらにコードブックに基づいて前記参照テンプレート内の音響特徴ベクトルの数を圧縮するように構成されている。 According to another embodiment of the present invention, the compression unit 406 is further configured to compress the number of acoustic feature vectors in the reference template based on a codebook.

上記の説明から分かるように、本実施形態に係る話者認証の登録のための装置は、実施に際して先の実施形態で述べた話者認証の登録のための方法を実現することが可能であり、また詐称者データベースがない場合に最適な音響特徴サブセットを選択することが可能であり、それにより抽出された特徴はより識別力を持つ。この技術を使用して、テキスト依存の話者照合システムは著しい改良を達成する。 As can be seen from the above description, the apparatus for registration of speaker authentication according to the present embodiment can implement the method for registration of speaker authentication described in the previous embodiment in implementation. In addition, it is possible to select an optimal acoustic feature subset when there is no impersonator database, and the extracted features are more discriminating. Using this technique, text-dependent speaker verification systems achieve significant improvements.

さらに、参照テンプレートはテンプレートのための必要なモリを削減することで相応して圧縮される。また、計算負荷も減少する。 Furthermore, the reference template is correspondingly compressed by reducing the necessary memory for the template. Also, the calculation load is reduced.

図５は、同じ発明概念の下での本発明の実施形態に係る話者認証の検証のための装置のブロック図である。以下、本実施形態について図５を用いて説明する。先の実施形態のものと同一の部分についての説明は、適宜省略する。 FIG. 5 is a block diagram of an apparatus for verification of speaker authentication according to an embodiment of the present invention under the same inventive concept. Hereinafter, this embodiment will be described with reference to FIG. The description of the same parts as those of the previous embodiment will be omitted as appropriate.

図５に示されるように、本実施形態に係る話者認証の検証のための装置５００は、テスト発話からテスト音響特徴ベクトル系列を生成するテスト音響特徴抽出部５０１と、最適なテスト音響特徴ベクトル系列を得るために、登録中に生成された最適な音響特徴サブセットに基づいて前記テスト音響特徴ベクトル系列を最適化するテスト最適化部５０２と、参照テンプレート及び前記最適なテスト音響特徴ベクトル系列に基づいて、テスト発話が同じ話者によって話された登録発話かどうかを判定する判定部５０３を含む。 As shown in FIG. 5, an apparatus 500 for verifying speaker authentication according to the present embodiment includes a test acoustic feature extraction unit 501 that generates a test acoustic feature vector sequence from a test utterance, and an optimal test acoustic feature vector. A test optimization unit 502 for optimizing the test acoustic feature vector sequence based on an optimal acoustic feature subset generated during registration to obtain a sequence, and a reference template and the optimal test acoustic feature vector sequence The determination unit 503 determines whether the test utterance is a registered utterance spoken by the same speaker.

判定部は、前記参照テンプレートと前記最適なテスト音響特徴ベクトル系列との間のＤＴＷマッチングスコアを計算するＤＴＷ計算部５０３１を含む。判定部５０３は、テスト発話が同じ話者によって話された登録発話かどうか判定するために前記ＤＴＷマッチングスコアをしきい値と比較する。 The determination unit includes a DTW calculation unit 5031 that calculates a DTW matching score between the reference template and the optimal test acoustic feature vector sequence. The determination unit 503 compares the DTW matching score with a threshold value to determine whether the test utterance is a registered utterance spoken by the same speaker.

上記の説明から分かるように、本実施形態に係る話者認証の検証のための装置は、実施に際して先の実施形態で述べた話者認証の検証のための方法を実現することが可能であり、また登録段階の中で選択された最適な音響特徴サブセットを用いることにより、各話者の特性を与える最適な特徴セットを選択することが可能である。従って、抽出された特徴はより識別力を持ち、システムは著しい改良を達成する。さらに、参照テンプレートはテンプレートのための必要なモリを削減することで相応して圧縮される。また、計算負荷も減少する。 As can be seen from the above description, the apparatus for verification of speaker authentication according to the present embodiment can implement the method for verification of speaker authentication described in the previous embodiment in implementation. Also, by using the optimal acoustic feature subset selected during the registration stage, it is possible to select the optimal feature set that gives the characteristics of each speaker. Thus, the extracted features are more discriminatory and the system achieves significant improvements. Furthermore, the reference template is correspondingly compressed by reducing the necessary memory for the template. Also, the calculation load is reduced.

図６は、同じ発明概念の下の本発明の実施形態に係る話者認証システムのブロック図である。以下、本実施形態について図を用いて説明する。先の実施形態のものと同一の部分についての説明は、適宜省略する。 FIG. 6 is a block diagram of a speaker authentication system according to an embodiment of the present invention under the same inventive concept. Hereinafter, the present embodiment will be described with reference to the drawings. The description of the same parts as those of the previous embodiment will be omitted as appropriate.

図６に示されるように、本実施形態に係る話者認証のためのシステムは先の実施形態に記載された話者認証の登録のための装置となり得る登録装置４００と、先の実施形態に記載された話者認証の検証のための装置となり得る検証装置５００を含む。登録装置４００によって生成された参照テンプレート及び最適な特徴サブセットは、ネットワーク、内部チャネル、磁気ディスクのような記録媒体などの任意の通信モードによって検証装置５００へ転送される。 As shown in FIG. 6, the system for speaker authentication according to the present embodiment includes a registration device 400 that can be a device for registration of speaker authentication described in the previous embodiment, and the previous embodiment. It includes a verification device 500 that can be a device for verification of the described speaker authentication. The reference template and the optimal feature subset generated by the registration device 400 are transferred to the verification device 500 by any communication mode such as a network, an internal channel, or a recording medium such as a magnetic disk.

上記の説明から分かるように、本実施形態に係る話者認証のためのシステムは、登録段階では詐称者データベースが存在しない場合に最適な音響特徴サブセットを選択し、また各話者の特性に従って最適な特徴セットを選択し、検証段階では登録段階で選択された最適な特徴セットを用いることにより、各話者の特性に従って最適な特徴セットを選択することが可能である。従って、抽出された特徴はより識別力を持ち、システムは著しい改良を達成する。さらに、参照テンプレートはテンプレートのための必要モリを削減することで相応して圧縮される。また、計算負荷も減少する。 As can be seen from the above description, the system for speaker authentication according to the present embodiment selects the optimal acoustic feature subset when there is no impersonator database at the registration stage, and is optimal according to the characteristics of each speaker. It is possible to select an optimal feature set according to the characteristics of each speaker by using the optimal feature set selected at the registration stage in the verification stage. Thus, the extracted features are more discriminatory and the system achieves significant improvements. Furthermore, the reference template is correspondingly compressed by reducing the required memory for the template. Also, the calculation load is reduced.

話者認証の登録のための装置４００、話者認証の検証のための装置５００及びそれらの様々な構成要素部品は、専用回線またはチップから構成されていることが可能であり、またコンピュータ（プロセッサ）による対応プログラムの実行によっても実装可能である。 The device 400 for enrollment of speaker authentication, the device 500 for verification of speaker authentication and their various components can be composed of dedicated lines or chips, and can also be a computer (processor). It can also be implemented by executing the corresponding program.

一方、本発明による話者認証の登録のための方法と装置、話者認証の検証のための方法と装置、及び話者認証のためのシステムについて、いくつかの典型的な実施形態を参照して詳しく説明したが、これらの実施形態は包括的ではなく、また当業者であれば本発明の趣旨及び範囲内で様々な変更及び修正を加えることが可能である。従って、本発明はこれらの実施形態に限定されるものでなく、また本発明の範囲は添付の特許請求の範囲によって定義される。 On the other hand, referring to some exemplary embodiments of the method and apparatus for enrollment of speaker authentication, the method and apparatus for verification of speaker authentication, and the system for speaker authentication according to the present invention. However, these embodiments are not comprehensive, and those skilled in the art can make various changes and modifications within the spirit and scope of the present invention. Accordingly, the invention is not limited to these embodiments, and the scope of the invention is defined by the appended claims.

Claims

Generating a plurality of acoustic feature vector sequences based on each of a plurality of utterances of the same content spoken by a speaker;
Generating a reference template from the plurality of acoustic feature vector sequences;
Generating a pseudo-impersonator feature vector sequence corresponding to each of the plurality of acoustic feature vector sequences based on a codebook including a plurality of codes and feature vectors corresponding to the plurality of codes;
Selecting an optimal acoustic feature subset based on the plurality of acoustic feature vector sequences, the reference template, and the plurality of pseudo impersonator feature vector sequences;
A method for enrollment of speaker authentication, comprising:

Generating a pseudo-spoofer feature vector sequence corresponding to each of the plurality of acoustic feature vector sequences,
Designating a code for each feature vector in the acoustic feature vector sequence to convert the acoustic feature vector sequence to a corresponding code sequence;
Converting the code sequence into a corresponding feature vector sequence as the pseudo-spoofer feature vector sequence based on the code and the feature vector corresponding to the code;
The method of claim 1, comprising:

Specifying a code for each feature vector in the acoustic feature vector sequence includes:
Searching the codebook for a feature vector closest to the feature vector of the acoustic feature vector sequence;
Designating a code corresponding to the closest feature vector in the codebook to the feature vector of the acoustic feature vector sequence;
The method of claim 2 comprising:

Selecting the optimal acoustic feature subset comprises:
Examining individual acoustic feature subset candidates;
Selecting the acoustic feature subset that maximizes the identification rate of the reference template for the pseudo-impersonator feature vector sequence as the optimal acoustic feature subset;
The method of claim 1, comprising:

Selecting the optimal acoustic feature subset comprises:
Examining individual acoustic feature subset candidates;
A plurality of DTW distances de (i) between the reference template and the plurality of acoustic feature vector sequences, and a plurality of DTW distances dp (i) between the reference template and the plurality of pseudo impersonator feature vector sequences. A step of calculating
Selecting the acoustic feature set that minimizes the ratio of the plurality of DTW distances de (i) to the plurality of DTW distances dp (i) as the optimal acoustic feature subset;
The method of claim 1, comprising:

Selecting the optimal acoustic feature subset comprises:
Examining individual acoustic feature subset candidates;
A plurality of DTW distances de (i) between the reference template and the plurality of acoustic feature vector sequences, and a plurality of DTW distances dp (i) between the reference template and the plurality of pseudo impersonator feature vector sequences. A step of calculating
Minimizing the ratio of the difference between the plurality of DTW distances de (i) and the plurality of DTW distances dp (i) to the sum of the plurality of DTW distances de (i) and the plurality of DTW distances dp (i) Selecting the acoustic feature set to be the optimal acoustic feature subset;
The method of claim 1, comprising:

The method according to any one of claims 4 to 6, wherein the step of examining the candidates of the individual acoustic feature subsets is performed in a specific range.

8. The method of claim 7, wherein the specific range includes acoustic feature subset candidates in which the number of acoustic features is greater than the specific number.

The method of claim 1, further comprising compressing dimensions of acoustic feature vectors in the reference template based on the optimal acoustic feature subset.

The method of claim 1, further comprising compressing a number of acoustic feature vectors in the reference template based on the codebook.

Generating a test acoustic feature vector sequence from the test utterance;
Optimizing the test acoustic feature vector sequence based on the optimal acoustic feature subset generated during registration to obtain an optimal test acoustic feature vector sequence;
Determining whether the test utterance is a registered utterance spoken by the same speaker based on a reference template and the optimal test acoustic feature vector sequence;
A method for verification of speaker authentication, comprising:

The step of determining includes
Calculating a DTW matching score between the reference template and the optimal test acoustic feature vector sequence;
Comparing the DTW matching score to a threshold value to determine if the test utterance is a registered utterance spoken by the same speaker;
The method of claim 11, comprising:

An acoustic feature extraction unit that generates an acoustic feature vector sequence from an utterance spoken by a speaker;
A template generation unit that generates a reference template from a plurality of acoustic feature vector sequences corresponding to a plurality of utterances of the same content spoken by a speaker;
Based on a code book including a plurality of codes and feature vectors corresponding to the plurality of codes, a pseudo-imposed person data generation unit that generates a pseudo-imposed person feature vector series corresponding to each of the plurality of acoustic feature vector series;
An optimization unit that selects an optimal acoustic feature subset based on the plurality of acoustic feature vector sequences, the reference template, and the plurality of pseudo impersonator feature vector sequences;
An apparatus for registration of speaker authentication, comprising:

The pseudo impersonator data generation unit
A conversion unit vector-code conversion unit that designates a code for each feature vector in the acoustic feature vector sequence in order to convert the acoustic feature vector sequence into a corresponding code sequence;
Based on the code and a feature vector corresponding to the code, a conversion unit code-vector conversion unit that converts the code sequence into a corresponding feature vector sequence as the pseudo-impersonator feature vector sequence;
14. The apparatus of claim 13, comprising:

The conversion unit vector-code conversion unit searches a feature vector closest to the feature vector of the acoustic feature vector sequence from a code book, and the feature closest to the feature vector of the acoustic feature vector sequence in the code book. The apparatus of claim 14, wherein the apparatus is configured to specify a code corresponding to a vector.

The optimization unit examines individual acoustic feature subset candidates, and selects the acoustic feature subset that maximizes the identification rate of the reference template with respect to the pseudo-impersonator feature vector sequence as the optimal acoustic feature subset. 14. The apparatus of claim 13, wherein the apparatus is configured.

The optimization unit examines individual acoustic feature subset candidates, a plurality of DTW distances de (i) between the reference template and the plurality of acoustic feature vector sequences, and the reference template and the plurality of pseudo-falsifications. The acoustic feature set that calculates a plurality of DTW distances dp (i) between the person feature vector series and minimizes a ratio of the plurality of DTW distances de (i) and the plurality of DTW distances dp (i) 14. The apparatus of claim 13, wherein the apparatus is configured to select as the optimal acoustic feature subset.

The optimization unit examines individual acoustic feature subset candidates, a plurality of DTW distances de (i) between the reference template and the plurality of acoustic feature vector sequences, and the reference template and the plurality of pseudo-falsifications. Calculating a plurality of DTW distances dp (i) between the plurality of DTW distances de (i) and the plurality of DTW distances dp (i). The configuration of claim 13, wherein the acoustic feature set that minimizes the ratio of the difference between i) and the plurality of DTW distances dp (i) is selected as the optimal acoustic feature subset. The device described.

The apparatus according to any one of claims 16 to 18, wherein the optimization unit is configured to examine candidates of the individual acoustic feature subset in a specific range.

The apparatus of claim 19, wherein the specific range includes acoustic feature subset candidates in which the number of acoustic features is greater than the specific number.

The apparatus of claim 13, further comprising a compression unit that compresses dimensions of acoustic feature vectors in the reference template based on the optimal acoustic feature subset.

The apparatus of claim 13, further comprising a compression unit that compresses the number of acoustic feature vectors in the reference template based on the codebook.

A test acoustic feature extraction unit that generates a test acoustic feature vector sequence from the test utterance;
A test optimization unit for optimizing the test acoustic feature vector sequence based on an optimal acoustic feature subset generated during registration to obtain an optimal test acoustic feature vector sequence;
A determination unit that determines whether the test utterance is a registered utterance spoken by the same speaker based on a reference template and the optimal test acoustic feature vector sequence;
An apparatus for verification of speaker authentication, comprising:

The determination unit
A DTW calculator that calculates a DTW matching score between the reference template and the optimal test acoustic feature vector sequence;
A determination unit that compares the DTW matching score with a threshold value to determine whether the test utterance is a registered utterance spoken by the same speaker;
24. The apparatus of claim 23, comprising:

An apparatus according to any one of claims 13 to 22,
An apparatus according to claim 23 or claim 24;
A system for speaker authentication, comprising: