JPH08286692A

JPH08286692A - Method and device for speaker collating

Info

Publication number: JPH08286692A
Application number: JP7087289A
Authority: JP
Inventors: Katsutake Bin; 雄偉閔; Nobuo Koizumi; 宣夫小泉
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1995-04-12
Filing date: 1995-04-12
Publication date: 1996-11-01
Anticipated expiration: 2017-01-28
Also published as: JP3251460B2

Abstract

PURPOSE: To improve the security of the system while using the system by making the system to be discriminated whether the inputted voice is the natural voice of a registered user or a recorded voice or a synthesized voice in the device which uses a text specifying type speaker collating method. CONSTITUTION: During a speaker registration, the device side arbitrarily decides two kinds of specified texts. Then, a code book corresponding to each text is produced for every speaker and is stored in a code book storage section 18. During a speaker collation, a user is asked to pronounce text voices corresponding to the two specified texts. Then, the features of the inputted voices corresponding to one of the specified text are quantized by the code book for every speaker corresponding to the texts. After that, distortion distance in the speaker is obtained and a relative threshold value, which represents whether it is a natural voice or not, is determined from the feature difference of two voices and the distance between the code books. Then, the in-speaker-distortion distance and the relative threshold value are compared by a comparison section 15. If the distance is larger than the value above, it is judged to be other than a natural voice and the processing in the speaker collation section 16 is made prohibitive.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、話者照合方式に関し、
特に、入力された音声が登録利用者による自然音声か、
あるいは録音音声や合成音声かを判別する技術に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker verification system,
In particular, if the input voice is a natural voice by a registered user,
Alternatively, the present invention relates to a technique for discriminating whether a recorded voice or a synthetic voice.

【０００２】[0002]

【従来の技術】従来、話者認識装置等に用いられる話者
照合方式として、テキスト依存型あるいはテキスト独立
型の方式が採用されていた。2. Description of the Related Art Conventionally, a text-dependent or text-independent method has been adopted as a speaker verification method used in a speaker recognition device or the like.

【０００３】テキスト依存型の話者照合方式の概念図を
図５及び図６に示す。図５に示す第１の方式では、話者
登録時に入力されたテキスト指定学習音声を入力してそ
の特徴抽出を行い（Ｓ５１，Ｓ５２）、コードブックを
作成しておく（Ｓ５３）。そして話者照合時には、当該
テキストについて入力されたテスト音声の特徴抽出を行
い（Ｓ５４，Ｓ５５）、これを対応するコードブックで
量子化して量子化距離を算出し（Ｓ５６）、算出した距
離と所定の閾値との照合によって音声を発した者の正当
性を判定する。図６に示す第２の方式も閾値による照合
を行う点で図５の方式と共通であるが、学習音声の特徴
抽出後（Ｓ６１，Ｓ６２）、時系列標準パタンを作成す
る点が異なる（Ｓ６３）。このような方式では、入力テ
スト音声の特徴抽出後（Ｓ６４，Ｓ６５）、閾値による
照合（Ｓ６７）前に、動的時間伸縮（ＤＴＷ）処理を行
う（Ｓ６６）。ＤＴＷ処理は、例えば同一テキストに基
づく音声であっても発話者や発話タイミング等によって
異なってくるピッチ時間を、共通のピッチ時間に換算す
るものである。上述のようなテキスト依存型話者照合方
式については、例えば、D.K.Burton,“Text−Dependent
Speaker Verification Using Source Coding,"IEEE Tr
ans.Acous., Speech,Signal Processing,vol.ASSP-35,p
p.133-143,Feb.1987.等の記載が参考になる。A conceptual diagram of the text-dependent speaker verification system is shown in FIGS. 5 and 6. In the first method shown in FIG. 5, the text-specified learning voice input at the time of speaker registration is input and its features are extracted (S51, S52), and a codebook is created (S53). Then, at the time of speaker verification, the feature of the test voice input for the text is extracted (S54, S55), and this is quantized by the corresponding codebook to calculate the quantized distance (S56). The legitimacy of the person who uttered the voice is judged by comparing with the threshold of. The second method shown in FIG. 6 is also common to the method shown in FIG. 5 in that the comparison is performed using a threshold value, but a difference is that a time-series standard pattern is created after feature extraction of learning speech (S61, S62). ). In such a method, after the feature extraction of the input test voice (S64, S65), before the matching by the threshold value (S67), the dynamic time extension / contraction (DTW) process is performed (S66). The DTW process converts a pitch time, which varies depending on a speaker, a utterance timing, and the like, even for voices based on the same text, into a common pitch time. For the text-dependent speaker verification method as described above, for example, DKBurton, “Text-Dependent
Speaker Verification Using Source Coding, "IEEE Tr
ans.Acous., Speech, Signal Processing, vol.ASSP-35, p
The description in p.133-143, Feb.1987, etc. is helpful.

【０００４】また、テキスト独立型の話者照合方式の概
念図を図７に示す。この方式では、図５に示すテキスト
依存型の方式において、テキストの内容（発話内容）を
利用者が自分の意志によって決めるようにした点が異な
る（Ｓ７１〜Ｓ７７）。このような方式については、例
えば、F.K.Soong,A.E.Rosenberg，L.R.Rabiner，B.H.Ju
ang，“A Vector Quantization Approach to Speaker R
ecognition.“Proc.IEEE ICASSP,vol.1, pp.387-390.Ma
rch 1985.等の記載が参考になる。A conceptual diagram of the text-independent speaker verification system is shown in FIG. This method differs from the text-dependent method shown in FIG. 5 in that the content of the text (speech content) is decided by the user according to his or her own will (S71 to S77). For such a method, for example, FKSoong, AERosenberg, LRRabiner, BHJu
ang, “A Vector Quantization Approach to Speaker R
ecognition. “Proc.IEEE ICASSP, vol.1, pp.387-390.Ma
The description of rch 1985. etc. is helpful.

【０００５】しかし、テキスト依存型による話者照合
も、テキスト独立型による話者照合も、登録利用者の音
声が事前に何らかの方法で他人に録音されれば、その録
音の音声データを使って、本人を詐称して、話者照合シ
ステムに不正侵入できるという共通の危険性が存在して
いる。However, in both the text-dependent speaker verification and the text-independent speaker verification, if the registered user's voice is recorded by another person in advance by some method, the recorded voice data is used. There is a common risk of misrepresenting yourself and being able to hack into the speaker verification system.

【０００６】そこで、このような問題点を解消する手法
として、テキスト指定型の話者照合方式が提案されるこ
とになった。この方式は、図８に示すように、話者照合
システム側で予め不特定話者音韻モデルと話者適応音韻
モデルを作成し（Ｓ８１〜Ｓ８４）、照合時にシステム
側で発話内容を指定する。そして利用者からのテスト音
声を入力して特徴抽出を行い（Ｓ８６）、利用者が指定
内容通りの発話を行った否かを音韻モデル連結と尤度計
算によって判定する（Ｓ８７，Ｓ８８）。入力テスト音
声の内容が確かに指定したテキストの内容と合致してい
れば、話者照合を開始する（Ｓ８９）。このようにして
録音等による詐称を防いでいる。このテキスト指定型の
話者照合方式については、例えば“話者認識技術”（松
井、古井、NTT R＆D vol．43 No.101994.）の記載を
参考にすることができる。Therefore, as a method for solving such a problem, a text-specified speaker verification system has been proposed. In this method, as shown in FIG. 8, an unspecified speaker phoneme model and a speaker adaptive phoneme model are created in advance on the speaker verification system side (S81 to S84), and the system side specifies the utterance content at the time of verification. Then, a test voice from the user is input to perform feature extraction (S86), and it is determined by the phoneme model connection and likelihood calculation whether or not the user has uttered as specified (S87, S88). If the content of the input test voice surely matches the content of the designated text, speaker verification is started (S89). In this way, spoofing due to recording etc. is prevented. Regarding the text-specified speaker verification method, for example, the description in “Speaker Recognition Technology” (Matsui, Furui, NTT R & D vol.43 No.101994.) Can be referred to.

【０００７】[0007]

【発明が解決しようとする課題】上述のように、テキス
ト指定型の話者照合方式は、事前にシステム側が発話内
容を指定するので、録音再生音声による不正侵入への対
応が可能となる効果がある。しかし、高度に進化した音
声合成技術やコンピュータ技術を組み合わせた場合に
は、テキスト指定型の話者照合方式と雖も万全を期し難
い場合がある。As described above, in the text-specifying speaker verification method, since the system side preliminarily specifies the utterance content, it is possible to cope with the illegal intrusion by the recorded / reproduced voice. is there. However, in the case of combining highly advanced speech synthesis technology and computer technology, it may be difficult to complete the text-specifying speaker verification method and 雖.

【０００８】例えば、図９に示すように、テキスト指定
型の処理を行う話者照合システム９０と、予め採取した
登録利用者の音声に基づいて各種音素波形を記憶した音
声合成装置９１とを組み合わせた場合において、音声合
成装置９１が瞬時に指定テキストに基づく音素波形を選
択して連結し、対応する合成音を生成してシステムに入
力した場合には、システム側では不正利用者によるアク
セスか否かを判別することができない。For example, as shown in FIG. 9, a speaker verification system 90 for performing a text-designated process and a voice synthesizer 91 for storing various phoneme waveforms based on the voice of a registered user collected in advance are combined. In this case, when the speech synthesizer 91 instantly selects and connects the phoneme waveforms based on the designated text to generate the corresponding synthesized sound and inputs it to the system, the system side determines whether the access is by an unauthorized user. It cannot be determined.

【０００９】本発明の課題は、上記問題点に鑑み、テキ
スト指定型の話者照合において、入力されたテスト音声
が登録利用者本人による自然音声か、あるいは録音再生
音声や合成音声かを自動判別してシステム利用上のセキ
ュリティ性を高める技術を提供することにある。In view of the above problems, an object of the present invention is to automatically discriminate whether the input test voice is a natural voice by the registered user, a recorded / reproduced voice, or a synthesized voice in the text-specified speaker verification. The purpose of this is to provide technology for improving the security of system usage.

【００１０】[0010]

【課題を解決するための手段】音声は、音源信号のほか
に声道特徴によって意味形成されており、また、録音再
生音声あるいは合成音声には、必ず録音機器や再生機器
等の電気回路の特性が含まれるようになる。しかも音声
の声道特徴は、たとえ同じ内容であっても時期や環境等
によって常に変化しているのが通常なので、該入力音声
に上記電気回路の特性が仮に含まれていてもそれを直接
検出することができない。しかし、図３に示すように、
電気回路を一旦通った音声の声道特徴３１は、オリジナ
ルの自然音声の声道特徴３０との間で特徴空間における
位置ずれが必然的に生じる。これは、どのような周波数
についても共通であり、各特徴空間の相対的位置も変わ
らないことが理論的に確認されている。なお、図３には
便宜上二次元の場合を示してあるが、実際には多次元と
なる場合が多い。本発明は、照合対象となる入力テスト
音声の上記性質を利用して該音声が発話者の自然音声な
のか、録音再生あるいは合成音声なのかを判別する新た
な話者照合方法、およびその方法を実施するための装置
を提供する。[Means for Solving the Problems] Voice has meaning formed by vocal tract characteristics in addition to sound source signals, and recording / playback voice or synthesized voice must always have characteristics of electric circuits such as recording equipment and playback equipment. Will be included. Moreover, the vocal tract characteristics of the voice are usually always changing depending on the time, environment, etc. even if they have the same content, so even if the input voice includes the characteristics of the electric circuit, it is directly detected. Can not do it. However, as shown in FIG.
The vocal tract feature 31 of the voice that has passed through the electric circuit is inevitably displaced from the original vocal tract feature 30 of the natural voice in the feature space. It is theoretically confirmed that this is common to all frequencies and the relative position of each feature space does not change. Although FIG. 3 shows a two-dimensional case for the sake of convenience, it is often a multidimensional case. The present invention provides a new speaker verification method for determining whether the sound is a natural voice of a speaker, a recorded / reproduced voice, or a synthesized voice by using the above-described property of an input test voice to be matched, and a method thereof. An apparatus for performing is provided.

【００１１】本発明の話者照合方法は、話者登録時に、
第１の指定テキストに対応する第１の話者別コードブッ
クおよび第２の指定テキストに対応する第２の話者別コ
ードブックを各話者別コードブックから出現するコード
ベクトルの頻度情報と共に格納しておく。話者照合時、
第１および第２の指定テキストに対応する音声がそれぞ
れ入力されたときは、第１の指定テキストに対応する入
力音声の特徴を前記第１の話者別コードブックで量子化
して話者内歪み距離を導出するとともに、各入力音声の
特徴差と各入力音声に対応する話者別コードブックとに
基づいて音声種別の基準値となる相対的閾値を導出し、
前記導出した話者内歪み距離と相対的閾値とを比較する
ことにより前記入力音声が自然音声かそれ以外の音かの
種別判定を行う。According to the speaker verification method of the present invention, at the time of speaker registration,
A first speaker-specific codebook corresponding to the first designated text and a second speaker-specific codebook corresponding to the second designated text are stored together with frequency information of code vectors appearing from each speaker-specific codebook. I'll do it. During speaker verification,
When the voices corresponding to the first and second designated texts are respectively input, the features of the input voices corresponding to the first designated text are quantized by the first speaker-specific codebook, and intra-speaker distortion is generated. Along with deriving the distance, derive a relative threshold serving as a reference value of the voice type based on the feature difference of each input voice and the speaker-specific codebook corresponding to each input voice,
By comparing the derived intra-speaker distortion distance with a relative threshold value, it is possible to determine whether the input voice is a natural voice or a sound other than that.

【００１２】この入力音声の種別判定は、具体的には、
前記第１の話者別コードブックのｉ番目のクラスタをＣ
i、該クラスタＣiのセントロイドをＸi、該クラスタＣi
に属するｊ番目の第１の指定テキストに対応する特徴ベ
クトルをＶ1（i,j）、該クラスタＣiに属するｊ番目の
第２の指定テキストに対応する特徴ベクトルをＶ2（i,
j）として下記条件式の成立性を判定する。This type determination of the input voice is, specifically,
Let C be the i-th cluster of the first speaker-specific codebook.
i, the centroid of the cluster Ci is Xi, and the cluster Ci is
The feature vector corresponding to the j-th first designated text belonging to the cluster Ci is V1 (i, j), and the feature vector corresponding to the j-th second designated text belonging to the cluster Ci is V2 (i, j).
As j), the satisfaction of the following conditional expression is judged.

【数２】 Σ［Ｖ1（i,j）−Ｘi］² ≦Σ［Ｖ2（i,j）−Ｘi］²−Σ［Ｖ1（i,j）−Ｖ2（i,j）］² Σ [V1 (i, j) −Xi] ² ≦ Σ [V2 (i, j) −Xi] ² −Σ [V1 (i, j) −V2 (i, j)] ²

【００１３】上式の左辺は、第１の指定テキストに対応
する自然音声を第１の話者別コードブックで量子化した
話者内歪み距離である。また、右辺第２項は、同じクラ
スタに属する二つの声道特徴のユークリッド距離の平方
和（平均値）を表している。右辺の値が閾値の役割を果
たす。入力音声が自然音声であれば常に上式が成立する
ことが確認されている。一方、自然音声以外であれば話
者内歪み距離が相対的に大きくなり、上式が常に成立し
なくなるので、アクティブな閾値による話者照合が可能
になる。The left side of the above equation is the intra-speaker distortion distance obtained by quantizing the natural voice corresponding to the first designated text with the first speaker-specific codebook. The second term on the right side represents the sum of squares (average value) of the Euclidean distances of two vocal tract features belonging to the same cluster. The value on the right side acts as a threshold. It has been confirmed that the above formula is always satisfied if the input voice is a natural voice. On the other hand, if it is other than natural voice, the intra-speaker distortion distance becomes relatively large, and the above equation is not always satisfied. Therefore, speaker verification using an active threshold becomes possible.

【００１４】上式において、右辺第１項は、本来、第２
の指定テキストに対応する自然音声を第１の話者別コー
ドブックで量子化して求める必要があるが、実際上はそ
れが不可能である。そこで、第２の話者別コードブック
と第１の話者別コードブックのコードブック間距離を用
いて、右辺第１項を近似する。このような近似手法ない
し閾値決定手法については、本出願人が先に提案した話
者照合方法及び装置（特願平６−２６５８５６号明細
書、特願平６−４１６１５号明細書）、あるいは“話者
別コードブックに基づく話者照合のしきい値の一決定方
法”（閔、村上、平成６年度音響学会春季研究発表会講
演論文集Ｉ、3-7-1、1994年3月）に詳細に記載されてい
る。なお、上式から明らかなように、右辺と左辺との相
対関係が、入力音声の特徴や入力音声に対応する話者別
コードブックによってアクティブに変わるので、以下の
説明では、右辺に相当する閾値を相対的閾値と称するこ
ととする。In the above equation, the first term on the right side is originally the second term.
It is necessary to quantize the natural speech corresponding to the designated text of No. 1 by the first speaker-specific codebook, but this is not possible in practice. Therefore, the first term on the right side is approximated by using the inter-codebook distances of the second speaker-specific codebook and the first speaker-specific codebook. Regarding such an approximation method or a threshold value determination method, a speaker verification method and apparatus previously proposed by the present applicant (Japanese Patent Application No. 6-265856, Japanese Patent Application No. 6-41615) or “ Method for Determining Threshold for Speaker Verification Based on Speaker-Specific Codebooks "(Minaka, Murakami, Proceedings of the 1994 Spring Meeting of the Acoustical Society of Japan, I, 3-7-1, March 1994) It is described in detail. As is clear from the above equation, the relative relationship between the right side and the left side is actively changed depending on the features of the input voice and the speaker-specific codebook corresponding to the input voice. Will be referred to as a relative threshold.

【００１５】また、上記方法の実施に適した本発明の話
者照合装置は、第１の指定テキストに対応する第１の話
者別コードブックおよび第２の指定テキストに対応する
第２の話者別コードブックを各々の話者別コードブック
から出現するコードベクトルの頻度情報と共に格納した
コードブック格納手段と、第１および第２の指定テキス
トに対応する入力音声を認識して各入力音声の特徴を抽
出する特徴抽出手段と、抽出された各音声特徴に対応す
る話者別コードブックをそれぞれ前記コードブック格納
手段より選択するコードブック選択手段と、前記第１の
指定テキストに対応する入力音声の特徴を前記選択した
第１の話者別コードブックで量子化して話者内歪み距離
を導出する手段と、前記第１および第２の指定テキスト
に対応する入力音声の特徴差と前記選択した第１および
第２の話者別コードブックとに基づいて音声種別の基準
となる相対的閾値を決定する閾値決定手段と、前記話者
内歪み距離と前記決定した相対的閾値とを比較すること
により前記入力音声が自然音声かそれ以外の音かの種別
判定を行う手段と、を備えることを特徴とする。Further, the speaker verification apparatus of the present invention suitable for carrying out the above method is the first speaker-specific codebook corresponding to the first designated text and the second speaker corresponding to the second designated text. A codebook storage means storing a speaker-specific codebook together with frequency information of code vectors appearing from each speaker-based codebook, and input voices corresponding to the first and second designated texts are recognized to recognize each input voice. Feature extracting means for extracting a feature, codebook selecting means for selecting a speaker-specific codebook corresponding to each extracted voice feature from the codebook storing means, and an input voice corresponding to the first designated text. Means for quantizing the features of the above with the selected first speaker-specific codebook to derive an intra-speaker distortion distance, and input sound corresponding to the first and second designated texts. Deciding means for deciding a relative threshold serving as a reference of the voice type based on the feature difference of the above and the selected first and second codebooks for each speaker, and the intra-speaker distortion distance and the decided relative Means for determining whether the input voice is a natural voice or a sound other than the natural voice by comparing the input voice with a dynamic threshold.

【００１６】上記構成において、閾値決定手段は、例え
ば、前記特徴差を前記選択した第１の話者別コードブッ
クでクラスタリングして各クラスタに属する距離の平方
和平均値を導出する手段と、前記選択した第２の話者別
コードブックを第１の話者別コードブックで量子化して
コードブック間距離を導出する手段とを有し、このコー
ドブック間距離と前記平方和平均値との差分値を前記相
対的閾値として決定するものとする。In the above configuration, the threshold value determining means may be, for example, means for clustering the feature differences with the selected first speaker-specific codebook to derive an average sum of squares of distances belonging to each cluster; Means for quantizing the selected second speaker-specific codebook with the first speaker-specific codebook to derive the inter-codebook distance, and the difference between this inter-codebook distance and the average sum of squares. A value shall be determined as the relative threshold.

【００１７】[0017]

【作用】本発明では、例えばマイクロフォン等の音声入
力手段の特性がほぼ同一となる時間内に発話者からの二
種類の指定テキストに対応する音声を入力する。そし
て、各入力音声の特徴抽出後にその差を求めて、電気回
路特性などの影響の除去を図る。即ち、図４に示すよう
に、二つの入力音声（音声信号）の特徴には、声道特
徴，回路特性，マイクロフォン特性が含まれるが、両者
の差は、結局声道特徴の差分となる。この声道特徴差
を、指定テキストに対応する話者別コードブックによっ
てクラスタリングし、同一クラスタに属する声道特徴差
（ユークリッド距離）の平方和平均値を導出する。上記
コードブック間距離から平方和平均値を差し引いた値
を、自然音声であるか否かを判断するときの相対的閾値
として決定する。そして、話者内歪み距離が上記相対的
閾値を越えた場合、入力音声が録音再生音声あるいは合
成音声と判定し、後続の処理をこの時点で拒否する。上
記相対的閾値を越えない場合、入力音声が利用者の自然
音声と判断し、以後の処理を継続する。このようにすれ
ば、自然音声と録音再生あるいは合成音声の種別を判定
することができ、合成音声装置等を用いたシステムへの
不正侵入を防ぐことができる。In the present invention, the voices corresponding to the two types of designated texts from the speaker are input within the time when the characteristics of the voice inputting means such as a microphone become substantially the same. Then, after extracting the feature of each input voice, the difference is obtained to eliminate the influence of the electric circuit characteristics and the like. That is, as shown in FIG. 4, the characteristics of the two input voices (audio signals) include the vocal tract characteristics, the circuit characteristics, and the microphone characteristics, but the difference between the two is the difference between the vocal tract characteristics. The vocal tract feature difference is clustered by the speaker-specific codebook corresponding to the designated text, and the average sum of squares of the vocal tract feature differences (Euclidean distance) belonging to the same cluster is derived. A value obtained by subtracting the square sum average value from the inter-codebook distance is determined as a relative threshold value when determining whether or not the voice is a natural voice. When the intra-speaker distortion distance exceeds the relative threshold value, it is determined that the input voice is the recording / playback voice or the synthetic voice, and the subsequent processing is rejected at this point. If the relative threshold is not exceeded, it is determined that the input voice is the user's natural voice, and the subsequent processing is continued. By doing so, it is possible to determine the types of natural voice and recorded / reproduced or synthetic voice, and prevent unauthorized intrusion into a system using a synthetic voice device or the like.

【００１８】[0018]

【実施例】以下、図面を参照して本発明の好適な実施例
を詳細に説明する。図１は、本発明の一実施例に係る話
者照合装置１のブロック構成図であり、前述のテキスト
指定型話者照合方式を応用した装置の例を示す。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a block configuration diagram of a speaker verification apparatus 1 according to an embodiment of the present invention, showing an example of an apparatus to which the above-described text-specified speaker verification system is applied.

【００１９】この話者照合装置１は、指定テキストに対
応する音声を入力する、例えばマイクロフォン等の音声
入力部１０と、入力された音声の内容を認識する音声認
識部１１と、各入力音声の特徴、即ち声道特徴（回路特
性およびマイクロフォン特性を含む）を抽出する特徴抽
出部１２と、抽出した特徴量を話者別コードブックで量
子化するベクトル量子化部１３と、特徴抽出部１２の出
力と話者別コードブックとに基づいてアクティブな相対
的閾値を決定する閾値決定部１４と、特徴抽出部１２お
よびベクトル量子化部１３の出力を比較する比較部１５
と、比較部１５の出力によって話者照合を行う話者照合
部１６とを備えている。また、話者登録時にテキスト学
習音声データに対応する話者別のコードブックを作成す
る話者別コードブック作成部１７と、作成された話者別
コードブックを格納するコードブック格納部１８とを備
えている。コードブック格納部１８は、閾値決定部１４
に随時出力可能に接続されている。This speaker verification apparatus 1 inputs a voice corresponding to a designated text, for example, a voice input unit 10 such as a microphone, a voice recognition unit 11 for recognizing the contents of the input voice, and a voice input unit for each input voice. The feature extracting unit 12 that extracts the features, that is, the vocal tract features (including the circuit characteristic and the microphone characteristic), the vector quantizing unit 13 that quantizes the extracted feature amount by the speaker-specific codebook, and the feature extracting unit 12 A threshold determination unit 14 that determines an active relative threshold value based on the output and the speaker-specific codebook, and a comparison unit 15 that compares the outputs of the feature extraction unit 12 and the vector quantization unit 13.
And a speaker verification unit 16 that performs speaker verification by the output of the comparison unit 15. In addition, a speaker-specific codebook creation unit 17 that creates a speaker-specific codebook corresponding to the text-learning voice data at the time of speaker registration, and a codebook storage unit 18 that stores the created speaker-specific codebook. I have it. The codebook storage unit 18 includes the threshold value determination unit 14
Is connected so that it can be output at any time.

【００２０】次に、上記構成の話者照合装置１の動作を
説明する。本実施例では、話者登録時に、話者照合装置
１が二種類の指定テキスト、即ち指定テキスト１と指定
テキスト２とを任意に定め、これらテキストに対応する
学習用音声（自然音声）を登録対象話者に発声させる。
入力された各々の発声内容から話者別コードブックを話
者別コードブック作成部１７で作成するとともに、作成
された話者別コードブックに基づいて学習用音声をベク
トル量子化し、各話者別コードブックのコードワード
（コードベクトル）の出現頻度を求める。そしてこの出
現頻度情報を当該話者別コードブックと共にコードブッ
ク格納部１８に格納しておく。Next, the operation of the speaker verification apparatus 1 having the above configuration will be described. In the present embodiment, at the time of speaker registration, the speaker verification apparatus 1 arbitrarily determines two types of designated texts, that is, designated text 1 and designated text 2, and registers learning voices (natural speech) corresponding to these texts. Make the target speaker speak.
A speaker-specific codebook is created by the speaker-specific codebook creation unit 17 from each input utterance content, and the learning voice is vector-quantized based on the created speaker-specific codebook, and the speaker-specific codebook is created. The frequency of appearance of codewords (code vectors) in the codebook is calculated. Then, the appearance frequency information is stored in the codebook storage unit 18 together with the speaker-specific codebook.

【００２１】話者照合に際しては、利用者に対して指定
テキスト１，２に対応するテスト音声の入力を促す。こ
れら音声の入力間隔は、マイクロフォン等の特性がほぼ
同一となる短時間に入力されることが好ましい。入力さ
れた各テスト音声に対して音声認識部１１で音声認識処
理を施し、指定テキスト１，２に対応するテスト音声で
あるか否かを判定する。指定した内容に合致するテスト
音声であることが認識できたときは、特徴抽出処理部１
２で各テスト音声の特徴抽出を行う。テスト音声特徴の
抽出後は、指定テキスト１，２に対応する話者別コード
ブックをそれぞれコードブック格納部１８から選び出
す。At the time of speaker verification, the user is prompted to input a test voice corresponding to the designated texts 1 and 2. It is preferable that the input intervals of these voices are input in a short time such that the characteristics of the microphone and the like are almost the same. The voice recognition unit 11 performs voice recognition processing on each input test voice to determine whether or not the test voice corresponds to the designated texts 1 and 2. When it is recognized that the test voice matches the specified content, the feature extraction processing unit 1
In step 2, the feature of each test voice is extracted. After extracting the test voice features, the speaker-specific codebooks corresponding to the designated texts 1 and 2 are selected from the codebook storage unit 18, respectively.

【００２２】選び出した話者別コードブックのうち、指
定テキスト１に対応する話者別コードブックをベクトル
量子化部１３に入力する。そして対応する入力テスト音
声の特徴に対してベクトル量子化を施し、前掲の式の左
辺に相当する話者内歪み距離を求める。なお、この話者
内歪み距離は、実際には平均量子化歪み距離となる。ま
た、指定テキスト１および指定テキスト２に対応する話
者別コードブックを閾値決定部１４に入力する。閾値決
定部１５には、特徴抽出部１２で抽出した各指定テキス
ト１，２の対応音声の特徴をも入力する。閾値決定部１
４では、これら入力情報に基づいて自然音声か否かを表
す相対的閾値を決定する。この閾値決定部１４における
処理手順を図２を参照して詳細に説明する。Of the selected speaker-specific codebooks, the speaker-specific codebook corresponding to the designated text 1 is input to the vector quantizer 13. Then, vector quantization is applied to the corresponding features of the input test speech, and the intra-speaker distortion distance corresponding to the left side of the above equation is obtained. The intra-speaker distortion distance is actually the average quantization distortion distance. In addition, the speaker-specific codebooks corresponding to the designated text 1 and the designated text 2 are input to the threshold value determination unit 14. The threshold determination unit 15 also inputs the features of the corresponding voices of the designated texts 1 and 2 extracted by the feature extraction unit 12. Threshold value determination unit 1
In 4, the relative threshold value indicating whether or not the voice is natural is determined based on these input information. The processing procedure in the threshold value determining unit 14 will be described in detail with reference to FIG.

【００２３】図２を参照すると、閾値決定部１４では、
各指定テキストの対応音声の声道特徴差を求め（Ｓ２
１）、この声道特徴差を入力された話者別コードブック
（Ｓ２２，Ｓ２３）の任意の一つ、例えば指定テキスト
１対応の話者別コードブックに入力して、クラスタリン
グ処理を施す（Ｓ２４）。そしてこれにより得られた複
数のクラスタ１〜Ｎに属する差分特徴量の距離平方の総
和を計算し、その平均値を求めて図示しないメモリに記
憶しておく（Ｓ２５）。一方、他の話者別コードブック
（指定テキスト２対応の話者別コードブック）から各コ
ードベクトルを上記出現頻度情報に従って出現させ（Ｓ
２６）、指定テキスト１対応の話者別コードブックで量
子化してコードブック間距離を求める（Ｓ２７，Ｓ２
８）。これにより得られたコードブック間距離を上記差
分特徴量の距離平方の平均距離から差し引いて相対的閾
値を導出する（Ｓ２９，Ｓ３０）。この相対的閾値が前
掲の式の右辺に相当する。Referring to FIG. 2, in the threshold value determining section 14,
The vocal tract feature difference of the corresponding voices of each designated text is calculated (S2
1), this vocal tract feature difference is input to any one of the inputted speaker-specific codebooks (S22, S23), for example, the speaker-specific codebook corresponding to the designated text 1, and clustering processing is performed (S24). ). Then, the total sum of the distance squares of the differential feature amounts belonging to the plurality of clusters 1 to N obtained in this way is calculated, and the average value thereof is obtained and stored in a memory (not shown) (S25). On the other hand, each code vector is made to appear according to the appearance frequency information from another speaker-based codebook (a speaker-specific codebook corresponding to the designated text 2) (S
26), the inter-codebook distance is obtained by quantization with the speaker-specific codebook corresponding to the designated text 1 (S27, S2).
8). The inter-codebook distance thus obtained is subtracted from the average distance of the distance squares of the above-mentioned difference feature amount to derive the relative threshold value (S29, S30). This relative threshold corresponds to the right side of the above equation.

【００２４】図１に戻り、比較部１５では、ベクトル量
子化部１３で得られた話者内歪み距離と閾値決定部１４
で決定した相対的閾値とを比較する。前述のように、話
者内歪み距離は、入力音声が録音再生音声か合成音声の
場合は相対的閾値より常に大きく、自然音声の場合は、
常に小さくなる。従って、両者の値の大小によって入力
音声の種別、即ち自然音声かそれ以外の音かを判定する
ことができるので、例えば入力音声が自然音声以外のと
きに話者照合部１６での以後の処理を拒否するようにす
れば、他人の音声を悪用した不正侵入を阻止することが
できる。Returning to FIG. 1, in the comparison unit 15, the intra-speaker distortion distance obtained by the vector quantization unit 13 and the threshold value determination unit 14 are provided.
Compare with the relative threshold determined in. As described above, the intra-speaker distortion distance is always larger than the relative threshold when the input voice is the recorded and reproduced voice or the synthetic voice, and when the input voice is the natural voice,
Always smaller Therefore, it is possible to determine the type of the input voice, that is, whether the input voice is a natural voice or a sound other than the natural voice, depending on the magnitude of the two values. For example, when the input voice is other than the natural voice, the subsequent processing in the speaker verification unit 16 is performed. By rejecting, it is possible to prevent unauthorized intrusion by misusing the voice of another person.

【００２５】[0025]

【発明の効果】以上の説明から明らかなように、本発明
によれば、指定テキストに対応する音声が音声合成装置
等によって作成され、話者照合装置内に入力された場合
であっても、装置内でそれを瞬時に検出できるので、話
者照合の際のセキュリティ性および照合の信頼性が格段
に高まる効果がある。As is apparent from the above description, according to the present invention, even when a voice corresponding to a designated text is created by a voice synthesizer or the like and input into a speaker verification device, Since it can be instantly detected in the device, there is an effect that security at the time of speaker verification and reliability of verification are significantly increased.

【００２６】また、入力音声が自然音声であるかそれ以
外の音声であるかの種別判定に用いる相対的閾値が、二
つの入力音声の声道特徴差と各音声に対応する話者別コ
ードブックのコードブック間距離によってアクティブに
得られ、さらに相対的閾値との比較対象となる話者内歪
み距離が、第１の指定テキストに対応する入力音声およ
び話者別コードブックにより得られるので、複数の登録
話者が任意の時期に発話する場合であっても各々の入力
音声の種別判定が可能となる効果がある。これにより、
この種の話者照合処理を行うシステムに対する犯罪防止
の効果が期待できる。The relative threshold used for determining whether the input voice is a natural voice or another voice is a vocal tract feature difference between the two input voices and a speaker-specific codebook corresponding to each voice. Since the intra-speaker distortion distance that is actively obtained by the inter-codebook distance of, and is compared with the relative threshold is obtained by the input speech corresponding to the first designated text and the speaker-specific codebook, Even when the registered speaker speaks at any time, it is possible to determine the type of each input voice. This allows
A crime prevention effect can be expected for a system that performs this type of speaker verification processing.

[Brief description of drawings]

【図１】本発明の一実施例に係る話者照合システムのブ
ロック構成図。FIG. 1 is a block configuration diagram of a speaker verification system according to an embodiment of the present invention.

【図２】本実施例による閾値決定部の処理手順説明図。FIG. 2 is an explanatory diagram of a processing procedure of a threshold value determining unit according to the present embodiment.

【図３】本発明の原理を説明するための特徴空間の移動
概念図。FIG. 3 is a conceptual diagram of movement of a feature space for explaining the principle of the present invention.

【図４】本発明により電気回路特性やマイクロフォン特
性等を除去する場合の概念図。FIG. 4 is a conceptual diagram for removing electric circuit characteristics, microphone characteristics, etc. according to the present invention.

【図５】第１のテキスト依存型話者照合方式の処理概念
図。FIG. 5 is a conceptual diagram of processing of the first text-dependent speaker verification method.

【図６】第２のテキスト依存型話者照合方式の処理概念
図。FIG. 6 is a processing conceptual diagram of a second text-dependent speaker verification method.

【図７】テキスト独立型話者照合方式の処理概念図。FIG. 7 is a processing conceptual diagram of a text-independent speaker verification method.

【図８】従来のテキスト指定型話者照合方式の処理概念
図。FIG. 8 is a processing conceptual diagram of a conventional text-specified speaker verification method.

【図９】従来の話者照合システムへの不正侵入処理概念
図。FIG. 9 is a conceptual diagram of unauthorized intrusion processing into a conventional speaker verification system.

[Explanation of symbols]

１話者照合装置１０音声入力部１１音声認識部１２特徴抽出処理部１３ベクトル量子化部１４閾値決定部１５比較部１６話者照合部１７話者別コードブック作成部１８コードブック格納部 DESCRIPTION OF SYMBOLS 1 Speaker verification device 10 Speech input unit 11 Speech recognition unit 12 Feature extraction processing unit 13 Vector quantization unit 14 Threshold value determination unit 15 Comparison unit 16 Speaker verification unit 17 Speaker-specific codebook creation unit 18 Codebook storage unit

Claims

[Claims]

1. A code vector appearing from each speaker-specific codebook, a first speaker-specific codebook corresponding to the first designated text and a second speaker-specific codebook corresponding to the second designated text. Are stored together with the frequency information of No. 1, and when the voices corresponding to the first and second designated texts are respectively input, the characteristics of the input voice corresponding to the first designated text are classified by the first speaker. In addition to deriving the intra-speaker distortion distance by quantizing with a codebook, derive a relative threshold value that is a reference value of voice type based on the feature difference of each input voice and the speaker-specific codebook corresponding to each input voice. Then, the speaker verification method is characterized by determining whether the input voice is a natural voice or a sound other than the natural voice by comparing the derived intra-speaker distortion distance with a relative threshold.

2. The type determination is performed by determining the i-th cluster of the first speaker-specific codebook as Ci, the centroid of the cluster Ci as Xi, and the j-th first cluster belonging to the cluster Ci.
The feature vector corresponding to the specified text of V1 (i, j),
Let V2 (i, j) be the feature vector corresponding to the j-th second designated text belonging to the cluster Ci, and let Σ [V1 (i, j) −Xi] ² ≦ Σ [V2 (i, j) −Xi] ² −Σ [V1 (i, j) −V2 (i, j)] ² is satisfied, the input voice is determined to be a natural voice,
2. The speaker verification method according to claim 1, wherein if it is not established, the sound is determined as a sound other than a natural voice.

3. In the conditional expression, Σ [V2 (i, j) −
3. The speaker verification method according to claim 2, wherein the term Xi] ² is approximated by the inter-codebook distance of the speaker-specific codebook corresponding to each of the input voices.

4. A code appearing from each speaker-specific codebook, a first speaker-specific codebook corresponding to the first designated text and a second speaker-specific codebook corresponding to the second designated text. Codebook storage means stored together with vector frequency information, feature extraction means for recognizing input voices corresponding to the first and second designated texts and extracting features of each input voice, and each extracted voice feature Codebook selecting means for selecting a corresponding speaker-specific codebook from the codebook storage means, respectively, and characteristics of the input voice corresponding to the first designated text are quantized by the selected first-speaker-specific codebook. Means for deriving the intra-speaker distortion distance, the feature difference between the input voices corresponding to the first and second designated texts, and the selected first and second speaker-specific co-ordinates. Threshold value determining means for determining a relative threshold value serving as a reference for a voice type based on a reference book, and whether the input voice is a natural voice by comparing the intra-speaker distortion distance with the determined relative threshold value. And a means for determining the type of sound, and a speaker verification apparatus comprising:

5. The threshold value determining means clusters the feature differences with the selected first speaker-specific codebook to derive an average sum of squares of distances belonging to each cluster, and the selected first Means for deriving the inter-codebook distance by quantizing the second speaker-specific codebook with the first speaker-specific codebook, and calculating the difference value between the inter-codebook distance and the square root mean value as described above. 5. The speaker verification device according to claim 4, wherein the speaker verification device determines the relative threshold value.