JP2007033898A

JP2007033898A - Speaker checking device

Info

Publication number: JP2007033898A
Application number: JP2005217478A
Authority: JP
Inventors: Chikashi Sugiura; 千加志杉浦; Takehiko Isaka; 岳彦井阪
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2005-07-27
Filing date: 2005-07-27
Publication date: 2007-02-08
Anticipated expiration: 2025-07-27
Also published as: JP4714523B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve collation precision without requiring a feature vector of a speech of another speaker even when a speech is distorted owing to background noise or under the influence of compression processing for the noise or when speaking contents of a speaker are different. <P>SOLUTION: A speech feature generation processing section 1 generates a feature vector by extracting a feature quantity from an input speech and also generates a feature representative vector by finding the center of gravity of a set of generated feature vectors. In collation mode, the first and the second speaker feature conversion processing sections 6 and 7 perform conversion processing for approximating code vectors and a feature vector set of a speaker to respective feature representative vectors and a threshold decision processing section 9 compares VQ distortion between vectors after the conversion processing with a threshold value to identify the speaker. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、例えば利用者の音声の特徴を抽出して本人認証を行う話者照合装置に関する。 The present invention relates to a speaker verification device that performs user authentication by extracting features of a user's voice, for example.

近年、例えばセキュアな状態が要求される建物への入室や機器の使用に際し、利用者の音声の特徴を抽出して本人認証を行う話者照合装置が種々提案されている。
例えば、話者ごとにその入力音声をサンプリングしてその特徴量から特徴ベクトルを求め、この特徴ベクトルをベクトル量子化してコードブックに登録する。そして、以後利用者の音声が入力されるごとに、この入力音声の特徴量をもとに特徴ベクトルを求め、この特徴ベクトルをベクトル量子化したのち上記コードブックに登録されたベクトルと照合することにより本人認証を行う装置が知られている（例えば、特許文献１を参照。）。 2. Description of the Related Art In recent years, various speaker verification devices have been proposed that perform user authentication by extracting features of a user's voice, for example, when entering a building or using equipment that requires a secure state.
For example, the input speech is sampled for each speaker, a feature vector is obtained from the feature amount, the feature vector is vector-quantized and registered in the codebook. Then, each time a user's voice is input, a feature vector is obtained based on the feature amount of the input voice, and the feature vector is vector-quantized and then collated with the vector registered in the codebook. There is known an apparatus for performing personal authentication (see, for example, Patent Document 1).

また、別の話者照合装置として、本人話者の特徴ベクトルと他人話者の特徴ベクトルとをもとに共分散行列を計算する。そして、この計算された共分散行列を用いて特徴ベクトルを変換することにより、統計的に識別性能を向上させる装置も提案されている（例えば、特許文献２を参照。）。
特開平７−２４８７９１号公報特許第３０８０３８８号明細書 As another speaker verification device, a covariance matrix is calculated based on the feature vector of the speaker and the feature vector of the other speaker. An apparatus that statistically improves identification performance by converting feature vectors using the calculated covariance matrix has also been proposed (see, for example, Patent Document 2).
JP 7-248791 A Japanese Patent No. 3080388

ところが、特許文献１に記載された装置では、予めコードブックに登録された利用者本人の音声の特徴ベクトルと、入力音声の特徴ベクトルとをただ単純に照合するようにしている。このため、音声が背景雑音の影響により歪んだり、また話者の発話内容が異なると、他人話者の特徴量分布がコードブックに登録された本人話者の特徴量分布に包含され、この結果他人話者を本人話者と誤認識したり、また本人話者を他人話者として誤認識する場合があり、照合精度が低かった。 However, the apparatus described in Patent Document 1 simply collates the feature vector of the user's own voice registered in advance in the code book with the feature vector of the input speech. For this reason, if the speech is distorted by the influence of background noise or the utterance content of the speaker is different, the feature amount distribution of the other speaker is included in the feature amount distribution of the person speaker registered in the codebook. In some cases, the other speaker is misrecognized as the speaker, or the speaker is misrecognized as another speaker.

一方、特許文献２に記載された装置では、共分散行列をもとに特徴ベクトルを変換するようにしている。このため、特許文献１に記載された装置に比べ照合精度を高めることが可能である。しかしながら、共分散行列を計算するためには他人話者の音声の特徴ベクトルを取得する必要があり、その処理のために余計な手間やコストが必要となる。 On the other hand, the apparatus described in Patent Document 2 converts feature vectors based on a covariance matrix. For this reason, it is possible to raise collation precision compared with the apparatus described in Patent Document 1. However, in order to calculate the covariance matrix, it is necessary to acquire the feature vector of the voice of another speaker, and extra effort and cost are required for the processing.

この発明は上記事情に着目してなされたもので、その目的とするところは、音声が背景雑音もしくはこの抑圧処理の影響により歪んだり、また話者の発話内容が異なる場合でも、他人話者の音声の特徴ベクトルを必要とすることなく照合精度を高めることを可能にした話者照合装置を提供することにある。 The present invention has been made paying attention to the above circumstances, and the purpose of the present invention is that even if the speech is distorted due to background noise or the effect of this suppression processing, and the utterance content of the speaker is different, It is an object of the present invention to provide a speaker verification apparatus that can improve verification accuracy without requiring a feature vector of speech.

上記目的を達成するためにこの発明は、話者の入力音声フレームからそれぞれ特徴ベクトルを生成すると共に、この生成された複数の特徴ベクトルを代表する特徴代表ベクトルを生成する手段を備える。そして、登録対象話者については、上記生成された特徴ベクトルをベクトル量子化することで生成されるコードベクトル及び特徴代表ベクトルをコードブックに記憶する。一方照合を行う際には、上記コードブックに記憶された第１の特徴ベクトルに対し、当該第１の特徴ベクトルと上記第１の特徴代表ベクトルとの間の距離を縮小する変換処理を行い、かつ照合対象話者の入力音声から生成される第２の特徴ベクトルに対し、当該第２の特徴ベクトルと第２の特徴代表ベクトルとの間の距離を縮小する変換処理を行う。そして、上記変換処理後のコードベクトル及び第２の特徴ベクトル間のベクトル量子化歪みを算出し、この算出されたベクトル量子化歪みを予め設定されたしきい値と比較して、その比較結果を話者の照合結果として出力するように構成したものである。 In order to achieve the above object, the present invention includes means for generating feature vectors from the input speech frames of speakers and generating feature representative vectors representing the plurality of generated feature vectors. For the speaker to be registered, the code vector and the feature representative vector generated by vector quantization of the generated feature vector are stored in the code book. On the other hand, when performing collation, a conversion process is performed on the first feature vector stored in the codebook to reduce the distance between the first feature vector and the first feature representative vector. In addition, a conversion process is performed to reduce the distance between the second feature vector and the second feature representative vector with respect to the second feature vector generated from the input speech of the verification target speaker. Then, the vector quantization distortion between the code vector after the conversion process and the second feature vector is calculated, the calculated vector quantization distortion is compared with a preset threshold value, and the comparison result is obtained. It is configured to output as a speaker verification result.

したがって、入力音声が背景雑音もしくはこれを抑圧処理した影響により歪んだり、また話者の発話内容が異なる場合でも、特徴ベクトルに対し上記変換処理を施すことにより、コードブックに記憶された登録話者のコードベクトルの分布と、照合対象話者の特徴ベクトルの分布がそれぞれ縮小される。このため、登録対象話者のコードベクトルに対する照合対象話者の特徴ベクトル集合の包含は低減される。したがって、登録対象話者のコードベクトルと照合対象話者の特徴ベクトルとの間のベクトル量子化歪みの差は顕著になり、これにより照合精度を高めることが可能となる。 Therefore, even if the input speech is distorted due to background noise or the effect of suppressing this, or the utterance content of the speaker is different, the registered speaker stored in the codebook is subjected to the above conversion processing on the feature vector. The distributions of the code vectors and the distribution of the feature vectors of the speaker to be collated are reduced. For this reason, the inclusion of the feature vector set of the verification target speaker with respect to the code vector of the registration target speaker is reduced. Therefore, the difference in vector quantization distortion between the code vector of the registration target speaker and the feature vector of the verification target speaker becomes significant, and this makes it possible to improve the verification accuracy.

要するにこの発明では、コードブックに記憶された第１の特徴ベクトル、及び照合対象話者の入力音声から生成される第２の特徴ベクトルに対し、それぞれ当該第１の特徴ベクトルとその特徴代表ベクトルとの間、及び第２の特徴ベクトルとその特徴代表ベクトルとの間の距離を縮小する変換処理を行う。そして、上記変換処理後のコードベクトル及び第２の特徴ベクトル間のベクトル量子化歪みを算出し、この算出されたベクトル量子化歪みを予め設定されたしきい値と比較して、その比較結果を話者の照合結果として出力するようにしている。 In short, in the present invention, for the first feature vector stored in the code book and the second feature vector generated from the input speech of the speaker to be verified, the first feature vector and its feature representative vector And a conversion process for reducing the distance between the second feature vector and the feature representative vector. Then, the vector quantization distortion between the code vector after the conversion process and the second feature vector is calculated, the calculated vector quantization distortion is compared with a preset threshold value, and the comparison result is obtained. This is output as a speaker verification result.

したがってこの発明によれば、音声が背景雑音により歪んだりまた話者の発話内容が異なる場合でも、他人話者の音声の特徴ベクトルを必要とすることなく照合精度を高めることを可能にした話者照合装置を提供することができる。 Therefore, according to the present invention, even when the voice is distorted by background noise or the utterance contents of the speaker are different, the speaker capable of improving the collation accuracy without requiring the feature vector of the voice of another speaker. A verification device can be provided.

以下、図面を参照してこの発明の実施形態を説明する。
先ず、この発明の原理について説明する。
ベクトル量子化（Vector Quantization；ＶＱ）モデルを用いた話者照合において発生する他人受入エラーは、例えば図５に示すように他人話者の特徴ベクトルが本人コードブックに包含されて、ＶＱ歪みが小さくなるために起こる。この問題を回避するには、本人コードブック及び他人話者の特徴ベクトルの分布をそれぞれ縮小する話者内分布補正が有効である。例えば図６に示すように、比較対象の本人コードブック及び他人話者の特徴ベクトルのそれぞれにおいて、コードベクトル及び特徴ベクトルの分布を各特徴代表ベクトル方向に近づければよい。 Embodiments of the present invention will be described below with reference to the drawings.
First, the principle of the present invention will be described.
Another person acceptance error that occurs in speaker verification using a Vector Quantization (VQ) model includes, for example, the feature vector of another person's speaker in the person's codebook as shown in FIG. Happens to be. In order to avoid this problem, intra-speaker distribution correction that reduces the distribution of the feature code of the person's codebook and the other person's speaker is effective. For example, as shown in FIG. 6, the distribution of the code vector and the feature vector may be made closer to the direction of each feature representative vector in each of the comparison target person's codebook and other person's feature vectors.

この話者内分布補正処理は、コードベクトル及び特徴ベクトルに対しそれぞれ、当該ベクトルと特徴代表ベクトルとの間の距離を縮小する変換処理を行うことにより実現される。例えば、変換処理前の特徴ベクトルをｖ_i 、変換処理後の特徴ベクトルをｖ_i ′、特徴代表ベクトルをｖ_g 、変換行列をＡ、第１の重み行列をＢ、第２の重み行列をＣ、ベクトル数をＮとすると、変換式は次の（１）式で表される。

This intra-speaker distribution correction processing is realized by performing conversion processing for reducing the distance between the code vector and the feature vector, respectively, between the vector and the feature representative vector. For example, the feature vector before conversion processing is v _i , the feature vector after conversion processing is v _i ′, the feature representative vector is v _g , the conversion matrix is A, the first weight matrix is B, and the second weight matrix is C When the number of vectors is N, the conversion equation is expressed by the following equation (1).

この（１）式において、変換行列Ａは例えば対角成分diag{A}が０．０以上であるＤｘＤの対角行列で表される。また、第１ならびに第２の重み行列Ｂ、Ｃは例えばＤｘＤの単位行列である。なお、Ｄはベクトル次元数を示している。 In this equation (1), the transformation matrix A is represented by a DxD diagonal matrix whose diagonal component diag {A} is 0.0 or more, for example. The first and second weight matrices B and C are, for example, DxD unit matrices. D represents the number of vector dimensions.

特徴代表ベクトルは、例えばコードベクトル及び特徴ベクトル集合の重心として表すことができる。このようにコードベクトル及び特徴ベクトル集合の重心をそれぞれ特徴代表ベクトルとすることにより、背景雑音もしくはこの抑圧処理による影響や、発話内容の違いによるコードベクトルならびに特徴ベクトルの変動を吸収することができ、話者のみに依存させることが可能となる。 The feature representative vector can be expressed, for example, as a centroid of a code vector and a feature vector set. In this way, by making the center of gravity of the code vector and the feature vector set as feature representative vectors, the influence of the background noise or this suppression processing, the variation of the code vector and the feature vector due to the difference in the utterance content, can be absorbed, It becomes possible to make it depend only on the speaker.

上述した対角行列Ａの各対角成分をａ_dとすると、特徴ベクトルの分布を小さくするためにはａ_dの値の範囲は０．０〜１．０であり、ここで、ａ_dを縮小率と呼ぶことにする。縮小率ａ_ｄは、例えば、すべての次元に対しａ_ｄ＝１．０を設定すると、無変換の状態となる。これに対しすべての次元に対しａ_ｄ＝０．０を設定すると、特徴代表ベクトルのみを話者の特徴ベクトルとする状態となる。
図７及び図８は特徴代表ベクトルの話者依存性の一例を示すもので、図７は話者Ｘの特徴代表ベクトルを、また図８は話者Ｙの特徴代表ベクトルをそれぞれ示している。図９は上記話者Ｘと話者Ｙの特徴代表ベクトルを重ねて表したものである。 When each diagonal elements of the diagonal matrix A described above and a _d, the range of values of a _d in order to reduce the distribution of feature vectors is 0.0-1.0, wherein the a _d This is called the reduction rate. For example, when a _d = 1.0 is set for all dimensions, the reduction ratio a _d is in a non-conversion state. On the other hand, if a _d = 0.0 is set for all dimensions, only the feature representative vector is set as the speaker feature vector.
FIGS. 7 and 8 show examples of speaker dependence of feature representative vectors. FIG. 7 shows a feature representative vector of speaker X, and FIG. 8 shows a feature representative vector of speaker Y, respectively. FIG. 9 shows the feature representative vectors of speaker X and speaker Y superimposed on each other.

同図から明らかなように、話者間における特徴代表ベクトルのばらつきが大きい次元は、より話者間の特徴の差を顕著に表す次元である。これに対し特徴代表ベクトルのばらつきが小さい次元は、話者間の特徴の差が少ない次元である。そこで、ばらつきが大きい次元に対しては縮小率ａ_ｄを０．０に近い値に設定し、一方ばらつきが小さい次元については縮小率ａ_ｄを１．０に近い値に設定する。この縮小率ａ_ｄの算出式は例えば次の（２）式のように表される。

As is clear from the figure, the dimension in which the feature representative vector varies greatly among the speakers is a dimension that more significantly represents the difference in the features between the speakers. On the other hand, the dimension in which the variation of the feature representative vector is small is a dimension in which the feature difference between the speakers is small. Therefore, the reduction ratio a _d for large variation dimension set to a value close to 0.0, whereas for small variations dimension to set the reduction ratio a _d to a value close to 1.0. Calculation formula of the reduction ratio a _d is expressed for example as the following equation (2).

ただし、σ_d は話者間の特徴代表ベクトルの次元ごとの標準偏差（d=1,2,,,D）であり、特徴代表ベクトルにおいてどの次元に話者性が現れやすいかを示す。値が大きいほど話者性が現れやすい次元である。ｐは縮小率ａ_ｄの次元平均をＢにするための調整用パラメータであり、直接値を指定することはなく、次式によって表される。

Here, σ _d is a standard deviation (d = 1, 2,... D) of feature representative vectors between speakers for each dimension, and indicates in which dimension speaker characteristics tend to appear in the feature representative vectors. The larger the value, the more likely it is that the speaker will appear. p is an adjustment parameter for the dimensionality average reduction ratio a _d to B, not possible to specify values directly, as represented by the following equation.

同式において、Ｂは縮小率ａ_dの次元平均値であり、話者特徴ベクトルの包含を低減するためのパラメータである。Ｂは値が大きいほど包含の低減効果が少なくなる。これに対し、値が小さいほど包含の低減効果が大きいが、値を小さくし過ぎると話者の特徴代表ベクトルの変動によるＶＱ歪みの増大が顕著になるため、本人許否エラーが発生しやすくなる。このパラメータＢは、例えば０．５付近の値を設定するのが好ましく、値を小さくし過ぎることによる過剰な補正は避ける方が望ましい。 In the equation, B is a dimension average value of the reduction rate a _d and is a parameter for reducing the inclusion of the speaker feature vector. The greater the value of B, the less the effect of reducing inclusion. In contrast, the smaller the value, the greater the effect of reducing the inclusion. However, if the value is too small, the increase in VQ distortion due to the variation of the speaker's feature representative vector becomes significant, and a person's permission error is likely to occur. The parameter B is preferably set to a value around 0.5, for example, and it is desirable to avoid excessive correction by making the value too small.

ｑは標準偏差σ_dの次元ごとの差をどの程度強調させるかを制御するパラメータであり、ｑ＝０．０のときに強調の程度が最大となる。これに対し、ｑ＝Ｂのときに次元ごとの差は無くなり、縮小率は次元一定の値＝Ｂとなる（このときｐ＝０．０）。
上記標準偏差σ_dは、予め複数話者の複数の環境下における特徴ベクトルを分析し算出しておくことで、適当な値を設定することが可能である。ここで、算出値は充分に多くの話者数、雑音環境下での音声を分析して得られるものであれば、音声信号から抽出される特徴ベクトルの普遍的な性質であると見なすことができる。つまり、この値は事前に算出すればよいものであり、使用環境に応じて値を事後的に算出しなくてもよい。また、特徴ベクトルにはＬＰＣケプストラムなどのケプストラム係数を用いているので、分析用の環境と実環境でのマイクロホンなどを含む音声入力系の違いは特徴ベクトルの線形差分として現れる。このため上記音声入力系の違いは、話者間の特徴代表ベクトルの標準偏差σ_dを計算する際に吸収される。 q is a parameter for controlling how much the difference of the standard deviation σ _d for each dimension is emphasized, and the degree of emphasis becomes maximum when q = 0.0. On the other hand, when q = B, there is no difference for each dimension, and the reduction ratio is a constant dimension value = B (in this case, p = 0.0).
The standard deviation σ _d can be set to an appropriate value by previously analyzing and calculating feature vectors of a plurality of speakers in a plurality of environments. Here, if the calculated value is obtained by analyzing speech under a sufficiently large number of speakers and noisy environment, it can be regarded as the universal nature of the feature vector extracted from the speech signal. it can. That is, this value only needs to be calculated in advance, and the value need not be calculated afterwards according to the use environment. Further, since a cepstrum coefficient such as an LPC cepstrum is used for the feature vector, the difference between the analysis environment and the voice input system including the microphone in the real environment appears as a linear difference of the feature vector. Therefore, the difference in the voice input system is absorbed when calculating the standard deviation σ _d of the feature representative vector between speakers.

次に、この発明に係わる話者照合装置の一実施形態を説明する。
図１はその機能構成を示すブロック図である。この話者照合装置は、音声特徴生成処理部１を備えている。音声特徴生成処理部１は、入力された話者の音声から特徴ベクトルを生成すると共に、この生成された特徴ベクトル集合の重心を求めてこの重心を特徴代表ベクトルとする。 Next, an embodiment of a speaker verification apparatus according to the present invention will be described.
FIG. 1 is a block diagram showing the functional configuration. This speaker verification device includes a speech feature generation processing unit 1. The speech feature generation processing unit 1 generates a feature vector from the input speaker's speech, obtains the center of gravity of the generated feature vector set, and sets this center of gravity as a feature representative vector.

図２はこの音声特徴生成処理部１の機能構成を示すブロック図である。音声特徴生成処理部１は、前処理部１１と、ＬＰＣ係数算出部１２と、ＬＰＣＣ生成部１３と、話者特徴代表ベクトル生成部１４とを有している。
前処理部１１は、入力された音声信号に対しアナログ−ディジタル（Ａ／Ｄ）変換、ならびに雑音抑圧処理を行ったのち音声分析区間を設定し、この分析区間内の音声波形を一定の時間及び一定のシフト周期で分析窓により切り出し、音声フレームを生成して保持する。 FIG. 2 is a block diagram showing a functional configuration of the voice feature generation processing unit 1. The speech feature generation processing unit 1 includes a preprocessing unit 11, an LPC coefficient calculation unit 12, an LPCC generation unit 13, and a speaker feature representative vector generation unit 14.
The preprocessing unit 11 performs an analog-digital (A / D) conversion and noise suppression processing on the input speech signal, sets a speech analysis section, and sets a speech waveform in the analysis section for a certain time and Cut out by the analysis window at a fixed shift period, and generate and hold an audio frame.

ＬＰＣ係数算出部１２は、上記前処理部１１により形成された各音声フレームから、線形予測符号化（Linear Prediction Coding；ＬＰＣ）により音声信号に含まれている個人性情報に関する特徴量を抽出する。ＬＰＣＣ生成部１３は、上記ＬＰＣ係数算出部１２により抽出された特徴量をもとに特徴ベクトル（ＬＰＣケプストラム、パワー項０次を含まない１次以上のパラメータ）を生成する。話者特徴代表ベクトル生成部１４は、上記ＬＰＣＣ生成部１３により生成された特徴ベクトル集合の重心を算出する。そして、この算出された重心を特徴代表ベクトルとする。 The LPC coefficient calculation unit 12 extracts a feature amount related to the personality information included in the audio signal by linear predictive coding (LPC) from each audio frame formed by the preprocessing unit 11. The LPCC generation unit 13 generates a feature vector (LPC cepstrum, first-order or higher parameters not including the 0th power term) based on the feature amount extracted by the LPC coefficient calculation unit 12. The speaker feature representative vector generation unit 14 calculates the center of gravity of the feature vector set generated by the LPCC generation unit 13. The calculated center of gravity is used as a feature representative vector.

またこの話者照合装置は、話者登録処理のための機能として、ベクトル量子化部２と、話者別コードブックデータベース３とを備える。ベクトル量子化部２は、話者登録モードが選択されている状態で、上記音声特徴生成処理部１により生成された特徴ベクトル集合とその特徴代表ベクトルを取り込む。そして、特徴ベクトルに対しベクトル量子化を行い、コードベクトルを出力する。話者別コードブックデータベース３は、例えば図３に示すように、上記ベクトル量子化部２により生成されたコードベクトルと、上記音声特徴生成処理部１により生成された特徴代表ベクトルを、図示しない話者名入力手段により入力された話者別名称ＩＤに対応付けて記憶する。 The speaker verification apparatus also includes a vector quantization unit 2 and a speaker-specific codebook database 3 as functions for speaker registration processing. The vector quantization unit 2 takes in the feature vector set generated by the speech feature generation processing unit 1 and its feature representative vector in a state where the speaker registration mode is selected. Then, vector quantization is performed on the feature vector, and a code vector is output. For example, as shown in FIG. 3, the speaker-specific codebook database 3 includes a code vector generated by the vector quantization unit 2 and a feature representative vector generated by the speech feature generation processing unit 1. The information is stored in association with the speaker-specific name ID input by the name input means.

さらにこの話者照合装置は、話者照合処理のために必要な機能として、変換係数・しきい値制御部４と、変換係数・しきい値データベース５と、第１の話者特徴変換処理部６と、第２の話者特徴変換処理部７と、ＶＱ歪み算出部８と、しきい値判定処理部９と、終了判定部１０とを備えている。 Further, the speaker verification device includes a conversion coefficient / threshold control unit 4, a conversion coefficient / threshold database 5, and a first speaker feature conversion processing unit as functions necessary for the speaker verification process. 6, a second speaker feature conversion processing unit 7, a VQ distortion calculation unit 8, a threshold determination processing unit 9, and an end determination unit 10.

変換係数・しきい値制御部４は、他人受入エラー率を低減するための話者特徴変換係数と、この値に対応したＶＱ歪み判定用のしきい値を変換係数・しきい値データベース５から読み込む。話者特徴変換係数は、縮小率ａのことを指している。縮小率ａは、先に示した（２）式に従い、話者間における特徴代表ベクトルの次元ごとのばらつきを考慮して、事前に用意された値である。しきい値は、本人受入率と他人拒否率が等しくなるように設定されている。変換係数・しきい値データベース５は、例えば図４に示すように、上記算出及び設定された話者特徴変換係数及び判定しきい値を、記憶する。 The conversion coefficient / threshold control unit 4 uses the conversion coefficient / threshold database 5 to obtain a speaker feature conversion coefficient for reducing the other-person acceptance error rate and a threshold for VQ distortion determination corresponding to this value. Read. The speaker feature conversion coefficient indicates the reduction rate a. The reduction ratio “a” is a value prepared in advance in consideration of the variation of the feature representative vector for each dimension between speakers in accordance with the above-described equation (2). The threshold is set so that the person acceptance rate is equal to the stranger rejection rate. For example, as shown in FIG. 4, the conversion coefficient / threshold database 5 stores the above-mentioned calculated and set speaker feature conversion coefficients and determination thresholds.

第１の話者特徴変換処理部６は、話者照合モードが選択されているときに、上記話者別コードブックデータベース３から話者ごとにコードベクトル及び特徴代表ベクトルを読み出す。そして、先に（１）式により示した変換式と、上記変換係数・しきい値制御部４から与えられる変換係数とを用いて、上記コードベクトルを特徴代表ベクトルに近づける変換処理を行う。 The first speaker feature conversion processing unit 6 reads out a code vector and a feature representative vector for each speaker from the speaker-specific codebook database 3 when the speaker verification mode is selected. Then, a conversion process is performed to bring the code vector closer to the feature representative vector by using the conversion formula shown by the expression (1) and the conversion coefficient given from the conversion coefficient / threshold control unit 4.

第２の話者特徴変換処理部７は、話者照合モードが選択されているときに、上記音声特徴生成処理部１により生成された話者の特徴ベクトル集合とその特徴代表ベクトルを取り込む。そして、先に（１）式により示した変換式と、上記変換係数・しきい値制御部４から与えられる変換係数とを用いて、上記話者の特徴ベクトル集合を特徴代表ベクトルに近づける変換処理を行う。 The second speaker feature conversion processing unit 7 takes in the speaker feature vector set and the feature representative vector generated by the speech feature generation processing unit 1 when the speaker verification mode is selected. Then, a conversion process for bringing the feature vector set of the speaker closer to the feature representative vector using the conversion formula shown by the expression (1) and the conversion coefficient given from the conversion coefficient / threshold control unit 4 I do.

ＶＱ歪み算出部８は、上記第１及び第２の話者特徴変換処理部６，７からそれぞれ変換されたコードベクトル及び変換された話者特徴ベクトルを取り込む。そして、これらの変換後のコードベクトルと話者特徴ベクトルとの間ＶＱ歪みを算出する。
しきい値判定処理部９は、上記ＶＱ歪み算出部８により算出されたＶＱ歪みを、上記変換係数・しきい値制御部４から与えられるしきい値と比較し、その比較結果を表すフラグ信号を出力する。 The VQ distortion calculation unit 8 takes in the converted code vector and the converted speaker feature vector from the first and second speaker feature conversion processing units 6 and 7, respectively. Then, VQ distortion is calculated between the converted code vector and speaker feature vector.
The threshold determination processing unit 9 compares the VQ distortion calculated by the VQ distortion calculation unit 8 with a threshold given from the conversion coefficient / threshold control unit 4, and a flag signal indicating the comparison result Is output.

終了判定部１０は、上記しきい値判定処理部９から出力されるフラグ信号により照合対象の話者が本人であるか否かを判定する。そして、本人と判定された場合に、変換係数・しきい値制御部４に対し変換係数変更制御信号を与えることにより変換係数及びしきい値を更新させ、これにより上記第１及び第２の話者特徴変換処理部６，７による変換処理、及びしきい値判定処理部９によるＶＱ歪みの判定処理を繰り返し実行させる。そして、上記繰り返し処理が予め設定された回数実行されると、その総合判定結果を照合判定情報として出力する。 The end determination unit 10 determines whether or not the speaker to be verified is the person himself / herself based on the flag signal output from the threshold value determination processing unit 9. When the person is determined to be the person, the conversion coefficient and threshold value are updated by giving a conversion coefficient change control signal to the conversion coefficient / threshold value control unit 4. The conversion processing by the person feature conversion processing units 6 and 7 and the VQ distortion determination processing by the threshold determination processing unit 9 are repeatedly executed. And if the said repetition process is performed the preset number of times, the comprehensive determination result will be output as collation determination information.

次に、以上のように構成された装置の動作を説明する。
先ず、照合に先立ち照合対象となる話者、つまり本人話者の音声の特徴の登録が行われる。すなわち、本人話者が自身の音声をマイクロホンから入力すると、この入力音声は音声特徴生成処理部１において音声フレームに変換されたのち、このフレームごとにＬＰＣ分析され、これにより上記入力音声の特徴量が抽出される。そして、この特徴量をもとに特徴ベクトルの集合が生成される。また、この生成された特徴ベクトル集合はの重心ベクトルがこの特徴ベクトル集合の特徴代表ベクトルとなる。 Next, the operation of the apparatus configured as described above will be described.
First, prior to collation, the voice characteristics of the speaker to be collated, that is, the speaker of the person himself, are registered. That is, when the speaker himself inputs his / her voice from the microphone, this voice is converted into a voice frame by the voice feature generation processing unit 1, and then LPC analysis is performed for each frame. Is extracted. A set of feature vectors is generated based on the feature amount. Further, the center vector of the generated feature vector set becomes the feature representative vector of the feature vector set.

上記生成された特徴ベクトル集合はベクトル量子化部２によりベクトル量子化されたのち、話者別名称ＩＤと対応付けられて上記特徴代表ベクトルと共に話者別コードブックデータベース３に記憶される。他の照合対象話者についても、同様に音声の特徴ベクトル集合及びその特徴代表ベクトルが生成され、当該特徴ベクトル集合のコードベクトル及び特徴代表ベクトルが話者別コードブックデータベース３に記憶される。
また、変換係数・しきい値データベース５には、予め前実験を行うことによって得られる変換係数としきい値のセットが記憶される。 The generated feature vector set is vector quantized by the vector quantization unit 2 and is stored in the speaker-specific codebook database 3 together with the feature representative vector in association with the speaker-specific name ID. Similarly, a speech feature vector set and its feature representative vector are also generated for other verification target speakers, and the code vector and feature representative vector of the feature vector set are stored in the speaker-specific codebook database 3.
The conversion coefficient / threshold database 5 stores a set of conversion coefficients and threshold values obtained by conducting a pre-experiment in advance.

さて、以上のように各データベース３，５への登録処理が終了すると、続いて話者音声の照合処理が以下のように実行される。すなわち、照合対象話者の音声が入力されると、音声特徴生成処理部１により上記入力音声の特徴ベクトルの集合及びその特徴代表ベクトルが生成され、この特徴ベクトル集合及び特徴代表ベクトルは第２の話者特徴変換処理部７に入力される。第２の話者特徴変換処理部７では、先に（１）式により示した変換式と、上記変換係数・しきい値制御部４から与えられる変換係数とを用いて、上記話者の特徴ベクトル集合をその特徴代表ベクトルに近づける変換処理が行われる。 Now, when the registration process to each of the databases 3 and 5 is completed as described above, the speaker voice collating process is subsequently executed as follows. That is, when the speech of the speaker to be collated is input, the speech feature generation processing unit 1 generates a set of feature vectors of the input speech and a feature representative vector thereof. The feature vector set and the feature representative vector are Input to the speaker feature conversion processing unit 7. The second speaker feature conversion processing unit 7 uses the conversion formula previously given by the expression (1) and the conversion coefficient given from the conversion coefficient / threshold control unit 4 to describe the speaker characteristics. A conversion process for bringing the vector set closer to the feature representative vector is performed.

またそれと並行して第１の話者特徴変換処理部６では、話者別コードブックデータベース３から話者ごとにコードベクトルとその特徴代表ベクトルが読み出される。そして、先に（１）式により示した変換式と、上記変換係数・しきい値制御部４から与えられる変換係数とを用いて、上記コードベクトルをその特徴代表ベクトルに近づける変換処理が行われる。 At the same time, the first speaker feature conversion processing unit 6 reads a code vector and its feature representative vector for each speaker from the codebook database 3 for each speaker. Then, using the conversion equation shown by the equation (1) and the conversion coefficient given from the conversion coefficient / threshold control unit 4, a conversion process is performed to bring the code vector close to the feature representative vector. .

ＶＱ歪み算出部８では、上記変換された話者の特徴ベクトル集合と、話者別コードブックデータベース３に記憶されたコードベクトルとの間のＶＱ歪みが算出され、この算出されたＶＱ歪みはしきい値判定処理部９によりしきい値と比較される。そして、その判定結果がフラグ信号として終了判定部１０に出力される。 The VQ distortion calculation unit 8 calculates a VQ distortion between the converted speaker feature vector set and the code vector stored in the speaker-specific codebook database 3, and the calculated VQ distortion is calculated. The threshold value determination processing unit 9 compares the threshold value. The determination result is output to the end determination unit 10 as a flag signal.

終了判定部１０では、上記フラグ信号をもとに照合対象の話者が本人であるか否かが判定される。そして、本人と判定されると、終了判定部１０から変換係数・しきい値制御部４に対し変換係数変更制御信号が与えられる。この結果、変換係数・しきい値制御部４では変換係数及びしきい値が更新される。この更新は、他人受入エラー率を低減する方向に変換係数及びしきい値を一定量シフトすることにより行われる。第１及び第２の話者特徴変換処理部６，７ではそれぞれ、上記更新された変換係数を用いて上記コードベクトル及び話者の特徴ベクトル集合の変換処理が行われる。そして、この変換されたコードベクトル及び話者の特徴ベクトル集合間のＶＱ歪みがＶＱ歪み算出部８により算出され、この算出されたＶＱ歪みがしきい値判定処理部９においてしきい値と比較される。 The end determination unit 10 determines whether or not the speaker to be verified is the person himself / herself based on the flag signal. When it is determined that the user is the person himself / herself, a conversion coefficient change control signal is given from the end determination section 10 to the conversion coefficient / threshold control section 4. As a result, the conversion coefficient and threshold value control unit 4 updates the conversion coefficient and threshold value. This update is performed by shifting the conversion coefficient and the threshold value by a certain amount in a direction to reduce the error rate of accepting others. Each of the first and second speaker feature conversion processing units 6 and 7 performs conversion processing of the code vector and the speaker feature vector set using the updated conversion coefficient. Then, the VQ distortion between the converted code vector and the speaker feature vector set is calculated by the VQ distortion calculation unit 8, and the calculated VQ distortion is compared with the threshold value in the threshold determination processing unit 9. The

終了判定部１０では、この比較結果を表すフラグ信号をもとに照合対象の話者が本人であるか否かが判定され、本人であれば再び変換係数・しきい値制御部４に対し変換係数変更制御信号が与えられる。そして、第１及び第２の話者特徴変換処理部６，７では、さらに更新された変換係数をもとにコードベクトル及び話者の特徴ベクトル集合の変換処理が行われ、この変換処理後のベクトル間のＶＱ歪みがしきい値判定処理部９においてしきい値と比較される。 The end determination unit 10 determines whether or not the speaker to be verified is the person himself / herself based on the flag signal representing the comparison result. A coefficient change control signal is provided. Then, the first and second speaker feature conversion processing units 6 and 7 further perform conversion processing of the code vector and speaker feature vector set based on the updated conversion coefficient. The VQ distortion between vectors is compared with the threshold value in the threshold value determination processing unit 9.

以後同様に、照合対象の話者が本人と判定されるごとに、変換係数及びしきい値が他人受入エラー率を低減する方向に順次更新され、この更新された変換係数及びしきい値をもとにコードベクトル及び話者の特徴ベクトル集合の変換処理から、ＶＱ歪みの比較処理までの一連の照合処理が繰り返し行われる。そして、上記繰り返し回数が予め設定された１回以上の回数に達し、その時点での最終的な比較結果が本人であれば、話者は本人である旨の判定結果が出力される。これに対し、上記繰り返しの途中で他人と判定されると、話者は他人である旨の判定結果が出力される。またこの場合、照合不可と判断して、照合処理を始めからやり直すように利用者に促してもよい。 Thereafter, each time the speaker to be verified is determined to be the person in question, the conversion coefficient and the threshold value are sequentially updated in a direction to reduce the error rate for accepting the other person. In addition, a series of collating processes from the conversion process of the code vector and the speaker feature vector set to the VQ distortion comparison process are repeatedly performed. If the number of repetitions reaches one or more times set in advance, and the final comparison result at that time is the person himself, a determination result indicating that the speaker is the person is output. On the other hand, if it is determined that the speaker is another person during the repetition, a determination result indicating that the speaker is another person is output. In this case, it may be determined that the collation is impossible, and the user may be prompted to restart the collation process from the beginning.

以上述べたようにこの実施形態では、音声特徴生成処理部１において入力音声から特徴量を抽出して特徴ベクトルを生成するとともに、この生成された特徴ベクトル集合の重心を求めることにより特徴代表ベクトルを生成するようにしている。そして、照合モードにおいて、第１及び第２の話者特徴変換処理部６，７により、変換係数・しきい値制御部４から与えられる変換係数をもとにコードベクトル及び話者の特徴ベクトル集合をそれぞれの特徴代表ベクトルに近づける変換処理を行い、この変換処理後のベクトル間のＶＱ歪みをしきい値判定処理部９によりしきい値と比較し、これにより話者の本人判定を行っている。 As described above, in this embodiment, the speech feature generation processing unit 1 extracts feature amounts from input speech to generate feature vectors, and obtains a feature representative vector by obtaining the center of gravity of the generated feature vector set. It is trying to generate. Then, in the collation mode, the code vector and speaker feature vector set based on the conversion coefficient given from the conversion coefficient / threshold control unit 4 by the first and second speaker feature conversion processing units 6 and 7. Is converted to each feature representative vector, and the VQ distortion between the vectors after the conversion processing is compared with the threshold value by the threshold value determination processing unit 9, thereby determining the identity of the speaker. .

したがって、入力音声が背景雑音もしくはこの抑圧処理の影響により歪んだり、また話者の発話内容が異なる場合でも、話者別コードブックデータベース３に記憶されたコードベクトルの分散と、照合対象話者の特徴ベクトルの分布はそれぞれ縮小される。このため、登録話者のコードベクトルに対する照合対象話者の特徴ベクトル集合の包含は低減され、これにより登録話者のコードベクトルと照合対象話者の特徴ベクトルとの間のベクトル量子化歪みの差は顕著になり、これにより照合精度を高めることが可能となる。 Therefore, even if the input speech is distorted due to background noise or the effect of this suppression processing, and the utterance contents of the speaker are different, the distribution of the code vectors stored in the speaker-specific codebook database 3 and the verification target speaker's Each feature vector distribution is reduced. For this reason, the inclusion of the feature vector set of the speaker to be verified with respect to the code vector of the registered speaker is reduced, and thus the difference in vector quantization distortion between the code vector of the registered speaker and the feature vector of the speaker to be verified is reduced. Becomes conspicuous, and this makes it possible to improve collation accuracy.

すなわち、この実施形態の装置は雑音環境下において高い頑健性を備える。図１１に示すように、雑音環境下における特徴ベクトルは、クリーン音声の特徴ベクトルと比較すると、雑音のみの特徴ベクトルの分布に近づく傾向にある。これは、雑音が大きくなるほど雑音に埋もれて雑音しか聞こえなくなることからも想像できる。この結果、雑音環境下では話者照合性能が低下するという問題を生じる。この問題を解決するために、一般には雑音付加音声に対して雑音抑圧処理を施す。しかし雑音抑圧処理を行うと、図１１に例示するように過剰抑圧となって特徴ベクトルの分布が広がり、図５に示すような分布の包含を招くと云う副作用を生じる。 That is, the apparatus of this embodiment has high robustness in a noisy environment. As shown in FIG. 11, the feature vector in a noisy environment tends to approach the distribution of noise-only feature vectors as compared to a clean speech feature vector. This can also be imagined from the fact that the greater the noise, the more it is buried in the noise and only the noise can be heard. As a result, there arises a problem that speaker verification performance deteriorates in a noisy environment. In order to solve this problem, noise suppression processing is generally performed on noise-added speech. However, when the noise suppression process is performed, excessive suppression occurs as illustrated in FIG. 11, and the distribution of the feature vector is widened, which causes a side effect of causing the inclusion of the distribution as illustrated in FIG.

これに対しこの発明の実施形態では、特徴ベクトルに対し特徴ベクトルの分布を縮小する変換処理を行うので、雑音抑圧処理を施した場合でも上記したような副作用による特徴ベクトルの分布の包含関係の強調を軽減することができ、結果として背景雑音による性能劣化を低減できる。 On the other hand, in the embodiment of the present invention, since the conversion process for reducing the distribution of the feature vector is performed on the feature vector, the inclusion relationship of the distribution of the feature vector due to the side effects as described above is enhanced even when the noise suppression process is performed. As a result, performance degradation due to background noise can be reduced.

さらにこの実施形態では、音声特徴生成処理部１において、入力音声の特徴ベクトル集合及び特徴代表ベクトルを生成する際に、有声音のみを対象としている。このため、背景雑音の影響を受けやすい無声音を予め特徴の抽出対象から排除することができ、これにより発話依存性を低減することができる。
またさらに、音声特徴生成処理部１において、特徴ベクトル集合の重心を算出し、この重心を特徴代表ベクトルとしている。このようにすると、発話偏りによる各特徴ベクトルのずれが平均をとることでキャンセルされ、結果として特徴代表ベクトルの変動を低減することができる。 Furthermore, in this embodiment, when the speech feature generation processing unit 1 generates the feature vector set and feature representative vector of the input speech, only the voiced sound is targeted. For this reason, the unvoiced sound that is easily affected by the background noise can be excluded in advance from the feature extraction target, thereby reducing the utterance dependency.
Furthermore, the speech feature generation processing unit 1 calculates the centroid of the feature vector set, and uses this centroid as the feature representative vector. In this way, the deviation of each feature vector due to the utterance bias is canceled by taking the average, and as a result, the variation of the feature representative vector can be reduced.

それに加え前記実施形態では、ＶＱ歪みとしきい値との比較の結果、話者が本人と判定された場合に、変換係数及びしきい値を他人受入エラー率が低減する方向に順次更新しながら、コードベクトル及び話者の特徴ベクトル集合の変換処理からＶＱ歪みの比較処理までの一連の照合処理を、複数回繰り返し実行するようにしている。このため、本人拒否率及び他人受入エラー率を共に低く保持することができる。 In addition, in the embodiment, when the speaker is determined to be the person as a result of the comparison between the VQ distortion and the threshold value, the conversion coefficient and the threshold value are sequentially updated in a direction in which the error acceptance rate of others is reduced, A series of collating processes from a code vector and speaker feature vector set conversion process to a VQ distortion comparison process are repeatedly executed a plurality of times. For this reason, both the person rejection rate and the other person acceptance error rate can be kept low.

図１０にこの実施形態に係わる装置による効果の一例を示す。この図１０は、特徴量変換無しの場合、縮小率ａをベクトルの次元ごとに一定にした場合、縮小率ａを次元毎に異なる値とした場合の、平均誤認識率を折れ線グラフで表したものである。
この図１０からも明らかなように、この実施形態による変換処理を使用することにより、本人コードブックに対する他人話者の特徴ベクトルの分布の包含関係が低減され、変換無しの場合に比べ平均誤認識率を改善することができる。 FIG. 10 shows an example of the effect of the apparatus according to this embodiment. FIG. 10 is a line graph showing the average misrecognition rate when there is no feature value conversion, when the reduction rate a is constant for each vector dimension, and when the reduction rate a is different for each dimension. Is.
As is apparent from FIG. 10, by using the conversion process according to this embodiment, the inclusion relation of the feature vector distribution of the other person's speaker with respect to the person's codebook is reduced, and the average misrecognition compared to the case without conversion. The rate can be improved.

なお、この発明は上記実施形態に限定されるものではない。例えば、前記実施形態では入力音声の有声音のみを対象として特徴ベクトル及び代表ベクトルを生成するようにした。しかしこれに限らず、さらに母音のみを対象として特徴ベクトル及び代表ベクトルを生成するようにしてもよい。すなわち、通常の音声信号中には、母音が必ず含まれ、この母音は個人性を顕著に表す要素である。よって、母音の時間定常性を利用して特徴ベクトル中で母音に該当するものを抽出し、この母音ごとの特徴ベクトルを量子化したコードベクトルの平均を特徴代表ベクトルとする。このようにすると、発話偏りによる特徴代表ベクトルのずれを低減することができる。 The present invention is not limited to the above embodiment. For example, in the embodiment, the feature vector and the representative vector are generated only for the voiced sound of the input voice. However, the present invention is not limited to this, and feature vectors and representative vectors may be generated only for vowels. That is, a normal voice signal always includes a vowel, and this vowel is an element that significantly represents individuality. Therefore, using the time continuity of vowels, those corresponding to vowels are extracted from the feature vectors, and the average of the code vectors obtained by quantizing the feature vectors for each vowel is used as the feature representative vector. In this way, it is possible to reduce the deviation of the feature representative vector due to the utterance bias.

また、特徴ベクトルはＬＰＣケプストラムの他にＭＦＣＣやスペクトル分析から生成される特徴ベクトル、ならびに照合モデルもＶＱのみならずＧＭＭなどの他の照合モデルにも適用して実施できる。
その他、コードベクトル及び話者の特徴ベクトル集合の変換処理からＶＱ歪みの比較処理までの一連の照合処理の繰り返し回数や、変換式の構成、パラメータの値等についても、この発明の要旨を逸脱しない範囲で種々変形して実施できる。 In addition to the LPC cepstrum, the feature vector can be implemented by applying the feature vector generated from MFCC and spectrum analysis and the matching model to not only VQ but also other matching models such as GMM.
In addition, the number of repetitions of a series of matching processes from the conversion process of the code vector and speaker feature vector set to the VQ distortion comparison process, the configuration of the conversion formula, the value of the parameter, and the like do not depart from the gist of the present invention. Various modifications can be made within the range.

要するにこの発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態に亘る構成要素を適宜組み合せてもよい。 In short, the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, you may combine suitably the component covering different embodiment.

この発明に係わる話者照合装置の一実施形態を示す機能ブロック図。The functional block diagram which shows one Embodiment of the speaker collation apparatus concerning this invention. 図１に示した話者照合装置の音声特徴生成処理部の構成を示す機能ブロック図。The functional block diagram which shows the structure of the audio | voice feature production | generation process part of the speaker collation apparatus shown in FIG. 図１に示した話者照合装置の話者コードブックデータベースに記憶される情報要素の一例を示す図。The figure which shows an example of the information element memorize | stored in the speaker codebook database of the speaker collation apparatus shown in FIG. 図１に示した話者照合装置の変換係数・しきい値データベースに記憶される情報要素の一例を示す図。The figure which shows an example of the information element memorize | stored in the conversion coefficient and threshold value database of the speaker collation apparatus shown in FIG. 変換補正前の特徴ベクトル空間の一例を示す図。The figure which shows an example of the feature vector space before conversion correction | amendment. 変換補正後の特徴ベクトル空間の一例を示す図。The figure which shows an example of the feature vector space after conversion correction | amendment. 話者Ｘのベクトル次元に対する特徴代表ベクトルの変化を示す図。The figure which shows the change of the feature representative vector with respect to the vector dimension of the speaker X. 話者Ｙのベクトル次元に対する特徴代表ベクトルの変化を示す図。The figure which shows the change of the feature representative vector with respect to the vector dimension of the speaker Y. ベクトル次元に対する特徴代表ベクトルの話者間のばらつきを示す図。The figure which shows the dispersion | variation between speakers of the feature representative vector with respect to a vector dimension. 変換補正無し、ベクトル次元一定、ベクトル次元を３パターンに可変設定した場合の誤認識率の変化を示す図。The figure which shows the change of the misrecognition rate when there is no conversion correction, the vector dimension is constant, and the vector dimension is variably set to three patterns. 特徴ベクトル分布とノイズ分布との関係の一例を示す図。The figure which shows an example of the relationship between feature vector distribution and noise distribution.

Explanation of symbols

１…音声特徴生成処理部、２…ベクトル量子化部、３…話者別コードブックデータベース、４…変換係数・しきい値制御部、５…変換係数・しきい値データベース、６，７…話者特徴変換処理部、８…ＶＱ歪み算出部、９…しきい値判定処理部、１０…終了判定部、１１…前処理部、１２…ＬＰＣ係数算出部、１３…ＬＰＣＣ生成部、１４…話者特徴代表ベクトル生成部。 DESCRIPTION OF SYMBOLS 1 ... Speech feature production | generation process part, 2 ... Vector quantization part, 3 ... Codebook database classified by speaker, 4 ... Conversion coefficient and threshold value control part, 5 ... Conversion coefficient and threshold value database, 6, 7 ... Story Person feature conversion processing unit, 8 ... VQ distortion calculation unit, 9 ... threshold value determination processing unit, 10 ... end determination unit, 11 ... preprocessing unit, 12 ... LPC coefficient calculation unit, 13 ... LPCC generation unit, 14 ... talk Person feature representative vector generation unit.

Claims

A feature vector generating means for time-dividing a speaker's input speech into a plurality of frames and generating a feature vector from each of the divided frames;
Feature representative vector generation means for generating a feature representative vector representing a plurality of feature vectors generated by the feature vector generation means;
For a speaker to be registered, a code vector obtained by vector quantization of the first feature vector generated by the feature vector generating unit, and a first feature representative vector generated by the feature representative vector generating unit A codebook to remember,
For the speaker to be verified, between the second feature vector generated by the feature vector generation unit and the second feature vector generated by the feature representative vector generation unit. Means for performing a conversion process to reduce the distance of
Means for performing a conversion process for reducing the distance between the code vector and the first feature representative vector with respect to the code vector stored in the code book;
Means for calculating a vector quantization distortion between the code vector after the conversion process and the second feature vector after the conversion process;
A speaker verification apparatus comprising: a determination unit that compares the calculated vector quantization distortion with a preset threshold value and outputs the comparison result as a speaker verification result.

The means for performing the conversion processing obtains a matrix obtained by multiplying a transformation matrix having a feature vector transformation function by a difference between a feature vector and a weight feature representative vector obtained by multiplying the first weight matrix, and for this matrix, The speaker verification apparatus according to claim 1, wherein the conversion operation is performed by adding a load feature representative vector multiplied by the second weight matrix.

The means for performing the conversion processing calculates each diagonal component of the diagonal matrix from the average value of the diagonal component and the variation of each dimension of the feature representative vector when the conversion matrix is a diagonal matrix. The speaker verification apparatus according to claim 2, wherein:

The feature vector generation means includes
Means for extracting a voiced sound frame from a plurality of frames of input speech;
The speaker verification apparatus according to claim 1, further comprising means for generating a feature vector from the extracted voiced sound frame.

The feature representative vector generation means includes:
Means for extracting a feature vector consisting only of a voiced sound frame from a plurality of feature vectors generated by the feature vector generation means;
The speaker verification apparatus according to claim 1, further comprising means for calculating a feature representative vector from a feature vector consisting only of the extracted voiced sound.

The feature representative vector generation means includes:
Means for extracting a feature vector corresponding to a vowel of the input speech from among a plurality of feature vectors generated by the feature vector generation means;
The speaker verification apparatus according to claim 1, further comprising means for calculating a feature representative vector from a feature vector corresponding to the extracted vowel.

The feature representative vector generation means includes:
Means for vector-quantizing each of the plurality of feature vectors generated by the feature vector generator;
The speaker verification apparatus according to claim 1, further comprising means for calculating centroids of the plurality of vector-quantized code vectors and using the calculated centroids as feature representative vectors.

Means for variably setting both the conversion coefficient used for the conversion process and the threshold value;
Each time both the transform coefficient and the threshold value are variably set, the transform process for the code vector and the second feature vector, and the calculation of the vector quantization distortion between the code vector after the transform process and the second feature vector Means for repeating the processing and processing for comparing the calculated vector quantization distortion with a threshold;
The speaker verification apparatus according to claim 1, further comprising means for obtaining a verification result based on a plurality of determination results obtained by the repetitive processing.