JP2000330590A

JP2000330590A - Method and system for collating speaker

Info

Publication number: JP2000330590A
Application number: JP11141172A
Authority: JP
Inventors: Shogo Nakamura; 尚五中村
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1999-05-21
Filing date: 1999-05-21
Publication date: 2000-11-30

Abstract

PROBLEM TO BE SOLVED: To provide a method and a system which can surely collate a speaker by a method performing the collation in a time domain. SOLUTION: This speaker collation system comprises a frequency band dividing part 1 for dividing an input speech (sampled input speech) into frequency bands, a non-linear compressing part 2 for operating non-linear compression of a speech signal in a prescribed frequency band of the speech signal divided into the frequency bands, a string-vectorizing part 3 for forming vector string from the non-linear compressed speech signal, a dictionary creating part 4 for creating a dictionary vector string as a dictionary, and a speaker judging part 5 for collating (judging) the speaker by operating the pattern matching between the vector string of the speech to be collated and the dictionary vector string of the dictionary creating part 4.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、話者照合方法およ
び話者照合システムに関する。The present invention relates to a speaker verification method and a speaker verification system.

【０００２】[0002]

【従来の技術】従来、話者照合の手法としては確率モデ
ルを用いるＨＭＭがよく知られているが、いまだ決定的
な方法は確立していない。特に、時間領域で照合を行な
う手法は、同一話者においても発声毎に変動する音声の
性質から非常に難しいものであった。2. Description of the Related Art Conventionally, an HMM using a probability model is well known as a method of speaker verification, but a definitive method has not yet been established. In particular, the method of performing the matching in the time domain is very difficult even for the same speaker due to the nature of the voice that varies for each utterance.

【０００３】[0003]

【発明が解決しようとする課題】本発明は、時間領域で
照合を行なう手法で、話者照合を確実に行なうことが可
能な話者照合方法および話者照合システムを提供するこ
とを目的としている。SUMMARY OF THE INVENTION It is an object of the present invention to provide a speaker verification method and a speaker verification system which can surely perform speaker verification by a method of performing verification in a time domain. .

【０００４】[0004]

【課題を解決するための手段】上記目的を達成するため
に、請求項１記載の発明は、サンプリングされた入力音
声を非線形圧縮して低分解能の音声信号に変換する非線
型圧縮工程と、非線型圧縮された音声信号からベクトル
列を構成するベクトル列化工程と、辞書に予め作成され
ている辞書ベクトル列と被照合音声のベクトル列とのパ
ターンマッチングを行ない、話者の照合，判定を行なう
照合判定工程とを有していることを特徴としている。In order to achieve the above object, the invention according to claim 1 comprises a non-linear compression step of non-linearly compressing a sampled input voice and converting it into a low-resolution voice signal. A vector sequence forming step of forming a vector sequence from the linearly-compressed audio signal, and pattern matching between a dictionary vector sequence created in advance in the dictionary and a vector sequence of the voice to be verified are performed, and speaker verification and determination are performed. And a collation determining step.

【０００５】また、請求項２記載の発明は、請求項１記
載の話者照合方法において、非線型圧縮工程では、サン
プリングされた入力音声のサンプル値を適当な個数で１
つのグループとし、グループに含まれるサンプル値の絶
対値の最大値でグループ内のすべてのサンプル値を正規
化した後、正規化したサンプル値を非線形量子化関数で
低分解能の音声信号に変換することを特徴としている。According to a second aspect of the present invention, in the speaker verification method according to the first aspect, in the non-linear compression step, an appropriate number of sampled values of the sampled input speech are set to one.
After normalizing all sample values in a group with the maximum absolute value of the sample values included in the group, and then converting the normalized sample values to a low-resolution audio signal using a nonlinear quantization function It is characterized by.

【０００６】また、請求項３記載の発明は、請求項１ま
たは請求項２記載の話者照合方法において、ベクトル列
化工程では、グループを１つのベクトルに対応させるこ
とにより、音声信号をベクトル列として扱い、照合判定
工程では、同一話者の同一音声には類似な音声パターン
が含まれることから、該ベクトル列にも類似なベクトル
が多数含まれ、該類似なベクトルは相関値の高いベクト
ルであり、該相関値の高いベクトルが多数含まれるベク
トル列は類似な音声であると判断し、該判断を話者照合
の判定に使用することを特徴としている。According to a third aspect of the present invention, in the speaker verification method according to the first or second aspect, in the vector-sequence conversion step, the speech signal is converted into a vector sequence by associating a group with one vector. In the matching determination step, since the same voice of the same speaker includes similar voice patterns, the vector sequence also includes many similar vectors, and the similar vector is a vector having a high correlation value. Yes, a vector sequence including a large number of vectors having a high correlation value is determined to be a similar voice, and the determination is used for speaker verification determination.

【０００７】また、請求項４記載の発明は、請求項１乃
至請求項３のいずれか一項に記載の話者照合方法におい
て、被照合音声のベクトル列を構成する場合、入力音声
を１ビットずつシフトさせ、それに対応したベクトル列
を作成し、１ビットずつシフトさせたベクトル列と辞書
ベクトル列とを順次照合することにより話者照合の判定
を行なうことを特徴としている。According to a fourth aspect of the present invention, in the speaker verification method according to any one of the first to third aspects, when the vector sequence of the voice to be verified is constituted by one bit of the input voice, The method is characterized in that a speaker sequence is determined by sequentially collating a vector sequence shifted one bit at a time with a dictionary vector sequence by creating a corresponding vector sequence.

【０００８】また、請求項５記載の発明は、サンプリン
グされた入力音声を帯域分割する帯域分割部と、帯域分
割された音声信号のうち、所定帯域の音声信号を非線形
圧縮する非線形圧縮部と、非線形圧縮された音声信号か
らベクトル列を構成するベクトル列化部と、辞書ベクト
ル列を辞書として作成する辞書作成部と、被照合音声の
ベクトル列と辞書作成部４によって作成されている辞書
ベクトル列とのパターンマッチングを行ない、話者の照
合，判定を行なう話者判定部とを有していることを特徴
としている。According to a fifth aspect of the present invention, there is provided a band dividing section for band-dividing a sampled input sound, a non-linear compressing section for non-linearly compressing a sound signal of a predetermined band among the band-divided sound signals, A vector-sequencing unit that forms a vector sequence from the non-linearly-compressed audio signal, a dictionary creating unit that creates a dictionary vector sequence as a dictionary, a vector sequence of a voice to be verified, and a dictionary vector sequence created by the dictionary creating unit 4 And a speaker determination unit for performing speaker matching and determination.

【０００９】また、請求項６記載の発明は、請求項５記
載の話者照合システムにおいて、辞書作成部は、同一話
者が発声した個々の同一音声に対応するベクトル列の間
で、ベクトル毎の相関値を計算し、その値が所定の閾値
以上になるベクトルの総数を求めるという処理を、すべ
てのベクトル列に対して行ない、相関値が所定の閾値以
上の個数の総和が最大のベクトル列を、その音声の辞書
ベクトル列とし、辞書には、辞書作成部により作成され
た個々の話者の音声に基づく辞書ベクトル列が登録され
ることを特徴としている。According to a sixth aspect of the present invention, in the speaker verification system according to the fifth aspect, the dictionary creation unit is configured to determine, for each vector, a vector sequence corresponding to each identical voice uttered by the same speaker. Is calculated for all vector columns, and the sum of the number of vectors whose correlation value is equal to or greater than a predetermined threshold is the largest vector column. Is a dictionary vector sequence of the voice, and a dictionary is registered with a dictionary vector sequence based on the voice of each speaker created by the dictionary creating unit.

【００１０】また、請求項７記載の発明は、請求項５記
載の話者照合システムにおいて、非線型圧縮部は、サン
プリングされた入力音声のサンプル値を適当な個数で１
つのグループとし、グループに含まれるサンプル値の絶
対値の最大値でグループ内のすべてのサンプル値を正規
化した後、正規化したサンプル値を非線形量子化関数で
低分解能の音声信号に変換することを特徴としている。According to a seventh aspect of the present invention, in the speaker verification system according to the fifth aspect, the non-linear compression section includes a sampled number of sampled input voices which is 1 in an appropriate number.
After normalizing all sample values in a group with the maximum absolute value of the sample values included in the group, and then converting the normalized sample values to a low-resolution audio signal using a nonlinear quantization function It is characterized by.

【００１１】また、請求項８記載の発明は、請求項５記
載の話者照合システムにおいて、話者判定部は、被照合
音声のベクトル列と辞書に登録されている複数の話者に
それぞれ対応する各辞書ベクトル列とのベクトルごとの
相関値を算出し、相関値が所定の閾値以上のベクトルの
総数を求め、相関値が所定の閾値以上のベクトルの総数
が最大となる辞書ベクトル列に対応する話者を、被照合
音声の話者として判定することを特徴としている。According to an eighth aspect of the present invention, in the speaker verification system according to the fifth aspect, the speaker determination unit corresponds to a vector sequence of the voice to be verified and a plurality of speakers registered in the dictionary, respectively. Calculate the correlation value for each vector with each dictionary vector string to be calculated, calculate the total number of vectors whose correlation value is equal to or greater than a predetermined threshold, and correspond to the dictionary vector string where the total number of vectors whose correlation value is equal to or greater than a predetermined threshold is maximum Is determined as the speaker of the voice to be verified.

【００１２】また、請求項９記載の発明は、請求項５記
載の話者照合システムにおいて、被照合音声のベクトル
列を構成する場合、入力音声を１ビットずつシフトさ
せ、それに対応したベクトル列を作成し、１ビットずつ
シフトさせたベクトル列と辞書ベクトル列とを順次照合
することにより話者照合の判定を行なうことを特徴とし
ている。According to a ninth aspect of the present invention, in the speaker verification system according to the fifth aspect, when the vector sequence of the voice to be verified is formed, the input voice is shifted one bit at a time, and the vector sequence corresponding to the input voice is shifted. It is characterized in that a speaker sequence is determined by sequentially comparing a vector sequence created and shifted one bit at a time with a dictionary vector sequence.

【００１３】[0013]

【発明の実施の形態】以下、本発明の実施形態を図面に
基づいて説明する。図１は本発明に係る話者照合システ
ムの構成例を示す図である。図１を参照すると、この話
者照合システムは、入力音声（サンプリングされた入力
音声）を帯域分割する帯域分割部１と、帯域分割された
音声信号のうち、所定帯域の音声信号を非線形圧縮する
非線形圧縮部２と、非線形圧縮された音声信号からベク
トル列を構成するベクトル列化部３と、辞書ベクトル列
を辞書として作成する辞書作成部４と、被照合音声のベ
クトル列と辞書作成部４の辞書ベクトル列とのパターン
マッチングを行ない、話者の照合（判定）を行なう話者
判定部５とを有している。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram showing a configuration example of a speaker verification system according to the present invention. Referring to FIG. 1, the speaker verification system includes a band dividing unit 1 for band-dividing an input voice (sampled input voice), and non-linearly compressing a voice signal of a predetermined band among the band-divided voice signals. A non-linear compression unit 2, a vector sequence conversion unit 3 that forms a vector sequence from the non-linearly compressed audio signal, a dictionary generation unit 4 that generates a dictionary vector sequence as a dictionary, and a vector sequence and a dictionary generation unit 4 of a voice to be verified. And a speaker determination unit 5 that performs pattern matching with the dictionary vector sequence of (1) and performs speaker verification (determination).

【００１４】本発明において、非線形圧縮部２は、Ｍビ
ットで量子化された音声信号値を非線形に圧縮し、より
低いビット(ｍビット)の音声信号に変換することによ
り、同一話者による同一音声の場合、音声信号の振幅方
向への変動を抑え、類似性の高い時系列を構成するよう
にしている。In the present invention, the non-linear compression section 2 non-linearly compresses an M-bit quantized audio signal value and converts it into a lower-bit (m-bit) audio signal, thereby providing the same speaker with the same signal. In the case of audio, fluctuations in the amplitude direction of the audio signal are suppressed, and a time series with high similarity is configured.

【００１５】ｍビットで構成されたｉ番目のｎ個の時系
列を１つのブロック(Ｂｉ)とすると、ブロック(Ｂｉ)は
次式(数１)のように表わすことができる。すなわち、ブ
ロック(Ｂｉ)はｎ個のｍビット信号で構成されることに
なる。Assuming that the i-th n time series composed of m bits is one block (Bi), the block (Bi) can be represented by the following equation (Equation 1). That is, the block (Bi) is composed of n m-bit signals.

【００１６】[0016]

【数１】Ｂｉ＝｛ｍビット，ｍビット，・・・，ｍビット｝Bi = {m bits, m bits,..., M bits}

【００１７】ここで、ブロック(Ｂｉ)を１つのベクトル
と考えると、変換されたデジタル音声信号はｎ次元のベ
クトル列とみなせる。すなわち、音声信号はｎ次元のベ
クトル列として表わすことができる。Here, if the block (Bi) is considered as one vector, the converted digital audio signal can be regarded as an n-dimensional vector sequence. That is, the audio signal can be represented as an n-dimensional vector sequence.

【００１８】また、辞書作成部４においては、話者照合
に用いる辞書ベクトル列を上記のような方法で作成し、
辞書として予め格納しておく。そして、話者判定部５
は、被照合音声が入力された場合、同様な手続きでその
音声に対応するベクトル列を構成し、辞書に格納されて
いる複数の話者のそれぞれに対応した複数の辞書ベクト
ル列の中から最も類似したものを求めることで、話者照
合を行なうことができる。The dictionary creating section 4 creates a dictionary vector sequence used for speaker verification by the above-described method.
It is stored in advance as a dictionary. And the speaker determination unit 5
When a voice to be verified is input, a vector sequence corresponding to the voice is constructed in the same procedure, and the most common dictionary vector sequence from a plurality of dictionary vector sequences corresponding to each of a plurality of speakers stored in the dictionary. By finding similar ones, speaker verification can be performed.

【００１９】図１の話者照合システムでは、帯域分割部
１により、例えば、１６ｋＨｚでサンプリングされた音
声（入力音声）を８チャンネルに等分割し、最も低い周
波数帯域［０〜１ｋＨｚ］の成分と必要に応じて特徴的
な帯域成分とを利用して、話者照合を行なう。In the speaker verification system shown in FIG. 1, for example, a voice (input voice) sampled at 16 kHz is equally divided into eight channels by a band dividing unit 1, and a component of the lowest frequency band [0 to 1 kHz] is divided into eight channels. Speaker verification is performed using characteristic band components as needed.

【００２０】以下、図１の話者照合システムにおいて、
非線形圧縮部２における音声振幅圧縮の仕方、ベクトル
列化部３におけるベクトル列の構成の仕方、辞書作成部
４における辞書（辞書ベクトル列）の作成の仕方、話者
判定部５における話者照合，判定の仕方について、それ
ぞれ説明する。Hereinafter, in the speaker verification system of FIG.
The method of compressing the voice amplitude in the non-linear compression unit 2, the method of constructing the vector sequence in the vector sequence conversion unit 3, the method of creating a dictionary (dictionary vector sequence) in the dictionary creation unit 4, the speaker verification in the speaker determination unit 5, Each of the determination methods will be described.

【００２１】まず、非線形圧縮部２における処理動作に
ついて説明する。非線形圧縮部２では、例えば図２(ａ)
に示すようにＭビットで量子化された音声信号（入力音
声のサンプル値）を時間的に連続したｎ個毎にまとめ、
それを１つのブロック（グループ）とする。従って、１
ブロック（１グループ）はｎ個の連続したＭビットの音
声信号Ａｉにより構成されている。このように、Ｍビッ
トで量子化された音声信号（入力音声のサンプル値）を
ｎ個毎にまとめてブロック分けしたとき、非線形圧縮部
２は、１ブロック（１グループ）内のｎ個の信号（サン
プル値）の絶対値の最大値で、ブロック（１グループ）
内の全てのサンプル値（ｎ個のサンプル値）を正規化す
る。しかる後、非線形圧縮部２は、正規化されたｎ個の
サンプル値を、所定の非線形量子化関数により、図２
(ｂ)に示すように、ｎ個のｍビット信号列（低分解能の
信号列）に変換する。First, the processing operation in the nonlinear compression section 2 will be described. In the non-linear compression unit 2, for example, FIG.
As shown in the above, the audio signal (sample value of the input audio) quantized by M bits is grouped into n temporally continuous signals,
Let it be one block (group). Therefore, 1
A block (one group) is composed of n consecutive M-bit audio signals Ai. As described above, when the audio signal quantized by M bits (sample value of the input audio) is grouped into n blocks each, the non-linear compression unit 2 outputs n signals in one block (one group). Maximum value of absolute value of (sample value), block (1 group)
Are normalized (n sample values). Thereafter, the non-linear compression unit 2 calculates the normalized n sample values by using a predetermined non-linear quantization function as shown in FIG.
As shown in (b), the signal is converted into n m-bit signal trains (low-resolution signal trains).

【００２２】次式(数２)は、変換されたｎ個のｍビット
信号列Ｂｉを表わしている。The following equation (Expression 2) represents the converted n m-bit signal strings Bi.

【００２３】[0023]

【数２】Ｂｉ＝｛ｂｉ１，ｂｉ２，・・・，ｂｉｎ｝＝
ｆ｛Ａｉ｝## EQU2 ## Bi = {bi1, bi2,..., Bin} =
f {Ai}

【００２４】ここで、ｆは非線形量子化関数を表わす。Here, f represents a non-linear quantization function.

【００２５】このように、本発明では、非線形圧縮部２
において、サンプリングされた入力音声のサンプル値を
適当な個数で１つのグループとし、グループに含まれる
サンプル値の絶対値の最大値でグループ内のすべてのサ
ンプル値を正規化した後、非線形量子化関数で低分解能
の信号に変換することにより、音声信号の振幅方向への
変動を抑えるようにしている。As described above, according to the present invention, the nonlinear compression section 2
, The sampled values of the input voice sampled are grouped into an appropriate number, and all the sample values in the group are normalized by the maximum value of the absolute values of the sample values included in the group. By converting the audio signal into a low-resolution signal, fluctuation in the amplitude direction of the audio signal is suppressed.

【００２６】次に、ベクトル列化部３における処理動作
について説明する。Ｍビットでサンプリングされた任意
の音声信号Ｓが長さＬであったとするとき、ブロックの
数ＢＮは次式(数３)に示されるようになる。Next, the processing operation in the vector sequencer 3 will be described. Assuming that an arbitrary audio signal S sampled by M bits has a length L, the number BN of blocks becomes as shown in the following equation (Equation 3).

【００２７】[0027]

【数３】ＢＮ＝(Ｌ／ｎ)の整数部分＋１BN = Integer part of (L / n) +1

【００２８】数３において、＋１は、音声信号Ｓの長さ
Ｌをｎごとに区分するときに端数が出た場合、０を付け
加えて１ブロックとするためである。図３はこの端数処
理を説明するための図である。図３のように、長さＬを
ｎごとに区分し、図３の例のように例えば５個のブロッ
クにしたいときに、１番最後の５番目のブロックでは端
数が生ずる。この場合、５番目の中途半端なブロックの
後ろ側に“０”を付加して(“０”で埋めて)、５個のブ
ロックにすることができる。このように、数３におい
て、＋１は、音声信号Ｓの長さＬをｎごとに区分すると
きに端数が出た場合、０を付け加えて１ブロックとする
ためである。ブロックの数がベクトルの個数であるか
ら、任意の音声信号ＳはＢＮ個のｎ次元ベクトル列で次
式（数４）に示すように表わされる。In the equation (3), +1 is used to add 0 when the length L of the audio signal S is divided when the length L is divided for each n to make one block. FIG. 3 is a diagram for explaining this fraction processing. As shown in FIG. 3, when the length L is divided for each n and it is desired to divide the length L into, for example, five blocks as in the example of FIG. 3, a fraction is generated in the last and fifth block. In this case, “5” can be added to the rear side of the fifth halfway block (filled with “0”) to make five blocks. In this way, in Equation 3, the value +1 is used to add 0 when the length L of the audio signal S is divided when the length L is divided for each n to make one block. Since the number of blocks is the number of vectors, an arbitrary audio signal S is represented by a BN n-dimensional vector sequence as shown in the following equation (Equation 4).

【００２９】[0029]

【数４】Ｓ＝｛Ｂ₁，Ｂ₂，・・・，Ｂ_BN｝S = {B ₁ , B ₂ ,..., B _BN }

【００３０】このように、ベクトル列化部３では、Ｍビ
ットでサンプリングされた任意の音声信号Ｓを数４のよ
うにベクトル列として構成するようにしている。As described above, the vector-sequencing unit 3 is configured to construct an arbitrary audio signal S sampled by M bits as a vector sequence as shown in Expression 4.

【００３１】また、辞書作成部４において、辞書データ
の作成は、同一話者が同一音声を数回発生し、個々の音
声に対応するベクトル列を作る。個々の音声に対応する
ベクトル列をＴ１，Ｔ２，・・・とすると、Ｔ１，Ｔ
２，・・・は次式のようになる。In the dictionary creating section 4, dictionary data is created by the same speaker generating the same voice several times, and generating a vector sequence corresponding to each voice. If the vector sequence corresponding to each voice is T1, T2,..., T1, T2
2,... Are as follows.

【００３２】[0032]

【数５】Ｔ１＝｛Ｂ_1(T1)，Ｂ_2(T1)，・・・Ｂ_BN(T1)｝Ｔ２＝｛Ｂ_1(T2)，Ｂ_2(T2)，・・・Ｂ_BN(T2)｝・・・・T1 = ｛B1 _(T1) , B2 _(T1) ,... _{BBN (T1)} ｝ T2 = ｛B1 _(T2) , B2 _(T2) _{,. )} ｝・・・・

【００３３】一般的に、Ｔ１，Ｔ２，・・・の長さは異
なる。In general, the lengths of T1, T2,... Are different.

【００３４】次に、異なる全てのベクトル列の間で、ベ
クトル毎の相関値を計算し、相関値が０．９以上になる
ベクトルの総数を求める。まず、Ｔ１のすべてのベクト
ルに対し、他のベクトル列のベクトルとの相関値を順に
調べ、相関値が０．９以上の個数の総和ＳＵＭ１を次式
のように求める。Next, a correlation value for each vector is calculated between all different vector sequences, and the total number of vectors having a correlation value of 0.9 or more is obtained. First, for all the vectors of T1, the correlation values with the vectors of the other vector columns are examined in order, and the sum SUM1 having the number of correlation values of 0.9 or more is obtained as in the following equation.

【００３５】[0035]

【数６】ＳＵＭ１＝ＩＮＴ(＜Ｂ_1(T1)，Ｂ_2(T2)＞／０．９)＋ＩＮＴ(＜Ｂ_2(T1)，Ｂ_1(T2)＞／０．９)＋・・・・・・・・・・・・＋ＩＮＴ(＜Ｂ_1(T1)，Ｂ_2(T2)＞／０．９)＋ＩＮＴ(＜Ｂ_2(T1)，Ｂ_2(T2)＞／０．９)＋・・・・・・・・・・・・＋ＩＮＴ(＜Ｂ_BN(T1)，Ｂ_BN(T2)＞／０．９)＋・・・・・・・・・SUM1 = INT (<B1 _(T1) , B2 _(T2) > / 0.9) + INT (<B2 _(T1) , B1 _(T2) > / 0.9) +. ... + INT (<B1 _(T1) , B2 _(T2) > / 0.9) + INT (<B2 _(T1) , B2 _(T2) > / 0.9 ) + ・・・・・・・・・・・・ + INT (<B _{BN (T1)} , B _{BN (T2)} > /0.9) + ・・・・・・・・・

【００３６】ここで、＜ｘ，ｙ＞はｘとｙの相関を表わ
し、＜＊＞は＊の整数を表わす。また、ＩＮＴ（＊）は
＊の整数を表わす。Here, <x, y> represents a correlation between x and y, and <*> represents an integer of *. INT (*) represents an integer of *.

【００３７】同様に、Ｔ２のすべてのベクトルに対し、
他のベクトル列のベクトルとの相関値を順に調べ、相関
値が０．９以上の個数の総和を求める。同様な操作をす
べてのベクトル列に対し行ない、相関値が０．９以上の
個数の総和が最大のベクトル列を、その話者の音声の辞
書ベクトル列とする。Similarly, for all vectors in T2,
The correlation values with the vectors of the other vector columns are examined in order, and the sum of the numbers whose correlation values are 0.9 or more is obtained. The same operation is performed on all the vector columns, and the vector column having the maximum sum of the correlation values of 0.9 or more is set as the dictionary vector column of the voice of the speaker.

【００３８】特に、低周波帯域においては、母音の定常
的な特徴が現れるため、同一話者の同一音声は相似的な
成分を含むため、上記のような正規化，非線形圧縮を受
けた音声信号は類似なパターンを有する。In particular, in the low frequency band, since a stationary characteristic of a vowel appears, the same voice of the same speaker contains similar components. Have a similar pattern.

【００３９】このように、辞書作成部４において、同一
話者が発声した個々の同一音声に対応するベクトル列の
間で（異なる全てのベクトル列の間で）、ベクトル毎の
相関値を計算し、その値が所定の閾値（上述の例では、
０．９）以上になるベクトルの総数を求めるという処理
を、すべてのベクトル列に対して行ない、相関値が所定
の閾値以上の個数の総和が最大のベクトル列を、その話
者の音声の辞書ベクトル列，すなわち辞書データとする
ようにしている。そして、辞書には、上述のようにして
作成された個々の(複数の)話者の音声に基づく辞書ベク
トル列が登録される。As described above, the dictionary creating unit 4 calculates the correlation value for each vector between vector sequences corresponding to individual identical voices uttered by the same speaker (between all different vector sequences). , The value of which is a predetermined threshold (in the above example,
0.9) is calculated for all the vector columns, and the vector column having the maximum sum of the numbers whose correlation values are equal to or greater than a predetermined threshold value is defined as the dictionary of the speaker's voice. Vector strings, that is, dictionary data. In the dictionary, a dictionary vector sequence based on the voices of the individual (plural) speakers created as described above is registered.

【００４０】次に、話者判定部５における話者照合，判
定処理について説明する。話者照合，判定処理では、入
力音声Ｓ（被照合音声のベクトル列）と辞書に登録され
ている複数の話者にそれぞれ対応する各辞書ベクトル列
とのベクトルごとの相関値を、前述したと同様に算出
し、相関値が所定の閾値（例えば０．９）以上のベクト
ルの総数を求める処理を行ない、相関値が所定の閾値以
上のベクトルの総数が最大となる辞書ベクトル列に対応
する話者を、被照合音声の話者として判定するようにし
ている。これは、他の話者による上述のように求められ
た音声との相似性は低くなり、類似なパターンの出現は
低くなることによる。Next, the speaker verification and determination processing in the speaker determination section 5 will be described. In the speaker verification and determination process, the correlation value for each vector between the input voice S (vector sequence of the voice to be verified) and each dictionary vector sequence corresponding to each of a plurality of speakers registered in the dictionary is described above. Similarly, a process is performed to obtain the total number of vectors having a correlation value equal to or greater than a predetermined threshold (for example, 0.9). Is determined as the speaker of the voice to be verified. This is because the similarity with the speech obtained as described above by other speakers is low, and the appearance of similar patterns is low.

【００４１】すなわち、本発明では、グループを１つの
ベクトルに対応させることにより、音声信号をベクトル
列として扱い、同一話者の同一音声には類似な音声パタ
ーンが含まれることから、該ベクトル列にも類似なベク
トルが多数含まれ、該類似なベクトルは相関値の高いベ
クトルであり、該相関値の高いベクトルが多数含まれる
ベクトル列は類似な音声であると判断し、該判断を話者
照合の判定に使用する。That is, in the present invention, the speech signal is treated as a vector sequence by associating the group with one vector, and the same voice of the same speaker contains similar voice patterns. Also includes a large number of similar vectors, the similar vector is a vector having a high correlation value, and a vector sequence including a large number of the vectors having a high correlation value is determined to be similar voices. Used to determine

【００４２】図４は単語“デジタル”の音声を６人の話
者が発声した場合、同一の話者の音声と他の５人の話者
の音声から求めた相関値が０．９以上の総数を示す図
（グラフ）である。なお、図４では、後述のように、被
照合音声を１ポイントずつシフトしてベクトル列を作成
し、それに対して相関を計算したため、図４のグラフの
横軸はシフト数を表わしている。このように被照合音声
を１ポイントずつシフトする理由は、被照合音声が一般
に、辞書データとは非同期であることに基づく問題を回
避するためである。FIG. 4 shows that when six speakers utter the voice of the word "digital", the correlation value obtained from the voice of the same speaker and the voices of the other five speakers is 0.9 or more. It is a figure (graph) which shows a total number. In FIG. 4, as will be described later, since the speech to be verified is shifted one point at a time to create a vector sequence, and the correlation is calculated for the vector sequence, the horizontal axis of the graph in FIG. 4 indicates the number of shifts. The reason why the to-be-verified voice is shifted by one point in this way is to avoid a problem based on the fact that the to-be-verified voice is generally asynchronous with the dictionary data.

【００４３】以下に、この非同期の問題と面倒な音声切
り出しの問題を緩和するために行なわれている被照合音
声のシフトの操作について、図５のフローチャートを用
いて説明する。まず、シフト回数Ｎを“１”に初期設定
した後、被照合音声（のベクトル列）Ｓと辞書ベクトル
列との相関値が０．９以上の個数を求める(ステップＳ
２)。次に、被照合音声Ｓを１サンプルシフトし(ステッ
プＳ３)、ステップＳ２に戻り、被照合音声（のベクト
ル列）Ｓと辞書ベクトル列との相関値が０．９以上の個
数を求める。このようにして、ステップＳ２，Ｓ３の処
理をｎサンプルのシフトが完了するまでｎ回繰り返す
(ステップＳ４，Ｓ５)。そして、各帯域成分の総和が所
定閾値を越えたところを音声の開始時点とする(ステッ
プＳ６)。一方、ステップＳ６で求めた総和に対し、各
帯域成分が予め設定した値以下の場合には、その帯域部
分には音声成分がないと判断する。The operation of shifting the to-be-verified voice performed to alleviate the asynchronous problem and the troublesome voice segmentation will be described below with reference to the flowchart of FIG. First, after initially setting the number of shifts N to “1”, the number of correlation values between the (voice vector to be verified) S and the dictionary vector sequence is determined to be 0.9 or more (step S).
2). Next, the to-be-verified voice S is shifted by one sample (step S3), and the process returns to step S2, where the number of correlation values between the (to-be-verified) speech vector S and the dictionary vector sequence is 0.9 or more. In this way, the processes of steps S2 and S3 are repeated n times until the shift of n samples is completed.
(Steps S4 and S5). Then, the point at which the sum of the respective band components exceeds the predetermined threshold value is set as the start time of the sound (step S6). On the other hand, if each band component is equal to or less than a preset value with respect to the sum calculated in step S6, it is determined that there is no audio component in that band portion.

【００４４】このように、音声のベクトル列を構成する
場合、入力音声を１ビットずつシフトさせ、それに対応
したベクトル列を作成することにより話者照合の精度を
高めることが可能となる。As described above, when constructing a speech vector sequence, the accuracy of speaker verification can be improved by shifting the input speech one bit at a time and creating a corresponding vector sequence.

【００４５】図６乃至図１０は具体的な処理例を示す図
である。FIGS. 6 to 10 are diagrams showing specific processing examples.

【００４６】先ず、図６を参照すると、元の音声信号は
１６ｋＨｚのサンプリングで８ビットに量子化された音
声信号であり、帯域分割部１では、この音声信号（１６
ｋＨｚサンプリングの８ビットデータ）を１ｋＨｚ以下
に帯域制限し、非線形量子化関数により２ｋＨｚサンプ
リングの４ビットデータに変換する。First, referring to FIG. 6, the original audio signal is an audio signal quantized to 8 bits by sampling at 16 kHz, and the audio signal (16
(kHz sampling 8-bit data) is band-limited to 1 kHz or less, and is converted to 2-kHz sampling 4-bit data by a non-linear quantization function.

【００４７】次に、図７に示すように、２ｋＨｚサンプ
リングの４ビットデータ列を２０サンプルで１ブロック
構成とする。すなわち、つまり２０次元のベクトル列を
構成する。２０ポイントで１ブロックとするので、ブロ
ック，すなわちベクトルがｎ個の音声は２０ｎポイント
の音声信号ということになる。Next, as shown in FIG. 7, a 2-bit sampling 4-bit data string is composed of 20 samples in one block. That is, a 20-dimensional vector sequence is formed. Since one block is formed by 20 points, a block, that is, a sound having n vectors is a sound signal of 20n points.

【００４８】図８乃至図１０は話者照合のためにどのよ
うなパターンマッチングを行なうかを説明するための図
である。図８は辞書に登録されている複数の話者（ｎ人
の話者）の音声の辞書ベクトル列｛１Ｂ，２Ｂ，３Ｂ，
・・・，ｎＢ｝を示す図であり、１つの辞書ベクトル列
（１つのブロック），例えば１Ｂは、１０ｍ秒の時間長
さを有し、｛１Ｂ１，１Ｂ２，・・・，１Ｂ２０｝の２
０ポイントを含んでいるとする。FIGS. 8 to 10 are diagrams for explaining what kind of pattern matching is performed for speaker verification. FIG. 8 shows a dictionary vector sequence {1B, 2B, 3B,... Of voices of a plurality of speakers (n speakers) registered in the dictionary.
, NB}. One dictionary vector string (one block), for example, 1B has a time length of 10 ms, and 2 of {1B1, 1B2,..., 1B20}.
Assume that it contains 0 points.

【００４９】また、図９，図１０は被照合音声｛１Ａ，
２Ａ，３Ａ，・・・ｊＡ｝と辞書ベクトル列｛１Ｂ，２
Ｂ，３Ｂ，・・・，ｎＢ｝のパターンマッチングの仕方
を示す図である。図９を参照すると、先ず、被照合音声
の１番目のベクトル１Ａと辞書ベクトル列のすべてのベ
クトル｛１Ｂ，２Ｂ，３Ｂ，・・・，ｎＢ｝との相関を
取り、その相関値が０．９以上の個数を求める。図９の
例では、被照合音声の１番目のベクトル１Ａと辞書ベク
トル列のすべてのベクトル｛１Ｂ，２Ｂ，３Ｂ，・・
・，ｎＢ｝との相関を取り、その相関値が０．９以上の
個数が４個であることを示している。同様にして、被照
合音声の２番目のベクトル２Ａと辞書ベクトル列のすべ
てのベクトル｛１Ｂ，２Ｂ，３Ｂ，・・・，ｎＢ｝との
相関を取り、その相関値が０．９以上の個数を求める
と、相関値が０．９以上の個数は１個である。このよう
な処理を、被照合音声の最後のベクトルｊＡまで順次に
行ない、被照合音声の１番目のベクトル１Ａから最後の
ベクトルｊＡまで、上記のようにして得られた相関値
０．９以上の個数の総和を求める。図９では、この総和
Ｓ１がＳ１＝１２として示されている。FIGS. 9 and 10 show voices to be verified # 1A,
2A, 3A,... JA} and a dictionary vector string {1B, 2
It is a figure which shows the pattern matching method of B, 3B, ..., nB}. Referring to FIG. 9, first, the first vector 1A of the speech to be verified is correlated with all the vectors {1B, 2B, 3B,..., NB} of the dictionary vector sequence, and the correlation value is set to 0. Find the number of 9 or more. In the example of FIG. 9, the first vector 1A of the voice to be verified and all the vectors {1B, 2B, 3B,.
, NB}, indicating that the number of the correlation values of 0.9 or more is four. Similarly, the correlation between the second vector 2A of the voice to be verified and all the vectors {1B, 2B, 3B,..., NB} of the dictionary vector sequence is obtained, and the number of the correlation values is 0.9 or more. Is found, the number of correlation values of 0.9 or more is one. Such processing is sequentially performed up to the last vector jA of the voice to be verified, and from the first vector 1A to the last vector jA of the voice to be verified, the correlation value 0.9 or more obtained as described above is obtained. Find the sum of the numbers. In FIG. 9, this sum S1 is shown as S1 = 12.

【００５０】ところで、被照合音声から得られるベクト
ル列は辞書ベクトル列とは非同期である。そこで、元の
音声｛１Ａ，２Ａ，３Ａ，・・・，ｊＡ｝を１ポイント
シフトさせて新たなベクトル列を図１０に示すように作
り直し、上述したと同様にして（図９に示したようにし
て）、相関値が０．９以上のベクトル（ブロック）の個
数の総和を求める。このように、１ポイントシフトさせ
た場合の総和をＳ２、１９ポイントシフトさせた場合の
総和をＳ２０とする。The vector sequence obtained from the voice to be verified is asynchronous with the dictionary vector sequence. Therefore, the original speech {1A, 2A, 3A,..., JA} is shifted by one point to recreate a new vector sequence as shown in FIG. 10, and in the same manner as described above (as shown in FIG. 9). Then, the sum of the number of vectors (blocks) having a correlation value of 0.9 or more is obtained. In this way, the total sum when shifted by one point is S2, and the total sum when shifted by 19 points is S20.

【００５１】図４は単語“デジタル”の音声に対して、
総和Ｓ１〜Ｓ２０をプロットした例であり、図４から明
らかなように、同一話者の相関値が０．９以上の個数
が、他の話者の相関値が０．９以上の個数と比べて高い
値を示している。この結果を利用し、次式（数７）を計
算する。FIG. 4 shows the speech of the word “digital”
This is an example in which the sums S1 to S20 are plotted. As is apparent from FIG. 4, the number of correlation values of the same speaker of 0.9 or more is compared with the number of correlation values of other speakers of 0.9 or more. High values. Using this result, the following equation (Equation 7) is calculated.

【００５２】[0052]

【数７】 (Equation 7)

【００５３】ここで、Ｓ(ｈ)_iは、ｈ番目の辞書に対し
て、ｉ−１ポイントシフトして求めた相関値０．９以上
の個数であり、Ｔｈは予め定めた閾値である。数７を計
算し、Ｄ_Sが最大となるｈ番目の辞書ベクトル列に対応
する話者が照合結果として得られる。Here, S (h) _i is the number of correlation values 0.9 or more obtained by shifting the h-th dictionary by i−1 points, and Th is a predetermined threshold value. The number 7 calculates, speaker corresponding to h-th dictionary vector sequence D _S is maximum is obtained as a result matching.

【００５４】なお、被照合音声が辞書音声に比べて長い
時系列の場合、相関値が０．９以上のベクトルの総数が
多くなる可能性がある。そこで、被照合音声は辞書音声
の長さで打ち切ることにする。If the voice to be verified is a time series longer than the dictionary voice, the total number of vectors having a correlation value of 0.9 or more may increase. Therefore, the voice to be verified is terminated at the length of the dictionary voice.

【００５５】[0055]

【発明の効果】以上に説明したように、請求項１乃至請
求項７記載の発明によれば、サンプリングされた入力音
声を非線形圧縮して低分解能の音声信号に変換し、非線
型圧縮された音声信号からベクトル列を構成し、辞書に
予め作成されている辞書ベクトル列と被照合音声のベク
トル列とのパターンマッチングを行ない、話者の照合，
判定を行なうので、時間領域で照合を行なう手法で、話
者照合を確実に行なうことができる。As described above, according to the first to seventh aspects of the present invention, the sampled input voice is nonlinearly compressed and converted into a low-resolution voice signal, and the input voice is nonlinearly compressed. A vector sequence is formed from the voice signal, and pattern matching is performed between a dictionary vector sequence created in advance in the dictionary and a vector sequence of the voice to be verified, and speaker verification,
Since the determination is performed, speaker verification can be reliably performed by a method of performing verification in the time domain.

【００５６】特に、請求項２，請求項７記載の発明によ
れば、サンプリングされた入力音声のサンプル値を適当
な個数で１つのグループとし、グループに含まれるサン
プル値の絶対値の最大値でグループ内のすべてのサンプ
ル値を正規化した後、正規化したサンプル値を非線形量
子化関数で低分解能の信号に変換するので、音声信号の
振幅方向への変動を抑えることができる。In particular, according to the second and seventh aspects of the present invention, an appropriate number of sampled sample values of the input speech are grouped into one group, and the maximum value of the absolute values of the sample values included in the group is calculated. After normalizing all the sample values in the group, the normalized sample values are converted into a low-resolution signal by a non-linear quantization function, so that the fluctuation of the audio signal in the amplitude direction can be suppressed.

【００５７】また、請求項４，請求項９記載の発明によ
れば、被照合音声のベクトル列を構成する場合、入力音
声を１ビットずつシフトさせ、それに対応したベクトル
列を作成し、１ビットずつシフトさせたベクトル列と話
者ベクトル列とを順次照合することにより話者照合の判
定を行なうので、話者照合の精度を高めることができ
る。According to the fourth and ninth aspects of the present invention, when forming a vector sequence of the voice to be verified, the input voice is shifted one bit at a time, and a vector sequence corresponding to the input voice is created. Since the speaker collation is determined by sequentially collating the vector sequence shifted by one with the speaker vector sequence, the accuracy of the speaker collation can be improved.

[Brief description of the drawings]

【図１】本発明に係る話者照合システムの構成例を示す
図である。FIG. 1 is a diagram showing a configuration example of a speaker verification system according to the present invention.

【図２】非線形圧縮部における処理動作を説明するため
の図である。FIG. 2 is a diagram for explaining a processing operation in a non-linear compression unit.

【図３】端数処理を説明するための図である。FIG. 3 is a diagram for explaining fraction processing.

【図４】単語“デジタル”の音声を５人の話者が発声し
た場合、同一の話者の音声と他の話者の音声から求めた
相関値が０．９以上の総数を示す図である。FIG. 4 is a diagram showing the total number of correlation values obtained from the voices of the same speaker and the voices of other speakers when the voice of the word “digital” is uttered by five speakers; is there.

【図５】被照合音声のシフトの操作を説明するためのフ
ローチャートである。FIG. 5 is a flowchart for explaining an operation of shifting a voice to be verified.

【図６】本発明の話者照合方法の具体的な処理例を示す
図である。FIG. 6 is a diagram showing a specific processing example of the speaker verification method of the present invention.

【図７】本発明の話者照合方法の具体的な処理例を示す
図である。FIG. 7 is a diagram showing a specific processing example of the speaker verification method of the present invention.

【図８】話者照合のためにどのようなパターンマッチン
グを行なうかを説明するための図である。FIG. 8 is a diagram for explaining what pattern matching is performed for speaker verification;

【図９】話者照合のためにどのようなパターンマッチン
グを行なうかを説明するための図である。FIG. 9 is a diagram for explaining what kind of pattern matching is performed for speaker verification.

【図１０】話者照合のためにどのようなパターンマッチ
ングを行なうかを説明するための図である。FIG. 10 is a diagram for explaining what kind of pattern matching is performed for speaker verification.

[Explanation of symbols]

１帯域分割部２非線型圧縮部３ベクトル列化部４辞書作成部５話者判定部 DESCRIPTION OF SYMBOLS 1 Band division part 2 Non-linear compression part 3 Vector sequence formation part 4 Dictionary creation part 5 Speaker judgment part

Claims

[Claims]

A nonlinear compression step of nonlinearly compressing a sampled input voice into a low-resolution voice signal; a vector conversion step of forming a vector row from the nonlinearly compressed voice signal; A speaker matching method comprising: performing a pattern matching between a dictionary vector sequence created in advance and a vector sequence of a voice to be verified, and performing a speaker matching and determination process.

2. The speaker verification method according to claim 1, wherein
In the non-linear compression step, an appropriate number of sample values of the sampled input audio are grouped into one group, and all sample values in the group are normalized by the maximum absolute value of the sample values included in the group. Converting the normalized sample value into a low-resolution audio signal using a non-linear quantization function.

3. The speaker collating method according to claim 1, wherein, in the vector-sequencing step, the group is made to correspond to one vector, so that the audio signal is treated as a vector sequence, and the collation-determining step is performed. Then, since the same voice of the same speaker includes a similar voice pattern, the vector sequence also includes many similar vectors, and the similar vector is a vector having a high correlation value. A speaker verification method characterized in that a vector sequence including many high vectors is determined to be a similar voice, and the determination is used for speaker verification determination.

4. In the speaker verification method according to any one of claims 1 to 3, when forming a vector sequence of a voice to be verified, the input voice is shifted by one bit.
A speaker verification method comprising: generating a vector sequence corresponding thereto; and sequentially comparing the vector sequence shifted one bit at a time with a dictionary vector sequence to determine speaker verification.

5. A band division unit for dividing a sampled input sound into a band, a non-linear compression unit for non-linearly compressing a sound signal of a predetermined band among the band-divided sound signals, and a vector from the non-linearly compressed sound signal. A vector-sequencing unit that forms a column, a dictionary creating unit that creates a dictionary vector sequence as a dictionary, and performs pattern matching between a vector sequence of a voice to be matched and a dictionary vector sequence created by the dictionary creating unit 4. A speaker verification system comprising: a speaker determination unit that performs speaker verification and determination.

6. The speaker verification system according to claim 5, wherein the dictionary creation unit calculates a correlation value for each vector between vector sequences corresponding to individual identical voices uttered by the same speaker, A process of calculating the total number of vectors whose values are equal to or greater than a predetermined threshold value is performed for all the vector columns, and the vector sequence having the maximum sum of the number of correlation values equal to or greater than the predetermined threshold value is determined as the dictionary vector of the speech. A speaker collation system, wherein a dictionary is registered with a dictionary vector sequence based on the voice of each speaker created by the dictionary creating unit.

7. The speaker verification system according to claim 5, wherein the non-linear compression section groups a suitable number of sampled values of the sampled input speech into one group, and includes an absolute value of the sample value included in the group. A speaker verification system characterized in that after normalizing all the sample values in a group with the maximum value of, a normalized sample value is converted into a low-resolution audio signal by a non-linear quantization function.

8. The speaker verification system according to claim 5, wherein the speaker determination unit includes a vector of the vector sequence of the voice to be verified and a dictionary vector sequence corresponding to each of a plurality of speakers registered in the dictionary. The correlation value is calculated for each of the vectors, the total number of vectors whose correlation value is equal to or greater than a predetermined threshold is calculated, and the speaker corresponding to the dictionary vector sequence in which the total number of vectors whose correlation value is equal to or more than a predetermined threshold is the maximum, A speaker verification system characterized in that the speaker is determined as a speaker.

9. In the speaker verification system according to claim 5, when forming a vector sequence of the voice to be verified, the input voice is shifted one bit at a time, a vector sequence corresponding to the input voice is created, and shifted one bit at a time. A speaker verification system, wherein the speaker verification is performed by sequentially verifying a vector sequence and a dictionary vector sequence.