JP2008224911A

JP2008224911A - Speaker recognition system

Info

Publication number: JP2008224911A
Application number: JP2007061123A
Authority: JP
Inventors: Seiichi Nakagawa; 聖一中川
Original assignee: Toyohashi University of Technology NUC
Current assignee: Toyohashi University of Technology NUC
Priority date: 2007-03-10
Filing date: 2007-03-10
Publication date: 2008-09-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method of identifying or matching a speaker, the method being characterized in that not only power spectrum information, but also phase information which has not been used before are applied to speaker recognition. <P>SOLUTION: In the method (speaker matching) of identifying which of speeches of speaker models structured in advance on the basis of identical feature parameters a speech that somebody utters corresponds to by extracting the feature parameters representing the individuality of the speaker included in the speech or a method (speaker matching) of determining whether the speech is a speech of a certain person, an extracting method for a phase feature parameters which represents features of a speaker included in a speech waveform and has not been used before and a conventional feature parameter extracting method are used in combination to identify or match the speaker. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声を用いた個人識別に関する話者同定と話者照合に関するものである。 The present invention relates to speaker identification and speaker verification for personal identification using speech.

音声を用いた話者認識には、ある誰かが発声した音声に含まれる話者の個人性を抽出し、登録した話者の中の誰の音声であるかを識別する方法（話者同定）と、ある人の音声であるかどうかを判定する方法（話者照合）がある。また、発話内容を予め決められていて、ユーザはその内容を発声することが要求されるテキスト依存型話者認識と、どんな内容を発声しても良いテキスト独立型話者認識がある。一般にテキスト依存型の方が高性能が得られるが、録音装置などで、ユーザの声が盗まれると、録音を再生することにより「なりすまし」が可能となり、セキュリティ上危険である。そのため、話者認識システムや装置から随時、ユーザが発声すべき内容を指示してやることで、「なりすまし」を防ぐ方法があり、これをテキスト指定型話者認識と呼んでいる。本特許は、話者同定、話者照合、テキスト独立型話者認識、テキスト依存型話者認識、テキスト指定型話者認識など、全ての基礎となる、音声中に含まれている個人の特徴を表している特徴パラメータを抽出することに関してのものである。 In speaker recognition using speech, a method of extracting the personality of a speaker included in the speech uttered by a certain person and identifying who is the speech among the registered speakers (speaker identification) Then, there is a method (speaker verification) for determining whether or not the voice is of a certain person. Furthermore, there are text-dependent speaker recognition in which the utterance content is predetermined and the user is required to utter the content, and text-independent speaker recognition in which any content can be uttered. In general, the text-dependent type provides higher performance, but if a user's voice is stolen by a recording device or the like, it can be "spoofed" by playing the recording, which is dangerous for security. For this reason, there is a method for preventing “spoofing” by instructing the contents to be uttered by the user from the speaker recognition system or apparatus as needed, which is called text-designated speaker recognition. This patent covers all the individual features included in speech, such as speaker identification, speaker verification, text-independent speaker recognition, text-dependent speaker recognition, text-specific speaker recognition, etc. Is related to extracting feature parameters representing

従来の代表的な特徴パラメータとしては、音声の短時間区間毎に求められるパワースペクトル情報を表現するもので、メル周波数ケプストラム係数が世界的な標準になっている。人間の聴覚は、パワースペクイトルには敏感であるが、位相情報には鈍感であり、通常の静かな環境の音声の場合は、位相の差を知覚することが出来ない。例えば、２００ヘルツの正弦波と１１００ヘルツの正弦波の和の音と１００ヘルツの正弦波と１１００ヘルツの余弦波の和の音は全く同一音に聞こえる。従って、音声認識には、パワースペクトル情報だけを用いて位相情報は捨てることによって、頑健な認識を行うことができる。話者認識でも全く同様に考えられてきて、位相情報は話者認識に用いられてこなかった。 A typical representative characteristic parameter represents power spectrum information obtained for each short time interval of speech, and the mel frequency cepstrum coefficient is a global standard. Human hearing is sensitive to the power spectrum, but insensitive to phase information, and in the case of normal quiet environment speech, the phase difference cannot be perceived. For example, the sum of a 200 Hz sine wave and a 1100 Hz sine wave and the sum of a 100 Hz sine wave and a 1100 Hz cosine wave sound exactly the same. Therefore, for speech recognition, robust recognition can be performed by discarding phase information using only power spectrum information. Speaker recognition has been considered in exactly the same way, and phase information has not been used for speaker recognition.

音声波形の短時間区間（例えば、３２ミリ秒）で、フーリエ変換してスペクトルを求めると、実数部と虚数部が求められる。周波数ごとの、この複素数の大きさ（パワースペクトル）が重要な特徴パラメータとして使用されてきた。例えば、１６キロヘルツサンプリング波形で、５１２点の区間（３２ミリ秒）を切り出し離散フーリエ変換を行うと、２５６個の複素表現スペクトルが求められる。この複素数をベクトル表現と見なせば、大きさ（実数部の２乗と虚数部の２乗：パワースペクトル）が重要であって、ベクトルの角度（実数部と虚数部の比）は無視されてきた。 When a spectrum is obtained by Fourier transform in a short time section (for example, 32 milliseconds) of a speech waveform, a real part and an imaginary part are obtained. The magnitude of this complex number (power spectrum) for each frequency has been used as an important feature parameter. For example, if a segment of 512 points (32 milliseconds) is cut out and a discrete Fourier transform is performed with a 16 kilohertz sampling waveform, 256 complex representation spectra are obtained. If this complex number is regarded as a vector expression, the size (the square of the real part and the square of the imaginary part: power spectrum) is important, and the angle of the vector (the ratio of the real part to the imaginary part) has been ignored. It was.

人間の声の発声器官を考えると、声帯の振動に起因する音源波形は個人によって大きく異なり、パワースペクトルの差だけでなく位相差にも影響を与える。音響管に相当する声道が音韻の性質を規定し、これはパワースペクトルで表現されると言われてきた。しかし、同じパワースペクトルでも異なった声道が対応すると考えられ、これはが位相差として現れると考えられる。 Considering the vocal organs of human voice, the sound source waveform caused by vocal cord vibration varies greatly from individual to individual, and affects not only the difference in power spectrum but also the phase difference. It has been said that the vocal tract corresponding to the acoustic tube defines the properties of phonology, which is expressed in the power spectrum. However, it is considered that different vocal tracts correspond to the same power spectrum, which appears as a phase difference.

そこで、実験により、パワースペクトルだけでなく、位相情報にも話者情報が存在することを見出した。この位相情報は、パワースペクトルと比べれば、話者情報は少ないため、パワースペクトルと併用することによって、位相情報がパワースペクトルの補間作用があることを見出した。
Therefore, through experiments, it was found that speaker information exists not only in the power spectrum but also in phase information. Since this phase information has less speaker information than the power spectrum, it has been found that the phase information has an effect of interpolating the power spectrum when used together with the power spectrum.

解決しようとする問題点は、パワースペクトル情報だけでなく、従来から利用されてこなかった位相情報を話者認識に適用する点である。 The problem to be solved is that not only power spectrum information but also phase information that has not been used conventionally is applied to speaker recognition.

位相情報を求める時に問題となるのが、定常な音声区間でも切り出し区間が変わると位相が変わることである。５１２点を切り出す場合、１点でも切り出す時点がすれると求められるスペクトルの位相は変わってしまう（但し、パワースペクトルは変わらない）。そこで、例えば、基準周波数ωを決め、この周波数から他の周波数の位相の相対的なずれを求めることによって、この問題を解決する。これにはいろいろな実現方法が考えられるが、一例として、ωヘルツの位相が４５度（＝π／４）になるように正規化する。これに伴って、他の周波数ω’の位相も正規化する。もし、ωの位相がφだった場合、他の周波数ω’の位相ψであった場合に、ψをψ＋ω’×（π／４−φ）／ωに変換する。これによって、ωの位相に対して、ω’の位相の相対値が求められる。 The problem when obtaining the phase information is that the phase changes when the cut-out section changes even in a steady speech section. In the case of cutting out 512 points, the phase of the spectrum to be obtained changes when the point of cutting out even one point is changed (however, the power spectrum does not change). Therefore, for example, this problem is solved by determining the reference frequency ω and obtaining the relative phase shift of other frequencies from this frequency. Various realization methods are conceivable for this, but as an example, normalization is performed so that the phase of ω hertz is 45 degrees (= π / 4). Accordingly, the phases of other frequencies ω ′ are also normalized. If the phase of ω is φ, and ψ is the phase ψ of another frequency ω ′, ψ is converted into ψ + ω ′ × (π / 4−φ) / ω. Thus, the relative value of the phase of ω ′ is obtained with respect to the phase of ω.

このようにして、例えば、５１２点の離散フーリエ変換によって、２５６点の位相系列が求められる。これをそのまま特徴パラメータとすることも考えられるが、通常はトレーニングデータ数などのトレードオフの関係から、個数を減らす必要がある。パワースペクトルの場合は、対数パワーに変換後、帯域通過フィルター処理（通常は２０個〜３０個）を施してから、ケプストラム変換し、その低次の１２個程度を用いるのが一般的である（メルケプストラム係数：ＭＦＣＣ）。ケプストラム変換の目的は、パワースペクトルに音源の特徴が乗算され、基本周波数の高調波が乗るから、声道形の特徴を表すパワースペクトルが滑らかなスペクトル包絡にならないためである。２５６個の位相特徴パラメータを１２個程度に削減する方法にはいろいろな手法が存在するが、一例として、最初の１２個とか、５個目から１６個目までの１２個を用いることが考えられる。ケプストラムに変換しないのは（勿論、ケプストラムに変換してもよい）、位相パラメータは、パワースペクトルと異なって、基本周波数の高調波の影響をあまり受けないためである。 In this way, for example, a 256-point phase sequence is obtained by 512-point discrete Fourier transform. Although it is conceivable that this is used as a feature parameter as it is, it is usually necessary to reduce the number because of the trade-off relationship such as the number of training data. In the case of a power spectrum, after conversion to logarithmic power, band pass filter processing (usually 20 to 30) is performed, then cepstrum conversion is performed, and about 12 low-orders are used ( Mel cepstrum coefficient: MFCC). The purpose of the cepstrum transformation is that the power spectrum is multiplied by the characteristics of the sound source, and the harmonics of the fundamental frequency are added, so that the power spectrum representing the vocal tract feature does not form a smooth spectral envelope. There are various methods for reducing the 256 phase feature parameters to about 12. However, as an example, the first 12 or the 12th from the 5th to the 16th may be used. . The reason why the phase parameter is not converted into a cepstrum (of course, it may be converted into a cepstrum) is that the phase parameter is not significantly affected by the harmonics of the fundamental frequency unlike the power spectrum.

例えば、１２個のＭＦＣＣと１２個の位相パラメータを別々の特徴パラメータとして用いて、それぞれで、話者モデルを表現するものと照合して尤度を求め、その尤度を併用して話者識別する方法と、それぞれの１２個の特徴パラメータを連結して２４個の特徴パラメータとして話者モデルと照合により尤度を求めて実現する方法もある。 For example, 12 MFCC and 12 phase parameters are used as separate feature parameters, and the likelihood is obtained by comparing with those expressing the speaker model, and speaker identification is performed using the likelihood together. There is also a method of connecting each of the 12 feature parameters and obtaining a likelihood by matching with the speaker model as 24 feature parameters.

［従来のシステムへの組み込み方法］
従来の話者認識システムに、本特徴パラメータを組み込むことは容易である。なぜなら、従来のほとんどの話者認識システムは、音声波形から特徴パラメータを抽出し、それを話者毎に作成した話者モデルと照合することで、尤度を計算し、尤度が最大になるモデルに対応する話者を結果とするか（話者同定）、話者モデルとの照合尤度が閾値を越えるかどうかによって、話者モデルに対応する話者かどうかを決定する（話者照合）。このようなシステムの特徴パラメータを本特徴パラメータに置き換えるだけで実現できる。あるいは、二つの特徴パラメータで求めた尤度の併用による話者認識方法は、下記の特許文献１の考えをそのまま用いることができる。 [Incorporation into conventional system]
It is easy to incorporate this feature parameter into a conventional speaker recognition system. This is because most conventional speaker recognition systems calculate the likelihood by extracting feature parameters from the speech waveform and comparing it with the speaker model created for each speaker, thereby maximizing the likelihood. Whether the speaker corresponds to the speaker model is determined based on whether the speaker corresponding to the model is the result (speaker identification) or whether the likelihood of matching with the speaker model exceeds a threshold value (speaker verification) ). This can be realized simply by replacing the characteristic parameter of the system with this characteristic parameter. Alternatively, the speaker recognition method based on the combined use of the likelihoods obtained from the two feature parameters can directly use the idea of Patent Document 1 below.

なお、特徴パラメータは、話者ばかりでなく、音韻に依存するから、話者ごとに代表的な特徴パラメータを１個用意するのではなく、種々の音韻による変動をカバーするようなモデルでなければならない。代表的な手法は下記の特許文献１で紹介されている。
特開２００５−０９１７５８号公報、話者認識システムの方法 Note that feature parameters depend not only on the speaker but also on the phoneme. Therefore, instead of preparing one typical feature parameter for each speaker, the model should not cover variations due to various phonemes. Don't be. A typical technique is introduced in the following Patent Document 1.
JP 2005-091758 A, Speaker recognition system method

ＮＴＴから提供された話者認識評価用音声データベースを用いて、有効性を検証した。このデータベースは、男性話者２２名、女性話者１３名が、約１年間の５時期に渡って発声したデータである。サンプリング周波数は１６キロヘルツである。フレーム長は２５ミリ秒、フレーム周期は１０ミリ秒である。この最初の時期に発声した５文で話者モデルを作成した。モデルとしてはガウス混合モデルを用いた（ＧＭＭ）。発声速度は、普通、速い、遅い、の３種類あるが、話者モデルは普通の速度で発声された５文を用いた。実験は１文毎に行い、全部のテストサンプル文数は、７００発話文である。世界的な標準的特徴パラメータであるＭＦＣＣを用いた場合の同定率は９５．７％であった。一方、最初の１２個の位相パラメータを用いた場合は、４１．０％であった。この両者の尤度を併用した尤度を用いて話者同定を行ったところ、９７．６％に向上した。 The effectiveness was verified using the speech database for speaker recognition evaluation provided by NTT. This database is data produced by 22 male speakers and 13 female speakers over five periods of about one year. The sampling frequency is 16 kilohertz. The frame length is 25 milliseconds, and the frame period is 10 milliseconds. A speaker model was created with five sentences spoken during this initial period. A Gaussian mixture model was used as a model (GMM). There are three types of utterance speeds: normal, fast, and slow, but the speaker model used five sentences uttered at normal speed. The experiment is performed for each sentence, and the total number of test sample sentences is 700 utterance sentences. The identification rate when using MFCC, which is a global standard feature parameter, was 95.7%. On the other hand, when the first 12 phase parameters were used, it was 41.0%. When speaker identification was performed using the likelihood obtained by combining both of these likelihoods, it was improved to 97.6%.

従来の話者認識システムの特徴抽出部に位相情報の特徴パラメータを抽出する機能を追加する。それに応じて、話者モデル作成、尤度計算、話者判定部を改良して実現する。 A function for extracting feature parameters of phase information is added to the feature extraction unit of the conventional speaker recognition system. Accordingly, speaker model creation, likelihood calculation, and speaker determination unit are improved and realized.

図１に話者認識システムの実施例を示す。まず、トレーニングフェーズでは、オフライン処理で、話者ごとにその話者が発声した音声データ集合から、話者モデルを作成する。 FIG. 1 shows an embodiment of a speaker recognition system. First, in the training phase, a speaker model is created for each speaker from a set of speech data uttered by the speaker in offline processing.

まず、図１中、音声分析部１１はシステムに音声波形を取り込む。音声波形に対する離散フーリエ変換や線形予測分析（ＬＰＣ）に相当する。例えば、１６キロヘルツサンプリングし、１サンプル１６ビットでＡＤ変換した、離散的サンプル信号時系列データに変換する。これに分析窓を掛けて、音声波形を切り出す。この時間長をフレームという。隣接区間がオーバーラップするようにずらしながら次のフレームを切り出す。このずらす間隔をフレーム周期とよぶ。例えば、ハミング窓を掛けて、２５６点を切り出す。これを離散フーリエ変換すれば、１２８個の複素数表現のスペクトルが求められる。このほかには、線形予測分析（ＬＰＣ分析)などでスペクトルが求められる。これをフレーム周期１０ミリ秒ごとに算出する。 First, in FIG. 1, the voice analysis unit 11 captures a voice waveform into the system. This corresponds to discrete Fourier transform and linear prediction analysis (LPC) for speech waveforms. For example, it is converted into discrete sample signal time-series data obtained by sampling 16 kilohertz and AD-converting 16 bits per sample. This is multiplied by the analysis window to cut out the speech waveform. This time length is called a frame. The next frame is cut out while shifting so that adjacent sections overlap. This interval of shifting is called a frame period. For example, 256 points are cut out by applying a Hamming window. If this is subjected to discrete Fourier transform, 128 complex representation spectra can be obtained. In addition to this, a spectrum is obtained by linear prediction analysis (LPC analysis) or the like. This is calculated every 10 milliseconds of the frame period.

図１中、特徴抽出部１２では、音声分析結果をもとに、特徴パラメータを抽出する。従来の代表的な特徴パラメータであるＭＦＣＣなどのほかに、位相特徴パラメータを抽出する。通常、この複素スペクトルの実数部の２乗と虚数部の２乗の和であるパワースペクトルに変換する。これに対数をとり、周波数軸をメルスケールに変換した後、数十個の帯域フィルター群に通し、離散コサイン変換し、その低次のパラメータ１２個程度を特徴パラメータとする（ＭＦＣＣ)。一方、位相情報を求めるために、実数部と虚数部の比から角度を求める（位相）。この位相は切り出し位置によって変動するので、ある基準周波数ωがπ/４になるように、他の周波数の位相を相対的な値に正規化する。この１２８個の値のうち、トレーニングデータ量と認識精度のトレードオフを考慮して、適切な１２個程度を使用する。どの１２個程度を用いるかはいろんな方法が考えられる。例えば、最初の１２個を用いる。これは、６０ヘルツから７２０ヘルツの周波数帯域に相当する。これらの特徴パラメータは、フレーム周期ごとに算出され、時系列パラメータとなる。 In FIG. 1, the feature extraction unit 12 extracts feature parameters based on the voice analysis result. In addition to MFCC, which is a typical representative feature parameter, phase feature parameters are extracted. Usually, the complex spectrum is converted into a power spectrum which is the sum of the square of the real part and the square of the imaginary part. The logarithm is taken and the frequency axis is converted to a mel scale, and then it is passed through several tens of band-pass filter groups and subjected to discrete cosine transform, and about 12 low-order parameters are used as characteristic parameters (MFCC). On the other hand, in order to obtain phase information, an angle is obtained from the ratio of the real part to the imaginary part (phase). Since this phase varies depending on the cutout position, the phases of other frequencies are normalized to relative values so that a certain reference frequency ω becomes π / 4. Among these 128 values, an appropriate value of about 12 is used in consideration of the trade-off between training data amount and recognition accuracy. Various methods can be used to determine which 12 are used. For example, the first 12 are used. This corresponds to a frequency band of 60 to 720 hertz. These feature parameters are calculated for each frame period and become time series parameters.

図１中、話者モデル作成部１３では、特徴抽出部１２で求めた特徴パラメータを使用して、話者ごとに話者モデルを作成する。代表的なモデルとしては、ＧＭＭやＨＭＭがある。話者モデルとしては、通常ガウス混合分布（ＧＭＭ)や隠れマルコフモデル（ＨＭＭ)が用いられる。正規化位相は、話者によって変わるが、同じ話者でも音韻によって異なるため、この特徴パターン集合を複数の分布の混合で表現する。特徴パラメータは、例えば、１２個のＭＦＣＣと１２個の正規化位相を別々に扱い、それぞれで話者モデルを作成する方法と、両者を連結して、２４個のパラメータとして、話者モデルを作成する方法が考えられる。実際の発声サンプルから話者を認識するテストフェーズでは、音声分析部１１と特徴抽出部１２はトレーニングフェーズと同一である。但し、このフェースはオンラインリアルタイム処理である。 In FIG. 1, the speaker model creation unit 13 creates a speaker model for each speaker using the feature parameters obtained by the feature extraction unit 12. Typical models include GMM and HMM. As the speaker model, a Gaussian mixture distribution (GMM) or a hidden Markov model (HMM) is usually used. Although the normalization phase varies depending on the speaker, even the same speaker varies depending on the phoneme. Therefore, this feature pattern set is expressed by a mixture of a plurality of distributions. The feature parameters are, for example, 12 MFCC and 12 normalized phases are handled separately, and a speaker model is created for each of them, and both are connected to create a speaker model as 24 parameters. A way to do this is conceivable. In the test phase for recognizing the speaker from the actual utterance sample, the speech analysis unit 11 and the feature extraction unit 12 are the same as in the training phase. However, this face is online real-time processing.

図１中、尤度計算部１４では、テストデータの特徴パラメータ時系列に対して、話者モデル作成部１３で作成した話者モデルを用いて尤度を計算する。 In FIG. 1, the likelihood calculation unit 14 calculates the likelihood using the speaker model created by the speaker model creation unit 13 for the feature parameter time series of the test data.

図１中、話者認識判定部１５では、尤度計算部１４で得られた尤度をもとに、話者を同定したり、話者を照合する。話者同定の場合は、各話者モデルによる尤度を比較し、もっとも尤度の高い話者モデルに対応する話者を同定結果として出力する。一方、話者照合の場合は、予め決められた閾値と尤度計算部１４で求めた尤度を比較し、尤度の方が大きいならば、この話者モデルに対応する話者であると判定する。もし尤度が閾値よりも小さければ、テストサンプルの発声者は、話者モデルの話者ではないと判定する。このとき、尤度計算部１４で求めた尤度がＭＦＣＣによる尤度と正規化位相による尤度のように二つある場合は、これらの尤度の重み付き線形和を尤度と見なす。
In FIG. 1, the speaker recognition determination unit 15 identifies a speaker or collates the speaker based on the likelihood obtained by the likelihood calculation unit 14. In speaker identification, the likelihoods of the speaker models are compared, and the speaker corresponding to the speaker model with the highest likelihood is output as the identification result. On the other hand, in the case of speaker verification, a predetermined threshold value is compared with the likelihood obtained by the likelihood calculating unit 14, and if the likelihood is larger, it is determined that the speaker corresponds to this speaker model. judge. If the likelihood is less than the threshold, it is determined that the test sample speaker is not a speaker model speaker. At this time, when there are two likelihoods obtained by the likelihood calculating unit 14 such as the likelihood by the MFCC and the likelihood by the normalized phase, the weighted linear sum of these likelihoods is regarded as the likelihood.

建物の入室管理、パソコンのパスワードに変わる個人管理、家庭内・病室内などグループ内での個人同定による個人に適したサービスの提供、会議録の作成時の発言者の自動付与など、幅広い応用がありうる。
Wide range of applications such as building entrance management, personal management that replaces PC passwords, provision of services suitable for individuals by personal identification within groups such as homes and hospital rooms, and automatic assignment of speakers when creating conference minutes It is possible.

話者認識システムの実施例に係るブロック図である。It is a block diagram concerning the example of a speaker recognition system.

Explanation of symbols

１１…音声分析部
１２…特徴抽出部
１３…話者モデル作成部
１４…尤度計算部
１５…話者認識判定部
ＭＦＣＣ…Mel-Frequency Cepstrum Coefficient (メル周波数ケプストラム係数)
ＧＭＭ…Gaussian Mixture Model (ガウス混合モデル)
ＨＭＭ…Hidden Markov Model (隠れマルコフモデル) DESCRIPTION OF SYMBOLS 11 ... Speech analysis part 12 ... Feature extraction part 13 ... Speaker model preparation part 14 ... Likelihood calculation part 15 ... Speaker recognition determination part MFCC ... Mel-Frequency Cepstrum Coefficient (Mel frequency cepstrum coefficient)
GMM ... Gaussian Mixture Model
HMM… Hidden Markov Model

Claims

A method of extracting the personality of a speaker included in the voice uttered by a certain person and identifying who is the voice among the registered speakers (speaker identification), or whether it is the voice of a certain person In the method (speaker verification) for determining the speaker, a phase feature parameter extraction method that has not been used in the past and representing speaker characteristics included in the speech waveform, and speaker identification or speaker verification using the same are performed. Technique.