JP2709386B2

JP2709386B2 - Spectrogram normalization method

Info

Publication number: JP2709386B2
Application number: JP62156958A
Authority: JP
Inventors: 哲中村; 清宏鹿野
Original assignee: 株式会社エイ・ティ・ア−ル自動翻訳電話研究所
Priority date: 1987-06-24
Filing date: 1987-06-24
Publication date: 1998-02-04
Anticipated expiration: 2013-02-04
Also published as: JPS64998A

Description

【発明の詳細な説明】［産業上の利用分野］この発明はスペクトログラムの正規化方法に関し、特
に、ベクトル量子化を用いて異話者間のスペクトログラ
ムの正規化を行ない、不特定話者認識のための話者適応
化や性質変換技術に適用できるようなスペクトログラム
の正規化方法に関する。［従来の技術および発明が解決しようとする問題点］自動翻訳電話では、入力として音声が用いられるが、
その音声は不特定話者の音声であり、このような不特定
話者の音声を的確に認識する必要がある。不特定話者認
識のための１つの手段として、異話者間のスペクトログ
ラムの正規化を行なう方法があるが、従来の異話者間の
スペクトログラムの正規化手段は、主に母音区間の正規
化に関するものであり、決定論的なスペクトル周波数の
変化などの方法しかなかった。そこで、ベクトル量子化を用いて異話者間のスペクト
ログラムの正規化を行なう方法が考えられる。ところ
が、従来のベクトル量子化では、計算量，メモリの増加
を抑えて認識性能を向上させるべくベクトル量子化に用
いるスペクトル歪み尺度の改良が行なわれてきた。そし
て、種々の特徴の組合わせの複合スペクトル歪み尺度が
用いられてきたが、この方法ではスペクトル歪み尺度に
多種の特徴量を混在させ、それらの間の依存関係を拘束
条件として用い、より認識性能の良い空間へ特徴を写像
するところに意味があった。しかし、この方法では、次
のような問題点があった。各特徴量間の依存関係がベクトル量子化のコードブ
ック内に統計的に妥当性を持つためには、非常に多くの
ラーニングサンプルとこのための膨大な計算時間が必要
である。コードブックサイズで見た場合、各特徴に必要なコ
ードブックサイズは特徴間の依存関係を拘束条件にする
ことで減少する。しかし、それでも全体のコードブック
サイズは各特徴に必要なコードブックサイズの積になっ
て、非常に大きくなってしまい、膨大なメモリが必要で
あった。複合スペクトル歪み尺度を用いてベクトル量子化の
コードブックを生成した場合、各種の特徴間の相関によ
り、スペクトルの再現能力が低下する。それゆえに、この発明の主たる目的は、ベクトル量子
化を用いてスペクトルを個人ごとに有限のベクトルで表
現し、その後、異話者間のベクトルの対応を求めること
により、異話者間のスペクトログラムを正規化し得るス
ペクトログラムの正規化方法を提供することである。［問題点を解決するための手段］この発明は音声をディジタル化し、その音声の特徴と
してスペクトログラムを抽出し、その抽出されたスペク
トログラムを未知話者と基準話者との間で正規化するス
ペクトログラムの正規化方法において、音声の特徴ごと
のコードブックを生成し、そのコードブックを参照して
未知話者の音声をベクトル量子化し、動的計画法を用い
てベクトル量子化によって生成された特徴ごとのコード
列と基準話者の学習用標準パターン列との対応づけのヒ
ストグラムを作成し、動的計画法によるマッチングの局
部距離に各種の特徴のコード間距離の和を用いて対応づ
けの経路を拘束することにより対応づけの学習を行な
い、基準話者の特徴ベクトルの線形結合で未知話者の特
徴ベクトルを書換えることにより、スペクトログラムの
正規化を行なうようにしたものである。［作用］この発明に係るスペクトログラムの正規化方法は、音
声をディジタル化し、音声の特徴ごとにコードブックを
生成し、そのコードブックを参照して未知話者の音声を
ベクトル量子化し、動的計画法を用いてベクトル量子化
によって生成された特徴ごとのコード列と基準話者の学
習用標準パターン列との対応づけのヒストグラムを作成
し、動的計画法によるマッチングの局部距離に各種の特
徴のコード間距離の和を用いて対応づけの経路を拘束す
ることにより、対応づけの学習を行ない、基準話者の特
徴ベクトルの線形結合で未知話者の特徴ベクトルを書換
えることにより、スペクトログラムの正規化を行なう。［発明の実施例］以下に、図面を参照して、この発明の実施例について
より詳細に説明する。第１図はこの発明が適用される音声認識装置の概略ブ
ロック図である。第１図において、音声認識装置はアンプ１とローパス
フィルタ２とA/D変換器３と処理装置４とから構成され
る。アンプ１は入力された音声信号を増幅するものであ
り、ローパスフィルタ２は増幅された音声信号から折返
し雑音を除去するものである。A/D変換器３は音声信号
を12kHzのサンプリング信号により、16ビットのディジ
タル信号に変換するものである。処理装置５はコンピュ
ータ５と磁気ディスク６と端末類７とプリンタ８とを含
む。コンピュータ５はA/D変換器３から入力された音声
のディジタル信号に基づいて音声認識を行なうものであ
る。第２図はこの発明の一実施例の音声の入力から正規化
スペクトログラムを出力するまでの全体の流れを示すフ
ロー図である。次に、第１図ないし第２図を参照して、この発明の一
実施例の動作について説明する。入力された音声信号は
アンプ１で増幅され、ローパスフィルタ２によって折返
し雑音が除去された後、第２図に示すステップ（図示で
はSPと略称する）SP1において、A/D変換器３が入力され
た音声信号を16ビットのディジタル信号に変換する。処
理装置４のコンピュータ５はステップSP2において、デ
ィジタル信号に変換された音声の特徴抽出を行なう。こ
の特徴抽出では、たとえば線形分析（LPC分析）などの
手法を用いて行なわれる。ステップSP3において、コードブックの生成であるか
否かが判別され、コードブックの生成であれば、ステッ
プSP4において、抽出された音声の特徴に基づいて、コ
ードブック生成が行なわれる。このコードブック生成と
しては、たとえばLBGアルゴリズムが用いられ、特徴ご
とに生成されて、ステップSP5において、磁気ディスク
６のセパレートコードブックに格納される。なお、LBG
アルゴリズムについては、Linde,Buzo,Gray;“An algo
rithm for Vector Quantization Design"IEEE COM
−28 （1980−01）に詳細に記載されている。量子化を行なうときには、ステップSP3においてコー
ドブックの生成でないことが判別され、前述のステップ
SP2で求められた音声の特徴が、ステップSP6において、
セパレートコードブックを参照してセパレートベクトル
量子化される。そして、ステップSP7において、変換ベ
クトルの学習であるか否かが判別され、変換ベクトルの
学習であれば、ステップSP8において、セパレートベク
トル量子化により生成された特徴ごとのコード列が標準
話者の学習用標準パターン系列とDouble SplitによるD
P（Dynamic Programming:動的計画法）マッチングされ
る。この学習用標準パターン系列はステップSP9におい
て予め磁気ディスク６に登録されている。ステップSP10
において、DPマッチングの結果のベクトルの対応づけの
ヒストグラムを用いて、変換ベクトルが生成される。こ
の変換ベクトルはステップSP11において、磁気ディスク
６に登録される。前述のステップSP7において、変換ベクトルの学習で
ないことを判別したとき、すなわち正規化であることを
判別したときには、ステップSP12について、セパレート
ベクトル量子化により生成された特徴ごとのコード列
が、ステップSP11において既に格納されている変換ベク
トルを用いてフレームごとに置換えられ、正規化スペク
トログラムが生成されて出力される。第３図はベクトル量子化を用いたスペクトログラム正
規化の動作を説明するためのフロー図であり、第４図は
セパレートベクトル量子化の動作を説明するためのフロ
ー図であり、第５図は変換ベクトル学習のアルゴリズム
を説明するためのフロー図であり、第６図はスペクトロ
グラム正規化のアルゴリズムであり、第７図はマッチン
グ方式を説明するためのフロー図である。次に、第３図を参照して、ベクトル量子化を用いたス
ペクトログラム正規化について説明する。この発明にお
けるベクトル量子化を用いたスペクトログラム正規化は
大きく２つの機能から構成されている。１つは、ステッ
プSP23におけるベクトル量子化である。このベクトル量
子化は、特徴の種類ごとに別々にベクトル量子化を行な
うセパレートベクトル量子化であって、ステップSP22に
おいて、特徴別に別々のコードブックが生成される。２つ目は、ステップSP24におけるスペクトルの変換
（正規化）であり、ステップSP24において、学習用単語
を未知話者に発声させることにより、ベクトルの対応づ
けを行なう。ここでは、全学習用単語について求めた対
応づけのヒストグラムを求め、これを重みとして未知話
者のコードブックの特徴ベクトルを標準話者のコードブ
ックの特徴ベクトルの線形結合で表わし、これを変換コ
ードブックとしてステップSP25において格納しておき、
正規化時には、入力されたスペクトルを入力ごとに変換
コードブックを用いて置換え、スペクトルの正規化を行
なう。ここで、セパレートベクトル量子化について詳細に説
明する。この発明では、音声をパワーとスペクトル情報
（自己相関係数,LPCケプストラム係数）の２種類の特徴
に分割し、それぞれについて別々にベクトル量子化を行
なう。但し、パワーはスカラーであるため、不均一スカ
ラー量子化となっている。第４図において、ステップSP
31において、16ビットのディジタル信号に変換された音
声信号に対して、14次の自己相関分析によるLPC分析を
行ない、入力音声の特徴であるパワーと自己相関係数,L
PCケプストラム係数を抽出する。ステップSP32におい
て、パワーのコードブック生成であるか否かを判別し、
パワーのコードブック生成であれば、ステップSP33にお
いて、入力音声のパワーをスカラー量子化する。スカラ
ー量子化では、不均一量子化の手法を用いて、ステップ
SP33においてパワーコードブックを生成し、ステップSP
34において、生成したコードブックブックを磁気ディス
ク６に格納する。パワーコードブックの生成でないとき、すなわち、量
子化時には、ステップSP34におけるパワーコードブック
を用いて、ステップSP35において量子化を行ない、パワ
ーに関するコード列を出力する。一方、ステップSP36において、LPC相関係数およびLPC
ケプストラム係数のコードブック生成であることが判別
されると、ステップSP37において、LBGアルゴリズムに
より、WLR尺度に基づいてコードブックが生成され、ス
テップSP38におて、生成されたコードブックが磁気ディ
スク６に格納される。こで、WLR尺度は、音声の特徴を
強調する尺度であり、単語音声の認識において高い性能
を示すものであり、杉山，鹿野による“ピークに重みを
おいたLPCスペクトルマッチング尺度”電子通信学会論
文（Ａ）J64−A5（198−05）に記載されている。なお、LPC相関係数およびLPCケプストラム係数のコー
ドブック生成でないとき、すなわち、量子化時には入力
音声の自己相関係数とLPCケプストラム係数に対し、ス
テップSP38におけるスペクトルコードブックを用いて、
ステップSP39においてベクトル量子化を行ない、スペク
トル情報に関するコード列を出力する。ここで、コードブック生成，量子化に用いたスペクト
ル歪み尺度は次のものである。 d_power＝P/P′＋Ｐ′/P−２ …（１） d_spectrum＝Σ（Ｃ（ｎ） −Ｃ′（ｎ））（Ｒ（ｎ）−Ｒ′（ｎ）） …（２）ここで、 d_powerはパワー項の歪み尺度であり、 d_spectrumはスペクトル歪み尺度であり、Ｒ（ｎ）はコードブックのｎ次の自己相関係数であ
り、Ｒ′（ｎ）は入力のｎ次の自己相関係数であり、Ｃ（ｎ）はコードブックのｎ次のLPCケプストラム係
数であり、Ｃ′（ｎ）は入力のｎ次のLPCケプストラム係数であ
り、Ｐはコードブックのパワーであり、Ｐ′は入力のパワーである。次に、第５図を参照して、第３図に示したステップSP
24,ステップSP25におけるスペクトルの正規化および変
換コードブックの生成について詳細に説明する。まず、
変換コードブックを生成するにあたって、学習用単語を
未知話者に発声させる。この入力音声をステップSP41に
おいて、ステップSP42で既に格納されているコードブッ
クを用いてセパレートベクトル量子化する。ステップSP
43において、量子化されたコード列は、ステップSP44に
おいて既に格納されている標準話者の同一単語の学習用
標準パターンとDouble Split法によりDPマッチングさ
れ、未知話者と標準話者が発声した同一学習単語でベク
トルの対応づけを求める。そして、すべての学習単語に
ついて対応づけを求め、ヒストグラムの形で格納する。
ステップSP45において、求めたヒストグラムを用いて、
未知話者の特徴ベクトルを、ステップSP46において格納
されている標準話者のコードブックの特徴ベクトルの対
応づけのヒストグラムを重みとした荷重和で表わす。こ
の荷重和は次の式で表すことができる。 k:標準話者のコードブックのコード番号 n:未知話者のコードブックのコード番号ａ′：未知話者から標準話者への変換ベクトルｂ（ｋ）：標準話者のコードブックの特徴ベクトル h_n（ｋ）:DPマッチングによる対応付けで求められた
未知話者のコードｎに対する標準話者のコードｋのヒス
トグラムつぎに、ステップSP48において、ａ′の変換ベクトル
で未知話者のコードブックを入替え、ステップSP43,SP4
5およびSP47およびSP48を繰返し行なう。この繰返しを
一定回数または全学習単語に対するDP距離が収束するま
で繰返し、ステップSP47において収束したことを判別す
ると、最終的な未知話者から標準話者への変換ベクトル
が求められる。次に、第６図を参照して、スペクトルの正規化につい
て説明する。ステップSP51において、未知話者の入力音
声を、コードブックを用いてセパレートベクトル量子化
する。ここで、未知話者のコードブックはステップSP52
において予め格納されている。そして、先程求めたステ
ップSP54における未知話者から標準話者への変換ベクト
ルにより、ステップSP53において未知話者のコードブッ
クを入替え、フレームワイズにスペクトルの入替えを行
なって正規化スペクトログラムを出力する。次に、第７図を参照して、対応づけを求めるマッチン
グ動作について説明する。マッチングはDouble Split
法を用いて行なう。ステップSP61において、セパレート
ベクトル量子化によりパワーとスペクトルと別々にベク
トル量子化し生成されたコード列と、コード列として格
納されている標準パターンとをマッチングする。標準パ
ターンはステップSP62において、セパレートベクトル量
子化によりコード化されたパワーおよびスペクトルの標
準パターンが予め格納されている。そして、ステップSP
61におけるマッチングにおいては、コード間の距離は予
めステップSP63において距離マトリクスを作成してお
き、この表びきを行なうことで求める。このようにし
て、順番に標準パターンとマッチングして求めた入力音
声と標準パターンのベクトルの対応をステップSP64にお
けるヒストグラム生成部に出力する。そして、ヒストグ
ラム生成部で求められたヒストグラムを重みとして、未
知話者の特徴ベクトルを標準話者の特徴ベクトルの線形
結合で表わして変換ベクトルとする。次に、マッチング方法について詳細に説明する。従来
のマッチングでは、入力も標準パターンも１つの特徴列
あるいはコード列であったが、セパレートベクトル量子
化では、一般に複数のコード列により構成される。この
発明では、パワーコード列とスペクトルコード列の２系
列のマッチング手法を例に掲げて説明する。パワーとス
ペクトルの両方の情報を考えた場合の距離尺度としてPW
LR尺度がある。これは次式で示される。 d_PWLR＝Σ(Ｃ(ｎ)−Ｃ′(ｎ))(Ｒ(ｎ)−Ｒ′(ｎ)) ＋ａ(P/P′＋Ｐ′/P−２) …（３）ａ＝0.01 従来のDouble Split法によるコード列のマッチング
では、前述のようにすべての空間がベクトル量子化さ
れ、有限個の点で代表されていることを利用して、予め
すべての代表点間の距離を求めて距離マトリクスに格納
しておく。したがって、 d_PWLR（i,j）＝DL（Ａ（ｉ）,B（ｊ）） DL（Ａ（ｉ）,B（ｊ））＝Σ（C_K（ｎ）−C_L（ｎ））（R_K（ｎ）−R_L（ｎ））＋ａ・（P_K/P_L＋P_L/P_K−２）Ａ（ｉ）は、入力音声のｉフレーム目のコード番号Ｂ（ｊ）は、標準パターンのｊフレーム目のコード番
号 DL（K,L）は、コードK,L間の距離を距離マトリクスか
ら表びきで求めたもの K,Lは、Ａ（ｊ）,B（ｊ）のコード番号しかし、セパレートベクトル量子化では、２つの系列
を有するので次のようにして距離を求める。ｄ_{［ｐ］［WLR］}（i,j）＝DL_spect（A_spect（ｉ）,B_spect（ｊ））＋ａ・DL_power（A_power（ｉ）,B_power（ｊ））ここで、 DL_spect（A_spect（ｉ）,B_spect（ｊ））＝Σ（C_K（ｎ）−C_L（ｎ））（R_K（ｎ）−R_L（ｎ）） DL_power（A_power（ｉ）,B_power（ｊ））＝Ｐ_Ｋ′/P_Ｌ′＋Ｐ_Ｌ′/P_Ｋ′−２ K,Lは、A_spect（ｉ）,B_spect（ｊ）のコード番号Ｋ′,L′は、A_power（ｉ）,B_power（ｊ）のコード番
号である。これは、PWLR尺度の第１項と第２項の別々にコード化
して距離を計算し、和を求めたものである。この局部距
離の尺度を用いて、DPマッチングにより距離を求める。以上のようにして、非常に高性能なベクトル量子化を
用いた正規化方式を達成できる。［発明の効果］以上のように、このは発明によれば、音声をベクトル
量子化した後スペクトログラムを抽出し、ベクトル量子
化のコードブックについて異話者間で対応づけを行な
い、この対応づけに基づいてスペクトログラムの正規化
を行なうようにしたので、各特徴の依存項を無視でき、
ラーニングサンプルを少なくてすみ、計算量が減少す
る。ただし、セパレートすることにより、別のベクトル
量子化系を構成するので、この分計算量が多少増加する
が、ラーニングサンプルが少ないので十分計算量を減少
できる。コードブックサイズはセパレートベクトル量子
化では、各特徴に必要なコードブックサイズの和になる
ので、全体のコードブックサイズを激減させることがで
きる。しかも、各特徴の依存項は無視するので、コード
ブックの特徴内で最適な量子化をすることができ、この
ために忠実にスペクトログラムを再現できる。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for normalizing a spectrogram, and more particularly, to normalizing a spectrogram between different speakers by using vector quantization to recognize an unspecified speaker. Spectrogram Normalization Method Applicable to Speaker Adaptation and Property Conversion Techniques [Problems to be Solved by Conventional Techniques and Inventions] Automatic translation telephones use speech as input,
The voice is a voice of an unspecified speaker, and it is necessary to accurately recognize the voice of such an unspecified speaker. As one means for unspecified speaker recognition, there is a method of normalizing a spectrogram between different speakers, but the conventional means for normalizing a spectrogram between different speakers mainly includes normalization of a vowel section. And there were only methods such as deterministic changes in spectral frequency. Therefore, a method of normalizing a spectrogram between different speakers using vector quantization is considered. However, in the conventional vector quantization, the spectral distortion measure used for vector quantization has been improved in order to suppress the increase in the amount of calculation and the memory and to improve the recognition performance. Although a composite spectral distortion scale of a combination of various features has been used, in this method, various types of feature amounts are mixed in the spectral distortion scale, and the dependency between them is used as a constraint, thereby improving recognition performance. It was meaningful to map features to a good space. However, this method has the following problems. In order for the dependency between the features to be statistically valid in the codebook of vector quantization, an extremely large number of learning samples and an enormous amount of calculation time for this are required. In terms of the codebook size, the codebook size required for each feature is reduced by using the dependency between features as a constraint. However, the overall codebook size is still a product of the codebook size required for each feature, and becomes very large, requiring an enormous amount of memory. When a codebook for vector quantization is generated using the composite spectrum distortion measure, the reproducibility of the spectrum is reduced due to the correlation between various features. Therefore, a main object of the present invention is to express a spectrum as a finite vector for each individual using vector quantization, and then obtain a correspondence between vectors between different speakers, thereby obtaining a spectrogram between different speakers. It is an object of the present invention to provide a spectrogram normalization method that can be normalized. Means for Solving the Problems The present invention digitizes speech, extracts a spectrogram as a feature of the speech, and normalizes the extracted spectrogram between an unknown speaker and a reference speaker. In the normalization method, a codebook for each feature of speech is generated, the speech of the unknown speaker is vector-quantized with reference to the codebook, and each feature generated by vector quantization using dynamic programming. Create a histogram of the correspondence between the code sequence and the standard pattern sequence for learning of the reference speaker, and constrain the path of the correspondence using the sum of the inter-code distances of various features to the local distance of matching by dynamic programming. By learning the correspondence vector by rewriting the feature vector of the unknown speaker by linear combination of the feature vector of the reference speaker. This is to normalize the program. [Operation] The spectrogram normalization method according to the present invention digitizes speech, generates a codebook for each feature of speech, refers to the codebook, vector-quantizes the speech of an unknown speaker, and performs dynamic programming. A histogram of the correspondence between the code sequence for each feature generated by vector quantization and the standard pattern sequence for learning of the reference speaker is created using the Learning the correspondence by constraining the correspondence path using the sum of the inter-code distances, and rewriting the feature vector of the unknown speaker by linear combination of the feature vector of the reference speaker, the normalization of the spectrogram Is performed. Embodiments of the Invention Hereinafter, embodiments of the present invention will be described in more detail with reference to the drawings. FIG. 1 is a schematic block diagram of a speech recognition apparatus to which the present invention is applied. In FIG. 1, the speech recognition device comprises an amplifier 1, a low-pass filter 2, an A / D converter 3, and a processing device 4. The amplifier 1 amplifies an input audio signal, and the low-pass filter 2 removes aliasing noise from the amplified audio signal. The A / D converter 3 converts an audio signal into a 16-bit digital signal using a 12 kHz sampling signal. The processing device 5 includes a computer 5, a magnetic disk 6, terminals 7 and a printer 8. The computer 5 performs voice recognition based on the digital voice signal input from the A / D converter 3. FIG. 2 is a flowchart showing the entire flow from the input of speech to the output of a normalized spectrogram according to one embodiment of the present invention. Next, the operation of one embodiment of the present invention will be described with reference to FIGS. The input audio signal is amplified by the amplifier 1 and the aliasing noise is removed by the low-pass filter 2, and then, in a step (abbreviated as SP in the drawing) SP1 shown in FIG. 2, the A / D converter 3 is input. The audio signal is converted to a 16-bit digital signal. In step SP2, the computer 5 of the processing device 4 extracts features of the voice converted into the digital signal. This feature extraction is performed using a technique such as linear analysis (LPC analysis). In step SP3, it is determined whether or not a codebook is to be generated. If a codebook is to be generated, in step SP4, a codebook is generated based on the characteristics of the extracted voice. As the codebook generation, for example, the LBG algorithm is used, generated for each feature, and stored in the separate codebook of the magnetic disk 6 in step SP5. Note that LBG
For the algorithm, see Linde, Buzo, Gray; “An algo
rithm for Vector Quantization Design "IEEE COM
−28 (1980-01). When performing quantization, it is determined in step SP3 that a codebook is not generated, and the above-described step
In step SP6, the features of the voice determined in SP2 are
The separate vector quantization is performed with reference to the separate codebook. Then, in step SP7, it is determined whether or not it is learning of the transformation vector. If the transformation vector is learning, in step SP8, the code string for each feature generated by the separate vector quantization is trained by the standard speaker. Standard pattern series and D by Double Split
P (Dynamic Programming: Dynamic Programming) matching. This learning standard pattern sequence is registered in the magnetic disk 6 in advance in step SP9. Step SP10
In, a transformed vector is generated using the histogram of the correspondence between the vectors as a result of the DP matching. This conversion vector is registered on the magnetic disk 6 in step SP11. When it is determined in step SP7 that it is not learning of a transformation vector, that is, when it is determined that normalization is performed, a code sequence for each feature generated by separate vector quantization is determined in step SP11 in step SP12. The permutation is performed frame by frame using the already stored transform vector, and a normalized spectrogram is generated and output. FIG. 3 is a flowchart for explaining the operation of spectrogram normalization using vector quantization, FIG. 4 is a flowchart for explaining the operation of separate vector quantization, and FIG. FIG. 6 is a flowchart for explaining a vector learning algorithm, FIG. 6 is a spectrogram normalization algorithm, and FIG. 7 is a flowchart for explaining a matching method. Next, spectrogram normalization using vector quantization will be described with reference to FIG. The spectrogram normalization using the vector quantization according to the present invention is mainly composed of two functions. One is vector quantization in step SP23. This vector quantization is separate vector quantization in which vector quantization is performed separately for each type of feature. In step SP22, different codebooks are generated for each feature. The second is spectrum conversion (normalization) in step SP24. In step SP24, an unknown speaker utters a learning word to associate a vector. Here, the histogram of the association obtained for all the learning words is obtained, and the feature vector of the codebook of the unknown speaker is represented by a linear combination of the feature vector of the codebook of the standard speaker using the weight as a weight. Stored as a book in step SP25,
At the time of normalization, the input spectrum is replaced using a conversion codebook for each input, and the spectrum is normalized. Here, the separate vector quantization will be described in detail. According to the present invention, speech is divided into two types of characteristics, power and spectral information (autocorrelation coefficient, LPC cepstrum coefficient), and vector quantization is performed separately for each of them. However, since the power is a scalar, non-uniform scalar quantization is performed. In FIG. 4, step SP
In step 31, the speech signal converted into a 16-bit digital signal is subjected to LPC analysis by a 14th-order autocorrelation analysis, and the power and autocorrelation coefficient L
Extract PC cepstrum coefficients. In step SP32, it is determined whether or not power codebook generation is performed,
If a power codebook is to be generated, in step SP33, the power of the input voice is scalar-quantized. In scalar quantization, a non-uniform quantization technique is used to
Generate a power codebook in SP33, step SP
At 34, the generated code book book is stored on the magnetic disk 6. When the power codebook is not generated, that is, at the time of quantization, quantization is performed in step SP35 using the power codebook in step SP34, and a power-related code sequence is output. On the other hand, in step SP36, the LPC correlation coefficient and the LPC
If it is determined that the codebook generation is a cepstrum coefficient codebook, in step SP37, a codebook is generated based on the WLR scale by the LBG algorithm, and in step SP38, the generated codebook is stored on the magnetic disk 6. Is stored. Here, the WLR scale is a measure that emphasizes the characteristics of speech, and exhibits high performance in word speech recognition. “LPC spectrum matching scale with weighted peaks” by Sugiyama and Kano, IEICE Transactions (A) It is described in J64-A5 (198-05). Note that when the codebook generation of the LPC correlation coefficient and the LPC cepstrum coefficient is not performed, that is, at the time of quantization, for the autocorrelation coefficient and the LPC cepstrum coefficient of the input voice, using the spectrum codebook in step SP38,
In step SP39, vector quantization is performed, and a code sequence related to spectrum information is output. Here, the spectrum distortion measures used for codebook generation and quantization are as follows. d _power = P / P '+ P' / P-2 (1) d _spectrum = Σ (C (n) -C '(n)) (R (n) -R' (n)) ... (2) _Where d _power is the distortion measure of the power term, d _spectrum is the spectral distortion measure, R (n) is the autocorrelation coefficient of the nth order in the codebook, and R ′ (n) is the nth order of the input. C (n) is the nth-order LPC cepstrum coefficient of the codebook, C ′ (n) is the nth-order LPC cepstrum coefficient of the input, and P is the power of the codebook. , P 'is the input power. Next, referring to FIG. 5, step SP shown in FIG.
24, the normalization of the spectrum and the generation of the conversion codebook in step SP25 will be described in detail. First,
When generating the conversion codebook, the unknown words are uttered by the unknown speaker. In step SP41, the input speech is subjected to separate vector quantization using the codebook already stored in step SP42. Step SP
In step 43, the quantized code sequence is DP-matched by the Double Split method with the standard pattern for learning the same word of the standard speaker already stored in step SP44, and the unknown speaker and the same Find the correspondence between vectors using learning words. Then, correspondences are obtained for all the learning words, and stored in the form of a histogram.
In step SP45, using the obtained histogram,
The feature vector of the unknown speaker is represented by a weighted sum using the histogram of the association of the feature vector of the codebook of the standard speaker stored in step SP46 as a weight. This load sum can be expressed by the following equation. k: code number of the codebook of the standard speaker n: code number of the codebook of the unknown speaker a ′: conversion vector from unknown speaker to standard speaker b (k): feature vector of the codebook of standard speaker h _n (k): histogram of the code n of the standard speaker with respect to the code n of the unknown speaker obtained by the association by the DP matching Next, in step SP48, the codebook of the unknown speaker is Exchange, step SP43, SP4
5 and SP47 and SP48 are repeated. This repetition is repeated a fixed number of times or until the DP distances for all the learned words converge. When it is determined in step SP47 that the convergence has been achieved, a final conversion vector from the unknown speaker to the standard speaker is obtained. Next, spectrum normalization will be described with reference to FIG. In step SP51, the input speech of the unknown speaker is subjected to separate vector quantization using a codebook. Here, the codebook of the unknown speaker is stored in step SP52.
Are stored in advance. Then, the codebook of the unknown speaker is replaced in step SP53 by the conversion vector from the unknown speaker to the standard speaker obtained in step SP54, the spectrum is replaced in a frame-wise manner, and a normalized spectrogram is output. Next, with reference to FIG. 7, a matching operation for obtaining association will be described. Matching is Double Split
It is performed using the method. In step SP61, the code sequence generated by separately vector-quantizing the power and spectrum by separate vector quantization is matched with the standard pattern stored as the code sequence. In step SP62, a standard pattern of power and spectrum coded by separate vector quantization is stored in advance as the standard pattern. And step SP
In the matching in 61, the distance between the codes is determined in advance by creating a distance matrix in step SP63 and performing this appearance. In this way, the correspondence between the input voice and the vector of the standard pattern obtained by sequentially matching with the standard pattern is output to the histogram generation unit in step SP64. Then, using the histogram obtained by the histogram generation unit as a weight, the feature vector of the unknown speaker is represented by a linear combination of the feature vector of the standard speaker to obtain a conversion vector. Next, the matching method will be described in detail. In the conventional matching, both the input and the standard pattern are one feature sequence or code sequence. However, in the separate vector quantization, generally, the input and the standard pattern are configured by a plurality of code sequences. In the present invention, a description will be given of an example of a matching method of two sequences of a power code sequence and a spectrum code sequence. PW as a distance scale when considering both power and spectrum information
There is an LR scale. This is shown by the following equation. d _PWLR = Σ (C (n) −C ′ (n)) (R (n) −R ′ (n)) + a (P / P ′ + P ′ / P−2) (3) a = 0.01 Conventional In the matching of the code strings by the Double Split method, as described above, the distance between all the representative points is obtained in advance by utilizing the fact that all the spaces are vector-quantized and represented by a finite number of points. It is stored in a matrix. _{Therefore, d PWLR (i, j)} = DL (A (i), B (j)) DL (A (i), B (j)) = Σ (C K (n) -C L (n)) ( _{_{R K (n) -R L (}} n)) + a · (P K / P L + P L / P K -2) a (i) is i-th frame of the code number B of the input speech (j) is the standard The code number DL (K, L) of the j-th frame of the pattern is obtained by expressing the distance between the codes K and L from the distance matrix K and L are the code numbers of A (j) and B (j) However, since separate vector quantization has two sequences, the distance is obtained as follows. d _{[p] [WLR]} (i, j) = DL _spect (A _spect (i), B _spect (j)) + a · DL _power (A _power (i), B _power (j)) where DL _spect _{(A spect (i), B} spect (j)) = Σ (C K (n) -C L (n)) (R K (n) -R L (n)) DL power (A power (i), B _power (j)) = P _{K ′} / P _{L ′} + P _{L ′} / P _{K ′} −2 K, L is the code number of A _spect (i), B _spect (j), and K ′, L ′ is A _power (i) and B _power (j). This is obtained by separately coding the first and second terms of the PWLR scale, calculating the distance, and calculating the sum. Using this local distance measure, a distance is determined by DP matching. As described above, it is possible to achieve a very high-performance normalization method using vector quantization. [Effects of the Invention] As described above, according to the present invention, a speech is vector-quantized, a spectrogram is extracted, and a codebook of vector quantization is associated between different speakers. Since the spectrogram is normalized based on the
Fewer learning samples and less computation. It should be noted that the amount of calculation is increased by a certain amount because another vector quantization system is configured by separating, but the amount of calculation can be sufficiently reduced because the number of learning samples is small. In the separate vector quantization, the codebook size is the sum of the codebook sizes required for each feature, so that the entire codebook size can be drastically reduced. Moreover, since the dependence term of each feature is ignored, optimal quantization can be performed within the features of the codebook, and therefore, the spectrogram can be faithfully reproduced.

【図面の簡単な説明】第１図はこの発明の一実施例が適用される音声認識装置
の概略ブロック図である。第２図は音声の入力から正規
化までの全体の処理の流れを示すフロー図である。第３
図はベクトル量子化を用いたスペクトログラム正規化の
動作を説明するためのフロー図である。第４図はセパレ
ートベクトル量子化の動作を説明するためのフロー図で
ある。第５図は変換ベクトル学習のアルゴリズムを説明
するためのフロー図である。第６図はスペクトログラム
正規化のアルゴリズムを示すフロー図である。第７図は
マッチング動作を説明するためのフロー図である。図において、１はアンプ、１はローパスフィルタ、３は
A/D変換器、４は処理装置、５はコンピュータを示す。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic block diagram of a speech recognition apparatus to which an embodiment of the present invention is applied. FIG. 2 is a flowchart showing the flow of the entire process from input of voice to normalization. Third
The figure is a flowchart for explaining the operation of spectrogram normalization using vector quantization. FIG. 4 is a flowchart for explaining the operation of the separate vector quantization. FIG. 5 is a flowchart for explaining the algorithm of the transformation vector learning. FIG. 6 is a flowchart showing an algorithm for spectrogram normalization. FIG. 7 is a flowchart for explaining the matching operation. In the figure, 1 is an amplifier, 1 is a low-pass filter, 3 is
An A / D converter 4 is a processing device, and 5 is a computer.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭61−261799（ＪＰ，Ａ) 特開昭61−123889（ＪＰ，Ａ) 特公昭56−51637（ＪＰ，Ｂ２) 国際公開86／5618（ＷＯ，Ａ１) ────────────────────────────────────────────────── ─── Continuation of front page (56) References JP-A-61-261799 (JP, A) JP-A-61-123889 (JP, A) Tokiko 56-51637 (JP, B2) International Publication 86/5618 (WO, A1)

Claims

(57) [Claims] A spectrogram normalizing method for digitizing a voice, extracting a spectrogram as a feature of the voice, and normalizing the extracted spectrogram between an unknown speaker and a reference speaker; Is generated, and the speech of the unknown speaker is vector-quantized with reference to the codebook, and the code sequence for each feature generated by vector quantization using the dynamic programming method, the reference speaker, and the standard pattern sequence for learning are generated. A correspondence histogram is created by constraining the correspondence path by using the sum of the inter-code distances of various features to the local distance of the matching by the dynamic programming method, and learning of the correspondence is performed. Spectrogram normalization is now performed by rewriting the feature vector of an unknown speaker with a linear combination of the speaker's feature vectors. , Normalization method of spectrogram. 2. Separate vector quantization is performed using two types of power and autocorrelation coefficient as features of the speech, and a learning histogram is created by learning certain learning words, and a feature of each codebook of the unknown speaker is created. 2. The spectrogram normalization method according to claim 1, wherein the normalization is performed by replacing the vector with a feature vector of a reference speaker using a histogram as a weight by a linear combination.