JPH0372991B2

JPH0372991B2 -

Info

Publication number: JPH0372991B2
Application number: JP61016993A
Authority: JP
Inventors: Uinsento Desooza Piitaa; Rai Booru Raritsuto; Reroi Maasaa Robaato; Aran Pichenii Maikeru
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1986-01-30
Filing date: 1986-01-30
Publication date: 1991-11-20
Also published as: JPS62178999A

Description

[Detailed description of the invention]

以下の順序で本発明を説明する。Ａ産業上の利用分野Ｂ開示の概要Ｃ従来の技術Ｄ発明が解決しようとする問題点Ｅ問題点を解決するための手段Ｆ実施例 F1音声入力信号のラベル化 F2フイーニーム単位のワード・モデルの生成 F3モデル生成および認識手順の概要 F4ラベル語彙生成および音声のラベル・ストリ
ングへの変換 F5ラベル・ストリングを使用するワード・モデ
ルの生成 F6認識プロセス F7ラベル・アルフアベツトの変化 F8表Ｇ発明の効果Ａ産業上の利用分野本発明は音声認識、詳細には所定の語彙中のワ
ードについて統計的なマルコフ・モデルを用いて
音声認識する音声認識システムに係る。Ｂ開示の概要音声認識システムで、ラベルに基づくマルコ
フ・モデルによりワードをモデル化する装置を開
示する。モデル化は、語彙中のワードに対応す
る第１音声入力を音響プロセツサに送り、音響プ
ロセツサは、発声されたワードの各々を標準ラベ
ルの列に変換して、標準ラベルの各々を、時間間
隔に割当て可能な音響タイプに対応させること；
各々の標準ラベルを、複数の状態と、ある状態
からある状態への少なくとも１つの遷移と、ある
遷移で少なくとも１つの設定可能な出力確率とを
有する確率モデルとして表わすこと；選択され
た音響入力を音響プロセツサに送り、音響プロセ
ツサは、選択された音響入力を、それぞれが時間
間隔に割当てられた音響タイプに対応する個人化
ラベルに変換すること；各々の出力確率を、所
与のモデルにより示された標準ラベルの確率とし
て設定し、所与のモデルにおける所与の遷移で特
定の個人化ラベルを生成することを含む。本発明
は、音声認識システムで簡単かつ自動的にワード
のモデルを生成する問題を扱う。Ｃ従来の技術最近の音声認識システムにおいては、ワードの
音響モデル化に関し一般に利用される２つの手法
がある。１つの手法はワード・テンプレートを使
用するもので、ワードを認識するためのマツチン
グ・プロセスは動的計画法（DP）の手順に基づ
く。この手順の例は、音響、音声および信号処理
に関するIEEE会報ASSP第23巻（1975年）67〜
72頁記載のエフ・イタクラの論文“最小予測誤差
原理を応用した音声認識”（F.Itakura、
“
MinimumPredictionResidualPrincipleAppliedt
ｏＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ”、IEEE
TransactionsonAcoustics、Speech、
andSignalProcessing、Vol.ASSP−23、1975、
pp67−72）および米国特許第4181821号に示され
ている。もう１つの手法は、確率的なトレーニングおよ
び復号のアルゴリズムに適した音素単位のマルコ
フ・モデルを用いる。この手法および関連手順の
説明は、IEEE会報第64巻（1976年）532〜556頁
記載のエフ・ジエリネクの論文“統計的方法によ
る連続音声認識”（F.Jelinek、“Continuous
Speech Recognitionby Statistical Methods”
Proceedingsofthe IEEE、Vol.64、1976、
pp.532−556）に記載されている。これらのモデルの下記の３つの点は特に重要で
ある。 (a) ワードの個別性：ワード・テンプレートは、
ワードの実際のサンプルから構築されるので、
より良好に認識することができる。音標に基づ
いたモデルは人工の音声を表わす基本形式から
導かれるので、理想的に作られたワードを表わ
し、実際には生起しないことがある。 (b) トレーニング可能性：マルコフ・モデルは、
例えば、（前述のジエリネクの論文に記載され
た）フオワード・バツクワード・アルゴリズム
によりトレーニングすることができるから、テ
ンプレートよりもすぐれている。ワード・テン
プレートは、（前述のイタクラの論文に記載さ
れた）イタクラ距離、スペクトル距離等のよう
な、トレーニングされない距離尺度を用いる。
１つの例外は、IBM研究報告RC5971、1976年
４月号に記載されたアール・バキスの論文“セ
ンチ秒音響状態による連続音声認識”（R.
Bakis、“Continuous Speech Recognition
Via Centisecond Acoustic States”IBM
Research Report RC5971、 April1976）で
用いた、ワード・テンプレートのトレーニング
を可能にする方法である。 (c) 計算速度：音響プロセツサから個別に出力さ
れたアルフアベツトを用いるマルコフ・モデル
は、（前記イタクラが用いた）動的計画法によ
るマツチングまたは（前記バキスが用いた）連
続パラメータのワード・テンプレートよりもか
なり計算速度が速い。Ｄ発明が解決しようとする問題点本発明の目的は、ワード・テンプレートのよう
にワードの個別性を有し、しかも個別のアルフア
ベツトのマルコフ・モデルで使用可能なトレーニ
ング可能性を生じる音響モデル化の方法を提供す
ることである。更に本発明の目的は、単純でしかも認識プロセ
スで高速動作をする音声認識の音響ワード・モデ
ルを提供することである。Ｅ問題点を解決するための手段本発明により、ワード・モデルを生成する場
合、最初に、ワードを表わす音響信号を、個別の
アルフアベツトから標準ラベルのストリングに変
換する。ラベルの各々は、それぞれのワードの時
間間隔を表わす。次いで、標準ラベルの各々は確
率的な（例えばマルコフ）モデルに置き換えら
れ、複数の連続モデル（標準ラベルごとに１モデ
ル）から成る基本形式のモデルを形成する。この
際確率はいまだ入力されていない。次いで、この
ような基本形式モデルを、モデルに適用する統計
値すなわち確率を生成する発声サンプルによりト
レーニングする。その後、そのモデルを実際の音
声認識に使用する。この方法の利点は、生成されたワード・モデル
が音素単位のモデルよりもずつと詳細で、しかも
トレーニング可能であり；パラメータ数は、語彙
の大きさではなく、標準ラベルのアルフアベツト
（種類）の数に左右され；これらのラベル単位の
モデルによる統計的なマツチングは、計算上、ワ
ード・テンプレートによるDPマツチングよりも
ずつと速いことである。これらの利点は、前述のバスキの論文に示され
た手法に関しては著しいものがある。バキスの論
文では、ワードは状態の連続と定義され、各々の
状態は異なつていると見なされる。それゆえ、各
ワードが典型的に60の状態に広がり、語彙が5000
ワードであつた場合、バキスによる手法は、
300000の異なつた状態と見なされることになる。
本発明により、各々の状態は200のオーダのラベ
ルの１つに対応して識別される。本発明は、ワー
ドを構成する200ラベルを、300000の状態として
ではなく単に（ラベルを表わす）番号の列として
記憶することができる記憶装置しか必要としな
い。更に、本発明のラベル単位のモデルによる手
法では、トレーニング・データは少なくて済む。
バキスの論文の手法では、各々の話者はトレーニ
ングのためワードごとに発声しなければならない
が、本発明の場合は、200の標準ラベルのモデル
に関連した値を設定するのに十分なワードを発声
するだけでよい。また、バキスの論文に示された
手法は、例えば“boy”と“boys”のような２つ
のワードを無関係に扱うが、本発明では、“boy”
の標準的なラベルに関するトレーニングは、
“boys”にも適用する。Ｆ実施例 F1音声入力信号のラベル化このシステムにおける音声認識およびモデル生
成の前処理は、音声入力信号を符号に変換して表
示することである。これは例えば、1981年の
ICASSP会報1153〜1155頁の、エイ・ナダス他の
論文“ブートストラツプ化またはクラスタ化から
得られた自動的に選択された音響原型による連続
音声認識”（A.Nadasetal、“Continuous Speech
Recognitionwith Automatically Selected
Acoustic Prototype Obtainedby Either
Bootstrappingor Clustering、”Proceedings
ICASSP1981、pp.1153−1155）に記載された手
順により行なわれる。この変換手順の場合、音響
入力信号の一定長の1/100秒のオーダの間隔がス
ペクトル分析され、その結果生じた情報は、“ラ
ベル”すなわちフイーニーム（feneme、フロン
トエンドから得られる微小音素をこのように呼ぶ
こととする。基本的にはどこのプロセスで得られ
るにしろ、微小な音響タイプを表わすものはこの
中に入る）からなる有限集合（アルフアベツト）
の中から選んだラベルを、間隔の各々に割り当て
る。ラベルの各々は音響タイプ、より具体的には
特有の10ミリ秒音声間隔のスペクトルパターンを
表わす。特有のスペクトル・パターンの初期選
択、すなわちラベル集合の生成も前述のナダス他
の論文に記載されている。本発明においては、種々のラベル集合がある。
第一に、標準ラベルからなる有限アルフアベツト
（ラベル列）がある。標準ラベルは、最初の話者
が通常の音響プロセツサに向つて発声すると生成
される（音響プロセツサは通常の方法を用いてク
ラスタ化およびラベル化を実行する）。標準ラベ
ルは最初の話者の音声に対応する。最初の話者
が、設定された標準ラベルを有する語彙の各ワー
ドを発声すると、音響プロセツサは各ワードを
（第２表および第２図に例示したように）標準ラ
ベルの列に変換する。ワードごとの標準ラベルの
列は記憶装置に書込まれる。第二に、個人化され
たラベル集合がある。これらの個人化ラベルは、
標準ラベルおよびその列が設定された後、次の話
者（最初の話者が再び、または別の話者）が音声
入力を音響プロセツサに供給することにより生成
される。標準ラベル集合およびそれぞれの個人化ラベル
集合のアルフアベツトの各々は、200（この数は異
なつてもよい）のラベルを含むことが望ましい。
標準ラベルと個人化ラベルは、各標準ラベルに関
連した確率モデルにより相互に関係づけられる。
特に、各標準ラベルは、 (a) ある状態からある状態に広がる複数の状態お
よび遷移、 (b)モデルにおける各々の遷移の確率、および (c) 複数のラベル出力確率（所与の遷移における
各出力確率は、次の整形中の話者からの音響入
力に基づいた所与の遷移における特定の個人化
ラベルを作成する標準ラベルのモデルの確率に
対応する）を有するモデルにより表わす。遷移確率およびラベル出力確率は、トレーニン
グ中の話者が既知の発声をするトレーニング期間
中に設定される。マルコフ・モデルをトレーニン
グする手法は既知であるが、下記に簡単に説明す
る。このラベル化手法の重要な特徴は、音響信号に
基づいて自動的に実行でき、従つて音声上の解釈
を必要としないことである。ラベル化手法の詳細については、後に第５図お
よび第６図に関連して説明する。 F2フイーニーム単位のワード・モデルの生成本発明は、簡単かつ自動的で、音素単位のモデ
ルを使用するものよりも正確に表現されるワー
ド・モデルを生成する新規の方法を示す。あるワ
ード・モデルを生成する場合、そのワードを最初
に１回発声し、音響プロセツサにより標準ラベル
のストリングを得る（原理については第２図、サ
ンプルについては第２表参照）。次に、標準ラベ
ルの各々は、最初および最後の状態ならびに状態
間に起こりうる或る遷移を表わすマルコフ・モデ
ル（第３図）に置き換えられる。これらのラベル
のマルコフ・モデルを連結した結果が完全なワー
ド・モデルである。そして、このモデルは、文献、例えば前述のジ
エリネクの論文から知られている他のマルコフ・
モデルの音声認識の場合にトレーニングし使用す
ることができる。各標準ラベルの統計的なモデル
は、そのラベルから形成したワード・モデルとし
て記憶装置に（第３表の形式で）記憶することが
望ましい。第３表において、標準ラベルのモデルM1〜
MNの各々は、３つの遷移を生じることがあり、
各々の遷移はそれに関連した遷移確率を有する。
更に、モデルM1〜MNの各々は各遷移に200の出
力確率（個人化ラベルごとに１つの確率）を有す
る。出力確率の各々は、モデルM1に対応する標
準ラベルが特定の遷移でそれぞれの個人化ラベル
を生成する確率を表わす。遷移確率および出力確
率は話者により異なることがあるのでトレーニン
グ期間中に設定する。必要なら、遷移確率と出力
確率を組合せ、それぞれが所定の出力を生成して
所定の遷移を行なう確率を示す合成確率を形成す
ることができる。与えられたワードについて、最初の話者が発声
すると、ラベルの順序、従つてそれに対応するモ
デルの順序が決まる。最初の話者は語彙のすべて
のワードを発声して、ワードごとに標準ラベル
（およびモデル）のそれぞれの順序を設定する。
その後、トレーニング中に、次の話者は既知の音
響入力（語彙にあるワードが望ましい）を発声す
る。これらの既知の音響入力の発声から、ある特
定の話者の遷移確率および出力確率を決定して記
憶し、モデルをトレーニングする。ちなみに、次
の話者は200のモデルの確率を設定するのに必要
な音響入力数を発声するだけでよい。すなわち、
次の話者は語彙にあるすべてのワードを発声しな
くてもよいが、その代り、200のモデルをトレー
ニングするのに必要な音響入力数だけは発声しな
ければならない。ラベル単位のモデルの使用は、語彙にワードを
追加するうえで更に重要である。語彙にワードを
加えるのに必要なのは、（新しいワードの発声の
ように）ラベルの列を決めることだけである。
（特定の話者の確率を含む）ラベル単位のモデル
は既に記憶装置に書込まれているので、新しいワ
ードに必要なデータはラベルの順序だけである。このモデルの利点は、極めて簡単で容易に生成
できることである。次に、本発明の良好な実施例
の詳細について説明する。本発明の良好な実施例では１つの音響プロセツ
サを用いるように示しているが、処理プログラム
は、例えば、最初に標準ラベルのアルフアベツト
を決める第１のプロセツサ；最初に１回発声した
ワードごとの標準ラベルの列を生成する第２のプ
ロセツサ；個人化ラベルのアルフアベツトを選択
する第３のプロセツサ；トレーニング入力すなわ
ちワードを個人化ラベルのストリングに変換する
第４のプロセツサを含む複数のプロセツサにわた
つて分配することもできる。 F3 モデル生成および認識手順の概要第１図は本発明によるモデル生成および実際の
認識を行う音声認識システムの概要を示す。入力された音声は、音響プロセツサ１で、予め
生成しておいて標準的なラベル・アルフアベツト
２を用いてラベル・ストリングに変換し、初期ス
テツプ３で、各ワードの最初の１回の発声により
生じた標準ラベルのストリング、ならびに前に作
成した基本フイーニーム・マルコフ・モデル４を
用いて、ワードごとにマルコフ・モデルを生成す
る。ラベル単位のマルコフ・モデルは中間的に記
憶される（フイーニーム単位のワード・モデル
５）。その後、トレーニング・ステツプ６で、次
の話者によるいくつかのワードの発声をラベル単
位のモデルとマツチング、モデルごとの遷移およ
び個人化ラベル出力の確率値に関する統計値を生
成する。実際の認識動作では、認識すべき発声か
ら生じる個人化ラベルのストリングを、統計的な
ワードのラベル単位のモデルとマツチングさせ、
個人化ラベルのストリングを生成する最高の確率
を生じるワードまたは複数ワードの識別子を出力
に供給する。第１図で、記号は１回発声（生成）するラベ
ル・ストリングを示し、記号は認識すべきいく
つかの発声をトレーニングするラベル・ストリン
グを示し、記号は実際に認識すべき発声のラベ
ル・ストリングを示す。 F4 ラベル語彙生成および音声のラベル・スト
リングへの変換ラベルのアルフアベツトを生成し、音声を実際
にラベル・ストリングに変換する手順について、
第５図および第６図に関連して説明する（この手
順の説明は、例えば前述のナダス他の論文にも記
載されている）。一般に音声の音響タイプ（詳細にはスペクト
ル・パラメータ）の原型ベクトルを表わす標準ラ
ベルを生成する場合、話者は音声サンプルを得る
ため約５分間しやべる（第５図のブロツク１１）。
音響プロセツサ（その詳細は第６図に関連して説
明する）では、それぞれが10ミリ秒の30000ベク
トルの音声パラメータが得られる（ブロツク１
２）。次いで、これらのベクトルを分析すなわち
ベクトル量子化動作で処理して、それぞれがほぼ
同じベクトルを含む200のクラスタに分類する
（ブロツク１３）。このような手順は既に、研究論
文、例えば、IEEEのASSP雑誌、1940年４月号
の４〜29頁のアーム・エム・グレイの論文“ベク
トル量子化”（R.M.Gray、“Vector
Quantization”、IEEE ASSP Magazine、April
1984、pp.4−29）に開示されている。クラスタの各々について、１つの原型ベクトル
を選択し、その結果生じた200の原型ベクトルを、
後の参照のために記憶する（ブロツク１４）。こ
のようなベクトルの各々は１つの音響要素すなわ
ちラベルを表わす。代表的なラベル・アルフアベ
ツトを第１表に示す。第６図は、音声を音響的に処理して発声のラベ
ル・ストリングを得る手順のブロツク図である。
マイクロホン２１からの音声はＡ／Ｄコンバータ
２２によりデイジタル表示に変換される。ブロツ
ク２３で、20ミリ秒のウインドウをデイジタル表
示から取出し、ウインドウは（いくらかオーバラ
ツプさせて）10ミリ秒ごとに取込む。ウインドウ
の各々について、高速フーリエ変換（FFT）で
スペクトル分析を行ない、各々の10ミリ秒の音声
を表わすベクトルを得る（第６図のブロツク２
４）。これらのベクトルのパラメータは、複数の
スペクトル・バンドのエネルギ値である。このよ
うに得られた現在のベクトルの各々は、ブロツク
２５で、前述の前処理で生成された原型ベクトル
のセツト（ブロツク１４）と比較する。ブロツク
２５で、現在のベクトルに最も近い原型ベクトル
を決定し、ブロツク２６で、この原型のラベルす
なわち識別子を出力する。このように、10ミリ秒
ごとに１つのラベルが出力に現われ、音声信号
は、コード化形式、すなわち200のフイーニー
ム・ラベルのコード化アルフアベツトで使用でき
る。ちなみに、本発明は周期的な間隔で生成され
たラベルに限定しなくてもよく、前記説明は、各
ラベルのそれぞれの時間間隔に対応させているだ
けである。 F5 ラベル・ストリングを使用するワード・モ
デルの生成ワード・モデルを生成する場合、語彙中に必要
とされるワードの各々は、いつたん発声してか
ら、前述のように標準ラベルのストリングに変換
する。１つのワードの標準ラベルのラベル・スト
リングの図形表示は、第２図に示すように、その
ワードを発声したとき音響プロセツサの出力に現
われたy¹，y²，……y^mの列から成る。このストリ
ングをそれぞれのワードの標準ラベルの基本形式
とみなす。ワードの基本モデルを、そのワードの発音の変
化を考慮に入れて生成するには、基本形式のスト
リングのフイーニームyiの各々を、そのフイーニ
ームの基本マルコフ・モデルＭ（yi）と取替える。基本マルコフ・モデルは第３図に示すように極
めて簡単な形式にすることができる。その構成
は、初期状態Si、最終状態Sf、遷移T1および
T2、ならびにナル遷移T0から成る。遷移T1は状
態をSiからSfに導き、１つの個人化ラベルの出力
を表わす。遷移T2は初期状態Siを離れて初期状
態Siに戻り、１つの個人化された出力を表わす。
遷移T0は状態をSiからSfに導くが、個人化ラベ
ルの出力は割当てられない。この基本モデルで
は、 (a) 遷移T1を１回だけ行なうことにより、個人
化ラベルが１つ出現し、 (b) 遷移T2を数回行なうことにより、個人化ラ
ベルがいくつか出現し、 (c) ナル遷移T0を行なうことにより、脱落個人
化ラベルが出現する。第３図に示す同じ基本モデルを、200の異なつ
た標準ラベルのすべてについて選択することがで
きる。もちろん、より複雑な基本モデルを使用し
たり、異なつたモデルをそれぞれの標準ラベルに
割当てたりすることもできるが、本実施例では、
第３図のモデルをすべての標準ラベルに使用す
る。第４図は、第２図に示すラベル単位の基本形式
のワード全体の完全な基本形式のマルコフ・モデ
ルを示す。その構成は、ワードの全標準ラベルの
基本モデルＭ（yi）の簡単な連結から成り、基本
モデルの各々の最終状態を、次の基本モデルの初
期状態に結合する。従つて、ｍ標準ラベルのスト
リングを生じたワードの完全な基本形式のマルコ
フ・モデルはｍ個の基本マルコフ・モデルを含
む。代表的な標準ラベル数（従つて、ワード・モ
デル当りの状態数）は約30〜80である。４ワード
のラベルを第２表に示す。例えば、ワード
“thanks”はラベルPX5で始まり、ラベルPX2で
終了する。ワードの基本形式のマルコフ・モデルを生成す
ることは、それぞれのワードの異なつた状態およ
び遷移ならびにそれらの相互関係を定めることを
意味する。音声認識に役立てるためには、ワード
の基本形式のモデルは、それを数回の発声により
整形する、すなわちモデルにおける遷移ごとに統
計値を蓄積して統計的モデルを作らなければなら
ない。同じ基本モデルがいくつかの異なつたワードで
現われるので、モデルをトレーニングするため各
ワードを数回発声する必要はない。このようなトレーニングはいわゆる“フオワー
ド・バツクワード・アルゴリズム”（研究論文、
例えば前述のジエリネクの論文に記載されてい
る）により行なうことができる。トレーニングの結果、モデル内の各遷移に確率
値が割当てられる。例えば、１つの特定の状態の
場合、T1が0.5、T2が0.4、T0が0.1の確率値にな
ることがある。更に、非ナル遷移T1およびT2の
各々について、それぞれの遷移が生じたとき、
200の個人化ラベルごとにその出現確率がいくら
であるかを示す確率のリストが与えられる。ワー
ドの統計的モデル全体は第３表に示すようなリス
トすなわち表の形式をとる。各々の基本モデルま
たは標準ラベルは表の１つの欄に示され、各々の
遷移は、個々の個人化ラベルの確率の要素（更に
それぞれの遷移をとる全体的な確率）を有する行
すなわちベクトルに対応する。すべてのワードのモデルを実際に記憶する場
合、ワードごとに、ワードを構成する基本マルコ
フ・モデルの識別子を成分とする１つのベクトル
を記憶し、アルフアベツトの200の標準ラベルご
とに記憶されている確率値を、１つの統計的マル
コフ・モデルに包含すれば十分である。従つて、
第３表に示す統計的なワード・モデルは、実際に
はそのように記憶する必要はなく、分散形式で記
憶して、必要に応じてデータを組合せることがで
きる。 F6 認識プロセス実際に音声認識を行う際には、最初の項で説明
したように、発声を個人化ラベルのストリングに
変換する。次いで、これらの個人化ラベルのスト
リングはワード・モデルの各々と突合わせ、その
モデルが表わすワードの発声により個人化ラベル
のストリングが生じた確率を得る。特に、マツチ
ングはそれぞれのワードのマツチ・スコアに基づ
いて実行する。各々のマツチ・スコアは前述のジ
エリネクの論文で説明した“フオワード確率”を
表わす。ラベル単位のモデルを用いる認識プロセ
スは、音素単位のモデルを用いる既知のプロセス
に類似している。最高の確率を有するワードまた
は複数のワードが出力として選択される。 F7 ラベル・アルフアベツトの変化最初のワード・モデル生成に用いる標準ラベ
ル・アルフアベツトとトレーニングおよび認識に
用いる個人化ラベルの集合とは、一般的ではある
が、すべてが同一ではないことがある。けれど
も、認識中に生成されたラベルに関連した確率値
は、トレーニング中に生成された個人化ラベルと
いくらか異なることがあつても、実際の認識結果
は大体正しい。これはマルコフ・モデルの適応性
により可能である。しかしながら、それぞれのトレーニングおよび
認識アルフアベツトの変化が過度に大きい場合、
精度に影響することもある。また、最初の話者と
非常に異なる音声の次の話者がモデルをトレーニ
ングするのに用いる確率は、精度に限界を生じる
ことがある。 F8 表第１表は代表的なラベル・アルフアベツトを示
す。表中、２文字は要素の音を大まかに表わす。２桁の数字は母音に関連し、第１の数字は音の
アクセント、第２の数字は最新の識別番号を表わ
す。１桁の数字は子音に関連し、最新の識別番号を
表わす。第２表は４ワードのラベル・ストリングのサン
プルを示す。第３表はワードの統計的マルコフ・モデル（フ
エネメ単位のモデル）を示す。 The present invention will be explained in the following order. A. Industrial field of application B. Summary of the disclosure C. Prior art D. Problems to be solved by the invention E. Means for solving the problems F. Examples F1. Labeling of audio input signals F.. Generation F3 Overview of model generation and recognition procedures F4 Label vocabulary generation and conversion of speech into label strings F5 Generation of word models using label strings F6 Recognition process F7 Changes in label alphabets F8 Table G Effects of the invention A INDUSTRIAL APPLICATION FIELD The present invention relates to speech recognition, and more particularly to a speech recognition system that recognizes speech using a statistical Markov model for words in a predetermined vocabulary. B. Summary of the Disclosure An apparatus for modeling words in a speech recognition system using a Markov model based on labels is disclosed. The modeling includes sending a first audio input corresponding to words in the vocabulary to an acoustic processor, converting each of the spoken words into a sequence of standard labels, and converting each of the standard labels into a time interval. correspond to assignable acoustic types;
Representing each standard label as a probabilistic model with multiple states, at least one transition from one state to another, and at least one configurable output probability at a given transition; to an acoustic processor, the acoustic processor converting the selected acoustic inputs into personalized labels, each corresponding to an acoustic type assigned to a time interval; and generating a specific personalized label at a given transition in a given model. The present invention addresses the problem of easily and automatically generating models of words in speech recognition systems. C. Prior Art In modern speech recognition systems, there are two commonly used techniques for acoustic modeling of words. One approach is to use word templates, where the matching process for recognizing words is based on a dynamic programming (DP) procedure. An example of this procedure is IEEE Bulletin on Acoustics, Speech, and Signal Processing ASSP Volume 23 (1975) 67–
F. Itakura's paper “Speech recognition applying the minimum prediction error principle” on page 72 (F. Itakura,
“
MinimumPredictionResidualPrincipleAppliedt
oSpeech Recognition”, IEEE
Transactionson Acoustics, Speech,
andSignalProcessing, Vol.ASSP−23, 1975,
pp67-72) and U.S. Pat. No. 4,181,821. Another approach uses phoneme-wise Markov models suitable for probabilistic training and decoding algorithms. A description of this technique and related procedures can be found in F. Jelinek's article “Continuous Speech Recognition by Statistical Methods” in IEEE Proceedings Volume 64 (1976), pp. 532-556.
Speech Recognition by Statistical Methods”
Proceedings of the IEEE, Vol.64, 1976,
pp.532-556). Three aspects of these models are particularly important. (a) Individuality of words: Word templates are
Built from real samples of Word, so
can be recognized better. Because phonetic-based models are derived from basic forms representing artificial speech, they represent ideally created words that may not occur in reality. (b) Trainability: Markov models are
For example, it is superior to templates because it can be trained by the forward-backward algorithm (as described in the aforementioned Zielynek paper). Word templates use untrained distance measures, such as Itakura distance, spectral distance, etc. (described in the Itakura paper mentioned above).
One exception is R. Bakis' paper “Continuous Speech Recognition with Centisecond Acoustic Conditions” in IBM Research Report RC5971, April 1976 issue.
Bakis, “Continuous Speech Recognition.
Via Centisecond Acoustic States”IBM
Research Report RC5971, April 1976) is a method that enables word template training. (c) Computation speed: Markov models using alpha-beta outputs individually output from acoustic processors can be compared with dynamic programming matching (as used by Itakura) or continuous parameter word templates (as used by Bakis). The calculation speed is also quite fast. D Problems to be Solved by the Invention The purpose of the present invention is to develop an acoustic modeling method that has word individuality like word templates, and yet provides trainability that can be used with individual alphabetic Markov models. The purpose is to provide a method. A further object of the present invention is to provide an acoustic word model for speech recognition that is simple yet fast-acting in the recognition process. E. Means for Solving Problems In generating a word model in accordance with the present invention, first an acoustic signal representing a word is converted from an individual alphabet to a string of standard labels. Each label represents the time interval of the respective word. Each of the standard labels is then replaced by a probabilistic (eg Markov) model to form a basic form model consisting of multiple continuous models (one model for each standard label). At this time, the probability has not been entered yet. These elementary models are then trained with speech samples that generate statistics or probabilities that are applied to the model. The model is then used for actual speech recognition. The advantage of this method is that the generated word model is more detailed than a phoneme-by-phoneme model and is trainable; Statistical matching using these label-wise models is computationally much faster than DP matching using word templates. These advantages are significant relative to the approach presented in the aforementioned Baschi paper. In Bakis's paper, a word is defined as a sequence of states, each of which is considered distinct. Therefore, each word is typically spread over 60 states, giving a vocabulary of 5000
If it is a word, the method by Bakis is
It would be considered 300,000 different states.
According to the invention, each state is identified corresponding to one of on the order of 200 labels. The present invention requires only storage that can store the 200 labels that make up a word simply as a sequence of numbers (representing the labels) rather than as 300,000 states. Furthermore, the label-based model approach of the present invention requires less training data.
Whereas Bakis's method requires each speaker to utter each word for training, in our case, each speaker must utter enough words to set the values associated with the model for 200 standard labels. Just say it out loud. Furthermore, the method shown in Bakis's paper treats two words, such as "boy" and "boys", unrelatedly, but in the present invention, "boy"
Training on standard labels for
Also applies to “boys”. F Example F1 Labeling of Audio Input Signal The preprocessing for speech recognition and model generation in this system is to convert the audio input signal into code and display it. This is for example the case in 1981.
A. Nadasetal, “Continuous Speech Recognition with Automatically Selected Acoustic Archetypes Obtained from Bootstrapping or Clustering,” in ICASSP Bulletin, pp. 1153-1155.
Recognition with Automatically Selected
Acoustic Prototype Obtainedby Either
Bootstrappingor Clustering,”Proceedings
ICAS SP 1981, pp. 1153-1155). For this conversion procedure, intervals of constant length on the order of 1/100 seconds of the acoustic input signal are spectrally analyzed, and the resulting information is used as a “label” or feneme, which refers to the microphonemes obtained from the front end. Basically, any process that represents a minute acoustic type is included in this finite set (alphabet).
Assign a label of your choice to each interval. Each label represents an acoustic type, more specifically a spectral pattern of a characteristic 10 millisecond speech interval. The initial selection of a unique spectral pattern, ie, the generation of a set of labels, is also described in the aforementioned Nadas et al. paper. In the present invention, there are various label sets.
First, there is a finite alpha alphabet (label sequence) consisting of standard labels. Standard labels are generated when the first speaker speaks into a conventional audio processor (which performs clustering and labeling using conventional methods). Standard labels correspond to the first speaker's voice. As the first speaker utters each word of the vocabulary with a set standard label, the acoustic processor converts each word into a string of standard labels (as illustrated in Table 2 and Figure 2). A string of standard labels for each word is written to storage. Second, there are personalized label sets. These personalized labels are
After the standard label and its columns are set, the next speaker (either the first speaker again or another speaker) is generated by providing speech input to the acoustic processor. Preferably, each of the alphabets of the standard label set and each personalized label set includes 200 labels (this number may vary).
Standard labels and personalized labels are interrelated by a probabilistic model associated with each standard label.
In particular, each standard label represents (a) the multiple states and transitions that span from one state to another, (b) the probability of each transition in the model, and (c) the multiple label output probabilities (each of the transitions at a given transition). The output probability is represented by a model with (corresponding to the probability of the standard label model producing a particular personalized label at a given transition based on the acoustic input from the speaker being shaped). The transition probabilities and label output probabilities are set during a training period during which the speaker under training makes known utterances. Techniques for training Markov models are known and are briefly described below. An important feature of this labeling technique is that it can be performed automatically based on acoustic signals and therefore does not require phonetic interpretation. Details of the labeling technique will be described later in connection with FIGS. 5 and 6. F2 Generation of Pheneme-wise Word Models The present invention presents a novel method for generating word models that are simple, automatic, and more accurately represented than those using phoneme-wise models. To generate a word model, the word is first uttered once and a string of standard labels is obtained by the acoustic processor (see Figure 2 for the principle and Table 2 for the sample). Each of the standard labels is then replaced by a Markov model (Figure 3) representing the first and last states and certain transitions that can occur between the states. The result of concatenating the Markov models of these labels is the complete word model. And this model is similar to other Markov models known from the literature, e.g. from the above-mentioned Zielynek paper.
The model can be trained and used in the case of speech recognition. Preferably, the statistical model for each standard label is stored in storage (in the form of Table 3) as a word model formed from that label. In Table 3, standard label models M1~
Each MN may result in three transitions,
Each transition has a transition probability associated with it.
Additionally, each of models M1-MN has 200 output probabilities for each transition (one probability for each personalized label). Each of the output probabilities represents the probability that the standard label corresponding to model M1 will generate the respective personalized label at a particular transition. Since the transition probability and output probability may differ depending on the speaker, they are set during the training period. If desired, transition probabilities and output probabilities can be combined to form composite probabilities, each indicating the probability of producing a given output and performing a given transition. For a given word, the first speaker's utterance determines the order of the labels and therefore the corresponding models. The first speaker speaks all the words of the vocabulary and establishes the respective order of standard labels (and models) for each word.
Then, during training, the next speaker utters a known acoustic input (preferably a word in the vocabulary). From these known acoustic input utterances, transition probabilities and output probabilities for a particular speaker are determined and stored to train the model. By the way, the next speaker only needs to say the number of acoustic inputs needed to set the model probability of 200. That is,
The next speaker does not have to say every word in the vocabulary, but only the number of acoustic inputs needed to train 200 models. The use of label-wise models is even more important in adding words to the vocabulary. Adding a word to the vocabulary requires only a sequence of labels (like saying a new word).
Since the label-wise model (including probabilities for particular speakers) is already written to storage, the only data needed for a new word is the order of the labels. The advantage of this model is that it is extremely simple and easy to generate. Next, details of a preferred embodiment of the present invention will be described. Although the preferred embodiment of the invention is shown using one acoustic processor, the processing program may include, for example, a first processor that first determines the alpha alphabet of the standard label; a second processor that generates a string of labels; a third processor that selects an alphabet of personalized labels; and a fourth processor that converts a training input or word into a string of personalized labels. You can also. F3 Overview of model generation and recognition procedure FIG. 1 shows an overview of a speech recognition system that performs model generation and actual recognition according to the present invention. The input speech is converted into a label string in an acoustic processor 1 using a pre-generated standard label alphabet 2, and in an initial step 3, it is converted into a label string by using the first utterance of each word. A Markov model is generated for each word using the string of standard labels created earlier and the basic Finim Markov model 4 created earlier. The label-based Markov model is intermediately stored (Fineem-based word model 5). Then, in a training step 6, the utterances of several words by the next speaker are matched with the model for each label, and statistics regarding the probability values of transitions and personalized label outputs for each model are generated. In the actual recognition operation, a string of personalized labels resulting from the utterance to be recognized is matched with a statistical word label-by-label model.
The word or multi-word identifier that yields the highest probability of generating a string of personalized labels is provided at the output. In Figure 1, symbols indicate label strings that are uttered (generated) once, symbols indicate label strings that train several utterances to be recognized, and symbols indicate label strings of utterances that are actually to be recognized. shows. F4 Label vocabulary generation and conversion of speech into label strings The steps for generating label alphanumeric characters and actually converting speech into label strings are explained below.
5 and 6 (a description of this procedure can also be found, for example, in the aforementioned Nadas et al. paper). In general, when generating standard labels representing prototypical vectors of acoustic types (specifically, spectral parameters) of speech, a speaker spends about five minutes to obtain speech samples (block 11 of FIG. 5).
In the acoustic processor (details of which will be explained in connection with Figure 6), 30000 vectors of audio parameters of 10 ms each are obtained (block 1).
2). These vectors are then processed in an analysis or vector quantization operation to classify them into 200 clusters, each containing approximately the same vectors (block 13). Such a procedure has already been described in research papers, such as the Arm M. Gray paper “Vector Quantization” (RMGray,
Quantization”, IEEE ASSP Magazine, April
1984, pp. 4-29). For each of the clusters, choose one archetype vector and the resulting 200 archetype vectors as
Store for later reference (block 14). Each such vector represents one acoustic element or label. Typical label alphabets are shown in Table 1. FIG. 6 is a block diagram of a procedure for acoustically processing speech to obtain label strings of utterances.
Audio from the microphone 21 is converted into a digital display by an A/D converter 22. At block 23, a 20 ms window is taken from the digital display, and the windows are taken every 10 ms (with some overlap). For each window, we perform a spectral analysis using Fast Fourier Transform (FFT) to obtain a vector representing each 10 ms of audio (block 2 in Figure 6).
4). The parameters of these vectors are energy values of multiple spectral bands. Each of the current vectors thus obtained is compared in block 25 with the set of prototype vectors generated in the preprocessing described above (block 14). Block 25 determines the prototype vector closest to the current vector, and block 26 outputs the label or identifier of this prototype. Thus, one label appears at the output every 10 milliseconds and the audio signal is available in a coded format, ie a coded alphabet of 200 finem labels. Incidentally, the invention need not be limited to labels generated at periodic intervals; the above description merely corresponds to each label's respective time interval. F5 Generating a Word Model Using Label Strings When generating a word model, each word needed in the vocabulary must be uttered once and then converted to a string of standard labels as described above. . The graphical representation of the label string of a standard label for a word consists of the sequences y ¹ , y ² , ... y ^m that appeared at the output of the sound processor when the word was uttered, as shown in Figure 2. . Consider this string as the basic form of the standard label for each word. To generate a basic model of a word that takes into account changes in the pronunciation of that word, replace each finem yi of a string in the basic form with the basic Markov model M(yi) of that finem. The basic Markov model can be reduced to a very simple form as shown in FIG. Its configuration is initial state Si, final state Sf, transition T1 and
T2, as well as the null transition T0. Transition T1 leads the state from Si to Sf and represents the output of one personalized label. The transition T2 leaves the initial state Si and returns to the initial state Si and represents one personalized output.
Transition T0 leads the state from Si to Sf, but the output of the personalization label is not assigned. In this basic model, (a) one personalized label appears by performing transition T1 only once, (b) several personalized labels appear by performing transition T2 several times, and (c ) By performing the null transition T0, a dropped personalization label appears. The same basic model shown in Figure 3 can be selected for all 200 different standard labels. Of course, more complex basic models could be used or different models could be assigned to each standard label, but in this example,
The model in Figure 3 is used for all standard labels. FIG. 4 shows a complete basic form Markov model for the entire label-based basic form word shown in FIG. Its construction consists of a simple concatenation of basic models M(yi) of all standard labels of a word, connecting the final state of each basic model to the initial state of the next basic model. Thus, a complete basic form Markov model for a word that yields a string of m standard labels includes m basic Markov models. A typical standard number of labels (and thus number of states per word model) is approximately 30-80. The four word labels are shown in Table 2. For example, the word "thanks" begins with label PX5 and ends with label PX2. Generating a Markov model of the basic form of a word means defining the different states and transitions of each word and their interrelationships. To be useful for speech recognition, a model of the basic form of a word must be reshaped by uttering it several times, that is, a statistical model must be created by accumulating statistical values at each transition in the model. Since the same basic model appears in several different words, there is no need to utter each word several times to train the model. This kind of training is called the “forward-backward algorithm” (research paper,
For example, as described in the above-mentioned paper by Zielinek). As a result of training, each transition in the model is assigned a probability value. For example, for one particular state, T1 may have a probability value of 0.5, T2 may have a probability value of 0.4, and T0 may have a probability value of 0.1. Furthermore, for each of the non-null transitions T1 and T2, when the respective transition occurs,
For each of the 200 personalized labels, a list of probabilities is given that indicates how likely it is to appear. Ward's entire statistical model takes the form of a list or table as shown in Table 3. Each base model or standard label is shown in one column of the table, and each transition corresponds to a row or vector with elements of the probability of the individual personalized label (as well as the overall probability of taking each transition). do. When actually storing models for all words, one vector is stored for each word whose components are the identifiers of the basic Markov models that make up the word, and each of the 200 standard labels of Alphabet has a stored probability It is sufficient to include the values in one statistical Markov model. Therefore,
The statistical word model shown in Table 3 does not actually need to be stored as such, but can be stored in a distributed format and the data combined as desired. F6 Recognition Process When actually performing speech recognition, as explained in the first section, utterances are converted into strings of personalized labels. These personalized label strings are then matched against each of the word models to obtain the probability that the personalized label string was produced by the utterance of the word represented by that model. In particular, matching is performed based on each word's match score. Each match score represents a "forward probability" as described in the aforementioned Zielinek paper. The recognition process using a label-wise model is similar to the known process using a phoneme-wise model. The word or words with the highest probability are selected as output. F7 Change in Label Alphabet The standard label alpha used to generate the initial word model and the set of personalized labels used for training and recognition are common, but may not all be the same. However, even though the probability values associated with the labels generated during recognition may differ somewhat from the personalized labels generated during training, the actual recognition results are generally correct. This is possible due to the adaptability of Markov models. However, if the changes in the respective training and cognitive alphas are too large,
Accuracy may also be affected. Also, the probabilities used to train the model for subsequent speakers whose voices are very different from the first speaker may impose limits on accuracy. F8 Table Table 1 shows typical label alphabets. In the table, the two letters roughly represent the sounds of the elements. The two digits relate to vowels, the first digit representing the accent of the sound and the second digit representing the latest identification number. The single digit number is associated with the consonant and represents the most recent identification number. Table 2 shows a sample 4-word label string. Table 3 shows Ward's statistical Markov model (Feneme unit model).

【表】【table】

【表】・・・
・・・
MN SN TN1 0.6 ・・・

TN2 0.3 ・・・

TN0 0.1 − − −

Ｇ発明の効果本発明により、 (a) ラベル単位のマルコフ・モデルはずつと詳細
なレベルでワードを表わすから音素単位のモデ
ルよりもすぐれ、 (b) DPマツチングを用いるワード・テンプレー
トと異なり、ラベル単位のワード・モデルはフ
オワード・バツクワード・アルゴリズムを用い
てトレーニングすることができ、 (c) パラメータ数は語彙の大きさではなくフイー
ニーム・アルフアベツトの大きさにより決まる
ので、必要な記憶容量は語彙の大きさの増加に
較べてゆつくりと増加し、 (d) ラベル単位のモデルを用いる認識手順は計算
上、DPマツチングおよび連続パラメータのワ
ード・テンプレートの使用よりもずつと速く、 (e) ワードのモデル化を自動的に行なうことがで
きる。【table】 · · ·
・・・
MN SN TN1 0.6 ・・・

TN2 0.3 ・・・

TN0 0.1 − − −

G. Effects of the Invention According to the present invention, (a) the label-based Markov model is superior to the phoneme-based model because it represents words at a detailed level, and (b) unlike the word template using DP matching, the label-based Markov model The unit word model can be trained using a forward-backward algorithm, and (c) the number of parameters is determined by the size of the finial alpha rather than the size of the vocabulary, so the required storage capacity depends on the size of the vocabulary. (d) recognition procedures using label-wise models are computationally faster than DP matching and the use of continuous parameter word templates; (e) word models can be done automatically.

[Brief explanation of drawings]

第１図は本発明によるモデル生成および認識手
順のブロツク図、第２図は音響プロセツサから得
たワードのラベル・ストリングを表わす図、第３
図は１ラベルの基本マルコフ・モデルを示す図、
第４図は第２図に示したストリングの標準ラベル
の各々を基本マルコフ・モデルと取替えることに
より生成したワードの基本形式を示す図、第５図
は標準ラベル・アルフアベツトの初期生成のプロ
セスを示すブロツク図、第６図は発声したワード
の個人化ラベル・ストリングを引出す音響プロセ
ツサの動作を表わすブロツク図である。１……音響プロセツサ、２……ラベル・アルフ
アベツト、３……初期ステツプ、４……フイーニ
ーム・マルコフ・モデル、５……ワード・モデ
ル、６……トレーニング・ステツプ、７……統計
的マルコフ・モデル、８……認識プロセス、１２
……音響プロセツサ。 1 is a block diagram of the model generation and recognition procedure according to the present invention; FIG. 2 is a diagram representing the label string of words obtained from the acoustic processor; and FIG.
The figure shows a basic Markov model with one label,
Figure 4 shows the basic form of the word generated by replacing each of the standard labels of the string shown in Figure 2 with the basic Markov model, and Figure 5 shows the process of initial generation of the standard label alphabet. Block Diagram FIG. 6 is a block diagram illustrating the operation of a sound processor to derive a personalized label string for a spoken word. 1...Acoustic processor, 2...Label alphabet, 3...Initial step, 4...Fineem Markov model, 5...Word model, 6...Training step, 7...Statistical Markov model , 8... Recognition process, 12
...Sound processor.

Claims

[Scope of Claim] A speech recognition device characterized by having one of the following components (a), (b) and (c). (a) Means for generating a corresponding sequence of recognition labels in response to an unknown audio input from a set of recognition labels, each representing an acoustic type that can be assigned to a minute time interval. (b) A Markov model sequence formed by selecting a plurality of Markov models from a set of Markov models corresponding to each of a set of standard labels, each representing an acoustic type that can be assigned to a minute time interval. A way to store as a word model. These word models extract the corresponding standard labels according to the pronunciation of each word in the word vocabulary to form a string of standard labels generated for each word, and extract the standard labels in the string. is defined by connecting Markov models corresponding to each. Markov labels corresponding to each of the standard labels above.
The model is common to all words in the word vocabulary above, and includes (i) multiple states, (ii) at least one transition from one state to another, and (iii) at least one transition. , the probability that each of the set of recognition labels is generated by the label generation means. (c) Means for generating a probability that a sequence of labels generated in response to the unknown speech input is generated for each word in the word vocabulary by referring to the word model in the storage means. 2. The speech recognition device according to claim 1, wherein the acoustic type of the set of standard labels is different from the acoustic type of the set of recognition labels.