JPH0372996B2

JPH0372996B2 -

Info

Publication number: JPH0372996B2
Application number: JP61058464A
Authority: JP
Inventors: Rai Booru Raritsuto; Uinsento Desoza Piitaa; Reroi Maasaa Robaato
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1986-03-18
Filing date: 1986-03-18
Publication date: 1991-11-20
Also published as: JPS62220996A

Description

[Detailed description of the invention]

以下の順序で本発明を説明する。Ａ産業上の利用分野Ｂ従来技術Ｃ発明が解決しようとする問題点Ｄ問題点を解決するための手段Ｅ実施例Ｅ−１音声認識システムＥ−1A 構成の概要（第１図、第２図）Ｅ−1B 聴覚モデル及びその実施（第４図）Ｅ−1C 詳細な照合（第３図、第１１図）Ｅ−1D 基本的な高速照合（第１２図）Ｅ−1E 別の高速照合（第１５図、第１６図）Ｅ−1F 最初のＪラベルに基づく照合（第１
６図）Ｅ−1G 単音の樹形構造と高速照合（第１７
図）Ｅ−1H 言語モデルＥ−1J スタツク・デコーダ（第２１図、第２
２図）Ｅ−1K 音声的基本形式の構築（第３図）Ｅ−1L 音素的基本形式の構築（第２３図）Ｅ−２ポーリングによる、語彙からの確から
しい単語の選択（第２５〜２９図）Ｆ発明の効果Ｇ表Ａ産業上の利用分野この発明は、広く音声認識技術に関し、特に単
語の語彙から選択された確からしい単語の短いリ
ストを形成するための技術に関するものである。Ｂ従来技術音声認識に対する確率的な処理方法において
は、音声処理装置によつて先ず音声波形がラベル
または音素の列に変換される。各々が音のタイプ
をあらわすものであるそれらのラベルは、典型的
には約200の異なるラベルからなるアルフアベツ
トから選択される。そのようなラベルの生成につ
いては、以下に示すもの等のさまざまな論文に述
べられている。 IEEE議事録（Proceeding of the IEEE）、64
巻、pp.532−556（1976）の“統計的方法による連
続的音声認識（Continuous Speech Recognition
by Statistical Methods）”と題する論文。音声認識を行うためにラベルを採用するにあた
り、マルコフ・モデル音マシン（確率有限状態マ
シンとも呼ばれる）について議論がなされてい
る。マルコフ・モデルは通常、複数の状態と、そ
れらの状態の間の遷移を有している。さらに、通
常マルコフ・モデルは、(a)生じる各遷移の確率
と、さまざまな遷移において各ラベルを形成する
個々の確率に関する確率値を割りあてられてな
る。尚、マルコフ・モデルは、パターン解析及び
機械的知能に関するIEEE議事録（IEEE
Transaction on Pattern Analysis and
Machine Intelligence）、巻PAMI−５、No.２、
1983年３月の、L.R.バール（Bahl）、F.ジエリネ
ツク（Jelinek）、及びR.L.マーサー（Mercer）
による“連続的音声認識に対する最尤的方法（Ａ
Maximum Likelihood Approach to
Continuous Speech Recognition）”と題する論
文等に記載されている。音声認識においては、音声処理装置によつてラ
ベル列が与えられたときに、語彙の中のどの単語
が最も確からしいかを決定するために、照合（マ
ツチング）処理が実行される。 1984年11月19日に出願された本出願人に係わる
米国特許出願第672974号に示されているように、
音声的照合は、(a)マルコフ・モデル音マシンの列
によつて語彙の各単語を特徴づけ、(b)音声処理装
置によつて生成されたラベルの列を生じる音マシ
ンの、単語をあらわす各列の個々の確からしさを
決定することにより行われる尚、音マシンの単語
をあらわす各列は、単語基本形式（word
baseform）と呼ばれる。上述の米国特許出願に記載されているように、
単語基本形式は、音声的な複数の音マシンで構成
することができる。この列では、各音マシンは、
好適には音声的な音に対応し、７つの状態と13の
遷移とを有している。あるいは、各単語基本形式は、音素的な音マシ
ンの列として形成してもよい。音素的な音マシン
は、音成的な音マシンよりも簡単なマルコフ・モ
デルであり、好ましくは２つの状態により構成さ
れる。そして、その第１の状態と第２の状態の間
には、いかなるラベルも生成され得ないような空
遷移（null transition）がある。また、第１の状
態と第２の状態の間には、ラベルのアルフアベツ
トから１つのラベルが生成され得るような非空遷
移もある。第１の状態においては、ラベルが形成
され得る自己ループとなる。そして、訓練段階の
間に、各音素的音マシンに対する統計が決定され
る。すなわち、各音素に対して、各遷移の確率
と、各ラベルが各非空遷移において生成される確
率とが、知られた発音から算出される。単語基本
形式は、音素的音マシンを連結することによつて
形成される。音素的基本形式と音声的基本形式を採用するシ
ステムにおいては、未知の音声入力に応答して音
声処理装置によつて生成されたラベルの列に対応
する最も確からしい単語（または単語の列）を見
出すことが最終目標である。この目標を達成する
ための１つの方法が、上述の米国特許出願に述べ
られている。特にその方法においては、単語の数
が、語彙中の全部の個数から、確からしい候補の
単語のリストにまで先ず低減され、次にこのリス
トされた単語はより詳しい照合手続または言語モ
デル手続において検討され、そこで好適には最も
確からしい単語が選択される。候補の単語の数を
低減するにあたつて、上述の米国特許出願が教示
する方法によれば、音マシンに近似が適用され、
これにより、過度の演算を要することなく高速処
理がもたらされる。単語の低減を達成する場合、
照合格子に基づく演算により近似的な音声的照合
が実行される。この近似的な音声的照合は、どの
単語に、より詳しい照合処理及び言語モデルに基
づく処理を施すべきであるかを決定するのに有効
であり充分であることが分かつている。Ｃ発明が解決しようとする問題点この発明の目的は、音声処理装置によつて生成
された音声ラベルの列に対応して、語彙中のどの
単語が相対的に高い尤度を有するかを決定するた
めの、高速で計算が簡単な方法及びそれを実行す
るための装置を提供することにある。Ｄ問題点を解決するための手段本発明は、詳細な照合において検査されるべき
単語の数を低減するための、従来とは異なる方法
を教示する。すなわち、本発明は、アルフアベツ
ト中の各ラベルが、語彙の各単語に“投票”を行
うテーブルが設けられてなるようなポーリング
（polling）方法に関するものである。この投票
は、所与の単語が所与のラベルを生成したことの
尤度（確からしさ）を反映する。その票の値は、
ラベル出力の確率と、訓練セツシヨンの間に得ら
れた遷移確率統計とから計算される。本発明の一実施例によれば、ラベル列が音声処
理装置によつて生成されるときに、目的の単語が
選択される。投票テーブルからは、その列の各ラ
ベルが認識され、目的の単語に対応する各ラベル
の票が決定される。そして、目的の単語に対する
ラベルのすべての票が蓄積されて結合され、確か
らしさの得点が与えられる。語彙の各単語に対し
てこの処理を繰りかえすことにより、各単語に対
する確からしさの得点が得られる。確からしい候
補の単語のリストは、尤度の得点から得ることが
できる。第２の実施例では、各ラベルが語彙の各単語に
対して有するペナルテイを含む第２のテーブルも
形成される。所与のラベルに割り付けられたペナ
ルテイは、その所与のラベルを生成しない単語の
確からしさをあらわす。第２の実施例では、ラベ
ル列に基づき所与の単語に対する尤度の得点を算
定するのに、ラベル候補とペナルテイの両方が考
慮される。長さを勘案するために、尤度の得点は、好まし
くは、単語の確からしさの得点を算定するにあた
つて考慮されたラベルの数に基づき換算される。さらに、ある単語の、生成されたラベルに沿つ
ての終了時点が決定されないときは、本発明は、
目的の単語が複数の継時的な尤度の得点を有する
ことができるように、尤度の得点を継時的な時間
間隔で計算すべきことを規定する。本発明はさら
に、目的の単語に対して、好適には語彙における
他のすべての単語の確からしさの得点と比較して
最高の尤度の得点を付与することを規定する。本発明によれば、各単語が、少くとも１つの確
率的有限状態音マシンの列によつてあらわされ、
且つ音声処理装置が、音声入力に応答して音声ラ
ベルを生成するような、単語の語彙から確からし
い単語を選択するための方法を教示する。その方
法は、(a)アルフアベツト中の各ラベルが語彙の各
単語に投票を行い、所定の単語に対する各ラベル
の票が、その票を与えるラベルを生成するその所
定の単語の尤度をあらわすような第１のテーブル
を形成する段階を有する。さらに、その方法は好
適には、(b)語彙の各単語に対して各ラベルにペナ
ルテイが割り付られ、所与の単語の所与のラベル
に割り付られたペナルテイが、その所与の単語の
モデルによつて生成されない所与のラベルの尤度
をあらわすような第２のテーブルを形成する段階
と、(c)所与のラベル列に対して、特定の単語に対
する列中のすべてのラベルの票を、その特定の単
語に対する列にはないすべてのラベルのペナルテ
イと結合する段階を含み特定の単語の尤度を決定
する段階を有する。さらに、上記方法は、好適には、各単語に対し
て確からしさの得点を与えるために、すべての単
語につき上記(a)，(b)及び(c)の段階を繰り返す工程
をも有する。もし望むなら、上述の方法は、上記米国特許出
願第672974号と組みあわせて使用することができ
る。Ｅ実施例Ｅ−１音声認識システムＥ−1A 構成の概要第１図には、音声認識システム１０００の概要
ブロツク図が示されている。このシステム１００
０は、スタツク・デコーダ１００２と、そのスタ
ツク・デコーダに接続された音声処理装置１００
４と、高速近似音声照合を実行する際に使用され
るアレイ・プロセツサ１００６と、詳細な音声照
合を実行する際に使用されるアレイ・プロセツサ
１００８と、言語モデル１０１０と、ワーク・ス
テーシヨン１０１２とを具備している。音声処理装置１００４は、音声波形入力を、広
い意味で、対応する音のタイプを認別するラベル
または音素の列に変換するように設計されてい
る。このシステムにおいては、音声処理装置１０
０４は人間の耳という特異なモデルに基づいてお
り、これについては本出願人に係る特願昭60−
211229号に記載がある。音声処理装置１００４からのラベルまたは音素
は、スタツク・デコーダ１００２に入る。論理的
には、スタツク・デコーダ１００２は第２図に示
すブロツク素子よつてあらわすことができる。す
なわち、スタツク・デコーダ１００２は、インタ
ーフエース１０２２，１０２４，１０２６及び１
０２８を介して、音声処理装置１００４、高速照
合プロセツサ１００６、詳細照合プロセツサ１０
０８、及び言語モデル１０１０と連絡し、さらに
はワーク・ステーシヨン１０１２と連絡する探索
ブロツク１０２０を有している。動作においては、音声処理装置１００４からの
音素は探索ブロツク１０２０によつて高速照合プ
ロセツサ１００６へ導かれる。この高速照合処理
については後に説明するが、本出願人に係わる上
述の米国特許出願第672974号にも記載されてい
る。簡単に述べると、この照合の目的は、所与の
ラベルの列に対して最も確からしい単語を決定す
ることである。高速照合は、語彙中で単語を検査し、所与の入
力ラベル列に対する候補の単語の数を低減するよ
うに意図されている。この高速照合は、マルコ
フ・モデルとも呼ばれる確率的有限状態マシンに
基づく。高速照合により候補の単語の数が低減される
と、スタツク・デコーダ１００２は言語モデル１
０１０と連絡をとり、言語モデル１０１０は、好
ましくは、存在する３重音字（tri−gram）に基
づき、高速照合候補リスト中の各候補の単語の文
脈的（contextual）確からしさを決定する。好適には、言語モデルの計算結果に基づき、詳
細な照合により、話された単語であることの妥当
な確からしさを有する単語が、高速照合の候補に
おいて検討される。この詳細な照合手続について
は、上記米国特許出願第672974号に記載されてい
る。詳細な照合手続は、第３図に示されているマシ
ンのようなマルコフ・モデル音マシンによつて実
行される。その詳細な照合の後、好適には、単語の確から
しさを決定するために言語モデルが再び呼び出さ
れる。高速照合及び詳細な照合と、言語モデルの
適用によつて得られた情報を利用する本発明のス
タツク・デコーダ１００２は、生成されたラベル
列に対応する単語の最も確からしい経路または列
を決定するように設計されている。最も確からしい単語の列を見出すための２つの
従来技術として、ヴイテルビ（Viterbi）デコー
デイングと、単一スタツク・デコーデイングがあ
る。これらの技術は、上述のL.R.バール、F.ジエ
リネツク及びR.L.マーサーの論文に記載されてい
る。特に、ビテルビ・デコーデイングはその第５
章に、単一スタツク・デコーデイングはその第６
章に記載されている。単一スタツク・デコーデイング技術において
は、確からしさに応じて単一スタツク中にさまざ
まな長さの経路がリストされる。単一スタツク・
デコーデイングは、確からしさは幾分か経路の長
さに依存し、すなわち一般的には規格化を採用し
なくてはならない、という事実を考慮しなくては
ならない。一方、ヴイテルビの技術は、そのような規格化
を必要とせず、一般的には小さいタスクに実用的
である。別の技術としては、単語の可能な各組み合わせ
を可能な単語列として検査し、どの組み合わせ
が、生成されたラベル列を生成する最も高い確率
を有するかを決定することにより、小さい語彙シ
ステムにつきデコーデイングを行うこともでき
る。しかし、この技術に必要な計算量は、大きい
語彙システムには非実用的なものとなる。スタツク・デコーダ１００２は、実質的に他の
ブロツク回路を制御する役目を果たすが、多くの
演算は行わない。それゆえ、スタツク・デコーダ
１００２は、好ましくは、Virtual Machine／
System Product Introduction Release ３
（1983）などの刊行物に記載されているIBM
VM／370オペレーテイング・システムのもとで
走る4341プロセツサを含んでいる。相当量の計算
を行うこのアレイ・プロセツサは、市販されてい
る浮動点システム（FPS）190Lにより実施され
ている。多重スタツキングと、最良の単語列または経路
を決定するための独得の決定施策を有する新規な
技術がL.R.バールと、F.ジエリネツクと、R.L.マ
ーサーによつて発明されたので後でこれについて
述べる。Ｅ−1B 聴覚モデル及びその実現第４図には、音声処理装置１１００（第１図で
は符号１００４）の特定の実施例が図示されてい
る。この図において、音声波入力（例えば、ふつ
うの会話）がアナログ−デイジタル（Ａ／Ｄ）コ
ンバータ１１０２に入力され、Ａ／Ｄコンバータ
１１０２は与め決められた割合でその入力をサン
プリングする。典型的なサンプリング率は、50マ
イクロ秒毎に１サンプルである。Ａ／Ｄコンバー
タ１１０２からのデイジタル信号の端部を整形す
るために、時間窓発生器１１０４が設けられてい
る。時間窓発生器１１０４の出力は高速フーリエ
変換回路（FFT）１１０６に入力され、FFT１
１０６は、各時間窓毎に周波数スペクトルを与え
る。 FFT１１０６の出力は次にラベルy₁y₂‥y_fを発
声するために処理される。ラベルを発生するため
に、特徴選択ブロツク１１０８、クラスタ・ブロ
ツク１１１０、プロトタイプ・ブロツク１１１
２、ラベル作成ブロツク１１１４という４つの回
路ブロツクが協働する。ラベルの発生において、
プロトコルは、選択された特徴または音声入力に
基づき、空間における点（またはベクトル）とし
て決定され、次にプロトコルに比較されうる空間
中の対応する点（またはベクトル）を与えるため
に、それと同一の選択された特徴によつて特徴づ
けられる。特に、プロトコルを決定する際、クラスタ・ブ
ロツク１１１０によつて、点の集合が個別のクラ
スタとして分類される。クラスタを決定するため
の方法は、音声に適用される、ガウス分布などの
確率分布に基づいている。クラスタの重心または
他の特徴に関連する各クラスタのプロトタイプ
は、プロトタイプ・ブロツク１１１２によつて発
生される。そして、選択された同一の特徴によつ
て特徴づけられる、生成されたプロトコルと音声
入力は、ラベル作成ブロツク１１１４に入力され
る。ラベル作成ブロツク１１１４は、比較手続を
実行し、これにより１つのラベルが特定の音声入
力に割りあてられる。適切な特徴の選択は、音声（会話）波入力をあ
らわすラベルを得る際の重要な要因である。ここ
に述べられている音声処理装置は、改良された特
徴選択ブロツク１１０８を有している。この音声
処理装置によれば、聴覚モデルが得られ、それが
音声認識システムの音声処理装置に適用される。
聴覚モデルを説明するために、第５図を参照す
る。第５図は、人間の耳の内部の一部をあらわす図
である。特に、内部有毛細胞１２００は、液体を
保持する通路１２０４内に突出する端部１２０２
をもつものとして示されている。内部有毛細胞の
上流には外部有毛細胞１２０６があり、これらも
通路１２０４に突出する端部１２０８をもつもの
として示されている。そして、内部有毛細胞１２
００と外部有毛細胞１２０６には、脳に情報を送
るための神経が接続されている。特に、ニユーロ
ンが、処理のため神経を介して脳に送られる電気
的刺激をもたらす電気化学的変化を被る。この電
気化学的変化は、基底膜１２１０の機械的な運動
によつて刺激を受ける。従来、基底膜１２１０が音声波入力に対して周
波数解析器の役目を果たし、基底膜１２１０に沿
う箇所が個々の臨界周波数帯域に応答することが
知られている。そして、基底膜１２１０の異なる
部分が、それに対応する周波数帯域に応答すると
いうことは、音声波入力に対して知覚される音の
大きさに影響を及ぼす。すなわち、２つの同様な
強度の音が同一の周波数帯域を占める場合より
も、２つの音がそれぞれ異なる臨界周波数帯域に
ある場合の方が、より大きい音であると知覚され
る。基底膜１２１０によつて決定される22個程度
の臨界周波数帯域が存在することが分かつてい
る。基底膜１２１０の周波数応答に一致するよう
に、本実施例の音声処理装置１１００は、好適に
は、音声波入力を上記臨界周波数帯域の一部また
はすべてに分離し、その分離された臨界周波数帯
域毎に個別に信号成分を検査する。この機能は、
FFT１１０６（第４図）からの信号を適宜濾過
し、以て特徴選択ブロツク１１０８中に、検査さ
れる各臨界周波数帯域毎に個別の信号を与えるこ
とによつて達成される。その個別の入力はまた、時間窓発生器１１０４
によつて、（好ましくは25.6ミリ秒の）時間枠内
にブロツクされている。それゆえ、特徴選択ブロ
ツク素子１１０８は、好ましくは22個の信号を含
み、その各々が１つの時間枠に対応する所与の周
波数帯域における音の強さをあらわす。フイルタ作用は、好適には第６図に示す慣用の
臨界帯域フイルタ１３００によつて行われる。個
別の信号は次に、音の大きさの等感（equal
loudness）コンバータ１３０２によつて処理され
る。このコンバータ１３０２は、知覚された音の
大きさの変化を、周波数の関数として勘案する。
この点において、ある周波数の所与のdBレベル
における第１の音は、聴きとられる大きさにおい
て、それとは別の周波数で同一のdBレベルにあ
る第２の音とは異なることがあることに注意され
たい。コンバータ１３０２は、さまざまな周波数
帯域が同一の大きさのスケールで測定されるよう
に各周波数帯域の信号を変換するべく、実験デー
タに基づき得る。例えば、コンバータ１３０２
は、好適には、1933年のフレツチヤー
（Fletcher）及びマンソン（Munson）の研究に
基づき、それにある程度変更を加えることにより
音の強度から音の大きさの等感曲線への対応づけ
を行う。これらの研究の一部変更した結果は、第
７図に示されている。すなわち、第７図の（1000
Hz、40dB）の点を通る等感曲線から分かるよう
に、1000Hz、40dBの音は、100Hz、60dBの音に、
大きさのレベルにおいて匹敵するのである。コンバータ１３０２は、周波数に拘りなく等感
度を与えるために、好適には第７図の曲線に従つ
て音の大きさを調節する。第７図からさらに見てとれるように、周波数へ
の依存性があるのみならず、音の強度と音の大き
さとは対応しない。すなわち、音の強度または振
幅における変化は、聴き取られる大きさの同様の
変化によつて必ずしも反映されない。例えば、
100Hzにおいて、110dBで音の強度が10dB変化す
るのと、20dBで音の強度が10dB変化するので
は、聴き取られる音の大きさの変化は、前者の方
がはるかに大きい。この差異は、予め定められた
様式で大きさを圧縮する大きさスケーリング・ブ
ロツク素１３０４によつて処理される。好適に
は、大きさスケーリング・ブロツク素子１３０４
は、フオン（phon）であらわされた音の大きさ
の振幅測定値をソーン（sone）で置き換えるこ
とにより、強度Ｐをその立方根P^1/3に圧縮する。第８図は、経験的に得られたソーン対フオンの
データをあらわす図である。ソーンを採用するこ
とにより、上記モデルは、会話音声の大きい振幅
に対してほぼ正確である。尚、１ソーンとは、
1KHz、40dBの音の大きさであると定義されてい
る。ここで再び第６図を参照すると、各臨界周波数
帯域に対応する等感で、大きさにつきスケーリン
グされた信号を処理する時間変化応答ブロツク素
子１３０６が図示されている。特に、各時間枠に
おいて、検査される各周波数帯域に対して神経フ
アイアリング率（neural firing rate）ｆが決定
される。神経フアイアリング率は、この音声処理
装置によれば、次のようにあらわされる。ｆ＝（So＋DL）ｎ (1) この式で、ｎは、神経伝達信号
（neurotransmitter）の量、Soは音声波入力とは
無関係な神経フアイアリングに関連する自発的フ
アイアリング定数、Ｌは大きさの測定値、Ｄは変
位定数である。すなわち、（So）ｎは、音声波入
力があるかないかに拘らず生じる自発的神経フア
イアリング率に対応し、DLnは音声波入力による
フアイアリング率に対応する。重要なのは、ｎの値が、この音声処理装置によ
り、次の関係式に基づき時間変化するものとして
特徴づけられることである： dn／dt＝Ao−（So＋Sh＋DL）ｎ (2) この式で、Aoは補給定数であり、Shは、自発
的な中立伝送信号の崩壊定数である。式(2)に示さ
れている新規な関係は、神経伝達信号がある一定
の割合で形成されつつあり、(a)崩壊（Sh×ｎ）、
(b)自発的フアイアリング、及び(c)音声波入力によ
る神経フアイアリングによつて失われることを考
慮したものである。これらのモデル化された現象
の推定位置が第５図に示されている。式(2)はまた、神経伝達信号の次の量と次のフア
イアリング率が、少くとも神経伝達信号の量の現
在の条件に乗算的に依存するという意味でこの音
声処理装置が非線形であるという事実をも反映し
ている。すなわち、時間（ｔ＋△ｔ）における神
経伝達信号の量は、時間ｔにおける神経伝達信号
の量に（dn／dt）・△ｔを加えたものであり、こ
れを式であらわすと、ｎ（ｔ＋△ｔ）＝ｎ（ｔ）＋（dn／dt）・△ｔ (3) 式(1)、(2)及び(3)は、上記聴音システムが時間の
経過に亘つて適合的であり、以て聴音神経上の信
号をして音声波入力に関して非線形とする、とい
う事実を利用する時間変化信号解析器を記述する
ものである。この点で、この音声処理装置は、神
経システム中の明白な時間変化に、より良く一致
するように、音声認識システムにおける非線形信
号処理を具現化する第１のモデルを与える。式(1)及び(2)中の未知数の数を低減するために、
この音声処理装置は、一定の音の大きさＬに適用
される次の式(4)を利用する。 So＋Sh＋DL＝１／Ｔ (4) この式で、Ｔは、音声波入力が発生されてか
ら、その最大値が37％まで低下するのに聴音の応
答が要する時間の測定値である。Ｔは音の大きさ
の関数であつて、この音声処理装置によれば、さ
まざまな音の大きさのレベルに対する応答の下降
をあらわすグラフから得られる値である。すなわ
ち、一定の大きさの音が発生されると、それは第
１の高レベルの応答を発生し、その後、応答は、
時定数Ｔで以て定常状態レベルに下降してゆく。
音声波入力がなければ、Ｔ＝To（50ミリ秒程度）
である。L_naxの場合、Ｔ＝T_nax（30ミリ秒程度）
である。Ao＝１と設定することによつて、Ｌ＝
０のとき、Ｉ／（So＋Sh）は５センチ秒である。
また、ＬがL_naxであり、L_nax＝20ソーンであると
き、式(5)は次のようになる。 So＋Sn＋Ｄ（20）＝１／30 (5) 上記のデータと式により、SoとShは式(6)及び
(7)によつて次のように決定される。 So＝DL_nax／（Ｒ＋（DL_naxToR）−１ (6) Sh＝１／To−So (7) ここでＲ＝fs｜Ｌ＝L_nax／fs｜Ｌ＝０ (8) ここでfs｜はdn／dt＝０の場合、所与の音の大
きさにおけるフアイアリング率をあらわす。Ｒは、音声処理装置に残された唯一の変数であ
る。それゆえ、音声処理装置の性能を変更するた
めには、Ｒのみが変えられる。すなわち、Ｒは、
通常、遷移効果に対して定常状態効果を最小化す
ることを意味する性能を変更するために調節する
ことのできる単一のパラメータである。定常状態
効果を最小化することは、周波数応答の差異や、
話者の差異や、バツクグラウンド・ノイズや、会
話の遷移部分でない会話の定常状態部分に影響を
及ぼす歪みが、一般には同一の音声入力に対して
一定でなく出力パターンを生じさせるゆえに、望
ましい。Ｒの値は、好ましくは、音声認識システ
ム全体のエラー率を最適化することによつてセツ
トされる。このようにして見出された適切な値は
Ｒ＝1.5である。すると、So＝0.0888、Sh＝
0.1111となり、Ｄは0.00666となる。第９図を参照すると、この音声処理装置のフロ
ーチヤートが示されている。第９図において、好
適には20KHzでサンプリングされた、25.6ミリ秒
の時間枠内のデイジタル化された音声がハニング
（Hanning）窓１３２０を通過し、ハニング窓１
３２０からの出力は、好適には10ミリ秒間隔でフ
ーリエ変換器（DFT）１３２２に送られる。こ
うして変換された出力は素子１３２４によつてフ
イルタされ、少くとも１つの周波数帯域（好まし
くはすべての臨界周波数帯域または少くとも20個
の臨界周波数帯域）の各々に対応する強さの密度
の出力を与える。この強さの密度は、次に対数変
換ステツプ１３２６によつて音の大きさのレベル
に変換される。このことは、第７図のグラフに基
づき容易に実行される。これ以下の処理は、ステ
ツプ１３３０のしきい値更新処理を含み、第１０
図に示されている。第１０図において、感覚のしきい値Tfと聴音
のしきい値T_hは、各々のフイルタされた周波数
帯域ｍに対して、ステツプ１３４０でT_f＝
120dB、T_h＝0dBと初期設定される。そのあと、
音声カウンタと、全フレーム（枠）レジスタと、
ヒストグラム・レジスタがステツプ１３４２でリ
セツトされる。各ヒストグラムは、サンプルの数または計数値
をあらわすビン（bin）を含み、その計数値の間
には、所与の周波数帯域内の強度またはそれと同
様な測定値が、個々の範囲にある。現時点のヒス
トグラムは、好ましくは、所与の各周波数帯域毎
に、音の大きさが、大きさの複数の範囲のどれか
にある期間のセンチ秒の数をあらわす。例えば、
第３の周波数帯域においては、強度10dBと20dB
の間に20センチ秒の間隔が存在し得る。同様に、
20番目の周波数帯域においては、50dBと60dBの
間に、全体の1000センチ秒のうち150センチ秒が
存在し得る。サンプル（またはセンチ秒）の全数
と、ビンに含まれる計数値から、百分位数が得ら
れる。ステツプ１３４４では、個々の周波数帯域フイ
ルタ出力からのフレームがチエツクされ、適当な
ヒストグラム中のビンが、フイルタ毎に１つづつ
ステツプ１３４６で増分される。振幅が55dBを
超えるビンの全数がステツプ１３４８で各フイル
タ（すなわち周波数帯域）毎に合計され、音声の
存在をあらわすフイルタの数が決定される。そし
て、音声を示唆する最小限の個数（例えば20個の
うち６個）のフイルタが存在しないなら、次のフ
レームがステツプ１３４４でチエツクされる。も
うステツプ１３５０で、音声をあらわす十分な数
のフイルタが存在するなら、ステツプ１３５２で
音声カウンタが増分される。音声カウンタは、ス
テツプ１３５４で10秒間の音声が生じるまでステ
ツプ１３５２で増分され、そのあとステツプ１３
５６で各フイルタに対してT_f及びT_hの新しい値
が決定される。 T_fとT_hの新しい値は、所与のフイルタに対し
て次のように決定される。T_fの場合、1000ビン
の上から35番目（すなわち音声の96.5番目の百分
位数）のサンプルを保持するビンのdB値がBIN_H
であると定義される。次にT_fは、T_f＝BIN_H＋
40dBとセツトされる。T_hの場合、最下位のビン
から百分位数で0.01番目のサンプルを保持するビ
ンのdB値がBIN_Lであると定義される。すなわ
ち、BIN_Lは、音声であるとして分類されたサン
プルの数を除くヒストグラム中のサンプルの数の
１％であるビンである。そうして、T_hは、T_h＝
BIN_L−30dBとして定義される。第９図に戻つて、音声振幅は、ステツプ１３３
０及び１３３２で更新されたしきい値に基づき、
ステツプ１３３２でソーンに変換されスケーリン
グされる。これの方法については前述したとおり
である。ソーンを得、スケーリングを行うための
別の方法としては、（ビンが増分された後の）フ
イルタ振幅“ａ”を利用して、次の式に基づき
dBに変換することがある。 a^aB＝20log₁₀（ａ）−10 (9) 次に各フイルタ振幅は、次の式に基づき、等し
い大きさを与えるために０から120の間の範囲に
スケーリングされる。 a^eql＝120（a^dB−T_h）／（T_f−T_h） (10) a^eqlは好ましくは次の式により、大きさのレベ
ル（ホン）から、（40dBで1KHzの信号を１に対
応づけることにより）ソーンでの近似的な大きさ
に変換される。 L^dB＝（a^eql−30）／４（11）そして、ソーンでの大きさは、次のように近似
される。 L_s（近似）＝10（L^dB）／20 （12）ソーンであらわされた大きさは次に入力として
ステツプ１３３４で式(1)及び(2)に与えられ、これ
により、各周波数帯域に対応する出力フアイアリ
ング率ｆが決定される（ステツプ１３３５）。22
個の周波数帯域がある場合、継続的な時間枠（フ
レーム）による音声波入力を22次元ベクトルが特
徴づける。しかし、一般的には、慣用的なメル
（mel：音の高さの単位）でスケールされたフイ
ルタ・バンクを用いて20個の周波数帯域が検査さ
れる。次の時間枠を処理する（ステツプ１３３６）前
に、ステツプ１３３７で式(3)に基づきｎの次の状
態が決定される。上述の音声処理装置は、フアイリング率ｆと中
立伝送信号の量ｎが大きいDCペデスタルをもつ
場合の適用例においては改良される。すなわち、
ｆ及びｎの方程式の項もダイナミツク・レンジが
重要である場合、ペデスタルの高さを低減するた
めの以下に示す式が得られる。先ず、定常状態であり、音声波入力がない（Ｌ
＝０）場合、式(2)は、定常状態における内部状態
n′につき、次のように解かれる。 n′＝Ａ／（So＋Sh）（13）神経伝達信号の量ｎ（ｔ）の内部状態は、定常
部分n′と時間変化部分n″とにより、次のようにあ
らわされる。ｎ（ｔ）＝n′＋n″（ｔ）（14）式(1)と式（14）とを組みあわせると、フアイア
リング率に関する次の式が得られる。ｆ（ｔ）＝(So+DL)(n′+n(t)) （15）この式において、So×n′という項は定数であ
り、その他の項はすべて、ｎの時間変化部分、及
びDLであらわされる入力信号のどちらかを含む。
このあとの処理は、出力ベクトルの間の平方差の
みに関与するので、定数項は無視し得る。そこ
で、式（13）を用いて、式（15）から定数項を除
いた式をf″（ｔ）とあらわすことにすると、 f″（ｔ）＝（So＋DL）n″（ｔ）＋DLA／（So＋Sh）（16）式(3)を考慮すると、次の状態は、ｎ(t+△t)＝n′(t+△t)＋n″(t+△t) （17）＝n′（ｔ＋△ｔ）＋n″（ｔ）＋（Ao −（So＋Sh＋DL）ｎ（ｔ））△ｔ（18）＝n″（ｔ＋△ｔ）＋n″（ｔ）−（So＋Sh） n′（ｔ）−DLn′（ｔ）△ｔ＋Ao△ｔ −（So＋Sh＋DL）n″（ｔ）△ｔ（19）式（19）で、すべての定数項を無視することに
より、次の式が得られる。 n″(t+△t)＝n″(t)(1-Sh△t)−f″（ｔ）△ｔ
（20）式（15）及び（20）はここで、それぞれ、10ミ
リ秒の時間枠毎に各フイルタに適用される出力方
程式と、状態更新方程式とを構成する。これらの
方程式を適用した結果得られるのが、10ミリ秒毎
の20元ベクトルであり、ベクトルの各成分は、メ
ル・スケールされたフイルタ・バンク中の個々の
周波数帯域のフアイアリング率に対応する。尚、先程示した実施例に関しては、ｆ、dn／
dt、及びｎ（ｔ＋１）の式が、それぞれフアイア
リング率ｆの空間的な表現と、次の状態ｎ（ｔ＋
△ｔ）を規定する。尚、さまざまな式の項に寄与する値（例えばto
＝５センチ秒、t_Lnax＝３センチ秒、Ao＝１、Ｒ
＝1.5、Lmax＝20）は別の値に設定してもよく、
そうすると、So、Sh及びＤは、それぞれ、好ま
しい導出値0.0888、0.11111及び0.00666とは異な
る値となることに注意されたい。この音声モデルは、本願発明者により、浮動点
システムFPS190LハードウエアにPL／Ｉプログ
ラミング言語を用いて実施されたが、それ以外の
さまざまなソフトウエアまたはハードウエアを使
用することもできる。Ｅ−1C 詳細な照合第３図には、サンプル詳細照合音マシン２００
０が示されている。各々の詳細照合音マシンは、
(a)複数の状態Siと、(b)複数の遷移tr（Sj｜Si）と
（尚、その遷移には、異なる状態間のものと、あ
る状態からそれ自身に戻るものとがあり、その
各々の遷移には確率が対応づけられている）、(c)
特定の遷移で発生されうる各々のラベルに対応す
る、実際のラベル確率とによつて特徴づけられる
確率的有限状態マシンである。第３図においては、詳細照合音マシン２０００
に７個の状態S₁〜S₇と、13個の状態tr1〜tr13が
与えられている。第３図から見てとれるように、
音マシン２０００は破線で示された３つの遷移
tr11、tr12及びtr13を有している。これら３つの
遷移の各々においては、音はラベルを生成するこ
となく１つの状態から別の状態へ変化し、従つ
て、そのような遷移はゼロ遷移と呼ばれる。一
方、tr1〜tr10の遷移に沿つてはラベルが生成さ
れ得る。特にtr1〜tr10の各遷移の場合、１つま
たはそれ以上のラベルは、それらが発生される個
別の確率をもつことができる。好ましくは、個々
の遷移において、このシステム中で発生され得る
ラベルに対応づけられた確率が存在する。すなわ
ち、もし、音声チヤネルによつて選択的に発生す
ることのできる200個のラベルが存在するならば、
（ゼロ遷移でない）各遷移は、それに対応づけら
れた200個の“実際のラベル確率”を有し、その
各々の確率は、特定の遷移で音によつて対応する
ラベルが発生される確率に対応する。第３図にお
いては、遷移tr1のための実際のラベル確率がＰ
〔ｉ〕（ｉは１〜200の整数であり、ラベルの番号
を示す）という記号により表示される。例えば、
ラベル１の場合、詳細照合音マシン２０００が、
遷移tr1でラベル１を発生する確率Ｐ〔１〕が存在
する。実際のさまざまなラベル確率は、ラベル及
びそれに対応する遷移とともに記憶されている。さて、所与の音に対応して、ラベルの列y₁y₂y₃
‥‥が詳細照合音マシン２０００に与えられたと
き、照合手続が実行される。詳細照合音マシンに
関連する手続は、第１１図を参照して説明され
る。第１１図は、第３図の音マシンの格子図であ
る。第３図の音マシンと同様に、この格子図は、
状態S₁から状態S₇へのゼロ遷移と、状態S₁から状
態S₂へ及び状態S₁から状態S₄への遷移を示してい
る。また別の状態の間の遷移も示されている。こ
の格子図はまた、水平方向に時間が目盛られてい
る。第１１図において、開始時点の確率q₀及びq₁
は、それぞれ、音に対して、その音が時間ｔ＝t₀
及びｔ＝t₁で開始時間をもつ確率をあらわす。各
開始時間t₀及びt₁において、さまざまな遷移が図
示されている。尚、継時的な開始時間の間の時間
間隔は、好適にはラベルの時間間隔の長さに等し
いことに注意されたい。所与の音が入力ストリングのラベルにどれほど
近いかを決定するために詳細照合音マシン２００
０を採用した場合には、その音の終了時点での分
布が求められ、それがその音の照合値を決定する
ために使用される。終了時刻の（時点）分布に基
づくという概念は、照合手続に関連してここで説
明される音マシンのすべての実施例に共通であ
る。詳細な照合を実行するために終了時点分布を
形成するに際しては、詳細照合音マシン２０００
が厳密且つ複雑な計算を行う。第１１図を参照して、時間ｔ＝t₀で開始時間と
終了時間の両方をもつために必要な計算について
先ず考えてみよう。これは第３図に示された例示
的な音マシン構造に従う場であるので、次の確率
が適用される。 Pr（S₇、ｔ＝t₀）＝q₀Ｔ（１→７）＋ Pr（S₂、ｔ＝t₀）Ｔ（２→７）＋ Pr（S₃、ｔ＝t₀）Ｔ（３→７）（21）この式で、Prは括弧内に示した状態にある確
率、Ｔは、括弧内に示した状態番号の矢印方向へ
遷移確率である。式（21）は、終了時間がｔ＝t₀
で生じるような３つの条件に対応する個々の確率
を表示する。さらに、ｔ＝t₀での終了時間が、こ
の例の場合、状態S₇での発生に限定されているこ
とが見てとれる。次に終了時間ｔ＝t₁に注目すると、状態S₁以外
のすべての状態に関して計算が行われなくてはな
らないことが見てとれる。状態S₁は、前の音の終
了時点で開始される。尚、説明の便宜上、状態S₄
に関する計算のみが示される。状態S₄の場合、計算式は次のようになる。 Pr（S₄、ｔ＝t₁）＝Pr（S₁、ｔ＝t₀）Ｔ（１→４）Pr（y₁｜１→４）＋Pr （S₄、ｔ＝t₀）Ｔ（４→４）Pr （y₁｜４→４）（22）言わば、式（22）は、時間ｔ＝t₁で音マシンが
状態S₄にある確率は、次の２つの項の和に依存し
ているのである。すなわち、そのうちの１つの項
は、(a)ｔ＝t₀で状態S₁にある確率に、状態S₁から
状態S₄への遷移確率及び、状態S₁から状態S₄への
遷移があるときに所与のラベルy₁が生成される確
率を掛けたものであり、もう１つの項は、(b)時間
ｔ＝t₀で状態S₄である確率に、状態S₄からそれ自
身へ遷移する確率及び、状態S₄からそれ自身への
遷移があるとき所与のラベルy₁が生成される確率
を掛けたものである。同様に、別の状態（ただし状態S₁を除く）に関
しても、時間ｔ＝t₁で音が特定の状態にある確率
を求めるための計算が行われる。一般的には、所
与の時間に特定の状態にある確率を決定する場合
に、詳細な照合の手続は、(a)その特定の状態に至
る遷移を有する以前の各状態と、その以前の各状
態の個々の確率を認識し、(b)その以前の各状態に
つき、ラベル列に一致するためにその以前の各状
態と現在の状態の間の遷移において発生されなく
てはならないラベルの確率をあらわす値を認識
し、(c)以前の各状態の確率と、ラベル確率をあら
わす個々の値とを組みあわせて、対応する遷移に
つき上記特定の状態の確率を求めることである。
その特定の状態にあることの全体の確率は、その
状態に導かれるすべての遷移についてのその状態
の確率から決定される。尚、状態S₇に対する計算
は、音が状態S₇で終了するときにその音が時間ｔ
＝t₁で開始され且つ終了することを可能ならしめ
る３つのゼロ遷移に関する項を含んでいることに
注意されたい。時間ｔ＝t₀とｔ＝t₁に関する確率計算と同様に
して、別の終了時間の列についての確率計算は、
好ましくは終了時分布を形成するために行われ
る。所与の単音に対応する終了時分布の値は、そ
の所与の単音が入力ラベルにどれほどよく一致す
るかを示す。ある単語が入力ラベルの列にどれほ
どよく一致するかを決定する場合、その単語をあ
らわす複数の単音が処理される。そして各単音は
確率値の終了時分布を発生する。その単音に対す
る照合値は、終了時確率を加えあわせ、次にその
和の対数をとることにより得られる。また、次の
単音の開始時点分布は、例えば、スケールされた
値の和が１になるように各値を合計値で割り各々
の値をスケーリングして終了時点分布を規格化す
ることにより得られる。尚、所与の単語または単語列につき検査される
べき単音の数ｈを決定するための方法は、少くと
も２つある。先ず、深さ優先法（depth first
method）においては、基本形式に沿つて、個々
の単音につき順次移動小形を計算することが行わ
れる。そして、その小計が、その基本形式に沿う
所与の単音の位置に対応する予定のしきい値以下
であることが分かると、計算が終了する。あるい
は、幅優先法（breadth first method）において
は、各単語の類似する単音の位置について計算が
行われる。すなわち、各単語の第１の単音に続く
計算として、各単語の第２の単音についての計算
が行われる等である。幅優先法においては、さま
ざまな単語の同一の数の単音に沿う計算値が、そ
れらの単音に沿う同一の相対音位置において比較
される。このどちらの方法においても、照合値の
合計が最大となる単語（単数または複数）が、求
める対象である。詳細な照合はAPAL（アレイ・プロセツサ・ア
センブリ言語）中で実行される。尚、APALは、
フローテイング・ポイント・システム社
（Floating Point System，Inc）190Lに固有のア
センブラである。詳細な照合には、実際の各ラベル確率（すなわ
ち、所与の遷移において所与の単音が所与のラベ
ルｙを発生する確率）と、各音マシンのための遷
移確率と、規定された開始時間の後に所与の時間
に所与の単音が所与の状態にある確率を記憶する
ために相当のメモリが必要であることを認識され
たい。上述のFPS（浮動点システム）190Lは、終
了時間や、例えば和（好ましくは、終了時点確率
の和の対数）に基づく照合値や、前に生成された
終了時点確率に基づく開始時間や、単語において
連続する単音に対応する照合値に基づく単語の照
合得点などのさまざまな計算を行うようにセツ
ト・アツプされる。さらに、詳細な照合は、好適
には照合手続において末端確率（tail
probability）を勘案する。末端確率とは、単語
に関与せず、ラベルの確からしさを順序測定した
ものである。より簡単な実施例においては、所与
の末端確率は、別のラベルに続くラベルの確から
しさに対応する。この確からしさは、例えば、幾
つかのサンプル音声により生成されたラベル列か
ら容易に決定される。それゆえ、詳細な照合は、基本形式と、マルコ
フ・モデルのための統計と、末端確率を収めるた
めに十分な記憶容量を必要とする。例えば、各々
の単語が約10個の音からなるような5000語の語彙
の場合、基本形式には5000×10のメモリが必要で
ある。また、（各単音にマルコフ・モデルが付随
した）70個の異なる単音と、200個の異なるラベ
ルと、各ラベルが生成される確率を有する10個の
遷移が存在する場合、その統計には、70×10×
200個の位置が必要である。しかし、音マシンは、
３つの部分、すなわち開始部分と、中間部分と、
終了部に分割され、それに対応する統計が付随す
ることが好ましい（好適には、連続的な部分に３
つの自己ループが含まれる）。従つて、必要な記
憶量は70×３×200に低減される。末端確率に関
しては、200×200の記憶位置が必要とされる。こ
の配列では、50Kの整数と82Kの浮動点記憶が満
足な動作を与える。尚、詳細な照合は、音声的な音でなく音素的な
音を使用することによつて実行することができる
ことに注意されたい。Ｅ−1D 基本的な高速照合以上のように、詳細な照合は演算的に高価であ
るため、精度を多少犠牲にしても必要な演算量を
低減する基本的な高速照合及び別の高速照合が実
行される。この高速照合は、好ましくは詳細な照
合と組み合わせて使用され、すなわち、高速照合
は語彙から、確からしい候補の単語をリストし、
そして高々リストされた候補の単語につき詳細な
照合が行われる。高速の近似的な音声照合技術は、前述の本出願
人に係る米国特許出願第672974号の主題である。
高速な近似的音声照合技術においては、好適に
は、所与の音マシン中のすべての遷移における各
ラベルに対応する実際の確率を特定の置換値と置
き換えることによつて各音マシンが単純化され
る。特定の置換値は好ましくは、その置換値が使
用されるときに所与の音に対応する照合値が、そ
の置換値が実際のラベル確率に置き換わらないと
き詳細な照合によつて達成される照合値の過大評
価であるように選択される。この条件を保証する
１つの方法は、所与の音マシンにおける所与のラ
ベルに対応するいかなる確率も置換値より大きく
ないように各置換値を選択することによる。音マ
シンにおける実際のラベル確率を、対応する置換
値と置きかえることによつて、ある単語に対する
照合得点を決定する際に必要な計算量が著しく減
少する。さらに、置換値は好ましくは過大評価で
あるので、得られた照合得点は、置換を行わない
で決定されたであろう得点よりも小さくない。マルコフ・モデルをもつ言語的デコーダにおい
て音声的照合を行うような特定の実施例において
は、その各音マシンは、次の(a)〜(c)を備えるよう
に訓練によつて特徴づけられる。 (a) 複数の状態とその状態の間の遷移経路。 (b) 確率Ｔ（ｉ→ｊ）をもつ遷移tr（Sj｜Si）。そ
の各々は、現在の状態Siが与えられたとき状態
Sjへの遷移の確率をあらわす。尚、このSjとSi
は、同一の状態でも異なる状態でもどちらでも
よい。 (c) 実際のラベル確率Ｐ（Yk｜ｉ→ｊ）。各ラベル確率Ｐ（Yk｜ｉ→ｊ）は、１つの状態
からその次の状態への所与の遷移において所与の
音マシンによつてラベルy.kが発生される確率を
あらわす。ここでｋは、ラベルを識別するための
添字である。各音マシンは、(d)その各音マシン中
の各Ykに単一の特定値P′（Yk）を割りあてるた
めの手段と、(e)所与の音マシン中の各遷移におい
て実際の各出力確率Ｐ（Yk｜ｉ→ｊ）を、対応す
るYkに割りあてられた単一の特定値P′（Yk）に
よつて置きかえるための手段とをもつ。好適に
は、その置換値は、特定の音マシンの任意の遷移
における対応するYkラベルの実際の最大ラベル
確率と少くとも大きさが等しい。高速照合手続
は、入力ラベルに対応して語彙から最も確からし
い単語として選択された10個から100個程度の単
語候補のリストを決定するために採用される。こ
れらの単語候補には好ましくは言語モデルがあて
がわれ、詳細な照合が実行される。このようにし
て詳細な照合により考慮される単語の数を語彙の
約１％まで低減することにより、計算コストが著
しく低減され、精度は低下しない。基本的な高速照合は、所与の音マシン中で所与
のラベルが発生され得るようなすべての遷移にお
ける所与のラベルの実際のラベル確率を単一の値
で置換することによつて詳細な照合を簡単化す
る。すなわち、ラベルが発生確率を有する所与の
音マシン中の遷移に拘らず、確率が単一の特定値
によつて置き換えられる。この値は、所与の音マ
シンの遷移において生じるラベルの最大確率と少
くとも等しい大きさであるような過大評価値であ
る。所与の音マシン中の所与のラベルにつきラベル
確率の置換値を実際のラベル確率の最大値と設定
することにより、基本的な高速照合を用いて発生
された照合値は少くとも、詳細な照合を採用して
得た照合値と同じ程度に大きい。このように、基
本的な高速照合は典型的には、より多くの単語が
候補として広く選択されるように各単音の照合値
か過大評価する。すなわち、詳細な照合に基づき
候補であると考えられる単語はまた、この基本的
な高速照合に基づく基準に合格する。第１２図を参照すると、基本的な高速照合のた
めの音マシンが示されている。（記号及び音素と
も呼ばれる。）ラベルは、開始時点分布とともに
基本的高速照合音マシンに入力される。この開始
時点分布とラベル列入力は、上述の詳細な照合の
音マシンに入力されるものと同様である。尚、開
始時点分布は、場合によつては複数の時間に亘る
分布ではなく、例えば、音の開始時点における一
定期間の沈黙の後に来る厳密な時間をあらわすこ
ともあることを認識されたい。しかし、音声が連
続的である場合には、開始時点分布（これについ
ては後で一層詳しく説明する）を決定するために
使用される。音マシン３０００は終了時点分布
と、その終了時点分布からの特定の音に対応する
照合値を発生する。ある単語の照合値とは、音素
（少くとも単語の最初のｈ音）の照合値の合計と
して定義される。第１３図を参照すると、基本的な高速照合演算
が図式的に示されている。この基本的な高速照合
演算は、開始時点分布と、単音によつて生成され
たラベルの数または長さと、各ラベルYkに関連
する置換値P′_Ykにのみ関与する。そして、所与の
音マシンにおける所与のラベルの実際のすべての
ラベル確率を対応する置換値で置きかえることに
より、基本的な高速照合は遷移確率を長さの分布
確率で置きかえ、（所与の音マシン中の各遷移毎
に異なつていてもよい）実際のラベル確率と、所
与の時間に所与の状態にある確率をもつことの必
要性を除去する。この点に関して、長さの分布は詳細な照合モデ
ルから決定される。特に、長さ分布の各長さに対
応して、詳細な照合手続は、各状態を個々にチエ
ツクして、各状態に対応して、(a)特定のラベル長
が与えられている場合と、(b)遷移に沿う出力に関
与しない場合とにつき、現在検査された状態が発
生し得るさまざまな遷移経路を決定する。特定の
各遷移経路に至る特定の長さのすべての遷移経路
に対応する確率が合計され、次のその分布におけ
る所与の長さの確率を表示するために、その特定
のすべての状態に対応する和が加えられる。照合
の手続は各長さについて繰り返される。照合手続
の好ましい形式に従えば、これらの演算はマルコ
フ・モデルの分野で知られている格子図（trellis
diagram）を参照して行われる。すなわち、格子
構造に沿う枝を共有する遷移経路に対しては、共
通する各枝に対する計算は１度だけなされる必要
があり、その計算値は共通の枝を含む各経路に加
えられる。第１３図においては、２つの限定が例示的に含
まれている。第１に、単音によつて生成されたラ
ベルの長さが、それぞれ1₀，1₁，1₂、及び1₃の確
率をもつ０，１，２，３のどれかであると仮定す
る。また、開始時間も限定され、これにより各々
が確率p₀，q₁，q₂及びq₃をもつ４つの開始時間の
みが可能となる。これらの限定により、次の式
が、目的の単音の終了時点分布を決定する。 φ₀＝q₀1₀ φ₁＝q₁1₀＋q₀1₁p₁ φ₂＝q₂1₀＋q₁1₁p₂＋q₀1₂p₁p₂ φ₃＝q₃1₀＋q₂1₁p₃＋q₁1₂p₂p₃＋q₀1₃p₁p₂p₃ φ₄＝q₃1₁p₄＋q₂1₂p₃p₄＋q₁1₃p₂p₃p₄ φ₅＝q₃1₂p₄p₅＋q₂1₃p₃p₄p₅ φ₆＝q₃1₃p₄p₅p₆ これらの式について見ると、φ₃が、４個の開
始時間の各々に対応する項を含むことが分かる。
その１の項は、単音が時間ｔ＝t₃で開始され、ゼ
ロ・ラベルの長さを生成する（すなわち、同時に
開始され終了する単音であること）確率をあらわ
す。第２の項は、単音が時間ｔ＝t₂で開始され、
ラベルの長さが１であり、ラベル３がその単音に
より生成される確率をあらわす。第３の項は、単
音が時間ｔ＝t₁で開始され、ラベルの長さが２で
あり（すなわち、ラベル２及び３）、ラベル２及
び３が単音によつて生成される確率をあらわす。
同様に、第４の項は、単音が時間ｔ＝t₀で開始さ
れ、ラベルの長さが３であり、３つのラベル１，
２及び３が単音によつて生成される確率をあらわ
す。基本的な高速照合に必要な計算量と詳細な照合
に必要な計算量とを比較すると、前者が後者より
も相対的に簡単であることがわかる。この点につ
いて、P′_Yk値は、ラベル長確率と同様にすべての
式の各表現につき同一のままである。さらに、長
さと開始時間の限定があるので、後の方の終了時
間の計算がより簡単になる。例えば、φ₆におい
ては、単音は時間ｔ＝t₃で出発しなくてはなら
ず、そして３つのラベル４，５及び６はすべし、
適用すべきその終了時間の単音によつて発生され
なくてはならない。対象となる単音の照合値を発
生する際、決められた終了時点分布に沿う終了時
点確率が合計される。所望なら、次の式を与える
ために、合計算の対数がとられる。照合値＝log₁₀（φ₀＋‥‥＋φ₆）前にも述べたように、ある単語についての照合
得点値は、ある特定の単語における連続的な単音
についての照合値を合計することによつて容易に
求められる。さて、開始時点分布の生成を説明するにあたつ
ては、第１４図が参照される。第１４図ａにおい
て、単語THE₁が繰り返えされ、それが要素的な
単音に分割される。第１４図ｂには、時間に亘つ
てラベル列が示されている。第１４図ｃでは、第
１の開始時点分布が示されている。この開始時点
分布は、（沈黙、すなわち音のない“単語”を含
みうるその前の単語において）前の最近の単音の
終了時点分布から得られたものである。第１４図
ｃのラベル入力及び開始時点分布に基づくと、単
音DHの開始時点分布φ_DHが発生される。次の単
音UHの開始時点分布は、前の単音の終了時点分
布が第１４図ｄのしきい値（Ａ）を超えている期
間を認識することによつて決定される。（Ａ）は
各終了時点分布毎に個別に決定される。好適に
は、（Ａ）は、対象とする単音の終了時点分布の
合計値の関数である。このように、時間ａとｂの
間の期間が、単音UHに対する開始時点分布がセ
ツトされている時間をあらわす（第１４図ｅ参
照）。第１４図ｅにおける時間ｃ及びｄの間の期
間は、単音DHの終了時点分布がしきい値を超え
且つ次の単音の開始時点分布がセツトされている
期間に対応する。開始時点分布の値は、例えば、
しきい値（Ａ）を超える終了時刻の和で各終了時
刻を割ることによつて終了時点分布を規格化する
ことにより得られる。この基本的な高速照合音マシン３０００は、
Floating Point Systems社の190L中でAPALプ
ログラムを用いて実行された。上述の教示に従つ
て照合手続の特定の形式を開発するために、別の
ハードウエアとソフトウエアを使用することもで
きる。Ｅ−1E 別の高速照合基本的な高速照合は単独、または詳細な照合も
しくは言語モデルと組み合わされて、必要な計算
量を大幅に低減する。必要な計算量をさらに低減
するためには、最小の長さL_nioと最大の長さL_nax
という２つの長さの間に均一なラベル長分布を定
義することによりここで述べる手法が詳細な照合
をさらに簡単化する。基本的な高速照合において
は、所与の長さのラベルを発生する単音の確率、
すなわち1₀，1₁，1₂等は典型的には異なる値を持
つている。ところが、この別の高速照合によれ
ば、ラベルの各長さの確率は、単一の均一な値で
置きかえられる。好ましくは、その最小長さは、もとの長さの分
布においてゼロでない確率をもつ最小の長さに等
しいが、所望とあらば他の長さを選択してもよ
い。最大長さの選択は最小長さの選択よりも任意
であるが、最小長さより小さく最大長さより大き
い長さの確率がゼロにセツトされるという点では
重要である。長さの確率を最大長さと最小長さの
間にのみ存在するように設定することにより、均
一な疑似分布が与えられる。ある方法において
は、均一な確率は、その疑似分布上の平均確率と
してセツトすることができる。あるいは、均一な
確率は、上記均一な値によつて置きかえられる長
さの確率の最大値として設定してもよい。すべてのラベル長確率を等しいとして特徴づけ
ることの効果は、基本的な高速照合の終了時点分
布に対応して上記に示された式を参照すれば容易
に認められる。特に、長さ確率は定数項としてあ
らわすことができる。 L_nioをゼロにセツトし、すべての長さ確率を単
一の一定値で置きかえることにより、終了時点分
布は次のように特徴づけられる。 θ_n＝φ_n／１＝q_n＋θ_n-1Pn ここで“１”は単一の均一な置換値であり、
Pmの値は、好ましくは時間ｍで所与の単音中に
発生された所与のラベルの置換値に対応する。上記θmについての式に対して、照合値は次の
ように定義される。照合値＝log₁₀（θ₀＋θ₁＋… ＋θ_n）＋log₁₀ （１）基本的な高速照合と、この別の高速照合を比較
すると、必要とされる加算と乗算の回数が、この
別の高速照合を採用することにより大幅に減少さ
れることが分かる。L_nio＝０の場合、基本的な高
速照合では、長さ確率を考慮しなくてはならない
という点で40回の乗算と20回の加算が必要である
ということが分かつた。この別の高速照合を用い
ると、θ_nは再帰的に求められ、連続的な各θ_nにつ
き１回の乗算と１回の加算のみが必要である。この別の高速照合が計算をいかに簡単化するか
について示すために、第１５図及び第１６図が与
えられている。第１５図ａにおいて、音マシンの
実施例３１００は、最小長さL_nio＝０に対応す
る。このときの最大長さは、長さ分布が均一であ
るとして特徴づけられるように無限であると定義
されている。第１５図ｂにおいては、音マシン３
１００から生成される格子図が示されている。こ
こで、q_oの後の開始時間が開始時点分布の外側に
あると仮定すると、ｍ＜ｎの場合に連続的な各θ_n
をすべて決定するには１回の加算と１回の乗算が
必要である。また、その終了時間をもとめるに
は、１回の乗算だけでよく加算は必要ではない。第１６図は、L_nio＝４の場合を示す。第１６図
ａはそのための音マシン３２００の特別な例をあ
らわし、第１６図ｂはそれに対応する格子図をあ
らわす。L_nio＝４なので、第１６図ｂの格子図は
ｕ、ｖ、ｗ及びｚでマークした経路に沿つてゼロ
確率を有する。θ₄とθ_oの間に延在するそれらの終
了時間の場合、４回の乗算と１回の加算が必要で
あることに注意されたい。しかし、ｎ＋４より大
きい終了時間の場合、１回の乗算のみで、加算は
必要ない。この実施例はFPS190L上でAPALコ
ードを用いて実行された。尚、所望に応じて第１
５図または第１６図の実施例に状態を付加しても
よいことに注意されたい。Ｅ−1F 最初のＪラベルに基づく照合基本的な高速照合及び上記別の高速照合をさら
に洗練するために、音マシンに入力されるストリ
ングの最初のＪ個のラベルのみを照合において考
慮するということが企図される。100分の１秒毎
に１つの割り合いで、音声チヤネルの音声処理装
置によつてラベルが発生されると仮定すると、Ｊ
の相応の値は100である。言いかえると、１秒程
度の音声に対応するラベルが、単音と、音マシン
に入つてくるラベルとの間の照合を調べるために
与えられる。検査されるラベルの数を限定するこ
とにより、２つの利点が実現される。第１に、デ
コーデイングの遅延が低減される。第２に、短い
単語の得点を長い単語の得点と比較する際の問題
が実質的に回避される。尚もちろん、Ｊの長さは
所望のとおりに変えてもよい。検査されるラベルの数を限定することの効果
は、第１６図ｂの格子図を参照することにより見
てとれる。上記の改良を行わない場合には、高速
照合の得点は、この図の最下行に沿うθ_nの確率の
和である。すなわち、（L_nio＝０の場合）ｔ＝t₀ま
たは（L_nio＝４の場合）ｔ＝t₄で開始される各時
間において状態S₄にある確率はθ_nとしてもとめら
れ、そのすべてのθ_nが次に合計される。L_nio＝４
の場合、t₄より前の任意の時間に状態S₄にある確
率はない。ところが、上記の改良を用いると、θ_n
の合計は時間Ｊで終了する。第１６図ｂでは、時
間Ｊは時間t_o+2に対応する。Ｊの時間間隔以上でＪラベルの検査を終了する
ことは、照合得点を計算する場合の次の２つの確
率合計をもたらす。第１に、上述のように、格子
図の最下行に沿う行計算が存在するが、それは時
間Ｊ−１までである。時間Ｊ−１までの各時間で
状態S₄にある確率は、行の得点値を得るために合
計される。第２は、時間Ｊで状態S₀〜S₄の各状態
に単音がある確率の和に対応する行の得点が存在
する。すなわち、行の得点は、行の得点＝₄ 〓^f=0 Pr（S_f、Ｊ）単音に対するこの照合値は、行の得点と列の得点
とを合計し次にその和の対数をとることによつて
得られる。次の単音についての高速照合を維続す
るために、好適には時間Ｊを含む。最下行に沿う
値が、次の単音の開始時点分布を得るために使用
される。連続的な単音の各々につき照合値を計
算した後は、前にも述べたように、すべての単音
についての合計が、すべての単音の照合得点の和
となる。上述した基本的な高速照合と別の高速照合とで
終了時点分布が生成される様子を調べてみると、
列の得点の計数値が高速照合計算値に容易に一致
しないことが見てとれに。調べるラベルの数を限
定するという改良手法を高速照合によりよく適用
するために、ここに説明する照合技術は、列の得
点を別の行の得点で置きかえるということを要請
する。すなわち、時間ＪとＪ＋Ｋ（Ｋは音マシン
における最大状態個数）の間の状態S₄（第１６図
ｂ）にある単音について別の行の得点が計算され
る。それゆえ、もし音マシンが10個の状態を有す
るなら、この改良手法は、確率が計算される格子
図の最下行に沿つて10個の終了時間を追加するこ
とになる。そして、所与の単音の照合値を得るた
めに、時間Ｊ＋Ｋまでの最下行に沿うすべての確
率と、時間Ｊ＋Ｋにおける確率とが追加される。
また、前と同様に、単語の照合得点を得るために
連続する単音照合値が合計される。この実施例は、FPS190L上でAPALコードを
用いて実行された。しかし、他のコード及び他の
ハードウエアを用いてそれを実行することもでき
る。Ｅ−1G 単音の樹形構造と高速照合基本的な高速照合または別の高速照合を用いる
ことにより、最大ラベルの限定を行つても行わな
くても、音マシンの照合値を計算するのに必要な
計算時間は著しく低減される。さらに、高速照合
によつて得られた単語について詳細な照合が実行
される場合ですら、この計算時間の節約は十分大
きな量である。単音の照合値は、一たん計算されると、第１７
図に示すように、樹形構造４１００の枝に沿つて
比較され、これにより、単音のどの経路が最も確
からしいかが決定される。第１７図において、
（４１０２から枝４１０４へつながる）DH及び
DH1に対応する単音の照合値の合計は、単音MX
から枝分かれするさまざまな音列よりも、話され
た単語“the”に対応するはるかに高い値になる
べきである。この点につき、最初のMX音の単音
照合値は一度しか計算されず、次にそのMX音か
ら延出する各基本形式のために使用される（枝４
１０４及び４１０６参照）。さらに、第１の枝の
列に沿つて計算された全得点値が、他の枝の列の
全得点値よりもはるかに低いかまたはしきい値よ
りもはるかに低いことが分かつた場合、その第１
の列から延出するすべての基本形式は、同時に候
補の単語から削除される。例えば、MXが確から
しい経路ではないと分かつたとき、枝４１０８か
ら枝４１１８までに関連づけられた基本形式は同
時に棄去される。この高速照合及び樹形構造を用いることによ
り、きわめて節約された計算量で以て候補の単語
の順序づけられたリストが得られる。記憶の必要量に関しては、音の樹形構造と、音
の統計と、末端確率が記憶されるべきものである
ことに注意されたい。樹形構造に関しては、
25000個の弧と、各弧を特徴づける４個のデー
タ・ワードが存在する。その第１のデータ・ワー
ドは、次の弧または単音へのインデツクスをあら
わす。第２のデータ・ワードは、その枝に沿う次
の単音の数をあらわす。第３のデータ・ワード
は、樹形図のどのノードにその弧が位置付けられ
ているかをあらわす。また、第４のデータ・ワー
ドは、現在の単音をあらわす。ゆえに、樹形構造
の場合、25000×４個の記憶箇所が必要である。
高速照合においては、100個の異なる単音と、200
個の黒なる音素がある。すると、音素がある単音
のどこかで生成される単一の確率をもつ場合、
100×200個の統計的確率のための記憶箇所が必要
である。最後に、末端確率の場合、200×200個の
記憶箇所が必要である。この高速照合には、
100Kの整数と60Kの浮動点記憶装置で十分であ
る。Ｅ−1H 言語モデル前にも述べたように、前後関係におけて単語に
関連する、３重音字などの情報を記憶する言語モ
デルを使用して正確な単語選択の確率を高めるよ
うにすることもできる。言語モデル１０１０（第１図）は、独自の性質
をもつ。特に、変換された３重音字法が用いられ
ている。この方法によれば、語彙における、単一
の単語と、順序づけられた一対の単語と、順序づ
けられた三連の単語の確からしさを決めるために
サンプル・テキストが検査される。そして、最も
確からしい三連単語のリストと最も確からしい対
の単語のリストが形成される。さらに、三連単語
が上記三連単語リストにない確率と、対の単語が
上記対の単語のリストにない確率がそれぞれ形成
される。この言語モデルによれば、ある当面の単語に２
つの単語が続くとき、この当面の単語とそれに続
く２つの単語が上記三連単語リストに存在してい
るか否かについて判断がなされる。そして、もし
そうなら、その三連単語に割りあてられた記憶さ
れている確率が示される。また、もし当面の単語
とそれに続く２つの単語が三連単語リストに存在
しないならば、その当面の単語とその次の単語が
対の単語リストに存在しているか否かについて判
断が行われる。そして、もしそうであれば、三連
単語が、三連単語リストに存在しない確率にその
対の確率が掛けられ、その積が当面の単語に割り
あてられる。あるいは、もし当面の単語とその次
（及びその次）の単語が三連単語リストと対単語
リスト上に存在しないならば、当面の単語のみ
に、三連単語が三連単語リストにない確率と、対
の単語が対単語リストにない確率とが掛けられ
る。この積は、次に当面の単語に割りあてられ
る。第１８図を参照すると、音声照合に使用される
音マシンの訓細をあらわすフローチヤートが示さ
れている。ステツプ５００２では、語彙の単語
（典型的には約5000個）が定義される。次に、各
単語は、音マシンの列によつてあらわされる（ス
テツプ５００４）。この音マシンは、例えば音声
的音マシンとしてあらわされているが、その代わ
りに音素的音を含むこともできる。単語を音声的
音マシンの列、または音素的音マシンの列によつ
てあらわすことについては以下で説明する。ステツプ５００６においては、単語の基本形式
が、前記樹形構造に配列される。各単語基本形式
における各音マシンの統計は、F.ジエリネツク
（Jelinek）による“統計的方法による連続的音声
認識（Continuous Speech Recognition by
statistical Methods）”という論文に発表されて
いるよく知られた前方後方（forward
backward）アルゴリズムに基づく訓練によつて
決定される。ステツプ５００９では、詳細な照合中で使用さ
れる実際のパラメータ値または統計に代入すべき
値が決定される。例えば、実際のラベル出力確率
に代入すべき値が決定される。ステツプ５０１０
では、各単語基本形式における音が近似的な代入
値をもつように、決定された値が、記憶された実
際の確率に置き換わる。基本的な高速照合に関連
するすべての近似は、ステツプ５０１０中で実行
される。次に、音声的照合の能力を高めるか否かについ
て判断がなされる（ステツプ５０１１）。そして、
もしそうでないなら、基本的な近似照合のために
決定された値が使用可能となるようにセツトさ
れ、他の近似に関連する他の評価値はセツトされ
ない（ステツプ５０１２）。また、向上された別
の近似照合が所望であるなら、その後ステツプ５
０１８が実行される。すると、均一なストリング
長の決定が行なわれ（ステツプ５０１８）、さら
なる向上が要望（ステツプ５０２０）されている
か否かについて判断がなされる。そして、もし近
似的照合のさらなる向上が要望されているなら、
音声的照合は、発生さたストリング中で最初のＪ
個のラベルに限定される（ステツプ５０２２）。
次に、向上された近似照合が選択されたか否かに
拘らず、算定されたパラメータ値がステツプ５０
１２でセツトされ、この時点で各単語基本形式に
おける各音マシンが、高速近似照合を可能ならし
める所望の近似値により訓練されたことになる。Ｅ−1J スタツク・デコーダ第１図の音声認識システムに使用される好適な
スタツク・デコーダが、本出願人の音声認識グル
ープ所属のL.バール（Bahl）、F.ジエリネツク
（Jelinek）、及びR.L.マーサー（Mercer）により
発明された。そこで、この好適なスタツク・デコ
ーダについて以下説明する。第１９図及び第２０図には、逐次のラベル間隔
またはラベル位置で発生される逐次的な複数のラ
ベルY₁Y₂‥‥が示されている。第２０図にも、いくつかの発生された単語経
路、すなわち経路Ａ、経路Ｂ及び経路が図示され
ている。第１９図の文脈では、経路Ａがエントリ
“to be or”に対応し、経路Ｂがエントリ“two
ｂ”に対応し、経路Ｃがエントリ“too”に対応
する。当面の単語経路に対応して、その単語経路
が終了したという最も高い確率を有するラベル
（または等価にはラベル間隔）が存在し、そのよ
うなラベルは、“境界ラベル”と呼ばれる。単語の列をあらわす単語列Ｗに対しては、２つ
の単語の間で“境界ラベル”としてラベル・スト
リング中にあらわされた最も確からしい終了時間
が、IBMテクニカル・デイスクロジヤー・ブル
テイン（Technical Disclosure Bulletin）
volume 23，number ４，1980年９月、L.R.バー
ル、F.ジエリネツク、R.L.マーサーによる“高速
音声照合演算（Faster Acoustic Match
Computation）”と題する文献に記載されている
ような方法により見出され得る。簡単に述べる
と、この文献は、２つの類似する事項、すなわち
(a)ラベル・ストリングＹのうちどれだけが単語
（または単語列）によつて引き起こされるか、及
び(b)どのラベル間隔で、ラベル・ストリングの一
部に対応する部分的な文が終了するのかというこ
とに関与するための方法を議論する。任意の単語経路について、ラベル・ストリング
の最初のラベルから境界ラベルまでを含む各ラベ
ルまたはラベル間隔に対応づけられた“尤度値”
が存在する。これらをまとめると、所与の単語経
路のすべての確からしさの値は、その所与の単語
経路の“尤度ベクトル”をあらわす。従つて、各
単語経路毎に、対応する尤度ベクトルが存在す
る。尤度値L_tは第２０図に示されている。単語経路W¹，W²，‥‥W³の集まりに対応す
る、ラベル間隔ｔでの“尤度包絡線”Λ_tは、数
学的には次のように定義される： Λ_t＝max（L_t（W¹）、‥‥L_t（W^S））すなわち、各ラベル間隔に対応して、尤度包絡
線は、その集まり内の任意の単語経路に関連する
最も高い尤度値を含む。第２０図には、尤度包絡
線１０４０が図示されている。単語経路は、もし完全な文に対応するならば、
“完全”であると考えられる。完全な経路は、好
適には、話者の入力によつて、例えば話者が文の
終わりに到達したときにボタンを押すことによつ
て識別される。この入力は、ラベル間隔と同期さ
れ、文の終わりがマークされる。完全な単語経路
は、単語の追加により延長することはできない。
“部分的な”単語経路は、不完全な文に対応し、
延長することができる。部分的な経路は“生”または“死”として分類
される。すなわち、もしある単語経路が既に延長
されているならその単語経路は“死”んでおり、
まだ延長されていなければ“生”きている。この
分類を用いると、１つまたはそれ以上の長さに延
長された単語経路を形成するために既に延長され
てしまつている経路は、その後の時間で延長につ
いては再び考慮されることがない。各単語経路はまた、尤度包絡線に関連して
“良”または“不良”としても特徴づけることが
できる。すなわち、もし境界ラベルに対応するラ
ベルで、単語経路が、最大尤度包絡線のΔ内にあ
る尤度値をもつならば、その単語経路は良いので
ある。そうでなければ、その単語経路は、“不良”
なのである。尚、必須ではないが好適には、Δは
一定値であり、この値によつて、最大尤度包絡線
の各値は、良不良のしきい値レベルとして働くよ
うに低減される。各ラベル間隔毎に、１つのスタツク素子があ
る。各々の生きている単語経路は、その経路のし
きい値レベルに対応するラベル間隔に対応するス
タツク素子に割りあてられる。スタツク素子は
０，１またはそれ以上の経路エントリをもつこと
ができ、それらのエントリは、尤度値の順序に従
いリストされている。次に、第１図のスタツク・デコーダ１００２に
よつて実行されるステツプについて説明する。尤度包絡線を形成しどの単語経路が良いかを判
断することは、第２２図のサンプル・フローチヤ
ートに示されているステツプと相互に関連づけら
れる。第２２図のフローチヤートにおいて、ゼロ
（null）経路が先ずステツプ５０５０で第１のス
タツク０に入力される。そして、前以つて決定さ
れた完全経路が存在すれば、それらの完全経路を
含むスタツク（完全）素子が与えられる（ステツ
プ５０５２）。スタツク（完全）素子中の各完全
経路は、対応づけられた尤度ベクトルを有してい
る。境界ラベルに最も高い尤度をもつ完全経路の
尤度ベクトルが、初期的には最大尤度包絡線を決
定する。もしスタツク（完全）素子中に完全経路
が存在しなければ、最大尤度包絡線は、各ラベル
間隔で−∞として初期化される。さらに、もし完
全経路が特定されないならば、やはり最大尤度包
絡線は−∞として初期化することができる。包絡
線の初期化についてはステツプ５０５４及び５０
５６によつて示されている。最大尤度包絡線が初期化された後は、その包絡
線が予定の値Δだけ低下され、これによりその低
下された尤度値の上方にΔ−良い領域が形成さ
れ、その尤度値の下方にΔ−不良領域が形成され
る。Δの値が大きい程、それだけ多数の単語経路
が延長可能であると考えられる。L_tを算出するた
めにlog₁₀が使用される場合は、Δとして2.0とい
う値が満足のゆく結果を与える。このΔの値は、
必須ではないが、好ましくは、ラベル間隔の長さ
に沿つて均一である。もし単語経路が、Δ−良い領域にある境界ラベ
ルにおいて尤度をもつなら、その単語経路は
“良”であるとマークされる。そうでなければ、
単語経路は“不良”としてマークされる。第２２図に示すように、尤度包絡線を更新し、
単語経路を良（延長可能）または不良としてマー
クするためのループは、最長のマークされていな
い単語経路を見出すことによつて開始される（ス
テツプ５０５８）。そして、もしマークされてい
ない２つ以上の単語経路が、その最長の単語経路
の長さに対応するスタツク中に存在するならば、
その境界ラベルに最も大きい尤度をもつ単語経路
が選択される。もし単語経路が見出されたなら、
その境界ラベルでの尤度がΔ−良い領域内にある
とき、“良”としてマークされ、そうでなければ
“不良”としてマークされる（ステツプ５０６
０）。もし単語経路が“不良”とマークされたな
ら、別のマークされていない生きた経路が見出さ
れマークされる（ステツプ５０６２）。もし単語
経路が“良”とマークされたなら、その“良”と
マークされた経路の尤度値を含むように尤度包絡
線が更新される。すなわち、各ラベル間隔毎に、
(a)尤度包絡線中の現在の尤度値と(b)“良”とマー
クされた単語経路に関連づけられた尤度値の間
に、より大きい尤度値として更新された尤度値が
決定される。このことは、ステツプ５０６４及び
ステツプ５０６６によつて示されている。包絡線
が更新された後は、最長且つ最良のマークされて
いない生きた単語経路が再び見出される（ステツ
プ５０５８）。次にループは、マークされていない単語経路が
残つていないようになるまで残り返えされる。そ
うして、マークされていない単語がすべて見出さ
れると、“良”とマークされた最短の単語経路が
選択される。もし良とマークされた最短長さをも
つ経路が１つ以上存在するならば、その境界ラベ
ルにおいて最も高い尤度をもつ経路が選択される
（ステツプ５０７０）。選択された最短経路には次
に延長が施される。すなわち、好ましくは、高速
照合と、言語モデルと、詳細な照合と、言語モデ
ル手続きを実行することにより、上述したように
少くとも１つの確からしく後に続く単語が決定さ
れる。この確からしく続く各々の単語により、延
長された単語経路が形成される。特に、延長され
た単語経路は、選択された最短の単語経路の末端
に確からしく続く単語を付加することにより形成
される。その選択された最短の単語経路に延長された単
語経路が形成された後は、その選択された単語経
路は、その経路がエントリとして入つていたスタ
ツクから除去され、各々の延長された単語経路は
そのための適当なスタツク中に入れられる。殊
に、延長された単語経路は、延長された単語経路
の境界ラベルに対応するスタツクに入れられるエ
ントリとなる（ステツプ５０７２）。ステツプ５０７２に関しては、選択した経路を
延長する動作は、第２２図を参照して説明され
る。ステツプ５０７０で経路が見出された後は、
以下に示すような手続が実行され、これにより、
適当な近似照合に基づき単数または複数の単語経
路が延長される。すなわち、ステツプ６０００（第２１図）で
は、音声処置装置１００２（第１図）が、上述し
たようにラベル・ストリング（列）を生成する。
ラベルの列は、ステツプ６００２の実行を可能な
らしめるために入力として与えられる。ステツプ
６００２では、前に示した教示に従い候補の単語
の順序づけられたリストを得るために、基本的近
似照合または、向上された近似照合手続のうちの
１つが実行される。その後、（既述の）言語モデ
ルがステツプ６００４で適用される。言語モデル
が適用された後に残つた単語は、ステツプ６００
６を実行する詳細な照合プロセツサ中で生成され
たラベルとともに入力される。詳細な照合は残り
の単語候補のリストをもたらし、このリストは好
ましくはステツプ６００８で言語モデルにかけら
れる。このように近似照合、詳細な照合、及び言
語モデルによつて決定された確からしい単語は、
第２１図のステツプ５０７０で見出された経路の
延長に使用される。ステツプ６００８（第２２
図）で決定された確からしい単語の各々は、複数
の延長さた単語経路が形成され得るように、見出
された単語経路が形成され得るように、見出され
た単語経路に個別に付加される。再び第２２図を参照すると、延長された経路が
形成されスタツクが再形成された後は、ステツプ
５０５２に戻ることによつて処理が繰り返えされ
る。このように、各反復処理は、最短の最良単語経
路を選択してこれを延長することからなる。１回
目の処理で“不良”とマークされた単語経路が、
その後の反復処理で“良”となることがある。生
きた単語経路を“良”または“不良”として特徴
づけることは、このように、各反復処理毎に独立
になされる。実際上、尤度包絡線は１つの反復処
理から次の反復処理までで大幅には変化しないの
で、単語経路が“良”または“不良”のどちらで
あるかを判断するための計算は効果的に実効され
る。さらに、規格化は不要である。完全な文が識別されたときは、好適にはステツ
プ５０７４が実行される。すなわち、マークされ
ていない単語経路が残つておらず、延長すべき良
い単語経路が存在しないときは、デコーデイング
が終了する。個々の境界ラベルで最高の尤度をも
つ完全な単語経路は、入力ラベル・ストリングに
対して最も確からしい単語列であると見なされ
る。尚、文の終わりが識別されない連続的な音声の
場合、経路の延長は、連続的もしくは、システム
の使用者の好みに応じた予定の数の単語分だけ継
続する。Ｅ−1K 音声的基本形式の構築基本形式を形成する際に使用することのできる
マルコフ・モデル音マシンとして音声に基づくも
のがある。すなわち、各音マシンは、国際音声記
号（International Phonetic Alphabet）に含ま
れているような所与の音声的な音に対応する。所与の単語について、各々が個々の音マシンに
対応するような音声的な音の列が存在する。各音
マシンは複数の状態と、それらの状態の間の複数
の遷移を含み、それらの遷移のうちのあるものは
音素出力を発生することができ、またあるもの
（ゼロ遷移と呼ばれる）はそれができない。前述
したように、各音マシンに関連する統計は、(a)所
与の遷移が生じる確率と、(b)所与の遷移で特定の
音素が生成される尤度とを含む。好適には、各非
ゼロ遷移においては、各音素に何らかの確率が対
応づけられている。発明の詳細な説明の末尾に掲
げた表Ｉに示された音素アルフアベツトには、好
ましくは200個の音素が存在している。音声的音
マシンを形成するために使用される音マシンは第
３図に示されている。そのような音マシンの列は
各単語毎に設けられている。統計または確率は、
知られている単語が発音される訓練期間に音マシ
ンに入力される。そして、さまざまな音声的音マ
シン中の遷移確率及び音素確率は、知られている
音声的な音が少くとも一回発音されたとき発生さ
れる音素ストリングに注目し、周知の前方−後方
（forward−backward）アルゴリズムを適用する
ことによつて、訓練期間の間に決定される。単音DHと判別された１つの単音の統計の例
が、発明の詳細な説明の末尾に掲げた表に示され
ている。一つの近似として、第３図の音マシンの
遷移tr1，tr2及びtr8のラベル出力確率分布は単一
の分布によつて表現される。また、遷移tr3，
tr4，tr5及びtr9は単一の分布によつて表現され、
遷移tr6，tr7及びtr10も単一の分布によつて表現
される。このことは、表２において、個々の列
４，５または６に弧（すなわち遷移）を割りあて
ることによつて見てとれる。表２は、各遷移の確
率と、ラベル（すなわち音素）が、単音DHの先
端と、中間と、後端において発生される確率を示
している。このDH音の場合、例えば、状態S₁か
ら状態S₂に遷移する確率は、0.07243であると計
算されている。状態S₁から状態S₄への遷移確率は
0.92757である（この場合、初期状態から可能な
遷移はこれら２つだけなので、それらの確率の和
は１である）。ラベル出力確率に関しては、DH
音は、その単音の後端、すなわち表２の６行目で
音素AE13（表１参照）を生成する確率は0.091で
ある。表２にはまた、各ノード（状態）に関連す
る計数値が存在する。このノード計数値は、訓練
期間中に音が対応する状態にある回数値をあらわ
す。表２のような統計は各音マシン毎に見出され
る。音声的音マシンを単語基本形式に配列すること
は典型的には音声学者によつて実行され、一般的
には自動的には実行されない。Ｅ−1L 音素的基本形式の構築第２３図は、音素的な音の具体例を示してい
る。この音素的な音は２つの状態と３つの遷移を
有している。この図においては、ゼロ遷移が破線
で示され、これは、ラベルが形成されない、状態
１から状態２への経路をあらわしている。状態１
での自己ループ遷移は、任意の数のラベルがそこ
から発生されることを可能ならしめる。状態１と
状態２の間の非ゼロ遷移は、ラベルを形成せしめ
るように許容される。各遷移と、遷移における各
ラベルに対応づけられた確率は、訓練段階で、音
声タイプの基本形式に関連して説明したのと類似
する方法により決定される。音素的な単語基本形式は、音素的な音を組み合
わせることによつて構築される。このための一つ
の方法は、本出願人に係わる1985年２月１日出願
の米国特許出願第697174号に記載されている。好
ましくは、音素的な単語基本形式は、対応する単
語の多数回の発音によつて生成される。このこと
は、本出願人に係わる、1985年５月29日に出願さ
れた米国特許出願第738933号に記載されている。
手短かに説明すると、多数の発音から基本形式を
生成する一つの方法は次のようなステツプからな
る。 (a) 単語の断片の多数の発音を、個々の音素スト
リングに変換する。 (b) 一組の音素的マルコフ・モデル音マシンを定
義する。 (c) 上記多数の音素ストリングを生成するため
に、最良の単一音マシンP₁を決定する。 (d) 上記多数の音素ストリングを生成するため
に、形式P₁P₂またはP₂P₁からなる最良の２音
基本形式を決定する。 (e) 各音素ストリングに対応して、上記最良の２
音基本形式を配列する。 (f) 各音素ストリングを左の部分と右の部分とに
分ける。このとき、左の部分は、２音基本形式
の第１の音マシンに対応し、右の部分は、２音
基本形式の第２の音マシンに対応する。 (g) 各左の部分を左サブストリングと、各右の部
分を右サブ・ストリングと、それぞれ識別す
る。 (h) 左サブストリングの組を、多数の発音に対応
する音素ストリングの組と同様の方法で処理す
る。この処理は、単一の音基本形式が、最良の
２音基本形式よりも高い確率を有するときに、
サブストリングがそれ以上分離するのを防止す
るステツプを含む。 (j) 右サブストリングの組を、多数の発音に対応
する音素ストリングの組と同様の方法で処理す
る。この処理は、単一の音基本形式が、最良の
２音基本形式よりも高い確率を有するときに、
サブストリングがそれ以上分離するのを防止す
るステツプを含む。 (k) 分離されていない複数の単一音を、それらの
音が対応する音素サブストリングの順序に対応
する順序で結合する。これらのモデル要素の数は、典型的には、単語
の発音に対応して得られた音素の数にほぼ等し
い。基本形式モデルは、次に、知られている発音
を発生し、その発声に応答してラベルのストリン
グを発生する音声処理装に入力することにより訓
練される（または統計を組み込む）。そして、知られている発音及び発声されたラベ
ルに基づき、よく知られた前方−後方（forward
−backward）アルゴリズムに従つて単語モデル
の統計が得られる。第２４図には、音素的な音に対応する格子が図
示されている。この格子は、音声的な詳細な照合
に関連する第１１図の格子よりも相当に簡単であ
る。Ｅ−２ポーリングによる、語彙からの確からし
い単語の選択第２５図を参照すると、本発明の一実施例のフ
ローチヤートが図示されている。このフローチヤ
ート８０００においては、ステツプ８００２で最
初に単語の語彙が決定される。これらの単語は、
使用者に応じて、標準的な事務通信語彙または技
術的な語彙に対応する。このとき、5000個または
それ以上の数の単語が語彙に存在しているが、そ
の単語の数は変えることができる。各単語は、上記Ｅ−1K またはＥ−1L章の教
示に従つて、マルコフ・モデル音マシンの列によ
り表示される。すなわち、各単語は、逐次的な音
声的音マシンの構築された基本形式、または逐次
的な音素的音マシンの構築された基本形式として
あらわすことができる。次に、ステツプ８００６で、各単語の各ラベル
毎に“票”が計算される。票を計算するステツプ８００６は、第２５，２
６，２７，２８及び２９図を参照して説明され
る。第２６図は、所与の音マシンP_pの音声ラベル
の分布を示す図である。この図に示されている計
数値は、訓練期間に発生された統計から得られた
ものである。すなわち、訓練期間に、知られてい
る音列に対応する知られている発音が発声され、
それに応答してラベル・ストリングが発生される
ことを思い起こされたい。こうして、知られてい
る単音が発音されるときに各ラベルが生成される
回数が訓練期間中に得られる。尚、第２６図に示
すような分布図は、各単音毎に発生される。訓練データからは、第２６図に含まれている情
報が得られるのみならず、所与の単音についての
ラベルの期待値も得られる。すなわち、所与の単
語に対応する知られている発音が発生されると、
発生されたラベルの数が記録される。所与の単音
に対応するラベルの数は、知られている発音が発
生される毎に記録される。そして、この情報か
ら、所与の単音に対応する最も確からしい、また
は期待される個数が定められる。第２７図は、各
単音毎のラベルの期待される個数をあらわす図で
ある。もしこれらの音が音素的に対応するなら
ば、各単音についての期待されるラベルの個数の
平均値は典型的には約１であるべきである。音声
的な音の場合、ラベルの個数はきわめて広い範囲
に亘り得る。訓練データから情報を引き出すことは、前記し
た“統計的方法による連続的音声認識
（Continuous Speech Recognition by
Statistical Methods）”と題する論文に詳細に記
載されている前方−後方（forward−backward）
アルゴリズムを利用することによつて達成され
る。簡単に述べると、前方−後方アルゴリズム
は、 (a) マルコフ・モデルの初期状態から状態ｉまで
を前方を見渡して、“前方経路”において状態
ｉに至る統計を決定し、 (b) マルコフ・モデルの最終状態から状態（ｉ＋
１）までを後方に見渡して、“後方経路”にお
いて状態（ｉ＋１）から最終状態に至る統計を
決定することにより、単音における、状態ｉと状態（ｉ＋１）の間
の各遷移の確率を決定することからなる。状態
ｉから状態（ｉ＋１）への遷移確率及びそれに
ついてのラベル出力は、あるラベル・ストリン
グが与えられたときに生じる特定の遷移の確率
を決定する際に別の統計に結合される。単語１及び単語２に関連して第２８図に示され
ているように、各単語は、所定の単音の列である
ことが知られている。各単語についての音列と、
第２５及び２６図に関連して述べた情報を与える
と、特定の単語Ｗにつき、所定のラベルが何回発
生するのが最も確からしいかについて決定するこ
とができる。単語１のような単語の場合、ラベル
１が期待される回数は、音P₁についてのラベル
１の計数値と、音P₃についてのラベル１の計数
値と、音P₆についてのラベル１の計数値等を加
えたものである。同様に、単語１について、ラベ
ル２が期待される回数は、音P₁についてのラベ
ル２の計数値と、音P₃についてのラベル２の計
数値等を加えたものである。単語１の各ラベルの
期待される計数値は、200個のラベルの各々につ
き上記ステツプを実行することによつて算定され
る。特定のラベルの計数値は、特定のラベルの発
生回数を、訓練期間に発生されたラベルの全数で
割つたものをあらわすか、または訓練期間中のラ
ベルの発生回数自体をあらわす。第２９図においては、特定の単語（例えば単語
１）における各ラベル毎の期待される計数値が示
されている。所与の単語について、第２９図に示すような期
待されるラベルの計数値からは、その単語の各ラ
ベルの“票”が計算される。所与の単語W′につ
いてのラベルL′の票は、単語W′がラベルL′を生
成する尤度をあらわす。この票は、単語W′がラ
ベルL′を生成する確率の対数に対応する。好まし
くは、票の値は次の式であらわされる。票＝log₁₀｛Pr（L′｜W′）｝これらの票の値は、第３０図に示すように、テ
ーブルに記憶される。単語１〜Ｗの各々につき、
各ラベルは２重添字つきのＶであらわされる票の
値を持つている。この１番目の添字はラベルに対
応し、２番目の添字は単語に対応する。従つて、
例えばV₁₂は単語２に関するラベル１の票の値で
ある。再び第２５図を参照すると、未知の音声入力に
応答してラベルを発生するステツプ８００８を含
み、ポーリングによつて語彙から確からしい候補
の単語を選択する処理が図示されている。この処
理は、音声処理装置１００４（第１図）によつて
実行される。当面の単語につき、第３０図のテーブル中で発
生されたラベルがルツク・アツプ（look−up）
される。そして、その単語について、各生成され
たラベルの票の値が検索される。この票の値は、
次に、その単語の票の値の合計を与えるために蓄
積される（ステツプ８０１０）。例えば、ラベル
１，３及び５が発生されたならば、票の値V₁₁、
V₃₁及びV₅₁が算定され結合されることになる。
もし票の値が確率の対数値であるならば、それら
は単語１についての全体の投票値を与えるために
合計される。同様の手続は語彙の単語毎に行わ
れ、これにより、ラベル１，３及び５が各単語に
対して“投票”することになる。本発明の一実施例によれば、各単語につき蓄積
された票の値は、その単語の尤度得点値の役目を
果たす。そして、最も高い蓄積された票の値をも
つｎ個の単語（ｎは予定の整数）が、候補の単語
として決定され、これらは後で、前述した詳細な
照合及び言語モデル中で処理されることになる。別の実施例では、票の値とともに“ペナルテ
イ”が算定される。すなわち、各単語について、
ペナルテイが算出され割りあてられる（ステツプ
８０１２）。このペナルテイは、当面のラベルが
所与の単語によつては発生されない尤度をあらわ
す。ペナルテイを算出するためにはさまざまの方
法がある。音素的基本形式によつてあらわされた
単語のペナルテイを算出するための一つの方法
は、各音素的な音が一つのラベルしか生成しない
と仮定することに関与する。所与のラベルと特定
の音素的な音に対して、その所与のラベルのペナ
ルテイは、それとは異なるラベルがその特定の音
素的な音によつて発生される確率の対数に対応す
る。従つて、音P₂のラベル１のペナルテイは、
ラベル２〜ラベル２００までの任意のラベルが、
発生される１つのラベルである確率の対数に対応
する。尚、各音素的な音毎に１つのラベル出力が
あるという仮定は、正確ではないけれども、ペナ
ルテイを算定するには満足のゆくものであること
が分かつている。こうして、各音につきラベルの
ペナルテイが決定されると、ある単語が知られて
いる音の列を構成することのペナルテイは容易に
決定される。各単語の単ラベルのペナルテイが第３１図に示
されている。各ペナルテイは、２個の添字つきの
PENとして識別され、第１の添字がラベルをあ
らわし、第２の添字が単語をあらわす。再び第２５図に戻ると、ステツプ８００８で発
生されたラベルは、ラベル・アルフアベツトのう
ちどのラベルが発生されていないかを調べるため
に検査される。こうして、各ラベル毎に、各ラベ
ルが発生されないペナルテイが算出される。所与
の単語の全体のペナルテイを得るためには、所与
の単語につき発生されなかつた各ラベルのペナル
テイが検索され、そのようなすべてのペナルテイ
が蓄積される（ステツプ８０１４）。もし各ペナ
ルテイが“ゼロ”確率の対数に対応するならば、
所与の単語についてのペナルテイは、票の場合と
同様に、すべてのラベルについて合計される。こ
の手続きは、語彙の各単語毎に繰り返えされ、こ
れにより、各単語は、発生されたラベルのストリ
ングが与えられたときに、全体の票の値と全体の
ペナルテイとを有することになる。語彙の各単語について全体の票の値と全体のペ
ナルテイが得られると、その２つの値を組み合わ
せることによつて尤度得点値が決定される（ステ
ツプ８０１６）。尚、もし望むなら、全体の票の
値は全体のペナルテイよりも大きく重みづけされ
てもよいし、その逆も可能である。さらに、各単語の尤度得点値は、好ましくは、
投票を行うラベルの数に基づきスケーリングされ
る（ステツプ８０１８）。特に、全体の票の値と
全体のペナルテイ（そのどちらも確率の対数の和
をあらわす）が加えあわされた後に、その最終の
和の値が、票の値とペナルテイとを計算する際に
発生され考慮された音声ラベルの数で割られる。
その結果、尤度得点値がスケーリングされる。本発明のさらに別の態様は、投票及びペナルテ
イ（すなわちポーリング）演算において、ストリ
ングのどのラベルを考慮するかを決定することに
関連する。単語の終端が識別され、それに対応す
るラベルが知られている場合、好ましくは、既知
の開始時間と既知の終了時間の間で発生されたす
べてのラベルが考慮される。しかし、終了時間が
既知でないことが分かつているとき（ステツプ８
０２０）、本発明は次のような方法を与える。す
なわち、基準的な終了時間が定義され、その基準
的な終了時間の後、逐次的な時間間隔毎に繰りか
えし尤度得点値が計算される（ステツプ８０２
２）。例えば、500ミリ秒後に、50ミリ秒間隔で各
単語の（スケーリングされた）尤度得点値が計算
され、この動作はその後、単語の発音の開始時間
から1000ミリ秒に達するまで行われる。この例で
は、各単語は、10個の（スケーリングされた）尤
度得点値をもつことになる。次に所与の単語に10個の尤度得点値のうちどれ
を割りあてるかについて選択を行うための手法が
適用される。特に、所与の単語につき得られた尤
度得点値の列に対して、同一の時間間隔で得られ
た他の単語の尤度得点値とは相対的に最も高い尤
度得点値が選択される（ステツプ８０２４）。こ
の最も高い尤度得点値は、その同一の時間間隔に
おける他のすべての尤度得点値から引かられる。
すると、所与の時間間隔で最も高い尤度得点値を
もつ単語がゼロにセツトされ、他の尤度の低い単
語の尤度得点値は負の値をもつことになる。そし
て、所与の単語の基も負の値が小さい（ゼロに近
い）尤度得点値が、最も高い相対的な尤度得点値
としてその単語に割りあてられる。各単語に尤度得点値が割りあてられると、最も
高い尤度得点値を割りあてられて成るｎ個の単語
が、ポーリングにより得られた候補の単語として
選択される（ステツプ８０２６）。本発明の一実施例においては、ポーリングによ
り得られたｎ個の単語は、低減された数の単語リ
ストとして与えられ、このリストの単語は、前述
した詳細な照合及び言語モデルの処理にかけられ
る。ポーリングによつて得られた、この低減され
た数のリストは、この実施例においては、前に示
した音声的高速照合に代わる働きを行う。この点
に関して、音声的高速照合が樹形の格子構造を与
え、この樹形構造に逐次的な音として単語基本形
式が入力され、それにおいて、同一の先頭音をも
つ単語が樹形構造に沿う共通の枝に従うことが観
察される。しかし、2000語の語彙に対して、本発
明のポーリング方法は、樹形格子構造をもつ高速
照合よりも２ないし３倍の処理速度をもつことが
分かつた。しかし、それとは異なり、音声的高速照合とポ
ーリングとを組み合わせて使用することもでき
る。すなわち、訓練されたマルコフ・モデルと、
ラベルの発生されたストリングとから、ステツプ
８０２８でポーリングと平行して近似的高速照合
が実行される。そして、１つのリストは音声的照
合によつて与えられ、もう１つのリストはポーリ
ングによつて与えられる。慣用的な手法では、１
つのリスト上のエントリは、他方のリストを増加
させる際に使用される。しかし、最良の単語候補
の数をさらに減少させることを要望する手法にお
いては、両方のリストにあらわれる単語のみが、
さらなる処理のために保持される。ステツプ８０
３０におけるこの２つの技術の相互作用は、シス
テムの精度と計算上の目標に依存する。さらに別
の実施例としては、格子型の音声的高速照合を順
次的にポーリング・リストに適用してもよい。ポーリングを実行するための装置は第３２図に
示されている。この図において、素子８１０２
は、上述したように訓練された単語モデルを記憶
する。そして、単語モデルに適用される統計から
は、票発生装置８１０４が各単語についてのラベ
ルの票の値を計算し、その票の値を票テーブル記
憶装置８１０６に記憶する。これと同様に、ペナルテイ発生装置８１０８が
語彙における各単語の各ラベルのペナルテイを計
算し、その値をペナルテイ・テーブル記憶装置８
１１０に入力する。単語尤度得点値計算装置８１１２は、未知の音
声入力に応答して音声処理装置８１１４により発
生されたラベルを受け取る。そして、単語選択素
子８１１６により選択された所与の単語につい
て、単語尤度得点値計算装置８１１２は、その選
択された単語の発生された各ラベルの票と、各ラ
ベルが発生されないペナルテイとを組み合わせ
る。装置８１１２は、前述したような、尤度得点
値をスケーリングするための手段をも含んでい
る。尤度得点値計算装置は、必須ではないけれど
も、基準時間の後逐次的な時間間隔で尤度得点値
の計算を繰り返すための手段を含んでいてもよ
い。尤度得点値計算装置８１１２は、単語リスト装
置置８１２０に単語得点値を与え、単語リスト装
置８１２０は、割りあてられた尤度得点値に従い
単語を配列する。ポーリングによつて得られた単語リストを、近
似的高速照合によつて得られたリストと組み合わ
せる実施例では、リスト比較装置８１２２が設け
られている。この装置８１２２は、入力として、
単語リスト装置からポーリング・リストと音声的
高速照合（前記いくつかの実施例で述べたもの）
とからポーリング・リストを受け取る。必要な記憶量と計算量を低減するために、いく
つかの特徴が組み込まれている。第１に、票の値
とペナルテイとは、０〜255に亘る整数としてフ
オーマツトすることができる。第２に、実際のペ
ナルテイの値は、ペナルテイ＝ａ×（票の値）＋ｂ
という式から票の値に対応して計算された近似的
なペナルテイによつて置き換えることができる。
尚、この式で、ａ，ｂは定数であり、それらの値
は最小２乗回帰により求められる。第３に、ラベ
ルは、各クラスが少くとも１つのラベルを含むよ
うな音声的なクラスにグループ分けすることがで
きる。そして、ラベルのクラスへの割りあては、
結果として得られる音声的クラスと単語の間の情
報を極大化するようにラベルを階層的に集積する
ことにより決定される。尚、本発明によれば、沈黙の期間が（既知の方
法により）検出され無視されることに注意された
い。また、本発明は、IBM MVSシステム上で
PL／Ｉを用いて実行されたが、他のシステムと
他のプログラム言語を用いても実行可能である。さらに、本発明は、その技術的思想の範囲内で
さまざまな変更が可能である。例えば、単語の終了時間を、その単語基本形式
中の各音の期待されるラベルの数を合計すること
により決定してもよい。それに加えて、票の値及
びペナルテイは、発生されたラベル・ストリング
のうち例えば奇数番目のラベルまたは最初のｍ個
のラベルのような選択されたラベルについてのみ
計算するようにしてもよい。ただし、単語の先端
と後端の間のラベル毎の票の値とペナルテイとを
考慮する方が好ましい。さらにまた、本発明は、確率の対数を加えるこ
と以外のさまざまな票決計算式の使用をも意図し
ている。本発明は、各ラベルが語彙の各単語に票
を投じ、この票が典型的には単語毎に異なるよう
な場合に、候補の単語の短いリストを得るための
ポーリング装置及びポーリング方法に広く適用さ
れる。Ｆ発明の効果以上のように、この発明によれば、音声認識に
おいて詳細な照合を行うべき単語のリストを作成
するために、ポーリング方法を使用するようにし
たので、そのような単語リストを短い処理時間で
得られる。 The present invention will be explained in the following order. A. Industrial application field B. Conventional technology C. Problem that the invention aims to solve D. Means to solve the problem E Examples E-1 Voice recognition system E-1A Configuration overview (Figures 1 and 2) E-1B Auditory model and its implementation (Figure 4) E-1C Detailed verification (Figure 3, Figure 11) E-1D Basic high-speed verification (Figure 12) E-1E Another high-speed verification (Figures 15 and 16) E-1F Matching based on first J label (first
Figure 6) E-1G Tree structure and high-speed matching of single notes (No. 17)
figure) E-1H language model E-1J stack decoder (Figure 21, 2
Figure 2) E-1K Construction of basic phonetic format (Figure 3) E-1L Construction of basic phonemic forms (Figure 23) E-2 Determination from vocabulary by polling
Selecting new words (Figures 25-29) F Effect of invention G table A. Industrial application field This invention relates broadly to speech recognition technology, and in particular to simple speech recognition technology.
A short list of likely words selected from the vocabulary of
It concerns techniques for forming strikes. B. Conventional technology In probabilistic processing methods for speech recognition
The audio waveform is first labeled by the audio processing device.
or converted into a string of phonemes. each type of sound
Those labels that represent
Alphabets consists of about 200 different labels.
selected from the list. Regarding the generation of such labels
has been described in various papers such as those listed below.
It's being ignored. Proceedings of the IEEE, 64
vol., pp. 532-556 (1976).
Continuous Speech Recognition
The paper entitled “by Statistical Methods”. When adopting labels for speech recognition
Markov model sound machine (probabilistic finite state machine)
There has been some discussion about
Ru. Markov models typically have multiple states and
have transitions between these states. In addition,
An ordinary Markov model has (a) the probability of each transition occurring;
and form each label at various transitions.
Probability values for individual probabilities are not assigned.
Ru. Note that the Markov model can be used for pattern analysis and
Proceedings of the IEEE on Machine Intelligence (IEEE
Transaction on Pattern Analysis and
Machine Intelligence), Volume PAMI-5, No. 2,
L.R. Bahl, F. Zierline, March 1983
Jelinek, and R.L. Mercer
“Maximum Likelihood Method for Continuous Speech Recognition (A
Maximum Likelihood Approach to
Continuous Speech Recognition)”
It is stated in the text etc. In speech recognition, the speech processing device
Which word in the vocabulary given the bell sequence
to determine which is the most likely
processing) is executed. Regarding the applicant's application filed on November 19, 1984
As shown in U.S. Patent Application No. 672,974,
Phonetic matching is performed using (a) a sequence of Markov model sound machines;
(b) Characterize each word in the vocabulary by
A sound machine that produces a sequence of labels generated by
The individual probability of each column representing a word in
Furthermore, the sound machine word is made by determining
Each column representing the word basic form (word
baseform). As described in the above-mentioned U.S. patent application,
The basic word form is made up of several phonetic sound machines.
can do. In this column, each sound machine is
Preferably corresponds to phonetic sounds, with 7 states and 13 states.
It has a transition. Alternatively, each word base form is a phonemic sound machine.
It may also be formed as a row of rows. phonemic sound machine
is a Markov model that is simpler than a tonal sound machine.
preferably composed of two states.
It will be done. And between the first state and the second state
is empty such that no label can be generated.
There is a transition (null transition). Also, the first condition
Between the state and the second state, there is an alphanumeric
A non-empty transition such that one label can be generated from
There is also a shift. In the first state, the label forms
becomes a self-loop that can be And at the training stage
Meanwhile, statistics for each phonetic sound machine are determined.
Ru. That is, for each phoneme, the probability of each transition is
and the probability that each label is generated at each non-empty transition.
The rate is calculated from the known pronunciations. vocabulary basics
The form is created by connecting phonemic sound machines.
It is formed. A system that adopts phonemic basic forms and phonetic basic forms.
In the system, a sound is generated in response to an unknown audio input.
Corresponds to the sequence of labels generated by the voice processor
Find the most likely word (or string of words) to
The final goal is to produce. achieve this goal
One way to do this is described in the above-mentioned US patent application.
It is being Especially in that way, the number of words
However, from the total number of words in the vocabulary, we select the most likely candidates.
is first reduced to a list of words, and then this list
The searched words can then be processed using a more detailed matching procedure or language model.
be considered in the Dell procedure, where preferably the most
Most likely words are selected. number of candidate words
In reducing the
According to the method, an approximation is applied to the sound machine,
This allows high-speed processing without requiring excessive calculations.
Reason is brought to you. When achieving word reduction,
Approximate phonetic matching using calculations based on matching grids
is executed. This approximate phonetic matching
Words are processed based on more detailed matching processing and language models.
Useful for deciding whether to apply further processing.
is found to be sufficient. C. Problem that the invention aims to solve The purpose of this invention is to
Which part of the vocabulary corresponds to the sequence of phonetic labels
To determine whether a word has a relatively high likelihood,
A fast and easy-to-calculate method and how to perform it.
The objective is to provide equipment for D. Means to solve the problem The invention should be examined in detail
A non-traditional way to reduce word count
Teach. That is, the present invention
Each label in the list “votes” for each word in the vocabulary.
Polling such that a table is set up
(polling) method. this vote
is the fact that a given word produced a given label.
Reflects likelihood (certainty). The value of that vote is
The probability of the label output and the probability obtained during the training session
is calculated from the calculated transition probability statistics. According to one embodiment of the invention, the label string is
The target word is
selected. From the voting table, each row in that column
A bell is recognized and each label corresponds to the desired word
votes will be determined. and for the target word
All votes for a label are accumulated and combined, making sure
A score of uniqueness is given. for each word in the vocabulary
By repeating this process, each word is
You will get a score for the certainty that you will. It seems certain
The list of complementary words can be obtained from the likelihood score.
can. In the second example, each label corresponds to each word in the vocabulary.
There is also a second table containing penalties for
It is formed. Penas assigned to a given label
lutei for words that do not generate that given label.
Represents certainty. In the second embodiment, the label
calculates the likelihood score for a given word based on the
Both label candidates and penalties are considered when determining
considered. To account for length, the likelihood score is preferably
In calculating the certainty score of a word,
The conversion is based on the number of labels considered. Furthermore, along the generated label of a certain word,
If the end point of the process is not determined, the present invention:
The target word has multiple successive likelihood scores
Likelihood scores can be calculated over time
Specifies what should be calculated in the interval. The present invention further
for the target word, preferably in the vocabulary.
compared to the certainty scores of all other words.
Specifies that the highest likelihood score is given. According to the invention, each word has at least one
represented by a string of modal finite-state sound machines,
and the audio processing device generates an audio label in response to the audio input.
Confirmation from the vocabulary of words, such as generating bells
Teach students how to choose the correct word. That person
(a) Each label in the alphabet is
Vote on words and label each label for a given word
where the votes of generate the label that gives that vote.
The first table represents the likelihood of a given word.
It has a step of forming. Moreover, the method is
(b) put a pen on each label for each word in the vocabulary;
given label of a given word
The penalty assigned to that given word is
Likelihood of a given label not generated by the model
Forming a second table representing
and (c) for a given label sequence,
votes for all labels in the column for that particular unit.
Penalty for all labels not in column for word
to determine the likelihood of a specific word.
It has a stage of Furthermore, the above method preferably provides for each word
to give a score of certainty.
Repeating steps (a), (b) and (c) above for each word
It also has If desired, the method described above can be
Can be used in combination with Application No. 672974.
Ru. E Example E-1 Voice recognition system E-1A Configuration overview FIG. 1 shows an overview of the speech recognition system 1000.
A block diagram is shown. This system 100
0 is the stack decoder 1002 and its stack
Audio processing device 100 connected to the Tsuku decoder
4 and used when performing fast approximate speech matching.
array processor 1006 and detailed audio
array processor used when performing
1008, language model 1010, and workspace
It is equipped with a station 1012. The audio processing device 1004 widens the audio waveform input.
A label that identifies the type of sound it corresponds to, in a specific sense.
or is designed to convert to a string of phonemes.
Ru. In this system, the audio processing device 10
04 is based on the unique model of the human ear.
Regarding this, the applicant's patent application filed in 1980-
It is described in No. 211229. Labels or phonemes from speech processing unit 1004
enters stack decoder 1002. Logical
Stack decoder 1002 is shown in FIG.
It can be expressed as a block element. vinegar
That is, the stack decoder 1002
-Face 1022, 1024, 1026 and 1
028, the audio processing device 1004, high-speed illumination
Synthesis processor 1006, detailed verification processor 10
08, and the language model 1010, and further
Search to contact work station 1012
It has block 1020. In operation, from the audio processing device 1004,
Phonemes are searched by search block 1020 in a fast matching process.
You will be led to Rosetsusa 1006. This high-speed matching process
As will be explained later, the
Also described in U.S. Patent Application No. 672,974 mentioned above.
Ru. Simply stated, the purpose of this matching is to
Determine the most likely word for a column of labels.
Is Rukoto. Fast matching examines words in the vocabulary and
to reduce the number of candidate words for the force label sequence.
It is intended to be a sea urchin. This fast matching is done by Marco
A stochastic finite state machine, also called a f-model,
Based on. Fast matching reduces the number of candidate words
, the stack decoder 1002 uses the language model 1
010, the language model 1010
Preferably, it is based on existing trigrams.
Then, the sentence of each candidate word in the fast matching candidate list.
Determine contextual certainty. Preferably, based on the calculation results of the language model,
Through detailed verification, the plausibility of the word being spoken is confirmed.
Words with high probability are candidates for high-speed matching.
It will be considered later. About this detailed matching procedure
is described in the above-mentioned U.S. Patent Application No. 672,974.
Ru. The detailed matching procedure is shown in Figure 3.
implemented by a Markov model sound machine such as
will be carried out. After that detailed matching, preferably from the exact word
The language model is called again to determine the
It will be done. High-speed matching, detailed matching, and language model
The present invention utilizes information obtained through application.
The tag decoder 1002 processes the generated label
Most likely path or column of words corresponding to column
is designed to determine. Two methods for finding the most probable word sequence
As a conventional technology, Viterbi decoding
decoding and single stack decoding.
Ru. These techniques were developed by L.R. Barr and F. Jie mentioned above.
Described in the paper by Linetsk and R.L. Mercer.
Ru. In particular, Viterbi decoding is the fifth
In chapter 6, single stack decoding
described in the chapter. In single stack decoding technology
varies during a single stack depending on the certainty.
Routes of various lengths are listed. Single stack
Decoding is more or less accurate due to the length of the path.
depending on the
We must take into account the fact that
No. On the other hand, Vyterbi's technology does not require such standardization.
generally practical for small tasks
It is. Another technique is to use each possible combination of words
as possible word sequences and which combinations
has the highest probability of producing the generated label sequence
By determining which has
Decoding can also be done per stem.
Ru. However, the amount of calculation required for this technique is large.
This makes it impractical for lexical systems. Stack decoder 1002 includes substantially other
It plays the role of controlling the block circuit, but many
No calculations are performed. Therefore, the stack decoder
1002 is preferably a Virtual Machine/
System Product Introduction Release 3
(1983) and other publications.
Under VM/370 operating system
Contains a running 4341 processor. Equivalent calculation
This array processor is not commercially available.
Implemented by Floating Point System (FPS) 190L
ing. Multiple stacking and the best word sequence or path
A new system with a unique decision strategy for determining
The technology was developed by L.R. Barr, F. Zielinetsk, and R.L. Ma.
About this later as it was invented by Sir
state E-1B auditory model and its realization FIG. 4 shows an audio processing device 1100 (in FIG.
1004) is illustrated.
Ru. In this figure, an audio wave input (e.g.
conversation) is an analog-digital (A/D)
input to the converter 1102, and the A/D converter
1102 samples the input at a given rate.
pull. Typical sampling rate is 50 ma
One sample every microsecond. A/D converter
1102.
A time window generator 1104 is provided to
Ru. The output of the time window generator 1104 is a fast Fourier
It is input to the conversion circuit (FFT) 1106 and FFT1
106 gives a frequency spectrum for each time window.
Ru. The output of FFT1106 is then labeled y₁y₂‥y_femits
Processed to voice. to generate a label
In addition, a feature selection block 1108, a cluster block
Tsuku 1110, prototype block 111
2. Four times called label creation block 1114
Road blocks work together. In the generation of labels,
The protocol is configured based on selected features or audio input.
Based on a point (or vector) in space
space that can be determined and then compared to the protocol.
to give the corresponding points (or vectors) in
characterized by the same selected features as
I get kicked. In particular, when deciding on a protocol, the cluster block
The lock 1110 allows the collection of points to be separated into individual clusters.
Classified as a star. to determine the cluster
The method is applied to speech, such as a Gaussian distribution.
Based on probability distribution. cluster centroid or
Prototypes for each cluster relative to other features
is generated by prototype block 1112.
be born. and by the same selected features.
The generated protocol and voice are characterized by
The input is entered into label creation block 1114.
Ru. Label creation block 1114 performs a comparison procedure.
This will cause one label to be assigned to a specific audio input.
assigned to power. Selection of appropriate features depends on the audio (conversation) wave input.
This is an important factor in obtaining a label that allows here
The audio processing device described in
It has a feature selection block 1108. this audio
According to the processing unit, an auditory model is obtained, which
It is applied to the speech processing device of the speech recognition system.
To explain the auditory model, please refer to Figure 5.
Ru. Figure 5 is a diagram showing a part of the inside of the human ear.
It is. In particular, internal hair cells 1200 absorb fluid.
End 1202 protruding into retaining passageway 1204
It is shown as having internal hair cells
Upstream are external hair cells 1206, which also
with an end 1208 projecting into the passageway 1204
It is shown as. and internal hair cells 12
00 and external hair cells 1206 send information to the brain.
It has nerves connected to it. In particular, the new euro
electricity sent through nerves to the brain for processing
undergoes electrochemical changes that result in physical stimulation. This electric
Gas chemical changes are caused by mechanical movement of the basement membrane 1210.
be stimulated by Conventionally, the basilar membrane 1210 has a frequency response to audio wave input.
It acts as a wave number analyzer and
It is possible for each part to respond to individual critical frequency bands.
Are known. And the different basement membrane 1210
When a part responds to its corresponding frequency band
This means that the perceived sound for a sound wave input is
Affects size. That is, two similar
than when sounds of intensity occupy the same frequency band.
Also, the two sounds are in different critical frequency bands.
In some cases, the sound is perceived to be louder.
Ru. Approximately 22 determined by basement membrane 1210
It is known that a critical frequency band exists.
Ru. to match the frequency response of the basement membrane 1210.
In this embodiment, the audio processing device 1100 preferably performs the following steps.
converts the audio wave input into a part of the above critical frequency band or
is separated into all its separated critical frequency bands
The signal components are examined individually for each region. This feature is
Filter the signal from FFT1106 (Figure 4) as appropriate.
Then, during the feature selection block 1108, the
It is possible to provide separate signals for each critical frequency band.
This is accomplished by Its separate input is also a time window generator 1104
within a time frame (preferably of 25.6 ms) by
is blocked. Therefore, the feature selection block
The check element 1108 preferably includes 22 signals.
for a given period, each of which corresponds to one time frame.
Represents the strength of sound in the wave number band. The filter action is preferably of the conventional type shown in FIG.
This is done by a critical band filter 1300. Individual
Another signal is then equal to the loudness of the sound.
(loudness) converter 1302.
Ru. This converter 1302 converts the perceived sound into
Consider the change in magnitude as a function of frequency.
In this respect, a given dB level of a certain frequency
The first sound in is loud enough to be heard.
and at the same dB level at a different frequency.
Note that the second sound may be different from the second sound.
sea bream. Converter 1302 converts various frequencies
so that the bands are measured on the same size scale
In order to convert the signals of each frequency band into
Based on data. For example, converter 1302
is preferably a 1933 Fletcher
(Fletcher) and Munson (Munson) research
Based on this, and by making some changes to it.
Correlation of sound intensity to sound volume isosensory curve
I do. The modified results of these studies are
This is shown in Figure 7. In other words, (1000
As can be seen from the isosensitive curve passing through the point (Hz, 40dB)
, 1000Hz, 40dB sound becomes 100Hz, 60dB sound,
They are comparable in size. The converter 1302 has equal sensitivity regardless of the frequency.
Preferably follow the curve of Figure 7 to give the degree of
to adjust the volume. As can be further seen from Figure 7, to the frequency
There is a dependence not only on sound intensity and sound volume.
It does not correspond to Sato. i.e. the intensity or vibration of the sound.
A change in width is a similar change in audible loudness.
Not necessarily reflected by change. for example,
At 100Hz, the sound intensity changes by 10dB at 110dB.
Since the sound intensity changes by 10dB with 20dB,
The change in the loudness of the audible sound is due to the former
is much larger. This difference is a predetermined
A size scaling block that compresses the size in style.
Processed by lock element 1304. suitably
is the magnitude scaling block element 1304
is the loudness of the sound expressed in phons
Replacing the amplitude measurement with sone
, the intensity P is its cube root P^1/3Compress it into Figure 8 shows the experimentally obtained Thorn vs. Huon
It is a figure showing data. Adopting Thorne
Due to the above model, the large amplitude of conversational speech
is almost accurate. Furthermore, what is 1 sone?
It is defined as a sound level of 1KHz, 40dB.
Ru. Referring again to Figure 6, each critical frequency
Equisensitivity corresponding to band, scaled by size
A time-varying response block element that processes the programmed signal.
A child 1306 is illustrated. In particular, for each time frame
the neural network for each frequency band being tested.
Neural firing rate f is determined
be done. The neural firing rate is the result of this speech processing.
According to the device, it is expressed as follows. f=(So+DL)n (1) In this formula, n is the nerve transmission signal
(neurotransmitter) amount, So is the audio wave input
Spontaneous fatigue associated with unrelated neural firing
Eyring constant, L is the magnitude measurement, D is the variation
It is a positional constant. In other words, (So)n is the sound wave input
Spontaneous nerve damage that occurs regardless of whether there is force or not.
corresponding to earring rate, DLn is based on audio wave input
corresponds to the firing rate. What is important is that the value of n is determined by this audio processing device.
Assuming that it changes over time based on the following relational expression,
It is characterized by: dn/dt=Ao−(So+Sh+DL)n (2) In this equation, Ao is the supply constant and Sh is the spontaneous
is the decay constant of the neutral transmission signal. As shown in equation (2)
The novel relationship that has been shown is that neural transmission signals are
It is being formed at a rate of (a) collapse (Sh×n),
(b) spontaneous firing; and (c) acoustic wave input.
It is important to consider that the nerve damage caused by
This was taken into consideration. These modeled phenomena
The estimated position of is shown in FIG. Equation (2) also calculates the following quantities and facades of the neural transmission signal:
earing rate is at least an expression of the amount of neural transmission signals.
This sound means that it depends multiplicatively on the existing conditions.
It also reflects the fact that voice processing devices are nonlinear.
ing. That is, God at time (t+△t)
The amount of nerve transmission signal is the nerve transmission signal at time t
is the amount of (dn/dt)・△t, and this
Expressing this in the formula, n(t+△t)=n(t)+(dn/dt)・△t (3) Equations (1), (2) and (3) indicate that the above listening system
It is adaptive over time, and auditory nerve signals are
It is said that the signal is nonlinear with respect to the audio wave input.
Write a time-varying signal analyzer that takes advantage of the fact that
It is something. In this respect, this audio processing device is divine.
better match the apparent time changes in the physiological system
As described above, nonlinear signals in speech recognition systems are
The first model that embodies the signal processing is given below. In order to reduce the number of unknowns in equations (1) and (2),
This audio processing device is applicable to a constant sound volume L.
The following equation (4) is used. So+Sh+DL=1/T (4) In this equation, T is the time when the audio wave input is generated.
However, the maximum value decreases to 37%, but the auditory response is
It is a measure of the time it takes for an answer to be answered. T is the loudness of the sound
According to this audio processing device, it is a function of
Declining response to different loudness levels
This is the value obtained from the graph representing . Sunawa
In other words, when a sound of a certain loudness is generated, it
1, then the response is
It falls to the steady state level with a time constant T.
If there is no audio wave input, T = To (about 50 milliseconds)
It is. L_naxIf T=T_nax(about 30ms)
It is. By setting Ao=1, L=
When 0, I/(So+Sh) is 5 centiseconds.
Also, L is L_naxand L_nax=20 sones
Then, equation (5) becomes as follows. So+Sn+D(20)=1/30 (5) Based on the above data and formula, So and Sh are calculated by formula (6) and
(7), it is determined as follows. So＝DL_nax/(R+(DL_naxToR) −1 (6) Sh=1/To−So (7) Here R=fs | L=L_nax/fs｜L=0 (8) Here, fs | is the loudness of a given sound when dn/dt=0.
Represents the firing rate at high temperature. R is the only variable left in the audio processing device.
Ru. Therefore, in order to change the performance of the audio processing device,
For example, only R can be changed. That is, R is
Typically, minimizing steady-state effects versus transition effects
Adjust to change performance meaning
is a single parameter that can be steady state
Minimizing the effect is due to differences in frequency response,
due to differences in speakers, background noise, and
Affects the steady state part of the conversation that is not the transition part of the conversation.
Generally speaking, the distortion exerted on the same audio input is
Because it produces a non-constant output pattern,
Delicious. The value of R is preferably
by optimizing the overall error rate of the system.
will be played. The appropriate value found in this way is
R=1.5. Then, So=0.0888, Sh=
It becomes 0.1111, and D becomes 0.00666. Referring to FIG. 9, the flowchart of this audio processing device
A chart is shown. In Figure 9, favorable
25.6ms, preferably sampled at 20KHz
The digitized audio within the time frame of Hanning
(Hanning) window 1320, Hanning window 1
The output from 320 is preferably flashed at 10 millisecond intervals.
1322. child
The output thus converted is filtered by element 1324.
at least one frequency band (preferably
All critical frequency bands or at least 20
intensity density corresponding to each of the critical frequency bands)
gives the output of The density of this intensity is then a logarithmic variable.
The loudness level is determined by the conversion step 1326.
is converted to This is based on the graph in Figure 7.
It is easily executed. The processing below is
10th step 1330;
As shown in the figure. In Figure 10, sensory threshold Tf and auditory sound
threshold T_his each filtered frequency
For band m, in step 1340 T_f=
120dB, T_h=0dB is the initial setting. after that,
audio counter, all frame registers,
The histogram register is reset in step 1342.
is set. Each histogram is the number of samples or counts
contains a bin representing the value, and between the counted values
The intensity within a given frequency band or its equivalent.
Different measurements are in individual ranges. current hiss
The totogram is preferably for each given frequency band.
, the loudness of the sound falls within several ranges of loudness.
represents the number of centiseconds in a period. for example,
In the third frequency band, the intensity is 10dB and 20dB
There may be an interval of 20 centiseconds in between. Similarly,
In the 20th frequency band, 50dB and 60dB
During that time, 150 centiseconds out of the total 1000 centiseconds
It can exist. Total number of samples (or centiseconds)
Then, the percentile is obtained from the count value contained in the bin.
It will be done. In step 1344, each frequency band filter is
The frames from the router output are checked and appropriate
One bin in the histogram for each filter.
It is incremented at step 1346. Amplitude is 55dB
The total number of bins exceeding
audio data (i.e., frequency band).
The number of filters representing the presence is determined. stop
the minimum number of sounds (e.g. 20).
If none of the filters (6 of them) exist, the next filter is
The frame is checked at step 1344. too
In step 1350, a sufficient number to represent the voice is selected.
If the filter exists, in step 1352
A voice counter is incremented. The audio counter
Step 1354 until you hear 10 seconds of audio.
is incremented at step 1352, then step 13
T for each filter in 56_fand T_hnew value of
is determined. T_fand T_hThe new value of for a given filter is
It is determined as follows. T_ffor 1000 bins
35th from the top (i.e. 96.5th percentile of audio)
BIN is the dB value of the bin that holds samples of_H
is defined as Then T_fis T_f＝BIN_H+
It is set to 40dB. T_h, the lowest bin
A bit that holds the 0.01th sample in percentile from
dB value of BIN_Lis defined as Sunawa
T-BIN_Lis a sample classified as audio.
of the number of samples in the histogram excluding the number of pulls.
It is a bottle that is 1%. Then, T_his T_h=
BIN_LDefined as −30dB. Returning to FIG. 9, the audio amplitude is determined at step 133.
Based on the thresholds updated at 0 and 1332,
Converted to thorn in step 1332 and scaled
be logged. As mentioned above, how to do this
It is. To get Thorn and scale
Alternatively, the file (after the bin has been incremented)
Using the filter amplitude “a”, based on the following formula:
May be converted to dB. a^aB=20log_Ten(a)-10 (9) Then each filter amplitude is equal based on the following formula:
range between 0 and 120 to give a larger size.
scaled. a^eql=120(a^dB−T_h)/(T_f−T_h) (Ten) a^eqlis preferably determined by the following formula:
(1 KHz signal at 40 dB)
(by matching) the approximate size in the Thorn
is converted to L^dB=(a^eql-30)/4 (11) And the size in Thorn is approximated as follows
be done. L_s(approximate) = 10 (L^dB)/20 (12) The size represented by Thorn is then input as
In step 1334, equations (1) and (2) are given, which
output fire array corresponding to each frequency band.
The switching rate f is determined (step 1335). twenty two
If there are several frequency bands, a continuous time frame (frame)
A 22-dimensional vector specifies the audio wave input
signal. However, in general,
(mel: unit of pitch)
Twenty frequency bands were tested using router banks.
It will be done. Before processing the next time frame (step 1336)
Then, in step 1337, the next state of n is calculated based on equation (3).
The status is determined. The above-mentioned audio processing device has a firing rate f and a medium
It has a DC pedestal with a large amount of vertical transmission signal n.
In case of application, it is improved. That is,
The terms in the equations for f and n also have a dynamic range.
To reduce pedestal height, if important.
The following formula is obtained. First, it is in a steady state and there is no audio wave input (L
= 0), equation (2) is the internal state in the steady state
For n′, it is solved as follows. n′=A/(So+Sh) (13) The internal state of the amount of nerve transmission signal n(t) is steady
By the part n′ and the time-varying part n″, we have the following
be harassed. n(t)=n′+n″(t) (14) Combining equation (1) and equation (14), the fire
The following equation for the ring rate is obtained: f(t)=(So+DL)(n′+n(t)) (15) In this equation, the term So×n′ is a constant.
and all other terms are the time-varying part of n, and
and DL.
The rest of the process is to calculate the squared difference between the output vectors.
The constant term can be ignored since it is involved in the There
Then, using equation (13), remove the constant term from equation (15).
Letting the equation be expressed as f″(t), f″(t)=(So+DL)n″(t) +DLA/(So+Sh) (16) Considering equation (3), the next state is n(t+△t)=n′(t+△t)+n″(t+△t) (17) =n′(t+△t)+n″(t)+(Ao −(So+Sh+DL)n(t))△t (18) =n″(t+△t)+n″(t)−(So+Sh) n′(t)−DLn′(t)△t+Ao△t −(So＋Sh＋DL)n″(t)△t (19) In equation (19), we decide to ignore all constant terms.
From this, the following formula is obtained. n″(t+△t)=n″(t)(1-Sh△t)−f″(t)△t
(20) Equations (15) and (20) are now
Output method applied to each filter for each time frame of seconds
Construct an equation and a state update equation. these
The result of applying the equation is that every 10 milliseconds
is a 20-element vector, and each component of the vector is
individual scaled filter banks.
Corresponds to the firing rate of the frequency band. Regarding the embodiment shown earlier, f, dn/
The expressions of dt and n(t+1) are respectively
A spatial representation of the ring rate f and the next state n(t+
Δt). Note that the values contributing to various equation terms (e.g. to
= 5 centiseconds, t_Lnax= 3 centiseconds, Ao = 1, R
= 1.5, Lmax = 20) may be set to another value,
Then, So, Sh and D are each preferred
different from the new derived values 0.0888, 0.11111 and 0.00666.
Please note that the value is This audio model was developed by the inventor as a floating point
PL/I program on system FPS190L hardware
It was implemented using the ramming language, but other
using different software or hardware.
It can also be used. E-1C Detailed verification FIG. 3 shows a sample detailed verification sound machine 200.
0 is shown. Each detailed matching sound machine is
(a) Multiple states Si, (b) Multiple transitions tr(Sj | Si)
(The transitions include those between different states and those between different states.
There are things that return to themselves from a state of
Each transition is associated with a probability), (c)
correspond to each label that can occur in a particular transition.
is characterized by the actual label probability
It is a stochastic finite state machine. In FIG. 3, detailed verification sound machine 2000
7 states S₁~S₇And the 13 states tr1 to tr13 are
It is given. As can be seen from Figure 3,
The sound machine 2000 has three transitions indicated by dashed lines.
It has tr11, tr12 and tr13. these three
At each transition, the sound can generate a label.
change and follow from one state to another
Therefore, such a transition is called a zero transition. one
On the other hand, labels are generated along the transition from tr1 to tr10.
It can be done. Especially for each transition from tr1 to tr10, one
or more labels, depending on the number of times they are generated.
It can have different probabilities. Preferably, individual
can occur in this system at the transition of
There are probabilities associated with labels. Sunawa
If it occurs selectively depending on the audio channel,
If there are 200 labels that can be
Each transition (that is not a zero transition) has a
has 200 “actual label probabilities” and
Each probability corresponds to a sound at a particular transition
Corresponds to the probability that a label will be generated. In Figure 3
then the actual label probability for transition tr1 is P
[i] (i is an integer from 1 to 200, and is the label number
). for example,
In the case of label 1, the detailed verification sound machine 2000
There is a probability P[1] that label 1 occurs at transition tr1
do. The actual various label probabilities are
and their corresponding transitions. Now, corresponding to a given sound, the column of labels y₁y₂y₃
... was given to the detailed verification sound machine 2000.
The matching procedure is executed. Detail verification sound machine
The relevant procedure is explained with reference to Figure 11.
Ru. Figure 11 is a grid diagram of the sound machine in Figure 3.
Ru. Similar to the sound machine in Figure 3, this grid diagram is
state S₁from state S₇and a zero transition to state S₁empty
Status S₂to & condition S₁from state S_Fourshows the transition to
Ru. Also shown are transitions between different states. child
The grid diagram also has a horizontal time scale.
Ru. In Figure 11, the probability q at the start point is₀and q₁
are, respectively, for a sound, when the sound is at time t=t₀
and t=t₁represents the probability of having a start time. each
start time t₀and t₁The various transitions are illustrated in
It is shown. In addition, the time between sequential start times
The interval is preferably equal to the length of the label time interval.
Please note that how much a given sound is labeled in the input string
Detailed matching sound machine 200 to determine the close
If 0 is adopted, the minute at the end of that note.
The cloth is found and it determines the match value of that sound
used for. Based on the distribution of end times
The concept of
common to all embodiments of the sound machine disclosed.
Ru. end point distribution to perform detailed matching.
When forming, detailed verification sound machine 2000
performs rigorous and complex calculations. Referring to FIG. 11, time t=t₀start time and
Regarding the calculations required to have both end times.
Let's think about it first. This is the example shown in Figure 3.
Since the field follows the sound machine structure, the following probability
applies. Pr(S₇, t=t₀)=q₀T(1→7)+ Pr(S₂, t=t₀)T(2→7)+ Pr(S₃, t=t₀)T(3→7) (21) In this equation, Pr is the certainty in the state shown in parentheses.
The rate, T, is in the direction of the arrow of the state number shown in parentheses.
is the transition probability. Equation (21) shows that the end time is t=t₀
The individual probabilities corresponding to the three conditions such that
Display. Furthermore, t=t₀The end time is
For example, state S₇Occurrence is limited to
You can see that. Then end time t=t₁If we pay attention to the state S₁other than
calculations must be made for all states of
You can see that there isn't. state S₁is the end of the previous note.
starts at the end of the period. For convenience of explanation, state S_Four
Only calculations for are shown. state S_FourIn this case, the calculation formula is as follows. Pr(S_Four, t=t₁)=Pr(S₁, t=t₀) T(1→4)Pr(y₁｜1→4)+Pr (S_Four, t=t₀)T(4→4)Pr (y₁｜4→4) (22) In other words, equation (22) means that time t=t₁The sound machine
state S_FourThe probability that is depends on the sum of the following two terms:
-ing That is, one of the terms
is (a)t=t₀In state S₁For the probability that the state S₁from
state S_FourThe transition probability to and the state S₁from state S_Fourto
given label y when there is a transition₁is generated
The other term is (b) time.
t=t₀In state S_FourWith probability that state S_Fourfrom itself
Probability of transition to body and state S_Fourfrom to itself
given label y when there is a transition₁Probability that is generated
It is multiplied by Similarly, another state (but state S₁(excluding)
Even if time t=t₁The probability that the sound is in a certain state is
Calculations are performed to find . Generally, places
When determining the probability of being in a particular state at a given time
(a) the detailed verification procedure that led to that particular state;
each previous state with a transition, and each state before that.
(b) for each previous state.
and each previous state to match the label column.
is not raised on the transition between state and current state.
Recognize the value representing the probability of a label that should not be
(c) Redefine the probability of each previous state and the label probability.
into the corresponding transition.
The purpose is to find the probability of the above-mentioned specific state.
The overall probability of being in that particular state is that
that state for all transitions that lead to it
is determined from the probability of In addition, state S₇calculation for
The sound is in state S₇When the sound ends at time t
=t₁Make it possible to start and end with
This includes terms related to three zero transitions.
Please be careful. time t=t₀and t=t₁As well as the probability calculation for
Then, the probability calculation for another end time column is
preferably done to form the ending distribution
Ru. The value of the distribution at the end corresponding to a given note is
How well does a given phone match the input label?
It shows how it is. How many words are in the input label column?
If you want to determine how well the word matches,
A number of single notes are processed. And each note is
Generate the ending distribution of probability values. for that single note
The matching value is obtained by adding the ending probabilities and then
It is obtained by taking the logarithm of the sum. Also, the following
The onset time distribution of a single note is scaled, e.g.
Divide each value by the total value so that the sum of the values is 1.
Standardize the ending point distribution by scaling the value of
It can be obtained by Note that a given word or word string is tested.
The method for determining the number h of power tones is at least as follows:
There are also two. First, the depth first method
method), each individual
It is done to calculate sequentially moving small forms for each single note.
It will be done. And the subtotal follows that basic format.
Below the expected threshold corresponding to the position of a given note
When it is found that , the calculation ends. Alright
In the breadth first method,
is calculated for the positions of similar phonemes in each word.
It will be done. i.e. following the first phonetic sound of each word
As a calculation, calculate the second phonetic sound of each word.
etc. will be carried out. In the breadth-first method,
The calculated value along the same number of phonemes in various words is that
Compare at the same relative pitch position along these notes.
be done. In both of these methods, the matching value
The word(s) with the largest sum is the one you are looking for.
It is a target for Detailed verification can be done using APAL (Array Processor
assembly language). In addition, APAL is
Floating Point System Co., Ltd.
(Floating Point System, Inc) 190L specific access
It is Sembla. For detailed matching, the actual probability of each label (i.e.
That is, at a given transition, a given note has a given label.
probability of generating y) and the transition for each sound machine.
transition probability and a given time after a specified start time
memorize the probability that a given note is in a given state in
It is recognized that considerable memory is required for
sea bream. The FPS (floating point system) 190L mentioned above is
the end time, e.g. the sum (preferably the end point probability)
(logarithm of the sum of ) or previously generated
Start time based on end point probability and word
Word matching based on match values corresponding to consecutive phonemes
Set to perform various calculations such as total score.
will be uploaded. Additionally, detailed matching is preferred.
In the matching procedure, the tail probability (tail
probability). Terminal probability is the word
We measured ordinal certainty of labels without involving
It is something. In a simpler embodiment, given
The terminal probability of is the probability of a label following another label.
correspond to the nature. For example, how much is this certainty?
Is it a label sequence generated from a few sample sounds?
It is easily determined. Therefore, the detailed matching is based on the basic form and Mark
statistics for the f-model and to contain the terminal probabilities.
requires sufficient storage capacity. For example, each
Vocabulary of 5000 words, each word consisting of about 10 sounds
, the basic format requires 5000x10 memory.
be. Also, (each note is accompanied by a Markov model)
) 70 different notes and 200 different labels
and 10 labels with a probability that each label is generated.
If a transition exists, its statistics include 70×10×
200 positions are required. However, the sound machine
Three parts: a beginning part, a middle part,
Split into ending parts and accompanied by corresponding statistics
(preferably 3 in consecutive parts)
(contains two self-loops). Therefore, the necessary notes
The storage capacity is reduced to 70x3x200. Regarding the terminal probability
then 200x200 storage locations are required. child
array fills up 50K integer and 82K floating point storage.
Gives proper movement. In addition, detailed matching is based on phonemic rather than phonetic sounds.
Can be performed by using sound
Please note that. E-1D Basic high-speed verification As mentioned above, detailed matching is computationally expensive.
Therefore, the amount of calculation required can be reduced even if some accuracy is sacrificed.
Basic fast matching and other fast matching to reduce
will be carried out. This fast matching is preferably
used in combination with matching, i.e. fast matching
lists likely candidate words from the vocabulary,
and detailed information for each candidate word listed at most.
A match is made. A high-speed approximate voice matching technique is described in the above-mentioned present application.
No. 672,974.
In high-speed approximate speech matching technology, it is preferable to
is each transition in all transitions in a given sound machine.
Place the actual probability corresponding to the label with a specific replacement value.
By changing each sound machine, each sound machine is simplified.
Ru. A particular replacement value is preferably
The match value corresponding to a given sound when used
If the replacement value of does not replace the actual label probability, then
Overestimation of matching values achieved by detailed matching
selected to be of value. guarantee this condition
One method is to
Any probability corresponding to Bell is greater than the replacement value
By choosing each replacement value as not. Onma
The actual label probability in the syn is the corresponding permutation
for a certain word by replacing it with the value
Significantly reduces the amount of calculation required to determine matching scores.
Do a little. Furthermore, the replacement value is preferably overestimated.
Therefore, the obtained matching score is not replaced.
not less than the score that would have been determined. In a linguistic decoder with a Markov model
In certain embodiments, where phonetic matching is performed using
is such that each sound machine is equipped with the following (a) to (c).
characterized by training. (a) Multiple states and transition paths between those states. (b) Transition tr(Sj|Si) with probability T(i→j). So
Each of is the state given the current state Si
Represents the probability of transition to Sj. Furthermore, this Sj and Si
can be in the same state or in different states.
good. (c) Actual label probability P(Yk|i→j). Each label probability P(Yk|i→j) is one state
For a given transition from to its next state,
Let the probability that label y.k is generated by the sound machine be
express. Here k is for identifying the label.
It is a subscript. Each sound machine: (d) in each sound machine;
To assign a single specific value P′(Yk) to each Yk of
and (e) at each transition in a given sound machine.
Then, each actual output probability P(Yk|i→j) is expressed as
For a single specific value P′(Yk) assigned to Yk
It has means for reversing and replacing it. suitably
whose replacement value is any transition of a particular sound machine
The actual maximum label of the corresponding Yk label in
At least equal in magnitude to probability. Fast verification procedure
is the most probable from the vocabulary corresponding to the input label
Approximately 10 to 100 units selected as the best words.
It is employed to determine a list of word candidates. child
A language model is preferably applied to these word candidates.
A detailed match is performed. Do it like this
The number of words considered by detailed matching is
By reducing it to approximately 1%, the calculation cost becomes significant.
accuracy is reduced without any loss in accuracy. The basic fast matching is given in a given sound machine.
for all transitions such that a label of
A single value that represents the actual label probability for a given label
to simplify detailed matching by replacing
Ru. That is, for a given label whose probability of occurrence is
Regardless of the transitions in the sound machine, the probability is a single specific value
replaced by This value is
Maximum probability and minimum probability of labels occurring in thin transitions
The overestimated values are at least the same size.
Ru. Label for a given label in a given sound machine
Set the replacement value of the probability to the maximum value of the actual label probability
generated using basic fast matching by
The matching values obtained are at least as detailed as possible.
It is as large as the matching value obtained. In this way, the base
Book-like fast matching typically uses more words.
Matching value of each phone so that it is widely selected as a candidate
Or overestimate. That is, based on detailed verification
Words that are considered candidates are also based on this basic
Pass criteria based on fast matching. Referring to Figure 12, for basic high-speed matching,
A meno sound machine is shown. (Symbols and phonemes
Also called. ) label along with the starting point distribution
Input into a basic high speed verification tone machine. this start
The time distribution and label column inputs are the detailed matching described above.
Similar to what is input into a sound machine. Furthermore, open
The starting point distribution sometimes spans multiple times.
Rather than the distribution, for example, the point at the beginning of the sound.
Denotes the exact time that follows a period of silence.
I would like you to recognize that there are also However, the audio is
If it is continuous, the starting point distribution (for which
(described in more detail later)
used. Sound machine 3000 ends point distribution
and corresponding to a specific sound from its ending time distribution.
Generates a match value. The match value for a word is the phoneme
(at least the first h sound of the word) and
It is defined as Referring to Figure 13, the basic high-speed matching operation
is shown diagrammatically. This basic fast match
The operation is generated by the starting point distribution and the single note.
associated with the number or length of labels Yk and each label Yk.
replacement value P′_Ykbe involved only in And given
All actual for a given label in a sound machine
To replace the label probability with the corresponding replacement value
, basic fast matching converts transition probabilities into length distributions
Replaced by the probability (for each transition in a given sound machine
the actual label probabilities (which may be different) and
The necessity of having a probability of being in a given state at a given time
remove the need. In this regard, the length distribution is
Determined from the file. In particular, for each length in the length distribution
Accordingly, the detailed matching procedure involves checking each state individually.
For each state, select (a) a specific label length.
is given, and (b) regarding the output along the transition.
The currently inspected condition may not be present.
Determine the various possible transition paths. specific
All transition paths of a certain length leading to each transition path
The probabilities corresponding to are summed and then in that distribution
To display the probability of a given length of
The sums corresponding to all states of are added. Verification
The procedure is repeated for each length. Verification procedure
Following the preferred form of
The trellis diagram is known in the field of economic models.
This is done with reference to the diagram). That is, the lattice
For transition paths that share branches along the structure, the common
Calculations need to be done only once for each branch passed through
The calculated value is added to each path that includes a common branch.
available. In FIG. 13, two limitations are illustratively included.
It is rare. First, the latitude produced by a single note is
The length of each bell is 1₀,1₁,1₂, and 1₃certainty
Assume that it is one of 0, 1, 2, or 3 with a rate of
Ru. In addition, the start time is also limited, which allows each
is the probability p₀,q₁,q₂and q₃of four starting times with
It becomes possible to With these limitations, the following equation
determines the end point distribution of the target note. φ₀=q₀1₀ φ₁=q₁1₀+q₀1₁p₁ φ₂=q₂1₀+q₁1₁p₂+q₀1₂p₁p₂ φ₃=q₃1₀+q₂1₁p₃+q₁1₂p₂p₃+q₀1₃p₁p₂p₃ φ_Four=q₃1₁p_Four+q₂1₂p₃p_Four+q₁1₃p₂p₃p_Four φ_Five=q₃1₂p_Fourp_Five+q₂1₃p₃p_Fourp_Five φ₆=q₃1₃p_Fourp_Fivep₆ Looking at these equations, φ₃However, there are four openings.
It can be seen that terms corresponding to each start time are included.
The first term is that the single note is at time t=t₃starts with
generate the label length (i.e. at the same time
It must be a single note that begins and ends.) Represents the probability.
vinegar. The second term is the time t=t₂started with
The length of the label is 1, and the label 3 is the single note.
represents the probability of being generated. The third term is simply
sound is time t=t₁and the label length is 2.
Yes (i.e. labels 2 and 3), labels 2 and 3
and 3 are generated by a single note.
Similarly, the fourth term indicates that the single note is at time t=t₀started with
, the label length is 3, and the three labels 1,
Expresses the probability that 2 and 3 are generated by a single note.
vinegar. The amount of calculation required for basic high-speed matching and detailed matching
When comparing the amount of calculation required for
It turns out that it is also relatively easy. Regarding this point
and P′_YkThe value is the same as the label length probability for all
remains the same for each representation of the expression. In addition, long
There is a limit to the start time, so please wait until the end of the later
Calculations in between become easier. For example, φ₆smell
So, for a single note, time t=t₃I have to leave at
, and the three labels 4, 5 and 6 should be,
generated by a single note whose end time should be applied
Must-have. Emit the matching value of the target single note.
The end time that follows the predetermined end time distribution when
Point probabilities are summed. If desired, give the following formula
Therefore, the logarithm of the sum is taken. Match value = log_Ten(φ₀＋‥‥＋φ₆) As mentioned before, matching on a word
The score value is based on consecutive single sounds in a particular word.
easily by summing the matching values for
Desired. Now, to explain the generation of the starting point distribution.
Please refer to FIG. 14 for details. Figure 14a Smell
The word THE₁is repeated and it becomes an elemental
divided into single notes. Figure 14b shows that over time
The label column is shown. In Figure 14c,
1 starting point distribution is shown. This starting point
The distribution (includes silence, i.e. “words” without sound)
(in the previous word) previous recent single note
This is obtained from the end point distribution. Figure 14
Based on the label input and starting time distribution of c,
Start time distribution φ of sound DH_DHis generated. next unit
The start point distribution of the note UH is the end point of the previous single note.
Period when the fabric exceeds the threshold (A) in Figure 14 d
Determined by recognizing the interval. (A) is
It is determined separately for each end point distribution. suitably
(A) is the end time distribution of the target single note.
It is a function of the total value. In this way, at times a and b
The period between
(See Figure 14e)
(see). Period between times c and d in Figure 14e
During this period, the end point distribution of the single note DH exceeds the threshold.
In addition, the starting point distribution of the next note is set.
Corresponds to the period. The value of the starting point distribution is, for example,
At each end, the sum of end times exceeding the threshold (A)
Normalize the end point distribution by dividing the time
It can be obtained by This basic high-speed matching sound machine 3000 is
APAL printer in Floating Point Systems' 190L.
It was executed using a program. Follow the above teachings
to develop a specific form of matching procedure.
You can also use hardware and software.
Wear. E-1E Another high-speed match Basic fast matching alone or detailed matching
or combined with a language model to perform the necessary calculations.
Significantly reduce the amount. Further reduces the amount of calculation required
In order, the minimum length L_nioand maximum length L_nax
We define a uniform label length distribution between the two lengths.
The method described here allows detailed matching by
further simplify. In basic fast matching
is the probability of a single note producing a label of a given length,
i.e. 1₀,1₁,1₂etc. typically have different values.
It's on. However, this other high-speed matching
For example, the probability for each length of the label is a single uniform value.
Can be replaced. Preferably, the minimum length is a fraction of the original length.
equal to the smallest length in the cloth that has non-zero probability
However, you may choose other lengths if desired.
stomach. Choosing the maximum length is more arbitrary than choosing the minimum length
but less than the minimum length and greater than the maximum length
in that the probability of a long length is set to zero.
is important. Let the length probability be the maximum length and minimum length
By setting it so that it exists only between
A uniform pseudodistribution is given. in a certain way
is the uniform probability is the mean probability on that pseudodistribution and
It can be set by Or a uniform
The probability is the length replaced by the above uniform value.
It may be set as the maximum value of the probability of Characterize all label length probabilities as equal
The effect of this is at the end of basic high-speed matching.
It is easy to refer to the formula shown above corresponding to the cloth.
recognized. In particular, the length probability is treated as a constant term.
It can be washed out. L_niois set to zero and all length probabilities are simply
By replacing it with a constant value of 1,
The cloth is characterized as follows. θ_n=φ_n/1=q_n+θ_n-1Pn Here “1” is a single uniform replacement value,
The value of Pm is preferably determined during a given note at time m.
corresponds to the generated replacement value for the given label. For the above formula for θm, the matching value is as follows
It is defined as follows. Match value = log_Ten(θ₀+θ₁+… +θ_n) + log_Ten (1) Compare basic fast match and this alternative fast match
Then the number of additions and multiplications required is
Significantly reduced by adopting another fast matching method.
I know that it will happen. L_nio= 0, the basic height
In fast matching, length probabilities must be taken into account.
Therefore, 40 multiplications and 20 additions are required.
That's what I found out. Using this alternative fast match
Then, θ_nis found recursively, and for each continuous θ_nNitsu
Only one multiplication and one addition are required. How this alternative fast match simplifies calculations
Figures 15 and 16 are given to illustrate the
is being given. In Figure 15a, the sound machine
Example 3100 has a minimum length L_nio=0 corresponds to
Ru. The maximum length at this time is determined by the length distribution being uniform.
defined as infinite so that it can be characterized as
has been done. In Figure 15b, the sound machine 3
A grid diagram generated from 100 is shown. child
Here, q_oThe start time after is outside the start time distribution.
For each successive θ for m<n_n
To determine all of , one addition and one multiplication are required.
is necessary. Also, to find out the end time
requires only one multiplication and no addition. Figure 16 shows L_nioThe case where =4 is shown. Figure 16
a has a special example of a sound machine 3200 for that purpose.
Therefore, Figure 16b shows the corresponding grid diagram.
Rawasu. L_nio= 4, so the grid diagram in Figure 16b is
Zero along the path marked by u, v, w and z
Has probability. θ_Fourand θ_othose ends extending between
For the end time, four multiplications and one addition are required.
Please note one thing. However, greater than n+4
In the case of high end time, only one multiplication and addition is
unnecessary. This example uses APAL code on FPS190L.
was executed using the code. In addition, the first
Even if a state is added to the embodiment of Fig. 5 or Fig. 16,
Please note a good thing. E-1F Matching based on first J label Basic high-speed matching and further high-speed matching above
The strings input into the sound machine to be refined into
Only the first J labels of the
It is intended that this will be taken into consideration. every 1/100th of a second
The audio processing equipment of the audio channel is
Assuming that the label is generated by the position, J
The corresponding value of is 100. In other words, about 1 second
The labels corresponding to the sounds of the degree are single sounds and sound machines.
to check the match between the incoming labels and
Given. Limiting the number of labels inspected
As a result, two advantages are realized. First, de
Coding delay is reduced. Second, short
Problems when comparing word scores to longer word scores
is substantially avoided. Of course, the length of J is
It may be changed as desired. Effect of limiting the number of labels inspected
can be seen by referring to the grid diagram in Figure 16b.
I can take it. If the above improvements are not made, the high speed
The matching score is θ along the bottom line of this figure._nof the probability of
It is Japanese. That is, (L_nio= 0) t=t₀Ma
Taha (L_nio= 4) t=t_Foureach time starting at
Between state S_FourThe probability of being in is θ_nTomo Tomera
, all θ_nare then summed. L_nio=4
If t_Fourstate S at any time before_FourIt is certain that
There is no rate. However, using the above improvement, θ_n
The sum ends at time J. In Figure 16b, time
interval J is time t_o+2corresponds to End inspection of J label after J time interval or more
This means that the following two certainties are required when calculating matching scores:
yields a total rate. First, as mentioned above, the lattice
There is a row calculation along the bottom row of the diagram, but it is
up to interval J-1. At each hour up to time J-1
state S_FourThe probabilities in are combined to get the score value for the row.
It is measured. The second is state S at time J₀~S_Foureach state of
There is a score for the row that corresponds to the sum of the probabilities that there is a single note in
do. That is, the score for the row is Row score =_Four 〓^f=0 Pr(S_f, J) This matching value for a single note is the row score and column score.
by summing and then taking the logarithm of that sum.
can get. Continue fast matching for next note
The time J is preferably included in order to along the bottom line
The value is used to obtain the starting point distribution of the next note.
be done. Calculate the match value for each successive note.
After calculating, as mentioned before, all the single notes
is the sum of the matching scores of all the phones.
becomes. The basic high-speed matching described above and another high-speed matching
Examining how the ending point distribution is generated, we find that
Column score counts easily match fast matching calculated values
Notice what I don't do. Limit the number of labels examined
The improved method of determining
In order to
Request to replace points with scores from another line
do. That is, time J and J+K (K is the sound machine
state S between (maximum number of states)_Four(Figure 16
The scores of the separate rows are calculated for the notes in b).
Ru. Therefore, if the sound machine has 10 states
If the lattice on which the probabilities are computed is
Adding 10 end times along the bottom row of the diagram
It becomes. Then, in order to obtain the matching value of a given single note,
All confirmations along the bottom row up to time J+K
The rate and probability at time J+K are added.
Also, as before, to get the word matching score
Successive phone matching values are summed. This example uses APAL code on FPS190L.
It was carried out using However, other codes and other
You can also do it using hardware.
Ru. E-1G Tree structure of single notes and high-speed matching Use basic fast match or another fast match
Therefore, even if you limit the maximum label, it will not be done.
Even if the
Computation time is significantly reduced. Furthermore, fast matching
A detailed match is performed on the words obtained by
This computational time savings is large enough even when
It's a huge amount. Once the matching value of a single note is calculated, the 17th
As shown, along the branches of the tree structure 4100
This determines which path of the note is most likely.
Mustard squid is determined. In Figure 17,
(Connects from 4102 to branch 4104) DH and
The sum of matching values of the single notes corresponding to DH1 is the single note MX
Rather than the various tones that branch out from the
corresponds to the word “the”.
Should. In this regard, the single note of the first MX note
The match value is calculated only once, then that MX note
(branch 4) is used for each basic form extending from
104 and 4106). Furthermore, the first branch
All score values calculated along the column are calculated for the columns of other branches.
much lower than the total score value or close to the threshold
If you find that the
All basic forms extending from the columns of are candidates at the same time.
removed from complementary words. For example, MX is sure
When it turns out that it is not the correct route, branch 4108
The basic formats associated from to branch 4118 are the same.
sometimes abandoned. By using this high-speed matching and tree structure,
candidate words with extremely low computational complexity.
You will get an ordered list of . Regarding memory requirements, the sound tree structure and the sound
statistics and terminal probabilities are what should be remembered.
Please note that. Regarding the tree structure,
25,000 arcs and 4 data that characterize each arc
ta word exists. That first data word
The key changes the index to the next arc or note.
Was. The second data word is the next one along that branch.
represents the number of single notes. third data word
is the node in the dendrogram at which the arc is located.
Shows what is happening. In addition, a fourth data
C represents the current single note. Therefore, the dendritic structure
In this case, 25000×4 storage locations are required.
In fast matching, 100 different single notes and 200
There are black phonemes. Then, a single sound with a phoneme
If we have a single probability generated somewhere in
Requires memory locations for 100 x 200 statistical probabilities
It is. Finally, for the terminal probability, 200 × 200
A memory location is required. This fast matching requires
100K integer and 60K floating point storage is sufficient.
Ru. E-1H language model As mentioned before, words in context
A language module that stores information such as related triplet letters.
Increase your chances of making accurate word selections by using
You can also do this. The language model 1010 (Figure 1) has unique properties.
have. In particular, the converted trigraph system is used.
ing. According to this method, a single
words, an ordered pair of words, and an ordered pair of words.
In order to determine the certainty of the triple words that were eclipsed.
Sample text is checked. And the most
List of probable triplets and most likely pairs
A list of words is formed. Additionally, triple words
The probability that is not in the triple word list above, and the paired word is
The probabilities of the above pairs of words not being in the list form each
be done. According to this language model, a given word has two
When two words follow, this word and its successor
are present in the triple word list above.
A judgment is made as to whether the And if
If so, the memory assigned to that triple word.
The probability that the Also, if the word at hand
and the following two words are in the triple word list
If not, the current word and the next word are
Determine whether or not the word exists in the paired word list.
A decision is made. And if so, triplicate
The probability that a word is not in the triple word list is
The pairwise probabilities are multiplied and the product is assigned to the word at hand.
It is applied. Or, if the current word and the next
(and the next) words are paired with the triple word list
If not on the list, only the current word
The probability that a triple word is not in the triple word list and
is multiplied by the probability that the word is not in the paired word list.
Ru. This product is then assigned to the word at hand.
Ru. Referring to Figure 18, the
A flowchart showing the details of the sound machine is shown.
It is. In step 5002, words in the vocabulary are
(typically about 5000) are defined. Then each
Words are represented by rows of sound machines (sound machines).
step 5004). This sound machine can e.g.
It is represented as a target sound machine, but instead
It can also contain phonemic sounds. words phonetically
Depending on the row of sound machines or the row of phonemic sound machines.
The representation will be explained below. In step 5006, the basic form of the word is
are arranged in the tree structure. Basic form of each word
The statistics of each sound machine in F. Zielinecz
(Jelinek) “Continuous speech using statistical methods”
Continuous Speech Recognition by
Published in a paper titled “Statistical Methods”
The well-known forward and backward
backward) by training based on the algorithm
It is determined. In step 5009, the
should be substituted with the actual parameter value or statistic
The value is determined. For example, the actual label output probability
The value to be assigned to is determined. Step 5010
Then, the sounds in each word basic form are approximate substitutions.
The determined value is stored as a stored actual value.
Replaced by the actual probability. Related to basic fast matching
All approximations to
be done. Next, we will discuss whether or not to improve the ability of phonetic matching.
A determination is made (step 5011). and,
If not, for basic approximate matching
The determined value is set so that it can be used.
and other evaluation values related to other approximations are set.
No (step 5012). It has also been improved
If an approximate match is desired, then step 5
018 is executed. Then a uniform string
The length is determined (step 5018), and
Improvements are requested (step 5020).
A judgment is made as to whether or not. And if near
If further improvement of similar matching is desired,
Phonetic matching is performed on the first J in the generated string.
label (step 5022).
Next, whether or not improved approximate matching was selected
Regardless, the calculated parameter value is
12, and at this point each word is in its basic form.
Each sound machine in the
has been trained with the desired approximation value. E-1J stack decoder A preferred method used in the speech recognition system shown in Figure 1.
The stack decoder is based on the applicant's speech recognition group.
L. Bahl and F. Zielinetsk from the Group
(Jelinek), and by R.L. Mercer.
Invented. Therefore, this suitable stack deco
The reader will be explained below. Figures 19 and 20 show the sequential label spacing.
or multiple labels occurring sequentially at a label location.
Bell Y₁Y₂‥‥It is shown. Figure 20 also shows some generated words.
The paths, i.e. path A, path B and path are illustrated.
ing. In the context of Figure 19, route A is the entry
Corresponding to “to be or”, route B is the entry “two
b” and route C corresponds to the entry “too”.
do. Corresponding to the current word path, the word path
the label with the highest probability that it has finished
(or equivalently, label spacing) and such
Such labels are called "boundary labels." For a word string W that represents a word string, there are two
label strings as “boundary labels” between words.
Most likely finish time revealed during the ring
However, IBM Technical Disclosure Bull
Tein (Technical Disclosure Bulletin)
volume 23, number 4, September 1980, L.R. Barr
“Fast
Faster Acoustic Match
Computation)”
can be found by such methods. briefly describe
and this document addresses two similar matters, namely
(a) How many of the label string Y are words?
(or word string) or
(b) At what label spacing is one label string
whether the partial sentence corresponding to the part ends or not.
Discuss ways to get involved. For any word path, the label string
Each label from the first label to the border label
“Likelihood value” associated with a label or label interval
exists. Putting these together, a given word history
The probability values of all the tracts for that given word
Represents the “likelihood vector” of the route. Therefore, each
For each word path, there is a corresponding likelihood vector.
Ru. Likelihood value L_tis shown in FIG. Word route W¹,W²，‥‥W³corresponds to a gathering of
The “likelihood envelope” Λ at the label interval t_tis a number
Scientifically defined as: Λ_t=max(L_t(W¹),...L_t(W^S)) That is, for each label interval, the likelihood envelope
A line is associated with any word path within that collection.
Contains the highest likelihood value. Figure 20 shows the likelihood envelope
Line 1040 is illustrated. If a word path corresponds to a complete sentence,
It is considered “complete”. A complete route is preferred.
Suitably, the speaker's input allows the speaker to e.g.
by pressing a button when the end is reached.
It is identified by This input is synchronized with the label spacing.
The end of the sentence is marked. complete word path
cannot be extended by adding words.
“Partial” word paths correspond to incomplete sentences,
Can be extended. Partial pathways are classified as “living” or “dead”
be done. That is, if a word path is already extended
If so, the word path is “dead” and
If it has not been extended yet, it is “alive”. this
Using classification, you can
already extended to form a lengthened word path
The route that has been extended will be extended at a later time.
It is never considered again. Each word path is also related to the likelihood envelope
Can also be characterized as “good” or “bad”
can. That is, if the label corresponding to the boundary label
If the word path is within Δ of the maximum likelihood envelope,
The word path is good if it has a likelihood value of
be. Otherwise, the word path is “bad”
That's why. Note that preferably, but not necessarily, Δ is
is a constant value, and this value determines the maximum likelihood envelope
Each value of will act as a threshold level for good or bad.
It is greatly reduced. There is one stack element for each label interval.
Ru. Each living word path is
The step corresponding to the label interval corresponding to the threshold level
Assigned to the tact element. The stack element is
Having 0, 1 or more route entries
and those entries follow the order of their likelihood values.
is listed. Next, in the stack decoder 1002 of FIG.
The steps to be executed will now be explained. Form a likelihood envelope to determine which word path is better.
The sample flowchart in Figure 22
Interrelate with the steps shown in the chart.
It will be done. In the flowchart of Figure 22, zero
(null) The path is first entered in step 5050.
Input to Tack 0. And decided in advance
If there are complete paths with
A stack (complete) element containing is given (stack
5052). Each complete element in a stack (complete) element
A path has an associated likelihood vector.
Ru. of the complete path with the highest likelihood for the boundary label.
The likelihood vector initially determines the maximum likelihood envelope.
Set. If a complete path in a stacked (complete) element
does not exist, the maximum likelihood envelope is
Initialized as −∞ in the interval. Furthermore, if completed
If all paths are not identified, the maximum likelihood envelope is still
The contact line can be initialized as −∞. envelope
Steps 5054 and 50 for line initialization.
56. After the maximum likelihood envelope is initialized, its envelope
The line is lowered by a predetermined value Δ, which lowers the
A Δ-good region is formed above the lowered likelihood value.
A Δ-defective region is formed below the likelihood value.
Ru. The larger the value of Δ, the more word paths
is considered to be extendable. L_tTo calculate
Menilog_Tenis used, set Δ to 2.0.
values give satisfactory results. The value of this Δ is
Preferably, but not required, the length of the label spacing
uniform along. If the word path is a boundary label in the Δ− good region
If the word path has a likelihood in
Marked as “good”. Otherwise,
The word path is marked as "bad". As shown in Figure 22, the likelihood envelope is updated,
Mark word paths as good (extensible) or bad
The longest unmarked loop for
It starts by finding a new word path (starting from
Step 5058). And if marked
If there are two or more word paths that do not exist, then the longest word path
If it exists in the stack corresponding to the length of
the word path with the greatest likelihood for that boundary label
is selected. If the word path is found,
The likelihood at that boundary label is within the Δ-good region
marked as “good” when, otherwise
Marked as “bad” (step 506)
0). If a word route is marked as “bad”
, another unmarked living path is found.
is marked (step 5062). if word
If a route is marked as “good”, the “good”
Likelihood envelope to include the likelihood value of the marked path
The line is updated. That is, for each label interval,
(a) Current likelihood value in the likelihood envelope and (b) “Good” and mark
between the likelihood values associated with the checked word paths.
, the likelihood value updated as a larger likelihood value is
It is determined. This means that steps 5064 and
This is indicated by step 5066. envelope
is marked as the longest and best after updated
The missing living word path is found again (Stetstu
5058). The loop then checks that the unmarked word path is
The remaining ones are returned until there are none left. So
Then all unmarked words will be found.
When the shortest word path marked as “good” is
selected. If the shortest length marked as good is also
If one or more paths exist, the boundary label
The path with the highest likelihood in the
(Step 5070). The shortest path selected is
will be extended. i.e. preferably fast
Matching, Language Models, Detailed Matching, and Language Models
As described above, by executing the procedure
At least one likely following word is determined.
It will be done. With each word that follows this certainty,
A lengthened word path is formed. In particular, extended
The selected word path is the terminal of the selected shortest word path.
formed by adding words that most likely follow
be done. a single word extended to that selected shortest word path.
After the word path is formed, the selected word path is
The path is the star that contained the path as an entry.
removed from Tsuku, each extended word path is
It is placed in the appropriate stack for that purpose. Special
, the extended word path is the extended word path
The elements placed on the stack corresponding to the boundary labels of
entry (step 5072). Regarding step 5072, the selected route is
The extending action is explained with reference to FIG.
Ru. After the route is found in step 5070,
The following procedures are carried out, which results in:
Single or multiple word definitions based on suitable approximate matching
The road will be extended. That is, in step 6000 (FIG. 21)
In this case, the voice processing device 1002 (FIG. 1) performs the above-mentioned
Generate a label string as shown below.
The column of labels indicates the possible execution of step 6002.
It is given as input to make the calculation. step
At 6002, the candidate word is determined according to the teachings shown previously.
To obtain an ordered list of
Approximate matching or improved approximate matching procedures
One is executed. Then the language model (as mentioned above)
The rule is applied in step 6004. language model
The remaining words after being applied are processed in step 600.
generated in a detailed matching processor that runs
will be entered along with the specified label. Detailed matching remains
yields a list of word suggestions, and this list is preferred.
Preferably, in step 6008, the language model is
It will be done. In this way, approximate matching, detailed matching, and
The likely words determined by the word model are
The route found in step 5070 of FIG.
used for extension. Step 6008 (22nd
Each of the probable words determined in
headings so that an extended word path can be formed.
found so that a word path can be formed.
added to each word path. Referring again to Figure 22, the extended path is
After the stack is formed and the stack is re-formed, the steps
The process is repeated by returning to 5052.
Ru. In this way, each iteration returns the shortest best word history.
It consists of selecting a path and extending it. once
Word paths marked as “bad” by eye processing are
It may become "good" in subsequent iterative processing. Living
Characterize the word path that came as “good” or “bad”
In this way, each iteration is independent
done to. In practice, the likelihood envelope is
It does not change significantly from the first iteration to the next iteration.
and whether the word path is “good” or “bad”.
Calculations to determine whether there is
Ru. Furthermore, no standardization is required. When a complete sentence is identified, it is preferable to
Step 5074 is executed. i.e. marked
There are no unused word paths left and there are no good word paths left that should be extended.
When there is no suitable word path, decoding
ends. Even the highest likelihood for each boundary label
A complete word path is added to the input label string.
is considered to be the most probable word sequence for
Ru. Furthermore, continuous speech in which the end of a sentence is not identified
If the route extension is continuous or system
The number of words to be repeated according to the user's preference is
Continue. E-1K Construction of basic phonetic formats can be used in forming the basic form
Also based on speech as a Markov model sound machine
There is. That is, each sound machine uses the International Phonetic Alphabet.
(International Phonetic Alphabet)
corresponds to a given phonetic sound such as For a given word, each to an individual sound machine.
There is a sequence of corresponding phonetic sounds. each sound
Machines can exist in multiple states and in between those states.
, and some of those transitions are
Something that can and is also capable of producing phonemic output
(called zero transition) cannot do that. aforementioned
As before, the statistics associated with each sound machine are (a)
(b) the probability that a given transition occurs and (b) the probability that a given transition occurs.
and the likelihood that the phoneme is generated. Preferably, each non-
In the zero transition, each phoneme has some probability associated with it.
It is being responded to. At the end of the detailed description of the invention.
The phoneme alphabets shown in Table I above have a preferred
There are actually 200 phonemes. phonetic sound
The sound machine used to form the machine is the first
This is shown in Figure 3. A row of such sound machines is
It is provided for each word. Statistics or probability is
Sound reinforcement during the training period when known words are pronounced.
input into the and various phonetic sounds.
The transition probabilities and phoneme probabilities in syn are known
Occurs when a phonetic sound is pronounced at least once.
Focusing on the phoneme strings that appear, the well-known forward-backward
(forward-backward) algorithm applied
This may be determined during the training period. Example of statistics for one single note identified as a single note DH
is shown in the table listed at the end of the detailed description of the invention.
ing. As an approximation, the sound machine in Figure 3
The label output probability distribution of transitions tr1, tr2 and tr8 is single
is expressed by the distribution of Also, transition tr3,
tr4, tr5 and tr9 are represented by a single distribution,
Transitions tr6, tr7 and tr10 are also expressed by a single distribution
be done. This means that in Table 2, the individual columns
Assign arcs (i.e. transitions) to 4, 5 or 6
This can be seen by Table 2 shows the accuracy of each transition.
rates and labels (i.e. phonemes) beyond the single DH
Indicates the probability of occurrence at the end, middle, and rear end.
are doing. For this DH sound, for example, state S₁mosquito
state S₂The probability of transitioning to is calculated to be 0.07243.
It is calculated. state S₁from state S_FourThe transition probability to is
0.92757 (in this case, it is possible from the initial state
Since there are only these two transitions, the sum of their probabilities is
is 1). Regarding the label output probability, DH
The note is at the end of the single note, i.e. in the 6th line of Table 2.
The probability of generating phoneme AE13 (see Table 1) is 0.091.
be. Table 2 also shows the information related to each node (state).
There is a count value. This node count value is the training
Displays the number of times the sound is in the corresponding state during the period.
vinegar. Statistics like Table 2 are found for each sound machine.
Ru. Arranging phonetic sound machines into word basic forms
is typically performed by phoneticians and is commonly
is not executed automatically. E-1L Construction of basic phonemic forms Figure 23 shows specific examples of phonemic sounds.
Ru. This phonemic sound has two states and three transitions.
have. In this figure, the zero transition is the dashed line
, which is a state in which no label is formed.
It represents the route from state 1 to state 2. state 1
A self-loop transition with any number of labels there
make it possible to be generated from. state 1 and
Non-zero transitions between state 2 cause the label to form
It is permissible to each transition and each
During the training stage, the probabilities associated with the labels are
Similar to that described in relation to the basic forms of voice types.
determined by the method. Phonemic word basic forms combine phonemic sounds.
It is constructed by letting one for this
This method was filed on February 1, 1985 by the applicant.
No. 697,174. good
Preferably, the basic phonemic form of a word is
produced by pronouncing a word multiple times. this thing
is a patent application filed on May 29, 1985 by the applicant.
No. 738,933.
Briefly, the basic form is extracted from a large number of pronunciations.
One way to generate it is as follows:
Ru. (a) The multiple pronunciations of a word fragment can be expressed as individual phoneme strings.
Convert to ring. (b) Define a set of phonemic Markov model sound machines.
to justify (c) To generate the above-mentioned large number of phoneme strings.
The best single sound machine P₁Determine. (d) To generate the above-mentioned large number of phoneme strings.
, format P₁P₂or P₂P₁The best two sounds consisting of
Decide on the basic format. (e) Corresponding to each phoneme string, select the best two
Arrange basic sound forms. (f) Split each phoneme string into left and right parts.
Separate. At this time, the left part is in two-note basic form.
corresponds to the first sound machine, and the right part corresponds to the second sound machine.
Corresponds to the second basic type sound machine. (g) Each left part as a left substring and each right part as a left substring.
and the right substring, respectively.
Ru. (h) Supports multiple pronunciations for left substring sets
are processed in the same way as the set of phoneme strings
Ru. This process is based on the idea that a single basic form of sound is the best
When it has a higher probability than the two-tone basic form,
Prevent substrings from separating further.
Contains steps. (j) Supports multiple pronunciations for the right substring set
are processed in the same way as the set of phoneme strings
Ru. This process is based on the idea that a single basic form of sound is the best
When it has a higher probability than the two-tone basic form,
Prevent substrings from separating further.
Contains steps. (k) unseparated single tones, their
Sounds correspond to the order of their corresponding phoneme substrings
join in the order you want. The number of these model elements is typically the word
approximately equal to the number of phonemes obtained corresponding to the pronunciation of
stomach. The basic formal model is then
, and in response to that utterance, a string of labels is generated.
training by inputting the sound into the audio processing device that generates the sound.
be refined (or incorporate statistics). and known pronunciations and uttered labels.
Based on the well-known forward-backward
−backward) word model according to the algorithm
statistics are obtained. Figure 24 shows the grids corresponding to phonemic sounds.
It is shown. This grid provides detailed phonetic matching.
It is considerably simpler than the lattice in Figure 11 associated with
Ru. E-2 Confirmation from vocabulary by polling
choice of words Referring to FIG. 25, a diagram of one embodiment of the present invention is shown.
A roach is illustrated. This flowchart
In step 8000, the
First, the vocabulary of words is determined. These words are
Standard business communication vocabulary or techniques, depending on the user.
corresponds to technical vocabulary. At this time, 5000 pieces or
There are more words in the vocabulary, but
The number of words can be changed. Each word is from the teaching in Chapter E-1K or E-1L above.
As shown, the sequence of Markov model sound machines
will be displayed. That is, each word consists of successive sounds
Basic form of vocal sound machine construction, or sequential
As the basic form of the phonemic sound machine constructed
can be expressed. Next, in step 8006, each label of each word is
A “vote” is calculated each time. The step 8006 of calculating votes is the 25th, 2nd
6, 27, 28 and 29.
Ru. Figure 26 shows a given sound machine P_paudio label of
FIG. The totals shown in this figure
The numbers are taken from the statistics generated during the training period.
It is something. That is, during the training period, the known
The known pronunciation corresponding to the tone sequence is uttered,
A label string is generated in response.
I want you to remember that. In this way, it is known
Each label is generated when a single note is pronounced.
The number of times is obtained during the training period. Furthermore, as shown in Fig. 26,
Such a distribution map is generated for each note. From the training data, the information contained in Figure 26
Not only can information be obtained, but also information about a given note can be obtained.
You can also get the expected value of the label. That is, for a given unit
When a known pronunciation corresponding to a word is generated,
The number of labels generated is recorded. given single note
The number of labels corresponding to the known pronunciations
It is recorded every time it is born. And this information?
, the most probable corresponding to a given note, and
The expected number is determined. Figure 27 shows each
A diagram showing the expected number of labels for each note.
be. If these sounds correspond phonemically
For example, the expected number of labels for each note is
The average value should typically be about 1. audio
For typical sounds, the number of labels can vary widely.
Can be extended to. Extracting information from training data is described above.
“Continuous speech recognition using statistical methods”
(Continuous Speech Recognition by
It is detailed in a paper entitled “Statistical Methods”.
forward-backward
achieved by using algorithms
Ru. Simply stated, the forward-backward algorithm
teeth, (a) From the initial state of the Markov model to state i
Look ahead and check the state on the “forward path”.
determine the statistics leading to i, (b) From the final state of the Markov model to the state (i+
1) Look backwards and follow the “rear route”.
The statistics from state (i+1) to the final state are
By deciding, Between state i and state (i+1) in a single note
consists of determining the probability of each transition in . situation
The transition probability from i to state (i+1) and its
The label output for a given label string
the probability of a particular transition occurring given the
combined with another statistic in determining the Shown in Figure 28 in relation to word 1 and word 2.
As shown, each word is a sequence of predetermined single sounds
It is known. The sound sequence for each word,
Provides the information mentioned in connection with Figures 25 and 26
and how many times a given label is issued for a particular word W.
determining whether it is most likely that
I can do it. For words like word 1, the label
The number of times 1 is expected is the sound P₁about label
1 count value and sound P₃Counting of label 1 for
value and sound P₆Add the count value etc. of label 1 for
It is something that has been learned. Similarly, for word 1, the label
The number of times Le2 is expected is the sound P₁label about
The count value of Le 2 and the sound P₃Label 2 total for
This is the addition of numerical values, etc. for each label of word 1
The expected counts are for each of the 200 labels.
calculated by performing the above steps.
Ru. The count value for a particular label is
Let the raw number be the total number of labels generated during the training period.
Show what you cut or show the number during the training period.
Indicates the number of bell occurrences itself. In Figure 29, specific words (e.g.
The expected count value for each label in 1) is shown.
has been done. For a given word, the period as shown in Figure 29
From the counted value of the expected label, each label of the word is
Bell's "votes" are calculated. For a given word W′
The votes for label L′ at
represents the likelihood that the This vote indicates that the word W′
It corresponds to the logarithm of the probability of producing L′. preferred
In other words, the vote value is expressed by the following formula. votes = log_Ten{Pr(L′|W′)} The values of these votes are as shown in Figure 30.
stored in the table. For each of words 1 to W,
Each label is a vote number represented by a double subscript V.
Has value. This first subscript corresponds to the label.
Accordingly, the second subscript corresponds to a word. Therefore,
For example, V₁₂is the vote value of label 1 for word 2
be. Referring again to Figure 25, when an unknown voice input
including step 8008 of generating a label in response.
and polling to find likely candidates from the vocabulary.
The process of selecting words is illustrated. This place
The process is performed by the audio processing device 1004 (FIG. 1).
executed. For the word in question, utter it in the table in Figure 30.
The generated label is look-up.
be done. And for that word, each generated
The vote value of the given label is searched. The value of this vote is
Then save up to give the total vote value for that word.
are accumulated (step 8010). For example, the label
If 1, 3 and 5 are generated, the vote value V₁₁,
V₃₁and V₅₁will be calculated and combined.
If the value of the vote is the logarithm of the probability, then they
to give the overall voting value for word 1
are totaled. A similar procedure is performed for each word in the vocabulary.
This will cause labels 1, 3 and 5 to be attached to each word.
You will have to “vote” for that. According to one embodiment of the invention, for each word the
The obtained vote value serves as the likelihood score value for that word.
Fulfill. and also the value of the highest accumulated vote.
and n words (n is a planned integer) are candidate words.
These will later be determined as detailed above.
It will be processed in the matching and language model. In another embodiment, a “penalty” is added along with the vote value.
i” is calculated. That is, for each word,
Penalty is calculated and allocated (step
8012). This penalty applies if the current label
represents the likelihood that it is not caused by a given word.
vinegar. There are various ways to calculate the penalty.
There is a law. expressed by the basic phonetic form
One method for calculating word penalties
, each phonetic sound generates only one label
involved in assuming that. given label and identification
For the phonemic sound of , the pena of its given label
Lutei is labeled differently from that particular sound.
corresponds to the logarithm of the probability generated by an elementary sound.
Ru. Therefore, the sound P₂The penalty for label 1 is
Any label from label 2 to label 200 is
Corresponds to the logarithm of the probability, which is one label generated
do. Note that one label is output for each phonetic sound.
Although it is not accurate to assume that
To calculate the Lutei, it must be satisfactory.
I understand. In this way, each note has a label.
Once a penalty is determined, a word is known and
The penalty for composing a sequence of notes is easy to
It is determined. Figure 31 shows the penalty for a single label for each word.
has been done. Each penalty has two subscripts.
PEN and the first subscript refers to the label.
The second subscript represents the word. Returning to FIG. 25 again, at step 8008 the
The generated label is labeled Alphabetical.
To find out which labels are not occurring at the same time
will be inspected. In this way, each label
The penalty for not incurring a penalty is calculated. given
To get the total penalty for the word of the given
Penalty for each label not raised for each word
Tey will be searched and all such penalty
is accumulated (step 8014). If each pena
If Lutei corresponds to the logarithm of the “zero” probability, then
The penalty for a given word is
Similarly, it is summed for all labels. child
This procedure is repeated for each word in the vocabulary.
Each word is then a string of generated labels.
given the overall vote value and the overall
There will be penalties. For each word in the vocabulary, the overall vote value and overall
Once Narutei is obtained, combine the two values.
The likelihood score value is determined by
Tup 8016). Furthermore, if you wish, you can also
The value is weighted more heavily than the overall penalty.
or vice versa. Furthermore, the likelihood score value of each word is preferably
scaled based on the number of labels voting.
(Step 8018). In particular, the overall vote value and
The overall penalty (both of which are the sum of the logarithms of the probabilities)
) are added together, and then the final
The sum value is used to calculate the vote value and penalty.
Divide by the number of phonetic labels generated and considered.
As a result, the likelihood score value is scaled. Yet another aspect of the invention provides voting and penalty
(i.e., polling) operation,
to decide which labels to consider for
Related. Word endings are identified and their corresponding
If the label is known, preferably
Everything that occurs between the start time and the known end time
all labels are considered. But the end time
When something is known that is not known (Step 8)
020), the present invention provides the following method. vinegar
That is, a standard end time is defined and
repeats at successive time intervals after the specified end time.
A likelihood score value is calculated (step 802).
2). For example, after 500 ms, each
The (scaled) likelihood score value of the word is calculated.
and this behavior is then applied to the start time of the word's pronunciation.
The process continues until 1000 milliseconds have passed. In this example
, each word has 10 (scaled) likelihoods
It will have a degree score value. Next, which of the 10 likelihood score values for a given word?
There is a method for making choices about whether to allocate
Applicable. In particular, the likelihood obtained for a given word is
obtained at the same time interval for a column of degree score values.
The highest likelihood score value is compared to the likelihood score value of other words.
A degree score value is selected (step 8024). child
The highest likelihood score value for that same time interval is
is subtracted from all other likelihood score values in .
Then, the highest likelihood score value for a given time interval is
words with a lower likelihood are set to zero, and other less likely words are set to zero.
The word's likelihood score value will have a negative value. stop
Therefore, the base of a given word also has small negative values (near zero).
b) The relative likelihood score value with the highest likelihood score value
is assigned to that word. When a likelihood score value is assigned to each word, the most
n words that are assigned high likelihood scores
is the candidate word obtained by polling.
selected (step 8026). In one embodiment of the invention, polling
The n words obtained by
The words in this list are given as
detailed matching and language model processing
Ru. This reduced value obtained by polling
The list of numbers used in this example is as shown above.
It functions as an alternative to the high-speed phonetic verification that was performed. This point
, phonetic fast matching gives a tree-like lattice structure.
Eh, in this tree structure, the basic form of the word as a sequential sound.
An expression is entered that also contains the same initial note.
It is observed that words follow a common branch along a tree structure.
be noticed. However, for a vocabulary of 2000 words,
Akira's polling method is a high-speed method with a tree-like lattice structure.
The processing speed can be two to three times faster than matching.
I understand. However, unlike that, phonetic fast matching and point
It can also be used in combination with
Ru. That is, the trained Markov model and
From the generated string of labels, step
Approximate high-speed matching in parallel with polling in 8028
is executed. And one list is a phonetic reference.
and the other list is given by Pauli.
given by ng. In the conventional method, 1
Entries on one list increase the other list
used when making However, the best word suggestions
The method that seeks to further reduce the number of
, only words that appear in both lists are
Retained for further processing. step 80
The interaction of these two technologies in 30
depends on system accuracy and computational goals. yet another
As an example of
It may then be applied to polling lists. The device for performing polling is shown in Figure 32.
It is shown. In this figure, element 8102
memorizes the word model trained as described above
do. And from the statistics applied to the word model
The vote generator 8104 generates a label for each word.
Calculate the value of the vote in the table and record the value of the vote in the vote table.
It is stored in the storage device 8106. Similarly, the penalty generator 8108
Calculate the penalty for each label of each word in the vocabulary.
The value is stored in the penalty table storage device 8.
110. The word likelihood score calculation device 8112 calculates unknown sounds.
Emitted by the voice processing device 8114 in response to voice input.
Receive the generated label. And the word selection element
For a given word selected by child 8116
The word likelihood score calculation device 8112 calculates the selected word likelihood score value.
The votes for each label generated for the selected word and each label
Combined with penalty that no bell occurs
Ru. The device 8112 performs a likelihood score, as described above.
Also includes a means to scale the values.
Ru. Although a likelihood score calculation device is not required,
also calculates the likelihood score value at successive time intervals after the reference time.
may include means for repeating the calculation of
stomach. The likelihood score calculation device 8112 is a word list device.
The word score value is given to the device 8120, and the word list device
The setting 8120 is based on the assigned likelihood score value.
Arrange words. The word list obtained through polling is
Combining the list obtained by similar high-speed matching
In some embodiments, list comparison device 8122 is provided.
It is being This device 8122 has as input:
Polling list and phonetic from word list device
Fast matching (as mentioned in some of the examples above)
Receives a polling list from. To reduce the amount of memory and computation required, we
Some features are included. First, the value of votes
and penalty are expressed as integers ranging from 0 to 255.
It can be automated. Second, the actual page
The value of penalty is penalty = a x (value of votes) + b
Approximately calculated corresponding to the vote value from the formula
can be replaced by a penalty.
In this formula, a and b are constants, and their values are
is determined by least squares regression. Thirdly, label
file so that each class contains at least one label.
can be grouped into phonetic classes.
Wear. And the assignment of labels to classes is
information between the resulting phonetic classes and words.
Collect labels hierarchically to maximize information
Determined by Furthermore, according to the present invention, the period of silence is
be detected and ignored (by law)
stomach. The invention also works on IBM MVS systems.
Executed using PL/I, but with other systems
It can also be executed using other programming languages. Furthermore, within the scope of its technical idea, the present invention
Various modifications are possible. For example, the end time of a word in its basic form
summing the number of expected labels for each sound in
It may be determined by In addition, the vote value and
The label string generated
For example, the odd numbered label or the first m
only for selected labels like the labels in
It may also be calculated. However, the tip of the word
Let the vote value and penalty for each label between and the trailing edge be
It is preferable to consider. Furthermore, the present invention adds the logarithm of the probability.
It is also intended to use a variety of voting formulas other than
ing. The present invention allows each label to vote for each word in the vocabulary.
and this vote typically differs for each word.
to get a short list of candidate words when
Widely applied to polling equipment and polling methods
It will be done. F Effect of invention As described above, according to this invention, voice recognition
Create a list of words for detailed matching
Try using the polling method to
Therefore, such a word list can be processed in a short processing time.
can get.

【表】【table】

【table】 [Brief explanation of drawings]

第１図は、音声認識システムの概要ブロツク
図、第２図は、第１図のブロツク図をより詳細に
示したブロツク図、第３図は、音声的音マシンの
図、第４図は、音声処理装置のブロツク図、第５
図は、人間の耳の内部を示す断面図、第６図は、
音声処理装置の一部のブロツク図、第７図は、音
の等感曲線の図、第８図は、ソーンとフオンの対
応を示す図、第９図は、音声処理装置の処理フロ
ーチヤート、第１０図は、第９図におけるしきい
値更新のための詳しいフロヤーチート、第１１図
は、詳細照合格子を示す図、第１２図は、高速照
合音マシンのブロツク図、第１３図は、高速照合
演算を示す図、第１４図は、単音、ラベル・スト
リング、開始及び終了時間の相互関係を示す図、
第１５図は、最小長０の音マシンとその開始時点
分布を示す図、第１６図は、最小長４の音マシン
とそのタイム・チヤートを示す図、第１７図は、
音の樹形構造を示す図、第１８図は、音声的照合
を実行するための訓練用音マシン中で実行される
ステツプをあらわすフローチヤートの図、第１９
図は、スタツク・デコーデイングの逐次的なステ
ツプをあらわす図、第２０図は、単語の経路と尤
度包絡線についての尤度ベクトルをあらわす図、
第２１図は、見出した経路を延長する手続のフロ
ーチヤートの図、第２２図は、スタツク・デコー
ダの動作のフローチヤートの図、第２３図は、音
素的音マシンの図、第２４図は、音素的な音の格
子をあらわす図、第２５．１図及び第２５．２図
は、本発明のポーリング方法をあらわす図、第２
５図は、第２５．１図及び第２５．２図の結合を
あらわす図、第２６図は、ラベルの計数値分布を
あらわす図、第２７図は、訓練期間に各単音が各
ラベルを生成する回数をあらわす図、第２８図
は、単語を生成する音列を示す図、第２９図は、
各ラベル毎に、ある単語について期待される計数
値をあらわす図、第３０図は、ラベルと単語毎の
票の値を示す図、第３１図は、ラベルと単語毎の
ペナルテイの値を示す図、第３２図は、本発明の
ポーリング装置のブロツク図である。 Fig. 1 is a schematic block diagram of a speech recognition system, Fig. 2 is a block diagram showing the block diagram of Fig. 1 in more detail, Fig. 3 is a diagram of an acoustic sound machine, and Fig. 4 is a block diagram showing the block diagram of Fig. 1 in more detail. Block diagram of audio processing device, No. 5
The figure is a cross-sectional view showing the inside of the human ear.
A block diagram of a part of the audio processing device, FIG. 7 is a diagram of the iso-sensory curve, FIG. 8 is a diagram showing the correspondence between son and phon, and FIG. 9 is a processing flowchart of the audio processing device. Fig. 10 is a detailed floor cheat for updating the threshold value in Fig. 9, Fig. 11 is a diagram showing a detailed matching grid, Fig. 12 is a block diagram of a high-speed matching sound machine, and Fig. 13 is a high-speed FIG. 14 is a diagram showing the matching operation; FIG. 14 is a diagram showing the correlation between a single note, a label string, and start and end times;
FIG. 15 is a diagram showing a minimum length 0 sound machine and its starting point distribution, FIG. 16 is a diagram showing a minimum length 4 sound machine and its time chart, and FIG. 17 is a diagram showing a minimum length 4 sound machine and its time chart.
Figure 18 shows a sound tree structure; Figure 18 is a flowchart representing the steps carried out in the training sound machine for performing phonetic matching;
Figure 20 shows the sequential steps of stack decoding; Figure 20 shows the likelihood vectors for word paths and likelihood envelopes;
FIG. 21 is a flowchart of the procedure for extending the found path, FIG. 22 is a flowchart of the operation of the stack decoder, FIG. 23 is a diagram of the phonemic sound machine, and FIG. 24 is a flowchart of the procedure for extending the found path. , Figures 25.1 and 25.2 are diagrams representing the phonemic sound grid, Figures 25.1 and 25.2 are diagrams representing the polling method of the present invention,
Figure 5 shows the combination of Figures 25.1 and 25.2, Figure 26 shows the label count distribution, and Figure 27 shows how each single note generates each label during the training period. Figure 28 is a diagram showing the number of times the word is generated, Figure 29 is a diagram showing the sound sequence that generates the word, and Figure 29 is
Figure 30 is a diagram showing the expected count value for a certain word for each label. Figure 30 is a diagram showing the vote value for each label and word. Figure 31 is a diagram showing the penalty value for each label and word. , FIG. 32 is a block diagram of the polling device of the present invention.

Claims

[Claims] 1. Quantizing audio at predetermined small time intervals,
In addition to generating labels according to the quantized audio data and performing preprocessing for speech recognition, at least one
A Markov model having two state transitions and a label output probability in which each of the above labels is output in each of these state transitions is set for each note,
The input speech to be recognized is converted into the string with the label, and the string with the label is matched with the single note or the string of the single note with reference to the probability data of the Markov model, and based on this matching, the input speech is converted into the string with the label. In a speech recognition method that recognizes speech, before performing the above recognition, (a) for each of the above labels and each word in the vocabulary, the label is generated at each of the above minute intervals when the word is uttered; (b) accumulating the label generation likelihoods of a plurality of labels generated according to the input speech to be recognized for a predetermined word, with reference to a table displaying each label generation likelihood; and a step of determining whether the predetermined word is a candidate word of the input speech according to the accumulated value, and performing the recognition on the word determined to be the candidate word. Method. 2. Quantize the audio at predetermined small time intervals,
In addition to generating labels according to the quantized audio data and performing preprocessing for speech recognition, at least one
A Markov model having two state transitions and a label output probability in which each of the above labels is output in each of these state transitions is set for each note,
The input speech to be recognized is converted into the string with the label, and the string with the label is matched with the single note or the string of the single note with reference to the probability data of the Markov model, and based on this matching, the input speech is converted into the string with the label. In a speech recognition device that recognizes speech, (a) for each label in the label set and each word in the vocabulary, a first label generating likelihood that the word generates the label at each minute interval is determined. (b) for each of said labels in said label set and for each of said words in said vocabulary, means for forming a table of label failures in which said word does not generate said label for each of said microintervals; (c) combining the label generation likelihood and the label non-generation likelihood associated with a predetermined word for a label generated in response to input speech; , means for determining the likelihood that the word corresponds to the input speech, and determining whether the predetermined word is a candidate for the input speech to be recognized based on the likelihood of the means (c). A speech recognition device characterized in that the above-mentioned recognition is performed on candidate words.