JPH0372992B2

JPH0372992B2 -

Info

Publication number: JPH0372992B2
Application number: JP61032048A
Authority: JP
Inventors: Rai Baaru Raritsuto; Binsento Desooza Piitaa; Reroi Maasaa Robaato; Aran Pichenii Maikeru
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1986-02-18
Filing date: 1986-02-18
Publication date: 1991-11-20
Also published as: JPS62194291A

Description

[Detailed description of the invention]

以下の順序で本発明を説明する。Ａ産業上の利用分野Ｂ開示の概要Ｃ従来の技術Ｄ発明が解決しようとする問題点Ｅ問題点を解決するための手段Ｆ実施例 F1 音声認識システムの環境 F1a 全般的説明（第４図〜第９図） F1b 聴覚モデルおよび音声認識システムの音
響プロセツサにおけるその実現（第１０
図〜第１６図） F1c 精密マツチング（第６図、第１７図） F1d 基本高速マツチング（第１８図〜第２０
図） F1e 代替高速マツチング（第２１図、第２２
図） F1f 最初のＪレベルに基づいたマツチング
（第２２図） F1g 音素木構造および高速マツチング実施例
（第２３図） F1h 言語モデル（第４図、第２４図） F1j スタツク・デコーダ（第７図〜第９図、
第２５図） F1k 音標型基本形態の構築 F2 開始音素マシンおよび終了音素マシンを
含む音素マシンのセツトの形成（第１Ａ図
〜第３図、第２６図〜第３２図） F3 表Ｇ発明の効果Ａ産業上の利用分野本発明は、マルコフ・モデルを用いた音声認識
方法に関し、とくに単語開始部分に留意してワー
ド基本形態（単語のマルコフ・モデルのことを指
す）を構築し、かつこのようなワード基本形態を
用いるようにしたものである。Ｂ開示の概要本発明は生成された音響ラベルのストリング
（音標型音素マシン（音声学的な音素のマシン）
のセツトの形成を含む）とマツチングさせること
ができるワード基本形態を構築する装置および方
法を開示する。各音素マシンが（）複数の状
態、（）それぞれがある状態からある状態に移
す複数の遷移、（）遷移ごとの記憶された確率、
および（）記憶されたラベル出力確率、を有す
る場合、各ラベル出力確率は、対応するラベルを
生成する前記各音素マシンの確率に対応し；前記
音標型音素マシンのセツトが開始音素マシンのサ
ブセツトを含むように形成されている場合、各開
始音素マシンの記憶された確率は、音声セグメン
トの始めに発声される音標要素の少なくとも１つ
に対応し；前記音標型音素マシンのセツトが終了
音素マシンのサブセツトを含むように形成されて
いる場合、各終了音素マシンの記憶された確率
は、音声セグメントの終りに発声される少なくと
も１つの音標要素に対応する。ワード基本形態
は、該セツトから選択された音素マシンの連結に
より構築される。Ｃ従来の技術本発明の背景または環境を与える発明として、
米国特許出願第06／665401号（1984年10月26日出
願）、および同第06／672974号（1984年11月19日
出願）がある。音声を認識する確率方法では、音響波形は最
初、音響プロセツサにより、ラベル・ストリング
に変換される。ラベルのアルフアベツト（集合）
は典型的には約200種類のラベルからなり、この
ラベルを用いて対応する音響タイプを特定する。
このようなラベルの生成は、種々の論文ならびに
前記米国特許出願第06／665401号に記載されてい
る。簡単に言えば、音響入力を分割して連続する
時間フレームに入れ、時間フレームごとにラベル
を割当てる。ラベルは通常、エネルギ特性に基づ
いて形成される。音声認識のためにラベルを使用する際のマルコ
フ・モデル（確率的な有限状態マシン）は既に提
案されている。マルコフ・モデルは通常、複数の
状態とそれらの状態の間の遷移とを含む。更に、
マルコフ・モデルは通常、(a)各遷移の生起確率、
および(b)種々の遷移で各ラベルを生成するそれぞ
れの確率に関し、それに割当てられた確率を有す
る。マルコフ・モデル（または等価的にマルコ
フ・ソース）は、IEEE会報：パターン分析およ
び計算機情報（PAMI）第５巻第２号（1983年３
月）179〜190頁記載のエル・アール・バール外の
論文“連続音声を認識する最尤法”（L.R. Bahl
et al、“Ａ Maximum Likelihood Approach
to Continuous Speech Recognition”、IEEE
Transactions on Pattern Analysis and
Machine Intelligence、Vol.PAMI−５、No.o.2、
March1983）のような種々の論文で説明されて
いる。マルコフ・モデル・マシンは、マルコフ・
モデル音素マシン、または単に音素マシンともい
う。音声を認識する際、語彙中のどのワード（複数
の場合もある）が音響プロセツサにより生成され
たラベル・ストリングを生じる最高の尤度を有す
るかを決定するマツチング・プロセスが実行され
る。このようなマツチング手順の１つが前記米国
特許出願第06／672974号に示されている。それに
よれば、音響マツチングは、(a)語彙中の各ワード
を、マルコフ・モデル音素マシンのシーケンスに
より特徴づけ、(b)各ワードが表わす音素マシンの
シーケンスの、音響プロセツサにより生成された
ラベルのストリングを生じるそれぞれの尤度を決
定することにより実行される。各ワードが表わす
音素マシンのシーケンスはワード基本形態に対応
する。ワード基本形態を形成する際には、最初に、そ
の基本形態の構築に用いる音素マシンの性質を定
義する必要がある。前記米国特許出願第06／
672974号で、音標型音素マシンから構築されたワ
ード基本形態が示されている。この場合、各音素
マシンは音標型の単音に対応し、７つの状態と13
の遷移を含む。詳細に言えば、それぞれが対応す
る音標要素を表わす約70音素のセツトが基本形態
を構築する基礎になつている。一般にワードの基
本形態は、音声学者がワードをそれぞれの音標セ
グメントに分解し、対応する音素マシンを各音標
セグメントに割当てることにより構築されてい
る。従来は、70音素の各々は、所与の等級に対応す
る単音がワードの最初、中央または末尾のいずれ
で生じたかとは無関係に、所与の単音等級を表わ
した。例えば、“ｋ”の単音は“cat”の場合のよ
うなワードの最初、“scat”の場合のようなワー
ドの中央、または“back”の場合のようなワー
ドの末尾のいずれの場合に生じても、音素ｋによ
り表示されていた。Ｄ発明が解決しようとする問題点本発明は、所定の音はその音が沈黙期間に隣接
する、すなわち沈黙期間に先行または後続するか
どうかにより異なつたエネルギ特性を表わすとい
う知見に基づく。特に本発明は、沈黙期間が先行
する場合は音によつてはエネルギが増強され、沈
黙期間が後続する場合は音によつてはエネルギが
減衰するという事実に基づいている。エネルギ特
性は一般に、音響入力の生成すべきラベルを決め
る際に音響プロセツサにより使用されるので、ワ
ードの始めまたは終りで生じる音かどうかによつ
てエネルギの増強または減衰を生じ、異なつたラ
ベルが生成されることがある。従つて本発明は、音がワードの始めに発声され
る場合にエネルギの増強を伴なういくつかの第１
の型の音素マシンと、音がワードの終りに発声さ
れる場合にエネルギの減衰を伴なういくつかの第
２の型の音素マシンを定義する。更に、発声され
る音が大きなエネルギの増強または減衰を伴なわ
ない場合に対応する第３の型の音素マシンがあ
る。第１の型の音素マシンを開始音素マシン、第
２の型の音素マシンを終了音素マシン、第３の型
の音素マシンを共通音素マシンという。開始音素マシンの統計値は沈黙からの遷移を反
映し、終了音素マシンの統計値は沈黙への遷移を
反映する。共通音素マシンは、ワードの中央で発
声される音、もつと一般的に言えば、沈黙への遷
移または沈黙からの遷移が音素マシンの統計値に
大きくは影響しないワード位置で発声される音に
対応する統計値を有することが望ましい。所定の音は、その音がワードのどの部分で発声
されても対応するエネルギ特性が大きく変化しな
い場合、それに関連した共通音素マシンだけを有
する。本発明に従つて、複数の開始音素マシンおよび
終了音素マシンが設けられ、所定の音が沈黙期間
に隣接して生じる場合のエネルギ特性を与える。このように本発明により、所与のワードは、そ
れに対応する開始音素マシンを有する対象音で開
始する場合、その開始音素マシンで始まり、かつ
その対象音の共通音素マシンが後続する基本形態
を得るように構成される。同様に、本発明によ
り、所与のワードがそれに対応する終了音素マシ
ンを有する対象音で終了する場合、そのワード
は、その終了音素マシンで終了し、かつその対象
音の共通音素マシンが先行する基本形態を得るよ
うに構成される。従つて本発明の目的は、（ワード基本形態を構
築する）マルコフ・モデルのセツトに、沈黙への
遷移、または沈黙からの遷移で生起する単音に対
応するマルコフ・モデルを包含し、このような基
本形態を使用するワード認識システムにおける精
度を高めるようにすることである。更に本発明の目的は、類似のエネルギ増強特性
を有する音をひとまとめにして、そのグループに
属するすべての音の単一開始音素マシンを定義す
るとともに、類似のエネルギ減衰特性を有する音
をひとまとめにして、そのグループに属するすべ
ての音の単一終了音素マシンを定義することによ
り、音素マシンの総数を制限することである。Ｅ問題点を解決するための手段前記目的を達成する方法は下記ステツプを含
む。 (a) 各音素マシンが、（）複数の状態、（）そ
れぞれがある状態からある状態に移る複数の遷
移、（）遷移ごとの記憶された確率、（）記
憶されたラベル出力確率（各ラベル出力確率は
対応するラベルを生成する前記各音素マシンの
確率に対応する）を有する場合に、音標型音素
マシンのセツトを形成する。 (b) 前記音標型音素マシンのセツトが開始音素マ
シンのサブセツトを含むように形成されている
場合、各開始音素マシンの記憶された確率は、
音声セグメントの最初に発声される少なくとも
１つの音標型要素に対応する。 (c) ワードがそれに対応する所与の開始音素マシ
ンを有する音標型要素で始まり、所与の開始音
素マシンで始まるワード基本形態を有する場
合、各ワード基本形態を音素マシンのシーケン
スとして構築する。本発明の方法は更に次の特徴を有する。すなわ
ち、前記音標型音素マシンのセツトは終了音素マ
シンのサブセツトを含むように形成され、各終了
音素マシンの記憶された確率は、音声セグメント
の末尾で発声される少なくとも１つの単一音標型
要素に対応する。また、ワードがそれに対応する
所与の終了音素マシンを有する音標型要素で終了
し、所与の終了音素マシンで終了するワード基本
形態を有する場合、各ワード基本形態は音素マシ
ンのシーケンスとして構築される。前記目的を達成する本発明の装置はマルコフ・
モデル音素マシンのセツトを含む。各音素マシン
は、（）複数の状態、（）それぞれがある状態
からある状態に移る複数の遷移、（）遷移ごと
の確率を記憶する手段、（）ラベル出力確率を
記憶する手段（各ラベル出力確率は、識別された
遷移で前記各音素マシンが特定のラベルを生成す
る確率に対応する。音素マシンの一部が開始音素
マシンを含む場合、各開始音素マシンは、（）
音の集合からの少なくとも１つの音に関連し、か
つ（）ワードの開始で少なくとも１つの関連音
の発声から遷移確率およびラベル出力確率が整形
される。）ならびに、各ワード基本形態を音素マ
シンのシーケンスとして構築する手段（前記構築
手段は、対象ワード基本形態に対応するワードが
所与の開始音素マシンに関連した音で開始する場
合、対象ワード基本形態の先頭に所与の開始音素
マシンを置く手段を含む。）を有することを特徴
とする。更に、本発明の装置の特徴として、音素マシン
の一部は終了音素マシン（各終了音素マシンは、
（）音の集合からの少なくとも１つの音に関連
し、かつ（）音声セグメントの終りで少なくと
も１つの関連音の発声から遷移確率およびラベル
出力確率が整形される。）を含み、前記構築手段
は更に、対象ワード基本形態に対応するワードが
所与の終了音素に関連した音で終了する場合、対
象ワード基本形態の末尾に所与の終了音素マシン
を置く手段を含む。更に、本発明の装置は、それぞれが発声開始時
にエネルギ増強の影響を受ける音に対応し、エネ
ルギ増強の影響を受けない場合の音を表わす共通
音素マシンも含み、その場合、前記構築手段は更
に、開始音素マシンおよび後続の共通音素マシン
（特定の音がワードを開始し、開始音素マシンが
それに関連する場合、特定の音に対応する）を包
含する手段を含む。Ｆ実施例 F1 音声認識システムの環境 F1a 全般的説明（第４図〜第９図）第４図は音声認識システム１０００の概要ブロ
ツク図を示す。このシステムは、スタツク・デコ
ーダ１００２、およびそれに接続された音響プロ
セツサ（AP）１００４、高速概算音響マツチン
グを実行するアレイ・プロセツサ１００６、精密
音響マツチングを実行するアレイ・プロセツサ１
００８、言語モデル１０１０、ならびにワークス
テーシヨン１０１２を含む。音響プロセツサ１００４は、音声波形入力を、
その各々が対応する単音符号を大まかに識別する
ラベルのストリング、すなわちフイーニーム（フ
ロント・エンドから得られた微小音素をこのよう
に呼ぶ。原則としてはフロント・エンドから得ら
れるものに限られない）に変換するように設計さ
れている。本システムでは、音響プロセツサ１０
０４は、人間の聴覚の独特なモデルに基づくもの
で、米国特許出願第06／665401号（1984年10月26
日出願）に記載されている。音響プロセツサ１００４からのラベル、すなわ
ちフイーニームはスタツク・デコーダ１００２に
送られる。第５図は、スタツク・デコーダ１００
２の論理素子を示す。すなわち、スタツク・デコ
ーダ１００２は探索装置１０２０、およびそれに
接続されたワークステーシヨン１０１２、インタ
フエース１０２２，１０２４，１０２６ならびに
１０２８を含む。これらのインタフエースの各々
は、音響プロセツサ１００４、アレイ・プロセツ
サ１００６，１００８ならびに言語モデル１０１
０にそれぞれ接続される。動作中、音響プロセツ
サ１００４からのフイーニームは探索装置１０２
０によりアレイ・プロセツサ１００６（高速突合
せ）に送られる。下記に説明する高速マツチング
手順は前記米国特許第06／672974号（1984年11月
19日出願）にも記載されている。マツチングの目
的は、簡単にいえば、所与のラベル・ストリング
の最も見込みのあるワード（複数ワードの場合も
ある）を決定することである。高速マツチングはワードの語彙中のワードを検
査するとともに、所与の到来ラベルのストリング
の候補ワードの数を少なくするように設計されて
いる。高速マツチングは確率的な有限状態マシン
（本明細書ではマルコフ・モデルともいう）に基
づくものである。高速マツチングが候補ワード数を減少した後、
スタツク・デコーダ１００２は、言語モデル１０
１０と対話し、できれば、現に存在する三重字に
基づき、高速マツチング候補リスト中の各候補ワ
ードの文脈上の尤度を確定する。精密マツチングは、これらのワードを、話され
たワードとして適度の尤度を有する高速マツチン
グ候補リストから、言語モデル計算に基づいて検
査することが望ましい。精密マツチングも前記米
国特許出願第06／672974号（1984年11月19日出
願）に記載されている。精密マツチングは、第６
図に示すようなマルコフ・モデルの音素マシンに
より実行する。精密マツチングの後、再び言語モデルを呼出
し、ワードの尤度を決定することが望ましい。本
発明のスタツク・デコーダ１００２は、−高速マ
ツチング、精密マツチング、および言語モデルの
使用から得られた情報を用いて−生成されたラベ
ル・ストリングのワードの最も見込みのあるパス
すなわちシーケンスを確定するように設計されて
いる。最も見込みのあるワード・シーケンスを見つけ
る従来の２つの方法はビタービ（Viterbi）復号
化および単一スタツク復号化である。これらの方
法の各々は、前記バール外の論文“連続音声を認
識する最尤法”で説明している。ビタービ復号化
は該論文のセクシヨンＶに、単一スタツク復号化
は同じくセクシヨンに記載されている。単一スタツク復号化手法では、異なる長さのパ
スは単一スタツクに尤度に従つて記載され、単一
スタツクに基づいて復号が行なわれる。単一スタ
ツク復号化は尤度がパス長にいくらか依存してい
るという事実によるものであるので、一般に正規
化が用いられる。ビタービの手法は正規化を必要とせず、一般に
小規模のタスクに向いている。もう１つの代替方法では、ワードの可能な組合
せの各々を起こりうるワード・シーケンスとして
検査し、どの組合せが生成されたラベル・ストリ
ングを生じる最高の確率を有するかを決定するこ
とにより、小規模の語彙体系で復号化を実行する
ことができる。この手法に必要な計算量は、大規
模な語彙体系の場合には膨大となり、非実用的で
ある。スタツク・デコーダ１００２は、実際には、他
の要素を制御するように作用するが、実行する計
算は多くはない。それ故、スタツク・デコーダ１
００２は、VM（仮想計算機）／システム・プロ
ダクト・イントロダクシヨン・リリース３
（1983）のような出版物に記載されているように、
IBM VM／370オペレーテイング・システムの制
御の下にランする4341プロセツサを含むことが望
ましい。相当な量の計算を実行するアレイ・プロ
セツサは、フローテイング・ポイント・システム
（FPS）社製の市販の190Lにより実現されてい
る。第７図、第８図および第９図は、エル・アー
ル・バール（L.R.Bahl）外の発明による多重ス
タツク手法および独特の判定方式を含む斬新な手
法を示す。第７図に、継続したラベル間隔で生成された複
数の継続したラベルy₁y₂……が示されている。また、第８図には複数のワード・パス、すなわ
ちパスＡ、パスＢおよびパスＣが示めされてい
る。第７図との関連で、パスＡはエントリ“to
be or”、パスＢはエントリ“two ｂ”、パスＣは
エントリ“too”に対応することがある。対象ワ
ード・パスごとに、“その対象ワード・パスが終
了している”最高の確率を有するラベルがあり、
このようなラベルを境界ラベルという。ワードのシーケンスを表わすワード・パスＷの
最も見込みのある終了時刻（境界ラベルとしてラ
ベル・ストリングに表示されている）は、IBM
技術開示会報第23巻第４号（1980年９月号）のエ
ル・アール・バール外の論文“音響突合せ計算の
高速化”（L.R.Bahl et al、“Faster Acoustic
Match Computation”、IBM Technical
Disclosure Bulletin、Vol.23、No.4、
September1980）に記載されている、２つのワー
ドの間の見込みのある境界を探索する既知の方法
により見つけることができる。簡単に言えば、こ
の論文は、下記の２つの重要な事項： (a)どれだけ多くのラベル・ストリングＹがワー
ド（またはワード・シーケンス）によるものであ
るか、(b)どのラベル間隔で、（ラベル・ストリン
グの部分に対応する）部分的な文が終了するか、
に取組む方法について論じている。所与のワード・パスにはいずれも、ラベル・ス
トリングの最初のラベルから境界ラベルまでを含
む各々のラベルまたはラベル間隔に関連した“尤
度値”がある。つまり、所与のワード・パスの尤
度値はすべて所与のワード・パスの“尤度ベクト
ル”を表わすので、ワード・パスごとに、対応す
る尤度ベクトルがある。尤度値L_tは第８図に示さ
れている。ワード・パスW¹，W²、……、W^Sの集まりの
ラベル間隔ｔでの、“尤度包絡線”Λ_tは数学的に
次のように定義される。 Λ_t＝max（L_t（W¹）、……、L_t（W^S））すなわち、ラベル間隔ごとに、尤度包絡線は、
前記集りの中の任意のワード・パスに関連した最
高の尤度値を含む。第８図に尤度包絡線１０４０
が示されている。ワード・パスは、完全な文に対応する場合には
“完全”とみなされる。完全なパスは、入力して
いる話者が、文の終了に達したとき、例えばボタ
ンを押して識別することが望ましい。入力された
入力は、文終了をマークするラベル間隔と同期さ
れる。完全なワード・パスは、それにワードを付
加して延長することはできない。部分的なワー
ド・パスは不完全な文に対応するので延長するこ
とができる。部分的なパスは“生きている”パスまたは“死
んでいる”パスに分類される。ワード・パスは、
それが既に延長されているときは“死んでいる”
が、まだ延長されていないときは“生きている”。
この分類により、既に延長されて少なくとも１ワ
ード以上延長されたワード・パスを形成している
パスは、次の時刻で延長が再び考慮されることは
ない。各々のワード・パスは、尤度包絡線に対して
“良い”パス、または“悪い”パスとして特徴づ
けることができる。ワード・パスは、その境界ラ
ベルに対応するラベルで、そのワード・パスが、
最大尤度包絡線内にある尤度値を有する場合は良
いワード・パスである。その他の場合は、ワー
ド・パスは悪いワード・パスである。最大尤度包
絡線の各値を一定の値だけ減少して良い（悪い）
限界レベルを変更することは、望ましいことでは
あるが、必ずしも必要ではない。ラベル間隔の各々についてスタツク要素があ
る。生きているワード・パスの各々は、このよう
な生きているパスの境界ラベルに対応するラベル
間隔に対応するスタツク要素に割当てられる。ス
タツク要素は、（尤度値の順序に記載されている）
０，１またはより多くのワード・パス・エントリ
を有することがある。次に、第４図のスタツク・デコーダ１００２に
より実行されるステツプについて説明する。尤度包絡線を形成し、どのワード・パスが良い
かを決定することは、第９図のスタツク復号手法
の流れ図に示すように相互関係を有する。第９図の流れ図において、ブロツク１０５０
で、最初に、ナル・パスが第１のスタツク０に入
る。ブロツク１０５２で、前に確定されている完
全なパスを含む（完全な）スタツク要素が、もし
あれば供給される。（完全な）スタツク要素中の
完全なパスの各々は、それに関連する尤度ベクト
ルを有する。その境界ラベルに最高の尤度を有す
る完全なパスの尤度ベクトルは、最初に最尤包絡
線を決める。もし（完全な）スタツク要素に完全
なパスがなければ、最尤包絡線は各ラベル間隔で
−∞に初期設定される。更に、完全なパスが指定
されていない場合にも、最尤度包絡線が−∞に初
期設定されることがある。包絡線の初期設定はブ
ロツク１０５４および１０５６で行なわれる。最尤包絡線は、初期設定された後、所定の量Δ
だけ減少され、減少された尤度を超えるΔ規定の
良い領域を形成し、減少された尤度を下まわるΔ
規定の悪い領域を形成する。Δが大きければ大き
いほど、延長が可能とみなされるワード・パス数
が大きくなる。L_tを確定するのにlog₁₀を用いる
場合、Δの値が2.0であれば満足すべき結果が得
られる。Δの値がラベル間隔の長さに沿つて均一
であることは、望ましいけれども、必ずしも必要
ではない。ワード・パスがその境界ラベルに、Δ規定の良
い領域内にある尤度を有する場合、そのワード・
パスは“良い”とマークされる。その他の場合
は、ワード・パスは“悪い”とマークされる。第９図に示すように、尤度包絡線を更新し、ワ
ード・パスを“良い”（延長が可能な）パス、ま
たは“悪い”パスとしてマークするループは、マ
ークされていない最長ワード・パスを探すブロツ
ク１０５８で始まる。２以上のマークされていな
いワード・パスが、最長のワード・パス長に対応
するスタツクにある場合、その境界ラベルに最高
の尤度を有するワード・パスが選択される。ワー
ド・パスが発見された場合、ブロツク１０６０
で、その境界ラベルでの尤度がΔ規定の良い領域
内にあるかどうかを調べる。もし良い領域内にな
ければ、ブロツク１０６２で、Δ規定の悪い領域
内のパスとマークし、ブロツク１０５８で、次の
マークされていない生きているパスを探す。もし
良い領域内にあれば、ブロツク１０６４で、Δ規
定の良い領域内のパスとマークし、ブロツク１０
６６で、尤度包絡線を更新して、“良い”とマー
クされたパスの尤度値を包含する。すなわち、ラ
ベル間隔ごとに、更新された尤度値は、 (a) その尤度包絡線内の現在の尤度値と、 (b) “良い”マークとされたワード・パスに関連
した尤度値の間のより大きい尤度値として規定される。この
動作はブロツク１０６４および１０６６で行なわ
れる。包絡線が更新された後、ブロツク１０５８
に戻り、マークされていない最長、最良の生きて
いるワード・パスを再び探す。このループは、マークされていないワード・パ
スがなくなるまで反復される。マークされていな
いワード・パスがなくなると、ブロツク１０７０
で、最短の“良い”とマークされたワード・パス
が選択される。もし、最短の長さを有する２以上
の“良い”ワード・パスがあれば、ブロツク１０
７２で、その境界ラベルに最高の尤度を有するワ
ード・パスが選択され、選択された最短のパスは
延長される。すなわち、少なくとも１つの見込み
のある後続ワードが、前述のように、高速マツチ
ング、言語モデル、精密マツチング、および言語
モデル手順を良好に実行することにより確定され
る。見込みのある後続ワードごとに、延長された
ワード・パスが形成される。詳細に述べれば、延
長されたワード・パスは、選択された最短ワー
ド・パスの末尾に、見込みのある後続ワードを付
加することにより形成される。選択された最短ワード・パスが、延長されたワ
ード・パスを形成した後、該選択されたワード・
パスは、それがエントリであつたスタツクから除
去され、その代りに、各々の延長されたワード・
パスは適切なスタツクに挿入される。特に、延長
されたワード・パスは、その境界ラベルに対応す
るスタツクへのエントリになる（ブロツク１０７
２）。延長されたパスが形成され、そのスタツクが再
形成された後、ブロツク１０５２に戻り、プロセ
スが反復される。従つて、反復ごとに、最短、最良の“良い”ワ
ード・パスが選択され、延長される。ある反復で
“悪い”パスとマークされたワード・パスは後の
反復で“良い”パスになることがあるので、生き
ているワード・パスが“良い”パスか、“悪い”
パスかという特徴は、各々の反復で独自に付与さ
れる。実際には、尤度包絡線は１つの反復と次の
反復とで大幅に変化しないので、ワード・パスが
良いか悪いかを決定する計算が効率的に行なわ
れ、正規化も不要になる。完全な文を識別する場合、ブロツク１０７４を
包含することが望ましい。すなわち、生きている
ワード・パスでマークされずに残つているものは
なく、延長すべき“良い”ワード・パスがない場
合、復号は終了する。その境界ラベルのそれぞれ
に最高の尤度を有する完全なワード・パスが、入
力ラベル・ストリングの最も見込みのあるワー
ド・シーケンスとして識別される。文終了が識別されない連続音声の場合、パス延
長は、継続して行なわれる、すなわち、そのシス
テムのユーザが希望する所定のワード数について
行なわれる。 F1b 聴覚モデルおよび音声認識システムの音響
プロセツサにおけるその実現（第１０図〜第１
６図）第１０図は、前述のような音響プロセツサ１１
００の特定の実施例を示す。音響波入力（例えば
自然の音声）が、所定の速度でサンプリングする
Ａ／Ｄ変換器１１０２に入る。代表的なサンプリ
ング速度は毎50マイクロ秒当り１サンプルであ
る。デイジタル信号の端を整形するために、時間
窓発生器１１０４が設けられている。時間窓発生
器１１０４の出力は、時間窓ごとに周波数スペク
トル出力を与えるFFT（高速フーリエ変換）装置
１１０６に入る。そして、FFT装置１１０６の出力は、ラベル
L₁L₂……L_fを生成するように処理される。特徴
選択装置１１０８、クラスタ装置１１１０、原型
装置１１１２および記号化装置１１１４は共同し
てラベルを生成する。ラベルを生成する際、原型
は、選択された特徴に基づき空間に点（またはベ
クトル）として形成される。音響入力は、選択さ
れた同じ特徴により、原型に比較しうる対応する
点（またはベクトル）を空間に供給するように特
徴づけられている。詳細に言えば、原型を定義する際、クラスタ装
置１１１０により点のセツトをそれぞれクラスタ
としてまとめる。クラスタを形成する方法は、音
声に適用される（ガウス分布のような）確率分布
に基づいている。各クラスタの原型は、（クラス
タの中心軌跡または他の特徴に関連して）原型装
置１１１２により生成される。生成された原型お
よび音響入力（どちらも同じ特徴が選択されてい
る）は記号化装置１１１４に入る。記号化装置１
１１４は比較手順を実行し、その結果、特定の音
響入力にラベルを割当てる。ちなみに、音響入力にラベルを割当てる手法は
音声認識以外の応用のために考案されている。従
つて、このような手法ならびにそのための記号化
装置は一般に音声認識システムに利用することが
できる。適切な特徴の選択は、音響（音声）波入力を表
わすラベルを取出す際の重要な要素である。ここ
に説明する音響プロセツサは改良された特徴選択
装置１１０８を含む。この音響プロセツサに従つ
て、聴覚モデルが取出され、音声認識システムの
音響プロセツサで使用される。第１１図により聴
覚モデルを説明する。第１１図は人間の内耳の部分を示す。詳細に述
べれば、内毛細胞１２００と、液体を含有する溝
１２０４に広がる末端部１２０２が詳細に示され
ている。また、内毛細胞１２００から上流には、
外毛細胞１２０６と、溝１２０４に広がる末端部
１２０８が示されている。内毛細胞１２００と外
毛細胞１２０６には、脳に情報を伝達する神経が
結合している。特に、ニユーロンが電気化学的変
化を受け、電気パルスが神経に沿つて脳に運ば
れ、処理されることになる。電気化学的変化は、
基底膜１２１０の機械的運動により刺激される。基底膜１２１０が音響波入力の周波数分析器と
して作用し、基底膜１２１０に沿つた部分がそれ
ぞれの臨界周波数バンドに応答することは従来か
ら知られている。対応する周波数バンドに応答す
る基底膜１２１０のそれぞれの部分は、音響波形
入力を知覚する音量に影響を与える。すなわち、
トーンの音量は、類似のパワーの強度の２つのト
ーンが同じ周波数バンドを占有する場合よりも、
２つのトーンが別個の臨界周波数バンドにある場
合の方が大きく知覚される。基底膜１２１０によ
り規定された２２の等級の臨界周波数バンドがあ
ることが分つている。基底膜１２１０の周波数レスポンスに合わせ
て、本発明は良好な形式で、臨界周波数バンドの
一部または全部に入力された音響波形を物理的に
定め、次いで、規定された臨界周波数バンドごと
に別個に信号成分を検査する。この機能は、
FFT装置１１０６（第１０図）からの信号を適
切に濾波し、検査された臨界周波数バンドごとに
特徴選択装置１１０８に別個の信号を供給するこ
とにより行なわれる。別個の入力も、時間窓発生器１１０４により
（できれば25.6ミリ秒の）時間フレームにブロツ
クされる。それゆえ、特徴選択装置１１０８は22
の信号を含むことが望ましい。これらの信号の
各々は、時間フレームごとに所与の周波数バンド
の音の強さを表わす。信号は、第１２図の通常の臨界バンド・フイル
タ１３００により濾波することが望ましい。次い
で、信号は個別に、音量の変化を周波数の関数と
して知覚する音量等化変換器１３０２により処理
する。ちなみに、１つの周波数で所与のdBレベ
ルの第１のトーンの知覚された音量は、もう１つ
の周波数で同じdBレベルの第２のトーンの音量
と異なることがある。音量等化変換器１３０２
は、経験的なデータに基づき、それぞれの周波数
バンドの信号を変換して各々が同じ音量尺度で測
定されるようにする。例えば、音量等化変換器１
３０２は、1933年のフレツチヤおよびムンソン
（Fletcher and Munson）の研究に多少変更を加
えることにより、音響エネルギを同等の音量に写
像することができる。第１３図は前記研究に変更
を加えた結果を示す。第１３図により、40dBで
1KHzのトーンは60dBで100Hzのトーンの音量レ
ベルに対応することが分る。音量等化変換器１３０２は、第１３図に示す曲
線に従つて音量を調整し、周波数と無関係に同等
の音量を生じさせる。周波数への依存性のほか、第１３図で特定の周
波数を調べれば明らかなように、パワーの変化は
音量の変化に対応しない。すなわち、音の強度、
すなわち振幅の変動は、すべての点で、知覚され
た音量の同様の変化に反映されない。例えば、
100Hzの周波数では、110dB付近における10dB
の知覚された音量変化は、20dB付近における
10dBの知覚された音量変化よりもずつと大きい。
この差は、所定の方法で音量を圧縮する音量圧縮
装置１３０４により処理する。音量圧縮装置１３
０４は、ホン単位の音量振幅測定値をソーン単位
に置換えることにより、パワーＰをその立方根
P^1/3に圧縮することができる。第１４図は、経験的に決められた既知のホン対
ソーンの関係を示す。ソーン単位の使用により、
本発明のモデルは大きな音声信号振幅でもほぼ正
確な状態を保持する。１ソーンは、1KHzのトー
ンで40dBの音量と規定されている。第１２図には、新規の時変レスポンス装置１３
０６が示されている。この装置は、各臨界周波数
バンドに関連した音量等化および音量圧縮信号に
より動作する。詳細に述べれば、検査された周波
数バンドごとに、神経発火率ｆが各時間フレーム
で決められる。発火率ｆは本発明の音響プロセツ
サに従つて次のように定義される。ｆ＝（So＋DL）ｎ (1) ただし、ｎは神経伝達物質の量；Soは音響波
形入力と無関係に神経発火にかかわる自発的な発
火定数；Ｌは音量測定値；Ｄは変位定数である。
So・ｎは音響波入力の有無に無関係に起きる自
発的な神経発火率に相当し、DLnは音響波入力に
よる発火率に相当する。重要な点は、本発明では、ｎの値は次式により
時間とともに変化するという特徴を有することで
ある。 dn／dt＝Ao−（So＋Sh＋DL）ｎ (2) ただし、Aoは補充定数；Shは自発的な神経伝
達物質減衰定数である。式(2)に示す新しい関係
は、神経伝達物質が一定の割合Aoで生成されな
がら、(a)減衰（Sh・ｎ）、(b)自発的な発火（So・
ｎ）、および(c)音響波入力による神経発火（DL・
ｎ）により失われることを考慮している。これら
のモデル化された現象は第１１図に示された場所
で起きるものと仮定する。式(2)で明らかなように、神経伝達物質の次量お
よび次発火率が少なくとも神経伝達物質の現量の
自乗に比例しており、本発明の音響プロセツサが
非線形であるという事実を示している。すなわ
ち、状態（ｔ＋Δt）での神経伝達物質の量は、
状態（ｔ＋dn／dt・Δt）での神経伝達物質の量
に等しい。よつて、ｎ（ｔ＋Δt）＝ｎ（ｔ）＋（dn／dt）・Δt (3) が成立する。式(1)，(2)および(3)は、時変信号分析器の動作を
表わす。時変信号分析器は、聴覚器官系が時間に
適応性を有し、聴神経の信号が音響波入力と非直
線的に関連させられるという事実を示している。
ちなみに、本発明の音響プロセツサは、神経系統
の明白な時間的変化によりよく追随するように、
音声認識システムで非線形信号処理を実施する最
初のモデルを提供するものである。式(1)および(2)において未知の項数を少なくする
ため、本発明では、一定の音量Ｌに適用される次
式を用いる。 So＋Sh＋DL＝１／Ｔ (4) ただし、Ｔはオーデイオ波入力が生成された
後、聴覚レスポンスがその最大値の37％に低下す
るまでの時間の測定値である。Ｔは、音量の関数
であり、本発明の音響プロセツサにより、種々の
音量レベルのレスポンスの減衰を表示する既知の
グラフから取出す。すなわち、一定の音量のトー
ンが生成されると、最初、高いレベルのレスポン
スが生じ、その後、レスポンスは時定数Ｔによ
り、安定した状態のレベルに向つて減衰する。音
響波入力がない場合、Ｔ＝T₀である。これは50
ミリ秒程度である。音量がL_naxの場合、Ｔ＝
T_naxである。これは30ミリ秒程度である。Ao＝
１に設定することにより、１／（So＋Sh）は、
Ｌ＝０の場合、５センチ秒と決定される。Ｌが
L_naxで、L_nax＝20ゾーンの場合、次式が成立つ。 So＋Sh＋Ｄ（20）＝１／30 (5) 前記データおよび式により、SoおよびShは下
記に示す式(6)および(7)により決まる。 So＝DL_nax／〔Ｒ＋（DL_naxT₀Ｒ）−１〕 (6) Sh＝１／T₀−So (7) ただし、Ｒ＝ｆ安定状態―L_nax／ｆ安定状態―Ｌ＝０ (8) f安定状態は、dn／dtが０の場合、所与の音量
での発火率を表わす。Ｒは、音響プロセツサに残つている唯一の変数
である。それゆえ、このプロセツサの性能はＲを
変えるだけで変更される。すなわち、Ｒは、性能
を変更するのに調整することができる１つのパラ
メータで、通常は、過渡状態の効果に対し安定状
態の効果を最小限にすることを意味する。類似の
音声入力の場合に出力パターンが一貫性に欠ける
ことは一般に、周波数レスポンスの相違、話者の
差異、背景雑音ならびに、（音声信号の安定状態
部分には影響するが過渡部分には影響しない）歪
みにより生ずるから、安定状態の効果を最小限に
することが望ましい。Ｒの値は、完全な音声認識
システムのエラー率を最適化するように設定する
ことが望ましい。このようにして見つかつた最適
値はＲ＝1.5である。その場合、SoおよびShの値
はそれぞれ0.0888および0.11111であり、Ｄの値
は0.00666が得られる。第１５図は本発明による音響プロセツサの動作
の流れ図である。できれば、20KHzでサンプリン
グされた、25.6ミリ秒の時間フレーム中のデイジ
タル化音声は、ハニング窓１３２０を通過し、そ
の出力は10ミリ秒間隔で、DFT１３２２におい
て２重フーリエ変換されることが望ましい。変換
出力はブロツク１３２４で濾波され、少なくとも
１つの周波数バンド（できればすべての臨界周波
数バンドか、または少なくとも20のバンド）の
各々にパワー密度出力を供給する。次いで、パワ
ー密度はブロツク１３２６で、記録された大きさ
から音量レベルに変換される。この動作は、第１
３図のグラフの変更により容易に実行される。そ
の後のプロセスの概要（ブロツク１３３０の限界
値更新を含む）は第１６図に示されている。第１６図において、最初に、濾波された周波数
バンドｍの各々の感覚限界T_fおよび可聴限界T_o
がそれぞれ、120dBおよび0dBになるように設定
される（ブロツク１３４０）。その後、音声カウ
ンタ、合計フレーム・レジスタおよびヒストグラ
ム・レジスタをリセツトする（ブロツク１３４
２）。ヒストグラムの各々はビン（bin）を含み、ビ
ンの各々は、（所与の周波数バンドで）パワーま
たは類似の測定値がそれぞれのレンジ内にある間
のサンプル数すなわちカウントを表わす。本発明
では、ヒストグラムは、（所与の周波数バンドご
とに）音量が複数の音量レンジの各々の中にある
期間のセンチ秒数を表わすことが望ましい。例え
ば、第３の周波数バンドでは、10dBと20dBのパ
ワーの間が20センチ秒の場合がある。同様に、第
20の周波数バンドでは、50dBと60dBの間に、合
計1000センチ秒のうちの150センチ秒がある場合
がある。合計サンプル数（すなわちセンチ秒）お
よびビンに含まれたカウントから百分位数が取出
される。ブロツク１３４４で、それぞれの周波数バンド
のフイルタ出力のフレームが検査され、ブロツク
１３４６で、適切なヒストグラム（フイルタ当り
１つ）中のビンが増分される。ブロツク１３４８
で、振幅が55dBを越えるビンの合計数がフイル
タ（すなわち周波数バンド）ごとに集計され、音
声の存在を示すフイルタ数を決定する。ブロツク
１３５０で、音声の存在を示す最小限（例えば20
のうちの６）のフイルタがない場合、ブロツク１
３４４で次のフレームを検査する。音声の存在を
示す十分なフイルタがある場合、ブロツク１３５
２で、音声カウンタを増分する。音声カウンタ
は、ブロツク１３５４で音声が10秒間現われ、ブ
ロツク１３５６で新しいT_fおよびT_hの値がフイ
ルタごとに決定されるまで増分される。所与のフイルタの新しいT_fおよびT_hの値は次
のように決定される。T_fの場合、1000ビンの最
上位から35番目のサンプルを保持するビンのdB
値（すなわち、音量の96.5番目の百分位数）は
BIN_Hと定義され、T_fはT_f＝BIN_H＋40dBに設定
される。T_hの場合、最下位のビンから（0.01）
（ビン総数−音声カウント）番目の値を保持する
ビンのdB値がBIN_Lと定義される。すなわち、
BIN_Lは、ヒストグラム中の、音声として分類さ
れたものを除いたサンプル数の１％のビンであ
る。T_hはT_h＝BIN_L−30dBと定義される。第１５図のブロツク１３３０および１３３２
で、音の振幅は、前述のように、限界値を更新
し、更新された限界値に基づいてソーン単位に変
換され、圧縮される。ソーン単位を導入し圧縮す
る代替方法は、（ビンが増分された後）フイルタ
振幅“ａ”を取出し、次式によりdBに変換する。 a^dB＝20log₁₀(a)−10 (9) 次に、フイルタ振幅の各々は、次式により同等
の音量を与えるように0dBと120dBの間のレンジ
に圧縮される。 a^eql＝120（a^dB−T_h）／（T_f−T_h） (10) 次に、a^eqlは次式により、音量レベル（ホン単
位）からソーン単位の音量の近似値に変換
（40dBで1KHzの信号を１に写像）することが望
ましい。 L^dB＝（a^eql−30）／４（11）次に、ソーン単位の音量の近似量L_sは次式で与
えられる。 L_s＝10（L^dB）／20 （12）ステツプ１３３４で、L_sは式(1)および(2)の入力
として使用され、ブロツク１３５５で、周波数バ
ンドごとの出力発火率ｆを決定する。22周波数バ
ンドの場合、22次元のベクトルが、連続する時間
フレームにわたる音響波入力を特徴づける。しか
しながら、一般に、20周波数バンドは、メールで
スケーリングされた通常のフイルタ・バンクを用
いて検査する。ブロツク１３３６で次の時間フレームを処理す
る前に、ブロツク１３３７で、ｎの“次状態”を
式(3)に従つて決定する。前述の音響プロセツサは、発火率ｆおよび神経
伝達物質量ｎが大きいDCペデスタルを有する場
合の使用についての改善を必要とする。すなわ
ち、ｆおよびｎの式の項のダイナミツクレンジが
重要な場合、下記の式を導いてペデスタルの高さ
を下げる。安定状態で、かつ音響波入力信号が存在しない
（Ｌ＝０）場合、式(2)は次のように安定状態の内
部状態n′について解くことができる。 n′＝Ａ／（So＋Sh）（13）神経伝達物質の量ｎ（ｔ）の内部状態は、次の
ように安定状態部分および変動部分として示され
る。ｎ（ｔ）＝n′＋n″（ｔ）（14）式(1)および（14）を結合すると、次のように発
火率が得られる。ｆ（ｔ）＝（So＋Ｄ・Ｌ）（n′＋n″（ｔ））（15
） So・n′の項は定数であるが、他のすべての項
は、ｎの変動部分か、または（Ｄ・Ｌ）により表
わされた入力信号を含む。爾後の処理は出力ベク
トル間の差の自乗のみに関連するので、定数項は
無視される。式（15）および（13）から次式が得
られる。 f″（ｔ）＝(So+D・L)・〔{n″(t)D・L・A} ／（So＋Sh）（16）式(3)を考慮すると、“次状態”は次のようにな
る。ｎ（ｔ＋Δt）＝n′（ｔ＋Δt）＋n″（ｔ＋Δt）（17）ｎ（ｔ＋Δt）＝n″（ｔ）＋Ａ−（So＋ Sh＋Ｄ・Ｌ）・（n′＋n″（ｔ））（18）ｎ（ｔ＋Δt）＝n″（ｔ）−（Sh・n″（ｔ） −（So＋Ao・L^A）・n″（ｔ） −（Ao・L^A・Ｄ）／（So＋Sh）＋Ao−（So・Ao）＋（Sh・Ao））／（So＋Sh）（19）式（19）はすべての常数項を無視すれば次のよ
うになる。 n″（ｔ＋Δt）＝n″（ｔ）（１−So・Δt） −f″（ｔ）（20）式（15）および（20）は、それぞれの10ミリ秒
時間フレーム中に各フイルタに適用される出力式
および状態更新式を構成する。これらの式の使用
結果は10ミリ秒ごとの20要素のベクトルであり、
このベクトルの各要素は、メルでスケーリングさ
れたフイルタ・バンクにおけるそれぞれの周波数
バンドの発火率に対応する。前述の実施例に関し、第１６図の流れ図は、発
火率ｆおよび“次状態”ｎ（ｔ＋Δt）の特別の場
合の式をそれぞれ定義する式（11）および（16）
により、ｆ，dn／dtおよびｎ（ｔ＋Δt）の式を置
換える以外は当てはまる。それぞれの式の項に特有の値（すなわち、t₀＝
5csec、t_L＝3csec、Ao＝１、Ｒ＝1.5およびL_nax
＝20）は他の値に設定することができ、So，Sh
およびＤの項は、他の項が異なつた値に設定され
ると、それぞれの望ましい値0.0888，0.11111、
および0.00666とは異なる値になる。本発明は種々のソフトウエアまたはハードウエ
アにより実施することができる。 F1c 精密マツチング（第６図、第１７図）第６図は一例として精密マツチング音素マシン
２０００を示す。音標型マツチングの各マシン
は、確率的な有限状態マシンであり、 (a) 複数の状態S_i； (b) 複数の遷移tr（Sj―Si）：ある遷移は異なつた
状態間で、ある遷移は同じ状態間で遷移し、各
遷移は対応する確率を有する； (c) 特定の遷移で生成しうるラベルごとに対応す
る実際のラベル確率を有することを特徴とする。第６図では、７つの状態S₁〜S₇ならびに13の遷
移tr１〜tr１３が精密マツチング音素マシン２０
００に設けられ、その中の３つの遷移tr１１，tr
１２およびtr１３のパスは破線で示されている。
これらの３つの遷移の各々で、音素はラベルを生
成せずに１つの状態から別の状態に変ることがあ
る。従つて、このような遷移はナル遷移と呼ばれ
る。遷移tr１〜tr１０に沿つて、ラベルを生成す
ることができる。詳細に述べれば、遷移tr１〜tr
１０の各々に沿つて少なくとも１つのラベルは、
そこに生成される独特の確率を有することがあ
る。遷移ごとに、システムで生成することができ
る各ラベルに関連した確率がある。すなわち、も
し選択的に音響チヤンネルにより生成することが
できるラベルが200あれば、（ナルではない）各遷
移はそれに関連した“実際のラベル確率”を200
有し、その各々は、対応するラベルが特定の遷移
で音素により生成される確率に対応する。遷移tr
１の実際のラベル確率は、図示のように、記号Ｐ
と、それに続くブラケツトに囲まれた１〜200の
列で表わされる。これらの数字の各々は所与のラ
ベルを表わす。ラベル１の場合は、精密突合せ音
素マシン２０００が遷移tr１でラベル１を生成す
る確率Ｐ〔１〕がある。種々の実際のラベル確率
は、ラベルおよび対応する遷移に関連して記憶さ
れている。ラベルy₁y₂y₃……のストリングが、所与の音素
に対応する精密突合せ音素マシン２０００に提示
されると、突合せ手順が実行される。精密突合せ
音素マシンに関連した手順について第１７図によ
り説明する。第１７図は第６図の音素マシンのトレリス図で
ある。前記音素マシンの場合のように、このトレ
リス図も状態S₁から状態S₇へのナル遷移、状態S₁
から状態S₂への遷移、および状態S₁から状態S₄へ
の遷移を示す。他の状態間の遷移も示されてい
る。また、トレリス図は水平方向に、測定された
時刻を示す。開始時確率q₀、およびq₁は、音素が
その音素の時刻ｔ＝t₀またはｔ＝t₁のそれぞれに
おいて開始時刻を有する確率を表わす。各開始時
刻におけるそれぞれの遷移も示されている。ちな
みに、連続する開始（および終了）時刻の間隔
は、ラベルの時間間隔に等しい長さであることが
望ましい。精密マツチング音素マシン２０００を用いて所
与の音素が到来ストリングのラベルにどれくらい
ぴつたりとマツチングするかを決定する際、その
音素の終了時刻分布を探索して、その音素のマツ
チング値を決めるのに使用する。終了時刻分布に
依存して精密マツチングを実行する方法は、マツ
チング手順に関して本発明で説明するすべての音
素マシンの実施例に共通である。精密なマツチン
グを実行するため終了時刻分布を生成する際、精
密マツチング音素マシン２０００は、正確で複雑
な計算を必要とする。最初に、第１７図のトレリス図により、時刻ｔ
＝t₀で開始時刻および終了時刻を得るのに必要な
計算について調べる。第６図に示された音素マシ
ン構造の例の場合は、下記の確率式が当てはま
る。 Pr（S₇，ｔ＝t₀）＝q₀・Ｔ（１→７）＋Pr（S₂，ｔ＝t₀）・Ｔ（２→７）＋Pr（S₃，ｔ＝t₀）・Ｔ（３→７）（21）ただし、Prは確率を表わし、Ｔは括弧内の２
つの状態の間の遷移確率を表わす。この式は、ｔ
＝t₀で終了時刻になることがある３つの状態のそ
れぞれの確率を示す。更に、ｔ＝t₀の終了時刻
は、状態S₇における現在の生起例に限定される。次に、終了時刻ｔ＝t₁を調べると、状態S₁以外
のあらゆる状態に関する計算を行なわなければな
らない。状態S₁は前の音素の終了時刻で開始す
る。説明の都合上、状態S₄に関する計算だけを示
す。 S₄の場合、計算は次のようになる。 Pr（S₄，ｔ＝t₁）＝Pr（S₁，ｔ＝t₀）・Ｔ（１→４）・Pr（y₁，１→４）＋Pr（S₄，ｔ＝t₀）・Ｔ（４→４）・ Pr（y₁，４→４）（22）式（22）は、ｔ＝t₁で音素マシンが状態S₄であ
る確率は下記の２つの項： (a) 時刻ｔ＝t₀で状態S₁である確率に、状態S₁か
ら状態S₄への遷移確率を乗じ、更に、生成中の
ストリング中の所与のラベル−ｙ−が状態S₁か
ら状態S₄へ遷移する確率を乗じて得た値と、 (b) 時刻ｔ＝t₀で状態S₄である確率に、状態S₄か
らそれ自身への遷移確率を乗じ、更に、状態S₄
からそれ自身に遷移するものとしてその間に所
与のラベル−ｙ−を生成する確率を乗じて得た
値との和によつて決まることを示す。同様に、（状態S₁を除く）他の状態に関する計
算も実行され、その音素が時刻ｔ＝t₁で特定の状
態である対応する確率を生成する。一般に、所与
の時刻に対象状態である確率を決定する際、精密
なマツチングは、 (a) 対象状態に導く遷移を生じる前の各状態およ
び前記前の各状態のそれぞれの確率を認識し、 (b) 前記前の状態ごとに、そのラベル・ストリン
グに適合するように、前記前の各状態と現在の
状態の間の遷移で生成しなければならないラベ
ルの確率を表わす値を認識し、 (c) 前の各状態の確率とラベル確率を表わすそれ
ぞれの値を組合せて、対応する遷移による対象
状態の確率を与える。対象状態である全体的な確率は、それに導くす
べての遷移による対象状態確率から決定される。
状態S₇に関する計算は、３つのナル遷移に関する
項を含み、その音素が状態S₇で終了する音素によ
り時刻ｔ＝t₁で開始・終了することを可能にす
る。時刻ｔ＝t₀およびｔ＝t₁に関する確率を決定
する場合のように、他の終了時刻の組の確率の決
定は、終了時刻分布を形成するように行なうこと
が望ましい。所与の音素の終了時刻分布の値は、
所与の音素がどれ位良好に到来ラベルにマツチン
グされるかを表示する。ワードがどれ位良好に到来ラベルにマツチング
されるかを決定する際、そのワードを表わす音素
に順次に処理される。各音素は確率値の終了時刻
分布を生成する。音素のマツチング値は、終了時
刻確率を合計し、その合計の対数をとることによ
り得られる。次の音素の開始時刻分布は終了時刻
分布を正規化することにより引出される。この正
規化では、例えば、それらの値の各々を、それら
の合計で割ることによりスケーリングし、スケー
リングされた値の合計が１になるようにする。所与のワードまたはワード・ストリングの検査
すべき音素数ｈを決定する方法が少なくとも２つ
ある。深さ優先方法では、計算は基本形式に沿つ
て行なう。（連続する音素の各々により連続して
小計を計算する）。この小計がそれに沿つた所与
の音素位置の所定の限界値以下であると分つた場
合、計算は終了する。もう１つの方法、幅優先方
法では、各ワードにおける類似の音素位置の計算
を行なう。計算は、各ワードの第１の音素の計
算、続いて各ワードの第２の音素の計算というよ
うに、順次に行なう。幅優先方法では、それぞれ
のワードの同数の音素に沿つた計算値は、相対的
に同じ音素位置で比較する。いずれの方法でも、
マツチング値の最大の和を有するワードが、求め
ていた目的ワードである。精密なマツチングはAPAL（アレイ・プロセツ
サ・アセンブリ言語）で実現されている。これ
は、フローテイング・ポイント・シシテムズ社
（Floating Point Systems，Inc.）製のアセンブ
ラ190Lである。ちなみに、精密マツチングは、
実際のラベル確率（すなわち、所与の音素が所与
の遷移で所与のラベルｙを生成する確率）、音素
マシンごとの遷移確率、および所与の音素が所定
の開始時刻後の所与の時刻で所与の状態である確
率の各々を記憶するためにかなりのメモリを必要
とする。前述の190Lは、終了時刻、できれば終
了時刻確率の対数和に基づいたマツチング値、前
に生成された終了時刻確率に基づいた開始時刻、
およびワード中の順次音素のマツチング値に基づ
いたワードマツチング得点のそれぞれの計算をす
るようにセツトアツプされる。更に、精密なマツ
チングは、マツチング手順の末尾確率を計算する
ことが望ましい。末尾確率はワードとは無関係に
連続するラベルの尤度を測定する。簡単な実施例
では、所与の末尾確率はもう１つのラベルに続く
ラベルの尤度に対応する。この尤度は、例えば、
或るサンプル音声により生成されたラベルのスト
リングから容易に決定される。それ故、精密なマツチングでは基本形態、マル
コフ・モデルの統計値、および末尾確率を含むの
に十分な記憶装置を備える。各ワードが約10の音
素を含む5000ワードの語彙の場合、基本形態は
5000×10の記憶量を必要とする。（音素ごとにマ
ルコフ・モデルを有する）70の別個の音素、200
の別個のラベル、および任意のラベルが生成する
確率を有する10の遷移がある場合、統計値は70×
10×200の記憶ロケーシヨンを必要とすることに
なる。しかしながら、音素マシンは３つの部分
（開始部分、中間部分および終了部分）に分割さ
れ、統計表はそれに対応することが望ましい。
（３つの自己ループの１つが連続する部分に含ま
れることが望ましい。）従つて、記憶要求は70×
３×200に減少する。末尾確率に関しては、200×
200の記憶ロケーシヨンが必要である。この配列
では、50Kの整数および82Kの浮動小数点の記憶
領域があれば満足に動作する。更に、従来のシステムは70の異なつた音素を含
んでいるが、本発明はそれぞれの音素マシンによ
り約96の音素を供給する。 F1d 基本高速マツチング（第１８図〜第２０
図）精密マツチングの計算には高い費用がかかるか
ら、精度をあまり犠牲にしないで所要の計算を少
なくする基本高速マツチングおよび代替高速マツ
チングを行なう。高速マツチングは精密マツチン
グに関連して使用することが望ましい。高速マツ
チングは、語彙から見込みのある候補ワードを取
出してリストに載せ、精密マツチングは大抵の場
合、このリフトの候補ワードで実行される。高速概算音響マツチング手法は前記米国特許出
願第06／672974号（1984年11月19日出願）に記載
されている。高速概算音響マツチングでは、各音
素マシンは、所与の音素マシンにおけるすべての
遷移でラベルごとの実際のラベル確率を特定の置
換え値と取替えることにより簡略化することが望
ましい。特定の置換え値は、その置換え値を使用
する場合に所与の音素のマツチング値が、その置
替え値が実際のラベル確率を取替えない場合の精
密マツチングにより得られるマツチング値を過大
評価するように選択することが望ましい。この条
件を保証する１つの方法は、所与の音素マシン中
の所与のラベルに対応する確率がどれもその置換
え値よりも大きくないように各々の置換え値を選
択する方法である。音素マシン中の実際のラベル
確率を、対応する置換え値と取替えることによ
り、ワードのマツチング得点を決定する際の所要
計算量を大幅に減少することができる。更に置換
え値は過大評価することが望ましいので、その結
果得られたマツチング得点は、前に置換えずに決
定された場合よりも少なくなる。マルコフ・モデルを有する言語デコーダで音響
マツチングを実行する特定の実施例において、各
音素は、整形により、 (a) 複数の状態および状態間の遷移パス、 (b) 確率Ｔ（ｉ→ｊ）−その各々は、現在の状態S_i
が与えられると状態S_jに遷移する確率を表わす
（ただし、S_iとS_jは同じ状態の場合もあれば異
なつた状態の場合もある）−を有する遷移tr（S_j
―S_i）、 (c) 実際のラベル確率（各々の実際のラベル確率
ｐ（y_k―ｉ→ｊ）は所与の音素マシンにより、
１つの状態から次の状態への所与の遷移におい
てラベルy_k（ｋはラベルを識別する記号）を生
成する確率を表わす）を生じることを特徴とす
る。各音素マシンは、 (a) 前記各音素マシン中の各y_kに１つの特定の値
p′（y_k）を割当てる手段、 (b) 所与の音素マシン中の各遷移で各々の実際の
出力確率ｐ（y_k―ｉ→ｊ）を、対応するy_kに割
当てられた１つの特定の値p′（y_k）に取替える
手段を含む。置換え値は、少なくとも、特定の音素マ
シン中の任意の遷移で対応するy_kラベルの実際の
最大ラベル確率の大きさであることが望ましい。
高速マツチング実施例は、到来ラベルに対応する
語彙で最も起こりうるワードとして選択された10
乃至100程度の候補ワードのリフトを形成するよ
うに使用される。候補ワードは言語モデルおよび
精密なマツチングに従属することが望ましい。精
密なマツチングで考慮するワード数を、語彙中の
ワードの約１％に切詰めることにより、計算費用
は、精度を維持しながら大幅に減少される。基本高速マツチングは、すべての遷移における
所与のラベルの実際のラベル確率を１つの値と置
換えることにより簡略化し、所与のラベルを所与
の音素マシンで生成することができる。すなわ
ち、ラベルが生じる確率を有する所与の音素マシ
ンにおける遷移とは無関係に、その確率を、１つ
の特定の値に置換える。この値は少なくとも、所
与の音素マシン中の任意の遷移で生ずるラベルの
最大の確率の大きさの過大評価であることが望ま
しい。ラベル確率置換え値を、所与の音素マシン中の
所与のラベルの実際のラベル確率の最大値として
設定することにより、基本高速マツチングにより
生成されたマツチング値が少なくとも、精密なマ
ツチングの使用から生じるようなマツチング値と
同じ大きさであることが保証される。このよう
に、基本高速マツチングは一般に各音素のマツチ
ング値を過大評価するので、より多くのワードが
一般に、候補ワードとして選択される。精密なマ
ツチングにより候補とみなされるワードも、基本
高速マツチングに従つて合格する。第１８図は基本高速マツチング音素マシン３０
００を示す。ラベル（記号およびフイーニームと
も呼ばれる）は開始時刻分布と一緒に基本高速突
合せ音素マシン３０００に入る。開始時刻分布お
よびラベル・ストリングの入力は、前述の精密マ
ツチング音素マシンの入力に似ている。開始時刻
は、時には、複数の時刻にわたる分布ではないこ
とがあるが、その代り、例えば、沈黙間隔に続く
正確な（音素開始）時刻を表わすこともある。し
かしながら、音声が連続している場合、終了時刻
分布は、（後に詳細に説明するように）開始時刻
分布を形成するのに用いられる。基本高速マツチ
ング音素マシン３０００は、終了時刻分布を生成
するとともに、生成された終了時刻分布からの特
定の音素のマツチング値を生成する。ワードのマ
ツチング得点は、構成する音素（少なくともその
ワードの最初のｈ音素）のマツチング値の和とし
て定義される。第１９図は基本高速マツチング計算を示す。基
本高速マツチング計算は、開始時刻分布、音素に
より生成されたラベルの数または長さ、および
各々のラベルy_kに関連した置換え値p′（y_k）だけ
に関連する。所与の音素マシン中の所与のラベル
の実際のラベル確率をすべて、対応する置換え値
と取替えることにより、基本高速マツチングは、
遷移確率を長さ分布確率と取替えるので、（所与
の音素マシンで遷移ごとに異なることがある）実
際のラベル確率、ならびに所与の時刻に所与の状
態にある確率を含むことが不要になる。ちなみに、長さ分布は精密なマツチングモデル
から決定される。詳細に説明すれば、長さ分布の
長さごとに、この手順は、各状態を個々に検査
し、状態ごとに、それぞれの遷移パスを決定する
ことが望ましい。それにより、現に検査された状
態は、 (a) 特定のラベルの長さを与えられると、 (b) 遷移に沿つた出力と無関係に生ずることがあ
る。各々の目的状態への特定の長さのすべての
遷移パスの確率は合計され、次いで、すべての
目的状態の合計は加算され、分布中の所与の長
さの確率を表わす。以上の手順は各々の長さに
ついて反復実行される。良好なマツチング手順
の形式に従つて、これらの計算は、マルコフ・
モデリングの技術で知られているようにトレリ
ス図に関して行なわれる。トレリス構造に沿つ
て分枝を共有する遷移パスの場合、共通分枝ご
との計算は一度だけ行なえばよく、その結果は
共通分枝を含む各々のパスに加えられる。第１９図において、例として２つの制限が含ま
れている。最初に、音素により生成されたラベル
の長さは、それぞれ確率1₀，1₁，1₂および1₃を有
する０，１，２または３である場合がある。開始
時刻も制限され、それぞれが確率q₀，q₁，q₂およ
びq₃を有する４つの開始時刻だけが許される。す
なわち、Ｌ（1₀，1₁，1₂，1₃）およびＱ（q₀，q₁，
q₂，q₃）が仮定される。これらの制限により、目
的音素の終了分布は下記の式のように定義され
る。 Φ₀＝q₀1₀ Φ₁＝q₁1₀＋q₀1₁p₁ Φ₂＝q₂1₀＋q₁1₁p₂＋q₀1₂p₁p₂ Φ₃＝q₃1₀＋q₂1₁p₃＋q₁1₂p₂p₃＋q₀1₃p₁p₂p₃ Φ₄＝q₃1₁p₄＋q₂1₂p₃p₄＋q₁1₃p₂p₃p₄ Φ₅＝q₃1₂p₄p₅＋q₂1₃p₃p₄p₅ Φ₆＝q₃1₃p₄p₅p₆ これらの式を調べると、Φ₃は４つの開始時刻
の各々に対応する項を含んでいることが分る。そ
の第１項は音素が時刻ｔ＝t₃で開始し、かつ長さ
０のラベル（音素は開始すると同時に終了する）
を生成する確率を表わす。第２項は音素が時刻ｔ
＝t₂で開始し、かつラベルの長さが１であり、か
つラベル３がその音素により生成される確率を表
わす。第３項は音素が時刻ｔ＝t₁で開始し、かつ
ラベルの長さ２（すなわちラベル２および３）で
あり、かつラベル２および３がその音素により生
成される確率を表わす。同様に、第４項は音素が
時刻ｔ＝t₀で開始し、かつラベルの長さ３であ
り、かつ３つのラベル１，２および３がその音素
により生成される確率を表わす。基本高速マツチングに要する計算と精密マツチ
ングに要する計算を比較すると、前者は後者より
も相対的に簡単であることが分る。ちなみに、
p′（ｙ）の値は、すべての式に出現するごとに、
ラベルの長さの確率の場合のように同じ値のまま
である。更に、長さおよび開始時刻の制限によ
り、後の終了時刻計算がより簡単になる。例え
ば、Φ₆で、音素は時刻ｔ＝t₃で開始し、３つのレ
ベル４，５および６はすべて、その終了時刻の音
素により生成して使用しなければならない。対象音素のマツチング値を生成する際、形成さ
れた終了時刻分布に沿つた終了時刻確率が合計さ
れる。必要なら、次式のようにその対数をとる。マツチング値＝log₁₀（Φ₀＋……＋Φ₆）前述のように、ワードのマツチング得点は、特
定のワード中の連続する音素のマツチング値を合
計することにより容易に決定される。次に、第２０図により開始時刻分布の生成につ
いて説明する。第２０図ａにおいて、ワード
THE₁がその構成音素に分解され、反復される。
第２０図ｂでは、ラベルのストリングが時間軸に
沿つて示されている。第２０図ｃは、最初の開始
時刻分布を示す。最初の開始時刻分布は、（沈黙
ワードを含むことがある先行ワードにおける）最
新の先行音素の終了時刻分布から引出されてい
る。第２０図ｃのラベル入力および開始時刻分布
に基づいて、音素DHの終了時刻分布Φ_DHが生成
される（第２０図ｄ）。次の音素UH1の開始時刻
分布は、前の音素終了分布が第２０図ｄの限界値
Ａを、起えた時刻を認識することにより決定され
る。Ａは終了時刻分布ごとに個々に決定される。
Ａは、対象音素の終了時刻分布の値の和の関数で
ある。従つて、時刻ａと時刻ｂの間隔は、音素
UH1の開始時刻分布が設定される時間を表わす。
第２０図ｅにおいて、時刻ｃと時刻ｄの間隔は、
音素DHの終了時刻分布が限界値Ａを越え、かつ
次の音素の開始時刻分布が設定される時間に相当
する。開始時刻分布の値は、例えば、限界値Ａを
越える終了時刻の和で各終了時刻値を割つて終了
時刻分布を正規化することにより得られる。基本高速マツチング音素マシン３０００は、前
記フローテイング・ポイント・システムズ社の、
APALプログラムによるアセンブラ１９０Ｌで実
現されている。また、本明細書の説明に従つて、
他のハードウエアおよびソフトウエアを用いて本
発明の特定の形式を展開することもできる。 F1e 代替高速マツチング（第２１図、第２２
図）単独で、またはできれば精密なマツチングおよ
び言語モデルと共に使用された基本高速マツチン
グは、計算所要量を大幅に少なくする。計算所要
量を更に少なくするため、本発明は更に、２つの
長さ（最小長L_nioおよび最大長L_nax）の間に均一
なラベル長分布を形成することにより精密なマツ
チングを簡略化する。基本高速マツチングでは、
所与の長さのラベル（すなわち、1₀，1₁，1₂等）
を生成する確率は一般に異なる値を得る。代替高
速マツチングにより、ラベルの各々の長さの確率
を１つの均一な値と取替える。最小値は、最初の長さの分布で非０の確率を有
する最小の長さに等しいことが望ましいが、希望
により、他の長さを選択することもできる。最大
長の選択は最小長の選択よりも任意であるが、最
小よりも小さく最大よりも大きい長さの確率は０
に設定される。長さの確率が最小長と最大長の間
にだけ存在するように設定することにより、均一
の擬似分布を示すことができる。一つの方法とし
て、均一確率は、擬似分布による平均確率として
設定することができる。代替方法として、均一確
率は、長さ確率の最大値として設定し、均一値と
取替えることができる。ラベルの長さの確率をすべて等しくすることに
よる効果は、前述の基本高速マツチングにおける
終了時刻分布の式から容易に認められる。詳細に
述べれば、長さの確率は定数として取出すことが
できる。 L_nioを０にセツトし、かつすべての長さの確率
を１つの定数の値と取替えることにより、終了時
刻分布は次のように表示される。 Θ_n＝Φ_n／１＝q_n＋Φ_n＋p_n （23）ただし、“１”は１つの均一の置換え値であり、
p_nの値は、所与の音素で時刻ｍに生成される所
与のラベルの置換え値に対応することが望まし
い。前述のΘ_nの式の場合、マツチング値は次のよ
うに定義される。マツチング値＝log₁₀（Θ₀＋Θ₁＋…＋Θ_n）＋log₁₀(1) （24）基本高速マツチングと代替高速マツチングを比
較すると、所要の加算および乗算数は、代替高速
マツチング音素マシンを使用することにより、大
幅に少なくなる。L_nio＝０の場合、基本高速マツ
チングは、長さの確率を考慮しなければならない
ので、40回の乗算と20回の加算を必要としたが、
代替高速マツチングの場合は、Θ_nが繰返し決定
されるので、連続するΘ_nの各々について１回の
乗算と１回の加算で済むことが分る。第２１図および第２２図は、第替高速マツチン
グによる計算の簡略化を詳細に示す。第２１図ａ
は、最小長L_nio＝０に相当する音素マシン３１０
０の実施例を示す。最大長は、長さ分布が均一に
なるように無限大に仮定する。第２１図ｂは、音
素マシン３１００から生じるトレリス図を示す。
q_o以後の開始時刻を開始時刻分布の外側と仮定す
ると、ｍ＜ｎの場合、連続するΘ_nの各々の決定
はすべて、１回の加算と１回の乗算で足りる。そ
れ以後の終了時刻を決定する場合は、１回の乗算
だけでよく、加算は不要である。第２２図ａは、最小長L_nio＝４の場合の特定の
音素マシン３２００の実施例を示し、第２２図ｂ
は、それに対応するトレリス図を示す。L_nio＝４
であるから、第２２図のトレリス図は、記号Ｕ，
Ｖ，ＷおよびＺのパスに沿つて０確率を生じる。
Θ₄とΘ_oの間の終了時刻の場合、４回の乗算と１
回の加算が必要である。ｎ＋４よりも大きい終了
時刻の場合は、１回の乗算だけでよく、加算は不
要である。この実施例は、前記FPS社の190L上
のAPALコードで実現されている。所望の追加状態を第２１図または第２２図の実
施例に付加することができる。例えば、L_nioの値
を変更せずに、ナル遷移を有する任意数の状態を
包含することができる。 F1f 最初のＪレベルに基づいたマツチング（第
２２図）基本高速マツチングおよび代替高速マツチング
を更に改良するため、音素マシンに入るストリン
グの最初のＪラベルのマツチングだけを考慮する
ようにする。ラベルが音響チヤンネルの音響プロ
セツサにより、毎センチ秒ごとに１ラベルの割合
で生成されるものと仮定すると、Ｊの妥当な値は
100である。換言すれば、約１秒の音声に対応す
るラベルが供給され、音素と音素マシンに入るラ
ベルとのマツチングを確定する。検査するラベル
数を限定することにより、２つの利点が得られ
る。第１は、復合遅延の減少であり、第２は、短
かいワードの得点と長いワードの得点を比較する
問題を十分に回避できることである。もちろん、
Ｊの長さは希望により変更することができる。検査するラベル数を限定することによる効果
は、第２２図ｂのトレリス図により観察すること
ができる。本発明による改良を伴なわない場合、
高速マツチング得点は、この図面の最下部の行に
沿つたΘ_nの確率の和である。すなわち、ｔ＝t₀
（L_nio＝０の場合）またはｔ＝t₄（L_nio＝４の場合）
で開始する各時刻に状態S₄である確率は、Θ_nと
して確定され、次いで、すべてのΘ_nは合計され
る。L_nio＝４の場合、t₄以前の任意の時刻に状態
S₄である確率は０である。前記改良により、Θ_n
の和をとることは、時刻Ｊで終了する。第２２図
ｂで、時刻Ｊは時刻t_o+2に相当する。時刻Ｊまでの区間を越えたＪラベルの検査を終
了することにより、マツチング得点を決定する際
に、下記の２つの確率の和を生じる。第１に、前
述のように、このトレリス図の最下部の行に沿つ
た行計算がある。しかし、この計算は時刻Ｊ−１
までである。時刻Ｊ−１までの各時刻に状態S₄で
ある確率が合計され、行得点を得る。第２に、そ
の音素が時刻ＪにS₀〜S₄のそれぞれの状態である
確率の和に相当する列得点がある。この列得点は
下記のように計算される。列得点＝₄ 〓^f=0 Pr（S_f，Ｊ）（25）音素のマツチング得点は、行得点と列得点を合
計して、その和の対数をとることにより得られ
る。次の音素の高速マツチングを継続するには、
最下部の行（時刻Ｊを含むことが望ましい）に沿
つた値を用いて、次の音素の開始時刻分布を取出
す。Ｊ回の連続音素の各々のマツチング得点を確定
した後、前述のように、全音素の合計はその音素
のすべてのマツチング得点の和である。前述の基本高速マツチングおよび代替高速マツ
チングの実施例で終了時刻の確率を生成する方法
を調べると、列得点の確定は、高速マツチング計
算に容易に適合しないことが分る。検査するラベ
ル数を限定するための改良を前記高速マツチング
および代替マツチングによりよく適応させるた
め、本発明は、列得点を追加行得点と置換えるこ
とを可能にする。すなわち、（第２２図ｂで）時
刻ＪおよびＪ＋Ｋの間で状態S₄である音素の追加
行得点が確定される。ただし、Ｋは任意の音素マ
シンにおける最大状態数である。それゆえ、任意
の音素マシン１０の状態を有する場合、本発明の
改良により、そのトレリス図の最下部の行に沿つ
て10の終了時刻が付加され、その各々について確
率が決定される。時刻Ｊ＋Ｋまでの最下位の行に
沿つたすべての確率（時刻Ｊ＋Ｋでの確率を含
む）が加算され、所与の音素のマツチング得点を
生成する。前述のように、連続する音素のマツチ
ング値を合計し、ワードのマツチング得点を得
る。この実施例は前述のFPS社の190L上のAPAL
コードで実現されているが、このシステムの他の
部分の場合のように、他のハードウエアで他のコ
ードにより実現することもできる。 F1g 音素木構造および高速マツチング実施例
（第２３図）基本高速マツチングまたは代替高速マツチング
を（最大ラベル制限がある場合またはない場合
に）使用することにより、音素マツチング値を決
定する際に必要な計算時間が大幅に少なくなる。
更に高速マツチングで得たリスト中のワードで精
密マツチングを実行する場合でさえも、計算量が
大幅に節約される。音素マツチング値は、いつたん確定されると、
第２３図に示すように、木構造４１００の分枝に
沿つて比較が行なわれ、音素のどのパスが最も起
こりうるかを判定する。第２３図において、（点
４１０２から分枝４１０４に出す）話されたワー
ド“the”の音素DHおよびUH1の音素マツチン
グ値の和は、音素MXから分岐する音素のそれぞ
れのシーケンスの場合よりもずつと高い値でなけ
ればならない。ちなみに、最初の音素MXの音素
マツチング値は１回だけ計算され、それから広が
る各基本形態に使用される。（分枝４１０４およ
び４１０６を参照されたい。）更に、分枝の最初
のシーケンスに沿つて計算された合計得点が、限
界値よりもずつと低いか、または分枝の他のシー
ケンスの合計得点よりもずつと低いことが分る
と、最初のシーケンスから広がるすべての基本形
態は同時に候補ワードから削除されることがあ
る。例えば、分枝４１０８〜４１１８に関連した
基本形態は、MXが見込みのあるパスではないと
決定された場合、同時に捨てられる。高速マツチング実施例および木構造により、一
定順序の候補ワードのリストが作成され、それに
伴なう計算は大幅に節約される。記憶要求については、音素の木構造、音素の統
計値、および末尾確率が記憶されることになつて
いる。木構造については、25000の弧と各々の弧
を特徴づける４つのデータワードがある。第１の
データワードは後続の弧すなわち音素の指標を表
わす。第２のデータワードは分枝に沿つた後続の
音素数を表わす。第３のデータワードは木構造の
どのノードに弧が置かれているかを表わす。第４
のデータワードは現在の音素を表わす。従つて、
この木構造の場合、25000×４の記憶空間が必要
である。高速マツチングでは、100の異なつた音
素と200の異なつたフイーニームがある。フイー
ニームは音素中のどこかで生成される１つの確率
を有するから、100×200の統計的確率の記憶空間
が必要である。末尾構造については、200×200の
記憶空間が必要である。従つて、高速マツチング
の場合、100Kの整数を記憶する空間と60Kの浮
動小数点を記憶する空間があれば十分である。 F1h 言語モデル（第４図、第２４図）前述のように、文脈中のワードに関する（三重
字のような）情報を記憶する言語モデルを包含す
ることにより、ワードを正しく選択する確率を高
めることができる。言語モデルは前記論文に記憶
されている。言語モデル１０１０（第４図）は一意性の文字
を有することが望ましい。すなわち、修正三重字
法が使用される。本発明に従つて、サンプル・テ
キストを検査し、語彙中の一定順序の三重ワード
およびワード対ならびに単一ワードの各々の尤度
を確定する。そして、最も見込みのある三重ワー
ドおよびワード対のリストが形成される。更に、
三重ワードのリストにはない三重ワード、および
ワード対のリストにはないワード対の尤度がそれ
ぞれ決定される。言語モデルに従つて、対象ワードが２ワードに
続く場合、この対象ワードおよび先行２ワードが
三重ワードのリストにあるかどうかが判定され
る。三重ワードのリストにある場合、その三重ワ
ードに割当てられた、記憶されている確率が指定
される。対象ワードと先行２ワードが三重ワード
のリストにない場合は、その対象ワードとそれに
隣接する先行ワードがワード対のリストにあるか
どうかについて判定する。ワード対のリストにあ
る場合は、そのワード対の確率と、前述の三重ワ
ードのリストに三重ワードがない確率を掛け、そ
の積を対象ワードに割当てる。対象ワードを含む
前記三重ワードおよびワード対がそれぞれ三重ワ
ードのリストおよびワード対のリストにない場合
には、対象ワードだけの確率に、前述の三重ワー
ドが三重ワードのリストにない確率、ならびにワ
ード対がワード対のリストにない確率を掛け、そ
の積を対象ワードに割当てる。第２４図の流れ図5000は音響マツチングで使用
する音素マシンの整形を示す。ブロツク５００２
で、ワードの語彙（一般に5000ワードのオーダ）
を定義する。次に、各ワードは音素マシンのシー
ケンスにより表示される。例えば、音素マシン
は、音標型音素マシンとして表示されているが、
代替的に、フイーニーム型音素のシーケンスを含
むことがある。音標型音素マシンのシーケンス、
またはフイーニーム型音素マシンのシーケンスに
よるワードの表示について下記に説明する。ワー
ドの音素マシン・シーケンスはワード基本形態と
いう。ブロツク５００６で、ワード基本形態を前述の
木構造に配列する。各ワード基本形態の音素マシ
ンごとの統計値は、IEEE会報第64巻（1976年）
記載のエフ・ジエリネクの論文“統計的方法によ
る連続音声認識”（F.Jelinek、“Continuous
Speech Recognition by Statical Methods”、
Proceedings of the IEEE、Vol.64、1976）に
示された周知のフオワード・バツクワード・アル
ゴリズムによる整形により決められる（ブロツク
５００８）。ブロツク５００９で、精密マツチングで使用す
る実際のパラメータ値すなわち統計値に代る値を
決める。例えば、実際のラベル出力確率に代る値
を確定する。ブロツク５０１０で、確定された値
が、記憶された実際の確率に取つて代り、各ワー
ド基本形態中の音素が概算置換え値を含むように
する。基本高速マツチングに関する概算はすべて
ブロツク５０１０で実行される。次にブロツク５０１１で、音響マツチングが向
上を要するかどうかを決定する。向上を要しない
場合は、基本概算マツチングのために確定された
値を使用のために設定し、他の概算に関する別の
推定値は設定しない（ブロツク５０１２）。向上
を必要とする場合には、ブロツク５０１８に進
む。ブロツク５０１８で、ストリングの長さの均
一な分布を形成し、ブロツク５０２０で、更に向
上が必要かどうかを決定する。更に向上させる必
要がない場合は、ラベル出力確率値およびストリ
ング長確率値を概算し、音響マツチングで使用す
るように設定する。更に向上を必要とする場合に
は、ブロツク５０２２で、音響マツチングを、生
成されたストリングの最初のＪラベルに限定す
る。改良された実施例の１つを選択するかどうか
にかかわらず、確定したパラメータ値はブロツク
５０１２で設定され、その結果、各ワード基本形
態中の各音素マシンは、所望の概算値により整形
され、高速概算マツチングを可能にする。 F1j スタツク・デコーダ（第７図〜第９図、第
２５図）次に、第４図の音声認識システムで用いる本発
明の良好なスタツク・デコーダについて説明す
る。第７図に、連続するラベル間隔、すなわちラベ
ル位置で生成された複数の連続するラベルy₁y₂…
…が示されている。また、第８図には複数の生成されたワード・パ
ス、すなわちパスＡ、パスＢおよびパスＣが示さ
れている。第７図との関連で、パスＡはエントリ
“to be or”に、パスＢはエントリ“twob”に、
パスＣはエントリ“too”に対応することがある。
対象ワード・パスごとに、その対象ワード・パス
が終了している最高の確率を有するラベル（すな
わち、等価的にラベル間隔）があり、このような
ラベルを境界ラベルという。ワードのシーケンスを表わすワード・パスＷご
とに、最も見込みのある終了時刻−ラベル・スト
リングにおいて２つのワードの間の境界ラベルと
して表わされる−は、IBM技術開示会報第23巻
第４号（1980年９月号）記載のエル・アール・バ
ール外の論文、“音響マツチング計算の高速化”
（L.R.Bahl et al、“Faster Acoustic Match
Computation”、IBM Technical Disclosure
Bulletin、Vol.23、No.4、September 1980）で
述べられているような既知の方法により見つける
ことができる。簡単に言えば、この論文は、次の
２つの類似の関心事、すなわち、 (a) ワード（またはワード・シーケンス）がどれ
だけ多くのラベル・ストリングを生じるか、 (b) どのラベル間隔で、部分的な文−ラベル・ス
トリングの部分に対応する−が終了するかに取組む方法について論じている。所与のワード・パスにはいずれも、ラベル・ス
トリングの最初のラベル〜境界ラベルを含む各々
のラベルすなわちラベル間隔に関連した“尤度
値”がある。つまり、所与のワード・パスの尤度
値はすべて所与のワード・パスの“尤度ベクト
ル”を表わすので、ワード・パスごとに、対応す
る尤度ベクトルがある。尤度値L_tは第８図に示さ
れている。ワード・パスW¹，W²……W^sの集まりのラベ
ル間隔ｔでの“尤度包絡線”Λ_tは数学的に次の
ように定義される。 Λ_t＝max（L_t（W¹）、……、L_t（W^s）すなわち、ラベル間隔ごとに、尤度包絡線は前
記集りの中の任意のワード・パスに関連した最高
の尤度値を含む。第８図に尤度包絡線１０４０が
示されている。ワード・パスは、完全な文に対応する場合には
“完全”とみなされる。完全なパスは、入力して
いる話者が、文の終了に達したとき、例えばボタ
ンを押して識別することが望ましい。入力は、文
終了をマークするラベル間隔と同期される。完全
なワード・パスは、それにワードを付加して延長
することはできない。部分的なワード・パスは不
完全な文に対応するので、延長することができ
る。部分的なパスは“生きている”パスまたは“死
んでいる”パスに分類される。ワード・パスは、
それが既に延長されているときは“死んでいる”
が、まだ延長されていないときは“生きている”。
この分類により、既に延長されて少なくとも１ワ
ード以上延長されたワード・パスを形成している
パスは、次の時刻で延長が再び考慮されることは
ない。各々のワード・パスは、尤度包絡線に対して
“良い”パス、または“悪い”パスとして特徴づ
けることができる。ワード・パスは、その境界ラ
ベルに対応するラベルで、そのワード・パスが、
Δの最大尤度包絡線内にある尤度値を有する場合
は良いワード・パスである。その他の場合は、ワ
ード・パスは悪いワード・パスである。最大尤度
包絡線の各値を一定の値Δだけ減少して良い（悪
い）限界レベルを決めるけれど、このΔを変更す
ることは、望ましいことではあるが、必ずしも必
要ではない。ラベル間隔の各々についてスタツク要素があ
る。生きているワード・パスの各々は、このよう
な生きているパスの境界ラベルに対応するラベル
間隔に対応するスタツク要素に割当てられる。ス
タツク要素は、（尤度値の順序に記載されている）
０，１またはより多くのワード・パス・エントリ
を有することがある。次に、第４図のスタツク・デコーダ１００２に
より実行されるステツプについて説明する。第９図の流れ図に示すように、尤度包絡線の形
成と、良いワード・パスはどれかの決定は相互に
関係している。第９図の流れ図において、ブロツク１０５０
で、最初に、ナル・パルスが第１のスタツク０に
入る。ブロツク１０５２で、前に確定されている
完全なパスを含む（完全なスタツク要素が、もし
あれば、供給される。（完全な）スタツク要素中
の完全なパスの各々は、それに関連する尤度ベク
トルを有する。その境界ラベルに最高の尤度を有
する完全なパスの尤度ベクトルは、最初に尤度包
絡線を決める。もし（完全な）スタツク要素に完
全なパスがなければ、最尤包絡線は各ラベル間隔
で−∞に初期設定される。更に、完全なパスが指
定されていない場合にも、最尤包絡線が−∞に初
期設定されることがある。包絡線の初期設定はブ
ロツク１０５４および１０５６で行なわれる。最尤包絡線は、初期設定された後、所定の量Δ
だけ減少され、減少された最尤を上まわるΔ規定
の良い領域を形成し、減少された尤度を下まわる
Δ規定の悪い領域を形成する。Δの値は探索の幅
を制御する。Δが大きければ大きいほど、延長が
可能とみなされるワード・パス数が大きくなる。
L_tを確定するのにlog₁₀を用いる場合、Δの値が
2.0であれば満足すべき結果が得られる。Δの値
がラベル間隔の長さに沿つて均一であることは、
望ましいけれども、必ずしも必要ではない。ワード・パスが、Δ規定の良い領域内にある境
界ラベルに尤度を有する場合、そのワード・パス
は“良い”とマークされる。その他の場合には、
ワード・パスは“悪い”とマークされる。第９図に示すように、尤度包絡線を更新し、ワ
ード・パスを“良い”（延長が可能な）パス、ま
たは“悪い”パスとしてマークするループは、マ
ークされていない最長ワード・パスを探すブロツ
ク１０５８で始まる。２以上のマークされていな
いワード・パスが、最長のワード・パス長に対応
するスタツクにある場合、その境界ラベルに最高
の尤度を有するワード・パスが選択される。ワー
ド・パスが発見された場合、ブロツク１０６０
で、その境界ラベルでの尤度がΔ規定の良い領域
内にあるかどうかを調べる。もし良い領域内にな
ければ、ブロツク１０６２で、Δ規定の悪い領域
内のパスとマークし、ブロツク１０５８で、次の
マークされていない生きているパスを探すす。も
し良い領域内にあれば、ブロツク１０６４で、Δ
規定の良い領域内のパスとマークし、ブロツク１
０６６で、尤度包絡線を更新して、“良い”とマ
ークされたパスの尤度値を包含する。すなわち、
ラベル間隔ごとに、更新された尤度値は、 (a) その尤度包線内の現在の尤度値と、 (b) “良い”とマークされたワード・パスに関連
した尤度値の間のより大きい尤度値として確定される。こ
の動作はブロツク１０６４および１０６６で行な
われる。包絡線が更新された後、ブロツク１０５
８に戻り、マークされていない最長、最良の生き
ているワード・パスを再び探す。このループは、マークされていないワード・パ
スがなくなるまで反復される。マークされていな
いワード・パスがなくなると、ブロツク１０７０
で、最短の“良い”とマークされたワード・パス
が選択される。もし、最短の長さを有する２以上
の“良い”ワード・パスがあれば、ブロツク１０
７２で、その境界ラベルに最高の尤度を有するワ
ード・パスが選択され、選択された最短のパスは
延長される。すなわち、少なくとも１つの、見込
みのある後続ワードが、前述のように、高速マツ
チング、言語モデル、精密マツチング、および言
語モデル手順を良好に実行することにより確定さ
れる。見込みのある後続ワードごとに、延長され
たワード・パスが形成される。詳細に述べれば、
延長されたワード・パスは、選択された最短ワー
ド・パスの終りに、見込みのある後続ワードを付
加することにより形成される。選択された最短ワード・パスが、延長されたワ
ード・パスを形成した後、該選択されたワード・
パスは、それがエントリであつたスタツクから除
去され、その代りに、各々の延長されたワード・
パスは適切なスタツクに挿入される。特に、延長
されたワード・パスはその境界ラベルに対応する
スタツクへのエントリになる（ブロツク１０７
２）。ブロツク１０７２における選択されたパルスを
延長する動作を第２５図の流れ図に関連して説明
する。ブロツク１０７０でパスが見つかつた後、
次の手順を実行し、それにより、ワード・パスま
たはパスは適切な概算マツチングに基づいて延長
される。第２５図のブロツク６０００で、（第４図の）
音響プロセツサ１００４は前述のようにラベルの
ストリングを生成する。ラベルのストリングはブ
ロツク６００２に入力として供給され、ブロツク
６００２で、基本の、または改良された概算マツ
チング手順の１つが実行され、前述のように一定
順序の候補ワードのリストを得る。その後、ブロ
ツク６００４で、前記言語モデルを前述のように
使用する。言語モデルを使用した後、ブロツク６
００６で、残つている対象ワードは、生成された
ラベルと一緒に精密マツチング・プロセツサに送
られる。ブロツク６００８で、精密なマツチング
は残つている候補ワードのリストを生じ、言語モ
デルに良好に提示される。（概算マツチング、精
密マツチングおよび言語モデルにより確定され
た）見込みのあるワードは、第９図のブロツク１
０７０で発見されたパスの延長に用いる。ブロツ
ク６００８（第２５図）で確定された、見込みの
あるワードの各々は、発見されたワード・パスに
別個に付加され、複数の延長されたワード・パス
を形成することができる（ブロツク６０１０）。第９図で、延長パスが形成され、スタツクが再
形成された後、ブロツク１０５２に戻つてプロセ
スを反復する。従つて、反復ごとに、最短、最良の“良い”ワ
ード・パスが選択され、延長される。ある反復で
“悪い”パスとマークされたワード・パスは後の
反復で“良い”パスになることがあるので、生き
ているワード・パスが“良い”パスか、“悪い”
パスかという特徴は、各々の反復で独立して付与
される。実際には、尤度包絡線は１つの反復と次
の反復とで大幅には変化しないので、ワード・パ
スが良いか悪いかを決定する計算は効果的に行な
われ、正規化も不要になる。完全な文を識別する場合、ブロツク１０７４を
包含することが望ましい。すなわち、生きている
ワード・パスでマークされずに残つているものは
なく、延長すべき“良い”ワード・パスがない場
合、復合は終了する。その境界ラベルのそれぞれ
に最高の尤度を有する完全なワード・パスが、入
力ラベル・ストリングの最も見込みのあるワー
ド・シーケンスとして識別される。文終了が識別されない連続音声の場合、パス延
長は、継続して行なわれるか、またはそのシステ
ムのユーザが希望する所定のワード数まで行なわ
れる。 F1k 音標型基本形態の構築基本形態を形成する際に使用しうるマルコフ・
モデル音素マシンの１つの型は音標に基づくもの
である。すなわち、各音素マシンは所与の音標の
単音に対応する。所与のワードごとに、それに対応するそれぞれ
の音素マシンを有する音標型単音のシーケンスが
ある。各音素マシンは、いくつかの状態およびそ
れらの間の遷移を含み、それらの中には、フイー
ニーム出力を生成できるものであり、できないも
の（ナル遷移という）もある。前述のように、各
音素マシンに関する統計値は、 (a) 生起する所与の音素の確率、および (b) 所与の遷移で特定のフイーニームが生成され
る尤度を含む。各々の非ナル遷移では、各フイ
ーニームに関連した確率があることが望まし
い。第１表に示すフイーニーム・アルフアベツ
ト（集合）中約200のフイーニームがある。第
６図に、音標型基本形態を形成する際に用いる
音素マシンを示す。このような音素マシンのシ
ーケンスはワードごとに与えられる。統計値、
すなわち確率は、既知のワードを発声する整形
フエーズで音素マシンに入れられる。種々の音
標型音素マシンにおける遷移確率およびフイー
ニーム確率は、整形中に、既知の音標型単音を
少なくとも１回発声した場合に生成するフイー
ニーム・ストリング（複数の場合もある）に、
周知のフオワード・バツクワード・アルゴリズ
ムを適用することにより決定される。音素DHとして識別された１つの音素の統計値
のサンプルが第２表に示されている。概算とし
て、第６図の音素マシンの遷移tr１，tr２および
tr８のラベル出力確率分布、遷移tr３，tr４，tr
５およびtr９のラベル出力確率分布、ならびに遷
移tr６，tr７およびtr１０のラベル出力確率分布
がそれぞれ、単一の分布により表示されている。
これは。第２表でそれぞれの列の弧（すなわち遷
移）へのラベル４，５または６の割当てにより示
されている。第２表は、音素DHの最初、中間ま
たは最後でそれぞれ生成される各遷移の確率なら
びにラベル（すなわちフイーニーム）の確率を示
す。DH音素の場合、例えば、状態S₁から状態S₂
へ遷移する確率は0.07243と計算され、状態S₁か
ら状態S₄へ遷移する確率は0.92757である。（なん
となれば、最初の状態から起こりうる遷移は２つ
だけであるから、両者の確率の和は１に等しい。）
ラベル出力確率に関しては、DH音素は、その音
素の最後の部分、すなわち第２表のラベル６の列
でフイーニームAE13（第１表参照）を生成する確
率0.091を有する。また、第２表には、各ノード
（すなわち状態）に関連したカウントが示されて
いる。ノード・カウントは、整形中、その音素が
対応する状態であつた回数を表わす。第２表のよ
うな統計値はフオニーム・マシンごとに存在す
る。音標型音素マシンをワード基本形態に配列する
のは、一般に、音声学者により実行されるので、
通常、自動的には行なわれない。音標型基本形態は精密マツチングおよび高速概
算マツチングで良好に使用されている。音標型基
本形態は音声学者の判断に依存し、自動的ではな
いから、音標型基本形態は正確ではないことが
時々ある。 F2 開始音素マシンおよび終了音素マシンを含
む音素マシンのセツトの形成（第１Ａ図〜第３
図、第２６図〜第３２図）前項で説明した基本形態構築の際に用いる音素
マシンは、音素マシンの集合から選択される。前
述のように、従来の音声認識システムの生成方式
では、各々の音（または、詳細にいえば各々の音
標型要素）は単一音素マシンにだけ関連してい
た。各音素マシンは、前述のように、遷移およびそ
れに関する確率、ならびにその遷移に関連したラ
ベル出力確率を含む。従つて、音素マシンは、そ
れに対応する音標型単音が発声された場合に、そ
の音素マシンの所与の遷移で所与のラベルを生成
する尤度を指す統計値を含む。この統計値は、既
知の音声を発声して音響プロセツサ１００４（第
４図）に入力し、既知のフオワード・バツクワー
ド・アルゴリズムを適用する整形期間中に取出さ
れる。整形中に取出される統計値は大部分、既知
の音声が発声されたとき音響プロセツサ１００４
により生成されたラベルにより決定されるが、音
響プロセツサ１００４により生成されるラベル
は、話された入力に対応するエネルギ関連特性に
より決定される。第２６図に示すワードWILLの
スペクトル写真ならびに第２７図に示すワード
WILLの波形により、沈黙状態から“ｗ”音を蓄
積する間のエネルギ特性は、エネルギ蓄積に続く
“ｗ”音のエネルギ特性と著しく異なることが分
る。本発明以前は、音すなわち音標型要素が、沈黙
期間に続くワードの最初に生じたか、ワードの中
央で生じたか、またはワードの最後に生じたかど
うかは区別されなかつた。本発明に従つて、これ
らの区別がされるようになつた。第２６図および第２７図に示すワード
“WILL”の最初の0.1秒の部分は、“ｗ”音の蓄積
を表わし、その直後の波形部分は、沈黙による影
響が少ない“ｗ”音に相当する。 “ｗ”音のエネルギ蓄積とその後の部分を、−
従来のシステムの生成方式のように−単一音素と
してひとまとめにして扱うことは、システムに誤
りを生じることになる。すなわち、“ｗ”音の単
一音素マシンは、“ｗ”音がワードの最初、ワー
ドの最後、およびワード内で生じるすべての場合
を、その統計値に混合していた。それ故、単一音
素マシンは、エネルギ蓄積ならびにエネルギ減衰
が混在した統計値を含んでいたのである。本発明に従つて、所与の単音−“ｗ”音のよう
な−は、それに関連して複数の音素マシンを有す
ることがある。例えば、“ｗ”音は、発声された
とき沈黙による影響を受けない、“ｗ”音の統計
値を包含する共通音素マシンを有する。共通音素
マシンは、沈黙期間に隣接しない、“ｗ”音の発
声により生成される統計値を含む。従つて、共通
音素には、エネルギの蓄積または減衰に関するエ
ネルギ特性が混入されない。更に、“ｗ”音は、
沈黙期間からの遷移で“ｗ”音の発声に関する統
計値を反映する開始音素マシン、ならびに、沈黙
期間直前の“ｗ”音の発声に関する統計値を反映
する終了音素マシンも含む。 “ｗ”音の開始音素マシンはONSETLXまた
はONLXと表示し、“ｗ”音の終了音素マシンは
TRAILLXまたはTRLXと表示する。共通音素
マシンはWXと表示する。各々の音素マシンは別
個に形成され、それぞれがそれ自身の確率および
ラベル確率を有する。“ｗ”音に関連した３つの
音素マシンの異なつた統計値が第３表、第４表お
よび第５表に示されている。第３表で、音素マシンONLXは第２表におけ
る統計値と同様に構成された統計値を有する。音
素マシンの最初、中間および末尾のセクシヨンで
種々のラベルを生成する確率が３列に示されてい
る。ある状態から別の状態への遷移確率も示され
ている。第２８図は、（第６図の音素マシンのよ
うな）音素マシンの遷移をどのように分類して３
つのセクシヨンを設けるかを示す。第３表の統計値は整形期間中に取出され、特定
の話者に適用するものである。整形中、サンプルの既知のテキストは、この話
者により発声される。既知のテキストから、その
テキストに対応する音素のシーケンスが決められ
る。既知のワードが発声されると、ラベル（すな
わちフイーニーム）のストリングが生成される。
ラベルは、ビタービ整列のような、通常の方法に
よるシーケンスで音素マシンに対して整列させら
れる。生成されたラベルと既知のテキストの音素
との間の対応は、各音素マシンで見つかつたそれ
ぞれの確率を確定するための基礎になる。例え
ば、沈黙が先行した“ｗ”音は、整形期間中、既
知の間隔で複数回生起することがある。“ｗ”音
に沈黙が先行する場合に特定のラベル−例えば
WX7−が生成される回数だけ処理され、第３表
に示すような確率が与えられる。詳細に言えば、
“ｗ”音の開始音素マシンは、その音素マシンの
中央でラベルWX7を生成する0.036の確率、なら
びにその音素マシンの末尾でラベルWX7を生成
する0.197の確率を有する。また、第３表では、
開始音素ONLXの、状態１と状態４の間の遷移
確率は0.67274であるが、状態１と状態２の間の
遷移確率は0.32370である。本発明の重要性は、第３表、第４表および第５
表を比較すれば明白である。第４表および第５表
は、ラベル出力WX7−第３表に示されている−
を、主要なラベル出力として含んでいない。更
に、第５表では、状態１から状態４への遷移は確
率1.0を有し、それに並行する状態１から状態２
への確率は０である。これらの点が、前述の第３
表の統計値と著しく異なつている。第３表、第４表および第５表に示された統計値
の有意な差異は、“ｗ”音のすべての生起を−ワ
ード中の位置と無関係に−ひとまとめにして単一
音素マシンの統計値にすることにより、誤りが生
じることがあることを示す。それぞれが音素のシーケンスを含むワード基本
形式を形成する際、これらの音素は所定の音素セ
ツトから選択される。単一音素マシン方式を用い
た従来の生成方式では、（前述のように）約70の
音素があつた。本発明に従つて、音素セツトは、
14の開始音素および12の終了音素からなる２６音
素を追加することが望ましい。第６表はこれらの
追加音素を示す。第６表で、各々の音（すなわち音標型要素）
は、それ自身の開始音素マシンおよび終了音素マ
シンを有しない。このような配列は本発明の範囲
内ではあるが、210音素マシン−音ごとに３音素
−の目録は、大量の整形データが得られない場合
には、大きすぎるものとみなされる。従つて、沈
黙期間に隣接するかどうかにかかわらず、統計値
に大きな変化を示さない一定の音は、それに対応
する共通音素マシンだけを有する。このような単
音等級はPX，TXおよびKXを含む。これらを無
音声閉鎖という。無音声閉鎖はワード中の位置の
影響を受けないので、単一音素により表示され
る。更に、一定の単音等級のグループはエネルギ蓄
積に関し極めて類似の統計値を有するので、この
ようなグループごとに、１つの開始音素マシンが
与えらえる。その１つが、第６表で、８つの音
（すなわち音標型要素）が関連している開始音素
マシンONSETAA、またはONAAである。同様
に、一定の音のグループはエネルギ減衰に関し極
めて類似の統計値を有するので、このようなグル
ープごとに、１つの終了音素マシンが与えられ
る。例えば、第６表で、７つの音が終了音素マシ
ンTRAILAA、またはTRAAに関連している。
このような分類により、そのための音響統計値を
生成するのに必要な音素マシンおよび整形データ
が少なくて済む。この分類からは、210音素マシ
ンを用いるシステムに関する性能上の重大な妥協
は生じていない。第６表は、本発明で用いた識別子に対応する標
準音標記号も示している。ここで特記すべき点
は、本発明は、（図示の記号で識別される）従来
の音標要素の一部を包含することが望ましく、し
かも、国際音標アルフアベツト以外の音の他の型
も考慮していることである。第６表で、接尾部“０”を有する音素はアクセ
ントのない母音を指し、接尾部“１”を有する音
素はアクセントのある母音を指す。次に第７表は、音素マシンが本発明の良好な実
施例に従つて形成される音素のすべてを識別す
る。第７表に示す音素のセツトから、ワードの基本
形式が構築される。再びワード“WILL”につい
て考えると、その基本形式は、第２９図に示す音
素（すなわち、等価的に音素マシン）のシーケン
スとして形成される。ワード“WILL”の音標ス
ペリングは第３０図に示されている。音素マシン
ONLXは“ｗ”音の開始音素マシンを表わす。
（ONLX音素マシンは、“ｌ”または“hw”音標
要素で始まる基本形態の最初の音素マシンでもあ
る。）ワード“WILL”のONLX音素マシンの後に
“ｗ”音の共通音素マシンに対応する音素マシン
WXが続く。その後に、IX1音素マシン、LX共
通音素マシン、およびTRLX音素マシンが続く。語彙中の各ワードは同様に（第２９図に示され
たワード“WILL”の基本形態のような）基本形
態により表示される。各ワードを形成する際、対
象ワードを含む音素が決定され、次いで、それら
の音素に対応する音素マシンが連結される。コンピユータに記憶された目録で、各ワード
は、それに対応する音素マシンのシーケンスによ
り表示され、そのための統計値は音素マシンごと
に記憶される。記憶所要量を減少させるため、各
音素マシンを対応する識別子により表示し、ワー
ド基本形態を、音素マシン識別子のシーケンスと
して形成することができる。例えば、ワード
“WILL”の基本形式は識別子のシーケンス：43
−27−81−12−56に対応する。識別子４３は音素
マシンONLXに対応し、識別子２７は音素マシ
ンWXに対応する。以下、同様に対応する。音素
マシンごとに、整形期間後、メモリの一部分に、
第２表〜第５表に示されたような統計値が記憶さ
れる。対象ワードが考慮されると、構成音素マシ
ン識別子の統計値が検索される。他の２つの基本形態の例として、第３１図およ
び第３２図にワード“BOG”およびワード
“DOG”の基本形態がそれぞれ示されている。ど
ちらの基本形態も開始音素マシンONBXで始ま
る。ワード“BOG”は、ONBXに続いて、音素
マシンBX，AW1，GXおよびTRBXを含む。ワ
ード“DOG”は、ONBXに続いて、音素マシン
DX，AW1，GXおよびTRBXのシーケンスを含
む。“Ｂ”と“Ｄ”の音のエネルギ蓄積が類似し
ていることから、同じ開始音素マシンが使用され
る。ONBX音素マシンを整形する際、それによ
り表示される単音（すなわち音標要素）のどれか
の発声を統計値の生成に組込むことが望ましい。
この条件は、複数の音（すなわち音標要素）に対
応する他の種々の開始音素マシンおよび終了音素
マシンにも適用することが望ましい。第１Ａ図および第１Ｂ図の流れ図で基本形態を
構築する手順を説明する。ブロツク８００２で、
開始音素マシン、共通音素マシンおよび終了音素
マシンから、音素マシンのセツトが形成される。
次にブロツク８００４で、ワードの語彙からワー
ドが選択される。ブロツク８００６で、ワード
は、複数の音標要素、または一般的に音として、
ワード“WILL”のＷ−Ｉ−ｌのような所定の順
序で特徴づけられる。次にブロツク８００８で、
所定順序の最初の音標要素を調べ、それに対応す
る開始音素マシンがあるかどうかを決める。対応
する開始音素マシンがある場合は、ブロツク８０
１０で、対応する開始音素マシンを検索し、ブロ
ツク８０１２で、最初の２つの音素マシンを、最
初の音標要素の開始音素マシンとそれに続く共通
音素マシンとしてセツトする。最初の音標要素に
対応する開始音素マシンがない場合は、ブロツク
８０１３で、その共通音素マシンを検索する。こ
の共通音素マシンは基本形態の開始を表わす。次にブロツク８０１４で、次の音標要素がない
場合は、ブロツク８０１５に進み、最初の音標要
素がそれに関連した終了要素を有するかどうかを
決める。終了音素がない場合は、その基本形態
は、開始音素（マシン）とそれに続く共通音素
（マシン）とみなされる。最初の音標要素に関連
した終了音素がある場合は、ブロツク８０１６
で、その終了音素は共通音素に付加されるので、
ワード基本形態は、最初の音標要素の開始音素マ
シン、共通音素マシンおよび終了音素マシンを含
む。ブロツク８０１４で、次の音標要素がある場
合、ブロツク８０１７で、次の音標要素を検査
し、次の音標要素の順序が最後であるかどうかを
決める。それが最後である場合は、ブロツク８０
１８で、その音標要素がそれに関連した終了音素
マシンを有するかどうかを決める。終了音素マシ
ンがある場合は、ブロツク８０２０で、その基本
形態は、最初の音標要素に対応する共通音素マシ
ンとそれに続く終了音素マシンを付加することに
より完成される。関連した終了音素マシンがない
場合は、ブロツク８０２２で、最後の音標要素の
共通音素マシンは、その基本形態の末尾に付加さ
れる。ブロツク８０１７で、次の音標要素が最後では
ない場合には、ブロツク８０２４で、音標要素の
共通音素を、前に配列されている音素マシンに付
加する。最後の音標要素に対応する音素マシン
（複数の場合もある）が付加されるまで、連続す
る音素マシンが付加され、音素マシンのシーケン
スを延長する。次に、第２Ａ図および第２Ｂ図で、本発明によ
る音素マシンの形成を説明する。最初、ブロツク
８１００で、例えば、国際音標アルフアベツトか
ら選択された音標要素のように、音が定義され
る。音の集りは、音声により形成される単音の種
類を表わす。ブロツク８１０２で、それぞれがそ
れに関する統計値を記憶する手段を有する複数の
音素マシンが形成される。次にブロツク８１０４
で、所与の音を、最初の音のセツト（その各々は
それに割当てられた開始音素マシンを得ることに
なつている）を選択する。エネルギ蓄積による影
響をかなり受ける音が最初のセツトを形成するこ
とが望ましい。（前述のように、十分な整形デー
タが得られる場合には、すべての音で最初のセツ
トを形成することができる。）次にブロツク８１
０６で、所与の音に開始音素マシンが割当てられ
る。ブロツク８１０８で、割当てられた開始音素
マシンの統計値を、音声セグメント（例えばワー
ド）の最初での発声から取出す−この発声は所与
の音に対応する単音、すなわち類似のエネルギ蓄
積特性を有する単音である。次にブロツク８１１０で、所与の音から共通音
素マシンを形成し、ブロツク８１１２で、そのた
めの統計値を生成する。ブロツク８１１４で、開
始音素マシンを得ることになつている各単音が、
所与の音として処理された後、２番目の単音等級
のセツトの所与の音（終了音素マシンがそれに割
当てられることになつている）が形成される。ブロツク８１１６で、２番目のセツトから所与
の音を選択し、ブロツク８１１８で、終了音素マ
シンをそれに割当てる。ブロツク８１２０で、割
当てられた終了音素の統計値を、音声セグメント
の末尾で生じる発声−所与の音に対応する単音、
すなわち類似のエネルギ減衰特性を有する単音の
発声−から生成する。その後、ブロツク８１２２
で、所与の単音の共通音素マシンが割当てられ、
ブロツク８１２４で、統計値が前に決定されてい
なかつた場合、統計値が生成される。ブロツク８
１２６で、すべての音（終了音素マシンがそれに
割当てられることになつている）が所与の音素と
して選択されたかどうかを判定する。選択されて
いた場合は、すべての音素マシンが形成されてい
る。選択されていない場合には、前に選択されな
かつた音を所与の音として選択し、前述のブロツ
ク８１１８〜８１２６の動作を反復する。第２Ａ図および第２Ｂ図は、本発明に従つて
種々の方法で変更することができる。第１に、開
始音素マシンだけを探索する場合は、ブロツク８
１１６〜８１２６は省略することができる。同様
に、終了音素マシンだけを探索する場合はブロツ
ク８１０４〜８１１４は省略することができる。
第２に、希望により、最初の音のセツトと２番目
の音のセツトは同時に生成することができる。更に、単一の開始音素マシンまたは終了音素マ
シンを２つ以上の音に割当てる動作ステツプが、
これらの実施例に関連する。この場合、統計値
は、１回生成するだけでよく、各音に適切に使用
する。最初に、どの音がそれに割当てられた開始
音素マシンおよび終了音素マシンを得るべきかに
ついて決定し、それにより、ブロツク８１０４お
よび８１１６の最初のセツトと２番目のセツトを
それぞれ形成することが望ましい。音声認識において、本発明は、音素マシン数を
増した基本形態を形成する装置を提供する。この
装置の例を第３図に示す。第３図に、複数の音素マシン８２０２〜８２１
２が示されている。各音素マシンは音素マシン８
２０２と同様に構築され、それぞれが、(a)遷移確
率メモリ８２１４、(b)ラベル確率メモリ８２１
６、ならびに(c)状態識別子および遷移識別子メモ
リ８２１８を含む。音素マシン８２０２および８
２０４を含む複数の音素マシンは共通音素マシン
であり、音素マシン８２０６および８２０８を含
む複数の音素マシンは開始音素マシンである。ま
た、音素マシン８２１０および８２１２を含む複
数の音素マシンは終了音素マシンである。各音素
マシン８２０２〜８２１２のそれぞれのメモリに
は、音素マシン整形装置８２２０により、統計値
が記憶されている。各ワードはあらかじめ音素のシーケンスとして
形成され、これらのシーケンスは記憶装置８２３
０に記載されている。基本形式構築装置８２４０
は、記憶装置８２３０からの音素シーケンス情報
と、音素マシーン整形装置８２２０により取出さ
れた統計値を組合せて、音素マシンのシーケンス
を構築する。所与のワードの音素マシンのシーケ
ンスは、そのワードの基本形態を表わし、音響マ
ツチング（前記F1c〜F1f項に記載されている）
に使用される。すなわち、認識すべき未知の音声
が発声されると、（第４図の）音響プロセツサ１
００４はそれに応じてラベルのストリングを生成
する。本発明は、改良された音声マシンのセツト
からの音素マシンで形成される基本形態と前記ス
トリング中のラベルとのマツチングを可能にす
る。本発明により追加された音素マシンを使用する
ことにより、音声認識の精度および速度がかなり
改善される。ちなみに、本発明は分離されたワードの音声認
識システムならびに連続音声の音声認識システム
に使用することができる。分離されたワードの場
合、それぞれのワードの後に小休止がある。従つ
て、各ワードの開始および終了には、多くの場
合、エネルギの蓄積および減衰がある。本発明
は、このようなシステムに特によく適合する。連
続音声の場合は、複数のワードが結合し、通常、
句間に小休止があるので、エネルギの蓄積および
減衰部分を有する各ワード基本形態を特徴づける
代りに、句間の開始音素マシンおよび減衰音素マ
シンの供給を示す。分離されたワードおよび連続
音声句は、包括的な用語の“音声セグメント”に
含まれる。音声セグメントは２つの沈黙期間の間
の音声部分とみなされる。 The present invention will be explained in the following order. A. Industrial application field B Summary of disclosure C. Conventional technology D. Problem that the invention aims to solve E. Means to solve the problem F Example F1 speech recognition system environment F1a General explanation (Figures 4 to 9) F1b auditory model and speech recognition system sounds
Its realization in the Hibiki processor (Part 10)
Figure ~ Figure 16) F1c precision matching (Fig. 6, Fig. 17) F1d Basic high-speed matching (Figures 18 to 20)
figure) F1e Alternative high-speed matching (Fig. 21, 22
figure) F1f Matching based on first J level
(Figure 22) F1g phoneme tree structure and high-speed matching example
(Figure 23) F1h language model (Figure 4, Figure 24) F1j stack decoder (Figures 7 to 9,
Figure 25) F1k Construction of phonetic basic form F2 Start phoneme machine and end phoneme machine
Formation of a set of phoneme machines including (Fig. 1A)
~Figure 3, Figure 26~Figure 32) F3 table G Effect of invention A. Industrial application field The present invention is based on speech recognition using a Markov model.
As for the method, write your words paying particular attention to the beginnings of words.
basic form (referring to the Markov model of a word)
) and create a word basic form like this.
It was designed to be used. B Summary of disclosure The present invention produces strings of acoustic labels.
(Phonetic phoneme machine (phonetic phoneme machine)
including forming a set of
A device and method for constructing word basic forms that can
Disclose the law. Each phoneme machine has () multiple states.
state, () each indicates a transition from one state to another.
multiple transitions, () stored probabilities for each transition,
and () has a memorized label output probability,
, each label output probability is the corresponding label
corresponding to the probability of each said phoneme machine to generate; said
Setting of phonetic type phoneme machine starts.Support of phoneme machine starts.
If configured to include a bushet, each open
The memorized probabilities of the initial phoneme machine are
at least one phonetic element pronounced at the beginning of the g
Corresponding to; the setting of the phonetic symbol type phoneme machine is completed.
Configured to include a subset of phoneme machines
If so, the memorized probability of each ending phoneme machine
is uttered at the end of the audio segment.
also corresponds to one phonetic element. Word basic form
is the concatenation of phoneme machines selected from the set.
More constructed. C. Conventional technology As an invention providing the background or environment of the present invention,
U.S. Patent Application No. 06/665401 (dated October 26, 1984)
Application), and No. 06/672974 (November 19, 1984)
There is an application). In probabilistic methods for recognizing speech, the acoustic waveform is
For the first time, a label string can be created using a sound processor.
is converted to Alphabet of labels (set)
typically consists of about 200 different labels, and this
Identify the corresponding acoustic type using the label.
The generation of such labels has been reported in various papers as well as
As described in the aforementioned U.S. Patent Application No. 06/665,401.
Ru. Simply put, the acoustic input is divided and continuous.
into time frames and label each time frame
Assign. Labels are usually based on energy characteristics.
It is formed by Marco on using labels for speech recognition
The f-model (stochastic finite state machine) has already been proposed.
It is being proposed. Markov models typically have multiple
Contains states and transitions between those states. Furthermore,
Markov models typically include (a) the probability of occurrence of each transition;
and (b) each generating each label at various transitions.
has a probability assigned to it with respect to the probability of
Ru. Markov model (or equivalently Markov model
IEEE Proceedings: Pattern Analysis and
Computer Information (PAMI) Vol. 5 No. 2 (March 1983)
month) outside of L.R.B. listed on pages 179-190.
Paper “Maximum Likelihood Method for Recognizing Continuous Speech” (L.R. Bahl
et al, “A Maximum Likelihood Approach
to Continuous Speech Recognition”, IEEE
Transactions on Pattern Analysis and
Machine Intelligence, Vol.PAMI-5, No.o.2,
March 1983)
There is. A Markov model machine is a Markov model machine.
Also called model phoneme machine or simply phoneme machine.
cormorant. When recognizing speech, which word(s) in the vocabulary?
) is generated by a sound processor.
has the highest likelihood of yielding a label string
A matching process is performed to determine the
Ru. One such matching procedure is the
As shown in patent application no. 06/672974. in addition
According to Acoustic matching, (a) each word in the vocabulary
into a sequence of Markov model phoneme machines.
(b) the phoneme machine represented by each word;
of the sequence, generated by the acoustic processor.
Determine the likelihood of each yielding a string of labels.
It is executed by setting Each word represents
Phoneme machine sequences correspond to word basic forms
do. When forming a word basic form, first
We define the properties of the phoneme machine used to construct the basic form of
need to be justified. Said U.S. Patent Application No. 06/
No. 672974, a work constructed from phonetic phoneme machines.
The basic form of the card is shown. In this case, each phoneme
The machine corresponds to a phonetic type single note, and has 7 states and 13
Contains transitions. In detail, each corresponds to
The basic form is a set of about 70 phonemes representing phonetic elements.
It has become the basis for building the. Generally word base
In this format, a phonetician identifies words with their respective phonetic symbols.
the corresponding phoneme machine for each phonetic symbol.
Constructed by assigning to segments
Ru. Traditionally, each of the 70 phonemes corresponds to a given grade.
whether the single note is at the beginning, middle, or end of the word.
represents a given phonetic magnitude, regardless of whether it occurs in
did. For example, the single note “k” is similar to “cat”.
at the beginning of a word like “scat”.
in the middle of the word, or in the case of “back”
The phoneme k occurs in any case at the end of the letter k.
was displayed. D. Problem that the invention aims to solve The present invention provides that the predetermined sound is
i.e. precedes or follows the quiet period?
It expresses different energy characteristics depending on the
Based on knowledge. In particular, the present invention is preceded by a quiet period.
Depending on the sound, the energy may be amplified and
If a silent period follows, some sounds may lose energy.
It is based on the fact that it is attenuated. Energy special
The quality generally determines the label that should be generated for the acoustic input.
It is used by the sound processor when
depending on whether the sound occurs at the beginning or end of the code.
This results in energy enhancement or attenuation, resulting in different lamps.
Bells may be generated. Therefore, the invention provides that the sound is uttered at the beginning of the word.
There are several first methods that involve energy enhancement when
With a phoneme machine of the type, a sound is uttered at the end of a word.
If the
Define the second type of phoneme machine. Furthermore, it is uttered
sounds with significant energy enhancement or attenuation.
There is a third type of phoneme machine for cases where
Ru. Start the first type of phoneme machine, the first type of phoneme machine.
Ending the second type of phoneme machine, the third type of phoneme machine
The phoneme machine is called the common phoneme machine. Start phoneme machine statistics reflect transitions from silence.
The end phoneme machine statistics indicate the transition to silence.
reflect. The common phoneme machine starts in the middle of the word.
Generally speaking, the sound that is spoken is a transition to silence.
Transitions from or to silence are added to the statistics of the phoneme machine.
The sound uttered at the word position does not have a large effect.
It is desirable to have corresponding statistics. A given sound is the part of the word in which that sound is uttered.
even if the corresponding energy characteristics do not change significantly.
If there is a common phoneme machine associated with it, then
do. In accordance with the present invention, a plurality of starting phoneme machines and
An ending phoneme machine is provided, and a predetermined sound is used for a period of silence.
Give the energy characteristics when it occurs adjacent to . Thus, according to the invention, a given word
Open with a target sound that has a starting phoneme machine that corresponds to
starts with that starting phoneme machine, and
Basic form followed by a common phoneme machine for the target sound
is configured to obtain. Similarly, according to the present invention
A given word has its corresponding ending phoneme machine.
If the word ends with a target sound that has a
ends with its ending phoneme machine, and its target
We will obtain the basic form preceded by the common phoneme machine of the sound.
It is composed of sea urchins. Therefore, the object of the present invention is to (construct word basic forms)
) The Markov model set includes
for single sounds that occur in transitions or transitions from silence.
including the corresponding Markov model, and
precision in word recognition systems using this form.
The aim is to increase the level of Furthermore, it is an object of the present invention to obtain similar energy enhancement properties.
Group sounds that have , into that group.
Define a single-start phoneme machine for all sounds that belong to
sound with similar energy attenuation characteristics.
group together, and everything belonging to that group.
by defining a single-end phoneme machine for all sounds.
The goal is to limit the total number of phoneme machines. E. Means to solve the problem The method to achieve the above objective includes the following steps:
nothing. (a) Each phoneme machine has () multiple states, ()
Multiple transitions, each transitioning from one state to another
transition, () memorized probability for each transition, () notation
Stored label output probability (each label output probability is
of each phoneme machine to generate the corresponding label.
corresponding to the probability), then the phonetic phoneme
Form a set of machines. (b) The set of phonetic alphabet type phoneme machines is the starting phoneme machine.
formed to include a subset of thin
If, the memorized probability of each starting phoneme machine is
At least the first utterance of the audio segment
Corresponds to one phonetic type element. (c) a given starting phoneme machine to which a word corresponds.
starts with a phonetic element with a given starting sound.
If you have a word elementary form that starts with elementary machine,
Then, each word basic form is sequenced by a phoneme machine.
Build it as a service. The method of the invention further has the following features. Sunawa
In other words, the set of phonetic symbol type phoneme machines is the end phoneme machine.
formed to contain a subset of thin, each end
The phoneme machine's memorized probabilities are the phonetic segment
at least one single phonetic type pronounced at the end of
corresponds to an element. Also, word corresponds to it
Ends with a phonetic type element with a given ending phoneme machine
and the word base ending with a given ending phoneme machine
form, each word base form has a phoneme machine.
constructed as a sequence of events. The device of the present invention which achieves the above object is a Markovian
Contains a set of model phoneme machines. Each phoneme machine
is () multiple states, () each state
Multiple transitions from to a state, () per transition
means to store the probability of () the label output probability
Means for storing (each label output probability is
At the transition, each phoneme machine generates a specific label.
corresponds to the probability that Part of the phoneme machine is the start phoneme
machine, each starting phoneme machine is ()
relating to at least one sound from a set of sounds;
At least one related sound at the beginning of a word ()
The transition probability and label output probability are shaped from the utterance of
be done. ) and the basic form of each word as a phoneme map.
Means for constructing as a sequence of syn (the construction
The means is that the word corresponding to the basic form of the target word is
When starting with a sound associated with a given starting phoneme machine
If the given starting phoneme appears at the beginning of the target word basic form,
Including means of placing the machine. ) is characterized by having
shall be. Furthermore, as a feature of the device of the present invention, a phoneme machine
Part of is the end phoneme machine (each end phoneme machine is
() associated with at least one sound from a set of sounds
and () at least at the end of the audio segment.
Transition probabilities and labels from the utterance of one related sound
Output probabilities are formatted. ), the construction means
furthermore, the word corresponding to the basic form of the target word is
If it ends with a sound related to a given ending phoneme, the
Given ending phoneme machine at the end of the elephant word base form
including means for placing. Furthermore, the device of the present invention each
In response to sounds that are affected by energy enhancement,
A common representation of the sound when it is not affected by Lugi enhancement.
also includes a phoneme machine, in which case the construction means further comprises:
, the starting phoneme machine and the subsequent common phoneme machine
(A particular sound starts a word, and the starting phoneme machine
(corresponding to a particular sound) if relevant.
including means for containing. F Example F1 speech recognition system environment F1a General explanation (Figures 4 to 9) FIG. 4 is an overview diagram of the speech recognition system 1000.
Show the diagram. This system is a stack deco
1002 and the audio processor connected to it.
Setusa (AP) 1004, high-speed approximate acoustic matuchin
Array processor 1006 performs precision
Array processor 1 that performs acoustic matching
008, language model 1010, and works
station 1012. The audio processor 1004 receives audio waveform input as
roughly identify the monotone symbol each of which corresponds to
A string of labels, i.e.
The microphonemes obtained from the front end are expressed like this.
call to. In principle, it is obtained from the front end.
(but not limited to)
It is. In this system, the sound processor 10
04 is based on a unique model of human hearing
No. 06/665,401 (October 26, 1984)
(Applications filed in Japan). Label from sound processor 1004, i.e.
Chifinim is stack decoder 1002.
Sent. FIG. 5 shows a stack decoder 100.
2 shows a logic element. i.e. stack deco
The reader 1002 connects the search device 1020 and
Connected workstation 1012,
Face 1022, 1024, 1026 and
Contains 1028. Each of these interfaces
are the sound processor 1004 and the array processor.
service 1006, 1008 and language model 101
0 respectively. During operation, the sound processor
The finem from the server 1004 is the search device 102
0 causes array processor 1006 (fast match
Sent to Fast matching explained below
The procedure is described in US Pat. No. 06/672,974 (November 1984).
(filed on the 19th). matching eyes
Simply put, the target is a given label string.
The most likely word (or multiple words)
The purpose is to determine the Fast matching searches for words in the vocabulary of words.
the string of the given incoming label.
is designed to reduce the number of candidate words in
There is. Fast matching is a stochastic finite state machine
(also referred to as Markov model in this specification)
It is something that can be developed. After fast matching reduces the number of candidate words,
Stack decoder 1002 uses language model 10
10 and, if possible, the existing Mieji.
Based on each candidate word in the high-speed matching candidate list,
Determine the contextual likelihood of the code. Precision matching allows these words to be
A fast matutin with moderate likelihood as a word
from the candidate list based on language model calculations.
It is recommended that the Precision matching is also the same as above.
National Patent Application No. 06/672974 (dated November 19, 1984)
application). Precision matching is the 6th
A Markov model phoneme machine as shown in the figure
Execute more. After precision matching, call the language model again
It is desirable to determine the word likelihood. Book
The stack decoder 1002 of the invention comprises:-
matching, precision matching, and language models.
Using the information obtained from the use of - generated labels
most likely path for the word in the string
i.e. designed to determine the sequence
There is. Find the most promising word sequence
The two conventional methods are Viterbi decoding.
decoding and single stack decoding. These people
Each of the above-mentioned Barr outsider papers “Recognize Continuous Speech”
Viterbi decoding
In Section V of the paper, single stack decoding
is also mentioned in the section. In the single stack decoding method, the
The cases are listed according to their likelihood in a single stack, and the
Decoding is performed based on the stack. single star
The likelihood of Tsuku decoding is somewhat dependent on the path length.
This is due to the fact that
is used. Viterbi's method does not require normalization and is generally
Suitable for small tasks. Another alternative is to use the possible combinations of words
each as a possible word sequence.
Check which combinations of label strings were generated.
to determine which has the highest probability of causing
perform decoding with a small vocabulary system by
be able to. The amount of computation required for this method is
In the case of a similar vocabulary system, it becomes huge and impractical.
be. Stack decoder 1002 actually
act to control elements of the
There aren't many calculations. Therefore, stack decoder 1
002 is VM (virtual machine)/system program
Duct Introduction Release 3
(1983), as described in publications such as
IBM VM/370 Operating System Controls
Desired to include a 4341 processor running under control.
Delicious. Array processors that perform a significant amount of calculations
Setsa is a floating point system
This is achieved using a commercially available 190L manufactured by (FPS).
Ru. Figures 7, 8 and 9 are
Multiplex system invented by L.R. Bahl et al.
Novel techniques including tack method and unique judgment method
Show the law. Figure 7 shows the duplicates generated with continuous label spacing.
continuous label y of number₁y₂……It is shown. Figure 8 also shows multiple word passes, i.e.
path A, path B and path C are shown.
Ru. In the context of Figure 7, path A is the entry “to
be or”, path B is entry “two b”, path C is
May correspond to the entry “too”. Target Wa
For each word pass, “The target word pass is
There is a label with the highest probability of
Such a label is called a boundary label. of a word path W representing a sequence of words.
Most likely end time (labeled as boundary label)
(shown on the bell string) is an IBM
Technology Disclosure Bulletin Vol. 23 No. 4 (September 1980 issue)
Le R. Barr et al.'s paper “Acoustic matching calculations”
“Faster Acoustic” (L.R. Bahl et al, “Faster Acoustic
Match Computation”, IBM Technical
Disclosure Bulletin, Vol.23, No.4,
September 1980).
Known ways to explore likely boundaries between
can be found by Simply put, this
The paper addresses two important points: (a) How many label strings Y are in the word
(or word sequence)
(b) At what label spacing (label string
the partial sentence (corresponding to the part of the sentence) ends, or
This article discusses ways to address this issue. Any given word pass has a label strip.
from the first label of the string to the border label.
The “likelihood” associated with each label or label spacing
There is a “likelihood value” for a given word path.
All degree values are the “likelihood vector” for a given word path.
”, so for each word pass,
There is a likelihood vector. Likelihood value L_tis shown in Figure 8.
It is. Word Pass W¹,W²,...,W^Sof a gathering of
“Likelihood envelope” Λ at label interval t_tis mathematically
It is defined as follows. Λ_t=max(L_t(W¹),...,L_t(W^S)) That is, for each label interval, the likelihood envelope is
the latest associated with any word pass in said collection.
Contains a high likelihood value. Figure 8 shows the likelihood envelope 1040
It is shown. If the word path corresponds to a complete sentence,
considered “complete”. For the complete path, please enter
When a speaker reaches the end of a sentence, e.g.
It is recommended that you press the button to identify it. entered
The input is synchronized with the label interval that marks the end of the sentence.
It will be done. A complete word pass can be added by appending the word
It cannot be extended further. Partial work
The de pass corresponds to an incomplete sentence, so it cannot be extended.
I can do it. A partial path can be a “living” path or a “dead” path.
Word Pass is classified as
“dead” when it has already been extended
However, when it has not yet been extended, it is “alive”.
This classification has already been extended by at least one
form a word pass that extends beyond the
The path will not be considered for extension again at
do not have. Each word path is
Characterized as a “good” or “bad” path.
can be used. The word pass is
The label corresponding to the bell whose word pass is
Good if the likelihood value is within the maximum likelihood envelope.
password. In other cases, the work
A do pass is a bad word pass. maximum likelihood hull
It is good (bad) to reduce each value of the connecting line by a certain value.
It is not desirable to change the limit level
Yes, but not necessarily. There is a stack element for each label interval.
Ru. Each living word pass looks like this
label corresponding to the boundary label of the living path
Assigned to the stack element corresponding to the interval. vinegar
Tack elements (listed in order of likelihood values)
0, 1 or more word pass entries
It may have. Next, the stack decoder 1002 in FIG.
The steps to be executed will be explained below. Form a likelihood envelope and determine which word path is better
Determining the stack decoding method shown in Figure 9
There is a mutual relationship as shown in the flowchart. In the flowchart of FIG. 9, block 1050
So, first, the null path enters the first stack 0.
Ru. At block 1052, the previously confirmed completion
If the (complete) stack element containing the complete path is
If available, it will be provided. in a (complete) stack element
Each complete path has a likelihood vector associated with it
has a file. has the highest likelihood for that boundary label
The likelihood vector of the complete path is first calculated by the maximum likelihood envelope
Decide on the line. If a (complete) stack element has a complete
If there is no path, the maximum likelihood envelope is
It is initialized to −∞. Additionally, the complete path is specified
Even if the maximum likelihood envelope is not
The period may be set. The initial setting for the envelope is
Locks 1054 and 1056 take place. The maximum likelihood envelope is initialized and then adjusted by a predetermined amount Δ
of the Δ predetermined value over the reduced likelihood.
Δ that forms a good region and is below the reduced likelihood
Form a poorly defined area. The larger Δ, the larger
The longer the number of word passes, the more likely they are to be extended.
becomes larger. L_tlog to determine_Tenuse
In this case, a satisfactory result is obtained if the value of Δ is 2.0.
It will be done. The value of Δ is uniform along the length of the label interval
Although it is desirable, it is not necessary to
isn't it. The word path has a good delta specification on its boundary label.
If the word has a likelihood that lies within the
The path is marked "good". In other cases
, the word pass is marked as “bad”. As shown in Figure 9, the likelihood envelope is updated and the word
A “good” (extendable) path, or
or as a “bad” path.
Blog about finding the longest unmarked word pass
It starts with 1058. Not marked with 2 or more
short word path corresponds to the longest word path length
If the border label is on the stack
A word path with a likelihood of is selected. War
block 1060 if a path is found.
, the likelihood at that boundary label is a region with good Δ specification
Check whether it is inside. If it's in a good area
If so, in block 1062, the region with poor Δ regulation is detected.
In block 1058, mark the path in
Look for unmarked living paths. if
If it is within the good range, block 1064
Mark the path within a well-defined area and block 10.
66, update the likelihood envelope and mark it as “good”.
Contains the likelihood value of the detected path. That is, la
For each bell interval, the updated likelihood value is (a) the current likelihood value within its likelihood envelope; (b) related to word passes marked “good”;
likelihood value is defined as the larger likelihood value between. this
Operations are performed in blocks 1064 and 1066.
It will be done. After the envelope has been updated, block 1058
Back to the longest unmarked, best alive
Find the password again. This loop uses unmarked word
Iterates until there are no more spaces left. Not marked
If you run out of passwords, block 1070
The shortest word path marked as “good” in
is selected. If 2 or more have the shortest length
If you have a “good” word pass, block 10
72, find the word with the highest likelihood for that boundary label.
The selected shortest path is
It will be extended. i.e. at least one prospect
If a following word is a fast match, as described above,
language models, precision matching, and languages
determined by successful execution of the model procedure.
Ru. For each likely subsequent word, extended
A word pass is formed. In detail,
The lengthened word path is the shortest word path selected.
Append a promising trailing word to the end of the decoded path.
Formed by adding The selected shortest word path is
After forming the code path, the selected word
The path is removed from the stack in which it was an entry.
instead, each extended word
The path is inserted into the appropriate stack. In particular, extension
The word path that has been created will be
entry into the stack (block 107).
2). An extended path is formed and its stack is re-formed.
Once formed, return to block 1052 and continue the process.
The process is repeated. Therefore, at each iteration, the shortest, best “good”
selected and extended. in some iteration
Word paths marked as “bad” paths are
Repetition can result in a “good” path, so
Is the word/pass you are using “good” or “bad”?
The path characteristic is uniquely given at each iteration.
It will be done. In reality, the likelihood envelope is between one iteration and the next
The word path does not change significantly between iterations.
Efficiently perform calculations to determine whether something is good or bad
This eliminates the need for normalization. If you want to identify a complete sentence, block 1074
It is desirable to include. i.e. alive
What remains unmarked in Word Pass
If there is no “good” word pass that should be extended
If so, decryption ends. each of its boundary labels
The complete word path with the highest likelihood
the most likely word for the power label string.
identified as a code sequence. For continuous speech where sentence ends are not identified, path extension
The process is continuous, i.e. the system
Regarding the predetermined number of words desired by the system user.
It is done. F1b auditory model and speech recognition system acoustics
Its realization in the processor (Figures 10 to 1)
Figure 6) FIG. 10 shows an audio processor 11 as described above.
A specific example of 00 is shown. Acoustic wave input (e.g.
natural sounds) are sampled at a predetermined rate.
Enters A/D converter 1102. representative sample
The processing speed is 1 sample every 50 microseconds.
Ru. time to shape the edges of the digital signal.
A window generator 1104 is provided. Time window occurs
The output of the
FFT (Fast Fourier Transform) device that provides torque output
Enter 1106. Then, the output of the FFT device 1106 is labeled
L₁L₂...L_fis processed to generate . Features
Selection device 1108, cluster device 1110, prototype
The device 1112 and the encoder 1114 cooperate.
generate a label. When generating labels, the original
places points (or vectors) in space based on selected features.
formed as a vector). The audio input is selected
A counterpart that can be compared to the original due to the same characteristics
special to supply points (or vectors) into space.
It is marked. In detail, when defining the prototype, the cluster
Each set of points is clustered by setting 1110.
summarized as The way to form clusters is to
A probability distribution (such as a Gaussian distribution) applied to voices
Based on. The prototype of each cluster is (class
(with respect to the center locus or other features of the data)
is generated by the location 1112. The generated prototype
and acoustic input (both with the same features selected).
) enters the encoder 1114. Symbolization device 1
114 performs a comparison procedure so that a particular sound
Assign a label to the acoustic input. By the way, the method for assigning labels to acoustic input is
It is designed for applications other than speech recognition. subordinate
Therefore, such a method and its symbolization are
The device can generally be used in speech recognition systems.
can. The selection of appropriate features represents the acoustic (speech) wave input.
This is an important element when removing the wasu label. here
The acoustic processor described has improved feature selection.
Includes device 1108. Follow this sound processor
Then, the auditory model is retrieved and used in the speech recognition system.
Used in sound processors. Listen according to Figure 11.
Explain the sensory model. FIG. 11 shows a portion of the human inner ear. detailed
Inner hair cells 1200 and fluid-containing grooves
The distal end 1202 extending to 1204 is shown in detail.
ing. Furthermore, upstream from the inner hair cells 1200,
Outer hair cell 1206 and distal end extending into groove 1204
1208 is shown. 1200 inner hair cells and outer hair cells
Hair cells 1206 contain nerves that transmit information to the brain.
are combined. In particular, Nieuron is an electrochemical
electrical pulses are carried along the nerves to the brain.
and will be processed. Electrochemical changes are
Stimulated by mechanical movement of the basement membrane 1210. The basement membrane 1210 serves as a frequency analyzer for acoustic wave input.
The area along the basement membrane 1210
Is it conventional to respond to each critical frequency band?
is known. responds to the corresponding frequency band.
Each portion of the basement membrane 1210 has an acoustic waveform
Affects the perceived volume of the input. That is,
The volume of the tone can be determined by combining two tones of similar power intensity.
than if the bands occupy the same frequency band.
If the two tones are in separate critical frequency bands
It is perceived as larger when the By the basement membrane 1210
There are 22 classes of critical frequency bands defined by
I know that. Match the frequency response of the basement membrane 1210
Therefore, the present invention is advantageous in that the critical frequency band is
Physically convert the input acoustic waveform into some or all parts.
and then for each defined critical frequency band.
The signal components are examined separately. This feature is
Applying the signal from FFT device 1106 (Figure 10)
Each critical frequency band is carefully filtered and tested.
Providing a separate signal to feature selection device 1108
This is done by A separate input is also provided by the time window generator 1104.
Blot on a time frame (preferably 25.6 ms)
is blocked. Therefore, the feature selector 1108 has 22
It is desirable to include the following signals. of these signals
each for a given frequency band per time frame
represents the strength of the sound. The signal is a conventional critical band filter in Figure 12.
Preferably, the filter is filtered by filter 1300. Next
, the signal is individually expressed as a function of frequency and the change in volume as a function of frequency.
Processed by a volume equalization converter 1302 that perceives
do. By the way, given dB level at one frequency
The perceived loudness of the first tone of the
the volume of the second tone of the same dB level at the frequency of
It may be different. Volume equalization converter 1302
are based on empirical data and each frequency
Convert the band signals so that each is measured on the same loudness scale.
be determined. For example, volume equalization converter 1
302 is a 1933 Fletscher and Munsson
(Fletcher and Munson) with some changes.
By
can be imaged. Figure 13 has been changed to the above study.
Shows the result of adding . According to Figure 13, at 40dB
A 1KHz tone has a volume level of 60dB compared to a 100Hz tone.
It turns out that it corresponds to Bell. The volume equalization converter 1302 converts the music shown in FIG.
Adjust the volume according to the line, equal regardless of frequency
produces a volume of In addition to frequency dependence, Figure 13 shows that
As is clear from examining the wave number, the change in power is
Does not respond to changes in volume. That is, the intensity of the sound,
That is, amplitude fluctuations are perceived at all points.
is not reflected in similar changes in volume. for example,
At a frequency of 100Hz, 10dB around 110dB
The perceived volume change is around 20dB.
Much larger than the 10dB perceived volume change.
This difference is due to volume compression, which compresses the volume in a predetermined way.
Processed by device 1304. Volume compression device 13
04 is the volume amplitude measurement value in phon units.
By replacing the power P with its cube root
P^1/3can be compressed into Figure 14 shows known phone pairs determined empirically.
Shows the relationship between Thorns. With the use of sone units,
The model of the present invention is almost correct even with large audio signal amplitude.
Maintain a solid state. 1 sone is a tone of 1KHz.
The volume level is specified as 40dB. FIG. 12 shows a new time-varying response device 13.
06 is shown. This device has each critical frequency
Band-related volume equalization and volume compression signals
Works better. In detail, the tested frequency
For every few bands, the neural firing rate f changes each time frame.
It can be determined by The firing rate f is the acoustic process of the present invention.
It is defined as follows according to the f=(So+DL)n (1) However, n is the amount of neurotransmitter; So is the acoustic wave
Spontaneous firing related to neural firing independent of shape input
fire constant; L is the volume measurement; D is the displacement constant.
So・n is a spontaneous phenomenon that occurs regardless of the presence or absence of acoustic wave input.
Corresponds to the spontaneous neural firing rate, and DLn corresponds to the acoustic wave input.
This corresponds to the firing rate of An important point is that in the present invention, the value of n is determined by the following formula:
By having the characteristic of changing over time,
be. dn/dt=Ao−(So+Sh+DL)n (2) where Ao is the recruitment constant; Sh is the spontaneous neurotransmission
is the material attenuation constant. The new relationship shown in equation (2)
is that neurotransmitters are not produced at a certain rate at Ao.
(a) Decay (Sh・n), (b) Spontaneous firing (So・
n), and (c) neural firing due to acoustic wave input (DL・
n) is taken into consideration. these
The modeled phenomenon is located at the location shown in Figure 11.
Assume that this occurs. As is clear from equation (2), the following amounts of neurotransmitters and
and the next firing rate is at least as high as the current amount of neurotransmitter.
It is proportional to the square of the sound processor of the present invention.
This shows the fact that it is nonlinear. Sunawa
The amount of neurotransmitter in state (t+Δt) is
Amount of neurotransmitter in state (t+dn/dt・Δt)
be equivalent to. Then, n(t+Δt)=n(t)+(dn/dt)・Δt (3) holds true. Equations (1), (2) and (3) describe the operation of the time-varying signal analyzer.
represent A time-varying signal analyzer shows how the auditory system changes over time.
It is adaptive and the auditory nerve signal is non-directive with the acoustic wave input.
It shows the fact that they are linearly related.
By the way, the acoustic processor of the present invention is based on the nervous system.
to better follow the apparent temporal changes in
The best way to perform nonlinear signal processing in speech recognition systems
This is the first model offered. Reduce the number of unknown terms in equations (1) and (2)
Therefore, in the present invention, the following
Use the formula. So+Sh+DL=1/T (4) However, T is the audio wave input generated
After that, the auditory response drops to 37% of its maximum value.
This is the measured value of the time it takes to complete the process. T is a function of volume
With the sound processor of the present invention, various
Known display of volume level response attenuation
Take it out from the graph. That is, a tone of constant volume
When a response is generated, initially a high level response is generated.
After that, the response is determined by the time constant T.
and decays toward steady-state levels. sound
If there is no acoustic wave input, T=T₀It is. This is 50
It is about milliseconds. Volume is L_naxIf T=
T_naxIt is. This is about 30 milliseconds. Ao=
By setting it to 1, 1/(So+Sh) becomes
If L=0, it is determined to be 5 centiseconds. L is
L_naxSo, L_nax= In the case of 20 zones, the following formula holds true. So+Sh+D(20)=1/30 (5) According to the above data and formula, So and Sh are
It is determined by equations (6) and (7) shown below. So＝DL_nax/ [R+(DL_naxT₀R)-1〕 (6) Sh=1/T₀−So (7) however, R=f steady state-L_nax/f steady state-L=0 (8) f Steady state is when dn/dt is 0, given volume
represents the firing rate at R is the only variable left in the sound processor
It is. Therefore, the performance of this processor is R
Just change it and it will change. That is, R is the performance
One parameter that can be adjusted to change the
meter, typically stable against transient effects
This means minimizing the effects of similar
Output pattern is inconsistent when using audio input
This is generally due to differences in frequency response,
difference, background noise as well as (steady state of the audio signal)
Distortion (affecting parts but not transient parts)
Minimize the steady-state effect because it is caused by
It is desirable to do so. The value of R is perfect speech recognition.
Configure your system to optimize error rates
This is desirable. The optimum found in this way
The value is R=1.5. In that case, the values of So and Sh
are 0.0888 and 0.11111 respectively, and the value of D
gives 0.00666. Figure 15 shows the operation of the audio processor according to the present invention.
This is a flowchart. Preferably sample at 20KHz
daisi during a 25.6 ms time frame.
The talized audio passes through a Hanning window 1320 and its
The output of the DFT1322 is at 10 ms intervals.
It is desirable to perform a double Fourier transform. conversion
The output is filtered in block 1324 and includes at least
One frequency band (preferably all critical frequencies)
several bands or at least 20 bands)
Each provides a power density output. Next, power
- Density is block 1326, recorded size
is converted to volume level. This operation is the first
This can be easily carried out by modifying the graph in Figure 3. So
Summary of the process after (limitations of block 1330)
(including value updates) is shown in FIG. In FIG. 16, first, the filtered frequency
Each sensory limit T of band m_fand audible limit T_o
are set to be 120dB and 0dB, respectively.
(block 1340). Then the voice cowl
total frame register and histogram
Reset system registers (block 134)
2). Each histogram contains bins,
(in a given frequency band)
or similar measurements within their respective ranges.
represents the number of samples, or count. present invention
Then, the histogram (for each given frequency band)
) the volume is within each of multiple volume ranges
Preferably, it represents the number of centiseconds of the period. example
For example, in the third frequency band, the 10dB and 20dB
The time between waves may be 20 centiseconds. Similarly, the
20 frequency bands, between 50dB and 60dB, total
If there are 150 centiseconds out of a total of 1000 centiseconds
There is. Total number of samples (i.e. centiseconds) and
and the percentile is taken from the count contained in the bin.
be done. Block 1344 identifies each frequency band.
The frames at the output of the filter are examined and blocked.
1346, the appropriate histogram (per filter)
1) The bins inside are incremented. block 1348
, the total number of bins whose amplitude exceeds 55 dB is the filter
data (i.e., frequency band), and
Determine the number of filters that indicate the presence of voices. block
1350, with a minimum number (e.g. 20
If there is no filter in 6), block 1
At 344, the next frame is inspected. existence of voice
If there are enough filters to show, block 135
2, increment the voice counter. voice counter
The audio appears in block 1354 for 10 seconds and the block
New T with Lock 1356_fand T_hThe value of
is incremented until determined for each router. new T for given filter_fand T_hThe value of is
It is determined as follows. T_f, the maximum of 1000 bins
dB of the bin holding the 35th sample from the top
The value (i.e. the 96.5th percentile of volume) is
BIN_His defined as T_fis T_f＝BIN_HSet to +40dB
be done. T_hIf , from the lowest bin (0.01)
Hold the (total number of bins - audio count)th value
The dB value of the bin is BIN_Lis defined as That is,
BIN_Lis classified as audio in the histogram.
bins of 1% of the number of samples excluding those
Ru. T_his T_h＝BIN_LDefined as −30dB. Blocks 1330 and 1332 of FIG.
, the amplitude of the sound updates the limit value as mentioned above
and changes per son based on the updated limit value.
converted and compressed. Introducing the son unit and compressing
An alternative method is to use the filter (after the bin has been incremented).
Take the amplitude "a" and convert it to dB using the following formula. a^dB=20log_Ten(a)−10 (9) Next, each of the filter amplitudes is equivalent to
Range between 0dB and 120dB to give a volume of
compressed into a^eql=120(a^dB−T_h)/(T_f−T_h) (Ten) Then a^eqlis the volume level (phone unit) using the following formula.
) to an approximate value of the volume in son units.
It is desirable to map a 1KHz signal to 1 at 40dB.
Delicious. L^dB=(a^eql-30)/4 (11) Next, the approximate amount L of the volume in units of son_sis given by the following equation
available. L_s=10(L^dB)/20 (12) At step 1334, L_sis the input of equations (1) and (2)
block 1355.
Determine the output firing rate f for each command. 22 frequency bar
In the case of a 22-dimensional vector
Characterize the acoustic wave input over a frame. deer
Generally, 20 frequency bands are available by email while
Uses a normal scaled filter bank.
and inspect it. Block 1336 processes the next time frame.
In block 1337, the “next state” of n is
Determined according to equation (3). The acoustic processor described above has a firing rate f and a neural
If the DC pedestal has a large amount of transmitter n,
Improvements are needed regarding the use of Sunawa
Therefore, the dynamic range of the terms in the equation of f and n is
If important, derive the formula below to determine the pedestal height
lower. Steady state and no acoustic wave input signal present
(L=0), Equation (2) is expressed as follows in the stable state.
can be solved for the partial state n′. n′=A/(So+Sh) (13) The internal state of the amount of neurotransmitter n(t) is as follows
are shown as steady state part and fluctuating part as
Ru. n(t)=n′+n″(t) (14) Combining equations (1) and (14) yields
Fire rate is obtained. f(t)=(So+D・L)(n′+n″(t))(15
) The term So・n′ is a constant, but all other terms
is the variable part of n or expressed by (D・L)
contains the passed input signal. Subsequent processing is done using the output vector.
Since it is only related to the square of the difference between tors, the constant term is
It will be ignored. From equations (15) and (13), we get the following equation:
It will be done. f″(t)=(So+D・L)・[{n″(t)D・L・A} /(So+Sh) (16) Considering equation (3), the “next state” is as follows.
Ru. n(t+Δt)=n′(t+Δt) +n″(t+Δt) (17) n(t+Δt)=n″(t)+A−(So+ Sh+D・L)・(n′+n″(t))(18) n(t+Δt)=n″(t)−(Sh・n″(t) −(So＋Ao・L^A)・n″(t) −(Ao・L^A・D)/(So+Sh) +Ao−(So・Ao)+(Sh・Ao)) /(So+Sh) (19) Equation (19) becomes as follows if all constant terms are ignored.
I'm going to growl. n″(t+Δt)=n″(t)(1−So・Δt) −f″(t) (20) Equations (15) and (20) are calculated for each 10 ms
Output expression applied to each filter during the time frame
and configure the state update expression. Using these expressions
The result is a vector of 20 elements every 10 ms,
Each element of this vector is scaled by mel
each frequency in the filter bank
Corresponds to the firing rate of the band. Regarding the embodiment described above, the flowchart in FIG.
Special case of fire rate f and “next state” n(t+Δt)
Equations (11) and (16) that define the equations of
Therefore, we can put the expressions of f, dn/dt and n(t+Δt).
This applies except for changing it. A unique value for each equation term (i.e., t₀=
5csec, t_L=3csec, Ao=1, R=1.5 and L_nax
= 20) can be set to other values, So, Sh
and D terms are set to different values when other terms are set to different values.
Then, the respective desired values are 0.0888, 0.11111,
and will be a different value from 0.00666. The present invention can be applied to various software or hardware.
It can be implemented by a. F1c precision matching (Fig. 6, Fig. 17) Figure 6 shows a precision matching phoneme machine as an example.
2000 is shown. Phonetic type matching machines
is a stochastic finite state machine, (a) Multiple states S_i; (b) Multiple transitions tr(Sj−Si): A certain transition has different
Between states, some transitions transition between the same states, and each
Transitions have corresponding probabilities; (c) Corresponding labels for each label that can be generated by a specific transition.
actual label probability It is characterized by having the following. In Figure 6, seven states S₁~S₇and 13 history
Transfer tr1 to tr13 are precision matching phoneme machine 20
00, and the three transitions tr11, tr
The paths of tr12 and tr13 are shown with dashed lines.
At each of these three transitions, the phoneme produces a label.
can change from one state to another without achieving
Ru. Therefore, such a transition is called a null transition.
Ru. Generate labels along transitions tr1 to tr10.
can be done. In detail, transition tr1 to tr
At least one label along each of the 10
There may be unique probabilities that are generated.
Ru. For each transition, the system can generate
There is a probability associated with each label. In other words, too
can be selectively generated by an acoustic channel.
If there are 200 possible labels, each transition (not null)
The associated “actual label probability” is 200
each of which has a corresponding label for a particular transition
corresponds to the probability generated by a phoneme. transition tr
The actual label probability of 1 is given by the symbol P, as shown.
and 1 to 200 surrounded by the following brackets.
Represented in columns. Each of these numbers is
Represents a bell. For label 1, precision butting sound
The elementary machine 2000 generates label 1 at transition tr1.
There is a probability P[1]. Various actual label probabilities
is stored in association with the label and the corresponding transition.
It is. label y₁y₂y₃The string of ... is the given phoneme
Presented to the precision matching phoneme machine 2000 corresponding to
Then, a matching procedure is performed. precision butt
The steps related to the phoneme machine are shown in Figure 17.
I will explain. Figure 17 is a trellis diagram of the phoneme machine in Figure 6.
be. As in the case of the phoneme machine, this training
Squirrel diagram is also in state S₁from state S₇null transition to state S₁
from state S₂transition to, and state S₁from state S_Fourfart
shows the transition of Transitions between other states are also shown.
Ru. In addition, the trellis diagram is horizontally
Indicates the time. Starting probability q₀, and q₁is the phoneme
Time t=t of that phoneme₀or t=t₁to each of
represents the probability of having a start time at . At the start of each
The respective transitions in time are also shown. China
For example, the interval between consecutive start (and end) times
must have a length equal to the label time interval.
desirable. Using precision matching phoneme machine 2000
how much a given phoneme is in the label of the incoming string
When determining whether to match closely,
Search the end time distribution of a phoneme and find the end time distribution of that phoneme.
Used to determine the timing value. End time distribution
The method to perform precision matching depends on
All sounds described in this invention with respect to the singing procedure
This is common to the bare machine embodiments. Precise pine chin
When generating the end time distribution for performing the
Dense matching phoneme machine 2000 is accurate and complex.
requires a lot of calculation. First, according to the trellis diagram of FIG. 17, time t
=t₀required to get the start and end times in
Find out about calculations. The phoneme map shown in Figure 6
In the case of the example of the
Ru. Pr(S₇,t=t₀)=q₀・T (1 → 7) +Pr(S₂,t=t₀)・T(2→7) +Pr(S₃,t=t₀)・T(3→7) (21) However, Pr represents the probability, and T is the 2 in parentheses.
represents the transition probability between two states. This formula is t
=t₀There are three states in which the end time can be
The probability of each is shown. Furthermore, t=t₀end time of
is state S₇limited to current occurrences. Next, end time t=t₁When we examine the state S₁other than
calculations for all states of
No. state S₁starts at the end time of the previous phoneme.
Ru. For convenience of explanation, state S_FourShow only calculations for
vinegar. S_FourIn this case, the calculation becomes: Pr(S_Four,t=t₁)=Pr(S₁,t=t₀)・ T(1→4)・Pr(y₁,1→4) +Pr(S_Four,t=t₀)・T(4→4)・ Pr(y₁,4→4) (22) Equation (22) is t=t₁The phoneme machine is in state S_FourAnd
The probability that (a) Time t=t₀In state S₁With probability that state S₁mosquito
state S_FourMultiplying the transition probability to
A given label −y− in the string is in state S₁mosquito
state S_FourThe value obtained by multiplying the probability of transition to (b) Time t=t₀In state S_FourWith probability that state S_Fourmosquito
is multiplied by the transition probability to itself, and further, the state S_Four
as something that transitions from to itself.
obtained by multiplying the probability of generating the given label −y−
value and Show that it is determined by the sum of Similarly, (state S₁(excluding) other conditions
calculation is also performed, and the phoneme is determined at time t=t₁in certain conditions
generate the corresponding probability that is the state. In general, given
When determining the probability of being in the target state at the time of
The matching is (a) Each state and state before the transition leading to the target state
and the respective probabilities of each of the previous states; (b) for each of said previous states, its label string;
Each of the previous states and the current
Labels that must be generated on transitions between states
Recognize the value representing the probability of (c) that representing the probability and label probability of each previous state;
Combine each value and target by corresponding transition
gives the probability of the state. The overall probability of being in the target state is
is determined from the target state probabilities for all transitions.
state S₇The calculations for the three null transitions
term, whose phoneme is in state S₇For phonemes ending in
time t=t₁Allows you to start and end with
Ru. Time t=t₀and t=t₁determine the probability for
Determination of probabilities for other pairs of end times, such as when
The determination shall be made in such a way as to form an end time distribution.
is desirable. The value of the end time distribution of a given phoneme is
How well a given phoneme matches the arriving label
Displays whether the How well words match incoming labels
the phoneme that represents the word.
are processed sequentially. Each phoneme is the end time of the probability value
Generate a distribution. The phoneme matching value is
By summing the time probabilities and taking the logarithm of that sum,
can be obtained. The start time distribution of the next phoneme is the end time
It is derived by normalizing the distribution. This positive
In normalization, for example, we define each of those values as
scale by dividing by the sum of
Make sure that the sum of the ringed values is 1. Examining a given word or word string
There are at least two ways to determine the number of phonemes h that should be
be. In the depth-first method, the computation follows the basic form
Let's do it. (Successively by each successive phoneme)
calculate subtotals). Given that this subtotal follows
is found to be below a predetermined limit value for the phoneme position of
If so, the calculation ends. Another method, breadth-first method
The method calculates similar phoneme positions in each word.
Do this. The calculation is the count of the first phoneme of each word.
calculation, followed by calculation of the second phoneme of each word.
Do it one after the other. In the breadth-first method, each
The calculated values along the same number of phonemes in words are relative
Compare at the same phoneme position. Either way,
The word with the largest sum of matching values is
That's the target word. Precise matching is done using APAL (Array Processor).
It is realized in the software (assembly language). this
is Floating Point Systems, Inc.
(Floating Point Systems, Inc.) assembly
It is 190L. By the way, precision matching is
The actual label probability (i.e., given the phoneme
probability of producing a given label y at the transition of ), the phoneme
The transition probabilities for each machine and for a given phoneme are
The certainty of being in a given state at a given time after the start time of
Requires considerable memory to remember each of the rates
shall be. The above 190L is the end time, preferably the end time.
Matching value based on the logarithmic sum of completion time probabilities,
start time based on the end time probability generated in
and the matching values of sequential phonemes in the word.
Calculate each word matching score
It will be set up as shown below. In addition, precision pine
Ching calculates the tail probability of the matching procedure
This is desirable. The tail probability is independent of the word.
Measure the likelihood of consecutive labels. Simple example
Then the given tail probability follows one more label
Corresponds to the likelihood of the label. This likelihood is, for example,
A string of labels generated by a sample audio
Easily determined from the ring. Therefore, in precise matching, the basic form,
Coff model statistics, including tail probabilities.
Equipped with sufficient storage devices. Each word has about 10 sounds
In the case of a vocabulary of 5000 words including elements, the basic form is
Requires 5000x10 memory. (Makes a mark for each phoneme.)
70 distinct phonemes (with Lukov model), 200
separate labels, and any labels that generate
If there are 10 transitions with probability, the statistic is 70×
Requires 10 x 200 storage locations
Become. However, the phoneme machine has three parts.
(start part, middle part and end part)
It is desirable that the statistical tables correspond to this.
(one of the three self-loops is included in the continuous part)
It is desirable that ) Therefore, the memory requirement is 70×
Reduced to 3 x 200. For the tail probability, 200×
200 storage locations are required. this array
So 50K integer and 82K floating point storage
It works fine if you have the space. Furthermore, traditional systems contain 70 different phonemes.
However, the present invention uses each phoneme machine.
provides approximately 96 phonemes. F1d Basic high-speed matching (Figures 18 to 20)
figure) Is precision matching calculations expensive?
reduces the required computations without sacrificing too much accuracy.
Basic high-speed matching and alternative high-speed matching
Do ching. High-speed matching is precision matching
It is recommended to use it in conjunction with fast pine
Ching selects promising candidate words from the vocabulary.
and put it on the list, precision matching is often
If so, it is executed on the candidate word for this lift. The fast approximate acoustic matching method is based on the above-mentioned U.S. patent.
Described in Application No. 06/672974 (filed on November 19, 1984)
has been done. In fast approximate acoustic matching, each sound
A phoneme machine is a phoneme machine with all phoneme machines in a given phoneme machine.
Transition sets the actual label probability for each label at a particular position.
It is desirable to simplify it by replacing it with a replacement value.
Delicious. A specific replacement value uses that replacement value.
If the matching value of a given phoneme is
Precision when the replacement value does not replace the actual label probability
Excessive matching value obtained by dense matching
It is desirable to choose to evaluate. This article
One way to guarantee that in a given phoneme machine
If the probability corresponding to a given label of is any of its permutations
Choose each replacement value so that it is no larger than the set value.
This is a method of choosing. Actual labels in the phoneme machine
By replacing the probabilities with the corresponding replacement values,
requirements for determining word matching scores.
The amount of calculation can be significantly reduced. More replacement
Since it is desirable to overestimate the value of
The matching scores obtained are determined without previous substitution.
less than the specified case. Acoustic with language decoder with Markov model
In a particular embodiment of performing matching, each
Through shaping, phonemes are (a) multiple states and transition paths between states; (b) probabilities T(i→j) - each of which is the current state S_i
is given, the state S_jrepresents the probability of transition to
(However, S_iand S_jmay be in the same state or different.
The transition tr(S_j
-S_i), (c) Actual label probabilities (each actual label probability
p(y_k-i→j) by the given phoneme machine,
In a given transition from one state to the next
label y_k(k is a symbol that identifies the label)
(representing the probability that
Ru. Each phoneme machine is (a) Each y in each of the above phoneme machines_kone specific value for
p′(y_k), (b) At each transition in a given phoneme machine, each actual
Output probability p(y_k-i→j), the corresponding y_kdivided into
One particular value p′(y_k) replace it with
means including. The replacement value must be at least a specific phoneme map.
y corresponding to any transition in thin_klabel actual
It is desirable that the magnitude is the maximum label probability.
Fast matching implementation corresponds to incoming labels
10 words chosen as most likely to occur in the vocabulary
We will form a lift of about 100 candidate words.
used in sea urchins. Candidate words are determined by language model and
It is desirable to submit to precise matching. spirit
The number of words considered in tight matching is determined by
The calculation cost is reduced by cutting down to about 1% of the word.
is significantly reduced while maintaining accuracy. Basic high-speed matching is used for all transitions.
Set the actual label probability for a given label to one value.
Simplify by replacing the given label with the given
It can be generated by the phoneme machine. Sunawa
That is, for a given phoneme machine with a probability that the label occurs
Regardless of the transition in the
Replace with a specific value. This value is at least
of the label that occurs at any transition in a given phoneme machine.
It is desired to be an overestimation of the magnitude of the maximum probability.
Yes. Let the label probability replacement value be
as the maximum value of the actual label probability for a given label
By setting, basic high-speed matching
The generated matching values are at least accurate
Matching values such as those resulting from the use of matching
guaranteed to be the same size. like this
Basically, high-speed matching generally involves matching each phoneme.
overestimates the value of the
Generally selected as a candidate word. precision ma
The words considered as candidates by tuching are basically
Pass according to fast matching. Figure 18 shows the basic high-speed matching phoneme machine 30.
Indicates 00. Labels (symbols and
(also called
Enter the matching phoneme machine 3000. Start time distribution
Enter the name and label string using the precision map described above.
Similar to the input of a tuching phoneme machine. Start time
is sometimes not distributed over multiple times.
but instead, e.g. following a silent interval
It may also represent the exact (phoneme start) time. death
However, if the audio is continuous, the end time
The distribution is based on the start time (as explained in more detail later)
Used to form a distribution. Basic high speed matsutchi
The ng phoneme machine 3000 generates an end time distribution.
At the same time, the characteristics from the generated end time distribution are
Generates a matching value for a given phoneme. Word Ma
The tuching score is based on the constituent phonemes (at least
The sum of the matching values of the first h phoneme of the word)
Defined as FIG. 19 shows the basic high-speed matching calculation. base
This high-speed matching calculation uses start time distribution, phoneme
the number or length of labels generated by
each label y_kThe replacement value p′(y_k)only
is connected with. a given label in a given phoneme machine
all the actual label probabilities of and the corresponding replacement values
By replacing , the basic high-speed matching is
Since we replace the transition probability with the length distribution probability, (given
(can be different for each transition in the phoneme machine)
label probability at a given time and a given state at a given time.
It is no longer necessary to include probabilities in the state. By the way, the length distribution is a precise matching model.
Determined from To explain in detail, the length distribution
For each length, this procedure examines each condition individually.
and determine each transition path for each state.
This is desirable. As a result, the actual state of the inspection
The situation is (a) Given a particular label length, (b) May occur independently of the output along the transition.
Ru. all of a specific length to each destination state
The transition path probabilities are summed and then all
The sums of the objective states are summed and the given length in the distribution
represents the probability of The above steps are for each length
is executed repeatedly. Good matching procedure
Following the form, these calculations are Markov-like
Trelli as the modeling technique is known
This is done with respect to the diagram. along the trellis structure
For transition paths that share a branch, each common branch
You only need to calculate it once, and the result is
Added to each path that contains a common branch. In Figure 19, two restrictions are included as an example.
It is. First, the label generated by the phoneme
The length of each has a probability of 1₀,1₁,1₂and 1₃have
may be 0, 1, 2 or 3. start
Time is also limited, each with probability q₀,q₁,q₂Oyo
biq₃Only four start times with . vinegar
That is, L(1₀,1₁,1₂,1₃) and Q(q₀,q₁，
q₂,q₃) is assumed. These limitations make it difficult for the eye to
The ending distribution of target phonemes is defined as the following formula:
Ru. Φ₀=q₀1₀ Φ₁=q₁1₀+q₀1₁p₁ Φ₂=q₂1₀+q₁1₁p₂+q₀1₂p₁p₂ Φ₃=q₃1₀+q₂1₁p₃+q₁1₂p₂p₃+q₀1₃p₁p₂p₃ Φ_Four=q₃1₁p_Four+q₂1₂p₃p_Four+q₁1₃p₂p₃p_Four Φ_Five=q₃1₂p_Fourp_Five+q₂1₃p₃p_Fourp_Five Φ₆=q₃1₃p_Fourp_Fivep₆ Examining these equations, we find that Φ₃are four start times
It can be seen that it contains terms corresponding to each of the following. So
The first term of is the phoneme at time t=t₃start with and length
0 label (phoneme ends at the same time as it begins)
represents the probability of generating . The second term is that the phoneme is at time t
=t₂, and the label length is 1, and
represents the probability that label 3 is generated by that phoneme.
Was. The third term is that the phoneme is at time t=t₁Start with and
With label length 2 (i.e. labels 2 and 3)
Yes, and labels 2 and 3 are produced by that phoneme.
represents the probability that the result will be achieved. Similarly, the fourth term is the phoneme
Time t=t₀and has a label length of 3.
and three labels 1, 2 and 3 correspond to that phoneme.
represents the probability generated by . Calculations and precision matching required for basic high-speed matching
Comparing the calculations required for
It turns out that it is also relatively easy. By the way,
The value of p′(y) is
remain the same value as in the case of label length probability
It is. Additionally, length and start time limitations
This makes later end time calculations easier. example
If, Φ₆So, the phoneme is at time t=t₃Start with 3 levels.
Bells 4, 5 and 6 all sound at their end time.
It must be generated and used by the elements. When generating matching values for target phonemes,
The end time probabilities along the given end time distribution are summed.
It will be done. If necessary, take its logarithm as shown in the following equation. Matching value = log_Ten(Φ₀＋……＋Φ₆) As mentioned above, word matching scores are
The matching values of consecutive phonemes in a given word are combined.
easily determined by measuring Next, we will explain the generation of start time distribution using Figure 20.
I will explain. In Figure 20a, the word
THE₁is broken down into its constituent phonemes and repeated.
In Figure 20b, the string of labels is on the time axis.
shown along. Figure 20c is the first start
Shows time distribution. The initial start time distribution is (silence
(in preceding words, which may contain words)
drawn from the end time distribution of the new preceding phoneme.
Ru. Label input and start time distribution in Figure 20c
Based on the end time distribution of phoneme DH Φ_DHis generated
(Figure 20d). Start time of next phoneme UH1
The distribution is such that the previous phoneme ending distribution is the limit value in Figure 20d.
A is determined by recognizing the time when it happened.
Ru. A is determined individually for each end time distribution.
A is a function of the sum of the end time distribution values of the target phoneme.
be. Therefore, the interval between time a and time b is the phoneme
Represents the time at which the UH1 start time distribution is set.
In Figure 20e, the interval between time c and time d is
The end time distribution of the phoneme DH exceeds the limit value A, and
Corresponds to the time when the start time distribution of the next phoneme is set.
do. The value of the start time distribution is, for example, the limit value A.
Divide each end time value by the sum of end times that exceed
Obtained by normalizing the time distribution. The basic high-speed matching phoneme machine 3000 is
Written by Floating Point Systems, Inc.
Executed using assembler 190L using APAL program.
It is being expressed. Also, according to the description herein,
books using other hardware and software.
It is also possible to develop specific forms of the invention. F1e Alternative high-speed matching (Fig. 21, 22
figure) Alone or preferably with precision matching and
Basic high-speed Matsuchin used with language models
This greatly reduces computational requirements. Calculation required
In order to further reduce the amount, the present invention further includes two
Length (minimum length L_nioand maximum length L_nax) evenly between
Precise pine printing by forming a label length distribution
Simplify Ching. In basic high-speed matching,
Labels of given length (i.e. 1₀,1₁,1₂etc)
The probabilities of producing generally obtain different values. alternative high
By fast matching, the probability of each length of the label
Replace with one uniform value. The minimum value has a non-zero probability in the initial length distribution.
is preferably equal to the minimum length to
Other lengths can also be selected. maximum
The choice of length is more arbitrary than the choice of minimum length, but
The probability of the length being less than the smallest and greater than the maximum is 0
is set to The probability of the length is between the minimum and maximum length
uniformly by setting it to exist only in
The pseudo-distribution of can be shown. one way
Therefore, the uniform probability is expressed as the average probability due to the pseudo distribution.
Can be set. As an alternative, uniform accuracy
The rate is set as the maximum value of the length probability, and the uniform value and
Can be replaced. By making the probabilities of all label lengths equal
The effect of
This can be easily recognized from the equation for the end time distribution. In detail
In other words, the probability of length can be taken as a constant.
can. L_nioset to 0 and the probabilities of all lengths
at the end by replacing the value of one constant with the value of one constant
The time distribution is displayed as follows. Θ_n=Φ_n/1=q_n+Φ_n+p_n (twenty three) However, "1" is one uniform replacement value,
p_nThe value of is the location generated at time m for a given phoneme.
It is desirable to correspond to a replacement value for a given label.
stomach. The aforementioned Θ_nFor the expression, the matching value is
defined as Matching value = log_Ten(Θ₀+Θ₁+…+Θ_n) +log_Ten(1) (24) Comparison of basic high-speed matching and alternative high-speed matching
Compared to the alternative, the number of additions and multiplications required is
By using the matching phoneme machine,
Width decreases. L_nioIf = 0, basic high speed pine
Ching must consider the probability of length
So, it required 40 multiplications and 20 additions, but
For alternative fast matching, Θ_nis determined repeatedly
Therefore, the continuous Θ_nonce for each of
It turns out that we only need one multiplication and one addition. Figures 21 and 22 show the second high-speed Matsuchin.
The simplification of the calculation by the algorithm is shown in detail. Figure 21a
is the minimum length L_nioPhoneme machine 310 corresponding to =0
An example of 0 is shown below. The maximum length is equal to the length distribution.
Assume that it is infinite so that Figure 21b shows the sound
A trellis diagram resulting from an elementary machine 3100 is shown.
q_oAssuming subsequent start times are outside the start time distribution
Then, if m<n, continuous Θ_neach decision of
All require one addition and one multiplication. So
If you want to determine the end time after
is sufficient, and no addition is necessary. Figure 22a shows the minimum length L_nioThe specific case when = 4
An embodiment of the phoneme machine 3200 is shown in FIG. 22b.
shows the corresponding trellis diagram. L_nio=4
Therefore, the trellis diagram in FIG. 22 has symbols U,
It produces a zero probability along the V, W and Z paths.
Θ_Fourand Θ_oFor an end time between 4 times and 1
It is necessary to add times. End greater than n+4
For time, only one multiplication is required, no additions.
It is essential. This example is on the FPS 190L.
This is realized using APAL code. The desired additional state is shown in Figure 21 or Figure 22.
It can be added to the example. For example, L_nioThe value of the
Any number of states with null transitions without changing
can be included. F1f Matching based on first J level (first
Figure 22) Basic high-speed matching and alternative high-speed matching
In order to further improve the
Consider only the matching of the first J label of the
Do it like this. Sound Pro whose label is Sound Channel
Rate of 1 label every centisecond by Setsa
A reasonable value for J is
It is 100. In other words, it corresponds to approximately 1 second of audio.
A label is supplied to the phoneme and a label that enters the phoneme machine.
Confirm matching with Bell. Label to inspect
By limiting the number, two advantages can be obtained.
Ru. The first is a reduction in decoding delay, and the second is a reduction in decoding delay.
Compare long word scores with long word scores
The problem can be avoided to a large extent. of course,
The length of J can be changed as desired. Effects of limiting the number of labels inspected
is observed using the trellis diagram in Figure 22b.
I can do it. Without the improvement according to the present invention,
The fast matching scores are in the bottom row of this drawing.
Along Θ_nis the sum of the probabilities of That is, t=t₀
(L_nio= 0) or t=t_Four(L_nio= 4)
State S at each time starting at_FourThe probability that Θ_nand
and then all Θ_nare summed
Ru. L_nioIf = 4, t_Fourstate at any previous time
S_FourThe probability that it is is 0. With the above improvement, Θ_n
The summing ends at time J. Figure 22
In b, time J is time t_o+2corresponds to The inspection of J labels beyond the interval up to time J is completed.
When determining the matching score by
yields the sum of the following two probabilities. First, before
As mentioned, along the bottom row of this trellis diagram
There is a line calculation. However, this calculation is performed at time J-1
That's it. State S at each time up to time J-1_Fourin
Certain probabilities are summed to obtain a row score. Second, that
The phoneme of S at time J₀~S_Fourare in each state of
There is a column score that corresponds to the sum of probabilities. This column score is
It is calculated as follows. Column score =_Four 〓^f=0 Pr(S_f, J) (25) The phoneme matching score is the sum of the row score and column score.
It is obtained by summing the sum and taking the logarithm of the sum.
Ru. To continue fast matching of the next phoneme,
Along the bottom row (preferably including time J)
Using the values obtained, extract the start time distribution of the next phoneme.
vinegar. Determine the matching score for each of the J consecutive phonemes.
Then, as mentioned above, the sum of all phonemes is that phoneme
is the sum of all matching scores. Basic high-speed matching and alternative high-speed matching described above
How to generate end time probabilities in the Ching example
When we examine
It turns out that it does not easily fit the calculation. Label to inspect
The above-mentioned high-speed matching is an improvement to limit the number of files.
and to better accommodate alternative matching.
Therefore, the present invention allows replacing column scores with additional row scores.
and make it possible. That is, (in Figure 22b) when
State S between time J and J+K_FourAdding a phoneme that is
The line score is determined. However, K is any phoneme
This is the maximum number of states in thin. Therefore, optional
If the phoneme machine 10 has the state of
Along the bottom row of that trellis diagram, the refinement
10 end times are added, each of which is confirmed.
rate is determined. In the lowest row up to time J+K
All probabilities along the line (including the probability at time J+K)
) are added to give the matching score for a given phoneme.
generate. As mentioned above, matching consecutive phonemes
The matching values are summed to obtain the word matching score.
Ru. This example is an APAL on the FPS 190L mentioned above.
This is realized in code, but other
other components on other hardware, such as when
It can also be realized by a code. F1g phoneme tree structure and high-speed matching example
(Figure 23) Basic fast matching or alternative fast matching
(with or without a max label limit
) to determine the phoneme matching value.
The calculation time required to determine the results is significantly reduced.
Furthermore, use the words in the list obtained through high-speed matching to
Even when performing dense matching, the computational complexity is
Significant savings. Once the phoneme matching value is determined,
As shown in FIG. 23, the branches of the tree structure 4100
A comparison is made along the path to determine which path of phonemes occurs most.
Determine if it is possible to break. In Figure 23, (point
4102 to branch 4104)
The phoneme DH and UH1 of the phoneme “the”
The sum of the values for each phoneme branching from the phoneme MX is
must be significantly higher than in the case of the other sequences.
Must be. By the way, the phoneme of the first phoneme MX
The matching value is calculated only once and then spread out.
used for each basic form. (branch 4104 and
and 4106. ) Furthermore, the beginning of the branch
The total score calculated along the sequence of
lower than the limit or other seams in the branch.
It can be seen that the total score is lower than Kensu's total score.
and all base forms extending from the first sequence
may be removed from the candidate words at the same time.
Ru. For example, related to branches 4108-4118
The basic form is that MX is not a promising path.
If it is determined, it will be discarded at the same time. High-speed matching example and tree structure make it possible to
A list of candidate words in fixed order is created and
The computation involved is significantly saved. Regarding memory requirements, phoneme tree structure, phoneme organization,
The measured value and the tail probability are to be memorized.
There is. For tree structure, 25000 arcs and each arc
There are four data words that characterize the first
The data word represents the index of the subsequent arc or phoneme.
Was. The second data word is the subsequent data word along the branch.
Represents the number of phonemes. The third data word is a tree structure
Indicates which node the arc is placed on. Fourth
The data word represents the current phoneme. Therefore,
This tree structure requires 25000 x 4 storage spaces.
It is. 100 different sounds with fast matching
There are over 200 different finems. fee
A neem is a probability of being produced somewhere in a phoneme.
has a storage space of 100×200 statistical probabilities
is necessary. For the trailing structure, 200×200
Requires storage space. Therefore, fast matching
In this case, 100K of space to store integers and 60K of float
It is sufficient to have space to store the floating decimal point. F1h language model (Figure 4, Figure 24) As mentioned above, regarding the word in context (triple
Contains a language model that stores information (such as letters)
This increases the probability of correctly selecting the word.
You can The language model is stored in the paper
has been done. Language model 1010 (Figure 4) is a unique character
It is desirable to have i.e. modified triple
law is used. According to the invention, the sample sample
Examine the text and identify triple words in a fixed order in the vocabulary
and the likelihood of each word pair and single word
Confirm. And the most promising Mie work
A list of codes and word pairs is formed. Furthermore,
Triple words not in the list of triple words, and
It is the likelihood of a word pair that is not in the list of word pairs.
Each will be determined. According to the language model, the target word is reduced to 2 words.
If it continues, this target word and the previous two words are
It is determined whether the word is in the list of triple words.
Ru. If it is in the list of triple words, the triple word
Specifies the remembered probability assigned to the
be done. The target word and the two preceding words are triple words.
If it's not in the list, add the target word and
Are adjacent preceding words in the list of word pairs?
judge whether in the list of word pairs.
, the probability of the word pair and the triple word
Multiply the list of words by the probability that there are no triple words and
Assign the product of to the target word. Contains target word
The triple word and word pair are each triple word and word pair.
not in the list of words and the list of word pairs
In this case, we add the probability of only the target word to the triple word
The probability that a word is not in the list of triple words, as well as the word
Multiply the probability that a word pair is not in the list of word pairs and
Assign the product of to the target word. The flowchart 5000 in Figure 24 is used in acoustic matching.
This shows the formatting of the phoneme machine. block 5002
Word vocabulary (generally on the order of 5000 words)
Define. Each word is then assigned a sequence in the phoneme machine.
Displayed by Kens. For example, phoneme machine
is displayed as a phonetic phoneme machine, but
Alternatively, the
Sometimes it happens. Sequence of phonetic phoneme machine,
Or to the sequence of the Finim-type phoneme machine.
The display of words according to the following will be explained below. War
The phoneme machine sequence of ``do'' is a word basic form.
say. In block 5006, the word basic form is
Arrange in a tree structure. Phoneme machining for each basic form of each word
The statistical values for each section are from IEEE Bulletin Volume 64 (1976).
F. Zielinek's paper “Statistical methods
F. Jelinek, “Continuous Speech Recognition” (F. Jelinek, “Continuous Speech Recognition”
“Speech Recognition by Static Methods”
Proceedings of the IEEE, Vol.64, 1976)
Well-known forward and backward als shown
Determined by algorithmic shaping (block
5008). Block 5009 is used for precision matching.
the actual parameter values, i.e. the statistical values
decide. For example, a value instead of the actual label output probability
Confirm. In block 5010, the determined value
will replace the actual memorized probabilities and
Phonemes in the basic form now contain approximate replacement values.
do. All estimates regarding basic high-speed matching
Execution occurs at block 5010. Next, in block 5011, the acoustic matching is
Decide whether you need more. no improvement required
If the base approximation is determined for matching
set the value for use and another regarding other approximations.
No estimated value is set (block 5012). improvement
If necessary, proceed to block 5018.
nothing. In block 5018, the string lengths are equalized.
In block 5020, further improvement is made.
Decide whether you need a top. Need to further improve
If you do not need the label output probability value and
approximate the matching length probability value and use it in acoustic matching.
Set it so that If further improvement is required
produces acoustic matching in block 5022.
limited to the first J label of the created string.
Ru. Whether to select one of the improved embodiments
Determined parameter values are blocked regardless of
5012, so that each word base form
Each phoneme machine in the state is shaped by the desired approximation.
is used to enable fast approximate matching. F1j stack decoder (Figures 7 to 9,
Figure 25) Next, the main speaker used in the speech recognition system shown in Figure 4.
Describe a clear stack decoder.
Ru. Figure 7 shows successive label intervals, i.e.
multiple consecutive labels y generated at file positions₁y₂…
…It is shown. Also, FIG. 8 shows a plurality of generated word patterns.
path A, path B and path C are shown.
It is. In relation to Figure 7, path A is the entry
“to be or”, path B is in the entry “twob”,
Path C may correspond to entry "too".
For each target word pass, the target word pass
The label that has the highest probability of being terminated (i.e.
In other words, there is equivalent label spacing), such as
The label is called a boundary label. A word pass W representing a sequence of words.
and the most likely end time - label string
The boundary label between two words in the ring and
− is IBM Technology Disclosure Bulletin Volume 23
L.R.B. No. 4 (September 1980 issue)
Paper outside the scope, “Acoustic matching calculation speed-up”
(L.R.Bahl et al, “Faster Acoustic Match
IBM Technical Disclosure
Bulletin, Vol.23, No.4, September 1980)
found by known methods as described
be able to. Briefly, this paper
Two similar concerns, namely: (a) Which word (or word sequence)
result in as many label strings as (b) At which label spacing, the partial sentence-label space
Does the − corresponding to the string part end? This article discusses ways to address this issue. Any given word pass has a label strip.
each from the first label of the string to the border label
The “likelihood” associated with the label, i.e., the label interval
value ”, i.e. the likelihood of a given word path
All values are the “likelihood vector” for a given word path.
”, so for each word pass,
There is a likelihood vector. Likelihood value L_tis shown in Figure 8.
It is. Word Pass W¹,W²...W^sA collection of labels
“Likelihood envelope” Λ with interval t_tis mathematically the following
It is defined as follows. Λ_t=max(L_t(W¹),...,L_t(W^s) That is, for each label interval, the likelihood envelope is
Highest associated with any word pass in the collection
Contains the likelihood value of . The likelihood envelope 1040 is shown in FIG.
It is shown. If the word path corresponds to a complete sentence,
considered “complete”. Please enter the complete path
When a speaker reaches the end of a sentence, e.g.
It is recommended that you press the button to identify it. The input is a sentence
Synchronized with the label interval that marks the end. Perfect
Extend the word password by appending the word
I can't. Partial word passwords are not allowed.
It corresponds to a complete sentence, so it can be extended.
Ru. A partial path can be a “living” path or a “dead” path.
Word Pass is classified as
“dead” when it has already been extended
However, when it has not yet been extended, it is “alive”.
This classification has already been extended by at least one
form a word pass that extends beyond the
The path will not be considered for extension again at
do not have. Each word path is
Characterized as a “good” or “bad” path.
I can do it. The word pass is
The label corresponding to the bell whose word pass is
If you have a likelihood value that is within the maximum likelihood envelope of Δ
is a good word pass. In other cases, the
A bad word pass is a bad word pass. maximum likelihood
It is good (or bad) to reduce each value of the envelope by a certain value Δ.
b) Although the limit level is determined, changing this Δ
Although it is desirable, it is not always necessary.
It's not important. There is a stack element for each label interval.
Ru. Each living word pass looks like this
label corresponding to the boundary label of the living path
Assigned to the stack element corresponding to the interval. vinegar
Tack elements (listed in order of likelihood values)
0, 1 or more word pass entries
It may have. Next, the stack decoder 1002 in FIG.
The steps to be executed will be explained below. As shown in the flowchart in Figure 9, the shape of the likelihood envelope is
The decision on what is a good password and password is mutual.
Involved. In the flowchart of FIG. 9, block 1050
So, first, the null pulse goes to the first stack 0.
enter. Previously confirmed in block 1052
Contains the complete path (if the complete stack element is
If available, it will be provided. in a (complete) stack element
Each complete path of has a likelihood vector associated with it.
It has a tor. has the highest likelihood for that boundary label.
The likelihood vector of the complete path to
Decide on the line. If a (complete) stack element is
If there is no complete path, the maximum likelihood envelope is
is initialized to −∞. Furthermore, the complete path is
Even if the maximum likelihood envelope is not
The period may be set. The initial setting for the envelope is
Locks 1054 and 1056 take place. The maximum likelihood envelope is initialized and then adjusted by a predetermined amount Δ
Δ stipulation above the reduced maximum likelihood
form a good region of and below the reduced likelihood
Forms a region with poor Δ regulation. The value of Δ is the width of the search
control. The larger Δ, the longer the extension.
The number of word passes considered possible increases.
L_tlog to determine_TenWhen using , the value of Δ is
2.0 gives satisfactory results. Δ value
is uniform along the length of the label spacing.
Although desirable, it is not necessary. If the word path is within a good region of delta regulation.
If the field label has a likelihood, its word path
is marked as “good”. In other cases,
Word pass is marked as "bad". As shown in Figure 9, the likelihood envelope is updated and the word
A “good” (extendable) path, or
or as a “bad” path.
Blog about finding the longest unmarked word pass
It starts with 1058. Not marked with 2 or more
short word path corresponds to the longest word path length
If the border label is on the stack
A word path with a likelihood of is selected. War
block 1060 if a path is found.
, the likelihood at that boundary label is a region with good Δ specification
Check whether it is inside. If it's in a good area
If so, in block 1062, the region with poor Δ regulation is detected.
In block 1058, mark the path in
Look for unmarked live paths. too
If it is within the good range, then in block 1064, Δ
Mark the path within the defined good area and block 1
In step 066, update the likelihood envelope and mark it as “good”.
Contains the likelihood value of the tracked path. That is,
For each label interval, the updated likelihood value is (a) the current likelihood value within its likelihood envelope; (b) Associated with word passes marked “good”
likelihood value is determined as the larger likelihood value between. child
The operations are performed in blocks 1064 and 1066.
be exposed. After the envelope has been updated, block 105
Return to 8 and find the longest, best unmarked
Find the password again. This loop uses unmarked word
Iterates until there are no more spaces left. Not marked
If you run out of passwords, block 1070
The shortest word path marked as “good” in
is selected. If 2 or more have the shortest length
If you have a “good” word pass, block 10
72, find the word with the highest likelihood for that boundary label.
The selected shortest path is
It will be extended. That is, at least one prospective
If a trailing word with
matching, language models, precision matching, and
determined by a successful execution of the word model procedure.
It will be done. For each likely subsequent word, the
A word pass is formed. In detail,
The extended word path will be
Append a promising trailing word to the end of the decoded path.
Formed by adding The selected shortest word path is
After forming the code path, the selected word
The path is removed from the stack in which it was an entry.
instead, each extended word
The path is inserted into the appropriate stack. In particular, extension
corresponds to that boundary label.
Entry to stack (block 107)
2). The selected pulse in block 1072
Explain the extending operation in relation to the flowchart in Figure 25.
do. After the path is found in block 1070,
Perform the following steps, which will allow you to
or paths are extended based on appropriate approximate matching.
be done. At block 6000 of FIG. 25, (of FIG. 4)
The sound processor 1004 processes the label as described above.
Generate a string. The label string is
Provided as an input to block 6002,
6002, basic or improved approximate pine
One of the checking steps is performed and the constant
Get a list of candidate words for order. After that, Bro.
6004, convert the language model as described above.
use. After using the language model, block 6
At 006, the remaining target words are generated
Sent along with the label to a precision matching processor.
It will be done. Precise matching with block 6008
yields a list of remaining candidate words and
Well presented to Dell. (approximate matching, precision
determined by dense matching and language model.
) Promising words are in block 1 of Figure 9.
Used to extend the path discovered at 070. Blotsu
The expected
Each word is added to the discovered word path.
Separately appended, multiple extended word paths
(block 6010). In Figure 9, an extension path is formed and the stack is re-established.
Once formed, the process returns to block 1052.
repeat the steps. Therefore, at each iteration, the shortest, best “good”
selected and extended. in some iteration
Word paths marked as “bad” paths are
Repetition can result in a “good” path, so
Is the word/pass you are using “good” or “bad”?
The path characteristic is assigned independently at each iteration.
be done. In fact, the likelihood envelope is one iteration and the next
The word pattern does not change significantly between iterations of
The calculations that determine whether a performance is good or bad should be done effectively.
We also no longer need normalization. If you want to identify a complete sentence, block 1074
It is desirable to include. i.e. alive
What remains unmarked in Word Pass
If there is no “good” word pass that should be extended
If so, the combination ends. each of its boundary labels
The complete word path with the highest likelihood
the most likely word for the power label string.
identified as a code sequence. For continuous speech where sentence ends are not identified, path extension
The length of the system may be continuous or
up to a predetermined number of words desired by the system user.
It will be done. F1k Construction of phonetic basic form Markov models that can be used to form basic forms.
One type of model phoneme machine is one based on phonetic symbols.
It is. That is, each phoneme machine is
Corresponds to single notes. For each given word, its corresponding
A sequence of phonetic phonemes with a phoneme machine of
be. Each phoneme machine has several states and
transitions between them, and some of them include
Those that can produce neem output and those that cannot.
There is also a null transition. As mentioned above, each
Statistics regarding the phoneme machine are: (a) the probability of a given phoneme occurring, and (b) A given transition produces a particular finem.
including the likelihood of For each non-null transition, each
It is desirable to have a probability associated with the neem.
stomach. Finim Alphabets shown in Table 1
There are approximately 200 finems in the set. No.
Figure 6 shows the phonetic alphabet used to form the basic form.
Showing the phoneme machine. The design of such a phoneme machine is
-kens are given for each word. Statistics,
i.e. the probability of uttering a known word is
It can be put into the phoneme machine in Phase. various sounds
Transition probabilities and fees in standard phoneme machines
Neem probability is calculated by using the known phonetic alphabet during shaping.
Fee generated when uttered at least once
Neem string(s),
Well-known forward-backward algorithms
determined by applying the system. Statistics for one phoneme identified as phoneme DH
A sample of is shown in Table 2. As a rough estimate
Therefore, the transitions tr1, tr2 and
Label output probability distribution of tr8, transitions tr3, tr4, tr
5 and tr9 label output probability distribution and transition
Label output probability distribution for moves tr6, tr7 and tr10
are each represented by a single distribution.
this is. In Table 2, each column's arc (i.e. transition
indicated by the assignment of labels 4, 5 or 6 to
has been done. Table 2 shows the beginning, middle and middle of the phoneme DH.
or the probability of each transition generated respectively at the end.
and the probability of the label (i.e., finem).
vinegar. For DH phonemes, for example, the state S₁from state S₂
The probability of transitioning to state S is calculated as 0.07243, and the probability of transitioning to state S₁mosquito
state S_FourThe probability of transitioning to is 0.92757. (what
Then, there are two possible transitions from the initial state.
Therefore, the sum of both probabilities is equal to 1. )
Regarding the label output probability, the DH phoneme is
The last part of the element, that is, the column labeled 6 in Table 2
Confirmation of generating Finim AE13 (see Table 1) with
It has a rate of 0.091. Table 2 also shows each node.
(i.e. state) related counts are shown
There is. The node count is the number of times that phoneme is
Represents the number of times the corresponding state occurred. Table 2
These statistics exist for each phoneme machine.
Ru. Arranging phonetic phoneme machines into word basic form
is generally performed by a phonetician, so
Usually not done automatically. The phonetic type basic form is suitable for precision matching and high-speed outline.
It is successfully used in calculation matching. Phonetic type base
This form depends on the phonetician's judgment and is not automatic.
Therefore, the phonetic basic form is not accurate.
Sometimes. F2 Contains a start phoneme machine and an end phoneme machine.
Formation of a set of phoneme machines (Figures 1A to 3)
Figures 26-32) Phonemes used when constructing basic forms explained in the previous section
The machine is selected from a collection of phoneme machines. Before
As mentioned above, the generation method of conventional speech recognition systems
Now, each sound (or more specifically, each sound
(typical elements) are only relevant to single phoneme machines.
Ta. Each phoneme machine has transitions and their
the probability associated with that transition, as well as the latitude associated with that transition.
Contains the bell output probability. Therefore, the phoneme machine
When the corresponding phonetic sound is uttered,
generate a given label at a given transition of the phoneme machine of
Contains a statistic that indicates the likelihood that This statistic is
The sound processor 1004 (No.
Figure 4) and enter the known forward
Extracted during the shaping period to apply the de-algorithm
It will be done. Most of the statistics retrieved during shaping are known
When the sound is uttered, the sound processor 1004
determined by the label generated by the sound
Label generated by the sound processor 1004
is the energy-related property corresponding to the spoken input.
Determined by The word WILL shown in Figure 26
Spectral photograph and words shown in Figure 27
With the WILL waveform, the “w” sound is stored from a silent state.
The energy characteristics during the product follow the energy storage.
It was found that the energy characteristics are significantly different from the “w” sound.
Ru. Prior to this invention, sounds or phonetic elements were
Occurs at the beginning of or within a word following a period
whether it occurs in the middle or at the end of the word
There was no distinction made between squids. According to the invention, this
A distinction has come to be made between them. Words shown in Figures 26 and 27
The first 0.1 seconds of “WILL” is an accumulation of “w” sounds.
The waveform part immediately after is the shadow caused by silence.
It corresponds to a “w” sound with less resonance. The energy accumulation and subsequent part of the “w” sound is −
Like the generation method of the conventional system - single phoneme and
Treating them as one group may cause errors in the system.
This will result in a loss of performance. In other words, the “w” sound
A one-phoneme machine uses the “w” sound at the beginning of a word,
at the end of a word, and in all cases occurring within a word.
were mixed into the statistics. Therefore, a single note
Elementary machines can store energy as well as attenuate energy.
It contained mixed statistical values. In accordance with the present invention, a given single note - like the "w" sound
has multiple phoneme machines associated with it.
Sometimes. For example, the “w” sound is uttered
Statistics of “w” sound, unaffected by silence
It has a common phoneme machine that contains values. common phoneme
The machine will emit a “w” sound that is not adjacent to a period of silence.
Contains statistics generated by voice. Therefore, common
Phonemes contain effects related to energy storage or decay.
energy characteristics are not mixed in. Furthermore, the “w” sound is
The regulation regarding the production of the “w” sound during the transition from the silent period
Starting phoneme machine that reflects readings, as well as silence
Reflects statistical values regarding the pronunciation of the “w” sound just before the period
It also includes an ending phoneme machine. The “w” sound starting phoneme machine is also ONSETLX
is displayed as ONLX, and the final phoneme machine for the “w” sound is
Display as TRAILLX or TRLX. common phoneme
The machine displays WX. Each phoneme machine is separate
formed into individuals, each with its own probability and
has a label probability. Three things related to the “w” sound
The different statistical values of the phoneme machine are shown in Tables 3 and 4.
and shown in Table 5. In Table 3, the phoneme machine ONLX is
It has statistics configured similarly to the statistics shown in the table. sound
In the first, middle and last sections of the bare machine
The probabilities of generating various labels are shown in three columns.
Ru. The transition probabilities from one state to another are also shown.
ing. Figure 28 shows (like the phoneme machine in Figure 6)
Una) How to classify the transitions of the phoneme machine 3
Indicates whether there will be two sections. The statistics in Table 3 were taken during the plastic surgery period and
This applies to speakers of During formatting, sample known text is used for this story.
uttered by the person. From known text, that
The sequence of phonemes corresponding to the text is determined.
Ru. When a known word is uttered, a label (e.g.
A string of ``Fineem'' is generated.
Labels can be added in the usual way, such as Viterbi alignment.
aligned against the phoneme machine in a sequence according to
It will be done. Generated labels and known text phonemes
The correspondence between is that found in each phoneme machine.
This becomes the basis for determining each probability. example
For example, the “w” sound preceded by silence has already been produced during the shaping period.
It may occur multiple times at certain intervals. “w” sound
a specific label when preceded by silence - e.g.
Processed as many times as WX7− is generated, Table 3
The probability shown in is given. In detail,
The starting phoneme machine for the “w” sound is
The probability of generating label WX7 at the center is 0.036, then
generate a label WX7 at the end of that phoneme machine
has a probability of 0.197. Also, in Table 3,
Transition between state 1 and state 4 of the start phoneme ONLX
The probability is 0.67274, but between state 1 and state 2
The transition probability is 0.32370. The importance of the invention is shown in Tables 3, 4 and 5.
This is obvious if you compare the tables. Tables 4 and 5
is the label output WX7 - shown in Table 3 -
is not included as the primary label output. Change
In Table 5, the transition from state 1 to state 4 is
state 1 to state 2 with rate 1.0 and parallel to it
The probability is 0. These points are the third point mentioned above.
The statistics are significantly different from those in the table. Statistical values shown in Tables 3, 4 and 5
The significant difference between all occurrences of the “w” sound
regardless of its position in the code – all together as a single
By using the phoneme machine's statistical values, errors can be avoided.
Indicates that there is a possibility that the Word bases each containing a sequence of phonemes
When forming a form, these phonemes are divided into predetermined phoneme segments.
selected from. Using the single phoneme machine method
Traditional generation methods (as mentioned above) generate approximately 70
The phoneme was hot. According to the invention, the phoneme set is
26 sounds consisting of 14 starting phonemes and 12 ending phonemes
It is desirable to add elements. Table 6 shows these
Indicates additional phonemes. In Table 6, each sound (i.e. phonetic type element)
has its own start and end phoneme machines.
It has no syn. Such an arrangement is within the scope of the present invention.
210 phoneme machine – 3 phonemes per sound
− The catalog is used when a large amount of formatted data cannot be obtained.
considered too large. Therefore, the sink
Statistics, whether or not adjacent to a quiet period
A constant sound that does not show a significant change in corresponds to
It has only a common phoneme machine. A simple like this
Sound classes include PX, TX and KX. Eliminate these
It's called audio closure. Silent closures are
unaffected, so it is displayed by a single phoneme.
Ru. Furthermore, groups of fixed note magnitude store energy.
Since they have very similar statistics regarding the product, this
One starting phoneme machine for each group such as
I can give it to you. One of them is Table 6, which has eight sounds.
(i.e., the phonetic type element) is associated with the starting phoneme
The machine is ONSETAA, or ONAA. similar
In other words, a certain group of sounds has extreme energy attenuation.
Groups like this have similar statistics.
One ending phoneme machine is given for each group.
Ru. For example, in Table 6, seven sounds are ending phonemes.
TRAILAA, or related to TRAA.
With this classification, the acoustic statistics for that purpose can be determined.
Phoneme machine and shaping data needed to generate
less. From this classification, there are 210 phonemes.
Significant performance compromises for systems using
has not occurred. Table 6 shows the standards corresponding to the identifiers used in the present invention.
The subphonetic symbols are also shown. Special points to note here
, the present invention (identified by the symbols shown)
It is desirable to include some of the phonetic elements of
Other types of sounds other than the International Phonetic Alphabet
is also taken into consideration. In Table 6, phonemes with the suffix “0” are
A sound with the suffix “1” that refers to a vowel without a point.
Plain refers to accented vowels. Next, Table 7 shows that the phoneme machine is a good implementation of the present invention.
Identify all of the phonemes formed according to the examples.
Ru. From the set of phonemes shown in Table 7, the basics of words are
A format is constructed. About the word “WILL” again
Considering this, the basic form is the sound shown in Figure 29.
a sequence of elements (i.e., equivalently a phoneme machine)
formed as a Phonetic symbol of the word “WILL”
Pelling is shown in FIG. phoneme machine
ONLX represents the starting phoneme machine for the "w" sound.
(ONLX phoneme machine uses the “l” or “hw” phoneme
It is also the first phoneme machine of the basic form starting with the element
Ru. ) After the ONLX phoneme machine for the word “WILL”
Phoneme machine corresponding to common phoneme machine for “w” sound
WX continues. After that, IX1 phoneme machine and LX
Followed by a phoneme machine, and a TRLX phoneme machine. Each word in the vocabulary is similarly (shown in Figure 29)
(such as the basic form of the word “WILL”)
Displayed depending on the status. When forming each word, pair
The phonemes containing the elephant words are determined, and then they
The phoneme machines corresponding to the phonemes are concatenated. Inventory stored in computer, each word
is determined by the corresponding sequence of phoneme machines.
statistics are displayed for each phoneme machine.
is memorized. To reduce storage requirements, each
Display phoneme machines by their corresponding identifiers and
The basic form of the code is expressed as a sequence of phoneme machine identifiers.
It can be formed by For example, word
The basic form of “WILL” is a sequence of identifiers: 43
Corresponds to -27-81-12-56. Identifier 43 is a phoneme
It corresponds to the machine ONLX, and the identifier 27 is a phoneme machine.
Compatible with Windows WX. The same applies hereafter. phoneme
For each machine, after a shaping period, a portion of memory is
The statistical values shown in Tables 2 to 5 are memorized.
It will be done. Once the target word is considered, the component phoneme machine
The statistical value of the link identifier is retrieved. Examples of the other two basic forms are shown in Figures 31 and 31.
and the word “BOG” and the word “BOG” in Figure 32.
Each basic form of “DOG” is shown. degree
The basic form also begins with the starting phoneme machine ONBX.
Ru. The word “BOG” is the phoneme following ONBX.
Includes machines BX, AW1, GX and TRBX. Wa
The code “DOG” is a phoneme machine following ONBX.
Includes DX, AW1, GX and TRBX sequences.
nothing. The energy storage of the “B” and “D” sounds is similar.
The same starting phoneme machine is used since
Ru. When shaping the ONBX phoneme machine,
any of the phonemes (i.e. phonetic elements) that are displayed
It is desirable to incorporate utterances into the generation of statistical values.
This condition applies to multiple sounds (i.e. phonetic elements).
various other start and end phoneme machines that correspond to
It is desirable to apply this to machines as well. The basic configuration can be explained using the flowcharts in Figures 1A and 1B.
Explain the steps to build. In block 8002,
Starting Phoneme Machine, Common Phoneme Machine and Ending Phoneme Machine
From the machines a set of phoneme machines is formed.
Next, block 8004 selects words from the word vocabulary.
selected. In block 8006, word
are multiple phonetic elements, or sounds in general,
A predetermined order such as W-I-l for the word “WILL”
Characterized by order. Next, in block 8008,
Examine the first phonetic element in a given order and find its corresponding
Determine whether there is a starting phoneme machine. correspondence
If there is a starting phoneme machine to
10, search for the corresponding starting phoneme machine and block
With Tsuk8012, the first two phoneme machines are
The beginning phoneme machine of the first phonetic element followed by the common
Set it as a phoneme machine. to the first phonetic element
If there is no corresponding starting phoneme machine, block
At 8013, the common phoneme machine is searched. child
The common phoneme machine of represents the beginning of the basic form. Next, in block 8014, there is no next phonetic element.
If so, proceed to block 8015 and select the first phonetic mark.
whether the element has a terminal element associated with it
decide. If there is no final phoneme, its basic form
is the starting phoneme (machine) followed by the common phoneme
(machine). related to the first phonetic element
If there is a final phoneme, block 8016
Then, the final phoneme is added to the common phoneme, so
The word base form is the starting phoneme map of the first phonetic element.
including the common phoneme machine and the end phoneme machine.
nothing. In block 8014, if the following phonetic element is present:
If so, block 8017 examines the next phonetic element.
and determine whether the next phonetic element is last in order.
decide. If it is the last, block 80
18, the end phoneme to which the phonetic element is associated
Decide whether to have a machine. ending phoneme better
If there is a component, block 8020 describes its basic
The morphology is the common phoneme machine corresponding to the first phonetic element.
by adding a final phoneme machine followed by a final phoneme machine.
more complete. There is no associated ending phoneme machine.
If the last phonetic element is
The common phoneme machine is a
It will be done. In block 8017, the next phonetic element is the last
If not, in block 8024, the phonetic element is
Attach a common phoneme to the previously arranged phoneme machine.
Add. Phoneme machine corresponding to the last phonetic element
(there may be more than one) are added consecutively.
A phoneme machine is added, and the sequence of the phoneme machine is
Extend the time. Next, in FIGS. 2A and 2B, according to the present invention,
This section explains the formation of a phoneme machine. First, block
8100, for example, the International Phonetic Alphabet
The sound is defined as a phonetic element selected from
Ru. A collection of sounds is the seeds of single sounds formed by speech.
represents a class. In block 8102, each
a plurality of devices having means for storing statistics regarding the
A phoneme machine is formed. Next block 8104
, we define a given sound as an initial set of sounds, each of which is
to get the starting phoneme machine assigned to it
Select ``Natsuru''. Shadow due to energy storage
The sounds that receive a lot of sound will form the first set.
is desirable. (As mentioned above, sufficient plastic surgery data
If the data is obtained, the first set is
can be formed. ) then block 81
In 06, a starting phoneme machine is assigned to a given sound.
Ru. At block 8108, the assigned starting phoneme
Machine statistics for audio segments (e.g. work
from the first utterance of d) - this utterance is given
A single note corresponding to the note of, i.e. a similar energy storage
It is a single note with product characteristics. Block 8110 then extracts the common sound from the given sound.
Form a raw machine and use block 8112 for that purpose.
Generate statistics for Block 8114 opens
Each note that is supposed to get an initial phoneme machine is
the second phonetic magnitude after being treated as a given note
for a given sound in the set of
) is formed. In block 8116, the given
Select the sound of the ending phoneme in block 8118.
Assign a syn to it. With block 8120,
The statistics of the guessed ending phoneme are added to the speech segment.
Vocalization occurring at the end of - a single sound corresponding to a given sound,
That is, for single notes with similar energy attenuation characteristics,
Generate from utterances. Then block 8122
, a common phoneme machine for a given phone is assigned,
Block 8124 indicates that the statistic was previously determined.
If not, statistics are generated. Block 8
126, all sounds (the ending phoneme machine
) is to be assigned a given phoneme and
to determine whether it has been selected. selected
If so, all phoneme machines are formed.
Ru. If not selected, the previously unselected
Select the above sound as the given sound and select the above blot.
8118-8126 are repeated. FIGS. 2A and 2B show that according to the present invention
It can be modified in various ways. First, open
If you want to search only the first phoneme machine, go to block 8.
116 to 8126 can be omitted. similar
If you want to explore only the ending phoneme machine, use Blotsu.
8104 to 8114 can be omitted.
Second, if you wish, the first set of notes and the second
A set of sounds can be generated simultaneously. In addition, a single starting phoneme machine or ending phoneme machine
The operation step of assigning a sound to two or more sounds is
Relating to these examples. In this case, the statistical value
only needs to be generated once and used appropriately for each sound.
do. First, start which sound is assigned to it
Should I get a phoneme machine and an ending phoneme machine?
and thereby block 8104 and
and the first and second set of 8116
It is desirable to form each. In speech recognition, the present invention reduces the number of phoneme machines to
An apparatus for forming an increased basic form is provided. this
An example of the device is shown in FIG. FIG. 3 shows a plurality of phoneme machines 8202 to 821.
2 is shown. Each phoneme machine is phoneme machine 8
202, each with (a) transition certainty.
rate memory 8214, (b) label probability memory 821
6, and (c) state identifier and transition identifier memo.
Contains 8218. Phoneme machines 8202 and 8
Multiple phoneme machines including 204 are common phoneme machines
and includes phoneme machines 8206 and 8208.
The multiple phoneme machine that includes is the starting phoneme machine. Ma
In addition, a complex system including phoneme machines 8210 and 8212
The number phoneme machine is a termination phoneme machine. each phoneme
In each memory of machines 8202 to 8212
is the statistical value by the phoneme machine shaping device 8220.
is memorized. Each word is predefined as a sequence of phonemes.
and these sequences are stored in storage device 823.
0. Basic format construction device 8240
is phoneme sequence information from storage device 8230
and extracted by the phoneme machine shaping device 8220.
By combining the statistical values obtained, the phoneme machine sequence
Build. Sequence of phoneme machine for given word
The sound represents the basic form of the word and
Tsuching (described in sections F1c to F1f above)
used for. In other words, the unknown voice to be recognized.
When uttered, the sound processor 1 (shown in Figure 4)
004 generates a string of labels accordingly
do. The present invention provides an improved set of voice machines.
The basic form formed by the phoneme machine from
Enables matching with labels during stringing.
Ru. Using the phoneme machine added by the invention
This significantly improves the accuracy and speed of speech recognition.
Improved. Incidentally, the present invention is based on speech recognition of separated words.
recognition system and continuous speech speech recognition system
It can be used for. Separated word place
There is a short pause after each word. obey
The beginning and end of each word are often
In this case, there is energy storage and decay. present invention
is particularly well suited for such systems. Communicating
In continuous speech, multiple words are combined, usually
There are short pauses between phrases, allowing energy to accumulate and
Characterize each word basic form with a decaying part
Instead, the interphrase onset phoneme machine and decay phoneme machine
Indicates the supply of syn. Separated words and sequences
Speech phrases are categorized under the umbrella term “speech segments”
included. The audio segment is between two periods of silence.
It is considered to be the audio part of

【表】【table】

【表】〉〉
s z
【table】 >>
sz

【表】【table】

【表】 SP〓‘BX’〓b SP〓‘ONAW’〓 SP〓‘UH0’
〓〓
SP〓‘DH’〓〓 SP〓‘ONBX’〓 SP〓‘UU0’
〓u
SP〓‘DX’〓d SP〓‘ONDH’〓 SP〓UX0’〓
U
SP〓‘D＄’〓 SP〓‘ONEE’〓 SP〓‘AA1’
〓α
SP〓‘FX’〓f SP〓‘ONER’〓 SP〓‘AE1’
〓〓
I
SP〓‘GX’〓g SP〓‘ONFX’〓 SP〓‘AI1’
〓a
〈
U
SP〓‘HX’〓h SP〓‘ONIX’〓 SP〓‘AU1’
〓a
〈
SP〓‘JX’〓j SP〓‘ONLX’〓 SP〓‘AW1’
〓〓
h
SP〓‘KQ’〓 SP〓‘ONMX’〓 SP〓‘EE1’
〓i
〈
SP〓‘KX’〓k SP〓‘ONSH’〓 SP〓‘EH1’
〓ε
SP〓‘LX’〓l SP〓‘ONSX’〓 SP〓‘EI1’
〓e
SP〓‘MX’〓m SP〓‘ONUH’〓 SP〓‘ER1’
〓з
SP〓‘NG’〓y SP〓‘TRAA’〓 SP〓‘IX1’
〓I
I
SP〓‘NX’〓n SP〓‘TRAW’〓 SP〓‘OI1’
〓o
〈
SP〓‘NXV’〓 SP〓‘TRBX’〓 SP〓‘OU1’
〓o
SP〓‘PQ’〓u SP〓‘TRDH’〓 SP〓‘UH1’
〓Λ
SP〓‘PX’〓p SP〓‘TREE’〓 SP〓‘UU1’
〓u
SP〓‘RX’〓r SP〓‘TRER’〓 SP〓‘UX1’
〓U
SP〓‘R＄’〓〓 SP〓‘TRFX’〓 SP〓‘AA2’
〓α
〉
SP〓‘SH’〓 SP〓‘TRKQ’〓 SP〓‘AE2’
〓〓
s
I
SP〓‘SX’〓s SP〓‘TRLX’〓 SP〓‘A12’
〓a
〈
U
SP〓‘TH’〓θ SP〓‘TRMX’〓 SP〓‘AU2’
〓a
〈
n
SP〓‘TQ’〓 SP〓‘TRSH’〓 SP〓‘AW2’
〓〓
〈
SP〓‘TX’〓t SP〓‘TRSX’〓 SP〓‘EE2’
〓i
SP〓‘VX’〓v SP〓‘AA0’〓α SP〓‘EH2
’〓ε
SP〓‘WX’〓w SP〓‘AE0’〓〓 SP〓‘EI2
’〓e
I
SP〓‘W＠’〓hw SP〓‘AI0’〓a SP〓‘ER2’
〓з
〈
U
SP〓‘XX’〓 SP〓‘AU0’〓a SP〓‘IX2’
〓I
〈
〉 I
SP〓‘ZH’〓 SP〓‘AW0’〓〓 SP〓‘O12’〓
o
z 〈
SP〓‘ZX’〓z SP〓‘EE0’〓i SP〓‘OU2’
〓o
SP〓‘？X’〓？ SP〓‘EH0’〓ε SP〓‘UH
2’Λ
i
SP〓‘EEG’〓 SP〓‘EI0’〓e SP〓‘UU2’
〓u
〈
I
SP〓‘IXG’〓 SP〓‘ER0’〓〓 SP〓‘UX2’
〓U
〈
U
SP〓‘UXG’〓 SP〓‘IX0’〓I
〈
I
SP〓‘ONAA’〓 SP〓‘OI0’〓o
〈
Ｇ発明の効果本発明により、改良されたワード基本形式を構
築し、音声認識の精度ならびに速度を向上させる
ことができる。[Table] SP〓'BX'〓b SP〓'ONAW'〓 SP〓'UH0'
〓〓
SP〓'DH'〓〓 SP〓'ONBX'〓 SP〓'UU0'
〓u
SP〓'DX'〓d SP〓'ONDH'〓 SP〓UX0'〓
U
SP〓'D$'〓 SP〓'ONEE'〓 SP〓'AA1'
〓α
SP〓'FX'〓f SP〓'ONER'〓 SP〓'AE1'
〓〓
I
SP〓'GX'〓g SP〓'ONFX'〓 SP〓'AI1'
〓a
<
U
SP〓'HX'〓h SP〓'ONIX'〓 SP〓'AU1'
〓a
<
SP〓'JX'〓j SP〓'ONLX'〓 SP〓'AW1'
〓〓
h
SP〓'KQ'〓 SP〓'ONMX'〓 SP〓'EE1'
〓i
<
SP〓'KX'〓k SP〓'ONSH'〓 SP〓'EH1'
〓ε
SP〓'LX'〓l SP〓'ONSX'〓 SP〓'EI1'
〓e
SP〓'MX'〓m SP〓'ONUH'〓 SP〓'ER1'
〓з
SP〓'NG'〓y SP〓'TRAA'〓 SP〓'IX1'
〓I
I
SP〓'NX'〓n SP〓'TRAW'〓 SP〓'OI1'
〓o
<
SP〓'NXV'〓 SP〓'TRBX'〓 SP〓'OU1'
〓o
SP〓'PQ'〓u SP〓'TRDH'〓 SP〓'UH1'
〓Λ
SP〓'PX'〓p SP〓'TREE'〓 SP〓'UU1'
〓u
SP〓'RX'〓r SP〓'TRER'〓 SP〓'UX1'
〓U
SP〓'R$'〓〓 SP〓'TRFX'〓 SP〓'AA2'
〓α
〉
SP〓'SH'〓 SP〓'TRKQ'〓 SP〓'AE2'
〓〓
s
I
SP〓'SX'〓s SP〓'TRLX'〓 SP〓'A12'
〓a
<
U
SP〓'TH'〓θ SP〓'TRMX'〓 SP〓'AU2'
〓a
<
n
SP〓'TQ'〓 SP〓'TRSH'〓 SP〓'AW2'
〓〓
<
SP〓'TX'〓t SP〓'TRSX'〓 SP〓'EE2'
〓i
SP〓'VX'〓v SP〓'AA0'〓α SP〓'EH2
'〓ε
SP〓'WX'〓w SP〓'AE0'〓〓 SP〓'EI2
'〓e
I
SP〓'W@'〓hw SP〓'AI0'〓a SP〓'ER2'
〓з
<
U
SP〓'XX'〓 SP〓'AU0'〓a SP〓'IX2'
〓I
<
〉I
SP〓'ZH'〓 SP〓'AW0'〓〓 SP〓'O12'〓
o
z〈
SP〓'ZX'〓z SP〓'EE0'〓i SP〓'OU2'
〓o
SP〓'? X'〓? SP〓'EH0'〓ε SP〓'UH
2'Λ
i
SP〓'EEG'〓 SP〓'EI0'〓e SP〓'UU2'
〓u
<
I
SP〓'IXG'〓 SP〓'ER0'〓〓 SP〓'UX2'
〓U
<
U
SP〓'UXG'〓 SP〓'IX0'〓I
<
I
SP〓'ONAA'〓 SP〓'OI0'〓o
<
G. Effects of the Invention According to the present invention, an improved word basic format can be constructed and the accuracy and speed of speech recognition can be improved.

[Brief explanation of drawings]

第１Ａ図および第１Ｂ図は本発明により基本形
態を構築する方法を示す流れ図、第１Ｃ図は第１
Ａ図と第１Ｂ図の配置関係を示す図、第２Ａ図お
よび第２Ｂ図は改良された基本形態の構築に使用
するために本発明に従つて開始音素マシン、共通
音素マシン、および終了音素マシンを形成する方
法を示す流れ図、第２Ｃ図は第２Ａ図と第２Ｂ図
の配置関係を示す図、第３図は開始音素マシン、
共通音素マシン、および終了音素マシンから形成
された改良型ワード基本形態を構築する装置を示
すブロツク図、第４図は本発明を実施しうるシス
テム環境の概要ブロツク図、第５図は第４図のシ
ステム環境の中のスタツク・デコーダを詳細に示
したブロツク図、第６図は整形セツシヨン中に得
られた統計値により記憶装置で識別され、表示さ
れる精密マツチング音素マシンを示す図、第７図
は連続するスタツク復号のステツプを示す図、第
８図はそれぞれのワード・パスの尤度ベクトルお
よび尤度包絡線を示す図、第９図はスタツク復号
手順のステツプを示す流れ図、第１０図は音響プ
ロセツサの要素を示す図、第１１図は音響モデル
の構成要素を形成する場所を表わす代表的な人間
の耳の部分を示す図、第１２図は音響プロセツサ
の部分を示すブロツク図、第１３図は音響プロセ
ツサの設計に用いる、音の強度と周波数の関係を
示す図、第１４図はソーンとホンの関係を示す
図、第１５図は第１０図の音響プロセツサにより
音響を特徴づける方法を示す流れ図、第１６図は
第１５図で限界値を更新する方法を示す流れ図、
第１７図は精密マツチング手順のトレリスすなわ
ち格子を示す図、第１８図はマツチングを実行す
るのに用いる音素マシンを示す図、第１９図は特
定の条件を有するマツチング手順で用いる時刻分
布図、第２０図ａ〜ｅは音素、ラベル・ストリン
グおよび、マツチング手順で決定された開始・終
了時刻の間の相互関係を示す図、第２１図ａおよ
びｂは最小の長さが０の特定の音素マシンおよび
それに対応する開始時刻分布を示す図、第２２図
ａおよびｂは最小の長さ４の特定の音素マシンお
よびそれに対応するトレリスを示す図、第２３図
は同時に複数ワードの処理を可能にする音素の木
構造を示す図、第２４図は整形されたワード基本
形態を形成する際に実行するステツプを示す流れ
図、第２５図はワード・パスを延長する際に実行
するステツプを示す流れ図、第２６図は分離して
話されたワードWILLのスペクトル写真、第２７
図は分離して話されたワードWILLの波形を示す
図、第２８図は３つの統計的部分：最初、中間、
最後に分割された音標型音素マシンを示す図、第
２９図はワード“WILL”を本発明により連続５
音素を含む基本形態で示した図、第３０図はワー
ド“WILL”の標準音標スペリングを連続３音標
スペリングで示した図、第３１図はワード
“BOG”の音素のシーケンスを本発明により示し
た図、第３２図はワード“DOG”の音素のシー
ケンスを本発明により示した図である。１０００……音声認識システム、１００２……
スタツク・デコーダ、１００４……音響プロセツ
サ、１００６，１００８……アレイ・プロセツ
サ、１０１０……言語モデル、１０１２……ワー
クステーシヨン、１０２０……探索装置、１０２
２，１０２４，１０２６，１０２８……インタフ
エース。 1A and 1B are flowcharts illustrating the method of constructing the basic form according to the present invention, and FIG.
Figures 2A and 2B show the arrangement of Figures A and 1B, and Figures 2A and 2B show the starting phoneme machine, common phoneme machine, and ending phoneme machine according to the present invention for use in constructing an improved basic form. FIG. 2C is a diagram showing the arrangement relationship between FIGS. 2A and 2B, FIG. 3 is a starting phoneme machine,
A block diagram showing an apparatus for constructing an improved word base form formed from a common phoneme machine and an end phoneme machine, FIG. 4 is a schematic block diagram of a system environment in which the present invention can be implemented, and FIG. FIG. 6 is a detailed block diagram of the stack decoder in the system environment of FIG. 8 shows the likelihood vector and likelihood envelope of each word path. FIG. 9 is a flowchart showing the steps of the stack decoding procedure. FIG. 10 shows the steps of the stack decoding procedure. 11 is a diagram showing the parts of a typical human ear representing the locations forming the components of the acoustic model. FIG. 12 is a block diagram showing the parts of the audio processor. Figure 13 is a diagram showing the relationship between sound intensity and frequency, which is used in the design of an audio processor. Figure 14 is a diagram showing the relationship between horns and horns. Figure 15 is a method for characterizing sound using the audio processor shown in Figure 10. FIG. 16 is a flowchart showing how to update the limit value in FIG.
17 is a diagram showing the trellis or lattice of the precision matching procedure, FIG. 18 is a diagram showing the phoneme machine used to perform the matching, FIG. 19 is a time distribution diagram used in the matching procedure with specific conditions, and FIG. Figures 20a-e are diagrams showing the interrelationships between phonemes, label strings, and start and end times determined by the matching procedure; Figures 21a and b are specific phoneme machines with a minimum length of 0; and its corresponding start time distribution, Figures 22a and b are illustrations of a specific phoneme machine of minimum length 4 and its corresponding trellis, and Figure 23 allows processing of multiple words at the same time. Figure 24 is a flowchart showing the steps taken in forming a formatted word base form; Figure 25 is a flowchart showing the steps taken in extending a word path; Figure 26 is a spectrum photograph of the word WILL spoken in isolation, No. 27
Figure 28 shows the waveform of the word WILL spoken in isolation, Figure 28 shows three statistical parts: initial, middle,
FIG. 29 is a diagram showing the finally divided phonetic symbol type phoneme machine, in which the word “WILL” is divided into 5 consecutive words according to the present invention.
Figure 30 shows the standard phonetic spelling of the word "WILL" as a continuous three-phonetic spelling; Figure 31 shows the phoneme sequence of the word "BOG" according to the present invention. FIG. 32 is a diagram showing the sequence of phonemes of the word "DOG" according to the present invention. 1000...Voice recognition system, 1002...
Stack decoder, 1004...Acoustic processor, 1006, 1008...Array processor, 1010...Language model, 1012...Workstation, 1020...Search device, 102
2,1024,1026,1028...interface.

Claims

[Claims] 1. From a set of labels each representing an acoustic type that can be assigned to a minute time interval, a corresponding string of labels is generated according to an unknown audio input, and this string of labels is used as a vocabulary. In a speech recognition method that performs speech recognition by matching word Markov models of words in a set, the word Markov model extracts corresponding partial word Markov models from a set of partial word Markov models and concatenates them. The partial word Markov model is composed of a partial word Markov model of the start part corresponding to the start part of the word and a partial word Markov model of the other partial words.
The partial word Markov model of the above starting part is used only for matching the starting part of the word, and the partial word Markov model of the above starting part is used only for matching the starting part of the word. A speech recognition method characterized in that it is used. 2 For at least some of the partial word Markov models, a partial word Markov model of the ending part corresponding to the ending part of the word is further retained, and the partial word Markov model of the ending part is used only for matching the ending part of the word. A speech recognition method according to claim 1. 3. The speech recognition method according to claim 1 or 2, wherein the probability value of the partial word Markov model of the start part is determined based on the utterance of the start part of the word.