JPH0372994B2

JPH0372994B2 -

Info

Publication number: JPH0372994B2
Application number: JP61032051A
Authority: JP
Inventors: Rai Baaru Raritsuto; Binsento Desooza Piitaa; Reroi Maasaa Robaato; Aran Pichenii Maikeru
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1986-02-18
Filing date: 1986-02-18
Publication date: 1991-11-20
Also published as: JPS62194293A

Description

[Detailed description of the invention]

以下の順序で本発明を説明する。Ａ産業上の利用分野Ｂ開示の概要Ｃ従来の技術Ｄ発明が解決しようとする問題点Ｅ問題点を解決するための手段Ｆ実施例 F1 音声認識システムの環境 F1a 全般的説明（第２図〜第７図） F1b 聴覚モデルおよび音声認識システムの音
響プロセツサにおけるその実現（第８図
〜第１４図） F1c 精密マツチング（第４図、第１５図） F1d 基本高速マツチング（第１６図〜第１８
図） F1e 代替高速マツチング（第１９図〜第２０
図） F1f 最初のＪレベルに基づいたマツチング
（第２０図） F1g 音素木構造および高速マツチング実施例
（第２１図） F1h 言語モデル（第２図、第２２図） F2 ワードの音標およびフイーニームのマル
コフ・モデル基本形態の構築 F2a 音標基本形態の構築（第４図、第２表） F2b フイーニーム基本形態の構築（第２３
図、第２４図） F3 合成フイーニーム音素のアルフアベツト
の生成（第４図） F4 短かい音標形マルコフ・モデルのワード
基本形態の自動形式（第１Ａ図、第１Ｂ
図、第２１図、第２３図、第２５図〜第３
０図） F5 比較結果 F6 代替実施例 F7 表Ｇ発明の効果Ａ産業上の利用分野本発明は語彙中のワードの音響モデルを生成す
る分野に関する。音響モデルは、主として音声認
識に使用されるが、ワードの音声の種類を探求す
る音標アプリケーシヨンに組込むことができる。Ｂ開示の概要本発明は、所与のワードのフイーニーム（微小
音素：フロント・エンドから得ることが多いので
その頭文字にちなんで名付けられた）基本形態に
比し長さを短縮した音標形の基本形態の自動的な
構築について開示する。詳細に言えば、本発明
は、（）フイーニーム音素のフイーニーム基本
形態により語彙中の各ワードを定義し、（）合
成音素（その各々は少なくとも１つのフイーニー
ム音素に対応する）のアルフアベツトを定義し、
（）音声入力に応じてフイーニーム・ストリン
グを生成するシステムにおいて、(a)フイーニーム
音素で示されるワード中の各フイーニーム音素
を、それに対応する合成音素に置換え、(b)組合せ
による悪影響が所定の限界値以下の場合に、少な
くとも１対の隣接する合成音素を１つの合成音素
に合併することにより、フイーニーム音素から成
るワード基本形態を合成音素の短縮されたワード
基本形態に変換することを可能にする。Ｃ従来の技術本発明の背景となる発明に関するものとして、
米国特許出願第06／665401号（1984年10月26日出
願）、同第06／672974号（1984年11月19日出願）、
および同第06／697174号（1985年２月１日出願）
がある。確率的な音声認識方法では、音響波形は最初、
音響プロセツサによりラベルすなわちフイーニー
ム（フロント・エンドから得られるような微小音
素）のストリングに変換される。ラベル（その
各々は単音符号を識別する）は、一般に約200の
異なつたラベルのアルフアベツトから選択され
る。このようなラベルの生成は、種々の論文なら
びに前記米国特許出願第06／665401号に記載され
ている。ラベルを用いて音声認識を行う際、マルコフ・
モデル・マシン（すなわち確率的な有限状態マシ
ン）が論議されている。マルコフ・モデルは通
常、複数の状態ならびに状態間の遷移を含む。更
に、マルコフ・モデルは、(a)生起する各遷移の確
率、(b)種々の遷移で各ラベルを生成するそれぞれ
の確率に関連して、それに割当てられた確率を有
するのが普通である。マルコフ・モデル、または
マルコフ・ソースは、種々の論文、例えば、
IEEE会報：パターン分析および計算機情報
（PAMI）第５巻第２号（1983年３月号）のエ
ル・アール・パール外の論文“連続音声を認識す
る最尤法”（L.R.Bahl et al，“Ａ Maximum
Likelihood Approach to Continuous Speecb
Rocognition”、IEEE Transactions on Pattern
Analysis and Machine Intelligence，Vol.
PAMI−５，No.２，March 1983）に記載されて
いる。音声を認識する際、マツチング・プロセスを実
行して、語彙中のどのワード（複数のワードの場
合もある）が、音響プロセツサにより生成された
ラベルのストリングを生成している最高の尤度を
生じるかを決定する。１つのこのようなマツチン
グ手順は、前記米国特許出願第06／672974号に示
されている。それによれば、音響突合せは、(a)語
彙中の各ワードをマルコフ・モデル音素マシンの
シーケンスにより特徴づけ、(b)音響プロセツサに
より生成されるラベル・ストリングを生成する、
各ワードを表わす音素マシンのシーケンスのそれ
ぞれの尤度を決定することにより、実行する。各
ワードが表わす音素マシンのシーケンスをワード
基本形態という。ワード基本形態を決める際、最初に、その基本
形態を構築するのに用いる音素マシンの性質を決
める必要がある。前記米国特許第06／697174号
に、フイーニームに基づいた基本形態によるワー
ドが示されている。すなわち、フイーニームのア
ルフアベツト（集合）をなす200のフイーニーム
の各々について、マルコフ・モデルが用意され、
発声されると（音響プロセツサで生成されるよう
に）０，１または２以上のフイーニームを生成す
る特定のフイーニームの確率を表示する。所与の
フイーニームのマルコフ・モデル音素マシンは、
それぞれの発声の発音変化により、所与のフイー
ニーム以外にフイーニームを生成しない確率、ま
たは所与のフイーニーム以外にいくつかのフイー
ニームを生成する確率を生じる。フイーニームの
基本形態により、各ワードの基本形態中の音素マ
シンの数は、ワード当りのフイーニームの数にほ
ぼ等しい。一般に、センチ秒当り１の割合で音響
プロセツサによりフイーニームを良好に生成する
場合、ワード当り60〜100のフイーニームがある。それに代るものとして、前記米国特許第06／
672974号に記載されているように、音標型の音素
マシンのワード基本形態を構築することがある。
この場合、各音素マシンは音標の単音に相当し、
７の状態と13の遷移を含む。Ｄ発明が解決しようとする問題点フイーニーム音素マシンは２つの状態により構
築され、容易に、かつ自動的に決定される。しか
しながら、フイーニーム基本形態はかなりの長さ
があり、そのために、マツチングプロセスで長い
計算を必要とする。それに対し、音標の基本形態
は短かいので計算も少なくて済むけれども、精度
が低いので容易かつ自動的には決定されず、一般
に、手作業により音標を入力する必要がある。フイーニーム基本形態および音標基本形態はど
ちらも、有効であり役立つけれども、用法によつ
ては最適化できない場合がある。従つて、本発明の目的は、フイーニーム基本形
態および音標の基本形態の不利点をかなり避ける
と同時に、それぞれの主要な利点を包含する基本
形態を提供することである。更に本発明の目的は、短かい簡単な音標型のマ
ルユフ・モデルのワード基本形態を自動的に形成
することである。更に、本発明の目的は、同じ対象ワードのフイ
ーニーム基本形態よりも短縮されている（正確
で、音標型の基本形態の特徴を有する）対象ワー
ドの基本形態を提供することである。Ｅ問題点を解決するための手段本発明に従つて、語彙中の各ワードは、現にあ
るフイーニーム基本形態により決められるものと
仮定する。また、合成音素のアルフアベツトも取
出されているものと仮定する。フオニーム
（Phoneme）のアルフアベツト中の要素に類似し
ている合成音素を選択することが望ましい。この
ような良好な場合、各音素マシンは、フオニー
ム・アルフアベツトの要素に対応し、そのフオニ
ーム素子をフイーニーム・ストリング中のフイー
ニームにマツチングさせる際に用いる遷移および
ラベル確率を記憶する。合成音素のアルフアベツト中の要素はフイーニ
ーム音素のアルフアベツト中の要素よりも少なく
することができる。本発明により得られる合成音素の簡単な基本形
態は下記のように構築される。対象ワードを１回、できれば複数回発声する。
それぞれの発声によりフイーニームのストリング
が生成される。対象ワードの生成されたフイーニ
ームのストリング（複数のストリングの場合もあ
る）に基づいて、ストリングを生成する最高の
（同時）確率を有するフイーニームのワード基本
形態が選択される。次いで、対象ワードの各スト
リングを、フイーニーム基本形態に対し整列させ
る。このように、フイーニーム基本形態中の各フ
イーニーム音素を、（対象ワードの）各ストリン
グ中の対応するフイーニームに対して整列させ
る。フイーニームのワード基本形態中の各フイー
ニーム音素はそれに対応する合成音素に置換えら
れ、それにより、合成音素の基本形態を形成す
る。次に下記ステツプが実行される。 (a) 隣接する合成音素の対を選択する。 (b) フイーニーム・ストリングごとに、選択され
た隣接合成音素の対に対して整列されるフイー
ニームのサブストリングを決定する。 (c) すべてのフイーニーム・ストリングについて
それぞれ決定されたサブストリングを生成する
最高の同時確率を生じる１つの合成音素を決定
する。 (d) 隣接合成音素の対のすべてについてステツプ
(a)〜(c)を反復し、各々の対の１つの合成音素を
決定する。 (e) 基本形態中の合成音素の各対を、それに代つ
て決定された１つの合成音素に置換えることに
よる悪影響を測定する。 (f) 悪影響が最小の１つの合成音素とそれに対応
する音素対を取替える。 (g) １つの合成音素とそれに対応する合成音素対
を取替えた後、合成音素の新しい基本形態を供
給する。 (h) この新しい基本形態についてステツプ(a)〜(g)
を反復する。ステツプ(h)は、合成音素の基本形態を所望の長
さに短縮する、すなわち共通の対をセツトで置換
えることによる悪影響が限界値を越えるまで、必
要な回数にわたつて反復する。更に、所与のフオニーム・アルフアベツトに適
合する合成音素アルフアベツトが選択されると、
本発明により生成された短縮された基本形態は、
音声体系に関するアプリケーシヨンで用いること
ができるワードの音標表示を備える。Ｆ実施例 F1 音声認識システムの環境 F1a 全般的説明（第２図〜第７図）第２図は音声認識システム１０００の概要ブロ
ツク図を示す。このシステムは、スタツク・デコ
ーダ１００２、およびそれに接続された音響プロ
セツサ（AP）１００４、高速概算音響マツチン
グを実行するアレイ・プロセツサ１００６、精密
な音響マツチングを実行するアレイ・プロセツサ
１００８、言語モデル１０１０、ならびにワーク
ステーシヨン１０１２を含む。音響プロセツサ１００４は、音声波形入力をラ
ベルのストリング、すなわち、その各々が対応す
る単音符号を大まかに識別するフイーニームに変
換するように設計されている。本発明システムで
は、音響プロセツサ１００４は、人間の聴覚の独
特なモデルに基づくもので、米国特許出願第06／
65401号（1984年10月26日出願）に記載されてい
る。音響プロセツサ１００４からのラベル、すな
わちフイーニームはスタツク・デコーダ１００２
に送られる。第３図は、スタツク・デコーダ１０
０２の論理素子を示す。すなわち、スタツク・デ
コーダ１００２は探索装置１０２０、およびそれ
に接続されたワークステーシヨン１０１２、イン
タフエース１０２２，１０２４，１０２６ならび
に１０２８を含む。これらのインタフエースの
各々は、音響プロセツサ１００４、アレイ・プロ
セツサ１００６，１００８ならびに言語モデル１
０１０にそれぞれ接続される。動作中、音響プロセツサ１００４からのフイー
ニームは探索装置１０２０によりアレイ・プロセ
ツサ１００６（高速マツチング）に送付される。
下記に説明する高速マツチング手順は前記米国特
許出願第06／672974号（1984年11月19日出願）に
も記載されている。マツチングの目的は、簡単に
いえば、所与のラベル・ストリングの少なくとも
１つの最も見込みのあるワードを決定することで
ある。高速マツチングはワード語彙の中のワードを検
査するとともに、所与の到来ラベルのストリング
の候補ワードの数を少なくするように設計されて
いる。高速マツチングは確率的な有限状態マシン
（本明細書ではマルコフ・モデルともいう）に基
づくものである。高速マツチングが候補ワード数を減少した後、
スタツク・デコーダ１００２は、言語モデル１０
１０と対話し、できれば、現に存在する三重字に
基づき、高速マツチング候補リスト中の各候補ワ
ードの文脈上の尤度を確定する。精密マツチングは、これらのワードを、話され
ワードとして適度の尤度を有する高速マツチング
候補リストから、言語モデル計算に基づいて検査
することが望ましい。精密マツチングも前記米国
特許出願第06／672974号（1984年11月19日出願）
に記載されている。精密マツチングは、第４図に
示すようなマルコフ・モデル音素マシンにより実
行する。精密マツチングの後、再び言語モデルを呼出
し、ワードの尤度を決定することが望ましい。本
発明のスタツク・デコーダ１００２は、―高速マ
ツチング、精密マツチング、および言語モデルの
使用から得られた情報を用いて―生成されたラベ
ル・ストリングのワードの最も見込みのあるパス
すなわちシーケンスを確定するように設計されて
いる。最も見込みのあるワードのシーケンスを見つけ
る従来の２つの方法は、ビタービ（Viterbi）復
号化および単一スタツク復号化である。これらの
手法の各々は、パターン解析およびマシン情報に
関するIEEE会報、PAMI第５巻第２号、1983年
３月号記載のエル・アール・バール外の論文、
“連続音声認識の最尤アプローチ”（L.R.Bahl et
al，“Ａ Maximum Likelihood Approach to
Continuous Speech Recognition”，IEEE
Transactions on Pattern Anelysis and
Machine Intelligence，Vol.PAMI−５，No.2，
March 1983）の第項および第にそれぞれ記
載されている。単一スタツク復号化手法では、長さの異なるパ
スは尤度に従つて単一のスタツクにリスト化さ
れ、この単一のスタツクに基づいて復号化され
る。単一スタツク復号化は、パスの長さにより尤
度がいくらか異なるので一般に正規化が用いられ
るという事実に基づくものである。ビタービ手法は正規化を必要とせず、一般に小
さいタスクに向いている。もう１つの代替方法として、小さい語彙システ
ムにより復号化を実行することがある。この場
合、見込みのあるワード・シーケンスとして可能
なワードの組合せを検査し、どの組合せが、生成
されたラベル・ストリングを生成する最高の確率
を有するかを決定する。この手法の計算要求は大
きい語彙システムの場合には実際的ではない。スタツク・デコーダ１００２は、実際には、他
の要素を制御するように作用するが、実行される
計算は多くはない。従つて、スタツク・デコーダ
１００２は、VM（仮想計算機）／システム・プ
ロダクト・イントロダクシヨン・リリース３
（1983）のような出版物に記載されているように、
IBM VM／370オペレーテイング・システムの制
御の下にランする4341プロセツサを含むことが望
ましい。相当な量の計算を実行するアレイ・プロ
セツサは、フローテイング・ポイント・システム
（FPS）社製の市販の190Lにより実現されてい
る。複数のスタツクおよび独特な決定方式を含む新
規の手法がエル・アール・バール（L.R.Bahl）
外により発明されている。この手法は第５図、第
６図および第７図に示されている。第５図および第６図に、連続する“ラベル間
隔”で生成された複数の連続ラベルY₁，Y₂……
が示されている。また第６図には、複数のワード・パスすなわち
パスＡ、パスＢ、およびパスＣが示されている。
第５図の文脈では、パスＡはエントリ“to be
or”に、パスＢはエントリ“two ｂ”に、パス
Ｃはエントリ“too”に対応するであろう。対象
ワード・パスの場合、終了している最高の確率を
対象ワード・パスが有するラベル（すなわち等価
的にラベル間隔）がある。このようなラベルを
“境界ラベル”という。ワードのシーケンスを表わすワード・パスＷの
場合、最も起こりうる終了時刻（２ワード間の
“境界ラベル”としてラベル・ストリングに表示
されている）は、IBM技術開示会報、第23巻第
４号、1980年９月号、エル・アール・バール外の
論文“高速音響マツチング計算”（L.R.Bahl et
al，“Faster Acoustic Match Computation”，
IBM Technical Disclosure Bulletin，Vol.23，
No.4，September 1980）に記載されているよう
な既知の方法により発見することができる。簡単
に言えば、この論文は、下記の２つの重要な事
項： (a) どれだけ多くのラベル・ストリングＹがワー
ド（またはワード・シーケンス）によるもので
あるか、 (b) どのラベル間隔で、（ラベル・ストリングの
部分に対応する）部分的な文が終了するかに取組む方法について説明している。任意の所与のワード・パスの場合、ラベル・ス
トリングの最初のラベル〜境界ラベルを含む各々
のラベルすなわちラベル間隔に関連した“尤度
値”がある。所与のワード・パスの尤度値の全部
は一括して、所与のワード・パスの“尤度ベクト
ル”を表わす。従つて、ワード・パスごとに、対
応する尤度ベクトルがある。尤度値L_tは第６図に
示されている。ワード・パスW¹、W²、……、W^sの集まりの
ラベル間隔ｔでの“尤度包絡線”Λ_tは数学的に
次のように定義される。 Λ_t＝max（L_t（W¹），……，L_t（W^s））すなわち、ラベル間隔ごとに、尤度包絡線は、
前記集りの中の任意のワード・パスに関連した最
高の尤度値を含む。第６図に尤度包絡線１０４０
が示されている。ワード・パスは、完全な文に対応する場合には
“完全”とみなされる。完全なパスは、入力して
いる話者が、文の終了に達したとき、例えばボタ
ンを押すことにより識別されることが望ましい。
入力された入力は、文終了をマークするラベル間
隔と同期される。完全なワード・パスは、それに
ワードを付加して延長することはできない。部分
的なワード・パスは不完全な文に対応し、延長す
ることができる。部分的なパスは“生きている”または“死んで
いる”パスに分類される。ワード・パスは、それ
が既に延長されているときは、“死んでいる”が、
まだ延長されていないときは“生きている”。こ
の分類により、既に延長されて少なくとも１つ
の、より長く延長されたワード・パスを形成して
いるパスは、次の時刻で延長が再び考慮されるこ
とはない。各々のワード・パスは、尤度包絡線に対して
“良い”、または“悪い”ものとして特徴づけるこ
とが可能である。ワード・パスは、その境界ラベ
ルに対応するラベルで、そのワード・パスが、最
大尤度包絡線内にある尤度値を有する場合は良い
ワード・パスである。その他の場合は、ワード・
パスは悪いワード・パスである。最大尤度包絡線
の各値を一定の値だけ減少して良い（悪い）限界
レベルとして作用させることは、望ましいことで
はあるが、必ずしも必要ではない。ラベル間隔の各々についてスタツク要素があ
る。生きているワード・パスの各々は、このよう
な生きているパスの境界ラベルに対応するラベル
間隔に対応するスタツク要素に割当てられる。ス
タツク要素は、（尤度値の順序にリスト化されて
いる）0.1またはより多くのワード・パス・エン
トリを有することがある。次に、第２図のスタツク・デコーダ１００２に
より実行されるステツプについて説明する。尤度包絡線を形成し、どのワード・パスが良い
かを決定することは、第７図のスタツク復号手法
の流れ図に示すように相互関係を有する。第７図の流れ図において、ブロツク１０５０
で、最初に、ナル・パスが第１のスタツク（０）
に入る。ブロツク１０５２で、前に確定されてい
る完全なパスを含む（完全な）スタツク要素が、
もしあれば、供給される。（完全な）スタツク要
素中の完全なパスの各々は、それに関連する尤度
ベクトルを有する。その境界ラベルに最高の尤度
を有する完全なパスの尤度ベクトルは、最初に最
尤包絡線を決める。もし（完全な）スタツク要素
に完全なパスがなければ、最尤包絡線は各ラベル
間隔で−∞に初期設定される。更に、完全なパス
が指定されていない場合にも、最尤包絡線が−∞
に初期設定されることがある。包絡線の初期設定
はブロツク１０５４および１０５６で行われる。最尤包絡線は、初期設定された後、所定の量△
だけ減少され、減少された尤度を上まわる△規定
の良い領域を形成し、減少された尤度を下まわる
△規定の悪い領域を形成する。△が大きければ大
きいほど、延長が可能とみなされるワード・パス
数が大きくなる。L_tを確定するのにlog₁₀を用い
る場合、△の値が２であれば満足すべき結果が得
られる。△の値がラベル間隔の長さに沿つて均一
であることは、望ましいけれども、必ずしも必要
ではない。ワード・パスがその境界ラベルに、△−良い領
域内にある尤度を有する場合、そのワード・パス
は“良い”とマークされる。その他の場合は、ワ
ード・パスは“悪い”とマークされる。第７図に示すように、尤度包絡線を更新し、ワ
ード・パスを“良い”（延長が可能な）パス、ま
たは“悪い”パスとしてマークするループは、マ
ークされていない最長ワード・パスを探すブロツ
ク１０５８で始まる。２以上のマークされていな
いワード・パスが、最長のワード・パス長に対応
するスタツクにある場合、その境界ラベルに最高
の尤度を有するワード・パスが選択される。ワー
ド・パスが発見された場合、ブロツク１０６０
で、その境界ラベルでの尤度が△−規定の良い領
域内にあるかどうかを調べる。もし良い領域内に
なければ、ブロツク１０６２で、△規定の悪い領
域内のパスとマークし、ブロツク１０５８で、次
のマークされていない生きているパスを探す。も
し良い領域内にあれば、ブロツク１０６４で、△
規定の良い領域内のパスとマークし、ブロツク１
０６６で、尤度包絡線を更新して、“良い”とマ
ークされたパスの尤度値を包含する。すなわち、
ラベル間隔ごとに、更新された尤度値は、 (a) その尤度包絡線内の現在の尤度値と、 (b) “良い”とマークされたワード・パスに関連
した尤度値の間のより大きい尤度値として確定される。この
動作はブロツク１０６４および１０６６で行われ
る。包絡線が更新された後、ブロツク１０５８に
戻り、マークされていない最長、最良の生きてい
るワード・パスを再び探す。このループは、マークされていないワード・パ
スがなくなるまで反復される。マークされていな
いワード・パスがなくなると、ブロツク１０７０
で、最短の“良い”とマークされたワード・パス
が選択される。もし、最短の長さを有する２以上
の“良い”ワード・パスがあれば、ブロツク１０
７２で、その境界ラベルに最高の尤度を有するワ
ード・パスが選択され、選択された最短のパスは
延長される。すなわち、少なくとも１つのありう
る後続ワードが、前述のように、高速マツチン
グ、言語モデル、精密マツチング、および言語モ
デル手順を良好に実行することにより確定され
る。見込みのある後続ワードごとに、延長された
ワード・パスが形成される。詳細に述べれば、延
長されたワード・パスは、選択された最短ワー
ド・パスの終りに、見込みのある後続ワードを付
加することにより形成される。選択された最短ワード・パスが、延長されたワ
ード・パスを形成した後、該選択されたワード・
パスは、それがエントリであつたスタツクから除
去され、その代りに、各々の延長されたワード・
パスは適切なスタツクに挿入される。特に、延長
されたワード・パスは、その境界ラベルに対応す
るスタツクへのエントリになる（ブロツク１０７
２）。延長されたパスが形成され、そのスタツクが再
形成された後、ブロツク１０５２に戻り、プロセ
スが反復される。従つて、反復ごとに、最短、最良の“良い”ワ
ード・パスが選択され、延長される。ある反復で
“悪い”パスとマークされたワード・パスは後の
反復で“良い”パスになることがある。よつて、
生きているワード・パスが“良い”パスか、“悪
い”パスかという特徴は、各々の反復で独自に付
与される。実際には、尤度包絡線は１つの反復と
次の反復とで大幅には変化しないので、ワード・
パスが良いか悪いかを決定する計算が効率的に行
われる。更に、正規化も不要になる。完全な文を識別する場合、ブロツク１０７４を
包含することが望ましい。すなわち、生きている
ワード・パスでマークされずに残つているものは
なく、延長すべき“良い”ワード・パスがない場
合、復号は終了する。その境界ラベルのそれぞれ
に最高の尤度を有する完全なワード・パスが、入
力ラベル・ストリングの最も見込みのあるワー
ド・シーケンスとして識別される。文終了が識別されない連続音声の場合、パス延
長は、継続して行われる、すなわち、そのシステ
ムのユーザが希望する所定のワード数について行
われる。 F1b 聴覚モデルおよび音声認識システムの音響
プロセツサにおけるその実現（第８図〜第１
４図）第８図は、前述のような音響プロセツサ１１０
０の特定の実施例を示す。音響波入力（例えば、
自然の音声）が、所定の速度でサンプリングする
Ａ／Ｄ変換器１１０２に入る。代表的なサンプリ
ング速度は毎50マイクロ秒当り１サンプルであ
る。デイジタル信号の端を整形するために、時間
窓発生器１１０４が設けられている。時間窓発生
器１１０４の出力は、時間窓ごとに周波数スペク
トル出力を与えるFFT（高速フーリエ変換）装置
１１０６に入る。そして、FFT装置１１０６の出力は、ラベル
L₁L₂……L_fを生成するように処理される。特徴
選択装置１１０８、クラスタ装置１１１０、原型
装置１１１２および記号化装置１１１４は共同し
てラベルを生成する。ラベルを生成する際、原型
は、選択された特徴に基づき空間に点（またはベ
クトル）として形成される。音響入力は、選択さ
れた同じ特徴により、原型に比較しうる対応する
点（またはベクトル）を空間に供給するように特
徴づけられている。詳細に言えば、原型を定義する際、クラスタ装
置１１１０により点のセツトをそれぞれのクラス
タとして群化する。クラスタを形成する方法は、
音声に適応される（ガウス分布のような）確率分
布に基づいている。各クラスの原型は、（クラス
タの中心軌跡または他の特徴に関連して）原型装
置１１１２により生成される。生成された原型お
よび音響入力（どちらも同じ特徴が選択されてい
る）は記号化装置１１１４に入る。記号化装置１
１１４は比較手順を実行し、その結果、特定の音
響入力にラベルを割当てる。適切な特徴の選択は、音響（音声）波入力を表
わすラベルを取出す際の重要な要素である。ここ
に説明する音響プロセツサは改良された特徴選択
装置１１０８を含む。この音響プロセツサに従つ
て、の聴覚モデルが取出され、音声認識システム
の音響プロセツサで使用される。聴覚モデルを、
第９図により説明する。第９図は人間の内耳の部分を示す。詳細に述べ
れば、内毛細胞１２００と、液体を含有する溝１
２０４に広がる末端部１２０２が詳細に示されて
いる。また、内毛細胞１２００から上流には、外
毛細胞１２０６と、溝１２０４に広がる末端部１
２０８が示されている。内毛細胞１２００と外毛
細胞１２０６には、脳に情報を伝達する神経が結
合している。特に、ニユーロンが電気化学的変化
を受け、電気パルスが神経に沿つて脳に運ばれ、
処理されることになる。電気化学的変化は、基底
膜１２１０の機械的運動により刺激される。基底膜１２１０が音響波入力の周波数分析器と
して作用し、基底膜１２１０に沿つた部分がそれ
ぞれの臨界周波数バンドに応答することは従来か
ら知られている。対応する周波数バンドに応答す
る基底膜１２１０のそれぞれの部分は、音響波形
入力を知覚する音量に影響を与える。すなわち、
トーンの音量は、類似のパワーの強度の２つのト
ーンが同じ周波数バンドを占有する場合よりも、
２つのトーンが別個の臨界周波数バンドにある場
合の方が大きく知覚される。基底膜１２１０によ
り規定された22の等級の臨界周波数バンドがある
ことが分つている。基底膜１２１０の周波数レスポンスに合わせ
て、本発明は良好な形式で、臨界周波数バンドの
一部または全部に入力された音響波形を物理的に
定め、次いで、規定された臨界周波数バンドごと
に別個に信号成分を検査する。この機能は、
FFT装置１１０６（第８図）からの信号を適切
に濾波し、検査された臨界周波数バンドごとに特
徴選択装置１１０８に別個の信号を供給すること
により行われる。別個の入力も、時間窓発生器１１０４により
（できれば25.6ミリ秒の）時間フレームにブロツ
クされる。それゆえ、特徴選択装置１１０８は22
の信号を含むことが望ましい。これらの信号の
各々は、時間フレームごとに所与の周波数バンド
の音の強さを表わす。信号は、第１０図の通常の臨界バンド・フイル
タ１３００により濾波することが望ましい。次い
で、信号は個別に、音量の変化を周波数の関数と
して知覚する音響等化変換器１３０２により処理
する。ちなみに、１つの周波数で所与のdBレベ
ルの第１のトーンの知覚された音量は、もう１つ
の周波数で同じdBレベルの第２のトーンの音量
と異なることがある。音量等化変換器１３０２
は、経験的なデータに基づき、それぞれの周波数
バンドの信号を変換して各々が同じ音量尺度で測
定されるようにする。例えば、音量等化変換器１
３０２は、1933年のフレツチヤおよびムンソン
（Fletcher and Munson）の研究に多少変更を加
えることにより、音響エネルギを同等の音量に写
像することができる。第１１図は前記研究に変更
を加えた結果を示す。第１１図により、40dBで
1KHzのトーンは60dBで100Hzのトーンの音量レ
ベルに対応することが分る。音量等化変換器１３０２は、第１１図に示す曲
線に従つて音量を調整し、周波数と無関係に同等
の音量を生じさせる。周波数への依存性のほか、第１１図で特定の周
波数を調べれば明らかなように、パワーの変化は
音量の変化に対応しない。すなわち、音の強度、
すなわち振幅の変動は、すべての点で、知覚され
た音量の同様の変化に反映されない。例えば、
100Hzの周波数では、110dB付近における10dBの
知覚された音量変化は、20dB付近における10dB
の知覚された音量変化よりもずつと大きい。この
差は、所定の方法で音量を圧縮する音量圧縮装置
１３０４により処理する。音量圧縮装置１３０４
は、ホン単位の音量振幅測定値をソーン単位に置
換えることにより、パワーＰをその立方根P^1/3に
圧縮することができる。第１２図は、経験的に決められた既知のホン対
ソーンの関係を示す。ソーン単位の使用により、
本発明のモデルは大きな音声信号振幅でもほぼ正
確な状態を保持する。１ソーンは、1KHzのトーンで40dBの音量と規
定されている。第１０図には、新規の時変レスポンス装置１３
０６が示されている。この装置は、各臨界周波数
バンドに関連した音量等化および音量圧縮信号に
より動作する。詳細に述べれば、検査された周波
数バンドごとに、神経発火率ｆが各時間フレーム
で決められる。発火率ｆは本発明の音響プロセツ
サに従つて次のように定義される。ｆ＝（So＋DL）ｎ (1) ただし、ｎは神経伝達物質の量；Soは音響波
形入力と無関係に神経発火にかかわる自発的な発
火定数；Ｌは音量測定値；Ｄは変位定数である。
So.nは音量波入力の有無に無関係に起きる自発的
な神経発火率に相当し、DLnは音響波入力による
発火率に相当する。重要な点は、本発明ではｎの値は次式により時
間とともに変化するという特徴を有することであ
る。 dn／dt＝Ao−（So＋Sh＋DL）ｎ (2) ただし、Aoは補充定数；Shは自発的な神経伝
達物質減衰定数である。式(2)に示す新しい関係
は、神経伝達物質が一定の割合Aoで生成されな
がら、(a)減衰（Sh.n）、(b)自発的な発火（So・
ｎ）、および(c)音響波入力による神経発火（DL・
ｎ）により失われることを考慮している。これら
のモデル化された現象は第９図に示された場所で
起きるものと仮定する。式(2)で明らかなように、神経伝達物質の次量お
よび次発火率が少なくとも神経伝達物質の現量の
自乗に比例しており、本発明の音響プロセツサが
非線形であるという事実を示している。すなわ
ち、状態（ｔ＋△ｔ）での神経伝達物質の量は、
状態（ｔ＋dn／dt・△ｔ）での神経伝達物質の
量に等しい。よつて、ｎ（ｔ＋△ｔ）＝ｎ（ｔ）＋（dn／dt）・△ｔ (3) が成立する。式(1)，(2)および(3)は、時変信号分析器の動作を
表わす。時変信号分析器は、聴覚器官系が時間に
適応性を有し、聴神経の信号が音響波入力と非直
線的に関連させられるという事実を示している。
ちなみに、本発明の音響プロセツサは、神経系統
の明白な時間的変化によりよく追随するように、
音声認識システムで非線形信号処理を実施する最
初のモデルを提供するものである。式(1)および(2)において未知の項数を少なくする
ため、本発明では、一定の音量Ｌに適用される次
式を用いる。 So＋Sh＋DL＝１／Ｔ (4) ただし、Ｔはオーデイオ波入力が生成された
後、聴覚レスポンスがその最大値の37％に低下す
るまでの時間の測定値である。Ｔは、音量の関数
であり、本発明の音響プロセツサにより、種々の
音量レベルのレスポンスの減衰を表示する既知の
グラフから取出す。すなわち、一定の音量のトー
ンが生成されると、最初、高いレベルのレスポン
スが生じ、その後、レスポンスは時定数Ｔによ
り、安定した状態のレベルに向つて減衰する。音
響波入力がない場合、Ｔ＝T₀である。これは50
ミリ秒程度である。音量がL_naxの場合、Ｔ＝
T_naxである。これは30ミリ秒程度である。Ao＝
１に設定することにより、１／（So＋Sh）は、
Ｌ＝０の場合、５センチ秒と決定される。Ｌが
L_naxで、L_nax＝20ソーンの場合、次式が成立つ。 So＋Sh＋Ｄ（20）＝１／30 (5) 前記データおよび式により、SoおよびShは下
記に示す式(6)および(7)により決まる。 So＝DL_nax／〔Ｒ＋（DL_naxT₀Ｒ）ー１〕 (6) Sh＝１／T₀−So (7) ただし、Ｒ＝ｆ安定状態｜L_nax／ｆ安定状態｜Ｌ＝０ (8) ｆ安定状態は、dn／dtが０の場合、所与の音
量での発火率を表わす。Ｒは、音響プロセツサに残つている唯一の変数
である。それゆえ、このプロセツサの性能はＲを
変えるだけで変更される。すなわち、Ｒは、性能
を変更するのに調整することができる１つのパラ
メータで、通常は、過渡状態の効果に対し、安定
状態の効果を最小限にすることを意味する。類似
の音声入力の場合に出力パターンが一貫性に欠け
ることは一般に、周波数レスポンスの相違、話者
の差異、背景雑音ならびに、（音声信号の安定状
態部分には影響するが過渡部分には影響しない）
歪みにより生ずるから、安定状態の効果を最小限
にすることが望ましい。Ｒの値は、完全な音声認
識システムのエラー率を最適化するように設定す
ることが望ましい。このようにして見つかつた最
適値はＲ＝1.5である。その場合、SoおよびShの
値はそれぞれ0.0888および0.11111であり、Ｄの
値は0.00666が得られる。第１３図は本発明による音響プロセツサの動作
の流れ図である。できれば、20kHzでサンプリン
グされた、25.6ミリ秒の時間フレーム中のデイジ
タル化音声は、ハニング窓１３２０を通過し、そ
の出力は10ミリ秒間隔で、DFT１３２２におい
て２重フーリエ変換されることが望ましい。変換
出力はブロツク１３２４で濾波され、少なくても
１つの周波数バンド（できればすべての臨界周波
数バンドか、または少なくとも20のバンド）の
各々にパワー密度出力を供給する。次いで、パワ
ー密度はブロツク１３２６で、記録された大きさ
から音量レベルに変換される。この動作は、第１
１図のグラフの変更により容易に実行される。そ
の後のプロセスの概要（ブロツク１３３０の限界
値更新を含む）は第１４図に示されている。第１４図において、最初に、濾波された周波数
バンドｍの各々の感覚限界Tfおよび可聴限界T_h
がそれぞれ、120dBおよび0dBになるように設定
される（ブロツク１３４０）。その後、音声カウ
ンタ、合計フレーム・レジスタおよびヒストグラ
ム・レジスタをリセツトする（ブロツク１３４
２）。ヒストグラムの各々はビン（bin）を含み、ビ
ンの各々は、（所与の周波数バンドで）パワーま
たは類似の測定値がそれぞれのレンジ内にある間
のサンプル数すなわちカウント数を表わす。本発
明では、ヒストグラムは、（所与の周波数バンド
ごとに）音量が複数の音量レンジの各々の中にあ
る期間のセンチ秒数を表わすことが望ましい。例
えば、第３の周波数バンドでは、10dBと20dBの
パワーの間が20センチ秒の場合がある。同様に、
第20の周波数バンドでは、50dBと60dBの間に、
合計1000センチ秒のうちの150センチ秒がある場
合がある。合計サンプル数（すなわちセンチ秒）
およびビンに含まれたカウントから百分位数が取
出される。ブロツク１３４４で、それぞれの周波数バンド
のフイルタ出力のフレームが検査され、ブロツク
１３４６で、適切なヒストグラム（フイルタ当り
１つ）中のビンが増分される。ブロツク１３４８
で、振幅が55dBを越えるビンの合計数がフイル
タ（すなわち周波数バンド）ごとに集計され、音
声の存在を示すフイルタ数を決定する。ブロツク
１３５０で、音声の存在を示す最小限（例えば20
のうちの６）のフイルタがない場合、ブロツク１
３４４で次のフレームを検査する。音声の存在を
示す十分なフイルタがある場合、ブロツク１３５
２で、音声カウンタを増分する。音声カウンタ
は、ブロツク１３５４で音声が10秒間現われ、ブ
ロツク１３５６で新しいTfおよびT_hの値がフイ
ルタごとに決定されるまで増分される。所与のフイルタの新しいTfおよびT_hの値は次
のように決定される。Tfの場合、1000ビンの最
上位から35番目のサンプルを保持するビンのdB
値（すなわち、音量の96.5番目の百分位数）は
BIN_Hと定義され、TfはTf＝BIN_H＋40dBに設定
される。T_hの場合、最下位のビンから（0.01）
（ビン総数−音声カウント）番目の値を保持する
ビンのdB値がBIN_Lと定義される。すなわち、
BIN_Lは、ヒストグラム中の、音声として分類さ
れたものを除いたサンプル数の１％のビンであ
る。T_hはT_h＝BIN_L−30dBと定義される。第１３図のブロツク１３３０および１３３２
で、音の振幅は、前述のように、限界値を更新
し、更新された限界値に基づいてソーン単位に変
換され、圧縮される。ソーン単位を導入し圧縮す
る代替方法は、（ビンが増分された後）フイルタ
振幅“ａ”を取出し、次式によりdBに変換する。 a^dB＝20log₁₀(a)−10 (9) 次に、フイルタ振幅の各々は、次式により同等
の音量を与えるように0dBと120dBの間のレンジ
に圧縮される。 a^egl＝120（a^dB−T_h）／（Tf−T_h） (10) 次に、a^eglは次式により、音量レベル（ホン単
位）からソーン単位の音量の近似値に変換
（40dBで1kHzの信号を１に写像）することが望ま
しい。 L^dB＝（a^egl−30）／４（11）次に、ソーン単位の音量の近似値L_sは次式で与
えられる。 L_s＝10（L^dB）／20 （12）ステツプ１３３４で、L_sは式(1)および(2)の入力
として使用され、周波数バンドごとの出力発火率
ｆを決定する。22周波数バンドの場合、22次元の
ベクトルが、連続する時間フレームにわたる音響
波入力を特徴づける。しかしながら、一般に、20
周波数バンドは、メルでスケーリングされた通常
のフイルタ・バンクを用いて検査する。次の時間フレームを処理する前に、ブロツク１
３３７で、ｎの“次状態”を式(3)に従つて決定す
る。前述の音響プロセツサは、発火率ｆおよび神経
伝達質量ｎが大きいDCペデスタルを有する場合
の使用についての改善を必要とする。すなわち、
ｆおよびｎの式の項のダイナミツクレンジが重要
な場合、下記の式を導いてペデスタルの高さを下
げる。安定状態で、かつ音響入力信号が存在しない
（Ｌ＝０）場合、式(2)は次のように安定状態の内
部状態n′について解くことができる。 n′＝Ａ／（So＋Sh）（13）神経伝達物質の量ｎ（ｔ）の内部状態は、次の
ように安定状態部分および変動部分として示され
る。ｎ（ｔ）＝n′＋n″（ｔ）（14）式(1)および（14）を結合すると、次のように発
火率が得られる。ｆ（ｔ）＝（So＋Ｄ・Ｌ）（n′＋n″（ｔ））（15） So・n′の項は定数であるが、他のすべての項
は、ｎの変動部分か、または（Ｄ・Ｌ）により表
わされた入力信号を含む。爾後の処理は出力ベク
トル間の差の二乗のみに関連するので、定数項は
無視される。式（15）および（13）から次式が得
られる。 f″（ｔ）＝（So＋Ｄ・Ｌ）・〔｛n″（ｔ）＋Ｄ・Ｌ
・
Ａ｝／（So＋Sh）〕（16）式(3)を考慮すると、“次状態”は次のようにな
る。ｎ(t+△t)＝n′(t+△t)＋n″(t+△t) （17）ｎ(t+△t)＝n″(t)＋Ａ−(So+Sh+D・L) ・（n′＋n″(t) （18）ｎ(t+△t)＝n″(t)−(Sh・n″(t) −（So＋Ao・L^A）・n″（ｔ） −（Ao・L^A・Ｄ）／（So＋Sh）＋Ao −（So・Ao）＋（Sh・Ao））／（So＋Sh）（19）式（19）はすべての常数項を無視すれば次のよ
うになる。 n″(t+△t)＝n″(t)(1-So・△t)−f″（ｔ）（20）式（15）および（20）は、それぞれの10ミリ秒
時間フレーム中に各フイルタに適用される出力式
および状態更新式を構成する。これらの式の使用
結果は10ミリ秒ごとの20要素のベクトルであり、
このベクトルの各要素は、メルでスケーリングさ
れたフイルタ・バンクにおけるそれぞれの周波数
バンドの発火率に対応する。前述の実施例に関し、第１３図の流れ図は、発
火率ｆおよび“次状態”ｎ（ｔ＋△ｔ）の特別の
場合の式をそれぞれ定義する式（11）および
（16）により、ｆ，dn／dtおよびｎ（ｔ＋△ｔ）
の式を置換える以外は当てはまる。それぞれの式の項に特有の値（すなわち、t₀＝
5csec、t_L＝3csec，Ao＝１，Ｒ＝1.5および
Lmax＝20）は他の値に設定することができ、
So，ShおよびＤの項は、他の項が異なつた値に
設定されると、それぞれの望ましい値0.0888，
0.11111、および0.00666とは異なる値になる。本発明は種々のソフトウエアまたはハードウエ
アにより実施することができる。 F1c 精密マツチング（第４図、第１５図、第１
６図）第４図は一例として音標型の音素マシン、2000
を示す。音標型マツチングの各マシンは、確率的
に限定された状態マシンであり、 (a) 複数の状態Si； (b) 複数の遷移tr（Sj｜Si）：ある遷移は異なつた
状態間で、ある遷移は同じ状態間で遷移し、各
遷移は対応する確率を有する； (c) 特定の遷移で生成しうるラベルごとに対応す
る実際のラベル確率を有することを特徴とする。第４図では、７つの状態S₁〜S₇ならびに13の遷
移tr１〜tr１３が精密マツチング音素マシン２０
００に設けられ、その中の３つの遷移tr１１，tr
１２およびtr１３のパスは破線で示されている。
これらの３つの遷移の各々で、音素はラベルを生
成せずに１つの状態から別の状態に変ることがあ
る。従つて、このような遷移はナル遷移と呼ばれ
る。遷移tr１〜tr１０に沿つて、ラベルを生成す
ることができる。詳細に述べれば、遷移tr１〜tr
１０の各々に沿つて少なくとも１つのラベルは、
そこに生成される独特の確率を有することがあ
る。遷移ごとに、システムで生成することができ
る各ラベルに関連した確率がある。すなわち、も
し選択的に音響チヤンネルにより生成することが
できるラベルが200あれば、（ナルではない）各遷
移はそれに関連した“実際のラベル確率”を200
有し、その各々は、対応するラベルが特定の遷移
で音素により生成される確率に対応する。遷移tr
１の実際のラベル確率は、図示のように、記号Ｐ
と、それに続くブラケツトに囲まれた１〜200の
列で表わされる。これらの数字の各々は所与のラ
ベルを表わす。ラベル１の場合は、精密マツチン
グ音素マシン２０００が遷移tr１でラベル１を生
成する確率Ｐ〔１〕がある。種々の実際のラベル
確率は、ラベルおよび対応する遷移に関連して記
憶されている。ラベルｙ１ｙ２ｙ３……のストリングが、所与
の音素に対応する精密マツチング音素マシン２０
００に提示されると、マツチング手順が実行され
る。精密マツチング音素マシンに関連した手順に
ついて第１５図により説明する。第１５図は第４図の音素マシンのトレリス図で
ある。前記音素マシンの場合のように、このトレ
リス図も状態S₁から状態S₇へのナル遷移、状態S₁
から状態S₂への遷移、および状態S₁から状態S₄へ
の遷移を示す。他の状態間の遷移も示されてい
る。また、トレリス図は水平方向に、測定された
時刻を示す。開始時確率q₀、およびq₁は、音素が
その音素の時刻ｔ＝t₀またはｔ＝t₁のそれぞれに
おいて開始時刻を有する確率を表わす。各開始時
刻におけるそれぞれの遷移も示されている。ちな
みに、連続する開始（および終了）時刻の間隔
は、ラベルの時間間隔に等しい長さであることが
望ましい。精密マツチング音素マシン２０００を用いて所
与の音素が到来ストリングのラベルにどれくらい
ぴつたりとマツチングされるかを決定する際、そ
の音素の終了時刻分布を探索して、その音素のマ
ツチング値を決めるのに使用する。終了時刻分布
に依存して精密マツチングを実行する方法は、マ
ツチング手順に関して本発明で説明するすべての
音素マシンの実施例に共通である。精密なマツチ
ングを実行するため終了時刻分布を生成する際、
精密マツチング音素マシン２０００は、正確で複
雑な計算を必要とする。最初に、第１５図のトレリス図により、時刻ｔ
＝t₀で開始時刻および終了時刻を得るのに必要な
計算について調べる。第４図に示された音素マシ
ン構造の例の場合は、下記の確率式が当てはま
る。 Pr（S₇，ｔ＝t₀）＝q₀Ｔ（１→７）＋Pr（S₂，ｔ＝
t₀）・Ｔ（２→７）＋Pr（S₃，ｔ＝t₀）・Ｔ（３
→７）（21）ただし、Prは確率を表わし、Ｔは括弧内の２
つの状態の間の遷移確率を表わす。この式は、ｔ
＝t₀で終了時刻になることがある３つの状態のそ
れぞれの確率を示す。更に、ｔ＝t₀の終了時刻
は、状態S₇における現在の生起例に限定される。次に、終了時刻ｔ＝t₁を調べると、状態S₁以外
のあらゆる状態に関する計算を行わなければなら
ない。状態S₁は前の音素の終了時刻で開始する。
説明の都合上、状態S₄に関する計算だけを示す。 S₄の場合、計算は次のようになる。 Pr（S₄，ｔ＝t₁）Pr（S₁，ｔ＝t₀）・Ｔ（１→
４）・Pr（y₁，１→４）＋Pr（S₄，ｔ＝t₀）・
Ｔ（４→４）・Pr（y₁，４→４）（22）式（22）は、時刻ｔ＝t₁で音素マシンが状態S₄
である確率は下記の２つの項： (a)時刻ｔ＝t₀で状態S₁である確率に、状態S₁か
ら状態S₄への遷移確率を乗じ、更に、生成中のス
トリング中の所与のラベルy₁が状態S₁から状態S₄
へ遷移する確率を乗じて得た値と、(b)時刻ｔ＝t₀
で状態S₄である確率に、状態S₄からそれ自身への
遷移確率を乗じ、更に、状態S₄からそれ自身に遷
移するものとして所与のラベルy₁を生成する確率
を乗じて得た値との和によつて決まることを示
す。同様に、（状態S₁を除く）他の状態に関する計
算も実行され、その音素が時刻ｔ＝t₁で特定の状
態である対応する確率を生成する。一般に、所与
の時刻に対象状態である確率を決定する際、精密
なマツチングは、 (a) 対象状態に導く遷移を生じる前の各状態およ
び前記前の各状態のそれぞれの確率を認識し、 (b) 前記前の状態ごとに、そのラベル・ストリン
グに適合するように、前記前の各状態と現在の
状態の間の遷移で生成しなければならないラベ
ルの確率を表わす値を認識し、 (c) 前の各状態の確率とラベル確率を表わすそれ
ぞれの値を組合せて、対応する遷移による対象
状態の確率を与える。対象状態である全体的な確率は、それに導くす
べての遷移による対象状態確率から決定される。
状態S₇に関する計算は、３つのナル遷移に関する
項を含み、その音素が状態S₇で終了する音素によ
り時刻ｔ＝t₁で開始・終了することを可能にす
る。時刻ｔ＝t₀およびｔ＝t₁に関する確率を決定す
る場合のように、他の終了時刻の組の確率の決定
は、終了時刻分布を形成するように行うことが望
ましい。所与の音素の終了時刻分布の値は、所与
の音素がどれ位良好に到来ラベルにマツチングさ
れるかを表示する。ワードがどれ位良好に到来ラベルにマツチング
されるかを決定する際、そのワードを表わす音素
は順次に処理される。各音素は確率値の終了時刻
分布を生成する。音素のマツチング値は、終了時
刻確率を合計し、その合計の対数をとることによ
り得られる。次の音素の開始時刻分布は終了時刻
分布を正規化することにより引出される。この正
規化では、例えば、それらの値の各々を、それら
の合計で割ることによりスケーリングし、スケー
リングされた値の合計が１になるようにする。所与のワードまたはワード・ストリングの検査
すべき音素数ｈを決定する方法が少なくとも２つ
ある。深さ優先方法では、計算は基本形態に沿つ
て行う（連続する音素の各々により連続して小計
を計算する）。この小計がそれに沿つた所与の音
素位置の所定の限界値以下であると分つた場合、
計算は終了する。もう１つの方法、幅優先方法で
は、各ワードにおける類似の音素位置の計算を行
う。計算は、各ワードの第１の音素の計算、続い
て各ワードの第２の音素の計算というように、順
次に行う。幅優先方法では、それぞれのワードの
同数の音素に沿つた計算値は、相対的に同じ音素
位置で比較する。いずれの方法でも、マツチング
値の最大の和を有するワードが、求めていた目的
ワードである。精密なマツチングはAPAL（アレイ・プロセツ
サ・アセンブリ言語）で実現されている。これ
は、フローテイング・ポイント・システムズ社
（Floating Point Systems，Inc.）製のアセンブ
ラ190Lである。精密マツチングは、実際のラベル確率（すなわ
ち、所与の音素が所与の遷移で所与のラベルｙを
生成する確率）、音素マシンごとの遷移確率、お
よび所与の音素が所定の開始時刻後の所与の時刻
で所与の状態である確率の各々を記憶するために
かなりのメモリを必要とする。前述の190Lは、
終了時刻、できれば終了時刻確率の対数和に基づ
いたマツチング値、前に生成された終了時刻確率
に基づいた開始時刻、およびワード中の順次音素
のマツチング値に基づいたワード・マツチング得
点のそれぞれの計算をするようにセツトアツプさ
れる。更に、精密なマツチングは、マツチング手
順の末尾確率を計算することが望ましい。末尾確
率はワードとは無関係に連続するラベルの尤度を
測定する。簡単な実施例では、所与の末尾確率は
もう１つのラベルに続くラベルの尤度に対応す
る。この尤度は、例えば、或るサンプル音声によ
り生成されたラベルのストリングから容易に決定
される。それ故、精密なマツチングは基本形態、マルコ
フ・モデルの統計値、および末尾確率を含むのに
十分な記憶装置を備える。各ワードが約10の音素
を含む5000ワードの語彙の場合、基本形式は5000
×10の記憶量を必要とする。（音素ごとにマルコ
フ・モデルを有する）70の別個の音素、200の別
個のラベル、および任意のラベルが生成される確
率を有する10の遷移がある場合、統計値は70×10
×200の記憶ロケーシヨンを必要とすることにな
る。しかしながら、音素マシンは３つの部分（開
始部分、中間部分および終了部分）に分割され、
統計表はそれに対応することが望ましい。（３つ
の自己ループの１つが各部分に含まれることが望
ましい。）従つて、記憶要求は70×３×200に減少
する。末尾確率に関しては、200×200の記憶ロケ
ーシヨンが必要である。この配列では、50Kの整
数および82Kの浮動小数点の記憶装置であれば満
足に動作する。 F1d 基本高速マツチング（第１６図〜第１８
図）精密マツチングの計算には高い費用がかかるか
ら、精度をあまり犠牲にしないで所要の計算を少
なくする基本高速マツチングおよび代替高速マツ
チングを行う。高速マツチングは精密マツチング
に関連して使用することが望ましい。高速マツチ
ングは、語彙から見込みのある候補ワードを取出
してリストに載せ、精密マツチングは大抵の場
合、このリストの候補ワードで実行される。高速概算音響マツチング手法は前記米国特許出
願第06／672974号（1984年11月19日出願）に記載
されている。高速概算音響マツチングでは、各音
素マシンは、所与の音素マシンにおけるすべての
遷移でラベルごとの実際のラベル確率を特定の置
換え値と取換えることにより簡略化することが望
ましい。特定の置換え値は、その置換え値を使用
する場合に所与の音素のマツチング値が、その置
換え値が実際のラベル確率を取替えない場合の精
密マツチングにより得られるマツチング値を過大
評価するように選択することが望ましい。この条
件を保証する１つの方法は、所与の音素マシン中
の所与のラベルに対応する確率がどれもその置換
え値よりも大きくないように各々の置換え値を選
択する方法である。音素マシン中の実際のラベル
確率を、対応する置換え値と取替えることによ
り、ワードのマツチング得点を決定する際の所要
計算量を大幅に減少することができる。更に置換
え値は過大評価することが望ましいので、その結
果得られたマツチング得点は、前に取替えなしに
決定されることになるものよりも少なくなる。マルコフ・モデルを有する言語デコーダで音響
マツチングを実行する特定の実施例において、各
音素は、整形により、 (a) 複数の状態および状態間の遷移パス、 (b) 確率Ｔ（ｉ→ｊ）−その各々は、現在の状態Si
が与えれれると状態Sjに遷移する確率を表わす
（ただし、SiとSjは同じ状態の場合もあれば異
なつた状態の場合もある）−を有する遷移tr（Sj
｜Si）、 (c) 実際のラベル確率（各々の実際のラベル確率
Ｐ（y_k｜ｉ→ｊ）は所与の音素マシンにより、
１つの状態から次の状態への所与の遷移におい
てラベルy_k（ｋはラベルを識別する記号）を生
成する確率を表わす）を生じることを特徴とす
る。各音素マシンは、 (a) 前記各音素マシン中の各y_kに１つの特定の
値p′（y_k）を割当てる手段、 (b) 所与の音素マシン中の各遷移で各々の実際
の出力確率ｐ（y_k｜ｉ→ｊ）を、対応するy_k
に割当てられた１つの特定の値p′（y_k）を割
当てる手段、を含む。置換え値は、少なくとも、特定の音素マ
シン中の任意の遷移で対応するy_kラベルの実際の
最大ラベル確率の大きさであることが望ましい。
高速マツチング実施例は、到来ラベルに対応する
語彙で最も起こりうるワードとして選択された10
乃至100のオーダの候補ワードのリストを形成す
るように使用される。候補ワードは言語モデルお
よび精密なマツチングに従属することが望まし
い。精密なマツチングで考慮するワード数を、語
彙中のワードの１％のオーダに切詰めることによ
り、計算費用は、精度を維持しながら大幅に減少
される。基本高速マツチングは、すべての遷移における
所与のラベルの実際のラベル確率を１つの値と取
替えることにより簡略化し、所与のラベルを所与
の音素マシンで生成することができる。すなわ
ち、ラベルが生じる確率を有する所与の音素マシ
ンにおける遷移とは無関係に、その確率を、１つ
の特定の値と取替える。この値は過大評価され、
少なくとも、所与の音素マシン中の任意の遷移で
生ずるラベルの最大の確率の大きさであることが
望ましい。ラベル確率置換え値を、所与の音素マシン中の
所与のラベルの実際のラベル確率の最大値として
設定することにより、基本高速マツチングにより
生成されたマツチング値が少なくとも、精密なマ
ツチングの使用から生じるようなマツチング値と
同じ大きさであることが保証される。このよう
に、基本高速マツチングは一般に各音素のマツチ
ング値を過大評価するので、より多くのワードが
一般に、候補ワードとして選択される。精密なマ
ツチングにより候補とみなされるワードも、基本
高速マツチングに従つて合格する。第１６図は基本高速マツチング音素マシン３０
００を示す。ラベル（記号およびフイーニームと
も呼ばれる）は開始時刻分布と一緒に基本高速マ
ツチング音素マシン３０００に入る。開始時刻分
布およびラベル・ストリングの入力は、前述の精
密マツチング音素マシンの入力に似ている。開始
時刻は、時には、複数の時刻にわたる分布ではな
いことがあるが、その代り、例えば、沈黙間隔に
続く正確な（音素開始）時刻を表わすこともあ
る。しかしながら、音声が連続している場合、終
了時刻分布は、（後に詳細に説明するように）開
始時刻分布を形成するのに用いられる。基本高速
マツチング音素マシン３０００は、終了時刻分布
を生成するとともに、生成された終了時刻分布か
らの特定の音素のマツチング値を生成する。ワー
ドのマツチング得点は、構成する音素（少なくと
もそのワードの最初のｈ音素）のマツチング値の
和として定義される。第１７図は基本高速マツチング計算を示す。基
本高速マツチング計算は、開始時刻分布、音素に
より生成されたラベルの数または長さ、および
各々のラベルy_kに関連した置換え値p′（y_k）だけ
に関連する。所与の音素マシン中の所与のラベルの実際のラ
ベル確率をすべて、対応する置換え値と取替える
ことにより、基本高速マツチングは、遷移確率を
長さ分布確率と取替えるので、（所与の音素マシ
ンで遷移ごとに異なることがある）実際のラベル
確率、ならびに所与の時刻に所与の状態にある確
率を含むことが不要になる。ちなみに、長さ分布は精密なマツチングモデル
から決定される。詳細に説明すれば、長さ分布の
長さごとに、この手順は、各状態を個々に検査
し、状態ごとに、それぞれの遷移パスを決定する
ことが望ましい。それにより、現に検査された状
態は、 (a) 特定のラベルの長さを与えられると、 (b) 遷移に沿つた出力と無関係に、生ずることがある。各々の目的状態への特定の
長さのすべての遷移パスの確率は合計され、次い
で、すべての目的状態の合計は加算され、分布中
の所与の長さの確率を表わす。以上の手順は各々
の長さについて反復実行される。良好なマツチン
グ手順の形式に従つて、これらの計算は、マルコ
フ・モデリングの技術で知られているようにトレ
リス図に関して行われる。トレリス構造に沿つて
分枝を共有する遷移パスの場合、共通分枝ごとの
計算は一度だけ行えばよく、その結果は共通分枝
を含む各々のパスに加えられる。第１７図において、例として２つの制限が含ま
れている。最初に、音素により生成されたラベル
の長さは、それぞれ確率1₀、1₁、1₂および1₃を有
する０、１、２または３である場合がある。開始
時刻も制限され、それぞれが確率q₀、q₁、q₂およ
びq₃を有する４つの開始時刻だけが許される。す
なわち、Ｌ（1₀、1₁、1₂、1₃）およびＱ（q₀、q₁、
q₂、q₃）が仮定される。これらの制限により、目
的音素の終了分布は下記の式のように定義され
る。 Φ₀＝q₀1₁ Φ₁＝q₁1₀＋q₀1₁p₁ Φ₂＝q₂1₀＋q₁1₁p₂＋q₀1₂p₁p₂ Φ₃＝q₃1₀＋q₂1₁p₃＋q₁1₂p₂p₃ ＋q₀1₃p₁p₂p₃ Φ₄＝q₃1₁p₄＋q₂1₂p₃p₄ ＋q₁1₃p₂p₃p₄ Φ₅＝q₃1₂p₄p₅＋q₂1₃p₃p₄p₅ Φ₆＝q₃1₃p₄p₅p₆ これらの式を調べると、Φ₃は４つの開始時刻
の各々に対応する項を含んでいることが分る。そ
の第１項は音素が時刻ｔ＝t₃で開始し、かつ長さ
０のラベル（音素は開始すると同時に終了する）
を生成する確率を表わす。第２項は音素が時刻ｔ
＝t₂で開始し、かつラベルの長さが１であり、か
つラベル３がその音素により生成される確率を表
わす。第３項は音素が時刻ｔ＝t₁で開始し、かつ
ラベルの長さが２（すなわちラベル２および３）
であり、かつラベル２および３がその音素により
生成される確率を表わす。同様に、第４項は音素
が時刻ｔ＝t₀で開始し、かつラベルの長さが３で
あり、かつ３つのラベル１，２および３がその音
素により生成される確率を表わす。基本高速マツチングに要する計算と精密マツチ
ングに要する計算を比較すると、前者は後者より
も相対的に簡単であることが分る。ちなみに、
p′（ｙ）の値は、すべての式に出現するごとに、
ラベルの長さの確率の場合のように同じ値のまま
である。更に、長さおよび開始時刻の制限によ
り、後の終了時刻計算がより簡単になる。例え
ば、Φ₆で、音素は時刻ｔ＝t₃で開始し、３つのレ
ベル４，５および６はすべて、その終了時刻の音
素により生成して使用しなければならない。対象音素のマツチング値を生成する際、形成さ
れた終了時刻分布に沿つた終了時刻確率が合計さ
れる。希望があれば、その対数をとつて次式を得
る。マツチング値＝log₁₀（Φ₀＋……＋Φ₆）前述のように、ワードのマツチング得点は、特
定のワード中の連続する音素のマツチング値を合
計することにより容易に決定される。次に、第１８図により開始時刻分布の生成につ
いて説明する。第１８図ａにおいて、ワード
THE₁がその構成音素に分解され、反復される。
第１８図ｂでは、ラベルのストリングが時間軸に
沿つて示されている。第１８図ｃは、最初の開始
時刻分布を示す。最初の開始時刻分布は、（沈黙
ワードを含むことがある先行ワードにおける）最
新の先行音素の終了時刻分布から引出されてい
る。第１８図ｃのラベル入力および開始時刻分布
に基づいて、音素DHの終了時刻分布Φ_DHが生成
される（第１８図ｄ）。次の音素UH１の開始時
刻分布は、前の音素終了分布が第１８図ｄの限界
値Ａを、越えた時刻を認識することにより決定さ
れる。Ａは終了時刻分布ごとに個々に決定され
る。Ａは、対象音素の終了時刻分布の値の和の関
数である。従つて、時刻ａと時刻ｂの間隔は、音
素UH１の開始時刻分布が設定される時間を表わ
す。第１８図ｅにおいて、時刻ｃと時刻ｄの間隔
は、音素DHの終了時刻分布が限界値Ａを越え、
かつ次の音素の開始時刻分布が設定される時間に
相当する。開始時刻分布の値は、例えば、限界値
Ａを越える終了時刻の和で各終了時刻値を割つて
終了時刻分布を正規化することにより得られる。基本高速マツチング音素マシン３０００は、前
記フローテイング・ポイント・システムズ社の、
APALプログラムによるアセンブラ１９０Ｌで実
現されている。また、本明細書の説明に従つて、
他のハードウエアおよびソフトウエアを用いて本
発明の特定の形式を展開することもできる。 F1e 代替高速マツチング（第１９図、第２０
図）単独で、またはできれば精密マツチングおよび
（または）言語モデルと共に使用された基本高速
マツチングは、計算所要量を大幅に少なくする。
計算所要量を更に少なくするため、本発明は更
に、２つの長さ（最小長L_nioよび最大長L_nax）の
間に均一なラベル長分布を形成することにより精
密なマツチングを簡略化する。基本高速マツチン
グでは、所与の長さのラベル（すなわち、1₀、
1₁、1₂等）を生成する確率は一般に異なる値を得
る。代替高速マツチングにより、ラベルの各々の
長さの確率を１つの均一な値と取替える。最小長は、最初の長さの分布で非ゼロの確率を
有する最小の長さに等しいことが望ましいが、希
望により、他の長さを選択することもできる。最
大長の選択は最小長の選択よりも任意であるが、
最小よりも小さく最大よりも大きい長さの確率は
０に設定される。長さの確率が最小長と最大長の
間にだけ存在するように設定することにより、均
一の擬似分布を示すことができる。一つの方法と
して、均一確率は、擬似分布による平均確率とし
て設定することができる。代替方法として、均一
確率は、長さ確率の最大値として設定し、均一値
と取替えることができる。ラベルの長さの確率をすべて等しくすることに
よる効果は、前述の基本高速マツチングにおける
終了時刻分布の式から容易に認められる。詳細に
述べれば、長さの確率は定数として取出すことが
できる。 L_nioを０にセツトし、かつすべての長さの確率
を１つの定数の値と取替えることにより、終了時
刻分布は次のように表示される。 Θm＝Φm／１＝qm＋Θm−1Pm （23）ただし、“１”は１つの均一の置換え値であり、
Pmの値は、所与の音素で時刻ｍに生成される所
与のラベルの置換え値に対応することが望まし
い。前述のΘmの式の場合、マツチング値は次のよ
うに定義される。マツチング値＝log₁₀（Θ₀＋Θ₁＋… ＋Θm＋log₁₀10 （24）基本高速マツチングと代替高速マツチングを比
較すると、所要の加算および乗算数は、代替高速
マツチング音素マシンを使用することにより、大
幅に少なくなる。L_nio＝０の場合、基本高速マツ
チングは、長さの確率を考慮しなければならない
ので、40回の乗算と20回の加算を必要としたが、
代替高速マツチングの場合は、Θmが繰返し決定
されるので、連続するΘmの各々について１回の
乗算と１回の加算で済むことが分る。第１９図および第２０図は、代替高速マツチン
グによる計算の簡略化を詳細に示す。第１９図ａ
は、最小長L_nio＝０に相当する音素マシン３１０
０の実施例を示す。最大長は、長さ分布が均一に
なるように無限大に仮定する。第１９図ｂは、音
素マシン３１００から生じるトレリス図を示す。
qn以後の開始時刻を開始始刻分布の外側と仮定
すると、ｍ＜ｎの場合、連続するΘmの各々の決
定はすべて、１回の加算と１回の乗算で足りる。
それ以後の終了時刻を決定する場合は、１回の乗
算だけでよく、加算は不要である。第２０図ａは、最小長L_nio＝４の場合の特定の
音素マシン３２００の実施例を示し、第２０図ｂ
は、それに対応するトレリス図を示す。L_nio＝４
であるから、第２０図ｂのトレリス図は、記号
Ｕ、Ｖ、ＷおよびＺのパスに沿つて０確率を生じ
る。Θ₄とΘ_oの間の終了時刻の場合、４回の乗算
と１回の加算が必要である。ｎ＋４よりも大きい
終了時刻の場合は、１回の乗算数だけでよく、加
算は不要である。この実施例は、前記FPS社の
190L上のAPALコードで実現されている。所望の追加状態を第１９図または第２０図の実
施例に付加することができる。 F1f 最初のＪレベルに基づいたマツチング（第
２０図）基本高速マツチングおよび代替高速マツチング
を更に改良するため、音素マシンに入るストリン
グの最初のＪラベルのマツチングだけを考慮する
ようにする。ラベルが音響チヤンネルの音響プロ
セツサにより、毎センチ秒ごとに１ラベルの割合
で生成されるものと仮定すると、Ｊの妥当な値は
100である。換言すれば、１秒にオーダの音声に
対応するラベルが供給され、音素と音素マシンに
入るラベルとのマツチングを確定する。検査する
ラベル数を限定することにより、２つの利点が得
られる。第１は、復号遅延の減少であり、第２
は、短かいワードの得点と長いワードの得点を比
較する問題を十分に回避できることである。もち
ろん、Ｊの長さは希望により変更することができ
る。検査するラベル数を限定することによる効果
は、第２０図ｂのトレリス図により観察すること
ができる。本発明による改良を伴なわない場合、
高速マツチング得点は、この図面の最下部のロー
（row）に沿つたΘmの確率の和である。すなわ
ち、ｔ＝t₀（L_nio＝０の場合）またはｔ＝t₄（L_nio＝
４の場合）で開始する各時刻に状態S₄である確率
は、Θmとして確定され、次いで、すべてのΘm
は合計される。L_nio＝４の場合、t₄以前の任意の
時刻に状態S₄である確率は０である。前記改良に
より、Θmの和をとることは、時刻Ｊで終了す
る。第２０図ｂで、時刻Ｊは時刻t_o+2に相当す
る。時刻Ｊまでの区間を越えたＪラベルの検査を終
了することにより、マツチング得点を決定する際
に、下記の２つの確率の和が生じる。第１に、前
述のように、このトレリス図の最下部のローに沿
つたロー計算がある。しかし、この計算は時刻Ｊ
−１までである。時刻Ｊ−１までの各時刻に状態
S₄である確率が合計され、ロー得点を得る。第２
に、その音素が時刻ＪにS₀〜S₄のそれぞれの状態
である確率の和に相当するカラム得点がある。こ
のカラム得点は下記のように計算される。カラム得点＝₄ 〓^f=0 Pr（Sf，Ｊ）（25）音素のマツチング得点は、ロー得点とカラム得
点を合計して、その和の対数をとることにより得
られる。次の音素の高速マツチングを継続するに
は、最下部のロー（時刻Ｊを含むことが望まし
い）に沿つた値を用いて、次の音素の開始時刻分
布を取出す。Ｊ回の連続音素の各々のマツチング得点を確定
した後、前述のように、全音素の合計はその音素
のすべてのマツチング得点の和である。前述の基本高速マツチングおよび代替高速マツ
チングの実施例で終了時刻の確率を生成する方法
を調べると、カラム得点の確定は、高速マツチン
グ計算に容易に適合しないことが分る。検査する
ラベル数を限定するための改良を前記高速マツチ
ングおよび代替マツチングによりよく適応させる
ため、本発明は、カラム得点を追加ロー得点と置
換えることを可能にする。すなわち、（第２０図
ｂで）時刻ＪおよびＪ＋Ｋの間で状態S₄である音
素の追加ロー得点が確定される。ただし、Ｋは任
意の音素マシンにおける最大状態数である。それ
ゆえ、任意の音素マシンが10の状態を有する場
合、本発明の改良により、そのトレリス図の最下
部のローに沿つて10の終了時刻が付加され、その
各々について確率が決定される。時刻Ｊ＋Ｋまで
の最下位のローに沿つたすべての確率（時刻Ｊ＋
Ｋでの確率を含む）が加算され、所与の音素のマ
ツチング得点を生成する。前述のように、連続す
る音素のマツチング値を合計し、ワードのマツチ
ング得点を得る。この実施例は前述のFPS社の190L上のAPAL
コードで実現されているが、このシステムの他の
部分の場合のように、他のハードウエアで他のコ
ードにより実現することもできる。 F1g 音素木構造および高速マツチング実施例
（第２１図）基本高速マツチングまたは代替高速マツチング
を（最大ラベル制限がある場合またはない場合
に）使用することにより、音素マツチング値を決
定する際に必要な計算時間が大幅に少なくなる。
更に高速マツチングで得たリスト中のワードで精
密マツチングを実行する場合でさえも、計算量が
大幅に節約される。音素マツチング値は、いつたん確定されると、
第２１図に示すように、木構造４１００の分枝に
沿つて比較が行われ、音素のどのパスが最も起こ
りうるかを判定する。第２１図において、（点４
１０２から分枝４１０４に出る）話されたワード
“the”の音素DHおよびUH１の音素マツチング
値の和は、音素MXから分枝する音素のそれぞれ
のシーケンスの場合よりもずつと高い値でなけれ
ばならない。ちなみに、最初の音素MXの音素マ
ツチング値は１回だけ計算され、それから広がる
各基本形式に使用される。（分枝４１０４および
４１０６を参照されたい。）更に分枝の最初のシ
ーケンスに沿つて計算された合計得点が、限界値
よりもずつと低いか、または分枝の他のシーケン
スの合計得点よりもずつと低いことが分ると、最
初のシーケンスから広がるすべての基本形態は同
時に候補ワードから削除されることがある。例え
ば、分枝４１０８〜４１１８に関連した基本形態
は、MXが見込みのあるパスではないと決定され
ると、同時に捨てられる。高速マツチング実施例および木構造により、一
定順序の候補ワードのリストが生成され、それに
伴なう計算は大幅に節約される。記憶要求については、音素の木構造、音素の統
計値、および末尾確率が記憶されることになつて
いる。木構造については、25000の弧と各々の弧
を特徴づける４つのデータワードがある。第１の
データワードは後続の弧すなわち音素の指標を表
わす。第２のデータワードは分枝に沿つた後続の
音素数を表わす。第３のデータワードは木構造の
どのノードに弧が置かれているかを表わす。第４のデータワードは現在の音素を表わす。従
つて、この木構造の場合、25000×４の記憶空間
が必要である。高速マツチングでは、100の異な
つた音素と200の異なつたフイーニームがある。
フイーニームは音素中のどこかで生成される１つ
の確率を有するから、100×200の統計的確率の記
憶空間が必要である。末尾構造については、200
×200の記憶空間が必要である。従つて、高速マ
ツチングの場合、100Kの整数を記憶する空間と
60Kの浮動小数点を記憶する空間があれば十分で
ある。 F1h 言語モデル（第２図、第２２図）前述のように、文脈中のワードに関する（三重
字のような）情報を記憶する言語モデルを包含す
ることにより、ワードを正しく選択する確率を高
めることができる。言語モデルは前記論文に記載
されている。言語モデル１０１０（第２図）は一意性の文字
を有することが望ましい。すなわち、修正三重字
法が使用される。本発明に従つて、サンプル・テ
キストを検査し、語彙中の一定順序の三重ワード
およびワード対ならびに単一ワードの各々の尤度
を確定する。そして、最も見込みのある三重ワー
ドおよびワード対のリストが形成される。更に、
三重ワードのリストにはない三重ワード、および
ワード対のリストにはないワード対の尤度がそれ
ぞれ決定される。言語モデルに従つて、対象ワードが２ワードに
続く場合、この対象ワードおよび先行２ワードが
三重ワードのリストにあるかどうかが判定され
る。三重ワードのリストにある場合、その三重ワ
ードに割当てられた、記憶されている確率が指定
される。対象ワードと先行２ワードが三重ワード
のリストにない場合は、その対象ワードとそれに
隣接する先行ワードがワード対のリストにあるか
どうかについて判定する。ワード対のリストにあ
る場合、そのワード対の確率と、前述の三重ワー
ドのリストに三重ワードがない確率を掛け、その
積を対象ワードに割当てる。対象ワードを含む前
記三重ワードおよびワード対がそれぞれ三重ワー
ドのリストおよびワード対のリストにない場合に
は、対象ワードだけの確率に、前述の三重ワード
が三重ワードのリストにない確率、ならびにワー
ド対がワード対のリストにない確率を掛け、その
積を対象ワードに割当てる。第２２図は、高速マツチング、精密マツチン
グ、および言語モデルにより語彙からワードを選
択する流れ図５０００を示す。ブロツク５００２
で、語彙の中のワードを定義し、ブロツク５００
４で、各ワードをマルコフ・モデルの音標の基本
形態により表示する。ブロツク５００６で、基本
形態の音素は（第２１図に示すように）木構造に
配列される。種々のワードのマルコフ・モデル
は、ブロツク５００８で、サンプル・テキストの
読取りにより整形する（後述）。各々のマルコ
フ・モデルのワード基本形態に関連した確率（例
えば、種々の遷移における遷移確率およびラベル
確率）は、（後に実行する）精密マツチングのた
めに保持される。所与のマルコフ・モデル音素マ
シンにおける所与のラベルの確率は、ブロツク５
０１０で、できるだけ過大評価された置換え値と
取替え、ブロツク５０１２で、基本高速マツチン
グを実行することができるようにする。音響プロ
セツサからのラベル・ストリング（前述）は、ブ
ロツク５０１４（図示せず）で、概算マツチング
のブロツク５０１２に入力を供給する。ブロツク
５０１６（図示せず）で、基本高速マツチングを
向上させようとする場合、均一の長さの分布の近
似値をブロツク５０１８で使用する。ブロツク５
０２０で、更に向上を図る場合は、ブロツク５０
２２で、マツチングを最初のＪラベルに制限す
る。所望の概算高速マツチングの後、確率順の候
補ワードのリストを、ブロツク５０２４（図示せ
ず）で、言語モデルに供給する。次いで、ブロツ
ク５０２６（図示せず）で、概算マツチングおよ
び言語モデルにより決定に従つて、語彙中のより
見込みのあるワードの精密マツチングを実行す
る。ブロツク５０２６（図示せず）の精密マツチ
ングからのワードは、ブロツク５０２８（図示せ
ず）で言語モデルに提示し、ブロツク５０３０
（図示せず）で最も見込みのあるワードを選択す
ることができる。 F2 ワードの音標およびフイーニームのマルコ
フ・モデル基本形態の構築 F2a 音標基本形態の構築（第４図、第１表、第
２表）基本形態を形成する際に用いることができるマ
ルコフ・モデル音素マシンの１つの型は音標に基
づくものである。すなわち、音素マシンの各々は
所与の音標の単位に対応する。所与のワードについて、各々がそれに対応する
それぞれの音素マシンを有する音標の単音のシー
ケンスがある。各音素マシンはいくつかの状態お
よびそれらの間の遷移を含み、それらの中には、
フイーニーム出力を生成することができるものも
あり、できないもの（ナル遷移という）もある。
前述のように、各音素マシンに関する統計値は、 (a) 生起する所与の遷移の確率、および (b) 所与の遷移で特定のフイーニームが生成され
る確率を含む。各々の非ナル遷移で、各フイー
ニームに関連した確率があることが望ましい。
第１表に示すフイーニーム・アルフアベツトに
は約200のフイーニームがある。第４図に、音
標の基本形態を形成する際に用いる音素マシン
を示す。このような音素マシンのシーケンスが
ワードごとに与えられる。既知のワードを発生
する整形フエーズ中に、統計値、すなわち確率
が音素マシンに入る。種々の音標の音素マシン
における遷移確率およびフイーニーム確率は、
既知の音標の単音を少なくとも１回発声したと
きに生成されるフイーニーム・ストリングを書
込み、周知のフオワード・バツクワード・アル
ゴリズムを適用することにより整形中に決めら
れる。音素DHとして識別された１つの音素の統計値
のサンプルが第２表に示されている。概算とし
て、第４図の音素マシンの遷移tr１，tr２および
tr８のラベル出力確率分布、遷移tr３，tr４，tr
５およびtr９のラベル出力確率分布、および遷移
tr６，tr７およびtr１０のラベル出力確率分布が
それぞれ、単一の分布により表示されている。こ
れは、第２表でそれぞれの列の弧（すなわち遷
移）へのラベル４，５または６の割当てにより示
されている。第２表は、音素DHの最初、中央ま
たは最後でそれぞれ生成される各遷移の確率なら
びにラベル（すなわちフイーニーム）の確率を示
す。DH音素の場合、例えば、状態S₁から状態S₂
へ遷移する確率は0.07243と計算され、状態S₁か
ら状態S₄へ遷移する確率は0.92757である。（なん
となれば、最初の状態から起こりうる遷移は２つ
だけであるから、両者の確率の和は１に等しい。）
ラベル出力確率に関しては、DH音素は、その音
素の最後の部分、すなわち第２表のラベル６の列
でフエネヌAE１３（第１表参照）を生成する確
率0.091を有する。また、第２表には各々のノー
ド（すなわち状態）に関連したカウントが示され
ている。ノード・カウントは、整形中、その音素
が対応する状態にあつた回数を表わす。第２表の
ような統計値はフオニーム・マシンごとに存在す
る。音標の音素マシンをワード基本形態のシーケン
スに配列するのは、一般に、音声学者により実行
され、通常、自動的には行われない。音標の基本形態は精密マツシングおよび高速概
算音響マツチングで使用され、ある程度成功して
いる。しかし、音標の基本形態は、音声学者の判
断に依存し、自動的ではないから、不正確なこと
が時々ある。 F2b フイーニーム基本形態の構築（第２３図、
第２４図）音標の基本形態の代替はフイーニーム基本形態
である。各音素マシンは、音標の単音に対応する
基本形態における各音素マシンの代わりに、フイ
ーニーム・アルフアベツト中のそれぞれのフイー
ニームに対応する。簡単なフイーニーム音素マシ
ンは２つの状態S₁およびS₂ならびに複数の遷移
（各々がそれぞれの確率を有する）を含む（第２
３図参照）。１つの遷移は２つの状態の間のナル
遷移で、フイーニームを生成することができな
い。２つの状態間の第１の非ナル遷移は、そこで
生成されるフイーニームごとのそれぞれの確率を
有する。第２の非ナル遷移は状態S₁での自己ルー
プに相当し、各フイーニームはそこで生成される
確率を有する。従つて、簡単なフイーニーム音素
マシンは０，１，または２以上のフイーニームを
生成することができる。第２４図はフイーニーム
音素に基づいたトレリス図を示す。フイーニーム単位のマルコフ・モデル音素マシ
ンを構築する際、複数の発声に基づいた音素マシ
ンを使用することが望ましい。音素マシンは、複
数の発声に基づいて形成することにより、所与の
ワードの発音の変化を捕捉することができる。こ
の発音の変化を捕捉する１つの方法は、米国特許
出願第738933号（1985年５月29日出願）に開示さ
れている。前記出願によりワードの基本形態を構築する
際、下記の方法に従つて、語彙中のワード・セグ
メンド（例えば、各ワードまたは所定のシラブル
またはその一部）ごとにフイーニーム基本形態を
構築する。この方法は、詳細に述べれば、下記の
ステツプを含む。 (a) ワード・セグメントの複数の発声の各々をそ
れぞれのフイーニーム・ストリングに変換す
る。 (b) フイーニームのマルコフ・モデル音素マシン
のセツトを形成する。 (c) 複数のフイーニーム・ストリングを生成する
最良の単一音素マシンP₁を決定する。 (d) 複数のフイーニーム・ストリングを生成する
形式P₁P₂またはP₂P₁の最良の２音素基本形態
を決定する。 (e) 各フイーニーム・ストリングに対し最良の２
音素基本形態を整列させる。 (f) 各フイーニーム・ストリングを左部分と右部
分に分割し、左部分は２音素基本形態の第１の
音素マシンに対応させ、右部分は２音素基本形
態の第２の音素マシンに対応させる。 (g) 左部分の各々を左サブストリングとして識別
し、右部分の各々を右サブストリングとして識
別する。 (h) 左サブストリングのセツトを、複数の発声に
対応するフイーニーム・ストリングのセツトと
同じように処理し、更に、サブストリングのそ
れ以上の分割は、その単一音素基本形態が、最
良の２音素基本形態の場合よりも高い、サブス
トリングを生成する確率を有する場合には禁止
する。 (i) 右サブストリングのセツトを、複数の発声に
対応するフイーニーム・ストリングのセツトと
同じように処理し、更に、サブストリングのそ
れ以上の分割は、その単一音素基本形態が、最
良の２音素基本形態の場合よりも高い、サブス
トリングを生成する確率を有する場合には禁止
する。 (j) 分割されない単一音素を、対応するフイーニ
ーム・サブストリングの順序に対応する順序で
連結する。特定の実施例では、この方法は更に下記のス
テツプを含む。 (k) 連結された基本形態をフイーニーム・ストリ
ングの各々に対して整列させ、連結された基本
形態中の音素ごとに、それに対応する各フイー
ニーム・ストリング中の、共通サブストリング
のセツトである所与の音素に対応するサブスト
リングを識別する。 (l) 共通サブストリングのセツトごとに、共通サ
ブストリングを生成する最高の同時確率を有す
る音素マシンを決定する。 (m) 共通サブストリングごとに、連結された基本
形態中の音素を、決定された最高の同時確率の
音素に置換える。音素の置換えにより得られた基本形態が改良さ
れた基本形態である。前記ステツプ(k)〜(m)を、音
素が置換えられなくなるまで反復すれば、更に改
良された基本形態が得られる。所与のワード・セグメント（例えば、ワード）
のフイーニーム基本形態を構築する他の方法があ
る。例えば、基本形態は、所与のワード・セグメ
ントの単一発声に応じて音響プロセツサの出力か
ら取出すことができる。フイーニーム基本形態は、どのように構築する
かにかかわらず、一般に、同じワードの音標基本
形態よりも精密かつ正確であるという利点を有す
ることが特徴である。更に、音標基本形態が通
常、音声学者により手動で構築されるのに対し、
フイーニーム基本形態は一般に自動的に構築され
る。しかしながら、フイーニーム基本形態は一般に
音標基本形態の長さの10倍もあり、計算上かなり
の不利点となる。本発明は、その最も基本的な形式において、前
に構築されたフイーニーム基本形態からのワー
ド・セグメント（例えばワード）の合成基本形態
の形成に関係する。フイーニーム基本形態に関して、精度および基
本形態を自動的に構築する能力を維持しながら、
音標の基本形態の長さを短縮する問題は重要なも
のになつている。本発明はこの問題と取組み、語
彙中のワードを表わす簡単な音標型のマルコフ・
モデル音素マシン基本形態の自動生成を可能にす
る。音標型の基本形態は、音声認識で役立ち、例
えば、前述の高速概算マツチング手順ならびに精
密なマツチング手順で容易に使用することができ
る。 F3 合成フイーニーム音素のアルフアベツトの
生成（第４図）合成音素のアルフアベツトを取出す際、フオー
ワード・バツクワード・アルゴリズムを用いて整
形されたＮフイーニーム音素の所定のセツトがあ
るものと仮定する。Ｎフイーニーム音素のセツト
が与えられると、合成音素のアルフアベツトが、
下記ステツプの実行により取出される。 (a) 現在の音素のセツトから音素の対を選択す
る。 (b) 選択された対で音素ごとにエントロピーを決
定する。音素ごとのエントロピー次のように定
義される。エントロピー＝−〓pilog₂pi 全フイーニーム（ｉ＝1200）の総和をとる。
各々のpiは特定の音素の、フイーニームｉを生
成する確率で、200の異なつたフイーニームが
ある。 (c) 選択された対の２つの音素を単一音素に合併
することから得られた合成音素（または対応す
るクラスタ）のエントロピーを決定する。合成
音素に関連した各出力確率は、対になつた２つ
の音素の出力確率の平均である。 (d) 次式により合成音素および合併された各音素
のエントロピー測定に基づいた音素およびフイ
ーニームの共通情報の損失を決定する。損失＝Ｅ合成−（Ｅ音素１＋Ｅ音素２）／２共通情報の損失が大きくなるに従つて、合成
音素が対の２音素に及ぼす影響は小さくなる。 (e) フイーニーム音素の対ごとにステツプ(a)〜(d)
を反復する。 (f) 合併された音素の所与の選択された対の最小
の損失を表わす合成音素を選択し、合併された
２つの音素をその合成音素に置換え、それによ
りセツト中の音素数を１減少する。 (g) セツト中の音素数がｎになるまでステツプ(a)
〜(f)を反復する。残つている各音素は合成音素
である。フイーニーム音素中の代表的な200の音素は、
合成音素アルフアベツト中の約50の一意的な合成
音素に減少することが望ましい。各々の合成音素は、第４図に示すような７状
態、13遷移のマルコフ・モデルにより表現される
ことが望ましい。各々の遷移は、生起する遷移の
尤度に対応する遷移確率を有する。各々の非ナル
遷移（tr１〜tr１０）はそれに関連した複数の出
力確率を有し、各々の出力確率は特定の非ナル遷
移で生成される所与のフイーニームの尤度を表わ
すことが望ましい。設定された50要素の合成音素は、下記F4項に
示すように、簡単な基本形態を構築するのに用い
ることが望ましい。合成音素アルフアベツト中の
合成音素の数および構成は、音標技術で既知の選
択されたフオニーム・アルフアベツトに類似する
ように選択することができる。 F4 短かい音標型マルコフ・モデルのワード基
本形態の自動形成（第１Ａ図、第１Ｂ図、第２
１図、第２３図、第２５図〜第３０図）本発明に従つて、簡単なマルコフ・モデル基本
形態は下記のように形成される。最初に、ワード
ごとに構築されたフイーニーム基本形態があり、
各基本形態は前述の方法の１つに従つて形成され
るものと仮定する。フイーニーム基本形態は、複
数の発声から構築し、メモリに記憶することが望
ましい。一般に、ワードのフイーニーム基本形態
は長さが60〜100フイーニームのオーダである。第２５図では、対象ワードはｆフイーニーム・
ストリングにより表示され、各ストリングはその
ワードのそれぞれの発声に対応する。ブロツク１
〜９（以下省略）の各々は音響プロセツサにより
生成されたフイーニーム（すなわちラベル）を表
わす。本実施例では、音響プロセツサによりラベ
ルは毎センチ秒ごとに１生成される。各ストリン
グ中のフイーニームは恐らく、対象ワードの発音
ごとに異なるであろう。本発明に従つて、対象ワードのフイーニーム基
本形態は既知であるものと仮定する。第２６図は
フイーニーム基本形態を示す。対象ワードに対応
するフイーニーム音素のシーケンス（フイーニー
ム基本形態）はあらかじめ形成されている。フイ
ーニーム基本形態は、前述のように、複数の発声
によりフイーニーム基本形態を構築する方法によ
り取出すことが望ましい。対象ワードのｆ発声の
各々について、フイーニームの項で定義されたフ
イーニーム・ストリングFS₁〜FS_fがある。フイ
ーニームは、200フイーニームのアルフアベツト、
すなわちセツトから選択される。フイーニーム音
素は第２３図の簡単な音素モデルを使用する場合
に０，１，または２以上のフイーニームを生成す
ることができるが、一般に１つのフイーニーム音
素は平均して１つのフイーニームを生成する。添
字付きのＰにより表示された各々の一意的なフイ
ーニーム音素によるフイーニーム基本形態が第２
６図に示されている。本発明により、各々の合成音素はフイーニー
ム・アルフアベツト中のＮ音素の少なくとも１つ
に対応し、各フイーニーム音素は１つの合成音素
にだけ対応する。合成音素のセツトにおける合成
音素数ｎは、フイーニーム音素のセツトにおける
フイーニーム音素数Ｎよりも少ない。新しいアル
フアベツトは、原始アルフアベツトよりもかなり
少ない要素、例えば、フイーニーム・アルフアベ
ツトの200要素のオーダに較べて40または50の要
素を有することが望ましい。アルフアベツトの音
素を決定する１つの方法は、前の項に簡単に示し
てある。ｎ合成音素のアルフアベツト、語彙中のワード
ごとのフイーニーム基本形態、および対象ワード
の複数の発声を仮定すれば、対象ワードの簡単な
基本形態は下記の方法により形成される。最初
に、各発声は、第２５図に示すようにフイーニー
ム・ストリングFS₁〜FS_fのそれぞれにより表示
される。その後、各々のフイーニーム・ストリングは、
対象ワードの第２６図のフイーニーム基本形態に
対して整列される。整列の結果を第２７図に示
す。第２７図には、各ストリングでフイーニーム
に対して整列されたフイーニーム基本形態が示さ
れている。この整列は、IEEE会報第64巻、1976
年４月号、532〜556頁記載のエフ・ジエネリクの
論文“統計的手法による連続音声認識”（F.
Jelinek，“Continuous Speech Recognition by
Statistical Methods，”Proceedings of the
IEEE，Vol.64，April 1976，pp 532−556）のよ
うな種々の論文で論議されたビテルビ（Viterbi）
整列手法の実行により得ることが望ましい。第２７図の各音素の添字はフイーニーム音素セ
ツトの200音素の中から音素を識別する。各々の
音素ストリングは、同じ音素のシーケンス、すな
わちP₁P₁P₃P₄P₂……P₄により表示されている。
しかしながら、フイーニーム音素は、ｆストリン
グにおけるフイーニームに対する整列が異なつて
いる。約１秒続く代表的なワードの場合、約100
のフイーニームと約100のフイーニーム音素があ
る。第２８図は、フイーニーム音素を合成音素に置
換えた結果を示す。第２７図の各音素は対応する
合成音素（その各々は記号CP_iを有する）に置換
えられている。添字ｉは合成音素のセツトの中か
ら合成音素を識別する。音素P₁およびP₂はCP₁₀
に、音素P₃はCP₁₁に、音素P₄はCP₁₃に置換えら
れている。第２８図の各ストリングの合成音素数
は、第２７図のストリングFS₁〜FS_fの各々の音
素数に等しい。合成音素の各々は、構造的に前述の音標型の音
素に類似し、13の遷移を有する７つの状態である
ことが望ましい。次いで、第２８図で、簡単な２
状態のフイーニーム音素は７状態の合成音素に置
換えられる。次に第２９図において、各フイーニーム・スト
リングに対して整列されている同じ２つの合成音
素（すなわち、左端とその右の合成音素）が１つ
の合成音素に置換えられる。詳細に言えば、（第
２８図で）CP₁₀として識別される２つの隣接す
る合成音素は１つの合成音素CP₁₀に置換えられ
る。同様に、第３０図では、２つの隣接する音成
音素が再び１つの合成音素に置換えられる。すな
わち、第３０図で、（第２９図の）隣接する合成
音素CP₁₁およびCP₁₃は、１つの合成音素CP₈に置
換えられている。第２９図および第３０図から、それぞれの連続
する合成音素の結合により合成音素数が１つずつ
減少することが分る。後に説明するように、２つ
の合成音素を合併するプロセスを続けることによ
り、特定の長さの基本形態を得ることができる。２つの合成音素を反復的に合併するプロセスは
接近（nearness）に基づくものである。合成音素
の所与のシーケンスについて、合成音素の対の
各々を検査し、その２つの音素の結合から生じる
逆効果を判定する。合併された場合に生じる逆効
果が最小の音素対は、最も接近しているものと
し、次の合併の候補とみなす。合併の逆効果を測
定する種々の方法があるが、本発明は“接近”を
MAX（ａ／Ａ，ｂ／Ｂ）と定義することが望ま
しい。ＡおよびＢは、計算された基本形態の長さ
を限定する選択可能な限界値である。ａおよびｂ
は下記のように定義する。隣接する合成音素の対をそのシーケンスで表示
する２つの合成音素CP_nおよびCP_oの場合、その
２つの合成音素に対してそれぞれ整列されたフイ
ーニームの連続F_nおよびF_oがある。合成音素
CP_noは、フイーニームF_noを生成する同時確率を
最大化する合成音素として定義される。F_noは、
２つの隣接するフイーニームの連続F_nおよびF_o
の連結である。合成音素CP_noは、隣接する２つ
の合成音素CP_nおよびCP_oが最も接近しており、
合併される予定になつている場合、CP_nおよび
CP_oに取つて代る音素である。L_n、L_oおよびL_no
は、合成音素CP_n、CP_oおよびCP_noを与えるF_n、
F_oおよびF_noのそれぞれの対数確率を表わす。係
数ａは次式により得られる。ａ＝L_n＋L_o−L_no 従つて、ａは、合成音素CP_nおよびCP_oの合併
による確率の減小を示す尺度である。２つの合成
音素の合併の逆効果が大きくなるに従つて、ａの
値も大きくなる。次に、ｂは“対ｔ検査”（Paired ｔ−test）と
して知られている方法により決定される。対ｔ検
査の一般的な方法は、種々のテキスト、例えば、
マサチユーセツツ州、ダクスバリ社発行の、W.
メンデンホールによる“確率および統計概論”第
５版、1979年、294〜297頁（W.Mendenhall，
“Lntroduction to Probability and Statistics”．
5th Edition，Duxbury Press，Massachuestts，
1979，pp.294−297）に記載されている。対ｔ検
査に従つて、T₁は、フイーニームの連続F_nの中
のフイーニームが（）合成音素CP_nと（）合
成音素CP_noの比較により生成される場合、それ
らのフイーニームの対数確率の差について対ｔ検
査を実行することにより得られるｔ統計値であ
る。同様に、T₂は、フイーニームの連続F_oの中
のフイーニームが（）合成音素CP_oと（）合
成音素CP_noの比較により生成される場合、それ
らのフイーニームのｔ統計値である。例えば、フ
イーニームの連続F_oの中のフイーニームごとに、
そのフイーニームを生成する合成音素CP_oの確
率、およびそのフイーニームを生成する合成音素
CP_noの確率がある。これらの２つの確率の間の
比較は対で行うことができる。このプロセスはフ
イーニームの連続の各々について反復され、それ
ぞれのワード発声に関連した別個の連続があるこ
とが分る。 Z₁およびZ₂は、T₁およびT₂のｔ分布をＮ（０，
１）分布に変換した後、T₁およびT₂により取ら
れる値を表わす。次に、ｂは次式により得られ
る。ｂ＝MAX（Z₁，Z₂）ｂは、合成音素CP_nおよびCP_oを合併してCP_no
に置換えた場合、そのフイーニーム確率の差が
（統計的に）どれだけ有意であるかを表わす。合成音素の対ごとにａ／Ａおよびｂ／Ｂの最大
値をとることにより、合併による逆効果が最小の
対が決められる。次いで、この最も接近している
対を合併し、１つの合成音素に置換えることがで
きる。すべての合成音素対の検査ならびに２つの最も
接近している隣接合成音素の合併は、基本形態の
合成音素数が指定された限界に達するか、または
最も接近している音素対の間の距離が１を越える
（いずれの場合も、ａまたｂまたは両者はそれぞ
れの限界値を越える）まで続く。合成音素が音標のアルフアベツトを反映するよ
うに選択される場合、その結果生じる合成音素の
シーケンスは、音声認識用として受入れ可能な簡
単な基本形態を与えるだけでなく、ワードの発音
の自動生成記述も提供する。後者の結果は音声学
の分野に利用することもできる。第１Ａ図および第１Ｂ図は本発明の方法の基本
的な流れ図８０００を示す。この流れ図に従つ
て、対象ワードを、１回または、できれば複数回
にわたり発声する。ブロツク８００２で、それぞ
れの発声により、フイーニーム・ストリングを生
成する。ブロツク８００４で、生成された対象ワ
ードのストリング（単数または複数）に基づき、
ストリング（単数または複数）を生成する最高の
（同時）確率を有するフイーニーム・ワード基本
形態を選択する。語彙中の各ワードの基本形態
は、ブロツク８００６であらかじめ形成されてい
る。次いでブロツク８００８で、対象ワードの各
ストリングは、選択された基本形態に対して整列
される。このようにして、基本形態の各フイーニ
ーム音素は（対象ワードの）各ストリング中の対
応するフイーニームに対して整列される。次いで
ブロツク８００９で、各フイーニーム音素は対応
する合成音素に置換えられる。次に下記ステツプ
を実行する。 (a) 隣接する合成音素の対を選択する（ブロツク
８０１０）。 (b) フイーニーム・ストリングごとに、選択され
た隣接音素対に対して整列させるフイーニー
ム・サブストリングを決定する（ブロツク８０
１２）。 (c) フイーニーム音素のアルフアベツトよりも少
ない要素を有するように形成された合成音素の
アルフアベツトから（ブロツク８０１３）、す
べてのフイーニーム・ストリングのそれぞれの
決定されたサブストリングを生成する最高の同
時確率を有する合成音素を決定する（ブロツク
８０１４）。 (d) ブロツク８０１６により、隣接する合成音素
のすべての対について前記ステツプ(a)〜(c)を反
復する。隣接する合成音素の対ごとに単一合成
音素を決定する。 (e) 基本形態で各音素対を、その代りに決定され
た単一合成音素に置換えることによる悪影響を
測定する（ブロツク８０２０）。 (f) 最小の悪影響が所定の限界値を越えない場合
（ブロツク８０２４）、最小の悪影響しか生じな
い合成音素とそれに対応する音素対を置換える
（ブロツク８０２２）。 (g) 合成音素とそれに対応する音素対を置換えた
後、新しい音素基本形態を供給する（ブロツク
８０２６）。 (h) 音素数が所定の所望数よりも大きい場合（ブ
ロツク８０２８）、ステツプ(a)〜(g)を反復して
新しい基本形態を求める。ステツプ(h)は、基本形態を所望の長さに短縮す
る（ブロツク８０３０）のに必要な回数反復する
か、または音素対を合成音素に置換えることによ
る悪影響が限界値を越える（ブロツク８０２４）
まで反復する。この手順は、すべてのワードが処理されるま
で、語彙中のワードごとに反復する（ブロツク８
０４０）。それ故、語彙中のワードごとに、所望
の長さに短縮された合成音素の基本形態がある
か、または悪影響の限界値を越えずには短縮する
ことができない基本形態がある。この手順により短縮された基本形態は音標型の
基本形態で、精密な音響マツチングおよび（また
は）高速概算音響マツチングを前述のように実行
する際に使用することができる。すなわち、第２
１図の木構造４１００（種々の音素シーケンスを
表わす、メモリに記憶された音素を表わす）は、
本発明により短縮された基本形態を含むことがで
きる。この場合、各音素は合成音素を含む。従つて、第２１図の各々のワード基本形態は、
合成音素のシーケンスを、短縮された基本形態で
表わし、シーケンス中の各合成音素は、到来する
ラベルすなわちフイーニームに対するマツチング
を実行する際に用いる音素マシンにより表示する
ことが企図されている。 F5 比較結果種々の型の基本形態を用いる音声認識システム
の性能を検査した結果を下記に示す。第１に、62のキーボード文字の語彙およびその
任意の600の発声を有するシステムでは、手作業
で決定された音標の基本形態のエラーの数は28で
あつた。単一発声後に構築されたフイーニーム基
本形態のエラーの数は14であつた。そして、数回
の発声から構築されたフイーニーム基本形態に基
づいたエラーの数は５であつた。第２に、事務的通信の2000の一般的なワードの
語彙および2070の任意の発声の原稿を有し、通常
の音標の基本形態を用いるシステムは、数回の発
声から構築されたフイーニーム基本形態を用いた
結果生じたエラーが42であるのに比し、108のエ
ラーを生じた。最後に、F6項で説明する“短縮された”基本
形態を用いることにより、次の結果が得られた。
2000の代表的な事務用ワードの語彙および2070の
任意のワードの原稿を有する場合、前述の高速概
算音響マツチングにより、音標の基本形態はそれ
に関連して生じたエラーが８であり、短縮された
基本形態はそれに関連して生じたエラーが２であ
つた。これらの結果から、複数の発声に基づいたフイ
ーニーム基本形態は、１回の発声から生成された
標準的な音標の基本形態およびフイーニーム基本
形態よりも正確であることが分る。また、短縮さ
れた基本形態は、従来の音標型の基本形態よりも
かなり改良されていることが分る。更に、これら
の結果から、良好な実施例は複数の発声に基づい
た短縮された基本形態である。 F6 代替実施例例えば、合成音素のセツトは、前述のものとは
異なる手法により決定することができる。また、
MAX（ａ／Ａ，ｂ／Ｂ）のような接近の測定は、
ａ／Ａまたはｂ／Ｂもしくは比較しうる結果を生
じる他の尺度として再形成することができる。更に、短縮された基本形態を構築する本発明の
方法は、ワードの複数の発声に基づいて実行し、
各発声に対応するフイーニーム・ストリングを用
いて処理することが望ましい。しかしながら、本
発明は、対象ワードの単一発声から短縮された基
本形態を構築するのに使用することもできる。単
一発声に基づいた場合、１つのサブストリングで
フイーニームを生成する最高の確率を生じる合成
音素が、隣接音素対の置換えで考慮される。１ス
トリングしかないから、考慮すべき同時確率はな
い。発音の変化から、複数の発声に基づいた短縮さ
れた基本形態により正確であり、良好である。 The present invention will be explained in the following order. A. Industrial application field B Summary of disclosure C. Conventional technology D. Problem that the invention aims to solve E. Means to solve the problem F Example F1 speech recognition system environment F1a General explanation (Figures 2 to 7) F1b auditory model and speech recognition system sounds
Its realization in the Hibiki processor (Fig. 8)
~Figure 14) F1c precision matching (Fig. 4, Fig. 15) F1d Basic high-speed matching (Figs. 16 to 18)
figure) F1e Alternative high-speed matching (Figures 19 to 20)
figure) F1f Matching based on first J level
(Figure 20) F1g phoneme tree structure and high-speed matching example
(Figure 21) F1h language model (Figure 2, Figure 22) F2 word phonetic symbol and feeneem mark
Construction of basic form of Coff model F2a Construction of basic phonetic forms (Figure 4, Table 2) F2b Construction of the basic form of Finim (23rd
(Fig. 24) F3 Alphabet of synthesized pheneem phoneme
Generation (Figure 4) F4 Short phonetic Markov model word
Automatic format of basic form (Fig. 1A, 1B
Figures, Figures 21, 23, 25 to 3
0 figure) F5 comparison result F6 Alternative Example F7 table G Effect of invention A. Industrial application field The present invention generates acoustic models of words in a vocabulary.
related to the field of Acoustic models are mainly used for speech recognition.
It is used to understand the sounds of words, but it is also used to explore the phonetic types of words.
can be incorporated into any phonetic application. B Summary of disclosure The present invention is based on the fine-me of a given word.
Phoneme: Because it is often obtained from the front end
The basic form (named after its initials)
Automatic creation of basic forms of phonetic forms with reduced length compared to
Disclose construction. Specifically, the present invention
is () Finim basic of Finim phoneme
Define each word in the vocabulary by its morphology, and
phonemes (each of which has at least one feeney)
Define the alphabet of (corresponding to the m phoneme),
() Finim String according to voice input
In a system that generates a
Each fineme phoneme in the word denoted by phoneme
, and (b) combination.
If the adverse effects of
Combine at least one pair of adjacent synthesized phonemes into one synthesized phoneme
By merging with
The basic form of the word is synthesized into a shortened word of the phoneme.
Allows conversion to basic form. C. Conventional technology Regarding the invention that forms the background of the present invention,
U.S. Patent Application No. 06/665401 (dated October 26, 1984)
Application), No. 06/672974 (filed on November 19, 1984),
and No. 06/697174 (filed on February 1, 1985)
There is. In probabilistic speech recognition methods, the acoustic waveform is initially
Label ie feeney by sound processor
(minimal sound such as that coming from the front end)
is converted to a string of Label (that
each identifying a monophonic symbol) are generally around 200
Selected from different label alphabets.
Ru. Generation of such labels is described in various papers.
and as described in the aforementioned U.S. patent application Ser. No. 06/665,401.
ing. When performing speech recognition using labels, Markov
model machine (i.e. stochastic finite state machine)
) is being discussed. The Markov model is
Usually includes multiple states and transitions between states. Change
In other words, the Markov model (a) confirms the certainty of each transition that occurs.
rate, (b) respectively to generate each label at various transitions.
has a probability assigned to it relative to the probability of
It is normal to do so. Markov model, or
Markov sources can be found in various papers, e.g.
IEEE Bulletin: Pattern analysis and computer information
(PAMI) Vol. 5 No. 2 (March 1983 issue)
Le R. Part et al.'s paper “Recognizing Continuous Speech”
L.R. Bahl et al, “A Maximum Likelihood Method”
Likelihood Approach to Continuous Speecb
“Rocognition”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol.
PAMI-5, No. 2, March 1983)
There is. Performs a matching process when recognizing speech.
to find out which word (in case of multiple words) in the vocabulary.
(in some cases) is generated by a sound processor.
The highest likelihood of generating a string of labels
Decide what will happen. one such matuchin
The programming procedure is described in the aforementioned U.S. Patent Application No. 06/672,974.
has been done. According to this, acoustic matching consists of (a) word
Each word in the vocabulary is processed by a Markov model phoneme machine.
characterized by the sequence, and (b) the acoustic processor.
generate a label string generated by
that of the sequence of phoneme machines representing each word
This is done by determining the likelihood of each. each
The sequence of phoneme machines that a word represents is a word.
It is called the basic form. When deciding on the basic form of a word, first
Determine the nature of the phoneme machine used to construct morphology.
It is necessary to Said US Patent No. 06/697174
The basic form of work based on Finim is
is shown. In other words, Finim's a
200 finims forming a Rufa bet (collection)
A Markov model is prepared for each of
When spoken (as produced by a sound processor)
) Generate 0, 1 or 2 or more finemes.
Display the probability of a particular finem. given
Finim's Markov model phoneme machine is
The change in pronunciation of each utterance results in a given fee.
Probability of not producing pheneem other than neem, or
or some fees other than the given fee
Generates a probability of producing neem. Finim's
The basic form determines the phoneme map in the basic form of each word.
The number of syns is approximately the number of finems per word.
Almost equal. In general, sound at a rate of 1 per centisecond
Generate finemes well with processor
In this case, there are 60 to 100 finemes per word. As an alternative, the said U.S. Pat.
Phonetic phonemes, as described in issue 672974
A word basic form of the machine may be constructed.
In this case, each phoneme machine corresponds to a single note in the phonetic alphabet,
Contains 7 states and 13 transitions. D. Problem that the invention aims to solve The Finim phoneme machine is structured by two states.
easily and automatically determined. deer
However, Finim's basic form is quite long
Therefore, the matching process takes a long time.
Requires calculation. On the other hand, the basic form of phonetic symbols
is short and requires less calculation, but the precision
is not easily and automatically determined due to the low
In this case, the phonetic symbols must be entered manually. What is the basic form of finem and the basic form of phonetic symbols?
Although both are effective and helpful, they vary depending on usage.
optimization may not be possible. Therefore, it is an object of the present invention to
largely avoids the disadvantages of basic forms of phonetics and phonetics.
At the same time, the basics encompassing the main advantages of each
It is to provide form. Furthermore, it is an object of the present invention to create a short and simple phonetic map.
Automatically form word basic form of Ruyuf model
It is to be. Furthermore, it is an object of the present invention to
It is shorter than the basic form of Neem (accurate
, the target word (which has the characteristics of the basic form of the phonetic type)
The objective is to provide the basic form of the code. E. Means to solve the problem According to the invention, each word in the vocabulary is
It is determined by the basic form of Finim.
Assume. In addition, the alpha alphabet of synthesized phonemes is also included.
Assume that it has been issued. phooneme
(Phoneme)
It is desirable to select synthesized phonemes that are this
If so, each phoneme machine will produce a phonie
It corresponds to the element of the alphabet and its font.
The beam element in the feeneme string
Transitions and transitions used when matching to neem
Store label probabilities. The elements in the alpha alphabet of synthesized phonemes are
element in the alphabet of the phoneme
can do. Simple basic form of synthesized phonemes obtained by the present invention
The state is constructed as follows. Say the target word once, or more than once if possible.
A string of finemes with each utterance
is generated. Generated Fini of target word
string (may be multiple strings)
) that generates a string based on
Finim word basics with (simultaneous) probability
A form is selected. Then each string of target words
Align the ring to the Finem basic form.
Ru. In this way, each frame in the basic form of Finim
the enim phoneme for each string (of the target word).
the corresponding fineme in the
Ru. Each fee in the basic word form of feeneem
The neem phoneme is replaced by its corresponding synthetic phoneme.
and thereby form the basic form of synthesized phonemes.
Ru. Next, the following steps are executed. (a) Select pairs of adjacent synthesized phonemes. (b) For each fineme string, the selected
The fee aligned for pairs of adjacent synthesized phonemes
Determine the substring of neem. (c) For all fineme strings.
Generate each determined substring
Determine the single synthesized phoneme that yields the highest joint probability
do. (d) Step for all pairs of adjacent synthesized phonemes.
Repeat steps (a) to (c) and create one synthesized phoneme for each pair.
decide. (e) Each pair of synthesized phonemes in the basic form is replaced by
to replace it with one synthesized phoneme determined by
measure the negative effects of (f) One synthesized phoneme with minimal negative impact and its corresponding response.
Replace the phoneme pairs. (g) One synthesized phoneme and its corresponding synthesized phoneme pair
After replacing the
supply. (h) Steps (a) to (g) for this new basic form.
Repeat. Step (h) converts the basic form of the synthesized phoneme to the desired length.
shorten to, i.e. replace common pairs with sets
until the negative effects of
Repeat as many times as necessary. Furthermore, for a given phoneme alpha
When a matching synthesized phoneme alphabet is selected,
The shortened basic form generated by the present invention is
For use in applications related to phonetic systems
It is equipped with a phonetic display of words that can be used. F Example F1 speech recognition system environment F1a General explanation (Figures 2 to 7) Figure 2 is an overview diagram of the speech recognition system 1000.
Show the diagram. This system is a stack deco
1002 and the audio processor connected to it.
Setusa (AP) 1004, high-speed approximate acoustic matuchin
Array processor 1006 performs precision
Array processor that performs acoustic matching
1008, language model 1010, and work
Station 1012 is included. The audio processor 1004 processes audio waveform input.
A string of bells, each with a corresponding
It is changed to a fe-neem, which roughly identifies the single note symbol.
It is designed to be replaced. With the system of the present invention
The sound processor 1004 is unique to human hearing.
Based on a special model, U.S. Patent Application No. 06/
No. 65401 (filed on October 26, 1984)
Ru. Label from sound processor 1004, e.g.
In other words, Finim is a stack decoder 1002.
sent to. FIG. 3 shows the stack decoder 10
02 logic element is shown. That is, the stack de-
The coder 1002 connects the search device 1020 and
Workstation 1012 connected to
Tough Ace 1022, 1024, 1026 and
contains 1028. of these interfaces
Each includes a sound processor 1004, an array processor
Setsa 1006, 1008 and language model 1
010 respectively. During operation, the feedback from the sound processor 1004 is
Neem is processed by the array process by the search device 1020.
The data is sent to Tsusa 1006 (high-speed matching).
The high-speed matching procedure described below is based on the
Patent Application No. 06/672974 (filed on November 19, 1984)
is also listed. The purpose of matching is to simply
In other words, for a given label string, at least
By determining the single most likely word
be. Fast matching searches for words in a word vocabulary.
the string of the given incoming label.
is designed to reduce the number of candidate words in
There is. Fast matching is a stochastic finite state machine
(also referred to as Markov model in this specification)
It is something that can be developed. After fast matching reduces the number of candidate words,
Stack decoder 1002 uses language model 10
10 and, if possible, the existing Mieji.
Based on each candidate word in the high-speed matching candidate list,
Determine the contextual likelihood of the code. Precision matching allows these words to be
Fast matching with moderate likelihood as words
From candidate list, inspect based on language model calculations
It is desirable to do so. Precision matching is also done in the US.
Patent Application No. 06/672974 (filed on November 19, 1984)
It is described in. Precision matching is shown in Figure 4.
It is implemented by a Markov model phoneme machine as shown.
go After precision matching, call the language model again
It is desirable to determine the word likelihood. Book
The stack decoder 1002 of the invention is a high speed decoder.
matching, precision matching, and language models.
Using the information obtained from the use of the generated label
most likely path for the word in the string
i.e. designed to determine the sequence
There is. Find the most likely sequence of words
The two traditional methods are Viterbi recovery.
encoding and single stack decoding. these
Each of the techniques provides pattern analysis and machine information.
IEEE Bulletin on PAMI Volume 5 No. 2, 1983
Articles published outside of L.R.B. in the March issue,
“Maximum likelihood approach to continuous speech recognition” (L.R. Bahl et al.
al, “A Maximum Likelihood Approach to
“Continuous Speech Recognition”, IEEE
Transactions on Pattern Analysis and
Machine Intelligence, Vol. PAMI-5, No. 2,
March 1983)
It is listed. The single stack decoding method uses
are listed in a single stack according to their likelihood.
and decrypted based on this single stack.
Ru. Single stack decoding is
Since the degrees differ somewhat, normalization is commonly used.
This is based on the fact that The Viterbi method does not require normalization and is generally small.
Suitable for small tasks. Another alternative is to use a small vocabulary system.
Decryption may be performed by the system. this place
possible word sequence if
examines the word combinations and determines which combinations are generated.
highest probability of producing a label string with
Determine whether you have The computational demands of this method are large.
This is not practical for large vocabulary systems. Stack decoder 1002 actually
acts to control the elements of, but is executed
There are not many calculations. Therefore, the stack decoder
1002 is a VM (virtual machine)/system program.
Product Introduction Release 3
(1983), as described in publications such as
IBM VM/370 Operating System Controls
Desired to include a 4341 processor running under control.
Delicious. Array processors that perform a significant amount of calculations
Setsa is a floating point system
This is achieved using a commercially available 190L manufactured by (FPS).
Ru. New features including multiple stacks and unique decision methods
This method is L.R.Bahl
Invented by outsiders. This method is shown in Figure 5.
6 and 7. Figures 5 and 6 show that between consecutive labels
Multiple consecutive labels Y generated at “interval”₁,Y₂……
It is shown. FIG. 6 also shows multiple word paths, i.e.
Path A, path B, and path C are shown.
In the context of Figure 5, path A is the entry “to be
path B to entry “two b”, path B to entry “two b”, path B to entry “two b”
C would correspond to the entry "too". subject
For word passes, the highest probability of termination is
The label that the target word path has (i.e. the equivalent
label spacing). Label like this
It is called a “boundary label.” of a word path W representing a sequence of words.
, the most likely end time (between two words)
Displayed in label string as “border label”
) is published in IBM Technology Disclosure Bulletin, Volume 23, No.
No. 4, September 1980, outside El R Bar
Paper “Fast Acoustic Matching Calculation” (L.R.Bahl et al.
al, “Faster Acoustic Match Computation”,
IBM Technical Disclosure Bulletin, Vol.23,
No. 4, September 1980)
can be discovered by any known method. easy
In other words, this paper focuses on two important points:
Section: (a) How many label strings Y are in the word
(or word sequence)
Is there? (b) At what label spacing (of the label string)
whether the partial sentence (corresponding to the part) ends It explains how to tackle this issue. For any given word pass, the label strip
each from the first label of the string to the border label
, or the “likelihood” associated with the label interval
has the value "all of the likelihood values for a given word path"
collectively calculates the “likelihood vector” for a given word path.
Therefore, for each word pass,
There is a corresponding likelihood vector. Likelihood value L_tis shown in Figure 6.
It is shown. Word Pass W¹,W²,...,W^sof a gathering of
“Likelihood envelope” Λ at label interval t_tis mathematically
It is defined as follows. Λ_t=max(L_t(W¹),...,L_t(W^s)) That is, for each label interval, the likelihood envelope is
the latest associated with any word pass in said collection.
Contains a high likelihood value. Figure 6 shows the likelihood envelope 1040
It is shown. If the word path corresponds to a complete sentence,
considered “complete”. For the complete path, please enter
When a speaker reaches the end of a sentence, e.g.
It is desirable to be identified by pressing the button.
The input entered is between the labels marking the end of the sentence.
synchronized with the interval. The complete word password is
It cannot be extended by adding words. part
A typical word path accommodates incomplete sentences and
can be done. Partial paths are “alive” or “dead”
Word Pass is classified as a “pass”.
is already extended, the “dead”
It is “alive” when it has not yet been extended. child
already extended by at least one
by forming a longer extended word path of
The path that is in the
There is no such thing. Each word path is
Characterizing it as “good” or “bad”
is possible. The word pass is marked with its boundary label.
The label corresponding to the word path whose word path is the
Good if it has a likelihood value that is within the large likelihood envelope
It is a password. Otherwise, word
Pass is a bad word pass. maximum likelihood envelope
Decrease each value by a certain value to find the good (bad) limit
It is desirable to act as a level.
Yes, but not necessarily necessary. There is a stack element for each label interval.
Ru. Each living word pass looks like this
label corresponding to the boundary label of the living path
Assigned to the stack element corresponding to the interval. vinegar
The task elements are listed in order of likelihood value (
) 0.1 or more
May have birds. Next, the stack decoder 1002 in FIG.
The steps to be executed will be explained below. Form a likelihood envelope and determine which word path is better
Determining the stack decoding method shown in Figure 7
There is a mutual relationship as shown in the flowchart. In the flowchart of FIG. 7, block 1050
So, first, the null path is the first stack (0)
to go into. In block 1052, the previously confirmed
The (complete) stack element containing the complete path
If available, it will be supplied. (complete) stack required
Each complete path in the element has a likelihood associated with it
has a vector. highest likelihood for that boundary label
The likelihood vector of the complete path with
Determine the likelihood envelope. If (complete) stack element
If there is no complete path to
The interval is initialized to −∞. Additionally, the complete path
Even if is not specified, the maximum likelihood envelope is −∞
may be initialized to . Initial envelope settings
is performed in blocks 1054 and 1056. After the maximum likelihood envelope is initialized, a predetermined amount △
△ provision that exceeds the reduced likelihood by
form a good region of and below the reduced likelihood
△ Forms poorly defined areas. If △ is large, it is large.
The higher the password, the more likely it is that the password can be extended.
The number becomes larger. L_tlog to determine_Tenusing
If the value of △ is 2, a satisfactory result can be obtained.
It will be done. The value of △ is uniform along the length of the label interval
Although it is desirable, it is not necessary to
isn't it. The word path has a border label of △−good region.
If the word path has a likelihood within the range
is marked as “good”. In other cases, the
The code path is marked as “bad”. As shown in Figure 7, the likelihood envelope is updated and the word
A “good” (extendable) path, or
or as a “bad” path.
Blog about finding the longest unmarked word pass
It starts with 1058. Not marked with 2 or more
short word path corresponds to the longest word path length
If the border label is on the stack
A word path with a likelihood of is selected. War
block 1060 if a path is found.
, the likelihood at that boundary label is △− well-specified region.
Find out if it is within the area. If within a good area
If not, in block 1062, △
Mark the path within the area, and in block 1058,
Find unmarked living paths. too
If it is within the acceptable range, in block 1064, △
Mark the path within the defined good area and block 1
In step 066, update the likelihood envelope and mark it as “good”.
Contains the likelihood value of the tracked path. That is,
For each label interval, the updated likelihood value is (a) the current likelihood value within its likelihood envelope; (b) Associated with word passes marked “good”
likelihood value is determined as the larger likelihood value between. this
The operations take place in blocks 1064 and 1066.
Ru. After the envelope has been updated, block 1058
Return, longest unmarked, best alive
Find the password again. This loop uses unmarked word
Iterates until there are no more spaces left. Not marked
If you run out of passwords, block 1070
The shortest word path marked as “good” in
is selected. If 2 or more have the shortest length
If you have a “good” word pass, block 10
72, find the word with the highest likelihood for that boundary label.
The selected shortest path is
It will be extended. That is, at least one possibility
If the subsequent word
language models, precision matching, and language models.
determined by successful implementation of Dell procedures.
Ru. For each likely subsequent word, extended
A word pass is formed. In detail,
The lengthened word path is the shortest word path selected.
Append a promising trailing word to the end of the decoded path.
Formed by adding The selected shortest word path is
After forming the code path, the selected word
The path is removed from the stack in which it was an entry.
instead, each extended word
The path is inserted into the appropriate stack. In particular, extension
The word path that has been created will be
entry into the stack (block 107).
2). An extended path is formed and its stack is re-formed.
Once formed, return to block 1052 and continue the process.
The process is repeated. Therefore, at each iteration, the shortest, best “good”
selected and extended. in some iteration
Word paths marked as “bad” paths are
Repetition can result in a “good” pass. Then,
Is the live word pass a “good” pass or a “bad” pass?
The characteristics of a “good pass” are uniquely attached to each iteration.
given. In fact, the likelihood envelope is one iteration and
Since it does not change significantly from next iteration to
The calculations that determine whether a path is good or bad are done efficiently.
be exposed. Furthermore, normalization becomes unnecessary. If you want to identify a complete sentence, block 1074
It is desirable to include. i.e. alive
What remains unmarked in Word Pass
If there is no “good” word pass that should be extended
If so, decryption ends. each of its boundary labels
The complete word path with the highest likelihood
the most likely word for the power label string.
identified as a code sequence. For continuous speech where sentence ends are not identified, path extension
The length is continuous, i.e. the system
line for a given number of words desired by the user of the system.
be exposed. F1b auditory model and speech recognition system acoustics
Its realization in the processor (Figures 8 to 1)
Figure 4) FIG. 8 shows an audio processor 110 as described above.
A specific example of 0 is shown. Acoustic wave input (e.g.
natural sounds) are sampled at a predetermined rate.
Enters A/D converter 1102. representative sample
The processing speed is 1 sample every 50 microseconds.
Ru. time to shape the edges of the digital signal.
A window generator 1104 is provided. Time window occurs
The output of the
FFT (Fast Fourier Transform) device that provides torque output
Enter 1106. Then, the output of the FFT device 1106 is labeled
L₁L₂...L_fis processed to generate . Features
Selection device 1108, cluster device 1110, prototype
The device 1112 and the encoder 1114 cooperate.
generate a label. When generating labels, the original
places points (or vectors) in space based on selected features.
formed as a vector). The audio input is selected
A counterpart that can be compared to the original due to the same characteristics
special to supply points (or vectors) into space.
It is marked. In detail, when defining the prototype, the cluster
The set of points is assigned to each class by the setting 1110.
group as data. The method of forming clusters is
probability component (such as a Gaussian distribution) applied to speech
Based on cloth. The prototype of each class is (class
(with respect to the center locus or other features of the data)
is generated by the location 1112. The generated prototype
and acoustic input (both with the same features selected).
) enters the encoder 1114. Symbolization device 1
114 performs a comparison procedure so that a particular sound
Assign a label to the acoustic input. The selection of appropriate features represents the acoustic (speech) wave input.
This is an important element when removing the wasu label. here
The acoustic processor described has improved feature selection.
Includes device 1108. Follow this sound processor
Then, an auditory model is taken and the speech recognition system
used in audio processors. auditory model,
This will be explained with reference to FIG. FIG. 9 shows a portion of the human inner ear. describe in detail
If so, the inner hair cells 1200 and the fluid-containing groove 1
Distal end 1202 extending to 204 is shown in detail.
There is. In addition, upstream from the inner hair cells 1200, outer hair cells 1200
hair cell 1206 and distal end 1 extending into groove 1204
208 is shown. 1200 inner hair cells and outer hair
Cell 1206 has nerves that transmit information to the brain.
It matches. In particular, Nieuron undergoes electrochemical changes.
The electrical pulses are carried along the nerves to the brain,
It will be processed. Electrochemical changes are fundamental
Stimulated by mechanical movement of membrane 1210. The basement membrane 1210 serves as a frequency analyzer for acoustic wave input.
The area along the basement membrane 1210
Is it conventional to respond to each critical frequency band?
is known. responds to the corresponding frequency band.
Each portion of the basement membrane 1210 has an acoustic waveform
Affects the perceived volume of the input. That is,
The volume of the tone can be determined by combining two tones of similar power intensity.
than if the bands occupy the same frequency band.
If the two tones are in separate critical frequency bands
It is perceived as larger when the By the basement membrane 1210
There are 22 defined critical frequency bands.
I know that. Match the frequency response of the basement membrane 1210
Therefore, the present invention is advantageous in that the critical frequency band is
Physically convert the input acoustic waveform into some or all parts.
and then for each defined critical frequency band.
The signal components are examined separately. This feature is
The signal from the FFT device 1106 (Fig. 8) is
for each critical frequency band tested.
Providing separate signals to feature selection device 1108
This is done by A separate input is also provided by the time window generator 1104.
Blot on a time frame (preferably 25.6 ms)
is blocked. Therefore, the feature selector 1108 has 22
It is desirable to include the following signals. of these signals
each for a given frequency band per time frame
represents the strength of the sound. The signal is a conventional critical band filter in Figure 10.
Preferably, the filter is filtered by filter 1300. Next
, the signal is individually expressed as a function of frequency and the change in volume as a function of frequency.
Processed by an acoustic equalization converter 1302 that perceives
do. By the way, given dB level at one frequency
The perceived loudness of the first tone of the
the volume of the second tone of the same dB level at the frequency of
It may be different. Volume equalization converter 1302
are based on empirical data and each frequency
Convert the band signals so that each is measured on the same loudness scale.
be determined. For example, volume equalization converter 1
302 is a 1933 Fletscher and Munsson
(Fletcher and Munson) with some changes.
By
can be imaged. Figure 11 has been changed to the above study.
Shows the result of adding . According to Figure 11, at 40dB
A 1KHz tone has a volume level of 60dB compared to a 100Hz tone.
It turns out that it corresponds to Bell. The volume equalization converter 1302 converts the music shown in FIG.
Adjust the volume according to the line, equal regardless of frequency
produces a volume of In addition to frequency dependence, Figure 11 shows
As is clear from examining the wave number, the change in power is
Does not respond to changes in volume. That is, the intensity of the sound,
That is, amplitude fluctuations are perceived at all points.
is not reflected in similar changes in volume. for example,
At a frequency of 100Hz, 10dB around 110dB
Perceived volume change is 10dB around 20dB
is much larger than the perceived volume change. this
The difference is a volume compression device that compresses the volume in a predetermined way
1304. Volume compression device 1304
places the loudness amplitude measurement in phons into units of son.
By changing the power P to its cube root P^1/3to
Can be compressed. Figure 12 shows known phone pairs determined empirically.
Shows the relationship between Thorns. With the use of sone units,
The model of the present invention is almost correct even with large audio signal amplitude.
Maintain a solid state. 1 sone is a 1KHz tone with 40dB volume and regulation.
has been established. FIG. 10 shows a new time-varying response device 13.
06 is shown. This device has each critical frequency
Band-related volume equalization and volume compression signals
Works better. In detail, the tested frequency
For every few bands, the neural firing rate f changes each time frame.
It can be determined by The firing rate f is the acoustic process of the present invention.
It is defined as follows according to the f=(So+DL)n (1) However, n is the amount of neurotransmitter; So is the acoustic wave
Spontaneous firing related to neural firing independent of shape input
fire constant; L is the volume measurement; D is the displacement constant.
So.n is a spontaneous phenomenon that occurs regardless of the presence or absence of sound wave input.
DLn corresponds to the neural firing rate due to acoustic wave input.
Corresponds to firing rate. The important point is that in the present invention, the value of n is determined by the following equation:
It has the characteristic of changing over time.
Ru. dn/dt=Ao−(So+Sh+DL)n (2) where Ao is the recruitment constant; Sh is the spontaneous neurotransmission
is the material attenuation constant. The new relationship shown in equation (2)
is that neurotransmitters are not produced at a certain rate at Ao.
(a) Decay (Sh.n), (b) Spontaneous firing (So・
n), and (c) neural firing due to acoustic wave input (DL・
n) is taken into consideration. these
The modeled phenomenon is at the location shown in Figure 9.
Assume that it happens. As is clear from equation (2), the following amounts of neurotransmitters and
and the next firing rate is at least as high as the current amount of neurotransmitters.
It is proportional to the square of the sound processor of the present invention.
This shows the fact that it is nonlinear. Sunawa
The amount of neurotransmitter in state (t+△t) is
of neurotransmitters in state (t+dn/dt・△t)
equals quantity. Then, n(t+△t)=n(t)+(dn/dt)・△t (3) holds true. Equations (1), (2) and (3) describe the operation of the time-varying signal analyzer.
represent. A time-varying signal analyzer shows how the auditory system changes over time.
It is adaptive and the auditory nerve signal is non-directive with the acoustic wave input.
It shows the fact that they are linearly related.
By the way, the acoustic processor of the present invention is based on the nervous system.
to better follow the apparent temporal changes in
The best way to perform nonlinear signal processing in speech recognition systems
This is the first model offered. Reduce the number of unknown terms in equations (1) and (2)
Therefore, in the present invention, the following
Use the formula. So+Sh+DL=1/T (4) However, T is the audio wave input generated
After that, the auditory response drops to 37% of its maximum value.
This is the measured value of the time it takes to complete the process. T is a function of volume
With the sound processor of the present invention, various
Known display of volume level response attenuation
Take it out from the graph. That is, a tone of constant volume
When a response is generated, initially a high level response is generated.
After that, the response is determined by the time constant T.
and decays toward steady-state levels. sound
If there is no acoustic wave input, T=T₀It is. This is 50
It is about milliseconds. Volume is L_naxIf T=
T_naxIt is. This is about 30 milliseconds. Ao=
By setting it to 1, 1/(So+Sh) becomes
If L=0, it is determined to be 5 centiseconds. L is
L_naxSo, L_naxIn the case of = 20 sones, the following formula holds true. So+Sh+D(20)=1/30 (5) According to the above data and formula, So and Sh are
It is determined by equations (6) and (7) shown below. So＝DL_nax/ [R+(DL_naxT₀R)-1〕 (6) Sh=1/T₀−So (7) however, R = f stable state | L_nax/f stable state | L=0 (8) f steady state is when dn/dt is 0 for a given sound
Expresses the firing rate in terms of volume. R is the only variable left in the sound processor
It is. Therefore, the performance of this processor is R
Just change it and it will change. That is, R is the performance
One parameter that can be adjusted to change the
meter, typically stable against transient effects
It means minimizing the effects of the condition. similar
Output pattern is inconsistent when using audio input
This is generally due to differences in frequency response,
difference, background noise and (sound signal stability)
(affects the dynamic part but not the transient part)
Minimize steady-state effects since they are caused by distortion.
It is desirable to The value of R is
configuration to optimize the error rate of the system.
It is desirable that The most
A suitable value is R=1.5. In that case, So and Sh
The values are 0.0888 and 0.11111, respectively, and D
The value is 0.00666. Figure 13 shows the operation of the audio processor according to the present invention.
This is a flowchart. Preferably sample at 20kHz
daisi during a 25.6 ms time frame.
The talized audio passes through a Hanning window 1320 and its
The output of the DFT1322 is at 10 ms intervals.
It is desirable to perform a double Fourier transform. conversion
The output is filtered in block 1324 to provide at least
One frequency band (preferably all critical frequencies)
several bands or at least 20 bands)
Each provides a power density output. Next, power
- Density is block 1326, recorded size
is converted to volume level. This operation is the first
This can be easily carried out by changing the graph in Figure 1. So
Summary of the process after (limitations of block 1330)
(including value updates) is shown in FIG. In Figure 14, first, the filtered frequency
Sensory limit Tf and audible limit T for each band m_h
are set to be 120dB and 0dB, respectively.
(block 1340). Then the voice cowl
total frame register and histogram
Reset system registers (block 134)
2). Each histogram contains bins,
(in a given frequency band)
or similar measurements within their respective ranges.
represents the number of samples, or the number of counts. Main departure
In light, the histogram (for a given frequency band
) within each of multiple volume ranges.
It is preferable to express the number of centiseconds of the period. example
For example, in the third frequency band, 10dB and 20dB
There may be 20 centiseconds between powers. Similarly,
In the 20th frequency band, between 50dB and 60dB,
If there are 150 centiseconds out of a total of 1000 centiseconds,
There is a case. Total number of samples (i.e. centiseconds)
and the percentiles are taken from the counts contained in the bins.
Served. Block 1344 identifies each frequency band.
The frames at the output of the filter are examined and blocked.
1346, the appropriate histogram (per filter)
1) The bins inside are incremented. block 1348
, the total number of bins whose amplitude exceeds 55 dB is the filter
data (i.e., frequency band), and
Determine the number of filters that indicate the presence of voices. block
1350, with a minimum number (e.g. 20
If there is no filter in 6), block 1
At 344, the next frame is inspected. existence of voice
If there are enough filters to show, block 135
2, increment the voice counter. voice counter
The audio appears in block 1354 for 10 seconds and the block
New Tf and T in Lock 1356_hThe value of
is incremented until determined for each router. New Tf and T for given filter_hThe value of is
It is determined as follows. For Tf, the maximum of 1000 bins
dB of the bin holding the 35th sample from the top
The value (i.e. the 96.5th percentile of volume) is
BIN_Hand Tf is defined as Tf=BIN_HSet to +40dB
be done. T_hIf , from the lowest bin (0.01)
Hold the (total number of bins - audio count)th value
The dB value of the bin is BIN_Lis defined as That is,
BIN_Lis classified as audio in the histogram.
bins of 1% of the number of samples excluding those
Ru. T_his T_h=BIN_LDefined as −30dB. Blocks 1330 and 1332 in FIG.
, the amplitude of the sound updates the limit value as mentioned above
and changes per son based on the updated limit value.
converted and compressed. Introducing the son unit and compressing
An alternative method is to use the filter (after the bin has been incremented).
Take the amplitude "a" and convert it to dB using the following formula. a^dB=20log_Ten(a)−10 (9) Next, each of the filter amplitudes is equivalent to
Range between 0dB and 120dB to give a volume of
compressed into a^egl=120(a^dB−T_h)/(Tf−T_h) (Ten) Then a^eglis the volume level (phone unit) using the following formula.
) to an approximate value of the volume in son units.
(Mapping a 1kHz signal to 1 at 40dB) is desirable.
Yes. L^dB=(a^egl-30)/4 (11) Then, an approximation of the volume in sones L_sis given by the following equation
available. L_s=10(L^dB)/20 (12) At step 1334, L_sis the input of equations (1) and (2)
output firing rate per frequency band
Determine f. For 22 frequency bands, the 22-dimensional
The vector is acoustic over successive time frames.
Characterize the wave input. However, in general, 20
Frequency bands are usually scaled in mel
The filter bank is used for testing. Block 1 before processing the next time frame.
337, determine the “next state” of n according to equation (3).
Ru. The acoustic processor described above has a firing rate f and a neural
When having a DC pedestal with a large transmission mass n
Needs improvement in usage. That is,
The dynamic cleanliness of the f and n equation terms is important.
If so, use the following formula to lower the pedestal height.
Geru. Steady state and no acoustic input signal present
(L=0), Equation (2) is expressed as follows in the stable state.
can be solved for the partial state n′. n′=A/(So+Sh) (13) The internal state of the amount of neurotransmitter n(t) is as follows
are shown as steady state part and fluctuating part as
Ru. n(t)=n′+n″(t) (14) Combining equations (1) and (14) yields
Fire rate is obtained. f(t)=(So+D・L)(n′+n″(t))(15) The term So・n′ is a constant, but all other terms
is the variable part of n or expressed by (D・L)
contains the passed input signal. Subsequent processing is done using the output vector.
Since it is only related to the squared difference between torques, the constant term is
It will be ignored. From equations (15) and (13), we get the following equation:
It will be done. f″(t)=(So+D・L)・[{n″(t)+D・L
・
A}/(So+Sh)〕 (16) Considering equation (3), the “next state” is as follows.
Ru. n(t+△t)=n′(t+△t)+n″(t+△t) (17) n(t+△t)=n″(t)+A−(So+Sh+D・L) ・(n′+n″(t) (18) n(t+△t)＝n″(t)−(Sh・n″(t) −(So＋Ao・L^A)・n″(t) −(Ao・L^A・D)/(So+Sh)+Ao −(So・Ao)＋(Sh・Ao)) /(So+Sh) (19) Equation (19) can be written as follows if all constant terms are ignored.
I'm going to growl. n″(t+△t)＝n″(t)(1-So・△t)−f″(t) (20) Equations (15) and (20) are calculated for each 10 ms
Output expression applied to each filter during the time frame
and configure the state update expression. Using these expressions
The result is a vector of 20 elements every 10 ms,
Each element of this vector is scaled by mel
each frequency in the filter bank
Corresponds to the firing rate of the band. Regarding the embodiment described above, the flowchart in FIG.
Special calculation of fire rate f and “next state” n(t+△t)
Equation (11) and
(16), f, dn/dt and n(t+△t)
This applies except for replacing the expression. A unique value for each equation term (i.e., t₀=
5csec, t_L=3csec, Ao=1, R=1.5 and
Lmax=20) can be set to other values,
The So, Sh, and D terms are set to different values for the other terms.
When set, each desired value 0.0888,
The value will be different from 0.11111 and 0.00666. The present invention can be applied to various software or hardware.
It can be implemented by a. F1c Precision matching (Fig. 4, Fig. 15, Fig. 1)
Figure 6) Figure 4 shows an example of a phonetic phoneme machine, 2000.
shows. Each phonetic matching machine uses a probabilistic
is a state machine limited to (a) Multiple states Si; (b) Multiple transitions tr (Sj | Si): A certain transition is different
Between states, some transitions transition between the same states, and each
Transitions have corresponding probabilities; (c) Corresponding labels for each label that can be generated by a specific transition.
actual label probability It is characterized by having the following. In Figure 4, seven states S₁~S₇and 13 history
Transfer tr1 to tr13 are precision matching phoneme machine 20
00, and the three transitions tr11, tr
The paths of tr12 and tr13 are shown with dashed lines.
At each of these three transitions, the phoneme produces a label.
can change from one state to another without achieving
Ru. Therefore, such a transition is called a null transition.
Ru. Generate labels along transitions tr1 to tr10.
can be done. In detail, transition tr1 to tr
At least one label along each of the 10
There may be unique probabilities that are generated.
Ru. For each transition, the system can generate
There is a probability associated with each label. In other words, too
can be selectively generated by an acoustic channel.
If there are 200 possible labels, each transition (not null)
The associated “actual label probability” is 200
each of which has a corresponding label for a particular transition
corresponds to the probability generated by a phoneme. transition tr
The actual label probability of 1 is given by the symbol P, as shown.
and 1 to 200 surrounded by the following brackets.
Represented in columns. Each of these numbers is
Represents a bell. For label 1, precision matchin
The phoneme machine 2000 generates label 1 at transition tr1.
There is a probability P[1] of various actual labels
Probabilities are written in relation to labels and corresponding transitions.
It is remembered. The string with label y1y2y3... is given
Precision matching phoneme machine 20 that corresponds to the phonemes of
00, the matching procedure is performed.
Ru. For procedures related to precision matching phoneme machine
This will be explained with reference to FIG. Figure 15 is a trellis diagram of the phoneme machine in Figure 4.
be. As in the case of the phoneme machine, this training
Squirrel diagram is also in state S₁from state S₇null transition to state S₁
from state S₂transition to, and state S₁from state S_Fourfart
shows the transition of Transitions between other states are also shown.
Ru. In addition, the trellis diagram is horizontally
Indicates the time. Starting probability q₀, and q₁is the phoneme
Time t=t of that phoneme₀or t=t₁to each of
represents the probability of having a start time at . At the start of each
The respective transitions in time are also shown. China
For example, the interval between consecutive start (and end) times
must have a length equal to the label time interval.
desirable. Using precision matching phoneme machine 2000
how much a given phoneme is in the label of the incoming string
When determining a tight match,
Find the end time distribution of the phoneme and map the phoneme.
Used to determine the cutting value. End time distribution
The way to perform precision matching depends on
All described in this invention with respect to the tucking procedure
This is common to all phoneme machine embodiments. precision pine
When generating the finishing time distribution to perform the
Precision matching phoneme machine 2000 is accurate and complex.
Requires complicated calculations. First, according to the trellis diagram of FIG. 15, time t
=t₀required to get the start and end times in
Find out about calculations. The phoneme map shown in Figure 4
In the case of the example of the
Ru. Pr(S₇,t=t₀)=q₀T(1→7)+Pr(S₂,t=
t₀)・T(2→7)+Pr(S₃,t=t₀)・T(3
→7) (21) However, Pr represents the probability, and T is the 2 in parentheses.
represents the transition probability between two states. This formula is t
=t₀There are three states in which the end time can be
The probability of each is shown. Furthermore, t=t₀end time of
is state S₇limited to current occurrences. Next, end time t=t₁When we examine the state S₁other than
calculations must be made for every state of
do not have. state S₁starts at the end time of the previous phoneme.
For convenience of explanation, state S_FourOnly calculations related to are shown. S_FourIn this case, the calculation becomes: Pr(S_Four,t=t₁)Pr(S₁,t=t₀)・T(1→
4)・Pr(y₁,1→4)+Pr(S_Four,t=t₀)・
T(4→4)・Pr(y₁,4→4) (22) Equation (22) is given at time t=t₁The phoneme machine is in state S_Four
The probability that is is the following two terms: (a) Time t=t₀In state S₁With probability that state S₁mosquito
state S_FourMultiplying the transition probability to
given label y in the string₁is in state S₁from state S_Four
The value obtained by multiplying the probability of transition to (b) time t = t₀
In state S_FourWith probability that state S_Fourfrom to itself
Multiply by the transition probability, and further state S_Fourto itself
given label y as the one to be transferred₁probability of generating
This shows that it is determined by the sum of the value obtained by multiplying
vinegar. Similarly, (state S₁(excluding) other conditions
calculation is also performed, and the phoneme is determined at time t=t₁in certain conditions
generate the corresponding probability that is the state. In general, given
When determining the probability of being in the target state at the time of
The matching is (a) Each state and state before the transition leading to the target state
and the respective probabilities of each of the previous states; (b) for each of said previous states, its label string;
Each of the previous states and the current
Labels that must be generated on transitions between states
recognize the value representing the probability of (c) that representing the probability and label probability of each previous state;
Combine each value and target by corresponding transition
gives the probability of the state. The overall probability of being in the target state is
is determined from the target state probabilities for all transitions.
state S₇The calculations for the three null transitions
term, whose phoneme is in state S₇For phonemes ending in
time t=t₁Allows you to start and end with
Ru. Time t=t₀and t=t₁determine the probability for
Determining the probabilities of other pairs of end times, such as when
It is desirable to do this so as to form an end time distribution.
Delicious. The end time distribution value of a given phoneme is
How well do the phonemes of match to the incoming labels?
Displays whether the How well words match incoming labels
the phoneme that represents the word.
are processed sequentially. Each phoneme is the end time of the probability value
Generate a distribution. The phoneme matching value is
By summing the time probabilities and taking the logarithm of that sum,
can be obtained. The start time distribution of the next phoneme is the end time
It is derived by normalizing the distribution. This positive
In normalization, for example, we define each of those values as
scale by dividing by the sum of
Make sure that the sum of the ringed values is 1. Examining a given word or word string
There are at least two ways to determine the number of phonemes h that should be
be. In the depth-first method, the computation follows the basic form.
(continuously subtotal by each successive phoneme)
). the given sound that this subtotal follows
If it is found that the elementary position is below a predetermined limit value,
The calculation ends. Another method, breadth-first method
calculates similar phoneme positions in each word.
cormorant. The calculation consists of calculating the first phoneme of each word, followed by
calculate the second phoneme of each word, and so on.
Do it next. In the breadth-first method, each word's
Calculated values along the same number of phonemes are relative to the same phoneme.
Compare by position. In either method, matching
The word with the largest sum of values is the goal we were looking for.
Word. Precise matching is done using APAL (Array Processor).
It is realized in the assembly language). this
is Floating Point Systems, Inc.
(Floating Point Systems, Inc.) assembly
It is 190L. Precise matching is based on the actual label probability (i.e.
That is, a given phoneme has a given label y at a given transition.
generation probability), transition probability for each phoneme machine, and
and the given time after the given start time of the given phoneme.
to remember each of the probabilities of being in a given state in
Requires considerable memory. The aforementioned 190L is
End time, preferably based on the log sum of end time probabilities.
matching value, previously generated end time probability
start time based on and sequential phonemes in the word
Word matching gain based on the matching value of
Setup to calculate each of the points
It will be done. Furthermore, precise matching is achieved by matching hands.
It is desirable to calculate the tail probabilities of the order. Ending sure
rate is the likelihood of consecutive labels independent of words.
Measure. In a simple example, a given tail probability is
corresponds to the likelihood of a label following another label.
Ru. This likelihood can be determined, for example, by using a certain sample voice.
easily determined from a string of labels generated by
be done. Therefore, precise matching is the basic form, Marco
including the statistics of the model, and the tail probabilities.
Provide sufficient storage. Each word has about 10 phonemes
For a vocabulary of 5000 words containing , the basic format is 5000
Requires ×10 memory capacity. (Marco for each phoneme)
70 distinct phonemes, 200 distinct
labels, and the certainty that any label will be generated.
If there are 10 transitions with rate, the statistic is 70×10
×200 storage locations are required.
Ru. However, the phoneme machine has three parts (open
beginning part, middle part and end part),
It is desirable that statistical tables correspond to this. (Three
It is desired that each part contains one self-loop of
Delicious. ) Therefore, the memory requirement is reduced to 70 x 3 x 200
do. For the tail probability, 200 x 200 memory locations
– sion is required. This array has 50K integer
numbers and 82K floating point storage.
Works on the legs. F1d Basic high-speed matching (Figs. 16 to 18)
figure) Is precision matching calculations expensive?
reduces the required computations without sacrificing too much accuracy.
Basic high-speed matching and alternative high-speed matching
Ching. High-speed matching is precision matching
It is desirable to use it in conjunction with. high speed matsuchi
ng extracts promising candidate words from the vocabulary.
Precision matching is often
If so, it is executed on the candidate words from this list. The fast approximate acoustic matching method is based on the above-mentioned U.S. patent.
Described in Application No. 06/672974 (filed on November 19, 1984)
has been done. In fast approximate acoustic matching, each sound
A phoneme machine is a phoneme machine with all phoneme machines in a given phoneme machine.
Transition sets the actual label probability for each label at a particular position.
It is desirable to simplify it by replacing it with a replacement value.
Delicious. A specific replacement value uses that replacement value.
If the matching value of a given phoneme is
Precision when the replacement value does not replace the actual label probability
Excessive matching value obtained by dense matching
It is desirable to choose to evaluate. This article
One way to guarantee that in a given phoneme machine
If any probability corresponding to a given label in its permutation
Choose each replacement value so that it is no larger than the set value.
This is a method of choosing. Actual labels in the phoneme machine
By replacing the probabilities with the corresponding replacement values,
requirements for determining word matching scores.
The amount of calculation can be significantly reduced. More replacement
Since it is desirable to overestimate the value of
Matching scores achieved without replacement before
Less than what is to be determined. Acoustic with language decoder with Markov model
In a particular embodiment of performing matching, each
Through shaping, phonemes are (a) multiple states and transition paths between states; (b) Probabilities T(i→j) - each of which is equal to the current state Si
represents the probability of transitioning to state Sj given
(However, Si and Sj may be in the same state or different.
The transition tr(Sj
｜Si), (c) Actual label probabilities (each actual label probability
P(y_k|i→j) is given by the phoneme machine,
In a given transition from one state to the next
label y_k(k is a symbol that identifies the label)
(representing the probability that
Ru. Each phoneme machine is (a) Each y in each of the above phoneme machines_kone particular
The value p′(y_k), (b) At each transition in a given phoneme machine, each actual
The output probability p(y_k｜i→j), the corresponding y_k
One particular value p′(y_k) divided by
means to apply, including. The replacement value must be at least a specific phoneme map.
y corresponding to any transition in thin_klabel actual
It is desirable that the magnitude is the maximum label probability.
Fast matching implementation corresponds to incoming labels
10 words chosen as most likely to occur in the vocabulary
Form a list of candidate words on the order of 100 to 100.
It is used as follows. Candidate words are determined by the language model and
It is desirable to be subject to precise matching.
stomach. The number of words considered for precise matching is
By truncating to the order of 1% of the words in the vocabulary.
Computation costs are significantly reduced while maintaining accuracy.
be done. Basic high-speed matching is used for all transitions.
Treat the actual label probability for a given label as one value.
Simplify by replacing the given label with the given
It can be generated by the phoneme machine. Sunawa
That is, for a given phoneme machine with a probability that the label occurs
Regardless of the transition in the
Replace with a specific value. This value is overestimated and
At least for any transition in a given phoneme machine
The maximum probability size of the label that occurs is
desirable. Let the label probability replacement value be
as the maximum value of the actual label probability for a given label
By setting, basic high-speed matching
The generated matching values are at least accurate
Matching values such as those resulting from the use of matching
guaranteed to be the same size. like this
Basically, high-speed matching generally involves matching each phoneme.
overestimates the value of the
Generally selected as a candidate word. precision ma
The words considered as candidates by tuching are basically
Pass according to fast matching. Figure 16 shows the basic high-speed matching phoneme machine 30.
Indicates 00. Labels (symbols and
) is the basic fast map along with the start time distribution.
Enter Tsuching Phoneme Machine 3000. Start time minutes
Entering the cloth and label strings can be done using the details above.
Similar to the input of a dense matching phoneme machine. start
Time is sometimes not distributed over multiple times.
Instead, e.g.
It may also represent the exact (phoneme start) time of the following
Ru. However, if the audio is continuous,
The completion time distribution (as explained in more detail later)
Used to form the start time distribution. Basic high speed
The matching phoneme machine 3000 has an end time distribution
and the generated end time distribution
A matching value for a specific phoneme is generated. War
The matching score for C is based on the constituent phonemes (at least
is the matching value of the first h phoneme of the word).
Defined as the sum. FIG. 17 shows the basic high-speed matching calculation. base
This high-speed matching calculation uses start time distribution, phoneme
the number or length of labels generated by
each label y_kThe replacement value p′(y_k)only
is connected with. The actual label of a given label in a given phoneme machine
Replace all Bell probabilities with their corresponding replacement values
By doing this, basic fast matching can reduce the transition probability to
(given phoneme machining)
actual label (which may be different for each transition)
Probability and probability of being in a given state at a given time
It becomes unnecessary to include the rate. By the way, the length distribution is a precise matching model.
Determined from To explain in detail, the length distribution
For each length, this procedure examines each condition individually.
and determine each transition path for each state.
This is desirable. As a result, the actual state of the inspection
The situation is (a) Given a particular label length, (b) Regardless of the output along the transition, This may occur. specific to each target state
The probabilities of all transition paths of length are summed and then
, the sums of all objective states are added and the distribution
represents the probability for a given length of . Each of the above steps
is executed iteratively over the length of . good matuchin
Following the form of the algorithmic procedure, these calculations are
Training is known as the technique of f-modeling.
It is done on a squirrel diagram. along the trellis structure
For transition paths that share branches, for each common branch
The calculation only needs to be done once, and the result is a common branch.
is added to each path containing. In Figure 17, two restrictions are included as an example.
It is. First, the label generated by the phoneme
The length of each has a probability of 1₀,1₁,1₂and 1₃have
may be 0, 1, 2 or 3. start
Time is also limited, each with probability q₀,q₁,q₂Oyo
biq₃Only four start times with . vinegar
That is, L(1₀,1₁,1₂,1₃) and Q(q₀,q₁,
q₂,q₃) is assumed. These limitations make it difficult for the eye to
The ending distribution of target phonemes is defined as the following formula:
Ru. Φ₀=q₀1₁ Φ₁=q₁1₀+q₀1₁p₁ Φ₂=q₂1₀+q₁1₁p₂+q₀1₂p₁p₂ Φ₃=q₃1₀+q₂1₁p₃+q₁1₂p₂p₃ +q₀1₃p₁p₂p₃ Φ_Four=q₃1₁p_Four+q₂1₂p₃p_Four +q₁1₃p₂p₃p_Four Φ_Five=q₃1₂p_Fourp_Five+q₂1₃p₃p_Fourp_Five Φ₆=q₃1₃p_Fourp_Fivep₆ Examining these equations, we find that Φ₃are four start times
It can be seen that it contains terms corresponding to each of the following. So
The first term of is the phoneme at time t=t₃start with and length
0 label (phoneme ends at the same time as it begins)
represents the probability of generating . The second term is that the phoneme is at time t
=t₂, and the label length is 1, and
represents the probability that label 3 is generated by that phoneme.
Was. The third term is that the phoneme is at time t=t₁Start with and
Label length is 2 (i.e. labels 2 and 3)
, and labels 2 and 3 are
Represents the probability of generation. Similarly, the fourth term is the phoneme
is time t=t₀and the label length is 3.
Yes, and the three labels 1, 2 and 3 represent that sound.
represents the probability generated by the element. Calculations and precision matching required for basic high-speed matching
Comparing the calculations required for
It turns out that it is also relatively easy. By the way,
The value of p′(y) is
remain the same value as in the case of label length probability
It is. Additionally, length and start time limitations
This makes later end time calculations easier. example
If, Φ₆So, the phoneme is at time t=t₃Start with 3 levels.
Bells 4, 5 and 6 all sound at their end time.
It must be generated and used by the elements. When generating matching values for target phonemes,
The end time probabilities along the given end time distribution are summed.
It will be done. If you wish, take the logarithm and obtain the following equation.
Ru. Matching value = log_Ten(Φ₀＋……＋Φ₆) As mentioned above, word matching scores are
The matching values of consecutive phonemes in a given word are combined.
easily determined by measuring Next, we will explain how to generate the start time distribution using Figure 18.
I will explain. In Figure 18a, the word
THE₁is broken down into its constituent phonemes and repeated.
In Figure 18b, the string of labels is on the time axis.
shown along. Figure 18c is the first start
Shows time distribution. The initial start time distribution is (silence
(in preceding words, which may contain words)
drawn from the end time distribution of the new preceding phoneme.
Ru. Label input and start time distribution in Figure 18c
Based on the end time distribution of phoneme DH Φ_DHis generated
(Fig. 18d). At the start of the next phoneme UH1
For the duration distribution, the previous phoneme ending distribution is at the limit of Figure 18d.
Determined by recognizing the time when the value A is exceeded.
It will be done. A is determined individually for each end time distribution.
Ru. A is the function of the sum of the end time distribution values of the target phoneme.
It is a number. Therefore, the interval between time a and time b is
represents the time at which the start time distribution of element UH1 is set.
vinegar. In Figure 18e, the interval between time c and time d
is, the end time distribution of the phoneme DH exceeds the limit value A,
and at the time when the start time distribution of the next phoneme is set.
Equivalent to. The value of the start time distribution is, for example, the critical value
Divide each end time value by the sum of end times that exceed A.
Obtained by normalizing the end time distribution. The basic high-speed matching phoneme machine 3000 is
Written by Floating Point Systems, Inc.
Executed using assembler 190L using APAL program.
It is being expressed. Also, according to the description herein,
books using other hardware and software.
It is also possible to develop specific forms of the invention. F1e Alternative high-speed matching (Fig. 19, 20
figure) alone or preferably with precision matching and
(or) base fast used with language models
Matching significantly reduces computational requirements.
To further reduce the computational requirements, the present invention further
, two lengths (minimum length L_nioand maximum length L_nax)of
precision by forming a uniform label length distribution between
Simplify dense matching. Basic high speed Matsuchin
In the case of labels of a given length (i.e., 1₀,
1₁,1₂etc.) generally obtain different values.
Ru. Alternative high-speed matching allows each label
Replace the length probability with one uniform value. The minimum length is the probability of non-zero in the initial length distribution.
preferably equal to the minimum length of the
Other lengths can be selected if desired. Most
The choice of major length is more arbitrary than the choice of minimum length, but
The probability that the length is smaller than the minimum and larger than the maximum is
Set to 0. The probability of length is the minimum length and maximum length
By setting it so that it exists only between
One pseudo-distribution can be shown. one way and
The uniform probability is the average probability due to the pseudo distribution.
can be set. As an alternative, uniform
The probability is set as the maximum value of the length probability, and the uniform value
It can be replaced with. By making the probabilities of all label lengths equal
The effect of
This can be easily recognized from the equation for the end time distribution. In detail
In other words, the probability of length can be taken as a constant.
can. L_nioset to 0 and the probabilities of all lengths
at the end by replacing the value of one constant with the value of one constant
The time distribution is displayed as follows. Θm=Φm/1=qm+Θm−1Pm (23) However, "1" is one uniform replacement value,
The value of Pm is the location generated at time m for a given phoneme.
It is desirable to correspond to a replacement value for a given label.
stomach. For the above equation for Θm, the matching value is:
defined as Matching value = log_Ten(Θ₀+Θ₁+… +Θm＋log_Ten10 (24) Comparison of basic high-speed matching and alternative high-speed matching
Compared to the alternative, the number of additions and multiplications required is
By using the matching phoneme machine,
Width decreases. L_nioIf = 0, basic high speed pine
Ching must consider the probability of length
So, it required 40 multiplications and 20 additions, but
For alternative fast matching, Θm is determined iteratively.
Therefore, one time for each successive Θm
It turns out that we only need one multiplication and one addition. Figures 19 and 20 show alternative high-speed matuchin
The simplification of the calculation by the algorithm is shown in detail. Figure 19a
is the minimum length L_nioPhoneme machine 310 corresponding to =0
An example of 0 is shown below. Maximum length is uniform length distribution
Assume that it is infinite so that Figure 19b shows the sound
A trellis diagram resulting from an elementary machine 3100 is shown.
Assume that the start time after qn is outside the start time distribution
Then, if m<n, each decision of consecutive Θm
All the equations require one addition and one multiplication.
If you want to determine a later end time, use one multiplication.
All you need to do is arithmetic, no addition is necessary. Figure 20a shows the minimum length L_nioThe specific case when = 4
An embodiment of the phoneme machine 3200 is shown in FIG. 20b.
shows the corresponding trellis diagram. L_nio=4
Therefore, the trellis diagram in Fig. 20b has the symbol
yields a probability of 0 along the paths U, V, W and Z.
Ru. Θ_Fourand Θ_oIf the end time is between, multiply by 4 times
One addition is required. greater than n+4
For the end time, only one multiplication is required;
No calculation required. This example is based on the above-mentioned FPS company.
It is realized with APAL code on 190L. The desired additional state is shown in Figure 19 or 20.
It can be added to the example. F1f Matching based on first J level (first
Figure 20) Basic high-speed matching and alternative high-speed matching
In order to further improve the
Consider only the matching of the first J label of the
Do it like this. Sound Pro whose label is Sound Channel
Rate of 1 label every centisecond by Setsa
A reasonable value for J is
It is 100. In other words, the audio of the order in 1 second.
The corresponding labels are supplied to the phonemes and phoneme machines.
Confirm the matching with the incoming label. inspect
By limiting the number of labels, there are two advantages.
It will be done. The first is a reduction in decoding delay, and the second is a reduction in decoding delay.
compares the scores for short and long words.
The problem of comparison can be fully avoided. rice cake
Of course, the length of J can be changed as desired.
Ru. Effects of limiting the number of labels inspected
is observed using the trellis diagram in Figure 20b.
I can do it. Without the improvement according to the present invention,
The fast matching score is the bottom row of this drawing.
is the sum of the probabilities of Θm along (row). Sunawa
T, t=t₀(L_nio= 0) or t=t_Four(L_nio=
state S at each time starting at (case 4)_Fourprobability that
is determined as Θm, then all Θm
are summed. L_nioIf = 4, t_Fourany previous
state S at time_FourThe probability that it is is 0. to said improvement
Therefore, taking the sum of Θm ends at time J.
Ru. In Figure 20b, time J is time t_o+2corresponds to
Ru. The inspection of J labels beyond the interval up to time J is completed.
When determining the matching score by
, the sum of the following two probabilities occurs. First, before
As mentioned, along the bottom row of this trellis diagram.
There is a tsuta rho calculation. However, this calculation is performed at time J
-1. status at each time up to time J-1
S_FourThe probabilities that are are summed to give a raw score. Second
, the phoneme becomes S at time J.₀~S_Foureach state of
There is a column score corresponding to the sum of probabilities. child
The column score for is calculated as follows. Column score =_Four 〓^f=0 Pr(Sf, J) (25) The phoneme matching score is the raw score and the column score.
Obtained by summing the points and taking the logarithm of the sum.
It will be done. To continue high-speed matching of the next phoneme
is the lowest row (preferably including time J).
), the start time of the next phoneme is calculated.
Take out the cloth. Determine the matching score for each of the J consecutive phonemes.
Then, as mentioned above, the sum of all phonemes is that phoneme
is the sum of all matching scores. Basic high-speed matching and alternative high-speed matching described above
How to generate end time probabilities in the Ching example
When examining the column score, it is found that the column score is determined quickly.
It turns out that it does not easily fit the calculations. inspect
The above-mentioned high-speed match improves to limit the number of labels.
better accommodate matching and alternative matching.
Therefore, the present invention replaces column scores with additional row scores.
make it possible to change. That is, (Fig. 20
b) state S between times J and J+K_Fourthe sound that is
The raw additional low score is determined. However, K is left
is the maximum number of states in the phoneme machine of meaning. that
Therefore, if any phoneme machine has 10 states,
If the bottom of the trellis diagram is
10 end times are added along the row of the section, and the
A probability is determined for each. Until time J+K
All probabilities along the lowest row of (time J+
K) are added, and the map of a given phoneme is
Generate a tucking score. As mentioned above, consecutive
Add up the matching values of the phonemes and match the words.
get points. This example is an APAL on the FPS 190L mentioned above.
This is realized in code, but other
other components on other hardware, such as when
It can also be realized by a code. F1g phoneme tree structure and high-speed matching example
(Figure 21) Basic fast matching or alternative fast matching
(with or without a max label limit
) to determine the phoneme matching value.
The calculation time required to determine the results is significantly reduced.
Furthermore, use the words in the list obtained through high-speed matching to
Even when performing dense matching, the computational complexity is
Significant savings. Once the phoneme matching value is determined,
As shown in FIG. 21, the branches of the tree structure 4100
A comparison is made along the path to determine which path of phonemes occurs most.
Determine if it is possible. In Figure 21, (point 4
102 to branch 4104) spoken word
Phoneme matching of “the” phoneme DH and UH1
The sum of the values is for each phoneme branching from the phoneme MX.
must be significantly higher than for the sequence of
Must be. By the way, the phoneme map of the first phoneme MX
Tutting values are calculated only once and then spread out
Used for each basic format. (branch 4104 and
See 4106. ) Furthermore, the first stem of the branch
The total score calculated along the
any other sequence of lower or branching than
If you find out that the total score is lower than the total score of
All basic forms extending from the initial sequence are the same.
Sometimes they are removed from the list of candidate words. example
For example, the basic form related to branches 4108 to 4118
is determined that MX is not a likely path.
Then, they are thrown away at the same time. High-speed matching example and tree structure make it possible to
A list of candidate words in fixed order is generated, and
The computation involved is significantly saved. Regarding memory requirements, phoneme tree structure, phoneme organization,
The measured value and the tail probability are to be memorized.
There is. For tree structure, 25000 arcs and each arc
There are four data words that characterize the first
The data word represents the index of the subsequent arc or phoneme.
Was. The second data word is the subsequent data word along the branch.
Represents the number of phonemes. The third data word is a tree structure
Indicates which node the arc is placed on. The fourth data word represents the current phoneme. subordinate
So, in the case of this tree structure, the storage space is 25000 × 4
is necessary. Fast matching allows 100 different
There are ivy phonemes and 200 different ivy finemes.
A fineem is one that is produced somewhere in the phoneme.
Since we have a probability of 100×200 statistical probabilities
Storage space is required. 200 for trailing structures
×200 storage space is required. Therefore, the high speed
In the case of tsuching, the space to store 100K integers and
60K floating point storage space is sufficient.
be. F1h language model (Figure 2, Figure 22) As mentioned above, regarding the word in context (triple
Contains a language model that stores information (such as letters)
This increases the probability of correctly selecting the word.
You can The language model is described in the above paper.
has been done. Language model 1010 (Figure 2) is a unique character
It is desirable to have i.e. modified triple
law is used. According to the invention, the sample sample
Examine the text and identify triple words in a fixed order in the vocabulary
and the likelihood of each word pair and single word
Confirm. And the most promising Mie work
A list of codes and word pairs is formed. Furthermore,
Triple words not in the list of triple words, and
It is the likelihood of a word pair that is not in the list of word pairs.
Each will be determined. According to the language model, the target word is reduced to 2 words.
If it continues, this target word and the previous two words are
It is determined whether the word is in the list of triple words.
Ru. If it is in the list of triple words, the triple word
Specifies the remembered probability assigned to the
be done. The target word and the two preceding words are triple words.
If it's not in the list, add the target word and
Are adjacent preceding words in the list of word pairs?
judge whether in the list of word pairs.
, the probability of the word pair and the triple word
Multiply the probability that there is no triple word in the list of words, and
Assign the product to the target word. Before including target word
Each triple word and word pair are triple words.
not in the list of words and the list of word pairs.
is the probability of only the target word, and the triple word
The probability that is not in the list of triple words, as well as the word
Multiply the probability that the do pair is not in the list of word pairs and
Assign the product to the target word. Figure 22 shows high-speed matching and precision matching.
Select words from the vocabulary using language modeling and language models.
5000 shows a flowchart 5000 for selecting. block 5002
, define the words in the vocabulary and block 500.
4, each word is the basic phonetic symbol of the Markov model.
Display by form. Block 5006, basic
The phonemes of the morphology are arranged in a tree structure (as shown in Figure 21).
Arranged. Markov models for various words
is block 5008, and the sample text
Format by reading (described later). each marco
The probabilities associated with the basic word forms of the word model (e.g.
For example, transition probabilities and labels at various transitions
probability) is used for precision matching (to be performed later).
It will be held for a long time. Given a Markov model phoneme
The probability of a given label in block 5 is
010, and the replacement value is overestimated as much as possible.
Replacement, block 5012, basic high speed mating
to be able to execute the log. sound professional
The label string from Setusa (described above) is
Approximate matching with lock 5014 (not shown)
provides input to block 5012. block
5016 (not shown) performs basic high-speed matching.
If you are trying to improve the distribution of uniform lengths,
Similar values are used in block 5018. Block 5
020, if you want to further improve it, block 50
22 to limit matching to the first J label.
Ru. After the desired approximate fast matching, the probabilistic order of candidates is
The list of complementary words is stored in block 5024 (not shown).
) to feed the language model. Next, Blotsu
5026 (not shown) performs approximate matching and
and the language model as determined by the language model.
Perform precision matching of promising words
Ru. Precision match of block 5026 (not shown)
The words from the block 5028 (not shown)
) to the language model and block 5030
(not shown) to select the most likely word.
can be done. Phonetics of the F2 word and Marko of Finim
Construction of the basic form of the f-model F2a Construction of basic phonetic forms (Figure 4, Table 1,
2 table) Matrix that can be used to form the basic form
One type of Lukov model phoneme machine is based on phonetic symbols.
It is something that can be developed. That is, each of the phoneme machines is
corresponds to the unit of a given phonetic symbol. For a given word, each corresponds to
A single phone sheet of phonemes with each phoneme machine
There is a kensu. Each phoneme machine has several states and
and transitions between them, among them:
Something that can also generate finem output
Yes, there are some that cannot (called null transitions).
As mentioned above, the statistics for each phoneme machine are (a) the probability of a given transition occurring, and (b) A given transition produces a particular finem.
Includes the probability that At each non-null transition, each fee
It is desirable to have a probability associated with neem.
For the Finim Alphabet shown in Table 1.
There are about 200 finim. Figure 4 shows the sound
A phoneme machine used to form the basic form of a mark
shows. The sequence of such a phoneme machine is
given word by word. generate known word
During the shaping phase, the statistical values, i.e. the probabilities
enters the phoneme machine. Phoneme machine for various phonetic symbols
The transition probability and Finim probability in are
uttered a single sound of a known phonetic symbol at least once
Write the fineme string generated when
including, well-known forward/backward al.
determined during shaping by applying the
It will be done. Statistics for one phoneme identified as phoneme DH
A sample of is shown in Table 2. As a rough estimate
Therefore, the transitions tr1, tr2 and
Label output probability distribution of tr8, transitions tr3, tr4, tr
5 and tr9 label output probability distributions and transitions
The label output probability distribution of tr6, tr7 and tr10 is
Each is represented by a single distribution. child
This is determined by the arc (i.e. transition) of each column in Table 2.
indicated by the assignment of labels 4, 5 or 6 to
has been done. Table 2 shows the beginning, center, or middle of the phoneme DH.
or the probability of each transition generated respectively at the end.
and the probability of the label (i.e., finem).
vinegar. For DH phonemes, for example, the state S₁from state S₂
The probability of transitioning to state S is calculated as 0.07243, and the probability of transitioning to state S₁mosquito
state S_FourThe probability of transitioning to is 0.92757. (what
Then, there are two possible transitions from the initial state.
Therefore, the sum of both probabilities is equal to 1. )
Regarding the label output probability, the DH phoneme is
The last part of the element, i.e. the column labeled 6 in Table 2
The probability of generating Fenenu AE13 (see Table 1) is
It has a rate of 0.091. Table 2 also includes each node.
The counts associated with the mode (i.e. state) are shown.
ing. The node count is determined by the number of its phonemes during shaping.
represents the number of times that the corresponding state has occurred. Table 2
Such statistics exist for each phoneme machine.
Ru. The phoneme machine of the phonetic alphabet is a sequence of word basic forms.
Sequencing is generally performed by phoneticians.
and usually not automatically. The basic form of phonetic symbols is precision mating and high-speed roughening.
It has been used with some success in computational acoustic matching.
There is. However, the basic form of phonetic symbols is
Depends on the decision and is not automatic, so it may not be accurate
Sometimes there is. F2b Construction of basic form of Finim (Fig. 23,
Figure 24) An alternative to the basic form of phonetic symbols is the basic form of finem.
It is. Each phoneme machine corresponds to a single note in the phonetic symbol.
Instead of each phoneme machine in the basic form,
Each fee in Neem Alphabet
Corresponds to neem. Easy feeneem phoneme improvement
has two states S₁and S₂as well as multiple transitions
(each with its own probability) containing (the second
(See Figure 3). A transition is a null between two states.
Unable to generate finem in transition.
stomach. The first non-null transition between two states is then
The probability of each generated finem is
have The second non-null transition is state S₁Self-rule in
each fineme is generated there.
Has probability. Therefore, the simple feeneem phoneme
The machine will play 0, 1, or 2 or more finems.
can be generated. Figure 24 is Finim
Figure 3 shows a trellis diagram based on phonemes. Markov model phoneme machining in finem units
When constructing a phoneme, a phoneme machine based on multiple utterances is used.
It is preferable to use the The phoneme machine is
given by forming it on the basis of number utterances.
It is possible to capture changes in the pronunciation of words. child
One way to capture changes in the pronunciation of
Disclosed in Application No. 738933 (filed May 29, 1985)
It is. Build the basic form of the word by said application
When selecting word segments in a vocabulary, follow the method below.
(e.g. each word or given syllable)
or part of it)
To construct. This method is explained in detail below.
Contains steps. (a) Each of the multiple utterances of a word segment
Convert to each fineme string.
Ru. (b) Finim's Markov model phoneme machine
form a set of (c) Generate multiple fineme strings
Best single phoneme machine P₁Determine. (d) Generate multiple fineme strings
Format P₁P₂or P₂P₁The best two-phoneme basic form of
Determine. (e) The best two for each finem string.
Align basic phoneme forms. (f) Each finem string has a left part and a right part.
The left part is the first part of the basic form of the two phonemes.
Corresponding to the phoneme machine, the right part is the two-phoneme basic form.
correspond to the second phoneme machine in the state. (g) identify each left part as a left substring;
and identify each right part as a right substring.
Separate. (h) Set the left substring to multiple utterances.
Corresponding set of finem strings and
Process in the same way, and also
If the basic form of a single phoneme is divided into more than
higher subs than in the case of the good two-phoneme basic form.
Prohibited if there is a probability of generating a tring
do. (i) Set the right substring to multiple utterances.
Corresponding set of finem strings and
Process in the same way, and also
If the basic form of a single phoneme is divided into more than
higher subs than in the case of the good two-phoneme basic form.
Prohibited if there is a probability of generating a tring
do. (j) A single phoneme that is not segmented is divided into the corresponding fini
in an order that corresponds to the order of the system substrings.
Link. In certain embodiments, the method further includes the steps below.
Including steps. (k) The connected basic forms are
aligned and concatenated bases for each of the
For each phoneme in the morphology, each corresponding fee
Common substrings in neem strings
The substring corresponding to a given phoneme that is a set of
Identify the ring. (l) For each set of common substrings,
has the highest joint probability of producing a bust string
Determine the phoneme machine to use. (m) For each common substring, the concatenated bases
The phonemes in the morphology are divided into the highest joint probabilities determined.
Replace with phoneme. The basic form obtained by replacing phonemes is improved.
This is the basic form. Steps (k) to (m) above
If it iterates until no elements are replaced, further modifications can be made.
A refined basic form is obtained. a given word segment (e.g. word)
There are other ways to construct the Finim basic form of
Ru. For example, the basic form is
output of the sound processor in response to a single utterance of the
It can be taken out. How to build the Finim basic form
Generally, the phonetic base of the same word, regardless of
It has the advantage of being more precise and accurate than form.
It is characterized by Furthermore, the basic form of phonetic symbols is
Usually constructed manually by phoneticians,
Finim basic forms are generally constructed automatically.
Ru. However, the basic form of Finim is generally
It is 10 times the length of the basic phonetic form, which is quite large in terms of calculations.
This is a disadvantage. In its most basic form, the invention
Work from the Finim basic form built in
basic form of composition of segment (e.g. word)
related to the formation of Regarding the basic form of Finim, accuracy and basic
While maintaining the ability to automatically construct this form,
The problem of shortening the length of the basic forms of phonetic symbols is an important one.
It's getting old. The present invention addresses this problem and
Simple phonetic Markov symbols representing words in the vocabulary
Enables automatic generation of basic form of model phoneme machine
Ru. The basic form of phonetic type is useful in speech recognition, e.g.
For example, the fast approximate matching procedure described above as well as the fine
Can be easily used in dense matching procedures
Ru. F3 Alphabet of composite pheneem phoneme
Generation (Figure 4) When extracting the alpha alphabet of a synthesized phoneme,
Sorted using word backward algorithm
There is a predetermined set of shaped N-fee-neem phonemes.
Assume that N-Fineam phoneme set
is given, the alpha alphabet of the synthesized phoneme is
It is retrieved by executing the steps below. (a) Select a phoneme pair from the current set of phonemes.
Ru. (b) Determine the entropy for each phoneme in the selected pair.
Set. The entropy for each phoneme is defined as
be justified. Entropy = −〓pilog₂pi Take the sum of all finim (i=1200).
Each pi produces a particular phoneme, the fineem i.
There are 200 different finems with a probability of
be. (c) Merging two phonemes of the selected pair into a single phoneme.
The synthesized phoneme (or the corresponding
Determine the entropy of the cluster. synthesis
Each output probability associated with a phoneme has two paired
is the average output probability of the phoneme. (d) The synthesized phoneme and each phoneme merged by the following formula
Phonemes and phis based on entropy measurements of
Determine the loss of common information in neem. Loss = E synthesis - (E phoneme 1 + E phoneme 2)/2 As the loss of common information increases,
The influence of a phoneme on the two phonemes of the pair becomes smaller. (e) Steps (a) to (d) for each pair of finem phonemes.
Repeat. (f) Minimum of a given selected pair of merged phonemes.
Select synthesized phonemes representing the loss of the merged
Replace two phonemes with their composite phoneme and
The number of phonemes being reset is decreased by 1. (g) Step (a) until the number of phonemes in the set is n.
-Repeat (f). Each remaining phoneme is a composite phoneme
It is. The typical 200 phonemes in the Fhineem phoneme are:
Approximately 50 unique compositions in the synthesized phoneme alphabet
It is desirable to reduce it to phonemes. Each synthesized phoneme has seven shapes as shown in Figure 4.
state, expressed by a Markov model of 13 transitions.
This is desirable. Each transition is
It has a transition probability corresponding to the likelihood. each non-null
A transition (tr1 to tr10) has multiple outputs related to it.
each output probability is a specific non-null transition.
represents the likelihood of a given finem generated by
It is desirable to The set 50 element synthesized phonemes are shown in section F4 below.
used to construct a simple basic form, as shown.
It is desirable that Synthetic phonemes in the alphabet
The number and composition of synthesized phonemes are based on known choices in phonetic technology.
Similar to selected phonemes and alphabets
You can choose as follows. F4 Word base of short phonetic Markov model
Automatic formation of this form (Fig. 1A, Fig. 1B, Fig. 2)
(Fig. 1, Fig. 23, Fig. 25 to Fig. 30) According to the invention, a simple Markov model basic
The form is formed as follows. First, the word
There is a basic form of Finim built for each,
Each basic form is formed according to one of the methods described above.
Assume that The basic form of Finim is
It is desirable to construct a number from vocalizations and store it in memory.
Delicious. In general, the finem basic form of the word
is on the order of 60 to 100 finemes in length. In Figure 25, the target word is f-fe-neem.
displayed by strings, each string having its
Corresponds to each utterance of the word. Block 1
Each of ~9 (hereinafter omitted) is processed by a sound processor.
Display the generated finem (i.e. label)
Was. In this example, the label is
One file is generated every centisecond. each string
The finem in the game is probably the pronunciation of the target word.
It will be different for each. In accordance with the present invention, the finem base of the target word is
It is assumed that this form is known. Figure 26 is
Shows the basic form of Finim. Compatible with target word
A sequence of feeneem phonemes (feenie
(Basic form) is preformed. Hui
The basic form of the neem, as mentioned above, has multiple vocalizations.
By the method of constructing the basic form of Finim,
It is desirable to take it out. f utterance of the target word
For each, the finem term
Eneem String FS₁~FS_fThere is. Hui
Neem is an alpha abet of 200 fine neem,
That is, it is selected from the set. Fhineem sound
When using the simple phoneme model shown in Figure 23,
Generate 0, 1, or 2 or more finemes in
can be played, but generally one feeneem sound
Each element generates one finem on average. Attachment
Each unique file indicated by a letter P
The basic form of fe-neem based on the neem phoneme is the second
It is shown in Figure 6. According to the present invention, each synthesized phoneme is
At least one of the N phonemes in the alphabet
, each finem phoneme is one composite phoneme.
corresponds only to Synthesis in a set of synthesized phonemes
The number of phonemes n is the number of phonemes in the set of fineme phonemes.
The number of phonemes is less than N. new al
Alphabet is considerably more than the original Alphabet.
Fewer elements, e.g.
40 or 50 elements compared to the 200 element order of
It is desirable to have the element. Alphabet's sound
One way to determine the prime is briefly shown in the previous section.
There is. Alphabets of n-synthetic phonemes, words in vocabulary
Basic form of finem and target word for each
Assuming multiple utterances of the target word, a simple
The basic form is formed by the following method. first
, each utterance is expressed as a feeney as shown in Figure 25.
Mu String FS₁~FS_fDisplayed by each of
be done. Then each finem string is
The basic form of the target word in Figure 26
aligned against. The alignment results are shown in Figure 27.
vinegar. In Figure 27, each string shows
The basic form of Finim is shown aligned to
It is. This alignment is IEEE Bulletin Volume 64, 1976
F. Zienerik, April issue, pp. 532-556.
Paper “Continuous speech recognition using statistical methods” (F.
Jelinek, “Continuous Speech Recognition by
Statistical Methods, “Proceedings of the
IEEE, Vol. 64, April 1976, pp 532-556).
Viterbi was discussed in various papers.
It is desirable to obtain this by performing an alignment method. The subscript of each phoneme in Figure 27 is
Identify the phonemes among the 200 phonemes of Tsuto. each
A phoneme string is a sequence of the same phonemes, i.e.
WachiP₁P₁P₃P_FourP₂...P_FourIt is displayed by.
However, the f-th string
The alignment to the finemes in the
There is. For a typical word lasting about 1 second, about 100
There are about 100 fineem phonemes.
Ru. Figure 28 shows how to place the feeneem phoneme into a synthesized phoneme.
The results are shown below. Each phoneme in Figure 27 corresponds to
Synthetic phonemes (each of which has the symbol CP_i) with
is being given. Is subscript i in the set of synthesized phonemes?
Identify synthesized phonemes. Phoneme P₁and P₂is CP_Ten
, the phoneme P₃is CP₁₁, the phoneme P_Fouris CP₁₃replaced with
It is. Number of synthesized phonemes for each string in Figure 28
is the string FS in Figure 27₁~FS_feach sound of
Equal to a prime number. Each of the synthesized phonemes is structurally a sound of the aforementioned phonetic type.
7 states with 13 transitions, similar to
This is desirable. Next, in Figure 28, a simple 2
The finem phoneme of the state is placed in the synthesized phoneme of the 7 states.
Can be replaced. Next, in Figure 29, each fineme strike
The same two synthesized sounds aligned to the ring
One element (i.e., the leftmost synthesized phoneme and the synthesized phoneme to its right)
is replaced by a synthesized phoneme. In detail, (part
In figure 28) CP_TenTwo adjacent neighbors identified as
The synthesized phoneme is one synthesized phoneme CP_Tenreplaced by
Ru. Similarly, in Figure 30, two adjacent tone components
The phoneme is again replaced by one synthesized phoneme. sand
That is, in Figure 30, the adjacent composition (of Figure 29)
Phoneme CP₁₁and C.P.₁₃is one synthesized phoneme CP₈placed in
It has been replaced. From Figures 29 and 30, each series
The number of synthesized phonemes increases by one by combining the synthesized phonemes.
It can be seen that it decreases. As explained later, two
By continuing the process of merging the synthesized phonemes of
It is possible to obtain a basic form of a specific length. The process of iteratively merging two synthesized phonemes is
It is based on proximity. synthetic phoneme
For a given sequence of synthesized phoneme pairs,
examine each resulting from the combination of its two phonemes.
Determine adverse effects. Adverse effects of merger
The phoneme pair with the smallest result is the closest one.
and will be considered as a candidate for the next merger. Measuring the adverse effects of mergers
Although there are various methods for determining the
It is preferable to define it as MAX (a/A, b/B).
Yes. A and B are the calculated lengths of the basic form
is a selectable limit value that limits . a and b
is defined as below. Display pairs of adjacent synthesized phonemes in their sequence
Two synthesized phonemes CP_nand C.P._oIf that
The phis aligned for the two synthesized phonemes
Continuous F of neem_nand F_oThere is. synthetic phoneme
C.P._nois, Finim F_noThe joint probability of generating
Defined as the maximizing synthesized phoneme. F_noteeth,
Continuation F of two adjacent finems_nand F_o
It is a concatenation of Synthetic phoneme CP_noare two adjacent
Synthetic phoneme CP of_nand C.P._ois the closest,
If scheduled to be merged, CP_nand
C.P._oIt is a phoneme that replaces . L_n,L_oand L_no
is a synthesized phoneme CP_n, C.P._oand C.P._nogives F_n,
F_oand F_norepresents the log probability of each. Person in charge
The number a is obtained by the following formula. a=L_n+L_o−L_no Therefore, a is a synthesized phoneme CP_nand C.P._omerger of
This is a measure of the decrease in probability due to composition of two
As the adverse effect of phoneme merging increases, the
The value also increases. Next, b is a “paired t-test”.
It is determined by a method known as vs. T-inspection
A common method of research is to examine various texts, e.g.
Published by Duxbari, Massachusetts, W.
“Introduction to Probability and Statistics” by Mendenhall, Vol.
5th edition, 1979, pp. 294-297 (W. Mendenhall,
“Lntroduction to Probability and Statistics”.
5th Edition, Duxbury Press, Massachusetts,
1979, pp. 294-297). vs. T-inspection
According to the survey, T.₁is a continuous F of finem_nin
The finem () is a synthesized phoneme CP_nand () combination
Phoneme CP_noIf it is produced by a comparison of
A pair-t test is performed on the difference in the log probability of the finem between
is the t-statistic obtained by performing the
Ru. Similarly, T₂is a continuous F of finem_oin
The finem () is a synthesized phoneme CP_oand () combination
Phoneme CP_noIf it is produced by a comparison of
These are the Finim's t-statistics. For example,
Enim continuous F_oFor each finem in the
Synthetic phoneme CP that generates that finem_ocertainty
rate, and the synthesized phoneme that generates its finem
C.P._noThere is a probability of between these two probabilities
Comparisons can be made pairwise. This process
Iterated for each series of enims, that
There is a separate sequence associated with each word utterance.
I understand. Z₁and Z₂is T₁and T₂The t-distribution of N(0,
1) After converting to distribution, T₁and T₂taken by
represents the value. Next, b is obtained by the following formula
Ru. b=MAX(Z₁,Z₂) b is a synthesized phoneme CP_nand C.P._oCP by merging_no
, the difference in the finem probability is
Indicates how significant (statistically) it is. Maximum of a/A and b/B for each pair of synthesized phonemes
By taking the value, the adverse effects of the merger are minimized.
The pair can be determined. Then this closest
It is possible to merge pairs and replace them with one synthesized phoneme.
Wear. Examination of all synthesized phoneme pairs as well as the two most
The merger of adjacent synthesized phonemes that are close together is the basic form of
the number of synthesized phonemes reaches a specified limit, or
The distance between the closest phoneme pair is greater than 1
(In either case, a or b or both are each
continues until the limit value is exceeded). Synthetic phonemes reflect the alphabet of the phonetic alphabet.
is selected, the resulting synthesized phoneme
The sequence is a simple sequence that is acceptable for speech recognition.
Pronunciation of words, not just the basic form
It also provides automatically generated descriptions. The latter result is phonetic
It can also be used in the field of Figures 1A and 1B are the basics of the method of the invention.
8000 shows a typical flowchart 8000. Follow this flowchart
repeat the target word once or, if possible, multiple times.
utter over and over. Block 8002, respectively.
This vocalization produces a fineme string.
to be accomplished. In block 8004, the generated target word
based on the string(s) of
Best to generate string(s)
Finim word basic with (simultaneous) probability
Select form. Basic form of each word in the vocabulary
is preformed in block 8006.
Ru. Next, in block 8008, each of the target words is
Strings are aligned to the selected base form
be done. In this way, each fin in the basic form
The phoneme pairs in each string (of the target word)
aligned to the corresponding finem. then
In block 8009, each fineme phoneme corresponds to
is replaced with a synthesized phoneme. Next, follow the steps below.
Execute. (a) Selecting pairs of adjacent synthesized phonemes (block
8010). (b) For each fineme string, the selected
Feeney to align against adjacent phoneme pairs
Determine the system substring (block 80)
12). (c) less than the alphabet of the finem phoneme.
of a synthesized phoneme formed to have no elements
From Alphabet (block 8013),
each of the fineme strings
The best match that produces the determined substring
Determine synthesized phonemes with time probabilities (block
8014). (d) By block 8016, adjacent synthesized phonemes are
Reverse steps (a) to (c) above for all pairs of
Revenge. Single synthesis for each pair of adjacent synthesized phonemes
Determine the phoneme. (e) Each phoneme pair in the basic form is determined instead.
The negative effects of replacing it with a single synthesized phoneme
Measure (block 8020). (f) the minimum adverse effect does not exceed the prescribed limit;
(block 8024), resulting in minimal adverse effects.
Replace a synthesized phoneme and its corresponding phoneme pair
(Block 8022). (g) Replaced synthesized phonemes and their corresponding phoneme pairs.
After that, a new phoneme basic form is provided (block
8026). (h) If the number of phonemes is larger than a predetermined desired number (block
8028) and repeat steps (a) to (g).
Search for a new basic form. Step (h) shortens the basic form to the desired length.
(block 8030).
or by replacing phoneme pairs with synthetic phonemes.
The negative impact exceeds the limit value (block 8024)
Repeat until. This step continues until all words are processed.
and iterate through each word in the vocabulary (block 8).
040). Therefore, for each word in the vocabulary, the desired
There is a basic form of a synthesized phoneme that is shortened to the length of
or shorten without exceeding adverse impact thresholds.
There is a basic form that cannot be done. The basic form shortened by this procedure is a phonetic type.
In basic form, precise acoustic matching and (also
) Perform fast approximate acoustic matching as described above
It can be used when That is, the second
The tree structure 4100 in Figure 1 (various phoneme sequences
represents a phoneme stored in memory) is
The present invention may include shortened basic forms.
Wear. In this case, each phoneme includes a synthesized phoneme. Therefore, the basic form of each word in FIG.
A sequence of synthesized phonemes in a shortened basic form.
each synthesized phoneme in the sequence is represented by the arrival
Matching against labels or finems
Displayed by the phoneme machine used when executing
That is planned. F5 comparison result Speech recognition systems using various types of basic forms
The results of the performance tests are shown below. First, a vocabulary of 62 keyboard characters and their
For systems with arbitrary 600 utterances, manual
The number of errors in the basic form of the phonetic alphabet determined by is 28.
It was hot. Finim base constructed after a single utterance
The number of errors in this embodiment was 14. And several times
Based on the basic form of Finim constructed from the vocalizations of
The number of errors encountered was 5. Second, 2000 common words for business correspondence.
Vocabulary and has a manuscript of 2070 arbitrary utterances, usually
A system using the basic form of the phonetic symbol of
Using the basic form of Finim constructed from voices.
108 errors compared to 42 resulting errors.
Ra occurred. Finally, the “shortened” basics explained in Section F6
By using the morphology, the following results were obtained.
Vocabulary of 2000 typical office words and 2070
If you have a manuscript of any word, use the quick overview mentioned above.
Through arithmetic acoustic matching, the basic form of the phonetic symbol is
Errors related to 8 were shortened.
The basic form has 2 errors associated with it.
Ivy. From these results, we can develop a filter based on multiple vocalizations.
The basic form of neem was generated from a single utterance.
Basic forms of standard phonetic symbols and basic finemes
It turns out to be more accurate than the form. Also, shortened
The basic form of
It can be seen that it has been significantly improved. Furthermore, these
From the results, a good practice is based on multiple utterances.
This is a shortened basic form. F6 Alternative Example For example, the set of synthesized phonemes is different from the one mentioned above.
It can be determined by different methods. Also,
Approach measurements like MAX(a/A, b/B) are
a/A or b/B or produce comparable results.
can be reshaped as other scales. Furthermore, the present invention constructs a shortened basic form.
The method is based on multiple utterances of words,
Uses a fineme string that corresponds to each utterance.
It is preferable to treat it as soon as possible. However, the book
The invention provides shortened bases from a single utterance of the target word.
It can also be used to construct this form. single
Based on one utterance, one substring
The composition that yields the highest probability of producing a finem
Phonemes are considered with replacement of adjacent phoneme pairs. 1st
Since there is only a ring, there are no joint probabilities to consider.
stomach. From changes in pronunciation, shortening based on multiple utterances
It is accurate and good due to its basic form.

【表】【table】

【表】Ｇ発明の効果本発明により、所与のワードを、フイーニーム
音素から成るワード基本形態から、合成音素から
成る短縮されたワード基本形態に変換することが
できる。[Table] G Effects of the Invention According to the present invention, a given word can be converted from a word basic form consisting of finemone phonemes to a shortened word basic form consisting of synthesized phonemes.

[Brief explanation of drawings]

第１Ａ図および第１Ｂ図は本発明の方法を示す
流れ図、第１Ｃ図は第１Ａ図と第１Ｂ図の配置関
係を示す図、第２図は本発明を実施しうるシステ
ム環境の概要ブロツク図、第３図は第２図のシス
テム環境の中のスタツク・デコーダを詳細に示し
たブロツク図、第４図は整形セツシヨン中に得ら
れた統計値により記憶装置で識別され、表示され
る精密突合せ音素マシンを示す図、第５図は連続
するスタツク復号のステツプを示す図、第６図は
それぞれのワード・パスの尤度ベクトルおよび尤
度包絡線を示す図、第７図はスタツク復号手順の
流れ図、第８図は音響プロセツサの要素を示す
図、第９図は音響モデルの構成要素を形成する場
所を表わす代表的な人間の耳の部分を示す図、第
１０図は音響プロセツサの部分を示すブロツク
図、第１１図は音響プロセツサの設計に用いる、
音の強度と周波数の関係を示す図、第１２図はソ
ーンとホンの関係を示す図、第１３図は第８図の
音響プロセツサにより音響の特徴をどのように示
すかを表わす図、第１４図は第１３図で限界値を
どのように更新するかを示す図、第１５図は精密
マツチング手順のトレリスすなわち格子を示す
図、第１６図はマツチングを実行するのに用いる
音素マシンを示す図、第１７図は特定の条件を有
するマツチング手順で用いる時刻分布図、第１８
図ａ〜ｅは音素、ラベル・ストリングおよび、マ
ツチング手順で決定された開始・終了時刻の間の
相互関係を示す図、第１９図ａおよびｂは最小の
長さが０の特定の音素マシンおよびそれに対応す
る開始時刻分布を示す図、第２０図ａおよびｂは
最小の長さ４の特定の音素マシンおよびそれに対
応するトレリスを示す図、第２１図は同時に複数
のワードの処理を可能にする音素の木構造を示す
図、第２２図は整形されたワード基本形式を形成
する際に実行するステツプの概略の流れ図、第２
３図はフイーニーム音素マシンを示す図、第２４
は順次の複数のフイーニーム音素マシンのトレリ
ス図、第２５図は対象ワードの対応する発声に応
じて生成された複数のフイーニーム・ストリング
を示す図、第２６図は第２５図のフイーニーム・
ストリングを生成している最高の同時確率を有す
る（フイーニーム音素のシーケンスを含む）フイ
ーニーム基本形態を示す図、第２７図は各フイー
ニーム・ストリングに対して整列されたフイーニ
ーム基本形態を示す図、第２８図は各フイーニー
ム音素とそれに対応する合成音素の置換えを示す
図、第２９図は第２８図に示す各ストリングに関
連した隣接する合成音素の共通の対を同じ１つの
合成音素に置換えた新しい基本形式を示す図、第
３０図は第２９図に示す各ストリングに関連した
隣接する合成音素の共通の対を同じ１つの合成音
素に置換えた新しい基本形式を示す図である。１０００……音声認識システム、１００２……
スタツク・デコーダ、１００４……音響プロセツ
サ、１００６，１００８……アレイ・プロセツ
サ、１０１０……言語モデル、１０１２……ワー
クステーシヨン、１０２０……探索装置、１０２
２，１０２４，１０２６，１０２８……インタフ
エース。 1A and 1B are flowcharts showing the method of the present invention, FIG. 1C is a diagram showing the arrangement relationship between FIGS. 1A and 1B, and FIG. 2 is a schematic block diagram of a system environment in which the present invention can be implemented. , FIG. 3 is a detailed block diagram of the stack decoder in the system environment of FIG. 2, and FIG. 4 is a detailed block diagram of the stack decoder in the system environment of FIG. Figure 5 is a diagram showing the phoneme machine, Figure 5 is a diagram showing successive stack decoding steps, Figure 6 is a diagram showing the likelihood vector and likelihood envelope of each word path, and Figure 7 is a diagram showing the stack decoding procedure. Flow chart, Figure 8 shows the elements of the acoustic processor, Figure 9 shows the parts of a typical human ear representing the locations forming the components of the acoustic model, and Figure 10 shows the parts of the acoustic processor. The block diagram shown in Fig. 11 is used for designing an audio processor.
Figure 12 is a diagram showing the relationship between sound intensity and frequency; Figure 12 is a diagram showing the relationship between horns and horns; Figure 13 is a diagram showing how the acoustic characteristics of Figure 8 are expressed by the acoustic processor; Figure 13 shows how the limits are updated; Figure 15 shows the trellis or lattice of the precision matching procedure; and Figure 16 shows the phoneme machine used to perform the matching. , Figure 17 is a time distribution diagram used in a matching procedure with specific conditions, Figure 18 is a time distribution diagram used in a matching procedure with specific conditions.
Figures a to e illustrate the interrelationships between phonemes, label strings, and start and end times determined by the matching procedure; Figures a and b show a specific phoneme machine with a minimum length of 0; Figures 20a and 20b show a specific phoneme machine with a minimum length of 4 and the corresponding trellis, Figure 21 allows processing of multiple words at the same time. Figure 22 is a diagram showing the tree structure of phonemes;
Figure 3 is a diagram showing the Finim phoneme machine, No. 24.
is a trellis diagram of a plurality of sequential feeneem phoneme machines, FIG. 25 is a diagram showing a plurality of feeneem strings generated in response to corresponding utterances of a target word, and FIG.
FIG. 27 shows the finemme base form (containing a sequence of finem phonemes) with the highest joint probability of producing a string; FIG. The figure shows the replacement of each fineme phoneme and its corresponding synthesized phoneme, and Figure 29 is a new base in which the common pairs of adjacent synthesized phonemes associated with each string shown in Figure 28 are replaced with the same single synthesized phoneme. A diagram showing the format, FIG. 30, is a diagram showing a new basic format in which the common pairs of adjacent synthesized phonemes associated with each string shown in FIG. 29 are replaced with the same single synthesized phoneme. 1000...Voice recognition system, 1002...
Stack decoder, 1004...Acoustic processor, 1006, 1008...Array processor, 1010...Language model, 1012...Workstation, 1020...Search device, 102
2,1024,1026,1028...interface.

Claims

[Claims] 1. A first word Markov model formed by concatenating first word partial Markov models each corresponding to a set of first phonemes each representing an acoustic type that can be assigned to a minute time interval. , respectively for each word in the vocabulary, each of which corresponds to one or more of the first phonemes; A set of second word partial Markov models with complex transitions are prepared corresponding to each of the set of second phonemes, and the second word partial Markov model is prepared based on the first word Markov model.
The second model formed by concatenating word partial Markov models
A method for constructing a word Markov model, the method comprising: (a) generating a string of first phonemes in response to the utterance of a target word; and (b) constructing a first word Markov model of the target word according to the generation. (c) replacing each of the first word partial Markov models of the first word Markov model with a corresponding second word partial Markov model; forming a string of word partial Markov models; (d) forming a string of adjacent second word partial Markov models in the string of second word partial Markov models;
selecting a pair of models; (e) determining a substring of the first phoneme that is aligned to the selected pair of adjacent second word partial Markov models; and (f) second word partial Markov models. (g) select the second word partial Markov model that produces the determined substring with the highest probability; (h ) Repeat steps (d) to (g) for each pair of adjacent second word partial Markov models in the string of second word partial Markov models, and (i) align the second word partial Markov models to the determined substring; (j) measuring the adverse effect of replacing each pair of adjacent second word partial Markov models with a correspondingly selected second word partial Markov model according to a predetermined criterion; synthesizing a second word Markov model by replacing at least one adjacent pair of second word partial Markov models with a correspondingly selected second word partial Markov model based on the adverse effect; Characteristic Ward Markov model construction method.