JPH0375879B2

JPH0375879B2 -

Info

Publication number: JPH0375879B2
Application number: JP61032050A
Authority: JP
Inventors: Rai Baaru Raritsuto; Sherudan Koohen Hooru; Reroi Maasaa Robaato
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1986-02-18
Filing date: 1986-02-18
Publication date: 1991-12-03
Also published as: JPS62194294A

Description

【発明の詳細な説明】以下の順序で本発明を説明する。[Detailed description of the invention] The present invention will be explained in the following order.

Ａ産業上の利用分野Ｂ開示の概要Ｃ従来の技術 C1 ワードの音標モデル（第２図） C2 統計的モデルおよび認識プロセス C3 ワードの組合せ、テキストの影響および解
放されたワード・グラフ（第３図、第４図） C4 終端解放グラフによる認識プロセスＤ発明が解決しようとする問題点Ｅ問題点を解決するための手段Ｆ実施例 F1 サブグラフ（クリンク）への分割（第２図） F2 境界リンク（ブリンク）の前計算（第５図） F3 クリンクの減少および結果の記憶（第６Ａ
図、第６Ｂ図） F4 前計算したブリンクを用いる認識手順（第
１図） F5 具体的な数値例 F6 本発明の音声認識システムの環境 F6a 全般的説明（第７図、第１０図〜第１２
図） F6b 聴覚モデルおよび音声認識システムの音響
プロセツサにおけるその実現（第１３図〜第１９
図） F6c 精密マツチング（第９図、第２０図） F6d 基本高速マツチング（第２１図〜第２３
図） F6e 代替高速マツチング（第２４図，第２５
図） F6f 最初のＪレベルに基づいたマツチング（第
２５図） F6g 音素木構造および高速マツチング実施例
（第２６図） F6h 言語モデル（第７図） F6i 概算によるトレーニング（第２７図） F6j 音響マツチングにより選択されたワードに
よるワード・パスの延長（第７図、第１０図〜第
１２図、第２８図） F7 付録Ｇ発明の効果Ａ産業上の利用分野本発明は音声認識装置、詳細には、音標グラフ
をワード・モデルとして使用し、連続音声を認識
する装置に係る。A. Field of industrial application B. Overview of the disclosure C. Prior art C1. Phonetic model of words (Figure 2) C2. Statistical models and recognition processes C3. Word combinations, text influences and released word graphs (Figure 3) , Fig. 4) C4 Recognition process D using terminal release graph Problem E to be solved by the invention Means for solving the problem F Example F1 Division into subgraphs (clinks) (Fig. 2) F2 Boundary link ( Pre-calculation of F3 (blink) (Fig. 5) Decrease of F3 clink and storage of results (Fig. 6A)
(Fig. 6B) F4 Recognition procedure using pre-calculated blink (Fig. 1) F5 Specific numerical example F6 Environment of the speech recognition system of the present invention F6a General explanation (Fig. 7, Fig. 10 to Fig. 12)
Figure) F6b Auditory model and its realization in the acoustic processor of the speech recognition system (Figures 13 to 19)
Figure) F6c Precision matching (Figures 9 and 20) F6d Basic high-speed matching (Figures 21 to 23)
Figure) F6e alternative high-speed matching (Figures 24 and 25)
Figure) F6f Matching based on initial J level (Figure 25) F6g Phoneme tree structure and fast matching example (Figure 26) F6h Language model (Figure 7) F6i Training by approximation (Figure 27) F6j Acoustic matching Extension of the word pass by the word selected by (Fig. 7, Fig. 10 to Fig. 12, Fig. 28) , relates to a device for recognizing continuous speech using a phonetic graph as a word model.

Ｂ開示の概要音声プロセツサおよびワード認識コンピユータ
をサブシステムとして有する連続音声認識システ
ムを開示する。音声プロセツサに関連して、合流
ノードの間に合流リンク（以下クリンク、
CLINKとも呼ぶ；confluent link）のグラフを
展開する手段と、隣接ワードの間に境界リンク
（以下ブリンク、BLINKとも呼ぶ；boundary
link）のグラフを展開する手段と、合流リンクお
よび境界リンクの目録をコード化目録として記憶
する手段と、音声認識の前記ワード認識サブシス
テムの認識語彙に記憶された認識シーケンスに対
応して未知の発声を合流リンクおよび境界リンク
の符号化シーケンスに変換する手段とを有するこ
とを特徴とする。本発明は、候補ワードに適合す
る音声を合流リンクのシーケンスとして特徴づけ
ることにより、連続音声認識を行なう方法も含
む。更に本発明は、分離したワードの音声認識に
も、その場合に境界リンクがない点を除き、連続
音声認識の場合と同様に適用することができる。B. SUMMARY OF THE DISCLOSURE A continuous speech recognition system is disclosed having a speech processor and a word recognition computer as subsystems. In relation to the audio processor, there is a merging link (hereinafter referred to as a link) between the merging nodes.
A means of expanding a graph of confluent links (also called CLINKs) and boundary links (hereinafter also called BLINKs) between adjacent words.
means for expanding a graph of links); and means for storing an inventory of merging links and boundary links as a coded inventory; and means for converting the utterance into a coded sequence of merging links and boundary links. The present invention also includes a method for performing continuous speech recognition by characterizing speech matching candidate words as a sequence of merging links. Furthermore, the invention can be applied to speech recognition of separated words in the same way as continuous speech recognition, except that in that case there are no boundary links.

Ｃ従来の技術数百ワードのごく小量の語彙を用い、切離して
発声する非連続のワードだけを認識するように設
計されているシステムでは、記憶要求は低く、ア
クセス速度も重要ではない。C. PRIOR ART In systems that use very small vocabularies of several hundred words and are designed to recognize only non-sequential words that are uttered in isolation, memory requirements are low and access speed is not critical.

より強力な音声認識システムでは、数千から数
万ワードの語彙を使用する。更に、連続ワードを
連続音声で認識するためには、ワードを生じる前
後関係が異なつていることがあるので、ワードの
最初の部分と最後の部分で起こりうる多くの発音
変化に適応するように、従来のモデルすなわち音
標グラフを拡張しなければならない。 More powerful speech recognition systems use vocabularies ranging from thousands to tens of thousands of words. Furthermore, in order to recognize consecutive words in continuous speech, the context in which the words occur may be different, so the recognition method needs to be adapted to the many pronunciation changes that can occur in the first and last parts of a word. The traditional model, the phonetic graph, must be extended.

これらの問題を扱う手法は、デイー・アール・
レデイ編集、アカデミツク・プレス社（ニユーヨ
ーク）発行の1975年版“音声認識”の275〜320頁
記載のピー・エス・コーヘンおよびアール・エ
ル・マーサーの論文“自動音声認識システムの音
素成分”（P.S.Cohen and R.L.Mercer、“The
Phonological Component of an Automatic
Speech Recognition System、”Speech
Recognition、edited by D.R.Reddy、
Academic Press、New York 1975、pp 275−
320）および同じ著者によるIBM技術開示会報第
24巻第８号、1982年１月号、4084〜4086頁記載の
論文“文脈を分別する音素法則を高速で使用する
方法”（“Method for Rapidly Applying
Context−Sensitive Phonological Rules”、
IBM Technical Disclosure Bulletin、Vol.24、
No.8、January 1982、pp.4084−4086）に記載さ
れている。 Techniques for dealing with these issues include D.R.
P. S. Cohen and R. L. Mercer's paper “Phoneme Components of Automatic Speech Recognition Systems,” published on pages 275-320 of “Speech Recognition,” 1975 edition, edited by Reddy, published by Academic Press, New York. R.L.Mercer, “The
Phonological Component of an Automatic
Speech Recognition System, “Speech
Recognition, edited by DRReddy,
Academic Press, New York 1975, pp 275−
320) and IBM Technology Disclosure Bulletin No. 320 by the same author.
24, No. 8, January 1982, pp. 4084-4086, "Method for Rapidly Applying Phonemic Laws for Context Discrimination"
Context−Sensitive Phonological Rules”,
IBM Technical Disclosure Bulletin, Vol.24,
No. 8, January 1982, pp. 4084-4086).

C1 ワードの音標モデル（第２図）音声認識システムで所与の語彙のワードを表わ
す場合、いくつかの手法を用いている。ワードを
表わす１つの方法は、音素（その各々は、例えば
アルフアベツトの文字で表わした母音または子音
に大まかに対応する基本的な単音のモデルであ
る）のストリングを使用することである。ワード
は非常に多くの異なつた発音をすることがあるの
で、発音と同数のモデルを保有するか、または１
つのモデルですべての発音に適応しなければなら
ない。その中間の方法は、ワードの各々を少なく
とも１つの基本形式で表現するとともに、起こり
うる変化に対していくつかの規則を用意すること
である。前述のコーヘンおよびマーサーの論文は
この手法を適切に説明している。C1 Word phonetic model (Figure 2) Speech recognition systems use several methods to represent words in a given vocabulary. One way to represent words is to use strings of phonemes, each of which is a basic phonetic model that roughly corresponds to a vowel or consonant in the alphabet, for example. Because a word can have so many different pronunciations, it is best to have as many models as there are pronunciations, or one
One model must accommodate all pronunciations. An intermediate method is to represent each word in at least one basic form and provide some rules for possible changes. The aforementioned Cohen and Mercer paper provides a good explanation of this technique.

第２図は、基本形態“ABCDEF”を有する仮
定のワードについて起こりうる１つの音標グラフ
を示す。ワードの同じ部分（単音、文字、または
そのシーケンス）について、いくつかの発音が可
能である。ワード・モデルすなわち音標グラフに
よる１つのパスは、そのワードの１つの発音に対
応する。 FIG. 2 shows one possible phonetic graph for a hypothetical word with the base form "ABCDEF". Several pronunciations are possible for the same part of a word (phone, letter, or sequence thereof). One pass through the word model or phonetic graph corresponds to one pronunciation of the word.

C2 統計的モデルおよび認識プロセス音声を認識する手段を備えるために、ワード・
モデルの各々を整形する、すなわちワードを数回
発声して統計を取り、そのワードが話されたと
き、ワード・グラフの任意のパスまたは任意の分
枝を通る確率を表わす値を得る。実際の音声認識
中、各々の発声は、いわゆる音響プロセツサによ
り、音響ラベルのストリングに変換する。次に、
このラベル・シーケンスは特定の手順（例えば、
ビタービ（Viterbi）アルゴリズム）により、す
べてのワード・モデルと比較し、各々のワードに
関する確率値を生成して認識すべきラベル・シー
ケンスが実際にそのワードを表わす確率を得る。
そして、最高の確率を得るワードを出力、すなわ
ち認識されたワードとして生成する。C2 Statistical models and recognition processes In order to provide a means to recognize speech, word
Each model is shaped, ie, the word is uttered several times and statistics are taken to obtain a value representing the probability that the word, when spoken, follows any path or branch of the word graph. During actual speech recognition, each utterance is converted into a string of acoustic labels by a so-called acoustic processor. next,
This label sequence specifies a specific procedure (e.g.
The Viterbi algorithm compares all word models and generates a probability value for each word to obtain the probability that the label sequence to be recognized actually represents that word.
The word with the highest probability is then generated as the output, ie, the recognized word.

C3 ワードの組合せ、テキストの影響および開
放されたワード・グラフ（第３図、第４図）前述の手順は、単一の、または分離したワード
を認識するのに適する。しかしながら、連続音声
の重要な特徴は、ワードは発声されたとき相互に
連結されていること、ならびに、各々の特定のワ
ードの発音は、前後関係すなわち、どのワードを
特定のワードの前に、どのワードを特定のワード
の後に話したかにより決まることである。そのた
め、ワードの終了すなわち境界を決定する仕事の
ほかに、以下に述べる特定の問題が生じる。第３
図は、２つの分離したワード（BRANDおよび
NEW）の音標グラフ、ならびに、これらのワー
ドを連結した組合せグラフを示す。（図中、縦線
はワード境界を表わす。）ワードの結合が最初の
ワードの終りと次のワードの最初の発音を変化さ
せることがある。別の文脈では、２つのワードの
各々の最初と終りが、第３図に示したものとは異
なるように変化していることもある。C3 Word Combinations, Text Influences and Open Word Graphs (Figures 3 and 4) The above procedure is suitable for recognizing single or isolated words. However, an important feature of continuous speech is that the words are interconnected when uttered, and that the pronunciation of each particular word depends on the context, that is, which words come before a particular word, and which It depends on whether the word is spoken after a particular word. Therefore, in addition to the task of determining word terminations or boundaries, certain problems arise, as discussed below. Third
The diagram shows two separate words (BRAND and
A phonetic graph of ``NEW'' and a combination graph of these words are shown. (In the figure, vertical lines represent word boundaries.) Combining words can change the end of the first word and the first pronunciation of the next word. In other contexts, the beginning and ending of each of the two words may be varied differently than shown in FIG.

一般に、各ワードはその発声中に別のワードに
連結することにより、その最初と終りが一定数の
異なつた発音を生じる可能性がある。このような
可能性は、第４図の例に示すように、各ワードの
音標グラフすなわちモデルに組込むことができ
る。第４図はワードTHOSEの音標グラフであ
る。従つて、（多くの単独で話されたワードの場
合には正しい）単一のノードで開始し終了するグ
ラフをワードごとに保有する代りに、各々のワー
ド・モデルは、その左右の端にいくつかの可能な
分枝を有する。これも、前述のコーヘンおよびマ
ーサーの論文に記載されている。 In general, each word can be concatenated with another word during its utterance, resulting in a certain number of different pronunciations at its beginning and end. Such possibilities can be incorporated into each word's phonetic graph or model, as shown in the example of FIG. Figure 4 is a phonetic graph of the word THOSE. Therefore, instead of having a graph for each word that starts and ends at a single node (as is true for many singly spoken words), each word model has a number of graphs at its left and right ends. It has several possible ramifications. This is also described in the Cohen and Mercer paper mentioned above.

C4 終端開放グラフによる認識プロセス音声認識の目標は個々のワード（所与の語彙の
要素）のシーケンスを識別することであるが、連
続音声の認識プロセスは、連続する音素の流れを
処理しなければならない。それゆえ、認識処理
中、最初に認識されたワードと次に起こりうる各
ワードのすべての可能な相互連結を考慮しなけれ
ばならない。従つて、所与の規則により、最後の
既知のワードと次に起こりうる各ワードとを連結
する音素の可能な選択を速かに、すなわち実際の
認識中に計算し評価しなければならない。（これ
も前述のコーヘンおよびマーサーの論文に記載さ
れている。）通常、未知の（次の）発声で、いく
つかのワードが話される確率を確定し、最高の確
率を有するワードを、次に認識するワードとして
選択する。C4 Recognition process using open-ended graphs While the goal of speech recognition is to identify sequences of individual words (elements of a given vocabulary), the process of recognizing continuous speech requires processing a stream of consecutive phonemes. No. Therefore, during the recognition process, all possible interconnections of the initially recognized word and each subsequent word must be considered. Therefore, given the rules, the possible choices of phonemes connecting the last known word and each next possible word must be calculated and evaluated quickly, ie during the actual recognition. (This is also described in the above-mentioned Cohen and Mercer paper.) Typically, in an unknown (next) utterance, the probability of several words being spoken is established, and the word with the highest probability is selected as the next Select the word to be recognized.

所望の語彙のワードごとの異なつたグラフすな
わちモデルを記憶するのに大容量の記憶装置が要
求されるが、これは、各ワードの左右の端に多数
の開放された分枝ならびに関連する連結規則を伴
なうためである。 A large amount of storage is required to store different graphs or models for each word of the desired vocabulary, which requires a large number of open branches at the left and right ends of each word, as well as associated linkage rules. This is because it accompanies

Ｄ発明が解決しようとする問題点語彙を増加し、ワード・グラフを拡張し、かつ
拡張されたグラフの連結規則を記憶するために、
必要な記憶容量およびアクセス時間が大幅に増大
する。従つて、利用可能な記憶空間を効率的に使
用し、認識中のモデルのアクセスを簡単にするこ
とが望ましい。D Problems to be Solved by the Invention In order to increase the vocabulary, expand the word graph, and memorize the connection rules of the expanded graph,
The required storage capacity and access time increases significantly. Therefore, it is desirable to use the available storage space efficiently and to facilitate access of the model during recognition.

本発明の主要な目的は、ワード・モデルすなわ
ち音声グラフを、連続または分離したワードの音
声認識システムに記憶し、必要な記憶空間を少な
くすることである。ワード・モデルは基本グラフ
に分割され、その各々は「クリンク」に相当す
る。クリンクは２つの合流ノードの間に広がる単
音またはそのシーケンスの少なくとも１つの発音
を表わす。連続音声の場合、各ワードの内部の部
分および２つのワードが接する境界部分は、基本
グラフで表わすクリンクから成る。本発明は、連
続音声の場合のような境界部分がない分離したワ
ードの音声の場合にも使用することができる。 A primary objective of the present invention is to store word models or phonetic graphs in continuous or discrete word speech recognition systems, reducing the storage space required. The word model is divided into elementary graphs, each of which corresponds to a "clink". A clink represents at least one pronunciation of a note or a sequence thereof extending between two confluence nodes. In the case of continuous speech, the internal parts of each word and the boundary parts where two words meet are composed of clinks represented by elementary graphs. The invention can also be used in the case of separate word speech without boundaries as in the case of continuous speech.

更に本発明の目的は、認識プロセス中にワー
ド・グラフの高速組立てを可能にする音標モデル
の記憶装置を提供することである。 Furthermore, it is an object of the invention to provide a storage device for phonetic models that allows fast construction of word graphs during the recognition process.

Ｅ問題点を解決するための手段本発明の連続音声認識の実施例により、最初に
作成された音標ワードのグラフは、合流リンクす
なわちクリンクと呼ばれる合流ノード、およびフ
ツクと呼ばれる左右の境界部分の間のサブグラフ
に分割される。前処理動作では、右と左のフツク
の対のすべての可能な連結を行ない、ブリンク
（２つのワードの連結リンクをそれぞれ表わす）
と呼ばれる境界リンクのサブグラフを生成する。
合流リンク（クリンク）および境界リンク（ブリ
ンク）は、それぞれのワードまたはワード・リン
クで数回現われることがあるが、それぞれ別個の
目録に１回だけ記憶することが望ましい。フツク
およびクリンクの各々は対応する識別子により表
示する。従つて、ワードごとの完全なモデルすな
わちグラフは記憶しないが、その代り、それぞれ
のワード・グラフを構成するフツクおよびクリン
クを識別する番号すなわち識別子のストリングだ
けは記憶する。E. Means for Solving the Problems According to the continuous speech recognition embodiment of the present invention, a graph of phonetic words initially created is formed between merging nodes called merging links or clinks, and left and right boundary parts called hooks. is divided into subgraphs. The preprocessing operation performs all possible concatenations of the right and left hook pairs, and the blinks (representing the concatenated links of two words, respectively)
Generate a subgraph of boundary links called .
Although merging links (clinks) and boundary links (blinks) may appear several times in each word or word link, it is desirable to store each only once in a separate inventory. Each hook and link is indicated by a corresponding identifier. Thus, the complete model or graph for each word is not stored, but instead only a string of numbers or identifiers identifying the hooks and links that make up each word graph.

認識中にワード・モデルを使用することになつ
ている場合、識別子のストリングを取出し、分散
しているワード・グラフの要素を高速にアクセス
することができる。必要な境界リンク（現に調査
中のワードだけではなく、先行ワードにも左右さ
れる）は、この最後のワードの右フツクの識別子
および現在のワードの左フツクの識別子によりア
クセスする。 If a word model is to be used during recognition, a string of identifiers can be retrieved to provide fast access to distributed word graph elements. The required boundary link (which depends not only on the word currently being examined but also on the preceding word) is accessed by the identifier of the right hook of this last word and the identifier of the left hook of the current word.

本発明の利点は明白である。すべてのワード・
グラフの詳細を記憶する代りに、多くのワード・
グラフで再現される基本グラフを１回だけ記憶す
る。文脈によつて決まる境界グラフは前もつて組
立てられているので、高速でアクセスすることが
でき、従つて認識中にフツクを連結するための評
価はしなくてもよい。新しいワードを語彙に加え
る場合、そのワードのクリンクのグラフ、および
そのワードのフツクに基づいたブリンクのグラフ
が既に記憶されている確率が高いから、大抵の場
合、ワードのサブグラフを識別する番号のストリ
ングを記憶するだけで十分である。新しいクリン
クまたはブリンクの追加を必要とすることは極め
てまれである。 The advantages of the invention are obvious. All words
Instead of memorizing graph details, many words and
The basic graph that is reproduced in the graph is memorized only once. Since the context-dependent boundary graph is pre-assembled, it can be accessed quickly and therefore no evaluation is required to connect hooks during recognition. When adding a new word to your vocabulary, there is a good chance that you already have a klink graph for that word and a blink graph based on the word's hooks in memory, so it is usually a string of numbers that identifies a subgraph of the word. It is sufficient to remember. The need to add new clinks or blinks is extremely rare.

分離したワードの音声の場合、ワードはそのワ
ードを形成する最初から最後までのクリンクに対
応する番号、すなわち識別子のストリングにより
特徴づけられる。連続音声の場合のように、基本
グラフは１回だけ記憶し、ワードの各々を、識別
子（それぞれがクリンクの基本グラフに対応す
る）のストリングで表わす。従つて、この場合
も、記憶要求は大幅に減少する。 In the case of discrete word sounds, the word is characterized by a string of numbers, or identifiers, corresponding to the clinks forming the word. As in the case of continuous speech, the elementary graph is stored only once, and each word is represented by a string of identifiers, each corresponding to Klink's elementary graph. Therefore, in this case as well, memory requirements are significantly reduced.

Ｆ実施例本発明は、音声認識プロセス中、記憶要求を少
なくし、計算作業を簡単にするすぐれた音標表示
を開示するものである。F. EXAMPLE The present invention discloses an improved phonetic representation that reduces memory requirements and simplifies computational tasks during the speech recognition process.

F1 サブグラフ（クリンク）への分割（第２図）第２図から明らかなように、ワードの音標グラ
フは、起こりうるすべての発音パスが経由しなけ
ればならない（図面の矢印で示した）いくつかの
点を有することがある。第２図のワード・グラフ
の、これらの共通点を星印で示した表示を下記に
示す。F1 Division into subgraphs (klinks) (Figure 2) As is clear from Figure 2, the phonetic graph of a word is divided into several groups (indicated by arrows in the diagram) through which all possible pronunciation paths must pass. It may have points of A representation of the word graph of FIG. 2, with these common points indicated by stars, is shown below.

＊Ａ＊XYCD XCD BCD BZ＊Ｅ＊Ｆ＊これらの共通点を合流ノードと呼び、第２図で
は矢印で表わす。ワードの各々は合流ノードによ
りセクシヨンに分割する。２つの合流ノードの間
のセクシヨンの各々は、合流リンクまたは略して
クリンクと呼ばれる基本サブグラフである。 *A*XYCD XCD BCD BZ*E*F* These common points are called confluence nodes and are represented by arrows in FIG. Each word is divided into sections by a merging node. Each section between two merging nodes is an elementary subgraph called a merging link or clink for short.

連続音声の文脈におけるワード・グラフの各々
は、少なくとも１つのクリンクのほかに、左フツ
ク（最初の合流ノードになる開放された分枝）お
よび右フツク（最後の合流ノードで開始する開放
された分枝）を含む。従つて、連続音声中の各ワ
ードは、左フツクが先行し、右フツクが後続する
クリンクの列として表わすことができる。 Each word graph in the context of continuous speech has at least one clink, as well as a left hook (an open branch that starts at the first confluence node) and a right hook (an open branch that starts at the last confluence node). branches). Thus, each word in the speech sequence can be represented as a sequence of clinks, preceded by a left hook and followed by a right hook.

どの語彙も一定数のクリンク、左フツクおよび
右フツクしかなく、その各々はいくつか異なつた
ワード・モデルで使用することができるので、
各々のクリンク、左フツクおよび右フツクは、
（統計値を含む）音標グラフとして１回だけ記憶
される。従つて、語彙の各ワードは、それぞれの
ワードを構成するクリンク等の列を識別するコー
ド番号の列として記憶される。これは所望の語彙
の記憶要求を少なくする第１の改良点である。 Since any vocabulary has only a certain number of clicks, left hooks, and right hooks, each of which can be used in several different word models,
Each clink, left hook and right hook are
It is stored only once as a phonetic graph (including statistical values). Thus, each word of the vocabulary is stored as a string of code numbers that identify the strings of clinks etc. that make up each word. This is the first improvement that reduces the storage requirements of the desired vocabulary.

F2 境界リンク（ブリンク）の前計算（第５図）コーヘンおよびマーサーの論文で説明している
ように、２つのワードのグラフの組合せは、それ
自身が音標サブグラフであるこれらのワードの間
にリンクを生じる。前述の合流リンク（クリン
ク）、左フツクおよび右フツクを含むモデル表示
を用いて、２つのワードの各連結は実際には、左
ワードの最後の合流ノードと右ワードの最初の合
流ノードとの間の境界サブグラフを構成する右フ
ツクと左フツクの組合せを生じる。このサブグラ
フを境界リンクまたはブリンクと呼ぶ。F2 Precomputation of boundary links (blinks) (Figure 5) As explained in the Cohen and Mercer paper, a combination of graphs of two words creates links between these words that are themselves phonetic subgraphs. occurs. Using the model representation described above, which includes a confluence link, a left hook, and a right hook, each connection of two words is actually a connection between the last confluence node of the left word and the first confluence node of the right word. yields a combination of right and left hooks that constitute a boundary subgraph of . This subgraph is called a boundary link or blink.

第５図は、２つのワードを連結する前述のプロ
セスの概要を示す。この例はワード“THOSE”
（第４図のグラフ）をそれ自身に連結するように
選択したもので、“THOSE THOSE”と発声す
る場合を示す。図面の左上はワードの右フツクを
示し、右上は左フツクを示す。分枝の各々に特定
の１組の規則を当てはめ、それぞれの発音がどの
ように、そしていつ生じるかを決定する。どの特
定の連結においても、終了する分枝と開始する分
枝のサブセツト、すなわち互いに適合するものだ
けが連結できる。その結果生じた境界リンクの例
が第５図の下部に示されている。 FIG. 5 provides an overview of the process described above for concatenating two words. This example uses the word “THOSE”
(graph in Figure 4) is selected to be connected to itself, and shows the case of saying "THOSE THOSE". The upper left of the drawing shows the right hook of the word, and the upper right shows the left hook. A specific set of rules is applied to each branch, determining how and when each pronunciation occurs. In any particular concatenation, only a subset of the terminating and initiating branches, ie, those that are compatible with each other, can be concatenated. An example of the resulting boundary link is shown at the bottom of FIG.

第５図中、星印は適合する末端ノードを示し、
IRSは、相互連結規則のセツトを意味する。 In Figure 5, the asterisk indicates the matching terminal node,
IRS stands for a set of interconnection rules.

第２の改良点は、右フツクの分枝と左フツクの
分枝の組合せからサブグラフとして生じる境界リ
ンクすなわちブリンクの前計算である。前述のよ
うに、語彙のワードはそれぞれ、左フツク、いく
つかのクリンク、および右フツクの識別子の列と
して記憶することができ、各々の可能なクリンク
およびフツクの音標グラフは別個に記憶される。 The second improvement is the precomputation of boundary links or blinks that arise as subgraphs from the combination of right-hook branches and left-hook branches. As previously mentioned, each vocabulary word can be stored as a string of identifiers for a left hook, some clinks, and a right hook, with the phonetic graph of each possible clink and hook being stored separately.

前述のブリンクの前計算の場合、右フツクと左
フツクのすべての可能な連結について評価を行な
う。例えば、50の異なつた右フツクと30の異なつ
た左フツクが与えられているものと仮定すれば、
前計算で1500のブリンクのセツトが生じる。通常
の語彙では、これらのブリンクの多くは等しいの
で、実際には、数百の異なつたブリンクしか生じ
ないことがある。 In the case of the blink precomputation described above, all possible connections of right and left hooks are evaluated. For example, suppose we are given 50 different right hooks and 30 different left hooks.
The precomputation results in a set of 1500 blinks. In normal vocabulary, many of these blinks are equal, so in reality only a few hundred different blinks may occur.

付録１は、フツクの対からブリンクを生成し、
各ブリンクを構成するクリンクを決定する手順を
示す。 Appendix 1 generates a blink from a pair of hooks,
The procedure for determining the links constituting each blink is shown.

F3 クリンクの減少および結果の記憶（第６Ａ
図、第６Ｂ図）次のステツプで、ブリンクをクリンク（合流リ
ンク）に変換する。内部の合流ノードを含むブリ
ンク（少数のブリンクについて生じることがあ
る）の各々は、それぞれが１つの合流ノードから
別の合流ノードにまたがるクリンクの列に分解さ
れる。内部に合流ノードを有しないブリンク（大
多数のブリンクについて生じることがある）の
各々は、ブリンクそのものとみなされる。F3 Clink reduction and memory of results (6th A)
(Fig. 6B) The next step is to convert the blink into a clink (merging link). Each blink containing an internal merge node (which may occur for a small number of blinks) is decomposed into a sequence of clinks, each spanning from one merge node to another. Each blink that does not have a merging node inside it (which may occur for the majority of blinks) is considered a blink itself.

その後、ブリンクのセツトから生じたクリンク
のセツトは、すべてのワード・モデルの内部部分
を表わすために収集された最初のクリンクのセツ
トと比較される。比較の結果、ブリンクのセツト
に基づいたいくつかのクリンクは、同等のクリン
クが既に内部ワード部分の基本クリンク目録で使
用可能であつたという理由で削除することができ
る。新しいクリンク（すなわち、基本クリンク目
録に存在していないクリンク）は、現に記憶され
ているクリンクの目録に付加され、付加されたク
リンクの各々に識別文字がそれぞれ割当てられ
る。 The set of clinks resulting from the set of blinks is then compared to the initial set of clinks collected to represent the internal portions of all word models. As a result of the comparison, some clinks based on the set of blinks can be deleted because equivalent clinks were already available in the basic link inventory of the internal word part. New clinks (ie, clinks not present in the base link inventory) are added to the currently stored inventory of clinks, and each added clink is assigned a respective identification letter.

ブリンクの前計算ステツプの結果、もう１つの
データ表が記憶される。この表は、右フツクと左
フツクの組合せごとに、その結果生じたブリンク
の識別文字と、それぞれのブリンクを構成するク
リンク（多くの場合、単一のクリンクになること
ができる）の識別文字とを含む。 As a result of Brink's precomputation step, another data table is stored. This table shows, for each combination of right and left hooks, the identification letter of the resulting blink and the identification letters of the clinks that make up each blink (which can often be a single clink). including.

第６Ａ図および第６Ｂ図は、本発明の手法を用
いる場合の認識語彙用に実際に記憶する表すなわ
ちリストの概要を示す。本発明の特定の実施例に
従つて、市販のIBMシステム370のコンピユータ
を使用し、IBMシステム370のメモリ中の対応す
るロケーシヨンに下記の種々の目録が記憶され
る。 Figures 6A and 6B provide an overview of the tables or lists that are actually stored for recognition vocabularies when using the technique of the present invention. In accordance with a particular embodiment of the present invention, a commercially available IBM System 370 computer is used, and the various inventories described below are stored in corresponding locations in the IBM System 370's memory.

(A) ワード目録：語彙の各ワードは、前述のよう
に、左フツク、クリンクのシーケンス、および
右フツクの識別子のストリングの形式で記憶さ
れる。記憶されても空（00付番）として識別さ
れる空フツクである場合もある。(A) Word Inventory: Each word of the vocabulary is stored in the form of a left hook, a sequence of clinks, and a string of right hook identifiers, as described above. Even if it is stored, it may be an empty hook that is identified as empty (numbered 00).

(B) クリンク目録：クリンクの各々は、その分枝
が音素を表わす音標グラフとして記憶される。
グラフは、あるプログラミング言語で使用可能
な構造の形式で記憶することができる。構造の
外に、構造の分枝ごとにその生起確率を表わす
確率値の表も記憶される。これらの値は、トレ
ーニングフエーズで取出される。クリンク目録
は、ブリンク計算手順から生じるクリンクも含
む。(B) Clink inventory: Each clink is stored as a phonetic graph whose branches represent phonemes.
Graphs can be stored in the form of structures usable in certain programming languages. In addition to the structure, a table of probability values representing the probability of occurrence of each branch of the structure is also stored. These values are retrieved during the training phase. The link inventory also includes links resulting from the blink calculation procedure.

(C) 初期フツク目録：最初に、ワード語彙で識別
されるすべての左フツクおよび右フツクの目録
がある。また、これらのフツクはクリンクのよ
うな構造（音標グラフ）であるが、片側が開放
されている。更にフツクごとに開放された分枝
の各々に適用する連結規則を与える表がある。
ブリンクが前計算された後は、このフツク目録
は使用されない。(C) Initial hook inventory: Initially, there is an inventory of all left hooks and right hooks identified in the word vocabulary. Also, these hooks have a clink-like structure (phonetic graph), but are open on one side. Additionally, there is a table giving the concatenation rules that apply to each of the branches opened for each hook.
This hook inventory is not used after the blink is precomputed.

(D) フツク対目録：ブリンクが前計算された後、
右フツクと左フツクの組合せごとに、それぞれ
のブリンクの識別文字、およびそれぞれのブリ
ンクを構成するクリンクの識別文字を記憶す
る。従つて、右フツクと左フツクの対が与えら
れると、この表は、クリンク目録から取出すこ
とができる対応するクリンクの識別子を与え、
その結果生じるブリンクを表わす。(D) Hook vs. inventory: After the blink is precomputed,
For each combination of a right hook and a left hook, the identification characters of each blink and the identification characters of the links constituting each blink are stored. Therefore, given a pair of right and left hooks, this table gives the identifier of the corresponding link that can be retrieved from the link inventory,
represents the resulting blink.

識別に用いる３つの目録(A)，(B)および(D)のう
ち、(B)のクリンク目録だけが大容量の記憶空間を
必要とし、他の(A)，(D)のワード目録およびフツク
対目録は、それぞれが２バイトの記憶しか必要と
しない識別子のストリングだけを記憶する。 Of the three catalogs (A), (B) and (D) used for identification, only the Clink catalog (B) requires a large amount of storage space, while the other word catalogs (A), (D) and The hook pair inventory stores only strings of identifiers, each requiring only two bytes of storage.

付録２はクリンク目録のサンプル部分を示す。 Appendix 2 shows a sample portion of the Clink inventory.

F4 前計算したブリンクを用いる認識手順（第
１図）連続音声の認識は、簡単に述べれば、下記のよ
うに動作することが望ましい。F4 Recognition procedure using pre-computed blink (Figure 1) To briefly describe continuous speech recognition, it is desirable to operate as follows.

(a) 音声信号を音響ラベルのストリングに変換す
る。(a) Convert the audio signal to a string of acoustic labels.

(b) マルコフ・モデルのシーケンスにより各ワー
ドをワード基本形式で表わす。各々のワード基
本形式はラベルのストリングにマツチングさせ
ることができる。(b) Represent each word in word elementary form by a sequence of Markov models. Each word base form can be matched to a string of labels.

(c) マツチング手順の場合、下記の２つのステツ
プを使用することが望ましい。(c) For matching procedures, it is recommended to use the following two steps:

最初に、大まかな、すなわち高速マツチング
を行ない、残りのラベル・シーケンスの先頭部
分に大まかに対応する、語彙のワードのサブセ
ツトを得る。 First, a rough or fast match is performed to obtain a subset of vocabulary words that roughly correspond to the beginning of the remaining label sequence.

次に、精密マツチング手順を、高速マツチン
グで選択した候補ワードのモデルすなわちグラ
フにより実行する。その結果生じる確率値を用
いて、出力ワードを最終的に選択する。（言語
モデルは突合せ手順でも用いる。）本発明の音標グラフ記憶手法を用いる連続音声
の実施例では、認識手順を次のように変更する。
認識はワードごとに行なうが、その手順は実際に
は、１つのワードの最後の合流ノードから、次の
ワードの最後の合流ノードに進む。このようにす
るのは、最後に認識されたワードの右フツク（連
続音声では次のワードの発音の影響を受ける）
が、次のワードを認識するのに使用可能でなけれ
ばならないからである。 A precision matching procedure is then performed on the model or graph of the candidate words selected in the fast match. The resulting probability value is used to ultimately select the output word. (The language model is also used in the matching procedure.) In a continuous speech embodiment using the phonetic graph storage technique of the present invention, the recognition procedure is modified as follows.
Although recognition is performed word by word, the procedure actually proceeds from the last confluence node of one word to the last confluence node of the next word. This is done by right-footing the last recognized word (in continuous speech, it is affected by the pronunciation of the next word).
must be available to recognize the next word.

認識手順を続けるため、候補ワードごとの識別
子ストリングが記憶装置から取出される。候補ワ
ードの左フツク識別子、および最後に識別された
ワードの右フツク識別子を対として使用し、フツ
ク対目録で、それぞれのブリンクを構成するクリ
ンクの識別子を見つける。その後、これらのクリ
ンク識別子、ならびにそれぞれの候補ワードの内
部部分を表わすストリングからの識別子を用い
て、クリンク目録から選択されたクリンク・モデ
ル（グラフ）の列を形成する。得られたクリンク
列は、開始時の分布として前のステツプの終了時
の分布を用いる入力カラベル・ストリングに適合
するラベル・ストリングに対応し、現に調査して
いる候補ワードの確率値を得る。 To continue the recognition procedure, the identifier string for each candidate word is retrieved from storage. Using the left hook identifier of the candidate word and the right hook identifier of the last identified word as a pair, find the identifiers of the links that make up each blink in the hook pair inventory. These clink identifiers, as well as the identifiers from the strings representing the internal parts of each candidate word, are then used to form a sequence of clink models (graphs) selected from the clink inventory. The resulting clink sequence corresponds to a label string that matches the input caravelle string using the ending distribution of the previous step as the starting distribution to obtain the probability value of the candidate word currently being investigated.

第１図はこの認識手順を示す。図中、ワードＮ
−１は、ワードＮの左フツク（LH２５）との境
界を形成する右フツク（RH３６）を有する。ワ
ードＮは、それぞれの合流リンクの間に広がる内
部部分に沿つてクリンクCL１１７、CL０８５、
およびCL１８７を有する。CL１１７、CL０８
５、およびCL１８７は、あらかじめ形成された
クリンク目録から得る。前述のように、ワードＮ
は、ワードＮ−１の最後の合流ノードから、ワー
ドＮの最後の合流ノードまでのモデル（すなわち
グラフ）であることを特徴とする。このモデルの
最初のクリンクは、ワードＮ−１の識別子RH３
６とワードＮの識別子LH２５の対により決ま
り、あらかじめ形成されている対応するブリンク
（ブリンク１１８）を見つける。ブリンク１１８
はクリンク目録のクリンク２６７に対応する。ク
リンク２６７はブリンクにだけ関連づけたり、ま
たはワードの内部部分の合流ノードの間に存在す
ることがある基本グラフを形成したりすることが
できる。 FIG. 1 shows this recognition procedure. In the diagram, word N
-1 has a right hook (RH36) that forms a boundary with the left hook of word N (LH25). Word N includes links CL117, CL085, along the interior portion extending between each merging link.
and CL187. CL117, CL08
5, and CL187 are obtained from a pre-formed Clink inventory. As mentioned above, the word N
is a model (ie, a graph) from the last confluence node of word N-1 to the last confluence node of word N. The first clink in this model is the identifier RH3 of word N-1.
6 and the identifier LH25 of word N, and finds a corresponding pre-formed blink (blink 118). blink 118
corresponds to Clink 267 in the Clink catalog. Clinks 267 can be associated only with blinks, or they can form an elementary graph that may exist between confluence nodes of internal parts of words.

ワードＮのRH６４により識別される右フツク
とワードＮ＋１のLH１８により識別される左フ
ツクとにより形成されるブリンクを検査する際、
ブリンク１６９を形成するため、２つのクリンク
（クリンク０２３およびクリンク１２５）がクリ
ンク目録から必要になる。ブリンク１６９が現わ
れると、クリンク０２５およびそれに続くクリン
ク１２５の連続と置換えることができる。 When inspecting the blink formed by the right hook identified by RH64 of word N and the left hook identified by LH18 of word N+1,
To form Blink 169, two Klinks (Clink 023 and Klink 125) are required from the link inventory. When blink 169 appears, it can be replaced by a series of blinks 025 and 125 following it.

図中、小さな丸印は合流ノードを表わし、星印
は最後の合流ノードを表わす。記号CL，RHおよ
びLHはそれぞれ、クリンク、右フツクおよび左
フツクの識別子を表わす。破線の枠は合流リン
ク・グラフを表わす。 In the figure, small circles represent merging nodes, and stars represent the last merging node. The symbols CL, RH and LH represent the identifiers of the link, right hook and left hook, respectively. The dashed frame represents a confluent link graph.

音響プロセツサにより生成されたラベルのスト
リングからワードのシーケンスを決定するため
に、マツチング・プロセスを実行する。マツチン
グ・プロセスにより、ワード（その左フツクとそ
の後の、最後の合流ノードまでのクリンクの列と
により形成）の各々は、あたかも最後に認識され
たワードの右フツクに続いているかのように扱わ
れる。第１図では、各ワードは、あたかもワード
Ｎ−１のRH３６に続いたかのように別個に扱わ
れる。すなわち、語彙の各ワードの左フツクは、
ワードＮ−１の右フツクに（１対１で）連結され
てブリンクを形成し、それに適切なクリンクの列
が続く。ブリンクは、１つのクリンクまたは１組
のクリンクに変換される。クリンクの各々は、１
つの音素または直並列配列に構成される複数の音
素と見なすことができる。音素の各々には、対応
するマルコフ・モデル音素マシンが指定されてい
る。クリンクの各々の、音素マシンまたは複数の
音素マシンの配列は、所与のクリンク、またはク
リンクの列の確率を決定するのに用い、音響プロ
セツサにより生成されたストリングのフイーニー
ム（feneme：フロント・エンド・プロセツサ等
で生成される微小音素をこのように呼ぶことにす
る）を生成する。以下、マツチング・プロセスの
詳細について説明する。なお、米国特許出願第
06／672974号（1984年11月19日出願）にもマツチ
ング・プロセスが記載されている。 A matching process is performed to determine the sequence of words from the string of labels generated by the acoustic processor. The matching process treats each word (formed by its left foot and subsequent sequence of clinks up to the last confluence node) as if it followed the right foot of the last recognized word. . In FIG. 1, each word is treated separately as if following RH 36 of word N-1. That is, the left hand of each word in the vocabulary is
It is connected (one-to-one) to the right hook of word N-1 to form a blink, followed by the appropriate sequence of links. A blink is converted into a link or a set of links. Each of the links is 1
It can be considered as one phoneme or multiple phonemes arranged in series and parallel arrays. Each phoneme has a corresponding Markov model phoneme machine specified. For each of the clinks, the phoneme machine or array of phoneme machines is used to determine the probability of a given clink or sequence of clinks, and is used to determine the feneme (feneme) of the string produced by the acoustic processor. This is how we call microphonemes generated by a processor, etc.). The details of the matching process will be explained below. Please note that U.S. Patent Application No.
No. 06/672974 (filed November 19, 1984) also describes the matching process.

マツチング手法により、特定のラベル・ストリ
ングが所与の音素マシンに供給されると所与の音
素マシンは（それに記憶されているデータから）、
所与の音素マシンが特定の到来するストリング中
のラベルを生成する尤度を決定する。 The matching technique ensures that when a particular label string is fed to a given phoneme machine, the given phoneme machine (from the data stored in it)
Determine the likelihood that a given phoneme machine will generate a label in a particular incoming string.

以下に説明するように、記憶し使用するデータ
に応じて異なつた型の音素マシンがある。最初、
所定の確率を音素モデルに組込む精密マツチング
音素マシンについて説明する。このマシンには次
のような特徴がある。 As explained below, there are different types of phoneme machines depending on the data they store and use. first,
A precision matching phoneme machine that incorporates predetermined probabilities into a phoneme model will be described. This machine has the following features:

(a) 複数の状態および状態間の遷移を有する。(a) Has multiple states and transitions between states.

(b) 各遷移に関連した確率、すなわち特定の遷移
が生じる確率を有する。(b) have a probability associated with each transition, i.e. the probability that a particular transition will occur;

(c) 所与の遷移で、特定のラベルが音素により生
成される確率を有する。(c) At a given transition, a particular label has a probability of being produced by a phoneme.

精密マツチング音素マシンは、音素が到来する
ストリング中のラベルにいかにぴつたりと一致す
るかを非常に正確に決定するのに使用することが
できるが、そのためには莫大な量の計算が必要に
なる。 Precision matching phoneme machines can be used to determine very precisely how closely a phoneme matches a label in an incoming string, but doing so requires an enormous amount of computation. .

精密マツチング音素マシンに関連した莫大な計
算要求を避けるために、前記マツチング手法で
は、もう１つの音素マシンを考慮している。第２
の音素マシン（基本高速マツチング音素マシンと
呼ぶ）は概算を行ない、精密マツチング音素マシ
ンに関する計算を簡単にする。特に、所与のラベ
ルが所与の音素における遷移で生ずる確率を有す
る場合は必ず、その確率は１つの特定の値を割当
てられる。（少なくとも、その音素における任意
の遷移で生ずるラベルの最高の確率の大きさであ
ることが望ましい）別の実施例（代替高速マツチングという）で
は、いくつかの長さの任意のラベル・ストリング
を生成する音素の確率は、それに対応する音素マ
シンにより均一とみなされる。代替高速マツチン
グ音素マシンでは、最小および最大の長さが指定
され、その間に均一の長さの分布が形成される。
代替高速マツチング音素マシンでは、その長さの
分布内の任意の長さの確率が１つの規定された値
に置換えられ、その分布の外側の任意の長さの確
率は０である。更に、代替高速マツチング音素マ
シンは、基本高速マツチング音素マシンの代りに
使用し、音素マシンのストリングにより形成され
たワードが候補ワードとして適切であるかどうか
を決定する際に必要な計算量を更に減少すること
ができる。 To avoid the enormous computational demands associated with precision matching phoneme machines, the matching technique considers another phoneme machine. Second
The phoneme machine (referred to as the basic fast matching phoneme machine) performs approximations and simplifies the calculations for the precise matching phoneme machine. In particular, whenever a given label has a probability of occurring at a transition in a given phoneme, that probability is assigned a particular value. (preferably at least the magnitude of the highest probability of a label occurring at any transition in that phoneme) In another embodiment (referred to as alternative fast matching), an arbitrary label string of some length is generated. The probabilities of phonemes are considered uniform by the corresponding phoneme machine. An alternative fast matching phoneme machine specifies a minimum and maximum length between which a uniform length distribution is formed.
In an alternative fast matching phoneme machine, the probability of any length within the length distribution is replaced by one prescribed value, and the probability of any length outside the distribution is zero. Additionally, an alternative fast matching phoneme machine can be used in place of the basic fast matching phoneme machine to further reduce the amount of computation required in determining whether a word formed by a string of phoneme machines is suitable as a candidate word. can do.

本発明の音素グラフに従つて、できればラベル
確率置換え値および長さ分布置換え値の両者を含
む基本高速マツチング音素マシンまたは代替高速
マツチング音素マシンによりワードを処理して、
ワードの語彙からワード・リストを取出すことが
望ましい。取出されたワード・リスト中のワード
は、その後、精密マツチング音素マシンにより処
理し、１つのワードが選択される。 Processing the word with a basic fast matching phoneme machine or an alternative fast matching phoneme machine, preferably including both label probability replacement values and length distribution replacement values, according to the phoneme graph of the invention,
It is desirable to derive a list of words from a vocabulary of words. The words in the retrieved word list are then processed by a precision matching phoneme machine and one word is selected.

代替的に、クリンク音素マシンにより表現され
た各ワードを、精密マツチング音素マシンだけを
用いてマツチングすることもできる。この場合
は、処理量およびそれに要する時間が増大する結
果になる。 Alternatively, each word represented by the Clink phoneme machine could be matched using only the precision matching phoneme machine. In this case, the amount of processing and the time required for it will increase.

前述の音素マシンの各々は、開始時分布および
ラベル・ストリングを入力として受取りマツチン
グ値を決定する。このマツチング値は、所与の音
素がストリング中のラベルの列を生成する尤度を
示す。 Each of the aforementioned phoneme machines receives as input a starting distribution and a label string and determines a matching value. This matching value indicates the likelihood that a given phoneme will generate a sequence of labels in the string.

スタツク・アルゴリズムは、復号動作、すなわ
ち、入力ラベル・ストリングをワード・モデルに
突合せるのに使用することができる。 The stack algorithm can be used for the decoding operation, ie, matching the input label string to the word model.

F5 具体的な数値例 5000ワードの語彙を有するシステムの記憶要求
は下記のようになる。F5 Specific numerical example The storage requirements for a system with a vocabulary of 5000 words are as follows.

(a) ブリンクを前計算しない古い構成の場合：全語彙：634000バイト各々の付加ワード：2500バイト (b) 本発明による新しい構成の場合：全語彙：72500バイト各々の付加ワード：18〜20バイトクリンクおよびフツクの数： 1000ワードの語彙の場合、下記の基本グラフが
使用されていた。(a) For the old configuration without precomputing blinks: Total vocabulary: 634000 bytes Each additional word: 2500 bytes (b) For the new configuration according to the invention: Total vocabulary: 72500 bytes Each additional word: 18-20 bytes Number of links and hooks: For a vocabulary of 1000 words, the basic graph below was used.

237クリンク 72右フツク 40左フツク右フツクと左フツクの組合せは2880のブリンク
を生じ、その多くは等しかつた。ブリンクの各々
を構成するには、現にある237のクリンクに加え
て122の新しいクリンクが必要であつた。従つて、
ブリンクを前計算した後の合計クリンク目録は
359クリンクであつた。 237 clinks 72 right hooks 40 left hooks The combination of right and left hooks resulted in 2880 blinks, many of which were equal. To construct each blink, 122 new links were required in addition to the 237 existing links. Therefore,
After precomputing the blinks, the total blink inventory is
It was 359 Klink.

F6 本発明の音声認識システムの環境 F6a 全般的説明（第７図、第１０図〜第１２
図）第７図は、本発明の環境を与える音声認識シス
テム１０００の概要ブロツク図を示す。このシス
テムは、スタツク・デコーダ１００２、およびそ
れに接続された音響プロセツサ（AP）１００４、
高速概算音響マツチングを実行するアレイ・プロ
セツサ１００６、精密な音響マツチングを実行す
るアレイ・プロセツサ１００８、言語モデル１０
１０、ならびにワークステーシヨン１０１２を含
む。F6 Environment of the speech recognition system of the present invention F6a General description (Figs. 7, 10 to 12)
FIG. 7 shows a schematic block diagram of a speech recognition system 1000 that provides an environment for the present invention. This system includes a stack decoder 1002 and an audio processor (AP) 1004 connected to it.
an array processor 1006 that performs fast approximate acoustic matching; an array processor 1008 that performs precise acoustic matching; and a language model 10.
10, as well as a workstation 1012.

音響プロセツサ１００４は、音声波形入力をラ
ベルのすなわちフイーニーム（この各々は対応す
る音響タイプを特定する）ストリングに変換する
ように設計されている。本システムでは、音響プ
ロセツサ１００４は、人間の聴覚の独特なモデル
に基づくもので、米国特許出願第06／665401号
（1984年10月26日出願）に記載されている。 Acoustic processor 1004 is designed to convert audio waveform input into a string of labels or finemes, each of which identifies a corresponding acoustic type. In the present system, audio processor 1004 is based on a unique model of human hearing, as described in US patent application Ser. No. 06/665,401 (filed October 26, 1984).

音響プロセツサ１００４からのラベル、すなわ
ちフイーニームはスタツク・デコーダ１００２に
送られる。第８図は、スタツク・デコーダ１００
２の論理素子を示す。すなわち、スタツク・デコ
ーダ１００２は探索装置１０２０、およびそれに
接続されたワークステーシヨン１０１２、インタ
フエース１０２２、１０２４、１０２６ならびに
１０２８を含む。これらのインタフエースの各々
は、音響プロセツサ１００４、アレイ・プロセツ
サ１００６、１００８ならびに言語モデル１０１
０にそれぞれ接続される。 Labels or finemes from audio processor 1004 are sent to stack decoder 1002. FIG. 8 shows a stack decoder 100.
2 shows a logic element. That is, stack decoder 1002 includes searcher 1020 and workstations 1012, interfaces 1022, 1024, 1026, and 1028 connected thereto. Each of these interfaces includes an acoustic processor 1004, an array processor 1006, 1008, and a language model 101.
0 respectively.

動作中、音響プロセツサ１００４からのフイー
ニームは探索装置１０２０によりアレイ・プロセ
ツサ１００６（高速突合せ）に送付される。下記
に説明する高速マツチング手順は前記米国特許出
願第06／672974号（1984年11月19日出願）にも記
載されている。マツチングの目的は、簡単にいえ
ば、所与のラベル・ストリングの少なくとも１つ
の最も起こりうるワードを決定することである。 In operation, the finemes from the acoustic processor 1004 are sent by the seeker 1020 to the array processor 1006 (fast match). The fast matching procedure described below is also described in the aforementioned US patent application Ser. No. 06/672,974, filed November 19, 1984. The purpose of matching is, simply, to determine at least one most likely word of a given label string.

高速マツチングはワードの語彙中のワードを検
査し、所与の到来ラベルのストリングの候補ワー
ドの数を少なくするように設計されている。高速
マツチングは確率的に限定された状態マシン（本
明細書ではマルコフ・モデルともいう）に基づく
ものである。 Fast matching is designed to examine words in a vocabulary of words and reduce the number of candidate words for a given incoming label string. Fast matching is based on stochastically bounded state machines (also referred to herein as Markov models).

精密マツチングは、これらのワードを、話され
たワードとして適度の尤度を有する高速マツチン
グ候補リストから、言語モデル計算に基づいて検
査することが望ましい。精密マツチングも前記米
国特許出願第06／672974号（1984年11月19日出
願）に記載されている。精密マツチングは、第９
図に示すようなマルコフ・モデルの音素マシンに
より実行する。 Precision matching is preferably based on language model calculations that examine these words from a list of fast matching candidates that have a reasonable likelihood of being spoken words. Precision matching is also described in the aforementioned US patent application Ser. No. 06/672,974 (filed November 19, 1984). Precision matching is the 9th
It is executed by a Markov model phoneme machine as shown in the figure.

精密マツチングの後、言語モデルを再び呼出
し、ワードの尤度を決定することが望ましい。 After precision matching, it is desirable to invoke the language model again to determine the word likelihood.

スタツク・デコーダ１００２（第７図）の目的
は、ラベルy₁，y₂，y₃…のストリングに最高の確
率を与えるワード・ストリングを決定することで
ある。 The purpose of stack decoder 1002 (FIG. 7) is to determine the word string that gives the highest probability for the string of labels y ₁ , y ₂ , y ₃ .

これは数学的には次のように表現する。 This can be expressed mathematically as follows.

Max（Pr（Ｗ｜Ｙ） (1) これは全ワード・ストリングＷにわたつてＹを
与えるＷの最大確率である。周知のように、Pr
（Ｗ｜Ｙ）は次のように書くことができる。 Max(Pr(W|Y) (1) This is the maximum probability of W giving Y over all word strings W. As is well known, Pr
(W|Y) can be written as follows.

Pr（Ｗ｜Ｙ）＝Pr（Ｗ）・Pr（Ｙ｜Ｗ）／Pr（Ｙ）
(2) ただし、Pr（Ｙ）はＷに無関係である。 Pr(W｜Y)=Pr(W)・Pr(Y｜W)/Pr(Y)
(2) However, Pr(Y) is unrelated to W.

連続するワードW^*の最も起こりうるパス（す
なわち列）を決定する１つの方法は、それぞれの
可能なパスを調べ、復号しようとするラベル・ス
トリングを生じるパスの各々の確率を決定するこ
とである。そして、関連する最高の確率を有する
パスを選択する。5000ワードの語彙の場合、この
方法は、特にワードの列が長いとき、扱いにくく
なり、非実際的である。 One way to determine the most likely path (i.e. sequence) of consecutive words W ^* is to examine each possible path and determine the probability of each path resulting in the label string that we are trying to decode. . Then select the path with the highest associated probability. For a vocabulary of 5000 words, this method becomes unwieldy and impractical, especially when the strings of words are long.

最尤ワード列W^*を発見する公知の他の２つの
方法は、ビタービ（Viterbi）復号化およびスタ
ツク復号化である。これらの手法の各々は、パタ
ーン解析およびマシン情報に関するIEEE会報
PAMI第５巻第２号、1983年３月号記載のエル・
アール・バール外の論文、“連続音声認識の最尤
アプローチ”（L.R.Bahl et al、“Ａ Maximum
Likelihood Approach to Continuous Speech
Recognition”、IEEE Transactions on Pattern
Analysis and Machine Intelligence、Vol.
PAMI−５、No.2、March 1983）の第項およ
び第項にそれぞれ記載されている。 Two other known methods of finding the maximum likelihood word sequence W ^* are Viterbi decoding and stack decoding. Each of these techniques is described in the IEEE Proceedings on Pattern Analysis and Machine Information
PAMI Vol. 5 No. 2, March 1983 issue
LRBahl et al, “A Maximum Likelihood Approach to Continuous Speech Recognition”
Likelihood Approach to Continuous Speech
Recognition”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol.
PAMI-5, No. 2, March 1983).

この論文のスタツク復号手法は単一のスタツク
復号化に関連する。すなわち、長さの異なるパス
は尤度により単一スタツクにリストされ、復号は
この単一のスタツクに基づいて行なわれる。単一
スタツク復号は、尤度がいくらかパスの長さに左
右され、従つて一般に正規化が行なわれるという
事実によるものである。しかしながら、正規化
は、もし正規化フアクタが正しく推定されなけれ
ば、不適切な探索により過度の探索および探索エ
ラーを生じることがある。 The stack decoding method in this paper is related to a single stack decoding. That is, paths of different lengths are listed in a single stack by likelihood, and decoding is performed based on this single stack. Single stack decoding is due to the fact that the likelihood depends somewhat on the path length and therefore normalization is commonly performed. However, normalization may result in excessive searching and search errors due to improper searching if the normalization factor is not estimated correctly.

ビタービ手法は、正規化は必要としないが、一
般に小さなタスクの場合にしか実際的ではない。
大規模な語彙を使用すると、基本的に時間に同期
するビタービ・アルゴリズムは、非同期の音響マ
ツチング成分とインタフエースしなければならな
いことがある。この場合、インタフエースは適切
ではないという結果になる。 The Viterbi method does not require normalization, but is generally only practical for small tasks.
With large vocabularies, the essentially time-synchronous Viterbi algorithm may have to interface with asynchronous acoustic matching components. In this case, the result is that the interface is not suitable.

エル・アール・バール（L.R.Bahl）他の発明
による代替の新規装置および方法（後述）は、最
も起こりうるワード列W^*を他の手法に比し低い
計算要求と高い精度で復号することができる方法
に関係する。特に、多重スタツク復号および独特
の決定方法により所与の時刻にどのワード列を展
開すべきかを決定することを特徴とする手法が設
けられている。この決定方法に従つて、相対的に
長さの短かいパスは、その短かさの故に不利には
ならないが、その代り、その相対的な尤度により
判定される。 An alternative novel apparatus and method (described below) by LRBahl et al. is a method that can decode the most likely word sequence W ^* with lower computational demands and higher accuracy than other techniques. related to. In particular, a technique is provided which is characterized by multiple stack decoding and a unique decision method to determine which word sequence to expand at a given time. According to this determination method, paths of relatively short length are not penalized because of their shortness, but are instead judged by their relative likelihood.

後に説明する第１０図、第１１図および第１２
図はこの装置および方法を示す。 Figures 10, 11, and 12, which will be explained later.
The figures illustrate this apparatus and method.

スタツク・デコーダ１００２は、実際には、他
の要素を制御するように作用するが、多くの計算
を実行することはない。従つて、スタツク・デコ
ーダ１００２は、IBM VM／370オペレーテイン
グ・システム（モデル155、VS2、リリース1.7）
の制御の下にランする4341プロセツサを含むこと
が望ましい。相当な量の計算を実行するアレイ・
プロセツサは、フローテイング・ポイント・シス
テム（FPS）社製の市販の190Lにより実現され
ている。 Stack decoder 1002 actually acts to control other elements, but does not perform much computation. Therefore, the stack decoder 1002 runs on the IBM VM/370 Operating System (Model 155, VS2, Release 1.7).
It is desirable to include a 4341 processor running under the control of the 4341 processor. Arrays that perform a significant amount of computation
The processor is realized by a commercially available 190L manufactured by Floating Point Systems (FPS).

F6b 聴覚モデルおよび音声認識システムの音響
プロセツサにおけるその実現（第１３図〜第１９
図）第１３図は、前述のような音響プロセツサ１１
００の特定の実施例を示す。音響波入力（例え
ば、自然の音声）が、所定の速度でサンプリング
するＡ／Ｄ変換器１１０２に入る。代表的なサン
プリング速度は毎50マイクロ秒当り１サンプルで
ある。デイジタル信号の端を整形するために、時
間窓発生器１１０４が設けられている。時間窓発
生器１１０４の出力は、時間窓ごとに周波数スペ
クトル出力を与えるFFT（高速フーリエ変換）装
置１１０６に入る。F6b Auditory model and its realization in the acoustic processor of the speech recognition system (Figures 13 to 19)
Figure) Figure 13 shows the sound processor 11 as described above.
A specific example of 00 is shown. Acoustic wave input (eg, natural speech) enters an A/D converter 1102 that samples at a predetermined rate. A typical sampling rate is one sample every 50 microseconds. A time window generator 1104 is provided to shape the edges of the digital signal. The output of the time window generator 1104 enters an FFT (Fast Fourier Transform) device 1106 which provides a frequency spectral output for each time window.

そして、FFT装置１１０６の出力は、ラベル
L₁，L₂……L_fを生成するように処理される。特
徴選択装置１１０８、クラスタ装置１１１０、原
型装置１１１２および記号化装置１１１４は共同
してラベルを生成する。ラベルを生成する際、原
型は、選択された特徴に基づき空間に点（または
ベクトル）として形成される。音響入力は、選択
された同じ特徴により、原型に比較しうる対応す
る点（またはベクトル）を空間に供給するように
特徴づけられている。 Then, the output of the FFT device 1106 is labeled
L ₁ , L ₂ ... are processed to generate L _f . Feature selector 1108, clusterer 1110, prototyper 1112, and encoder 1114 jointly generate labels. When generating labels, prototypes are formed as points (or vectors) in space based on selected features. The acoustic input is characterized by the same selected features to provide corresponding points (or vectors) in space that can be compared to the prototype.

詳細に言えば、原型を定義する際、クラスタ装
置１１１０により点のセツトを集めてクラスタに
群化する。クラスタを形成する方法は、音声に適
用される（ガウス分布のような）確率分布に基づ
いている。各クラスタの原型は、（クラスタの中
心軌跡または他の特徴に関連して）原型装置１１
１２により生成される。生成された原型および音
響入力（どちらも同じ特徴が選択されている）は
記号化装置１１１４に入る。記号化装置１１１４
は比較手順を実行し、その結果、特定の音響入力
にラベルを割当てる。 Specifically, when defining a prototype, clustering device 1110 collects a set of points and groups them into clusters. The method of forming clusters is based on probability distributions (such as Gaussian distributions) applied to speech. The prototype of each cluster (with respect to the center locus or other characteristics of the cluster) is stored in the prototype device 11
12. The generated prototype and the acoustic input (both with the same features selected) enter the encoder 1114. Symbolization device 1114
performs a comparison procedure and, as a result, assigns a label to a particular acoustic input.

適切な特徴の選択は、音響（音声）波入力を表
わすラベルを取出す際の重要な要素である。音響
プロセツサは改良された特徴選択装置１１０８に
関係する。音響プロセツサに従つて、独特の聴覚
モデルが取出され使用される。聴覚モデルを、第
１４図により説明する。 Selection of appropriate features is an important factor in retrieving labels representing acoustic (speech) wave inputs. The acoustic processor is associated with an improved feature selector 1108. A unique auditory model is derived and used according to the acoustic processor. The auditory model will be explained with reference to FIG.

第１４図は人間の内耳の部分を示す。詳細に述
べれば、内毛細胞１２００と、液体を含有する溝
１２０４に広がる末端部１２０２が詳細に示され
ている。また、内毛細胞１２００から上流には、
外毛細胞１２０６と、溝１２０４に広がる末端部
１２０８が示されている。内毛細胞１２００と外
毛細胞１２０６には、脳に情報を伝達する神経が
結合している。特に、ニユーロンが電気化学的変
化を受け、電気パルスが神経に沿つて脳に運ば
れ、処理されることになる。電気化学的変化は、
基底膜１２１０の機械的運動により刺激される。 FIG. 14 shows a portion of the human inner ear. In particular, the inner hair cells 1200 and the distal end 1202 extending into a fluid-containing groove 1204 are shown in detail. Furthermore, upstream from the inner hair cells 1200,
Outer hair cells 1206 and distal ends 1208 extending into grooves 1204 are shown. Nerves that transmit information to the brain are connected to the inner hair cells 1200 and the outer hair cells 1206. Specifically, the neuron undergoes electrochemical changes, causing electrical pulses to be carried along nerves to the brain for processing. Electrochemical changes are
Stimulated by mechanical movement of the basement membrane 1210.

基底膜１２１０が音響波入力の周波数分析器と
して作用し、基底膜１２１０に沿つた部分がそれ
ぞれの臨界周波数バンドに応答することは従来か
ら知られている。対応する周波数バンドに応答す
る基底膜１２１０のそれぞれの部分は、音響波形
入力を知覚する音量に影響を与える。すなわち、
トーンの音量は、類似のパワーの強度の２つのト
ーンが同じ周波数バンドを占有する場合よりも、
２つのトーンが別個の臨界周波数バンドにある場
合の方が大きく知覚される。基底膜１２１０によ
り規定された22の等級の臨界周波数バンドがある
ことが分つている。 It is known in the art that basilar membrane 1210 acts as a frequency analyzer of acoustic wave input, with sections along basilar membrane 1210 responding to respective critical frequency bands. Each portion of basilar membrane 1210 that responds to a corresponding frequency band influences the perceived loudness of the acoustic waveform input. That is,
The loudness of a tone is greater than if two tones of similar power intensity occupied the same frequency band.
It is perceived as louder when the two tones are in separate critical frequency bands. It has been found that there are 22 orders of critical frequency bands defined by the basilar membrane 1210.

基底膜１２１０の周波数レスポンスに合わせ
て、本発明は良好な形式で、臨界周波数バンドの
一部または全部に入力された音響波形を定め、次
いで、規定された臨界周波数バンドごとに別個に
信号成分を検査する。この機能は、FFT装置１
１０６（第１３図）からの信号を適切に濾波し、
検査された臨界周波数バンドごとに特徴選択装置
１１０８に別個の信号を供給することにより行な
われる。 Tailored to the frequency response of the basilar membrane 1210, the present invention advantageously defines the input acoustic waveform in some or all of the critical frequency bands and then separately determines the signal components for each defined critical frequency band. inspect. This function is available in FFT device 1
106 (FIG. 13) by appropriately filtering the signal from
This is done by providing a separate signal to the feature selector 1108 for each critical frequency band tested.

別個の入力も、時間窓発生器１１０４により
（できれば25.6ミリ秒の）時間フレームにブロツ
クされる。それゆえ、特徴選択装置１１０８は22
の信号を含むことが望ましい。これらの信号の
各々は、時間フレームごとに所与の周波数バンド
の音の強さを表わす。 The separate inputs are also blocked into time frames (preferably 25.6 milliseconds) by time window generator 1104. Therefore, the feature selector 1108 has 22
It is desirable to include the following signals. Each of these signals represents the sound intensity of a given frequency band for each time frame.

信号は、第１５図の通常の臨界バンド・フイル
タ１３００により濾波することが望ましい。次い
で、信号は別個に、音量の変化を周波数の関数と
して知覚する音量等化変換器１３０２により処理
する。ちなみに、１つの周波数で所与のdBレベ
ルの第１のトーンの知覚された音量は、もう１つ
の周波数と同じdBレベルの第２のトーンの音量
と異なることがある。音量等化変換器１３０２
は、経験的なデータに基づき、それぞれの周波数
バンドの信号を変換して各々が同じ音量尺度で測
定されるようにする。例えば、音量等化変換器１
３０２は、1933年のフレツチヤおよびムンソン
（Fletcher and Munson）の研究に多少変更を加
えることにより、音響エネルギを同等の音量に写
像することができる。第１６図は前記研究に変更
を加えた結果を示す。第１６図により、40dBで
1KHzのトーンは60dBで100Hzのトーンの音量レ
ベルに対応することが分る。 The signal is preferably filtered by a conventional critical band filter 1300 of FIG. The signal is then processed separately by a volume equalization transformer 1302 that perceives changes in volume as a function of frequency. Incidentally, the perceived loudness of a first tone at a given dB level at one frequency may be different from the loudness of a second tone at the same dB level at another frequency. Volume equalization converter 1302
is based on empirical data and transforms the signals in each frequency band so that each is measured on the same loudness scale. For example, volume equalization converter 1
302, with some modifications to the 1933 work of Fletcher and Munson, can map acoustic energy to equivalent loudness. Figure 16 shows the results of a modification to the previous study. According to Figure 16, at 40dB
It can be seen that a 1KHz tone corresponds to the volume level of a 100Hz tone at 60dB.

音量等化変換器１３０２は、第１６図に示す曲
線に従つて音量を調整し、周波数と無関係に同等
の音量を生じさせる。 The volume equalization converter 1302 adjusts the volume according to the curve shown in FIG. 16, producing the same volume regardless of frequency.

周波数への依存性のほか、第１６図で特定の周
波数を調べれば明らかなように、パワーの変化は
音量の変化に対応しない。すなわち、音の強度、
すなわち振幅の変動は、すべての点で、知覚され
た音量の同等の変化に反映されない。例えば、
100Hzの周波数では、110dB付近における10dB
の知覚された音量変化は、20dB付近における
10dBの知覚された音量変化よりもずつと大きい。
この差は、所定の方法で音量を圧縮する音量圧縮
装置１３０４により処理する。音量圧縮装置１３
０４は、ホン単位の音量振幅測定値をソーン単位
に置換えることにより、パワーＰをその立方根
P^1/3に圧縮することができる。 In addition to the dependence on frequency, changes in power do not correspond to changes in volume, as is clear from examining specific frequencies in Figure 16. That is, the intensity of the sound,
That is, variations in amplitude are not reflected in equivalent changes in perceived loudness at all points. for example,
At a frequency of 100Hz, 10dB around 110dB
The perceived volume change is around 20dB.
Much larger than the 10dB perceived volume change.
This difference is processed by a volume compression device 1304, which compresses the volume in a predetermined manner. Volume compression device 13
04 converts the power P to its cube root by replacing the volume amplitude measurement value in units of phons with units of sones.
It can be compressed to P ^1/3 .

第１７図は、経験的に決められた既知のホン対
ソーンの関係を示す。ソーン単位の使用により、
本発明のモデルは大きな音声信号振幅でもほぼ正
確な状態を保持する。１ソーンは、1KHzのトー
ンで40dBの音量と規定されている。 FIG. 17 shows the known empirically determined Hong-to-Thorn relationship. With the use of sone units,
Our model remains nearly accurate even with large audio signal amplitudes. One sone is defined as a 1KHz tone with a volume of 40dB.

第１５図には、新規の時変レスポンス装置１３
０６が示されている。この装置は、各臨界周波数
バンドに関連した音量等化および音量圧縮信号に
より動作する。詳細に述べれば、検査された周波
数バンドごとに、神経発火率ｆが各時間フレーム
で決められる。発火率ｆは本発明の音響プロセツ
サに従つて次のように定義される。 FIG. 15 shows a new time-varying response device 13.
06 is shown. The device operates with volume equalization and volume compression signals associated with each critical frequency band. Specifically, for each frequency band examined, the neural firing rate f is determined for each time frame. The firing rate f is defined according to the acoustic processor of the present invention as follows.

ｆ＝（So＋DL）ｎ (1) ただし、ｎは神経伝達物質の量；Soは音響波
形入力と無関係に神経発火にかかわる自発的な発
火定数；Ｌは音量測定値；Ｄは変位定数である。
So・ｎは音響波入力の有無に無関係に起きる自
発的な神経発火率に相当し、DLnは音響波入力に
よる発火率に相当する。 f=(So+DL)n (1) where n is the amount of neurotransmitter; So is the spontaneous firing constant involved in neural firing independent of acoustic waveform input; L is the measured sound volume; D is the displacement constant.
So·n corresponds to the spontaneous neural firing rate that occurs regardless of the presence or absence of acoustic wave input, and DLn corresponds to the firing rate due to acoustic wave input.

重要な点は、本発明では、ｎの値は次式により
時間とともに変化するという特徴を有することで
ある。 An important point is that the present invention has the characteristic that the value of n changes with time according to the following equation.

dn／dt＝Ao−（So＋Sh＋DL）ｎ (2) ただし、Aoは補充定数；Shは自発的な神経伝
達物質減衰定数である。式(2)に示す新しい関係
は、神経伝達物質が一定の割合Aoで生成されな
がら、(a)減衰（Sh・ｎ）、(b)自発的な発火（So・
ｎ）、および(c)音響波入力による神経発火（DL・
ｎ）により失われることを考慮している。これら
のモデル化された現象は第９図に示された場所で
起きるものと仮定する。 dn/dt=Ao−(So+Sh+DL)n (2) where Ao is the recruitment constant; Sh is the spontaneous neurotransmitter decay constant. The new relationship shown in equation (2) shows that while neurotransmitters are generated at a constant rate Ao, (a) decay (Sh・n), (b) spontaneous firing (So・
n), and (c) neural firing due to acoustic wave input (DL・
n) is taken into account. It is assumed that these modeled phenomena occur at the locations shown in FIG.

式(2)で明らかなように、神経伝達物質の次量お
よび次発火率が少なくとも神経伝達物質の現量の
自乗に比例しており、本発明の音響プロセツサが
非線形であるという事実を示している。すなわ
ち、状態（ｔ＋△ｔ）での神経伝達物質の量は、
状態（ｔ＋dn／dt・△ｔ）での神経伝達物質の
量に等しい。よつて、ｎ（ｔ＋△ｔ）＝ｎ（ｔ）＋（dn／dt）・△ｔ (3) が成立する。 As is clear from equation (2), the next amount of neurotransmitter and the next firing rate are proportional to at least the square of the current amount of neurotransmitter, indicating the fact that the acoustic processor of the present invention is nonlinear. There is. In other words, the amount of neurotransmitter in state (t+△t) is
It is equal to the amount of neurotransmitter in the state (t+dn/dt・Δt). Therefore, n(t+Δt)=n(t)+(dn/dt)·Δt (3) holds true.

式(1)，(2)および(3)は、時変信号分析器の動作を
表わす。時変信号分析器は、聴覚器官系が時間に
適応性を有し、聴神経の信号が音響波入力と非直
線的に関連させられるという事実を示している。
ちなみに、本発明の音響プロセツサは、神経系統
の明白な時間的変化によりよく追随するように、
音声認識システムで非線形信号処理を実施する最
初のモデルを提供するものである。 Equations (1), (2) and (3) represent the operation of the time-varying signal analyzer. Time-varying signal analyzers point to the fact that the auditory system is time-adaptive, and the signals of the auditory nerve are non-linearly related to the acoustic wave input.
Incidentally, the acoustic processor of the present invention can better track obvious temporal changes in the nervous system.
It provides the first model for implementing nonlinear signal processing in speech recognition systems.

式(1)および(2)において未知の項数を少なくする
ため、本発明では、一定の音量Ｌに適用される次
式を用いる。 In order to reduce the number of unknown terms in equations (1) and (2), the present invention uses the following equation that is applied to a constant volume L.

So＋Sh＋DL＝１／Ｔ (4) ただし、Ｔはオーデイオ波入力が生成された
後、聴覚レスポンスがその最大値の37％に低下す
るまでの時間の測定値である。Ｔは、音量の関数
であり、本発明の音響プロセツサにより、種々の
音量レベルのレスポンスの減衰を表示する既知の
グラフから取出す。すなわち、一定の音量のトー
ンが生成されると、最初、高いレベルのレスポン
スが生じ、その後、レスポンスは時定数Ｔによ
り、安定した状態のレベルに向つて減衰する。音
響波入力がない場合、Ｔ＝T₀である。これは50
ミリ秒程度である。音量がL_naxの場合、Ｔ＝
T_naxである。これは30ミリ秒程度である。Ao＝
１に設定することにより、１／（So＋Sh）は、
Ｌ＝０の場合、５センチ秒と決定される。Ｌが
L_naxで、L_nax＝20ソーンの場合、次式が成立つ。 So+Sh+DL=1/T (4) where T is the measurement of the time after the audio wave input is generated until the auditory response drops to 37% of its maximum value. T is a function of loudness and is taken from a known graph that displays the attenuation of the response for various loudness levels by the sound processor of the present invention. That is, when a tone of constant volume is generated, a high level response initially occurs, after which the response decays with a time constant T toward a steady state level. When there is no acoustic wave input, T=T ₀ . This is 50
It is about milliseconds. If the volume is L _nax , T=
T _nax . This is about 30 milliseconds. Ao=
By setting it to 1, 1/(So+Sh) becomes
If L=0, it is determined to be 5 centiseconds. L is
When L _nax and L _nax = 20 sones, the following equation holds true.

So＋Sh＋Ｄ（20）＝1/30 (5) 前記データおよび式により、SoおよびShは下
記に示す式(6)および(7)により決まる。 So+Sh+D(20)=1/30 (5) Based on the above data and formula, So and Sh are determined by formulas (6) and (7) shown below.

So＝DL_nax／〔Ｒ＋（DL_naxT₀Ｒ）−１〕 (6) Sh＝１／T₀−So (7) ただし、Ｒ＝ｆ安定状態／ｆ安定状態L_nax／Ｌ＝０ (8) f安定状態は、dn／dtが０の場合、所与の音量
での発火率を表わす。 So=DL _nax / [R+(DL _nax T ₀ R)-1] (6) Sh=1/T ₀ -So (7) However, R=f stable state/f stable state L _nax /L=0 (8 ) f steady state represents the firing rate at a given volume when dn/dt is 0.

Ｒは、音響プロセツサに残つている唯一の変数
である。それゆえ、このプロセツサの性能はＲを
変えるだけで変更される。すなわち、Ｒは、性能
を変更するのに調整することができる１つのパラ
メータで、通常は、過渡状態の効果に対し安定状
態の効果を最小限にすることを意味する。類似の
音声入力の場合に出力パターンが一貫性に欠ける
ことは一般に、周波数レスポンスの相違、話者の
差異、背景雑音ならびに、（音声信号の安定状態
部分には影響するが過渡部分には影響しない）歪
みにより生ずるから、安定状態の効果を最小限に
することが望ましい。Ｒの値は、完全な音声認識
システムのエラー率を最適化するように設定する
ことが望ましい。このようにして見つかつた最適
値はＲ＝1.5である。その場合、SoおよびShの値
はそれぞれ0.0888および0.11111であり、Ｄの値
は0.00666が得られる。 R is the only variable left in the audio processor. Therefore, the performance of this processor is changed simply by changing R. That is, R is one parameter that can be adjusted to change performance, usually meant to minimize steady-state effects versus transient-state effects. Inconsistent output patterns for similar speech inputs are generally caused by differences in frequency response, speaker differences, background noise, and other factors (affecting the steady-state but not the transient portions of the speech signal). ) It is desirable to minimize steady-state effects since they are caused by distortion. The value of R is preferably set to optimize the error rate of the complete speech recognition system. The optimal value thus found is R=1.5. In that case, the values of So and Sh are 0.0888 and 0.11111, respectively, and the value of D is 0.00666.

第１８図は本発明による音響プロセツサの動作
の流れ図である。できれば、20KHzでサンプリン
グされた、25.6ミリ秒の時間フレーム中のデイジ
タル化音声は、ハニング窓１３２０を通過し、そ
の出力は10ミリ秒間隔で、DFT１３２２におい
て２重フーリエ変換されることが望ましい。変換
出力はブロツク１３２４で濾波され、少なくても
１つの周波数バンド（できればすべての臨界周波
数バンドか、または少なくとも20のバンド）の
各々にパワー密度出力を供給する。次いで、パワ
ー密度はブロツク１３２６で、記録された大きさ
から音量レベルに変換される。この動作は、第１
６図のグラフの変更により、または、後に第１９
図に概要を示すプロセスにより取出された限界値
に基づいて実行される。 FIG. 18 is a flow diagram of the operation of the audio processor according to the present invention. Preferably, the digitized audio, sampled at 20 KHz, during a 25.6 millisecond time frame is passed through a Hanning window 1320 and the output is double Fourier transformed in a DFT 1322 at 10 millisecond intervals. The conversion output is filtered in block 1324 to provide a power density output in each of at least one frequency band (preferably all critical frequency bands, or at least 20 bands). The power density is then converted from the recorded loudness to a volume level at block 1326. This operation is the first
By changing the graph in Figure 6 or later
It is carried out on the basis of the limit values derived by the process outlined in the figure.

第１９図において、最初に、濾波された周波数
バンドｍの各々の感覚限界T_fおよび可聴限界T_h
がそれぞれ、120dBおよび0dBになるように設定
される（ブロツク１３４０）。その後、音声カウ
ンタ、合計フレーム・レジスタおよびヒストグラ
ム・レジスタをリセツトする（ブロツク１３４
２）。 In FIG. 19, first the perceptual limit T _f and the audible limit T _h of each filtered frequency band m
are set to be 120 dB and 0 dB, respectively (block 1340). The audio counter, total frame register, and histogram register are then reset (block 134).
2).

ヒストグラムの各々はビン（bin）を含み、ビ
ンの各々は、（所与の周波数バンドで）パワーま
たは類似の測定値がそれぞれのレンジ内にある間
のサンプル数すなわちカウントを表わす。本発明
では、ヒストグラムは、（所与の周波数バンドご
とに）音量が複数の音量レンジの各々の中にある
期間のセンチ秒数を表わすことが望ましい。例え
ば、第３の周波数バンドでは、10dBと20dBのパ
ワーの間が20センチ秒の場合がある。同様に、第
20の周波数バンドでは、50dBと60dBに、間の合
計1000センチ秒のうちの150センチ秒がある場合
がある。合計サンプル数（すなわちセンチ秒）お
よびビンに含まれたカウントから百分位数が取出
される。 Each of the histograms includes bins, with each bin representing the number of samples or counts during which the power or similar measurement (for a given frequency band) is within a respective range. In the present invention, the histogram preferably represents the number of centiseconds during which the volume is within each of a plurality of volume ranges (for a given frequency band). For example, in the third frequency band, there may be 20 centiseconds between 10 dB and 20 dB power. Similarly, the
For 20 frequency bands, 50dB and 60dB may have 150 centiseconds of the total 1000 centiseconds between them. Percentiles are taken from the total number of samples (ie, centiseconds) and the binned counts.

ブロツク１３４４で、それぞれの周波数バンド
のフイルタ出力のフレームが検査され、ブロツク
１３４６で、適切なヒストグラム（フイルタ当り
１つ）中のビンが増分される。ブロツク１３４８
で、振幅が55dBを越えるビンの合計数がフイル
タ（すなわち周波数バンド）ごとに集計され、音
声の存在を示すフイルタ数を決定する。ブロツク
１３５０で、音声の存在を示す最小限（例えば20
のうちの６）のフイルタがない場合、ブロツク１
３４４で次のフレームを検査する。音声の存在を
示す十分なフイルタがある場合、ブロツク１３５
２で、音声カウンタを増分する。音声カウンタ
は、ブロツク１３５４で音声が10秒間現われ、ブ
ロツク１３５６で新しいT_fおよびT_hの値がフイ
ルタごとに決定されるまで増分される。 At block 1344, the frame of filter output for each frequency band is examined, and at block 1346, the bins in the appropriate histogram (one per filter) are incremented. block 1348
Then, the total number of bins with amplitudes greater than 55 dB is tallied for each filter (i.e., frequency band) to determine the number of filters that indicate the presence of speech. At block 1350, a minimum number (e.g. 20
If there is no filter in 6), block 1
At 344, the next frame is inspected. If there are enough filters to indicate the presence of audio, block 135
2, increment the voice counter. The audio counter is incremented at block 1354 until audio appears for 10 seconds and at block 1356 new T _f and T _h values are determined for each filter.

所与のフイルタの新しいT_fおよびT_hの値は次
のように決定される。T_fの場合、1000ビンの最
上位から35番目のサンプルを保持するビンのdB
値（すなわち、音量の96.5番目の百分位数）は
BIN_Hと定義され、T_fはT_f＝BIN_H＋40dBに設定
される。T_hの場合、最下位のビンから（0.01）
（ビン総数−音声カウント）番目の値を保持する
ビンのdB値がBIN_Lと定義される。すなわち、
BIN_Lは、ヒストグラム中の音声として、分類さ
れたものを除いたサンプル数の１％のビンであ
る。T_hはT_h＝BIN_L−30dBと定義される。 The new T _f and T _h values for a given filter are determined as follows. For T _f , dB of the bin holding the 35th sample from the top of 1000 bins
The value (i.e. the 96.5th percentile of volume) is
BIN _H is defined, and T _f is set to T _f =BIN _H +40dB. For T _h , from the lowest bin (0.01)
The dB value of the bin holding the (total number of bins - voice count)-th value is defined as BIN _L. That is,
BIN _L is a bin of 1% of the number of samples excluding those classified as voices in the histogram. T _h is defined as T _h =BIN _L -30dB.

第１８図のブロツク１３３０および１３３２
で、音の振幅は、前述のように、限界値を更新
し、更新された限界値に基づいてソーン単位に変
換され、圧縮される。ソーン単位を導入し圧縮す
る代替方法は、（ビンが増分された後）フイルタ
振幅“ａ”を取出し、次式によりdBに変換する。 Blocks 1330 and 1332 of FIG.
Then, the amplitude of the sound is updated with the limit value as described above, and based on the updated limit value, the amplitude of the sound is converted into units of sones and compressed. An alternative method of introducing and compressing Sohn units is to take the filter amplitude "a" (after the bins have been incremented) and convert it to dB by:

a^dB＝20log₁₀(a)−10 (9) 次に、フイルタ振幅の各々は、次式により同等
の音量を与えるように0dBと120dBの間のレンジ
に圧縮される。 a ^dB = 20log ₁₀ (a)-10 (9) Each of the filter amplitudes is then compressed to a range between 0 dB and 120 dB to give equivalent loudness by:

a^eql＝120（a^dB−T_h）／（T_f−T_h） (10) 次に、a^eqlは次式により、音量レベル（ホン単
位）からソーン単位の音量の近似値に変換
（40dBで1KHzの信号を１に写像）することが望
ましい。 a ^eql = 120 (a ^dB − T _h ) / (T _f − T _h ) (10) Next, a ^eql is converted from the volume level (in units of phons) to an approximate value of the volume in units of sones (40 dB It is desirable to map a 1KHz signal to 1).

L^dB＝（a^eql−30）／４ (11) 次に、ソーン単位の音量の近似値L_sは次式で与
えられる。 L ^dB = (a ^eql −30)/4 (11) Next, the approximate value L _s of the sound volume per son is given by the following equation.

L_s＝10（L^dB）／20 (12) ステツプ１３３４で、L_sは式(1)および(2)の入力
として使用され、周波数バンドごとの出力発火率
ｆを決定する。22周波数バンドの場合、22次元の
ベクトルが、連続する時間フレームにわたる音響
波入力を特徴づける。しかしながら、一般に、20
周波数バンドは、メルでスケーリングされた通常
のフイルタ・バンクを用いて検査する。 L _s =10(L ^dB )/20 (12) At step 1334, L _s is used as input to equations (1) and (2) to determine the output firing rate f for each frequency band. For 22 frequency bands, a 22-dimensional vector characterizes the acoustic wave input over consecutive time frames. However, in general, 20
The frequency bands are examined using a conventional filter bank scaled by Mel.

次の時間フレームを処理する前に、ブロツク１
３３７で、ｎの“次状態”を式(3)に従つて決定す
る。 Block 1 before processing the next time frame.
At 337, the "next state" of n is determined according to equation (3).

前述の音響プロセツサは、発火率ｆおよび神経
伝達物質量ｎが大きいDCペデスタルを有する場
合の使用についての改善を必要とする。すなわ
ち、ｆおよびｎの式の項のダイナミツクレンジが
重要な場合、下記の式を導いてペデスタルの高さ
を下げる。 The acoustic processors described above require improvement for use with DC pedestals where the firing rate f and neurotransmitter content n are large. That is, if the dynamic range of terms in the f and n equations is important, the following equation is derived to lower the height of the pedestal.

安定状態で、かつ音響波入力信号が存在しない
（Ｌ＝０）の場合、式(2)は次のように安定状態の
内部状態n′について解くことができる。 In a stable state and when there is no acoustic wave input signal (L=0), equation (2) can be solved for the internal state n' in the stable state as follows.

n′＝Ａ／（So＋Sh） (13) 神経伝達物質の量ｎ（ｔ）の内部状態は、次の
ように安定状態部分および変動部分として示され
る。 n'=A/(So+Sh) (13) The internal state of the neurotransmitter amount n(t) is expressed as a stable state part and a fluctuating part as follows.

ｎ（ｔ）＝n′＋n″（ｔ） (14) 式(1)及び(14)を結合すると、次のように発火率
が得られる。 n(t)=n′+n″(t) (14) Combining equations (1) and (14), the firing rate is obtained as follows.

ｆ（ｔ）＝（So＋Ｄ・Ｌ）（n′＋n″（ｔ）） (15) So・n′の項は定数であるが、他のすべての項
は、ｎの変動部分か、または（Ｄ・Ｌ）により表
わされた入力信号を含む。爾後の処理は出力ベク
トル間の差の二乗のみに関連するので、定数項は
無視される。式(11)及び(15)から次式が得られる。 f(t)=(So+D・L)(n′+n″(t)) (15) The So・n′ term is a constant, but all other terms are the varying parts of n or (D・Contains an input signal represented by It will be done.

f″（ｔ）＝（So＋Ｄ・Ｌ）・〔｛n″（ｔ）＋Ｄ・Ｌ
・
Ａ｝／（So＋Sh）〕 (16) 式(3)を考慮すると、“次状態”は次のようにな
る。 f″(t)=(So+D・L)・[{n″(t)+D・L
・
A} /(So+Sh)] (16) Considering equation (3), the “next state” is as follows.

ｎ（ｔ＋Δt）＝n′（ｔ＋Δt）＋n″（ｔ＋Δt）(17
) ｎ（ｔ＋Δt）＝n″（ｔ）＋Ａ−（So＋Sh＋Ｄ・
Ｌ）・（n′＋n″（ｔ）） (18) ｎ（ｔ＋Δt）＝n″（ｔ）−（Sh・n″（ｔ） −（So＋Ao・L^A）・n″（ｔ） −（Ao・L^A・Ｄ）／（So＋Sh）＋Ao−（So・Ao）＋（Sh・Ao））／（So＋Sh） (19) 式(19)はすべての常数項を無視すれば次のよう
になる。 n(t+Δt)=n′(t+Δt)+n″(t+Δt)(17
) n(t+Δt)=n″(t)+A−(So+Sh+D・
L)・(n′+n″(t)) (18) n(t+Δt)=n″(t)−(Sh・n″(t) −(So+Ao・L ^A )・n″(t) −(Ao・L ^A・D)/(So+Sh) +Ao−(So・Ao)+(Sh・Ao))/(So+Sh) (19) Equation (19) becomes as follows if all constant terms are ignored.

n″（ｔ＋Δt）＝n″（ｔ）（１−So・Δt）−f″（
ｔ）
(20) 式(15)および(20)は、それぞれの10ミリ秒時間
フレーム中に各フイルタに適用される出力式およ
び状態更新式を構成する。これらの式の使用結果
は10ミリ秒ごとの20要素のベクトルであり、この
ベクトルの各要素は、メルでスケーリングされた
フイルタ・バンクにおけるそれぞれの周波数バン
ドの発火率に対応する。 n″(t+Δt)=n″(t)(1−So・Δt)−f″(
t)
(20) Equations (15) and (20) constitute the output and state update equations applied to each filter during each 10 ms time frame. The result of using these equations is a vector of 20 elements every 10 milliseconds, where each element of this vector corresponds to the firing rate of each frequency band in the filter bank scaled by Mel.

前述の実施例に関し、第１８図の流れ図は、発
火率ｆおよび“次状態”ｎ（ｔ＋Δt）の特別の場
合の式をそれぞれ定義する式(11)および(16)によ
り、ｆ，dn／dtおよびｎ（ｔ＋Δt）の式を置換え
る以外は当てはまる。 With respect to the embodiment described above, the flowchart of FIG. and n(t+Δt) except that the expressions are replaced.

それぞれの式の項に特有の値（すなわち、t₀＝
5csec、t_L＝3csec、Ao＝１、Ｒ＝1.5およびL_nax
＝20）は他の値に設定することができ、So，Sh
およびＤの項は、他の項が異なつた値に設定され
ると、それぞれの望ましい値0.0888，0.11111、
および0.00666とは異なる値になる。 A unique value for each equation term (i.e., t ₀ =
5csec, t _L = 3csec, Ao = 1, R = 1.5 and L _nax
= 20) can be set to other values, So, Sh
and D terms have their desired values of 0.0888, 0.11111, respectively, when the other terms are set to different values.
and will be a different value from 0.00666.

本発明は種々のソフトウエアまたはハードウエ
アにより実施することができる。 The invention can be implemented with a variety of software or hardware.

F6c 精密マツチング（第９図、第２０図）第９図は一例として精密マツチング音素マシン
２０００を示す。このマシンの各々は、確率的に
限定された状態マシンであり、 (a) 複数の状態S_i； (b) 複数の遷移tr（Sj→Si）：ある遷移は異なつた
状態間で、ある遷移は同じ状態間で遷移し、各
遷移は対応する確率を有する； (c) 特定の遷移で生成しうるラベルごとに対応す
る実際のラベル確率を有することを特徴とする。F6c Precision Matching (FIGS. 9 and 20) FIG. 9 shows a precision matching phoneme machine 2000 as an example. Each of these machines is a stochastically limited state machine, with (a) multiple states S _i ; (b) multiple transitions tr (Sj → Si): a transition is a transition between different states; transitions between the same states, and each transition has a corresponding probability; (c) is characterized in that it has a corresponding actual label probability for each label that can be generated at a particular transition.

第９図では、７つの状態S₁〜S₇ならびに13の遷
移tr1〜tr13が精密マツチング音素マシン２００
０に設けられ、その中の３つの遷移tr11，tr12お
よびtr13のパスは破線で示されている。これらの
３つの遷移の各々で、音素はラベルを生成せずに
１つの状態から別の状態に変ることがある。従つ
て、このような遷移はナル遷移と呼ばれる。遷移
tr1〜tr10に沿つて、ラベルを生成することがで
きる。詳細に述べれば、遷移tr1〜tr10の各々に
沿つて少なくとも１つのラベルは、そこに生成さ
れる独特の確率を有することがある。遷移ごと
に、システムで生成することができる各ラベルに
関連した確率がある。すなわち、もし選択的に音
響チヤンネルにより生成することができるラベル
が200あれば、（ナルではない）各遷移はそれに関
連した“実際のレベル確率”を200有し、その
各々は、対応するラベルが特定の遷移で音素によ
り生成される確率に対応する。遷移tr1の実際の
ラベル確率は、図示のように、記号Ｐと、それに
続くブラケツトに囲まれた１〜200の列で表わさ
れる。これらの数字の各々は所与のラベルを表わ
す。ラベル１の場合は、精密マツチング音素マシ
ン２０００が遷移tr1でラベル１を生成する確率
Ｐ〔１〕がある。種々の実際のラベル確率は、ラ
ベルおよび対応する遷移に関連して記憶されてい
る。 In FIG. 9, seven states S ₁ to S ₇ and 13 transitions tr1 to tr13 are shown in the precision matching phoneme machine 200.
0, of which the paths of the three transitions tr11, tr12 and tr13 are indicated by dashed lines. In each of these three transitions, a phoneme may change from one state to another without generating a label. Therefore, such a transition is called a null transition. transition
Labels can be generated along tr1 to tr10. In particular, at least one label along each of transitions tr1-tr10 may have a unique probability of being generated there. For each transition, there is a probability associated with each label that can be generated by the system. That is, if there are 200 labels that can be selectively generated by an acoustic channel, each (non-null) transition has an "actual level probability" associated with it of 200, each of which means that the corresponding label is Corresponds to the probability of being produced by a phoneme in a particular transition. The actual label probability of transition tr1 is represented by the symbol P followed by a column from 1 to 200 in brackets, as shown. Each of these numbers represents a given label. In the case of label 1, there is a probability P[1] that the precision matching phoneme machine 2000 generates label 1 at transition tr1. Various actual label probabilities are stored in association with labels and corresponding transitions.

ラベルy₁y₂y₃…のストリングが、所与の音素に
対応する精密マツチング音素マシン２０００に提
示されると、マツチング手順が実行される。精密
マツチング音素マシンに関連した手順について第
２０図により説明する。 When a string with labels y ₁ y ₂ y ₃ . . . is presented to the precision matching phoneme machine 2000 corresponding to a given phoneme, the matching procedure is performed. The procedure related to the precision matching phoneme machine will be explained with reference to FIG.

第２０図は第９図の音素マシンのトレリス図で
ある。前記音素マシンの場合のように、このトレ
リス図も状態S₁から状態S₇へのナル遷移、状態S₁
から状態S₂への遷移、および状態S₁から状態S₄へ
の遷移を示す。他の状態間の遷移も示されてい
る。また、トレリス図は水平方向に、測定された
時刻を示す。開始時確率q₀およびq₁は、音素がそ
の音素の時刻ｔ＝t₀またはｔ＝t₁のそれぞれにお
いて開始時刻を有する確率を表わす。各開始時刻
t₀およびt₁におけるそれぞれの遷移も示されてい
る。ちなみに、連続する開始（および終了）時刻
の間隔は、ラベルの時間間隔に等しい長さである
ことが望ましい。 FIG. 20 is a trellis diagram of the phoneme machine of FIG. 9. As in the case of the phoneme machine, this trellis diagram also has a null transition from state S ₁ to state S ₇ , state S ₁
shows the transition from to state S ₂ and from state S ₁ to state S ₄ . Transitions between other states are also shown. Additionally, the trellis diagram shows the measured times in the horizontal direction. The start probabilities q ₀ and q ₁ represent the probability that a phoneme has a start time at time t=t ₀ or t=t ₁ of the phoneme, respectively. Each start time
The respective transitions at t ₀ and t ₁ are also shown. Incidentally, it is desirable that the interval between successive start (and end) times be equal to the time interval between labels.

精密マツチング音素マシン２０００を用いて所
与の音素が到来ストリングのラベルにどれくらい
ぴつたりとマツチングさせられるかを決定する
際、その音素の終了時刻分布を探索して、その音
素のマツチング値を決めるのに使用する。終了時
刻分布に依存して精密なマツチングを実行する方
法は、マツチング手順に関して本明細書で説明す
るすべての音素マシンの実施例に共通である。精
密なマツチングを実行するため終了時刻分布を生
成する際、精密マツチング音素マシン２０００
は、正確で複雑な計算を必要とする。 When using the Precision Matching Phoneme Machine 2000 to determine how closely a given phoneme is matched to the label of an incoming string, the end time distribution of that phoneme is searched to determine the matching value for that phoneme. used for. The method of performing a precise matching depending on the end time distribution is common to all phoneme machine embodiments described herein with respect to the matching procedure. When generating end time distribution to perform precise matching, the Precision Matching Phoneme Machine 2000
requires precise and complex calculations.

最初に、第２０図のトレリス図により、時刻ｔ
＝t₀で開始時刻および終了時刻を得るのに必要な
計算について調べる。第９図に示された音素マシ
ン構造の例の場合は、下記の確率式が当てはま
る。 First, according to the trellis diagram of FIG. 20, time t
Examine the calculations required to obtain the start and end times at = t ₀ . In the case of the example phoneme machine structure shown in FIG. 9, the following probability equation applies.

Pr（S₇、ｔ＝t₀）＝q₀・Ｔ（１→７）＋Pr（S₂、ｔ＝t₀）・Ｔ（２→７）＋Pr（S₃、ｔ＝t₀）・Ｔ（３→７）
（21）ただし、Prは確率を表わし、Ｔは括弧内の２
つの状態の間の遷移確率を表わす。この式は、ｔ
＝t₀で終了時刻になることがある３つの状態のそ
れぞれの確率が、この例では、状態S₇における終
了時刻生起に限定されることを示する。 Pr(S ₇ , t=t ₀ )=q ₀・T(1→7) +Pr(S ₂ , t=t ₀ )・T(2→7) +Pr(S ₃ , t=t ₀ )・T( 3→7)
(21) However, Pr represents the probability, and T is the 2 in parentheses.
represents the transition probability between two states. This formula is t
We show that the probabilities of each of the three states that can have an end time of = t ₀ are limited to end time occurrences in state S ₇ in this example.

次に、終了時刻ｔ＝t₁を調べると、状態S₁以外
のあらゆる状態に関する計算を行なわなければな
らない。状態S₁は前の音素の終了時刻で開始す
る。説明の都合上、状態S₄に関する計算だけを示
す。 Next, when we look at the end time t=t ₁ , we have to perform calculations for every state other than state S ₁ . State S ₁ starts at the end time of the previous phoneme. For convenience of explanation, only the calculations for state S ₄ are shown.

S₄の場合、計算は次のようになる。 For S ₄ , the calculation becomes:

Pr（S₄、ｔ＝t₁）＝Pr（S₁、ｔ＝t₀）・Ｔ（１→４）・Pr（y₁、１→４）＋Pr（S₄、ｔ＝t₀）・Ｔ（４→４）・ Pr（y₁、４→４）（22）式（22）は、時刻ｔ＝t₁で音素マシンが状態S₄
である確率は下記の２つの項： (a) 時刻ｔ＝t₀で状態S₁である確率に、状態S₁か
ら状態S₄への遷移確率を乗じ、更に、生成中の
ストリング中の所与のラベルy₁が状態S₁から状
態S₄へ遷移する確率を乗じて得た値と、 (b) 時刻ｔ＝t₀で状態S₄である確率に、状態S₄か
らそれ自身への遷移確率を乗じ、更に、状態S₄
からそれ自身に遷移するものとして所与のラベ
ルy₁を生成する確率を乗じて得た値との和によつて決まることを示す。 Pr(S ₄ , t=t ₁ )=Pr(S ₁ , t=t ₀ )・T(1 → 4)・Pr(y ₁ , 1→4) +Pr(S ₄ , t=t ₀ )・T (4→4)・Pr(y ₁ , 4→4) (22) Equation (22) shows that at time t=t ₁ , the phoneme machine is in state S ₄
_The _probability _that _{_} The value obtained by multiplying the probability that a given label y ₁ transitions from state S ₁ to state S ₄ and (b) the probability of being in state S ₄ at time t = t ₀ multiplied by the probability of transitioning from state S ₄ to itself. Multiply the transition probability and further state S ₄
We show that it is determined by the sum of the values obtained by multiplying the probability of generating a given label y ₁ as a transition from to itself.

同様に、（状態S₁を除く）他の状態に関する計
算も実行され、その音素が時刻ｔ＝t₁で特定の状
態である対応する確率を生成する。一般に、所与
の時刻に対象状態である確率を決定する際、精密
なマツチングは、 (a) 対象状態に導く遷移を生じる前の各状態およ
び前記前の各状態のそれぞれの確率を認識し、 (b) 前記前の状態ごとに、そのラベル・ストリン
グに適合するように、前記前の各状態と現在の
状態の間の遷移で生成しなければならないラベ
ルの確率を表わす値を認識し、 (c) 前の各状態の確率とラベル確率を表わすそれ
ぞれの値を組合せて、対応する遷移による対象
状態の確率を与える。 Similarly, calculations for other states (except state S ₁ ) are also performed to generate the corresponding probabilities that the phoneme is in the particular state at time t=t ₁ . In general, in determining the probability of being in a target state at a given time, precise matching involves (a) recognizing the respective probability of each state and each previous state before the transition that leads to the target state; (b) for each said previous state, recognize a value representing the probability of a label that must be generated on a transition between each said previous state and the current state to match its label string; c) Combine the respective values representing the probability of each previous state and the label probability to give the probability of the target state due to the corresponding transition.

対象状態である全体的な確率は、それに導くす
べての遷移による対象状態確率から決定される。
状態S₇に関する計算は、３つのナル遷移に関する
項を含み、その音素が状態S₇で終了する音素によ
り時刻ｔ＝t₁で開始・終了することを可能にす
る。 The overall probability of being a target state is determined from the target state probabilities due to all transitions leading to it.
The calculation for state S ₇ includes terms for three null transitions, allowing the phoneme to start and end at time t=t ₁ with the phoneme ending in state S ₇ .

時刻ｔ＝t₀およびｔ＝t₁に関する確率を決定す
る場合のように、他の終了時刻の組の確率の決定
は、終了時刻分布を形成するように行なうことが
望ましい。所与の音素の終了時刻分布の値は、所
与の音素がどれ位良好に到来ラベルにマツチング
させられるかを表示する。 As in the case of determining the probabilities for times t=t ₀ and t=t ₁ , the determination of probabilities for other sets of end times is preferably done in such a way as to form an end time distribution. The end time distribution value for a given phoneme indicates how well the given phoneme is matched to the incoming label.

ワードがどれ位良好に到来ラベルにマツチング
させられるかを決定する際、そのワードを表わす
音素は順次に処理される。各音素は確率値の終了
時刻分布を生成する。音素のマツチング値は、終
了時刻確率を合計し、その合計の対数をとること
により得られる。次の音素の開始時刻分布は終了
時刻分布を正規化することにより引出される。こ
の正規化では、例えば、それらの値の各々を、そ
れらの合計で割ることによりスケーリングし、ス
ケーリングされた値の合計が１になるようにす
る。 In determining how well a word is matched to an incoming label, the phonemes representing the word are processed sequentially. Each phoneme generates an end time distribution of probability values. The matching value of a phoneme is obtained by summing the end time probabilities and taking the logarithm of the sum. The start time distribution of the next phoneme is derived by normalizing the end time distribution. This normalization involves, for example, scaling each of the values by dividing them by their sum, such that the scaled values sum to one.

所与のワードまたはワード・ストリングの検査
すべき音素数ｈを決定する方法が少なくとも２つ
ある。深さ優先方法では、計算は基本形式に沿つ
て行なう（連続する音素の各々により連続して小
計を計算する）。この小計がそれに沿つた所与の
音素位置の所定の限界値以下であると分つた場合
計算は終了する。もう１つの方法、幅優先方法で
は、各ワードにおける類似の音素位置の計算を行
なう。計算は、各ワードの第１の音素の計算、続
いて各ワードの第２の音素の計算というように、
順次に行なう。幅優先方法では、それぞれのワー
ドの同数の音素に沿つた計算値は、相対的に同じ
音素位置で比較する。いずれの方法でも、マツチ
ング値の最大の和を有するワードが、求めていた
目的ワードである。 There are at least two ways to determine the number h of phonemes to test for a given word or word string. In the depth-first method, calculations follow a basic format (calculating subtotals for each successive phoneme in succession). The calculation ends if this subtotal is found to be less than or equal to a predetermined limit for a given phoneme position along it. Another method, the breadth-first method, involves calculating similar phoneme positions in each word. The calculations are the calculation of the first phoneme of each word, followed by the calculation of the second phoneme of each word, and so on.
Do it sequentially. In the breadth-first method, calculations along the same number of phonemes in each word are compared at the same relative phoneme position. In either method, the word with the largest sum of matching values is the desired target word.

精密なマツチングはAPAL（アレイ・プロセツ
サ・アセンブリ言語）で実現されている。これ
は、フローテイング・ポイント・システムズ社
（Floating Point Systems、Inc.）製のアセンブ
ラ190Lである。ちなみに、精密なマツチングは、
実際のラベル確率（すなわち、所与の音素が所与
の遷移で所与のラベルｙを生成する確率）、音素
マシンごとの遷移確率、および所与の音素が所定
の開始時刻後の所与の時刻で所与の状態である確
率の各々を記憶するためにかなりのメモリを必要
とする。前述の190Lは、終了時刻、できれば終
了時刻確率の対数和に基づいたマツチング値、前
に生成された終了時刻確率に基づいた開始時刻、
およびワード中の順次音素のマツチング値に基づ
いたワード・マツチング得点のそれぞれの計算を
するようにセツトアツプされる。更に、精密なマ
ツチングは、マツチング手順の末尾確率を計算す
ることが望ましい。末尾確率はワードとは無関係
に連続するラベルの尤度を測定する。簡単な実施
例では、所与の末尾確率はもう１つのラベルに続
くラベルの尤度に対応する。この尤度は、例え
ば、或るサンプル音声により生成されたラベルの
ストリングから容易に決定される。 Precise matching is achieved using APAL (Array Processor Assembly Language). This is an assembler 190L manufactured by Floating Point Systems, Inc. By the way, precise matching is
The actual label probability (i.e., the probability that a given phoneme produces a given label y at a given transition), the transition probability for each phoneme machine, and the probability that a given phoneme produces a given label y after a given start time It requires considerable memory to store each probability of being in a given state at a time. The aforementioned 190L is an end time, preferably a matching value based on the logarithmic sum of the end time probabilities, a start time based on the previously generated end time probability,
and the respective calculations of word matching scores based on the matching values of sequential phonemes in the word. Furthermore, for precise matching, it is desirable to calculate the tail probabilities of the matching procedure. Tail probability measures the likelihood of consecutive labels independent of words. In a simple example, a given tail probability corresponds to the likelihood of a label following another label. This likelihood is easily determined, for example, from a string of labels generated by some sample speech.

それ故、精密なマツチングは基本形態、アルコ
フ・モデルの統計値、および末尾確率を含むのに
十分な記憶装置を備える。各ワードが約10の音素
を含む5000ワードの語彙の場合、基本形式は5000
×10の記憶量を必要とする。（音素ごとにマルコ
フ・モデルを有する）70の別個の音素、200の別
個のラベル、および任意のラベルが生成される確
率を有する10の遷移がある場合、統計値は70×10
×200の記憶ロケーシヨンを必要とすることにな
る。しかしながら、音素マシンは３つの部分（開
始部分、中間部分および終了部分）に分割され、
統計表はそれに対応することが望ましい（３つの
自己ループの１つが各部分に含まれることが望ま
しい）。従つて、記憶要求は60×２×200に減少す
る。末尾確率に関しては、200×200の記憶ロケー
シヨンが必要である。この配列では、50Kの整数
および82Kの浮動小数点の記憶装置であれば満足
に動作する。 Therefore, precise matching provides sufficient storage to include the base form, the Alcoff model statistics, and the tail probabilities. For a vocabulary of 5000 words, each word containing about 10 phonemes, the basic form is 5000
Requires ×10 memory capacity. If there are 70 distinct phonemes (with a Markov model for each phoneme), 200 distinct labels, and 10 transitions with a probability of generating any label, the statistic is 70 × 10
x200 storage locations would be required. However, the phoneme machine is divided into three parts (starting part, middle part and ending part),
The statistical table should correspond accordingly (preferably one of the three self-loops is included in each part). Therefore, the storage requirement is reduced to 60x2x200. For tail probabilities, 200x200 storage locations are required. 50K integer and 82K floating point storage will work satisfactorily with this array.

F6d 基本高速マツチング（第２１図〜第２３
図）精密マツチングの計算は高い費用がかかるか
ら、精度をあまり犠牲にしないで、所要の計算を
少なくする基本高速マツチングおよび代替高速マ
ツチングが設けられている。高速マツチングは精
密なマツチングに関連して使用することが望まし
い。高速マツチングは語彙から見込みのある候補
ワードをリストに記載し、精密なマツチングは、
大抵の場合、このリストの候補ワードで実行され
る。F6d Basic high-speed matching (Figures 21 to 23)
Figure) Because precise matching calculations are expensive, basic fast matching and alternative fast matching are provided that reduce the required calculations without sacrificing too much accuracy. It is desirable to use high-speed matching in conjunction with precision matching. High-speed matching lists promising candidate words from the vocabulary, and precise matching uses
In most cases, this list of candidate words is used.

高速概算音響マツチング手法は前記米国特許出
願第06／672974号（1984年11月19日出願）に記載
されている。高速概算音響マツチングでは、各音
素マシンは、所与の音素マシンにおけるすべての
遷移でラベルごとの実際のラベル確率を指定の置
換え値と取替えることにより簡略化することが望
ましい。特定の置換え値は、その置換え値を使用
する場合に所与の音素のマツチング値が、その置
換え値が実際のラベル確率を取替えない場合の精
密なマツチングにより得られるマツチング値を過
大評価するように選択することが望ましい。この
条件を保証する１つの方法は、所与の音素マシン
中の所与のラベルに対応する確率がどれもその置
換え値よりも大きくないように各々の置換え値を
選択する方法である。音素マシン中の実際のラベ
ル確率を、対応する置換え値と取替えることによ
り、ワードの突合せ得点を決定する際の所要計算
数を大幅に減少することができる。更に置換え値
は過大評価することが望ましいので、その結果得
られたマツチング得点は、前に取替えなしに決定
されることになるものよりも少なくなる。 A fast approximate acoustic matching technique is described in the aforementioned US patent application Ser. No. 06/672,974, filed November 19, 1984. In fast approximate acoustic matching, each phoneme machine is preferably simplified by replacing the actual label probability for each label with a specified replacement value at every transition in a given phoneme machine. A particular replacement value is such that when using that replacement value, the matching value for a given phoneme overestimates the matching value obtained by precise matching if the replacement value does not replace the actual label probability. It is desirable to select. One way to ensure this condition is to choose each replacement value such that the probability of corresponding to a given label in a given phoneme machine is no greater than that replacement value. By replacing the actual label probabilities in the phoneme machine with corresponding replacement values, the number of calculations required in determining word match scores can be significantly reduced. Furthermore, it is desirable to overestimate the replacement value so that the resulting matching score is less than what would have previously been determined without replacement.

マルコフ・モデルを有する言語デコーダで音響
マツチングを実行する特定の実施例において、各
音素マシンは、整形により、 (a) 複数の状態および状態間の遷移パス、 (b) それぞれが、現在の状態S_iが与えられると状
態S_jへの遷移確率Ｔ（ｉ→ｊ）を有し、S_iおよ
びS_jは同じ状態または異なつた状態を表わすこ
とができる遷移tr（S_i→S_j）、 (c) １つの状態から次の状態への各遷移の実際の
ラベル確率ｐ（y_k―ｉ→ｊ）を得るように特徴
づけられる。この場合、ｋはラベル識別記号で
ある。 In a particular embodiment of performing acoustic matching in a language decoder with a Markov model, each phoneme machine is shaped to include (a) a plurality of states and transition paths between states, and (b) each of the current state S Given _i has a transition probability T(i→j) to state S _j , and S _i and S _j can represent the same state or different states, tr(S _i →S _j ), ( c) characterized to obtain the actual label probability p(y _k −i→j) of each transition from one state to the next. In this case, k is a label identification symbol.

各音素マシンは、 (a) 前記各音素マシン中の各y_kに１つの特定の値
p′（y_k）を割当てる手段、 (b) 所与の音素マシン中の各遷移で各々の実際の
出力確率ｐ（y_k―ｉ→ｊ）を、対応するy_kに割
当てられた１つの特定の値p′（y_k）に取替える
手段、を含む。置換え値は、少なくとも、特定の音素マ
シン中の任意の遷移で対応するy_kラベルの実際の
最大ラベル確率の大きさであることが望ましい。
高速突合せ実施例は、到来ラベルに対応する語彙
で最も起こりうるワードとして選択された10乃至
100のオーダの候補ワードのリストを形成するよ
うに使用される。候補ワードは言語モデルおよび
精密なマツチングに従属することが望ましい。精
密なマツチングで考慮するワード数を、語彙中の
ワードの１％のオーダに切詰めることにより、計
算費用は、精度を維持しながら大幅に減少され
る。 Each phoneme machine has: (a) one particular value for each y _k in each of said phoneme machines;
(b) means for assigning each actual output probability p(y _k −i _→ j) at each transition in a given phoneme machine to one assigned to the corresponding y _k means for substituting a specific value p′(y _k ). Preferably, the replacement value is at least the magnitude of the actual maximum label probability of the corresponding y _k label at any transition in a particular phoneme machine.
The fast matching example uses the 10 to 10 words selected as the most likely words in the vocabulary corresponding to the incoming label.
It is used to form a list of candidate words on the order of 100. Preferably, candidate words are subject to a language model and precise matching. By cutting down the number of words considered in precise matching to the order of 1% of the words in the vocabulary, computational cost is significantly reduced while maintaining accuracy.

基本高速マツチングは、すべての遷移における
所与のラベルの実際のラベル確率を１つの値と取
替えることにより簡略化し、所与のラベルを所与
の音素マシンで生成することができる。すなわ
ち、ラベルが生じる確率を有する所与の音素マシ
ンにおける遷移とは無関係に、その確率を１つの
特定の値と取替える。この値は過大評価され、少
なくとも、所与の音素マシン中の任意の遷移で生
ずるラベルの最大の確率の大きさであることが望
ましい。 Basic fast matching can be simplified by replacing the actual label probability of a given label at every transition with a single value, allowing the given label to be generated by a given phoneme machine. That is, regardless of the transition in a given phoneme machine that a label has a probability of occurring, we replace that probability with one particular value. This value is overestimated and is preferably at least the magnitude of the maximum probability of a label occurring at any transition in a given phoneme machine.

ラベルの確率置換え値を、所与の音素マシン中
の所与のラベルの実際のラベル確率の最大値とし
て設定することにより、基本高速マツチングによ
り生成されたマツチング値が少なくとも、精密な
マツチングの使用から生じるようなマツチング値
と同じ大きさであることが保証される。このよう
に、基本高速マツチングは一般に各音素のマツチ
ング値を過大評価するので、より多くのワードが
一般に、候補ワードとして選択される。精密なマ
ツチングにより候補とみなされるワードも、基本
高速マツチングに従つて合格する。 By setting the probability replacement value of a label as the maximum of the actual label probabilities for a given label in a given phoneme machine, the matching value produced by the basic fast matching is at least It is guaranteed to be as large as the matching value that occurs. Thus, more words are generally selected as candidate words because basic fast matching generally overestimates the matching value of each phoneme. Words that are considered candidates through precise matching also pass according to basic high-speed matching.

第２１図は基本高速マツチング音素マシン３０
００を示す。ラベル（記号およびフイーニームと
も呼ばれる）は開始時刻分布と一緒に基本高速マ
ツチング音素マシン３０００に入る。開始時刻分
布およびラベル・ストリングの入力は、前述の精
密マツチング音素マシンの入力に似ている。開始
時刻は、時には、複数の時刻にわたる分布ではな
いことがあるが、その代り、例えば、沈黙間隔に
続く正確な（音素開始）時刻を表わすこともあ
る。しかしながら、音声が連続している場合、終
了時刻分布は、（後に詳細に説明するように）開
始時刻分布を形成するのに用いられる。基本高速
マツチング音素マシン３０００は、終了時刻分布
を生成するとともに、生成された終了時刻分布か
らの特定の音素のマツチング値を生成する。ワー
ドのマツチング得点は、構成する音素（少なくと
もそのワードの最初のｈ音素）のマツチング値の
和として定義される。 Figure 21 shows the basic high-speed matching phoneme machine 30.
Indicates 00. The labels (also referred to as symbols and finemes) enter the basic fast matching phoneme machine 3000 along with the start time distribution. The start time distribution and label string inputs are similar to the precision matching phoneme machine inputs described above. The start time is sometimes not distributed over multiple times, but instead represents a precise (phoneme onset) time following a silence interval, for example. However, if the audio is continuous, the end time distribution is used to form the start time distribution (as explained in more detail below). The basic high-speed matching phoneme machine 3000 generates an end time distribution and also generates a matching value for a specific phoneme from the generated end time distribution. The matching score of a word is defined as the sum of the matching values of the constituent phonemes (at least the first h phoneme of the word).

第２２図は基本高速マツチング計算を示す。基
本高速マツチング計算は、開始時刻分布、音素に
より生成されたラベルの数または長さ、および
各々のラベルy_kに関連した置換え値p′（y_k）だけ
に関連する。 FIG. 22 shows the basic high-speed matching calculation. The basic fast matching calculation is concerned only with the start time distribution, the number or length of labels generated by the phoneme, and the replacement value p'(y _k ) associated with each label y _k .

所与の音素マシン中の所与のラベルの実際のラ
ベル確率をすべて、対応する置換え値と取替える
ことにより、基本高速マツチングは、遷移確率を
長さ分布確率と取替えるので、（所与の音素マシ
ンで遷移ごとに異なることがある）実際のラベル
確率、ならびに所与の時刻に所与の状態にある確
率を含むことが不要になる。 By replacing all the actual label probabilities for a given label in a given phoneme machine with the corresponding replacement values, the basic fast matching replaces transition probabilities with length distribution probabilities (for a given phoneme machine It becomes unnecessary to include the actual label probabilities (which may be different for each transition) as well as the probability of being in a given state at a given time.

ちなみに、長さ分布は精密なマツチング・モデ
ルから決定される。詳細に説明すれば、長さ分布
の長さごとに、マツチング手順は、各状態を個々
に検査し、状態ごとに、それぞれの遷移パスを決
定することが望ましい。それにより、現に検査さ
れた状態は、 (a) 特定のラベルの長さを与えられると、 (b) 遷移に沿つた出力と無関係に、生ずることがある。各々の目的状態への特定の長
さのすべての遷移パスの確率は合計され、次い
で、すべての目的状態の合計は加算され、分布中
の所与の長さの確率を表わす。以上の手順は各々
の長さについて反復実行される。本発明の良好な
形式に従つて、これらの計算は、マルコフ・モデ
リングの技術で知られているようにトレリス図に
関して行なわれる。トレリス構造に沿つて分枝を
共有する遷移パスの場合、共通分枝ごとの計算は
一度だけ行なえばよく、その結果は共通分枝を含
む各々のパスに加えられる。 Incidentally, the length distribution is determined from a precise matching model. Specifically, for each length of the length distribution, the matching procedure preferably examines each state individually and, for each state, determines the respective transition path. Thereby, the currently examined state can occur (a) given a particular label length, and (b) independently of the output along the transition. The probabilities of all transition paths of a particular length to each destination state are summed, and then the sums of all destination states are summed to represent the probability of a given length in the distribution. The above procedure is repeated for each length. In accordance with a preferred form of the invention, these calculations are performed on a trellis diagram as is known in the art of Markov modeling. For transition paths that share branches along a trellis structure, the computation for each common branch only needs to be performed once, and the results are added to each path that contains the common branch.

第２２図において、例として２つの制限が含ま
れている。最初に、音素により生成されたラベル
の長さは、それぞれ確率1₀，1₁，1₂および1₃を有
する０，１，２，または３である場合がある。開
始時刻も制限され、それぞれが確率q₀，q₁，q₂お
よびq₃を有する４つの開始時刻だけが許される。
すなわち、Ｌ（1₀，1₁，1₂，1₃）およびＱ（q₀，q₁，
q₂，q₃）が仮定される。これらの制限により、目
的音素の終了分布は下記の式のように定義され
る。 In FIG. 22, two restrictions are included as an example. Initially, the length of labels generated by phonemes may be 0, 1, 2, or 3 with probabilities 1 ₀ , 1 ₁ , 1 ₂ and 1 ₃ respectively. The start times are also restricted, allowing only four start times, each with probabilities q ₀ , q ₁ , q ₂ and q ₃ .
That is, L (1 ₀ , 1 ₁ , 1 ₂ , 1 ₃ ) and Q (q ₀ , q ₁ ,
q ₂ , q ₃ ) are assumed. Due to these restrictions, the end distribution of the target phoneme is defined as shown in the following equation.

Φ₀＝q₀1₀ Φ₁＝q₁1₀＋q₀1₁p₁ Φ₂＝q₂1₀＋q₁1₁p₂＋q₀1₂p₁p₂ Φ₃＝q₃1₀＋q₂1₁p₃＋q₁1₂p₂p₃ ＋q₀1₃p₁p₂p₃ Φ₄＝q₃1₁p₄＋q₂1₂p₃p₄＋q₁1₃p₂p₃p₄ Φ₅＝q₃1₂p₄p₅＋q₂1₃p₃p₄p₅ Φ₆＝q₃1₃p₄p₅p₆ これらの式を調べると、Φ₃は４つの開始時刻
の各々に対応する項を含んでいることが分かる。
その第１項は音素が時刻ｔ＝t₃で開始し、かつ長
さ０のラベル（音素は開始すると同時に終了す
る）を生成する確率を表わす。第２項は音素が時
刻ｔ＝t₂で開始し、かつラベルの長さが１であ
り、かつラベル３がその音素により生成される確
率を表わす。第３項は音素が時刻ｔ＝t₁で開始
し、かつラベルの長さが２（すなわちラベル２お
よび３）であり、かつラベル２および３がその音
素により生成される確率を表わす。同様に、第４
項は音素が時刻ｔ＝t₀で開始し、かつラベルの長
さが３であり、かつ３つのラベル１，２および３
がその音素により生成される確率を表わす。 Φ ₀ =q ₀ 1 ₀ Φ ₁ =q ₁ 1 ₀ +q ₀ 1 ₁ p ₁ Φ ₂ =q ₂ 1 ₀ +q ₁ 1 ₁ p ₂ +q ₀ 1 ₂ p ₁ p ₂ Φ ₃ =q ₃ 1 ₀ +q ₂ 1 ₁ p ₃ +q ₁ 1 ₂ p ₂ p ₃ +q ₀ 1 ₃ p ₁ p ₂ p ₃ Φ ₄ =q ₃ 1 ₁ p ₄ +q ₂ 1 ₂ p ₃ p ₄ +q ₁ 1 ₃ p ₂ p ₃ p ₄ Φ ₅ = q ₃ 1 ₂ p ₄ p ₅ + q ₂ 1 ₃ p ₃ p ₄ p ₅ Φ ₆ = q ₃ 1 ₃ p ₄ p ₅ p ₆ Examining these equations, we find that Φ ₃ is equal to It can be seen that the corresponding terms are included.
The first term represents the probability that a phoneme starts at time t= _t3 and generates a zero-length label (the phoneme ends at the same time as it starts). The second term represents the probability that a phoneme starts at time t= _t2 , the label length is 1, and label 3 is generated by that phoneme. The third term represents the probability that the phoneme starts at time t= _t1 , the label length is 2 (ie, labels 2 and 3), and that labels 2 and 3 are generated by that phoneme. Similarly, the fourth
The term is such that the phoneme starts at time t = t ₀ , the label length is 3, and there are three labels 1, 2, and 3.
represents the probability of being generated by that phoneme.

基本高速マツチングに要する計算と精密マツチ
ングに要する計算を比較すると、前者は後者より
も相対的に簡単であることが分る。ちなみに、
p′（ｙ）の値は、すべての式に出現するごとに、
ラベルの長さの確率の場合のように同じ値のまま
である。更に、長さおよび開始時刻の制限によ
り、後の終了時刻計算がより簡単になる。例え
ば、Φ₆で、音素は時刻ｔ＝t₃で開始し、３つのラ
ベル４，５および６はすべて、その終了時刻の音
素により生成して使用しなければならない。 Comparing the calculations required for basic high-speed matching and the calculations required for precise matching, it is found that the former is relatively simpler than the latter. By the way,
The value of p′(y) is
The label length remains the same value as in the case of probability. Furthermore, the length and start time limitations make later end time calculations easier. For example, in Φ ₆ , the phoneme starts at time t=t ₃ and all three labels 4, 5 and 6 must be generated and used by the phoneme at that ending time.

対象音素のマツチング値を生成する際、形成さ
れた終了時刻分布に沿つた終了時刻確率が合計さ
れる。次いでその対数をとつてマツチング値を得
る。 When generating a matching value for a target phoneme, end time probabilities along the formed end time distribution are summed. Then, the matching value is obtained by taking the logarithm.

マツチング値＝log₁₀（Φ₀＋…＋Φ₆）前述のように、ワードのマツチング得点は、特
定のワード中の連続する音素のマツチング値を合
計することにより容易に決定される。 Matching value=log ₁₀ (Φ ₀ +...+Φ ₆ ) As mentioned above, the matching score of a word is easily determined by summing the matching values of consecutive phonemes in a particular word.

次に、第２３図により開始時刻分布の生成につ
いて説明する。第２３図(a)において、ワード
THE₁がその構成音素に分解され、反復される。
第２３図(b)では、ラベルのストリングが時間軸に
沿つて示されている。第２３図(c)は、最初の開始
時刻分布を示す。最初の開始時刻分布は、（沈黙
ワードを含むことがある先行ワードにおける）最
新の先行音素の終了時刻分布から引出されてい
る。第２３図(c)のラベル入力および開始時刻分布
に基づいて、音素DHの終了時刻分布Φ_DHが生成
される（第２３図(d)）。次の音素UH1の開始時刻
分布は、前の音素終了分布が第２３図(d)の限界値
Ａを、越えた時刻を認識することにより決定され
る。Ａは終了時刻分布ごとに個々に決定される。
Ａは、対象音素の終了時刻分布の値の和の関数で
ある。従つて、時刻ａと時刻ｂの間隔は、音素
UH1の開始時刻分布が設定される時間を表わす。
第２３図(e)において、時刻ｃと時刻ｄの間隔は、
音素DHの終了時刻分布が限界値Ａを越え、かつ
次の音素の開始時刻分布が設定される時間に相当
する。開始時刻分布の値は、例えば、限界値Ａを
越える終了時刻の和で各終了時刻値を割つて終了
時刻分布を正規化することにより得られる。 Next, generation of the start time distribution will be explained with reference to FIG. In Figure 23(a), the word
THE ₁ is broken down into its constituent phonemes and repeated.
In FIG. 23(b), a string of labels is shown along the time axis. FIG. 23(c) shows the initial start time distribution. The initial start time distribution is derived from the end time distribution of the most recent preceding phoneme (in the preceding word, which may include a silent word). Based on the label input and the start time distribution in FIG. 23(c), the end time distribution Φ _DH of the phoneme DH is generated (FIG. 23(d)). The start time distribution of the next phoneme UH1 is determined by recognizing the time at which the previous phoneme end distribution exceeds the limit value A shown in FIG. 23(d). A is determined individually for each end time distribution.
A is a function of the sum of the end time distribution values of the target phoneme. Therefore, the interval between time a and time b is the phoneme
Represents the time at which the UH1 start time distribution is set.
In FIG. 23(e), the interval between time c and time d is
This corresponds to the time when the end time distribution of the phoneme DH exceeds the limit value A and the start time distribution of the next phoneme is set. The value of the start time distribution is obtained, for example, by dividing each end time value by the sum of end times exceeding the limit value A to normalize the end time distribution.

基本高速マツチング音素マシン３０００は、前
記のフローテイング・ポイント・システムズ社
の、APALプログラムによるアセンブラ190Lで
実現されている。また、本明細書の説明に従つ
て、他のハードウエアおよびソフトウエアを用い
て本発明の特定の形式を展開することもできる。 The basic high-speed matching phoneme machine 3000 is realized using the above-mentioned Floating Point Systems Assembler 190L using the APAL program. Additionally, certain forms of the invention may be deployed using other hardware and software in accordance with the description herein.

F6e. 代替高速マツチング（第２４図、第２５
図）単独で、またはできれば精密なマツチングおよ
び（または）言語モデルと共に使用された基本高
速マツチングは、計算所要量を大幅に少なくす
る。計算所要量を更に小さくするため、本発明は
更に、２つの長さ（最小長L_nioおよび最大長
L_nax）の間に均一なラベル長分布を形成すること
により精密なマツチングを簡略化する。基本高速
マツチングでは、所与の長さのラベル（すなわ
ち、1₀，1₁，1₂等）を生成する確率は一般に異な
る値を得る。代替高速マツチングにより、ラベル
の各々の長さの確率を１つの均一な値と取替え
る。F6e. Alternative high-speed matching (Fig. 24, 25)
Figure) Basic fast matching, used alone or preferably in conjunction with precise matching and/or language models, significantly reduces computational requirements. In order to further reduce the computational requirements, the present invention further provides two lengths (minimum length L _nio and maximum length L nio
Precise matching is simplified by forming a uniform label length distribution between L _nax ). In basic fast matching, the probabilities of generating labels of a given length (ie, 1 ₀ , 1 ₁ , 1 _{2 ,} etc.) generally obtain different values. Alternative fast matching replaces each length probability of a label with one uniform value.

最小長は、最初の長さの分布で非０の確率を有
する最小の長さに等しいことが望ましいが、希望
により、他の長さを選択することもできる。最大
長の選択は最小長の選択よりも任意であるが、最
小よりも小さく最大よりも大きい長さの確率は０
に設定される。長さの確率が最小長と最大長の間
にだけ存在するように設定することにより、均一
の擬似分布を示すことができる。一つの方法とし
て、均一確率は、擬似分布による平均確率として
設定することができる。代替方法として、均一確
率は、長さ確率の最大値として設定し、均一値と
取替えることができる。 The minimum length is preferably equal to the minimum length that has a non-zero probability in the initial distribution of lengths, but other lengths can be chosen if desired. The choice of maximum length is more arbitrary than the choice of minimum length, but the probability of a length less than the minimum and greater than the maximum is 0.
is set to By setting the length probability to exist only between the minimum length and the maximum length, a uniform pseudo-distribution can be shown. As one method, the uniform probability can be set as the average probability due to a pseudo-distribution. As an alternative, the uniform probability can be set as the maximum value of the length probabilities and replaced by the uniform value.

ラベルの長さの確率をすべて等しくすることに
よる効果は、前述の基本高速マツチングにおける
終了時刻分布の式から容易に認められる。詳細に
述べれば、長さの確率は定数として取出すことが
できる。 The effect of making all label length probabilities equal can be easily recognized from the above-mentioned formula for end time distribution in basic high-speed matching. Specifically, the length probability can be extracted as a constant.

L_nioを０にセツトし、かつすべての長さの確率
を１つの定数の値と取替えることにより、終了時
刻分布は次のように表示される。 By setting L _nio to 0 and replacing all length probabilities with one constant value, the end time distribution can be expressed as:

Φ_n＝ｍ／１＝q_n＋Φ_n-1p_n （23）ただし、“１”は１つの均一の置換え値であり、
p_nの値は、所与の音素で時刻ｍに生成される所
与のラベルの置換え値に対応することが望まし
い。 Φ _n =m/1=q _n +Φ _n-1 p _n (23) However, “1” is one uniform replacement value,
Preferably, the value of p _n corresponds to a replacement value for a given label generated at time m for a given phoneme.

前述のΦ_nの式の場合、マツチング値は次のよ
うに定義される。 In the case of the above formula for Φ _n , the matching value is defined as follows.

マツチング値＝log₁₀（Φ₀＋Φ₁＋…＋Φ_n）＋log₁₀(1) （24）基本高速マツチングと代替高速マツチングを比
較すると、所要の加算および乗算数は、代替高速
マツチング音素マシンを使用することにより、大
幅に少なくなる。L_nio＝０の場合、基本高速マツ
チングは、長さの確率を考慮しなければならない
ので、40回の乗算と20回の加算を必要としたが、
代替高速マツチングの場合は、Φ_nが繰返し決定
されるので、連続するΦ_nの各々について１回の
乗算と１回の加算で済むことが分る。 Matching value = log ₁₀ (Φ ₀ +Φ ₁ +…+Φ _n ) +log ₁₀ (1) (24) Comparing the basic fast matching and the alternative fast matching, the number of additions and multiplications required is lower than that using the alternative fast matching phoneme machine. This will significantly reduce the amount. When L _nio = 0, the basic fast matching requires 40 multiplications and 20 additions because length probabilities must be taken into account.
In the case of alternative fast matching, it can be seen that since Φ _n is determined iteratively, one multiplication and one addition are required for each successive Φ _n .

第２４図および第２５図は、代替高速マツチン
グによる計算の簡略化を詳細に示す。第２４図ａ
は、最小長L_nio＝０に相当する音素マシン３１０
０の実施例を示す。最大長は、長さ分布が均一に
なるように無限大に仮定する。第２４図ｂは、音
素マシン３１００から生じるトレリス図を示す。
q_o以後の開始時刻を開始時刻分布の外側と仮定す
ると、ｍ＜ｎの場合、端続するΦ_nの各々の決定
はすべて、１回の加算と１回の乗算で足りる。そ
れ以後の終了時刻を決定する場合は、１回の乗算
だけでよく、加算は不要である。第２５図ａは、
最小長L_nio＝４の場合の特定の音素マシン３２０
０の実施例を示し、第２５図ｂは、それに対応す
るトレリス図を示す。L_nio＝４であるから、第２
５図ｂのトレリス図は、記号Ｕ，Ｖ，ＷおよびＺ
のパスに沿つて０確率を生じる。Φ₄とΦ_oの間の
終了時刻の場合、４回の乗算と１回の加算が必要
である。ｎ＋４よりも大きい終了時刻の場合は、
１回の乗算だけでよく、加算は不要である。この
実施例は、前記FPS社の190L上のAPALコード
で実現されている。 Figures 24 and 25 detail the computational simplification by alternative fast matching. Figure 24a
is the phoneme machine 310 corresponding to the minimum length L _nio = 0
An example of 0 is shown below. The maximum length is assumed to be infinite so that the length distribution is uniform. FIG. 24b shows the trellis diagram resulting from the phoneme machine 3100.
Assuming that the start times after q _o are outside the start time distribution, if m<n, the determination of each successive Φ _n requires only one addition and one multiplication. When determining the end time after that, only one multiplication is required and no addition is necessary. Figure 25a is
Specific phoneme machine 320 for minimum length L _nio = 4
FIG. 25b shows the corresponding trellis diagram. Since L _nio = 4, the second
The trellis diagram in Figure 5b is marked by the symbols U, V, W and Z.
yields zero probability along the path of . For an end time between Φ ₄ and Φ _o , four multiplications and one addition are required. For end times greater than n+4,
Only one multiplication is required; no addition is necessary. This embodiment is implemented with the APAL code on the FPS 190L.

本発明に従つて、所望の追加状態を第２４図ま
たは第２５図の実施例に付加することができる。
例えば、L_nioの値を変更せずに、ナル遷移を有す
る任意数の状態を包含することができる。 In accordance with the present invention, any desired additional states may be added to the embodiment of FIG. 24 or 25.
For example, any number of states with null transitions can be included without changing the value of L _nio .

F6f 最初のＪレベルに基づいたマツチング（第２５
図）本発明は、音素マシンに入るストリングの最初
のＪラベルのマツチングだけを考慮することによ
り、基本高速マツチングおよび代替高速マツチン
グを更に改良されたものにすることを企図してい
る。ラベルが音響チヤンネルの音響プロセツサに
より、毎センサ秒ごとに１ラベルの割合で生成さ
れるものと仮定すると、Ｊの妥当な値は100であ
る。換言すれば、１秒のオーダの音声に対応する
ラベルが供給され、音素と音素マシンに入るラベ
ルとのマツチングを確定する。検査するラベル数
を限定することにより、２つの利点が得られる。
第一は、復合遅延の減少であり、第二は、短かい
ワードの得点と長いワードの得点を比較する問題
を十分に回避できることである。もちろん、Ｊの
長さは希望により変更することができる。F6f Matching based on first J level (25th
Figure) The present invention contemplates a further improvement of the basic fast matching and the alternative fast matching by considering only the matching of the first J label of the string entering the phoneme machine. A reasonable value for J is 100, assuming that labels are generated by the acoustic channel's acoustic processor at the rate of one label every sensor second. In other words, labels corresponding to audio on the order of one second are provided to establish a match between phonemes and labels entering the phoneme machine. By limiting the number of labels to be tested, two advantages are obtained.
The first is a reduction in decoding delay, and the second is that the problem of comparing short and long word scores can be largely avoided. Of course, the length of J can be changed as desired.

検査するラベル数を限定することによる効果
は、第２５図ｂのトレリス図により観察すること
ができる。本発明による改良がない場合、高速マ
ツチング得点は、この図面の最下部のロー
（row）に沿つたΦ_nの確率の和である。すなわ
ち、ｔ＝t₀（L_nio＝０の場合）またはｔ＝t₄（L_nio＝
４の場合）で開始する各時刻に状態S₄である確率
は、a_nとして確定され、次いで、すべてのΦ_nは
合計される。L_nio＝４の場合、t₄以前の任意の時
刻に状態S₄である確率は０である。前記改良によ
り、Φ_nの和をとることは、時刻Ｊで終了する。
第２５図ｂにおいて、時刻Ｊは時刻t_o+2に相当す
る。 The effect of limiting the number of labels to be inspected can be observed from the trellis diagram of Figure 25b. Without the improvements of the present invention, the fast matching score is the sum of the probabilities of Φ _n along the bottom row of this drawing. That is, t=t ₀ (if L _nio = 0) or t=t ₄ (L _nio =
The probability of being in state S ₄ at each time starting at (case 4) is determined as a _n and then all Φ _n are summed. When L _nio =4, the probability of being in state S ₄ at any time before t ₄ is 0. With the improvement, the summing of Φ _n ends at time J.
In FIG. 25b, time J corresponds to time t _o+2 .

時刻Ｊまでの区間を越えたＪラベルの検査を終
了することにより、マツチング得点を決定する際
に、下記の２つの確率の和が生じる。第一に、前
述のようにこのトレリス図の最下部のローに沿つ
たロー計算がある。しかし、この計算は時刻Ｊ−
１までである。時刻Ｊ−１までの各時刻に状態S₄
である確率が合計され、ロー得点を形成する。第
二に、その音素が時刻ＪにS₀〜S₄のそれぞれの状
態である確率の和に相当するカラム得点がある。
このカラム得点は下記のように計算される。 By completing the inspection of the J label that exceeds the interval up to time J, the sum of the following two probabilities occurs when determining the matching score. First, as mentioned above, there is a row calculation along the bottom row of this trellis diagram. However, this calculation is performed at time J−
Up to 1. State S ₄ at each time up to time J-1
The probabilities that are summed to form a raw score. Second, there is a column score corresponding to the sum of the probabilities that the phoneme is in each of the states S ₀ to S ₄ at time J.
This column score is calculated as follows.

カラム得点＝₄ 〓^f=0 Pr（S_f，Ｊ）（25）音素のマツチング得点は、ロー得点とカラム得
点を合計して、その和の対数をとることにより得
られる。次の音素の高速マツチングを継続するに
は、最下部のロー（時刻Ｊを含むことが望まし
い）に沿つた値を用いて、次の音素の開始時刻分
布を引出す。 Column score = ₄ 〓 ^f=0 Pr (S _f , J) (25) The phoneme matching score is obtained by summing the row score and column score and taking the logarithm of the sum. To continue the fast matching of the next phoneme, values along the bottom row (preferably including time J) are used to derive the start time distribution of the next phoneme.

Ｊ個の連続音素の各々のマツチング得点を確定
した後、前述のように、全音素の合計はその音素
のすべてのマツチング得点の和である。 After determining the matching score for each of the J consecutive phonemes, the total phoneme total is the sum of all matching scores for that phoneme, as described above.

前述の基本高速マツチングおよび代替高速マツ
チングの実施例で終了時刻の確率を生成する方法
を調べると、カラム得点の確定は、高速マツチン
グ計算に容易に適合しないことが分る。検査する
ラベル数を限定する改良を前記高速マツチングお
よび代替マツチングによりよく適応させるため、
本発明は、カラム得点を追加ロー得点と置換える
ことを可能にする。すなわち、（第２５図ａで）
時刻ＪおよびＪ＋Ｋの間で状態S₄である音素の追
加ロー得点が確定される。ただし、Ｋは任意の音
素マシンにおける最大状態数である。それゆえ任
意の音素マシンが10の状態を有する場合、本発明
の改良により、そのトレリス図の最下部のローに
沿つて10の終了時刻が付加され、その各々につい
て確率が決定される。時刻Ｊ＋Ｋまでの最下位の
ローに沿つたすべての確率（時刻Ｊ＋Ｋでの確率
を含む）が加算され、所与の音素のマツチング得
点を生成する。前述のように、連続する音素のマ
ツチング値が合計され、ワードのマツチング得点
を得る。 Examining the method of generating end time probabilities in the Basic Fast Matching and Alternative Fast Matching embodiments described above, we find that determining column scores is not easily compatible with fast matching calculations. In order to better adapt the improvements that limit the number of labels to be tested to the fast matching and alternative matching,
The present invention allows column scores to be replaced with additional row scores. That is, (in Figure 25a)
An additional low score for the phoneme in state S ₄ between times J and J+K is determined. However, K is the maximum number of states in any phoneme machine. Therefore, if any phoneme machine has 10 states, our refinement adds 10 end times along the bottom row of its trellis diagram and determines the probability for each of them. All probabilities along the lowest row up to time J+K (including the probability at time J+K) are added to generate a matching score for a given phoneme. As described above, the matching values of consecutive phonemes are summed to obtain a word matching score.

この実施例は前述のFPS社の190L上のAPAL
コードで実現されているが、本発明の他の部分の
場合のように、他のハードウエアで他のコードに
より実現することもできる。 This example is an APAL on the FPS 190L mentioned above.
Although implemented in code, it can also be implemented in other hardware and in other code, as is the case with other parts of the invention.

F6g 音素木構造および高速マツチング実施例（第２
６図）基本高速マツチングまたは代替高速マツチング
を（最大ラベル制限がある場合またはない場合
に）使用することにより、音素マツチング値を決
定する際に必要な計算時間が少なくなる。更に、
高速マツチングで得たリスト中のワードで精密マ
ツチングを実行する場合でさえも、計算量が大幅
に節約される。F6g Phoneme tree structure and high-speed matching example (Second
(Figure 6) By using basic fast matching or alternative fast matching (with or without a maximum label limit), less computational time is required in determining phoneme matching values. Furthermore,
Even when performing precise matching on the words in the list obtained by fast matching, the amount of computation is significantly saved.

音素マツチング値は、いつたん確定されると、
第２６図に示すように、木構造４１００の分枝に
沿つて比較され、音素のどのパスが最も起こりう
るかを判定する。第２６図において、（点４１０
２から分枝４１０４に出る）話されたワード
“the”の音素DHおよびUH1の音素マツチング値
の和は、音素MXから分岐する音素のそれぞれの
シーケンスの場合よりもずつと高い値でなければ
ならない。ちなみに、最初の音素MXの音素マツ
チング値は１回だけ計算され、それから広がる各
基本形式に使用される。（分枝４１０４および４
１０６を参照されたい。）更に、分枝の最初のシ
ーケンスに沿つて計算された合計得点が、限界値
よりもずつと低いか、または分岐の他のシーケン
スの合計得点よりもずつと低いことが分ると、最
初のシーケンスから広がるすべての基本形式は同
時に候補ワードから削除されることがある。例え
ば、分枝４１０８〜４１１８に関連した基本形式
は、MXが起こりそうなパスでないことが確定さ
れると、同時に捨てられる。高速マツチング実施
例および木構造により、順序づけられた候補ワー
ドのリストが生成され、それに伴なう計算は大幅
に節約される。 Once the phoneme matching value is determined,
As shown in FIG. 26, comparisons are made along the branches of tree structure 4100 to determine which path of phonemes is most likely. In Figure 26, (point 410
The sum of the phoneme matching values for the phonemes DH and UH1 of the spoken word “the” (from branch 4104 from phoneme MX) must be a higher value than for each sequence of phonemes branching from phoneme MX. . Incidentally, the phoneme matching value for the first phoneme MX is calculated only once and then used for each base form that extends. (Branch 4104 and 4
See 106. ) Furthermore, if we find that the total score computed along the first sequence of branches is lower than the critical value or lower than the total score of the other sequences of branches, then All base forms extending from the sequence may be removed from the candidate words at the same time. For example, the base forms associated with branches 4108-4118 are discarded as soon as it is determined that MX is not a likely path. The fast matching implementation and tree structure generates an ordered list of candidate words with significant savings in associated computation.

記憶要求については、音素の木構造、音素の統
計値、および末尾確率が記憶されることになつて
いる。木構造については、25000の弧と各弧を特
徴づける４つのデータワードがある。第１のデー
タワードは後続の弧すなわち音素の指標を表わ
す。第２のデータワードは分枝に沿つた後続の音
素の数を表わす。第３のデータワードは木構造の
どのノードに弧が置かれているかを表わす。第４
のデータワードは現在の音素を表わす。従つて、
この木構造の場合、25000×４の記憶空間が必要
である。高速マツチングでは、100の異なつた音
素と200の異なつたフイーニームがある。フイー
ニームは音素中のどこかで生成される１つの確率
を有するから、100×200の統計的確率の記憶空間
が必要である。末尾構造については、200×200の
記憶空間が必要である。高速マツチングの場合、
100Kの整数と60Kの浮動小数点の記憶空間があ
れば十分である。 For storage requests, the phoneme tree structure, phoneme statistics, and tail probabilities are to be stored. For the tree structure, there are 25,000 arcs and four data words characterizing each arc. The first data word represents an index of a subsequent arc or phoneme. The second data word represents the number of subsequent phonemes along the branch. The third data word represents at which node of the tree structure the arc is located. Fourth
The data word represents the current phoneme. Therefore,
This tree structure requires a storage space of 25000×4. For fast matching, there are 100 different phonemes and 200 different finemes. Since a finem has one probability of being generated anywhere in the phoneme, a storage space of 100 x 200 statistical probabilities is required. For the tail structure, 200x200 storage space is required. In the case of high-speed matching,
100K integer and 60K floating point storage space is sufficient.

F6h 言語モデル（第７図）前述のように、文脈中のワードに関する（三重
字のような）情報を記憶する言語モデルを包含す
ることにより、正しくワードを選択する確率を高
めることができる。言語モデルは前記論文に記載
されている。F6h Language Model (Figure 7) As mentioned above, the probability of correctly selecting a word can be increased by including a language model that stores information about the word in context (such as triple letters). The language model is described in the above paper.

言語モデル１０１０（第７図）は独特の文字を
有することが望ましい。詳細に言えば、修正三重
字法が使用される。本発明に従つて、サンプル・
テキストが検査され、語彙中の、順序づけられた
三重ワードおよびワード対ならびに単一ワードの
各々の尤度を確定する。そして、最も起こりうる
三重ワードおよびワード対のリストが形成され
る。更に、三重ワードのリスト中にない三重ワー
ドおよびワード対のリスト中にないワード対の尤
度がそれぞれ決定される。 Language model 1010 (Figure 7) preferably has unique characters. Specifically, a modified trigraph method is used. According to the invention, the sample
The text is examined to determine the likelihood of each ordered triple word and word pair and single word in the vocabulary. A list of most likely triple words and word pairs is then formed. Additionally, the likelihoods of triple words not in the list of triple words and word pairs not in the list of word pairs are determined, respectively.

言語モデルに従つて、対象ワードが２ワードに
続く場合、この対象ワードと先行する２ワードが
三重ワードのリストにあるかどうかについて判定
する。三重ワードのリストにある場合、その三重
ワードに割当てられた、記憶されている確率が指
定される。対象ワードと先行ワードが三重ワード
のリストにない場合は、その対象ワードとそれに
隣接する先行ワードがワード対のリストにあるか
どうかについて判定する。ワード対のリストにあ
る場合、そのワード対の確率と、前述の三重ワー
ドのリストに三重ワードがない確率を掛け、その
積を対象ワードで割当てる。対象ワードを含む前
記三重ワードおよびワード対がそれぞれ三重ワー
ドのリストおよびワード対のリストにない場合に
は、対象ワードだけの確率に、前述の三重ワード
が三重ワードのリストにない確率、ならびにワー
ド対がワード対のリストにない確率を掛け、その
積を対象ワードに割当てる。 According to the language model, if the target word follows two words, a determination is made as to whether the target word and the two preceding words are in the list of triple words. If in a list of triple words, the stored probability assigned to that triple word is specified. If the target word and the preceding word are not in the list of triple words, a determination is made as to whether the target word and its adjacent preceding word are in the list of word pairs. If it is in a list of word pairs, multiply the probability of that word pair by the probability that there is no triple word in the list of triple words mentioned above, and assign the product by the target word. If said triple word and word pair containing the target word are not in the list of triple words and the list of word pairs, respectively, then the probability of the target word alone is added to the probability that said triple word is not in the list of triple words, and the word pair. Multiply by the probability that is not in the list of word pairs and assign the product to the target word.

F6i 概算によるトレーニング（第２７図）第２７図の流れ図５０００は音響マツチングで
使用する音素マシンのトレーニングを示す。ブロ
ツク５００２で、ワードの語彙（一般的に5000ワ
ードのオーダ）が定義される。ブロツク５００４
で、各ワードを音素マシンのシーケンスにより表
示する。音素マシンは、例えば、音標型音素マシ
ンとして表示されているが、代替的にフイーニー
ム音素のシーケンスを含むこともある。音標型音
素マシンのシーケンスまたはフイーニーム型音素
マシンのシーケンスによるワードの表示について
は下記に説明する。ワードの音素マシン・シーケ
ンスはワード基本形態と呼ぶ。Training by F6i Approximation (Figure 27) The flowchart 5000 in Figure 27 shows training of a phoneme machine used in acoustic matching. At block 5002, a vocabulary of words (typically on the order of 5000 words) is defined. block 5004
, each word is displayed by a sequence of phoneme machines. The phoneme machine is shown, for example, as a phonetic phoneme machine, but may alternatively include a sequence of finemone phonemes. The display of words by a phonetic phoneme machine sequence or a finemone phoneme machine sequence will be described below. The phoneme machine sequence of a word is called the word base form.

ブロツク５００６で、ワード基本形態を前述の
木構造に配列する。各ワードの基本形態での音素
マシンごとの統計は、IEEE会報第64巻（1976年）
532〜556頁記載のエフ・ジエリネクの論文“統計
的方法による連続音声認識”（F.Jelinek、
“Continuous Speech Recognition by
Statististal Methods”Proceedings of the
IEEE、Vol.64、1976、pp 532−556）に示された
周知のフオワード・バツクワード・アルゴリズム
によるトレーニングにより決定される（ブロツク
５００８）。 Block 5006 arranges the word base forms into the tree structure described above. Statistics for each phoneme machine in the basic form of each word can be found in IEEE Bulletin Volume 64 (1976)
F. Jelinek's paper “Continuous speech recognition using statistical methods” (pages 532-556)
“Continuous Speech Recognition by
Statististal Methods”Proceedings of the
IEEE, Vol. 64, 1976, pp. 532-556) (block 5008).

ブロツク５００９で、精密なマツチングで使用
する実際のパラメータの値すなわち統計値の代わ
りに用いる値を決定する。例えば、実際のラベル
出力確率の代りに用いる値を決定する。ブロツク
５０１０で、決定された値は、記憶された実際の
確率に取つて代り、各ワードの基本形式中の音素
が近似的な置換え値を含むようにする。基本高速
マツチングに関する概算はすべてブロツク５０１
０で実行される。 Block 5009 determines the actual parameter values used in the fine matching, ie, the values to be used in place of the statistical values. For example, a value to be used in place of the actual label output probability is determined. At block 5010, the determined values replace the stored actual probabilities so that the phonemes in the base form of each word contain approximate replacement values. All calculations related to basic high-speed matching are in block 501.
Executes at 0.

次に、ブロツク５０１１で、音響マツチングを
それ以上近似させる、すなわち向上させるべきか
どうかについて判定する。向上しなくてもよい場
合は、ブロツク５０１２で、基本概算マツチング
のために決定した値を使用するように設定し、他
の概算を使用するようには設定しない。向上を希
望する場合には、ブロツク５０１８で、均一なス
トリングの長さの分布を形成する。ブロツク５０
２０で、更に向上を希望するかどうかについて決
定する。向上を希望しない場合は、ブロツク５０
１２で、ラベル出力確率値およびストリングの長
さの確率値が、音響マツチングでの使用のために
概算され設定される。更に向上を希望する場合
は、ブロツク５０２２で、音響マツチングは、生
成されたストリングの最初のＪラベルに制限され
る。向上された実施例の１つを選択するかどうか
にかかわらず、ブロツク５０１２で、確定された
パラメータ値が設定される。この時点で、各ワー
ドの基本形式中のそれぞれの音素マシンは、所望
の概算によりトレーニングされている。 Next, at block 5011, a determination is made as to whether the acoustic matching should be further approximated or improved. If no improvement is required, block 5012 sets the value determined for base approximation matching to be used and does not configure to use any other approximation. If improvement is desired, block 5018 creates a uniform string length distribution. block 50
At 20, a decision is made as to whether further enhancement is desired. If you do not wish to improve, block 50.
At 12, label output probability values and string length probability values are estimated and set for use in acoustic matching. If further improvement is desired, at block 5022 the acoustic matching is limited to the first J label of the generated string. Regardless of whether one of the enhanced embodiments is selected, determined parameter values are set at block 5012. At this point, each phoneme machine in the basic form of each word has been trained with the desired approximation.

F6j 音響マツチングにより選択されたワードによる
ワード・パスの拡張（第７図、第１０図〜第１
２図、第２８図）次に第７図の音声認識で使用する良好なスタツ
ク復合方法について説明する。F6j Expansion of word paths by words selected by acoustic matching (Figures 7, 10 to 1)
(FIGS. 2 and 28) Next, a good stack decoding method used in the speech recognition shown in FIG. 7 will be explained.

第１０図および第１１図において、連続する
“ラベル間隔”すなわち“ラベル位置”で生成さ
れた複数の連続ラベルy₀y₁…が示されている。 10 and 11, a plurality of continuous labels y ₀ y _{1 .} . . generated at continuous “label intervals” or “label positions” are shown.

また、第１１図には、生成された複数のワー
ド・パス、すなわちパスＡ、パスＢおよびパスＣ
が示されている。第１０図の文脈で、パスＡはエ
ントリ“to be or”に、パスＢはエントリ“two
ｂ”に、パスＣはエントリ“too”に対応するで
あろう。対象ワード・パスの場合、終了している
最高の確率を対象ワード・パスが有するラベル
（すなわち等価的にラベル間隔）がある。このよ
うなラベルは“境界ラベル”と呼ばれる。 FIG. 11 also shows a plurality of generated word paths, namely path A, path B, and path C.
It is shown. In the context of Figure 10, path A is the entry "to be or" and path B is the entry "two".
b”, path C would correspond to entry “too”. For a target word path, there is a label (i.e., equivalently, a label interval) for which the target word path has the highest probability of being terminated. .Such a label is called a "boundary label".

ワードのシーケンスを表わすワード・パスＷの
場合、最も起こりうる終了時刻（２ワード間の
“境界ラベル”としてラベル・ストリングに表示
されている）は、IBM技術開示ブレチン、第23
巻第４号、1980年９月号、エル・アール・バール
外の論文“高速音響マツチング計算”（L.R.Bahl
et al、“Faster Acoustic Match
Computation”、IBM Technical Disclosure
Bulletin、Vol.23、No.4、September 1980）に
記載されているような既知の方法により発見する
ことができる。簡単に言えば、この論文は、下記
の２つの重要な事項： (a) どれだけ多くのラベル・ストリングＹがワー
ド（またはワード・シーケンス）によるもので
あるか、 (b) どのラベル間隔で、（ラベル・ストリングの
部分に対応する）部分的な文が終了するかに取組む方法について説明している。 For a word path W representing a sequence of words, the most likely ending time (shown in the label string as a "boundary label" between two words) is given in IBM Technical Disclosure Bulletin, No. 23.
Volume No. 4, September 1980, paper “Fast Acoustic Matching Calculations” by LRBahl et al.
et al, “Faster Acoustic Match
IBM Technical Disclosure
Bulletin, Vol. 23, No. 4, September 1980). Briefly, this paper focuses on two important questions: (a) how many label strings Y are words (or word sequences); and (b) at what label spacing. It describes how to deal with the termination of partial sentences (corresponding to parts of label strings).

任意の所与のワード・パスの場合、ラベル・ス
トリングの最初のラベル〜境界ラベルを含む各々
のラベルすなわちラベル間隔に関連した“尤度
値”がある。所与のワード・パスの尤度値の全部
は一括して、所与のワード・パスの“尤度ベクト
ル”を表わす。従つて、ワード・パスごとに、対
応する尤度ベクトルがある。尤度値L_tは第１１図
に示されている。 For any given word path, there is a "likelihood value" associated with each label or label interval in the label string, including the first label and the boundary label. All of the likelihood values for a given word path collectively represent the "likelihood vector" for the given word path. Therefore, for each word pass there is a corresponding likelihood vector. The likelihood value L _t is shown in FIG.

ワード・パスW¹，W²，…，W^sの集まりのラ
ベル間隔ｔでの“尤度包絡線”Λ_tは数学的に次
のように定義される。 The "likelihood envelope" Λ _t of a collection of word paths W ¹ , W ² , . . . , W ^s at label interval t is defined mathematically as follows.

Λ_t＝max（L_t（W¹）、…、L_t（W^s））すなわち、ラベル間隔ごとに、尤度包絡線は、
前記集りの中の任意のワード・パスに関連した最
高の尤度値を含む。第１１図に尤度包絡線８０４
０が示されている。 Λ _t = max(L _t (W ¹ ), ..., L _t (W ^s )) In other words, for each label interval, the likelihood envelope is
Contains the highest likelihood value associated with any word path in the collection. Figure 11 shows the likelihood envelope 804.
0 is shown.

ワード・パスは、完全な文に対応する場合には
“完全”とみなされる。完全なパスは、入力して
いる話者が、文の終了に達したとき、例えばボタ
ンを押すことにより識別されることが望ましい。
入力された入力は、文終了をマークするラベル間
隔と同期される。完全なワード・パスは、それに
ワードを付加して延長することはできない。部分
的なワード・パスは不完全な文に対応し、延長す
ることができる。 A word pass is considered "complete" if it corresponds to a complete sentence. Preferably, the complete path is identified when the typing speaker reaches the end of the sentence, for example by pressing a button.
The entered input is synchronized with the label interval that marks the end of the sentence. A complete word pass cannot be extended by appending words to it. Partial word paths accommodate incomplete sentences and can be extended.

部分的なパスは“生きている”または“死んで
いる”パスに分類される。ワード・パスは、それ
が既に延長されているときは“死んでいる”が、
まだ延長されていないときは“生きている”。こ
の分類により、既に延長されて少なくとも１つ
の、より長く延長されたワード・パスを形成して
いるパスは、次の時刻で延長が再び考慮されるこ
とはない。 Partial paths are classified as "alive" or "dead" paths. A Word Pass is "dead" when it has already been extended, but
It is “alive” when it has not yet been extended. By this classification, a path that has already been extended to form at least one longer extended word path will not be considered for extension again at the next time.

各々のワード・パスは、尤度包絡線に対して
“良い”、または“悪い”ものとして特徴づけるこ
とが可能である。ワード・パスは、その境界ラベ
ルに対応するラベルで、そのワード・パスが、最
大尤度包絡線内にある尤度値を有する場合は良い
ワード・パスである。その他の場合は、ワード・
パスは悪いワード・パスである。最大尤度包絡線
の各値を一定の値だけ減少して良い（悪い）限界
レベルとして作用させることは、望ましいことで
はあるが、必ずしも必要ではない。 Each word path can be characterized as "good" or "bad" with respect to the likelihood envelope. A word path is a good word path if it has a likelihood value that is within the maximum likelihood envelope with the label corresponding to its boundary label. Otherwise, word
Pass is a bad word pass. Although it is desirable, it is not necessary to reduce each value of the maximum likelihood envelope by a fixed value to act as a good (bad) limit level.

ラベル間隔の各々についてスタツク要素があ
る。生きているワード・パスの各々は、このよう
な生きているパスの境界ラベルに対応するラベル
間隔に対応するスタツク要素に割当てられる。ス
タツク要素は、（尤度値の順序にリスト化されて
いる）０，１またはより多くのワード・パス・エ
ントリを有することがある。 There is a stack element for each label interval. Each live word path is assigned to a stack element corresponding to the label interval that corresponds to the boundary label of such live path. A stack element may have zero, one, or more word path entries (listed in order of likelihood value).

次に、第７図のスタツク・デコーダ１００２に
より実行されるステツプについて説明する。 Next, the steps performed by stack decoder 1002 of FIG. 7 will be described.

尤度包絡線を形成し、どのワード・パスが良い
かを決定することは、第１２図のスタツク復号手
法の流れ図に示すように相互関係を有する。 Forming the likelihood envelope and determining which word passes are good are interrelated as shown in the stack decoding technique flow diagram of FIG.

第１２図の流れ図において、ブロツク５０５０
で、最初に、ナル・パスが第１のスタツク（０）
に入る。ブロツク５０５２で、前に確定されてい
る完全なパスを含む（完全な）スタツク要素がも
しあれば供給される。（完全な）スタツク要素中
の完全なパスの各々は、それに関連する尤度ベク
トルを有する。その境界ラベルに最高の尤度を有
する完全なパスの尤度ベクトルは、最初に最尤包
絡線を決める。もし（知全な）スタツク要素に完
全なパスがなければ、最尤包絡線は各ラベル間隔
で−∞に初期設定される。更に、完全なパスが指
定されていない場合にも、最尤包絡線が−∞に初
期設定されることがある。包絡線の初期設定はブ
ロツク５０５４および５０５６で行なわれる。 In the flowchart of FIG. 12, block 5050
So, first, the null path is the first stack (0)
to go into. At block 5052, the (complete) stack element containing the previously determined complete path, if any, is provided. Each complete path in a (complete) stack element has a likelihood vector associated with it. The likelihood vector of the complete path with the highest likelihood at its boundary label first determines the maximum likelihood envelope. If there is no complete path in the (intelligent) stack element, the maximum likelihood envelope is initialized to -∞ at each label interval. Additionally, the maximum likelihood envelope may be initialized to −∞ even if the complete path is not specified. Initialization of the envelope occurs in blocks 5054 and 5056.

最尤包絡線は、初期設定された後、所定の量Δ
だけ減少され、減少された尤度を越えるΔ規定の
良い領域を形成し、減少された尤度を下まわるΔ
規定の悪い領域を形成する。Δが大きければ大き
いほど、延長が可能とみなされるワード・パス数
が大きくなる。L_tを確定するのにlog₁₀を用いる
場合、Δの値が２であれば満足すべき結果が得ら
れる。Δの値がラベル間隔の長さに沿つて均一で
あることは、望ましいけれども、必ずしも必要で
はない。 The maximum likelihood envelope is initialized and then adjusted by a predetermined amount Δ
is reduced by Δ, forming a well-defined region where Δ exceeds the reduced likelihood and Δ below the reduced likelihood.
Form a poorly defined area. The larger Δ, the larger the number of word passes that are considered possible to extend. When using log ₁₀ to determine L _t , a value of Δ of 2 gives satisfactory results. Although it is desirable, it is not necessary that the value of Δ be uniform along the length of the label spacing.

第１２図に示すように、尤度包絡線を更新し、
ワード・パスを“良い”（延長が可能な）パス、
または“悪い”パスとしてマークするループは、
マークされていない最長ワード・パスを探すブロ
ツク５０５８で始まる。２以上のマークされてい
ないワード・パスが、最長のワード・パス長に対
応するスタツクにある場合、その境界ラベルに最
高の尤度を有するワード・パスが選択される。ワ
ード・パスが発見された場合、ブロツク５０６０
で、その境界ラベルでの尤度がΔ規定の良い領域
内にあるかどうかを調べる。もし良い領域内にな
ければ、ブロツク５０６２で、Δ規定の悪い領域
内のパスとマークし、ブロツク５０５８で、次の
マークされていない生きているパスを探す。もし
良い領域内にあれば、ブロツク５０６４で、Δ規
定の良い領域内のパスとマークし、ブロツク５０
６６で、尤度包絡線を更新して、“良い”とマー
クされたパスの尤度値を包含する。すなわち、ラ
ベル間隔ごとに、更新された尤度値は、 (a) その尤度包絡線内の現在の尤度値と、 (b) “良い”とマークされたワード・パスに関連
した尤度値の間のより大きい尤度値として確定される。この
動作はブロツク５０６４および５０６６で行なわ
れる。包絡線が更新された後、ブロツク５０５８
に戻り、マークされていない最長、最良の生きて
いるワード・パスを再び探す。 As shown in Figure 12, the likelihood envelope is updated,
A “good” (extendable) password,
or the loop you want to mark as a “bad” path.
The process begins at block 5058, which searches for the longest unmarked word path. If more than one unmarked word path is in the stack corresponding to the longest word path length, then the word path with the highest likelihood for its boundary label is selected. If the password is found, block 5060
Then, check whether the likelihood at that boundary label is within the region with good Δ regulation. If it is not in the good region, block 5062 marks the path as being in the bad region of the Δ prescription, and block 5058 searches for the next unmarked live path. If it is within the good region, block 5064 marks the path as being within the good region of the Δ prescription, and block 50
At 66, the likelihood envelope is updated to include the likelihood values of the paths marked "good." That is, for each label interval, the updated likelihood value is the sum of (a) the current likelihood value within its likelihood envelope, and (b) the likelihood associated with the word path marked “good”. determined as the larger likelihood value between the values. This operation occurs in blocks 5064 and 5066. After the envelope has been updated, block 5058
Go back and look again for the longest, best unmarked living word path.

このループは、マークされていないワード・パ
スがなくなるまで反復される。マークされていな
いワード・パスがなくなると、ブロツク５０７０
で、最短の“良い”とマークされたワード・パス
が選択される。もし、最短の長さを有する２以上
の“良い”ワード・パスがあれば、ブロツク５０
７２で、その境界ラベルに最高の尤度を有するワ
ード・パスが選択され、選択された最短のパスは
延長される。すなわち、少なくとも１つのありう
る後続ワードが、前述のように、高速マツチン
グ、言語モデル、精密マツチング、および言語モ
デル手順を良好に実行することにより確定され
る。見込みのある後続ワードごとに、延長された
ワード・パスが形成される。詳細に述べれば、延
長されたワード・パスは、選択された最短ワー
ド・パスの終りに、見込みのある後続ワードを付
加することにより形成される。 This loop is repeated until there are no more unmarked word paths. When there are no more unmarked word passes, block 5070
, the shortest word path marked "good" is selected. If there are two or more "good" word paths with the shortest length, block 50
At 72, the word path with the highest likelihood for its boundary label is selected and the selected shortest path is extended. That is, at least one possible successor word is determined by successfully performing the fast matching, language model, precision matching, and language model procedures, as described above. For each potential successor word, an extended word path is formed. Specifically, an extended word path is formed by appending a likely successor word to the end of the selected shortest word path.

選択された最短ワード・パスが、延長されたワ
ード・パスを形成した後、該選択されたワード・
パスは、それがエントリであつたスタツクから除
去され、その代りに、各々の延長されたワード・
パスは適切なスタツクに挿入される。特に、延長
されたワード・パスは、その境界ラベルに対応す
るスタツクへのエントリになる（ブロツク５０７
２）。 After the selected shortest word path forms an extended word path, the selected word path
The path is removed from the stack of which it was an entry and is replaced by each extended word.
The path is inserted into the appropriate stack. In particular, the extended word path becomes an entry into the stack corresponding to its boundary label (block 507).
2).

ブロツク５０７２における選択されたパルスを
延長する動作を第１２図の流れ図に関連して説明
する。ブロツク５０７０でパスが見つかつた後、
第２８図に示す手順が実行され、適切な概算マツ
チングに基づいてワード・パスが延長される。 The act of extending selected pulses in block 5072 will be described in conjunction with the flowchart of FIG. After finding the path in block 5070,
The procedure shown in Figure 28 is executed to extend the word path based on the appropriate approximate matching.

第２８図のブロツク６０００で、（第７図の）
音響プロセツサ１００４はラベルのストリングを
生成する。ラベルのストリングはブロツク６００
２に入力として供給され、ブロツク６００２で、
基本の、または向上された概算マツチング手順の
１つが実行され、順序づけられた候補ワードのリ
ストを得る。その後、ブロツク６００４で、前述
の言語モデルを前述のように使用する。言語モデ
ルを使用した後、ブロツク６００６で、残つてい
る対象ワードは、生成されたラベルと一緒に精密
マツチング・プロセツサに送られる。ブロツク６
００８で、精密なマツチングは残つている候補ワ
ードのリストを生じ、言語モデルに良好に提示さ
れる。（概算マツチング、精密マツチングおよび
言語モデルにより確定された）見込みのあるワー
ドは、第１２図のブロツク５０７０で発見された
パスの延長に用いる。ブロツク６００８（第２８
図）で確定された、見込みのあるワードの各々
は、発見されたワード・パスに別個に付加され、
複数の延長されたワード・パスを形成することが
できる（ブロツク６０１０）。 At block 6000 of FIG. 28, (of FIG. 7)
Audio processor 1004 generates a string of labels. Label string is block 600
2, and at block 6002,
One of the basic or improved approximate matching procedures is performed to obtain an ordered list of candidate words. Thereafter, at block 6004, the language model described above is used as described above. After using the language model, the remaining target words along with the generated labels are sent to the precision matching processor at block 6006. Block 6
At 008, the fine matching yields a list of remaining candidate words that are better presented to the language model. Probable words (as determined by approximate matching, precise matching, and language model) are used to extend the path found in block 5070 of FIG. Block 6008 (28th
Each potential word determined in Figure) is appended separately to the discovered word path, and
Multiple extended word passes may be formed (block 6010).

第１２図で、延長パスが形成され、スタツクが
再形成された後、ブロツク５０５２に戻つてプロ
セスを反復する。 In FIG. 12, after the extension path is formed and the stack is re-formed, the process returns to block 5052 to repeat the process.

従つて、反復ごとに、最短、最良の“良い”ワ
ード・パスが選択され、延長される。ある反復で
“悪い”パスとマークされたワード・パスは後の
反復で“良い”パスになることがある。よつて、
生きているワード・パスが“良い”パスか、“悪
い”パスかという特徴は、各々の反復で独自に付
与される。実際には、尤度包絡線は１つの反復と
次の反復とで大幅に変化しないので、ワード・パ
スが良いか悪いかを決定する計算が効率的に行な
われる。更に、正規化も不要になる。 Therefore, at each iteration, the shortest, best "good" word path is selected and extended. A word path marked as a "bad" path in one iteration may become a "good" path in a later iteration. Then,
The characteristics of whether a living word path is a "good" or "bad" path are uniquely assigned to each iteration. In practice, the likelihood envelope does not change significantly from one iteration to the next, so the calculations to determine whether a word pass is good or bad are performed efficiently. Furthermore, normalization becomes unnecessary.

完全な文を識別する場合、ブロツク５０７４を
包含することが望ましい。すなわち、生きている
ワード・パスでマークされずに残つているものは
なく、延長すべき“良い”ワード・パスがない場
合、復号は終了する。その境界ラベルのそれぞれ
に最高の尤度を有する完全なワード・パスが、入
力ラベル・ストリングの最も見込みのあるワー
ド・シーケンスとして識別される。 When identifying complete sentences, it is desirable to include block 5074. That is, if there are no living word paths left unmarked and no "good" word paths to extend, decoding terminates. The complete word path with the highest likelihood for each of its boundary labels is identified as the most likely word sequence of the input label string.

文終了が識別されない連続音声の場合、パス延
長は継続して行なわれる、すなわち、そのシステ
ムのユーザが希望する所定のワード数について行
なわれる。 In the case of continuous speech where sentence ends are not identified, path extension is performed continuously, ie, for a predetermined number of words as desired by the user of the system.

F7 付録付録１：フツク対からブリンクを生成し、ブリン
クをクリンクに変更する手順 (a) 左フツクに連結する資格のある右フツクから
ノードを決める。F7 Appendix Appendix 1: Procedure for generating a blink from a pair of hooks and changing the blink to a clink (a) Determine a node from the right hook that is eligible to be connected to the left hook.

(b) 右フツクで資格のあるノードに連結可能な左
フツクでノードを決める。(b) Determine a node with the left hook that can be connected to a qualified node with the right hook.

(c) 連結を行ない、未使用の分枝を取除く。(c) Perform concatenation and remove unused branches.

(d) 結果として生ずるグラフ（ブリンク・グラ
フ）の合流ノードを決め、ブリンクをクリンク
のシーケンスに割込ませる。(d) Determine the confluence node of the resulting graph (blink graph) and insert the blink into the sequence of clinks.

付録２：クリンク目録のサンプル部分Ｇ発明の効果前述のように本発明は、語彙中の各ワードをク
リンクの識別子のシーケンスならびにそれに対応
する音素マシンにより識別することを可能にす
る。すなわち、各ワードの基本形式は、精密なマ
ツチングならびに使用される高速マツチングのた
め、各音素の統計表で満たされている音素のシー
ケンスとしてではなく、単に識別子のシーケンス
として記憶される。前述のように、音素マシンご
との統計表は１回だけ記憶される。Appendix 2: Sample Portion of the Klink Inventory G Effects of the Invention As mentioned above, the invention allows each word in the vocabulary to be identified by a sequence of Klink identifiers and a corresponding phoneme machine. That is, the basic form of each word is stored simply as a sequence of identifiers, rather than as a sequence of phonemes filled with statistical tables for each phoneme, due to the precise matching as well as the fast matching used. As mentioned above, the statistics table for each phoneme machine is stored only once.

[Brief explanation of drawings]

第１図は認識手順で、別個に書込まれた基本グ
ラフすなわちクリンクを用いたワード・グラフの
組立てを示す図、第２図は単一ワードの音標グラ
フの概要図、第３図は２つの別個のワードの簡単
な音標ストリングおよび両ワードを連結した音標
グラフを示す図、第４図は異なつた文脈で種々の
発音を表わす左右の端でそれぞれのフツクを集合
的に形成する開放された分枝を有するワードの音
標グラフを示す図、第５図は境界サブグラフすな
わちブリンクを生ずる音標グラフの２つのフツク
のそれぞれの分枝の相互連結を示す図、第６Ａ図
および第６Ｂ図は本発明を使用するシステムにお
ける音声認識語彙に記憶されている基本グラフお
よび識別子ストリングのそれぞれの目録の概要を
示す図表、第７図は本発明を実施しうるシステム
環境の概要ブロツク図、第８図は第７図のシステ
ム環境の中のスタツク・デコーダを詳細に示した
ブロツク図、第９図はトレーニング・セツシヨン
中に得られた統計値により記憶装置で識別され、
表示される精密なマツチング音素マシンを示す
図、第１０図は連続するスタツク復号のステツプ
を示す図、第１１図はスタツク復号手法を示す
図、第１２図はスタツク復号手法の流れ図、第１
３図は音響プロセツサの要素を示す図、第１４図
は音響モデルの構成要素を形成する場所を表わす
代表的な人間の耳の部分を示す図、第１５図は音
響プロセツサの部分を示すブロツク図、第１６図
は音響プロセツサの設計に用いる、音の強度と周
波数の関係を示す図、第１７図はソーンとホンの
関係を示す図、第１８図は第１３図の音響プロセ
ツサにより音響の特徴をどのように示すかを表わ
す流れ図、第１９図は第１８図で限界値をどのよ
うに更新するかを示す流れ図、第２０図は精密マ
ツチング手順のトレリスすなわち格子を示す図、
第２１図はマツチングを実行するのに用いた音素
マシンを示す図、第２２図は特定の条件を有する
マツチング手順で用いる時刻分布図、第２３図(a)
〜(e)は音素、ラベル・ストリングおよびマツチン
グ手順で決定された開始・終了時刻の間の相互関
係を示す図、第２４図(a)および(b)は、最小の長さ
が０の特定の音素マシンおよびそれに対応する開
始時刻分布を示す図、第２５図(a)および(b)は最小
の長さ４の特定の音素マシンおよびそれに対応す
るトレリスを示す図、第２６図は同時に複数のワ
ードの処理を可能にする音素の木構造を示す図、
第２７図は音素マシンの整形を示す流れ図、第２
８図はスタツク復号手順でワード・パスをいかに
延長するかを示す流れ図である。１０００……音声認識システム、１００２……
スタツク・デコーダ、１００４……音響プロセツ
サ、１００６，１００８……アレイ・プロセツ
サ、１０１０……言語モデル、１０１２……ワー
クステーシヨン、１０２０……探索装置、１０２
２，１０２４，１０２６，１０２８……インタフ
エース。 Figure 1 shows the recognition procedure, showing the assembly of a word graph using separately drawn elementary graphs, or clinks, Figure 2 is a schematic diagram of a single word phonetic graph, and Figure 3 shows the construction of two word graphs. Figure 4 shows a simple phonetic string of separate words and a phonetic graph connecting both words. Figure 5 shows the phonetic graph of a word with branches; Figure 5 shows the interconnection of the respective branches of the two hooks of the phonetic graph resulting in a boundary subgraph or blink; Figures 6A and 6B illustrate the invention. FIG. 7 is a schematic block diagram of a system environment in which the present invention may be implemented; FIG. A detailed block diagram of a stack decoder in the system environment of FIG. 9 is identified in storage by statistics obtained during a training session,
Figure 10 is a diagram showing the displayed precise matching phoneme machine, Figure 10 is a diagram showing the steps of continuous stack decoding, Figure 11 is a diagram showing the stack decoding method, Figure 12 is a flowchart of the stack decoding method, and Figure 1 is a diagram showing the stack decoding method.
Fig. 3 is a diagram showing the elements of the acoustic processor, Fig. 14 is a diagram showing the parts of a typical human ear representing the locations forming the components of the acoustic model, and Fig. 15 is a block diagram showing the parts of the acoustic processor. , Fig. 16 is a diagram showing the relationship between sound intensity and frequency, which is used in the design of an audio processor, Fig. 17 is a diagram showing the relationship between the sound and the horn, and Fig. 18 is a diagram showing the characteristics of the sound using the audio processor shown in Fig. 13. 19 is a flowchart showing how to update the limits in FIG. 18; FIG. 20 is a flowchart showing the trellis or grid of the precision matching procedure;
Figure 21 is a diagram showing the phoneme machine used to perform matching, Figure 22 is a time distribution diagram used in a matching procedure with specific conditions, and Figure 23 (a).
~(e) is a diagram showing the interrelationship between phonemes, label strings, and start and end times determined by the matching procedure; Figures 24(a) and (b) are diagrams showing the relationship between phonemes, label strings, and start and end times determined by the matching procedure; 25(a) and (b) are diagrams showing a specific phoneme machine with a minimum length of 4 and its corresponding trellis. FIG. 26 is a diagram showing multiple phoneme machines at the same time. Diagram showing the phoneme tree structure that allows processing of words,
Figure 27 is a flowchart showing the shaping of the phoneme machine, the second
FIG. 8 is a flow diagram illustrating how the word path is lengthened in the stack decoding procedure. 1000...Voice recognition system, 1002...
Stack decoder, 1004...Acoustic processor, 1006, 1008...Array processor, 1010...Language model, 1012...Workstation, 1020...Search device, 102
2,1024,1026,1028...interface.

Claims

[Claims] 1. A sound system in which partial Markov models corresponding to phonemes are connected to form a chain of partial Markov models representing a plurality of continuously uttered words, each of which can be assigned to a minute time interval. A device that recognizes continuous speech by selecting a string of labels according to the input speech from a set of labels representing types, and matching this string of labels to a chain of partial Markov models, performs the following (a) to ( A continuous speech recognition device characterized by having the means of g). (a) means for storing first data representing a plurality of possible utterances of a first audio portion; The possible utterances of this first audio portion do not vary with the audio portions that precede or follow the first audio portion. The first data is specified by a first identifier. (b) means for storing second data representing a plurality of possible utterances of a second audio portion different from said first audio portion; The possible utterances of this second audio portion do not vary with the audio portion that precedes it, but vary with the audio portion that follows it. The second data is specified by a second identifier. (c) means for storing third data representing a plurality of possible utterances of a third audio portion different from said first audio portion and said second audio portion; The possible utterances of this third audio portion vary with the audio portions that precede it and do not vary with the audio portions that follow it. The third data is specified by a third identifier. (d) means for storing fourth data representing a first word comprising said second audio portion; This fourth data includes a second identifier representing a portion of the first word. (e) means for storing fifth data representing a second word different from the first word, including the third audio portion; This fifth data includes a third identifier representing a part of the second word. (f) Means for storing sixth data including a first identifier representing the second audio portion and a third audio portion following the second audio portion. (g) Converting a plurality of consecutive words into a chain of the first, second and third identifiers above based on the contents of the storage means in (d) and (e) above, and
and the third identifier into first identifiers according to the contents of (f) above, and match them to the label string of the input voice based on the chain of the converted first identifiers. A means of constructing chains of partial Markov models.