JPS62220996A

JPS62220996A - Voice recognition method and apparatus

Info

Publication number: JPS62220996A
Application number: JP61058464A
Authority: JP
Inventors: ラリツト・ライ・ボール; ピーター・ヴインセント・デソウザ; ロバート・レロイ・マーサー
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1986-03-18
Filing date: 1986-03-18
Publication date: 1987-09-29
Also published as: JPH0372996B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】以下のＩ＋１８Ｎ序で本発明を説明する。[Detailed description of the invention] The invention will be described in the following I+18N order.

Ａ、産業上の利用分野Ｂ、従来技術Ｃ０発明が解決しようとする問題点り１間「点を解決するための手段Ｅ、実施例Ｅ−１，音声Ｖ識システムＥ−１Ａ、　　構成の標要（第１図、第２図）ｇ−ＩＢ
、　　聴覚モデル及びその実施（第４図）Ｆ−ＩＣ，詳
細な照合（第３図、第１１図）Ｅ−ＩＤ、　　基本的な
高速照合（第１２図）Ｅ−ＩＫ、　　別の高速照合（ｖ
、１５図、第１６図）Ｅ−ＩＦ、　　最初のＪラベルに
基づく照合（第１６１関）Ｅ−ＩＧ、　　単音の樹形構造と高速照合（第１７図）Ｅ−１）１．　　言語モデルＥ−ＩＪ、　　スタック・デコーダ（第２１図、第２２
図）Ｅ−１に、音声的基本形式の構′４（第３図）Ｅ−ＩＬ
、　　音素的基本形式の構築（第２６図）Ｅ−２，ポー
リングによる、語會からの確からしい単語の選歌（第２
５〜２９図）Ｆ０発明の効果０６表Ａ、産業上の利用分野この発明は、広くは音声認識技術に関し、特に単語の語
黄から選択された確からしい単語の煙いリストを形成す
るための技術に関するものである。A. Industrial field of application B. Prior art C0 Problems to be solved by the invention E. Embodiment E-1. Voice recognition system E-1A. Main points (Figures 1 and 2) g-IB
, auditory model and its implementation (Fig. 4) F-IC, detailed matching (Figs. 3, 11) E-ID, basic fast matching (Fig. 12) E-IK, another fast matching ( v
, 15, 16) E-IF, Matching based on the first J label (Section 161) E-IG, Tree structure of a single note and high-speed matching (Figure 17) E-1) 1. Language model E-IJ, stack decoder (Figures 21 and 22)
Figure) E-1 shows the phonetic basic format structure 4 (Figure 3) E-IL
, Construction of basic phonemic forms (Figure 26) E-2, Selection of probable words from a vocabulary meeting by Pauling (Second Edition)
Figures 5 to 29) F0 Effects of the Invention 06 Table A, Industrial Field of Application This invention relates generally to speech recognition technology, and in particular to technology for forming a list of likely words selected from yellow words. It is related to.

Ｂ、従来技術音声認識に対する確率的な処理方法においては、音声処
理装置によって先ず音声波形がラベルまたは音素の列に
変換される。各々が音のタイプをあらわすものであるそ
れらのラベルは、典型的には約２００の異なるラベルか
らなるアルファベットから選択される。そのようなラベ
ルの生成については、以下に示すもの等のさまざまな論
文に述べられている。B. Prior Art In the probabilistic processing method for speech recognition, a speech waveform is first converted into a label or string of phonemes by a speech processing device. The labels, each representing a type of sound, are typically selected from an alphabet of about 200 different labels. The generation of such labels is described in various papers, such as those listed below.

ＩＥＴｉＥＥ［事録（Ｐｒｏｃｅｅｄｉｎｇ　ｏｆ　　
ｔｈｅｒｇａａ）、６４巻、ｐｐ、５３２−５５６（１
９７６）の１統計的方法による連続的音声認識（ｃｏｎ
ｔｉｎｕｏｕｓ　　５ｐｅｅｅｈ　　Ｒｅｃｏｇｎｉｔ
ｉｏｎｂｙ　　５ｔａｔｉｓｔｉｃａｌ　　Ｍｅｔｈｏ
ｄａ　）”と題する論文。IETiEE [Proceeding of
thergaa), vol. 64, pp. 532-556 (1
Continuous speech recognition (con
tiny 5peeeh Recogniz
ionby 5 statistical method
The paper entitled ``Da)''.

音声認識を行うためにラベルを採用するＫあたり、マル
コフ・モデル音マシン（確率有限状態マシンとも呼ばれ
る）について議論がなされている。There has been discussion about Markov model sound machines (also called stochastic finite state machines), which employ labels to perform speech recognition.

マルコフ・モデルは通常、ＭＩ数の状態と、それらの状
所の間の遷移を有している。さらに１通常マルコフ・モ
デルは、（ａ）生じる各遷移の確率と。Markov models typically have an MI number of states and transitions between those states. Furthermore, one usually uses a Markov model with: (a) the probability of each transition occurring;

さまざまな遷移において各ラベルを形成する個々の確率
に関する確車値を割りあてられてなる。尚、マルコフ・
モデルは、パターン解析及び機械的知能に関するＩＥＦ
Ｆ４１ｉ事録（Ｔ　ＥＥ３’Ｅ’　Ｔ（ａｎ・５ａｅｔ
Ｓｏｎｏｎ　　Ｐａｔｔｅｒｎ　　Ａｎａｌｙｓｉｓ　
　ａｎｄ　　ＭａｃｈｉｎｅＩｎｔｅｌｌｉｇｅｎｃｅ
　）、巻ＰＡＭＩ−５、厚２．１９８３年３月の、Ｌ、
　Ｒ，バール（ｌ１ａｈｌ　）、Ｆ。It consists of assigning certainty values for the individual probabilities of forming each label at various transitions. Furthermore, Markov
The model is based on the IEF on pattern analysis and mechanical intelligence.
F41i records (T EE3'E' T(an・5aet
Sonon Pattern Analysis
and Machine Intelligence
), Volume PAMI-5, Thickness 2. March 1983, L.
R, Barr (l1ahl), F.

ジエリネツク（Ｊｅｌ＋ｎｅｋ　）、及びＲ，Ｌ、−ｆ
−サー（Ｍｅｒｅｅｒ）による　１連続的音声認識に対
する最尤的方法（Ａ　Ｍａｘｉｍｕｍ　Ｌｉｋｅｌｉｈ
ｏｏｄＡｐｎｒｏａｅｈ　　ｔｏ　　Ｃｏｎｔｉｎｕｏ
ｕｓ　　５ｐｅｓｅｈＲｅｃｏ（ｎｉｔｉｏｎ）”と顧
する論文等に記載されている。Jel+nek, and R, L, -f
- A Maximum Likelihood Method for Continuous Speech Recognition by Mereer
oodApnroaeh to Continuo
5pesehReco(nition)” and other papers that refer to it.

音声認識においては、音声処理装置によってラベル列が
与えられたときに、語索の中のどの単語が最も確からし
いかを決定するために、照合（マツチング）処理が実行
される。In speech recognition, when a label string is provided by a speech processing device, a matching process is performed to determine which word in a search is most likely.

１９８４年１１月１９日に出願された本出願人に係わる
米国特許出願第６７２９７４号に示されているように、
音声的照合は、（ａ）マルコフ・モデル音マシンの列に
よって語零の各単語を特徴づけ、　ｒｔ＋＋　　音声処
理装置によって生成されたラベルの列を生じる音マシン
の、即語をあらわす各列の個々の確からしさを決定する
ことにより行われろっ尚、音マシンの単語をあらわす各
列は、＠匣基本形式（ｗｏｒｄ　　ｂａｓｅｆｏｒｍ）
と呼ばれる。As set forth in commonly assigned U.S. Patent Application No. 672,974, filed November 19, 1984,
Phonetic matching consists of: (a) characterizing each word of the word zero by a sequence of Markov model sound machines, and characterizing each sequence of the sound machine representing an immediate word resulting in a sequence of labels generated by the rt++ speech processing unit; This is done by determining the likelihood of each word in the sound machine.
It is called.

上述の米国特許出願に記載されているように、単語基本
形式は、音声的な複数の音マシンで構成することができ
る。この例では、各音マシンは、好−為にはｆｆ−声的
な音圧対応し、７つの状態と１３の遷移とを有している
。As described in the above-mentioned US patent application, a word base form can be composed of phonetic sound machines. In this example, each sound machine preferably corresponds to an ff-voice sound pressure and has 7 states and 13 transitions.

あるいは、各単語基本形式は、音素的な音マシンの列と
して形成してもよい。音素的な音マシンは、音成的な音
マシンよりも簡単なマルコフ・モデルであり、好ましく
は２つの状態により構成されろ。そし、て、その第１の
状態と館２の状態の間には、いかなるラベルも生成され
得ないような空Ｗｉｌｄ（ｎｕｌｌ　　ｔ（ａｎｓｉｔ
ｉｏｎ　）があるうまた、第１の状態とが２の状ｖ！！
の間には、ラベルのアルファベットから１つのラベルが
生成され得るような非空遷移もある。第１の状態におい
ては、ラベルが形成され得る自己ループとなる。そして
、訓練段階の間に、各音素的音マシンに対する統計が決
定さ）Ｌる。すなわち、各音素に対して、各遷移の確率
と、各ラベルが各非空ｆ移において生成される確末とが
、知られた発音から算出される。、単語基本形式は、音
素的音マシンを連結することによって形成される。Alternatively, each word base form may be formed as a sequence of phonemic sound machines. A phonemic sound machine is a simpler Markov model than a phonetic sound machine, and preferably consists of two states. Then, between the first state and the state of building 2, there is an empty field Wild(null t(ansit
ion ), and the first state and the second state v! !
There are also non-empty transitions in between such that a label can be generated from an alphabet of labels. The first state results in a self-loop in which a label can be formed. Then, during the training phase, statistics for each phonetic sound machine are determined. That is, for each phoneme, the probability of each transition and the probability that each label will be generated at each non-empty transition are calculated from the known pronunciation. , word base forms are formed by concatenating phonemic sound machines.

音素的基本形式と音声的基本形式を採用するシステムに
お（八ては、未知の音声入力に応答して音声処理装置に
よって生成されたラベルの列に対応する最も確からしい
単語（または単語の列）を見出すことが最終目標である
。この目標を達成するための１つの方法が、上述の米国
特許出願に述べられて層る。特にその方法においては、
単語の数が、語蕾中の全部の個数淋ら、確からしい候補
の単語のリストＶｒ、″！で先ず低減され、次にこのリ
ストされた単語はより詳しい照合手続または言語モデル
手続において検討され、そこで好適には最も確からしい
単語が選択さ；ｈる。候補の単語の数を低減するにあた
って、上述の米ｌ：ｆ！特許出願が教示する方法により
ば、音マシンに近似が適用され、これによシ、過度の演
算を要することなく高速処理がもたらされる。単語の低
減を達成する場合、照合格子に基づく演算により近似的
な音声的照合が実行される。この近似的な音声的照合は
、どの単語に、よシ詳しい照合処理及び言語モデルに基
づく処理を施すべきであるかを決定するのに有効であり
充分であることが分かつている。For systems that employ phonemic and phonetic elementary forms, the most likely word (or sequence of words) that corresponds to the sequence of labels produced by the speech processing device in response to unknown speech input is used. ) is the ultimate goal. One method for achieving this goal is described in the above-mentioned U.S. patent application. In particular, in that method:
The number of words is first reduced to a list of likely candidate words Vr,''!, taking the total number of words in the word buds, and then this list of words is considered in a more detailed matching or language modeling procedure. , then preferably the most likely word is selected; in reducing the number of candidate words, an approximation is applied to the sound machine according to the method taught by the above-mentioned US patent application; This provides fast processing without requiring excessive computation. When achieving word reduction, an approximate phonetic match is performed by operations based on a matching grid. This approximate phonetic match has been found to be effective and sufficient to determine which words should be subjected to more detailed matching and language model-based processing.

Ｃ１発明が解決しようとする問題点この発明の目的は、音声処理装置によって生成された音
声ラベルの列に対応して、語常中のどの単語が相対的に
高い尤度を有するかを決定するための、高速で計算が簡
単な方法及びそれを実行するための装置を捉供すること
にある。C1 Problems to be Solved by the Invention The purpose of the invention is to determine which words in common words have a relatively high likelihood in response to a sequence of phonetic labels generated by a speech processing device. The object of the present invention is to provide a fast and computationally simple method for and an apparatus for carrying out the method.

Ｄ０問題点を解決するための手段本発明は、詳細な照合において検査されるべき単語の数
を低減するための、従来とは異なる方法を教示する。す
なわち、本発明は、アルファベット中の各ラベルが、語
雲の各単語に１投票″を行うテーブルが設けられてなる
ようなポーリング（ｐａｌ目ｎｇ）方法に関するもので
ある。この投票は、所与の単語が所与のラベルを生成し
たことの尤度（確からしさ）を反映する。その票の値は
、ラベル出力の確率と、訓練セツションの間に得られた
遷移確率統計とから計算される。Means for Solving the D0 Problem The present invention teaches a non-conventional method for reducing the number of words to be examined in detailed matching. That is, the invention relates to a polling method in which a table is provided in which each label in the alphabet gives one vote for each word of the word cloud. reflects the likelihood that a word in has generated a given label.The value of the vote is calculated from the probability of the label output and the transition probability statistics obtained during the training session. .

本発明の一実施例によれば、ラベル列が音声処理＠置に
よって生成されるときに、目的の単語が選択される。投
票テーブルからは、その列の各ラベルが認識され、目的
の単語に対応する各ラベルの票が決定される。そして、
目的の単語に対するラベルのすべての票が蓄積されて結
合され、確からしさの得点が与えられる。語雲の各単語
に対してこの処理を繰りかえすことＫよシ、各単語に対
する確からしさの得点が得られる。確からしい候補の単
語のリストは、尤度の得点から得ることができる。According to one embodiment of the invention, the target word is selected when the label string is generated by audio processing. From the voting table, each label in the column is recognized and the vote for each label corresponding to the target word is determined. and,
All votes for the label for the target word are accumulated and combined to give a likelihood score. By repeating this process for each word in the word cloud, a probability score for each word can be obtained. A list of likely candidate words can be obtained from the likelihood scores.

？Ａ２の実施例では、各ラベルが語雲の各単語に対して
有するペナルティを含む第２のテーブルも形成される。? In the A2 embodiment, a second table is also formed containing the penalty each label has for each word in the word cloud.

所与のラベルに割り付られたペナルティは、その所与の
ラベルを生成しない単語の確からしさをあらわす。第２
の実施例では、ラベル列に基づき所与の単語に対する尤
度の得点を算定するのに、ラベル候補とペナルティの両
方が考慮される。The penalty assigned to a given label represents the likelihood of a word not producing that given label. Second
In the embodiment, both label candidates and penalties are considered in calculating the likelihood score for a given word based on the label sequence.

長さを勘案するために、尤度の得点は、好ましくは、単
語の確からしさの得点を算定するｋあたって考慮された
ラベルのａＫ基づき換算される。To account for length, the likelihood score is preferably scaled based on the label's aK being considered in k to calculate the word's likelihood score.

さらに、ある単語の、生成されたラベルに沿っての終了
時点が決定されないときは、本発明は、目的の単語が複
数の継時的な尤度の得点を有することができるように、
尤度の得点を継時的な時間間隔で計算すべきことを規定
するう本発明はさらに、°目的の増語に対して、好適に
は語ｆにおける他のすべての単語の確からしさの得点と
比較して最高の尤度の得点な付与することを規定する。Furthermore, when the end point of a word along the generated label is not determined, the present invention provides a method to
The invention further provides that the likelihood score is to be calculated at successive time intervals.° For the target augmentation word, the likelihood score is preferably calculated for all other words in the word f. It specifies that the score with the highest likelihood be given compared to

本発明によれば、各単語が、少くとも１つの確率的有限
状態音マシンの列によってあられされ、且つ音声処理装
置が、音声入力に応答して音声ラベルを生成するような
、即將の語希から確からしい単語を選歌するための方法
を教示する。その方法は、（ａ）アルファベット中の各
ラベルが［０の各単ＩＮＫ投票を行い、所定の単語に対
する各ラベルの票が、その票を与えるラベルを生成する
七の所定の浄語の尤度をあらわすような第１のテーブル
を形成する段階を有する。さらに、その方法は好適には
、（ｂ）詰蕾の各単語に対して各ラベルにペナルティが
割り付られ、所与の単語の所与のラベルに割り付られた
ペナルティが、その所与の単語のモデルによって生成さ
れない所与のラベルの尤度をあらわすような飢２のテー
ブルを形成する段階と、（ｃ）所与のラベル列に対して
、特定の単語に対する列中のすべてのラベルの票を、そ
の特定の単語に対する列にはないすべてのラベルのペナ
ルティと結合する段階を含み特定の単語の尤度を決定す
る段階を有する。According to the present invention, an instant word dictionary is provided in which each word is generated by a sequence of at least one probabilistic finite state sound machine, and a speech processing device generates a speech label in response to speech input. We will teach you how to select likely words from songs. The method is as follows: (a) Each label in the alphabet takes a single INK vote of forming a first table representing the first table. Further, the method preferably includes: (b) a penalty is assigned to each label for each word of the block, and the penalty assigned to a given label of a given word is (c) for a given label column, forming a table representing the likelihood of a given label not generated by the word model; determining the likelihood of a particular word including combining the votes with the penalties of all labels not in the column for that particular word.

さらに、上記方法は、好適には、各単語に対して確から
しさの得点を与えるためた、すべての単語につき上記（
ａ）、（ｂ）及び（ｃ）の段階を繰り返す工程ｆｆも有
する。Furthermore, the method preferably includes the above (
It also has a step ff of repeating steps a), (b) and (c).

もし望むなら、上述の方法は、上記米国ＩＭ出頌第６７
２９７４号と組みあわせて使用することができる。If desired, the above method can be applied to the above US IM Demo No. 67.
It can be used in combination with No. 2974.

Ｅ、実施例Ｅ、−１，音声認識システムＥ、−ＩＡ、　　構成の概要第１図には、音声枦砕システム１０００の概要ブロック
図が示されている。このシステム１０００は、スタック
・デコーダ１００２と、そのスタック・デコーダに接続
された音声処理装置１００４と、高速近似音・声照合を
実行する際に使用されるアレイ・プロセッサ１００６と
、詳細な音声照合を実行する際に使用されるアレイ・プ
ロセッサ１００８と、言語モデル１０１０と、ワーク・
ステーション１０１２とを具備している。E. Example E.-1. Speech Recognition System E.-IA. Overview of Configuration FIG. 1 shows a schematic block diagram of a speech recognition system 1000. The system 1000 includes a stack decoder 1002, a speech processing unit 1004 connected to the stack decoder, an array processor 1006 used in performing fast approximate speech-to-voice matching, and a detailed speech matching. The array processor 1008, the language model 1010, and the work
station 1012.

音声処理装置１００４は、音声波形入力を、広い意味で
、対応する音のタイプをＶ別するラベルまたは音素の列
に変換するように設計されているうこのシステムにおい
ては、音声処理装置１００４は人間の耳という特異なモ
デルに基づいており、これＫついては本出願人に係る特
願昭６０−２１１２２９号に記載がある。In this system, the speech processing device 1004 is designed to convert a speech waveform input into a sequence of labels or phonemes that classify the corresponding sound type in a broad sense. This model is based on a unique model called the ears of K, which is described in Japanese Patent Application No. 1982-211229 filed by the present applicant.

音声処理装置ｆ！１１００４からのラベルまたは音素は
、スタック・デコーダ１００２に入る。論理的には、ス
タック・デコーダ１００２は第２図に示すブロック素子
よってあられすことができる。すなわち、スタック・デ
コーダ１００２は、インターフェース１０２２，１０２
４，１０２６及び１０２８を介して、音声処理装置１０
０４、高速照合プロセッサ１００６、詳細照合プロセッ
サ１００８、及び言語モデル１０１０と連絡し、さらに
はワーク・ステーション１０１２と連絡する探索ブロッ
ク１０２０を有している。Audio processing device f! Labels or phonemes from 11004 enter stack decoder 1002. Logically, stack decoder 1002 can be implemented by the block elements shown in FIG. That is, stack decoder 1002 has interfaces 1022, 102
4, 1026 and 1028, the audio processing device 10
04, has a search block 1020 in communication with a fast match processor 1006, a detailed match processor 1008, and a language model 1010, as well as with a work station 1012.

動作においては、音声処理装置１００４からの音素は探
索ブロック１０２０によって高速照合プロセッサ１００
６へ導かれる。この高速照合処理については後に説明す
るが１本出願人に係わる上述の米国特許出願竿Ａ７２９
７４号にも記載されている。簡単に述べると、この照合
の目的は、所与のラベルの列に対して最も確からしい４
５語を決定することである。In operation, phonemes from speech processing unit 1004 are passed to fast matching processor 100 by search block 1020.
Leads to 6. This high-speed matching process will be explained later, but the above-mentioned US patent application file A729 related to the applicant
It is also mentioned in issue 74. Simply stated, the purpose of this match is to find the most likely 4 for a given column of labels.
The purpose is to decide on five words.

高速照合は、語會中で増語を検査し、所与の入力ラベル
列に対する候補の単語の故を低減するように意図さ幻て
いる。この高速照合は、マルコフ・モデルとも呼ばれる
確率的有限状押マシンに基づく。Fast matching is designed to test for word augmentation in a word association and reduce the error of candidate words for a given input label sequence. This fast matching is based on stochastic finite state machines, also called Markov models.

高速照合によシ候神の単語の数が低減されると、スタッ
ク・デコーダ１００２は言語モデル１０１０と連絡をと
り、言語モデル１０１０は、好ましくは、存在する３重
音字（ｔｒｉ−ｇ（ａｍ）に基づき、高速照合候補リス
ト中の各候補の単語の文脈的（ｃｏｎｔｅｘｔｕａｌ　
）確からしさを決定する。Once the number of candidate words has been reduced through fast matching, the stack decoder 1002 communicates with a language model 1010, which preferably detects any triglyphs (tri-g(am)) present. Based on the contextual
) determine certainty.

好適には、言語モデルの計算結果に基づき、詳細な照合
によシ、話された単語であることの妥当な確からしさを
有する単語が、高速照合の候補において検討される。こ
の詳細な照合手続くついては、上記米国特許出願第６７
２９７４号に記載されている。Preferably, based on the calculation results of the language model, words that have a reasonable probability of being spoken words after detailed matching are considered in the candidates for fast matching. This detailed matching procedure is described in U.S. Patent Application No. 67, cited above.
No. 2974.

詳細な照合手続は、第３図に示されているマシンのよウ
ナマルコフ・モデル音マシンによって実行される。The detailed matching procedure is carried out by an Una Markov model sound machine, such as the machine shown in FIG.

その詳細な照合の後、好適には、４ｊ語の確からしさを
決定するために言語モデルが再び呼び出される。高速照
合及び詳細な照合と、言語モデルの適用によって得られ
た情報を利用する本発明のスタック・デコーダ１００２
は、生成されたラベル列に対応する単語の最も確からし
い経路またけ列を決定するように設計されている。After that detailed matching, the language model is preferably invoked again to determine the likelihood of the 4j words. Stack decoder 1002 of the present invention that utilizes information obtained by fast and detailed matching and application of language models
is designed to determine the most likely path-spanning sequence of words corresponding to the generated label sequence.

最も確からしい卯飴の列を見出すための２つの従来技術
として、ヴイテルビ（Ｖｉｔｅｒｂｉ　）デコーディン
グと、却−スタック・デコーディングがある。これらの
技術は、上述のＩ、、　Ｒ，バール、Ｆ。Two conventional techniques for finding the most likely string of rabbit candy are Viterbi decoding and stack decoding. These techniques are described in I., R., Barr, F., supra.

ジエリネツク及びＲ，Ｌ、マーサーの論文に記載されて
いる。特に、ビテルビ・デコーディングはその第５章に
、岸−スタック・デコーディングはその第６章に記載さ
れている。Zielinecz and R.L. Mercer. In particular, Viterbi decoding is described in Chapter 5, and Kishi-stack decoding is described in Chapter 6.

単一スタック・デコーディング技術においては、確から
しさに応じて増−スタック中にさまざまな長さの経路が
リストされる。単一スタック・デコーディングは、確か
らしさは部分か経路の長さに依存し、すなわち一般的に
は規格化を採用しなぐてはならない、という事実を考慮
しなくてはならない。In single stack decoding techniques, paths of different lengths are listed in the augmented stack depending on their certainty. Single stack decoding must take into account the fact that the certainty depends on the length of the part or path, ie normalization must be employed in general.

一方、グイテルビの技術は、そのような規格化を必要と
せず、一般的には小さいタスクに実用的である。Guiterubi's technique, on the other hand, requires no such standardization and is generally practical for small tasks.

別の技術としては、単語の可能な各組み合わせを可節な
単語列として検査し、どの組み合わせが、生成されたラ
ベル列を生成する最も高い確率を有するかを決定するこ
とＫより、小さい語！システムにつきデコーディングを
行うこともできる。しかし、この技術に必要な計算量は
、大きい語竹システムには非実用的なものとなる。Another technique is to examine each possible combination of words as a collapsible word string and determine which combination has the highest probability of producing a generated label string with fewer than K words! Decoding can also be done per system. However, the amount of computation required for this technique makes it impractical for large language systems.

スタック・デコーダ１００２は、実質的に他のブロック
回路を制御する役目を果たすが、多くの演算は行わない
。それのえ、スタック・デコーダ１００２は、好ましく
は、”ｌ／１ｒｔｕａｌ　Ｍａｃｈｉｎｅ／Ｓｙｓｔｅ
ｍ　Ｐｒｏｄｕｃｔ　　ＩｎｔｒｏｄｕｃｔｉｏｎＲｅ
ｌｅａｓｅ　　３（１９Ｆｌ　３　）　　などの刊行物
に記載されているＴＢＭ　　ＶＭ１５７０オペレーティ
ング拳システムのもとで走る４３４１プロセツサを含ん
でいる。相描量の計算を行うこのアレイ・プロセッサは
、市販されている浮動点システム（ＦＰＳ）１９０Ｌに
より実施されている。Stack decoder 1002 essentially serves to control other block circuits, but does not perform many operations. Additionally, the stack decoder 1002 is preferably an "l/1rtual Machine/System
mProduct IntroductionRe
It includes a 4341 processor running under the TBM VM1570 operating system as described in publications such as Lease 3 (19Fl3). The array processor that performs the phase coverage calculations is implemented by a commercially available floating point system (FPS) 190L.

多重スクッキングと、最良の単語列または経路を決定す
るための独得の決定施策を有する新規な技術がり、　Ｒ
，バールと、Ｆ、ジェリネックと、ＲｏＬ、マーサーに
よって発明されたので後でこれについて述べる。A novel technique with multiple scooking and a unique decision strategy to determine the best word string or path, R
, Barr, F., Jelinek, and RoL, Mercer, and will be discussed later.

Ｅ−ＩＢ、　　聴覚モデル及びその実現＠４図には、音
声処理装置１１００（第１図では符号１００４）の特定
の実施例が図示されている。この図において、音声波入
力（例えば、ふつうの会話）がアナログ−ディジタル（
Ａ／Ｄ　）コンバータ１１ａ２に入力され、Ａ　／　Ｄ
コンバータ１１０２は与め決められた割合でその入力を
サンプリングする。典ｆｌｌＪ的なサンプリング率は、
　　５０マイクロ秒毎に１サンプルである。Ａ／Ｄコン
バータ１１０２からのディジタル信号の端部を整形する
ために、時間窓発生器１１ｏ４が設けられている。時間
窓発生器１１０４の出力は高速フーリエ変換回路（ＦＦ
Ｔ）１１０６に入力され、ＦＦＴ１１０６は、各時間窓
毎に周波数スペクトルを与える。E-IB, Auditory Model and Its Implementation@4 In FIG. 4, a particular embodiment of an audio processing device 1100 (labeled 1004 in FIG. 1) is illustrated. In this figure, the audio wave input (e.g., normal conversation) is analog-digital (
A/D) is input to the converter 11a2, and the A/D
Converter 1102 samples its input at a given rate. The typical sampling rate is
One sample every 50 microseconds. A time window generator 11o4 is provided to shape the ends of the digital signal from the A/D converter 1102. The output of the time window generator 1104 is a fast Fourier transform circuit (FF
T) 1106, the FFT 1106 provides a frequency spectrum for each time window.

ＦＦＴ１１０６の出力は次にラベル７１７２．。The output of FFT 1106 is then labeled 7172. .

ｙ、を発生するために処理される。ラベルを発生するた
めに、特徴：ｙ４Ｒブロック１１０８．クラスタ・ブロ
ック１１１０、プロトタイプ・ブロック１１１２、　　
ラベル作成ブロック１１１４という４つの回路ブロック
が協働する。ラベルの発生において、プロトコルは、選
択された％徴または音声人力に基づき、空間における点
（またはベクトル）として決定され、次にプロトコルに
比較されうる空間中の対応する点（またはベクトル）を
与えるために、それと同一の選択された特徴によって特
徴づけられる。is processed to generate y. To generate the label, feature:y4R block 1108. cluster block 1110, prototype block 1112,
Four circuit blocks, label creation block 1114, work together. In the generation of labels, the protocol is determined as a point (or vector) in space based on the selected % signature or speech force, which then gives the protocol a corresponding point (or vector) in space that can be compared. is characterized by the same selected characteristics.

特に、プロトコルを決定する際、クラスタ・ブロック１
１１０によって、点の集合が個別のクラスタとして分類
される。クラスタを決定するための方法は、音声に適用
される、ガウス分布などの確率分布に基づいている。ク
ラスタの重心または他の特徴に関連する各クラスタのプ
ロトタイプは、プロトタイプ−ブロック１１１２によっ
て発生される。そして、選択された同一の特徴によって
特徴づけられる、生成されたプロトコルと音声入力は、
ラベル作成ブロック１１１４に入力される。In particular, when determining the protocol, cluster block 1
110 classifies the set of points as distinct clusters. Methods for determining clusters are based on probability distributions, such as Gaussian distributions, applied to speech. A prototype for each cluster related to the centroid or other characteristics of the cluster is generated by prototype block 1112. The generated protocol and audio input, characterized by the same selected features, are then
Input to label creation block 1114.

ラベル作成ブロック１１１４は、比較手続を実行し、こ
れにより１つのラベルが特定の音声入力に割りあてられ
る。Label creation block 1114 performs a comparison procedure that assigns one label to a particular audio input.

適切な特徴の選択は、音声（会話）波入力をあらわすラ
ベルを得る際の重要な要因である。ここに述べられてい
る音声処理装置は、改良さ治た特Ｗ選歌ブロック１１０
８を有している。この音声処理装置によれば、聴覚モデ
ルが得られ、それが音声認識システムの音声処理装置に
適用される。Selection of appropriate features is an important factor in obtaining labels representing speech (speech) wave input. The audio processing device described herein is an improved and cured special W song selection block 110.
It has 8. According to this speech processing device, an auditory model is obtained, which is applied to the speech processing device of the speech recognition system.

Ｍ党モデルを説明するために、第５図を参照する。To explain the M-party model, reference is made to FIG.

第５図は、人間の耳の内部の一部をあらわす図である。FIG. 5 is a diagram showing a part of the inside of the human ear.

特忙、内部有毛細胞１２００は、液体を保持する通路１
２０４内に突出する端部１２０２をもつものとして示さ
れている。内部有毛細胞の上流には外部有毛細胞１２０
６があり、これらも通路１２０４ＶＣ突出する端部１２
０８をもつものとして示されている。そして、内部有毛
細胞１２００と外部有毛細胞１２０６には、脳に情報を
送るための神経が接続されている。特に、ニューロンが
、処理のため神経を介して脳に送られる電気的刺激をも
たらす電気化学的変化を被る。この電気化学的変化は、
基底膜」２１０の機械的な運動によって刺激を受ける。The internal hair cells 1200 are fluid-holding channels 1
204 is shown with an end 1202 projecting into it. Upstream of the inner hair cells are the outer hair cells 120
6, these also have passages 1204VC projecting ends 12
08. Nerves for sending information to the brain are connected to the inner hair cells 1200 and the outer hair cells 1206. In particular, neurons undergo electrochemical changes that result in electrical impulses being sent via nerves to the brain for processing. This electrochemical change is
It is stimulated by mechanical movement of the "basement membrane" 210.

従来、基底［１２１０が音声波入力に対して周波数解析
器の役目を果たし、基底膜１２１０に沿う箇所が個々の
臨界周波数帯域に応答することが知られている。そして
、基底膜１２１０の異なる部分が、それに対応する周波
数帯域に応答するということは、音声波入力に対して知
覚される音の大きさに影響を及ぼす。すなわち、２つの
同様な強度の音が同一の周波数帯域を占める場合よシも
、２つの音がそれぞれ異なる臨界周波数帯域にある場合
の方が、より大きい音であると知覚される。It is conventionally known that the base [1210] acts as a frequency analyzer for audio wave input, and locations along the basilar membrane 1210 respond to individual critical frequency bands. In turn, different portions of basilar membrane 1210 respond to corresponding frequency bands, which affects the perceived loudness of sound wave input. That is, even if two sounds of similar intensity occupy the same frequency band, the two sounds are perceived to be louder if they are in different critical frequency bands.

基底膜１２１０によって決定される２２個程度の臨界周
波数帯域が存在することが分かつている。It has been found that there are approximately 22 critical frequency bands determined by the basilar membrane 1210.

基底膜１２１０の周波数応答に一致するように１本実施
例の音声処理装置１１００は１．好適には、音声波入力
を上記臨界周波数帯域の一部またはすべてに分離し、そ
の分離された臨界周波数帯域毎に個別に信号成分を検査
する。この機能は、ＦＦＴ１１０６（第４図）からの信
号を適宜濾過し、以て特徴選択ブロック１１０８中に、
検査される各臨界周波数帯域毎に個別の信号を与えると
とＫよって達成される。In order to match the frequency response of the basilar membrane 1210, the audio processing device 1100 of the present embodiment has a 1. Preferably, the audio wave input is separated into some or all of the critical frequency bands, and the signal components are individually examined for each separated critical frequency band. This function appropriately filters the signal from FFT 1106 (FIG. 4) so that during feature selection block 1108,
This is achieved by providing a separate signal for each critical frequency band being tested.

その個別の入力はまた、時間窓発生器１１０４によって
、（好ましくは２５．６ミリ秒の）時間枠内にブロック
されている。それゆえ、特徴選択ブロック素子１１０８
は、好ましくは２２個の信号を含み、その各々が１つの
時間枠に対応する所与の周波数帯域における音の強さを
あらわす。The individual inputs are also blocked within a time window (preferably 25.6 milliseconds) by a time window generator 1104. Therefore, feature selection block element 1108
preferably includes 22 signals, each representing the sound intensity in a given frequency band corresponding to one time frame.

フィルタ作用は、好適にばＰ６図に示す慣用の臨界帯域
フィルタ１３００によって行われる。個別の信号は次に
、音の大きさの等感（ｅｑｕａｌｌｏｕｄｎｅｓｓ　）
コンバータ１５０２によって処理される。このコンバー
タ１３０２は、知覚された音の大きさの変化な、周波数
の関数として勘案する。この点において、ある周波数の
所与のｄＲレベルにおけるＰＦｌの音は、聴きとられる
大きさにおいて、それとは別の周波数で同一のｄＲレベ
ルにある第２の音とは異なることがあることに注意され
たい。コンバータ１３０２は、さまざまな周波数帯域が
同一の大きさのスケールで測定されるように各周波数帯
域の信号を変換するべく、実験データに基づき得る。例
えば、コンバータ１３０２は、好適には、１９３３年の
フレクチャー（Ｆｌｅｔｃｈｅｒ　）及びマンソン（Ｍ
ｕｎｓｏｎ）の研究に基づき、それＫある程度変更を加
えることＫより音の弾ばから音の大きさの等線曲線への
対応づけを行う。これらの研究の一部変更した結果は、
ｆＰＺ図に示されている。すなわち、第７図の（１００
０Ｈｚ、４０ｄＲ）の点を通る等線曲線から分かるよう
に、１０００）Ｔｚ、４０ｄＢの音は、１００Ｈｚ、６
０ｄＢの音に１大きさのレベルにおいて匹敵するのであ
る。Filtering is preferably performed by a conventional critical band filter 1300, shown in diagram P6. The individual signals are then given equal loudness.
Processed by converter 1502. This converter 1302 accounts for changes in perceived loudness as a function of frequency. In this regard, note that the sound of PFl at a given dR level at one frequency may differ in audible loudness from a second sound at a different frequency and the same dR level. I want to be Converter 1302 may be based on experimental data to convert the signals in each frequency band such that the various frequency bands are measured on the same magnitude scale. For example, converter 1302 is preferably adapted from the 1933 Fletcher and M.
(Unson), and by making some changes to it, we can make a correspondence between the bounce of the sound and the isoline curve of the loudness of the sound. The modified results of these studies are
It is shown in the fPZ diagram. That is, (100
As can be seen from the isoline curve passing through the point 0Hz, 40dR), the sound of 1000)Tz, 40dB is 100Hz, 6
It is comparable to a 0 dB sound at a level of 1 loudness.

コンバータ１３０２ば、周波数に拘りなく等感度を与え
るために、好適には第７図の曲線に従って音の大きさを
調節する。Converter 1302 preferably adjusts the sound volume according to the curve of FIG. 7 in order to provide equal sensitivity regardless of frequency.

デフ図からさらに見てとれるように、周波数への依存性
があるのみならず、音の強度と音の大きさとは対応しな
い。すなわち、音の強度または損幅における変化は、聴
き取られる大きさの同様の変化によって必ずしも反映さ
れない。例えば、１００Ｈｚにおいて、１１０ｄＢで音
の強度が１０ｄＢ変化するのと、２０ｄＢで音の強度が
１０ｄＢ変化するのでは、聴き取られる音の大きさの変
化は、前者の方がはるかに大きい。この差異は、予め定
められた様式で大きさを圧縮する大きさスケーリング・
ブロック素１３０４によって処理される。As can be further seen from the differential diagram, not only is there a dependence on frequency, but sound intensity and sound volume do not correspond. That is, changes in sound intensity or amplitude are not necessarily reflected by similar changes in audible loudness. For example, at 100 Hz, if the sound intensity changes by 10 dB at 110 dB, and if the sound intensity changes by 10 dB at 20 dB, the change in the audible sound volume is much larger in the former. This difference is a function of magnitude scaling, which compresses magnitude in a predetermined manner.
Processed by block element 1304.

好適には、大きさスケーリング・ブロック素子１３０４
け、フォノ（ｏｈｏｎ　）であられされた音の大きさの
損＠測定値なソー７（ｓａｎｅ）で置き換えることによ
り、強度Ｐをその立方根Ｐ１／３に圧縮する。Preferably, size scaling block element 1304
The intensity P is compressed to its cube root P1/3 by replacing it with sane, which is the measured value of the loudness of the sound produced by the phono (ohon).

竿８図は、経験的に得られたソーン対フォンのデータを
あらわす図である。ンーンを採用することにより、上記
モデルは、会話音声の大きい撮幅に対してほぼ正確であ
る。尚、１ソーンとは、１ＫＴ（ｚ、４０ｄＢの音の大
きさであると定義されている。Figure 8 is a diagram showing empirically obtained Thorn vs. Phone data. By employing this method, the above model is almost accurate for a large field of view of conversational speech. Note that 1 sone is defined as a sound level of 1 KT (z, 40 dB).

ここで再び第６図を参照すると、各臨界周波数帯すに対
応する等線で、太ささにつきスケーリングされた信号を
処理する時間変化応答ブロック素子１３０６が図示され
ている。特に、各時間枠において、検査される各周波数
帯域に対して神経ファイアリング率（ｎｅｕ（ａｌ　　
ｆｉｒｉｎｆＣ（ａｔｅ　）ｆが決定される。神経ファ
イアリング率は、この音声処理装置によれば、次のよう
にあられされる。Referring again to FIG. 6, there is illustrated a time-varying response block element 1306 that processes isolineally, thickness-scaled signals corresponding to each critical frequency band. Specifically, in each time frame, the neural firing rate (neu(al)
firinfC(ate)f is determined. According to this speech processing device, the neural firing rate is expressed as follows.

ｆ　＝（Ｓｏ　＋ＤＬ　）　ｎ　　　　　　　　（１）
この式で、ｎは、神経伝達信号（ｎｅｕｒｏｔ（ａｎｓｍｉｔｔｅｒ　）の量、Ｓｏは
音声波入力とは無関係な神経ファイアリングに関連する
自発的ファイアリング定数、Ｌは大きさの泗１定値、Ｄ
は変位定数である。すなわち、（Ｓｏ）ｎは、音声波入
力があるかないかに拘らず生じる自発的神経ファイアリ
ング″ＩｇＫ対応し、ＤＬｎは音声波入力によるファイ
アリング率に対応する。f = (So + DL) n (1)
In this equation, n is the amount of neural transmission signal (neurot(ansmitter)), So is the spontaneous firing constant related to neural firing independent of acoustic wave input, L is the S1 constant of magnitude, and D
is the displacement constant. That is, (So)n corresponds to the spontaneous neural firing "IgK" that occurs regardless of whether there is a sound wave input, and DLn corresponds to the firing rate due to the sound wave input.

重要なのは、ｎの値が、この音声処理装置により、次の
関係式に基づき時間変化するものとして特徴づけられる
ことである：ｄｎ／ｄｔ＝Ａｏ−（Ｓｏ＋Ｓｈ＋Ｉ）Ｌ）ｎ　　（２
）この式で、Ａｏは補給定数であり、Ｓｈは、自発的な
中立伝送１Ｓ号の崩壊定数である。式（２）に示されて
いる新規な関係は、神経伝達信号がある一定の割合で形
成されつつあり、（ａ）崩壊（ＳｈＸｎ）、（ｈ）　自
発的ファイアリング、及び　（ｃ）音声波入力による神
経ファイアリングによって失われることを考慮したもの
である。これらのモデル化された現実の推宇位置が第５
図に示されている。Importantly, the value of n is characterized by this audio processing device as time-varying based on the following relation: dn/dt=Ao-(So+Sh+I)L)n (2
) In this equation, Ao is the replenishment constant and Sh is the decay constant of the spontaneous neutral transmission 1S. The novel relationship shown in Equation (2) is that neural transmission signals are being formed at a certain rate and are characterized by (a) collapse (ShXn), (h) spontaneous firing, and (c) acoustic waves. This takes into consideration the fact that it is lost due to neural firing due to input. These modeled actual positions are the fifth
As shown in the figure.

式（２）はまた、神経伝達信号の次の１Ｆと次のファイ
アリング率が、少くとも神経伝達信号の畦の現在の条件
に乗算的に依存するという意味でこの音声処理￥置が非
線形であるという事実をも反映している。すなわち、時
間（ｔ＋Δｔ）Ｋおける神経伝達信号の量は、時間ｔに
おけろ神経伝達信号の景Ｋ（ｄｎ／ｄｔ）・Δｔを加え
たものであり、これを式であられすと、ｎ　（ｔ＋Δｔ　）＝ｎ　（ｔ　）＋（ｄｎ／ｄｔ　）
　φΔｔ（３）式（１）、（２）及び（３）は、上記聴
音システムが時間の経過に亘って適合的であシ、以て聴
音神経上の信号をして音声波人力に関して非線形とする
。という事実を利用する時間変化信号解析器を記述する
ものである。この点で、この音声処理装置は、神経シス
テム中の明白な時間変化に、より良く一致するように、
音声認識システムにおける非線形信号処理を具現化する
第１のモデルを与える。Equation (2) also shows that this audio processing arrangement is non-linear in the sense that the next 1F and the next firing rate of the neural transmission signal depend multiplicatively on the current conditions of the neural transmission signal ridge at least. It also reflects the fact that In other words, the amount of neural transmission signals at time (t+Δt)K is the sum of the neural transmission signals at time t, K(dn/dt)·Δt, and when expressed in the formula, n ( t+Δt)=n(t)+(dn/dt)
φΔt(3) Equations (1), (2), and (3) indicate that the above-mentioned auditory system is adaptive over time, and that the signals on the auditory nerve are nonlinear with respect to the acoustic wave force. do. This paper describes a time-varying signal analyzer that takes advantage of this fact. In this respect, this audio processing device is designed to better match the apparent temporal changes in the nervous system.
A first model that embodies nonlinear signal processing in a speech recognition system is presented.

式（１）及び（２）中の未知数の数を低減するために、
この音声処理装置は、一定の音の大きさＬＫ適用される
次の式（４）を利用する。In order to reduce the number of unknowns in equations (1) and (2),
This audio processing device uses the following equation (4) to which a constant sound volume LK is applied.

Ｓｏ＋Ｓｈ＋ＤＬ＝１／Ｔ　　　　　　　　　　　　　
（４）この式で、Ｔは、音声波入力が発生されてから、
その最大値が６７％まで低下するのに聴音の応答が要す
る時間の測定値である。Ｔは音の大きさの関数であって
、この音声処理装置によれば、さまざまな音の大きさの
レベルに対する応答の下降をあらわすグラフから得られ
る値である。すなわち、一定の大きさの音が発生される
と、それは第１の高レベルの応答を発生し、その後、応
答は、時定数Ｔで以て定常状態レベルに下降してゆく。So+Sh+DL=1/T
(4) In this equation, T is the time after the audio wave input is generated.
It is a measurement of the time it takes for the auditory response to drop to 67% of its maximum value. T is a function of loudness, and according to this audio processing device, is a value obtained from a graph showing the decline in response to various loudness levels. That is, when a sound of a certain loudness is generated, it generates a first high level response, after which the response declines to a steady state level with a time constant T.

音声波入力がなければ、Ｔ＝Ｔｏ　（５０ミリ秒程度）
である。Ｔｊ、ｍ　ａ　ｘの場合、Ｔ：＝Ｔ＊ａｘ　（
３０ミリ秒程度）である。八ｏ　＝　１　と設定するこ
とによって、Ｌ＝Ｏのとき、Ｉ／（Ｓｏ＋Ｓｈ）は５セ
ンチ秒である。また、ＬがＬｍａｘであり、Ｌｍａｘ＝
２０ノーンであるとき、式（５）は次のようになる。If there is no audio wave input, T=To (about 50 milliseconds)
It is. In the case of Tj, m a x, T:=T*ax (
(about 30 milliseconds). By setting 8o=1, when L=O, I/(So+Sh) is 5 centiseconds. Also, L is Lmax, and Lmax=
20 non, equation (5) becomes as follows.

Ｓｏ＋Ｓｎ＋Ｄ（２０）＝１／３０　　　　　　（５）
上記のデータと式によｐ、ＳｏとＳｈは式（６）及び（
７）によって次のように決定される。So+Sn+D(20)=1/30 (5)
According to the above data and formula, p, So and Sh are calculated by formula (6) and (
7) is determined as follows.

Ｓｏ＝ＤＬｍａｘ／（Ｒ＋（ＤＬｍａｘＴｏＲ）　−１
）　　（６）Ｓ　ｈ　＝　１　／Ｔ　ｏ　−Ｓ　ｏ　　
　　　　　　　　　　　　（７）ｆ　ｓ　ｌ　Ｌ＝Ｌｍ
ａｘここで　Ｒ＝ニア７ココ面−（８）ここでｆｓ　ｌ　＋ｔｄｎ／ｄｔ＝ｏ　　の場合、所与
の音の大きさにおけるファイアリング率をあらわす。So=DLmax/(R+(DLmaxToR) −1
) (6) S h = 1 / T o - S o
(7) f s l L=Lm
ax where R=near 7 here surface-(8) where fs l +tdn/dt=o represents the firing rate at a given sound level.

Ｒは、音声処理装置に残された唯一の変数である。それ
ゆえ、音声処理装置の性能を変更するためには、Ｒのみ
が変えられる。すなわち、Ｒは、通常、遷移効果に対し
て定常状態効果を最小化することを意味する性能を変更
するために調節することのできる単一のパラメータであ
る。定常状態効果を最小化することは、周波数応答の差
異や、話者の差異や、バックグラウンド・ノイズや、会
話の遷移部分でなく会話の定常状糖部分に影響を及ぼす
歪みが、一般には同一の音声入力に対して一定でない出
カバターンを生じさせろゆえに、望ましい。Ｒの値は、
好ましくは、音声認識システム全体のエラー率をｔ適化
すること釦よってセットされる。このようＫして見出さ
れた適切な値はＲ＝１５である。すると、Ｓｏ　＝０．
０８８８．５ｈ＝ｒ）、１１１１となり、Ｄは０．００
６６６となる。R is the only variable left in the audio processor. Therefore, to change the performance of the audio processing device, only R can be changed. That is, R is usually a single parameter that can be adjusted to change performance, which means minimizing steady-state effects versus transition effects. Minimizing steady-state effects means that frequency response differences, speaker differences, background noise, and distortions that affect the steady-state sugar portion of speech rather than the transition portion of speech are generally the same. This is desirable because it produces a non-constant output pattern for the audio input. The value of R is
Preferably, it is set by a button to optimize the error rate of the entire speech recognition system. The appropriate value thus found for K is R=15. Then, So =0.
0888.5h=r), 1111, and D is 0.00
It becomes 666.

筑９図を参照すると、この音声処理装置のフローチャー
トが示されている。坑９図において、好適には２０ＫＨ
ｚでサンプリングされた、２５．６ミリ秒の時間枠内の
ディジタル化された音声がハ＝ング（Ｈａｎｎｉｎ（）
　空１３２０を通過し、ハニング窓１３２０からの出力
は、好適には１０ミリ秒間隔で７−リエ変換器（Ｔ）Ｆ
’Ｔ）１３２２に送られる。こうして変換された出力は
素子１３２４によってフィルタされ、少くとも１つの周
波数帯竣（好ましくはすべての臨界周波数帯域または少
くとも２０個の臨界周波数帯域）の各々に対応する強さ
の密度の出力を与える。この強さの密度は、次に対数変
換ステップ１３２６たよって音の大きさのレベルに変換
される。このことは、第７図のグラフに基づき容易に実
行される。これ以下の処理は、ステップ１３６０のしき
い値更新処理を含み、第１０図に示されている。Referring to Fig. 9, a flowchart of this audio processing device is shown. In the hole 9 diagram, preferably 20KH
The digitized audio within a time frame of 25.6 ms, sampled at
The output from the Hanning window 1320 is passed through the sky 1320, and the output from the Hanning window 1320 is preferably input to a 7-lier transformer (T) F
'T) 1322. The thus transformed output is filtered by element 1324 to provide an output with an intensity density corresponding to each of at least one complete frequency band (preferably all critical frequency bands or at least 20 critical frequency bands). . This intensity density is then converted to loudness levels by a logarithmic conversion step 1326. This is easily accomplished based on the graph of FIG. The subsequent processing includes threshold updating processing at step 1360, and is shown in FIG.

第１０図において、感覚のしきい値Ｔｆ　と聴音のしき
い値Ｔｈは、各々のフィルタされた周波数帯域ｍに対し
て、ステップ１３４０でＴｆ＝１２０ｄＢ％Ｔｈ＝Ｏｄ
Ｒと初期設電される。そのあと、音声カウンタと、全フ
レーム（枠）レジスタと。In FIG. 10, the sensory threshold Tf and the auditory threshold Th are determined in step 1340 for each filtered frequency band m, with Tf=120 dB%Th=Od
The initial power is set to R. After that, the audio counter and all frame registers.

ヒストグラム・レジスタがステップ１３４２でリセット
される。The histogram register is reset at step 1342.

各ヒストグラムは、サンプルの数または計数値をあらわ
すビン（ｂｓｎ）を含み、その計数値の間には、所与の
周波数帯域内の強度またはそれと同様な測定値が、個々
のＷＪ囲にある。現時点のヒストグラムは、好ましくは
、所与の各周波数帯域毎に、音の大きさが、大きさの複
数の範囲のどれかにある期間のセンチ秒の数をあらわす
う例えば、第３の周波数帯域においては、強度１０　ｄ
Ｂと２０ｄＢの間忙２０センチ秒の間隔が存在し得る。Each histogram includes bins (bsn) representing the number of samples or counts between which the intensity or similar measurements within a given frequency band lie around individual WJs. The current histogram preferably represents, for each given frequency band, the number of centiseconds during which the loudness is in any of a plurality of loudness ranges, e.g., a third frequency band. In this case, the strength is 10 d
There may be a 20 centisecond interval between B and 20 dB.

同様に％　２００番目周波数帯域においては、５０ｄＲ
と６０　ｄＢの間に１全体の１０００センチ秒のうち１
５０センチ秒が存在し得る。サンプル（またはセンチ秒
）の全数と、ビンに含まれる計数値から、百分位数が得
られる。Similarly, in the %200th frequency band, 50dR
and 60 dB between 1 and 1 out of 1000 centiseconds
There may be 50 centiseconds. Percentiles are obtained from the total number of samples (or centiseconds) and the counts contained in the bins.

ステップ１３４４では、個々の周波数帯域のフィルタ出
力からのフレームがチェックされ、適当なヒストグラム
中のビンが、フィルタ毎に１つづつステップ１３４６で
増分される。振幅が５５ｄＢを超えるビンの全数がステ
ップ１５４８で各フィルタ（すなわち周波数帯域）毎に
合計され、音声の存在をあらわすフィルタの数が決定さ
れる。そして、音声を示唆する最小限の個数（例えば２
０個のうち６個）のフィルタが存在しないなら、次のフ
レームがステップ１３４４でチェックされる。In step 1344, the frames from the filter outputs of the individual frequency bands are checked and the appropriate histogram bins are incremented, one for each filter, in step 1346. The total number of bins with amplitudes greater than 55 dB are summed for each filter (ie, frequency band) in step 1548 to determine the number of filters that represent the presence of speech. Then, select the minimum number of sounds (for example, 2
If there are no filters (6 out of 0), then the next frame is checked in step 1344.

もしステップ１３５０で、音声をあらわす十分な数のフ
ィルタが存在するなら、ステップ１３５２で音声カウン
タが増分される。、音声カラ／りは、ステップ１３５４
で１０秒間の音声が生じるまでステップ１３５２で増分
され、そのあとステップ１３５６で各フィルタに対して
Ｔ　及びＴｈの新しい値が決定される。If, in step 1350, there are a sufficient number of filters representing speech, then in step 1352, a speech counter is incremented. , voice color/ri, step 1354
is incremented in step 1352 until 10 seconds of audio occur, then new values of T and Th are determined for each filter in step 1356.

Ｔ、とＴｈの新しい値は、所与のフィールタに対して次
のように決定される。Ｔ、の場合、１０００ビンの上か
ら５５番目（すなわち音声の９６．５番目の百分位数）
のサンプルを保持するビンのｄＢ値がＲＩＮ□であると
定義される。次ｋＴ、は、Ｔ、＝ＢＩＮＨ＋４０ｄＲと
セットされる。Ｔｈの場合、最下位のビンから百分位数
で０．０１番目のサンプルを保持するビンのｄＢ値がＢ
ＩＮＬであると定義される。すなわち、ＲｒＮＬは、音
声であるとして分類されたサンプルの数を除くヒストグ
ラム中のサンプルの故の１チであるビンである。そうし
て、Ｔ、は、Ｔｈ＝　Ｂ　Ｉ　ＮＬ−３０ｄＢとして定
義される。The new values of T, and Th are determined for a given filter as follows. For T, the 55th from the top of 1000 bins (i.e., the 96.5th percentile of audio)
The dB value of the bin holding samples of is defined to be RIN□. Next, kT, is set as T,=BINH+40dR. For Th, the dB value of the bin holding the 0.01th percentile sample from the lowest bin is B
Defined to be INL. That is, RrNL is the bin that is the largest number of samples in the histogram excluding the number of samples classified as speech. Then, T is defined as Th = B I NL - 30 dB.

第９図に、２つて、音声振幅は、ステップ１３３０及び
１３３２で更新されたしきい値に基づき、ステップ１３
３２でンーンに変換されスケーリングされる。これの方
法について警マ前述したとおりである。ソーンを得、ス
ケーリングを行うための別の方法としては、（ビンが増
分された後の）フィルタ擺幅″ａ”を利用して、次の弐
に基づきｄＢに変換することがある。In FIG. 9, the audio amplitude is determined in step 13 based on the thresholds updated in steps 1330 and 1332.
32, it is converted and scaled. The method for doing this is as described above. Another way to obtain and scale the samples is to take the filter amplitude "a" (after the bins have been incremented) and convert to dB based on:

ａａｎ＝２０Ｊｎｇ　　　（ａ）−１０（９）次に各フ
ィルタ振幅は、次の式に基づき、等しい大きさを与える
ためＫＯから１２０の間の範囲にスケーリングされる。aan=20Jng (a)-10(9) Each filter amplitude is then scaled to a range between KO and 120 to give equal magnitude according to the following equation:

ａ　＠ｑ””１２０　（ａ　ｄＢ−Ｔｈ　）／　（Ｔｔ
　　Ｔｈ）ａ＠ｑｊ　　は好豊しくけ次の式により、大
きさのレベル（ポン）から、（４０ｄＢでＩＫＨ，の信
号を１に対応づけることＫより）ソーンでの近似的な大
きさに変４−される。a @q””120 (a dB-Th)/ (Tt
Th) a@qj can be changed from the magnitude level (Pon) to the approximate magnitude at Thorn (K by correlating the signal of IKH, with 1 at 40 dB) using the following formula: 4- To be done.

Ｌ　”＝（ａ　＠Ｑ’−３０）／４　　　　（１１）そ
して、ソーンでの大きさは、次のように近似される。L''=(a @Q'-30)/4 (11) Then, the size at the Thorn is approximated as follows.

Ｌ　（近似＞＝１ｏ（ｔ　　　）／２０　　（１２）ソ
ーンであられされた大きさけ次に入力としてステップ１
３３４で式（１）及び（２）　ＩＣ−７’−えられ、こ
れＫより、各周波数帯博に対応する出力ファイアリング
率ｆが決定される（ステップ１５３５）。２２個の周波
数帯謔がある場合、継絣的か時間枠（フレーム）による
音声波入力を２２次元ベクトルが特徴づける。しかし、
−毅的には、慣用的なメル（ｍｅｔ：ｉの高さの単位）
でスケールされたフィルタ・バンクを用いて２０個の周
波数帯博が検査される。L (approximation>=1o(t)/20 (12) The size of the sown is then taken as input in step 1
Equations (1) and (2) IC-7'- are obtained in step 1534, and from this K, the output firing rate f corresponding to each frequency band is determined (step 1535). When there are 22 frequency bands, a 22-dimensional vector characterizes the audio wave input by sequential time frames. but,
-For resolute, the customary mel (met: unit of height of i)
Twenty frequency bands are examined using a filter bank scaled by .

次の時間枠を処理する（ステップ１３３６）前に、ステ
ップ１３３７で式（３）に基づきｎの次の状態が決定さ
れる。Before processing the next time frame (step 1336), the next state of n is determined in step 1337 based on equation (3).

上述の音声処理４！置は、ファイリング率ｆと中立伝送
信号のｌ）ｎが大きいＤＣペデスタルをもつ場合の適用
例においては改良される。すなわち、ｆ及びｎの方程式
の項のダイナミック・レンジが重要である埋合、ペデス
タルの高さを低減するための以下に示す式が得られる。Audio processing mentioned above 4! The arrangement is improved in applications where the filing rate f and l)n of the neutral transmission signal have a large DC pedestal. That is, the equation shown below for reducing the pedestal height is obtained, where the dynamic range of the f and n equation terms is important.

先ず、定常状態であシ、音声波入力がない（Ｌ＝０）場
合、式（２）は、定常状態における内部状ｐｎ’につき
、次のように解かれる。First, in the steady state, when there is no audio wave input (L=0), equation (2) is solved as follows for the internal state pn' in the steady state.

ｎ’　＝　Ａ　／　（Ｓｏ　＋　Ｓｈ　）　　　　　　
　　　（１３）神経伝達信号の景ｎ　（ｔ　）の内部状
態は、定常部分ｎ′と時間変化部分ｎＩＩとにより、次
のようにあられされる。n' = A / (So + Sh)
(13) The internal state of the scene n (t ) of the neural transmission signal is expressed as follows by the constant part n' and the time-varying part nII.

ｎ　（ｔ　）＝ｎ’　＋ｎ”　（ｔ　）　　　　　　　
（１４）式（１）と式（１４）とを組みあわせると、フ
ァイアリング率にＭＦる次の式が得られる。n(t)=n'+n''(t)
(14) By combining Equation (1) and Equation (14), the following equation for firing rate MF is obtained.

ｆ　（ｔ　）　＝（Ｓｏ＋ＤＬ）　（ｎ’　＋６”　（
ｔ）　）この弐において、５ｏＸｎ’という項は定数で
あり、その他の項はすべて、ｎの時間変化部分、及びＴ
）Ｌであられされる入力信号のどちらかを含む。f (t) = (So+DL) (n'+6" (
t)) In this 2, the term 5oXn' is a constant, and all other terms are the time-varying part of n, and T
)L.

このあとの処理は、出力ベクトルの間の平方差のみに関
与するので、定数項は無視し得る。そこで、式（１３）
を用いて、式（１５）から定数項を除いた式をｒ”　（
ｔ　）とあられすことにすると、ｆ”　（ｔ）＝（Ｓｏ
＋Ｔ）Ｌ）ｎ”　（ｔ）＋ＤＬＡ／（Ｓｏ＋Ｓｈ　）　
　　　　　　　　　　　（１６）式（３）を考慮すると
、次の状態は、ｎ（ｔ＋Δ１）＝・ｎ’（ｔ＋△ｔ　）＋ｎ”　（ｔ＋
Δｔ）（１７）＝ｎ’　（ｔ＋Δｔ）＋ｎ”　（ｔ）＋
（Ａｏ−（Ｓｏ＋Ｓｈ＋１１Ｌ）ｎ（ｔ））Δｔ　　　
　　　　　　　　　（１８）＝ｎ”（ｔ＋Δｔ）＋ｎ”
（ｔ）−（Ｓｏ＋５ｈ）ｎ’（ｔ）−ｒ）Ｌｎ’ｒｔ）
Δｔ＋Ａｏ△ｔ　−（Ｓ　ｏ　＋Ｓ　ｈ　＋　Ｄ　Ｌ　
）ｎＩＩ　（ｔ）Δｔ　　　　　　　　　　　　　　（
１９）式（１９）で、すべての定数項を無視するととＫ
より、次の式が得られる。Since the subsequent processing concerns only the squared difference between the output vectors, the constant term can be ignored. Therefore, formula (13)
Using
t), then f” (t)=(So
+T)L)n” (t)+DLA/(So+Sh)
(16) Considering equation (3), the next state is n(t+Δ1)=・n'(t+Δt)+n'' (t+
Δt) (17)=n' (t+Δt)+n” (t)+
(Ao−(So+Sh+11L)n(t))Δt
(18)=n”(t+Δt)+n”
(t)-(So+5h)n'(t)-r)Ln'rt)
Δt+Ao△t−(S o +S h +D L
)nII (t)Δt (
19) In equation (19), if all constant terms are ignored, then K
From this, the following formula is obtained.

ｎ　ｌＪ　（ｔ＋Δｔ）＝ｎ”　（ｔ）　（１−８ｈΔ
ｔ）−ｒ”（ｔ）Δｔ　　　　　　　（２０）式（１５
）及び（２０）はここで、それぞれ、１０ミリ秒の時間
枠毎に各フィルタに適用される出力方程式と、状態更新
方程式とを構成する。これらの方程式を適用した結果得
られるのが、１０ミ’）秒毎の２０元ベクトルであシ、
ベクトルの各成分は、メル・スケールされたフィルタ・
バンク中の個々の周波数帯域のファイアリング率に対応
する。n lJ (t+Δt)=n” (t) (1-8hΔ
t)−r”(t)Δt (20) Equation (15
) and (20) now constitute the output equation and state update equation, respectively, that are applied to each filter every 10 ms time window. The result of applying these equations is a 20-element vector every 10 m'),
Each component of the vector is a mel-scaled filter.
Corresponds to the firing rate of the individual frequency bands in the bank.

尚、先程示した実施例に関しては、ｆ％　ｄｎ／ｄｔ、
　　及びｎ（ｔ＋１）の式が、それぞれファイアリング
率ｆの空間的な表現と、次の状１ｉ１１ｎ（ｔ＋Δｔ）
を規定する。Regarding the example shown earlier, f% dn/dt,
and n(t+1) are respectively the spatial expression of the firing rate f and the following form 1i11n(t+Δt)
stipulates.

尚、さまざまな式の項に寄与する値（例えばｔ。It should be noted that the values contributing to various equation terms (e.g. t.

＝５センチ秒、ｔＬ、Ｔｌい□＝３センチ秒、Ａｏ＝１
、Ｒ＝１．５．Ｔ、ｍａｒ＝＝２０　）は別の値に設定
してもよく、そうすると、Ｓｏｌ　Ｓｈ及びＤは、それ
ぞね、好ましい導出値０．０８８Ｂ、　　０．１１１１
１及び０゜Ｄｏ　６６６とは異なる値となることに注意
されたい。= 5 centiseconds, tL, Tl = 3 centiseconds, Ao = 1
, R=1.5. T, mar==20) may be set to other values, then Sol Sh and D have the preferred derived values of 0.088B and 0.1111, respectively.
Note that the values are different from 1 and 0°Do 666.

この音声モデルは、本１頴発明者により、浮動点システ
ムＦＰＳ１９０ＬハードウェアにＰＩ、／Ｉプログラミ
ング言語を用いて実施されたが、それ以外のさまざまな
ソフトウェアまたはハードウエアを使用することもでき
る。This audio model was implemented by the present inventor on floating point system FPS190L hardware using the PI,/I programming language, but various other software or hardware could be used.

ｇ−ｊＣ，詳細な照合竺３図には、サンプル詳細照合音マシン２０００が示さ
れている。各々の詳細照合音マシンは、（ａ）複数の状
態ｓｔ　と、（ｂ）複数の遷移ｔｒ（ＳｊｌＳｉ）と（
尚、その遷移には、異なる状態量のものと、ある状態か
らそれ自身に戻るものとがあり、その各々の遷移には確
率が対応づけられている）、（ｃ）特定の遷移で発生さ
れうる各々の２ペルに対応する、実際のラベル確率とＫ
よって特徴づけらねる確率的有限状態マシンである。g-jC, Detailed Verification Figure 3 shows a sample detailed verification sound machine 2000. Each detailed matching sound machine has (a) a plurality of states st, (b) a plurality of transitions tr(SjlSi) and (
Note that the transitions include those with different state quantities and those that return from a certain state to itself, and each transition is associated with a probability); The actual label probability and K corresponding to each two pels
Therefore, it is a stochastic finite state machine that cannot be characterized.

第３図においては、詳細照″合音マシン２０００に７個
の状態Ｓ１〜Ｓ７と、１３個の状態ｔｒ１〜ｔｒ１５　
　が与えられている。第５図から見てとれるように、音
マシン２０００は破線で示された３つの遷移ｔｒ１１、
ｔｒ１２及びｔｒ１５　を有している。これら３つの遷
移の各々においては、音はラベルを生成することなく１
つの状態から別の状態へ変化し、従って、そのような遷
移はゼロ遷移と呼ばれる。一方、ｔｒ１〜ｔｒ１０の遷
移に沿ってはラベルが生成され得る。特Ｋｔｒ１〜ｔｒ
１０の各遷移の場合、１つまたはそれ以上のラベルは、
それらが発生される個別の確率をもつことができる。好
ましくは、個々の遷移において、このシステム中で発生
され得るラベルに対応づけられた確率が存在する。すな
わち、もし、音声チャネルによって選歌的に発生するこ
とのできる２００個のラベルが存在するならば、（ゼロ
遷移でない）各遷移は、それに対応づけられた２００個
の゛実際のラベル確率”を有し、その各々の確率は、特
定の遷移で音によって対応するラベルが発生される確率
に対応する。第３図においては、遷移ｔｒ１のための実
際のラベル確率がｐ〔ｔ’）（ｔは１〜２００の整数で
あり、ラベルの番号を示す）という記号により表示され
る。例えば、ラベル１の場合、詳細照合音マシン２００
０が、遷移ｔｒ１でラベル１を発生する確率Ｐ〔１〕が
存在する。実際のさまざまなラベル確率は、ラベル及び
そ、れに対応する遷移とと４に記憶されている。In FIG. 3, the detailed matching sound machine 2000 has seven states S1 to S7 and 13 states tr1 to tr15.
is given. As can be seen from FIG. 5, the sound machine 2000 has three transitions tr11 indicated by broken lines,
It has tr12 and tr15. In each of these three transitions, the sound is 1 without producing a label.
changes from one state to another; such a transition is therefore called a zero transition. On the other hand, labels can be generated along the transitions from tr1 to tr10. Special Ktr1~tr
For each of the 10 transitions, one or more labels are
They can have individual probabilities of being generated. Preferably, for each transition there is a probability associated with the label that can be generated in the system. That is, if there are 200 labels that can be selectively generated by the audio channel, then each transition (that is not a zero transition) has 200 ``actual label probabilities'' associated with it. and each probability corresponds to the probability that the corresponding label is generated by the sound at a particular transition.In Figure 3, the actual label probability for transition tr1 is p[t') (t is It is an integer from 1 to 200 and is displayed by a symbol indicating the label number.For example, in the case of label 1, the detailed verification sound machine 200
There is a probability P[1] that 0 generates label 1 at transition tr1. The actual various label probabilities are stored in labels and their corresponding transitions.

さて、所与の音に対応して、ラベルの列７１１２ｙ３・
・・・が詳ｍ照合音マシン２０００Ｖｃ与えられたとき
、照合手続が実行される。詳細照合音マシンに関連する
手続は、第１１図を参照して説明される。Now, corresponding to the given sound, the label column 7112y3.
. . is given as detailed m verification sound machine 2000Vc, the verification procedure is executed. The procedure associated with the detailed verification tone machine will be explained with reference to FIG.

第１１図は、第３図の音マシンの格子図である。FIG. 11 is a grid diagram of the sound machine of FIG.

第３図の音マシンと同様に１この格子図は、状態Ｓ１か
ら状ｎＳ７へのゼロ遷移と、状態Ｓ、から状態Ｓ２へ及
び状態Ｓ１から状ｎｓ４への遷移を示している。、また
別の状態の間の遷移も示されている。この格子図はまた
、水平方向に時間が目盛られている。第１１図において
、開始時点の確率ｑ　及びｑｌは、そねぞれ、音に対し
て、その音が時間１＝１　　及びｔ　＝ｔ　１で開始時
間をもつ確率をあらわす。各開始時間ｔ□及びｔｌＶｃ
おいて、さまざまな遷移がｑ示されている。尚、継時的
な開始時間の間の時間間隔は、好適にはラベルの時間間
隔の長さに等しいことに注意されたい。Similar to the sound machine of FIG. 3, this grid diagram shows zero transitions from state S1 to state nS7, and transitions from state S to state S2 and from state S1 to state ns4. , and transitions between different states are also shown. This grid diagram is also horizontally scaled in time. In FIG. 11, the starting time probabilities q 1 and ql represent, for a sound, the probability that the sound has a starting time at time 1=1 and t 2 =t 1, respectively. Each start time t□ and tlVc
, various transitions are shown q. Note that the time interval between successive start times is preferably equal to the length of the time interval of the label.

所与の音が入力ストリングのラベルにどれほど近いかを
決定するために詳細照合音マシン２０００を採用した場
合には、その音の終了時点での分布が求められ、それが
その音の照合値を決定するために使用される。終了時刻
の（時点）分布に基づくという概念は、照合手続に関連
してここで説明されろ音マシンのすべての実施例に共通
である。If the detailed match sound machine 2000 is employed to determine how close a given note is to the input string label, the distribution at the end of that note is determined, which determines the match value for that note. used to make decisions. The concept of being based on the distribution of end times is common to all embodiments of the sound machine described here in connection with the matching procedure.

詳細な照合を実行するために終了時点分布を形成するに
際しては、詳細照合音マシン２０００が厳密且つ複雑な
計算を行う。In forming the end point distribution for performing detailed matching, the detailed matching sound machine 2000 performs rigorous and complex calculations.

第１１図を参照して、時間ｔ　”Ｒｔ　□で開始時間と
終了時間の御方をもつため忙必要な計算について先ず考
えてみよう。これはｆｆ１３図に示された例示的な音マ
シン構造に従う場であるので、次の確率が適用される。With reference to Figure 11, let us first consider the calculations required to have start and end times at time t ''Rt □. Therefore, the following probability is applied.

Ｐｒ　（８７，ｔ”ｔ□　）＝Ｑ□　Ｔ　（１→７）＋
Ｐｒ（Ｓ２、ｔＦ　ｔ□　）Ｔ　（２→７　）　＋　Ｐ　ｒ　（Ｓ　３、ｔ＝ｔｏ）
Ｔ（３→７）この式で、Ｐｒは括弧内に示した状態にある確率、Ｔは
、括弧内に示した状態番号の矢印方向への遷移確率であ
る。式（２１）は、終了時間がｔ＝ｔｏで生じるような
３つの条件に対応する個々の確率を表示する。さらに、
ｔ＝ｔｏでの終了時間が、この例の［、合、状％Ｓ７で
の発生に限定されていることが見てとれる。Pr (87,t”t□)=Q□T (1→7)+
Pr (S2, tF t□) T (2→7) + Pr (S 3, t=to)
T(3→7) In this formula, Pr is the probability of being in the state shown in parentheses, and T is the probability of transition of the state number shown in parentheses in the direction of the arrow. Equation (21) displays the individual probabilities corresponding to the three conditions such that the end time occurs at t=to. moreover,
It can be seen that the end time at t=to is limited to the occurrence of [, case, state %S7 in this example.

次に終了時間ｔ　＝ｔ　Ｉ　Ｋ注目すると、状態Ｓ１以
外のすべての状態に関して計算が行われなくてはならな
いことが晃でとれる。状態Ｓ１は、前の音の終了時点で
開始される。尚、説明の便宜上、状態Ｓ４に関する計算
のみが示される。Next, noting the end time t = t I K, Akira can see that calculations must be performed for all states except state S1. State S1 is started at the end of the previous note. Note that for convenience of explanation, only calculations related to state S4 are shown.

状態Ｓ４の場合、計算式は次のようＫなる。In the case of state S4, the calculation formula is K as follows.

Ｐｒ（Ｓ４、ｔ＝ｔ１　）＝＝Ｐｒ　（Ｓ　１、ｔ＝ｔ
□）Ｔ（１−＋４）Ｐｒ（ｙｌ　！１→４）十Ｐｒ（Ｓ
ａ、１＝１０）Ｔ　（４→４　）　Ｐｒ　（ｙ　１１４→４　）言わば
、式（２２）は、時間１＝１１で音マシンが状態Ｓ４に
ある確率は、次の２つの項の和に依存しているのである
。すなわち、そのうちの１つの項は、（ａ）　ｔ＝ｔｏ
で状！１ｉｉｌＳ１にある確率に、状態Ｓ　から状態Ｓ
４への遷移確率及び、状態Ｓ１から状態Ｓ４への遷移が
あるときに所与のラベルｙ１が生成される確率を掛けた
ものであり、もう１つの項は、（ｂ）時間ｔ　＝ｔ　□
で状Ｐ４Ｓ４である確率に、状！ＰＳ４からそれ自身へ
遷移する確率及び、状態Ｓ４からそれ自身への遷移があ
るとき所与のラベルＹ１が生成される確率を掛けたもの
である。Pr(S4, t=t1)==Pr(S1,t=t
□)T(1-+4)Pr(yl !1→4)tenPr(S
a, 1=10) T (4→4) Pr (y 114→4) In other words, equation (22) shows that the probability that the sound machine is in state S4 at time 1=11 is the sum of the following two terms. It depends. That is, one of the terms is (a) t=to
De-state! 1iilWith a probability in S1, from state S to state S
4 multiplied by the probability that a given label y1 is generated when there is a transition from state S1 to state S4, and the other term is (b) time t = t □
The probability that the shape P4S4 is the shape! It is the probability of transitioning from PS4 to itself multiplied by the probability that a given label Y1 will be generated when there is a transition from state S4 to itself.

同様に、別の状ｎ（ただし状ｐＳ１を除く）に関しても
、時間ｔ　：　ｔ　１で音が特定の状態忙ある確率を求
めるための計算が行われろ。一般的には、所与の時間に
特定の状態にある確率を決定する場合に、詳細な照今の
手ｖｃは、（ａ）その特定の状態に至る遷移を有する以
前の各状態と、その以前の各状態の個々の確率を認識し
、（ｂ）その以前の各状態につき、ラベル列に一致する
ためＫその以前の各状態と現在の状態の間の遷移におい
て発生されなくてはならないラベルの確率をあらわす値
を認識し、（ｃ）以前の各状態の確率と、ラベル確率を
あらわす個々の値とを組みあわせて、対応する遷移につ
き上記特定の状態の確率を求めることである。その特定
の状態にあることの全体の確率は、その状態に導かれる
すべての遷移についでのその状態の確率から決定される
。尚、状態８７に対する計算は、音が状１Ｆ！ｓ７で終
了するときにその音が時間ｔ　＝ｔ　１で開始され且つ
終了することを可能ならしめる３つのゼロ遷移に関する
項を含んでいることに注意されたい。Similarly, for other states n (except for state pS1), calculations are performed to find the probability that the sound is in a particular state at time t:t1. In general, when determining the probability of being in a particular state at a given time, the detailed Teruim's hand vc is defined as (a) each previous state that has a transition leading to that particular state, and its Recognize the individual probabilities of each previous state; (b) for each previous state, K the labels that must be generated at the transition between each previous state and the current state to match the label sequence; and (c) combining the probability of each previous state with the individual value representing the label probability to find the probability of the particular state for the corresponding transition. The overall probability of being in a particular state is determined from the probability of that state over all transitions that lead to that state. Note that the calculation for state 87 is that the sound is state 1F! Note that when ending at s7 we include terms for three zero transitions that allow the note to start and end at time t = t1.

時間１＝１　　とｔ　＝ｔ　Ｉ　Ｋ関する確率計算と同
様にして、別の終了時間の列についての確率計算は、好
ましくは終了時分布を形成するために行われる。所与の
単９ｆｋ対応する終了時分布の値は、その所与の単音が
入カラペルにどれほどよく一致するかを示す。ある単語
が入力ラベルの列にどれほどよく一致するか？決定する
場合、その単語をあらわす複数の単音が処理される。そ
して各単音は確率値の終了時分布を発生する。その岸音
に対する照合値は、終了時確率を加えあわせ、次にその
和の対数をとることｋよシ得られる。捜た、次の単音の
開始時点分布は、例えば、スケールされた値の和が１に
なるように多値を合計値で割り各々の値シスケーリング
して終了時点分布を規格化することによシ得られる。Analogous to the probability calculations for times 1=1 and t =t I K, probability calculations for further sequences of finishing times are preferably performed to form finishing time distributions. The value of the ending distribution corresponding to a given single 9fk indicates how well that given single note matches the incoming carapel. How well does a word match a sequence of input labels? When a decision is made, the plurality of sounds representing the word are processed. Each note then generates an ending distribution of probability values. The match value for that Kishi sound can be obtained by adding up the ending probabilities and then taking the logarithm of the sum. The searched start time distribution of the next single note can be obtained by, for example, dividing the multi-value by the total value and scaling each value so that the sum of the scaled values becomes 1 and standardizing the end time distribution. You can get it.

尚、所与の単語または単語列につき検査されるべき単音
の数りを決定するための方法は、少くとも２つある。先
ず、深さ優先法（ｄｅｐｔｈ　ｆｉｒｓｔｍｅｔｈｏｄ
　）　においては、基本形式に？Ｇって、個々の単音に
−）き順次移動小計を計算することが行われる。そして
、その小計が、その基本形式に沿う所与の単音の位１＜
対応する予定のしきい値以下であることが分かると、計
算が終了する。あるいは、幅優先法（ｂｒｅａｄｔｈ　
　ｆｉｒｓｔｍｅｔｈｏｄ）　　においては、各単語の
類似する単音の位ｆｆｆＫついて計算が行われる。すｆ
ｆｃわち、各単語の第１の学音に続く計算として、各単
語の第２の単音についての計算が行われる等である。幅
優先法建おいては、さまざまな単語の同一の数の単音′
に沿う計算値が、それらの単音に沿う同一の相対音位蓚
において比較される。このどちらの方法においても、照
合値の合計が最大となる単Ｍ（単数または複数）が、求
める対象である。Note that there are at least two methods for determining the number of phones to be tested for a given word or word string. First, the depth first method
) In the basic form? In G, moving subtotals are calculated sequentially for each note (-). Then, if the subtotal is 1<
If it is found to be less than or equal to the corresponding scheduled threshold, the calculation ends. Alternatively, breadth-first method (breadth
In the first method), calculations are performed on the similar phonetic positions fffK of each word. Sf
fc, that is, the calculation for the second phonetic sound of each word is performed as a calculation following the first phonetic sound of each word, and so on. In breadth-first construction, the same number of single sounds ′ in different words
The calculated values along the lines are compared at the same relative pitch along those notes. In either of these methods, the single M (single or plural) for which the sum of matching values is maximum is the object to be determined.

詳細な照合はＡＰＡＬ（アレイ・プロセッサ・アセンブ
リ言語）中で実行される。尚、ＡＰＡＬは、フローティ
ング・ポイント・システム社（Ｆｌｏａｔｉｎ（Ｐｏ１
ｎｔ　　Ｓｙｓｔｅｍ、Ｉｎｃ　）１９０Ｌに固有のア
センブラである。Detailed matching is performed in APAL (Array Processor Assembly Language). APAL is a subsidiary of Floating Point System Co., Ltd. (Floatin (Po1
nt System, Inc.) 190L-specific assembler.

詳細な照合には、実際の各ラベル確率（すなわち、所与
の遷移において所与の単音が所与のラベルｙを発生する
確率）と、各音マシンのための遷移確率と、親電された
開始時間の後に所与の時間に所与の単音が所与の状態に
ある確率を記憶するために相当のメモリが必要であるこ
とを閣識されたい。上述のＦＰＳ（浮動点システム）　
１９０Ｌは、終了時間や、例えば和（好ましくは、終了
時点確率の和の対数）忙基づく照合値や、前に生成され
た終了時点確率に基づく開始時間や、単語忙おいて連続
する単音に対応する照合値に基づく単語の照合得点など
のさまざまな計算を行うようにセット・アップされろ。The detailed matching includes the actual probability of each label (i.e., the probability that a given phone produces a given label y in a given transition), the transition probability for each sound machine, and the It should be appreciated that considerable memory is required to store the probability that a given note will be in a given state at a given time after the start time. FPS (Floating Point System) mentioned above
190L corresponds to an end time, a match value based on the sum (preferably the logarithm of the sum of the end point probabilities), a start time based on a previously generated end point probability, and consecutive single sounds in the word shu. be set up to perform various calculations, such as matching scores for words based on matching values.

さらに、詳細な照合は、好適には照合手続において末端
確率（ｔａｉｌｐｒｏｂａｂｌ目ｔｙ）を勘案する。末
端確率とは、単語に関与せず、ラベルの確からしさを頑
次測定したものである。よシ簡単な実施例においては、
所与の末端確率は、別のラベルに続くラベルの確からし
さに対応する。この確からしさけ、例えば、幾つかのサ
ンプル音声により生成されたラベル列から容易に決定さ
れる。Furthermore, the detailed matching preferably takes into account tail probabilities in the matching procedure. Terminal probability is a robust measure of the certainty of a label, independent of words. In a very simple example,
A given terminal probability corresponds to the probability of a label following another label. This likelihood is easily determined, for example, from a sequence of labels generated by several sample voices.

それゆえ、詳細な照合は、基本形式と、マルコフ・モデ
ルのための統計と、末端確率を収めるために十分な記憶
容量を必要とする。例えば、各々の単語が約１０個の音
からなるような５０００語の語彙の場合、基本形式には
５０００Ｘ１０のメモリが必要である。また、（各単音
にマルコフ・モデルが付随した）７０個の異なる単音と
、２００個の異なるラベルと、各ラベルが生成される確
率を有する１０個の遷移が存在する場合、その統計には
、７０Ｘ１０Ｘ２００個の位置が必要である。しかし、
音マシンは、３つの部分、すなわち開始部分と、中間部
分と、終了部に分割され、それに対応する統計が付随す
ることが好ましい（好適−は、連続的な部分ＶＣ５つの
自己ループが含まれる）。従って、必要な紀伊量は７０
×３×２００に低減される。末端確率に関しては、２０
０Ｘ２００の記憶位置が必要とされる。この配列では、
５０にの整数と８２にの浮動点記憶が満足な動作を与え
る。Therefore, detailed matching requires sufficient storage capacity to accommodate the basic form, the statistics for the Markov model, and the terminal probabilities. For example, for a vocabulary of 5000 words, each word consisting of approximately 10 sounds, the basic format requires 5000×10 memories. Also, if there are 70 different phones (with a Markov model attached to each phone), 200 different labels, and 10 transitions with each label having a probability of being generated, then the statistics are: 70x10x200 positions are required. but,
The sound machine is preferably divided into three parts, namely a beginning part, a middle part and an end part, accompanied by corresponding statistics (preferably - includes a continuous part VC 5 self-loops) . Therefore, the required amount of Kii is 70
×3×200. Regarding the terminal probability, 20
A storage location of 0X200 is required. In this array,
Integer to 50 and floating point storage to 82 give satisfactory operation.

尚、詳細な照合は、音声的な音でなく音素的な音を使用
することによって実行することができることに注意され
たい。Note that detailed matching can be performed by using phonemic sounds rather than phonetic sounds.

Ｅ−ＩＤ、　　基本的な高速照合以上のように、詳細な照合は演算的に高価であるため、
精度を多少＠牲にしても必要な演算量を低減する基本的
な高速照合及び別の高速照合が実行される。この高速照
合は、好ましくは詳細な照合と組み合わせて使用され、
すなわち、高速照合は語ｔから、確からしい候補の単語
をリストし、そして高々リストされた候補の単語につき
詳細な照合が行われる。E-ID, more than basic high-speed matching, detailed matching is computationally expensive;
A basic fast match and another fast match are performed that reduce the amount of computation required at the expense of some accuracy. This fast matching is preferably used in combination with detailed matching,
That is, fast matching lists probable candidate words from word t, and detailed matching is performed on at most the listed candidate words.

高速の近似的な音声照合技術は、前述の本出願人に係る
米国特許出Ｈｆ’Ｒ６７２９７４号の主題である。高速
な近似的音声照合技術においては、好適には、所与の音
マシン中のすべての攬移における各ラベルに対応する実
際の確率を特定の置換値と置き換えることによって各音
マシンが単線化される。特定の置換値は好ましくは、そ
の置換値が使用されるときに所与の音に対応する照合値
が、その置換値が実−のラベル確率に置き換わらないと
き詳細な照合によって達成される照合値の過大評価であ
るように選歌される。この条件を保証する１つの方法は
、所与の音マシンにおける所与のラベルに対応するいか
なる確率も置換値よシ大きくないように各置換値を選歌
することによる。音マシンにおける実際のラベル確率を
、対応する置換値と置きかえることによって、ある単語
忙対する照合得点を決定する際に必要な計算量が著しく
減少する。さらに、置換値は好ましくは過大評価である
ので、得られた照合得点は、置換を行わないで決定され
たであろう得点よりも小さくない。Fast approximate voice matching techniques are the subject of the aforementioned commonly assigned US Pat. No. Hf'R672,974. In a fast approximate phonetic matching technique, each sound machine is preferably unilinearized by replacing the actual probability corresponding to each label at all transitions in a given sound machine with a specific replacement value. Ru. A particular replacement value is preferably a match that is achieved by detailed matching when the replacement value does not replace the actual label probability when the replacement value is used. The song is chosen to be an overestimation of its value. One way to ensure this condition is by choosing each replacement value such that any probability corresponding to a given label in a given sound machine is no greater than the replacement value. By replacing the actual label probabilities in the sound machine with corresponding replacement values, the amount of computation required in determining the match score for a given word is significantly reduced. Furthermore, the replacement value is preferably an overestimation so that the matching score obtained is no less than the score that would have been determined without the replacement.

マルコフ・モデルをもつ言語的デコーダにおいて音声的
照合な行うような特定の実施例においては、その各音マ
シンは、次の（ｈ）〜（ｃｌを備えるように別線によっ
て特僧づけられる。In a particular embodiment, such as performing phonetic matching in a linguistic decoder with a Markov model, each of its sound machines is designated by a separate line to include (h) to (cl).

（ｑ）ｎ数の状態とその状Ｗの間の遷移経路。(q) Transition path between n number of states and the state W.

（ｂｌ　　確束Ｔ（１４ｊ）をもつ遷移ｔｒ（ＳｊｌＳ
ｉ　　）。その各々は、現在の状％Ｓｔが与えられたと
き状態Ｓｊへの遷移の確率をあらわすつ尚、このＳｊ　
と旧は、同一の状態でも異なる状態でもどちらでもよい
。(bl Transition tr(SjlS
i). Each of them represents the probability of transition to state Sj given the current state %St;
and old may be in the same state or in different states.

（ｃ）実際のラベル確率Ｐ（Ｙｋ　ｌ　ｉ−＋ｊ　）。(c) Actual label probability P(Yk　l　i−+j　).

各ラベル確率Ｐ（Ｙｋｌｌ→ｊ）は、１つの状態からそ
の次の状態への所与の遷移において所与の音マシンによ
ってラベルｙｋが発生される確率をあらわすうここでｋ
は、ラベルを識別するための添字である。各音マシンは
、（ｄ）その各音マシン中の各Ｙｋ？ｒ単一の特定値Ｐ
’　（Ｙｋ　）を割りあてるための手段と、（ｅ）所与
の音マシン中の各遷移において実際の各出力確率Ｐ（Ｙ
ｋｌｉ→ｊ）を、対応するＹｋに割りあてられた単一の
特定値ｐ’　（Ｙｋ　）によって置きかえるための手段
とをもつ。Each label probability P(Ykll→j) represents the probability that label yk is generated by a given sound machine on a given transition from one state to its next state, where k
is a subscript to identify the label. Each sound machine has (d) each Yk? r single specific value P
' (Yk ) and (e) each actual output probability P(Y
kli→j) by a single specific value p' (Yk) assigned to the corresponding Yk.

好適には、その置換値は、特定の音マシンの任意の遷移
における対応するＹｋラベルの実際の渚大うベル確木と
少くとも大きさが等しい。高速照合手続は、入力ラベル
に対応して語常から最も確からしい単語として選歌され
た１０個から１０Ｏｆ同程度の単語候補のリストを決定
するために採用される。これらの単語候補には好ましく
は言語モデルがあてがわれ、詳細な照合が実行される。Preferably, the replacement value is at least equal in magnitude to the actual value of the corresponding Yk label at any transition of the particular sound machine. A fast matching procedure is employed to determine a list of 10 comparable word candidates corresponding to the input label and selected from the etymology as the most probable word. A language model is preferably applied to these word candidates and detailed matching is performed.

このようｋして詳細な照合により考慮される単語の数を
語介の約１係まで低減するととＫより、計算コストが著
しく低減され、精度は低下しない。In this way, if the number of words to be considered through detailed matching is reduced to about 1 word, the calculation cost will be significantly reduced and the accuracy will not be reduced.

基本的な高速照合は、所与の音マシン中で所与のラベル
が発生され得るようなすべての遷移における所与のラベ
ルの実際のラベル確率を単一の値で置換することによっ
て詳細な照合を簡単化する。Basic fast matching is performed by substituting a single value for the actual label probability of a given label at all transitions such that a given label can be generated in a given sound machine. Simplify.

すなわち、ラベルが発生確率を有する所与の音マシン中
の遷移に拘らず、確率が単一の特定値によって置き換え
られる。この値は、所与の音マシンの遷移において生じ
るラベルの最大確率と少くとも等しい大きさであるよう
な過大評価値である。That is, regardless of the transition in a given sound machine whose label has a probability of occurrence, the probability is replaced by a single specific value. This value is an overestimation value that is at least as large as the maximum probability of a label occurring in a given sound machine transition.

所与の音マシン中の所与のラベルにつきラベル確率の置
換値を実際のラベル確率の最大値と設定することにより
、基本的な高速照合を用いて発生された照合値は少くと
も、詳細な照合を採用して得た照合値と同じ程度に大き
い。このように％基本的な高速照合は典型的には、より
多くの単語が候補として広く選歌されるよう釦各単音の
照合値を過大評価する。すなわち、詳細な照合に基づき
候補であると考えられる単語はまた、この基本的な高速
照合に基づく基準に合格する。By setting the replacement value of label probability to be the maximum value of the actual label probability for a given label in a given sound machine, the match value generated using basic fast matching is at least as detailed as It is as large as the matching value obtained by employing matching. Thus, basic fast matching typically overestimates the matching value of each button so that more words are widely selected as candidates. That is, words that are considered candidates based on detailed matching also pass this basic fast matching-based criterion.

竿１２１’２１を参照すると、基本的な高速照合のため
の音マシンが示さセてＡるウ　（記号及び音素とも呼ば
れる）ラベルは、開始時点分布とともに基本的高速照合
音マシンに入力される。この開始時点分布とラベル列入
力は、上述の詳細な照合の音マシンに入力されるものと
園様である。尚、開始時点分布は、場合によっては複数
の時間に亘る分布ではなく、例えば、音の開始時点にお
ける一定期間の沈黙の後に来る厳密な時間をあらわすこ
ともあることを認識されたい。しかし、音声が連続的で
ある場合には、開始時点分布（これについては後で一層
詳しく説明する）を決定するために使用される。、１ｆ
マシン３０００は終了時点分布と、その終了時点分布か
らの特定の音に対応する照合値を発生する。ある単語の
照合値とは、音素（少くとも単語のｆ俊初のｈ音）の照
合値の今計として定蒔される。Referring to rod 121'21, the sound machine for basic fast matching is shown. Labels (also referred to as symbols and phonemes) are input into the basic fast matching sound machine along with the starting time distribution. This starting time distribution and label string input are similar to those input to the detailed matching sound machine described above. It should be appreciated that the onset time distribution may in some cases not be a distribution over multiple times, but may represent, for example, the exact time following a period of silence at the onset of a sound. However, if the audio is continuous, it is used to determine the starting point distribution (which will be explained in more detail below). , 1f
Machine 3000 generates an end time distribution and matching values corresponding to particular sounds from the end time distribution. The matching value of a certain word is determined as the sum total of the matching values of phonemes (at least the h sound at the beginning of the f sound of the word).

第１３図を参照すると、基本的な高速照合演算が図式的
に示されている。この基本的な高速照合演算は、開始時
点分布と、単音によって生成されたラベルの数または長
さと、各ラベルＹｋに関連する置換値”Ｙｋにのみ関与
する。そして、所与の音マシンＫｔ＝−ける所与のラベ
ルの実際のすべてのラベル確率を対応する置換値で危き
かえることにより、基本的な高速照合は遷移確率を長さ
の分布確率で置きかえ、（所与の音マシン中の各遵移毎
に異なっていてもよい）実際のラベル確率と、所与の時
間に所与の状態にある確率をもつことの必要性を除去す
る。Referring to FIG. 13, the basic fast matching operation is illustrated diagrammatically. This basic fast matching operation only concerns the starting time distribution, the number or length of labels produced by the phone, and the replacement value "Yk" associated with each label Yk. Then, for a given tone machine Kt= The basic fast matching replaces transition probabilities with length distribution probabilities by compromising all actual label probabilities for a given label in the This eliminates the need to have an actual label probability (which may be different for each transition) and a probability of being in a given state at a given time.

この点に関して、長さの分布は詳細な照合モデルから決
定される。特に、長さ分布の各長さに対応して、詳縄な
照合手続は、各状態を個々にチェックして、各状態に対
応して、（ａ）特定のラベル長が与えられている場合と
、（ｂ）遷移に沿う出力に関与しない場合とにつき、現
在検査された状態が発生し得るさまざまな遷移経路を決
定する。特定の各遷移経路に至る特定の長さのすべての
遷移経路に対応する確率が合計され、次にその分布にお
ける所与の長さの確率を表示するために、その特定のす
べての状態に対応する和が・加えられる。In this regard, the length distribution is determined from a detailed matching model. In particular, for each length in the length distribution, a detailed matching procedure checks each state individually and, corresponding to each state, (a) given a particular label length; and (b) determining the various transition paths in which the currently examined state may occur, with no involvement in the output along the transition. The probabilities corresponding to all transition paths of a given length leading to each given transition path are summed, and then the probabilities corresponding to all states of that given given length in that distribution are summed. The sum is added.

上述の手続は各長さについて繰り返えされる。照合手続
の好ましい形式に従えば、これらの演算はマルコフ・モ
デルの分野で知られている格子図（ｔｒｅｌｌｉｓ　　
ｄｉａｇ（ａｍ　）を参照して行われる。The above procedure is repeated for each length. According to the preferred form of the matching procedure, these operations are performed using trellis diagrams known in the field of Markov models.
This is done with reference to diag(am).

すなわち、格子構造に沿う枝を洪有する２移経路に対し
ては、共通する各校に対する計算は１度だけなされる必
要≠；あり、その計算値は共通の枝シ含む各経路に加え
ら机ろ。That is, for a two-transfer path that has many branches along the lattice structure, the calculation for each common branch needs to be done only once, and the calculated value is added to each route containing the common branches. reactor.

２１３図釦シーては、２つの限定が例示的に含まれてい
る。ｆｆ１′１に、＃音によって生成されたラベルの長
さが、それぞれ１ｏ１　１．１２、及び１６の確率をも
つ０．１．２．３のどれかであると仮定する。また、開
始時間も限定され、これにより各々が確率０３％　Ｑｌ
、Ｑ２及びｑ３をもつ４つの開始時間のみが可能となる
。これらの限定により、次の式が、目的の単音の終了時
点分布を決定する。The button 213 includes two limitations as an example. Assume that the length of the label generated by the # sound in ff1′1 is one of 1o1, 1.12, and 0.1, 2.3 with a probability of 16, respectively. In addition, the starting time is also limited, which allows each person to have a probability of 03% Ql
, Q2 and q3 are only possible. With these limitations, the following equation determines the end time distribution of the target note.

’ｏ＝ｑｏ’。'o=qo'.

’１”’Ｑ１１０＋ｑＯ１１０１ ’２”Ｑ２　’Ｏ＋Ｑ１’Ｉ　ｎ２＋Ｑｏ　’２　’Ｊ
　”２’３＝”５１０＋ｑ２１１　”３”Ｑｌ　１２　
”２　”３＋ｑ０１３ｐＩ　Ｐ２　Ｔ１３ ’４”４３１１°４　＋Ｑ２１２　ｐ５　ｐ４＋Ｑ１１
３Ｔ１２１１３　ｏ４ ’５”Ｑ３１２９４１）５＋Ｑ２１５　Ｄｒ、　ｐ４　
Ｔ１５’６＝ｑｓ　１３ｐ４　”５　ｐにれらの式について卯、ると、Φ３が、４個の開始時間の
各々に対応する項を含むことが分かる。'1''Q110+qO1101 '2'Q2 'O+Q1'I n2+Qo '2'J
"2'3="510+q211 "3"Ql 12
"2"3+q013pI P2 T13 '4"4311°4 +Q212 p5 p4+Q11
3T12113 o4 '5''Q312941)5+Q215 Dr, p4
T15'6=qs 13p4 ''5 p Considering these equations, it can be seen that Φ3 includes terms corresponding to each of the four starting times.

その、１の項は、増音が時間ｔ　”　ｔ　３で開始され
、ゼロ・ラベルの長さを生成する（すなわち、同時に開
始され終了する岸音であること）確率をあらわす。坑２
の項は、単音が時間ｔ＝ｔ゛２で開始され、ラベルの長
さが１であり、ラベル３がその単音により生成される確
率をあらわす、、飢３の項は、単音が時間ｔ＝ｔ１で開
始され、ラベルの長さが２であり（すなわち、ラベル２
及び６）、ラベル２及び３が単音によって生成される確
率をあらわす。同様に、舘４の項は、単音が時間ｔ＝ｔ
ｏで開始され、ラベルの長さが３であり、３つのラベル
１．２及び３が単音によって生成さ更る確率をあらわす
。The term 1 represents the probability that the augmentation starts at time t '' t 3 and produces a zero label length (i.e., the shore sounds start and end at the same time). Pit 2
The term represents the probability that a single note starts at time t=t゛2, the label length is 1, and label 3 is generated by that note.The term 3 represents the probability that a note starts at time t=t starts at t1 and has label length 2 (i.e. label 2
and 6) represents the probability that labels 2 and 3 are generated by a single note. Similarly, the term for Tate 4 indicates that the single note is at time t=t
o, the label length is 3, and represents the probability that the three labels 1, 2, and 3 are generated by a single phone.

基本的な高速照合に必要な計′ＩＩＬ＃と詳細な照合に
必要な計算量とを比較すると、前者が後者よりも相対的
に簡単であることがわかる。この点について、”Ｙｋ値
は、ラベル長確率と同様にすべての式の各表現につき同
一のままである。さらに、長さと開始時間の限定がある
ので、後の方の終了時間の計算がよシ簡単になる。例え
ば、Φ６においては、単音は時間ｔ＝ｔ３で出発しなく
てはならず、そして３つのラベル４．５及び６はすべし
、適用すべきその終了時間の単音忙よって発生されなく
てはならない。対象となる沖音の照合値を発生する際、
決められた終了時点分布釦沿う終了時点確率が合計され
る。所望なら、次の式を与えるために、合計、算の対数
がとられる。Comparing the total 'IIL# required for basic high-speed verification with the amount of calculation required for detailed verification, it can be seen that the former is relatively simpler than the latter. In this regard, the Yk value remains the same for each representation of all equations, as does the label length probability. Furthermore, given the length and start time limitations, later calculations of the end time become easier. For example, in Φ6, the note must start at time t=t3, and the three labels 4, 5 and 6 must be generated by the note schedule of that ending time to be applied. When generating the target Okion matching value,
The end point probabilities along the determined end point distribution button are summed. If desired, the logarithm of the summation is taken to give the following equation:

照合値＝　ｌ　ｏぎ　　（Φ　＋・・・・＋Φ６）前に
も述べたように、あるＪ１語についての照合得点値は、
ある特定の単語における連続的な単音についての照合値
を合計することによって容易に求められる。Matching value = l ogi (Φ +...+Φ6) As mentioned before, the matching score value for a certain J1 word is
It can be easily determined by summing the matching values for consecutive single sounds in a particular word.

さて、開始時点分布の生成を説明するにあたっては、飢
１４図が参照される。ｆＰ１４図（ａ）ｌｃおいて、単
＠’ｒＨａ１が繰り返えされ、それが要素的な単音に分
割される。第１４図（ｂ）には、時間に亘ってラベル列
が示されている。第１４図＜ｅ）では、第１の開始時点
分布が示されている。この開始時点分布は、（沈黙、す
なわち音のない１単語”を含みうるその前の４Ｓ語にお
いて）前の最近の単音の終了時点分布から得られたもの
である。第１４図（ｃ）のラベル入力及び開始時点分布
に基づくと、単音Ｔ）Ｈの開始時点分布ΦＤＴ（が発生
される。次の単音ＵＴ（の開始時点分布は、前の単音の
終了時点分布が＃￥１４図（ｄ）のしきい値（Ａ）を超
えている期間を７９　ｉ／ｌすることによって決定され
る。（・Ａ）は各終了時点分布毎に個別に決定される。Now, in explaining the generation of the starting time distribution, reference is made to Figure 14. In fP14 (a) lc, the single @'rHa1 is repeated and divided into elemental single sounds. FIG. 14(b) shows a label sequence over time. In FIG. 14<e) a first starting point distribution is shown. This onset time distribution is obtained from the end time distribution of the previous most recent phone (in the previous 4S word, which may contain a silence, i.e., a word without a sound). Figure 14(c). Based on the label input and the start time distribution, the start time distribution ΦDT( of the single note T)H is generated. ) is determined by multiplying the period exceeding the threshold (A) by 79 i/l. (·A) is determined separately for each end point distribution.

好適には、（Ａ）は、対象とする単音の終了時点分布の
合計値の関数である。このように、時間ａとｂの間の期
間が、単音Ｕ）Ｉに対する開始時点分布がセットされて
いる時間をあらわす（第１４図（ｅ）参照）。Preferably, (A) is a function of the total value of the end time distribution of the target single note. Thus, the period between times a and b represents the time during which the starting point distribution for the single note U)I is set (see FIG. 14(e)).

第１４図Ｉ’ｓ）における時間Ｃ°及びｄの間の期間は
、単音ＤＨの終了時点分布がしきい値ｆＸ−超え且つ次
の坪音の開始時点分布がセットされている期間に対応す
る。ｐ始時点分布の値は、例えば、しきい値（Ａ）を超
える終了時刻の和で各紹了時刻を割ることによって終了
時点分布を規格化するととによりイ得られる。The period between times C° and d in Fig. 14 I's) corresponds to the period in which the end time distribution of the single note DH exceeds the threshold fX- and the start time distribution of the next tsubo tone is set. . The value of the start time distribution p can be obtained, for example, by normalizing the end time distribution by dividing each introduction time by the sum of end times exceeding the threshold (A).

この基本的な高速照合音マシ７３０００は、Ｆ’ｌｏｉ
ｔｌｎｇ　Ｐｏ１ｎｔ　Ｓｙｓｔｅｍ＋ｓ社の１９０Ｌ
中でＡＰＡＬプログラムを用いて実行された。上述の教
示に従い照合手続の特定の形式を開発するために、別の
ハードウェアとソフトウェアを使用することもできる。This basic high-speed verification sound machine 73000 is F'loi
tlng Po1nt System+s 190L
It was executed using the APAL program in the program. Other hardware and software may also be used to develop a particular form of matching procedure in accordance with the above teachings.

Ｅ−ＩＥ、　　別の高速照合基本的な高速照合は単独、または詳細な照合もしくは′
＊語モモデル組み合わされて、必要な計算量を大幅に低
減する。必要な計ｆｉ量をさらに低減するためには、最
小の長さし　　　と最大の長さｉｎＬ　　　という２つの長さの間に均一なラベル長ａｘ分布を定義することによりここで述べる手法が詳、細な
照合をさらに簡単化する。基本的な高速照合においては
、所与の長さのラベルを発生する単音の確率、すｆｊわ
ち１゜、１１．１２等は典型的には異なる値を持ってい
る。ところが、この別の高速照合によれば、ラベルの各
長さの確率は、単一の均一な値で置きかえられる。E-IE, separate fast matching Basic fast matching can be done alone or with detailed matching or '
*The word model is combined to significantly reduce the amount of calculation required. To further reduce the required total fi amount, the method described here can be further refined by defining a uniform label length ax distribution between two lengths, the minimum length and the maximum length in L. , which further simplifies detailed matching. In basic fast matching, the probability of a phone generating a label of a given length, fj, 1°, 11.12, etc., typically has different values. However, according to this alternative fast match, the probabilities for each length of the label are replaced by a single uniform value.

好ましくは、その最小長さは、もとの長さの分布におい
てゼロでない確率をもつ最小の長さに等しいが、所望と
あらば他の長さをｔ＋［してもよい。Preferably, the minimum length is equal to the minimum length that has a non-zero probability in the original length distribution, but other lengths may be used if desired.

最大長さの選択は最小長さの選択よりも任意であるが、
最小長さより小さく最大長さより大舞い長すノ確率がゼ
ロにセットされるという点では重要である。長さの′ｍ
率を最大長さと最小長さの間にのみ存在するように設定
することにより、均一な擬似分布が与えられる。ある方
法においては、均一な確率は、その擬似分布上の平均確
率としてセットすることができる。あるいは、均一な確
率は、上記均一な値によって置きかえられる長さの確率
の！鯰大値として設定してもよい。The choice of maximum length is more arbitrary than the choice of minimum length, but
It is important that the probability of a length being less than the minimum length and greater than the maximum length is set to zero. length'm
By setting the ratio to exist only between the maximum and minimum lengths, a uniform pseudo-distribution is given. In one method, the uniform probability can be set as the average probability over the pseudo-distribution. Alternatively, the uniform probability is the probability of the length being replaced by the uniform value above! It may be set as a catfish large value.

すべてのラベル長確率を等しいとして特徴づけることの
効果は、基本的な高速照合の終了時点分布に対応して上
記に示された式を参照すれば容易に認められる。特に、
長さ確率は定数項としてあられすことかできる。The effect of characterizing all label length probabilities as equal is easily seen with reference to the equations shown above for the basic fast match termination time distribution. especially,
Length probability can be expressed as a constant term.

Ｌ、Ｔ１１ｎ　をゼロにセットし、すべての長さ確率を
単一の一定値で置きかえるととにより、終了時点分布は
次のように特徴づけられる。By setting L, T11n to zero and replacing all length probabilities with a single constant value, the termination point distribution is characterized as follows.

θ１＝Φｍ／　１＝Ｑｍ＋ｅｍ−ＩＰｍここで″１”は
単一の均一な置換値であり、Ｐｍの値は、好ましくは時
間ｍで所与の単音中に発生された所与のラベルの置換値
に対応する。θ1=Φm/1=Qm+em−IPm where “1” is a single uniform displacement value, and the value of Pm is preferably the displacement of a given label produced during a given note at time m. corresponds to a value.

上記θｍについての式に対して、照合値は次のように定
義される。For the above equation for θm, the matching value is defined as follows.

照合値”ＩＱｇｌｏ（θ０＋０１＋・・・・＋０ｍ）＋
　ｌ　ｏ　ｔｚ　１　（＋　　　　　　　　　　（１）
基本的な高速照合と、この別の高速照合を比較すると、
必要とされる加算と乗算の回数が、この別の高速照合を
採用することＫより大＠に減少されることが分かる。Ｌ
ｒｎ、ｎ＝ｏの場合、基本的な高速照合では、長さ確率
を考慮しなくてはならないという点で４０回のＪＰＲと
２０回の加算が必要であるということが分かった。この
別の高速照合を用いると、θ□は再帰的に求められ、連
続的な各θ□につき１回の乗算と１回の加算のみが必要
である。Verification value "IQglo (θ0+01+...+0m)+
l o tz 1 (+ (1)
Comparing the basic fast match and this alternative fast match:
It can be seen that the number of additions and multiplications required is reduced by more than K by adopting this alternative fast match. L
It was found that for rn, n=o, a basic fast match requires 40 JPRs and 20 additions in that length probabilities must be taken into account. Using this alternative fast match, θ□ is determined recursively, requiring only one multiplication and one addition for each successive θ□.

この別の高速照合が計算をいかに簡単化するかについて
示すために、第１５図及び躯１６図が与えられている。Figures 15 and 16 are provided to illustrate how this alternative fast match simplifies calculations.

第１５図（ａ）において、音マシンの実施例３１００は
、最小長さＬｎ１１ｎ＝ｏに対応する。このときの最大
長さは、長さ分布が均一であるとして特徴づけられるよ
うに無限であると定°義されている。第１５図（ｂ）Ｖ
Ｃおいては、音マシン３１００から生成される格子図が
示されている。ここで、ｑｎの後の開始時間が開始時点
分布の外側にあると仮定すると、ｍ＜ｎの場合に連続的
な各θ□をすべて決定するには１回の加算と１回の乗算
が必要である。また、その終了時間をもとめるには、１
回の乗算だけでよく加Ｗ、は必要ではない。In FIG. 15(a), the sound machine embodiment 3100 corresponds to a minimum length Ln11n=o. The maximum length at this time is defined as infinite so that the length distribution can be characterized as uniform. Figure 15(b)V
In C, a grid diagram generated from the sound machine 3100 is shown. Here, assuming that the starting time after qn is outside the starting time distribution, one addition and one multiplication are required to determine all successive θ□ for m<n. It is. Also, to find the end time, 1
Only one multiplication is required, and addition W is not necessary.

第１６図は、”ｍｆｎ＝’の場合を示す。第１６図（ａ
）はそのための音マシン３２００の特別な例をあられし
、第１６図（ｂ）はそれに対応する格子図をあらわす。Figure 16 shows the case of "mfn='. Figure 16 (a
) shows a special example of a sound machine 3200 for this purpose, and FIG. 16(b) shows a corresponding grid diagram.

Ｌｒｎ１ｎ＝４なので、即、１６図ｒｂ）の格子図は１
％　Ｖ％Ｗ及び２でマークした経路に沿ってゼロ確率を
有する。θ４とθ。の間に延在するそれらの終了時間の
場合、４回の乗算と１回の加算が必要であることに注意
されたい。しかし、ｎ＋４より大きい終了時間の場合、
１回の！Ｐ算のみで、加算は必要ない。この実施例はＦ
’ＰＳ　　１９０Ｌ上でＡＰＡＬコードを用いて実行さ
れた。Since Lrn1n=4, the lattice diagram of Figure 16 (rb) is 1.
%V%W and has zero probability along the path marked with 2. θ4 and θ. Note that for those ending times extending between However, for end times greater than n+4,
One time! P calculation only, no addition required. This example is F
'Implemented using APAL code on PS 190L.

尚、所望に応じて第１５図または第１６図の実施例に状
態を付加してもよいととに注意されたい６Ｅ−ＩＦ、　
　最初のＪラベルに基づく照合基本的な高速照合及び上
記側の高速照合をさらに法線するために、音マシンに入
力されるストリングの最初の５個のラベルのみを照合に
おいて考慮するということが企図される。１００分の１
秒毎に１つの割り合いで、音声チャネルの音声処理装賃
によってラベルが発生されると仮定すると、Ｊの相応の
値は１００である。首いかえると、１秒程度の音声に対
応するラベルが、単音と、音マシンに入ってくるラベル
との間の照合を調べるために与えられる。検査されるラ
ベルの数を限定することにより、２つの利点が実現され
る。ボ１に、デコーディングの遅延が低減される。第２
に、短い単語の得点を長い単語の得点と比較する際の間
頌が実質的に回避される。尚もちろん、Ｊの長さは所望
のとおりに変えてもよい。It should be noted that states may be added to the embodiment of FIG. 15 or 16 as desired.
In order to further normalize the initial J-label based matching basic fast matching and the fast matching above, the idea is to consider only the first 5 labels of the string input to the sound machine in the matching. be done. 1/100th
Assuming that labels are generated by the audio processing equipment of the audio channel at a rate of one per second, the corresponding value for J is 100. In short, a label corresponding to a second or so of audio is given to check for a match between the note and the label coming into the sound machine. By limiting the number of labels that are inspected, two advantages are realized. First, the decoding delay is reduced. Second
Additionally, interludes in comparing short word scores with long word scores are substantially avoided. Of course, the length of J may be varied as desired.

検査されるラベルの数を限定することの効果は、第１６
図（ｂ）の格子図を参照するととＫより見てとれろ。上
記の改良を行わない場合に＆ｔ１高速照合の得点は、こ
の図の最下行に沿うθ□の確率の和である。すなわち、
（Ｌｍｉｎ＝０の場合）１＝１０または（Ｌ、＝４の場
合）１＝１４で開始され１　ｎる各時間において状態Ｓ４にある確率はθ□としてもと
められ、そのすべてのθ□が次に合計される。Ｌｒｎ１
ｎ＝４の場合、ｔ４より前の任意の時間に状Ｐ４Ｓ４に
ある確率はない。ところが、上記の改良を用いると、θ
□の合計は時間Ｊで終了するつ第１６図（ｂｌでは、時
間Ｊは時間ｔｎ＋２に対応する。The effect of limiting the number of labels inspected is
If you refer to the grid diagram in Figure (b), you can see it from K. The score of &t1 fast matching without the above improvement is the sum of the probabilities of θ□ along the bottom row of this figure. That is,
The probability of being in state S4 at each time 1 n starting at 1 = 10 (for Lmin = 0) or 1 = 14 (for L, = 4) is taken as θ□, and all θ□ is totaled to Lrn1
If n=4, there is no probability of being in the state P4S4 at any time before t4. However, using the above improvement, θ
The sum of □ ends at time J. In FIG. 16 (bl, time J corresponds to time tn+2.

Ｊの時間間隔以上でＪラベルの検査？終了することは、
照合得点を計算する場合の次の２つの確率合計をもたら
す。第１Ｋ、上述のように、格子図の最下行に沿う行計
算が存在するが、それは時間Ｊ−１までである。時間Ｊ
−１までの各時間で状態５４１１？：ある確率は、行の
得点値を得るために合計される。第２ＶＣ，時間Ｊで状
態Ｓ。−８４の各状態に単音がある確率の和釦対応する
行の得点が存在する。すなわち、行の得点は、単音に対するこの照合値は、行の得点と列の得点とを合
計し次にその和の対数をとることＫよって得られる。次
の単音についての高速照合を維続するために、好適には
時間Ｊを含む、最下行に沿う値が、次の単音の開始時点
分布を得るために使用されろ。Inspect J label at or above J time interval? To end,
This results in two probability sums when calculating matching scores: 1K, as mentioned above, there is a row computation along the bottom row of the lattice diagram, but only up to time J-1. time J
State 5411 at each time up to -1? : Certain probabilities are summed to obtain a row score value. 2nd VC, state S at time J. There is a score for the row corresponding to the Japanese button with a probability that there is a single note in each state of −84. That is, the row score is: This matching value for a phone is obtained by summing the row score and the column score and then taking the logarithm of that sum. To maintain fast matching for the next phone, the values along the bottom row, preferably including time J, are used to obtain the onset time distribution of the next phone.

連続的なψ音の各々につき照合値を計算した後は、前に
も述べたように、すべての単音についての合ａ１が、す
べての功科の照合得点の和となる。After calculating the matching value for each successive ψ sound, the sum a1 for all the sounds is the sum of the matching scores for all the gongs, as mentioned before.

上述した基本的２Ｃ高速照合と別の畠速囲今とで終了時
点分布が生成される様子を調べてみると、列の得点の計
算値が高速照合計算値に容易に一致しなめことが見てと
れる。調べるラベルの数を限定するという改良手法を高
速照合によりよく適用するために１ここに謂５明する照
合技術は、列の得点を別の行の得点で置きかえるという
ことを要請する。すなわち、時間ＪとＪ＋Ｋ（Ｋは音マ
シンにおける最大状卵個数）の間の状態ｓ４（坑１６図
（ｂ））にある単音について別の行の得点が計算される
。それゆえ、もし音マシンが１０個の状態を有するなら
、この改良手法は、確率が計算される格子図のＩ々下行
に沿って１０個の終了時間を追加することＫなる。そし
て、所与の単音の照合値な得るために、時間Ｊ＋Ｋまで
の最下行に沿うすべての確率と、時間Ｊ＋ＫＫおける確
率とが追加される。また、前と同様忙、増語の照合得点
を得るために連続する単音照合イ１６４が合計される。When we examine how the end point distribution is generated using the basic 2C high-speed matching described above and the other Hatakehaya Eima, we find that the calculated values of the column scores easily match the high-speed matching calculated values. I can take it. In order to better apply the improved technique of limiting the number of labels examined to high-speed matching, the matching technique described herein requires replacing the scores in one column with the scores in another row. That is, the score of another row is calculated for the single note in state s4 (Figure 16 (b)) between time J and J+K (K is the maximum number of eggs in the sound machine). Therefore, if the sound machine has 10 states, this refinement would add 10 termination times along the lower rows of the grid diagram on which the probabilities are calculated. Then, all the probabilities along the bottom row up to time J+K and the probability at time J+KK are added to obtain the match value for a given phone. Also, as before, consecutive single-phone matches 164 are summed to obtain matching scores for busy and added words.

この実施例は、ＦＰＳ　　１９０Ｌ上でＡＰＡＬコード
を用いて実行された。しかし、他のコード及び仙のハー
ドウェアを用いてそれを実行することもでべろ。This example was implemented using APAL code on an FPS 190L. However, it can also be implemented using other code and hardware.

Ｅ−１Ｇ、岸音の樹形構造と高速照合基本的な高速照合または別の高速照合を用いることＫよ
り％最大ラベルの限定を行っても行わなくても、音マシ
ンの照合１直を計算するのに必要な計算時間は著しく低
減される。さらに、高速照合−よって得られた単語につ
いて詳細な照合が実行さ軌る坑介ですら、この計算時間
の節約は十分大きな量である。E-1G, Kishion's tree structure and high-speed matching Using basic fast matching or another fast matching Calculate 1 round of sound machine matching with or without limiting the %maximum label from K The computational time required to do so is significantly reduced. Furthermore, even in cases where detailed matching is performed on the words obtained - fast matching - the computational time savings are large enough.

単音の照合値は、−たん計算されると、筆１７図に示す
ように、樹形構造４１００の枝に沿って比較され、これ
により、単音のどの経路が最も確からしいかが決定され
る。第１７図において、（４１０２から枝４１０４へつ
ながる）ＤＨ及びＤＨｌに対応する単音の照合値の合計
は、単音ＭＸから枝分かれするさまざ１な音列よりも、
話された単語“ｔｈｅ　”に対応するばろかに高い値に
なるべきである。この点につき、最初のＭＸ音の単音昭
合値は一度しか計算されず、次にそのＭＸ音から延出す
る各基本形式のために使用される（枝４１０４及び４１
０６参照）。さらに％第１の枝の列に清って計算された
全得点値が、他の枝の列の全得点値よりもけるかに仙い
かまたはしきい値よりもはるかに低いことが分かった場
合、その第１の列から延出するすべての基本形式は、同
時に候補の単語から削除される。例えば、ＭＸが確から
しい経路ではないと分かったとき、枝４１ｏ８から枝４
１１８までに関連づけられた基本形式は同時に棄却され
る。Once the match values for the phone are calculated, they are compared along the branches of the tree structure 4100, as shown in Figure 17, to determine which path for the phone is most likely. In FIG. 17, the sum of matching values for the single notes corresponding to DH and DHl (connected from 4102 to branch 4104) is greater than the sum of matching values for the single notes that branch from the single note MX.
It should be a ridiculously high value corresponding to the spoken word "the". In this regard, the single note combination value of the first MX note is calculated only once and then used for each base form extending from that MX note (branches 4104 and 41
06). Furthermore, if the total score value calculated for the first branch column is found to be significantly lower than the total score value of the other branch columns or much lower than the threshold , all base forms extending from its first column are removed from the candidate word at the same time. For example, when it turns out that MX is not a likely route, from branch 41o8 to branch 4
The basic forms associated up to 118 are rejected at the same time.

この高速照合及び樹形構造を用いることにより、きわめ
て節約された計１９：量で以て’ＩＢＭの単語の順序づ
けられたリストが得らハ２る。By using this fast matching and tree structure, an ordered list of 'IBM' words is obtained with an extremely economical total of 19 words.

記憶の必要量に関しては、音の樹形構造と、音の統計と
、末端確率が記憶されるべきものであることに注意され
たい。樹形構造に関しては、２５０００個の弧と、各弧
を特徴づける４個のデータ・ワードが存在する。その第
１のデータ・ワードは、次の弧または単音へのインデッ
クスなあられす。竿２のデータ・ワードは、その枝ＶＣ
沿う次の単音の数をあらわす。第３のデータ・ワードは
、ｔｒＩ形図のどのノードにその弧が位置付けられてい
るかをあらわす。また、＠４のデータ・ワードは、現在
の単音をあらわす。ゆえに、樹形構造の場合、２５００
０Ｘ４個の記憶箇所が必要である。高速照合においては
、１００個の異なる単音と、２００個の黒なる音素があ
る。すると、音素がある単音のどこかで生成される単一
の砕木をもつ場合、１００Ｘ２００個の統計的確率のだ
めの記憶箇所が必要である。段替に、末端確率の堝介、
２００Ｘ２００個の記憶箇所が必要である。この高速照
合たは、１００にの整数と６０にの浮動点記憶装置で十
分である。Regarding memory requirements, note that the tree structure of the sounds, the statistics of the sounds, and the terminal probabilities are what should be memorized. For the tree structure, there are 25,000 arcs and 4 data words characterizing each arc. The first data word is the index to the next arc or note. The data word of rod 2 is the branch VC
Represents the number of the next note. The third data word represents at which node of the trI diagram the arc is located. Also, the data word @4 represents the current single note. Therefore, for a tree structure, 2500
0x4 storage locations are required. In fast matching, there are 100 different phonemes and 200 black phonemes. Then, if a phoneme has a single broken tree generated somewhere in a phone, 100×200 storage locations for statistical probabilities are required. For a change, Gosuke of terminal probability,
200x200 storage locations are required. For this fast lookup, 100 integer and 60 floating point stores are sufficient.

Ｅ−１）ｒ、　　言語モデル前にも述べたように、前後関係におけて鵬語に関連する
、３重音字などの情報を記憶する言語モデルを使用して
正確な単語遣押の確率を高めるようＫすることもできる
。E-1) r. Language model As mentioned earlier, we use a language model that stores information related to Peng language in context, such as triplet characters, to estimate the probability of correct word usage. You can also use K to increase it.

言語モデル１０１０（坑１１閑）は、独自の性質をもつ
。特に、変換された３重音字法が用いられている。この
方法によれば、Ｍｌ［おけろ、単一の、Φ語と、順序づ
けられた一対の単語と、ｌ１ｋｌ序づけられた三速の単
語の確からしさを決めるためにサンプル・テキストが検
介される。そして、最も確からしい三速坪語の１１スト
と最も確からしい対のＪ１４４のリストが形成される。The language model 1010 (Kan 11 Kan) has unique properties. In particular, a converted trigraph system is used. According to this method, a sample text is examined to determine the likelihood of Ml[okero, single, Φ words, ordered pairs of words, and l1kl ordered three-speed words. . Then, a list is formed of the most probable 11 strokes of Sansei Tsubogo and the most probable pair of J144.

さらに、二連単語が上記三速部悟リスｌない確率と、対
の４Ｓ語が上記対の単語のリストにない確率がそれぞれ
形成される。Furthermore, the probability that the double word is not in the list of the three-speed part word list and the probability that the paired 4S word is not in the list of paired words are formed, respectively.

この言語モデルによれば、ある嶺部の単Ｆ？［２つの単
語が続くとき、この当面の単語とそれに続く２つの単語
が上記三速単ＫＮ’）ストに存在しているか否かについ
て判断がなされる。そして、もしそうなら、その三速即
飴に割りあてられた記憶されているｗｅ究が示される。According to this language model, a certain Reibe's simple F? [When two words follow, a determination is made as to whether this current word and the following two words are present in the three-speed single KN') strike. And, if so, the memorized weight assigned to that three-speed candy is shown.

また、もし当面の４．％とそれに続く２つの単語が三速
Ｉ！Ｐ）％　ＩＪストに存在しないならば、その当面の
１．Ｆｔとその次の単語が対の単語リストに存在してい
るか否かについて判断が行われる。そして、もしそうで
あれば、三速単語が、二連単語リストに存在しないＷ率
にその対の確率が掛けられ、その積が当面の単語に割り
あてられる。あるいは、もし当面の単語とその次（及び
その次）の単語が二連単語リストと対単語リスト上に存
在しないならば、当面の単語のみに、二連単語が三速キ
モリストにない確率と、対の単語が対単語リス）［ない
確率とが掛けられる。このｆ＊は、次に当面の単語に割
りあてられる。Also, if the current situation is 4. % and the two words that follow it are three-speed I! P)% If it does not exist in the IJ strike, then the current 1. A determination is made as to whether Ft and the next word are present in the paired word list. Then, if so, the W rate at which the third speed word is not present in the binary word list is multiplied by the probability of that pair, and the product is assigned to the word at hand. Or, if the current word and the next (and the next) word are not on the double word list and the paired word list, then the probability that the double word is not on the three-speed kimo list only for the current word, Paired words are multiplied by the probability that they are not (paired words). This f* is then assigned to the word in question.

第１８図を＃ｐ照すると、音声照合に使用される音マシ
ンの詳細をあらわすフローチャートが示されている。ス
テップ５００２では、語含の単語（ｓＬ面的には約５０
００個）が定義される。次に１、各単語は、音マシンの
列によってあられされる（ステップ５０ｒ１４）。この
音マシンは、例えば音声的音マシンとしてあられされて
いるが、その代わりに音禦的音を含むこともできる。単
語を音声的音マシンの列、または音素的音マシンの列に
よってあられすことについては以下で説明する。Referring to FIG. 18, there is shown a flowchart showing details of the sound machine used for voice matching. In step 5002, the word containing word (approximately 50
00 pieces) are defined. Next, 1, each word is pronounced by a bank of sound machines (step 50r14). Although the sound machine is described as, for example, an aural sound machine, it may alternatively include an aural sound. The generation of words by a sequence of phonetic or phonemic sound machines will be explained below.

ステップ５００６においては、単語の基本形式が、前記
樹形構造に配列される。各単語基本形式における各音マ
シンの統計は、Ｆ、ジエリネツク（Ｊｅｌｉｎｅｋ）Ｖ
Ｃよる１統計的方法による連続的音声１５９（ｃｏｎｔ
ｉｎｕｏｕｓ　　ＳｐｅｅｃｈＲｅｅｏｇｎｌｔｉｏｎ
　　　ｂｙ　　　ＳｔａｔｉｓｔｉｃａｌＭｅｔｈｏｄ
ｓ　）”という論文に発表されているよく知られた前方
接方（ｆｏｒｗａｒｄ　　ｂａｃｋｗａｒｄ　）アルゴ
リズムに基づくり１１紳によって決定される。In step 5006, the basic forms of words are arranged in the tree structure. The statistics for each sound machine in each word basic form are F, Jelinek, V
Continuous speech 159 (cont
Inious Speech Recognition
by StatisticalMethod
It is determined by Eleven Gen., based on the well-known forward backward algorithm published in the paper "S)".

ステップ５００９では、詳細な照合中で使用される実際
のパラメータ値または統計に代入すべき値が決定される
。例えば、実際のラベル出力確率に代入すべき値が決定
される。ステップ５０１０では、各単語基本形式におけ
る音が近似的な代入値をもつように、決定された値が、
記憶された実際の確率に置き換わる。基本的な高速照合
に関連するすべての近似は、ステップ５０１０中で実行
される。In step 5009, the values to be substituted for the actual parameter values or statistics used during detailed matching are determined. For example, a value to be substituted for the actual label output probability is determined. In step 5010, the determined value is determined such that the sound in each word basic form has an approximate substitution value.
Replaced by the actual memorized probability. All approximations related to basic fast matching are performed during step 5010.

次に、音声的照合の能力を高めるか否かについて判断が
なされる（ステップ５０１１）、そして、もしそうでな
いなら、基本的な近似照合のたやに決定された値が使用
可能となるようにセットされ、他の近似に関連する他の
評価値はセットされない（ステップ５０１２）。また、
向上された別の近似照合が所望であるなら、その後ステ
ップ５（１１８が実行される。すると、均一なストリン
グ長の決定が行なわれ（ステップ５０１８）、さらなる
向上が要望（ステップ５，０２０　”）されているか否
かについて判断がなされる。そして、もし近似的照合の
さらなる向上が要望されているなら、音声的照合は、発
生されたストリング中で最初の５個のラベルに限定され
る（ステップ５０２”）。次に、向上された近似照合が
選歌されたか否かに拘らず、算定されたパラメータ値が
ステップ″５０１２でセットされ、この時点で各単語基
本形式における各音マシンが、高速近似照合を可能なら
しめる所望の近似値により訓絆されたことＫなる。A determination is then made as to whether to enhance the power of phonetic matching (step 5011), and if not, the determined value can be used for basic approximate matching. is set, and other evaluation values associated with other approximations are not set (step 5012). Also,
If another improved approximate match is desired, then step 5 (118) is performed. A uniform string length determination is then made (step 5018) and further improvement is desired (step 5,020''). A decision is then made as to whether a further improvement in approximate matching is desired, then phonetic matching is limited to the first five labels in the generated string (step 502'').The calculated parameter values are then set in step ``5012'', regardless of whether an improved approximation match has been selected, at which point each sound machine in each word base form performs a fast approximation. This means that K has been trained by the desired approximation that makes matching possible.

Ｅ−１Ｊ、　　スタック・デコーダ第１図の音声認識システムに使用される好適なスタック
・デコーダが、本出願人の音声認識グループ所属のり、
バール（Ｂａｈｌ）、Ｆ、ジエリネツク（Ｊｅ目ｎｅｋ
　）、及びＲ，Ｌ、−ｒ−サー（Ｍｅｒｅｅｒ）により
発明された。そこで、この好適なスタック・デコーダに
ついて以下説明する。E-1J, Stack Decoder A preferred stack decoder for use in the speech recognition system of FIG.
Bahl, F.
), and R,L,-r-Mereer. Therefore, this preferred stack decoder will be explained below.

第１９図及び坑２０図には、逐次のラベル間隔またはラ
ベル位置で発生される遂次的な複数のラベルＹ　Ｙ　・
・・・が示されている。FIG. 19 and FIG. 20 show a plurality of sequential labels Y Y generated at sequential label intervals or label positions.
···It is shown.

笛２０１’ＱＫも、いくつかの発生された単語経路、す
なわち経路Ａ、経路Ｂ及び経路Ｃが図示されている。坑
１９図の文脈では、経路Ａがエントリ１ｔｏ　　ｂｅ　
　ｅｒ”に対応し、経路Ｂがエントリ”ｔｗ。Whistle 201'QK also illustrates several generated word paths: path A, path B, and path C. In the context of diagram 19, path A is entry 1 to be
Path B corresponds to the entry "tw.

ｂ″に対応し、経路Ｃがエントリ　“ｔｏｏ″に対応す
る。当面の単語経路に対応して、その単語経路が終了し
たという最も高い確率を有するラベル（または等価には
ラベル間隔）が存在し、そのようなラベルは“境界ラベ
ル”と呼ばれる。Corresponding to the word path at hand, there is a label (or equivalently, a label interval) that has the highest probability that the word path has ended. , such a label is called a "boundary label".

単語の列をあらわす単語列Ｗ＜対しては、２つの枦語の
間で“境界ラベル”としてラベル・ストリング中にあら
れされた最も確からしい終了時間が、１８Ｍテクニカル
−ディスクロジャー・プルティン（Ｔｅｃｈｎｉｃａｌ
　　Ｔ’）ｉｓｅｌｏｓｕｒｅＢｕｌｌｅｔｌｎ　　）
　　　ｖｏ　　ｌ　　ｕｍｅ　　　２　３、　ｎｕｍｂ
ｅｒ　　　４．１９８０年９月、Ｌ、Ｒ，パール、Ｆ、
ジエリネツク、Ｒ，Ｌ、マーサーによる“高速音声照合
演算（Ｆａｓｔｅｒ　Ａｅｏｕｓｔｌｅ　ＭａｔｃｈＣ
ｏｍｐｕｔａｔｉｏｎ　）’と題する文献に記載されて
いるような方法とより見出され得る。簡単に述べると、
この文献は、２つの類似する事項、すなわチ（ａ）　　
ラベル・ストリングＹのうちどれだけが単語（まだは単
語列）によって引き起こされるか、及び（ｂ）　　どの
ラベル間隔で、ラベル・ストリングの一部に対応する部
分的な文が終了するのかということに関与するための方
法を議論する。For a word string W< representing a string of words, the most likely end time that appears in the label string as a "boundary label" between two words is the 18M Technical Disclosure Pultin.
T') iselosureBulletln)
volume 2 3, number
er 4. September 1980, L, R, Pearl, F.
"Fast Aeoustle MatchC" by Jielinecz, R.L.
and the method described in the document entitled 'Imputation)'. Simply put,
This document covers two similar matters, namely (a)
(b) how much of the label string Y is caused by a word (yet a word sequence); and (b) at which label interval the partial sentence corresponding to the part of the label string ends. Discuss ways to get involved.

任意の単語経路について、ラベル・ストリングの最初の
ラベルから境界ラベルまでを含む各ラベルまたはラベル
間隔に対応づけられた”尤度値”が存在する。これらを
まとめると、所与の単語経路のすべての確からしさの値
は、その所与の単語経路の“尤度ベクトル”をあらわす
。従って、各単語経路俗に、対応する尤度ベクトルが存
在する。For any word path, there is a "likelihood value" associated with each label or label interval in the label string, from the first label to the boundary label. Taken together, all likelihood values for a given word path represent the "likelihood vector" for that given word path. Therefore, for each word path there is a corresponding likelihood vector.

尤度値Ｌｔは坑２０図に示されている。The likelihood value Lt is shown in Figure 20.

単語経路Ｗ１、Ｗ２、・・・・ＷＳの集まりに対応する
、ラベル間ｉｔでの“尤度包絡線”Ａｔは、数学的には
次のように定義される：Ａ　　＝ｍａｘ（Ｌ　　（Ｗｌ）、・・・・Ｌ　（ＷＳ
））すなわち、各ラベル間隔に対応して、尤度包路線は
、その集まり内の任意の単語経路に関連する最も高い尤
度値を含む。２２０図には、尤度包絡１ｐ１０４０が図
示されている。The “likelihood envelope” At between labels it, which corresponds to the collection of word paths W1, W2, ...WS, is defined mathematically as follows: A = max(L (Wl ),...L (WS
)) That is, for each label interval, the likelihood hull path contains the highest likelihood value associated with any word path in the collection. In FIG. 220, the likelihood envelope 1p1040 is illustrated.

単語経路は、もし完全な文に対応するならば、１完全”
であると考えられる。完全な経路は、好適には、話者の
入力によって、例えば話者が文の絆わり忙到達したとき
にボタンを押すことによって識別される。この入力は、
ラベル間隔と同期され、文の終わりがマークされる。完
全な単語経路は、単語の追加により延長することはでき
ない。A word path is 1 complete if it corresponds to a complete sentence.
It is thought that. The complete path is preferably identified by the speaker's input, for example by pressing a button when the speaker reaches the end of the sentence. This input is
Synchronized with label spacing and marks end of sentence. A complete word path cannot be extended by adding words.

“部分的な”単語経路は、不完全な文に対応し、延長す
るととができる。A "partial" word path corresponds to an incomplete sentence and can be extended.

部分的な経路は“生′または“死”として分類される。Partial pathways are classified as "live" or "dead".

すなわち、もしある単語経路が既に延長されているなら
その単語経路は１死”んでおり、まだ延長されていなけ
れば１生”きている。この分類を用いると、１つまたは
それ以上の長さに延長された単語経路を形成するために
既に延長されてしまっている経路は、その後の時間で延
長については再び考慮されることがない。That is, if a word path has already been extended, the word path is "dead", and if it has not been extended yet, it is "alive". Using this classification, a path that has already been lengthened to form a word path lengthened to one or more lengths is not considered again for lengthening at a later time.

各単語経路はまた、尤度包絡線に関連して“良”または
１不良”としても％徴づけることができる。Each word path can also be marked as ``good'' or 1 bad in relation to the likelihood envelope.

すなわち、もし境界ラベルに対応するラベルで、単語経
路が、最大尤度包絡線のΔ内にある尤度値をもつならば
、その単語経路は良いのである。そうでなければ、その
単語経路は１不良”なのである。尚、必須ではないが好
適には、Δは一定値であり、この値によって、最大尤度
包絡線の各値は、良不良のしきい値レベルとして働くよ
うに低減される。That is, if a word path has a likelihood value that is within Δ of the maximum likelihood envelope for a label corresponding to a boundary label, then that word path is good. Otherwise, the word path is 1 bad. Note that preferably, but not necessarily, Δ is a constant value such that each value of the maximum likelihood envelope reduced to act as a threshold level.

各ラベル間隔毎に１１つのスタック素子がある。There are 11 stack elements for each label interval.

各々の生きている単語経路は、その経路のしきい値レベ
ルに対応するラベル間隔に対応するスタック素子に割り
あてられる。スタック素子は０．１またはそれ以上の経
路エントリをもつことができ、それらのエントリは、尤
度値の順序に従いリストされている。Each live word path is assigned a stack element corresponding to the label interval that corresponds to the threshold level of that path. A stack element can have 0.1 or more path entries, and the entries are listed in order of likelihood value.

次に％飢１図のスタック・デコーダ１００２によって実
行されるステップ化ついて説明する。Next, the stepping performed by the stack decoder 1002 in the diagram will be described.

尤度包路線を形成しどの単語経路が良いかを判断するこ
とは、第２２図のサンプル・７０−チャ−ＩＣ示されて
いるステップと相互に関連づけられる。Forming the likelihood hull path and determining which word path is better is correlated with the steps shown in the sample 70-Char-IC of FIG.

第２２図のフローチャートにおいて、ゼロ（ｎｕｌｌ）
経路が先ずステップ５０５０で９４１のスタック（０）
Ｋ入力される。そして、前以って決定された完全経路が
存在すれば、それらの完全経路を含むスタック（完全）
！２子が与えられる（ステップ５０５２）。スタック（
完全）素子中の各完全経路は、対応づけられた尤度ベク
トルを有している。境界ラベルに最も高い尤度をもつ完
全経路の尤度ベクトルが、初期的には最大尤度包絡線を
決定する。もしスタック（完全）素子中に完全経路が存
在しなければ、最大尤度包絡線は、各ラベル間隔で−ω
として初期化される。さらに、もし完全経路が特定され
ｆｃいならば、やはり最大尤度包絡線は一■として初期
化することができる。In the flowchart of FIG. 22, zero (null)
The path first stacks 941 (0) at step 5050.
K is input. and, if there are complete paths determined in advance, a stack (complete) containing those complete paths.
! Two children are given (step 5052). stack(
Each complete path in the (complete) element has an associated likelihood vector. The likelihood vector of the complete path with the highest likelihood for the boundary label initially determines the maximum likelihood envelope. If there is no complete path in the stacked (complete) elements, the maximum likelihood envelope is −ω
It is initialized as . Furthermore, if a complete path is identified and fc, the maximum likelihood envelope can still be initialized as 1.

包絡線の初期化についてはステップ５０５４及び５０５
６１’１ｍよって示されている。Steps 5054 and 505 for envelope initialization
It is indicated by 61'1m.

最大尤度包絡線が初期化された後は、その包絡線が予定
の値Δだけ低下され、これＫよシその低下された尤度値
の上方にΔ−良い領域が形成され、その尤度値の下方に
Δ−不良領域が形成される。After the maximum likelihood envelope is initialized, it is lowered by a predetermined value Δ, and a Δ-good region is formed above the lowered likelihood value, and the likelihood is A Δ-defective region is formed below the value.

Δの値が大きい程、それだけ多数の単語経路が延長可能
であると考えられる。Ｌ、を算出するために１０　Ｋ　
１０が使用される場合は、Δとして２．０という値が満
足のゆく結果を与える。このｌの値は、必須ではないが
、好ましくは、ラベル間隔の長さに沿って均一である。It is believed that the larger the value of Δ, the more word paths can be extended. 10 K to calculate L.
If 10 is used, a value of 2.0 for Δ gives satisfactory results. This value of l is preferably, but not necessarily, uniform along the length of the label interval.

もし単語経路が、Δ−良い領域にある境界ラベルにおい
て尤度をもつなら、その単語経路は”良２であるとマー
クされる。そうでなければ、単語経路は１不良”として
マークされる。If a word path has a likelihood at a boundary label that is in the Δ-good region, then the word path is marked as "2 good." Otherwise, the word path is marked as 1 bad.

ｔｔｇ、２２図に示すように、尤度旬絡絆を更新し、単
語経路を良（延長可能）または不良としてマークするた
めのループは、最長のマークされていない単語経路を見
出すことＫよって開始される（ステラ７’５０５８　）
。そして、もしマークされていない２つ以上の単語経路
が、その締長の単語経路の長さに対応するスタック中に
存在するならば、その境界ラベルに最も大きい尤度をも
つ単語経路が選択される。もしＳ語経路が見出されたな
ら、その境界ラベルでの尤度がΔ−良い領域内にあると
き”良”としてマークされ、そうでなければ１不良”と
してマークされる（ステップ５０６０）。ttg, as shown in Figure 22, the loop for updating the likelihood bond and marking word paths as good (extendable) or bad starts by finding the longest unmarked word path K. To be done (Stella 7'5058)
. Then, if two or more unmarked word paths exist in the stack corresponding to the length of the word path of that length, the word path with the greatest likelihood for its boundary label is selected. Ru. If an S-word path is found, it is marked as "good" if the likelihood at its boundary label is within the Δ-good region, otherwise it is marked as "1 bad" (step 5060).

もし単語経路が１不良″とマークされたなら、別のマー
クされていない生きた経路が見出されマークされる（ス
テップ５０６２）。もし単語経路が”良”とマークされ
たなら、その１良”とマークされた経路の尤度値を含む
ように尤度包路線が更新される。すなわち、各ラベル間
隔毎に、韻　尤度包絡線中の現在の尤度値と（ｂ）　　
“良”とマークされた単語経路に関連づけられた尤度値
の間に１よ抄大きい尤度値として更新された尤度値が決
定される。゛このことは、ステップ５０６４及びステッ
プ５０６６によって示されている。包路線が更新された
後は、最長且つ最良のマークされていない生きた単語経
路が再び見出される（ステップ５０５８）。If the word path is marked as ``1 bad,'' another unmarked live path is found and marked (step 5062). If the word path is marked as ``good,'' then the 1 good path is found and marked (step 5062). The likelihood hull line is updated to include the likelihood value of the path marked ``. That is, for each label interval, the current likelihood value in the rhyme likelihood envelope and (b)
An updated likelihood value is determined as a likelihood value greater than one between the likelihood values associated with word paths marked as "good." This is indicated by steps 5064 and 5066. After the envelope path is updated, the longest and best unmarked live word path is found again (step 5058).

次にループは、マークされていない単語経路が残ってい
ないようになるまで繰り返えさハ、る。そうして、マー
クされていない単語がすべて見出されると、“良”とマ
ークされた最短の単語経路が選択される。もし良とマー
クされた最短長さをもつ経路が１つ以上存在するならば
、その境界ラベルにおいて最も高い尤度をもつ経路が選
捩さねる（ステップ５０７０）。選択された最短経路に
は次に延長が施される。すなわち、好ましくは、高速照
合と、言語モデルと、詳細な照合と、言語モデル手続き
を実行するととＫより、上述したように少くとも１つの
確からしく後に仔く単語が決定さねろ。この確からしく
続く各々の単語により、延長された単語経路が形成され
る。特に、延長された単語経路は、凋枦された最短の単
語経路の末端に確からしく続くキモを付加することによ
シ形成される。The loop is then repeated until there are no unmarked word paths left. Then, once all unmarked words have been found, the shortest word path marked "good" is selected. If there is more than one path with the shortest length marked as good, the path with the highest likelihood at its boundary label is selected (step 5070). The selected shortest path is then extended. That is, preferably, performing the fast matching, language model, detailed matching, and language model procedures determines at least one likely descendant word from K as described above. Each likely successive word forms an extended word path. In particular, an extended word path is formed by adding a key word that most likely follows the end of the shortest shortened word path.

その′Ｊ！３釈された最短の単語経路に延長された単語
経路が形成された後は、その選択された単語経路は、そ
の経路がエントリとして入っていたスタックから除去さ
れ、各々の延長された単語経路はそのための適当なスタ
ック中に入れられる。殊に、延長された単語経路は、延
長された単語経路の境界ラベルに対応するスタックに入
れらｈるエントリとなる（ステップ５０７２）。That'J! After an extended word path has been formed to the triangulated shortest word path, the selected word path is removed from the stack in which it was an entry, and each extended word path is It is placed in the appropriate stack for that purpose. In particular, the extended word path results in an entry placed in the stack corresponding to the boundary label of the extended word path (step 5072).

ステップ５０７２に関しては、選択し九経路を延長する
動作は、第２２図を参照して説明される。Regarding step 5072, the operation of selecting and extending nine paths is described with reference to FIG.

ステップ５０７０で経路が見出された後は、以下に示す
ような手続が実行され、これにより、適描な近似照合に
基づき単数または複数の単語経路が延長される。After a path is found in step 5070, a procedure as described below is performed, which extends the word path or paths based on a suitable approximate match.

すなわち、ステップ６０００（第２１図）では、音声処
理％ｆｆ１００２　（坑１図）が、上述したようにラベ
ル・ストリング（列）を生成する。ラベルの列は、ステ
ップ６００２の実行を可能ならしめろために入力として
与えられる。ステップ６００２では、前に示した教示に
従い候補の単語の順序づけられたリストを得るために、
基本的近似照合または、向上された近似照合手続のうち
の１つが実行される。その後、（既述の）言語モデルが
ステップ６００４で適用される。言語モデルが適用され
た後に残った単語は、ステップ６００６を実行する詳細
な照合プロセッサ中で生成されたラベルとともに入力さ
れる。詳細な照合は残りの単語候補のリストをもたらし
、このリストは好ましくはステップ６００８で言語モデ
ルＫかけられる。That is, in step 6000 (Figure 21), audio processing %ff1002 (Figure 1) generates a label string as described above. A column of labels is provided as input to enable execution of step 6002. In step 6002, to obtain an ordered list of candidate words according to the teachings presented above,
One of the basic approximate matching or improved approximate matching procedures is performed. The language model (as previously described) is then applied in step 6004. The words remaining after the language model is applied are input with the labels generated in the detailed matching processor, which performs step 6006. The detailed matching yields a list of remaining word candidates, which is preferably multiplied by the language model K in step 6008.

このように近似照合、詳細な照合、及び言語モデルによ
って決定された確からしい単語は、第２１図のステップ
５０７０で見出された経路の延長に使用される。ステッ
プ６００　Ｂ　（ｆｆｉ２２図）テ決定された確からし
い単語の各々は、複数の延長された４！語経路が形成さ
れ得るように、見出された単語経路が形成され得るよう
に、見出された単語経路に個別に付加される。The likely words thus determined by the approximate matching, detailed matching, and language model are used to extend the path found in step 5070 of FIG. 21. Step 600 B (Figure ffi22) Each of the determined probable words is divided into a plurality of extended 4! The word path is individually appended to the found word path so that the found word path can be formed.

再び筑２２図を参照すると、延長された経路が形成され
スタックが再形成された後は、ステップ５０５２に＃る
ことＫよって処理が繰９返えされる。Referring again to Figure 22, after the extended path has been formed and the stack has been re-formed, the process is repeated nine times by returning to step 5052.

このように１各反復処理は、最短の最良単語経路を選択
してこれを延長することからなる。１回目の処理で１不
良”とマークされた単語経路が、その後の反復処理で１
良２となることがある。生きた単語経路を１良”または
１不良″として特徴づける仁とは、このように、各反復
処理毎に独立になされる。実１祭上、尤度包終紳は１つ
の反復処理から次の反復処理までで大幅には変化しない
ので、単語経路が１良”または１不良”のどちらである
かを判断するための計算は効果的に実効される。さらに
、規格化は不要である。Each iteration thus consists of selecting the shortest best word path and extending it. A word path marked as ``1 bad'' in the first processing will be marked as 1 bad in subsequent iterations.
It may be a good 2. The characterization of a live word path as 1 good or 1 bad is thus done independently for each iteration. In practice, the likelihood envelope does not change significantly from one iteration to the next, so the calculation to determine whether the word path is 1 good or 1 bad. will be effectively implemented. Furthermore, no standardization is required.

完全な文が識別されたときは、好適にはステップ５０７
４が実行される。すなわち、マークされていない星語経
路が残っておらず、延長すべき良い単語経路が存在しな
いときは、デコーディングが終了する。個々の境界ラベ
ルで最高の尤度をもつ完全な単語経路は、入力ラベル・
ストリングに対して最も確からしい単腑列であると見な
される。When a complete sentence is identified, preferably step 507
4 is executed. That is, when there are no unmarked star word paths remaining and no good word paths to extend, decoding ends. The complete word path with the highest likelihood at each boundary label is
It is considered to be the most likely univocal sequence for the string.

尚、文の終わ抄が識別さ軌ない連続的な音声の場合、経
路の延長は、連続的もしくは、システムの使用者の好み
に応じた予定の数の単語分だけ継続する。Note that in the case of continuous speech in which the end of a sentence is not identified, the path extension is continuous or continues for a predetermined number of words depending on the preference of the system user.

Ｅ−ＩＫ、　　音声的基本形式の構築基本形式を形成する際忙使用することのできるマルコフ
・モデル音マシンとして音声に基づくものがある。すな
わち、各音マシンは、国際音声記号（Ｉｎｔｅｒｎａｔ
ｉｏｎａｌ　ＰｈｏｎｅｔｉｃＡｌｐｈａｂｅｔ　）に
含まれているような所与の音声的な音に対応する。E-IK, Construction of Phonetic Basic FormsThere are speech-based Markov model sound machines that can be used in forming basic forms. That is, each sound machine uses the International Phonetic Alphabet (International Phonetic Alphabet).
ional PhoneticAlphabet ).

所与の単語について、各々が個々の音マシンに対応する
ような音声的な音の列が存在する。各音マシンはＷ′Ｐ
ｉの状態と、それらの状態の間の複数の遷移を含み、そ
れらの遷移のうちのあるものは音素出力を発生すること
ができ、またあるもの（ゼロ遷移と呼ばれる）はそれが
できない。前述したように、各音マシンに関連する統計
は、（ａ）所与の遷移が生じる確率と、（ｂ）所与の遷
移で特定の音素が生成される尤度とを含む。好適には、
各非ゼロ遷移においては、各音素に何らかの確率が対応
づけられている。発明の詳細な説明の末尾に掲げた表■
に示された音素アルファベットには、好ましくは２００
個の音素が存在している。音声的音マシンを形成するた
めに使用される音マシンは第３図に示されている。その
ような音マシンの列は各単語毎に設けられている。統計
または確率は、知られている単語がを音される訓鍾期間
に音マシンに入力される。そして、さまざまな音声的音
マシン中の遷移＆率及び音素確率は、知られている音声
的な音が少くとも一回発音されたとき発生される音素ス
トリングに注目し、周知の前方−後方（ｆｏｒｗａｒｄ
−ｂａｃｋｗａｒｄ　）アルゴリズムを適用すること釦
よって、訓練期間の間に決定される。For a given word, there is a sequence of phonetic sounds, each corresponding to an individual sound machine. Each sound machine is W'P
It contains states of i and multiple transitions between those states, some of which can generate a phoneme output and some (called zero transitions) cannot. As mentioned above, the statistics associated with each sound machine include (a) the probability that a given transition will occur, and (b) the likelihood that a given transition will produce a particular phoneme. Preferably,
In each non-zero transition, each phoneme is associated with some probability. Table listed at the end of the detailed description of the invention■
The phoneme alphabet shown in preferably includes 200
There are several phonemes. The sound machine used to form the acoustic sound machine is shown in FIG. A row of such sound machines is provided for each word. Statistics or probabilities are entered into the sound machine during the training period when known words are sounded. Then, the transition & rate and phoneme probabilities in various phonetic sound machines are determined by focusing on the phoneme strings generated when a known phonetic sound is pronounced at least once, and using the well-known forward-backward ( forward
-backward) is determined during the training period by applying the algorithm.

短音Ｔ）　Ｔ（と判別された１つの単音の統計の例が、
発明の詳細な説明の末尾に掲げた表に示されている。一
つの近似として、＠３図の音マシンの遷移ｔｒ１、ｔｒ
２及びｔｒ１３のラベル出力確率分布は単一の分布によ
って表現される。また、遷移ｔｒ３、ｔｒ４、ｔｒ５及
びｔｒ９は単一の分布によって表現され、遷移ｔｒ６、
ｔｒ７及びｔｒｌｏも単一の分布によって表現される。An example of the statistics of one single sound identified as T (short sound T) is
As shown in the table listed at the end of the detailed description of the invention. As an approximation, the transitions tr1, tr of the sound machine in Figure @3
The label output probability distributions of Tr2 and tr13 are expressed by a single distribution. Also, transitions tr3, tr4, tr5, and tr9 are expressed by a single distribution, and transitions tr6,
tr7 and trlo are also represented by a single distribution.

このことは、表２において、個々の列４．５または６に
弧（すなわち遷移）を割りあてることＫよって即１てと
れる。This can be seen in Table 2 by assigning arcs (or transitions) to the individual columns 4.5 or 6.

表２は、各遷移の確率と、ラベル（すなわち音素）が、
卯音ｔ）Ｔ（の先端と、中間と、後端において発生され
る確率を示している。このｎ　ＴＴ音の場合、例えば、
状態Ｓ１から状態Ｓ２に遷移する確率は、１１．０７２
４３であると計算さ幻ている。状Ｐ４ｓ１から状態Ｓ４
への遷移確率は０．９２７５７である（この場合、初期
状態から可能な遷移はこれら２つだけなので、それらの
確率の和は１である）。Table 2 shows the probability of each transition and the label (i.e. phoneme)
It shows the probability of being generated at the leading edge, middle, and trailing edge of the sound t)T(.In the case of this nTT sound, for example,
The probability of transitioning from state S1 to state S2 is 11.072
It is calculated to be 43. From state P4s1 to state S4
The transition probability to is 0.92757 (in this case, these are the only two possible transitions from the initial state, so the sum of their probabilities is 1).

ラベル出力確率に関しては、ＤＴ（音は、その単音の後
端、すなわち表２の６行目で音素ＡＥ１３（表１参照）
を生成する確率は０．０９１である。表２にはまた、各
ノード（状態）に関連する計数値が存在する。このノー
ド計数値は、訓練期間中に音が対応する状態にある回数
値をあらわす。表２のような統計は各音マシン毎に見出
される。Regarding the label output probability, the DT (sound is the phoneme AE13 at the end of its single note, i.e., the 6th line of Table 2 (see Table 1)
The probability of generating is 0.091. Also in Table 2 are counts associated with each node (state). This node count represents the number of times the sound is in the corresponding state during the training period. Statistics such as Table 2 are found for each sound machine.

音声的音マシンを単語基本形式に配列することは典型的
には音声学者によって実行され、一般的には自動的には
実行されない。Arranging phonetic sound machines into word basic forms is typically performed by a phonetician and is generally not performed automatically.

Ｅ−ＩＬ、　　音素的基本形式の構築第２３図は、音素的な音の具体例を示している。E-IL, construction of basic phonemic forms FIG. 23 shows a specific example of phonemic sounds.

この音素的な音は２つの状態と３つの遷移を有している
。この図においては、ゼロ遷移が破線で示され、これは
、ラベルが形成されない、状態１から状態２への経路を
あられしている。状態１での自己ループ遷移は、任意の
数のラベルがそこから発生されることを可能ならしめる
。状態１と状態２の間の非ゼロ遷移は、ラベルを形成せ
しめるように許容される。各遷移と、遷移における各ラ
ベルに対応づけられた確率は、訓＃を段階で、音声タイ
プの基本形式に関連して説明したのと類似する方法によ
り決定される。This phonemic sound has two states and three transitions. In this figure, the zero transition is shown as a dashed line, which depicts the path from state 1 to state 2 where no label is formed. The self-loop transition in state 1 allows any number of labels to be generated from it. Non-zero transitions between state 1 and state 2 are allowed to form a label. The probabilities associated with each transition and each label in the transition are determined in a manner similar to that described in relation to the basic form of the phonetic type in the step #.

音素的な単語基本形式は、ｉｓ的な音を組み合わせるこ
為建よって構築される。このための一つの方法は、本出
願人に係わる１９８５年２月１日出願の米国特許出願第
６９７１７４号忙記載されている。好ましくは、音素的
な単語基本形式は、対応する単語の多数回の発音たよっ
て生成される。The basic phonemic word form is constructed by combining is-like sounds. One method for doing this is described in commonly assigned US patent application Ser. No. 697,174, filed February 1, 1985. Preferably, the phonemic word base form is generated by multiple pronunciations of the corresponding word.

このことは、本出願人に係わる。１９８５年５月２９日
忙出願さ飢た釆国特許出ｍ第７３８９３３号に記載され
ている。手短かに説明すると、多数の発音から基本形式
を生成する一つの方法は次のようなステップからなる。This concerns the applicant. It is described in Japanese Patent No. 738933, filed on May 29, 1985. Briefly, one method for generating basic forms from a large number of pronunciations consists of the following steps.

（ＰＬ）単語の断片の多数の発音を、個々の音素ストリ
ングに変換する。(PL) Convert multiple pronunciations of a word fragment into individual phoneme strings.

（ｂ）−組ｔｌ）ｔｌｅ＋マルコフ・モデル音マシンヲ
定義する。(b) - Define the set tl)tle+Markov model sound machine.

（ｃ）上記多数の音素ス）　ＩＪソング生成するために
、最良の単一音マシンＰ１を、決定する。(c) Determine the best single note machine P1 to generate the IJ song.

ｃｄ）上記多数の音素ス）　ＩＪソング生成するために
、形式Ｐ１Ｐ２またはＰ　２　Ｐ　１からなる最良の２
前肩本形式を決定する。cd) The above-mentioned number of phonemes) In order to generate an IJ song, the best 2 consisting of the form P1P2 or P 2 P 1
Decide on the front shoulder type.

蘭　各音素ストリングに対応して、上記最良の２前肩本
形式を配列する。Arrange the best 2-front shoulder form above for each phoneme string.

（ｆ）各音素ストリングを左の部分と右の部分とに分け
る。このとき、左の部分は、２前肩本形式の坑１の音マ
シンに対応し、右の部分は、２前肩本形式の坑２の音マ
シンに対応する。(f) Divide each phoneme string into a left part and a right part. At this time, the left part corresponds to the sound machine of Pit 1 of the two-front shoulder type, and the right part corresponds to the sound machine of Pit 2 of the two-front shoulder type.

６ｒ）各左の部分を左サブストリングと、各右の部分を
右すプス）　ＩＪソング、それぞれ識別する。6r) Identify each left part as a left substring and each right part as a right substring) IJ song, respectively.

缶）左サブストリングの組を、多数の発音に対応する音
素ストリングの組と同様の方法で処理する。この処理は
、単一の前肩本形式が、最良の２前肩本形式よりも高い
確率を有するときに、サブストリングがそれ以上分離す
るのを防止するステップを含む。can) process the set of left substrings in the same way as the set of phoneme strings corresponding to multiple pronunciations. The process includes preventing the substrings from further separating when the single front shoulder type has a higher probability than the best two front shoulder type.

（ｊ）　　右サブストリングの組を、多数の発音に対応
する音素ストリングの組と同様の方法で処理する。この
処理は、単一の前肩本形式が、最良の２前肩本形式より
も高い確率を有するときに、サブストリングがそれ以上
分離するのを防止するステップを含む。(j) Process the set of right substrings in the same way as the set of phoneme strings corresponding to multiple pronunciations. The process includes preventing the substrings from further separating when the single front shoulder type has a higher probability than the best two front shoulder type.

藺　分離されていない複数の単一音を、それらの音が対
応する音素サブストリングの順序に対応する順序で結合
する。藺 Combines unseparated single tones in an order that corresponds to the order of the phoneme substrings to which they correspond.

これらのモデル要素の数は、典型的には、単語の発音に
対応して得られた音素の数にほぼ等しい。The number of these model elements is typically approximately equal to the number of phonemes obtained corresponding to the pronunciation of the word.

基本形式モデルは、次に１知られている発音を発声し、
その発声釦応答してラベルのストリングを発生する音声
処理装置に入力することによシ訓絆される（または統計
を組み込む）。The basic formal model then utters one known pronunciation,
It is trained (or incorporates statistics) by inputting it into an audio processor that generates a string of labels in response to the vocal button.

そして、知られている発音及び発生されたラベルに基づ
き、よく知られた前方−後方（ｆｏｒｗａｒｄ−ｂａｃ
ｋｗａｒｄ）アルゴリズムに従って単語モデルの統計が
得られる。Then, based on the known pronunciation and the generated label, the familiar forward-back
The word model statistics are obtained according to the kward) algorithm.

第２４図には、音素的な音に対応する格子が図示されて
いる。この格子は、音声的な詳細な照合に関連する第１
１図の格子よりも相当に簡単である。FIG. 24 shows a grid corresponding to phonemic sounds. This lattice consists of the first
This is considerably simpler than the grid in Figure 1.

Ｅ−２，ポーリングによる、語ｔからの確からしい単語
の選歌第２５図を参照すると、本発明の一実施例の７０−チャ
ートが図示されている。このフローチャート８０００１
Ｃおいては、ステップ８００２で最初に単語の語重が決
定される。これらの単語は、使用者に応じて　Ｆ準的な
事務通信語重または技術的な語重に対応する。このとき
、５１１ＤＯ個またはそれ以上の数の単語が語中に存在
しているが、その単語の数は変えることができる。E-2, Selection of Probable Words from Word t by Pauling Referring to FIG. 25, a 70-chart of one embodiment of the present invention is illustrated. This flowchart 80001
In C, the word weight of the word is first determined in step 8002. These words correspond to either a business communication word weight or a technical word weight, depending on the user. At this time, there are 511DO or more words in the word, but the number of words can be changed.

各キモは、上記Ｅ−ＩＫ　　またはＥ−ＩＬ章の教示に
従って、マルコフ・モデル音マシンの列により表示され
る。すなわち、各単語は、逐次的な音声的音マシンの＃
ｌｔ築された基本形式、または逐次的な音素的音マシン
の構槃された基本形式としてあられすことができる。Each key is represented by a row of Markov model sound machines according to the teachings of the E-IK or E-IL chapters above. That is, each word has a # of sequential phonetic sound machines.
It can appear as a built-up basic form or a structured basic form of a sequential phonetic sound machine.

次ｌｌ？：、ステップ８００６で、各単語の各ラベル毎
に１票”が計算される。Next time? :, step 8006 calculates one vote for each label of each word.

票を計算するステップ８００６は、＊２５．２６．２７
．２８及び２９ＦＪＱを参照して説明される。Step 8006 of calculating votes is *25.26.27
．． 28 and 29FJQ.

ｔｊｔ２６図は、所与の音マシンＰ、の音声ラベルの分
布を示す図である。この図に示されている計数値は、開
腔期間に発生された統計から得られたものである。すな
わち、別線期間に、知られている音列に対応する知られ
ている発音が発声され、それに応答してラベル・ストリ
ングが発生されることを思い起こされたい。こうして、
知られている岸音が発音されるときに各ラベルが生成さ
れる回数が別線期間中に得られる。尚、坑２６図に示す
ような分布図は、各単音毎に発生される。tjt26 is a diagram showing the distribution of audio labels of a given sound machine P. The counts shown in this figure are obtained from statistics generated during the open cavity period. That is, recall that during the separate line period, a known pronunciation corresponding to a known phonetic sequence is uttered and a label string is generated in response. thus,
The number of times each label is generated when a known sound is pronounced is obtained during the separate line period. Incidentally, a distribution map as shown in Fig. 26 is generated for each single note.

別線データからは、第２６図に含まれている情報が得ら
れるのみならず、所与の単音についてのラベルの期待値
も得られる。すなわち、所与の学語に対応する知られて
いる発音が発声されると、発生されたラベルの数が記録
される。所与の単音に対応するラベルの数は、知られて
いる発音が発声される毎に記録される。そして、この情
報から、所与の単音に対応する最も確からしい、または
期待される個数が定められる。ｔＩＣ２７図は、各単音
毎のラベルの期待される個数をあらわす図である。From the separate line data, not only the information contained in FIG. 26 can be obtained, but also the expected value of the label for a given note. That is, when a known pronunciation corresponding to a given language is uttered, the number of labels generated is recorded. The number of labels corresponding to a given phone is recorded each time a known pronunciation is uttered. From this information, the most likely or expected number corresponding to a given note is then determined. The tIC27 diagram is a diagram showing the expected number of labels for each single note.

もしこれらの音が音素的に対応するならば、各単音につ
いての期待されるラベルの個数の平均値は典型的には約
１であるべきである。音声的な音の場合、ラベルの個数
はきわめて広い範囲に亘り得ろ。If these sounds are phonemically corresponding, the average number of expected labels for each phone should typically be about 1. For phonetic sounds, the number of labels can range over a very wide range.

訓練データから情報を引き出すことは、前記した″″統
計的方法による連続的音声認識（ｃｏｎｔｉｎｕｏｕｓ
　　Ｓｏｅ＠ｃｈ　　Ｒ＠ｅｏ（ｎｉｔｉｏｎｂｙ　　
５ｔａｔｉｓｔｉｃａｌ　　Ｍｅｔｈｏｄｓ　）”　と
題する論文に詳細に記載されている前方−後方（ｆｏｒ
ｗａｒｄ−ｂａｃｋｗａｒｄ）アルゴリズムを利用する
ととＫよって達成される。簡単に述べると、前方−後方
アルゴリズムは、（ａ）マルコフ・モデルの初期状態から状態ｉまでを前
方に見渡して、′前方経路”において状態量に至る統計
を決定し、（ｂ）マルコフ・モデルの最終状態から状ｑ（ｌ＋１）
豊でを後方に見渡して、“後方経路＃において状態（ｌ
＋１　）から最終状態に至る統計を決定することにより
、単音における、状態１と状態（１＋１’）の間の各遷移
の確率を決定することからなる。状態ｉから状態（ｌ＋
１　）への遷移確率及びそれについてのラベル出力は、
あるラベル・ストリングが与えら軌たときに生じる特定
の遷移の確率を決定する際に別の統計に結合される。Extracting information from the training data involves continuous speech recognition using statistical methods as described above.
Soe@ch R@eo(nitionby
The forward-backward method is described in detail in the paper titled
This can be achieved by using a ward-backward algorithm. Briefly, the forward-backward algorithm (a) looks forward from the initial state of the Markov model to state i and determines the statistics leading to the state quantity on the 'forward path'; (b) the Markov model From the final state of q(l+1)
Look backwards at Toyode and see “state (l) on backward route #”.
+1) to the final state, consisting of determining the probability of each transition between state 1 and state (1+1') in a single note. From state i to state (l+
The transition probability to 1) and the label output for it are:
It is combined with other statistics in determining the probability of a particular transition occurring given a label string.

単語１及び単語２に関連してＭ２８１５４に示されてい
るように、各単語は、所定の単音の列であることが知ら
れている。各単＠についての音列と、竿２５及び２６図
に関連して述べた情報を与オ、ると、特定の単語ＷＫつ
き、所定のラベルが何回発生するのが最も確からしいか
Ｋついて決定することができる。単語１のような＃、語
の場合、ラベル１が期待される回数は、音Ｐ１について
の２ペル１の計数値と、音Ｐ３についてのラベル１の計
数値と、音Ｐ６Ｖｃついてのラベル・１の計数値等を加
えたものである。同様に、単語ＩＫついて、ラベル２が
期待される回数は、＊ｐ、についてのラベル２の計数値
と、音Ｐ３についてのラベル２の計数値等を加えたもの
である。単語１の各ラベルの期待さセ、る計数値は、２
００個のラベルの各々につき上記ステップを実行すると
とくよって算定される。特定のラベルの計数値は、特定
のラベルの発生回数を、訓練期間に発生されたラベルの
全数で割ったものをあらわすか、または別線期間中のラ
ベルの発生回数自体をあらわす。As shown in M28154 with respect to Word 1 and Word 2, each word is known to be a predetermined string of single sounds. Given the sound sequence for each unit @ and the information described in connection with Figures 25 and 26, we can determine how many times a given label is most likely to occur with a particular word WK. can be determined. For #, words like word 1, the number of times label 1 is expected is the count of 2 pel 1 for sound P1, the count of label 1 for sound P3, and the label 1 for sound P6Vc. This is the sum of the calculated values, etc. Similarly, the number of times label 2 is expected for word IK is the sum of the count value of label 2 for *p, the count value of label 2 for sound P3, etc. The expected count value for each label of word 1 is 2
It is calculated by performing the above steps for each of 00 labels. The count value for a particular label represents the number of occurrences of the particular label divided by the total number of labels generated during the training period, or represents the number of occurrences of the label during the separate line period itself.

第２９図においては、特定の単＠（例えば単語１）にお
ける各ラベル毎の期待される計数値が示されている。In FIG. 29, the expected count value for each label in a specific @ (for example, word 1) is shown.

所与の単語について、第２９図に示すような期待される
ラベルの計数値からは、その単語の各ラベルの″票”が
計算される。所与の単語ＶについてのラベルＬ′の票は
、単語ＷがラベルＬ′を生成する尤度をあらわす。この
票は、単語ＷがラベルＬ′を生成する確率の対数に対応
する。好ましくは、票の値は次の式であられされる。For a given word, from the expected label counts as shown in FIG. 29, the "votes" for each label of that word are calculated. The vote for label L' for a given word V represents the likelihood that word W generates label L'. This vote corresponds to the logarithm of the probability that word W generates label L'. Preferably, the vote value is calculated using the following formula:

票＝＝ｌｏｇ１０　（？ｒ　（Ｌ’　１ｖｉＦ）　１−
これらの票の値は、第３０図に示すように、テーブルに
記憶される。、単語１〜Ｗの各々につき、各ラベルは２
重添字つきのＶであられされる票の値を持っている。こ
の１番目の添字はラベルに対応し、２番目の添字は単Ｆ
ＦＦｔＫ対応する。従って、例えばｖ１２は単語２に関
するラベル１０票の値である。Votes ==log10 (?r (L' 1viF) 1-
The values of these votes are stored in a table as shown in FIG. , for each word 1 to W, each label is 2
It has the value of the vote expressed by V with double subscript. This first subscript corresponds to the label, and the second subscript is the single F
Compatible with FFtK. Therefore, for example, v12 is the value of 10 votes for the label regarding word 2.

再び第２５喫を参照すると、未知の音声入力に応答して
ラベルを発生するステップ８００８を含み、ポーリング
釦よって語常から確からしい挨祉の単語を選択する処理
が図示されている。この処理は、音声処理＠ｅ　１００
４　（第１図）によって実行される。Referring again to No. 25, a process is illustrated that includes step 8008 of generating a label in response to an unknown speech input, and selecting a likely courtesy word from common pronunciation using a polling button. This process is performed using audio processing @e 100
4 (Fig. 1).

当面の単語につき：第３０１ｖｌのテーブル中で発生さ
れたラベルがルック・アップ（ｌｏｏｋ　　ｕ＋１１）
される。そして、その単語について、各生成されたラベ
ルの票の値が検索される。この票の値は、次に、その単
語の票の値の合計を与えるために蓄積される（ステップ
ａｏ１ｏ）。例えば、ラベル１．３及び５が発生された
ならば、票の値ｖ１１、ｖ８１及び”５１が算定され結
合されることになるうもし票の値が確率の対数値である
ならば、それらは単語１についての全体の投票値を与え
るだめに合計される。同様の手続は語９の単語毎に行わ
れ、これ釦より、ラベル１．３及び５が各中給に対して
“投票”することになる。For the word at hand: the label generated in the 301st vl table is looked up (look u+11)
be done. Then, the vote value of each generated label is searched for that word. This vote value is then accumulated to give a total vote value for that word (step ao1o). For example, if the labels 1.3 and 5 were generated, the vote values v11, v81 and "51 would be computed and combined. If the vote values are the logarithm of the probabilities, then they are are summed to give the overall vote value for word 1. A similar procedure is performed for each word in word 9, and from this button labels 1, 3 and 5 "vote" for each medium salary. It turns out.

本発明の一実施例によれば、各中語につき蓄積された票
の値は、その単語の尤度得点値の役目を果たす。そして
、最も高い蓄積された票の値をもつｎ個の単語（ｎは予
定の整数）が、候補の単語として決定され、これらは後
で、前述した詳細な照合及び言語モデル中で処理される
ことになる。According to one embodiment of the invention, the vote value accumulated for each Chinese word serves as the likelihood score value for that word. The n words (where n is a predetermined integer) with the highest accumulated vote value are then determined as candidate words, which are later processed in the detailed matching and language model described above. It turns out.

別の実施例では、票の値とともに“ペナルティ”が算定
される。すなわち、各単語について、ペナルティが算出
され割りあてられる（ステップ８０１２）。このペナル
ティは、当面のラベルが所与の凰語によっては発生され
ない尤度をあらわす。In another embodiment, a "penalty" is calculated along with the vote value. That is, for each word, a penalty is calculated and assigned (step 8012). This penalty represents the likelihood that the label at hand is not generated by a given 凯 word.

ペナルティを算出するためにはさまざまの方法がある。There are various ways to calculate the penalty.

音素的基本形式によってあられされた４Ｓ＠のペナルテ
ィを算出するための一つの方法は、各音素的な音が一つ
のラベルしか生成しないと仮定することに関与する。所
与のラベルと特定の音素的な音に対して、その所与のラ
ベルのペナルティは、それとは異なるラベルがその特定
の音素的な音によって発生される確率の対数に対応する
。従って、音Ｐ２のラベル１のペナルティは、ラベル２
〜ラベル２００までの任意のラベルが、発生すれる１つ
のラベルである確率の対数に対応する。One method for calculating the 4S@ penalty imposed by the phonemic base form involves assuming that each phonetic sound generates only one label. For a given label and a particular phonemic sound, the penalty for that given label corresponds to the logarithm of the probability that a different label is produced by that particular phonemic sound. Therefore, the penalty for label 1 of sound P2 is label 2
Any label up to label 200 corresponds to the logarithm of the probability that one label occurs.

尚、各音素的な音毎に１つのラベル出力があるという仮
定は、正確ではないけれども、ペナルティを算定するに
は満足のゆくものであることが分かつている。こうして
、各音につなラベルのペナルティが決定されると、ある
単語が知られている音の列を構成することのペナルティ
は容易に決定される。Note that the assumption that there is one label output for each phonemic sound, although not accurate, has been found to be satisfactory for calculating penalties. Once the penalty for the label connected to each sound is determined in this way, the penalty for a word forming a sequence of known sounds is easily determined.

各単語の単ラベルのペナルティが第３１図に示されてい
る。各ペナルティは、２個の添字つきのＰｇＮとして識
別され、第１の添字がラベルをあられし、第２の添字が
単語をあらわす。The single label penalty for each word is shown in FIG. Each penalty is identified as a PgN with two subscripts, the first subscript representing the label and the second subscript representing the word.

再び竿２５図に戻ると、ステップ８００８で発生された
ラベルは、ラベル・アルファベットのうちどのラベルが
発生されていないかを調べるために検査される。こうし
て、各ラベル毎に、各ラベルが発生されないペナルティ
が算出される。所与の単語の全体のペナルティを得るた
めには、所与の単語につき発生されなかった各ラベルの
ペナルティが検デされ、そのようなすべてのペナルティ
が蓄積されろ（ステップ８０１４）。もし各ペナルティ
が”ゼロ”確率の対数に対応するならば、所与の単語に
ついてのペナルティは、票の場合と同様に、すべてのラ
ベルについて合計される。この手続ｉ！は、給金の各単
語毎に繰り返えされ、これにより、各単語は、発生され
たラベルのストリングが与えられたときに、全体の票の
値と全体のペナルティとを有することになる。Returning again to Figure Pole 25, the generated labels in step 8008 are examined to see which labels in the label alphabet have not been generated. In this way, the penalty for not generating each label is calculated for each label. To obtain the overall penalty for a given word, the penalty for each label not generated for a given word is determined and all such penalties are accumulated (step 8014). If each penalty corresponds to the logarithm of the "zero" probability, then the penalties for a given word are summed over all labels, just as for votes. This procedure i! is repeated for each word of the payoff, so that each word has an overall vote value and an overall penalty given the string of labels generated.

給電の各単語について全体の票の値と全体のペナルティ
が得られると、その２つの値を組み合わせることによっ
て尤度得点値が決定される（ステップ８０１６）。尚、
もし望むなら、全体の票の値は全体のペナルティよシも
大きく重みづけされてもよいし、その逆も可能である。Once the overall vote value and overall penalty are obtained for each word in the feed, a likelihood score value is determined by combining the two values (step 8016). still,
If desired, the overall vote value may be weighted more heavily than the overall penalty, and vice versa.

さらＫ、各４ｊ語の尤度得点値は、好ましくは、投票を
行うラベルの数に基づきスケーリングされる（ステップ
８０１８）。特に、全体の票の値と全体のペナルティ（
そのどちらも確率の対数の和をあらわす）が加えあわさ
れた後に、その最終の和の値が、票の値とペナルティと
を計算する際に発生され考慮された音声２ペルの数で割
られる。Additionally, the likelihood score value for each 4j word is preferably scaled based on the number of labels voting (step 8018). In particular, the overall vote value and the overall penalty (
(both of which represent the sum of the logarithms of the probabilities) are added together, and then the final sum value is divided by the number of vocal 2-pels that occurred and were taken into account when calculating the vote value and penalty. .

その結果、尤度得点値がスケーリングされる。As a result, the likelihood score value is scaled.

本発明のさらに別の態様は、投票及びペナルティ（すな
わちポーリング）演算において、ストリングのどのラベ
ルを考慮するかを決定することに関連する。単語の終端
が職別され、それに対応するラベルが知られている場合
、好ましくは、既知の開始時間と既知の終了時間の間で
発生されたすべてのラベルが考慮される。しかし、終了
時間が既知でないことが分かつているとき（ステップ８
０２０）％本発明は次のような方法を与える。すなわち
、基準的な終了時間が定義され、その基準的な終了時間
の後、逐次的な時間間隔毎に繰シかえし尤度得点値が計
算される（ステップ８０２２）。Yet another aspect of the invention relates to determining which labels of a string to consider in voting and penalty (or polling) operations. If the end of a word is distinguished and its corresponding label is known, preferably all labels generated between the known start time and the known end time are considered. However, when it is known that the end time is not known (step 8
020)% The present invention provides the following method. That is, a standard end time is defined, and repeat likelihood score values are calculated for each successive time interval after the standard end time (step 8022).

例えば、５００ミリ秒後に、５０ミリ秒間隔で各単語の
（スケーリングされた）尤度得点値が計算され、この動
作はその後、単語の発音の開始時間から１０００ミリ秒
に達するまで行われる。この例では、各単語は、１０個
の（スケーリングされた）尤度得点値をもつことになる
。For example, after 500 milliseconds, a (scaled) likelihood score value is calculated for each word at 50 millisecond intervals, and this operation continues until 1000 milliseconds are reached from the start time of the word's pronunciation. In this example, each word will have 10 (scaled) likelihood score values.

次に所与の４Ｓ枡に１０個の尤度得点値のうちどれを割
りあてるかについて／％’ｆ＜を行うための手法が適用
される。特に、所与の単語につき得られた尤度得点値の
列に対して、同一の時間間隔で得られた仙の単語の尤度
得点値とは相対的に最も高い尤度得点値が選択される（
ステップ８０２４）。A technique is then applied to perform /%'f< as to which of the 10 likelihood score values to assign to a given 4S square. In particular, for a sequence of likelihood score values obtained for a given word, the highest likelihood score value is selected relative to the likelihood score values of the words obtained at the same time interval. (
Step 8024).

この借も高い尤度得点値は、その同一の時間間隔におけ
る他のすべての尤度得点値から引かられる。This highest likelihood score value is subtracted from all other likelihood score values for that same time interval.

すると、所与の時間間隔で最も高い尤度得点値をもつ単
語がゼロにセットされ、他の尤度の低い単語の尤度得点
値は負の値をもつことになる。そして、所与の単語の管
も負の値が小さい（ゼロに近い）尤度得点値が、屑も高
い相対的な尤度得点値としてその単語に割りあてられる
。The word with the highest likelihood score value for a given time interval will then be set to zero, and the likelihood score values of other less likely words will have negative values. Then, a likelihood score value with a small negative value (close to zero) for a given word is assigned to that word as a relative likelihood score value with a high value.

各車Ｆ＠に尤度ａｉ点値が割りあてられると、最も亮い
尤ＩＷ得点値を割りあてられて成るｎ個の単語が、ポー
リングによシ得られた候補の単語として選歌される（ス
テップ８０２６）。When a likelihood ai point value is assigned to each car F@, the n words to which the brightest likelihood IW score value is assigned are selected as candidate words obtained by polling ( Step 8026).

本発明の一実施例においては、ポーリングにより得られ
たｎ個の学語は、低減された数の学語りストとして与え
られ、このリストの１￥′！語は、前述した詳細な照合
及び言語モデルの処理にかけられる。ボーりングによっ
て得られた。この低減された数のリストは、この実施例
においては、前に示した音声的高速照合に代わる働きを
行う。この点に関して、音声的高速照合が楔形の格子構
造を与え、この樹形構造に逐次的な音として単語基本形
式が入力され、それにおいて、同一の先頭音をもつ単語
が樹形柳造′ＶＣ沿う共通の枝に従うことが観察される
。しかし、２０００語の語粟に対して、本発明のポーリ
ング方法は、樹形格子構造をもつ高速照合よりも２ない
し３倍の処理速度をもつことが分かった。In one embodiment of the invention, the n words obtained by polling are given as a reduced number of words list, 1\'! of this list. The words are subjected to the detailed matching and language model processing described above. Obtained by boring. This reduced list of numbers serves in this embodiment as an alternative to the phonetic fast matching shown above. In this regard, phonetic fast matching provides a wedge-shaped lattice structure into which word base forms are input as successive sounds, in which words with the same initial sound are found in the tree-shaped lattice structure 'VC'. It is observed to follow a common branch along. However, for a vocabulary of 2000 words, the polling method of the present invention was found to have a processing speed two to three times faster than a high-speed match with a dendritic lattice structure.

しかし、それとは異ｆＣす、音声的高速照合とポーリン
グとを組み合わせて使用することもできる。However, in contrast, it is also possible to use a combination of phonetic high-speed verification and polling.

すなわち、Ｆ！Ｉｌｌ紳されたマルコフ・モデルと、ラ
ベルの発生されたス）　ＩＪングとから、ステップ８０
２８でポーリングと平行して近似的高速照合が実行され
る。そして、１つのリストは音声的照合によって与えら
れ、もう１つのリス）＋１ポーリングによって与えられ
る。慣用的な手法では、１つのリスト上のエントリは、
他方のリストを増加させる際に使用さチする。しかし、
最良の単語候補の数をさらにｐ　４＞させることを要望
する手法においては、両方のリストにあられれる単語の
みが、さらなる処理のために保持される。ステップ８０
３０におけるこの２つの技術の相互作用は、システムの
＃を度と計でγ上の目標に依存する。さらに別の実施例
としては、格子型の音声的高速照合を順次的にポーリン
グ・リストに適用してもよい。In other words, F! From the generated Markov model and the generated labels, step 80
Approximate fast matching is performed at 28 in parallel with polling. Then, one list is given by phonetic matching and the other list is given by +1 polling. In conventional practice, the entries on one list are
Used when increasing the other list. but,
In approaches that desire to further increase the number of best word candidates by p 4 >, only words that appear in both lists are retained for further processing. step 80
The interaction of these two techniques in 30 depends on the # of the system and the goals on γ in total. In yet another embodiment, a grid-based phonetic fast match may be applied sequentially to the polling list.

ポーリングを実行するための装置は筺３２図忙示されて
いる。この図において、素子８１０２は、上述したよう
に訓練された単語モデルを記憶する。Apparatus for performing polling is shown in Figure 32 of the housing. In this figure, element 8102 stores word models trained as described above.

そして、単語モデルに適用される統計からは、票発生暎
ＰＦ８１０４が各車ＭＫついてのラベルの票の値を計算
し、その票の値を票テーブル記憶装置８１０６に記憶す
る。Then, from the statistics applied to the word model, the vote generation PF 8104 calculates the vote value of the label for each car MK, and stores the vote value in the vote table storage device 8106.

こね、と同様に、ペナルティ発生前置８１０Ｂが語仝に
おける各単語の各ラベルのペナルティを計算し、その値
をペナルティ・テーブル記憶装置８１１０に入力する。Similar to knead, penalty generation prefix 810B calculates the penalty for each label of each word in the idiom and enters the value into penalty table storage 8110.

３１語尤度得点値計算暎置８１１２は、未知の音声入力
に応答して音声処理青竹８１１４により発生されたラベ
ルを受は携る。そして、単語選歌素子８１１６により選
択された所且の単語について、単語尤度得点値計Ｗ、舊
＊Ｆ１１１２は、その選歌さ幻た単語の発生された各ラ
ベルの票と、各ラベルが発生されｔＣいペナルティとを
組み全わせる。、装置８１１２は、前述したような、尤
度得点値をスケーリングするための手段をも含んでいる
。尤度得点値計算装置は、必須ではないけれども、基準
時間の後遂次的な時間））４１　隔で尤度得点値の計算
を繰り返すための手段を含んでいてもよい。The 31-word likelihood score calculation device 8112 receives labels generated by the audio processing device 8114 in response to unknown audio input. Then, for the word selected by the word selection element 8116, the word likelihood score meter W, 舊*F 1112 calculates the votes of each label generated for the selected word and the number of labels generated. Combine it with the tC penalty. , the apparatus 8112 also includes means for scaling the likelihood score values, as described above. The likelihood score calculation device may, although not necessarily, include means for repeating the calculation of the likelihood score at successive time intervals))41 after the reference time.

尤度得点値計Ｗ、装置８１１２は、岸語りスト装装置８
１２０に単Ｆ）得点値を与え、単語リスト装置８１２０
は、割りあてられた尤度得点値に従い単語を配列する。Likelihood score value meter W, device 8112 is Kishi Katari strike equipment device 8
120 with a single F) score value, and word list device 8120
arranges words according to their assigned likelihood score values.

ポーリングによって得られた単ＦＦＭ　Ｉ７ストを、近
似的高速照合によって得られたリストと組み合わせる実
物例では、リスト比較装ｖ８１２２が設けられている。In a practical example of combining a single FFM I7 list obtained by polling with a list obtained by approximate fast matching, a list comparator v8122 is provided.

この装置８１２２は、入力として、単語りスト装置から
ポーリング・リストと音声的高速照合（前記いくつかの
実施例で述べたもの）とからポーリング・リストを受は
準る。This device 8122 receives as input a polling list from a wordlist device and a polling list from a phonetic fast match (as described in some embodiments above).

必要な記ｍ柴と計算量を低減するために、いくつかの特
徴が組み込まれている。飢１に、票の値とペナルティと
は、０〜２５５に亘る整数としてフォーマットすること
ができる。第２に、実際のペナルティの値は、ペナルテ
ィ＝ａ×（票の値）＋ｈという式から票の値に対応して
計算された近似的なペナルティによって置き換えること
ができる。尚、この式で、＾、ｂは定数であシ、それら
の値は最小２乗回帰により求められる。第３に、ラベル
は、各クラスが少くとも１つのラベルを含むような音声
的なりラスにグループ分けすることができる。そして、
ラベルのクラスへの割りあては、結果として得られる音
声的クラスと単語の間の情報を極大化するようにラベル
を階層的＜ａｔｔ積することにより決定される。Several features are incorporated to reduce the amount of memory and computation required. Finally, the vote value and penalty can be formatted as integers ranging from 0 to 255. Second, the actual penalty value can be replaced by an approximate penalty calculated corresponding to the vote value from the formula Penalty=a*(Vote value)+h. In this equation, ^ and b are constants, and their values are obtained by least squares regression. Third, labels can be grouped into phonetic classes such that each class contains at least one label. and,
The assignment of labels to classes is determined by hierarchical <att product of the labels so as to maximize the information between the resulting phonetic classes and words.

尚、本発明忙よれば、沈黙の期間が（既知の方法により
）検出され無視されるととに注意されたい。Note that in accordance with the present invention, periods of silence are detected (by known methods) and ignored.

また、本発明は、ＩＢＭ　　ＭＶＳシステム上でＰＬ／
Ｉを用いて実行されたが、他のシステムと仙のプログラ
ム言語を用いても実行可能である。The present invention also provides PL/
Although it was implemented using I, it can also be implemented using other systems and programming languages.

さらに、本発明は、その技術的思想の範囲内でさまざま
な変更が可能である。Furthermore, the present invention can be modified in various ways within the scope of its technical idea.

例えば、単語の終了時間を、その単語基本形式中の各音
の期待されるラベルの数を合計することにより決定して
もよい。それに加えて、票の値及びペナルティは、発生
されたラベル・ストリングのうち例えば奇数番目のラベ
ルまたは最初のｍ個のラベルのような３ｆ４釈されたラ
ベルについてのみ計算するようにしてもよい。ただし、
単語の先端と後端の間のラベル毎の票の値とペナルティ
とを考慮する方が好ましい。For example, the end time of a word may be determined by summing the number of expected labels for each sound in the word base form. Additionally, the vote value and penalty may be calculated only for 3f4 interpreted labels, such as the odd numbered labels or the first m labels of the generated label string. however,
It is preferable to consider the vote value and penalty for each label between the beginning and end of a word.

さらにまた、本発明は、確率の対数を加えること以外の
さまざまな票決計算式の使用をも意図している。本発明
は、各ラベルが語夛の各単語に票を投じ、この票が典Ｒ
す的には単語毎に異なるような場合に、候神の単語の短
いリストを得るためのポーリング実情及びポーリング方
法に広く適用される。Furthermore, the present invention contemplates the use of various voting formulas other than adding logarithms of probabilities. In the present invention, each label casts a vote for each word in the vocabulary, and this vote
It is widely applied to the actual polling situation and polling method to obtain a short list of candidate words, especially when the word is different from word to word.

Ｆ８発明の効果以上のように、この発明によれば、音声認識において詳
細な照合を行うべきギ語のリストを作成するために、ポ
ーリング方法を使用するようにしたので、そのような単
語リストが輯い処理時間で得られる。Effects of the F8 Invention As described above, according to the present invention, the polling method is used to create a list of words that should be checked in detail in speech recognition, so that such a word list can be This can be achieved with less processing time.

本ギ滲　べ（Ｖ　ＱｃｏＣロロヘへへへへへヘヘαヘローＷ　Ｗ−ｒ　ｒ　ｆｆ　Ｐ？＋Ｆｖ＋？　−一〜り膿
カ（（唖唖℃（唖℃唖Ｎｋ　ＮＮＮｈ唖　哨　噂　噂　
哨　　哨　　哨　　哨　　（イ）　　哨　寸　　臂　°
寸　寸　寸　寸　寸　嘴ロロロロロロロロロロロロロロ
ロロロロロロロロロロロロロロロロロロロロロロＪ　ｌ
＆　ｌＪ＋ククＩ＋Ｉ＆ｈ−Ｉ−ＫＺ＞Ａ１１．Ｑ＆ＨＨｐｐＱＱ−一
ン〆へ甲Hongi seepage (V Q coC Rorohehehehehehehe α heroo W W-r r ff P?+Fv+?
sentry sentry (a) sentry
Dimension Dimension Dimension Dimension Beak Rorororororororororororororororororororororolololololololololololololololololololololololololololololololololololololololololololololololoro
&lJ+KukuI+I&h-I-KZ>A11. Q & HHpp

[Brief explanation of drawings]

炉１図は、音声堅党システムの漿要ブロック図、額２図
は、坑１図のブロック図をより詳細に示したブロック図
、第３図は、音声的音マシンの図、第４図は、音声処理装置のブロック図、ｔ４ｒ、５図は
、人間の耳の内部を示す断面図、第６図は、音声処理装
置の一部のブロック図、第７１・４は、音の等感曲線の
図、坑８図は、ンーンとフォノの対応を示す図、第９図は、
音声処理装置の処理フローチャート、第１０図は、！９
図におけるしきい値更新のための詳しいフローチャート
、筑１１図は、詳細照合格子を示す図、Ｆｌ　２図は、高速照合音マシンのブロック図、＃１３
図は、高速照合演算を示す図、第１４図は、単音、ラベル・ストリング、開始及び終了
時間の相方関係を示す図、第１５図は、最小長０の音マシンとその開始時点分布を
示す図、第１６図は、最小長４の音マシンとそのタイム・チャー
トを示す図、第１７図は、音の神Ｊ形構造を示す図、第１８図は、音
声的照合を実行するための訓練川音マシン中で実行され
るステップをあらわすフローチャートの図、ｔＪ、１９図は、スタック・デコーディングの遂次的な
ステップをあらわす図、第２０図は、沖語の経路と尤度包絡線についての尤度ベ
クトルをあらわす図、第２１図は、見出した経路を延長する手続のフローチャ
ートの図、第２２図は、スタック・デコーダの動作のフローチャー
トの図、第２３図は、＊素的音マシンの図、第２４図は、音素的な音の格子をあらわす図、第２Ｓ、
１図及び第２５．２図は、本発明のポーリング方法をあ
らわす図、第２５図は、第２５．１図及び第２５．２図の結合をあ
らわす図、第２６図は、ラベルの計数値分布をあらわす図、第２７
図は、訓練期間に各単音が各ラベルを生成する回数をあ
らわす図、第２８図は、単語を構成する音列を示す図、第２９図は
、各ラベル毎に、ある単語について期待される計数値を
あらわす図、筆３０図は、ラベルと単語毎の票の値を示す図、第３１
図は、ラベルと単語毎のペナルティの値を示す図、第３２図は、本発明のポーリング装置のブロック図であ
る。出願人インターナショカル・ゼジネス・マシーＺズ・コ
一幀々→勺ン代理人　弁理士　　山　　　本　　　仁　
　　朗（外１名）１司　液蚊者の太き″！０尊威會楽第７図味・、へ乍第９図昔声処茗［＆１のフローチャート第１１図詳細ｙ、心格手時間訴遠胛合音マレレ第１２図Ｌ　（１０１１１２１３）　、　Ｑ（ｑＯＱＩ　ｑ２Ｑ
３）し畏史高遠ｗ合演羞第１３図第１４図１、うぜル・スト１ルク１笥柘および′於３晴商の矧五
間１系（０）」−二−’−ａ−一一−−一−−−−一一
一一−−−４−ＴＨＥ。（ｂ）３猫α５Ｓｙ輝二一一一一一一一鯛ε　ヤ１〈パロ第１９図 “Ｔｏ　ＢＥ　ＯＲＮＯＴ　Ｔｏ　ＢＥ”　＆スクック
ーｉ６ステ、ップ第２３図音素的ｆＪ音ママリン　１　　　ｔ　２　　　ｔ３音素的な奮の格子第２４図 −ン〜塚　　イΣ 契第２６図ラヘ゛」しの期待されんイ固数第２７図第２８図第２９図票の値第３０図Diagram 1 of the furnace is a block diagram of the main components of the sound system, Figure 2 is a block diagram showing the block diagram of Figure 1 in more detail, Figure 3 is a diagram of the acoustic sound machine, and Figure 4 is a block diagram of the audio processing device, t4r, 5 is a cross-sectional view showing the inside of the human ear, FIG. The diagram of the curve, Figure 8 shows the correspondence between the tone and the phono, and Figure 9 shows the
The processing flowchart of the audio processing device, FIG. 10, is! 9
Detailed flowchart for updating the threshold value in Figure 1. Figure 11 is a diagram showing the detailed matching grid. Figure Fl 2 is a block diagram of the high speed matching sound machine. #13
Figure 14 shows the mutual relationship between a single note, a label string, and the start and end times. Figure 15 shows the minimum length 0 note machine and its start time distribution. Figure 16 is a diagram showing a minimum length 4 sound machine and its time chart, Figure 17 is a diagram showing a sound god J-shaped structure, and Figure 18 is a diagram showing a sound machine with a minimum length of 4 and its time chart. Figure 19 is a diagram representing the sequential steps of stack decoding; Figure 20 is a flowchart diagram representing the steps performed in the training Kawano machine; Figure 21 is a flowchart of the procedure for extending the found route. Figure 22 is a flowchart of the operation of the stack decoder. Figure 23 is the *elementary sound machine. Figure 24 is a diagram showing the phonemic sound grid, Figure 2S,
Figure 1 and Figure 25.2 are diagrams showing the polling method of the present invention, Figure 25 is a diagram showing the combination of Figures 25.1 and Figure 25.2, and Figure 26 is the count value of the label. Diagram showing distribution, No. 27
Figure 28 shows the number of times each single phone generates each label during the training period, Figure 28 shows the sequence of sounds that make up a word, and Figure 29 shows what is expected for a certain word for each label. Figure 30, which shows the counted value, is a diagram showing the label and vote value for each word, Figure 31.
FIG. 32 is a diagram showing labels and penalty values for each word. FIG. 32 is a block diagram of a polling device of the present invention. Applicant International Business Machines Z's Co., Ltd. → Representative Patent Attorney Hitoshi Yamamoto
Akira (1 other person) 1 Tsukasa liquid mosquito person's thickness''! Sueen Tae Goon Marele 12th figure L (10111213), Q (qOQI q2Q
3) Shiwoshi Takato w Gakuensha Figure 13 Figure 14 Figure 1, Uzeru Stroke 1 Luk 1 Tsuge and 'O 3 Harusho's Hagigoken 1 series (0)''-2-'-a- 11---1---1111---4-THE. (b) 3 Cat α5Sy Teruji 11 11 1 Tai ε Ya 1〈 Paro Figure 19 “To BE ORNOT To BE” & Sukkuu i6 Step, Up Figure 23 Phonemic fJ Sound Mamarin 1 t 2 t3 Phonemic lattice Figure 24 - N ~ Tsuka I Σ Figure 26 Graph's expected fixed number Figure 27 Figure 28 Figure 29 Value of the ticket Figure 30

Claims

[Claims]

(1) A method of speech recognition comprising measuring the likelihood of a word from a vocabulary in response to speech input, the method comprising: (a) selected from an alphabet of multiple labels in response to speech input; generate a string of labels, each representing a type of sound; (b) compute the label vote values representing the likelihood that each label is generated when a given word is pronounced; c) accumulating the vote values of at least a plurality of labels among the labels generated in the string for the target word, such that the accumulated vote value is the target word; A speech recognition method that expresses the likelihood of

(2) for selecting probable words from a vocabulary, each word being represented by a sequence of at least one probabilistic finite state sound machine, and a speech processing device generating a speech label in response to speech input; In a speech recognition device: (a) means for forming a first table such that each label of the alphabet gives, for each word of the vocabulary, a vote value representing the likelihood that that word generates that label; (b) For each word in each label of the alphabet,
(c) means for forming a second table in which the labels are assigned a penalty representing the likelihood of not being generated according to the model of the word; determining the likelihood of a particular word for a given string of labels, including means for combining the vote value of a label with the penalty of all labels in the string corresponding to that particular word; and a voice recognition device.