JPS59126598A

JPS59126598A - Voice recognition method and apparatus

Info

Publication number: JPS59126598A
Application number: JP58000550A
Authority: JP
Inventors: ステイ−ブン・ロイド・モシエル
Original assignee: Exxon Corp
Current assignee: Exxon Mobil Corp
Priority date: 1983-01-07
Filing date: 1983-01-07
Publication date: 1984-07-21

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】本発明は音声認識方法および装置に関し、特定すると、
連続音声信号中の１つまたはそれ以上のキーワードを実
時間で認識する方法および装置に。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech recognition method and apparatus, and specifically:
A method and apparatus for recognizing one or more keywords in a continuous audio signal in real time.

関する。related.

適当に処理された未知の隔絶された・可聴（オーディオ
）信号を１つまたは複数の子め用意された既知のキーワ
ード信号と比較することにより隔絶された発声を認識す
る種々の音声認識システムが従来より提案されてきた。Various speech recognition systems are conventionally known which recognize isolated utterances by comparing a suitably processed unknown isolated audible (audio) signal with one or more child-prepared known keyword signals. More have been suggested.

本明細書において「キーワード」なる用語は、結合され
た一部のＶ音素および音響（サウンド）を意味するのに
使用され、例えば、音節、ワード（語）、句等の一部で
ある。The term "keyword" is used herein to mean a combination of some V phonemes and sounds, such as part of a syllable, word, phrase, etc.

多（のシステムはその成功度が限定されたものであった
が、特に１つのシステムは、隔絶されたキーワードを認
識するのに商業上利用されて成功を納めた。このシステ
ムは、１９７７年７月２６日付で特許された米国特許第
４，０３８，５０３号に記載された方法に従ってほぼ動
作し、未知の可聴信号データの境界が認識システムによ
り測定されるバックグラウンドノイズまたは無音状態の
いずれかであることを条件として、限定された範囲のキ
ーワードの１つを認識する方法を提供するもので、この
方法は好結果をもたらした。このシス４．％は未知の可
聴信号が生じる期間が十分に限定されており、かつ単一
のキーワードの発声しか含まないという推定に依存する
。Although many systems have had limited success, one system in particular has been successfully used commercially to recognize isolated keywords. Operating substantially in accordance with the method described in U.S. Pat. This method has yielded good results, providing a method for recognizing one of a limited range of keywords, provided that the unknown audible signal occurs for a sufficient period of time. It relies on the presumption that it is limited and contains only the utterance of a single keyword.

キーワード′境界が前に知られていない、またはマーク
されていない連続する会話音声のような連続する可聴信
号においては、到来可聴データを区分するために、すな
わち、音素、音節、ワード、文章等の言語単位の境界を
キーワード認識工程の開始に先立って決定するために、
種々の方法が考案された。しかしながら、これらの従来
の連続音声システムは、満足できる区分方法が見出され
ないこともあって、その成功は限定された。さらに、他
のかなりの問題が存在する。例えば、−貫的には、限定
されたボキャブラリイ（語粟）シか低誤報率で認識でき
ないこと、認識の精度が異なる話者の音声（ボイス）特
性の差に非常に敏感であること、システムが例えば普通
の電話通信装置で伝送される可聴信号に普通生じるよう
な分析されつつある可聴信号の歪に非常に敏感であるこ
となどである。In continuous audible signals, such as continuous speech speech, where keywords' boundaries are not previously known or marked, keywords are used to segment the incoming audible data, i.e., between phonemes, syllables, words, sentences, etc. In order to determine the boundaries of linguistic units prior to the start of the keyword recognition process,
Various methods have been devised. However, these conventional continuous speech systems have had limited success, in part because no satisfactory segmentation method has been found. Additionally, there are other considerable problems. For example, - comprehensively, limited vocabulary cannot be recognized with a low false alarm rate; recognition accuracy is very sensitive to differences in voice characteristics of different speakers; For example, the system is very sensitive to distortions of the audio signal being analyzed, such as those commonly occurring in audio signals transmitted in common telephone communication equipment.

米国特許第４．２２７．１７６号、第４，２４１，３２
９号および第４，２２７，１７７号に記載された連続音
声認証方法は、連続音声中のキーワードを実時間におい
て首尾よく認識する商業的に容認できる有効な手法につ
いてそれぞれ記述している。これらの特許に記載される
一般的方法は、現在商用に供せられており、実験的にも
また実用試験においても、話渚不依存の状況で高忠実性
と低誤率を事実上提供することが分る。しかしながら、
現今の技術の最先端にあるこれらの技術およびこれら技
術が開発された概念でさえも、誤報率および話者不依存
性能の両面において欠点を有する。U.S. Patent Nos. 4.227.176 and 4,241,32
No. 9 and No. 4,227,177, respectively, describe commercially acceptable and effective techniques for successfully recognizing keywords in continuous speech in real time. The general methods described in these patents are now commercially available and have been shown to provide virtually high fidelity and low error rates in shore-independent situations, both experimentally and in practical trials. I understand. however,
Even these techniques, which are at the current state of the art, and the concepts for which they were developed, have drawbacks in both false alarm rates and speaker independent performance.

それ故１本発明の主な目的は連続するマークされてない
可聴信号のキーワードを認識するのに改良された有効性
を有する音声認識方法および装置を提供することである
。本発明の他の目的は未知の可聴人力信号データの位相
および振巾歪みに比較的不感知な、未知の可聴入力信号
の有節発音（−分節）率の変動に比較的不感知な、異な
る話者、従って異なる音声特性に等しく良好に応答する
、信頼性Σ１ありかつ改善されたより低い誤報率を有す
る、そして実時間で動作する方法および装置を提供する
こと予ある。It is therefore a primary object of the present invention to provide a speech recognition method and apparatus having improved effectiveness in recognizing keywords in continuous unmarked audible signals. It is another object of the present invention to provide a method that is relatively insensitive to phase and amplitude distortions of the unknown audible input signal data, relatively insensitive to variations in the articulation rate of the unknown audible input signal; It is intended to provide a method and apparatus that responds equally well to speakers and thus different speech characteristics, has reliability Σ1 and an improved lower false alarm rate, and operates in real time.

本発明は可聴信号中の少なくとも１つの予め定められた
キーワードを認識する音声分析システムに関する。各キ
ーワードは少なくとも１つのターゲット・パターンを有
するパターン・テンプレートによって特徴付けられてお
り、各ターゲット・パターンは少な（とも１つの短期間
パワースペクトルを表わす。各ｊ−ゲット・パターンは
それと関連した醗犬ドエル時間期間および最小ドエル時
間期間を有する。The present invention relates to a speech analysis system for recognizing at least one predetermined keyword in an audible signal. Each keyword is characterized by a pattern template with at least one target pattern, each target pattern representing a short-term power spectrum. It has a dwell time period and a minimum dwell time period.

本発明の方法は繰返しフレーム速度で可聴人力信号から
この可聴信号を表わす一連のフレームパターンを形成す
る段階を特徴としている。各フレームパターンはフレー
ム時間と関連している。その後、各フレームパター７に
対してターゲット・パターンの選択されたものについて
のフレームの類似性の数値測定値が発生される。好まし
くは。The method of the invention is characterized by the step of forming from an audible human input signal a series of frame patterns representative of the audible signal at a repetitive frame rate. Each frame pattern is associated with a frame time. Thereafter, for each frame pattern 7 a numerical measure of the frame similarity for the selected one of the target patterns is generated. Preferably.

各ターゲット・パターンについての各フレームノ（ター
ンの類似性を表わす数値測定値が発生される。A numerical measure is generated representing the similarity of each frame turn for each target pattern.

各フレーム時間において、各キーワードに対して、キー
ワードがそのときに存在するフレーム時間で終了する見
込みを表わす数値ワードスコアが上記数値測定値を利用
して蓄積される。この蓄積段階は、現フレームパターン
とキーワードの最後で生じるターゲット・パターンの類
似性の数値測定値で始まる、連続する一連の繰返し形成
されるフレームパターンのそれぞれに対する蓄積数値測
定値を含む。その後、このように決定されたキーワード
に対する数値が予め定められた認識レベルを越えるとき
にはいつでも、少なくとも予備のキーワード認識決定が
発生される。At each frame time, for each keyword, a numerical word score is accumulated using the numerical measurements described above, representing the likelihood that the keyword will finish in the frame time in which it currently exists. This accumulation step includes accumulating numerical measurements for each successive series of repeatedly formed frame patterns, starting with a numerical measurement of the similarity of the current frame pattern to the target pattern occurring at the end of the keyword. Thereafter, at least a preliminary keyword recognition determination is generated whenever the numerical value for the keyword thus determined exceeds a predetermined recognition level.

他の面においては、本発明は第１の繰返しフレーム速度
で可聴信号からこの可聴信号を表わす、それぞれがフレ
ーム時間と関連している一連のフレームパターンを形成
する手段を有する音声認識装置に関する。さらに、各フ
レームパターンに対して、ターゲット・パターンの選択
されたものについてのフレームパターンのそれぞれの類
似性の数値測定値を発生する手段が設けられている。好
ましくは、上記方法と同様に、数値測定値は各ターゲッ
ト・パターンに関する各フレームパターンに対して発生
される。In another aspect, the invention relates to a speech recognition apparatus having means for forming a series of frame patterns, each associated with a frame time, representing an audible signal from an audible signal at a first repetitive frame rate. Furthermore, means are provided for generating, for each frame pattern, a numerical measure of the respective similarity of the frame pattern with a selected one of the target patterns. Preferably, similar to the method described above, numerical measurements are generated for each frame pattern for each target pattern.

蓄積素子は、各フレーム時間および各キーワードに対し
て、キーワードがそのときに存在するフレーム時間で終
了する見込みを表わす数値ワードスコアを合計する。こ
の合計は各フレーム時間および各キーワーードに対して
決定される。この蓄積素子は、現フレームパターンとキ
ーワードの最後のターゲット・パターンの類似性の数値
測定値で始まる、連続する一連の繰返し形成されるフレ
ームパターンのそれ破れに対する数値測定値を、各キー
ワードに対して、蓄積するための装置を含む。The storage element sums, for each frame time and each keyword, a numerical word score representing the likelihood that the keyword will finish in the frame time in which it then exists. This sum is determined for each frame time and each keyword. This storage element stores, for each keyword, a numerical measure of the breaking of a consecutive series of repeatedly formed frame patterns, starting with a numerical measure of the similarity between the current frame pattern and the last target pattern of the keyword. , including a device for accumulating.

この装置はキーワードに対して蓄積された数値が予め定
められた基準を越えたときにはいつでも少なくとも予備
のキーワード認識信号を発生する手段をさらに特徴とし
ている。The apparatus further features means for generating at least a preliminary keyword recognition signal whenever the accumulated value for the keyword exceeds a predetermined criterion.

本発明の他の目的、特徴、ならびに利点は添付図面を参
照しての本発明の好ましい実施例についての以下の説明
から明らかとなろう。Other objects, features and advantages of the invention will become apparent from the following description of preferred embodiments of the invention, taken in conjunction with the accompanying drawings.

なお、図面中、対応する素子には対応する参照符号が付
されている。Note that in the drawings, corresponding elements are given corresponding reference numerals.

本明細書に記載される特定の好ましい実施例の１つにお
いては、音声認識は、到来可聴データ信号、一般的には
音声（スピーチ）、の特定のアナログおよびディジタル
処理を行なう特別に構成された電子装置と、特定の他の
データ変換段階および数値評価を行なうために、本発明
に従ってプログラムされた汎用ディジタル・コンピュー
タを含むシステムにより遂行される。本システムのハー
ドウェア部分とソフトウェア部分の間のタスクの分割は
、音声認轟を安価な価格で実時間で遂行し得るシステム
を得るためになされたものである。In one of the particular preferred embodiments described herein, speech recognition includes specially configured analog and digital processing of an incoming audible data signal, generally audio (speech). The system includes electronic equipment and a general purpose digital computer programmed in accordance with the present invention to perform certain other data conversion steps and numerical evaluations. The division of tasks between the hardware and software parts of the system was done in order to obtain a system that can perform voice recognition in real time at a low cost.

しかしながら、この特のシステムのハードウェアで遂行
されつつあるタスクのある部分はソフトウェアで十分遂
行され得るであろうし、また本具体例のソフトウェアプ
ログラミングで遂行されつつあるタスクのある部分は、
他の具体例においては特定目的の回路で遂行し得るであ
ろう。この後者に関連□しては、利用できる場合に、装
置の７・−ドウエアおよびソフトウェアの実施形態につ
いて説明する。However, some portions of the tasks being performed in the hardware of this particular system could be adequately performed in software, and some portions of the tasks being performed with the software programming of this particular example would be
Other embodiments could be implemented with special purpose circuitry. In this latter regard, hardware and software embodiments of the device, if available, will be described.

前記したように、本発明の一側面に依れば、信号が例え
ば電話線により歪を生じた場合でも連続音声信号中のキ
ーワードを認識する装置が提供される。従って、特に第
１図において、１０で指示される音声人力信号は、任意
の距離および任意数の交換機を包含する電話線を介して
炭素送話機および受話機により発生される音声信号と考
えることができる。それゆえ、本発明の代表例は、未知
のソース（話者に依存しない系）から供給され、電話シ
ステムを介して受信される可聴データのキーワードを認
識することである。他方、人力信号は、無線通信リンク
例えば商業放送局、私設通信リンクから取り出される任
意の可聴データ信号、例えば音声人力信号である。As mentioned above, in accordance with one aspect of the present invention, an apparatus is provided for recognizing keywords in a continuous audio signal even when the signal is distorted by, for example, telephone lines. Thus, particularly in FIG. 1, the voice input signal designated at 10 may be considered a voice signal generated by a carbon transmitter and receiver over a telephone line spanning any distance and any number of switches. can. Therefore, a representative example of the present invention is the recognition of keywords in audible data supplied via an unknown source (speaker independent system) and received via a telephone system. On the other hand, a human input signal is any audible data signal, such as an audio human input signal, derived from a wireless communication link, such as a commercial broadcast station, or a private communication link.

以上の説明から明らかなように１本発明の方法および装
置は、一連の音響、音素、またはその他の認識可能な符
号を含む音声信号の認識と関係する。本明細書において
は、「キーワード」、「一連のターゲットパターン」、
［テンプレートパターン」または［゛キーワード・テン
プレート」のいずれかについて言及されるが、この４つ
の用語は、一般的なものであり、等価なものであると考
えられる。これは、本方法および装置が検出できる認識
可能な一連の可聴音響またはその代替物を表現する便利
な方法である。これらの用語は、単一の音素、音節、ま
たは音響から一連のワード（文法的意味における）なら
びに単一のワードに至るいずれをも包含するよ５に広く
かつ一般的に解釈されるべきである。As can be seen from the above description, one method and apparatus of the present invention relates to the recognition of audio signals that include a series of sounds, phonemes, or other recognizable symbols. In this specification, "keyword", "series of target patterns",
Although reference is made to either "template pattern" or "keyword template," these four terms are generic and considered equivalent. This is a convenient way to represent a recognizable set of audible sounds, or a substitute thereof, that the method and apparatus can detect. These terms should be interpreted broadly and generally to encompass everything from a single phoneme, syllable, or sound to a series of words (in the grammatical sense) as well as a single word. .

アナログ−ディジタル（Ａ／Ｄ　）コンバータ１３は、
線１０上の到来アナログ可１［’ｉＭ　＜８号データを
受信して、そのデータの信号振幅をディジタル形式に変
換する。例示のψコンバータは、入力信号データを１２
ビツトの２進表示に変換するが、その変換は、８０００
回／秒の割合で起こる。他の具体例においては、他のサ
ンプリング速度が採用できる。例えば、高品質信号が利
用できる場合は、１６紐２の速度を使用できる。ん■変
換機１３は、その出力１５を介して自己相関器１７に供
給する。The analog-digital (A/D) converter 13 is
It receives the incoming analog data on line 10 and converts the signal amplitude of that data to digital form. The exemplary ψ converter converts the input signal data into 12
It is converted to binary representation of bits, but the conversion is 8000 bits.
It happens at a rate of times/second. In other embodiments, other sampling rates may be employed. For example, a speed of 16 strings 2 can be used if a high quality signal is available. The converter 13 supplies an autocorrelator 17 via its output 15.

自己相関器１７はディジタル入力信号を処理して５１秒
間に１００回短期間自己相関関数を発生し、図示のよう
に、線１９を介して七の出力を供給する。各自己相関関
数は、３２の値またはチャンネルを有し、各値は３０ビ
ツトの解に計算される。Autocorrelator 17 processes the digital input signal to generate short term autocorrelation functions 100 times in 51 seconds and provides seven outputs via line 19 as shown. Each autocorrelation function has 32 values or channels, and each value is calculated into a 30-bit solution.

自己相関器は、第２図と関連して追ってより詳細に説明
する。The autocorrelator will be described in more detail below in connection with FIG.

線１９上の自己相関関数は、フーリエ変換装置２１によ
りフーリエ変換され、線２３を介して対応する短期間の
官処理されたパワースペクトルを得る。スペクトルは、
自己相関関数と同じ繰返し数で、すなわち１００回／秒
の割合で発生され、そして各短期間パワースペクトルは
、各１６ビツトの解を有する３１の数値期間を有する。The autocorrelation function on line 19 is Fourier transformed by a Fourier transform device 21 to obtain the corresponding short-term processed power spectrum via line 23. The spectrum is
It is generated with the same number of repetitions as the autocorrelation function, ie, at a rate of 100 times/second, and each short-term power spectrum has 31 numerical periods with each 16-bit solution.

理解されるように、スペクトル３１の期間の各々は、あ
る周波数バンド内の単一パークを表わす。フーリエ変換
装置はまた、不要な隣接バンド・レスポンスを減するた
めハミングまたは類似の窓（ウィンド）関数を含むのが
よい。As will be appreciated, each period of spectrum 31 represents a single park within a frequency band. The Fourier transform device may also include a Hamming or similar window function to reduce unwanted adjacent band responses.

例示の第１の実施例において、７−リエ変換ならびに後
続の処理段階は、本方法にしたがって反゛復的に必要と
される演算をスピード化するための周辺アレイプロセッ
サを利用して、適描にプログラムされた汎用ディジタル
コンピュータの制御下で遂行される。採用される特定の
コンピュータは、マサチューセッツ所在のディジタル・
エクィップメント・コーポレーションにより製造すした
ＦＤＰ−１１型である。採用される特定のアレイプロセ
ッサは、本出願の譲受人に譲渡された米国特許第４゜２
２８．４９８号に記載されている。第３図と関連して後
述されるプログラムは、これらの利用可能なデジタル処
理ユニットの能力および特性にほぼ基づいて設定される
。In the first illustrative embodiment, the 7-layer transform as well as subsequent processing steps are performed using peripheral array processors to speed up the operations required iteratively according to the method. The process is carried out under the control of a general purpose digital computer programmed in the following manner. The specific computer employed will be provided by Digital Corporation, located in Massachusetts.
It is model FDP-11 manufactured by Equipment Corporation. The particular array processor employed is described in U.S. Patent No. 4.2, assigned to the assignee of this application.
28.498. The program described below in connection with FIG. 3 is set up largely based on the capabilities and characteristics of these available digital processing units.

短期間惣処理パワースペクトルは、２５で指示されるよ
うに周波数レスポンスについて等化される。しかして、
この等化は、追って詳細に示されるように各周波数バン
ドまたはチャンネル内に起こるピーク振幅の関数として
遂行される。＃２６上の周波数レスポンスを等化された
スペクトルは。The short-term processed power spectrum is equalized for frequency response as indicated at 25. However,
This equalization is performed as a function of the peak amplitude occurring within each frequency band or channel, as will be shown in detail below. The spectrum of the frequency response on #26 is equalized.

１００／秒の割合で発生され、そして各スペクトルは、
１６ピツトの精度で評価される３１の数値期間を有する
。到来音声データの最終的評価を容易にするため、線２
６上の周波数レスポンスを等価された窓処理されたスペ
クトルは、３５で指示されるように振幅変換を受ける。generated at a rate of 100/sec, and each spectrum is
It has 31 numerical periods evaluated with an accuracy of 16 pits. To facilitate the final evaluation of the incoming audio data, line 2
The windowed spectrum equalized frequency response on 6 undergoes an amplitude transformation as indicated at 35.

これは到来スペクトルに非直線的振幅変換を課する。こ
の変換については追って詳細に記述するが、この点にお
いては、未知の到来可聴信号−が基準語黄のキーワード
と整合し得る精度を改善するということを言及しておこ
う。例示の具体例において、この変換は、スペクトルを
基準語案のキーワードを表わすターゲット・パターンと
比較する前のある時点において周波数レスポンスを等化
されたが窓処理されたスペクトルのすべてについて遂行
される。This imposes a non-linear amplitude transformation on the incoming spectrum. This transformation will be described in detail later, but at this point it is worth mentioning that it improves the accuracy with which the unknown incoming audio signal can be matched with the reference yellow keyword. In the illustrated embodiment, this transformation is performed on all of the frequency response-equalized but windowed spectra at some point before comparing the spectra to the target pattern representing the reference word keyword.

線３８上の振幅変換され等化された短期間スベクトルは
、ついで、４０においてキーワード・ターゲット・パタ
ーンと比較される。４２で指示されるキーワード・ター
ゲット・パターンは、変換。The amplitude transformed and equalized short term vector on line 38 is then compared to the keyword target pattern at 40. The keyword target pattern indicated by 42 is conversion.

等化スペクトルが比較され得る統計態様の基準面素のキ
ーワードを表わす。このようにして、比較の厳密さにし
たがって候補ワードが選択され、例示の具体例において
は、この選択工程は、全体として不適当なパターンシー
ケンスを排除し、キーワードの取逃しの可能性を最小に
するように設計される。認識決定が線４４を介して与え
られる。Represents the keyword of the reference plane element in statistical terms with which the equalized spectra can be compared. In this way, candidate words are selected according to the stringency of the comparison, and in the illustrated embodiment, this selection process eliminates overall unsuitable pattern sequences and minimizes the possibility of missing keywords. designed to. A recognition determination is provided via line 44.

第１Ａ図を参照すると、本発明の音声認識システムはコ
ントローラ４５を採用しているが、これは、例えば、Ｆ
ＤＰ−１１のような汎用ディジタルコンピュータあるい
はこのシステムに対して特別に組込まれたハードウェア
・コントローラとし得る。Referring to FIG. 1A, the speech recognition system of the present invention employs a controller 45, which may include, for example, F
It may be a general purpose digital computer such as a DP-11 or a hardware controller specifically incorporated into the system.

例示の具体例において、コントローラ４５は、プリプロ
セッサ４６から予処理された可聴データを受は取る。プ
リプロセッサについては、第２図と関連して詳細に説明
する。プリプロセッサ４６は、線４７を介して可聴人力
アナログ信号を受信し、インターフェース線４８を介し
て制御プリプロセッサに処理されたデータを供給する。In the illustrated embodiment, controller 45 receives preprocessed audio data from preprocessor 46 . The preprocessor will be described in detail in conjunction with FIG. Preprocessor 46 receives an audible human analog signal on line 47 and provides processed data to the control preprocessor on interface line 48.

一般に、制御プロセッサの動作速度は、汎用プロセッサ
であると、到来データを実時間で処理するに十分速くな
い。この結果、要素４５の処理速度を有効に増すために
、種々の特別目的のハードウェアを採用するのが有利で
ある。特に、本発明の譲受人に譲渡された米国特許第４
，２２８，４９８号に記載されるようなベクトル処理装
置４８ａは、パイプライン効果を利用することにより相
当増大されたアレイ処理能力を提供する。加えて、第５
．６．７および８図と関連して詳述するように、尤度関
数プロセッサ４８ｂは、装置の動作速度をさらに１０倍
増すためベクトルプロセッサと関連して使用できる。Generally, the operating speed of the control processor is not fast enough to process incoming data in real time if it is a general purpose processor. As a result, it is advantageous to employ various special purpose hardware to effectively increase the processing speed of element 45. In particular, U.S. Pat.
, 228,498, provides significantly increased array processing power by exploiting pipeline effects. In addition, the fifth
．． As detailed in connection with Figures 6.7 and 8, the likelihood function processor 48b can be used in conjunction with a vector processor to further increase the operating speed of the device by a factor of ten.

本発明の好ましい具体例においては制御プロセッサ４５
はディジタルコンピュータであるが、第９および１０図
と関連して説明される他の特定の具体例においては、処
理能力の相当の部分が逐次プロセッサ４９において制御
プロセッサの外部で実施される。このプロセッサの構造
については、第９および１０図と関連して追って詳細に
説明する。このように、ここに例示される音声認識を実
施するための装置は、その速度、およびノ・−ドウエア
、ソフトウェアまたはノー−ドウエアおよびソフトウェ
アの有利な組合せで実施できる点において犬なる変幻性
を有するものである。In a preferred embodiment of the invention, control processor 45
Although is a digital computer, in other specific embodiments described in connection with FIGS. 9 and 10, a significant portion of the processing power is implemented external to the control processor in serial processor 49. The structure of this processor will be described in more detail below in connection with FIGS. 9 and 10. Thus, the apparatus for implementing speech recognition illustrated herein is unique in its speed and ability to be implemented in hardware, software or advantageous combinations of hardware and software. It is something.

次にプリプロセッサについて説明する。Next, the preprocessor will be explained.

第２図に例示される装置において、固有の平均化の作用
をもつ自己相関機関は線１０を介して供給される到来ア
ナログ可聴データ、一般的には音声信号・に作用するア
ナログ−ディジタルコンバータ１３により発生されるデ
ィジタルデータ列に対して遂行される。コンバータ１３
は、線１５上にディジタル人力信号を発生する。ディジ
タル処理機能ならびにアナログ−ディジタル変換は、ク
ロック発振器５１０制御下で調時される。クロック発振
器は、２５６，０００パルス／秒の基本タイミング信号
を発生し、そしてこの信号は、周波数分割器５２に供給
されて、ｓ、ｏｏｏパルス／秒の第２のタイミング信号
を得る。低速タイミング信号は、アナログ−ディジタル
変換器１３ならびにラッチレジスタ５３を制御する。し
かして、コノラッチレジスタは、次の変換が完了するま
で最後の変換の１２ビツトの結果を保持するものである
。In the apparatus illustrated in FIG. 2, the autocorrelation machine with its inherent averaging effect is applied to an analog-to-digital converter 13 acting on incoming analog audio data, typically an audio signal, supplied via line 10. This is performed on a digital data string generated by converter 13
generates a digital human input signal on line 15. Digital processing functions as well as analog-to-digital conversion are timed under clock oscillator 510 control. The clock oscillator generates a basic timing signal of 256,000 pulses/second, and this signal is fed to a frequency divider 52 to obtain a second timing signal of s,ooo pulses/second. The slow timing signal controls analog-to-digital converter 13 as well as latch register 53. The conoratch register thus holds the 12-bit result of the last conversion until the next conversion is completed.

自己相関積は、レジスタ５３に含まれる数に３２ワード
シフトレジスタ５８の出力を乗算するディジタルマルチ
プライヤ５６により発生される。The autocorrelation product is generated by a digital multiplier 56 which multiplies the number contained in register 53 by the output of a 32 word shift register 58.

レジスタ５８は循環モード動作し、高速クロック周波数
により、駆動されるから、シフトレジスタデータの１循
環は、各アナログ−ディジタル変換ごとに遂行される。Because register 58 operates in a circular mode and is driven by a high speed clock frequency, one rotation of shift register data is performed for each analog-to-digital conversion.

シフトレジスタ５８に対する人力は、−回の循環サイ、
−ル中に一度レジスタ５３から供給される。ディ゛ジ、
タルマルチプレクサ５６に対する一方の人力は、ラッチ
レジスタ５３から直接供給され、他方の人力は、シフト
レジスタの現任出力からマルチプレクサ５９を介して供
給される（後述する１つの例外があるが）。乗算は、高
速クロック周波数で遂行される。The human power for the shift register 58 is − times cycle size,
- from register 53 once during the roll. Digi,
One power supply to the multiplexer 56 is provided directly from the latch register 53, and the other power is provided via the multiplexer 59 from the current output of the shift register (with one exception described below). Multiplications are performed at high clock frequencies.

このようにし、て、Ａ／’Ｉ）変換から得られる各値は
先行の３１の変換値の各々と乗算される。技術に精通し
たものには明らかであるように、それＫより発生される
信号は、人力信号を、それを３２の異なる時間増分だけ
遅延した信号と乗算することと等価である（１つは遅延
０である）。０遅延相関を得るため、すなわち信号のべ
きを生ずるため、マルチプレクサ５９は、シフトレジス
タに各折しい値が導入されつつある時点に、ラッチレジ
スタ５３の現在値をそれ自体と乗算する。このタイミ□
　ング機能は、６０で指示される。In this way, each value resulting from the A/'I) transformation is multiplied by each of the previous 31 transformed values. As will be apparent to those skilled in the art, the signal generated by K is equivalent to multiplying the human signal by a signal delayed by 32 different time increments (one delayed 0). To obtain a zero delay correlation, i.e. to produce a power of the signal, multiplexer 59 multiplies the current value of latch register 53 by itself at the time each possible value is being introduced into the shift register. This time□
The switching function is indicated at 60.

これも技術に精通したものには明らかなように、１回の
変換とその３１の先行データから得られる積は、適当な
サンプリング間隔についてのエネルギ分布ｆなわちスペ
クトルを公正に表わさない。Again, as will be apparent to those skilled in the art, the product of one transform and its 31 predecessors does not fairly represent the energy distribution f, or spectrum, for a given sampling interval.

したがって、第２図の装置は、これらの複数組の積の平
均化を行なう。Therefore, the apparatus of FIG. 2 averages these multiple sets of products.

平均化を行なう累積工程は、加算器６５と接続されて１
組の３２の累積器を形成する３２ワードシフトレジスタ
６３により提供される。すなわち各ワードは、ゲイジタ
ルマルチプレクサからの対応する増分に加算された後、
再循環され得る。この循塘ループは゛、低周波り四ツク
信号により駆動されるＮ分割器６９により制御されるゲ
ート６７を通る。分割器６９は、シフトレジスタ６３が
読み出されるまでに累積されしたがって平均化される岬
間的自己相関関数の数を決定するファクタにより、低周
波クロックを分割する。The accumulation step for averaging is connected to an adder 65 and
Provided by 32 word shift registers 63 forming a set of 32 accumulators. That is, each word is added to the corresponding increment from the gauge multiplexer, and then
Can be recycled. This loop passes through a gate 67 which is controlled by a divide-by-N 69 driven by a low frequency 4-way signal. Divider 69 divides the low frequency clock by a factor that determines the number of cap-to-cap autocorrelation functions that are accumulated and thus averaged before shift register 63 is read.

例示の具体例においては、読み出されるまでに８０のサ
ンプルが累積される。換言すると、Ｎ分割器６９に対す
るＮは８０に等しい。８０の変換サンプルが相関づけら
れ、累積された後１分割器６９は、線７２を介してコン
ピュータ割込み回路７１をトリガする。この時点に、シ
フトレジスタ６３の内容は、適当なインターフェース回
路７３を介してコンピュータメモリに逐次読み込まれる
。In the illustrated implementation, 80 samples are accumulated before being read. In other words, N for N divider 69 is equal to 80. After the 80 transform samples have been correlated and accumulated, divider-by-one 69 triggers computer interrupt circuit 71 via line 72. At this point, the contents of shift register 63 are serially read into computer memory via appropriate interface circuitry 73.

レジスタ内の３２の逐次のワードは、インターフェース
７３を介してコンピュータに項番に提示される。技術に
精通したものには明らかなように、周辺ユニット、すな
わち自己相関器ブリプロセッサからコンピュータへのこ
のデータ転送は、普通。The 32 consecutive words in the register are presented to the computer via interface 73 in terms of numbers. As is obvious to those familiar with the technology, this data transfer from the peripheral unit, i.e. the autocorrelator, to the computer is common.

直接メモリアクセス法により遂行されよう。This may be accomplished by direct memory access methods.

８０００の初サンプリング速度で８０のサンプルが平均
化されることに基づき、毎秒１００の平均化自己相関関
数が毎秒コンピュータに供給されることが分ろう。It will be seen that based on the 80 samples being averaged with an initial sampling rate of 8000, 100 averaged autocorrelation functions are provided to the computer every second.

シフトレジスタの内容がコンピュータから読み出されて
いる間、ゲート６７が閉成されるから、シフトレジスタ
ーの各ワードは、０にリセットされ、累積工程の再開を
可能にする。While the contents of the shift register are being read from the computer, gate 67 is closed so that each word of the shift register is reset to zero, allowing the accumulation process to resume.

数式で表わすと、第２図に示される装置の動作は下記の
ごとく記述できる。Expressed mathematically, the operation of the device shown in FIG. 2 can be described as follows.

アナログ−ディジタル変換器が時間列Ｓ　（ｔ）を発生
すると仮定すると（ここにｔ＝ｏ、　　Ｔｏ、　　２’
ｌ’。。Assuming that the analog-to-digital converter generates a time sequence S (t), where t=o, To, 2'
l'. .

−−一−Ｔ、はサンプリング間隔（例示の具体例におい
て１／８０００秒））、第２図の例示のディジタル相関
回路は、始動時のあいまいさを無視すると、次の自己相
関関数を計算するものと考えることができる。--1-T, is the sampling interval (1/8000 seconds in the example embodiment), and the example digital correlation circuit of FIG. 2, ignoring start-up ambiguities, calculates the autocorrelation function It can be thought of as a thing.

ａ（ｊ、ｔ）＝Σ　Ｓ（ｔ＋ｋＴｏ）Ｓ（ｔ＋（ｋ、）
）Ｔｏ）　　ｆｉｌｋ＝０ここにｊ　＝０．　１．　２’、　　−−−３１、を二
８０　Ｔｏ　。a(j, t)=Σ S(t+kTo)S(t+(k,)
)To) filk=0 where j=0. 1. 2', ---31, 280 To.

１６０’ｆｏ　　−，８０ｎ’ｌ’ｏ、−ｍ−である。160'fo　-, 80n'l'o, -m-.

これらの自己相関関数は、第１図の線１９上の相関出力
に対応する。These autocorrelation functions correspond to the correlation outputs on line 19 of FIG.

第３図を参照して説明すると、ディジタル相関図は、各
１０ミリ秒毎に１相関関数の割合で一連のデータブロッ
クをコンピュータに連続的に伝送するように動作する。Referring to FIG. 3, the digital correlation diagram operates to continuously transmit a series of data blocks to the computer at a rate of one correlation function every 10 milliseconds.

これは第３図に７７で指示される。各データブロックは
、対応する細分時間間隔に誘導される自己相関関数を表
わす。上述のように、例示の自己相関関数は、単位秒当
り１００の３２ワード関数の割合でコンピュータに提供
される。この分析間隔は、以下において「フレーム」と
称される。This is indicated at 77 in FIG. Each data block represents an autocorrelation function induced into a corresponding subdivision time interval. As mentioned above, the exemplary autocorrelation function is provided to the computer at a rate of 100 32 word functions per second. This analysis interval is referred to below as a "frame".

第１の例示の具体例において、自己相関関数データの処
理は、適当にプログラムされた専用ディジタルコンピュ
ータで遂行される。コンピュータプログラムにより提供
される機能を含むフローチャートが第３図に示されてい
る。しかしながら、段階の種々のものは、ソフトウェア
でなくてノ・−ドウエア（以下に説明する）によって遂
行でき、また第２図の装置により遂行される機能のある
ものは、第３図のフローチャートの対応する修正により
ソフトウェアでも遂行できることを指摘しておく。In a first illustrative embodiment, processing of the autocorrelation function data is performed on a suitably programmed dedicated digital computer. A flowchart containing the functions provided by the computer program is shown in FIG. However, various of the steps may be performed by hardware (described below) rather than software, and some of the functions performed by the apparatus of FIG. 2 may correspond to the flowchart of FIG. It should be pointed out that this can also be accomplished in software by making modifications.

第２図のディジタル相関器は、瞬間的に発生される自己
相関関数の時間平均動作を遂行するが、コンピュータに
読み出される平均自己相関関数は、サンプルのｊ１次の
処理および評価と干渉し合うようなある種の変則的不連
続性または不均一性を含む。したがって、データの各ブ
ロック、すなわち各自己相関関数ａ（ｊ　、ｔ、）は、
まず時間に関して平滑化される。これは、第３図のフロ
ーチャートにおいて７８で指示される。好ましい平滑法
は、平滑化自己相関出力ａ、（’ｊ、ｔ）が下式により
与えられるものである。The digital correlator of FIG. 2 performs a time-averaging operation of the instantaneously generated autocorrelation function, but the average autocorrelation function read out to the computer does not interfere with the processing and evaluation of the j1st order of the samples. Contains some kind of anomalous discontinuity or non-uniformity. Therefore, each block of data, i.e. each autocorrelation function a(j, t,), is
First, it is smoothed with respect to time. This is indicated at 78 in the flowchart of FIG. A preferred smoothing method is one in which the smoothed autocorrelation output a, ('j, t) is given by the following formula.

ａｓ　（ｊ　、　ｔ）＝Ｏｏａ（ｊ　、　ｔ）＋Ｃ１ａ
（ｊ　、　ｔ−Ｔ）＋Ｏ，ａ（Ｊ、ｔ　−２Ｔ）　（２
）ここにａ（ｊ　、ｔ）は式（１）において定義された
不平滑人力自己相関関数であり、ａ５（ｊ、ｔ）は平滑
自己相関出力であり、ｊは遅延時間を表わし、ｔは実時
間を表わし、Ｔは連続的に発生される自己相関関数間の
時間間＠（フレーム）を表わし、好ましい具体例におい
ては０．０１秒に等しい。重み付は関数Ｃ６＋　ＯＨ、
Ｃｇは、例示の具体例においては好ましくは１／４．１
／２．１／４に選ばれるのがよいが、他の値も選択され
よう。例えば２０Ｈｚのカットオフ周波数をもつガウス
のインパルフレスポンスを近似する平滑化関数をコンピ
ュータソフトウェアで実施できよう。しかしながら、実
験によれば、式（２ンに例示される実施容易は平滑化関
数で満足な結果が得られることが示された。上述のよう
に、平滑化関数は、遅延の６値Ｊについて別々に適用さ
れる。as(j, t)=Ooa(j, t)+C1a
(j, t-T)+O,a(J, t-2T) (2
) where a(j, t) is the unsmoothed human autocorrelation function defined in equation (1), a5(j, t) is the smoothed autocorrelation output, j represents the delay time, and t is Representing real time, T represents the time interval @(frames) between successively generated autocorrelation functions, which in the preferred embodiment is equal to 0.01 seconds. Weighting is the function C6+ OH,
Cg is preferably 1/4.1 in the illustrated embodiment
/2.1/4 is preferably chosen, but other values may also be chosen. For example, a smoothing function that approximates a Gaussian impulse response with a cutoff frequency of 20 Hz could be implemented in computer software. However, experiments have shown that satisfactory results can be obtained with the easy-to-implement smoothing function exemplified by Equation (2).As mentioned above, the smoothing function is Applied separately.

以下の分析は、音声信号の短期間フーリエパワースペク
トルに関する種々の操作を含むが、）・−ドウエアを簡
単にしかつ処理スピードを上げるため、自己相関関数の
周波数領域への変換は、例示の具体例においては８ビツ
トの算術で実施される。Although the following analysis involves various operations on the short-term Fourier power spectrum of the audio signal, for simplicity and speed of processing, the transformation of the autocorrelation function into the frequency domain is described in an illustrative example. is implemented using 8-bit arithmetic.

３ＫＨｚ近傍のバンドパスの高域の端では、スペクトル
パワ密度が８ビツト量における解像に不十分なレベルに
減する。それゆえ、システムの周波数レスポンスは、６
ｄｂ／オクターブの上昇率で傾斜される。これは７９で
指示される。この高周波数の強調は、その変数すなわち
時間遅延に関する自己相関関数の二次微分を取ることに
より遂行される。At the high end of the bandpass around 3 KHz, the spectral power density is reduced to a level insufficient for resolution in 8-bit quantities. Therefore, the frequency response of the system is 6
Slanted at a rising rate of db/octave. This is indicated by 79. This high frequency enhancement is accomplished by taking the second derivative of the autocorrelation function with respect to that variable, the time delay.

微分操作は、次式のごとくである。The differential operation is as shown in the following equation.

ｂ（ｊ　、ｔ）＝−ａ（ｊ＋１　、ｔ）＋２ａ（ｊ　、
ｔ）　−ａ（ｊ−１，ｔ）　　　（３）ｊ＝０に対する
微分値を求めるために、自己相関関数は０に関して対称
であるから、ａ　（−ｊ　、　ｔ）＝ａ　（十Ｊ、　ｔ　）であると
仮定する。b(j, t)=-a(j+1, t)+2a(j,
t) -a(j-1, t) (3) To find the differential value with respect to j = 0, since the autocorrelation function is symmetric with respect to 0, a (-j, t) = a (10 J, t ).

また、（３２）に対するデータはないから、ｊ−３１に
おける微分値は、Ｊ＝３０のときの微分値と同じである
と仮定する。Furthermore, since there is no data for (32), it is assumed that the differential value at j-31 is the same as the differential value when J=30.

第３図のフローチャートで示されるように、分析手続き
の高周波強調後の次の段階は、自己相関のピーク絶対値
を見出すことにより現在のフレーム間隔における信号パ
ワを算出することである。As shown in the flowchart of FIG. 3, the next step in the analysis procedure after high frequency enhancement is to calculate the signal power in the current frame interval by finding the peak absolute value of the autocorrelation.

パワの概算値Ｐ　（ｔ）は次のごとくとなる。The estimated value of power P(t) is as follows.

Ｐ（ｔ）＝ｍａｘｌｂ（ｉ、ｔ）ｌ　　　　　　　　（
４）８ピットスペクトル分析のための自己相関関数を用
意するため、平滑化自己相関関数はＰ　（ｔ）に関して
ブロック標準化され（８０にて）、各標準価値の上位８
ピツトがスペクトル分析ノ・−ドウエアに人力される。P(t)=maxlb(i,t)l (
4) To prepare the autocorrelation function for the 8-pit spectrum analysis, the smoothed autocorrelation function was block standardized (at 80) with respect to P(t) and the top 8 of each standard value
The pit is manually input to the spectral analysis software.

それゆえ、標準化されかつ平滑化された自己相関関数は
次のごとくなる。Therefore, the standardized and smoothed autocorrelation function is:

ｃ（ｊ　、ｔ）＝１２７ｂ（ｊ　、　ｔ）／ｐ（ｔ）　
　　　　　（５１ついで８０で指示されるように、時間
に関して平滑化され、周波数強調され、標準化された各
自己相関関数ｃ（ｊ、ｔ）に余弦フーリエ変換が適用さ
れ、３１点のパワスペクトルを生成する。余弦値のマ）
　ＩＪラックス次式で与えられる。すなわちＳ（ｉＪ）
＝１２６ｇ（ｉ）（ｃｏｓ（２πｉ／８０００）ｆ（ｊ
　））　。c(j, t)=127b(j, t)/p(t)
(51 A cosine Fourier transform is then applied to each time-smoothed, frequency-weighted, standardized autocorrelation function c(j,t) as indicated at 80 to produce a 31-point power spectrum. .Cosine value Ma)
IJ Lux is given by the following equation. That is, S(iJ)
=126g(i)(cos(2πi/8000)f(j
)).

Ｊ＝ｏ、　１．２．　Ｌ−、３１（６）ここに、Ｓ（ｉ
　、ｊ　）は、時刻ｔにおける、ｆ（ｊ）１１、！に中
心を置くバンドのスペクトルエネルギ、ｇ　（ｉ　）　
＝１／２　（１＋ｃｏｓ　２πＩＡ３）は、サイドロー
ブを減するための（ハミング）窓関数エンベロープであ
る、およびｆ（ｊ）＝３０＋１０００（０，０５５２ｊ＋０．４３
８）１１０．６３出　。J=o, 1.2. L-, 31(6) where S(i
, j ) is f(j)11,! at time t. The spectral energy of the band centered at, g (i)
=1/2 (1+cos 2πIA3) is the (Hamming) window function envelope to reduce sidelobes, and f(j)=30+1000(0,0552j+0.43
8) 110.63 out.

ｊ　＝０．　１．　２．−−−−、３１　　　　　　　
　　（７１これは、主楽音ピッチいわゆる「メル」曲線
上に等しく離間された分析周波数である。明らかなよう
に、これは、約３０００〜５０ｏＯＩ−（、！の代表的
通信チャンネルのバンド幅の周波数に対する主ピッチ（
メルスケール）周波数軸線間隔に対応する。j=0. 1. 2. -----, 31
(71 This is the analysis frequency equally spaced on the tonic pitch so-called "mel" curve. As can be seen, this is the frequency of the bandwidth of a typical communication channel of approximately 3000-50oOI-(,!) The main pitch for (
mel scale) corresponds to the frequency axis spacing.

スペクトル分析は、−３１から＋３１までの遅れを加算
を必要とするから、自己相関が０に関して対称であると
いうことを仮定すれば、Ｊの正値しか必要としない。し
かしながら、遅れ００項を２度計算することを避けるた
めに、余弦マトリン、クスは次のように調節される。Since the spectral analysis requires adding lags from -31 to +31, only positive values of J are required, assuming that the autocorrelation is symmetric about zero. However, to avoid calculating the lag 00 term twice, the cosine matrine, x, is adjusted as follows.

Ｓ（０、ｊ　）＝１２６／２＝６３　、余３に対して　
　　（８）かくして、計算されたパワスペクトルは次式
により与えられる。S(0,j)=126/2=63, for remainder 3
(8) Thus, the calculated power spectrum is given by the following equation.

（９）ここで第２番目の結果は周波数ｆ匂）に対応する。(9) Here the second result corresponds to the frequency f).

これも明らかなように、各スペクトル内の各点すなわち
値は、対応する周波数バンドを表わす。As can also be seen, each point or value within each spectrum represents a corresponding frequency band.

このフーリエ変換は従来のコンピューターハードウェア
内で完全に遂行できるが、外部のハードウェアマルチプ
レックサまたは高速フーリエ変換（ＦＦ’ｌ’　）周辺
装置を利用すれば、工程はかなりスピード化し得よう。Although this Fourier transform can be performed entirely within conventional computer hardware, the process could be considerably speeded up by utilizing an external hardware multiplexer or Fast Fourier Transform (FF'l') peripheral.

しかしながら、この種のモジュール構造および動作は技
術上周知であるから、ここでは詳細に説明しない。ハー
ドウェア高速フーリエ変換周辺装置には、周波数平滑機
能が組み込まれるのが組み込まれるのが有利であり、こ
の場合、各スペクトルは、上述の好ましい（ハミング）
窓重み付は関数ｇ（１）に従って周波数が平滑される。However, this type of modular structure and operation is well known in the art and will not be described in detail here. Advantageously, the hardware fast Fourier transform peripheral incorporates a frequency smoothing function, in which case each spectrum is
In the window weighting, the frequency is smoothed according to the function g(1).

これは、ハードウェアによるフーリエ変換の実施に対応
するブロック８５の８３で実施される。This is performed at 83 of block 85, which corresponds to performing a Fourier transform in hardware.

バックグラウンドノイズが相当ある場合、バックグラウ
ンドのパワスペクトルの概算値が、この段階においてＳ
’（ｊ　、ｔ）から減算されねばならない。ノイズを表
わすために選択したフレーム（１または複数）には、音
声信号を含ませてはならない。雑音フレーム間隔を選択
する最適のルールは、応用にしたがって変わるであろう
。話者が例えば音声認識装置により制御される機械で相
互通信に掛わり合う場合１例えば１機械がその音声応答
ユニットによる話しを′終了した直後の間隔に任意にフ
レームを選択するのが有利である。拘束がより少ない場
合には、過ぎ去った１ないし２秒の間の可聴人力の最少
の振幅のフレームを選択することによりノイズフレーム
を見出すことができる。If there is considerable background noise, an estimate of the background power spectrum can be obtained at this stage by S
' must be subtracted from (j, t). The frame(s) selected to represent noise must not contain any audio signal. The optimal rule for selecting the noise frame spacing will vary depending on the application. If speakers engage in intercommunication with machines controlled, for example, by a voice recognition device, it is advantageous to arbitrarily select a frame, for example in an interval immediately after one machine has finished speaking with its voice response unit. . If there are fewer constraints, the noise frame can be found by selecting the frame with the lowest amplitude of audible human effort during the past 1-2 seconds.

逐次の平滑パワスペクトルが高速フーリエ変換周辺装置
８５から受信されると、以下で説明されるように、周辺
装置８５からのスペクトルに対するピークパワスペクト
ルエンベロープ（一般に異なる）を決定し、それに応じ
て高速フーリエ変換装置の出力を変更することにより通
信チャンネルの等化が行われる。到来する窓処理された
パワスペクトルＳ’（Ｊ、ｔ）、（ここにＪはスペクト
ルの複数の周波数に割り当てられる）に対応しかつ該ス
ペクトルにより変更された新たに発生された各ピーク振
幅は、各スペクトルチャンネルまたはバンドに対する高
速アタック、低速ディケイ、ピーク検出機能の結果であ
る。窓処理されたパワスペクトルは、対応するピーク振
幅スペクトルのそれぞれの期間に関して標準化される。Once the successive smoothed power spectra are received from the fast Fourier transform peripheral 85, determine the peak power spectral envelopes (generally different) for the spectra from the peripheral 85 and fast Fourier transform them accordingly, as described below. Equalization of the communication channel is achieved by changing the output of the conversion device. Each newly generated peak amplitude corresponding to and modified by the incoming windowed power spectrum S'(J,t), where J is assigned to a plurality of frequencies of the spectrum, is It is the result of fast attack, slow decay, and peak detection functions for each spectral channel or band. The windowed power spectrum is normalized with respect to each period of the corresponding peak amplitude spectrum.

これは、８７．８９．９１で指示される。This is designated at 87.89.91.

例示の具体例においては、新しい窓処理されたスペクト
ルを受は取る前に決定された「古い」ピーク振幅スペク
トルｐ（Ｊ、ｔ；−’Ｉ’）が、新たに到来したスペク
トルＳ’（ｊ、ｔ）と周波数バンドと周波数バンドとを
比較するやり方で比較される。In the illustrated embodiment, the "old" peak amplitude spectrum p(J,t;-'I'), determined before receiving the new windowed spectrum, is replaced by the newly arrived spectrum S'(j , t) in a frequency band to frequency band manner.

ついで、新しいピークスペクトルｐ（ｊ、ｔ）が。Then, the new peak spectrum p(j, t).

下記の規則にしたがって発生される。「古い］ピーク振
幅スペクトルの各バンドのパワ振幅は、この具体例にお
いては固定分数、例えば１０２３／１０２４と乗算され
る。これは、ピーク検出関数の低速ディケイ部分に対応
する。到来スペクトルｓ’（ｊ、ｔ）の周波数バンドＪ
のパワ振幅が、崩壊ピーク振幅スペクトルの対応する周
波数バンドのパワ振幅より太きければ、その（またはそ
れらの）周波数バンドに対する崩壊ピーク振幅スペクト
ル値は、到来する窓処理スペクトルの対応するバンドの
スペクトル値と置き代えられる。これは、ピーク検出関
数の高速アタック部分に対応する。Generated according to the rules below. The power amplitude of each band of the "old" peak amplitude spectrum is multiplied by a fixed fraction, e.g. 1023/1024 in this example. This corresponds to the slow decay part of the peak detection function. The incoming spectrum s'( j, t) frequency band J
If the power amplitude of is thicker than the power amplitude of the corresponding frequency band of the decay peak amplitude spectrum, then the decay peak amplitude spectral value for that (or those) frequency bands is equal to the spectral value of the corresponding band of the incoming windowed spectrum. It can be replaced with This corresponds to the fast attack part of the peak detection function.

数学的には、ピーク検出関数は次のように表現できる。Mathematically, the peak detection function can be expressed as:

すなわちｐ（ｊ　、ｔ）＝ｍａｘ　ｐ（ｊ　、ｔ−Ｔ）−（１−
Ｅ）　・ｐ（ｔ）　−８（ｊ　、ｔ）　。That is, p(j, t)=max p(j, t-T)-(1-
E) ・p(t) −8(j, t).

ｊ　＝０．１．　、−−−−、３１　　　　ＭここにＪ
は周波数バンドの各々に割り尚てられ、ｐ（ｊ、ｔ）は
生じたピークスペクトルであり、ｐ（ｊ、ｔ−’Ｉ’）
は「古い」すなわち先行のピークスペクトルであり、Ｓ
’（ｊ、ｔ）は新たに到来した部分的に処理されたパワ
スペクトルであり、ＰＣｌ）は時刻ｔにおけるパワ概算
値であり、ｒはディケイパラメータである。j=0.1. , -----, 31 M here J
is reassigned to each of the frequency bands, p(j, t) is the resulting peak spectrum, and p(j, t-'I')
is the “old” or preceding peak spectrum, and S
'(j,t) is the newly arrived partially processed power spectrum, PCl) is the power estimate at time t, and r is the decay parameter.

式Ｈにしたがうと、ピークスペクトルは、より高値のス
ペクトル人力の不存在の場合、１−Ｅの率で通常崩壊す
る。普通、Ｅは”／１０２４に等しい。しかしながら、
サイレントの期間中、特に通信チャンネルまたは音声特
性の迅速な変化が予測されない場合、ピークスペクトル
のディケイを許すことは望ましくなかろう。サイレント
フレームを限定するためには、バックグラウンドノイズ
フレームな選択するのに採用されたのと同じ方法が採用
される。過ぎ去った１２８のフレームの振幅（ｐ（ｔ）
の平方根）が検査され、最小値が見つけられる。現在フ
レームの振幅がこの最小値の４倍より小さければ、現在
フレームはサイレントであると決定され、Ｅに対して、
値１／１０２４の代わりに値「０」が置き代えられる。According to Equation H, the peak spectrum typically collapses at a rate of 1-E in the absence of higher values of spectral input. Usually E is equal to "/1024. However,
It may be undesirable to allow the peak spectrum to decay during periods of silence, especially if rapid changes in the communication channel or audio characteristics are not expected. To limit the silent frames, the same method adopted to select the background noise frames is adopted. The amplitude of the past 128 frames (p(t)
) is examined and the minimum value is found. If the amplitude of the current frame is less than four times this minimum value, the current frame is determined to be silent, and for E:
The value "0" is substituted for the value 1/1024.

ピークスペクトルが発生された後、生じたピーク振幅ス
ペクトルｐ（ｊ、ｔ）は、各周波数バンドピーク値を新
たに発生されたピークスペクトルの隣接する周波数に対
応するピーク値と平均することにより、周波数平滑化さ
れる（８９）。しかして、平均値に寄与する全周波数バ
ンド幅は、フォルマント周波数間の代表的周波数間隔に
概ね等しい。音声認識の技術に精通したものには、明ら
かなように、この間隔は、約１０００Ｈｚ程度である。After the peak spectrum is generated, the resulting peak amplitude spectrum p(j,t) is calculated by averaging each frequency band peak value with the peak value corresponding to the adjacent frequency of the newly generated peak spectrum. Smoothed (89). Thus, the total frequency bandwidth contributing to the average value is approximately equal to the representative frequency spacing between formant frequencies. As will be apparent to those familiar with the art of speech recognition, this interval is on the order of approximately 1000 Hz.

この特定の方故による平均化により、スペクトル内の有
用情報、すなわち７オルマント共鳴を表わす局部的変動
が維持され、他方、周波数スペクトルの全体的な強調は
抑制される。好ましい具体例においては、ピークスペク
トルは、７つの隣接する周波数バンドをカッく−する移
動平均関数により周波数に関して平滑化される。平均関
数を１次のごとくである。Averaging in this particular manner preserves the useful information in the spectrum, namely the local fluctuations representing the 7-ormant resonance, while suppressing the overall enhancement of the frequency spectrum. In a preferred embodiment, the peak spectrum is smoothed in frequency by a moving average function that cuts seven adjacent frequency bands. The average function is of first order.

バスバンドの終端において、ｐ（ｋ、ｔ）は、０より小
さいｋおよび３１より大きいｋに対して０となる。標準
エンベロープｈＱ）は、実際に加算された有効データ要
素の数を考慮に入れる。かくして、ｈ（ｏ）　＝　７／
４．ｈ（１）　＝　７１５　、　ｈ（２）　＝　７．／
６　。At the end of the bus band, p(k,t) becomes 0 for k less than 0 and k greater than 31. The standard envelope hQ) takes into account the number of valid data elements actually added. Thus, h(o) = 7/
4. h(1) = 715, h(2) = 7. /
6.

ｈ（３）　＝　１　、−−−−−　ｈ（２８）　＝　１
　、　ｈ（２９）　＝　７／６　。h(3) = 1, ----- h(28) = 1
, h(29) = 7/6.

ｈ（３０）　＝　７７５　、そしてｈ（３１）　＝　７
／４となる。得られた平滑化ピーク振幅スペクトルｅ（
ｊ、ｔ）は。h(30) = 775, and h(31) = 7
/4. The obtained smoothed peak amplitude spectrum e(
j, t) is.

ついで、いま受信されたパワスペクトルを標準化し１周
波数等化するのに使用されるが、これは到来平滑化スペ
クトルＳ／（５、ｔ）の各周波数ノくンドの振幅値を、
平滑化ピークスペクトルｅ（ｊ　。Next, it is used to standardize the power spectrum just received and perform one-frequency equalization, which means that the amplitude value of each frequency node of the incoming smoothed spectrum S/(5, t) is
Smoothed peak spectrum e(j.

ｔ）の対応する周波数バンド値で分割することにより行
われる。数学的にこれは１次のように表わされる。t) by the corresponding frequency band value. Mathematically, this can be expressed as first order.

５ｎ（ｊ、ｔ）’（８’（ｊ、ｔ）／ｅ（ｊ、ｔ））３
２７５７　　　（１２）ここに、５ｎ（ｆ、ｔ）は、ピ
ーク標準化され平滑化されたパワスペクトルであり、Ｊ
は各周波数ノくンドに対して割り当てられる。このステ
ップは、９１で指示されている。ここで、周波数等化さ
れかつ標準化された一連の短期間ノくワスペクトルが得
られるが、このスペクトルは、到来音声信号の周波数含
分の変化が強調され、一般的な長期間周波数強調または
歪は抑制されたものである。この周波数補償方法は、補
償の基準が全信号または各周波数バンドのいずれにおい
ても平均ノくワレベルである通常の周波数補償システム
に比して、電話線のような周波数歪を生ずる通信リンク
を介して伝送される音声信号の認識において非常に有利
であることが分った。5n(j,t)'(8'(j,t)/e(j,t))3
2757 (12) where 5n(f,t) is the peak normalized and smoothed power spectrum and J
is assigned to each frequency node. This step is indicated at 91. Here, we obtain a series of frequency-equalized and standardized short-term noise spectra, which emphasize changes in the frequency content of the incoming speech signal and which are generally characterized by long-term frequency enhancement or distortion. is suppressed. This frequency compensation method is more effective than conventional frequency compensation systems, where the compensation criterion is the average noise level for either the entire signal or each frequency band. It has been found to be very advantageous in the recognition of transmitted audio signals.

逐次のスペクトルは種々処理され、等化されたが、到来
可聴信号を表わすデータはなお１００／秒の割合で生ず
るスペクトルを含んでいることを指摘してお（。Note that although the sequential spectra have been variously processed and equalized, the data representing the incoming audio signal still includes spectra occurring at a rate of 100/sec.

９１で指示されるように標準化され、周波数等化された
スペ・クトルは、９３で指示されるように振幅変換を受
ける。これは、スペクトル振幅値に非直線的なスケール
操作をなすことにより行なわれる。The normalized and frequency equalized spectrum as indicated at 91 undergoes an amplitude transformation as indicated at 93. This is done by performing a non-linear scaling operation on the spectral amplitude values.

Ｓｎ（ｊ　−ｔ　）　（式１２から）のごとき個々の等
化され標準化されたスベク・トルを選択すると（ここに
Ｊはスペクトルの異なる周波数バンドを指示し、ｔは実
時間を表わす）、非面線スケール化スペクトルＸ（ｊ　
、ｔ）は２次の直線分数関数により定義される。Choosing an individual equalized and standardized vector torque such as Sn(j − t ) (from Equation 12), where J indicates the different frequency bands of the spectrum and t stands for real time, the non- Surface line scaled spectrum X(j
, t) are defined by a quadratic linear fractional function.

５ｎ（ｊ、ｔ）　−ＡＸ（ｊ　、ｔ）＝１２８−５（。、ｔ）　＋Ａ　　ｊ　
−０，１，−−−，３０””ここにＡは３＝θ〜３１ま
でのスペクトルＳｆ］（ｊ。5n(j, t) -A X(j, t)=128-5(., t) +A j
-0, 1, ---, 30"" where A is the spectrum Sf from 3=θ to 31] (j.

ｔ）の平均値であり、下記のように定義される。t) and is defined as below.

ここでＪはパワスペクトルの周波数バンドを指示する０スペクトルの３１の期間は、次式のようにＡのＸ　（３
１、ｔ　）　＝１６　Ｌｏｇ、Ａ　　　　　　　　　　
　％このスケール関数（式１３）は、短期間平均値Ａか
ら大きく偏ったスペクトル強度に対して柔軟なスレッシ
ョルドおよび漸進的な飽和の作用を及ばず。数学的に述
べると、平均近傍の強度に対して概ね直線的であり、平
均から離れた強度に対して概ね対数的であり、極端な強
度値に対して実質的に一定である。対数スケールの場合
、関数Ｘ（ｊ、ｔ）は０に関して対称であり、聴覚神経
を刺激するような割合の関数を水製するようなスレッシ
ョルドおよび飽和の振舞な示す。実際に、全認識システ
ムは、この特定の非直線スケール関数の場合、スペクト
ル振幅の直線または対数スケールのいずれかの場合より
も相当良好に機能する。Here, J indicates the frequency band of the power spectrum. The 31 period of the spectrum is expressed as
1, t ) = 16 Log, A
% This scale function (Equation 13) does not have a flexible threshold and gradual saturation effect on spectral intensities that deviate significantly from the short-term average value A. Mathematically stated, it is approximately linear for intensities near the average, approximately logarithmic for intensities away from the average, and substantially constant for extreme intensity values. In the case of a logarithmic scale, the function X(j,t) is symmetric about 0 and exhibits threshold and saturation behavior that produces a proportion function that stimulates the auditory nerve. In fact, the entire recognition system performs considerably better for this particular non-linear scale function than for either linear or logarithmic scales of spectral amplitude.

このようにして、振幅変換され、周波数レスポンスを等
化され、標準化された一連の短期間パワスペクトルＸ（
ｊ、ｔ）（ここに、１＝０．０１゜０．０２．０．０３
　、０．０４、−−−一秒、ｊ＝Ｏ，−−−。In this way, a series of short-term power spectra X(
j, t) (here, 1=0.01°0.02.0.03
, 0.04, --- 1 second, j=O, ---.

３０（発生されたパワスペクトル）の周波数バンドに対
応））が発生する。各スペクトルに対して３２ワードが
用意され、Ａ（式１５）、すなわちスペクトル値の平均
値の値は、３２ワードとして記憶される。以下において
１フレーム」として言及されるこの振幅変換された短期
間パワスペクトルは、例示の具体例においては、９５で
指示されるように、２５６の３２ワードスペクトルに対
する記憶容量をもつファーストイン・ファーストアウト
循環メモリに記憶される。か（して、例示の具体例にお
いては、２．５６秒の音声入力信号が分析のために利用
可能になる。この記憶容電は、もし必要ならば、分析お
よび評価のため異なる実時間でスペクトルを選択し、し
たがって分析上必要に応じて時間的゛に前進、後退でき
るような変幻性をもつ認識システムを提供する。30 (corresponding to the frequency band of the generated power spectrum))) are generated. 32 words are prepared for each spectrum, and A (Equation 15), that is, the value of the average value of the spectrum values, is stored as 32 words. This amplitude-converted short-term power spectrum, referred to below as "one frame", is a first-in first-out, in the illustrated embodiment, with a storage capacity for 256 32-word spectra, as indicated at 95. Stored in circular memory. (Thus, in the illustrated embodiment, 2.56 seconds of audio input signal is available for analysis. This storage capacitance can be stored at different real-time times for analysis and evaluation, if necessary. To provide a protean recognition system that can select a spectrum and thus move forward or backward in time as required for analysis.

このように、最後の２．５６秒に対するフレームは循環
メモリに記憶され、必要なときに利用できる。例示の具
体例においては、動作中、各フレームは２．５６秒記憶
される。かくして、時刻ｔ□において循環メモリに入っ
たフレームは、２．５６秒後、時刻ｔ　＋　２．５６秒
に対応する新しいフレームが記憶されるとき、メモリか
ら失なわれる。すなわちシフトされる。Thus, the frames for the last 2.56 seconds are stored in circular memory and available when needed. In the illustrated embodiment, during operation, each frame is stored for 2.56 seconds. Thus, the frame that entered circular memory at time t□ is lost from memory 2.56 seconds later when a new frame corresponding to time t + 2.56 seconds is stored. In other words, it is shifted.

循環メモリ中を通るフレームは、好ましくは実時間にお
いて既知の語紮のワードと比較され、入力ターン中のキ
ーワードを決定し、識別する。各語紮ワードは、複数の
非重複のマルチフレーム（好ましくは３フレーム）デザ
インセットまたはグーゲット・パターンに形成された複
数の処理ノくワスベクトルを統計的に表わすテンプレー
ト・ノくターンにより表わされる。これらのノくターン
は、語紮ワードの意味のある音響事象をもつともよ（表
わすように選択されるのがよく、モして９９において記
憶される。The frames passed through the circular memory are compared, preferably in real-time, with known word combinations to determine and identify the keywords in the input turn. Each word conjugation is represented by a template knot that statistically represents a plurality of non-overlapping multi-frame (preferably three-frame) design sets or multiple processed word vectors formed into a Googet pattern. These notuturns may be selected to represent the meaningful acoustic events of the conjugation word, and are then stored at 99.

デザインセットパターンを形成するスペクトルは、線１
０上の連続する未知の音声人力を処理するため、上述の
システム（第３図）を使って種々の状況で話されるワー
ドに対して発生される。The spectrum forming the design set pattern is line 1
To handle a sequence of unknown speech inputs over 0, the system described above (FIG. 3) is used to generate words spoken in various situations.

このよ５に、各面素ワードは、それと関連する一般に複
数の一連のデザインセラトノくターンｐ（１）□。Thus, each plane word has a generally plural series of design turns p(1)□ associated with it.

ｐ（ｉ）Ｑ、−を有しており、各パターンは、短期間パ
ワースペクトルの領域においてその１番目のキーワード
についての１つの指示を与える。各キーワードに対する
デザインセットパターンの集まりは、ターゲットパター
ンを発生するについての統計的基準を形成する。p(i)Q,-, and each pattern gives one indication for its first keyword in the region of the short-term power spectrum. The collection of design set patterns for each keyword forms a statistical basis for generating target patterns.

本発明の例示の具体例において、デザインセットパター
ンｐ（ｉ）、５は各々、直列に配列された３つの選択さ
れたフレームを構成する９６要素配列と考えることがで
きる。パターンを形成するフレームは、時間に関する平
滑に起因する不要相関を避けるため少なくとも３０ミリ
秒離間されるべきである。本発明の他の具体例において
は、フレームを選択するため他のサンプリング法を実施
できる。In an exemplary embodiment of the invention, design set patterns p(i), 5 can each be thought of as a 96-element array constituting three selected frames arranged in series. The frames forming the pattern should be at least 30 milliseconds apart to avoid unnecessary correlation due to smoothing in time. In other embodiments of the invention, other sampling methods may be implemented to select frames.

しかしながら、好ましい方法は、フレームを−・定継続
時間、好ましくは３０ミリ秒離間してフレームを選択し
、非重複デザインセットパターンをキーワー　ドを限定
する時間間隔中離間させる方法である。すなわち、第１
のデザインセットパターンｐ□は、キーワードの開始点
近傍の部分に対応し、第２のパターンｐ２は時間の後の
部分に対応し、以下同様であり、そして、パターンｐ＋
　＋　Ｉ）Ｑ　＋　−−−は、一連のターゲットパター
ンに対する統計的基準、すなわちワードテンプレートを
形成し、到来音声データはこれに整合されるのである。However, a preferred method is to select frames - a fixed duration, preferably 30 milliseconds apart, and space the non-overlapping design set patterns for a time interval that defines the keywords. That is, the first
The design set pattern p□ corresponds to the part near the start point of the keyword, the second pattern p2 corresponds to the part after the time, and so on, and the pattern p+
+ I)Q + --- forms a statistical criterion, or word template, for the set of target patterns to which the incoming audio data is matched.

ターゲットパターンｔ１．　ｔＱ　−−−は各々、　ｐ
（ｉ）ｊが独立のガウス変数より成ることを仮定するこ
とにより対応するｐ（ｉ）ｊから発生される統計データ
よりなる。この仮定は、以下で説明される到来データと
ターゲットパターン間に尤度統計データが生成されるこ
とを可能にする。かくして、ターゲラトノくターンは、
エントリとして、対応するデザインセットパターンアレ
イエントリコレクションに対する平均標準偏差およびエ
リヤ標準化ファクタを含む配列より成る。より精確な尤
度統計については後で説明する。Target pattern t1. tQ --- are each p
(i) consists of statistical data generated from the corresponding p(i)j by assuming that j consists of independent Gaussian variables. This assumption allows likelihood statistics to be generated between the incoming data and the target pattern, which will be described below. Thus, the target turn is
It consists of an array containing, as an entry, the mean standard deviation and area standardization factor for the corresponding design set pattern array entry collection. More accurate likelihood statistics will be explained later.

技術に精通したものには明らかなように、はとんどすべ
てのキーワードは、２以上の文脈上および／または地域
的な発音を有し、したがってデザインセットパターンの
２以上の「スペリング」を有している。かくして、上述
のパターン化スペリングｐ□、ｐ＊　ｒ−−一を有する
語集ワードは、実際上、一般にｐ（ｉＬ　、Ｉ）（ｉ）
Ｑ　−−−−１ｉ＝１．２゜−−−、Ｍとして表現でき
る。ここにｐ（ｉ）ｊの各々は、第３番目のクラスのデ
ザインセットパターンについての可能な代替的記述方法
であり、各ワードに対して全部でＭの異なるスペリング
がある。As is obvious to those skilled in the art, almost every keyword has more than one contextual and/or regional pronunciation, and therefore more than one "spelling" of the design set pattern. are doing. Thus, a glossary word with the patterned spelling p□, p* r--1 described above is in practice generally p(iL,I)(i)
Q----1i=1.2°----, it can be expressed as M. Here each p(i)j is a possible alternative description for the third class of design set patterns, and there are a total of M different spellings for each word.

それゆえ、ターゲットパターンｔ□＋　ｔ２　　”−＋
ｔ１　　は、もつとも一般的意味において、各々第１番
目のブルーフ゛またはクラスのデザインセットパターン
に対する複数の代替的統計的スペリングを表わす。この
ように、例示の具体例において、［ターゲットパターン
」なる用語は、もつとも一般的意味において使用されて
おり、したがって、各ターゲットパターンは、２以上の
許容し得る代替的「統計的スペリング」を有し得る。Therefore, the target pattern t□+ t2 ”−+
t1 represents, in the most general sense, a plurality of alternative statistical spellings for the design set pattern, each of the first blueprint or class. Thus, in the illustrated embodiment, the term "target pattern" is used in a very general sense, such that each target pattern has two or more permissible alternative "statistical spellings." It is possible.

到来する未知の音声信号および基準パターンを形成する
音声信号の予備処理は、これで完了する。The preliminary processing of the incoming unknown audio signal and the audio signal forming the reference pattern is now complete.

次に、記憶されたスペク゛トルの処理について説明する
。Next, processing of the stored spectrum will be explained.

第３図を参照して説明すると、まず、到来連続可聴デー
タを表わす９５で記憶されたスペクトルまたはフレーム
は、下記の方法にしたがって面素のキーワードを表わす
９９で示す記憶されたターゲット・パターンのテンプレ
ートと比較される。Referring to FIG. 3, first, the spectrum or frame stored at 95 representing the incoming continuous audible data is converted into a stored target pattern template shown at 99 representing the field element keyword according to the method described below. compared to

各１０ミリ秒のフレームに対して、記憶された基準パタ
ーンと比較のだめのパターンは、現在のスヘクトルベク
トルＳ（ｊ、ｔ）、３フレーム前のスペクトルＳ（ｊ　
、ｔ−０，０３）、および６フレーム前のスペクトルＳ
（ｊ、ｔ　　Ｏ，０６）を隣接させて下記の９６要素パ
ターンを形成することにより９７°で形成される。For each 10 ms frame, the stored reference pattern and comparison pattern are the current spectral vector S(j, t), the spectrum 3 frames ago S(j
, t-0,03), and the spectrum S 6 frames ago
(j, t O, 06) are formed at 97° by adjoining them to form the following 96-element pattern.

このようにして形成された各マルチフレーム・パターン
は例えば米国特許第４．２４１．３２９号、第４，２２
７，１７６号、および第４，２２７，１７７号に記載さ
れた方法によって変換できる。しかしながら、これら変
換は、本発明との関連において有用であるけれど、本発
明の一部を形成するものではなく、上記米国特許の教示
をこの中で教示される方法および装置にどのようにして
適合させるかはこの分野の技術者には明らかであろう。Each multiframe pattern thus formed is described, for example, in U.S. Pat.
No. 7,176 and No. 4,227,177. However, while these transformations are useful in the context of the present invention, they do not form part of the invention and do not explain how the teachings of the above-identified US patents may be adapted to the methods and apparatus taught therein. It will be obvious to engineers in this field whether to do so.

かくして、例示の実施例では、変換は相互相関関係を減
じ、デイメンショナリティを減少し、そしてターゲット
・パターン間の分離を増大できる。等化されたスペクト
ルを構成するマルチフレーム・パターンは、変換された
パターン（または一連の変換されたパターン）がターゲ
ット・パターン（または一連のターゲット・パターン）
と整合する確率を測定する、１００で指示された統計的
尤度計算ブロックへ人力として供給される。Thus, in example embodiments, the transformation can reduce cross-correlation, reduce dimensionality, and increase separation between target patterns. The multi-frame patterns that make up the equalized spectrum are the transformed pattern (or series of transformed patterns) as the target pattern (or series of target patterns).
is manually fed into a statistical likelihood calculation block, designated 100, which measures the probability of matching .

次に、統計的尤度の計゛算について説明する。Next, calculation of statistical likelihood will be explained.

上述のようにして形成されたマルチ７レーム・パターン
ｘ（Ｊ、ｔ）は統計的尤度計算ブロックへ入力として供
給される。上記したように、このプロセッサは、連続的
に与えられるマルチフレーム・パターン（未知の人力音
声を１唄次表わす）のそれぞれが機械の始業におけるキ
ーワードテンプレートのターゲット・パターンのそれぞ
れと整合する確率の測定値を提供する。代表的には、タ
ーゲット・パターンを表わす各データは僅かに非対称の
確率密度を有するが、しかしそれにも拘わらず、平均値
Ｗ□、におよび平均偏差（分散）　ｖａｒ（ｉ、ｋ）を
持つ通常のガウスの分布によって統計的に十分に近似さ
れる。ここで、１は第に番目のターゲット・パターンの
要素の逐次の指示である。The multi-seven frame pattern x(J,t) formed as described above is supplied as input to the statistical likelihood calculation block. As mentioned above, this processor measures the probability that each successively given multi-frame pattern (representing one song of unknown human voice) matches each of the target patterns of the keyword template at the start of the machine. Provide value. Typically, each datum representing the target pattern has a slightly asymmetric probability density, but is nevertheless typically is statistically well approximated by a Gaussian distribution of . Here, 1 is the sequential designation of the element of the th target pattern.

このプロセスの鏝も簡単な実現は異なる値の１およびｋ
と関連したデータが相関関係になく、従ってターゲット
・パターンｋに属するデータＸに対する同時確率密度が
次式である（対数的に）と仮定することである。すなわ
ち、Ｌ　（ＸＩＫ）　＝　ｐ　（ｘ、ｘ）＝Σ１／２１ｎ２
　（ｖａｒ　（ｉ、Ｋ）　）対数は単調関数であるから
、この統計はキーワードテンプレートの任意の１つのタ
ーゲット・パターンとの整合の確率がある他の語粟のタ
ーゲット・パターンとの整合の確率より大きいか、また
は小さいか、あるいは別法として、特定のパターンとの
整合の確率が予め定められた最小レベルを越えたか否か
を決定するのに十分である。各入力マルチフレームパタ
ーンは面素のキーワードテンプレートのターゲット・パ
ターンの全部に対して計算されたその統計的尤度Ｌ（Ｘ
ＩＫ）を有する。A simple implementation of this process is to use different values of 1 and k.
It is assumed that the data associated with k are uncorrelated, and therefore the joint probability density for data X belonging to target pattern k is (logarithmically): That is, L (XIK) = p (x, x) = Σ1/21n2
Since the logarithm (var (i, K)) is a monotonic function, this statistic means that the probability of matching any one target pattern of the keyword template is greater than the probability of matching the target pattern of any other word millet. greater or lesser, or alternatively sufficient to determine whether the probability of matching a particular pattern exceeds a predetermined minimum level. Each input multi-frame pattern has its statistical likelihood L(X
IK).

結果としての統計的尤度Ｌ（ＸＩＫ）は、パターンＸが
生じる時間ｔにおけるＫと名付けられたターゲットパタ
ーンの発生の相対的尤度と解釈される。The resulting statistical likelihood L(XIK) is interpreted as the relative likelihood of the occurrence of the target pattern named K at the time t that pattern X occurs.

この分野の技術者には十分に理解できるように、これら
尤度の統計のランキングはそれが単一のターゲット・パ
ターンによってのみ実行できる限り音声認識を構成する
。これら尤度の銃創は実行されるべき最終の関数に依存
して、全体のシステムにおいて種々の方法で利用できる
。As is well understood by those skilled in the art, ranking of these likelihood statistics constitutes speech recognition insofar as it can only be performed with a single target pattern. These likelihood bullets can be utilized in different ways in the overall system depending on the final function to be performed.

確率モデルについてはガウスの分布を利用できるが（例
えば上述の米国特許第４，２４１，３２９号、第４，２
２７，１７６号および第４．２２７．１７７号参照）、
ラプラス分布、すなわちＰ　（ｘ）　＝（１／Ｊｚ　ｓ′）ｅｘｐ　−（−ｆｉ
　ｌ　Ｘ−ｍ１／８’　）（ここにｍは統計平均、Ｓ′
は変数Ｘの標準偏差である）は、計算が少なくてすみ、
例えば米国特許第４．０３８．５０３号に記載される話
者に不依存性の隔絶ワード認識法におけるガウスの分布
とほとんど同様に機能することが分った。未知の人カッ
くターンＸと第に番目の記憶基準パターン間の類似の程
度Ｌ　（ｘ　ｌ　ｋ）は、確率の対数に比例し、次の式
で１００において算出される。For probabilistic models, Gaussian distributions can be used (e.g., U.S. Pat. Nos. 4,241,329 and 4,2
27,176 and 4.227.177),
Laplace distribution, i.e. P (x) = (1/Jz s′)exp −(−fi
l X-m1/8') (where m is the statistical mean, S'
is the standard deviation of the variable X) requires less calculation,
For example, it has been found to perform much like the Gaussian distribution in the speaker-independent isolated word recognition method described in US Pat. No. 4,038,503. The degree of similarity L (x l k) between the unknown person's cuckoo turn

一連のパターンの確度スコアＬを結合して話されたワー
ドまたはフレーズの確度スコアを形成するため、各フレ
ームに対するスコアＬ（ＸＩ　ｋ）は、そのフレームに
対する全基準パターンの最良ノ（最小の）スコアを減す
ることにより調節される。すなわち、Ｌ’（Ｘｌｋ）　−ｍｉｎＬ（Ｘｌｉ）　　　　　　０
８）したがって、各フレームに対する最良の適合パター
ンは、０のスコアを有するであろう。仮定された一連の
基準パターンに対する調節されたスコアは、フレームご
とに累積され、指示された一連のもの（シーケンス）に
ついての有利な決定が正しい決定となるような、確率に
直接に関係づけられたシーケンススコアを得ることがで
きる。Since the accuracy scores L of a series of patterns are combined to form the accuracy score of a spoken word or phrase, the score L(XI k) for each frame is the best (minimum) score of all reference patterns for that frame. It is adjusted by decreasing . That is, L'(Xlk) -minL(Xli) 0
8) Therefore, the best fitting pattern for each frame will have a score of 0. The adjusted scores for the hypothesized set of reference patterns are accumulated for each frame and directly related to the probability that a favorable decision for the indicated sequence is the correct decision. A sequence score can be obtained.

記憶された既知のパターンに対する未知の人カスベクト
ルパターンの比較は、１番目の基準パターンに対する下
記の関数を計算することにより遂行される。すなわち、ここに、ｓｉｋは１／ｓ’ｉｋに等しい。Comparison of the unknown human scum vector pattern to the stored known pattern is accomplished by calculating the following function for the first reference pattern. That is, where sik is equal to 1/s'ik.

通常のソフトウェアで実施される計算においては、代数
関数５ｌｘ−ｕｌ（式１９）を計算するために下記の命
令が実行されよう。In a calculation performed in normal software, the following instructions would be executed to calculate the algebraic function 5lx-ul (Equation 19).

１、ｘ−ｕを計算せよ２、ｘ−ｕの符号を試験せよ３、ｘ−ｕが負ならば、絶対値を形成するように否定せ
よ４、ｉと乗算せよ５、　結果をアキュウレータに加えよ２０−ワード面素を有する代表的音声認識システムにお
いては、約２２２の異なる基準パターンが設けられよう
。これを求めるに必要とされるステップの数は、間接動
作を含まないと、５Ｘ９６Ｘ２２２＝１０６５６０ステツプであり、これ
が、実時間スペクトルフレーム速度に遅れないようにす
るため、１０ミリ秒以内で実行されなければならない。1. Calculate x-u. 2. Test the sign of x-u. 3. If x-u is negative, negate it to form the absolute value. 4. Multiply by i. 5. Add the result to the accumulator. In a typical speech recognition system with a 20-word surface element, approximately 222 different reference patterns would be provided. The number of steps required to solve this, not including indirect operations, is 5 x 96 x 222 = 106560 steps, which must be performed within 10 ms to keep up with the real-time spectral frame rate. Must be.

それゆえ、プロセッサは、尤度関数を丁度求めるために
は、はぼ１１００万／秒の命令を実行できなり゛ればな
らない。必須の速度を考慮に入れて、米国特許第４，２
２８，４９８号に開示されるシステムベクトルプロセッ
サと適合する専用の先度関数ハードウェアモジュール２
００（第５図）が採用される。Therefore, a processor must be able to execute approximately 11 million instructions per second to just determine the likelihood function. Taking into account the requisite speed, U.S. Pat.
Dedicated Priority Function Hardware Module 2 Compatible with the System Vector Processor Disclosed in No. 28,498
00 (FIG. 5) is adopted.

この専用ハードウェアにおいては、上述の５つのステッ
プが、２組の独立変数７、Ｘ％　Ｕとともに同時に遂行
されるから、実際には、１つの命令を実行するのに要す
る時間で１０の命令が遂行される。基本的ベクトルプロ
セッサは８００万命令／秒の速度で動作するから、尤度
関数に対する有効計算速度は、専用ハードウェア２００
が採用されると約８０００万命令／秒となる。In this dedicated hardware, the five steps mentioned above are performed simultaneously with two sets of independent variables, so in reality, ten instructions are executed in the time it takes to execute one instruction. carried out. Since the basic vector processor operates at a speed of 8 million instructions/second, the effective calculation speed for the likelihood function is 200
If adopted, the number of instructions will be approximately 80 million instructions/second.

第５図を参照すると、ハードウェアーモジュール２００
は、１０のステップの同時の実行を可能にするため、ハ
ードウェアによるパイプライン処理および並列処理の組
合せを採用している。２つの同一の部分２０２．２０４
は、各々、独立の入力データ独立変数について５つの算
術演算ステップを遂行しており、２つの結果はそれらの
出力に接続された加算器２０６により結合される。加算
器２０６からの加算値の累積は、式（１９）の１〜９６
の加算であり、そしてこの値は、米国特許第４，２８８
，４９８号に記載される標準的ベクトルブ算セッサの演
算ユニットで処理される。Referring to FIG. 5, hardware module 200
employs a combination of hardware pipelining and parallel processing to enable simultaneous execution of ten steps. two identical parts 202.204
are each performing five arithmetic steps on independent input data independent variables, and the two results are combined by an adder 206 connected to their outputs. The cumulative value from the adder 206 is 1 to 96 in equation (19).
and this value is the sum of U.S. Pat.
, 498.

動作において、パイプライン処理用レジスタは、以下の
処理段階における中間データを保持する。In operation, pipeline processing registers hold intermediate data for the following processing stages.

１、　入力独立変数（クロック作動レジスタ２０８゜２
１０、２１２．２１４．２１６．２１８　）２．１Ｃ−
ｕの絶対値（クロック作動レジスタ２２０、２２２）五　乗算器の出力（クロック作動レジスタ２２４゜２２
６）入力データがクロック作動レジスタ２０８〜２１８に保
持されると、χ−Ｕの大きさが、減算絶対値回路２２８
，２３０により決定される。第６図を参照すると、減算
・絶対値回路２２８，２３０は、各々Ｍ１および第２の
減算器２３２，２３４（一方はｘ　−ｕを算出、他方は
ｕ　−ｘを算出）および正の結果を選択するためのマル
チプレクサ２３６を備えている。レジスタ２０８，２１
０から出る線２３８，２４０上の入力独立変数ｘおよび
Ｕは、それぞれ−１２８〜＋１２７の８ビツト数である
。1. Input independent variable (clock operated register 208゜2
10, 212.214.216.218) 2.1C-
Absolute value of u (clock operated registers 220, 222) 5 Multiplier output (clock operated registers 224, 22
6) Once the input data is held in the clock-operated registers 208-218, the magnitude of χ-U is
, 230. Referring to FIG. 6, subtractor/absolute value circuits 228, 230 respectively calculate M1 and second subtractors 232, 234 (one calculates x - u, the other calculates u - x) and positive results. A multiplexer 236 is provided for selection. registers 208, 21
The input independent variables x and U on lines 238 and 240 originating from 0 are each 8-bit numbers from -128 to +127.

８ビツト減算器の差出力は９ビツトにオー□バーフロー
することがあるから（例えば１２７−（−１２８）＝２
５５）、算術のオーバーフロー状態を取り扱うため余分
の回路が必要であり、採用される。この状態はオーバー
フロー検出器２３５により、決定される。しかして、そ
の入力は、「Ｘ」の符号（＋ｌｉ！　２３５　ａ上）、
ｒｕＪの符号（ＩＩ２３５ｂ上および［ｘ　−ｕ　ｊの
符号（線２３５ｃ上）である。Since the difference output of the 8-bit subtracter may overflow to 9 bits (for example, 127 - (-128) = 2
55), extra circuitry is required and employed to handle arithmetic overflow conditions. This condition is determined by overflow detector 235. Therefore, the input is the sign of "X" (on +li! 235 a),
The sign of ruJ (on II 235b) and the sign of [x − u j (on line 235c).

次に第７図を参照すると、オーバーフロー検出器は、こ
の例示の具体例においては、３人力師ゲート２６８．２
７０およびＯＲゲート２７２を有する組合せ回路である
。第８図の真理値表は１オーバーフロー状態を入力の関
数として表わしている。Referring now to FIG. 7, the overflow detector, in this illustrative embodiment, includes three rickshaw gates 268.2.
70 and an OR gate 272. The truth table of FIG. 8 represents the one overflow condition as a function of input.

オーバーフロー状態はマルチプレクサ２３６１（これは
正の減算器出力を選択する回路である）で４つの選択を
行なうこと忙より処理される。これ等選択は、線−２４
２および２４４上の２進レベルで定められる。［２４２
上のレベルは、Ｘ　−Ｈの符号を表わす０２４４上の符
号は、１ならばオーバーフローを表わす。かくして、選
択は次のごとくなる。The overflow condition is handled by making four selections in multiplexer 2361 (which is the circuit that selects the positive subtractor output). These selections are line -24
It is defined in binary levels above 2 and 244. [242
The upper level represents the sign of X-H.0244 The upper sign represents overflow if it is 1. Thus, the choices are as follows.

１１２２４　　線２２４０　　　０　　減算器２３２の出力を選択１　　　０　
　減算器２３４の出力を選択だ減算器２３４を選択マルチプレクサはこのように制御されて、８極４ａｔス
イツチのように作用する。シフト動作ハ１組合せにより
減算出力を適当なマルチプレクサ人力に接続することに
より組合せ的に遂行される。11224 Line 224 0 0 Select output of subtracter 232 1 0
The select subtracter 234 multiplexer is controlled in this manner and acts like an 8-pole 4at switch. Shifting operations are performed combinatorially by connecting the subtracted outputs to the appropriate multiplexer outputs.

シフトは算術的に２で分割する効果をもつ。The shift has the effect of arithmetically dividing by two.

減算中にオーバーフローが起こると、マルチプレクサの
出力は、減算器の出力を２で分割した出力となる。それ
ゆえ、最終結果を２で乗算して正しいスケールファクタ
を取り戻すことができるように、計算の後段でこの状態
を思い出させることが必要である。この復旧は、最後の
パイプライン処理レジスタの後のマルチプレクサで行わ
れる。If an overflow occurs during subtraction, the output of the multiplexer will be the output of the subtractor divided by two. It is therefore necessary to recall this condition later in the calculation so that the final result can be multiplied by 2 to recover the correct scale factor. This restoration is done in the multiplexer after the last pipeline processing register.

それゆえ、パイプライン処理レジスタ２２０゜２２２．
２２４．２２６には余分のビットが設けられており、第
２のマルチプレクサ２４８．２５０を制後する。これら
マルチプレクサは、オーバーフロービットがセット（１
ＩＣ等しい）の場合、それぞれ８×８ビツトの乗算器２
５２．２５４の乗算積を１ビツトだけシフトアップし、
２を乗算する。乗算演算は、８ビツト数を受は入れその
積を出力するＴＲＷ部品番号ＭＰＹ　−８−ＨＪのごと
き標準的集積回路装置で実施できるＯかくして、乗算器２５２．２５４は、各クロックパルス
でｉおよび１ｘ−ｕｌの積を生ずる百の値は余分のデー
タレジスタ２５６．２５８により正しく調時される）。Therefore, the pipeline processing registers 220°222.
An extra bit is provided at 224.226 and passes through the second multiplexer 248.250. These multiplexers have an overflow bit set (1
IC equal), each 8x8 bit multiplier 2
Shift up the multiplication product of 52.254 by 1 bit,
Multiply by 2. The multiplication operation can be performed on a standard integrated circuit device, such as the TRW part number MPY-8-HJ, which accepts an 8-bit number and outputs the product. Thus, multipliers 252, 254 multiply i and i on each clock pulse. The 100 value resulting in the 1x-ul product is properly timed by the extra data register 256.258).

乗算器２５２．２５４の出力は、レジスタ２２４．２２
６に／（スコア記憶され、線２６０．２６２を介し、加
算器２０６を経て残りの回路に出力される。The output of multiplier 252.254 is output to register 224.22.
6/(score is stored and output to the rest of the circuit via adder 206 via lines 260 and 262.

同じ専用ハードウエアモジューＡ／２００は、マトリッ
クス乗算において必要とされるような２ベクトルの内部
積な計算するのにも採用できる。これは、減算、絶対値
回路２２８．２３０において側路を可態とするゲート回
路２６４．２６６で遂行される。この動作モードにおい
ては１データｒＸＪおよび「ｉ」入力バスは、乗算器入
力として、パイプライン処理レジスタ２２０．２２２に
直接加えられる。The same dedicated hardware module A/200 can also be employed to compute two-vector inner products, such as those required in matrix multiplication. This is accomplished with gating circuits 264.266 allowing bypass in subtraction, absolute value circuits 228.230. In this mode of operation, the 1 data rXJ and "i" input buses are applied directly to pipeline processing registers 220, 222 as multiplier inputs.

次に、ワードレベル検出処理について説明する。Next, word level detection processing will be explained.

本発明の好ましい実施例によるキーワードの「スペリン
グ」は与えられた順序の一連の基準パターンネーム、ま
たは「単音（言語音）」（ターゲット・パターン）なら
びにスペリングにおける各単音に関連した最小および最
′大ドエル時間（継続時間）である。キーワードスペリ
ングに対する未知の入カバターン列の整合は各入カバタ
ーンをスペリングのある単音に属させることにより遂行
される。「属性」の度合は単音に関するパターンの尤度
スコアによって測定される。各折しい入カスベクトルフ
レームにおいて全体の１ワードスコア」が次のようにし
て各キーワードスペリングＫＭして計算される。A ``spelling'' of a keyword according to a preferred embodiment of the present invention is a set of reference pattern names or ``phones'' (target pattern) in a given order and the minimum and maximum associated with each phone in the spelling. This is the dwell time (duration time). Matching an unknown incoming cover turn sequence to a keyword spelling is accomplished by associating each input cover turn with a spelled phone. The degree of "attribute" is measured by the likelihood score of the pattern for a single phone. The overall one-word score for each odd input waste vector frame is calculated for each keyword spelling KM as follows.

第４図を参照して、現フレーム（円４０２に対応する）
はキーワードの終りであると仮定する。Referring to FIG. 4, the current frame (corresponding to circle 402)
Assume that is the end of the keyword.

面素の各キーワードに対するワードスコアは次のように
して決定される。ワードスコアに対する第１０貢獄は、
どちらが良いにしても（より小さくても）、キーワード
スペリングの最終音素に関する現入カバターンの、ある
いはすぐ前の音素の、尤度スコアである。The word score for each keyword of the plane element is determined as follows. The 10th tribute to word score is
Whichever is better (or smaller) is the likelihood score of the current kataan or the immediately preceding phoneme for the final phoneme of the keyword spelling.

時間的に後方の次の７レーム（円４０４に対応する）が
次に検査される。現単音の最小ドエル時間がまで経過し
ていない場合には、現（すなわち、すぐ前の）パターン
の貢献は、（ａ）現音素に関する尤度スコアまたは（ｂ
）すぐ前の音素に関する尤度スコアのうちの良い方であ
る。この貢献は部分ワードスコアに加えられる。最小ド
エル時間が経過した場合には、現およびすぐ前の音素に
関する尤度スコアが検査される。すぐ前の音素のスコア
が良い場合には、すぐ前の音素が現音素となり（パス４
０６の１つを通じて）、そのスコアはワードスコアに累
積、されたものである。そして最小および最大ドエル時
間はリセットされる。その他の場合には、現単音が現と
じ〜でとどまり、その尤度スコアがワードスコアに加算
される。現音素に対する最大ドエル時間が経過した場合
には、ペナルティがワードスコアに加えられ、すぐ前の
音素が現単音となる。すぐ前の音素が存在しないときに
分析は完了し、最終ワードスコアはワードスコア累算器
の内容を始めから終りまでのフレームの数で割つたもの
（すなわち、スペリングに関するフレーム当りの平均尤
度スコア）である。The next seven frames later in time (corresponding to circle 404) are examined next. If the minimum dwell time of the current phoneme has not yet elapsed, the contribution of the current (i.e., immediately preceding) pattern is determined by either (a) the likelihood score for the current phoneme or (b
) is the better of the likelihood scores for the immediately preceding phoneme. This contribution is added to the partial word score. If the minimum dwell time has elapsed, the likelihood scores for the current and immediately preceding phonemes are examined. If the score of the immediately previous phoneme is good, the immediately previous phoneme becomes the current phoneme (pass 4).
06), the score is cumulative to the word score. The minimum and maximum dwell times are then reset. Otherwise, the current phonetic note remains at current binding and its likelihood score is added to the word score. If the maximum dwell time for the current phoneme has elapsed, a penalty is added to the word score and the immediately previous phoneme becomes the current phoneme. The analysis is complete when the immediately preceding phoneme is absent, and the final word score is the contents of the word score accumulator divided by the number of frames from beginning to end (i.e., the average likelihood score per frame for spelling). ).

ワードスコアについての検出スレシホールドは検出確率
と誤報確率との間にトレード・オフ（交換条件）を確立
するように設定される。任意のスペリングに関するワー
ドスコアがスレシホールド値より良い場合には、４２０
（第３図）での検出が宣言される。２つまたはそれ以上
の検出が短かすぎる時間期間内で生じる場合には、調停
論理が重複する検出の最良のものを選択する。Detection thresholds for word scores are set to establish a trade-off between detection probability and false alarm probability. 420 if the word score for any spelling is better than the threshold value
(Figure 3) detection is declared. If two or more detections occur within too short a period of time, arbitration logic selects the best of the overlapping detections.

仮定された「現」音素はワードスペリング中単調に変化
し、決して前の状態に後退しないから、ワード検出方法
についての上記説明はダイナミックなプログラミングの
問題として書き直すことができる。Since the assumed "current" phoneme changes monotonically during word spelling and never regresses to a previous state, the above description of the word detection method can be rewritten as a dynamic programming problem.

次に１ダイナミックプログラミング手法について説明す
る。Next, one dynamic programming method will be explained.

第４Ａ図を参照して、このダイナミツ久プログラミング
手法によれば、キーワードの認識は抽象的な状態空間を
通る適当なパスを見つける問題として表わすことができ
る。この図において）各日は、ドエル時間位置またはレ
ジスタとも称される可能な状態を表わし、決定を行うプ
ロセスはこれを通ることができる。Referring to FIG. 4A, according to this dynamic programming approach, keyword recognition can be expressed as a problem of finding a suitable path through an abstract state space. Each day (in this figure) represents a possible state, also called a dwell time position or register, through which the process of making a decision can pass.

垂直破１１．ｓ２ｏ、５２２Ｍの空間は、パターンが現
在音素に整合するかしないかを決定する際に決定を行な
うプロセスが通過し得る仮定の状態をそれぞれ表わす。Vertical break 11. The spaces s2o, 522M each represent hypothetical states that the decision-making process may pass through in determining whether a pattern currently matches or does not match a phoneme.

この空間は、必須のドエル時間部分５２４と、任意のド
エル時間部分５２６に分けられている。必須ドエル時間
部分は特定の１」音素またはパターンの最小継続時間で
あ名。任意ドエル時間部分はパターンの追加の最大継続
時間を表わす。任意または必須ドエル時間部内の各日は
、形成されるフレームの連続体の１つのフレーム時間を
表わし、フレームからフレームへの０．０１秒の間隔に
対応する。かくして、各日は、１つのキーワードスペリ
ングにおける仮定の現音声位置を識別し７、かつまた、
現音素が始まってから経過したと仮定される、その音素
またはダーゲットパターンにおける以前の「円」または
位置に対応するその音声またはターゲットパターンにお
けるそれより（０，０１秒の）フレームの数とともにパ
ターンの現在の継続時間を表わす。１つのパターン（音
素）が始まり、最小のドエル時間期間が経過した後、次
のターゲットパターン（ｉＨＩ）ノｊ１１の節点すなわ
ち位置（円）５２８に進むには数本の可能なパスがある
。これは、スペリングの次のパターン（音素）へ移動す
ることの決定がいつなされるかに依存する。これらの決
定の可能性は、この図においては円５２８￥Ｃ向う数本
の矢により表わされている。次のパターン（音素）の始
点は円５２８により表わされているが、次のパターン（
音素）へのこの転移は現パターン（音素）の任意ドエル
時間中の任意の節点すなわち位置から、または、必須ド
エル時間期間の最後の節点からなされよう。This space is divided into a mandatory dwell time portion 524 and an optional dwell time portion 526. The required dwell time portion is named by the minimum duration of a particular 1'' phoneme or pattern. The arbitrary dwell time portion represents the maximum additional duration of the pattern. Each day within the optional or required dwell time portion represents one frame time of the series of frames formed and corresponds to an interval of 0.01 seconds from frame to frame. Thus, each day identifies a hypothetical current phonetic position in one keyword spelling7, and also
of the pattern along with the number of frames (in 0,01 seconds) beyond that in that phoneme or target pattern that corresponds to the previous "circle" or position in that phoneme or target pattern that is assumed to have elapsed since the beginning of the current phoneme. Represents the current duration. After a pattern (phoneme) begins and a minimum dwell time period has elapsed, there are several possible paths to proceed to the node or location (circle) 528 of the next target pattern (iHI) noj11. This depends on when the decision is made to move on to the next pattern (phoneme) of spelling. These decision possibilities are represented in this figure by several arrows pointing towards the circle 528\C. The starting point of the next pattern (phoneme) is represented by a circle 528;
This transition to a phoneme) may be made from any node or position during any dwell time of the current pattern (phoneme) or from the last node of the required dwell time period.

米国時許第４，２４１，３２９号、第４，２２７，１７
６号および第４，２２７，１７７号に記載のキーワード
認識方法は、次のパターン（音素）Ｋ関する尤度スコア
が現パターン（音素）Ｋ関する尤度スコアより良好であ
るような第１の節点で転移を行なう。U.S. Permit No. 4,241,329, No. 4,227,17
The keyword recognition method described in No. 6 and No. 4,227,177 uses a first node such that the likelihood score for the next pattern (phoneme) K is better than the likelihood score for the current pattern (phoneme) K. Perform the transfer.

すなわち、フレームが、現音素またはパターンより次の
音素またはパターンとよく整合する。しかしながら、全
ワードスコアは、フレーム当りの（すなわちバスに含ま
れる節点当りの）平均パターン（音素）スコアである。That is, the frame matches the next phoneme or pattern better than the current phoneme or pattern. However, the total word score is the average pattern (phoneme) score per frame (ie, per node included in the bus).

現節点までのワードスコアに適用される「全スコア」の
定義と同じ定義が、転移をいつなすべきかを決定するの
に使用できる。The same definition of "total score" that applies to word scores up to the current node can be used to determine when a transfer should occur.

すなわち、次のパターンへの転移を、例えば転移指示線
５３０に対応する最初の機会でなすべきか、あるいは例
えば転移指示線５３２に対応するもつと後の時点でなす
べきかの決定に使用できる。最適には、節点当りの平均
スコアが最良であるようなバスを次のパターン（音素）
中に選ぶことＫなる。米国特許第４，２４１，３２９号
、第４，２２７，１７６号および第４，２２７，１７７
号に記載される標準的キーワード法は、次のパターン（
単音）に移動すべきことの決定をなした後潜在的なバス
について試験をしないから、平均スコア／節点により測
定されるところＫしたがってほぼ最適の決定をなすこと
になろう。That is, it can be used to determine whether the transition to the next pattern should be made at the first opportunity, eg, corresponding to transfer indicator line 530, or at a later point, eg, corresponding to transfer indicator line 532. Optimally, the bus with the best average score per node is set to the next pattern (phoneme).
It becomes K to choose among them. U.S. Patent Nos. 4,241,329, 4,227,176 and 4,227,177
The standard keyword method for issues is the following pattern (
Since we do not test for potential basses after making the decision to move to a single note, we will therefore have made a nearly optimal decision as measured by the average score/node.

従って、本発明はキーワード認識に平均スコア／節点法
を採石し、そしてキーワードの最後のパターンの「最良
の終了節点」Ｋ対する平均スコア／節点が予め定められ
たスレシホールドを越えるときにはいつでも、検出が記
録される。Therefore, the present invention employs an average score/node method for keyword recognition, and detects whenever the average score/node for the "best ending node" K of the last pattern of a keyword exceeds a predetermined threshold. is recorded.

ダイナミックプログラミング手法は、各分析時間フレー
ムとおいて、ある始業ワードが丁度始まる（すなわち１
ある前のワードまたは他の音響が丁度終了した）尤度ス
コアを必要とする。クローズド・ボキャプラリイ・タス
ク釦おいては、このスコアを提供することは真直ぐな事
柄である。しかしながら、キーワード・タスクにおいて
は、すべての予期される音響に対する基準パターンも１
すべでの可能なワードに対する定義も利用できない。The dynamic programming method requires that in each analysis time frame, a certain starting word just begins (i.e., 1
Requires the likelihood score (that some previous word or other sound just ended). In a closed vocabulary task button, providing this score is a straightforward matter. However, in the keyword task, the reference pattern for all expected sounds is also one
Definitions for all possible words are also not available.

人力スコアを与えるためのいくつかの方法があり得る。There can be several ways to give a human score.

これを例示するため罠は、ダイナミックプログラミング
方法のある特徴をさらに説明する必要がある。この方法
は、それぞれが特定シーケンスのパターンおよびパター
ン継続時間に対応するスコアを記憶する定められた順序
配列の累算器ム（１）、Ａ（２）、−ｍ−により実゛現
される。時刻ｔの分析フレームにおける１番目の累算器
の内容は人（１１ｔ）と表示される。１番目の累算器に
関連した基準パターン忙対する時刻ｔの尤度スコアはＬ
（ｉ、ｉ）で表示される。To illustrate this, it is necessary to further explain certain features of the dynamic programming method. The method is implemented by a defined ordinal array of accumulators (1), A(2), -m-, each storing a score corresponding to a particular sequence of patterns and pattern durations. The content of the first accumulator in the analysis frame at time t is displayed as person (11t). The likelihood score at time t for the reference pattern related to the first accumulator is L
It is displayed as (i, i).

ターゲット・パターンの第１フレーム（すなわち始まり
）Ｋ対応しない累算器に対する循環式は次の通りである
。The circular equation for the first frame (or beginning) of the target pattern K non-corresponding accumulators is as follows.

Ａ（１，ｔ）：Ｌ（ｉ、　ｔ、）＋Ａ（ｉ−１，ｔ−１
）次のターゲット・パターンの第１の累算器Ａ（ｎ、　
ｉ）は先行するパターンに対して利用できる累算器（す
なわち、転移が次のパターンに対して行なわれ得る累算
器）の最良（最小）のスコアが供給される。A(1, t): L(i, t,)+A(i-1, t-1
) of the next target pattern A(n,
i) is supplied with the best (minimum) score of the available accumulator for the preceding pattern (ie, the accumulator for which a transfer can be made for the next pattern).

すなわち、Ａ（ｎ、　ｔ　）＝Ｌ（ｎ、　ｔ　）十ｍｉ
ｎ　Ａ（ｉ、　ｔ−１）ｉ＝ｍ、ｎ−１このようにしてターゲット・パターンに対する最適の継
続時間が見出される。That is, A(n, t) = L(n, t) +mi
n A(i, t-1) i=m, n-1 In this way the optimal duration for the target pattern is found.

上記したように、検出されるキーワードに対するワード
スコアは分析フレーム当りの平均尤度スコアである。こ
れは現分析フレームにおける最後のターゲット・パター
ンの出力スコア（１つであった場合には、次のパターン
に対する累算器に送られるであろうスコア）とワードが
始まったときのこのワードの継続時間で割られた入力ス
コアとの差である。累積されたワードスコアに関連した
ワードの継続時間ならびにターゲット・パターン長は繰
越すことができ、レジスタからレジスタへ更新できる。As mentioned above, the word score for the detected keyword is the average likelihood score per analysis frame. This is the output score of the last target pattern in the current analysis frame (the score that would have been sent to the accumulator for the next pattern, if there was one) and the continuation of this word when the word started. It is the difference from the input score divided by time. The word duration associated with the accumulated word score as well as the target pattern length can be carried forward and updated from register to register.

キーワードの第１のパターンの第１のレジスタに対応す
る入力スコアに対する累算器はＡ（ｏ、ｔ）と表示され
る。最も簡単な入力方法はキーワード認識プロセスに対
する入力尤度スコアとして一定傾斜ａの直ｍランプ関数
を使用することである０この方決に対するダ］続く累算
器の内容は次表に示されている。The accumulator for the input score corresponding to the first register of the first pattern of keywords is denoted as A(o,t). The simplest input method is to use a direct m ramp function with constant slope a as the input likelihood score for the keyword recognition process.

＼時間　　　　累算器内容ｔ　　Ａ（０，ｔ）　Ａ（１，ｔ）　　Ａ（２，ｔ）　
−−−００００１０Ｌ（１，１）　　Ｌ（２，１）２　　２０　　　　　Ｌ（１，２）＋ＯＬ（２，２）＋
Ｌ（１，１）３　　３０　　　　　Ｌ（１，３）＋２０
Ｌ（２，３）＋Ｌ（１，２）十〇任意の時間において、
各累算器中で行なわれた加算の数はすべての累′Ｓ器に
対して同じであり、従って初期設定に起因するバイナス
はない。Ｃの効果は、Ｃが小さい場合にはＡ（１，ｔ）
がＡ（２，ｔ）よりも良好なスコアを含む傾向シてあり
、他方Ｃが大きい場合にはＡ（２，ｔ）が良好なスコア
を含むということを注記することＫより、理解できる。\ Time Accumulator contents t A (0, t) A (1, t) A (2, t)
---0000 10L(1,1) L(2,1) 2 20 L(1,2)+OL(2,2)+
L(1,1)3 30 L(1,3)+20
L (2, 3) + L (1, 2) 10 At any time,
The number of additions performed in each accumulator is the same for all accumulators, so there are no minuses due to initialization. The effect of C is A(1,t) when C is small.
This can be understood from K by noting that A(2,t) tends to contain better scores than A(2,t), while if C is large then A(2,t) contains better scores.

その結果はこの方法によって見出された最適のパターン
継続時間が長すぎるまたは短かすぎるようにバイアスさ
れることモある。ランプ関数はすべての累算器罠伝搬す
るから、ワードにおけるすべてのパターンの継続時間は
同様にバイアスされるであろう。The results may be biased such that the optimal pattern duration found by this method is too long or too short. Since the ramp function trap propagates through all accumulators, the durations of all patterns in words will be similarly biased.

一定のランプ（傾斜）を累積する代りに、第１の累算器
の内容を１定数ではなくて所望のキーワードの第１の基
準パターンに関する現信号の尤度スコアを加算するよう
に、再循環してもよい。すべての残りの票ｎ器の内容は
、フレーム当りのワードスコアを決定するときに減算さ
れる定数内までは、正確である。この方法は次表に例示
されている。Instead of accumulating a constant ramp, the contents of the first accumulator are recirculated to add the likelihood score of the current signal with respect to the first reference pattern of the desired keyword instead of one constant. You may. The contents of all remaining votes n units are accurate up to a constant that is subtracted when determining the word score per frame. This method is illustrated in the table below.

時間　　　　　　　累算器内容ｔＡ（ｏ、ｉ）　　　　　　　　　ａＱ、ｔ、）００　
　　　　　　　　　ＱＩ　Ｌ（０，１）　　　　　　　　　Ｌ（１，１）２　
Ｌ（０２）４ｇ０．１）　　　　　　Ｌ（１，２）＋Ｌ
（０，１）３　ｇｏβ）ｌ−Ｌ（Ｃ）、２））−Ｌ（０
，１）　　　　Ｌ（１，３片Ｌ（０２）＋Ｌ（０，１）
ｔ　Ａ（２，ｔ）−−− ０１Ｌ（２Ｊ）２　Ｌ（２，２））−Ｌ（１，１）３　Ｌ（２，３）＋−Ｌ（１，２）＋１（０，１）４　
Ｌ（２ｔ４）ｌ−Ｌ（１，３））−Ｌ（０２））Ｌ（０
，１）この方法によれば、キーワードの第２のおよびそ
れに続くパターンに対する最適パターン継続時間の選択
は第１のパターンの継続時間には無関係である。一方、
累算器の内容から第１リパターンがどのくらい長くある
べきかを知ることは不可能である。この事実は上表にお
いて置換Ｌ（２，ｔ）＝Ｌ（１，ｔ）＝Ｌ（０，ｔ、）
　Ｖｃよって明らかにされている。Time Accumulator content tA (o, i) aQ, t, )00
Q I L(0,1) L(1,1)2
L(02)4g0.1) L(1,2)+L
(0,1)3 goβ)l-L(C),2))-L(0
,1) L(1,3 pieces L(02)+L(0,1)
t A(2,t) --- 0 1L(2J) 2 L(2,2))-L(1,1) 3 L(2,3)+-L(1,2)+1(0,1 )4
L(2t4)l-L(1,3))-L(02))L(0
, 1) According to this method, the selection of the optimal pattern duration for the second and subsequent patterns of keywords is independent of the duration of the first pattern. on the other hand,
It is impossible to know from the contents of the accumulator how long the first repattern should be. This fact is shown in the above table by the substitution L(2,t)=L(1,t)=L(0,t,)
It is made clear by Vc.

３つの累算器が設けられ、パターンが３つの分析フレー
ムの継続時間を有し得るけれど、３つ全部の累算器は常
に同じスコアを含み、選択すべき独特の最小値のものは
ない。この問題は合計のワードスコアの判断（評価）の
みに影響を与え、例えば鴬基準パターンの統計的データ
になる後続のパターンの分類には影響を与えない。現在
好ましい実施例はこの方法を、各キーワードの第１のパ
ターンに割当てられた任意の一定継続時間とともに１使
用する。Although three accumulators are provided and a pattern may have a duration of three analysis frames, all three accumulators always contain the same score and there is no unique minimum value to choose from. This problem only affects the judgment (evaluation) of the total word score, and does not affect the classification of subsequent patterns that become statistical data of the Utsugi reference pattern, for example. The presently preferred embodiment uses this method one with an arbitrary constant duration assigned to the first pattern of each keyword.

次に、基準パターンのトレーニングについて説明する。Next, reference pattern training will be explained.

基準パターンの構成のためサンプル平均Ｕおよび分散（
パリアンス）Ｓ／を得るためには、各始業ワードの多数
の発生が音声認識システムに挿入され１対応する予処理
されたスペクトルフレームの全統計データが求められる
。装置の上首尾の動作に極めて重要なのは、どの人カス
ベクトルフレームがどのターゲットまたは基準パターン
に対応すべきかの選択である。Sample mean U and variance (
In order to obtain S/, multiple occurrences of each starting word are inserted into the speech recognition system and the entire statistical data of the corresponding preprocessed spectral frame is determined. Critical to the successful operation of the device is the selection of which human body vector frames should correspond to which target or reference patterns.

入力ワードに対して人間により選ばれた重要な音響的音
素のような良好な情報が不存在の場合、話されたワード
の始点と終点間の時間間隔は一多数の一様に離間された
サンプインターバルに分割される。これらのサブインタ
ーバルの各々は、唯一の基準パターンと対応せしめられ
る。各間隔において始まる１または複数の３フレームパ
ターンが形成され、その間隔と関連する基準パターンに
したがって分類される。同じ語物ワードの後続の例は、
同様に、同数の一様に離間された間隔に分割される。対
応する順番の間隔から抽出された３フレームパターンの
要素の平均値およびパリアンスは、始業ワードの利用可
能な金側について異積され、そのワードに対する１組の
基準パターンを形成する。間隔の数（基準パターンの数
）は・開業ワードに含まれ単位の言語学的音素当り約２
または３とすべきである。In the absence of good information, such as the important acoustic phonemes selected by humans for the input word, the time intervals between the start and end points of the spoken word were uniformly spaced apart. Divided into sample intervals. Each of these sub-intervals is associated with a unique reference pattern. One or more three-frame patterns are formed starting in each interval and classified according to the reference pattern associated with that interval. Subsequent examples of the same story word are:
Similarly, it is divided into an equal number of uniformly spaced intervals. The mean values and parances of the three-frame pattern elements extracted from the corresponding sequential intervals are cross-producted on the available gold side of the starting word to form a set of reference patterns for that word. The number of intervals (the number of reference patterns) is approximately 2 per unit of linguistic phoneme included in the opening word.
Or it should be 3.

４最良の結果を得るためには、記録された可聴波形およ
びスペクトルフレームの人間による試験を含む手続きに
より、キーワードの始点がマークされる。この手続を自
動的に実施するために＆ま、装置がワードの境界を正確
に見つけるように、ワードを１時に１つずつ話し、サイ
レントにより境界を定めることが必要である。基準パタ
ーンは、隔絶して話された各ワードの１つのこのような
サンプルからイニシャライズされよう。しかして、全ハ
リアンスは、基準パターンにおいて都合のよい定数に設
定される。その後、トレーニング資料は、認識されるべ
き発声を表わし、かつ認識プロセスにより見出されるよ
うなワード境界をもつ発生を含むことができる。4. For best results, the starting points of keywords are marked by a procedure that includes human examination of recorded audio waveforms and spectral frames. In order to perform this procedure automatically, it is necessary to speak the words one at a time and demarcate them silently, so that the device accurately finds the boundaries of the words. A reference pattern would be initialized from one such sample of each word spoken in isolation. The total harance is then set to a convenient constant in the reference pattern. The training material may then include occurrences representing the utterances to be recognized and with word boundaries as found by the recognition process.

適当数のトレーニング発声による統計的データが累積し
た後、そのようにして、見出された基準パターンが初基
準パターンの代わりに利用される。After the statistical data from a suitable number of training utterances has been accumulated, the found reference pattern is then used in place of the initial reference pattern.

次いで、トレーニング資料による２回目のパスが行われ
る。このとき、ワードは、第３図におけるように認識プ
ロセッサによりなされ乏ｌｉｄ基づいた時間間隔に分割
される。各３フレーム入カバターン（または、各基準パ
ターンに対する１つの代表釣人カバターン）が前述のパ
ターン整合法によりある基準パターンと関連づけられる
。平均値およびパリアンスは、それらが認識装置により
使用される方法と完全に適合した態様で誘導される最終
の１組の基準パターンを、Ｖ成するように１秒間異積さ
れる。A second pass through the training material is then made. The word is then divided into sparsity-based time intervals by the recognition processor as in FIG. Each three-frame cover turn (or one representative angler cover turn for each reference pattern) is associated with a certain reference pattern by the pattern matching method described above. The mean values and parances are interproducted for 1 second so that they form a final set of reference patterns that are derived in a manner fully compatible with the method used by the recognizer.

最小（必須）および最大（必須十任意）ドエル時間は、
好ましくはトレーニングプロセス中に決定されるのがよ
い。本発明の好ましい具体例においては、装置は、上述
のように数人の話者を使ってトレーニングされる。さら
に１上述のように１認識プロセスは、トレーニング手続
き中、上述のプロセスにしたがってパターンの境界を自
動的に決定する。このようＫして境界が記録され、装置
により識別された。各キーワードに対してドエル時間が
記憶される。The minimum (required) and maximum (required ten optional) dwell times are
Preferably it is determined during the training process. In a preferred embodiment of the invention, the device is trained using several speakers as described above. Furthermore, as described above, the recognition process automatically determines the boundaries of the pattern according to the process described above during the training procedure. The boundaries were thus recorded and identified by the device. Dwell time is stored for each keyword.

トレーニング工程の終了時に、各パターンに対するドエ
ル時間が試験され、パターンに対する最小および最大の
ドエル時間が選ばれる。本発明の好ましい具体例におい
ては、ドエル時間のヒストグラムが形成され、最小およ
び最大ドエル時間は、第２５および第７５．１００分位
数に設定される０これは低誤報率を維持しながら高認識
精度を与える。代わりに、最小および最大ドエル時間の
他の選択も可能であるが、認識精度と誤報率との間には
交換条件がある。すなわち、もしも最小ドエル時間およ
び最大ドエル時間が選択されると、一般に高誤報率の犠
牲でより高い認識精度が得られる。At the end of the training process, the dwell times for each pattern are tested and the minimum and maximum dwell times for the pattern are selected. In a preferred embodiment of the invention, a histogram of dwell times is formed and the minimum and maximum dwell times are set at the 25th and 75.100th quantiles, which provides high recognition while maintaining a low false alarm rate. Gives precision. Alternatively, other choices of minimum and maximum dwell times are possible, but there is a trade-off between recognition accuracy and false alarm rate. That is, if minimum and maximum dwell times are selected, higher recognition accuracy is generally obtained at the expense of a higher false alarm rate.

次に、本音声認識方法を使用して実現された装置建つい
て説明する。Next, the construction of a device realized using the present speech recognition method will be explained.

前記したように、本発明の現在好ましい実施例において
は、第２図のプリプロセッサにより遂行された信号およ
びデータ操作以上の信号およびデータ操作が米国特許第
４，２２８，４９８号に記載されたような専用ベクトル
コンピュータ・プロセッサとの組合せで動作するディジ
タル・エクイプメント・コーポレーシヨンのＰＤＰ　−
１１型コンビニータによって実行され、制御されるよう
に構成された。As noted above, in the presently preferred embodiment of the invention, the signal and data manipulations beyond those performed by the preprocessor of FIG. 2 are performed by the preprocessor of FIG. Digital Equipment Corporation's PDP that operates in combination with a dedicated vector computer processor -
It was configured to be executed and controlled by a Type 11 combinator.

本発明方法はコンピュータプログラミングの利用に加え
て、ハードウェアを利用して実現できる。In addition to using computer programming, the method of the present invention can be implemented using hardware.

第９図を参照すると、本発明の一特定例のハードウェア
においては、尤度データ発生用プ四セッサからの尤度デ
ータはライン３００を通じてメモリ３０２に与えられる
。メモリ３０２は検出されている始業キーワードのター
ゲット・パターンのそれぞれＫ［する入力フレームパタ
ーンの尤度スコアを記録するのに十分な記憶容量を有す
る。この尤度スコア入力データはプロセッサからライン
３００を通じて利用でき、そして高データ速度で、予め
定められ７とシーケンスでメモリ３０２に転送される。Referring to FIG. 9, in one particular example of the hardware of the present invention, likelihood data from a likelihood data generation processor is provided to memory 302 over line 300. The memory 302 has sufficient storage capacity to record the likelihood scores of the input frame patterns for each of the target patterns of the starting keyword being detected. This likelihood score input data is available from the processor over line 300 and is transferred to memory 302 in a predetermined sequence at a high data rate.

このデータはアドレス出力信号３０６からのライン３０
４を介してのアドレス出力信号に従ってメモリ３０２内
に記憶される。アドレスカウンタ３０６はライン３００
を介してのデータと同期するカウントライン３０８を介
してのパルス信号によって増分され、そしてライン３１
０を介してのリセット信号によって初期の予め定められ
たアドレスにリセットされる。This data comes from address output signal 306 on line 30.
stored in memory 302 according to the address output signal via 4. Address counter 306 is on line 300
is incremented by a pulse signal on count line 308 synchronized with the data on line 31
A reset signal via 0 resets it to the initial predetermined address.

第９＠の例示の実施例はターゲット°パターンシフトレ
ジスタ・メモリ３１２　（ａ）、３１２　（ｂ）、−−
−３１２（ｎ）を有し、各シフトレジスタ・メモリは特
定のターゲラＦ・パターンに関する、処理された可聴信
号の前の２．５６秒の、各７レームに対する尤度スコア
データな記憶することができる。これらメモリ３１２は
入力ライン３１４（バックワード・シフトモードにおい
て）を介してまた人力３１５（フォーワード・シフトモ
ードにおいて）を介してデータをロードできるフォーワ
ード・バックワード・シフトレジスタである。この中で
使用される各シフトレジスタ・メモリ３１２の出力は、
メモリデータが「フォーワード」方向（順方向）Ｋシフ
トされたときに、出力ライン３１６を介して利用できる
。The ninth example embodiment is a target pattern shift register memory 312 (a), 312 (b), --
-312(n), and each shift register memory can store likelihood score data for each of the previous 7 frames of the processed audio signal for a particular Targetera F pattern. can. These memories 312 are forward-backward shift registers that can be loaded with data via input lines 314 (in backward shift mode) and via manual input 315 (in forward shift mode). The output of each shift register memory 312 used in this is:
Memory data is available via output line 316 when it has been shifted K in the "forward" direction.

動作において、上記メモリ３１２はライン３１８を介し
ての各（フォーワード）クロックパルスでレジスタの内
容を１デ一タ位置「順方向」にシフトする１すなわちラ
イン３１６の出力により接近する方向にシフトされる。In operation, the memory 312 is shifted closer to the output of line 316 by shifting the contents of the register one data position "forward" with each (forward) clock pulse via line 318. Ru.

対応的に１各メモリ３１２は、ライン３１９を介しての
各（バックワード）クロックパルスで、その内容を１位
置逆方向に、すなわち入力ライン３１５により接近する
方向に、シフトする。例示の実施例では、各メモリは２
，５６秒の尤度スコアデータを記憶するための位置を有
する。Correspondingly, each memory 312 shifts its contents one position backward, ie, closer to input line 315, with each (backward) clock pulse via line 319. In the illustrated embodiment, each memory has two
, 56 seconds of likelihood score data.

ライン３１６を介しての各メモリ３１２の出力はそれぞ
れゲート素子３２１を介して制御比較回路３２０に接続
されている。第１０図と関連して詳ＭＥ説明するこれら
比較回路３２０は、それぞれ出力として、ライン３２２
を介しての累積された、標準化されたワードスコアと、
ライン３２４を介してのワードスコア累積完了信号とを
有する。The output of each memory 312 via line 316 is connected via a respective gate element 321 to a control comparison circuit 320 . These comparator circuits 320, which will be described in detail in conjunction with FIG.
cumulative, standardized word scores via
and a word score accumulation complete signal via line 324.

ワードスコア累積完了信号が上記した方法に従ってキー
ワード認識処理を完了したことに対応して各比較回路３
２０からライン３２４の全部を通じて利用できるときに
は、キーワードは、（ａ）その現フレームに対して標準
化されたワードスコアが予め定められたスレシホールド
レベルを越したか否かを、また（ｂ）スレシホールドレ
ベルを越したワードがざらに後での決定処理のためのも
のとみなすべきか否かを、それぞれ決定するために検査
される。In response to the word score accumulation completion signal indicating that the keyword recognition process has been completed according to the method described above, each comparison circuit 3
20 through line 324, the keyword determines whether (a) the normalized word score for that current frame exceeds a predetermined threshold level, and (b) the thread Words that exceed the threshold level are each examined to determine whether they should be considered for further decision processing.

開業の各キーワードに対して１つの比較回路３２０があ
る。各比較回路３２０は、かくして、その入力としてラ
イン３２６を介しての、そのキーワニドのターゲット・
パターンに対応するメモリ３１２のそれぞれの出力を有
する。後で詳しく記載するように１マルチプル比較マル
チブレクシ−／グ素子３３０と継続時間カウントプロセ
ス制御素子３３２より構成される比較回路はキーワード
がそのときの「現」フレーム時間で終了するという仮定
に対して標準化され、累積されたワードスコアを決定す
る。There is one comparison circuit 320 for each keyword in a business. Each comparator circuit 320 thus receives the target signal of its keypad via line 326 as its input.
It has a respective output of memory 312 corresponding to a pattern. As will be described in more detail below, the comparison circuit consisting of a multiple comparison multiplexing element 330 and a duration counting process control element 332 is standardized against the assumption that the keyword ends at the then "current" frame time. and determine the cumulative word score.

例示のシフトレジスタ・メモリ３１２は再循環「順方向
」シフトモードあるいは非再循環［逆方向」シフトモー
ドとして構成されている。再循環順方向モードにおいて
、シフトレジスタ・メモリ３１２はライン３１５を通じ
てそれらの入力を受信する。これら入力はゲート素子３
３３および３２１を介してメモリ３１２にゲート入力さ
れる。The exemplary shift register memory 312 is configured as a recirculating "forward" shift mode or a non-recirculating "reverse" shift mode. In recirculating forward mode, shift register memory 312 receives its inputs over line 315. These inputs are gate element 3
33 and 321 to memory 312.

非循環動作モードにおいて、メモリはメモリ３０２かも
ゲート素子３３８を介してライン３１４によりその入力
を受信する。In the non-circular mode of operation, memory 302 also receives its input on line 314 via gate element 338.

動作において、Ｒ勿に、ゲート素子３３８を介して全容
量まで逆方向にロードされる、すなわち各メモリに２５
６の尤度スコアを逆方向にロードする。この人力データ
はアドレスカウンタ３０６からの逐次カウントに従って
メモリ３０２から得られる。アドレスカウントＣ（従っ
て、ゲート３３８はライン３４２を介して選択的に供給
される可能化信号によって選択的に可能化される。２イ
ン３４２を介しての可能化信号はゲート３３８を介して
１〜ｎデコ一ド回路３４４によって逐次制御され、それ
によってメモリ３０２の出力はメモリ３１２の対応する
ものに記憶される。In operation, R is loaded backwards to full capacity through gate device 338, i.e., each memory is loaded with 25
Load the likelihood score of 6 backwards. This manual data is obtained from memory 302 according to a sequential count from address counter 306. Address count C (thus, gate 338 is selectively enabled by an enable signal selectively provided via line 342. The outputs of memory 302 are sequentially controlled by n-decode circuit 344, so that the output of memory 302 is stored in the corresponding memory 312.

各パターンに対する第１の２５６人力尤度スコアがそれ
ぞれのメモリ３１２（例示の実施例ではメモリ３０２の
内容の２５６の読出しに対応する）Ｋロードされると、
メモリ３１２は順方向再循環モードで作動され、それに
よりシフトレジスタの最後の入力（第２５６番目のフレ
ームに対応する尤度スコア）がメモリから読出され、今
可能化されたゲート３２１および３３３を通って同じシ
フトレジスタ・メモリの他端の入力（ライン３１５を介
して）となる。従って、メモリ３１２が繰返しシフトさ
れると、最後の２５６フレームのそれぞれに対する各タ
ーゲット・パターンの尤度スコアが逆の年代順で読出さ
れ、そして同じ順序でシフトレジスタに再び挿入される
。がくして、順方向シフトライン３１８を通じての２５
６のカウントの後、シフトレジスタはその最初のデータ
状態に戻る。しかしながら、今、レジスタ３１２がシフ
トされていた時間期間中ロードされたメモリ３０２から
次の尤度スコアが順次にレジスタに挿入される。この新
しい尤度スコアはライン３１９の逆方１ｉＪロードパル
スに応答してゲー）　３３　ｇヲ介してロードされる。Once the first 256 human likelihood scores for each pattern have been loaded into the respective memory 312 (corresponding to 256 reads of the contents of memory 302 in the exemplary embodiment),
Memory 312 is operated in forward recirculation mode, whereby the last input of the shift register (likelihood score corresponding to the 256th frame) is read from memory and passed through the now enabled gates 321 and 333. is the input (via line 315) to the other end of the same shift register memory. Thus, as memory 312 is repeatedly shifted, the likelihood scores for each target pattern for each of the last 256 frames are read out in reverse chronological order and reinserted into the shift register in the same order. 25 through the forward shift line 318.
After a count of six, the shift register returns to its initial data state. However, now the next likelihood score from memory 302 that was loaded during the period of time that register 312 was being shifted is inserted into the register in sequence. This new likelihood score is loaded via the game 33g in response to an inverse 1iJ load pulse on line 319.

メモリ３１２中の最も古い尤度スコアが失なわれる。The oldest likelihood score in memory 312 is lost.

シフトレジスタ３１２は各ターゲット・パターンに対す
る第２番目から第２５７番目までの尤度スコアを含むこ
とになる。これらスコアは上記したのと同じ態様でシフ
トされる。シフトおよびロードするプロセスは各所しい
フレーム時間において絖けられ、その結果後記する処理
のために尤度スコアが適当な時間に読出される。　　　
　　　−第１０図を参照すると、キーワードを表わす各
群のレジスタ３１２の出力はそれぞれの制御２極マルチ
プレクサスイツチ３６０に対するライン３２６を通じて
利用できるようにされている。マルチプレクサ３６０の
動作は次の通りである。各フレーム時間の開始時に、各
マルチプレクサ３６０はライン３６２を介してのリセッ
ト信号によってリセットされる。ライン３６２を通じて
のリセット信号に応答して、マルチプレクサ３６０の出
力ツイン３６４．３６６はそれぞれｆｓｌの人力ライン
１ここではライン３２６　（ａ）および３２６　（ｂ）
に供給される。フレーム時間の開始時に、ライン３２６
（ａ）上のデータはキーワードの最後のターゲット・パ
ターンに対する「現」フレーム時間中の人力尤度スコア
を表わし、またライン３２６　（ｂ）上のデータはすぐ
前のターゲット・パターンに対する「現」７ｖ−ムｐ−
ゲット・パターン中のスコアを表わす。ライン３６４お
よび３６６を介してのマルチプレクサ３６０の出力は数
値比較素子、例えば演算素子３６８に供給される。この
素子３６８はライン３７０に「良好」入力スコアを提供
し、かつライン３６４．３６６のどちらがライン３７２
に良好入力スコアを有せしめるかについて識別する。Shift register 312 will contain the second through 257th likelihood scores for each target pattern. These scores are shifted in the same manner as described above. The shifting and loading process is performed at various frame times so that the likelihood scores are read at the appropriate times for processing as described below.
- With reference to FIG. 10, the output of each group of registers 312 representing a keyword is made available via line 326 to a respective controlled bipolar multiplexer switch 360. The operation of multiplexer 360 is as follows. At the beginning of each frame time, each multiplexer 360 is reset by a reset signal on line 362. In response to a reset signal through line 362, the output twins 364, 366 of multiplexer 360 are connected to fsl power line 1, here lines 326 (a) and 326 (b), respectively.
supplied to At the beginning of the frame time, line 326
The data on (a) represents the human likelihood score during the "current" frame time for the last target pattern of the keyword, and the data on line 326 (b) represents the "current" 7v for the immediately previous target pattern. -Mu p-
Represents the score in the get pattern. The output of multiplexer 360 via lines 364 and 366 is provided to a numerical comparison element, such as an arithmetic element 368. This element 368 provides a "good" input score for line 370, and which of lines 364.366 and 372
to have a good input score.

良好スコアは加算器３７４の内容に加算せれる。The good score is added to the contents of adder 374.

（加算器３７４の内容は各フレーム時間の始めにライン
３６２を介してのリセット信号ＩＣよってＯＫリセット
される。）累積尤度スコアは次に、ライン３７５を通じ
て利用できるかつ合計の累積スコアを表わす加算器３７
４の内容を、累積された尤度スコアの数Ｎで割ること罠
より「標準化」さ−れる０この割算は割算回路３７６で
実行されるＯ割算回路３７６のライン３７８上の出力は
平均スコア／節点を表わし、かつそれぞれのキーワード
が可能な検出されたキーワード候補であるか否かを決定
する際に使用される。(The contents of adder 374 are reset to OK at the beginning of each frame time by a reset signal IC via line 362.) The cumulative likelihood score is then available via line 375 and summed to represent the total cumulative score. Vessel 37
4 is "normalized" by dividing the content of 0 by the number of accumulated likelihood scores N. This division is performed in divider circuit 376. The output on line 378 of divider circuit 376 is Represents the average score/node and is used in determining whether each keyword is a possible detected keyword candidate.

ライン３７２上の比較回路３６８の出力は、最小および
最大ドエル時間とともに、マルチプレクサ３６０が次の
２つの入力尤度スコアに、すなわちライン３２６　（ｂ
）および３２６　（ｃ）を通じて制用できる尤度スコア
（キーワードの最後のターゲットパターンのすぐ前とそ
の次（最後から２番目と３番目）にそれぞれ対応する）
Ｋ増分されるべきか否かを決定するために使用される。The output of comparator circuit 368 on line 372, along with the minimum and maximum dwell times, is output by multiplexer 360 to the next two input likelihood scores, namely line 326 (b
) and the likelihood scores available through 326(c) (corresponding to immediately before and after (second to last and third to last) the keyword's last target pattern, respectively)
Used to determine whether K should be incremented.

ライン３７２上の信号レベルはまた、ライン３８０上の
最大ドエル時間信号とともに１加算器３７４の累積ワー
ドスコアにペナルティを加えるべきか否かを決定するた
めに使用される。従って、そのときに存在するターゲッ
トパターンに対する最大ドエル時間を経過したときに、
「良好」であるのがライン３６４上の尤度スコアである
ならば、ゲート３８２は作動され、ペナルティ・カウン
トが加算器３７４の累積スコアに加えられる。The signal level on line 372 is also used in conjunction with the maximum dwell time signal on line 380 to determine whether to add a penalty to the cumulative word score of 1 adder 374. Therefore, when the maximum dwell time for the target pattern existing at that time has elapsed,
If "good" is the likelihood score on line 364, gate 382 is activated and a penalty count is added to the cumulative score in adder 374.

実質的にはカウンタであるプログラム・ドエル時間監視
素子３８６はライン３８８および３９０を介して種々の
ターゲットパターンに対する最小および最大ドエル時間
を受信する。最小ドエル時間がカウンタ３８６中のカウ
ントだけ越えると、最ホトエル時間経過ライン３９２に
ある信号レベルが置かれる。現ターゲットパターンに対
する最大ドエル時間が経過すると、対応する信号レベル
＼が上記したようにライン３８０に置かれる。カウンタ３
８６はマルチプレクサ３６０がライン３９４上の信号に
よって次の対のラインに増分されると（後述する）、リ
セットされる。このカウンタは再循環メモリ３１２がラ
イン３１８のカクントバルスによってシフトされると増
分される。ワード長カウ／り８９６はライン３９７を介
して割算器３７６へワード長を与える。カウンタ３９６
は各フレーム時間の始めに、ライン３６２のリセット信
号（２イン３１９のシフトレジスタ逆方向信号に対応す
る）Ｋよってリセットされ、そして再循環メモリ３１２
がライン３１８上のパルスによってシフトされるときご
とに増分される。Program dwell time monitoring element 386, which is essentially a counter, receives minimum and maximum dwell times for various target patterns via lines 388 and 390. When the minimum dwell time exceeds the count in counter 386, a signal level is placed on the maximum dwell time elapsed line 392. When the maximum dwell time for the current target pattern has elapsed, a corresponding signal level \ is placed on line 380 as described above. counter 3
86 is reset when multiplexer 360 is incremented to the next pair of lines by the signal on line 394 (described below). This counter is incremented when recirculating memory 312 is shifted by the cucunto pulse on line 318. Word length counter 896 provides the word length to divider 376 via line 397. counter 396
is reset at the beginning of each frame time by a reset signal K on line 362 (corresponding to the shift register reverse signal on 2-in 319) and recirculating memory 312
is incremented each time is shifted by a pulse on line 318.

本発明によれば、最小および最大ドエル時間はライン３
７２上の信号レベルとともにマルチプレクサ３６０の増
分を制御する。例えば、現ターゲットパターンがライン
３６４および３６６上のスコアによって指示されるよう
に「良好なスコア」を有するならば、最大ドエル時間が
経過した場合にのみ「次の信号」が得られる。（これは
ゲート３９８を通ってライン３９４に至るライン３１８
上のパルスによって行なわれる。）これに対し、最小ド
エル時間だけが経過したが、「良好な信号」が２イン３
６６にある、すなわちすぐ前のターゲットパターンが「
良好」である場合には、フィン３１８上のパルスがゲー
ト４００を通り、ライン３９４　：Ｃ次の信号を生じさ
せ、そしてマルチプレクサは再び次の対のターゲットパ
ターン尤度スコツ人カライン忙増分される。他のすべて
の状況において、例示の実施例ではマルチプレクサは増
分されない。According to the invention, the minimum and maximum dwell times are line 3
It controls the increments of multiplexer 360 along with the signal level on 72. For example, if the current target pattern has a "good score" as indicated by the scores on lines 364 and 366, the "next signal" will only be obtained if the maximum dwell time has elapsed. (This is line 318 passing through gate 398 to line 394.
This is done by the upper pulse. ) whereas only the minimum dwell time has elapsed, but the "good signal" is 2 in 3
66, that is, the immediately previous target pattern is "
If so, the pulse on fin 318 passes through gate 400, producing the next signal on line 394:C, and the multiplexer is again incremented to select the next pair of target pattern likelihoods. In all other situations, the multiplexer is not incremented in the illustrated embodiment.

マルチプレクサ３６０が最終対の入力之イン３２６、例
示の実施例ではライン３２６（Ｘ−１）および３２６（
Ｘ）Ｋあると、ライｙ３９４を介しての次の信号の受信
忙よりライン３２４に「終了」信号が発生される。この
終了信号は割算器３７６の出力を凍結し、かつキーワー
ドに対するスコアが得られたことを通信する効果を持つ
０上記したように１終了信号がすべての比較回路３２０
から得られると、最良のスコアが検討され１上記した基
準に従って決定がなされる。この決定は次のフレーム時
間の開始前に、好ましくは実時間で１行なわれ、そして
全手続きが再び始まる。Multiplexer 360 connects the final pair of inputs 326, lines 326 (X-1) and 326 (
If X)K is present, a ``end'' signal is generated on line 324 while the next signal is being received via line y 394. This termination signal has the effect of freezing the output of the divider 376 and communicating that a score for the keyword has been obtained.As mentioned above, the termination signal 1 has the effect of freezing the output of the divider 376 and communicating that a score has been obtained for the keyword.
, the best scores are considered and a decision is made according to the criteria described above. This decision is made once, preferably in real time, before the start of the next frame time, and the whole procedure begins again.

上記したことから、本発明のいくつかの目的が達成され
、他の有益な結果が得られたことが分るであろう。From the foregoing, it will be seen that several objects of the present invention have been achieved and other beneficial results have been obtained.

この中で記載したキーワード認識方法および装置は特別
の応用として隔絶された音声認識を含み得るということ
は理解されよう。記載した好ましい実施例についての追
加、削除、および他の変形、変更はこの分野の技術者に
は明らかであり、特許請求の範囲内にあるものである。It will be appreciated that the keyword recognition methods and apparatus described herein may include isolated speech recognition as a special application. Additions, deletions, and other modifications and changes to the described preferred embodiments will be apparent to those skilled in the art and are within the scope of the following claims.

[Brief explanation of the drawing]

第１図は本発明にしたがって遂行される一連の動作を一
般的用語で例示するフローチャート、第１　Ａ’図は本
発明の好ましい具体例の概略ブロック回路図、第２図は
！！１図に例示される全プロセスにおける特定の処理動
作を遂行するための電子装置の概略ブロック図、第３図
は第１図のプロセスにおける特定の手続きを遂行するデ
ィジタルコンピュータプログ−ラムの流れ線図、第４図
は本発明の好ましい具体例の整列プロセスの線図、第４
ム図は本発明の動的プログラミング法にしたがう整列プ
ロセスの線図、第５図は本発明の好ましい具体例の尤度
関数プロセッサの電気的ブロック図、第６図は本発明の
好ましい具体例の減算・絶対値回路の！気的概略ブロッ
ク図、第７図は本発明の好ましい具体例のオーバーフロ
ー検出論理回路の電気回路図、第８図は第７図の回路図
に対する真値表Ｓ第９図は本発明の好ましい具体例の逐
次読パターン整列回路を示すブロック図、第１０図は本
発明の継続時間および比較制御プ四セスを実施するため
の特定のハードウェアの具体例の電気回路図である。１３　：　Ａ／Ｄコンバータ４５；制御プロセッサ４６：プリブｐセツナ４８ａニベクトルプロセツサ４８ｂ：尤度関数プロセッサ４９：逐次解読プロセッサ５１：クロック発振器５２：Ｍ波数分割器５３：ラッチ５６：ディジタル乗算器５８：３２ワード循環シフトレジス、り５９：マルチプ
レックサ６０：Ｂ選択回路６３：３２ワードシフトレジスタメモリ６５：３２ビツ
ト加算器６７：ゲート７１：コンピュータ割込み回路７３：インターフェース図面の浄ぶ内容に変更なし）手　続　浦　正　和　（方式）昭和５８年５月２４日特許庁長官　若　杉　和　夫　殿事件の表示　昭和５８年特願第　５５０　号発明の名称
　音声認識方法および装置補正をする者事件との関係　　　　　　　　　　特許出願人名称エク
ソン・コーポレイション代理人補正命令通知の日付　昭和５８年４月２６−ａ−ニー　
　　４．。浦−正の対象願書の光千略ｉ出願人の欄一一、−ニ　　ニー：　ｎ、／”　　−−一　　診１．
；委任状及びその訳文　　　　　　　　　　　　各１通
図面　　　　　　　　　　１通明細書の発明の詳細な説明・図面の簡単な説明の欄補正
の内容　　別紙の通り図面の浄書（内容に変更なし）本願明細書の記載を次のように補正する。ｔ　第６３頁７行行「第８図」とあるのを「次」と訂正
する。２　第６３頁７行と８行の行間に次の表を加入する。１　　１　　１　　０１　　１　・　００１　　　ａ　　　１　　１　（オーバーフロ−）１　　
０　０　００　１　　１　　００　　　１　　　０　　　１（オーバーフロー）０　　
　　　０　　　　　１　　　　　００　０　０　　〇五　第８２頁８行「第９図」とあるのを「第８図」と訂
正する。４、　第８３頁６行「第２図」とあるのを「第８図」と
訂正する。５、第８４頁１４行「第１０図」とあるのを「第９図」
と訂正する。６　第８８頁１１行「第１０図」とあるのを「第２図」
と訂正する。Ｚ　第９４頁最終行乃至第９５頁１行「第８図は第７図
の回路図に対する真値表」とあるのを削除する。＆　第９５頁１行「第９図」とあるΩを「第８図」と訂
正する。９　第９５頁２行ｌ−第１０図」とあるのを「第９図」
と訂正する。FIG. 1 is a flowchart illustrating in general terms the sequence of operations performed in accordance with the invention; FIG. 1A' is a schematic block circuit diagram of a preferred embodiment of the invention; FIG. ! 1 is a schematic block diagram of an electronic device for carrying out specific processing operations in the overall process illustrated in FIG. 1; FIG. 3 is a flow diagram of a digital computer program for carrying out specific procedures in the process of FIG. , FIG. 4 is a diagram of the alignment process of a preferred embodiment of the present invention, FIG.
FIG. 5 is an electrical block diagram of the likelihood function processor of a preferred embodiment of the present invention; FIG. 6 is a diagram of the alignment process according to the dynamic programming method of the present invention; FIG. Subtraction/absolute value circuit! FIG. 7 is an electrical circuit diagram of an overflow detection logic circuit according to a preferred embodiment of the present invention; FIG. 8 is a true value table for the circuit diagram of FIG. 7; FIG. 9 is an electrical circuit diagram of a preferred embodiment of the present invention. FIG. 10 is a block diagram illustrating an example sequential read pattern alignment circuit. FIG. 10 is an electrical circuit diagram of a specific hardware implementation for implementing the duration and comparison control process of the present invention. 13: A/D converter 45; Control processor 46: Preb p setuna 48a Ni-vector processor 48b: Likelihood function processor 49: Sequential decoding processor 51: Clock oscillator 52: M wave number divider 53: Latch 56: Digital multiplier 58 : 32-word circular shift register, register 59: Multiplexer 60: B selection circuit 63: 32-word shift register memory 65: 32-bit adder 67: Gate 71: Computer interrupt circuit 73: Interface No changes to the contents of the drawing) Procedure Masakazu Ura (Method) Kazuo Wakasugi, Commissioner of the Japan Patent Office May 24, 1980 Display of the case Patent Application No. 550 of 1980 Title of the invention Relationship with the voice recognition method and device correction case Patent applicant name: Exxon Corporation Date of notice of agent amendment order: April 26-a-nee, 1982
4. . Ura-masa's subject application, column 11 of the applicant: n, /” ---1 Examination 1.
Power of attorney and its translation (1 copy each) Drawings (1 copy) Contents of amendments to the detailed description of the invention and brief description of the drawings in the specification Engraving of the drawings as attached (no changes to the contents) The description of the specification of the application is as follows: Correct as shown below. t On page 63, line 7, correct the text ``Figure 8'' to ``next''. 2 Add the following table between lines 7 and 8 on page 63. 1 1 1 0 1 1 ・00 1 a 1 1 (overflow) 1
0 0 0 0 1 1 0 0 1 0 1 (overflow) 0
0 1 00 0 0 05 Page 82, line 8, “Figure 9” is corrected to “Figure 8.” 4. On page 83, line 6, "Figure 2" should be corrected to "Figure 8." 5.Page 84, line 14, replace “Figure 10” with “Figure 9”
I am corrected. 6 Page 88, line 11, replace “Figure 10” with “Figure 2”
I am corrected. Z From the last line of page 94 to the first line of page 95, delete the statement ``Figure 8 is a true value table for the circuit diagram of Figure 7''. & On page 95, line 1, Ω that says "Figure 9" is corrected to read "Figure 8." 9 Page 95, line 2 l - Figure 10” should be replaced with “Figure 9”
I am corrected.

Claims

Claims: (1) Each keyword is characterized by a template having at least one target pattern, each target pattern representing at least one short-term power spectrum and each target pattern associated with it. a speech analysis system for recognizing at least one predetermined keyword in an audible signal having a minimum dwell time period and a maximum dwell time period of forming a series of frame patterns, each associated with a frame time, representing, for each frame pattern, a numerical measure of the similarity of each frame pattern to a selected one of the target patterns; and accumulating, for each frame time and each keyword, a numerical word score representing the likelihood that the keyword ended in the frame time, using the numerical measurements, the current frame accumulating, for each keyword, a numerical measure for each successive series of said repeatedly formed frame patterns, starting with a numerical measure of similarity between the pattern and a last target pattern of said keyword; A recognition method comprising an accumulation step and a step of generating at least a preliminary keyword recognition decision whenever a numerical value for a keyword exceeds a predetermined recognition level. (2) the accumulating step includes adding the word score accumulated in each of a series of frame patterns that does not exceed the maximum length time for the then-existing target pattern to the word score of the frame pattern and the then-existing target pattern; summing a numerical measure representing the better of a numerical measure representing the similarity of the frame pattern and a numerical measure representing the similarity of the immediately preceding target pattern; A word score accumulated for each frame pattern occurring over a frame time that exceeds the dwell time is a numerical measure representing the similarity of the frame pattern to the then existing target pattern and the immediately preceding target pattern. summing the better of the numerical measurements representing the similarity of the target patterns; , updating the then-existing target pattern by designating the immediately previous target pattern as a new then-existing target pattern; and updating the dog dwell time for the then-existing target pattern. Whenever you move past the previous target pattern, replace it with the new, then-existing target pattern.
2. The method of claim 1, further comprising the step of indicating as a pattern. (3) maintaining a frame count of the number of pattern frames used in determining the numerical word score for the keyword; and calculating the accumulated numerical word score for the keyword by the pattern used in generating the score; and generating a standardized word score by dividing by the number of frames. 4. The method of claim 3, wherein the second adding step includes adding a penalty value to the accumulated score for the keyword whenever the maximum dwell time of a target pattern element for the keyword is exceeded. Method. (5) Each keyword has at least one target
characterized by a template with patterns, each target pattern representing a small (one short-term power spectrum) and each target pattern having a maximum dwell time period and a maximum dwell time period associated with it. , a speech analysis system for recognizing at least one predetermined keyword in an audible signal, comprising: a series of frame patterns, each associated with a frame time, representing this audible signal from said audible signal at a repeating frame rate; an apparatus for generating, for each frame pattern, a numerical measure of the similarity of each frame pattern with a selected one of said target patterns; Apparatus for accumulating a numerical word score representing the likelihood, for each frame time and for each keyword, using the numerical measure, the numerical similarity between the current frame pattern and the last target of the keyword, the pattern; an accumulating device including a device for accumulating, for each keyword, a numerical measurement value for each of the successive series of said repeatedly formed frame patterns starting with a measurement value; and a recognition level at which the numerical value for the keyword is predetermined. (6) means for generating at least a preliminary keyword recognition decision whenever the accumulating means exceeds a minimum dwell time for the then existing target pattern; The word score accumulated at is added to the better of a numerical measure representing the similarity between the frame pattern and the currently existing Kugetzut pattern and a numerical measure representing the similarity between the frame pattern and the immediately preceding target pattern. a first device for adding a numerical quantity representing the frame pattern and the word score accumulated in each frame pattern occurring at a frame time that exceeds the minimum dwell time of the then present target pattern; a second device for summing the better of a numerical measure representing the similarity of the frame pattern to the immediately preceding target pattern and a numerical measure representing the similarity of the frame pattern to the immediately preceding target pattern; by designating the immediately previous target pattern as the new then-existing target pattern when the numerical measurement for the previous target pattern is better than the numerical measurement for the then-existing target pattern. the trap updates the immediately previous target pattern with the new then-existing target pattern whenever said maximum dwell time for the then-existing target pattern is exceeded;
The device according to claim 5, which is a device for selecting as a pattern. (7) a counter that holds a frame count of the number of pattern frames used in determining said numerical word score for a keyword, and a counter that maintains a frame count of the number of pattern frames used in determining said numerical word score for a keyword; and generating a standardized word score by dividing by the number of pattern frames. (8) The second addition device. 8. The apparatus of claim 7, including means for adding a penalty value to the accumulated score for a keyword whenever a maximum dwell time of a target pattern element for the keyword is exceeded. (9) each keyword is characterized by a template having at least one target pattern, each target pattern representing at least one short-term power spectrum, and each target pattern having at least one required dwell associated therewith; In a speech analysis system for recognizing at least one keyword in an audible signal having a time position and at least one arbitrary dwell time position, a series of frames representative of this audible signal from said audible signal at repeated frame times. forming a pattern; and generating a numerical measure of similarity of each frame pattern for each of the target patterns; and for a second and subsequent required dwell time position of each target pattern. , and for any dwell time position of each target pattern, accumulate the sum of the accumulated score for the previous target pattern dwell time position during the previous frame time and the current value measurement associated with the target pattern. and for the first required dwell time position of the first target pattern of each keyword, the first required dwell time position in the previous frame time.
and the same for the first required dwell time position of each other target pattern. accumulating the sum of the best ending cumulative score for the previous target pattern of the keyword and the current numeric measurement associated with the target pattern; and the possible word ending of the last target pattern of each keyword. The recognition method also includes the step of generating a recognition decision based on the cumulative value of . 10. The method of claim 9, including the step of storing, in association with each dwell time position cumulative score, a word duration count corresponding to the time position length of the keyword associated with the cumulative score at the dwell time position. 11. The method of claim 10, comprising: s) storing, in association with each dwell time position cumulative score, a target pattern duration count corresponding to a positional sequence of dwell time positions in the target pattern. . (b) each keyword is characterized by a template having at least one target pattern, each target pattern representing at least one short-term power spectrum, and each target pattern having at least one required dwell associated therewith; At least +tr=<- and - (in a speech analysis system for recognizing one keyword) in an audible signal, which has a time position and at least one arbitrary dwell time position. an apparatus for forming a series of frame patterns representative of an audible signal; an apparatus for generating a numerical measure of the similarity of each of said frame patterns for each of said target patterns; For the required dwell time position and for each target pattern's optional dwell time position, the accumulated score for the previous target pattern dwell time position during the previous frame time and the current value associated with the target pattern. a first device for accumulating a sum with a measured value;
a second device for accumulating the sum of the score of the zero dwell time position and the current numerical measurement associated with the first target pattern of the keyword, and for the first required dwell time position of each other target pattern; a third device that accumulates the sum of the best ending cumulative score for the previous target pattern of the same keyword and the current numerical measurement associated with the target pattern, and the probability of the last target pattern of each keyword; and a device for generating a recognition decision based on the cumulative value of the end of a certain word. 8) The apparatus of claim 12, comprising a device for storing, in association with each dwell time position cumulative score, a word duration count corresponding to the time position primate of the keyword associated with the cumulative score at the dwell time position. . 14.) a second device for storing, in association with each dwell time position cumulative score, a target no turn duration count corresponding to a positional sequence of dwell time positions in the target pattern. The device described. ) each keyword is modeled by a template with at least one target pattern, each target pattern represents at least one short-term value vector, and each target pattern has at least one associated In a speech analysis system for recognizing at least one keyword in an audible signal, which has a return dwell time position and at least f arbitrary dwell time positions, the incoming audible signal corresponding to the keyword is and making each subinterval correspond to a unique reference pattern. creating a second bus for passing the audible human input signal representing the keyword; determining the interval duration for each sub-interval; repeating said step for a plurality of audible input signals representing the same keyword; and determining criteria associated with each sub-interval. ) generating statistical data describing the duration of each criterion turn; and determining minimum and maximum dwell times for each criterion turn from said collected statistical data. 16. The method of claim 15, wherein the subintervals are initially evenly spaced from the beginning to the end of an audible input keyword.