JPH02298998A

JPH02298998A - Voice recognition equipment and method thereof

Info

Publication number: JPH02298998A
Application number: JP2092371A
Authority: JP
Inventors: Ian Bickerton; イアン　ビッカートン
Original assignee: Smiths Group PLC
Current assignee: Smiths Group PLC
Priority date: 1989-04-12
Filing date: 1990-04-09
Publication date: 1990-12-11
Also published as: DE4010028A1; GB2230370B; FR2645999A1; GB8908205D0; JP2001000007U; GB2230370A; GB9007067D0; DE4010028C2; FR2645999B1

Abstract

PURPOSE: To effectively recognize a speech by allowing a pattern matching unit to analyze a speech signal through the use of both of the output of a neural network unit and a vocabulary identifying output so as to output a signal expressing a word in the speech. CONSTITUTION: A memory 17 includes speech information concerning the vocabulary of recognizable words and the pattern matching unit 16 identifies a boundary between different words and in order to give the first display of the word in the speech, compares stored vocabulary and the speech signal to execute the first analysis of the speech signal. Then this device includes the neural network unit 20 connected with the pattern matching unit 16. This pattern matching unit 16 executes the second analysis of the speech signal utilizing both of the output of the neural network unit 20 and vocabulary identification from first analysis and gives an output signal expressing the word in the speech at least from second analysis.

Description

【発明の詳細な説明】（技術分野）この発明はスピーチ信号の第１分析が異なる語（ｗｏｒ
ｄ　）の間の境界（ｂｏｕｎｄａｒｙ　）を識別し、か
つ蓄積語常（ｓｔｏｒｅｄ　ｖｏｃａｂｕｌａｒｙ　）
との比較によって会話された語（ｗｏｒｄｓ　５ｐｏｋ
ｅｎ）の第１表示を与えるよう実行される種類の音声認
識方法に関連している。DETAILED DESCRIPTION OF THE INVENTION (Technical Field) The present invention provides a first analysis of a speech signal for different words.
d) and identify the boundaries between
Words spoken by comparison with words (words 5pok)
en) relates to a speech recognition method of the type which is carried out to give a first representation of the speech recognition method.

（背景技術）多重機能を有する複雑な装置において、会話された指令
（ｓｐｏｋｅｎ　ｃｏｍｍａｎｄｓ　）により装置を制
御できることは有用である。これはまたユーザーの手が
他の仕事に占有されるところ、あるいはユーザーが障害
を持ち、かつ通常の機械的スイッチや制御装置を操作す
るために自分の手が使えないところで有用である。BACKGROUND OF THE INVENTION In complex equipment having multiple functions, it is useful to be able to control the equipment by spoken commands. It is also useful where the user's hands are occupied with other tasks, or where the user is disabled and cannot use his or her hands to operate normal mechanical switches and controls.

スピーチにより制御された装置による問題は、音声認識
が信頼性が無く、特に会話者の声が振動のような環境フ
ァクターで変更されるところではそうである。これは動
作の失敗あるいはさらに悪い場合には不正確な動作を導
く。A problem with speech-controlled devices is that voice recognition is unreliable, especially where the interlocutor's voice is modified by environmental factors such as vibrations. This leads to failure or worse, incorrect operation.

音声認識には種々の技術が使用されている。１つの技術
はマルコフモデルの使用を含み、これは連続音声の語の
間の境界を容易に識別できるという理由で有用である。Various techniques are used for speech recognition. One technique involves the use of Markov models, which are useful because boundaries between words in continuous speech can be easily identified.

雑音の多い環境あるいはスピーチが会話者の緊張により
劣化されるところでは、マルコフモデル技術は会話され
た語の十分信頼性ある識別を与えないであろう。最近、
雑音補償、補間、シンタックス選択および他の方法によ
りそのような技術の性能を改良するかなりの努力が払わ
れてきた。In noisy environments or where speech is degraded by the nervousness of the interlocutors, Markov model techniques may not provide sufficiently reliable identification of spoken words. recently,
Considerable efforts have been made to improve the performance of such techniques through noise compensation, interpolation, syntax selection and other methods.

音声認識に提案されてきた代案の技術は神経網（ｎｅｕ
ｒａｌ　ｎｅｔｓ）を利用している。これらの神経網技
術はスピーチがひど（劣化されていても個別の語を高い
精度で識別することができる。しかしそれらは連続音声
の認識には通していない。というのはそれらが語の境界
を正確に識別できないからである。An alternative technology that has been proposed for speech recognition is neural networks.
ral nets). These neural network techniques are able to identify individual words with high accuracy even when speech is severely degraded. However, they do not pass for recognition of continuous speech, since they do not recognize word boundaries. This is because it cannot be identified accurately.

（発明の開示）本発明の目的は改良された音声認識装置と音声認識方法
を与えることである。DISCLOSURE OF THE INVENTION An object of the present invention is to provide an improved speech recognition device and method.

本発明の一態様によると、上に規定された種類の音声認
識の方法が備えられ、それは該方法が会話された語の第
２表示を与えるために神経網技術と第１分析からの語境
界識別を使用してスピーチ信号の第２分析を実行し、か
つ少なくとも第２表示から会話された語を表す出力信号
を与えるステップを含むことを特徴としている。According to one aspect of the invention, a method of speech recognition of the type defined above is provided, which method uses neural network techniques and word boundaries from a first analysis to give a second representation of spoken words. Performing a second analysis of the speech signal using the identification and providing an output signal representing spoken words from at least the second display.

第１分析はマルコフモデルを使用して実行できる。給電
はダイナミック時間ワーピングテンプレ）　（ｄｙｎａ
ｍｉｃ　ｔｉｍｅ　ｗａｒｐｉｎｇ　ｔｅｍｐｌａｔｅ
　）を含み、かつ第１分析は非対称ダイナミック時間ワ
ーピングアルゴリズムを使用して実行できる。The first analysis can be performed using a Markov model. Power supply is dynamic time warping template) (dyna
mic time warping template
), and the first analysis can be performed using an asymmetric dynamic time warping algorithm.

第１分析は複数の異なるアルゴリズムを利用して実行さ
れることが好ましく、各アルゴリズムは表示された語が
会話された語であることの信転性（ｃｏｎ　ｆ　１ｄｅ
ｎｃｅ　）の表示と共にスピーチ信号に最も近い給電メ
モリの語を示す信号を与え、かつ異なるアルゴリズムに
より与え°られた信号間で比較が行われている。会話さ
れた語の第１表示が信顛性の測度を与えるところでは、
信顛性の測度が所定の値より大きい場合に出力信号が第
１表示のみに応答するよう備えられている。Preferably, the first analysis is performed using a plurality of different algorithms, each algorithm determining the confidence that the displayed word is a spoken word.
A signal indicating the word of the power supply memory closest to the speech signal is provided with an indication of .nce), and a comparison is made between the signals provided by the different algorithms. Where the first representation of a spoken word provides a measure of authenticity,
The output signal is provided to respond only to the first indication if the measure of authenticity is greater than a predetermined value.

第２分析は神経網と共に多層バーセプトロン技術（ｍｕ
ｌｔｉ−１ａｙｅｒ　ｐｅｒｃｅｐｔｒｏｎ　ｔｅｃｈ
ｎｉｑｕｅ）を使用して実行できる。The second analysis is based on the multilayer berceptron technique (mu) along with the neural network.
lti-1ayer perceptron tech
nique).

出力信号は会話された語の会話者にフィードバックを与
えるよう利用できる。The output signal can be used to provide feedback to the interlocutor of the words spoken.

本方法はスピーチ信号に雑音マーキングアルゴリズム（
ｎｏｉｓｅ　ｍａｒｋｉｎｇ　ａ１ｇｏｒｉｔｈｎ＋　
）を実行するステップを含み、かつ以前に識別された語
のシンタックスに従って蓄積語彙にシンタックス制限を
実行するステップを含むことができる。This method uses a noise marking algorithm (
noise marking a1gorithn+
) and performing a syntax restriction on the stored vocabulary according to the syntax of the previously identified words.

本発明は音声認識装置にも関連し、認識できる語の給電
についてのスピーチ情報を含むメモリと、異なる語の間
の境界を識別しかつ会話された語の第１表示を与えるた
めに蓄積語彙とスピーチ信号を比較するスピーチ信号の
第１分析を実行するパターンマツチングユニットを含む
ものにおいて、該装置が、パターンマッチングユニッｌ
−（１６）と接続された神経網ユニッ）　（２０）を含
み、該パターンマツチングユニット（１６）が神経網ユ
ニット（２０）の出力と第１分析からの語境界識別の双
方を利用するスピーチ信号の第２分析を実行し、がつパ
ターンマツチングユニット（１６）が会話された語を表
す出力信号を少なくとも第２分析から与えることを特徴
としている本発明による音声認識装置と方法は装置を概略示す添付
図面を参照して実例により説明されよう。The invention also relates to a speech recognition device, comprising a memory containing speech information about the supply of words to be recognized, and a stored vocabulary for identifying boundaries between different words and providing a first representation of spoken words. a pattern matching unit for performing a first analysis of the speech signal comparing speech signals, the apparatus comprising: a pattern matching unit performing a first analysis of the speech signal;
- a neural network unit (20) connected to (16); the pattern matching unit (16) utilizes both the output of the neural network unit (20) and the word boundary identification from the first analysis; The speech recognition device and method according to the invention comprises performing a second analysis of the signal, characterized in that the pattern matching unit (16) provides an output signal representative of the spoken words from at least the second analysis. It will be explained by way of example with reference to the accompanying drawings, in which: FIG.

（実施例）音声認識装置は参照記号１により一般的に示され、かつ
航空機パイロン＋の酸素マスクに取り付けられているよ
うなマイクロホン２がら入力スピーチ信号を受信する。Embodiment A speech recognition device receives an input speech signal from a microphone 2, indicated generally by the reference symbol 1, and such as that mounted on an oxygen mask on an aircraft pylon+.

識別された語を表す出力信号は装置ｌによりフィードバ
ックデバイス３および利用デバイス（ｕｔｉｌｉｓａｔ
ｉｏｎ　ｄｅｖｉｃｅ）　４に印加される。フィードバ
ックデバイス３は装置１により識別されたような語の会
話者に通知するために配列された可視表示あるいは可聴
デバイスであろう。利用デバイス４は装置の出力信号か
ら利用デバイスにより認識された会話された指令に応じ
て航空機装置の機能を制御するよう配列されよう。An output signal representing the identified word is sent by the device I to the feedback device 3 and the utilization device (utilisat
ion device) 4. The feedback device 3 may be a visual display or an audible device arranged to notify the interlocutor of such words as identified by the device 1. The utilization device 4 may be arranged to control the functions of the aircraft equipment in response to spoken commands recognized by the utilization device from the output signals of the apparatus.

マイクロホン２からの信号は前置増幅器１０に供給され
、この前置増幅器１０はすべての周波数チャネル出力が
同様なダイナミックレンジを占有することを保証する平
坦長期平均スピーチスペクトル（ｆｌａｔ　ｌｏｎｇ−
ｔｅｒａ＋　ａｖｅｒａｇｅ　５ｐｅｅｃｈ　５ｐｅｃ
ｔｒｕ１１）（この場合その特性は公称的に１ｋＨｚま
で平坦である）を生成するプリエンファシス段１１を含
んでいる。スイッチ１２は高い周波数で３　ｄＢ／オク
ターブあるいは６　ｄＢ／オクターブのいずれかを与え
るよう設定できる。前置増幅器１０はまた４ｋＨｚに設
定された一３ｄＢ遮断周波数を持つ８次バッターワース
低域通過フィルタの形をしているアンチアライアシング
フィルタ２１を含んでいる。The signal from the microphone 2 is fed to a preamplifier 10 which generates a flat long-average speech spectrum that ensures that all frequency channel outputs occupy a similar dynamic range.
tera+ average 5peech 5pec
tru11) (in which case its characteristics are nominally flat up to 1 kHz). Switch 12 can be set to provide either 3 dB/octave or 6 dB/octave at high frequencies. Preamplifier 10 also includes an antialiasing filter 21 in the form of an 8th order Butterworth low pass filter with a -3 dB cutoff frequency set at 4 kHz.

前置増幅器１０からの出力はアナログ対ディジタル変換
器１３を介してディジタルフィルタバンク１４に伝達さ
れる。フィルタバンク１４は７ＭＳ３２０１０マイクロ
プロセツサのアセンブリソフトウェア−として実現され
た１９個のチャネルを有し、かつアイイ−イー議事録（
ＩＥＥ　Ｐｒｏｃ、）　、第１２７巻、パートＦ、第１
号、１９８０年２月のジェー・エヌ・ホルメス（Ｊ、　
Ｎ、Ｈｏ１ａ＋ｅｓ）によるｒＪｓＲ１１チャネルボコ
ーダ−（ＪＳＲＵ　Ｃｈａｎｎｅｌ　Ｖｏｃｏｄｅｒ）
　Ｊに基づいている。フィルタバンク１４は周波数範囲
２５０　４０００）１ｚの聴覚（ａｕｄｉｔｏｒｙ　ｐ
ｅｒｃｅｐｔｉｏｎ　）の臨界帯域にほぼ対応する不均
等チャネル間隔を有している。隣接チャネルの応答はそ
れらのピークより約３ｄＢ下で交差している。チャネル
の中央において、近傍チャネルの減衰は約１１ｄＢであ
る。The output from preamplifier 10 is transmitted via analog-to-digital converter 13 to digital filter bank 14 . The filter bank 14 has 19 channels, implemented as assembly software on a 7MS32010 microprocessor, and
IEE Proc, Volume 127, Part F, Volume 1
Issue, February 1980, J.N. Holmes (J,
rJsR11 channel vocoder (JSRU Channel Vocoder) by N,Ho1a+es)
Based on J. The filter bank 14 has an auditory frequency range of 250-4000) 1z.
It has unequal channel spacing that approximately corresponds to the critical band of erception ). The responses of adjacent channels intersect approximately 3 dB below their peaks. At the center of the channel, the attenuation of neighboring channels is approximately 11 dB.

フィルタバンク１４からの信号はジヱー・ニス・プライ
ドル（Ｊ、　Ｓ、　Ｂｒ１ｄｌｅ）等により記述された
種類の雑音マーキングアルゴリズムを組み込んだ積分・
雑音マーキングユニット１５に供給される。The signal from filter bank 14 is processed by an integral filter incorporating a noise marking algorithm of the type described by J. S. Briddle et al.
A noise marking unit 15 is supplied.

自動音声認識に適用された雑音補償スペクトル距離測度
（ｎｏｉｓｅ　ｃｏｍｐｅｎｓａｔｉｎｇ　ｓｐｅｃｔ
ｒｕｍ　ｄｉｓｔａｎｃｅｍｅａｓｕｒｅ　）について
は音響国際会議録（Ｐｒｏｃ、　Ｉｎ５ｔ。noise compensating spectral distance measure applied to automatic speech recognition
rum distance measurement) in the Proceedings of the International Conference on Acoustics (Proc, In5t).

Ａｃｏｕｓｔ、　）　、ウィンドメアー（Ｗｉｎｄ＊ｅ
ｒｅ）　、１９８４年１１月を参照されたい。周期性雑
音を低減する適応雑音相殺技術（ａｄａｐｔｉｖｅ　ｎ
ｏｉｓｅ　ｃａｎｃｅｌｌａｔｉｏｎｔｅｃｈｎｉｑｕ
ｅ　）は例えば周期性ヘリコプタ−雑音の低減に使用で
きるユニット１５により実現できる。Acoust, ), Windmare (Wind*e
re), November 1984. Adaptive noise cancellation technology to reduce periodic noise
oise cancellation technology
e) can be realized, for example, by a unit 15 which can be used to reduce periodic helicopter noise.

雑音マーキングユニット１５の出力は種々のパターンマ
ツチングアルゴリズムを実行するパターンマツチングユ
ニット１６に供給される。パターンマツチングユニット
１６は語彙メモリ１７に接続され、この語彙メモｆ月７
はダイナミック時間ワーピング（ＤＴＷ　　：　Ｄｙｎ
ａｍｉｃ　Ｔｉｍｅ　Ｗａｒｐｉｎｇ）　）テンプレー
トと語彙中の各語のマルコフモデルを含んでいる。The output of the noise marking unit 15 is provided to a pattern matching unit 16 which performs various pattern matching algorithms. The pattern matching unit 16 is connected to the vocabulary memory 17, and the pattern matching unit 16 is connected to the vocabulary memory 17.
is dynamic time warping (DTW: Dyn
(amic Time Warping) template and a Markov model for each word in the vocabulary.

ＤＴＷテンプレートは単一パスの時間整列平均化技術（
ｓｉｎｇｌｅ　ｐａｓｓ、　ｔｉｍｅ−ａｌｉｇｎｅｄ
　ａｖｅｒａｇｉｎｇｔｅｃｈｎｉｑｕｅ　）あるいは
埋め込みトレーニング技術（ｅｍｂｅｄｄｅｄ　ｔｒａ
ｉｎｉｎｇ　ｔｅｃｈｎｉｑｕｅ　）のいずれかを使用
して創成できる。このテンプレートは時間に対する周波
数およびスペクトルエネルギーを表している。The DTW template uses a single-pass time-aligned averaging technique (
single pass, time-aligned
averaging technique) or embedded training technique (embedded training technique).
can be created using any of the following techniques: This template represents frequency and spectral energy versus time.

マルコフモデルは同じ語の多くの発声からの装置のトレ
ーニングの間に導かれ、スペクトルおよび時間変化は統
計的モデルで獲得される。マルコフモデルは多数のＭ敗
状態からなり、各状態は一対のスペクトルフレームおよ
び分散フレーム（ｖａｒｉａｎｃｅ　ｆｒａｍｅ）から
構成されている。スペクトルフレームは１２０Ｈｚから
４ＭＨｚの周波数範囲をカバーする１９個の値を含み、
分散フレームは状態平均期間（ｓｔａｔｅ　ｍｅａｎ　
ｄｕｒａｔｉｏｎ　）の形をした各スペクトルベクトル
／特徴（ｆｅａｔｕｒｅ　）に関連した分散情報と標準
偏差情報を含んでいる。A Markov model is derived during training of the device from many utterances of the same word, and spectral and temporal variations are acquired with a statistical model. The Markov model consists of a number of M losing states, each state consisting of a pair of spectral frames and a variance frame. The spectral frame contains 19 values covering the frequency range from 120Hz to 4MHz,
The distributed frame has a state mean period
Contains variance and standard deviation information associated with each spectral vector/feature in the form of duration.

トレーニングの間の個別の発声は定常音声状態（Ｓむａ
ｔｉｏｎａｒｙ　ｐｈｏｎｅｔｆｃ　５ｔａｔｅｓ）と
それらのスペクトル遷移（ｓｐｅｃｔｒａｌ　ｔｒａｎ
ｓｉｔｉｏｎ　）を分類するよう分析される。モデルパ
ラメーターはエム・ジェー・ラッセル（Ｍ、　Ｊ、　Ｒ
ｕ５ｓｅｌｌ　）とアール・エッチ・ムアー（Ｒ，Ｈｏ
Ｍｏｏｒｅ　）の［自動音声認識のヒドンマルコフモデ
ルの状態占有の明確なモデリング（Ｅｘｐｌｃｉｔ　ｓ
ｏｄｅｌｌｉｎｇ　ｏｆ　５ｔａｔｅ　ｏｃｃｕｐａｎ
ｃｙｉｎ　　ｈｉｄｄｅｎ　　Ｍａｒｋｏｖ　　Ｍｏｄ
ｅｌｓ　　ｆｏｒ　　ａｕｔｏｍａｔｉｃ　　ｓｐｅｅ
ｃｈｒｅｃｏｇｎｉｔｉｏｎ　）　Ｊ　、アイイーイー
イー音響国際会議録（Ｐｒｏｃ’ＩＥＢＥ　Ｉｎｔ、　
Ｃｏｎｆ、　ｏｎ　Ａｃｏｕｓｔｉｃｓ　）、スピーチ
と信号の処理（Ｓｐｅｅｃｈ　ａｎｄ　ＳｉｇｎａｌＰ
ｒｏｃｅｓｓｉｎｇ）　、タンパ（Ｔａｌ１９ａ　）　
、１９８５年、３月２６−２９日により記述されたとタ
ビ再評価アルゴリズム（Ｖｉｔｅｒｂｉ　ｒｅ−ｅｓｔ
ｉｍａｔｉｏｎ　ａｌｇｏｒｉｔｈｍ　）を使用した回
帰プロセスにより評価される。最終語モデル（ｆｉｎａ
ｌ　ｗｏｒｄ　ｍｏｄｅｌ）は時間および抑揚（ｉｎｆ
ｌｅｃｔｉｏｎ）の双方の自然会話語変動性（ｎａｔｕ
ｒａｌｓｐｏｋｅｎ　ｗｏｒｄ　ｖａｒｉａｂｉｌＨｙ
　）を含んでいる。Individual vocalizations during training are in a steady state of speech (Smua
tionary phonet fc 5tates) and their spectral transitions (spectral tran
location) is analyzed to classify it. Model parameters were determined by M.J. Russell (M, J, R
u5sell) and R, Ho
Moore)'s Explicit Modeling of State Occupancy in Hidden Markov Models for Automatic Speech Recognition.
odelling of 5tate occasion
cyin hidden Markov Mod
els for automatic speed
chrecognition) J, Proceedings of the International Conference on Acoustics (Proc'IEBE Int.
Conf, on Acoustics), Speech and SignalP
rocessing), Tampa (Tal19a)
, March 26-29, 1985.
It is evaluated by a regression process using the ``imation algorithm''. Final word model (fina
l word model) is time and intonation (inf
Natural conversational language variability (natu)
ralspoken word variableHy
).

メモリ１７とパターンマツチングユニット１６の中間に
シンタックスユニット１８があり、シンタックスユニッ
ト１８は以前に識別された語のシンタックスに従ってス
ピーチ信号が比較される蓄積語彙に通常のシンタックス
制限を実行する。Intermediate between the memory 17 and the pattern matching unit 16 is a syntax unit 18 which performs the usual syntax restrictions on the stored vocabulary with which the speech signals are compared according to the syntax of previously identified words. .

パターンマツチングユニット１６はまた神経網ユニット
２０に接続されている。神経網ユニット２０はニス・エ
ム・ピーリング（Ｓ、　Ｍ、　Ｐｅｅｌｉｎｇ　）とア
ール・エッチ・ムアー（Ｒ，Ｈ，Ｍｏｏｒｅ　）により
記述された「多層バーセプトロンを用いた孤立ディジッ
ト認識の実験（Ｅｘｐｅｒｉｍｅｎｔｓ　ｉｎ　１ｓｏ
ｌａｔｅｄ　ｄｉｇｉｔｒｅｃｏｇｎｉｔｉｏｎ　ｕｓ
ｉｎｇ　ｔｈｅ　ｍｕｌｔｉ−１ａｙｅｒ　ｐｅｒｃｅ
ｐｔｒｏｎ）、ＲＳＩ？Ｅメモランダム第４０７３号、
１９８７年のような多層パーセブトロン（ＭＬＰ　：　
Ｍｕｌｔｉ−Ｌａｙｅｒ　Ｐｅｒｃｅｐｔｒｏｎ）を組
み込んでいる。Pattern matching unit 16 is also connected to neural network unit 20. The neural network unit 20 is based on the ``Experiments in 1 solitary digit recognition using a multilayer berceptron'' described by S. M. Peeling and R. H. Moore.
lated digitrecognition us
ing the multi-layer perce
ptron), RSI? E Memorandum No. 4073,
Multilayer persebtron (MLP) like 1987:
Multi-Layer Perceptron).

ＭＬＰは高い背景雑音が低エネルギー摩擦音スピーチ（
ｆｒｉｃａｔｉｖｅ　５ｐｅｅｃｈ）のマスクを生起す
るように不完全パターンを認識できる性質を有している
。ＭＬＰはディー・イー・ルメルハー）　（Ｄ、　Ｅ。MLP is characterized by high background noise and low-energy fricative speech (
It has the property of being able to recognize incomplete patterns so as to generate a mask of fricative 5peach). MLP is D. E. Rumelher) (D, E.

Ｒｕｍｅｌｈａｒｔ　）等により記述された「エラー後
方伝搬による学習内部表現（Ｌｅａｒｎｉｎｇ　１ｎｔ
ｅｒｎａｌｒｅｐｒｅｓｅｎｔａｔｉｏｎ　ｂｙ　ｅｒ
ｒｏｒ　ｂａｃｋ　ｐｒｏｐａｇａｔｉｏｎ）、認識科
学（Ｃｏｇｎｉｔｉｖｅ　５ｃｉｅｎｃｅ　）　、ＵＣ
５Ｄ、　ＩＣＳ報告第８５０６号、１９８５年９月のよ
うな態様で実現される。``Learning internal representation by error backward propagation'' described by Rumelhart et al.
ernalrepresentation by er
ror back propagation), Cognitive 5science, UC
5D, ICS Report No. 8506, September 1985.

パターンマツチングユニット１６は会話された語と語彙
の語との間の最良マツチングを選択する３つの異なるア
ルゴリズムを使用している。Pattern matching unit 16 uses three different algorithms to select the best match between spoken words and vocabulary words.

その１つはディー・ニス・プライドル（Ｊ、　Ｓ。One of them is Dee Nis Prydl (J, S.

Ｂｒ１ｄｌｅ）により記述された「統計モデルとテンプ
レートマツチング：自動会話認識の明らかに異なる２つ
の技術の間のいくつかの重要な関係（Ｓｔｏｃｈａｓｔ
ｉｃ　ｍｏｄｅｌ　ａｎｄ　ｔｅｍｐｌａｔｅ　ａ＋ａ
ｔｃｈｉｎｇ　：ｓｏｍｅ　　ｉｍｐｏｒｔａｎｔ　　
ｒｅｌａｔｉｏｎｓ　　ｂｅｔｗｅｅｎ　　ｔｗｏ　　
ａｐｐａｒｅｎｔｌｙ　ｄｉｆｆｅｒｅｎｔ　ｔｅｃｈ
ｎｉｑｕｅｓ　ｆｏｒ　ａｕｔｏｍａｔｉｃ　ｓｐｅｅ
ｃｈｒｅｃｏｇｎｉｔｉｏｎ　）　、音響会議録（Ｐｒ
ｏｃ、Ｉｎ５ｔ、　ｏｆＡｃｏｕｓｔｉｃｓ　）　、ウ
ィンドメアー（Ｗｉｎｄｍｅｒｅ）、１９８４年１１月
およびディー・ニス・プライドル（Ｊ、Ｓ、　Ｂｒ１ｄ
ｌｅ）等による「全語テンプレートを使用する連続接続
語認識（Ｃｏｎｔｉｎｕｏｕｓ　ｃｏｎｎｅｃｔｅｄｗ
ｏｒｄ　ｒｅｃｏｇｎｉｔｉｏｎ　ｕｓｉｎｇ　ｗｈｏ
ｌｅ　ｖ＜ｏｒｄ　ｔｅａＩｐｌａｔｅｓ）　Ｊ、無線
・電子工学（Ｒａｄｉｏ　ａｎｄ　Ｅｌｅｃｔｒｏｎｉ
ｃ　Ｅｎｇｉｎｅｅｒ）、第３巻、第４号、１９８３年
４月のような種類の非対称ＤＴＷアルゴリ、ズムである
。これは実時間音声認識に特に適している効率のよい単
一パスプロセス（ｓｉｎｇｌｅ　ｐａｓｓ　ｐｒｏｃｅ
ｓｓ　）である。このアルゴリズムはユニッ）１５によ
り実現された雑音補償技術で効率よく作用する。``Statistical Models and Template Matching: Some Important Relationships Between Two Distinctly Different Techniques of Automatic Speech Recognition'' (Stochast Br1dle)
ic model and template a+a
tching :some important
relations between two
Apparently different technology
uniques for automatic speed
chrecognition), Acoustic Conference Minutes (Pr.
oc, In5t, ofAcoustics), Windmere, November 1984 and Dee Nis Preidl (J, S, Br1d
``Continuous connected word recognition using whole word templates'' by Le) et al.
ord recognition using who
J, Radio and Electronic Engineering
C Engineer), Volume 3, No. 4, April 1983. It is an efficient single pass process that is particularly suited for real-time speech recognition.
ss). This algorithm works efficiently with the noise compensation technique realized by Unit 15.

第２のアルゴリズムはヒドンセミマルコフモデル技術（
０３ＭＭ　：　Ｈｉｄｄｅｎ　５ｅａｔ　Ｍａｒｋｏｖ
　Ｍｏｄｅｌｔｅｃｈｎｉｑｕｅ　）を使用し、ここで
上述の語霊メモリ１７内に含まれたマルコフモデルは会
話された語信号と比較される。会話された語の時間変動
と抑揚変動についてのマルコフモデルの追加情報はパタ
ーンマツチングの間の認識性能を増大する。実際に、Ｄ
Ｔ−およびＨＳＭＭアルゴリズムはお互いに統合されて
いる。統合されたＤＴＷ技術と１５ＭＭ技術は連続スピ
ーチの隣接語間の境界の識別を可能にする。The second algorithm is the hidden semi-Markov model technique (
03MM: Hidden 5eat Markov
Modeltechnique), where the Markov model contained in the word memory 17 described above is compared with the spoken word signal. The Markov model's additional information about the temporal and intonation variations of spoken words increases recognition performance during pattern matching. In fact, D
T- and HSMM algorithms are integrated with each other. The integrated DTW and 15MM techniques enable identification of boundaries between adjacent words in continuous speech.

第３のアルゴリズムは神経網２０と共にＮＬＰ技術を使
用している。ＭＬＰはＤＴＷ　／ＨＳＭＭアルゴリズム
により制御され、ＮＬＰはパターンマツチングユニット
１６内め（示されていない）スピーチバッファーを見る
可変窓を有し、この窓の大きさと位置はＤＴＷ／Ｈ５Ｍ
Ｍアルゴリズムによって決定されている。The third algorithm uses NLP techniques with neural network 20. The MLP is controlled by the DTW/HSMM algorithm, and the NLP has a variable window viewing the speech buffer (not shown) within the pattern matching unit 16, the size and position of which is determined by the DTW/H5M algorithm.
It is determined by the M algorithm.

このようにして、ＨＳＭＭアルゴリズムは語境界あるい
は端点の識別にＭＬＰにより使用され、かつスペクトル
時間セグメントあるいは語候補はＭＬＰにより処理でき
る。各アルゴリズムは信顛性測度と共にスピーチに最も
近いアルゴリズムによって識別された語彙メモリで語を
表示することによりスピーチ信号のその説明（ｅｘｐｌ
ａｎａｔｉｏｎ　）を示す信号を与える。いくつかの語
のリストはそれらの関連信顧性測度を持つ各アルゴリズ
ムによって生成できる。ユニット１６内のより高いレベ
ルのソフトウェア−は各アルゴリズムにより達成された
独立の結果を比較し、かつフィードバックデバイス３お
よび任意の加重の後のこれらの結果に基づく利用デバイ
ス４に出力を生成する。In this way, the HSMM algorithm can be used by the MLP to identify word boundaries or endpoints, and spectral-temporal segments or word candidates can be processed by the MLP. Each algorithm uses its description of the speech signal (expl.
anation). A list of several words can be generated by each algorithm with their associated credibility measures. Higher level software within unit 16 compares the independent results achieved by each algorithm and generates an output to feedback device 3 and utilization device 4 based on these results after any weighting.

このようにして、本発明の装置は以前には可能でなかっ
た自然連続スピーチの認識に神経網技術を使用すること
を可能にする０本発明の装置と方法の１つの利点はそれ
が短い応答時間を有し、かつ会話者に迅速なフィードバ
ックを与えることである。これは特に航空機への適用に
重要である。In this way, the device of the invention allows the use of neural network techniques for the recognition of naturally continuous speech, which was not previously possible. One advantage of the device and method of the invention is that it time and give prompt feedback to the interlocutor. This is particularly important for aircraft applications.

代案のアルゴリズムが使用できることが評価され、それ
は神経網技術を使用する第２のアルゴリズムに従って語
境界を識別することのできる１つのアルゴリズムを与え
ることのみが必要である。It is appreciated that alternative algorithms can be used; it is only necessary to provide one algorithm capable of identifying word boundaries according to a second algorithm using neural network techniques.

神経網アルゴリズムは各語に使用する必要は無い、いく
つかの装置ではその信鎖性の測度があるレベルの上にあ
る限りマルコフアルゴリズムのみが出力を与えるようそ
れが配列されよう、異なる語が会話される場合、あるい
は明瞭に会話されるか、あるいは高い背景雑音を持つ場
合に、信幀性の測度は落ち、かつ装置は独立意見（１ｎ
ｄｅｐｅｎｄｅｎｔｏｐｉｎｉｏｎ　）の神経網アルゴ
リズムを考慮する。A neural network algorithm need not be used for each word; in some devices it may be arranged so that the Markov algorithm only gives an output as long as its reliability measure is above a certain level; If the conversation is clear or has high background noise, the credibility measure decreases and the device
We consider the neural network algorithm of ``dependenttopinion''.

記述されたユニットにより遂行された機能が１つあるい
はそれ以上のコンピューターのプログラミングにより遂
行でき、かつ上に規定された離散ユニットにより実行さ
れる必要の無いことが評価されよう。It will be appreciated that the functions performed by the units described can be performed by programming one or more computers and need not be performed by the discrete units defined above.

本装置は多くの適用に使用できるが、しかし機械と輸送
機関の制御、特に固定翼と回転翼航空機の制御のように
高い雑音環境での使用に特に適している。The device can be used in many applications, but is particularly suitable for use in high noise environments, such as in the control of machinery and transportation, especially in the control of fixed-wing and rotary-wing aircraft.

[Brief explanation of drawings]

第１図は本発明の音声認識装置の一実施例を示している
。１・・・音声認識装置　　２・・・マイクロホン３・・
・フィードバックデバイス４・・・利用デバイス　　１０・・・前置増幅器１１・
・・プリエンファシス段１２・・・スイッチ１３・・・アナログ対ディジタル変換器１４・・・ディ
ジ久ルフィルタバンク１５・・・雑音マーキングユニット１６・・・パターンマツチングユニット１７・・・語彙
メモリ１８・・・シンタックスユニット２０・・・神経網ユニットFIG. 1 shows an embodiment of the speech recognition device of the present invention. 1...Speech recognition device 2...Microphone 3...
・Feedback device 4... Utilization device 10... Preamplifier 11・
... Pre-emphasis stage 12 ... Switch 13 ... Analog-to-digital converter 14 ... Digital filter bank 15 ... Noise marking unit 16 ... Pattern matching unit 17 ... Vocabulary memory 18 ...Syntax unit 20...Neural network unit

Claims

Claims: 1. Speech recognition of the type in which a first analysis of the speech signal is carried out so as to identify boundaries between different words and, by comparison with a stored vocabulary, give a first indication of the words spoken. In the method, the method performs a second analysis of the speech signal using neural network techniques and word boundary identification from the first analysis to provide a second representation of the spoken words; and A method comprising the step of providing an output signal representative of spoken words. 2. The method of claim 1, wherein the first analysis is performed using a Markov model. 3. A method according to claim 1 or 2, characterized in that the vocabulary includes dynamic time warping templates. 4. The method of claim 3, wherein the first analysis is performed using an asymmetric dynamic time warping algorithm. 5. A first analysis is performed using a number of different algorithms, each of which generates a signal indicating the word in lexical memory that is closest to the speech signal, along with an indication of confidence that the displayed word is a spoken word. 5. Method according to claim 1, characterized in that a comparison is made between signals provided by different algorithms. 6. characterized in that the first representation of the spoken word provides a measure of reliability, and the output signal is arranged to respond only to the first representation if the measure of reliability is greater than a predetermined value. 6. A method according to any one of claims 1 to 5. 7. A method according to any one of claims 1 to 6, characterized in that the second analysis is performed using a multilayer perceptron technique in conjunction with a neural network. 8. A method according to any one of claims 1 to 7, characterized in that the output signal is used to give feedback to the interlocutor of the words spoken. 9. The method includes the step of performing a noise marking algorithm on the speech signal.
8. The method according to any one of 8. 10. A method according to any one of claims 1 to 9, characterized in that the method comprises the step of performing a syntax restriction on the stored vocabulary according to the syntax of previously identified words. 11. A speech recognition device, comprising a memory containing speech information about a vocabulary of words that can be recognized, and a stored vocabulary and a speech signal for identifying boundaries between different words and providing a first representation of the spoken words. The first of the speech signals to compare
In those including a pattern matching unit for performing analysis, the apparatus includes: a neural network unit (20) connected to a pattern matching unit (16), the pattern matching unit (16) performing a second analysis of the speech signal utilizing both the output and the word boundary identification from the first analysis, and the pattern matching unit (16) providing an output signal representative of spoken words from at least the second analysis; A voice recognition device featuring: