JPH01310399A

JPH01310399A - Speech recognition device

Info

Publication number: JPH01310399A
Application number: JP63141069A
Authority: JP
Inventors: Tsuneo Nitta; 恒雄新田; Akira Nakayama; 昭中山
Original assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Current assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Priority date: 1988-06-08
Filing date: 1988-06-08
Publication date: 1989-12-14

Abstract

PURPOSE:To accurately detect a speech section even in environment where an external noise is loud by detecting the distance between the lips of a speaker and a microphone and entering the section wherein the lips move as a next candidate for the speech section. CONSTITUTION:A distance sensor 3 provided to a microphone device 1 detects and outputs variation in the distance between the microphone 2 and the lips of the speaker 21 including the motion of the lips and their peripheral part when the speaker 21 speaks to a distance calculation part 7. The calculation part 6 converts the detection signal into a digital value and outputs it to a voicing section detection part 7. The detection part 7 uses a threshold value which is determined adaptively to regard the section wherein the largest variation in the distance between the microphone 2 and lips is obtained as the voicing section, and outputs it to a time normalization part 8 as the next candidate for a speech section signal outputted from a speech section detection part 5. When the external noise is too loud to detect the speech section through the microphone 2, the voicing section is utilized to find the speech section.

Description

[Detailed description of the invention]

［発明の目的］（産業上の利用分野）本発明は音声認識装置に関する。（従来の技術）音声認識装置は、発声者が発生した音声を入力して、そ
の音声信号から音声区間を検出し、この音声区間の信号
を時間的に正規化して得た特徴量と標準パターンとの間
で類似度′ａ算を行ない、最も高いスコアを示すカテゴ
リを音声認識結果として出力する装置であって、人間の
音声を利用した操作により自動的に作動を行なう種々の
装置に使用されている。しかして、音声認識装置を使用した音声認識では、その
前処理として発声者が発生した音声における音声区間、
を検出する処理を行なうことが必要である。従来の音声認識装置において音声区間を検出するために
は、発声者の音声をマイクロフォンから入力し、入力し
た音声を分析して得た音響的特徴パラメータに基づいて
、適宜なしきい値を利用して音声区間の始端と終端を検
出して音声区間を検出することが行なわれている。（発明が解決しようとする課題）しかしながら、このような従来の音声認識装置における
音声区間の検出の方法においては、外部騒音が大きい環
境では、発声者が発生した音声に加え°て外部の騒音が
一緒にマイクロフォンに入力されることがある。この場
合には、音声の分析時微量も、本来の音声の成分とそれ
以外の外部騒音の成分とが重畳してしまうために、正確
な音声区間の検出が困難となる可能性がある。これは特
に騒音レベルが大きい場合や突発的なノイズが発生する
場合に顕著である。本発明は前記事情に基づいてなされたもので、外部騒音
に影響されることなく音声区間の検出を正確に行なうこ
とができる音声認識装置を提供することを目的とする。［発明の構成］（課題を解決するための手段）前記目的を達成するために本発明の音声認識装置は、発
声者の器とマイクロフォンとの間の距離を検出する距離
センサと、この距離センサで検出した信号をデジタル量
に変換する距離計算手段と、この距離計算手段で出力さ
れた信号の距離時系列から安定した距ｉｌｉｔｍを抽出
し唇およびその付近部の動きから音声の発声区間を検出
する発声区間検出手段とを備え、発声区間検出部で検出
した発声区間を音声区間の次候補とすることを特徴とす
るものである。（作用）すなわち、距離センサが、発声者が発声した時の唇とマ
イクロフォンとの間の距離を検出し、Ｂが動いている区
間を発声者が音声を発生している区間とみなして音声区
間の次候補としてエントリする。これにより外部騒音が
大きい環境下において、マイクから入力した大きな騒音
を含む音声信号から音声区間を検出することが困難な場
合でも、発声区間を利用して正確に音声区間を検出する
ことができる。（実施例）以下本発明の実施例を図面を参照して説明する。本発明の音声認識装置の構成の一実施例を第１図につい
て説明する。図中１はマイクロフォン２と距離センサ３とを一体に取
付けたマイクロフォン装置！で、第２図で示すように発
声者２１が装着して使用する。音声入力部であるマイク
ロフォン２は接話形のもので、発声者２１の唇２２の前
方の一定距離を置いた位置に設けられる。距離センサ３
はマイクロフォン２と発声者２１の唇２２との間の距離
を測定するセンサで、発声時に唇２２およびその付近部
が動くことによりマイクロフォン２と唇２２との間の距
離が変化するので、その距離の変化を検出する。このた
め、距離センサ３はマイクロフォン２と唇２２との距離
を正確に検出できる位置に設けられる。距離センサ３と
しては、赤外線センサ、超音波センサなどを使用するが
、なかでも赤外線センサはノイズが少なく好適である。マイクロフォン２と距離センサ３は支持部材１ａにより
一体に支持され、この支持部材１ａは発声者２１が装着
するようになっている。なお、このマイクロフォン２と
距離センサ３は信号線（図示せず）を介して装置本体に
信号を送るようになっている。４は音響分析部で、マイクロフォン２から入力した音声
信号を受け、その音響的特徴パラメータを抽出して、そ
の信号を音声区間検出部５に出力するものである。音声
区間検出部５は、音響分析部４からの信号を受けて音声
信号を検出し、その信号を時間正規化部８に出力するも
のである。６は距離計算部で、前記距離センサ３からの検出信号を
受けてデジタル量に変換し、その信号を発声区間検出１
部７に出力するものである。発声区間検出部７は距離計
算部６からの信号を受けて発声区間を検出し、その信号
を時間正規化部８に出力するものである。８は時間正規化部で、音声区間検出部５および発声区間
検出部７から夫々出力された音声区間信号と発声区間信
号を受け、夫々の信号から時間的に正規化した特徴量を
得て類似度演算部９に出力するものである。類似度演算
部９は時間正規化部８からの信号を受けて標準パターン
１０との間で類似度演算を行なうものである。このように構成された音声認識装置により音声認識を行
なう場合について説明する。マイクロフォン装置１を装着した発声者２１が発生した
音声はマイクロフォン２に入力され、この音声信号は音
響分析部４で音響的特徴パラメータが抽出される。抽出
された特徴量の信号は音声区間検出部５に出力され、音
声区間検出部５において特徴量の一部（例えばパワー系
列）を用いて、適応的に決定されているしきい値により
音声区間の始端と終端を検出する。この音声区間信号は
時間規格化部８に出力される。一方、マイクロフォン装置１に設けた距離センサ３は発
声者２１が音声を発声した時の唇２２およびその付近部
の動きに伴うマイクロフォン２と唇２２との間の距離の
変化を検出し、その検出信号を距離計算部６に出力する
。距離計算部６ではこの検出信号をデジタル量に変換し
て発声区間検出部７に出力する。発声区間検出部７では
、適応的に決定されるしきい値を用いて、マイクロフォ
ン２と８２２との間の距離変動が最も大きい区間を検出
して、これを発声区間とみなす。この発声区間信号は前
記音声区間検出部５から出力された音声区間信号の次候
補として時間規制化部８に出力される。時間規制化部８では、音声区間検出部５と発声区間検出
部７から夫々出力された音声区間信号と発声区間信号を
受けて、前記音響分析部４で抽出された特徴ｆｆ１（例
えばバンドパス出力部）を、音声区間内と発声区間内で
時間的に正規化した２通りの特徴量を得、この信号を類
似度演算部９に出力する。類似度演算部９では、これら
２通りの特徴量と標準パターン１０との間で類似度演算
を行ない、最も高いスコアを示すカテゴリを音声認識結
果として出力する。ここで、発声者２１が発生した音声に基づく音声区間を
求めるとともに、音声発生時の唇２２およびその付近部
の動きに着目して発声区間を求め、この発生区間を音声
区間の次候補に使用するものとしてノミネートしておく
から、外部騒音が大きくマイクロフォン２に音声に加え
て外部騒音も一緒に入力されて分析時微量から音声区間
を検出するのが困難な場合には、発声区間を利用して音
声区間を求めることができる。求めた音声区間は、本来
の音声区間と合致する正確なものである。次に、発声区間検出部７と音声区間検出部５の動作を説
明して、本発明装置の特徴を説明する。第２図は、距離計算部６で出力されたマイクロフォン２
と発声者２１の唇２２の間の距離から発声区間を探索す
る音声区間検出部７の内部処理のフローの概要を示して
いる。まず、マイクロフォン２と唇２２との間の時系列
データＤ　（ｎ）に対して平滑化処理を行ない安定した
距離変動量の時系列ｄ　（ｎ）を抽出する（ＳＰＩ）。次にこの時系列ｄ　（ｎ）のｎ−１〜１０の区間で、無
発声区間とみなせる時の平均距離ｄ＾を算出しく５Ｐ２
）、これに適応的に決めたオフセットｄｏを加算したし
きい値ｄＴｌｌ　（−ｄＴ　＋　ｄｏ　）を決定する（
Ｓ　Ｐ　３）。発声区間ＳＡ、ＳＥの探索は、ｄ　（ｎ）　＞　ｄ　Ｔ
ｌｌの区間が連続して数フレーム以上続いた時、初めて
ｄ　（ｎ）　＞　ｄ　Ｔｌｌになった位置を始端とし、
さらにｄ　（ｎ）　＞　ｄ　Ｔｌｌの区間が連続して１
０数フレーム以上続いた後にｄ　（ｎ）　＜　ｄ　Ｔｌ
ｌの区間が連続して数フレーム以上続いた時、初めてｄ
　（ｎ）　＜　ｄ　Ｔｌｌとなった位置を終端と決定し
、このＳＡ、ＳＥの区間を発声区間として時間規制化部
８へ信号を送る（ＳＦ３）。また、音声区間検出部５も同様な手法によって音声区間
を探索する。この場合、マイクロフォン２から入力され
た音声信号から抽出した特徴量により決定され、その特
徴量は必ずしも音声によるものだけでなく、マイクロフ
ォン周辺の外部ノイズ成分が含まれることがある。第３図および第４図は、大きな外部ノイズが音声区間の
前後にある場合の発声区間検出部７と音声区間検出部の
処理の様子を示す線図である。第３図で示す発声区間検
出部７による検出発声区間ＳＡ、ＳＥは、マイクロフォ
ンと唇との間の変動距離から探索されたものであるから
、外部ノイズの影響を全く受けずに、発声者の口の動き
から音声が発生されたとみなされる区間を検出すること
ができる。一方、第４図で示す音声区間検出部５の処理
では、適応的に決定されたしきいｉｉ　ｐ　Ｔ１１を用
いてもノイズ成分のために正確な音声区間ＳＴ。ＥＴの検出は不可能となり、ノイズ成分を含んだ区間Ｓ
Ｆ、ＥＦが誤検出される。音声区間の検出は、音声認識
装置にとっては致命的なエラーとなる。この様な事態を
避けるために、外部ノイズが比較的大きな環境において
も、外部ノイズ音に影響されることがない発声区間ＳＡ
、ＥＡを音声区間の次候補としてエントリすることによ
り、より安定した高い認識率の音声認識装置を実現する
ことができる。なお、本発明は前述した実施例に限定されず、要旨を変
更しない範囲で種々変形して実施することができる。音
声入力用のマイクロフォンとしては通常使用される接話
型のマイクロフォンに限定されず、本発明装置を適用す
る音声取込みを行なう装置により変更される。例えば電
話受話器をマイクロフォンとして使用できる。この場合
には、電話受話機を一定位置に固定し、また距離センサ
を電話受話器に一体に取付ける。[Object of the Invention] (Industrial Application Field) The present invention relates to a speech recognition device. (Prior art) A speech recognition device inputs the speech generated by a speaker, detects a speech section from the speech signal, and temporally normalizes the signal of this speech section to obtain feature quantities and a standard pattern. This is a device that calculates the degree of similarity 'a between ``a'' and outputs the category with the highest score as a voice recognition result, and is used in various devices that operate automatically by operations using human voice. ing. Therefore, in speech recognition using a speech recognition device, as a preprocessing step, the speech interval of the speech generated by the speaker is
It is necessary to perform processing to detect the In order to detect a speech interval in a conventional speech recognition device, the voice of the speaker is input through a microphone, and an appropriate threshold value is used based on the acoustic feature parameters obtained by analyzing the input voice. A voice section is detected by detecting the start and end of the voice section. (Problem to be Solved by the Invention) However, in the method of detecting speech intervals in such a conventional speech recognition device, in an environment with large external noise, external noise may be heard in addition to the voice generated by the speaker. It may also be input into the microphone. In this case, when analyzing the voice, even a minute amount of the original voice component and other external noise components are superimposed, so it may be difficult to accurately detect the voice section. This is particularly noticeable when the noise level is high or sudden noise occurs. The present invention has been made based on the above-mentioned circumstances, and an object of the present invention is to provide a speech recognition device that can accurately detect speech sections without being affected by external noise. [Structure of the Invention] (Means for Solving the Problem) In order to achieve the above object, the speech recognition device of the present invention includes a distance sensor that detects the distance between the speaker's device and the microphone, and a distance sensor that detects the distance between the speaker's device and the microphone. A distance calculation means for converting the detected signal into a digital quantity, and a stable distance ilitm is extracted from the distance time series of the signal output by this distance calculation means, and the utterance period of the voice is detected from the movement of the lips and the surrounding area. The invention is characterized in that it includes a vocalization section detecting means for detecting a vocalization section, and uses the vocalization section detected by the vocalization section detecting section as the next candidate for the vocalization section. (Function) In other words, the distance sensor detects the distance between the speaker's lips and the microphone when the speaker speaks, and the section in which B is moving is regarded as the section in which the speaker is generating sound, and the speech section is determined. Enter as the next candidate. As a result, even if it is difficult to detect a voice section from a voice signal including loud noise input from a microphone in an environment with large external noise, the voice section can be accurately detected using the utterance section. (Example) Examples of the present invention will be described below with reference to the drawings. An embodiment of the configuration of the speech recognition device of the present invention will be described with reference to FIG. In the figure, 1 is a microphone device in which a microphone 2 and a distance sensor 3 are integrated! As shown in FIG. 2, the speaker 21 wears and uses the device. The microphone 2, which is a voice input section, is of a close-talking type and is provided at a position a certain distance in front of the lips 22 of the speaker 21. Distance sensor 3
is a sensor that measures the distance between the microphone 2 and the lips 22 of the speaker 21. The distance between the microphone 2 and the lips 22 changes as the lips 22 and the surrounding area move when speaking. Detect changes in Therefore, the distance sensor 3 is provided at a position where the distance between the microphone 2 and the lips 22 can be accurately detected. As the distance sensor 3, an infrared sensor, an ultrasonic sensor, or the like is used, and among them, an infrared sensor is preferable because it has less noise. The microphone 2 and the distance sensor 3 are integrally supported by a support member 1a, which is worn by a speaker 21. Note that the microphone 2 and the distance sensor 3 are configured to send signals to the main body of the apparatus via a signal line (not shown). Reference numeral 4 denotes an acoustic analysis section that receives the audio signal input from the microphone 2, extracts its acoustic characteristic parameters, and outputs the signal to the audio section detection section 5. The voice section detection section 5 receives the signal from the acoustic analysis section 4, detects a voice signal, and outputs the signal to the time normalization section 8. 6 is a distance calculation unit which receives the detection signal from the distance sensor 3, converts it into a digital quantity, and converts the signal into a digital quantity.
It is output to section 7. The vocalization section detecting section 7 receives the signal from the distance calculation section 6, detects the vocalization section, and outputs the signal to the time normalization section 8. Reference numeral 8 denotes a time normalization unit which receives the voice interval signal and the utterance interval signal respectively output from the voice interval detection unit 5 and the utterance interval detection unit 7, obtains temporally normalized feature quantities from each signal, and calculates the similarity. It is output to the degree calculation section 9. The similarity calculation section 9 receives the signal from the time normalization section 8 and performs a similarity calculation between it and the standard pattern 10. A case in which speech recognition is performed by the speech recognition device configured in this manner will be described. The voice generated by the speaker 21 wearing the microphone device 1 is input to the microphone 2, and the acoustic analysis unit 4 extracts acoustic characteristic parameters from this voice signal. The signal of the extracted feature quantity is output to the speech interval detection unit 5, and the speech interval detection unit 5 uses a part of the feature quantity (for example, a power sequence) to determine the speech interval according to an adaptively determined threshold value. Detect the start and end of. This voice section signal is output to the time standardization section 8. On the other hand, the distance sensor 3 provided in the microphone device 1 detects a change in the distance between the microphone 2 and the lips 22 due to the movement of the lips 22 and its vicinity when the speaker 21 utters a voice. The signal is output to the distance calculation section 6. The distance calculation section 6 converts this detection signal into a digital quantity and outputs it to the utterance section detection section 7. The utterance section detecting section 7 uses an adaptively determined threshold to detect the section in which the distance variation between the microphone 2 and 822 is the largest, and regards this as the utterance section. This vocalization section signal is output to the time regulation section 8 as the next candidate for the vocalization section signal outputted from the vocalization section detecting section 5. The time regulation unit 8 receives the voice interval signal and the utterance interval signal output from the voice interval detection unit 5 and the utterance interval detection unit 7, respectively, and receives the feature ff1 (for example, bandpass output) extracted by the acoustic analysis unit 4. part) is temporally normalized within the speech interval and within the utterance interval to obtain two types of feature quantities, and output these signals to the similarity calculation unit 9. The similarity calculation unit 9 performs similarity calculation between these two types of feature amounts and the standard pattern 10, and outputs the category showing the highest score as a speech recognition result. Here, a voice interval is determined based on the voice generated by the speaker 21, and a voice interval is determined by focusing on the movement of the lips 22 and its vicinity when the voice is generated, and this generated interval is used as the next candidate for the voice interval. Therefore, if the external noise is large and the external noise is input to microphone 2 in addition to the voice, and it is difficult to detect the vocal interval from a small amount during analysis, the vocal interval can be used. The speech interval can be found by The obtained voice section is accurate and matches the original voice section. Next, the operations of the utterance period detecting section 7 and the voice section detecting section 5 will be explained, and the features of the apparatus of the present invention will be explained. FIG. 2 shows the microphone 2 output from the distance calculation section 6.
2 shows an overview of the internal processing flow of the voice section detecting section 7 that searches for a voice section from the distance between the lips 22 of the speaker 21 and the lips 22 of the speaker 21. First, a smoothing process is performed on the time series data D (n) between the microphone 2 and the lips 22 to extract a time series d (n) of stable distance variation (SPI). Next, calculate the average distance d^ in the interval n-1 to 10 of this time series d (n) that can be considered as a silent interval 5P2
), the threshold value dTll (-dT + do) is determined by adding the adaptively determined offset do to this (
S P 3). The search for vocalization sections SA and SE is performed using d (n) > d T
When the section ll continues for several frames or more, the position where d (n) > d Tll for the first time becomes the starting point,
Furthermore, the interval d (n) > d Tll is continuously 1
After 0 frames or more, d (n) < d Tl
Only when the interval l continues for several frames or more, d
The position where (n) < d Tll is determined as the end, and a signal is sent to the time regulating unit 8 with this SA and SE section as the utterance section (SF3). Furthermore, the voice section detecting section 5 also searches for a voice section using a similar method. In this case, it is determined by the feature amount extracted from the audio signal input from the microphone 2, and the feature amount is not necessarily based on the voice, but may include external noise components around the microphone. FIGS. 3 and 4 are diagrams showing the processing of the vocalization section detection section 7 and the speech section detection section when large external noise is present before and after the speech section. Since the vocalization sections SA and SE detected by the vocalization section detecting section 7 shown in FIG. It is possible to detect the interval in which speech is considered to have been generated based on mouth movements. On the other hand, in the processing of the speech section detection unit 5 shown in FIG. 4, even if the adaptively determined threshold ii p T11 is used, the speech section ST is accurate due to noise components. It becomes impossible to detect ET, and the section S containing noise components
F and EF are erroneously detected. Detection of a voice section is a fatal error for the voice recognition device. In order to avoid such a situation, the utterance section SA is not affected by external noise even in an environment with relatively large external noise.
, EA as the next candidate for the speech section, it is possible to realize a speech recognition device with a more stable and high recognition rate. Note that the present invention is not limited to the embodiments described above, and can be implemented with various modifications without changing the gist. The microphone for voice input is not limited to the normally used close-talking type microphone, but may be changed depending on the device for capturing voice to which the device of the present invention is applied. For example, a telephone receiver can be used as a microphone. In this case, the telephone receiver is fixed at a fixed position, and the distance sensor is integrally attached to the telephone receiver.

【Effect of the invention】

以上説明したように本発明の音声認識装置によれば、発
声者が発声した時における発声者のぽとマイクロフォン
との間の距離の変動を距離センサで測定して、Ｂが動い
た区間を発声区間として検出し、音声区間の次候補とし
てエントリすることにより、大きな騒音の環境下におい
てマイクロフォンから入力した発声の音声から音声区間
の検出が困難であ、る場合にも、発声区間を利用して外
部騒音に影響されることなく音声区間を正確に検出する
ことができ、使用環境に制約されることのない高い認識
率と安定した精度を得ることができる。As explained above, according to the speech recognition device of the present invention, the distance sensor measures the change in the distance between the speaker's port and the microphone when the speaker speaks, and the range in which B moves is uttered. By detecting it as a segment and entering it as the next candidate for the speech segment, it is possible to use the speech segment even when it is difficult to detect the speech segment from the voice input from the microphone in a noisy environment. It is possible to accurately detect voice sections without being affected by external noise, and it is possible to obtain a high recognition rate and stable accuracy that is not restricted by the usage environment.

[Brief explanation of the drawing]

第１図は本発明の音声認識装置の一実施例を示すシステ
ム構成図、第２図は同実施例におけるマイクロフォン装
置を示す説明図、第３図はこの実施例における発声区間
検出部の処理を示すフローチャート、第４図は発声区間
検出部による発声区間の検出の状態を示す線図、第５図
は音声区間検出部による音声区間の検出の状態を示す線
図である。１・・・マイクロフォン装置、２・・・マイクロフォン
、３・・・距離センサ、４・・・音響分析部、５・・・
音声区間検出部、６・・・距離計算部、７・・・発声区
間検出部、８・・・時間正規化部、９・・・類似度演算
部。出願人代理人　弁理士　鈴江武彦第２図第３図第４図第５図FIG. 1 is a system configuration diagram showing an embodiment of the speech recognition device of the present invention, FIG. 2 is an explanatory diagram showing a microphone device in the same embodiment, and FIG. FIG. 4 is a diagram showing the state of voice section detection by the voice section detecting section, and FIG. 5 is a diagram showing the state of voice section detection by the voice section detecting section. DESCRIPTION OF SYMBOLS 1... Microphone device, 2... Microphone, 3... Distance sensor, 4... Acoustic analysis part, 5...
Speech section detection unit, 6... Distance calculation unit, 7... Vocalization section detection unit, 8... Time normalization unit, 9... Similarity calculation unit. Applicant's representative Patent attorney Takehiko Suzue Figure 2 Figure 3 Figure 4 Figure 5

Claims

[Claims]

Input the voice generated by the speaker from the microphone,
In an apparatus that performs speech recognition by detecting a speech section from the speech signal and performing a similarity calculation between the feature amount obtained by temporally normalizing the signal of this speech section and a standard pattern, A distance sensor that detects the distance between the lips and the microphone, a distance calculation means that converts the signal detected by this distance sensor into a digital quantity, and a stable distance based on the distance time series of the signal output by this distance calculation means. utterance period detecting means for extracting the amount of utterance and detecting the utterance section of the voice from the movement of the lips and the vicinity thereof, the utterance section detected by the utterance section detecting section being set as a next candidate for the voice section. Characteristic voice recognition device.