JPS6219760B2

JPS6219760B2 -

Info

Publication number: JPS6219760B2
Application number: JP55023797A
Authority: JP
Inventors: Masaru Nishimura; Takehiko Asano
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1980-02-26
Filing date: 1980-02-26
Publication date: 1987-04-30
Also published as: JPS56119200A

Description

【発明の詳細な説明】本発明は音声認識装置に関し、特に外部騒音に
よる誤動作誤認識を防止せしめたところに特徴を
有する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech recognition device, and is particularly characterized in that it prevents erroneous recognition due to external noise.

音声信号の特徴パラメータを抽出し、この特徴
パラメータを利用してある話者の音声を認識し、
認識した音声にしたがつて外部機器を制御する信
号等を出力する音声認識装置は従来種々提案され
てきている。その一例を第１図のブロツク図に示
す。入力音声を電気信号に変換するマイクロフオ
ン等の音響―電気信号変換器を含む入力部１と、
入力された音響信号のうち処理の対象となる周波
数帯域のみを取り出す入力帯域フイルタ２と、音
声信号の特徴を抽出する特徴抽出部３と、あらか
じめ登録された音声特徴を標準パターンとして記
憶する標準パターン記憶部４と、入力音声から抽
出された特徴パターンと標準パターンとを比較し
入力音声を特定する認識処理部５と、認識処理結
果を出力する出力制御部６と、を主な構成要素と
しこれに認識率を向上させるための入力信号振幅
正規化回路７と、時間軸調整部８と、あらかじめ
音声特徴の標準パターンを登録する際の処理を受
け持つ登録制御部９が付加されている。 Extract the feature parameters of the audio signal, use these feature parameters to recognize the voice of a certain speaker,
2. Description of the Related Art Various speech recognition devices have been proposed in the past that output signals for controlling external devices in accordance with recognized speech. An example of this is shown in the block diagram of FIG. an input section 1 including an acoustic-electrical signal converter such as a microphone that converts input audio into an electrical signal;
An input band filter 2 that extracts only the frequency band to be processed from an input acoustic signal, a feature extractor 3 that extracts features of the audio signal, and a standard pattern that stores pre-registered audio features as a standard pattern. The main components are a storage unit 4, a recognition processing unit 5 that identifies the input voice by comparing the characteristic pattern extracted from the input voice with a standard pattern, and an output control unit 6 that outputs the recognition processing result. An input signal amplitude normalization circuit 7 for improving the recognition rate, a time axis adjustment section 8, and a registration control section 9 that is in charge of processing when registering standard patterns of voice features in advance are added to the system.

音声の特徴を抽出するパラメータとしては周波
数スペクトル分布、相関関数、零交差数、フオル
マント周波数、或いは線形予測係数など多くの方
法が考えられるが、これらのうち音声の周波数ス
ペクトルを複数の帯域通過フイルタにより分離抽
出し、標準パターンとの相関をみるいわゆるフイ
ルタバンク方式は比較的簡単な構成で高い認識率
を得ることができる方法として良く用いられてい
る。 There are many methods that can be considered as parameters for extracting voice features, such as frequency spectrum distribution, correlation function, number of zero crossings, formant frequency, or linear prediction coefficient. The so-called filter bank method, which separates and extracts patterns and checks the correlation with a standard pattern, is often used as a method that can obtain a high recognition rate with a relatively simple configuration.

さて、このような音声認識装置においては、装
置に認識させるべき制御命令音声をいかにＳ／Ｎ
比良く、即ち雑音に対する信号の比率を高く入力
するか、ということがきわめて重要な課題となつ
ている。音声認識装置が実際に用いられる環境を
考えた場合、騒音のない静かな場所で動作させる
ということはむしろ稀れであつて、通常は制御命
令音声以外の種々の周囲音にさらされた音声認識
装置からみて雑音の多い状況で使用されるのであ
る。つまり、種々の音の中から制御命令音声を分
離抽出する能力がいかにすぐれているかが装置の
認識性能決定の大きな要素となつているのであ
る。 Now, in such a voice recognition device, how do you determine the S/N of the control command voice that the device should recognize?
An extremely important issue is how to input signals with a good ratio, that is, with a high signal to noise ratio. When considering the environment in which speech recognition devices are actually used, it is rather rare that they operate in a quiet place without noise, and usually speech recognition devices are exposed to various ambient sounds other than control command voices. It is used in situations where there is a lot of noise from the perspective of the device. In other words, the ability to separate and extract control command voices from various sounds is a major factor in determining the recognition performance of the device.

人の音声の周波数成分は100Hz〜5KHzの音声帯
域内に殆んど含まれており、従来の音声認識装置
ではこの帯域内のスペクトル分布を特徴パラメー
タとして用い認識処理を行なつている。ところで
前述した種々の周囲音が示す周波数スペクトル分
布の帯域は前記の音声の帯域に比較してはるかに
広いもので周波数成分として音声の帯域内の成分
と、それ以上の高い周波数成分両者を含むものが
殆んどである。第１図に示した構成の従来の音声
認識装置では入力帯域フイルタ２で音声帯域の信
号を取り込むことになつており、この点制御命令
音声と周囲音のうち音声帯域内の成分の区別は無
い。したがつて入力帯域フイルタ２から後の特徴
抽出部３に入力される信号は制御命令音声信号と
音声帯域成分が加え合わされた信号となり、特徴
抽出部３は、この両者が加え合わされた信号から
特徴パラメータを抽出する。ところがこのうち周
囲音の音声帯域成分は、そのレベル、周波数成
分、持続時間が刻々と変化する音声認識装置にと
つては雑音であるから、制御命令音声と不規則に
混入する周囲音とが加え合わされた信号から抽出
されたパラメータを用いて登録、認識動作を行な
つたとすれば認識性能は低下する。制御命令音声
を信号、周囲音を雑音としてそのＳ／Ｎ比が充分
に良好なものであるならば無論問題はないのであ
つて、従来装置においてもこの良好なＳ／Ｎ比確
保の為、種々の措置がとられている。例えば入力
部のマイクロフオンとして接話マイクロフオンを
用いるというのが簡単、確実な方法で一般的によ
く行なわれている。ノイズ・キヤンセル・マイク
ロフオンとも呼ばれる接話マイクロフオンは、文
字通り口唇に近接させて使用することで周囲音の
混入を抑えることができるよう構成されたもので
ある。この接話マイクロフオンの採用はそれなり
のＳ／Ｎ比の改善があり実用上多いに効果をあげ
ている。しかしながら周囲音の中でも振幅が大き
く、比較的高い周波数成分を含むような音響信号
に対しては、その効果もうすくなる傾向があり、
尚充分とは言えない。 Most of the frequency components of human speech are contained within the audio band of 100Hz to 5KHz, and conventional speech recognition devices perform recognition processing using the spectral distribution within this band as a feature parameter. By the way, the band of the frequency spectrum distribution exhibited by the various ambient sounds mentioned above is much wider than the band of the voice, and includes both components within the voice band and higher frequency components. is the majority. In the conventional speech recognition device having the configuration shown in Fig. 1, the input band filter 2 takes in the signal in the voice band, and there is no distinction between the control command voice and the components of the ambient sound within the voice band. . Therefore, the signal inputted from the input band filter 2 to the subsequent feature extraction section 3 is a signal in which the control command voice signal and the voice band component are added, and the feature extraction section 3 extracts features from the signal in which both are added. Extract parameters. However, the voice band components of the ambient sounds are noise to the voice recognition device whose level, frequency components, and duration change every moment, so the control command voice and the irregularly mixed ambient sounds are added. If registration and recognition operations are performed using parameters extracted from the combined signals, recognition performance will deteriorate. As long as the control command voice is used as a signal and the ambient sound is used as noise, the S/N ratio is sufficiently good, there is no problem. Conventional equipment also uses various methods to ensure this good S/N ratio. Measures have been taken. For example, it is a simple and reliable method to use a close-talking microphone as the input section microphone. A close-talking microphone, also called a noise-cancelling microphone, is constructed so that it can be used close to the lips to suppress the infiltration of ambient sounds. The adoption of this close-talking microphone improves the S/N ratio to a certain degree, and is highly effective in practice. However, it tends to be less effective against acoustic signals that have large amplitudes and include relatively high frequency components among ambient sounds.
However, it cannot be said that it is sufficient.

本発明はかかる音声認識装置の周囲音による認
識性能劣化を防止する目的から為されたものであ
る。 The present invention has been made for the purpose of preventing deterioration of the recognition performance of such a speech recognition device due to ambient sounds.

即ち、人の音声と周囲の騒音との周波数スペク
トル分布の違いから認識若しくは登録処理すべき
“音”か否かを判定しようとするものであり、5K
Hz以上の周波数成分に着目し、その周波数成分の
持続時間を調べ制御命令音声と周囲音との差違を
明確にしようとするものである。 In other words, it attempts to determine whether or not a "sound" should be recognized or registered based on the difference in frequency spectrum distribution between a person's voice and surrounding noise.
It focuses on frequency components above Hz and examines the duration of these frequency components to clarify the differences between control command voices and ambient sounds.

以下、本発明の一実施例を示す第２図に従つて
詳述する。１は入力部があつて、マイクロフオン
１０と増巾器１１により構成される。この入力部
１からの信号は入力帯域フイルタ２を通つた後、
特徴抽出部３に入る。特徴抽出部３は中心周波数
がそれぞれ_１，_２…_NのＮ個のバンドパス
フイルタ（以後BPFと略す。）３０，３１、…３
２これら各フイルタ出力を積分する積分器３３，
３４、…３５該各積分器の出力を切替えるマルチ
プレクサ３６、該マルチプレクサを通過した前記
各フイルタの出力レベルをデイジタル信号に変換
するアナログ―デイジタル（Ａ／Ｄ）変換器３７
によつて構成される。なお、Ｎ個のBPF３０，３
１、…３２全部で受持つ帯域は入力帯域フイルタ
２の帯域の範囲に含まれるものとする。Ａ／Ｄ変
換器３７により、入力部１からの信号の各フイル
タ成分が適当な時間間隔で（多くの場合10msec
前後）でサンプリングされ、デイジタルコードに
変換された後認識処理部５のＩ／Ｏポートを含む
マイクロコンピユータ５０を介して記憶メモリ５
１に記憶される。マイクロコンピユータ５０には
別の標準パターンメモリ４が接続されており、あ
らかじめ制御命令音声の特徴パターンがその制御
内容と対応づけられて記憶されている。音声認識
モードにおいては前述の如く制御音声が入力し、
特徴抽出部３の各フイルタ３０，３１、…３２に
より抽出されデイジタルコード化された信号列は
記憶メモリー５１に記憶され、次いでマイクロコ
ンピユータ５０はこの記憶パターンと標準パター
ンとの差を、全ての標準パターンについて計算し
その差が最も小さい標準パターンを決定すること
により入力音声を特定する。一般に人間の話声は
同じ言語を発声してもその時間的推移は常に同等
とは限らないため、第１図に示したように何らか
の時間軸調整回路８が付加されなければならない
のは周知の通りであるが、この時間軸調整はマイ
クロコンピユータ５０が特徴パターンの演算を行
なう時に同時にソフトウエア的に処理できるので
第２図の実施例では省略している。また第２図の
本発明に於ては第１図における振巾正規化回路７
も同様の理由により省略している。 Hereinafter, an embodiment of the present invention will be described in detail with reference to FIG. 2. 1 has an input section and is composed of a microphone 10 and an amplifier 11. After the signal from this input section 1 passes through the input band filter 2,
Enter feature extraction section 3. The feature extraction unit 3 includes N bandpass filters (hereinafter abbreviated as BPF) 30, 31,...3 whose center frequencies are ₁ , ₂ ... _N , respectively.
2 an integrator 33 that integrates the outputs of each of these filters;
34,...35 A multiplexer 36 that switches the output of each integrator, and an analog-digital (A/D) converter 37 that converts the output level of each filter that has passed through the multiplexer into a digital signal.
Consisting of: In addition, N BPF30,3
It is assumed that the band handled by all of the input band filters 1, . The A/D converter 37 converts each filter component of the signal from the input section 1 at appropriate time intervals (in most cases, 10 msec).
After being sampled (before and after) and converted into a digital code, it is stored in the storage memory 5 via a microcomputer 50 including an I/O port of the recognition processing unit 5.
1 is stored. Another standard pattern memory 4 is connected to the microcomputer 50, and the characteristic patterns of control command voices are stored in advance in association with the control contents. In the voice recognition mode, the control voice is input as described above,
The signal strings extracted by the filters 30, 31, . The input speech is specified by calculating the patterns and determining the standard pattern with the smallest difference. In general, human speech is not always the same over time even when the same language is uttered, so it is well known that some kind of time axis adjustment circuit 8 must be added as shown in Figure 1. However, since this time axis adjustment can be processed by software simultaneously when the microcomputer 50 calculates the characteristic pattern, it is omitted in the embodiment shown in FIG. Further, in the present invention shown in FIG. 2, the amplitude normalization circuit 7 in FIG.
is also omitted for the same reason.

認識モードにおける音声の取り込みは常時行な
われており、入力音声が途切れたとき、即ちポー
ズ期間に前述の認識計算が実行され、それ以前の
入力音声がパターンマツチング法により特定され
る。この時入力音声について特定が可能となつた
時、即ち入力音声が何らかの標準パターンに許容
され得る誤差の範囲内で一致した時、マイクロコ
ンピユータ５０は出力制御部６を制御して制御音
声に対応した外部機器制御用信号を出力する。入
力音声が特定できぬ場合、外部機器制御用信号は
出力せず、例えば表示部１２を駆動して特定でき
なかつた旨を知らせ、話者に再発声を促すように
する。 Speech is always captured in the recognition mode, and when the input speech is interrupted, that is, during a pause period, the above-mentioned recognition calculation is executed, and the previous input speech is identified by the pattern matching method. At this time, when the input voice can be specified, that is, when the input voice matches some standard pattern within an allowable error range, the microcomputer 50 controls the output control unit 6 to respond to the control voice. Outputs signals for controlling external devices. If the input voice cannot be specified, the external device control signal is not output, and the display unit 12 is driven, for example, to notify that the input voice could not be specified, and prompt the speaker to speak again.

次に第２図の構成において、本発明内容である
音声信号と周囲の騒音との判別がどのように行な
われるかを説明する。 Next, in the configuration shown in FIG. 2, how the discrimination between the audio signal and the surrounding noise, which is the content of the present invention, is performed will be explained.

入力部１からの信号は入力帯域フイルタ２に入
ると同時にハイパスフイルタ１４にも入力され
る。このハイパスフイルタ１４は入力部１からの
音響信号のうち入力帯域フイルタ２の帯域以上の
成分を通過させる。但し、ハイパスフイルタと言
つても、マイクロフオン１０から入つてくる最高
周波数以上を通過させる必要はないので、マイク
ロフオン１０の特性に合わせての高域減衰特性と
なつている。このハイパスフイルタ１４を通過す
る信号成分が前記制御命令音声と周囲音との差違
を決定する。ハイパスフイルタ１４からの信号は
高域信号用積分器１５で積分された後、持続時間
検出器１６に入力されその発生持続時間を調べら
れる。 The signal from the input section 1 enters the input band filter 2 and is also input to the high pass filter 14 at the same time. This high-pass filter 14 passes components of the acoustic signal from the input section 1 that have a band equal to or higher than that of the input band filter 2 . However, although it is called a high-pass filter, it is not necessary to pass frequencies higher than the highest frequency coming from the microphone 10, so it has high-frequency attenuation characteristics that match the characteristics of the microphone 10. The signal component passing through this high-pass filter 14 determines the difference between the control command voice and the ambient sound. After the signal from the high-pass filter 14 is integrated by a high-frequency signal integrator 15, it is input to a duration detector 16 to check the duration of its occurrence.

前述したように、音声信号は入力帯域フイルタ
２、及びBPF３０，３１、…３２とによつて制限
される帯域内にその殆んどの成分が集中しており
ハイパスフイルタ１４が受け持つ帯域内の成分は
きわめて少ない。逆に周囲で発生する種々の騒音
ではこのハイパスフイルタ１４の帯域内の成分を
多く含んでいる。従つて、この帯域内の成分の有
無で音声と周囲音との判定ができると考えられ
る。ところが、｜Ｓ｜、｜Ｋ｜、｜Ｚ｜等の子音
のようにこの帯域の成分を含むものがあり、単純
に、成分の有無言わばレベルによつて判定を行な
うのは危険であることがわかる。そこで本発明に
於ては前記持続時間検出回路１６を設け時間軸方
向の要素を取り入れて判定を正確に行なおうとし
ている。普通に人がしやべつて前記｜Ｓ｜、｜Ｋ
｜、｜Ｚ｜等の子音を発声する場合、この子音に
含まれる前記ハイパスフイルタ１４の帯域内成分
の持続時間は約20msec〜150msecの間におさま
つている。従つて持続時間検出回路１６はハイパ
スフイルタ１４を通過してくる信号成分の持続時
間を検出し、この持続時間が前記の20msec〜
150msecの間に入らない場合これを音声以外の雑
音と判定しその旨を知らせる信号をマイクロコン
ピユータ５０に与える。マイクロコンピユータ５
０はこの信号を受けると音声データの取入れを中
止し、次の音声入力を待つ。 As mentioned above, most of the components of the audio signal are concentrated within the band limited by the input band filter 2 and the BPFs 30, 31, ...32, and the components within the band controlled by the high-pass filter 14 are Very few. On the contrary, various noises occurring in the surroundings contain many components within the band of this high-pass filter 14. Therefore, it is thought that it is possible to determine whether a sound is a sound or an ambient sound based on the presence or absence of components within this band. However, there are consonants such as |S|, |K|, and |Z| that contain components in this band, and it may be dangerous to simply make a judgment based on the presence or absence of components, so to speak, and the level. Recognize. Therefore, in the present invention, the duration detection circuit 16 is provided to incorporate elements in the time axis direction in order to make accurate determinations. Normally people do the above |S|, |K
When consonants such as | and |Z| are uttered, the duration of the in-band components of the high-pass filter 14 contained in these consonants is within a range of about 20 msec to 150 msec. Therefore, the duration detection circuit 16 detects the duration of the signal component passing through the high-pass filter 14, and the duration detects the duration of the signal component passing through the high-pass filter 14.
If the noise does not appear within 150 msec, it is determined that this is noise other than voice, and a signal indicating this is given to the microcomputer 50. microcomputer 5
Upon receiving this signal, 0 stops taking in audio data and waits for the next audio input.

以上述べたように本発明によれば音声と周囲音
との判定にある帯域成分の持続時間の違いという
特徴を用いておりその弁別性は高く、音声認識装
置の認識率向上に大きく寄与するものである。ま
たその構成も非常に簡単で実用性は極めて高い。 As described above, according to the present invention, the characteristic of the difference in duration of band components in the discrimination between speech and ambient sound is used, and the discrimination is high, which greatly contributes to improving the recognition rate of speech recognition devices. It is. Moreover, its configuration is very simple and its practicality is extremely high.

[Brief explanation of the drawing]

第１図は従来の音声認識装置を示すブロツク
図、第２図は本発明音声認識装置の構成を示すブ
ロツク図であつて、１は入力部、３は特徴抽出
部、４は標準パターン記憶部、５は認識処理部、
１４はハイパスフイルタ、１６は持続時間検出回
路、を夫々示している。 FIG. 1 is a block diagram showing a conventional speech recognition device, and FIG. 2 is a block diagram showing the configuration of the speech recognition device of the present invention, in which 1 is an input section, 3 is a feature extraction section, and 4 is a standard pattern storage section. , 5 is a recognition processing unit,
Reference numeral 14 indicates a high-pass filter, and 16 indicates a duration detection circuit.

Claims

[Claims]

1. an acoustic-electrical signal conversion means, an audio band filter that passes audio band components of the electrical signal from the conversion means, and a high-pass filter that passes high frequency components above the audio band of the electrical signal; and a duration detection means for detecting the duration of the signal component passing through the high-pass filter, and the duration detection means detects whether the signal passing through the high-pass filter has a duration other than the duration of normal speech. A voice recognition device characterized by interrupting voice recognition operation if the operation lasts for a long time.