JPH04276799A

JPH04276799A - Speech recognition system

Info

Publication number: JPH04276799A
Application number: JP3062599A
Authority: JP
Inventors: Kuniyasu Kaneuchi; 金内　邦容
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1991-03-04
Filing date: 1991-03-04
Publication date: 1992-10-01

Abstract

PURPOSE:To reduce erroneous recognition due to an environmental noise in the periphery with comparatively easy processing and simple constitution. CONSTITUTION:The feature quantity of sound inputted from an ordinary microphone 1a and a skeleton conducting type microphone 1b are extracted by feature extraction parts 2a, 2b, respectively, thence, they are inputted to interval detecting parts 3a, 3b, and interval detection are performed. Interval information and feature quantity detected by the interval detecting parts 3a, 3b are inputted to a control part 4, and an interval detected as the sound by the microphone 1b out of the intervals detected by the microphone 1a is assumed as a correct sound interval. Thereby, it is possible to prevent a pattern collation from being performed by judging a tone other than the sound emitted from a user as the sound erroneously. Also, since possibility to omit the high frequency component of the sound only by the interval information of the microphone 1b exists, the interval information of the microphone 1a is used.

Description

[Detailed description of the invention]

【０００１】0001

【技術分野】本発明は、音声認識システム、より詳細に
は、音声認識システムにおける音声入力手段に関する。TECHNICAL FIELD The present invention relates to a speech recognition system, and more particularly to speech input means in a speech recognition system.

【０００２】0002

【従来技術】近年になって音声認識を応用した製品が出
てくるようになってきたが、まだ、発声の仕方や周囲の
環境等にかなりの制限があり、使いにくいものが多い。特に、周囲の環境騒音による誤認識や誤動作は大きな問
題である。例えば、オフィスで使用されるものであれば
、ＯＡ機器の作動音や人の話し声、エアコンの音などが
あり、自動車の中での使用ではエンジン音やカーステレ
オの音などがある。これらに対する対策も色々と考えら
れており、比較的効果のある方策としては、音声用マイ
クと騒音用マイクの２本を備え、音声用マイクで得られ
た音声から騒音用マイクから得られた騒音成分を差し引
いて認識を行う方法、指向性の強いマイクを用いて利用
者の音声だけを拾う方法、骨伝導マイクを用いる方法等
がある。しかしながら、いずれの方法でも処理が複雑で
あったり、マイクの位置が問題となったり、高周波成分
の音声が拾えない等の問題があり、まだ、決め手となる
ような効果的な対策はない。BACKGROUND OF THE INVENTION In recent years, products that apply voice recognition have come on the market, but there are still many restrictions on how to speak, the surrounding environment, etc., and many of them are difficult to use. In particular, erroneous recognition or malfunction due to surrounding environmental noise is a big problem. For example, when used in an office, there are sounds such as operating office equipment, people's voices, and air conditioners, and when used in a car, there are engine sounds and car stereo sounds. Various countermeasures have been considered for these problems, and a relatively effective measure is to have two microphones, one for voice and one for noise, and to use the sound obtained from the voice microphone to the noise obtained from the noise microphone. There are methods such as performing recognition by subtracting components, using a highly directional microphone to pick up only the user's voice, and using a bone conduction microphone. However, each method has problems such as complicated processing, problems with the position of the microphone, and inability to pick up high-frequency components of audio, and so far there is no definitive and effective countermeasure.

【０００３】0003

【目的】本発明は、上述のごとき実情に鑑みてなされた
もので、比較的容易な処理と構成で周囲の環境騒音によ
る誤認識を低減することを目的とするものである。[Object] The present invention was made in view of the above-mentioned circumstances, and it is an object of the present invention to reduce erroneous recognition due to surrounding environmental noise with relatively easy processing and configuration.

【０００４】0004

【構成】本発明は、上記目的を達成するために、（１）
マイクロフォン等の音響を電気的信号に変換する第１の
音声入力手段と、該第１の音声入力手段により入力され
た音声の特徴量を抽出する第１の特徴量抽出部と、該第
１の特徴量抽出部から抽出された特徴量から音声の区間
を検出する第１の音声区間検出部と、骨伝導型振動ピッ
クアップ等の骨伝導を利用して音声の検出を行う第２の
音声入力手段と、該第２の音声入力手段より入力された
音声の特徴量を抽出する第２の特徴量抽出部と、該第２
の特徴量抽出部から抽出された特徴量から音声の区間を
検出する第２の音声区間検出部と、予め入力された音声
の第１の特徴量抽出部から得られた特徴量より、その音
声パターン辞書を生成する手段と、該音声パターン辞書
を記憶しておく音声パターン辞書記憶部と、該音声パタ
ーン辞書と未知入力音声パターンとパターン照合を行う
パターン照合部と、パターン照合の結果を出力する結果
出力部とを具備する音声認識システムにおいて、音声入
力時に、第２の音声区間検出部において検出された音声
区間を含む第１の音声区間検出部において検出された音
声のみを発声された音声とみなして音声パターン辞書登
録及び、又は認識のためのパターン照合を行うことを特
徴としたものであり、更には、上記（１）において、（
２）第２の音声区間検出部で音声として検出されても、
第１の音声区間検出部において音声として検出されなか
った場合には該音声区間をキャンセルすること、或いは
、（３）音声入力時以外の時の第１の音声入力部より検
出される音量のレベルが所定の値以上、即ち、周囲環境
騒音が所定のレベル以上の場合、第２の区間検出部で検
出された音声区間を発声された音声とみなして音声パタ
ーン辞書登録及び、又は認識のためのパターン照合を行
うことを特徴としたものである。以下、本発明の実施例
に基いて説明する。[Structure] In order to achieve the above objects, the present invention provides (1)
a first audio input means that converts sound from a microphone or the like into an electrical signal; a first feature extractor that extracts a feature of the audio input by the first audio input means; A first voice section detection section that detects a voice section from the feature extracted from the feature extraction section; and a second voice input means that detects voice using bone conduction such as a bone conduction vibration pickup. and a second feature amount extracting unit that extracts a feature amount of the voice input from the second voice input means;
A second speech section detecting section detects a speech section from the feature amount extracted from the feature amount extracting section of means for generating a pattern dictionary; a speech pattern dictionary storage section for storing the speech pattern dictionary; a pattern matching section for performing pattern matching between the speech pattern dictionary and an unknown input speech pattern; and outputting a result of the pattern matching. In a speech recognition system having a result output section, when inputting speech, only the speech detected in the first speech section detecting section that includes the speech section detected in the second speech section detecting section is regarded as the uttered speech. It is characterized by performing speech pattern dictionary registration and/or pattern matching for recognition, and furthermore, in (1) above, (
2) Even if it is detected as voice by the second voice section detection unit,
(3) canceling the voice section if it is not detected as voice by the first voice section detecting section; or (3) the level of volume detected by the first voice input section at times other than when inputting voice; is above a predetermined value, that is, when the surrounding environment noise is above a predetermined level, the speech section detected by the second section detection section is regarded as the uttered speech and the speech pattern dictionary is registered and/or recognized. The feature is that it performs pattern matching. Hereinafter, the present invention will be explained based on examples.

【０００５】図１は、本発明による音声認識システムの
一実施例を説明するためのブロック図で、図中、１ａは
通常の音響信号を電気信号に変換するマイク、１ｂは骨
伝導型のマイクで、これらマイク１ａ，１ｂより入力さ
れた音響信号を後述するようにして特徴量を抽出してパ
ターン照合を行うものであるが、その特徴量抽出方式、
パターン照合方式等は現在種々の方式が提案されており
、本発明においては、そのいずれの方式を採用しても良
い。また、骨伝導型のマイク１ｂは鼻骨の部分からとる
もの、のどの部分からとるものなどあるが、いずれのタ
イプでも良い。FIG. 1 is a block diagram for explaining one embodiment of the speech recognition system according to the present invention. In the figure, 1a is a microphone that converts a normal acoustic signal into an electrical signal, and 1b is a bone conduction type microphone. Then, feature quantities are extracted from the acoustic signals inputted from the microphones 1a and 1b and pattern matching is performed as described later.The feature quantity extraction method,
Various pattern matching methods and the like are currently being proposed, and the present invention may employ any of these methods. Furthermore, there are bone conduction type microphones 1b that are taken from the nasal bone, and those that are taken from the throat, but any type may be used.

【０００６】（１）まず、マイク１ａ，１ｂから入力さ
れた音声はそれぞれの特徴量抽出部２ａ，２ｂによって
特徴量が抽出される。その抽出された特徴量は、それぞ
れの区間検出部３ａ，３ｂに入力され、それぞれのアル
ゴリズムで区間検出が行われる。それぞれの区間検出部
３ａ，３ｂで検出された区間情報及び特徴量は制御部４
に入力される。制御部４では、マイク１ａで検出された
区間のうちマイク１ｂで音声として検出された区間を正
しい音声区間として見なす（図２参照）。これにより、
利用者が発声した音声以外の音を音声と間違えてパター
ン照合を行なうことを防ぐ。又、マイク１ｂの区間情報
だけでは音声の高周波成分が欠落する可能性があるので
マイク１ａの区間情報を用いる。(1) First, features of the voices inputted from the microphones 1a and 1b are extracted by the respective feature extractors 2a and 2b. The extracted feature amounts are input to the respective section detection units 3a and 3b, and section detection is performed using each algorithm. The section information and feature amounts detected by the respective section detection sections 3a and 3b are sent to the control section 4.
is input. The control unit 4 regards the section detected as voice by the microphone 1b as a correct voice section among the sections detected by the microphone 1a (see FIG. 2). This results in
To prevent sounds other than voices uttered by a user from being mistaken for voices and pattern matching performed. Furthermore, since there is a possibility that high frequency components of the voice may be lost if only the section information of the microphone 1b is used, the section information of the microphone 1a is used.

【０００７】（２）（１）の例とは逆に音声が入力され
てないのにマイク１ｂの入力で音声として区間が検出さ
れる可能性がある。例えば、骨伝導マイク１ｂの身体に
接触しているセンサ部が利用者が身体を動かしたために
生ずる身体との摩擦によって発生する信号を音声と見な
す場合がある（図３参照）。この誤検出を防ぐため、マ
イク１ｂによって区間検出情報が発生しても、マイク１
ａで検出されなかった場合は、音声として見なさない。(2) Contrary to the example in (1), there is a possibility that a section may be detected as voice due to input from the microphone 1b even though no voice is input. For example, the sensor section of the bone conduction microphone 1b that is in contact with the user's body may regard a signal generated by friction with the user's body as the user moves the user's body as audio (see FIG. 3). In order to prevent this false detection, even if section detection information is generated by microphone 1b, microphone 1
If it is not detected in a, it is not considered as audio.

【０００８】（３）周囲の環境騒音が定常的に大きい場
合、即ち、マイク１ａより入力される音声レベルが常時
区間検出のための閾値を超えている場合（図４参照）、
マイク１ａによる区間検出は不可能である。従って、こ
の場合にはマイク１ｂによる区間検出情報のみで区間検
出を行なう。(3) When the surrounding environmental noise is constantly large, that is, when the audio level input from the microphone 1a always exceeds the threshold for detecting the section (see FIG. 4),
Section detection using the microphone 1a is impossible. Therefore, in this case, section detection is performed only using section detection information from the microphone 1b.

【０００９】[0009]

【効果】以上の説明から明らかなように、本発明による
と、周囲の環境騒音が大きい場合、小さい場合、それぞ
れに対応して、正しい区間検出が可能になった。[Effects] As is clear from the above explanation, according to the present invention, it is possible to correctly detect a section depending on whether the surrounding environmental noise is large or small.

[Brief explanation of the drawing]

【図１】　　本発明による音声認識システムの一実施例
を説明するためのブロック図である。FIG. 1 is a block diagram for explaining an embodiment of a speech recognition system according to the present invention.

【図２】　　請求項１に記載の発明の一例を説明するた
めの図で、マイク１ａと１ｂの出力レベルの一例を示す
図である。FIG. 2 is a diagram for explaining an example of the invention according to claim 1, and is a diagram showing an example of output levels of microphones 1a and 1b.

【図３】　　請求項２に記載の発明の一例を説明するた
めの図で、マイク１ａと１ｂの出力レベルの他の一例を
示す図である。FIG. 3 is a diagram for explaining an example of the invention according to claim 2, and is a diagram showing another example of the output levels of the microphones 1a and 1b.

【図４】　　請求項３に記載の発明の一例を説明するた
めの図で、マイク１ａと１ｂの出力レベルの更に他の一
例を示す図である。FIG. 4 is a diagram for explaining an example of the invention according to claim 3, and is a diagram showing still another example of the output levels of the microphones 1a and 1b.

[Explanation of symbols]

１ａ，１ｂ…マイク、２ａ，２ｂ…特徴量抽出部、３ａ
，３ｂ…音声区間検出部、４…制御部、５…パターン辞
書記憶部、６…パターン照合部、７…結果出力部、８…
表示部。1a, 1b...Microphone, 2a, 2b...Feature extraction unit, 3a
, 3b...Speech section detection unit, 4...Control unit, 5...Pattern dictionary storage unit, 6...Pattern matching unit, 7...Result output unit, 8...
Display section.

Claims

[Claims]

1. A first audio input means for converting sound from a microphone or the like into an electrical signal; and a first feature extractor for extracting a feature of the audio input by the first audio input means. , a first speech section detection section that detects a speech section from the feature extracted from the first feature extraction section;
A second voice input means that detects voice using bone conduction such as a bone conduction vibration pickup, and a second feature amount extraction that extracts a feature amount of the voice input from the second voice input means. a second speech section detection section that detects a section of speech from the feature extracted from the second feature extraction section; means for generating a speech pattern dictionary from the feature quantity; a speech pattern dictionary storage section for storing the speech pattern dictionary; a pattern matching section for pattern matching the speech pattern dictionary and an unknown input speech pattern; In a speech recognition system comprising a result output section that outputs a result of matching, only the speech detected by the first speech section detection section that includes the speech section detected by the second speech section detection section at the time of speech input. What is claimed is: 1. A speech recognition system that performs speech pattern dictionary registration and/or pattern matching for recognition by regarding speech as uttered speech.

2. In the speech recognition system according to claim 1, even if the second speech section detecting section detects the speech, if the first speech section detecting section does not detect the speech as speech, the second speech section detecting section detects the speech. A speech recognition system characterized by canceling speech sections.

3. In the speech recognition system according to claim 1, when the volume level detected from the first speech input section is equal to or higher than a predetermined value at times other than during speech input, the second section A speech recognition system characterized in that a speech section detected by a detection unit is regarded as uttered speech and registered in a speech pattern dictionary and/or pattern matching for recognition is performed.