JPH10312194A

JPH10312194A - Method and device for detecting speech to be recognized

Info

Publication number: JPH10312194A
Application number: JP9280670A
Authority: JP
Inventors: Mitsuhiro Inazumi; 満広稲積; Sunao Aizawa; 直相澤
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1997-03-12
Filing date: 1997-10-14
Publication date: 1998-11-24
Anticipated expiration: 2017-10-14
Also published as: JP3726448B2

Abstract

PROBLEM TO BE SOLVED: To reduce current consumption in a standby state of the speech to be recognized. SOLUTION: This device is provided with an intermittent drive control means 6 intermittent driving a sound input means 1, an input level decision means 2 detecting a sound level inputted while the intermittent driven sound input means 1 is in an operation state, deciding the presence of the sound from the size of its level and returning to a non-operation state when the sound does not exist is decided, a sound decision means 3 starting its operation after the sound exists is decided by this input level decision means 2, roughly deciding whether the sound is a noise or the sound like the speech and returning to the non-operation state when the sound is not the sound like the speech is decided and a speech decision means 4 for starting its operation after the sound is the sound like the speech is decided by this sound decision means 3, deciding whether or not the sound like the speech is the speech, transferring its characteristic data to a speech recognition means 5 when the sound is the speech is judged and returning to the non-operation state when the sound is not the speech.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力音声を認識し
てその認識結果に基づいて何らかの動作を行う音声認識
装置において、装置側が入力音声を常に待つ状態となっ
ている場合、入力音声を効率よく検出して消費電流を少
なくする認識対象音声検出方法およびその装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus for recognizing an input speech and performing some operation based on the result of the recognition. The present invention relates to a method and apparatus for detecting a voice to be recognized which detects well and reduces current consumption.

【０００２】[0002]

【従来の技術】最近、音声認識を用いた機器が様々な分
野で実用化されてきている。この種の機器は、スイッチ
を入れて初めて認識動作を開始することで十分な機能を
果たすものもあるが、音声が入力されると、直ちに入力
音声を認識してその認識結果に基づいた動作を行うとい
うように、常に、入力音声を待ち状態としておく必要の
あるものもある。2. Description of the Related Art Recently, devices using voice recognition have been put to practical use in various fields. Some devices of this type perform a sufficient function by starting the recognition operation only when the switch is turned on.However, when a voice is input, the device recognizes the input voice immediately and performs an operation based on the recognition result. In some cases, it is necessary to always keep the input sound in a waiting state.

【０００３】後者の例としては、たとえば、ユーザが時
刻を問い合わせると現在時刻を応答する時計などがあ
る。この種の機器は、乾電池で動作するものが殆どであ
り、また、機器の小型、軽量化を考えたとき、乾電池は
小容量のものを使用することが望ましく、かつ、長時
間、電池交換をしないで済むことが望まれる。[0003] An example of the latter is, for example, a clock that responds to the current time when the user inquires about the time. Most of this type of equipment operates on dry batteries, and when considering the reduction in size and weight of the equipment, it is desirable to use dry batteries of small capacity, and to replace batteries for a long time. It is desirable not to do so.

【０００４】しかし、この種の機器は、常に音声入力を
待ち状態にしておく必要があるため、待ち状態において
も、常に電流が消費されることになり、その消費電流を
如何に小さく抑えるかが大きな課題である。However, in this type of device, it is necessary to always keep the voice input in a waiting state, so that even in the waiting state, the current is always consumed. It is a big challenge.

【０００５】音声を常に待つ状態としておくには、マイ
クロホンやアンプなどの音声検出回路を常に動作可能状
態としておく必要がある。この種の機器に一般的に用い
られているコンデンサマイクロホンの消費電流は、５０
０μＡ程度であり、マイクロホンで入力した音声信号を
処理するアンプの消費電流も同様に５００μＡ程度であ
る。[0005] In order to always wait for voice, it is necessary to keep a voice detection circuit such as a microphone or an amplifier operable at all times. The current consumption of a condenser microphone generally used for this type of equipment is 50
The current consumption is about 0 μA, and the current consumption of the amplifier that processes the audio signal input by the microphone is also about 500 μA.

【０００６】したがって、これらの音声検出回路の消費
電流は１ｍＡ程度となる。この状態を保持するとすれ
ば、一年間に8.76ＡＨｒ．を消費することになる。この
値は、単１のアルカリ電池に相当するものであり、安価
な単１マンガン電池の２本分に相当する。Accordingly, the current consumption of these voice detection circuits is about 1 mA. If this condition is maintained, 8.76 AHr. Will be consumed. This value corresponds to a single alkaline battery, and corresponds to two inexpensive single manganese batteries.

【０００７】機器の小型化、軽量化さらには価格を考え
たとき、電池は単３以下を用いるのが望ましいが、上述
した例では、電池の寿命はごく短いものとなってしまう
という問題があった。When considering the miniaturization and weight reduction of the equipment and the price, it is desirable to use batteries of AA or less, but in the above-described example, there is a problem that the life of the batteries is extremely short. Was.

【０００８】これに対処するには、必要なときだけスイ
ッチをオンして音声入力を可能とすることも考えられる
が、動作を行わせるためにその都度スイッチを入れるの
では、この種の音声認識を用いた装置としてのメリット
が全く失われることになり、現実的な方法ではない。ま
た、他の方法として、特公昭６１−５４１９１がある。
この従来技術は、アラーム付き電子時計において、アラ
ームセット時刻が到達した以降のアラーム装置の動作
を、音声入力によって制御可能としたものである。To cope with this, it is conceivable to turn on the switch only when necessary to enable voice input. However, if the switch is turned on each time to perform the operation, this type of voice recognition is not possible. The merit as a device using is completely lost, which is not a practical method. As another method, there is Japanese Patent Publication No. 61-54191.
According to this conventional technique, in an electronic timepiece with an alarm, the operation of the alarm device after the alarm set time has arrived can be controlled by voice input.

【０００９】[0009]

【発明が解決しようとする課題】しかしながら、前述の
特公昭６１−５４１９１で示される技術は、アラーム信
号などで音声入力のタイミングを制御するものであり、
任意のタイミングで音声入力を行うことができないの
で、使い勝手が悪い問題があった。However, the technique disclosed in Japanese Patent Publication No. 61-54191 described above controls the timing of voice input using an alarm signal or the like.
Since voice input cannot be performed at an arbitrary timing, there is a problem that usability is poor.

【００１０】そこで、本発明は、入力音声を効率よく検
出することで、認識対象音声の入力待ち状態における機
器の消費電流を小さく抑え、乾電池を使用する機器にお
いては、小容量の電池で長時間使用することができる認
識対象音声検出方法およびその装置を実現することを目
的としている。Therefore, the present invention efficiently detects an input voice, thereby suppressing the current consumption of the device in a state of waiting for the input of the voice to be recognized. In a device using a dry battery, a small capacity battery can be used for a long time. An object of the present invention is to realize a recognition target speech detection method and apparatus that can be used.

【００１１】[0011]

【課題を解決するための手段】本発明の認識対象音声検
出方法は、音入力手段に入力された音声を認識して、そ
の認識結果に対して何らかの動作を行う音声認識装置に
おける認識対象音声検出方法において、前記音入力手段
を間欠駆動し、間欠駆動される音入力手段が動作状態の
間に入力された音に対し、その音が認識対象の音である
かを判定する処理を、複数段階の工程に分けて段階的に
行い、現在処理中の工程での処理結果がその工程に設定
された条件を満たした以降に次の段階の工程が動作し、
段階を経るに従って、消費電流が大きく、かつ、認識す
べき音声か否かの判定確度の上がる処理に移行し、それ
ぞれの工程における処理において、その工程に設定され
た条件が満たされない場合は、それぞれの工程を非動作
状態に戻すことを特徴としている。A recognition target voice detection method according to the present invention recognizes a voice input to a sound input means and performs a certain operation on the recognition result. The method comprises the steps of: intermittently driving the sound input means; and determining whether the sound is a recognition target sound with respect to a sound input while the intermittently driven sound input means is in an operating state. Is performed step by step, and after the processing result of the process currently being processed satisfies the conditions set for the process, the next stage process operates,
After passing through the steps, the current consumption is large, and the process shifts to a process of increasing the determination accuracy of the voice to be recognized. In the process in each process, if the condition set in the process is not satisfied, Is returned to the non-operation state.

【００１２】具体的には、前記音入力手段を間欠駆動
し、間欠駆動される音入力手段が動作状態の間に入力さ
れた音のレベルを検出し、そのレベルの大きさから音の
有無を判定し、音が無いと判定した場合は、非動作状態
に戻る第１の処理工程と、この第１の処理工程で音が有
りと判定された以降に動作を開始し、その音が雑音であ
るか音声らしき音であるかを大まかに判定し、音声らし
き音ではないと判定した場合は、非動作状態に戻る第２
の処理工程と、この第２段階の処理工程で音声らしき音
と判定された以降に動作を開始し、その音声らしき音が
音声であるか否かを判定し、音声であると判断した場合
は、その音声特徴データを認識部側に渡し、音声でない
と判断した場合には、非動作状態に戻る第３の処理工程
とを有する。More specifically, the sound input means is intermittently driven, the level of the sound input while the intermittently driven sound input means is operating is detected, and the presence or absence of the sound is determined from the level. If it is determined that there is no sound, a first processing step to return to the non-operating state, and operation is started after it is determined that there is a sound in the first processing step, and the sound is noise. It is roughly determined whether or not there is a sound like a voice, and if it is determined that the sound is not like a voice, the second operation returns to the non-operation state.
And the operation is started after it is determined that the sound is likely to be a sound in the processing step of the second stage, and it is determined whether or not the sound that is likely to be a sound is a sound. A third processing step of passing the voice feature data to the recognizing unit and returning to a non-operation state when it is determined that the voice is not voice.

【００１３】そして、前記第１の処理工程は、前記音入
力手段が動作状態の間に入力された音の平均パワーを求
め、この平均パワーと基準レベルとを比較して音の有無
を判定し、音が無いと判定した場合は、非動作状態に戻
るようにしてもよく、また、前記音入力手段が動作状態
の間に入力された音を、人間の音声の周波数帯域を含む
周波数帯域とそれ以外の周波数帯域に分けて少なくとも
一方の周波数帯域の平均パワーを求め、その平均パワー
の値を基に音を判定し、少なくとも人間の音声ではない
と判定した場合は、非動作状態に戻るようにしてもよ
い。さらに、これらを組み合わせてもよい。In the first processing step, the average power of the sound input while the sound input means is in an operating state is determined, and the average power is compared with a reference level to determine the presence or absence of a sound. If it is determined that there is no sound, the sound input unit may return to a non-operating state, and the sound input means may input the sound during the operating state with a frequency band including a human voice frequency band. The average power of at least one of the frequency bands is obtained by dividing the frequency band into other frequency bands, the sound is determined based on the value of the average power, and if it is determined that the sound is not at least human voice, the state returns to the non-operation state. It may be. Further, these may be combined.

【００１４】また、前記第２の処理工程は、前記第１の
処理工程での設定条件を満たした音信号に対し、その音
信号の継続時間を計測し、その継続時間を基に音声らし
き音か否かを判定し、音声らしき音ではないと判定した
場合は、非動作状態に戻るようにしてもよく、また、前
記第１の処理工程での設定条件を満たした音信号に対
し、その音信号の所定時間内における零交差数を計測
し、その零交差数を基に音声らしき音か否かを判定し、
音声らしき音ではないと判定した場合は、非動作状態に
戻るようにしてもよい。さらに、これらを組み合わせる
ようにしてもよい。また、前記第３の処理工程は、前記
第２の処理工程での設定条件を満たした音信号に対し、
音声特徴抽出処理を行い、これにより抽出された音声特
徴データを基に、入力音が音声であるか否かを判断して
音声であると判断した場合に、その特徴データを認識部
側に渡し、音声でないと判断した場合は、非動作状態に
戻るようにしている。The second processing step measures the duration of the sound signal that satisfies the conditions set in the first processing step, and, based on the duration, sounds like a sound. It is determined whether or not the sound signal does not sound like a sound. If it is determined that the sound signal does not sound like a sound, the sound processing apparatus may return to the non-operation state. The number of zero crossings within a predetermined time of the sound signal is measured, and it is determined whether or not the sound is like a sound based on the number of zero crossings,
If it is determined that the sound is not a sound like a voice, it may return to the non-operation state. Further, these may be combined. Further, the third processing step is for a sound signal that satisfies the set conditions in the second processing step.
A voice feature extraction process is performed, and based on the voice feature data extracted as described above, it is determined whether or not the input sound is a voice. If the input sound is determined to be a voice, the feature data is passed to the recognition unit. When it is determined that the voice is not a voice, the operation returns to the non-operation state.

【００１５】さらに、前記認識部では、予め設定された
キーワードを含む音声特徴データのみを認識処理するよ
うにしてもよい。Further, the recognizing unit may recognize only voice feature data including a preset keyword.

【００１６】また、本発明の認識対象音声検出装置は、
音入力手段に入力された音声を認識してその認識結果に
対して何らかの動作を行う音声認識装置における認識対
象音声検出装置において、前記音入力手段を間欠駆動す
る間欠駆動制御手段と、この間欠駆動手段により間欠駆
動される音入力手段が動作状態の間に入力された音に対
し、その音が認識対象の音であるかを判定する処理を、
複数段階に分けて段階的に行うそれぞれの処理手段を有
し、現在処理中の処理手段による処理結果が、その処理
手段に設定された条件を満たした以降に、次の段階の処
理手段が動作し、段階を経るに従って、消費電流が大き
く、かつ、認識すべき音声か否かの判定確度の上がる処
理に移行し、それぞれの処理手段におけるそれぞれの処
理において、その処理手段に設定された条件が満たされ
ない場合は、それぞれの処理手段を非動作状態に戻すこ
とを特徴としている。Further, the recognition target speech detection device of the present invention comprises:
An intermittent drive control means for intermittently driving the sound input means, in a recognition target speech detection apparatus in a speech recognition apparatus for recognizing a voice input to the sound input means and performing some operation on the recognition result; A process for determining whether the sound is a recognition target sound with respect to a sound input while the sound input means intermittently driven by the means is in the operating state,
It has processing means for performing the processing step by step in a plurality of stages, and after the processing result of the processing means currently processing satisfies the condition set for the processing means, the processing means of the next stage operates. Then, as the stages pass, the process shifts to a process in which the current consumption is large and the accuracy of determining whether or not the voice is to be recognized increases, and in each process in each processing unit, the condition set in the processing unit is When not satisfied, each processing means is returned to the non-operation state.

【００１７】具体的には、前記音入力手段を間欠駆動す
る間欠駆動制御手段と、この間欠駆動制御手段により間
欠駆動される音入力手段が動作状態の間に入力された音
のレベルを検出し、そのレベルの大きさから音の有無を
判定し、音が無いと判定した場合は、非動作状態に戻る
入力レベル判定手段と、この入力レベル判定手段で入力
音が有りと判定された以降に動作を開始し、その音が雑
音であるか音声らしき音であるかを大まかに判定し、音
声らしき音でないと判定した場合は、非動作状態に戻る
音判定手段と、この音判定手段で音が音声らしき音と判
定された以降に動作を開始し、その音声らしき音が音声
であるか否かを判定し、音声であると判断した場合は、
その音声特徴データを認識部側に渡し、音声でないと判
断した場合には、非動作状態に戻る音声判定手段とを有
した構成としている。More specifically, an intermittent drive control means for intermittently driving the sound input means and a sound input means intermittently driven by the intermittent drive control means detect a level of a sound input during an operation state. The presence or absence of a sound is determined from the magnitude of the level, and if it is determined that there is no sound, input level determining means returning to a non-operating state, and after the input level determining means determines that there is an input sound, The operation is started, and it is roughly determined whether the sound is a noise or a sound like a sound. If it is determined that the sound is not a sound like a sound, sound determining means returning to a non-operating state and sound by the sound determining means are returned. Starts operation after it is determined that the sound is likely to be a sound, determines whether the sound likely to be a sound is a sound, and when it is determined that the sound is a sound,
The voice feature data is passed to the recognition unit side, and when it is determined that the voice is not voice, the voice determining means returns to a non-operation state.

【００１８】そして、前記入力レベル判定手段は、前記
音入力手段が動作状態の間に入力された音の平均パワー
を求め、この平均パワーと基準レベルとを比較して音の
有無を判定し、音が無いと判定した場合は、非動作状態
に戻るようにしてもよく、前記音入力手段が動作状態の
間に入力された音を、人間の音声の周波数帯域を含む周
波数帯域とそれ以外の周波数帯域に分けて少なくとも一
方の周波数帯域の平均パワーを求め、その平均パワーの
値を基に音を判定し、少なくとも人間の音声ではないと
判定した場合は、非動作状態に戻るようにしてもよい。
さらに、これらを組み合わせるようにしてもよい。The input level determining means obtains an average power of the sound input while the sound input means is operating, and compares the average power with a reference level to determine the presence or absence of a sound. When it is determined that there is no sound, the sound input unit may return to the non-operating state, and the sound input unit may convert the sound input during the operating state into a frequency band including a human voice frequency band and other frequencies. The average power of at least one of the frequency bands is obtained by dividing the frequency band, and the sound is determined based on the average power value. Good.
Further, these may be combined.

【００１９】また、前記音判定手段は、前記入力レベル
判定手段での設定条件を満たした音信号に対し、その音
信号の継続時間を計測し、その継続時間を基に音声らし
き音か否かを判定し、音声らしき音ではないと判定した
場合は、非動作状態に戻るようにしてもよく、また、前
記入力レベル判定手段での設定条件を満たした音信号に
対し、その音信号の所定時間内における零交差数を計測
し、その零交差数を基に音声らしき音か否かを判定し、
音声らしき音ではないと判定した場合は、非動作状態に
戻るようにしてもよい。さらに、これらを組み合わせる
ようにしてもよい。The sound determining means measures the duration of the sound signal which satisfies the conditions set by the input level determining means, and determines whether or not the sound is like a sound based on the duration. And if it is determined that the sound is not a sound like a sound, it may be returned to the non-operation state. In addition, for a sound signal that satisfies the condition set by the input level determining means, a predetermined The number of zero crossings in time is measured, and it is determined based on the number of zero crosses whether or not the sound is like a sound,
If it is determined that the sound is not a sound like a voice, it may return to the non-operation state. Further, these may be combined.

【００２０】また、前記音声判定手段は、前記音判定手
段での設定条件を満たした音信号に対し、音声特徴抽出
処理を行い、その音声特徴データを基に入力音が音声で
あるか否かを判断して音声であると判断した場合に、そ
の特徴データを認識部側に渡し、音声でないと判断した
場合は、非動作状態に戻るようにしている。The sound determination means performs a sound feature extraction process on the sound signal which satisfies the conditions set by the sound determination means, and determines whether or not the input sound is a sound based on the sound feature data. Is determined, the feature data is passed to the recognizing unit when it is determined to be a voice, and when it is determined that the voice is not a voice, the operation returns to the non-operation state.

【００２１】さらに、前記認識部では、予め設定された
キーワードを含む音声特徴データのみを認識対象音声と
して認識処理するようにしてもよい。Further, the recognizing unit may perform recognition processing only on voice feature data including a preset keyword as a voice to be recognized.

【００２２】本発明は、認識対象の音声を常に待ち受け
る状態にしておき、認識対象音声が入力されると、認識
結果に対応した動作を行う音声認識装置に適用されるこ
とで効果を発揮するものである。この種の装置にあって
は、音声を常に待ち状態としておくために消費電流が大
きく、乾電池を電源として用いるものにあっては、消費
電流を如何に小さく抑えるかということが大きな課題で
あった。The present invention is effective when applied to a speech recognition apparatus which always keeps a voice of a recognition target in a standby state and performs an operation corresponding to a recognition result when a voice to be recognized is input. It is. In this type of device, the current consumption is large in order to always keep the sound in a waiting state, and in the case of using a dry battery as a power source, how to reduce the current consumption is a major problem. .

【００２３】これを解決するために本発明は、まず、音
入力手段を間欠駆動する。具体例としては、たとえば、
0.1 秒間を動作状態として音声入力を可能とし、その後
の0.4 秒間を非動作状態とするというように動作状態と
非動作状態を繰り返す間欠的な音声入力動作を行う。こ
のような間欠駆動を行うことにより、待ち状態における
消費電流を小さく抑えることができる。In order to solve this, the present invention first intermittently drives the sound input means. As a specific example, for example,
An intermittent voice input operation that repeats the operation state and the non-operation state is performed such that the operation state is set to the operation state for 0.1 second and the non-operation state is set for the next 0.4 seconds. By performing such intermittent driving, the current consumption in the waiting state can be suppressed to a small value.

【００２４】ただし、間欠駆動することによる問題点も
ある。たとえば、消費電流を小さく抑えるために、たと
えば、駆動時間をごく短い時間（たとえば0.1 秒程度）
としたとき、マイクロホンの特性上、正常な音声入力動
作を行うことができない。これに対処するために、ま
ず、第１段階の処理として、処理時間が短く、しかも電
流消費が小さくて済む音の有無検出だけを行い、この第
１段階の処理を通過した音信号に対し、第２段階の処理
として、その音がどのような音であるかの判定を行い、
音声らしいと判定された場合に、第３段階の処理とし
て、人間の音声であるか否かの判定処理を行うというよ
うに、幾つかの工程に分けて段階的な処理を行うように
している。しかも、工程を経るにしたがって、処理時間
と消費電流を要する処理とし、それぞれの工程での条件
を満たされない場合は、音声入力手段を非動作状態に戻
すようにすることで、無駄な電流消費を抑えることがで
きる。また、第１の工程は平均パワーを算出する手段、
基準レベルを記憶する手段、比較手段、さらに、場合に
よっては、周波数フィルタを加える程度で構成でき、簡
単に実現可能である。また、第２の工程も、計時手段、
継続時間を測定する手段、さらには、零交差数を計測す
る手段程度で構成でき、第１の工程同様簡単に実現でき
る。However, there is a problem due to the intermittent driving. For example, in order to keep current consumption low, for example, drive time should be very short (for example, about 0.1 second).
Then, due to the characteristics of the microphone, a normal voice input operation cannot be performed. In order to cope with this, first, as the first stage processing, only the presence or absence of a sound that requires a short processing time and a small current consumption is detected, and the sound signal that has passed through the first stage processing is As a process of the second stage, it is determined what kind of sound the sound is,
When it is determined that the voice is likely to be voice, a stepwise process is performed in several steps, such as performing a determination process as to whether or not the voice is a human voice as a third stage process. . Moreover, the processing requires processing time and current consumption as the process passes, and when the conditions in each process are not satisfied, the voice input means is returned to the non-operating state, thereby reducing unnecessary current consumption. Can be suppressed. The first step is a means for calculating an average power,
It can be configured simply by adding a reference level storage unit, a comparison unit, and, in some cases, a frequency filter, and can be easily realized. In addition, the second step also includes a timing unit,
It can be composed of a means for measuring the duration and a means for measuring the number of zero crossings, and can be realized simply as in the first step.

【００２５】なお、第１の工程において、音声の有無を
判定する場合、周波数フィルタを用いた処理を行うこと
により、人間の音声とは異なる音を、早い処理段階で除
去することも可能となる。つまり、まず、人間の音声の
周波数範囲内に一定のパワーを有する音が有るか否かを
判定することで、人間の音声とは異なった周波数帯域に
大きな平均パワーを有する音を処理対象外とすることが
でき、処理の効率化が図れる。また、入力レベルの大き
さから音の有無を判定する処理と、この周波数フィルタ
を用いた処理とを組み合わせることにより、より一層、
処理の効率化が図れる。In the first step, when the presence or absence of voice is determined, by performing processing using a frequency filter, it is possible to remove a sound different from human voice at an early processing stage. . That is, first, it is determined whether or not there is a sound having a certain power in the frequency range of the human voice, and the sound having a large average power in a frequency band different from the human voice is excluded from the processing target. And the efficiency of the processing can be improved. Further, by combining the process of determining the presence or absence of sound from the magnitude of the input level and the process using this frequency filter,
Processing efficiency can be improved.

【００２６】また、第２の工程において、零交差数を計
測する処理を行うことで、効率よく人間の音声らしき音
かそれ以外の音かを判定することができる。また、この
零交差数を計測する処理と、所定レベル以上の信号の継
続時間を判定する処理を組み合わせることにより、より
一層、高精度で効率的な音判定が可能となる。Further, in the second step, by performing the process of measuring the number of zero crossings, it is possible to efficiently determine whether the sound is like a human voice or not. In addition, by combining the process of measuring the number of zero crossings and the process of determining the duration of a signal having a predetermined level or more, more accurate and efficient sound determination can be performed.

【００２７】また、第３の工程は、認識装置がもともと
持っている音声特徴抽出手段を用いることで実現でき、
この処理により人間の音声であるか否かの判定を高精度
に行うことができる。The third step can be realized by using the voice feature extraction means originally possessed by the recognition device.
By this processing, it can be determined with high accuracy whether or not the voice is a human voice.

【００２８】さらに、予め設定されたキーワードを含む
音声特徴データのみを認識対象音声として受け付けるよ
うにすることにより、無駄な認識動作を行わなくて済
み、これによっても消費電流を小さく抑えることができ
る。Further, by accepting only the voice feature data including a preset keyword as the voice to be recognized, useless recognition operation is not required, and the current consumption can be reduced.

【００２９】[0029]

【発明の実施の形態】以下、本発明の実施の形態につい
て図面を参照しながら説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００３０】図１は本発明の実施の形態を説明するブロ
ック図であり、たとえばコンデンサマイクロホンなどの
音入力手段１、音入力手段１から入力された音のレベル
が一定以上あるか否かを判定する入力レベル判定手段
２、入力レベル判定手段２で一定以上のレベルがあると
判定された音が音声らしきものであるかそれ以外の雑音
であるか否かを判定する音判定手段３、音判定手段３で
音声らしきものであると判定された場合、それが音声で
あるか否かを判定する音声判定手段４、音声判定手段４
で音声であると判定された場合、その音声に対して認識
動作を行う音声認識手段５、間欠駆動制御手段６などか
ら構成されている。なお、この間欠駆動制御手段６は、
前記音入力手段１に対しては、間欠駆動信号（これにつ
いては後述する）を与えるとともに、他の手段に対して
はそれぞれの手段が動作を行うときに動作電圧を与え
る。FIG. 1 is a block diagram for explaining an embodiment of the present invention. For example, a sound input means 1 such as a condenser microphone, and it is determined whether or not the level of a sound input from the sound input means 1 is equal to or higher than a certain level. Input level determining means 2; sound determining means 3 for determining whether the sound determined to have a certain level or more by the input level determining means 2 is sound like a sound or other noise; sound determination If the means 3 determines that the sound is sound, the sound determination means 4 and the sound determination means 4 determine whether the sound is sound.
When it is determined that the voice is a voice, the voice recognition unit 5 performs a recognition operation on the voice, an intermittent drive control unit 6, and the like. This intermittent drive control means 6
An intermittent drive signal (to be described later) is given to the sound input means 1, and an operating voltage is given to the other means when each means operates.

【００３１】このような構成において、その動作を説明
する。間欠駆動制御手段６は、音入力手段１に対して間
欠駆動信号を与え、これにより、音入力手段１は周期的
に動作状態と非動作状態となる。このように、音声入力
手段１が間欠駆動する場合、幾つかの問題点がある。The operation of such a configuration will be described. The intermittent drive control means 6 supplies an intermittent drive signal to the sound input means 1, whereby the sound input means 1 is periodically brought into an operating state and a non-operating state. As described above, when the voice input unit 1 is driven intermittently, there are some problems.

【００３２】すなわち、音入力手段１が動作状態となる
頻度が少ないと、認識すべき音声を入力し損なう可能性
がある。逆に、頻度が高すぎると電流消費が大きくな
り、低消費電流化に支障がでることになる。これらの点
に対しては、ある程度の頻度を有し、かつ、動作状態の
時間を短くすることで対処できる。That is, if the frequency of the sound input means 1 being in the operating state is low, there is a possibility that the voice to be recognized is not inputted. On the other hand, if the frequency is too high, the current consumption increases, which hinders the reduction in current consumption. These points can be dealt with by having a certain frequency and reducing the time of the operation state.

【００３３】しかし、動作状態の時間を短くしすぎる
と、音入力手段１の特性上の問題点がある。たとえば、
音入力手段１としてコンデンサマイクロホンを用いた場
合、入力した音信号を安定した音信号（たとえば認識処
理を行うための特徴抽出が可能な音信号）として取り出
すには、通常、秒単位の時間を要する。However, if the operating time is too short, there is a problem in the characteristics of the sound input means 1. For example,
When a condenser microphone is used as the sound input unit 1, it usually takes a time in seconds to take out the input sound signal as a stable sound signal (for example, a sound signal from which features can be extracted for performing recognition processing). .

【００３４】これらの点を考慮して、本発明では、音入
力手段１を間欠駆動させ、かつ、音入力手段１が動作状
態のときに取り込んだ音信号に対して複数段階の工程に
分けて、順次、消費電流が大きく、認識対象音声か否か
の判定確度の上がる処理時間の長い処理を行うようにす
る。以下、この具体的な処理について説明する。In consideration of these points, in the present invention, the sound input means 1 is intermittently driven, and the sound signal fetched when the sound input means 1 is in the operating state is divided into a plurality of steps. Then, a process is performed in which the current consumption is large and the processing time for which the accuracy of determining whether or not the voice is the recognition target is increased is long. Hereinafter, this specific processing will be described.

【００３５】この実施の形態においては、前述した点を
考慮して、音入力手段１に対し、たとえば、0.1 秒間動
作可能状態とし、その後の0.4 秒は動作を休み、その
後、再び、0.1 秒間動作可能状態とし、その後の0.4 秒
は動作を休むというような間欠駆動信号を間欠駆動制御
手段６から出力する。In this embodiment, in consideration of the above points, the sound input means 1 is made operable, for example, for 0.1 second, the operation is rested for 0.4 seconds thereafter, and then operated again for 0.1 second. The intermittent drive control means 6 outputs an intermittent drive signal indicating that the operation is enabled and the operation is stopped for the next 0.4 seconds.

【００３６】したがって、音声入力手段１は、間欠駆動
制御手段６によって間欠的に設定されるる0.1 秒間の動
作可能状態のときにのみ音の入力を可能とし、それ以外
では音声入力動作やその他の動作を行わない状態（これ
をここではスリープ状態という）となる。Therefore, the sound input means 1 enables sound input only in the operable state for 0.1 second which is set intermittently by the intermittent drive control means 6, and otherwise, the sound input operation and other operations are performed. (This is called a sleep state here).

【００３７】そして、たとえば、ある時刻において音信
号が存在し、このとき、音入力手段１が動作可能状態と
なっていれば、その音は音入力手段１により取り込まれ
る。この音入力手段１に入力された音信号は、入力レベ
ル判定手段２で、入力レベルの判定が行われる。つま
り、この段階では音の有無だけの判定を行う。For example, if a sound signal exists at a certain time and the sound input means 1 is in an operable state at this time, the sound is taken in by the sound input means 1. The input level of the sound signal input to the sound input unit 1 is determined by the input level determining unit 2. That is, at this stage, only the presence or absence of sound is determined.

【００３８】この入力レベル判定手段２による音の有無
検出は、様々な手法により行うことができる。たとえ
ば、図２に示すような例がある。この図２で示す例は、
平均パワー算出部２１１、基準レベル記憶部２１２、比
較部２１３、入力音判定結果出力部２１４で構成され、
音入力手段１で入力された音信号から平均パワーを算出
して、その平均パワーを基準レベルと比較し、その比較
結果に基づいて入力音判定結果を出力する。The detection of the presence or absence of a sound by the input level determining means 2 can be performed by various methods. For example, there is an example as shown in FIG. The example shown in FIG.
The average power calculation unit 211, the reference level storage unit 212, the comparison unit 213, the input sound determination result output unit 214,
The average power is calculated from the sound signal input by the sound input unit 1, the average power is compared with a reference level, and an input sound determination result is output based on the comparison result.

【００３９】なお、音入力手段１としてコンデンサマイ
クロホンを用いた場合、前述したように、入力した音信
号を安定した音信号とするには、通常、秒単位の時間を
要するが、このように、単に音があるか無いかを判定す
るだけの処理を行うには、0.1 秒程度の時間で実用的に
は十分である。When a condenser microphone is used as the sound input means 1, as described above, it usually takes a time in seconds to convert the input sound signal into a stable sound signal. A time of about 0.1 second is practically sufficient to perform a process of merely determining whether or not there is a sound.

【００４０】以上の処理は、本発明の第１段階の処理で
あり、図３のフローチャートのステップｓ１〜ｓ３の処
理である。つまり、スリープ状態（ステップｓ１）にお
いて、間欠駆動制御手段６から動作開始信号が入ると、
音入力手段１が動作状態となり、所定レベル以上の音信
号があるか否かを判定する（ステップｓ２，ｓ３）。そ
して、所定レベル以上の音信号が存在すると判定された
ときは、次の第２段階の処理に移り、もし、所定レベル
以上の音信号が無ければ、音は無しと判断してスリープ
状態に戻る。The above processing is the processing of the first stage of the present invention, and is the processing of steps s1 to s3 in the flowchart of FIG. That is, in the sleep state (step s1), when an operation start signal is input from the intermittent drive control means 6,
The sound input means 1 is activated, and it is determined whether or not there is a sound signal of a predetermined level or higher (steps s2, s3). When it is determined that a sound signal of a predetermined level or more is present, the process proceeds to the next second stage processing. If there is no sound signal of a predetermined level or more, it is determined that there is no sound and the apparatus returns to the sleep state. .

【００４１】音が有りと判断された場合は、第２段階の
処理として、音判定手段３によりその音が音声らしき音
であるか雑音であるかを判定する。この音声らしき音で
あるか雑音であるかを判定する手段としては幾つかの考
えられるが、ここでは、その一例として図４に示すよう
に、所定以上のレベルの音の継続時間を調べて突発的な
雑音であるか否かを判定する。If it is determined that there is a sound, the sound determining means 3 determines whether the sound is a sound like a sound or a noise as a second stage processing. There are several possible means for determining whether the sound is sound or noise. Here, as an example, as shown in FIG. It is determined whether the noise is typical.

【００４２】図４に示す音判定手段３は継続時間判定部
３１、計時部３２、継続時間記憶部３３、音判定結果出
力部３４などから構成されている。このような構成にお
いて、入力レベル判定手段２にて所定レベル以上と判定
された信号がどの程度継続しているかを計時部３２から
の時間信号を用いて計時し、継続時間記憶部３３に記憶
されている時間に基づいてその入力音が音声らしき音か
それ以外の突発的な雑音かを判定する。The sound judging means 3 shown in FIG. 4 comprises a duration judging section 31, a timer section 32, a duration storing section 33, a sound judgment result output section 34, and the like. In such a configuration, the duration of the signal determined to be equal to or higher than the predetermined level by the input level determination means 2 is measured using the time signal from the timer 32 and stored in the duration storage 33. It is determined whether the input sound is a sound that seems to be a voice or other sudden noise based on the duration of the sound.

【００４３】つまり、所定レベル以上の入力音の継続時
間が継続時間記憶部３３に記憶されている時間より短い
場合は、少なくとも音声ではなく、たとえば、ドアを閉
めたときの音などの突発的な雑音であると判定する。That is, when the duration of the input sound of a predetermined level or more is shorter than the time stored in the duration storage unit 33, at least not a voice but a sudden sound such as a sound when a door is closed. It is determined to be noise.

【００４４】以上の第２段階の処理は、図３のフローチ
ャートのステップｓ４，ｓ５の処理である。つまり、第
１段階の処理（音が有るか否かの判定処理）において、
音が有りと判定された場合、まず、音判定処理として、
前述したような所定レベル以上の音の継続時間を調べ
（ステップｓ４）、その時間にもとづいて入力音は雑音
であるか否かを判定する（ステップｓ５）。ここで、突
発的な雑音でない、つまり、音声の可能性があると判定
された場合は、次の第３段階に処理に移り、もし、突発
的な雑音であると判定された場合はスリープ状態に戻
る。The processing of the second stage is the processing of steps s4 and s5 in the flowchart of FIG. That is, in the first stage processing (processing for determining whether or not there is a sound),
When it is determined that there is a sound, first, as sound determination processing,
The duration of the sound having a predetermined level or more as described above is checked (step s4), and based on the time, it is determined whether or not the input sound is noise (step s5). Here, if it is determined that there is no sudden noise, that is, if there is a possibility of voice, the process proceeds to the next third step, and if it is determined that there is sudden noise, a sleep state is set. Return to

【００４５】この第２段階の処理において、入力音声が
突発的な雑音ではなく、音声である可能性があると判定
された場合、第３段階の処理として、音声判定手段４に
よりその音声らしき音が人間の音声であるか否かを判定
する。この音声判定手段４による音声判定処理について
以下に説明する。In the second stage processing, if it is determined that the input voice is not sudden noise but may be a voice, the voice determining means 4 performs the third stage processing as a sound like the voice. Is a human voice. The sound judgment processing by the sound judgment means 4 will be described below.

【００４６】この音声判定手段４が行う処理は、まず、
音声らしき音が人間の音声かそれ以外の音かを区別する
ことが必要であるが、これに対しては、入力音を特徴抽
出処理（たとえばＬＰＣ分析）し、その分析結果に基づ
いて人間の音声であるか否かを判定する。具体的には、
人間の音声生成機構のモデル化による特徴抽出処理を行
って、その誤差を求め、誤差の大きさから人間の音声か
否かを判定する。たとえば、第２段階の処理で突発的な
雑音ではなく音声らしき音と判定された場合でも、音信
号をＬＰＣ分析による誤差を求めることで明確に判断で
きる。なお、この音声判定手段４は、音声認識装置がも
ともと持っている特徴分析手段により行うことができる
ことは勿論である。The processing performed by the voice determination means 4 is as follows.
It is necessary to distinguish whether the sound that seems to be a voice is a human voice or another voice. To deal with this, the input sound is subjected to feature extraction processing (for example, LPC analysis), and based on the analysis result, the input sound is analyzed. It is determined whether or not it is a voice. In particular,
An error is obtained by performing a feature extraction process based on modeling of a human voice generation mechanism, and it is determined from the magnitude of the error whether or not the voice is a human voice. For example, even if it is determined in the second stage processing that the sound is not a sudden noise but a sound like a voice, the sound signal can be clearly determined by obtaining an error by LPC analysis. It is needless to say that the voice determination means 4 can be performed by the characteristic analysis means originally included in the voice recognition device.

【００４７】以上の第３段階の処理は、図３のフローチ
ャートのステップｓ６，ｓ７の処理である。つまり、第
２段階の処理（音声らしい音か否かの判定処理）におい
て、音声らしいと判定された場合、音声判定処理とし
て、特徴抽出を行い（ステップｓ６）、その結果に基づ
いて人間の音声であるか否かを判定し、人間の音声であ
ると判定した場合は認識対象音声として（ステップｓ
７）、音声認識手段５にその特徴データを送り、認識処
理に移る。もし、ステップｓ７で、人間の音声でないと
判定された場合は、認識対象音声でないとしてスリープ
状態に戻る。また、音声認識手段５により認識処理を行
い（ステップｓ８）、認識処理が終了するとスリープ状
態に戻る。The processing in the third stage is the processing in steps s6 and s7 in the flowchart of FIG. That is, if it is determined that the sound is likely to be a sound in the second stage processing (a process of determining whether or not the sound is sound-like), feature extraction is performed as sound determination processing (step s6), and based on the result, a human voice is determined. Is determined, and if it is determined that the voice is a human voice, the voice is recognized as a recognition target voice (step s).
7) The feature data is sent to the voice recognition means 5 and the process proceeds to recognition processing. If it is determined in step s7 that the voice is not a human voice, it is determined that the voice is not a recognition target voice, and the process returns to the sleep state. The voice recognition unit 5 performs a recognition process (step s8), and returns to the sleep state when the recognition process ends.

【００４８】以上のように本発明は、第１〜第３の三段
階の処理を経て、入力された音が人間の音声であると判
定された場合に初めて認識処理に入るようにしている。As described above, according to the present invention, the recognition process is started only when it is determined that the input sound is a human voice through the first to third steps.

【００４９】つまり、第１段階では、間欠的に入力音を
検出してレベルの大きさから入力音が有るか否かだけの
処理を行い、所定レベル以上の入力音が存在した場合
に、第２段階の処理を行う。そして、第２段階では、所
定レベル以上の入力音が突発的な雑音であるのか音声ら
しき音であるのかを判定し、音声らしき音である場合に
のみ第３段階の処理に入る。この第３段階の処理では、
音声らしき音が人間の音声であるか否かを判定し、人間
の音声である場合にそれを認識対象として音声認識手段
５に特徴データを渡すようにする。That is, in the first stage, the input sound is intermittently detected, and processing is performed based on the level of the input sound to determine whether or not the input sound is present. A two-stage process is performed. Then, in the second stage, it is determined whether the input sound of a predetermined level or more is a sudden noise or a sound like a voice, and only when the input sound is a sound like a voice, the process proceeds to the third stage. In this third stage,
It is determined whether the sound that seems to be a voice is a human voice, and when the voice is a human voice, the feature data is passed to the voice recognition means 5 as a recognition target.

【００５０】なお、第１〜第３の三段階の処理に要する
時間は、たとえば「今、何時」というような認識対象音
声の発話時間に比べると、ごく短い時間であるので、認
識対象音声に対して認識処理を行う上で、実用的には殆
ど支障はない。Note that the time required for the first to third three-stage processing is very short compared to the utterance time of the recognition target voice, such as "what time is it now?" Practically, there is almost no problem in performing the recognition processing.

【００５１】以上説明したように、本発明では、第１段
階の動作における音の待ち状態は、この実施の形態で
は、0.５秒間のうち、0.1 秒の動作を行う間欠動作であ
るので、常に入力音声を待つ状態としておく場合に比
べ、１／５の消費電流ですむ。As described above, according to the present invention, the sound waiting state in the first stage operation is an intermittent operation in which 0.1 second operation is performed in 0.5 second in this embodiment. The current consumption is only 1/5 of the case where the input voice is always waited.

【００５２】ちなみに、単１の乾電池は単３の乾電池の
4.5倍ほどの容量があるので、消費電流が１／５となれ
ば、単１と同じ電池寿命を得ようとした場合、単３電池
で可能となる。By the way, the size AA battery is the same as the size AA battery.
Since there is about 4.5 times the capacity, if the current consumption is reduced to 1/5, it is possible to use AA batteries to obtain the same battery life as AA.

【００５３】また、第１段階から順に段階を経るごと
に、消費電流が大きく、動作時間の長い処理に移るよう
にし、１つの段階に設定された条件を満たしたとき、次
の段階に移り、ある段階で条件が満たされないときはス
リープ状態に戻るので、無駄な処理を行わなくて済み、
これによっても消費電流を抑えることができる。特に、
第３段階以降の処理は、実質的な音声認識処理であり、
装置としてフル稼働に近い状態となるが、この第３段階
に達するまでに、条件が満たされないときは、スリープ
状態に戻るため、無駄な認識動作を行わなくて済む。Further, each time the steps are performed in order from the first step, the current consumption is increased and the operation time is increased. When the condition set in one step is satisfied, the processing proceeds to the next step. When the condition is not satisfied at a certain stage, it returns to the sleep state, so that unnecessary processing is not performed,
This can also reduce current consumption. Especially,
The processes after the third stage are substantial speech recognition processes,
The apparatus is in a state close to full operation, but if the condition is not satisfied before reaching the third stage, the apparatus returns to the sleep state, so that useless recognition operation is not required.

【００５４】なお、以上説明した例は１つの実施の形態
であり、この実施の形態に限定されるものではない。た
とえば、間欠駆動制御手段６による音入力手段１に対す
る駆動はは、上述の実施の形態では、0.1 秒間を動作状
態とし0.４秒休むというように、0.5 秒間のうち0.1 秒
の動作を行うようにしているが、この動作頻度は任意に
設定できるものであり、その装置の特性などを考慮して
最も適当と思われる動作間隔を設定できるものである。
ただし、頻度が少なすぎると、入力音の検出ミスが発生
しやすいので、ある程度の頻度は必要である。The above-described example is one embodiment, and the present invention is not limited to this embodiment. For example, the driving of the sound input means 1 by the intermittent drive control means 6 is such that in the above-described embodiment, the operation is performed for 0.1 second and the operation is paused for 0.4 second. However, the operation frequency can be set arbitrarily, and the operation interval that is considered most appropriate can be set in consideration of the characteristics of the device.
However, if the frequency is too low, a detection error of the input sound is likely to occur, so a certain frequency is required.

【００５５】また、第１段階の処理における入力レベル
判定手段２は、図２で示したものでなく、たとえば、図
５（ａ）あるいは図５（ｂ）のような構成としてもよ
い。Further, the input level determining means 2 in the first stage processing is not shown in FIG. 2, but may be configured as shown in FIG. 5A or FIG. 5B, for example.

【００５６】図５（ａ）は、低域通過フィルタ２１５、
平均パワー算出部２１６、基準レベル記憶部２１７、比
較部段２１８、入力レベル判定結果出力部２１９から構
成されている。低域通過フィルタ２１５は、ここでは、
４ＫＨｚ以下の周波数成分を通過させるもので、４ＫＨ
ｚ以下の周波数成分の平均パワーを平均パワー算出部２
１６で算出し、その平均パワーと、基準レベル記憶部２
１７に記憶されている基準レベルとを比較部２１８で比
較し、その比較結果をもとに入力レベル判定結果を出力
する。FIG. 5A shows a low-pass filter 215,
It comprises an average power calculation unit 216, a reference level storage unit 217, a comparison unit stage 218, and an input level determination result output unit 219. The low-pass filter 215 here comprises
Passes frequency components of 4KHz or less.
The average power of the frequency component equal to or lower than z is calculated by the average power calculator 2
16, the average power and the reference level storage unit 2
The comparison unit 218 compares the reference level stored in the memory 17 with the reference level, and outputs an input level determination result based on the comparison result.

【００５７】ここで、周波数の基準を４ＫＨｚとしたの
は、人間の音声は殆どが４ＫＨｚ以下であるためであ
る。これにより、４ＫＨｚ以下の周波数成分の平均パワ
ーが、基準レベルより大きい場合には、人間の声である
可能性もあると判定できる。言い換えれば、４ＫＨｚ以
上に大きな値の平均パワーを有する音は人間の音声とは
異なる音であるとみなすことができるので、このような
音は処理対象から除去する。したがって、まず、人間の
音声の周波数範囲内に一定のパワーを有する音を判定結
果として取り出し、この音に対して第２段階以降の処理
を行う。The reason why the frequency is set to 4 KHz is that most of human voices are at 4 KHz or less. Accordingly, when the average power of the frequency component of 4 KHz or less is higher than the reference level, it can be determined that there is a possibility that the voice is a human voice. In other words, since a sound having a large average power of 4 KHz or more can be regarded as a sound different from a human voice, such a sound is removed from the processing target. Therefore, first, a sound having a certain power within the frequency range of the human voice is extracted as a determination result, and the second and subsequent steps are performed on this sound.

【００５８】また、図６（ｂ）は、４ＫＨｚ以下の周波
数成分を通過させる低域通過フィルタ２１５と、４ＫＨ
ｚより高い周波数成分を通過させる高域通過フィルタ２
２０と、４ＫＨｚ以下の周波数成分の平均パワーを算出
する平均パワー算出部２２１、４ＫＨｚより高い周波数
成分の平均パワーを算出する平均パワー算出部２２２、
これらの平均パワーの差あるいは比をとる比較部２２
３、入力レベル判定結果出力部２２４から構成されてい
る。FIG. 6B shows a low-pass filter 215 for passing a frequency component of 4 KHz or less, and a 4 KH filter.
High-pass filter 2 that passes frequency components higher than z
20, an average power calculation unit 221 that calculates the average power of the frequency components equal to or lower than 4 KHz, and an average power calculation unit 222 that calculates the average power of the frequency components higher than 4 KHz.
A comparison unit 22 for taking the difference or ratio of these average powers
3. It comprises an input level judgment result output unit 224.

【００５９】このように、高域と低域の２種類の周波数
帯域フィルタを設け、その差あるいは比を求めることに
より、人間の音声とその他の雑音とをさらに精度よく区
別することができる。たとえば、４ＫＨｚより高い周波
数成分の平均パワーが、４ＫＨｚ以下の周波数成分の平
均パワーよりきわめて大きければ、それは人間の音声で
はなく雑音の可能性が高いと判定でき、また、逆に、４
ＫＨｚ以下の周波数成分の平均パワーが４ＫＨｚより高
い周波数成分の平均パワーよりきわめて大きければ、人
間の音声である可能性が有ると判定できる。また、低周
波成分と高周波成分の両方にまんべんなくパワーが存在
する場合は、両者の比は小さくなり、この場合も、人間
の音声以外の雑音であると判定できる。As described above, by providing two kinds of frequency band filters of a high frequency band and a low frequency band and obtaining a difference or a ratio therebetween, it is possible to distinguish human voice from other noises with higher accuracy. For example, if the average power of the frequency component higher than 4 KHz is much larger than the average power of the frequency component of 4 KHz or less, it can be determined that the possibility of noise is high instead of human voice, and conversely,
If the average power of the frequency component lower than KHz is much higher than the average power of the frequency component higher than 4 KHz, it can be determined that there is a possibility that the voice is a human voice. When power is present in both the low-frequency component and the high-frequency component evenly, the ratio between the two becomes small, and in this case also, it can be determined that the noise is other than human voice.

【００６０】このように、２種類の周波数帯域のフィル
タを設け、それぞれのフィルタを通過する周波数成分の
平均パワーの大きさに基づいた判定を行うことにより、
この第１段階の処理においても、人間の音声かそれ以外
の音声かを大まかに判定することができ、その後の処理
を、より一層、効率的に行うことが可能となる。As described above, by providing filters of two kinds of frequency bands and making a determination based on the average power of the frequency component passing through each filter,
Also in the processing of the first stage, it is possible to roughly determine whether it is a human voice or another voice, and the subsequent processing can be performed even more efficiently.

【００６１】なお、入力レベル判定手段２としては、図
２で示した手段と図５（ａ），（ｂ）のいずれかの手段
とを組み合わせるようにしてもよい。たとえば、図２で
示したものと図５（ａ）のものとを組み合わせた場合、
まず、入力音声の平均パワーを算出して基準レベルと比
較し、基準レベル以上の場合に、低域通過フィルタを通
して、４ＫＨｚ以下の周波数成分の平均パワーを算出
し、その平均パワーと、基基準レベルとを比較し、その
比較結果をもとに入力レベル判定結果を出力するように
してもよい。As the input level determining means 2, the means shown in FIG. 2 may be combined with any of the means shown in FIGS. 5 (a) and 5 (b). For example, when the one shown in FIG. 2 and the one shown in FIG.
First, the average power of the input voice is calculated and compared with the reference level. If the average power is higher than the reference level, the average power of the frequency component of 4 KHz or less is calculated through a low-pass filter. May be compared, and the input level determination result may be output based on the comparison result.

【００６２】また、音判定手段３は、前述の実施の形態
では、入力レベル判定手段２からの入力レベル判定結果
を受けて、所定以上のレベルの音の継続時間を調べて、
突発的な雑音を処理対象外とすることで、音声らしき音
を検出するようにしたが、この音判定手段３は、たとえ
ば、図６に示すような零交差数計測を行うことで、音声
らしき音を検出するようにしてもよい。図６は零交差数
計測部３５、計時部３６、音判定結果出力部３７で構成
され、音入力手段１に入力される音信号の零交差数を調
べることにより、その音が人間の音声であるかそれ以外
の音で有るかを大まかに判定する。In the above-described embodiment, the sound determination means 3 receives the input level determination result from the input level determination means 2 and checks the duration of a sound of a predetermined level or more.
The sound which seems to be speech is detected by excluding sudden noise from being processed. However, the sound determination means 3 measures the number of zero crossings as shown in FIG. Sound may be detected. FIG. 6 includes a zero-crossing number measuring unit 35, a timer unit 36, and a sound determination result output unit 37. By examining the number of zero-crossings of a sound signal input to the sound input unit 1, the sound is converted into a human voice. It is roughly determined whether there is a sound or another sound.

【００６３】つまり、ある一定時間における人間の音声
の零交差数は予めわかっているので、入力音に対してあ
る一定時間内の零交差数をカウントし、その零交差数を
基に音判定を行う。これにより、第１段階における設定
条件を満たした音であっても、たとえば、電話の呼び出
し音、チャイムの音、楽器の音、機械音などの音を人間
の音声と区別することができ、より人間の音声らしき音
のみを判定することができる。That is, since the number of zero-crossings of a human voice in a certain time is known in advance, the number of zero-crossings in a certain time with respect to an input sound is counted, and sound determination is performed based on the number of zero-crossings. Do. Thereby, even if the sound satisfies the setting conditions in the first stage, for example, sounds such as telephone ringing sounds, chime sounds, musical instrument sounds, and mechanical sounds can be distinguished from human sounds. Only sounds that are likely to be human voices can be determined.

【００６４】また、音判定手段３としては、前述の継続
時間を調べて突発的な雑音を除去する手段と、図６で示
した手段とを併用するようにしてもよい。たとえば、ま
ず最初に、継続時間を判定し、所定の時間以上継続する
音であると判定した場合に、入力音声の零交差数を調
べ、その零交差数により音声らしき音か雑音かの判定を
行うようにする。これにより、入力音が人間の音声らし
き音であるか否かを高い精度で判定できる。Further, as the sound determination means 3, a means for examining the above-mentioned duration and removing sudden noise may be used in combination with the means shown in FIG. For example, first, the duration is determined, and when it is determined that the sound lasts for a predetermined time or more, the number of zero crossings of the input voice is checked, and the sound like noise or noise is determined based on the number of zero crossings. To do. Thus, it is possible to determine with high accuracy whether or not the input sound is a sound that seems to be a human voice.

【００６５】また、第３段階の処理として、音声判定手
段４が行う処理は、ＬＰＣ分析などの音声特徴抽出によ
り非音声を除去する処理であるが、たとえば、テレビジ
ョンやラジオから流れてくる人間の音声は認識対象の音
声と判断してしまうことになる。このような認識対象で
はない人間の音声特徴データが音声認識部５に与えられ
ると、音声認識部５ではその音声に反応して訳の分から
ない応答をすることがある。このような認識対象音声以
外の音声を排除するために、その後の音声認識手段５に
おける認識処理をキーワードを用いて認識を行うように
してもよい。In the third stage, the process performed by the voice judging means 4 is a process for removing non-voice by voice feature extraction such as LPC analysis. Is determined to be the voice to be recognized. When such voice characteristic data of a human who is not a recognition target is given to the voice recognition unit 5, the voice recognition unit 5 sometimes makes an incomprehensible response in response to the voice. In order to eliminate such voices other than the voice to be recognized, subsequent recognition processing in the voice recognition unit 5 may be performed using a keyword.

【００６６】つまり、音声認識手段５における認識可能
な登録単語の１つとしてキーワードを予め登録してお
き、そのキーワードを含んだ音声を入力することで認識
動作が可能とするような設定としておく。That is, a keyword is registered in advance as one of the registered words that can be recognized by the voice recognition means 5, and a setting is made so that a recognition operation can be performed by inputting a voice including the keyword.

【００６７】たとえば、時刻を問い合わせると現在時刻
を応答する時計を考えた場合、キーワードとしてたとえ
ば「太郎」を予め登録しておき、時刻を問い合わせると
きに、単に「今何時」というのではなく、たとえば、
「太郎、今何時」というように、キーワードを含んだ内
容の問いかけを行うようにする。装置側では、キーワー
ドが含まれている場合だけ認識対象音声として受け付け
るようにする。これにより、キーワードを含まない音声
は、認識対象音声として受け付けないので、前述したよ
うに、テレビジョンやラジオから流れてくる人間の音声
に装置が反応して訳の分からない応答をするというよう
なことがなくなり、これによっても、無駄な電流消費を
抑えることができる。For example, in the case of a clock that responds to the current time when the time is queried, for example, “Taro” is registered in advance as a keyword. ,
Ask questions such as "Taro, what time is it now?" On the device side, only when the keyword is included, it is accepted as the recognition target voice. As a result, voices that do not include a keyword are not accepted as voices to be recognized, and as described above, the device responds to human voices flowing from a television or a radio and makes an incomprehensible response. This eliminates unnecessary current consumption.

【００６８】なお、以上説明した実施の形態は、本発明
の好適な実施の形態の例であるが、これに限定されるも
のではなく、本発明の要旨を逸脱しない範囲で、種々変
形実施可能である。The embodiment described above is an example of a preferred embodiment of the present invention. However, the present invention is not limited to this embodiment, and various modifications can be made without departing from the gist of the present invention. It is.

【００６９】なお、本発明の処理を行う処理プログラム
は、フロッピィディスク、光ディスク、ハードディスク
などの記憶媒体に記憶させておくことができ、本発明
は、それらの記憶媒体をも含むものであり、また、ネッ
トワークからデータを得る形式でもよい。The processing program for performing the processing of the present invention can be stored in a storage medium such as a floppy disk, an optical disk, or a hard disk. The present invention includes those storage media. Alternatively, data may be obtained from a network.

【００７０】[0070]

【発明の効果】以上説明したように、本発明によれば、
音入力手段を間欠駆動させ、音入力手段が動作状態のと
きのみに音声入力動作を行うことにより、待ち状態にお
ける消費電流を小さく抑えることができる。As described above, according to the present invention,
By intermittently driving the sound input means and performing the sound input operation only when the sound input means is in the operating state, the current consumption in the waiting state can be reduced.

【００７１】また、本発明では、音入力手段が動作状態
のときの処理を幾つかの段階に分けて行う。まず、処理
時間が短く、しかも、電流消費が小さくて済む音の有無
検出を第１段階の処理として行い、この第１段階の処理
を通過した音信号に対し、その音がどのような音である
かの判定を第２段階の処理として行い、この第２段階の
処理により音声らしいと判定された場合に、第３段階の
処理として、人間の音声であるか否かの判定処理を行う
というように、幾つかの工程に分けて処理を行うように
している。しかも、工程を経るにしたがって、処理時間
と消費電流を要する処理とし、それぞれの工程での条件
が満たされない場合は、装置を非動作状態に戻し、音声
入力手段のみが間欠駆動するモードに戻すようにしてい
る。In the present invention, the processing when the sound input means is in the operating state is performed in several stages. First, detection of the presence / absence of a sound that requires a short processing time and consumes a small amount of current is performed as a first-stage process. It is determined that there is a voice as a second-stage process, and if it is determined that the voice is likely to be a voice by the second-stage process, a determination process of whether or not the voice is a human voice is performed as a third-stage process. As described above, the processing is performed in several steps. In addition, the process requires processing time and current consumption as the process passes. If the conditions in each process are not satisfied, the device is returned to the non-operation state, and the mode is returned to the mode in which only the voice input unit is intermittently driven. I have to.

【００７２】このように、工程を経るにしたがって、処
理時間と消費電流を要する処理とすることにより、音入
力手段を間欠駆動することによる様々な問題点に対応す
ることができ、しかも、消費電流を大幅に抑えることが
可能となる。As described above, by performing processing requiring processing time and current consumption as the process proceeds, it is possible to cope with various problems caused by intermittently driving the sound input means. Can be greatly reduced.

【００７３】これにより、電源として電池を使用する機
器の場合、たとえば、単３電池で単１電池と同じ寿命を
得ることも可能となり、同じ寿命を得るのに、電池容量
を小さなものとすることができ、装置の小型化と軽量化
が図れる。また、電池を装置に付加して販売する場合
は、電池容量が小さい分、装置の販売価格の低廉化にも
寄与することができなど、種々の効果が得られる。Thus, in the case of equipment using a battery as a power source, for example, it is possible to obtain the same life as an AA battery with an AA battery, and to reduce the battery capacity to obtain the same life. And the size and weight of the device can be reduced. In addition, when a battery is added to a device and sold, various effects can be obtained, such as a small battery capacity, which can contribute to a reduction in the selling price of the device.

[Brief description of the drawings]

【図１】本発明の実施の形態の基本的な構成を示すブロ
ック図。FIG. 1 is a block diagram showing a basic configuration of an embodiment of the present invention.

【図２】図１で示した入力レベル判定手段の一例を示す
図。FIG. 2 is a diagram illustrating an example of an input level determining unit illustrated in FIG. 1;

【図３】本発明の実施の形態の処理を説明するフローチ
ャート。FIG. 3 is a flowchart illustrating processing according to the embodiment of the present invention.

【図４】図１で示した音判定手段の一例を示す図。FIG. 4 is a diagram illustrating an example of a sound determination unit illustrated in FIG. 1;

【図５】図１で示した入力レベル判定手段の他の例を示
す図。FIG. 5 is a view showing another example of the input level determining means shown in FIG. 1;

【図６】図１で示した音判定手段の他の例を示す図。FIG. 6 is a diagram showing another example of the sound determination unit shown in FIG. 1;

[Explanation of symbols]

１音入力手段２入力レベル判定手段３音判定手段４音声判定手段５音声認識手段６間欠駆動制御手段２１１，２１６，２２１，２２２平均パワー算出部２１２，２１７基準レベル記憶部２１３，２１８，２２３比較部２１４，２１９，２２４入力レベル判定結果出力部２１５低域通過フィルタ２２０高域通過フィルタ３１継続時間判定部３２，３６計時部３３継続時間記憶部３４，３７音判定結果出力部３５零交差数計測部 Reference Signs List 1 sound input means 2 input level determination means 3 sound determination means 4 voice determination means 5 voice recognition means 6 intermittent drive control means 211,216,221,222 average power calculation sections 212,217 reference level storage sections 213,218,223 Compare Units 214, 219, 224 Input level determination result output unit 215 Low-pass filter 220 High-pass filter 31 Duration determination unit 32, 36 Clock unit 33 Duration storage unit 34, 37 Sound determination result output unit 35 Zero-crossing number measurement Department

Claims

[Claims]

1. A method for detecting a voice to be recognized in a voice recognition device for recognizing a voice input to a sound input unit and performing some operation on the recognition result, wherein the sound input unit is intermittently driven, and The sound input means performs a process of determining whether the sound is a recognition target sound for a sound input during the operation state, in a plurality of steps, and in a stepwise manner. After the processing result in step satisfies the condition set in the process, the process of the next stage operates, and as the stages pass, the current consumption is large and the accuracy of determining whether or not the voice is to be recognized increases. A method for detecting a recognition target voice, wherein the method shifts to a process, and in a process in each process, when a condition set in the process is not satisfied, the process is returned to a non-operation state.

2. A method for detecting a voice to be recognized in a voice recognition device for recognizing a voice input to a sound input means and performing some operation on a result of the recognition, wherein the sound input means is intermittently driven. The sound input means detects the level of the sound input during the operating state, determines the presence or absence of a sound from the level of the level, and determines that there is no sound. Processing, and after the sound is determined to be present in the first processing step, the operation is started, and it is roughly determined whether the sound is noise or sound like voice. If it is determined, a second processing step of returning to the non-operating state, and operation is started after it is determined in this second processing step that the sound is likely to be sound, and whether or not the sound likely to be sound is sound And if it is determined to be audio, Its audio characteristic data pass the recognition unit side, when it is determined that no speech is recognized speech detection method characterized by having a third processing step of returning to the inactive state, the.

3. The first processing step includes determining an average power of a sound input while the sound input means is in an operating state, and comparing the average power with a reference level to determine the presence or absence of a sound. 3. The method according to claim 2, wherein when it is determined that there is no sound, the process returns to the non-operation state.

4. The method according to claim 1, wherein the first processing step divides the sound input during the operation of the sound input unit into at least one of a frequency band including a frequency band of a human voice and a frequency band other than the frequency band. The average power of the frequency band of the frequency band is determined, the sound is determined based on the average power value, and at least when it is determined that the sound is not a human voice, the operation returns to the non-operation state. Recognition target voice detection method.

5. The second processing step measures the duration of the sound signal that satisfies the set conditions in the first processing step, and generates a sound that seems to be sound based on the duration. The recognition target voice detection method according to any one of claims 2 to 4, wherein it is determined whether or not the sound is a voice-like sound.

6. The second processing step measures the number of zero-crossings of the sound signal within a predetermined time period for a sound signal that satisfies the conditions set in the first processing step, and 6. A recognition target voice detection method according to any one of claims 2 to 5, wherein it is determined whether or not the voice is a voice-like sound on the basis of the above, and if it is determined that the sound is not a voice-like sound, the sound returns to a non-operation state. .

7. The third processing step performs a sound feature extraction process on the sound signal that satisfies the setting conditions in the second processing step, and based on the sound feature data extracted thereby, When it is determined whether the input sound is a voice or not and the voice is determined to be a voice, the feature data is passed to the recognition unit side,
The method according to any one of claims 2 to 6, wherein the method returns to a non-operation state when it is determined that the voice is not a voice.

8. The recognition target speech detection method according to claim 7, wherein the recognition unit performs recognition processing only on speech feature data including a preset keyword.

9. A recognition target speech detection device in a speech recognition device that recognizes speech input to a sound input device and performs some operation on the recognition result, wherein the intermittent drive control device intermittently drives the sound input device. The process of determining whether the sound is a recognition target sound with respect to the sound input by the sound input unit intermittently driven by the intermittent drive unit during the operation state is divided into a plurality of stages, and is performed stepwise. After the processing result of the processing unit currently being processed satisfies the conditions set for the processing unit, the processing unit of the next stage operates and goes through the steps, Large current consumption,
In addition, the processing shifts to processing in which the accuracy of determining whether or not the voice is to be recognized is increased. In each processing in each processing means, if the condition set in the processing means is not satisfied, each processing means is deactivated. A recognition target speech detection device, which is returned to a state.

10. A recognition target voice detection device in a voice recognition device that recognizes voice input to a sound input device and performs some operation on the recognition result, wherein the intermittent drive control device drives the sound input device intermittently. The sound input means intermittently driven by the intermittent drive control means detects the level of the sound input during the operating state, determines the presence or absence of sound from the level of the level, and determines that there is no sound. In this case, the input level determining means returning to the non-operating state; and starting the operation after the input level determining means determines that there is an input sound, and roughly determining whether the sound is noise or sound like voice. If it is determined that the sound is not a sound like a sound, sound determining means returning to a non-operating state, and operation is started after the sound determining means determines that the sound is a sound like a sound, and the sound like the sound is sound It is determined whether or not the voice data is a voice. If it is determined that the voice is voice, the voice feature data is passed to the recognition unit, and if it is determined that the voice is not voice, the voice determination unit returns to a non-operation state. A recognition target speech detection device, characterized in that:

11. The input level determining means determines an average power of a sound input while the sound input means is in an operating state, and compares the average power with a reference level to determine the presence or absence of a sound. 11. The recognition target speech detection device according to claim 10, wherein when it is determined that there is no sound, the operation returns to the non-operation state.

12. The input level determining means divides a sound input while the sound input means is in an operating state into at least one of a frequency band including a frequency band of a human voice and another frequency band. The average power of a frequency band is obtained, a sound is determined based on the value of the average power, and at least when it is determined that the voice is not a human voice, the apparatus returns to a non-operation state. Recognition target voice detection device.

13. The sound judging means measures a duration of the sound signal which satisfies the condition set by the input level judging means, and determines whether or not the sound is like a sound based on the duration. And determining that the sound is not a sound like a voice, returning to a non-operation state.
13. The recognition target speech detection device according to any one of 12.

14. The sound judging means measures the number of zero crossings of a sound signal which satisfies the condition set by the input level judging means within a predetermined time of the sound signal, and based on the number of zero crossings. 14. The recognition target voice detection device according to claim 10, wherein it is determined whether or not the sound is a voice, and when it is determined that the sound is not a voice, the process returns to the non-operation state.

15. The sound determination unit performs a sound feature extraction process on a sound signal that satisfies a condition set by the sound determination unit, and determines whether an input sound is a sound based on the sound feature data. When it is determined that the voice is a voice, the feature data is passed to the recognition unit side, and when it is determined that the voice is not a voice, the process returns to a non-operation state.
15. The recognition target speech detection device according to any one of 14.

16. The recognition target speech detection device according to claim 15, wherein the recognition unit recognizes only the voice feature data including a preset keyword as the recognition target speech.