JPS60104999A

JPS60104999A - Voice recognition equipment

Info

Publication number: JPS60104999A
Application number: JP58214331A
Authority: JP
Inventors: 武志則松
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1983-11-14
Filing date: 1983-11-14
Publication date: 1985-06-10

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は特定話者を対象とした音声認識装置に関する。[Detailed description of the invention] Industrial applications The present invention relates to a speech recognition device targeted at a specific speaker.

従来例の構成とその問題点特定話者を対象とした単語音声認識装置では、あらかじ
め記憶された各単語音声のバタンと入力音声バタンとの
整合を行ない、類似度の最も高いものを認識結果とする
方式が一般に行なわれている。しかし、周囲の雑音の大
きい所では雑音が入力音声に混在し、誤認識の大きな原
因となつイーいる。このために周囲雑音入力を各周波数
帯域に分割して記憶しておき、入力音声の各帯域のエネ
ルギーから雑音エネルギーを減算して音声信号を得るこ
とにより、雑音中での認識率の向上を計る方法がある。Conventional configuration and its problems In a word speech recognition device targeted at a specific speaker, the word speech recognition device that is targeted at a specific speaker matches the pre-stored word speech sounds with the input speech sounds, and selects the one with the highest degree of similarity as the recognition result. This method is generally used. However, in places where there is a lot of ambient noise, the noise mixes with the input voice and becomes a major cause of misrecognition. To this end, the ambient noise input is divided into each frequency band and stored, and the noise energy is subtracted from the energy of each frequency band of the input audio to obtain the audio signal, thereby improving the recognition rate in noise. There is a way.

しかし、高雑音中では雑音信号を減算することにより、
入力音声信号自体も変形を受け、誤認識を招くことにな
る。特に、信頼性を要求さ扛る場所で音声認識装置を使
用する場合には誤認識をいかに抑えるかが問題となる。However, in high noise, by subtracting the noise signal,
The input audio signal itself is also subject to deformation, leading to erroneous recognition. In particular, when a voice recognition device is used in a place where reliability is required, how to suppress misrecognition becomes a problem.

発明の目的本発明は、上記問題点を解決するために、１ず周囲の雑
音状況を分析し、認識可能かどうかを自動的に判別する
ことによって極力誤認識を抑えることのできる音声認識
装置を提供するものである。Purpose of the Invention In order to solve the above-mentioned problems, the present invention provides a speech recognition device that can suppress misrecognition as much as possible by first analyzing the surrounding noise situation and automatically determining whether recognition is possible. This is what we provide.

発明の構成本発明は、認識対象となる各単語音声の特徴ベクトルの
時系列を標準バタンとして記憶する記憶手段と、人がス
イッチ等により認識装置をスタートさせたかどうかを検
知し、処理を開始させる検知手段と、周囲の雑音が正確
な認識を実行できるレベルであるかを判別する雑音分析
手段と、決められた時間内に音声が入力さｎるかを判別
する時間監視手段を有し、音声が入力されると認識手段
により記憶手段で記憶された標準バタンと入力音声との
間で整合を行ない、その結果を音声合成手段により音声
出力させるように構成したものである０実施例の説明第１図は本発明の一実施例を示すブロック図である。１
は検知手段で、認識処理を開始させる情報が入ったかど
うかを検知し、次の処理を開始させる。２は雑音分析手
段で、周囲の雑音を取り込み雑音エネルギーが認識を実
行する上で十分に低いレベルであるかを判別する。認識
可能な場合は、表示ランプ６を点灯する。３は時間監視
手段で、表示ランプ６が点灯後、一定時間内に音声が入
力されたかを判別し、時間内に入力のない場合はその後
の処理を中止する。４は記憶手段７により蓄えられた標
準バタンと入力音声とで整合をとり、類似度の最も高い
標準バタンを認識結果として出力する認識手段である。Structure of the Invention The present invention includes a storage means for storing a time series of feature vectors of each word sound to be recognized as a standard button, and detects whether or not a person has started a recognition device using a switch or the like, and starts processing. It has a detection means, a noise analysis means for determining whether the surrounding noise is at a level that allows accurate recognition, and a time monitoring means for determining whether voice is input within a predetermined time. When input, the recognition means matches the input voice with the standard bang stored in the storage means, and the result is outputted as a voice by the voice synthesis means. FIG. 1 is a block diagram showing an embodiment of the present invention. 1
is a detection means that detects whether information to start recognition processing has been received and starts the next processing. 2 is a noise analysis means that takes in ambient noise and determines whether the noise energy is at a sufficiently low level to perform recognition. If the recognition is possible, the display lamp 6 is turned on. Reference numeral 3 denotes a time monitoring means that determines whether or not a voice has been input within a certain period of time after the display lamp 6 is turned on, and if there is no input within that time, the subsequent processing is stopped. Reference numeral 4 denotes a recognition means that matches the standard bangs stored in the storage means 7 with the input voice and outputs the standard bangs with the highest degree of similarity as a recognition result.

５ｉｄ認識手段４の出力により制御さｎて認識結果を発
声する音声合成手段である。This is a voice synthesis means that is controlled by the output of the 5id recognition means 4 and utters the recognition result.

第２図は本実施例の具体的な構成を示す回路図である。FIG. 2 is a circuit diagram showing a specific configuration of this embodiment.

１ｏは第１図の２．３，４．７の各手段を実現するため
のマイクロコンピュータで、認識対象となる単語群の特
徴パラメータを標準バタンと１−で記憶している記憶部
１０ｂと、周囲の雑音を分析し、また入力音声が何であ
るかの認識処理を行なう演算制御部１ｏＣ及び入力部１
０ａ１出力部１０ｄにより等測的に構成されている。1o is a microcomputer for realizing each of the means 2.3 and 4.7 in FIG. An arithmetic control unit 1oC and an input unit 1 that analyze ambient noise and perform recognition processing of input speech.
It is configured isometrically by the 0a1 output section 10d.

８は音声の入力及び周囲雑音の収音を行なうマイクロホ
ン、９はマイクロホン８から入力さｎた入力信号ｉ　Ａ
　／　Ｄ変換し特徴パラメータの抽出を行なうＡ／Ｄ変
換器である。１３は認識処理を開始させるスイッチであ
る。なお、このスイッチ１３を赤外線センサに置き換え
、人が所定の場所にいる時にその人の存在を検知して認
識装置を動作させることもできる。１２は認識の結果を
文字により出力する認識結果表示器である。8 is a microphone for inputting voice and collecting ambient noise; 9 is an input signal input from the microphone 8;
/ This is an A/D converter that performs D conversion and extracts feature parameters. 13 is a switch for starting recognition processing. It is also possible to replace this switch 13 with an infrared sensor and operate the recognition device by detecting the presence of a person when he or she is in a predetermined location. Reference numeral 12 denotes a recognition result display that outputs the recognition result in characters.

第３図は本実施例のマイクロコンピュータの動作を説明
するための要部のフローチャー１・である。FIG. 3 is a flowchart 1 of the main part for explaining the operation of the microcomputer of this embodiment.

以上の構成による本実施例の動作を第３図のフローチャ
ートに沿って詳細に説明する。The operation of this embodiment with the above configuration will be explained in detail along the flowchart of FIG.

本実施例による音声認識装置は、人が認識装置を動作さ
せたい時にスイッチ１３を押すことにより動作が開始さ
扛る。すなわち、まずステップ１４によりスイッチ１３
が押されるかどうかを監視しておき、押されたことが検
知されると、この出力カマイクロコンピュータ１ｏの入
力部１０ｂに入り認識の窓を開く。この後ステップ１５
により雑音の収音分析を行なう。これは、マイクロホン
８から周囲雑音が入力され、さらにＡ／Ｄ変換器９によ
りＡ／Ｄ変換された雑音信号がマイクロコンピュータ１
ｏに入力され、この雑音のエネルギーレベルと閾値（こ
の後の認識処理の実行を行なうのが十分に可能な雑音レ
ベル）との比較を演算側［１０ｃで行ない、ステップ１
６で判断する。The speech recognition device according to this embodiment is started by pressing the switch 13 when a person wants to operate the recognition device. That is, first, in step 14, the switch 13 is
It monitors whether or not it is pressed, and when it is detected that it has been pressed, the output signal enters the input section 10b of the microcomputer 1o and opens a recognition window. After this step 15
Performs sound collection analysis of noise. In this case, ambient noise is input from the microphone 8, and the noise signal which is A/D converted by the A/D converter 9 is sent to the microcomputer 1.
o, and the energy level of this noise is compared with a threshold value (a noise level that is sufficient to perform the subsequent recognition processing) on the calculation side [10c, and step 1
Judging by 6.

ここで雑音レベルが閾値より大きければ認識が難かしい
と判断し、ステップ１４に戻る。ステップ１６で認識可
能と判断されるとステップ１７により表示ランプ６（例
えば、発光ダイオード）を点灯させ、人に音声の入力を
許可する情報を送る０この後ステップ１８によりマイク
ロコンピュータ１０内で時間を割数し始める。マイクロ
ホン８からの入力信号に対してはステップ１９により音
声区間を切り出し、そｎが実際に音声による信号である
かをステップ２ｏで判断し、音声でない場合にはステッ
プ２１で計数さルている時間和か一定時間長ｔｌ超えて
いないかを調べ、超えていなければ再びステップ１９に
戻υ、同様の処理を続け、ｔを超えていれば認識処理を
中止し、ステップ２４により表示ランプｅｉＯＦＦする
０ステツプ２０の音声かどうかの判断は入力信号データ
の語頭定常部がホルマント構造を有しているか否かによ
り、有しておれば音声であると判断する。またステップ
１９で入力信号から音声区間を切り出すためには、音声
の語頭前部及び後部に一定時間長以上の無音声区間を有
し、音声の長さがある一定時間長の間にあり、音声区間
内の最大エネルギーがある閾値以上であるという条件を
満たす部分を音声区間と見なす。Here, if the noise level is greater than the threshold value, it is determined that recognition is difficult, and the process returns to step 14. If it is determined in step 16 that recognition is possible, the display lamp 6 (for example, a light emitting diode) is turned on in step 17, and information to permit the person to input voice is sent. Start dividing. For the input signal from the microphone 8, a voice section is extracted in step 19, and it is determined in step 2o whether the signal is actually a voice signal, and if it is not a voice signal, the time is counted in step 21. Check whether the sum exceeds a certain time length tl, and if it does not, return to step 19 again υ and continue the same process. If it exceeds t, stop the recognition process and turn off the display lamp ei in step 24. The judgment in step 20 as to whether or not the input signal is speech depends on whether or not the word-initial stationary part of the input signal data has a formant structure, and if so, it is determined that it is speech. In addition, in order to extract a speech section from the input signal in step 19, it is necessary to have a silent section of a certain length or more before and after the beginning of the speech word, and if the length of the speech is within a certain fixed time length, A portion that satisfies the condition that the maximum energy within the section is greater than or equal to a certain threshold is regarded as a voice section.

音声が入力されると、ステップ２２の認識処理により記
憶部１０ｂＫ記憶された標準バタンと入力音声との間で
整置をとり、類似度の最も太きいものる認識結果とする
。When the voice is input, the recognition process in step 22 aligns the input voice with the standard bang stored in the storage unit 10bK to obtain the recognition result with the highest degree of similarity.

この認識結果はマイクロコンピュータ１０の出力として
音声合成手段６、音声メモリ１１を駆動しステップ２３
により合成音声を発声させる。さらに、この認識結果を
認識結果表示器１２に表示させる。こｎらの一連の認識
処理が終了すると、マイクロコンピュータ１０の出力部
１０ｄより制御信号が表示ランプ６に送らｆ′ＬＯＦＦ
状態となり、認識の窓を閉じ認識が終了する。This recognition result is outputted from the microcomputer 10 to drive the speech synthesis means 6 and the speech memory 11 in step 23.
The synthesized voice is uttered. Furthermore, this recognition result is displayed on the recognition result display 12. When these series of recognition processes are completed, a control signal is sent from the output section 10d of the microcomputer 10 to the display lamp 6, and the f'LOFF
state, the recognition window closes, and recognition ends.

上記実施例の構成によれば、周囲の雑音レベルが大きい
時には認識が正確に行なえないとして認識を受けつけな
いことにより、高雑音下で無理に認識処理したことによ
り生じる誤認識を防止できる。また、音声入力する時間
をタイマーの監視によって制限することにより、対象話
者以外の話者の音声等によって生じる誤認識の防止にも
役立つ０発明の効果以上のように本発明は、人がスイッチを押すと自動的に
周囲の雑音の状態を分析し、雑音レベルがある閾値以上
の高レベルであれば認識処理を中断する雑音分析手段を
有し、また認識窓開放後は一定時間長内に音声入力がさ
れない場会にも認識を中断する時間監視手段を有するこ
とにより、誤認識を生ずる要因を未然に防止し、誤認識
を極力抑えることのできる音声認識装置を提供できるも
のである。According to the configuration of the above-described embodiment, by not accepting recognition because recognition cannot be performed accurately when the surrounding noise level is high, it is possible to prevent erroneous recognition caused by forced recognition processing under high noise. Furthermore, by limiting the time for voice input by monitoring a timer, the present invention is useful for preventing misrecognition caused by voices of speakers other than the target speaker. It has a noise analysis means that automatically analyzes the surrounding noise state when you press , and interrupts the recognition process if the noise level is higher than a certain threshold. By having a time monitoring means that interrupts recognition even when no voice input is made, it is possible to provide a voice recognition device that can prevent factors that cause erroneous recognition and suppress erroneous recognition as much as possible.

また、音声入力を表示ランプにより促し、認識結果は音
声合成手段により耳で確認することができ、話者にとっ
て大変扱いやすい音声認識装置を提供できるものである
。Further, voice input can be prompted by a display lamp, and the recognition result can be confirmed with the ears by the voice synthesizing means, making it possible to provide a voice recognition device that is very easy for the speaker to use.

[Brief explanation of the drawing]

第１図は本発明の一実施例を示す構成図、第２図は具体
的な構成を示す回路図、第３図は動作説明のための要部
フローチャートである。１・・・・・・検知手段、２　・・・・雑音分析手段、
３・・・・・時間監視手段、４・・・・認識手段、６・
・・・音声合成手段、７・・・・・・記憶手段、８・・
・・マイクロホン、９・・・・・・Ａ／Ｄ変換器、１０
・・・・・・マイクロコンピュータ、１１・・・・・・
音声メモリ。代理人の氏名　弁理士　中　尾　敏　男　ほか１名第１
図第３図FIG. 1 is a configuration diagram showing an embodiment of the present invention, FIG. 2 is a circuit diagram showing a specific configuration, and FIG. 3 is a flowchart of main parts for explaining the operation. 1...detection means, 2...noise analysis means,
3. Time monitoring means, 4. Recognition means, 6.
...Speech synthesis means, 7...Storage means, 8...
...Microphone, 9...A/D converter, 10
・・・・・・Microcomputer, 11・・・・・・
voice memory. Name of agent: Patent attorney Toshio Nakao and 1 other person No. 1
Figure 3

Claims

[Claims]

A detection means for detecting information for starting a recognition process, and after the recognition process is started by the detection means,
A noise analysis means that determines whether the ambient noise level is low enough to perform accurate speech recognition and rejects recognition if the noise level is lower than a certain threshold, and the noise analysis means can recognize it. When it is determined that this is the case, the input voice is recognized using a display lamp that prompts voice input and a timer that measures the time to stop recognition if voice is not input within a certain period of time. A speech recognition device comprising a recognition means for deriving a result, and a speech synthesis means for outputting the recognition result as a voice.