JP2003330491A

JP2003330491A - Method, device, and program for voice recognition

Info

Publication number: JP2003330491A
Application number: JP2002135377A
Authority: JP
Inventors: Toru Iwazawa; 透岩沢
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2002-05-10
Filing date: 2002-05-10
Publication date: 2003-11-19

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device which enables a user to easily specify a cause of erroneous recognition. <P>SOLUTION: The voice recognition device has a voice input part 1 for outputting voice data corresponding to an input voice, a voice recognition part 2 for collating the voice data with standard patterns related to preliminarily registered voice recognition vocabulary to recognize the input voice, a condition detection part 3 for examining whether at lease one of a plurality of different factors causing trouble in voice recognition exists in the voice data or not, a voice recognition result discrimination part 4 for judging that the voice recognition is proper when any of the factors is not detected by the condition detection part 3 and judging that the voice recognition is not proper when some of the factors are detected, and an operation part 5 for executing one of a plurality of preliminarily set response operations corresponding to the factors respectively in the case that it is judged that the voice recognition is not proper. <P>COPYRIGHT: (C)2004,JPO

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、環境雑音や周囲会
話などの周囲雑音が不要音声として入力されることを想
定した、実環境指向の音声認識を行う装置および方法に
関する。さらに、本発明は、そのような実環境指向の音
声認識を実行するためのプログラムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus and method for recognizing speech in a real environment, assuming that ambient noise or ambient noise such as ambient conversation is input as unnecessary speech. Furthermore, the present invention relates to a program for executing such real environment-oriented voice recognition.

【０００２】[0002]

【従来の技術】話者の音声を音響分析（スペクトル分
析）して認識する音声認識装置が知られている。この音
声認識装置には、限定語彙（予め定めた語彙）の音声入
力が行われる限定語彙音声認識装置と任意語彙の音声入
力が行われる任意語彙音声認識装置があり、一部実用化
されている。限定語彙音声認識装置にはいくつかの種類
があるが、中でも単語単位に区切って発声した音声を認
識する離散単語音声認識装置は技術的に容易であること
からこれまでに多くの製品が提供されている。2. Description of the Related Art A voice recognition device for recognizing a voice of a speaker by acoustic analysis (spectral analysis) is known. This speech recognition device includes a limited vocabulary speech recognition device that inputs a limited vocabulary (predetermined vocabulary) and an arbitrary vocabulary speech recognition device that inputs a vocabulary of an arbitrary vocabulary. Some of them have been put into practical use. . There are several types of limited vocabulary speech recognizers, but among them, many discrete products have been provided so far because the discrete word speech recognizer that recognizes the voice divided into words is technically easy. ing.

【０００３】離散単語音声認識装置の主要部は、標準パ
ターンとして音声認識語彙（限定語彙）に関する短時間
スペクトルが予め登録されるメモリ部と、マイクロホン
から入力される話者の音声（入力音声）の短時間スペク
トルを例えば１０〜２０ｍｓのフレーム単位に求める音
響分析部と、音響分析部にて求められた短時間スペクト
ルとメモリ部に登録されている標準パターンとのマッチ
ングを調べ、最もマッチングした標準パターンを認識結
果として出力する音声認識部とからなる。The main part of the discrete word voice recognition device is a memory unit in which a short-time spectrum relating to a voice recognition vocabulary (limited vocabulary) is registered in advance as a standard pattern, and a speaker's voice (input voice) input from a microphone. For example, the acoustic analysis unit that obtains the short-time spectrum in frame units of 10 to 20 ms and the matching between the short-time spectrum obtained by the acoustic analysis unit and the standard pattern registered in the memory unit are examined, and the most matched standard pattern Is output as a recognition result.

【０００４】音声認識部で行われる標準パターンとのマ
ッチング処理としては、例えば動的計画法（ＤＰ：Dyna
mic Programming）の手法を利用するＤＰマッチング法
が知られている。このＤＰマッチング法は、認識対象と
なる候補ワードに対応した標準パターンを予め作成して
おき、入力音声を分析して得られる特徴量パターンと全
標準パターンとの時間軸の対応を取りながらマッチング
させることで、最も類似した標準パターンを抽出すると
いうものである。As the matching process with the standard pattern performed in the voice recognition section, for example, dynamic programming (DP: Dyna) is used.
A DP matching method utilizing a mic programming method is known. In this DP matching method, a standard pattern corresponding to a candidate word to be recognized is created in advance, and matching is performed while keeping the time axis correspondence between the feature amount pattern obtained by analyzing the input voice and all the standard patterns. By doing so, the most similar standard pattern is extracted.

【０００５】周囲雑音などの不要音声により、音声認識
に不具合が発生する場合がある。この不要音声による音
声認識の不具合の発生を防止することのできるものとし
て、特開平９−１４６５８６号公報には、周囲雑音が大
きい場合に話者に大きな声で発声するように促すように
した音声認識装置が開示されている。この音声認識装置
では、入力音声のエネルギーが所定のしきい値より小さ
いか否かを判定し、小さいと判定された場合に、ユーザ
に対して大きな声で発声する旨の警告がなされる。A problem may occur in voice recognition due to unnecessary voice such as ambient noise. In order to prevent the occurrence of the problem of voice recognition due to this unnecessary voice, Japanese Patent Laid-Open No. 9-146586 discloses a voice that prompts a speaker to speak loudly when ambient noise is large. A recognizer is disclosed. In this voice recognition device, it is determined whether or not the energy of the input voice is smaller than a predetermined threshold value, and when it is determined that the energy is low, the user is warned that a loud voice is uttered.

【０００６】また、音声認識では話者の声量不足が問題
となる。この問題を解決するものとして、特開２０００
−１５５６００号公報には、入力音声レベルが適正な範
囲内にない場合に話者に対して適正な入力音声レベルで
話すように警告するものが開示されている。この他、特
開２０００−７５８９３号公報に開示された、音声デー
タが音声認識を行うのに適切であるか否かを音声認識を
行う前に判断する手法もある。Further, in voice recognition, there is a problem of insufficient voice volume of the speaker. As a means for solving this problem, Japanese Unexamined Patent Application Publication 2000-2000
Japanese Patent Publication No. 155600 discloses a device that warns a speaker to speak at an appropriate input voice level when the input voice level is not within a proper range. In addition, there is a method disclosed in Japanese Patent Laid-Open No. 2000-75893 that determines whether or not voice data is appropriate for voice recognition before performing voice recognition.

【０００７】[0007]

【発明が解決しようとする課題】音声認識では、環境雑
音や周囲の会話などの不要音声が話者の音声に重畳する
と、音声分析の精度が下がり、音声の認識率が低下す
る。こういった不要音声は周囲環境から受けるものばか
りではなく、システム自身が雑音源になっていることも
ある。また、周囲雑音の少ない静かな状況においても、
環境によっては反響音による影響を受けることもある。
一方で、音声認識一般の問題として利用者の発声方法の
知識不足に起因する認識語彙以外の発話や声量が足りな
い、認識エンジンが動作する前に話しかけてしまうなど
の発声不具合による認識精度劣化の問題もある。In the speech recognition, when unnecessary noise such as environmental noise or surrounding conversation is superimposed on the speech of the speaker, the accuracy of the speech analysis is lowered and the recognition rate of the speech is lowered. Such unnecessary voice is not only received from the surrounding environment, but the system itself may be a noise source. Also, in a quiet situation with little ambient noise,
Depending on the environment, it may be affected by the echo sound.
On the other hand, as a general problem of speech recognition, recognition accuracy deterioration due to utterance defects such as insufficient utterance other than the recognition vocabulary and insufficient voice volume due to lack of knowledge of the user's utterance method, or speaking before the recognition engine operates. There are also problems.

【０００８】上記のように、実環境の音声認識において
は、音声認識を阻害する要因が多く、かつ、複雑であ
る。このため、上述した従来の音声認識装置では、誤認
識が発生した場合に、利用者がその誤認識原因を特定す
るのは難しかった。また、周囲雑音を音声認識語彙に誤
認識した場合には、結果として音声認識装置が不意に誤
動作することになるが、利用者がその誤動作の原因を特
定するのは難しかった。As described above, in real-world voice recognition, there are many factors that hinder the voice recognition and they are complicated. Therefore, in the above-described conventional voice recognition device, it is difficult for the user to specify the cause of the erroneous recognition when the erroneous recognition occurs. Further, when the ambient noise is erroneously recognized in the voice recognition vocabulary, the voice recognition device malfunctions unexpectedly as a result, but it is difficult for the user to specify the cause of the malfunction.

【０００９】特開平９−１４６５８６号公報に記載のも
のにおいては、入力音声のエネルギーが所定のしきい値
より小さい場合は、ユーザに対して大きな声で発声する
旨の警告がなされるが、上述した音声認識を阻害する複
数の要因を特定するようにはなっていない。In the method disclosed in Japanese Unexamined Patent Publication No. 9-146586, when the energy of the input voice is smaller than a predetermined threshold value, the user is warned that a loud voice is issued. It is not designed to identify multiple factors that interfere with voice recognition.

【００１０】特開２０００−１５５６００号公報に記載
のものにおいても、入力音声レベルが適正な範囲内にな
い場合に話者に対して適正な入力音声レベルで話すよう
に警告がなされるだけで、やはり、上述した音声認識を
阻害する複数の要因を特定するようにはなっていない。Even in the one disclosed in Japanese Unexamined Patent Publication No. 2000-155600, when the input voice level is not within the proper range, only a warning is given to the speaker to speak at the proper input voice level, After all, it is not designed to identify a plurality of factors that hinder the above-mentioned voice recognition.

【００１１】特開２０００−７５８９３号公報において
も、音声データが音声認識を行うのに適切であるか否か
を音声認識を行う前に判断するだけで、やはり、上述し
た音声認識を阻害する複数の要因を特定するようにはな
っていない。Also in Japanese Unexamined Patent Publication No. 2000-75893, it is only necessary to determine whether or not voice data is appropriate for performing voice recognition before performing voice recognition. It is not designed to identify the factors.

【００１２】本発明の目的は、上記各問題を解決し、利
用者が誤認識の原因を容易に特定することのできる、音
声認識装置および音声認識方法ならびにプログラムを提
供することにある。An object of the present invention is to provide a voice recognition device, a voice recognition method, and a program which can solve the above problems and allow a user to easily identify the cause of erroneous recognition.

【００１３】[0013]

【課題を解決するための手段】上記目的を達成するた
め、本発明の音声認識装置は、入力音声に応じた音声デ
ータを出力する音声入力手段と、前記音声データと予め
登録された音声認識語彙に関する標準パターンとを照合
することで前記入力音声を音声認識する音声認識手段
と、前記音声認識に不具合を生じさせる複数の異なる要
因のうちの少なくとも１つが前記音声データ中に存在す
るかどうかを調べる状況検知手段と、前記状況検知手段
にて前記複数の要因のいずれも検知されなかった場合
は、前記音声認識が妥当なものであると判定し、前記複
数の要因のいずれかが検知された場合には、前記音声認
識が妥当なものでないと判定する音声認識結果判定手段
と、前記複数の要因にそれぞれ対応する複数の応答動作
が予め設定されており、前記音声認識結果判定手段にて
前記音声認識が妥当なものでないと判定された場合に、
前記複数の応答動作のうちから該当する応答動作を選択
的に実行する応答手段とを有することを特徴とする。In order to achieve the above object, a voice recognition device of the present invention comprises a voice input means for outputting voice data corresponding to an input voice, a voice recognition vocabulary registered in advance with the voice data. A voice recognition means for recognizing the input voice by collating with a standard pattern regarding whether or not there is at least one of a plurality of different factors causing a problem in the voice recognition in the voice data. When neither of the plurality of factors is detected by the situation detecting means and the situation detecting means, it is determined that the voice recognition is valid, and when any of the plurality of factors is detected. In, the voice recognition result determination means for determining that the voice recognition is not appropriate, and a plurality of response operations respectively corresponding to the plurality of factors are preset, When in serial speech recognition result determining means is determined that the speech recognition not reasonable,
A response unit that selectively executes a corresponding response operation from among the plurality of response operations.

【００１４】本発明の音声認識方法は、入力音声に応じ
た音声データを出力する第１のステップと、前記音声デ
ータと予め登録された音声認識語彙に関する標準パター
ンとを照合することで前記入力音声を音声認識する第２
のステップと、前記音声認識に不具合を生じさせる複数
の異なる要因のうちの少なくとも１つが前記音声データ
中に存在するかどうかを調べる第３のステップと、前記
第３のステップにて前記複数の要因のいずれも検知され
なかった場合は、前記音声認識が妥当なものであると判
定し、前記複数の要因のいずれかが検知された場合に
は、前記音声認識が妥当なものでないと判定する第４の
ステップと、前記第４のステップにて前記音声認識が妥
当なものでないと判定された場合に、予め設定されてい
る、前記複数の要因にそれぞれ対応する複数の応答動作
のうちから該当する応答動作を選択的に実行する第５の
ステップとを含むことを特徴とする。In the voice recognition method of the present invention, the first step of outputting voice data corresponding to the input voice and the input voice by collating the voice data with a standard pattern relating to the voice recognition vocabulary registered in advance. Second voice recognition
And a third step of checking whether at least one of a plurality of different factors causing a problem in the voice recognition is present in the voice data, and the plurality of factors in the third step. If none of the above is detected, it is determined that the voice recognition is valid, and if any of the plurality of factors is detected, it is determined that the voice recognition is not valid. If the voice recognition is determined not to be appropriate in step 4 and the fourth step, it corresponds to one of a plurality of preset response operations corresponding to the plurality of factors. A fifth step of selectively performing a response operation.

【００１５】本発明のプログラムは、入力音声に応じた
音声データを出力する第１の処理と、前記音声データと
予め登録された音声認識語彙に関する標準パターンとを
照合することで前記入力音声を音声認識する第２の処理
と、前記音声認識に不具合を生じさせる複数の異なる要
因のうちの少なくとも１つが前記音声データ中に存在す
るかどうかを調べる第３の処理と、前記第３の処理にて
前記複数の要因のいずれも検知されなかった場合は、前
記音声認識が妥当なものであると判定し、前記複数の要
因のいずれかが検知された場合には、前記音声認識が妥
当なものでないと判定する第４の処理と、前記第４の処
理にて前記音声認識が妥当なものでないと判定された場
合に、予め設定されている、前記複数の要因にそれぞれ
対応する複数の応答動作のうちから該当する応答動作を
選択的に実行する第５の処理とをコンピュータに実行さ
せることを特徴とする。The program of the present invention verifies the input voice by collating the first process for outputting the voice data corresponding to the input voice with the standard pattern relating to the voice data and the voice recognition vocabulary registered in advance. A second process of recognizing, a third process of checking whether or not at least one of a plurality of different factors causing a problem in the voice recognition exists in the voice data, and a third process. If none of the plurality of factors is detected, it is determined that the voice recognition is valid, and if any of the plurality of factors is detected, the voice recognition is not valid. And a plurality of responses corresponding to the plurality of factors set in advance when the voice recognition is determined to be invalid in the fourth process. Characterized in that to execute a fifth process for performing responding operation appropriate from among the operations selectively to the computer.

【００１６】上記のとおりの本発明においては、音声認
識に不具合を生じさせる複数の異なる要因のいずれかが
音声データに含まれている場合は、その要因が検出され
る。具体的には、定常雑音、突発雑音、反響音などの環
境雑音（システム自体の雑音を含む）や周囲の会話など
の不要音声、さらには、発話の不具合など、音声認識の
妨げとなる種々の要因が検出される。そして、予め用意
されている応答動作のうちからその検出された要因に対
応する応答動作が適宜選択されて実行されるので、利用
者は、その応答動作から誤認識の原因を容易に判断する
ことが可能である。In the present invention as described above, when any one of a plurality of different factors that cause a problem in voice recognition is included in the voice data, the factor is detected. Specifically, various noises such as stationary noises, sudden noises, and reverberant sounds (including noises of the system itself) and unnecessary voices such as surrounding conversations, as well as various problems that interfere with voice recognition such as utterance defects. The factor is detected. Then, the response action corresponding to the detected factor is appropriately selected from the response actions prepared in advance and executed, so that the user can easily determine the cause of the misrecognition from the response action. Is possible.

【００１７】[0017]

【発明の実施の形態】次に、本発明の実施形態について
図面を参照して説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, embodiments of the present invention will be described with reference to the drawings.

【００１８】（実施形態１）図１は、本発明の第１の実
施形態である音声認識装置の概略構成を示すブロック図
である。この音声認識装置は、音声入力部１、音声認識
部２、状況検知部３、音声認識結果判定部４、動作部
５、音声認識語彙格納部２１からなる。(Embodiment 1) FIG. 1 is a block diagram showing a schematic configuration of a speech recognition apparatus according to a first embodiment of the present invention. This voice recognition device includes a voice input unit 1, a voice recognition unit 2, a situation detection unit 3, a voice recognition result determination unit 4, an operation unit 5, and a voice recognition vocabulary storage unit 21.

【００１９】音声入力部１は、例えばマイクロホンを備
え、話者が発声した音声（発話音声）がマイクロホンか
ら入力されると、該入力音声に応じた音声データ（所定
時間間隔ごとに周波数領域に変換された音声パターン）
を出力する。音声認識語彙格納部２１は、音声認識に必
要な語彙（音声認識語彙）を格納するもので、例えば
「おはよう」、「こんにちは」などの単語の標準パター
ン（音声パターン）が事前に話者の音声に基づいて登録
される。The voice input unit 1 is provided with a microphone, for example, and when a voice uttered by a speaker (spoken voice) is input from the microphone, voice data corresponding to the input voice (converted into a frequency domain at predetermined time intervals) Voice pattern)
Is output. Speech recognition vocabulary storing section 21 is intended to store the vocabulary necessary for voice recognition (voice recognition vocabulary), for example, "Good morning", "Hello" word of standard patterns such as (voice pattern) in advance to speaker of voice Will be registered based on.

【００２０】音声認識部２は、音声認識語彙格納部２１
に格納されている音声認識語彙を辞書とした周知の音声
認識エンジンを有し、音声入力部１から出力された音声
データの音声認識を行う。具体的には、音声認識部２
は、音声入力部１から出力された音声データの音声区間
を切り出し、その切り出した音声区間のデータについて
音声認識語彙格納部２１に格納されている音声認識語彙
（標準パターン）とのマッチングを調べ、最もマッチン
グした音声認識語彙を認識結果２ａとして出力する。こ
の認識結果２ａは、音声認識結果判定部４に供給され、
同時にその認識の際に切り出した音声区間のデータ（切
り出し音声データ２ｂ）が状況検知部３に供給される。The voice recognition unit 2 includes a voice recognition vocabulary storage unit 21.
It has a well-known voice recognition engine that uses the voice recognition vocabulary stored in the dictionary as a dictionary, and performs voice recognition of the voice data output from the voice input unit 1. Specifically, the voice recognition unit 2
Cuts out the voice section of the voice data output from the voice input unit 1 and checks the data of the cut voice section for matching with the voice recognition vocabulary (standard pattern) stored in the voice recognition vocabulary storage unit 21, The most matched speech recognition vocabulary is output as the recognition result 2a. This recognition result 2a is supplied to the voice recognition result determination unit 4,
At the same time, the data of the voice section cut out at the time of the recognition (cutout voice data 2b) is supplied to the situation detection unit 3.

【００２１】状況検知部３は、音声認識部２から供給さ
れた切り出し音声データ２ｂについてスペクトル分析を
行い、その切り出し音声データ２ｂ中に環境面や音声発
声面における不具合の要因があるかどうかを調べ、その
結果（状況検知結果）を音声認識結果判定部４に出力す
る。環境面の不具合の要因としては、例えば周囲の会話
や雑音（扇風機などの風きり音）などの継続的な雑音で
ある定常雑音がある。音声発声面の不具合の要因として
は、例えば頭切れ、パワーの不足および過多などの発話
不具合がある。The situation detecting section 3 performs a spectrum analysis on the cut-out voice data 2b supplied from the voice recognizing section 2 to check whether or not the cut-out voice data 2b has a cause of a problem in terms of environment and voice utterance. , And outputs the result (situation detection result) to the voice recognition result determination unit 4. The cause of environmental problems is, for example, stationary noise that is continuous noise such as surrounding conversation and noise (wind noise of a fan). As a cause of the problem of voice utterance, there are utterance problems such as head cut, lack of power, and excess.

【００２２】音声認識結果判定部４は、状況検知部３か
ら供給された状況検知結果に基づいて音声認識部２にお
ける音声認識に致命的な不具合が存在しないかどうかを
判断し、致命的な不具合が存在しない場合は、音声認識
部２から供給された認識結果２ａを有効なものとしてそ
のまま動作部５へ送出し、致命的な不具合が存在する場
合には、音声認識部２から供給された認識結果２ａを破
棄し、代わりにその不具合内容を動作部５へ送出する。The voice recognition result judgment unit 4 judges whether or not there is a fatal defect in the voice recognition in the voice recognition unit 2 based on the situation detection result supplied from the situation detection unit 3, and a fatal malfunction is caused. If there is not, the recognition result 2a supplied from the voice recognition unit 2 is sent to the operation unit 5 as it is as valid, and if there is a fatal defect, the recognition result supplied from the voice recognition unit 2 is recognized. The result 2a is discarded and the content of the defect is sent to the operation unit 5 instead.

【００２３】動作部５は、音声認識結果判定部４から認
識結果２ａが供給された場合は、その認識結果２ａに基
づいて予め設定された応答動作を実行し、音声認識結果
判定部４から致命的な不具合の内容が供給された場合に
は、その不具合の内容に応じた応答（回避）動作を実行
する。ここで、予め設定された応答動作とは、例えば認
識結果２ａに対して予め用意された音声メッセージで応
答することである。また、回避動作とは、例えば不具合
内容に応じた警告メッセージを話者に通知したり、音声
認識装置内で能動的に該当不具合を回避する動作を行っ
たりすることである。能動的な不具合回避動作として
は、例えば周囲雑音が多い場合に、マイクホンのボリュ
ームを絞ったり、誤認識が連続する場合に、音声認識を
停止して、マウスやタッチパネル、ボタンなど他の入力
デバイスを使用する旨のガイダンスを流したりする動作
がある。When the recognition result 2a is supplied from the voice recognition result judging unit 4, the operation unit 5 executes a preset response operation based on the recognition result 2a, and the voice recognition result judging unit 4 makes a fatal error. When the content of the specific malfunction is supplied, a response (avoidance) operation according to the content of the malfunction is executed. Here, the preset response operation is, for example, to respond to the recognition result 2a with a voice message prepared in advance. The avoidance operation means, for example, notifying the speaker of a warning message according to the content of the defect, or actively performing the operation of avoiding the problem in the voice recognition device. As an active defect avoidance operation, for example, when there is a lot of ambient noise, the volume of the microphone is turned down, and when recognition errors continue, voice recognition is stopped and other input devices such as mouse, touch panel, and buttons are activated. There is an operation to give guidance to use.

【００２４】次に、本実施形態の音声認識装置の動作に
ついて具体的に説明する。Next, the operation of the speech recognition apparatus of this embodiment will be specifically described.

【００２５】まず、利用者は、音声認識を行わせるにあ
たり、自身の声で音声認識に必要な単語の音声パターン
（標準パターン）を音声認識語彙格納部２１に登録す
る。例えば、「こんにちは」、「こんばんは」、「おは
よう」など、種々の単語の音声パターンを登録する。こ
の音声パターンの登録は周知の手法を用いることがで
き、その手順は一般的であるため、ここではその詳細な
説明は省略する。First, when performing voice recognition, the user registers the voice pattern (standard pattern) of a word necessary for voice recognition with his or her voice in the voice recognition vocabulary storage unit 21. For example, "Hello", "Good evening", such as "Good morning", to register the speech patterns of a variety of words. A well-known method can be used for the registration of this voice pattern, and the procedure thereof is general, so a detailed description thereof will be omitted here.

【００２６】音声認識語彙格納部２１への音声による単
語登録がなされた後、利用者は、その登録した単語に基
づく音声入力を行う。利用者が発した音声は音声入力部
１に入力され、音声入力部１からその入力された音声に
応じた音声データ（音声パターン）が出力される。この
とき、周囲雑音（周囲に居る他の人の会話などの定常雑
音）があれば、音声入力部１から出力された音声データ
にその周囲雑音が含まれることになる。After the voice word is registered in the voice recognition vocabulary storage unit 21, the user inputs a voice based on the registered word. The voice uttered by the user is input to the voice input unit 1, and the voice input unit 1 outputs voice data (voice pattern) corresponding to the input voice. At this time, if there is ambient noise (stationary noise such as a conversation with another person in the surroundings), the ambient noise is included in the voice data output from the voice input unit 1.

【００２７】音声入力部１から出力された音声データが
音声認識部２に供給されると、音声認識部２は、その音
声データから必要な音声区間を切り出し、その切り出し
た音声区間のデータ（切り出し音声データ２ｂ）につい
て音声認識語彙格納部２１に格納されている音声認識語
彙（標準パターン）とのマッチングを調べる。このマッ
チングには例えばＤＰマッチング法を用いることができ
る。そして、音声認識部２は、最もマッチングした音声
認識語彙を認識結果２ａとして音声認識結果判定部４に
供給するとともに、切り出し音声データ２ｂを状況検知
部３に供給する。When the voice data output from the voice input unit 1 is supplied to the voice recognition unit 2, the voice recognition unit 2 cuts out a necessary voice section from the voice data, and outputs the cut-out voice section data (cutout). The voice data 2b) is checked for matching with the voice recognition vocabulary (standard pattern) stored in the voice recognition vocabulary storage unit 21. For this matching, for example, the DP matching method can be used. Then, the voice recognition unit 2 supplies the best-matched voice recognition vocabulary as the recognition result 2a to the voice recognition result determination unit 4, and also supplies the cut-out voice data 2b to the situation detection unit 3.

【００２８】音声認識部２から切り出し音声データ２ｂ
が供給されると、状況検知部３は、その切り出し音声デ
ータ２ｂ中に周囲雑音や音声発声不良による不具合の要
因があるかどうかを調べ、その調べた結果（状況検知結
果）を音声認識結果判定部４に供給する。周囲雑音や音
声発声不良は、切り出し音声データ２ｂ中にそれぞれ特
有のスペクトルパターンとして存在するので、切り出し
音声データ２ｂをスペクトル分析してそのようなスペク
トルパターンを検出することで周囲雑音や音声発声不良
による不具合の要因を検知することができる。Speech data 2b cut out from the speech recognition unit 2
When the cutout voice data 2b is supplied, the situation detection unit 3 checks whether or not there is a cause of a defect due to ambient noise or poor voice utterance, and the result of the check (situation detection result) is determined as a voice recognition result. Supply to part 4. Since ambient noise and poor voice utterance exist as unique spectral patterns in the cut-out voice data 2b, spectrum analysis of the cut-out voice data 2b is performed to detect such a spectral pattern, which results from ambient noise and poor voice utterance. The cause of the failure can be detected.

【００２９】状況検知部３から状況検知結果が供給され
ると、音声認識結果判定部４は、その状況検知結果に基
づいて音声認識部２における音声認識に致命的な不具合
が存在しないかどうかを判断する（妥当性の判断）。例
えば、利用者が、正常な音声認識を行うための音声発声
方法の認識不足のために、音声認識語彙格納部２１に格
納されている音声認識語彙以外の単語を発声したり、声
量が足りなかったりした場合に音声発声不良となる。音
声入力部１における音声入力時に、このような音声発声
不良があると、音声認識部２における音声認識に致命的
な不具合が生じ、結果的に誤認識となる。音声認識結果
判定部４は、状況検知部３からの状況検知結果からその
ような音声発声不良による音声認識部２における音声認
識の致命的な不具合を調べる。そして、音声認識結果判
定部４は、致命的な不具合が存在しない場合は、音声認
識部２から供給された認識結果２ａを有効なものとして
そのまま動作部５へ送出し、致命的な不具合が存在する
場合には、音声認識部２から供給された認識結果２ａを
破棄し、代わりにその不具合内容を動作部５へ送出す
る。When the situation detection result is supplied from the situation detection unit 3, the voice recognition result determination unit 4 determines whether or not there is a fatal defect in the voice recognition in the voice recognition unit 2 based on the situation detection result. Judge (judge validity). For example, the user utters a word other than the voice recognition vocabulary stored in the voice recognition vocabulary storage unit 21 due to insufficient recognition of the voice utterance method for performing normal voice recognition, or the amount of voice is insufficient. If this happens, it will result in poor voice production. If there is such a voice utterance defect during voice input in the voice input unit 1, a fatal defect occurs in voice recognition in the voice recognition unit 2, resulting in erroneous recognition. The voice recognition result determination unit 4 examines a fatal defect in the voice recognition in the voice recognition unit 2 due to such poor voice utterance from the situation detection result from the situation detection unit 3. Then, if there is no fatal defect, the voice recognition result determination unit 4 sends the recognition result 2a supplied from the voice recognition unit 2 to the operation unit 5 as it is, and the fatal defect exists. In this case, the recognition result 2a supplied from the voice recognition unit 2 is discarded, and instead the fault content is sent to the operation unit 5.

【００３０】音声認識結果判定部４から認識結果２ａが
供給されると、動作部５は、その認識結果２ａに対する
応答として予め設定された動作を実行する。例えば、利
用者が音声入力部１にて「おはよう」という単語を音声
入力し、音声認識結果判定部４から「おはよう」という
有効な認識結果２ａが供給された場合、動作部５は、
「おはよう」という応答メッセージを出力する。この応
答メッセージは音声により提示することもできる。When the recognition result 2a is supplied from the voice recognition result judging section 4, the operating section 5 executes a preset operation as a response to the recognition result 2a. For example, when the user voice-inputs the word "Ohayo" in the voice input unit 1 and the voice recognition result determination unit 4 supplies the effective recognition result 2a "Ohayo", the operation unit 5
The response message "Good morning" is output. This response message can also be presented by voice.

【００３１】一方、音声認識結果判定部４から不具合内
容が供給されると、動作部５は、予め設定されている、
複数の要因に対応する複数の応答動作のうちから、その
不具合内容に対応する応答動作を実行する。例えば、不
具合内容が声量不足であれば、利用者にもう少し大きな
声で発声するようにガイダンスする。また、周囲雑音が
大きい旨の不具合内容が供給された場合には、マイクロ
ホンのボリュームを絞るようにガイダンスする。このよ
うなガイダンスは、例えば予め用意されたガイダンス用
のメッセージ群の中から該当するメッセージを適宜選択
して実行する。On the other hand, when the fault content is supplied from the voice recognition result judging section 4, the operating section 5 is preset.
The response operation corresponding to the failure content is executed from the plurality of response operations corresponding to the plurality of factors. For example, if the content of the problem is insufficient voice volume, the user is instructed to speak in a louder voice. Also, when the problem content that the ambient noise is large is supplied, guidance is given to reduce the volume of the microphone. Such guidance is executed by, for example, appropriately selecting a corresponding message from a message group for guidance prepared in advance.

【００３２】以上説明した本実施形態の音声認識装置に
よれば、周囲雑音や音声発声不良など複数の要因のいず
れかにより音声認識に致命的な不具合が発生した場合
に、利用者に対してその不具合の内容に応じたガイダン
スが行われる。よって、利用者は、誤認識があった場合
にその原因を容易に特定することができる。According to the voice recognition apparatus of the present embodiment described above, when a fatal defect occurs in voice recognition due to any of a plurality of factors such as ambient noise and poor voice utterance, the user is informed of the fact. Guidance is provided according to the content of the defect. Therefore, the user can easily identify the cause of misrecognition.

【００３３】なお、上述した本実施形態の音声認識装置
は、利用者自身の音声入力によって音声認識語彙格納部
２１に単語登録された音声認識語彙に基づいて音声認識
が行われる特定話者音声認識を採用しているが、本発明
はこれに限定されるものではない。例えば、発声者の違
いによる音声のばらつきを考慮した音声認識が行われる
不特定話者音声認識を行うように構成することもでき
る。The above-described voice recognition device of the present embodiment is a specific speaker voice recognition in which voice recognition is performed based on the voice recognition vocabulary registered in the voice recognition vocabulary storage unit 21 by the user's own voice input. However, the present invention is not limited to this. For example, it may be configured to perform unspecified speaker voice recognition in which voice recognition is performed in consideration of variations in voice due to differences in speakers.

【００３４】また、音声認識部２における音声認識とし
ては、離散単語音声認識に限られることはなく、種々の
音声認識を適用可能である。ただし、後述する棄却語彙
辞書を利用するものにおいては、その棄却語彙辞書の持
つ制約から離散単語音声認識とする方が望ましい。Further, the voice recognition in the voice recognition unit 2 is not limited to the discrete word voice recognition, and various kinds of voice recognition can be applied. However, in the case of using a rejection vocabulary dictionary, which will be described later, it is preferable to use discrete word speech recognition because of the restriction of the rejection vocabulary dictionary.

【００３５】（実施形態２）図２は、本発明の第２の実
施形態である音声認識装置の概略構成を示すブロック図
である。この音声認識装置は、上述した第１の実施形態
の構成（図１参照）を基本構成として備え、さらにご認
識時のより細かな応答を行うために、棄却語彙格納部２
２を備える。図２中、図１に示したものと同様のものに
は同じ符号を付している。ここでは、説明を簡略化する
ために、同じ動作を行うものについては説明を省略し、
特徴部分についてのみ詳細に説明する。(Embodiment 2) FIG. 2 is a block diagram showing a schematic configuration of a speech recognition apparatus according to a second embodiment of the present invention. This voice recognition device is provided with the configuration of the first embodiment described above (see FIG. 1) as a basic configuration, and in order to make a more detailed response at the time of recognition, the rejection vocabulary storage unit 2
2 is provided. 2, the same components as those shown in FIG. 1 are designated by the same reference numerals. Here, in order to simplify the description, description of the same operation is omitted,
Only the characteristic part will be described in detail.

【００３６】棄却語彙格納部２２には、利用者が音声認
識語彙格納部２１に格納されている音声認識語彙を正し
く発声した場合以外の入力音声を棄却させるための棄却
語彙が予め格納されている。棄却語彙は、例えば
「あ」、「い」などの母音および音節の組み合わせであ
り、理想的には音声認識語彙格納部２１に格納されてい
る音声認識語彙以外の音声パターンを可能な限り棄却語
彙として登録することが望ましい。ただし、棄却語彙の
量が多くなると、その棄却語彙を用いた認識処理（棄却
処理）に時間がかかる。このように、棄却語彙の量と処
理時間との間にはトレードオフの関係があり、この点を
考慮して棄却語彙を設定する必要がある。The reject vocabulary storage unit 22 stores in advance a reject vocabulary for rejecting an input voice except when the user correctly utters the voice recognition vocabulary stored in the voice recognition vocabulary storage unit 21. . The rejected vocabulary is, for example, a combination of vowels and syllables such as "a" and "ii", and ideally, the rejected vocabulary is as much as possible from the voice patterns other than the voice recognition vocabulary stored in the voice recognition vocabulary storage unit 21. It is desirable to register as. However, when the amount of rejected vocabulary increases, the recognition process (rejection process) using the rejected vocabulary takes time. As described above, there is a trade-off relationship between the amount of rejected vocabulary and the processing time, and it is necessary to set the rejected vocabulary in consideration of this point.

【００３７】棄却語彙の生成には、例えば、任意の音節
列からなる音節ネットを利用した膨大な棄却語彙データ
ベースの中から、音声認識語彙格納部２１に格納されて
いる音声認識語彙およびその類似語彙にそれぞれマッチ
する棄却語彙を除去する、といった手法を用いる。類似
語彙の生成は、例えば音響的な類似語生成を利用する手
法が知られている。この手法は、音素レベルで誤認識を
起こし易いペアを予めピックアップしておき、そのペア
となった音素を子音とする同じ母音を持つ音節（類似音
節）に基づいて、音声認識語彙に類似する類似語を生成
する。例えば、音声認識語彙として登録されている「け
んかい（見解）」に対して、「カ」行、「ハ」行の音素
を類似音素とした場合は、「けんかい」、「へんか
い」、「けんはい」、「へんはい」の４つの類似語が生
成される。このようにして得られた棄却語彙を棄却用の
単語辞書として用いることで、利用者が音声認識語彙を
正しく発声した場合以外の入力音声や雑音を棄却させる
ことができる。To generate the rejected vocabulary, for example, a voice recognition vocabulary stored in the voice recognition vocabulary storage unit 21 and its similar vocabulary are selected from a huge rejection vocabulary database using a syllable net composed of arbitrary syllable strings. The rejected vocabulary that matches each is removed. For the generation of the similar vocabulary, for example, a method using acoustic similar word generation is known. This method picks up a pair that is likely to cause misrecognition at the phoneme level in advance, and based on syllables (similar syllables) that have the same vowel with the paired phoneme as a consonant, similar to the speech recognition vocabulary. Generate a word. For example, if the phonemes in the “Ka” and “Ha” lines are similar phonemes to the “kenkai (view)” registered as the speech recognition vocabulary, “kenkai”, “henkai”, “ken” Four similar words "Yes" and "Henye" are generated. By using the rejected vocabulary thus obtained as a word dictionary for rejection, it is possible to reject the input voice or noise other than when the user correctly utters the voice recognition vocabulary.

【００３８】本実施形態の音声認識装置では、音声入力
部１から音声データが入力されると、音声認識部２が、
その入力された音声データから音声区間を切り出し、こ
の切り出し音声データの音声パターンが棄却語彙格納部
２２に格納されている棄却語彙のいずれかの音声パター
ンとマッチしないかを調べる。In the voice recognition apparatus of this embodiment, when voice data is input from the voice input unit 1, the voice recognition unit 2
A voice section is cut out from the input voice data, and it is checked whether or not the voice pattern of the cut-out voice data does not match any voice pattern of the reject vocabulary stored in the reject vocabulary storage unit 22.

【００３９】入力音声データの音声パターンが棄却語彙
にマッチしている場合は、音声認識部２は、入力音声を
棄却し、その旨を示す音声認識結果を音声認識結果判定
部４へ送出するとともに、切り出し音声データを状況検
知部３に出力する。状況検知部３では、周囲雑音や発話
不具合検などの要因の検出が行われ、その検出結果（状
況検知結果）が音声認識結果判定部４に出力される。When the voice pattern of the input voice data matches the reject vocabulary, the voice recognition unit 2 rejects the input voice and sends a voice recognition result indicating that to the voice recognition result determination unit 4. , And outputs the cut-out voice data to the situation detection unit 3. The situation detection unit 3 detects factors such as ambient noise and speech defect detection, and outputs the detection result (situation detection result) to the voice recognition result determination unit 4.

【００４０】音声認識結果判定部４は、音声認識部２か
ら音声認識結果（棄却）が入力された場合は、状況検知
部３からの状況検知結果に基づく不具合内容を動作部５
に出力する。音声認識結果判定部４から不具合内容が供
給されると、動作部５は、その不具合内容に応じた応答
動作を実行する。例えば、声量不足であれば、利用者に
もう少し大きな声で発声するようにガイダンスし、周囲
雑音が大きければ、マイクロホンのボリュームを絞るよ
うにガイダンスする。また、発話であったにもかかわら
ず、発話不具合がなかった場合は、音声認識を停止させ
る、あるいは、再度音声入力を要求するようにガイダン
スする。例えば、棄却語彙格納部２２に棄却語彙として
「あ」の音声パターン（標準パターン）が格納されてお
り、音声入力部１にて、音声発声不良や周囲雑音がな
く、不要発話として「あ」の音声入力がなされて、音声
認識結果判定部４から音声認識結果（棄却）のみを示す
不具合内容が動作部５へ出力された場合は、動作部５は
「何」と聞き返す棄却応答を行う。上記のいずれのガイ
ダンスも、例えば予め用意されたガイダンス用のメッセ
ージを音声により提示することにより行うことができ
る。When the voice recognition result (rejection) is input from the voice recognition unit 2, the voice recognition result determination unit 4 determines the malfunction content based on the situation detection result from the situation detection unit 3 as the operation unit 5.
Output to. When the fault content is supplied from the voice recognition result determination unit 4, the operation unit 5 executes a response operation according to the fault content. For example, if the voice volume is insufficient, the user is instructed to speak a little louder, and if the ambient noise is large, the user is instructed to turn down the volume of the microphone. Further, if there is no utterance defect despite the utterance, guidance is given to stop voice recognition or request voice input again. For example, the rejected vocabulary storage unit 22 stores a voice pattern (standard pattern) of “A” as a rejected vocabulary, and the voice input unit 1 has no voice utterance defect or ambient noise, and indicates “A” as unnecessary speech. When the voice input is performed and the fault content indicating only the voice recognition result (rejection) is output from the voice recognition result determination unit 4 to the operation unit 5, the operation unit 5 makes a rejection response by asking "what". Any of the above guidances can be performed by, for example, presenting a message for guidance prepared in advance by voice.

【００４１】一方、入力音声パターンが棄却語彙にマッ
チしていない場合は、音声認識部２は、入力音声パター
ンが音声認識語彙格納部２１に格納されている音声認識
語彙のいずれかの音声パターンとマッチするか否かを調
べる。これ以降は、上述した第１の実施形態で説明した
動作が行われる。On the other hand, when the input voice pattern does not match the rejected vocabulary, the voice recognition unit 2 determines that the input voice pattern is one of the voice patterns of the voice recognition vocabulary stored in the voice recognition vocabulary storage unit 21. Check if there is a match. After that, the operation described in the above-described first embodiment is performed.

【００４２】以上説明した本実施形態によれば、利用者
が音声認識語彙格納部２１に格納されている音声認識語
彙を正しく発声した場合以外の入力音声は棄却され、そ
の棄却したものについての音声認識処理は行われないの
で、上述した第１の実施形態のものに比べて認識率が高
くなる。According to the present embodiment described above, the input voice except when the user correctly utters the voice recognition vocabulary stored in the voice recognition vocabulary storage unit 21 is rejected, and the voice about the rejected voice is rejected. Since the recognition process is not performed, the recognition rate is higher than that in the above-described first embodiment.

【００４３】また、棄却されたものについても、不具合
要因が検出され、この検出された不具合要因に応じた応
答動作が行われるので、上述した第１の実施形態のもの
に比べてより細かな応答が可能となる。Further, with respect to the rejected ones, the defect factor is detected, and the response operation is performed in accordance with the detected defect factor. Therefore, a finer response than that of the first embodiment described above. Is possible.

【００４４】（実施形態３）図３は、本発明の第３の実
施形態である音声認識装置の概略構成を示すブロック図
である。この音声認識装置は、上述した第２の実施形態
の構成（図２参照）において、誤認識時のより細かな応
答を行うために、履歴解析部６が加えられている。図３
中、図２に示したものと同様のものには同じ符号を付し
ている。ここでは、説明を簡略化するために、同じ動作
を行うものについては説明を省略し、特徴部分について
のみ詳細に説明する。(Third Embodiment) FIG. 3 is a block diagram showing a schematic configuration of a speech recognition apparatus according to a third embodiment of the present invention. In the voice recognition device, a history analysis unit 6 is added in the configuration of the second embodiment described above (see FIG. 2) in order to make a finer response at the time of erroneous recognition. Figure 3
The same parts as those shown in FIG. 2 are designated by the same reference numerals. Here, in order to simplify the description, description of the same operation will be omitted, and only the characteristic part will be described in detail.

【００４５】履歴解析部６は、過去に音声認識部２から
出力された音声認識結果および状況検知部３から出力さ
れた状況検知結果を履歴として保持しており、音声認識
部２から音声認識結果が出力され、状況検知部３から状
況検知結果が出力されると、それら音声認識結果および
状況検知結果の過去における発生状態を上記履歴に基づ
いて調べ、その調べた発生状態が所定の条件を満たすか
どうかを判定し、その判定結果（履歴解析結果）を音声
認識結果とともに音声認識結果判定部４に出力する。こ
こで、所定の条件とは、同じ音声認識結果に関する同じ
状況検知結果の過去における継続性と頻度である。継続
性は、同じ状況検知結果が連続して何回出力されている
かを示し、頻度は、同じ状況検知結果が過去の所定の回
数のうちの何回出力されているかを示す。The history analysis unit 6 holds as a history the voice recognition result output from the voice recognition unit 2 and the situation detection result output from the situation detection unit 3 in the past, and the voice recognition result from the voice recognition unit 2 is retained. Is output and the situation detection result is output from the situation detection unit 3, the past occurrence states of the voice recognition result and the situation detection result are checked based on the history, and the checked occurrence state satisfies a predetermined condition. It is determined whether or not it is determined, and the determination result (history analysis result) is output to the voice recognition result determination unit 4 together with the voice recognition result. Here, the predetermined condition is the past continuity and frequency of the same situation detection result regarding the same voice recognition result. The continuity indicates how many times the same situation detection result is continuously output, and the frequency indicates how many times the same situation detection result is output in a predetermined number of times in the past.

【００４６】音声認識結果判定部４は、履歴解析部６か
ら入力された、継続性や頻度に基づく履歴解析結果から
雑音過多の有無を判断し、その結果を動作部５に出力す
る。動作部５は、音声認識結果判定部４からの雑音過多
の有無に応じて予め用意された応答動作を行う。例え
ば、雑音過多の場合は、音声認識を停止させ、雑音過多
でない場合は、予め設定されている応答動作群から該当
する応答動作を適宜選択して実行する。The voice recognition result judging section 4 judges the presence or absence of excessive noise from the history analysis result based on the continuity and frequency inputted from the history analyzing section 6, and outputs the result to the operating section 5. The operation unit 5 performs a response operation prepared in advance according to the presence or absence of excessive noise from the voice recognition result determination unit 4. For example, in the case of excessive noise, the voice recognition is stopped, and in the case of not excessive noise, the corresponding response operation is appropriately selected and executed from the preset response operation group.

【００４７】なお、状況検知部３にて環境面（周囲雑な
ど）の不具合が検出されない場合で、発話であり、か
つ、発話不具合がある場合は、履歴解析部６は、履歴情
報の解析は行わず、音声認識部２からの音声認識結果お
よび状況検知部３からの状況検知結果をそのまま音声認
識結果判定部４へ渡すようにしてもよい。この場合は、
音声認識結果判定部４および動作部５において、上述し
た第１および第２の実施形態の場合と同様な動作が行わ
れる。When the situation detection unit 3 does not detect any environmental trouble (miscellaneous surroundings, etc.) and it is an utterance and there is an utterance defect, the history analysis unit 6 analyzes the history information. Instead, the voice recognition result from the voice recognition unit 2 and the situation detection result from the situation detection unit 3 may be directly passed to the voice recognition result determination unit 4. in this case,
The voice recognition result determination unit 4 and the operation unit 5 perform the same operations as in the above-described first and second embodiments.

【００４８】履歴解析結果の利用方法に関する具体例は
後述する。A specific example of how to use the history analysis result will be described later.

【００４９】（実施形態４）図４は、本発明の第４の実
施形態である音声認識装置の概略構成を示すブロック図
である。この音声認識装置は、上述した第３の実施形態
の構成（図３参照）において、誤認識時のより細かな応
答を行うために、情報検知部３が環境推定部３１を含む
構成になっている。図４中、図３に示したものと同様の
ものには同じ符号を付している。ここでは、説明を簡略
化するために、同じ動作を行うものについては説明を省
略し、特徴部分についてのみ詳細に説明する。(Fourth Embodiment) FIG. 4 is a block diagram showing a schematic structure of a speech recognition apparatus according to a fourth embodiment of the present invention. In this voice recognition device, in the configuration of the above-described third embodiment (see FIG. 3), the information detection unit 3 includes the environment estimation unit 31 in order to make a finer response at the time of erroneous recognition. There is. 4, those parts which are the same as those shown in FIG. 3 are designated by the same reference numerals. Here, in order to simplify the description, description of the same operation will be omitted, and only the characteristic part will be described in detail.

【００５０】環境推定部３１は、音声認識部２から入力
された切り出し音声データに含まれる、周囲環境および
システム内部から発せられる雑音や反響音による不具合
を推定し、その推定結果を出力する。具体的には、環境
推定部３１は、切り出し音声データに含まれる不具合要
因を定常雑音、突発雑音、反響音の優先順位で検出す
る。ここで、定常雑音は、電気ノイズや周囲会話、周囲
環境雑音が継続的に続く雑音のことを意味する。これに
対して、突発雑音は、物が落下したときの音や手を１回
叩いたときの音などのような、突発的に発せられた音
で、継続性のない雑音である。反響音は、発話音声に重
畳される、室内の壁などで反響した音声である。なお、
本実施形態では、不具合要因が存在する場合の音声認識
に与える影響が大きいものを優先させるように優先順位
を設定しているが、定常雑音、突発雑音、反響音の３つ
の不具合要因を全て判定し、全ての検出結果を出力する
ようにしてもよい。The environment estimating unit 31 estimates a defect due to noise or reverberant sound included in the clipped voice data input from the voice recognizing unit 2 and emitted from the surrounding environment and the system, and outputs the estimation result. Specifically, the environment estimation unit 31 detects the failure factors included in the cut-out voice data in the priority order of stationary noise, sudden noise, and reverberant sound. Here, the stationary noise means noise in which electrical noise, surrounding conversation, and surrounding environment noise continue continuously. On the other hand, the sudden noise is a sound that is suddenly emitted, such as a sound when an object is dropped or a sound when a hand is struck once, and has no continuity. The reverberant sound is a sound that is echoed on a wall inside the room and is superimposed on the uttered sound. In addition,
In the present embodiment, the priority order is set so as to give priority to the one having a large influence on the voice recognition when there is a defect factor, but all three defect factors of stationary noise, sudden noise, and echo sound are determined. However, all detection results may be output.

【００５１】次に、環境推定部３１の動作を具体的に説
明する。図５は、図４に示した環境推定部３１の動作を
説明するためのフローチャート図である。Next, the operation of the environment estimating unit 31 will be specifically described. FIG. 5 is a flow chart for explaining the operation of the environment estimation unit 31 shown in FIG.

【００５２】まず、切り出し音声データを取得し（ステ
ップＳ１）、その取得した切り出し音声データに定常雑
音が含まれているがどうかを判定する(ステップＳ
２）。定常雑音がある場合は「定常雑音あり」を出力し
(ステップＳ３）、処理を終了する。定常雑音がない場
合は、続いて、ステップＳ１で取得した切り出し音声デ
ータに突発雑音が含まれているかどうかを判定する（ス
テップＳ４）。突発雑音がある場合は、「突発雑音あ
り」を出力し（ステップＳ５）、処理を終了する。突発
雑音がない場合は、続いて、ステップＳ１で取得した切
り出し音声データに反響音が含まれているかどうかを判
定する（ステップＳ６）。反響音がある場合は、「反響
音あり」を出力し（ステップＳ７）、処理を終了する。
反響音がない場合は、検知結果なしを出力し（ステップ
Ｓ８）、処理を終了する。First, the cut-out voice data is acquired (step S1), and it is determined whether or not the acquired cut-out voice data contains stationary noise (step S1).
2). If there is stationary noise, output “With stationary noise”
(Step S3), the process ends. If there is no stationary noise, it is subsequently determined whether or not the cutout voice data acquired in step S1 includes sudden noise (step S4). If there is a sudden noise, "there is a sudden noise" is output (step S5), and the process is terminated. If there is no sudden noise, it is subsequently determined whether or not the cut-out audio data acquired in step S1 includes reverberation (step S6). If there is a reverberant sound, "reverberant sound is present" is output (step S7), and the process ends.
If there is no reverberant sound, no detection result is output (step S8), and the process ends.

【００５３】次に、定常雑音、突発雑音、反響音の各不
具合要因の検出方法について具体的について説明する。Next, a concrete description will be given of a method of detecting each of the defect factors such as stationary noise, sudden noise, and reverberation noise.

【００５４】（定常雑音の検出）定常雑音の検出方法と
しては以下の２つの方法が考えられる。(Detection of Stationary Noise) The following two methods can be considered as a method of detecting stationary noise.

【００５５】第１の方法は、切り出し音声データの音声
長に着目する方法である。定常雑音があると、音声認識
エンジンは音声区間の終端を正確に検出することができ
ない。この場合、切り出された音声区間長（切り出し音
声データ長）は、終端が正確に検出されたもの、例えば
静かな環境で音声認識語彙を発声した場合に切り出され
た音声区間長（切り出し音声データ長）に比べて長くな
る。したがって、切り出し音声データ長（時間）を取得
し、その値（時間）がある一定の閾値を超えているかど
うかで周囲雑音を検知することができる。The first method is to pay attention to the voice length of the clipped voice data. In the presence of stationary noise, the speech recognition engine cannot accurately detect the end of the speech section. In this case, the cut-out voice section length (cut-out voice data length) is the one whose end is accurately detected, for example, the cut-out voice section length (cut-out voice data length when the voice recognition vocabulary is uttered in a quiet environment. ) Will be longer than. Therefore, ambient noise can be detected by acquiring the cut-out voice data length (time) and determining whether the value (time) exceeds a certain threshold value.

【００５６】なお、上記ある一定の閾値は、音声認識語
彙の最大音節数に依存し、それに応じて調節する必要が
ある。また、切り出された音声区間長（切り出し音声デ
ータ長）は、切り出し音声を例えばWave形式など一般の
音声形式のデータに落とせば、そのデータサイズから計
算することができる。音声区間の始端および終端は、例
えば音声の無いホワイトノイズ区間の平均パワーを学習
し、その平均パワーを基準にした所定の閾値を超えた時
点を始端、再びホワイトノイズ区間の平均パワーに戻っ
た時点を終端としてそれぞれ検出することができる。The certain threshold value depends on the maximum number of syllables in the speech recognition vocabulary and needs to be adjusted accordingly. Further, the cut-out voice section length (cut-out voice data length) can be calculated from the data size of the cut-out voice when the cut-out voice is dropped into data of a general voice format such as Wave format. For the start and end of the voice section, for example, the average power of the white noise section with no voice is learned, and the start point is the point when it exceeds a predetermined threshold based on that average power, and the point when it returns to the average power of the white noise section again. Can be detected as the end.

【００５７】第２の方法は、切り出し音声データの発話
部分以外の音声パワーに着目する方法である。切り出し
音声データの前後の短い区間、例えば前後１００ｍｓ〜
２００ｍｓの区間における最大パワーが一定の閾値を超
えているか否かを調べることで、定常雑音の存在が検知
可能である。音声パワーの閾値は、実環境で測定される
周囲雑音の音声パワーとの兼ね合いで調整する必要があ
る。The second method is to pay attention to the voice power other than the utterance portion of the cut-out voice data. Short section before and after the cut-out audio data, for example, 100 ms before and after
The presence of stationary noise can be detected by checking whether or not the maximum power in the 200 ms section exceeds a certain threshold. The threshold of the voice power needs to be adjusted in consideration of the voice power of the ambient noise measured in the actual environment.

【００５８】上述の第１および第２の定常雑音検出方法
は、検知すべき音声の質により一長一短があるため、併
用して利用するのが望ましい。図６は、それら第１およ
び第２の定常雑音検出方法を併用した場合の定常雑音の
検出手順を示すフローチャート図である。Since the first and second stationary noise detection methods described above have advantages and disadvantages depending on the quality of the voice to be detected, it is desirable to use them together. FIG. 6 is a flowchart showing a stationary noise detection procedure when the first and second stationary noise detection methods are used in combination.

【００５９】まず、切り出し音声データの音声長を取得
し（ステップＳ１０）、その取得した音声長が予め設定
された閾値を超えているかどうかを調べる（ステップＳ
１１）。閾値を超えている場合は、「定常雑音あり」を
出力し（ステップＳ１２）、処理を終了する。閾値を越
えていない場合は、続いて、切り出し音声データの前後
１００ｍｓの区間におけるピークパワーを取得し（ステ
ップＳ１３）、その取得した前後ピークパワーがともに
予め設定された閾値を超えているかどうかを調べる（ス
テップＳ１４）。前後のピークパワーが両方とも閾値を
超えている場合は、「定常雑音あり」を出力し（ステッ
プＳ１５）、処理を終了する。前後のピークパワーのい
ずれか一方でも閾値を越えていない場合は、「定常雑音
なし」を出力し（ステップＳ１６）、処理を終了する。First, the voice length of the cut-out voice data is acquired (step S10), and it is checked whether the obtained voice length exceeds a preset threshold value (step S).
11). If it exceeds the threshold, "with stationary noise" is output (step S12), and the process ends. If it does not exceed the threshold value, subsequently, the peak power in the section of 100 ms before and after the cut-out audio data is acquired (step S13), and it is checked whether the acquired front and rear peak powers both exceed the preset threshold value. (Step S14). If both the front and rear peak powers exceed the threshold value, "with stationary noise" is output (step S15), and the process ends. If either one of the front and rear peak powers does not exceed the threshold value, "no stationary noise" is output (step S16), and the process ends.

【００６０】（突発雑音の検出）突発雑音と人間の発話
音声とを区別して検出するためには、継続性のない短い
区間の雑音を検出する方法が効果的である。その検出方
法としては様々な方法が考えられるが、簡単な例とし
て、切り出し音声データを一定のフレーム長のフレーム
に分割し突発雑音を検出する手法を以下に説明する。(Detection of Sudden Noise) In order to detect sudden noise and human speech separately, a method of detecting noise in a short section without continuity is effective. Although various methods are conceivable as the detection method, as a simple example, a method of dividing the cut-out voice data into frames having a fixed frame length and detecting sudden noise will be described below.

【００６１】図７は、突発雑音検出の一手順を示すフロ
ーチャート図である。まず、切り出し音声データの音声
長（時間）を取得し（ステップＳ２０）、その取得した
音声長（時間）を所定のフレーム長で分割し、各フレー
ムに０番から番号を割当てるとともに、分割されたフレ
ームの総数を算出する（ステップＳ２１）。続いて、カ
ウンタを初期化した後（ステップＳ２２）、カウンタ値
に対応した番号のフレームにおけるピークパワーを算出
し（ステップＳ２３）、そのピークパワーが一定の閾値
を超えているかどうか判定する（ステップＳ２４）。FIG. 7 is a flow chart showing one procedure of sudden noise detection. First, the voice length (time) of the cut-out voice data is acquired (step S20), the obtained voice length (time) is divided into predetermined frame lengths, numbers are assigned from 0 to each frame, and the divided voice data is divided. The total number of frames is calculated (step S21). Then, after initializing the counter (step S22), the peak power in the frame of the number corresponding to the counter value is calculated (step S23), and it is determined whether or not the peak power exceeds a certain threshold value (step S24). ).

【００６２】ステップＳ２４で、閾値を超えていると判
定された場合は、前後のフレームのピークパワーを取得
し（ステップＳ２５）、その取得したピークパワーがと
もに先の閾値（ステップＳ２４において用いた閾値）を
越えていないかどうかを判定する（ステップＳ２６）。
このステップＳ２６で、閾値を超えていると判定された
場合は、「突発雑音あり」を出力し（ステップＳ２
７）、処理を終了する。If it is determined in step S24 that the threshold power is exceeded, the peak powers of the preceding and following frames are acquired (step S25), and the acquired peak powers are both the previous threshold (the threshold used in step S24). ) Is not exceeded (step S26).
If it is determined in step S26 that the threshold value is exceeded, "with sudden noise" is output (step S2).
7), the process ends.

【００６３】ステップＳ２４およびＳ２６で、閾値を超
えていないと判定された場合は、カウンタの値を１増や
し（ステップＳ２８）、カウンタの値がステップＳ２１
で算出したフレーム数より小さいかどうかを判定する
（ステップＳ２９）。カウンタがフレーム数より小さい
場合は、ステップＳ２３に戻る。カウンタがフレーム数
となった場合は、「突発雑音なし」を出力し（ステップ
Ｓ３０）、処理を終了する。If it is determined in steps S24 and S26 that the threshold value is not exceeded, the counter value is incremented by 1 (step S28), and the counter value is changed to step S21.
It is determined whether the number of frames is smaller than the number of frames calculated in step S29. If the counter is smaller than the number of frames, the process returns to step S23. When the counter reaches the number of frames, "no sudden noise" is output (step S30), and the process ends.

【００６４】図８は、切り出し音声データの一例を示す
波形図である。横軸に時間がとられ、縦軸に音声パワー
がとられている。この例では、音声データの先頭部（音
声入力開始直後）に突発的な雑音があり、その後ろに発
話音声がある。図８中、点線が囲われた区間が、あるフ
レーム長（時間幅）で区切られた１つのフレームであ
る。FIG. 8 is a waveform diagram showing an example of cut-out audio data. The horizontal axis represents time and the vertical axis represents voice power. In this example, there is a sudden noise at the beginning of the voice data (immediately after the start of voice input), and the uttered voice is behind it. In FIG. 8, a section surrounded by a dotted line is one frame sectioned by a certain frame length (time width).

【００６５】上記のステップＳ２１では、図８に示すよ
うなフレームで切り出し音声データを複数に分割し、そ
のフレーム数を算出する。例えば、切り出し音声データ
の音声長が１ｓで、フレーム長が５０ｍｓの場合、フレ
ーム数は２０個になる。In step S21, the cut-out audio data is divided into a plurality of frames as shown in FIG. 8, and the number of frames is calculated. For example, when the voice length of the cut-out voice data is 1 s and the frame length is 50 ms, the number of frames is 20.

【００６６】また、上記ステップＳ２５における前後の
フレームとは、具体的には次のようなフレームである。
例えばフレーム数が２０個で、先頭から順番に０、１、
２、・・・１８、１９と番号を割り当てた場合で、現在
のカウンタ値が５であった場合、ステップＳ２４で番号
「５」が割り当てられたフレームに対する処理が行わ
れ、ステップＳ２５では番号「４」、「６」が割り当て
られたフレームがそれぞれ前後のフレームとして処理さ
れる。なお、フレーム長は、突発雑音と人間の発話の継
続時間の特性を考慮して、突発雑音を正確に検知でき、
人間の発話を誤検知しないような最適な値に設定する必
要がある。The frames before and after the step S25 are specifically the following frames.
For example, if the number of frames is 20, 0, 1, and
When the numbers are assigned to 2, ... 18, 19 and the current counter value is 5, the process for the frame to which the number “5” is assigned is performed in step S24, and the number “5” is assigned in step S25. The frames to which "4" and "6" are assigned are processed as the preceding and following frames, respectively. The frame length can be accurately detected by considering the characteristics of sudden noise and the duration of human utterance.
It is necessary to set the optimum value so that human speech is not erroneously detected.

【００６７】また、上記ステップＳ２４における一定の
閾値は、音声認識語彙に関する発話のピークパワーに依
存するが、理想的には、突発雑音のピークパワーの音声
認識への影響を考慮することが望ましい。Further, the fixed threshold value in step S24 depends on the peak power of the utterance related to the voice recognition vocabulary, but ideally, it is desirable to consider the influence of the peak power of the sudden noise on the voice recognition.

【００６８】（反響音の検出）図４に示した構成におい
て、音声入力部１が単一のマイクロホンよりなる場合
は、反響音を検出することは困難である。音声入力部１
として、向きの異なる２本のマイクロホンを用い、それ
ぞれのマイクロホンに入力される音声の差分をとること
で、反響音の存在を検出することが可能である。具体的
には、２本のマイクロホンの向きを異ならせることで、
一方のマイクロホンからは反響音があまり含まれていな
い音声データが出力され、他方のマイクロホンからは反
響音を含む音声データが出力されることになり、これら
マイクロホンの出力の差分をとることで反響音を検出す
ることができる。そして、その検出された反響音のレベ
ルが所定の閾値を越えた場合を、音声認識の不具合要因
となる反響音と判定する。(Detection of Reverberant Sound) In the configuration shown in FIG. 4, it is difficult to detect the reverberant sound when the voice input unit 1 is composed of a single microphone. Voice input unit 1
As the above, it is possible to detect the presence of a reverberant sound by using two microphones having different directions and calculating the difference between the sounds input to the respective microphones. Specifically, by changing the directions of the two microphones,
One microphone outputs audio data that does not contain much reverberant sound, and the other microphone outputs audio data that contains reverberant sound.By calculating the difference between the output of these microphones, the reverberant sound Can be detected. Then, when the detected level of the reverberant sound exceeds a predetermined threshold value, it is determined as the reverberant sound that causes a defect in the voice recognition.

【００６９】環境推定部３１の推定結果は、履歴解析部
６に供給される。そして、履歴解析部６、音声認識結果
判定部４および動作部５において、上述の第１〜第３の
実施形態で説明した動作が適宜実行される。The estimation result of the environment estimation unit 31 is supplied to the history analysis unit 6. Then, the history analysis unit 6, the voice recognition result determination unit 4, and the operation unit 5 appropriately perform the operations described in the above-described first to third embodiments.

【００７０】（実施形態５）図９は、本発明の第５の実
施形態である音声認識装置の概略構成を示すブロック図
である。この音声認識装置は、上述した第４の実施形態
の構成（図４参照）において、誤認識時のより細かな応
答を行うために、状況検知部３が発話検出部３２をさら
に含む構成になっている。図９中、図４に示したものと
同様のものには同じ符号を付している。ここでは、説明
を簡略化するために、同じ動作を行うものについては説
明を省略し、特徴部分についてのみ詳細に説明する。(Fifth Embodiment) FIG. 9 is a block diagram showing the schematic arrangement of a speech recognition apparatus according to the fifth embodiment of the present invention. In this voice recognition device, in the configuration of the above-described fourth embodiment (see FIG. 4), the situation detection unit 3 further includes a speech detection unit 32 in order to make a finer response at the time of erroneous recognition. ing. 9, those parts which are the same as those shown in FIG. 4 are designated by the same reference numerals. Here, in order to simplify the description, description of the same operation will be omitted, and only the characteristic part will be described in detail.

【００７１】発話検出部３２は、音声認識部２から入力
された切り出し音声データ中に人間の発話が含まれるか
どうかを判定し、その判定結果を出力する。切り出し音
声データから人間の発話の有無を検出する方法として
は、人間の発話音声のスペクトルの調波構造に着目して
検出する手法がある（文献：「音声認識のためのスペク
トルの調波構造の利用」日本音響学会平成９年度秋季研
究発表会講演論文集 pp3-4）。この調波構造の特徴量を
切り出し音声データから抽出し利用することで人間の発
話と雑音を分離することが可能である。The utterance detection unit 32 determines whether the cut-out voice data input from the voice recognition unit 2 includes a human utterance, and outputs the determination result. As a method of detecting the presence or absence of human utterance from cut-out voice data, there is a method of detecting by paying attention to the harmonic structure of the spectrum of human uttered voice (Reference: “Spectrum harmonic structure for speech recognition. Utilization ”Proceedings of the 1997 Autumn Meeting of the Acoustical Society of Japan, pp3-4). It is possible to separate human speech and noise by extracting and using the feature amount of this harmonic structure from the voice data.

【００７２】発話検出部３２の発話検出結果は、履歴解
析部６に供給される。そして、履歴解析部６、音声認識
結果判定部４および動作部５において、上述の第１〜第
４の実施形態で説明した動作が適宜実行される。The speech detection result of the speech detection section 32 is supplied to the history analysis section 6. Then, the history analysis unit 6, the voice recognition result determination unit 4, and the operation unit 5 appropriately execute the operations described in the above-described first to fourth embodiments.

【００７３】本実施形態によれば、発話検出部３２を追
加することにより、発話検出なしにも関わらず音声認識
結果が音声認識語彙格納部２１の語彙であった場合に、
音声認識結果を破棄することが可能である。これによ
り、誤認識および誤動作を低減することができる。ま
た、音声認識結果が棄却語彙格納部２２の語彙であった
場合で、発話が検出された場合に、「発話が検出された
が認識できなかった」旨のガイダンスを動作部５で提示
させることが可能である。これにより、さらに細かな応
答を提供することができる。According to the present embodiment, by adding the utterance detection unit 32, when the speech recognition result is the vocabulary of the speech recognition vocabulary storage unit 21 without the utterance detection,
It is possible to discard the voice recognition result. This can reduce false recognition and malfunction. When the speech recognition result is the vocabulary in the rejected vocabulary storage unit 22 and the utterance is detected, the operation unit 5 is caused to present the guidance that “the utterance was detected but could not be recognized”. Is possible. Thereby, a finer response can be provided.

【００７４】また、上記発話検出部３２の発話検出結果
に基づく音声認識結果の破棄と、上述の第２の実施形態
で述べた棄却語彙格納部２２を利用した不要発話棄却と
が互いに補間しあうことで棄却精度を高める効果を持
つ。例えば、棄却語彙を利用しない場合は、周囲会話の
ある状況では、発話検出部は発話ありと誤検出してしま
う。この場合、その周囲会話に対応する単語を棄却語彙
として棄却語彙格納部２２に予め登録しておき、周囲会
話を棄却語彙として認識させて棄却する、といった補間
を行うことで、発話検出部が周囲会話を発話ありと誤検
出して、誤動作してしまうことを防止することができ
る。また、この逆の補間として、音声認識語彙に極めて
近い雑音が入力され、棄却語彙による棄却がうまく行か
ない場合に、発話検出部３２で発話なしと検出すること
でその雑音を破棄させることも可能である。Further, the discard of the voice recognition result based on the speech detection result of the speech detection unit 32 and the unnecessary speech rejection using the rejection word storage unit 22 described in the above second embodiment interpolate with each other. This has the effect of increasing the rejection accuracy. For example, when the rejected vocabulary is not used, the utterance detection unit erroneously detects that there is utterance in a situation where there is a surrounding conversation. In this case, the word corresponding to the surrounding conversation is registered in the reject vocabulary storage unit 22 as a reject vocabulary in advance, and the surrounding conversation is recognized as a reject vocabulary and rejected, so that the utterance detection unit performs It is possible to prevent a malfunction by falsely detecting a conversation with utterance. Further, as the inverse interpolation, when noise extremely close to the speech recognition vocabulary is input and the rejection by the rejection vocabulary does not go well, it is possible to discard the noise by detecting no utterance by the utterance detection unit 32. Is.

【００７５】（実施形態６）図１０は、本発明の第６の
実施形態である音声認識装置の概略構成を示すブロック
図である。この音声認識装置は、上述した第５の実施形
態の構成（図９参照）において、誤認識時のより細かな
応答を行うために、状況検知部３が発話不具合検出部３
３をさらに含む構成になっている。図１０中、図９に示
したものと同様のものには同じ符号を付している。ここ
では、説明を簡略化するために、同じ動作を行うものに
ついては説明を省略し、特徴部分についてのみ詳細に説
明する。(Sixth Embodiment) FIG. 10 is a block diagram showing the schematic arrangement of a speech recognition apparatus according to the sixth embodiment of the present invention. In this voice recognition device, in the configuration of the fifth embodiment described above (see FIG. 9), the situation detection unit 3 uses the utterance defect detection unit 3 in order to make a finer response at the time of erroneous recognition.
3 is further included. In FIG. 10, those similar to those shown in FIG. 9 are designated by the same reference numerals. Here, in order to simplify the description, description of the same operation will be omitted, and only the characteristic part will be described in detail.

【００７６】発話不具合検出部３３は、音声認識部２か
ら入力された切り出し音声データが所定の不具合条件を
満たすかどうかを判定する。この発話不具合検出部３３
では、環境要因ではなく、音声を発話する人間側の不具
合要因を検出する点が環境推定部３１と異なる。発話不
具合検出部３３によって検出される発話不具合の要因は
多岐にわたっているが、ここでは、頭切れ、パワー不
足、パワー過多を代表例として挙げる。The speech deficiency detection unit 33 determines whether the cut-out voice data input from the voice recognition unit 2 satisfies a predetermined fault condition. This utterance defect detection unit 33
2 differs from the environment estimation unit 31 in that it detects not the environmental factors but the defect factors on the human side who speaks a voice. Although there are various causes of the utterance defect detected by the utterance defect detection unit 33, here, representative examples include head loss, insufficient power, and excessive power.

【００７７】上記の不具合要因のうち、パワー不足は切
り出し音声全体のピークパワーが一定の閾値より小さい
かどうかで検出することができ、また、パワー過多はそ
のピークパワーが一定の閾値より大きいかどうかで検出
することができる。残りの頭切れ検出は、切り出し音声
の先頭部分に着目して以下のようにして検出する。Among the above-mentioned trouble factors, the power shortage can be detected by whether or not the peak power of the whole cut-out voice is smaller than a certain threshold value, and the excessive power is whether the peak power is larger than a certain threshold value or not. Can be detected with. The remaining head loss detection is performed as follows by focusing on the head portion of the clipped voice.

【００７８】図１１に、頭切れ検出の簡単なアルゴリズ
ムの一例を示す。この例では、簡略化のため、音声デー
タの先頭１００ｍｓの区間に着目するものとするが、着
目する区間は、１００ｍｓに限定されるものではない。
以下の頭切れ検出が可能であれば、着目する区間は５０
ｍｓでも３０ｍｓでもよい。FIG. 11 shows an example of a simple algorithm for detecting head loss. In this example, for the sake of simplification, the section of 100 ms at the beginning of the audio data is focused, but the section of interest is not limited to 100 ms.
If the following head loss detection is possible, the section of interest is 50
It may be ms or 30 ms.

【００７９】まず、切り出し音声データの先頭１００ｍ
ｓのフレームにおけるピークパワーを取得する（ステッ
プＳ４０）。次いで、その取得したピークパワーが閾値
を超えているかどうか判定し（ステップＳ４１）、閾値
を超えている場合は、続いて、先頭１００ｍｓのフレー
ム中に発話音声が含まれているかどうかを判定する（ス
テップＳ４２）。First, the first 100 m of the cut-out audio data
The peak power in the s frame is acquired (step S40). Next, it is determined whether or not the acquired peak power exceeds a threshold value (step S41), and if it exceeds the threshold value, it is subsequently determined whether or not a speech sound is included in the first 100 ms frame ( Step S42).

【００８０】上記ステップＳ４１で閾値を超えていない
と判定された場合、および上記ステップＳ４２で発話音
声が含まれていないと判定された場合は、「頭切れな
し」を出力し（ステップＳ４３）、処理を終了する。上
記ステップＳ４２で発話音声が含まれていると判定され
た場合は、「頭切れあり」を出力し（ステップＳ４
４）、処理を終了する。If it is determined in step S41 that the threshold is not exceeded, and if it is determined in step S42 that no speech is included, "no head cut" is output (step S43). The process ends. If it is determined in step S42 that the speech is included, "head cut" is output (step S4).
4), the process ends.

【００８１】上述のようにして発話検出部３２にて発話
不具合が検出され、その検出結果が履歴解析部６に供給
される。そして、履歴解析部６、音声認識結果判定部４
および動作部５において、上述の第１〜第５の実施形態
で説明した動作が適宜実行される。As described above, the speech detection unit 32 detects the speech defect, and the detection result is supplied to the history analysis unit 6. Then, the history analysis unit 6 and the voice recognition result determination unit 4
In the operation unit 5, the operations described in the above first to fifth embodiments are appropriately executed.

【００８２】図１２に、環境推定部３１、発話検出部３
２、発話不具合検出部３３の検出結果およびその検出結
果に応じた応答動作の具体例を示す。以下、図１２を参
照して具体的な動作を説明する。FIG. 12 shows the environment estimating unit 31 and the speech detecting unit 3.
2. A specific example of the detection result of the utterance defect detection unit 33 and the response operation according to the detection result will be described. Hereinafter, a specific operation will be described with reference to FIG.

【００８３】図１２に示す例では、音声認識語彙として
「こんにちは」、棄却語彙として「あ」がそれぞれ音声
認識語彙格納部２１、棄却語彙格納部２２に格納されて
おり、一回の音声入力に対して、必ずいずれかの結果が
返るものとする。簡略化のため、環境推定部３１では定
常雑音のみを検出するものとする。発話不具合検出部３
３では「頭切れ」、「パワー不足」、「パワー過多」の
優先順位でいずれか一つの検知結果を出力するものとす
る。履歴解析部４では同じ状況検知結果が３回連続した
場合に「状：３連続」を、発話検出部の結果が「発話あ
り」で、かつ、音声認識結果が棄却語彙であった場合が
３回連続で続いた場合に「棄：３連続」をそれぞれ履歴
照合結果として返すものとする。また、定常雑音が連続
する場合および定常雑音なしにも関わらず誤認識が連続
する場合は、音声認識を停止するものとする。なお、図
１２に示す表中、「−」が記入されているものは「結果
如何に関わらず」を示し、「ＮＯＴ（Ａ）」はＡという
検知結果が検知されていないことを示す。[0083] In the example shown in FIG. 12, "Hello" as a voice recognition vocabulary, "A" respectively as rejectable vocabulary speech recognition vocabulary storing section 21 is stored in the rejectable vocabulary storage unit 22, a single audio input On the other hand, one of the results must be returned. For simplification, the environment estimation unit 31 detects only stationary noise. Speech defect detection unit 3
In 3, it is assumed that any one of the detection results is output in the priority order of "head cut", "power shortage", and "excessive power". The history analysis unit 4 gives “condition: 3 consecutive” when the same situation detection result is repeated three times, and 3 when the result of the utterance detection unit is “uttered” and the voice recognition result is a rejected vocabulary. In the case of continuous consecutive times, “discard: 3 consecutive times” is returned as the history collation result. Further, if the stationary noise is continuous or if the erroneous recognition is continuous without the stationary noise, the speech recognition is stopped. In addition, in the table shown in FIG. 12, "-" is entered indicates "regardless of the result", and "NOT (A)" indicates that the detection result of A is not detected.

【００８４】環境推定部３１で「定常雑音あり」と検知
された場合は、音声認識結果を含め、他の結果に関わら
ず定常雑音の存在を利用者にガイダンスする。これは
「定常雑音あり」の環境下で発話された音声は正しく音
声認識されにくく、音声認識語彙が複数存在することを
想定した場合、音声認識語彙に認識されても誤認識であ
る可能性が高いためである。また、「定常雑音あり」の
状況が３回連続した場合は、履歴解析部４が「状：３連
続」を出力し、動作部５が音声認識を停止させる。この
音声認識停止に際して、マウスやタッチパネル、ボタン
など他の入力デバイスを備えている場合は、該他の入力
デバイスの使用を促すガイダンスを行ってもよい。When the environment estimation unit 31 detects that "stationary noise is present", the user is informed of the presence of stationary noise regardless of other results including the voice recognition result. This is because it is difficult to correctly recognize the speech uttered in the environment with "steady noise", and assuming that there are multiple speech recognition vocabulary, even if recognized in the speech recognition vocabulary, it may be erroneous recognition. Because it is expensive. Further, when the situation of “with stationary noise” continues three times, the history analysis unit 4 outputs “Condition: 3 continuous”, and the operation unit 5 stops the voice recognition. When this voice recognition is stopped, if another input device such as a mouse, a touch panel, or a button is provided, guidance for promoting the use of the other input device may be given.

【００８５】環境推定部３１の結果が「検知結果なし」
である場合は、発話検出部３２の出力を参照する。「発
話あり」であれば、発話不具合検出部３３の出力結果を
参照して何らかの応答動作を行い、「発話なし」であれ
ば何も動作せず無視する。なお、「発話なし」の場合に
おいて、発話検出部３２の検出精度が十分でない場合
は、例えば音声認識語彙「こんにちは」が出力された場
合に、その検出結果を破棄してもよい。The result of the environment estimation unit 31 is “no detection result”.
If it is, the output of the speech detection unit 32 is referred to. If there is "utterance", some response operation is performed by referring to the output result of the speech failure detection unit 33, and if "no speech", no operation is performed and it is ignored. Incidentally, in the case of "no speech", if the detection accuracy of the speech detection section 32 is not sufficient, for example, when the speech recognition vocabulary "Hi" is output, it may discard the detection result.

【００８６】上記の「発話あり」の場合で、発話不具合
検出部３３にて何らかの発話不具合が検知された場合
は、その発話不具合を伝えるガイダンスを行う。発話不
具合が検知されなかった場合は、音声認識結果によって
動作が異なる。音声認識結果が音声認識語彙「こんにち
は」の場合は、「こんにちは」に対応する応答動作を行
う。棄却語彙「あ」の場合は、「何」と聞き返すなどの
通常棄却応答動作を行うが、発話不具合が検知されない
状況が３回連続した場合には、履歴解析部４が「棄：３
連続」を出力し、動作部５が音声認識を停止させる。In the case of "with utterance" described above, if any utterance defect is detected by the utterance defect detection unit 33, guidance for notifying the utterance defect is given. When no speech defect is detected, the operation differs depending on the voice recognition result. If the voice recognition result of the speech recognition vocabulary "Hello", a response operation corresponding to the "Hello". When the rejection vocabulary is "a", the normal rejection response operation such as "what" is asked, but when the utterance failure is not detected three times in a row, the history analysis unit 4 determines "discard: 3".
"Continuous" is output, and the operation unit 5 stops the voice recognition.

【００８７】なお、発話不具合検出部３３の検出結果に
関しては、発話検出部３２と同様、発話不具合と検出さ
れる音声に対する音声認識精度の劣化度合に応じて、音
声認識結果が音声認識語彙「こんにちは」である場合に
その検出結果を破棄するようにしてもよい。[0087] Incidentally, with respect to the detection result of the speech fault detector 33, similar to the speech detection section 32, in accordance with the deterioration degree of the speech recognition accuracy for the speech to be detected and speech defect, speech recognition result voice recognition vocabulary "Hello If it is ", the detection result may be discarded.

【００８８】以上説明した各実施形態における音声認識
および応答動作は、いずれも基本的にはプログラムによ
り実現することができ、既存のコンピュータシステム
（基本構成は、記憶装置、キーボードやマウスなどの入
力装置、ＣＲＴやＬＣＤなどの表示装置および入力装置
からの入力を受け付けて記憶装置へのアクセス、出力装
置および表示装置の動作を制御する制御装置からなる）
上で動作させることができる。このプログラムは、コン
ピュータのハードディスク内に予めインストールされて
いてもよく、また、ＣＤ−ＲＯＭなどの記録媒体により
提供されてもよい。さらには、このプログラムは、イン
ターネットを介して提供（ダウンロード）してもよい。The voice recognition and response operations in each of the embodiments described above can be basically realized by a program, and existing computer systems (basic configuration is a storage device, an input device such as a keyboard or a mouse). , A control device that receives input from a display device such as a CRT or LCD and an input device to access a storage device, and controls the operation of the output device and the display device).
Can be run on. This program may be installed in advance in the hard disk of the computer, or may be provided by a recording medium such as a CD-ROM. Furthermore, this program may be provided (downloaded) via the Internet.

【００８９】[0089]

【発明の効果】以上説明したように、本発明によれば、
利用者は、環境的、あるいは人為的な誤認識原因を容易
に判断し、その原因を排除していくことで、音声認識に
適した理想的な環境、理想的な入力音声を得られるの
で、より認識率の高い音声認識装置を提供することがで
きる。As described above, according to the present invention,
The user can easily determine the environmental or artificial cause of erroneous recognition and eliminate the cause to obtain an ideal environment suitable for voice recognition and an ideal input voice. A voice recognition device having a higher recognition rate can be provided.

[Brief description of drawings]

【図１】本発明の第１の実施形態である音声認識装置の
概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a voice recognition device according to a first embodiment of the present invention.

【図２】本発明の第２の実施形態である音声認識装置の
概略構成を示すブロック図である。FIG. 2 is a block diagram showing a schematic configuration of a voice recognition device according to a second embodiment of the present invention.

【図３】本発明の第３の実施形態である音声認識装置の
概略構成を示すブロック図である。FIG. 3 is a block diagram showing a schematic configuration of a voice recognition device according to a third embodiment of the present invention.

【図４】本発明の第４の実施形態である音声認識装置の
概略構成を示すブロック図である。FIG. 4 is a block diagram showing a schematic configuration of a voice recognition device according to a fourth embodiment of the present invention.

【図５】図４に示す環境推定部の動作を説明するための
フローチャート図である。FIG. 5 is a flow chart diagram for explaining the operation of the environment estimation unit shown in FIG.

【図６】定常雑音の一検出手順を示すフローチャート図
である。FIG. 6 is a flowchart showing one detection procedure of stationary noise.

【図７】突発雑音の一検出手順を示すフローチャート図
である。FIG. 7 is a flowchart showing one detection procedure of sudden noise.

【図８】切り出し音声データの一例を示す波形図であ
る。FIG. 8 is a waveform diagram showing an example of clipped audio data.

【図９】本発明の第５の実施形態である音声認識装置の
概略構成を示すブロック図である。FIG. 9 is a block diagram showing a schematic configuration of a voice recognition device which is a fifth embodiment of the present invention.

【図１０】本発明の第６の実施形態である音声認識装置
の概略構成を示すブロック図である。FIG. 10 is a block diagram showing a schematic configuration of a voice recognition device which is a sixth embodiment of the present invention.

【図１１】頭切れの一検出手順を示すフローチャート図
である。FIG. 11 is a flowchart showing a procedure for detecting a head loss.

【図１２】図１０に示す環境推定部、発話検出部および
発話不具合検出部の検出結果およびその検出結果に応じ
た応答動作を説明するための図である。FIG. 12 is a diagram for explaining the detection results of the environment estimation unit, the speech detection unit, and the speech defect detection unit shown in FIG. 10 and the response operation according to the detection results.

[Explanation of symbols]

１音声入力部２音声認識部２ａ認識結果２ｂ切り出し音声データ３状況検知部４音声認識結果判定部５動作部６履歴解析部２１音声認識語彙格納部２２棄却語彙格納部３１環境推定部３２発話検出部３３発話不具合検出部 1 Voice input section 2 Speech recognition section 2a recognition result 2b Cutout audio data 3 Situation detection section 4 Speech recognition result judgment unit 5 working parts 6 History analysis section 21 Speech recognition vocabulary storage 22 Rejected vocabulary storage 31 Environment Estimation Department 32 Speech detector 33 Speech defect detection unit

Claims

[Claims]

1. A voice recognition unit for recognizing the input voice by collating the voice input unit for outputting voice data corresponding to the input voice with the standard pattern for the voice recognition vocabulary registered in advance. And a situation detecting means for checking whether or not at least one of a plurality of different factors causing a problem in the voice recognition exists in the voice data, and the situation detecting means detects any of the plurality of factors. If not, it is determined that the voice recognition is valid, and if any of the plurality of factors is detected, the voice recognition result determination means that determines that the voice recognition is not valid. And a plurality of response operations respectively corresponding to the plurality of factors are set in advance, and the voice recognition is not appropriate by the voice recognition result determination means. If it is constant, the speech recognition apparatus characterized by having a response unit for executing selectively a response operation to appropriate from among the plurality of response operations.

2. A rejection vocabulary storage unit in which a voice pattern relating to a rejection vocabulary for rejecting an input voice that is not an utterance based on the voice recognition vocabulary is registered, and the voice recognition unit includes the voice data and the rejection. The voice recognition device according to claim 1, wherein the voice recognition is performed by further collating with a voice pattern registered in the vocabulary storage means.

3. A history of factors detected by the situation detecting means is held, and when a factor is detected by the situation detecting means, the past occurrence state of the detected factor is checked from the history. Further comprising history analysis means for analyzing whether or not the examined occurrence state satisfies a predetermined condition, and the voice recognition result determination means further utilizes the analysis result in the history analysis means to validate the voice recognition. The speech recognition apparatus according to claim 1, wherein the response unit performs different response operations depending on whether the predetermined condition is satisfied or not.

4. The environment detecting means according to claim 1, wherein the situation detecting means includes an environment estimating means for detecting a cause of a defect due to a specific noise and an echo sound from the voice data. Voice recognition device.

5. The voice recognition apparatus according to claim 4, wherein the situation detection unit further includes an utterance detection unit that checks whether or not a human utterance is included in the voice data.

6. The utterance failure detecting means further comprises, when the utterance detecting means detects the utterance of the human being, checks whether or not a predetermined failure condition is satisfied for the detected utterance. The voice recognition device according to claim 5, comprising:

7. A first step of outputting voice data according to an input voice, and a second step of voice-recognizing the input voice by collating the voice data with a standard pattern relating to a voice recognition vocabulary registered in advance. And a third step of checking whether at least one of a plurality of different factors causing a problem in the voice recognition is present in the voice data, and the plurality of factors in the third step. If none of the above is detected, it is determined that the voice recognition is valid, and if any of the plurality of factors is detected, it is determined that the voice recognition is not valid. Four
And in the fourth step, when it is determined that the voice recognition is not appropriate, a corresponding response is selected from a plurality of preset response operations corresponding to the plurality of factors. A fifth step of selectively performing an operation, the voice recognition method.

8. The first step further comprises collating the voice data with a pre-registered voice pattern relating to a reject vocabulary for rejecting an input voice that is not a utterance based on the voice recognition vocabulary. The voice recognition method according to claim 7, further comprising the step of performing the voice recognition.

9. A step of holding a history of factors detected in the third step; and a step of detecting a factor in the third step, wherein the occurrence state of the detected factor in the past is recorded from the history. The method further includes a step of examining, and a step of analyzing whether or not the examined occurrence state satisfies a predetermined condition, and the fourth step further uses the result of the analysis to judge the validity of the voice recognition. 8. The voice recognition method according to claim 7, wherein the fifth step includes a step of performing different response operations depending on whether the predetermined condition is satisfied or not.

10. The method according to claim 7, wherein the third step includes a step of detecting a cause of a failure due to a specific noise and an echo sound from the voice data.
The voice recognition method according to any one of 1.

11. The voice recognition method according to claim 10, wherein the third step further includes a step of checking whether or not a human utterance is included in the voice data.

12. The third step further comprises, when the utterance of the human being is detected, checking whether or not a predetermined defect condition is satisfied for the detected utterance. 11. The voice recognition method according to item 11.

13. A first process of outputting voice data according to an input voice, and a second process of voice-recognizing the input voice by collating the voice data with a standard pattern relating to a voice recognition vocabulary registered in advance. Processing, a third processing for checking whether at least one of a plurality of different factors causing a problem in the voice recognition is present in the voice data, and the plurality of factors in the third processing. If none of the above is detected, it is determined that the voice recognition is valid, and if any of the plurality of factors is detected, it is determined that the voice recognition is not valid. 4 and a plurality of response actions corresponding respectively to the plurality of factors set in advance when the voice recognition is determined to be invalid in the fourth process. Program for executing a fifth process for performing responsive action by those selectively to the computer.

14. The first process further collates the voice data with a voice pattern, which is registered in advance, for a reject vocabulary for rejecting an input voice that is not a utterance based on the voice recognition vocabulary. 14. The program according to claim 13, including a process of performing the voice recognition.

15. A process of holding a history of factors detected by the third process; and a factor occurrence state in the past when the factor is detected by the third process, from the history. It further includes a process of examining and a process of analyzing whether or not the examined occurrence state satisfies a predetermined condition, and the fourth process further utilizes the result of the analysis to determine the validity of the voice recognition. 14. The program according to claim 13, wherein the fifth process includes a process of performing different response operations depending on whether the predetermined condition is satisfied or not.

16. The method according to claim 13, wherein the third process includes a process of detecting a cause of a problem caused by specific noise and echo sound from the voice data. program.

17. The program according to claim 16, wherein the third process further includes a process of checking whether or not a human utterance is included in the voice data.

18. The third process further includes a process of checking, when the utterance of the human being is detected, whether or not a predetermined defect condition is satisfied for the detected utterance. 17. The program according to 17.