JP2010078986A

JP2010078986A - Equipment controller by speech recognition

Info

Publication number: JP2010078986A
Application number: JP2008247948A
Authority: JP
Inventors: Yasunari Obuchi; 康成大淵; Takashi Sumiyoshi; 貴志住吉
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2008-09-26
Filing date: 2008-09-26
Publication date: 2010-04-08
Anticipated expiration: 2028-09-26
Also published as: JP4779000B2

Abstract

<P>PROBLEM TO BE SOLVED: To solve a problem that when electronic equipment is controlled from a remote place, a method using a remote controller gives inconvenience that the remote controller is not found instantaneously and a method for performing audio recognition without using switches, possibly gives annoyance due to frequent wrong response to noise etc. <P>SOLUTION: Speech recognition is carried out first in a manual mode using the remote controller etc., and sound input data as an object of speech recognition and other sound input data are stored in a tutor signal database 120. Based upon the stored tutor signal data, a determination parameter database 124 is generated by performing automatic determination parameter learning 122 and after sufficiently high precision is obtained, an automatic mode is entered to perform speech/non-speech automatic determination 114 on sound input data. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、家庭内やオフィス内などにおいて、離れたところにある電気機器を操作しようとする際に、手元にリモコンなどの器具を持たなくとも自分の声だけで操作できるような機能を提供する技術に関する。 The present invention provides a function that can be operated with only one's own voice without having a device such as a remote control at hand when trying to operate a remote electrical device in a home or office. Regarding technology.

テレビやエアコンなどの機器を遠くから操作する手段としては、赤外線によるリモコンが幅広く普及している。一方、人間の声をマイクによって取得し、あらかじめ用意したモデルと比較照合することによって、発話内容を認識する音声認識技術が確立しており、それを使って機器の制御を行おうという試みもある。音声認識を使って機器の制御を行う場合、例えば特許文献１にあるように、音声認識スイッチを用意して、使用者が明示的に音声認識の開始を指示することが一般的である。また、特許文献２にあるように、特定の単語の発声を音声認識スイッチの代わりに用いるという技術も存在する。 Infrared remote controls are widely used as means for operating devices such as TVs and air conditioners from a distance. On the other hand, a voice recognition technology that recognizes the utterance content by acquiring human voice with a microphone and comparing it with a model prepared in advance has been established, and there is an attempt to control the device using it. . When controlling a device using voice recognition, for example, as disclosed in Patent Document 1, for example, a voice recognition switch is generally prepared and a user explicitly instructs the start of voice recognition. Further, as disclosed in Patent Document 2, there is a technique in which the utterance of a specific word is used instead of the voice recognition switch.

特開平４−７９９８号公報Japanese Patent Laid-Open No. 4-7998 特開２０００−３２２０７８号公報JP 2000-322078 A

リモコンを使った機器の制御には、咄嗟のときにリモコンが見つからない、リモコンのボタンが多すぎて使い方がわからないなどの問題がある。音声認識機能を用いればこれらの問題が低減するが、従来の方式では、音声認識スイッチを作動させることが必要なため、最低でも一つのボタンを持つ装置を手元に置いておくことが必要である。この場合、咄嗟のときにその装置が見つからないという課題は依然として残ったままである。仮に音声認識スイッチを用いず、常に音声認識を起動しておいた場合、周囲の雑音や、機器操作を意図しない雑談などの発声に誤反応し、ユーザが望まない挙動が頻発するという問題がある。この問題を回避するため、音声認識スイッチの代わりに特定の単語の発声を用いるやり方もあるが、やはり雑音などへの誤反応をなくすことは難しい。また、「電気をつける」といった単純な操作に対しても、常に最低でも２回の発話が必要となるため、ユーザの利便性を損ねるという問題もある。 There are problems with controlling a device using a remote control, such as when the remote control cannot be found during a dredging, and there are too many buttons on the remote control to understand how to use. These problems can be reduced by using the voice recognition function. However, since the conventional method requires the voice recognition switch to be activated, it is necessary to have a device having at least one button at hand. . In this case, the problem that the device cannot be found at the time of dredging still remains. If the voice recognition switch is always activated without using the voice recognition switch, there is a problem that the user's undesired behavior frequently occurs due to false reaction to ambient noise or utterances such as chats not intended for device operation. . In order to avoid this problem, there is a method of using the utterance of a specific word instead of the voice recognition switch, but it is still difficult to eliminate a false reaction to noise or the like. In addition, even a simple operation such as “turn on electricity” always requires at least two utterances, which impairs user convenience.

本発明の目的は、高精度で利便性の高い音声認識による機器制御装置を提供することにある。 An object of the present invention is to provide a device control apparatus based on voice recognition that is highly accurate and convenient.

本発明においては、音声認識を使用する特定の環境において、誤反応の原因となりうる音の性質を十分に調べ、なおかつ正しく反応することが望まれる音声入力の性質についても十分に調べることができれば、音声入力とそれ以外の入力とを正確に判別できるという性質を活用する。これらの性質を調べるためには、音声認識が反応すべきであることがわかっているデータと、反応すべきでないことがわかっているデータを、それぞれ十分な量だけ集めることが必要である。 In the present invention, in a specific environment where speech recognition is used, if it is possible to sufficiently investigate the nature of the sound that may cause a false reaction, and also sufficiently investigate the nature of the speech input that is desired to react correctly, Utilizing the property of being able to accurately discriminate between voice input and other input. In order to examine these properties, it is necessary to collect a sufficient amount of data that is known to be responsive to speech recognition and data that is known not to be responsive.

そこで本発明においては、まず始めに音声認識スイッチと併用する形で音声による機器制御を行う。これにより、反応すべき音のデータを十分に集めることができる。また、音声認識スイッチが押されていないときにも音データの取り込みを続けることにより、反応すべきでないデータをも十分に集めることができる。これらのデータをもとに、音声認識の起動を判定するモジュールの学習を十分に行い、誤反応が生じる心配が無くなった時点で、音声認識スイッチが不要であることをユーザに伝える。ユーザがこれを確認し、以後は自動で音声認識を起動するよう指定すると、音声認識スイッチ不要の音声による機器制御が可能となる。 Therefore, in the present invention, device control by voice is first performed in combination with a voice recognition switch. Thereby, it is possible to sufficiently collect sound data to be reacted. Further, by continuing to capture sound data even when the voice recognition switch is not pressed, it is possible to sufficiently collect data that should not be reacted. Based on these data, the module for determining the activation of voice recognition is sufficiently learned, and when there is no fear of erroneous reaction, the user is informed that the voice recognition switch is unnecessary. If the user confirms this and designates that voice recognition is to be automatically started thereafter, the device can be controlled by voice without a voice recognition switch.

すなわち上記の目的を達成するため本発明においては、音声認識の起動方法を手動と自動とに切り替える切り換え部と、起動方法によらず音声認識のための入力音声の候補となる音データを検出する検出部と、検出された音データが、音声認識の対象とすべきデータであるかどうかを自動判定する自動判定部と、検出された音データが音声認識の対象とすべきデータであるかどうかが自明である場合にそれをデータベースに蓄積する蓄積部と、このデータベースに蓄積されたデータをもとに自動判定部で用いるパラメータを学習する学習部とを備える音声認識による機器制御装置を提供する。 That is, in order to achieve the above object, in the present invention, a switching unit that switches a speech recognition activation method between manual and automatic, and sound data that is a candidate for input speech for speech recognition are detected regardless of the activation method. A detection unit, an automatic determination unit that automatically determines whether or not the detected sound data is a target of speech recognition, and whether or not the detected sound data is a target of speech recognition Is provided with an accumulator that accumulates data in a database when it is self-evident and a learning unit that learns parameters used in an automatic determination unit based on the data accumulated in the database. .

本発明により、機器制御装置の購入・設置後しばらくすると、リモコンに相当するものを一切必要としない音声制御が可能となり、咄嗟のときにリモコンが見つからないといった問題が生じなくなる。 According to the present invention, after a while after purchase and installation of the device control apparatus, it is possible to perform voice control that does not require any equivalent of a remote controller, and the problem that the remote controller cannot be found in the event of a drought does not occur.

以下、図を用いて本発明の有効な実施例を説明するが、まず本発明の基本構成について説明する。なお、以下の説明において、「音声・非音声自動判定」を、自動判定部、或いは自動判定手段とするなど、機能ブロックを「部」或いは「手段」と呼ぶ場合がある。 Hereinafter, an effective embodiment of the present invention will be described with reference to the drawings. First, the basic configuration of the present invention will be described. In the following description, function blocks may be referred to as “parts” or “means”, such as “automatic determination of voice / non-voice” as an automatic determination unit or automatic determination means.

図１は、本発明の基本的な構成を示す概念図である。本構成においては、装置の電源が入っているあいだは継続的に音声入力装置（マイク等）１０２からの入力音データの取り込みを続ける。得られた音データに対し、音声区間候補検出１０４を行う。音声区間候補抽出の方法としては、例えば、Ａ／Ｄ変換器によって音データをデジタル化した後、幅１０ミリ秒程度の時間幅でフレームデータを切り出し、その間の音声波形の振幅の自乗和を取ってフレームパワーとし、その値をあらかじめ設定した閾値と比較するなどの手法が考えられる。このような音声区間の検出は、当該分野の技術者において広く知られている方法である。この際、音声区間候補検出部１０４の役割は、あくまでも候補を抽出することであり、後段の処理により音声として処理すべきものとそうでないものとの選別を行うことが想定されているため、不要なデータを数多く検出してしまうことは問題ないが、逆に必要なデータを取りこぼしてしまうことは出来る限り避けなければならない。そのような検出を行うためには、上述の閾値を極めて小さく設定するなどの方式で対応すれば良い。 FIG. 1 is a conceptual diagram showing a basic configuration of the present invention. In this configuration, the input sound data from the voice input device (such as a microphone) 102 is continuously captured while the device is turned on. Speech segment candidate detection 104 is performed on the obtained sound data. As a method for extracting speech segment candidates, for example, after sound data is digitized by an A / D converter, frame data is cut out with a time width of about 10 milliseconds, and the square sum of the amplitude of the speech waveform is obtained. For example, a method may be considered in which the frame power is used and the value is compared with a preset threshold value. Such detection of a speech section is a method widely known by engineers in the field. At this time, the role of the speech section candidate detection unit 104 is to extract candidates to the last, and since it is assumed that what is to be processed as speech and what is not to be processed by subsequent processing are unnecessary. There is no problem in detecting a lot of data, but on the contrary, it is necessary to avoid losing necessary data as much as possible. In order to perform such detection, a method such as setting the threshold value to be extremely small may be used.

次に、動作モードチェック１０６において、対象となっている装置の動作モードが「手動」と「自動」のどちらに設定されているかを確認する。動作モードの設定は、動作モード設定１０８において行うが、その詳細は後述する。 Next, in the operation mode check 106, it is confirmed whether the operation mode of the target device is set to “manual” or “automatic”. The operation mode is set in the operation mode setting 108, details of which will be described later.

図２は、動作モードが「手動」の場合の主な処理の手順を、図１から抜き出したものである。まず、ユーザ入力装置１１０からの入力が無いかどうかを、ユーザ入力検出１１２にて検出する。ここで、ユーザ入力装置とは、ユーザが音声認識の起動を指示するための装置であり、後で具体的に説明するように、例えばリモコンのボタンなどを想定している。ここで、リモコンのボタンが押されたなど、音声認識の起動が指示された場合には、入力された音データを用いて音声認識１１６を実行する。また、それと同時に、入力された音声データを「音声」の教師信号として教師信号データベース１２０に追加格納する。音声認識１１６では、あらかじめ設定された機器制御のためのコマンドのリストなどを参照して、音声認識部にて音声認識を実行し、結果として得られた制御コマンドを、機器制御１１８にて実行する。一方、リモコンのボタンが押されていないなど、音声認識の起動が指示されていない場合には、音声認識１１６を実行せず、入力された音声データを「非音声」の教師信号として教師信号データベース１２０に追加格納する。 FIG. 2 is an extraction of the main processing procedure when the operation mode is “manual” from FIG. First, the user input detection 112 detects whether or not there is an input from the user input device 110. Here, the user input device is a device for the user to instruct activation of voice recognition, and assumes a remote control button, for example, as will be described in detail later. Here, when the activation of voice recognition is instructed, for example, when a button on the remote control is pressed, the voice recognition 116 is executed using the input sound data. At the same time, the input voice data is additionally stored in the teacher signal database 120 as a “voice” teacher signal. In the voice recognition 116, the voice recognition unit executes voice recognition with reference to a preset list of equipment control commands, and the resulting control command is executed by the equipment control 118. . On the other hand, when activation of voice recognition is not instructed, such as when a button on the remote control is not pressed, the voice recognition 116 is not executed, and the input voice data is used as a “non-voice” teacher signal as a teacher signal database. 120 is additionally stored.

図３は、動作モードが動作モードチェック１０６で「自動」とされた場合の主な処理の手順を、図１から抜き出したものである。まず、音声・非音声自動判定１１４を行う自動判定手段が起動される。ここでは、入力された音データが、「音声」であるか「非音声」であるかをコンピュータが自動判定する。ただし、ここでいう「音声」とは音声認識の入力とすべき全ての音データを指し、「非音声」とは音声認識の入力とすべきでない全ての音データを指す。つまり、ユーザの一般的な話し声や笑い声、咳などは、周囲雑音と同じように「非音声」の範疇に含まれると解釈すべきである。自動判定にあたっては、判定パラメータデータベース１２４に格納されているデータを用いるが、この詳細については後述する。音声・非音声自動判定の結果、入力された音データが「音声」であると判定された場合には、その音データを用いて音声認識１１６を実行する。認識結果に応じて機器制御１１８が実行されるのは、動作モード「手動」の場合と同じである。ただし、「自動」の場合には、入力された音データが「音声」であるというのはあくまでも推定であるため、そのデータを教師信号データベース１２０に追加格納することはしない。音声・非音声自動判定の結果が「非音声」であった場合には、装置は何もせず次の入力を待つ。 FIG. 3 shows the main processing procedure extracted from FIG. 1 when the operation mode is “automatic” in the operation mode check 106. First, the automatic determination means for performing the voice / non-voice automatic determination 114 is activated. Here, the computer automatically determines whether the input sound data is “voice” or “non-voice”. Here, “speech” refers to all sound data that should be input for speech recognition, and “non-speech” refers to all sound data that should not be input for speech recognition. That is, the user's general speech, laughter, cough, etc. should be interpreted as being included in the category of “non-speech” as well as ambient noise. In the automatic determination, data stored in the determination parameter database 124 is used, and details thereof will be described later. If it is determined that the input sound data is “voice” as a result of the automatic voice / non-voice determination, the voice recognition 116 is executed using the sound data. The device control 118 is executed according to the recognition result as in the case of the operation mode “manual”. However, in the case of “automatic”, since it is only an estimation that the input sound data is “speech”, the data is not additionally stored in the teacher signal database 120. If the result of the voice / non-voice automatic determination is “non-voice”, the device does nothing and waits for the next input.

図４は、音声・非音声判別のための判定パラメータの学習に関する一連の処理手順を、図１から抜き出したものである。教師信号データベース１２０に蓄積された教師信号は、判定パラメータ学習１２２で使用される。判定パラメータ学習１２２は、教師信号が一定量蓄積された段階で起動される。ただし、上述した音声区間候補検出１０４から機器制御１１８までの処理が大きな遅延なく行われることが望ましいのに対し、判定パラメータ学習１２２にはそのような即応性は求められないため、コンピュータの処理能力などに応じて、判定パラメータ学習１２２の起動を遅らせたりすることもありうる。判定パラメータ学習１２２においては、教師信号データベース１２０に含まれる音データから各種の特徴量を抽出し、それらの特徴量の組と、音声もしくは非音声というラベルとの関係を学習する。このような２種類のクラスへの分類問題は、機械学習の分野において十分に研究が行われており、当該分野の技術者にとっては様々な手法が容易に適用可能なものであるが、ここではその中でも最も代表的なものである線形判別分析を用いる手法について説明した後、図４の残りの部分について再度説明する。 FIG. 4 shows a series of processing procedures relating to learning of determination parameters for voice / non-voice discrimination extracted from FIG. The teacher signal stored in the teacher signal database 120 is used in the determination parameter learning 122. The determination parameter learning 122 is activated when a certain amount of teacher signals are accumulated. However, while it is desirable that the processing from the voice segment candidate detection 104 to the device control 118 described above be performed without a large delay, the determination parameter learning 122 is not required to have such responsiveness. In some cases, the activation of the determination parameter learning 122 may be delayed. In the determination parameter learning 122, various feature amounts are extracted from the sound data included in the teacher signal database 120, and the relationship between the set of feature amounts and a label of speech or non-speech is learned. Such a classification problem into two classes has been well studied in the field of machine learning, and various techniques can be easily applied to engineers in the field. After describing a method using linear discriminant analysis, which is the most typical of them, the remaining part of FIG. 4 will be described again.

図５は、線形判別分析による判定パラメータ学習の様子を示した図である。ここでは、横軸に特徴量１、縦軸に特徴量２として、２種類の特徴量を使う例を示している。ここでいう特徴量とは、平均の音量や観測された区間長といった簡単なものから、ケプストラムや基本周波数といった音声分析で良く用いられる特徴量、また、音データベースから作った混合ガウス分布や隠れマルコフモデルなどを利用した尤度などでも構わないが、以下の議論は特徴量として何を使うかに依存しないため、ここでは詳述しない。図５においては、「音声」として１３、「非音声」として２６、合わせて３９個の教師データが存在する状況を仮定している。この平面上に直線を引き、黒丸（音声）が片側に、白丸（非音声）が反対側に集まるようにするのが線形判別分析である。定量的な基準としては、各点と直線との距離（直線の片側が正、反対側が負と定義する）を求め、黒丸についての距離の総和と白丸についての距離の総和との差が最大になるようにすれば良い。また、本図のように、黒丸の総数と白丸の総数が異なる場合には、黒丸１つを２つとして数え、全体の重みのバランスを取る。そのような基準により、黒丸と白丸を最も良く区分する直線が得られる。一般には、特徴量が多次元である場合、多次元の特徴空間を２つに分ける超平面が得られることになる。図に太く示された斜めの線が、線形判別分析により得られた結果である。この例では、直線の右上を黒丸、左下を白丸と判定することにすると、黒丸1個、白丸1個の合計２個だけが誤って分類されることになる。教師データに対する判定性能は、黒丸が約９２％、白丸が約９６％であり、平均すると約９４％となる。 FIG. 5 is a diagram showing a state of determination parameter learning by linear discriminant analysis. Here, an example is shown in which two types of feature quantities are used, with the feature quantity 1 on the horizontal axis and the feature quantity 2 on the vertical axis. The feature value here refers to features that are often used in speech analysis such as cepstrum and fundamental frequency from simple values such as average volume and observed interval length, mixed Gaussian distributions and hidden Markovs created from sound databases. Likelihood using a model or the like may be used, but the following discussion is not detailed here because it does not depend on what is used as a feature quantity. In FIG. 5, it is assumed that “teaching” is 13 and “non-speech” is 26, for a total of 39 teacher data. Linear discriminant analysis draws a straight line on this plane so that black circles (voice) are gathered on one side and white circles (non-voice) are gathered on the opposite side. As a quantitative standard, the distance between each point and the straight line (one side of the straight line is defined as positive and the other side is defined as negative) is calculated, and the difference between the sum of the distances for the black circles and the sum of the distances for the white circles is maximized. What should I do? Further, as shown in the figure, when the total number of black circles is different from the total number of white circles, one black circle is counted as two to balance the overall weight. Such a criterion provides a straight line that best separates the black and white circles. In general, when the feature quantity is multidimensional, a hyperplane that divides the multidimensional feature space into two is obtained. The diagonal lines shown in the figure are the results obtained by linear discriminant analysis. In this example, if it is determined that the upper right corner of the straight line is a black circle and the lower left corner is a white circle, only a total of two, one black circle and one white circle, are erroneously classified. The judgment performance for teacher data is about 92% for black circles and about 96% for white circles, and averages about 94%.

図４において、判定パラメータ学習１２２が終了したら、得られた結果を判定パラメータデータベース１２４に保存する。図５の例でいうと、斜めの線を表わす傾きおよび切片の値をパラメータとして保存する。このパラメータを持っておくと、新たな音データが入ってきたとき、それを特徴量に変換して平面上の点として表わし、その点が直線のどちら側にあるかを見ることによって、音声と非音声のどちらであるかを判定することができる。また、図５の例で、教師データに対する判定性能が約９４％であるという結果が得られたが、これが判定精度推定１２６の役割に相当する。判定精度推定１２６では、教師データに対する判定精度などから、新しいデータを判定した際に得られる結果の精度を推定する。これは、音声・非音声自動判定で十分な精度が得られないと予想される場合には、リモコンのボタンなどによる手動起動を用いることが望ましいと思われるためである。判定精度推定１２６で得られた結果は、判定精度表示１２８でユーザに提示される。ただしこの際、９４％といった数値を見せることがユーザにとってわかりやすいとは限らないため、例えば予想精度が予め設定した閾値よりも低いときには赤いＬＥＤを、高いときには青いＬＥＤを点灯させるといった表示方式を取ってもよい。 In FIG. 4, when the determination parameter learning 122 is completed, the obtained result is stored in the determination parameter database 124. In the example of FIG. 5, the slope and intercept values representing diagonal lines are stored as parameters. With this parameter, when new sound data comes in, it is converted into a feature value and expressed as a point on the plane, and by looking at which side of the line the point is the sound and It can be determined whether it is non-voice. Further, in the example of FIG. 5, the result that the determination performance with respect to the teacher data is about 94% is obtained, and this corresponds to the role of the determination accuracy estimation 126. In the determination accuracy estimation 126, the accuracy of the result obtained when new data is determined is estimated from the determination accuracy for the teacher data. This is because, when it is expected that sufficient accuracy cannot be obtained by the voice / non-voice automatic determination, it is desirable to use manual activation by a button on the remote controller. The result obtained by the determination accuracy estimation 126 is presented to the user on the determination accuracy display 128. However, since it is not always easy for the user to show a value such as 94% at this time, for example, a red LED is turned on when the prediction accuracy is lower than a preset threshold, and a blue LED is turned on when the prediction accuracy is high. Also good.

以下、図１に戻って動作モード設定について説明する。ユーザは、判定精度表示１２８を見て、動作モードを変更したいと思った場合には、ボタン操作などにより動作モードを変更する（動作モード設定１０８）。例えば動作モードを手動から自動に変更した場合、以降の処理では、動作モードチェック１０６にて、「自動」の方の処理が進められることになる。なお、ユーザが装置の性能に不満がある場合などには、動作モードを「手動」に戻すことにより、再びリモコンなどを使って操作することが可能になる。なお、「手動」から「自動」への変更は、通常、音声認識が数百回分実施された後、日にちにして数日程度で行うことが可能となる。 Hereinafter, returning to FIG. 1, the operation mode setting will be described. If the user looks at the determination accuracy display 128 and wishes to change the operation mode, the user changes the operation mode by operating a button or the like (operation mode setting 108). For example, when the operation mode is changed from manual to automatic, in the subsequent processing, the “automatic” processing is advanced in the operation mode check 106. When the user is dissatisfied with the performance of the apparatus, the operation mode can be returned to “manual” to operate again using the remote controller or the like. It should be noted that the change from “manual” to “automatic” can usually be performed within a few days after the voice recognition is performed several hundred times.

図６は、図１の機能構成を有する装置において、音データの取得から機器制御に至るまでの処理の流れをフローチャートの形で表わしたものである。実際に機器制御が実行されるのは、音声区間候補が見つかり、動作モードが手動で、起動ボタンが押され、音声認識結果が有効な命令であった場合（２０６−２０８−２１０−２１２−２１８−２２０−２２２）と、音声区間候補が見つかり、動作モードが自動で、音声・非音声判定の結果が音声であり、音声認識結果が有効な命令であった場合（２０６−２０８−２１４−２１６−２１８−２２０−２２２）の２通りである。そのいずれかに該当しない場合、および実際に機器制御が実行された後には、状態は再び音データの取得２０２に戻り、以降、本装置が停止するまで同じ動作を繰り返す。これらの動作の他に、図４で説明した判定パラメータ学習のフローが随時起動される。なお、ここで随時とは、必要なデータが一定量蓄積された時である。 FIG. 6 shows the flow of processing from acquisition of sound data to device control in the apparatus having the functional configuration of FIG. 1 in the form of a flowchart. Device control is actually executed when a voice segment candidate is found, the operation mode is manual, the activation button is pressed, and the voice recognition result is a valid command (206-208-210-212-218). -220-222), when the voice section candidate is found, the operation mode is automatic, the voice / non-voice judgment result is voice, and the voice recognition result is a valid command (206-208-214-216) -218-220-222). If none of these applies, and after device control is actually executed, the state returns to the sound data acquisition 202 again, and thereafter the same operation is repeated until the apparatus stops. In addition to these operations, the determination parameter learning flow described with reference to FIG. Here, the term “anytime” means when a certain amount of necessary data is accumulated.

図７は、第１の実施例として、テレビ装置の一部として実現する場合の音声認識による機器制御装置の構成を示した図である。本構成の大部分は、テレビ本体３３２の内部のモジュールとして実現される。内部のモジュールには、上述したコンピュータの処理部を構成する中央処理部（Central Processing Unit：ＣＰＵ）や記憶部（メモリ）が含まれる。ＣＰＵは、音声区間候補検出部３０４、動作モードチェック部３０６、教師信号検出部３１２、音声・非音声自動判定部３１４、音声認識部３１６、判定パラメータ学習部３２２、判定精度推定部３２６の各機能をプログラム処理として実行する。教師信号データベース３２０と判定パラメータデータベース３２４は記憶部中に蓄積される。動作モード表示部３２８は独立した表示素子、あるいはテレビのディスプレイ中に表示して構成できる。なお、ＣＰＵが実行する各機能のうち、音声認識部の機能などを専用回路で実現しても良いことは言うまでもない。 FIG. 7 is a diagram showing a configuration of a device control apparatus based on voice recognition when implemented as a part of a television apparatus as the first embodiment. Most of this configuration is realized as a module inside the television main body 332. The internal modules include a central processing unit (CPU) and a storage unit (memory) that constitute the processing unit of the computer described above. The CPU has functions of a speech segment candidate detection unit 304, an operation mode check unit 306, a teacher signal detection unit 312, a speech / non-speech automatic determination unit 314, a speech recognition unit 316, a determination parameter learning unit 322, and a determination accuracy estimation unit 326. Is executed as a program process. The teacher signal database 320 and the determination parameter database 324 are accumulated in the storage unit. The operation mode display unit 328 can be configured to be displayed on an independent display element or a television display. Of course, among the functions executed by the CPU, the function of the voice recognition unit and the like may be realized by a dedicated circuit.

その他に、リモコン装置３１０と、必要に応じて人検知部である人検知センサ３３０を用いる。リモコン装置３１０は、一般的なテレビに付属するものを用いることもできるし、音声認識を起動するという目的に特化した、ボタンひとつだけの装置でも構わない。一般的に、前者の場合には赤外線で信号を送ることが多く、後者の場合には、自動車用のキーレスエントリーなどで用いられる微弱無線を用いることが一般的であると思われる。人検知センサ３３０は、たとえば赤外線センサの反応により、部屋の中に人がいるかどうかを検知するものである。あるいは、オフィスの入退室管理システムなどと連携して、室内に人がいるかどうかの情報を提供する装置である。本実施例においては、部屋の中に人がいるかどうかについての情報が得られるのであれば、既存のどのような技術を用いようとも特段の違いは生じないため、人検知センサ３３０についての詳細な説明は割愛する。 In addition, a remote control device 310 and a human detection sensor 330 as a human detection unit are used as necessary. As the remote control device 310, a device attached to a general television can be used, or a device having only one button specialized for the purpose of starting speech recognition may be used. In general, in the former case, signals are often transmitted by infrared rays, and in the latter case, it seems that it is common to use weak radio used in keyless entry for automobiles. The human detection sensor 330 detects whether or not there is a person in the room by the reaction of an infrared sensor, for example. Alternatively, it is an apparatus that provides information on whether there is a person in the room in cooperation with an office entrance / exit management system or the like. In this embodiment, if information about whether or not there is a person in the room can be obtained, no particular difference will occur regardless of which existing technology is used. I will omit the explanation.

赤外線・無線受信部３１１では、リモコン装置３１０から送られる赤外線や無線の信号を受信する。これは、図１記載のユーザ入力検出１１２の役割を果たすモジュールであるが、その他に、通常のリモコン装置３１０からテレビ制御の信号が送られてきた場合には、対応する制御信号をチャンネル・音量制御部３１８に送る。チャンネル・音量制御部３１８は、通常のテレビ装置に含まれるモジュールであり、制御信号に応じてテレビの受信チャンネルや音量などを変更する。また、リモコン装置３１０から動作モード変更の信号が送られてきた場合には、対応する信号を動作モード設定部３０８に送る。動作モード設定部３０８は、この他に、図示を省略したボタン等で直接操作することも可能である。 The infrared / wireless receiving unit 311 receives infrared and wireless signals sent from the remote control device 310. This is a module that plays the role of the user input detection 112 shown in FIG. 1. In addition, when a TV control signal is sent from a normal remote control device 310, the corresponding control signal is sent to the channel / volume. The data is sent to the control unit 318. The channel / volume control unit 318 is a module included in a normal television device, and changes a television reception channel, volume, and the like according to a control signal. When an operation mode change signal is sent from remote control device 310, a corresponding signal is sent to operation mode setting unit 308. In addition to this, the operation mode setting unit 308 can be directly operated with a button or the like not shown.

マイク３０２、音声区間候補検出部３０４、動作モードチェック部３０６の役割は、それぞれ図１記載の対応するブロックと同じである。教師信号検出部３１２は、動作モードが「手動」である場合には、赤外線・無線受信部３１１からの信号に応じて、音声認識部３１６を起動すべきかどうかの判定を下す。一方、動作モードが「自動」である場合には、人検知センサ３３０からの信号に応じて、音声認識部３１６の動作を抑制すべきか、それとも音声・非音声自動判定部３１４の判定結果に任せるべきかを出力する。これはすなわち、部屋の中に人がいないことがわかっているならば、音声を解析した結果のいかんにかかわらず、入力音データは「非音声」であると見なすことが望ましいと考えられるためである。また、人検知センサ３３０の機能に完全な信頼が置けない場合には、即座に「非音声」と判断するのではなく、音声・非音声自動判定部３１４の動作基準を一時的に修正し、「非音声」という判定が出る可能性を高めるというやり方を取ることもできる。 The roles of the microphone 302, the speech segment candidate detection unit 304, and the operation mode check unit 306 are the same as the corresponding blocks shown in FIG. When the operation mode is “manual”, the teacher signal detection unit 312 determines whether to activate the voice recognition unit 316 according to the signal from the infrared / wireless reception unit 311. On the other hand, when the operation mode is “automatic”, the operation of the voice recognition unit 316 should be suppressed or left to the determination result of the voice / non-voice automatic determination unit 314 according to the signal from the human detection sensor 330. Output what to do. This is because if it is known that there are no people in the room, it may be desirable to regard the input sound data as “non-speech” regardless of the result of analyzing the sound. is there. Also, if the function of the human detection sensor 330 cannot be completely trusted, the operation standard of the voice / non-voice automatic determination unit 314 is temporarily corrected instead of immediately determining “non-voice”, It is also possible to take a method of increasing the possibility of the determination of “non-voice”.

また、付加的な機能として、赤外線・無線受信部３１１から音声認識起動の信号を受信したならば、その後一定時間のあいだは、動作モードが「自動」であっても、「手動」であるときと同様の動作をするという方式を取ることもできる。これは、ユーザの手元にリモコン装置３１０があることがわかっているならば、咄嗟のときにリモコン装置３１０が見つからないという課題に対処する必要がないことが推定されるため、非音声入力に対する誤反応のリスクを犯す必要性が低いと考えられるためである。この場合にも、動作モードを「手動」に変更してしまうばかりでなく、「自動」ではあるが「音声」が検出されにくくなるように動作基準を一時的に修正するという方式も可能である。音声認識部３１６が動作した場合には、その結果に応じた信号がチャンネル・音量制御部３１８に送られる。なお、ここで一定時間、一時的とは、通常数分程度が好適である。 As an additional function, if a voice recognition activation signal is received from the infrared / wireless receiving unit 311, when the operation mode is “automatic” and “manual” for a certain period of time after that, It is also possible to take a method of performing the same operation as. This is because if it is known that the remote control device 310 is present at the user's hand, it is estimated that there is no need to deal with the problem that the remote control device 310 cannot be found at the time of a drought. This is because the need to violate the risk of reaction is considered low. In this case, not only the operation mode is changed to “manual” but also a method of temporarily correcting the operation reference so that “sound” is difficult to be detected although it is “automatic” is possible. . When the voice recognition unit 316 operates, a signal corresponding to the result is sent to the channel / volume control unit 318. In this case, the fixed time and temporary are usually several minutes.

動作モードが「手動」の場合に、「音声」ないし「非音声」の教師信号が教師信号データベース３２０に追加格納されるのは図１の場合と同様であるが、その他に、動作モードが「自動」であっても、人検知センサ３３０からの信号もしくはリモコン装置３１０使用後一定時間内という条件で強制的に「非音声」という判定が下された場合には、教師信号検出部３１２は、「非音声」の教師信号を教師信号データベース３２０に追加格納する。ここで、使用後の一定時間とは通常数分程度である。その後の、判定パラメータ学習部３２２、判定パラメータデータベース３２４、判定精度推定部３２６、動作モード表示部３２８の役割は、図１の対応する部分とほぼ同じである。 When the operation mode is “manual”, the teacher signal of “speech” or “non-speech” is additionally stored in the teacher signal database 320 in the same manner as in the case of FIG. Even if “automatic”, if the determination of “non-speech” is made forcibly under the condition that the signal from the human detection sensor 330 or the remote control device 310 is used within a certain period of time, the teacher signal detection unit 312 The “non-speech” teacher signal is additionally stored in the teacher signal database 320. Here, the fixed time after use is usually about several minutes. Subsequent roles of the determination parameter learning unit 322, the determination parameter database 324, the determination accuracy estimation unit 326, and the operation mode display unit 328 are substantially the same as the corresponding portions in FIG.

図８は、第２の実施例に係わり、部屋の中などにある各種機器を制御するための汎用機器制御装置として、音声認識による機器制御装置を実現する場合の構成を示した図である。 FIG. 8 is a diagram showing a configuration in the case of realizing a device control apparatus based on voice recognition as a general-purpose device control apparatus for controlling various devices in a room or the like according to the second embodiment.

図７の実施例との大きな違いは、テレビ本体３３２に含まれていた各モジュールのうち、チャンネル・音量制御部３１８を除くすべてが、本構成において新たに設けられた汎用制御装置４３２の中に移されている点である。図７の構成においては、これらのモジュールによる処理がテレビ装置の内部で行われ、その結果に応じてチャンネルや音量などの制御が行なわれていたが、本構成においては、これらのモジュールによる処理の結果、機器制御を行う必要がある場合には、赤外線・無線発信部４３２において赤外線ないし無線信号として送出するか、もしくは音声認識部４１６から直接信号線を通じて外部装置に信号を送出する。外部装置としては、テレビ４３４の他に、エアコン４３６、照明４３８など室内に存在する様々な電気機器などが含まれ、機器制御の信号は、どの機器に対してどのような動作を実施するかという組合せとして送出される。各機器は、自らに対する制御命令を受信した場合のみ、その命令に対応する動作を行う。 The major difference from the embodiment of FIG. 7 is that all the modules included in the TV main body 332 except the channel / volume control unit 318 are included in the general-purpose control device 432 newly provided in this configuration. This is the point that has been moved. In the configuration of FIG. 7, the processing by these modules is performed inside the television apparatus, and the channel and volume are controlled according to the result. In this configuration, the processing of these modules is performed. As a result, when it is necessary to perform device control, the infrared / radio transmission unit 432 transmits the signal as an infrared or wireless signal, or the voice recognition unit 416 transmits a signal directly to an external device through a signal line. The external devices include various electric devices existing in the room such as the air conditioner 436 and the lighting 438 in addition to the television 434, and the device control signal indicates what operation is performed on which device. Sent as a combination. Each device performs an operation corresponding to the command only when it receives a control command for itself.

図９は、図７、図８の実施例の構成による装置において、音データの取得から機器制御に至るまでの処理の流れをフローチャートの形で表わしたものである。図６に示したフローチャートとの違いは、動作モードが自動である場合に、リモコン装置使用後の時間の判定５１４と、人検知センサによる人検知の判定５１６の二つの処理が加わっている部分である。これらの判定５１４、５１６により、リモコン使用後Ｔ秒未満であると判定された場合（Ｔの値は使用環境に応じて予め設定しておくが、上述したように数分程度である。）、およびセンサが人を検知しなかった場合には、入力された音データが非音声であることは自明であると判断し、音声・非音声判定は行わず、なおかつ音データを非音声の教師データとして蓄積する。リモコン使用後Ｔ秒以上経っており、センサが人を検知した場合には、図６と同様に音声・非音声判定５１８を実行する。 FIG. 9 shows the flow of processing from acquisition of sound data to device control in the form of a flowchart in the apparatus having the configuration of the embodiment of FIGS. The difference from the flowchart shown in FIG. 6 is that when the operation mode is automatic, two processes of time determination 514 after using the remote control device and human detection determination 516 by the human detection sensor are added. is there. When it is determined by these determinations 514 and 516 that it is less than T seconds after using the remote control (the value of T is set in advance according to the use environment, but is about several minutes as described above). If the sensor does not detect a person, it is obvious that the input sound data is non-speech, no sound / non-sound determination is performed, and the sound data is non-speech teacher data. Accumulate as. When T seconds have passed since the remote control was used and the sensor detects a person, voice / non-voice determination 518 is executed as in FIG.

図１０は、第３の実施例に係わり、カーナビゲーション装置の一部として、音声認識による機器制御装置を実現する場合の詳細な構成を示した図である。一般的に、音声認識機能を持ったカーナビゲーション装置においては、ナビ機能全般を操作するためのタッチパネル・操作ボタン類６１０とは別に、運転しながらでも操作しやすい位置に音声認識起動ボタン６３０が設置してあることが多い。音声認識起動ボタンの信号は、直接教師信号検出部６１２に送られるのに対し、タッチパネル・操作ボタン類６１０からの信号は、カーナビを直接操作するためのものであったり、動作モードを設定するためのものであったりする。図７、図８で示したテレビや汎用制御装置の実施例の場合と異なり、タッチパネル等の操作信号も含めてすべて有線で送信されるが、それらの信号を受信した後の動作は、テレビ等の場合とほぼ同じである。また、カーナビ装置の設置にあたっては、自動車本体からの運転状況に関する情報、例えばエンジン回転数などを提供する運転状況送信部６３２と接続することも可能である。この場合、たとえば走行速度が基準値よりも大きい場合にはタッチパネル操作を無効化するといったことは多くのカーナビ装置で既に実施されている。 FIG. 10 is a diagram showing a detailed configuration in the case of realizing a device control apparatus based on voice recognition as a part of the car navigation apparatus according to the third embodiment. In general, in a car navigation apparatus having a voice recognition function, a voice recognition start button 630 is installed at a position that is easy to operate while driving, apart from a touch panel and operation buttons 610 for operating the navigation function in general. Often there are. The signal of the voice recognition activation button is directly sent to the teacher signal detection unit 612, whereas the signal from the touch panel / operation buttons 610 is for directly operating the car navigation system or for setting the operation mode. Or something. Unlike the embodiments of the television and the general-purpose control device shown in FIG. 7 and FIG. 8, all the operation signals including the operation signal of the touch panel are transmitted by wire, but the operation after receiving these signals is the television or the like. Is almost the same as Further, when installing the car navigation device, it is also possible to connect to an operation status transmission unit 632 that provides information related to the operation status from the automobile body, for example, the engine speed. In this case, for example, when the traveling speed is larger than the reference value, the touch panel operation is invalidated in many car navigation devices.

本実施例において、これに加えて、走行速度やハンドルの切り角、ブレーキペダルの踏み込み量などを元に、音声認識の誤反応による悪影響を定量化する。すなわち、走行速度が高い場合、カーブを曲がっている場合、ブレーキを踏んでいる場合などは、運転手が通常よりも高い集中力を必要とする状況であるので、音声認識に誤反応を生じさせないことが重要である。そこで、教師信号検出部６１２は、走行速度やハンドルの切り角、ブレーキペダルの踏み込み量などの重みつき和を誤反応悪影響度として算出し、この値があらかじめ決められた値よりも大きい場合は、音声・非音声判別の結果にかかわらず音声認識を起動しないよう制御する。あるいは、教師信号検出部６１２は、誤反応悪影響度に比例する形で、音声・非音声判別に用いられる閾値を「音声認識が起動されにくくなる」ように修正する値を付加するよう制御する。これにより、運転に集中すべき特段の事情があるような場合に、音声認識部６１６の誤動作によって危険な状況が発生してしまうリスクをさらに低めることができるようになる。 In the present embodiment, in addition to this, the adverse effects caused by the erroneous reaction of voice recognition are quantified based on the traveling speed, the turning angle of the steering wheel, the depression amount of the brake pedal, and the like. In other words, when the driving speed is high, turning a curve, or stepping on a brake, the driver needs more concentration than usual, so there is no false reaction in voice recognition. This is very important. Therefore, the teacher signal detection unit 612 calculates the weighted sum of the traveling speed, the turning angle of the steering wheel, the amount of depression of the brake pedal, and the like as an adverse reaction adverse effect degree, and when this value is larger than a predetermined value, Control is performed so that voice recognition is not activated regardless of the result of voice / non-voice discrimination. Alternatively, the teacher signal detection unit 612 controls to add a value that corrects the threshold used for speech / non-speech discrimination so that “speech recognition is unlikely to be activated” in proportion to the adverse reaction adverse effect level. Thereby, when there are special circumstances that should be concentrated on driving, it is possible to further reduce the risk that a dangerous situation may occur due to a malfunction of the voice recognition unit 616.

図１１は、各実施例における音声・非音声自動判定部（図７、８、１０の３１４、４１４、６１４）の動作の例を表わしたフローチャートである。ここでは線形判別分析を用いる場合の例を示す。まず、音データを取得（７０２）した後に、これをもとに特徴量を計算する（７０４）。ここで用いる特徴量としては、平均パワー、ケプストラム、基本周波数、事前に学習した音声モデルや非音声モデルとの類似度、音声認識を適用してみた場合の認識結果信頼度などが含まれる。ただし、音声らしさを表現しうるこの他の特徴量を用いるとしても、本発明の構成はそのまま活用可能である。次に、これらの特徴量を正規化した後に（７０６）、あらかじめ学習しておいた線形判別係数との内積を求める（７０８）。こうして得られた値が、あらかじめ学習しておいた閾値よりも大きければ音声、小さければ非音声と判定する（７１０、７１２、７１４）。なお、ここでは線形判別分析を用いる例を説明したが、それ以外の分類手法（たとえばサポートベクトルマシンなど）を用いるとしても、本発明の構成はそのまま活用可能である。これらの分類手法の適用については、当該分野に携わる開発者であれば容易に実現可能であるので、ここでは詳述しない。 FIG. 11 is a flowchart showing an example of the operation of the voice / non-voice automatic determination unit (314, 414, 614 in FIGS. 7, 8, and 10) in each embodiment. Here, an example of using linear discriminant analysis is shown. First, after obtaining sound data (702), a feature value is calculated based on the sound data (704). The feature quantities used here include average power, cepstrum, fundamental frequency, similarity to previously learned speech models and non-speech models, recognition result reliability when speech recognition is applied, and the like. However, the configuration of the present invention can be used as it is even if other feature quantities capable of expressing the voice quality are used. Next, after normalizing these feature quantities (706), an inner product with a linear discriminant coefficient learned in advance is obtained (708). If the value obtained in this way is larger than the threshold learned in advance, it is determined that the sound is speech, and if it is smaller, it is determined that the speech is not speech (710, 712, 714). Although an example using linear discriminant analysis has been described here, the configuration of the present invention can be used as it is even if other classification methods (for example, a support vector machine or the like) are used. The application of these classification methods can be easily realized by a developer engaged in the relevant field, and therefore will not be described in detail here.

図１２は、各実施例における判定パラメータ学習部（図７、８、１０の３２２、４２２，６２２）の動作の例を表わしたフローチャートである。ここでも線形判別分析を用いる場合の例を示している。まず、学習データすべてを取得した後に（８０２）、それらに対する特徴量を求め正規化する（８０４）。こうして得られた特徴量データをベクトル化した後、全体に対する共分散行列および音声データの群内平均、非音声データの群内平均を求める（８０６）。次に、共分散行列の逆行列を求めた後（８０８）、得られた逆行列を群内平均の差に掛ける（８１０）。これにより線形判別係数が求められ、それを保存する（８１２）。なお、線形判別係数全体に定数をかけても性能は不変であるので、音声データに線形判別係数をかけた場合と非音声データに線形判別係数をかけた場合を比較し、前者が小さくなる場合には、線形判別係数全体に-1をかけて符号を反転させておくと便利である。次に最適な閾値を求める（８１４）必要があるが、これについては判定精度推定部で詳しく述べるため、ここでは説明を割愛する。 FIG. 12 is a flowchart showing an example of the operation of the determination parameter learning unit (322, 422, 622 in FIGS. 7, 8 and 10) in each embodiment. Again, an example of using linear discriminant analysis is shown. First, after all the learning data has been acquired (802), the feature values for them are obtained and normalized (804). After the feature quantity data thus obtained is vectorized, a covariance matrix and an intra-group average of speech data and an intra-group average of non-speech data are obtained for the whole (806). Next, after obtaining an inverse matrix of the covariance matrix (808), the obtained inverse matrix is multiplied by the difference of the intragroup average (810). As a result, a linear discrimination coefficient is obtained and stored (812). Note that the performance does not change even if a constant is applied to the entire linear discriminant coefficient, so the former is smaller when the linear discriminant coefficient is applied to speech data than when the linear discriminant coefficient is applied to non-speech data. It is convenient to reverse the sign by multiplying the whole linear discrimination coefficient by -1. Next, it is necessary to obtain an optimum threshold value (814), but since this will be described in detail in the determination accuracy estimation unit, description thereof is omitted here.

図１３は、各実施例における判定精度推定部（図７、８、１０の３２４、４２４、６２４）の動作の例を表わしたフローチャートである。図１２の場合と同様に全学習データに対する正規化特徴ベクトルを得た後（９０２）、線形判別係数をかけて線形判別スコアを求める（９０４）。次に、閾値Cの候補値の最小値CminをCにセットする（９０６）。ここでは、例えば非音声データ全体に対する線形判別スコアの平均値などを用いると良い。次に、閾値の候補Cを用いた場合の平均判別率Rを求める。ここで平均判別率とは、音声データを正しく音声と判定した率および非音声データを正しく非音声と判定した率の平均値である。なお、音声データに対する性能と非音声データに対する性能のいずれか片方をより重視する場合には、両者の重みつき平均を取れば性能を調整することが可能である。このような処理を、さまざまな候補値Cに対して繰り返し、Rが最大となる場合のCを最適閾値Cbestとする。また、最大値Rmaxを推定判定精度とする（９０８−９１８）。 FIG. 13 is a flowchart showing an example of the operation of the determination accuracy estimation unit (324, 424, 624 in FIGS. 7, 8, and 10) in each embodiment. Similar to the case of FIG. 12, after obtaining normalized feature vectors for all learning data (902), a linear discrimination coefficient is applied to obtain a linear discrimination score (904). Next, the minimum value Cmin of the threshold C candidate values is set to C (906). Here, for example, an average value of linear discrimination scores for the entire non-voice data may be used. Next, an average discrimination rate R when the threshold candidate C is used is obtained. Here, the average discrimination rate is an average value of a rate at which audio data is correctly determined as speech and a rate at which non-audio data is correctly determined as non-speech. In the case where one of the performance for audio data and the performance for non-audio data is more important, the performance can be adjusted by taking a weighted average of both. Such processing is repeated for various candidate values C, and C when R is maximum is set as the optimum threshold Cbest. In addition, the maximum value Rmax is set as the estimation determination accuracy (908-918).

以上詳述してきた本発明は、離れた場所からの操作を容易にする機能を持つテレビなど家電装置として利用可能である。また、いくつかの機器をまとめて制御するための汎用制御装置としても利用可能である。さらに、自動車用のカーナビゲーション装置としても利用可能である。 The present invention described above in detail can be used as a home appliance such as a television having a function of facilitating operation from a remote place. It can also be used as a general-purpose control device for controlling several devices collectively. Furthermore, it can be used as a car navigation device for automobiles.

本発明の基本構成を説明するための典型的なシステムの構成図である。It is a block diagram of the typical system for demonstrating the basic composition of this invention. 図１の構成で、動作モードが手動の時の処理の流れの詳細を表わす図である。FIG. 2 is a diagram showing details of a process flow when the operation mode is manual in the configuration of FIG. 1. 図１の構成で、動作モードが自動の時の処理の流れの詳細を表わす図である。It is a figure showing the detail of the flow of a process in the structure of FIG. 1 when an operation mode is automatic. 図１の構成で、判定パラメータの学習に関連する処理の流れの詳細を表わす図である。FIG. 2 is a diagram showing details of a flow of processing related to learning of determination parameters in the configuration of FIG. 1. 図１の構成で、線形判別分析を用いた判定パラメータ学習の例の説明図である。It is explanatory drawing of the example of the determination parameter learning using the linear discriminant analysis with the structure of FIG. 図１の構成で、音声で機器を制御する際のフローチャートを示す図である。It is a figure which shows the flowchart at the time of controlling an apparatus with an audio | voice with the structure of FIG. 第１の実施例に係わる、テレビ装置の一部として機器制御装置を実現する際の構成図である。It is a block diagram at the time of implement | achieving an apparatus control apparatus as a part of television apparatus concerning a 1st Example. 第２の実施例に係わる、汎用機器制御装置として機器制御装置を実現する際の構成図である。It is a block diagram at the time of implement | achieving an apparatus control apparatus as a general purpose apparatus control apparatus concerning a 2nd Example. 第１、２の実施例に係わる、機器制御装置の動作を説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining operation | movement of the apparatus control apparatus concerning the 1st, 2nd Example. 第３の実施例に係わる、カーナビ装置の一部として機器制御装置を実現する際の構成図である。It is a block diagram at the time of implement | achieving an apparatus control apparatus as a part of car navigation apparatus concerning a 3rd Example. 各実施例に係わる、機器制御装置の要部の動作を説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining operation | movement of the principal part of an apparatus control apparatus concerning each Example. 各実施例に係わる、機器制御装置の要部の動作を説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining operation | movement of the principal part of an apparatus control apparatus concerning each Example. 各実施例に係わる、機器制御装置の要部の動作を説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining operation | movement of the principal part of an apparatus control apparatus concerning each Example.

Explanation of symbols

１０２…マイクなどの音声入力装置
１０４…音声区間候補を検出する処理ブロック
１０６…動作モードが手動と自動のどちらかを判定する処理ブロック
１０８…ユーザが動作モードを手動もしくは自動に設定する処理ブロック
１１０…ユーザが音声認識の起動をスイッチなどで入力する処理ブロック
１１２…ユーザからの入力を検出するブロック
１１４…入力音データが音声か非音声かを自動的に判定するブロック
１１６…音声認識を実行するブロック
１１８…音声認識の結果に基づき機器の制御を実行するブロック
１２０…教師信号を蓄積するデータベース
１２２…教師信号を用いて判定パラメータを学習するブロック
１２４…学習により得られた判定パラメータを保持するデータベース
１２６…現在の判定パラメータによる判定の精度を推定するブロック
１２８…推定した判定精度をユーザに表示するブロック。 102 ... Voice input device 104 such as a microphone ... Processing block 106 for detecting a voice segment candidate ... Processing block 108 for determining whether the operation mode is manual or automatic ... Processing block 110 for the user to set the operation mode to manual or automatic ... Processing block 112 in which user inputs activation of voice recognition with switch or the like ... Block 114 for detecting input from user ... Block 116 for automatically determining whether input sound data is voice or non-voice ... Block 118: Control of device based on the result of speech recognition 120: Database 122 for storing teacher signals 122: Learning determination parameters using teacher signals 124: Database for storing determination parameters obtained by learning 126 ... The accuracy of the determination based on the current determination parameter Block showing the determination accuracy of block 128 ... estimated that constant to the user.

Claims

A device control apparatus for controlling a device by recognizing a user's voice by a voice recognition unit,
A switching section for switching the voice recognition activation method between manual and automatic;
A detection unit for detecting sound data that is a candidate for input speech for speech recognition;
An automatic determination unit that automatically determines whether the detected sound data is data to be subjected to speech recognition;
When it is obvious whether the detected sound data is data to be subjected to speech recognition, a teacher signal detection unit that accumulates it in a database;
A learning unit that learns determination parameters used in the automatic determination unit based on data stored in the database;
A device control apparatus based on voice recognition.

When the teacher signal detection unit manually activates speech recognition by the switching unit, the input sound data is stored in the database as data to be subjected to speech recognition.
The apparatus control apparatus by voice recognition according to claim 1.

The teacher signal detection unit stops the automatic speech recognition activation for a certain time after the user manually activates the speech recognition.
The apparatus control apparatus by voice recognition according to claim 1.

The teacher signal detection unit accumulates the detected sound data in the database as data that should not be subject to speech recognition for a certain period of time after the user manually activates speech recognition.
The apparatus control apparatus by voice recognition according to claim 1.

The teacher signal detection unit makes it difficult for automatic speech recognition to start during a certain time after the user manually starts speech recognition.
The apparatus control apparatus by voice recognition according to claim 1.

It further has a human detection unit that detects whether there is a person around the device using information other than sound,
When the teacher signal detection unit detects that there is no person by the human detection unit, the automatic detection of voice recognition is stopped.
The apparatus control apparatus by voice recognition according to claim 1.

It further has a human detection unit that detects whether there is a person around the device using information other than sound,
The teacher signal detection unit, when it is detected by the human detection unit that there is no person, accumulates the detected sound data in the database as data that should not be subject to speech recognition.
The apparatus control apparatus by voice recognition according to claim 1.

The teacher signal detection unit changes the ease of activation of automatic speech recognition according to the degree of adverse reaction adverse effect that quantifies the adverse effect due to erroneous reaction of speech recognition according to the status of the device to be controlled,
The apparatus control apparatus by voice recognition according to claim 1.

A determination accuracy estimation unit that estimates determination accuracy based on the accumulated determination parameter;
The apparatus control apparatus by voice recognition according to claim 1.

An operation mode display unit that displays the determination accuracy calculated by the determination accuracy estimation unit;
10. The device control apparatus by voice recognition according to claim 9.

A device control device that recognizes a user's voice and controls the device,
A switching unit for switching the voice recognition activation method to manual or automatic, and
A processing unit for processing input speech for speech recognition;
A storage unit for storing the processing result of the processing unit,
When the processing unit detects sound data that is a candidate for the input speech, the activation method is manual, and the activation method is automatic, and the detected sound data should be subject to speech recognition. As a result of determining whether or not it is data, if it is determined that the data should be the target, control to perform voice recognition of the sound data,
When it is clear whether or not the sound data is data to be recognized, the sound data is stored in the storage unit and used in the determination based on the sound data stored in the storage unit. Learn judgment parameters,
A device control apparatus based on voice recognition.

When the processing unit manually activates speech recognition by the switching unit, the sound data is stored in the storage unit as data to be subjected to speech recognition.
The apparatus control apparatus by voice recognition according to claim 11.

The processing unit accumulates the sound data in the storage unit as data that should not be subject to speech recognition for a certain period of time after the user manually activates speech recognition.
The apparatus control apparatus by voice recognition according to claim 12.

The processing unit stops the automatic speech recognition activation for a certain time after the user manually activates the speech recognition.
The apparatus control apparatus by voice recognition according to claim 11.

The processing unit performs control so that activation of automatic speech recognition is less likely to occur for a certain time after the user manually activates speech recognition.
The apparatus control apparatus by voice recognition according to claim 11.

It further has a human detection unit that detects whether there is a person around the device using information other than sound,
When the processing unit detects that there is no person by the human detection unit, it stops the automatic speech recognition activation,
The apparatus control apparatus by voice recognition according to claim 11.

It further has a human detection unit that detects whether there is a person around the device using information other than sound,
The processing unit accumulates the detected sound data in the storage unit as data that should not be subject to speech recognition when the human detection unit detects that no person is present,
The apparatus control apparatus by voice recognition according to claim 11.

The processing unit changes the ease of activation of automatic speech recognition according to the degree of adverse reaction adverse effect that quantifies the adverse effect due to erroneous reaction of speech recognition according to the status of the device to be controlled,
The apparatus control apparatus by voice recognition according to claim 11.

The processing unit accumulates the determination parameter in the storage unit, and estimates determination accuracy based on the accumulated determination parameter.
The apparatus control apparatus by voice recognition according to claim 11.

An operation mode display for displaying the estimated determination accuracy;
20. The device control apparatus by voice recognition according to claim 19.