JP5430382B2

JP5430382B2 - Input device and method

Info

Publication number: JP5430382B2
Application number: JP2009285106A
Authority: JP
Inventors: 貴司所
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2009-12-16
Filing date: 2009-12-16
Publication date: 2014-02-26
Anticipated expiration: 2029-12-16
Also published as: JP2011128766A

Description

本発明は、操作者による音声及びジェスチャに基づいて操作対象装置へ入力するコマンドを決定する入力装置及び方法に関する。 The present invention relates to an input device and method for determining a command to be input to an operation target device based on a voice and a gesture by an operator.

近年、音声認識技術や画像認識技術等各種認識技術の発達により、音声やジェスチャ等の複数の入力手段を用いて操作対象装置にコマンドを入力することができる入力装置が提案されている。例えば、特許文献１には、操作者の音声及びジェスチャの内容に対応するコマンド候補のうち操作者の意図したコマンドに該当する確率が最も高いコマンドを操作対象装置に入力する技術が記載されている。 In recent years, with the development of various recognition technologies such as voice recognition technology and image recognition technology, input devices capable of inputting commands to an operation target device using a plurality of input means such as voice and gestures have been proposed. For example, Patent Literature 1 describes a technique for inputting a command having the highest probability of corresponding to a command intended by the operator, out of command candidates corresponding to the voice and gesture content of the operator, to the operation target device. .

特開２００２−１８２６８０号公報JP 2002-182680 A

上記従来技術では、常に操作者の音声及びジェスチャの内容に対応するコマンド候補の両方に基づいて操作対象装置に入力するコマンドを決定する。しかしながら、操作対象装置に対するコマンド入力を意図していない日常の会話や身振り手振り等が認識阻害要因となり、操作者による音声やジェスチャが正しく認識されないことがある。上記従来技術では、操作者の音声又はジェスチャのいずれかが認識阻害要因等のために正常に認識できない場合には、操作者が意図しないコマンドが誤って操作対象装置に入力されてしまう可能性があった。 In the prior art, a command to be input to the operation target device is always determined based on both the operator's voice and command candidates corresponding to the content of the gesture. However, daily conversation and gesture gestures that are not intended for command input to the operation target device may be a recognition impediment, and voices and gestures by the operator may not be recognized correctly. In the above prior art, if either the operator's voice or gesture cannot be recognized normally due to a recognition hindrance factor or the like, there is a possibility that a command not intended by the operator may be erroneously input to the operation target device. there were.

そこで、本発明は、操作者による音声及びジェスチャに基づいて操作対象装置へ入力するコマンドを決定する入力装置において、操作者の意図しないコマンドが操作対象装置へ入力されることを抑制することを目的とする。 In view of the above, an object of the present invention is to suppress a command unintended by the operator from being input to the operation target device in an input device that determines a command to be input to the operation target device based on the voice and gesture by the operator. And

本発明は、操作対象装置へ入力するコマンドを操作者による音声及びジェスチャの両方に基づいて決定する入力装置であって、
操作者による音声が入力される音声入力部と、
操作者によるジェスチャを撮影した画像が入力される画像入力部と、
前記入力される音声に対して音声認識処理を行い、操作対象装置へコマンドを入力するための音声として予め定められた音声のうち該入力される音声と一致する音声の候補を特定し、該入力される音声が該特定した候補に一致する確からしさの指標である音声認識スコアを算出する第１の算出部と、
前記入力される画像に対して画像認識処理を行い、操作対象装置へコマンドを入力するためのジェスチャとして予め定められたジェスチャのうち該入力される画像に撮影されたジェスチャと一致するジェスチャの候補を特定し、該入力される画像に撮影されたジェスチャが該特定した候補に一致する確からしさの指標であるジェスチャ認識スコアを算出する第２の算出部と、
前記特定した候補に基づいて操作対象装置へ入力するコマンドを決定する決定部と、
を備え、
前記決定部は、前記算出した音声認識スコア及びジェスチャ認識スコアのいずれか一方が所定の第１の閾値より小さい場合、他方の認識スコアに対応する候補のみに基づいて操作対象装置へ入力するコマンドを決定し、前記算出した音声認識スコア及びジェスチャ認識スコアの少なくとも一方が、前記第１の閾値より小さい所定の第２の閾値より小さい場合、前記特定した両方の候補を破棄し、前記特定した候補に基づく操作対象装置へ入力するコマンドの決定を行わないことを特徴とする入力装置である。
The present invention is an input device that determines a command to be input to an operation target device based on both voice and gesture by an operator,
A voice input unit for inputting voices by an operator;
An image input unit for inputting an image of a gesture made by an operator;
Voice recognition processing is performed on the input voice, and voice candidates that match the input voice are identified from voices that are predetermined as voices for inputting commands to the operation target device. A first calculation unit that calculates a speech recognition score that is an index of probability that the speech to be matched with the identified candidate;
Image recognition processing is performed on the input image, and gesture candidates that match the gesture photographed in the input image among gestures predetermined as gestures for inputting a command to the operation target device are selected. A second calculation unit that calculates a gesture recognition score that is an index of the probability that the gesture captured in the input image is identified and matches the identified candidate;
And determine tough that determine the command to be input to the operation target device based on the identified candidate,
With
Before Kike' tough, if either one of the speech recognition score and gesture recognition score the calculated is smaller than a predetermined first threshold value, and inputs to the operation target apparatus based on only the candidate corresponding to the other recognition score A command is determined , and at least one of the calculated speech recognition score and gesture recognition score is smaller than a predetermined second threshold value smaller than the first threshold value, the both identified candidates are discarded and the identified An input device is characterized in that a command to be input to an operation target device based on a candidate is not determined .

また、本発明は、操作対象装置へ入力するコマンドを操作者による音声及びジェスチャの両方に基づいて決定する入力装置であって、
操作者による音声が入力される音声入力部と、
操作者によるジェスチャを撮影した画像が入力される画像入力部と、
前記入力される音声に対して音声認識処理を行い、操作対象装置へコマンドを入力するための音声として予め定められた音声のうち該入力される音声と一致する音声の候補を特定し、該入力される音声が該特定した候補に一致する確からしさの指標である音声認識スコアを算出する第１の算出部と、
前記入力される画像に対して画像認識処理を行い、操作対象装置へコマンドを入力するためのジェスチャとして予め定められたジェスチャのうち該入力される画像に撮影されたジェスチャと一致するジェスチャの候補を特定し、該入力される画像に撮影されたジェスチャが該特定した候補に一致する確からしさの指標であるジェスチャ認識スコアを算出する第２の算出部と、
前記特定した候補に基づいて操作対象装置へ入力するコマンドを決定する決定部と、
を備え、
前記決定部は、前記算出した音声認識スコアとジェスチャ認識スコアとの差の大きさが所定の第３の閾値より大きい場合、該音声認識スコアとジェスチャ認識スコアのうち大きい方に対応する候補のみに基づいて操作対象装置へ入力するコマンドを決定することを特徴とする入力装置である。
Further, the present invention is an input device for determining a command to be input to the operation target device based on both voice and gesture by the operator,
A voice input unit for inputting voices by an operator;
An image input unit for inputting an image of a gesture made by an operator;
Voice recognition processing is performed on the input voice, and voice candidates that match the input voice are identified from voices that are predetermined as voices for inputting commands to the operation target device. A first calculation unit that calculates a speech recognition score that is an index of probability that the speech to be matched with the identified candidate;
Image recognition processing is performed on the input image, and gesture candidates that match the gesture photographed in the input image among gestures predetermined as gestures for inputting a command to the operation target device are selected. A second calculation unit that calculates a gesture recognition score that is an index of the probability that the gesture captured in the input image is identified and matches the identified candidate;
And determine tough that determine the command to be input to the operation target device based on the identified candidate,
With
Before Kike' tough corresponds towards the magnitude of the difference between the speech recognition score and gesture recognition score and the calculated larger of the predetermined case third greater than the threshold value, the voice recognition score and gesture recognition candidate score The input device is characterized in that a command to be input to the operation target device is determined based only on the command.

また、本発明は、操作対象装置へ入力するコマンドを操作者による音声及びジェスチャの両方に基づいて決定する入力方法であって、
操作者による音声が入力される音声入力工程と、
操作者によるジェスチャを撮影した画像が入力される画像入力工程と、
前記入力される音声に対して音声認識処理を行い、操作対象装置へコマンドを入力するための音声として予め定められた音声のうち該入力される音声と一致する音声の候補を特定し、該入力される音声が該特定した候補に一致する確からしさの指標である音声認識スコアを算出する第１の算出工程と、
前記入力される画像に対して画像認識処理を行い、操作対象装置へコマンドを入力するためのジェスチャとして予め定められたジェスチャのうち該入力される画像に撮影されたジェスチャと一致するジェスチャの候補を特定し、該入力される画像に撮影されたジェスチャが該特定した候補に一致する確からしさの指標であるジェスチャ認識スコアを算出する第２の算出工程と、
前記特定した候補に基づいて操作対象装置へ入力するコマンドを決定する決定工程と、を有し、
前記決定工程は、前記算出した音声認識スコア及びジェスチャ認識スコアのいずれか一方が所定の第１の閾値より小さい場合、他方の認識スコアに対応する候補のみに基づいて操作対象装置へ入力するコマンドを決定し、前記算出した音声認識スコア及びジェスチャ
認識スコアの少なくとも一方が、前記第１の閾値より小さい所定の第２の閾値より小さい場合、前記特定した両方の候補を破棄し、前記特定した候補に基づく操作対象装置へ入力するコマンドの決定を行わないことを特徴とする入力方法である。
Further, the present invention is an input method for determining a command to be input to an operation target device based on both voice and gesture by an operator,
A voice input process in which the voice of the operator is input;
An image input process in which an image of a gesture made by an operator is input;
Voice recognition processing is performed on the input voice, and voice candidates that match the input voice are identified from voices that are predetermined as voices for inputting commands to the operation target device. A first calculation step of calculating a speech recognition score that is an index of probability that the speech to be matched with the identified candidate;
Image recognition processing is performed on the input image, and gesture candidates that match the gesture photographed in the input image among gestures predetermined as gestures for inputting a command to the operation target device are selected. A second calculation step of determining a gesture recognition score that is an index of the probability that the gesture captured in the input image is identified and matches the identified candidate;
Anda decision step that determine the command to be input to the operation target device based on the candidate that the identified,
Before Kike' constant step, if either one of the speech recognition score and gesture recognition score the calculated is smaller than a predetermined first threshold value, and inputs to the operation target apparatus based on only the candidate corresponding to the other recognition score The command is determined, and the calculated speech recognition score and gesture
When at least one of the recognition scores is smaller than a predetermined second threshold value that is smaller than the first threshold value, both of the identified candidates are discarded, and a command to be input to the operation target device based on the identified candidate is determined. This is an input method that is not performed .

また、本発明は、操作対象装置へ入力するコマンドを操作者による音声及びジェスチャの両方に基づいて決定する入力方法であって、
操作者による音声が入力される音声入力工程と、
操作者によるジェスチャを撮影した画像が入力される画像入力工程と、
前記入力される音声に対して音声認識処理を行い、操作対象装置へコマンドを入力するための音声として予め定められた音声のうち該入力される音声と一致する音声の候補を特定し、該入力される音声が該特定した候補に一致する確からしさの指標である音声認識スコアを算出する第１の算出工程と、
前記入力される画像に対して画像認識処理を行い、操作対象装置へコマンドを入力するためのジェスチャとして予め定められたジェスチャのうち該入力される画像に撮影されたジェスチャと一致するジェスチャの候補を特定し、該入力される画像に撮影されたジェスチャが該特定した候補に一致する確からしさの指標であるジェスチャ認識スコアを算出する第２の算出工程と、
前記特定した候補に基づいて操作対象装置へ入力するコマンドを決定する決定工程と、を有し、
前記決定工程は、前記算出した音声認識スコアとジェスチャ認識スコアとの差の大きさが所定の第３の閾値より大きい場合、該音声認識スコアとジェスチャ認識スコアのうち大きい方に対応する候補のみに基づいて操作対象装置へ入力するコマンドを決定することを特徴とする入力方法である。
Further, the present invention is an input method for determining a command to be input to an operation target device based on both voice and gesture by an operator,
A voice input process in which the voice of the operator is input;
An image input process in which an image of a gesture made by an operator is input;
Voice recognition processing is performed on the input voice, and voice candidates that match the input voice are identified from voices that are predetermined as voices for inputting commands to the operation target device. A first calculation step of calculating a speech recognition score that is an index of probability that the speech to be matched with the identified candidate;
Image recognition processing is performed on the input image, and gesture candidates that match the gesture photographed in the input image among gestures predetermined as gestures for inputting a command to the operation target device are selected. A second calculation step of determining a gesture recognition score that is an index of the probability that the gesture captured in the input image is identified and matches the identified candidate;
Anda decision step that determine the command to be input to the operation target device based on the candidate that the identified,
Before Kike' constant step corresponds towards the magnitude of the difference between the speech recognition score and gesture recognition score and the calculated larger of the predetermined larger than the third threshold value, the voice recognition score and gesture recognition candidate score The input method is characterized in that a command to be input to the operation target device is determined based only on the above.

本発明によれば、操作者による音声及びジェスチャに基づいて操作対象装置へ入力するコマンドを決定する入力装置において、操作者の意図しないコマンドが操作対象装置へ入力されることを抑制できる。 ADVANTAGE OF THE INVENTION According to this invention, in the input device which determines the command input into an operation target apparatus based on the voice and gesture by an operator, it can suppress that the command which an operator does not intend is input into an operation target apparatus.

実施例１及び２のコマンド入力装置の概略構成を示すブロック図1 is a block diagram showing a schematic configuration of a command input device according to Embodiments 1 and 2 実施例１で音声及びジェスチャが入力された際の処理を示すフロー図Flow chart showing processing when voice and gesture are inputted in the first embodiment 実施例１における認識スコアに応じたコマンド決定方法を示す図The figure which shows the command determination method according to the recognition score in Example 1. 実施例２で音声及びジェスチャが入力された際の処理を示すフロー図Flow chart showing processing when voice and gesture are inputted in the second embodiment 実施例２における認識スコアに応じたコマンド決定方法を示す図The figure which shows the command determination method according to the recognition score in Example 2.

（実施例１）
以下、図面を参照して本発明の具体的な実施の形態について説明する。以下の実施例は本発明を実施するための一例であって、本発明の範囲を限定する趣旨のものではない。
図１は、本発明の第１の実施例に係るコマンド入力装置１０１の概略構成を示すブロック図である。このコマンド入力装置１０１は、操作者による音声及びジェスチャの両方に基づいて操作対象装置１１０へ入力するコマンドを決定する入力装置である。本実施例に係るコマンド入力装置１０１を適用可能な操作対象装置１１０としては、テレビ、レコーダ、パーソナルコンピュータ、ゲーム機、メディアプレーヤ等、操作者の音声及びジェスチャによる操作指示に従って動作するよう構成された種々の機器を例示できる。例えば、操作対象装置がテレビ受信装置の場合、操作対象装置へ入力されるコマンドはチャンネルの切り換え、音量調整、画質調整、入力切り換え等を例示できる。 Example 1
Hereinafter, specific embodiments of the present invention will be described with reference to the drawings. The following examples are examples for carrying out the present invention, and are not intended to limit the scope of the present invention.
FIG. 1 is a block diagram showing a schematic configuration of a command input device 101 according to the first embodiment of the present invention. The command input device 101 is an input device that determines a command to be input to the operation target device 110 based on both voice and gesture by the operator. The operation target device 110 to which the command input device 101 according to the present embodiment can be applied is configured to operate according to an operation instruction by an operator's voice and gesture, such as a television, a recorder, a personal computer, a game machine, and a media player. Various devices can be exemplified. For example, when the operation target device is a television receiving device, commands input to the operation target device can be exemplified by channel switching, volume adjustment, image quality adjustment, input switching, and the like.

図１に示すように、コマンド入力装置１０１は、音声入力部１０２と、音声コマンド認識スコア判定部１０３と、音声コマンドデータベース１０４と、ジェスチャ入力部１０５と、ジェスチャコマンド認識スコア判定部１０６と、を有する。コマンド入力装置１０１は更に、ジェスチャコマンドデータベース１０７と、閾値比較部１０８と、コマンド決定部１０９と、を有する。なお、図１において破線で示した認識スコア差分判定部４０２、音声コマンド認識スコア判定部１０３及びジェスチャコマンド認識スコア判定部１０６か
ら該認識スコア差分判定部４０２への入力線は後述する実施例２に関する構成要素である。また、閾値比較部及びコマンド決定部における括弧書きの符号も、実施例２における参照符号であり、本実施例とは無関係である。 As illustrated in FIG. 1, the command input device 101 includes a voice input unit 102, a voice command recognition score determination unit 103, a voice command database 104, a gesture input unit 105, and a gesture command recognition score determination unit 106. Have. The command input device 101 further includes a gesture command database 107, a threshold comparison unit 108, and a command determination unit 109. Note that the input lines from the recognition score difference determination unit 402, the voice command recognition score determination unit 103, and the gesture command recognition score determination unit 106 indicated by broken lines in FIG. 1 to the recognition score difference determination unit 402 relate to Example 2 described later. It is a component. Further, the reference numerals in parentheses in the threshold comparison unit and the command determination unit are also reference numerals in the second embodiment and are not related to the present embodiment.

音声入力部１０２は、マイクロフォンにて操作者が発声した音声を集音し、音声信号に変換して音声コマンド認識スコア判定部１０３に出力する。 The voice input unit 102 collects the voice uttered by the operator with the microphone, converts the voice into a voice signal, and outputs the voice signal to the voice command recognition score determination unit 103.

音声コマンド認識スコア判定部１０３は、入力された音声信号に対して音声認識処理を行い、操作者が発声した音声の内容を特定する。そして、その音声の内容に基づいて、操作対象装置１１０へコマンドを入力するための音声として予め定められた音声（音声コマンド）のうち、当該入力される音声と一致する音声コマンドの候補を複数抽出する。音声コマンドは音声コマンドデータベース１０４に予め記憶している。 The voice command recognition score determination unit 103 performs voice recognition processing on the input voice signal, and specifies the content of the voice uttered by the operator. Based on the contents of the voice, a plurality of voice command candidates that match the inputted voice are extracted from voices (voice commands) predetermined as voices for inputting commands to the operation target device 110. To do. Voice commands are stored in advance in the voice command database 104.

音声コマンド認識スコア判定部１０３は、前記抽出した音声コマンド候補それぞれに対して、認識スコアを付与する。ここで、認識スコアとは、入力される音声が音声コマンドデータベース１０４に格納された音声コマンドに一致する確からしさ、すなわち一致度合の指標となる量である。例えば、音声コマンド認識スコア判定部１０３に入力される音声信号と、各音声コマンドの典型的な音声信号（予め記憶しておく）との相関を計算することにより、認識スコアを算出することができる。操作者の音声と音声コマンドとの一致の度合の指標となる量であればどのような方法で認識スコアを算出しても良い。 The voice command recognition score determination unit 103 assigns a recognition score to each of the extracted voice command candidates. Here, the recognition score is a probability that the input voice matches the voice command stored in the voice command database 104, that is, an amount serving as an index of the degree of matching. For example, the recognition score can be calculated by calculating the correlation between the voice signal input to the voice command recognition score determination unit 103 and a typical voice signal (stored in advance) of each voice command. . The recognition score may be calculated by any method as long as it is an amount that is an index of the degree of coincidence between the operator's voice and the voice command.

音声コマンドデータベース１０４は、音声コマンド認識スコア判定部１０３にて行う音声認識処理において用いる音声信号解析用のデータを格納している。本実施例では、音声コマンドデータベース１０４に格納されたデータを用いて音声コマンド候補の特定及び認識スコアの算出を行う音声コマンド認識スコア判定部１０３が、本発明の第１の算出部を構成する。
The voice command database 104 stores voice signal analysis data used in voice recognition processing performed by the voice command recognition score determination unit 103. In this embodiment, a voice command recognition score determination unit 103 that specifies voice command candidates and calculates a recognition score using data stored in the voice command database 104 constitutes a first calculation unit of the present invention.

ジェスチャ入力部１０５は、カメラにて操作者が行うジェスチャを撮影し、画像信号に変換してジェスチャコマンド認識スコア判定部１０６に出力する。本実施例では、ジェスチャ入力部１０５が、本発明の画像入力部を構成する。 The gesture input unit 105 captures a gesture performed by the operator with a camera, converts the image into an image signal, and outputs the image signal to the gesture command recognition score determination unit 106. In this embodiment, the gesture input unit 105 constitutes the image input unit of the present invention.

ジェスチャコマンド認識スコア判定部１０６は、入力された画像信号に対して画像認識処理を行い、操作者が行ったジェスチャの内容を特定する。そのジェスチャの内容に基づいて、操作対象装置１１０へコマンドを入力するためのジェスチャとして予め定められたジェスチャ（ジェスチャコマンド）のうち、当該入力される画像に撮影されたジェスチャと一致するジェスチャコマンドの候補を複数抽出する。ジェスチャコマンドはジェスチャコマンドデータベース１０７に予め記憶している。 The gesture command recognition score determination unit 106 performs image recognition processing on the input image signal, and specifies the content of the gesture performed by the operator. Of the gestures (gesture commands) predetermined as gestures for inputting commands to the operation target device 110 based on the content of the gestures, gesture command candidates that match the gestures captured in the input image Extract multiple. The gesture command is stored in advance in the gesture command database 107.

ジェスチャコマンド認識スコア判定部１０６は、前記抽出したジェスチャコマンド候補それぞれに対して、認識スコアを付与する。ここで、認識スコアとは、入力される画像に撮影されたジェスチャがジェスチャコマンドデータベース１０７に格納されたジェスチャコマンドに一致する確からしさ、すなわち一致度合の指標となる量である。例えば、ジェスチャコマンド認識スコア判定部１０６に入力される画像信号と、各ジェスチャコマンドの典型的な画像信号（予め記憶しておく）との相関を計算することにより、認識スコアを算出することができる。操作者のジェスチャとジェスチャコマンドとの一致の度合の指標となる量であればどのような方法で認識スコアを算出しても良い。本実施例では、認識スコアとして、一致の度合が高いほど大きい数値が算出されるような量を用いる。 The gesture command recognition score determination unit 106 assigns a recognition score to each of the extracted gesture command candidates. Here, the recognition score is a probability that the gesture photographed in the input image matches the gesture command stored in the gesture command database 107, that is, an amount serving as an index of the degree of coincidence. For example, the recognition score can be calculated by calculating the correlation between the image signal input to the gesture command recognition score determination unit 106 and a typical image signal (previously stored) of each gesture command. . The recognition score may be calculated by any method as long as it is an amount that serves as an index of the degree of matching between the gesture of the operator and the gesture command. In the present embodiment, as the recognition score, an amount such that a larger numerical value is calculated as the matching degree is higher is used.

ジェスチャコマンドデータベース１０７は、ジェスチャコマンド認識スコア判定部１０６にて行う画像認識処理において用いる画像信号解析用のデータを格納している。本実施例では、ジェスチャコマンドデータベース１０７に記憶されたデータを用いてジェスチャコマンド候補の特定及び認識スコアの算出を行うジェスチャコマンド認識スコア判定部１０６が、本発明の第２の算出部を構成する。
The gesture command database 107 stores data for image signal analysis used in image recognition processing performed by the gesture command recognition score determination unit 106. In the present embodiment, a gesture command recognition score determination unit 106 that specifies a gesture command candidate and calculates a recognition score using data stored in the gesture command database 107 constitutes a second calculation unit of the present invention.

閾値比較部１０８は、音声コマンド認識スコア判定部１０３から入力される音声コマンド候補毎の認識スコアの最大値と、予め記憶している所定の第１の閾値及び第２の閾値とを比較する。また、ジェスチャコマンド認識スコア判定部１０６から入力されるジェスチャコマンド候補毎の認識スコアの最大値と、前記第１の閾値及び第２の閾値と、を比較する。そして、比較結果をコマンド決定部１０９に出力する。比較の処理については後述する。 The threshold comparison unit 108 compares the maximum recognition score for each voice command candidate input from the voice command recognition score determination unit 103 with a predetermined first threshold and second threshold stored in advance. Further, the maximum value of the recognition score for each gesture command candidate input from the gesture command recognition score determination unit 106 is compared with the first threshold value and the second threshold value. Then, the comparison result is output to the command determination unit 109. The comparison process will be described later.

コマンド決定部１０９は、閾値比較部１０８から入力される比較結果に応じて、上記抽出した音声コマンド候補及びジェスチャコマンド候補に基づいて操作対象装置１１０へ入力するコマンドを決定する方法を以下の３つのうちから選択する。第１のコマンド決定方法では、音声コマンド候補及びジェスチャコマンド候補の両方に基づいて入力コマンドを決定する。第２のコマンド決定方法では、音声コマンド候補又はジェスチャコマンド候補の一方に基づいて入力コマンドを決定する。第３のコマンド決定方法では、音声コマンド候補及びジェスチャコマンド候補を破棄し、操作対象装置１１０へのコマンド入力を行わない。コマンド決定部１０９は、選択したコマンド決定方法により操作対象装置１１０へ入力するコマンドを決定する。本実施例では、閾値比較部１０８及びコマンド決定部１０９が、本発明における決定部を構成する。
The command determination unit 109 determines the command to be input to the operation target device 110 based on the extracted voice command candidate and the gesture command candidate according to the comparison result input from the threshold comparison unit 108 by the following three methods. Choose from home. In the first command determination method, an input command is determined based on both the voice command candidate and the gesture command candidate. In the second command determination method, an input command is determined based on one of a voice command candidate and a gesture command candidate. In the third command determination method, the voice command candidate and the gesture command candidate are discarded, and the command input to the operation target device 110 is not performed. The command determination unit 109 determines a command to be input to the operation target device 110 using the selected command determination method. In this embodiment, threshold comparator 108 and the command determination unit 109 constitutes a determined tough that put the present invention.

図２は、本実施例において、音声入力部１０２及びジェスチャ入力部１０５への音声及びジェスチャの入力を契機として開始される処理を示すフローチャートである。 FIG. 2 is a flowchart illustrating processing that is started in response to voice and gesture input to the voice input unit 102 and the gesture input unit 105 in the present embodiment.

Ｓ２０１において、音声入力部１０２から音声コマンド認識スコア判定部１０３に音声信号が入力され、ジェスチャ入力部１０５からジェスチャコマンド認識スコア判定部１０６にジェスチャを撮影した画像信号が入力されると、Ｓ２０２に遷移する。 In S201, when an audio signal is input from the voice input unit 102 to the voice command recognition score determination unit 103, and an image signal obtained by photographing a gesture is input from the gesture input unit 105 to the gesture command recognition score determination unit 106, the process proceeds to S202. To do.

Ｓ２０２において、音声コマンド認識スコア判定部１０３は、入力された音声信号に基づいて音声コマンド候補を抽出し、抽出した音声コマンド候補それぞれについて認識スコアを算出し、Ｓ２０３に遷移する。 In S202, the voice command recognition score determination unit 103 extracts voice command candidates based on the input voice signal, calculates a recognition score for each of the extracted voice command candidates, and proceeds to S203.

Ｓ２０３において、閾値比較部１０８は、Ｓ２０２で算出した音声コマンド候補毎の認識スコアの最大値と、予め記憶している第１の閾値（例えば５０％）とを比較し、認識スコアの最大値が第１の閾値以上の場合には、Ｓ２０４に遷移する。一方、認識スコアの最大値が第１の閾値より小さい場合には、Ｓ２０５に遷移する。
ここで、第１の閾値は、抽出したコマンド候補が、それに基づいて操作対象装置１１０へ入力するコマンドを決定し得るほどの確からしさを有するものか判断するための認識スコアの基準値である。第１の閾値は、実験等により最適値を求めて予め記憶しておく。本実施例で示した５０％という数値は一例であって、第１の閾値として最適な数値は実施の形態に応じて異なり得る。
コマンド候補毎に算出される認識スコアの最大値が第１の閾値以上であれば、当該コマンド候補はそれに基づいて入力コマンドを決定し得るほどの確からしさを有すると判断する。一方、認識スコアの最大値が第１の閾値未満であれば、抽出されたコマンド候補はそれに基づいて入力コマンドを決定し得るほどの確からしさを有しないと判断する。 In S203, the threshold value comparison unit 108 compares the maximum value of the recognition score for each voice command candidate calculated in S202 with a first threshold value (for example, 50%) stored in advance, and the maximum value of the recognition score is determined. If it is greater than or equal to the first threshold, the process proceeds to S204. On the other hand, when the maximum value of the recognition score is smaller than the first threshold, the process proceeds to S205.
Here, the first threshold is a reference value of a recognition score for determining whether the extracted command candidate has a certainty that can determine a command to be input to the operation target device 110 based on the extracted command candidate. As the first threshold value, an optimum value is obtained by experiment or the like and stored in advance. The numerical value of 50% shown in this embodiment is an example, and the optimal numerical value as the first threshold value may vary depending on the embodiment.
If the maximum value of the recognition score calculated for each command candidate is equal to or greater than the first threshold value, it is determined that the command candidate has such a certainty that the input command can be determined based on the maximum value. On the other hand, if the maximum value of the recognition score is less than the first threshold, it is determined that the extracted command candidate does not have such a certainty that the input command can be determined based on the command candidate.

Ｓ２０４において、閾値比較部１０８は、音声入力に基づいて抽出した音声コマンド候補を操作対象装置１１０へ入力するコマンドの決定に用いることをコマンド決定部１０９に通知して、Ｓ２０７に遷移する。 In S204, the threshold comparison unit 108 notifies the command determination unit 109 that the voice command candidate extracted based on the voice input is used to determine a command to be input to the operation target device 110, and the process proceeds to S207.

Ｓ２０５において、閾値比較部１０８は、Ｓ２０２で算出した音声コマンド候補毎の認識スコアの最大値と、予め記憶している第２の閾値（例えば５％）とを比較し、認識スコアの最大値が第２の閾値以上の場合には、Ｓ２０６に遷移する。一方、認識スコアの最大値が第２の閾値より小さい場合には、Ｓ２１２に遷移する。
ここで、第２の閾値は、操作者が操作対象装置１１０へのコマンドの入力を意図して発声やジェスチャを行っているのか、また、音声やジェスチャの認識に関して著しい認識阻害要因が存在するか、を判断するための認識スコアの基準値である。第２の閾値は、実験等により最適値を求めて予め記憶しておく。本実施例で示した５％という数値は一例であって、第２の閾値として最適な数値は実施の形態に応じて異なり得る。
コマンド候補毎に算出される認識スコアの最大値が第２の閾値以上であれば、操作者は操作対象装置１１０へのコマンド入力を意図して発声やジェスチャを行っており、また、著しい認識阻害要因は無いと判断する。一方、認識スコアの最大値が第２の閾値未満であれば、入力された音声信号や画像信号は、操作者が操作対象装置１１０へのコマンド入力を意図して行った発声やジェスチャを捉えたものではないか、又は、著しい認識阻害要因が存在すると判断する。このような場合、音声及びジェスチャに基づいて操作対象装置１１０へ入力するコマンドを決定しない。 In S205, the threshold value comparison unit 108 compares the maximum value of the recognition score for each voice command candidate calculated in S202 with a second threshold value (for example, 5%) stored in advance, and the maximum value of the recognition score is determined. If it is greater than or equal to the second threshold, the process proceeds to S206. On the other hand, when the maximum value of the recognition score is smaller than the second threshold, the process proceeds to S212.
Here, the second threshold is whether the operator is making a utterance or a gesture with the intention of inputting a command to the operation target device 110, or is there a significant recognition impediment to the recognition of the voice or the gesture? , Is a reference value of the recognition score. As the second threshold value, an optimum value is obtained by experiment or the like and stored in advance. The numerical value of 5% shown in this embodiment is an example, and the optimal numerical value as the second threshold value may vary depending on the embodiment.
If the maximum value of the recognition score calculated for each command candidate is greater than or equal to the second threshold value, the operator is making a voice or gesture with the intention of inputting a command to the operation target device 110, and the recognition is significantly inhibited. Judge that there is no cause. On the other hand, if the maximum value of the recognition score is less than the second threshold, the input audio signal or image signal captures a utterance or gesture made by the operator with the intention of inputting a command to the operation target device 110. It is determined that there is no significant recognition impediment factor. In such a case, the command to be input to the operation target device 110 is not determined based on the voice and gesture.

Ｓ２０６において、閾値比較部１０８は、音声入力に基づいて抽出した音声コマンド候補を操作対象装置１１０へ入力するコマンドの決定に用いないことをコマンド決定部１０９に通知して、Ｓ２０７に遷移する。 In S206, the threshold comparison unit 108 notifies the command determination unit 109 that the voice command candidate extracted based on the voice input is not used for determining the command to be input to the operation target device 110, and the process proceeds to S207.

Ｓ２０７において、ジェスチャコマンド認識スコア判定部１０６は、入力された画像信号に基づいてジェスチャコマンド候補を抽出し、抽出したジェスチャコマンド候補それぞれについて認識スコアを算出し、Ｓ２０８に遷移する。 In S207, the gesture command recognition score determination unit 106 extracts a gesture command candidate based on the input image signal, calculates a recognition score for each extracted gesture command candidate, and proceeds to S208.

Ｓ２０８において、閾値比較部１０８は、Ｓ２０７で算出したジェスチャコマンド候補毎の認識スコアの最大値と、予め記憶している第１の閾値（５０％）とを比較し、認識スコアの最大値が第１の閾値以上の場合には、Ｓ２０９に遷移する。一方、認識スコアの最大値が第１の閾値より小さい場合には、Ｓ２１０に遷移する。 In S208, the threshold value comparison unit 108 compares the maximum value of the recognition score for each gesture command candidate calculated in S207 with the first threshold value (50%) stored in advance, and the maximum value of the recognition score is the first value. If the threshold is equal to or greater than 1, the process proceeds to S209. On the other hand, when the maximum value of the recognition score is smaller than the first threshold value, the process proceeds to S210.

Ｓ２０９において、閾値比較部１０８は、ジェスチャ入力に基づいて抽出したジェスチャコマンド候補を操作対象装置１１０へ入力するコマンドの決定に用いることをコマンド決定部１０９に通知して、Ｓ２１３に遷移する。 In step S209, the threshold comparison unit 108 notifies the command determination unit 109 that the gesture command candidate extracted based on the gesture input is used to determine a command to be input to the operation target device 110, and the process proceeds to step S213.

Ｓ２１０において、閾値比較部１０８は、Ｓ２０７で算出したジェスチャコマンド候補毎の認識スコアの最大値と、予め記憶している第２の閾値（５％）とを比較し、認識スコアの最大値が第２の閾値以上の場合には、Ｓ２１１に遷移する。一方、認識スコアの最大値が第２の閾値より小さい場合には、Ｓ２１２に遷移する。 In S210, the threshold comparison unit 108 compares the maximum value of the recognition score for each gesture command candidate calculated in S207 with the second threshold (5%) stored in advance, and the maximum value of the recognition score is the first value. If it is greater than or equal to the threshold of 2, the process proceeds to S211. On the other hand, when the maximum value of the recognition score is smaller than the second threshold, the process proceeds to S212.

Ｓ２１１において、閾値比較部１０８は、ジェスチャ入力に基づいて抽出したジェスチャコマンド候補を操作対象装置１１０へ入力するコマンドの決定に用いないことをコマンド決定部１０９に通知して、Ｓ２１３に遷移する。
なお、本実施例では、音声コマンド候補の認識スコアについての第１の閾値及び第２の閾値をジェスチャコマンド候補の認識スコアについての閾値としてそのまま用いる場合を例示したが、ジェスチャコマンド候補の認識スコアについての閾値を別途定めても良い。 In S211, the threshold comparison unit 108 notifies the command determination unit 109 that the gesture command candidate extracted based on the gesture input is not used for determining a command to be input to the operation target device 110, and the process proceeds to S213.
In this embodiment, the first threshold value and the second threshold value for the voice command candidate recognition score are used as they are as the threshold value for the gesture command candidate recognition score, but the gesture command candidate recognition score is used. The threshold may be determined separately.

Ｓ２１２において、閾値比較部１０８は、音声入力に基づいて抽出した音声コマンド候補及びジェスチャ入力に基づいて抽出したジェスチャコマンド候補を破棄する。これは、各コマンド候補の認識スコアが第２の閾値よりも小さいことから、操作者はコマンド入力
のための音声及びジェスチャを行っていないか、又は著しい認識阻害要因が存在すると判断できるからである。 In S212, the threshold comparison unit 108 discards the voice command candidate extracted based on the voice input and the gesture command candidate extracted based on the gesture input. This is because the recognition score of each command candidate is smaller than the second threshold value, so that it can be determined that the operator does not perform voice and gesture for command input or there is a significant recognition impediment factor. .

Ｓ２１３において、コマンド決定部１０９は、閾値比較部１０８から通知されたコマンド決定方法に基づいて、操作対象装置１１０へ入力するコマンドを決定する。
音声コマンド候補及びジェスチャコマンド候補の両方を用いて入力コマンドを決定する場合（Ｓ２０４且つＳ２０９を実行した場合）は、両者で共通するコマンド候補について認識スコアを積算し、その値が最も大きいコマンド候補を入力コマンドとして決定する。
また、音声コマンド候補又はジェスチャコマンド候補のいずれか一方を用いて入力コマンドを決定する場合（Ｓ２０４且つＳ２１１を実行又はＳ２０６且つＳ２０９を実行した場合）は、該一方のコマンド候補のうちの認識スコアが最大のものを入力コマンドとする。
また、音声コマンド候補及びジェスチャコマンド候補の両方とも用いないと通知された場合（Ｓ２０６且つＳ２１１を実行した場合）は、Ｓ２１２と同様両コマンド候補を破棄し、操作対象装置１１０へコマンドを入力しない。 In step S <b> 213, the command determination unit 109 determines a command to be input to the operation target device 110 based on the command determination method notified from the threshold comparison unit 108.
When the input command is determined using both the voice command candidate and the gesture command candidate (when S204 and S209 are executed), the recognition score is integrated for the command candidates common to both, and the command candidate having the largest value is selected. Determine as an input command.
In addition, when an input command is determined using either one of the voice command candidate and the gesture command candidate (when S204 and S211 are executed or when S206 and S209 are executed), the recognition score of the one command candidate is The largest command is the input command.
Further, when it is notified that neither the voice command candidate nor the gesture command candidate is used (when S206 and S211 are executed), both command candidates are discarded as in S212, and the command is not input to the operation target device 110.

なお、上記のフローチャートでは、音声コマンド候補の抽出、認識スコア算出、閾値との比較、コマンド決定に用いるか否かの判定処理（Ｓ２０２からＳ２０６）の後に、ジェスチャコマンド候補についての同様の処理（Ｓ２０７からＳ２１１）を行う例を示した。しかしながら、この処理に関して音声入力についての処理とジェスチャ入力についての処理は順不同であり、ジェスチャ入力についての処理が先でも良い。また、音声入力についての処理及びジェスチャ入力に関する処理を並行して行っても良い。 In the above flowchart, similar processing (S207) for gesture command candidates is performed after voice command candidate extraction, recognition score calculation, comparison with a threshold value, and processing for determining whether to use a command (S202 to S206). To S211). However, the processing for voice input and the processing for gesture input are out of order with respect to this processing, and the processing for gesture input may be performed first. Further, the process for voice input and the process for gesture input may be performed in parallel.

図３は、上述した認識スコアに応じたコマンド決定方法を示す図である。図３に示すように、音声認識スコア及びジェスチャ認識スコアの両方が第１の閾値以上の場合には、音声コマンド候補及びジェスチャコマンド候補の両方を用いて入力コマンドを決定する。この場合、音声認識スコア及びジェスチャ認識スコアが高いため、操作者がコマンド入力のために音声及びジェスチャの両方を行っており、且つ、認識阻害要因がない状態と判断できるからである。 FIG. 3 is a diagram illustrating a command determination method according to the above-described recognition score. As shown in FIG. 3, when both the voice recognition score and the gesture recognition score are equal to or higher than the first threshold, the input command is determined using both the voice command candidate and the gesture command candidate. In this case, since the voice recognition score and the gesture recognition score are high, it can be determined that the operator performs both voice and gesture for command input, and there is no recognition inhibition factor.

また、音声認識スコア及びジェスチャ認識スコアの一方が第１の閾値以上であり、且つ他方が第２の閾値以上第１の閾値未満である場合は、当該一方の認識スコアに対応する候補に基づいて入力コマンドを決定する。この場合、音声又はジェスチャの一方の認識スコアが低いため、操作者はコマンド入力のための音声及びジェスチャの両方を行っているものの、音声又はジェスチャの一方に関して認識阻害要因があると判断できるからである。 In addition, when one of the voice recognition score and the gesture recognition score is equal to or greater than the first threshold value and the other is equal to or greater than the second threshold value and less than the first threshold value, based on the candidate corresponding to the one recognition score. Determine the input command. In this case, since the recognition score of one of the voice and the gesture is low, the operator can determine that there is a recognition impediment factor for either the voice or the gesture, although both the voice and the gesture for inputting the command are performed. is there.

また、音声認識スコア及びジェスチャ認識スコアの少なくとも一方が第２の閾値より小さい場合は、音声コマンド候補及びジェスチャコマンド候補を破棄し、コマンド入力を行わない。この場合、音声又はジェスチャの一方の認識スコアが非常に低いため、操作者が音声又はジェスチャの他方でのみコマンド入力を行ったか、又は、当該他方によるコマンド入力は操作者の日常の会話や所作等を誤認識したものと判断できるからである。なお、音声認識スコア及びジェスチャ認識スコアの両方が第２の閾値以上第１の閾値未満である場合も、同様に判断して音声コマンド候補及びジェスチャコマンド候補を破棄する。 If at least one of the voice recognition score and the gesture recognition score is smaller than the second threshold, the voice command candidate and the gesture command candidate are discarded and no command is input. In this case, since the recognition score of one of the voice and gesture is very low, the operator inputs a command only on the other side of the voice or gesture, or the command input by the other side is the daily conversation or action of the operator. It is because it can be judged that it was misrecognized. Even when both the voice recognition score and the gesture recognition score are greater than or equal to the second threshold and less than the first threshold, the voice command candidate and the gesture command candidate are discarded in the same manner.

本実施例に係るコマンド入力装置１０１によれば、認識阻害要因によって音声コマンド候補又はジェスチャコマンド候補のいずれか一方の認識スコアが低下した場合に、当該一方のコマンド候補を用いずに入力コマンドを決定する。従って、認識阻害要因によって操作者の意図と異なるコマンドが認識されても、それが操作対象装置１１０への入力コマンドとして決定されることを抑制できる。
また、音声コマンド候補又はジェスチャコマンド候補の認識スコアがいずれか一方でも
著しく低い場合には、操作者が行っている発声又はジェスチャはコマンド入力のための発声又はジェスチャとして正しくないと判断する。或いは、そもそも操作者はコマンド入力のための発声及びジェスチャ自体を行っていないと判断する。そして、音声及びジェスチャに基づく操作対象装置１１０へのコマンド入力を行わない。従って、操作者が操作対象装置１１０へのコマンドの入力を意図していない場合に、操作者の意図していないコマンドが操作対象装置１１０へ入力されてしまうことを抑制できる。 According to the command input device 101 according to the present embodiment, when the recognition score of either the voice command candidate or the gesture command candidate is reduced due to the recognition inhibition factor, the input command is determined without using the one command candidate. To do. Therefore, even if a command different from the operator's intention is recognized due to the recognition inhibition factor, it can be suppressed that it is determined as an input command to the operation target device 110.
If the recognition score of either the voice command candidate or the gesture command candidate is remarkably low, it is determined that the utterance or gesture performed by the operator is not correct as the utterance or gesture for command input. Alternatively, in the first place, it is determined that the operator does not perform utterance and gesture for command input. Then, no command is input to the operation target device 110 based on voice and gesture. Therefore, when the operator does not intend to input a command to the operation target device 110, it is possible to suppress a command not intended by the operator from being input to the operation target device 110.

なお、上記本実施例で説明した、音声認識スコア又はジェスチャ認識スコアの少なくとも一方が第２の閾値未満であれば特定したコマンド候補を全て破棄する処理は、必ずしも行わなくても良い。また、入力される音声信号や画像信号に基づいて複数のコマンド候補を特定する例について説明したが、コマンド候補は複数でなくても良い。コマンド候補が１つのみの場合は、上記実施例における最大値を、当該特定した１つのコマンド候補の認識スコアに読み替えればよい。 If at least one of the voice recognition score or the gesture recognition score described in the present embodiment is less than the second threshold value, the process of discarding all the specified command candidates is not necessarily performed. Moreover, although the example which specifies a some command candidate based on the audio | voice signal and image signal which were input was demonstrated, a command candidate does not need to be plural. When there is only one command candidate, the maximum value in the above embodiment may be read as the recognition score of the specified one command candidate.

（実施例２）
次に、本発明の第２の実施例について説明する。実施例１では音声及びジェスチャのコマンド候補の認識スコアを閾値と直接比較した結果に基づいてコマンド決定方法を選択したが、実施例２では音声及びジェスチャのコマンド候補の認識スコアの差分を閾値と比較した結果に基づいてコマンド決定方法を選択する。 (Example 2)
Next, a second embodiment of the present invention will be described. In the first embodiment, the command determination method is selected based on the result of directly comparing the recognition scores of the voice and gesture command candidates with the threshold. However, in the second embodiment, the difference between the recognition scores of the voice and gesture command candidates is compared with the threshold. The command determination method is selected based on the result.

以下、実施例１と異なる部分を中心に詳細に説明する。実施例２に係るコマンド入力装置の構成は、実施例１のコマンド入力装置１０１に、図１で破線で示した認識スコア差分判定部４０２及び認識スコア差分判定部４０２に関する入出力線を追加した構成である。 Hereinafter, a description will be made in detail focusing on the differences from the first embodiment. The configuration of the command input device according to the second embodiment is a configuration in which input / output lines related to the recognition score difference determination unit 402 and the recognition score difference determination unit 402 indicated by broken lines in FIG. 1 are added to the command input device 101 of the first embodiment. It is.

閾値比較部４０１は、音声コマンド認識スコア判定部１０３から入力される音声コマンド候補毎の認識スコアの最大値と、予め記憶している第２の閾値とを比較する。また、ジェスチャコマンド認識スコア判定部１０６から入力されるジェスチャコマンド候補毎の認識スコアの最大値と、予め記憶している第２の閾値とを比較する。そして、比較結果をコマンド決定部４０３に出力する。ここで、第２の閾値は、実施例１で説明した第２の閾値と同じものである。 The threshold comparison unit 401 compares the maximum recognition score for each voice command candidate input from the voice command recognition score determination unit 103 with a second threshold stored in advance. Also, the maximum value of the recognition score for each gesture command candidate input from the gesture command recognition score determination unit 106 is compared with a second threshold value stored in advance. Then, the comparison result is output to the command determination unit 403. Here, the second threshold value is the same as the second threshold value described in the first embodiment.

認識スコア差分判定部４０２は、音声コマンド認識スコア判定部１０３から入力されるコマンド候補毎の認識スコアの最大値と、ジェスチャコマンド認識スコア判定部１０６から入力されるコマンド候補毎の認識スコアの最大値と、の差分を算出する。そして、算出結果をコマンド決定部４０３に出力する。 The recognition score difference determination unit 402 includes a maximum recognition score for each command candidate input from the voice command recognition score determination unit 103 and a maximum recognition score for each command candidate input from the gesture command recognition score determination unit 106. The difference between is calculated. Then, the calculation result is output to the command determination unit 403.

コマンド決定部４０３は、閾値比較部４０１から入力される比較結果と、認識スコア差分判定部４０２から入力される差分と、に応じて、操作対象装置１１０へ入力するコマンドを決定する方法を実施例１で説明した３つの方法のうちから選択する。 The command determination unit 403 is a method for determining a command to be input to the operation target device 110 according to the comparison result input from the threshold comparison unit 401 and the difference input from the recognition score difference determination unit 402. The method is selected from the three methods described in 1.

図４は、本実施例において、音声入力部１０２及びジェスチャ入力部１０５への音声及びジェスチャの入力を契機として開始される処理を示すフローチャートである。 FIG. 4 is a flowchart illustrating processing that is started in response to voice and gesture input to the voice input unit 102 and the gesture input unit 105 in the present embodiment.

Ｓ５０１及びＳ５０２は、実施例１のＳ２０１及びＳ２０２と同一の処理のため、説明を省略する。 Since S501 and S502 are the same processes as S201 and S202 of the first embodiment, description thereof will be omitted.

Ｓ５０３において、閾値比較部４０１は、Ｓ５０２で算出した音声コマンド候補毎の認識スコアの最大値と、予め記憶している第２の閾値（例えば５％）とを比較し、認識スコアの最大値が第２の閾値以上の場合には、Ｓ５０５に遷移する。一方、認識スコアの最大値が第２の閾値より小さい場合には、Ｓ５０４に遷移する。 In S503, the threshold value comparison unit 401 compares the maximum value of the recognition score for each voice command candidate calculated in S502 with a second threshold value (for example, 5%) stored in advance, and the maximum value of the recognition score is determined. If it is greater than or equal to the second threshold, the process proceeds to S505. On the other hand, when the maximum value of the recognition score is smaller than the second threshold, the process proceeds to S504.

Ｓ５０４において、音声コマンド候補の認識スコアが第２の閾値よりも小さいため、操作者はコマンド入力のための発声を行っていないか、又は音声認識に関して著しい認識阻害要因があると判断して、音声コマンド候補を破棄する。 In S504, since the recognition score of the voice command candidate is smaller than the second threshold value, it is determined that the operator has not made a utterance for command input or there is a significant recognition hindrance factor regarding voice recognition. Discard command candidates.

Ｓ５０５の処理は、実施例１のＳ２０７と同一の処理のため、説明を省略する。 Since the process of S505 is the same as S207 of the first embodiment, a description thereof will be omitted.

Ｓ５０６において、閾値比較部４０１は、Ｓ５０５で算出したジェスチャコマンド候補毎の認識スコアの最大値と、予め記憶している第２の閾値（５％）とを比較し、認識スコアの最大値が第２の閾値以上の場合には、Ｓ５０７に遷移する。一方、認識スコアの最大値が第２の閾値より小さい場合には、Ｓ５０４に遷移する。 In S506, the threshold value comparison unit 401 compares the maximum value of the recognition score for each gesture command candidate calculated in S505 with the second threshold value (5%) stored in advance, and the maximum value of the recognition score is the first value. If it is greater than or equal to the threshold of 2, the process proceeds to S507. On the other hand, when the maximum value of the recognition score is smaller than the second threshold, the process proceeds to S504.

Ｓ５０４において、ジェスチャコマンド候補の認識スコアが第２の閾値よりも小さいため、操作者はジェスチャによるコマンド入力を行っていないと判断して、ジェスチャコマンド候補を破棄する。 In S504, since the recognition score of the gesture command candidate is smaller than the second threshold value, the operator determines that the command input by the gesture is not performed, and discards the gesture command candidate.

Ｓ５０７において、認識スコア差分判定部４０２は、Ｓ５０２で算出した音声コマンド候補毎の認識スコアの最大値と、Ｓ５０５で算出したジェスチャコマンド候補毎の認識スコアの最大値と、の差分を算出する。そして、当該算出された差分の絶対値が第３の閾値（例えば５０％）以下である場合は、Ｓ５０８に遷移し、差分の絶対値が第３の閾値（５０％）より大きい場合は、Ｓ５０９に遷移する。
第３の閾値は、抽出された音声コマンド候補又はジェスチャコマンド候補のいずれかが、それに基づいて操作対象装置１１０へ入力するコマンドを決定し得るほどの確からしさを有しないか判断するための基準値であり、実験等により最適値を求めて記憶しておく。本実施例で示した５０％という数値は一例であって、第３の閾値として最適な数値は実施の形態に応じて異なり得る。
音声コマンド候補の認識スコアの最大値とジェスチャコマンド候補の認識スコアの最大値との差分の絶対値が第３の閾値以下であれば、両コマンド候補はいずれもそれに基づいて入力コマンドを決定し得る確からしさを有すると判断する。一方、当該差分の絶対値が第３の閾値より大きい場合、両コマンド候補のいずれか一方は認識スコアがかなり低く、当該一方のコマンド候補はそれに基づいて入力コマンドを決定し得る確からしさを有さないと判断する。 In S507, the recognition score difference determination unit 402 calculates a difference between the maximum value of the recognition score for each voice command candidate calculated in S502 and the maximum value of the recognition score for each gesture command candidate calculated in S505. If the calculated absolute value of the difference is equal to or smaller than a third threshold value (for example, 50%), the process proceeds to S508. If the absolute value of the difference is greater than the third threshold value (50%), S509 is performed. Transition to.
The third threshold value is a reference value for determining whether any of the extracted voice command candidates or gesture command candidates has such a certainty that the command to be input to the operation target device 110 can be determined based thereon. The optimum value is obtained by experiment and stored. The numerical value of 50% shown in this embodiment is an example, and the optimal numerical value as the third threshold value may vary depending on the embodiment.
If the absolute value of the difference between the maximum recognition score of the voice command candidate and the maximum recognition score of the gesture command candidate is equal to or smaller than the third threshold, both command candidates can determine an input command based on the absolute value. Judge that it has certainty. On the other hand, when the absolute value of the difference is larger than the third threshold value, either one of the command candidates has a considerably low recognition score, and the one command candidate has a certainty that an input command can be determined based on the recognition score. Judge that there is no.

Ｓ５０８において、認識スコア差分判定部４０２は、音声コマンド候補及びジェスチャコマンド候補の両方に基づいて操作対象装置１１０へ入力するコマンドを決定することをコマンド決定部４０３に通知して、Ｓ５１０に遷移する。 In S508, the recognition score difference determination unit 402 notifies the command determination unit 403 that the command to be input to the operation target device 110 is determined based on both the voice command candidate and the gesture command candidate, and the process proceeds to S510.

Ｓ５０９において、認識スコア差分判定部４０２は、音声コマンド候補又はジェスチャコマンド候補のうち認識スコアの最大値が大きい方に基づいて操作対象装置１１０へ入力するコマンドを決定することをコマンド決定部４０３に通知して、Ｓ５１０に遷移する。 In step S509, the recognition score difference determination unit 402 notifies the command determination unit 403 that the command to be input to the operation target device 110 is determined based on the voice command candidate or the gesture command candidate having the largest recognition score. Then, the process proceeds to S510.

Ｓ５１０において、コマンド決定部４０３は、認識スコア差分判定部４０２から通知されたコマンド決定方法に基づいて、操作対象装置１１０へ入力するコマンドを決定する。
音声コマンド候補及びジェスチャコマンド候補の両方を用いて入力コマンドを決定する場合（Ｓ５０８を実行した場合）は、両者で共通するコマンド候補について認識スコアを積算し、その値が最も大きいコマンド候補を入力コマンドとして決定する。
また、音声コマンド候補又はジェスチャコマンド候補のいずれか一方を用いて入力コマンドを決定する場合（Ｓ５０９を実行した場合）は、該一方のコマンド候補のうちの認識スコアが最大のものを入力コマンドとして決定する。 In S <b> 510, the command determination unit 403 determines a command to be input to the operation target device 110 based on the command determination method notified from the recognition score difference determination unit 402.
When the input command is determined using both the voice command candidate and the gesture command candidate (when S508 is executed), the recognition score is integrated for the command candidate common to both, and the command candidate having the largest value is input command. Determine as.
In addition, when an input command is determined using either one of voice command candidates or gesture command candidates (when S509 is executed), the command with the largest recognition score is determined as the input command. To do.

なお、上記のフローチャートでは、音声コマンド候補の抽出、認識スコア算出及び第２の閾値との比較の処理（Ｓ５０２からＳ５０３）の後に、ジェスチャコマンド候補についての同様の処理（Ｓ５０５からＳ５０６）を行う例を示した。しかしながら、この処理に関して音声入力についての処理とジェスチャ入力についての処理は順不同であり、ジェスチャ入力についての処理が先でも良い。また、音声入力についての処理及びジェスチャ入力に関する処理を並行して行っても良い。 In the above flowchart, the same processing (S505 to S506) for the gesture command candidate is performed after the voice command candidate extraction, recognition score calculation, and comparison with the second threshold value (S502 to S503). showed that. However, the processing for voice input and the processing for gesture input are out of order with respect to this processing, and the processing for gesture input may be performed first. Further, the process for voice input and the process for gesture input may be performed in parallel.

図５は、上述した認識スコアに応じたコマンド決定方法を示す図である。図５に示すように、音声コマンド候補の認識スコアの最大値又はジェスチャコマンド候補の認識スコアの最大値の少なくとも一方が第２の閾値より小さい場合は、音声コマンド候補及びジェスチャコマンド候補を破棄し、コマンド入力を行わない。この点は実施例１と同様である。 FIG. 5 is a diagram showing a command determination method according to the above-described recognition score. As shown in FIG. 5, when at least one of the maximum recognition score of the voice command candidate or the maximum recognition score of the gesture command candidate is smaller than the second threshold, the voice command candidate and the gesture command candidate are discarded, Do not enter commands. This is the same as in the first embodiment.

それ以外の場合であって、音声コマンド候補の認識スコアの最大値とジェスチャコマンド候補の認識スコアの最大値との差分の絶対値が第３の閾値（５０％）以下の場合には、音声コマンド候補及びジェスチャコマンド候補の両方を用いて入力コマンドを決定する。この場合、音声コマンド候補及びジェスチャコマンド候補の認識スコアの偏りが小さいことから、操作者は音声及びジェスチャの両方でコマンド入力を行っており、且つ認識阻害要因もないと判断できるからである。 In other cases, when the absolute value of the difference between the maximum recognition score of the voice command candidate and the maximum recognition score of the gesture command candidate is equal to or smaller than the third threshold (50%), the voice command An input command is determined using both candidates and gesture command candidates. In this case, since the bias of recognition scores of the voice command candidates and the gesture command candidates is small, it is possible for the operator to input commands by both voice and gesture and determine that there is no recognition hindrance factor.

一方、音声コマンド候補の認識スコアの最大値とジェスチャコマンド候補の認識スコアの最大値との差分の絶対値が第３の閾値より大きい場合には、認識スコアの最大値の大きい方に対応するコマンド候補を用いて入力コマンドを決定する。この場合、音声コマンド候補及びジェスチャコマンド候補の一方の認識スコアが低いことから、操作者は音声及びジェスチャの両方でコマンド入力を行っているものの、音声又はジェスチャの一方の認識に関して認識阻害要因があると判断できるからである。 On the other hand, if the absolute value of the difference between the maximum recognition score of the voice command candidate and the maximum recognition score of the gesture command candidate is greater than the third threshold, the command corresponding to the larger recognition score The input command is determined using the candidate. In this case, since the recognition score of one of the voice command candidate and the gesture command candidate is low, the operator inputs a command by both the voice and the gesture, but there is a recognition obstructing factor regarding the recognition of either the voice or the gesture. This is because it can be determined.

本実施例に係るコマンド入力装置１０１によれば、認識阻害要因によって音声コマンド候補又はジェスチャコマンド候補のいずれかの認識スコアが低下したことを、両コマンド候補の認識スコアの最大値の差分に基づいて判断することができる。そして、操作対象装置１１０へ入力するコマンドの決定において、認識阻害要因によって認識スコアが低下した方のコマンド候補を用いない。従って、実施例１と同様に、認識阻害要因によって操作者の意図と異なるコマンドが認識されても、それが操作対象装置１１０への入力コマンドとして決定されることを抑制できる。
また、音声コマンド候補又はジェスチャコマンド候補の少なくとも一方でも認識スコアが著しく低い場合には、操作者が行っている発声又はジェスチャはコマンド入力のための発声又はジェスチャとして正しくないと判断する。或いは、そもそも操作者はコマンド入力のための発声及びジェスチャ自体を行っていないと判断する。そして、音声及びジェスチャに基づく操作対象装置１１０へのコマンド入力を行わない。従って、操作者が操作対象装置１１０へのコマンドの入力を意図していない場合に、操作者の意図しないコマンドが操作対象装置１１０へ入力されてしまうことを抑制できる。 According to the command input device 101 according to the present embodiment, the fact that the recognition score of either the voice command candidate or the gesture command candidate has decreased due to the recognition inhibition factor is based on the difference between the maximum recognition scores of both command candidates. Judgment can be made. In determining the command to be input to the operation target device 110, the command candidate whose recognition score is lowered due to the recognition inhibition factor is not used. Therefore, as in the first embodiment, even when a command different from the operator's intention is recognized due to the recognition inhibition factor, it can be suppressed that it is determined as an input command to the operation target device 110.
If at least one of the voice command candidate and the gesture command candidate has a remarkably low recognition score, it is determined that the utterance or gesture performed by the operator is not correct as the utterance or gesture for command input. Alternatively, in the first place, it is determined that the operator does not perform utterance and gesture for command input. Then, no command is input to the operation target device 110 based on voice and gesture. Therefore, when the operator does not intend to input a command to the operation target device 110, it is possible to prevent the command not intended by the operator from being input to the operation target device 110.

１０２：音声入力部、１０５：ジェスチャ入力部、１０３：音声コマンド認識スコア判定部、１０６：ジェスチャコマンド認識スコア判定部、１０８：閾値比較部、１０９：コマンド決定部 102: Voice input unit, 105: Gesture input unit, 103: Voice command recognition score determination unit, 106: Gesture command recognition score determination unit, 108: Threshold comparison unit, 109: Command determination unit

Claims

An input device that determines a command to be input to an operation target device based on both voice and gesture by an operator,
A voice input unit for inputting voices by an operator;
An image input unit for inputting an image of a gesture made by an operator;
Voice recognition processing is performed on the input voice, and voice candidates that match the input voice are identified from voices that are predetermined as voices for inputting commands to the operation target device. A first calculation unit that calculates a speech recognition score that is an index of probability that the speech to be matched with the identified candidate;
Image recognition processing is performed on the input image, and gesture candidates that match the gesture photographed in the input image among gestures predetermined as gestures for inputting a command to the operation target device are selected. A second calculation unit that calculates a gesture recognition score that is an index of the probability that the gesture captured in the input image is identified and matches the identified candidate;
And determine tough that determine the command to be input to the operation target device based on the identified candidate,
With
Before Kike' tough, if either one of the speech recognition score and gesture recognition score the calculated is smaller than a predetermined first threshold value, and inputs to the operation target apparatus based on only the candidate corresponding to the other recognition score A command is determined , and at least one of the calculated speech recognition score and gesture recognition score is smaller than a predetermined second threshold value smaller than the first threshold value, the both identified candidates are discarded and the identified An input device that does not determine a command to be input to an operation target device based on a candidate .

Before Kike' tough, when the calculated speech recognition score and gesture recognition score is less than the both first threshold value, the command which the discarding the identified candidate is input to the operation target device based on the identified candidate The input device according to claim 1, wherein the determination is not performed.

An input device that determines a command to be input to an operation target device based on both voice and gesture by an operator,
A voice input unit for inputting voices by an operator;
An image input unit for inputting an image of a gesture made by an operator;
Voice recognition processing is performed on the input voice, and voice candidates that match the input voice are identified from voices that are predetermined as voices for inputting commands to the operation target device. A first calculation unit that calculates a speech recognition score that is an index of probability that the speech to be matched with the identified candidate;
Image recognition processing is performed on the input image, and gesture candidates that match the gesture photographed in the input image among gestures predetermined as gestures for inputting a command to the operation target device are selected. A second calculation unit that calculates a gesture recognition score that is an index of the probability that the gesture captured in the input image is identified and matches the identified candidate;
And determine tough that determine the command to be input to the operation target device based on the identified candidate,
With
Before Kike' tough corresponds towards the magnitude of the difference between the speech recognition score and gesture recognition score and the calculated larger of the predetermined case third greater than the threshold value, the voice recognition score and gesture recognition candidate score An input device that determines a command to be input to the operation target device based only on the command.

Before Kike' tough, at least one of speech recognition score and gesture recognition score the calculated is smaller than a predetermined second threshold value, discarding the candidate both of the above specified operation target based on the candidates the specific 4. The input device according to claim 3 , wherein a command to be input to the device is not determined.

The first calculation unit calculates a voice recognition score by calculating a correlation between a predetermined voice as a voice for inputting a command to the operation target device and the input voice. The input device according to any one of 1 to 4.

The second calculation unit calculates a correlation between a gesture predetermined as a gesture for inputting a command to the operation target device and a gesture photographed in the input image, thereby obtaining a gesture recognition score. The input device according to claim 1, wherein the input device is calculated.

The first calculation unit identifies a plurality of voice candidates that match the input voice among predetermined voices as voices for inputting commands to the operation target device, and voices for each of the plurality of candidates. Calculate the recognition score,
The input device according to claim 1, wherein the determination unit compares the maximum value of the plurality of candidate speech recognition scores with the first threshold value and the second threshold value.

The second calculation unit identifies a plurality of gesture candidates that match a gesture photographed in the input image among gestures predetermined as gestures for inputting a command to the operation target device, and a plurality of gesture candidates Calculate a gesture recognition score for each of
The input device according to claim 1, wherein the determination unit compares the maximum value of the plurality of candidate gesture recognition scores with the first threshold value and the second threshold value.

An input method for determining a command to be input to an operation target device based on both voice and gesture by an operator,
A voice input process in which the voice of the operator is input;
An image input process in which an image of a gesture made by an operator is input;
Voice recognition processing is performed on the input voice, and voice candidates that match the input voice are identified from voices that are predetermined as voices for inputting commands to the operation target device. A first calculation step of calculating a speech recognition score that is an index of probability that the speech to be matched with the identified candidate;
Image recognition processing is performed on the input image, and gesture candidates that match the gesture photographed in the input image among gestures predetermined as gestures for inputting a command to the operation target device are selected. A second calculation step of determining a gesture recognition score that is an index of the probability that the gesture captured in the input image is identified and matches the identified candidate;
Anda decision step that determine the command to be input to the operation target device based on the candidate that the identified,
Before Kike' constant step, if either one of the speech recognition score and gesture recognition score the calculated is smaller than a predetermined first threshold value, and inputs to the operation target apparatus based on only the candidate corresponding to the other recognition score A command is determined , and at least one of the calculated speech recognition score and gesture recognition score is smaller than a predetermined second threshold value smaller than the first threshold value, the both identified candidates are discarded and the identified An input method characterized by not determining a command to be input to an operation target device based on a candidate .

Prior Kike' constant step, if the calculated speech recognition score and gesture recognition score is less than the both first threshold, discards the candidate that the specific inputs to the operation target device based on the identified candidate command The input method according to claim 9 , wherein the determination is not performed.

An input method for determining a command to be input to an operation target device based on both voice and gesture by an operator,
A voice input process in which the voice of the operator is input;
An image input process in which an image of a gesture made by an operator is input;
Voice recognition processing is performed on the input voice, and voice candidates that match the input voice are identified from voices that are predetermined as voices for inputting commands to the operation target device. A first calculation step of calculating a speech recognition score that is an index of probability that the speech to be matched with the identified candidate;
Image recognition processing is performed on the input image, and gesture candidates that match the gesture photographed in the input image among gestures predetermined as gestures for inputting a command to the operation target device are selected. A second calculation step of determining a gesture recognition score that is an index of the probability that the gesture captured in the input image is identified and matches the identified candidate;
Anda decision step that determine the command to be input to the operation target device based on the candidate that the identified,
Before Kike' constant step corresponds towards the magnitude of the difference between the speech recognition score and gesture recognition score and the calculated larger of the predetermined larger than the third threshold value, the voice recognition score and gesture recognition candidate score An input method for determining a command to be input to an operation target device based only on the command.

Prior Kike' constant step, at least one of the calculated speech recognition score and gesture recognition score is smaller than a predetermined second threshold value, discarding the candidate both of the above specified operation target based on the candidates the specific 12. The input method according to claim 11 , wherein a command to be input to the apparatus is not determined.

In the first calculation step, a speech recognition score is calculated by calculating a correlation between a predetermined speech as a speech for inputting a command to the operation target device and the input speech.
The input method according to any one of claims 9 to 12, wherein the value is calculated.

In the second calculation step, a gesture recognition score is calculated by calculating a correlation between a gesture predetermined as a gesture for inputting a command to the operation target device and a gesture photographed in the input image. The input method according to any one of claims 9 to 13, which calculates the value.

In the first calculation step, a plurality of voice candidates that match the input voice are identified from voices that are predetermined as voices for inputting a command to the operation target device, and voices for each of the plurality of candidates are specified. Calculate the recognition score,
The input method according to claim 9 or 10, wherein in the determining step, the maximum value of the plurality of candidate speech recognition scores is compared with the first threshold value and the second threshold value.

In the second calculation step, a plurality of gesture candidates that match a gesture photographed in the input image among gestures predetermined as gestures for inputting a command to the operation target device are specified, and a plurality of gesture candidates are specified. Calculate a gesture recognition score for each of
The input method according to claim 9 or 10, wherein in the determination step, the maximum value of the plurality of candidate gesture recognition scores is compared with the first threshold value and the second threshold value.