JP5323770B2

JP5323770B2 - User instruction acquisition device, user instruction acquisition program, and television receiver

Info

Publication number: JP5323770B2
Application number: JP2010149860A
Authority: JP
Inventors: 真人藤井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2010-06-30
Filing date: 2010-06-30
Publication date: 2013-10-23
Anticipated expiration: 2030-06-30
Also published as: JP2012014394A

Description

本発明は、テレビ、オーディオ機器、パソコンおよび各種家電等の機器を利用するユーザから、当該機器を制御するための指示を取得するユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機に関する。 The present invention relates to a user instruction acquisition device, a user instruction acquisition program, and a television receiver that acquire an instruction for controlling the device from a user who uses a device such as a television, an audio device, a personal computer, and various home appliances.

テレビ等の機器がユーザからの指示を受け取る方法としては、例えば最も基本的なものとして、リモコンによって指示を受け取る方法が挙げられる。また、特許文献１，２では、前記したようなリモコン操作の煩わしさを回避するために、リモコンの代わりに音声認識やジェスチャ（動作認識）を利用してユーザから指示を受け取る装置が提案されている。 As a method for a device such as a television to receive an instruction from a user, for example, the most basic method is a method of receiving an instruction with a remote controller. Patent Documents 1 and 2 propose an apparatus for receiving instructions from a user using voice recognition or gesture (motion recognition) instead of the remote control in order to avoid the troublesome operation of the remote control as described above. Yes.

特開２００４−１９２６５３号公報JP 2004-192653 A 特許第３８８６０７４号公報Japanese Patent No. 3886074

しかしながら、前記したようなリモコンを用いた方法では、テレビ等の機器に指示する内容が固定であるため柔軟性に欠けており、また、リモコンの操作が複雑で扱いにくいという問題があった。そして、特許文献１，２で提案された音声認識やジェスチャを利用する装置では、常に音声認識を行っているため、ユーザが実際に指示をいったタイミングや複数のユーザのうちの誰が指示を行っているのかが分からず、さらに雑音にも反応してしまうという問題があった。また、特許文献１，２で提案された装置では、ユーザがディスプレイ画面に表示された擬人化されたエージェント画像を見ながら指示を行う等、日常的ではない不自然な状況下で指示を行う必要があり煩雑であるという問題があった。 However, the method using the remote controller as described above has a problem that the content to be instructed to a device such as a television is fixed and lacks flexibility, and the operation of the remote controller is complicated and difficult to handle. In the devices using speech recognition and gestures proposed in Patent Documents 1 and 2, since voice recognition is always performed, the timing when the user actually gives an instruction and who of the multiple users gives the instruction. There is a problem that it does not know whether it is, and also reacts to noise. In addition, in the devices proposed in Patent Documents 1 and 2, it is necessary for the user to give an instruction under an unnatural situation that is not everyday, such as giving an instruction while looking at an anthropomorphic agent image displayed on the display screen. There was a problem that it was complicated.

本発明はかかる点に鑑みてなされたものであって、ユーザの自然な発話あるいは動作によって機器を的確に指示制御することができ、かつ、ユーザが実際に指示を行っている場合のみ指示を取得するユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機を提供することを課題とする。 The present invention has been made in view of the above points, and it is possible to accurately instruct and control a device by a user's natural utterance or operation, and an instruction is acquired only when the user actually gives an instruction. It is an object to provide a user instruction acquisition device, a user instruction acquisition program, and a television receiver.

前記課題を解決するために請求項１に係るユーザ指示取得装置は、機器を利用する複数のユーザの中から当該機器を制御するための指示を行っているユーザを特定し、当該ユーザからの指示を取得するユーザ指示取得装置であって、カメラによって撮影された映像から、予め登録された前記複数のユーザのぞれぞれを認識するとともに前記複数のユーザのぞれぞれの顔の変化を検出し、当該顔の変化から前記複数のユーザのそれぞれが発話している期間を示す発話期間を生成する顔分析手段と、前記複数のユーザの映像から、前記複数のユーザの手の動作を認識する手動作分析手段と、前記顔分析手段によって生成された前記発話期間に基づいて、前記機器の周囲の音から音声を検出し、予め前記ユーザごとに登録された音響特徴量を用いて前記音声の内容および話者を認識する音声分析手段と、前記顔分析手段によって認識された前記複数のユーザの中に前記音声分析手段によって認識された前記話者が含まれている場合、当該話者を前記指示を行っているユーザとして特定し、前記顔分析手段によって検出された前記ユーザの顔の変化と、前記手動作分析手段によって認識された前記ユーザの手の動作と、前記音声分析手段によって認識された前記ユーザの音声の内容と、に対して予め定められたコマンドを生成するコマンド生成手段と、を備える構成とした。 In order to solve the above-mentioned problem, the user instruction acquisition device according to claim 1 specifies a user who gives an instruction to control the device from a plurality of users who use the device, and receives an instruction from the user. A user instruction acquisition device that recognizes each of the plurality of users registered in advance from an image captured by a camera and detects a change in the face of each of the plurality of users. Detecting a face analysis means for generating an utterance period indicating a period during which each of the plurality of users is speaking from the change of the face, and recognizing hand movements of the plurality of users from the images of the plurality of users. Based on the utterance period generated by the hand motion analysis means and the face analysis means, and detects sound from sounds around the device, and uses acoustic feature values registered in advance for each user Speech analysis means for recognizing the content of the recorded speech and the speaker, and when the speaker recognized by the speech analysis means is included in the plurality of users recognized by the face analysis means, The user is identified as the user who is giving the instruction, the face change of the user detected by the face analysis means, the movement of the user's hand recognized by the hand movement analysis means, and the voice analysis means And a command generation means for generating a predetermined command for the content of the user's voice recognized by the above.

このような構成によれば、ユーザ指示取得装置は、顔分析手段によってユーザの顔の変化から当該ユーザが発話している期間を生成し、ユーザが発話している場合のみ音声認識を行うため、音声認識の精度を高めることができる。また、顔認識で認識したユーザと音声認識で認識した話者とを比較することで、機器に対して音声指示を行ったユーザを特定できるため、複数のユーザが機器を利用する場合であっても、コマンドを的確に生成することができる。 According to such a configuration, the user instruction acquisition device generates a period during which the user is speaking from the change of the user's face by the face analysis unit, and performs voice recognition only when the user is speaking. The accuracy of voice recognition can be increased. In addition, since a user who has given a voice instruction to a device can be identified by comparing the user recognized by face recognition with the speaker recognized by voice recognition, a plurality of users use the device. The command can be generated accurately.

また、請求項２に係るユーザ指示取得装置は、前記顔分析手段が、前記映像から前記複数のユーザの顔の領域を検出する顔領域検出手段と、予め前記ユーザごとに登録された顔特徴量を用いて、前記顔の領域に対応するユーザを認識する顔認識手段と、前記複数のユーザの顔の領域から、当該複数のユーザの顔の変化を検出する顔変化検出手段と、前記複数のユーザの顔の変化から、当該複数のユーザが発話しているか否かを判定し、発話していると判定した場合に前記発話期間を生成する発話状態推定手段と、を備える構成とした。 According to a second aspect of the present invention, in the user instruction acquiring apparatus, the face analysis unit detects a plurality of face areas of the plurality of users from the video, and a facial feature amount registered in advance for each user. A face recognition means for recognizing a user corresponding to the face area, a face change detection means for detecting a change in the face of the plurality of users from the face areas of the plurality of users, It is configured to determine whether or not the plurality of users are speaking from the change of the user's face, and to determine a speech state estimation unit that generates the speech period when it is determined that the user is speaking.

このような構成によれば、ユーザ指示取得装置は、発話状態推定手段によって当該ユーザが発話しているのか否かを判定し、ユーザが発話していると判定した場合のみ発話期間を生成して音声分析手段に出力するため、音声認識の精度をより高めることができる。 According to such a configuration, the user instruction acquisition device determines whether or not the user is speaking by the speech state estimating means, and generates the speech period only when it is determined that the user is speaking. Since it outputs to a voice analysis means, the precision of voice recognition can be raised more.

また、請求項３に係るユーザ指示取得プログラムは、機器を利用する複数のユーザの中から当該機器を制御するための指示を行っているユーザを特定し、当該ユーザからの指示を取得するために、コンピュータを、カメラによって撮影された映像から、予め登録された前記複数のユーザのぞれぞれを認識するとともに前記複数のユーザのぞれぞれの顔の変化を検出し、当該顔の変化から前記複数のユーザのそれぞれが発話している期間を示す発話期間を生成する顔分析手段、前記複数のユーザの映像から、前記複数のユーザの手の動作を認識する手動作分析手段、前記顔分析手段によって生成された前記発話期間に基づいて、前記機器の周囲の音から音声を検出し、予め前記ユーザごとに登録された音響特徴量を用いて前記音声の内容および話者を認識する音声分析手段、前記顔分析手段によって認識された前記複数のユーザの中に前記音声分析手段によって認識された前記話者が含まれている場合、当該話者を前記指示を行っているユーザとして特定し、前記顔分析手段によって検出された前記ユーザの顔の変化と、前記手動作分析手段によって認識された前記ユーザの手の動作と、前記音声分析手段によって認識された前記ユーザの音声の内容と、に対して予め定められたコマンドを生成するコマンド生成手段、として機能させる構成とした。 A user instruction acquisition program according to claim 3 is for identifying a user who is giving an instruction to control the device from a plurality of users who use the device and acquiring an instruction from the user. The computer recognizes each of the plurality of users registered in advance from the video taken by the camera, detects a change in the face of each of the plurality of users, and changes the face A face analysis unit that generates a speech period indicating a period during which each of the plurality of users is speaking, a hand motion analysis unit that recognizes the motions of the plurality of users from the videos of the plurality of users, and the face Based on the utterance period generated by the analysis means, the voice is detected from the sounds around the device, and the contents of the voice and the acoustic feature amount registered in advance for each user are used. Voice analysis means for recognizing a speaker, and when the speaker recognized by the voice analysis means is included in the plurality of users recognized by the face analysis means, the instruction is given to the speaker. The user's face change detected by the face analysis means, the user's hand movement recognized by the hand movement analysis means, and the user's hand recognition recognized by the voice analysis means. It is configured to function as command generation means for generating a predetermined command for the content of the voice.

このような構成によれば、ユーザ指示取得プログラムは、顔分析手段によってユーザの顔の変化から当該ユーザが発話している期間を生成し、ユーザが発話している場合のみ音声認識を行うため、音声認識の精度を高めることができる。また、顔認識で認識したユーザと音声認識で認識した話者とを比較することで、機器に対して音声指示を行ったユーザを特定できるため、複数のユーザが機器を利用する場合であっても、コマンドを的確に生成することができる。 According to such a configuration, the user instruction acquisition program generates a period during which the user is speaking from the change of the user's face by the face analysis unit, and performs voice recognition only when the user is speaking. The accuracy of voice recognition can be increased. In addition, since a user who has given a voice instruction to a device can be identified by comparing the user recognized by face recognition with the speaker recognized by voice recognition, a plurality of users use the device. The command can be generated accurately.

また、請求項４に係るテレビ受像機は、ユーザに放送番組を提供するテレビ受像機であって、当該テレビ受像機に設置されたカメラの映像およびマイクの音を分析することで、前記ユーザの音声および動作による前記ユーザからの指示を取得する請求項１または請求項２に記載のユーザ指示取得装置を備える構成とした。 According to a fourth aspect of the present invention, there is provided a television receiver for providing a broadcast program to a user, and analyzing a video of a camera installed in the television receiver and a sound of a microphone, thereby analyzing the user's It was set as the structure provided with the user instruction | indication acquisition apparatus of Claim 1 or Claim 2 which acquires the instruction | indication from the said user by an audio | voice and operation | movement.

このような構成によれば、テレビ受像機は、顔分析手段によってユーザの顔の変化から当該ユーザが発話しているのか否かを判定し、ユーザが発話している場合のみ音声認識を行うため、音声認識の精度を高めることができる。また、顔認識で認識したユーザと音声認識で認識した話者とを比較することで、機器に対して音声指示を行ったユーザを特定できるため、複数のユーザが機器を利用する場合であっても、コマンドを的確に生成することができる。 According to such a configuration, the television receiver determines whether or not the user is speaking from the change of the user's face by the face analysis means, and performs voice recognition only when the user is speaking. The accuracy of voice recognition can be increased. In addition, since a user who has given a voice instruction to a device can be identified by comparing the user recognized by face recognition with the speaker recognized by voice recognition, a plurality of users use the device. The command can be generated accurately.

請求項１、請求項２、請求項３および請求項４に係る発明によれば、ユーザの顔の変化から当該ユーザの発話状態を自動的に判定するとともに、ユーザが機器に対して音声および動作による指示を行うだけでコマンドを生成することができる。従って、複雑な操作を行うことなく、ユーザの自然な行動の延長上でその指示内容を機器に伝え、当該機器を制御することができる。 According to the first, second, third, and fourth aspects of the invention, the user's utterance state is automatically determined from the change in the user's face, and the user speaks and operates the device. A command can be generated simply by giving an instruction. Therefore, the instruction content can be transmitted to the device and the device can be controlled on the extension of the natural action of the user without performing a complicated operation.

本発明に係るユーザ指示取得装置の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the user instruction | indication acquisition apparatus which concerns on this invention. （ａ）は、本発明に係るユーザ指示取得装置における発話状態推定手段の具体的構成を示すブロック図であり、（ｂ）は、本発明に係るユーザ指示取得装置における発話条件記憶部が予め保持する発話条件の一例を示す図である。(A) is a block diagram which shows the specific structure of the utterance state estimation means in the user instruction | indication acquisition apparatus which concerns on this invention, (b) is previously held by the utterance condition memory | storage part in the user instruction | indication acquisition apparatus which concerns on this invention It is a figure which shows an example of the speech conditions to perform. （ａ）は、本発明に係るユーザ指示取得装置におけるコマンド生成手段の具体的構成を示すブロック図であり、（ｂ）は、本発明に係るユーザ指示取得装置におけるコマンド条件記憶部が予め保持するコマンド条件の一例を示す図である。(A) is a block diagram which shows the specific structure of the command production | generation means in the user instruction | indication acquisition apparatus which concerns on this invention, (b) is a command condition memory | storage part in the user instruction | indication acquisition apparatus which concerns on this invention previously hold | maintains. It is a figure which shows an example of command conditions. 本発明に係るユーザ指示取得装置におけるユーザの指示の一例を示す図である。It is a figure which shows an example of the user's instruction | indication in the user instruction | indication acquisition apparatus which concerns on this invention. 本発明に係るユーザ指示取得装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the user instruction | indication acquisition apparatus based on this invention. 本発明に係るユーザ指示装置を備えるテレビ受像機の一例を示す概略図である。It is the schematic which shows an example of a television receiver provided with the user instruction | indication apparatus which concerns on this invention.

本発明の実施形態に係るユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機について、図面を参照しながら説明する。なお、以下の説明において、同一の構成については同一の名称及び符号を付し、詳細説明を省略する。 A user instruction acquisition device, a user instruction acquisition program, and a television receiver according to an embodiment of the present invention will be described with reference to the drawings. In the following description, the same configuration is given the same name and symbol, and detailed description is omitted.

［ユーザ指示取得装置］
ユーザ指示取得装置１は、テレビ等の機器を利用する複数のユーザの中から当該機器を制御するための指示を行っているユーザを特定し、当該ユーザからの指示を取得するためのものである。 [User Instruction Acquisition Device]
The user instruction acquisition device 1 is for specifying a user who gives an instruction for controlling the device from a plurality of users using the device such as a television and acquiring an instruction from the user. .

ユーザ指示取得装置１は、例えば図６に示すように、ユーザに放送番組を提供するテレビ受像機（以下、テレビという）Ｔと接続され、テレビＴ上部に設置されたカメラＣｒおよびマイクＭから入力されたユーザの映像および音声を分析することでユーザの指示を取得する。そして、ユーザ指示取得装置１は、図１に示すように、これに対応するコマンドを生成し、当該コマンドを機器の制御部等に出力する。なお、ユーザ指示取得装置１は、図６に示すようにテレビＴの外部に備えられるのではなく、当該テレビＴの内部に内蔵されてもよい。 For example, as shown in FIG. 6, the user instruction acquisition device 1 is connected to a television receiver (hereinafter referred to as a television) T that provides a broadcast program to the user, and inputs from a camera Cr and a microphone M installed on the upper portion of the television T. The user's instruction is acquired by analyzing the user's video and audio. Then, as shown in FIG. 1, the user instruction acquisition device 1 generates a command corresponding to this, and outputs the command to the control unit of the device. Note that the user instruction acquisition device 1 is not provided outside the television T as shown in FIG. 6, but may be incorporated inside the television T.

ユーザ指示取得装置１は、ここでは図１に示すように、音声分析手段１０と、顔分析手段２０と、手動作分析手段３０と、を備えている。また、ユーザ指示取得装置１は、前記したように、機器を利用するユーザの映像を撮影するためのカメラＣｒと、機器の周囲の音を集音するためのマイクＭと、を備えている。なお、ユーザ指示取得装置１が備えるカメラＣｒとマイクＭは、例えば前記した図６に示すように、機器の上部に設置され、機器を利用するユーザの映像と、機器の周囲の音と、を取得できるように構成されている。以下、ユーザ指示取得装置１が備える各構成について、詳細に説明する。 Here, as shown in FIG. 1, the user instruction acquisition device 1 includes a voice analysis unit 10, a face analysis unit 20, and a hand motion analysis unit 30. In addition, as described above, the user instruction acquisition device 1 includes the camera Cr for capturing the video of the user who uses the device, and the microphone M for collecting sounds around the device. Note that the camera Cr and the microphone M provided in the user instruction acquisition device 1 are installed on the upper part of the device, for example, as shown in FIG. 6 described above, and the video of the user who uses the device and the sound around the device. It is configured so that it can be obtained. Hereinafter, each component with which the user instruction | indication acquisition apparatus 1 is provided is demonstrated in detail.

音声分析手段１０は、マイクＭによって集音した機器の周囲の音から音声を検出し、予めユーザごとに登録された音響特徴量を用いて音声の内容および話者を認識するものである。音声分析手段１０は、ここでは図１に示すように、音声検出手段１１と、音声認識手段１２と、話者認識手段１３と、を備えている。 The voice analysis means 10 detects voice from sounds around the equipment collected by the microphone M, and recognizes the content of the voice and the speaker using the acoustic feature amount registered for each user in advance. Here, as shown in FIG. 1, the voice analysis means 10 includes a voice detection means 11, a voice recognition means 12, and a speaker recognition means 13.

音声検出手段１１は、機器の周囲の音から音声を検出するためのものである。音声検出手段１１は、図１に示すように、マイクＭから機器の周囲の音が入力されると、予め登録された音声の周波数特性等を用いて、機器の周囲の音から音声を抽出する。また、音声検出手段１１は、後記する発話状態推定手段２４から、ユーザが発話している期間を示す発話期間が入力されると、当該発話期間に検出した音声を音声認識手段１２および話者認識手段１３に出力する。すなわち、音声検出手段１１は、ユーザが発話している間のみ、検出した音声を音声認識手段１２および話者認識手段１３に対して出力するように構成されている。なお、音声検出手段１１は、前記した音声の周波数特性のデータを予め保持する図示しない記憶部を備えている。 The sound detection means 11 is for detecting sound from sounds around the device. As shown in FIG. 1, when a sound around the device is input from the microphone M, the sound detection unit 11 extracts the sound from the sound around the device using a frequency characteristic of the sound registered in advance. . When the speech detection unit 11 receives an utterance period indicating a period during which the user is speaking from the utterance state estimation unit 24 described later, the speech detection unit 11 recognizes the speech detected during the utterance period and recognizes the speech. Output to means 13. That is, the voice detection unit 11 is configured to output the detected voice to the voice recognition unit 12 and the speaker recognition unit 13 only while the user is speaking. The voice detection means 11 includes a storage unit (not shown) that holds the frequency characteristic data of the voice in advance.

音声認識手段１２は、音声からその音声の内容を認識するためのものである。音声認識手段１２は、具体的には、音声の時間波形から音響分析によって、例えば、スペクトルの低次ＤＣＴ成分等の周波数特性を音響特徴量として抽出し、予め登録されている全ての単語の発音に応じた音響モデルと照合し、さらに言語モデル（単語の連続出現頻度分布）も用いることで、音響的および言語的に最も可能性の高い単語列を認識結果として得る。なお、音声認識手段１２は、前記した音響モデルおよび言語モデルを予め保持する図示しない記憶部を備えている。 The voice recognition means 12 is for recognizing the contents of the voice from the voice. Specifically, the speech recognition unit 12 extracts, for example, frequency characteristics such as a low-order DCT component of a spectrum as an acoustic feature amount by acoustic analysis from a time waveform of speech, and pronunciations of all registered words. By collating with an acoustic model corresponding to the above, and using a language model (continuous frequency distribution of words), the most likely word sequence acoustically and linguistically is obtained as a recognition result. The voice recognition unit 12 includes a storage unit (not shown) that holds the acoustic model and the language model in advance.

音声認識手段１２には、図１に示すように、音声検出手段１１から音声が入力される。そして、音声認識手段１２は、前記した手法によって音声から単語列を抽出し、これを音声情報としてコマンド生成手段４０の情報取得部４１に出力する（図３（ａ）参照）。 As shown in FIG. 1, voice is input to the voice recognition unit 12 from the voice detection unit 11. Then, the voice recognition unit 12 extracts a word string from the voice by the above-described method, and outputs it as voice information to the information acquisition unit 41 of the command generation unit 40 (see FIG. 3A).

話者認識手段１３は、音声からその音声の話者、すなわちどのユーザがその音声を発しているのかを認識するためのものである。話者認識手段１３は、具体的には、音声から前記した音声認識手段１２と同様の音響特徴量を抽出し、当該音響特徴量と特定の話者について予め登録されている話者モデルとを比較して、話者が誰であるかを判定する。 The speaker recognition means 13 is for recognizing a speaker of the voice, that is, which user is emitting the voice from the voice. Specifically, the speaker recognizing unit 13 extracts the same acoustic feature amount as that of the speech recognizing unit 12 from the voice, and obtains the acoustic feature amount and a speaker model registered in advance for a specific speaker. A comparison is made to determine who the speaker is.

話者認識手段１３における話者の判定には、ベイズ情報量基準を用いることもできる。また、音響特徴量を音素のクラスに分類し、音素クラスの混合モデルを使って照合することもできる。なお、話者認識手段１３は、前記した話者モデルを予め保持する図示しない記憶部を備えている。また、この話者モデルは、例えばユーザが予め特定の単語を発話し、このユーザの音声を氏名あるいはニックネーム等の登録名とともに前記した記憶部に登録することで作成することができる。 A Bayes information criterion can be used for speaker determination in the speaker recognition means 13. It is also possible to classify acoustic features into phoneme classes and collate them using a phoneme class mixed model. The speaker recognition unit 13 includes a storage unit (not shown) that holds the speaker model described above in advance. The speaker model can be created, for example, when the user utters a specific word in advance and registers the user's voice in the storage unit together with a registered name such as a name or a nickname.

話者認識手段１３には、図１に示すように、音声検出手段１１から音声が入力される。そして、話者認識手段１３は、前記した手法によって音声から話者を判定し、これを話者情報としてコマンド生成手段４０の情報取得部４１に出力する（図３（ａ）参照）。 As shown in FIG. 1, voice is input to the speaker recognition unit 13 from the voice detection unit 11. Then, the speaker recognition unit 13 determines a speaker from the voice by the above-described method, and outputs this as speaker information to the information acquisition unit 41 of the command generation unit 40 (see FIG. 3A).

顔分析手段２０は、カメラＣｒによって撮影された映像から、顔画像認識処理によって予め登録された複数のユーザのそれぞれを認識するとともに、複数のユーザのそれぞれの顔の変化を検出するものである。また、顔分析手段２０は、複数のユーザの顔の変化から当該複数のユーザが発話しているか否かを判定するとともに、複数のユーザのそれぞれが発話している期間を示す発話期間を生成するものである。顔分析手段２０は、ここでは図１に示すように、顔領域検出手段２１と、顔変化検出手段２２と、顔認識手段２３と、発話状態推定手段２４と、を備えている。 The face analysis means 20 recognizes each of a plurality of users registered in advance by face image recognition processing from the video imaged by the camera Cr, and detects changes in the faces of the plurality of users. Further, the face analysis unit 20 determines whether or not the plurality of users are speaking from the change of the faces of the plurality of users, and generates an utterance period indicating a period during which each of the plurality of users is speaking. Is. As shown in FIG. 1, the face analysis unit 20 includes a face area detection unit 21, a face change detection unit 22, a face recognition unit 23, and an utterance state estimation unit 24.

顔領域検出手段２１は、複数のユーザの映像から人物の顔の領域を検出するものである。顔領域検出手段２１は、具体的には、複数のユーザの映像を構成する画像から、ユーザの普遍的な特徴を抽出し、それらの特徴を検出することで人物の顔の領域を検出する。なお、顔領域検出手段２１は、画像からの前記普遍的特徴の抽出にハール関数を用いることで、高速処理が可能となる。 The face area detecting means 21 detects a human face area from a plurality of user images. Specifically, the face area detection means 21 extracts a user's universal features from images constituting a plurality of user images, and detects those features to detect a human face region. The face area detecting means 21 can perform high-speed processing by using a Haar function for extracting the universal feature from the image.

顔領域検出手段２１には、図１に示すように、カメラＣｒから機器を利用する複数のユーザの映像が入力される。そして、顔領域検出手段２１は、前記した手法によって映像からユーザの顔の領域を検出し、これを顔領域情報として顔変化検出手段２２の各検出部および顔認識手段２３に出力する。 As shown in FIG. 1, a plurality of users using the device are input to the face area detection unit 21 from the camera Cr. Then, the face area detection unit 21 detects the user's face area from the video by the above-described method, and outputs this to the detection units of the face change detection unit 22 and the face recognition unit 23 as face area information.

顔変化検出手段２２は、複数のユーザの映像から検出されたユーザの顔の領域から、それぞれのユーザの顔の変化を検出するものである。顔変化検出手段２２は、例えば、顔領域検出手段２１で３人分の顔の領域が検出された場合、当該３人分の顔の領域のそれぞれの顔の変化を検出する。顔変化検出手段２２は、ここでは図１に示すように、顔の向き検出部２２１と、視線検出部２２２と、目の開閉検出部２２３と、口唇動き検出部２２４と、を備えている。 The face change detection unit 22 detects a change in each user's face from areas of the user's face detected from a plurality of user images. For example, when the face area detection unit 21 detects a face area for three people, the face change detection unit 22 detects a change in each of the face areas for the three people. As shown in FIG. 1, the face change detection unit 22 includes a face direction detection unit 221, a line-of-sight detection unit 222, an eye opening / closing detection unit 223, and a lip movement detection unit 224.

顔の向き検出部２２１は、機器を基準とした複数のユーザの顔の向きを検出するものである。顔の向き検出部２２１は、例えば機器がテレビＴである場合（図６参照）、前記した顔領域情報から、ユーザの顔の向きがテレビ画面の中央に対して水平方向および垂直方向にそれぞれどのくらいの角度回っているかを検出する。顔の向き検出部２２１は、具体的には、前記したハール関数や、後記するガボールウェーブレットによる特徴抽出により、例えばユーザの目と鼻と口の周りの特徴の配置情報をもとに、様々な顔の向きのテンプレートを記録しておき、それらテンプレートとのマッチングによってユーザの顔の向きを推定する。 The face orientation detection unit 221 detects the face orientations of a plurality of users based on the device. For example, when the device is a television T (see FIG. 6), the face orientation detection unit 221 determines how much the user's face orientation is in the horizontal and vertical directions with respect to the center of the television screen based on the face area information described above. Detect whether the angle is turning. Specifically, the face direction detection unit 221 performs various extractions based on, for example, the arrangement information of the features around the eyes, nose, and mouth of the user by extracting the features by the Haar function and the Gabor wavelet described later. A face orientation template is recorded, and the user's face orientation is estimated by matching with the template.

視線検出部２２２は、機器を基準とした複数のユーザの視線の向きを検出するものである。視線検出部２２２は、前記した顔領域情報から、ユーザの目の視線の向きが頭部に対して水平方向および垂直方向にそれぞれどのくらいの角度回っているかを検出する。視線検出部２２２は、顔領域検出手段２１が検出したユーザの顔の領域をもとに、ユーザの顔内のパーツの配置情報に基づいてユーザの目の位置を推定し、事前に登録しておいた各視線の方向の画像パターンとのマッチングによりユーザの視線方向を推定する。なお、視線検出部２２２は、例えば機器がテレビＴである場合（図６参照）、前記した顔の向き検出部２２１による検出結果と組み合わせることで、ユーザがテレビ画面のどのあたりを見ているかも推定することができる。 The line-of-sight detection unit 222 detects the directions of the lines of sight of a plurality of users based on the device. The line-of-sight detection unit 222 detects how much the direction of the line of sight of the user's eyes rotates in the horizontal and vertical directions with respect to the head from the face area information described above. The line-of-sight detection unit 222 estimates the position of the user's eyes based on the arrangement information of the parts in the user's face based on the user's face area detected by the face area detection unit 21, and registers in advance. The direction of the user's line of sight is estimated by matching with the image pattern in the direction of each line of sight. Note that, for example, when the device is a television T (see FIG. 6), the line-of-sight detection unit 222 is combined with the detection result by the face orientation detection unit 221 described above, which part of the television screen the user is looking at. Can be estimated.

目の開閉検出部２２３は、ユーザが目を開けているか、あるいは閉じているかを検出するものである。目の開閉検出部２２３は、視線検出部２２２と同様に、顔領域検出手段２１が検出したユーザの顔の領域をもとに、ユーザの顔内のパーツの配置情報に基づいてユーザの目の位置を推定し、その位置に黒い部分がある場合に目を開けていると判定し、その位置に黒い部分がなくなった場合に目を閉じていると判定する。 The eye opening / closing detection unit 223 detects whether the user has opened or closed the eyes. Similarly to the line-of-sight detection unit 222, the eye opening / closing detection unit 223 is based on the arrangement information of parts in the user's face based on the user's face area detected by the face area detection unit 21. The position is estimated, and it is determined that the eyes are open when there is a black part at that position, and it is determined that the eyes are closed when there is no black part at that position.

口唇動き検出部２２４は、ユーザの口唇の動きを検出するものである。口唇動き検出部２２４は、顔領域検出手段２１が検出したユーザの顔の領域をもとに、ユーザの顔内のパーツの配置情報に基づいてユーザの口の位置を推定し、ブロックマッチングやＬｕｃａｓ−Ｋａｎａｄｅの手法等の動き検出アルゴリズムによって口唇の動きベクトルを抽出し、動きベクトルのパワーがある閾値を越え、かつそのパワー変動に周期性が見られる場合に、ユーザの口唇が動いて発話していると判定する。 The lip movement detection unit 224 detects the movement of the user's lips. The lip movement detection unit 224 estimates the position of the user's mouth based on the arrangement information of the parts in the user's face based on the user's face area detected by the face area detection unit 21, and performs block matching or Lucas. -When a lip motion vector is extracted by a motion detection algorithm such as Kanade's method, and the power of the motion vector exceeds a certain threshold value and the power fluctuation is periodic, the user's lip moves and speaks It is determined that

そして、顔変化検出手段２２は、前記した顔の向き検出部２２１、視線検出部２２２、目の開閉検出部２２３および口唇動き検出部２２４が顔の領域ごとに検出した顔の変化を、顔変化情報として発話状態推定手段２４の発話状態判定部２４１に出力するとともに（図２（ａ）参照）、コマンド生成手段４０の情報取得部４１に出力する（図３（ａ）参照）。 Then, the face change detection unit 22 detects the face change detected for each face region by the face direction detection unit 221, the line-of-sight detection unit 222, the eye opening / closing detection unit 223, and the lip movement detection unit 224. The information is output to the utterance state determination unit 241 of the utterance state estimation unit 24 (see FIG. 2A) and also output to the information acquisition unit 41 of the command generation unit 40 (see FIG. 3A).

顔認識手段２３は、複数のユーザの映像から検出されたユーザの顔の領域から、当該領域に含まれるユーザを認識するものである。顔認識手段２３は、前記した顔領域検出手段２１で検出された顔の領域に対して顔画像認識技術を適用し、誰が機器を利用しているのかを判定する。顔認識手段２３は、例えば機器がテレビＴである場合（図６参照）であって３人のユーザがテレビを視聴している場合、顔画像認識技術を用いて、３つの顔の領域からそれぞれの領域の顔に対応するユーザの氏名、あるいはニックネーム等の登録名を割り出す。 The face recognizing means 23 recognizes users included in the area from the areas of the user's face detected from a plurality of user images. The face recognition means 23 applies a face image recognition technique to the face area detected by the face area detection means 21 and determines who is using the device. For example, when the device is a television T (see FIG. 6) and three users are watching the television, the face recognition unit 23 uses face image recognition technology to detect each of the three face regions. The registered name such as the user's name or nickname corresponding to the face of the area is determined.

顔認識手段２３は、具体的には、ガボールウェーブレットによる局所的な輝度成分の周波数分析結果を特徴とするテンプレートマッチング手法によって前記した顔の領域に含まれる顔からユーザを特定する。顔認識手段２３は、具体的には、顔領域検出手段２１が検出したユーザの顔の領域における目と鼻と口の周りで決めた位置の特徴量とその配置情報を顔特徴量として抽出し、予め登録されているユーザの画像特徴量と照合することでユーザを特定する。 Specifically, the face recognizing unit 23 identifies a user from the face included in the face area by a template matching method characterized by the frequency analysis result of the local luminance component by the Gabor wavelet. Specifically, the face recognizing unit 23 extracts the feature amount of the position determined around the eyes, the nose, and the mouth in the user's face region detected by the face region detecting unit 21 and the arrangement information thereof as the face feature amount. The user is identified by collating with the image feature amount of the user registered in advance.

また、顔認識手段２３は、表情変化などに対しても認識性能を落とさないために、特徴の配置の位置関係の変形も許す手法を用いることもできる。なお、顔認識手段２３は、前記した顔特徴量を予め保持する図示しない記憶部を備えている。また、この顔特徴量は、例えばユーザが予め特定の角度から顔を撮影し、その顔画像を氏名あるいはニックネーム等の登録名とともにユーザ指示取得装置１に登録することで、作成することができる。 Further, the face recognizing means 23 can also use a technique that allows deformation of the positional relationship of the feature arrangement so as not to deteriorate the recognition performance even for facial expression changes and the like. Note that the face recognition means 23 includes a storage unit (not shown) that holds the face feature values described above in advance. The face feature amount can be created by, for example, a user photographing a face from a specific angle in advance and registering the face image in the user instruction acquisition device 1 together with a registered name such as a name or a nickname.

顔認識手段２３には、図１に示すように、顔領域検出手段２１から顔領域情報が入力される。そして、顔認識手段２３は、前記した手法によって顔領域情報からユーザを認識し、これを人物情報としてその検出時間とともにコマンド生成手段４０の情報取得部４１に出力する（図３（ａ）参照）。 As shown in FIG. 1, face area information is input to the face recognition means 23 from the face area detection means 21. Then, the face recognizing unit 23 recognizes the user from the face area information by the above-described method, and outputs this to the information acquisition unit 41 of the command generating unit 40 together with the detection time as person information (see FIG. 3A). .

発話状態推定手段２４は、複数のユーザの顔の変化から、当該複数のユーザが発話しているか否かを判定するとともに、複数のユーザが発話している期間を示す発話期間を生成するものである。発話状態推定手段２４は、ここでは図２（ａ）に示すように、発話状態判定部２４１と、発話条件記憶部２４２と、を備えている。 The utterance state estimation unit 24 determines whether or not the plurality of users are speaking from the change of the faces of the plurality of users and generates an utterance period indicating a period during which the plurality of users are speaking. is there. Here, the utterance state estimation unit 24 includes an utterance state determination unit 241 and an utterance condition storage unit 242 as shown in FIG.

発話状態判定部２４１は、ユーザが発話しているか否かを判定するためのものである。発話状態判定部２４１には、図２（ａ）に示すように、顔変化検出手段２２から、ユーザの顔の向き、ユーザの視線、ユーザの目の開閉、ユーザの口唇動き等の情報からなる顔変化情報と、これらのユーザの顔の変化を検出した検出時間（図示省略）と、が入力される。また、発話状態判定部２４１には、図２（ａ）に示すように、発話条件を予め保持する発話条件記憶部２４２から発話条件が入力される。 The utterance state determination unit 241 is for determining whether or not the user is speaking. As shown in FIG. 2A, the utterance state determination unit 241 includes information from the face change detection unit 22 such as the user's face orientation, the user's line of sight, the user's eye opening / closing, and the user's lip movement. Face change information and detection times (not shown) for detecting changes in the face of the user are input. Further, as shown in FIG. 2A, the utterance condition is input to the utterance state determination unit 241 from the utterance condition storage unit 242 that holds the utterance conditions in advance.

ここで、発話条件とは、ユーザが発話していると判定するために予め定められた条件であり、図２（ｂ）に示すように、ユーザの顔の向き、視線、目の開閉、口唇動き等の顔の変化の検出結果によって決定される条件のことを示している。すなわち、発話状態判定部２４１は、顔変化検出手段２２によって検出されたユーザの顔の変化がこの発話条件を満たしている場合に限り、機器を利用するユーザが発話状態にあると判定する。 Here, the utterance condition is a predetermined condition for determining that the user is speaking, and as shown in FIG. 2B, the user's face direction, line of sight, eye opening / closing, lips It shows the condition determined by the detection result of the face change such as movement. That is, the utterance state determination unit 241 determines that the user using the device is in the utterance state only when the change in the user's face detected by the face change detection unit 22 satisfies the utterance condition.

発話条件は、ここでは図２（ｂ）に示すように、ユーザの顔が時間率８０％以上で正面を向いていること、ユーザの視線が時間率８０％以上でテレビ画面方向を向いていること、ユーザの目が時間率８０％以上で開いていること、ユーザの口唇が時間率５０％以上で動いていること、の全ての条件を満たしているときに、ユーザが発話状態にあると規定している。なお、前記した時間率とは、ユーザの顔の変化の検出時間に対する顔の変化の持続時間の割合を示している。例えば、顔変化検出手段２２においてユーザの顔の変化を２秒間検出した場合、顔の変化が１秒持続していれば時間率は５０％となり、１．６秒間持続していれば時間率は８０％となる。 As shown in FIG. 2B, the utterance condition here is that the user's face is facing the front at a time rate of 80% or more, and the user's line of sight is facing the TV screen direction at a time rate of 80% or more. In other words, when the user's eyes are open at a time rate of 80% or more and the user's lips are moving at a time rate of 50% or more, the user is in an utterance state. It prescribes. Note that the above-described time rate indicates the ratio of the duration of face change to the detection time of the user's face change. For example, when the face change detection unit 22 detects a change in the user's face for 2 seconds, the time rate is 50% if the face change lasts 1 second, and if the face change lasts 1.6 seconds, the time rate is 80%.

なお、図２（ｂ）に示した発話条件はあくまでも例示であり、発話条件および時間率は、機器の種類あるいはユーザの種別によって適宜変更することができる。例えば、図２（ｂ）における発話条件の検出対象から顔の向き、視線、目の開閉を除外し、ユーザの口唇のみが所定の時間率以上で動いていれば、ユーザが発話状態にあると判定することもできる。 Note that the utterance conditions shown in FIG. 2B are merely examples, and the utterance conditions and the time rate can be appropriately changed according to the type of device or the type of user. For example, if the face direction, line of sight, and opening / closing of eyes are excluded from the detection target of the utterance condition in FIG. 2B and only the user's lips are moving at a predetermined time rate or more, the user is in the utterance state. It can also be determined.

発話状態判定部２４１は、顔変化検出手段２２から入力される顔変化情報と、発話条件記憶部２４２から入力される発話条件とを照合し、当該発話条件を満たしている場合は、前記した顔変化情報の検出時間から、ユーザが発話している期間を示す発話期間を生成する。そして、発話状態判定部２４１は、図１および図２（ａ）に示すように、当該発話期間を音声検出手段１１に出力する。 The utterance state determination unit 241 compares the face change information input from the face change detection unit 22 with the utterance condition input from the utterance condition storage unit 242, and if the utterance condition is satisfied, the face described above An utterance period indicating the period during which the user is speaking is generated from the change information detection time. Then, the utterance state determination unit 241 outputs the utterance period to the voice detection unit 11 as shown in FIG. 1 and FIG.

なお、発話状態判定部２４１は、発話状態か否かの判定結果を機器に表示することが出好ましい。例えば機器がテレビＴである場合（図６参照）、発話状態判定部２４１は、ユーザが発話状態にあると判定すると、テレビ画面にユーザが発話状態にあることを表示する。このように、発話状態か否かの判定結果をテレビ画面に表示することで、ユーザがテレビＴを見続けることが期待できるため、判定精度を高めることができる。以下、図１に戻ってユーザ指示取得装置１の残りの構成について、説明する。 Note that the utterance state determination unit 241 preferably displays the determination result on whether or not the utterance state is present on the device. For example, when the device is the television T (see FIG. 6), when the speech state determination unit 241 determines that the user is in the speech state, the speech state determination unit 241 displays that the user is in the speech state on the television screen. In this way, by displaying the determination result on whether or not the speech state is on the television screen, the user can expect to continue watching the television T, so that the determination accuracy can be improved. Hereinafter, the remaining configuration of the user instruction acquisition apparatus 1 will be described with reference to FIG.

手動作分析手段３０は、カメラＣｒによって撮影した複数のユーザの映像から人物の手の領域を検出し、複数のユーザの手の動作を認識するものである。手動作分析手段３０は、ここでは図１に示すように、手領域検出手段３１と、手動作認識手段３２と、を備えている。 The manual motion analysis means 30 detects a human hand region from a plurality of user images taken by the camera Cr, and recognizes the motion of the plurality of user hands. Here, as shown in FIG. 1, the hand motion analysis unit 30 includes a hand region detection unit 31 and a hand motion recognition unit 32.

手領域検出手段３１は、複数のユーザの映像から人物の手の領域を検出するものである。手領域検出手段３１は、具体的には、複数のユーザの映像を構成する画像から、肌色と大まかな形状情報を用いて人物の手の領域を検出する。また、手領域検出手段３１は、例えば、ユーザが指さしまたは手を差し伸べながら指示するという前提がある場合は、距離画像を用い、最も手前に出ている部分を切り出すことで、手の領域を特定することもできる。 The hand region detection means 31 detects a human hand region from a plurality of user images. Specifically, the hand region detection means 31 detects a human hand region using skin color and rough shape information from images constituting a plurality of user images. Further, for example, when there is a premise that the user gives an instruction while pointing or extending the hand, the hand region detecting unit 31 uses the distance image to identify the hand region by cutting out the most protruding part. You can also

手領域検出手段３１には、図１に示すように、カメラＣｒから機器を利用する複数のユーザの映像が入力される。そして、手領域検出手段３１は、前記した手法によって映像からユーザの手の領域を検出し、これを手領域情報として手動作認識手段３２に出力する。 As shown in FIG. 1, the hand region detecting means 31 receives images of a plurality of users who use devices from the camera Cr. Then, the hand region detection unit 31 detects the region of the user's hand from the video by the above-described method, and outputs this to the hand motion recognition unit 32 as hand region information.

手動作認識手段３２は、複数のユーザの映像から検出されたユーザの手の領域から、ユーザの手の動作を認識するものである。手動作認識手段３２は、手領域検出手段３１で検出された領域に対して動作認識手法を適用し、予め定められたコマンドに対応するユーザの手の動作を認識する。手動作認識手段３２は、具体的には、ユーザの手の領域から、事前に作成した、例えばＳＩＦＴやＳＵＲＦと呼ばれる普遍的な特徴の時系列データ、すなわち各特徴をフレームごとに追跡していくことにより得られるデータからなる各動作認識のテンプレートと、ユーザの手の領域から抽出したこれらの特徴量の時系列データとのマッチングを取ることで、動作認識を行うことができる。なお、手動作認識手段３２は、ユーザの手動作の有無のみならず、その手動作の種類（指さし、指ふり等）も認識する。 The hand movement recognition means 32 recognizes the movement of the user's hand from the user's hand area detected from the images of the plurality of users. The hand motion recognition unit 32 applies a motion recognition method to the region detected by the hand region detection unit 31, and recognizes the motion of the user's hand corresponding to a predetermined command. Specifically, the hand motion recognition means 32 tracks time-series data of universal features, for example, called SIFT and SURF, which are created in advance from the user's hand region, that is, each feature for each frame. The motion recognition can be performed by matching each motion recognition template made up of the data obtained from the above and the time-series data of these feature amounts extracted from the user's hand region. The hand motion recognition means 32 recognizes not only the presence / absence of a user's hand motion but also the type of hand motion (pointing, finger-flicking, etc.).

手動作認識手段３２には、図１に示すように、手領域検出手段３１から手領域情報が入力される。そして、手動作認識手段３２は、前記した手法によってユーザの手の動作を認識し、これを手動作情報としてその検出時間とともにコマンド生成手段４０の情報取得部４１に出力する（図３（ａ）参照）。 As shown in FIG. 1, hand region information is input to the hand motion recognition unit 32 from the hand region detection unit 31. Then, the hand movement recognition unit 32 recognizes the movement of the user's hand by the method described above, and outputs this to the information acquisition unit 41 of the command generation unit 40 together with the detection time as hand movement information (FIG. 3A). reference).

コマンド生成手段４０は、顔分析手段２０によって認識された複数のユーザの中に、音声分析手段１０によって認識された話者が含まれている場合、顔分析手段２０によって検出されたユーザの顔の変化と、手動作分析手段３０によって認識されたユーザの手の動作と、音声分析手段１０によって認識されたユーザの音声の内容と、に対して予め定められたコマンドを生成するものである。コマンド生成手段４０は、ここでは図３（ａ）に示すように、情報取得部４１と、コマンド生成部４２と、コマンド条件記憶部４３と、を備えている。 When a plurality of users recognized by the face analysis unit 20 include speakers recognized by the voice analysis unit 10, the command generation unit 40 determines the user's face detected by the face analysis unit 20. Predetermined commands are generated for the change, the user's hand motion recognized by the hand motion analysis means 30, and the user's voice content recognized by the voice analysis means 10. Here, as shown in FIG. 3A, the command generation unit 40 includes an information acquisition unit 41, a command generation unit 42, and a command condition storage unit 43.

情報取得部４１は、機器を制御するためのコマンドの生成に必要な情報を取得するものである。情報取得部４１には、図３（ａ）に示すように、顔変化検出手段２２から顔変化情報が、顔認識手段２３から人物情報が、手動作認識手段３２から手動作情報が、音声認識手段１２から音声情報が、話者認識手段１３から話者情報が入力される。 The information acquisition unit 41 acquires information necessary for generating a command for controlling the device. As shown in FIG. 3A, the information acquisition unit 41 receives face change information from the face change detection unit 22, person information from the face recognition unit 23, hand movement information from the hand movement recognition unit 32, and voice recognition. Voice information is input from the means 12 and speaker information is input from the speaker recognition means 13.

そして、情報取得部４１は、顔認識手段２３によって認識された人物の中に、話者認識手段１３によって認識された話者が含まれている場合、すなわち、機器を利用する複数のユーザの中に、機器に対して音声指示を行ったユーザが含まれる場合、図３（ａ）に示すように、当該音声指示を行ったユーザの顔変化情報と、手動作情報と、音声情報と、をコマンド生成部４２に出力する。このように、情報取得部４１は、複数のユーザが機器を利用する場合において、顔認識手段２３によって認識された複数のユーザの中から機器に指示を行っているユーザを特定することができる。なお、情報取得部４１は、前記した顔変化情報、人物情報、手動作情報、音声情報および話者情報を一時的に保持するための図示しない記憶部を備えている。 Then, the information acquisition unit 41, when the speaker recognized by the speaker recognition unit 13 is included in the persons recognized by the face recognition unit 23, that is, among a plurality of users who use the device. 3 includes a user who gave a voice instruction to the device, as shown in FIG. 3A, the face change information, the hand movement information, and the voice information of the user who gave the voice instruction Output to the command generation unit 42. As described above, the information acquisition unit 41 can specify a user who gives an instruction to the device from the plurality of users recognized by the face recognition unit 23 when the plurality of users use the device. The information acquisition unit 41 includes a storage unit (not shown) for temporarily storing the face change information, the person information, the hand movement information, the voice information, and the speaker information.

なお、顔変化検出手段２２から情報取得部４１に入力される顔変化情報は、顔領域検出手段２１によって検出された顔の領域ごとの顔変化の情報である。また、顔認識手段２３から情報取得部４１に入力される人物情報も、同じく顔領域検出手段２１によって検出された顔の領域ごとの氏名等の登録名の情報である。従って、情報取得部４１は、顔の領域を基準とすることで、顔変化検出手段２２から入力された顔変化情報がどのユーザの顔変化に関する情報であるかを判別することができる。 Note that the face change information input from the face change detection unit 22 to the information acquisition unit 41 is face change information for each face area detected by the face area detection unit 21. The person information input from the face recognition unit 23 to the information acquisition unit 41 is also information on a registered name such as a name for each face area detected by the face area detection unit 21. Therefore, the information acquisition unit 41 can determine which user's face change the face change information input from the face change detection means 22 is based on the face area.

また、情報取得部４１には、前記したように、顔認識手段２３から人物情報とともにその検出時間が入力され、手動作認識手段３２から手動作情報とともにその検出時間が入力される。従って、情報取得部４１は、検出時間を基準とすることで、手動作認識手段３２から入力された手動作情報がどのユーザの手動作に関する情報であるかを判別することができる。 Further, as described above, the detection time is input from the face recognition unit 23 together with the person information to the information acquisition unit 41 and the detection time is input from the hand movement recognition unit 32 together with the hand movement information. Therefore, the information acquisition unit 41 can determine which user's manual motion is the manual motion information input from the manual motion recognition means 32 by using the detection time as a reference.

コマンド生成部４２は、機器を制御するための指示に対応するコマンドを生成するものである。コマンド生成部４２には、図３（ａ）に示すように、情報取得部４１から、機器に対して音声指示を行ったユーザの顔変化情報と、手動作情報と、音声情報と、が入力される。また、コマンド生成部４２には、図３（ａ）に示すように、コマンド条件を予め保持するコマンド条件記憶部４３からコマンド条件が入力される。 The command generation unit 42 generates a command corresponding to an instruction for controlling the device. As shown in FIG. 3A, the command generation unit 42 receives face change information, manual action information, and voice information of the user who has given a voice instruction to the device from the information acquisition unit 41. Is done. Further, as shown in FIG. 3A, command conditions are input to the command generation unit 42 from a command condition storage unit 43 that holds command conditions in advance.

ここで、コマンド条件とは、予め定められたコマンド生成のための条件であり、図３（ｂ）に示すように、ユーザの顔の向き、視線、目の開閉、口唇動き、手動作、音声等の検出結果によって決定される条件のことを示している。すなわち、コマンド生成部４２は、顔変化検出手段２２によって検出されたユーザの顔の変化と、手動作認識手段３２によって認識されたユーザの手の動作と、音声認識手段１２によって認識されたユーザの音声と、がこのコマンド条件を満たしている場合に限り、コマンドを生成する。 Here, the command condition is a predetermined condition for generating a command. As shown in FIG. 3B, the user's face direction, line of sight, eye opening / closing, lip movement, hand movement, voice, and the like. This indicates the condition determined by the detection result. That is, the command generator 42 changes the user's face detected by the face change detection unit 22, the user's hand movement recognized by the hand movement recognition unit 32, and the user's hand recognized by the voice recognition unit 12. The command is generated only when the voice and the voice satisfy this command condition.

コマンド条件は、ここでは図３（ｂ）に示すように、４つのパターンが規定されている。第１パターンは、図３（ｂ）の検出結果の欄の第１列目であり、ユーザの顔が正面を向いており、ユーザの視線がテレビ画面方向を向いており、ユーザの目が開いており、ユーザの口唇が動いており、ユーザが手動作を行っており、ユーザが発話している場合に、音声指示内容および手動作指示内容を解析してコマンドを生成する旨が規定されている。これは、例えば図４（ａ）に示すユーザＡのような状況の場合にコマンドを生成することを意味している。 Here, as shown in FIG. 3B, four patterns are defined as command conditions. The first pattern is the first column of the detection result column in FIG. 3B, the user's face is facing the front, the user's line of sight is facing the TV screen, and the user's eyes are open. If the user's lips are moving, the user is performing a manual operation, and the user is speaking, it is specified that the voice instruction content and the manual operation instruction content are analyzed to generate a command. Yes. This means that, for example, a command is generated in the case of a situation like the user A shown in FIG.

第２パターンは、図３（ｂ）の検出結果の欄の第２列目であり、ユーザの顔が正面を向いており、ユーザの視線がテレビ画面方向を向いており、ユーザの目が開いており、ユーザの口唇が動いており、ユーザが手動作を行っておらず、ユーザが発話している場合に、音声指示内容を解析してコマンドを生成する旨が規定されている。これは、例えば図４（ｂ）に示すユーザＢのような状況の場合にコマンドを生成することを意味している。 The second pattern is the second column in the detection result column of FIG. 3B, the user's face is facing the front, the user's line of sight is facing the TV screen, and the user's eyes are open. When the user's lips are moving, the user is not performing a hand movement, and the user is speaking, it is specified that the voice instruction content is analyzed to generate a command. This means that, for example, a command is generated in the case of a situation like the user B shown in FIG.

第３パターンは、図３（ｂ）の検出結果の欄の第３列目であり、ユーザの顔が横を向いており、ユーザの視線が横方向を向いており、ユーザの目が開いており、ユーザの口唇が動いており、ユーザが手動作を行っており、ユーザが発話している場合に、音声指示内容および手動作指示内容を解析してコマンドを生成する旨が規定されている。これは、例えば図４（ｃ）に示すユーザＣのような状況の場合にコマンドを生成することを意味している。 The third pattern is the third column in the detection result column of FIG. 3B, where the user's face is facing sideways, the user's line of sight is facing sideways, and the user's eyes are open. It is specified that when the user's lips are moving, the user is performing a manual motion, and the user is speaking, a command is generated by analyzing the voice instruction content and the manual motion instruction content. . This means that, for example, a command is generated in the case of a situation like the user C shown in FIG.

第４パターンは、図３（ｂ）の検出結果の欄の第４列目であり、ユーザの顔が横を向いており、ユーザの視線が横方向を向いており、ユーザの目が閉じており、ユーザの口唇が動いており、ユーザが手動作を行っておらず、ユーザが発話している場合に、音声指示内容を解析してコマンドを生成する旨が規定されている。これは、例えば図４（ｄ）に示すユーザＤのような状況の場合にコマンドを生成することを意味している。 The fourth pattern is the fourth column in the detection result column of FIG. 3B, where the user's face is facing sideways, the user's line of sight is facing sideways, and the user's eyes are closed. When the user's lips are moving, the user is not performing hand movements, and the user is speaking, it is specified that the voice instruction content is analyzed to generate a command. This means that a command is generated in the case of a situation such as the user D shown in FIG.

なお、図３（ｂ）に示したコマンド条件はあくまでも例示であり、機器の種類あるいはユーザの種別によって適宜変更することができる。例えば、図３（ｂ）におけるコマンド条件の検出対象から顔の向き、視線、目の開閉を除外し、ユーザの口唇動きと音声のみをコマンド生成のための条件とすることもできる。 Note that the command conditions shown in FIG. 3B are merely examples, and can be changed as appropriate depending on the type of device or the type of user. For example, it is possible to exclude the face direction, line of sight, and eye opening / closing from the command condition detection target in FIG. 3B, and use only the user's lip movement and voice as conditions for command generation.

ここで、コマンド生成部４２は、機器を制御するためのコマンドリストを予め保持する図示しないデータベースを備えている。そして、コマンド生成部４２は、音声認識手段１２が認識したユーザの音声の内容および、手動作認識手段３２が認識したユーザの手の動作に相当するコマンドを、前記したデータベースで検索することで、ユーザの音声指示内容および手動作指示内容を解析する。 Here, the command generation unit 42 includes a database (not shown) that holds in advance a command list for controlling the device. Then, the command generation unit 42 searches the database described above for the content of the user's voice recognized by the voice recognition unit 12 and the command corresponding to the user's hand movement recognized by the hand movement recognition unit 32. The user's voice instruction content and manual operation instruction content are analyzed.

なお、前記したデータベースには、ユーザが日常的に発する自然な言葉や動作と、コマンドが関連付けられている。例えば、機器がテレビＴである場合（図６参照）、ユーザがテレビＴの音量が不足していることに関して発する「音ちっちゃいよね」、「声小さいな」、「よく聞こえないな」というような言葉は、前記したデータベースにおいて、「テレビの音量を上げる」というコマンドと関連付けられている。また、同様に、ユーザがテレビの音量が大きすぎることに関して行う「耳を塞ぐ」という動作は、前記したデータベースにおいて、「テレビの音量を下げる」というコマンドと関連付けられている。 Note that the database described above associates commands and natural words and actions that are uttered by the user on a daily basis. For example, when the device is a television T (see FIG. 6), the user utters that the volume of the television T is insufficient, such as “sounds are tiny”, “sounds low”, “cannot hear well” The word is associated with the command “increase TV volume” in the database. Similarly, the operation of “closing the ear” performed by the user regarding the volume of the television being too loud is associated with the command “decreasing the volume of the television” in the database.

このように、コマンド生成部４２のデータベースがユーザの自然な発話や動作に対応するコマンドリストを保持することで、ユーザが機器に対してより自然な状況下で指示を行うことができる。 Thus, the database of the command generation unit 42 holds the command list corresponding to the user's natural utterances and actions, so that the user can instruct the device under more natural conditions.

以上説明したような構成を備えるユーザ指示取得装置１は、顔分析手段２０によってユーザの顔の変化から当該ユーザが発話しているのか否かを判定し、ユーザが発話している場合のみ音声認識を行うため、音声認識の精度を高めることができる。また、顔認識で認識したユーザと音声認識で認識した話者とを比較することで、機器に対して音声指示を行ったユーザを特定できるため、複数のユーザが機器を利用する場合であっても、コマンドを的確に生成することができる。 In the user instruction acquisition device 1 having the configuration described above, the face analysis unit 20 determines whether or not the user is speaking from the change of the user's face, and performs voice recognition only when the user is speaking. Therefore, the accuracy of voice recognition can be improved. In addition, since a user who has given a voice instruction to a device can be identified by comparing the user recognized by face recognition with the speaker recognized by voice recognition, a plurality of users use the device. The command can be generated accurately.

また、ユーザ指示取得装置１によれば、ユーザの顔の変化から当該ユーザの発話状態を自動的に判定するとともに、ユーザが機器に対して音声および動作による指示を行うだけでコマンドを生成することができる。従って、複雑な操作を行うことなく、ユーザの自然な行動の延長上でその指示内容を機器に伝え、当該機器を制御することができる。 Moreover, according to the user instruction acquisition device 1, the user's speech state is automatically determined from the change of the user's face, and the user generates a command only by giving an instruction by voice and operation to the device. Can do. Therefore, the instruction content can be transmitted to the device and the device can be controlled on the extension of the natural action of the user without performing a complicated operation.

ここで、ユーザ指示取得装置１は、一般的なコンピュータを、前記した各手段として機能させるプログラムにより動作させることで実現することができる。このプログラム（コンテンツ暗号化プログラム）は、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 Here, the user instruction acquisition device 1 can be realized by operating a general computer by a program that functions as each of the above-described units. This program (content encryption program) can be distributed via a communication line, or can be distributed by writing on a recording medium such as a CD-ROM.

［ユーザ指示取得装置の動作］
ユーザ指示取得装置１の動作について、図５を参照しながら簡単に説明する。
まず、ユーザ指示取得装置１が動作を開始すると、カメラＣｒが機器を利用する複数のユーザの映像を取得し、これを顔領域検出手段２１および手領域検出手段３１に出力する。また、マイクＭが機器の周囲の音を取得し、これを音声検出手段１１に出力する。そして、音声検出手段１１が、機器の周囲の音から音声を検出する（ステップＳ１）。次に、顔領域検出手段２１が、複数のユーザの映像から人物の顔の領域を検出し、これを顔領域情報として顔変化検出手段２２の各検出部および顔認識手段２３に出力する（ステップＳ２）。 [Operation of User Instruction Acquisition Device]
The operation of the user instruction acquisition device 1 will be briefly described with reference to FIG.
First, when the user instruction acquisition device 1 starts operating, the camera Cr acquires videos of a plurality of users who use the device, and outputs them to the face area detection means 21 and the hand area detection means 31. In addition, the microphone M acquires sound around the device and outputs it to the sound detection means 11. And the audio | voice detection means 11 detects an audio | voice from the sound around an apparatus (step S1). Next, the face area detection means 21 detects a human face area from the images of a plurality of users, and outputs this as face area information to each detection unit of the face change detection means 22 and the face recognition means 23 (step). S2).

次に、顔変化検出手段２２の各検出部が、複数のユーザの顔領域情報からユーザの顔の向き、視線、目の開閉、口唇動き等の顔の変化を検出し、これを顔変化情報として発話状態判定部２４１および情報取得部４１に出力する（ステップＳ３）。次に、顔認識手段２３が、複数のユーザの顔領域情報から当該領域に含まれる顔に対応する人物、すなわちユーザを認識し、これを人物情報として情報取得部４１に出力する（ステップＳ４）。 Next, each detection unit of the face change detection means 22 detects a face change such as a user's face direction, line of sight, eye opening / closing, lip movement, etc. from a plurality of user face area information, and this is detected as face change information. To the utterance state determination unit 241 and the information acquisition unit 41 (step S3). Next, the face recognition means 23 recognizes a person corresponding to the face included in the area, that is, the user from the face area information of a plurality of users, and outputs this to the information acquisition unit 41 as person information (step S4). .

また、手領域検出手段３１が、複数のユーザの映像から人物の手の領域を検出し、これを手領域情報として手動作認識手段３２に出力する（ステップＳ５）。次に、手動作認識手段３２が、複数のユーザの手領域情報からユーザの手の動作を認識し、これを手動作情報として情報取得部４１に出力する（ステップＳ６）。 Further, the hand region detection unit 31 detects a human hand region from a plurality of user images, and outputs this to the hand motion recognition unit 32 as hand region information (step S5). Next, the hand movement recognition unit 32 recognizes the movement of the user's hand from the hand area information of the plurality of users, and outputs this to the information acquisition unit 41 as the hand movement information (step S6).

次に、発話状態判定部２４１が、複数のユーザの顔変化情報が発話条件記憶部２４２から入力される発話条件を満たしているかどうかを判断し、複数のユーザが発話しているか否かを判定する（ステップＳ７）。そして、発話状態判定部２４１は、複数のユーザが発話していると判定した場合、ユーザが発話している期間を示す発話期間を生成し、これを音声検出手段１１に出力する。これにより、音声検出手段１１が、機器の周囲の音声を音声認識手段１２および話者認識手段１３に出力する（ステップＳ７でＹｅｓ）。一方、発話状態判定部２４１は、複数のユーザが発話していると判定しなかった場合、新たな入力があるまで待機する（ステップＳ７でＮｏ）。 Next, the utterance state determination unit 241 determines whether or not the face change information of a plurality of users satisfies the utterance condition input from the utterance condition storage unit 242, and determines whether or not the plurality of users are speaking. (Step S7). When the speech state determination unit 241 determines that a plurality of users are speaking, the speech state determination unit 241 generates a speech period indicating a period during which the user is speaking and outputs the speech period to the voice detection unit 11. Thereby, the voice detection means 11 outputs the voice around the device to the voice recognition means 12 and the speaker recognition means 13 (Yes in step S7). On the other hand, if the speech state determination unit 241 does not determine that a plurality of users are speaking, the speech state determination unit 241 stands by until there is a new input (No in step S7).

次に、音声認識手段１２が、機器の周囲の音声からその音声の内容を認識し、これを音声情報として情報取得部４１に出力する（ステップＳ８）。また、話者認識手段１３が、機器の周囲の音声からその音声の話者を認識し、これを話者情報として情報取得部４１に出力する（ステップＳ９）。 Next, the voice recognition unit 12 recognizes the content of the voice from the voice around the device, and outputs this as voice information to the information acquisition unit 41 (step S8). Moreover, the speaker recognition means 13 recognizes the speaker of the voice from the voice around the device, and outputs this to the information acquisition unit 41 as the speaker information (step S9).

次に、人物情報の人物の中に話者情報の話者が含まれる場合、情報取得部４１が音声指示を行ったユーザの顔変化情報と、手動作情報と、音声情報と、をコマンド生成部４２に出力する。そして、ユーザの顔変化情報と、手動作情報と、音声情報と、がコマンド条件を満たしている場合、コマンド生成部４２がコマンドを生成する（ステップＳ１０）。 Next, when the speaker of the speaker information is included in the person of the person information, command generation is performed on the face change information, the hand movement information, and the voice information of the user who has given the voice instruction by the information acquisition unit 41 To the unit 42. When the user face change information, the hand movement information, and the voice information satisfy the command conditions, the command generation unit 42 generates a command (step S10).

１ユーザ指示取得装置
１０音声分析手段
１１音声検出手段
１２音声認識手段
１３話者認識手段
２０顔分析手段
２１顔領域検出手段
２２顔変化検出手段
２３顔認識手段
２４発話状態推定手段
３０手動作分析手段
３１手領域検出手段
３２手動作認識手段
４０コマンド生成手段
４１情報取得部
４２コマンド生成部
４３コマンド条件記憶部
２２１顔の向き検出部
２２２視線検出部
２２３目の開閉検出部
２２４口唇動き検出部
２４１発話状態判定部
２４２発話条件記憶部
Ｃｒカメラ
Ｍマイク
Ｔテレビ受像機（テレビ） DESCRIPTION OF SYMBOLS 1 User instruction | indication acquisition apparatus 10 Speech analysis means 11 Speech detection means 12 Speech recognition means 13 Speaker recognition means 20 Face analysis means 21 Face area detection means 22 Face change detection means 23 Face recognition means 24 Speech state estimation means 30 Manual motion analysis means 31 Hand region detection means 32 Hand movement recognition means 40 Command generation means 41 Information acquisition section 42 Command generation section 43 Command condition storage section 221 Face orientation detection section 222 Eye gaze detection section 223 Eye opening / closing detection section 224 Lip movement detection section 241 Utterance State determination unit 242 Speech condition storage unit Cr Camera M Microphone T Television receiver (TV)

Claims

A user instruction acquisition device that identifies a user who gives an instruction to control the device from a plurality of users who use the device, and acquires an instruction from the user,
Recognizing each of the plurality of users registered in advance from video captured by the camera, detecting a change in the face of each of the plurality of users, and detecting the plurality of the plurality of users from the change in the face. Face analysis means for generating an utterance period indicating a period during which each of the users is speaking;
Manual motion analysis means for recognizing the motion of the hands of the plurality of users from the images of the plurality of users;
Based on the utterance period generated by the face analysis means, speech is detected from sounds around the device, and the speech content and speaker are recognized using acoustic feature values registered in advance for each user. Voice analysis means to
When the speaker recognized by the voice analysis unit is included in the plurality of users recognized by the face analysis unit, the speaker is identified as the user who is giving the instruction, and the face For the change of the user's face detected by the analysis means, the movement of the user's hand recognized by the hand movement analysis means, and the content of the user's voice recognized by the voice analysis means, Command generation means for generating a predetermined command;
A user instruction acquisition device comprising:

The face analysis means includes
Face area detection means for detecting areas of the faces of the plurality of users from the video;
Face recognition means for recognizing a user corresponding to the face area using a face feature amount registered in advance for each user;
Face change detection means for detecting changes in the faces of the plurality of users from the areas of the faces of the plurality of users;
From the change of the faces of the plurality of users, it is determined whether or not the plurality of users are speaking, and when it is determined that they are speaking, the utterance state estimation means for generating the utterance period;
The user instruction acquisition apparatus according to claim 1, further comprising:

In order to identify a user who gives an instruction to control the device from a plurality of users who use the device, and to obtain an instruction from the user,
Recognizing each of the plurality of users registered in advance from video captured by the camera, detecting a change in the face of each of the plurality of users, and detecting the plurality of the plurality of users from the change in the face. Face analysis means for generating an utterance period indicating a period during which each of the users is speaking
Manual motion analysis means for recognizing the motions of the hands of the plurality of users from the images of the plurality of users;
Based on the utterance period generated by the face analysis means, speech is detected from sounds around the device, and the speech content and speaker are recognized using acoustic feature values registered in advance for each user. Voice analysis means to
When the speaker recognized by the voice analysis unit is included in the plurality of users recognized by the face analysis unit, the speaker is identified as the user who is giving the instruction, and the face For the change of the user's face detected by the analysis means, the movement of the user's hand recognized by the hand movement analysis means, and the content of the user's voice recognized by the voice analysis means, Command generating means for generating a predetermined command;
A user instruction acquisition program characterized by being made to function as:

A television receiver that provides broadcast programs to users,
The user instruction acquisition device according to claim 1 or 2, wherein an instruction from the user based on the voice and operation of the user is acquired by analyzing a video of a camera installed in the television receiver and a sound of a microphone. A television receiver comprising: