JP2014175944A

JP2014175944A - Television conference apparatus, control method of the same, and program

Info

Publication number: JP2014175944A
Application number: JP2013048379A
Authority: JP
Inventors: Takahiro Hiramatsu; 嵩大平松
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2013-03-11
Filing date: 2013-03-11
Publication date: 2014-09-22
Anticipated expiration: 2033-03-11
Also published as: JP6149433B2

Abstract

PROBLEM TO BE SOLVED: To provide a television conference apparatus which can be operated by a user who cannot reach a television conference apparatus and a remote control.SOLUTION: A television conference apparatus has: image pick-up means which acquires image data of one or a plurality of users using a television conference apparatus; audio input means which acquires audio data of the user; analysis means which analyzes at least one of the direction of the user's face and the direction of the user's eyes, from the image data; and estimation means which estimates the intention of operation of the television conference apparatus of the user on the basis of the analysis result of the analysis means. The television conference apparatus performs the operation on the basis of the estimation result obtained by the estimation means.

Description

本発明は、テレビ会議装置、テレビ会議装置の制御方法、及びプログラムに関する。 The present invention relates to a video conference apparatus, a video conference apparatus control method, and a program.

インターネット等の通信ネットワークを介して、遠隔地等と会議を行うテレビ会議システムが知られている。このようなテレビ会議システムでは、１つのテレビ会議装置を複数の利用者で利用することが想定されており、一部の利用者はテレビ会議装置やリモコンに手が届かない場所に位置する場合がある。従来は、このようにテレビ会議装置やリモコンに手が届かない利用者はテレビ会議装置の操作が困難であった。 2. Description of the Related Art A video conference system that performs a conference with a remote place via a communication network such as the Internet is known. In such a video conference system, it is assumed that a single video conference device is used by a plurality of users, and some users may be located in a place where they cannot reach the video conference device or the remote control. is there. Conventionally, it has been difficult for a user who cannot reach the video conference device or the remote controller to operate the video conference device.

また、音声認識によりテレビ会議装置の操作を行なう場合には、テレビ会議中の発言等により誤操作の恐れがあった。特許文献１には、音声認識開始スイッチを操作したときにのみ、音声制御が可能な音声認識電気機器が開示されている。 Further, when the video conference apparatus is operated by voice recognition, there is a risk of erroneous operation due to a speech during the video conference. Patent Document 1 discloses a voice recognition electrical device capable of voice control only when a voice recognition start switch is operated.

特許文献１に開示の技術では、テレビ会議装置やリモコンから離れた利用者が操作できないという従来技術の問題点を解決できない。
本発明の実施の形態は、上記問題点を鑑みてなされたものであって、テレビ会議装置やリモコンに手が届かない利用者にも操作可能なテレビ会議装置を提供することを目的とする。 The technique disclosed in Patent Document 1 cannot solve the problem of the prior art that a user who is away from the video conference device or the remote controller cannot operate.
Embodiments of the present invention have been made in view of the above-described problems, and an object of the present invention is to provide a video conference apparatus that can be operated by a user who cannot reach a video conference apparatus or a remote controller.

上記課題を解決するため、本願請求項１は、テレビ会議装置を利用する一又は複数の利用者の画像データを取得する撮像手段と、前記利用者の音声データを取得する音声入力手段と、前記画像データより前記利用者の顔の向き及び視線の方向のうち少なくとも一つを解析する解析手段と、前記解析手段の解析結果に基づいて前記利用者の前記テレビ会議装置の操作の意図を推定する推定手段と、を有し、前記推定手段の推定結果に基づいて前記操作を行う。 In order to solve the above-described problem, claim 1 of the present application includes an imaging unit that acquires image data of one or a plurality of users who use a video conference device, a voice input unit that acquires the voice data of the users, Analyzing means for analyzing at least one of the user's face direction and line-of-sight direction from image data, and estimating the user's intention to operate the video conference device based on the analysis result of the analyzing means An estimation unit, and performs the operation based on an estimation result of the estimation unit.

本実施の形態によれば、テレビ会議装置やリモコンに手が届かない利用者にも操作可能なテレビ会議装置を提供することができる。 According to the present embodiment, it is possible to provide a video conference device that can be operated even by a user who cannot reach the video conference device or the remote controller.

一実施形態に係るテレビ会議システムの構成例を示す図である。It is a figure which shows the structural example of the video conference system which concerns on one Embodiment. 一実施の形態に係るテレビ会議装置のハードウェア構成図である。It is a hardware block diagram of the video conference apparatus concerning one embodiment. 一実施の形態に係るテレビ会議装置の機能構成図である。It is a functional block diagram of the video conference apparatus concerning one embodiment. 一実施の形態に係るテレビ会議装置の配置の例を示す図である。It is a figure which shows the example of arrangement | positioning of the video conference apparatus which concerns on one embodiment. 第１の実施の形態に係るフローチャートである。3 is a flowchart according to the first embodiment. 第１の実施の形態に係る画像処理のフローチャートである。3 is a flowchart of image processing according to the first embodiment. 第１の実施の形態に係る近距離用のテーブルのイメージを示す図である。It is a figure which shows the image of the table for short distance which concerns on 1st Embodiment. 第１の実施の形態に係る遠距離用のテーブルのイメージを示す図である。It is a figure which shows the image of the table for long distances which concerns on 1st Embodiment. 第１の実施の形態に係る中距離用のテーブルのイメージを示す図である。It is a figure which shows the image of the table for medium distances concerning 1st Embodiment. 第１の実施の形態に係る撮像手段が取得した画像のイメージを示す図である。It is a figure which shows the image of the image which the imaging means which concerns on 1st Embodiment acquired. 第１の実施の形態に係る角度に応じたテーブルのイメージを示す図である。It is a figure which shows the image of the table according to the angle which concerns on 1st Embodiment. 第1の実施の形態に係る音声認識手段のフローチャートである。3 is a flowchart of voice recognition means according to the first embodiment. 第1の実施の形態に係る音声認識手段の機能構成図である。FIG. 3 is a functional configuration diagram of voice recognition means according to the first embodiment. 第２の実施の形態に係るフローチャートである。It is a flowchart concerning a 2nd embodiment. その他の実施の形態に係るテレビ会議装置の配置の例を示す図である。It is a figure which shows the example of arrangement | positioning of the video conference apparatus which concerns on other embodiment.

以下に、本発明の実施の形態について、添付の図面を参照して説明する。
＜システムの構成＞
図１は、本発明の一実施形態に係るテレビ会議システム１００の構成例を示す図である。テレビ会議システム１００は、ネットワーク１０４に接続されたテレビ会議装置１０１、１０２、１０３及びサーバ１０５を備える。サーバ１０５は、例えば、テレビ会議装置１０１からの映像や音声のデータを受信し、所望のテレビ会議装置、例えばテレビ会議装置１０２、１０３等に送信する役割を担う。また、この時に、サーバ１０５がテレビ会議装置１０１〜１０３から得られたデータをエンコードしたり、テレビ会議装置１０１〜１０３から得られたエンコードされたデータをデコードしたりする機能を備えていても良い。尚、図１の構成は一例であって、テレビ会議システムを構成するテレビ会議装置の数は２つ以上の任意の数であって良い。さらに、サーバ１０５を介さずに、テレビ会議装置同士をつなげるピアツーピアの接続環境でも良い。 Embodiments of the present invention will be described below with reference to the accompanying drawings.
<System configuration>
FIG. 1 is a diagram illustrating a configuration example of a video conference system 100 according to an embodiment of the present invention. The video conference system 100 includes video conference apparatuses 101, 102, 103 and a server 105 connected to a network 104. For example, the server 105 is responsible for receiving video and audio data from the video conference apparatus 101 and transmitting it to a desired video conference apparatus such as the video conference apparatuses 102 and 103. At this time, the server 105 may have a function of encoding data obtained from the video conference apparatuses 101 to 103 and decoding encoded data obtained from the video conference apparatuses 101 to 103. . Note that the configuration in FIG. 1 is an example, and the number of video conference apparatuses constituting the video conference system may be any number of two or more. Furthermore, a peer-to-peer connection environment that connects video conference apparatuses without using the server 105 may be used.

テレビ会議装置１０１、１０２、１０３は、例えば、サーバ１０５を介して、テレビ会議装置間で通信を行い、画像や音声を送受信できる。これにより、テレビ会議装置１０１の利用者は、テレビ会議装置１０２や１０３の利用者と、リアルタイムに送受信される画像や音声を介してテレビ会議を行うことができる。
＜装置の構成＞
（ハードウェア構成）
図２に本実施の形態に係るテレビ会議装置１０１のハードウェア構成の例を示す。尚、テレビ会議システム１００を構成する他のテレビ会議装置１０２及び１０３は、必ずしも同じ構成である必要はない。 For example, the video conference apparatuses 101, 102, and 103 can communicate with each other via the server 105 to transmit and receive images and audio. Thereby, the user of the video conference apparatus 101 can hold a video conference with the users of the video conference apparatuses 102 and 103 through images and sounds transmitted and received in real time.
<Device configuration>
(Hardware configuration)
FIG. 2 shows an example of the hardware configuration of the video conference apparatus 101 according to this embodiment. Note that the other video conference apparatuses 102 and 103 constituting the video conference system 100 do not necessarily have the same configuration.

テレビ会議装置１０１は、コンピュータ等で構成されるテレビ会議装置本体２００、表示装置２１０、１つ又は複数のマイク２１２を備える。また、テレビ会議装置本体２００は、ＣＰＵ（Central Processing Unit）２０１、メモリ２０２、制御部２０３、画像処理部２０４、音声処理部２０５、ネットワークインタフェース（以下、ネットワークＩ／Ｆと称す）２０６、撮像素子インタフェース（以下、撮像素子Ｉ／Ｆと称す）２０７、カメラ２０８、画像出力インタフェース（以下、画像出力Ｉ／Ｆと称す）２０９、音声入出力インタフェース（以下、音声入出力Ｉ／Ｆと称す）２１４、スピーカ２１３、システムバス２１４を備える。尚、上記構成はあくまでも一例であって、本発明の範囲を限定するものではない。例えば、カメラ２０８やスピーカ２１３はテレビ会議装置本体２００とは別に設けられていても良いし、また表示装置２１０に内蔵されていても良い。また、テレビ会議装置本体２００が複数のマイク２１２の少なくとも１つを有していても良い。さらに、撮像素子Ｉ／Ｆ２０７は画像処理部２０４に含まれていても良いし、音声入出力Ｉ／Ｆ２１４は音声処理部２０５に含まれていても良い。 The video conference apparatus 101 includes a video conference apparatus main body 200 configured by a computer or the like, a display device 210, and one or more microphones 212. The video conference apparatus main body 200 includes a CPU (Central Processing Unit) 201, a memory 202, a control unit 203, an image processing unit 204, an audio processing unit 205, a network interface (hereinafter referred to as network I / F) 206, an image sensor. Interface (hereinafter referred to as image sensor I / F) 207, camera 208, image output interface (hereinafter referred to as image output I / F) 209, audio input / output interface (hereinafter referred to as audio input / output I / F) 214 A speaker 213 and a system bus 214. In addition, the said structure is an example to the last, Comprising: The scope of the present invention is not limited. For example, the camera 208 and the speaker 213 may be provided separately from the video conference apparatus main body 200 or may be built in the display device 210. Further, the video conference apparatus main body 200 may include at least one of the plurality of microphones 212. Further, the image sensor I / F 207 may be included in the image processing unit 204, and the audio input / output I / F 214 may be included in the audio processing unit 205.

ＣＰＵ２０１は、例えばメモリ２０２からプログラムやデータを読み出し、処理を実行することで、テレビ会議装置１０１が備える各機能を実現する演算装置である。メモリ２０２は、例えばＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、ＨＤＤ（Hard Disk Drive）等の記憶部である。メモリ２０２は、ＣＰＵ２０１により実行される様々な処理に必要とされる各種ソフトウエアやデータ、画像データ、音声データ等を格納する。 The CPU 201 is an arithmetic device that implements each function of the video conference apparatus 101 by reading a program or data from the memory 202 and executing the processing, for example. The memory 202 is a storage unit such as a random access memory (RAM), a read only memory (ROM), and a hard disk drive (HDD). The memory 202 stores various software necessary for various processes executed by the CPU 201, data, image data, audio data, and the like.

制御部２０３は、テレビ会議装置１０１全体の制御を行う。画像処理部２０４は、画像データ又は画像信号に対して各種画像処理を行う。音声処理部２０５は、音声データ又は音声信号に対して各種音声処理を行う。尚、画像処理部２０４、音声処理部２０５は、ＤＳＰ（Digital Signal Processor）等のプロセッサを含んでいても良い。ネットワークＩ／Ｆ２０６は、テレビ会議装置１０１をネットワーク１０４に接続するためのインタフェースで、ネットワーク１０４を介して、他のテレビ会議装置１０２、１０３等とデータの送受信を行う。 The control unit 203 controls the entire video conference apparatus 101. The image processing unit 204 performs various image processing on the image data or the image signal. The audio processing unit 205 performs various audio processes on the audio data or the audio signal. Note that the image processing unit 204 and the sound processing unit 205 may include a processor such as a DSP (Digital Signal Processor). A network I / F 206 is an interface for connecting the video conference apparatus 101 to the network 104, and transmits / receives data to / from other video conference apparatuses 102, 103 and the like via the network 104.

撮像素子Ｉ／Ｆ２０７は、撮像用のカメラ２０８から出力される画像信号を所定の画像データとして取り込むインタフェースである。表示装置２１０は、例えば、ＬＣＤ（Liquid Crystal Display）モニタや、プロジェクタ等の表示装置である。表示装置２１０は、音声出力用のスピーカ２１３を備えていても良い。画像出力Ｉ／Ｆ２０９は、表示装置２１０に通信先の画像や、メニュー画面、設定画面等の各種画像を出力するインタフェースである。
音声入出力Ｉ／Ｆ２１１は、音声入力用のマイク２１２を介して入力された音声信号を所定の音声データとして取り込む。また、出力する音声データをスピーカ２１３で再生可能な音声信号に変換する。システムバス２１４は、アドレスバス、データバス及び各種制御信号を伝達する。
（機能構成）
図３は、本実施の形態におけるテレビ会議装置１０１の機能構成を示す図である。テレビ会議装置１０１は、撮像手段３０１、解析手段３０２、推定手段３０３、制御手段３０４、音声入力手段３０５、音声出力手段３０６、音声認識手段３０７、音量制御手段３０８を有する。 The image sensor I / F 207 is an interface that captures an image signal output from the imaging camera 208 as predetermined image data. The display device 210 is a display device such as an LCD (Liquid Crystal Display) monitor or a projector. The display device 210 may include a speaker 213 for outputting sound. The image output I / F 209 is an interface that outputs a communication destination image and various images such as a menu screen and a setting screen to the display device 210.
The voice input / output I / F 211 takes in a voice signal input via a voice input microphone 212 as predetermined voice data. Also, the audio data to be output is converted into an audio signal that can be reproduced by the speaker 213. The system bus 214 transmits an address bus, a data bus, and various control signals.
(Functional configuration)
FIG. 3 is a diagram illustrating a functional configuration of the video conference apparatus 101 according to the present embodiment. The video conference apparatus 101 includes an imaging unit 301, an analysis unit 302, an estimation unit 303, a control unit 304, a voice input unit 305, a voice output unit 306, a voice recognition unit 307, and a volume control unit 308.

撮像手段３０１は、テレビ会議装置１０１の利用者等の画像データを取得する手段で、例えば、図２のカメラ２０８等を含む。解析手段３０２は、撮像手段３０１が取得した画像データ又は画像信号の画像処理を行い、テレビ会議装置１０１の利用者のそれぞれについて、顔の向き又は視線のうち少なくとも一つを解析する。推定手段３０３は、解析手段３０２の解析結果に基づいて、各利用者のテレビ会議装置１０１への操作の意図を推定する。解析手段３０２及び推定手段３０３は、例えば、図２の画像処理部２０４に含まれる。また、解析手段３０２及び推定手段３０３は、ＣＰＵ２０１等で動作するプログラムによって各機能が実現されるものであっても良い。 The imaging unit 301 is a unit that acquires image data of a user of the video conference apparatus 101, and includes, for example, the camera 208 of FIG. The analysis unit 302 performs image processing of the image data or the image signal acquired by the imaging unit 301 and analyzes at least one of the face direction and the line of sight for each user of the video conference apparatus 101. Based on the analysis result of the analysis unit 302, the estimation unit 303 estimates the intention of each user to operate the video conference apparatus 101. The analysis unit 302 and the estimation unit 303 are included in, for example, the image processing unit 204 in FIG. Further, the analysis unit 302 and the estimation unit 303 may be realized by a program that operates on the CPU 201 or the like.

制御手段３０４は、推定手段３０３の推定結果に基づいて、テレビ会議装置１０１の制御を行う手段で、例えば、図２の制御部２０３やＣＰＵ２０１で動作するプログラム等を含む。尚、必要に応じて、例えば、推定手段３０３等が、制御手段３０４を介さずに、音声認識手段３０７等を制御しても良い。音声入力手段３０５は、テレビ会議装置１０１の利用者等の音声データを取得する手段で、例えば、図２のマイク２１２を含む。音声認識手段３０７は、音声入力手段３０５が取得した音声データ又は音声信号の音声処理を行い、音声データ又は音声信号を予め登録された単語パターン及び制御内容と比較し、制御内容を認識する。音声認識手段３０７は、例えば、図２の音声処理部２０５に含まれる。或いは、ＣＰＵ２０１等で動作するプログラムによって機能が実現されるものであっても良い。 The control unit 304 is a unit that controls the video conference apparatus 101 based on the estimation result of the estimation unit 303, and includes, for example, a program that operates on the control unit 203 and the CPU 201 in FIG. Note that, for example, the estimation unit 303 or the like may control the voice recognition unit 307 or the like without using the control unit 304 as necessary. The voice input means 305 is means for acquiring voice data of a user of the video conference apparatus 101 and includes, for example, the microphone 212 of FIG. The voice recognition unit 307 performs voice processing on the voice data or voice signal acquired by the voice input unit 305, compares the voice data or voice signal with a pre-registered word pattern and control content, and recognizes the control content. The voice recognition unit 307 is included in, for example, the voice processing unit 205 in FIG. Alternatively, the function may be realized by a program operating on the CPU 201 or the like.

音声出力手段３０６は、テレビ会議装置１０１の音声出力を行う手段で、例えば、図２のスピーカ２１３を含む。音量制御手段３０８は、制御手段３０４や音声認識手段３０７等の制御により、音声出力手段３０６から出力される音量や、音声入力手段３０５から入力される音声のレベルを調整する。音量制御手段３０８は、例えば、図２の音声入出力Ｉ／Ｆ２１１や音声処理部２０５等に含まれる。 The audio output unit 306 is a unit that performs audio output of the video conference apparatus 101, and includes, for example, the speaker 213 of FIG. The volume control unit 308 adjusts the volume output from the audio output unit 306 and the level of the audio input from the audio input unit 305 under the control of the control unit 304, the voice recognition unit 307, and the like. The volume control unit 308 is included in, for example, the voice input / output I / F 211 and the voice processing unit 205 shown in FIG.

次に、テレビ会議装置１０１を使用する際の各部の配置の一例を図４に示す。会議テーブル４０６の上に、図４に示す通り、カメラ２０８を備えたテレビ会議装置本体２００、表示装置２１０、複数のマイク２１２ａ、２１２ｂが配置されている。また、会議テーブル４０６の周りには、利用者４０１、４０２、４０３、４０４が、表示装置２１０及びカメラ２０８が見える位置に座っているものとする。さらに、テレビ会議装置１０１は、テレビ会議装置１０１を操作するためのリモコン４０５をさらに有していても良い。 Next, FIG. 4 shows an example of the arrangement of each unit when the video conference apparatus 101 is used. On the conference table 406, as shown in FIG. 4, a video conference apparatus main body 200 having a camera 208, a display device 210, and a plurality of microphones 212a and 212b are arranged. In addition, it is assumed that users 401, 402, 403, and 404 are sitting around the conference table 406 at a position where the display device 210 and the camera 208 can be seen. Furthermore, the video conference apparatus 101 may further include a remote controller 405 for operating the video conference apparatus 101.

テレビ会議装置１０１は、カメラ２０８に対する各利用者の顔の向き又は視線の方向に基づいて、各利用者の操作の意図を推定する。例えば、利用者の顔の向き又は視線がカメラ２０８の方向を向いている場合には、利用者が音声によるテレビ会議装置１０１の操作を意図していると判断する。このとき、テレビ会議装置１０１は、音声認識処理を開始させ、その認識結果に対応するテレビ会議装置１０１の制御を行う。 The video conference apparatus 101 estimates the intention of each user's operation based on the face direction or line-of-sight direction of each user with respect to the camera 208. For example, when the user's face direction or line of sight faces the camera 208, it is determined that the user intends to operate the video conference apparatus 101 by voice. At this time, the video conference apparatus 101 starts voice recognition processing and controls the video conference apparatus 101 corresponding to the recognition result.

以上の構成により、例えば図４において、テレビ会議装置本体２００やリモコン４０５から離れた利用者４０１、４０３においても、カメラ２０８に向かって指示することにより、テレビ会議装置１０１の音声による操作を行うことができる。
＜動作の説明＞
次に、本実施の形態に係るテレビ会議装置１０１の動作について説明する。
［第１の実施の形態］
図５に、第１の実施の形態に係るフローチャートを示す。テレビ会議装置１０１は、撮像手段３０１により一又は複数の利用者の画像データを取得する。（ステップＳ５０１）次に、解析手段３０２は、この画像データより、各利用者について、顔の向き及び視線のうち少なくとも一つを解析する。（ステップＳ５０２）
尚、各利用者の顔の向きや視線の方向の解析は、一般的な顔認識技術や視線解析技術によるもので良い。ここでは、顔認識技術や視線解析技術の一例について、概要のみ説明する。まず、利用者の顔の向きについては、撮像手段３０１が取得した画像データより、例えば、パターン認識の技術を応用して、各利用者の顔、目、鼻、口、等の構成要素を抽出し、抽出された構成要素の位置関係から顔の向きを判断する。或いは、目の位置関係や角度から顔の向きを算出しても良い。また、視線の解析については、例えば、上記各利用者の虹彩の位置を検出し、各利用者の目、鼻、口等の構成要素から虹彩までの距離と、上記顔の向きとにより、各利用者の視線を解析する。 With the above configuration, for example, in FIG. 4, even the users 401 and 403 away from the video conference apparatus main body 200 and the remote controller 405 can operate the video conference apparatus 101 by voice by giving instructions to the camera 208. Can do.
<Description of operation>
Next, the operation of the video conference apparatus 101 according to this embodiment will be described.
[First Embodiment]
FIG. 5 shows a flowchart according to the first embodiment. The video conference apparatus 101 acquires image data of one or a plurality of users by the imaging unit 301. (Step S501) Next, the analysis unit 302 analyzes at least one of the face direction and the line of sight for each user from the image data. (Step S502)
It should be noted that the analysis of each user's face direction and line-of-sight direction may be performed by a general face recognition technique or line-of-sight analysis technique. Here, only an outline of an example of the face recognition technique and the line-of-sight analysis technique will be described. First, with regard to the orientation of the user's face, components such as the face, eyes, nose and mouth of each user are extracted from the image data acquired by the imaging unit 301 by applying, for example, a pattern recognition technique. Then, the orientation of the face is determined from the positional relationship between the extracted components. Alternatively, the face orientation may be calculated from the positional relationship and angle of the eyes. As for the line-of-sight analysis, for example, the position of each user's iris is detected, and the distance from each user's eye, nose, mouth and other components to the iris, and the orientation of the face, Analyzes the user's line of sight.

尚、これらの方法は、あくまでも一例であって、本発明の範囲を限定するものではない。他の方法によって、各利用者の顔の向きや視線の方向を検出するものであっても良い。 These methods are merely examples and do not limit the scope of the present invention. Other methods may be used to detect the face direction and line-of-sight direction of each user.

次に、推定手段３０３は、解析手段３０２の解析結果に基づいて、各利用者の操作の意図を推定する。例えば、図４において、各利用者は、テレビ会議中は通常、表示装置２１０を見ている。つまり、テレビ会議中、各利用者は表示装置２１０に表示された相手方や、提示された資料等を確認するため、表示装置２１０に視線が向いている。この状況で、利用者のいずれかが、カメラ２０８又はテレビ会議装置本体２００に視線を向け、発声した場合、それは通信先のテレビ会議参加者への発言ではなく、テレビ会議装置１０１の操作を意図していると考えられる。従って、本実施の形態では、推定手段３０３は、カメラ２０８に対する各利用者の顔の向き又は視線の方向がカメラ２０８やテレビ会議装置本体２００の方向を向いているかどうかを判定する。（ステップＳ５０３）利用者のうちの少なくとも一人の顔の向き又は視線の方向がカメラ２０８やテレビ会議装置本体２００の方向を向いている場合には、推定手段３０３は、利用者が音声認識によるテレビ会議装置１０１の操作を意図していると判断する。 Next, the estimation unit 303 estimates the intention of each user's operation based on the analysis result of the analysis unit 302. For example, in FIG. 4, each user is normally viewing the display device 210 during a video conference. That is, during a video conference, each user looks at the display device 210 in order to confirm the other party displayed on the display device 210, the presented material, and the like. In this situation, when one of the users turns his eyes to the camera 208 or the video conference apparatus main body 200 and speaks, the intention is not to speak to the video conference participant of the communication destination but to operate the video conference apparatus 101. it seems to do. Therefore, in the present embodiment, the estimation unit 303 determines whether or not the direction of each user's face or line of sight with respect to the camera 208 faces the direction of the camera 208 or the video conference apparatus main body 200. (Step S503) When the direction of the face or the line of sight of at least one of the users faces the direction of the camera 208 or the video conference apparatus main body 200, the estimating means 303 uses the voice recognition by the user. It is determined that the operation of the conference apparatus 101 is intended.

また、より好適な例として、利用者の顔の向き又は視線の方向が、所定の時間（例えば、５秒）を越えてカメラ２０８やテレビ会議装置本体２００の方向を向いている場合に、利用者が音声認識による操作を意図していると判断しても良い。これにより、利用者がたまたまカメラ２０８やテレビ会議装置本体２００を見てしまったときに、誤って音声認識処理が開始してしまうことを防止することができる。 Further, as a more preferable example, it is used when the user's face direction or line-of-sight direction is facing the direction of the camera 208 or the video conference apparatus body 200 for a predetermined time (for example, 5 seconds). It may be determined that the person intends the operation by voice recognition. Accordingly, it is possible to prevent the voice recognition process from being erroneously started when the user happens to look at the camera 208 or the video conference apparatus main body 200.

推定手段３０３が、利用者の顔の向き又は視線の向きがカメラ２０８の方向を向いていると判断した場合、音声認識手段３０７は、音声入力手段３０５で取得した音声データの音声認識処理を行う。（ステップＳ５０４）その後、音声認識処理の結果に基づいて、テレビ会議装置１０１の制御を行う。（ステップＳ５０５）尚、音声処理については、後述する。 When the estimation unit 303 determines that the user's face direction or line-of-sight direction is facing the camera 208, the voice recognition unit 307 performs voice recognition processing on the voice data acquired by the voice input unit 305. . (Step S504) Thereafter, the video conference apparatus 101 is controlled based on the result of the voice recognition process. (Step S505) The voice processing will be described later.

また、好適な例として、音声認識手段３０７が音声認識処理を行っている間、各ユーザに表示、音声メッセージ、効果音等によってその旨を通知する手段を有すると良い。これにより、各ユーザは、テレビ会議装置１０１が音声認識処理を行っていることが判るので、誤操作を招くような発言を控えることができる。一方、推定手段３０３が、利用者の顔の向き又は視線の方向がカメラ２０８又は撮像手段３０１の方向と一致しないと判断した場合には、ステップＳ５０１に戻り、再度、ステップＳ５０１〜Ｓ５０３の処理を行う。 As a preferred example, while the voice recognition means 307 is performing voice recognition processing, it is preferable to have means for notifying each user by a display, a voice message, a sound effect or the like. Accordingly, each user can know that the video conference apparatus 101 is performing the voice recognition processing, and therefore can refrain from a statement that causes an erroneous operation. On the other hand, when the estimation unit 303 determines that the orientation of the user's face or the direction of the line of sight does not match the direction of the camera 208 or the imaging unit 301, the process returns to step S501 and the processing of steps S501 to S503 is performed again. Do.

以上の動作により、各利用者はテレビ会議装置１０１やリモコン４０５の操作を行うことなく、音声制御によるテレビ会議装置１０１の操作を行えるようになる。
（画像処理）
ここで、解析手段３０２及び推定手段３０３の画像処理の一例について、具体的な例をあげて説明する。図６は、解析手段３０２及び推定手段３０３の画像処理の動作の一例を示すフローチャートである。解析手段３０２は、撮像手段３０１が取得した画像データに人の顔が検出されると、その利用者の顔方向を取得する。（ステップＳ６０１）また、同時に撮像範囲に占める顔の面積の割合から、撮像手段３０１と利用者との距離を取得する。（ステップＳ６０２）さらに、必要に応じて、カメラ２０８と利用者又は、カメラ２０８とテレビ会議装置本体２００との角度を取得する。（ステップＳ６０３）
次に、取得された使用者との距離と、カメラ２０８と利用者又はテレビ会議装置本体２００との角度から、推定手段３０３は、カメラ２０８に対する顔の位置と、その顔の位置においてカメラ２０８の方向を向いた場合の顔の方向である参照用顔方向との関係を示す複数のテーブルの中から、使用するテーブルを選択する。（Ｓ６０４）
ここで、上記テーブルについて説明する。図７に近距離用のテーブルのイメージを示す。４つのマス目は、カメラ２０８で取得した画像における顔の位置に対応している。また、マス目毎に、顔がカメラ２０８の方向を向いた場合の顔の方向を示す参照用顔方向のデータが格納されている。従って、例えば、カメラ２０８で取得した画像の左上に映っている人物の顔の方向と、図７のＡ列１行目のマス目の参照用顔方向が一致した場合には、その人物は、カメラ２０８の方向を向いていると判断できる。 With the above operation, each user can operate the video conference apparatus 101 by voice control without operating the video conference apparatus 101 or the remote controller 405.
(Image processing)
Here, an example of the image processing of the analysis unit 302 and the estimation unit 303 will be described with a specific example. FIG. 6 is a flowchart illustrating an example of image processing operations of the analysis unit 302 and the estimation unit 303. When a human face is detected in the image data acquired by the imaging unit 301, the analysis unit 302 acquires the face direction of the user. (Step S601) In addition, the distance between the imaging unit 301 and the user is acquired from the ratio of the face area in the imaging range at the same time. (Step S602) Furthermore, the angle between the camera 208 and the user or the camera 208 and the video conference apparatus body 200 is acquired as necessary. (Step S603)
Next, based on the acquired distance from the user and the angle between the camera 208 and the user or the video conference apparatus main body 200, the estimation unit 303 determines the position of the face of the camera 208 relative to the camera 208 and the position of the camera 208. A table to be used is selected from a plurality of tables indicating the relationship with the reference face direction, which is the face direction when facing in the direction. (S604)
Here, the table will be described. FIG. 7 shows an image of a short distance table. The four squares correspond to the face positions in the image acquired by the camera 208. In addition, for each square, reference face direction data indicating the face direction when the face faces the camera 208 is stored. Therefore, for example, when the face direction of the person shown in the upper left of the image acquired by the camera 208 matches the reference face direction of the square in the first row of column A in FIG. It can be determined that the camera 208 is facing.

尚、カメラ２０８と利用者との距離に応じて、撮像範囲に占める顔の面積の割合は異なる。例えば、カメラ２０８と利用との距離が離れている場合には、撮像範囲に占める顔の面積の割合は小さくなるので、図８に示すように、マス目の小さい遠距離用のテーブルを使用する。一方、カメラ２０８と利用者との距離が近くなり、遠距離用テーブルのマス目に顔が収まらない場合には、図９に示す中距離用のテーブル又は図８の近距離用のテーブルを選択する。 Note that the ratio of the area of the face in the imaging range varies depending on the distance between the camera 208 and the user. For example, when the distance between the camera 208 and the use is far, the ratio of the area of the face to the imaging range is small, so a long distance table as shown in FIG. 8 is used. . On the other hand, if the camera 208 is close to the user and the face does not fit in the squares of the long distance table, the medium distance table shown in FIG. 9 or the short distance table shown in FIG. 8 is selected. To do.

例えば、撮像手段３０１で取得した画像データが図１０のようであったとする。ここで、左前に映っている人物の顔の向きを判定する場合には、顔の面積から、図９の中距離用のテーブルを選択すると良い。この場合、顔の位置は、図９の中距離用テーブルのＡ列、２行目に該当するので、当該テーブルから対応する参照用顔方向を取得することができる。この参照用顔方向と、図１０の左前に映っている人物の顔の方向が一致した場合には、この人物はカメラ２０８の方向を向いていると判断できる。 For example, assume that the image data acquired by the imaging unit 301 is as shown in FIG. Here, when determining the orientation of the face of the person shown on the left front, it is preferable to select the medium distance table in FIG. 9 from the area of the face. In this case, since the face position corresponds to the A column and the second row of the intermediate distance table in FIG. 9, the corresponding reference face direction can be acquired from the table. If the reference face direction matches the face direction of the person shown in the left front of FIG. 10, it can be determined that the person is facing the camera 208.

また、画像データの顔の位置と、カメラ２０８の方向を向いたときの顔の向きは、利用者とカメラ２０８との角度によっても変わってくる。例えば、カメラ２０８が利用者に対して斜め方向に設置されている場合には、例えば、図１１に示すように、利用者がカメラ２０８を見たときの顔の方向が変わってくると考えられる。そのため、利用者とカメラ２０８の角度に応じて、複数のテーブルを用意しておくと良い。 Further, the position of the face of the image data and the orientation of the face when facing the direction of the camera 208 also vary depending on the angle between the user and the camera 208. For example, when the camera 208 is installed in an oblique direction with respect to the user, for example, as shown in FIG. 11, the face direction when the user looks at the camera 208 is considered to change. . Therefore, it is preferable to prepare a plurality of tables according to the angle between the user and the camera 208.

尚、利用者とカメラ２０８との角度については、テレビ会議装置１０１が特定の会議室に常時設置されている場合には、設置時に設定を行うものであっても良い。また、テレビ会議装置本体２００の位置が固定で、テレビ会議装置本体２００とカメラ２０８との角度が変更できる場合には、テレビ会議装置本体２００とカメラ２０８との角度に基づいて求めても良い。または、テレビ会議開始時に、一人又は複数の利用者がカメラ２０８の方向を向いて、設定操作を行うことにより、最適なテーブルを選択するものであっても良い。さらに、テレビが会議中に利用者が表示装置２１０を見ているときの顔の方向に基づいて、解析手段３０２が算出するものであっても良い。 Note that the angle between the user and the camera 208 may be set when the video conference apparatus 101 is always installed in a specific conference room. Further, when the position of the video conference apparatus main body 200 is fixed and the angle between the video conference apparatus main body 200 and the camera 208 can be changed, it may be obtained based on the angle between the video conference apparatus main body 200 and the camera 208. Alternatively, at the start of the video conference, one or a plurality of users may face the camera 208 and perform a setting operation to select an optimal table. Further, the analysis unit 302 may calculate based on the face direction when the user is viewing the display device 210 during a TV conference.

尚、画像データ上の顔がテーブルの複数のエリアに跨る場合も有り得るが、その場合は、例えば顔の中心が納まるエリアに顔が有るものとして処理すれば良い
ここで図６に戻ってフローチャートの説明を続ける。ステップＳ６０４において、推定手段３０３は、使用するテーブルを選択した後、上記テーブルより、顔位置に対応するカメラ２０８の方向を見ている場合の顔方向の情報である参照用顔方向を決定する。（ステップＳ６０５）次に、利用者の顔方向と、参照用顔方向が一致するかどうかを比較する。（ステップＳ６０６）尚、ここで「一致する」とは、完全に１００％一致している場合だけではなく、その差が所定の範囲内であり、実質的に同じ方向を向いていると判断できる場合を含む。 Note that the face on the image data may extend over a plurality of areas of the table. In this case, for example, the processing may be performed assuming that the face is in the area where the center of the face is contained. Returning to FIG. Continue the explanation. In step S604, the estimation unit 303 selects a table to be used, and then determines a reference face direction, which is information on the face direction when viewing the direction of the camera 208 corresponding to the face position, from the table. (Step S605) Next, it is compared whether or not the user's face direction matches the reference face direction. (Step S606) Here, “match” is not only the case where 100% match completely, but it can be determined that the difference is within a predetermined range and is substantially in the same direction. Including cases.

ここで、利用者の顔方向と、参照用顔方向が一致した場合には、音声認識処理を開始する。（Ｓ６０７）一方、利用者の顔方向と、参照用顔方向が一致しない場合には、ステップＳ６０１に戻って再度処理を行う。 Here, when the user's face direction matches the reference face direction, the voice recognition process is started. (S607) On the other hand, if the user's face direction does not match the reference face direction, the process returns to step S601 and the process is performed again.

以上、図６の画像処理によれば、利用者の視線解析を行わずに、顔方向の解析だけで、処理を行うことができる。尚、上記動作はあくまでも一例であって、本発明の範囲を限定するものではない。例えば、カメラ２０８と利用者との角度に応じた複数のテーブルを予め用意する代わりに、基準となるテーブルに基づいて、角度に応じたテーブルを算出するものであっても良い。または、距離、角度に応じた複数のテーブルを有する変わりに、例えば、会議開始時に、各利用者がカメラ２０８の方向を向いて、所定の操作を行うことにより、新たにテーブルを作成するものであっても良い。
（音声処理）
次に、音声認識手段３０７の音声認識処理について、具体的な例をあげて説明する。図１２に、第1の実施の形態における音声認識処理のフローチャートを示す。また、図１３に、音声認識手段３０７の機能構成を示す。図１３において、音声認識手段３０７は、音響分析手段１３０１、単語認識手段１３０２、制御認識手段１３０３、単語パターン記憶手段１３０４、単語−制御信号対応記憶手段１３０５を有している。音響分析手段１３０１は、音声入力手段３０５で取得された音声データ又は音声信号の音響解析を行う。（図１２のステップ１２０１及びステップＳ１２０２）音響分析手段１３０１で解析された音声データ又は音声信号は、単語認識手段１３０２において、単語パターン記憶手段１３０４に記憶された単語パターンと照合される。（図１２のステップＳ１２０３）
音響分析手段１３０１で解析された音声データ又は音声信号が単語パターン記憶手段１３０４に記憶された単語パターンと一致すると、その一致した単語情報が制御認識手段１３０３に入力される。制御認識手段１３０３は、単語−制御信号対応記憶手段１３０５に記憶された、単語情報と制御内容の組み合わせの情報に基づいて、入力された単語情報に対応する制御内容を判断又は実行する。上記構成により、音声認識手段３０７は、音声入力手段３０５から入力された音声データ又は音声信号を解析し、入力された音声データ又は音声信号に対応する制御内容の有無を判断し、対応する制御内容がある場合には、入力された音声データに対応する制御内容に対応する制御信号を送出する。（図１２のステップＳ１２０４）
上記動作により、音声入力手段３０５から入力された音声が、予め登録された音声データと一致した場合に、音声によるテレビ会議装置１０１の操作が行われる。 As described above, according to the image processing in FIG. 6, it is possible to perform the processing only by analyzing the face direction without performing the user's line-of-sight analysis. The above operation is merely an example, and does not limit the scope of the present invention. For example, instead of preparing in advance a plurality of tables corresponding to the angle between the camera 208 and the user, a table corresponding to the angle may be calculated based on a reference table. Alternatively, instead of having a plurality of tables corresponding to distances and angles, for example, at the start of a conference, each user faces the camera 208 and performs a predetermined operation to create a new table. There may be.
(Audio processing)
Next, the voice recognition process of the voice recognition unit 307 will be described with a specific example. FIG. 12 shows a flowchart of the speech recognition process in the first embodiment. FIG. 13 shows a functional configuration of the voice recognition unit 307. In FIG. 13, the speech recognition unit 307 includes an acoustic analysis unit 1301, a word recognition unit 1302, a control recognition unit 1303, a word pattern storage unit 1304, and a word-control signal correspondence storage unit 1305. The acoustic analysis unit 1301 performs acoustic analysis of the voice data or voice signal acquired by the voice input unit 305. (Step 1201 and Step S1202 in FIG. 12) The speech data or speech signal analyzed by the acoustic analysis unit 1301 is collated with the word pattern stored in the word pattern storage unit 1304 in the word recognition unit 1302. (Step S1203 in FIG. 12)
When the voice data or voice signal analyzed by the acoustic analysis unit 1301 matches the word pattern stored in the word pattern storage unit 1304, the matched word information is input to the control recognition unit 1303. The control recognition unit 1303 determines or executes the control content corresponding to the input word information based on the information of the combination of the word information and the control content stored in the word-control signal correspondence storage unit 1305. With the above configuration, the voice recognition unit 307 analyzes the voice data or voice signal input from the voice input unit 305, determines the presence or absence of control content corresponding to the input voice data or voice signal, and the corresponding control content. If there is, a control signal corresponding to the control content corresponding to the input audio data is transmitted. (Step S1204 in FIG. 12)
With the above operation, when the voice input from the voice input unit 305 matches the voice data registered in advance, the video conference apparatus 101 is operated by voice.

以上、本実施の形態によれば、テレビ会議装置本体２００やリモコン４０５に手が届かない利用者が、テレビ会議装置１０１の操作を行えるようになる。また、操作が必要なときに、カメラ２０８やカメラ２０８を備えたテレビ会議装置本体２００の方向を向くことにより音声認識処理を起動できるので、テレビ会議中の発言等により、意図しない操作が行われることを効果的に防止できる。さらに、音声認識手段３０７が音声認識処理を行っている間、その旨を利用者に通知する手段を設けることによって、会議中の発言等による誤動作をより効果的に防止できる。 As described above, according to this embodiment, a user who cannot reach the video conference apparatus main body 200 or the remote controller 405 can operate the video conference apparatus 101. In addition, when an operation is required, the voice recognition process can be activated by facing the camera 208 or the video conference apparatus main body 200 provided with the camera 208. Therefore, an unintended operation is performed due to a speech during the video conference. Can be effectively prevented. Further, by providing a means for notifying the user of the fact while the voice recognition means 307 is performing the voice recognition process, it is possible to more effectively prevent a malfunction due to a speech during the meeting.

尚、上記構成は、あくまでも一例であって、本発明の範囲を限定するものではない。例えば、図１３の単語パターン記憶手段１３０４及び単語−制御信号対応記憶手段１３０５は、図２のメモリ２０２上に有していても良いし、クラウド上（例えば、インターネット等のネットワーク上にあるサーバ）に有していても良い。また、図１２のステップＳ１２０４において、制御認識手段１３０３が制御信号を送出する代わりに、図３の制御手段３０４がテレビ会議装置１０１の制御を行っても良い。
［第２の実施の形態］
次に、第２の実施の形態について説明する。
図４において、例えば、利用者４０２が利用者４０１に向かって話をしている場合には、通信先のテレビ会議装置の利用者に話しかける意図はなく、利用者４０１にのみ話をしたい内容を話していることが考えられる。従来は、このような場合、利用者４０２が小さな声で話をしていても、マイク２１２ａのゲインが自動的に上がることにより、通信先のテレビ会議利用者に、意図せず会話内容が伝わってしまうことがあった。本実施の形態は、このような問題に対応するものである。 In addition, the said structure is an example to the last, Comprising: The scope of the present invention is not limited. For example, the word pattern storage unit 1304 and the word-control signal correspondence storage unit 1305 in FIG. 13 may be included in the memory 202 in FIG. 2 or on the cloud (for example, a server on a network such as the Internet). You may have. Further, in step S1204 in FIG. 12, the control means 304 in FIG. 3 may control the video conference apparatus 101 instead of the control recognition means 1303 sending a control signal.
[Second Embodiment]
Next, a second embodiment will be described.
In FIG. 4, for example, when the user 402 is talking to the user 401, there is no intention to talk to the user of the video conference device as the communication destination, and the contents that the user wants to talk only to the user 401 are displayed. It is possible to talk. Conventionally, in such a case, even when the user 402 is speaking in a low voice, the gain of the microphone 212a is automatically increased, so that the conversation content is unintentionally transmitted to the video conference user of the communication destination. There was a case. The present embodiment addresses such a problem.

図１４に、本実施の形態に係るフローチャートを示す。図５で示したステップＳ５０１からステップＳ５０５の動作は第１の実施の形態と同じである。ここでは、第１の実施の形態と異なる点を中心に説明する。 FIG. 14 shows a flowchart according to the present embodiment. The operations from step S501 to step S505 shown in FIG. 5 are the same as those in the first embodiment. Here, the description will focus on the differences from the first embodiment.

ステップＳ５０３において、利用者の顔の向き又は視線の方向が撮像手段３０１の方向と一致しないと判断された場合、利用者の顔の向き又は視線の方向が撮像手段３０１の方向に対して、予め定められた範囲内にあるかどうかを判断する。（ステップＳ１４０１）
利用者のうちの少なくとも一人の顔の向き又は視線の方向が撮像手段３０１の方向に対して、予め定められた範囲内にない場合には、利用者がマイク２１２ａ及びマイク２１２ｂの音量の調整を意図していると判断し、マイク２１２ａ及びマイク２１２ｂの音量の自動調整を行う。（ステップＳ１４０２）
具体的には、テレビ会議装置１０１は、マイク２１２ａ及びマイク２１２ｂからの音声入力レベルに基づいて、通信先のテレビ会議装置１０２、１０３へ送出する音声信号の音量を調整する。 In step S <b> 503, when it is determined that the orientation of the user's face or the direction of the line of sight does not match the direction of the imaging unit 301, the orientation of the user's face or the direction of the line of sight is previously set with respect to the direction of the imaging unit 301. Determine if it is within the specified range. (Step S1401)
If the direction of the face or the line of sight of at least one of the users is not within a predetermined range with respect to the direction of the imaging unit 301, the user adjusts the volume of the microphone 212a and the microphone 212b. It is determined that it is intended, and the volume of the microphone 212a and the microphone 212b is automatically adjusted. (Step S1402)
Specifically, the video conference apparatus 101 adjusts the volume of the audio signal transmitted to the communication destination video conference apparatuses 102 and 103 based on the audio input level from the microphone 212a and the microphone 212b.

例えば、マイク２１２ａから予め定められた値よりも大きい音声入力レベルがある場合には、通信先のテレビ会議装置利用者への発言と考えられるので、通信先への音量の調整は行わない。一方、マイク２１２ａから予め定められた値よりも小さい音声信号の入力が有る場合には、通信先のテレビ会議装置利用者への発言ではないと判断し、通信先への音声信号の音量を下げる又は消音する。また、マイクａから音声入力が無い場合には、利用者が単に別の方向を見ただけと判断し、通信先への音声の調整は行わなくても良い。 For example, if there is an audio input level greater than a predetermined value from the microphone 212a, it is considered to be a message to the user of the video conference device at the communication destination, so the volume of the communication destination is not adjusted. On the other hand, when an audio signal smaller than a predetermined value is input from the microphone 212a, it is determined that the message is not a message to the video conference device user of the communication destination, and the volume of the audio signal to the communication destination is lowered. Or mute. Further, when there is no voice input from the microphone a, it is determined that the user has simply looked in another direction, and it is not necessary to adjust the voice to the communication destination.

マイク２１２ｂについても、同様に制御を行う。これにより、声の小さい会話は通信先の利用者には届かず、声が大きい通信先への発言等は、通常通り通信先の利用者に届けることができる。 The microphone 212b is similarly controlled. Thus, a conversation with a low voice does not reach the user at the communication destination, and a speech or the like to the communication destination with a high voice can be delivered to the user at the communication destination as usual.

一方、ステップＳ１４０１において、利用者の顔の向き又は視線の方向が撮像手段３０１の方向に対して、予め定められた範囲内にある場合には、ステップＳ５０１に戻り処理を継続する。 On the other hand, if it is determined in step S1401 that the user's face orientation or line-of-sight direction is within a predetermined range with respect to the direction of the imaging unit 301, the process returns to step S501 and continues.

尚、ステップＳ１４０１の「予め定められた範囲」については、システムや利用者の要求に応じて任意に設定可能である。例えば、推定手段３０３によって、利用者の両目が検出可能な範囲を上記予め定められた範囲としても良い。また、推定手段３０３によって、利用者の顔の向き又は視線の方向が検出できた場合に、顔の向き又は視線の方向が上記予め定められた範囲内にあると判断しても良い。或いは、推定手段３０３が、利用者がカメラ２０８に対して所定の角度を越えて横を向いたと判断した場合に、上記予め定められた範囲内にないと判断しても良い。 Note that the “predetermined range” in step S1401 can be arbitrarily set according to the request of the system or the user. For example, the range that can be detected by both of the eyes of the user by the estimation unit 303 may be set as the predetermined range. Further, when the estimation unit 303 can detect the face direction or line-of-sight direction of the user, it may be determined that the face direction or line-of-sight direction is within the predetermined range. Alternatively, when the estimation unit 303 determines that the user has turned sideways beyond a predetermined angle with respect to the camera 208, the estimation unit 303 may determine that the user is not within the predetermined range.

以上、本実施の形態によれば、利用者が小さな声で話をしている場合に、通信先のテレビ会議装置利用者に意図せず会話内容が伝わることを低減できる。
［その他の実施の形態］
各利用者の画像データを取得するカメラ２０８として、従来のテレビ会議装置が備えるようなテレビ会議参加者撮影用のカメラを利用可能である。しかし、上記テレビ会議参加者用のカメラとは別に、カメラ２０８を備えても良い（例えば、図１４のカメラ２０８）。この場合、この場合、カメラ２０８が取得した画像データは、通信先等に表示されないので、カメラ２０８のレンズとして、例えば魚眼レンズやパノラマ用３６０度レンズを採用することができる。 As described above, according to the present embodiment, when the user is speaking with a small voice, it is possible to reduce the unintentional transmission of the conversation contents to the video conference device user of the communication destination.
[Other embodiments]
As the camera 208 for acquiring the image data of each user, a camera for photographing a video conference participant as provided in a conventional video conference device can be used. However, a camera 208 may be provided separately from the camera for the video conference participant (for example, the camera 208 in FIG. 14). In this case, since the image data acquired by the camera 208 is not displayed at the communication destination or the like in this case, for example, a fisheye lens or a 360-degree lens for panorama can be used as the lens of the camera 208.

１０１テレビ会議装置
２００テレビ会議装置本体
２１０表示装置
２１２、２１２ａ、２１２ｂマイク
２０８カメラ
３０１撮像手段
３０２解析手段
３０３推定手段
３０５音声入力手段
３０７音声認識手段 DESCRIPTION OF SYMBOLS 101 Video conference apparatus 200 Video conference apparatus main body 210 Display apparatus 212, 212a, 212b Microphone 208 Camera 301 Imaging means 302 Analysis means 303 Estimation means 305 Voice input means 307 Voice recognition means

特開平１１−２０２８９２号公報Japanese Patent Laid-Open No. 11-202892

Claims

Imaging means for acquiring image data of one or a plurality of users using the video conference device;
Voice input means for acquiring voice data of the user;
Analyzing means for analyzing at least one of the face direction and the line-of-sight direction of the user from the image data;
An estimation means for estimating an intention of the user to operate the video conference device based on an analysis result of the analysis means,
A video conference apparatus that performs the operation based on an estimation result of the estimation means.

The estimation unit estimates that the video conference device is intended to be operated when a face direction or a line-of-sight direction of at least one of the users faces a direction of the imaging unit. The video conference apparatus described in 1.

A table showing a relationship between a face position with respect to the image pickup means and a direction of the face or a line of sight when the image pickup means is viewed at the face position; and at least one face of the user 3. The video conference according to claim 1, wherein the video conference device estimates that the video conference device is intended to be operated when the orientation or the gaze direction is a face orientation or a gaze direction included in the table. apparatus.

The video conference apparatus according to claim 3, comprising a plurality of the tables corresponding to a distance between the imaging unit and the user.

The video conference apparatus according to claim 3, further comprising a plurality of the tables corresponding to an angle between an imaging direction of the imaging unit and the user.

The video conference apparatus according to claim 1, wherein an intention of operating the video conference apparatus is an operation of the video conference apparatus by voice.

The estimation unit obtains the voice acquired from the voice input unit when the face direction or the line-of-sight direction of at least one of the users is not within a predetermined range with respect to the direction of the imaging unit. The video conference apparatus according to claim 1, wherein it is estimated that the volume control is intended.

The volume control of the voice acquired from the voice input unit is performed based on a direction of a face or a line of sight of at least one of the users and a voice input level from the voice input unit. Video conferencing equipment.

Acquiring image data of one or more users of the video conference device;
Analyzing at least one of face orientation and line of sight for each of the users from the image data;
Estimating the user's intention to operate the video conference device based on the analysis result of the analysis means;
Performing the operation based on an estimation result of the estimation means;
A control method for a video conference apparatus.

On the computer,
A procedure for acquiring image data of one or more users of the video conference device;
A procedure for analyzing at least one of the orientation and line of sight of each of the users from the image data;
A procedure for estimating an intention of an operation of each of the users on the video conference device based on an analysis result of the analysis means;
A procedure for performing the operation based on an estimation result of the estimation means;
A program for running