JP6539624B2

JP6539624B2 - Gaze-matched face image synthesizing method, video conference system, and program

Info

Publication number: JP6539624B2
Application number: JP2016180848A
Authority: JP
Inventors: 柏野　邦夫; 邦夫柏野; 隆行黒住; 卓弥井上; 高橋　友和; 友和高橋; 高嗣平山; 康友川西; 大輔出口; 一郎井手; 村瀬　洋; 洋村瀬
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-09-15
Filing date: 2016-09-15
Publication date: 2019-07-03
Anticipated expiration: 2036-09-15
Also published as: JP2018046464A

Description

本発明は、視線一致顔画像合成方法、テレビ会議システム、及びプログラムに関する。 The present invention relates to a gaze-matched face image synthesis method, a video conference system, and a program.

従来、視線一致顔画像合成方法に関しては、テレビ会議における対話者間の視線一致のための目領域変換手法として、利用者端末上のディスプレイ表示に対してカメラ位置が固定されていることを前提として目領域を変換する方法が知られている（非特許文献１）。 Conventionally, with regard to the gaze-matched face image synthesis method, it is premised that the camera position is fixed with respect to the display on the user terminal as an eye area conversion method for gaze matching between the communicators in the video conference. There is known a method of transforming the eye area (Non-patent Document 1).

井上卓弥，高橋友和，平山高嗣，出口大輔，井手一郎，村瀬洋，黒住隆行，柏野邦夫. “テレビ会議における対話者間の視線一致のための目領域変換手法に関する検討”. 信学技報, No. PRMU2014-60, pp. 33-38, October 2014.Takuya Inoue, Tomokazu Takahashi, Takahiro Hirayama, Exit Oosuke, Ide Ichiro, Hiroshi Murase, Hiroyuki Kurozumi, Kunio Kanno. No. PRMU 2014-60, pp. 33-38, October 2014.

しかし、上記の非特許文献１の方法では、利用者端末上のディスプレイ表示に対してカメラ視点位置が変化したときに十分な精度で視線の一致を判定できないという欠点があった。 However, the method of Non-Patent Document 1 described above has a drawback that the line of sight can not be determined with sufficient accuracy when the camera viewpoint position changes with respect to the display on the user terminal.

本発明は、上記の事情を鑑みてなされたもので、ディスプレイ表示に対するカメラ視点位置が変化する場合であっても、視線の一致を精度よく判定して、視線が一致した画像を生成することができる視線一致顔画像合成方法、テレビ会議システム、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and even when the camera viewpoint position with respect to the display display changes, it is possible to accurately determine the line of sight coincidence and generate an image in which the line of sight is coincident. It is an object of the present invention to provide a gaze-matched face image synthesizing method, a video conference system, and a program.

上記の目的を達成するために第１の発明の視線一致顔画像合成方法は、テレビ会議システムにおける視線一致顔画像合成方法であって、対話者を表すカメラ入力画像から特徴を抽出する特徴抽出過程と、前記カメラ入力画像の特徴を用いて目の開閉を判定する目開閉判定過程と、目が開いていると判定している場合において、対話相手を注視しているか否か及びディスプレイ表示に対するカメラ視点位置が既知の顔画像から予め学習された、対話相手を注視しているか否か及び前記カメラ視点位置を識別する識別器と、前記カメラ入力画像の特徴とに基づいて、前記カメラ入力画像が表す対話者が、対話相手を注視しているか否か、及び前記カメラ視点位置を判定するカメラ位置対話相手注視判定過程と、前記対話相手を注視していると判定された場合において、前記カメラ入力画像が表す対話者の目領域に、カメラを注視したときの画像を合成する合成過程と、を含む。 In order to achieve the above object, the gaze-matched face image synthesis method of the first invention is a gaze-matched face image synthesis method in a video conference system, and a feature extraction process of extracting features from a camera input image representing a communicator An eye open / close determination step of determining the open / close of the eye using the feature of the camera input image; The camera input image is based on whether a gaze at a dialogue partner has been learned in advance from a face image whose viewpoint position is known, and a identifier for identifying the camera viewpoint position, and a feature of the camera input image. Whether or not the communicator who is gazing at the conversation partner, the camera position dialogue partner gaze determination process for determining the camera viewpoint position, and the gaze at the conversation partner In the case that, in the interlocutor of the eye region in which the camera input image represents, including a synthesizing step of synthesizing the image when watching the camera.

第２の発明のテレビ会議システムは、対話者を表すカメラ入力画像から特徴を抽出する特徴抽出部と、前記カメラ入力画像の特徴を用いて目の開閉を判定する目開閉判定部と、目が開いていると判定している場合において、対話相手を注視しているか否か及びディスプレイ表示に対するカメラ視点位置が既知の顔画像から予め学習された、対話相手を注視しているか否か及び前記カメラ視点位置を識別する識別器と、前記カメラ入力画像の特徴とに基づいて、前記カメラ入力画像が表す対話者が、対話相手を注視しているか否か、及び前記カメラ視点位置を判定するカメラ位置対話相手注視判定部と、前記対話相手を注視していると判定された場合において、前記カメラ入力画像が表す対話者の目領域に、カメラを注視したときの画像を合成する合成部と、を含んで構成されている。 According to a second aspect of the present invention, a video conference system includes a feature extraction unit that extracts a feature from a camera input image representing a communicator, an eye open / close determination unit that determines the opening and closing of the eye using the feature of the camera input image, In the case where it is determined that the camera is open, whether or not it gazes at the conversation partner, whether or not it gazes at the conversation partner, whose camera viewpoint position with respect to the display display is learned in advance from known face images Whether the communicator represented by the camera input image is gazing at the other party of the conversation based on the identifier for identifying the viewpoint position and the feature of the camera input image, and the camera position for determining the camera viewpoint position When it is determined that the dialogue partner gaze determination unit and the dialogue partner are gazed, an image obtained when the camera gaze is gazed in the eye area of the communicator represented by the camera input image Is configured to include a combining unit that, the.

第３の発明の視線一致顔画像合成方法は、テレビ会議システムにおける視線一致顔画像合成方法であって、対話者を表すカメラ入力画像から特徴を抽出する特徴抽出過程と、前記カメラ入力画像の特徴を用いて目の開閉を判定する目開閉判定過程と、目が開いていると判定している場合において、対話者側のディスプレイ上の注視点及びディスプレイ表示に対するカメラ視点位置が既知の顔画像から予め学習された、各カメラ視点位置及び対話者側のディスプレイ上の各位置のペアについて前記位置を注視し、かつ、前記カメラ視点位置であるか否かを識別する識別器と、前記カメラ入力画像の特徴とに基づいて、対話者側のディスプレイ上の注視点、及び前記カメラ視点位置を判定するカメラ位置ディスプレイ上注視点判定過程と、前記対話者側のディスプレイ上の注視点に基づいて、前記カメラ入力画像が表す対話者が対話相手を注視しているか否かを判定する注視相手判定過程と、前記対話相手を注視していると判定された場合において、前記カメラ入力画像が表す対話者の目領域に、前記対話相手に対応する位置を注視したときの画像を合成する合成過程と、を含む。 A gaze-matched face image synthesis method according to a third invention is a gaze-matched face image synthesis method in a video conference system, comprising: a feature extraction step of extracting a feature from a camera input image representing a communicator; and a feature of the camera input image In the eye open / close determination process of determining the open / close of the eye using the above, and in the case where it is determined that the eye is open, the gaze point on the display on the interlocutor side A classifier that gazes at the position for each camera viewpoint position and each position pair on the display on the interlocutor side, which has been learned in advance, and identifies whether or not the camera viewpoint position, and the camera input image The gaze point determination process on the display of the interlocutor side on the basis of the features of A gaze partner determination step of determining whether a communicator represented by the camera input image gazes at the conversation partner based on a fixation point on the display of the speaker side, and it is determined that the conversation partner is gazed at And a combining step of combining an image obtained when the user gazes at a position corresponding to the conversation partner in the eye region of the dialog represented by the camera input image.

第４の発明のテレビ会議システムは、対話者を表すカメラ入力画像から特徴を抽出する特徴抽出部と、前記カメラ入力画像の特徴を用いて目の開閉を判定する目開閉判定部と、目が開いていると判定している場合において、対話者側のディスプレイ上の注視点及びディスプレイ表示に対するカメラ視点位置が既知の顔画像から予め学習された、各カメラ視点位置及び対話者側のディスプレイ上の各位置のペアについて前記位置を注視し、かつ、前記カメラ視点位置であるか否かを識別する識別器と、前記カメラ入力画像の特徴とに基づいて、対話者側のディスプレイ上の注視点、及び前記カメラ視点位置を判定するカメラ位置ディスプレイ上注視点判定部と、前記対話者側のディスプレイ上の注視点に基づいて、前記カメラ入力画像が表す対話者が対話相手を注視しているか否かを判定する注視相手判定部と、前記対話相手を注視していると判定された場合において、前記カメラ入力画像が表す対話者の目領域に、前記対話相手に対応する位置を注視したときの画像を合成する合成部と、を含んで構成されている。 A video conference system according to a fourth aspect of the invention includes a feature extraction unit that extracts a feature from a camera input image representing a communicator, an eye open / close determination unit that determines the opening and closing of the eye using the feature of the camera input image, When it is determined that the camera is open, the gaze point on the display on the interlocutor's display and the camera viewpoint position relative to the display are previously learned from the face image known, on each display on the interloper's display A gaze point on a display on the interlocutor side based on a discriminator that gazes at the position for each position pair and identifies whether or not the camera viewpoint position, and a feature of the camera input image, And a camera position display on the camera position display for determining the camera viewpoint position, and a pair represented by the camera input image based on the gaze point on the display on the interlocutor side. A gaze partner determination unit that determines whether a person gazes at a conversation partner, and the dialog in an eye area of a communicator represented by the camera input image when it is determined that the gaze at the conversation partner is determined And a combining unit configured to combine an image when the user gazes at a position corresponding to the other party.

本発明のプログラムは、コンピュータに、本発明の視線一致顔画像合成方法の各ステップを実行させるためのプログラムである。 The program of the present invention is a program for causing a computer to execute each step of the gaze-matched face image synthesis method of the present invention.

以上説明したように、本発明の視線一致顔画像合成方法、テレビ会議システム、及びプログラムによれば、予め学習された識別器に基づいて、対話相手を注視しているか否か、又は対話者側のディスプレイ上の注視点と、ディスプレイ表示に対するカメラ視点位置とを判定し、カメラ入力画像が表す対話者の目領域に、カメラ、又は対話相手に対応する位置を注視したときの画像を合成することにより、ディスプレイ表示に対するカメラ視点位置が変化する場合であっても、視線の一致を精度よく判定して、視線が一致した画像を生成することができることができる、という効果が得られる。 As described above, according to the gaze-matched face image synthesizing method, the video conference system, and the program of the present invention, whether or not the other party in the dialogue is gazed on the basis of the classifier learned in advance To determine the gaze point on the display of the camera and the camera viewpoint position with respect to the display display, and synthesize the image when the camera or the position corresponding to the conversation partner is gazed in the eye area of the interlocutor represented by the camera input image As a result, even when the camera viewpoint position with respect to the display display changes, it is possible to accurately determine the match of the sight line and to generate an image in which the sight line matches.

本発明の第１の実施の形態のテレビ会議システムの一構成例を示すブロック図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a block diagram which shows one structural example of the video conference system of the 1st Embodiment of this invention. カメラ視点位置とウィンドウの配置を説明するための図である。It is a figure for demonstrating a camera viewpoint position and arrangement | positioning of a window. カメラ視点位置とウィンドウの配置を説明するための図である。It is a figure for demonstrating a camera viewpoint position and arrangement | positioning of a window. カメラ視点位置とウィンドウの配置を説明するための図である。It is a figure for demonstrating a camera viewpoint position and arrangement | positioning of a window. カメラ視点位置とウィンドウの配置を説明するための図である。It is a figure for demonstrating a camera viewpoint position and arrangement | positioning of a window. 目領域画像を抽出する方法を説明するための図である。It is a figure for demonstrating the method to extract eye region image. 目領域画像の一例を示す図である。It is a figure which shows an example of an eye area | region image. 学習用カメラ入力画像を撮像する際に、ディスプレイ上に表示した点の配置を示す図である。FIG. 8 is a diagram showing the arrangement of points displayed on a display when capturing a learning camera input image. 本発明の第１の実施の形態のテレビ会議システムにおける学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the learning process routine in the video conference system of the 1st Embodiment of this invention. 本発明の第１の実施の形態のテレビ会議システムにおける視線一致顔画像合成処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the gaze matching face image synthetic | combination processing routine in the video conference system of the 1st Embodiment of this invention. 本発明の第２の実施の形態のテレビ会議システムの一構成例を示すブロック図である。It is a block diagram which shows one structural example of the video conference system of the 2nd Embodiment of this invention. カメラとウィンドウの配置を説明するための図である。It is a figure for demonstrating arrangement | positioning of a camera and a window. 本発明の第２の実施の形態のテレビ会議システムにおける視線一致顔画像合成処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the gaze matching face image synthetic | combination processing routine in the video conference system of the 2nd Embodiment of this invention. 本発明の第３の実施の形態のテレビ会議システムの一構成例を示すブロック図である。It is a block diagram which shows one structural example of the video conference system of the 3rd Embodiment of this invention. カメラとウィンドウの配置を説明するための図である。It is a figure for demonstrating arrangement | positioning of a camera and a window. 本発明の第３の実施の形態のテレビ会議システムにおける画像送信側視線一致顔画像合成処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the image transmission side gaze-matched face image synthetic | combination processing routine in the video conference system of the 3rd Embodiment of this invention. 本発明の第３の実施の形態のテレビ会議システムにおける画像受信側視線一致顔画像合成処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the image reception side gaze-matched face image synthetic | combination processing routine in the video conference system of the 3rd Embodiment of this invention.

＜本発明の実施の形態の概要＞
まず、本発明の実施の形態の概要について説明する。本発明の実施の形態は、テレビ会議システムにおいて視線の一致する顔画像を生成するものである。複数の対話者のテレビ会議において、ウィンドウがさまざまな場所に配置してある場合においても、会議参加者に誰が誰に向かって対話しているかをわかりやすく提示するものである。 <Overview of the embodiment of the present invention>
First, an outline of the embodiment of the present invention will be described. An embodiment of the present invention is to generate a face image that matches the line of sight in a video conference system. In the video conference of a plurality of interlocutors, even when the windows are arranged at various places, it is possible to show meeting participants in an easy-to-understand manner who is interacting with whom.

本発明の実施の形態は、非特許文献１に記載の、テレビ会議における対話者間の視線一致のための目領域変換手法に比べて、利用者端末上のディスプレイ表示に対する相対的なカメラ位置を判定することで利用者端末の向きの変化に対応することを可能とした点が主眼である。特にタブレット端末のように持ち方によってディスプレイの表示を回転できるような場合において、利用者端末上のディスプレイ表示に対するカメラ視点位置が変化するような場合にもカメラ視点位置と注視相手を特定することを可能とするものである。 The embodiment of the present invention is a camera position relative to the display on the user terminal, as compared with the eye area conversion method for eye-gaze matching between communicators in a video conference described in Non-Patent Document 1. The point is that it is possible to cope with the change in the orientation of the user terminal by making the determination. In particular, in the case where the display on the user terminal can be rotated according to the way of holding, as in the case of a tablet terminal, the camera viewpoint position and the gaze partner should be specified even when the camera viewpoint position changes with respect to the display on the user terminal. It is possible.

＜第１の実施の形態＞
＜テレビ会議システムの構成＞
以下、図面を参照して本発明の第１の実施の形態を詳細に説明する。図１は、本発明の第１の実施の形態のテレビ会議システム１００を示すブロック図である。テレビ会議システム１００は、ＣＰＵと、ＲＡＭと、学習処理ルーチン、及び視線一致顔画像合成処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータを用いて構成され、機能的には次に示すように構成されている。 First Embodiment
<Configuration of Video Conference System>
Hereinafter, a first embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a video conference system 100 according to a first embodiment of this invention. The teleconferencing system 100 is configured using a computer including a CPU, a RAM, a learning processing routine, and a ROM storing a program for executing the gaze-matched face image synthesis processing routine, and the functions are as follows: It is configured as shown in.

本実施の形態のテレビ会議システム１００は、図１に示すように、カメラ１０と、演算部２０と、通信部８０と、ディスプレイ９０とを備えている。 As shown in FIG. 1, the video conference system 100 according to the present embodiment includes a camera 10, an arithmetic unit 20, a communication unit 80, and a display 90.

図２Ａ〜図２Ｄに示すように、カメラ１０は、ディスプレイ９０の表示に対する複数の視点位置の何れかから、対話者の顔を撮像し、カメラ入力画像を、演算部２０へ出力する。ディスプレイ９０は、対話相手の画像を表示したウィンドウ９２を表示し、対話者は、ウィンドウ９２を見ながら対話を行う。 As shown in FIGS. 2A to 2D, the camera 10 captures the face of the communicator from any of a plurality of viewpoint positions for display on the display 90, and outputs a camera input image to the calculation unit 20. The display 90 displays a window 92 displaying an image of the other party, and the interlocutor interacts while looking at the window 92.

演算部２０は、学習用画像ＤＢ２２、目領域画像抽出部２４、３２、特徴抽出部２６、３４、識別器構築部２８、識別器記憶部３０、目開閉判定部３６、カメラ位置対話相手注視判定部３８、注視画像記憶部４０、合成部４２、及び画面出力部４４を備えている。 The calculation unit 20 includes a learning image DB 22, eye region image extraction units 24, 32, feature extraction units 26, 34, a classifier construction unit 28, a classifier storage unit 30, eye open / close judgment unit 36, camera position dialogue partner gaze judgment A unit 38, a gaze image storage unit 40, a combining unit 42, and a screen output unit 44 are provided.

学習用画像ＤＢ２２は、ディスプレイ９０の表示に対するカメラ１０の各視点位置及びディスプレイ９０上の各位置について、当該位置に配置された１点を対話者が注視しているときに当該視点位置のカメラ１０により撮像された学習用カメラ入力画像が記憶されている。ディスプレイ９０上のウィンドウ９２の表示範囲が予め定められており、各学習用カメラ入力画像の注視点が、ウィンドウ９２の表示範囲内であるか否かに応じて、対話者が対話相手を注視しているか否かを示す情報が予め付与され、更に、カメラ１０の視点位置を示す情報が予め付与されている。例えば、対話者には顔は固定したまま視線だけを動かし、図５に示すような点を１点ずつディスプレイ上に表示し、順に注視してもらい、撮影したカメラ入力画像において、点線の矩形内を注視している画像をＰｏｓｉｔｉｖｅ、それ以外の画像をＮｅｇａｔｉｖｅとする。 The learning image DB 22 indicates the position of the camera 10 when the communicator is gazing at one point arranged at the position with respect to each viewpoint position of the camera 10 with respect to the display on the display 90 and each position on the display 90. A learning camera input image captured by the camera is stored. The display range of the window 92 on the display 90 is predetermined, and the communicator gazes at the other side of the dialog depending on whether the fixation point of each learning camera input image is within the display range of the window 92 or not. Information indicating whether or not the camera 10 has been given is given in advance, and furthermore, the information showing the viewpoint position of the camera 10 is given in advance. For example, the interlocutors move only their eyes while keeping their face fixed, and display the points as shown in FIG. 5 one by one on the display and have them gaze closely in sequence, and within the dotted rectangle The image that is watching the image is positive, and the other image is negative.

目領域画像抽出部２４は、学習用カメラ入力画像の各々から、以下のように目領域画像を抽出する。 The eye region image extraction unit 24 extracts eye region images from each of the learning camera input images as follows.

例えば、図３に示すような左右それぞれの目領域の輪郭部分から４点の特徴点を抽出する。次に、４点の特徴点の外接矩形で目領域を抽出し、左右の目を連結して、目領域画像とする。図４に抽出した目領域画像の例を示す。この際、抽出した特徴点に、平行移動や拡縮などの摂動を加えることで、目領域画像のパターンを増やしている。この処理をすべての学習用カメラ入力画像について行ない、すべての目領域画像の外接矩形の大きさの中央値で、すべての目領域画像をリサイズする。 For example, four feature points are extracted from the contour portions of the left and right eye regions as shown in FIG. Next, an eye area is extracted with a circumscribed rectangle of four feature points, and the left and right eyes are connected to form an eye area image. FIG. 4 shows an example of the eye area image extracted. At this time, the patterns of the eye region image are increased by adding perturbations such as parallel movement and scaling to the extracted feature points. This process is performed on all learning camera input images, and all eye region images are resized at the median value of the circumscribed rectangle size of all eye region images.

特徴抽出部２６は、学習用カメラ入力画像の各々から抽出された目領域画像について、以下のように特徴量を抽出する。 The feature extraction unit 26 extracts feature amounts as follows for eye region images extracted from each of the learning camera input images.

例えば、抽出した目領域画像からCannyによりエッジを検出し、エッジがあるか否かの特徴量とする。次に、目領域画像をグレースケールに変換し、ヒストグラム均等化を行なう。その後、輝度値を連結して特徴量とする。最後に、輝度値とエッジの特徴量を連結して最終的な特徴量とする。 For example, an edge is detected by Canny from the extracted eye region image, and it is set as a feature amount whether or not there is an edge. The eye area image is then converted to grayscale and histogram equalization is performed. Thereafter, the luminance values are connected to be a feature amount. Finally, the luminance value and the edge feature amount are connected to make a final feature amount.

識別器構築部２８は、学習用カメラ入力画像の各々から抽出された特徴量と、学習用カメラ入力画像の各々に付与されている、対話者が対話相手を注視しているか否かを示す情報及びカメラ１０の視点位置を示す情報とに基づいて、対話者が対話相手を注視しているか否か及びカメラ視点位置のペアを識別するための識別器を構築する。本実施の形態では、識別器として、一般的に２クラス識別問題に対して高い性能を持つとされる SVM(Suport Vector Machine) 識別器を用いる。 The discriminator construction unit 28 is a feature amount extracted from each of the learning camera input images, and information indicating whether the communicator is gazing at the other party of conversation, which is given to each of the learning camera input images. And a discriminator for identifying whether or not the interlocutor is gazing at the other party of the dialogue and a pair of camera viewpoint positions based on the information indicating the viewpoint position of the camera 10. In this embodiment, an SVM (Suport Vector Machine) classifier that is generally considered to have high performance for the two-class classification problem is used as a classifier.

ここで識別するクラスのラベルは、対話者が対話相手を注視しているか否かとカメラ１０の視点位置の各ペアとする。例えば、ウィンドウの位置を、ディスプレイ上の左上、右上、左下、右下の４つとし、対話者が対話相手を注視しているか否かの２つと合計すると４×２＝８個のラベルを使用して、識別器の学習を行う。 The label of the class to be identified here is each pair of whether or not the interlocutor gazes at the other party of the dialogue and the viewpoint position of the camera 10. For example, the position of the window is four on the display: upper left, upper right, lower left, lower right, and 4 x 2 = 8 labels if the interlocutor gazes at the other party or not And learn the classifier.

識別器記憶部３０は、識別器構築部２８によって学習された識別器が記憶されている。 The classifier storage unit 30 stores the classifier learned by the classifier construction unit 28.

目領域画像抽出部３２は、カメラ入力画像から、目領域画像抽出部２４と同様に、目領域画像を抽出する。 The eye area image extraction unit 32 extracts an eye area image from the camera input image as in the case of the eye area image extraction unit 24.

特徴抽出部３４は、カメラ入力画像から抽出された目領域画像について、特徴抽出部２６と同様に、特徴量を抽出する。 The feature extraction unit 34 extracts feature amounts from the eye region image extracted from the camera input image, as in the feature extraction unit 26.

目開閉判定部３６は、カメラ入力画像について抽出された特徴量と、目の開閉度が既知の参照画像から予め抽出された特徴量とに基づいて、目の開閉度を計算し、目の開閉度が閾値より大きい場合に、目が開いていると判定する。 The eye opening / closing judgment unit 36 calculates the eye opening / closing degree based on the feature amount extracted for the camera input image and the feature amount extracted in advance from the reference image whose eye opening / closing degree is known, If the degree is greater than the threshold, it is determined that the eye is open.

カメラ位置対話相手注視判定部３８は、目が開いていると判定された場合、識別器記憶部３０に記憶されている識別器と、カメラ入力画像について抽出された特徴量と、に基づいて、対話者が対話相手を注視しているか否か、及びカメラ１０の視点位置を判定する。 When it is determined that the eyes are open, the camera position dialogue partner gaze determination unit 38 determines, based on the classifier stored in the classifier storage unit 30 and the feature quantity extracted for the camera input image, It is determined whether the interlocutor gazes at the other party of the dialogue and the viewpoint position of the camera 10.

注視画像記憶部４０は、予め撮像された、対話者がカメラ１０を注視しているときのカメラ入力画像から抽出された、目を表す画像が記憶されている。なお、カメラ１０の視点位置毎に、当該視点位置のカメラ１０を注視しているときのカメラ入力画像から抽出された、目を表す画像を用意してもよい。 The gaze image storage unit 40 stores an image representing an eye, which is captured in advance and extracted from a camera input image when the communicator gazes at the camera 10. In addition, you may prepare for each viewpoint position of the camera 10 the image showing the eyes extracted from the camera input image at the time of gazing at the camera 10 of the said viewpoint position.

合成部４２は、対話者が対話相手を注視していると判定された場合、カメラ入力画像の目領域に、注視画像記憶部４０に記憶されている、目を表す画像を合成した合成画像を生成し、合成画像を、通信部８０により対話相手側のテレビ会議システム（図示省略）へ送信する。なお、カメラ１０の視点位置毎に、目を表す画像を用意している場合には、カメラ位置対話相手注視判定部３８により判定されたカメラ１０の視点位置に対応する、目を表す画像を合成すればよい。 When it is determined that the communicator is gazing at the other party of the dialogue, the synthesizing unit 42 synthesizes an image representing the eye, which is stored in the gazing image storage unit 40, in the eye area of the camera input image. The composite image is generated and transmitted by the communication unit 80 to the teleconference system (not shown) on the other side of the dialog. When an image representing an eye is prepared for each viewpoint position of the camera 10, an image representing the eye corresponding to the viewpoint position of the camera 10 determined by the camera position dialogue partner gaze determination unit 38 is synthesized. do it.

画面出力部４４は、対話相手側のテレビ会議システムから通信部８０により受信した画像を、ディスプレイ９０のウィンドウ９２上に表示する。 The screen output unit 44 displays the image received by the communication unit 80 from the teleconference system on the other side of the dialog on the window 92 of the display 90.

なお、対話相手側のテレビ会議システムも、対話者側のテレビ会議システム１００と同様の構成となっており、インターネットなどのネットワークを介して、対話相手側のテレビ会議システムと、対話者側のテレビ会議システム１００とは接続されている。 The teleconference system on the other side of the conversation has the same configuration as the teleconferencing system 100 on the dialogue side, and the television conference system on the other side of the dialogue and the television on the dialogue side via a network such as the Internet. The conference system 100 is connected.

＜テレビ会議システムの作用＞
次に、第１の実施の形態のテレビ会議システム１００の作用について説明する。まず、前処理として、カメラ１０の各視点位置及びディスプレイ９０上の各位置について、当該位置に配置された１点を対話者が注視しているときに、当該視点位置のカメラ１０により学習用カメラ入力画像が撮像され、学習用画像ＤＢ２２に、各学習用カメラ入力画像が、対話者が対話相手を注視しているか否かを示す情報及びカメラ１０の視点位置を示す情報と共に格納される。 <Operation of the video conference system>
Next, the operation of the video conference system 100 according to the first embodiment will be described. First, as preprocessing, for each viewpoint position of the camera 10 and each position on the display 90, the camera 10 for learning by the camera 10 at the viewpoint position while the communicator is gazing at one point arranged at the position. An input image is captured, and each learning camera input image is stored in the learning image DB 22 together with information indicating whether the communicator gazes at the other party and information indicating the viewpoint position of the camera 10.

そして、テレビ会議システム１００によって、図６に示す学習処理ルーチンが実行される。 Then, the video conference system 100 executes a learning process routine shown in FIG.

まず、ステップＳ１００において、学習用画像ＤＢ２２から、各学習用カメラ入力画像が取得される。ステップＳ１０２では、各学習用カメラ入力画像から、目領域画像が抽出される。 First, in step S100, each learning camera input image is acquired from the learning image DB 22. In step S102, an eye area image is extracted from each learning camera input image.

そして、ステップＳ１０４において、各学習用カメラ入力画像の目領域画像から、特徴量が抽出される。ステップＳ１０６では、上記ステップＳ１０４で各学習用カメラ入力画像について抽出された特徴量と、各学習用カメラ入力画像に付与されている、対話者が対話相手を注視しているか否かを示す情報及びカメラ１０の視点位置を示す情報とに基づいて、識別器を構築し、識別器記憶部３０に格納して、学習処理ルーチンを終了する。 Then, in step S104, the feature amount is extracted from the eye area image of each learning camera input image. In step S106, the feature amount extracted for each learning camera input image in step S104, information indicating whether the communicator is gazing at the other party of the conversation, added to each learning camera input image, and Based on the information indicating the viewpoint position of the camera 10, a classifier is constructed and stored in the classifier storage unit 30, and the learning processing routine is ended.

また、対話相手とのテレビ会議が開始され、ディスプレイ９０のウィンドウ９２上に、対話相手側のテレビ会議システムから送信された画像を表示しているときに、カメラ１０からカメラ入力画像が入力される度に、テレビ会議システム１００によって、図７に示す視線一致顔画像合成処理ルーチンが繰り返し実行される。 In addition, when a video conference with the conversation partner is started, and the image transmitted from the video conference system on the conversation partner side is displayed on the window 92 of the display 90, the camera input image is input from the camera 10. Each time the teleconference system 100 repeatedly executes the gaze-matched face image synthesis processing routine shown in FIG. 7.

まず、ステップＳ１１０において、カメラ入力画像から、目領域画像が抽出される。ステップＳ１１２では、目領域画像から、特徴量を抽出する。 First, in step S110, an eye area image is extracted from the camera input image. In step S112, feature amounts are extracted from the eye region image.

そして、ステップＳ１１４では、上記ステップＳ１１２で抽出された特徴量に基づいて、目が開いているか否かを判定する。目が開いていると判定された場合には、ステップＳ１１５へ進み、目が開いていないと判定された場合には、ステップＳ１２２へ進む。 Then, in step S114, it is determined whether or not the eyes are open based on the feature quantity extracted in step S112. If it is determined that the eyes are open, the process proceeds to step S115. If it is determined that the eyes are not open, the process proceeds to step S122.

ステップＳ１１５では、上記ステップＳ１１２で抽出された特徴量と、識別器記憶部３０に記憶されている識別器とに基づいて、対話者が対話相手を注視している否か及びカメラ１０の視点位置のペアを判定する。ステップＳ１１６では、上記ステップＳ１１５の判定結果に基づいて、対話者が対話相手を注視しているか否かを判定する。対話相手を注視していると判定された場合には、ステップＳ１１８へ進み、対話相手を注視していないと判定された場合には、ステップＳ１２２へ進む。 In step S115, based on the feature quantity extracted in step S112 and the discriminator stored in the discriminator storage unit 30, whether or not the communicator is gazing at the dialogue partner, and the viewpoint position of the camera 10 Determine the pair of In step S116, based on the determination result in step S115, it is determined whether the communicator is gazing at the other party. If it is determined that the user is gazing at the dialogue partner, the process proceeds to step S118, and if it is determined that the dialogue partner is not gazed at, the process proceeds to step S122.

ステップＳ１１８では、カメラ入力画像の目領域に、注視画像記憶部４０に記憶されている、目を表す画像を合成した合成画像を生成し、ステップＳ１２０において、上記ステップＳ１１８で生成した合成画像を、通信部８０により対話相手側のテレビ会議システムへ送信し、視線一致顔画像合成処理ルーチンを終了する。 In step S118, a composite image obtained by combining the image representing the eyes stored in the gaze image storage unit 40 is generated in the eye area of the camera input image, and in step S120, the composite image generated in step S118 is The communication unit 80 transmits it to the teleconference system on the other side of the dialog, and the gaze-matched face image synthesis processing routine is ended.

ステップＳ１２２では、カメラ入力画像をそのまま通信部８０により対話相手側のテレビ会議システムへ送信し、視線一致顔画像合成処理ルーチンを終了する。 In step S122, the camera input image is transmitted as it is to the teleconference system on the other side of the dialog by the communication unit 80, and the gaze-matched face image synthesis processing routine is ended.

以上説明したように、第１の実施の形態に係るテレビ会議システムは、予め学習された識別器に基づいて、対話相手を注視しているか否か及びディスプレイ表示に対するカメラ視点位置を判定し、カメラ入力画像が表す対話者の目領域に、カメラを注視したときの画像を合成することにより、ディスプレイ表示に対するカメラ視点位置が変化する場合であっても、視線の一致を精度よく判定して、視線が一致した画像を生成することができる。 As described above, the video conference system according to the first embodiment determines whether the user is gazing at the other party of the conversation and the camera viewpoint position with respect to the display on the basis of the classifier learned in advance. Even when the camera viewpoint position with respect to the display display changes, by combining the image when gazing at the camera into the eye area of the communicator represented by the input image, the coincidence of the line of sight is accurately determined, Can generate a matched image.

＜第２の実施の形態＞
＜テレビ会議システムの構成＞
次に、第２の実施の形態のテレビ会議システムについて説明する。なお、第１の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。 Second Embodiment
<Configuration of Video Conference System>
Next, a video conference system according to the second embodiment will be described. In addition, about the part which becomes the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第２の実施の形態では、ディスプレイに、複数の対話相手を表す複数のウィンドウが表示されている点と、対話者が注視している対話相手を判定して目の領域の画像を合成している点とが、第１の実施の形態と異なっている。 In the second embodiment, a point at which a plurality of windows representing a plurality of dialog partners are displayed on the display, and a dialog partner at which the interlocutor is gazing are determined to synthesize an image of the eye area Is different from the first embodiment.

図８に示すように、第２の実施の形態のテレビ会議システム２００の演算部２２０は、学習用画像ＤＢ２２、目領域画像抽出部２４、３２、特徴抽出部２６、３４、識別器構築部２２８、識別器記憶部２３０、目開閉判定部３６、カメラ位置ディスプレイ上注視点判定部２３８、注視相手判定部２３９、注視画像記憶部２４０、視線一致条件判定部２４１、合成部２４２、及び画面出力部４４を備えている。 As shown in FIG. 8, the computing unit 220 of the video conference system 200 according to the second embodiment includes a learning image DB 22, eye area image extracting units 24 and 32, feature extracting units 26 and 34, and a classifier constructing unit 228. , Discriminator storage unit 230, eye open / close determination unit 36, camera position display upper fixation point determination unit 238, gaze partner determination unit 239, gaze image storage unit 240, gaze matching condition determination unit 241, combining unit 242, and screen output unit It has 44.

図９に示すように、カメラ１０は、ディスプレイ９０の表示に対する複数の視点位置の何れかから、対話者の顔を撮像し、カメラ入力画像を、演算部２２０へ出力する。ディスプレイ９０は、複数の対話相手の画像を表示した複数のウィンドウ９２を表示し、対話者は、複数のウィンドウ９２を見ながら対話を行う。 As shown in FIG. 9, the camera 10 captures the face of the communicator from any of a plurality of viewpoint positions with respect to the display of the display 90, and outputs a camera input image to the computing unit 220. The display 90 displays a plurality of windows 92 in which images of a plurality of interaction partners are displayed, and the interlocutor interacts while viewing the plurality of windows 92.

識別器構築部２２８は、ディスプレイ９０の表示に対するカメラ１０の各視点位置及びディスプレイ９０上の各位置のペアについて、学習用カメラ入力画像の各々から抽出された特徴量に基づいて、対話者がディスプレイ９０上の当該位置を注視し、かつ、当該視点位置であるか否かを識別するための識別器を各々構築する。 The discriminator construction unit 228 displays the interlocutor on the basis of the feature value extracted from each of the learning camera input images for each viewpoint position of the camera 10 with respect to the display of the display 90 and each pair of positions on the display 90. A gazer for the position on 90 is constructed, and a discriminator for identifying whether the position is the eye point position is constructed respectively.

ここで識別するクラスのラベルは、ディスプレイ９０上の位置とカメラ１０の視点位置の各ペアとする。例えば、カメラ１０の視点位置を、ディスプレイ９０の上部、下部、左部、右部の４つとし、ディスプレイ９０上の位置を、ディスプレイ９０上の縦方向、横方向に予め設定したスキップ幅でサンプリングしたＮ個の位置とする。合計すると４×Ｎ＝４Ｎ個のラベルを使用して、識別器の学習を行う。また、学習用カメラ入力画像を注視しているか否かに分ける境界は、当該位置が中心となるような矩形とする。 The label of the class to be identified here is each pair of the position on the display 90 and the viewpoint position of the camera 10. For example, the viewpoint positions of the camera 10 are four at the top, bottom, left, and right of the display 90, and the positions on the display 90 are sampled with skip widths preset in the vertical and horizontal directions on the display 90 And N positions. The classifier learning is performed using 4 × N = 4N labels in total. Further, the boundary for dividing whether or not the learning camera input image is watched is a rectangle whose center is the position.

識別器記憶部２３０は、識別器構築部２８によってカメラ１０の各視点位置及びディスプレイ９０上の各位置のペアについて学習された識別器が記憶されている。 The classifier storage unit 230 stores classifiers learned for each viewpoint position of the camera 10 and each pair of positions on the display 90 by the classifier construction unit 28.

カメラ位置ディスプレイ上注視点判定部２３８は、目が開いていると判定された場合、カメラ１０の各視点位置及びディスプレイ９０上の各位置のペアについて、識別器記憶部２３０に記憶されている当該ペアについての識別器と、カメラ入力画像について抽出された特徴量と、に基づいて、対話者がディスプレイ９０上の当該位置を注視し、かつ、当該視点位置であるか否かを各々判定する。 When it is determined that the eyes are open, the camera position display upper fixation point determination unit 238 stores the pair of each viewpoint position of the camera 10 and each position on the display 90 in the identifier storage unit 230 Based on the classifier for the pair and the feature quantity extracted for the camera input image, it is determined whether the interlocutor gazes at the position on the display 90 and whether it is the viewpoint position or not.

注視相手判定部２３９は、カメラ位置ディスプレイ上注視点判定部２３８によるカメラ１０の各視点位置及びディスプレイ９０上の各位置のペアについての判定結果に基づいて、対話者がどの対話相手を注視しているかを判定する。例えば、対話者が注視していると判定されたディスプレイ９０上の位置の各々について、当該位置を含むウィンドウ９２に投票し、投票数が最も多いウィンドウ９２が表示している対話相手を、対話者が注視している対話相手として判定する。なお、投票数が最も多いウィンドウ９２が複数存在する場合には、投票数が最も多い複数のウィンドウ９２の中心位置の平均位置を注視していると判断すればよい。 Based on the determination result of each viewpoint position of the camera 10 and the pair of each position on the display 90 by the gaze position determination unit 238 of the camera position display, the gaze target determination unit 239 gazes at which dialogue partner Determine if it exists. For example, for each of the positions on the display 90 determined to be gazed by the interlocutor, the interlocutor who is voting in the window 92 including the position and the window 92 with the highest number of votes is displayed is Determined as the other person in the dialogue that is watching. When there are a plurality of windows 92 having the largest number of votes, it may be determined that the average position of the central positions of the plurality of windows 92 having the largest number of votes is watched.

注視画像記憶部２４０は、予め撮像された、対話者が、複数のウィンドウ９２に対応する複数の合成用注視点２９０の各々を注視しているときのカメラ入力画像から抽出された、目を表す画像が記憶されている（上記図９参照）。 The gaze image storage unit 240 represents an eye extracted from a camera input image captured in advance when a communicator gazes at each of a plurality of synthesis gaze points 290 corresponding to a plurality of windows 92. An image is stored (see FIG. 9 above).

合成部２４２は、対話者が対話相手を注視していると判定された場合、カメラ入力画像の目領域に、対話者が注視している対話相手に対応して注視画像記憶部２４０に記憶されている、目を表す画像を合成した合成画像を生成し、合成画像を、通信部８０により画像受信側のテレビ会議システムへ送信する。 When it is determined that the communicator gazes at the dialogue partner, the synthesis unit 242 stores the eye region of the camera input image in the gaze image storage unit 240 corresponding to the dialogue partner gazed at The combined image is generated by combining the images representing the eyes, and the combined image is transmitted by the communication unit 80 to the video conference system on the image receiving side.

なお、画像受信側のテレビ会議システムも、対話者側のテレビ会議システム２００と同様の構成となっており、インターネットなどのネットワークを介して、画像受信側のテレビ会議システムと、対話者側のテレビ会議システム２００とは接続されている。 The teleconferencing system on the image receiving side also has the same configuration as the teleconferencing system 200 on the communicator side, and the teleconferencing system on the image receiving side and the television on the communicator side via a network such as the Internet. The conference system 200 is connected.

＜テレビ会議システムの作用＞
次に、第２の実施の形態のテレビ会議システム２００の作用について説明する。まず、前処理として、カメラ１０の各視点位置及びディスプレイ９０上の各位置について、当該位置に配置された１点を対話者が注視しているときに、当該視点位置のカメラ１０により学習用カメラ入力画像が撮像され、学習用画像ＤＢ２２に格納される。 <Operation of the video conference system>
Next, the operation of the video conference system 200 according to the second embodiment will be described. First, as preprocessing, for each viewpoint position of the camera 10 and each position on the display 90, the camera 10 for learning by the camera 10 at the viewpoint position while the communicator is gazing at one point arranged at the position. An input image is captured and stored in the learning image DB 22.

そして、テレビ会議システム２００によって、上記第１の実施の形態と同様に、学習処理ルーチンが実行される。なお、第２の実施の形態では、カメラ１０の各視点位置及びディスプレイ９０上の各位置のペアについて、学習用カメラ入力画像の各々から抽出された特徴量に基づいて、対話者がディスプレイ９０上の当該位置を注視し、かつ、カメラ１０が当該視点位置であるか否かを識別するための識別器が各々構築される。 Then, the video conference system 200 executes a learning processing routine as in the first embodiment. In the second embodiment, with respect to each viewpoint position of the camera 10 and each pair of positions on the display 90, the communicator is displayed on the display 90 based on the feature quantity extracted from each of the learning camera input images. A discriminator is constructed to gaze at the corresponding position of and to identify whether or not the camera 10 is the viewpoint position.

また、対話相手とのテレビ会議が開始され、ディスプレイ９０の複数のウィンドウ９２上に、対話相手側の複数のテレビ会議システムから送信された画像を各々表示しているときに、カメラ１０からカメラ入力画像が入力される度に、テレビ会議システム２００によって、図１０に示す視線一致顔画像合成処理ルーチンが繰り返し実行される。なお、第１の実施の形態と同様の処理については、同一符号を付して詳細な説明を省略する。 Also, when a video conference with the conversation partner is started and the images transmitted from the plurality of video conference systems on the conversation partner side are respectively displayed on the plurality of windows 92 of the display 90, camera input from the camera 10 Every time an image is input, the teleconference system 200 repeatedly executes the gaze-matched face image synthesis processing routine shown in FIG. The same processing as that of the first embodiment is given the same reference numeral and the detailed description is omitted.

そして、ステップＳ１１４では、上記ステップＳ１１２で抽出された特徴量に基づいて、目が開いているか否かを判定する。目が開いていると判定された場合には、ステップＳ２００へ進み、目が開いていないと判定された場合には、ステップＳ１２２へ進む。 Then, in step S114, it is determined whether or not the eyes are open based on the feature quantity extracted in step S112. If it is determined that the eyes are open, the process proceeds to step S200. If it is determined that the eyes are not open, the process proceeds to step S122.

ステップＳ２００では、上記ステップＳ１１２で抽出された特徴量と、識別器記憶部２３０に記憶されている識別器とに基づいて、カメラ１０の各視点位置及びディスプレイ９０上の各位置のペアについて、対話者が当該位置を注視し、かつ、カメラ１０が当該視点位置である否かを判定する。そして、ステップＳ２０２では、上記ステップＳ２００の判定結果に基づいて、対話者が注視している対話相手を判定する。 In step S200, based on the feature quantity extracted in step S112 and the classifier stored in the classifier storage unit 230, dialogue is performed for each viewpoint position of the camera 10 and each pair of positions on the display 90. It is determined whether a person gazes at the position and the camera 10 is at the viewpoint position. Then, in step S202, based on the determination result of the above-mentioned step S200, it is determined the conversation partner gazed by the communicator.

そして、ステップＳ２０４では、上記ステップＳ２０２の判定結果に基づいて、対話者が注視している対話相手がいるか否かを判定する。注視している対話相手がいると判定された場合には、ステップＳ２０６へ進み、注視している対話相手がいないと判定された場合には、ステップＳ１２２へ進む。 Then, in step S204, it is determined based on the determination result in step S202 whether or not there is a dialogue partner gazed by the communicator. If it is determined that there is a dialogue partner who is gazing, the process proceeds to step S206. If it is determined that there is no dialogue partner who is gazing, the process proceeds to step S122.

ステップＳ２０６では、上記ステップＳ２０２の判定結果に基づいて、カメラ入力画像の目領域に、対話者が注視している対話相手に対応して注視画像記憶部２４０に記憶されている、目を表す画像を合成した合成画像を生成し、ステップＳ１２０において、上記ステップＳ２０６で生成した合成画像を、通信部８０により画像受信側のテレビ会議システムへ送信し、視線一致顔画像合成処理ルーチンを終了する。 In step S206, based on the determination result in step S202, an image representing an eye stored in the gaze image storage unit 240 in the eye area of the camera input image corresponding to the other party of the dialogue being gazed by the communicator In step S120, the communication unit 80 transmits the composite image generated in step S206 to the video conference system on the image receiving side, and the gaze-matched face image composition processing routine is ended.

ステップＳ１２２では、カメラ入力画像をそのまま通信部８０により画像受信側のテレビ会議システムへ送信し、視線一致顔画像合成処理ルーチンを終了する。 In step S122, the camera input image is transmitted as it is to the teleconference system on the image receiving side by the communication unit 80, and the gaze-matched face image synthesis processing routine is ended.

以上説明したように、第２の実施の形態に係るテレビ会議システムは、ディスプレイ表示に対する各カメラ視点位置及びディスプレイ上の各位置のペアについて予め学習された識別器に基づいて、対話者側のディスプレイ上の注視点及びカメラ視点位置を判定して、注視している対話相手を判定し、カメラ入力画像が表す対話者の目領域に、対話相手に対応する位置を注視したときの画像を合成することにより、ディスプレイ表示に対するカメラ視点位置が変化する場合であっても、視線の一致を精度よく判定して、視線が一致した画像を生成することができる。 As described above, in the video conference system according to the second embodiment, the display on the interlocutor side is performed based on the classifiers learned in advance for each camera viewpoint position with respect to the display display and each pair of positions on the display. Determines the gaze point and camera viewpoint position above to determine the conversation partner who is gazing, and synthesizes an image when gazing at a position corresponding to the conversation partner in the eye area of the communicator represented by the camera input image Accordingly, even when the camera viewpoint position with respect to the display display changes, it is possible to accurately determine the line-of-sight match and generate an image with the line-of-sight match.

また、複数の対話者のテレビ会議において、ディスプレイ上に複数のウィンドウが配置してある場合においても、会議参加者に誰が誰に向かって対話しているかをわかりやすく提示することができる。 Also, in a video conference of a plurality of interlocutors, even when a plurality of windows are arranged on the display, it is possible to show the meeting participants in an easy-to-understand manner who is interacting with whom.

＜第３の実施の形態＞
＜テレビ会議システムの構成＞
次に、第３の実施の形態のテレビ会議システムについて説明する。なお、第１の実施の形態、第２の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。 Third Embodiment
<Configuration of Video Conference System>
Next, a video conference system according to the third embodiment will be described. In addition, about the part which becomes the structure similar to 1st Embodiment and 2nd Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第３の実施の形態では、画像受信側のテレビ会議システムで、画像を合成している点が、第２の実施の形態と異なっている。 The third embodiment is different from the second embodiment in that images are combined in a video conference system on the image receiving side.

図１１に示すように、第３の実施の形態では、画像送信側のテレビ会議システム３００と画像受信側のテレビ会議システム４００とが、ネットワーク５００を介して接続されている。 As shown in FIG. 11, in the third embodiment, a video conference system 300 on the image transmitting side and a video conference system 400 on the image receiving side are connected via the network 500.

図１２に示すように、画像送信側のテレビ会議システム３００と画像受信側のテレビ会議システム４００とにおいて、ディスプレイ９０は、複数の対話相手の画像を表示した複数のウィンドウ９２を表示し、対話者は、複数のウィンドウ９２を見ながら対話を行う。本実施の形態では、画像送信側のテレビ会議システム３００と画像受信側のテレビ会議システム４００との各々で、ディスプレイ９０上の複数のウィンドウ９２の配置が変更可能となっている。以下では、画像送信側のテレビ会議システム３００と画像受信側のテレビ会議システム４００とで、ディスプレイ９０上の複数のウィンドウ９２の配置が異なっている場合を例に説明する。 As shown in FIG. 12, in the image transmission side video conference system 300 and the image reception side video conference system 400, the display 90 displays a plurality of windows 92 on which images of a plurality of conversation partners are displayed, and a communicator Interact while viewing the plurality of windows 92. In the present embodiment, the arrangement of the plurality of windows 92 on the display 90 can be changed in each of the video conference system 300 on the image transmission side and the video conference system 400 on the image reception side. In the following, a case in which the arrangement of the plurality of windows 92 on the display 90 is different between the video conference system 300 on the image transmitting side and the video conference system 400 on the image receiving side will be described as an example.

画像送信側のテレビ会議システム３００の演算部３２０は、学習用画像ＤＢ２２、目領域画像抽出部２４、３２、特徴抽出部２６、３４、識別器構築部２２８、識別器記憶部２３０、目開閉判定部３６、カメラ位置ディスプレイ上注視点判定部２３８、注視相手判定部２３９、及び画面出力部４４を備えている。 The operation unit 320 of the video conference system 300 on the image transmission side includes the learning image DB 22, the eye area image extraction units 24 and 32, the feature extraction units 26 and 34, the classifier construction unit 228, the classifier storage unit 230, and the eye opening / closing judgment The unit 36, the camera position display upper fixation point determination unit 238, the gaze partner determination unit 239, and the screen output unit 44 are provided.

画像送信側のテレビ会議システム３００の通信部８０は、目開閉判定部３６により目が開いていないと判定された場合には、目開閉判定部３６による判定結果と、カメラ入力画像とを、画像受信側のテレビ会議システム４００へ送信する。また、目開閉判定部３６により目が開いていると判定された場合には、目開閉判定部３６による判定結果と、注視相手判定部２３９による判定結果と、カメラ入力画像とを、画像受信側のテレビ会議システム４００へ送信する。 When the eye open / close determination unit 36 determines that the eyes are not open, the communication unit 80 of the video conference system 300 on the image transmission side determines the determination result by the eye open / close determination unit 36 and the camera input image It transmits to the teleconference system 400 on the receiving side. When the eye open / close determination unit 36 determines that the eyes are open, the determination result by the eye open / close determination unit 36, the determination result by the fixation partner determination unit 239, and the camera input image To the video conference system 400 of FIG.

画像受信側のテレビ会議システム４００の演算部４２０は、目開閉判定部４２２、注視対象計算部４２４、注視画像記憶部２４０、及び合成部２４２を備えている。 The calculation unit 420 of the video conference system 400 on the image reception side includes an eye open / close determination unit 422, a gaze target calculation unit 424, a gaze image storage unit 240, and a combination unit 242.

目開閉判定部４２２は、通信部８０により受信した、目の開閉の判定結果に基づいて、画像送信側の対話者の目が開いているか否かを判定する。 The eye open / close determination unit 422 determines whether the eye of the communicator on the image transmission side is open based on the determination result of the eye open / close received by the communication unit 80.

注視対象計算部４２４は、画像送信側の対話者の目が開いていると判定された場合、通信部８０により受信した、注視している対話相手の情報に基づいて、画像送信側の対話者が対話相手を注視しているか否かを判定する。また、注視対象計算部４２４は、画像送信側の対話者が対話相手を注視していると判定された場合、通信部８０により受信した、注視している対話相手の情報に基づいて、ディスプレイ９０上のどのウィンドウ９２の対話者を注視しているかを計算する。例えば、ディスプレイ９０上の複数のウィンドウ９２のうち、通信部８０により受信した、注視している対話相手の情報に対応するウィンドウ９２を特定し、特定されたウィンドウ９２が表示している対話相手を、画像送信側の対話者が注視している対話相手として計算する。 When it is determined that the eye of the interlocutor on the image transmission side is open, the gaze target calculation unit 424 determines the interlocutor on the image transmission side based on the information on the other party of the gaze at which the gaze is being received. It is determined whether the subject gazes at the other party. In addition, when it is determined that the communicator at the image transmission side is gazing at the conversation partner, the gazing target calculation unit 424 displays the display 90 based on the information on the gazed conversation partner received by the communication unit 80. It is calculated which of the windows 92 on the upper side is focused on. For example, among the plurality of windows 92 on the display 90, the window 92 corresponding to the information of the conversation partner being watched received by the communication unit 80 is identified, and the conversation partner displayed by the identified window 92 is selected. , It is calculated as the other side of the dialogue on which the interlocutor on the image sending side is watching.

注視画像記憶部２４０は、予め撮像された、対話者が、上記図１２に示すような複数のウィンドウ９２に対応する複数の合成用注視点２９０の各々を注視しているときのカメラ入力画像から抽出された、目を表す画像が記憶されている。 The gaze image storage unit 240 uses a camera input image captured in advance when a communicator gazes at each of a plurality of composition gaze points 290 corresponding to a plurality of windows 92 as shown in FIG. 12 described above. An image representing the extracted eye is stored.

合成部２４２は、画像送信側の対話者が対話相手を注視していると判定された場合、注視対象計算部４２４による計算結果に基づいて、画像送信側のテレビ会議システム３００から受信したカメラ入力画像の目領域に、画像送信側の対話者が注視している対話相手に対応して注視画像記憶部２４０に記憶されている、目を表す画像を合成した合成画像を生成し、合成画像を、ディスプレイ９０の該当するウィンドウ９２に表示する。 When it is determined that the communicator at the image transmission side is gazing at the conversation partner, the synthesis unit 242 receives the camera input received from the video conference system 300 at the image transmission side based on the calculation result by the gaze target calculation unit 424. In the eye area of the image, a composite image is generated by combining the image representing the eyes, which is stored in the gaze image storage unit 240 corresponding to the other party of the dialogue on the image transmission side, and the composite image is generated. , And displayed on the corresponding window 92 of the display 90.

なお、画像送信側のテレビ会議システム３００は、画像受信側のテレビ会議システム４００と同様の構成を更に有しており、画像受信側のテレビ会議システム４００は、画像送信側のテレビ会議システム３００と同様の構成を更に有しているが、簡単のため、図示を省略している。 The video conference system 300 on the image transmitting side further has the same configuration as the video conference system 400 on the image receiving side, and the video conference system 400 on the image receiving side and the video conference system 300 on the image transmitting side. It further has the same configuration but is omitted for simplicity.

＜テレビ会議システムの作用＞
次に、第３の実施の形態の画像送信側のテレビ会議システム３００の作用について説明する。まず、前処理として、ディスプレイ９０の表示に対するカメラ１０の各視点位置及びディスプレイ９０上の各位置について、当該位置に配置された１点を対話者が注視しているときに、当該視点位置のカメラ１０により学習用カメラ入力画像が撮像され、学習用画像ＤＢ２２に格納される。 <Operation of the video conference system>
Next, the operation of the video conference system 300 on the image transmission side according to the third embodiment will be described. First, as preprocessing, with respect to each viewpoint position of the camera 10 with respect to the display on the display 90 and each position on the display 90, when the communicator is gazing at one point arranged at the position, the camera at the viewpoint position The camera input image for learning is captured by 10 and stored in the image for learning DB 22.

そして、画像送信側のテレビ会議システム３００によって、上記第２の実施の形態と同様に、学習処理ルーチンが実行される。 Then, a learning processing routine is executed by the video conference system 300 on the image transmission side, as in the second embodiment.

また、対話相手とのテレビ会議が開始され、ディスプレイ９０の複数のウィンドウ９２上に、画像受信側のテレビ会議システム４００の各々から送信された画像を表示しているときに、カメラ１０からカメラ入力画像が入力される度に、テレビ会議システム３００によって、図１３に示す画像送信側視線一致顔画像合成処理ルーチンが繰り返し実行される。なお、第１の実施の形態、第２の実施の形態と同様の処理については、同一符号を付して詳細な説明を省略する。 In addition, when a video conference with the other party of the dialogue is started and the images transmitted from each of the video conference systems 400 on the image receiving side are displayed on the plurality of windows 92 of the display 90, camera input from the camera 10 Every time an image is input, the video conference system 300 repeatedly executes the image transmission side gaze-matched face image synthesis processing routine shown in FIG. In addition, about the process similar to 1st Embodiment and 2nd Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

そして、ステップＳ１１４では、上記ステップＳ１１２で抽出された特徴量に基づいて、目が開いているか否かを判定する。目が開いていると判定された場合には、ステップＳ２００へ進み、目が開いていないと判定された場合には、ステップＳ３００へ進む。 Then, in step S114, it is determined whether or not the eyes are open based on the feature quantity extracted in step S112. If it is determined that the eyes are open, the process proceeds to step S200. If it is determined that the eyes are not open, the process proceeds to step S300.

そして、ステップＳ３０２では、上記ステップＳ１１４の判定結果と、上記ステップＳ２０２の判定結果と、カメラ入力画像とを、通信部８０により画像受信側のテレビ会議システム４００へ送信し、画像送信側視線一致顔画像合成処理ルーチンを終了する。 Then, in step S302, the communication unit 80 transmits the determination result in the above step S114, the determination result in the above step S202, and the camera input image to the video conference system 400 on the image receiving side, The image composition processing routine is ended.

一方、ステップＳ３００では、上記ステップＳ１１４の判定結果と、カメラ入力画像とを通信部８０により画像受信側のテレビ会議システム４００へ送信し、画像送信側視線一致顔画像合成処理ルーチンを終了する。 On the other hand, in step S300, the communication unit 80 transmits the determination result in step S114 and the camera input image to the teleconference system 400 on the image receiving side, and the image transmitting side gaze-matched face image synthesizing process routine is ended.

画像送信側のテレビ会議システム３００から、各種判定結果及びカメラ入力画像を受信する度に、テレビ会議システム４００によって、図１４に示す画像受信側視線一致顔画像合成処理ルーチンが繰り返し実行される。 Every time the various types of determination results and the camera input image are received from the video conference system 300 on the image transmission side, the video conference system 400 repeatedly executes the image reception side gaze-matched face image synthesis processing routine shown in FIG.

まず、ステップＳ４００では、画像送信側のテレビ会議システム３００から受信した判定結果に基づいて、画像送信側の対話者の目が開いているか否かを判定する。目が開いていると判定された場合には、ステップＳ４０２へ進み、目が開いていないと判定された場合には、ステップＳ４０４へ進む。 First, in step S400, based on the determination result received from the video conference system 300 on the image transmission side, it is determined whether the eye of the interlocutor on the image transmission side is open. If it is determined that the eyes are open, the process proceeds to step S402. If it is determined that the eyes are not open, the process proceeds to step S404.

ステップＳ４０２では、画像送信側のテレビ会議システム３００から受信した判定結果に基づいて、画像送信側の対話者が注視している対話相手がいるか否かを判定する。注視している対話相手がいると判定された場合には、ステップＳ４０６へ進み、注視している対話相手がいないと判定された場合には、ステップＳ４０４へ進む。 In step S402, based on the determination result received from the video conference system 300 on the image transmission side, it is determined whether there is a conversation partner on whom the interlocutor on the image transmission side is gazing. If it is determined that there is a dialogue partner who is gazing, the process proceeds to step S406. If it is determined that there is no dialogue partner who is gazing, the process proceeds to step S404.

ステップＳ４０６では、画像送信側のテレビ会議システム３００から受信した判定結果に基づいて、画像送信側の対話者が、ディスプレイ９０上のどのウィンドウ９２の対話者を注視しているかを計算する。 In step S406, based on the determination result received from the image transmission side video conference system 300, it is calculated which window 92 on the display 90 the conversation party of the image transmission side gazes at.

そして、ステップＳ４０８では、上記ステップＳ４０６の計算結果に基づいて、画像送信側のテレビ会議システム３００から受信したカメラ入力画像の目領域に、画像送信側の対話者が注視している対話相手に対応して注視画像記憶部２４０に記憶されている、目を表す画像を合成した合成画像を生成し、ステップＳ４１０において、上記ステップＳ４０８で生成した合成画像を、ディスプレイ９０の該当するウィンドウ９２に表示して、画像受信側視線一致顔画像合成処理ルーチンを終了する。 Then, in step S408, based on the calculation result in step S406, the eye region of the camera input image received from the teleconference system 300 on the image transmitting side corresponds to the other side of the conversation on which the communicator on the image transmitting side is gazing Then, a composite image obtained by combining the images representing the eyes, stored in the gaze image storage unit 240, is generated, and the composite image generated in step S408 is displayed in the corresponding window 92 of the display 90 in step S410. Then, the image reception side gaze-matched face image synthesis processing routine is ended.

ステップＳ４０４では、画像送信側のテレビ会議システム３００から受信したカメラ入力画像をそのまま、ディスプレイ９０の該当するウィンドウ９２に表示して、画像受信側視線一致顔画像合成処理ルーチンを終了する。 In step S404, the camera input image received from the video conference system 300 on the image transmission side is displayed as it is on the corresponding window 92 of the display 90, and the image reception side gaze-matched face image synthesis processing routine is ended.

以上説明したように、第３の実施の形態に係るテレビ会議システムは、ディスプレイ表示に対する各カメラ視点位置及びディスプレイ上の各位置のペアについて予め学習された識別器に基づいて、画像送信側のディスプレイ上の注視点及びカメラ視点位置を判定して、注視している対話相手を判定し、画像受信側において、カメラ入力画像が表す対話者の目領域に、対話相手に対応する位置を注視したときの画像を合成することにより、ディスプレイ表示に対するカメラ視点位置が変化する場合であっても、視線の一致を精度よく判定して、視線が一致した画像を生成することができる。 As described above, in the video conference system according to the third embodiment, the display on the image transmission side is based on the classifiers learned in advance for each camera viewpoint position for display display and each pair of positions on the display. When the gaze point and camera viewpoint position above are determined to determine the conversation partner who is gazing, and the image reception side gazes at a position corresponding to the conversation partner in the eye area of the communicator represented by the camera input image By combining the images of the above, even when the camera viewpoint position with respect to the display display changes, it is possible to accurately determine the line-of-sight match and generate the image with the line-of-sight match.

また、複数の対話者のテレビ会議において、テレビ会議システム毎に、複数のウィンドウがさまざまな場所に配置してある場合においても、会議参加者に誰が誰に向かって対話しているかをわかりやすく提示することができる。 In addition, in the video conference of multiple interlocutor, even when multiple windows are placed in various places for each video conference system, it is easy to understand who is interacting with who in the conference participant can do.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the scope of the present invention.

例えば、上述のテレビ会議システムは、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, although the above-mentioned video conference system has a computer system inside, the "computer system" also includes a homepage providing environment (or display environment) if the WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 Furthermore, although the present invention has been described as an embodiment in which the program is installed in advance, it is also possible to provide the program by storing the program in a computer readable recording medium.

１０カメラ
２０、２２０、３２０、４２０演算部
２４目領域画像抽出部
２６特徴抽出部
２８、２２８識別器構築部
３０、２３０識別器記憶部
３２目領域画像抽出部
３４特徴抽出部
３６、４２２目開閉判定部
３８カメラ位置対話相手注視判定部
４０、２４０注視画像記憶部
４２、２４２合成部
４４画面出力部
８０通信部
９０ディスプレイ
９２ウィンドウ
１００、２００、３００、４００テレビ会議システム
２３８カメラ位置ディスプレイ上注視点判定部
２３９注視相手判定部
２９０合成用注視点
４２４注視対象計算部
５００ネットワーク DESCRIPTION OF SYMBOLS 10 Cameras 20, 220, 320, 420 Arithmetic unit 24 Eye region image extraction unit 26 Feature extraction unit 28, 228 Discriminator construction unit 30, 230 Discriminator storage unit 32 Eye region image extraction unit 34 Feature extraction unit 36, 422 Eye opening and closing Judgment unit 38 Camera position dialogue other party gaze judgment unit 40, 240 Gaze image storage unit 42, 242 Synthesizing unit 44 Screen output unit 80 Communication unit 90 Display 92 Window 100, 200, 300, 400 Teleconference system 238 Camera position display on gaze point Judgment unit 239 gaze target determination unit 290 composite gaze point 424 gaze target calculation unit 500 network

Claims

A gaze matching face image synthesizing method in a video conference system, comprising:
Feature extraction process for extracting features from camera input images representing interlocutor,
An eye open / close judging process of judging open / close of eyes using features of the camera input image;
In the case where it is determined that the eyes are open, whether or not they are gazing at the conversation partner, whether they are gazing at the conversation partner, in which the camera viewpoint position with respect to the display display is learned in advance from known face images Based on the identifier for identifying the camera viewpoint position and the feature of the camera input image, it is determined whether the communicator represented by the camera input image is gazing at the other party of the dialog and the camera viewpoint position Camera position dialogue partner gaze determination process,
Combining, when it is determined that the user is gazing at the conversation partner, an image when the user gazes at the camera in the eye area of the communicator represented by the camera input image;
A gaze-matched face image composition method including:

A gaze matching face image synthesizing method in a video conference system, comprising:
Feature extraction process for extracting features from camera input images representing interlocutor,
An eye open / close judging process of judging open / close of eyes using features of the camera input image;
When it is determined that the eyes are open, each camera viewpoint position and the display on the interlocutor side in which the gaze point on the display on the interlocutor's display and the camera viewpoint position relative to the display are previously learned from known face images. Note on the display on the interlocutor's side based on a discriminator that looks at the position for each position pair on the top and identifies whether it is the camera viewpoint position and the features of the camera input image A gaze point determination process on a camera position display for determining a viewpoint and the camera viewpoint position;
A gaze partner determination step of determining whether the dialog person represented by the camera input image is gazing at the conversation partner based on the fixation point on the display of the dialogue partner side;
Combining the images when the user gazes at a position corresponding to the conversation partner in the eye area of the communicator represented by the camera input image when it is determined that the user gazes at the conversation partner;
A gaze-matched face image composition method including:

In the combining process, when it is determined that the user gazes at the conversation partner, the combining unit on the image receiving side combines the image when the user gazes in the eye area of the communicator represented by the camera input image. The gaze-matched face image synthesizing method according to claim 1 or 2.

A feature extraction unit for extracting features from a camera input image representing a communicator;
An eye open / close judgment unit which judges the open / close of the eye using the feature of the camera input image;
In the case where it is determined that the eyes are open, whether or not they are gazing at the conversation partner, whether they are gazing at the conversation partner, in which the camera viewpoint position with respect to the display display is learned in advance from known face images Based on the identifier for identifying the camera viewpoint position and the feature of the camera input image, it is determined whether the communicator represented by the camera input image is gazing at the other party of the dialog and the camera viewpoint position Camera position dialogue partner gaze determination unit;
A combining unit configured to combine an image obtained when the user gazes at the camera into the eye area of the communicator represented by the camera input image when it is determined that the user is gazing at the conversation partner;
Video conferencing system including:

A feature extraction unit for extracting features from a camera input image representing a communicator;
An eye open / close judgment unit which judges the open / close of the eye using the feature of the camera input image;
When it is determined that the eyes are open, each camera viewpoint position and the display on the interlocutor side in which the gaze point on the display on the interlocutor's display and the camera viewpoint position relative to the display are previously learned from known face images. Note on the display on the interlocutor's side based on a discriminator that looks at the position for each position pair on the top and identifies whether it is the camera viewpoint position and the features of the camera input image A gaze point determination unit on a camera position display that determines a viewpoint and the camera viewpoint position;
A gaze partner determination unit that determines whether a dialog represented by the camera input image is gazing at a conversation partner based on a gaze point on a display on the dialog person side;
A synthesizing unit that synthesizes an image obtained when gazing at a position corresponding to the conversation partner into the eye area of the communicator represented by the camera input image when it is determined that the user gazes at the conversation partner;
Video conferencing system including:

The program for making a computer perform each process of the gaze matching face image synthetic | combination method in any one of Claims 1-3.