JP2010239499A

JP2010239499A - Communication terminal unit, communication control unit, method of controlling communication of communication terminal unit, and communication control program

Info

Publication number: JP2010239499A
Application number: JP2009086794A
Authority: JP
Inventors: Takahiro Shimazu; 宝浩島津
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2009-03-31
Filing date: 2009-03-31
Publication date: 2010-10-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a communication terminal unit capable of changing a user image transmitted according to user's movement, and to provide a communication control unit, a method of controlling a communication terminal unit, and a communication control program. <P>SOLUTION: It is detected whether or not the user of the terminal unit is in making a gesture during teleconferencing (S4). When the user is not in making a gesture (S5:NO), the user's face image is transmitted to the terminal unit of an opposite user (S3). When the user is in making a gesture (S5:YES), the user's upper-body image is transmitted to the terminal unit of the opposite user (S7). When the user tries to give an expression to his/her feeling, the opposite user can check the user's gesture. When the user does not make a gesture, the opposite user can check the user's facial expression. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、他拠点の端末との間で、画像と音声を双方向に送受信できる通信端末装置、通信制御装置、前記通信端末装置の通信制御方法、通信制御プログラムに関する。 The present invention relates to a communication terminal device, a communication control device, a communication control method for the communication terminal device, and a communication control program that can bidirectionally transmit and receive images and sound between terminals at other sites.

従来、複数の端末をネットワークを介して接続し、画像と音声を双方向に送受信することで、遠隔の地にある者同士の会議を実現するテレビ会議システムが知られている。このシステムでは、会議参加者は表示画面に表示された相手と会議を行う。画像を介して情報を伝達できるため、会議参加者は、表示画面に表示された相手画像から相手の表情を読み取り、相手の感情などを推測することができる。例えば、入力音声から発言者を推定し、自動的にクローズアップ撮影することができるカメラ制御方法及び装置並びに記憶媒体が提案されている（例えば、特許文献１参照）。この制御方法では、発言者がクローズアップ撮影されるので、相手端末の表示画面には発言者の表情を確実に表示させることができる。 2. Description of the Related Art Conventionally, there has been known a video conference system in which a plurality of terminals are connected via a network, and an image and a sound are bidirectionally transmitted and received to realize a conference between persons in remote locations. In this system, a conference participant has a conference with the other party displayed on the display screen. Since information can be transmitted via the image, the conference participant can read the other party's facial expression from the other party's image displayed on the display screen and guess the other party's emotion. For example, a camera control method and apparatus and a storage medium that can estimate a speaker from input speech and automatically perform close-up photography have been proposed (see, for example, Patent Document 1). In this control method, since the speaker is photographed in close-up, the expression of the speaker can be reliably displayed on the display screen of the partner terminal.

特開平１１−３４１３３４号公報JP-A-11-341334

しかしながら、会議参加者は、感情を表現する場合に会話に身振り手振り等のジェスチャーを交えることがある。特許文献１に記載のカメラ制御方法では、表示画面には会議参加者の顔画像のみが表示されるため、会議参加者がジェスチャーで感情を表現しようとしても、表示画面にジェスチャーが映らない問題点があった。表示画面にジェスチャーを映すために撮影範囲を広げると、会議参加者の顔画像が相対的に小さくなってしまい、表情が確認できなくなってしまうという問題点があった。 However, a conference participant may use gestures such as gestures when expressing emotions. In the camera control method described in Patent Document 1, since only the face image of the conference participant is displayed on the display screen, the gesture does not appear on the display screen even if the conference participant tries to express the emotion with the gesture. was there. If the shooting range is expanded to show the gesture on the display screen, the face images of the conference participants become relatively small, and the expression cannot be confirmed.

本発明は、上記課題を解決するためになされたものであり、ユーザの動作に応じて送信するユーザ画像を変更できる通信端末装置、通信制御装置、通信端末装置の通信制御方法、通信制御プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problem, and provides a communication terminal device, a communication control device, a communication control method for a communication terminal device, and a communication control program capable of changing a user image to be transmitted according to a user's operation. The purpose is to provide.

上記目的を達成するために、請求項１に係る発明の通信端末装置は、ネットワークを介して接続された他の通信端末装置と画像及び音声を介した通信を行う通信端末装置であって、ユーザを撮影する撮影手段と、前記撮影手段により撮影された撮影画像から、前記ユーザの掌及び腕の少なくともいずれかの動きが所定量以上ある状態を前記ユーザの動作として検出する動作検出手段と、前記動作検出手段によって前記動作が検出された場合に、前記撮影手段によって撮影された撮影画像の中から、前記ユーザの顔領域と掌領域と腕領域とを含む上半身領域の画像の範囲を第一画像範囲として決定する第一画像範囲決定手段と、前記動作検出手段によって前記動作が検出されなかった場合に、前記撮影画像の中から、前記ユーザの前記顔領域の画像の範囲を第二画像範囲として決定する第二画像範囲決定手段と、前記第一画像範囲決定手段によって決定された前記第一画像範囲の画像、又は前記第二画像範囲決定手段によって決定された前記第二画像範囲の画像を、前記他の通信端末装置に送信する画像送信手段と、前記他の通信端末装置から送信された前記画像を受信する画像受信手段と、前記画像受信手段によって前記画像が受信された場合に、前記画像を表示画面に表示させる画像表示制御手段とを備えている。 In order to achieve the above object, a communication terminal device according to a first aspect of the present invention is a communication terminal device that communicates with other communication terminal devices connected via a network via images and sounds, and is a user Photographing means for photographing the image, movement detection means for detecting a state in which at least one of the movements of the palm and arm of the user is greater than or equal to a predetermined amount from the photographed image photographed by the photographing means, When the motion is detected by the motion detection means, the first image shows a range of the upper body area image including the face area, the palm area, and the arm area of the user among the captured images captured by the imaging means. A first image range determining unit that determines the range; and when the motion is not detected by the motion detection unit, the face area of the user is selected from the captured image. A second image range determining means for determining the image range as the second image range, and an image of the first image range determined by the first image range determining means, or determined by the second image range determining means An image transmitting means for transmitting an image in the second image range to the other communication terminal apparatus, an image receiving means for receiving the image transmitted from the other communication terminal apparatus, and the image received by the image receiving means. Image display control means for displaying the image on the display screen when the image is received.

また、請求項２に係る発明の通信端末装置は、請求項１に記載の発明の構成に加え、前記動作検出手段は、前記掌及び前記腕の両方に所定量以上の動きがある状態を前記動作として検出することを特徴とする。 According to a second aspect of the present invention, in addition to the configuration of the first aspect of the invention, the motion detecting means is configured to detect a state where both the palm and the arm have a predetermined amount of movement. It is detected as an operation.

また、請求項３に係る発明の通信端末装置は、請求項１又は２に記載の発明の構成に加え、前記動作検出手段は、前記掌の形状に所定量以上の変化がある状態を前記掌の動きとして検出することを特徴とする。 According to a third aspect of the present invention, there is provided the communication terminal device according to the first or second aspect, wherein the motion detecting means indicates that the palm shape has a change of a predetermined amount or more. It is characterized by detecting as a movement.

また、請求項４に係る発明の通信端末装置は、請求項１乃至３のいずれかに記載の発明の構成に加え、前記動作検出手段は、前記腕の位置が所定量以上変化した状態を前記腕の動きとして検出することを特徴とする。 According to a fourth aspect of the present invention, there is provided the communication terminal device according to any one of the first to third aspects, wherein the motion detecting means indicates that the arm position has changed by a predetermined amount or more. It is detected as a movement of the arm.

また、請求項５に係る発明の通信端末装置は、請求項１乃至４のいずれかに記載の発明の構成に加え、前記撮影画像から前記対象ユーザの鼻位置を検出する鼻位置検出手段をさらに備え、前記第一画像範囲決定手段は、前記鼻位置検出手段によって検出された前記鼻位置が、前記第一画像範囲の水平方向における中心点となるように、前記第一画像範囲を決定することを特徴とする。 In addition to the configuration of the invention according to any one of claims 1 to 4, the communication terminal device according to the invention according to claim 5 further includes a nose position detection means for detecting the nose position of the target user from the captured image. And the first image range determining means determines the first image range so that the nose position detected by the nose position detecting means is a central point in the horizontal direction of the first image range. It is characterized by.

また、請求項６に係る発明の通信端末装置は、請求項１乃至５のいずれかに記載の発明の構成に加え、複数のユーザの中から発言者を特定する発言者特定手段を備え、前記動作検出手段は、前記発言者特定手段によって特定された前記発言者の掌及び腕の少なくともいずれかの動きが前記所定量以上ある状態を前記発言者の動作として検出し、前記第一画像範囲決定手段は、前記発言者の顔領域と掌領域と腕領域とを含む上半身画像の範囲を前記第一画像範囲として決定し、前記第二画像範囲決定手段は、前記発言者の前記顔領域を含む顔画像の範囲を前記第二画像範囲として決定することを特徴とする。 According to a sixth aspect of the present invention, there is provided a communication terminal device according to any one of the first to fifth aspects, further comprising a speaker specifying unit that specifies a speaker from a plurality of users. The motion detection means detects a state in which at least one of the movements of the palm and the arm of the speaker specified by the speaker specifying means is greater than or equal to the predetermined amount, and determines the first image range. The means determines an upper body image range including the speaker's face area, palm area, and arm area as the first image area, and the second image range determining means includes the face area of the speaker. The range of the face image is determined as the second image range.

また、請求項７に係る発明の通信端末装置は、請求項６に記載の発明の構成に加え、前記撮影手段により撮影された前記撮影画像から人物を認識する人物認識手段と、前記人物認識手段によって認識された前記人物の口形の変化を検出する口形検出手段とを備え、前記発言者特定手段は、前記口形検出手段によって前記口形の変化が所定量以上検出された人物を前記発言者として特定することを特徴とする。 According to a seventh aspect of the present invention, there is provided a communication terminal apparatus according to the sixth aspect, in addition to the configuration of the sixth aspect of the invention, a person recognizing unit for recognizing a person from the photographed image photographed by the photographing unit; Mouth shape detecting means for detecting a change in the mouth shape of the person recognized by the speaker, wherein the speaker specifying means specifies a person whose mouth shape change is detected by a predetermined amount or more by the mouth shape detecting means as the speaker. It is characterized by doing.

また、請求項８に係る発明の通信端末装置は、請求項６又は７に記載の発明の構成に加え、前記ユーザの音声を検出するとともに、前記音声の方向を検出する音声検出手段を備え、前記発言者特定手段は、前記音声検出手段により検出された前記方向にいる人物を前記発言者として特定することを特徴とする。 In addition to the configuration of the invention of claim 6 or 7, the communication terminal device of the invention according to claim 8 includes voice detection means for detecting the voice of the user and detecting the direction of the voice, The speaker specifying means specifies a person in the direction detected by the voice detecting means as the speaker.

また、請求項９に係る発明の通信制御装置は、ネットワークを介して複数の通信端末装置に接続され、前記通信端末装置間で行われる通信を制御する通信制御装置であって、前記通信端末装置の撮影手段によって撮影され、前記通信端末装置から送信される撮影画像を受信する撮影画像受信手段と、前記撮影画像受信手段によって受信された前記撮影画像に基づき、前記ユーザの掌及び腕の少なくともいずれかの動きがある状態を前記ユーザの動作として検出する動作検出手段と、前記動作検出手段によって前記動作が検出された場合に、前記撮影画像の中から、前記ユーザの顔領域と掌領域と腕領域とを含む上半身領域の画像の範囲を第一画像範囲として決定する第一画像範囲決定手段と、前記動作検出手段によって前記動作が検出されなかった場合に、前記撮影画像の中から、前記ユーザの前記顔領域を含む顔画像の範囲を第二画像範囲として決定する第二画像範囲決定手段と、前記第一画像範囲決定手段によって決定された前記第一画像範囲の画像、又は前記第二画像範囲決定手段によって決定された前記第二画像範囲の画像を、前記通信端末装置に送信する画像送信手段とを備えている。 A communication control device according to an embodiment of the present invention is a communication control device that is connected to a plurality of communication terminal devices via a network and controls communication performed between the communication terminal devices, wherein the communication terminal device Based on the captured image received by the captured image receiving means and at least one of the palm and arm of the user, the captured image receiving means for receiving the captured image transmitted by the communication terminal device Motion detection means for detecting a state of such movement as the user's motion, and when the motion is detected by the motion detection means, the user's face area, palm area, and arm from the captured image A first image range determining means for determining an image range of the upper body area including the area as the first image range, and the motion detection means does not detect the motion. The second image range determining means for determining a range of the face image including the face area of the user as the second image range from the photographed image, and the first image range determining means. And an image transmitting means for transmitting the image in the first image range or the image in the second image range determined by the second image range determining means to the communication terminal device.

また、請求項１０に係る発明の通信端末装置の通信制御方法は、ネットワークを介して接続された他の通信端末装置と、画像及び音声を介した通信を行う通信端末装置の通信制御方法であって、ユーザの掌及び腕の少なくともいずれかの動きがある状態を前記ユーザの動作として検出する動作検出ステップと、前記動作検出ステップにおいて前記動作が検出された場合に、ユーザを撮影する撮影手段によって撮影された撮影画像の中から、前記ユーザの顔領域と掌領域と腕領域とを含む上半身領域の画像の範囲を第一画像範囲として決定する第一画像範囲決定ステップと、前記動作検出ステップにおいて前記動作が検出されなかった場合に、前記撮影画像の中から、前記ユーザの前記顔領域の画像の範囲を第二画像範囲として決定する第二画像範囲決定ステップと、前記第一画像範囲決定ステップにおいて決定された前記第一画像範囲の画像、又は前記第二画像範囲決定手段において決定された前記第二画像範囲の画像を、前記他の通信端末装置に送信する画像送信ステップと、前記他の通信端末装置から送信された前記画像を受信する画像受信ステップと、前記画像受信ステップにおいて前記画像が受信された場合に、前記画像を表示画面に表示させる画像表示制御ステップとを備えている。 A communication control method for a communication terminal device according to a tenth aspect of the invention is a communication control method for a communication terminal device that communicates with other communication terminal devices connected via a network via images and sounds. An operation detecting step for detecting a movement of at least one of the user's palm and arm as the user's operation, and an imaging unit for capturing the user when the operation is detected in the operation detecting step. In the first image range determination step for determining an image range of the upper body region including the user's face region, palm region, and arm region as the first image range from the captured images, and in the motion detection step A second image that determines, as a second image range, an image range of the face area of the user from the captured image when the operation is not detected. An image of the first image range determined in the range determining step and the first image range determining step, or an image of the second image range determined in the second image range determining means is used as the other communication terminal. An image transmitting step for transmitting to the device, an image receiving step for receiving the image transmitted from the other communication terminal device, and the image displayed on the display screen when the image is received in the image receiving step. An image display control step.

また、請求項１１に係る発明の通信制御プログラムは、請求項１乃至８のいずれかに記載の通信端末装置の各種処理手段としてコンピュータを機能させることを特徴とする。 According to an eleventh aspect of the present invention, a communication control program causes a computer to function as various processing means of the communication terminal device according to any one of the first to eighth aspects.

請求項１に係る発明の通信端末装置では、ネットワークを介して接続された他の通信端末装置と画像及び音声を介した通信が行われる。撮影手段はユーザを撮影する。動作検出手段は、撮影手段により撮影された撮影画像から、ユーザの掌及び腕の少なくともいずれかの動きが所定量以上ある状態をユーザの動作として検出する。第一画像範囲決定手段は、動作検出手段によって動作が検出された場合に、撮影手段によって撮影された撮影画像の中から、ユーザの顔領域と掌領域と腕領域とを含む上半身領域の画像の範囲を第一画像範囲として決定する。第二画像範囲決定手段は、動作検出手段によって動作が検出されなかった場合に、撮影画像の中から、ユーザの顔領域の画像の範囲を第二画像範囲として決定する。画像送信手段は、第一画像範囲決定手段によって決定された第一画像範囲の画像、又は第二画像範囲決定手段によって決定された第二画像範囲の画像を、他の通信端末装置に送信する。画像受信手段は、他の通信端末装置から送信された画像を受信する。画像表示制御手段は、画像受信手段によって画像が受信された場合に、その画像を表示画面に表示させる。このように、動作検出手段によって動作が検出されなかった場合は、ユーザの顔領域の画像が他の通信端末装置に送信され、動作が検出された場合は、上半身領域の画像が他の通信端末装置に送信される。よって、他の通信端末装置では、ユーザがジェスチャーを行っている場合には、ユーザの上半身領域の画像が表示画面に表示され、ユーザがジェスチャーを行っていない場合には、ユーザの顔領域の画像が表示される。よって、ユーザがジェスチャーで感情を表現しようとした場合、相手ユーザはユーザのジェスチャーを確認することができる。また、ユーザがジェスチャーを行わない場合、相手ユーザは、ユーザの顔の表情を確認できる。よって、異なる拠点にいる会議参加者同士で良好なコミュニケーションをとることができる。また、ユーザがジェスチャーを行わないときは、上半身画像よりもデータ量の小さい顔画像を送信するので、通信負荷を軽減できる。 In the communication terminal apparatus according to the first aspect of the present invention, communication is performed with other communication terminal apparatuses connected via a network via images and sounds. The photographing means photographs the user. The motion detection means detects a state in which at least one of the movements of the user's palm and arm is greater than or equal to a predetermined amount from the photographed image photographed by the photographing means. The first image range determining unit is configured to detect an image of an upper body region including a user's face region, palm region, and arm region from among the captured images captured by the capturing unit when the motion is detected by the motion detecting unit. The range is determined as the first image range. The second image range determining means determines the image range of the user's face area as the second image range from the captured images when no motion is detected by the motion detecting means. The image transmission unit transmits the image of the first image range determined by the first image range determination unit or the image of the second image range determined by the second image range determination unit to another communication terminal device. The image receiving means receives an image transmitted from another communication terminal device. The image display control means displays the image on the display screen when the image receiving means receives the image. As described above, when the motion is not detected by the motion detection means, the image of the user's face area is transmitted to another communication terminal device, and when the motion is detected, the image of the upper body area is transmitted to the other communication terminal. Sent to the device. Therefore, in another communication terminal device, when the user is making a gesture, an image of the upper body area of the user is displayed on the display screen, and when the user is not making a gesture, an image of the user's face area is displayed. Is displayed. Therefore, when a user tries to express an emotion with a gesture, the other user can confirm the user's gesture. Further, when the user does not perform a gesture, the partner user can check the facial expression of the user. Therefore, it is possible to take good communication between conference participants at different bases. Further, when the user does not perform a gesture, a face image having a data amount smaller than that of the upper body image is transmitted, so that the communication load can be reduced.

また、請求項２に係る発明の通信端末装置では、請求項１に記載の発明の効果に加え、動作検出手段は、掌及び腕の両方に所定量以上の動きがある状態を動作として検出する。つまり、ユーザの上半身の小さな揺れ等の動作として検出せず、ユーザの所定量以上の大きな動作のみを検出することができる。 In addition, in the communication terminal device according to the second aspect of the invention, in addition to the effect of the first aspect of the invention, the motion detection means detects a state where both the palm and the arm have a predetermined amount of movement as a motion. . That is, it is possible to detect only a large motion that is greater than or equal to a predetermined amount of the user without detecting it as a motion such as a small shaking of the upper body of the user.

また、請求項３に係る発明の通信端末装置では、請求項１又は２に記載の発明の効果に加え、動作検出手段は、掌の形状に所定量以上の変化がある状態を掌の動きとして検出する。これにより、ユーザが掌を動かして感情表現している場合に、掌の画像を含む上半身画像を表示画面に表示させることができる。さらに、掌の形状に所定量以上の変化があった場合にのみ検出するので、掌の形状の小さな変化があった場合には、掌の動きとして検出することがない。 In addition, in the communication terminal device of the invention according to claim 3, in addition to the effect of the invention according to claim 1 or 2, the motion detection means uses a state in which the shape of the palm changes more than a predetermined amount as the movement of the palm. To detect. Thereby, when the user moves his / her palm and expresses emotions, the upper body image including the palm image can be displayed on the display screen. Furthermore, since detection is performed only when there is a change in the palm shape by a predetermined amount or more, if there is a small change in the palm shape, it is not detected as movement of the palm.

また、請求項４に係る発明の通信端末装置では、請求項１乃至３のいずれかに記載の発明の効果に加え、動作検出手段は、腕の位置が所定量以上変化した状態を腕の動きとして検出する。これにより、ユーザが腕を動かして感情表現している場合に、腕の画像を含む上半身画像を表示画面に表示させることができる。さらに、腕の位置に所定量以上の変化があった場合にのみ検出するので、腕の位置の小さな変化があった場合には、腕の動きとして検出することがない。 Further, in the communication terminal device according to a fourth aspect of the invention, in addition to the effect of the invention according to any one of the first to third aspects, the motion detecting means detects the movement of the arm when the position of the arm has changed by a predetermined amount or more. Detect as. Thereby, when the user moves his / her arm and expresses emotion, the upper body image including the image of the arm can be displayed on the display screen. Further, since the detection is performed only when the arm position has changed by a predetermined amount or more, the arm movement is not detected when there is a small change in the arm position.

また、請求項５に係る発明の通信端末装置では、請求項１乃至４のいずれかに記載の発明の効果に加え、第一画像範囲決定手段は、鼻位置検出手段をさらに備えている。鼻位置検出手段は、撮影画像から対象ユーザの鼻位置を検出する。第一画像範囲決定手段は、鼻位置検出手段によって検出された鼻位置が、第一画像範囲の水平方向における中心点となるように、第一画像範囲を決定する。これにより、ユーザの顔画像を表示画面の中心に常に位置させることができる。 In the communication terminal device according to the fifth aspect of the present invention, in addition to the effect of the first aspect of the present invention, the first image range determining means further includes a nose position detecting means. The nose position detection means detects the target user's nose position from the captured image. The first image range determination unit determines the first image range so that the nose position detected by the nose position detection unit is a center point in the horizontal direction of the first image range. Thereby, a user's face image can always be located in the center of a display screen.

また、請求項６に係る発明の通信端末装置では、請求項１乃至５のいずれかに記載の発明の効果に加え、発言者特定手段は、複数のユーザの中から発言者を特定する。動作検出手段は、発言者特定手段によって特定された発言者の掌及び腕の少なくともいずれかの動きが所定量以上ある状態を前記発言者の動作として検出する。第一画像範囲決定手段は、発言者の顔領域と掌領域と腕領域とを含む上半身画像の範囲を第一画像範囲として決定する。第二画像範囲決定手段は、発言者の顔領域を含む顔画像の範囲を第二画像範囲として決定する。従って、複数のユーザが一拠点にいる場合は、その中の発言者について第一画像範囲又は第二画像範囲を指定することができる。 Moreover, in the communication terminal device of the invention according to claim 6, in addition to the effect of the invention according to any one of claims 1 to 5, the speaker specifying means specifies a speaker from a plurality of users. The motion detection means detects a state where the movement of at least one of the palm and the arm of the speaker specified by the speaker specifying means is greater than or equal to a predetermined amount as the motion of the speaker. The first image range determining means determines a range of the upper body image including the speaker's face region, palm region, and arm region as the first image range. The second image range determining means determines the range of the face image including the speaker's face area as the second image range. Therefore, when a plurality of users are at one base, the first image range or the second image range can be designated for the speaker within the user.

また、請求項７に係る発明の通信端末装置では、請求項６に記載の発明の効果に加え、人物認識手段は、撮影手段により撮影された撮影画像から人物を認識する。口形検出手段は、人物認識手段によって認識された人物の口形の変化を検出する。発言者特定手段は、
口形検出手段によって口形の変化が所定量以上検出された人物を発言者として特定する。これにより一拠点に複数のユーザがいる場合でもその中から発言者を的確に特定できる。 In the communication terminal device according to the seventh aspect of the invention, in addition to the effect of the invention according to the sixth aspect, the person recognition means recognizes a person from the photographed image photographed by the photographing means. The mouth shape detection means detects a change in the mouth shape of the person recognized by the person recognition means. The speaker identification means is
A person whose mouth shape change is detected by the mouth shape detecting means by a predetermined amount or more is specified as a speaker. Thereby, even when there are a plurality of users at one base, it is possible to accurately identify the speaker from among them.

また、請求項８に係る発明の通信端末装置では、請求項６又は７に記載の発明の効果に加え、音声検出手段は、ユーザの音声を検出するとともに、音声の方向を検出する。発言者特定手段は、音声検出手段により検出された方向にいる人物を発言者として特定する。これにより一拠点に複数のユーザがいる場合でもその中から発言者を的確に特定できる。 In the communication terminal device according to the eighth aspect of the invention, in addition to the effect of the sixth aspect, the voice detecting means detects the voice of the user and detects the direction of the voice. The speaker specifying unit specifies a person in the direction detected by the voice detection unit as a speaker. Thereby, even when there are a plurality of users at one base, it is possible to accurately identify the speaker from among them.

また、請求項９に係る発明の通信制御装置では、ネットワークを介して複数の通信端末装置に接続され、通信端末装置間で行われる通信を制御する。撮影画像受信手段は、通信端末装置の撮影手段によって撮影され、通信端末装置から送信される撮影画像を受信する。動作検出手段は、撮影画像受信手段によって受信された撮影画像に基づき、ユーザの掌及び腕の少なくともいずれかの動きがある状態をユーザの動作として検出する。第一画像範囲決定手段は、動作検出手段によって動作が検出された場合に、撮影画像の中から、ユーザの顔領域と掌領域と腕領域とを含む上半身領域の画像の範囲を第一画像範囲として決定する。第二画像範囲決定手段は、動作検出手段によって動作が検出されなかった場合に、撮影画像の中から、ユーザの顔領域を含む顔画像の範囲を第二画像範囲として決定する。画像送信手段は、第一画像範囲決定手段によって決定された第一画像範囲の画像、又は第二画像範囲決定手段によって決定された第二画像範囲の画像を、通信端末装置に送信する。これにより、動作検出手段によって動作が検出されなかった場合は、ユーザの顔領域の画像を通信端末装置の表示画面に表示させ、動作が検出された場合は、ユーザの顔領域と掌領域と腕領域とを含む上半身領域の画像を表示画面に表示させることができる。ユーザの動作に応じてユーザの画像サイズを調節できるので、ユーザの動作に応じて表情や動作を通信端末装置の表示画面に確実に表示できる。従って、ユーザの感情を豊かに表現できるので、相手と良好なコミュニケーションをとることができる。また、ユーザが動作しないときは、第一画像範囲の画像よりもデータ量の小さい第二画像範囲の画像を送信するので、通信負荷を軽減できる。 In the communication control device according to the ninth aspect of the present invention, the communication control device is connected to a plurality of communication terminal devices via a network and controls communication performed between the communication terminal devices. The photographed image receiving means receives a photographed image that is photographed by the photographing means of the communication terminal apparatus and transmitted from the communication terminal apparatus. The motion detection means detects a state in which at least one of the user's palm and arm moves as a user motion based on the captured image received by the captured image receiving means. The first image range determining means determines the range of the upper body area image including the user's face area, palm area, and arm area from the captured image when the motion is detected by the motion detecting means. Determine as. The second image range determining means determines, as the second image range, a face image range including the user's face area from the photographed images when no action is detected by the action detecting means. The image transmission unit transmits the image of the first image range determined by the first image range determination unit or the image of the second image range determined by the second image range determination unit to the communication terminal device. As a result, if no motion is detected by the motion detection means, an image of the user's face area is displayed on the display screen of the communication terminal device. If a motion is detected, the user's face area, palm area, and arm An image of the upper body area including the area can be displayed on the display screen. Since the user's image size can be adjusted according to the user's action, the facial expression and action can be reliably displayed on the display screen of the communication terminal device according to the user's action. Therefore, since the user's emotions can be expressed richly, good communication with the other party can be achieved. In addition, when the user does not operate, an image in the second image range having a smaller data amount than the image in the first image range is transmitted, so that the communication load can be reduced.

また、請求項１０に係る発明の通信端末装置の通信制御方法では、動作検出ステップにおいて、ユーザの掌及び腕の少なくともいずれかの動きがある状態をユーザの動作として検出する。第一画像範囲決定ステップにおいて、動作検出ステップにて動作が検出された場合に、ユーザを撮影する撮影手段によって撮影された撮影画像の中から、ユーザの顔領域と掌領域と腕領域とを含む上半身領域の画像の範囲を第一画像範囲として決定する。第二画像範囲決定ステップにおいて、動作検出ステップにて動作が検出されなかった場合に、撮影画像の中から、ユーザの顔領域の画像の範囲を第二画像範囲として決定する。画像送信ステップにおいて、第一画像範囲決定ステップにて決定された第一画像範囲の画像、又は第二画像範囲決定手段において決定された第二画像範囲の画像を、他の通信端末装置に送信する。画像受信ステップにおいて、他の通信端末装置から送信された画像を受信する。画像表示制御ステップにおいて、画像受信ステップにて画像が受信された場合に、画像を表示画面に表示させる。このように、動作検出手段によって動作が検出されなかった場合は、ユーザの顔領域の画像が表示画面に表示され、動作が検出された場合は、ユーザの顔領域と掌領域と腕領域とを含む上半身領域の画像が表示画面に表示される。これにより、ユーザの動作に応じてユーザの画像サイズを調節できるので、ユーザの動作に応じて表情や動作を表示画面に確実に表示できる。従って、ユーザの感情を豊かに表現できるので、相手と良好なコミュニケーションをとることができる。また、ユーザが動作しないときは、第二画像範囲の画像よりもデータ量の小さい第一画像範囲の画像を送信するので、通信負荷を軽減できる。 In the communication control method for the communication terminal device according to the tenth aspect of the present invention, in the motion detection step, a state in which at least one of the user's palm and arm moves is detected as the user motion. When a motion is detected in the motion detection step in the first image range determination step, the user's face region, palm region, and arm region are included from the captured images captured by the capturing unit that captures the user. The range of the upper body region image is determined as the first image range. In the second image range determination step, when no motion is detected in the motion detection step, the range of the image of the user's face area is determined as the second image range from the captured images. In the image transmission step, the image in the first image range determined in the first image range determination step or the image in the second image range determined by the second image range determination means is transmitted to another communication terminal device. . In the image receiving step, an image transmitted from another communication terminal device is received. In the image display control step, when an image is received in the image reception step, the image is displayed on the display screen. As described above, when no motion is detected by the motion detection means, an image of the user's face area is displayed on the display screen, and when a motion is detected, the user's face area, palm area, and arm area are displayed. An image of the upper body area including the image is displayed on the display screen. Thereby, since a user's image size can be adjusted according to a user's operation | movement, an expression and an operation | movement can be reliably displayed on a display screen according to a user's operation | movement. Therefore, since the user's emotions can be expressed richly, good communication with the other party can be achieved. In addition, when the user does not operate, an image in the first image range having a smaller data amount than the image in the second image range is transmitted, so the communication load can be reduced.

また、請求項１１に係る発明の通信制御プログラムでは、請求項１乃至８のいずれかに記載の通信端末装置の各種処理手段としてコンピュータを機能させるので、請求項１乃至８のいずれかに記載の効果を得ることができる。 In the communication control program of the invention according to claim 11, since the computer functions as various processing means of the communication terminal device according to claim 1, the computer program according to claim 1. An effect can be obtained.

テレビ会議システム１の構成を示すブロック図である。1 is a block diagram showing a configuration of a video conference system 1. FIG. 端末装置３の電気的構成を示すブロック図である。3 is a block diagram showing an electrical configuration of a terminal device 3. FIG. ＨＤＤ３１の各種記憶エリアを示す概念図である。3 is a conceptual diagram showing various storage areas of an HDD 31. FIG. ＲＡＭ２２の各種記憶エリアを示す概念図である。3 is a conceptual diagram showing various storage areas of a RAM 22. FIG. ディスプレイ２８における一表示態様を示す図である（相手ユーザがジェスチャーを行っていない場合）。It is a figure which shows the one display aspect in the display (when the other party user is not gesturing). ディスプレイ２８における一表示態様を示す図である（相手ユーザがジェスチャーを行っている場合）。It is a figure which shows the one display mode in the display 28 (when the other party user is performing gesture). 撮影画像５０における顔画像５１の範囲を示す図である。It is a figure which shows the range of the face image 51 in the picked-up image 50. FIG. 撮影画像５０における上半身画像５２の範囲を示す図である。FIG. 5 is a diagram illustrating a range of an upper body image 52 in a captured image 50. ＣＰＵ２０による画像送信処理のフローチャートである。It is a flowchart of the image transmission process by CPU20. ＣＰＵ２０による顔画像範囲決定処理のフローチャートである。It is a flowchart of the face image range determination process by CPU20. ＣＰＵ２０による動作検出処理のフローチャートである。It is a flowchart of the operation | movement detection process by CPU20. ＣＰＵ２０による掌動作検出処理のフローチャートである。It is a flowchart of the palm movement detection process by CPU20. ＣＰＵ２０による腕動作検出処理のフローチャートである。It is a flowchart of the arm motion detection process by CPU20. ＣＰＵ２０による上半身画像範囲決定処理のフローチャートである。It is a flowchart of the upper body image range determination process by CPU20. ＣＰＵ２０による画像受信処理のフローチャートである。It is a flowchart of the image reception process by CPU20. 第二実施形態の端末装置１３０の電気的構成を示すブロック図である。It is a block diagram which shows the electric constitution of the terminal device 130 of 2nd embodiment. ＲＡＭ１２２の各種記憶エリアを示す概念図である。3 is a conceptual diagram showing various storage areas of a RAM 122. FIG. ＣＰＵ１２０による画像送信処理のフローチャートである。3 is a flowchart of image transmission processing by a CPU 120.

以下、本発明の第一実施形態である端末装置３について、図面を参照して説明する。はじめに、端末装置３を構成要素とするテレビ会議システム１の構成について、図１を参照して説明する。 Hereinafter, the terminal device 3 which is 1st embodiment of this invention is demonstrated with reference to drawings. First, the configuration of the video conference system 1 including the terminal device 3 as a component will be described with reference to FIG.

テレビ会議システム１は、ネットワーク２を介して相互に接続された端末装置３、４を備えている。端末装置３、４は、別拠点に設けられている。このテレビ会議システム１では、端末装置３、４間において、ネットワーク２を介して、画像、音声が互いに送受信されることで、別拠点にあるユーザ同士の遠隔会議が実施される。本実施形態では、端末装置３が設けられた拠点を自拠点、端末装置４が設けられた拠点を他拠点として説明する。 The video conference system 1 includes terminal devices 3 and 4 connected to each other via a network 2. The terminal devices 3 and 4 are provided at different bases. In this video conference system 1, a remote conference between users at different bases is performed by transmitting and receiving images and sounds between the terminal devices 3 and 4 via the network 2. In the present embodiment, the base where the terminal device 3 is provided will be described as its own base, and the base where the terminal device 4 is provided will be described as another base.

なお、本実施形態では、端末装置３においてユーザがジェスチャーを交えながら会話をしている場合には、端末装置３のユーザの上半身画像を端末装置４に送信し、ユーザがジェスチャーをせずに会話している場合には、ユーザの顔画像を端末装置４に送信する点に特徴がある。 In the present embodiment, when the user is talking while gesturing in the terminal device 3, the upper body image of the user of the terminal device 3 is transmitted to the terminal device 4, and the user talks without gesturing. If it is, the user's face image is transmitted to the terminal device 4.

端末装置３の電気的構成について、図２を参照して説明する。図２は、端末装置３の電気的構成を示すブロック図である。なお、端末装置３と端末装置４とは全て同じ構成であるので、ここでは端末装置３の構成についてのみ説明し、端末装置４については説明を省略する。 The electrical configuration of the terminal device 3 will be described with reference to FIG. FIG. 2 is a block diagram showing an electrical configuration of the terminal device 3. Since the terminal device 3 and the terminal device 4 all have the same configuration, only the configuration of the terminal device 3 will be described here, and the description of the terminal device 4 will be omitted.

端末装置３には、端末装置３の制御を司るコントローラとしてのＣＰＵ２０が設けられている。ＣＰＵ２０には、ＢＩＯＳ等を記憶したＲＯＭ２１と、各種データを一時的に記憶するＲＡＭ２２と、データの受け渡しの仲介を行うＩ／Ｏインタフェイス３０とが接続されている。Ｉ／Ｏインタフェイス３０には、各種記憶エリアを有するハードディスクドライブ３１（以下、ＨＤＤ３１）が接続されている。 The terminal device 3 is provided with a CPU 20 as a controller that controls the terminal device 3. Connected to the CPU 20 are a ROM 21 that stores BIOS, a RAM 22 that temporarily stores various data, and an I / O interface 30 that mediates data transfer. The I / O interface 30 is connected to a hard disk drive 31 (hereinafter referred to as HDD 31) having various storage areas.

Ｉ／Ｏインタフェイス３０には、ネットワーク２と通信するための通信装置２５と、マウス２７と、ビデオコントローラ２３と、キーコントローラ２４と、ユーザを撮影するためのカメラ３４と、ユーザの音声を取り込むためのマイク３５と、ＣＤ−ＲＯＭドライブ２６とが各々接続されている。ビデオコントローラ２３には、端末装置４を使用する相手ユーザを表示するディスプレイ２８が接続されている。キーコントローラ２４には、キーボード２９が接続されている。 The I / O interface 30 captures a communication device 25 for communicating with the network 2, a mouse 27, a video controller 23, a key controller 24, a camera 34 for photographing a user, and a user's voice. A microphone 35 and a CD-ROM drive 26 are connected to each other. The video controller 23 is connected to a display 28 that displays a partner user who uses the terminal device 4. A keyboard 29 is connected to the key controller 24.

なお、ＣＤ−ＲＯＭドライブ２６に挿入されるＣＤ−ＲＯＭ１１４には、端末装置３のメインプログラムや、本発明の通信制御プログラム等が記憶されている。ＣＤ−ＲＯＭ１１４の導入時には、これら各種プログラムが、ＣＤ−ＲＯＭ１１４からＨＤＤ３１にセットアップされて、後述するプログラム記憶エリア３１３（図３参照）に記憶される。 The CD-ROM 114 inserted into the CD-ROM drive 26 stores the main program of the terminal device 3, the communication control program of the present invention, and the like. When the CD-ROM 114 is introduced, these various programs are set up from the CD-ROM 114 to the HDD 31 and stored in a program storage area 313 (see FIG. 3) described later.

次に、ＨＤＤ３１の各種記憶エリアについて、図３を参照して説明する。ＨＤＤ３１には、カメラ３４によって撮影された撮影画像５０（図７、図８参照）を記憶する撮影画像データ記憶エリア３１１と、端末装置３のディスプレイ２８に表示される画面データを記憶する表示画面データ記憶エリア３１２と、各種プログラムを記憶するプログラム記憶エリア３１３と、プログラムの実行に必要な所定値を記憶する所定値記憶エリア３１４と、その他の情報記憶エリア３１５とが少なくとも設けられている。 Next, various storage areas of the HDD 31 will be described with reference to FIG. The HDD 31 has a captured image data storage area 311 for storing a captured image 50 (see FIGS. 7 and 8) captured by the camera 34 and display screen data for storing screen data displayed on the display 28 of the terminal device 3. At least a storage area 312, a program storage area 313 that stores various programs, a predetermined value storage area 314 that stores predetermined values necessary for executing the program, and other information storage areas 315 are provided.

プログラム記憶エリア３１３には、端末装置３のメインプログラムや、端末装置４との間で遠隔会議を実行するための会議支援プログラム、画像表示に係る本発明の通信制御プログラム等が記憶されている。その他の情報記憶エリア３１５には、端末装置３で使用されるその他の情報が記憶されている。なお、端末装置３がＨＤＤ３１を備えていない専用機の場合は、ＲＯＭ２１に各種プログラムが記憶される。 The program storage area 313 stores a main program of the terminal device 3, a conference support program for executing a remote conference with the terminal device 4, a communication control program of the present invention relating to image display, and the like. In the other information storage area 315, other information used in the terminal device 3 is stored. When the terminal device 3 is a dedicated machine that does not include the HDD 31, various programs are stored in the ROM 21.

次に、ＲＡＭ２２の各種記憶エリアについて、図４を参照して説明する。ＲＡＭ２２には、画像範囲記憶エリア２２１と、掌動作記憶エリア２２２と、腕動作記憶エリア２２３と、動作検出記憶エリア２２４と、接続端末記憶エリア２２５と、処理画像記憶エリア２２６が少なくとも設けられている。画像範囲記憶エリア２２１には、撮影画像５０における送信画像の画像範囲が記憶される。掌動作記憶エリア２２２には、撮影画像５０におけるユーザの掌面積が記憶される掌面積記憶エリア２２２１と、撮影画像５０に基づいて検出されたユーザの掌の形状変化の有無を記憶する掌変化記憶エリア２２２２とが設けられている。腕動作記憶エリア２２３には、ユーザの腕の輪郭データが記憶される輪郭データ記憶エリア２２３１と、ユーザの腕の位置が変化しているか否かを記憶する腕変化記憶エリア２２３２とが設けられている。動作検出記憶エリア２２４には、ユーザの動作が検出されたか否かが記憶される。接続端末記憶エリア２２５には、ネットワーク２を介して現在接続している接続端末の端末ＩＤが記憶される。処理画像記憶エリア２２６には、画像処理を行うための画像データが記憶される。 Next, various storage areas of the RAM 22 will be described with reference to FIG. The RAM 22 includes at least an image range storage area 221, a palm motion storage area 222, an arm motion storage area 223, a motion detection storage area 224, a connection terminal storage area 225, and a processed image storage area 226. . The image range storage area 221 stores the image range of the transmission image in the captured image 50. The palm motion storage area 222 stores a palm area storage area 2221 that stores the palm area of the user in the captured image 50 and a palm change storage that stores the presence or absence of a change in the shape of the user's palm detected based on the captured image 50. An area 2222 is provided. The arm motion storage area 223 is provided with a contour data storage area 2231 for storing the contour data of the user's arm and an arm change storage area 2232 for storing whether or not the position of the user's arm has changed. Yes. The motion detection storage area 224 stores whether or not a user motion has been detected. In the connection terminal storage area 225, the terminal ID of the connection terminal currently connected via the network 2 is stored. The processed image storage area 226 stores image data for performing image processing.

次に、端末装置３のディスプレイ２８に表示される画面について、図５および図６を参照して説明する。端末装置３のディスプレイ２８には、端末装置４から送信される相手ユーザの画像が表示される。相手ユーザがジェスチャーを行っていない場合には、図５に示すように、ディスプレイ２８には相手ユーザの顔の画像が表示される。相手ユーザがジェスチャーを行っている場合には、図６に示すように、ディスプレイ２８には相手ユーザの上半身の画像が表示される。 Next, the screen displayed on the display 28 of the terminal device 3 is demonstrated with reference to FIG. 5 and FIG. On the display 28 of the terminal device 3, an image of the partner user transmitted from the terminal device 4 is displayed. When the other user is not making a gesture, an image of the other user's face is displayed on the display 28 as shown in FIG. When the other user is making a gesture, as shown in FIG. 6, an image of the upper body of the other user is displayed on the display 28.

次に、ユーザのジェスチャーを検出する方法について説明する。本実施形態においては、ユーザが掌と腕との両方を動かしている状態を、ユーザがジェスチャーを行っている状態として検出する。 Next, a method for detecting a user's gesture will be described. In this embodiment, the state in which the user moves both the palm and the arm is detected as the state in which the user is making a gesture.

はじめに、ユーザの掌の動きを検出する方法について説明する。掌の動きの検出は、カメラ３４によって撮影されたユーザの撮影画像５０（図７参照）に基づいて行われる。まず、ユーザの撮影画像５０から、ユーザの掌領域を抽出する。そして、抽出された掌領域の面積に一定以上の変化がある場合には、掌の動きがあるとして検出する。 First, a method for detecting a user's palm movement will be described. The palm movement is detected based on a user-captured image 50 (see FIG. 7) captured by the camera 34. First, the user's palm area is extracted from the user's captured image 50. If there is a certain change in the area of the extracted palm region, it is detected that there is a palm movement.

掌領域の抽出方法は、周知の様々な方法が適用可能であり、例えば、特開２００３−３４６１６２に記載された方法が適用可能である。まず、ＲＧＢ表色系で表示される撮影画像５０をＨＳＶ表色系に変換する。ＨＳＶ表色系は、色の種類を表す色相Ｈ（ｈｕｅ）、色の鮮やかさを表す彩度Ｓ（ｓａｔｕｒａｔｉｏｎ）、そして明るさの程度を表す明度Ｖ（ｖａｌｕｅ）の3つの要素からなる。ＲＧＢ表色系からＨＳＶ表色系への変換方法は、例えば高木・下田監修「画像解析ハンドブック」（東京大学出版会，ｐｐ．４８５−４９１，１９９１年発行）に記載されている。なお、Ｈ、Ｓ、Ｖの値の範囲は、下記の通りである。
・０≦Ｈ≦２π
・０≦Ｓ≦１
・０≦Ｖ≦１ Various well-known methods can be applied to the palm region extraction method. For example, the method described in Japanese Patent Application Laid-Open No. 2003-346162 is applicable. First, the captured image 50 displayed in the RGB color system is converted into the HSV color system. The HSV color system includes three elements: a hue H (hue) representing the type of color, a saturation S (saturation) representing the vividness of the color, and a lightness V (value) representing the degree of brightness. A conversion method from the RGB color system to the HSV color system is described, for example, in “Image Analysis Handbook” (published by the University of Tokyo Press, pp. 485-491, 1991), supervised by Takagi and Shimoda. In addition, the range of the value of H, S, and V is as follows.
・ 0 ≦ H ≦ 2π
・ 0 ≦ S ≦ 1
・ 0 ≦ V ≦ 1

次に、画像中の手領域である場所を抽出するために、肌色抽出を行う。本実施形態では、肌色領域の閾値を下記のように設定する。
・０．１１＜Ｈ＜０．２２、
・０．２＜Ｓ＜０．５
色相Ｈと、彩度Ｓとが、上述の閾値内にある画素を、肌色画素として抽出する。 Next, in order to extract a place which is a hand region in the image, skin color extraction is performed. In the present embodiment, the threshold value of the skin color area is set as follows.
・ 0.11 <H <0.22,
・ 0.2 <S <0.5
Pixels whose hue H and saturation S are within the above threshold are extracted as skin color pixels.

手領域と背景とを分離するために、肌色画素と非肌色画素とに２値化する。そして、得られた２値画像において、所定範囲内の面積を有する肌色画素部分を掌領域として抽出する。掌領域の抽出は１／３０秒ごとに行われる。 In order to separate the hand area and the background, binarization is performed on skin color pixels and non-skin color pixels. Then, in the obtained binary image, a skin color pixel portion having an area within a predetermined range is extracted as a palm region. The palm area is extracted every 1/30 seconds.

抽出された掌の面積に所定値以上の変化がある場合には、掌の形状に変化があると検出され、面積に所定値以上の変化がない場合には、掌の形状に変化がないと検出される。 If there is a change in the extracted palm area over a predetermined value, it is detected that there is a change in the palm shape. If there is no change over the predetermined value in the area, there is no change in the palm shape. Detected.

次に、ユーザの腕の動きを検出する方法について説明する。腕の動きの検出は、カメラ３４によって撮影されたユーザの撮影画像５０（図７参照）に基づいて行われる。腕の動きの検出は、周知の方法によって行われ、たとえば、『オプティカルフローを用いた複雑背景下における人物の腕領域の抽出と運動パラメータ推定』（電気学会論文誌Ｃ分冊、Ｖｏｌ．１２０−Ｃ、Ｎｏ．１２、ｐｐ．１８０１−１８０８（２０００１２））を用いた方法や、ユーザの撮影画像５０から輪郭データを抽出して、輪郭データをもとに検出する方法や、撮影画像５０からユーザの掌を検出して、掌の位置の変化をもとに検出する方法が適用可能である。本実施形態では、ユーザの撮影画像５０から、ユーザの腕の輪郭データを抽出し、抽出した輪郭データの位置に一定以上の変化がある場合には、腕の動きがあるとして検出する方法を適用する。 Next, a method for detecting the movement of the user's arm will be described. The movement of the arm is detected based on the user's captured image 50 (see FIG. 7) captured by the camera 34. Detection of arm movement is performed by a well-known method, for example, “Extraction of human arm region and estimation of motion parameter under complex background using optical flow” (The Institute of Electrical Engineers of Japan C Volume, Vol. 120-C No. 12, pp. 1801-1808 (2000 12)), a method of extracting contour data from the user's captured image 50 and detecting it based on the contour data, or a user from the captured image 50 It is possible to apply a method of detecting the palm of the hand and detecting it based on the change in the position of the palm. In the present embodiment, a method is used in which contour data of the user's arm is extracted from the captured image 50 of the user, and when there is a certain change in the position of the extracted contour data, it is detected that there is a movement of the arm. To do.

この方法では、まず、撮影画像５０の中から、ユーザの腕の輪郭データを抽出するための抽出領域が指定される。ここでは、ユーザの撮影画像５０に存在する一定面積以上の肌色部分が顔領域として抽出され、顔領域より下に存在する肌色部分が掌領域として抽出され、上下方向における顔領域と掌領域との間の領域が、腕の輪郭データを抽出するための抽出領域として指定される。肌色部分の抽出方法は、上述したとおりである。 In this method, first, an extraction region for extracting contour data of the user's arm is specified from the captured image 50. Here, a skin color portion having a predetermined area or more existing in the photographed image 50 of the user is extracted as a face region, a skin color portion existing below the face region is extracted as a palm region, and the face region and palm region in the vertical direction are extracted. The area between the two is designated as an extraction area for extracting the contour data of the arm. The skin color portion extraction method is as described above.

次に、撮影画像５０に対してグレースケール化を行い、輪郭データを抽出する。輪郭データを抽出する際は、周知の一次微分法を使用する。一次微分法の輪郭抽出では、各画素における濃度の勾配を求めることによって輪郭の強さと方向とを算出し、濃度値が急激に変化する部分を輪郭データとして抽出する。抽出領域の指定および輪郭データの抽出は、１／３０秒ごとに行われる。 Next, the captured image 50 is converted into a gray scale, and contour data is extracted. When extracting contour data, a well-known first-order differential method is used. In the first-order differential contour extraction, the strength and direction of the contour are calculated by obtaining the density gradient in each pixel, and the portion where the density value changes rapidly is extracted as the contour data. The designation of the extraction area and the extraction of the contour data are performed every 1/30 seconds.

抽出された輪郭データに変化があれば腕の動きがあると検出され、輪郭データに所定値以上の変化がなければ、腕の動きがないと検出される。 If there is a change in the extracted contour data, it is detected that there is a movement of the arm, and if there is no change in the contour data over a predetermined value, it is detected that there is no movement of the arm.

次に、撮影画像５０の中から、ユーザの顔領域の画像（顔画像５１）の範囲を決定する方法について、図７を参照して説明する。顔画像範囲の決定方法は、周知の方法であり、例えば、特開平１０−３３４２１３に記載された方法が適用可能である。 Next, a method of determining the range of the user's face area image (face image 51) from the captured image 50 will be described with reference to FIG. The method for determining the face image range is a well-known method, and for example, the method described in JP-A-10-334213 is applicable.

はじめに、ユーザの上半身が撮影された撮影画像５０から、ユーザの顔領域が抽出される。撮影画像５０からユーザの顔領域を抽出する場合、まず、ＲＧＢ表色系で表示される撮影画像５０をＨＳＶ表色系に変換する。そして、ＨＳＶ表色系に変換された変換画像から、色相Ｈと、彩度Ｓとが、上述の閾値内にある画素を、肌色画素として抽出する。顔領域と背景とを分離するために、肌色画素と非肌色画素とに２値化する。ＨＳＶ表色系への変換方法、肌色画素の抽出方法、２値化の方法は、掌領域の抽出方法で説明した方法と同様である。そして、得られた２値画像において、画像の上半分に存在する肌色画素部分を顔領域として抽出する。 First, the face area of the user is extracted from the captured image 50 in which the upper body of the user is captured. When extracting a user's face area from the captured image 50, first, the captured image 50 displayed in the RGB color system is converted into the HSV color system. Then, pixels whose hue H and saturation S are within the above-described threshold are extracted as skin color pixels from the converted image converted into the HSV color system. In order to separate the face area from the background, binarization is performed on skin color pixels and non-skin color pixels. The conversion method to the HSV color system, the skin color pixel extraction method, and the binarization method are the same as those described in the palm region extraction method. Then, in the obtained binary image, the skin color pixel portion existing in the upper half of the image is extracted as a face region.

次に、顔画像５１を切り出すための顔画像範囲を決定する。顔画像範囲を決定する場合、まず、撮影画像５０の中から、ユーザの鼻位置を特定する。鼻位置の特定は、顔領域の中心部において隣り合ったふたつの鼻孔部分を検出することにより行う。鼻孔部分には光が照射されないため、暗く撮影される。撮影画像において、顔領域の中心部で暗く撮影された部分を鼻孔として検出し、検出された鼻孔の中心位置を鼻位置として特定する。そして、撮影画像５０の画像領域の左下部分を撮影画像５０におけるＸＹ座標の原点として、鼻位置の撮影画像５０におけるＸＹ座標（ｘ３、ｙ３）を特定する。 Next, a face image range for cutting out the face image 51 is determined. When determining the face image range, first, the user's nose position is specified from the captured image 50. The nose position is specified by detecting two adjacent nostril portions in the center of the face area. Since the nostril portion is not irradiated with light, it is photographed darkly. In the photographed image, a darkly photographed portion at the center of the face area is detected as a nostril, and the center position of the detected nostril is specified as the nose position. Then, the XY coordinates (x3, y3) in the captured image 50 at the nose position are specified with the lower left portion of the image area of the captured image 50 as the origin of the XY coordinates in the captured image 50.

次に、抽出された顔領域のＸ座標の最大値と最小値、Ｙ座標の最大値と最小値を検出し、それぞれの値について、鼻位置（ｘ３、ｙ３）との差分を算出する。算出された差分のうちの最も大きい値を第一差分αとし、第一差分αに所定値を加算した第一拡大値Ｎを算出する。第一拡大値Ｎを用いて、鼻位置（ｘ３、ｙ３）が顔画像５１の中心となるように、顔画像５１の範囲を決定する。具体的には、顔画像５１の４隅の座標を下記のように決定する。
・（ｘ３＋Ｎ、ｙ３＋Ｎ）
・（ｘ３＋Ｎ、ｙ３−Ｎ）
・（ｘ３−Ｎ、ｙ３＋Ｎ）
・（ｘ３−Ｎ、ｙ３−Ｎ） Next, the maximum value and minimum value of the X coordinate and the maximum value and minimum value of the Y coordinate of the extracted face area are detected, and the difference from the nose position (x3, y3) is calculated for each value. The largest value among the calculated differences is defined as a first difference α, and a first enlarged value N is calculated by adding a predetermined value to the first difference α. Using the first enlargement value N, the range of the face image 51 is determined such that the nose position (x3, y3) is the center of the face image 51. Specifically, the coordinates of the four corners of the face image 51 are determined as follows.
・ (X3 + N, y3 + N)
・ (X3 + N, y3-N)
・ (X3-N, y3 + N)
・ (X3-N, y3-N)

次に、撮影画像５０の中から、ユーザの上半身領域の画像（上半身画像５２）の範囲を決定する方法について、図８を参照して説明する。まず、撮影画像５０の中から、撮影画像の色、幾何学的な形状、濃淡パターン、動きなどが、パラメータとして抽出される。次いで、予めＨＤＤ３１のその他の情報記憶エリア３１５に記憶された人物画像のパラメータのデータベースが参照されて、抽出された上述の各パラメータと、データベースに記憶されている人物画像のパラメータとのマッチング処理が実行される。そして、データベースに記憶されている人物画像のパラメータと良好に一致する撮影画像の部分が特定される。特定された部分の画像が、ユーザ領域として特定される。次に、撮影画像５０の画像領域の左下部分を撮影画像５０におけるＸＹ座標の原点として、ユーザ領域のＸ座標の最大値と最小値とが検出される。 Next, a method for determining the range of the upper body region image (upper body image 52) of the user from the captured image 50 will be described with reference to FIG. First, from the captured image 50, the color, geometric shape, shading pattern, movement, and the like of the captured image are extracted as parameters. Next, a database of human image parameters stored in advance in the other information storage area 315 of the HDD 31 is referred to, and matching processing between the extracted parameters described above and the human image parameters stored in the database is performed. Executed. Then, the portion of the photographed image that matches well with the parameters of the person image stored in the database is specified. The image of the specified part is specified as the user area. Next, the maximum value and the minimum value of the X coordinate of the user area are detected using the lower left portion of the image area of the captured image 50 as the origin of the XY coordinates in the captured image 50.

次に、撮影画像５０の中から、ユーザの鼻位置が検出される。鼻位置の特定方法は上述のとおりである。そして、検出された鼻位置のＸ座標ｘ３と、検出されたユーザ領域のＸ座標の最大値と最小値との差分がそれぞれ算出される。差分のうちの大きい値が、第二差分βとされ、第二差分βと所定値Ｕとを加算した値である第二拡大値Ｍが算出される。 Next, the user's nose position is detected from the captured image 50. The method for specifying the nose position is as described above. Then, the difference between the detected X coordinate x3 of the nose position and the maximum value and the minimum value of the detected X coordinate of the user area is calculated. A larger value of the differences is set as the second difference β, and a second enlarged value M that is a value obtained by adding the second difference β and the predetermined value U is calculated.

そして、鼻位置（ｘ３、ｙ３）と、第二拡大値Ｍとをパラメータとして、上半身画像５２の範囲が決定される。本実施形態では、鼻位置（ｘ３、ｙ３）が、上半身画像５２の左右方向（Ｘ方向）において中心に位置し、上下方向（Ｙ方向）において、下端から２／３のところに位置するように、上半身画像５２が決定される。具体的には、上半身画像５２の４隅の座標は下記のように表される。
・（ｘ３＋Ｍ、ｙ３＋（Ｍ×（２／３）））
・（ｘ３＋Ｍ、ｙ３−（Ｍ×（４／３）））
・（ｘ３−Ｍ、ｙ３＋（Ｍ×（２／３）））
・（ｘ３−Ｍ、ｙ３−（Ｍ×（４／３））） Then, the range of the upper body image 52 is determined using the nose position (x3, y3) and the second enlarged value M as parameters. In the present embodiment, the nose position (x3, y3) is positioned at the center in the left-right direction (X direction) of the upper body image 52 and is positioned at 2/3 from the lower end in the vertical direction (Y direction). The upper body image 52 is determined. Specifically, the coordinates of the four corners of the upper body image 52 are expressed as follows.
(X3 + M, y3 + (M × (2/3)))
(X3 + M, y3- (M × (4/3)))
(X3-M, y3 + (M × (2/3)))
(X3-M, y3- (M × (4/3)))

テレビ会議システム１におけるユーザのジェスチャーを考慮した通信制御処理について図９乃至図１５のフローチャートを参照して説明する。本説明では、自拠点側にある端末装置３と、他拠点側にある端末装置４とが会議を行う場合を想定する。端末装置３、４では、カメラ３４により撮影された撮影画像５０から、ユーザの顔画像５１又は上半身画像５２を切り出して、他の端末装置に対して送信する「画像送信処理」と、他の端末装置が送信した画像を受信して表示する「画像受信処理」との両方が行われる。そこで、説明の便宜上、自拠点側の端末装置３において「画像送信処理」が実行され、他拠点側の端末装置４において「画像受信処理」が実行される場合を例に説明する。 A communication control process in consideration of the user's gesture in the video conference system 1 will be described with reference to the flowcharts of FIGS. In this description, it is assumed that the terminal device 3 on the local site side and the terminal device 4 on the other site side hold a conference. In the terminal devices 3 and 4, an “image transmission process” in which the user's face image 51 or upper body image 52 is cut out from the captured image 50 captured by the camera 34 and transmitted to another terminal device; Both “image reception processing” in which an image transmitted by the apparatus is received and displayed is performed. Therefore, for convenience of explanation, an example will be described in which “image transmission processing” is executed in the terminal device 3 on the local site side and “image reception processing” is executed on the terminal device 4 on the other site side.

まず、自拠点側の端末装置３のＣＰＵ２０において実行される画像送信処理について、説明する。端末装置３と端末装置４とが各々ネットワークに接続し、互いに通信を開始すると、図９に示す画像送信処理が開始される。画像送信処理が開始されると、はじめに、カメラ３４が駆動され、カメラ３４により撮影された撮影画像５０の取得が開始される（Ｓ１）。カメラ３４により撮影された撮影画像５０の画像データは、撮影画像データ記憶エリア３１１に記憶される。 First, image transmission processing executed by the CPU 20 of the terminal device 3 on the local site side will be described. When the terminal device 3 and the terminal device 4 are connected to the network and start communication with each other, the image transmission process shown in FIG. 9 is started. When the image transmission process is started, first, the camera 34 is driven, and acquisition of the captured image 50 captured by the camera 34 is started (S1). Image data of the photographed image 50 photographed by the camera 34 is stored in the photographed image data storage area 311.

撮影画像の取得が開始されると（Ｓ１）、顔画像範囲決定処理が行われる（Ｓ２）。顔画像範囲決定処理について、図１０を参照して説明する。この顔画像範囲決定処理は、図９の画像送信処理のＳ２で実行されるサブルーチンである。 When acquisition of a captured image is started (S1), face image range determination processing is performed (S2). The face image range determination process will be described with reference to FIG. This face image range determination process is a subroutine executed in S2 of the image transmission process of FIG.

顔画像範囲決定処理が開始されると、まず、ＲＧＢ表色系である撮影画像５０がＨＳＶ表色系に変換されて、変換画像として処理画像記憶エリア２２６に記憶される（Ｓ１５１）。ＨＳＶ表色系への変換方法は上述のとおりである。そして、処理画像記憶エリア２２６に記憶された変換画像から、色相Ｈと、彩度Ｓとが、上述の閾値内にある画素が、肌色画素として抽出される（Ｓ１５２）。肌色画素の抽出方法は上述のとおりである。顔領域と背景とを分離するために、肌色画素と非肌色画素とが２値化される（Ｓ１５３）。画像の上半分に存在する肌色画素の部分が顔領域として特定される。 When the face image range determination process is started, first, the captured image 50 that is the RGB color system is converted into the HSV color system, and is stored as a converted image in the processed image storage area 226 (S151). The conversion method to the HSV color system is as described above. Then, from the converted image stored in the processed image storage area 226, pixels whose hue H and saturation S are within the above threshold are extracted as skin color pixels (S152). The skin color pixel extraction method is as described above. In order to separate the face area from the background, the skin color pixels and the non-skin color pixels are binarized (S153). The skin color pixel portion present in the upper half of the image is specified as the face region.

次に、処理画像記憶エリア２２６が参照されて、ユーザの鼻位置が特定される（Ｓ１５４）。鼻位置の特定は、顔領域の中心部において隣り合ったふたつの鼻孔部分を検出することにより行われる。鼻孔部分には光が照射されないため、暗く撮影される。顔領域の中心部において暗く撮影された部分が鼻孔として検出され、検出された鼻孔の中心位置が鼻位置として特定される。そして、撮影画像５０の画像領域の左下部分を撮影画像５０におけるＸＹ座標の原点として、鼻位置の撮影画像５０におけるＸＹ座標（ｘ３、ｙ３）が特定され、画像範囲記憶エリア２２１の鼻位置記憶エリア（図示省略）に記憶される。 Next, the processed image storage area 226 is referred to, and the user's nose position is specified (S154). The nose position is specified by detecting two nostril portions adjacent to each other in the center of the face area. Since the nostril portion is not irradiated with light, it is photographed darkly. A darkly photographed portion at the center of the face area is detected as a nostril, and the center position of the detected nostril is specified as the nose position. Then, the XY coordinates (x3, y3) in the photographed image 50 at the nose position are specified with the lower left portion of the image area of the photographed image 50 as the origin of the XY coordinates in the photographed image 50, and the nose position storage area in the image range storage area 221 (Not shown).

次に、顔領域について、Ｘ座標における最大値ｘ１と最小値ｘ２、Ｙ座標における最大値ｙ１と最小値ｙ２が検出される（Ｓ１５５）。検出されたｘ１、ｘ２、ｙ１、ｙ２は、画像範囲記憶エリア２２１の顔領域記憶エリア（図示省略）に記憶される。そして、顔領域記憶エリアと鼻位置記憶エリアとが参照されて、ｘ１とｘ３との差分ａ、ｘ３とｘ２との差分ｂ、ｙ１とｙ３との差分ｃ、ｙ３とｙ２との差分ｄが算出される（Ｓ１５６）。なお、差分ａ、ｂ、ｃ、ｄは、下記の式で表すことができる。
・ａ＝ｘ１−ｘ３
・ｂ＝ｘ３−ｘ２
・ｃ＝ｙ１−ｙ３
・ｄ＝ｙ３−ｙ２
算出された４つの差分ａ、ｂ、ｃ、ｄのうちの最も大きい値が、第一差分αとされ、画像範囲記憶エリア２２１の第一差分記憶エリア（図示省略）に記憶される。 Next, for the face area, the maximum value x1 and minimum value x2 in the X coordinate, and the maximum value y1 and minimum value y2 in the Y coordinate are detected (S155). The detected x1, x2, y1, and y2 are stored in the face area storage area (not shown) of the image range storage area 221. Then, referring to the face area storage area and the nose position storage area, the difference a between x1 and x3, the difference b between x3 and x2, the difference c between y1 and y3, and the difference d between y3 and y2 are calculated. (S156). Differences a, b, c, and d can be expressed by the following equations.
A = x1-x3
B = x3-x2
C = y1-y3
D = y3-y2
The largest value among the four calculated differences a, b, c, and d is set as the first difference α and stored in the first difference storage area (not shown) of the image range storage area 221.

そして、第一差分αと所定値記憶エリア３１４に記憶された所定値Ｋとを加算した値である第一拡大値Ｎが算出される（Ｓ１５７）。第一拡大値Ｎは、下記の式で表される。
・Ｎ＝α＋Ｋ
算出された第一拡大値は、画像範囲記憶エリア２２１の第一拡大値記憶エリア（図示省略）に記憶される。 Then, a first enlarged value N that is a value obtained by adding the first difference α and the predetermined value K stored in the predetermined value storage area 314 is calculated (S157). The first enlarged value N is represented by the following formula.
・ N = α + K
The calculated first enlarged value is stored in a first enlarged value storage area (not shown) of the image range storage area 221.

次に、第一拡大値記憶エリアと、鼻位置記憶エリアとが参照されて、鼻位置のＸ座標ｘ３から、Ｎ大きいＸ座標（ｘ３＋Ｎ）と、Ｎ小さいＸ座標（ｘ３−Ｎ）とが算出される。鼻位置のＹ座標ｙ３から、Ｎ大きいＹ座標（ｙ３＋Ｎ）と、Ｎ小さいＹ座標（ｙ３−Ｎ）とが算出される。（ｘ３＋Ｎ）が、顔画像５１におけるＸ座標の最大値となり、（ｘ３−Ｎ）が、顔画像５１におけるＸ座標の最小値となる。（ｙ３＋Ｎ）が、顔画像５１におけるＹ座標の最大値となり、（ｙ３−Ｎ）が顔画像５１におけるＹ座標の最小値となる。そして、このＸ座標、Ｙ座標の組み合わせからなる４点の座標が算出される。４点の座標は下記のように表される。
・（ｘ３＋Ｎ、ｙ３＋Ｎ）
・（ｘ３＋Ｎ、ｙ３−Ｎ）
・（ｘ３−Ｎ、ｙ３＋Ｎ）
・（ｘ３−Ｎ、ｙ３−Ｎ） Next, with reference to the first enlarged value storage area and the nose position storage area, an N large X coordinate (x3 + N) and an N small X coordinate (x3−N) are calculated from the X coordinate x3 of the nose position. Is done. From the Y coordinate y3 of the nose position, an N large Y coordinate (y3 + N) and an N small Y coordinate (y3-N) are calculated. (X3 + N) is the maximum value of the X coordinate in the face image 51, and (x3-N) is the minimum value of the X coordinate in the face image 51. (Y3 + N) is the maximum value of the Y coordinate in the face image 51, and (y3-N) is the minimum value of the Y coordinate in the face image 51. Then, the coordinates of four points made up of the combination of the X coordinate and the Y coordinate are calculated. The coordinates of the four points are expressed as follows.
・ (X3 + N, y3 + N)
・ (X3 + N, y3-N)
・ (X3-N, y3 + N)
・ (X3-N, y3-N)

４点の座標は、顔画像５１の範囲を示す情報として、画像範囲記憶エリア２２１に記憶される（Ｓ１５８）。そして、顔画像範囲決定処理を終了して、画像送信処理（図９参照）に戻る。 The coordinates of the four points are stored in the image range storage area 221 as information indicating the range of the face image 51 (S158). Then, the face image range determination process ends, and the process returns to the image transmission process (see FIG. 9).

顔画像範囲決定処理（Ｓ２）が終了すると、撮影画像データ記憶エリア３１１と画像範囲記憶エリア２２１とが参照されて、ユーザの顔画像５１に対応する画像データが相手ユーザの使用する端末装置４に送信される（Ｓ３）。顔画像５１に対応する画像データが端末装置４に送信されると（Ｓ３）、ユーザの動作の有無を検出する動作検出処理が行われる（Ｓ４）。 When the face image range determination process (S2) is completed, the captured image data storage area 311 and the image range storage area 221 are referred to, and the image data corresponding to the user's face image 51 is transferred to the terminal device 4 used by the partner user. It is transmitted (S3). When image data corresponding to the face image 51 is transmitted to the terminal device 4 (S3), an operation detection process for detecting the presence or absence of the user's operation is performed (S4).

動作検出処理について、図１１を参照して説明する。動作検出処理は、図９の画像送信処理のＳ４で実行されるサブルーチンである。動作検出処理が開始されると、まず、掌動作検出処理が開始される（Ｓ５１）。掌動作検出処理は、図１１に示す動作検出処理のＳ５１で実行されるサブルーチンである。 The operation detection process will be described with reference to FIG. The motion detection process is a subroutine executed in S4 of the image transmission process in FIG. When the motion detection process is started, first, the palm motion detection process is started (S51). The palm motion detection process is a subroutine executed in S51 of the motion detection process shown in FIG.

掌動作検出処理について、図１２を参照して説明する。掌動作検出処理が開始されると、まず、ＲＧＢ表色系である撮影画像５０がＨＳＶ表色系に変換されて、変換画像として処理画像記憶エリア２２６に記憶される（Ｓ１０１）。次に、処理画像記憶エリア２２６が参照されて、変換画像から肌色抽出が行われる（Ｓ１０２）。そして、肌色画素と非肌色画素とが２値化され、得られた２値画像が処理画像記憶エリア２２６に記憶される（Ｓ１０３）。そして、所定値記憶エリア３１４と処理画像記憶エリア２２６とが参照されて、２値画像において、所定範囲の面積を有する肌色画素部分を掌領域として抽出する（Ｓ１０４）。掌領域の抽出は１／３０秒ごとに行われる。 The palm motion detection process will be described with reference to FIG. When the palm motion detection process is started, first, the captured image 50 which is the RGB color system is converted into the HSV color system and stored as a converted image in the processed image storage area 226 (S101). Next, the processed image storage area 226 is referred to, and skin color extraction is performed from the converted image (S102). Then, the skin color pixel and the non-skin color pixel are binarized, and the obtained binary image is stored in the processed image storage area 226 (S103). Then, the predetermined value storage area 314 and the processed image storage area 226 are referred to, and a skin color pixel portion having an area in a predetermined range is extracted as a palm region in the binary image (S104). The palm area is extracted every 1/30 seconds.

そして、抽出された掌領域の面積が算出され、掌動作記憶エリア２２２の掌面積記憶エリア２２２１に記憶される（Ｓ１０５）。掌面積記憶エリア２２２１には、例えば６０の記憶エリアが設けられ、１／３０秒ごとに、それぞれの記憶エリアに掌領域の面積（掌面積）が記憶される。６０番目の記憶エリアに掌面積が記憶されると、次は、１番目の記憶エリアに最新の掌面積が上書きされる。そして、所定量のデータが蓄積された段階で、掌面積記憶エリア２２２１が参照されて、記憶された最大掌面積に対する最小掌面積の割合が、一例として、３／４未満であるか否かが判断される（Ｓ１０６）。記憶された最大掌面積に対する最小掌面積の割合が、一例として３／４未満であれば、掌の形状に変化がある（掌の動きが検出された）として、掌変化記憶エリア２２２２に「１」が記憶される（Ｓ１０７）。一方、記憶された最大掌面積に対する最小掌面積の割合が、３／４以上であれば、掌の形状には変化がなかった（掌の動きが検出されなかった）として、掌変化記憶エリア２２２２に「０」が記憶される（Ｓ１０８）。そして、掌動作検出処理を終了して、動作検出処理（図１１参照）に戻る。 Then, the area of the extracted palm region is calculated and stored in the palm area storage area 2221 of the palm motion storage area 222 (S105). For example, 60 storage areas are provided in the palm area storage area 2221, and the area (palm area) of the palm region is stored in each storage area every 1/30 seconds. When the palm area is stored in the 60th storage area, the latest palm area is overwritten in the first storage area. Then, when a predetermined amount of data is accumulated, the palm area storage area 2221 is referred to and, as an example, whether or not the ratio of the minimum palm area to the stored maximum palm area is less than 3/4. Determination is made (S106). If the ratio of the minimum palm area to the stored maximum palm area is less than 3/4 as an example, it is assumed that there is a change in the shape of the palm (the movement of the palm is detected). Is stored (S107). On the other hand, if the ratio of the minimum palm area to the stored maximum palm area is 3/4 or more, the palm shape is not changed (the palm movement is not detected), and the palm change storage area 2222 is detected. "0" is stored in (S108). Then, the palm motion detection process is terminated, and the process returns to the motion detection process (see FIG. 11).

掌動作検出処理（Ｓ５１）が終了すると、掌変化記憶エリア２２２２が参照されて、掌の動きが検出されたか否かが判断される（Ｓ５２）。掌変化記憶エリア２２２２に、「０」が記憶されており、掌の動きが検出されなかったと判断された場合には（Ｓ５２：ＮＯ）、ユーザの動作は検出されなかったとして、動作検出記憶エリア２２４に「０」が記憶される（Ｓ５６）。掌変化記憶エリア２２２２に、「１」が記憶されており、掌の動きが検出されたと判断された場合には（Ｓ５２：ＹＥＳ）、続いて腕動作検出処理が行われる（Ｓ５３）。 When the palm motion detection process (S51) ends, the palm change storage area 2222 is referred to and it is determined whether or not a palm movement is detected (S52). If “0” is stored in the palm change storage area 2222 and it is determined that the movement of the palm is not detected (S52: NO), it is determined that the user's movement is not detected, and the movement detection storage area “0” is stored in 224 (S56). When “1” is stored in the palm change storage area 2222 and it is determined that the movement of the palm is detected (S52: YES), the arm motion detection process is subsequently performed (S53).

腕動作検出処理について、図１３を参照して説明する。腕動作検出処理は、図１１の動作検出処理のＳ５３で実行されるサブルーチンである。動作検出処理が開始されると、まず、ユーザの腕の輪郭データを抽出するための抽出領域が指定される（Ｓ１３１）。ユーザの撮影画像５０に存在する一定面積以上の肌色部分が顔領域として抽出され、顔領域より下に存在する肌色部分が掌領域として抽出され、上下方向における顔領域と掌領域との間の領域が、腕の輪郭データを抽出するための抽出領域として指定される。肌色部分の抽出方法は、上述したとおりである。 The arm motion detection process will be described with reference to FIG. The arm motion detection process is a subroutine executed in S53 of the motion detection process of FIG. When the motion detection process is started, first, an extraction area for extracting contour data of the user's arm is designated (S131). An area between the face area and the palm area in the vertical direction is extracted as a face area, and a skin color area existing below the face area is extracted as a palm area. Is designated as an extraction region for extracting arm contour data. The skin color portion extraction method is as described above.

次に、撮影画像５０に対してグレースケール化を行い、輪郭データが抽出される（Ｓ１３２）。輪郭データを抽出する際は、周知の一次微分法を使用する。一次微分法の輪郭抽出では、各画素における濃度の勾配を求めることによって輪郭の強さと方向とを算出し、濃度値が急激に変化する部分が輪郭データとして抽出される。抽出領域の指定および輪郭データの抽出は、１／３０秒ごとに行われる。 Next, the captured image 50 is converted into a gray scale, and contour data is extracted (S132). When extracting contour data, a well-known first-order differential method is used. In the first-order differential contour extraction, the strength and direction of the contour are calculated by obtaining the density gradient in each pixel, and the portion where the density value changes rapidly is extracted as the contour data. The designation of the extraction area and the extraction of the contour data are performed every 1/30 seconds.

抽出された輪郭データは、ＲＡＭ２２の腕動作記憶エリア２２３の輪郭データ記憶エリア２２３１に記憶される（Ｓ１３３）。輪郭データ記憶エリア２２３１には、例えば６０の記憶エリアが設けられ、１／３０秒ごとに、それぞれの記憶エリアに輪郭データが記憶される。６０番目の記憶エリアに輪郭データが記憶されると、次は、１番目の記憶エリアに最新の輪郭データが上書きされる。輪郭データ記憶エリア２２３１に、所定量のデータが蓄積されると、６０の記憶エリアが参照されて、記憶された複数の輪郭データのうちのＸ座標における最大値と最小値との差分△Ｘ１、Ｙ座標の最大値と最小値との差分△Ｙ１とが算出される（Ｓ１３４）。 The extracted contour data is stored in the contour data storage area 2231 of the arm motion storage area 223 of the RAM 22 (S133). In the contour data storage area 2231, for example, 60 storage areas are provided, and the contour data is stored in each storage area every 1/30 seconds. When the contour data is stored in the 60th storage area, the latest contour data is overwritten in the first storage area. When a predetermined amount of data is accumulated in the contour data storage area 2231, 60 storage areas are referred to, and the difference ΔX1 between the maximum value and the minimum value in the X coordinate of the stored plurality of contour data, A difference ΔY1 between the maximum value and the minimum value of the Y coordinate is calculated (S134).

そして、差分△Ｘ１、又は差分△Ｙ１のいずれかが、ＨＤＤ３１の所定値記憶エリア３１４に記憶された所定値よりも大きいか否かが判断される（Ｓ１３５）。差分△Ｘ１、又は差分△Ｙ１のいずれかが、ＨＤＤ３１の所定値記憶エリア３１４に記憶された所定値よりも大きければ（Ｓ１３５：ＹＥＳ）、腕の位置に変化がある（腕の動きが検出された）として、腕変化記憶エリア２２３２に、「１」が記憶される（Ｓ１３６）。差分△Ｘ１、又は差分△Ｙ１のいずれも、ＨＤＤ３１の所定値記憶エリア３１４に記憶された所定値よりも小さければ（Ｓ１３５：ＮＯ）、腕の位置に変化がない（腕の動きが検出されなかった）として、腕変化記憶エリア２２３２に、「０」が記憶される（Ｓ１３７）。そして、腕動作検出処理を終了して、動作検出処理（図１１参照）に戻る。 Then, it is determined whether either the difference ΔX1 or the difference ΔY1 is larger than the predetermined value stored in the predetermined value storage area 314 of the HDD 31 (S135). If either the difference ΔX1 or the difference ΔY1 is larger than the predetermined value stored in the predetermined value storage area 314 of the HDD 31 (S135: YES), there is a change in the arm position (the movement of the arm is detected). "1" is stored in the arm change storage area 2232 (S136). If either the difference ΔX1 or the difference ΔY1 is smaller than the predetermined value stored in the predetermined value storage area 314 of the HDD 31 (S135: NO), there is no change in the arm position (the movement of the arm is not detected). "0" is stored in the arm change storage area 2232 (S137). Then, the arm motion detection process ends, and the process returns to the motion detection process (see FIG. 11).

そして、動作検出処理では、腕動作検出処理が終了すると（Ｓ５３）、腕変化記憶エリア２２３２が参照されて、腕の動きが検出されたか否かが判断される（Ｓ５４）。腕変化記憶エリア２２３２に「０」が記憶されており、腕の動きが検出されていない場合には（Ｓ５４：ＮＯ）、ユーザの動作は検出されなかったとして、動作検出記憶エリア２２４に「０」が記憶される（Ｓ５６）。腕変化記憶エリア２２３２に「１」が記憶されており、腕の動きが検出されたと判断された場合には（Ｓ５４：ＹＥＳ）、ユーザの動作が検出されたとして、動作検出記憶エリア２２４に「１」が記憶される（Ｓ５５）。そして、動作検出処理を終了して、画像送信処理（図９参照）へ戻る。 In the motion detection process, when the arm motion detection process is completed (S53), the arm change storage area 2232 is referred to and it is determined whether or not an arm movement is detected (S54). When “0” is stored in the arm change storage area 2232 and no movement of the arm is detected (S54: NO), it is determined that the user's movement is not detected, and “0” is stored in the movement detection storage area 224. Is stored (S56). When “1” is stored in the arm change storage area 2232 and it is determined that the movement of the arm is detected (S54: YES), it is determined that the user's motion is detected, and “ 1 "is stored (S55). Then, the motion detection process is terminated, and the process returns to the image transmission process (see FIG. 9).

図９に戻り、動作検出処理（Ｓ４）が終了すると、動作検出記憶エリア２２４が参照されて、ユーザの動作が検出されたか否かが判断される（Ｓ５）。動作検出記憶エリア２２４に、「０」が記憶されている場合、動作が検出されなかったと判断されて（Ｓ５：ＮＯ）、会議終了であるか否かが判断される（Ｓ１２）。会議終了であるか否かの判断は、ネットワーク２に接続されている端末装置が、自身のほかに１以上あるか否かがを判断することによって行われる。 Returning to FIG. 9, when the motion detection process (S4) ends, the motion detection storage area 224 is referred to and it is determined whether or not a user motion has been detected (S5). When “0” is stored in the motion detection storage area 224, it is determined that the motion has not been detected (S5: NO), and it is determined whether the conference is over (S12). The determination as to whether or not the conference is ended is made by determining whether or not there are one or more terminal devices connected to the network 2 other than itself.

端末装置３、４では、ネットワーク２に接続すると、ネットワーク２に接続したことを示す接続信号が相手側の端末装置に送信される。他の端末装置からの接続信号を受信した場合、接続信号を送信した端末装置の端末ＩＤが、ＲＡＭ２２の接続端末記憶エリア２２５に記憶される。一方、ネットワーク２への接続を切断すると、ネットワークへの接続を切断したことを示す切断信号が相手側の端末装置に送信される。ネットワーク２に接続されている端末装置が、自身の他には０である場合（Ｓ１２：ＹＥＳ）、処理を終了する。一方、ネットワーク２に接続されている端末装置が、自身のほかに１以上ある場合（Ｓ１２：ＮＯ）、Ｓ１〜Ｓ５の処理が繰り返される。すなわち、端末装置３において、ユーザのジェスチャーが検出されない場合（Ｓ５：ＮＯ）、引き続き、ユーザの顔画像５１が端末装置４に送信される（Ｓ３）。 When the terminal devices 3 and 4 are connected to the network 2, a connection signal indicating connection to the network 2 is transmitted to the counterpart terminal device. When a connection signal is received from another terminal device, the terminal ID of the terminal device that has transmitted the connection signal is stored in the connection terminal storage area 225 of the RAM 22. On the other hand, when the connection to the network 2 is disconnected, a disconnection signal indicating that the connection to the network is disconnected is transmitted to the partner terminal device. If the terminal device connected to the network 2 is 0 in addition to itself (S12: YES), the process is terminated. On the other hand, when one or more terminal devices are connected to the network 2 (S12: NO), the processes of S1 to S5 are repeated. That is, when the user's gesture is not detected in the terminal device 3 (S5: NO), the user's face image 51 is continuously transmitted to the terminal device 4 (S3).

動作検出記憶エリア２２４に、「１」が記憶されている場合、動作が検出されたと判断されて（Ｓ５：ＹＥＳ）、撮影画像データ記憶エリア３１１に記憶された撮影画像５０からユーザの上半身画像５２の範囲が決定される上半身画像範囲決定処理が行われる（Ｓ６）。 When “1” is stored in the motion detection storage area 224, it is determined that a motion has been detected (S5: YES), and the user's upper body image 52 is determined from the captured image 50 stored in the captured image data storage area 311. The upper-body image range determination process is performed in which the range of S is determined (S6).

上半身画像範囲決定処理について、図１４を参照して説明する。上半身画像範囲決定処理は、図９の画像送信処理のＳ６で実行されるサブルーチンである。図１４に示すように、上半身画像範囲決定処理が開始されると、まず、撮影画像におけるユーザ領域が特定される（Ｓ１７１）。具体的には、撮影画像データ記憶エリア３１１が参照されて、撮影画像の色、幾何学的な形状、濃淡パターン、動きなどがパラメータとして抽出される。予めＨＤＤ３１のその他の情報記憶エリア３１５に記憶された人物画像のパラメータのデータベースが参照されて、抽出された上述の各パラメータと、データベースに記憶されている人物画像のパラメータとのマッチング処理が実行される。そして、データベースに記憶されている人物画像のパラメータと良好に一致する撮影画像の部分が特定される。特定された部分の画像が、ユーザ領域として特定される。次に、撮影画像５０の画像領域の左下部分を撮影画像５０におけるＸＹ座標の原点として、ユーザ領域のＸ座標における最大値ｘ４と最小値ｘ５が検出される（Ｓ１７２）。検出されたｘ４、ｘ５は、画像範囲記憶エリア２２１のユーザ領域記憶エリア（図示省略）に記憶される。 The upper body image range determination process will be described with reference to FIG. The upper body image range determination process is a subroutine executed in S6 of the image transmission process of FIG. As shown in FIG. 14, when the upper body image range determination process is started, first, the user area in the captured image is specified (S171). Specifically, the captured image data storage area 311 is referred to, and the color, geometric shape, shading pattern, movement, etc. of the captured image are extracted as parameters. A database of human image parameters stored in advance in the other information storage area 315 of the HDD 31 is referred to, and the above-described extracted parameters are matched with the parameters of the human image stored in the database. The Then, the portion of the photographed image that matches well with the parameters of the person image stored in the database is specified. The image of the specified part is specified as the user area. Next, the maximum value x4 and the minimum value x5 in the X coordinate of the user area are detected using the lower left portion of the image area of the captured image 50 as the origin of the XY coordinates in the captured image 50 (S172). The detected x4 and x5 are stored in the user area storage area (not shown) of the image range storage area 221.

次に、撮影画像５０の中から、ユーザの鼻位置が特定される（Ｓ１７３）。鼻位置の特定方法は上述のとおりである。検出された鼻位置のＸ座標ｘ３は、画像範囲記憶エリア２２１の鼻位置記憶エリア（図示省略）に記憶される。そして鼻位置記憶エリアとユーザ領域記憶エリアとが参照されて、ｘ３、ｘ４、ｘ５から、ｘ４とｘ３との差分ｅ、ｘ３とｘ５との差分ｆが算出される（Ｓ１７４）。なお、差分ｅ、ｆは、下記の式で表すことができる。
・ｅ＝ｘ４−ｘ３
・ｆ＝ｘ３−ｘ５ Next, the user's nose position is specified from the captured image 50 (S173). The method for specifying the nose position is as described above. The detected X coordinate x3 of the nose position is stored in the nose position storage area (not shown) of the image range storage area 221. Then, the nose position storage area and the user area storage area are referred to, and the difference e between x4 and x3 and the difference f between x3 and x5 are calculated from x3, x4, and x5 (S174). The differences e and f can be expressed by the following equations.
E = x4-x3
・ F = x3-x5

算出された２つの差分ｅ、ｆのうちの大きい値が、第二差分βとされて、画像範囲記憶エリア２２１の第二差分記憶エリア（図示省略）に記憶される。第二差分βと所定値記憶エリア３１４に記憶された所定値Uとを加算した値である第二拡大値Ｍが算出される（Ｓ１７５）。第二拡大値Ｍは、下記の式で表される。
・Ｍ＝β＋U
算出された第二拡大値Ｍは、画像範囲記憶エリア２２１の第二拡大値記憶エリア（図示省略）に記憶される。 The larger value of the calculated two differences e and f is set as the second difference β and stored in the second difference storage area (not shown) of the image range storage area 221. A second enlarged value M, which is a value obtained by adding the second difference β and the predetermined value U stored in the predetermined value storage area 314, is calculated (S175). The second enlarged value M is represented by the following formula.
・ M = β + U
The calculated second enlarged value M is stored in a second enlarged value storage area (not shown) of the image range storage area 221.

次に、第二拡大値記憶エリアと、鼻位置記憶エリアとが参照されて、鼻位置のＸ座標ｘ３から、Ｍ大きいＸ座標（ｘ３＋Ｍ）と、Ｍ小さいＸ座標（ｘ３−Ｍ）とが算出される。鼻位置のＹ座標ｙ３から、（Ｍ×（２／３））大きいＹ座標（ｙ３＋（Ｍ×（２／３）））と、（Ｍ×（４／３））小さいＹ座標（ｙ３−（Ｍ×（４／３）））とが算出される。（ｘ３＋Ｍ）が、上半身画像５２におけるＸ座標の最大値となり、（ｘ３−Ｍ）が、上半身画像５２におけるＸ座標の最小値となる。（ｙ３＋（Ｍ×（２／３）））が、上半身画像５２におけるＹ座標の最大値となり、（ｙ３−（Ｍ×（４／３）））が上半身画像５２におけるＹ座標の最小値となる。そして、このＸ座標、Ｙ座標の組み合わせからなる４点の座標が算出される。４点の座標は下記のように表される。
・（ｘ３＋Ｍ、ｙ３＋（Ｍ×（２／３）））
・（ｘ３＋Ｍ、ｙ３−（Ｍ×（４／３）））
・（ｘ３−Ｍ、ｙ３＋（Ｍ×（２／３）））
・（ｘ３−Ｍ、ｙ３−（Ｍ×（４／３））） Next, with reference to the second enlarged value storage area and the nose position storage area, an M large X coordinate (x3 + M) and an M small X coordinate (x3-M) are calculated from the X coordinate x3 of the nose position. Is done. From the Y coordinate y3 of the nose position, (M × (2/3)) large Y coordinate (y3 + (M × (2/3))) and (M × (4/3)) small Y coordinate (y3− ( M × (4/3))) is calculated. (X3 + M) is the maximum value of the X coordinate in the upper body image 52, and (x3-M) is the minimum value of the X coordinate in the upper body image 52. (Y3 + (M × (2/3))) is the maximum value of the Y coordinate in the upper body image 52, and (y3− (M × (4/3))) is the minimum value of the Y coordinate in the upper body image 52. . Then, the coordinates of four points made up of the combination of the X coordinate and the Y coordinate are calculated. The coordinates of the four points are expressed as follows.
(X3 + M, y3 + (M × (2/3)))
(X3 + M, y3- (M × (4/3)))
(X3-M, y3 + (M × (2/3)))
(X3-M, y3- (M × (4/3)))

４点の座標は、上半身画像５２の範囲を示す情報として、画像範囲記憶エリア２２１に記憶される（Ｓ１７６）。そして、上半身画像範囲決定処理が終了して、画像送信処理（図９参照）に戻る。 The coordinates of the four points are stored in the image range storage area 221 as information indicating the range of the upper body image 52 (S176). Then, the upper body image range determination process ends, and the process returns to the image transmission process (see FIG. 9).

図９に戻り、上半身画像範囲決定処理（Ｓ６）が終了すると、撮影画像データ記憶エリア３１１と画像範囲記憶エリア２２１とが参照されて、撮影画像５０から上半身画像５２の範囲が切り出されて端末装置４に送信される（Ｓ７）。つまり、端末装置３のユーザがジェスチャーを行っていると判断された場合には（Ｓ５：ＹＥＳ）、端末装置４に、ユーザの上半身の画像が送信される（Ｓ７）。 Returning to FIG. 9, when the upper body image range determination process (S6) is completed, the captured image data storage area 311 and the image range storage area 221 are referred to, and the range of the upper body image 52 is cut out from the captured image 50. 4 (S7). That is, when it is determined that the user of the terminal device 3 is making a gesture (S5: YES), an image of the upper body of the user is transmitted to the terminal device 4 (S7).

ユーザの上半身画像５２が、端末装置４に送信されると（Ｓ７）、接続端末記憶エリア２２５が参照されて、ネットワーク２に接続されている端末装置が、自身の他に１以上あるか否かが判断される（Ｓ８）。ネットワーク２に接続されている端末装置が、自身の他には０である場合、会議終了であるとして（Ｓ８：ＹＥＳ）、処理を終了する。一方、ネットワーク２に接続されている端末装置が、自身のほかに１以上ある場合、会議は終了していないとして（Ｓ８：ＮＯ）、ユーザの動作の有無を検出する動作検出処理が再度行われる（Ｓ９）。Ｓ９の処理はＳ４と同様であるため説明を省略する。 When the user's upper body image 52 is transmitted to the terminal device 4 (S7), the connected terminal storage area 225 is referred to, and whether or not there are one or more terminal devices connected to the network 2 other than itself. Is determined (S8). If the terminal device connected to the network 2 is 0 in addition to itself, it is determined that the conference has ended (S8: YES), and the process ends. On the other hand, when there are one or more terminal devices connected to the network 2, the operation detection process for detecting the presence / absence of the user's operation is performed again, assuming that the conference has not ended (S8: NO). (S9). Since the process of S9 is the same as S4, description is abbreviate | omitted.

動作検出処理（Ｓ９）が終了すると、動作検出記憶エリア２２４が参照されて、ユーザの動作が検出されたか否かが判断される（Ｓ１０）。動作検出記憶エリア２２４に、「１」が記憶されている場合、動作が検出されたと判断され（Ｓ１０：ＹＥＳ）、ユーザの顔領域および掌領域が、切り出された上半身画像５２の範囲に含まれているか否かが判断される（Ｓ１１）具体的には、まず、撮影画像データ記憶エリア３１１が参照されて、撮影画像５０においてユーザの顔領域および掌領域が抽出される。顔領域と掌領域との抽出方法は上述のとおりである。そして、抽出された顔領域および掌領域のＸ座標、Ｙ座標の最大値と最小値が検出される。検出された最大値と最小値とが、画像範囲記憶エリア２２１に記憶された画像範囲内にあるか否かが判断される。 When the motion detection process (S9) ends, the motion detection storage area 224 is referred to and it is determined whether or not a user motion has been detected (S10). If “1” is stored in the motion detection storage area 224, it is determined that motion has been detected (S10: YES), and the user's face area and palm area are included in the range of the cut out upper body image 52. (S11) Specifically, first, the captured image data storage area 311 is referred to, and the face area and palm area of the user are extracted from the captured image 50. The method for extracting the face area and the palm area is as described above. Then, the maximum value and the minimum value of the X and Y coordinates of the extracted face area and palm area are detected. It is determined whether or not the detected maximum value and minimum value are within the image range stored in the image range storage area 221.

検出された掌領域および顔領域が、いずれも画像範囲記憶エリア２２１に記憶された画像範囲内にあるか否かが判断されると（Ｓ１１：ＹＥＳ）、Ｓ８に戻って会議終了であるか否かが判断される。顔領域および掌領域のいずれかが、画像範囲記憶エリア２２１に記憶された画像範囲からはみ出していると判断されると（Ｓ１１：ＮＯ）、再度、撮影画像データ記憶エリア３１１に記憶された撮影画像５０からユーザの上半身画像５２の範囲が決定される（Ｓ６）。決定された上半身画像５２の範囲は、最新の画像範囲として画像範囲記憶エリア２２１に上書きされる。 When it is determined whether or not the detected palm area and face area are both within the image range stored in the image range storage area 221 (S11: YES), the process returns to S8 to determine whether the meeting is over. Is judged. If it is determined that either the face area or the palm area is outside the image range stored in the image range storage area 221 (S11: NO), the captured image stored in the captured image data storage area 311 is again displayed. The range of the upper body image 52 of the user from 50 is determined (S6). The determined range of the upper body image 52 is overwritten in the image range storage area 221 as the latest image range.

一方、Ｓ１０において、動作検出記憶エリア２２４に、「０」が記憶されている場合、ユーザの動作が検出されなかったと判断されて（Ｓ１０：ＮＯ）、会議終了であるか否かが判断される（Ｓ１２）。ネットワーク２に接続されている端末装置が、自身の他には０である場合、会議終了であるとして（Ｓ１２：ＹＥＳ）、処理を終了する。ネットワーク２に接続されている端末装置が、自身のほかに１以上ある場合、会議は終了していないとして（Ｓ１２：ＮＯ）、Ｓ１の処理に戻る。すなわち、端末装置３において、ユーザの動作が検出されず（Ｓ１０：ＮＯ）、会議が終了していない場合（Ｓ１２：ＮＯ）ユーザの上半身画像５２ではなくユーザの顔画像５１が、端末装置４に送信される（Ｓ３）。 On the other hand, if “0” is stored in the motion detection storage area 224 in S10, it is determined that no user motion has been detected (S10: NO), and it is determined whether or not the conference has ended. (S12). If the terminal device connected to the network 2 is 0 in addition to itself, it is determined that the conference is over (S12: YES), and the process is terminated. If there are one or more terminal devices connected to the network 2, the conference is not finished (S12: NO), and the process returns to S1. That is, in the terminal device 3, when the user's operation is not detected (S 10: NO) and the conference is not ended (S 12: NO), the user's face image 51 instead of the user's upper body image 52 is displayed on the terminal device 4. It is transmitted (S3).

次に、他拠点側の端末装置４のＣＰＵ２０において実行される画像受信処理について、図１５のフローチャートを参照して説明する。端末装置３と端末装置４とが各々ネットワークに接続し、互いに通信を開始すると、図１５に示す画像受信処理が開始される。画像受信処理が開始されると、端末装置３から送信された画像が受信されたか否かが判断される（Ｓ３１）。画像が受信されていない場合（Ｓ３１：ＮＯ）、画像が受信されるまで、Ｓ３１の処理が繰り返される。画像が受信された場合（Ｓ３１：ＹＥＳ）、受信された画像がビデオコントローラ２３によってディスプレイ２８に表示される（Ｓ３２）。 Next, image reception processing executed by the CPU 20 of the terminal device 4 on the other base side will be described with reference to the flowchart of FIG. When the terminal device 3 and the terminal device 4 are connected to the network and start communication with each other, the image reception process shown in FIG. 15 is started. When the image reception process is started, it is determined whether an image transmitted from the terminal device 3 has been received (S31). When the image is not received (S31: NO), the process of S31 is repeated until the image is received. When an image is received (S31: YES), the received image is displayed on the display 28 by the video controller 23 (S32).

上述のように、端末装置３のユーザがジェスチャーを行っている場合には、端末装置３からは、ユーザの上半身画像５２に対応する画像データが送信される。一方、端末装置３のユーザがジェスチャーを行っていない場合には、端末装置３からは、ユーザの顔画像５１に対応する画像データが送信される。よって、端末装置４のディスプレイ２８には、端末装置３のユーザがジェスチャーを行っている場合、ユーザの上半身の画像が表示され（図５参照）、ジェスチャーを行っていない場合、ユーザの顔の画像が表示される（図６参照）。 As described above, when the user of the terminal device 3 is making a gesture, the terminal device 3 transmits image data corresponding to the upper body image 52 of the user. On the other hand, when the user of the terminal device 3 is not making a gesture, the terminal device 3 transmits image data corresponding to the face image 51 of the user. Therefore, when the user of the terminal device 3 is making a gesture, an image of the upper body of the user is displayed on the display 28 of the terminal device 4 (see FIG. 5). Is displayed (see FIG. 6).

そして、接続端末記憶エリア２２５が参照されて、ネットワーク２に接続されている端末装置が、自身の他に１以上あるか否かが判断される（Ｓ３３）。ネットワーク２に接続されている端末装置が、自身の他には０である場合（Ｓ３３：ＹＥＳ）、処理を終了する。一方、ネットワーク２に接続されている端末装置が、自身のほかに１以上ある場合（Ｓ３３：ＮＯ）、Ｓ３１に戻り、Ｓ３１〜Ｓ３３の処理が繰り返される。 Then, the connected terminal storage area 225 is referred to, and it is determined whether there are one or more terminal devices connected to the network 2 other than itself (S33). When the terminal device connected to the network 2 is 0 in addition to itself (S33: YES), the process is terminated. On the other hand, when there are one or more terminal devices connected to the network 2 (S33: NO), the process returns to S31, and the processes of S31 to S33 are repeated.

以上説明したように、第一実施形態である端末装置３は、ネットワーク２を介して他の端末装置４と相互に接続される。これら端末装置間で、画像、音声を互いに送受信することで遠隔会議を実施するテレビ会議システム１を構成する。このテレビ会議システム１では、遠隔会議中に、端末装置３（又は４）のユーザの動作（ジェスチャー）を検出する。そして、ユーザがジェスチャーを行っていない場合には、相手ユーザの端末装置４（又は３）に対して、ユーザの顔画像５１に対応するデータを送信する。ユーザがジェスチャーを行っている場合には、相手ユーザの端末装置４（又は３）に対して、ユーザの上半身画像５２に対応するデータを送信する。 As described above, the terminal device 3 according to the first embodiment is connected to other terminal devices 4 via the network 2. A video conference system 1 that implements a remote conference by transmitting and receiving images and sounds between these terminal devices is configured. In the video conference system 1, the user's operation (gesture) of the terminal device 3 (or 4) is detected during the remote conference. And when the user is not gesturing, the data corresponding to the face image 51 of the user is transmitted to the terminal device 4 (or 3) of the other user. When the user is making a gesture, data corresponding to the upper body image 52 of the user is transmitted to the terminal device 4 (or 3) of the other user.

よって、端末装置３のユーザがジェスチャーを行っている場合、端末装置４のディスプレイ２８にはユーザの上半身の画像が表示され（図５参照）、ジェスチャーを行っていない場合、ディスプレイ２８にはユーザの顔の画像が表示される（図６参照）。従って、ユーザがジェスチャーで感情を表現しようとした場合、相手ユーザはユーザのジェスチャーを確認することができる。また、ユーザがジェスチャーを行わない場合、相手ユーザは、ユーザの顔の表情を確認できる。よって、異なる拠点にいる会議参加者同士で良好なコミュニケーションをとることができる。また、ユーザがジェスチャーを行わないときは、上半身画像５２よりもデータ量の小さい顔画像５１の画像データを送信するので、通信負荷を軽減できる。 Therefore, when the user of the terminal device 3 is making a gesture, an image of the upper body of the user is displayed on the display 28 of the terminal device 4 (see FIG. 5). A face image is displayed (see FIG. 6). Therefore, when a user tries to express an emotion with a gesture, the other user can confirm the user's gesture. Further, when the user does not perform a gesture, the partner user can check the facial expression of the user. Therefore, it is possible to take good communication between conference participants at different bases. Further, when the user does not perform a gesture, the image data of the face image 51 having a data amount smaller than that of the upper body image 52 is transmitted, so that the communication load can be reduced.

なお、以上説明において、図２に示すカメラ３４が本発明の「ユーザ撮影手段」に相当する。図２に示すディスプレイ２８が本発明の「表示画面」に相当する。図１０に示す顔画像範囲決定処理を実行するＣＰＵ２０が本発明の「第二画像範囲決定手段」に相当する。図１１に示す動作検出処理を実行するＣＰＵ２０が本発明の「動作検出手段」に相当する。図１４に示す上半身画像範囲決定処理を実行するＣＰＵ２０が本発明の「第一画像範囲決定手段」に相当する。図９に示すＳ３およびＳ７の処理を実行するＣＰＵ２０が本発明の「画像送信手段」に相当する。図１４に示すＳ１７２の処理を実行するＣＰＵ２０、および図１０に示すＳ１５４の処理を実行するＣＰＵ２０が、本発明の「鼻位置検出手段」に相当する。図１５に示すＳ３１の処理を実行するＣＰＵ２０が本発明の「画像受信手段」に相当する。図１５に示すＳ３２の処理を実行するＣＰＵ２０が本発明の「表示制御手段」に相当する。 In the above description, the camera 34 shown in FIG. 2 corresponds to the “user photographing unit” of the present invention. The display 28 shown in FIG. 2 corresponds to the “display screen” of the present invention. The CPU 20 that executes the face image range determining process shown in FIG. 10 corresponds to the “second image range determining means” of the present invention. The CPU 20 that executes the motion detection process shown in FIG. 11 corresponds to the “motion detection means” of the present invention. The CPU 20 that executes the upper body image range determination process shown in FIG. 14 corresponds to the “first image range determination means” of the present invention. The CPU 20 that executes the processes of S3 and S7 shown in FIG. 9 corresponds to the “image transmission means” of the present invention. The CPU 20 that executes the process of S172 shown in FIG. 14 and the CPU 20 that executes the process of S154 shown in FIG. 10 correspond to the “nasal position detecting means” of the present invention. The CPU 20 that executes the process of S31 shown in FIG. 15 corresponds to the “image receiving means” of the present invention. The CPU 20 that executes the process of S32 shown in FIG. 15 corresponds to the “display control means” of the present invention.

次に、本発明の第二実施形態である端末装置１３０について説明する。第一実施形態では、端末装置３、４にユーザが一人ずつの条件で行われる会議を想定している。第二実施形態は、各拠点の端末装置３、４に複数のユーザがいる場合に、その中の発言者を特定し、その発言者をカメラ３４の撮影対象とする点が第一実施形態と異なる。なお、第二実施形態の端末装置１３０は、第一実施形態の端末装置３と同様に、図１に示すテレビ会議システム１を構成するものである。 Next, the terminal device 130 which is 2nd embodiment of this invention is demonstrated. In the first embodiment, it is assumed that the terminal devices 3 and 4 are conferences that are performed under the condition of one user. The second embodiment is different from the first embodiment in that when there are a plurality of users in the terminal devices 3 and 4 at each base, a speaker is specified among them and the speaker is set as an imaging target of the camera 34. Different. In addition, the terminal device 130 of 2nd embodiment comprises the video conference system 1 shown in FIG. 1 similarly to the terminal device 3 of 1st embodiment.

まず、端末装置１３０の電気的構成について、図１６を参照して説明する。端末装置１３０には、端末装置１３０の制御を司るコントローラとしてのＣＰＵ１２０が設けられている。ＣＰＵ１２０には、ＢＩＯＳ等を記憶したＲＯＭ１２１と、各種データを一時的に記憶するＲＡＭ１２２と、データの受け渡しの仲介を行うＩ／Ｏインタフェイス３０とが接続されている。Ｉ／Ｏインタフェイス３０には、各種記憶エリアを有するハードディスクドライブ１３１（以下、ＨＤＤ１３１）と、音声方向検出装置３６と、駆動回路３７とが接続されている。音声方向検出装置３６には、ユーザの音声が入力されるマイク３５が接続されている。音声方向検出装置３６は、マイク３５に入力される音声の位相差に基づき、音声が発せられた音源の方向を検出する。駆動回路３７には、カメラ３４を回転移動させるカメラ移動装置３８が接続されている。 First, the electrical configuration of the terminal device 130 will be described with reference to FIG. The terminal device 130 is provided with a CPU 120 as a controller that controls the terminal device 130. Connected to the CPU 120 are a ROM 121 that stores BIOS, a RAM 122 that temporarily stores various data, and an I / O interface 30 that mediates data transfer. Connected to the I / O interface 30 are a hard disk drive 131 (hereinafter referred to as HDD 131) having various storage areas, an audio direction detection device 36, and a drive circuit 37. A microphone 35 to which a user's voice is input is connected to the voice direction detection device 36. The sound direction detection device 36 detects the direction of the sound source from which the sound is emitted based on the phase difference of the sound input to the microphone 35. A camera moving device 38 that rotates the camera 34 is connected to the drive circuit 37.

ＲＡＭ１２２には、図１７に示すように、第一実施形態のＲＡＭ２２と同様の各種記憶エリア（図４参照）に加えて、音源の方向が検出される音源方向記憶エリア２２７が設けられている。端末装置１３０のその他の電気的構成は、第一実施形態の端末装置３（図２参照）と同様の構成を備えている。 As shown in FIG. 17, the RAM 122 is provided with a sound source direction storage area 227 for detecting the direction of the sound source in addition to various storage areas similar to the RAM 22 of the first embodiment (see FIG. 4). The other electrical configuration of the terminal device 130 has the same configuration as the terminal device 3 (see FIG. 2) of the first embodiment.

発言者特定方法について説明する。発言者特定方法としては、周知の種々の方法が適用可能である。例えば、特開平１１−３４１３３４に記載された方法や、特開２００１−３３９７０３が適用可能である。本変形例では、はじめに、マイク３５に入力される音声の位相差に基づき、音声方向検出装置３６によって、音声が発せられた音源の方向を検出する。そして、カメラ移動装置３８によって、カメラ３４が検出された音源の方向を撮影するように撮影される。次いで、カメラ３４の撮影範囲が狭められる。 The speaker identification method will be described. Various known methods can be applied as the speaker identification method. For example, the method described in JP-A-11-341334 and JP-A-2001-339703 can be applied. In this modification, first, the direction of the sound source from which the sound is emitted is detected by the sound direction detection device 36 based on the phase difference of the sound input to the microphone 35. Then, the camera moving device 38 takes a picture so as to photograph the direction of the detected sound source. Next, the shooting range of the camera 34 is narrowed.

そして、撮影画像の色、幾何学的な形状、濃淡パターン、動きなどがパラメータとして抽出される。次いで、予めＨＤＤ３１のその他の情報記憶エリア３１５に記憶された人物画像のパラメータのデータベースが参照されて、抽出された上述の各パラメータと、データベースに記憶されている人物画像のパラメータとのマッチング処理が実行される。そして、データベースに記憶されている人物画像のパラメータと良好に一致する撮影画像の部分が特定される。特定された部分の画像が、発言者の画像として特定される。 Then, the color, geometric shape, shading pattern, movement, etc. of the captured image are extracted as parameters. Next, a database of human image parameters stored in advance in the other information storage area 315 of the HDD 31 is referred to, and matching processing between the extracted parameters described above and the human image parameters stored in the database is performed. Executed. Then, the portion of the photographed image that matches well with the parameters of the person image stored in the database is specified. The identified part image is identified as the image of the speaker.

次に、ＣＰＵ１２０による画像送信処理について、図１８のフローチャートを参照して説明する。本実施形態においても、端末装置１３０、端末装置４では、画像を送信する「画像送信処理」と、画像を受信する「画像受信処理」との両方が行われる。「画像受信処理」は、第一実施形態と同様であるため、説明を省略する。 Next, image transmission processing by the CPU 120 will be described with reference to the flowchart of FIG. Also in the present embodiment, the terminal device 130 and the terminal device 4 perform both “image transmission processing” for transmitting an image and “image reception processing” for receiving an image. The “image reception process” is the same as that in the first embodiment, and thus description thereof is omitted.

端末装置１３０と端末装置４とが各々ネットワークに接続し、互いに通信を開始すると、図１８に示す画像送信処理が開始される。画像送信処理が開始されると、はじめに、カメラ３４が駆動され、カメラ３４により撮影された撮影画像の取得が開始される（Ｓ７１）。カメラ３４により撮影された撮影画像は、撮影画像データ記憶エリア３１１に記憶される。 When the terminal device 130 and the terminal device 4 are connected to the network and start communication with each other, the image transmission process shown in FIG. 18 is started. When the image transmission process is started, first, the camera 34 is driven, and acquisition of a captured image captured by the camera 34 is started (S71). The captured image captured by the camera 34 is stored in the captured image data storage area 311.

次に、マイク３５から音声が入力されたか否かが判断される（Ｓ８４）。マイク３５から音声が入力されていない場合、自拠点側に発言者がいないと判断されて（Ｓ８４：ＮＯ）、Ｓ７１およびＳ８４の処理が繰り返される。 Next, it is determined whether or not sound is input from the microphone 35 (S84). When no sound is input from the microphone 35, it is determined that there is no speaker on the local site side (S84: NO), and the processes of S71 and S84 are repeated.

一方、マイク３５から音声が入力されている場合、自拠点側に発言者がいると判断されて（Ｓ８４：ＹＥＳ）、マイク３５から入力された音声の位相差に基づき、音声方向検出装置３６によって音声が発せられた音源の方向が検出される。音源の方向は、音源方向記憶エリア２２７に記憶される。そして、音源方向記憶エリア２２７が参照されて駆動回路３７によってカメラ移動装置３８が駆動され、カメラ３４の撮影方向が音源の方向に向けられる。そして、マッチング処理により、発言者の画像が特定される（Ｓ８５）。特定された発言者の画像は、撮影画像データ記憶エリア３１１に記憶される。 On the other hand, when the voice is input from the microphone 35, it is determined that there is a speaker on the local site side (S84: YES), and the voice direction detection device 36 determines the phase difference of the voice input from the microphone 35. The direction of the sound source from which the sound is emitted is detected. The direction of the sound source is stored in the sound source direction storage area 227. Then, the sound source direction storage area 227 is referred to, the camera moving device 38 is driven by the drive circuit 37, and the shooting direction of the camera 34 is directed to the direction of the sound source. Then, the image of the speaker is specified by the matching process (S85). The identified speaker image is stored in the captured image data storage area 311.

次に、撮影画像データ記憶エリア３１１に記憶された撮影画像から発言者の顔画像の範囲を決定する顔画像範囲決定処理が行われる（Ｓ７２）。顔画像範囲決定処理は、第一実施形態と同様のため、説明を省略する。顔画像範囲決定処理が終了すると（Ｓ７２）、撮影画像データ記憶エリア３１１と画像範囲記憶エリア２２１とが参照されて、発言者の顔画像が、送信画像として撮影画像から切り出され、端末装置４に送信される（Ｓ７３）。 Next, a face image range determination process is performed to determine the range of the speaker's face image from the captured image stored in the captured image data storage area 311 (S72). Since the face image range determination process is the same as that of the first embodiment, the description thereof is omitted. When the face image range determination processing is completed (S72), the photographed image data storage area 311 and the image range storage area 221 are referred to, and the speaker's face image is cut out from the photographed image as a transmission image and is sent to the terminal device 4. It is transmitted (S73).

発言者の顔画像が端末装置４に送信されると（Ｓ７３）、発言者の動作の有無を検出する動作検出処理が行われる（Ｓ７４）。動作検出処理は、第一実施形態と同様であるため説明を省略する。動作検出処理（Ｓ７４）が終了すると、動作検出記憶エリア２２４が参照されて、発言者の動作が撮影画像に基づき検出されたか否かが判断される（Ｓ７５）。動作検出記憶エリア２２４に、「０」が記憶されている場合、動作が検出されなかったと判断されて（Ｓ７５：ＮＯ）、会議終了であるか否かが判断される（Ｓ８２）。 When the face image of the speaker is transmitted to the terminal device 4 (S73), an action detection process for detecting the presence or absence of the action of the speaker is performed (S74). Since the operation detection process is the same as that of the first embodiment, the description thereof is omitted. When the motion detection process (S74) ends, the motion detection storage area 224 is referred to, and it is determined whether or not the speaker's motion is detected based on the photographed image (S75). If “0” is stored in the motion detection storage area 224, it is determined that no motion has been detected (S75: NO), and it is determined whether the conference is over (S82).

会議終了である場合（Ｓ８２：ＹＥＳ）、処理を終了する。一方、ネットワーク２に接続されている端末が、自身のほかに１以上ある場合（Ｓ８２：ＮＯ）、Ｓ７１〜Ｓ７５の処理が繰り返される。すなわち、端末装置１３０において、発言者の動作が検出されず（Ｓ７５：ＮＯ）、会議終了でない場合（Ｓ８２：ＮＯ）、引き続き、発言者の顔画像が送信画像として端末装置４に送信される（Ｓ７３）。 If the conference is over (S82: YES), the process is terminated. On the other hand, when there are one or more terminals connected to the network 2 (S82: NO), the processes of S71 to S75 are repeated. That is, in the terminal device 130, when the operation of the speaker is not detected (S75: NO) and the conference is not ended (S82: NO), the speaker's face image is continuously transmitted to the terminal device 4 as a transmission image ( S73).

動作検出記憶エリア２２４に、「１」が記憶されている場合、動作が検出されたと判断されて（Ｓ７５：ＹＥＳ）、撮影画像データ記憶エリア３１１に記憶された撮影画像から発言者の上半身画像の範囲を決定する上半身画像範囲決定処理が行われる（Ｓ７６）。上半身画像範囲決定処理は、第一実施形態と同様のため、説明を省略する。上半身画像範囲決定処理が終了すると、撮影画像データ記憶エリア３１１と画像範囲記憶エリア２２１とが参照されて、発言者の上半身画像が、送信画像として撮影画像から切り出され、端末装置４に送信される（Ｓ７７）。すなわち、端末装置１３０において、発言者の動作が検出された場合（Ｓ７５：ＮＯ）、発言者の上半身画像が端末装置４に送信される（Ｓ７７）。 If “1” is stored in the motion detection storage area 224, it is determined that a motion has been detected (S75: YES), and the upper body image of the speaker is extracted from the captured image stored in the captured image data storage area 311. Upper body image range determination processing for determining the range is performed (S76). The upper body image range determination process is the same as that in the first embodiment, and a description thereof will be omitted. When the upper body image range determination process is completed, the photographed image data storage area 311 and the image range storage area 221 are referred to, and the upper body image of the speaker is cut out from the photographed image as a transmission image and transmitted to the terminal device 4. (S77). That is, when the operation of the speaker is detected in the terminal device 130 (S75: NO), the upper body image of the speaker is transmitted to the terminal device 4 (S77).

発言者の上半身画像が、端末装置４に送信されると（Ｓ７７）、接続端末記憶エリア２２５が参照されて、ネットワーク２に接続されている端末が、自身の他に１以上あるか否かが判断される（Ｓ７８）。ネットワーク２に接続されている端末が、自身の他には０である場合、会議終了であるとして（Ｓ７８：ＹＥＳ）、処理を終了する。一方、ネットワーク２に接続されている端末が、自身のほかに１以上ある場合、会議は終了していないとして（Ｓ７８：ＮＯ）、発言者の動作の有無を検出する動作検出処理が再度行われる（Ｓ７９）。 When the upper body image of the speaker is transmitted to the terminal device 4 (S77), the connected terminal storage area 225 is referred to and whether or not there are one or more terminals connected to the network 2 other than the terminal itself. Determination is made (S78). If the terminal connected to the network 2 is 0 in addition to itself, it is determined that the conference has ended (S78: YES), and the process ends. On the other hand, when there are one or more terminals connected to the network 2, it is assumed that the conference has not ended (S 78: NO), and the operation detection process for detecting the presence or absence of the speaker's operation is performed again. (S79).

動作検出処理（Ｓ７９）が終了すると、動作検出記憶エリア２２４が参照されて、発言者の動作が検出されたか否かが判断される（Ｓ８０）。動作検出記憶エリア２２４に、「１」が記憶されている場合、動作が検出されたと判断される（Ｓ８０：ＹＥＳ）。動作が検出されたと判断されると（Ｓ８０：ＹＥＳ）、発言者の顔領域および掌領域が、切り出された上半身画像範囲に含まれているか否かが、第一実施形態と同一の方法で判断される（Ｓ８１）。 When the motion detection process (S79) ends, the motion detection storage area 224 is referred to and it is determined whether or not a speaker's motion has been detected (S80). If “1” is stored in the motion detection storage area 224, it is determined that motion has been detected (S80: YES). If it is determined that an action has been detected (S80: YES), it is determined by the same method as in the first embodiment whether or not the speaker's face area and palm area are included in the clipped upper body image range. (S81).

検出された掌領域および顔領域の最大値と最小値とが、すべて画像範囲記憶エリア２２１に記憶された画像範囲内にあるか否かが判断されると（Ｓ８１：ＹＥＳ）、Ｓ７８に戻って会議終了であるか否かが判断される。顔領域および掌領域のいずれかが、画像範囲記憶エリア２２１に記憶された画像範囲からはみ出していると判断されると（Ｓ８１：ＮＯ）、Ｓ７１に戻って、処理が繰り返される。 When it is determined whether or not the maximum value and minimum value of the detected palm area and face area are all within the image range stored in the image range storage area 221 (S81: YES), the process returns to S78. It is determined whether the conference is over. If it is determined that either the face region or the palm region is outside the image range stored in the image range storage area 221 (S81: NO), the process returns to S71 and the process is repeated.

Ｓ８０において、動作検出記憶エリア２２４に、「０」が記憶されている場合、発言者の動作が検出されなかったと判断され（Ｓ８０：ＮＯ）、会議終了であるか否かが判断される（Ｓ８２）。ネットワーク２に接続されている端末が、自拠点の端末装置以外に無い場合、会議終了であるとして（Ｓ８２：ＹＥＳ）、処理を終了する。一方、ネットワーク２に接続されている端末が、自身のほかに１以上ある場合、会議は終了していないとして（Ｓ８２：ＮＯ）、Ｓ７１の処理に戻る。 In S80, when “0” is stored in the motion detection storage area 224, it is determined that the speaker's motion has not been detected (S80: NO), and it is determined whether or not the conference has ended (S82). ). If there is no terminal connected to the network 2 other than the terminal device at the local site, it is determined that the conference is over (S82: YES), and the process is ended. On the other hand, if there are one or more terminals connected to the network 2, it is determined that the conference has not ended (S82: NO), and the process returns to S71.

以上説明したように、第二実施形態である端末装置１３０は、一拠点に複数のユーザがいる場合は、その中から発言者を特定し、発言者の顔画像又は上半身画像を他の端末装置に送信する。よって、発言者がジェスチャーで感情を表現しようとした場合、他拠点にいるユーザは発言者のジェスチャーを確認することができる。また、発言者がジェスチャーを行わない場合、相手ユーザは、発言者の顔の表情を確認できる。よって、異なる拠点にいる会議参加者同士で、良好なコミュニケーションをとることができる。 As described above, when there are a plurality of users at one site, the terminal device 130 according to the second embodiment identifies the speaker from among the users, and uses the other person's face image or upper body image as the other terminal device. Send to. Therefore, when a speaker tries to express his / her emotions with gestures, a user at another base can check the speaker's gestures. Further, when the speaker does not perform a gesture, the other user can confirm the facial expression of the speaker. Therefore, favorable communication can be taken between conference participants at different bases.

なお、図１８に示すＳ８５の処理を行うＣＰＵ１２０が本発明の「発言者特定手段」に相当する。図１８に示すＳ８５の処理において、撮影画像から人物画像を認識するＣＰＵ１２０が本発明の「人物認識手段」に相当する。図１６に示す音声方向検出装置３６が、本発明の「音声検出手段」に相当する。 The CPU 120 that performs the process of S85 shown in FIG. 18 corresponds to the “speaker specifying means” of the present invention. In the process of S85 shown in FIG. 18, the CPU 120 that recognizes a person image from a captured image corresponds to the “person recognition unit” of the present invention. The voice direction detection device 36 shown in FIG. 16 corresponds to “voice detection means” of the present invention.

なお、本発明は上記第一、第二実施形態に限定されるものではなく、種々の変更が可能である。上述の実施の形態では、端末装置３において、取得された撮影画像から顔画像又は上半身画像を切り出して、切り出した画像を他の端末装置４に対して送信する画像送信処理が行われていた。また、端末装置４において、他の端末装置３から送信された撮影画像を受信した。しかしながら本発明はこの構成に限定されず、他の構成であってもよい。 The present invention is not limited to the first and second embodiments described above, and various modifications can be made. In the above-described embodiment, the terminal device 3 performs the image transmission process in which a face image or an upper body image is cut out from the acquired captured image and the cut-out image is transmitted to another terminal device 4. Further, the terminal device 4 has received a captured image transmitted from another terminal device 3. However, the present invention is not limited to this configuration and may have other configurations.

例えば、テレビ会議全体を制御するＭＣＵ（ＭｕｌｔｉｐｏｉｎｔＣｏｎｔｒｏｌＵｎｉｔ）がネットワーク２に接続されている場合には、端末装置３、端末装置４、ＭＣＵで以下の処理を行っても良い。端末装置３は、ＭＣＵに対して撮影画像を送信する処理を行う。ＭＣＵは、端末装置３から送信された撮影画像から、上述の方法により顔画像又は上半身画像を切り出して、切り出した画像を端末装置４に対して送信する処理を行う。端末装置４では、ＭＣＵから送信された撮影画像を受信し、ディスプレイ２８に表示するする処理を行う。なお、説明の便宜上、端末装置３において画像が送信され、端末装置４において画像が受信される場合を例に説明したが、端末装置３、４では、撮影画像をＭＣＵに対して送信する処理と、ＭＣＵから送信された画像を受信する画像受信処理との両方が行われる。 For example, when an MCU (Multipoint Control Unit) that controls the entire video conference is connected to the network 2, the following processing may be performed by the terminal device 3, the terminal device 4, and the MCU. The terminal device 3 performs processing for transmitting a captured image to the MCU. The MCU performs a process of cutting out the face image or the upper body image from the captured image transmitted from the terminal device 3 by the above-described method and transmitting the cut-out image to the terminal device 4. The terminal device 4 performs a process of receiving a captured image transmitted from the MCU and displaying it on the display 28. For convenience of explanation, the case where an image is transmitted in the terminal device 3 and an image is received in the terminal device 4 has been described as an example. However, in the terminal devices 3 and 4, a process of transmitting a captured image to the MCU Both image reception processing for receiving an image transmitted from the MCU is performed.

また、上述した第一、第二実施形態では、説明の便宜上、２つの端末装置３、４を構成とするテレビ会議システムを一例として説明したが、２つ以上の端末装置を構成とするテレビ会議システムにも適用可能である。 In the first and second embodiments described above, for convenience of explanation, a video conference system including two terminal devices 3 and 4 has been described as an example. However, a video conference including two or more terminal devices is described. It is also applicable to the system.

また、第二実施形態では、会議に参加する複数ユーザの中から発言者を特定する方法として、マイク３５に入力される音声の方向を検出して、検出された方向に発言者が存在すると推定する方法を用いた。発言者の特定方法はこれに限定されず、たとえば、カメラの撮影画像から複数のユーザの唇形の変化をそれぞれ検出して、唇形に変化があるユーザを発言者として特定してもよい。 In the second embodiment, as a method for identifying a speaker from a plurality of users participating in the conference, the direction of the voice input to the microphone 35 is detected, and it is estimated that the speaker exists in the detected direction. The method used was used. The speaker specifying method is not limited to this, and for example, a change in the lip shape of a plurality of users may be detected from the captured image of the camera, and a user having a change in lip shape may be specified as the speaker.

そこで、唇形の変化から、発言者を特定する方法について説明する。この方法では、はじめに、カメラ３４の撮影画像から、撮影画像の色、幾何学的な形状、濃淡パターン、動きなどがパラメータとして抽出される。次いで、予めＨＤＤ３１のその他の情報記憶エリア３１５に記憶された人物画像のパラメータのデータベースが参照されて、抽出された上述の各パラメータと、データベースに記憶されている人物画像のパラメータとのマッチング処理が実行される。そして、データベースに記憶されている人物画像のパラメータと良好に一致する撮影画像の部分が特定される。特定された部分の画像が、人物の画像として特定される。 Therefore, a method for identifying a speaker from a change in lip shape will be described. In this method, first, from the captured image of the camera 34, the color, geometric shape, shading pattern, movement, etc. of the captured image are extracted as parameters. Next, a database of human image parameters stored in advance in the other information storage area 315 of the HDD 31 is referred to, and matching processing between the extracted parameters described above and the human image parameters stored in the database is performed. Executed. Then, the portion of the photographed image that matches well with the parameters of the person image stored in the database is specified. The image of the specified part is specified as a person image.

次に、特定された複数の人物の画像から、それぞれ顔領域が検出される。そして、特定された人物画像の輪郭データが抽出され、顔領域の下側半分において、抽出された輪郭データに変化がある場合には、唇形に変化があるとして検出される。なお、顔領域の検出方法、輪郭データの抽出方法は上述のとおりである。そして、唇形に変化がある人物の画像を発言者の画像として特定する。このようにして、複数の会議参加者から発言者を特定することができる。 Next, face areas are respectively detected from the images of a plurality of specified persons. Then, the contour data of the specified person image is extracted, and if there is a change in the extracted contour data in the lower half of the face area, it is detected that there is a change in the lip shape. The face area detection method and the contour data extraction method are as described above. Then, an image of a person whose lip shape is changed is specified as an image of a speaker. In this way, a speaker can be identified from a plurality of conference participants.

また、第一、第二実施形態では、掌及び腕の両方の動きが検出された場合に、ユーザの動作が検出されたものとされたが、掌及び腕の動きの少なくともいずれかの動きが検出された場合に、ユーザの動作が検出されたものとしても良い。具体的には、第一、第二実施形態の動作検出処理（図１１参照）では、掌の動きが検出され（Ｓ５２：ＹＥＳ）、かつ腕の動きが検出された場合に（Ｓ５４：ＹＥＳ）、ユーザの動作が検出されたものとされた（Ｓ５５）。しかし、Ｓ５３及びＳ５４の処理は行われず、掌の動きが検出された場合には（Ｓ５２：ＹＥＳ）、ユーザの動作が検出されたとし（Ｓ５５）、掌の動きが検出されなければ（Ｓ５２：ＮＯ）、ユーザの動作が検出されなかったとしてもよい（Ｓ５６）。Ｓ５１及びＳ５２の処理は行われず、腕の動きが検出された場合には（Ｓ５４：ＹＥＳ）、ユーザの動作が検出されたとし（Ｓ５５）、腕の動きが検出されなければ（Ｓ５４：ＮＯ）、ユーザの動作が検出されなかったとしてもよい（Ｓ５６）。掌の動きが検出されるか（Ｓ５２：ＹＥＳ）、又は腕の動きが検出された場合（Ｓ５４：ＹＥＳ）、ユーザの動作が検出されたものとしてもよい（Ｓ５５）。 In the first and second embodiments, the user's movement is detected when movement of both the palm and the arm is detected. However, at least one movement of the palm and the arm is detected. If it is detected, it may be that the user's action is detected. Specifically, in the motion detection process (see FIG. 11) of the first and second embodiments, when the movement of the palm is detected (S52: YES) and the movement of the arm is detected (S54: YES). It is assumed that the user's action has been detected (S55). However, the processes of S53 and S54 are not performed, and if the movement of the palm is detected (S52: YES), the user's movement is detected (S55), and the movement of the palm is not detected (S52: NO), the user's action may not be detected (S56). If the process of S51 and S52 is not performed and the movement of the arm is detected (S54: YES), the user's movement is detected (S55), and the movement of the arm is not detected (S54: NO). The user's action may not be detected (S56). If a palm movement is detected (S52: YES) or an arm movement is detected (S54: YES), a user action may be detected (S55).

１テレビ会議システム
２ネットワーク
３端末装置
４端末装置
２０ＣＰＵ
２３ビデオコントローラ
２５通信装置
２８ディスプレイ
３０インタフェイス
３１ハードディスクドライブ
３４カメラ
３５マイク
３６音声方向検出装置
２２１画像範囲記憶エリア
２２２掌動作記憶エリア
２２３腕動作記憶エリア
２２４動作検出記憶エリア
３１１撮影画像データ記憶エリア
３１２表示画面データ記憶エリア 1 video conference system 2 network 3 terminal device 4 terminal device 20 CPU
23 Video Controller 25 Communication Device 28 Display 30 Interface 31 Hard Disk Drive 34 Camera 35 Microphone 36 Voice Direction Detection Device 221 Image Range Storage Area 222 Palm Motion Storage Area 223 Arm Motion Storage Area 224 Motion Detection Storage Area 311 Captured Image Data Storage Area 312 Display screen data storage area

Claims

A communication terminal device that communicates with other communication terminal devices connected via a network via images and sounds,
Photographing means for photographing the user;
Motion detection means for detecting, as a user motion, a state in which at least one of the user's palm and arm movements is a predetermined amount or more from a captured image captured by the imaging means;
When the motion is detected by the motion detection means, the first range of the upper body area image including the user's face area, palm area, and arm area is selected from the captured images captured by the imaging means. First image range determining means for determining the image range;
Second image range determining means for determining, as the second image range, an image range of the face area of the user from the captured image when the motion is not detected by the motion detecting means;
An image for transmitting the image of the first image range determined by the first image range determination unit or the image of the second image range determined by the second image range determination unit to the other communication terminal device A transmission means;
Image receiving means for receiving the image transmitted from the other communication terminal device;
An image display control means for displaying the image on a display screen when the image is received by the image receiving means.

The communication terminal apparatus according to claim 1, wherein the motion detection unit detects a state where both the palm and the arm have a predetermined amount or more of movement as the motion.

The communication terminal apparatus according to claim 1, wherein the motion detection unit detects a state in which the shape of the palm has a change of a predetermined amount or more as the movement of the palm.

The communication terminal apparatus according to claim 1, wherein the motion detection unit detects a state in which the position of the arm has changed by a predetermined amount or more as a movement of the arm.

Further comprising nose position detection means for detecting the nose position of the target user from the captured image,
The first image range determining means includes
5. The first image range is determined such that the nose position detected by the nose position detection unit is a center point in the horizontal direction of the first image range. The communication terminal device according to claim 1.

A speaker identification means for identifying a speaker from a plurality of users is provided.
The motion detection means detects a state where the movement of at least one of the palm and arm of the speaker specified by the speaker specifying means is greater than or equal to the predetermined amount, as the action of the speaker.
The first image range determining means determines an upper body image range including the speaker's face region, palm region, and arm region as the first image range,
The communication terminal apparatus according to claim 1, wherein the second image range determining unit determines a range of a face image including the face area of the speaker as the second image range. .

Person recognizing means for recognizing a person from the photographed image photographed by the photographing means;
A mouth shape detecting means for detecting a change in the mouth shape of the person recognized by the person recognizing means,
The speaker specifying means includes:
The communication terminal apparatus according to claim 6, wherein a person whose change in mouth shape is detected by a predetermined amount or more by the mouth shape detecting unit is specified as the speaker.

A voice detecting means for detecting the voice of the user and detecting the direction of the voice;
The speaker specifying means includes:
8. The communication terminal apparatus according to claim 6, wherein a person in the direction detected by the voice detection unit is specified as the speaker.

A communication control device that is connected to a plurality of communication terminal devices via a network and controls communication performed between the communication terminal devices,
Photographed image receiving means for receiving a photographed image that is photographed by the photographing means of the communication terminal device and transmitted from the communication terminal device;
Based on the captured image received by the captured image receiving means, an action detecting means for detecting a state of movement of at least one of the user's palm and arm as the user's action;
When the motion is detected by the motion detection means, a first image range is determined as an image range of the upper body area including the user's face area, palm area, and arm area from the captured image. One image range determining means;
A second image range determining unit that determines a range of a face image including the face region of the user as a second image range from the captured image when the motion is not detected by the motion detection unit;
Image transmitting means for transmitting the image of the first image range determined by the first image range determining means or the image of the second image range determined by the second image range determining means to the communication terminal device And a communication control device.

A communication control method for a communication terminal device that communicates with another communication terminal device connected via a network via an image and sound,
An operation detecting step of detecting a state in which at least one of the user's palm and arm is in motion as the user's operation;
When the motion is detected in the motion detection step, the range of the upper body region image including the user's face region, palm region, and arm region from among the captured images captured by the capturing unit that captures the user A first image range determination step for determining as a first image range;
A second image range determining step for determining, as a second image range, an image range of the face area of the user from the captured image when the motion is not detected in the motion detection step;
An image for transmitting the image of the first image range determined in the first image range determination step or the image of the second image range determined by the second image range determination means to the other communication terminal device Sending step;
An image receiving step of receiving the image transmitted from the other communication terminal device;
A communication control method for a communication terminal device, comprising: an image display control step for displaying the image on a display screen when the image is received in the image reception step.

A communication control program for causing a computer to function as various processing means of the communication terminal device according to claim 1.