JP2003230049A

JP2003230049A - Camera control method, camera controller and video conference system

Info

Publication number: JP2003230049A
Application number: JP2002029428A
Authority: JP
Inventors: Takahiro Onishi; 崇浩大西
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2002-02-06
Filing date: 2002-02-06
Publication date: 2003-08-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a camera control method for employing a smaller number of voice input sections and capable of eliminating the effect of surrounding reverberation sound and accurately displaying a face image of an utterer on a screen. <P>SOLUTION: A camera control/position discrimination section 11 capable of controlling panning/tilting/zooming in a direction of a sound source detected by a sound source direction detection section 15 on the basis of a voice frequency and/or a delay time from a plurality of voice input sections fewer than the number of participants changes the direction of a camera 10, a face image extract/position discrimination section 8 extracts the face image of a person on the basis of the received image, the camera control/position discrimination section 11 is controlled to display the face image extracted at a position and/or with a size on a prescribed screen, when two or more face images are extracted, the face image of a person resident at the closest position in the direction of sound source is selected and displayed. When no face image is detected or the face image cannot be displayed within the screen, zooming is automatically controlled to a wide angle, and the face image of a displayed person is displayed at a prescribed position and/or with a prescribed size. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、カメラ制御方法及
びカメラ制御装置並びにテレビ会議システムに関し、特
に、複数の音声入力手段により音源方向を検出し、カメ
ラの方向を制御し、カメラに映し出された画像の中から
顔画像を抽出する手段と、顔画像の画面上の位置と大き
さとを認識する手段とを有するカメラ制御方法及びカメ
ラ制御装置並びにテレビ会議システムに関する。より具
体的には、例えば、テレビ会議システムの場合などにお
いて、発言者に対してカメラの向きとズームとを自動的
にかつ正確に制御して、円滑にテレビ会議を進めること
を可能にする。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a camera control method, a camera control device, and a video conference system, and more particularly, it detects the direction of a sound source by a plurality of audio input means, controls the direction of the camera, and displays it on the camera. The present invention relates to a camera control method, a camera control device, and a video conference system that have means for extracting a face image from an image and means for recognizing the position and size of the face image on the screen. More specifically, for example, in the case of a video conference system or the like, it is possible to automatically and accurately control the direction of the camera and the zoom for the speaker, and to smoothly advance the video conference.

【０００２】[0002]

【従来の技術】ＴＶ会議などにおいて、発言者の人物画
像を正確に画面に表示するためのカメラ制御方法及びカ
メラ制御装置に関し、以下のような従来例がある。特開
平５−２６８５９９号公報「テレビ会議システムにおけ
る人物撮像カメラの自動制御方式」によれば、人物の輪
郭やパターンマッチングにより画面内の人物の位置及び
大きさを自動的に制御する技術が記載されている。即
ち、人物撮像カメラの出力画像の中から人物の輪郭を抽
出して、抽出された人物を含む画像と、予め記憶されて
いて、画面中央位置に人物像を配置している人物像のサ
ンプルデータとを比較照合することにより、サンプルデ
ータに近い状態となるように、人物撮像カメラのパン・
チルト・ズーム制御を行ない、撮像される人物を表示画
面の中央付近となる位置及び大きさに自動的に表示でき
るようにしている。2. Description of the Related Art The following conventional examples relate to a camera control method and a camera control apparatus for accurately displaying a person image of a speaker in a video conference. Japanese Unexamined Patent Publication No. 5-268599 “Automatic control system for human image pickup camera in video conference system” describes a technique for automatically controlling the position and size of a person on a screen by the contour and pattern matching of the person. ing. That is, the outline of a person is extracted from the output image of the person image pickup camera, and an image including the extracted person and sample data of a person image in which a person image is stored in advance and stored in the center position of the screen By comparing and collating with, the pan and
Tilt / zoom control is performed so that the person to be imaged can be automatically displayed at a position and size near the center of the display screen.

【０００３】特開平９−３０７８７０号公報「テレビ会
議システムにおけるカメラ自動方向制御装置」によれ
ば、複数の音声入力手段から発言者の方向を検出して、
音声入力手段の位置にカメラの向きを制御すると共に、
発言者の顔輪郭を抽出してカメラの向きを調整する技術
が記載されている。即ち、テレビ会議への参加者が、発
言中の状態にある場合には点灯するランプ付きのイアホ
ンマイクをそれぞれ装着することにより、参加者が発言
した際に、ランプの点灯がランプ検出回路により検出さ
れて、検出された点灯ランプの位置データに基づいて、
カメラ方向制御回路が、カメラの向きを該点灯ランプの
方向に向けるように制御し、更に、顔輪郭抽出回路が、
予め登録している標準的な顔輪郭データを参照すること
により、発言者の顔の位置を検出して、カメラの方向を
該発言者の顔の位置に正確に向けさせるものである。According to Japanese Unexamined Patent Publication No. 9-307870, "Camera Automatic Direction Control Device in Video Conference System", the direction of a speaker is detected from a plurality of voice input means,
While controlling the direction of the camera to the position of the voice input means,
A technique for extracting the face contour of the speaker and adjusting the orientation of the camera is described. That is, when the participant in the video conference wears the earphone microphone with a lamp that lights up when the participant is speaking, the lamp detection circuit detects the lighting of the lamp when the participant speaks. Based on the detected lighting lamp position data,
The camera direction control circuit controls the direction of the camera so as to face the lighting lamp, and the face contour extraction circuit further includes:
By referring to standard face contour data registered in advance, the position of the face of the speaker is detected, and the direction of the camera is accurately directed to the position of the face of the speaker.

【０００４】特表２０００−５１２１０８号「音源位置
づけ方法と装置」によれば、２つの音声入力手段により
得られる音声信号を時間分割により周波数分析を行な
い、それぞれの音声入力手段にて得られた音声信号の遅
延時間を算出することにより、音源方向を検出し、更に
は、音声入力手段を４つだけ備えることにより、音声の
垂直方向のみならず距離まで検出し、カメラの方向やズ
ームを制御することを可能とし、発言者の方向にカメラ
を自動的に向けると共に適切な大きさの画像にズームア
ップする技術が記載されている。According to Japanese Patent Laid-Open No. 2000-512108 "Sound Source Positioning Method and Device", the voice signals obtained by the two voice input means are subjected to frequency analysis by time division, and the voices obtained by the respective voice input means. By calculating the delay time of the signal, the sound source direction is detected. Further, by providing only four voice input means, not only the vertical direction of the voice but also the distance is detected, and the direction and zoom of the camera are controlled. A technique for enabling the above, automatically pointing the camera in the direction of the speaker, and zooming up to an image of an appropriate size is described.

【０００５】即ち、発言者の位置が移動するような状態
にあったとしても、発言者の音声信号を互いに離れた位
置に配置されている２つのマイク（音声入力手段）によ
り受信し、それぞれのマイクで受信された双方の音声信
号間の受信時間の信号遅延時間を、フーリエ解析を用い
て算出することにより、発言者の音声信号の方向を検出
することを可能とすると共に、更には、Ｘ，Ｙ座標方向
の互いに異なる位置に３つ乃至４つのマイク（音声入力
手段）を配設して、それぞれのマイクで受信された音声
信号間の受信時間関係を算出することにより、Ｘ，Ｙ，
Ｚの３軸の座標方向における発言者の位置（即ち、方向
と距離）を算出することを可能とし、発言者の方向にカ
メラを向けさせると同時に、ズームアップ制御も行なう
ことが可能であるとしているものである。That is, even if the speaker's position is moving, the speaker's voice signal is received by two microphones (voice input means) arranged at positions distant from each other, and By calculating the signal delay time of the reception time between both voice signals received by the microphone using Fourier analysis, it is possible to detect the direction of the voice signal of the speaker, and further, X , Three or four microphones (voice input means) are arranged at different positions in the Y-coordinate direction, and the reception time relationship between the voice signals received by the respective microphones is calculated.
It is possible to calculate the position (that is, the direction and distance) of the speaker in the coordinate directions of the three axes of Z, and to point the camera in the direction of the speaker and at the same time perform zoom-up control. There is something.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、上述の
技術においては、例えば、特開平５−２６８５９９号公
報における技術の場合にあっては、テレビ会議への参加
者の人物に関する画像パターンが予め登録されている既
知の状態でなければならず、また、画面上に、２人以上
の人物の顔画像が抽出されてしまった場合に、真に表示
すべき目的の発言者を正確に表示することや、あるい
は、画面外に位置している発言者を表示することが困難
であるという課題を有している。However, in the technique described above, for example, in the case of the technique disclosed in Japanese Patent Laid-Open No. 5-268599, an image pattern relating to the person of the participant in the video conference is registered in advance. If the face images of two or more persons have been extracted on the screen, it is necessary to accurately display the intended speaker who should be displayed. Or, there is a problem that it is difficult to display the speaker located outside the screen.

【０００７】また、特開平９−３０７８７０号公報にお
ける技術においては、正確な音源方向を検出するため
に、テレビ会議への参加者即ち発言者の数だけ、ランプ
付きのイアホンマイク即ち音声入力手段を用意する必要
があり、多数の音声入力手段が必要となるという課題を
有している。更には、特表２０００−５１２１０８号に
おける技術においては、２乃至４個程度の少ない音声入
力手段（マイク）により、発言者の音源方向や音源との
距離を検出することができるが、テレビ会議室が反響の
大きな部屋や雑音の大きい部屋などの場合にあっては、
音声信号の到来する方向が、種々に変化してしまい、正
確に音源方向を検出することが困難であるという課題を
有している。Further, in the technique disclosed in Japanese Unexamined Patent Publication No. 9-307870, in order to detect an accurate sound source direction, there are as many earphone microphones with a lamp, that is, voice input means, as many as the number of participants or speakers of the video conference. There is a problem that it is necessary to prepare and a large number of voice input means are required. Further, in the technology of Japanese Patent Publication No. 2000-512108, the direction of the sound source of the speaker and the distance to the sound source can be detected by a small number of two to four voice input means (microphones). Is a room with a lot of echo or a room with a lot of noise,
There is a problem that it is difficult to accurately detect the direction of the sound source because the arrival direction of the audio signal changes in various ways.

【０００８】本発明は、かかる課題に鑑みてなされたも
のであり、参加者の人数よりも少ない個数の複数個の音
声入力手段を介してそれぞれから入力されてくる入力音
声の周波数や信号遅延時間を分析することにより音源方
向を検出する音源方向検出手段と、該音源方向検出手段
により検出された音源方向にカメラの向きを水平方向及
び／又は垂直方向に移動制御させ、更に、広角又は望遠
のズーム変更制御させると共に、前以って設定されてい
る画面上の位置に、前以って設定されている顔画像の大
きさで表示されるようにカメラを制御することができる
カメラ制御・位置判定手段と、カメラにより撮像された
画像の中から人物の顔画像を抽出し、画面上の顔画像の
位置／大きさを認識する顔画像抽出・位置判定手段とを
用いて、前記音源方向検出手段により検出された音声の
音源方向に前記顔画像抽出・位置判定手段により人物の
顔画像が検出された場合に、初めて発言者の顔画像と認
識して、前記カメラ制御・位置判定手段により画面上の
所定の位置及び／又は大きさに、該発言者の顔画像を表
示することを可能とし、周りの反響音などの影響から正
確に発言者を表示することができないことや、多くの音
声入力部を装備しなくてはならない等の従来技術の課題
を解決せんとするものである。The present invention has been made in view of the above problems, and the frequency and signal delay time of input voices respectively input from a plurality of voice input means of which the number is smaller than the number of participants. A sound source direction detecting means for detecting a sound source direction by analyzing, and controlling the movement of the camera in the horizontal direction and / or the vertical direction to the sound source direction detected by the sound source direction detecting means. Camera control / position that allows zoom change control and also controls the camera so that it is displayed at the preset position on the screen with the preset facial image size. Using the determination means and the face image extraction / position determination means for extracting the face image of the person from the image captured by the camera and recognizing the position / size of the face image on the screen, the sound source When the face image of the person is detected by the face image extracting / position determining unit in the sound source direction of the voice detected by the direction detecting unit, it is recognized as the face image of the speaker for the first time, and the camera control / position determining unit is used. This makes it possible to display the face image of the speaker at a predetermined position and / or size on the screen, and it is not possible to accurately display the speaker due to the influence of surrounding echoes. It is an object of the present invention to solve the problems of the prior art, such as having to equip the voice input section.

【０００９】更には、前記顔画像抽出・位置判定手段に
て、複数の人物の顔画像が検出された場合には、前記音
源方向検出手段により検出された音源方向に最も近い位
置にいる人物を発言者とみなして、該発言者の顔画像を
所定の位置／大きさに表示することを可能とし、逆に、
顔画像が検出されない場合や、顔画像が画面内に収まり
切れない状態にある場合には、カメラを広角側にズーム
変更制御させて、広角画像に撮影されている音源方向の
近傍にいる人物の顔画像を検出させて、該人物を発言者
とみなして、発言者とみなされた該人物の顔画像を所定
の位置／大きさに表示することを可能にせんとするもの
である。Further, when the face image extracting / position determining means detects face images of a plurality of persons, the person closest to the sound source direction detected by the sound source direction detecting means is selected. It is possible to display the face image of the speaker as a speaker at a predetermined position / size, and conversely,
If no face image is detected, or if the face image is too large to fit on the screen, control the camera to change the zoom to the wide-angle side so that the person in the vicinity of the sound source direction captured in the wide-angle image is controlled. The face image is detected, the person is regarded as the speaker, and the face image of the person regarded as the speaker can be displayed at a predetermined position / size.

【００１０】[0010]

【課題を解決するための手段】本発明は、上記課題を解
決するための技術手段を提供するものであり、各発明
は、以下の技術手段を構成している。第１の発明は、カ
メラの方向を水平方向及び／又は垂直方向へ角度変更す
る制御を可能とするカメラ制御手段を有することによ
り、撮像画像を画面上の所定の位置に表示させることが
できるカメラ制御方法において、更に、前記カメラより
入力される画像から人物の顔画像を抽出する顔画像抽出
手段と、該顔画像抽出手段により抽出され、画面上に表
示される顔画像の位置を認識することができる顔画像位
置判定手段と、複数個の音声入力手段に入力されるそれ
ぞれの音声信号から音源方向を検出する音源方向検出手
段と、を有し、前記音源方向検出手段により検出された
音源方向に、前記カメラ方向制御手段により水平方向及
び／又は垂直方向への前記カメラの角度変更を自動的に
制御して、前記顔画像抽出手段により抽出され、かつ、
前記顔画像位置判定手段により認識された画面上にある
人物の顔画像の位置を、前以って設定されている画面上
の位置に表示させるようにして、前記音源方向検出手段
により検出された音源方向にある人物の顔画像を前以っ
て設定されている画面上の位置に表示させるカメラ制御
方法とすることを特徴とするものである。The present invention provides technical means for solving the above-mentioned problems, and each invention constitutes the following technical means. A first aspect of the present invention has a camera control unit that enables control of changing the direction of the camera to a horizontal direction and / or a vertical direction so that a captured image can be displayed at a predetermined position on the screen. In the control method, further, recognizing the face image extraction means for extracting a face image of a person from the image input from the camera, and the position of the face image extracted by the face image extraction means and displayed on the screen. And a sound source direction detecting means for detecting a sound source direction from each sound signal input to a plurality of sound input means, and the sound source direction detected by the sound source direction detecting means. The camera direction control means automatically controls the angle change of the camera in the horizontal direction and / or the vertical direction, and is extracted by the face image extraction means, and
The position of the face image of the person on the screen recognized by the face image position determination means is displayed at the preset position on the screen, and detected by the sound source direction detection means. A camera control method for displaying a face image of a person in a sound source direction at a preset position on the screen.

【００１１】第２の発明は、前記第１の発明に記載のカ
メラ制御方法において、前記カメラ制御手段として、前
記カメラを広角及び／又は望遠へズーム変更する制御を
可能とし、また、前記顔画像位置判定手段として、画面
上での顔画像の大きさを認識することを可能とすること
により、前記カメラ制御手段により広角及び／又は望遠
への前記カメラのズーム変更を自動的に制御して、前記
顔画像抽出手段により抽出され、かつ、前記顔画像位置
判定手段により認識される画面上にある人物の顔画像の
大きさを、前以って設定されている画面上の大きさに表
示させるようにして、前記音源方向検出手段により検出
された音源方向にある人物の顔画像を前以って設定され
ている画面上の大きさに表示させるカメラ制御方法とす
ることを特徴とするものである。In a second aspect of the present invention, in the camera control method according to the first aspect, the camera control means enables control of zooming the camera to a wide angle and / or a telephoto, and the face image. As the position determination means, by making it possible to recognize the size of the face image on the screen, the camera control means automatically controls the zoom change of the camera to wide-angle and / or telephoto, The size of the face image of the person on the screen, which is extracted by the face image extraction means and recognized by the face image position determination means, is displayed in a preset size on the screen. In this way, a camera control method for displaying a face image of a person in the sound source direction detected by the sound source direction detecting means in a size set on a screen set in advance is provided. It is intended.

【００１２】第３の発明は、前記第１又は第２の発明に
記載のカメラ制御方法において、前記音源方向検出手段
により検出された音源方向に、前記顔画像抽出手段によ
り２つ以上の人物の顔画像が抽出された場合には、前記
カメラ制御手段をして、前記音源方向検出手段により検
出された音源方向に最も近い位置にいる人物の顔画像
を、前以って設定された画面上の位置及び／又は大きさ
に表示させるように制御するカメラ制御方法とすること
を特徴とするものである。A third aspect of the invention is the camera control method according to the first or second aspect of the invention, in which the face image extracting means detects two or more persons in the sound source direction detected by the sound source direction detecting means. When the face image is extracted, the camera control means is used to display the face image of the person closest to the sound source direction detected by the sound source direction detecting means on the screen set in advance. The camera control method is so controlled as to display the position and / or size of the camera.

【００１３】第４の発明は、前記第１乃至第３の発明の
いずれかに記載のカメラ制御方法において、前記音源方
向検出手段により検出された音源方向に、前記顔画像抽
出手段により人物の顔画像が検出されなかった場合、及
び／又は、前記顔画像抽出手段により抽出された人物の
顔画像が画面内に収まり切らなかった場合には、前記カ
メラ制御手段をして、前記カメラを広角動作へのズーム
変更の制御を自動的に行なわしめ、該広角動作による撮
像画像に基づいて、前記顔画像抽出手段による人物の顔
画像の抽出動作と、前記顔画像位置判定手段による画面
上に表示される顔画像の位置及び／又は大きさの認識動
作とが行なわれるカメラ制御方法とすることを特徴とす
るものである。According to a fourth aspect of the present invention, in the camera control method according to any one of the first to third aspects, a face of a person is detected by the face image extracting means in the sound source direction detected by the sound source direction detecting means. When no image is detected and / or when the face image of the person extracted by the face image extracting means does not fit on the screen, the camera control means is used to operate the camera in wide-angle operation. The automatic zoom control is performed to display the face image of the person by the face image extracting means and the face image position determining means displays on the screen based on the captured image by the wide-angle operation. The camera control method is characterized in that a position and / or size of a face image is recognized.

【００１４】第５の発明は、前記第１乃至第４の発明の
いずれかに記載のカメラ制御方法において、前記音源方
向検出手段により検出された音源方向に、前記カメラ制
御手段によりカメラの角度変更及び／又はズーム変更の
制御が行なわれた後の経過時間を計測する時間計測手段
を更に有し、該時間計測手段により計測される前記経過
時間が、前以って設定されている一定の単位時間を経過
していることが検出される毎に、前記音源方向検出手段
により検出される音源方向への前記カメラ制御手段によ
るカメラの角度変更及び／又はズーム変更の制御がなさ
れるカメラ制御方法とすることを特徴とするものであ
る。A fifth invention is the camera control method according to any one of the first to fourth inventions, wherein the camera control means changes the angle of the camera in the sound source direction detected by the sound source direction detection means. And / or a time measuring means for measuring an elapsed time after the zoom change control is performed, and the elapsed time measured by the time measuring means is a preset unit. A camera control method in which the camera angle change and / or zoom change control by the camera control unit in the sound source direction detected by the sound source direction detection unit is performed each time it is detected that time has elapsed. It is characterized by doing.

【００１５】第６の発明は、前記第１乃至第５の発明の
いずれかに記載のカメラ制御方法において、前記顔画像
抽出手段が、人物の肌色を識別することにより、人物の
顔画像を抽出するカメラ制御方法とすることを特徴とす
るものである。According to a sixth aspect of the present invention, in the camera control method according to any one of the first to fifth aspects, the face image extracting means extracts the face image of the person by identifying the skin color of the person. And a camera control method for controlling the camera.

【００１６】第７の発明は、前記第１乃至第６の発明の
いずれかに記載のカメラ制御方法において、前記顔画像
抽出手段が、人物の輪郭を識別することにより、人物の
顔画像を抽出するカメラ制御方法とすることを特徴とす
るものである。A seventh invention is the camera control method according to any one of the first to sixth inventions, wherein the face image extraction means extracts a face image of the person by identifying the outline of the person. And a camera control method for controlling the camera.

【００１７】第８の発明は、前記第１乃至第７の発明の
いずれかに記載のカメラ制御方法において、前記音源方
向検出手段が、複数個の前記音声入力手段に入力される
それぞれの音声信号の周波数及び／又は遅延時間を分析
することにより、音源方向を検出するカメラ制御方法と
することを特徴とするものである。An eighth invention is the camera control method according to any one of the first to seventh inventions, wherein the sound source direction detecting means inputs respective audio signals to a plurality of the audio input means. The method is characterized by providing a camera control method for detecting the sound source direction by analyzing the frequency and / or the delay time.

【００１８】第９の発明は、水平方向及び／又は垂直方
向への角度変更、及び／又は、広角及び／又は望遠への
ズーム変更を制御可能とするカメラ制御装置において、
更に、前記第１乃至第８の発明のいずれかに記載のカメ
ラ制御方法を実現することを可能とする手段を備えてい
るカメラ制御装置とすることを特徴とするものである。A ninth invention is a camera control device capable of controlling an angle change in a horizontal direction and / or a vertical direction and / or a zoom change in a wide angle and / or a telephoto.
Further, the present invention is characterized in that the camera control device is provided with means for realizing the camera control method according to any one of the first to eighth inventions.

【００１９】第１０の発明は、テレビ会議を行なうテレ
ビ会議システムにおいて、前記第９の発明に記載のカメ
ラ制御装置を備えているテレビ会議システムとすること
を特徴とするものである。A tenth aspect of the present invention is a video conference system for performing a video conference, the video conference system comprising the camera control device according to the ninth aspect.

【００２０】[0020]

【発明の実施の形態】本発明によるカメラ制御方法及び
該カメラ制御方法を実現するカメラ制御装置並びに該カ
メラ制御装置を備えているテレビ会議システムに関する
実施形態の一例を以下に説明する。図１は、本発明によ
るカメラ制御方法及びカメラ制御装置をテレビ会議シス
テムに適用する場合における実施形態の構成の一例を示
す機能ブロック図であり、テレビ会議システム装置本体
２０は、次の回路要素から構成されている。テレビ会議
システム装置本体２０は、当該テレビ会議システム装置
本体２０の全体制御を司るＣＰＵ（中央演算装置）１
と、ＣＰＵ１による制御・演算等のためのプログラムを
格納しているＲＯＭ（リードオンメモリ）及び該制御・
演算等を補助すると共に諸データを格納するワーキング
メモリとしてのＲＡＭ（ランダムアクセスメモリ）とか
らなるメモリＡ２と、テレビ会議システムを動作させ
る上で必要な情報を保管しておくためのプログラマブル
な書き換えが可能なメモリＢ３とを備えている。BEST MODE FOR CARRYING OUT THE INVENTION An example of an embodiment relating to a camera control method according to the present invention, a camera control apparatus for realizing the camera control method, and a video conference system including the camera control apparatus will be described below. FIG. 1 is a functional block diagram showing an example of the configuration of an embodiment when a camera control method and a camera control device according to the present invention are applied to a video conference system. The video conference system device main body 20 includes the following circuit elements. It is configured. The video conference system device body 20 includes a CPU (central processing unit) 1 that controls the entire video conference system device body 20.
And a ROM (read-on memory) storing programs for control / calculation by the CPU 1 and the control /
A memory A 2 including a RAM (random access memory) as a working memory for assisting calculations and storing various data, and programmable rewriting for storing information necessary for operating the video conference system. And a memory B 3 capable of

【００２１】更に、テレビ会議システム装置本体２０
は、テレビ会議への参加者が会話するために用いる、２
個以上の複数個からなる音声入力部４及び２個以上の複
数個からなるスピーカ５と、音声入力部４とスピーカ５
とを制御する音声制御部６と、複数個の音声入力部４そ
れぞれから入力される各入力音声信号の周波数の分析や
入力音声信号の遅延時間の分析により音源方向を検出す
る音源方向検出部１５と、通信回線１７を介して、映像
や音声データの受け渡しを行なう通信制御部７と、人物
の顔画像の抽出と該顔画像の画面上の位置及び／又は大
きさを把握する顔画像抽出・位置判定部８と、通信制御
部７との間で送受信される映像及び音声データの分離や
多重化を行なう分離多重化部９と、テレビ会議に必要と
する映像を撮影するカメラ１０と、カメラ１０の水平方
向及び／又は垂直方向との角度（方向）変更を制御した
り、広角及び／又は望遠へのズーム変更を制御したり、
あるいは、現在のカメラ１０の角度（方向）やズームの
状態をカメラ位置として把握するカメラ制御・位置判定
部１１とを備えている。Further, the main body 20 of the video conference system device.
Used by participants in a video conference to talk
A voice input unit 4 including a plurality of voice input units and a speaker 5 including a plurality of two or more voice input units 4 and a speaker 5.
And a sound source direction detecting unit 15 for detecting the sound source direction by analyzing the frequency of each input audio signal input from each of the plurality of audio input units 4 and the delay time of the input audio signal. A communication control unit 7 for transferring video and audio data via the communication line 17, a face image of a person, and a face image extraction for grasping the position and / or size of the face image on the screen. A position determining unit 8 and a demultiplexing unit 9 that demultiplexes and multiplexes video and audio data transmitted and received between the communication control unit 7, a camera 10 that captures an image required for a video conference, and a camera. Control the angle (direction) change with 10 horizontal and / or vertical directions, control the zoom change to wide angle and / or telephoto,
Alternatively, it is provided with a camera control / position determination unit 11 that grasps the current angle (direction) of the camera 10 and the zoom state as a camera position.

【００２２】ここに、音源方向検出部１５は、前述のご
とく、テレビ会議への参加者よりも少ない数からなる２
個以上の複数個の音声入力部４から入力された入力音声
の周波数や音声信号の遅延時間を分析することにより、
音源方向を検出する音源方向検出手段を提供するもので
ある。また、顔画像抽出・位置判定部８は、前述のごと
く、カメラ１０より入力される画像から人物の顔画像を
抽出する顔画像抽出手段と、該顔画像抽出手段により抽
出され、画面上に表示された顔画像の位置及び／又は画
面上での顔画像の大きさを認識することができる顔画像
位置判定手段とを提供するものであり、人物の顔画像を
抽出する顔画像抽出手段としては、人物の肌色を識別す
ることにより、あるいは、人物の輪郭を識別することに
より、人物の顔画像を抽出することが可能である。Here, as described above, the sound source direction detecting unit 15 is composed of a smaller number than the participants in the video conference.
By analyzing the frequency of the input voice and the delay time of the voice signal input from the plurality of voice input units 4 or more,
A sound source direction detecting means for detecting a sound source direction is provided. As described above, the face image extraction / position determination unit 8 includes a face image extraction unit that extracts a face image of a person from the image input from the camera 10, and the face image extraction unit extracts the face image and displays it on the screen. A face image position determining means capable of recognizing the position of the face image and / or the size of the face image on the screen is provided as a face image extracting means for extracting a face image of a person. It is possible to extract a face image of a person by identifying the skin color of the person or by identifying the contour of the person.

【００２３】また、カメラ制御・位置判定部１１は、前
述のごとく、カメラ１０の水平方向及び／又は垂直方向
への角度変更（即ち、パン及び／又はチルト変更）、及
び／又は、広角及び／又は望遠へのズーム変更を制御す
るカメラ制御手段を提供すると共に、現在のカメラ１０
の位置（即ち、方向やズーム状態）を判定したり、ある
いは、前記音源方向検出手段により検出された音源方向
に所在している人物の顔画像を前以って設定されている
画面上の位置及び／又は大きさ（倍率）に自動的に表示
させたりすることができる表示位置判定手段を含むカメ
ラ制御手段を提供しているものである。Further, the camera control / position determination unit 11 changes the angle of the camera 10 in the horizontal direction and / or the vertical direction (that is, changes the pan and / or the tilt) and / or the wide angle and / or the wide angle, as described above. Alternatively, the present camera 10 is provided with a camera control means for controlling the zoom change to the telephoto.
Position (that is, direction or zoom state) is determined, or a face image of a person located in the sound source direction detected by the sound source direction detecting means is set on the screen in advance. And / or a camera control unit including a display position determination unit that can automatically display the size (magnification).

【００２４】更には、テレビ会議システム装置本体２０
は、映像データや各種の文字情報を表示する表示部１２
と、カメラ１０で撮影された映像を符号化すること、及
び／又は、表示部１２に表示する映像データや文字情報
を符号化することを制御する映像制御部１３と、相手先
の電話番号の入力やテレビ会議システム装置本体２０に
対する操作のための入力を行ない、該入力を判定するた
めの操作部１４と、発言者の自動探索時において、単位
時間間隔毎に、音源方向の検出動作と顔画像の抽出動作
とを制御するために必要となる時計部１６とを備えてい
る。かかる各回路要素は、図１に示す通り互いに結線さ
れて、テレビ会議システム装置本体２０が構成されてい
る。ここに、図１に示す結線として、実線は、音声及び
／又は画像情報が流れる情報信号線を示し、また、破線
は、各種の制御情報が流れる制御信号線を示している。Further, the main body 20 of the video conference system device
Is a display unit 12 for displaying video data and various character information.
And an image control unit 13 for controlling encoding of an image captured by the camera 10 and / or encoding of image data or character information displayed on the display unit 12, and a telephone number of the other party. Inputting and inputting for operation on the video conference system device main body 20, the operation unit 14 for determining the input, and the detecting operation of the sound source direction and the face at every unit time interval during the automatic search of the speaker. A clock unit 16 necessary for controlling the image extracting operation is provided. The respective circuit elements are connected to each other as shown in FIG. 1 to form the video conference system device body 20. Here, as the connection shown in FIG. 1, a solid line indicates an information signal line through which audio and / or image information flows, and a broken line indicates a control signal line through which various control information flows.

【００２５】また、メモリＢ３に記憶されている具体
的な情報としては、例えば、顔画像の検出用としての人
物の肌色や人物の輪郭に関するデータ、前以って設定さ
れている顔画像の画面表示位置・範囲、前以って設定さ
れている顔画像の画面の大きさ、顔画像検出動作の停止
時間間隔（即ち、音源方向検出方向の検出と、該音源方
向の顔画像の検出と、該顔画像の所定の位置／大きさへ
の表示を実行させる単位時間間隔）の設定値や、発呼時
の発呼先番号（アドレス）などが記憶されている。テレ
ビ会議システム装置本体２０にて画像通信が行なわれて
いる通信中の状態にある時、複数個の音声入力部４で得
られたそれぞれの音声信号に基づいて、音源方向検出部
１５により音源方向が検出される。ＣＰＵ１は、カメラ
制御・位置判定部１１にてカメラ１０の位置（方向とズ
ームの状態）を判断すると共に、音源方向検出部１５に
より検出された音源方向へカメラ１０の向きを制御す
る。The specific information stored in the memory B3 is, for example, data relating to the skin color of a person or the outline of the person for detecting a face image, and the face image set in advance. Screen display position / range, screen size of face image set in advance, stop time interval of face image detection operation (that is, detection of sound source direction detection direction and detection of face image in the sound source direction) A set value of a unit time interval for executing display of the face image at a predetermined position / size, a callee number (address) at the time of making a call, and the like are stored. When the video conference system device main body 20 is in a communication state in which image communication is being performed, the sound source direction detection unit 15 detects the sound source direction based on the respective audio signals obtained by the plurality of audio input units 4. Is detected. The CPU 1 determines the position (direction and zoom state) of the camera 10 by the camera control / position determination unit 11, and controls the orientation of the camera 10 to the sound source direction detected by the sound source direction detection unit 15.

【００２６】ここで、カメラ１０から取り込まれた映像
データに基づいて、顔画像抽出・位置判定部８により、
人物の顔画像が抽出され、抽出された顔画像の位置が検
出される。しかる後、再び、ＣＰＵ１は、カメラ制御・
位置判定部１１により、カメラ１０の位置（方向とズー
ムの状態）を判断し、かつ、カメラ１０から取り込まれ
た映像データに基づいて、顔画像抽出・位置判定部８に
より抽出されて、画面上の表示位置が検出されている人
物の顔画像から、前以って設定されている画面上の所定
の位置及び／又は大きさ（倍率）に、顔画像が表示され
るように、カメラ１０の位置が制御される。Here, based on the video data taken in from the camera 10, the face image extraction / position determination unit 8
The face image of the person is extracted, and the position of the extracted face image is detected. After that, the CPU 1 controls the camera again.
The position determination unit 11 determines the position (direction and zoom state) of the camera 10, and the face image extraction / position determination unit 8 extracts the image data on the screen based on the video data captured from the camera 10. The face image of the person whose display position is detected is displayed by the camera 10 so that the face image is displayed at a predetermined position and / or size (magnification) on the screen set in advance. The position is controlled.

【００２７】また、発言者が検出されて、前以って設定
されている画面上の所定の位置及び／又は大きさ（倍
率）に、該発言者の顔画像が一旦表示された以降におい
ては、カメラ１０で撮像され、画面表示されている顔画
像の表示を安定させるために、音源方向の検出動作と顔
画像の抽出動作更には所定の位置及び／又は大きさへの
カメラの制御動作との常時実行を一旦停止させて、時計
部１６によりタイマを起動させて、前以って設定されて
いる単位時間が経過する毎に、実行されるように制御す
る。Further, after the speaker is detected and the face image of the speaker is once displayed at a predetermined position and / or size (magnification) on the screen set in advance, In order to stabilize the display of the face image captured by the camera 10 and displayed on the screen, a sound source direction detection operation, a face image extraction operation, and a camera control operation to a predetermined position and / or size are performed. The constant execution is temporarily stopped, the timer is started by the clock unit 16, and the timer is controlled to be executed each time a preset unit time elapses.

【００２８】図１に示す本実施形態の構成について更に
詳細に説明する。前述のごとく、映像情報と音声情報と
をそれぞれ符号化する符号化手段と符号化されたデータ
を復号化する復号化手段としては、映像制御部１３と音
声制御部６とがそれぞれ分担している。通信回線１７に
データを送信する送信手段と通信回線１７からのデータ
を受信する受信手段とは、通信制御部７が担務してい
る。符号化／復号化された情報の画像通信を行なうため
に必要とする処理を適時行なう画像及び音声情報処理手
段の役割は、主として、ＣＰＵ１が、メモリＡ２，メ
モリＢ３，映像制御部１３，音声制御部６，通信制御
部７を制御することにより、遂行される。The configuration of this embodiment shown in FIG. 1 will be described in more detail. As described above, the video control unit 13 and the audio control unit 6 share the encoding unit for encoding the video information and the audio information and the decoding unit for decoding the encoded data, respectively. . The communication control unit 7 is responsible for transmitting means for transmitting data to the communication line 17 and receiving means for receiving data from the communication line 17. The role of the image and audio information processing means for timely performing the processing required for image communication of encoded / decoded information is mainly that the CPU 1 controls the memory A 2, the memory B 3, the video control unit 13, This is performed by controlling the voice control unit 6 and the communication control unit 7.

【００２９】また、画像の圧縮・解凍処理は、通信制御
部７にて行なわれる。更に、前以って設定されている一
定の単位時間まで時間をカウント・計測する時計手段即
ち時間計測手段は、時計部１６が担務する。入力される
画像から人物の顔画像を抽出し、画面上に表示される人
物の顔画像の位置及び／又は大きさを把握する人物の顔
画像抽出・位置判定手段は、顔画像抽出・位置判定部８
が分担する。入力音声の音源方向を検出する音源方向検
出手段は、音源方向検出部１５が分担する。テレビ会議
システム装置本体２０を操作するためのキー入力を検出
する入力検出手段は、操作部１４が担務している。カメ
ラ１０を上下左右にチルト・パン動作させる方向変更手
段とカメラ１０を広角や望遠状態に動作させるズーム変
更手段とは、カメラ制御・位置判定部１１が分担してい
る。The image compression / decompression process is performed by the communication control unit 7. Further, the clock unit 16 is responsible for the clock means for counting and measuring the time up to a preset unit time, that is, the time measuring means. A face image extraction / position determination unit for a person that extracts a face image of a person from an input image and grasps the position and / or size of the face image of the person displayed on the screen is a face image extraction / position determination. Part 8
Will be shared. The sound source direction detecting unit 15 is responsible for the sound source direction detecting means for detecting the sound source direction of the input voice. The operation unit 14 is responsible for input detection means for detecting a key input for operating the video conference system device body 20. The camera control / position determination unit 11 shares the direction changing means for tilting / panning the camera 10 vertically and horizontally and the zoom changing means for moving the camera 10 in a wide-angle or telephoto state.

【００３０】次に、以上のごとき構成を有するテレビ会
議システム装置のカメラ制御方法及びカメラ制御装置に
関し、発言者の位置を自動探索し、該発言者の顔画像を
所定の位置及び／又は大きさで表示させる動作につい
て、図２，図３，図４，図５，図６に示す各フローチャ
ートに沿って説明する。図２は、図１に示すテレビ会議
システムにおける本発明に係るカメラ制御方法及びカメ
ラ制御装置に関わる一実施形態を説明するためのフロー
チャートである。以下、図２に示すフローチャートに沿
って、本実施形態について説明する。テレビ会議システ
ム装置本体２０において、空き（Ｉｄｌｅ）状態（ステ
ートＳ１）から画像通信中の状態を示す通信中の状態
（ステートＳ２）に移行すると、まず、２個以上の複数
個からなる音声入力部４に入力されている入力音声信号
があるか否かが確認され（ステップａ１）、入力音声信
号が検出されていない場合には（ステップａ１のＮ
Ｏ）、ステップａ１にて、次の音声の入力があるまで待
ち合わされる。Next, regarding the camera control method and the camera control device of the video conference system apparatus having the above-mentioned configuration, the position of the speaker is automatically searched, and the face image of the speaker is predetermined position and / or size. The operation to be displayed at will be described with reference to the flowcharts shown in FIGS. 2, 3, 4, 5, and 6. FIG. 2 is a flowchart for explaining one embodiment of the camera control method and the camera control device according to the present invention in the video conference system shown in FIG. The present embodiment will be described below with reference to the flowchart shown in FIG. In the main body 20 of the video conference system, when the idle state (state S1) is changed to the communicating state (state S2) indicating the image communicating state, first, the voice input unit including a plurality of two or more voice input parts is provided. It is confirmed whether or not there is an input voice signal being input to No. 4 (step a1), and if no input voice signal is detected (N in step a1).
O), at step a1, the process waits until the next voice is input.

【００３１】一方、入力音声信号が検出された場合には
（ステップａ１のＹＥＳ）、音源方向検出部１５によ
り、該入力音声信号の音源方向が検出される（ステップ
ａ２）。更に、カメラ制御・位置判定部１１により、カ
メラ１０の現在の位置（方向とズームの状態、即ち、水
平、垂直、望遠及び広角の位置）が検出される（ステッ
プａ３）。検出された音源方向と、検出された現在のカ
メラ１０の位置とは、メモリＢ３に一旦保存される（ス
テップａ４）。ＣＰＵ１により、検出されてメモリＢ３
に保存されている音源方向とカメラ１０の現在の位置
（方向とズームの状態、即ち、水平、垂直、望遠及び広
角の位置）とが確認され、カメラ１０の向きを、音源方
向に向けさせるための移動量が算出され、算出された該
移動量をカメラ制御・位置判定部１１に送信して、カメ
ラ制御・位置判定部１１からの制御により、カメラ１０
の向きを、検出された音源方向へと動作させる（ステッ
プａ５）。On the other hand, when the input voice signal is detected (YES in step a1), the sound source direction detecting section 15 detects the sound source direction of the input voice signal (step a2). Further, the camera control / position determination unit 11 detects the current position (direction and zoom state, that is, horizontal, vertical, telephoto, and wide-angle position) of the camera 10 (step a3). The detected sound source direction and the detected current position of the camera 10 are temporarily stored in the memory B3 (step a4). Memory B3 detected by CPU1
To confirm the sound source direction and the current position of the camera 10 (direction and zoom state, that is, horizontal, vertical, telephoto, and wide-angle position) stored in, and to direct the camera 10 to the sound source direction. The movement amount of the camera 10 is calculated, the calculated movement amount is transmitted to the camera control / position determination unit 11, and the camera 10 is controlled by the camera control / position determination unit 11.
Is operated in the direction of the detected sound source (step a5).

【００３２】カメラ１０の音源方向への移動動作後、撮
像されたカメラ１０からの映像信号に基づいて、顔画像
抽出・位置判定部８により、該映像信号の中に含まれて
いる人物の顔画像を抽出すると共に、画面上の該顔画像
の位置と画面上での顔画像の大きさ（倍率）とを判定し
（ステップａ６）、判定された画面上の顔画像の位置と
画面上での顔画像の大きさ（倍率）とを示すカメラ１０
の位置（方向及びズームの状態）は、メモリＢ３に保
存される（ステップａ７）。After the movement of the camera 10 in the direction of the sound source, the face image extraction / position determination unit 8 determines the face of the person included in the image signal based on the imaged image signal from the camera 10. While extracting the image, the position of the face image on the screen and the size (magnification) of the face image on the screen are determined (step a6), and the determined position of the face image on the screen and the screen are determined. 10 showing the size (magnification) of the face image of the
The position (direction and zoom state) of is stored in the memory B 3 (step a7).

【００３３】しかる後において、メモリＢ３に予め保
存されている設定値に沿って、画面上の顔画像の位置、
大きさ（倍率）を決定し、決定された顔画像の位置と大
きさ（倍率）とに基づいて、再び、カメラ制御・位置判
定部１１により、カメラ１０の方向とズームとを動作さ
せることにより、音源となっている人物即ち発言者の顔
画像を前以って設定されている画面上の位置と大きさ
（倍率）とに自動的に表示することができる（ステップ
ａ８）。After that, the position of the face image on the screen is set in accordance with the preset values stored in the memory B3 in advance.
By determining the size (magnification), the camera control / position determination unit 11 operates the direction and zoom of the camera 10 again based on the determined position and size (magnification) of the face image. The face image of the person who is the sound source, that is, the speaker can be automatically displayed at the preset position and size (magnification) on the screen (step a8).

【００３４】ここで、ステップａ６において、発言者を
示す人物の顔画像抽出と画面上の顔画像の位置・大きさ
を、顔画像抽出・位置判定部８において判定する判定方
法としては、例えば、特開２０００−３５４２４７号公
報「画像処理装置」や前記特開平５−２６８５９９号公
報「テレビ会議システムにおける人物撮像カメラの自動
制御方式」等により開示されているように、前述したご
とく、人物の肌色や人物の輪郭から顔画像を抽出するこ
ととする。かくのごとき動作を行なうことにより、発言
者の顔画像を、画面上の前以って設定されている所定の
位置に、前以って設定されている所定の大きさで正確に
表示することが可能となる。Here, in step a6, the face image extraction / position determination unit 8 determines the face image of the person indicating the speaker and the position / size of the face image on the screen. As disclosed in Japanese Patent Application Laid-Open No. 2000-354247, "Image processing apparatus", Japanese Patent Application Laid-Open No. 5-268599, "Automatic control system for human image pickup camera in video conference system", etc., as described above, The face image is extracted from the contour of the person. By performing such an operation, the face image of the speaker can be accurately displayed at a predetermined position set on the screen in a predetermined size set in advance. Is possible.

【００３５】次に、図３に示す本発明に係るカメラ制御
方法及びカメラ制御装置の他の実施形態について説明す
る。図３は、図１に示すテレビ会議システムにおける本
発明に係るカメラ制御方法及びカメラ制御装置に関わる
他の実施形態を説明するためのフローチャートである。Next, another embodiment of the camera control method and the camera control apparatus according to the present invention shown in FIG. 3 will be described. FIG. 3 is a flowchart for explaining another embodiment relating to the camera control method and the camera control device according to the present invention in the video conference system shown in FIG.

【００３６】ここに、図３においては、カメラ制御・位
置判定部１１を動作させて、音源方向検出部１５にて検
出された音源方向に、カメラ１０を移動動作させること
により、顔画像抽出・位置判定部８により、画面上の人
物の顔画像を検出した時に、周りの反響音などのために
音源検出方向に多少の誤差が生じて、２つ以上の顔画像
が抽出された場合、それぞれの顔画像の位置をメモリＢ
３に一旦保存し、音源方向検出部１５により検出され
た音源方向に最も近くに所在している人物を目的とする
発言者と判定し、メモリＢ３に予め保存されている設
定値に沿って、画面上の顔画像の位置、大きさ（倍率）
を決定し、決定された顔画像の位置と大きさ（倍率）と
に基づいて、再び、カメラ制御・位置判定部１１を動作
させることにより、カメラ１０を移動動作させて、当該
音源の発生源と判定された発言者を、前以って設定され
ている位置と大きさ（倍率）とにより自動的に画面表示
することを可能としているものである。Here, in FIG. 3, the camera control / position determination unit 11 is operated to move the camera 10 in the sound source direction detected by the sound source direction detection unit 15 to extract a face image. When the position determination unit 8 detects a face image of a person on the screen, if some error occurs in the sound source detection direction due to surrounding echoes, and two or more face images are extracted, The position of the face image of memory B
3 is once stored in the memory B 3, and the person closest to the sound source direction detected by the sound source direction detecting unit 15 is determined to be the speaker, and according to the set value stored in advance in the memory B 3. , Position of face image on screen, size (magnification)
Is determined, and the camera control / position determination unit 11 is operated again based on the determined position and size (magnification) of the face image to move the camera 10 to move the source of the sound source. It is possible to automatically display the speaker determined to be on the screen according to the position and the size (magnification) set in advance.

【００３７】以下、図３に示すフローチャートに沿っ
て、本実施形態について説明する。テレビ会議システム
装置本体２０において、空き（Ｉｄｌｅ）状態（ステー
トＳ１）から画像通信中の状態を示す通信中の状態（ス
テートＳ２）に移行すると、まず、２個以上の複数個か
らなる音声入力部４に入力されている入力音声信号があ
るか否かが確認され（ステップｂ１）、入力音声信号が
検出されていない場合には（ステップｂ１のＮＯ）、ス
テップｂ１にて、次の音声の入力があるまで待ち合わさ
れる。The present embodiment will be described below with reference to the flow chart shown in FIG. In the main body 20 of the video conference system, when the idle state (state S1) is changed to the communicating state (state S2) indicating the image communicating state, first, the voice input unit including a plurality of two or more voice input parts is provided. It is confirmed whether or not there is an input voice signal input to No. 4 (step b1). If no input voice signal is detected (NO in step b1), the next voice is input in step b1. I will be waiting until there is.

【００３８】一方、入力音声信号が検出された場合には
（ステップｂ１のＹＥＳ）、音源方向検出部１５によ
り、該入力音声信号の音源方向が検出される（ステップ
ｂ２）。更に、カメラ制御・位置判定部１１により、カ
メラ１０の現在の位置（方向とズームの状態、即ち、水
平、垂直、望遠及び広角の位置）が検出される（ステッ
プｂ３）。検出された音源方向と、検出された現在のカ
メラ１０の位置とは、メモリＢ３に一旦保存される（ス
テップｂ４）。ＣＰＵ１により、検出されてメモリＢ３
に保存されている音源方向とカメラ１０の現在の位置
（方向とズームの状態、即ち、水平、垂直、望遠及び広
角の位置）とが確認され、カメラ１０の向きを、音源方
向に向けさせるための移動量が算出され、算出された該
移動量をカメラ制御・位置判定部１１に送信して、カメ
ラ制御・位置判定部１１からの制御により、カメラ１０
の向きを、検出された音源方向へと動作させる（ステッ
プｂ５）。On the other hand, when the input sound signal is detected (YES in step b1), the sound source direction detecting section 15 detects the sound source direction of the input sound signal (step b2). Further, the camera control / position determination unit 11 detects the current position (direction and zoom state, that is, horizontal, vertical, telephoto, and wide-angle position) of the camera 10 (step b3). The detected sound source direction and the detected current position of the camera 10 are temporarily stored in the memory B3 (step b4). Memory B3 detected by CPU1
To confirm the sound source direction and the current position of the camera 10 (direction and zoom state, that is, horizontal, vertical, telephoto, and wide-angle position) stored in, and to direct the camera 10 to the sound source direction. The movement amount of the camera 10 is calculated, the calculated movement amount is transmitted to the camera control / position determination unit 11, and the camera 10 is controlled by the camera control / position determination unit 11.
Is operated in the direction of the detected sound source (step b5).

【００３９】カメラ１０の音源方向への移動動作後、撮
像されたカメラ１０からの映像信号に基づいて、顔画像
抽出・位置判定部８により、該映像信号の中に含まれて
いる人物の顔画像を抽出すると共に、画面上の該顔画像
の位置と画面上での顔画像の大きさ（倍率）とが判定さ
れる（ステップｂ６）。ここで、抽出された人物の顔画
像として、２つ以上の人物の顔画像が検出されているか
否かが判定される（ステップｂ７）。２つ以上の人物の
顔画像が検出されている場合には（ステップｂ７のＹＥ
Ｓ）、顔画像抽出・位置判定部８により検出されたそれ
ぞれの人物の顔画像の位置と音源方向検出部１５により
検出された音源の音源方向とを比較し、該音源方向に最
も近い位置に所在する顔画像の位置の人物を、目的とす
る人物（即ち、発言者）として選択し、ステップｂ９に
移行する（ステップｂ８）。一方、２つ以上の人物の顔
画像が検出されていなく、１つの顔画像のみの場合には
（ステップｂ７のＮＯ）、検出された顔画像の人物が目
的の人物（即ち、発言者）であるので、次のステップｂ
９に移行する。ステップｂ９においては、発言者として
選択された人物の顔画像を撮像するためのカメラ１０の
位置（方向及びズームの状態）が、メモリＢ３に保存
される（ステップｂ９）。After the camera 10 moves in the direction of the sound source, the face image extraction / position determination unit 8 determines the face of the person included in the image signal based on the imaged image signal from the camera 10. While extracting the image, the position of the face image on the screen and the size (magnification) of the face image on the screen are determined (step b6). Here, it is determined whether or not the face images of two or more persons are detected as the extracted face images of the person (step b7). When the face images of two or more persons are detected (YE in step b7)
S), the position of the face image of each person detected by the face image extraction / position determination unit 8 is compared with the sound source direction of the sound source detected by the sound source direction detection unit 15, and the position is closest to the sound source direction. The person at the position of the existing face image is selected as the target person (that is, the speaker), and the process proceeds to step b9 (step b8). On the other hand, when the face images of two or more persons are not detected and only one face image is detected (NO in step b7), the person of the detected face images is the target person (that is, the speaker). So there is next step b
Move to 9. In step b9, the position (direction and zoom state) of the camera 10 for capturing the face image of the person selected as the speaker is stored in the memory B3 (step b9).

【００４０】しかる後において、メモリＢ３に予め保
存されている設定値に沿って、画面上の顔画像の位置、
大きさ（倍率）を決定し、決定された顔画像の位置と大
きさ（倍率）とに基づいて、再び、カメラ制御・位置判
定部１１により、カメラ１０の方向とズームとを動作さ
せることにより、音源となっている人物即ち発言者の顔
画像を前以って設定されている画面上の位置と大きさ
（倍率）とに自動的に表示することができる（ステップ
ｂ１０）。After that, the position of the face image on the screen is set in accordance with the set value stored in the memory B 3 in advance.
By determining the size (magnification), the camera control / position determination unit 11 operates the direction and zoom of the camera 10 again based on the determined position and size (magnification) of the face image. The face image of the person who is the sound source, that is, the speaker can be automatically displayed at the preset position and size (magnification) on the screen (step b10).

【００４１】次に、図４に示す本発明に係るカメラ制御
方法及びカメラ制御装置の更なる他の実施形態について
説明する。図４は、図１に示すテレビ会議システムにお
ける本発明に係るカメラ制御方法及びカメラ制御装置に
関わる更なる他の実施形態を説明するためのフローチャ
ートである。Next, still another embodiment of the camera control method and the camera control apparatus according to the present invention shown in FIG. 4 will be described. FIG. 4 is a flowchart for explaining still another embodiment of the camera control method and the camera control device according to the present invention in the video conference system shown in FIG.

【００４２】ここに、図４においては、顔画像抽出・位
置判定部８により画面上の人物の顔画像を検出せんとし
た時に、人物の顔画像が検出されなかった場合、及び／
又は、抽出された人物の顔画像が画面内に収まり切らな
かった場合には、ＣＰＵ１からカメラ制御・位置判定部
１１を動作させて、カメラ１０のズーム変更を行なわせ
て、自動的に広角動作させることにより、より広い範囲
の画像を画面上に表示し、かかる広範囲に撮像された画
面上に人物の顔画像が検出されたかどうかを、顔画像抽
出・位置判定部８により判定し、人物の顔画像が抽出さ
れた場合には、メモリＢ３に予め保存されている設定
値に沿って、画面上の顔画像の位置及び大きさ（倍率）
を決定し、決定された顔画像の位置と大きさ（倍率）と
に基づいて、再び、カメラ制御・位置判定部１１によ
り、カメラ１０を動作させることにより、たとえ、音源
方向の検出に誤差があった場合であっても、音源の発生
元の発言者を探し出し、前以って設定されている画面上
の位置と大きさ（倍率）とにより、自動的に表示するこ
とを可能としているものである。Here, in FIG. 4, when the face image extraction / position determination unit 8 does not detect the face image of the person on the screen, the face image of the person is not detected, and /
Alternatively, when the extracted face image of the person does not fit within the screen, the CPU 1 operates the camera control / position determination unit 11 to change the zoom of the camera 10 and automatically perform the wide-angle operation. By doing so, a wider range of images is displayed on the screen, and it is determined by the face image extraction / position determination unit 8 whether or not a face image of a person is detected on the screen captured in such a wide range, and When the face image is extracted, the position and size (magnification) of the face image on the screen are set according to the setting values stored in the memory B 3 in advance.
Then, based on the determined position and size (magnification) of the face image, the camera control / position determination unit 11 operates the camera 10 again, so that an error may occur in the detection of the sound source direction. Even if there is a sound source, it is possible to find the speaker who is the source of the sound source and automatically display it based on the preset position and size (magnification) on the screen. Is.

【００４３】以下、図４に示すフローチャートに沿っ
て、本実施形態について説明する。テレビ会議システム
装置本体２０において、空き（Ｉｄｌｅ）状態（ステー
トＳ１）から画像通信中の状態を示す通信中の状態（ス
テートＳ２）に移行すると、まず、２個以上の複数個か
らなる音声入力部４に入力されている入力音声信号があ
るか否かが確認され（ステップｃ１）、入力音声信号が
検出されていない場合には（ステップｃ１のＮＯ）、ス
テップｃ１にて、次の音声の入力があるまで待ち合わさ
れる。The present embodiment will be described below with reference to the flowchart shown in FIG. In the main body 20 of the video conference system, when the idle state (state S1) is changed to the communicating state (state S2) indicating the image communicating state, first, the voice input unit including a plurality of two or more voice input parts is provided. It is confirmed whether or not there is an input voice signal input in step 4 (step c1). If no input voice signal is detected (NO in step c1), the next voice is input in step c1. I will be waiting until there is.

【００４４】一方、入力音声信号が検出された場合には
（ステップｃ１のＹＥＳ）、音源方向検出部１５によ
り、該入力音声信号の音源方向が検出される（ステップ
ｃ２）。更に、カメラ制御・位置判定部１１により、カ
メラ１０の現在の位置（方向とズームの状態、即ち、水
平、垂直、望遠及び広角の位置）が検出される（ステッ
プｃ３）。検出された音源方向と、検出された現在のカ
メラ１０の位置とは、メモリＢ３に一旦保存される（ス
テップｃ４）。ＣＰＵ１により、検出されてメモリＢ３
に保存されている音源方向とカメラ１０の現在の位置
（方向とズームの状態、即ち、水平、垂直、望遠及び広
角の位置）とが確認され、カメラ１０の向きを、音源方
向に向けさせるための移動量が算出され、算出された該
移動量をカメラ制御・位置判定部１１に送信して、カメ
ラ制御・位置判定部１１からの制御により、カメラ１０
の向きを、検出された音源方向へと動作させる（ステッ
プｃ５）。On the other hand, when the input audio signal is detected (YES in step c1), the sound source direction detecting section 15 detects the sound source direction of the input audio signal (step c2). Further, the camera control / position determination unit 11 detects the current position (direction and zoom state, that is, horizontal, vertical, telephoto, and wide-angle position) of the camera 10 (step c3). The detected sound source direction and the detected current position of the camera 10 are temporarily stored in the memory B3 (step c4). Memory B3 detected by CPU1
To confirm the sound source direction and the current position of the camera 10 (direction and zoom state, that is, horizontal, vertical, telephoto, and wide-angle position) stored in, and to direct the camera 10 to the sound source direction. The movement amount of the camera 10 is calculated, the calculated movement amount is transmitted to the camera control / position determination unit 11, and the camera 10 is controlled by the camera control / position determination unit 11.
Is operated in the direction of the detected sound source (step c5).

【００４５】カメラ１０の音源方向への移動動作後、撮
像されたカメラ１０からの映像信号に基づいて、顔画像
抽出・位置判定部８により、該映像信号の中に含まれて
いる人物の顔画像を抽出すると共に、画面上の該顔画像
の位置と画面上での顔画像の大きさ（倍率）とが判定さ
れる（ステップｃ６）。ここで、顔画像抽出・位置判定
部８により、人物の顔画像が検出されているか否かが判
定される（ステップｃ７）。人物の顔画像が検出されて
いる場合（ステップｃ７のＹＥＳ）、検出されている顔
画像の人物が、目的とする人物（即ち、発言者）であ
り、発言者として決定された人物の顔画像を撮像してい
るカメラ１０の位置（方向及びズームの状態）が、メモ
リＢ３に保存される（ステップｃ１０）。一方、ステ
ップｃ７において、人物の顔画像が検出されていない場
合、及び／又は、抽出された人物の顔画像が画面内に収
まり切らなかった場合には（ステップｃ７のＮＯ）、カ
メラ制御・位置判定部１１を動作させて、カメラ１０を
制御し、より広範囲の画像を撮像する広角動作を行なわ
しめる。After the camera 10 moves in the direction of the sound source, the face image extraction / position determination unit 8 determines the face of the person included in the image signal based on the imaged image signal from the camera 10. While extracting the image, the position of the face image on the screen and the size (magnification) of the face image on the screen are determined (step c6). Here, the face image extraction / position determination unit 8 determines whether or not a face image of a person is detected (step c7). When the face image of the person is detected (YES in step c7), the person of the detected face image is the target person (that is, the speaker), and the face image of the person who is determined as the speaker. The position (direction and zoom state) of the camera 10 that is capturing the image is stored in the memory B 3 (step c10). On the other hand, in step c7, if the face image of the person is not detected and / or if the extracted face image of the person does not fit on the screen (NO in step c7), camera control / position The determination unit 11 is operated to control the camera 10 to perform a wide-angle operation for capturing an image in a wider range.

【００４６】広角動作のカメラ１０により撮像されたカ
メラ１０からの映像信号に基づいて、再度、顔画像抽出
・位置判定部８により、該映像信号の中に含まれている
人物の顔画像を抽出すると共に、画面上の該顔画像の位
置と画面上での顔画像の大きさ（倍率）が判定される
（ステップｃ９）。ここで、人物の顔画像が検出された
場合には（ステップｃ９のＹＥＳ）、音源方向の検出に
誤差があったものとみなして、広角動作の結果、顔画像
抽出・位置判定部８により人物の顔画像が検出された人
物が、音源の発生元の目的とする人物（即ち、発言者）
であるとして、ステップｃ１０に移行し、発言者として
決定された人物の顔画像を撮像するためのカメラ１０の
位置（方向及びズームの状態）が、メモリＢ３に保存
される（ステップｃ１０）。Based on the video signal from the camera 10 picked up by the wide-angle camera 10, the face image extraction / position determination unit 8 again extracts the face image of the person included in the video signal. At the same time, the position of the face image on the screen and the size (magnification) of the face image on the screen are determined (step c9). Here, when the face image of the person is detected (YES in step c9), it is considered that there is an error in the detection of the sound source direction, and as a result of the wide-angle operation, the face image extraction / position determination unit 8 determines the person. The person whose face image was detected is the person who is the source of the sound source (that is, the speaker)
If so, the process proceeds to step c10, and the position (direction and zoom state) of the camera 10 for capturing the face image of the person determined as the speaker is stored in the memory B3 (step c10).

【００４７】しかる後において、メモリＢ３に予め保
存されている設定値に沿って、画面上の顔画像の位置、
大きさ（倍率）を決定し、決定された顔画像の位置と大
きさ（倍率）とに基づいて、再び、カメラ制御・位置判
定部１１により、カメラ１０の方向とズームとを動作さ
せることにより、音源となっている人物即ち発言者の顔
画像を前以って設定されている画面上の位置と大きさ
（倍率）とに自動的に表示することができる（ステップ
ｃ１２）。After that, the position of the face image on the screen is set in accordance with the set values stored in advance in the memory B3.
By determining the size (magnification), the camera control / position determination unit 11 operates the direction and zoom of the camera 10 again based on the determined position and size (magnification) of the face image. The face image of the person who is the sound source, that is, the speaker can be automatically displayed at the preset position and size (magnification) on the screen (step c12).

【００４８】一方、ステップｃ９において、広角動作さ
せたカメラ１０によっても、人物の顔画像が検出されな
かった場合には（ステップｃ９のＮＯ）、音源方向検出
部１５が検出した音源が、何らかの雑音によるものとみ
なして、音源方向検出動作を一旦停止せしめて（ステッ
プｃ１１）、ステップｃ１に復帰し、次の音声の入力が
あるまで待ち合わされる。On the other hand, in step c9, when the face image of the person is not detected even by the camera 10 operated in the wide angle (NO in step c9), the sound source detected by the sound source direction detecting unit 15 causes some noise. The sound source direction detection operation is temporarily stopped (step c11), the process returns to step c1, and the process waits until the next voice is input.

【００４９】次に、図５に示す本発明に係るカメラ制御
方法及びカメラ制御装置の更なる他の実施形態について
説明する。図５は、図１に示すテレビ会議システムにお
ける本発明に係るカメラ制御方法及びカメラ制御装置に
関わる更なる他の実施形態を説明するためのフローチャ
ートである。Next, still another embodiment of the camera control method and the camera control apparatus according to the present invention shown in FIG. 5 will be described. FIG. 5 is a flowchart for explaining still another embodiment relating to the camera control method and the camera control device according to the present invention in the video conference system shown in FIG.

【００５０】ここに、図５においては、一度、前記図２
乃至図４のいずれかに示す過程を経て、目的とする人物
（即ち、発言者）の顔画像が検出されて、該発言者の顔
画像を前以って設定されている画面上の所定の位置と所
定の大きさ（倍率）とに自動的に表示する動作が決定さ
れた後において、音源方向検出部１５により常時音源方
向を検出するような動作を一旦停止せしめて、メモリＢ
３に予め登録されている単位時間間隔が経過するまで
時計部１６によりタイマをカウントさせて、該単位時間
間隔が経過する毎に、音源方向検出部１５による音源方
向の再検出を動作させるようにすることにより、発言者
である人物の僅かな動きにも常時反応して、カメラ１０
の位置が移動動作されないようにし、もって、発言者の
顔画像を安定して表示させることを可能としているもの
である。Here, in FIG. 5, once in FIG.
Through the process shown in any one of FIG. 4 to FIG. 4, the face image of the target person (that is, the speaker) is detected, and the face image of the speaker is set in a predetermined screen on the screen. After the operation of automatically displaying the position and the predetermined size (magnification) is determined, the operation of constantly detecting the sound source direction by the sound source direction detecting unit 15 is temporarily stopped, and the memory B
3, the timer is counted by the clock unit 16 until the unit time interval pre-registered in 3 is elapsed, and the sound source direction detection unit 15 re-detects the sound source direction every time the unit time interval elapses. By doing so, the camera 10 responds to the slight movement of the person who is the speaker at all times.
It is possible to stably display the face image of the speaker by preventing the position of (1) from being moved.

【００５１】以下、図５に示すフローチャートに沿っ
て、本実施形態について説明する。テレビ会議システム
装置本体２０において、空き（Ｉｄｌｅ）状態（ステー
トＳ１）から画像通信中の状態を示す通信中の状態（ス
テートＳ２）に移行すると、まず、時計部１６によりカ
ウントされるタイマが、メモリＢ３に予め登録されて
いる単位時間間隔に到達したか否かが判定される（ステ
ップｄ１）。該単位時間間隔に到達していない場合には
（ステップｄ１のＮＯ）、そのまま、時計部１６による
タイマのカウントは継続されるが、一方、該単位時間間
隔に到達し、所定の単位時間を経過していることが検出
された場合には（ステップｄ１のＹＥＳ）、時計部１６
によりカウントされているタイマを一旦クリアして、再
度初期値からのカウントを行なわせることを可能とする
状態に設定させる（ステップｄ２）。This embodiment will be described below with reference to the flow chart shown in FIG. In the video conference system device main body 20, when the idle state (state S1) is changed to the communication state (state S2) indicating the image communication state, first, the timer counted by the clock unit 16 is It is determined whether the unit time interval pre-registered in B3 has been reached (step d1). When the unit time interval has not been reached (NO in step d1), the timer of the clock unit 16 continues to be counted as it is, while the unit time interval has been reached and the predetermined unit time has elapsed. If it is detected (YES in step d1), the clock unit 16
The timer being counted by is once cleared and is set to a state in which counting from the initial value can be performed again (step d2).

【００５２】タイマが前記単位時間間隔が示す時間を経
過したことが時計部１６により検出された場合において
は、まず、入力音声信号が検出されるか否かが確認され
る（ステップｄ３）。入力音声信号が検出されていない
場合には（ステップｄ３のＮＯ）、ステップｄ１０に移
行して、時計部１６によるタイマのカウント動作を開始
させて（ステップｄ１０）、ステップｄ１に復帰して、
次の単位時間間隔の経過まで待ち合わせる。When the timer 16 detects that the timer has passed the time indicated by the unit time interval, it is first checked whether or not the input voice signal is detected (step d3). When the input voice signal is not detected (NO in step d3), the process proceeds to step d10 to start the counting operation of the timer by the clock unit 16 (step d10), and returns to step d1.
Wait until the next unit time interval elapses.

【００５３】一方、入力音声信号が検出されている場合
には（ステップｄ３のＹＥＳ）、音源方向検出部１５に
より、該入力音声信号の音源方向が検出される（ステッ
プｄ４）。更に、カメラ制御・位置判定部１１により、
カメラ１０の現在の位置（方向とズームの状態、即ち、
水平、垂直、望遠及び広角の位置）が検出される（ステ
ップｄ５）。検出された音源方向と、検出された現在の
カメラ１０の位置とは、メモリＢ３に一旦保存される
（ステップｄ６）。ＣＰＵ１により、検出されてメモリ
Ｂ３に保存されている音源方向とカメラ１０の現在の位
置（方向とズームの状態、即ち、水平、垂直、望遠及び
広角の位置）とが確認され、カメラ１０の向きを、音源
方向に向けさせるための移動量が算出され、算出された
該移動量をカメラ制御・位置判定部１１に送信して、カ
メラ制御・位置判定部１１からの制御により、カメラ１
０の向きを、検出された音源方向へと動作させる（ステ
ップｄ７）。On the other hand, when the input voice signal is detected (YES in step d3), the sound source direction detecting section 15 detects the sound source direction of the input voice signal (step d4). Further, by the camera control / position determination unit 11,
Current position of camera 10 (direction and zoom status, ie,
Horizontal, vertical, telephoto and wide-angle positions) are detected (step d5). The detected sound source direction and the detected current position of the camera 10 are temporarily stored in the memory B3 (step d6). The CPU 1 confirms the sound source direction detected and stored in the memory B3 and the current position of the camera 10 (direction and zoom state, that is, horizontal, vertical, telephoto and wide-angle position), and the orientation of the camera 10 is confirmed. Is calculated for the direction of the sound source, and the calculated movement amount is transmitted to the camera control / position determination unit 11, and the camera 1 is controlled by the camera control / position determination unit 11.
The direction of 0 is operated toward the detected sound source direction (step d7).

【００５４】ここで、前述した図２乃至図４に示すそれ
ぞれのフローチャートのいずれかのカメラ制御動作が選
択されて実行される（ステップｄ８）。即ち、カメラ１
０の音源方向への移動動作後、撮像されたカメラ１０か
らの映像信号に基づいて、顔画像抽出・位置判定部８に
より、該映像信号の中に含まれている人物の顔画像を抽
出すると共に、画面上の該顔画像の位置と画面上での顔
画像の大きさ（倍率）とが判定される。ここで、抽出さ
れた人物の顔画像が、１つのみ、又は、２つ以上の顔画
像が検出されているか、あるいは、１つの顔画像も検出
されていないかのいずれであるかが判定される。Here, any one of the camera control operations in the respective flow charts shown in FIGS. 2 to 4 is selected and executed (step d8). That is, camera 1
After the moving operation of 0 in the sound source direction, the face image extraction / position determination unit 8 extracts the face image of the person included in the image signal based on the imaged image signal from the camera 10. At the same time, the position of the face image on the screen and the size (magnification) of the face image on the screen are determined. Here, it is determined whether only one face image of the extracted person is detected, two or more face images are detected, or one face image is not detected. It

【００５５】１つのみの顔画像しか検出されていない場
合には、図２のフローチャートのステップａ６及びａ７
に示すごとく、抽出された人物の顔画像が目的とする発
言者の顔画像であるとして、前述の顔画像抽出・位置判
定部８の動作により判定されている画面上の該顔画像の
位置と画面上での顔画像の大きさ（倍率）とを示すカメ
ラ１０の位置（方向及びズームの状態）が、メモリＢ
３に保存される。If only one face image is detected, steps a6 and a7 in the flowchart of FIG.
As shown in, the position of the face image on the screen determined by the operation of the face image extraction / position determination unit 8 is determined as the face image of the target speaker as the extracted face image of the person. The position (direction and zoom state) of the camera 10 indicating the size (magnification) of the face image on the screen is stored in the memory B.
Stored in 3.

【００５６】また、２つ以上の顔画像が検出されている
場合には、図３のフローチャートのステップｂ６乃至ｂ
９に示すように、顔画像抽出・位置判定部８により検出
されたそれぞれの人物の顔画像の位置と音源方向検出部
１５により検出された音源の音源方向とを比較し、該音
源方向に最も近い位置に所在する顔画像の位置の人物
を、目的とする人物（即ち、発言者）として選択し、発
言者として選択された人物の顔画像を撮像するためのカ
メラ１０の位置（方向及びズームの状態）が、メモリＢ
３に保存される。When two or more face images are detected, steps b6 to b in the flowchart of FIG.
As shown in FIG. 9, the position of the face image of each person detected by the face image extraction / position determination unit 8 is compared with the sound source direction of the sound source detected by the sound source direction detection unit 15, and the sound source direction most The person at the position of the face image located at a close position is selected as the target person (that is, the speaker), and the position (direction and zoom) of the camera 10 for capturing the face image of the person selected as the speaker. State) is memory B
Stored in 3.

【００５７】一方、１つの顔画像も検出されていない場
合や顔画像が画面内に収まり切れない場合には、図４の
フローチャートのステップｃ８乃至ｃ１１に示すごと
く、カメラ制御・位置判定部１１によりカメラ１０を広
角動作させて、広範囲の画像を撮像させた結果、顔画像
抽出・位置判定部８により人物の顔画像が検出されるよ
うになった場合には、該人物の顔画像を撮像するための
カメラ１０の位置（方向及びズームの状態）が、メモリ
Ｂ３に保存される。しかし、広角動作によるカメラ１
０によっても、人物の顔画像が検出されなかった場合に
は、音源方向検出動作を一旦停止せしめて、ステップｄ
１０に移行して、時計部１６によるタイマのカウント動
作を開始させて（ステップｄ１０）、ステップｄ１に復
帰して、次の単位時間間隔の経過まで待ち合わせる。On the other hand, when no face image is detected or when the face image does not fit on the screen, the camera control / position determination unit 11 determines by the steps c8 to c11 in the flowchart of FIG. When the face image extraction / position determination unit 8 detects a face image of a person as a result of operating the camera 10 in a wide angle and capturing a wide range of images, the face image of the person is captured. The position (direction and zoom state) of the camera 10 for saving is stored in the memory B 3. However, camera 1 with wide-angle operation
If the face image of the person is not detected even by 0, the sound source direction detection operation is temporarily stopped and step d
10, the timer 16 starts the counting operation of the timer (step d10), returns to step d1, and waits until the next unit time interval elapses.

【００５８】発言者の顔画像が検出されて、カメラ１０
の位置がメモリＢ３に保存された状態に至っている場
合においては、メモリＢ３に予め保存されている設定
値に沿って、画面上の顔画像の位置、大きさ（倍率）を
決定し、決定された顔画像の位置と大きさ（倍率）とに
基づいて、再び、カメラ制御・位置判定部１１により、
カメラ１０の方向とズームとを動作させることにより、
音源となっている人物即ち発言者の顔画像を前以って設
定されている画面上の位置と大きさ（倍率）とにより画
面表示させる（ステップｄ９）。更に、先にステップｄ
２において初期値に設定されたタイマのカウントを開始
させて（ステップｄ１０）、ステップｄ１に復帰して、
次の単位時間間隔の経過まで待ち合わせることにより、
メモリＢ３に予め登録されている次回の単位時間間隔
に到達するまで、入力音声の検出動作が中止される。而
して、たとえ、発言者の顔即ち口の位置が、発言中にお
いて、多少ずれて、音源方向が多少ずれてしまうような
場合があったとしても、その都度、カメラ１０の位置が
移動調整されることなく、前記単位時間間隔が示す所定
時間が経過するまでは、カメラ１０の位置が安定した位
置に置かれることとなり、発言者の顔画像を安定して表
示させることができる。When the face image of the speaker is detected, the camera 10
When the position of is stored in the memory B 3, the position and size (magnification) of the face image on the screen are determined according to the preset values stored in the memory B 3. Based on the determined position and size (magnification) of the face image, the camera control / position determination unit 11 again causes
By operating the direction and zoom of the camera 10,
The face image of the person who is the sound source, that is, the speaker is displayed on the screen according to the preset position and size (magnification) on the screen (step d9). Furthermore, step d
In step 2, the count of the timer set to the initial value is started (step d10), and the process returns to step d1.
By waiting until the next unit time interval elapses,
The input voice detection operation is stopped until the next unit time interval registered in advance in the memory B3 is reached. Therefore, even if the position of the speaker's face, that is, the position of the mouth is slightly deviated during the remark, and the sound source direction is deviated slightly, the position of the camera 10 is adjusted each time. Without being performed, the position of the camera 10 is kept at a stable position until the predetermined time indicated by the unit time interval elapses, and the face image of the speaker can be stably displayed.

【００５９】[0059]

【発明の効果】以上に説明したごとく、本発明に係るカ
メラ制御方法及びカメラ制御装置並びにテレビ会議シス
テムによれば、以下のごとき効果がもたらされる。即
ち、本発明によれば、水平方向、垂直方向、広角及び望
遠動作を行なうことができるカメラ制御手段と、カメラ
より入力され、画面上に表示された画像の中から人物の
顔画像を抽出し、顔画像の位置と大きさとを認識する顔
画像抽出・位置判定手段と、参加者数よりも少ない複数
個の音声入力手段に入力されるそれぞれの音声信号の周
波数及び／又は遅延時間から音源方向を検出することが
できる音源方向検出手段により検出された音源方向即ち
発言者の方向にカメラを自動的に向けさせる制御をする
ことができ、かつ、前以って設定されている画面上の所
定の位置に、前以って設定されている顔画像の所定の大
きさで表示されるようにカメラを制御することができる
カメラ位置判定手段をも備えた前記カメラ制御手段と、
を具備することにより、カメラにて正確に発言者を捉え
て、画面上の所定の位置に、且つ、所定の大きさで表示
させることが可能である。As described above, according to the camera control method, the camera control device and the video conference system according to the present invention, the following effects are brought about. That is, according to the present invention, the camera control means capable of performing the horizontal direction, the vertical direction, the wide angle, and the telephoto operation, and the face image of the person are extracted from the images input from the camera and displayed on the screen. , A face image extraction / position determination means for recognizing the position and size of the face image, and a sound source direction from the frequency and / or delay time of each audio signal input to a plurality of audio input means less than the number of participants The direction of the sound source detected by the sound source direction detecting means, that is, the direction of the speaker, can be automatically controlled, and a predetermined value on the screen can be set. At the position, the camera control means also provided with a camera position determination means capable of controlling the camera so as to be displayed in a predetermined size of the face image set in advance,
With the above configuration, the speaker can be accurately captured by the camera and displayed at a predetermined position on the screen and in a predetermined size.

【００６０】また、前記音源検出方向手段により検出さ
れた音源検出方向にカメラの向きを制御した際に、周り
の反響音などの影響のため、前記音源検出方向に多少の
誤差が生じて、前記顔画像・位置判定手段にて、複数の
人物の顔画像が検出された場合にあっては、前記音源検
出方向に最も近い位置に所在する人物を発言者として判
別して、判別された該発言者の顔画像を、画面上の所定
の位置に、且つ、所定の大きさで、表示させることがで
きる。Further, when the direction of the camera is controlled in the sound source detection direction detected by the sound source detection direction means, a slight error occurs in the sound source detection direction due to the influence of surrounding reverberation sound, etc. When the face image / position determination means detects the face images of a plurality of persons, the person closest to the sound source detection direction is determined as the speaker, and the determined speech is determined. The face image of the person can be displayed at a predetermined position on the screen and in a predetermined size.

【００６１】更に、前記音源検出方向手段により検出さ
れた音源検出方向にカメラの向きを制御しても、前記顔
画像・位置判定手段により、人物の顔画像が検出されな
い場合、及び／又は、顔画像が画面内に収まり切らない
状態にある場合には、カメラを広角動作させて、より広
範囲の画像を撮像せしめ、該広範囲の画像から、発言者
とみなされる人物の顔画像を検出することにより、前記
音源検出方向に多少の誤差があったとしても、該発言者
とみなされた人物の顔画像を、画面上の所定の位置に、
且つ、所定の大きさで、表示させることができる。Furthermore, when the face image of the person is not detected by the face image / position determination means even if the direction of the camera is controlled in the sound source detection direction detected by the sound source detection direction means, and / or If the image does not fit on the screen, the camera is operated in a wide angle to capture a wider range of images, and the face image of the person who is considered to be the speaker is detected from the wider range of images. Even if there is some error in the sound source detection direction, the face image of the person regarded as the speaker is displayed at a predetermined position on the screen.
Moreover, it can be displayed in a predetermined size.

【００６２】更には、音源方向の検出動作を常時行なわ
せることなく、前以って登録されている単位時間間隔が
示す所定の単位時間が経過する毎に、音源方向の検出動
作を行なわせ、カメラの位置を音源方向に向けさせ、発
言者の顔画像を画面上の所定の位置や大きさに表示させ
る動作を行なわせることにより、たとえ、発言中に、発
言者の顔の位置即ち口の位置が多少ずれて、音源方向が
多少ずれるようなことがあったとしても、カメラの位置
（即ち、方向及びズームの状態）が常時移動調整される
ことを防止し、発言者の顔画像を安定して、画面表示さ
せることが可能である。Further, the sound source direction detecting operation is not always performed, but the sound source direction detecting operation is performed each time a predetermined unit time indicated by a unit time interval registered in advance elapses. By pointing the position of the camera toward the sound source and displaying the face image of the speaker at a predetermined position or size on the screen, even if the speaker's face position or mouth Even if the position of the sound source is slightly misaligned and the sound source direction is slightly misaligned, the camera position (that is, the direction and zoom state) is constantly prevented from being adjusted, and the face image of the speaker is stabilized. Then, it is possible to display it on the screen.

【００６３】また、前記音源方向検出手段として、参加
者数よりも少ない複数個の音声入力手段から入力されて
くるそれぞれの入力音声信号の周波数及び／又は遅延時
間を分析することにより、音声の発生元である音源方向
を検出することを可能としており、音声入力手段の数を
より少なくすることができる。Further, as the sound source direction detecting means, the frequency and / or the delay time of each input voice signal input from a plurality of voice input means smaller than the number of participants are analyzed to generate a voice. The original sound source direction can be detected, and the number of voice input means can be reduced.

[Brief description of drawings]

【図１】本発明によるカメラ制御方法及びカメラ制御装
置をテレビ会議システムに適用する場合における実施形
態の構成の一例を示す機能ブロック図である。FIG. 1 is a functional block diagram showing an example of a configuration of an embodiment when a camera control method and a camera control device according to the present invention are applied to a video conference system.

【図２】図１に示すテレビ会議システムにおける本発明
に係るカメラ制御方法及びカメラ制御装置に関わる一実
施形態を説明するためのフローチャートである。FIG. 2 is a flowchart for explaining an embodiment of a camera control method and a camera control device according to the present invention in the video conference system shown in FIG.

【図３】図１に示すテレビ会議システムにおける本発明
に係るカメラ制御方法及びカメラ制御装置に関わる他の
実施形態を説明するためのフローチャートである。FIG. 3 is a flowchart for explaining another embodiment of the camera control method and the camera control device according to the present invention in the video conference system shown in FIG.

【図４】図１に示すテレビ会議システムにおける本発明
に係るカメラ制御方法及びカメラ制御装置に関わる更な
る他の実施形態を説明するためのフローチャートであ
る。4 is a flowchart for explaining still another embodiment of the camera control method and the camera control device according to the present invention in the video conference system shown in FIG.

【図５】図１に示すテレビ会議システムにおける本発明
に係るカメラ制御方法及びカメラ制御装置に関わる更な
る他の実施形態を説明するためのフローチャートであ
る。5 is a flowchart for explaining still another embodiment of the camera control method and the camera control device according to the present invention in the video conference system shown in FIG.

[Explanation of symbols]

１…ＣＰＵ、２…メモリＡ、３…メモリＢ、４…音声入
力部、５…スピーカ、６…音声制御部、７…通信制御
部、８…顔画像抽出・位置判定部、９…分離多重化部、
１０…カメラ、１１…カメラ制御・位置判定部、１２…
表示部、１３…映像制御部、１４…操作部、１５…音源
方向検出部、１６…時計部、１７…通信回線、２０…テ
レビ会議システム装置本体。1 ... CPU, 2 ... Memory A, 3 ... Memory B, 4 ... Voice input unit, 5 ... Speaker, 6 ... Voice control unit, 7 ... Communication control unit, 8 ... Face image extraction / position determination unit, 9 ... Separation and multiplexing Akabe,
10 ... Camera, 11 ... Camera control / position determination unit, 12 ...
Display unit, 13 ... Image control unit, 14 ... Operation unit, 15 ... Sound source direction detection unit, 16 ... Clock unit, 17 ... Communication line, 20 ... Video conference system device body.

Claims

[Claims]

1. A camera control capable of displaying a picked-up image at a predetermined position on a screen by having a camera control means capable of controlling a direction of the camera to be changed in a horizontal direction and / or a vertical direction. In the method,
Face image extraction means for extracting a face image of a person from the image input from the camera, and face image position determination capable of recognizing the position of the face image extracted by the face image extraction means and displayed on the screen Means and a sound source direction detecting means for detecting a sound source direction from each of the audio signals input to the plurality of audio input means, and the camera direction control is performed in the sound source direction detected by the sound source direction detecting means. A person on the screen which is extracted by the face image extraction means and recognized by the face image position determination means by automatically controlling the angle change of the camera in the horizontal direction and / or the vertical direction by the means. The face image of the person in the sound source direction detected by the sound source direction detecting means is displayed in advance so that the position of the face image is displayed at the position on the screen set in advance. Camera control method characterized by displaying on the position on the screen is constant.

2. The camera control method according to claim 1, wherein the camera control means enables control of zooming the camera to a wide angle and / or a telephoto, and the face image position determination means uses a screen. By making it possible to recognize the size of the face image above, the camera control means automatically controls the zoom change of the camera to wide angle and / or telephoto, and the face image extraction means The size of the face image of the person on the screen that is extracted and recognized by the face image position determination means is displayed in the size on the screen set in advance, and the sound source A camera control method characterized in that a face image of a person in the sound source direction detected by the direction detecting means is displayed in a size on the screen set in advance.

3. The camera control method according to claim 1, wherein the face image extracting unit extracts face images of two or more persons in the sound source direction detected by the sound source direction detecting unit. In this case, the camera control means is used to set the face image of the person at the position closest to the sound source direction detected by the sound source direction detection means to the position on the screen set in advance and / or A camera control method characterized by controlling so as to display in a size.

4. The camera control method according to claim 1, wherein a face image of the person is not detected by the face image extracting unit in the sound source direction detected by the sound source direction detecting unit. And / or, when the face image of the person extracted by the face image extracting means is not fully contained in the screen, the camera control means is used to
The camera is automatically controlled to change the zoom to a wide-angle operation, and based on the imaged image obtained by the wide-angle operation, the face image extracting operation by the face image extracting means and the face image position determining means are performed. A camera control method, characterized in that a position and / or size of a face image displayed on a screen is recognized.

5. The camera control method according to claim 1, wherein the camera control unit controls a camera angle change and / or a zoom change in the sound source direction detected by the sound source direction detection unit. Further has a time measuring means for measuring an elapsed time after the execution, and the elapsed time measured by the time measuring means has passed a preset unit time. The camera control method is characterized in that the camera control means controls the camera angle change and / or the zoom change in the sound source direction detected by the sound source direction detection means each time.

6. The camera control method according to claim 1, wherein the face image extracting means extracts the face image of the person by identifying the skin color of the person. Control method.

7. The camera control method according to claim 1, wherein the face image extracting means extracts the face image of the person by identifying the contour of the person. Control method.

8. The camera control method according to claim 1, wherein the sound source direction detection means has a frequency and / or a delay time of each audio signal input to the plurality of audio input means. A method for controlling a camera, characterized in that the sound source direction is detected by analyzing the.

9. A camera control device capable of controlling an angle change in a horizontal direction and / or a vertical direction and / or a zoom change to a wide angle and / or a telephoto, further comprising any one of claims 1 to 8. A camera control device comprising means for enabling the camera control method according to claim 1 to be realized.

10. A video conference system for performing a video conference, comprising the camera control device according to claim 9.