JP2001352530A

JP2001352530A - Communication conference system

Info

Publication number: JP2001352530A
Application number: JP2000172960A
Authority: JP
Inventors: Masafumi Tanaka; 雅史田中; Kenichi Furuya; 賢一古家
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-06-09
Filing date: 2000-06-09
Publication date: 2001-12-21

Abstract

PROBLEM TO BE SOLVED: To provide a communication conference system that can adaptively decide a delay time until the time when shift of a visual field of a camera and change in directivity of a microphone are started with alteration of talkers. SOLUTION: A position estimate section 10 estimates a position of a speaker, a position history management section 201 updates a position history database 202 according to the result, a visual field decision section 203 decides the visual field based on the position history, decides a delay time of delay section 204 based on the position history, the visual field and the visual field history, and a device control section 30 controls a visual field direction and a visual angle of a video camera with the lapse of the delay time from the time when the visual field is decided.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ビデオカメラの視
野やマイクロホンの指向性等を過去の話者位置の位置履
歴と新しい話者位置から決定する通信会議装置に関し、
特に話者位置の変化に伴ってカメラの視野変更やマイク
ロホンの指向性変更の開始までの遅延時間を適応的に制
御する技術に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a communication conference apparatus for determining the field of view of a video camera, the directivity of a microphone, and the like from the position history of past speaker positions and a new speaker position.
In particular, the present invention relates to a technique for adaptively controlling a delay time until the start of a change in a camera's field of view or a change in microphone directivity according to a change in a speaker position.

【０００２】[0002]

【従来の技術】図８は従来の通信会議装置に使用されて
いるビデオカメラの話者自動追従システムの要部を説明
するための図である。１０は話者の位置を推定する位置
推定部、２０’は選択性制御部である。選択性制御部２
０’では、位置推定部１０で得られた話者の位置情報を
受け取り、位置履歴管理部２０１において話者の位置履
歴データベース２０２の更新を行い、位置履歴データベ
ース２０２の位置履歴情報を視野決定部２０３に入力さ
せてそこで視野を決定し、得られた視野情報を機器制御
部３０に出力してビデオカメラの視野角（ズーム）と視
野方向（左右方向）を制御するものである。2. Description of the Related Art FIG. 8 is a diagram for explaining a main part of a speaker automatic tracking system of a video camera used in a conventional communication conference apparatus. 10 is a position estimating unit for estimating the position of the speaker, and 20 'is a selectivity control unit. Selectivity control unit 2
At 0 ′, the position information of the speaker obtained by the position estimating unit 10 is received, the position history managing unit 201 updates the position history database 202 of the speaker, and the position history information of the position history database 202 is used as the field of view determining unit. A field of view is determined by inputting the information to a field of view 203, and the obtained field of view information is output to the device control unit 30 to control the field of view (zoom) and field of view (horizontal direction) of the video camera.

【０００３】[0003]

【発明が解決しようとする課題】従って、従来の技術で
は、視野決定部２０３で視野が決定されると、直ちにあ
るいは固定の遅延時間の後に、ビデオカメラの移動が開
始する。このようにビデオカメラが動き出すまでの遅延
時間が固定であるために、次のような不都合が生じるこ
とがある。Therefore, in the prior art, when the visual field is determined by the visual field determination unit 203, the video camera starts moving immediately or after a fixed delay time. Since the delay time until the video camera starts moving is fixed as described above, the following inconvenience may occur.

【０００４】まず、話者交代があったときは、前記遅延
時間が長いと、交代した話者の発言が終わってしまうこ
とがある。一方、話者の交代が頻繁な場合には、その交
代に合わせてビデオカメラが速く動き出すように遅延時
間を短くしておくと、映像が乱れ、視聴者に不快感を与
える問題がある。[0004] First, when there is a speaker change, if the delay time is long, the change of the speaker may end. On the other hand, when the speakers change frequently, if the delay time is shortened so that the video camera starts moving quickly in accordance with the change, there is a problem that the video is disturbed and the viewer is uncomfortable.

【０００５】また、ビデオカメラの視野方向が大きく移
動する場合には、その視野方向の移動に時間がかかるの
で、その移動時間中に別の話者への交代が起きてビデオ
カメラが迷走することがあり、これを予防するために、
上記遅延時間をその移動が小さい場合よりも長く設定す
る必要がある。しかしこのようにすると、視野移動開始
が常時遅くなるという問題が起こる。When the direction of the visual field of the video camera moves largely, it takes time to move in the direction of the visual field, so that a change of another speaker occurs during the moving time and the video camera strays. To prevent this,
The delay time needs to be set longer than when the movement is small. However, this causes a problem that the visual field movement start is always delayed.

【０００６】一方、話者交代に伴うマイクロホンの指向
性変更タイミングについても、ビデオカメラの視野変更
タイミングと大きくずれることは好ましくない。[0006] On the other hand, it is not preferable that the timing of changing the directivity of the microphone accompanying the change of the speaker greatly deviates from the timing of changing the visual field of the video camera.

【０００７】本発明の目的は、話者交代によってビデオ
カメラの視野移動やマイクロホンの指向性変更が開始す
るまでの遅延時間を適応的に決定し、上記した問題を解
決した通信会議装置を提供することである。SUMMARY OF THE INVENTION An object of the present invention is to provide a communication conference apparatus which solves the above-mentioned problem by adaptively determining a delay time until the movement of the field of view of the video camera or the change in the directivity of the microphone is started by the change of speaker. That is.

【０００８】[0008]

【課題を解決するための手段】上記課題を解決するため
の第１の発明は、映像信号又は少なくとも２以上の音声
信号から話者位置を推定する位置推定手段（１０）と、
前記位置推定手段（１０）で得られた話者位置を記録し
て位置履歴（２０２）を更新する位置履歴管理手段（２
０１）と、前記位置履歴（２０２）に基づいて前記映像
信号を収録する撮像手段の視野又は前記音声信号を収録
する集音手段の指向性を決定する視野又は指向性決定手
段（２０３）と、前記決定された視野又は指向性の情報
を遅延させる遅延手段（２０４）と、該遅延手段（２０
４）に対して可変の遅延時間を設定する遅延時間決定手
段（２０５）と、前記遅延手段（２０４）で遅延された
前記視野又は指向性の情報により、前記撮像手段の視野
又は前記集音手段の指向性を新しい話者位置に合わすよ
う設定する機器制御手段（３０）と、を有するよう構成
した。According to a first aspect of the present invention, there is provided a position estimating means for estimating a speaker position from a video signal or at least two or more audio signals;
A position history management unit (2) that records the speaker position obtained by the position estimation unit (10) and updates the position history (202).
01) and a visual field or directivity determining means (203) for determining the visual field of the imaging means for recording the video signal or the directivity of the sound collecting means for recording the audio signal based on the position history (202); Delay means (204) for delaying the determined field of view or directivity information;
4) a delay time determining means (205) for setting a variable delay time, and the visual field or the sound collecting means of the imaging means based on the information of the visual field or the directivity delayed by the delay means (204). And a device control means (30) for setting the directivity of the device to a new speaker position.

【０００９】第２の発明は、第１の発明において、前記
話者位置履歴（２０２）は、話者位置、発話時刻、発話
時間の組からなり、前記遅延時間決定手段（２０５）
は、前記発話時間から求められる前の話者の発話継続時
間、新しい話者の累積発話時間から前の話者の累積発話
時間を差し引いた累積発話時間差、及び前の話者位置か
ら新しい話者位置への変化量のうちから選択した少なく
とも１以上により、前記遅延時間を設定するよう構成し
た。In a second aspect based on the first aspect, the speaker position history (202) comprises a set of a speaker position, a speech time, and a speech time, and the delay time determination means (205).
Is the utterance duration of the previous speaker obtained from the utterance time, the cumulative utterance time difference obtained by subtracting the cumulative utterance time of the previous speaker from the cumulative utterance time of the new speaker, and the new speaker from the previous speaker position. The delay time is set according to at least one selected from the amount of change to the position.

【００１０】第３の発明は、第２の発明において、話者
位置の履歴は話者毎に記録され、該話者は映像信号又は
少なくとも２以上の音声信号から判定され、前記判定さ
れた話者の履歴から前記遅延時間を設定するよう構成し
た。In a third aspect based on the second aspect, the history of speaker positions is recorded for each speaker, and the speaker is determined from a video signal or at least two or more audio signals. The delay time is set from the history of the user.

【００１１】[0011]

【発明の実施の形態】図１は本発明の通信会議装置の実
施形態のブロック図である。１０はビデオカメラの映像
信号やマイクロホンの音声信号から話者位置を推定する
位置推定部（位置推定手段）、２０は位置推定部１０で
得られた話者位置情報に基づきビデオカメラの視野情報
を作成して出力する選択性制御部、３０は選択性制御部
２０からの視野情報に基づきビデオカメラの視野方向と
視野角を制御する機器制御部（機器制御手段）である。FIG. 1 is a block diagram showing an embodiment of a communication conference apparatus according to the present invention. Reference numeral 10 denotes a position estimating unit (position estimating means) for estimating a speaker position from a video signal of a video camera or a sound signal of a microphone, and 20 denotes visual field information of the video camera based on the speaker position information obtained by the position estimating unit 10. A selectivity control unit 30, which is created and output, is a device control unit (device control means) that controls the viewing direction and the viewing angle of the video camera based on the view information from the selectivity control unit 20.

【００１２】位置推定部１０は、撮像した画像から１又
は２以上の動物体を判定し、それらの動物体に予め特徴
を設定した人物の頭部が存在するか否かで人物を判定
し、さらにその人物の口唇が動いているか否かで１人の
話者の位置を推定する。あるいは、複数設置したマイク
ロホンから入力する音声信号を処理して１人の話者の位
置を推定する。The position estimating unit 10 determines one or more moving objects from the captured image, and determines the person based on whether or not a head of a person whose characteristics are set in advance exists in the moving objects. Further, the position of one speaker is estimated based on whether the lips of the person are moving. Alternatively, the position of one speaker is estimated by processing audio signals input from a plurality of installed microphones.

【００１３】選択性制御部２０において、２０１は位置
推定部１０で得られた話者位置情報を受け取り、位置履
歴データベース２０２の更新を行う位置履歴管理部、２
０３は位置履歴情報を入力して処理することにより当該
話者位置に視野角と視野方向を合わせるようビデオカメ
ラの視野情報を作成する視野決定部（視野決定手段）、
２０４は視野決定部２０３で得られた視野情報を遅延さ
せる遅延部（遅延手段）、２０５はビデオカメラの移動
遅延時間を決める移動遅延時間決定部（遅延時間決定手
段）、２０６はビデオカメラの視野情報の履歴を管理す
る視野履歴管理部、２０７は視野履歴データベースであ
る。In the selectivity control unit 20, reference numeral 201 denotes a position history management unit which receives the speaker position information obtained by the position estimation unit 10 and updates the position history database 202;
A field-of-view deciding unit (field-of-view deciding unit) 03 for creating and processing field information of the video camera so that the field angle and the field direction are matched with the speaker position by inputting and processing the position history information;
Reference numeral 204 denotes a delay unit (delay means) for delaying the visual field information obtained by the visual field determination unit 203; 205, a moving delay time determining unit (delay time determining means) for determining the moving delay time of the video camera; A visual field history management unit 207 that manages the history of information is a visual field history database.

【００１４】位置履歴データベース２０２には、話者位
置（各話者は特定の位置から移動しないものとし、その
位置を直交座標又は極座標で登録する）、発話時刻、発
話時間の組からなる話者毎のデータを格納する。話者交
代があり、話者の発話開始や発話終了が検出される毎に
その発話時刻、発話時間、発話者が記録更新される。ま
た、視野履歴データベース２０７にも、視野情報（視野
角、視野方向）を話者に対応して格納する。The position history database 202 contains a speaker position (each speaker does not move from a specific position, and the position is registered in rectangular coordinates or polar coordinates), an utterance time, and an utterance time. Each data is stored. There is a speaker change, and the utterance time, utterance time, and utterer are updated each time the start and end of the utterance of the speaker are detected. The view history database 207 also stores view information (view angle, view direction) corresponding to the speaker.

【００１５】移動遅延時間決定部２０５により遅延部２
０４の遅延時間を設定する手法として次のような手法が
ある。すなわち、図２に示すように、新しく検出された
話者位置と過去の話者位置履歴から得られる２つの量、
つまり、前の話者の「発話継続時間」と、新しい話者の
累積発話時間（本装置の動作開始からの累積発話時間）
から前の話者の累積発話時間を差し引いた「累積発話時
間差」とによって、遅延時間を設定する。The moving delay time determining unit 205 controls the delay unit 2
The following method is available as a method for setting the delay time 04. That is, as shown in FIG. 2, two amounts obtained from the newly detected speaker position and the past speaker position history,
In other words, the "utterance duration" of the previous speaker and the cumulative utterance time of the new speaker (cumulative utterance time from the start of operation of the present apparatus)
And the “accumulated utterance time difference” obtained by subtracting the accumulated utterance time of the previous speaker from.

【００１６】「発話継続時間」が長い発話者は、報告や
講演などを行っている場合が多く、他の話者の短い発言
の後に再び発話することが多いと考えられるので、不快
感を招く頻繁なビデオカメラの視野移動を防止するた
め、「発話継続時間」が長い話者から他の話者へ話者交
代があった場合には、「発話継続時間」が短い話者から
他の話者への話者交代があった場合に比べて、遅延時間
を長く設定する。一方、複数人による討論の場合には、
司会者の累積発話時間が短いなど、参加者の累積発話時
間に差が生じることが多い。このように累積発話時間に
差がある状況で話者交代が起こった場合には、前記「累
積発話時間差」が正のとき（新しい話者の累積発話時間
が前の話者の累積発話時間より大きいとき）には、「平
均発話継続時間」が長い話者から短い話者への話者交代
があったことを示すので、「累積発話時間差」が負であ
るような「平均発話継続時間」が短い話者から長い話者
への話者交代の場合に比較して、話者交代が引き続き起
こる可能性が高い。したがって、ビデオカメラの視野の
頻繁な移動を予防するために、「累積発話時間差」が正
のときは負のときよりもビデオカメラの移動遅延時間を
長くする。以上から、図２に示すように「発話継続時
間」が長いほど、また「累積発話時間差」が正で大きい
ほど、ビデオカメラの移動遅延時間を長く設定する。Speakers with a long "speech continuation time" often give reports or lectures, and are likely to speak again after short speeches of other speakers, which causes discomfort. In order to prevent frequent video camera field-of-view shifts, when a speaker with a long "duration of utterance" changes from one speaker to another, a speaker with a short "duration of utterance" switches to another speaker. The delay time is set longer than when the speaker is changed. On the other hand, in the case of discussions by multiple people,
Differences in the cumulative speech time of the participants often occur, such as the short cumulative speech time of the moderator. As described above, when the speaker change occurs in a situation where the cumulative utterance time is different, when the “cumulative utterance time difference” is positive (the cumulative utterance time of the new speaker is longer than the cumulative utterance time of the previous speaker). When it is large, it indicates that there has been a speaker change from a speaker with a long "average utterance duration" to a speaker with a short "average utterance duration". Is more likely to continue to occur than when the speaker changes from a short speaker to a long speaker. Therefore, in order to prevent frequent movement of the visual field of the video camera, the movement delay time of the video camera is made longer when the "cumulative speech time difference" is positive than when it is negative. As described above, as shown in FIG. 2, the longer the “speech continuation time” and the larger the “cumulative utterance time difference”, the longer the movement delay time of the video camera.

【００１７】さらに、本発明では移動遅延時間決定部２
０５での遅延時間の決定要素に視野情報も考慮する。す
なわち、図３に示すように、前の視野から新しい視野へ
の「視野移動量」が小さい場合には、ビデオカメラの視
野の変化が小さいので映像の乱れが小さく、映像の乱れ
よりは素早い視野の追従が優先できるため、遅延時間を
短く設定する。逆に、「視野移動量」が長い場合には、
画像が乱れる危険性を小さくすることに重点をおいて、
ビデオカメラの移動遅延時間を長く設定する。Further, according to the present invention, the moving delay time determining unit 2
The visual field information is also taken into account as a determinant of the delay time at 05. That is, as shown in FIG. 3, when the "field of view movement amount" from the previous field of view to the new field of view is small, the change in the field of view of the video camera is small, so that the image disturbance is small, and the visual field is quicker than the image disturbance. Since the following can be given priority, the delay time is set short. Conversely, if the “field of view movement” is long,
Focusing on reducing the risk of image distortion,
Set a longer moving delay time for the video camera.

【００１８】具体的な遅延時間の決定手法としては、上
記した「発話継続時間」、「累積発話時間差」、「視野
移動量」の要素から遅延時間を設定するようテーブルを
作成して利用する手法がある。この場合のテーブル内容
は話者交代がある度に更新するようにする。As a specific method for determining the delay time, a method is used in which a table is created and set so as to set the delay time from the above-mentioned "utterance continuation time", "cumulative utterance time difference", and "view movement amount" elements. There is. In this case, the contents of the table are updated each time there is a speaker change.

【００１９】又別に、Ｔをビデオカメラの移動開始まで
の遅延時間、ｔ１を「発話継続時間」、ｔ２を「累積発
話時間差」、ｗを「視野移動量」とし、ａ，ｂ，ｃ，ｄ
を正の係数とするとき、Ｔ＝ｆ（ａ・ｔ１＋ｂ・ｔ２＋ｃ・ｗ＋ｄ）のような式により演算によって求めることもできる。こ
の関数ｆ( )は値域の最小値が０以上でかつ上に有界な
単調増加関数であり、その例としては、 tanh(x)+1=2exp(x)/{exp(x)+exp(-x)} あるいは、ステップ関数 s(x)=1 （ｘ＞しきい値） =0 （ｘ≦しきい値）がある。In addition, T is a delay time until the start of movement of the video camera, t1 is "speech continuation time", t2 is "cumulative speech time difference", w is "visual field movement amount", and a, b, c, d
Is a positive coefficient, it can also be obtained by an operation using an expression such as T = f (a · t1 + b · t2 + c · w + d). This function f () is a monotonically increasing function in which the minimum value of the range is equal to or greater than 0 and is bounded above. For example, tanh (x) + 1 = 2exp (x) / {exp (x) + exp (-x)} Alternatively, there is a step function s (x) = 1 (x> threshold) = 0 (x ≦ threshold).

【００２０】以上のように、ビデオカメラの視野決定が
なされてからビデオカメラの移動開始までの遅延時間
を、「発話継続時間」、「累積発話時間差」、「視野移
動量」の内の少なくとも１つを利用して決めることによ
り、視野情報が作成されてからビデオカメラの移動が開
始するまでの遅延時間が適応的に調整される。このた
め、ビデオカメラが動き出すまでの時間が固定されてい
る場合に比べて、頻繁に話者が交代する場合は遅延時間
を長くして映像の乱れを防いだり、視野が大きく移動す
る場合は遅延時間を小さく迅速な視野変更を実現するこ
とができる。なお、遅延時間が長いためその遅延時間が
満了する前に話者交代があった場合は、前回の遅延時間
をリセットして、新たな話者に応じた遅延時間を設定す
る。As described above, the delay time from when the visual field of the video camera is determined to when the video camera starts moving is defined as at least one of “speech continuation time”, “cumulative utterance time difference”, and “visual field movement amount”. The delay time from the creation of the view information to the start of the movement of the video camera is adaptively adjusted. For this reason, compared to the case where the time until the video camera starts to move is fixed, the delay time is increased when the speakers change frequently to prevent the image from being disturbed, and when the visual field moves greatly, the delay time is increased. A quick change of the field of view can be realized in a short time. Note that if there is a speaker change before the delay time expires because the delay time is long, the previous delay time is reset and a delay time according to a new speaker is set.

【００２１】図４は位置推定部１０の具体的な一例を示
す図で、ビデオカメラで撮像した画像を処理して話者位
置を推定する場合についてのものである。ここでは、画
像の輝度信号ＹをＡ／Ｄ変換器１０１によりディジタル
信号に変換してから動きエリア検出部１０２でフレーム
間の差分をとることにより動きエリアを検出し、Ａ／Ｄ
変換器１０１の出力信号から作成したしきい値により２
値変換部１０３においてその動きエリアを示す差分情報
を２値化する。次に、この２値化差分情報を水平方向動
エリア抽出部１０４に取り込み、時間的に及び水平方向
（画像の横方向）に空間的にその２値化差分情報を累積
加算することにより動物体の水平方向の位置座標を動物
体毎に求めて、動エリア選択部１０５に送る。この動エ
リア選択部１０５では、もとめた複数の動物体の水平方
向の位置座標から１つの動物体の位置座標を選択して、
頭頂抽出部１０６と顔幅抽出部１０７に送る。頭頂抽出
部１０６では２値変換部１０３で得られた２値化差分情
報と動エリア選択部１０６で得られた動物体の位置座標
から１つの動物体の頭頂の座標を求める。顔幅抽出部１
０７では２値変換部１０３で得られた２値化差分情報と
エッジ検出部１０８で得られた画像のエッジ（輪郭）情
報と頭頂抽出部１０６で得られた頂部座標とから動物体
（人物）の頭部の左右の座標により頭部情報作成し、顔
特徴抽出部１０９に送る。この顔特徴抽出部１０９で
は、人物の顔特徴量の１つである頬の縦線と眉毛や目等
の横線に相当する情報が含まれているか否かにより人物
の頭部か否かを判定し、頭部であると判定したときその
情報を発言者判定部１１０に送る。この発言者判定部１
１０では、頭部の左右の座標と頭頂座標から口唇の位置
する領域を判定し、その口唇領域の所定時間毎の変化量
を検出して口唇が上下に動いているか否かにより、当該
人物が話者か否かを判定する。このようにして、画像情
報から話者位置を推定する（参考文献：特開平７−２２
５８４１）。FIG. 4 is a diagram showing a specific example of the position estimating unit 10 in which a speaker position is estimated by processing an image picked up by a video camera. Here, the motion area is detected by converting the luminance signal Y of the image into a digital signal by the A / D converter 101 and then calculating the difference between the frames by the motion area detection unit 102.
2 based on a threshold created from the output signal of converter 101
The value conversion unit 103 binarizes the difference information indicating the motion area. Next, the binarized difference information is taken into the horizontal moving area extracting unit 104, and the binarized difference information is accumulated and added temporally and spatially in the horizontal direction (horizontal direction of the image) to thereby obtain a moving object. Is obtained for each moving object, and is sent to the moving area selection unit 105. The moving area selection unit 105 selects the position coordinates of one moving object from the obtained horizontal position coordinates of the plurality of moving objects,
It is sent to the crown extraction unit 106 and the face width extraction unit 107. The head extraction unit 106 obtains the coordinates of the head of one moving object from the binary difference information obtained by the binary conversion unit 103 and the position coordinates of the moving object obtained by the moving area selection unit 106. Face width extraction unit 1
At 07, a moving object (person) is obtained from the binary difference information obtained by the binary conversion unit 103, the edge (contour) information of the image obtained by the edge detection unit 108, and the top coordinates obtained by the top extraction unit 106. The head information is created based on the left and right coordinates of the head and sent to the face feature extraction unit 109. The face feature extraction unit 109 determines whether or not a person's head is based on whether or not information corresponding to a vertical line on the cheek and a horizontal line such as eyebrows and eyes, which are one of the facial feature amounts of the person, is included. Then, when it is determined that it is the head, the information is sent to the speaker determination unit 110. This speaker determination unit 1
In 10, the region where the lips are located is determined from the left and right coordinates and the top coordinates of the head, the amount of change in the lip region at predetermined time intervals is detected, and whether the lips are moving up and down is determined by the person. It is determined whether or not the speaker. In this way, the speaker position is estimated from the image information (reference: Japanese Patent Laid-Open No. 7-22 / 1995).
5841).

【００２２】一方、図５は位置推定部１０の別の具体的
な一例を示す図で、複数のマイクロホンで得られる複数
の音声信号を処理して話者位置を推定する場合について
のものである。ここでは、受信した音声信号の相互相関
関数をすべてのマイクロホンの組み合わせについて計算
し、得られた相互相関関数について、予め決めた１つの
基準マイクロホンと他のマイクロホンとの間の相互相関
関数の最大値を与える時間差を求め、これを予備推定時
間差とし、全てのマイクロホンについての遅延和パワー
を最大にする時間差を上記予備推定時間差の近傍で探索
して、これを推定時間差とし、この推定時間差に基づい
て音源位置を計算するものである（参考文献：特開平１
１−３０４９０６）。なお、この他の複数マイクロホン
で得られる複数の信号を処理して話者位置を推定する方
法は、文献「音響システムと信号処理」、大賀他、電子
情報通信学会の第７章に詳述されている。On the other hand, FIG. 5 is a diagram showing another specific example of the position estimating unit 10 in which a plurality of audio signals obtained by a plurality of microphones are processed to estimate a speaker position. . Here, the cross-correlation function of the received audio signal is calculated for all combinations of microphones, and the obtained cross-correlation function is the maximum value of the cross-correlation function between one predetermined reference microphone and another microphone. Is obtained as a preliminary estimated time difference, a time difference that maximizes the delay sum power for all microphones is searched for in the vicinity of the preliminary estimated time difference, and this is used as an estimated time difference, and based on this estimated time difference, This is to calculate the sound source position (Reference: JP-A-Hei 1
1-304906). A method of estimating a speaker position by processing a plurality of signals obtained by other microphones is described in detail in the document "Acoustic system and signal processing", Oga et al., And Chapter 7 of IEICE. ing.

【００２３】前記した機器制御部３０では、選択性制御
部２０からの視野情報に基づきビデオカメラの視野方向
と視野角を制御するが、視野方向については推定された
話者位置の方向を向く視野方向となるようビデオカメラ
の上下左右の向きを制御すればよく、視野角については
例えば個々の話者位置に応じた視野角を予めテーブルに
格納しておいて、推定された話者位置に応じてそのテー
ブルから視野角を読み出してビデオカメラのズームを広
角側或いは望遠側等に制御すればよい。The device control unit 30 controls the visual field direction and the visual field angle of the video camera based on the visual field information from the selectivity control unit 20, and the visual field direction is directed to the estimated speaker position. The up, down, left, and right directions of the video camera may be controlled so as to be in the same direction.For the viewing angle, for example, a viewing angle corresponding to each speaker position is stored in a table in advance, and the viewing angle is determined according to the estimated speaker position. Then, the viewing angle may be read from the table and the zoom of the video camera may be controlled to the wide angle side or the telephoto side.

【００２４】なお、得られた視野情報に基づきビデオカ
メラの視野方向や視野角の制御と共にマイクロホンの指
向性を制御するようにすることもできる。このときは、
前記視野情報で決定される視野方向からの音声に対する
複数のマイクロホンの出力音声信号が同相となるように
それぞれのマイクロホンの音声信号に時間遅延を与え所
定の重み係数をかけてから加算処理する。これにより、
決定された視野方向から伝搬してくる音声信号以外は互
いに打ち消しあって減衰し、視野方向に対する鋭い指向
性を実現することができる。Note that the directivity of the microphone can be controlled together with the control of the viewing direction and viewing angle of the video camera based on the obtained viewing information. At this time,
The audio signals of the microphones are given a time delay so that the audio signals output from the plurality of microphones with respect to the audio from the visual field direction determined by the visual field information have the same phase. This allows
Except for the audio signal propagating from the determined viewing direction, the sound signals cancel each other out and attenuate, thereby realizing sharp directivity in the viewing direction.

【００２５】図６は本発明の装置を通信会議装置の話者
自動追従システムに適用した実施形態を示したものであ
る。話者６０が発した音声をマイクロホンアレー５０で
集音し、集音した複数チャネルの音声信号から位置推定
部１０において話者の位置を推定し、その推定した話者
位置の情報を選択性制御部２０に入力して視野情報を
得、この視野情報により機器制御部３０によりビデオカ
メラ４０の視野方向や視野角を制御するものである。FIG. 6 shows an embodiment in which the apparatus of the present invention is applied to a speaker automatic tracking system of a communication conference apparatus. The sound emitted by the speaker 60 is collected by the microphone array 50, the position of the speaker is estimated by the position estimating unit 10 from the collected sound signals of the plurality of channels, and the information of the estimated speaker position is selectedivity control. The information is input to the unit 20 to obtain the visual field information, and the device control unit 30 controls the visual field direction and the visual angle of the video camera 40 based on the visual field information.

【００２６】図７は本発明の装置を通信会議装置の話者
自動追従指向性集音システムに適用した実施形態を示し
たものである。話者６０が発した音声をマイクロホンア
レー５０で集音し、集音した複数チャネルの音声信号か
ら位置推定部１０において話者位置を推定し、推定した
話者位置から選択性制御部２０において集音領域（つま
り視野）を決定し、機器制御部３０において指向性の向
き、指向性の幅を計算し、アレー信号処理装置７０にお
いて複数チャネルの音声信号から集音領域のみの音声信
号を出力するようにしたものである。FIG. 7 shows an embodiment in which the apparatus of the present invention is applied to a speaker automatic tracking directional sound collecting system of a communication conference apparatus. The sound emitted by the speaker 60 is collected by the microphone array 50, the position of the speaker is estimated by the position estimating unit 10 from the collected sound signals of a plurality of channels, and the selectivity control unit 20 collects the sound from the estimated speaker position. The sound area (that is, the field of view) is determined, the direction of the directivity and the width of the directivity are calculated by the device control unit 30, and the array signal processing device 70 outputs the audio signal of only the sound collection area from the audio signals of a plurality of channels. It is like that.

【００２７】[0027]

【発明の効果】以上から本発明によれば、話者の位置履
歴に加えて視野履歴を用いるので、視野決定や指向性決
定からビデオカメラの視野変更開始やマイクロホンの指
向性変更開始までの時間を適応的に制御することがで
き、話者交代に適切に対応することができる利点があ
る。As described above, according to the present invention, since the visual field history is used in addition to the position history of the speaker, the time from the determination of the visual field and the directivity to the start of the visual field change of the video camera and the start of the directivity change of the microphone are determined. Can be adaptively controlled, and there is an advantage that it is possible to appropriately cope with speaker change.

[Brief description of the drawings]

【図１】本発明の通信会議装置の要部のブロック図で
ある。FIG. 1 is a block diagram of a main part of a communication conference device of the present invention.

【図２】話者の発話に適応させたビデオカメラの移動
遅延時間設定手法の説明図である。FIG. 2 is an explanatory diagram of a moving delay time setting method of a video camera adapted to a speaker's utterance.

【図３】ビデオカメラの視野移動量に適応させたビデ
オカメラの移動遅延時間の設定手法の説明図である。FIG. 3 is an explanatory diagram of a setting method of a moving delay time of the video camera adapted to a moving amount of a visual field of the video camera.

【図４】画像を利用した位置推定部の説明図である。FIG. 4 is an explanatory diagram of a position estimating unit using an image.

【図５】音声を利用した位置推定部の説明図である。FIG. 5 is an explanatory diagram of a position estimating unit using sound.

【図６】話者自動追従ビデオカメラシステムの説明図
である。FIG. 6 is an explanatory diagram of a speaker automatic following video camera system.

【図７】話者自動追従指向性集音システムの説明図で
ある。FIG. 7 is an explanatory diagram of a speaker automatic tracking directional sound collection system.

【図８】従来の通信会議装置の要部のブロック図であ
る。FIG. 8 is a block diagram of a main part of a conventional communication conference device.

[Explanation of symbols]

１０：位置推定部、１０１：Ａ／Ｄ変換器、１０２：動
きエリア検出部、１０３：２値変換部、１０４：水平方
向動エリア検出部、１０５：動エリア選択部、１０６：
頭頂抽出部、１０７：顔幅抽出部、１０８：エッジ検出
部、１０９：顔特徴抽出判定部、１１０：話者判定部２０：選択性制御部、２０１：位置履歴管理部、２０
２：位置履歴データベース、２０３：視野決定部、２０
４：遅延部、２０５：移動遅延時間決定部、２０６：視
野履歴管理部、２０７：視野履歴データベース３０：機器制御部４０：ビデオカメラ５０：マイクロホンアレー６０：話者７０：アレー信号処理装置10: position estimator, 101: A / D converter, 102: moving area detector, 103: binary converter, 104: horizontal moving area detector, 105: moving area selector, 106:
Top extraction section, 107: face width extraction section, 108: edge detection section, 109: face feature extraction determination section, 110: speaker determination section 20: selectivity control section, 201: position history management section, 20
2: position history database, 203: visual field determination unit, 20
4: delay unit, 205: moving delay time determination unit, 206: visual field history management unit, 207: visual field history database 30: device control unit 40: video camera 50: microphone array 60: speaker 70: array signal processing device

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5C022 AA12 AB36 AB62 AB63 AB66 AC41 AC69 AC72 5C064 AA02 AB04 AC04 AC09 AC12 AC22 AD14 5D020 BB03 BB04 ──────────────────────────────────────────────────続き Continued on the front page F term (reference) 5C022 AA12 AB36 AB62 AB63 AB66 AC41 AC69 AC72 5C064 AA02 AB04 AC04 AC09 AC12 AC22 AD14 5D020 BB03 BB04

Claims

[Claims]

1. A position estimating means for estimating a speaker position from a video signal or at least two or more audio signals, and a position history managing means for recording a speaker position obtained by the position estimating means and updating a position history. And a field of view or directivity determining means for determining the field of view of the image pickup means for recording the video signal or the directivity of the sound collecting means for recording the audio signal based on the position history, and the determined field of view or directivity Delay means for delaying the information of the above; delay time determining means for setting a variable delay time for the delay means; information on the visual field or directionality of the imaging means, A device control unit for setting the directivity of the sound collection unit to match a new speaker position.

2. The speaker according to claim 1, wherein the speaker position history includes a set of a speaker position, an utterance time, and an utterance time, and wherein the delay time determination unit determines the utterance of the speaker before the utterance obtained from the utterance time. A duration, a cumulative utterance time difference obtained by subtracting the cumulative utterance time of the previous speaker from the cumulative utterance time of the new speaker, and at least one or more selected from the amount of change from the previous speaker position to the new speaker position. And the delay time are set.

3. The speaker according to claim 2, wherein a history of speaker positions is recorded for each speaker, and said speaker is determined from a video signal or at least two or more audio signals. A communication conference device for setting a delay time.