JP2003189273A

JP2003189273A - Speaker identifying device and video conference system provided with speaker identifying device

Info

Publication number: JP2003189273A
Application number: JP2001387569A
Authority: JP
Inventors: Takashi Imai; 孝志今井; Kazuya Iwasaki; 一也岩崎
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2001-12-20
Filing date: 2001-12-20
Publication date: 2003-07-04
Anticipated expiration: 2021-12-20
Also published as: JP4212274B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a video conference system capable of always and accurately displaying a speaker on a monitor screen. <P>SOLUTION: If a human detecting part 3 for detecting a person detects a person in an input image in a voice source direction obtained by a direction detecting part 2 for detecting the voice source direction of an input voice, a CGROM 6 and a superimposition generation circuit 4 display a prescribed marker indicating a speaker around the image of the detected person on the monitor screen. The human detecting part 3 is provided with a face extracting means 3a for extracting a face color of a person to detect the person, a facial contour detecting means 3b for detecting the facial contour, eyes, nose and lips of the person, a lips detecting means 3c for detecting the speaker by detecting lips' movement. The direction detecting part 2 can be provided with a first storage memory for storing voice source characteristics inputted in advance as a video conference participant, and a voice source characteristics comparing means for comparing the characteristics of a newly inputted voice source with those of the voice source stored in the first storage memory. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、テレビ会議システ
ムの発言者識別装置及び該発言者識別装置を備えたテレ
ビ会議システムに関する。特に、複数の参加者によりテ
レビ会議を行なっている際に、発言者を識別可能なマー
カ（目印）を付与してモニタ画面にマーキング表示する
機能を有するテレビ会議システムの発言者識別装置及び
該発言者識別装置を備えたテレビ会議システムに関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker identification device for a video conference system and a video conference system equipped with the speaker identification device. In particular, when a video conference is held by a plurality of participants, a speaker identification device of a video conference system having a function of adding a marker (mark) for identifying the speaker and displaying a marking on a monitor screen and the speaker The present invention relates to a video conference system including a person identification device.

【０００２】[0002]

【従来の技術】テレビ会議を行なう際に、発言者を識別
し、マーカ（目印）のマーキング表示を行なうテレビ会
議システムの従来技術としては、例えば、特開平８−３
７６５５号公報「話者識別表示機能を有するテレビ会議
システム」において開示されているように、音声の入力
方向のみを識別して、発言者が存在している方向を決定
し、マーキング表示を行なうものがある。即ち、特開平
８−３７６５５号公報に開示されている技術において
は、複数の参加者によりテレビ会議を行なっている際
に、音の発生方向を検出する音声方向検出器即ち方向検
出手段からの音声方向のデータに基づいて、モニタ画面
上の位置（座標）を求めて、該テレビ会議において発言
している発言者（話者）の位置にカメラ装置が自動的に
移動し、更に、モニタ画面に、該発言者が写し出された
際には、該発言者が識別できるようなマーカ（目印）を
付与するように構成されているものであり、テレビ会議
で発言している参加者を容易に識別することが可能であ
るとしている。2. Description of the Related Art As a conventional technique of a video conference system for identifying a speaker and displaying a marker (mark) at the time of a video conference, for example, Japanese Patent Laid-Open No. 8-3 is available.
As disclosed in Japanese Patent No. 7655 "TV conference system having a speaker identification display function", only a voice input direction is identified, a direction in which a speaker is present is determined, and marking display is performed. There is. That is, in the technique disclosed in Japanese Patent Laid-Open No. 8-37655, when a plurality of participants hold a video conference, a voice direction detector for detecting the direction of sound generation, that is, a voice from a direction detecting means. The position (coordinates) on the monitor screen is obtained based on the direction data, and the camera device automatically moves to the position of the speaker (speaker) who is speaking in the video conference. , When the speaker is imaged, a marker (mark) that allows the speaker to be identified is provided, and the participant who is speaking in the video conference can be easily identified. Is said to be possible.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、かかる
従来の発言者識別表示機能を有するテレビ会議システム
にあっては、テレビ会議の参加者の発言そのものを識別
することを可能としているものではなく、何らかの音即
ち物音が発生している場合に、該物音の発生方向のみに
基づいて、マーカ（目印）を付与してしまう構成となっ
ている。したがって、テレビ会議の参加者の発言以外の
音声や物音が発生した場合であっても、発言者識別のマ
ーカ（目印）表示がなされてしまうという問題がある。
例えば、くしゃみやペンを落とした音、あるいは、マイ
クに物をぶつけた音など、参加者の発言以外の物音によ
っても、マーカ（目印）のマーキング表示が行なわれて
しまうため、モニタ画面上に、テレビ会議にて真に必要
とする発言者を示す位置とは異なる位置にマーキング表
示が行なわれたり、あるいは、あらぬ方向にカメラ装置
の向きが移動してしまったりして、テレビ会議参加者に
とっては、モニタ画面が非常に見難くなったり、紛らわ
しく感じられる場合が生じてしまう。However, in the conventional video conference system having such a speaker identification display function, it is not possible to identify the speech itself of the participants of the video conference. When a sound, that is, a sound is generated, a marker (mark) is added based on only the direction in which the sound is generated. Therefore, there is a problem that a marker (marker) for speaker identification is displayed even when a voice or a noise other than the speech of the participant of the video conference occurs.
For example, the marking display of the marker (mark) will be displayed even if there is a sound other than the speech of the participant, such as the sound of dropping a sneeze or a pen, or the sound of hitting an object with the microphone. Marking is displayed at a position different from the position showing the speaker who really needs the video conference, or the direction of the camera device has moved in an unpredictable direction. May cause the monitor screen to be very difficult to see or may be confusing.

【０００４】本発明は、かかる課題を解決するためにな
されたものであり、画像及び音声を入力し送受信するテ
レビ会議システムにおいて、例えば、入力音声から音源
方向を検出する方向検出手段と、入力画像から人物を検
出する人間検出手段とを具備し、更に、前記人間検出手
段にて検出された人物の画像周辺に前以って定められて
いる所定のマーカ（目印）を表示するマーキング手段を
具備し、前記方向検出手段によって得られた音源方向に
おいて、前記人間検出手段により人物が検出された場合
に、始めて、テレビ会議における発言者による発声音声
であると判定して、前記マーキング手段により、表示さ
れているモニタ画面上の該人物の画像周辺に発言者を示
す所定のマーカをマーキング表示させることを可能とす
るテレビ会議システムの発言者識別装置及び該発言者識
別装置を備えたテレビ会議システムを提供せんとするも
のである。The present invention has been made to solve the above problems, and in a video conference system for inputting and transmitting an image and a sound, for example, a direction detecting means for detecting a sound source direction from the input sound, and an input image. A human detection means for detecting a person from the human body, and a marking means for displaying a predetermined marker (mark) which is set in advance around the image of the person detected by the human detection means. Then, in the sound source direction obtained by the direction detecting means, when a person is detected by the human detecting means, first, it is determined that the voice is the voice of the speaker in the video conference, and the marking means displays the voice. A video conferencing system that enables a predetermined marker indicating the speaker to be displayed around the image of the person on the displayed monitor screen. There is provided St. videoconferencing system with a beam of speaker identification device and the speaker identification device.

【０００５】[0005]

【課題を解決するための手段】本発明に係るテレビ会議
システムの発言者識別装置及び該発言者識別装置を備え
たテレビ会議システムは、以下の具体的な技術手段によ
り構成されている。A speaker identification device of a video conference system and a video conference system equipped with the speaker identification device according to the present invention are configured by the following specific technical means.

【０００６】第１の技術手段は、画像及び音声を入力し
送受信するテレビ会議システムにおける発言者を識別す
るテレビ会議システムの発言者識別装置において、入力
音声から音源方向を検出する方向検出手段と、入力画像
内から人物を検出する人間検出手段と、表示されている
モニタ画面上において前記人間検出手段にて検出された
人物の画像周辺に前以って定められているマーカを表示
するマーキング手段とを具備し、前記方向検出手段によ
って得られた音源方向に前記人間検出手段にて人物を検
出した場合に、テレビ会議における発言者として前記マ
ーキング手段により発言者を示すマーカをマーキング表
示させるテレビ会議システムの発言者識別装置とするこ
とを特徴とするものである。A first technical means is a speaker identifying device of a video conference system for identifying a speaker in a video conference system for inputting and transmitting an image and a voice, and a direction detecting means for detecting a sound source direction from an input voice. Human detecting means for detecting a person from the input image, and marking means for displaying a predetermined marker around the image of the person detected by the human detecting means on the displayed monitor screen. A video conference system for displaying a marker indicating a speaker by the marking unit as a speaker in a video conference when a person is detected by the human detection unit in the sound source direction obtained by the direction detection unit. The speaker identification device is used.

【０００７】第２の技術手段は、前記第１の技術手段に
おいて、前記人間検出手段が、人物の顔の色を抽出して
人物を検出する顔抽出手段を更に具備しているテレビ会
議システムの発言者識別装置とすることを特徴とするも
のである。A second technical means is the video conference system according to the first technical means, wherein the human detecting means further comprises face extracting means for extracting the color of the face of the person to detect the person. The speaker identification device is a feature.

【０００８】第３の技術手段は、前記第１又は第２の技
術手段において、前記人間検出手段が、人物の顔の輪郭
及び目乃至鼻乃至口の輪郭を抽出して人物を検出する顔
輪郭抽出手段を更に具備しているテレビ会議システムの
発言者識別装置とすることを特徴とするものである。A third technical means is the face contour according to the first or second technical means, wherein the human detecting means detects a human by extracting the contour of a person's face and the contours of eyes, nose or mouth. The present invention is characterized in that it is a speaker identification device of a video conference system further comprising extraction means.

【０００９】第４の技術手段は、前記第１乃至第３の技
術手段において、前記人間検出手段が、人物の唇が動い
ていることを検出する唇検出手段を更に具備しているテ
レビ会議システムの発言者識別装置とすることを特徴と
するものである。A fourth technical means is the video conference system according to any one of the first to third technical means, wherein the human detecting means further comprises a lip detecting means for detecting that a person's lip is moving. The speaker identification device is used.

【００１０】第５の技術手段は、前記第４の技術手段に
おいて、前記唇検出手段にて、人物の唇が動いているこ
とが検出された場合に、前記マーキング手段にて発言者
を示す前記マーカをマーキング表示させるテレビ会議シ
ステムの発言者識別装置とすることを特徴とするもので
ある。A fifth technical means is the fourth technical means, wherein when the lip detecting means detects that the lips of a person are moving, the marking means indicates the speaker. The present invention is characterized by being used as a speaker identification device of a video conference system for displaying markers by marking.

【００１１】第６の技術手段は、前記第５の技術手段に
おいて、前記方向検出手段によって得られた音源方向に
前記人間検出手段にて人物が検出された場合であり、か
つ、前記唇検出手段にて、人物の唇が動いていることが
検出された場合に、前記マーキング手段にて発言者を示
す前記マーカをマーキング表示させるテレビ会議システ
ムの発言者識別装置とすることを特徴とするものであ
る。A sixth technical means is a case where the human detecting means detects a person in the sound source direction obtained by the direction detecting means in the fifth technical means, and the lip detecting means. In the above, when it is detected that the lips of a person are moving, the speaker identification device of the video conference system displays the marker indicating the speaker by the marking means. is there.

【００１２】第７の技術手段は、前記第１乃至第６のい
ずれかの技術手段において、前記方向検出手段が、音源
の特徴を記憶する第１記憶手段と、新たに入力された音
源の特徴と前記第１記憶手段に記憶された前記音源の特
徴とを比較する音源特徴比較手段とを更に具備し、前記
人間検出手段が、検出された人物の画像位置を記憶する
第２記憶手段を更に具備し、前記音源特徴比較手段によ
って新たに入力された音源の特徴と前記第１記憶手段に
記憶された前記音源の特徴とが一致していることが検出
された場合、前記第２記憶手段に記憶された前記人物の
前記画像位置を読み出して、モニタ画面に表示された前
記人物の画像周辺に前記マーキング手段により発言者を
示す前記マーカをマーキング表示させるテレビ会議シス
テムの発言者識別装置とすることを特徴とするものであ
る。A seventh technical means is the method according to any one of the first to sixth technical means, wherein the direction detecting means stores a characteristic of a sound source, and a characteristic of a newly input sound source. And a sound source feature comparison unit that compares the feature of the sound source stored in the first storage unit, and the human detection unit further includes a second storage unit that stores the image position of the detected person. When it is detected that the feature of the sound source newly input by the sound source feature comparison unit matches the feature of the sound source stored in the first storage unit, the second storage unit stores the feature. Speaker identification of a video conference system in which the stored image position of the person is read and the marker indicating the speaker is marked and displayed by the marking means around the image of the person displayed on the monitor screen. It is characterized in that a location.

【００１３】第８の技術手段は、前記第７の技術手段に
おいて、前記音源特徴比較手段が、前記第１記憶手段に
記憶された前記音源の特徴のうち、前記新たに入力され
た音源の方向にある音源の特徴を読み出し、前記音源特
徴比較手段によって新たに入力された音源の特徴とを比
較して一致していないことが検出された場合において、
前記方向検出手段によって得られた音源方向に前記人間
検出手段にて人物を検出した場合には、前記新たに入力
された音源の特徴を前記第１記憶メモリに記憶し直すと
共に、前記人間検出手段にて検出された該人物の画像位
置を前記第２記憶メモリに記憶させるテレビ会議システ
ムの発言者識別装置とすることを特徴とするものであ
る。An eighth technical means is the same as the seventh technical means, wherein the sound source feature comparing means has a direction of the newly input sound source among the features of the sound source stored in the first storing means. In the case where it is detected by comparing the features of the sound source newly read by the sound source feature comparing means with the feature of the sound source in
When the person detecting section detects a person in the sound source direction obtained by the direction detecting section, the characteristics of the newly input sound source are stored in the first storage memory again, and the human detecting section is stored. It is a speaker identification device of a video conference system that stores the image position of the person detected in 1. in the second storage memory.

【００１４】第９の技術手段は、前記第７又は第８の技
術手段において、前記音源特徴比較手段が、前記第１記
憶手段に記憶された前記音源の特徴のうち、前記新たに
入力された音源の方向にある音源の特徴を読み出し、前
記音源特徴比較手段によって新たに入力された音源の特
徴とを比較して一致していないことが検出された場合に
おいて、前記方向検出手段によって得られた音源方向に
前記人間検出手段にて人物を検出しなかった場合には、
前記マーキング手段による発言者を示す前記マーカのマ
ーキング表示を元の状態のままとするテレビ会議システ
ムの発言者識別装置とすることを特徴とするものであ
る。In a ninth technical means according to the seventh or eighth technical means, the sound source feature comparison means receives the new input from among the characteristics of the sound sources stored in the first storage means. When the feature of the sound source in the direction of the sound source is read out and the feature of the newly input sound source is compared by the sound source feature comparing unit and it is detected that they do not match, the direction detecting unit obtains When no person is detected by the human detecting means in the sound source direction,
The present invention is a speaker identification device of a video conference system in which the marking display of the marker indicating the speaker by the marking means is kept in the original state.

【００１５】第１０の技術手段は、前記第７乃至第９の
いずれかの技術手段において、前記第２記憶手段が、前
記人間検出手段が具備している顔輪郭抽出手段により入
力画像の中の人物の顔の輪郭及び／又は目乃至鼻乃至口
の輪郭に関する顔特徴を抽出して記憶する顔特徴記憶手
段を更に具備し、前記人間検出手段が、該顔特徴記憶手
段に記憶された前記顔特徴と前記顔輪郭抽出手段により
入力画像の中から新たに抽出された顔特徴とを比較する
顔特徴比較手段を更に具備し、前記音源特徴比較手段に
よって新たに入力された音源の特徴と前記第１記憶手段
に記憶された前記音源の特徴とが一致していることが検
出された場合、前記顔特徴比較手段において、前記顔特
徴記憶手段に記憶された前記顔特徴と前記顔輪郭抽出手
段によって新たに抽出された顔特徴とを比較して、一致
していることが検出された場合には、前記第２記憶手段
に記憶された人物の前記画像位置を読み出して、モニタ
画面に表示された前記人物の画像周辺に前記マーキング
手段により発言者を示す前記マーカをマーキング表示さ
せるテレビ会議システムの発言者識別装置とすることを
特徴とするものである。A tenth technical means is the method according to any one of the seventh to ninth technical means, wherein the second storage means uses the face contour extracting means included in the human detecting means in the input image. The human feature detecting means further comprises a facial feature storing means for extracting and storing facial features relating to a contour of a person's face and / or contours of eyes, nose or mouth, and the human detecting means stores the face stored in the facial feature storing means. It further comprises face feature comparison means for comparing the feature with a face feature newly extracted from the input image by the face contour extraction means, and the feature of the sound source newly input by the sound source feature comparison means and the first feature. 1 When it is detected that the features of the sound source stored in the storage unit match, the face feature comparison unit uses the face features stored in the face feature storage unit and the face contour extraction unit. Newly If the detected facial features are compared with each other and it is detected that they match, the image position of the person stored in the second storage means is read out and the person displayed on the monitor screen is displayed. The speaker identification device of the video conference system, in which the marker indicating the speaker is marked and displayed by the marking means around the image.

【００１６】第１１の技術手段は、前記第１０の技術手
段において、前記顔特徴比較手段において、前記顔特徴
記憶手段に記憶された前記顔特徴と前記人間検出手段の
顔輪郭抽出手段によって新たに抽出された顔特徴とを比
較して、一致していないことが検出された場合、前記マ
ーキング手段による発言者を示す前記マーカのマーキン
グ表示を元の状態のままとするテレビ会議システムの発
言者識別装置とすることを特徴とするものである。According to an eleventh technical means, in the tenth technical means, in the face feature comparing means, the face feature stored in the face feature storing means and the face contour extracting means of the human detecting means are newly added. Speaker comparison of the video conference system in which the marking display of the marker indicating the speaker by the marking means is kept in the original state when the extracted facial features are compared and it is detected that they do not match. The device is a device.

【００１７】第１２の技術手段は、前記第１乃至第１１
のいずれかの技術手段において、前記方向検出手段が、
音源の特徴を記憶する第１記憶手段と、新たに入力され
た音源の特徴と前記第１記憶手段に記憶された前記音源
の特徴とを比較する音源特徴比較手段とを更に具備し、
前記人間検出手段が、検出された人物の画像位置を記憶
する第２記憶手段と、検出された人物の唇が動いている
ことを検出する唇検出手段とを更に具備し、前記音源特
徴比較手段によって新たに入力された音源の特徴と前記
第１記憶手段に記憶された前記音源の特徴とが一致して
いることが検出された場合、前記唇検出手段によって前
記音源の特徴が一致している方向にいる人物の唇が動い
ていることが検出された場合には、前記第２記憶手段に
記憶された人物の前記画像位置を読み出して、モニタ画
面に表示された前記人物の画像周辺に前記マーキング手
段により発言者を示す前記マーカをマーキング表示させ
るテレビ会議システムの発言者識別装置とすることを特
徴とするものである。A twelfth technical means is the above-mentioned first to eleventh.
In any one of the above technical means, the direction detecting means,
Further comprising first storage means for storing the characteristics of the sound source, and sound source characteristic comparison means for comparing the newly input characteristics of the sound source with the characteristics of the sound source stored in the first storage means,
The human detection means further comprises a second storage means for storing the detected image position of the person, and a lip detection means for detecting that the detected person's lips are moving, and the sound source feature comparison means. When it is detected that the feature of the sound source newly input and the feature of the sound source stored in the first storage unit match, the feature of the sound source matches by the lip detecting unit. When it is detected that the lips of the person in the direction are moving, the image position of the person stored in the second storage unit is read and the image is displayed around the image of the person displayed on the monitor screen. The present invention is a speaker identification device of a video conference system in which the marker indicating the speaker is marked and displayed by the marking means.

【００１８】第１３の技術手段は、前記第１２の技術手
段において、前記唇検出手段によって前記音源の特徴が
一致している方向にいる人物の唇が動いていることが検
出されない場合、前記マーキング手段による発言者を示
す前記マーカのマーキング表示を元の状態のままとする
テレビ会議システムの発言者識別装置とすることを特徴
とするものである。A thirteenth technical means is the marking according to the twelfth technical means, when the lips of the person in the direction in which the features of the sound sources match each other are not detected by the lips detecting means. The present invention is a speaker identification device of a video conference system in which the marking display of the marker indicating the speaker by the means is kept in the original state.

【００１９】第１４の技術手段は、前記第１乃至第１３
のいずれかの技術手段において、前記方向検出手段は、
入力音声が継続して入力されている継続時間を検出する
時間検出手段を更に具備し、前記時間検出手段により検
出された入力音声の前記継続時間が、前以って設定され
た所定継続時間以上に亘っていることが判定された場合
において、前記人間検出手段により、入力画像の中から
人物の検出を行なう動作が起動されるテレビ会議システ
ムの発言者識別装置とすることを特徴とするものであ
る。A fourteenth technical means is the above-mentioned first to thirteenth.
In any one of the above technical means, the direction detecting means,
The method further comprises time detection means for detecting a duration time during which the input voice is continuously input, and the duration time of the input voice detected by the time detection means is equal to or longer than a preset predetermined duration time. When it is determined that the length of time has passed, the speaker detection device of the video conference system is activated by the human detection means to detect the person from the input image. is there.

【００２０】第１５の技術手段は、前記第１乃至第１４
の技術手段のいずれかにおいて、前記方向検出手段は、
入力音声が継続して入力されている継続時間を検出する
時間検出手段を更に具備し、前記時間検出手段により検
出された入力音源の前記継続時間が、前以って設定され
た所定継続時間以上に亘っていることが判定された場合
において、前記方向検出手段により、前記入力音声の音
源方向の検出を行なう動作が起動されるテレビ会議シス
テムの発言者識別装置とすることを特徴とするものであ
る。The fifteenth technical means is the first to fourteenth aspects.
In any of the above technical means, the direction detecting means,
The input sound source further comprises time detection means for detecting a duration time during which the input voice is continuously input, and the duration time of the input sound source detected by the time detection means is equal to or longer than a predetermined duration time set in advance. The speaker identification device of the video conference system in which the operation for detecting the sound source direction of the input voice is started by the direction detecting means when it is determined that is there.

【００２１】第１６の技術手段は、前記第１４又は第１
５の技術手段において、前記方向検出手段が、入力音声
の音声レベルを検出する音声レベル検出手段を更に具備
し、該音声レベル検出手段により検出される入力音声の
音声レベルが予め定められた所定音声レベル以上である
ことを検出した場合に、前記時間検出手段による前記継
続時間の検出動作が開始されるテレビ会議システムの発
言者識別装置とすることを特徴とするものである。A sixteenth technical means is the fourteenth or the first
In the technical means of 5, the direction detecting means further comprises a voice level detecting means for detecting the voice level of the input voice, and the voice level of the input voice detected by the voice level detecting means is a predetermined voice. When it is detected that the level is equal to or higher than the level, the speaker identification device of the video conference system is started, in which the operation of detecting the duration by the time detector is started.

【００２２】第１７の技術手段は、前記第１乃至第１６
のいずれかの技術手段において、前記人間検出手段が、
入力画像内の人物を検出し、前記マーキング手段が、前
記人間検出手段によって検出された人物それぞれに対し
て、モニタ画面に表示された該人物の画像周辺に色分け
及び／又は形状分けされたマーカのマーキング表示を行
なうテレビ会議システムの発言者識別装置とすることを
特徴とするものである。The seventeenth technical means is the above-mentioned first to sixteenth.
In any of the technical means of, the human detection means,
Detecting a person in the input image, the marking means, for each person detected by the person detecting means, of the markers color-coded and / or shape-coded around the image of the person displayed on the monitor screen. The present invention is characterized by being used as a speaker identification device of a video conference system that displays markings.

【００２３】第１８の技術手段は、前記第１７の技術手
段において、前記マーキング手段が、前記人間検出手段
によって検出された人物それぞれに対して、モニタ画面
に表示された該人物の画像周辺に色分け及び／又は形状
分けをした前記マーカのマーキング表示を行なう際に、
前記方向検出手段により得られた音源方向にある画像位
置の人物の画像周辺には、発言者を示す色及び／又は形
状の発言者マーカのマーキング表示を、また、前記方向
検出手段により得られた音源方向以外にある画像位置の
人物の画像周辺には、非発言者を示す色及び／又は形状
の非発言者マーカのマーキング表示を、それぞれ行なう
テレビ会議システムの発言者識別装置とすることを特徴
とするものである。[0023] In an eighteenth technical means, in the seventeenth technical means, the marking means colors each person detected by the human detecting means by color coding around the image of the person displayed on the monitor screen. And / or when performing marking display of the markers divided into shapes,
Around the image of the person at the image position in the direction of the sound source obtained by the direction detecting means, a marking display of a speaker marker of a color and / or shape showing a speaker is obtained by the direction detecting means. A non-speaker marker of a color and / or a shape indicating a non-speaker is displayed around the image of a person at an image position other than the sound source direction as a speaker identification device of a video conference system. It is what

【００２４】第１９の技術手段は、前記第１乃至第１８
のいずれかの技術手段において、前記人間検出手段が、
入力画像内の人物を検出する際に、検出された検出時刻
を記憶する検出時刻記憶手段を更に具備し、該検出時刻
記憶手段に記憶された前記検出時刻を、検出された前記
人物の参加時刻として、モニタ画面に表示するテレビ会
議システムの発言者識別装置とすることを特徴とするも
のである。A nineteenth technical means is the first to the eighteenth.
In any of the technical means of, the human detection means,
When detecting a person in the input image, the detection time storage means for storing the detected detection time is further provided, and the detection time stored in the detection time storage means is defined as the participation time of the detected person. The speaker identification device of the video conference system displayed on the monitor screen is characterized by the above.

【００２５】第２０の技術手段は、前記第１乃至第１９
のいずれかの技術手段において、前記人間検出手段が、
入力画像内から検出された１人以上の人物の画像位置を
記憶することができる第２記憶手段を１つ以上具備し、
更に、前記人間検出手段が、前以って設定された一定周
期毎に、前記入力画像内のすべての人物の検出を行な
い、各人物の検出結果の画像位置をそれぞれ求める際
に、前記検出結果の人物の画像位置のそれぞれと前記第
２記憶手段に記憶された１人以上の人物の画像位置との
比較を行なう位置比較手段を更に具備し、前記位置比較
手段にて、前記第２記憶手段に記憶された人物の画像位
置のいずれかが、前記検出結果の人物の画像位置のいず
れにも一致していない場合には、一致していない前記第
２記憶手段に記憶された画像位置の人物が退席したもの
として、該人物が検出されなくなった時刻を退席時刻と
して記憶する退席時刻記憶手段と、逆に、前記位置比較
手段にて、前記検出結果の人物の画像位置のいずれか
が、前記第２記憶手段に記憶された人物の画像位置のい
ずれにも一致していない場合には、一致していない前記
検出結果の画像位置の人物が新たに参加したものとし
て、該人物が新たに検出された時刻を参加時刻として記
憶する参加時刻記憶手段と、を更に具備し、前記退席時
刻記憶手段に記憶されている前記退席時刻を、逆に、前
記参加時刻記憶手段に記憶されている前記参加時刻を、
モニタ画面に表示するテレビ会議システムの発言者識別
装置とすることを特徴とするものである。The twentieth technical means is the first to nineteenth aspects.
In any of the technical means of, the human detection means,
One or more second storage means capable of storing the image positions of one or more persons detected from the input image,
Further, the human detection means detects all the persons in the input image at every preset constant period, and when the image position of the detection result of each person is obtained, the detection result is obtained. Position comparison means for comparing each of the image positions of the persons with the image positions of the one or more persons stored in the second storage means, and the position comparison means includes the second storage means. If any of the image positions of the person stored in the above does not match any of the image positions of the person of the detection result, the person of the image position stored in the second storage means does not match. As a person leaving the seat, the leaving time storage means for storing the time when the person is no longer detected as the leaving time, and conversely, in the position comparing means, any one of the image positions of the persons of the detection result is In the second storage means If it does not match any of the stored image positions of the person, it is assumed that the person at the image position of the detection result that does not match newly joins, and the time when the person is newly joined is joined. A participation time storage means for storing the time, and the leaving time stored in the leaving time storage means, conversely, the participation time stored in the participation time storage means,
The present invention is characterized by being used as a speaker identification device of a video conference system displayed on a monitor screen.

【００２６】第２１の技術手段は、前記第１乃至第２０
のいずれかの技術手段において、前記方向検出手段が、
入力音声の音源を検出した時刻を音源検出時刻として記
憶する音源検出記憶手段を更に具備し、前記音源検出記
憶手段に記憶されている前記音源検出時刻をモニタ画面
に表示するテレビ会議システムの発言者識別装置とする
ことを特徴とするものである。The twenty-first technical means are the first to twentieth aspects.
In any one of the above technical means, the direction detecting means,
Speaker of the video conference system further comprising a sound source detection storage unit that stores the time when the sound source of the input voice is detected as a sound source detection time, and displays the sound source detection time stored in the sound source detection storage unit on a monitor screen. It is characterized by being used as an identification device.

【００２７】第２２の技術手段は、前記第１乃至第２１
のいずれかの技術手段において、少なくとも水平方向に
可動可能な画像入力用のカメラ装置を更に具備し、前記
方向検出手段が検出した音源方向に、前記カメラ装置の
位置を向けるように制御するテレビ会議システムの発言
者識別装置とすることを特徴とするものである。The twenty-second technical means is the first to twenty-first aspects.
In any one of the technical means, the video conference further comprising at least a camera device for image input that is movable in the horizontal direction, and controlling the position of the camera device to face the sound source direction detected by the direction detecting means. The present invention is characterized by being used as a speaker identification device of the system.

【００２８】第２３の技術手段は、前記第２２の技術手
段において、前記方向検出手段が検出した音源方向に、
前記カメラ装置の位置を向けさせるように制御する際
に、検出した該音源方向には前記人間検出手段による人
物が検出されなかった場合、及び／又は、発言者を示す
前記マーカを新たにマーキング表示すべき状態が検出さ
れなかった場合には、前記カメラ装置を動作前の位置に
戻すテレビ会議システムの発言者識別装置とすることを
特徴とするものである。A twenty-third technical means is the sound source direction detected by the direction detecting means in the twenty-second technical means,
When controlling so that the position of the camera device is directed, when a person is not detected by the human detecting means in the detected sound source direction, and / or the marker indicating the speaker is newly marked and displayed. When the desired state is not detected, the camera device is used as a speaker identification device of the video conference system that returns the camera device to the position before the operation.

【００２９】第２４の技術手段は、前記第１乃至第２３
のいずれかの技術手段において、少なくとも水平方向に
可動可能な画像入力用のカメラ装置が、ズーミングを行
なうことができるズーム手段を更に具備し、前記方向検
出手段が検出した音源方向に、前記カメラ装置の位置を
向けると共に前記ズーム手段によりズーミングを行なう
ように制御するテレビ会議システムの発言者識別装置と
することを特徴とするものである。A twenty-fourth technical means is the first to twenty-third aspects.
In any one of the technical means described above, at least the camera device for image input that is movable in the horizontal direction further includes a zooming device capable of performing zooming, and the camera device is provided in the sound source direction detected by the direction detecting device. And a speaker identifying device of a video conference system for controlling the zooming means to perform zooming.

【００３０】第２５の技術手段は、前記第２４の技術手
段において、前記方向検出手段が検出した音源方向に、
前記カメラ装置の位置を向けさせると共に前記ズーム手
段によりズーミングを行なうように制御する際に、検出
した該音源方向には前記人間検出手段による人物が検出
されなかった場合、及び／又は、発言者を示す前記マー
カを新たにマーキング表示すべき状態が検出されなかっ
た場合には、前記カメラ装置を動作前の位置に戻すテレ
ビ会議システムの発言者識別装置とすることを特徴とす
るものである。A twenty-fifth technical means is the sound source direction detected by the direction detecting means in the twenty-fourth technical means,
When the position of the camera device is directed and the zooming means is controlled to perform zooming, when no person is detected by the human detecting means in the detected sound source direction, and / or When a state in which the marker to be displayed is not newly marked is not detected, the speaker device is a speaker identification device of the video conference system that returns the camera device to the position before the operation.

【００３１】第２６の技術手段は、前記第１乃至第２５
の技術手段のいずれかのテレビ会議システムの発言者識
別装置を備えているテレビ会議システムとすることを特
徴とするものである。A twenty-sixth technical means is the first to the twenty-fifth aspects.
The video conference system is provided with the speaker identification device of the video conference system according to any one of the above technical means.

【００３２】[0032]

【発明の実施の形態】本発明に係るテレビ会議システム
の発言者識別装置及び該発言者識別装置を備えたテレビ
会議システムの実施形態の一例について、以下に図面を
参照しながら説明する。図１は、本発明に係る発言者識
別装置を具備したテレビ会議システムの構成の一例を示
すブロック構成図である。BEST MODE FOR CARRYING OUT THE INVENTION An example of an embodiment of a speaker identification device of a video conference system according to the present invention and a video conference system equipped with the speaker identification device will be described below with reference to the drawings. FIG. 1 is a block configuration diagram showing an example of the configuration of a video conference system including a speaker identification device according to the present invention.

【００３３】図１に示すように、本発明に係る発言者識
別装置を具備したテレビ会議システムは、入力音声から
音源方向を検出する方向検出部２と、入力画像内から人
物を検出する人間検出部３とを具備し、更に、モニタ画
面に表示させる各種のＣＧ（ＣｏｍｐｕｔｅｒＧｒａ
ｐｈｉｃｓ）情報を保存しているＲＯＭであるＣＧＲＯ
Ｍ６と、該ＣＧＲＯＭ６からのＣＧ情報を、モニタ画面
に重畳させて表示させるためのスーパインポーズ発生回
路４とを、識別用マーカを表示させるためのマーキング
手段として具備している。図１に基づいて、本発明に係
るテレビ会議システムの動作について説明する。As shown in FIG. 1, a video conference system equipped with a speaker identifying apparatus according to the present invention includes a direction detecting section 2 for detecting a sound source direction from an input voice and a human detecting for detecting a person from an input image. And various types of CG (Computer Gra) to be displayed on the monitor screen.
CGRO, which is a ROM that stores information
The M6 and the superimpose generation circuit 4 for superimposing and displaying the CG information from the CGROM 6 on the monitor screen are provided as marking means for displaying the identification marker. The operation of the video conference system according to the present invention will be described with reference to FIG.

【００３４】まず、入力された音声は、音声入力部１よ
り電気信号からなる音声信号に変換され、方向検出部２
と音声コーデック７とに送られる。方向検出部２におい
ては、該音声信号に基づいて発言者の位置が、モニタ画
面上の位置情報として検出され、カメラ制御部８および
人間検出部３に対して、該位置情報が与えられる。First, the input voice is converted by the voice input unit 1 into a voice signal composed of an electric signal, and the direction detection unit 2
To the audio codec 7. In the direction detection unit 2, the position of the speaker is detected as position information on the monitor screen based on the voice signal, and the position information is given to the camera control unit 8 and the human detection unit 3.

【００３５】カメラ制御部８は、発言者の前記位置情報
に基づいて、少なくとも水平方向に可動可能なカメラ装
置９の旋回動作やズーミング動作を行なわせることによ
り、音声の発生源（即ち、テレビ会議における発言者の
位置）が、モニタ画面の中央に位置するように制御した
り、あるいは、ズーミングにより拡大表示するように制
御したりして、カメラ装置９の位置制御あるいはズーム
制御を行なう。On the basis of the position information of the speaker, the camera control section 8 causes at least the horizontally movable camera device 9 to perform a turning operation and a zooming operation to generate a sound source (that is, a video conference). The position of the speaker in (1) is controlled so as to be positioned at the center of the monitor screen or is controlled so as to be enlarged and displayed by zooming, thereby performing position control or zoom control of the camera device 9.

【００３６】ここで、カメラ制御部８は、方向検出部２
により検出された音声の発声方向（音源方向）にカメラ
装置９の位置を向けるように旋回制御させたり、ズーム
制御をさせたりする際に、方向検出部２により検出され
た音源方向に人間検出部３が人物を検出できなかった場
合、及び／又は、例えば、テレビ会議の参加者の音声
（音源）の特徴と一致していない特徴の発声音声である
場合などのごとく、発言者を特定することができなかっ
た場合（即ち、発言者を示すマーカを新たにマーキング
表示すべき状態を検出できなかった場合）には、動作前
の元のカメラ装置９の位置に戻すことも可能である。Here, the camera control unit 8 includes the direction detection unit 2
When the turning control or the zoom control is performed so that the position of the camera device 9 is directed to the utterance direction (sound source direction) of the sound detected by the human detection unit, the human detection unit detects the direction of the sound source detected by the direction detection unit 2. 3 is not able to detect a person, and / or, for example, it is a voiced voice of a feature that does not match the feature of the voice (sound source) of the participants of the video conference, and the speaker is specified. If it is not possible (that is, if the state in which the marker indicating the speaker should be newly marked and displayed cannot be detected), it is possible to return the camera device 9 to the original position before the operation.

【００３７】あるいは、前述のごとく、カメラ装置９は
ズーム手段を具備しており、カメラ制御部８は、ユーザ
の指示により、又は、方向検出部２により検出された音
源方向に、カメラ装置９の位置を向けると共に、自動的
にカメラ装置９のズーミングを行なわせるように制御す
ることも可能であり、また、方向検出部２により検出さ
れた音源方向に人間検出部３が人物を検出できなかった
場合、及び／又は、発言者を特定することができなかっ
た場合（即ち、発言者を示すマーカを新たにマーキング
表示すべき状態を検出できなかった場合）には、動作前
の元のカメラ装置９の位置に戻すと共にカメラ装置９の
ズーミング状態を動作前の状態に戻すことも可能であ
る。Alternatively, as described above, the camera device 9 is provided with the zooming means, and the camera control unit 8 controls the camera device 9 according to the user's instruction or in the sound source direction detected by the direction detection unit 2. It is possible to control the position so that the camera device 9 is automatically zoomed, and the human detection unit 3 cannot detect the person in the sound source direction detected by the direction detection unit 2. If and / or if the speaker cannot be specified (that is, if the state in which a marker indicating the speaker should be newly marked and displayed cannot be detected), the original camera device before the operation It is also possible to return to the position 9 and return the zooming state of the camera device 9 to the state before the operation.

【００３８】撮像装置としてのカメラ装置９において
は、撮像された映像情報を、映像信号として電気信号に
変換する。変換された該映像信号はビデオデコーダ１０
によってデジタル処理され、人間検出部３に映像データ
として送られる。人間検出部３においては、方向検出部
２から入力される前記位置情報とビデオデコーダ１０か
ら入力される前記映像データとに基づいて、前記映像デ
ータにおける前記位置情報が示す位置に、人物の撮像画
像データが存在しているか否かの検出を行なう。In the camera device 9 as the image pickup device, the picked-up image information is converted into an electric signal as a video signal. The converted video signal is transferred to the video decoder 10
Is digitally processed by and is sent to the human detection unit 3 as video data. In the human detection unit 3, based on the position information input from the direction detection unit 2 and the video data input from the video decoder 10, a captured image of a person is located at a position indicated by the position information in the video data. It detects whether or not data exists.

【００３９】前記位置情報が示す位置に、人物の撮像画
像データが存在していることが検出された場合にあって
は、マーキング手段を提供するスーパインポーズ発生回
路４において、各種のＣＧ（ＣｏｍｐｕｔｅｒＧｒａ
ｐｈｉｃｓ）情報を保存しているＲＯＭであるＣＧＲＯ
Ｍ６から、発言者を示すマーカ（目印）として予め定め
られているマーカデータを読み出し、前記映像データ上
における前記位置情報と発言者である人物の撮像画像デ
ータとから演算されたマーカ表示位置（即ち、発言者の
人物の画像周辺位置）に、読み出された発言者を示す前
記マーカデータを前記映像データ上に重ね合わされた合
成映像データが作成される。When it is detected that the imaged image data of a person is present at the position indicated by the position information, various CGs (Computers) are used in the superimpose generation circuit 4 which provides marking means. Gra
CGRO, which is a ROM that stores information
Marker data that is predetermined as a marker (mark) indicating the speaker is read out from M6, and the marker display position (that is, the marker display position calculated from the position information on the video data and the captured image data of the person who is the speaker (ie, , Around the image of the person of the speaker), the composite video data in which the marker data indicating the read speaker is superimposed on the video data is created.

【００４０】作成された合成映像データは、画像コーデ
ック５によって圧縮符号化処理された符号化画像データ
とされ、一方、音声入力部１からの音声信号が、音声コ
ーデック７にて圧縮符号化された符号化音声データとさ
れ、該符号化画像データと該符号化音声データとが、多
重化回路部１１にて多重化されて、通信回線１２を通し
て、相手端末に送られる。The generated composite video data is encoded image data that has been compression-encoded by the image codec 5, while the audio signal from the audio input unit 1 is compression-encoded by the audio codec 7. Encoded voice data, the encoded image data and the encoded voice data are multiplexed by the multiplexing circuit unit 11 and sent to the partner terminal through the communication line 12.

【００４１】ここで、マーキング手段であるスーパイン
ポーズ発生回路４にて作成された前記合成映像データ
（即ち、カメラ装置９で撮像され、ビデオデコーダ１０
でデジタル処理が施された映像データと、発言者を示す
マーカデータとが合成された合成映像データ）が更に画
像コーデック５により圧縮符号化された状態の前記符号
化画像データが、受信側の相手端末における画像コーデ
ックにより復号化されて、テレビ会議システムにおける
モニタ画面に画面表示されている画像の一例の概念図
を、図２に示す。図２において、１０１は、テレビ会議
の参加者のうち、発言者を示し、１０２は、テレビ会議
の参加者で現在発言していない人物を示し、１００は、
発言者であることを示すマーカ（図２においては、矢印
形状のマーク）である。Here, the composite video data (that is, imaged by the camera device 9 and produced by the superimpose generation circuit 4 which is a marking means, and is recorded by the video decoder 10).
The image data coded by the image codec 5 is further compressed and encoded by the image codec 5 and the encoded image data obtained by combining the image data digitally processed with the marker data indicating the speaker is FIG. 2 shows a conceptual diagram of an example of an image which is decoded by the image codec in the terminal and is displayed on the monitor screen of the video conference system. In FIG. 2, reference numeral 101 denotes a speaker among participants of the video conference, 102 denotes a participant of the video conference who is not currently speaking, and 100 denotes
It is a marker (indicated by an arrow in FIG. 2) indicating that the speaker.

【００４２】人間検出部３は、カメラ装置９からの映像
情報がビデオデコーダ１０によりデジタル処理が施され
た映像データの中から色情報を用いて人間の顔の色を抽
出することにより、顔領域を特定（抽出）して、映像デ
ータの中に含まれている人物を検出することができる顔
抽出手段３ａを備えている。また、該人間検出部３は、
カメラ装置９からの映像情報がビデオデコーダ１０によ
りデジタル処理が施された映像データの中から、人物の
顔の輪郭や目乃至鼻乃至口といった個々の特徴を有する
輪郭を抽出することによって、映像データの中に含まれ
ている人物を検出することができる顔輪郭抽出手段３ｂ
も備えている。The human detecting section 3 extracts the color of the human face by using the color information from the video data obtained by digitally processing the video information from the camera device 9 by the video decoder 10. The face extraction means 3a that can identify (extract) and detect a person included in the video data. Further, the human detection unit 3 is
The video data from the camera device 9 is digitally processed by the video decoder 10 to extract the contour of a person's face and the contour having individual features such as eyes, nose, or mouth from the video data. Face contour extraction means 3b capable of detecting a person included in
Is also equipped.

【００４３】更には、後述するように、該人間検出部３
は、カメラ装置９からの映像情報がビデオデコーダ１０
によりデジタル処理が施された映像データの中から、人
物の唇が動いているか否かを検出することができる唇検
出手段３ｃも備えている。かかる唇検出手段３ｃによ
り、参加者の人物の唇が動いていることが検出された場
合、方向検出部２の結果如何によらず、あるいは、方向
検出部２によって得られた音源方向に人間検出部３にて
人物を検出した場合であって、かつ、前記唇検出手段３
ｃにて該人物の唇が動いていることが検出された場合に
は、該人物が発言していると見なして、モニタ画面上の
発言者である該人物の画像周辺に、発言者を示すマーカ
をマーキング表示させることとしても良いし、更に、方
向検出部２により検出された音源方向に一致する方向に
いる人物の唇が動いていることが、唇検出手段３ｃによ
り検出された場合であって、更に、方向検出部２にて発
声音声（音源）の特徴とテレビ会議参加者の発声音声
（音源）の特徴とが一致する場合において、発言者を示
すマーカをモニタ画面にマーキング表示させることとし
ても構わない。Further, as will be described later, the human detecting section 3
Indicates that the video information from the camera device 9 is the video decoder 10
Also provided is a lip detecting means 3c capable of detecting whether or not the lips of a person are moving from the video data digitally processed by. When it is detected by the lip detecting means 3c that the lips of the person of the participant are moving, it does not depend on the result of the direction detecting section 2 or the person is detected in the sound source direction obtained by the direction detecting section 2. When a person is detected by the unit 3, and the lip detecting means 3
When it is detected that the lips of the person are moving in c, it is considered that the person is speaking, and the speaker is displayed around the image of the person who is the speaker on the monitor screen. Markers may be displayed by marking, and when the lip detecting means 3c detects that the lips of a person in a direction that matches the sound source direction detected by the direction detecting unit 2 are moving. Further, in the case where the characteristics of the uttered sound (sound source) and the characteristics of the uttered sound (sound source) of the video conference participant are matched by the direction detection unit 2, a marker indicating the speaker is displayed on the monitor screen by marking. It doesn't matter.

【００４４】また、図１に示す方向検出部２と人間検出
部３とは、図３のような構成にすることも可能である。
ここに、図３は、本発明に係る発言者識別装置を具備し
たテレビ会議システムを構成する方向検出部と人間検出
部との他の構成例を示すブロック構成図である。即ち、
図３に示すごとく、方向検出制御部２′は、図１に示す
方向検出部２以外に、更に、入力音声（音源）の特徴を
抽出する音源特徴抽出部２１と、抽出された入力音声
（音源）の特徴を特徴データとして記憶する第一の記憶
手段である第１記憶メモリ２２と、方向検出部２を介し
て入力されて、音源特徴抽出部２１により抽出された音
源の特徴と第１記憶メモリ２２に記憶されているすべて
の前記特徴データとを比較照合する音源特徴比較部２３
とを備えている。Further, the direction detecting section 2 and the human detecting section 3 shown in FIG. 1 can be configured as shown in FIG.
FIG. 3 is a block configuration diagram showing another configuration example of the direction detection unit and the human detection unit which configure the video conference system including the speaker identification device according to the present invention. That is,
As shown in FIG. 3, in addition to the direction detection unit 2 shown in FIG. 1, the direction detection control unit 2 ′ further includes a sound source feature extraction unit 21 for extracting features of an input sound (sound source) and an extracted input voice ( Sound source), which is a first storage means for storing the feature of the sound source) as feature data, and the feature of the sound source which is input through the direction detection unit 2 and extracted by the sound source feature extraction unit 21. Sound source feature comparison unit 23 for comparing and collating with all the feature data stored in the storage memory 22.
It has and.

【００４５】一方、人間検出制御部３′は、図１に示す
人間検出部３以外に、更に、人間検出部３において、ビ
デオデコーダ１０からの映像データの中に人物が検出さ
れた場合の人物の画像位置を記憶するための第二の記憶
手段を提供する第２記憶メモリ３２と、方向検出部２が
示す音声の発声方向（音源方向）の位置に相当する第２
記憶メモリ３２における映像データの画像位置を算出す
る演算を行なう位置演算部３１とを備えている。On the other hand, the human detection control section 3 ′ is a person other than the human detection section 3 shown in FIG. 1 when a person is detected in the video data from the video decoder 10 in the human detection section 3. A second storage memory 32 that provides a second storage unit for storing the image position of the second image, and a second storage memory 32 that corresponds to the position in the utterance direction (sound source direction) of the sound indicated by the direction detection unit
A position calculation unit 31 for calculating the image position of the video data in the storage memory 32 is provided.

【００４６】図３に示す方向検出制御部２′と人間検出
制御部３′とにおいては、まず、テレビ会議が始まるに
先立って、第１記憶メモリ２２と第２記憶メモリ３２と
に、それぞれ、テレビ会議の各参加者の発声音声（音
源）の特徴を示す特徴データと各参加者の画像位置とを
予め登録する。ここで、発言者の発声音声が、音声入力
部１より電気信号の音声信号に変換されて、方向検出制
御部２′に入力され、図１と同一の機能を果たす方向検
出部２を介して、音源特徴抽出部２１に送られる。音源
特徴抽出部２１においては、発言者の音声信号（音源）
の特徴を抽出し、第１記憶メモリ２２に特徴データとし
て格納する。In the direction detection control unit 2'and the human detection control unit 3'shown in FIG. 3, first, prior to the start of the video conference, the first storage memory 22 and the second storage memory 32 are respectively provided. The feature data indicating the features of the voice (sound source) of each participant of the video conference and the image position of each participant are registered in advance. Here, the uttered voice of the speaker is converted into a voice signal of an electric signal by the voice input unit 1 and is input to the direction detection control unit 2 ', and is passed through the direction detection unit 2 that performs the same function as in FIG. , To the sound source feature extraction unit 21. In the sound source feature extraction unit 21, the voice signal (sound source) of the speaker
Is extracted and stored in the first storage memory 22 as characteristic data.

【００４７】一方、図１と同一の人物検出機能を果たす
人間検出部３においては、方向検出制御部２′から入力
される発声音声方向（音源方向）の位置に相当するビデ
オデコーダ１０の映像データ内の画像位置に人物の映像
を検出した場合には、該人物の画像位置を第２記憶メモ
リ３２に格納する。On the other hand, in the human detection section 3 which performs the same person detection function as in FIG. 1, the video data of the video decoder 10 corresponding to the position in the uttered voice direction (sound source direction) input from the direction detection control section 2 '. When the video of the person is detected at the image position in the inside, the image position of the person is stored in the second storage memory 32.

【００４８】第１記憶メモリ２２と第２記憶メモリ３２
とに、テレビ会議参加者に関する音声（音源）の特徴と
人物の画像位置とを予め設定した後、テレビ会議が始ま
ると、音源特徴比較部２３においては、音源特徴抽出部
２１から送られてくる新たな発声音声（音源）の特徴
と、第１記憶メモリ２２に記憶されているすべての音声
の特徴データとを比較照合し、記憶されているすべての
音声の前記特徴データの中に、新たな前記発声音声（音
源）の特徴と一致する音声が存在していることが検出さ
れた場合には、本テレビ会議の参加者の発言と判断し、
位置演算部３１を介して該人物の画像位置を第２記憶メ
モリ３２から読み出して、モニタ画面上における該発言
者の人物の画像周辺に、発言者であることを示すマーカ
を、スーパインポーズ発生回路４を介して画面表示す
る。First storage memory 22 and second storage memory 32
Then, when the video conference starts after presetting the characteristics of the sound (sound source) and the image position of the person regarding the video conference participants, the sound source feature comparison unit 23 sends the sound source feature extraction unit 21. The feature of the new vocal sound (sound source) and the feature data of all the voices stored in the first storage memory 22 are compared and collated, and the new feature is added to the feature data of all the stored voices. When it is detected that a voice that matches the characteristics of the uttered voice (sound source) is present, it is determined that the participant of this video conference is speaking,
The image position of the person is read from the second storage memory 32 via the position calculation unit 31, and a marker indicating that the speaker is a superimpose is generated around the image of the person of the speaker on the monitor screen. The screen is displayed via the circuit 4.

【００４９】かくのごとく、発声音声である音源の特徴
に基づいて、テレビ会議の参加者の発言と判断される場
合にあっては、図１に示す人間検出部３と同一の機能を
果たす図３における人間検出部３は、何ら動作をする必
要はなく、人間検出部３は起動されることがないものと
することができる。As described above, in the case where it is judged that the speech of the participant of the video conference is based on the characteristics of the sound source which is the uttered voice, the figure which fulfills the same function as the human detecting section 3 shown in FIG. The human detection unit 3 in 3 does not need to perform any operation, and the human detection unit 3 can be assumed not to be activated.

【００５０】一方、もし、音源特徴抽出部２１から送ら
れてくる前記発声音声（音源）の特徴が、第１記憶メモ
リ２２に記憶されているすべての音声の特徴データと一
致していないことが音源特徴比較部２３において判明し
た場合には、図１に示す人間検出部３と同一の機能を果
たす図３における人間検出部３が起動されて、該人間検
出部３において、ビデオデコーダ１０からの映像データ
の中から人物検出を行なう。即ち、該人間検出部３にお
いて、位置演算部３１を介して得られた前記発声音声方
向（音源方向）の位置に、人物の映像が存在しているか
否かが判別されることにより、該発声音声（音源）が、
本テレビ会議の参加者の発言であるか否かを判定する。On the other hand, if the features of the uttered voice (sound source) sent from the sound source feature extraction unit 21 do not match the feature data of all the voices stored in the first storage memory 22. When the sound source feature comparison unit 23 finds out, the human detection unit 3 in FIG. 3 having the same function as that of the human detection unit 3 shown in FIG. 1 is activated, and the human detection unit 3 outputs the signal from the video decoder 10. A person is detected from the video data. That is, the human detection unit 3 determines whether or not the image of the person is present at the position in the uttered voice direction (sound source direction) obtained via the position calculation unit 31, and thereby the uttered voice is detected. The voice (sound source)
It is determined whether or not it is the speech of the participant of this video conference.

【００５１】人間検出部３において、前記発声音声方向
（音源方向）の位置に人物が存在していないと判定され
た場合には、該発声音声（音源）が、本テレビ会議には
無関係の雑音と見なして、何ら処理を行なうことなく、
元の状態のままとし、一方、該発声音声方向（音源方
向）の位置に人物が存在していると判定された場合に
は、発言者が位置を移動して発言しているものと見なし
て、位置演算部３１を介して算出されている参加者の画
像位置情報を、新たに第２記憶メモリ３２に登録すると
共に、第１記憶メモリ２２にも、該発声音声（音源）の
特徴データを再登録する。When the human detection unit 3 determines that no person is present at the position in the uttered voice direction (sound source direction), the uttered voice (sound source) is a noise irrelevant to the video conference. And without any processing,
If it is determined that a person is present at the position in the uttered voice direction (sound source direction), it is considered that the speaker has moved the position and is speaking. The image position information of the participant calculated via the position calculation unit 31 is newly registered in the second storage memory 32, and the feature data of the uttered voice (sound source) is also stored in the first storage memory 22. Register again.

【００５２】従来の技術においては、新たな音声が発生
された際には、発言者を示すマーカ（目印）をモニタ画
面にマーキング表示させたり、カメラ装置を旋回させ
て、該音声の発生元である発言者がモニタ画面内に収ま
るように撮像せんとしている。しかしながら、発言者以
外の何らかの物音が発生した場合においても、全く同様
に、発言者を示すマーカ（目印）がモニタ画面にマーキ
ング表示されたり、あるいは、カメラ装置が物音の発生
方向に旋回されてしまっていた。In the prior art, when a new voice is generated, a marker (mark) indicating the speaker is displayed on the monitor screen as a marking or the camera device is turned to detect the source of the voice. A speaker is trying to capture an image so that it fits within the monitor screen. However, even when some noise is generated by a person other than the speaker, the marker (marker) indicating the speaker is marked on the monitor screen or the camera device is swung in the direction of the noise. Was there.

【００５３】本発明に係る発言者識別手段を備えたテレ
ビ会議システムにおいては、人間検出部３にて人物が検
出されなかった場合には、発言者を示すマーカを新たに
マーキング表示させることも行なわれないし、更に、図
１に示すカメラ制御部８を制御して、カメラ装置９の位
置やズーミング状態が、元の状態に戻るように復帰指令
を送出している。即ち、たとえ、カメラ装置９の位置が
一旦音源方向に旋回されたとしても、図示はしていない
が、カメラ制御部８が旋回制御される前の位置情報やズ
ーミング情報が、例えば、図３に示す第２記憶メモリ３
２に保存されていることにより、カメラ装置９の位置や
ズーミング状態を元の状態に戻す復帰指令の送出が可能
とされている。而して、たとえ、参加者の発言以外の物
音に反応して、カメラ装置９が旋回してしまった場合で
あっても、発言者を示すマーカ（目印）の位置、あるい
は、カメラ装置９の位置を、元の位置に戻すことができ
る。In the video conference system equipped with the speaker identifying means according to the present invention, when the person detecting section 3 does not detect any person, a marker indicating the speaker is newly displayed. In addition, the camera controller 8 shown in FIG. 1 is controlled to send a return command so that the position of the camera device 9 and the zooming state return to the original state. That is, even if the position of the camera device 9 is once swung in the direction of the sound source, although not shown, the position information and zooming information before the camera control unit 8 is swung is shown in FIG. Second storage memory 3 shown
By being stored in 2, it is possible to send a return command for returning the position of the camera device 9 and the zooming state to the original state. Thus, even if the camera device 9 turns in response to a noise other than the speech of the participant, the position of the marker (mark) indicating the speaker or the camera device 9 is changed. The position can be returned to the original position.

【００５４】また、第２記憶メモリ３２が、人間検出部
３が具備している顔輪郭抽出手段３ｂにより人物の顔の
輪郭及び目乃至鼻乃至口の輪郭に関する個々の人物の顔
特徴を抽出して記憶する顔特徴記憶手段を更に具備して
いる場合においては、人間検出部３が、該顔特徴記憶手
段に記憶された前記顔特徴と顔輪郭抽出手段３ｂにより
映像データの中から新たに抽出された顔特徴とを比較す
る顔特徴比較手段を更に具備し、音源特徴比較部２３に
よって新たに入力された音源の特徴と第１記憶メモリ２
２に記憶された前記特徴データ（音源の特徴）とが一致
していることが検出された場合、前記顔特徴比較手段に
おいて、前記顔特徴記憶手段に記憶された前記顔特徴と
人間検出部３の顔輪郭抽出手段３ｂによって新たに抽出
された顔特徴とを比較して、一致していることが検出さ
れた場合には、第２記憶メモリ３２に記憶された人物の
画像位置を読み出して、該画像位置の前記人物の画像周
辺に発言者を示すマーカをマーキング表示させたり、カ
メラ装置９を前記人物の位置に旋回させたり、ズーミン
グして拡大表示させることも可能である。Further, the second storage memory 32 extracts the facial features of the individual person concerning the contour of the person's face and the contours of the eyes, nose or mouth by the face contour extracting means 3b provided in the human detecting section 3. In the case of further comprising a facial feature storing means for storing the facial features, the human detecting section 3 newly extracts from the video data by the facial feature and facial contour extracting means 3b stored in the facial feature storing means. The sound source feature comparing unit 23 further includes a face feature comparing unit that compares the face feature and the feature of the sound source newly input by the sound source feature comparing unit 23 and the first storage memory 2.
When it is detected that the feature data (feature of the sound source) stored in No. 2 match, the face feature comparing means and the human detecting section 3 in the face feature storing means. The face features newly extracted by the face contour extracting means 3b of No. 1 are compared, and when it is detected that they match, the image position of the person stored in the second storage memory 32 is read out, It is also possible to mark and display a marker indicating a speaker around the image of the person at the image position, rotate the camera device 9 to the position of the person, or zoom and enlarge the display.

【００５５】ここで、前記顔特徴比較手段において、前
記顔特徴記憶手段に記憶された前記顔特徴と顔輪郭抽出
手段３ｂによって新たに抽出された顔特徴とを比較し
て、一致していないことが検出された場合、発言者を示
す前記マーカのマーキング表示やカメラ装置９の位置や
ズーミング状態を元の状態のままとすることとする。Here, in the face feature comparison means, the face features stored in the face feature storage means are compared with the face features newly extracted by the face contour extraction means 3b, and they do not match. Is detected, the marking display of the marker indicating the speaker, the position of the camera device 9 and the zooming state are left as they are.

【００５６】更に、図１に示す方向検出部２は、図４の
ような構成にすることも可能である。ここに、図４は、
本発明に係る発言者識別装置を具備したテレビ会議シス
テムを構成する方向検出部の更なる他の構成例を示すブ
ロック構成図である。即ち、図４に示すように、方向検
出制御部２″は、図１に示す方向検出部２以外に、更
に、所定の音声レベルを上回る発声音声（音源）を検出
する音声レベル検出部２４と、かかる所定の音声レベル
を上回る発声音声が予め設定されている所定継続時間以
上に亘って継続していることを検出する時間検出手段を
提供するタイマ部２５とを備えている。Further, the direction detecting section 2 shown in FIG. 1 can be configured as shown in FIG. Here, in FIG.
It is a block block diagram which shows the further another structural example of the direction detection part which comprises the video conference system provided with the speaker identification apparatus based on this invention. That is, as shown in FIG. 4, in addition to the direction detection unit 2 shown in FIG. 1, the direction detection control unit 2 ″ further includes a voice level detection unit 24 for detecting a voiced sound (sound source) higher than a predetermined voice level. And a timer unit 25 which provides a time detecting means for detecting that the uttered voice exceeding the predetermined voice level continues for a preset predetermined duration or longer.

【００５７】図４において、入力された発声音声は、音
声入力部１において、音声信号に変換されて、方向検出
制御部２″に入力され、音声レベル検出部２４において
発声音声の音声レベルが測定される。参加者からの発言
として、該発声音声が前記所定の音声レベル以上の音声
信号であることが検出された場合には、タイマ部２５の
タイマが起動されて、経過時間の計数が開始され、予め
設定されている前記所定継続時間が経過した場合には、
所定の継続時間以上に亘って、入力音声（音源）が継続
している状態にあり、タイマ部２５からの出力信号を、
方向検出部２に対して送出する。In FIG. 4, the input uttered voice is converted into a voice signal in the voice input unit 1 and input to the direction detection control unit 2 ″, and the voice level of the uttered voice is measured in the voice level detection unit 24. When it is detected as a speech from the participant that the uttered voice is a voice signal having the predetermined voice level or higher, the timer of the timer unit 25 is activated to start counting elapsed time. If the predetermined duration set in advance has elapsed,
The input sound (sound source) is in a state of continuing for a predetermined duration or longer, and the output signal from the timer unit 25 is
It is sent to the direction detector 2.

【００５８】ここで、予め設定されている前記所定継続
時間が経過する前に、参加者からの発言が終了して、音
声レベル検出部２４において、前記所定の音声レベル以
上の発声音声が検出されなくなると、タイマ部２５はタ
イマの計数を停止され、タイマ部２５からの出力信号は
発生しなくなる。図１と同様の機能を果たす方向検出部
２は、タイマ部２５からの出力信号が入力されている場
合にあって、始めて、入力された発声音声（音源）が、
本テレビ会議の参加者の発言に基づく音声信号であるか
否かを判断する動作が起動されることになる。Here, before the preset predetermined duration has elapsed, the speech from the participant ends, and the voice level detecting section 24 detects the voiced voice of the predetermined voice level or higher. When it disappears, the timer unit 25 stops counting the timer, and the output signal from the timer unit 25 stops being generated. In the case where the output signal from the timer unit 25 is input, the direction detection unit 2 having the same function as in FIG.
The operation for determining whether or not the audio signal is based on the speech of the participant of this video conference is activated.

【００５９】例えば、くしゃみのような短い時間の音声
情報は発言ではないので、図１に示すカメラ装置９を、
短い時間の該音声情報の位置に旋回させたり、あるい
は、モニタ画面上に該音声情報の位置を発言者を示すよ
うにマーキング表示させることは無駄である。本発明に
おける実施例に示すように、かかる短い時間の音声情報
の場合においては、タイマ部２５からの出力信号は出力
されることはなく、而して、方向検出部２としては、発
声音声方向（音源方向）の位置を示す位置情報を検出す
る動作が起動されずに、カメラ装置９の旋回などの制御
動作や、あるいは、人間検出部３における人物検出動作
も起動されない状態に設定されている。For example, since voice information for a short time such as sneezing is not a utterance, the camera device 9 shown in FIG.
It is wasteful to turn to the position of the voice information for a short time, or to display the position of the voice information on the monitor screen so as to indicate the speaker. As shown in the embodiment of the present invention, in the case of the voice information of such a short time, the output signal from the timer unit 25 is not output, and therefore the direction detection unit 2 uses the uttered voice direction. The operation for detecting the position information indicating the position of (sound source direction) is not activated, and the control operation such as turning of the camera device 9 or the person detection operation in the human detection unit 3 is not activated. .

【００６０】また、図４においては、かかる短い時間の
音声情報の場合に、方向検出部２が起動されないことに
より、人間検出部３も起動されない旨を説明している
が、かかる場合に限らず、直接、タイマ部２５から人間
検出部３へも出力信号が供給されていて、かかる短い時
間の音声情報の場合には、方向検出部２を介することな
く、直接、人間検出部３を起動させない状態とすること
も可能である。Further, in FIG. 4, it is explained that, in the case of the audio information of such a short time, the human detecting unit 3 is not activated because the direction detecting unit 2 is not activated, but the case is not limited to such a case. In the case where the output signal is directly supplied from the timer unit 25 to the human detection unit 3 and the audio information is for a short time, the human detection unit 3 is not directly activated without going through the direction detection unit 2. It can also be in a state.

【００６１】なお、以上に説明のごとく、図４において
は、タイマ部２５を起動させる条件として、予め定めら
れている所定の音声レベル以上のレベルにある発声音声
がある場合を条件としているが、かかる所定の音声レベ
ル以上のレベルにある発声音声であるか否かの如何に関
わらず、識別可能な何らかの音声が継続して、所定継続
時間以上に亘っていることが検出された場合であっても
構わない。As described above, in FIG. 4, the condition for activating the timer unit 25 is that there is a vocal sound at a level equal to or higher than a predetermined predetermined audio level. When it is detected that some identifiable voice continues for a predetermined duration or longer regardless of whether or not the voice is a voice having a level higher than the predetermined voice level. I don't mind.

【００６２】また、前述のごとく、図３に示す人間検出
部３は、方向検出部２からの発声音声方向（音源方向）
の位置が示す映像データ上の位置に存在する人物の有無
を検出するだけではなく、テレビ会議に先立って、予
め、テレビ会議への参加者全員の画像位置の検出を行な
い、図３に示す第２記憶メモリ３２に登録しておくよう
にすることが可能とされている。而して、例えば、図５
に示すように、参加者のうち、発言者と非発言者とのマ
ーカの表示を色分けしたり、あるいは、形状分けしたり
して、変化させて、モニタ画面上にマーキング表示を行
なわせることも可能である。Further, as described above, the human detecting section 3 shown in FIG. 3 has the utterance voice direction (source direction) from the direction detecting section 2.
In addition to detecting the presence or absence of a person at the position on the video data indicated by the position, the image positions of all the participants in the video conference are detected prior to the video conference. 2 It is possible to register in the storage memory 32. Thus, for example, FIG.
As shown in, it is also possible to change the marker display of the speaker and the non-speaker among the participants by color or shape to display the markings on the monitor screen. It is possible.

【００６３】ここに、図５は、テレビ会議参加者のう
ち、発言者用と非発言者用のマーカのマーキング表示を
行なう場合のモニタ画面表示の一例を示す概念図であ
る。図５においては、発言者１０５が発言していること
を示す発言者マーカ１０３は、非発言者１０６を示す非
発言者マーカ１０４とは、例えば、異なる色を用いてマ
ーキング表示している一例を示している。FIG. 5 is a conceptual diagram showing an example of the monitor screen display when the marker display for the speaker and the marker for the non-speaker among the video conference participants is displayed. In FIG. 5, an example in which the speaker marker 103 indicating that the speaker 105 is speaking is displayed in a different color from the non-speaker marker 104 indicating the non-speaker 106, for example. Shows.

【００６４】即ち、テレビ会議参加者のうち、発言者１
０５が発言を行なった場合、方向検出部２にて、発言者
１０５の音源方向を示す位置情報が検出され、予め記憶
メモリ３２に記憶されている画像位置と比較して、該音
源方向に人物の存在を確認することによって、発言者１
０５を識別し、発言していない非発言者１０６を示す非
発言者マーカ１０４とは異なる色の発言者マーカ１０３
によって表示している。That is, of the video conference participants, the speaker 1
When 05 speaks, the direction detection unit 2 detects the position information indicating the sound source direction of the speaker 105, compares the position information with the image position stored in the storage memory 32 in advance, and detects the person in the sound source direction. By confirming the existence of
Speaker marker 103 of a different color from the non-speaker marker 104 that identifies 05 and indicates the non-speaker 106 who is not speaking.
Are displayed by.

【００６５】而して、たとえ、テレビ会議の参加者全員
をモニタ画面に画面表示している状態であっても、モニ
タ画面の参加者各自毎に即ち人物の画像周辺位置に、参
加者全員に対してそれぞれ色違いの発言者マーカ１０３
と非発言者マーカ１０４とのマーキング表示を行ない、
発言者か否かを容易に識別可能とすることができる。こ
こに、マーキング表示する前記マーカとしては、発言者
と非発言者との識別用のみに限るものではない。例え
ば、テレビ会議の司会者と一般参加者とオブザーバとを
容易に識別可能とするように、色分け及び／又は形状分
け及び／又は模様分けすることとしても良く、テレビ会
議の実施に有用な如何なる識別情報でも、モニタ画面に
重畳表示されるマーカにより提供することが可能であ
る。Therefore, even if all the participants of the video conference are displayed on the monitor screen, all the participants of the monitor screen, that is, in the peripheral position of the image of the person, can see all the participants. On the other hand, speaker markers 103 of different colors are used.
Marking display with the non-speaker marker 104,
It is possible to easily identify whether or not the speaker. The marker displayed in the marking is not limited to the one for distinguishing the speaker from the non-speaker. For example, it may be color-coded and / or shape-coded and / or pattern-coded so that the moderator, the general participant and the observer of the video-conference can be easily distinguished, and any identification useful for conducting the video-conference is possible. Information can also be provided by the marker displayed on the monitor screen in a superimposed manner.

【００６６】また、人間検出部３を構成する人間検出制
御部３′は、音声が入力された時だけ、人物検出動作を
行なうのではなく、常時、定期的に、人間検出部３にて
人物の検出動作を行なうこともできる。而して、ビデオ
デコーダ１０からの映像データの中の各画像位置で、人
物を最初に検出した時刻を、図示していない時計回路か
ら読み出し、検出された該人物の検出時刻を、テレビ会
議への参加時刻として、図３に示す第２記憶メモリ３２
の検出時刻記憶部に記憶させると共に、スーパインポー
ズ発生回路４により、該検出時刻記憶部から読み出した
前記検出時刻を重ね合わせて画面表示させることによ
り、モニタ画面には、人物の検出結果として、該人物が
テレビ会議に参加した参加時刻を表示することができ
る。Further, the human detection control section 3'constituting the human detection section 3 does not perform the person detection operation only when the voice is inputted, but the human detection section 3 constantly and periodically detects the person. Can also be detected. Then, at each image position in the video data from the video decoder 10, the time when the person is first detected is read out from a clock circuit (not shown), and the detected time of the detected person is transferred to the video conference. As the participation time of the second storage memory 32 shown in FIG.
Is stored in the detection time storage unit, and the superimposing generation circuit 4 superimposes the detection time read from the detection time storage unit on the screen to display the detection result of a person on the monitor screen. The time when the person participated in the video conference can be displayed.

【００６７】あるいは、人間検出部３を構成する人間検
出制御部３′は、ビデオデコーダ１０からの映像データ
即ち入力画像内から検出される人物の画像位置を１人以
上記憶することができる１つ以上の第２記憶メモリ３２
を具備し、更に、予め設定されている一定周期毎に、定
期的に、前記入力画像内のすべての人物の検出を行な
い、各人物の検出結果の画像位置をそれぞれ求める際
に、前記検出結果の人物の画像位置のそれぞれと、第２
記憶メモリ３２に記憶された１人以上の人物の画像位置
とを比較する位置検出手段を更に具備しており、該位置
検出手段にて、前記検出結果の人物の画像位置のいずれ
かが、第２記憶メモリ３２に記憶された人物の画像位置
のいずれにも一致していない場合には、前記検出結果の
画像位置の人物が新たにテレビ会議に参加したものと見
なし、人間検出部３により前記検出結果の画像位置の人
物が新たに検出された時刻を参加時刻として、図３に示
す第２記憶メモリ３２の参加時刻記憶部に記憶させると
共に、該参加時刻記憶部に記憶された前記参加時刻を、
モニタ画面に表示させることとしても良い。Alternatively, the human detection control section 3'constituting the human detection section 3 can store one or more image positions of the person detected from the video data from the video decoder 10, that is, the input image. Second storage memory 32 described above
Further, at a predetermined constant period, all the persons in the input image are detected at regular intervals, and the detection result is obtained when the image position of the detection result of each person is obtained. Each of the image positions of the
The apparatus further comprises position detecting means for comparing the image positions of the one or more persons stored in the storage memory 32, and in the position detecting means, one of the image positions of the persons of the detection result is 2 If it does not match any of the image positions of the person stored in the storage memory 32, it is considered that the person at the image position of the detection result has newly participated in the video conference, and the human detection unit 3 determines that The time when the person at the image position of the detection result is newly detected is stored as the participation time in the participation time storage unit of the second storage memory 32 shown in FIG. 3, and the participation time stored in the participation time storage unit is stored. To
It may be displayed on the monitor screen.

【００６８】例えば、図６は、あるテレビ会議の参加状
況を示すモニタ画面表示の一例を示す概念図である。図
６（Ａ）において、１０９及び１１０は、９：００にテ
レビ会議に参加した参加者を示すものであり、それぞれ
の人物が検出された検出時刻即ち参加時刻を、モニタ画
面上の映像データに重ね合わせて、該参加者の人物画像
位置の画像周辺に、参加時刻表示１０７及び１０８が表
示されている。For example, FIG. 6 is a conceptual diagram showing an example of a monitor screen display showing the participation status of a certain video conference. In FIG. 6A, reference numerals 109 and 110 denote participants who participated in the video conference at 9:00, and the detection time when each person was detected, that is, the participation time is recorded in the video data on the monitor screen. Participating time displays 107 and 108 are displayed around the image of the person image position of the participant in an overlapping manner.

【００６９】例えば、図６（Ｂ）は、図６（Ａ）から、
しばらく時間が経過したテレビ会議の参加状況を示して
おり、モニタ画面に参加時刻表示１１１が新たに追加さ
れて表示されているように、参加者１１２が９：３０に
テレビ会議に参加したことがわかる。かかる参加時刻の
表示は、図３に示す第２記憶メモリ３２の前記検出時刻
記憶部又は前記参加時刻記憶部に記憶させることによ
り、モニタ画面には、テレビ会議が終了するまで、該人
物の参加時刻を引き続き表示させることができる。For example, FIG. 6B shows that from FIG.
It shows the participation status of the video conference after a while, and the participant 112 has participated in the video conference at 9:30, as the participation time display 111 is newly added and displayed on the monitor screen. Recognize. The display of the participation time is stored in the detection time storage unit or the participation time storage unit of the second storage memory 32 shown in FIG. 3 so that the participation of the person is displayed on the monitor screen until the video conference ends. The time can be displayed continuously.

【００７０】また、人間検出制御部３′は、ビデオデコ
ーダ１０からの映像データ即ち入力画像内から検出され
る人物の画像位置を１人以上記憶することができる１つ
以上の第２記憶メモリ３２を具備し、更に、予め設定さ
れている一定周期毎に、定期的に、前記入力画像内のす
べての人物の検出を行ない、各人物の検出結果の画像位
置をそれぞれ求める際に、前記検出結果の人物の画像位
置のそれぞれと、第２記憶メモリ３２に記憶された１人
以上の人物の画像位置とを比較する位置検出手段を更に
具備しており、該位置検出手段にて、第２記憶メモリ３
２に記憶された人物の画像位置のいずれかが、前記検出
結果の画像位置のいずれにも一致していない場合には、
第２記憶メモリ３２に記憶された一致していない画像位
置の人物が退席したものと見なし、人間検出部３により
該人物が検出されなくなった時刻を退席時刻として、図
３に示す第２記憶メモリ３２の退席時刻記憶部に記憶さ
せると共に、該退席時刻記憶部に記憶された前記退席時
刻を、モニタ画面に表示させることができる。Further, the human detection control section 3'includes one or more second storage memories 32 capable of storing one or more image positions of the person detected from the video data from the video decoder 10, that is, the input image. Further, at a predetermined constant period, all the persons in the input image are detected at regular intervals, and the detection result is obtained when the image position of the detection result of each person is obtained. Further includes position detecting means for comparing each of the image positions of the persons with the image positions of the one or more persons stored in the second storage memory 32. Memory 3
When any of the image positions of the person stored in 2 does not match any of the image positions of the detection result,
The second storage memory shown in FIG. 3 is defined as the time when the person at the non-coincident image position stored in the second storage memory 32 is considered to have left, and the time when the person is no longer detected by the human detection unit 3 is set as the departure time. The leaving time storage unit of 32 can be stored, and the leaving time stored in the leaving time storage unit can be displayed on the monitor screen.

【００７１】即ち、今まで人物検出がされていた画像位
置即ち画像領域において人間検出部３による人物の検出
がされなくなった場合、図示していない時計回路より人
物が検出されなくなった時刻を読み出し、スーパインポ
ーズ発生回路４により重ね合わせ表示させてモニタ画面
に表示させていることになる。而して、モニタ画面に
は、人物が検出されなくなった時刻を、該参加者が退席
した退席時刻として、該参加者を示していた人物の画像
位置の画像周辺に画面表示することができる。また、か
かる人物が検出されなくなった退席時刻は、図３に示す
第２記憶メモリ３２の前記退席時刻記憶部に記憶させる
ことにより、モニタ画面には、テレビ会議が終了するま
で、該人物の退席時刻を引き続き表示させることができ
る。That is, when the human detection section 3 no longer detects the person at the image position, that is, the image area where the person has been detected so far, the time when the person is no longer detected is read from a clock circuit (not shown), The superimpose generation circuit 4 displays the superimposed image on the monitor screen. Thus, the time when the person is no longer detected can be displayed on the monitor screen around the image of the image position of the person who was showing the participant, as the leaving time when the participant left the seat. Further, the leaving time when the person is no longer detected is stored in the leaving time storage unit of the second storage memory 32 shown in FIG. 3, so that the leaving screen of the person is displayed on the monitor screen until the video conference ends. The time can be displayed continuously.

【００７２】更に、人間検出部３は、発言者の発声音声
方向（音源方向）による位置情報とカメラ装置９からの
映像情報に基づく映像データとにより、音声を発声した
発言者の人物を検出した場合、図示していない時計回路
より時刻を読み出し、スーパインポーズ発生回路４より
読み出された時刻を重ね合わせて表示することもでき
る。即ち、方向検出制御部２′が、入力音声の音源を検
出した時刻を音源検出時刻として、図３に示す第１記憶
メモリ２２の音源検出記憶部に記憶すると共に、モニタ
画面には、該音源検出記憶部に記憶されている前記音源
検出時刻を、各参加者の人物画像位置の画像周辺に表示
させることができる。而して、参加者各々が発言した発
言時刻を、モニタ画面に表示することができる。Further, the human detecting section 3 detects the person who uttered the voice, based on the position information based on the voice direction (sound source direction) of the speaker and the video data based on the video information from the camera device 9. In this case, it is also possible to read the time from a clock circuit (not shown) and display the time read from the superimpose generation circuit 4 in an overlapping manner. That is, the direction detection control unit 2'stores the time when the sound source of the input sound is detected as the sound source detection time in the sound source detection storage unit of the first storage memory 22 shown in FIG. 3, and the sound source is displayed on the monitor screen. The sound source detection time stored in the detection storage unit can be displayed around the image of the person image position of each participant. Thus, the time when each participant speaks can be displayed on the monitor screen.

【００７３】また、人間検出部３は、人物の顔の色や顔
の輪郭などによる人物検出機能と共に、前述のごとく、
人物の唇が動いていることを検出する唇検出手段３ｃを
更に付与させることも可能としている。Further, the human detecting section 3 has a person detecting function based on the color of the face of the person, the outline of the face, etc., as described above.
It is also possible to further add lip detecting means 3c for detecting that the lips of a person are moving.

【００７４】かかる場合においては、前述のごとく、発
声音声が音声入力部１に入力された場合、まず、人間検
出部３の人物検出機能により、発声音声方向（音源方
向）の位置にいる人物が検出される。その後、唇検出手
段３ｃにより、検出された人物の唇が動いているか否か
が検出される。唇が動いていることが検出できれば、入
力された発声音声方向（音源方向）にいる当該人物の発
言による発声音声であると判断し、カメラ装置９の旋回
動作やズーム動作、更には、および発言者を示すマーカ
のマーキング表示を行なう。逆に、唇検出手段３ｃによ
り、検出された人物の唇が動いていないことが検出され
た場合は、入力された発声音声は、入力された発声音声
方向（音源方向）にいる当該人物の発言による発声音声
ではないと判断し、カメラ装置９の旋回動作やズーム動
作及び発言者を示すマーカのマーキング表示を行なわな
い。In such a case, as described above, when the voiced voice is input to the voice input unit 1, first, the person detecting function of the human detection unit 3 detects a person in the voiced voice direction (sound source direction). To be detected. After that, the lip detecting means 3c detects whether or not the detected lips of the person are moving. If the movement of the lips can be detected, it is determined that the utterance is the utterance of the person in the input utterance direction (sound source direction), and the turning operation and zooming operation of the camera device 9, and the utterance Marker indicating the person is displayed. On the contrary, when the lip detecting means 3c detects that the detected lips of the person are not moving, the input uttered voice is the speech of the person in the input uttered voice direction (source direction). It is determined that the voice is not produced by the camera device 9, and the turning operation and zooming operation of the camera device 9 and the marking display of the marker indicating the speaker are not performed.

【００７５】更には、図３に示すごとく、方向検出制御
部２′が、音源特徴抽出部２１により抽出された音源の
特徴を記憶する第１記憶メモリ２２と、新たに入力され
た音源の特徴と第１記憶メモリ２２に記憶された前記音
源の特徴とを比較する音源特徴比較部２３とを具備して
いて、人間検出制御部３′が、検出された人物の画像位
置を記憶する第２記憶メモリ３２を具備している場合に
おいて、人間検出制御部３′が、更に、検出された人物
の唇が動いていることを検出する唇検出手段３ｃを具備
している場合にあっては、音源特徴比較部２３によって
新たに入力された音源の特徴と第１記憶メモリ２２に記
憶された前記音源の特徴とが一致していることが検出さ
れた場合で、かつ、前記音源の特徴の一致が検出された
方向にいる人物の唇が動いていることが、唇検出手段３
ｃによって検出された場合に、始めて、第２記憶メモリ
３２に記憶された人物の前記画像位置を読み出して、該
画像位置の前記人物の画像周辺に発言者を示す前記マー
カをマーキング表示させたり、カメラ装置９の位置を発
言者の方向に向けさせたり、ズーミングさせることも可
能である。Further, as shown in FIG. 3, the direction detection control unit 2'includes a first storage memory 22 for storing the features of the sound source extracted by the sound source feature extraction unit 21, and the features of the newly input sound source. And a sound source feature comparison unit 23 that compares the feature of the sound source stored in the first storage memory 22, and the human detection control unit 3 ′ stores the detected image position of the person. In the case of including the storage memory 32, in the case where the human detection control unit 3 ′ further includes the lip detecting means 3c for detecting that the detected lips of the person are moving, When it is detected by the sound source feature comparison unit 23 that the newly input feature of the sound source and the feature of the sound source stored in the first storage memory 22 match, and the feature of the sound source matches. Of the person in the direction in which That it is moving, lips detection means 3
When it is detected by c, first, the image position of the person stored in the second storage memory 32 is read out, and the marker indicating the speaker is displayed around the image of the person at the image position. It is also possible to direct the position of the camera device 9 toward the speaker or to perform zooming.

【００７６】かかる場合において、前記音源の特徴が一
致している方向にいる人物の唇が動いていることが、唇
検出手段３ｃによって検出されない場合にあっては、ご
く短時間の発言であったものと見なして、発言者を示す
前記マーカをマーキング表示させる位置やカメラ装置９
の位置を、元の状態のままとすることも可能である。In such a case, the movement of the lips of the person in the direction in which the features of the sound sources are matched is not a short time when the lips detecting means 3c does not detect it. The camera device 9 and the position where the marker indicating the speaker is displayed as marking.
It is also possible to keep the position of the original state.

【００７７】[0077]

【発明の効果】以上に説明したごとく、本発明に係るテ
レビ会議システムの発言者識別装置及び該発言者識別装
置を備えたテレビ会議システムによれば、例えば、くし
ゃみや物を落とした音、マイクに物をぶつけた音など参
加者の発言以外の音の発生によって、誤って、該発言以
外の音の発生方向にカメラ装置が旋回してしまったり、
発言者を示すマーカをマーキング表示させたりすること
を防止することが可能であり、発言している参加者を常
に正確にモニタ画面に表示することが可能となり、而し
て、快適にテレビ会議を行なうことが可能なテレビ会議
システムを提供することができる。As described above, according to the speaker identification device of the video conference system and the video conference system equipped with the speaker identification device according to the present invention, for example, a sneeze sound, a dropped object, a microphone, etc. When a sound other than the speech of the participant is generated, such as the sound of a thing hitting the object, the camera device accidentally turns in the direction of the sound other than the speech,
It is possible to prevent the marker indicating the speaker from being marked and displayed, and it is possible to always accurately display the participant who is speaking on the monitor screen. It is possible to provide a video conference system that can be conducted.

【００７８】更には、たとえ、テレビ会議の参加者全員
をモニタ画面に画面表示した場合であっても、発言者や
非発言者を識別可能なマーカを重畳させて画面表示させ
たり、あるいは、テレビ会議の司会者や一般の参加者あ
るいはオブザーバなどの識別が視覚的に容易なマーカを
重畳させて画面表示させることも可能であり、而して、
スムースにテレビ会議を行うことが可能なテレビ会議シ
ステムを提供することができる。Further, even when all the participants of the video conference are displayed on the monitor screen, the marker for distinguishing the speaker or the non-speaker may be superimposed and displayed on the screen, or the TV may be displayed. It is also possible to superimpose a marker that is visually easy to identify the moderator of the conference, general participants, observers, etc., and display it on the screen.
A video conference system capable of smoothly performing a video conference can be provided.

【００７９】更には、テレビ会議の参加者が会議に参加
した参加時刻、退席した退席時刻、発言した発言時刻な
どのテレビ会議の会議経過時刻を表示することが可能な
テレビ会議システムを提供することができる。Further, it is to provide a video conference system capable of displaying the elapsed time of the video conference such as the participation time when the video conference participant participated in the conference, the time when he / she left the seat, and the time when he / she spoke. You can

[Brief description of drawings]

【図１】本発明に係る発言者識別装置を具備したテレビ
会議システムの構成の一例を示すブロック構成図であ
る。FIG. 1 is a block configuration diagram showing an example of a configuration of a video conference system including a speaker identification device according to the present invention.

【図２】本発明に係る発言者識別装置を具備したテレビ
会議システムにおいて、モニタ画面に画面表示された画
像の一例を示す概念図である。FIG. 2 is a conceptual diagram showing an example of an image displayed on a monitor screen in a video conference system equipped with a speaker identification device according to the present invention.

【図３】本発明に係る発言者識別装置を具備したテレビ
会議システムを構成する方向検出部と人間検出部との他
の構成例を示すブロック構成図である。FIG. 3 is a block configuration diagram showing another configuration example of a direction detection unit and a human detection unit included in the video conference system including the speaker identification device according to the present invention.

【図４】本発明に係る発言者識別装置を具備したテレビ
会議システムを構成する方向検出部の更なる他の構成例
を示すブロック構成図である。FIG. 4 is a block configuration diagram showing still another configuration example of the direction detection unit included in the video conference system including the speaker identification device according to the present invention.

【図５】テレビ会議参加者のうち、発言者用と非発言者
用のマーカのマーキング表示を行なう場合のモニタ画面
表示の一例を示す概念図である。FIG. 5 is a conceptual diagram showing an example of a monitor screen display when marker display for speakers and non-speakers among video conference participants is performed.

【図６】あるテレビ会議の参加状況を示すモニタ画面表
示の一例を示す概念図である。FIG. 6 is a conceptual diagram showing an example of a monitor screen display showing a participation situation of a certain video conference.

[Explanation of symbols]

１…音声入力部、２…方向検出部、２′，２″…方向検
出制御部、３…人間検出部、３′…人間検出制御部、３
ａ…顔抽出手段、３ｂ…顔輪郭抽出手段、３ｃ…唇検出
手段、４…スーパインポーズ発生回路、５…画像コーデ
ック、６…ＣＧＲＯＭ、７…音声コーデック、８…カメ
ラ制御部、９…カメラ装置、１０…ビデオデコーダ、１
１…多重化回路部、１２…通信回線、１３…記憶メモ
リ、２１…音源特徴抽出部、２２…第１記憶メモリ、２
３…音源特徴比較部、２４…音声レベル検出部、２５…
タイマ部、３１…位置演算部、３２…第２記憶メモリ、
１００…マーカ、１０１…発言者、１０２…参加者、１
０３…発言者マーカ、１０４…非発言者マーカ、１０５
…発言者、１０６…非発言者、１０７，１０８…参加時
刻表示、１０９，１１０…参加者、１１１…参加時刻表
示、１１２…参加者。DESCRIPTION OF SYMBOLS 1 ... Voice input part, 2 ... Direction detection part, 2 ', 2 "... Direction detection control part, 3 ... Human detection part, 3' ... Human detection control part, 3
a ... Face extracting means, 3b ... Face contour extracting means, 3c ... Lip detecting means, 4 ... Superimpose generation circuit, 5 ... Image codec, 6 ... CGROM, 7 ... Voice codec, 8 ... Camera control section, 9 ... Camera Device, 10 ... Video decoder, 1
DESCRIPTION OF SYMBOLS 1 ... Multiplexing circuit part, 12 ... Communication line, 13 ... Storage memory, 21 ... Sound source feature extraction part, 22 ... First storage memory, 2
3 ... Sound source feature comparison unit, 24 ... Voice level detection unit, 25 ...
Timer unit, 31 ... Position calculation unit, 32 ... Second storage memory,
100 ... Marker, 101 ... Speaker, 102 ... Participant, 1
03 ... speaker marker, 104 ... non-speaker marker, 105
... speaker, 106 ... non-speaker, 107,108 ... participation time display, 109,110 ... participant, 111 ... participation time display, 112 ... participant.

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5B057 AA20 BA02 BA17 CA01 CA08 CA16 CB01 CB08 CB12 CB16 CE08 CH18 DA07 DA08 DA16 DB02 DB06 DB09 DC16 DC25 DC36 5C064 AA02 AC04 AC06 AC08 AC09 AC12 AC13 AC15 AC16 AC18 AC22 AD02 5L096 AA02 AA06 BA20 CA02 DA02 FA06 FA15 GA38 HA03 ─────────────────────────────────────────────────── ─── Continued front page F-term (reference) 5B057 AA20 BA02 BA17 CA01 CA08 CA16 CB01 CB08 CB12 CB16 CE08 CH18 DA07 DA08 DA16 DB02 DB06 DB09 DC16 DC25 DC36 5C064 AA02 AC04 AC06 AC08 AC09 AC12 AC13 AC15 AC16 AC18 AC22 AD02 5L096 AA02 AA06 BA20 CA02 DA02 FA06 FA15 GA38 HA03

Claims

[Claims]

1. A speaker identification device of a video conference system for identifying a speaker in a video conference system for inputting and transmitting an image and a voice, and a direction detecting means for detecting a sound source direction from an input voice, and a person from an input image. And a marking means for displaying a predetermined marker around the image of the person detected by the human detecting means on the displayed monitor screen. When a person is detected by the human detection unit in the sound source direction obtained by the direction detection unit, a marker indicating the speaker is displayed as a speaker in the video conference by the marking unit. Speaker identification device.

2. The speaker of the video conference system according to claim 1, wherein the human detection means further comprises face extraction means for extracting the color of the face of the person to detect the person. Identification device.

3. The human detection means further comprises face contour extraction means for extracting a contour of a person's face and contours of eyes, nose, or mouth to detect the person. Alternatively, the speaker identification device of the video conference system according to item 2.

4. The speaker of the video conference system according to claim 1, wherein the human detection means further comprises lip detection means for detecting movement of a person's lips. Identification device.

5. The marker for indicating the speaker is displayed by the marking means when the lips detecting means detects that the lips of the person are moving. The speaker identification device of the video conference system described in 1.

6. A case where a person is detected by the human detecting means in the sound source direction obtained by the direction detecting means, and the lips of the person are detected to be moving by the lip detecting means. The speaker identification device of the video conference system according to claim 5, wherein the marker indicating the speaker is displayed in a marking manner by the marking means when the speaker identification is performed.

7. The sound source in which the direction detecting means compares a feature of a sound source newly stored with a first storage means for storing a feature of the sound source, and a feature of the sound source newly stored and a feature of the sound source stored in the first storing means. The human detection means further comprises second storage means for storing the detected image position of the person, and the feature of the sound source newly input by the sound source feature comparison means and the When it is detected that the characteristics of the sound source stored in the first storage means match,
It is characterized in that the image position of the person stored in the second storage means is read out, and the marker indicating the speaker is marked by the marking means around the image of the person displayed on the monitor screen. The speaker identification device of the video conference system according to claim 1.

8. The sound source feature comparison means reads out the feature of the sound source in the direction of the newly input sound source from the features of the sound source stored in the first storage means, and the sound source feature comparison means. When it is detected by comparing with the characteristics of the newly input sound source that the person does not match, when a person is detected by the human detecting means in the sound source direction obtained by the direction detecting means, The feature of the newly input sound source is stored in the first storage memory again, and the image position of the person detected by the human detecting means is stored in the second storage memory. The speaker identification device of the video conference system according to claim 7.

9. The sound source feature comparison means reads out the features of the sound source in the direction of the newly input sound source from the features of the sound source stored in the first storage means, and the sound source feature comparison means. When it is detected that they do not match by comparing the characteristics of the newly input sound source, and the person detecting unit does not detect a person in the sound source direction obtained by the direction detecting unit. The speaker identification device of the video conference system according to claim 7 or 8, wherein the marking display of the marker indicating the speaker by the marking means is kept in the original state.

10. The second storage means stores the facial features relating to the contour of the face of the person and / or the contour of the eyes, nose or mouth in the input image by the face contour extracting means included in the human detecting means. A face feature storage means for extracting and storing is further provided, and the human detection means is provided with the face feature stored in the face feature storage means and the face newly extracted from the input image by the face contour extraction means. A face feature comparison means for comparing the features with each other, wherein the feature of the sound source newly input by the sound source feature comparison means and the feature of the sound source stored in the first storage means match each other; If detected, the facial feature comparison means compares the facial features stored in the facial feature storage means with the facial features newly extracted by the facial contour extraction means, and finds a match. If detected Reading the image position of the person stored in the second storage means, and displaying the marker indicating the speaker by the marking means around the image of the person displayed on the monitor screen. The speaker identification device of the video conference system according to any one of claims 7 to 9.

11. The face feature comparing means compares the face features stored in the face feature storing means with the face features newly extracted by the face contour extracting means of the human detecting means, and matches them. 11. The speaker identification device of the video conference system according to claim 10, wherein the marking display of the marker indicating the speaker by the marking means is kept in the original state when it is detected that the speaker is not present.

12. The sound source in which the direction detecting means compares a characteristic of a sound source with a first storage means, and a characteristic of a newly input sound source and a characteristic of the sound source stored in the first storage means. The human detection means further comprises a second storage means for storing the detected image position of the person, and a lip detection means for detecting that the detected lip of the person is moving. Further, when it is detected that the feature of the sound source newly input by the sound source feature comparison unit matches the feature of the sound source stored in the first storage unit, the lip detection unit When it is detected that the lips of the person in the direction in which the features of the sound sources match each other are detected, the image position of the person stored in the second storage unit is read out and displayed on the monitor screen. Around the image of the person The speaker identification device of the video conference system according to any one of claims 1 to 11, wherein the marker indicating the speaker is marked and displayed by the marking means.

13. If the lips detecting unit does not detect that the lips of a person in the direction in which the features of the sound sources match each other are detected, the marking display of the marker indicating the speaker by the marking unit is used. The speaker identification device of the video conference system according to claim 12, wherein the speaker identification device remains in the state.

14. The direction detecting means further comprises time detecting means for detecting a duration time during which the input voice is continuously input, and the duration time of the input voice detected by the time detecting means is: When it is determined that the preset duration is longer than a preset duration, the human detection means activates an operation for detecting a person from an input image. Items 1 to 1
3. The speaker identification device of the video conference system according to any one of 3 above.

15. The direction detecting means further comprises time detecting means for detecting a duration time during which the input voice is continuously input, and the duration time of the input sound source detected by the time detecting means is: The operation for detecting the sound source direction of the input sound is started by the direction detecting means when it is determined that the predetermined duration time or more has been set in advance. Item 15. The speaker identification device of the video conference system according to any one of Items 1 to 14.

16. The direction detecting means further comprises a voice level detecting means for detecting a voice level of the input voice, and the voice level of the input voice detected by the voice level detecting means is a predetermined voice level determined in advance. The speaker identification device of the video conference system according to claim 14 or 15, wherein the detection operation of the duration by the time detection means is started when the above is detected.

17. The human detecting means detects a person in the input image, and the marking means, for each person detected by the human detecting means, the periphery of the image of the person displayed on the monitor screen. The speaker identification device of the video conference system according to any one of claims 1 to 16, wherein markers are displayed in different colors and / or shapes.

18. The marking means displays, for each person detected by the human detecting means, a marking display of the marker which is color-coded and / or shape-coded around an image of the person displayed on a monitor screen. When performing, around the image of the person at the image position in the sound source direction obtained by the direction detecting means, a marking display of a speaker marker of a color and / or shape indicating the speaker, and the direction detecting means The marking display of a non-speaker marker of a color and / or a shape indicating a non-speaker is performed around the image of a person at an image position other than the sound source direction obtained by the above. Speaker identification device of the described video conference system.

19. The human detection means further comprises a detection time storage means for storing the detected detection time when detecting the person in the input image, and the detection time stored in the detection time storage means. The speaker identification device of the video conference system according to any one of claims 1 to 18, wherein the time is displayed on the monitor screen as the detected participation time of the person.

20. The human detection means comprises one or more second storage means capable of storing the image positions of one or more persons detected from the input image, and the human detection means further comprises: , Every predetermined period set in advance, all the persons in the input image are detected, and when obtaining the image position of the detection result of each person, each of the image positions of the persons of the detection result And position comparison means for comparing the image positions of one or more persons stored in the second storage means with the position comparison means,
If any of the image positions of the person stored in the storage unit does not match any of the image positions of the person of the detection result, the image positions stored in the second storage unit do not match. If the person has left the seat, the leaving time storage means that stores the time when the person is no longer detected as the leaving time, and conversely, in the position comparing means, any one of the image positions of the persons of the detection result is displayed. If the person does not match any of the image positions of the person stored in the second storage means, it is determined that the person at the image position of the detection result that does not match newly joins the person. And a participation time storage means for storing the time detected as the participation time, and the leaving time stored in the leaving time storage means is conversely stored in the participation time storage means. The above Pressurized time, speaker identification device of the video conference system according to any one of claims 1 to 19, characterized in that displayed on the monitor screen.

21. The direction detection means further comprises a sound source detection storage means for storing a time at which a sound source of the input voice is detected as a sound source detection time, and the sound source detection time stored in the sound source detection storage means is stored. 21. The speaker identification device for a video conference system according to claim 1, wherein the speaker identification device is displayed on a monitor screen.

22. A camera device for image input, which is movable in at least a horizontal direction, is further provided, and the camera device is controlled so as to direct the position of the camera device to the sound source direction detected by the direction detecting means. 22. The speaker identification device of the video conference system according to claim 1.

23. When a person is not detected by the human detecting means in the detected sound source direction when controlling the position of the camera device toward the sound source direction detected by the direction detecting means. 23. And / or, if the state in which the marker indicating the speaker is to be newly marked and displayed is not detected, the camera device is returned to the position before the operation. Speaker identification device for conference system.

24. At least a horizontally movable image input camera device further comprises zoom means capable of performing zooming, and a position of the camera device is set in a sound source direction detected by the direction detection means. 24. The speaker identification device for a video conference system according to claim 1, wherein the speaker identification device is controlled so as to be directed and zoomed by the zoom means.

25. When controlling to direct the position of the camera device to the sound source direction detected by the direction detecting means and to perform zooming by the zooming means,
If no person is detected by the human detecting means in the detected sound source direction and / or a state in which the marker indicating the speaker is to be newly marked and displayed is not detected, the camera device The speaker identification device of the video conference system according to claim 24, wherein the speaker identification device is returned to the position before the operation.

26. A video conference system comprising the speaker identification device of the video conference according to claim 1. Description: