JP4212274B2

JP4212274B2 - Speaker identification device and video conference system including the speaker identification device

Info

Publication number: JP4212274B2
Application number: JP2001387569A
Authority: JP
Inventors: 孝志今井; 一也岩崎
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2001-12-20
Filing date: 2001-12-20
Publication date: 2009-01-21
Anticipated expiration: 2021-12-20
Also published as: JP2003189273A

Description

【０００１】
【発明の属する技術分野】
本発明は、テレビ会議システムの発言者識別装置及び該発言者識別装置を備えたテレビ会議システムに関する。特に、複数の参加者によりテレビ会議を行なっている際に、発言者を識別可能なマーカ（目印）を付与してモニタ画面にマーキング表示する機能を有するテレビ会議システムの発言者識別装置及び該発言者識別装置を備えたテレビ会議システムに関する。
【０００２】
【従来の技術】
テレビ会議を行なう際に、発言者を識別し、マーカ（目印）のマーキング表示を行なうテレビ会議システムの従来技術としては、例えば、特開平８−３７６５５号公報「話者識別表示機能を有するテレビ会議システム」において開示されているように、音声の入力方向のみを識別して、発言者が存在している方向を決定し、マーキング表示を行なうものがある。
即ち、特開平８−３７６５５号公報に開示されている技術においては、複数の参加者によりテレビ会議を行なっている際に、音の発生方向を検出する音声方向検出器即ち方向検出手段からの音声方向のデータに基づいて、モニタ画面上の位置（座標）を求めて、該テレビ会議において発言している発言者（話者）の位置にカメラ装置が自動的に移動し、更に、モニタ画面に、該発言者が写し出された際には、該発言者が識別できるようなマーカ（目印）を付与するように構成されているものであり、テレビ会議で発言している参加者を容易に識別することが可能であるとしている。
【０００３】
【発明が解決しようとする課題】
しかしながら、かかる従来の発言者識別表示機能を有するテレビ会議システムにあっては、テレビ会議の参加者の発言そのものを識別することを可能としているものではなく、何らかの音即ち物音が発生している場合に、該物音の発生方向のみに基づいて、マーカ（目印）を付与してしまう構成となっている。したがって、テレビ会議の参加者の発言以外の音声や物音が発生した場合であっても、発言者識別のマーカ（目印）表示がなされてしまうという問題がある。例えば、くしゃみやペンを落とした音、あるいは、マイクに物をぶつけた音など、参加者の発言以外の物音によっても、マーカ（目印）のマーキング表示が行なわれてしまうため、モニタ画面上に、テレビ会議にて真に必要とする発言者を示す位置とは異なる位置にマーキング表示が行なわれたり、あるいは、あらぬ方向にカメラ装置の向きが移動してしまったりして、テレビ会議参加者にとっては、モニタ画面が非常に見難くなったり、紛らわしく感じられる場合が生じてしまう。
【０００４】
本発明は、かかる課題を解決するためになされたものであり、画像及び音声を入力し送受信するテレビ会議システムにおいて、例えば、入力音声から音源方向を検出する方向検出手段と、入力画像から人物を検出する人間検出手段とを具備し、更に、前記人間検出手段にて検出された人物の画像周辺に前以って定められている所定のマーカ（目印）を表示するマーキング手段を具備し、前記方向検出手段によって得られた音源方向において、前記人間検出手段により人物が検出された場合に、始めて、テレビ会議における発言者による発声音声であると判定して、前記マーキング手段により、表示されているモニタ画面上の該人物の画像周辺に発言者を示す所定のマーカをマーキング表示させることを可能とするテレビ会議システムの発言者識別装置及び該発言者識別装置を備えたテレビ会議システムを提供せんとするものである。
【０００５】
【課題を解決するための手段】
本発明に係るテレビ会議システムの発言者識別装置及び該発言者識別装置を備えたテレビ会議システムは、以下の具体的な技術手段により構成されている。
【０００６】
第１の技術手段は、画像及び音声を入力し送受信するテレビ会議システムにおける発言者を識別するテレビ会議システムの発言者識別装置において、入力音声から音源方向を検出し当該音源の特徴を記憶する第１記憶手段と、新たに入力された音源の特徴と前記第１記憶手段に記憶された前記音源の特徴とを比較する音源特徴比較手段を備えた音源の方向検出手段と、入力画像内から人物を検出する人間検出手段と、前記方向検出手段によって得られた音源方向に前記人間検出手段にて人物を検出した場合、検出した人物の画像位置を記憶する第２記憶手段と、表示されているモニタ画面上の発言者の画像周辺に前以って定められているマーカを表示するマーキング手段とを具備し、前記音源特徴比較手段によって新たに入力された音源の特徴と前記第１記憶手段に記憶された前記音源の特徴とが一致していることが検出された場合、前記第２記憶手段に記憶された前記人物の前記画像位置を読み出して、テレビ会議における発言者として前記マーキング手段により発言者を示すマーカをマーキング表示させることを特徴とするものである。
【００１３】
第２の技術手段は、前記第１の技術手段において、前記音源特徴比較手段が、前記第１記憶手段に記憶された前記音源の特徴のうち、前記新たに入力された音源の方向にある音源の特徴を読み出し、前記音源特徴比較手段によって新たに入力された音源の特徴とを比較して一致していないことが検出された場合において、前記方向検出手段によって得られた音源方向に前記人間検出手段にて人物を検出した場合には、前記新たに入力された音源の特徴を前記第１記憶手段に記憶し直すと共に、前記人間検出手段にて検出された該人物の画像位置を前記第２記憶手段に記憶させるテレビ会議システムの発言者識別装置とすることを特徴とするものである。
【００１４】
第３の技術手段は、前記第１又は第２の技術手段において、前記音源特徴比較手段が、前記第１記憶手段に記憶された前記音源の特徴のうち、前記新たに入力された音源の方向にある音源の特徴を読み出し、前記音源特徴比較手段によって新たに入力された音源の特徴とを比較して一致していないことが検出された場合において、前記方向検出手段によって得られた音源方向に前記人間検出手段にて人物を検出しなかった場合には、前記マーキング手段による発言者を示す前記マーカのマーキング表示を元の状態のままとするテレビ会議システムの発言者識別装置とすることを特徴とするものである。
【００１５】
第４の技術手段は、前記第１乃至第３のいずれかの技術手段において、前記第２記憶手段が、前記人間検出手段が具備している顔輪郭抽出手段により入力画像の中の人物の顔の輪郭及び／又は目乃至鼻乃至口の輪郭に関する顔特徴を抽出して記憶する顔特徴記憶手段を更に具備し、前記人間検出手段が、該顔特徴記憶手段に記憶された前記顔特徴と前記顔輪郭抽出手段により入力画像の中から新たに抽出された顔特徴とを比較する顔特徴比較手段を更に具備し、前記音源特徴比較手段によって新たに入力された音源の特徴と前記第１記憶手段に記憶された前記音源の特徴とが一致していることが検出された場合、前記顔特徴比較手段において、前記顔特徴記憶手段に記憶された前記顔特徴と前記顔輪郭抽出手段によって新たに抽出された顔特徴とを比較して、一致していることが検出された場合には、前記第２記憶手段に記憶された人物の前記画像位置を読み出して、モニタ画面に表示された前記人物の画像周辺に前記マーキング手段により発言者を示す前記マーカをマーキング表示させるテレビ会議システムの発言者識別装置とすることを特徴とするものである。
【００１６】
第５の技術手段は、前記第４の技術手段において、前記顔特徴比較手段において、前記顔特徴記憶手段に記憶された前記顔特徴と前記人間検出手段の顔輪郭抽出手段によって新たに抽出された顔特徴とを比較して、一致していないことが検出された場合、前記マーキング手段による発言者を示す前記マーカのマーキング表示を元の状態のままとするテレビ会議システムの発言者識別装置とすることを特徴とするものである。
【００１７】
第６の技術手段は、前記第１乃至第５のいずれかの技術手段において、前記人間検出手段が、検出された人物の唇が動いていることを検出する唇検出手段を具備し、前記音源特徴比較手段によって新たに入力された音源の特徴と前記第１記憶手段に記憶された前記音源の特徴とが一致していることが検出された場合、前記唇検出手段によって前記音源の特徴が一致している方向にいる人物の唇が動いていることが検出された場合には、前記第２記憶手段に記憶された人物の前記画像位置を読み出して、モニタ画面に表示された前記人物の画像周辺に前記マーキング手段により発言者を示す前記マーカをマーキング表示させるテレビ会議システムの発言者識別装置とすることを特徴とするものである。
【００１８】
第７の技術手段は、前記第６の技術手段において、前記唇検出手段によって前記音源の特徴が一致している方向にいる人物の唇が動いていることが検出されない場合、前記マーキング手段による発言者を示す前記マーカのマーキング表示を元の状態のままとするテレビ会議システムの発言者識別装置とすることを特徴とするものである。
【００２２】
第８の技術手段は、前記第１乃至第７のいずれかの技術手段において、前記人間検出手段が、入力画像内の人物を検出し、前記マーキング手段が、前記人間検出手段によって検出された人物それぞれに対して、モニタ画面に表示された該人物の画像周辺に色分け及び／又は形状分けされたマーカのマーキング表示を行なうテレビ会議システムの発言者識別装置とすることを特徴とするものである。
【００２４】
第９の技術手段は、前記第１乃至第８のいずれかの技術手段において、前記人間検出手段が、入力画像内の人物を検出する際に、検出された検出時刻を記憶する検出時刻記憶手段を更に具備し、該検出時刻記憶手段に記憶された前記検出時刻を、検出された前記人物の参加時刻として、モニタ画面に表示するテレビ会議システムの発言者識別装置とすることを特徴とするものである。
【００２５】
第１０の技術手段は、前記第１乃至第９のいずれかの技術手段において、前記人間検出手段が、入力画像内から検出された１人以上の人物の画像位置を記憶することができる記憶手段を１つ以上具備し、更に、前記人間検出手段が、前以って設定された一定周期毎に、前記入力画像内のすべての人物の検出を行ない、各人物の検出結果の画像位置をそれぞれ求める際に、前記検出結果の人物の画像位置のそれぞれと前記記憶手段に記憶された１人以上の人物の画像位置との比較を行なう位置比較手段を更に具備し、前記位置比較手段にて、前記記憶手段に記憶された人物の画像位置のいずれかが、前記検出結果の人物の画像位置のいずれにも一致していない場合には、一致していない前記記憶手段に記憶された画像位置の人物が退席したものとして、該人物が検出されなくなった時刻を退席時刻として記憶する退席時刻記憶手段と、更に、前記位置比較手段にて、前記検出結果の人物の画像位置のいずれかが、前記記憶手段に記憶された人物の画像位置のいずれにも一致していない場合には、一致していない前記検出結果の画像位置の人物が新たに参加したものとして、該人物が新たに検出された時刻を参加時刻として記憶する参加時刻記憶手段と、を更に具備し、前記退席時刻記憶手段に記憶されている前記退席時刻を、更に、前記参加時刻記憶手段に記憶されている前記参加時刻を、モニタ画面に表示するテレビ会議システムの発言者識別装置とすることを特徴とするものである。
【００２６】
第１１の技術手段は、前記第１乃至第１０のいずれかの技術手段において、前記方向検出手段が、入力音声の音源を検出した時刻を音源検出時刻として記憶する音源検出記憶手段を更に具備し、前記音源検出記憶手段に記憶されている前記音源検出時刻をモニタ画面に表示するテレビ会議システムの発言者識別装置とすることを特徴とするものである。
【００３１】
第１２の技術手段は、前記第１乃至第１１の技術手段のいずれかのテレビ会議システムの発言者識別装置を備えているテレビ会議システムとすることを特徴とするものである。
【００３２】
【発明の実施の形態】
本発明に係るテレビ会議システムの発言者識別装置及び該発言者識別装置を備えたテレビ会議システムの実施形態の一例について、以下に図面を参照しながら説明する。
図１は、本発明に係る発言者識別装置を具備したテレビ会議システムの構成の一例を示すブロック構成図である。
【００３３】
図１に示すように、本発明に係る発言者識別装置を具備したテレビ会議システムは、入力音声から音源方向を検出する方向検出部２と、入力画像内から人物を検出する人間検出部３とを具備し、更に、モニタ画面に表示させる各種のＣＧ（ＣｏｍｐｕｔｅｒＧｒａｐｈｉｃｓ）情報を保存しているＲＯＭであるＣＧＲＯＭ６と、該ＣＧＲＯＭ６からのＣＧ情報を、モニタ画面に重畳させて表示させるためのスーパインポーズ発生回路４とを、識別用マーカを表示させるためのマーキング手段として具備している。
図１に基づいて、本発明に係るテレビ会議システムの動作について説明する。
【００３４】
まず、入力された音声は、音声入力部１より電気信号からなる音声信号に変換され、方向検出部２と音声コーデック７とに送られる。
方向検出部２においては、該音声信号に基づいて発言者の位置が、モニタ画面上の位置情報として検出され、カメラ制御部８および人間検出部３に対して、該位置情報が与えられる。
【００３５】
カメラ制御部８は、発言者の前記位置情報に基づいて、少なくとも水平方向に可動可能なカメラ装置９の旋回動作やズーミング動作を行なわせることにより、音声の発生源（即ち、テレビ会議における発言者の位置）が、モニタ画面の中央に位置するように制御したり、あるいは、ズーミングにより拡大表示するように制御したりして、カメラ装置９の位置制御あるいはズーム制御を行なう。
【００３６】
ここで、カメラ制御部８は、方向検出部２により検出された音声の発声方向（音源方向）にカメラ装置９の位置を向けるように旋回制御させたり、ズーム制御をさせたりする際に、方向検出部２により検出された音源方向に人間検出部３が人物を検出できなかった場合、及び／又は、例えば、テレビ会議の参加者の音声（音源）の特徴と一致していない特徴の発声音声である場合などのごとく、発言者を特定することができなかった場合（即ち、発言者を示すマーカを新たにマーキング表示すべき状態を検出できなかった場合）には、動作前の元のカメラ装置９の位置に戻すことも可能である。
【００３７】
あるいは、前述のごとく、カメラ装置９はズーム手段を具備しており、カメラ制御部８は、ユーザの指示により、又は、方向検出部２により検出された音源方向に、カメラ装置９の位置を向けると共に、自動的にカメラ装置９のズーミングを行なわせるように制御することも可能であり、また、方向検出部２により検出された音源方向に人間検出部３が人物を検出できなかった場合、及び／又は、発言者を特定することができなかった場合（即ち、発言者を示すマーカを新たにマーキング表示すべき状態を検出できなかった場合）には、動作前の元のカメラ装置９の位置に戻すと共にカメラ装置９のズーミング状態を動作前の状態に戻すことも可能である。
【００３８】
撮像装置としてのカメラ装置９においては、撮像された映像情報を、映像信号として電気信号に変換する。変換された該映像信号はビデオデコーダ１０によってデジタル処理され、人間検出部３に映像データとして送られる。
人間検出部３においては、方向検出部２から入力される前記位置情報とビデオデコーダ１０から入力される前記映像データとに基づいて、前記映像データにおける前記位置情報が示す位置に、人物の撮像画像データが存在しているか否かの検出を行なう。
【００３９】
前記位置情報が示す位置に、人物の撮像画像データが存在していることが検出された場合にあっては、マーキング手段を提供するスーパインポーズ発生回路４において、各種のＣＧ（ＣｏｍｐｕｔｅｒＧｒａｐｈｉｃｓ）情報を保存しているＲＯＭであるＣＧＲＯＭ６から、発言者を示すマーカ（目印）として予め定められているマーカデータを読み出し、前記映像データ上における前記位置情報と発言者である人物の撮像画像データとから演算されたマーカ表示位置（即ち、発言者の人物の画像周辺位置）に、読み出された発言者を示す前記マーカデータを前記映像データ上に重ね合わされた合成映像データが作成される。
【００４０】
作成された合成映像データは、画像コーデック５によって圧縮符号化処理された符号化画像データとされ、一方、音声入力部１からの音声信号が、音声コーデック７にて圧縮符号化された符号化音声データとされ、該符号化画像データと該符号化音声データとが、多重化回路部１１にて多重化されて、通信回線１２を通して、相手端末に送られる。
【００４１】
ここで、マーキング手段であるスーパインポーズ発生回路４にて作成された前記合成映像データ（即ち、カメラ装置９で撮像され、ビデオデコーダ１０でデジタル処理が施された映像データと、発言者を示すマーカデータとが合成された合成映像データ）が更に画像コーデック５により圧縮符号化された状態の前記符号化画像データが、受信側の相手端末における画像コーデックにより復号化されて、テレビ会議システムにおけるモニタ画面に画面表示されている画像の一例の概念図を、図２に示す。図２において、１０１は、テレビ会議の参加者のうち、発言者を示し、１０２は、テレビ会議の参加者で現在発言していない人物を示し、１００は、発言者であることを示すマーカ（図２においては、矢印形状のマーク）である。
【００４２】
人間検出部３は、カメラ装置９からの映像情報がビデオデコーダ１０によりデジタル処理が施された映像データの中から色情報を用いて人間の顔の色を抽出することにより、顔領域を特定（抽出）して、映像データの中に含まれている人物を検出することができる顔抽出手段３ａを備えている。
また、該人間検出部３は、カメラ装置９からの映像情報がビデオデコーダ１０によりデジタル処理が施された映像データの中から、人物の顔の輪郭や目乃至鼻乃至口といった個々の特徴を有する輪郭を抽出することによって、映像データの中に含まれている人物を検出することができる顔輪郭抽出手段３ｂも備えている。
【００４３】
更には、後述するように、該人間検出部３は、カメラ装置９からの映像情報がビデオデコーダ１０によりデジタル処理が施された映像データの中から、人物の唇が動いているか否かを検出することができる唇検出手段３ｃも備えている。かかる唇検出手段３ｃにより、参加者の人物の唇が動いていることが検出された場合、方向検出部２の結果如何によらず、あるいは、方向検出部２によって得られた音源方向に人間検出部３にて人物を検出した場合であって、かつ、前記唇検出手段３ｃにて該人物の唇が動いていることが検出された場合には、該人物が発言していると見なして、モニタ画面上の発言者である該人物の画像周辺に、発言者を示すマーカをマーキング表示させることとしても良いし、更に、方向検出部２により検出された音源方向に一致する方向にいる人物の唇が動いていることが、唇検出手段３ｃにより検出された場合であって、更に、方向検出部２にて発声音声（音源）の特徴とテレビ会議参加者の発声音声（音源）の特徴とが一致する場合において、発言者を示すマーカをモニタ画面にマーキング表示させることとしても構わない。
【００４４】
また、図１に示す方向検出部２と人間検出部３とは、図３のような構成にすることも可能である。
ここに、図３は、本発明に係る発言者識別装置を具備したテレビ会議システムを構成する方向検出部と人間検出部との他の構成例を示すブロック構成図である。即ち、図３に示すごとく、方向検出制御部２′は、図１に示す方向検出部２以外に、更に、入力音声（音源）の特徴を抽出する音源特徴抽出部２１と、抽出された入力音声（音源）の特徴を特徴データとして記憶する第一の記憶手段である第１記憶メモリ２２と、方向検出部２を介して入力されて、音源特徴抽出部２１により抽出された音源の特徴と第１記憶メモリ２２に記憶されているすべての前記特徴データとを比較照合する音源特徴比較部２３とを備えている。
【００４５】
一方、人間検出制御部３′は、図１に示す人間検出部３以外に、更に、人間検出部３において、ビデオデコーダ１０からの映像データの中に人物が検出された場合の人物の画像位置を記憶するための第二の記憶手段を提供する第２記憶メモリ３２と、方向検出部２が示す音声の発声方向（音源方向）の位置に相当する第２記憶メモリ３２における映像データの画像位置を算出する演算を行なう位置演算部３１とを備えている。
【００４６】
図３に示す方向検出制御部２′と人間検出制御部３′とにおいては、まず、テレビ会議が始まるに先立って、第１記憶メモリ２２と第２記憶メモリ３２とに、それぞれ、テレビ会議の各参加者の発声音声（音源）の特徴を示す特徴データと各参加者の画像位置とを予め登録する。
ここで、発言者の発声音声が、音声入力部１より電気信号の音声信号に変換されて、方向検出制御部２′に入力され、図１と同一の機能を果たす方向検出部２を介して、音源特徴抽出部２１に送られる。音源特徴抽出部２１においては、発言者の音声信号（音源）の特徴を抽出し、第１記憶メモリ２２に特徴データとして格納する。
【００４７】
一方、図１と同一の人物検出機能を果たす人間検出部３においては、方向検出制御部２′から入力される発声音声方向（音源方向）の位置に相当するビデオデコーダ１０の映像データ内の画像位置に人物の映像を検出した場合には、該人物の画像位置を第２記憶メモリ３２に格納する。
【００４８】
第１記憶メモリ２２と第２記憶メモリ３２とに、テレビ会議参加者に関する音声（音源）の特徴と人物の画像位置とを予め設定した後、テレビ会議が始まると、音源特徴比較部２３においては、音源特徴抽出部２１から送られてくる新たな発声音声（音源）の特徴と、第１記憶メモリ２２に記憶されているすべての音声の特徴データとを比較照合し、記憶されているすべての音声の前記特徴データの中に、新たな前記発声音声（音源）の特徴と一致する音声が存在していることが検出された場合には、本テレビ会議の参加者の発言と判断し、位置演算部３１を介して該人物の画像位置を第２記憶メモリ３２から読み出して、モニタ画面上における該発言者の人物の画像周辺に、発言者であることを示すマーカを、スーパインポーズ発生回路４を介して画面表示する。
【００４９】
かくのごとく、発声音声である音源の特徴に基づいて、テレビ会議の参加者の発言と判断される場合にあっては、図１に示す人間検出部３と同一の機能を果たす図３における人間検出部３は、何ら動作をする必要はなく、人間検出部３は起動されることがないものとすることができる。
【００５０】
一方、もし、音源特徴抽出部２１から送られてくる前記発声音声（音源）の特徴が、第１記憶メモリ２２に記憶されているすべての音声の特徴データと一致していないことが音源特徴比較部２３において判明した場合には、図１に示す人間検出部３と同一の機能を果たす図３における人間検出部３が起動されて、該人間検出部３において、ビデオデコーダ１０からの映像データの中から人物検出を行なう。即ち、該人間検出部３において、位置演算部３１を介して得られた前記発声音声方向（音源方向）の位置に、人物の映像が存在しているか否かが判別されることにより、該発声音声（音源）が、本テレビ会議の参加者の発言であるか否かを判定する。
【００５１】
人間検出部３において、前記発声音声方向（音源方向）の位置に人物が存在していないと判定された場合には、該発声音声（音源）が、本テレビ会議には無関係の雑音と見なして、何ら処理を行なうことなく、元の状態のままとし、一方、該発声音声方向（音源方向）の位置に人物が存在していると判定された場合には、発言者が位置を移動して発言しているものと見なして、位置演算部３１を介して算出されている参加者の画像位置情報を、新たに第２記憶メモリ３２に登録すると共に、第１記憶メモリ２２にも、該発声音声（音源）の特徴データを再登録する。
【００５２】
従来の技術においては、新たな音声が発生された際には、発言者を示すマーカ（目印）をモニタ画面にマーキング表示させたり、カメラ装置を旋回させて、該音声の発生元である発言者がモニタ画面内に収まるように撮像せんとしている。しかしながら、発言者以外の何らかの物音が発生した場合においても、全く同様に、発言者を示すマーカ（目印）がモニタ画面にマーキング表示されたり、あるいは、カメラ装置が物音の発生方向に旋回されてしまっていた。
【００５３】
本発明に係る発言者識別手段を備えたテレビ会議システムにおいては、人間検出部３にて人物が検出されなかった場合には、発言者を示すマーカを新たにマーキング表示させることも行なわれないし、更に、図１に示すカメラ制御部８を制御して、カメラ装置９の位置やズーミング状態が、元の状態に戻るように復帰指令を送出している。即ち、たとえ、カメラ装置９の位置が一旦音源方向に旋回されたとしても、図示はしていないが、カメラ制御部８が旋回制御される前の位置情報やズーミング情報が、例えば、図３に示す第２記憶メモリ３２に保存されていることにより、カメラ装置９の位置やズーミング状態を元の状態に戻す復帰指令の送出が可能とされている。而して、たとえ、参加者の発言以外の物音に反応して、カメラ装置９が旋回してしまった場合であっても、発言者を示すマーカ（目印）の位置、あるいは、カメラ装置９の位置を、元の位置に戻すことができる。
【００５４】
また、第２記憶メモリ３２が、人間検出部３が具備している顔輪郭抽出手段３ｂにより人物の顔の輪郭及び目乃至鼻乃至口の輪郭に関する個々の人物の顔特徴を抽出して記憶する顔特徴記憶手段を更に具備している場合においては、人間検出部３が、該顔特徴記憶手段に記憶された前記顔特徴と顔輪郭抽出手段３ｂにより映像データの中から新たに抽出された顔特徴とを比較する顔特徴比較手段を更に具備し、音源特徴比較部２３によって新たに入力された音源の特徴と第１記憶メモリ２２に記憶された前記特徴データ（音源の特徴）とが一致していることが検出された場合、前記顔特徴比較手段において、前記顔特徴記憶手段に記憶された前記顔特徴と人間検出部３の顔輪郭抽出手段３ｂによって新たに抽出された顔特徴とを比較して、一致していることが検出された場合には、第２記憶メモリ３２に記憶された人物の画像位置を読み出して、該画像位置の前記人物の画像周辺に発言者を示すマーカをマーキング表示させたり、カメラ装置９を前記人物の位置に旋回させたり、ズーミングして拡大表示させることも可能である。
【００５５】
ここで、前記顔特徴比較手段において、前記顔特徴記憶手段に記憶された前記顔特徴と顔輪郭抽出手段３ｂによって新たに抽出された顔特徴とを比較して、一致していないことが検出された場合、発言者を示す前記マーカのマーキング表示やカメラ装置９の位置やズーミング状態を元の状態のままとすることとする。
【００５６】
更に、図１に示す方向検出部２は、図４のような構成にすることも可能である。ここに、図４は、本発明に係る発言者識別装置を具備したテレビ会議システムを構成する方向検出部の更なる他の構成例を示すブロック構成図である。
即ち、図４に示すように、方向検出制御部２″は、図１に示す方向検出部２以外に、更に、所定の音声レベルを上回る発声音声（音源）を検出する音声レベル検出部２４と、かかる所定の音声レベルを上回る発声音声が予め設定されている所定継続時間以上に亘って継続していることを検出する時間検出手段を提供するタイマ部２５とを備えている。
【００５７】
図４において、入力された発声音声は、音声入力部１において、音声信号に変換されて、方向検出制御部２″に入力され、音声レベル検出部２４において発声音声の音声レベルが測定される。参加者からの発言として、該発声音声が前記所定の音声レベル以上の音声信号であることが検出された場合には、タイマ部２５のタイマが起動されて、経過時間の計数が開始され、予め設定されている前記所定継続時間が経過した場合には、所定の継続時間以上に亘って、入力音声（音源）が継続している状態にあり、タイマ部２５からの出力信号を、方向検出部２に対して送出する。
【００５８】
ここで、予め設定されている前記所定継続時間が経過する前に、参加者からの発言が終了して、音声レベル検出部２４において、前記所定の音声レベル以上の発声音声が検出されなくなると、タイマ部２５はタイマの計数を停止され、タイマ部２５からの出力信号は発生しなくなる。
図１と同様の機能を果たす方向検出部２は、タイマ部２５からの出力信号が入力されている場合にあって、始めて、入力された発声音声（音源）が、本テレビ会議の参加者の発言に基づく音声信号であるか否かを判断する動作が起動されることになる。
【００５９】
例えば、くしゃみのような短い時間の音声情報は発言ではないので、図１に示すカメラ装置９を、短い時間の該音声情報の位置に旋回させたり、あるいは、モニタ画面上に該音声情報の位置を発言者を示すようにマーキング表示させることは無駄である。本発明における実施例に示すように、かかる短い時間の音声情報の場合においては、タイマ部２５からの出力信号は出力されることはなく、而して、方向検出部２としては、発声音声方向（音源方向）の位置を示す位置情報を検出する動作が起動されずに、カメラ装置９の旋回などの制御動作や、あるいは、人間検出部３における人物検出動作も起動されない状態に設定されている。
【００６０】
また、図４においては、かかる短い時間の音声情報の場合に、方向検出部２が起動されないことにより、人間検出部３も起動されない旨を説明しているが、かかる場合に限らず、直接、タイマ部２５から人間検出部３へも出力信号が供給されていて、かかる短い時間の音声情報の場合には、方向検出部２を介することなく、直接、人間検出部３を起動させない状態とすることも可能である。
【００６１】
なお、以上に説明のごとく、図４においては、タイマ部２５を起動させる条件として、予め定められている所定の音声レベル以上のレベルにある発声音声がある場合を条件としているが、かかる所定の音声レベル以上のレベルにある発声音声であるか否かの如何に関わらず、識別可能な何らかの音声が継続して、所定継続時間以上に亘っていることが検出された場合であっても構わない。
【００６２】
また、前述のごとく、図３に示す人間検出部３は、方向検出部２からの発声音声方向（音源方向）の位置が示す映像データ上の位置に存在する人物の有無を検出するだけではなく、テレビ会議に先立って、予め、テレビ会議への参加者全員の画像位置の検出を行ない、図３に示す第２記憶メモリ３２に登録しておくようにすることが可能とされている。而して、例えば、図５に示すように、参加者のうち、発言者と非発言者とのマーカの表示を色分けしたり、あるいは、形状分けしたりして、変化させて、モニタ画面上にマーキング表示を行なわせることも可能である。
【００６３】
ここに、図５は、テレビ会議参加者のうち、発言者用と非発言者用のマーカのマーキング表示を行なう場合のモニタ画面表示の一例を示す概念図である。図５においては、発言者１０５が発言していることを示す発言者マーカ１０３は、非発言者１０６を示す非発言者マーカ１０４とは、例えば、異なる色を用いてマーキング表示している一例を示している。
【００６４】
即ち、テレビ会議参加者のうち、発言者１０５が発言を行なった場合、方向検出部２にて、発言者１０５の音源方向を示す位置情報が検出され、予め記憶メモリ３２に記憶されている画像位置と比較して、該音源方向に人物の存在を確認することによって、発言者１０５を識別し、発言していない非発言者１０６を示す非発言者マーカ１０４とは異なる色の発言者マーカ１０３によって表示している。
【００６５】
而して、たとえ、テレビ会議の参加者全員をモニタ画面に画面表示している状態であっても、モニタ画面の参加者各自毎に即ち人物の画像周辺位置に、参加者全員に対してそれぞれ色違いの発言者マーカ１０３と非発言者マーカ１０４とのマーキング表示を行ない、発言者か否かを容易に識別可能とすることができる。ここに、マーキング表示する前記マーカとしては、発言者と非発言者との識別用のみに限るものではない。例えば、テレビ会議の司会者と一般参加者とオブザーバとを容易に識別可能とするように、色分け及び／又は形状分け及び／又は模様分けすることとしても良く、テレビ会議の実施に有用な如何なる識別情報でも、モニタ画面に重畳表示されるマーカにより提供することが可能である。
【００６６】
また、人間検出部３を構成する人間検出制御部３′は、音声が入力された時だけ、人物検出動作を行なうのではなく、常時、定期的に、人間検出部３にて人物の検出動作を行なうこともできる。而して、ビデオデコーダ１０からの映像データの中の各画像位置で、人物を最初に検出した時刻を、図示していない時計回路から読み出し、検出された該人物の検出時刻を、テレビ会議への参加時刻として、図３に示す第２記憶メモリ３２の検出時刻記憶部に記憶させると共に、スーパインポーズ発生回路４により、該検出時刻記憶部から読み出した前記検出時刻を重ね合わせて画面表示させることにより、モニタ画面には、人物の検出結果として、該人物がテレビ会議に参加した参加時刻を表示することができる。
【００６７】
あるいは、人間検出部３を構成する人間検出制御部３′は、ビデオデコーダ１０からの映像データ即ち入力画像内から検出される人物の画像位置を１人以上記憶することができる１つ以上の第２記憶メモリ３２を具備し、更に、予め設定されている一定周期毎に、定期的に、前記入力画像内のすべての人物の検出を行ない、各人物の検出結果の画像位置をそれぞれ求める際に、前記検出結果の人物の画像位置のそれぞれと、第２記憶メモリ３２に記憶された１人以上の人物の画像位置とを比較する位置検出手段を更に具備しており、該位置検出手段にて、前記検出結果の人物の画像位置のいずれかが、第２記憶メモリ３２に記憶された人物の画像位置のいずれにも一致していない場合には、前記検出結果の画像位置の人物が新たにテレビ会議に参加したものと見なし、人間検出部３により前記検出結果の画像位置の人物が新たに検出された時刻を参加時刻として、図３に示す第２記憶メモリ３２の参加時刻記憶部に記憶させると共に、該参加時刻記憶部に記憶された前記参加時刻を、モニタ画面に表示させることとしても良い。
【００６８】
例えば、図６は、あるテレビ会議の参加状況を示すモニタ画面表示の一例を示す概念図である。図６（Ａ）において、１０９及び１１０は、９：００にテレビ会議に参加した参加者を示すものであり、それぞれの人物が検出された検出時刻即ち参加時刻を、モニタ画面上の映像データに重ね合わせて、該参加者の人物画像位置の画像周辺に、参加時刻表示１０７及び１０８が表示されている。
【００６９】
例えば、図６（Ｂ）は、図６（Ａ）から、しばらく時間が経過したテレビ会議の参加状況を示しており、モニタ画面に参加時刻表示１１１が新たに追加されて表示されているように、参加者１１２が９：３０にテレビ会議に参加したことがわかる。
かかる参加時刻の表示は、図３に示す第２記憶メモリ３２の前記検出時刻記憶部又は前記参加時刻記憶部に記憶させることにより、モニタ画面には、テレビ会議が終了するまで、該人物の参加時刻を引き続き表示させることができる。
【００７０】
また、人間検出制御部３′は、ビデオデコーダ１０からの映像データ即ち入力画像内から検出される人物の画像位置を１人以上記憶することができる１つ以上の第２記憶メモリ３２を具備し、更に、予め設定されている一定周期毎に、定期的に、前記入力画像内のすべての人物の検出を行ない、各人物の検出結果の画像位置をそれぞれ求める際に、前記検出結果の人物の画像位置のそれぞれと、第２記憶メモリ３２に記憶された１人以上の人物の画像位置とを比較する位置検出手段を更に具備しており、該位置検出手段にて、第２記憶メモリ３２に記憶された人物の画像位置のいずれかが、前記検出結果の画像位置のいずれにも一致していない場合には、第２記憶メモリ３２に記憶された一致していない画像位置の人物が退席したものと見なし、人間検出部３により該人物が検出されなくなった時刻を退席時刻として、図３に示す第２記憶メモリ３２の退席時刻記憶部に記憶させると共に、該退席時刻記憶部に記憶された前記退席時刻を、モニタ画面に表示させることができる。
【００７１】
即ち、今まで人物検出がされていた画像位置即ち画像領域において人間検出部３による人物の検出がされなくなった場合、図示していない時計回路より人物が検出されなくなった時刻を読み出し、スーパインポーズ発生回路４により重ね合わせ表示させてモニタ画面に表示させていることになる。
而して、モニタ画面には、人物が検出されなくなった時刻を、該参加者が退席した退席時刻として、該参加者を示していた人物の画像位置の画像周辺に画面表示することができる。
また、かかる人物が検出されなくなった退席時刻は、図３に示す第２記憶メモリ３２の前記退席時刻記憶部に記憶させることにより、モニタ画面には、テレビ会議が終了するまで、該人物の退席時刻を引き続き表示させることができる。
【００７２】
更に、人間検出部３は、発言者の発声音声方向（音源方向）による位置情報とカメラ装置９からの映像情報に基づく映像データとにより、音声を発声した発言者の人物を検出した場合、図示していない時計回路より時刻を読み出し、スーパインポーズ発生回路４より読み出された時刻を重ね合わせて表示することもできる。即ち、方向検出制御部２′が、入力音声の音源を検出した時刻を音源検出時刻として、図３に示す第１記憶メモリ２２の音源検出記憶部に記憶すると共に、モニタ画面には、該音源検出記憶部に記憶されている前記音源検出時刻を、各参加者の人物画像位置の画像周辺に表示させることができる。
而して、参加者各々が発言した発言時刻を、モニタ画面に表示することができる。
【００７３】
また、人間検出部３は、人物の顔の色や顔の輪郭などによる人物検出機能と共に、前述のごとく、人物の唇が動いていることを検出する唇検出手段３ｃを更に付与させることも可能としている。
【００７４】
かかる場合においては、前述のごとく、発声音声が音声入力部１に入力された場合、まず、人間検出部３の人物検出機能により、発声音声方向（音源方向）の位置にいる人物が検出される。その後、唇検出手段３ｃにより、検出された人物の唇が動いているか否かが検出される。唇が動いていることが検出できれば、入力された発声音声方向（音源方向）にいる当該人物の発言による発声音声であると判断し、カメラ装置９の旋回動作やズーム動作、更には、および発言者を示すマーカのマーキング表示を行なう。逆に、唇検出手段３ｃにより、検出された人物の唇が動いていないことが検出された場合は、入力された発声音声は、入力された発声音声方向（音源方向）にいる当該人物の発言による発声音声ではないと判断し、カメラ装置９の旋回動作やズーム動作及び発言者を示すマーカのマーキング表示を行なわない。
【００７５】
更には、図３に示すごとく、方向検出制御部２′が、音源特徴抽出部２１により抽出された音源の特徴を記憶する第１記憶メモリ２２と、新たに入力された音源の特徴と第１記憶メモリ２２に記憶された前記音源の特徴とを比較する音源特徴比較部２３とを具備していて、人間検出制御部３′が、検出された人物の画像位置を記憶する第２記憶メモリ３２を具備している場合において、人間検出制御部３′が、更に、検出された人物の唇が動いていることを検出する唇検出手段３ｃを具備している場合にあっては、音源特徴比較部２３によって新たに入力された音源の特徴と第１記憶メモリ２２に記憶された前記音源の特徴とが一致していることが検出された場合で、かつ、前記音源の特徴の一致が検出された方向にいる人物の唇が動いていることが、唇検出手段３ｃによって検出された場合に、始めて、第２記憶メモリ３２に記憶された人物の前記画像位置を読み出して、該画像位置の前記人物の画像周辺に発言者を示す前記マーカをマーキング表示させたり、カメラ装置９の位置を発言者の方向に向けさせたり、ズーミングさせることも可能である。
【００７６】
かかる場合において、前記音源の特徴が一致している方向にいる人物の唇が動いていることが、唇検出手段３ｃによって検出されない場合にあっては、ごく短時間の発言であったものと見なして、発言者を示す前記マーカをマーキング表示させる位置やカメラ装置９の位置を、元の状態のままとすることも可能である。
【００７７】
【発明の効果】
以上に説明したごとく、本発明に係るテレビ会議システムの発言者識別装置及び該発言者識別装置を備えたテレビ会議システムによれば、例えば、くしゃみや物を落とした音、マイクに物をぶつけた音など参加者の発言以外の音の発生によって、誤って、該発言以外の音の発生方向にカメラ装置が旋回してしまったり、発言者を示すマーカをマーキング表示させたりすることを防止することが可能であり、発言している参加者を常に正確にモニタ画面に表示することが可能となり、而して、快適にテレビ会議を行なうことが可能なテレビ会議システムを提供することができる。
【００７８】
更には、たとえ、テレビ会議の参加者全員をモニタ画面に画面表示した場合であっても、発言者や非発言者を識別可能なマーカを重畳させて画面表示させたり、あるいは、テレビ会議の司会者や一般の参加者あるいはオブザーバなどの識別が視覚的に容易なマーカを重畳させて画面表示させることも可能であり、而して、スムースにテレビ会議を行うことが可能なテレビ会議システムを提供することができる。
【００７９】
更には、テレビ会議の参加者が会議に参加した参加時刻、退席した退席時刻、発言した発言時刻などのテレビ会議の会議経過時刻を表示することが可能なテレビ会議システムを提供することができる。
【図面の簡単な説明】
【図１】本発明に係る発言者識別装置を具備したテレビ会議システムの構成の一例を示すブロック構成図である。
【図２】本発明に係る発言者識別装置を具備したテレビ会議システムにおいて、モニタ画面に画面表示された画像の一例を示す概念図である。
【図３】本発明に係る発言者識別装置を具備したテレビ会議システムを構成する方向検出部と人間検出部との他の構成例を示すブロック構成図である。
【図４】本発明に係る発言者識別装置を具備したテレビ会議システムを構成する方向検出部の更なる他の構成例を示すブロック構成図である。
【図５】テレビ会議参加者のうち、発言者用と非発言者用のマーカのマーキング表示を行なう場合のモニタ画面表示の一例を示す概念図である。
【図６】あるテレビ会議の参加状況を示すモニタ画面表示の一例を示す概念図である。
【符号の説明】
１…音声入力部、２…方向検出部、２′，２″…方向検出制御部、３…人間検出部、３′…人間検出制御部、３ａ…顔抽出手段、３ｂ…顔輪郭抽出手段、３ｃ…唇検出手段、４…スーパインポーズ発生回路、５…画像コーデック、６…ＣＧＲＯＭ、７…音声コーデック、８…カメラ制御部、９…カメラ装置、１０…ビデオデコーダ、１１…多重化回路部、１２…通信回線、１３…記憶メモリ、２１…音源特徴抽出部、２２…第１記憶メモリ、２３…音源特徴比較部、２４…音声レベル検出部、２５…タイマ部、３１…位置演算部、３２…第２記憶メモリ、１００…マーカ、１０１…発言者、１０２…参加者、１０３…発言者マーカ、１０４…非発言者マーカ、１０５…発言者、１０６…非発言者、１０７，１０８…参加時刻表示、１０９，１１０…参加者、１１１…参加時刻表示、１１２…参加者。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speaker identification device of a video conference system and a video conference system including the speaker identification device. In particular, when a video conference is performed by a plurality of participants, a speaker identification device of a video conference system having a function of marking and displaying a marker on a monitor screen by adding a marker (mark) that can identify a speaker and the speech The present invention relates to a video conference system provided with a person identification device.
[0002]
[Prior art]
As a prior art of a video conference system that identifies a speaker and displays a marker (marker) when a video conference is performed, for example, Japanese Patent Laid-Open No. 8-37655 “Video conference having a speaker identification display function” is disclosed. As disclosed in the “system”, there is a system in which only a voice input direction is identified, a direction in which a speaker is present is determined, and a marking display is performed.
That is, in the technique disclosed in Japanese Patent Application Laid-Open No. 8-37655, when a video conference is performed by a plurality of participants, the sound from the sound direction detector, that is, the direction detecting means for detecting the sound generation direction. Based on the direction data, the position (coordinates) on the monitor screen is obtained, and the camera device automatically moves to the position of the speaker (speaker) speaking in the video conference. When the speaker is copied, it is configured to give a marker that can be identified by the speaker, so that the participant who is speaking in the video conference can be easily identified. It is possible to do.
[0003]
[Problems to be solved by the invention]
However, in such a conventional video conference system having a speaker identification display function, it is not possible to identify the speech itself of a video conference participant, and some sound, that is, a sound is generated. In addition, a marker (mark) is added based only on the direction in which the sound is generated. Therefore, there is a problem that a marker (marker) for identifying a speaker is displayed even when a sound or a sound other than a speech of a participant in a video conference occurs. For example, the marking of the marker (marker) will also be displayed on the monitor screen by sound other than the participant's speech such as sneezing, pen dropping, or hitting the microphone. For videoconferencing participants, markings may be displayed at a position different from the position that indicates the speaker that is really needed in the videoconference, or the camera device may move in a different direction. May cause the monitor screen to be very difficult to see or feel confusing.
[0004]
The present invention has been made to solve such a problem. In a video conference system for inputting and transmitting images and sounds, for example, a direction detecting means for detecting a sound source direction from input sounds, and a person from an input image. Human detection means for detecting, further comprising marking means for displaying a predetermined marker (mark) defined in advance around the image of the person detected by the human detection means, When a person is detected by the human detection means in the sound source direction obtained by the direction detection means, it is determined that the voice is spoken by a speaker in a video conference, and is displayed by the marking means. A speaker of a video conference system capable of marking a predetermined marker indicating a speaker around the image of the person on the monitor screen Another apparatus and there is provided St. videoconferencing system with the speaker identification device.
[0005]
[Means for Solving the Problems]
A speaker identification device of a video conference system according to the present invention and a video conference system including the speaker identification device are configured by the following specific technical means.
[0006]
A first technical means detects a sound source direction from input sound in a speaker identification device of a video conference system that identifies a speaker in a video conference system that inputs and receives images and sounds. A sound source comprising first sound storage means for storing the characteristics of the sound source, and sound source feature comparison means for comparing the characteristics of the newly input sound source with the characteristics of the sound source stored in the first storage means. Direction detection means, human detection means for detecting a person from the input image, A second storage unit that stores an image position of the detected person when the human detection unit detects a person in the sound source direction obtained by the direction detection unit; On the displayed monitor screen Speaker Marking means for displaying a predetermined marker around the image of When it is detected that the feature of the sound source newly input by the sound source feature comparison unit matches the feature of the sound source stored in the first storage unit, the feature is stored in the second storage unit. Read the image position of the person, As a speaker in a video conference, a marker indicating the speaker is marked and displayed by the marking means.
[0013]
First 2 The technical means of the first 1 In the technical means, the sound source feature comparison means reads out the feature of the sound source in the direction of the newly input sound source among the features of the sound source stored in the first storage means, and the sound source feature comparison means When a person is detected by the human detection means in the sound source direction obtained by the direction detection means when it is detected that the characteristics of the newly input sound source are not matched by , The first storage of the characteristics of the newly input sound source means And the image position of the person detected by the human detection means is stored in the second storage. means This is a speaker identification device of a video conference system that is stored in the system.
[0014]
First 3 The technical means of the first 1 or 2 In the technical means, the sound source feature comparison means reads out the feature of the sound source in the direction of the newly input sound source among the features of the sound source stored in the first storage means, and the sound source feature comparison means When a person is not detected by the human detection means in the sound source direction obtained by the direction detection means when it is detected that the characteristics of the newly input sound source do not coincide with each other. The present invention is characterized in that it is a speaker identification device of a video conference system in which the marking display of the marker indicating the speaker by the marking means is kept in its original state.
[0015]
First 4 The technical means of the first 1st to 3rd In any one of the technical means, the second storage means relates to the contour of the face of the person in the input image and / or the contour of the eyes, nose, or mouth by the face contour extraction means provided in the human detection means. Face feature storage means for extracting and storing face features is further provided, and the human detection means newly extracts from the input image by the face features and face contour extraction means stored in the face feature storage means. A face feature comparison unit that compares the face feature with the sound source feature newly input by the sound source feature comparison unit and the feature of the sound source stored in the first storage unit In the face feature comparison means, the face feature stored in the face feature storage means is compared with the face feature newly extracted by the face contour extraction means, so that they match. Be If it is issued, the image position of the person stored in the second storage means is read, and the marker indicating the speaker is marked and displayed by the marking means around the person image displayed on the monitor screen. The present invention is characterized in that it is a speaker identification device of a video conference system.
[0016]
First 5 The technical means of the first 4 In the technical means, the face feature comparison means compares the face feature stored in the face feature storage means with the face feature newly extracted by the face contour extraction means of the human detection means, and matches. If it is detected that the marking is not made, the marker identification display of the marker indicating the speaker by the marking means is used as the speaker identification device of the video conference system.
[0017]
First 6 The technical means of the first to the first 5 In any of the technical means, Human detection means Lip detecting means for detecting that the detected human lips are moving The And when it is detected that the feature of the sound source newly input by the sound source feature comparison means matches the feature of the sound source stored in the first storage means, the lip detection means When it is detected that the lips of the person in the direction in which the characteristics of the sound source match are moving, the image position of the person stored in the second storage means is read and displayed on the monitor screen. Further, the present invention is characterized in that the speaker identifying device of the video conference system displays the marker indicating the speaker by the marking means around the person image.
[0018]
First 7 The technical means of the first 6 In the technical means, if the lip detecting means does not detect that the lips of the person in the direction in which the characteristics of the sound source match are detected, the marking means indicating the speaker by the marking means This is a speaker identification device for a video conference system that remains in the above state.
[0022]
First 8 The technical means of the first to the first 7 In any of the technical means, the human detection means detects a person in the input image, and the marking means displays the person displayed on the monitor screen for each person detected by the human detection means. The present invention is characterized in that it is a speaker identification device of a video conference system that performs marking display of markers color-coded and / or shaped around the image.
[0024]
First 9 The technical means of the first to the first 8 In any one of the technical means, the human detection means further comprises a detection time storage means for storing the detected detection time when detecting a person in the input image, and stored in the detection time storage means. Further, the present invention is characterized in that the detected time is used as a speaker identification device of a video conference system that displays on the monitor screen as the detected participation time of the person.
[0025]
First 10 The technical means of the first to the first 9 In any one of the technical means, the human detection means includes one or more storage means capable of storing the image positions of one or more persons detected from the input image, and further, the human detection means However, when every person in the input image is detected and the image position of the detection result of each person is determined for each predetermined period, the image position of the person in the detection result is determined. Further comprising position comparison means for comparing each of the image positions of one or more persons stored in the storage means, wherein any of the image positions of the persons stored in the storage means is selected by the position comparison means. However, if it does not match any of the image positions of the person as the detection result, the person at the image position stored in the storage means that does not match is left and the person is no longer detected. When In the leaving time storage means for storing as a leaving time, and in the position comparison means, any one of the detected person image positions is equal to any of the person image positions stored in the storage means. If not, participation time storage means for storing, as a participation time, a time when the person at the image position of the detection result that does not match is newly detected as a participation time. And a speaker identification device for a video conference system that displays the leaving time stored in the leaving time storage means and the participation time stored in the participation time storage means on a monitor screen. It is characterized by this.
[0026]
First 11 The technical means of the first to the first 10 In any one of the technical means, the sound source stored in the sound source detection storage means further includes sound source detection storage means for storing the time when the direction detection means detected the sound source of the input sound as a sound source detection time. The present invention is characterized in that it is a speaker identification device of a video conference system that displays a detection time on a monitor screen.
[0031]
First 12 The technical means of the first to the first 11 One of the technical means is a video conference system including a speaker identification device for a video conference system.
[0032]
DETAILED DESCRIPTION OF THE INVENTION
An example of an embodiment of a speaker identification device of a video conference system according to the present invention and a video conference system including the speaker identification device will be described below with reference to the drawings.
FIG. 1 is a block diagram showing an example of the configuration of a video conference system provided with a speaker identification device according to the present invention.
[0033]
As shown in FIG. 1, a video conference system including a speaker identification device according to the present invention includes a direction detection unit 2 that detects a sound source direction from input sound, and a human detection unit 3 that detects a person from an input image. CGROM6, which is a ROM that stores various CG (Computer Graphics) information to be displayed on the monitor screen, and a superine for displaying the CG information from the CGROM6 superimposed on the monitor screen. The pause generation circuit 4 is provided as a marking means for displaying the identification marker.
The operation of the video conference system according to the present invention will be described with reference to FIG.
[0034]
First, the input voice is converted into a voice signal composed of an electrical signal from the voice input unit 1 and sent to the direction detection unit 2 and the voice codec 7.
In the direction detection unit 2, the position of the speaker is detected as position information on the monitor screen based on the audio signal, and the position information is given to the camera control unit 8 and the human detection unit 3.
[0035]
The camera control unit 8 performs a turning operation and a zooming operation of the camera device 9 movable at least in the horizontal direction based on the position information of the speaker, thereby generating a sound source (that is, a speaker in a video conference). The position of the camera device 9 is controlled so as to be positioned at the center of the monitor screen, or is controlled to be enlarged and displayed by zooming, thereby performing position control or zoom control of the camera device 9.
[0036]
Here, the camera control unit 8 performs the turn control so that the position of the camera device 9 is directed to the voice direction (sound source direction) of the sound detected by the direction detection unit 2 or the zoom control. When the human detection unit 3 cannot detect a person in the direction of the sound source detected by the detection unit 2, and / or, for example, the voice of the feature that does not match the feature of the sound (sound source) of the participant of the video conference When the speaker cannot be specified (that is, when the marker indicating the speaker should not be newly marked and displayed), the original camera before the operation is detected. It is also possible to return to the position of the device 9.
[0037]
Alternatively, as described above, the camera device 9 includes zoom means, and the camera control unit 8 directs the position of the camera device 9 to the sound source direction detected by the user's instruction or the direction detection unit 2. In addition, it is also possible to automatically control the camera device 9 to zoom, and when the human detection unit 3 cannot detect the person in the sound source direction detected by the direction detection unit 2, and When the speaker cannot be specified (that is, when the marker indicating the speaker should not be newly marked and displayed), the position of the original camera device 9 before the operation is It is also possible to return the zooming state of the camera device 9 to the state before the operation.
[0038]
In the camera device 9 as an imaging device, the captured video information is converted into an electrical signal as a video signal. The converted video signal is digitally processed by the video decoder 10 and sent to the human detector 3 as video data.
In the human detection unit 3, based on the position information input from the direction detection unit 2 and the video data input from the video decoder 10, a captured image of a person is located at the position indicated by the position information in the video data. Detects whether data exists.
[0039]
When it is detected that the captured image data of the person exists at the position indicated by the position information, the superimpose generation circuit 4 that provides the marking means performs various CG (Computer Graphics) information. Is read out from the CGROM 6 which is a ROM storing the marker, and marker data set in advance as a marker (marker) indicating the speaker is read out from the position information on the video data and the captured image data of the person who is the speaker. Composite video data in which the marker data indicating the read speaker is superimposed on the video data at the calculated marker display position (that is, the peripheral position of the image of the person of the speaker) is created.
[0040]
The generated composite video data is encoded image data that has been compression-encoded by the image codec 5, while the audio signal from the audio input unit 1 is encoded by the audio codec 7. The encoded image data and the encoded audio data are multiplexed by the multiplexing circuit unit 11 and sent to the partner terminal through the communication line 12.
[0041]
Here, the synthesized video data created by the superimpose generation circuit 4 serving as marking means (that is, video data captured by the camera device 9 and digitally processed by the video decoder 10 and the speaker are shown. The encoded image data in a state in which the combined video data combined with the marker data) is further compressed and encoded by the image codec 5 is decoded by the image codec in the partner terminal on the receiving side, and the monitor in the video conference system FIG. 2 shows a conceptual diagram of an example of an image displayed on the screen. In FIG. 2, 101 indicates a speaker among the participants of the video conference, 102 indicates a person who is a participant of the video conference and is not currently speaking, and 100 indicates a marker (100) indicating that the speaker is a speaker. In FIG. 2, it is an arrow-shaped mark).
[0042]
The human detection unit 3 identifies the face area by extracting the color of the human face using the color information from the video data digitally processed by the video decoder 10 from the video information from the camera device 9 ( And a face extracting means 3a capable of detecting a person included in the video data.
Further, the human detection unit 3 has individual characteristics such as the contour of the face of the person, eyes, nose, mouth from the video data in which the video information from the camera device 9 is digitally processed by the video decoder 10. Also provided is a face contour extracting means 3b that can detect a person included in the video data by extracting the contour.
[0043]
Further, as will be described later, the human detection unit 3 detects whether or not the lips of the person are moving from the video data in which the video information from the camera device 9 is digitally processed by the video decoder 10. The lip detecting means 3c that can do this is also provided. When it is detected by the lip detection means 3c that the lip of the participant's person is moving, human detection is performed regardless of the result of the direction detection unit 2 or in the sound source direction obtained by the direction detection unit 2. When the person is detected by the unit 3 and the lip detecting means 3c detects that the person's lips are moving, the person is regarded as speaking, A marker indicating the speaker may be marked around the image of the person who is the speaker on the monitor screen, and the person in the direction matching the sound source direction detected by the direction detection unit 2 may be displayed. The movement of the lips is detected by the lip detection means 3c, and the direction detection unit 2 further features the uttered voice (sound source) and the utterance voice (sound source) of the video conference participant. If they match, say It may be possible to mark displaying a marker on the monitor screen showing the.
[0044]
Further, the direction detection unit 2 and the human detection unit 3 shown in FIG. 1 can be configured as shown in FIG.
FIG. 3 is a block diagram showing another configuration example of the direction detecting unit and the human detecting unit constituting the video conference system including the speaker identification device according to the present invention. That is, as shown in FIG. 3, the direction detection control unit 2 ′, in addition to the direction detection unit 2 shown in FIG. 1, further includes a sound source feature extraction unit 21 that extracts features of the input sound (sound source), and the extracted input. The first storage memory 22 which is a first storage means for storing the features of the sound (sound source) as feature data, and the features of the sound source input via the direction detection unit 2 and extracted by the sound source feature extraction unit 21 A sound source feature comparison unit 23 for comparing and collating all the feature data stored in the first storage memory 22 is provided.
[0045]
On the other hand, in addition to the human detection unit 3 shown in FIG. 1, the human detection control unit 3 ′ further detects the human image position when the human detection unit 3 detects a person in the video data from the video decoder 10. The second storage memory 32 that provides the second storage means for storing the image data, and the image position of the video data in the second storage memory 32 corresponding to the position of the voice production direction (sound source direction) indicated by the direction detection unit 2 And a position calculation unit 31 that performs a calculation to calculate.
[0046]
In the direction detection control unit 2 ′ and the human detection control unit 3 ′ shown in FIG. 3, first, prior to the start of the video conference, the first storage memory 22 and the second storage memory 32 respectively store the video conference. Feature data indicating the features of the voice (sound source) of each participant and the image position of each participant are registered in advance.
Here, the voice of the speaker is converted into a voice signal of an electrical signal from the voice input unit 1 and input to the direction detection control unit 2 ′, via the direction detection unit 2 that performs the same function as in FIG. 1. To the sound source feature extraction unit 21. In the sound source feature extraction unit 21, the feature of the voice signal (sound source) of the speaker is extracted and stored in the first storage memory 22 as feature data.
[0047]
On the other hand, in the human detection unit 3 that performs the same person detection function as that in FIG. 1, the image in the video data of the video decoder 10 corresponding to the position of the voice direction (sound source direction) input from the direction detection control unit 2 ′. When a person image is detected at the position, the image position of the person is stored in the second storage memory 32.
[0048]
When the video conference starts after the voice (sound source) characteristics and the person's image position regarding the video conference participants are set in the first storage memory 22 and the second storage memory 32 in advance, The feature of the new uttered voice (sound source) sent from the sound source feature extraction unit 21 is compared with the feature data of all the voices stored in the first storage memory 22, and all the stored voices are stored. If it is detected in the feature data of the voice that there is a voice that matches the feature of the new uttered voice (sound source), it is determined that the speech of the participant of this video conference is A superimpose generation circuit reads a position of the person's image from the second storage memory 32 via the arithmetic unit 31 and places a marker indicating the person's person on the monitor screen around the person's image. 4 Through screen displays.
[0049]
As described above, when it is determined that the speech of the participant in the video conference is based on the characteristics of the sound source that is the voice, the human in FIG. 3 that performs the same function as the human detection unit 3 shown in FIG. The detection unit 3 does not need to perform any operation, and the human detection unit 3 may not be activated.
[0050]
On the other hand, if the features of the voice (sound source) sent from the sound source feature extraction unit 21 do not match the feature data of all the sounds stored in the first storage memory 22, the sound source feature comparison If it is found in the unit 23, the human detection unit 3 in FIG. 3 that performs the same function as the human detection unit 3 shown in FIG. 1 is activated, and in the human detection unit 3, the video data from the video decoder 10 is detected. Detect people from inside. That is, the human detection unit 3 determines whether or not a person's video exists at a position in the utterance voice direction (sound source direction) obtained via the position calculation unit 31, thereby It is determined whether or not the sound (sound source) is a speech of a participant of the video conference.
[0051]
When the human detection unit 3 determines that no person is present at the position of the voiced voice direction (sound source direction), the voiced voice (sound source) is regarded as noise irrelevant to the video conference. However, if it is determined that there is a person in the position of the utterance voice direction (sound source direction) without performing any processing, the speaker moves the position. The participant's image position information calculated through the position calculation unit 31 is newly registered in the second storage memory 32 and the utterance is also stored in the first storage memory 22. Re-register the audio (sound source) feature data.
[0052]
In the prior art, when a new voice is generated, a marker (marker) indicating the speaker is marked on the monitor screen, or the camera device is turned so that the speaker who is the source of the voice is generated. Is trying to capture the image so that it fits within the monitor screen. However, even if some noise other than the speaker is generated, the marker (marker) indicating the speaker is marked on the monitor screen or the camera device is turned in the direction in which the noise is generated. It was.
[0053]
In the video conference system provided with the speaker identification means according to the present invention, when a person is not detected by the human detection unit 3, a marker indicating the speaker is not newly displayed. Further, the camera control unit 8 shown in FIG. 1 is controlled to send a return command so that the position and zooming state of the camera device 9 return to the original state. That is, even if the position of the camera device 9 is once turned in the direction of the sound source, although not shown, the position information and zooming information before the camera control unit 8 is turn-controlled are shown in FIG. By being stored in the second storage memory 32 shown, a return command for returning the position and zooming state of the camera device 9 to the original state can be sent. Thus, even if the camera device 9 turns in response to a sound other than the participant's speech, the position of the marker (marker) indicating the speaker or the camera device 9 The position can be returned to the original position.
[0054]
In addition, the second storage memory 32 extracts and stores the facial features of each person related to the facial contour of the person and the contours of eyes, nose, and mouth by the facial contour extraction means 3b provided in the human detection unit 3. In the case of further comprising a face feature storage means, the human detection unit 3 is a face newly extracted from the video data by the face feature and face contour extraction means 3b stored in the face feature storage means. A face feature comparing means for comparing the features, and the feature of the sound source newly input by the sound source feature comparison unit 23 matches the feature data (sound source feature) stored in the first storage memory 22; In the face feature comparison means, the face feature stored in the face feature storage means is compared with the face feature newly extracted by the face contour extraction means 3b of the human detection unit 3. And one When it is detected that the person is stored, the image position of the person stored in the second storage memory 32 is read, and a marker indicating the speaker is marked around the image of the person at the image position. It is also possible to turn the camera device 9 to the position of the person or zoom and display the enlarged image.
[0055]
Here, the face feature comparison means compares the face feature stored in the face feature storage means with the face feature newly extracted by the face contour extraction means 3b, and detects that they do not match. In this case, the marker marking display indicating the speaker, the position of the camera device 9 and the zooming state are left in the original state.
[0056]
Furthermore, the direction detection unit 2 shown in FIG. 1 can be configured as shown in FIG. FIG. 4 is a block diagram showing still another example of the configuration of the direction detecting unit constituting the video conference system provided with the speaker identification device according to the present invention.
That is, as shown in FIG. 4, in addition to the direction detection unit 2 shown in FIG. 1, the direction detection control unit 2 ″ further includes a voice level detection unit 24 for detecting a voice (sound source) exceeding a predetermined voice level. And a timer unit 25 that provides time detection means for detecting that the utterance voice exceeding the predetermined voice level continues for a preset predetermined duration or more.
[0057]
In FIG. 4, the input uttered voice is converted into a voice signal by the voice input unit 1 and input to the direction detection control unit 2 ″, and the voice level of the uttered voice is measured by the voice level detection unit 24. When it is detected as a speech from the participant that the uttered voice is an audio signal having a level equal to or higher than the predetermined voice level, the timer of the timer unit 25 is started and counting of elapsed time is started. When the set predetermined duration has elapsed, the input voice (sound source) is in a state of continuing for a predetermined duration or longer, and the output signal from the timer unit 25 is sent to the direction detection unit. 2 is sent.
[0058]
Here, the speech from the participant ends before the preset predetermined duration elapses, and when the voice level detection unit 24 no longer detects the voice that exceeds the predetermined voice level, The timer unit 25 stops counting the timer, and the output signal from the timer unit 25 is not generated.
The direction detection unit 2 that performs the same function as in FIG. 1 is the case where the output signal from the timer unit 25 is input, and for the first time, the input voice (sound source) is input to the participant of this video conference. The operation of determining whether or not the voice signal is based on a speech is activated.
[0059]
For example, since voice information for a short time such as sneeze is not a speech, the camera device 9 shown in FIG. 1 is turned to the position of the voice information for a short time, or the position of the voice information is displayed on the monitor screen. It is useless to display the marking to indicate the speaker. As shown in the embodiment of the present invention, in the case of such short time audio information, the output signal from the timer unit 25 is not output, and thus the direction detection unit 2 does not output the utterance voice direction. The operation for detecting the position information indicating the position of the (sound source direction) is not activated, and the control operation such as turning of the camera device 9 or the human detection operation in the human detection unit 3 is not activated. .
[0060]
Moreover, in FIG. 4, in the case of such short time audio information, it is described that the direction detection unit 2 is not activated, and thus the human detection unit 3 is also not activated. When the output signal is also supplied from the timer unit 25 to the human detection unit 3 and the voice information is in such a short time, the human detection unit 3 is not directly activated without going through the direction detection unit 2. It is also possible.
[0061]
As described above, in FIG. 4, the condition for starting the timer unit 25 is a case where there is an utterance voice at a level equal to or higher than a predetermined voice level. Regardless of whether the voice is at or above the voice level, it may be a case where it is detected that some identifiable voice continues for a predetermined duration or longer. .
[0062]
Further, as described above, the human detection unit 3 shown in FIG. 3 not only detects the presence / absence of a person at the position on the video data indicated by the position of the utterance voice direction (sound source direction) from the direction detection unit 2. Prior to the video conference, the image positions of all the participants in the video conference can be detected in advance and registered in the second storage memory 32 shown in FIG. Thus, for example, as shown in FIG. 5, among the participants, the marker display of the speaker and the non-speaker is color-coded or shape-divided to be changed and displayed on the monitor screen. It is also possible to make the marking display.
[0063]
FIG. 5 is a conceptual diagram showing an example of a monitor screen display in the case of performing marker display for a speaker and a non-speaker among video conference participants. In FIG. 5, an example in which the speaker marker 103 indicating that the speaker 105 is speaking is displayed by marking different colors from the non-speaker marker 104 indicating the non-speaker 106, for example. Show.
[0064]
That is, among the video conference participants, when the speaker 105 speaks, the direction information indicating the sound source direction of the speaker 105 is detected by the direction detection unit 2 and is stored in the storage memory 32 in advance. A speaker marker 103 having a color different from that of the non-speaker marker 104 indicating the non-speaker 106 who identifies the speaker 105 by confirming the presence of a person in the sound source direction as compared with the position. It is displayed by.
[0065]
Thus, even if all the participants of the video conference are displayed on the monitor screen, each participant on the monitor screen, that is, at the position around the image of the person, Marking display of the speaker marker 103 of different colors and the non-speaker marker 104 can be performed, so that it is possible to easily identify whether or not the speaker is a speaker. Here, the marker to be displayed for marking is not limited to identifying a speaker and a non-speaker. For example, color and / or shape and / or pattern may be used to facilitate identification of video conference moderators, general participants, and observers, and any identification useful for conducting video conferences. Information can also be provided by a marker superimposed on the monitor screen.
[0066]
In addition, the human detection control unit 3 ′ constituting the human detection unit 3 does not perform a person detection operation only when a voice is input, but always performs a human detection operation on the human detection unit 3 regularly. Can also be performed. Thus, the time when the person is first detected at each image position in the video data from the video decoder 10 is read from a clock circuit (not shown), and the detected time of the detected person is sent to the video conference. 3 is stored in the detection time storage unit of the second storage memory 32 shown in FIG. 3, and the detection time read from the detection time storage unit is superimposed and displayed on the screen by the superimpose generation circuit 4. Thus, the participation time when the person participated in the video conference can be displayed on the monitor screen as the result of the person detection.
[0067]
Alternatively, the human detection control unit 3 ′ constituting the human detection unit 3 may store one or more first data that can store one or more person's image positions detected from the video data from the video decoder 10, that is, the input image. Further, when all the persons in the input image are periodically detected and the image positions of the detection results of the respective persons are obtained, the storage memory 32 is provided. , Further comprising position detection means for comparing each of the detected image positions of the person and the image positions of one or more persons stored in the second storage memory 32; If any of the image positions of the person as the detection result does not match any of the image positions of the person stored in the second storage memory 32, a person at the image position of the detection result is newly Video conference It is assumed that the person has participated, and the time at which the person at the image position as the detection result is newly detected by the human detection unit 3 is stored as the participation time in the participation time storage unit of the second storage memory 32 shown in FIG. The participation time stored in the participation time storage unit may be displayed on a monitor screen.
[0068]
For example, FIG. 6 is a conceptual diagram showing an example of a monitor screen display showing the participation status of a certain video conference. In FIG. 6A, reference numerals 109 and 110 denote participants who participated in the video conference at 9:00, and the detection time at which each person was detected, that is, the participation time, is displayed as video data on the monitor screen. Overlapping, participation time displays 107 and 108 are displayed around the image of the person image position of the participant.
[0069]
For example, FIG. 6B shows the participation status of the video conference after a while from FIG. 6A, and the participation time display 111 is newly added and displayed on the monitor screen. It can be seen that the participant 112 participated in the video conference at 9:30.
The display of the participation time is stored in the detection time storage unit or the participation time storage unit of the second storage memory 32 shown in FIG. 3, so that the person's participation is displayed on the monitor screen until the video conference is ended. The time can be displayed continuously.
[0070]
Further, the human detection control unit 3 ′ includes one or more second storage memories 32 that can store one or more image positions of the person detected from the video data from the video decoder 10, that is, the input image. Further, every person in the input image is periodically detected at predetermined intervals, and when the image position of the detection result of each person is obtained, the person of the detection result is detected. A position detecting means for comparing each of the image positions with the image positions of one or more persons stored in the second storage memory 32 is further provided. If any of the stored image positions of the person does not match any of the detection result image positions, the person at the non-matching image position stored in the second storage memory 32 has left. Considered to be The time when the person is no longer detected by the human detection unit 3 is stored in the leaving time storage unit of the second storage memory 32 shown in FIG. 3 as the leaving time, and the leaving time stored in the leaving time storage unit Can be displayed on the monitor screen.
[0071]
That is, when a person is no longer detected by the human detection unit 3 in the image position where the person has been detected so far, that is, in the image area, the time when the person is no longer detected is read from a clock circuit (not shown), It is displayed on the monitor screen by being superimposed by the generator circuit 4.
Thus, on the monitor screen, the time when the person is no longer detected can be displayed on the screen around the image at the image position of the person showing the participant as the leaving time when the participant left.
Further, the leaving time when the person is no longer detected is stored in the leaving time storage unit of the second storage memory 32 shown in FIG. 3, so that the person's leaving time is displayed on the monitor screen until the video conference ends. The time can be displayed continuously.
[0072]
Further, when the human detection unit 3 detects the person of the speaker who uttered the voice based on the position information based on the voice direction (sound source direction) of the speaker and the video data based on the video information from the camera device 9, It is also possible to read the time from a clock circuit (not shown) and superimpose the time read from the superimpose generation circuit 4. That is, the direction detection control unit 2 'stores the time when the sound source of the input sound is detected as the sound source detection time in the sound source detection storage unit of the first storage memory 22 shown in FIG. The said sound source detection time memorize | stored in the detection memory | storage part can be displayed on the image periphery of the person image position of each participant.
Thus, the utterance time when each participant utters can be displayed on the monitor screen.
[0073]
Further, the human detection unit 3 can be further provided with a lip detection means 3c for detecting that the lips of the person are moving, as described above, in addition to the person detection function based on the color of the person's face and the outline of the face. It is said.
[0074]
In such a case, as described above, when the uttered voice is input to the voice input unit 1, first, a person in the position of the uttered voice direction (sound source direction) is detected by the person detection function of the human detection unit 3. . Thereafter, the lip detecting means 3c detects whether or not the detected lips of the person are moving. If it can be detected that the lips are moving, it is determined that the voice is produced by the voice of the person in the input voice direction (sound source direction), and the camera device 9 is turned and zoomed. Marker marking indicating the person is performed. On the other hand, when the lip detecting means 3c detects that the detected lips of the person are not moving, the input utterance voice is the utterance of the person in the input utterance voice direction (sound source direction). Therefore, the camera apparatus 9 is not turned and the zoom operation and the marker marking indicating the speaker are not displayed.
[0075]
Furthermore, as shown in FIG. 3, the direction detection control unit 2 ′ stores a first storage memory 22 that stores the features of the sound source extracted by the sound source feature extraction unit 21, and the newly input sound source features and the first A sound source feature comparison unit 23 for comparing the sound source features stored in the storage memory 22, and a second storage memory 32 in which the human detection control unit 3 ′ stores the detected image position of the person. If the human detection control unit 3 ′ further includes lip detection means 3c for detecting that the detected human lips are moving, the sound source feature comparison is performed. When it is detected that the feature of the sound source newly input by the unit 23 matches the feature of the sound source stored in the first storage memory 22, and the match of the feature of the sound source is detected. The lips of a person in a different direction are moving Is detected by the lip detection means 3c, the image position of the person stored in the second storage memory 32 is read for the first time, and the marker indicating the speaker around the image of the person at the image position Can be displayed on the display, the position of the camera device 9 can be directed toward the speaker, or zooming can be performed.
[0076]
In such a case, if it is not detected by the lip detection means 3c that the lips of the person in the direction in which the characteristics of the sound source are coincident are detected, it is considered that the utterance was a very short time. Thus, it is also possible to leave the position where the marker indicating the speaker is marked and the position of the camera device 9 in the original state.
[0077]
【The invention's effect】
As described above, according to the speaker identification device of the video conference system according to the present invention and the video conference system including the speaker identification device, for example, a sneeze, a sound of dropping an object, or an object hit a microphone. Preventing the camera device from turning in the direction of sound generation other than the speech or marking the marker indicating the speech by mistake due to the occurrence of sound other than the participant's speech Therefore, it is possible to always display the participant who is speaking on the monitor screen accurately, and thus it is possible to provide a video conference system capable of performing a video conference comfortably.
[0078]
Furthermore, even if all participants in the video conference are displayed on the monitor screen, a marker that can identify the speaker or non-speaker is superimposed on the screen, or the moderator of the video conference is held. It is also possible to superimpose visually easy-to-identify markers such as participants, general participants, and observers, and display them on the screen, thus providing a video conference system that enables smooth video conferences. can do.
[0079]
Furthermore, it is possible to provide a video conference system capable of displaying the conference elapsed time of the video conference such as the participation time when the participant of the video conference participates in the conference, the leaving time when the participant leaves the conference, and the speech time when the user speaks.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating an example of a configuration of a video conference system including a speaker identification device according to the present invention.
FIG. 2 is a conceptual diagram illustrating an example of an image displayed on a monitor screen in a video conference system including a speaker identification device according to the present invention.
FIG. 3 is a block configuration diagram showing another configuration example of a direction detection unit and a human detection unit constituting a video conference system including a speaker identification device according to the present invention.
FIG. 4 is a block diagram showing still another example of the configuration of the direction detecting unit constituting the video conference system including the speaker identification device according to the present invention.
FIG. 5 is a conceptual diagram showing an example of a monitor screen display in the case where marker display is performed for speakers and non-speakers among video conference participants.
FIG. 6 is a conceptual diagram showing an example of a monitor screen display showing a participation situation of a certain video conference.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Voice input part, 2 ... Direction detection part, 2 ', 2 "... Direction detection control part, 3 ... Human detection part, 3' ... Human detection control part, 3a ... Face extraction means, 3b ... Face outline extraction means, 3c ... Lip detection means, 4 ... Superimpose generation circuit, 5 ... Image codec, 6 ... CGROM, 7 ... Audio codec, 8 ... Camera control unit, 9 ... Camera device, 10 ... Video decoder, 11 ... Multiplexing circuit unit , 12 ... communication line, 13 ... storage memory, 21 ... sound source feature extraction unit, 22 ... first storage memory, 23 ... sound source feature comparison unit, 24 ... audio level detection unit, 25 ... timer unit, 31 ... position calculation unit, 32 ... Second memory memory, 100 ... Marker, 101 ... Speaker, 102 ... Participant, 103 ... Speaker marker, 104 ... Non-speaker marker, 105 ... Speaker, 106 ... Non-speaker, 107,108 ... Participation Time display, 109,1 0 ... participants, 111 ... join time display, 112 ... participants.

Claims

In a speaker identification device of a video conference system for identifying a speaker in a video conference system that inputs and receives images and sounds, first storage means for detecting the direction of the sound source from the input speech and storing the characteristics of the sound source, and A sound source direction detecting means provided with a sound source feature comparing means for comparing the characteristics of the inputted sound source and the characteristics of the sound source stored in the first storage means; and a human detecting means for detecting a person from the input image. When a person is detected by the human detection means in the sound source direction obtained by the direction detection means, a second storage means for storing the detected image position of the person, and a speaker's position on the displayed monitor screen Marking means for displaying a predetermined marker around the periphery of the image, and features of the sound source newly input by the sound source feature comparison means and the first memory If it is detected that the characteristics of the sound source stored in the same are detected, the image position of the person stored in the second storage means is read out by the marking means as a speaker in a video conference. A speaker identification device for a video conference system, wherein a marker indicating a speaker is marked and displayed.

The sound source feature comparison unit reads out the feature of the sound source in the direction of the newly input sound source among the features of the sound source stored in the first storage unit, and is newly input by the sound source feature comparison unit. If the person is detected by the human detection means in the direction of the sound source obtained by the direction detection means when it is detected that they do not match with the characteristics of the sound source, the newly input 2. The stored sound source characteristic is stored again in the first storage unit, and the image position of the person detected by the human detection unit is stored in the second storage unit. Speaker identification device for video conference system.

The sound source feature comparison unit reads out the feature of the sound source in the direction of the newly input sound source among the features of the sound source stored in the first storage unit, and is newly input by the sound source feature comparison unit. If the person is not detected by the human detection means in the direction of the sound source obtained by the direction detection means when it is detected that the characteristics do not match with the characteristics of the sound source, the marking 3. The speaker identification device for a video conference system according to claim 1 or 2, wherein the marking display of the marker indicating the speaker by means is kept in its original state.

The second storage means extracts and stores facial features relating to the human face outline and / or eyes or nose or mouth outline in the input image by the face outline extraction means included in the human detection means. Face feature storage means is further provided, and the human detection means compares the face feature stored in the face feature storage means with the face feature newly extracted from the input image by the face contour extraction means. If it is further detected that the feature of the sound source newly input by the sound source feature comparison unit and the feature of the sound source stored in the first storage unit match, When the face feature comparison means compares the face feature stored in the face feature storage means with the face feature newly extracted by the face contour extraction means and detects that they match. Is the second The image position of the person stored in the means is read out, and the marker indicating the speaker is marked and displayed by the marking means around the person image displayed on the monitor screen. A speaker identification device for a video conference system according to any one of the above.

The face feature comparison means compares the face feature stored in the face feature storage means with the face feature newly extracted by the face contour extraction means of the human detection means, and detects that they do not match. 5. The speaker identification device for a video conference system according to claim 4, wherein the marking display of the marker indicating the speaker by the marking means is left in its original state.

The human detection means includes lip detection means for detecting that the detected human lips are moving, and is stored in the first storage means and the characteristics of the sound source newly input by the sound source feature comparison means. When it is detected that the characteristics of the sound source match, when the lips detecting means detects that the lips of the person in the direction where the characteristics of the sound source match are moving The image position of the person stored in the second storage means is read, and the marker indicating the speaker is marked and displayed by the marking means around the person image displayed on the monitor screen. The speaker identification device of the video conference system according to any one of claims 1 to 5.

If it is not detected by the lip detection means that the lips of the person in the direction in which the characteristics of the sound source match are moving, the marking display of the marker indicating the speaker by the marking means remains in the original state The speaker identification device for a video conference system according to claim 6.

The human detection means detects a person in the input image, and the marking means performs color coding and / or color around the person image displayed on the monitor screen for each person detected by the human detection means. The speaker identification device for a video conference system according to any one of claims 1 to 7, characterized in that marking display is performed on the marker divided into shapes.

When the human detection means detects a person in the input image, the human detection means further comprises a detection time storage means for storing the detected detection time, and the detection time stored in the detection time storage means is detected. and as a participant time of the person, speaker identification device of the video conference system according to any one of claims 1 to 8, characterized in that displayed on the monitor screen.

The human detection means includes one or more storage means capable of storing the image positions of one or more persons detected from the input image, and the human detection means is set in advance. When every person in the input image is detected every fixed period and the image position of the detection result of each person is obtained, each of the detected person image positions is stored in the storage means. And a position comparison means for comparing the image position of one or more persons with any one of the person image positions stored in the storage means by the position comparison means. If it does not match any of the image positions, the person at the image position stored in the storage means that does not match has left, and the time when the person is no longer detected is stored as the leaving time. Leave When any one of the person image positions of the detection result does not match any of the person image positions stored in the storage means in the time storage means and further in the position comparison means, Participation time storage means for storing, as a participation time, a time at which the person at the image position of the detection result that does not coincide as a new participation has newly joined, and the leaving time storage the leave time stored in the unit, further, TV according to the participation time stored in the join time storage means, to one of the claims 1 to 8, characterized in that displayed on the monitor screen Conference system speaker identification device.

The direction detection means further includes sound source detection storage means for storing the time when the sound source of the input sound is detected as the sound source detection time, and displays the sound source detection time stored in the sound source detection storage means on a monitor screen. The speaker identification device for a video conference system according to any one of claims 1 to 10 .

Video conference system, characterized by comprising a speaker identification apparatus according to any of the video conference according to claim 1 to 11.