JP4585380B2

JP4585380B2 - Next speaker detection method, apparatus, and program

Info

Publication number: JP4585380B2
Application number: JP2005164119A
Authority: JP
Inventors: 篤信木村; 彰中山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-06-03
Filing date: 2005-06-03
Publication date: 2010-11-24
Anticipated expiration: 2025-06-03
Also published as: JP2006338493A

Description

本発明は、複数のユーザが同一の空間、またはネットワークを介して互いに音声通信可能な遠隔の空間において会議する会議システムにおいて、あるユーザが発言中に次に発言することを望んでいるユーザを検出し、明示する方法および装置に関する。 The present invention detects a user who wants to speak next while a user speaks in a conference system in which a plurality of users conference in the same space or in a remote space where they can communicate with each other via a network. And an explicit method and apparatus.

没入型仮想共有環境における聴覚障害者支援に対するコミュニケーション支援手法として、相手の発話行為をユーザの視覚内に文字画像で提示する手法がある（非特許文献１、特許文献１）。 As a communication support method for the hearing-impaired person support in the immersive virtual shared environment, there is a method of presenting the other party's speech act as a character image in the user's vision (Non-patent Document 1, Patent Document 1).

また、遠隔会議等において、発話者を特定する手法として、映像に吹き出しを表示する手法の提案がある（非特許文献２）。
特開２００１−２２８７９４号公報「没入型仮想共有環境における聴覚障害者の会話支援インタフェースの開発」電子情報通信学会技術研究報告ＯＩＳ２００４−２２「ビデオ会議における発言表示手法の提案：電子情報通信学会技術研究報告」ＭＶＥ２００１−１３７ In addition, as a method for identifying a speaker in a remote conference or the like, there is a proposal of a method for displaying a speech balloon on a video (Non-Patent Document 2).
JP 2001-228794 A "Development of conversation support interface for hearing impaired people in immersive virtual shared environment" IEICE Technical Report OIS 2004-22 "Proposal of speech display method in video conference: IEICE technical report" MVE2001-137

これらの技術は、過去の発話や入力済みの会話内容を改めて提示し、利用する手法であり、吹き出し等への文字提示を行う処理は、文字データを入力、あるいはデータベースから読み出す処理の後に行われる。 These techniques are methods for re-presenting and using past utterances and entered conversation contents, and the process of presenting characters in speech balloons etc. is performed after the process of inputting character data or reading from a database. .

本発明の目的は、会話の場、特に遠隔コミュニケーションにおいて、次発言権が明示されないために、会話開始の衝突が多くなることを解決する次発言者明示方法および装置を提供することにある。 An object of the present invention is to provide a next speaker specifying method and apparatus that solves an increase in the number of collisions at the start of a conversation because the right to speak is not clearly specified in a conversation place, particularly in remote communication.

本発明によれば、次発言者明示方法は、ユーザを撮像する撮像装置と、前記撮像装置で撮影された映像を前記ユーザに提示する提示装置とを有する各サイトのそれぞれにいる複数のユーザがネットワークを介して会議する会議システムにおいて、前記複数のユーザのうちのあるユーザが発言中に次に発言することを望んでいるユーザを検出し、明示する次発言者明示方法であって、
前記提示装置に提示されている、自サイトを含む前記各サイトの前記撮像装置の映像より映像上の各ユーザのユーザ位置とユーザ頭部の３次元位置を検出するユーザ位置検出ステップと、
前記提示装置に提示されている、自サイトを含む前記各サイトの前記撮像装置の映像より映像上の各ユーザの視線方向を検出するユーザ視線方向検出ステップと、
前記ユーザ頭部の３次元位置と、前記視線方向と、前記提示装置の提示面の、前記撮像装置を中心とした座標系における３次元位置、傾き、提示面の大きさにより、前記ユーザ頭部の３次元位置からの前記視線方向ベクトルが交差する、前記提示装置の提示面上の座標位置を取得し、該座標位置と前記各ユーザのユーザ位置より、各ユーザがどのユーザに対して視線を向けているのかを検出する注視対象検出ステップと、
視線を得ているユーザの投票処理を行い、所定の割合以上のユーザから視線を得ているユーザを次発言者のユーザと判定する次発言者判定ステップと、
次発言者と判定されたユーザの提示装置上の映像上に、該ユーザが次発言権を持つことを明示するエフェクトを提示するエフェクト提示ステップとを有する。 According to the onset bright, the next speaker explicitly method, a plurality of users in each of the sites having an imaging device for imaging the user, and a presentation device for presenting the image photographed by the image pickup device to the user In a conferencing system for meeting over a network, a next speaker specifying method for detecting and clearly indicating a user who wants to speak next while a certain user among the plurality of users speaks,
A user position detecting step for detecting a user position of each user on the video and a three-dimensional position of the user's head from the video of the imaging device at each site including the own site presented to the presentation device;
A user gaze direction detection step for detecting a gaze direction of each user on the video from a video of the imaging device at each site including the own site presented to the presentation device;
According to the three-dimensional position of the user head, the line-of-sight direction, and the three-dimensional position, inclination, and size of the presentation surface of the presentation surface of the presentation device in the coordinate system centered on the imaging device, The coordinate position on the presentation surface of the presentation device where the line-of-sight vector from the three-dimensional position intersects is obtained, and from which user's line of sight to each user, the coordinate position and the user position of each user Gaze target detection step for detecting whether or not
A next speaker determination step of performing a voting process of a user who has obtained a line of sight, and determining a user who has obtained a line of sight from a predetermined percentage of users as a user of the next speaker
An effect presenting step of presenting an effect that clearly indicates that the user has the right to speak next on the video on the presentation device of the user determined to be the next speaker.

本発明は、会議に参加していり各ユーザの視線を元に、次発言権を有するユーザを自動的に検出し、次発言権を示すエフェクトを提示することにより、会話の参加者全体に次発言権を有するユーザを明示する。 The present invention automatically detects a user who has the next speech right based on each user's line of sight and presents an effect indicating the next speech right to the entire conversation participants. Specify who has the right to speak.

会議における次発言者が自動的に検出され、明示されることにより、各ユーザは会議の場において各ユーザが次に話すことを望んでいるユーザを知ることができ、会議の場での各ユーザの会話開始の衝突が少なくなるなど、会議の場での会話が円滑に進む効果がある。 The next speaker in the meeting is automatically detected and identified so that each user knows who each user wants to speak next in the meeting and each user in the meeting There is an effect that the conversation at the conference is smoothly advanced, such as less collisions at the start of conversation.

また、会議の場に参加するユーザの興味のない発話をする発言者に対して発話の交代を促す効果や、会話の場で話が発生しないときにいずれかのユーザに次発言権が明示されることにより、会話の場の発話を促す効果もある。 In addition, the effect of prompting a speaker who speaks uninterested in a meeting place to speak is changed, and the right to speak next is clearly indicated to any user when no talk occurs in the conversation place. This also has the effect of prompting the user to speak in a conversation place.

特に遠隔コミュニケーションにおいて、これらの効果は高い。 These effects are particularly high in remote communication.

次に、本発明の実施の形態について図面を参照して説明する。 Next, embodiments of the present invention will be described with reference to the drawings.

［第１の実施形態］
図１は本発明の第１の実施形態による次発言者明示装置の構成図である。ここでは、簡単のために、２台の次発言者明示装置１と２がネットワーク３を介して接続される例を示している。 [First Embodiment]
FIG. 1 is a block diagram of a next speaker specifying apparatus according to a first embodiment of the present invention. Here, for the sake of simplicity, an example is shown in which two next speaker specifying devices 1 and 2 are connected via a network 3.

次発言者明示装置１は通信装置１１と音声再生装置１２と収音装置１３と提示装置１４と撮像装置１５と次発言者判定装置１６とを有している。次発言者明示装置２も次発言者明示装置１と同じ構成で、通信装置２１と音声再生装置２２と収音装置２３と提示装置２４と撮像装置２５と次発言者判定装置２６とを有している。 The next speaker specifying device 1 includes a communication device 11, a sound reproducing device 12, a sound collecting device 13, a presentation device 14, an imaging device 15, and a next speaker determining device 16. The next speaker specifying device 2 has the same configuration as the next speaker specifying device 1, and includes a communication device 21, a sound reproducing device 22, a sound collecting device 23, a presentation device 24, an imaging device 25, and a next speaker determining device 26. ing.

音声再生装置１２は次発言者明示装置２の収音装置２３で収音され、ネットワーク３を経て通信装置１１で受信された音声を再生し、ユーザ３０１に提示する。収音装置１３は提示装置１４周辺の音声を収音し、該音声を通信装置１１、ネットワーク３を介して次発言者明示装置２に送信する。提示装置１４は、次発言者明示装置２の撮像装置２５で撮影され、ネットワーク３を介して通信装置１１で受信された映像や撮像装置１５で撮影された映像や視覚エフェクトを提示する。撮像装置１５は提示装置１４周辺を撮影し、映像を通信装置１１からネットワーク３を介して次発言者明示装置２に送信するもので、提示装置１４の映像提示方向と同じ方向に向けて設置されている。次発言者判定装置１６は次発言者を判定する。 The sound reproducing device 12 reproduces the sound collected by the sound collecting device 23 of the next speaker specifying device 2 and received by the communication device 11 via the network 3 and presents it to the user 301. The sound collection device 13 collects the sound around the presentation device 14 and transmits the sound to the next speaker specifying device 2 via the communication device 11 and the network 3. The presentation device 14 presents an image captured by the imaging device 25 of the next speaker specifying device 2 and received by the communication device 11 via the network 3, an image captured by the imaging device 15, and a visual effect. The imaging device 15 shoots the periphery of the presentation device 14 and transmits the video from the communication device 11 to the next speaker specifying device 2 via the network 3 and is installed in the same direction as the video presentation direction of the presentation device 14. ing. The next speaker determination device 16 determines the next speaker.

図２は次発言者判定装置１６の構成を示している。次発言者判定装置１６はユーザ位置検出部１０１と視線方向検出部１０２と注視対象検出部１０３と次発言権者判定部１０４とエフェクト提示部１０５と発話音声検出部１０６と発話映像検出部１０７とエフェクト消去部１０８とを有している。図示していないが、次発言者判定装置２６も、次発言者判定装置１６と同じ構成である。 FIG. 2 shows the configuration of the next speaker determination device 16. The next speaker determination device 16 includes a user position detection unit 101, a gaze direction detection unit 102, a gaze target detection unit 103, a next speaker right determination unit 104, an effect presentation unit 105, a speech sound detection unit 106, and a speech video detection unit 107. And an effect erasing unit 108. Although not shown, the next speaker determination device 26 has the same configuration as the next speaker determination device 16.

ユーザ位置検出部１０１は、提示装置１４に提示される、次発言者明示装置２の撮像装置２５からの映像と自装置１の撮像装置１５からの映像を元に、次発言者明示装置２と自装置１周辺にいる各ユーザの、取得映像上のユーザ領域の重心であるユーザ位置を検出する。また、あらかじめ取得される、各撮像装置のユーザ領域に対応するユーザ頭部の３次元位置を推定するキャリブレーションデータを元に、次発言者明示装置２の撮像装置２５からの映像と自装置１の撮像装置１５からの映像におけるユーザ領域より、ユーザ頭部の３次元位置を推定する。なお、各撮像装置に対して、カメラなどの撮像系を２台以上用い、ステレオ画像によりユーザ頭部の３次元位置を検出してもよい。 Based on the video from the imaging device 25 of the next speaker specifying device 2 and the video from the imaging device 15 of the own device 1 presented to the presentation device 14, the user position detection unit 101 and the next speaker specifying device 2 The user position that is the center of gravity of the user area on the acquired video of each user around the device 1 is detected. Further, based on the calibration data obtained in advance for estimating the three-dimensional position of the user's head corresponding to the user area of each imaging device, the video from the imaging device 25 of the next speaker specifying device 2 and the own device 1 The three-dimensional position of the user's head is estimated from the user area in the video from the imaging device 15. Note that two or more imaging systems such as cameras may be used for each imaging device, and the three-dimensional position of the user's head may be detected from a stereo image.

視線方向検出部１０２は、提示装置１４に提示される、次発言者明示装置２の撮像装置２５からの映像と自装置１の撮像装置１５からの映像において、映像上のユーザの瞳を検出し、ユーザの目を球体とみなす場合に映像上の、ユーザの瞳の位置が球体上のどの位置にあるかによって、球体中心部より映像上の、ユーザの瞳の位置を通るベクトルを取得でき、該ベクトルを各ユーザの映像上の視線方向とする。なお、各ユーザにユーザの眼球用の撮像装置を装着させることによって正確な視線方向を検出してもよい。 The gaze direction detection unit 102 detects the user's pupil on the video in the video from the imaging device 25 of the next speaker specifying device 2 and the video from the imaging device 15 of the own device 1 presented on the presentation device 14. When the user's eyes are regarded as a sphere, a vector passing through the position of the user's pupil on the image can be obtained from the center of the sphere depending on where the position of the user's pupil on the image is on the image, The vector is set as the line-of-sight direction on the video of each user. In addition, you may detect an exact eyes | visual_axis direction by making each user wear the imaging device for user's eyeballs.

注視対象検出部１０３は、ユーザ位置検出部１０１で検出されたユーザ頭部の３次元位置と、視線方向検出部１０２で検出されたユーザの映像上の視線方向と、提示装置１４の提示面の、撮像装置１５を中心とした座標系における３次元位置、傾き、提示面の大きさにより、ユーザ位置検出部１０１で検出されたユーザ頭部の３次元位置からの視線方向検出部１０２で検出された、ユーザの、映像上の視線方向のベクトルが、提示装置１４の提示面と交差するかどうかを検出し、交差する場合は交差している部分の、提示装置１４の提示上の座標位置を取得し、該座標位置とユーザ位置検出部１０１で検出された各ユーザのユーザ位置より、自分を含むどのユーザの映像に対して該ユーザが視線を向けているのかを判定する。 The gaze target detection unit 103 includes a three-dimensional position of the user's head detected by the user position detection unit 101, a gaze direction on the user's video detected by the gaze direction detection unit 102, and a presentation surface of the presentation device 14. The gaze direction detection unit 102 detects the three-dimensional position of the user's head detected by the user position detection unit 101 based on the three-dimensional position, the tilt, and the size of the presentation surface in the coordinate system centered on the imaging device 15. In addition, it is detected whether or not the user's line-of-sight vector on the video intersects the presentation surface of the presentation device 14, and if it intersects, the coordinate position on the presentation of the presentation device 14 of the intersecting portion is determined. Based on the acquired coordinate position and the user position of each user detected by the user position detection unit 101, it is determined which user's video including himself / herself is pointing the line of sight.

次発言者判定部１０４は、注視対象検出部１０３での結果を元に、会話の場で、現発話者に視線を向けているユーザを除いたユーザのうち、過半数のユーザの視線を得ているユーザがいる場合に、該ユーザを次発言者と判定する。 Based on the result of the gaze target detection unit 103, the next speaker determination unit 104 obtains the majority of the user's gazes of users other than the user who is gazeing at the current speaker in the conversation. When there is a user, the user is determined as the next speaker.

エフェクト提示部１０５は、次発言者判定部１０４で次発言権があると判定されたユーザの提示装置１４上の映像に対して、ユーザ位置検出部１０１で得た各ユーザのユーザ位置に重畳させて、該ユーザが次発言権を持つことを明示する吹き出しエフェクトを提示する。 The effect presenting unit 105 superimposes the video on the user presentation device 14 determined by the next speaker determining unit 104 as having the right to speak next to the user position of each user obtained by the user position detecting unit 101. Thus, a speech balloon effect that clearly indicates that the user has the right to speak next is presented.

発話音声検出部１０６は、通信装置１１を介して音声再生装置１２で得た音声、もしくは通信装置１１を介さずに収音装置１３で得た音声を用いて、音声より発話の有無を検出する。 The utterance voice detection unit 106 detects the presence or absence of utterance from the voice using the voice obtained by the voice reproduction device 12 via the communication device 11 or the voice obtained by the sound collection device 13 without going through the communication device 11. .

発話映像検出部１０７は、通信装置１１を介して提示装置１４で得た映像、もしくは通信装置１１を介さずに撮像装置１５で得た映像を用いて、映像より発話可能性のあるユーザを特定して検出する。 The utterance video detection unit 107 uses the video obtained by the presentation device 14 via the communication device 11 or the video obtained by the imaging device 15 without going through the communication device 11 to identify a user who may speak from the video. To detect.

エフェクト消去部１０８は、発話音声検出部１０６によって発話の有りが所定時間以上検出され、かつ発話映像検出部１０７によって同じ地点において発話可能性のあるユーザが検出された場合、該ユーザが次発言者を含む、現発言者以外のユーザであったならば、エフェクト提示部１０５によって重畳されている吹き出しエフェクトを消す。ここで、「所定時間以上」は、発言の割込みや雑音を考慮したものである。 When the utterance voice detection unit 106 detects the presence of an utterance for a predetermined time or more and the utterance video detection unit 107 detects a user who is likely to speak at the same point, the effect erasure unit 108 determines that the user is the next speaker. If the user is a user other than the current speaker, the balloon effect superimposed by the effect presenting unit 105 is erased. Here, “more than the predetermined time” is taken into consideration of speech interruption and noise.

なお、次発言者判定装置１６の各部で検出されたユーザ位置、視線方向、注視対象、次発言者等は不図示の記憶部に記憶される。 Note that the user position, line-of-sight direction, gaze target, next speaker, and the like detected by each unit of the next speaker determination device 16 are stored in a storage unit (not shown).

図３は本実施形態における次発言者判定処理を示すフローチャート、図４は本実施形態におけるユーザ位置検出、エフェクト提示の例、図５は本実施形態における視線方向検出、注視位置検出の例を示している。 FIG. 3 is a flowchart showing the next speaker determination process in this embodiment, FIG. 4 shows an example of user position detection and effect presentation in this embodiment, and FIG. 5 shows an example of gaze direction detection and gaze position detection in this embodiment. ing.

次に、次発言者判定処理の流れを、他装置のユーザ４０１、自装置のユーザ３０１に注目して説明する。 Next, the flow of the next speaker determination process will be described by paying attention to the user 401 of the other apparatus and the user 301 of the own apparatus.

ユーザ位置検出部１０１は、提示装置１４に提示される、次発言者明示装置２の撮像装置２５からの映像と自装置１の撮像装置１５からの映像を元に、次発言者明示装置２と自装置１周辺にいる各ユーザの、取得映像上のユーザ領域の重心であるユーザ位置４０２を検出する（ステップ２０１）。また、あらかじめ取得される、各撮像装置のユーザ領域に対応するユーザ頭部の３次元位置を推定するキャリブレーションデータを元に、次発言者明示装置２の撮像装置２５からの映像と自装置１の撮像装置１５からの映像におけるユーザ領域より、ユーザ頭部の３次元位置を推定する。図４において、他地点のユーザ４０１のユーザ位置４０２を検出する例を示す。映像の各フレーム間の差分を計算することで、映像中の動物体を検出し、ユーザであるとみなす動物体の領域を抽出し、映像上での該領域の重心を求めることで、ユーザであるとみなす動物体の中心位置を検出し、これをユーザ４０１のユーザ位置４０２とする（参考文献：「ディジタル映像処理」八木伸行監修、映像情報メディア学会編、オーム社出版局）。他地点のユーザ４０１のユーザ頭部の３次元位置を推定する例を示す。予め他地点の撮像装置２５を用いて取得される、撮像装置２５に写るユーザ４０１の領域の大きさと領域の位置に対する、そのときの該ユーザ頭部の３次元位置を対応させたデータを、ユーザ頭部の３次元位置を推定するキャリブレーションデータとして保存しておく。ユーザ位置４０２の検出時に得た、ユーザ４０１であるとみなす動物体の領域より、ユーザの領域が近似するキャリブレーションデータを抽出する。該キャリブレーションデータに対応するユーザ頭部の3次元位置を、ユーザ位置４０２に対応するユーザの３次元位置と推定する。キャリブレーションデータ作成の粒度には、作成時のユーザ頭部の３次元位置のデータの粒度が依存する。撮像装置２５の設置位置に大きく影響されるため一概に規定できないが、粒度を高くするほど、正確なユーザ頭部の３次元位置が推定できる。実用的には、次発言者明示装置２を利用する範囲にユーザ４０１がいる場合のキャリブレーションデータのみを保存することでデータ作成コストを省略することができる。 Based on the video from the imaging device 25 of the next speaker specifying device 2 and the video from the imaging device 15 of the own device 1 presented to the presentation device 14, the user position detection unit 101 and the next speaker specifying device 2 The user position 402 that is the center of gravity of the user area on the acquired video of each user around the device 1 is detected (step 201). Further, based on the calibration data obtained in advance for estimating the three-dimensional position of the user's head corresponding to the user area of each imaging device, the video from the imaging device 25 of the next speaker specifying device 2 and the own device 1 The three-dimensional position of the user's head is estimated from the user area in the video from the imaging device 15. FIG. 4 shows an example in which the user position 402 of the user 401 at another point is detected. By calculating the difference between each frame of the video, the moving object in the video is detected, the area of the moving object that is regarded as the user is extracted, and the center of gravity of the area on the video is obtained. The center position of the moving object considered to be present is detected, and this is set as the user position 402 of the user 401 (reference: “Digital Video Processing”, supervised by Nobuyuki Yagi, edited by the Institute of Image Information Media, Ohm Publishing Office). The example which estimates the three-dimensional position of the user head of the user 401 of another point is shown. Data obtained by using the imaging device 25 at another point in advance and corresponding to the size of the area of the user 401 reflected in the imaging device 25 and the position of the area corresponding to the three-dimensional position of the user's head at that time It is stored as calibration data for estimating the three-dimensional position of the head. Calibration data that approximates the user's area is extracted from the area of the moving object that is considered to be the user 401 obtained when the user position 402 is detected. The three-dimensional position of the user head corresponding to the calibration data is estimated as the three-dimensional position of the user corresponding to the user position 402. The granularity of the calibration data creation depends on the granularity of the data of the three-dimensional position of the user's head at the time of creation. Since it is greatly affected by the installation position of the imaging device 25, it cannot be defined unconditionally, but the higher the granularity, the more accurate the three-dimensional position of the user's head can be estimated. Practically, the data creation cost can be omitted by storing only the calibration data when the user 401 is in the range where the next speaker specifying device 2 is used.

次に、視線方向検出部１０２は、提示装置１４に提示される、次発言者明示装置２の撮像装置２５からの映像と自装置１の撮像装置１５からの映像において、図５に示すように、映像上のユーザの瞳３０２を検出し、ユーザの目を球体とみなす場合に映像上の、ユーザの瞳３０２の位置が球体上のどの位置にあるかによって、球体中心部より映像上の、ユーザの瞳３０２の位置を通るベクトルを取得し、該ベクトルを各ユーザの映像上の視線方向３０３とする（ステップ２０２）。ユーザの視線が撮像装置１５に対して反対を向いている場合のように撮像装置１５でユーザの瞳を前方向から撮影できない場合、各ユーザの視線方向の検出は難しい。しかし、提示装置１４の方向を向いていないユーザは会話の場に参加する意図が少なく、会話の場の一員として捉える必要がないと考え、視線方向の検出は行わない。 Next, as shown in FIG. 5, the line-of-sight direction detection unit 102 shows the video from the imaging device 25 of the next speaker specifying device 2 and the video from the imaging device 15 of the own device 1 presented on the presentation device 14. When the user's pupil 302 on the image is detected and the user's eyes are regarded as a sphere, the position of the user's pupil 302 on the image is on the sphere depending on the position on the sphere. A vector passing through the position of the user's pupil 302 is acquired, and this vector is set as the line-of-sight direction 303 on the video of each user (step 202). When the imaging device 15 cannot capture the user's pupil from the front, as in the case where the user's line of sight is opposite to the imaging device 15, it is difficult to detect the viewing direction of each user. However, the user who is not facing the direction of the presentation device 14 has little intention to participate in the conversation place, and does not need to be regarded as a member of the conversation place, so the gaze direction is not detected.

次に、注視対象検出部１０３は、ユーザ位置検出部１０１におけるユーザ頭部の３次元位置と、視線方向検出部１０２におけるユーザの映像上の視線方向３０３と、提示装置１４の提示面の、撮像装置１５を中心とした座標系における３次元位置、傾き、提示面の大きさにより、ユーザ位置検出部１０１におけるユーザ頭部の３次元位置からの視線方向検出部１０２におけるユーザの、映像上の視線方向３０３のベクトルが、提示装置１４の提示面と交差するかどうかを検出し、交差する場合は交差している部分の、提示装置１４の提示上の座標位置（注視位置）３０４を取得し、注視位置３０４があるユーザのユーザ位置４０２に対して一定量の誤差の範囲内で、一定時間以上向けられている場合、該ユーザに対して視線を向けていると検出する（ステップ２０３）。具体的には、注視対象検出部１０３の注視位置情報が提示装置１４上の該ユーザのユーザ位置に対して、提示装置上で２ｃｍの誤差の範囲内に、１秒以上向けられていることを条件とする。この数値は変更可能であり、数値の変更によって次発言者判定処理の結果や、会話の場での会話の流れや発話数を制御することができる。 Next, the gaze target detection unit 103 captures the three-dimensional position of the user's head in the user position detection unit 101, the line-of-sight direction 303 on the user's video in the line-of-sight direction detection unit 102, and the presentation surface of the presentation device 14. The line of sight on the image of the user in the line-of-sight direction detection unit 102 from the three-dimensional position of the user head in the user position detection unit 101 depending on the three-dimensional position, inclination, and size of the presentation surface in the coordinate system centering on the device 15 Detect whether the vector of the direction 303 intersects the presentation surface of the presentation device 14, and if it intersects, obtain the coordinate position (gaze position) 304 on the presentation of the presentation device 14 of the intersecting portion, When the gaze position 304 is directed to the user position 402 of the user within a certain amount of error within a certain amount of time, it is detected that the user is looking toward the user. (Step 203). Specifically, the gaze position information of the gaze target detection unit 103 is directed to the user position of the user on the presentation device 14 within an error range of 2 cm on the presentation device for 1 second or more. Condition. This numerical value can be changed, and by changing the numerical value, it is possible to control the result of the next speaker determination process, the flow of conversation in the place of conversation, and the number of utterances.

次に、次発言者判定部１０４は、会話の場にいるユーザの視線が集中する対象を検出することによって、次に発言を行う権利を明示的に持つユーザを判定する（ステップ２０４）。注視対象検出部２０３の結果を元に、自動的に投票を行い、次発言者を判定する。具体的には、現発話者に視線を向けているユーザを除いたユーザのうち、過半数のユーザの視線を得ている対象ユーザがいる場合に、該ユーザを次発言者と判定する。この投票条件は変更可能であり、現発話者に視線を向けているユーザを除かず全ユーザを対象に視線方向の投票を行う条件や、過半数でなく３分の２以上のユーザの視線を得ている対象ユーザがいるという条件が考えられ、それぞれの条件によって、発言権の移動を起こりやすくすることや、起こりにくくすることが可能である。 Next, the next speaker determination unit 104 determines a user who explicitly has the right to speak next (step 204) by detecting an object on which the gaze of the user in the conversation is concentrated. A vote is automatically performed based on the result of the gaze target detection unit 203 to determine the next speaker. Specifically, if there is a target user who obtains the majority of users' gaze among users other than the user who is gazeing at the current speaker, the user is determined to be the next speaker. The voting conditions can be changed, and conditions for voting in the line-of-sight direction for all users, not including users who are looking at the current speaker, and more than two-thirds of the lines of sight are obtained. There may be a condition that there is a target user, and it is possible to make it easier or less likely to cause the right to speak according to each condition.

次発言者判定部１０４で、次発言者と判定されるユーザがいる場合は、該ユーザに次発言権を与え、エフェクト提示処理に移る（ステップ２０５）。次発言者判定部１０４で、次発言者と判定されるユーザがいない場合は、次発言者判定（ステップ２０２，２０３）を繰り返し行う。 If there is a user who is determined to be the next speaker in the next speaker determination unit 104, the next speaker is given to the user, and the process proceeds to the effect presentation process (step 205). If there is no user who is determined to be the next speaker in the next speaker determination unit 104, the next speaker determination (steps 202 and 203) is repeated.

次に、エフェクト提示部１０５は、次発言者判定部１０４で、次発言者と判定されるユーザがいる場合、該ユーザの映る映像に対してエフェクト提示処理を行う（ステップ２０６）。図４において、他地点のユーザ４０１が次発言者として判定された場合の例を示す。提示装置１４上で、映像中の該ユーザ４０１の領域の重心である、ユーザ位置検出部１０１によって検出された該ユーザ４０１のユーザ位置４０２に吹き出しエフェクト４０３を重畳させる。重畳させる位置関係は、吹き出しエフェクトの尾４０４の先端がユーザ位置４０２に一致するように重畳することで、吹き出しエフェクト４０３が該ユーザ４０１より発信されたものだと見える位置関係である。 Next, when there is a user who is determined to be the next speaker by the next speaker determination unit 104, the effect presentation unit 105 performs an effect presentation process on the video image of the user (step 206). FIG. 4 shows an example in which the user 401 at another point is determined as the next speaker. On the presentation device 14, a balloon effect 403 is superimposed on the user position 402 of the user 401 detected by the user position detection unit 101, which is the center of gravity of the area of the user 401 in the video. The positional relationship to be superimposed is a positional relationship in which the balloon effect 403 can be seen as being transmitted from the user 401 by overlapping so that the tip of the tail 404 of the balloon effect coincides with the user position 402.

次に、発話音声検出部１０６は、通信装置１１を介して音声再生装置１２で再生する、次発言者明示装置２からの音声、もしくは通信装置１１を介さずに収音装置１３で収音した音声を処理し、発話が行われていない時のノイズレベルより大きい音声入力を検出した場合、該音声入力が検出された時点で発話があったとみなす（ステップ２１１）。 Next, the utterance voice detection unit 106 picks up the sound from the next speaker specifying device 2 to be played back by the voice playback device 12 via the communication device 11 or the sound pickup device 13 without going through the communication device 11. When speech is processed and speech input greater than the noise level when speech is not being performed is detected, it is considered that speech has occurred when speech input is detected (step 211).

発話映像検出部１０７は、通信装置１１を介して提示装置１４で提示する、次発言者明示装置２からの映像、もしくは通信装置１１を介さずに撮像装置１５で撮影した映像を処理し、映像より各ユーザの口の動きの変化を抽出し、あるユーザの口の動きに変化が合った場合、該ユーザに発話の可能性があるとみなす（ステップ２１２）。 The utterance video detection unit 107 processes the video from the next speaker specifying device 2 presented by the presentation device 14 via the communication device 11 or the video taken by the imaging device 15 without going through the communication device 11. If a change in the mouth movement of each user is extracted and the change in the mouth movement of a certain user matches, it is considered that the user has a possibility of speaking (step 212).

エフェクト消去部１０８は、発話音声検出部１０６によってある地点での発話が所定時間以上有ったと検出され、かつ発話映像検出部１０７によって同じ地点で発話可能性のあるユーザがいると検出された場合、該ユーザが次発言者を含む、現発言者以外のユーザであったならば、エフェクト提示部１０５によって重畳されている吹き出しエフェクトを消す（ステップ２１３）。 When the utterance voice detection unit 106 detects that the utterance at a certain point has been for a predetermined time or longer and the utterance video detection unit 107 detects that there is a user who may utter at the same point, the effect elimination unit 108 If the user is a user other than the current speaker including the next speaker, the balloon effect superimposed by the effect presenting unit 105 is erased (step 213).

［第２の実施形態］
本実施形態では、同一空間における会話の場での例について例示する。 [Second Embodiment]
In the present embodiment, an example in the place of conversation in the same space is illustrated.

図６は本実施形態の次発言者明示装置の全体図、図７は本実施形態の次発言者判定装置３４のブロック図である。 FIG. 6 is an overall view of the next speaker specifying device of this embodiment, and FIG. 7 is a block diagram of the next speaker determining device 34 of this embodiment.

本実施形態の次発言者明示装置４は提示装置３１と撮像装置３２と収音装置３３と次発言権者判定装置３４を有している。 The next speaker specifying device 4 of the present embodiment includes a presentation device 31, an imaging device 32, a sound collection device 33, and a next speaker right determination device 34.

提示装置３１は、周辺にいるユーザ８０１に対して視覚エフェクトを提示し、会話の場の中央に、例えば会議室の会議デスク上に提示され、天井に設置したプロジェクタとデスク上に設置した平面白板より構成される、あるいはデスク上に設置した映像提示デバイスにより構成される。収音装置３３は提示装置３１の周辺の音声を収音する。撮像装置３２は、提示装置３１周辺を撮影し、提示装置３１の映像提示方向と反対方向に向けて、提示装置３１の近傍にいるユーザと提示装置３１の提示内容を撮影可能な位置に設置される。次発言権判定装置３４はユーザ位置検出部６０１と視線方向検出部６０２と注視対象検出部６０３と次発言権者判定部６０４とエフェクト提示部６０５と発話音声検出部６０６と発話映像検出部６０７とエフェクト消去部６０８を有している。 The presentation device 31 presents a visual effect to a user 801 in the vicinity, and is presented in the center of a conversation, for example, on a conference desk in a conference room, and a projector installed on the ceiling and a flat white board installed on the desk Or a video presentation device installed on a desk. The sound collection device 33 collects sound around the presentation device 31. The imaging device 32 captures the periphery of the presentation device 31 and is installed at a position where the presentation content of the user and the presentation device 31 in the vicinity of the presentation device 31 can be photographed in a direction opposite to the video presentation direction of the presentation device 31. The The next speaking right determination device 34 includes a user position detection unit 601, a gaze direction detection unit 602, a gaze target detection unit 603, a next speaking right person determination unit 604, an effect presentation unit 605, an utterance voice detection unit 606, and an utterance video detection unit 607. An effect erasing unit 608 is provided.

ユーザ位置検出部６０１は、提示装置３１に提示される、撮像装置３２からの映像を元に、各ユーザの取得映像上のユーザ領域の重心であるユーザ位置を検出する。視線方向検出部６０２は、提示装置３１に提示される、撮像装置３２からの映像において、映像上のユーザの瞳を検出し、ユーザの目を球体とみなす場合に映像上の、ユーザの瞳の位置が球体上のどの位置にあるかによって、球体中心部より映像上の、ユーザの瞳の位置を通るベクトルを取得し、該ベクトルを各ユーザの視線方向とする。注視対象検出部６０３は、ユーザ位置検出部６０１で検出されたユーザ位置と、視線方向検出部６０２で検出された各ユーザの視線方向より、どのユーザに対して視線を向けているのかを検出する。次発言者判定部６０４は、注視対象検出部６０３での結果を元に、会話の場で、現発話者に視線を向けているユーザを除いたユーザのうち、過半数のユーザの視線を得ているユーザがいる場合に、該ユーザを次発言者と判定する。エフェクト提示部６０５は、次発言者判定部６０４で次発言権があると判定されたユーザの付近で、かつ提示装置３１上の該ユーザの視線方向にある部分に、該ユーザが次発言権を持つことを明示する吹き出しエフェクトを該ユーザから発信されたエフェクトであるように提示する。発話音声検出部６０６は、収音装置３３で得た音声を用いて、音声より発話の有無を検出する。発話映像検出部６０７は、撮像装置３２で得た映像を用いて、映像より発話可能性のあるユーザを特定して検出する。エフェクト消去部６０８は、発話音声検出部６０６によって発話の有りが所定時間以上検出され、かつ発話映像検出部６０７によって発話可能性のあるユーザが検出された場合、該ユーザが次発言者を含む、現発言者以外のユーザであったならば、エフェクト提示部６０５によって重畳されている吹き出しエフェクトを消す。ここで、「所定時間以上」は、発言の割込みや雑音を考慮したものである。 The user position detection unit 601 detects the user position, which is the center of gravity of the user area on the acquired video of each user, based on the video from the imaging device 32 presented on the presentation device 31. The line-of-sight direction detection unit 602 detects the user's pupil on the video in the video from the imaging device 32 presented to the presentation device 31, and when the user's eyes are regarded as a sphere, Depending on where the position is on the sphere, a vector passing through the position of the user's pupil on the image is acquired from the center of the sphere, and this vector is set as the direction of the line of sight of each user. The gaze target detection unit 603 detects to which user the line of sight is directed based on the user position detected by the user position detection unit 601 and the line-of-sight direction of each user detected by the line-of-sight direction detection unit 602. . The next speaker determination unit 604 obtains the gazes of the majority of users among users other than the users who are gazeing at the current speaker in the conversation based on the result of the gaze target detection unit 603. When there is a user, the user is determined as the next speaker. The effect presenting unit 605 gives the next speaking right to a portion in the vicinity of the user who is determined to have the next speaking right by the next speaking person determining unit 604 and in the line-of-sight direction of the user on the presentation device 31. A balloon effect that clearly indicates that it is held is presented as an effect transmitted from the user. The utterance voice detection unit 606 detects the presence or absence of utterance from the voice using the voice obtained by the sound collection device 33. The utterance video detection unit 607 uses the video obtained by the imaging device 32 to identify and detect a user who is likely to speak from the video. When the utterance voice detection unit 606 detects the presence of an utterance for a predetermined time or more and the utterance video detection unit 607 detects a user who is likely to speak, the effect deletion unit 608 includes the next speaker. If the user is a user other than the current speaker, the balloon effect superimposed by the effect presentation unit 605 is erased. Here, “more than a predetermined time” takes account of speech interruption and noise.

なお、次発言者判定装置３４の各部で検出されたユーザ位置、視線方向、注視対象、次発言者等は不図示の記憶部に記憶される。 Note that the user position, line-of-sight direction, gaze target, next speaker, and the like detected by each unit of the next speaker determination device 34 are stored in a storage unit (not shown).

図８は本実施形態における次発言者判定処理を示すフローチャート、図９は本実施形態におけるユーザ位置検出とエフェクト提示の一例を示す図、図１０は本実施形態における視線方向検出の一例を示す図である。 FIG. 8 is a flowchart showing the next speaker determination processing in this embodiment, FIG. 9 is a diagram showing an example of user position detection and effect presentation in this embodiment, and FIG. 10 is a diagram showing an example of gaze direction detection in this embodiment. It is.

次に、次発言者判定処理の流れを、ユーザ８０１に注目して説明する。 Next, the flow of the next speaker determination process will be described by paying attention to the user 801.

ユーザ位置検出部６０１は、提示装置３１に提示される、撮像装置３２からの映像を元に、各ユーザの取得映像上のユーザ領域の重心であるユーザ位置を検出する（ステップ７０１）（図９において、ユーザ８０１のユーザ位置８０２を検出する例を示す）。映像の各フレーム間の差分を計算することで、映像中の動物体を検出し、ユーザであるとみなす動物体の領域を抽出し、映像上での該領域の重心を求めることで、ユーザであるとみなす動物体の中心位置を検出し、これをユーザ８０１のユーザ位置８０２とする（参考文献：「ディジタル映像処理」八木伸行監修、映像情報メディア学会編、オーム社出版局）。 The user position detection unit 601 detects the user position, which is the center of gravity of the user area on the acquired video of each user, based on the video from the imaging device 32 presented to the presentation device 31 (step 701) (FIG. 9). 2 shows an example of detecting the user position 802 of the user 801). By calculating the difference between each frame of the video, the moving object in the video is detected, the area of the moving object that is regarded as the user is extracted, and the center of gravity of the area on the video is obtained. The center position of the moving object considered to be present is detected, and this is set as the user position 802 of the user 801 (reference: “Digital Video Processing” supervised by Nobuyuki Yagi, edited by the Institute of Image Information Media, Ohm Publishing Office).

視線方向検出部６０２は、提示装置３１に提示される、撮像装置３２からの映像において、映像上のユーザの瞳を検出し、ユーザの目を球体とみなす場合に、ユーザの瞳の位置が球体上のどの位置にあるかによって、球体中心部より映像上の、ユーザの瞳の位置を通るベクトルを取得し、該ベクトルを各ユーザの視線方向とする（ステップ７０２）（図１０において、ユーザ９０１のユーザの瞳９０２を検出することによって、ユーザ９０１の視線方向９０３を検出する例を示す）。ユーザの視線が撮像装置３２に対して反対を向いている場合のように撮像装置３２でユーザの瞳を前方向から撮影できない場合、各ユーザの視線方向の検出は難しい。しかし、提示装置３１の方向を向いていないユーザは会話の場に参加する意図が少なく、会話の場の一員として捉える必要がないと考え、視線方向の検出は行わない。 The line-of-sight direction detection unit 602 detects the user's pupil on the video in the video from the imaging device 32 presented to the presentation device 31, and when the user's eyes are regarded as a sphere, the position of the user's pupil is a sphere. A vector passing through the position of the user's pupil on the video is acquired from the center of the sphere depending on the position on the sphere, and the vector is set as the direction of the line of sight of each user (step 702) (in FIG. 10, user 901 An example in which the line-of-sight direction 903 of the user 901 is detected by detecting the user's pupil 902 is shown). When the user's eyes cannot be photographed from the front by the imaging device 32 as in the case where the user's eyes are opposite to the imaging device 32, it is difficult to detect the viewing direction of each user. However, a user who is not facing the direction of the presentation device 31 has little intention to participate in the conversation place, and does not need to be regarded as a member of the conversation place, so the gaze direction is not detected.

注視対象検出部６０３は、視線方向検出部６０２によって検出されたユーザの視線方向９０３が、ユーザ位置検出部６０１で検出された各ユーザのユーザ位置８０２に対して一定量の誤差の範囲内で、一定時間以上向けられている場合、該ユーザに対して視線を向けていると検出する（ステップ７０３）。具体的には、視線方向検出部６０２によって検出されたユーザの視線方向９０３がユーザ８０１のユーザ位置８０２に対して、あらかじめ計測された提示装置３１の大きさを基準にして１ｍの誤差の範囲内に、１秒以上向けられていることを条件とする。この数値は変更可能であり、数値の変更によって次発言者判定処理の結果や、会話の場での会話の流れや発話数を制御することができる。 The gaze target detection unit 603 is configured such that the user's gaze direction 903 detected by the gaze direction detection unit 602 is within a certain amount of error with respect to the user position 802 of each user detected by the user position detection unit 601. If it is directed for a certain time or longer, it is detected that the user is looking toward the user (step 703). Specifically, the user's line-of-sight direction 903 detected by the line-of-sight direction detection unit 602 is within an error range of 1 m with respect to the user position 802 of the user 801 based on the size of the presentation device 31 measured in advance. On the condition that it is directed for 1 second or more. This numerical value can be changed, and by changing the numerical value, it is possible to control the result of the next speaker determination process, the flow of conversation in the place of conversation, and the number of utterances.

次発言者判定部６０４は、会話の場にいるユーザの視線が集中する対象を検出することによって、次に発言を行う権利を明示的に持つユーザを判定する（ステップ７０４）。注視対象検出部６０３の結果を元に、自動的に投票を行い、次発言者を判定する。具体的には、現発話者に視線を向けているユーザを除いたユーザのうち、過半数のユーザの視線を得ている対象ユーザがいる場合に、該ユーザを次発言者と判定する。この投票条件は変更可能であり、現発話者に視線を向けているユーザを除かず全ユーザを対象に視線方向の投票を行う条件や、過半数でなく３分の２以上のユーザの視線を得ている対象ユーザがいるという条件が考えられ、それぞれの条件によって、発言権の移動を起こりやすくすることや、起こりにくくすることが可能である。 The next speaker determination unit 604 determines a user who explicitly has the right to speak next by detecting an object on which the gaze of the user in the conversation is concentrated (step 704). Based on the result of the gaze target detection unit 603, voting is automatically performed to determine the next speaker. Specifically, if there is a target user who obtains the majority of users' gaze among users other than the user who is gazeing at the current speaker, the user is determined to be the next speaker. The voting conditions can be changed, and conditions for voting in the line-of-sight direction for all users, not including users who are looking at the current speaker, and more than two-thirds of the lines of sight are obtained. There may be a condition that there is a target user, and it is possible to make it easier or less likely to cause the right to speak according to each condition.

次発言者判定部６０４で、次発言者と判定されるユーザがいる場合は、該ユーザに次発言権を与え、エフェクト提示処理に移る（ステップ７０５）。次発言者判定部６０４で、次発言者と判定されるユーザがいない場合は、次発言者判定を繰り返し行う。 If there is a user who is determined to be the next speaker by the next speaker determination unit 604, the next speaker is given to the user, and the process proceeds to the effect presentation process (step 705). When there is no user determined as the next speaker by the next speaker determination unit 604, the next speaker determination is repeated.

エフェクト提示部６０５は、次発言者判定部６０４で、次発言者と判定されるユーザがいる場合、該ユーザの映る映像に対してエフェクト提示処理を行う（ステップ７０６）。図９において、ユーザ８０１が次発言者として判定された場合の例を示す。次発言権があると判定されたユーザ８０１の視線方向にある提示装置３１上に、次発言権を持つことを明示する吹き出しエフェクト８０３を提示する。重畳させる位置関係は、吹き出しエフェクトの尾８０４の先端が、ユーザ８０１のユーザ位置８０２に対して提示装置３１上で最も近い位置に一致するように重畳することで、吹き出しエフェクト８０３が該ユーザ８０１より発信されたものだと見える位置関係である。 When there is a user who is determined to be the next speaker by the next speaker determination unit 604, the effect presentation unit 605 performs an effect presentation process on the video image of the user (step 706). FIG. 9 shows an example when the user 801 is determined as the next speaker. A speech balloon effect 803 that clearly indicates that the user has the next speech right is presented on the presentation device 31 in the line-of-sight direction of the user 801 determined to have the next speech right. The superposition position is such that the tip of the balloon effect tail 804 is superimposed so that the tip of the user 801 is closest to the user position 802 of the user 801 on the presentation device 31, so that the balloon effect 803 is received from the user 801. It is a positional relationship that appears to have been transmitted.

発話音声検出部６０６は、収音装置３３で収音された音声を処理し、発話が行われていない時のノイズレベルより大きい音声入力を検出した場合、該音声入力が検出された地点で発話があったとみなす（ステップ７１１）。 The utterance voice detection unit 606 processes the voice picked up by the sound pickup device 33, and when a voice input greater than the noise level when no utterance is performed is detected, the utterance voice detection section 606 utters at the point where the voice input is detected. (Step 711).

発話映像検出部６０７は、撮像装置３２で撮影された映像を処理し、映像より各ユーザの口の動きの変化を抽出し、あるユーザの口の動きに変化が合った場合、該ユーザに発話の可能性があるとみなす（ステップ７１２）。 The utterance video detection unit 607 processes the video captured by the imaging device 32, extracts a change in the mouth movement of each user from the video, and if the change in the mouth movement of a certain user matches, the utterance video detection unit 607 (Step 712).

エフェクト消去部６０８は、発話音声検出部６０６によって発話有りが所定時間以上検出され、かつ発話映像検出部６０７によって発話可能性のあるユーザがいると検出された場合、該ユーザが次発言者を含む、現発言者以外のユーザであったならば、エフェクト提示部６０５によって重畳されている吹き出しエフェクトを消す（ステップ７１３）。 When the utterance voice detection unit 606 detects that there is an utterance for a predetermined time or more and the utterance video detection unit 607 detects that there is a user who is likely to speak, the effect deletion unit 608 includes the next speaker. If the user is a user other than the current speaker, the balloon effect superimposed by the effect presentation unit 605 is erased (step 713).

なお、以上説明した次発言者明示装置の機能は、その機能を実現するためのプログラムを、コンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータに読み込ませ、実行するものであってもよい。コンピュータ読み取り可能な記録媒体とは、フレキシブルディスク、光磁気ディスク、ＣＤ−ＲＯＭ等の記録媒体、コンピュータシステムに内蔵されるハードディスク装置等の記憶装置を指す。さらに、コンピュータ読み取り可能な記録媒体は、インターネットを介してプログラムを送信する場合のように、短時間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバとなるコンピュータ内の揮発性メモリのように、一定時間プログラムを保持しているものを含む。 The function of the next speaker specifying apparatus described above is executed by recording a program for realizing the function on a computer-readable recording medium and causing the computer to read the program recorded on the recording medium. You may do. The computer-readable recording medium refers to a recording medium such as a flexible disk, a magneto-optical disk, and a CD-ROM, and a storage device such as a hard disk device built in a computer system. Further, the computer-readable recording medium is a medium that dynamically holds the program for a short time (transmission medium or transmission wave) as in the case of transmitting the program via the Internet, and in the computer serving as a server in that case Such as a volatile memory that holds a program for a certain period of time.

本発明の第１の実施形態の次発言者明示装置を含むシステムの構成図である。It is a block diagram of the system containing the next speaker clarification apparatus of the 1st Embodiment of this invention. 図１の次発言者明示装置のブロック図である。It is a block diagram of the next speaker clarification apparatus of FIG. 第１の実施形態における次発言者判定処理を示すフローチャートである。It is a flowchart which shows the next speaker determination process in 1st Embodiment. 第１の実施形態におけるユーザ位置検出、エフェクト提示の一例を示す図である。It is a figure which shows an example of the user position detection in 1st Embodiment, and an effect presentation. 第１の実施形態における視線方向検出、注視位置検出の一例を示す図である。It is a figure which shows an example of the gaze direction detection and gaze position detection in 1st Embodiment. 本発明の第２の実施形態の次発言者明示装置の構成図である。It is a block diagram of the next speaker clarification apparatus of the 2nd Embodiment of this invention. 図１の次発言者明示装置のブロック図である。It is a block diagram of the next speaker clarification apparatus of FIG. 第２の実施形態における次発言者判定処理を示すフローチャートである。It is a flowchart which shows the next speaker determination process in 2nd Embodiment. 第２の実施形態におけるユーザ位置検出、エフェクト提示の一例を示す図である。It is a figure which shows an example of the user position detection in 2nd Embodiment, and an effect presentation. 第２の実施形態における視線方向検出の一例を示す図である。It is a figure which shows an example of the gaze direction detection in 2nd Embodiment.

Explanation of symbols

１，２，４次発言者明示装置
３ネットワーク
１１，２１通信装置
１２，２２音声再生装置
１３，２３，３３収音装置
１４，２４，３１提示装置
１５，２５，３２撮像装置
１６，２６，３４次発言者判定装置
１０１，６０１ユーザ位置検出部
１０２，６０２視線方向検出部
１０３，６０３注視対象検出部
１０４，６０４次発言者判定部
１０５，６０５エフェクト提示部
１０６，６０６発話音声検出部
１０７，６０７発話映像検出部
１０８，６０８エフェクト消去部
２０１〜２０６，２１１〜２１３ステップ
３０１ユーザ
３０２ユーザの瞳
３０３視線方向
３０４注視位置
３０１‘ 映像上のユーザ
３０２‘ 映像上のユーザの瞳
３０３‘ 映像上の視線方向
４０１，８０１ユーザ
４０２，８０２ユーザ位置
４０３，８０３吹き出しエフェクト
４０４，８０４吹き出しエフェクトの尾
９０１ユーザ
９０２ユーザの瞳
９０３視線方向 1, 2, 4 Secondary speaker clarification device 3 Network 11, 21 Communication device 12, 22 Audio reproduction device 13, 23, 33 Sound collection device 14, 24, 31 Presentation device 15, 25, 32 Imaging device 16, 26, 34 Next speaker determination device 101,601 User position detection unit 102,602 Gaze direction detection unit 103,603 Gaze target detection unit 104,604 Next speaker determination unit 105,605 Effect presentation unit 106,606 Speech audio detection unit 107,607 Utterance video detection unit 108,608 Effect elimination unit 201-206, 211-213 Step 301 User 302 User's pupil 303 Gaze direction 304 Gaze position 301 ′ User 302 on video 302 ′ User pupil 303 on video 303 ′ Gaze on video Direction 401,801 User 402,802 User position 403,80 3 Balloon effect 404, 804 Balloon effect tail 901 User 902 User's pupil 903 Gaze direction

Claims

In a conference system in which a plurality of users at each site having an imaging device that captures an image of a user and a presentation device that presents an image captured by the imaging device to the user is conferred via a network . A next-speaker explicit method that detects and explicitly identifies the user that one of the users wants to speak next while speaking,
A user position detecting step for detecting a user position of each user on the video and a three-dimensional position of the user's head from the video of the imaging device at each site including the own site presented to the presentation device;
A user gaze direction detecting step for detecting a gaze direction of each user on the video from a video of the imaging device at each site including the own site presented to the presentation device;
According to the three-dimensional position of the user head, the line-of-sight direction, and the three-dimensional position, inclination, and size of the presentation surface of the presentation surface of the presentation device in the coordinate system centered on the imaging device, The coordinate position on the presentation surface of the presentation device where the line-of-sight vector from the three-dimensional position intersects is obtained, and from which user's line of sight to each user, the coordinate position and the user position of each user Gaze target detection step for detecting whether or not
A next speaker determination step of performing a voting process of a user who has obtained a line of sight and determining a user who has obtained a line of sight from a predetermined percentage or more of users as a user of the next speaker;
An effect presentation step of presenting an effect that clearly indicates that the user has the right to speak on the video on the presentation device of the user determined to be the next speaker.

The next speaker specifying method according to claim 1, further comprising the step of erasing the presented effect when a next speaker's speech is detected.

In a conference system in which a plurality of users at each site having an imaging device that captures an image of a user and a presentation device that presents an image captured by the imaging device to the user is conferred via a network . A next-speaker clarification device that detects and explicitly identifies the user that one of the users wants to speak next while speaking,
User detection means for detecting the user position of each user on the video and the three-dimensional position of the user head from the video of the imaging device at each site including the own site presented to the presentation device;
User gaze direction detection means for detecting the gaze direction of each user on the video from the video of the imaging device at each site including the own site presented to the presentation device;
According to the three-dimensional position of the user head, the line-of-sight direction, and the three-dimensional position, inclination, and size of the presentation surface of the presentation surface of the presentation device in the coordinate system centered on the imaging device, The coordinate position on the presentation surface of the presentation device where the line-of-sight vector from the three-dimensional position intersects is obtained, and from which user's line of sight to each user, the coordinate position and the user position of each user Gaze target detection means for detecting whether or not
Next speaker determination means for performing voting processing of users who have obtained a line of sight, and determining a user who has obtained a line of sight from a predetermined percentage or more of users as a user of the next speaker,
A next speaker specifying device comprising effect presentation means for presenting an effect indicating that the user has the next speaking right on the video on the presentation device of the user determined to be the next speaker.

The next speaker specifying device according to claim 3 , further comprising means for deleting the presented effect when the next speaker's speech is detected.

A program for operating a computer as a next speaker specifying device according to claim 3 or 4 .