JP2006338529A

JP2006338529A - Conversation structure estimation method

Info

Publication number: JP2006338529A
Application number: JP2005164395A
Authority: JP
Inventors: Kazuhiro Otsuka; 和弘大塚; Junji Yamato; 淳司大和
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-06-03
Filing date: 2005-06-03
Publication date: 2006-12-14
Anticipated expiration: 2025-06-03
Also published as: JP4804801B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a conversation structure estimation method with which structure of conversation can be automatically estimated by measuring behaviors of participants in the conversation for a scene in which a plurality of persons have conversation by meeting. <P>SOLUTION: The directions of heads of each person who participates in the conversation are measured and the directions of visual lines of each person are calculated based on the directions of heads of each person. In addition, presence of utterance of each person is detected. Then, structural information of the conversation at each point of time in the conversation is calculated based on the directions of visual lines and information about the presence of utterance of each person. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、複数の人物の行動を自動的に計測、または認識を行う技術に関し、その中でも特に、複数の人物が会話を行う状況を対象とし、観測される人物の行動からその場において生じている会話の構造を自動的に推定する会話構造推定方法に関する。 The present invention relates to a technique for automatically measuring or recognizing actions of a plurality of persons, and particularly, a situation in which a plurality of persons have a conversation, and is generated on the spot from observed person actions. The present invention relates to a conversation structure estimation method for automatically estimating a conversation structure.

複数人物による対面会話において、各参加者は「話し手」、「受け手」、「傍参与者」といった役割を担い、それらが時間ともに移り変わることが知られている。このような会話中の役割分担やその時間変化といった会話の構造を自動的に抽出することは、会議映像のアーカイブ構築のための自動インデクシング、自動映像編集などを実現する上での基礎的な課題である。 It is known that in a face-to-face conversation with a plurality of persons, each participant plays a role such as “speaker”, “recipient”, and “participant”, and these change over time. The automatic extraction of the conversation structure such as the division of roles during conversation and the change in time is a fundamental issue in realizing automatic indexing and automatic video editing for the construction of conference video archives. It is.

このような会話中の人物の役割のうち、従来は、言語的な情報伝達を中心的に担うと考えられる「話し手」が主に注目されており、各人物の発話の状態などを音響信号として捉えて、複数の会話参加者のうち、話し手が誰であるかを同定する技術が提案されている（例えば特許文献１参照）。 Among the roles of people in conversation, the “speaker”, which is thought to play a central role in linguistic information transmission, has been mainly focused on, and the state of each person's utterance is used as an acoustic signal. A technique for identifying who is the speaker among a plurality of conversation participants has been proposed (see, for example, Patent Document 1).

しかしながら会話は、「話し手」のみでは成立せず、話し手の発話が誰に向けられたかという会話の構造に関する情報も重要であることが、近年徐々に認識され始めており、これまで会話構造を推定する手がかりとして、会話参加者の視線の振る舞いなどの非言語的な情報が有用であると示唆されている（特許文献２参照）。また、特許文献３においては、話し手の視線の分配量によって、話し手が話し掛けている相手が一人か複数かを判断できるという実験結果を示しているが、視線を自動的に計測する方法は提案していない。 However, conversation has not been established by "speaker" alone, and it is gradually beginning to be recognized in recent years that information on the structure of the conversation about who the speaker's utterance was directed to is important. As a clue, it has been suggested that non-verbal information such as the behavior of the line of sight of conversation participants is useful (see Patent Document 2). Further, Patent Document 3 shows an experimental result that it is possible to determine whether the speaker is speaking to one person or a plurality of persons based on the amount of distribution of the speaker's line of sight. However, a method for automatically measuring the line of sight is proposed. Not.

一方、非特許文献４によって、従来、視線の方向として、頭部の方向を代わりに検出して用いる方法が提案されているが、会話の構造を推定するものではない。
A. Gard, V. Pavlovic, and J. M.Rehg,“Boosted learning in dynamic Bayesian networks for multimodal speaker detection,” Proc. IEEE, Vol.91, No.9, 2003. N. Jovanovic and R. Akker,“Towards automatic addressee identification in multiparty dialogues,”Proc. SIGdial, pp.89-92, 2004. Y. Takemae, K. Otsuka, and N. Mukawa,“An analysis of speakers' gaze behavior for automatic addressee identification in multiparty conversation and its application to video editting,”Proc. of IEEE International Workshop on Robot and Human Interactive Communication(IEEE/RO-MAN2004), pp.581-586, 2004. R. Stiefelhagen et a1.,“Modeling focus of attention for meeting index based on multiple cues,” IEEE Trans. Neural Networks, vo1.13, No.4, 2002. On the other hand, Non-Patent Document 4 has conventionally proposed a method in which the direction of the head is detected and used instead as the direction of the line of sight, but does not estimate the structure of the conversation.
A. Gard, V. Pavlovic, and JMRehg, “Boosted learning in dynamic Bayesian networks for multimodal speaker detection,” Proc. IEEE, Vol.91, No.9, 2003. N. Jovanovic and R. Akker, “Towards automatic addressee identification in multiparty dialogues,” Proc. SIGdial, pp.89-92, 2004. Y. Takemae, K. Otsuka, and N. Mukawa, “An analysis of speakers' gaze behavior for automatic addressee identification in multiparty conversation and its application to video editing,” Proc. Of IEEE International Workshop on Robot and Human Interactive Communication (IEEE / RO-MAN2004), pp.581-586, 2004. R. Stiefelhagen et a1., “Modeling focus of attention for meeting index based on multiple cues,” IEEE Trans. Neural Networks, vo1.13, No.4, 2002.

以上のように従来技術では、会話参加者の役割やその時間変化といった会話の構造を自動的に推定することはできなかった。 As described above, in the prior art, it is not possible to automatically estimate the conversation structure such as the role of the conversation participant and its time change.

そこでこの発明は、複数の人物が対面で会話を行う場面を対象として、会話参加者の行動を計測することにより、会話の構造を自動的に推定する会話構造推定方法を提供することを目的としている。 SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a conversation structure estimation method for automatically estimating a conversation structure by measuring a conversation participant's behavior for a scene where a plurality of persons have a conversation in person. Yes.

本発明は、上述の課題を解決すべくなされたもので、会話構造推定装置における会話構造推定方法であって、頭部方向計測処理部が、会話に参加している各人物の頭部方向を計測し、視線方向算出処理部が、前記各人物の頭部方向に基づいてそれら各人物の視線方向を算出し、発話有無計測処理部が、前記各人物の発話の有無を検出し、会話構造推定処理部が、前記各人物の前記視線方向および前記発話の有無の情報に基づいて、会話中の各時点における会話の構造情報を算出することを特徴とする会話構造推定方法である。
このように、視線方向と発話の有無の情報を組み合わせることにより、発話の有無の情報のみからでは知ることのできない、話し手の発話が向けられている相手などの会話の構造を推定することが可能となる。また、現状では、会話を妨げることなく人の視線方向を装置等により直接計測することは困難であるが、その視線方向を直接計測するのではなく、比較的計測が容易な頭部の方向から近似的に視線の方向を推定するため、自然な会話を妨げることなく、会話の構造を外部観測によって推定することが可能となる。 The present invention has been made to solve the above-described problem, and is a conversation structure estimation method in a conversation structure estimation device, in which a head direction measurement processing unit calculates a head direction of each person participating in a conversation. A gaze direction calculation processing unit calculates a gaze direction of each person based on the head direction of each person, and an utterance presence / absence measurement processing unit detects the presence / absence of each person's utterance, An estimation processing unit calculates conversation structure information at each time point during a conversation based on information on the gaze direction of each person and the presence / absence of the utterance.
In this way, by combining the gaze direction and utterance presence / absence information, it is possible to estimate the structure of the conversation, such as the other party to whom the speaker's utterance is directed, which cannot be known only from the utterance presence / absence information. It becomes. In addition, at present, it is difficult to directly measure a person's gaze direction with a device or the like without disturbing the conversation, but instead of directly measuring the gaze direction, the direction of the head is relatively easy to measure. Since the direction of the line of sight is approximately estimated, the conversation structure can be estimated by external observation without disturbing natural conversation.

また本発明は、上述の前記会話構造推定処理部が、前記会話に参加している各人物が、話し手、話し手が話し掛けている相手である受け手、会話参加者のうち話し手でも受け手でもない傍参与者の３種類の役割のうち、各時点においてどの役割を担っているかを示す前記会話の構造情報を算出することを特徴とする。
そのため、複数人物による会話中において、誰が誰に向かって話し掛けているかという情報が推定できることにより、会話を撮影した映像に対してインデックスを付与するなど多種多様な応用へと発明技術を適用することができる。 In the present invention, the conversation structure estimation processing unit described above is configured such that each person participating in the conversation is a speaker, a receiver that the speaker is speaking to, a participant who is neither a speaker nor a receiver among the conversation participants. It is characterized in that the structure information of the conversation indicating which role is played at each time point among the three types of roles of the person is calculated.
Therefore, it is possible to apply the invention technology to a wide variety of applications, such as adding an index to the video that captures the conversation, by estimating the information of who is talking to whom during the conversation between multiple persons. it can.

また本発明は、前記算出された会話の構造情報に基づいて、一人の話し手が受け手である他の会話参加者全員に向かって話し掛ける状況を示す情報と、会話参加者のうち特定の二人の人物が、話し手または受け手となり、他の人物が傍参与者となる状況を示す情報と、前記２つの状況に合致しない状況の情報とのいづれかの情報を示すグラフを出力する出力部を備えることを特徴とする。
これは、個別の人物の役割を統合して得られる会話の場を支配する情報伝達のパターンを推定することに相当し、これにより会話を撮影した映像に対してインデックスを付与するなど多種多様な応用へと発明技術を適用することができる。 The present invention also provides information indicating a situation in which one speaker speaks to all other conversation participants who are receivers based on the calculated conversation structure information, and two specific conversation participants. An output unit for outputting a graph indicating information about a situation in which a person is a speaker or a receiver and another person is an associate and information on a situation that does not match the two situations; Features.
This is equivalent to estimating the pattern of information transmission that governs the place of conversation that is obtained by integrating the roles of individual persons. Inventive technology can be applied to applications.

また本発明は、前記各人物の頭部方向に基づく各人物の視線方向の算出と、前記各人物の視線方向および発話の有無の情報に基づく会話中の各時点における会話の構造情報の算出は、それぞれ、前記頭部方向と前記視線方向との関係を示すモデル式、および前記視線方向と前記発話の有無と前記会話の構造情報との相互の関係を示すモデル式、およびそれらの時間変化に関する確率モデル式を用いて、各時刻における視線方向および会話の構造情報および確率モデルのパラメータの同時事後確率分布を計算する。
このような確率モデルを用いることにより、ある頭部方向を向いているときの視線の曖昧さや、視線方向と発話の有無の情報、及び、会話の構造との間の曖昧さなどの不確実性を確率的な表現法を用いて適切に取り扱うことができる。また、頭部方向から視線方向を推定する問題と、視線方向と発話の有無の情報から会話の構造を推定する問題、及び、確率モデルのパラメータを推定する問題を、同時に統合的に解くことにより、解の不確実性が相補的に解消され、これら個別の問題を独立に解く場合と比較して、より正確に視線の方向や会話の構造を推定することが可能となる。 Further, the present invention calculates the gaze direction of each person based on the head direction of each person, and calculates the structure information of the conversation at each time point during the conversation based on the gaze direction of each person and the presence / absence of speech. , Respectively, a model formula indicating a relationship between the head direction and the line-of-sight direction, a model formula indicating a mutual relationship between the line-of-sight direction, the presence / absence of the utterance, and the structure information of the conversation, and their temporal changes By using the probability model formula, the simultaneous posterior probability distribution of the gaze direction and the structural information of the conversation at each time and the parameters of the probability model is calculated.
By using such a probabilistic model, uncertainties such as ambiguity of gaze when facing a certain head direction, information on gaze direction and presence / absence of speech, and ambiguity between conversation structures Can be handled appropriately using a probabilistic representation. Also, by solving the problem of estimating the gaze direction from the head direction, the problem of estimating the conversation structure from the information of the gaze direction and the presence / absence of speech, and the problem of estimating the parameters of the probability model at the same time, Uncertainty of the solution is eliminated in a complementary manner, and it becomes possible to estimate the direction of the line of sight and the structure of conversation more accurately than in the case where these individual problems are solved independently.

また本発明は、前記確率モデルが、前記会話の構造に依存した各人物の視線方向の確率分布、前記会話の構造に依存した各人物が発話を行う確率、前記各人物の各々の視線の方向に対して、とり得る頭部方向を表す確率分布、前記会話の構造が時間的にどのように変化するかを定める遷移確率、前記会話構造に依存して各人物の視線方向が時間的にどのように変化するかを定める遷移確率、であることを特徴とする。
これにより、会話の構造の依存して特定の視線の振る舞いや発話の状態が現れるといった人間の特性をモデルに取り入れることができ、このような人間の特性として、例えば、ある話し手が他の人に向かって話しをする場合には、その受け手に対して視線が投げかけられ、また、受け手の人物は話しての方を良く見る。また、話し手は発話をする確率が高いといった特性が上げられる。このようなモデルを用いることにより、観測された人間の行動から会話の構造が正確に推定できるようになる。 Further, the present invention provides the probability distribution of the gaze direction of each person depending on the conversation structure, the probability that each person speaks depending on the conversation structure, and the gaze direction of each person. Probability distribution representing the possible head direction, transition probability that determines how the conversation structure changes over time, and how the line-of-sight direction of each person depends on the conversation structure over time It is a transition probability that determines how to change.
This makes it possible to incorporate human characteristics such as the appearance of specific gaze behaviors and utterance states depending on the structure of the conversation into the model. When speaking to the receiver, a gaze is thrown at the receiver, and the receiver's person looks closely at the person speaking. In addition, the speaker is more likely to speak. By using such a model, the conversation structure can be accurately estimated from the observed human behavior.

また本発明は、前記各時刻における視線方向および会話の構造および確率モデルのパラメータの同時事後確率分布は、ギブスサンプラーを用いて、各時刻における前記視線方向および会話の構造および確率モデルのパラメータの各々に含まれるすべての未知変数について、それらの全条件付事後分布からの乱数発生させて得られるサンプル集合として、近似的に計算することを特徴とする。
これにより、同時事後確率分布の厳密な計算が困難である本発明のような多くの未知変数を含む確率モデルについても、近似的に解を得ることが可能となる。 Further, the present invention provides a simultaneous posterior probability distribution of the gaze direction and conversation structure and probability model parameters at each time using a Gibbs sampler, and each of the gaze direction and conversation structure and probability model parameters at each time. Is calculated approximately as a sample set obtained by generating random numbers from their all conditional posterior distributions.
As a result, it is possible to obtain an approximate solution even for a probability model including many unknown variables such as the present invention in which exact calculation of the simultaneous posterior probability distribution is difficult.

また本発明は、前記各人物の頭部方向は、会話参加者の頭部の装着された磁気式センサーを用いて計測することを特徴とする。
これにより正確に３次元空間中における各会話参加者の頭部の座標、及び、回転角を、高い時間分解能で計測することが可能となり、このことは、時間的に綿密な会話構造の推定を可能とするものである。 Further, the present invention is characterized in that the head direction of each person is measured using a magnetic sensor attached to a conversation participant's head.
This makes it possible to accurately measure the coordinates and rotation angle of each conversation participant's head in the three-dimensional space with high temporal resolution, which is an accurate estimation of the conversation structure in terms of time. It is possible.

また本発明は、前記各人物の発話の有無は、前記各人物に装着されたマイクロフォンより得られる音響信号の大きさに基づいて検出することを特徴とする。
これにより、各会話参加者毎に個別の発話状況を検出することが可能となり、精度の高い会話構造の推定が可能となる。 In addition, the present invention is characterized in that the presence / absence of speech of each person is detected based on the magnitude of an acoustic signal obtained from a microphone attached to each person.
Thereby, it becomes possible to detect the individual utterance situation for each conversation participant, and it is possible to estimate the conversation structure with high accuracy.

また本発明は、会話構造推定装置のコンピュータに実行させるプログラムであって、会話に参加している各人物の頭部方向を計測する処理と、前記各人物の頭部方向に基づいてそれら各人物の視線方向を算出する処理と、前記各人物の発話の有無を検出する処理と、前記各人物の前記視線方向および前記発話の有無の情報に基づいて、会話中の各時点における会話の構造情報を算出する処理と、をコンピュータに実行させるプログラムである。 Further, the present invention is a program executed by a computer of a conversation structure estimation apparatus, the process of measuring the head direction of each person participating in a conversation, and each person based on the head direction of each person Structure information of the conversation at each time point during the conversation based on the process of calculating the gaze direction of each person, the process of detecting the presence or absence of the utterance of each person, and the information on the gaze direction of each person and the presence or absence of the utterance And a program for causing a computer to execute the process of calculating.

また本発明は、会話構造推定装置のコンピュータに実行させるプログラムを記憶する記録媒体であって、会話に参加している各人物の頭部方向を計測する処理と、前記各人物の頭部方向に基づいてそれら各人物の視線方向を算出する処理と、前記各人物の発話の有無を検出する処理と、前記各人物の前記視線方向および前記発話の有無の情報に基づいて、会話中の各時点における会話の構造情報を算出する処理と、をコンピュータに実行させるプログラムを記憶する記録媒体である。 Further, the present invention is a recording medium for storing a program to be executed by a computer of a conversation structure estimation device, the process of measuring the head direction of each person participating in a conversation, and the head direction of each person Based on the processing for calculating the gaze direction of each person based on the above, the processing for detecting the presence / absence of each person's utterance, and the information on the gaze direction of each person and the presence / absence of the utterance, Is a recording medium for storing a program for causing a computer to execute a process for calculating the structure information of the conversation.

本発明によれば、複数人物の対面会話を対象とし、会話中の人物の頭部方向及び発話状態を計測し、これらの計測された情報と、視線方向、会話の構造に関する確率モデルに基づいて、各時点における会話の構造、視線方向、及びモデルのパラメータを同時推定している。そのため、会話中における話し手、受け手、傍参与者といった各参会者の役割と、それらの時間変化として表される会話の構造を自動的に推定することができる。 According to the present invention, for a face-to-face conversation between a plurality of persons, the head direction and speech state of the person in conversation are measured, and based on the measured information, the gaze direction, and the probability model related to the conversation structure. Simultaneously estimate the conversation structure, gaze direction, and model parameters at each time point. Therefore, it is possible to automatically estimate the role of each participant such as a speaker, a receiver, and a side participant during the conversation, and the structure of the conversation expressed as a change with time.

以下、本発明の一実施形態による会話構造推定方法を図面を参照して説明する。図１は同実施形態による会話構造推定装置の構成を示すブロック図である。この図において、符号１０１は頭部方向計測部、１０２は発話状態計測部、１０３は観測データ記億部、１０４はパラメータ記億部、１０５はサンプル集合記億部、１０６はギブスサンプラー、１０７は統計量計算部、１０８は出力部である。 Hereinafter, a conversation structure estimation method according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the conversation structure estimation apparatus according to the embodiment. In this figure, reference numeral 101 is a head direction measuring unit, 102 is an utterance state measuring unit, 103 is an observation data storage unit, 104 is a parameter storage unit, 105 is a sample set storage unit, 106 is a Gibbs sampler, 107 is A statistic calculation unit 108 is an output unit.

そして頭部方向計測部１０１は各参加者に装着された磁気式センサ（または地磁気センサ）などであり、例えば、地磁気によるＮ極と頭部の方向の関係によって所定の方向を基準とした頭部の方向を計測する。また発話状態計測部１０２は例えば、各参加者に装着されたピンマイクロフォンなどであり、当該マイクロフォンより得られる音響信号の大きさに基づいて発音の有無を計測する。また観測データ記億部１０３は、ある時間区間について、頭部方向計測部１０１及び発話状態計測部１０２より得られたデータを記憶する。またパラメータ記億部１０４は会話モデルのハイパーパラメータの値を記憶している。またサンプル集合記億部１０５は、ギブスサンプラー１０６によって生成されるサンプルの集合を記憶する。またギブスサンプラー１０６は、観測データ記億部１０３に記憶された観測データ、及び、パラメータ記億部１０４に記憶されたモデルのハイパーパラメータの値を入力とし、未知変数の同時事後確率分布を表すサンプル集合を生成し、その値を、サンプル集合記億部１０５に記憶させる。また統計量計算部１０７は、サンプル集合記億部１０５に記録されたサンプル集合より、未知変数に関する統計量を計算する。また出力部１０８は、統計量計算部１０７により計算された統計量を、ディスプレイなどに出力する。 The head direction measuring unit 101 is a magnetic sensor (or geomagnetic sensor) attached to each participant, for example, a head based on a predetermined direction depending on the relationship between the N pole and the head direction due to geomagnetism. Measure the direction of. The utterance state measuring unit 102 is, for example, a pin microphone attached to each participant, and measures the presence or absence of sound generation based on the magnitude of an acoustic signal obtained from the microphone. The observation data storage unit 103 stores data obtained from the head direction measurement unit 101 and the utterance state measurement unit 102 for a certain time interval. The parameter storage unit 104 stores hyperparameter values of the conversation model. The sample set storage unit 105 stores a set of samples generated by the Gibbs sampler 106. The Gibbs sampler 106 receives the observation data stored in the observation data storage unit 103 and the model hyperparameter value stored in the parameter storage unit 104 as input, and represents a simultaneous posterior probability distribution of unknown variables. A set is generated, and the value is stored in the sample set storage unit 105. The statistic calculator 107 calculates a statistic regarding the unknown variable from the sample set recorded in the sample set storage unit 105. The output unit 108 outputs the statistic calculated by the statistic calculation unit 107 to a display or the like.

図２は会話構造推定装置の処理フローを示す図である。
次に図２を用いて会話構造推定装置の処理フローについて説明する。
まず、ある時間区間において(１≦ｔ≦Ｔ)、一定時間間隔で各参加者の頭部方向を頭部方向計測部１０１が計測する（ステップＳ５０１）。また発話状態計測部１０２が同様に、ある時間区間において(１≦ｔ≦Ｔ)、一定時間間隔で各参加者の発話状態を計測（音声を取得）する（ステップＳ５０２）。これらの計測した情報が観測データ記憶部１０３に記録される。上記ステップＳ５０１とステップＳ５０２の計測は各計測部においてｔ＜Ｔを判定（ステップＳ５０３）して、ｔ＜Ｔとなるまで繰り返される。次に、パラメータ記億部１０４に記憶されているパラメータの値を用いて、ギブスサンプラー１０６の初期化を行う（ステップＳ５０４）。統いて、各変数について全条件付事後分布からのサンプリング(乱数発生)を行い、変数の値を更新するという処理を行う（ステップＳ５０５）。そして全ての変数についてステップＳ５０４の処理により更新したか否かを判断し（ステップＳ５０６）、その結果全ての変数について更新した場合には、次に反復回数が既定値に達したか否かを判定する（ステップＳ５０７）。そして既定値に達した場合には、サンプル集合記億部１０５に記憶されているサンプル集合を用いて、各変数についての統計量を計算する（ステップＳ５０８）。 FIG. 2 is a diagram showing a processing flow of the conversation structure estimation apparatus.
Next, the processing flow of the conversation structure estimation apparatus will be described with reference to FIG.
First, in a certain time interval (1 ≦ t ≦ T), the head direction measuring unit 101 measures the head direction of each participant at regular time intervals (step S501). Similarly, the utterance state measuring unit 102 measures (acquires voice) the utterance state of each participant at a certain time interval in a certain time interval (1 ≦ t ≦ T) (step S502). These measured information is recorded in the observation data storage unit 103. The measurement in steps S501 and S502 is repeated until t <T is determined in each measurement unit (step S503) and t <T is satisfied. Next, the Gibbs sampler 106 is initialized using the parameter values stored in the parameter storage unit 104 (step S504). Then, for each variable, sampling (random number generation) is performed from all conditional posterior distributions, and the process of updating the value of the variable is performed (step S505). Then, it is determined whether or not all variables have been updated by the process of step S504 (step S506). If all the variables have been updated as a result, it is then determined whether or not the number of iterations has reached a predetermined value. (Step S507). When the predetermined value is reached, the statistics for each variable are calculated using the sample set stored in the sample set storage unit 105 (step S508).

次に、上記会話構造推定装置の処理フローについてより詳細に説明する。
図３は会話参加者の相対位置を示す図である。
図３が示すように、本実施形態の会話構造推定方法において対象となる会話参加者はそれぞれ図３のような相対座標に着席して位置し、会話を行うものとする。ここで人物の人数Ｎは、Ｎ≧３とする。また、推定対象とする時間区間は、一定時間間隔で離散化されており、 t = 1，２・・，Ｔとする。そして時刻ｔにおける人物ｉの視線方向をＸ_ｉ，ｔと表す。また人物ｉが人物ｊの顔に視線を向けている場合の視線方向Ｘ_ｉｊ＝ｊとし、誰の方も見ていない場合をＸ_ｉｊ＝ｉと表す。また各人の視線方向をまとめたものを視線パターンと呼び、Ｘａ_ｔ＝｛Ｘ_１，ｔ，Ｘ_２，ｔ，・・・，Ｘ_Ｎ，ｔ｝と表し、対象時間区間における各時間間隔の視線パターンの系列をＸａ_１：ｔ＝｛Ｘ_１，Ｘ_２，・・・，Ｘ_Ｎ｝のように表すとする。 Next, the processing flow of the conversation structure estimation apparatus will be described in more detail.
FIG. 3 is a diagram showing the relative positions of the conversation participants.
As shown in FIG. 3, it is assumed that the conversation participants to be targeted in the conversation structure estimation method of the present embodiment are seated at relative coordinates as shown in FIG. 3 and have a conversation. Here, the number of persons N is N ≧ 3. Further, the time interval to be estimated is discretized at a constant time interval, and t = 1, 2,. The line-of-sight direction of the person i at time t is expressed as X _{i, t} . Further, the line-of-sight direction X _ij = j when the person i is looking toward the face of the person j is represented as X _ij = i. Also, the line of sight direction of each person is called a line-of-sight pattern, _expressed as Xa _t = {X _{1, t} , X _{2, t} ,..., X _{N, t} }, and each time interval in the target time interval It is assumed that the line-of-sight pattern series is _expressed as Xa _{1: t} = {X ₁ , X ₂ ,..., X _N }.

また、ある時刻ｔにおける会話の構造をＳ_ｔと表す。一人の人物ｉが他の参加者全員に話し掛けている場合の構造を The conversation structure at a certain time _t is denoted as St. The structure when one person i talks to all other participants

と記す。この構造のことを「収束構造」と呼ぶ。また参加者の中の二人の人物ｉ，ｊの間のみで会話が進行している状況、つまり、人物ｉと人物ｊの二人が話し手または受け手となる場合のことを「二者結合」と呼び、記号 . This structure is called a “convergence structure”. Also, the situation where the conversation is progressing only between the two persons i and j among the participants, that is, the case where the person i and the person j are speakers or receivers is “binary join”. The symbol

で表す。さらに、これらの構造以外の構造のことを「発散構造」と呼び、記号Ｓ_ｔ＝Ｒ^０と表記する。Ｎ（≧３）人の会話においては、上記の３種類の構造について、対象人物の組み合わせを考慮した、Ｍ＝Ｎ＋_ＮＣ_２＋１個の構造が存在するものとし、各時刻において、何れかの会話状態 Represented by Further, a structure other than these structures is referred to as a “divergent structure” and expressed as a symbol S _t = R ⁰ . In a conversation of N (≧ 3) people, there are M = N + _N C ₂ +1 structures in consideration of the combination of target persons for the above three types of structures, Conversation state

をとるものとする。対象時間範囲における会話状態の系列をＳ_１：ｔ＝｛Ｓ_１，Ｓ_２，・・・，Ｓ_Ｔ｝と表す。 Shall be taken. A series of conversation states in the target time range is represented as S _{1: t} = {S ₁ , S ₂ ,..., S _T }.

まず、上述したようにステップＳ５０１において頭部方向計測部１０１が、各時刻ｔにおける各参加者ｉの頭部方向ｈ_ｉ，ｔを計測する。この計測値は、図３のように、頭部の水平方向の回転角（Ｘ軸正方向を基準とする）であり、人物を上方から見た場合の座標軸との成す角度として計測されるものとする。対象時間区間における頭部方向の計測値の集合をＨ_１：ｔ＝｛Ｈ_１，・・・，Ｈ_Ｔ｝，Ｈ_ｔ＝｛ｈ_１，ｔ，・・・，ｈ_Ｎ，ｔ｝のように表す。また、上述したようにステップＳ５０２において発話状態計測部１０２は、各時刻ｔにおける各参加者ｉの発話状態ｕ_ｉ，ｔを計測する。この発話状態は発話の有無であり、当該発話の有無を０または１の２値により表す。対象時間区間における発話の観測データは、Ｕ_１：ｔ＝｛Ｕ_１，・・・，Ｕ_Ｔ｝，Ｕ_ｔ＝｛ｕ_１，ｔ，・・・，ｕ_Ｎ，ｔ｝と表す。そしてこれらの観測データが観測データ記憶部１０３に記録される。 First, as described above, in step S501, the head direction measuring unit 101 measures the head directions h _{i, t} of each participant i at each time t. As shown in FIG. 3, this measured value is the horizontal rotation angle of the head (based on the positive X-axis direction) and is measured as the angle formed with the coordinate axis when the person is viewed from above. And A set of measurement values in the head direction in the target time interval is represented as H _{1: t} = {H ₁ ,..., H _T }, H _t = {h _{1, t} ,..., H _{N, t} }. Expressed in Further, as described above, in step S502, the utterance state measuring unit 102 measures the utterance state u _{i, t} of each participant i at each time t. This utterance state is the presence or absence of an utterance, and the presence or absence of the utterance is represented by a binary value of 0 or 1. The observation data of the utterance in the target time interval is expressed as U _{1: t} = {U ₁ ,..., U _T }, U _t = {u _{1, t} ,..., U _{N, t} }. These observation data are recorded in the observation data storage unit 103.

図４は会話モデルを示す図である。
会話モデルとしては、図４のような動的ベイジアンネットを用いることができる。この会話モデルにおいては、会話の構造は、初期確率 FIG. 4 is a diagram showing a conversation model.
As the conversation model, a dynamic Bayesian network as shown in FIG. 4 can be used. In this conversation model, the structure of conversation is the initial probability.

及び、状態遷移確率 And state transition probability

を持つマルコフ過程に従うと仮定する。これらのパラメータをまとめて、 Suppose we follow a Markov process with Putting these parameters together,

のように表記する。また、視線パターンＸａ_ｔは、会話構造に依存した生成確率Ｐ（Ｘａ_ｔ｜Ｓ_ｔ）及び、遷移確率Ｐ（Ｘａ_ｔ｜Ｘａ_ｔ−１，Ｓ_ｔ−１）に従い出現するものと仮定し、その尤度は Notation is as follows. Further, it is assumed that the line-of-sight pattern Xa _t appears according to the generation probability P (Xa _t | S _t ) and the transition probability P (Xa _t | Xa _t−1 , S _t−1 ) depending on the conversation structure, The likelihood is

のように定義する。ただし、ここでは、各人物の視線方向は、会話構造が与えられたときに条件付独立であると仮定している。視線方向についてのパラメータを Define as follows. However, here, it is assumed that the gaze direction of each person is conditionally independent when a conversation structure is given. Parameters for gaze direction

のように表記する。また、ある視線パターンＸａ_ｔにおける頭部方向Ｈ_ｔの尤度分布は、ガウス関数を用いて Notation is as follows. Also, the likelihood distribution of head direction H _t at a certain gaze pattern Xa _t, using a Gaussian function

のように表す。
ただし、ここでμ_ｉｊ，σ^２ _ｉｊは、人物ｉが人物ｊを見る時の頭部方向の尤度分布の平均と分散をそれぞれ表す。また、各会話参加者は、会話状態に依存したベルヌーイ過程に従い発話を行うと仮定し、発話の尤度を It expresses like this.
Here, μ _ij and σ ² _ij represent the mean and variance of the likelihood distribution in the head direction when the person i views the person j, respectively. Also, assume that each conversation participant speaks according to a Bernoulli process that depends on the conversation state, and the likelihood of speech is determined.

とし、発話を行う確率を And the probability of uttering

のように表す。 It expresses like this.

上記の会話モデルに基づき、本発明では、全ての未知変数、つまり、会話構造の系列Ｓ_１：Ｔ，視線パターンの系列Ｘａ_１：Ｔ、及び会話モデルのパラメータ Based on the above conversation model, in the present invention, all unknown variables, that is, conversation structure series S _{1: T} , line-of-sight pattern series Xa _{1: T} , and conversation model parameters.

を、観測データ The observation data

より算出して推定することを目標とする。本発明の一実施例では、ギブスサンプラー１０６は、ベイズ流のアプローチを採用し、これらの未知変数についての同時事後確率分布を、ギブスサンプリングと呼ばれる方法を用いて計算する。ギブスサンプリングでは、まず、事前確率分布からのサンプリングにより各変数についての初期値を設定し、その後、各変数について、全条件付事後確率分布からのサンプリングを行い、変数の値を更新するという処理を繰り返し実行する。十分な回数、反復が行われた後のサンプル集合が、未知変数の同時事後確率分布を近似するものと考え、そのサンプル集合より、統計量計算部１０７が未知変数についての統計量を計算する。 The goal is to calculate and estimate more. In one embodiment of the invention, the Gibbs sampler 106 takes a Bayesian approach and calculates the joint posterior probability distribution for these unknown variables using a method called Gibbs sampling. In Gibbs sampling, first, the initial value for each variable is set by sampling from the prior probability distribution, and then, for each variable, sampling is performed from the all conditional posterior probability distribution, and the value of the variable is updated. Run repeatedly. The sample set after iterating a sufficient number of times is considered to approximate the simultaneous posterior probability distribution of the unknown variable, and the statistic calculator 107 calculates the statistic for the unknown variable from the sample set.

また本実施例においては、各未知変数について事前確率分布の形状として共役事前分布を採用する。会話構造の初期確率、状態遷移確率、視線パターンの生成確率、状態遷移確率の事前分布は、それぞれ独立なディリクレー分布に従うものとする。また、頭部方向の尤度分布の平均、及び、分散の事前分布は、それぞれ、ガウス分布、及び、逆カイニ乗分布に従うものとする。また、発話確率の事前分布はベータ分布に従うものとする。 In this embodiment, a conjugate prior distribution is adopted as the shape of the prior probability distribution for each unknown variable. The initial probability of the conversation structure, the state transition probability, the line-of-sight pattern generation probability, and the prior distribution of the state transition probability are assumed to follow independent Dirichlet distributions. Further, the mean of the likelihood distribution in the head direction and the prior distribution of the variance follow a Gaussian distribution and an inverse chi-square distribution, respectively. In addition, the prior distribution of the utterance probability follows the beta distribution.

各会話構造特有の視線パターン、発話状態を設定するために、これらの事前分布の形状をハイパーパラメータの値として設定する。例えば、一人の人物ｉが他の参加者全員に話し掛けている場合の構造 In order to set the line-of-sight pattern and speech state specific to each conversation structure, the shape of these prior distributions is set as a hyperparameter value. For example, the structure when one person i talks to all other participants

において、話し手ｉの視線方向の分布は一様とし、受け手ｊ（≠ｉ）の視線方向は、話し手に対して高い値をとるよう設定する。また発話確率は、話し手ｉについてのみ高い値をとるものとする。また二者結合の構造 , The distribution of the line-of-sight direction of the speaker i is uniform, and the line-of-sight direction of the receiver j (≠ i) is set to take a high value with respect to the speaker. The utterance probability takes a high value only for the speaker i. Also two-party structure

の場合には、対象となるペア（ｉ，ｊ）の人物間において相互凝視状態となるような視線方向についての確率が高い値をとり、このペア以外の人物の視線方向の分布は一様とする。さらに、発話確率はこのペアの人物について高い値をとるように設定する。さらに、発散構造Ｒ^０においては、各人の視線方向の分布は一様とし、また、発話確率は低い値をもつものとする。このように設定した値を、パラメータ記億部１０４が記憶している。 In this case, the probability of the gaze direction that causes mutual gaze state between the persons of the target pair (i, j) is high, and the gaze direction distribution of the persons other than the pair is uniform. To do. Furthermore, the utterance probability is set to take a high value for the person in this pair. Further, in the divergent structure R ^0, it is assumed that the distribution in the gaze direction of each person is uniform and the utterance probability has a low value. The parameter storage unit 104 stores the values set in this way.

そして上述のステップＳ５０５においてギブスサンプラー１０６は、ギブスサンプリングを実行する。まず、未知変数 In step S505, the Gibbs sampler 106 performs Gibbs sampling. First, unknown variables

の各々について、パラメータ記億部１０４に記憶されている値によって定められる事前分布から乱数を発生させ(サンプリング)、その値を変数の値として設定する。ここでパラメータ記億部１０４に記憶されている値とは、事前確率分布の形状をあらわすパラメータのことであり、具体的には、会話構造の初期確率については、その事前分布であるディリクレー分布のパラメータの値、会話構造の状態遷移確率については，その事前分布であるディリクレー分布のパラメータの値、視線パターンの生成確率については、その事前分布であるディリクレー分布のパラメータの値、視線パターンの状態遷移確率については、その事前分布であるディリクレー分布のパラメータの値、頭部方向の尤度分布（ガウス分布）の平均値については、その事前分布であるガウス分布の平均と分散の値、頭部方向の尤度分布（ガウス分布）の分散については、その事前分布である逆カイ二乗分布の自由度と尺度パラメータの値である。 For each of these, a random number is generated (sampling) from a prior distribution determined by the value stored in the parameter storage unit 104, and the value is set as a variable value. Here, the value stored in the parameter storage unit 104 is a parameter representing the shape of the prior probability distribution. Specifically, for the initial probability of the conversation structure, the Dirichlet distribution that is the prior distribution is used. Regarding the parameter value and the state transition probability of the conversation structure, the parameter value of the Dirichlay distribution that is the prior distribution, and for the generation probability of the line-of-sight pattern, the parameter value of the Dirichlay distribution that is the prior distribution and the state transition of the line-of-sight pattern For the probability, the parameter value of the Dirichlet distribution that is the prior distribution, the average value of the likelihood distribution (Gaussian distribution) in the head direction, the mean and variance values of the Gaussian distribution that is the prior distribution, the head direction For the variance of the likelihood distribution (Gaussian distribution) of, the degree of freedom of the inverse chi-square distribution that is its prior distribution and the scale parameter It is.

統いてギブスサンプラー１０６は、各未知変数について、それぞれ、全条件付事後分布からのサンプリングを行い、各変数の値を更新する。また、反復回数ｑが一定回以上ｑ≧Ｑ´の場合、その結果の値を、サンプル集会記億部１０５に記憶させる。全条件付事後分布は、自然共役分布とするので、それぞれの事前分布と同じ関数形を持ち、会話状態の初期確率、状態遷移確率、視線パターンの生成確率、状態遷移確率の事前分布は、それぞれ独立なディリクレー分布となる。また、頭部方向の尤度分布の平均、及び、分散の全条件付事後分布は、それぞれ、ガウス分布、及び、逆カイニ乗分布となる。また、発話確率については、ベータ分布となる。さらに、各時刻の会話構造の全条件付事後分布は、 The Gibbs sampler 106 performs sampling from the all conditional posterior distribution for each unknown variable, and updates the value of each variable. When the number of repetitions q is equal to or greater than a certain number of times q ≧ Q ′, the value of the result is stored in the sample assembly memory part 105. Since all conditional posterior distributions are natural conjugate distributions, they have the same function form as the prior distributions, and the initial probability of conversation state, state transition probability, line-of-sight pattern generation probability, and state transition probability prior distribution are respectively Independent Dirichlet distribution. In addition, the average of the likelihood distribution in the head direction and the all-conditional posterior distribution of variance are a Gaussian distribution and an inverse chi-in distribution, respectively. In addition, the utterance probability has a beta distribution. In addition, the all-conditional posterior distribution of the conversation structure at each time is

のようになり、この分布からのサンプリングにより状態が更新される。さらに、各時刻の視線パターンについては、全条件付事後分布 The state is updated by sampling from this distribution. Furthermore, with regard to the line-of-sight pattern at each time, all conditional posterior distributions

からのサンプリングにより状態が更新される。 The state is updated by sampling from.

ギブスサンプラー１０６は、ギブスサンプリングをＱ回反復、実行すると、その算出結果として得られる各未知変数の値をサンプル集合記憶部１０５に記録する。算出結果として得られた各未知変数の値は、q番目の反復回の結果とした場合には、Ｘａ_１：ｔ ^（ｑ）、Ｓａ_１：ｔ ^（ｑ）、φ^（ｑ）＜式（１７）における未知変数のｑ番目の反復回に対応する値＞のそれぞれの値である。そしてその後、統計量計算部１０７によって、サンプル集合記億部１０５からサンプル集合（ギプスサンプラー１０６によって出力された各未知変数の値）が読み出され、各未知変数についての推定値が計算される。例えば、会話構造と視線パターンについては、最大事後確率推定値が When the Gibbs sampler 106 repeats and executes Gibbs sampling Q times, the Gibbs sampler 106 records the value of each unknown variable obtained as a result of the calculation in the sample set storage unit 105. When the value of each unknown variable obtained as a calculation result is the result of the qth iteration, Xa _{1: t} ^(q) , Sa _{1: t} ^(q) , φ ^(q) <Expression (17 ) Of each of the values> corresponding to the qth iteration of the unknown variable. After that, the statistic calculation unit 107 reads the sample set (value of each unknown variable output by the cast sampler 106) from the sample set storage unit 105, and calculates an estimated value for each unknown variable. For example, for conversational structures and gaze patterns, the maximum posterior probability estimate is

のように計算される。ここで、 It is calculated as follows. here,

、その他の場合 , Other cases

である。また、そのほかの未知変数については、最小二乗誤差推定値が、 It is. For other unknown variables, the least square error estimate is

のように計算される。 It is calculated as follows.

以下では、上記の実施例により得られる結果の一部を説明する。
図３のような配置にある４人による会話を対象に、時間間隔1/30秒、10000フレーム(約５．６分)の時間区間について本発明方法を適用した。 In the following, some of the results obtained by the above example will be described.
The method of the present invention was applied to a time interval of 10000 frames (about 5.6 minutes) with a time interval of 1/30 seconds for a conversation by four people arranged as shown in FIG.

図５は観測データ（頭部方向及び発話の有無）の一部を示す図である。
図５には、頭部方向計測部１０１、及び発話状態計測部１０２により計測され、観測データ記億部１０３に記憶された観測データの一部を示す。 FIG. 5 is a diagram showing a part of observation data (head direction and presence / absence of speech).
FIG. 5 shows a part of the observation data measured by the head direction measurement unit 101 and the utterance state measurement unit 102 and stored in the observation data storage unit 103.

図６は推定結果（視線方向、及び、会話の構造）の一部を示す図である。
推定結果の値の提示法の一つとして図６のような時系列ダイアグラムとして，各時刻における会話構造の推定値を表示する形態が考えられる。この図は各時刻において，会話に中心的に関与している人が誰であるかを示している。また図６は、ギブスサンプラー１０６によりＱ＝７００（Ｑ´＝５００）回の反復処理が行われた後に、上述の統計量計算部１０７の処理を経て、出力部１０８へ出力された推定結果の例を示しており、各参加者（人物１〜人物４＝Ｐ１〜Ｐ４）の他の参加者への視線方向と、会話の構造（会話の有無）を示している。 FIG. 6 is a diagram illustrating a part of the estimation result (gaze direction and conversation structure).
As one method of presenting the estimation result value, a form in which the estimated value of the conversation structure at each time is displayed as a time series diagram as shown in FIG. This figure shows who is mainly involved in the conversation at each time. Further, FIG. 6 shows an estimation result output to the output unit 108 after the processing of the statistic calculation unit 107 described above after Q = 700 (Q ′ = 500) iteration processing is performed by the Gibbs sampler 106. An example is shown, and the line-of-sight directions to other participants (person 1 to person 4 = P1 to P4) and the structure of conversation (whether or not conversation is present) are shown.

この図６は、次のような手順を全時刻において実施することで得られる。
まず、各時刻tにおいての会話状態の推定値Ｓｂ_ｔが人物ｉへの収束構造，つまり、Ｓｂ_ｔ＝Ｒ_ｉ ^Ｃ＜式（１）に同じ＞の場合、出力部１０８は、この人物ｉの位置にバンドを表示させている（Ｓｂは推定値を表す）。また会話状態の推定値Ｓｂ_ｔが、人物ｉとｊの二者結合の場合、つまりＳｂ_ｔ＝Ｒ_{（ｉ，ｊ）} ^ＤＬ＜式（２）に同じ＞の場合、出力部１０８は、この二人の人物ｉとｊの位置にそれぞれバンドを表示させている。また会話状態の推定値Ｓｂ_ｔが発散構造の場合には、出力部１０８はその時刻ではバンドの表示をしない。 FIG. 6 is obtained by performing the following procedure at all times.
First, convergence structure to estimate Sb _t a person i conversation state of at each time t, that _is, when the Sb t _{= R} ^{i C} <Equation (1) in the same>, the output unit 108, the person i A band is displayed at the position (Sb represents an estimated value). When the estimated value Sb _{t of the} conversation state is a two-party combination of the persons i and j, that is, when Sb _t = R _{(i, j)} ^DL <same as the expression (2)>, the output unit 108 Bands are displayed at the positions of human figures i and j, respectively. Further, when the estimated value Sb _t conversation state is divergent structure, the output unit 108 does not display the band at that time.

また図６以外の会話構造の推定結果の提示方法としては、出力部１０８は、各時刻において、会話に中心的に関与している人が誰であるかを、対応する人物の映像を切り替えて表示することにより提示することも可能である。このような映像を視聴することにより、会話に参加していない人物でもその会話の構造を容易に把握することができ、会話内容をより的確に理解することが可能となる。 As another method of presenting the estimation result of the conversation structure other than that in FIG. 6, the output unit 108 switches the video of the corresponding person to determine who is mainly involved in the conversation at each time. It is also possible to present it by displaying it. By viewing such a video, even a person who does not participate in the conversation can easily grasp the structure of the conversation, and can understand the conversation content more accurately.

そしてこの図６の各参加者の視線方向のグラフにおいては、会話構造推定装置によって推定された各人物の視線方向(実線)と、人手でラベル付けされた実際の視線方向の正解データ(破線)とが重ね合わされている。両者の比較より、両者の一致率は、平均して約７割と妥当な推定精度が得られた。また、図６の会話構造の推定結果を示す部分においては、各時刻毎のバンドを見たとき、黒い一重のバンドがある時刻においては、そのバンドが存在する番号ｉの人物が話し手となる収束構造 In the graph of the gaze direction of each participant in FIG. 6, correct data (broken line) of the gaze direction (solid line) of each person estimated by the conversation structure estimation device and the actual gaze direction labeled manually. And are superimposed. From a comparison between the two, the agreement rate between the two averaged about 70% on average, and a reasonable estimation accuracy was obtained. Further, in the portion showing the estimation result of the conversation structure in FIG. 6, when the band at each time is viewed, at the time when there is a single black band, the convergence of the person with the number i in which the band exists becomes the speaker. Construction

をもち、二重のバンドがある時刻は、そのバンドが存在する二つの番号ｉ，ｊの人物による二者結合 The time when there is a double band is a two-party combination of the two numbers i and j where the band exists.

の構造をもつ。また、バンドが存在しない時刻は、発散の構造Ｒ^０をもつ。
つまり図６の会話構造のグラフにおいては、各時刻においてＰ１〜Ｐ４の一人の人物についてバンドが存在する場合には、会話構造は集束構造であり、また二人の人物についてバンドが存在する場合には会話構造は二者結合であり、また３人以上の人物についてバンドが存在する場合には発散構造であることを示している。 It has the structure of Further, the time when no band exists has a divergent structure ^R0 .
In other words, in the graph of the conversation structure in FIG. 6, when a band exists for one person P1 to P4 at each time, the conversation structure is a converging structure, and when a band exists for two persons. Indicates that the conversation structure is a two-body combination and that a band exists for three or more persons is a divergent structure.

また図６においては、推定された会話状態が人物ｉへの収束構造、つまりＳｂ_ｔ＝Ｒ_ｉ ^Ｃ＜式（１）に同じ＞の場合、この中心人物ｉが話し手と判断され、その他の人物が受け手と判断される。また、推定された会話状態が人物ｉとｊの二者結合の場合、つまりＳｂ_ｔ＝Ｒ_{（ｉ，ｊ）} ^ＤＬ＜式（２）に同じ＞の場合、この二人の人物ｉとｊが、話し手及び受け手と判断される（どちらが話し手でどちらが受け手かは区別しない）。その他の人物は傍参与者と判断される。推定された会話状態が発散構造の場合、このグループによる会話が生じていないと判断されるため、その場には、話し手、受け手、傍参与者は存在しないと判断される。 In FIG. 6, when the estimated conversation state is a convergence structure to the person i, that is, Sb _t = R _i ^C <same as the expression (1)>, this central person i is determined to be a speaker, and other persons Is determined to be the recipient. In addition, when the estimated conversation state is a two-party combination of persons i and j, that is, Sb _t = R _{(i, j)} ^DL <same as in equation (2)>, the two persons i and j are , The speaker and receiver (which is not the speaker and the receiver). Other persons are judged as associates. If the estimated conversation state is a divergent structure, it is determined that no conversation by this group has occurred, and therefore it is determined that there are no speakers, receivers, and associates in that place.

図７は３時刻における会話参加者の画像である。
図８は３時刻における視線の方向、及び、推定された会話構造を示す図である。
次に、より具体的に会話構造の時間遷移の様子を説明するために、３時刻(ｔ_１＝３１０，ｔ_２＝４８５，ｔ_３＝５７８)の各参加者を写した画像を図７に示し、その各時点における視線パターン、及び、推定された会話構造を図８に示す。 FIG. 7 is an image of a conversation participant at 3 hours.
FIG. 8 is a diagram showing the direction of the line of sight at three times and the estimated conversation structure.
Next, in order to explain the state of time transition of the conversation structure more specifically, an image showing each participant at three times (t ₁ = 310, t ₂ = 485, t ₃ = 578) is shown in FIG. FIG. 8 shows the line-of-sight pattern at each time point and the estimated conversation structure.

図８において、グラフ中の細い矢印は推定された視線方向を示し、太い矢印は正解の視線方向を示す。まず、時刻ｔ_１付近において、人物４が他の全員に向かって意見を表明している。その後、時刻ｔ_２付近において、人物２が人物４の発言に対して、同意を表明する発話を行い、それに対して、人物４も注意を向け、人物４と人物２の間のみで会話が進行し、他の人物１及び３は傍参与者となった。その後、時刻ｔ_３付近において、人物２に発話権が移り、傍参与者だった人物３も人物２へ注意を向けて人物２の話を聞く体制に入った。これらの会話の進展が、図８に示すように視線方向の推移として正しく推定され、さらに、このような会話の状況変化に適合する形で、会話構造の状態が推定された。 In FIG. 8, the thin arrow in the graph indicates the estimated line-of-sight direction, and the thick arrow indicates the correct line-of-sight direction. First, in the vicinity of time t ₁ , the person 4 expresses his / her opinion toward everyone else. After that, around time t ₂ , the person 2 makes an utterance expressing consent to the utterance of the person 4, and the person 4 also pays attention to the utterance, and the conversation proceeds only between the person 4 and the person 2. The other persons 1 and 3 became associates. Thereafter, at time t ₃ , the utterance right was transferred to the person 2, and the person 3 who was an attendant started to listen to the person 2 while paying attention to the person 2. As shown in FIG. 8, the progress of these conversations was correctly estimated as a change in the line of sight direction, and the state of the conversation structure was estimated in such a way as to adapt to such a change in the conversation situation.

このように本発明によって、このように会話の進行ととも移り変わる各参加者の視線方向、及び、会話における役割が推定され、適切に会話の構造が推定できることがわかる。 Thus, according to the present invention, it is understood that the direction of the line of sight of each participant that changes with the progress of the conversation and the role in the conversation can be estimated, and the structure of the conversation can be estimated appropriately.

以上で説明したように、本発明では、複数人物の対面会話を対象とし、会話中の人物の頭部方向及び発話状態を計測し、これらの計測された情報と、視線方向、会話の構造に関する確率モデルに基づいて、各時点における会話の構造、視線方向、及びモデルのパラメータを同時推定している。そのため、会話中における話し手、受け手、傍参与者といった各参会者の役割と、それらの時間変化として表される会話の構造を自動的に推定することができる。 As described above, in the present invention, for a face-to-face conversation of a plurality of persons, the head direction and the speech state of the person in conversation are measured, and the measured information, the line-of-sight direction, and the conversation structure Based on the probabilistic model, the structure of the conversation, the gaze direction, and the parameters of the model at each time point are estimated simultaneously. Therefore, it is possible to automatically estimate the role of each participant such as a speaker, a receiver, and a side participant during the conversation, and the structure of the conversation expressed as a change with time.

なお上述の会話構造推定装置は内部に、コンピュータシステムを有している。そして、上述した処理の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータが読み出して実行することによって、上記処理が行われる。ここでコンピュータ読み取り可能な記録媒体とは、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、半導体メモリ等をいう。また、このコンピュータプログラムを通信回線によってコンピュータに配信し、この配信を受けたコンピュータが当該プログラムを実行するようにしても良い。 The conversation structure estimation apparatus described above has a computer system inside. The process described above is stored in a computer-readable recording medium in the form of a program, and the above process is performed by the computer reading and executing this program. Here, the computer-readable recording medium means a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, a semiconductor memory, or the like. Alternatively, the computer program may be distributed to the computer via a communication line, and the computer that has received the distribution may execute the program.

また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

本発明の一実施形態による会話構造推定装置の構成を示すブロック図である。It is a block diagram which shows the structure of the conversation structure estimation apparatus by one Embodiment of this invention. 本発明の一実施形態による会話構造推定装置の処理フローを示す図である。It is a figure which shows the processing flow of the conversation structure estimation apparatus by one Embodiment of this invention. 本発明の一実施形態による会話参加者の相対位置を示す図である。It is a figure which shows the relative position of the conversation participant by one Embodiment of this invention. 本発明の一実施形態による会話モデルを示す図である。It is a figure which shows the conversation model by one Embodiment of this invention. 本発明の一実施形態による観測データ（頭部方向及び発話の有無）の一部を示す図である。It is a figure which shows a part of observation data (head direction and the presence or absence of speech) by one Embodiment of this invention. 本発明の一実施形態による推定結果（視線方向、及び、会話の構造）の一部を示す図である。It is a figure which shows a part of estimation result (gaze direction and the structure of conversation) by one Embodiment of this invention. 本発明の一実施例における会話の構造の遷移を説明するための３時刻における会話参加者の画像である。It is an image of the conversation participant in 3 time for demonstrating the transition of the structure of the conversation in one Example of this invention. 本発明の一実施例における会話の構造の遷移を説明するための３時刻における視線の方向、及び、推定された会話構造を示す図である。It is a figure which shows the direction of the eyes | visual_axis in 3 time for demonstrating the transition of the structure of the conversation in one Example of this invention, and the estimated conversation structure.

Explanation of symbols

１・・・会話構造推定装置
１０１・・・頭部方向計測部
１０２・・・発話状態計測部
１０３・・・観測データ記憶部
１０４・・・パラメータ記憶部
１０５・・・ギブスサンプラー
１０６・・・サンプル集合記憶部
１０７・・・統計量計算部
１０８・・・出力部
DESCRIPTION OF SYMBOLS 1 ... Conversation structure estimation apparatus 101 ... Head direction measurement part 102 ... Speech state measurement part 103 ... Observation data storage part 104 ... Parameter storage part 105 ... Gibbs sampler 106 ... Sample set storage unit 107 ... statistic calculation unit 108 ... output unit

Claims

A conversation structure estimation method in a conversation structure estimation apparatus,
The head direction measurement processing unit measures the head direction of each person participating in the conversation,
The gaze direction calculation processing unit calculates the gaze direction of each person based on the head direction of each person,
The utterance presence / absence measurement processing unit detects the presence / absence of each person's utterance,
A conversation structure estimation processing unit, wherein the conversation structure estimation processing unit calculates conversation structure information at each time point during a conversation based on the gaze direction of each person and the presence / absence of speech.

The conversation structure estimation processing unit
Each person participating in the conversation
Among the three types of roles of the speaker, the receiver to whom the speaker is talking, and the attendant who is neither the speaker nor the receiver among the conversation participants, the structure information of the conversation indicating which role is played at each time point. The conversation structure estimation method according to claim 1, wherein the conversation structure is estimated.

Based on the calculated conversation structure information,
Information that indicates the situation in which one speaker speaks to all other conversation participants
Information that indicates the situation where two specific participants in a conversation are speakers or receivers and other people are associates,
The conversation structure estimation method according to claim 2, further comprising: an output unit that outputs a graph indicating information on any one of the situation information that does not match the two situations.

Calculating the gaze direction of each person based on the head direction of each person;
Calculation of the structure information of the conversation at each time point during the conversation based on the information on the gaze direction of each person and the presence or absence of utterance,
A model expression indicating the relationship between the head direction and the line-of-sight direction, a model expression indicating the mutual relationship between the line-of-sight direction, the presence / absence of the utterance, and the structural information of the conversation, and a probability model expression regarding their temporal changes The conversation structure estimation method according to any one of claims 1 to 3, wherein a simultaneous posterior probability distribution of gaze direction and conversation structure information and probability model parameters at each time is calculated using.

The probability model is
Probability distribution of the gaze direction of each person depending on the structure of the conversation,
The probability of each person uttering depending on the structure of the conversation;
Probability distribution representing the possible head direction with respect to the direction of each line of sight of each person,
A transition probability that determines how the structure of the conversation changes over time;
Transition probability that determines how the gaze direction of each person changes in time depending on the conversation structure,
The conversation structure estimation method according to claim 4, wherein:

The simultaneous posterior probability distribution of gaze direction and conversation structure and probability model parameters at each time
As a sample set obtained by using a Gibbs sampler to generate random numbers from all the conditional posterior distributions of all unknown variables included in each of the parameters of the gaze direction and conversation structure and probability model at each time The method for estimating the conversation structure according to claim 4 or 5, wherein the calculation is performed approximately.

The head direction of each person is
It measures using the magnetic sensor with which the conversation participant's head was mounted | worn. The conversation structure estimation method in any one of Claims 1-6 characterized by the above-mentioned.

The presence or absence of each person's utterance is
The conversation structure estimation method according to any one of claims 1 to 7, wherein detection is performed based on a magnitude of an acoustic signal obtained from a microphone attached to each person.

A program to be executed by a computer of a conversation structure estimation device,
Processing to measure the head direction of each person participating in the conversation;
A process of calculating the gaze direction of each person based on the head direction of each person;
Processing for detecting the presence or absence of each person's utterance;
Based on the information on the gaze direction of each person and the presence / absence of the utterance, processing for calculating the structure information of the conversation at each time point during the conversation;
A program that causes a computer to execute.

A recording medium for storing a program to be executed by a computer of a conversation structure estimation apparatus,
Processing to measure the head direction of each person participating in the conversation;
A process of calculating the gaze direction of each person based on the head direction of each person;
Processing for detecting the presence or absence of each person's utterance;
Based on the information on the gaze direction of each person and the presence / absence of the utterance, processing for calculating the structure information of the conversation at each time point during the conversation;
Medium for storing a program for causing a computer to execute the program.