JP5437297B2

JP5437297B2 - Dialog state estimation apparatus, method, and program

Info

Publication number: JP5437297B2
Application number: JP2011049284A
Authority: JP
Inventors: 史朗熊野; 和弘大塚; 弾三上; 淳司大和
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-03-07
Filing date: 2011-03-07
Publication date: 2014-03-12
Anticipated expiration: 2031-03-07
Also published as: JP2012185727A

Description

本発明は、対話状態推定装置、方法、及びプログラムに係り、特に、複数の対話者間の共感性の状態、又は複数の対話者の表情の状態を推定する対話状態推定装置、方法、及びプログラムに関する。 The present invention relates to a dialog state estimation apparatus, method, and program, and in particular, a dialog state estimation apparatus, method, and program for estimating a state of empathy between a plurality of conversation persons or a state of facial expressions of a plurality of conversation persons. About.

対面対話は、人が社会的生活を営む上で他者との情報共有、相手の感情の理解、及び、意思決定などを行う上で最も基本的なコミュニケーション形態である。よって、ミーティング自動分析の研究が注目を集めている。もし、ミーティングにおいて、ターンテイキング（発話切り換え）がどのような流れで行われているか、主要な人物は誰か、対話者がどのような感情を抱いているのか、といった様々な情報を把握してそのミーティングを上手く支援してくれるシステムがあれば、ミーティングはもっと楽しく、有意義で生産的なものになると考えられる。しかし、そのようなミーティングの状態のうち、対話者の感情については、これまでのミーティング自動分析の分野でほとんど扱われてこなかった。また、グループミーティングにおける対話者の感情の状態を自動的に推定することは、重要であるにもかかわらず、その試みは見当たらない。 Face-to-face dialogue is the most basic form of communication for sharing information with others, understanding other people's emotions, and making decisions when a person lives in a social life. Therefore, research on automatic meeting analysis has attracted attention. If there is a variety of information, such as how the turn-taking (speech switching) is performed in the meeting, who is the main person, and what emotions the interlocutor has? If there is a system that can support the meeting well, the meeting will be more enjoyable, meaningful and productive. However, in the state of such a meeting, the feelings of the interlocutor have hardly been dealt with in the field of automatic meeting analysis so far. Moreover, although it is important to automatically estimate the state of emotions of the interlocutor in the group meeting, no attempt has been made.

ここで、関連の深い従来技術として、表情認識の技術が知られている。最近の表情認識の対象は、初期の表情認識研究で対象とされていた幸福、驚き、怒り、悲しみ、嫌悪、恐れという６基本表情から、自発的な自然な表情にシフトしてきている。例えば、非特許文献１では、ミーティング中に表出された微笑、哄笑、及び、それ以外の３種を識別する方法が提案されている。 Here, a facial expression recognition technique is known as a related prior art. The subject of recent facial expression recognition has shifted from the six basic facial expressions of happiness, surprise, anger, sadness, disgust, and fear, which were the subjects of early facial expression recognition research, to spontaneous natural facial expressions. For example, Non-Patent Document 1 proposes a method for identifying the smile, the ridicule, and the other three types expressed during the meeting.

また、対人感情を推定する技術が知られている。上記の非特許文献１では、ミーティング場面において誰が誰に対して微笑を送っているのかを自動的に認識し、さらに、それが合計でどの程度の時間行われていたのかを数え上げた結果をネットワーク表示することで、対話者間の対人感情を可視化する方法を提案している。このとき、微笑が誰から誰に送られたのかについては、映像から自動的に推定した各対話者の表情と頭の方向から決定している。 In addition, a technique for estimating personal feelings is known. In Non-Patent Document 1 above, the network automatically recognizes who is sending a smile to who in the meeting scene, and counts how long it has been done in total. We have proposed a method to visualize interpersonal feelings between interlocutors by displaying them. At this time, who sent the smile to whom is determined from the facial expression and head direction of each conversation person automatically estimated from the video.

また、グループ全体としてのミーティング状態を推定する技術が知られている。例えば、ミーティングに関する上位レベルの状態として、モノノーグやダイアローグといった対話のレジームの状態、及び、主導権を持った人物を自動的に推定する技術が提案されている（例えば、特許文献１）。上記特許文献１では、レジームの状態、誰と誰がインタラクションしているのか、及び、対話者の行動からなる３層の階層構造の対話モデルを提案している。 In addition, a technique for estimating a meeting state as a whole group is known. For example, as a high-level state related to a meeting, a technology for automatically estimating a state of a dialogue regime such as a mononogue or a dialog and a person having initiative is proposed (for example, Patent Document 1). Patent Document 1 proposes a dialog model having a three-layered hierarchical structure consisting of a regime state, who and who are interacting with each other, and the actions of a dialog person.

また、ソーシャルネットワーク、あるいは、ソシオメトリと呼ばれる、大規模なグループにおける人物間に直接的な関係があるのかどうかについてのネットワークを推定する方法が提案されている（非特許文献２、３）。それらネットワークは、物理的な近接性（非特許文献２）や画像中の動き情報（非特許文献３）という比較的シンプルな情報から直接求められている。 In addition, a method for estimating a network about whether there is a direct relationship between persons in a large-scale group called a social network or sociometry has been proposed (Non-patent Documents 2 and 3). These networks are directly required from relatively simple information such as physical proximity (Non-Patent Document 2) and motion information in an image (Non-Patent Document 3).

特開２００６−３３８５２９号公報JP 2006-338529 A

S. Kumano, K. Otsuka, D. Mikami and J. Yamato, "Recognizing communicative facial expressions for discovering interpersonal emotions in group meetings", In Proc. Int'l Conf. Multimodal Interfaces and Workshop on Machine Learning for Multimodal Interaction (ICMI-MLMI 2009), 2009.S. Kumano, K. Otsuka, D. Mikami and J. Yamato, "Recognizing communicative facial expressions for discovering interpersonal emotions in group meetings", In Proc. Int'l Conf.Multimodal Interfaces and Workshop on Machine Learning for Multimodal Interaction (ICMI -MLMI 2009), 2009. N. Eagle, A. Pentland, and D. Lazer. Inferring social network structure using mobile phone data. In Proc. the National Academy of Sciences (PNAS), volume 106, pages 15274−15278, 2009.N. Eagle, A. Pentland, and D. Lazer.Inferring social network structure using mobile phone data.In Proc.the National Academy of Sciences (PNAS), volume 106, pages 15274-15278, 2009. L. Ding and A. Yilmaz. Learning relations among movie characters: A social network perspective. In In Proc. European Conference on Computer Vision (ECCV2010), volume IV, pages 410−423, 2010.L. Ding and A. Yilmaz. Learning relations among movie characters: A social network perspective.In In Proc.European Conference on Computer Vision (ECCV2010), volume IV, pages 410-423, 2010.

しかしながら、上記の非特許文献１に記載の技術では、表情を認識する対象の人物のみに着目し、それらの表情が表出された文脈を考慮していなかった、という問題がある。更に、上記の非特許文献１に記載の技術の問題は、誰かに対して向けられた表情が、直接的にその人物に対する感情を表しているという仮定にある。つまり、この方法では、第三者やモノに対して抱いている感情を他者に伝える場面、例えば、二者共が好ましく思っていないモノに対する否定的な感情をお互いに否定的な表情を表出して共感しあっている場面を扱うことができない。 However, the technique described in Non-Patent Document 1 has a problem in that attention is paid only to a person who recognizes facial expressions, and the context in which those facial expressions are expressed is not taken into consideration. Furthermore, the problem of the technique described in Non-Patent Document 1 described above is that the facial expression directed toward someone directly expresses emotions for that person. In other words, in this method, a scene that conveys emotions held by a third party or an object to others, for example, negative feelings for an object that the two parties do not like, expresses negative expressions to each other. Can't handle scenes that come out and sympathize with each other.

また、上記の特許文献１に記載の技術では、対話者の感情に関わる状態については扱われていない、という問題がある。 In addition, the technique described in Patent Document 1 has a problem that the state relating to the emotion of the conversation person is not handled.

また、上記の非特許文献２、３に記載の技術では、短時間の感情に関わるグループの状態についてはほとんど注意が払われていない、という問題がある。 Further, the techniques described in Non-Patent Documents 2 and 3 have a problem that little attention is paid to the state of a group related to a short-time emotion.

本発明は、上記の課題を解決するためになされたもので、複数の対話者間の共感性の状態又は各対話者の表情の状態を推定することができる対話状態推定装置、対話状態推定方法、及びプログラムを提供することを目的とする。 The present invention has been made in order to solve the above-described problem, and a dialogue state estimation device and a dialogue state estimation method capable of estimating a state of empathy among a plurality of talkers or a facial expression state of each talker. And to provide a program.

上記の目的を達成するために本発明に係る対話状態推定装置は、複数の対話者の顔を含む領域を撮像した画像を入力として、前記複数の対話者間の視線の状態を検出する視線状態検出手段と、複数の対話者間における共感性を示す状態と前記複数の対話者間の視線の状態との組み合わせに応じた表情の状態の共起性を表わす対話モデルのパラメータの初期値を設定すると共に、前記対話モデルに従って、前記複数の対話者間の共感性を示す状態の初期値、及び前記複数の対話者の各々の表情を示す状態の初期値を設定する初期値設定手段と、前記初期値設定手段によって設定された前記パラメータ、前記共感性を示す状態、及び前記表情を示す状態の各々の初期値、又は前回決定された前記パラメータ、前記共感性を示す状態、及び前記表情を示す状態と、前記視線状態検出手段によって検出された前記複数の対話者間の視線の状態とに基づいて、前記対話モデルに従って、前記複数の対話者間の共感性を示す状態、及び前記複数の対話者の各々の表情を示す状態を決定する状態決定手段と、前記状態決定手段によって決定された前記複数の対話者間の共感性を示す状態及び前記複数の対話者の各々の表情を示す状態に基づいて、前記対話モデルのパラメータを決定するパラメータ決定手段と、前記状態決定手段による決定及び前記パラメータ決定手段による決定を、予め定められた収束条件を満たすまで繰り返し、前記複数の対話者間の共感性を示す状態、又は前記複数の対話者の各々の表情を示す状態を推定する推定手段とを含んで構成されている。 In order to achieve the above object, an apparatus for estimating a conversation state according to the present invention uses a line-of-sight state that detects a state of a line of sight between a plurality of interlocutors by using an image obtained by imaging an area including a plurality of conversational person faces. Setting initial values of parameters of the dialogue model representing the co-occurrence of the state of expression according to the combination of the state of the sympathy between the plurality of dialogues and the state of the line of sight between the plurality of dialogues And an initial value setting means for setting an initial value of a state indicating empathy between the plurality of interactors and an initial value indicating a state of each of the plurality of interactors according to the dialog model; The initial values of the parameters set by the initial value setting means, the state showing the empathy, and the state showing the expression, or the previously determined parameters, the state showing the empathy, and the expression Based on the state shown and the state of the line of sight between the plurality of interrogators detected by the line-of-sight state detecting means, according to the dialogue model, State determining means for determining a state indicating each expression of the interlocutor, a state indicating empathy between the plurality of interrogators determined by the state determining means, and a state indicating the expression of each of the plurality of interrogators Based on the parameter determination means for determining the parameters of the interaction model, the determination by the state determination means and the determination by the parameter determination means are repeated until a predetermined convergence condition is satisfied, And an estimation means for estimating a state showing empathy or a state showing each facial expression of the plurality of interlocutors.

本発明に係る対話状態推定方法は、視線状態検出手段、初期値設定手段、状態決定手段、パラメータ決定手段、及び推定手段を含む対話状態推定装置における対話状態推定方法であって、前記対話状態推定装置は、前記視線状態検出手段によって、複数の対話者の顔を含む領域を撮像した画像を入力として、前記複数の対話者間の視線の状態を検出するステップと、前記初期値設定手段によって、複数の対話者間における共感性を示す状態と前記複数の対話者間の視線の状態との組み合わせに応じた表情の状態の共起性を表わす対話モデルのパラメータの初期値を設定すると共に、前記対話モデルに従って、前記複数の対話者間の共感性を示す状態の初期値、及び前記複数の対話者の各々の表情を示す状態の初期値を設定するステップと、前記状態決定手段によって、前記初期値設定手段によって設定された前記パラメータ、前記共感性を示す状態、及び前記表情を示す状態の各々の初期値、又は前回決定された前記パラメータ、前記共感性を示す状態、及び前記表情を示す状態と、前記視線状態検出手段によって検出された前記複数の対話者間の視線の状態とに基づいて、前記対話モデルに従って、前記複数の対話者間の共感性を示す状態、及び前記複数の対話者の各々の表情を示す状態を決定するステップと、前記パラメータ決定手段によって、前記状態決定手段によって決定された前記複数の対話者間の共感性を示す状態及び前記複数の対話者の各々の表情を示す状態に基づいて、前記対話モデルのパラメータを決定するステップと、前記推定手段によって、前記状態決定手段による決定及び前記パラメータ決定手段による決定を、予め定められた収束条件を満たすまで繰り返し、前記複数の対話者間の共感性を示す状態、又は前記複数の対話者の各々の表情を示す状態を推定するステップと、を含んで実行することを特徴とする。 The dialog state estimation method according to the present invention is a dialog state estimation method in a dialog state estimation device including a line-of-sight state detection unit, an initial value setting unit, a state determination unit, a parameter determination unit, and an estimation unit. The apparatus receives, as an input, an image obtained by imaging a region including a plurality of conversational person faces by the visual line state detection unit, and detects the state of the visual line between the plurality of conversational parties, and the initial value setting unit, Setting an initial value of a parameter of a dialogue model representing the co-occurrence of a state of expression according to a combination of a state showing empathy between a plurality of dialogues and a state of gaze between the plurality of dialogues; Setting an initial value of a state indicating empathy among the plurality of interlocutors and an initial value indicating a state of each of the plurality of interrogators according to a dialog model; The initial value of each of the parameter set by the initial value setting unit, the state indicating the empathy, and the state indicating the facial expression by the state determination unit, or the parameter determined last time, the state indicating the empathy And a state showing empathy among the plurality of interlocutors according to the dialog model based on the state indicating the facial expression and the state of the line of sight between the plurality of interlocutors detected by the gaze state detecting means. Determining a state indicating the facial expression of each of the plurality of interlocutors; and a state indicating sympathy between the plurality of interrogators determined by the state determining unit by the parameter determining unit; A step of determining parameters of the dialogue model based on a state indicating each expression of the dialogue person; and And the determination by the parameter determining means are repeated until a predetermined convergence condition is satisfied, and a state showing empathy among the plurality of talkers or a state showing the facial expressions of the plurality of talkers is estimated. And the step of executing.

本発明によれば、視線状態検出手段によって、複数の対話者の顔を含む領域を撮像した画像を入力として、前記複数の対話者間の視線の状態を検出する。初期値設定手段によって、複数の対話者間における共感性を示す状態と前記複数の対話者間の視線の状態との組み合わせに応じた表情の状態の共起性を表わす対話モデルのパラメータの初期値を設定すると共に、前記対話モデルに従って、前記複数の対話者間の共感性を示す状態の初期値、及び前記複数の対話者の各々の表情を示す状態の初期値を設定する。 According to the present invention, the line-of-sight state detection means detects the state of the line of sight between the plurality of interlocutors using as input an image obtained by imaging an area including the faces of the plurality of interlocutors. The initial value of the parameter of the dialogue model that represents the co-occurrence of the state of the facial expression according to the combination of the state showing empathy among a plurality of interlocutors and the state of the line of sight between the plurality of interlocutors And an initial value of a state indicating empathy among the plurality of interlocutors and an initial value indicating the expression of each of the plurality of interrogators are set according to the dialog model.

そして、状態決定手段によって、前記初期値設定手段によって設定された前記パラメータ、前記共感性を示す状態、及び前記表情を示す状態の各々の初期値、又は前回決定された前記パラメータ、前記共感性を示す状態、及び前記表情を示す状態と、前記視線状態検出手段によって検出された前記複数の対話者間の視線の状態とに基づいて、前記対話モデルに従って、前記複数の対話者間の共感性を示す状態、及び前記複数の対話者の各々の表情を示す状態を決定する。パラメータ決定手段によって、前記状態決定手段によって決定された前記複数の対話者間の共感性を示す状態及び前記複数の対話者の各々の表情を示す状態に基づいて、前記対話モデルのパラメータを決定する。 Then, the initial value of each of the parameter set by the initial value setting unit, the state indicating the empathy, and the state indicating the facial expression, or the parameter determined last time, the empathy by the state determination unit. Based on the state shown, the state showing the facial expression, and the state of the line of sight between the plurality of interrogators detected by the line-of-sight state detecting means, according to the dialogue model, And a state indicating the facial expression of each of the plurality of interactors. The parameter determination means determines the parameters of the conversation model based on the state indicating empathy between the plurality of interrogators determined by the state determination means and the state indicating each facial expression of the plurality of conversation persons. .

そして、推定手段によって、前記状態決定手段による決定及び前記パラメータ決定手段による決定を、予め定められた収束条件を満たすまで繰り返し、前記複数の対話者間の共感性を示す状態、又は前記複数の対話者の各々の表情を示す状態を推定する。 Then, the estimation unit repeats the determination by the state determination unit and the determination by the parameter determination unit until a predetermined convergence condition is satisfied, and the state showing empathy among the plurality of interlocutors or the plurality of dialogs The state which shows each person's facial expression is estimated.

このように、複数の対話者間における共感性を示す状態と複数の対話者間の視線の状態との組み合わせに応じた表情の状態の共起性を表わす対話モデルを用いることにより、複数の対話者間の共感性の状態又は各対話者の表情の状態を推定することができる。 In this way, by using the dialogue model that represents the co-occurrence of facial expression states according to the combination of the state showing empathy among a plurality of talkers and the state of gaze between the plurality of talkers, It is possible to estimate the state of empathy among the parties or the state of facial expression of each interlocutor.

本発明に係る前記状態決定手段は、前記初期値設定手段によって設定された前記パラメータ、前記共感性を示す状態、及び前記表情を示す状態の各々の初期値、又は前回決定された前記パラメータ、前記共感性を示す状態、及び前記表情を示す状態と、前記視線状態検出手段によって検出された前記複数の対話者間の視線の状態とに基づいて、前記対話モデルに従って、前記複数の対話者間の共感性を示す状態が各状態となる確率を示す第１確率分布、及び前記複数の対話者の各々の表情を示す状態が各状態となる確率を示す第２確率分布を求め、求められた前記第１確率分布に従って、前記複数の対話者間の共感性を示す状態を決定すると共に、求められた前記第２確率分布に従って、前記複数の対話者の各々の表情を示す状態を決定し、前記パラメータ決定手段は、前記状態決定手段によって決定された前記複数の対話者間の共感性を示す状態及び前記複数の対話者の各々の表情を示す状態に基づいて、前記対話モデルのパラメータが各値である確率を示す第３確率分布を求め、求められた前記第３確率分布に従って、前記対話モデルのパラメータを決定するようにすることができる。 The state determining means according to the present invention is characterized in that the initial value of each of the parameters set by the initial value setting means, the state indicating the empathy, and the state indicating the facial expression, or the parameter determined last time, Based on the state of empathy, the state of the facial expression, and the state of the line of sight between the plurality of interlocutors detected by the line-of-sight state detection means, according to the dialog model, between the plurality of interlocutors The first probability distribution indicating the probability that the state exhibiting empathy becomes each state, and the second probability distribution indicating the probability that each state indicating the expression of each of the plurality of interlocutors becomes each state are obtained, and the obtained Determining a state indicating empathy among the plurality of interlocutors according to the first probability distribution, and determining a state indicating each facial expression of the plurality of interlocutors according to the obtained second probability distribution; The parameter determination means is configured to determine whether each parameter of the conversation model is based on a state indicating empathy between the plurality of interactors determined by the state determination means and a state indicating each expression of the plurality of interactors. A third probability distribution indicating a probability that is a value can be obtained, and parameters of the dialogue model can be determined according to the obtained third probability distribution.

本発明に係る前記視線状態検出手段は、前記画像の時系列データを入力として、前記複数の対話者間の視線の状態の時系列データを検出し、前記初期値設定手段は、前記対話モデルのパラメータの初期値を設定すると共に、前記複数の対話者間の共感性を示す状態の時系列データの初期値、及び前記複数の対話者の各々の表情の状態の時系列データの初期値を設定し、前記状態決定手段は、前記複数の対話者間の共感性を示す状態の時系列データ、及び前記複数の対話者の各々の表情を示す状態の時系列データを決定し、前記推定手段は、前記複数の対話者間の共感性を示す状態の時系列データ、又は前記複数の対話者の各々の表情を示す状態の時系列データを推定するようにすることができる。これによって、複数の対話者間の共感性の状態又は各対話者の表情の状態を精度良く推定することができる。 The line-of-sight state detection means according to the present invention detects time-series data of the line-of-sight state between the plurality of interrogators, using the time-series data of the image as input, and the initial value setting means In addition to setting initial values of parameters, initial values of time-series data in a state showing empathy among the plurality of interlocutors, and initial values of time-series data in the state of each facial expression of the plurality of interrogators are set. The state determining means determines time-series data indicating a state of empathy among the plurality of interlocutors and time-series data indicating a state of each of the plurality of interrogators, and the estimating means includes The time-series data indicating the empathy among the plurality of interlocutors or the time-series data indicating the state of each of the plurality of interrogators can be estimated. This makes it possible to accurately estimate the state of empathy among a plurality of interlocutors or the state of facial expression of each interlocutor.

本発明に係る対話状態推定装置は、前記画像に基づいて、前記複数の対話者の各々の表情の状態を推定する表情推定手段を更に含み、前記状態決定手段は、前記初期値設定手段によって設定された前記パラメータ、及び前記表情推定手段による推定結果に基づいて、前記複数の対話者の各々の表情を示す状態を決定するようにすることができる。 The dialogue state estimation apparatus according to the present invention further includes a facial expression estimation unit that estimates a facial expression state of each of the plurality of conversation persons based on the image, and the state determination unit is set by the initial value setting unit. The state indicating the facial expression of each of the plurality of interlocutors can be determined based on the obtained parameter and the estimation result by the facial expression estimation means.

本発明に係る前記複数の対話者間の視線の状態を、相互凝視、片側凝視、及び相互そらし、又は相互凝視及び片側凝視とすることができる。また、前記複数の対話者間の共感性を示す状態を、共感、無関心、及び反感とすることができる。また、前記表情の状態を、肯定的、中立的、及び否定的、又は無表情、微笑、驚き、及び嫌悪とすることができる。 The line-of-sight state between the plurality of interlocutors according to the present invention may be mutual gaze, one-sided gaze, and mutual gaze, or mutual gaze and one-sided gaze. Moreover, the state which shows the sympathy between the said several interlocutors can be made into sympathy, indifference, and a counter feeling. In addition, the facial expression can be positive, neutral, negative, or expressionless, smiling, surprised, and disgusted.

本発明に係る前記推定手段は、前記状態決定手段による決定及び前記パラメータ決定手段による決定を、予め定められた収束条件を満たすまで繰り返した後に、更に、前記状態決定手段による決定及び前記パラメータ決定手段による決定を複数回繰り返し、前記複数回繰り返したときに前記状態決定手段によって決定された前記複数の対話者間の共感性を示す状態、又は前記複数の対話者の各々の表情を示す状態に基づいて、前記複数の対話者間の共感性を示す状態、又は前記複数の対話者の各々の表情を示す状態を推定するようにすることができる。 The estimation unit according to the present invention repeats the determination by the state determination unit and the determination by the parameter determination unit until a predetermined convergence condition is satisfied, and then further determines by the state determination unit and the parameter determination unit. Based on a state indicating empathy among the plurality of interrogators determined by the state determining means when the determination is repeated a plurality of times, or a state indicating the expression of each of the plurality of interrogators Thus, it is possible to estimate a state indicating empathy among the plurality of interlocutors or a state indicating the facial expression of each of the plurality of interlocutors.

本発明に係るプログラムは、コンピュータを、上記の対話状態推定装置の各手段として機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each unit of the above-described dialog state estimation device.

以上説明したように、本発明の対話状態推定装置、対話状態推定方法、及びプログラムによれば、複数の対話者間における共感性を示す状態と複数の対話者間の視線の状態との組み合わせに応じた表情の状態の共起性を表わす対話モデルを用いることにより、複数の対話者間の共感性の状態又は各対話者の表情の状態を推定することができる、という効果が得られる。 As described above, according to the dialogue state estimation device, the dialogue state estimation method, and the program of the present invention, a combination of a state showing empathy among a plurality of talkers and a line-of-sight state between the plurality of talkers. By using the dialogue model representing the co-occurrence of the corresponding facial expression state, it is possible to estimate the sympathetic state among a plurality of interlocutors or the facial expression state of each interlocutor.

（Ａ）、（Ｂ）対話モデルを説明するための図である。(A), (B) It is a figure for demonstrating a dialogue model. （Ａ）、（Ｂ）表情共起行列を説明するための図である。It is a figure for demonstrating (A) and (B) expression co-occurrence matrix. （Ａ）、（Ｂ）表情の状態の推定方法を説明するための図である。(A), (B) It is a figure for demonstrating the estimation method of the state of a facial expression. 本発明の実施の形態に係る対話状態推定装置の構成を示す概略図である。It is the schematic which shows the structure of the dialogue state estimation apparatus which concerns on embodiment of this invention. （Ａ）対話を撮影する様子を示す図、及び（Ｂ）各対話者を撮影した画像のイメージ図である。(A) The figure which shows a mode that a conversation is image | photographed, (B) The image figure of the image which image | photographed each dialog person. 本発明の実施の形態に係る対話状態推定装置における対話状態推定処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the dialog state estimation process routine in the dialog state estimation apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜概要＞
まず、本発明の概要について説明する。ミーティングを深く理解する上で重要なのは、対話者間で行われる感情的なメッセージのやりとりを介して、どのような対人関係が築かれ、結果的としてどのように合意が形成されていくのかを把握することである。そのような感情的メッセージの基本的なユニットは、ミーティングでは共感、無関心、及び、反感であると考える。例えば、共感はしばしば同意や親密な対人関係を表す。共感は、カメレオン効果やｍｏｔｏｒｍｉｍｉｃｒｙなどと呼ばれる対話者間の同調的な行動を引き起こす。共感は、微笑といった肯定的な感情を表す行動ばかりでなく、哀れみのような否定的な感情を表す行動を引き起こすこともある。他方、共感と比べてあまり明らかにされていないが、反感や無関心は行動の非同調を引き起こしやすいものと予想される。 <Overview>
First, an outline of the present invention will be described. The key to a deep understanding of the meeting is to understand what interpersonal relationships are built and how consensus is formed as a result through the exchange of emotional messages between the interlocutors. It is to be. We consider the basic units of such emotional messages to be empathy, indifference, and antipathy at the meeting. For example, empathy often represents consent or intimate interpersonal relationships. Sympathy causes a coordinated behavior between the interlocutors called the chameleon effect or motor mimicry. Sympathy can cause not only positive emotions such as smiles, but also negative emotions such as pity. On the other hand, although less clear than empathy, antipathy and indifference are expected to easily cause behavioral disagreement.

また、共感や反感に関わる対話者行動のうち、表情と視線の組み合わせが特に重要である。それは、それらが多量の感情的なメッセージを瞬時かつ方向性を持った（相手を特定した）形で達成することができるためである。表情は、感情的メッセージを表出するための主要な方法である。例えば、微笑は肯定的な感情を表すとともに、相手との親密な対人関係を構築しまた証明する。無表情を含む中立的な（肯定的でも否定的でもない）表情は、相手が肯定的な表情をしている際には反感的な反応とみなされうる。他方の視線は様々な機能を持つが、共感的インタラクションにおいては、特に、モニタリング及びトリガリングが重要である。モニタリングは他者の表情を観察してそこから相手の感情を読み取るために必須である。トリガリングは、自分のことを見ている人物に対して視線を向け、さらに、その人物に対して反応することを引き起こす。この２つの機能により、二者は相互凝視（アイコンタクト）をし、感情的メッセージが表情を介して瞬時に交換され共有されることとなる。相互凝視はさらに同調的行動をより強く引き起こす。 In addition, the combination of facial expression and line of sight is particularly important among the interlocutor behaviors related to empathy and antipathy. This is because they can achieve a large amount of emotional messages in an instant and directional manner (specifying the opponent). Facial expressions are the primary method for expressing emotional messages. For example, smiles express positive emotions and build and prove intimate relationships with others. Neutral (not positive or negative) facial expressions, including no expression, can be regarded as a counterfeit reaction when the other person has a positive facial expression. The other line of sight has various functions, but monitoring and triggering are particularly important in empathic interaction. Monitoring is essential for observing the facial expressions of others and reading their feelings. Triggering causes a person to look at him and to react to that person. With these two functions, the two perform mutual gaze (eye contact), and emotional messages are instantly exchanged and shared via facial expressions. Mutual gaze also causes synchronized behavior more strongly.

なお、以下では、「共感的インタラクション」という単語を、二者が表情と視線を組み合わせて感情的メッセージをやりとりする短時間のイベントと定義する。送られる感情的メッセージは結果として、共有されるかもしれないし、あるいは無視や拒絶されるかもしれない。要するに、共感的インタラクションは「誰が誰と共感、無関心、あるいは反感しているのか」を表す。また、共感的インタラクションの状態が、共感性を示す状態の一例である。 In the following, the word “sympathetic interaction” is defined as a short-time event in which two parties exchange emotional messages by combining facial expressions and line of sight. As a result, sent emotional messages may be shared, or ignored or rejected. In short, sympathetic interaction represents "who is sympathetic, indifferent, or dissatisfied with whom". Moreover, the state of empathic interaction is an example of a state showing empathy.

ここで、表情と視線の組み合わせによって二者間でどのように感情的メッセージが取り交わされるのかの一例を紹介する。人物Ａが人物Ｂに対してある表情を向けたときを考えてみる。このとき、二者が共感し合っている場合には、人物Ｂは人物Ａと類似した表情を返す。例えば、微笑に対しては微笑を、悲しそうな顔だと哀れみの表情といった具合である。他方、二者が反感し合っている場合には、人物Ｂは人物Ａとは異なる表情を表出するかもしれない。例えば、微笑に対して嫌悪を示したり、困惑の表情に苦笑したりする。お互いに相手に対して無関心な二者は、相手の表情に関わらず中立的な表情をすることが多い。 Here, we will introduce an example of how emotional messages are exchanged between the two parties by the combination of facial expression and line of sight. Consider the case where person A turns a certain expression toward person B. At this time, if the two are sympathetic with each other, the person B returns an expression similar to the person A. For example, a smile is a smile, and a sad face is a sad expression. On the other hand, if the two are rebelling, person B may express a different expression from person A. For example, they dislike smiles or laugh at embarrassed expressions. Two people who are indifferent to each other often have neutral facial expressions regardless of their facial expressions.

そこで、本発明では、ミーティング中に対話者間で取り交わされる感情的なメッセージの主要な表出手段である表情と、それに深く関わる形で行われる共感的インタラクションを同時に推定する。共感的インタラクションのキーとなる対話者行動として本発明では表情と視線に注目する。それは、それらの組み合わせが瞬時かつ指向的な多量の共感的メッセージの伝達を実現するからである。本発明では、対話者ペアの共感的インタラクションと表情の間には深い関わりがあり、両者の表情は共感的インタラクションの種類に従って共起する。この仮定に基づき、本発明では確率モデルを使用し、映像データを観測データとして共感的インタラクション及び表情を同時に推定する。 Therefore, in the present invention, facial expressions, which are the main means of expressing emotional messages exchanged between the interlocutors during the meeting, and empathic interaction performed in a deeply related manner are simultaneously estimated. In the present invention, attention is paid to facial expressions and line of sight as a conversational behavior that is a key of empathic interaction. This is because the combination realizes transmission of a large amount of sympathetic messages that are instantaneous and directed. In the present invention, there is a deep relationship between a sympathetic interaction and a facial expression of a dialogue pair, and both facial expressions co-occur according to the type of sympathetic interaction. Based on this assumption, the present invention uses a probabilistic model and simultaneously estimates empathic interaction and facial expression using video data as observation data.

＜対話モデルの概要＞
次に、本発明で提案する対話状態推定装置で用いる対話モデルについて説明する。 <Outline of conversation model>
Next, a dialogue model used in the dialogue state estimation apparatus proposed in the present invention will be described.

共感的インタラクションと対話者の行動との間の関係性をモデル化する１つの方法として、本実施の形態では階層的動的ベイジアンネットワークを用いる。このモデルでは、最上位のレイヤの状態は離散的であり、それがマルコフ過程に従って遷移し、その状態がその下位のレイヤの状態を支配する。ここでは、共感的インタラクションが上位レイヤであり、表情と視線を含む対話者行動が中位レイヤである。さらに、対話者行動はその下位のレイヤである映像信号に影響を及ぼす。 In this embodiment, a hierarchical dynamic Bayesian network is used as one method for modeling the relationship between empathic interaction and the behavior of the interlocutor. In this model, the state of the top layer is discrete, it transitions according to a Markov process, and that state dominates the state of the lower layer. Here, the empathic interaction is the upper layer, and the interlocutor action including the facial expression and the line of sight is the middle layer. Furthermore, the conversational person behavior affects the video signal which is the lower layer.

図１（Ａ）に映像信号から共感的インタラクションの状態及び表情の状態を推定するための階層的動的ベイジアンネットワークを示す。ここで、ｔは離散的な時間ステップを表し、ｔ＝１、・・・、Ｔである。また、ノードは変数を表し、矢印は変数間の確率的な因果関係を表す。図１（Ａ）のモデルでは、変数は次の３つ、すなわち、共感的インタラクション｛Ｅ_t｝_t=1 ^T、表情｛Ｆ_t｝_t=1 ^T、及び視線｛Ｘ_t｝_t=1 ^Tである。図１（Ｂ）に示すように、それぞれの対話者ペアの共感的インタラクションの状態と視線の状態は、両者の表情に影響を及ぼすと仮定する。この関係性は後述の表情共起行列によりモデル化される。共感的インタラクションの種類は視線の状態にも影響を及ぼすとする。これは、予備検討として使用したデータにおいて、共感と反感のインタラクションにおいては相互凝視や片側凝視が多く生じ、無関心においては相互そらしが多く生じていたことによる。 FIG. 1A shows a hierarchical dynamic Bayesian network for estimating the state of empathic interaction and the state of facial expression from a video signal. Here, t represents a discrete time step, where t = 1,. Nodes represent variables, and arrows represent probabilistic causal relationships between variables. In the model of FIG. 1A, there are three variables: empathic interaction {E _t } _{t = 1} ^T , facial expression {F _t } _{t = 1} ^T , and line of sight {X _t } _{t = 1} ^T It is. As shown in FIG. 1B, it is assumed that the state of sympathetic interaction and the state of the line of sight of each pair of interactors influence the expressions of both. This relationship is modeled by a facial expression co-occurrence matrix described later. It is assumed that the type of sympathetic interaction also affects the state of gaze. This is due to the fact that in the data used as a preliminary study, mutual gaze and one-sided gaze occurred frequently in the interaction between empathy and antipathy, and a lot of disorientation occurred in indifference.

時刻ｔにおける対話者の全ペアの共感的インタラクションの状態の集合を、Ｅ_t ＝｛ｅ_t ^(i,j)｝_(i,j)∈rにて表す。ここで、ｅ^(i,j)∈｛１,・・・,Ｎ_e｝は対話者ｉと対話者ｊのペア（ｉ，ｊ）の間のインタラクションの種類を表し、ｒはＮ×（Ｎ−１）組ある対話者ペアの集合を表す。Ｎ_eは共感的インタラクションの種類の数である。本実施の形態では、ｅ^(i,j)∈｛“共感” 、“無関心” 、“反感”｝とし、Ｎ_e＝３とする。時刻ｔにおける全ての対話者の表情の状態の集合を、Ｆ_t＝｛ｆ_i,t｝_i=1 ^Nにて表す。ここで、ｆ_i∈｛１,・・・,Ｎ_f｝は対話者ｉの表情状態を表し、Ｎ_fは表情の状態数を表す。本実施の形態では、ｆ_i∈｛“肯定的” 、“中立的” 、“否定的”｝とし、Ｎ_f＝３とする。 A set of states of sympathetic interaction of all pairs of dialoguers at time t is represented by E _t = {e _t ^{(i, j)} } _{(i, j) ∈r} . Here, e ^{(i, j)} ∈ {1,..., N _e } represents the type of interaction between the pair (i, j) of the conversation person i and the conversation person j, and r is N × (N -1) Represents a set of dialogue pairs. N _e is the number of sympathetic interaction types. In the present embodiment, e ^{(i, j)} ∈ {“sympathy”, “indifference”, “disapproval”}, and N _e = 3. A set of facial expression states of all the interrogators at time t is represented by F _t = {f _{i, t} } _{i = 1} ^N. Here, f _i ε {1,..., N _f } represents the expression state of the conversation person i, and N _f represents the number of expression states. In the present embodiment, f _i ∈ {“positive”, “neutral”, “negative”}, and N _f = 3.

時刻ｔにおける全ての対話者ペアの視線状態の集合をＸ_t＝｛ｘ_t ^(i,j)｝にて表す。ここで、ｘ_t ^(i,j)は対話者ペア（ｉ，ｊ）の視線状態を表し、相互凝視、片側凝視、相互そらしのうちのいずれか１状態をとるものとする。つまり、視線の状態数はＮ_x＝３である。 A set of line-of-sight states of all the conversation partner pairs at time t is represented by X _t = {x _t ^{(i, j)} }. Here, x _t ^{(i, j)} represents the line-of-sight state of the dialogue pair (i, j), and assumes any one state of mutual gaze, one-side gaze, and mutual gaze. That is, the number of line-of-sight states is N _x = 3.

時刻ｔにおける画像（映像データ中の１フレーム）をＩ_tにて表す。全ての対話者がこの画像中に含まれているものとする。この画像Ｉが観測データとなる。なお、本実施の形態では画像のみを観測データとするが、映像中の音声データを併せて利用しても構わない。 Image at time t (one frame in the video data) representing at I _t. Assume that all the interlocutors are included in this image. This image I becomes observation data. In this embodiment, only the image is used as the observation data, but the audio data in the video may be used together.

＜表情共起行列＞
ここで、本発明の重要な部分である、共感的インタラクション及び視線の状態が、対話者間の表情の共起のパターンに影響を及ぼすことをモデル化した表情共起行列について説明する。ここでは、表情を除く対話者行動として視線のみを対話モデルに含めているが、頭部ジェスチャなどのこれら以外の対話者行動を対話モデルに含めてもよい。 <Expression co-occurrence matrix>
Here, a facial expression co-occurrence matrix that models the influence of empathic interaction and line-of-sight, which is an important part of the present invention, on the pattern of co-occurrence of facial expressions among interlocutors will be described. Here, only the line of sight is included in the dialog model as a dialog person action excluding facial expressions, but other dialog person actions such as head gestures may be included in the dialog model.

本実施の形態では、共感的インタラクションは、それぞれの対話者ペア、すなわち、二者の間に定義され、離散的な状態をとるものとする。ここでは、共感的インタラクションの状態として「共感」、「無関心」、及び、「反感」の３つがあるものとする。 In the present embodiment, the sympathetic interaction is defined between each pair of interactors, that is, two parties, and takes a discrete state. Here, it is assumed that there are three sympathetic interaction states: “sympathy”, “indifference”, and “antisense”.

表情は対話者毎に定義され、複数の状態に分類できるものとする。本発明では任意の状態を使用可能だが、本実施の形態では、「肯定的」、「中立的」、「否定的」の３つの状態を仮定する。この他には、「無表情」、「微笑」、「驚き」、「嫌悪」といった状態を用いてもよい。 Facial expressions are defined for each conversation person, and can be classified into multiple states. Although any state can be used in the present invention, in the present embodiment, three states of “positive”, “neutral”, and “negative” are assumed. In addition to this, states such as “no expression”, “smile”, “surprise”, and “disgust” may be used.

ここでは、視線の状態として「相互凝視」、「片側凝視」、及び、「相互そらし」の３状態を定義する。相互凝視は二者がお互いに相手の方を見合っている状態、すなわち、アイコンタクトをしている状態である。片側凝視は対話者ペアの一方の対話者のみが他方の対話者の方を見ている状態である。相互そらしは両者がお互いに相手の方を見ていない状態である。 Here, three states of “mutual gaze”, “one-sided gaze”, and “mutual gaze” are defined as gaze states. Mutual gaze is a state in which the two are looking at each other, that is, in eye contact. A one-sided gaze is a state in which only one conversation person of a conversation person pair is looking at the other conversation person. Mutual distraction is a state where the two are not looking at each other.

ここでは、対話者同士がどのような表情を表出し合いやすいかについての頻度のパターンを表情共起行列と呼ぶ。 Here, the frequency pattern regarding what facial expressions are likely to be expressed by each other is called a facial expression co-occurrence matrix.

図２（Ａ）、（Ｂ）に、それぞれの共感的インタラクションの状態及び視線の状態の各組み合わせについての表情共起行列を示す。これらの表情共起行列は、撮影した実際の対話のデータ（対話時間の合計は３０分程度）に対して、共感的インタラクション、表情、及び、視線の状態について人手で付けたラベルを使用して作成されたものである。上記図２において、表情共起行列の要素の色が明るいほど、その要素の頻度が高いことを表わしている。また、「判定困難」（“hard to judge”）の状態は、ラベル付けをした人物が、そのフレームに対して共感、無関心、及び反感のいずれかのラベルを付ける事が困難だと判断したことを意味する。例えば、二者が直接的なインタラクションを行っていない場合である。なお、この表情共起行列はその定義上、相互凝視及び相互そらし状態においては対称であるが、片側凝視については一般に非対称となる。また、上記図２（Ｂ）において、片側凝視については、人物ｉが見ている人、人物ｊが見られている人を表わしている。上記図２（Ａ）において、それぞれの表情共起行列の間に明らかにパターンの差が見られる。これは、表情共起行列が共感的インタラクションの状態及び視線の状態の組み合わせにより表情共起のパターンが変化する、という本発明における仮説の妥当性を示している。さらに、これらの表情共起のパターンの特徴の多くは、これまでに明らかにされている心理学的知見と概ね一致する。ａ）まず、共感の状態では、肯定的表情の共起が多く見られる。ｂ）相互凝視は行動の同調、すなわち、表情の一致をより引き起こす。例えば、相互凝視、片側凝視、相互そらしの順で、肯定的表情の共起の頻度が高くなっていく。 2A and 2B show facial expression co-occurrence matrices for each combination of the state of sympathetic interaction and the state of gaze. These facial expression co-occurrence matrices use labels manually attached to empathic interactions, facial expressions, and line-of-sight conditions for actual conversation data (total conversation time is about 30 minutes). It has been created. In FIG. 2, the brighter the color of an element of the facial expression co-occurrence matrix, the higher the frequency of the element. The “hard to judge” status indicates that it is difficult for the labeled person to label the frame as either sympathy, indifference, or disagreement. Means. For example, when the two are not directly interacting with each other. This facial expression co-occurrence matrix is symmetric in terms of mutual gaze and mutual gaze by definition, but is generally asymmetric for unilateral gaze. In FIG. 2B, the one-side gaze represents a person who is watching the person i and a person who is watching the person j. In FIG. 2A, there is a clear pattern difference between the facial expression co-occurrence matrices. This shows the validity of the hypothesis in the present invention that the expression co-occurrence matrix changes the expression co-occurrence pattern depending on the combination of the state of sympathetic interaction and the state of gaze. Furthermore, many of the features of these facial expression co-occurrence patterns are generally consistent with the psychological findings that have been clarified so far. a) First, in the state of empathy, many co-occurrences of positive facial expressions are seen. b) Mutual gaze causes more synchronization of actions, that is, matching facial expressions. For example, the co-occurrence of positive facial expressions increases in the order of mutual gaze, one-sided gaze, and mutual gaze.

その他の直感的に妥当な特徴も得られている。ｃ）反感では、否定的表情が一方あるいは両方の対話者により表出されやすい。ｄ）無関心では、特に相互凝視において、中立的表情と他の表情（肯定的表情あるいは否定的表情）の組み合わせが顕著である。ｅ）判定困難とラベル付けられた場面では、中立的表情の共起が目立つ。 Other intuitively relevant features are also obtained. c) In antipathy, negative facial expressions are likely to be expressed by one or both interlocutors. d) With indifference, a combination of a neutral facial expression and another facial expression (positive facial expression or negative facial expression) is remarkable, especially in mutual gaze. e) In the scene labeled as difficult to judge, co-occurrence of neutral facial expressions is conspicuous.

なお、上記のデータにおいては、相互そらしの状態についても、表情共起行列間に無視できない差が見られた。この理由としては、他者の言動は必ずしもその対話者の方を見ていなくても、周辺視や聴覚情報から得ることができるためであると考える。よって、本実施の形態では、相互そらしの状態についてもモデルに加えることとする。ただし、本発明の枠組みでは、相互そらしの状態についてはモデルから除外しても構わない。 In the above data, there was a non-negligible difference between the expression co-occurrence matrices even in the state of mutual distraction. The reason for this is that the behavior of others can be obtained from peripheral vision and auditory information, even if they are not necessarily looking at the person in the conversation. Therefore, in this embodiment, the state of mutual distraction is also added to the model. However, in the framework of the present invention, the mutual distraction state may be excluded from the model.

表情共起行列は、単純に頻度をカウントした頻度行列（元々の頻度行列と呼ぶ）に対して、それぞれの表情の組み合わせの発生頻度に基づく正規化を行って生成されている。具体的には、図２（Ｂ）に示す表情共起行列は、まず、対話者ｉ、ｊについて、特定の視線の状態及び共感的インタラクションの組み合わせの場合における、それぞれの表情の組み合わせに対して実際の頻度を数え（元々の頻度行列）、次いで、その頻度の行列を、対話者ｉ、ｊそれぞれの対話者が表出した表情の各組み合わせの頻度を示す頻度ベクトルにより除した行列を計算し、正規化された頻度行列としている。この正規化は、共感的インタラクション毎の表情共起行列の違いをより明確化する。 The facial expression co-occurrence matrix is generated by normalizing a frequency matrix (original frequency matrix) obtained by simply counting frequencies based on the frequency of occurrence of each facial expression combination. Specifically, the expression co-occurrence matrix shown in FIG. 2 (B) is first obtained for each combination of facial expressions in the case of a combination of a specific line-of-sight state and empathic interaction for the interlocutors i and j. The actual frequency is counted (original frequency matrix), and then the matrix is calculated by dividing the frequency matrix by the frequency vector indicating the frequency of each combination of facial expressions expressed by each of the interrogators i and j. , A normalized frequency matrix. This normalization further clarifies the difference in facial expression co-occurrence matrix for each sympathetic interaction.

逆に、元々の頻度の行列では、その対話のテーマや種類、さらには個人の性格などによって変化する表情カテゴリの表出頻度に大きな影響を受けてしまう。特に、人はまわりに他者がいる場面では、微笑を多く表出しやすいため、元々の頻度行列は肯定的な表情同士の共起が高くなっている。 On the other hand, the original frequency matrix is greatly influenced by the expression frequency of the facial expression category that changes depending on the theme and type of the conversation and the personality of the individual. In particular, in a scene where there are others around, people tend to express a lot of smiles, so the original frequency matrix has a high co-occurrence of positive facial expressions.

本実施の形態では、表情共起を二者の表情がどういう組み合わせで同時に発生するのかをモデル化したが、両者の表情表出のタイムラグを考慮しても構わない。すなわち、ある時刻における一方の対話者ｉの表情と、そのδｔ時間後のもう一方の対話者ｊの表情の共起としてモデル化してもよい。この場合、対話者ｉを、先に相手に視線を向けた対話者、あるいは、先に無表情から何らかの別の表情を表出した対話者などとすればよい。 In the present embodiment, the facial expression co-occurrence is modeled in what combination the two facial expressions occur simultaneously, but the time lag between the facial expressions of both may be considered. In other words, it may be modeled as a co-occurrence of the expression of one conversational person i at a certain time and the expression of the other conversational person j after δt time. In this case, the conversation person i may be a conversation person who first turned his gaze toward the other party, or a conversation person who first expressed some other expression from no expression.

さらに、表情の共起を三者以上で考えても構わない。例えば、誰かが冗談を言ったときにその対話に参加している多くの対話者が笑う、あるいは、議論がうまく進まずに皆が難しい表情を浮かべている、といった場面において効果的である。 Furthermore, you may consider co-occurrence of facial expressions with more than three parties. For example, it is effective in situations where a lot of interlocutors participating in the dialogue laugh when someone jokes, or everyone has a difficult expression without a good discussion.

また、共感的インタラクションとして３つの状態があることを仮定しているが、状態数を１つとしてもよい。この場合、共感的インタラクションの状態には意味がなくなるが、対話者間でどういう表情の共起が起きやすいかの情報を利用して、各対話者の表情を、一人一人個別に認識するよりも高精度に推定できる場合がある。例えば、映像中の対話者の顔の向きが正面でなく横向きや上向きである場面、あるいは、顔が手や物体によって一部が隠れているという場面である。 Moreover, although it is assumed that there are three states as the sympathetic interaction, the number of states may be one. In this case, the state of sympathetic interaction is meaningless, but rather than recognizing each conversational person's facial expression individually using information on what facial expression co-occurrence is likely to occur between the conversational persons. It may be possible to estimate with high accuracy. For example, there are scenes where the face of the interlocutor in the video is not facing the front but sideways or upward, or where the face is partially hidden by a hand or object.

＜事前分布＞
本実施の形態では、パラメータの事前分布として、数学的に扱いやすいという理由から、以下の式に示す、事後分布に対する共役事前分布を用いる。 <Prior distribution>
In this embodiment, a conjugate prior distribution with respect to the posterior distribution shown in the following equation is used as the parameter prior distribution because it is easy to handle mathematically.

ここで、ｋはどのパラメータを表すかを示す便宜的な添え字であり、φ_kはπやθやλを表す。パラメータの事前分布Ｐ（φ_ｋ）としては、離散変数に対して一般によく使用されるディリクレ分布を使用する。 Here, k is a convenient index indicating which parameter is represented, and φ _k represents π, θ, or λ. As the parameter prior distribution P (φ _k ), the Dirichlet distribution generally used for discrete variables is used.

また、本実施の形態に係る対話状態推定装置では、観測データの系列Ｉ_1:Tを入力として、提案した対話モデルに基づき、共感的インタラクションの時系列Ｅ_1:T、表情の時系列Ｆ_1:T、及びモデルパラメータφを同時に推定する。本実施の形態では、ベイズ推論の枠組みにてこれら全ての確率変数の同時事後確率分布ｐ（Ｅ_1:T,Ｆ_1:T,φ|Ｉ_1:T）を推定する。ベイズ推論では、モデルの事前情報として、モデルパラメータｐ（φ）の事前分布を導入する。 In the dialog state estimation apparatus according to the present embodiment, the observation data series I _{1: T} is used as an input, and the empathic interaction time series E _{1: T} and facial expression time series F ₁ based on the proposed dialog model. _{: T} and model parameter φ are estimated simultaneously. In the present embodiment, the simultaneous posterior probability distribution p (E _{1: T} , F _{1: T} , φ | I _{1: T} ) of all these random variables is estimated in the framework of Bayesian inference. In Bayesian inference, a prior distribution of model parameters p (φ) is introduced as model prior information.

全ての変数についての同時事後確率分布は、同時確率分布と観測データI_1:Tの事前確率とを用いて、以下の（１）式で表される。 The joint posterior probability distribution for all variables is expressed by the following equation (1) using the joint probability distribution and the prior probability of the observation data I _{1: T.}

ｐ（Ｅ_1:T,Ｆ_1:T,Ｘ_1:T,φ|Ｉ_1:T）＝ｐ（Ｅ_1:T,Ｆ_1:T,Ｘ_1:T,Ｉ_1:T,φ）／ｐ（Ｉ_1:T）
・・・（１） p (E1 _{: T} , F1 _{: T} , X1 _{: T} , φ | I1 _{: T} ) = p (E1 _{: T} , F1 _{: T} , X1 _{: T} , I1 _{: T} , φ) / p (I _{1: T} )
... (1)

観測データＩ_1:Tは既知であり、ｐ（Ｉ_1:T）は定数とみなせるため、以下の（２）式が得られる。 Since the observation data I _{1: T} is known and p (I _{1: T} ) can be regarded as a constant, the following equation (2) is obtained.

ｐ（Ｅ_1:T,Ｆ_1:T,Ｘ_1:T,φ|Ｉ_1:T）∝ｐ（Ｅ_1:T,Ｆ_1:T,Ｘ_1:T,Ｉ_1:T,φ）・・・（２） p (E _{1: T} , F _{1: T} , X _{1: T} , φ | I _{1: T} ) ∝p (E _{1: T} , F _{1: T} , X _{1: T} , I _{1: T} , φ) (2)

ここで、∝記号は、左辺が右辺に比例することを表す。ここでは、図１（Ａ）に示したベイジアンネットワークに従い、上記（２）式の右辺の同時確率分布を、以下の（３）式に示すように定義する。 Here, the ∝ symbol represents that the left side is proportional to the right side. Here, according to the Bayesian network shown in FIG. 1A, the simultaneous probability distribution on the right side of the above equation (2) is defined as shown in the following equation (3).

ｐ（Ｅ_1:T,Ｆ_1:T,Ｘ_1:T,φ|Ｉ_1:T）
：＝ｐ（φ）Ｐ（Ｅ_1:T|φ）Ｐ（Ｘ_1:T|Ｅ_1:T,φ）Ｐ（Ｆ_1:T|Ｅ_1:T,Ｘ_1:T,φ）Ｐ（Ｉ_1:T|Ｆ_1:T,φ）・・・（３） p (E _{1: T} , F _{1: T} , X _{1: T} , φ | I _{1: T} )
: = P (φ) P (E _{1: T} | φ) P (X _{1: T} | E _{1: T} , φ) P (F _{1: T} | E _{1: T} , X _{1: T} , φ) P ( I _{1: T} | F _{1: T} , φ) (3)

これ以降、特に必要がなければ、簡略化のためモデルパラメータφを省略して記載する。 After this, unless otherwise required, the model parameter φ is omitted for simplicity.

共感的インタラクションが対話者ペア間で独立であり、それぞれ１次マルコフ過程に従うと仮定すると、共感的インタラクションの状態の時系列の事前確率Ｐ（Ｅ_1:T）は、以下の（４）式で表される。 Assuming that the sympathetic interaction is independent between the dialogue pairs and follows the first-order Markov process, the prior probability P (E _{1: T} ) of the state of the sympathetic interaction is expressed by the following equation (4) _: expressed.

ここで、 here,

は共感的インタラクションの状態の初期確率を表わし、 Represents the initial probability of the state of empathic interaction,

は、共感的インタラクションの状態の遷移確率を表す。これらのパラメータは、それぞれの対話者ペアについて用意される。これらの確率はどちらも時間不変であり、更に、以下の（７）式で表される。 Represents the transition probability of the state of the sympathetic interaction. These parameters are prepared for each dialogue pair. Both of these probabilities are time-invariant and are further expressed by the following equation (7).

視線状態に対する共感的インタラクションの尤度Ｐ（Ｘ_1:T|Ｅ_1:T）については、それぞれの対話者ペア及び時間についての独立性であることを仮定して、それらの尤度の積として定義する。すなわち、視線状態に対する共感的インタラクションの尤度Ｐ（Ｘ_1:T|Ｅ_1:T）は、以下の（８）式で表される。 The likelihood P (X _{1: T} | E _{1: T} ) of the sympathetic interaction with the line-of-sight state is assumed to be independence with respect to each pair of dialoguers and time, and as a product of those likelihoods. Define. That is, the likelihood P (X _{1: T} | E _{1: T} ) of the sympathetic interaction with the line-of-sight state is expressed by the following equation (8).

この尤度に関して、以下の（９）式に示すように、それぞれの対話者ペアは個々にパラメータを有している。 Regarding this likelihood, as shown in the following equation (9), each dialogue pair has a parameter individually.

それぞれの共感的インタラクションの状態に関するパラメータベクトルπ_・＝｛π_・,e´|ｅ´∈｛１，．．．，Ｎ_ｅ｝}は、それぞれ独立なディリクレ分布に従って分布しているものとする。 Parameter vectors π _· = {π _{·, e ′} | e′∈ {1,. . . , N _e }} are distributed according to independent Dirichlet distributions.

共感的インタラクションの状態及び視線の状態が与えられたもとでの表情の条件付き確率Ｐ（Ｆ_1:T|Ｅ_1:T,Ｘ_1:T）は、表情の事前確率、及び、表情共起行列の積として表現される。すなわち、表情の条件付き確率Ｐ（Ｆ_1:T|Ｅ_1:T,Ｘ_1:T）は、以下の（１０）式で表される。 The conditional probability P (F _{1: T} | E _{1: T} , X _{1: T} ) of the facial expression given the state of sympathetic interaction and the state of gaze is the prior probability of the facial expression and the facial expression co-occurrence matrix Expressed as the product of That is, the conditional probability P (F _{1: T} | E _{1: T} , X _{1: T} ) of the facial expression is expressed by the following equation (10).

ここで、Ｐ（ｆ_i,1:T）は対話者ｉの表情の事前分布であり、対話者間での独立性を仮定している。この各対話者の表情の事前分布は、時刻間での表情の独立性を仮定することで、以下の（１１）式で表される。 Here, P (f _{i, 1: T} ) is a prior distribution of the expression of the conversation person i, and assumes independence among the conversation persons. The prior distribution of the facial expressions of each conversation person is expressed by the following equation (11), assuming independence of facial expressions between times.

Ｍ_e,x _(i,j)（ｆ,ｆ´）は、対話者ペア（ｉ，ｊ）について、共感的インタラクションの状態がｅかつ視線の状態がｘである場合についての表情共起行列の（ｆ,ｆ´）成分を表す。表情共起行列Ｍはそれぞれの共感的インタラクションの状態及び視線の状態の組み合わせについて存在し、合計でＮ_e×Ｎ_x個存在する。それぞれの表情共起行列Ｍの大きさはＮ_f×Ｎ_fである。 M _{e, x} _{(i, j)} (f, f ′) is the expression co-occurrence matrix for the dialogue pair (i, j) when the empathic interaction state is e and the line-of-sight state is x. The (f, f ′) component is represented. The expression co-occurrence matrix M exists for each combination of the state of sympathetic interaction and the state of gaze, and there are a total of N _e × N _x . The size of each facial expression co-occurrence matrix M is N _f × N _f .

それぞれの対話者ペア及び対話者が、表情の事前確率及び表情共起行列についてのパラメータを持っている。事前分布についてのパラメータについては、対話者ｉの表情の状態がｆ´である確率を以下の（１２）式で表わす。 Each interactor pair and interactor has parameters for facial prior probabilities and facial expression co-occurrence matrices. As for the parameters for the prior distribution, the probability that the state of the expression of the conversation person i is f ′ is expressed by the following equation (12).

一方、対話者ペア（ｉ，ｊ）について、共感的インタラクションの状態がｅかつ視線の状態がｘであるときに、対話者ｉの表情がｆであると同時に対話者ｊの表情がｆ´である確率Ｍ_e,x ^(i,j)（ｆ,ｆ´）を以下の（１３）式で表わす。 On the other hand, for the conversation partner pair (i, j), when the state of empathic interaction is e and the line of sight is x, the conversation person i's facial expression is f and the conversation person j's facial expression is f '. A certain probability M _{e, x} ^{(i, j)} (f, f ′) is expressed by the following equation (13).

Ｎ_f×Ｎ_f行列である表情共起行列はその要素の和が１である。ここでは、それぞれの表情共起行列を、要素数がＮ_f×Ｎ_fでありそれらの要素の和が１である１つのベクトルとみなす。それぞれのパラメータの分布のモデルとしては、任意の形状を使用可能である。本実施の形態では、表情の事前分布のパラメータのベクトル The expression co-occurrence matrix which is an N _f × N _f matrix has a sum of 1 elements. Here, each expression co-occurrence matrix is regarded as one vector whose number of elements is N _f × N _f and the sum of those elements is 1. An arbitrary shape can be used as a model of the distribution of each parameter. In this embodiment, the vector of facial expression prior distribution parameters

及び、表情の共起行列のパラメータのベクトル And a vector of facial expression co-occurrence matrix parameters

のいずれに対してもそれぞれ独立なディリクレ分布を使用することとする。ここで、ｆ"∈｛１,・・・,Ｎ_f×Ｎ_f｝である． The Dirichlet distribution that is independent of each other is used. Here, f ″ ∈ {1,..., N _f × N _f }.

ここで、対話モデルの全パラメータをまとめると、φ＝｛Π,Θ｝となる。共感的インタラクションに関するモデルパラメータΠは、以下の（１６）式で表される。 Here, when all the parameters of the dialogue model are collected, φ = {Π, Θ}. The model parameter に関する relating to the sympathetic interaction is expressed by the following equation (16).

表情に関するモデルパラメータΘは、以下の（１７）式で表される。 The model parameter Θ relating to the expression is expressed by the following equation (17).

パラメータの事前確率分布ｐ（φ）は、これらそれぞれのパラメータの事前分布の積である。 The parameter prior probability distribution p (φ) is the product of the prior distributions of these respective parameters.

なお、本実施の形態では、モデルパラメータを対話者ペアや対話者毎に用意することとしているが、対話者ペアや対話者で共通のパラメータとしても構わない。 In the present embodiment, the model parameters are prepared for each pair of dialoguers or dialogues. However, the parameters may be shared by the dialogue pairs or dialogues.

また、観測される映像フレームＩに対する表情の状態の尤度Ｐ（Ｉ_1:T|Ｆ_1:T）を、表情が対話者間及び時刻間で独立であることを仮定して、以下の（１８）式で表わす。 Further, the likelihood P (I _{1: T} | F _{1: T} ) of the facial expression state with respect to the observed video frame I is assumed to be the following (assuming that the facial expression is independent between the interlocutors and between the times) _: 18) It expresses by a formula.

この表情の状態の尤度Ｐ（Ｉ_t|ｆ_i,t）についてはどのような方法で計算しても構わない。まず、映像データから直接計算可能な尤度を用いることが考えられる。例えば、特開２００９−１１０４２６号公報に記載の方法を用いることができる。この方法では、顔面中に疎に配置した多数の点の輝度（画像の明るさ）が各表情においてどのような分布となるのかを正規分布を用いてモデル化しておき、画像から対象人物の表情、及び、顔の位置・姿勢を同時に推定する。 The likelihood P (I _t | f _{i, t} ) of the expression state may be calculated by any method. First, it is conceivable to use a likelihood that can be directly calculated from video data. For example, the method described in JP2009-110426A can be used. In this method, the distribution of the luminance (brightness of the image) of many points sparsely arranged on the face in each facial expression is modeled using a normal distribution, and the facial expression of the target person is derived from the image. , And the position and orientation of the face are estimated simultaneously.

あるいは、顔アクションユニット（ＦａｃｉａｌＡｃｔｉｏｎＵｎｉｔｓ）と呼ばれる眉、目、口などの各顔部品がどのように動いたかという動作を、映像信号と表情の間に挟んでもよい。この場合、まず、映像信号からどの顔アクションユニットが発生したのかを検出し、次いで、検出された顔アクションユニットが与えられたもとでの各表情の確率を尤度として計算すればよい。顔アクションユニットの検出方法としては、例えば、図３（Ａ）、（Ｂ）に示すような顔面上の複数の特徴点を画像中で追跡し、それら特徴点の間の幾何学的配置の変化から顔アクションユニットの発生を検出する方法がある。特徴点の追跡方法としては、特徴点を中心とした小さな矩形領域をテンプレートとしたテンプレートマッチングがある。口部の顔アクションユニットの検出については、画像情報に加えて、音声信号から音声パワーにより閾値処理などにより検出した発話の有無の情報を利用しても構わない。特徴点の間の幾何学的配置の変化としては、２点間の距離や３点のなす角の変化を使用し、それらが、顔アクションユニットが発生しているときとしていないときで、どのような値をとるのかを正規分布を用いてモデル化すれば検出可能である。 Alternatively, an operation of how each facial part such as eyebrows, eyes, and mouth, which is called a face action unit, has moved may be sandwiched between the video signal and the facial expression. In this case, first, it is only necessary to detect which face action unit is generated from the video signal, and then calculate the probability of each facial expression with the detected face action unit as a likelihood. As a detection method of the face action unit, for example, a plurality of feature points on the face as shown in FIGS. 3A and 3B are tracked in an image, and a change in geometric arrangement between these feature points is performed. There is a method for detecting the occurrence of a face action unit. As a feature point tracking method, there is template matching using a small rectangular region centered on a feature point as a template. For the detection of the facial action unit in the mouth, in addition to the image information, information on the presence / absence of utterance detected from the audio signal by threshold processing or the like based on the audio power may be used. The change in the geometry between feature points is the distance between the two points or the change in the angle between the three points, and how they are when the face action unit is occurring and not This can be detected by modeling using a normal distribution.

また、表情の各状態についての尤度ではなく、１枚の入力画像に対して推定した１つの表情の状態のみを出力する認識器を用いることも可能である。この場合、例えば、認識された状態の尤度を適当な定数τ（０＜＜τ＜１）とし、認識された状態以外の表情の状態については尤度を（１−τ）／（Ｎ_f−１）とすればよい。認識された状態以外の表情の状態の尤度も正の小さな値とするのは、尤度が０であるときに事後確率も必ず０となり決して推定結果とならないことを避けるためである。 It is also possible to use a recognizer that outputs only one facial expression state estimated with respect to one input image instead of the likelihood of each facial expression state. In this case, for example, the likelihood of the recognized state is set to an appropriate constant τ (0 << τ <1), and the likelihood is set to (1−τ) / (N _f for facial expression states other than the recognized state. -1). The likelihood of facial expression states other than the recognized state is also set to a small positive value in order to avoid that the a posteriori probability is always 0 and the estimation result is never obtained when the likelihood is 0.

＜システム構成＞
次に、画像データの時系列である映像データを入力として、複数の対話者間の共感的インタラクション及び表情を推定する対話状態推定装置に、本発明を適用した場合を例にして、本発明の実施の形態を説明する。 <System configuration>
Next, a case where the present invention is applied to a dialog state estimation device that estimates video data, which is time series of image data, and estimates sympathetic interaction and facial expressions between a plurality of interlocutors, An embodiment will be described.

図４に示すように、本実施の形態に係る対話状態推定装置は、複数の対話者を撮影した映像データの入力を受け付ける入力部１と、対話者の全ペアの共感的インタラクションの状態及び各対話者の表情の状態を推定する演算部２と、推定結果を出力する出力部３と、を備えている。 As shown in FIG. 4, the dialogue state estimation device according to the present embodiment includes an input unit 1 that accepts input of video data obtained by photographing a plurality of interrogators, the state of sympathetic interaction of all pairs of dialoguers, and each A calculation unit 2 that estimates the state of the expression of the conversation person and an output unit 3 that outputs the estimation result are provided.

入力部１は、ＣＣＤカメラ及びマイクロフォンなどを用いた既知の撮影手段に接続されており、撮影手段によって撮影された、複数の対話者の顔を含む対話の映像データの入力を受け付ける。また、入力部１は、キーボードやマウスにも接続されており、キーボードやマウスを操作することにより入力された情報を受け付ける。 The input unit 1 is connected to known photographing means using a CCD camera, a microphone, and the like, and accepts input of conversation video data including faces of a plurality of conversation persons photographed by the photographing means. The input unit 1 is also connected to a keyboard and mouse, and receives information input by operating the keyboard and mouse.

主なカメラの配置方法として、１台の全方位カメラを対話者間の中心付近に配置する方法と、各対話者の正面方向に１台ずつのカメラを配置する方法とが考えられる。本発明の枠組みでは、全ての対話者の顔面を撮影できる配置であればどのような配置方法でも構わない。本実施の形態では、撮影手段として、全方位カメラを用いることとする。これは、全方位カメラを用いた場合にはカメラの外部パラメータ（世界座標系に対する並進及び回転）を求めることなく、本発明で必要となる対話者の視線の方向を特定するために、対話者間の位置関係を対話者の数に関わらず、対象人物の画像中の位置関係から用意に把握することができ、有利なためである。他方、複数台のカメラを用いる方法では、各対話者の顔の解像度を上げやすい反面、ハードウェア構成が複雑になることに加え、対話毎に対話者の数や座席配置が異なる場合には、その都度各カメラの外部パラメータを得るためのキャリブレーションを行う必要がある。図５（Ａ）に、撮影場面の一例を示す。また、図５（Ｂ）に、全方位カメラで撮影した入力画像の一例を示す。 As a main camera arrangement method, there are a method in which one omnidirectional camera is arranged in the vicinity of the center between the talkers, and a method in which one camera is arranged in the front direction of each talker. In the framework of the present invention, any arrangement method may be used as long as it is an arrangement that can photograph the faces of all the interlocutors. In this embodiment, an omnidirectional camera is used as the photographing means. This is because, when an omnidirectional camera is used, in order to specify the direction of the viewer's line of sight required by the present invention without obtaining external parameters (translation and rotation with respect to the world coordinate system) of the camera, This is because the positional relationship between them can be easily grasped from the positional relationship in the image of the target person regardless of the number of interlocutors, which is advantageous. On the other hand, in the method using multiple cameras, it is easy to increase the resolution of each conversation person's face, but in addition to the complexity of the hardware configuration, if the number of conversation persons and the seat arrangement differ for each conversation, It is necessary to perform calibration to obtain external parameters of each camera each time. FIG. 5A shows an example of a shooting scene. FIG. 5B shows an example of an input image taken with an omnidirectional camera.

なお、入力部１に接続されたカメラの内部パラメータについては事前のキャリブレーションにより既に得られているものとする。 It is assumed that the internal parameters of the camera connected to the input unit 1 have already been obtained by prior calibration.

出力部３は、ディスプレイ、プリンタ、磁気ディスクなどで実装される。 The output unit 3 is implemented by a display, a printer, a magnetic disk, or the like.

演算部２は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）と、後述する対話状態推定処理ルーチンを実行するためのプログラムを記憶したＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）とを備えたコンピュータで構成され、機能的には次に示すように構成されている。演算部２は、データ記憶部２１、推定部２２、及び学習部２３を備えている。 The calculation unit 2 is composed of a computer having a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory) storing a program for executing a dialog state estimation processing routine described later. Functionally, it is configured as follows. The calculation unit 2 includes a data storage unit 21, an estimation unit 22, and a learning unit 23.

データ記憶部２１は、入力された映像データ及び学習部２３によって学習されたハイパーパラメータが記憶されている。なお、ハイパーパラメータとは、推定部２２で推定されるパラメータの事前分布についてのモデルパラメータのことである。 The data storage unit 21 stores input video data and hyperparameters learned by the learning unit 23. The hyper parameter is a model parameter for the prior distribution of parameters estimated by the estimation unit 22.

推定部２２は、映像データとハイパーパラメータを入力とし、上述の対話モデルに基づいて、その映像の各フレームにおける各対話者ペアの共感的インタラクションの状態、各対話者の表情の状態、及び、対話モデルのパラメータを同時に推定する。 The estimation unit 22 receives video data and hyperparameters as input, and based on the above-described dialogue model, the state of empathic interaction of each dialogue pair in each frame of the video, the state of facial expression of each dialogue person, and dialogue Estimate model parameters simultaneously.

推定部２２によって解く問題は、上述した対話モデルに基づき、観測信号I_1:Tが与えられたときの、共感的インタラクションの時系列Ｅ_1:T、表情の時系列Ｆ_1:T、及びモデルパラメータφの同時事後確率分布ｐ（Ｅ_1:T,Ｆ_1:T,φ|Ｉ_1:T）を推定することである。本実施の形態では、この問題を解くための１つの方法として、ギブス・サンプラーを用いることとする。ギブス・サンプラーはマルコフ連鎖モンテカルロ法の一種であり、複雑なモデルに対して近似解をえるのに優れている。ギブス・サンプラーは、推定する全ての変数を１つずつ順番にサンプリングすることを１セットとして、このセットを多数回繰り返す。このサンプリングの基となる分布は、サンプリングする変数以外の全ての変数を固定とした全条件付き事後分布である。このとき不偏分布は推定すべき事後分布に一致する。この同時事後確率分布は、マルコフ連鎖が収束した後にサンプリングしたランダム変数から得られる統計量に基づいて求められる。 Problems solved by estimator 22, based on the interaction model described above, the observation signals I _{1: when T} is given, sympathetic time series of interactions E _{1: T,} the time series expression F _{1: T,} and models This is to estimate the simultaneous posterior probability distribution p (E _{1: T} , F _{1: T} , φ | I _{1: T} ) of the parameter φ. In the present embodiment, a Gibbs sampler is used as one method for solving this problem. Gibbs sampler is a kind of Markov chain Monte Carlo method, and it is excellent for approximating complex models. The Gibbs sampler repeats this set a number of times, with a set of sampling all the variables to be estimated one by one in order. The distribution on which the sampling is based is an all-conditional posterior distribution in which all variables other than the variable to be sampled are fixed. At this time, the unbiased distribution matches the posterior distribution to be estimated. This simultaneous posterior probability distribution is obtained based on statistics obtained from random variables sampled after the Markov chain has converged.

さらに、対話者の数Ｎ_iは既知であるものとする。本発明は対話者数Ｎ_iが２以上の対話を対象とする。本発明は任意の対話者数を対象とすることが可能であるが、本実施の形態では４人対話（Ｎ_i＝４）を対象として説明する。この対話者の数については、既存の顔検出器（例えばＨａａｒ−ｌｉｋｅ特徴量に基づくカスケード型ＡｄａＢｏｏｓｔ検出器）を用いることで自動的に算出することが可能である。なお、上記のカスケード型ＡｄａＢｏｏｓｔ検出器については、非特許文献（P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features", In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 511−518, 2001.）に記載されている技術を用いればよいため、説明を省略する。 Further, it is assumed that the number of conversational persons _Ni is known. The present invention targets conversations in which the number of conversations N _i is 2 or more. Although the present invention can target any number of interlocutors, the present embodiment will be described for a four-person dialog (N _i = 4). The number of interlocutors can be automatically calculated by using an existing face detector (for example, a cascade type AdaBoost detector based on a Haar-like feature value). The cascade type AdaBoost detector is described in non-patent literature (P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features”, In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern. (Recognition, pp. 511-518, 2001)).

また、推定部２２は、視線方向検出部２９、表情推定部３０、初期値設定部３１、状態サンプリング部３２、パラメータサンプリング部３３、収束判定部３４、及び推定値算出部３５を備えている。なお、状態サンプリング部３２が、状態決定手段の一例であり、パラメータサンプリング部３３が、パラメータ決定手段の一例である。また、収束判定部３４及び推定値算出部３５が、推定手段の一例である。 The estimation unit 22 includes a gaze direction detection unit 29, a facial expression estimation unit 30, an initial value setting unit 31, a state sampling unit 32, a parameter sampling unit 33, a convergence determination unit 34, and an estimated value calculation unit 35. The state sampling unit 32 is an example of a state determination unit, and the parameter sampling unit 33 is an example of a parameter determination unit. Moreover, the convergence determination part 34 and the estimated value calculation part 35 are examples of an estimation means.

視線方向検出部２９は、映像データの各フレームについて、各対話者の顔を表わす領域を特定すると共に、眼領域を抽出し、従来既知の手法を用いて、抽出された眼領域に基づいて、視線方向を検出する。また、視線方向検出部２９は、各対話者について、検出した視線方向と、全対話者の位置関係とに基づいて、視線方向が、どの対話者の方向を向いているかを判定し、対話者の全ペア（ｉ，ｊ）について、対話者ペア（ｉ，ｊ）の視線の状態（相互凝視、片側凝視、相互そらし）を検出する。 The line-of-sight direction detection unit 29 specifies an area representing each interactor's face for each frame of the video data, extracts an eye area, and uses a conventionally known method based on the extracted eye area. Detect gaze direction. Further, the line-of-sight direction detection unit 29 determines, for each conversation person, which conversation person the direction of the line-of-sight is facing based on the detected line-of-sight direction and the positional relationship of all the conversation persons. For all pairs (i, j), the line-of-sight state (mutual gaze, one-sided gaze, mutual gaze) of the dialogue pair (i, j) is detected.

表情推定部３０は、映像データの各フレームについて、各対話者の顔を表わす領域を特定すると共に、顔領域から表情を推定し、表情の状態の尤度Ｐ（Ｉ_t|ｆ_i,t）を算出する。 The facial expression estimation unit 30 specifies an area representing each interactor's face for each frame of the video data, estimates the facial expression from the face area, and expresses the likelihood of the facial expression P (I _t | f _{i, t} ). Is calculated.

初期値設定部３１は、予め与えられたハイパーパラメータを用いて、パラメータφ＝｛Π，Θ｝の初期値を設定する。本実施の形態では、ハイパーパラメータの値として、データ記憶部２１に記憶されている値を用いる場合について説明するが、ハイパーパラメータの値として適当な乱数を用いてもよい。 The initial value setting unit 31 sets an initial value of the parameter φ = {Π, Θ} using a hyperparameter given in advance. In the present embodiment, a case where a value stored in the data storage unit 21 is used as the hyperparameter value will be described, but an appropriate random number may be used as the hyperparameter value.

また、初期値設定部３１は、パラメータφ＝｛Π，Θ｝の初期値を用いて、各確率変数の初期値、すなわち、各対話者ペア（ｉ，ｊ）に関する共感的インタラクションの状態の時系列データの初期値ｅ_t ^(i,j)と、各対話者ｉの表情の状態の時系列データの初期値ｆ_i,tとをサンプリングする。具体的には、上記（５）式及び（６）式で与えられる分布に従って発生させた乱数をｅ_t ^(i,j)の初期値とする。同様に、（１２）式で与えられる分布に従って発生させた乱数を初期値ｆ_i,tとする。 In addition, the initial value setting unit 31 uses the initial value of the parameter φ = {Π, Θ}, and the initial value of each random variable, that is, the state of the sympathetic interaction regarding each dialoguer pair (i, j). The initial value e _t ^{(i, j)} of the series data and the initial value f _{i, t} of the time series data of the state of facial expression of each conversation person i are sampled. Specifically, a random number generated according to the distribution given by the above equations (5) and (6) is used as the initial value of e _t ^{(i, j)} . Similarly, random numbers generated according to the distribution given by equation (12) are set as initial values f _{i, t} .

状態サンプリング部３２は、各時刻ｔにおける共感的インタラクションの状態ｅ_t ^(i,j)と、各時刻ｔにおける表情の状態ｆ_i,tとを、全条件付き確率分布からサンプリングし、値を更新する。 The state sampling unit 32 samples the sympathetic interaction state _et ^{(i, j)} at each time t and the facial expression state f _{i, t at} each time t from all conditional probability distributions, and updates the values. To do.

それぞれの変数に対する全条件付き確率分布は、その事前分布として共役事前分布を使用することから、共役事前分布と同じ分布形状となる。この全条件付き確率分布においては、対象変数以外の変数は全て固定されているため、上記（３）式で表される同時事後確率分布から対象の変数を含まない項を正規化定数とした分布とみなすことができる。 The all conditional probability distribution for each variable has the same distribution shape as the conjugate prior distribution because the conjugate prior distribution is used as the prior distribution. In this all-conditional probability distribution, all variables other than the target variable are fixed, so a distribution that does not include the target variable from the simultaneous posterior probability distribution represented by the above equation (3) is a normalized constant. Can be considered.

時刻ｔにおける共感的インタラクションの状態ｅ_t ^(i,j)は、以下の（１９）式で表わされる全条件付き確率分布に従って、サンプリングされる。 The state e _t ^{(i, j)} of the sympathetic interaction at time t is sampled according to the all conditional probability distribution expressed by the following equation (19).

なお、上記（１９）式で表わされる全条件付き確率分布が、第１確率分布の一例である。 The all-conditional probability distribution represented by the above equation (19) is an example of the first probability distribution.

状態サンプリング部３２による１回目の実行時は、まず、時刻ｔ＝１（１フレーム目）について、初期値設定部３１で設定した各値を用いて、上記（１９）式の右辺の各項の確率を、上記（６）式、（１３）式、（９）式により求める。そして、得られた確率分布に従って発生させた乱数を、更新後のｅ₁ ^(i,j)とする。同様の処理を、時刻ｔ＝２からｔ＝Ｔまで順番に繰り返し、ｅ_t ^(i,j)を更新していく。 At the time of the first execution by the state sampling unit 32, first, for each time of the term on the right side of the above equation (19), using the values set by the initial value setting unit 31 for the time t = 1 (first frame). The probability is obtained by the above equations (6), (13), and (9). Then, a random number generated according to the obtained probability distribution is set as e ₁ ^{(i, j)} after update. Similar processing is repeated in order from time t = 2 to t = T, and e _t ^{(i, j)} is updated.

状態サンプリング部３２による２回目以降の実行時は、初期値設定部３１で設定した値の代わりに、更新後のｅ_t ^(i,j)、ｆ_i,t、φの各値を用いて、同様のサンプリングを行うことで、ｅ^t _(i,j)の値を更新していく。 At the second and subsequent executions by the state sampling unit 32, instead of the values set by the initial value setting unit 31, the updated values of e _t ^{(i, j)} , f _{i, t} , φ are used, By performing similar sampling, the value of e ^t _{(i, j)} is updated.

同様に、時刻ｔにおける表情の状態ｆ_i,tは，以下に示す（２０）式で表される全条件付き確率分布に従って、サンプリングされる。 Similarly, the facial expression state f _{i, t at} time t is sampled according to an all conditional probability distribution expressed by the following equation (20).

なお、上記（２０）式で表わされる全条件付き確率分布が、第２確率分布の一例である。 Note that the all conditional probability distribution represented by the above equation (20) is an example of the second probability distribution.

状態サンプリング部３２による１回目の実行時は、まず、時刻ｔ＝１（１フレーム目）について、初期値設定部３１で設定した各値、及び表情推定部３０で算出した表情の尤度を用いて、上記（２０）式の右辺の各項の確率を、上記（１２）式、（１３）式により求める。そして、得られた確率分布に従って発生させた乱数を、更新後のｆ_i,tとする。同様の処理を、時刻ｔ＝２からｔ＝Ｔまで順番に繰り返し、ｆ_i,tを更新していく。 At the time of the first execution by the state sampling unit 32, first, for each time t = 1 (first frame), each value set by the initial value setting unit 31 and the likelihood of the facial expression calculated by the facial expression estimation unit 30 are used. Thus, the probability of each term on the right side of the above equation (20) is obtained by the above equations (12) and (13). Then, the random number generated according to the obtained probability distribution is set as f _{i, t} after the update. Similar processing is repeated in order from time t = 2 to t = T, and f _{i, t} is updated.

状態サンプリング部３２による２回目以降の実行時は、初期値設定部３１で設定した値の代わりに、更新後のｅ_t ^(i,j) 、ｆ_i,t、φの各値を用いて、同様のサンプリングを行うことで、ｆ_i,tの値を更新していく。 At the second and subsequent executions by the state sampling unit 32, instead of the values set by the initial value setting unit 31, the updated values of e _t ^{(i, j)} , f _{i, t} , φ are used, By performing similar sampling _, the value of f _{i, t} is updated.

パラメータサンプリング部３３は、パラメータφ＝｛Π，Θ｝を全条件付き確率分布からサンプリングし、値を更新する。 The parameter sampling unit 33 samples the parameter φ = {Π, Θ} from the all conditional probability distribution and updates the value.

全てのモデルパラメータはディリクレ分布に従うと仮定しているため、事前分布のパラメータ（すなわち、ハイパーパラメータ）に対して、状態サンプリング部３２による直前の共感的インタラクション及び表情のサンプリングにおいて対象とする事象が発生している回数を加えた値を、事後分布のパラメータとして求めることができる。 Since all model parameters are assumed to follow the Dirichlet distribution, the target event occurs in the previous sampling of the sympathetic interaction and facial expression by the state sampling unit 32 with respect to the parameter of the prior distribution (ie, the hyper parameter). A value obtained by adding the number of times of the calculation can be obtained as a parameter of the posterior distribution.

各モデルパラメータは、以下に示す（２１）式〜（２５）式で表される全条件付き確率分布に従って、サンプリングされる。 Each model parameter is sampled according to an all conditional probability distribution expressed by the following equations (21) to (25).

ここで、Ｄ（π_e|η_e ^*）はパラメータをη_e ^*とするπに関するディリクレ分布であり、〜記号は、左辺の分布が右辺の分布に従うこと意味する。例えば、共感的インタラクションの遷移確率のパラメータについては、以下の（２６）式、（２７）式に従って求められる Here, D (π _e | η _e ^* ) is a Dirichlet distribution with respect to π with a parameter η _e ^*, and the symbol “˜” means that the distribution on the left side follows the distribution on the right side. For example, the parameter of the transition probability of the sympathetic interaction is obtained according to the following equations (26) and (27).

ここで、ηは、ハイパーパラメータであり、η^*は、サンプリング結果を用いて更新された値を示す。また、ｎ_e,e´（ｅ＝０）は、共感的インタラクションの状態がｅであった回数であり、ｎ_e,e´（ｅ≠０）は、ある時刻での共感的インタラクションの状態がｅから次の時刻にｅ´へと変化した回数である。 Here, η is a hyper parameter, and η ^* indicates a value updated using the sampling result. Also, _ne , _{e ′} (e = 0) is the number of times the state of empathic interaction is e, _{and ne} , _{e ′} (e ≠ 0) is the state of empathic interaction at a certain time. This is the number of changes from e to e ′ at the next time.

同様に、η_x ^*は、{η_x,e' ^*}_{e'∈{1,...,Ne}}であり、視線の状態がｘであり、かつ、共感的インタラクションの状態がｅ'であった回数を表わし、η_e,x ^*は、{η_e,x,f" ^*}_{e,x, f"}であり、対話者ペア（ｉ，ｊ）の表情がｆ_i，ｆ_jであり、視線状態がｘであり、かつ、共感的インタラクションの状態がｅであった回数を表わす。また、η_i ^*は、{η_i,f ^*}_i,fであり、対話者ｉの表情の状態がｆであった回数を表わす。 Similarly, η _x ^* is {η _{x, e '} ^* } _{e'∈ {1, ..., Ne}} , the line-of-sight state is x, and the state of sympathetic interaction is e'. Η _{e, x} ^* is {η _{e, x, f "} ^* } _{e, x, f"} , and the expression of the dialogue pair (i, j) is f _i , f _j , Represents the number of times that the line-of-sight state was x and the state of empathic interaction was e. Also, η _i ^* is {η _{i, f} ^* } _{i, f} and represents the number of times that the state of facial expression of the conversation person i is f.

パラメータサンプリング部３３は、上記（２１）式〜（２５）式の右辺の分布に従って発生させた乱数をそれぞれパラメータπ、λ、θの更新値として設定することにより、パラメータφ＝｛Π，Θ｝を更新する。なお、上記（２１）式〜（２５）式で表わされる全条件付き確率分布が、第３確率分布の一例である。 The parameter sampling unit 33 sets the random numbers generated according to the distributions on the right side of the equations (21) to (25) as the updated values of the parameters π, λ, and θ, respectively, so that the parameter φ = {Π, Θ} Update. Note that the all-conditional probability distribution represented by the equations (21) to (25) is an example of a third probability distribution.

収束判定部３４は、所定の収束条件を満たしたか否かを判定し、所定の収束条件を満たなさない場合には、状態サンプリング部３２およびパラメータサンプリング部３３の処理を繰り返し、共感的インタラクションの状態の時系列e_t ^(i,j)と、表情状態の時系列f_i,tと、パラメータφとを更新していく。 The convergence determination unit 34 determines whether or not a predetermined convergence condition is satisfied. If the predetermined convergence condition is not satisfied, the state sampling unit 32 and the parameter sampling unit 33 repeat the processing, and the state of empathic interaction The time series e _t ^{(i, j)} , the time series f _{i, t of} the facial expression state _, and the parameter φ are updated.

ここで、所定の収束条件とは、「予め定めた反復回数に到達したか否か」や、「更新前の各時系列（またはパラメータ）と更新後の各時系列（またはパラメータ）との誤差が、予め定めた閾値以下となったか否か」などを用いる。 Here, the predetermined convergence condition is “whether a predetermined number of iterations has been reached” or “an error between each time series (or parameter) before update and each time series (or parameter) after update. Is used or not ”or the like is used.

推定値算出部３５は、収束判定部３４において所定の収束条件を満たすまで反復させて対話モデルのパラメータが収束した後に、さらにＭ回の反復を行ったそのＭ回分のサンプル｛Ｅ_1:T ^(q),Ｆ_1:T ^(q),φ(q)}_q=M´+1 ^M´+Mを用いて共感的インタラクションの状態及び表情の状態の推定結果を算出する（Ｍ´は，収束判定部３４においてパラメータが収束するまでに要した反復回数とする）。添え字（ｑ）はサンプル番号（ｑ＝１、・・・、Ｍ）である。これらの反復回数は、推定対象とする共感的インタラクションや表情の状態数にもよるが、例えば、Ｍ´＝８００、Ｍ＝２００とすればよい。 The estimated value calculation unit 35 repeats the convergence determination unit 34 until a predetermined convergence condition is satisfied, and after the parameters of the conversation model have converged, the M number of samples {E _{1: T} ^{( q)} , F _{1: T} ^(q) , φ (q)} _{q = M ′ + 1} ^{M ′ + M} is used to calculate the estimation results of the state of empathic interaction and the state of facial expression (M ′ is the convergence The number of iterations required until the parameter converges in the determination unit 34). The subscript (q) is a sample number (q = 1,..., M). The number of repetitions may depend on, for example, M ′ = 800 and M = 200, depending on the number of sympathetic interactions and facial expressions to be estimated.

時刻ｔにおける共感的インタラクションの状態の推定値＾ｅ_t ^(i,j) 、及び表情の状態の推定値＾ｆ_i,tは、同時刻における同時事後確率分布ｐ（ｆ_t,ｅ_t|I_1:t）に基づき計算される。例えば、これらの推定値は、以下の（２８）式、（２９）式に示すように、同時事後確率分布の共感的インタラクションあるいは表情それぞれについての周辺確率を最大化する状態として計算される。パラメータの推定値＾φについては、以下の（３０）式に示すように、同時事後確率分布のパラメータについての周辺分布の期待値として計算される． The estimated value ^ e _t ^{(i, j)} of the state of sympathetic interaction at time t and the estimated value ^ f _{i, t} of the expression state are expressed by the simultaneous posterior probability distribution p (f _t , e _t | I _{1: t} ). For example, these estimated values are calculated as a state that maximizes the peripheral probability for each of the sympathetic interaction or facial expression of the simultaneous posterior probability distribution, as shown in the following equations (28) and (29). The estimated parameter ^ φ is calculated as the expected value of the peripheral distribution for the parameter of the simultaneous posterior probability distribution, as shown in the following equation (30).

ここで、δ_y（ｚ）は、ｙ＝ｚであれば１を、そうでなければ０を返す関数である。 Here, δ _y (z) is a function that returns 1 if y = z and returns 0 otherwise.

共感的インタラクションの状態の推定値＾ｅ_t ^(i,j) 、表情の状態の推定値＾ｆ_i,t、及びパラメータの推定値＾φが、出力部３よりユーザに出力される。 Empathetic estimate interaction state _{^{^ e t (i, j)}} , the estimated value of the facial expression of the state ^ f _{i, t,} and the estimated value of the parameter ^ phi is output to the user from the output unit 3.

学習部２３は、学習データ記憶部４０、教師ラベル作成部４１、及びハイパーパラメータ学習部４２を備えている。 The learning unit 23 includes a learning data storage unit 40, a teacher label creation unit 41, and a hyper parameter learning unit 42.

初期値設定部３１において利用するハイパーパラメータの値は、予め適当な値を定めてもよいが、推定用対話映像以外の対話映像（学習用映像）を入力として、モデルのハイパーパラメータを学習することもできる。学習を行うことで、パラメータφの初期値をより適切な値に設定することができるため、所定の繰り返し回数でより精度のよいモデルに収束させることが可能となり、結果として共感的インタラクション及び表情の推定の精度を向上させる効果がある。 The value of the hyper parameter used in the initial value setting unit 31 may be determined in advance. However, the hyper parameter of the model is learned by inputting a dialogue video (learning video) other than the estimation dialogue video. You can also. By performing learning, the initial value of the parameter φ can be set to a more appropriate value, so that it is possible to converge to a more accurate model with a predetermined number of iterations, resulting in empathic interaction and facial expression. This has the effect of improving the accuracy of estimation.

学習データ記憶部４０は、学習用映像データを記憶している。 The learning data storage unit 40 stores learning video data.

教師ラベル作成部４１は、学習用映像データを入力として，その各フレームにおける共感的インタラクションの状態、表情の状態、及び視線の状態を、教師ラベルとして作成する。本発明により推定することが効果的である共感的インタラクション及び表情については、オペレータからの教師ラベルの入力を入力部１により受け付けて、これらのラベルを付与することとする。なお、膨大な計算時間をかけても、これらを高精度に推定可能な方法があれば、教師ラベル作成部４１として利用してもよい。一方、視線の状態については、視線方向検出部２９と同様の処理を行って、ラベル付けを行えばよい。 The teacher label creating unit 41 receives learning video data as an input, and creates a state of empathic interaction, a facial expression state, and a line-of-sight state in each frame as a teacher label. For empathic interactions and facial expressions that are effective to be estimated according to the present invention, the input of the teacher label from the operator is accepted by the input unit 1 and these labels are given. In addition, even if it takes enormous calculation time, if there is a method capable of estimating these with high accuracy, the teacher label creating unit 41 may be used. On the other hand, the line-of-sight state may be labeled by performing the same processing as the line-of-sight direction detection unit 29.

ハイパーパラメータ学習部４２は、教師ラベルを入力として、対話モデルのハイパーパラメータを学習して、データ記憶部２１へ出力する。ここでは、それぞれのハイパーパラメータが表す事象がその学習データ中で発生した確率を、ハイパーパラメータの値として設定することとする。例えば、共感的インタラクションの遷移確率Ｐ（ｅ_t＝ｅ´|ｅ_t−1＝ｅ）についてであれば、教師ラベル中において、共感的インタラクションの状態が連続する２時刻内においてｅからｅ´へと変化した回数を数えて、その回数を全ての遷移の回数の和で除した値（確率）を、当該ハイパーパラメータの値として設定すればよい。 The hyper parameter learning unit 42 receives the teacher label as an input, learns the hyper parameter of the dialogue model, and outputs it to the data storage unit 21. Here, the probability that the event represented by each hyper parameter has occurred in the learning data is set as the value of the hyper parameter. For example, with regard to the transition probability P (e _t = e ′ | e _t−1 = e) of the sympathetic interaction, e is changed to e ′ within two times when the state of the sympathetic interaction is continuous in the teacher label. And the value obtained by dividing the number of times by the sum of the number of times of all transitions (probability) may be set as the value of the hyperparameter.

＜対話状態推定装置の作用＞
次に、本実施の形態に係る対話状態推定装置の作用について説明する。まず、対話状態推定装置は、学習データ記憶部４０に記憶された学習用映像データについて、フレーム毎に、対話者間の共感的インタラクションの状態と各対話者の表情の状態とをオペレータに入力させて、ラベル付けを行うと共に、対話状態推定装置の演算部２が、対話者間の視線の状態を検出して、ラベル付けを行なう。そして、対話状態推定装置の演算部２が、ラベル付けされた結果を用いて、ハイパーパラメータを学習し、データ記憶部２１に学習結果を格納する。 <Operation of dialog state estimation device>
Next, the operation of the dialog state estimation apparatus according to this embodiment will be described. First, the dialogue state estimation device causes the operator to input the state of empathic interaction between the conversation parties and the state of facial expression of each conversation person for each frame of the learning video data stored in the learning data storage unit 40. Then, the labeling is performed, and the calculation unit 2 of the dialog state estimation apparatus detects the state of the line of sight between the interrogators and performs the labeling. Then, the calculation unit 2 of the dialog state estimation apparatus learns hyperparameters using the labeled results, and stores the learning results in the data storage unit 21.

また、撮影手段によって、複数の対話者が対話している様子を撮影し、撮影された映像データが、入力部１を介して対話状態推定装置に入力され、データ記憶部２１に格納される。 In addition, the photographing unit photographs a state in which a plurality of interlocutors are interacting, and the photographed video data is input to the conversation state estimation device via the input unit 1 and stored in the data storage unit 21.

そして、対話状態推定装置の演算部２において、図６に示す対話状態推定処理ルーチンが実行される。 Then, a dialogue state estimation processing routine shown in FIG. 6 is executed in the calculation unit 2 of the dialogue state estimation device.

まず、ステップ１００において、データ記憶部２１に記憶された映像データ及びハイパーパラメータを読み込み、取得する。そして、ステップ１０２において、映像データの各時刻（各フレーム）について、対話者の全ペアの視線の状態を検出する。ステップ１０３では、映像データの各時刻（各フレーム）について、各対話者の表情の状態を推定し、表情の各状態の尤度を算出する。 First, in step 100, the video data and hyper parameters stored in the data storage unit 21 are read and acquired. Then, in step 102, the line-of-sight states of all pairs of the interactors are detected for each time (each frame) of the video data. In step 103, for each time (each frame) of the video data, the state of the facial expression of each conversation person is estimated, and the likelihood of each state of the facial expression is calculated.

次のステップ１０４では、上記ステップ１００で取得したハイパーパラメータに基づいて、対話モデルの各種パラメータの初期値を設定し、設定された対話モデルの各種パラメータの初期値に基づいて、対話者の全ペアの共感的インタラクションの状態の時系列データ、および各対話者の表情の状態の時系列データの初期値を設定する。 In the next step 104, initial values of various parameters of the dialogue model are set based on the hyper parameters acquired in the above step 100, and all the pairs of the dialoguers are set based on the initial values of the various parameters of the set dialogue model. The initial value of the time-series data of the state of empathic interaction and the time-series data of the state of facial expression of each dialogue person is set.

そして、ステップ１０６において、上記ステップ１０４で設定された対話者の全ペアの共感的インタラクションの状態の時系列データ、各対話者の表情の状態の時系列データ、及び対話モデルの各種パラメータの初期値、またはステップ１０６及び後述するステップ１０８で前回決定された対話者の全ペアの共感的インタラクションの状態の時系列データ、各対話者の表情の状態の時系列データ、及び対話モデルの各種パラメータと、上記ステップ１０２で検出された対話者の全ペアの視線の状態の時系列データと、上記ステップ１０３で算出された各対話者の表情の各状態の尤度の時系列データとに基づいて、上記（１９）式、（２０）式で表される確率分布を求め、求められた確率分布に従って、サンプリングにより、対話者の全ペアの共感的インタラクションの状態の時系列データ、および各対話者の表情の状態の時系列データを決定する。 Then, in step 106, time-series data of the state of empathic interaction of all pairs of the talkers set in step 104, time-series data of the state of expression of each talker, and initial values of various parameters of the dialogue model Or the time series data of the state of empathic interaction of all pairs of interactors previously determined in step 106 and step 108 to be described later, the time series data of the state of expression of each interactor, and various parameters of the conversation model, Based on the time-series data of the line-of-sight states of all pairs of the conversation persons detected in step 102, and the time-series data of the likelihood of each state of each conversation person's facial expression calculated in step 103, The probability distributions represented by Equations (19) and (20) are obtained, and sampling is performed according to the obtained probability distributions for all pairs of the interlocutors. Time-series data of sensitive Interaction state, and to determine the time-series data of the facial expression of the state of each interlocutor.

次のステップ１０８では、直前のステップ１０６で決定された対話者の全ペアの共感的インタラクションの状態の時系列データ、および各対話者の表情の状態の時系列データと、上記ステップ１０２で検出された対話者の全ペアの視線の状態の時系列データと、上記ステップ１００で取得したハイパーパラメータとに基づいて、上記（２１）式〜（２５）式で表される確率分布を求め、求められた確率分布に従って、サンプリングにより、対話モデルの各種パラメータの値を決定する。 In the next step 108, the time series data of the state of empathic interaction of all pairs of the conversation parties determined in the previous step 106, and the time series data of the state of each conversation person's facial expression are detected in the above step 102. On the basis of the time-series data of the line-of-sight states of all the pairs of conversational persons and the hyperparameters acquired in step 100, the probability distributions expressed by the above equations (21) to (25) are obtained and obtained. According to the probability distribution, the values of various parameters of the dialogue model are determined by sampling.

ステップ１１０では、所定の収束条件として、予め定めた反復回数に到達したか否かを判定し、予め定めた反復回数に到達していない場合には、所定の収束条件が成立していないと判断して、上記ステップ１０６へ戻り、上記ステップ１０６、１０８の処理を繰り返す。一方、予め定めた反復回数に到達した場合には、所定の収束条件が成立したと判断し、ステップ１１２へ進む。 In step 110, it is determined whether a predetermined number of iterations has been reached as a predetermined convergence condition. If the predetermined number of iterations has not been reached, it is determined that the predetermined convergence condition has not been established. Then, the process returns to step 106 and the processes of steps 106 and 108 are repeated. On the other hand, if the predetermined number of iterations has been reached, it is determined that a predetermined convergence condition has been established, and the routine proceeds to step 112.

ステップ１１２では、上記ステップ１０６と同様に、前回決定された対話者の全ペアの共感的インタラクションの状態の時系列データ、各対話者の表情の状態の時系列データ、及び対話モデルの各種パラメータと、検出された対話者の全ペアの視線の状態の時系列データと、上記ステップ１０３で算出された各対話者の表情の各状態の尤度の時系列データとに基づいて、上記（１９）式、（２０）式で表される確率分布を求め、求められた確率分布に従って、サンプリングにより、対話者の全ペアの共感的インタラクションの状態の時系列データ、および各対話者の表情の状態の時系列データを決定する。 In step 112, as in step 106, the time-series data of the sympathetic interaction state of all the previously determined pairs of the interlocutors, the time-series data of the facial expression states of the interlocutors, and various parameters of the dialogue model On the basis of the time series data of the gaze state of all the detected pairs of the interlocutors and the time series data of the likelihood of each state of the facial expressions of each interrogator calculated in the above step 103 (19) The probability distribution represented by the equation (20) is obtained, and according to the obtained probability distribution, the time series data of the state of the sympathetic interaction of all the pairs of the talkers and the state of the facial expression of each talker are obtained by sampling. Determine time-series data.

そして、ステップ１１４では、上記ステップ１０８と同様に、直前のステップ１１２で決定された対話者の全ペアの共感的インタラクションの状態の時系列データ、および各対話者の表情の状態の時系列データと、上記ステップ１０２で検出された対話者の全ペアの視線の状態の時系列データと、上記ステップ１００で取得したハイパーパラメータとに基づいて、上記（２１）式〜（２５）式で表される確率分布を求め、求められた確率分布に従って、サンプリングにより、対話モデルの各種パラメータの値を決定する。 Then, in step 114, as in step 108 above, time-series data of the state of empathic interaction of all pairs of interactors determined in previous step 112, and time-series data of the states of facial expressions of each interactor Based on the time-series data of the line-of-sight states of all pairs of interrogators detected in the step 102 and the hyperparameters acquired in the step 100, they are expressed by the above equations (21) to (25). A probability distribution is obtained, and values of various parameters of the dialogue model are determined by sampling according to the obtained probability distribution.

ステップ１１６では、上記ステップ１１２、１１４の処理をＭ回繰り返したか否かを判定し、Ｍ回繰り返していない場合には、上記ステップ１１２へ戻り、上記ステップ１１２、１１４の処理を繰り返す。一方、上記ステップ１１２、１１４の処理をＭ回繰り返した場合には、ステップ１１８において、上記ステップ１１２、１１４で決定されたＭ回分の、対話者の全ペアの共感的インタラクションの状態の時系列データ、各対話者の表情の状態の時系列データ、及び対話モデルの各種パラメータに基づいて、対話者の全ペアの共感的インタラクションの状態の時系列データ、各対話者の表情の状態の時系列データ、及び対話モデルの各種パラメータの各推定値を算出する。そして、ステップ１２０において、上記ステップ１１８で算出された対話者の全ペアの共感的インタラクションの状態の時系列データ、各対話者の表情の状態の時系列データ、及び対話モデルの各種パラメータの各推定値を出力部３により出力して、対話状態推定処理ルーチンを終了する。 In step 116, it is determined whether or not the processes of steps 112 and 114 have been repeated M times. If the processes have not been repeated M times, the process returns to step 112 and the processes of steps 112 and 114 are repeated. On the other hand, when the processes of steps 112 and 114 are repeated M times, in step 118, the time series data of the state of empathic interaction of all pairs of conversation persons determined in steps 112 and 114 above. , Time series data on the state of facial expression of each conversation person, time series data on the state of empathic interaction of all pairs of conversation persons, and time series data on the state of facial expression of each conversation person, based on various parameters of the conversation model And each estimated value of various parameters of the dialogue model. Then, in step 120, the time-series data of the state of empathic interaction of all pairs of interactors calculated in step 118, the time-series data of the state of expression of each interactor, and the various parameters of the conversation model are estimated. The value is output by the output unit 3, and the dialog state estimation processing routine is terminated.

以上説明したように、本実施の形態に係る対話状態推定装置によれば、対話者ペアにおける共感的インタラクションの状態と対話者ペアの視線の状態との組み合わせに応じた表情の状態の共起性を表わす対話モデルを用いることにより、対話者ペアの共感的インタラクションの状態及び各対話者の表情の状態を推定することができる。 As described above, according to the dialogue state estimation device according to the present embodiment, the co-occurrence of the state of the facial expression according to the combination of the state of sympathetic interaction in the dialogue partner pair and the state of the gaze of the dialogue pair By using the dialogue model representing, it is possible to estimate the state of the sympathetic interaction of the dialogue partner pair and the state of the facial expression of each dialogue person.

また、複数人による対話を対象として，表情が表出された文脈を考慮し，対話を撮影した映像データから各対話者で表出される表情の状態を正しく推定すると同時に，それぞれの対話者ペアの間の共感的インタラクションの状態を正しく推定することができる。 In addition, taking into account the context in which facial expressions are expressed for dialogues by multiple people, the state of facial expressions expressed by each dialoguer is correctly estimated from the video data of the dialogue, and at the same time, It is possible to correctly estimate the state of empathic interaction between them.

対話者の表情の状態を、従来の対象人物のみに着目した従来の推定の枠組みよりも高い精度で推定できる。特に、従来の方法では正しい推定が困難であった、映像中の人物の顔の向きが正面でなく横向きや上向きである場面、あるいは、顔が手や物体によって一部が隠れている場面においても、表情の状態を精度よく推定することができる。 It is possible to estimate the state of the dialogue person's facial expression with higher accuracy than the conventional estimation framework focusing only on the target person. In particular, it is difficult to correctly estimate with the conventional method, even in a scene where the face direction of the person in the video is not front but sideways or upward, or where the face is partially hidden by a hand or object The state of facial expression can be estimated accurately.

また、対話モデルのパラメータとして、対象とする対話映像毎に、共感的インタラクションの各状態がどの程度の確率で発生するのか(パラメータΠ_０)や対話者行動（＝表情）がどの程度の確率で発生するのか(パラメータΘ) 、またそれらがどの程度共起するのか(パラメータλ)の確率を得ることができる。これらのパラメータの違いによって、それぞれの対話を特徴付けることができる。 Also, as the parameters of the dialogue model, the probability of occurrence of each state of empathic interaction (parameter Π ₀ ) and the dialogue behavior (= facial expression) for each target dialogue video It is possible to obtain the probability of occurrence (parameter Θ) and how much they co-occur (parameter λ). These differences in parameters can characterize each interaction.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、対話者ペアの共感的インタラクションの状態及び各対話者の表情の状態の何れか一方のみの推定値を出力するようにしてもよい。 For example, it is possible to output an estimated value of only one of the state of the sympathetic interaction of the conversation partner pair and the state of the facial expression of each conversation person.

また、複数の対話者のうちの特定の対話者ペアのみについて、共感的インタラクションの状態、各対話者の表情の状態、及び対話者モデルのパラメータを推定するようにしてもよい。 In addition, the state of empathic interaction, the state of facial expression of each conversation person, and the parameters of the conversation person model may be estimated only for a specific pair of conversation persons among a plurality of conversation persons.

また、映像データ中の特定フレーム、すなわち、１時刻の静止画データに基づいて、その時刻における共感的インタラクションの状態、各対話者の表情の状態、及び対話者モデルのパラメータを推定するようにしてもよい。 Further, based on a specific frame in the video data, that is, still image data at one time, the state of empathic interaction at that time, the state of facial expression of each conversation person, and the parameters of the conversation person model are estimated. Also good.

また、上述の対話状態推定装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 Moreover, although the above-mentioned dialog state estimation apparatus has a computer system inside, if a "computer system" is using the WWW system, it shall also include a homepage provision environment (or display environment). .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１入力部
２演算部
３出力部
２１データ記憶部
２２推定部
２９視線方向検出部
３０表情推定部
３１初期値設定部
３２状態サンプリング部
３３パラメータサンプリング部
３４収束判定部
３５推定値算出部 DESCRIPTION OF SYMBOLS 1 Input part 2 Calculation part 3 Output part 21 Data storage part 22 Estimation part 29 Gaze direction detection part 30 Facial expression estimation part 31 Initial value setting part 32 State sampling part 33 Parameter sampling part 34 Convergence determination part 35 Estimated value calculation part

Claims

Eye gaze state detecting means for detecting an eye gaze state between the plurality of interlocutors, using as an input an image obtained by imaging an area including the faces of the plurality of interlocutors,
Setting an initial value of a parameter of a dialogue model representing the co-occurrence of a state of expression according to a combination of a state showing empathy between a plurality of dialogues and a state of gaze between the plurality of dialogues; According to a dialogue model, an initial value setting means for setting an initial value of a state showing empathy between the plurality of talkers, and an initial value showing a state of each of the plurality of talkers,
The initial value of each of the parameter set by the initial value setting means, the state showing the empathy, and the state showing the expression, or the previously determined parameter, the state showing the empathy, and the expression Based on the state shown and the state of the line of sight between the plurality of interrogators detected by the line-of-sight state detecting means, according to the dialogue model, State determining means for determining a state indicating each expression of the dialogue person;
Parameter determining means for determining parameters of the dialog model based on a state indicating empathy among the plurality of interrogators determined by the state determining means and a state indicating the expression of each of the plurality of interrogators;
The determination by the state determination unit and the determination by the parameter determination unit are repeated until a predetermined convergence condition is satisfied, and a state showing empathy between the plurality of interlocutors or each facial expression of the plurality of interlocutors is obtained. An estimation means for estimating the state to be indicated;
A dialogue state estimation device including:

The state determination means indicates the initial value of the parameter set by the initial value setting means, the state indicating the empathy, and the state indicating the facial expression, or the parameter determined last time, the empathy Based on the state, the state indicating the facial expression, and the state of the line of sight between the plurality of interlocutors detected by the line-of-sight state detecting means, the empathy between the plurality of interlocutors is shown according to the dialog model. A first probability distribution indicating a probability that each state becomes each state, and a second probability distribution indicating a probability that each state indicating the expression of each of the plurality of interlocutors becomes each state, and the obtained first probability distribution. And determining a state indicating empathy among the plurality of interlocutors, and determining a state indicating each facial expression of the plurality of interrogators according to the obtained second probability distribution,
The parameter determination unit is configured so that each parameter of the conversation model is determined based on a state indicating empathy between the plurality of interactors determined by the state determining unit and a state indicating each expression of the plurality of interactors. The dialog state estimation apparatus according to claim 1, wherein a third probability distribution indicating a probability that is a value is obtained, and parameters of the dialog model are determined according to the obtained third probability distribution.

The line-of-sight state detection means receives time-series data of the image as input and detects time-series data of the line-of-sight state between the plurality of interlocutors,
The initial value setting means sets an initial value of a parameter of the dialogue model, an initial value of time-series data in a state showing empathy between the plurality of dialogues, and each facial expression of the plurality of dialogues Set the initial value of the time series data of the state,
The state determination means determines time-series data indicating a state of empathy between the plurality of interlocutors and time-series data indicating a state of each of the plurality of interrogators,
The dialogue according to claim 1 or 2, wherein the estimation means estimates time-series data indicating a state of empathy among the plurality of interactors, or time-series data indicating a state of each of the plurality of interactors. State estimation device.

Further comprising facial expression estimation means for estimating the state of facial expression of each of the plurality of interlocutors based on the image;
The said state determination means determines the state which shows each facial expression of these dialog persons based on the said parameter set by the said initial value setting means, and the estimation result by the said facial expression estimation means. 4. The dialogue state estimation apparatus according to any one of items 3.

The conversation state estimation apparatus according to any one of claims 1 to 4, wherein the gaze state between the plurality of interlocutors is a mutual gaze, a one-sided gaze, and a mutual gaze, or a mutual gaze and a one-sided gaze.

The dialog state estimation apparatus according to any one of claims 1 to 5, wherein states indicating empathy among the plurality of interrogators are sympathy, indifference, and anti-feeling.

The dialog state estimation device according to any one of claims 1 to 6, wherein the facial expression states are positive, neutral, and negative, or no expression, smile, surprise, and disgust.

The estimation unit repeats the determination by the state determination unit and the determination by the parameter determination unit until a predetermined convergence condition is satisfied, and then further includes a plurality of determinations by the state determination unit and determinations by the parameter determination unit. Based on a state indicating empathy among the plurality of interrogators determined by the state determination means when repeated a plurality of times, or a state indicating the expression of each of the plurality of interrogators. The conversation state estimation apparatus according to claim 1, wherein a state indicating sympathy between the plurality of conversational persons or a state indicating each facial expression of the plurality of conversational persons is estimated.

A dialog state estimation method in a dialog state estimation apparatus including a line-of-sight state detection unit, an initial value setting unit, a state determination unit, a parameter determination unit, and an estimation unit,
The dialog state estimation device includes:
Detecting a line-of-sight state between the plurality of interlocutors by inputting an image obtained by imaging an area including the faces of the plurality of interlocutors by the line-of-sight state detecting unit;
By the initial value setting means, initial values of parameters of the dialogue model representing the co-occurrence of facial expression states according to a combination of a state showing empathy among a plurality of talkers and a line-of-sight state between the plurality of talkers Setting a value and setting an initial value of a state indicating empathy between the plurality of interactors and an initial value indicating a state of each of the plurality of interactors according to the interaction model; and
The initial value of each of the parameter set by the initial value setting unit, the state indicating the empathy, and the state indicating the facial expression by the state determination unit, or the parameter determined last time, the empathy Based on the state, the state indicating the facial expression, and the state of the line of sight between the plurality of interlocutors detected by the line-of-sight state detecting means, the empathy between the plurality of interlocutors is shown according to the dialog model. Determining a state and a state indicating each facial expression of the plurality of interactors;
The parameter determination means determines the parameters of the conversation model based on the state indicating empathy among the plurality of interrogators determined by the state determination means and the state indicating each facial expression of the plurality of conversation persons. And steps to
The estimation means repeats the determination by the state determination means and the determination by the parameter determination means until a predetermined convergence condition is satisfied, or a state showing empathy among the plurality of interactors, or the plurality of interactors Estimating a state indicating each facial expression of
The dialog state estimation method characterized by including and performing.

The program for functioning a computer as each means of the dialog state estimation apparatus of any one of Claims 1-8.