JP5909472B2

JP5909472B2 - Empathy interpretation estimation apparatus, method, and program

Info

Publication number: JP5909472B2
Application number: JP2013199558A
Authority: JP
Inventors: 史朗熊野; 大塚　和弘; 和弘大塚; 淳司大和
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-09-26
Filing date: 2013-09-26
Publication date: 2016-04-26
Anticipated expiration: 2033-09-26
Also published as: JP2015064827A

Description

この発明は、共感解釈を推定する技術に関する。 The present invention relates to a technique for estimating empathy interpretation.

画面上に表示されたお手本となる顔の表情の動きをどれくらい上手にマネできるかを測定する技術が、非特許文献１に記載されている。 Non-Patent Document 1 describes a technique for measuring how well a facial expression movement displayed on a screen can be managed well.

Nintendo、Fumiko Inudo、［online］、［平成２５年７月３１日］、インターネット＜URL: http://www.nintendo.co.jp/ds/ykoj/etc/index.html＞Nintendo, Fumiko Inudo, [online], [July 31, 2013], Internet <URL: http://www.nintendo.co.jp/ds/ykoj/etc/index.html>

しかし、背景技術は、どれくらい上手にマネできるかを測定する技術であり、画面上に表示された第一人物と画面を通して第一人物を見ている第二人物との共感解釈を推定する技術ではなかった。共感解釈とは、共感、反感、どちらでもない等の二者間の対話状態のことである。 However, the background technology is a technology that measures how well you can manage, and the technology that estimates the empathy between the first person displayed on the screen and the second person who sees the first person through the screen. There wasn't. Sympathetic interpretation is a state of dialogue between two parties, such as empathy and dissensibility.

この発明は、画面上に表示された第一人物と画面を通して第一人物を見ている第二人物との共感解釈を推定する共感解釈推定装置、方法、プログラム及び記録媒体を提供することを目的とする。 An object of the present invention is to provide a sympathy interpretation estimation apparatus, method, program, and recording medium for estimating a sympathy interpretation between a first person displayed on a screen and a second person watching the first person through the screen. And

この発明の一態様による映像生成装置は、行動が変化する第一人物の第一映像を第二人物に提示する人物映像提示部と、第一人物の行動の変化に対して第二人物が行うべき共感反感に関するアクションを第二人物に提示するアクション提示部と、第一映像が提示された第二人物の頭部を撮影した第二映像を取得する映像取得部と、第二映像における第一人物の行動の変化に対する第二人物の行動を検出することにより第二人物の第二行動時系列を生成する行動認識部と、第一映像における第一人物の第一行動時系列及び第二映像における第二人物の第二行動時系列に基づいて、行動の時間差と行動の一致性とに基づく共感解釈の尤度を表すタイミングモデルを含むモデルパラメタを用いて、第一人物と第二人物との間の共感解釈を推定する事後確率推定部と、推定された共感解釈を、第一人物の行動の変化に対する結果として、第二人物に対して提示する表示部と、を含む。 A video generation device according to an aspect of the present invention includes a person video presentation unit that presents a first video of a first person whose behavior changes to a second person, and a second person that performs the behavior change of the first person and action presentation unit that presents the actions on sympathetic antipathy to the second person, and the image acquisition unit that acquires a second image obtained by photographing the head portion of the second person first image is presented, the first in the second image An action recognition unit that generates a second action time series of the second person by detecting the action of the second person with respect to a change in the action of the person, and the first action time series and the second picture of the first person in the first picture Based on the second action time series of the second person in the model, using the model parameters including the timing model representing the likelihood of empathy interpretation based on the time difference of action and the coincidence of actions, the first person and the second person Estimating the empathy interpretation between Comprising a probability estimator, the estimated sympathy interpreted as a result to changes in behavior of the first person, and a display unit to be presented to the second person, the.

画面上に表示された第一人物と画面を通して第一人物を見ている第二人物との共感解釈を推定することができる。また、行動の時間差と行動の一致性とに基づく共感解釈の尤度を表すタイミングモデルを含むモデルパラメタを用いることにより、表出のタイミングを考慮することができ、高い精度で共感解釈を推定することができる。 It is possible to estimate the sympathy interpretation between the first person displayed on the screen and the second person watching the first person through the screen. Moreover, by using model parameters including timing models that represent the likelihood of empathic interpretation based on behavioral time differences and behavioral consistency, the timing of expression can be taken into account, and the empathic interpretation can be estimated with high accuracy. be able to.

共感解釈推定装置の例を示すブロック図。The block diagram which shows the example of a sympathy interpretation estimation apparatus. 第一映像、第一人物及び第二人物の例を説明するための図。The figure for demonstrating the example of a 1st image | video, a 1st person, and a 2nd person. 第一人物の行動の変化を説明するための図。The figure for demonstrating the change of a 1st person's action. 第一人物の行動の変化を説明するための図。The figure for demonstrating the change of a 1st person's action. アクション提示部によるアクションの例示の例を説明するための図。The figure for demonstrating the example of the action by an action presentation part. 変形例を説明するための図。The figure for demonstrating a modification. 変形例を説明するための図。The figure for demonstrating a modification. 共感解釈推定方法の処理の流れを示す図。The figure which shows the flow of a process of an empathy interpretation estimation method. 対話状態推定装置の例を示すブロック図。The block diagram which shows the example of a dialog state estimation apparatus. パラメタ学習部の例を示すブロック図。The block diagram which shows the example of a parameter learning part. 学習フェーズの処理の流れを示す図。The figure which shows the flow of a process of a learning phase. 時間差関数を説明するための図。The figure for demonstrating a time difference function. 対話者の行動と共感解釈の時間差を説明するための図。The figure for demonstrating the time difference of a dialogue person's action and empathy interpretation. 変化タイミング関数を説明するための図。The figure for demonstrating a change timing function. 変化タイミング関数の有効範囲を説明するための図。The figure for demonstrating the effective range of a change timing function. 変化タイミング関数の有効範囲を説明するための図。The figure for demonstrating the effective range of a change timing function.

以下、この発明の実施形態について詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

［共感解釈推定装置及び方法］
共感解釈推定装置は、図１に示すように、人物映像提示部８２、映像取得部８１、推定用映像記憶部７２、行動認識部２０、モデルパラメタ記憶部７４、事後確率推定部５０、表示部８３及びアクション提示部８４を例えば備えている。 [Empathetic interpretation estimation apparatus and method]
As shown in FIG. 1, the empathy interpretation estimation apparatus includes a person video presentation unit 82, a video acquisition unit 81, an estimation video storage unit 72, a behavior recognition unit 20, a model parameter storage unit 74, a posterior probability estimation unit 50, and a display unit. 83 and the action presentation part 84 are provided, for example.

共感解釈推定方法は、図８に示すステップＳ１からステップＳ４により例えば構成される。 The sympathy interpretation estimation method includes, for example, steps S1 to S4 shown in FIG.

人物映像提示部８２は、行動が変化する第一人物１００の第一映像を第二人物２００に提示する（ステップＳ１）。例えば、人物映像提示部８２は、表示部８３を通して、第一映像を第二人物２００に提示する。表示部８３は、ディスプレイ等の表示装置である。また、第一映像は、推定用映像の一部として行動認識部２０に提供される。第一映像は、第一人物１００の少なくとも頭部を映した映像である。第一人物１００は、表情を表すことができれば人間でなくてもよい。例えば、第一人物１００は、キャラクタや物であってもよい。ここで、頭部とは、表情を認識できる程度に顔の領域を含む部分であるとする。 The person video presentation unit 82 presents the first video of the first person 100 whose behavior changes to the second person 200 (step S1). For example, the person video presentation unit 82 presents the first video to the second person 200 through the display unit 83. The display unit 83 is a display device such as a display. The first video is provided to the action recognition unit 20 as a part of the estimation video. The first video is a video showing at least the head of the first person 100. The first person 100 may not be a person as long as he can express an expression. For example, the first person 100 may be a character or an object. Here, it is assumed that the head is a part including a facial region to such an extent that facial expressions can be recognized.

図２の左に、第一映像の例を示す。第一映像の中の第一人物１００の行動は変化する。例えば、第一映像において最初は図２の左に示すように第一人物１００は無表情であるが、あるタイミングで図３に示すように第一人物１００は笑顔になる。このような、第一映像が、図２の右に記載された第二人物２００に提示される。 An example of the first video is shown on the left of FIG. The action of the first person 100 in the first video changes. For example, in the first video, the first person 100 initially has no expression as shown on the left in FIG. 2, but at a certain timing, the first person 100 smiles as shown in FIG. Such a first video is presented to the second person 200 shown on the right side of FIG.

映像取得部８１は、第一映像が提示された第二人物の行動を撮影した第二映像を取得する（ステップＳ２）。取得された第二映像は、推定用映像の一部として行動認識部２０に提供される。映像取得部８１は、ビデオカメラ等の映像を撮影することができる装置である。第二映像には、第二人物２００の少なくとも頭部が映されているものとする。 The video acquisition unit 81 acquires a second video obtained by photographing the behavior of the second person presented with the first video (step S2). The acquired second video is provided to the action recognition unit 20 as a part of the video for estimation. The video acquisition unit 81 is a device that can capture video such as a video camera. It is assumed that at least the head of the second person 200 is shown in the second video.

図２では、第二人物２００が行うべき「共感して下さい」というアクションが、第一映像の中で提示されている。このため、第二人物２００は、笑顔になった第一人物１００と共感するために、笑顔になる等の行動を行う。映像取得部８１は、この第二人物２００の行動を撮影した第二映像を取得するのである。なお、この第二人物２００が行うべきアクションはアクション提示部８４により提示される。 In FIG. 2, the action “please sympathize” that the second person 200 should perform is presented in the first video. For this reason, the second person 200 performs a behavior such as smiling in order to sympathize with the first person 100 smiling. The video acquisition unit 81 acquires a second video obtained by photographing the behavior of the second person 200. The action to be performed by the second person 200 is presented by the action presentation unit 84.

行動認識部２０は、第一映像における第一人物１００の行動を検出することにより第一人物１００の第一行動時系列を、第二映像における第二人物２００の行動を検出することにより第二人物の第二行動時系列を生成する（ステップＳ３）。生成された第一行動時系列及び第二行動時系列は、事後確率推定部５０に提供される。行動として、例えば表情、視線、頭部ジェスチャ、発話有無等が検出される。行動認識部２０における、行動の認識方法は後述する［学習フェーズ］における行動の認識方法と同様であるので、ここでは説明を省略する。 The behavior recognition unit 20 detects the behavior of the first person 100 in the first video to detect the first behavior time series of the first person 100, and detects the behavior of the second person 200 in the second video by detecting the behavior of the second person 200. A second action time series of the person is generated (step S3). The generated first action time series and second action time series are provided to the posterior probability estimation unit 50. As the behavior, for example, an expression, a line of sight, a head gesture, presence / absence of speech, and the like are detected. Since the behavior recognition method in the behavior recognition unit 20 is the same as the behavior recognition method in [learning phase] described later, the description thereof is omitted here.

事後確率推定部５０は、第一行動時系列及び第二行動時系列に基づいて、行動の時間差と行動の一致性とに基づく共感解釈の尤度を表すタイミングモデルを含むモデルパラメタを用いて、第一人物１００と第二人物２００との間の共感解釈を推定する（ステップＳ４）。 The posterior probability estimation unit 50 uses a model parameter including a timing model representing the likelihood of empathy interpretation based on the time difference between actions and the coincidence of actions based on the first action time series and the second action time series, A sympathetic interpretation between the first person 100 and the second person 200 is estimated (step S4).

事後確率推定部５０は、モデルパラメタ記憶部７４から読み込んだモデルパラメタを用いる。事後確率推定部５０がステップＳ４の処理を行う前に、モデルパラメタ記憶部７４には、後述する学習フェーズにより学習され生成されたモデルパラメタが記憶されているものとする。 The posterior probability estimation unit 50 uses the model parameter read from the model parameter storage unit 74. It is assumed that the model parameter storage unit 74 stores model parameters learned and generated in a learning phase described later before the posterior probability estimation unit 50 performs the process of step S4.

事後確率推定部５０は、具体的には、モデルパラメタ記憶部７４に記憶されているモデルパラメタを用いて、対話者の行動の時系列Bである第一行動時系列及び第二行動時系列から時刻tにおける対話者間の共感解釈の事後確率分布P(e_t|B)を推定する。より具体的には、事後確率推定部５０は、対話者の行動の時系列Bである第一行動時系列及び第二行動時系列と、事前分布とタイミングモデルと静的モデルの各パラメタを含むモデルパラメタとを入力として、後述する式（１）に従って、時刻tにおける共感解釈eの事後確率分布P(e_t|B)を計算する。モデルパラメタ及び式（１）については、後述する［学習フェーズ］において説明する。 Specifically, the posterior probability estimation unit 50 uses the model parameters stored in the model parameter storage unit 74, from the first action time series and the second action time series that are the time series B of the conversation person's action. Estimate the posterior probability distribution P (e _t | B) of the sympathy interpretation between the talkers at time t. More specifically, the posterior probability estimation unit 50 includes parameters of a first action time series and a second action time series, which are time series B of a conversation person's action, a prior distribution, a timing model, and a static model. Using the model parameter as an input, a posteriori probability distribution P (e _t | B) of empathy interpretation e at time t is calculated according to equation (1) described later. The model parameters and equation (1) will be described in [Learning Phase] described later.

なお、共感解釈の推定結果を確率分布ではなく１つの種類として出力する必要がある場合には、事後確率が最も高い共感解釈の種類、すなわちe~_t=argmax_{e_t} P(e_t|B)を出力してもよい。 When it is necessary to output an estimation result of sympathy interpreted as a single kind rather than probability distributions, the posterior probability is highest sympathy interpretation types, namely _e ~ _t = argmax e_t P | a (e _t B) It may be output.

また、事後確率が最も高い共感解釈の種類に加え、その事後確率が最も高い共感解釈の強度を出力してもよい。例えば、強度は、共感であれば、強度＝共感の確率−反感の確率という式から、反感であれば、強度＝反感の確率―共感の確率という式により求めることができる。この場合、強度は、-1から1の間の数値で表される。 Further, in addition to the type of sympathy interpretation having the highest posterior probability, the strength of the sympathy interpretation having the highest posterior probability may be output. For example, in the case of empathy, the strength can be obtained from the equation strength = sympathy probability−anti-probability, and in the case of anti-sensation, the strength = probability of probability−sympathy probability. In this case, the intensity is represented by a numerical value between -1 and 1.

共感解釈とは、共感、反感、どちらでもない等の二者間の対話状態（この例では第一人物１００と第二人物２００と間の対話状態）のことである。 The sympathy interpretation is a state of dialogue between two parties (in this example, a state of dialogue between the first person 100 and the second person 200) such as sympathy or disagreement.

上記説明した共感解釈推定装置の一実施形態によれば、画面上に表示された第一人物と画面を通して第一人物を見ている第二人物との共感解釈を推定することができる。また、行動の時間差と行動の一致性とに基づく共感解釈の尤度を表すタイミングモデルを含むモデルパラメタを用いることにより、表出のタイミングを考慮することができ、高い精度で共感解釈を推定することができる。 According to one embodiment of the sympathy interpretation estimating apparatus described above, it is possible to estimate the sympathy interpretation between the first person displayed on the screen and the second person watching the first person through the screen. Moreover, by using model parameters including timing models that represent the likelihood of empathic interpretation based on behavioral time differences and behavioral consistency, the timing of expression can be taken into account, and the empathic interpretation can be estimated with high accuracy. be able to.

［共感解釈推定装置及び方法の変形例等］
人物映像提示部８２自身が生成した第一映像が第二人物２００に提供されてもよいし、人物映像提示部８２以外により予め生成された第一映像が第二人物２００に提供されてもよい。例えば、第一映像が予め生成され推定用映像記憶部７２に記憶されている場合には、人物映像提示部８２は、推定用映像記憶部７２から読み込んだ第一映像を第二人物２００に提供する。 [Modified example of empathy interpretation estimation apparatus and method]
The first video generated by the person video presentation unit 82 itself may be provided to the second person 200, or the first video generated in advance by a person other than the person video presentation unit 82 may be provided to the second person 200. . For example, when the first video is generated in advance and stored in the estimation video storage unit 72, the human video presentation unit 82 provides the second person 200 with the first video read from the estimation video storage unit 72. To do.

第一行動時系列については、第一映像における第一人物の行動と関連付けて予め記憶部８５に記憶されていてもよい。例えば、第一映像として笑顔のキャラクタ映像を提示する場合、行動認識部２０がその第一映像に基づいてそのキャタクターの第一行動時系列を事前に生成することにより、そのキャラクタの第一行動時系列を事前に用意してくことができるためである。この場合、行動認識部２０は、第二行動時系列のみを生成する。また、この場合、事後確率推定部５０は、記憶部８５から読み込んだ第一行動時系列及び行動認識部２０が生成した第二行動時系列に基づいて、共感解釈を推定する。 The first action time series may be stored in advance in the storage unit 85 in association with the action of the first person in the first video. For example, when a smiling character video is presented as the first video, the action recognition unit 20 generates a first action time series of the character based on the first video, so that the first action time of the character is displayed. This is because the series can be prepared in advance. In this case, the action recognition unit 20 generates only the second action time series. In this case, the posterior probability estimation unit 50 estimates the empathy interpretation based on the first action time series read from the storage unit 85 and the second action time series generated by the action recognition unit 20.

事後確率推定部５０により推定された共感解釈は、表示部８３に表示されてもよい。第二人物２００が行うべきアクションについての情報がアクション提示部８４から事後確率推定部５０に提供されている場合には、事後確率推定部５０は、提供された第二人物２００が行うべきアクションに関連する共感解釈のみを表示部８３に表示させるようにしてもよい。例えば、第二人物２００が行うべきアクションが「共感して下さい」というアクションである場合には、事後確率推定部５０は共感についての共感解釈の推定結果のみを表示部８３に表示させる。 The empathy interpretation estimated by the posterior probability estimation unit 50 may be displayed on the display unit 83. When information on the action to be performed by the second person 200 is provided from the action presentation unit 84 to the posterior probability estimation unit 50, the posterior probability estimation unit 50 determines the action to be performed by the provided second person 200. Only the related sympathy interpretation may be displayed on the display unit 83. For example, when the action to be performed by the second person 200 is an action “please sympathize”, the posterior probability estimation unit 50 causes the display unit 83 to display only the estimation result of the sympathy interpretation regarding sympathy.

上記の実施形態では、第一映像において最初は図２の左に示すように第一人物１００は無表情であるが、あるタイミングで図３に示すように第一人物１００は笑顔になったが、第一映像において第一人物１００は他の行動を取ってもよい。例えば、第一人物１００は、図４に示すように悲しい表情を示す行動を取ってもよい。 In the above embodiment, the first person 100 has no expression as shown in the left of FIG. 2 in the first video, but the first person 100 smiles as shown in FIG. 3 at a certain timing. In the first video, the first person 100 may take other actions. For example, the first person 100 may take an action showing a sad expression as shown in FIG.

なお、図４では、第二人物２００が行うべき「共感して下さい」というアクションが、第一映像の中で提示されている。一般に、悲しい表情を示す行動に対して共感を示すための行動は、同様に悲しい表情を概して迅速に示す行動である。このため、この図４の場合、第二人物２００は、第一人物１００と同様に悲しい表情を示す行動を適切なタイミングで取ることにより共感を示すことができる。 In FIG. 4, the action “please sympathize” that the second person 200 should perform is presented in the first video. In general, an action for showing empathy for an action showing a sad expression is an action showing a sad expression generally quickly. Therefore, in the case of FIG. 4, the second person 200 can show empathy by taking an action showing a sad expression at an appropriate timing, like the first person 100.

アクション提示部８４が第二人物２００に提示する第二人物２００が行うべきアクションは、「共感して下さい」というアクションに限られない。言い換えれば、アクション提示部８４は、第二人物２００が行うべきアクションであればどのようなアクションを第二人物２００に提示してもよい。例えば、アクション提示部８４は、図５に示すように「反感して下さい」というアクションを第二人物２００に提示してもよい。なお、図５では、第一人物１００は笑顔になるという行動を示している。一般に、笑顔になるという行動に対して反感を示すための行動は、怒ったりムスッとした表情を概して迅速に示す行動である。あるいは、笑顔を遅れても示す行動も反感を示すことになりえる。このため、この図５の場合、第二人物２００は、怒ったりムスッとした表情を迅速に示す行動を取ることや，笑顔を遅れて示す行動を取ることにより反感を示すことができる。 The action to be performed by the second person 200 presented to the second person 200 by the action presentation unit 84 is not limited to the action “please sympathize”. In other words, the action presenting unit 84 may present any action to the second person 200 as long as the action should be performed by the second person 200. For example, the action presenting unit 84 may present the action “please feel” to the second person 200 as shown in FIG. Note that FIG. 5 shows an action in which the first person 100 smiles. In general, an action for showing a sense of opposition to the action of becoming a smile is an action that generally shows an angry or sloppy expression generally quickly. Alternatively, even if a smile is delayed, the action that is shown can also be countered. Therefore, in the case of FIG. 5, the second person 200 can show a sense of disapproval by taking an action that quickly shows an angry or sloppy expression or an action that shows a smile late.

アクション提示部８４は、第二人物２００が行うべきアクションを第二人物２００に提示する場所は、第一映像に限られない。言い換えれば、アクション提示部８４は、第二人物２００が行うべきアクションを第二人物２００に提示できれば、その方法はどのようなものであってもよい。例えば、アクション提示部８４は、第一映像が第二人物２００に提示される前に、第二人物２００が行うべきアクションを表示部８３に表示することにより、第二人物２００が行うべきアクションを第二人物２００に提示してもよい。 The place where the action presenting unit 84 presents the action to be performed by the second person 200 to the second person 200 is not limited to the first video. In other words, the action presenting unit 84 may use any method as long as the action to be performed by the second person 200 can be presented to the second person 200. For example, the action presenting unit 84 displays the action to be performed by the second person 200 on the display unit 83 before the first video is presented to the second person 200, so that the action to be performed by the second person 200 is displayed. You may show to the 2nd person 200.

アクション提示部８４は、共感解釈推定装置に備えられていなくてもよい。この場合、
第二人物２００は、第二人物２００が行うべきアクションを事前に知っていてもよいし、知っていなくてもよい。 The action presentation unit 84 may not be provided in the sympathy interpretation estimation device. in this case,
The second person 200 may or may not know the action to be performed by the second person 200 in advance.

人物映像提示部８２は、第一人物を含む複数の人物のそれぞれの映像を第二人物２００に提示してもよい。例えば、人物映像提示部８２により、図６に示す第一人物を含む４人の人物のそれぞれの映像が表示部８３を介して第二人物２００に提示される。この場合、４人の人物の何れか１人の行動が変化する。図６では、アクション提示部８４により、４人の人物のそれぞれの映像において「共感して下さい」というアクションが示されている。このため、第二人物２００は、４人の人物の中の行動が変化する何れか１人の人物に対して共感を示すための行動を行う。この行動が変化する何れか１人の人物が、第一人物１００となる。 The person video presentation unit 82 may present each video of a plurality of persons including the first person to the second person 200. For example, the video images of the four persons including the first person shown in FIG. 6 are presented to the second person 200 via the display unit 83 by the person video presentation unit 82. In this case, the behavior of any one of the four persons changes. In FIG. 6, the action presentation unit 84 shows an action “please sympathize” in each of the images of the four persons. For this reason, the second person 200 performs an action to show empathy for any one person whose action among the four persons changes. Any one person whose behavior changes is the first person 100.

なお、図６の例では、アクション提示部８４により提示される、複数の人物のそれぞれについての第二人物２００が行うべきアクションが「共感して下さい」という共通のアクションであったが、アクション提示部８４により提示される、複数の人物のそれぞれについての第二人物２００が行うべきアクションが異なっていてもよい。 In the example of FIG. 6, the action to be performed by the second person 200 for each of a plurality of persons presented by the action presentation unit 84 is a common action “please sympathize”, but the action presentation The actions to be performed by the second person 200 for each of the plurality of persons presented by the unit 84 may be different.

例えば、図７の例では、４人の人物の中の図７の紙面に対して上段に位置する２人の人物についての第二人物２００が行うべきアクションは「共感して下さい」というアクションであるが、４人の人物の中の図７の紙面に対して下段に位置する２人の人物についての第二人物２００が行うべきアクションは「反感して下さい」というアクションである。 For example, in the example of FIG. 7, the action to be performed by the second person 200 for the two persons positioned on the upper side of the page of FIG. 7 among the four persons is an action “please sympathize”. However, among the four persons, the action to be performed by the second person 200 for the two persons positioned on the lower side with respect to the page of FIG.

この場合、上段に位置する２人の人物のうち、何れか１人以上の行動が変化した場合には第二人物２００は共感を示す行動が求められるが、下段に位置する２人の人物のうち、何れか１人以上の行動が変化した場合には第二人物は反感を示す行動が求められることになる。 In this case, if any one or more of the two persons located in the upper row changes, the second person 200 is required to have an action that shows empathy, but the two persons located in the lower row Among them, when any one or more of the behaviors change, the second person is required to exhibit a feeling of discomfort.

第二人物２００が行うべきアクションを示すことができるものであれば、４人の人物の全ての行動が変化してもよい。 As long as the second person 200 can indicate an action to be performed, all the actions of the four persons may change.

例えば、４人の人物の全てについての第二人物２００が行うべきアクションは「共感して下さい」である場合において、４人の人物の全員が笑顔の行動に変化したときには、第二人物２００は迅速に笑顔になる等の共感を示す行動が求められることになる。 For example, when the action to be performed by the second person 200 for all of the four persons is “please sympathize”, when all of the four persons change to a smiling action, the second person 200 Actions that show empathy, such as quickly smiling, are required.

また、図７に例示するように、図７の紙面に対して上段に位置する２人の人物についての第二人物２００が行うべきアクションは「共感して下さい」というアクションであるが、４人の人物の中の図７の紙面に対して下段に位置する２人の人物についての第二人物２００が行うべきアクションは「反感して下さい」というアクションである場合において、上段に位置する２人の人物が笑顔の行動に変化し、下段に位置する２人の人物がムスッとする行動に変化したときには、第二人物２００は迅速に笑顔になる等の共感を示す行動が求められることになる。 Further, as illustrated in FIG. 7, the action to be performed by the second person 200 for the two persons positioned on the upper side with respect to the paper surface of FIG. 7 is an action “please sympathize”. Among the persons in FIG. 7, the actions to be performed by the second person 200 for the two persons positioned in the lower stage with respect to the page of FIG. When the person changes to a smiling action and the two persons located in the lower stage change to a dull action, the second person 200 is required to have a sympathetic action such as quickly smiling. .

このように、人物映像提示部８２は、行動が変化する少なくとも１以上の人物を含む複数の人物のそれぞれの映像を第二人物２００に提示してもよい。この場合、行動が変化する少なくとも１以上の人物のそれぞれを第一人物とする行動認識部２０及び事後確率推定部５０の処理により、各第一人物と第二人物との間の共感解釈が推定される。例えば、４人の人物の全てについての第二人物２００が行うべきアクションは「共感して下さい」である場合において、４人の人物の全員が笑顔の行動に変化したときには、４人の人物のそれぞれを第一人物とする行動認識部２０及び事後確率推定部５０の処理により、４人の人物のそれぞれと第二人物との間の共感解釈が推定される。 In this way, the person video presentation unit 82 may present each video of a plurality of persons including at least one person whose behavior changes to the second person 200. In this case, the empathy interpretation between each first person and the second person is estimated by the processing of the action recognition unit 20 and the posterior probability estimation unit 50 each of which is at least one person whose behavior changes. Is done. For example, when the action to be performed by the second person 200 for all four persons is “please sympathize”, when all of the four persons change to a smiling action, the actions of the four persons The empathy interpretation between each of the four persons and the second person is estimated by the processes of the action recognition unit 20 and the posterior probability estimation unit 50 each of which is the first person.

共感解釈推定装置は、後述する［学習フェーズ］で説明する、入力部１０、学習用映像記憶部７０、共感解釈付与部３０及びパラメタ学習部４０を更に備えていてもよい。この場合、モデルパラメタを学習し生成する機能を有していてもよい。 The sympathy interpretation estimation apparatus may further include an input unit 10, a learning video storage unit 70, a sympathy interpretation imparting unit 30, and a parameter learning unit 40, which will be described later in “Learning Phase”. In this case, it may have a function of learning and generating model parameters.

［プログラム、記録媒体］
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施例において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Program, recording medium]
The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above-described embodiments are not only executed in time series according to the order described, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes.

また、上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

［学習フェーズ］
以下、モデルパラメタの学習及び生成を行う学習フェーズについて説明する。モデルパラメタの学習は、例えば、図９に記載された対話状態推定装置１により行われる。 [Learning phase]
Hereinafter, a learning phase for learning and generating model parameters will be described. The learning of the model parameter is performed by, for example, the dialogue state estimation device 1 described in FIG.

＜構成＞
図９を参照して、この実施形態の対話状態推定装置１の構成例について説明する。対話状態推定装置１は入力部１０と行動認識部２０と共感解釈付与部３０とパラメタ学習部４０学習用映像記憶部７０とモデルパラメタ記憶部７４とを備える。学習用映像記憶部７０は、例えば、ＲＡＭ（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリなどの半導体メモリ素子により構成される補助記憶装置、などにより構成することができる。モデルパラメタ記憶部７４は、学習用映像記憶部７０と同様に構成してもよいし、リレーショナルデータベースやキーバリューストアなどのミドルウェア、などにより構成してもよい。 <Configuration>
With reference to FIG. 9, the structural example of the dialog state estimation apparatus 1 of this embodiment is demonstrated. The dialog state estimation device 1 includes an input unit 10, an action recognition unit 20, a sympathy interpretation imparting unit 30, a parameter learning unit 40, a learning video storage unit 70, and a model parameter storage unit 74. The learning video storage unit 70 can be configured by, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device including a semiconductor memory element such as a hard disk, an optical disk, or a flash memory. The model parameter storage unit 74 may be configured in the same manner as the learning video storage unit 70, or may be configured by middleware such as a relational database or a key value store.

図１０を参照して、この実施形態のパラメタ学習部４０の構成例について説明する。パラメタ学習部４０は事前分布学習部４２とタイミングモデル学習部４４と静的モデル学習部４６とを備える。 With reference to FIG. 10, the structural example of the parameter learning part 40 of this embodiment is demonstrated. The parameter learning unit 40 includes a prior distribution learning unit 42, a timing model learning unit 44, and a static model learning unit 46.

＜学習フェーズ＞
図１１を参照して、対話状態推定装置１の学習フェーズにおける動作例を説明する。 <Learning phase>
With reference to FIG. 11, the operation example in the learning phase of the dialog state estimation apparatus 1 will be described.

入力部１０へ学習用映像が入力される（ステップＳ１１）。学習用映像は、複数の人物が対話する状況を撮影した映像であり、少なくとも対話者の頭部が撮影されていなければならない。学習用映像の撮影は、各対話者について一台のカメラを用意して、複数のカメラにより撮影した映像を多重化した映像でもよいし、魚眼レンズを用いるなどした全方位カメラ一台で対話者全員を撮影した映像であってもよい。入力された学習用映像は学習用映像記憶部７０に記憶される。 A learning video is input to the input unit 10 (step S11). The learning video is a video that captures a situation where a plurality of persons interact, and at least the head of the dialog must be captured. The video for learning can be taken by preparing one camera for each conversation person and multiplexing the pictures taken by multiple cameras, or by using a fisheye lens, etc. It may be a video of shooting. The input learning video is stored in the learning video storage unit 70.

行動認識部２０は学習用映像記憶部７０に記憶されている学習用映像を入力として、学習用映像に撮影された各対話者の行動として、表情、視線、頭部ジェスチャ、発話有無などを検出し、その結果生成された対話者の行動の時系列を出力する（ステップＳ２１）。この実施形態では、表情、視線、頭部ジェスチャ、および発話有無の4つの行動チャネルを認識対象とする。行動チャネルとは、行動の形態のことである。表情は、感情を表す主要な経路である。この実施形態では、無表情／微笑／哄笑／苦笑／思考中／その他、の6状態を表情の認識対象とする。視線は、感情を誰に伝えようとしているのかということと、他者の行動を観察していることとの少なくとも一方などを表している。この実施形態では、他者のうちの誰か一人を見ておりその相手が誰である／誰も見ていない（という状態）、を視線の認識対象としている。すなわち、状態数は対話者の数となる。ここで、対話者とは、視線を測定している対象者を含む対話に参加している全員を指す。表情と視線の認識方法は、「特開２０１２−１８５７２７号公報（参考文献１）」又は「熊野史朗, 大塚和弘, 三上弾, 大和淳司, “複数人対話を対象とした表情と視線に基づく共感／反感の推定モデルとその評価”, 電子情報通信学会技術報告，ヒューマンコミュニケーション基礎研究会, HCS 111(214), pp. 33-38，2011.（参考文献２）」に記載の方法を用いればよい。 The action recognition unit 20 receives the learning video stored in the learning video storage unit 70 and detects facial expressions, gaze, head gestures, presence / absence of speech, etc. as the actions of each conversation person captured in the learning video. And the time series of the action of the dialogue person generated as a result is outputted (Step S21). In this embodiment, four action channels including facial expression, line of sight, head gesture, and presence / absence of speech are recognized. An action channel is a form of action. Facial expressions are the main pathway for expressing emotions. In this embodiment, six states of no expression / smile / smile / bitter smile / thinking / other are the facial expression recognition targets. The line of sight represents at least one of, for example, who is trying to convey emotions and / or observing the actions of others. In this embodiment, the line-of-sight recognition target is a person who is looking at one of the other persons and who is / is not looking at that person. That is, the number of states is the number of interlocutors. Here, the dialogue person refers to all who participate in the dialogue including the subject who is measuring the line of sight. The method of recognizing facial expression and line of sight is “Japanese Laid-Open Patent Publication No. 2012-185727 (Reference 1)” or “Shiro Kumano, Kazuhiro Otsuka, Amami Mikami, Junji Yamato, Estimated model of empathy / antisense and its evaluation ", IEICE Technical Report, Human Communication Fundamentals, HCS 111 (214), pp. 33-38, 2011. (Reference 2)" That's fine.

頭部ジェスチャは、しばしば他者の意見に対する態度の表明として表出される。この実施形態では、なし／頷き／首ふり／傾げ／これらの組み合わせ、の4状態を頭部ジェスチャの認識対象とする。頭部ジェスチャの認識方法は、周知のいかなる方法も用いることができる。例えば「江尻康, 小林哲則, “対話中における頭部ジェスチャの認識”, 電子情報通信学会技術研究報告, PRMU2002-61, pp.31-36, Jul.2002.（参考文献３）」に記載の方法を用いればよい。発話有無は、話し手／聞き手という対話役割の主要な指標となる。この実施形態では、発話／沈黙、の2状態を発話有無の認識対象とする。発話有無の認識方法は、映像中の音声パワーを検出してあらかじめ定めた閾値を超えた場合に発話していると判断すればよい。もしくは映像中の対話者の口元の動きを検出することで発話の有無を検出してもよい。それぞれの行動は一台の装置ですべて認識してもよいし、行動ごとに別々の装置を用いて認識しても構わない。例えば、表情の認識であれば、行動認識装置の一例として「特許４９４２１９７号公報（参考文献４）」を使用すればよい。なお、行動認識部２０は、共感解釈付与部３０と同様に人手によるラベル付けを行い、その結果を出力するとしても構わない。 Head gestures are often expressed as an expression of attitude to the opinions of others. In this embodiment, four states of none / whit / neck / tilt / a combination thereof are recognized as head gesture recognition targets. Any known method can be used as a method for recognizing a head gesture. For example, “Yasushi Ejiri, Tetsunori Kobayashi,“ Recognition of Head Gesture during Dialogue ”, IEICE Technical Report, PRMU2002-61, pp.31-36, Jul.2002. (Reference 3) This method may be used. The presence or absence of utterance is a major indicator of the conversation role of the speaker / listener. In this embodiment, two states of utterance / silence are recognized as utterance presence / absence recognition targets. As a method for recognizing the presence or absence of utterance, it may be determined that the utterance is made when the audio power in the video is detected and a predetermined threshold is exceeded. Alternatively, the presence or absence of an utterance may be detected by detecting the movement of the conversation person's mouth in the video. Each action may be recognized by a single device, or may be recognized by using a separate device for each action. For example, in the case of facial expression recognition, “Patent No. 4942197 (Reference 4)” may be used as an example of an action recognition device. The action recognition unit 20 may perform manual labeling in the same manner as the empathy interpretation giving unit 30 and output the result.

また、表情や頭部ジェスチャに関しては、「強度」を推定して出力するとしてもよい。表情の強度は、対象とする表情である確率により求めることができる。また、頭部ジェスチャの強度は、振幅の最大値（頷きであれば、頷く角度の最大値）に対する取得された動作の振幅の値の割合により求めることができる。 For facial expressions and head gestures, “strength” may be estimated and output. The intensity of the facial expression can be obtained from the probability that the facial expression is the target. Further, the strength of the head gesture can be obtained from the ratio of the value of the amplitude of the acquired motion to the maximum value of the amplitude (the maximum value of the scooping angle if it is whispered).

共感解釈付与部３０は学習用映像記憶部７０に記憶されている学習用映像に基づいて複数の外部観察者が共感解釈をラベル付けした学習用共感解釈時系列を出力する（ステップＳ３０）。学習用共感解釈時系列は、学習用映像を複数の外部観察者に提示して、各時刻における対話二者間の共感解釈を外部観察者が人手によりラベル付けした時系列である。この実施形態では、二者間の対話状態として、共感／反感／どちらでもない、の3状態を対象とする。二者間の対話状態とは、同調圧力（自分とは異なる同じ意見を大勢の他者が持っているときにそれに従わなければならないと感じること）に深く関わり、合意形成や人間関係を構築する上での基本要素である。また、外部観察者が解釈するこれらの状態のことをまとめて共感解釈と呼ぶ。すなわち、この実施形態における対話状態解釈とは共感解釈である。 The empathy interpretation giving unit 30 outputs a learning sympathy interpretation time series in which a plurality of external observers label the sympathy interpretation based on the learning video stored in the learning video storage unit 70 (step S30). The learning sympathy interpretation time series is a time series in which learning videos are presented to a plurality of external observers, and the external observers manually label the sympathetic interpretations between the two conversations at each time. In this embodiment, the three states of empathy / disapproval / neither are targeted as the conversation state between the two parties. The state of dialogue between the two is deeply related to the pressure of entrainment (feeling that many others have to follow the same opinion different from their own) and build consensus building and relationships The basic element above. In addition, these states interpreted by an external observer are collectively referred to as empathy interpretation. That is, the dialogue state interpretation in this embodiment is a sympathy interpretation.

行動認識部２０の出力する学習用行動時系列と共感解釈付与部３０の出力する学習用共感解釈時系列とはパラメタ学習部４０に入力される。パラメタ学習部４０は、外部観察者の共感解釈と対話者の行動とを関連付けるモデルパラメタを学習する。モデルパラメタは、対話者間の共感解釈の事前分布と、対話者間の行動の時間差と対話者間の行動の一致性とに基づく共感解釈の尤度を表すタイミングモデルと、対話者間の行動の共起性に基づく共感解釈の尤度を表す静的モデルとを含む。 The learning action time series output from the action recognition unit 20 and the learning empathy interpretation time series output from the empathy interpretation assigning unit 30 are input to the parameter learning unit 40. The parameter learning unit 40 learns model parameters that relate the sympathy interpretation of the external observer and the behavior of the dialog person. Model parameters include a timing model that represents the likelihood of empathy interpretation based on prior distribution of empathy interpretation among the interlocutors, the time difference between the behaviors of the interlocutors and the consistency of the behavior between the interlocutors, And a static model representing the likelihood of sympathy interpretation based on the co-occurrence of.

パラメタ学習部４０の備える事前分布学習部４２は、学習用共感解釈時系列を用いて事前分布を学習する（ステップＳ４２）。パラメタ学習部４０の備えるタイミングモデル学習部４４は、学習用行動時系列と学習用共感解釈時系列とを用いてタイミングモデルを学習する（ステップＳ４４）。パラメタ学習部４０の備える静的モデル学習部４６は、学習用行動時系列と学習用共感解釈時系列とを用いて静的モデルを学習する（ステップＳ４６）。得られたモデルパラメタはモデルパラメタ記憶部７４に記憶される。 The prior distribution learning unit 42 included in the parameter learning unit 40 learns the prior distribution using the learning sympathy interpretation time series (step S42). The timing model learning unit 44 included in the parameter learning unit 40 learns a timing model using the learning action time series and the learning sympathy interpretation time series (step S44). The static model learning unit 46 included in the parameter learning unit 40 learns a static model using the learning action time series and the learning sympathy interpretation time series (step S46). The obtained model parameters are stored in the model parameter storage unit 74.

＜＜モデルの概要＞＞
この実施形態のモデルについて詳述する。この実施形態では、外部観察者が与える共感解釈は対話二者の組み合わせ毎に独立であることを仮定する。よって、以下では対話者が二人のみの場合を想定する。なお、対話者が三人以上の場合には、それぞれの対話二者の組み合わせのみに注目して学習を行えばよい。 << Overview of model >>
The model of this embodiment will be described in detail. In this embodiment, it is assumed that the empathy interpretation given by the external observer is independent for each combination of two dialogues. Therefore, in the following, it is assumed that there are only two participants. Note that when there are three or more interlocutors, it is only necessary to learn by focusing only on the combination of the two interrogators.

この実施形態では、対話者の行動の時系列Bが与えられたときの各時刻tでの外部観察者の共感解釈eの事後確率分布P(e_t|B)を、ナイーブベイズモデルを用いてモデル化し、その推定を行う。ナイーブベイズモデルは従属変数（ここでは共感解釈）と各説明変数（ここでは各対話者の行動）との間の確率的依存関係が説明変数間で独立であることを仮定する。ナイーブベイズモデルはシンプルであるにも関わらず多くの分野で高い推定性能を示すことが確認された優れたモデルである。この発明においてナイーブベイズモデルを用いる利点は二つある。一つは、行動チャネル間の全ての共起（例えば、表情、視線、頭部ジェスチャ、および発話有無の全てが同時に発生した状態）をモデル化しないため、過学習を避けやすいという点である。これは、対象とする変数空間に対して学習サンプルが少ない場合に特に有効である。もう一つは、観測情報としての行動チャネルの追加や削除が容易という点である。 In this embodiment, the posterior probability distribution P (e _t | B) of the sympathetic interpretation e of the external observer at each time t given the time series B of the conversation person's behavior is _expressed using a naive Bayes model. Model and estimate. The naive Bayes model assumes that the stochastic dependence between the dependent variables (here, empathy interpretation) and each explanatory variable (here, the actions of each interactor) is independent among the explanatory variables. The Naive Bayes model is an excellent model that has been confirmed to show high estimation performance in many fields despite being simple. There are two advantages of using the naive Bayes model in this invention. One is that it is easy to avoid over-learning because it does not model all co-occurrence between behavioral channels (for example, a state in which all of facial expressions, gaze, head gestures, and utterances occur simultaneously). This is particularly effective when there are few learning samples for the target variable space. The other is that it is easy to add or delete action channels as observation information.

この実施形態におけるナイーブベイズモデルでは、事後確率分布P(e_t|B)は式（１）のように定義される。 In the naive Bayes model in this embodiment, the posterior probability distribution P (e _t | B) is defined as shown in Equation (1).

ここで、P(dt_t ^b|c_t ^b,e_t)はタイミングモデルであり、時刻tの周辺で行動チャネルbについて二者間の行動が時間差dt_t ^bで一致性c_t ^bであるときに外部観察者の共感解釈がeとなる尤度を表す。一致性cとは、二者間で行動が一致しているか否かを表す二値状態のことであり、対話二者の行動のカテゴリが同じか否かで判断する。P(b_t,e_t)は静的モデルであり、時刻tのその瞬間において行動チャネルbが対話二者間でどう共起しているのかをモデル化している。これら二つのモデルについては以下で順に説明する。P(e_t)は共感解釈eの事前分布であり、行動を考えないときに各共感解釈eがどれくらいの確率で生成されるかを表す。 Here, P (dt _t ^b | c _t ^b , e _t ) is a timing model, and when the behavior between the two parties is behavior coherence c _t ^b with time difference dt _t ^b around time t Represents the likelihood that the external observer's sympathy interpretation is e. The coincidence c is a binary state indicating whether or not the behaviors of the two parties are the same, and is determined based on whether or not the categories of the behaviors of the two parties are the same. P (b _t , e _t ) is a static model that models how the action channel b co-occurs between the two parties at the instant of time t. These two models will be described in turn below. P (e _t ) is a prior distribution of the sympathy interpretation e, and represents the probability that each sympathy interpretation e is generated when no action is considered.

＜＜タイミングモデル＞＞
この実施形態における行動チャネルbについてのタイミングモデルは式（２）のように定義される。 << Timing model >>
The timing model for the action channel b in this embodiment is defined as shown in Equation (2).

式（２）から明らかなように、このタイミングモデルは、対話二者の行動間の時間差がdtでありその一致性がcであるときの共感解釈eの尤度を表す時間差関数P( d~t_t ^b |c_t ^b,e_t)と、その相互作用の近辺で共感解釈eがどのタイミングで変化するかを表す変化タイミング関数π_tから構成されている。d~t_t ^bは、外部観察者の共感解釈の時系列をヒストグラム化した際のビン番号である。ビンサイズについては例えば200ミリ秒とする。 As is clear from equation (2), this timing model is based on the time difference function P (d ~) representing the likelihood of the empathy interpretation e when the time difference between the actions of the two conversations is dt and the coincidence is c. t _t ^b | c _t ^b , e _t ) and a change timing function π _t representing the timing at which the sympathetic interpretation e changes in the vicinity of the interaction. d to t _t ^b are bin numbers when the time series of the external observer's empathy interpretation is converted into a histogram. For example, the bin size is 200 milliseconds.

なお、この実施形態では、それぞれの行動チャネルについてその行動チャネル内で二者間のタイミングモデルを構築したが、行動チャネル間のモデルを構築しても構わない。例えば、表情と頭部ジェスチャとの間の時間差dtと一致性cと、共感解釈eとの関係をモデル化することができる。ただしこの場合は、一致性cを決める際に各行動チャネルについて、例えば、肯定的／中立的／否定的といった、異なる行動チャネルの間でも一致性cを判断できるカテゴリ群を新たに導入する必要がある。これらのカテゴリについては、映像から行動チャネルを検出する際に認識してもよいし、一旦行動チャネルごとに異なるカテゴリ群で認識しておいて、表情が微小なら肯定的といったようにそれらのラベルを後から肯定的／中立的／否定的に分類し直しても構わない。 In this embodiment, for each behavior channel, a timing model between two parties is constructed within the behavior channel, but a model between behavior channels may be constructed. For example, the relationship between the time difference dt between the facial expression and the head gesture, the consistency c, and the empathy interpretation e can be modeled. However, in this case, when determining the consistency c, it is necessary to introduce a new category group that can determine the consistency c even between different behavior channels such as positive / neutral / negative, for example. is there. These categories may be recognized when the action channel is detected from the video, or once they are recognized by different category groups for each action channel, and their labels are affirmed if the facial expression is small. You may reclassify later as positive / neutral / negative.

＜＜時間差関数＞＞
時間差関数P(d~t_t ^b|c_t ^b,e_t)は、対話二者間の行動が行動チャネルbにおいて一致しているか否かを示す一致性cとその時間差dtによって共感解釈eがどの種類となりやすいかの尤度を表す。この実施形態では、外部観察者の共感解釈の時系列をヒストグラム化した際のビン番号d~t_t ^bを使用している。ビンサイズについては例えば200ミリ秒とする。 << Time difference function >>
The time difference function P (d ~ t _t ^b | c _t ^b , e _t ) indicates that the sympathetic interpretation e is based on the coincidence c indicating whether or not the actions between the two parties are matched in the action channel b and the time difference dt. The likelihood of which type is likely to be represented. In this embodiment, bin numbers d to t _t ^b when the time series of the sympathy interpretation of the external observer are converted into a histogram are used. For example, the bin size is 200 milliseconds.

図１２にこの実施形態の時間差関数の一例を表す。時間差関数P(d~t_t ^b|c_t ^b,e_t)は対話者の行動の一致性cと時間差のビン番号d~t_t ^bとにより共感解釈eの尤度を決定する。図１２（Ａ）は対話者間の行動が一致する場合の時間差関数の一例であり、図１２（Ｂ）は対話者間の行動が不一致の場合の時間差関数の一例である。例えば、対話者間の行動が一致する場合に、与え手の行動表出から受け手の反応表出の時間差が500ミリ秒であった場合には、共感解釈eが「共感」である尤度が約0.3、「どちらでもない」である尤度が約0.2、「反感」である尤度が約0.5となる。時間差関数は外部観察者がラベル付けした共感解釈の時系列を時間差ビン単位で集計し、共感解釈eのカテゴリ毎にすべての時間差ビンにおける尤度の総和が1となるように正規化することで求める。 FIG. 12 shows an example of the time difference function of this embodiment. Time difference function _{^{P (d ~ t t b |}} c t b, e t) determines the likelihood of sympathetic interpretation e by the bin number d ~ t _t ^b Consistency c and time difference of behavior of the interlocutor. FIG. 12A is an example of a time difference function when the actions between the interlocutors match, and FIG. 12B is an example of the time difference function when the actions between the interlocutors do not match. For example, if the behaviors of the interlocutors match, and the time difference between the giver's action expression and the receiver's reaction expression is 500 milliseconds, the likelihood that the empathy interpretation e is "sympathy" The likelihood of about 0.3, “Neither” is about 0.2, and the likelihood of “antisense” is about 0.5. The time difference function calculates the time series of empathy interpretations labeled by external observers in units of time difference bins, and normalizes the sum of likelihood in all time difference bins to be 1 for each category of empathy interpretation e. Ask.

＜＜変化タイミング関数＞＞
変化タイミング関数πはどのタイミングで共感解釈eが変化するかを表す。別の見方をすると、変化タイミング関数πは時間差関数がどの範囲にわたってどの程度の強さで式（１）における共感解釈eの推定に寄与するかを決定する。 << Change timing function >>
The change timing function π represents at which timing the empathy interpretation e changes. Viewed another way, the change timing function π determines to what extent the time difference function contributes to the estimation of the empathy interpretation e in equation (1) over which range.

この実施形態では変化タイミング関数を式（３）のようにモデル化する。 In this embodiment, the change timing function is modeled as shown in Equation (3).

ここで、t_aは対象の相互作用における与え手の行動表出開始の時刻を表す。また、時刻t'は与え手の行動表出開始の時刻をt'=0とし、受け手の反応表出開始時刻をt'=1としたときの相互作用中での相対時間を表し、t'=(t-t_a)/dtとして計算される。 Here, t _a represents the time behavior expression initiation hand given in the interaction of interest. In addition, time t ′ represents the relative time during the interaction when the action expression start time of the giver is t ′ = 0 and the reaction expression start time of the receiver is t ′ = 1. Calculated as = (tt _a ) / dt.

π=0は、式（１）で表される事後確率分布P(e_t|B)において、タイミングモデルP(dt_t ^b|c_t ^b,e_t)が全く寄与しないことを表す。π=1は、事後確率分布P(e_t|B)において、タイミングモデルP(dt_t ^b|c_t ^b,e_t)が完全に寄与することを表す。 π = 0 represents that the timing model P (dt _t ^b | c _t ^b , e _t ) does not contribute at all in the posterior probability distribution P (e _t | B) represented by the equation (1). π = 1 represents that the timing model P (dt _t ^b | c _t ^b , e _t ) contributes completely in the posterior probability distribution P (e _t | B).

条件dt>Lは、与え手の行動表出に対して受け手の反応表出が遅すぎることを表す。例えば、この実施形態では閾値Lを2秒とする。これは、話し手の語彙的に重要なフレーズに対する聞き手の表情表出がおよそ500〜2,500ミリ秒の範囲で起こるという研究結果を参考にした値であり、どの行動チャネルにおいても概ねこの範囲に収まるという仮定に基づく。上記の研究結果についての詳細は、「G. R. Jonsdottir, J. Gratch, E. Fast, and K. R. Thorisson, “Fluid semantic back-channel feedback in dialogue: Challenges & progress”, International Conference Intelligent Virtual Agents (IVA), pp. 154-160, 2007.（参考文献５）」を参照されたい。 The condition dt> L represents that the response expression of the receiver is too late with respect to the action expression of the giver. For example, in this embodiment, the threshold value L is 2 seconds. This is a value based on the research results that the expression of the listener's facial expression for the vocabulary important phrase of the speaker occurs in the range of about 500 to 2,500 milliseconds, and it is generally within this range in any action channel. Based on assumptions. For details on the above research results, see “GR Jonsdottir, J. Gratch, E. Fast, and KR Thorisson,“ Fluid semantic back-channel feedback in dialogue: Challenges & progress ”, International Conference Intelligent Virtual Agents (IVA), pp. 154-160, 2007. (Reference 5).

条件t-t_a>Wは、時刻tがそれ以前の直近で表出された与え手の表情表出からの時間経過が長いことを意味する。対話二者間でお互いに行動を表出して相互作用が行われると、それから一定の間は外部観察者の共感解釈がそのタイミングに影響を受けるが、その後しばらく次の相互作用が行われなければその影響はなくなるということをモデル化したものである。閾値Wは正の値であればどのような値でもよく、二者対話のように対象の二者間で絶えず相互作用が発生する場合には無限大としても問題無い。しかし、大人数での対話で主に一人が話しているといった状況で、その中のある二人の聞き手の間での相互作用といったように、必ずしも相互作用が頻繁とは限らない場合には閾値Wが長すぎる場合も考えられる。この実施形態では経験的に閾値Wを4秒とする。これは、閾値Wを4秒付近に設定した場合に推定精度が最も高くなったという実験結果に基づくものである。 The condition tt _a > W means that the time elapses from the expression of the facial expression of the giving hand that was most recently expressed before time t. When interaction is performed by expressing actions between the two parties, the sympathy interpretation of the external observer is affected by the timing for a certain period of time, but if the next interaction does not occur for a while after that, It is modeled that the effect disappears. The threshold value W may be any value as long as it is a positive value, and there is no problem even if the threshold value W is infinite when interaction between the two parties is continuously generated as in a two-party dialogue. However, in a situation where one person is mainly speaking in a dialogue with a large number of people and the interaction is not always frequent, such as an interaction between two listeners, a threshold is set. It is also possible that W is too long. In this embodiment, the threshold value W is empirically set to 4 seconds. This is based on the experimental result that the estimation accuracy is the highest when the threshold W is set to around 4 seconds.

図１３に共感解釈、与え手の行動表出、および受け手の反応表出の一例を示す。図１３の塗りつぶしパターンは行動もしくは共感解釈のカテゴリの違いを表す。αとβの値については例えばα=0.2、β=0.8と設定する。これらの値は、式（３）の変化タイミング関数πが累積確率を最も近似するように定めたものである。 FIG. 13 shows an example of sympathy interpretation, behavioral expression of the giver, and response expression of the receiver. The filled pattern in FIG. 13 represents the difference in the category of behavior or empathy interpretation. For the values of α and β, for example, α = 0.2 and β = 0.8 are set. These values are determined so that the change timing function π of Equation (3) approximates the cumulative probability most.

図１４に変化タイミング関数πの一例を示す。グラフ上にプロットした点は、実際に女性4名の対話グループ4つ（計16名）の対話データに対して計9名の外部観察者が与えた共感解釈のラベルおいて、そのラベルが相対時刻t'中のどこで変化したかの累積確率を表す。この変化タイミング関数によってよく近似できていることが見て取れる。但し、αとβはこの値に限らなくてもよく、α+β=1、0≦α≦1、0≦β≦1を満たすようにする。簡単な設定としては，「α=0、β=1」でもかまわない。 FIG. 14 shows an example of the change timing function π. The points plotted on the graph are actually the empathetic interpretation labels given by nine external observers to the dialogue data of four dialogue groups of four women (16 people in total). This represents the cumulative probability of the change at time t ′. It can be seen that this change timing function can be approximated well. However, α and β are not limited to these values, and α + β = 1, 0 ≦ α ≦ 1, and 0 ≦ β ≦ 1 are satisfied. As a simple setting, “α = 0, β = 1” may be used.

図１５及び図１６は変化タイミング関数の有効範囲の一例を模式的に表した図である。黒の塗りつぶしは行動が検出されていない状態、白の塗りつぶしと斜めのハッチングは行動のカテゴリを表している。共感解釈の縦のハッチングは共感であること、横のハッチングは反感であることを表している。図１５（Ａ）は対話者間の行動が一致した場合についての有効範囲を表している。与え手の行動と受け手の反応が一致しているため「共感」が閾値Wの間だけ継続している。図１５（Ｂ）は対話者間の行動が不一致であった場合についての有効範囲を表している。与え手の行動と受け手の反応が不一致であるため「反感」が閾値Wの間だけ継続している。図１５（Ｃ）は与え手の行動表出に対して受け手の反応表出が遅すぎる、すなわちdt>Lであるために変化タイミング関数が有効範囲外となっている状況を表している。この場合は全体を通して「どちらでもない」状態が継続している。図１６は対話二者が交互に行動を表出したときの有効範囲である。基本的な考え方は図１５（Ａ）から図１５（Ｃ）と同様である。 15 and 16 are diagrams schematically showing an example of the effective range of the change timing function. A black fill indicates a state in which no action is detected, and a white fill and diagonal hatching indicate a category of action. The vertical hatching of the sympathy interpretation indicates empathy, and the horizontal hatching indicates counteraction. FIG. 15A shows an effective range in the case where actions between the interlocutors coincide. Since the behavior of the giver and the response of the recipient match, “sympathy” continues only during the threshold W. FIG. 15B shows an effective range in the case where the actions between the dialoguers do not match. Since the behavior of the giver and the response of the recipient are inconsistent, “disgust” continues only during the threshold W. FIG. 15C shows a situation where the response expression of the receiver is too late with respect to the action expression of the giver, that is, the change timing function is outside the effective range because dt> L. In this case, the “neither” state continues throughout. FIG. 16 shows the effective range when the two dialogues alternately express their actions. The basic idea is the same as in FIGS. 15A to 15C.

＜＜静的モデル＞＞
静的モデルP(b_t|e_t)は、時刻tに行動チャネルbについて対話二者間で特定の行動が共起した場合に、共感解釈eがどの程度の尤度で生成されるかをモデル化したものである。 << Static model >>
The static model P (b _t | e _t ) shows the likelihood that the sympathetic interpretation e is generated when a specific action co-occurs between two parties for the action channel b at time t. Modeled.

モデル化の方法は、表情と視線については特許文献１および非特許文献１にて提案されているため、これらの文献の記載に従えばよく、対話二者間の視線状態のモデルと、その視線状態毎の表情の状態との共起のモデルとを組み合わせればよい。ここで、二者間の視線状態とは、例えば、相互凝視／片側凝視／相互そらし、の3状態が考えられる。 The modeling method has been proposed in Patent Document 1 and Non-Patent Document 1 for facial expression and line of sight, and therefore, it is sufficient to follow the description in these documents. What is necessary is just to combine the model of co-occurrence with the state of the expression for every state. Here, the gaze state between the two may be, for example, three states of mutual gaze / one-side gaze / mutual gaze.

頭部ジェスチャについての静的モデルはP(g|e)で表される。ここで、gは二者間での頭部ジェスチャの組み合わせ状態を表す。対象とする頭部ジェスチャの状態数をN_gとすると、二者間での頭部ジェスチャの組み合わせの状態数はN_g×N_gとなる。カテゴリとして任意の種類と数を対象としても構わないが、数が多すぎると学習サンプル数が少ない場合に過学習に陥りやすい。その場合は、最初に用意したカテゴリをさらにクラスタリングによりグルーピングしても構わない。例えば、その方法の一つとしてSequential Backward Selection (SBS)が挙げられる。例えば頭部ジェスチャのカテゴリを対象とする場合、頭部ジェスチャのみを用いた推定、すなわち事後確率をP(e|B):=P(e)P(g'|e)として、すべてのカテゴリから推定精度が最高になるように選択した二つのカテゴリを統合して一つにまとめる。これを推定精度が悪化する直前まで繰り返すことで一つずつカテゴリ数を減らしていけばよい。ここで、g’はグルーピング後における二者間での頭部ジェスチャの組み合わせ状態である。発話有無についても頭部ジェスチャと同様に二者間の共起をモデル化する。 The static model for head gestures is represented by P (g | e). Here, g represents a combination state of head gestures between two parties. When the number of states of the head gestures of interest and N _g, the number of states of combinations of head gestures between two parties becomes N _{_g} × N _g. Arbitrary types and numbers may be targeted as categories, but if the number is too large, overlearning tends to occur when the number of learning samples is small. In that case, the categories prepared first may be further grouped by clustering. For example, Sequential Backward Selection (SBS) is one of the methods. For example, when targeting the category of head gesture, the estimation using only the head gesture, that is, the posterior probability P (e | B): = P (e) P (g '| e) The two categories selected for the best estimation accuracy are integrated into one. It is sufficient to reduce the number of categories one by one by repeating this until just before the estimation accuracy deteriorates. Here, g ′ is a combined state of the head gesture between the two after grouping. As for the presence or absence of utterance, the co-occurrence between two parties is modeled in the same way as the head gesture.

＜＜モデルの学習方法＞＞
この実施形態では、いずれのモデルについても離散状態として記述されているため、学習フェーズではその離散状態が学習サンプル中に何回出現したかの頻度を取り、最後にその頻度を正規化（確率化）すればよい。 << Model learning method >>
In this embodiment, since any model is described as a discrete state, in the learning phase, the frequency of how many times the discrete state appears in the learning sample is taken, and finally the frequency is normalized (probabilized). )do it.

このとき、モデルを準備する方針として、モデルパラメタの学習に使用する学習用映像に撮影された対話者の集団と、対話状態を推定したい推定用映像に撮影された対話者の集団が同一であれば、対話二者毎にそれぞれ独立にパラメタを学習し、ある対話二者についての推定にはその対話二者のデータから学習したパラメタを用いるとすればよい。他方、学習用映像に撮影された対話者の集団と、推定用映像に撮影された対話者の集団が異なる場合には、対話二者を区別せずに一つのモデルを学習し、その一つのモデルを使用して推定したい対話二者についての推定を行えばよい。 At this time, as a policy to prepare the model, if the group of conversations captured in the learning video used to learn the model parameters is the same as the group of conversations captured in the estimation video for which the conversation state is to be estimated. For example, the parameters are learned independently for each of the two conversations, and the parameters learned from the data of the two conversations may be used for estimation of the two conversations. On the other hand, if the group of interrogators captured in the video for learning differs from the group of interrogators captured in the video for estimation, one model is learned without distinguishing between the two conversations, It is only necessary to make an estimation about two parties who want to estimate using the model.

共感解釈推定装置及び方法は、例えば対話行為の表出のためのトレーニングやゲームに用いることができる。例えば、恋愛シミュレーション等のゲームにおいて、ゲームに登場するキャラクタと、そのゲームをプレイするプレイヤーとの共感解釈を推定するために共感解釈推定装置及び方法を用いてもよい。 The empathy interpretation estimation apparatus and method can be used, for example, in training or a game for expressing a dialogue act. For example, in a game such as a love simulation, a sympathy interpretation estimation apparatus and method may be used to estimate a sympathy interpretation between a character appearing in the game and a player playing the game.

２０行動認識部
５０事後確率推定部
７２推定用映像記憶部
８１映像取得部
８２人物映像提示部
８３表示部
８４アクション提示部
１００第一人物
２００第二人物 20 action recognition unit 50 posterior probability estimation unit 72 estimation video storage unit 81 video acquisition unit 82 person video presentation unit 83 display unit 84 action presentation unit 100 first person 200 second person

Claims

A person video presentation unit that presents a first video of the first person whose behavior changes to the second person;
An action presenting unit that presents the second person with an action related to empathy against the change in the behavior of the first person.
A video acquisition unit for acquiring a second video obtained by photographing the head of the second person presented with the first video;
An action recognition unit that generates a second action time series of the second person by detecting the action of the second person with respect to a change in the action of the first person in the second video;
Based on the first action time series of the first person in the first video and the second action time series of the second person in the second video , the empathy interpretation based on the time difference of actions and the consistency of actions A posterior probability estimator that estimates a sympathetic interpretation between the first person and the second person using a model parameter including a timing model representing likelihood;
A display unit that presents the estimated sympathy interpretation to the second person as a result of a change in the behavior of the first person;
A sympathetic interpretation estimation device.

In the sympathetic interpretation estimation apparatus according to claim 1,
The display unit presents only the sympathy interpretation related to the action presented by the action presentation unit among the estimated sympathy interpretations.
Empathy interpretation estimation device.

In the empathy interpretation estimation apparatus of Claim 1 or 2 ,
The action recognition unit further generates a first action time series of the first person by detecting the action of the first person in the first video,
The posterior probability estimation unit estimates the empathy interpretation based on the first action time series and the second action time series generated by the action recognition unit,
Empathy interpretation estimation device.

In the empathy interpretation estimation apparatus in any one of Claim 1 to 3 ,
The person video presentation unit presents each video of a plurality of persons including at least one person whose behavior changes to the second person,
By the processing of the behavior recognition unit and the posterior probability estimation unit that each of the at least one person whose behavior changes is the first person, the sympathy interpretation between each first person and the second person is performed. presume,
Empathy interpretation estimation device.

A person video presenting step in which the person video presenting section presents the first video of the first person whose behavior changes to the second person;
An action presenting step in which an action presenting unit presents an action relating to empathy against the second person in response to a change in the behavior of the first person;
A video acquisition step in which the video acquisition unit acquires a second video obtained by photographing the head of the second person on which the first video is presented;
A behavior recognition step of generating a second action time series of the second person by the behavior recognizing unit detects the behavior of the second person with respect to a change in behavior of the first person in the above Symbol second image,
The posterior probability estimation unit is configured to determine the time difference between actions and the consistency of actions based on the first action time series of the first person in the first picture and the second action time series of the second person in the second picture. A posterior probability estimation step for estimating a sympathy interpretation between the first person and the second person using a model parameter including a timing model representing a likelihood of the sympathy interpretation based on:
A display step in which the display unit presents the estimated empathy interpretation to the second person as a result of a change in the behavior of the first person;
Empathy interpretation estimation method including

  In the sympathetic interpretation estimation method of claim 5,
  The display unit presents only the sympathy interpretation related to the action presented by the action presentation unit among the estimated sympathy interpretations.
  Emotional interpretation estimation method.

  In the sympathetic interpretation estimation method according to claim 5 or 6,
  The action recognition unit further generates a first action time series of the first person by detecting the action of the first person in the first video,
  The posterior probability estimation unit estimates the empathy interpretation based on the first action time series and the second action time series generated by the action recognition unit,
  Emotional interpretation estimation method.

  In the empathy interpretation estimation method in any one of Claim 5 to 7,
  The person video presentation unit presents each video of a plurality of persons including at least one person whose behavior changes to the second person,
  By the processing of the behavior recognition unit and the posterior probability estimation unit that each of the at least one person whose behavior changes is the first person, the sympathy interpretation between each first person and the second person is performed. presume,
  Emotional interpretation estimation method.

The program for functioning a computer as each part of the empathy interpretation estimation apparatus described in any one of Claim 1 to 4 .