JP2005199403A

JP2005199403A - Emotion recognition device and method, emotion recognition method of robot device, learning method of robot device and robot device

Info

Publication number: JP2005199403A
Application number: JP2004009690A
Authority: JP
Inventors: Kuniaki Noda; 邦昭野田; Masahiro Fujita; 雅博藤田; Tsutomu Sawada; 務澤田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2004-01-16
Filing date: 2004-01-16
Publication date: 2005-07-28

Abstract

<P>PROBLEM TO BE SOLVED: To stably estimate the state of emotion of a recognition object by taking account of a context. <P>SOLUTION: An emotion estimate system 50 is provided with; an emotion estimate device 1 outputting an estimated emotion Es obtained by estimating a user's emotion by taking account of the context of sensor information based on time-series sensor information; an emotion prediction device 40 referring a predicted emotion change database learned by taking account of the context of actions, finding a predicted emotion change dEb predicted to change after the execution of the action and outputting the predicted emotion based on the estimated result Es' of the emotion estimate device 11 before the action; and an emotion integration part 52 fusing the estimated emotion Es and the predicted emotion Eb by parameters η obtained by taking account of time deterioration of the predicted emotion Eb and outputting a final estimated emotion E. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、情動を認識するための認識対象の情動を、センサ情報から抽出した認識対象の特徴量に基づき認識する情動認識装置及び方法、この情動認識装置を搭載したロボット装置の学習方法、並びに、インタラクション対象の情動を認識するロボット装置及びその情動認識方法に関する。 The present invention relates to an emotion recognition device and method for recognizing emotion to be recognized for recognizing emotion based on the feature quantity of the recognition target extracted from the sensor information, a learning method for a robot apparatus equipped with the emotion recognition device, and The present invention relates to a robot apparatus for recognizing emotion to be interacted with and a method for recognizing the emotion.

従来の感情認識技術やそれを応用したシステムとしては、下記特許文献１乃至特許文献４などがある。例えば特許文献１には、ユーザの音声に含まれる感情に反応してストリーが展開される対話型映画システムの技術が開示されている。また、特許文献２には、音声から感情を認識する認識器と、画像から感情を認識する認識器とを用意し、認識器が認識した感情に応じて重み付けして統合する階層的感情認識装置の技術が開示されている。 Examples of conventional emotion recognition technology and systems to which the technology is applied include Patent Documents 1 to 4 listed below. For example, Patent Document 1 discloses a technology of an interactive movie system in which a story is developed in response to emotions included in a user's voice. Patent Document 2 discloses a hierarchical emotion recognition apparatus that prepares a recognizer that recognizes emotion from speech and a recognizer that recognizes emotion from an image, and weights and integrates them according to the emotion recognized by the recognizer. The technology is disclosed.

また、特許文献３には、学会などの発表者に対して発表者自身の感情を聴衆の反応に応じて認識し、フィードバックさせる発表支援装置の技術が開示されている。更に、特許文献４には、複数の感情認識器の認識結果を重み付けして統合することで人間の感情を認識し、認識した感情を考慮した行動パターンを提案するための行動パターン処理装置が開示されている。 Further, Patent Document 3 discloses a technology of a presentation support apparatus that recognizes a presenter's own emotion according to a reaction of an audience and gives feedback to a presenter such as an academic society. Furthermore, Patent Document 4 discloses a behavior pattern processing device for recognizing human emotions by weighting and integrating recognition results of a plurality of emotion recognizers and proposing a behavior pattern in consideration of the recognized emotions. Has been.

これら下記特許文献１乃至特許文献４における感情認識技術には下記のような特徴がある。すなわち、
各感情に専用のサブ認識器を設け、出力を論理合成して最終的な出力とする
複数の特徴量に基づく認識結果を重み付けによって組み合わせ、最終的な出力とする
認識結果を統合する重み付けパラメータは実験的に採取した教師データを元に算出している
各認識器で利用するパラメータは認識対象の各個人毎に準備されている。 The emotion recognition techniques in the following Patent Documents 1 to 4 have the following characteristics. That is,
A dedicated sub-recognition device is provided for each emotion, and the output is logically synthesized to obtain the final output. The weighting parameters that combine the recognition results based on multiple feature quantities by weighting and combine the recognition results that are the final output are The parameters used by each recognizer calculated based on experimentally collected teacher data are prepared for each individual to be recognized.

特許２８７４８５８号公報Japanese Patent No. 2874858 特許２９６７０５８号公報Japanese Patent No. 2996758 特許２９６００２９号公報Japanese Patent No. 2960029 特開２００２−７３６３４号公報Japanese Patent Laid-Open No. 2002-73634

しかしながら、これら特許文献１乃至特許文献４に記載の技術においては、下記のような問題点がある。すなわち、各認識器はその瞬間のセンサ入力情報のベクトルのみに依存してその出力を決定するものである。したがって、感情認識装置が例えば、表情、音声、ジェスチャなどに関するセンサ入力のベクトル情報と、それに対応する感情との写像関係を学習するようなモデルで構成されている場合、そのときのセンサ入力情報、すなわち認識対象の表面的な状態変化のみを絶対的なものとして信頼し、出力を決定するため、少しのセンサ入力の変化で容易に認識結果が変化してしまい、安定することがない。 However, the techniques described in Patent Documents 1 to 4 have the following problems. That is, each recognizer determines its output depending only on the vector of sensor input information at that moment. Therefore, if the emotion recognition device is configured with a model that learns the mapping relationship between sensor input vector information related to facial expressions, speech, gestures, and the like and the corresponding emotion, for example, sensor input information at that time, In other words, since only the superficial state change of the recognition target is trusted as absolute and the output is determined, the recognition result easily changes with a slight change in sensor input, and does not become stable.

例えば、機嫌よく話をしている最中に、考え事をして少し顔をしかめただけでその一瞬は嫌悪感を表出していると判断されるようなものである。従来の技術では、上記のような脆弱な認識器から得られた情報をそのまま機器の機能選択やロボット装置の行動選択に利用することが提案されているが、実際のシステムを構築した場合、ちょっとした出力の変化で機能を選択していては、頻繁に機能選択が起こり機器の利用上の支障をきたす。機器の機能選択やロボット装置の行動選択、又はコミュニケーションにおいて重要なのは、認識対象の根底に定常的にある感情の状態である。一瞬の表情の変化や、発話のちょっとした語調の変化から生じる認識結果のブレは何らかの処理によってスムージングされることが必要である。 For example, while talking in good mood, just thinking and making a little frown will determine that you are feeling disgusted for a moment. In the conventional technology, it has been proposed to use the information obtained from the vulnerable recognizer as described above for selecting the function of the device or selecting the action of the robot device as it is. When a function is selected based on a change in output, the function is frequently selected, resulting in trouble in using the device. What is important in device function selection, robot device action selection, or communication is the state of emotion that is constantly at the base of the recognition target. The blurring of the recognition result caused by a momentary change in facial expression or a slight change in speech tone needs to be smoothed by some processing.

本発明は、このような従来の実情に鑑みて提案されたものであり、コンテキストを考慮することで認識対象の情動の状態を安定的に推定する情動推定装置及び方法、そのような情動推定装置を搭載したロボット装置及びその情動認識方法、並びに情動推定装置を搭載したロボット装置及びその学習方法を提供することを目的とする。 The present invention has been proposed in view of such a conventional situation, and an emotion estimation device and method for stably estimating the state of emotion to be recognized by considering the context, and such an emotion estimation device It is an object of the present invention to provide a robot apparatus equipped with a robot apparatus, an emotion recognition method thereof, a robot apparatus equipped with an emotion estimation apparatus, and a learning method thereof.

上述した目的を達成するために、本発明に係る情動認識装置は、時系列のセンサ情報から、情動を認識するための認識対象に関する時系列の特徴量を抽出する特徴量抽出手段と、上記時系列の特徴量に基づき、上記センサ情報のコンテキストを考慮して上記認識対象の情動を推定する情動推定手段とを有することを特徴とする。 In order to achieve the above-described object, an emotion recognition apparatus according to the present invention includes a feature amount extraction unit that extracts time-series feature amounts related to a recognition target for recognizing emotion from time-series sensor information, and the time And an emotion estimation means for estimating the emotion of the recognition target in consideration of the context of the sensor information based on the feature quantity of the sequence.

本発明においては、時系列の特徴量に基づき認識対象の情動を推定する、すなわちセンサ情報のコンテキストを考慮して情動推定を行うため、例えば認識対象の一瞬の情動変化又は瞬間的なセンサ情報から抽出した特徴量に基づく推定結果の誤りなどを反映してしまうことがなく、認識対象の情動推定結果を平滑化したものとして認識することができる。 In the present invention, since the emotion of the recognition target is estimated based on the time-series feature quantity, that is, the emotion estimation is performed in consideration of the context of the sensor information, for example, from the momentary emotion change or the instantaneous sensor information of the recognition target An error in the estimation result based on the extracted feature amount is not reflected, and the emotion estimation result of the recognition target can be recognized as smoothed.

また、上記特徴量抽出手段は、複数のモーダルについて各モーダル毎の上記時系列の特徴量をモーダル別特徴量列として抽出し、上記情動推定手段は、上記モーダル別特徴量列に基づき各モーダル毎に上記情動を認識する複数のモーダル別情動認識手段と、上記各モーダル別情動認識手段の認識結果に基づき上記情動を推定する認識結果統合手段とを有することにより、例えば認識対象の表情、ジェスチャ、音声などのモーダル別に時系列の特徴量を抽出し、モーダル別に情動を認識することができ、認識結果統合手段は、１つのモーダルについて推定された情動を認識結果としたり、複数のモーダルについて推定された情動を統合して認識結果としたりすることができる。 Further, the feature quantity extraction means extracts the time-series feature quantity for each modal for a plurality of modals as a modal feature quantity sequence, and the emotion estimation means performs each modal based on the modal feature quantity sequence. A plurality of modal emotion recognition means for recognizing the emotion and a recognition result integration means for estimating the emotion based on the recognition results of the modal emotion recognition means. Extracts time-series features for each modal such as speech, and can recognize emotions by modal. The recognition result integration means uses the emotion estimated for one modal as the recognition result or estimates for multiple modals. It is possible to integrate emotions into recognition results.

更に、上記情動推定手段は、上記モーダル別情動認識手段により得られた認識結果の予測誤差を算出する予測誤差算出手段を有し、上記認識結果統合手段は、上記各モーダル別情動認識手段により得られた認識結果及びその予測誤差に基づき上記情動を推定することができ、モーダルに応じて異なる情動認識率などを有する場合にこれを上記予測誤差に反映させることで更に情動推定手段の情動推定率（情動認識率）を向上することができる。 Further, the emotion estimation means includes prediction error calculation means for calculating a prediction error of the recognition result obtained by the modal emotion recognition means, and the recognition result integration means is obtained by the modal emotion recognition means. The emotion can be estimated based on the recognition result obtained and its prediction error, and when the emotion recognition rate differs depending on the modal, the emotion estimation rate of the emotion estimation means is further reflected by reflecting this in the prediction error. (Emotion recognition rate) can be improved.

更にまた、上記認識結果統合手段は、上記予測誤差算出手段により得られた予測誤差を信頼度に変換し、各モーダル別情動認識手段により得られた認識結果にその信頼度を重み付けした重み付け認識結果に基づき上記情動を推定することができ、予測誤差をある程度平均化するなどして信頼度とし、重み付け認識結果を求めることで、情動推定手段が推定結果を更に平滑化することができる。 Furthermore, the recognition result integration means converts the prediction error obtained by the prediction error calculation means into reliability, and weighted recognition results obtained by weighting the recognition results obtained by the modal emotion recognition means. The emotion can be estimated based on the above, and the estimation result can be further smoothed by obtaining the weighted recognition result by averaging the prediction error to some extent to obtain the reliability.

また、上記情動推定手段は、時系列のセンサ情報から抽出した認識対象に関する時系列の特徴量を入力データとしたとき、上記時系列のセンサ情報を取得した際の上記認識対象の情動が出力データとなるよう、再帰的学習により予め学習されたものとすることができ、情動推定手段を例えばリカレントネットワークとして学習することができる。 In addition, when the emotion estimation unit uses, as input data, a time-series feature amount related to the recognition target extracted from the time-series sensor information, the emotion of the recognition target when the time-series sensor information is acquired is output data. Thus, it can be learned in advance by recursive learning, and the emotion estimation means can be learned as a recurrent network, for example.

本発明に係るロボット装置は、内部状態及び／又は外部刺激に基づき自律的に行動するロボット装置において、インタラクション対象に関する情報を取得する１以上のセンサと、上記センサから時系列のセンサ情報を受け取り、上記インタラクション対象に関する時系列の特徴量を抽出する特徴量抽出手段と、上記時系列の特徴量に基づき、上記センサ情報のコンテキストを考慮して上記インタラクション対象の情動を推定する情動推定手段とを有することを特徴とする。 The robot apparatus according to the present invention receives one or more sensors for acquiring information related to an interaction target in the robot apparatus acting autonomously based on an internal state and / or an external stimulus, and time-series sensor information from the sensor. Feature amount extraction means for extracting a time-series feature amount related to the interaction target, and emotion estimation means for estimating the emotion of the interaction target in consideration of the context of the sensor information based on the time-series feature amount It is characterized by that.

本発明においては、ロボット装置のインタラクション対象となるユーザなどの情動を、ユーザの特徴量の時系列データに基づき推定する情動推定手段を有するため、ユーザの情動を誤認識や一時的な認識結果によらないものとしてより確実に推定することができる。 In the present invention, since there is an emotion estimation means for estimating the emotion of the user who is the interaction target of the robot apparatus based on the time-series data of the feature quantity of the user, the emotion of the user is misrecognized or temporarily recognized. It can be more reliably estimated that it does not depend.

また、複数の行動から一の行動を選択して実行する行動実行手段と、上記行動実行手段により実行された行動の種類に応じて上記インタラクション対象の情動を予測する情動予測手段と、上記情動推定手段により得られた推定結果と上記情動予測手段により得られた予測結果とに基づき上記インタラクション対象の情動を推定する情動統合手段とを有することができ、更に安定かつ正確に情動を推定することができる。 In addition, an action execution means for selecting and executing one action from a plurality of actions, an emotion prediction means for predicting the emotion of the interaction target according to the type of action executed by the action execution means, and the emotion estimation And an emotion integration means for estimating the emotion of the interaction target based on the estimation result obtained by the means and the prediction result obtained by the emotion prediction means, and can more stably and accurately estimate the emotion. it can.

更に、上記情動予測手段は、一の行動と、該一の行動の実行後に変化すると予想される予想情動変化とが対応づけられた予想情動変化データベースを参照して行動実行後の情動変化を予想し、該予想結果と行動実行前に上記情動推定手段により推定された推定結果とに基づき行動実行後の上記情動を予測することができ、行動実行前後に変化すると考えられるインタラクション対象の情動変化を予測することができる。 Further, the emotion prediction means predicts an emotion change after the execution of the action with reference to an expected emotion change database in which one action is associated with an expected emotion change expected to change after the execution of the one action. The emotion after the execution of the action can be predicted based on the prediction result and the estimation result estimated by the emotion estimation unit before the execution of the action. Can be predicted.

更にまた、上記予想情動変化データベースは、各行動に対してその実行前後の上記インタラクション対象の情動変化に基づき上記予想情動変化が学習されたものとすることができ、例えば一の行動が実行された際の行動実行前後における上記情動推定手段による推定結果や、外部から与えられた行動実行前後における情動変化に基づき、一の行動を実行する毎に更新して上記予想情動データベースを逐次更新することで、行動のコンテキストを考慮した予想情動データベースを構築することができる。 Furthermore, the predicted emotion change database may be such that the expected emotion change is learned based on the emotion change of the interaction target before and after the execution of each action, for example, one action is executed. Update each time an action is executed based on the estimation result by the emotion estimation means before and after the execution of the action and the emotional change before and after the execution of the action given from outside. It is possible to construct a predicted emotion database that takes into account the context of behavior.

また、上記情動統合手段は、行動実行後から時間が経過するに従って上記情動予測手段による予測結果より上記情動推定手段による推定結果を重視するように変化するパラメータにより、上記推定結果及び予測結果に重み付けし、該重み付けした結果に基づき上記情動を推定することができ、情動推定手段における推定結果に対して情動予測手段の予測結果によりトップダウンの補正を行うと共に、行動の実行に関わらず常に得られる情動推定手段の推定結果を、行動実行後の経過時間が長くなるほど重視させるようにすることができる。 In addition, the emotion integration unit weights the estimation result and the prediction result with a parameter that changes so that the estimation result by the emotion estimation unit is more important than the prediction result by the emotion prediction unit as time elapses after execution of the action. Then, the emotion can be estimated based on the weighted result, and the estimation result in the emotion estimation means is top-down corrected by the prediction result of the emotion prediction means, and is always obtained regardless of the execution of the action. The estimation result of the emotion estimation means can be emphasized as the elapsed time after executing the action becomes longer.

本発明に係るロボット装置の学習方法は、与えられた入力データから、インタラクション対象の情動を認識する情動認識装置を搭載したロボット装置の学習方法において、時系列のセンサ情報から抽出したインタラクション対象に関する時系列の特徴量を入力データとし、当該時系列のセンサ情報を取得した際の上記インタラクション対象の情動を出力の目標値として上記情動認識装置の学習をする学習工程を有することを特徴とする。 A learning method for a robot apparatus according to the present invention is a learning method for a robot apparatus equipped with an emotion recognition apparatus for recognizing an emotion of an interaction target from given input data, and relates to an interaction target extracted from time-series sensor information. It has a learning step in which the emotion recognition device learns using the feature quantity of the series as input data and the emotion of the interaction target when the time-series sensor information is acquired as the output target value.

本発明においては、ロボット装置の情動認識装置を、例えばリカレントネットワークなどにより構成することで、時系列の特徴量を入力データとして情動を認識するものとして学習することで、インタラクション対象の情動を、センサ情報のコンテキストを考慮して推定することが可能な情動認識装置を搭載したロボット装置を得ることができる。 In the present invention, the emotion recognition device of the robot apparatus is configured by, for example, a recurrent network, and learning is performed by recognizing the emotion using time-series feature amounts as input data. A robot apparatus equipped with an emotion recognition apparatus that can be estimated in consideration of the context of information can be obtained.

本発明に係る情動認識装置及び方法によれば、時系列のセンサ情報から、インタラクション対象に関する時系列の特徴量を抽出し、これに基づき、上記センサ情報のコンテキストを考慮して上記インタラクション対象の情動を推定するので、例えば瞬間的なセンサ情報に基づいてインタラクション対象の一瞬の表情の変化や、発話の際の一瞬の語調の変化を認識結果に反映させてしまうことを防止し、インタラクション対象の情動推定結果を平滑化してより自然なものとして認識することができる。 According to the emotion recognition apparatus and method of the present invention, a time-series feature amount related to an interaction target is extracted from time-series sensor information, and based on this, the emotion of the interaction target is considered in consideration of the context of the sensor information. For example, it is possible to prevent changes in the facial expression of the interaction target and instantaneous tone changes during the utterance from being reflected in the recognition result based on instantaneous sensor information. The estimation result can be smoothed and recognized as more natural.

更に、本発明に係るロボット装置及びその情動認識方法によれば、時系列のセンサ情報から、インタラクション対象に関する時系列の特徴量を抽出し、これに基づき、上記センサ情報のコンテキストを考慮して上記インタラクション対象の情動を推定する情動認識装置を搭載することで、ロボット装置が、ユーザの一瞬の表情変化や、発話の一瞬の語調の変化に左右されることなく、インタラクション対象となるユーザなどの情動をより正しく認識することができ、ユーザとのインタラクションをより上手に行うことができ、また、時系列のセンサ情報に加え、自身の行動履歴に基づきユーザの情動変化を予測した結果を考慮すれば、センサ情報に基づく認識結果を補正して更に正確にユーザの情動を認識し、より生物らしい振る舞いを行わせることが可能となる。 Furthermore, according to the robot apparatus and the emotion recognition method according to the present invention, the time-series feature amount related to the interaction target is extracted from the time-series sensor information, and based on this, the above-described sensor information context is taken into consideration. Equipped with an emotion recognition device that estimates the emotion of the interaction target, the robot device can affect the emotion of the user or the like that is subject to interaction without being influenced by the momentary facial expression change or the instantaneous tone change of the utterance. If you consider the results of predicting the user's emotional change based on their own action history in addition to time-series sensor information, , Correct the recognition result based on sensor information, recognize the user's emotions more accurately, and behave more biologically Rukoto is possible.

また、本発明に係るロボット装置及びその学習方法によれば、時系列のセンサ情報から抽出したインタラクション対象の時系列の特徴量を入力とし、その際の実際のインタラクション対象の情動を出力目標値として、例えばリカレントネットワークなどの再帰学習により学習を行うことで、センサ情報のコンテキストを考慮してインタラクション対象の情動を認識することが可能な情動認識装置の学習をロボット装置が行うことができる。 Further, according to the robot apparatus and the learning method thereof according to the present invention, the time series feature quantity of the interaction target extracted from the time series sensor information is input, and the actual interaction target emotion at that time is used as the output target value. For example, by performing learning by recursive learning such as a recurrent network, the robot apparatus can learn an emotion recognition apparatus capable of recognizing the emotion of an interaction target in consideration of the context of sensor information.

以下、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。この実施の形態は、本発明を、コンテキストを考慮して、認識対象の情動を認識する情動認識システムを搭載したロボット装置に適用したものである。本実施の形態におけるロボット装置は、外部環境（外部刺激）の時系列データと、行動履歴とから情動の認識対象となるインタラクション対象（以下、ユーザという。）の情動を安定的かつ正しく推定するものである。ここでは、先ず、情動認識システムについて説明し、その後、この情動認識装置を搭載するに好適なロボット装置の一構成例について説明する。 Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings. In this embodiment, the present invention is applied to a robot apparatus equipped with an emotion recognition system that recognizes an emotion to be recognized in consideration of the context. The robot apparatus according to the present embodiment stably and correctly estimates an emotion of an interaction target (hereinafter referred to as a user) as an emotion recognition target from time series data of an external environment (external stimulus) and an action history. It is. Here, the emotion recognition system will be described first, and then a configuration example of a robot apparatus suitable for mounting the emotion recognition device will be described.

Ａ：情動認識システム
上述した如く、従来の情動認識器は、例えばユーザなどのインタラクション対象の瞬間的な音声や画像から得られる特徴量から、インタラクション対象の情動を認識するものであって、コンテキスト情報を無視しているため認識結果が不安定となってしまう。ここでいうコンテキストとは具体的には、インタラクション対象がたった今どのような情動状態であったか、インタラクション対象に対してたった今、ロボット装置自身がどのような行動を行ったのかなどである。 A: Emotion recognition system As described above, the conventional emotion recognizer recognizes the emotion of the interaction target from the feature amount obtained from the instantaneous voice or image of the interaction target of the user, for example, and includes context information. The result of recognition becomes unstable because of ignoring. Specifically, the context here refers to what kind of emotional state the interaction target is just now, what action the robot apparatus itself has just performed on the interaction target, and the like.

そこで、本実施の形態におけるロボット装置は、コンテキストを考慮して情動認識を行うため、それまでの状態を基準とし、次の状態を予測するアルゴリズムの１つとしてリカレントニューラルネットワークなどを利用することによって、入力情報、本実施の形態においては、センサ情報から抽出される特徴量の時間的な変化（コンテキスト）を考慮して情動を推定するものである。 Therefore, the robot apparatus according to the present embodiment performs emotion recognition in consideration of the context, and therefore uses a recurrent neural network or the like as one of algorithms for predicting the next state based on the previous state. In this embodiment, the emotion is estimated in consideration of the temporal change (context) of the feature amount extracted from the sensor information.

さらに、ロボット装置自らが行う行動選択と、当該選択した行動によるインタラクション対象の情動変化の対応関係を予測するモデルを準備し、これによりロボット装置がどのような行動を取るとインタラクション対象にどのような情動変化が起こるかを予測して、センサ情報から推定される情動推定結果にトップダウンの補正をかけることにより、情動推定結果を更に平滑化するものである。 Furthermore, a model that predicts the correspondence between the action selection performed by the robot apparatus itself and the emotional change of the interaction target due to the selected action is prepared. The emotion estimation result is further smoothed by predicting whether an emotional change will occur and applying top-down correction to the emotion estimation result estimated from the sensor information.

すなわち、本実施の形態における情動認識システムは、時系列のセンサ情報からユーザの情動を推定する情動推定装置と、実行した行動の種類に応じてユーザの情動変化を予測するための情動予測装置と、情動推定装置が推定した推定結果（以下、推定情動Ｅｓという。）及び情動予測装置が予測した予測結果（以下、予測情動Ｅｂという。）に、行動履歴の時間減衰を考慮した重みを乗算して足し合わせたものを最終的に得られた認識結果（以下、認識情動Ｅという。）として出力する情動統合部とから構成される。 That is, the emotion recognition system according to the present embodiment includes an emotion estimation device that estimates a user's emotion from time-series sensor information, and an emotion prediction device that predicts a user's emotion change according to the type of action performed. The estimation result estimated by the emotion estimation device (hereinafter referred to as estimated emotion Es) and the prediction result predicted by the emotion prediction device (hereinafter referred to as predicted emotion Eb) are multiplied by a weight that takes into account the time decay of the action history. And an emotion integration unit that outputs a finally obtained recognition result (hereinafter referred to as recognition emotion E).

（１）情動推定装置
先ず、センサ情報の時系列データからセンサ情報のコンテキストを考慮した推定情動Ｅｓを出力する情動推定装置について説明する。図１は、本実施の形態における情動推定装置を示すブロック図である。図１に示すように、この情動推定装置１は、外部の状況を検出するセンサ部２_ｍと、ジェスチャ、発話、人間の表情などの各モーダル毎にセンサ部２_ｍからのセンサ情報から前処理として特徴量を抽出する特徴量抽出部３_ｎと、同じく各モーダル毎に推定情動Ｅｓ_ｎ及び後述する予測誤差ｘ_ｎ＾（ｘ_ｎ＾は、ｘ_ｎハットを示す、以下同様。）を算出するモーダル別情動推定部４_ｎと、予測誤差ｘ_ｎ＾から推定情動Ｅｓ_ｎに対する信頼度λ_ｎを算出するソフトマックス演算器７_ｎと、推定情動Ｅｓ_ｎと信頼度λ_ｎとを乗算する乗算器８_ｎと、乗算器８_ｎの出力の総和を求め、推定情動Ｅｓを出力する加算器９とを有する。 (1) Emotion estimation device First, an emotion estimation device that outputs estimated emotion Es in consideration of the context of sensor information from time-series data of sensor information will be described. FIG. 1 is a block diagram showing an emotion estimation apparatus according to the present embodiment. As shown in FIG. 1, the emotion estimation apparatus 1, preprocessing the sensor unit 2 _m for detecting the external situation, gesture, speech, from the sensor information from the sensor unit 2 _m for each modality, such as a human facial expressions And a feature quantity extraction unit 3 _n that extracts a feature quantity, and an estimated emotion Es _n and a prediction error x _n ^ (x _n ^ indicates x _n hat, the same applies hereinafter) for each modal. Modal emotion estimation unit 4 _n , softmax calculator 7 _n for calculating the reliability λ _n for the estimated emotion Es _n from the prediction error x _n ^, and a multiplier for multiplying the estimated emotion Es _n and the reliability λ _n 8 _n and an adder 9 that calculates the sum of the outputs of the multiplier 8 _n and outputs the estimated emotion Es.

センサ部２_ｍは、例えば、周囲の画像を撮像する撮像手段としてのカメラ２_１、周囲の音声を入力する音声入力手段としてのマイクロホン２_２などからなり、時系列のセンサ情報（以下、センサ情報列ともいう。）を出力する。 Sensor unit 2 _m, for example, the camera 2 ₁ as an imaging means for capturing an image of the surroundings _consists and a microphone 2 ₂ as a sound input means for inputting the voice of the surrounding, when the sensor information of the sequence (hereinafter, sensor information Is also called a column.)

特徴量抽出部３_ｎは、モーダル毎に前処理を行うもので、例えば表情分析部３_１、ジェスチャ分析部３_２、発話分析部３_３などを有し、時系列のセンサ情報から、モーダル毎に、認識対象の時系列の特徴量ベクトルを抽出し、モーダル別特徴量列として出力する。 Amount extracting unit 3 _n features, performs a pre-process for each modal, for example the facial expression analyzer 3 _1, gesture analysis unit 3 _{2 has} a like speech analysis unit 3 _3, from the sensor information of the time series, each modal Then, a time-series feature quantity vector to be recognized is extracted and output as a modal feature quantity sequence.

情動推定部４_ｎは、モーダル毎に用意され、各モーダルに対応した特徴量抽出部３_ｎにて前処理され得られた特徴量ベクトル列が入力され、この特徴量ベクトル列から推定情動Ｅｓ_ｎを推定して出力するものである。この情動推定部４_ｎは、過去の情動を考慮して現在の情動を予測する情動フォアードモデル（forward model：フォワードモデル又は順モデル）により情動を推定して、モーダル別推定情動Ｅｓ_ｎとして出力するモーダル別情動認識手段としての情動認識器５_ｎと、この情動認識器５_ｎが出力した推定情動Ｅｓ_ｎの予測誤差ｘ_ｎ＾を算出する誤差予測器６_ｎとを有している。誤差予測器６_ｎにおいても、過去の予測誤差ｘ_ｎ＾を考慮して現在の予測誤差ｘ_ｎ＾を算出するフォワードモデルにより推定情動Ｅｓ_ｎの予測誤差ｘ_ｎ＾を算出するものとすることができる。 The emotion estimation unit 4 _n is prepared for each modal, and the feature quantity vector sequence obtained by preprocessing by the feature quantity extraction unit 3 _n corresponding to each modal is input, and the estimated emotion Es _n is obtained from this feature quantity vector sequence. Is estimated and output. The emotion estimation unit 4 _n estimates an emotion by an emotion forward model (forward model or forward model) that predicts the current emotion in consideration of the past emotion, and outputs the estimated emotion as a modal-specific estimate _n . It includes an emotion recognizer 5 _n as modal emotion recognition means, and an error predictor 6 _n that calculates a prediction error x _n ^ of the estimated emotion Es _n output from the emotion recognizer 5 _n . Also in the error predictor 6 _n, be made to calculate the prediction error x _{n ^} of the estimated emotional Es _n by the forward model to calculate the current prediction error x _{n ^} in consideration of past prediction error x _{n ^} it can.

フォワードモデルは、入出力の写像関係、すなわち本実施の形態においては、時系列の特徴量ベクトルを入力した場合に、推定情動Ｅｓ_ｎを出力とする写像関係を実現していればよく、数式、パラメータマトリクス、対応表など、どのような表現方法であってもよい。また、本実施の形態においては、センサ入力から情動を予測する方向、すなわちここでは、特徴量ベクトル列を入力とし、推定情動Ｅｓ_ｎを出力とする方向を順方向（情動フォワードモデル）という。なお、フォワードモデルと入出力データが逆になるモデルをインバースモデル又は逆モデルという。 The forward model only needs to realize the mapping relationship of input / output, that is, in this embodiment, the mapping relationship that outputs the estimated emotion Es _n when a time-series feature quantity vector is input. Any expression method such as a parameter matrix or a correspondence table may be used. In the present embodiment, the direction in which emotion is predicted from the sensor input, that is, the direction in which the feature vector sequence is input and the estimated emotion Es _n is output is referred to as the forward direction (emotional forward model). A model in which input / output data is reversed from the forward model is referred to as an inverse model or an inverse model.

ソフトマックス演算部７_ｎは、ソフトマックスを求める後述する関数を用いて予測誤差ｘ_ｎ＾を、各情動推定器５_ｎが算出する推定情動Ｅｓ_ｎに対する信頼度λ_ｎに変換し、乗算器８_ｎは、推定情動Ｅｓ_ｎに信頼度λ_ｎを乗算する。加算器９は、乗算器８_ｎにて信頼度λ_ｎで重み付けされた重み付き推定情動λ_ｎＥｓ_ｎの総和を求めて推定情動Ｅｓとして出力する。これらソフトマックス演算部７_ｎ、乗算器８_ｎ及び加算器９により、モーダル別情動認識器５_ｎの認識結果を統合するための認識結果統合手段が構成される。 The softmax computing unit 7 _n converts a prediction error x _n ^ into a reliability λ _n for the estimated emotion Es _n calculated by each emotion estimator 5 _n using a function to be described later for obtaining the softmax, and a multiplier 8 _n multiplies the estimated emotion Es _n by the reliability λ _n . The adder 9 calculates the sum of the weighted estimated emotions λ _n Es _n weighted with the reliability λ _n by the multiplier 8 _{n and} outputs the sum as the estimated emotion Es. These softmax computing units 7 _n , multipliers 8 _n and adders 9 constitute recognition result integration means for integrating the recognition results of the modal emotion recognizers 5 _n .

ここで、本実施の形態においては、この情動認識器（情動フォワードモデル）及び誤差予測器として、例えばリカレントニューラルネットワークなどを利用して予め学習されたものを使用する。すなわち、情動認識器５_ｎは、時系列の特徴量ベクトルｓ_ｎ ^＊とそれに対応する予め感情が分類された教師データである時系列の感情分類情報Ｅｓ_ｎ ^＊とからなる学習データを使用し、入力として与えられる時系列の特徴量ベクトルから時系列の感情情報を出力する情動認識器５_ｎを学習するものである。すなわち、学習データのうち、時系列の特徴量ベクトルｓ_ｎ ^＊を入力データとし、教師データである時系列の感情情報Ｅｓ_ｎ ^＊を出力データの目標値としてモーダル毎に予め学習する。 Here, in the present embodiment, as the emotion recognizer (emotional forward model) and the error predictor, those previously learned using, for example, a recurrent neural network are used. That is, the emotion recognizer 5 _n uses learning data including time-series feature quantity vectors s _n ^* and time-series emotion classification information Es _n ^* corresponding to teacher data into which emotions are classified in advance. This learns the emotion recognizer 5 _n that outputs time-series emotion information from a time-series feature vector given as an input. That is, among the learning data, the time-series feature vector s _n ^* is used as input data, and the time-series emotion information Es _n ^* , which is teacher data, is learned in advance for each modal as the target value of the output data.

また、誤差予測器６_ｎは、時系列の特徴量ベクトルｓ_ｎ ^＊とそれに対応する予め感情が分類された教師データである時系列の感情分類情報Ｅｓ_ｎ ^＊とからなる学習データを使用して、入力として与えられる情動認識器５_ｎの出力（推定情動Ｅｓ_ｎ）及び／又は特徴量ベクトルｓ_ｎ ^＊から予測誤差を出力する誤差予測器６_ｎを情動認識器５_ｎ毎に学習する。ここではまずこれらの学習方法について説明し、次に学習された情動認識器５_ｎ及び誤差予測器６_ｎを有する情動推定部４_ｎを使用した情動推定装置における情動推定方法（再現方法）について説明する。 Further, the error predictor 6 _n uses learning data including time-series feature quantity vectors s _n ^* and corresponding time-series emotion classification information Es _n ^* which is teacher data into which emotions are classified in advance. learns error predictor 6 _n for outputting an output of the emotion recognizer 5 _n given as input (estimated emotion Es _n) and / or feature vectors s _n prediction error from ^* for each emotion recognizer 5 _n. Here, these learning methods will be described first, and then the emotion estimation method (reproduction method) in the emotion estimation apparatus using the emotion estimation unit 4 _n having the learned emotion recognizer 5 _n and error predictor 6 _n will be described. To do.

（２）モーダル別情動認識器の学習方法
本実施の形態においては、モーダル毎に情動推定部４_ｎ、すなわち情動認識器５_ｎを用意する。情動認識器５_ｎは、１つの情動認識器により、例えば「喜び（joy）」、「悲しみ（sadness）」、「怒り（anger）」、「驚き（surprise）」、「嫌悪（disgust）」、及び「恐れ（fear）」などの複数の情動を認識することができ、ある一のモーダルにおいて各感情に対応する特徴量を予め学習しておき、認識の際には、モーダル毎にセンサ入力から特徴量を抽出する前処理を行って、抽出した特徴量に基づきユーザの情動を推定して出力するものである。なお、本実施の形態における情動認識器は、上記６つの基本６感情を認識するものとし、したがってモーダル別情動認識器５_ｎが出力する推定情動Ｅｓ_ｎ及び情動推定装置１が出力する推定情動Ｅｓは、６種類の感情を要素とするベクトルとするが、認識する感情の種類はこれに限らない。 (2) Learning method of modal-specific emotion recognizer In the present embodiment, an emotion estimation unit 4 _n , that is, an emotion recognizer 5 _n is prepared for each modal. The emotion recognizer 5 _n can be operated by one emotion recognizer, for example, “joy”, “sadness”, “anger”, “surprise”, “disgust”, And a plurality of emotions such as “fear” can be recognized, and a feature amount corresponding to each emotion is learned in advance in a certain modal. Preprocessing for extracting feature values is performed, and the user's emotion is estimated and output based on the extracted feature values. Note that emotion recognizer of the present embodiment, estimated emotions Es which shall recognize the six basic 6 emotions, therefore modal-specific emotion recognizers 5 estimated emotions Es _n and emotion estimation apparatus 1 _n outputs outputs Is a vector having six types of emotions as elements, but the types of emotions to be recognized are not limited to this.

情動認識器５_ｎが考慮するモーダルとしては、例えば表情、ジェスチャ、発話などがある。学習では、先ず、情動認識器５_ｎの学習を行うための学習データを用意し、次に、モーダル毎に情動認識器５_ｎの学習を行う。 The emotion recognizer 5 _n to consider modal, e.g. expression, gesture, and the like speech. In learning, first, learning data for learning the emotion recognizer 5 _n is prepared, and then the emotion recognizer 5 _n is learned for each modal.

情動認識器５_ｎの学習を行うために必要な学習データは、入力データとなる時系列の特徴量ベクトルと、出力データの目標値となる、上記入力データを取得した際のインタラクション対象、本実施の形態においてはユーザの情動である。なお、ユーザの情動は適当に数値化されているものとする。 The learning data necessary for learning the emotion recognizer 5 _n includes the time-series feature vector serving as input data and the target of interaction when obtaining the input data serving as the target value of the output data. This is the emotion of the user. It is assumed that the user's emotion is appropriately digitized.

入力データは、表情分析部３_１、及びジェスチャ分析部３_２などからなる特徴量抽出部３_ｎに、画像などの時系列のセンサ情報を供給し、発話分析部３_３などからなる特徴量抽出部３_ｎに音声などの時系列のセンサ情報を供給し、この時系列のセンサ情報に基づいて、それぞれのモーダルに対応した、時系列のユーザの特徴量（フィーチャー）ベクトルを集める。 The input data, the facial expression analyzer 3 _1, and the feature extraction unit 3 _n made of the gesture analysis unit 3 ₂ supplies the sensor information of the time series, such as the image feature extraction made of speech analysis unit 3 ₃ The time-series sensor information such as voice is supplied to the unit 3 _n , and the time-series user feature vector corresponding to each modal is collected based on the time-series sensor information.

例えば、表情分析部３_１であれば、画像全体の周波数成分や方向成分を抽出するフィルタリング処理を行った結果が特徴量として抽出されたり、例えば額や眉間、頬のしわの密度や方向、目の見開き具合、唇の形など、顔に視覚的に表れている要素の特徴を数値的に現したベクトルデータなどが特徴量として抽出されたりして、特徴量ベクトルの時系列データ（特徴量ベクトル列）が出力される。 For example, if the facial expression analyzer 3 _1, as a result of the filtering process for extracting a frequency component and direction component of the entire image or be extracted as a feature quantity, for example the forehead and glabella, cheeks wrinkles density and direction, eye Time-series data of feature vectors (feature vectors), such as vector data that numerically represents the features of elements visually appearing on the face, such as the spread of lips, the shape of lips, etc. Column) is output.

また、ジェスチャ分析部３_２であれば、手先位置の移動量、移動速度、手先軌道の切り返しの周波数などが特徴量として抽出され特徴量ベクトル列が出力される。 Further, if the gesture analysis unit 3 _2, the moving amount of the hand position, moving speed, such as frequency of crosscut of the hand trajectory is output feature vector sequence is extracted as the feature quantity.

また、発話分析部３_３であれば、認識した発話の平均音圧（パワー）、基本周波数（相似的な波の繰り返しのパターンが現れる周波数）、及びスペクトルなどのデータを特徴量として、抽出された特徴量ベクトル列が出力される。 Further, if the speech analysis unit 3 _3, average sound pressure of the recognized utterance (power), the fundamental frequency (repetition frequency pattern appears in homothetic wave), and the feature amount data such as spectra, extracted The feature vector sequence is output.

それぞれのモーダルに対応した情動認識器５_ｎの学習データの収集は、例えば次のように実験的に行うことができる。すなわち、特徴量抽出部３_ｎの前で、決められたシナリオ通りの演技を人間に実際に演じてもらい、その際に観測されたデータ（センサ情報列）から特徴量抽出部３_ｎにより各モーダルに対応した時系列の特徴量ベクトル（フィーチャーベクトル）を抽出し、そのときの人間の感情を示す情報と共に記録することで学習データを収集することができる。また、後述するロボット装置に情動認識システムを搭載する場合には、少なくとも特徴量抽出部３_ｎが搭載されたロボット装置の前で、同様に人間に演技を演じてもらって学習データを収集すればよい。ここで、特徴量ベクトルを抽出した際の人間の感情を示す情報は、上述の基本６感情が数値化されたたベクトル（感情分類情報）であり、認識時において情動認識器５_ｎが出力する推定情動の教師データとなるものである。以下では、この学習データを推定情動（教師）Ｅｓ_ｎ ^＊と記載するものとする。また、学習データのうち、学習に使用する特徴量ベクトルを特徴量ベクトル（教師）ｓ_ｎ ^＊と記載する。 Collecting training data of emotion recognizer 5 _n corresponding to each modal can experimentally performed that, for example, as follows. In other words, in front of the feature quantity extraction unit 3 _n , human beings actually perform the performance according to the determined scenario, and each modal is performed by the feature quantity extraction unit 3 _n from the data (sensor information sequence) observed at that time. Learning data can be collected by extracting a time-series feature vector (feature vector) corresponding to, and recording it together with information indicating human emotion at that time. In addition, when an emotion recognition system is mounted on a robot apparatus to be described later, it is only necessary to collect learning data by acting by a human in the same manner at least in front of the robot apparatus on which the feature amount extraction unit 3 _n is mounted. . Here, the information indicating the human emotion when the feature vector is extracted is a vector (emotional classification information) in which the basic six emotions described above are digitized, and is output by the emotion recognizer 5 _n at the time of recognition. This is the teacher data for the estimated emotion. Hereinafter, this learning data is referred to as estimated emotion (teacher) Es _n ^* . Further, among the learning data, a feature vector used for learning is described as a feature vector (teacher) s _n ^* .

こうしてこの特徴量抽出部３_ｎにより収集された入力データとなる時系列の特徴量ベクトル（教師）ｓ_ｎ ^＊及び、出力の目標値とする、特徴量ベクトル（教師）ｓ_ｎ ^＊を取得した際のユーザの推定情動（教師）Ｅｓ_ｎ ^＊からなる学習データを使用して、各情動認識器５_ｎの学習を行う。 Thus the time series feature vector (teacher) s _n ^* and as an input data collected by the feature extraction unit 3 _n, a target value of the output, the feature vector (teacher) when acquiring the s _n ^* Learning of each emotion recognizer 5 _n is performed using learning data consisting of the estimated emotion (teacher) Es _n ^* of the user.

なお、情動推定部４_ｎの学習は各モーダル毎に個別に行われ、特徴量抽出部３_ｎ、情動推定部４_ｎ、は各モーダル毎に用意され、情動認識器５_ｎ、誤差予測器６_ｎは各情動推定部４_ｎ毎に設けられるものであるが、以下の説明においては、特に必要がないときは特徴量抽出部３、情動推定部４、情動認識器５、誤差予測器６ということとする。 The learning of the emotion estimation unit 4 _n is performed individually for each modal, and the feature quantity extraction unit 3 _n and the emotion estimation unit 4 _n are prepared for each modal, and the emotion recognizer 5 _n and the error predictor 6. _n is provided for each emotion estimation unit 4 _n . In the following description, the feature quantity extraction unit 3, the emotion estimation unit 4, the emotion recognizer 5, and the error predictor 6, unless particularly necessary. I will do it.

図２は、学習時の情動推定部５を模式的に示す図である。図２に示すように、学習時には、一のモーダル、ここでは「表情」より情動認識する表情用情動認識器５に対して、この「表情」について前処理により得られた学習データが格納されたデータベース１０が接続される。情動認識器５は、データベース１０から供給される学習データ、すなわち時系列の特徴量ベクトル（教師）ｓ_ｎ ^＊とそのときの時系列の推定情動（教師）Ｅｓ_ｎ ^＊とにより、与えられた時系列の特徴量ベクトルから情動を推定するための情動モデルの学習を行う。すなわち、学習データの特徴量ベクトル（教師）ｓ_ｎ ^＊を入力としたとき、推定情動（教師）Ｅｓ_ｎ ^＊を出力するよう、モーダル毎に情動認識器（情動フォワードモデル）５を学習させる。 FIG. 2 is a diagram schematically showing the emotion estimation unit 5 during learning. As shown in FIG. 2, at the time of learning, learning data obtained by preprocessing for the “expression” is stored in the emotion recognition device 5 for facial expression that recognizes the emotion from one modal, here “expression”. A database 10 is connected. The emotion recognizer 5 receives the learning data supplied from the database 10, that is, given the time-series feature vector (teacher) s _n ^* and the time-series estimated emotion (teacher) Es _n ^* at that time. An emotion model for estimating an emotion from a sequence feature vector is learned. That is, when the feature vector (teacher) s _n ^* of learning data is input, the emotion recognizer (emotional forward model) 5 is trained for each modal so as to output the estimated emotion (teacher) Es _n ^* .

本実施の形態においては、時系列データから情動を推定する情動認識器（情動フォワードモデル）５を、リカレントニューラルネットワークにより構成する。なお、情動を学習、認識する際に用いるアルゴリズムはリカレントニューラルネットワークに限らず、時系列のデータを出力に反映させることができるものであれば、どのようなアルゴリズムであってもよい。例えば、ある時間幅を有する情報をクラスタリングする手法としては、ＤＰ（Dynamic Programming）マッチング（動的計画法）、隠れマルコフモデル（Hidden Markov Model：ＨＭＭ）などの手法がある。本実施の形態におけるコンテキストとは、その時の状態ベクトルだけではなく、過去を遡り、時間幅を有する状態ベクトルデータ列を指すものであり、このようなデータを入力として出力（推定情動）を得ることができればよい。 In the present embodiment, an emotion recognizer (emotional forward model) 5 that estimates emotion from time-series data is configured by a recurrent neural network. The algorithm used when learning and recognizing emotions is not limited to a recurrent neural network, and any algorithm can be used as long as time series data can be reflected in an output. For example, as a method for clustering information having a certain time width, there are methods such as DP (Dynamic Programming) matching (dynamic programming), Hidden Markov Model (HMM), and the like. The context in the present embodiment refers not only to the state vector at that time, but also to a state vector data string having a time width that goes back in the past, and obtains an output (estimated emotion) using such data as an input. If you can.

図２の情動認識器５には、それぞれのモーダルに対する、前処理によって抽出された字系列の学習データのうち、特徴量ベクトル（教師）ｓ_ｎ ^＊がリカレントニューラルネットワークの入力データとして与えられる。そして、入力データを収集した際の推定情動（教師）Ｅｓ_ｎ ^＊が出力データとして与えられ、情動フォワードモデルとしての情動認識器５の学習が行われる。リカレントニューラルネットワークには、図３に示すように、対応するモーダルにおける特徴量ベクトルが入力される入力層１１ａと、１以上の層からなる中間層１２と、対応するモーダルに関する推定情動を出力する出力層１３ａ、１３ｂと、出力層１３ｂの推定情動が入力される入力層１１ｂとを有する。すなわち、入力層１１ａ、１１ｂには、あるタイミングにおける特徴量ベクトル、前のタイミングにて出力層１３ｂから出力された推定情動が入力され、これが中間層１２を介して出力層１３ａ、１３ｂから出力され、そのうちの一部が次のタイミングにおいて入力層１１ｂへ戻される。 The emotion recognizer 5 shown in FIG. 2 is provided with the feature vector (teacher) s _n ^* as input data of the recurrent neural network among the learning data of the character series extracted by the preprocessing for each modal. Then, the estimated emotion (teacher) Es _n ^* when the input data is collected is given as output data, and learning of the emotion recognizer 5 as an emotion forward model is performed. As shown in FIG. 3, the recurrent neural network has an input layer 11a to which a feature vector in the corresponding modal is input, an intermediate layer 12 composed of one or more layers, and an output that outputs an estimated emotion related to the corresponding modal. Layers 13a and 13b, and an input layer 11b to which the estimated emotion of the output layer 13b is input. That is, the feature vector at a certain timing and the estimated emotion output from the output layer 13b at the previous timing are input to the input layers 11a and 11b, which are output from the output layers 13a and 13b via the intermediate layer 12. Some of them are returned to the input layer 11b at the next timing.

このように、リカレントニューラルネットワークには出力層から入力層（または中間層）にフィードバックする一群のユニット（入力層１１ｂ、出力層１３ｂ）があり、これらは時系列入力に基づく文脈情報（コンテキスト）を内部表現している。出力層１３ｂから入力層１１ｂへのフィードバックはコンテキストループと呼ばれる。このリカレントニューラルネットワークを利用することで、瞬間的な入出力情報の写像関係ではなく、文脈情報を考慮した現在の情動推定を実現することができる。なお、リカレントニューラルネットワークについての詳細は後述する。 In this way, the recurrent neural network has a group of units (input layer 11b and output layer 13b) that feed back from the output layer to the input layer (or intermediate layer), and these include context information (context) based on time-series input. Expressed internally. The feedback from the output layer 13b to the input layer 11b is called a context loop. By using this recurrent neural network, it is possible to realize current emotion estimation in consideration of context information rather than instantaneous input / output information mapping. Details of the recurrent neural network will be described later.

（３）誤差予測器の学習方法
人間が各モーダルを通じで表現する情動には、必ずしも各感情に相当する表現が平均的に含まれていないことが実験により知られている。例えば上記特許文献２においては、音声又は画像により感情を認識する際、「悲しみ」、及び「恐怖」は、音声のみで認識される度合いが高く、「怒り」、「幸福」、及び「驚き」は、画像のみで認識される度合いが高いことを利用し、音声データ及び画像データからそれぞれ感情認識する感情認識部を設けたとき、例えば音声データから感情認識する感情認識部にて「悲しみ」、又は「恐怖」が認識された場合はその重みを大きくし、画像データから感情認識する感情認識部にて「怒り」、「幸福」、又は「驚き」が認識された場合はその重みを大きくして、これら２つの感情認識部の認識結果を統合するものである。 (3) Learning method of the error predictor It has been experimentally known that emotions that humans express through each modal do not necessarily include expressions corresponding to each emotion on average. For example, in Patent Document 2, when emotions are recognized by voice or image, “sadness” and “fear” are highly recognized only by voice, and “anger”, “happiness”, and “surprise”. Is based on a high degree of recognition only with images, and when an emotion recognition unit is provided for recognizing emotions from audio data and image data, for example, "sadness" in the emotion recognition unit that recognizes emotions from audio data, Or, when “fear” is recognized, the weight is increased, and when “anger”, “happiness”, or “surprise” is recognized by the emotion recognition unit that recognizes emotion from the image data, the weight is increased. Thus, the recognition results of these two emotion recognition units are integrated.

このように、感情に応じて、認識を得意とするセンサ情報が異なるため、各感情についてモーダル毎の認識結果がどの程度信頼できるかには差が生じる。この差を各認識器の信頼度として予めの学習によって獲得しておき、認識時に各モーダルの認識結果に重み付けして最終的な認識結果とすることで、認識結果の信頼性を高めることができる。 In this way, sensor information that is good at recognition differs depending on the emotion, and therefore a difference occurs in how reliable the recognition result for each modal for each emotion is. The reliability of the recognition result can be increased by acquiring this difference as the reliability of each recognizer by learning in advance and weighting the recognition result of each modal at the time of recognition to obtain the final recognition result. .

上述の図１にて説明したように、本実施の形態の情動推定装置においては、マルチモーダルなセンサ情報から情動を推定するものであり、各モーダルに対応した情動認識器５が要素認識器（情動フォワードモデル）として用いられている。また、それぞれの情動認識器５が出力する情動認識結果を、そのコンテキストでの認識結果の信頼度に応じて重み付けして最終的な出力を決定している。このため、各モーダルに対応したモジュール（情動推定部４）において、情動認識器５と対になって信頼度を算出するための誤差予測器６を有している。ここでいうコンテキストとは、センサ入力状態（時系列のセンサ情報列）、または感情認識状態（時系列の推定情動）などを想定することができる。それぞれの想定に応じて、誤差予測器６への入力データを設定し、例えばセンサ情報のコンテキストを考慮する場合には、入力データとして時系列のセンサ情報から抽出した時系列の特徴量ベクトルを使用したり、感情認識状態のコンテキストを考慮する場合には、入力データとして、情動認識器５の出力を使用するようにすればよい。 As described above with reference to FIG. 1, the emotion estimation apparatus according to the present embodiment estimates emotion from multimodal sensor information, and the emotion recognizer 5 corresponding to each modal is an element recognizer ( Emotion forward model). Further, the emotion recognition results output from each emotion recognizer 5 are weighted according to the reliability of the recognition results in the context to determine the final output. For this reason, the module (emotion estimation unit 4) corresponding to each modal has an error predictor 6 for pairing with the emotion recognizer 5 to calculate the reliability. The context here can be assumed to be a sensor input state (time-series sensor information string) or an emotion recognition state (time-series estimated emotion). In accordance with each assumption, input data to the error predictor 6 is set. For example, when considering the context of sensor information, a time series feature vector extracted from time series sensor information is used as input data. When the context of the emotion recognition state is taken into consideration, the output of the emotion recognizer 5 may be used as input data.

上述した如く、各モーダルに対して個別に学習された要素認識器（情動認識器）の出力をどの程度信頼するかを出力する誤差予測器６の学習においても、センサ入力情報もしくは感情認識状態の履歴情報を元に算出するモデルを適用することができ、センサ情報（またはこれらから抽出された特徴量）、及び／又は感情認識状態を入力として対となる要素認識器の誤差を予測するモデルがリカレントニューラルネットワークによって実現される。ここでは、時系列のセンサ情報から抽出された時系列の特徴量ベクトルと、感情認識器により認識されたモーダル別推定情動とを入力とするモデルについて説明する。 As described above, even in the learning of the error predictor 6 that outputs how reliable the output of the element recognizer (emotion recognizer) individually learned for each modal is, the sensor input information or the emotion recognition state A model that can be applied based on history information can be applied, and a model that predicts an error of a pair of element recognizers using sensor information (or feature values extracted from these) and / or emotion recognition states as inputs. Realized by a recurrent neural network. Here, a model will be described in which a time-series feature vector extracted from time-series sensor information and a modal-specific estimated emotion recognized by an emotion recognizer are input.

すなわち、誤差予測器６は、図４に示すように、各モーダルについて収集された上述の学習データを再度用い、特徴量ベクトル（教師）ｓ_ｎ ^＊及びこの特徴量ベクトル（教師）ｓ_ｎ ^＊を入力としたときの情動認識器５の出力データである推定情動Ｅｓ_ｎ ^＊を入力データとし、情動認識器５の出力Ｅｓ_ｎと特徴量ベクトルｓ_ｎ ^＊に対応する推定情動（教師）Ｅｓ_ｎ ^＊との差分である誤差情報を理想出力ｘ_ｎ ^＊とし、この理想出力力ｘ_ｎ ^＊を出力の目標値として各モーダルに対応した誤差予測器６の学習を行う。なお、特徴量ベクトル（教師）ｓ_ｎ ^＊は複数の特徴量を要素とするベクトルであり、推定情動Ｅｓ_ｎ ^＊、推定情動Ｅｓ_ｎ、理想出力ｘ_ｎ ^＊は、上述の６種類の情動を要素とするベクトルである。また、学習に使用する学習データは、情動認識器５の学習に用いた学習データとは異なるものを用意してもよく、このことによりより汎化能力が高い誤差予測器６を得ることができる。 That is, as shown in FIG. 4, the error predictor 6 uses again the learning data collected for each modal, and uses the feature vector (teacher) s _n ^* and the feature vector (teacher) s _n ^* . Estimated emotion Es _n ^* which is output data of the emotion recognizer 5 when input is used as input data, and the estimated emotion (teacher) Es _n ^* corresponding to the output Es _n of the emotion recognizer 5 and the feature quantity vector s _n ^* ^. the error information which is the difference between an ideal output x _n ^*, performs learning of the error predictor 6 corresponding to each modality as a target value of the output of the ideal output power x _n ^*. The feature quantity vector (teacher) s _n ^* is a vector having a plurality of feature quantities as elements, and the estimated emotion Es _n ^* , the estimated emotion Es _n , and the ideal output x _n ^* are elements of the six types of emotion described above. Is a vector. Moreover, the learning data used for learning may be prepared differently from the learning data used for learning of the emotion recognizer 5, whereby the error predictor 6 with higher generalization ability can be obtained. .

図５は、リカレントニューラルネットワークを用いた誤差予測器の学習方法を示す。誤差予測器の学習の場合には、対応するモーダルの時系列の特徴量ベクトルｓ_ｎ、対応するモーダルにおける情動予測データである推定情動Ｅｓ_ｎが入力層２１ａ、２１ｂに入力され、出力層２３ａから情動予測誤差ｘ_ｎ＾が出力される。また、出力層２３ｂから出力される予測誤差ｘ_ｎ＾がコンテキストループによりフィードバックされて入力層１１ｃに供給される。すなわち、時刻ｔ＋１における予測誤差ｘ_ｎ＾は、特徴量ベクトルｓ_ｎ、及び推定情動Ｅｓ_ｎ、並びに時刻ｔで出力された予測誤差ｘ_ｎ＾から出力される。 FIG. 5 shows an error predictor learning method using a recurrent neural network. In the case of learning by the error predictor, the corresponding modal time-series feature vector s _n and the estimated emotion Es _n which is emotion prediction data in the corresponding modal are input to the input layers 21a and 21b, and from the output layer 23a. The emotion prediction error x _n ^ is output. Further, the prediction error x _n ^ output from the output layer 23b is fed back by the context loop and supplied to the input layer 11c. That is, the prediction error x _n ^ at time t + 1 is output from the feature vector s _n , the estimated emotion Es _n , and the prediction error x _n ^ output at time t.

（４）リカレントニューラルネットワーク
次に、リカレントニューラルネットワークの一例について説明しておく。なお、次に説明する特開平８−６９１６号公報に記載のリカレント型ニューラルネットワークに限らず、上述したように、時系列データを入力とし、コンテキストを考慮した学習、認識を行うことができるものであればよい。 (4) Recurrent Neural Network Next, an example of a recurrent neural network will be described. Not only the recurrent neural network described in Japanese Patent Laid-Open No. 8-6916 described below, but as described above, learning and recognition can be performed in consideration of the context using time series data as input. I just need it.

ニューラルネットワークとは、人間の脳における神経回路網を簡略化したモデルであり、それは神経細胞ニューロンが、一方向にのみ信号が通過するシナプスを介して結合されているネットワークである。ニューロン間の信号の伝達は、このシナプスを通して行われ、シナプスの抵抗、すなわち、重みを適当に調整することにより、様々な情報処理が可能となる。各ニューロンでは、結合されている他のニューロンからの出力をシナプスの重み付けをして入力し、それらの総和を非線形応答関数の変形を加えて、再度、他のニューロンへ出力する。 A neural network is a simplified model of a neural network in the human brain, which is a network in which neuronal neurons are connected via synapses through which signals pass only in one direction. Signal transmission between neurons is performed through the synapse, and various information processing is possible by appropriately adjusting the resistance of the synapse, that is, the weight. In each neuron, outputs from other connected neurons are input with synaptic weighting, and their sum is added to a non-linear response function and output to another neuron again.

ニューラルネットワークの構造の一つに、図６に示すような多層型ネットワークがある。このタイプのネットワークは層構造を有し、層間の結合のみが許され、層内の結合や自己回帰的な結合は存在しない。この多層型ネットワークは、空間的に広がるパターンの認識や、情報圧縮に適していると考えられている。 One of the neural network structures is a multilayer network as shown in FIG. This type of network has a layered structure, only coupling between layers is allowed, and there is no intra-layer coupling or autoregressive coupling. This multilayer network is considered to be suitable for recognition of spatially spreading patterns and information compression.

一方、図７に示すように、ネットワークの構造にそのような制限を設けず、各ユニット間で任意の結合を許すものが、リカレント型ネットワークとよばれる。厳密に言うと、リカレント型ネットワークには層の概念はないが、ここでは多層型ネットワークとの対応をとるために、便宜的に層という概念を取り入れる。以下では、入力データが入力されるユニット群を入力層、ネットワークの出力を出すユニット群を出力層、その他のユニット群を中間層と呼ぶ。 On the other hand, as shown in FIG. 7, a network that does not have such a restriction and allows arbitrary coupling between units is called a recurrent network. Strictly speaking, there is no concept of layers in the recurrent network, but here the concept of layers is adopted for convenience in order to correspond to the multilayer network. Hereinafter, a unit group to which input data is input is referred to as an input layer, a unit group that outputs network output is referred to as an output layer, and other unit groups are referred to as intermediate layers.

リカレント型ネットワークでは、各ユニットの過去の出力がネットワーク内の他のユニット、または自分自身に戻される結合がある。そのため、時間に依存して各ニューロンの状態が変化するダイナミックスをネットワークの内部に有する。このように、ネットワーク内に時間を有するシステムであるので、リカレント型ネットワークは、時系列パターンの認識や予測に適していると考えられている。なお、多層型ネットワークは、リカレント型ネットワークの特別な場合と見ることができる。 In a recurrent network, there is a connection where the past output of each unit is returned to another unit in the network or to itself. Therefore, the network has dynamics in which the state of each neuron changes depending on time. Thus, since the system has time in the network, the recurrent network is considered to be suitable for time series pattern recognition and prediction. A multi-layer network can be viewed as a special case of a recurrent network.

リカレント型ニューラルネットワークの各ニューロンの従う状態方程式は下記式（１）、式（２）で与えられる。 The equation of state followed by each neuron of the recurrent neural network is given by the following equations (1) and (2).

ここで、ｘ_ｉ（ｔ）は時刻ｔにおけるユニットｉの内部状態であり、出力値ｙ_ｉは内部状態を非線形変換して決められる。また、τ_ｉ，Ｘ_ｉはそれぞれユニットｉの時定数、外部入力であり、ｗ_ｉｊはユニットｊからユニットｉへの結合の重みである。Ｎは総ユニット数である。 Here, x _i (t) is the internal state of unit i at time t, and the output value y _i is determined by nonlinearly transforming the internal state. Also, τ _i and X _i are the time constant and external input of unit i, respectively, and w _ij is the weight of coupling from unit j to unit i. N is the total number of units.

時刻ｔ_０から時刻ｔ_ｎまでの各時刻におけるネットワークの状態を求める場合は次の手順により求めることができる。まず、初期状態として時刻ｔ_０におけるネットワークの内部状態と出力値を適当に設定する。その後、上記式（１）及び式（２）を時間の順方向にｔ_０からｔ_ｎまで解く。 When obtaining the network status at each time from time t ₀ to time t _n, it can be obtained by the following procedure. First, as an initial state, the internal state and output value of the network at time t ₀ are appropriately set. Thereafter, the above formulas (1) and (2) are solved from t ₀ to t _n in the forward direction of time.

本実施の形態のように、時系列パターンの認識や予測に、リカレント型ニューラルネットワークを用いる場合、ネットワークが正しい出力を出すように、上述したように、特徴量ベクトル列などの時系列データである入力データと、推定情動などの時系列データである教師データ（出力データ）とが対になった学習データを用意し、それらを用いてネットワークの重み値、時定数、初期状態を予め学習しておく必要がある。これには通常、バックプロパゲーション法と呼ばれる最適化手法を用いて行われる。 As described above, when using a recurrent neural network for time series pattern recognition and prediction as in this embodiment, time series data such as a feature vector sequence is used so that the network outputs a correct output. Prepare learning data in which input data and teacher data (output data) that is time-series data such as estimated emotions are paired, and use them to learn network weight values, time constants, and initial states in advance It is necessary to keep. This is usually done using an optimization technique called backpropagation.

この手法の特徴は、下記式（３）で与えられるようなｔ_０からｔ_ｎにわたる有限時間区間における誤差を小さくするように、最急降下法に基づき重みを修正することである。 The feature of this method is that the weight is corrected based on the steepest descent method so as to reduce an error in a finite time interval from t ₀ to t _n as given by the following equation (3).

ここで、Ｙ_ｋ（ｔ）は時刻ｔにおけるユニットｋに提示される教師データである。但し、ｋは出力層に属するユニット群、Ｎ_ｖは出力層に属するユニット数を示す。 Here, Y _k (t) is teacher data presented to unit k at time t. Here, k is the unit belonging to the output layer, N _v denotes the number of units belonging to the output layer.

重み値、時定数、初期状態の修正量を計算する際に用いられる最急降下方向は下記式（４）、式（５）、式（６）で与えられる。 The steepest descent direction used when calculating the weight value, the time constant, and the correction amount of the initial state is given by the following equations (4), (5), and (6).

ここでＰ_ｉ（ｔ）はそれぞれユニットｉの逆伝搬誤差である。逆伝搬誤差は下記式（７）に従って計算される。 Here, P _i (t) is the back propagation error of unit i. The back propagation error is calculated according to the following equation (7).

ここで、δ_ｉｋはクロネッカーのデルタ記号である。上記式（７）では、出力層に属するニューロンの逆伝搬誤差は、他のニューロンからの逆伝搬誤差と重みの総和以外に、各時刻毎に出力値の誤差が加算される形になっている。また、ｄｆ(ｘ_ｉ)／ｄｘ_ｉ、(ｙ_ｉ(ｔ)−Ｙ_ｉ(ｔ))は、ｘ_ｉ(ｔ)、ｙ_ｉ(ｔ)が決まらないと計算できない。そこで、ｘ_ｉ(ｔ)、ｙ_ｉ(ｔ)を時間の順方向に計算した後、時刻ｔ_ｎでは逆伝搬誤差は０であると仮定して、下記式（８）で与えられる境界条件を設定し、逆伝搬誤差をｔ_ｎ→ｔ_０の時間の逆方向に計算する。 Where δ _ik is the Kronecker delta symbol. In the above equation (7), the back propagation error of the neurons belonging to the output layer is such that the error of the output value is added at each time in addition to the sum of the back propagation errors and weights from other neurons. . Also, df (x _i ) / dx _i , (y _i (t) −Y _i (t)) cannot be calculated unless x _i (t) and y _i (t) are determined. Therefore, after calculating x _i (t) and y _i (t) in the forward direction of time, assuming that the back-propagation error is 0 at time t _n , the boundary condition given by the following equation (8) is Then, the back propagation error is calculated in the reverse direction of the time t _n → t ₀ .

最急降下方向の計算の手順は以下のように行う。すなわち、まず時刻ｔ_０における各ユニットの初期状態を適当に設定した後、上記式（１）及び式（２）に従って各時刻の内部状態と出力値を計算し、その値を保存しておく。次に、時刻ｔ_ｎにおける上記式（８）で与えられる逆伝搬誤差の境界条件を設定する。その後、先程計算した内部状態と出力値を用いて、上記式（７）に従って時間に逆行しながら各時刻の逆伝搬誤差を計算し保存する。最後に、求められた各時刻の出力値と逆伝搬誤差を用いて上記式（４）乃至式（６）で与えられる最急降下方向を計算する。 The calculation procedure for the steepest descent direction is as follows. That is, first, the initial state of each unit at time t ₀ is appropriately set, then the internal state and output value at each time are calculated according to the above formulas (1) and (2), and the values are stored. Next, the boundary condition of the back propagation error given by the above equation (8) at time t _n is set. Thereafter, using the internal state and the output value calculated earlier, the back propagation error at each time is calculated and stored while reversing the time according to the above equation (7). Finally, the steepest descent direction given by the above formulas (4) to (6) is calculated using the obtained output value and back propagation error at each time.

バックプロパゲーション学習の処理手順の詳細を、図８を用いて示す。
まず、図８における学習データは、例えば、図７のような二つの入力層ニューロンと、一つの出力層ニューロンとを有するネットワークに対する学習データとする。学習データは図９に示すように、入力データと教師データとからなり、それぞれ（入力層ニューロン数×時系列サンプル点数）、(出力層ニューロン数×時系列サンプル点数）だけのデータ数を有する。 The details of the backpropagation learning process procedure will be described with reference to FIG.
First, the learning data in FIG. 8 is, for example, learning data for a network having two input layer neurons and one output layer neuron as shown in FIG. As shown in FIG. 9, the learning data is composed of input data and teacher data, and has the number of data corresponding to (number of input layer neurons × number of time series sample points) and (number of output layer neurons × time series sample points), respectively.

まず、適当な乱数などを用いて重み値、時定数、及び初期状態の初期値を設定する（ステップＳ１）。そして、逆伝搬誤差が収束するまで次のステップＳ２からステップＳ７までの処理を繰り返す。 First, a weight value, a time constant, and an initial value of an initial state are set using an appropriate random number (step S1). Then, the processing from the next step S2 to step S7 is repeated until the back propagation error converges.

まず、適当にネットワークの初期状態、すなわち、ネットワークの内部状態と出力値を設定した後、上記式（１）及び式（２）に従い入力データを用いて時間の順方向に各ユニットの内部状態、出力値を計算してそれを保存する（ステップＳ２）。 First, after appropriately setting the initial state of the network, that is, the internal state and output value of the network, the internal state of each unit in the forward direction of time using the input data according to the above formulas (1) and (2), An output value is calculated and stored (step S2).

そして、上記式（８）の境界条件を設定し、その後、上記式（７）に従い、教師データを使用して時間の逆方向に各ユニットの逆伝搬誤差Ｐ_ｉ(ｔ）を計算し、それを保存する（ステップＳ３）。 Then, the boundary condition of the above equation (8) is set, and then the back propagation error P _i (t) of each unit is calculated in the reverse direction of time using the teacher data according to the above equation (7). Is stored (step S3).

次いで、ステップ２で求めた内部状態及び出力値と、ステップ３で求めた逆伝搬誤差を用いて、上記式（４）乃至式（６）に従い最急降下方向を計算する（ステップＳ４）。 Next, the steepest descent direction is calculated according to the above formulas (4) to (6) using the internal state and output value obtained in step 2 and the back propagation error obtained in step 3 (step S4).

そして、ステップ２４で求めた最急降下方向と、前回の重みの修正量により下記式（９）乃至式（１１）に従って今回の重みの修正量を計算する（ステップＳ５）。 Then, the current weight correction amount is calculated according to the following formulas (9) to (11) based on the steepest descent direction obtained in step 24 and the previous weight correction amount (step S5).

ここでγは学習係数、αはモーメント係数、ｎは学習回数である。右辺第２項はモーメント項と呼ばれ、学習を加速するために経験的に加える項である。 Here, γ is a learning coefficient, α is a moment coefficient, and n is the number of learnings. The second term on the right side is called a moment term and is an empirically added term to accelerate learning.

次いで、下記式（１２）乃至式（１４）に従い、各ユニットの重みを修正する（ステップＳ２６）。 Next, the weight of each unit is corrected according to the following formulas (12) to (14) (step S26).

そして、ステップ７にて、誤差が一定の値以下に収束するか否かが判断され、誤差が一定値より大きい場合は、ステップ２からステップ６の処理を繰り返す。以上のようにして、情動認識器５や誤差予測器６の学習を行なうことができ、学習された情動認識器５及び誤差予測器６からなる情動推定部４を用いて情動推定装置１が構成される。 Then, in step 7, it is determined whether or not the error converges below a certain value. If the error is larger than the certain value, the processing from step 2 to step 6 is repeated. As described above, the emotion recognizer 5 and the error predictor 6 can be learned, and the emotion estimation apparatus 1 is configured using the emotion estimation unit 4 including the learned emotion recognizer 5 and error predictor 6. Is done.

（５）各モーダル別の認識結果の統合（再現方法）
次に、情動推定装置１の情動推定方法について説明する。先ず、ロボット装置が有するカメラ２_１やマイクロホン２_２などの各種センサ手段が時系列のセンサ情報を取得する。そして、特徴量抽出部３は、この時系列のセンサ情報から、前処理として時系列の特徴量ベクトルを抽出する。すなわち、特徴量抽出部３により、表情、ジェスチャ、発話などのモーダル毎にモーダル別特徴量列として、時系列の特徴量ベクトルを抽出する。 (5) Integration of recognition results by modal (reproduction method)
Next, the emotion estimation method of the emotion estimation device 1 will be described. First, the various sensor means such as a camera 2 ₁ and microphone 2 ₂ having the robot apparatus acquires sensor information in time series. Then, the feature quantity extraction unit 3 extracts a time series feature quantity vector as preprocessing from the time series sensor information. That is, the feature amount extraction unit 3 extracts a time-series feature amount vector as a modal feature amount sequence for each modal such as an expression, a gesture, and an utterance.

次に、抽出された特徴量を対応する情動推定部４へ供給する。例えば、表情の特徴量であれば、表情用情動推定部４_１へ、ジェスチャの特徴量であればジェスチャ用情動推定部４_２へ、発話の特徴量であれば発話用情動推定部４_３へ、特徴量ベクトル列が供給される。 Next, the extracted feature amount is supplied to the corresponding emotion estimation unit 4. For example, if the feature quantity of the facial expression, the expression for the emotion estimation unit 4 _1, if the feature amount of the gesture to a gesture for emotion estimation unit 4 _2, to the utterance for emotion estimation unit 4 ₃ If a feature quantity of speech , A feature vector sequence is supplied.

各情動推定部４は、情動認識器５により時系列の特徴量ベクトルｓ_ｎから推定情動Ｅｓ_ｎを出力し、誤差予測器６により予測誤差ｘ_ｎ＾を算出して出力する。なお、特徴量抽出部３は、情動認識器５や誤差予測器６内に配置に配置してもよい。上述したように、モーダル別推定情動Ｅｓ_ｎはモーダル毎に設けられた情動認識器５毎に算出された情動推定ベクトルであり、情動の要素として定義した各感情（例えば、基本６感情）のそれぞれの値が含まれる。また、最終的な出力となる推定情動Ｅｓも同じく各感情の値が含まれたベクトル情報である。 Each emotion estimation unit 4 outputs the estimated emotional Es _n from the feature quantity vector s _n in time series by the emotion recognizer 5 calculates and outputs a prediction error x _{n ^} by the error estimator 6. Note that the feature quantity extraction unit 3 may be arranged in the emotion recognizer 5 or the error predictor 6. As described above, the modal-specific estimated emotion Es _n is an emotion estimation vector calculated for each emotion recognizer 5 provided for each modal, and each emotion (for example, basic six emotions) defined as an emotion element. The value of is included. The estimated emotion Es that is the final output is also vector information including the value of each emotion.

誤差予測器６により算出された予測誤差ｘ_ｎ＾は、対応するモーダルにソフトマックス演算器７に供給される。ソフトマックス演算器７は、誤差予測器６から出力された予測誤差ｘ_ｎ＾を元に下記式（１５）のソフトマックス関数に従って情動認識器５の信頼度λ_ｎを算出する。ここでｘ_ｎ＾、ｘ_ｌ＾は各誤差予測器の出力、σは各認識器の誤差をどの程度平均化して責任信号（信頼度）を算出するかを決定するパラメータである。 The prediction error x _n ^ calculated by the error predictor 6 is supplied to the softmax calculator 7 in a corresponding modal manner. The softmax calculator 7 calculates the reliability λ _n of the emotion recognizer 5 according to the softmax function of the following equation (15) based on the prediction error x _n ^ output from the error predictor 6. Here, x _n ^, x _l ^ are outputs of the respective error predictors, and σ is a parameter that determines how much the error of each recognizer is averaged to calculate the responsibility signal (reliability).

それぞれの信頼度λ_ｎは、下記式（１６）に示すように、対応する各情動認識器５が出力する推定情動Ｅｓ_ｎに乗算され、これらを加算した総和が、時系列のセンサ入力情報を元に推定された最終的な情動認識結果ベクトルとしての推定情動Ｅｓとなる。 As shown in the following equation (16), each reliability λ _n is multiplied by the estimated emotion Es _n output from the corresponding emotion recognizer 5, and the sum of these is added to the time-series sensor input information. This is the estimated emotion Es as the final estimated emotion recognition result vector.

このように、本実施の形態における情動推定装置１は、各モーダルに対応した情動認識器（要素認識器）５が、瞬間のセンサ情報のみから情動を推定するのではなく、それ以前の過去のセンサ情報も考慮し、時間的な広がりを持ったセンサ情報（時系列のセンサ情報）から特徴量を抽出して情動を推定することで、センサ情報のコンテキストを考慮して情動を認識することができる。すなわち、予め、情動認識器５及び誤差予測器６を再帰的学習方法、例えばリカレントニューラルネットワークなどのアルゴリズムを用いて、それぞれ時系列の特徴量ベクトルから推定情動を出力するフォワードモデル及び時系列の特徴量ベクトル及び／又は推定情動から予測誤差を出力するフォワードモデルを学習しておくことにより、各センサの入力情報から得られる特徴量ベクトルの時系列データを入力とし、コンテキストを考慮して情動を推定する情動推定装置を得ることができる。 As described above, the emotion estimation apparatus 1 according to the present embodiment is configured so that the emotion recognizer (element recognizer) 5 corresponding to each modal does not estimate the emotion only from the instantaneous sensor information, Recognizing emotions in consideration of the context of sensor information by taking out sensor information and estimating emotions by extracting features from sensor information (time-series sensor information) having time spread it can. That is, the emotion recognizer 5 and the error predictor 6 are used in advance by using a recursive learning method, for example, an algorithm such as a recurrent neural network. By learning the forward model that outputs the prediction error from the quantity vector and / or the estimated emotion, the time series data of the feature quantity vector obtained from the input information of each sensor is input and the emotion is estimated in consideration of the context. An emotion estimation device can be obtained.

また、誤差予測器から得られた予測誤差をソフトマックス関数の処理によって信頼度に変換し、モーダルごとの認識結果に重み付けすることにより、各感情に対してより正確に判定を行うことができると予想されるモーダルの出力を重視した統合結果として推定情動Ｅｓを出力することができる。 In addition, by converting the prediction error obtained from the error predictor into reliability by processing of a softmax function and weighting the recognition result for each modal, it is possible to make a more accurate determination for each emotion The estimated emotion Es can be output as an integration result that places importance on the expected modal output.

（６）情動予測装置
次に、ロボット装置が実行した行動のコンテキストを考慮し、その行動に応じてユーザの情動を予測する情動予測装置について説明する。情動予測装置は、上記のセンサ入力情報の履歴を元に現在のインタラクション対象の情動推定を行うモジュール（情動推定装置）と並列に、ロボット装置自らの行動とそれに対応するインタラクション対象（ユーザ）の情動変化を推定するモジュールである。この情動予測装置は、実行した行動ｋに応じて、行動実行後のユーザの情動変化を予想し、この予想した情動変化（以下、予想情動変化ｄＥｂ_ｋという。）と、行動の実行前の上述の情動推定装置における推定結果Ｅｓ’とを統合し、予測情動Ｅｂを出力する。 (6) Emotion Prediction Device Next, an emotion prediction device that takes into consideration the context of the action performed by the robot apparatus and predicts the user's emotion according to the action will be described. The emotion prediction device, in parallel with the module (emotional estimation device) that estimates the emotion of the current interaction target based on the sensor input information history, the robot device itself and the corresponding interaction target (user) emotion This module estimates changes. This emotion prediction device predicts an emotional change of the user after the execution of the action according to the executed action k, the predicted emotional change (hereinafter referred to as an expected emotional change dEb _k ), and the above-mentioned before the execution of the action. Are integrated with the estimation result Es ′ of the emotion estimation apparatus, and a predicted emotion Eb is output.

図１０（ａ）は、ロボット装置における情動予測装置に関わる要部を示す図である。図１０（ａ）に示すように、ロボット装置は、内部状態３１及び／又は外部刺激３２に基づき自律的に行動を実行するものであって、複数の要素行動（スキーマ）３３が木構造に構成されたスキーマツリーを有する行動制御器３０を供える。スキーマ３３は、内部状態３１及び外部刺激３２が入力されると、これらから各スキーマが自身に記述された行動の実行優先度を示す行動価値（アクティベーションレベル：Activation level）ＡＬを算出し、この行動価値ＡＬに基づき実行する行動を選択するモジュール（行動記述モジュール）であり、各モジュール毎にステートマシンを用意しており、それ以前の行動（動作）や状況に依存して、センサ入力された外部情報の認識結果を分類し、動作を機体上で発現する。各スキーマ３３は、自身に記述された行動に応じて所定の内部状態及び外部刺激が定義されている。 FIG. 10A is a diagram illustrating a main part related to the emotion prediction apparatus in the robot apparatus. As shown in FIG. 10 (a), the robot apparatus autonomously executes actions based on the internal state 31 and / or the external stimulus 32, and a plurality of element actions (schema) 33 are configured in a tree structure. A behavior controller 30 having a structured schema tree is provided. When the internal state 31 and the external stimulus 32 are input, the schema 33 calculates an action value (Activation level) AL indicating the execution priority of the action described in each schema. This is a module (behavior description module) that selects an action to be executed based on the action value AL. A state machine is prepared for each module, and a sensor input is input depending on the previous action (action) and situation. The recognition result of external information is classified and the action is expressed on the aircraft. Each schema 33 defines a predetermined internal state and external stimulus according to the behavior described in itself.

ここで外部刺激３２とは、ロボット装置の知覚情報等であり、例えばカメラから入力された画像に対して処理された色情報、形情報、顔情報等の対象物情報等が挙げられる。具体的には、例えば、色、形、顔、３Ｄ一般物体、及びハンドジェスチャー、その他、動き、音声、接触、距離、場所、時間、及びユーザとのインタラクション回数等が挙げられる。 Here, the external stimulus 32 is perceptual information or the like of the robot apparatus, and includes, for example, object information such as color information, shape information, and face information processed for an image input from the camera. Specifically, for example, color, shape, face, 3D general object, hand gesture, movement, voice, contact, distance, place, time, number of times of interaction with the user, and the like can be mentioned.

また、内部状態３１とは、内部状態管理部（図示せず）にて管理される本能や感情といった情動であり、例えば、疲れ（FATIGUE）、痛み（PAIN）、栄養状態（NOURISHMENT）、乾き（THURST）、愛情（AFFECTION）、好奇心（CURIOSITY）等がある。例えば、内部状態「栄養状態」は、バッテリの残量を基に決定し、内部状態「疲れ」は、消費電力を基に決定することができる。 The internal state 31 is emotions such as instinct and emotion managed by an internal state management unit (not shown). For example, fatigue (FATIGUE), pain (PAIN), nutritional state (NOURISHMENT), dryness ( THURST), love (AFFECTION), curiosity (CURIOSITY). For example, the internal state “nutrient state” can be determined based on the remaining battery level, and the internal state “fatigue” can be determined based on power consumption.

そして、例えば行動出力が「食べる」であるスキーマ３３は、外部刺激３２として対象物の種類、対象物の大きさ、対象物の距離等を扱い、内部状態３１として「NOURISHMENT」（「栄養状態」）、「FATIGUE」（「疲れ」）等を扱う。このように、各スキーマ３３毎に、扱う外部刺激３２及び／及び内部状態３１の種類が定義され、該当する外部刺激３２及び／又は内部状態３１に対応する行動（要素行動）の行動価値ＡＬが算出される。なお、１つの内部状態、又は外部刺激は、１つの要素行動だけでなく、複数の要素行動に対応付けられていてもよいことはもちろんである。 For example, the schema 33 whose behavior output is “eat” handles the type of the object, the size of the object, the distance of the object, and the like as the external stimulus 32, and “NOURISHMENT” (“nutrition state”) as the internal state 31. ), “FATIGUE” (“fatigue”), etc. Thus, for each schema 33, the types of external stimuli 32 and / or internal states 31 to be handled are defined, and the action value AL of the action (elemental action) corresponding to the corresponding external stimulus 32 and / or internal state 31 is set. Calculated. Of course, one internal state or external stimulus may be associated with not only one elemental action but also a plurality of elemental actions.

行動価値ＡＬとは、スキーマ３３をロボット装置がどれくらいやりたいか（実行優先度）を示すものである。この行動価値ＡＬに基づき、選択されたスキーマ３３は自身に記述された行動を出力する。この行動価値ＡＬは、内部状態３１及び外部刺激３２から算出される。具体的には、例えば、内部状態３１から、該当する行動について、どれだけやりたいかを示すモチベーションベクトル（Motivation Vector）が算出され、内部状態３１及び外部刺激３２から、該当する行動をやれるか否か示すリリーシングベクトル（Releasing Vector）が算出され、これら２つのベクトルから行動価値ＡＬを算出することができる。そして、例えば、アクティベーションレベルが最も高いスキーマを選択したり、アクティベーションレベルが所定の閾値を超えた２以上のスキーマを選択して並列的に行動実行するようにすることができる。但し、並列実行するときは各スキーマ同士でハードウェア・リソースの競合がないことを前提とする。 The action value AL indicates how much the robot device wants to execute the schema 33 (execution priority). Based on this action value AL, the selected schema 33 outputs the action described in itself. This action value AL is calculated from the internal state 31 and the external stimulus 32. Specifically, for example, from the internal state 31, a motivation vector (Motivation Vector) indicating how much the corresponding action is desired is calculated, and whether or not the corresponding action can be performed from the internal state 31 and the external stimulus 32. A releasing vector (Releasing Vector) is calculated, and an action value AL can be calculated from these two vectors. Then, for example, the schema with the highest activation level can be selected, or two or more schemas with the activation level exceeding a predetermined threshold value can be selected and the actions can be executed in parallel. However, when executing in parallel, it is assumed that there is no hardware resource conflict between schemas.

行動制御器３０は、選択されたスキーマ３３を識別する識別ＩＤ（スキーマＩＤ）を出力し、このスキーマＩＤは情動予想装置４０に入力される。 The behavior controller 30 outputs an identification ID (schema ID) for identifying the selected schema 33, and this schema ID is input to the emotion prediction device 40.

情動予測装置４０は、予想情動変化の値が格納された予想情動変化データベース４１を有し、入力されるスキーマＩＤが示す行動に基づき、当該行動の実行後に変化すると予想される情動変化（予想情動変化）を出力する。予想情動変化は、ロボット装置の行動制御器３０における行動制御アルゴリズムと密接な関係を持っており、行動制御アルゴリズムに定義された全てのスキーマ３３に対応した予想情動変化の値が定義されている。 The emotion prediction device 40 has a predicted emotion change database 41 in which a value of a predicted emotion change is stored, and an emotion change (expected emotion) that is expected to change after execution of the behavior based on the behavior indicated by the input schema ID. Change). The expected emotion change has a close relationship with the behavior control algorithm in the behavior controller 30 of the robot apparatus, and the expected emotion change values corresponding to all the schemas 33 defined in the behavior control algorithm are defined.

予想情動変化データベース４１は、実際のインタラクションを通じて観測された値を用いて動的に更新することができる。通常は、全てのインタラクション対象の人物について共通の予想情動変化データベースを構築するが、それまでにロボット装置が実際にインタラクションを行い、顔、声、名前などを記憶した人物毎にデータを保持することによって、人物毎に予測モデルとなる予想情動変化データベースを切り替え可能なように、予想情動変化データベースを構築してもよい。また、これらを組み合わせて、よく知っている人物に対しては人物毎に予想情動変化データベースを構築し、初対面の人物など用の共通に使用可能な予想情動データベースを構築しておいてもよい。 The predicted emotion change database 41 can be dynamically updated using values observed through actual interactions. Normally, a common database of expected emotional changes is constructed for all interaction target persons, but the robot device actually interacts so far, and keeps data for each person who stores face, voice, name, etc. Thus, the predicted emotion change database may be constructed so that the predicted emotion change database as a prediction model can be switched for each person. In addition, by combining these, an expected emotion change database may be constructed for each person who knows well, and an expected emotion database that can be used in common for the first person to be met may be constructed.

そして、行動単位（スキーマ）ｋ毎に定義された予想情動変化ベクトル（予測情動変化）をｄＥｂ_ｋとし、当該行動実行前の情動を、その行動を実行する前に観測されたセンサ情報から推定された推定情動Ｅｓ’、すなわち上述の情動推定装置にて算出された行動実行前の推定情動としたとき、これらを組み合わせ、行動履歴の情報を元に推定される下記式（１７）に示す予測情動Ｅｂが求められる。予測情動Ｅｂも上記基本６感情を要素とするベクトルである。 The predicted emotion change vector (predicted emotion change) defined for each behavior unit (schema) k is dEb _k, and the emotion before execution of the behavior is estimated from the sensor information observed before the behavior is executed. The estimated emotion Es ′, that is, the estimated emotion before the action execution calculated by the above-described emotion estimation device, is combined, and the predicted emotion shown in the following formula (17) estimated based on the behavior history information Eb is determined. The predicted emotion Eb is also a vector having the basic six emotions as elements.

（７）予想情動変化データベースの学習方法
次に、情動予測装置における予想情動変化データベースの構築方法（学習方法）について説明する。予想情動変化データベースは、行動を実行する前と後との情動の差分を表現したデータベースであり、初期状態の情動予測装置４０の予想情動変化データベース４１は全て０で初期化されており、センサ情報から推定された推定情動Ｅｓのみを利用してインタラクション対象の情動が判断される。その後は、実際にロボット装置が行動選択を行ってインタラクションした結果、行動前後におけるインタラクション対象の情動推定結果が得られた場合には、下記式（１８）に示すように、行動を実行した後の最終的な情動予測結果Ｅ_ａと行動を実行した後の行動履歴に基づく推定情動Ｅｂとの差分の値を用いて予想情動変化データベースは更新される。 (7) Learning Method of Expected Emotion Change Database Next, a construction method (learning method) of the expected emotion change database in the emotion prediction device will be described. The predicted emotion change database is a database that expresses the difference between the emotion before and after the action is executed, and the predicted emotion change database 41 of the emotion prediction device 40 in the initial state is all initialized to 0, and sensor information The emotion to be interacted with is determined using only the estimated emotion Es estimated from. After that, when the robot apparatus actually selects an action and interacts with it, if an emotion estimation result of the interaction target before and after the action is obtained, as shown in the following equation (18), expected emotional change database using the difference value of the estimated emotional Eb based on action history after performing a final emotional prediction result E _a and action is updated.

ここで、ｄＥｂ_ｋは各行動単位ｋに対応した予測情動変化ベクトル、Ｅ_ａはセンサ情報から推定された推定情動Ｅｓと、行動履歴の情報を元に推定される予測情動Ｅｂとから得られる後述する最終的な出力としての推定情動Ｅ、αは学習係数、時刻Ｔ’は前回行動ｋを行った時間、時刻Ｔは次回、同一の行動ｋを行う時間を示し、ｄＥｂの値は、行動ｋに対応した値のみが更新される。また、時刻Ｔ’における推定情動Ｅ_ａを示すＥ_ａ ^Ｔ’は、例えば行動実行後（時刻Ｔ’）に時系列のセンサ情報から抽出したユーザの時系列の特徴量に基づき推定された推定情動Ｅｓとして予測情動変化データベースを更新してもよい。また、例えば人為的にデータベースを作成したいときなどにおいては、時刻Ｔ’におけるユーザの情動を、教師データとして外部から供給したり、ロボット装置に教えたりすればよい。 Here, dEb _k is a predicted emotion change vector corresponding to each behavior unit k, E _a is an estimated emotion Es estimated from sensor information, and a predicted emotion Eb estimated based on behavior history information, which will be described later. As a final output, the estimated emotion E, α is a learning coefficient, time T ′ is the time when the previous action k was performed, time T is the time when the same action k is performed next time, and the value of dEb is the action k Only the value corresponding to is updated. Further, E _a ^{T ′} indicating the estimated emotion E _a at time T ′ is, for example, estimated emotion estimated based on the user's time-series feature amount extracted from time-series sensor information after execution of the action (time T ′). The predicted emotion change database may be updated as Es. In addition, for example, when it is desired to artificially create a database, the user's emotion at time T ′ may be supplied from the outside as teacher data or may be taught to the robot apparatus.

ここで、予想情動変化ｄＥｂ_ｋ ^Ｔを更新しようとした場合、目標とする値が推定情動Ｅ_ａであり、それに対して前回の予測情動変化ｄＥｂ_ｋ ^Ｔ’を考慮して算出された値が今回の予測情動Ｅｂ_ｋ ^Ｔとなるため、上記式（１８）の右辺に示すように推定情動Ｅ_ａ ^Ｔ’と予測情動Ｅｂ_ｋ ^Ｔ’との差分をとり、この値が正ならば予想情動変化ｄＥｂ_ｋ ^Ｔをより大きくする必要があり、負ならば予想情動変化ｄＥｂ_ｋ ^Ｔをより小さくする必要があることを示す。 Here, when the predicted emotion change dEb _k ^{T is} to be updated, the target value is the estimated emotion E _a , and the value calculated in consideration of the previous predicted emotion change dEb _k ^{T ′} is predicted emotion _Eb ^{k T,} and therefore, taking the difference of the expression ^'a prediction affective Eb _k ^T' (18) estimates the emotion _{E a} ^T as shown in the right-hand side of the expected emotional changes dEb if this value is positive it is necessary to increase the _k ^T, indicating a need to further reduce the expected emotional changes DEB _k ^T if negative.

このように、時刻Ｔにおける予想情動変化ｄＥｂ_ｋ ^Ｔは、時刻Ｔ’における予想情動変化ｄＥｂ_ｋ ^Ｔ’に、時刻Ｔ’におけるユーザの情動Ｅ_ａ ^Ｔ’と時刻Ｔ’における予測情動Ｅｂ_ｋ ^Ｔ’との差に学習係数αを乗算した値を加算したものとなっており、行動履歴、すなわち行動のコンテキストを考慮したものとなっている。 As described above, the predicted emotion change dEb _k ^T at time ^T is changed from the expected emotion change dEb _k ^{T ′} at time T ′ to the user emotion E _a ^{T ′} at time T ′ and the predicted emotion Eb _k ^{T ′ at} time T ′. And a value obtained by multiplying the difference by the learning coefficient α, and the action history, that is, the action context is taken into consideration.

例えば、図１０（ｂ）に示すように、ある時点における予想情動変化データベース４１においては、例えばスキーマＩＤ（ｋ）＝１に記述された行動を実行すると、内部状態「JOY」が上昇し（１０）、内部状態「DISGUST」が減少している（−２５）。また、スキーマＩＤ（ｋ）＝６に記述された行動を実行すると、内部状態「JOY」が上昇し（＋５）、内部状態「DISGUST」が減少する（−１０）ことを示している。 For example, as shown in FIG. 10B, in the predicted emotion change database 41 at a certain point in time, when an action described in, for example, schema ID (k) = 1 is executed, the internal state “JOY” increases (10 ) The internal state “DISGUST” has decreased (−25). Further, when the action described in the schema ID (k) = 6 is executed, the internal state “JOY” increases (+5), and the internal state “DISGUST” decreases (−10).

こうしてロボット装置は、行動を実行することで予想情動変化の値を更新し、予想情動変化データベース４１を学習することができる。なお、予想情動変化データベース４１は、常に更新し続けるものとしてもよいが、所定期間の行動実行結果に応じて更新した後に更新を終了してもよく、予め行動実行後の予想情動変化ｄＥｂ_ｋが定義された予想情動変化データベースを使用してもよい。 In this way, the robot apparatus can learn the expected emotion change database 41 by updating the value of the expected emotion change by executing the action. Note that the expected emotion change database 41 may be constantly updated, but may be updated after being updated according to the action execution result of a predetermined period, and the expected emotion change dEb _k after the action execution is previously determined. A defined expected emotion change database may be used.

（８）情動認識システムにおける推定情動Ｅの推定方法
次に、時系列のセンサ入力情報に基づく認識結果と行動履歴に基づく情動予測結果の融合方法について説明する。上述した情動推定装置により、センサ入力情報に基づく認識結果として得られる推定情動Ｅｓは、センサ入力の時系列情報に従って逐次出力されるものであるが、情動予測装置により、行動履歴に基づく認識結果として得られる推定情動Ｅｂは、ある要素行動を完了した瞬間に得られる値である。したがって、行動履歴に基づいて予測された予測情動の信頼性は、行動完了後から時間の経過とともに低下する。この効果を反映させた上で、センサ入力に基づいて算出された情動認識結果（推定情動Ｅｓ）に対するトップダウンの補正を行うアルゴリズムを下記式（１９）及び式（２０）によって定式化する。これらの式（１９）、（２０）は、行動完了すぐの段階では、行動履歴に基づく情動予測結果である予測情動Ｅｂを重視し、行動完了から時間が経過するに従ってセンサ情報に基づく情動認識結果である推定情動Ｅｓを重視するように変化することを意味している。ここでｔはある行動が完了してからの経過時間（異なる行動が完了すると、０にリセットされる）、τ_０及びτは時間経過に対する減衰の度合を決定するパラメータである。 (8) Estimation Method of Estimated Emotion E in Emotion Recognition System Next, a method of merging the recognition result based on time-series sensor input information and the emotion prediction result based on the action history will be described. The estimated emotion Es obtained as the recognition result based on the sensor input information by the emotion estimation device described above is sequentially output according to the time-series information of the sensor input, but the emotion prediction device uses the recognition result based on the action history as the recognition result. The obtained estimated emotion Eb is a value obtained at the moment when a certain elemental action is completed. Therefore, the reliability of the predicted emotion predicted based on the action history decreases with the passage of time from the completion of the action. After reflecting this effect, an algorithm for performing top-down correction on the emotion recognition result (estimated emotion Es) calculated based on the sensor input is formulated by the following equations (19) and (20). These formulas (19) and (20) emphasize the predicted emotion Eb, which is the emotion prediction result based on the behavior history, immediately after the completion of the behavior, and the emotion recognition result based on the sensor information as time elapses from the behavior completion. This means that the estimated emotion Es is changed so as to be emphasized. Here, t is an elapsed time after completion of a certain action (reset to 0 when a different action is completed), and τ ₀ and τ are parameters for determining the degree of attenuation with respect to the passage of time.

図１１は、情動認識システムのうち、情動推定装置１の推定結果である推定情動Ｅｓと、情動予測装置４０の予測結果である予測情動Ｅｂとを統合する情動統合部５２に関する部分を示す図である。図１１に示すように、結果統合部５２は、情動推定装置１からの推定情動Ｅｓと情動予測装置４０からの予測情動Ｅｂとが入力され、上記式（１９）、（２０）に従って推定情動Ｅを出力する。 FIG. 11 is a diagram illustrating a portion related to the emotion integration unit 52 that integrates the estimated emotion Es that is the estimation result of the emotion estimation device 1 and the predicted emotion Eb that is the prediction result of the emotion prediction device 40 in the emotion recognition system. is there. As shown in FIG. 11, the result integrating unit 52 receives the estimated emotion Es from the emotion estimation device 1 and the predicted emotion Eb from the emotion prediction device 40, and estimates the estimated emotion E according to the above equations (19) and (20). Is output.

ここで、情動推定装置１は、上述したように、時系列のセンサ情報列から時系列の特徴量を抽出し、コンテキストを考慮して推定した推定結果を推定情動Ｅｓとして出力するセンサ情報分析システムとして作用する。 Here, as described above, the emotion estimation apparatus 1 extracts a time-series feature amount from a time-series sensor information sequence, and outputs an estimation result estimated in consideration of the context as an estimated emotion Es. Acts as

また情動予測装置４０は、行動実行後に、行動履歴、すなわち行動のコンテキストを考慮して学習された予想情動変化データベース４１を参照して予想情動変化ｄＥｂ_ｋを出力する情動変化予測部（行動履歴分析システム）４２と、この予想情動変化ｄＥｂ_ｋ及び行動実行前の情動推定装置１’の推定情動Ｅｓ’から予測情動を算出する予測情動算出部４３とを有する。この情動予測装置４０は、行動実行後に予測情動Ｅｂを出力するもので、情動推定装置１’から行動実行前の推定情動Ｅｓ’を受け取り、情動予測部４２から実行した行動ｋに対応づけられた予想情動変化ｄＥｂ_ｋを受け取り、これらを加算して推定情動Ｅｂを出力する。なお、図１１には、推定情動Ｅｓ’を出力する情動推定装置１’を記載しているが、情動推定装置１から行動開始前の推定情動Ｅｓ’を入力するようにすればよい。 In addition, the emotion prediction device 40 refers to an expected emotion change database 41 learned in consideration of an action history, that is, an action context after the action is executed, and an emotion change prediction unit (behavior history analysis) outputs the expected emotion change dEb _k. System) 42 and a predicted emotion calculation unit 43 that calculates a predicted emotion from the predicted emotion change dEb _k and the estimated emotion Es ′ of the emotion estimation device 1 ′ before the execution of the action. This emotion prediction device 40 outputs the predicted emotion Eb after the action is executed, receives the estimated emotion Es ′ before the action execution from the emotion estimation device 1 ′, and is associated with the action k executed from the emotion prediction unit 42. The expected emotion change dEb _k is received and added to output the estimated emotion Eb. 11 shows the emotion estimation device 1 ′ that outputs the estimated emotion Es ′. However, the estimated emotion Es ′ before the start of the action may be input from the emotion estimation device 1.

これらの出力データは、図１２のようになる。図１２は、ロボット装置が３つの行動Ｂ１〜Ｂ３を実行した際の推定情動の変化を示すグラフであって、上段から、時系列のセンサ情報に基づき、情動推定装置にて算出された推定情動Ｅｓ、次段は、この推定情動Ｅｓの時間減衰を考慮した値＝（１−η）Ｅｓを示す。 These output data are as shown in FIG. FIG. 12 is a graph showing changes in the estimated emotion when the robot apparatus executes the three actions B1 to B3. From the upper stage, the estimated emotion calculated by the emotion estimation device based on time-series sensor information. Es, the next stage, shows a value = (1−η) Es in consideration of the time decay of the estimated emotion Es.

また、３段目は、行動履歴に基づき予測された予測情動Ｅｂを示し、その後段は、この予測情動Ｅｂの時間減衰を考慮した値＝ηＥｂを示す。そして、５段目は、（１−η）Ｅｓ及びηＥｂを統合して得られる情動認識装置５０の認識結果である推定情動Ｅを示す。 The third row shows the predicted emotion Eb predicted based on the action history, and the subsequent row shows a value = ηEb considering the time decay of the predicted emotion Eb. The fifth row shows the estimated emotion E that is the recognition result of the emotion recognition device 50 obtained by integrating (1-η) Es and ηEb.

本実施の形態においては、複数のモーダルについて、モーダル毎に情動認識器を用意し、入力されるモーダル別の時系列のセンサ情報から抽出した時系列の特徴量ベクトルにより推定情動を算出し、更に、認識結果の予測誤差を求めて信頼度パラメータに変換し、認識結果に重み付けするため、センサ入力情報の時系列（コンテキスト）を考慮した情動を推定することができる。このことにより、センサのノイズによって推定情動が不安定になる、すなわち認識結果が支離滅裂に変化することを抑えて、極めて安定な認識結果とすることができると共に、信頼度λにより認識器の認識結果に重み付けして統合して推定情動Ｅｓとすることで極めて正確に情動を推定することができる。 In the present embodiment, for a plurality of modals, an emotion recognizer is prepared for each modal, an estimated emotion is calculated from a time-series feature vector extracted from time-series sensor information input by modal, and Since the prediction error of the recognition result is obtained, converted into the reliability parameter, and the recognition result is weighted, it is possible to estimate the emotion in consideration of the time series (context) of the sensor input information. As a result, it is possible to obtain an extremely stable recognition result by suppressing the estimated emotion from being unstable due to sensor noise, i.e., the recognition result changing to incoherent, and the recognition result of the recognizer by the reliability λ. It is possible to estimate the emotion very accurately by weighting and integrating them into the estimated emotion Es.

更に、センサ入力情報から認識器によって得られた推定情動Ｅｓだけではなく、ロボット装置自身の行動のコンテキストから情動変化を予測した予測情動Ｅｂを求め、推定情動Ｅｓにトップダウンの補正をかけると共に、センサ情報が入力され次第得ることができる推定情動Ｅｓと、行動結果後にのみ出力される予測情動Ｅｂの時間的な誤差を考慮する、すなわち、行動完了すぐの段階では、行動履歴に基づく予測情動Ｅｂの結果を重視し、行動完了から時間が経過するに従ってセンサ情報に基づく情動認識結果（推定情動Ｅｓ）を重視するようにパラメータηを変化させ、センサ入力に基づく認識結果（推定情動Ｅｓ）と、行動履歴に基づく情動予測結果（予測情動Ｅｂ）とを融合させることで、情動推定システムにて最終的に得られる推定情動を、ユーザの情動変化に対する、生物により近い自然な認識を可能する。 Furthermore, not only the estimated emotion Es obtained by the recognizer from the sensor input information, but also a predicted emotion Eb that predicts the emotion change from the context of the behavior of the robot device itself, and applies a top-down correction to the estimated emotion Es, Considering the temporal error between the estimated emotion Es that can be obtained as soon as the sensor information is input and the predicted emotion Eb that is output only after the behavior result, that is, at the stage immediately after the completion of the behavior, the predicted emotion Eb based on the behavior history. The parameter η is changed so that the emotion recognition result based on the sensor information (estimated emotion Es) is emphasized as time elapses from the completion of the action, the recognition result based on the sensor input (estimated emotion Es), By combining the emotion prediction result (predicted emotion Eb) based on the action history, the estimation finally obtained by the emotion estimation system Emotion, for emotion change of the user, to allow natural recognition closer organism.

また、自らの行動がインタラクション対象のどのような情動変化をもたらすかを推定するための予想情動変化データベース（フォワードモデル）は、情動推定装置１から得られた結果を教師データとして逐次的に、行動−情動予測モデルのリアルタイム学習を行うことができ、その予測精度を向上させることができる。 Moreover, the predicted emotion change database (forward model) for estimating what kind of emotional change of the interaction target is caused by the user's own behavior is sequentially performed using the results obtained from the emotion estimation device 1 as teacher data. -Real-time learning of an emotion prediction model can be performed, and the prediction accuracy can be improved.

このように、センサ情報から得られる推定情動Ｅｓだけではなく、それまでの人間とのインタラクション経験から、自らの行動が相手の情動に対してどのような変化をもたらすかの経験、すなわちロボット装置自身の行動のコンテキストを考慮し他者の情動遷移モデルとなる予想情動変化データベースを構築して予測情動Ｅｂを求め、情動推定装置１からの推定情動Ｅｓと合わせて判断することで推定情動Ｅの推定精度を向上することができる。したがって、ロボット装置は、インタラクション対象の情動を示すこの推定情動Ｅに応じて行動を選択することができ、例えばインタラクション対象を喜ばせたり、楽しませたりといった行動を発現してよりエンターテイメント性を向上することができる。 In this way, not only the estimated emotion Es obtained from the sensor information, but also the experience of how one's own action changes the other's emotion based on the previous experience with human interaction, that is, the robot apparatus itself The estimated emotion E is estimated by constructing a predicted emotion change database that becomes an emotion transition model of the other person in consideration of the context of the other person's behavior, obtaining the predicted emotion Eb, and judging together with the estimated emotion Es from the emotion estimation device 1 Accuracy can be improved. Therefore, the robot apparatus can select an action according to the estimated emotion E indicating the emotion of the interaction target. For example, the robot apparatus expresses an action such as pleasing or entertaining the interaction target, thereby improving the entertainment property. be able to.

Ｂ：ロボット装置
次に、上述した情動認識システムを搭載したロボット装置の一具体例について説明する。本実施の形態においては、２足歩行型のロボット装置を例にとって説明するが、２足歩行のロボット装置に限らず、４足又は車輪等により移動可能なロボット装置に適用できることはいうまでもない。 B: Robot Device Next, a specific example of a robot device equipped with the emotion recognition system described above will be described. In the present embodiment, a biped walking robot device will be described as an example, but it is needless to say that the present invention is not limited to a biped walking robot device and can be applied to a robot device that can be moved by four feet or wheels. .

この人間型のロボット装置は、住環境その他の日常生活上の様々な場面における人的活動を支援する実用ロボットであり、内部状態（怒り、悲しみ、喜び、楽しみ等）に応じて行動できるほか、人間が行う基本的な動作を表出できるエンターテインメントロボットである。また、自身の内部状態ではなく、上述の情動認識システムにおいて認識したユーザの情動に応じて行動を発現することも可能である。図１３は、本実施の形態におけるロボット装置の概観を示す斜視図である。 This humanoid robot device is a practical robot that supports human activities in various situations in the living environment and other daily life, and can act according to the internal state (anger, sadness, joy, fun, etc.) It is an entertainment robot that can express the basic actions performed by humans. Moreover, it is also possible to express an action according to the emotion of the user recognized by the above emotion recognition system instead of the internal state of the user. FIG. 13 is a perspective view showing an overview of the robot apparatus according to the present embodiment.

図１３に示すように、ロボット装置１０１は、体幹部ユニット１０２の所定の位置に頭部ユニット１０３が連結されると共に、左右２つの腕部ユニット１０４Ｒ／Ｌと、左右２つの脚部ユニット１０５Ｒ／Ｌが連結されて構成されている（但し、Ｒ及びＬの各々は、右及び左の各々を示す接尾辞である。以下において同じ。）。 As shown in FIG. 13, the robot apparatus 101 has a head unit 103 connected to a predetermined position of the trunk unit 102, two left and right arm units 104R / L, and two left and right leg units 105R /. L is connected to each other (provided that R and L are suffixes indicating right and left, respectively, and the same applies hereinafter).

このロボット装置１０１が具備する関節自由度構成を図１４に模式的に示す。頭部ユニット１０３を支持する首関節は、首関節ヨー軸１１１と、首関節ピッチ軸１１２と、首関節ロール軸１１３という３自由度を有している。 FIG. 14 schematically shows a joint degree-of-freedom configuration of the robot apparatus 101. The neck joint that supports the head unit 103 has three degrees of freedom: a neck joint yaw axis 111, a neck joint pitch axis 112, and a neck joint roll axis 113.

また、上肢を構成する各々の腕部ユニット１０４Ｒ／Ｌは、肩関節ピッチ軸１１７と、肩関節ロール軸１１８と、上腕ヨー軸１１９と、肘関節ピッチ軸１２０と、前腕ヨー軸１２１と、手首関節ピッチ軸１２２と、手首関節ロール輪１２３と、手部１２４とで構成される。手部１２４は、実際には、複数本の指を含む多関節・多自由度構造体である。ただし、手部１２４の動作は、ロボット装置１０１の姿勢制御や歩行制御に対する寄与や影響が少ないので、本明細書では簡単のため、ゼロ自由度と仮定する。したがって、各腕部は７自由度を有するとする。 Each arm unit 104R / L constituting the upper limb includes a shoulder joint pitch axis 117, a shoulder joint roll axis 118, an upper arm yaw axis 119, an elbow joint pitch axis 120, a forearm yaw axis 121, and a wrist. A joint pitch shaft 122, a wrist joint roll wheel 123, and a hand portion 124 are included. The hand part 124 is actually a multi-joint / multi-degree-of-freedom structure including a plurality of fingers. However, since the movement of the hand part 124 has little contribution or influence on the posture control or walking control of the robot apparatus 101, it is assumed in this specification that the degree of freedom is zero. Therefore, it is assumed that each arm portion has seven degrees of freedom.

また、体幹部ユニット１０２は、体幹ピッチ軸１１４と、体幹ロール軸１１５と、体幹ヨー軸１１６という３自由度を有する。 The trunk unit 102 has three degrees of freedom: a trunk pitch axis 114, a trunk roll axis 115, and a trunk yaw axis 116.

また、下肢を構成する各々の脚部ユニット１０５Ｒ／Ｌは、股関節ヨー軸１２５と、股関節ピッチ軸１２６と、股関節ロール軸１２７と、膝関節ピッチ軸１２８と、足首関節ピッチ軸１２９と、足首関節ロール軸１３０と、足部１３１とで構成される。本明細書中では、股関節ピッチ軸１２６と股関節ロール軸１２７の交点は、ロボット装置１０１の股関節位置を定義する。人体の足部１３１は、実際には多関節・多自由度の足底を含んだ構造体であるが、本明細書においては、簡単のためロボット装置１０１の足底は、ゼロ自由度とする。したがって、各脚部は、６自由度で構成される。 Each leg unit 105R / L constituting the lower limb includes a hip joint yaw axis 125, a hip joint pitch axis 126, a hip joint roll axis 127, a knee joint pitch axis 128, an ankle joint pitch axis 129, and an ankle joint. It comprises a roll shaft 130 and a foot 131. In the present specification, the intersection of the hip joint pitch axis 126 and the hip joint roll axis 127 defines the hip joint position of the robot apparatus 101. The human foot 131 is actually a structure including a multi-joint / multi-degree-of-freedom sole, but in this specification, for simplicity, the bottom of the robot apparatus 101 has zero degrees of freedom. . Accordingly, each leg is configured with 6 degrees of freedom.

以上を総括すれば、ロボット装置１０１全体としては、合計で３＋７×２＋３＋６×２＝３２自由度を有することになる。ただし、エンターテインメント向けのロボット装置１が必ずしも３２自由度に限定されるわけではない。設計・制作上の制約条件や要求仕様等に応じて、自由度すなわち関節数を適宜増減することができることはいうまでもない。 In summary, the robot apparatus 101 as a whole has a total of 3 + 7 × 2 + 3 + 6 × 2 = 32 degrees of freedom. However, the robot device 1 for entertainment is not necessarily limited to 32 degrees of freedom. Needless to say, the degree of freedom, that is, the number of joints, can be increased or decreased as appropriate in accordance with design / production constraints or required specifications.

上述したようなロボット装置１０１がもつ各自由度は、実際にはアクチュエータを用いて実装される。外観上で余分な膨らみを排してヒトの自然体形状に近似させること、２足歩行という不安定構造体に対して姿勢制御を行うこと等の要請から、アクチュエータは小型且つ軽量であることが好ましい。 Each degree of freedom of the robot apparatus 101 as described above is actually implemented using an actuator. It is preferable that the actuator be small and light in light of demands such as eliminating the extra bulge on the appearance and approximating the shape of a natural human body, and performing posture control on an unstable structure such as biped walking. .

このようなロボット装置は、ロボット装置全体の動作を制御する制御システムを例えば体幹部ユニット１０２等に備える。図１５は、ロボット装置１０１の制御システム構成を示す模式図である。図１５に示すように、制御システムは、ユーザ入力等に動的に反応して情緒判断や感情表現を司る思考制御モジュール３００と、アクチュエータ４５０の駆動等ロボット装置１の全身協調運動を制御する運動制御モジュール４００とで構成される。 Such a robot apparatus includes a control system that controls the operation of the entire robot apparatus, for example, in the trunk unit 102. FIG. 15 is a schematic diagram illustrating a control system configuration of the robot apparatus 101. As shown in FIG. 15, the control system dynamically controls the whole body cooperative movement of the robot apparatus 1 such as driving of the actuator 450 and the thinking control module 300 that controls emotion judgment and emotional expression in response to user input and the like. And a control module 400.

思考制御モジュール３００は、情緒判断や感情表現に関する演算処理を実行するＣＰＵ（Central Processing Unit）３１１や、ＲＡＭ（Random Access Memory）３１２、ＲＯＭ（Read Only Memory）３１３及び外部記憶装置（ハード・ディスク・ドライブ等）３１４等で構成され、モジュール内で自己完結した処理を行うことができる、独立駆動型の情報処理装置である。 The thought control module 300 includes a central processing unit (CPU) 311, a random access memory (RAM) 312, a read only memory (ROM) 313, and an external storage device (hard disk This is an independent drive type information processing apparatus that is composed of 314 and the like and can perform self-contained processing in the module.

この思考制御モジュール３００は、画像入力装置３５１から入力される画像データや音声入力装置３５２から入力される音声データ等、外界からの刺激等に従って、ロボット装置１０１の現在の感情や意思を決定する。すなわち、上述したように、入力される画像データからユーザの表情を認識し、その情報をロボット装置１０１の感情や意思に反映させることで、ユーザの表情に応じた行動を発現することができる。ここで、画像入力装置３５１は、例えばＣＣＤ（Charge Coupled Device）カメラを複数備えており、また、音声入力装置３５２は、例えばマイクロホンを複数備えている。 The thought control module 300 determines the current emotion and intention of the robot apparatus 101 according to stimuli from the outside such as image data input from the image input apparatus 351 and audio data input from the audio input apparatus 352. That is, as described above, by recognizing the user's facial expression from the input image data and reflecting the information on the emotion and intention of the robot apparatus 101, it is possible to express an action according to the user's facial expression. Here, the image input device 351 includes a plurality of CCD (Charge Coupled Device) cameras, for example, and the audio input device 352 includes a plurality of microphones, for example.

また、思考制御モジュール３００は、意思決定に基づいた動作又は行動シーケンス、すなわち四肢の運動を実行するように、運動制御モジュール３００に対して指令を発行する。 The thought control module 300 issues a command to the motion control module 300 to execute an action or action sequence based on decision making, that is, exercise of the limbs.

一方の運動制御モジュール４００は、ロボット装置１０１の全身協調運動を制御するＣＰＵ４１１や、ＲＡＭ４１２、ＲＯＭ４１３及び外部記憶装置（ハード・ディスク・ドライブ等）４１４等で構成され、モジュール内で自己完結した処理を行うことができる独立駆動型の情報処理装置である。また、外部記憶装置４１４には、例えば、オフラインで算出された歩行パターンや目標とするＺＭＰ軌道、その他の行動計画を蓄積することができる。 One motion control module 400 includes a CPU 411 that controls the whole body cooperative motion of the robot apparatus 101, a RAM 412, a ROM 413, an external storage device (hard disk drive, etc.) 414, etc., and performs self-contained processing within the module. It is an independent drive type information processing apparatus that can be performed. Also, the external storage device 414 can store, for example, walking patterns calculated offline, target ZMP trajectories, and other action plans.

この運動制御モジュール４００には、図１４に示したロボット装置１０１の全身に分散するそれぞれの関節自由度を実現するアクチュエータ４５０、対象物との距離を測定する距離計測センサ（図示せず）、体幹部ユニット１０２の姿勢や傾斜を計測する姿勢センサ４５１、左右の足底の離床又は着床を検出する接地確認センサ４５２，４５３、足底１３１の足底１３１に設けられる荷重センサ、バッテリ等の電源を管理する電源制御装置４５４等の各種の装置が、バス・インターフェース（Ｉ／Ｆ）４０１経由で接続されている。ここで、姿勢センサ４５１は、例えば加速度センサとジャイロ・センサの組み合わせによって構成され、接地確認センサ４５２，４５３は、近接センサ又はマイクロ・スイッチ等で構成される。 The motion control module 400 includes an actuator 450 that realizes the degrees of freedom of joints distributed throughout the body of the robot apparatus 101 shown in FIG. 14, a distance measurement sensor (not shown) that measures the distance from the object, a body Posture sensor 451 for measuring the posture and inclination of the trunk unit 102, grounding confirmation sensors 452 and 453 for detecting left or right foot floor getting off or landing, load sensor provided on the foot sole 131 of the sole 131, power source such as a battery Various devices such as a power supply control device 454 for managing the network are connected via a bus interface (I / F) 401. Here, the posture sensor 451 is configured by, for example, a combination of an acceleration sensor and a gyro sensor, and the grounding confirmation sensors 452 and 453 are configured by proximity sensors, micro switches, or the like.

思考制御モジュール３００と運動制御モジュール４００は、共通のプラットフォーム上で構築され、両者間はバス・インターフェース３０１，４０１を介して相互接続されている。 The thought control module 300 and the motion control module 400 are constructed on a common platform, and are interconnected via bus interfaces 301 and 401.

運動制御モジュール４００では、思考制御モジュール３００から指示された行動を体現すべく、各アクチュエータ４５０による全身協調運動を制御する。すなわち、ＣＰＵ４１１は、思考制御モジュール３００から指示された行動に応じた動作パターンを外部記憶装置４１４から取り出し、又は、内部的に動作パターンを生成する。そして、ＣＰＵ４１１は、指定された動作パターンに従って、足部運動、ＺＭＰ軌道、体幹運動、上肢運動、腰部水平位置及び高さ等を設定するとともに、これらの設定内容に従った動作を指示する指令値を各アクチュエータ４５０に転送する。 The motion control module 400 controls the whole body cooperative motion by each actuator 450 in order to embody the action instructed from the thought control module 300. That is, the CPU 411 extracts an operation pattern corresponding to the action instructed from the thought control module 300 from the external storage device 414 or generates an operation pattern internally. Then, the CPU 411 sets a foot movement, a ZMP trajectory, a trunk movement, an upper limb movement, a waist horizontal position, a height, and the like according to a specified movement pattern, and a command for instructing an action according to these setting contents. The value is transferred to each actuator 450.

また、ＣＰＵ４１１は、姿勢センサ４５１の出力信号によりロボット装置１０１の体幹部ユニット１０２の姿勢や傾きを検出するとともに、各接地確認センサ４５２，４５３の出力信号により各脚部ユニット１０５Ｒ／Ｌが遊脚又は立脚のいずれの状態であるかを検出することによって、ロボット装置１０１の全身協調運動を適応的に制御することができる。更に、ＣＰＵ４１１は、ＺＭＰ位置が常にＺＭＰ安定領域の中心に向かうように、ロボット装置１０１の姿勢や動作を制御する。 In addition, the CPU 411 detects the posture and inclination of the trunk unit 102 of the robot apparatus 101 based on the output signal of the posture sensor 451, and each leg unit 105R / L detects the free leg based on the output signals of the grounding confirmation sensors 452 and 453. Alternatively, the whole body cooperative movement of the robot apparatus 101 can be adaptively controlled by detecting whether the robot is standing or standing. Further, the CPU 411 controls the posture and operation of the robot apparatus 101 so that the ZMP position always moves toward the center of the ZMP stable region.

また、運動制御モジュール４００は、思考制御モジュール３００において決定された意思通りの行動がどの程度発現されたか、すなわち処理の状況を、思考制御モジュール３００に返すようになっている。このようにしてロボット装置１０１は、制御プログラムに基づいて自己及び周囲の状況を判断し、自律的に行動することができる。 In addition, the motion control module 400 is configured to return to the thought control module 300 the degree to which the intended behavior determined by the thought control module 300 has been expressed, that is, the processing status. In this way, the robot apparatus 101 can determine its own and surrounding conditions based on the control program, and can act autonomously.

このようなロボット装置には、動的に変化する作業環境下で一定時間内に応答できるようなヒューマン・インターフェース技術が要求されている。本実施の形態に係るロボット装置１０１は、情動推定システムを搭載することにより、周囲のユーザ（飼い主又はともだち、若しくは正当なユーザ）の情動を認識すると共に、認識結果に基づいてリアクションを制御することによって、より高いエンターテイメント性を実現することができる。 Such a robot apparatus is required to have a human interface technology that can respond within a predetermined time in a dynamically changing work environment. The robot apparatus 101 according to the present embodiment includes an emotion estimation system, thereby recognizing the emotions of surrounding users (owners, friends, or legitimate users) and controlling reactions based on the recognition results. By this, higher entertainment properties can be realized.

なお、本発明は上述した実施の形態のみに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能であることは勿論である。また、上述の実施の形態における情動認識システムは、ハードウェアにより実現しても、任意の処理を、演算器にコンピュータプログラムを実行させることにより実現してもよいことは勿論である。この場合、コンピュータプログラムは、記録媒体に記録して提供することも可能であり、また、インターネットその他の伝送媒体を介して伝送することにより提供することも可能である。 It should be noted that the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present invention. In addition, the emotion recognition system in the above-described embodiment may be realized by hardware or may be realized by causing an arithmetic unit to execute a computer program. In this case, the computer program can be provided by being recorded on a recording medium, or can be provided by being transmitted via the Internet or another transmission medium.

本発明の実施の形態における情動推定装置を示すブロック図である。It is a block diagram which shows the emotion estimation apparatus in embodiment of this invention. 上記情動推定装置における、学習時の情動推定部５を模式的に示す図である。It is a figure which shows typically the emotion estimation part 5 at the time of learning in the said emotion estimation apparatus. 上記情動推定装置における情動推定部のリカレントニューラルネットワークにおける学習方法を説明するための図である。It is a figure for demonstrating the learning method in the recurrent neural network of the emotion estimation part in the said emotion estimation apparatus. 上記情動推定装置における、学習時の誤差予測器を模式的に示す図であるIt is a figure which shows typically the error predictor at the time of learning in the said emotion estimation apparatus. 上記情動推定装置における誤差予測器のリカレントニューラルネットワークにおける学習方法を説明するための図である。It is a figure for demonstrating the learning method in the recurrent neural network of the error predictor in the said emotion estimation apparatus. 多層型ニューラルネットワークの一例を示す図である。It is a figure which shows an example of a multilayer type neural network. リカレント型ネットワークの一例を示す図であるIt is a figure which shows an example of a recurrent type network. リカレント型ネットワークにおける重み値、時定数、初期状態に対するバックプロバケーション学習方法を示すフローチャートである。It is a flowchart which shows the back pro vacation learning method with respect to the weight value in a recurrent type network, a time constant, and an initial state. 学習データの一例を示す図である。It is a figure which shows an example of learning data. （ａ）は、ロボット装置における予測情動装置に関わる要部を示す図、（ｂ）は、予想情動変化データベースの一具体例を示す図である。(A) is a figure which shows the principal part in connection with the prediction emotion apparatus in a robot apparatus, (b) is a figure which shows a specific example of a prediction emotion change database. 本発明の実施の形態における情動認識装置の要部を説明する図である。It is a figure explaining the principal part of the emotion recognition apparatus in embodiment of this invention. 本発明の実施の形態における情動推定結果、情動予測結果、及びこれらを統合した結果の一例を示す図である。It is a figure which shows an example of the emotion estimation result in the embodiment of this invention, an emotion prediction result, and the result of integrating these. 本発明の実施の形態におけるロボット装置の概観を示す斜視図である。1 is a perspective view showing an overview of a robot apparatus according to an embodiment of the present invention. 同ロボット装置が具備する関節自由度構成を模式的に示す図である。It is a figure which shows typically the joint freedom degree structure which the robot apparatus comprises. 同ロボット装置の制御システム構成を示す模式図である。It is a schematic diagram which shows the control system structure of the robot apparatus.

Explanation of symbols

１情動推定装置、２_ｍセンサ部、３_ｎ特徴量抽出部、４_ｎモーダル別情動推定部、７_ｎソフトマックス演算器、８_ｎ乗算器、９加算器、５_ｎ情動認識器、６_ｎ誤差予測器、１０データベース、３１内部状態、３２外部刺激、３０行動制御器、３３スキーマ、４０情動予測装置、４１予想情動変化データベース、４２予想情動算出部、５０情動推定システム、５２情動統合部

DESCRIPTION OF SYMBOLS 1 Emotion estimation apparatus, 2 _m sensor part, 3 _n feature-value extraction part, 4 _n modal separate emotion estimation part, 7 _n softmax computing unit, 8 _n multiplier, 9 adder, 5 _n emotion recognizer, 6 _n error Predictor, 10 database, 31 internal state, 32 external stimulus, 30 behavior controller, 33 schema, 40 emotion prediction device, 41 predicted emotion change database, 42 predicted emotion calculation unit, 50 emotion estimation system, 52 emotion integration unit

Claims

A feature quantity extraction means for extracting a time series feature quantity related to a recognition target for recognizing emotion from time series sensor information;
An emotion estimation device comprising: an emotion estimation unit that estimates the emotion of the recognition target in consideration of the context of the sensor information based on the time-series feature amount.

The feature amount extraction means extracts the time-series feature amount for each modal for a plurality of modals as a modal-specific feature amount sequence,
The emotion estimation means includes a plurality of modal emotion recognition means for recognizing the emotion for each modal based on the modal feature quantity sequence, and a recognition for estimating the emotion based on a recognition result of the modal emotion recognition means. The emotion estimation apparatus according to claim 1, further comprising a result integration unit.

The emotion estimation means includes prediction error calculation means for calculating a prediction error of the recognition result obtained by the modal emotion recognition means,
The emotion estimation apparatus according to claim 2, wherein the recognition result integration unit estimates the emotion based on a recognition result obtained by the modal emotion recognition unit and a prediction error thereof.

The prediction error calculation means calculates a prediction error of the recognition result based on the recognition result obtained by the modal feature quantity sequence and / or the modal emotion recognition means. Emotion estimation device.

The recognition result integration unit converts the prediction error obtained by the prediction error calculation unit into reliability, and based on the weighted recognition result obtained by weighting the recognition result obtained by each modal emotion recognition unit. The emotion estimation apparatus according to claim 3, wherein the emotion is estimated.

When the emotion estimation means uses the time-series feature quantity related to the recognition target extracted from the time-series sensor information as input data, the emotion of the recognition target when the time-series sensor information is acquired is output data. The emotion estimation apparatus according to claim 1, which has been learned in advance by recursive learning.

When the prediction error calculation means uses, as input data, a time-series feature quantity for each modal related to a recognition target extracted from time-series sensor information and / or a recognition result obtained by the modal-specific emotion recognition means. It is learned in advance by recursive learning so that the difference between the emotion of the recognition target when acquiring the sensor information of the series and the recognition result obtained by the modal-specific emotion recognition means becomes output data,
When the modal emotion recognition means uses the time-series feature quantity for each modal regarding the recognition target extracted from the time-series sensor information as input data, the emotion of the recognition target when the time-series sensor information is acquired is The emotion estimation apparatus according to claim 3, which has been learned in advance by recursive learning so as to be output data.

The emotion estimation apparatus according to claim 6, wherein a neural network is used for the recursive learning.

A feature quantity extraction step for extracting a time series feature quantity related to a recognition target for recognizing emotion from time series sensor information;
An emotion estimation method comprising: an emotion estimation step of estimating the emotion of the recognition target in consideration of the context of the sensor information based on the time-series feature amount.

In the feature amount extraction step, the time-series feature amount for each modal is extracted as a feature amount sequence by modal for a plurality of modals,
The emotion estimation step includes the modal emotion recognition step for recognizing the emotion for each modal based on the modal feature quantity sequence, and the modal recognition result obtained in the modal emotion recognition step. An emotion estimation method according to claim 9, further comprising a recognition result integration step for estimation.

A prediction error calculation step of calculating a prediction error of the recognition result obtained in the emotion recognition process according to modal,
The emotion estimation method according to claim 10, wherein in the recognition result integration step, the emotion is estimated based on a modal recognition result obtained in the modal emotion recognition step and a prediction error thereof.

In the recognition result integration step, the prediction error calculated in the prediction error calculation step is converted into a reliability, and the recognition result obtained in each modal emotion recognition step is weighted to the weighted recognition result. The emotion estimation method according to claim 11, wherein the emotion is estimated based on the emotion.

In a robotic device that behaves autonomously based on internal conditions and / or external stimuli,
One or more sensors that obtain information about the interaction object;
Feature quantity extraction means for receiving time series sensor information from the sensor and extracting time series feature quantities relating to the interaction target;
A robot apparatus comprising: an emotion estimation unit configured to estimate the emotion of the interaction target in consideration of the context of the sensor information based on the time-series feature amount.

The feature amount extraction means extracts the time-series feature amount for each modal as a feature amount sequence by modal for a plurality of modals,
The emotion estimation means includes a plurality of modal emotion recognition means for recognizing the emotion for each modal based on the modal feature quantity sequence, and the emotion based on the recognition result obtained by the modal emotion recognition means. The robot apparatus according to claim 13, further comprising a recognition result integrating unit for estimating.

The emotion estimation means includes prediction error calculation means for calculating a prediction error of the recognition result obtained by the modal emotion recognition means,
The robot apparatus according to claim 14, wherein the recognition result integrating unit estimates the emotion based on a recognition result obtained by the modal emotion recognition unit and a prediction error thereof.

Action execution means for selecting and executing one action from a plurality of actions;
An emotion prediction means for predicting the emotion of the interaction target according to the type of action executed by the action execution means;
14. The robot apparatus according to claim 13, further comprising emotion integration means for estimating the emotion of the interaction target based on the estimation result obtained by the emotion estimation means and the prediction result obtained by the emotion prediction means. .

The emotion prediction means predicts an emotional change after execution of an action by referring to an expected emotional change database in which one action and an expected emotional change expected to change after the execution of the one action are associated with each other, The robot apparatus according to claim 16, wherein the emotion after the execution of the behavior is predicted based on the prediction result and the estimation result estimated by the emotion estimation means before the execution of the behavior.

The robot apparatus according to claim 17, wherein the predicted emotion change database is obtained by learning the expected emotion change based on an emotion change of the interaction target before and after the execution of each action.

Database learning that updates the predicted emotion change in the predicted emotion change database based on the difference in emotion of the interaction target before and after the execution of the behavior estimated by the emotion estimation unit every time the behavior is executed by the behavior execution unit The robot apparatus according to claim 17, further comprising: means.

The emotion integration unit weights the estimation result and the prediction result by a parameter that changes so that the estimation result by the emotion estimation unit is more important than the prediction result by the emotion prediction unit as time elapses after execution of the action, The robot apparatus according to claim 16, wherein the emotion is estimated based on the weighted result.

In an emotion recognition method for a robot apparatus that acts autonomously based on an internal state and / or an external stimulus,
A feature amount extraction step of extracting a time series feature amount related to an interaction target from time series sensor information;
An emotion estimation method for a robot apparatus, comprising: an emotion estimation step for estimating an emotion of the interaction target in consideration of the context of the sensor information based on the time-series feature amount.

In the feature quantity extraction step, the time series feature quantity for each modal is extracted as a modal-specific feature quantity sequence for a plurality of modals,
The emotion estimation step includes the modal emotion recognition step for recognizing the emotion for each modal based on the modal feature quantity sequence, and the modal recognition result obtained in the modal emotion recognition step. The method of recognizing an emotion of a robot apparatus according to claim 21, further comprising a recognition result integrating step of estimating.

A prediction error calculation step of calculating a prediction error of the recognition result obtained in the emotion recognition process according to modal,
The emotion recognition method for a robot apparatus according to claim 22, wherein, in the recognition result integration step, the emotion is estimated based on the recognition result obtained in the modal emotion recognition step and a prediction error thereof.

An emotion prediction step of predicting the emotion of the interaction target according to the type of the executed action when the one action is executed by the action executing means for selecting and executing one action from a plurality of actions;
The emotion integration step of estimating the emotion of the interaction target based on the estimation result obtained in the emotion estimation step and the prediction result obtained in the emotion prediction step. An emotion recognition method for a robotic device.

In the emotion prediction step, an emotion change after the execution of an action is predicted by referring to an expected emotion change database in which one action and an expected emotion that is expected to change after the execution of the one action are associated with each other. The emotion recognition method for a robot apparatus according to claim 24, wherein the emotion after the execution of the action is predicted based on the result and the estimation result obtained in the emotion estimation step before the execution of the action.

In the emotion integration step, the estimation is performed according to a parameter that changes so as to emphasize the estimation result estimated in the emotion estimation step from the prediction result obtained in the emotion prediction step as time elapses after execution of the action. 25. The emotion recognition method for a robot apparatus according to claim 24, wherein a result and a prediction result are weighted, and the emotion is estimated based on the weighted result.

In the learning method of the robot apparatus equipped with the emotion recognition device that recognizes the emotion of the interaction target from the given input data,
Learning the emotion recognition device using the time-series feature quantity related to the interaction target extracted from the time-series sensor information as input data and the emotion of the interaction target when the time-series sensor information is acquired as the output target value. A learning method for a robot apparatus, characterized by comprising a learning step.

In a robotic device that behaves autonomously based on internal conditions and / or external stimuli,
An emotion recognition device for recognizing the emotion of an interaction target from given input data;
One or more sensors that obtain information about the interaction object;
Feature quantity extraction means for receiving time series sensor information from the sensor and extracting a time series feature quantity relating to the interaction target;
Learning of the emotion recognition apparatus using, as input data, a time-series feature amount related to an interaction target extracted from the time-series sensor information, and using the emotion of the interaction target when the time-series sensor information is acquired as an output target value Learning means to
The emotion recognition apparatus recognizes the emotion using the time-series feature quantity extracted by the feature quantity extraction unit as the input data.