JP7425681B2

JP7425681B2 - Social ability generation device, social ability generation method, and communication robot

Info

Publication number: JP7425681B2
Application number: JP2020108946A
Authority: JP
Inventors: ランディゴメス
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2024-01-31
Anticipated expiration: 2040-06-24
Also published as: JP2022006610A

Description

本発明は、社会的能力生成装置、社会的能力生成方法、およびコミュニケーションロボットに関する。 The present invention relates to a social ability generation device, a social ability generation method, and a communication robot.

今日、スマートスピーカーやコミュニケーションロボットの開発が進められている。このようなシステムでは、指示に応じて、照明をオン状態またはオフ状態にする、カレンダーにアクセスする、メールを読む、予定を設定するなどの機能に焦点を当てられている。このようなシステムでは、指示の入力が、例えばタッチパネルによる選択、音声による定められているコマンド等に限られており、人との関係を構築することが困難である。 Today, smart speakers and communication robots are being developed. Such systems focus on functions such as turning lights on or off, accessing calendars, reading email, and setting appointments based on commands. In such a system, the input of instructions is limited to, for example, selection using a touch panel or predetermined voice commands, making it difficult to build relationships with people.

このため、人との関係を持てるシステムが望まれている。例えば特許文献１には、コンパニオンデバイスと人と対話に対して、人をデバイスとの対話や操作に関わらせるシステムが提案されている。特許文献１に記載の技術では、コンパニオンデバイスが、利用者との発話や行動を検出して、移動、グラフィック、音、光、芳香を通して表現し、親交的存在を提供する。 For this reason, a system that allows for relationships with people is desired. For example, Patent Document 1 proposes a system in which a companion device interacts with a person and a person is involved in interaction and operation with the device. In the technology described in Patent Document 1, a companion device detects the user's utterances and actions, expresses them through movement, graphics, sound, light, and fragrance, and provides a friendly presence.

特表２０１９－５２１４４９号公報Special Publication No. 2019-521449

しかしながら、特許文献１に記載の技術では、ロボットと人との間に感情的な繋がりを形成することが困難であった。 However, with the technology described in Patent Document 1, it is difficult to form an emotional connection between a robot and a human.

本発明は、上記の問題点に鑑みてなされたものであって、ロボットと人との間に感情的な繋がりを形成することができる社会的能力生成装置、社会的能力生成方法、およびコミュニケーションロボットを提供することを目的とする。 The present invention has been made in view of the above problems, and provides a social ability generation device, a social ability generation method, and a communication robot that can form an emotional connection between a robot and a person. The purpose is to provide

（１）上記目的を達成するため、本発明の一態様に係る社会的能力生成装置は、人に関する人情報を取得し、取得した前記人情報から人に関する特徴情報を抽出し、コミュニケーションを行うコミュニケーション装置と人の間に生じる働きかけを認知し、人と人との間に生じる働きかけを認知する認知手段と、抽出された前記人に関する特徴情報を用いて、人の感情的な相互作用をマルチモーダルによって学習する学習手段と、学習された前記人の感情的な相互作用情報に基づいて、行動を生成する動作生成手段と、を備える。 (1) In order to achieve the above object, a social ability generation device according to one aspect of the present invention acquires person information about a person, extracts characteristic information about the person from the acquired person information, and performs communication. Multimodal emotional interaction between people using cognitive means that recognizes the interactions that occur between devices and people, and the extracted characteristic information about the person. and a behavior generating means that generates a behavior based on the learned emotional interaction information of the person.

（２）また、本発明の一態様に係る社会的能力生成装置において、前記学習手段は、暗黙的な報酬と、明示的な報酬とを用いて学習を行い、前記暗黙的な報酬は、前記人に関する特徴情報を用いて、マルチモーダルによって学習された報酬であり、前記明示的な報酬は、前記動作生成手段によって生成された前記コミュニケーション装置の前記人に対する行動を評価した結果に基づく報酬であるようにしてもよい。 (2) Furthermore, in the social ability generation device according to one aspect of the present invention, the learning means performs learning using an implicit reward and an explicit reward, and the implicit reward is The reward is multimodally learned using feature information about the person, and the explicit reward is a reward based on the result of evaluating the behavior of the communication device toward the person generated by the behavior generation means. You can do it like this.

（３）また、本発明の一態様に係る社会的能力生成装置において、音響信号を収音する収音部と、利用者を含む画像を撮影する撮影部と、を備え、前記認知手段は、収音された前記音響信号に対して音声認識処理を行って音声に関する特徴情報を抽出し、撮影された画像に対して画像処理を行って画像に含まれる人行動に関する特徴情報を抽出し、前記人に関する特徴情報は、前記音声に関する特徴情報と、前記人行動に関する特徴情報を含み、前記音声に関する特徴情報は、音声信号、声の大きさの情報、声の抑揚の情報、発話の意味のうち少なくとも１つであり、前記人行動に関する特徴情報は、人の表情情報、人が行ったジェスチャー情報、人の頭部姿勢情報、人の顔向き情報、人の視線情報、および人と人との間の距離のうち少なくとも１つであるようにしてもよい。 (3) Furthermore, the social ability generation device according to one aspect of the present invention includes a sound collection section that picks up an acoustic signal, and a shooting section that shoots an image including the user, and the recognition means includes: Speech recognition processing is performed on the collected acoustic signal to extract feature information related to the sound, image processing is performed on the captured image to extract feature information related to human behavior included in the image, and the The feature information about the person includes feature information about the voice and feature information about the human behavior, and the feature information about the voice includes the voice signal, voice volume information, voice intonation information, and the meaning of the utterance. The characteristic information regarding human behavior is at least one of the following: facial expression information, gesture information performed by the person, head posture information, face orientation information, line of sight information, and communication between people. It may be at least one of the distances between.

（４）また、本発明の一態様に係る社会的能力生成装置において、前記学習手段は、社会規範、社会構成要素、心理学的な知見、および人文学的な知見を用いて学習するようにしてもよい。 (4) Furthermore, in the social ability generation device according to one aspect of the present invention, the learning means performs learning using social norms, social constituent elements, psychological knowledge, and humanities knowledge. It's okay.

（５）上記目的を達成するため、本発明の一態様に係る社会的能力生成方法は、認知手段が、人に関する人情報を取得し、取得した前記人情報から人に関する特徴情報を抽出し、コミュニケーションを行うコミュニケーション装置と人の間に生じる働きかけを認知し、人と人との間に生じる働きかけを認知し、学習手段が、抽出された前記人に関する特徴情報を用いて、人の感情的な相互作用をマルチモーダルによって学習し、動作生成手段が、学習された前記人の感情的な相互作用情報に基づいて、行動を生成する。 (5) In order to achieve the above object, a social ability generation method according to one aspect of the present invention includes a recognition unit that acquires person information about a person, extracts characteristic information about the person from the acquired person information, The learning means recognizes the interactions that occur between a person and a communication device that communicates, recognizes the interactions that occur between people, and uses the extracted characteristic information about the person to learn about the person's emotional state. The interaction is multimodally learned, and the behavior generation means generates the behavior based on the learned emotional interaction information of the person.

（６）上記目的を達成するため、本発明の一態様に係るコミュニケーションロボットは、人に関する人情報を取得し、取得した前記人情報から人に関する特徴情報を抽出し、コミュニケーションを行うコミュニケーション装置と人の間に生じる働きかけを認知し、人と人との間に生じる働きかけを認知する認知手段と、抽出された前記人に関する特徴情報を用いて、人の感情的な相互作用をマルチモーダルによって学習する学習手段と、学習された前記人の感情的な相互作用情報に基づいて、行動を生成する動作生成手段と、を備える。 (6) In order to achieve the above object, a communication robot according to one aspect of the present invention acquires human information about a person, extracts characteristic information about the person from the acquired human information, and connects a communication device and a person to communicate. Multimodally learns emotional interactions between people using a cognitive means that recognizes the interactions that occur between people and the extracted characteristic information about the person. The device includes a learning device and a behavior generating device that generates a behavior based on the learned emotional interaction information of the person.

（７）また、本発明の一態様に係るコミュニケーションロボットは、表示部を備え、前記動作生成手段は、人に対して肯定的な感情を最大化させる振る舞いをさせることで人との関係を良い状態を維持する画像を生成し、生成した前記画像を前記表示部に表示させるようにしてもよい。 (7) Furthermore, the communication robot according to one aspect of the present invention includes a display unit, and the behavior generating means improves the relationship with the person by causing the person to behave in a manner that maximizes positive emotions. An image that maintains the state may be generated, and the generated image may be displayed on the display unit.

（１）～（７）によれば、ロボットと人との間に感情的な繋がりを形成することができる。
（２）によれば、多くの教示データを用いずに学習を行うことができる。
（３）によれば、人の反応に基づく多くの情報を取得できる。
（４）によれば、社会的にインテリジェントで、社会シナリオをナビゲートすることができる。
（７）によれば、人に対して肯定的な感情を最大化させる振る舞いをさせることができ、人との関係を良い状態を維持することができる。 According to (1) to (7), an emotional connection can be formed between robots and humans.
According to (2), learning can be performed without using much teaching data.
According to (3), a lot of information can be obtained based on people's reactions.
According to (4), they are socially intelligent and able to navigate social scenarios.
According to (7), it is possible to make people behave in a manner that maximizes positive emotions, and it is possible to maintain good relationships with people.

実施形態に係るコミュニケーションロボットのコミュニケーション例を示す図である。FIG. 3 is a diagram illustrating an example of communication of a communication robot according to an embodiment. 実施形態に係るコミュニケーションロボットの構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a communication robot according to an embodiment. 実施形態に係るコミュニケーションロボットの外形例を示す図である。1 is a diagram showing an example of the external shape of a communication robot according to an embodiment. 実施形態の社会的能力生成装置が行う認知と学習と社会的能力の流れを示す図である。FIG. 3 is a diagram showing the flow of recognition, learning, and social ability performed by the social ability generation device of the embodiment. 実施形態に係る認知部が認識するデータ例を示す図である。FIG. 6 is a diagram showing an example of data recognized by the recognition unit according to the embodiment. 比較例における生データを用いて深層強化学習を行うシステム例を示す図である。It is a diagram showing an example of a system that performs deep reinforcement learning using raw data in a comparative example. 実施形態に係る動作生成部が用いるエージェント作成方法例を示す図である。FIG. 3 is a diagram illustrating an example of an agent creation method used by the behavior generation unit according to the embodiment. 実施形態に係る社会的能力生成処理の手順例を示すフローチャートである。2 is a flowchart illustrating an example of a procedure for social ability generation processing according to an embodiment. 実施形態に係るコミュニケーションロボットと利用者のコミュニケーション例を示す図である。FIG. 3 is a diagram illustrating an example of communication between a communication robot and a user according to an embodiment. 実施形態に係るコミュニケーションロボットと利用者のコミュニケーション時に表示部に表示される画像例を示す図である。FIG. 3 is a diagram illustrating an example of an image displayed on a display unit during communication between a communication robot and a user according to an embodiment. 実施形態に係るコミュニケーションロボットと利用者のコミュニケーション時に表示部に表示される画像例を示す図である。FIG. 3 is a diagram illustrating an example of an image displayed on a display unit during communication between a communication robot and a user according to an embodiment. 実施形態に係る利用者の友達とのコミュニケーション時のコミュニケーションロボットによるコミュニケーション例を示す図である。FIG. 3 is a diagram illustrating an example of communication by a communication robot when communicating with a user's friend according to the embodiment. 実施形態のコミュニケーションロボットを車両内のカーナビケーションシステムに適用した例を示す図である。FIG. 2 is a diagram showing an example in which the communication robot of the embodiment is applied to a car navigation system in a vehicle. 実施形態に係るカーナビゲーションに適用した場合に家庭内の各種装置との接続例を示す図である。FIG. 2 is a diagram illustrating an example of connections with various devices in the home when applied to a car navigation according to an embodiment.

以下、本発明の実施の形態について図面を参照しながら説明する。なお、以下の説明に用いる図面では、各部材を認識可能な大きさとするため、各部材の縮尺を適宜変更している。 Embodiments of the present invention will be described below with reference to the drawings. Note that in the drawings used in the following explanation, the scale of each member is changed as appropriate in order to make each member a recognizable size.

＜概要＞
図１は、本実施形態に係るコミュニケーションロボット１のコミュニケーション例を示す図である。図１のように、コミュニケーションロボット１は、個人または複数の人２とのコミュニケーションを行う。コミュニケーションは、主に対話ｇ１１と仕草ｇ１２（動作）でる。動作は、実際の動作に加え、表示部に表示される画像によって表現する。また、コミュニケーションロボット１は、利用者にインターネット回線等を介して電子メールが送信された際、電子メールを受信し電子メールが届いたことと内容を知らせる（ｇ１４）。また、コミュニケーションロボット１は、例えば電子メールに返答が必要な場合に、アドバイスが必要か利用者とコミュニケーションをとって提案ｇ１４を行う。コミュニケーションロボット１は、返答を送信する（ｇ１５）。また、コミュニケーションロボット１は、例えば利用者の予定に合わせて、予定日時や場所に応じた場所の天気予報の提示ｇ１９を行う。 <Summary>
FIG. 1 is a diagram showing an example of communication by the communication robot 1 according to the present embodiment. As shown in FIG. 1, a communication robot 1 communicates with an individual or multiple people 2. Communication mainly consists of dialogue g11 and gestures g12 (actions). The motion is expressed by an image displayed on the display unit in addition to the actual motion. Furthermore, when an e-mail is sent to the user via an Internet line or the like, the communication robot 1 receives the e-mail and notifies the user that the e-mail has arrived and the content (g14). Furthermore, for example, when a reply to an e-mail is required, the communication robot 1 communicates with the user to see if advice is needed and makes a proposal g14. The communication robot 1 sends a reply (g15). Furthermore, the communication robot 1 presents a weather forecast g19 for a location according to the scheduled date and time and location, for example, in accordance with the user's schedule.

本実施形態では、ロボットと人との間に感情的な繋がりを形成ことができるようにロボットの社会的能力を生成して、例えば人の反応や行動に応じて人とのコミュニケーションを行う。そして、本実施形態では、人とロボットが感情のレベルで共感してコミュニケーションを行うようにする。本実施形態では、いわば人とペットとの間のコミュニケーションのようなものを、社会規範等も学習することで実現する。本実施形態では、コミュニケーションにおいて、利用者の社会的背景（バックグラウンド）、人と人とのやりとり等を学習することで、上記を実現する。 In this embodiment, the social ability of the robot is generated so that an emotional connection can be formed between the robot and the human, and the robot communicates with the human based on, for example, the human's reactions and actions. In this embodiment, humans and robots communicate with each other by empathizing with each other on an emotional level. In this embodiment, so to speak, communication between humans and pets is realized by learning social norms and the like. In this embodiment, the above is achieved by learning users' social backgrounds, interactions between people, etc. in communication.

＜コミュニケーションロボット１の構成例＞
次に、コミュニケーションロボット１の構成例を説明する。
図２は、本実施形態に係るコミュニケーションロボット１の構成例を示すブロック図である。図２のように、コミュニケーションロボット１は、受信部１０１、撮影部１０２、収音部１０３、センサ１０４、社会的能力生成装置１００、記憶部１０６、第１データベース１０７、第２データベース１０９、表示部１１１、スピーカー１１２、アクチュエータ１１３、および送信部１１４を備えている。 <Example of configuration of communication robot 1>
Next, a configuration example of the communication robot 1 will be explained.
FIG. 2 is a block diagram showing a configuration example of the communication robot 1 according to this embodiment. As shown in FIG. 2, the communication robot 1 includes a receiving section 101, a photographing section 102, a sound collecting section 103, a sensor 104, a social ability generation device 100, a storage section 106, a first database 107, a second database 109, and a display section. 111, a speaker 112, an actuator 113, and a transmitter 114.

社会的能力生成装置１００は、認知部１０５（認知手段）、学習部１０８（学習手段）、および動作生成部１１０（動作生成手段）を備えている。
動作生成部１１０は、画像生成部１１０１、音声生成部１１０２、駆動部１１０３、送信情報生成部１１０４を備えている。 The social ability generation device 100 includes a recognition section 105 (recognition means), a learning section 108 (learning means), and a motion generation section 110 (motion generation means).
The motion generation section 110 includes an image generation section 1101, an audio generation section 1102, a drive section 1103, and a transmission information generation section 1104.

＜コミュニケーションロボット１の機能、動作＞
次に、コミュニケーションロボット１の各機能部の機能、動作について、図１を参照して説明する。 <Function and operation of communication robot 1>
Next, the functions and operations of each functional section of the communication robot 1 will be explained with reference to FIG. 1.

受信部１０１は、ネットワークを介して、例えばインターネットから情報（例えば電子メール、ブログ情報、ニュース、天気予報等）を取得し、取得した情報を認知部１０５と動作生成部１１０に出力する。または、受信部１０１は、例えば第１データベース１０７がクラウド上にある場合、クラウド上の第１データベース１０７から情報を取得し、取得した情報を認知部１０５に出力する。 The receiving unit 101 acquires information (eg, e-mail, blog information, news, weather forecast, etc.) from, for example, the Internet via a network, and outputs the acquired information to the recognition unit 105 and the motion generation unit 110. Alternatively, for example, when the first database 107 is on the cloud, the receiving unit 101 acquires information from the first database 107 on the cloud and outputs the acquired information to the recognition unit 105.

撮影部１０２は、例えばＣＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭｅｔａｌＯｘｉｄｅＳｅｍｉｃｏｎｄｕｃｔｏｒ；相補性金属酸化膜半導体）撮影素子、またはＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ；電荷結合素子）撮影素子等である。撮影部１０２は、撮影した画像（人に関する情報である人情報；静止画、連続した静止画、動画）を認知部１０５と動作生成部１１０に出力する。なお、コミュニケーションロボット１は、撮影部１０２を複数備えていてもよい。この場合、撮影部１０２は、例えばコミュニケーションロボット１の筐体の前方と後方に取り付けられていてもよい。 The imaging unit 102 is, for example, a CMOS (Complementary Metal Oxide Semiconductor) imaging device, a CCD (Charge Coupled Device) imaging device, or the like. The photographing unit 102 outputs a photographed image (person information that is information about a person; a still image, a continuous still image, a moving image) to the recognition unit 105 and the motion generation unit 110. Note that the communication robot 1 may include a plurality of imaging units 102. In this case, the imaging unit 102 may be attached to the front and rear of the housing of the communication robot 1, for example.

収音部１０３は、例えば複数のマイクロホンで構成されるマイクロホンアレイである。収音部１０３は、複数のマイクロホンが収音した音響信号（人情報）を認知部１０５と動作生成部１１０に出力する。なお、収音部１０３は、マイクロホンが収音した音響信号それぞれを、同じサンプリング信号でサンプリングされて、アナログ信号からデジタル信号に変換した後、認知部１０５に出力するようにしてもよい。 The sound collection unit 103 is, for example, a microphone array composed of a plurality of microphones. The sound collection unit 103 outputs the acoustic signals (person information) collected by the plurality of microphones to the recognition unit 105 and the motion generation unit 110. Note that the sound collection unit 103 may output each of the acoustic signals picked up by the microphone to the recognition unit 105 after being sampled with the same sampling signal and converting the analog signal into a digital signal.

センサ１０４は、例えば環境の温度を検出する温度センサ、環境の照度を検出する照度センサ、コミュニケーションロボット１の筐体の傾きを検出するジャイロセンサ、コミュニケーションロボット１の筐体の動きを検出する加速度センサ、気圧を検出する気圧センサ等である。センサ１０４は、検出した検出値を認知部１０５と動作生成部１１０に出力する。 The sensor 104 is, for example, a temperature sensor that detects the temperature of the environment, an illuminance sensor that detects the illuminance of the environment, a gyro sensor that detects the inclination of the housing of the communication robot 1, and an acceleration sensor that detects the movement of the housing of the communication robot 1. , an atmospheric pressure sensor that detects atmospheric pressure, etc. The sensor 104 outputs the detected value to the recognition unit 105 and the motion generation unit 110.

記憶部１０６は、例えば、認知部１０５が認識すべき項目、認識の際に用いられる各種値（しきい値、定数）、認識を行うためのアルゴリズム等を記憶する。 The storage unit 106 stores, for example, items to be recognized by the recognition unit 105, various values (threshold values, constants) used during recognition, algorithms for recognition, and the like.

第１データベース１０７は、例えば、音声認識の際に用いられる言語モデルデータベースと音響モデルデータベースと対話コーパスデータベースと音響特徴量、画像認識の際に用いられる比較用画像データベースと画像特徴量、等を格納する。なお、各データ、特徴量については後述する。なお、第１データベース１０７は、クラウド上に置かれていてもよく、ネットワークを介して接続されていてもよい。 The first database 107 stores, for example, a language model database, an acoustic model database, a dialogue corpus database, and acoustic features used in speech recognition, a comparison image database and image features used in image recognition, and the like. do. Note that each data and feature amount will be described later. Note that the first database 107 may be placed on a cloud or may be connected via a network.

第２データベース１０９は、学習時に用いられる、例えば社会構成要素、社会規範、社会的慣習、心理学、人文学等、人と人との関係性に関するデータを格納する。なお、第２データベース１０９は、クラウド上に置かれていてもよく、ネットワークを介して接続されていてもよい。 The second database 109 stores data related to relationships between people, such as social components, social norms, social customs, psychology, humanities, etc., which are used during learning. Note that the second database 109 may be placed on the cloud or may be connected via a network.

社会的能力生成装置１００は、コミュニケーションロボット１と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知し、認知した内容と第２データベース１０９が格納するデータとに基づいて人間の感情的な相互作用を学習する。そして、社会的能力生成装置１００は、学習した内容からコミュニケーションロボット１の社会的能力を生成する。なお、社会能力とは、例えば、人と人との間で行われる対話、行動、理解、共感等、人と人との間の相互作用を行う能力である。 The social ability generation device 100 recognizes the interaction that occurs between the communication robot 1 and a person or the interaction that occurs between multiple people, and based on the recognized content and the data stored in the second database 109, the social ability generation device 100 generates human emotions. learning interactive interactions. Then, the social ability generation device 100 generates social abilities of the communication robot 1 from the learned content. Note that social ability is, for example, the ability to perform interactions between people, such as dialogue, action, understanding, and empathy between people.

認知部１０５は、コミュニケーションロボット１と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知する。認知部１０５は、撮影部１０２が撮影した画像、収音部１０３が収音した音響信号、およびセンサ１０４が検出した検出値を取得する。なお、認知部１０５は、受信部１０１が受信した情報を取得するようにしてもよい。認知部１０５は、取得した情報と、第１データベース１０７に格納されているデータに基づいて、コミュニケーションロボット１と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知する。なお、認知方法については後述する。認知部１０５は、認知した認知結果（音に関する特徴量、人行動に関する特徴情報）を学習部１０８に出力する。なお、認知部１０５は、撮影部１０２が撮影した画像に対して周知の画像処理（例えば、二値化処理、エッジ検出処理、クラスタリング処理、画像特徴量抽出処理等）を行う。認知部１０５は、取得した音響信号に対して、周知の音声認識処置（音源同定処理、音源定位処理、雑音抑圧処理、音声区間検出処理、音源抽出処理、音響特徴量算出処理等）を行う。認知部１０５は、認知された結果に基づいて、取得された音響信号から対象の人または動物または物の音声信号（または音響信号）を抽出して、抽出した音声信号（または音響信号）を認識結果として動作生成部１１０に出力する。認知部１０５は、認知された結果に基づいて、取得された画像から対象の人または物の画像を抽出して、抽出した画像を認識結果として動作生成部１１０に出力する。 The recognition unit 105 recognizes the interaction that occurs between the communication robot 1 and a person, or the interaction that occurs between multiple people. The recognition unit 105 acquires an image photographed by the photographing unit 102, an acoustic signal collected by the sound collection unit 103, and a detection value detected by the sensor 104. Note that the recognition unit 105 may acquire the information received by the receiving unit 101. The recognition unit 105 recognizes the interaction between the communication robot 1 and a person, or the interaction between multiple people, based on the acquired information and the data stored in the first database 107. Note that the recognition method will be described later. The recognition unit 105 outputs the recognized recognition results (features related to sounds, feature information related to human behavior) to the learning unit 108 . Note that the recognition unit 105 performs known image processing (for example, binarization processing, edge detection processing, clustering processing, image feature quantity extraction processing, etc.) on the image photographed by the photographing unit 102. The recognition unit 105 performs well-known speech recognition processing (sound source identification processing, sound source localization processing, noise suppression processing, speech interval detection processing, sound source extraction processing, acoustic feature amount calculation processing, etc.) on the acquired acoustic signal. Based on the recognized result, the recognition unit 105 extracts the audio signal (or audio signal) of the target person, animal, or object from the acquired audio signal, and recognizes the extracted audio signal (or audio signal). The result is output to the motion generation section 110. The recognition unit 105 extracts an image of the target person or object from the acquired image based on the recognition result, and outputs the extracted image to the motion generation unit 110 as a recognition result.

学習部１０８は、認知部１０５が出力する認知結果と、第２データベース１０９に格納されているデータを用いて、人間の感情的な相互作用を学習する。学習部１０８は、学習によって生成されたモデルを記憶する。なお、学習方法については後述する。 The learning unit 108 uses the recognition results output by the recognition unit 105 and the data stored in the second database 109 to learn human emotional interactions. The learning unit 108 stores the model generated through learning. Note that the learning method will be described later.

動作生成部１１０は、受信部１０１から受信された情報、撮影部１０２から撮影された画像、収音部１０３から収音された音響信号、および認知部１０５から認識結果を取得する。動作生成部１１０は、学習された結果と、取得された情報とに基づいて、利用者に対する行動（発話、仕草、画像）を生成する。 The motion generation section 110 acquires information received from the reception section 101 , an image photographed from the photographing section 102 , an acoustic signal collected from the sound collection section 103 , and a recognition result from the recognition section 105 . The behavior generation unit 110 generates behavior (utterances, gestures, images) for the user based on the learned results and the acquired information.

画像生成部１１０１は、学習された結果と、取得された情報とに基づいて、表示部１１１に表示させる出力画像（静止画、連続した静止画、または動画）を生成し、生成した出力画像を表示部１１１に表示させる。これにより、動作生成部１１０は、表示部１１１に表情のようなアニメーションを表示させ、利用者へ提示する画像を提示させて、利用者とのコミュニケーションを取る。表示される画像は、人の目の動きに相当する画像、人の口の動きに相当する画像、利用者の目的地などの情報（地図、天気図、天気予報、お店や行楽地の情報等）、インターネット回線を介して利用者にＴＶ電話してきた人の画像等である。 The image generation unit 1101 generates an output image (still image, continuous still image, or video) to be displayed on the display unit 111 based on the learned result and the acquired information, and displays the generated output image. It is displayed on the display unit 111. As a result, the motion generation unit 110 causes the display unit 111 to display an animation such as a facial expression, presents an image to be presented to the user, and communicates with the user. The displayed images include images that correspond to the movements of a person's eyes, images that correspond to the movements of a person's mouth, and information such as the user's destination (maps, weather maps, weather forecasts, information on shops and recreational areas). etc.), an image of a person who has made a video call to the user via the Internet line, etc.

音声生成部１１０２は、学習された結果と、取得された情報とに基づいて、スピーカー１１２に出力させる出力音声信号を生成し、生成した出力音声信号をスピーカー１１２に出力させる。これにより、動作生成部１１０は、スピーカー１１２から音声信号を出力させて、利用者とのコミュニケーションを取る。出力される音声信号は、コミュニケーションロボット１に割り当てられている声による音声信号、インターネット回線を介して利用者にＴＶ電話してきた人の音声信号等である。 The audio generation unit 1102 generates an output audio signal to be output to the speaker 112 based on the learned result and the acquired information, and causes the speaker 112 to output the generated output audio signal. Thereby, the motion generation unit 110 outputs an audio signal from the speaker 112 to communicate with the user. The output audio signal is an audio signal from a voice assigned to the communication robot 1, an audio signal from a person who has made a video call to the user via the Internet line, or the like.

駆動部１１０３は、学習された結果と、取得された情報とに基づいて、アクチュエータ１１３を駆動するための駆動信号を生成し、生成した駆動信号でアクチュエータ１１３を駆動する。これにより、動作生成部１１０は、コミュニケーションロボット１の動作を制御することで感情等を表現させ、利用者とのコミュニケーションを取る。 The drive unit 1103 generates a drive signal for driving the actuator 113 based on the learned result and the acquired information, and drives the actuator 113 with the generated drive signal. Thereby, the motion generation unit 110 controls the motion of the communication robot 1 to express emotions, etc., and communicate with the user.

送信情報生成部１１０４は、学習された結果と、取得された情報とに基づいて、例えば利用者がネットワークを会話している他の利用者へ、利用者が送信したい送信情報（音声信号、画像）を生成し、生成した送信情報を送信部１１４から送信させる。 Based on the learned results and the acquired information, the transmission information generation unit 1104 generates transmission information (audio signal, image ) is generated, and the generated transmission information is transmitted from the transmitter 114.

表示部１１１は、液晶画像表示装置、または有機ＥＬ（ＥｌｅｃｔｒｏＬｕｍｉｎｅｓｃｅｎｃｅ）画像表示装置等である。表示部１１１は、社会的能力生成装置１００の画像生成部１１０１が出力する出力画像を表示する。 The display unit 111 is a liquid crystal image display device, an organic EL (Electro Luminescence) image display device, or the like. The display unit 111 displays an output image output by the image generation unit 1101 of the social ability generation device 100.

スピーカー１１２は、社会的能力生成装置１００の音声生成部１１０２が出力する出力音声信号を出力する。 The speaker 112 outputs an output audio signal output by the audio generation unit 1102 of the social ability generation device 100.

アクチュエータ１１３は、社会的能力生成装置１００の駆動部１１０３が出力する駆動信号に応じて動作部を駆動する。 The actuator 113 drives the operating unit according to a drive signal output by the drive unit 1103 of the social ability generation device 100.

送信部１１４は、社会的能力生成装置１００の送信情報生成部１１０４が出力する送信情報を、ネットワークを介して送信先に送信する。 The transmission unit 114 transmits the transmission information output by the transmission information generation unit 1104 of the social ability generation device 100 to a destination via the network.

＜コミュニケーションロボット１の外形例＞
次に、コミュニケーションロボット１の外形例を説明する。
図３は、本実施形態に係るコミュニケーションロボット１の外形例を示す図である。図３の正面図ｇ１０１、側面図ｇ１０２の例では、コミュニケーションロボット１は３つの表示部１１１（１１１ａ、１１１ｂ、１１１ｃ）を備えている。また図３の例では、撮影部１０２ａは表示部１１１ａの上部に取り付けられ、撮影部１０２ｂは表示部１１１ｂの上部に取り付けられている。表示部１１１ａ、１１１ｂは、人の目に相当し、かつ画像情報を提示する。スピーカー１１２は、筐体１２０の人の口に相当する画像を表示する表示部１１１ｃの近傍に取り付けられている。収音部１０３は、筐体１２０に取り付けられている。 <Example of external shape of communication robot 1>
Next, an example of the external shape of the communication robot 1 will be explained.
FIG. 3 is a diagram showing an example of the external shape of the communication robot 1 according to the present embodiment. In the example of the front view g101 and side view g102 of FIG. 3, the communication robot 1 includes three display units 111 (111a, 111b, 111c). Further, in the example of FIG. 3, the photographing section 102a is attached to the top of the display section 111a, and the photographing section 102b is attached to the top of the display section 111b. The display units 111a and 111b correspond to human eyes and present image information. The speaker 112 is attached to the housing 120 near a display section 111c that displays an image corresponding to a human mouth. The sound collection section 103 is attached to the housing 120.

また、コミュニケーションロボット１は、ブーム１２１を備える。ブーム１２１は、筐体１２０に可動部１３１を介して可動可能に取り付けられている。ブーム１２１には、水平バー１２２が可動部１３２を介して回転可能に取り付けられている。
また、水平バー１２２には、表示部１１１ａが可動部１３３を介して回転可能に取り付けられ、表示部１１１ｂが可動部１３４を介して回転可能に取り付けられている。
なお、図３に示したコミュニケーションロボット１の外形は一例であり、これに限らない。 The communication robot 1 also includes a boom 121. The boom 121 is movably attached to the housing 120 via a movable part 131. A horizontal bar 122 is rotatably attached to the boom 121 via a movable part 132.
Furthermore, the display section 111a is rotatably attached to the horizontal bar 122 via a movable section 133, and the display section 111b is rotatably attached via a movable section 134.
Note that the external shape of the communication robot 1 shown in FIG. 3 is an example, and is not limited to this.

＜第１データベースが格納するデータ＞
次に、第１データベースが格納するデータ例を説明する。
言語モデルデータベースは、言語モデルを格納する。言語モデルは、任意の文字列について、それが日本語文等である確率を付与する確率モデルである。また、言語モデルは、例えば、Ｎグラムモデル、隠れマルコフモデル、最大エントロピーモデル等のいずれかである。 <Data stored in the first database>
Next, an example of data stored in the first database will be explained.
The language model database stores language models. The language model is a probability model that gives a probability that any character string is a Japanese sentence or the like. Further, the language model is, for example, an N-gram model, a hidden Markov model, a maximum entropy model, or the like.

音響モデルデータベースは、音源モデルを格納する。音源モデルは、収音された音響信号を音源同定するために用いるモデルである。 The acoustic model database stores sound source models. The sound source model is a model used to identify the sound source of a collected acoustic signal.

音響特徴量とは、収音された音響信号を高速フーリエ変換（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）を行って周波数領域の信号に変換した後、算出された特徴量である。音響特徴量は、例えば、静的メル尺度対数スペクトル（ＭＳＬＳ：Ｍｅｌ－ＳｃａｌｅＬｏｇＳｐｅｃｔｒｕｍ）、デルタＭＳＬＳ及び１個のデルタパワーを、所定時間（例えば、１０ｍｓ）毎に算出される。なお、ＭＳＬＳは、音響認識の特徴量としてスペクトル特徴量を用い、ＭＦＣＣ（メル周波数ケプストラム係数；ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）を逆離散コサイン変換することによって得られる。 The acoustic feature is a feature calculated after converting a collected acoustic signal into a frequency domain signal by performing Fast Fourier Transform. The acoustic features are, for example, a static Mel-Scale Log Spectrum (MSLS), a delta MSLS, and one delta power, which are calculated at predetermined time intervals (for example, 10 ms). Note that MSLS is obtained by performing inverse discrete cosine transform on MFCC (Mel Frequency Cepstrum Coefficient) using a spectral feature as a feature for acoustic recognition.

対話コーパスデータベースは、対話コーパスを格納する。対話コーパスとは、コミュニケーションロボット１と利用者とが、対話を行う際に使用するコーパスであり、例えば対話内容に応じたシナリオである。 The dialogue corpus database stores dialogue corpora. The dialogue corpus is a corpus used when the communication robot 1 and the user have a dialogue, and is, for example, a scenario depending on the content of the dialogue.

比較用画像データベースは、例えばパターンマッチングの際に用いられる画像を格納する。パターンマッチングの際に用いられる画像は、例えば、利用者の画像、利用者の家族の画像、利用者のペットの画像、利用者の友人や知り合いの画像等を含む。 The comparison image database stores, for example, images used in pattern matching. Images used during pattern matching include, for example, images of the user, images of the user's family, images of the user's pets, images of the user's friends and acquaintances, and the like.

画像特徴量は、例えば人物や物の画像から、周知の画像処理によって抽出された特徴量である。
なお、上述した例は一例であり、第１データベース１０７は他のデータを格納していてもよい。 The image feature amount is, for example, a feature amount extracted from an image of a person or an object by well-known image processing.
Note that the example described above is just an example, and the first database 107 may store other data.

＜認知、学習、社会的能力の流れ＞
次に、本実施形態の社会的能力生成装置１００が行う認知と学習の流れについて説明する。図４は、本実施形態の社会的能力生成装置１００が行う認知と学習と社会的能力の流れを示す図である。 <Flow of cognition, learning, and social abilities>
Next, the flow of recognition and learning performed by the social ability generation device 100 of this embodiment will be explained. FIG. 4 is a diagram showing the flow of recognition, learning, and social ability performed by the social ability generation device 100 of this embodiment.

認識結果２０１は、認知部１０５によって認識された結果の一例である。認識結果２０１は、例えば対人関係、対人相互関係等である。 Recognition result 201 is an example of a result recognized by recognition unit 105. The recognition result 201 is, for example, interpersonal relationships, interpersonal mutual relationships, etc.

マルチモーダル学習、理解２１１は、学習部１０８によって行われる学習内容例である。学習方法２１２は、機械学習等である。また、学習対象２１３は、社会構成要素、社会模範、心理学、人文学等である。 Multimodal learning and understanding 211 is an example of learning content performed by the learning unit 108. The learning method 212 is machine learning or the like. Furthermore, the learning objects 213 are social constituent elements, social models, psychology, humanities, and the like.

社会的能力２２１は、社会技能であり、例えば共感、個性化、適応性、情緒的アホーダンス等である。 Social abilities 221 are social skills, such as empathy, individualization, adaptability, and emotional aphorism.

＜認識するデータ＞
次に、認知部１０５が認識するデータ例を説明する。
図５は、本実施形態に係る認知部１０５が認識するデータ例を示す図である。本実施形態では、図５のように個人データ３０１と、対人関係データ３５１を認識する。 <Data to be recognized>
Next, an example of data recognized by the recognition unit 105 will be explained.
FIG. 5 is a diagram showing an example of data recognized by the recognition unit 105 according to this embodiment. In this embodiment, personal data 301 and interpersonal relationship data 351 are recognized as shown in FIG.

個人データは、１人の中でおきる行動であり、撮影部１０２と収音部１０３によって取得されたデータと、取得されたデータに対して音声認識処理、画像認識処理等を行ったデータである。個人データは、例えば、音声データ、音声処理された結果である意味データ、声の大きさ、声の抑揚、発話された単語、表情データ、ジェスチャーデータ、頭部姿勢データ、顔向きデータ、視線データ、共起表現データ、生理的情報（体温、心拍数、脈拍数等）等である。なお、どのようなデータを用いるかは、例えばコミュニケーションロボット１の設計者が選択してもよい。この場合、例えば、実際の２人のコミュニケーションまたはデモンストレーションに対して、コミュニケーションロボット１の設計者が、コミュニケーションにおいて個人データのうち重要な特徴を設定するようにしてもよい。また、認知部１０５は、取得された発話と画像それぞれから抽出された情報に基づいて、個人データとして、利用者の感情を認知する。この場合、認知部１０５は、例えば声の大きさや抑揚、発話継続時間、表情等に基づいて認知する。そして本実施形態のコミュニケーションロボット１は、利用者の感情を良い感情を維持する、利用者との関係を良い関係を維持するように働きかけるように制御する。 Personal data is behavior that occurs within one person, and includes data acquired by the photographing unit 102 and sound collection unit 103, and data obtained by performing voice recognition processing, image recognition processing, etc. on the acquired data. . Personal data includes, for example, voice data, semantic data resulting from voice processing, voice volume, voice intonation, spoken words, facial expression data, gesture data, head posture data, face direction data, and gaze data. , co-occurrence expression data, physiological information (body temperature, heart rate, pulse rate, etc.). Note that the designer of the communication robot 1 may select, for example, what kind of data to use. In this case, for example, the designer of the communication robot 1 may set important characteristics of personal data in communication for actual communication or demonstration between two people. Furthermore, the recognition unit 105 recognizes the user's emotions as personal data based on information extracted from each of the acquired utterances and images. In this case, the recognition unit 105 recognizes based on, for example, the volume and intonation of the voice, duration of speech, facial expression, etc. The communication robot 1 of this embodiment controls the user's emotions to maintain good emotions and to maintain a good relationship with the user.

ここで、利用者の社会的背景（バックグラウンド）の認知方法例を説明する。
認知部１０５は、取得した発話と画像と第１データベース１０７が格納するデータとに基づいて、利用者の国籍、出身地等を推定する。認知部１０５は、取得した発話と画像と第１データベース１０７が格納するデータとに基づいて、利用者の起床時間、外出時間、帰宅時間、就寝時間等の生活スケジュールを抽出する。認知部１０５は、取得した発話と画像と生活スケジュールと第１データベース１０７が格納するデータとに基づいて、利用者の性別、年齢、職業、趣味、経歴、嗜好、家族構成、信仰している宗教、コミュニケーションロボット１に対する愛着度等を推定する。なお、社会的背景は変化する場合もあるため、コミュニケーションロボット１は、会話と画像と第１データベース１０７が格納するデータとに基づいて、利用者の社会的背景に関する情報を更新していく。なお、感情的な共有を可能とするために、社会的背景やコミュニケーションロボット１に対する愛着度は、年齢や性別や経歴等の入力可能なレベルに限らず、例えば、時間帯に応じた感情の起伏や話題に対する声の大きさや抑揚等に基づいて認知する。このように、認知部１０５は、利用者が自信で気づいていないことについても、日々の会話と会話時の表情等に基づいて学習していく。 Here, an example of a method for recognizing a user's social background will be explained.
The recognition unit 105 estimates the user's nationality, place of birth, etc. based on the acquired utterances and images and data stored in the first database 107. The recognition unit 105 extracts the user's daily life schedule, such as wake-up time, going out time, returning home time, and bedtime, based on the acquired utterances and images and the data stored in the first database 107. The recognition unit 105 determines the user's gender, age, occupation, hobbies, career, preferences, family composition, and religious beliefs based on the acquired utterances, images, life schedule, and data stored in the first database 107. , the degree of attachment to the communication robot 1, etc. is estimated. Note that since the social background may change, the communication robot 1 updates information regarding the user's social background based on the conversation, images, and data stored in the first database 107. In order to enable emotional sharing, the social background and the degree of attachment to the communication robot 1 are not limited to the level that can be entered such as age, gender, career history, etc., but also the emotional ups and downs depending on the time of day. It is recognized based on the volume and intonation of the voice regarding the topic. In this way, the recognition unit 105 learns things that the user is not aware of based on daily conversations and facial expressions during conversations.

対人関係データは、利用者と他の人との関係に関するデータである。このように対人関係データを用いることで、社会的なデータを用いることができる。対人関係のデータは、例えば、人と人との距離、対話している人同士の視線が交わっているか否か、声の抑揚、声の大きさ等である。人と人との距離は後述するように、対人関係によって異なる。例えば夫婦や友達であれば対人関係がＬ１であり、ビジネスマン同士の対人関係はＬ１よりも大きいＬ２である。 Interpersonal relationship data is data regarding relationships between users and other people. By using interpersonal relationship data in this way, social data can be used. The interpersonal relationship data includes, for example, the distance between people, whether or not the people who are interacting are looking at each other, the intonation of their voices, the volume of their voices, and the like. As will be explained later, the distance between people varies depending on interpersonal relationships. For example, interpersonal relationships between husband and wife and friends are L1, and interpersonal relationships between businessmen are L2, which is larger than L1.

なお、例えば、実際の２人のコミュニケーションまたはデモンストレーションに対して、コミュニケーションロボット１の設計者が、コミュニケーションにおいて対人データのうち重要な特徴を設定するようにしてもよい。なお、このような個人データ、対人関係データ、利用者の社会的背景に関する情報は、第１データベース１０７または記憶部１０６に格納する。 Note that, for example, the designer of the communication robot 1 may set important features of interpersonal data in communication for actual communication or demonstration between two people. Note that such personal data, interpersonal relationship data, and information regarding the user's social background are stored in the first database 107 or the storage unit 106.

また、認知部１０５は、利用者が複数人の場合、例えば利用者とその家族の場合、利用者毎に個人データを収集して学習し、人毎に社会的背景を推定する。なお、このような社会的背景は、例えばネットワークと受信部１０１を介して取得してもよく、その場合、利用者が例えばスマートフォン等で自分の社会的背景を入力または項目を選択するようにしてもよい。 Furthermore, when there are multiple users, for example, when there are users and their families, the recognition unit 105 collects and learns personal data for each user, and estimates the social background of each person. Note that such social background may be acquired, for example, via the network and the receiving unit 101, and in that case, the user may input his/her social background or select an item using, for example, a smartphone. Good too.

ここで、対人関係データの認知方法例を説明する。
認知部１０５は、取得した発話と画像と第１データベース１０７が格納するデータとに基づいて、コミュニケーションが行われている人と人との距離（間隔）を推定する。認知部１０５は、取得した発話と画像と第１データベース１０７が格納するデータとに基づいて、コミュニケーションが行われている人の視線が交わっているか否かを検出する。認知部１０５は、取得した発話と第１データベース１０７が格納するデータとに基づいて、発話内容、声の大きさ、声の抑揚、受信した電子メール、送信した電子メール、送受信した電子メールの送受信先の相手に基づいて、友人関係、仕事仲間、親戚親子関係を推定する。 Here, an example of a method for recognizing interpersonal relationship data will be explained.
The recognition unit 105 estimates the distance (interval) between the people communicating with each other based on the acquired utterances and images and the data stored in the first database 107. The recognition unit 105 detects whether the lines of sight of the people communicating with each other are intersecting, based on the acquired utterances and images, and data stored in the first database 107. Based on the acquired utterance and the data stored in the first database 107, the recognition unit 105 determines the content of the utterance, the volume of the voice, the intonation of the voice, the received e-mail, the sent e-mail, and the sending and receiving of the sent and received e-mail. Estimates friendships, work colleagues, relatives, and parent-child relationships based on the previous partner.

なお、認知部１０５は、使用される初期状態において、第１データベース１０７が記憶するいくつかの社会的背景や個人データの初期値の組み合わせの中から、例えばランダムに１つを選択して、コミュニケーションを開始するようにしてもよい。そして、認知部１０５は、ランダムに選択した組み合わせによって生成された行動によって、利用者とのコミュニケーションが継続しにくい場合、別の組み合わせを選択しなおすようにしてもよい。 In addition, in the initial state used, the recognition unit 105 randomly selects one combination of initial values of social background and personal data stored in the first database 107, and performs communication. may be started. Then, if the behavior generated by the randomly selected combination makes it difficult to continue communication with the user, the recognition unit 105 may reselect another combination.

＜学習手順＞
本実施形態では、認知部１０５によって認識された個人データ３０１と対人関係データ３５１と、第２データベース１０９が格納するデータを用いて、学習部１０８が学習を行う。 <Learning procedure>
In this embodiment, the learning unit 108 performs learning using the personal data 301 and interpersonal relationship data 351 recognized by the recognition unit 105 and data stored in the second database 109.

ここで、社会的構成と社会規範について説明する。人々が社会的な相互作用に参加する空間において、例えば人と人とのキャリによって、対人関係が異なる。例えば、人との間隔が０～５０ｃｍの関係は親密（Ｉｎｔｉｍａｔｅ）な関係であり、人との間隔が５０～１ｍの関係は個人的（Ｐｅｒｓｏｎａｌ）な関係である。人との間隔が１～４ｍの関係は社会的（Ｓｏｃｉａｌ）な関係であり、人との間隔が４ｍの以上の関係は公的（Ｐｕｂｌｉｃ）な関係である。このような社会規範は、学習時に、仕草や発話が社会規範に合致しているか否かを報酬（暗示的な報酬）として用いられる。 Here, we will explain social composition and social norms. In spaces where people participate in social interactions, interpersonal relationships differ depending on people's careers, for example. For example, a relationship with a distance of 0 to 50 cm from a person is an intimate relationship, and a relationship with a distance of 50 to 1 meter from a person is a personal relationship. A relationship where the distance between one person and another person is 1 to 4 meters is a social relationship, and a relationship where the distance between one person and another person is 4 meters or more is a public relationship. Such social norms are used as rewards (implicit rewards) during learning, based on whether or not gestures and utterances conform to social norms.

また、対人関係は、学習時に報酬の特徴量の設定によって、利用される環境や利用者に応じたものに設定するようにしてもよい。具体的には、ロボットが苦手な人には、あまり話しかけないようなルールとし、ロボットが好きな人には積極的に話しかけるルールに設定するなど、複数の親密度の設定を設けるようにしてもよい。そして、実環境において、利用者の発話と画像を処理した結果に基づいて、利用者が、どのタイプであるかを認知部１０５が認知して、学習部１０８がルールを選択するようにしてもよい。 Furthermore, the interpersonal relationship may be set according to the environment and user in which the system is used by setting the feature amount of the reward at the time of learning. Specifically, you could set multiple intimacy settings, such as setting a rule to not talk to people who don't like robots much, and setting a rule to actively talk to people who like robots. good. In a real environment, the recognition unit 105 recognizes what type the user is based on the results of processing the user's utterances and images, and the learning unit 108 selects a rule. good.

また、人間のトレーナーは、コミュニケーションロボット１の行動を評価し、自分が知っている社会構成や規範に応じた報酬（暗示的な報酬）を提供するようにしてもよい。 Furthermore, the human trainer may evaluate the behavior of the communication robot 1 and provide rewards (implicit rewards) according to the social structure and norms that the human trainer knows.

＜第２データベースが格納するデータ＞
次に、第２データベースが格納するデータ例を説明する。
社会構成要素は、例えば、年齢、性別、職業、複数の人の間の関係（親子、夫婦、恋人、友達、知り合い、仕事仲間、ご近所の人、先生と生徒等）である。 <Data stored in the second database>
Next, an example of data stored in the second database will be explained.
Social components include, for example, age, gender, occupation, and relationships between multiple people (parents and children, husband and wife, lovers, friends, acquaintances, colleagues, neighbors, teachers and students, etc.).

社会規範は、個人、複数の人の間のルールやマナーであり、年齢、性別、職業、複数の人の間の関係に応じた発話、仕草等が関連づけられている。 Social norms are rules and manners between individuals or multiple people, and are associated with utterances, gestures, etc. depending on age, gender, occupation, and relationships between multiple people.

心理学に関するデータは、例えば、これまでの実験や検証で得られている知見（例えば母親と幼児との愛着関係、エディプスコンプレックス等のコンプレックス、条件反射、フェティシズム等）のデータである。 Data related to psychology is, for example, data on findings obtained through past experiments and verifications (for example, the attachment relationship between mothers and infants, complexes such as the Oedipus complex, conditioned reflexes, fetishism, etc.).

人文学に関するデータは、例えば宗教的なルール、慣習、国民性、地域性、国や地域における特徴的な行為や行動や発話等のデータである。例えば、日本人の場合は、同意の際に、言葉で言わずに頷くことで同意を表す等のデータである。また、人文学に関するデータは、例えば、国や地域によって、何を重要視し、何を優先するか等のデータである。 Data related to the humanities include, for example, data on religious rules, customs, national characteristics, regional characteristics, and characteristic acts, actions, and utterances of a country or region. For example, data shows that Japanese people express their agreement by nodding without saying words. Data related to the humanities includes, for example, what is considered important and what is given priority depending on the country or region.

図６は、比較例における生データを用いて深層強化学習を行うシステム例を示す図である。
比較例では、撮影された画像９０１と収音された音響信号９０１それぞれの生データ９０２を学習に用いる場合は、深層強化学習９０３を行う必要がある。この比較例のシステムは、実現が困難である。理由は、深層強化学習のための教示データを充分に集める必要があるが、集めるのが困難である。集めるのが困難な理由は、生データの中に必要な特徴が現れる回数が限られるためである。 FIG. 6 is a diagram illustrating an example of a system that performs deep reinforcement learning using raw data in a comparative example.
In the comparative example, when using raw data 902 of a captured image 901 and a captured audio signal 901 for learning, it is necessary to perform deep reinforcement learning 903. The system of this comparative example is difficult to implement. The reason is that it is necessary to collect enough teaching data for deep reinforcement learning, but it is difficult to collect it. The reason why it is difficult to collect is that the number of times the required feature appears in the raw data is limited.

このため、本実施形態では、生データ（音声信号、画像）を学習に直接用いず、生データから特徴量を検出し、その特徴量を学習に用いることで深層強化学習では無く強化学習ですむ。 Therefore, in this embodiment, instead of directly using raw data (audio signals, images) for learning, features are detected from the raw data and those features are used for learning, which enables reinforcement learning instead of deep reinforcement learning. .

図７は、本実施形態に係る動作生成部１１０が用いるエージェント作成方法例を示す図である。
符号３００が示す領域は、入力からエージェントを作成、出力（エージェント）までの流れを示す図である。
撮影部１０２が撮影した画像と収音部１０３が収音した情報３１０は、人（利用者、利用者の関係者、他人）に関する情報と、人の周りの環境情報である。撮影部１０２と収音部１０３によって取得された生データ３０２は、認知部１０５に入力される。 FIG. 7 is a diagram illustrating an example of an agent creation method used by the behavior generation unit 110 according to the present embodiment.
The area indicated by reference numeral 300 is a diagram showing the flow from input to creation of an agent to output (agent).
The image photographed by the photographing unit 102 and the information 310 collected by the sound collecting unit 103 are information about a person (a user, a person related to the user, another person) and information about the environment around the person. Raw data 302 acquired by the imaging unit 102 and the sound collection unit 103 is input to the recognition unit 105.

認知部１０５は、入力された生データ３０２から複数の情報（声の大きさ、声の抑揚、発話内容、発話された単語、利用者の視線、利用者の頭部姿勢、利用者の顔向き、利用者の生態情報、人と人との距離、人と人との視線が交わっているか否か、等）を抽出、認識する。認知部１０５は、抽出、認識された複数の情報を利用して、例えばニューラルネットワークを用いてマルチモーダル理解を行う。
認知部１０５は、例えば音声信号および画像の少なくとも１つに基づいて、個人を識別し、識別した個人に識別情報（ＩＤ）を付与する。認知部１０５は、音声信号および画像の少なくとも１つに基づいて、識別した人ごとの動作を認知する。認知部１０５は、例えば画像に対して周知の画像処理と追跡処理を行って、識別した人の視線を認識する。認知部１０５は、例えば音声信号に対して音声認識処理（音源同定、音源定位、音源分離、発話区間検出、雑音抑圧等）を行って音声を認識する。認知部１０５は、例えば画像に対して周知の画像処理を行って、識別した人の頭部姿勢を認識する。認知部１０５は、例えば撮影された画像に２人が撮影されている場合、発話内容、撮影された画像における２人の間隔等に基づいて、対人関係を認知する。認知部１０５は、例えば撮影された画像と収音された音声信号それぞれを処理した結果に応じて、コミュニケーションロボット１と利用者との社会的な距離を認知する（推定する）。 The recognition unit 105 collects a plurality of pieces of information from the input raw data 302 (voice volume, voice intonation, utterance content, uttered words, user's line of sight, user's head posture, user's face orientation). , user's ecological information, distance between people, whether or not people are looking at each other, etc.). The recognition unit 105 uses a plurality of pieces of extracted and recognized information to perform multimodal understanding using, for example, a neural network.
The recognition unit 105 identifies an individual based on at least one of an audio signal and an image, and assigns identification information (ID) to the identified individual. The recognition unit 105 recognizes the motion of each identified person based on at least one of the audio signal and the image. The recognition unit 105 performs well-known image processing and tracking processing on the image, for example, to recognize the line of sight of the identified person. The recognition unit 105 performs, for example, speech recognition processing (sound source identification, sound source localization, sound source separation, speech interval detection, noise suppression, etc.) on the speech signal to recognize speech. The recognition unit 105 performs well-known image processing on the image, for example, to recognize the head posture of the identified person. For example, when two people are photographed in a photographed image, the recognition unit 105 recognizes the interpersonal relationship based on the content of the utterance, the distance between the two people in the photographed image, and the like. The recognition unit 105 recognizes (estimates) the social distance between the communication robot 1 and the user, for example, according to the results of processing each of the photographed image and the collected audio signal.

学習部１０８は、深層学習では無く、強化学習３０４を行う。強化学習では、最も関連性の高い特徴（社会構成や社会規範を含む）を選択するように学習を行う。この場合は、マルチモーダル理解で用いた複数の情報を特徴として入力に用いる。学習部１０８の入力は、例えば、生データそのものか、名前ＩＤ（識別情報）、顔の影響、認識したジェスチャー、音声からのキーワード等である。学習部１０８の出力は、コミュニケーションロボットの行動である。出力される行動は、目的に応じて定義したいものであればよく、例えば、音声応答、ロボットのルーチン、ロボットが回転するための向きの角度などである。なお、マルツモーダル理解において、検出にニューラルネットワーク等を用いてもよい。この場合は、身体の異なるモダリティを用いて、人間の活動を検出しますようにしてもよい。また、どの特徴を用いるかは、例えばコミュニケーションロボット１の設計者が、予め選択するようにしてもよい。さらに、本実施形態では、学習時に、暗示的な報酬と明示的な報酬を用いることで、社会的な模範や社会構成概念を取り込むことができる。強化学習した結果が出力であり、エージェント３０５である。このように、本実施形態では、動作生成部１１０が用いるエージェントを作成する。 The learning unit 108 performs reinforcement learning 304 instead of deep learning. Reinforcement learning involves learning to select the most relevant features (including social constructs and social norms). In this case, multiple pieces of information used in multimodal understanding are used as features for input. The input to the learning unit 108 is, for example, raw data itself, name ID (identification information), facial influence, recognized gestures, keywords from voice, etc. The output of the learning unit 108 is the behavior of the communication robot. The behavior to be output may be anything that is desired to be defined according to the purpose, such as a voice response, a robot routine, or an angle for the robot to rotate. Note that in multimodal understanding, a neural network or the like may be used for detection. In this case, different modalities of the body may be used to detect human activity. Furthermore, which feature to use may be selected in advance by, for example, the designer of the communication robot 1. Furthermore, in this embodiment, by using implicit rewards and explicit rewards during learning, social models and social constructs can be incorporated. The result of reinforcement learning is the output, which is the agent 305. In this manner, in this embodiment, the agent used by the behavior generation unit 110 is created.

符号３５０が示す領域は、報酬の使用方法を示す図である。
暗黙的の報酬３６２は、暗黙的反応を学習するために使われる。この場合、生データ３０２には利用者の反応が含まれ、この生データ３０２を上述したマルチモーダル理解３０３する。学習部１０８は、暗黙的の報酬３６２と第２データベース１０９が格納する社会模範等を用いて、暗黙的反応システム３７２を生成する。なお、暗黙の報酬は、強化学習によって得られたものでもよく、人間が与えてもよい。また、暗黙的反応システムは、学習によって獲得されるモデルであってもよい。 The area indicated by reference numeral 350 is a diagram showing how to use the reward.
Implicit rewards 362 are used to learn implicit responses. In this case, the raw data 302 includes user reactions, and the raw data 302 is subjected to the multimodal understanding 303 described above. The learning unit 108 generates an implicit reaction system 372 using the implicit reward 362 and social models stored in the second database 109. Note that the implicit reward may be obtained by reinforcement learning or may be given by a human. Further, the implicit reaction system may be a model acquired through learning.

明示的反応の学習には、例えば人間のトレーナーが、コミュニケーションロボット１の行動を評価し、自分の知っている社会構成や社会規範に応じた報酬３６１を与える。なお、エージェントは、入力に対して、報酬が最大となる行動を採用する。これにより、エージェントは、ユーザーに対して肯定的な感情を最大化させるような振る舞い（発話、仕草）を採用する。 For explicit response learning, for example, a human trainer evaluates the behavior of the communication robot 1 and gives a reward 361 according to the social structure and social norms that the human trainer knows. Note that the agent adopts the action that maximizes the reward for the input. As a result, the agent adopts behaviors (utterances, gestures) that maximize positive emotions toward the user.

学習部１０８は、この明示的の報酬３６１を用いて、明示的反応システム３７１を生成する。なお、明示的反応システムは、学習によって獲得されるモデルであってもよい。なお、明示的な報酬は、利用者が、コミュニケーションロボット１の行動を評価して与えるようにしてもよく、利用者の発話や行動（仕草、表情等）に基づいて、コミュニケーションロボット１が、例えば利用者が望んでいた行動を取れたか否か等に基づいて報酬を推定するようにしてもよい。
学習部１０８は、動作時、これらの学習モデルを用いてエージェント３０５を出力する。 The learning unit 108 uses this explicit reward 361 to generate an explicit reaction system 371. Note that the explicit reaction system may be a model acquired through learning. Note that the explicit reward may be given by the user after evaluating the behavior of the communication robot 1. Based on the user's utterances and actions (gestures, facial expressions, etc.), the communication robot 1 may, for example, The reward may be estimated based on whether or not the user was able to take the desired action.
During operation, the learning unit 108 outputs the agent 305 using these learning models.

なお、本実施形態では、利用者の反応である明示的な報酬を、暗示的な報酬より優先する。この理由は、利用者の反応の方が、コミュニケーションにおいては信頼度が高いためである。 Note that in this embodiment, an explicit reward that is a user's reaction is given priority over an implicit reward. The reason for this is that user reactions are more reliable in communication.

＜処理手順例＞
次に、処理手順例を説明する。図８は、本実施形態に係る社会的能力生成処理の手順例を示すフローチャートである。 <Processing procedure example>
Next, an example of the processing procedure will be explained. FIG. 8 is a flowchart illustrating an example of a procedure for social ability generation processing according to the present embodiment.

（ステップＳ１１）認識部１０５は、撮影部１０２が撮影した画像と、収音部１０３が収音した音響信号を取得する。 (Step S11) The recognition unit 105 acquires the image photographed by the photographing unit 102 and the acoustic signal collected by the sound collection unit 103.

（ステップＳ１２）認識部１０５は、音響信号から音声に関する特徴情報を認識または検出あるいは抽出し、画像から人に関する特徴情報を認識または検出あるいは抽出する。なお、音声に関する特徴情報は、音声信号、声の大きさの情報、声の抑揚の情報、発話の意味のうち少なくとも１つである。人に関する特徴情報は、人の表情情報、人が行ったジェスチャー情報、人の頭部姿勢情報、人の顔向き情報、人の視線情報のうち少なくとも１つである。 (Step S12) The recognition unit 105 recognizes, detects, or extracts feature information about voices from the acoustic signal, and recognizes, detects, or extracts feature information about people from the image. Note that the feature information regarding the voice is at least one of a voice signal, information on the volume of the voice, information on the intonation of the voice, and the meaning of the utterance. The characteristic information regarding a person is at least one of the following: facial expression information, gesture information made by the person, head posture information, face direction information, and gaze information.

（ステップＳ１３）認知部１０５は、取得した情報と、第１データベース１０７に格納されているデータに基づいて、コミュニケーションロボット１と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知する。 (Step S13) The recognition unit 105 recognizes the interaction that occurs between the communication robot 1 and a person, or the interaction that occurs between multiple people, based on the acquired information and the data stored in the first database 107. .

（ステップＳ１４）学習部１０８は、認知部１０５が出力する認知結果と、第２データベース１０９に格納されているデータを用いて、人間の感情的な相互作用を学習する。 (Step S14) The learning unit 108 learns human emotional interaction using the recognition result output by the recognition unit 105 and the data stored in the second database 109.

（ステップＳ１５）動作生成部１１０は、学習された結果と、取得された情報とに基づいて、利用者に対する行動（発話、仕草、画像）を生成する。 (Step S15) The behavior generation unit 110 generates behavior (utterances, gestures, images) for the user based on the learned results and the acquired information.

＜コミュニケーションロボット１と人とのコミュニケーション＞
次に、コミュニケーションロボット１と人とのコミュニケーション例を説明する。
コミュニケーションロボット１の発話のタイミングは、初期値として、例えば帰宅時、起床時等に設定しておく。そして、コミュニケーションを繰り返していくことで、コミュニケーションロボット１が発話を開始するタイミングを学習していくようにしてもよい。
あるいは、コミュニケーションロボット１は、利用者の発話や行動に応じて、発話を開始するようにしてもよい。この場合、コミュニケーションロボット１は、スマートスピーカー等で採用されているコマンドに応じて会話を開始するのではなく、例えば、利用者がコミュニケーションロボット１に話しかけていることや利用者の表情や顔向き（顔がコミュニケーションロボット１を見ている等）や部屋の明かりが点灯した等を認知して会話を開始する。また、コミュニケーションロボット１は、例えば、利用者の発話内容、利用者の表情等を認知して会話を終了する。そして、コミュニケーションロボット１は、利用者の社会的背景（含む年齢、性別）、およびコミュニケーションロボット１に対する愛着等に応じて、発話の開始タイミングと終了タイミングを切り替えるようにしてもよい。 <Communication between communication robot 1 and people>
Next, an example of communication between the communication robot 1 and a person will be explained.
The timing of the speech of the communication robot 1 is set as an initial value, for example, when the robot returns home or when the robot wakes up. Then, by repeating communication, the communication robot 1 may learn the timing to start speaking.
Alternatively, the communication robot 1 may start speaking in response to the user's utterances or actions. In this case, the communication robot 1 does not start a conversation in response to a command adopted by a smart speaker, etc., but rather, for example, the communication robot 1 does not start a conversation in response to a command adopted by a smart speaker or the like. The communication robot 1 starts a conversation by recognizing that a face is looking at the communication robot 1, etc.) or that a light in the room is turned on. Furthermore, the communication robot 1 finishes the conversation by recognizing, for example, the content of the user's utterances, the user's facial expressions, and the like. The communication robot 1 may switch the start timing and end timing of speech depending on the user's social background (including age and gender), attachment to the communication robot 1, and the like.

コミュニケーションロボット１と利用者のコミュニケーション例を、図９～図１２を用いて説明する。
図９は、本実施形態に係るコミュニケーションロボット１と利用者のコミュニケーション例を示す図である。図９の例は、利用者の帰宅時に、コミュニケーションロボット１が話しかけ、利用者との会話が行われている様子を示している。コミュニケーションの際、コミュニケーションロボット１は、人との関係を良い状態を維持するように行動（発話、仕草、画像提示）する。なお、コミュニケーションロボット１は、駆動部１１０３とアクチュエータ１１３によってブーム水平バー等を駆動してジェスチャーや仕草を制御する。 Examples of communication between the communication robot 1 and the user will be explained using FIGS. 9 to 12.
FIG. 9 is a diagram showing an example of communication between the communication robot 1 and the user according to the present embodiment. The example in FIG. 9 shows the communication robot 1 talking to the user and having a conversation with the user when the user returns home. During communication, the communication robot 1 acts (utterances, gestures, image presentation) so as to maintain a good relationship with the person. Note that the communication robot 1 controls gestures and gestures by driving the boom horizontal bar and the like using the drive unit 1103 and the actuator 113.

図１０と図１１は、本実施形態に係るコミュニケーションロボット１と利用者のコミュニケーション時に表示部に表示される画像例を示す図である。
図１０の例は、表示部１１１ａと１１１ｂに愛着を表すハートマークを表示し、表示部１１１ｃに笑っている口の画像に相当する画像を表示した例である。
図１１の例は、予定日の場所をイメージするイラストと、当日の天気予報を提示し、スピーカー１１２によって天気予報を発話して読み上げている例である。なお、図１１の例では、場所をイメージするイラストを提示する例を示したが、コミュニケーションロボット１は、予定の場所の画像（写真または動画）を受信部１０１を介して取得し、取得した画像を提示するようにしてもよい。
このように、表示部１１１には、人に対して肯定的な感情を最大化させる振る舞いをさせることで人との関係を良い状態を維持する画像を表示させる。これにより、本実施形態によれば、人に対して肯定的な感情を最大化させる振る舞いをさせることができ、人との関係を良い状態を維持することができる。 10 and 11 are diagrams showing examples of images displayed on the display unit during communication between the communication robot 1 and the user according to the present embodiment.
The example in FIG. 10 is an example in which heart marks representing attachment are displayed on the display sections 111a and 111b, and an image corresponding to an image of a smiling mouth is displayed on the display section 111c.
The example shown in FIG. 11 is an example in which an illustration illustrating the location of the scheduled date and the weather forecast for the day are presented, and the weather forecast is uttered and read out by the speaker 112. Although the example in FIG. 11 shows an example in which an illustration illustrating a location is presented, the communication robot 1 acquires an image (photo or video) of the planned location via the receiving unit 101 and displays the acquired image. may be presented.
In this manner, the display unit 111 displays an image that maintains a good relationship with the person by causing the person to behave in a manner that maximizes positive emotions. As a result, according to the present embodiment, it is possible to make a person behave in a manner that maximizes positive emotions, and it is possible to maintain a good relationship with the person.

図１２は、本実施形態に係る利用者の友達とのコミュニケーション時のコミュニケーションロボット１によるコミュニケーション例を示す図である。図１２の例では、コミュニケーションロボット１は、利用者との対話によって、利用者の友達にメッセージを送信する。利用者の友達が所持する端末２００は、このメッセージを受信して表示部に表示させる（ｇ３０１）。そして、端末２００は、利用者の友達の操作結果に応じて、利用者への返信ｇ３０２をコミュニケーションロボット１へ送信する。コミュニケーションロボット１は、端末２００から受信した情報に基づいて、利用者の友達からの返信を動作（発話、仕草、画像提示）で提示する。 FIG. 12 is a diagram showing an example of communication by the communication robot 1 when communicating with a user's friend according to the present embodiment. In the example of FIG. 12, the communication robot 1 sends a message to the user's friends through dialogue with the user. The terminal 200 owned by the user's friend receives this message and displays it on the display (g301). Then, the terminal 200 transmits a reply g302 to the user to the communication robot 1 according to the operation result of the user's friend. Based on the information received from the terminal 200, the communication robot 1 presents replies from the user's friends through actions (speech, gestures, image presentation).

上述した例では、コミュニケーションロボット１は、音声と動作（仕草）と画像を用いて利用者とのコミュニケーションを行う例を説明したが、これに限らない。利用者とのコミュニケーションを行うために、コミュニケーションロボット１が用いる出力手段は、２つ以上用いることが好ましく、音声と動作（仕草）と画像のうち２つ以上であればよい。または、出力手段は、例えばテキストと動作、テキストと音声等であってもよい。また、コミュニケーションロボット１に対して、利用者に飽きさせないため、出力手段は複数であることが好ましい。 In the example described above, the communication robot 1 communicates with the user using voice, motion (gesture), and images, but the present invention is not limited to this. In order to communicate with the user, it is preferable that the communication robot 1 uses two or more output means, and any two or more of voice, movement (gesture), and image may be used. Alternatively, the output means may be, for example, text and motion, text and voice, or the like. Further, in order to prevent the user from getting bored with the communication robot 1, it is preferable that there be a plurality of output means.

また、コミュニケーションロボット１への利用者からの入力は、上述した音声と画像に限らない。利用者の行動を取得できればよく、他の情報も取得するようにしてもよい。他の情報とは、例えば、利用者がコミュニケーションロボット１に触れた、叩いた等の接触情報等である。 Furthermore, the input from the user to the communication robot 1 is not limited to the above-mentioned voices and images. It is sufficient if the user's behavior can be acquired, and other information may also be acquired. Other information is, for example, contact information such as when the user touches or hits the communication robot 1.

以上のように本実施形態では、コミュニケーションロボット１と人の間に生じる働きかけ、もしくは複数人の間に生じる働きかけをコミュニケーションロボット１が認知するようにした。そして、本実施形態では、認知した内容から人間の感情的な相互作用を機械学習と心理学、社会的慣習、人文科学などから学習し、学習した内容からロボットの社会的能力を生成するようにした。また、本実施形態では、学習において、暗示的な報酬に加え得て明示的な報酬を用いるようにした。 As described above, in this embodiment, the communication robot 1 is made to recognize the interaction that occurs between the communication robot 1 and a person, or the interaction that occurs between multiple people. In this embodiment, human emotional interactions are learned from the recognized content using machine learning, psychology, social customs, humanities, etc., and the robot's social abilities are generated from the learned content. did. Furthermore, in this embodiment, explicit rewards are used in addition to implicit rewards in learning.

これにより本実施形態によれば、人との感情的な相互作用に基づいたロボットの社会的スキルの生成を行うことが出来る。本実施形態によれば、人々との関係を育む家のソーシャルロボット、エージェントを提供することができる。本実施形態によれば、機械と人間の共感的なコミュニケーションと相互作用を生むことができる。本実施形態によれば、ペットフレンドのような「マシン」フレンドの概念またはロボットの友達を提供することができる。本実施形態によれば、社会的にインテリジェントで、社会シナリオをナビゲートできるマシンを提供することができる。これにより、本実施形態によれば、ロボットと人との間に感情的な繋がりを形成することができる。
また、本実施形態によれば、収音された音響信号、撮影された画像それぞれから特徴を抽出して、抽出した特徴を用いて強化学習させるようにしたので、生データを用いた深層機械学習のように多くの教示データを用いずに学習を行わせることができる。 As a result, according to this embodiment, social skills of the robot can be generated based on emotional interactions with people. According to this embodiment, it is possible to provide a home social robot and agent that foster relationships with people. According to this embodiment, empathetic communication and interaction between machines and humans can be generated. According to this embodiment, the concept of a "machine" friend, such as a pet friend, or a robot friend can be provided. According to this embodiment, it is possible to provide a machine that is socially intelligent and capable of navigating social scenarios. Thereby, according to this embodiment, an emotional connection can be formed between the robot and the person.
Furthermore, according to this embodiment, features are extracted from each of the collected audio signals and the captured images, and reinforcement learning is performed using the extracted features, so deep machine learning using raw data is performed. Learning can be performed without using a large amount of teaching data.

＜変形例＞
なお、実施形態では、コミュニケーションを行う装置の例としてコミュニケーションロボット１を説明するが、本実施形態は、他の装置、例えば車載のナビゲーション装置、スマートフォン、タブレット端末等にも適用可能である。例えばスマートフォンに適用する場合は、スマートフォンの表示部上に、例えば図３のようなコミュニケーションロボット１の静止画を表示させ、音声によるコミュニケーションを主とするようにしてもよい。または、スマートフォンの表示部上に、コミュニケーションロボット１の仕草をアニメーションで表示させるようにしてもよい。 <Modified example>
In the embodiment, a communication robot 1 will be described as an example of a device that performs communication, but the present embodiment is also applicable to other devices, such as an in-vehicle navigation device, a smartphone, a tablet terminal, and the like. For example, when applied to a smartphone, a still image of the communication robot 1 as shown in FIG. 3 may be displayed on the display section of the smartphone, and communication may be performed mainly by voice. Alternatively, the gestures of the communication robot 1 may be displayed in animation on the display section of the smartphone.

図１３は、本実施形態のコミュニケーションロボットを車両内のカーナビケーションシステム３００に適用した例を示す図である。なお、カーナビケーションシステム３００は、スマートフォン、タブレット端末等であってもよい。
カーナビケーションシステム３００は、表示部にコミュニケーションロボットの画像を表示させる。この場合、コミュニケーションロボットは、エージェントとして動作する。そして、エージェントは、カーナビケーションシステム３００が備える撮影部、収音部、表示部、スピーカー等を用いて、コミュニケーションロボット１の機能（除く駆動部、アクチュエータ等）を実現する。 FIG. 13 is a diagram showing an example in which the communication robot of this embodiment is applied to a car navigation system 300 in a vehicle. Note that the car navigation system 300 may be a smartphone, a tablet terminal, or the like.
The car navigation system 300 displays an image of a communication robot on the display unit. In this case, the communication robot operates as an agent. Then, the agent realizes the functions of the communication robot 1 (excluding the drive unit, actuator, etc.) using the imaging unit, sound collection unit, display unit, speaker, etc. included in the car navigation system 300.

カーナビケーションシステム３００に適用する場合は、表示部に表示されるコミュニケーションロボットは静止画であってもアニメーションであってもよい。この場合、エージェントは、少なくとも音声による対話を応答として行う。この場合であっても、コミュニケーションの際、エージェントは、人との関係を良い状態を維持するように行動（発話、仕草、画像提示）する。 When applied to the car navigation system 300, the communication robot displayed on the display unit may be a still image or an animation. In this case, the agent performs at least a voice interaction as a response. Even in this case, when communicating, the agent acts (utterances, gestures, image presentation) to maintain a good relationship with the person.

図１４は、本実施形態に係るカーナビゲーションに適用した場合に家庭内の各種装置との接続例を示す図である。なお、図１４においても、カーナビケーションシステム３００は、スマートフォン、タブレット端末等であってもよい。なお、カーナビケーションシステム３００は通信部（受信部と送信部）を備え、自宅の各装置はネットワークを介して接続されているとする。カーナビゲーションシステム３００に適用されているエージェントは、利用者とのコミュニケーションに応じて、例えば駐車場のシャッターの開閉４０１、炊飯器の動作開始指示４０２、エアーコンの動作開始指示や温度等の設定指示４０３、部屋等の電灯の点灯開始指示４０４、および自動芝刈り機の動作開始指示４０５等を行う。なお、エージェントは、単に動作指示を行うのではなく、利用者との発話に応じて、何時に帰宅予定であるか、利用者の好みの温度設定、利用者の好みの部屋の明るさ設定をコミュニケーション伊予って学習し、これらを学習した結果にも基づいて、帰宅時にこれらの作業が終了しているように、それぞれ最適なタイミングや設定指示を行うようにしてもよい。 FIG. 14 is a diagram showing an example of connections with various devices in the home when applied to a car navigation according to the present embodiment. Note that also in FIG. 14, the car navigation system 300 may be a smartphone, a tablet terminal, or the like. It is assumed that the car navigation system 300 includes a communication section (a receiving section and a transmitting section), and each device at home is connected via a network. The agent applied to the car navigation system 300 responds to communication with the user by, for example, opening/closing the parking lot shutter 401, instructing the rice cooker to start operating 402, instructing the air conditioner to start operating or setting the temperature, etc. 403. , an instruction 404 to start turning on lights in a room, etc., an instruction 405 to start operating an automatic lawn mower, and the like. In addition, the agent does not simply give operation instructions, but also tells the user what time they plan to return home, the user's preferred temperature setting, and the user's preferred room brightness settings, based on the user's utterances. The user may also learn how to communicate and, based on the results of the learning, provide the optimal timing and setting instructions so that these tasks are completed by the time the user returns home.

なお、本発明における社会的能力生成装置１００の機能の全てまたは一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより社会的能力生成装置１００が行う全ての処理または一部の処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 Note that a program for realizing all or part of the functions of the social ability generation device 100 according to the present invention may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be read into a computer system. , all or part of the processing performed by the social ability generation device 100 may be performed. Note that the "computer system" herein includes hardware such as an OS and peripheral devices. Furthermore, the term "computer system" includes a WWW system equipped with a home page providing environment (or display environment). Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, magneto-optical disks, ROMs, and CD-ROMs, and storage devices such as hard disks built into computer systems. Furthermore, "computer-readable recording medium" refers to volatile memory (RAM) inside a computer system that serves as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. This also includes programs that are retained for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in a transmission medium. Here, the "transmission medium" that transmits the program refers to a medium that has a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Moreover, the above-mentioned program may be for realizing a part of the above-mentioned functions. Furthermore, it may be a so-called difference file (difference program) that can realize the above-described functions in combination with a program already recorded in the computer system.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形および置換を加えることができる。 Although the mode for implementing the present invention has been described above using embodiments, the present invention is not limited to these embodiments in any way, and various modifications and substitutions can be made without departing from the gist of the present invention. can be added.

１…コミュニケーションロボット、１０１…受信部、１０２…撮影部、１０３…収音部、１０４…センサ、１００…社会的能力生成装置、１０６…記憶部、１０７…第１データベース、１０９…第２データベース、１１１…表示部、１１２…スピーカー、１１３…アクチュエータ、１１４…送信部、１０５…認知部、１０８…学習部、１１０…動作生成部、１１０１…画像生成部、１１０２…音声生成部、１１０３…駆動部、１１０４…送信情報生成部 1... Communication robot, 101... Receiving unit, 102... Photographing unit, 103... Sound collecting unit, 104... Sensor, 100... Social ability generation device, 106... Storage unit, 107... First database, 109... Second database, 111... Display section, 112... Speaker, 113... Actuator, 114... Transmission section, 105... Recognition section, 108... Learning section, 110... Motion generation section, 1101... Image generation section, 1102... Sound generation section, 1103... Drive section , 1104...Transmission information generation unit

Claims

Acquire human information about a person, extract characteristic information about the person from the acquired human information, recognize the interaction that occurs between the communication device and the person, and recognize the interaction that occurs between the people. cognitive means,
A learning means for multimodally learning emotional interactions between people using the extracted feature information about the person;
a behavior generating means for generating a behavior based on the learned emotional interaction information of the person;
Equipped with
The learning means is
Learning using implicit and explicit rewards,
generating an implicit response system using the implicit reward and social model;
evaluating the behavior of the communication device and generating an explicit response system using the explicit reward according to social composition and social norms;
When used, the implicit reaction system and the explicit reaction system are used to generate an output that incorporates at least one of a social model and a social construct;
The implicit reward is a reward learned multimodally using feature information about the person,
The social ability generating device , wherein the explicit reward is a reward based on a result of evaluating the behavior of the communication device toward the person generated by the behavior generating means .

a sound collection unit that collects the acoustic signal;
A photography department that photographs images including users;
The recognition means performs voice recognition processing on the collected acoustic signal to extract feature information related to the sound, and performs image processing on the captured image to extract feature information related to human behavior included in the image. extract,
The characteristic information regarding the person includes characteristic information regarding the voice and characteristic information regarding the human behavior,
The feature information regarding the voice is at least one of a voice signal, voice volume information, voice intonation information, and meaning of the utterance,
The feature information regarding human behavior includes at least one of the following: facial expression information, gesture information performed by a person, head posture information, face direction information, line of sight information, and distance between people. There is one
The social ability generation device according to claim 1 .

The learning means is
Learning using social norms, social components, psychological knowledge, and humanities knowledge,
The social ability generation device according to claim 1 or claim 2 .

The recognition means acquires person information about the person, extracts characteristic information about the person from the acquired person information, recognizes the interaction that occurs between the person and the communication device that performs communication, and recognizes the interaction that occurs between the person and the person. Recognizing the efforts,
a learning means multimodally learns the emotional interaction of the person using the extracted characteristic information about the person;
The learning means is
Learning using implicit and explicit rewards,
generating an implicit response system using the implicit reward and social model;
evaluating the behavior of the communication device and generating an explicit response system using the explicit reward according to social composition and social norms;
When used, the implicit reaction system and the explicit reaction system are used to generate an output that incorporates at least one of a social model and a social construct;
The implicit reward is a reward learned multimodally using feature information about the person,
The explicit reward is a reward based on the result of evaluating the behavior of the communication device towards the person generated by the behavior generation means,
the action generation means generates the action based on the learned emotional interaction information of the person;
Social competence generation method.

Acquire human information about a person, extract characteristic information about the person from the acquired human information, recognize the interaction that occurs between the communication device and the person, and recognize the interaction that occurs between the people. cognitive means,
A learning means for multimodally learning emotional interactions between people using the extracted feature information about the person;
a behavior generating means for generating a behavior based on the learned emotional interaction information of the person;
Equipped with
The learning means is
Learning using implicit and explicit rewards,
generating an implicit response system using the implicit reward and social model;
evaluating the behavior of the communication device and generating an explicit response system using the explicit reward according to social composition and social norms;
When used, the implicit reaction system and the explicit reaction system are used to generate an output that incorporates at least one of a social model and a social construct;
The implicit reward is a reward learned multimodally using feature information about the person,
The communication robot , wherein the explicit reward is a reward based on a result of evaluating the behavior of the communication device toward the person generated by the behavior generation means .

Equipped with a display section,
The action generation means generates an image that maintains a good relationship with the person by causing the person to behave in a manner that maximizes positive emotions, and causes the display unit to display the generated image.
The communication robot according to claim 5 .