JP2022006610A

JP2022006610A - Social capacity generation device, social capacity generation method, and communication robot

Info

Publication number: JP2022006610A
Application number: JP2020108946A
Authority: JP
Inventors: ランディゴメス; Gomez Randy
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2022-01-13
Anticipated expiration: 2040-06-24
Also published as: JP7425681B2

Abstract

To provide a social capacity generation device, a social capacity generation method, and a communication robot capable of forming emotional connection between a robot and a human.SOLUTION: A social capacity generation device includes: recognition means for acquiring human information on a human, extracting feature information on the human from the acquired human information, recognizing an approach that occurs between a communication device for executing communication and a human, and recognizing an approach that occurs between a human and a human; learning means for learning emotional interactions of humans by a multimodal using the extracted feature information on the human; and motion generation means for generating an action on the basis of the information on the learned emotional interactions of humans.SELECTED DRAWING: Figure 1

Description

本発明は、社会的能力生成装置、社会的能力生成方法、およびコミュニケーションロボットに関する。 The present invention relates to a social ability generation device, a social ability generation method, and a communication robot.

今日、スマートスピーカーやコミュニケーションロボットの開発が進められている。このようなシステムでは、指示に応じて、照明をオン状態またはオフ状態にする、カレンダーにアクセスする、メールを読む、予定を設定するなどの機能に焦点を当てられている。このようなシステムでは、指示の入力が、例えばタッチパネルによる選択、音声による定められているコマンド等に限られており、人との関係を構築することが困難である。 Today, smart speakers and communication robots are being developed. Such systems focus on features such as turning lights on or off, accessing calendars, reading emails, and setting appointments, depending on instructions. In such a system, the input of instructions is limited to, for example, selection by a touch panel, commands defined by voice, and the like, and it is difficult to build a relationship with a person.

このため、人との関係を持てるシステムが望まれている。例えば特許文献１には、コンパニオンデバイスと人と対話に対して、人をデバイスとの対話や操作に関わらせるシステムが提案されている。特許文献１に記載の技術では、コンパニオンデバイスが、利用者との発話や行動を検出して、移動、グラフィック、音、光、芳香を通して表現し、親交的存在を提供する。 Therefore, a system that can have a relationship with people is desired. For example, Patent Document 1 proposes a system in which a person is involved in a dialogue or operation with a device in contrast to a dialogue between a companion device and a person. In the technology described in Patent Document 1, a companion device detects utterances and actions with a user and expresses them through movement, graphics, sound, light, and fragrance to provide an intimate existence.

特表２０１９－５２１４４９号公報Special Table 2019-521449 Gazette

しかしながら、特許文献１に記載の技術では、ロボットと人との間に感情的な繋がりを形成することが困難であった。 However, with the technique described in Patent Document 1, it is difficult to form an emotional connection between a robot and a human.

本発明は、上記の問題点に鑑みてなされたものであって、ロボットと人との間に感情的な繋がりを形成することができる社会的能力生成装置、社会的能力生成方法、およびコミュニケーションロボットを提供することを目的とする。 The present invention has been made in view of the above problems, and is a social ability generation device, a social ability generation method, and a communication robot capable of forming an emotional connection between a robot and a human. The purpose is to provide.

（１）上記目的を達成するため、本発明の一態様に係る社会的能力生成装置は、人に関する人情報を取得し、取得した前記人情報から人に関する特徴情報を抽出し、コミュニケーションを行うコミュニケーション装置と人の間に生じる働きかけを認知し、人と人との間に生じる働きかけを認知する認知手段と、抽出された前記人に関する特徴情報を用いて、人の感情的な相互作用をマルチモーダルによって学習する学習手段と、学習された前記人の感情的な相互作用情報に基づいて、行動を生成する動作生成手段と、を備える。 (1) In order to achieve the above object, the social ability generation device according to one aspect of the present invention acquires human information about a person, extracts characteristic information about a person from the acquired personal information, and communicates. Multimodal human emotional interaction using cognitive means that recognize the action that occurs between the device and the person and the action that occurs between the person and the extracted characteristic information about the person. It is provided with a learning means learned by the above and a motion generating means for generating an action based on the learned emotional interaction information of the person.

（２）また、本発明の一態様に係る社会的能力生成装置において、前記学習手段は、暗黙的な報酬と、明示的な報酬とを用いて学習を行い、前記暗黙的な報酬は、前記人に関する特徴情報を用いて、マルチモーダルによって学習された報酬であり、前記明示的な報酬は、前記動作生成手段によって生成された前記コミュニケーション装置の前記人に対する行動を評価した結果に基づく報酬であるようにしてもよい。 (2) Further, in the social ability generation device according to one aspect of the present invention, the learning means performs learning using an implicit reward and an explicit reward, and the implicit reward is the above-mentioned. It is a reward learned by multimodal using characteristic information about a person, and the explicit reward is a reward based on the result of evaluating the behavior of the communication device generated by the motion generating means with respect to the person. You may do so.

（３）また、本発明の一態様に係る社会的能力生成装置において、音響信号を収音する収音部と、利用者を含む画像を撮影する撮影部と、を備え、前記認知手段は、収音された前記音響信号に対して音声認識処理を行って音声に関する特徴情報を抽出し、撮影された画像に対して画像処理を行って画像に含まれる人行動に関する特徴情報を抽出し、前記人に関する特徴情報は、前記音声に関する特徴情報と、前記人行動に関する特徴情報を含み、前記音声に関する特徴情報は、音声信号、声の大きさの情報、声の抑揚の情報、発話の意味のうち少なくとも１つであり、前記人行動に関する特徴情報は、人の表情情報、人が行ったジェスチャー情報、人の頭部姿勢情報、人の顔向き情報、人の視線情報、および人と人との間の距離のうち少なくとも１つであるようにしてもよい。 (3) Further, in the social ability generation device according to one aspect of the present invention, the cognitive means includes a sound collecting unit that collects an acoustic signal and a photographing unit that captures an image including a user. The picked-up acoustic signal is subjected to voice recognition processing to extract characteristic information related to voice, and the captured image is subjected to image processing to extract characteristic information related to human behavior contained in the image. The characteristic information related to a person includes the characteristic information related to the voice and the characteristic information related to the human behavior, and the characteristic information related to the voice includes a voice signal, information on the volume of the voice, information on the intonation of the voice, and the meaning of the speech. The characteristic information related to the person's behavior is at least one, and the characteristic information regarding the person's behavior includes the person's facial expression information, the person's gesture information, the person's head posture information, the person's face orientation information, the person's line of sight information, and the person and person. It may be at least one of the distances between them.

（４）また、本発明の一態様に係る社会的能力生成装置において、前記学習手段は、社会規範、社会構成要素、心理学的な知見、および人文学的な知見を用いて学習するようにしてもよい。 (4) Further, in the social ability generation device according to one aspect of the present invention, the learning means is to learn using social norms, social components, psychological findings, and humanistic findings. You may.

（５）上記目的を達成するため、本発明の一態様に係る社会的能力生成方法は、認知手段が、人に関する人情報を取得し、取得した前記人情報から人に関する特徴情報を抽出し、コミュニケーションを行うコミュニケーション装置と人の間に生じる働きかけを認知し、人と人との間に生じる働きかけを認知し、学習手段が、抽出された前記人に関する特徴情報を用いて、人の感情的な相互作用をマルチモーダルによって学習し、動作生成手段が、学習された前記人の感情的な相互作用情報に基づいて、行動を生成する。 (5) In order to achieve the above object, in the social ability generation method according to one aspect of the present invention, the cognitive means acquires human information about a person, extracts characteristic information about a person from the acquired human information, and then extracts the characteristic information about the person. The communication device that communicates and the action that occurs between people are recognized, the action that occurs between people is recognized, and the learning means uses the extracted characteristic information about the person to make the person emotional. The interaction is learned in a multimodal manner, and the action generation means generates an action based on the learned emotional interaction information of the person.

（６）上記目的を達成するため、本発明の一態様に係るコミュニケーションロボットは、人に関する人情報を取得し、取得した前記人情報から人に関する特徴情報を抽出し、コミュニケーションを行うコミュニケーション装置と人の間に生じる働きかけを認知し、人と人との間に生じる働きかけを認知する認知手段と、抽出された前記人に関する特徴情報を用いて、人の感情的な相互作用をマルチモーダルによって学習する学習手段と、学習された前記人の感情的な相互作用情報に基づいて、行動を生成する動作生成手段と、を備える。 (6) In order to achieve the above object, the communication robot according to one aspect of the present invention acquires human information about a person, extracts characteristic information about a person from the acquired human information, and communicates with a communication device and a person. Multimodal learning of human emotional interactions using cognitive means that recognize the actions that occur between people and the actions that occur between people and the extracted characteristic information about the person. It includes a learning means and a motion generating means for generating an action based on the learned emotional interaction information of the person.

（７）また、本発明の一態様に係るコミュニケーションロボットは、表示部を備え、前記動作生成手段は、人に対して肯定的な感情を最大化させる振る舞いをさせることで人との関係を良い状態を維持する画像を生成し、生成した前記画像を前記表示部に表示させるようにしてもよい。 (7) Further, the communication robot according to one aspect of the present invention is provided with a display unit, and the motion generating means has a good relationship with a person by causing the person to behave in a way that maximizes positive emotions. An image that maintains the state may be generated, and the generated image may be displayed on the display unit.

（１）～（７）によれば、ロボットと人との間に感情的な繋がりを形成することができる。
（２）によれば、多くの教示データを用いずに学習を行うことができる。
（３）によれば、人の反応に基づく多くの情報を取得できる。
（４）によれば、社会的にインテリジェントで、社会シナリオをナビゲートすることができる。
（７）によれば、人に対して肯定的な感情を最大化させる振る舞いをさせることができ、人との関係を良い状態を維持することができる。 According to (1) to (7), an emotional connection can be formed between a robot and a human.
According to (2), learning can be performed without using a lot of teaching data.
According to (3), a lot of information based on human reaction can be obtained.
According to (4), it is socially intelligent and can navigate social scenarios.
According to (7), it is possible to make a person behave in a way that maximizes positive emotions, and it is possible to maintain a good relationship with the person.

実施形態に係るコミュニケーションロボットのコミュニケーション例を示す図である。It is a figure which shows the communication example of the communication robot which concerns on embodiment. 実施形態に係るコミュニケーションロボットの構成例を示すブロック図である。It is a block diagram which shows the configuration example of the communication robot which concerns on embodiment. 実施形態に係るコミュニケーションロボットの外形例を示す図である。It is a figure which shows the outline example of the communication robot which concerns on embodiment. 実施形態の社会的能力生成装置が行う認知と学習と社会的能力の流れを示す図である。It is a figure which shows the flow of the cognition, learning and the social ability performed by the social ability generation device of an embodiment. 実施形態に係る認知部が認識するデータ例を示す図である。It is a figure which shows the example of the data which the cognitive part which concerns on embodiment recognizes. 比較例における生データを用いて深層強化学習を行うシステム例を示す図である。It is a figure which shows the system example which performs deep reinforcement learning using raw data in a comparative example. 実施形態に係る動作生成部が用いるエージェント作成方法例を示す図である。It is a figure which shows the example of the agent creation method used by the operation generation part which concerns on embodiment. 実施形態に係る社会的能力生成処理の手順例を示すフローチャートである。It is a flowchart which shows the procedure example of the social capacity generation processing which concerns on embodiment. 実施形態に係るコミュニケーションロボットと利用者のコミュニケーション例を示す図である。It is a figure which shows the communication example of a communication robot and a user which concerns on embodiment. 実施形態に係るコミュニケーションロボットと利用者のコミュニケーション時に表示部に表示される画像例を示す図である。It is a figure which shows the image example which is displayed on the display part at the time of communication of a communication robot and a user which concerns on embodiment. 実施形態に係るコミュニケーションロボットと利用者のコミュニケーション時に表示部に表示される画像例を示す図である。It is a figure which shows the image example which is displayed on the display part at the time of communication of a communication robot and a user which concerns on embodiment. 実施形態に係る利用者の友達とのコミュニケーション時のコミュニケーションロボットによるコミュニケーション例を示す図である。It is a figure which shows the example of communication by a communication robot at the time of communication with a friend of a user which concerns on embodiment. 実施形態のコミュニケーションロボットを車両内のカーナビケーションシステムに適用した例を示す図である。It is a figure which shows the example which applied the communication robot of embodiment to the car navigation system in a vehicle. 実施形態に係るカーナビゲーションに適用した場合に家庭内の各種装置との接続例を示す図である。It is a figure which shows the connection example with various devices in a home when applied to the car navigation which concerns on embodiment.

以下、本発明の実施の形態について図面を参照しながら説明する。なお、以下の説明に用いる図面では、各部材を認識可能な大きさとするため、各部材の縮尺を適宜変更している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings used in the following description, the scale of each member is appropriately changed in order to make each member recognizable.

＜概要＞
図１は、本実施形態に係るコミュニケーションロボット１のコミュニケーション例を示す図である。図１のように、コミュニケーションロボット１は、個人または複数の人２とのコミュニケーションを行う。コミュニケーションは、主に対話ｇ１１と仕草ｇ１２（動作）でる。動作は、実際の動作に加え、表示部に表示される画像によって表現する。また、コミュニケーションロボット１は、利用者にインターネット回線等を介して電子メールが送信された際、電子メールを受信し電子メールが届いたことと内容を知らせる（ｇ１４）。また、コミュニケーションロボット１は、例えば電子メールに返答が必要な場合に、アドバイスが必要か利用者とコミュニケーションをとって提案ｇ１４を行う。コミュニケーションロボット１は、返答を送信する（ｇ１５）。また、コミュニケーションロボット１は、例えば利用者の予定に合わせて、予定日時や場所に応じた場所の天気予報の提示ｇ１９を行う。 <Overview>
FIG. 1 is a diagram showing a communication example of the communication robot 1 according to the present embodiment. As shown in FIG. 1, the communication robot 1 communicates with an individual or a plurality of people 2. Communication mainly consists of dialogue g11 and gesture g12 (movement). The operation is expressed by an image displayed on the display unit in addition to the actual operation. Further, when the e-mail is transmitted to the user via the Internet line or the like, the communication robot 1 receives the e-mail and notifies the user that the e-mail has arrived (g14). Further, the communication robot 1 makes a proposal g14 by communicating with the user whether advice is necessary, for example, when a reply to an e-mail is required. The communication robot 1 transmits a response (g15). Further, the communication robot 1 presents the weather forecast of the place according to the scheduled date and time and the place according to the schedule of the user, for example.

本実施形態では、ロボットと人との間に感情的な繋がりを形成ことができるようにロボットの社会的能力を生成して、例えば人の反応や行動に応じて人とのコミュニケーションを行う。そして、本実施形態では、人とロボットが感情のレベルで共感してコミュニケーションを行うようにする。本実施形態では、いわば人とペットとの間のコミュニケーションのようなものを、社会規範等も学習することで実現する。本実施形態では、コミュニケーションにおいて、利用者の社会的背景（バックグラウンド）、人と人とのやりとり等を学習することで、上記を実現する。 In the present embodiment, the social ability of the robot is generated so that an emotional connection can be formed between the robot and the human, and communication with the human is performed according to, for example, the reaction or behavior of the human. Then, in the present embodiment, the human and the robot sympathize with each other at the emotional level and communicate with each other. In this embodiment, so to speak, communication between a person and a pet is realized by learning social norms and the like. In the present embodiment, the above is realized by learning the social background (background) of the user, the interaction between people, and the like in communication.

＜コミュニケーションロボット１の構成例＞
次に、コミュニケーションロボット１の構成例を説明する。
図２は、本実施形態に係るコミュニケーションロボット１の構成例を示すブロック図である。図２のように、コミュニケーションロボット１は、受信部１０１、撮影部１０２、収音部１０３、センサ１０４、社会的能力生成装置１００、記憶部１０６、第１データベース１０７、第２データベース１０９、表示部１１１、スピーカー１１２、アクチュエータ１１３、および送信部１１４を備えている。 <Configuration example of communication robot 1>
Next, a configuration example of the communication robot 1 will be described.
FIG. 2 is a block diagram showing a configuration example of the communication robot 1 according to the present embodiment. As shown in FIG. 2, the communication robot 1 includes a receiving unit 101, a photographing unit 102, a sound collecting unit 103, a sensor 104, a social ability generation device 100, a storage unit 106, a first database 107, a second database 109, and a display unit. It includes a 111, a speaker 112, an actuator 113, and a transmission unit 114.

社会的能力生成装置１００は、認知部１０５（認知手段）、学習部１０８（学習手段）、および動作生成部１１０（動作生成手段）を備えている。
動作生成部１１０は、画像生成部１１０１、音声生成部１１０２、駆動部１１０３、送信情報生成部１１０４を備えている。 The social ability generation device 100 includes a cognitive unit 105 (cognitive means), a learning unit 108 (learning means), and a motion generation unit 110 (motion generation means).
The motion generation unit 110 includes an image generation unit 1101, a voice generation unit 1102, a drive unit 1103, and a transmission information generation unit 1104.

＜コミュニケーションロボット１の機能、動作＞
次に、コミュニケーションロボット１の各機能部の機能、動作について、図１を参照して説明する。 <Functions and operations of communication robot 1>
Next, the functions and operations of each functional unit of the communication robot 1 will be described with reference to FIG.

受信部１０１は、ネットワークを介して、例えばインターネットから情報（例えば電子メール、ブログ情報、ニュース、天気予報等）を取得し、取得した情報を認知部１０５と動作生成部１１０に出力する。または、受信部１０１は、例えば第１データベース１０７がクラウド上にある場合、クラウド上の第１データベース１０７から情報を取得し、取得した情報を認知部１０５に出力する。 The receiving unit 101 acquires information (for example, e-mail, blog information, news, weather forecast, etc.) from the Internet via a network, and outputs the acquired information to the cognitive unit 105 and the motion generation unit 110. Alternatively, for example, when the first database 107 is on the cloud, the receiving unit 101 acquires information from the first database 107 on the cloud and outputs the acquired information to the cognitive unit 105.

撮影部１０２は、例えばＣＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭｅｔａｌＯｘｉｄｅＳｅｍｉｃｏｎｄｕｃｔｏｒ；相補性金属酸化膜半導体）撮影素子、またはＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ；電荷結合素子）撮影素子等である。撮影部１０２は、撮影した画像（人に関する情報である人情報；静止画、連続した静止画、動画）を認知部１０５と動作生成部１１０に出力する。なお、コミュニケーションロボット１は、撮影部１０２を複数備えていてもよい。この場合、撮影部１０２は、例えばコミュニケーションロボット１の筐体の前方と後方に取り付けられていてもよい。 The photographing unit 102 is, for example, a CMOS (Complementary Metal Oxide Semiconductor) photographing element, a CCD (Charge Coupled Device) photographing element, or the like. The photographing unit 102 outputs the captured image (human information which is information about a person; still image, continuous still image, moving image) to the cognitive unit 105 and the motion generation unit 110. The communication robot 1 may include a plurality of photographing units 102. In this case, the photographing unit 102 may be attached to the front and the rear of the housing of the communication robot 1, for example.

収音部１０３は、例えば複数のマイクロホンで構成されるマイクロホンアレイである。収音部１０３は、複数のマイクロホンが収音した音響信号（人情報）を認知部１０５と動作生成部１１０に出力する。なお、収音部１０３は、マイクロホンが収音した音響信号それぞれを、同じサンプリング信号でサンプリングされて、アナログ信号からデジタル信号に変換した後、認知部１０５に出力するようにしてもよい。 The sound collecting unit 103 is, for example, a microphone array composed of a plurality of microphones. The sound collecting unit 103 outputs an acoustic signal (human information) collected by a plurality of microphones to the recognition unit 105 and the motion generation unit 110. The sound collecting unit 103 may sample each of the acoustic signals collected by the microphone with the same sampling signal, convert the analog signal into a digital signal, and then output the sound to the recognition unit 105.

センサ１０４は、例えば環境の温度を検出する温度センサ、環境の照度を検出する照度センサ、コミュニケーションロボット１の筐体の傾きを検出するジャイロセンサ、コミュニケーションロボット１の筐体の動きを検出する加速度センサ、気圧を検出する気圧センサ等である。センサ１０４は、検出した検出値を認知部１０５と動作生成部１１０に出力する。 The sensor 104 is, for example, a temperature sensor that detects the temperature of the environment, an illuminance sensor that detects the illuminance of the environment, a gyro sensor that detects the inclination of the housing of the communication robot 1, and an acceleration sensor that detects the movement of the housing of the communication robot 1. , A pressure sensor that detects pressure. The sensor 104 outputs the detected detection value to the recognition unit 105 and the motion generation unit 110.

記憶部１０６は、例えば、認知部１０５が認識すべき項目、認識の際に用いられる各種値（しきい値、定数）、認識を行うためのアルゴリズム等を記憶する。 The storage unit 106 stores, for example, items to be recognized by the recognition unit 105, various values (threshold values, constants) used at the time of recognition, an algorithm for performing recognition, and the like.

第１データベース１０７は、例えば、音声認識の際に用いられる言語モデルデータベースと音響モデルデータベースと対話コーパスデータベースと音響特徴量、画像認識の際に用いられる比較用画像データベースと画像特徴量、等を格納する。なお、各データ、特徴量については後述する。なお、第１データベース１０７は、クラウド上に置かれていてもよく、ネットワークを介して接続されていてもよい。 The first database 107 stores, for example, a language model database, an acoustic model database, a dialogue corpus database, and acoustic features used in speech recognition, a comparative image database and image features used in image recognition, and the like. do. Each data and feature amount will be described later. The first database 107 may be placed on the cloud or may be connected via a network.

第２データベース１０９は、学習時に用いられる、例えば社会構成要素、社会規範、社会的慣習、心理学、人文学等、人と人との関係性に関するデータを格納する。なお、第２データベース１０９は、クラウド上に置かれていてもよく、ネットワークを介して接続されていてもよい。 The second database 109 stores data on the relationship between people, such as social components, social norms, social customs, psychology, and humanities, which are used during learning. The second database 109 may be placed on the cloud or may be connected via a network.

社会的能力生成装置１００は、コミュニケーションロボット１と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知し、認知した内容と第２データベース１０９が格納するデータとに基づいて人間の感情的な相互作用を学習する。そして、社会的能力生成装置１００は、学習した内容からコミュニケーションロボット１の社会的能力を生成する。なお、社会能力とは、例えば、人と人との間で行われる対話、行動、理解、共感等、人と人との間の相互作用を行う能力である。 The social ability generation device 100 recognizes the action that occurs between the communication robot 1 and a person, or the action that occurs between a plurality of people, and the human emotion is based on the recognized content and the data stored in the second database 109. To learn the interaction. Then, the social ability generation device 100 generates the social ability of the communication robot 1 from the learned contents. The social ability is, for example, the ability to interact between people, such as dialogue, behavior, understanding, and empathy between people.

認知部１０５は、コミュニケーションロボット１と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知する。認知部１０５は、撮影部１０２が撮影した画像、収音部１０３が収音した音響信号、およびセンサ１０４が検出した検出値を取得する。なお、認知部１０５は、受信部１０１が受信した情報を取得するようにしてもよい。認知部１０５は、取得した情報と、第１データベース１０７に格納されているデータに基づいて、コミュニケーションロボット１と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知する。なお、認知方法については後述する。認知部１０５は、認知した認知結果（音に関する特徴量、人行動に関する特徴情報）を学習部１０８に出力する。なお、認知部１０５は、撮影部１０２が撮影した画像に対して周知の画像処理（例えば、二値化処理、エッジ検出処理、クラスタリング処理、画像特徴量抽出処理等）を行う。認知部１０５は、取得した音響信号に対して、周知の音声認識処置（音源同定処理、音源定位処理、雑音抑圧処理、音声区間検出処理、音源抽出処理、音響特徴量算出処理等）を行う。認知部１０５は、認知された結果に基づいて、取得された音響信号から対象の人または動物または物の音声信号（または音響信号）を抽出して、抽出した音声信号（または音響信号）を認識結果として動作生成部１１０に出力する。認知部１０５は、認知された結果に基づいて、取得された画像から対象の人または物の画像を抽出して、抽出した画像を認識結果として動作生成部１１０に出力する。 The cognitive unit 105 recognizes the action that occurs between the communication robot 1 and a person, or the action that occurs between a plurality of people. The cognitive unit 105 acquires an image captured by the photographing unit 102, an acoustic signal collected by the sound collecting unit 103, and a detection value detected by the sensor 104. The cognitive unit 105 may acquire the information received by the receiving unit 101. Based on the acquired information and the data stored in the first database 107, the cognitive unit 105 recognizes the action that occurs between the communication robot 1 and a person, or the action that occurs between a plurality of people. The cognitive method will be described later. The cognitive unit 105 outputs the recognized cognitive result (feature amount related to sound, feature information related to human behavior) to the learning unit 108. The cognitive unit 105 performs well-known image processing (for example, binarization processing, edge detection processing, clustering processing, image feature amount extraction processing, etc.) on the image captured by the imaging unit 102. The recognition unit 105 performs well-known voice recognition processing (sound source identification processing, sound source localization processing, noise suppression processing, voice section detection processing, sound source extraction processing, sound feature calculation processing, etc.) on the acquired acoustic signal. The cognitive unit 105 extracts the audio signal (or acoustic signal) of the target person, animal, or object from the acquired acoustic signal based on the recognized result, and recognizes the extracted audio signal (or acoustic signal). As a result, it is output to the motion generation unit 110. Based on the recognized result, the cognitive unit 105 extracts an image of a target person or an object from the acquired image, and outputs the extracted image to the motion generation unit 110 as a recognition result.

学習部１０８は、認知部１０５が出力する認知結果と、第２データベース１０９に格納されているデータを用いて、人間の感情的な相互作用を学習する。学習部１０８は、学習によって生成されたモデルを記憶する。なお、学習方法については後述する。 The learning unit 108 learns human emotional interactions using the cognitive results output by the cognitive unit 105 and the data stored in the second database 109. The learning unit 108 stores the model generated by the learning. The learning method will be described later.

動作生成部１１０は、受信部１０１から受信された情報、撮影部１０２から撮影された画像、収音部１０３から収音された音響信号、および認知部１０５から認識結果を取得する。動作生成部１１０は、学習された結果と、取得された情報とに基づいて、利用者に対する行動（発話、仕草、画像）を生成する。 The motion generation unit 110 acquires the information received from the reception unit 101, the image taken from the photographing unit 102, the acoustic signal collected from the sound collecting unit 103, and the recognition result from the recognition unit 105. The motion generation unit 110 generates actions (utterances, gestures, images) for the user based on the learned result and the acquired information.

画像生成部１１０１は、学習された結果と、取得された情報とに基づいて、表示部１１１に表示させる出力画像（静止画、連続した静止画、または動画）を生成し、生成した出力画像を表示部１１１に表示させる。これにより、動作生成部１１０は、表示部１１１に表情のようなアニメーションを表示させ、利用者へ提示する画像を提示させて、利用者とのコミュニケーションを取る。表示される画像は、人の目の動きに相当する画像、人の口の動きに相当する画像、利用者の目的地などの情報（地図、天気図、天気予報、お店や行楽地の情報等）、インターネット回線を介して利用者にＴＶ電話してきた人の画像等である。 The image generation unit 1101 generates an output image (still image, continuous still image, or moving image) to be displayed on the display unit 111 based on the learned result and the acquired information, and generates the generated output image. It is displayed on the display unit 111. As a result, the motion generation unit 110 causes the display unit 111 to display an animation such as a facial expression, presents an image to be presented to the user, and communicates with the user. The displayed image is an image corresponding to the movement of a person's eyes, an image corresponding to the movement of a person's mouth, information such as a user's destination (map, weather map, weather forecast, information on shops and recreational areas). Etc.), images of people who have made a TV call to the user via the Internet line, etc.

音声生成部１１０２は、学習された結果と、取得された情報とに基づいて、スピーカー１１２に出力させる出力音声信号を生成し、生成した出力音声信号をスピーカー１１２に出力させる。これにより、動作生成部１１０は、スピーカー１１２から音声信号を出力させて、利用者とのコミュニケーションを取る。出力される音声信号は、コミュニケーションロボット１に割り当てられている声による音声信号、インターネット回線を介して利用者にＴＶ電話してきた人の音声信号等である。 The voice generation unit 1102 generates an output voice signal to be output to the speaker 112 based on the learned result and the acquired information, and outputs the generated output voice signal to the speaker 112. As a result, the motion generation unit 110 outputs an audio signal from the speaker 112 to communicate with the user. The output voice signal is a voice signal assigned to the communication robot 1, a voice signal of a person who has made a TV call to a user via an Internet line, or the like.

駆動部１１０３は、学習された結果と、取得された情報とに基づいて、アクチュエータ１１３を駆動するための駆動信号を生成し、生成した駆動信号でアクチュエータ１１３を駆動する。これにより、動作生成部１１０は、コミュニケーションロボット１の動作を制御することで感情等を表現させ、利用者とのコミュニケーションを取る。 The drive unit 1103 generates a drive signal for driving the actuator 113 based on the learned result and the acquired information, and drives the actuator 113 with the generated drive signal. As a result, the motion generation unit 110 controls the motion of the communication robot 1 to express emotions and the like, and communicates with the user.

送信情報生成部１１０４は、学習された結果と、取得された情報とに基づいて、例えば利用者がネットワークを会話している他の利用者へ、利用者が送信したい送信情報（音声信号、画像）を生成し、生成した送信情報を送信部１１４から送信させる。 The transmission information generation unit 1104 is, based on the learned result and the acquired information, for example, transmission information (audio signal, image) that the user wants to transmit to another user having a conversation on the network. ) Is generated, and the generated transmission information is transmitted from the transmission unit 114.

表示部１１１は、液晶画像表示装置、または有機ＥＬ（ＥｌｅｃｔｒｏＬｕｍｉｎｅｓｃｅｎｃｅ）画像表示装置等である。表示部１１１は、社会的能力生成装置１００の画像生成部１１０１が出力する出力画像を表示する。 The display unit 111 is a liquid crystal image display device, an organic EL (Electroluminescence) image display device, or the like. The display unit 111 displays an output image output by the image generation unit 1101 of the social ability generation device 100.

スピーカー１１２は、社会的能力生成装置１００の音声生成部１１０２が出力する出力音声信号を出力する。 The speaker 112 outputs an output voice signal output by the voice generation unit 1102 of the social ability generation device 100.

アクチュエータ１１３は、社会的能力生成装置１００の駆動部１１０３が出力する駆動信号に応じて動作部を駆動する。 The actuator 113 drives the moving unit in response to a driving signal output by the driving unit 1103 of the social ability generation device 100.

送信部１１４は、社会的能力生成装置１００の送信情報生成部１１０４が出力する送信情報を、ネットワークを介して送信先に送信する。 The transmission unit 114 transmits the transmission information output by the transmission information generation unit 1104 of the social ability generation device 100 to the transmission destination via the network.

＜コミュニケーションロボット１の外形例＞
次に、コミュニケーションロボット１の外形例を説明する。
図３は、本実施形態に係るコミュニケーションロボット１の外形例を示す図である。図３の正面図ｇ１０１、側面図ｇ１０２の例では、コミュニケーションロボット１は３つの表示部１１１（１１１ａ、１１１ｂ、１１１ｃ）を備えている。また図３の例では、撮影部１０２ａは表示部１１１ａの上部に取り付けられ、撮影部１０２ｂは表示部１１１ｂの上部に取り付けられている。表示部１１１ａ、１１１ｂは、人の目に相当し、かつ画像情報を提示する。スピーカー１１２は、筐体１２０の人の口に相当する画像を表示する表示部１１１ｃの近傍に取り付けられている。収音部１０３は、筐体１２０に取り付けられている。 <Outline example of communication robot 1>
Next, an external example of the communication robot 1 will be described.
FIG. 3 is a diagram showing an outline example of the communication robot 1 according to the present embodiment. In the example of the front view g101 and the side view g102 of FIG. 3, the communication robot 1 includes three display units 111 (111a, 111b, 111c). Further, in the example of FIG. 3, the photographing unit 102a is attached to the upper part of the display unit 111a, and the photographing unit 102b is attached to the upper part of the display unit 111b. The display units 111a and 111b correspond to the human eye and present image information. The speaker 112 is attached in the vicinity of the display unit 111c that displays an image corresponding to the human mouth of the housing 120. The sound collecting unit 103 is attached to the housing 120.

また、コミュニケーションロボット１は、ブーム１２１を備える。ブーム１２１は、筐体１２０に可動部１３１を介して可動可能に取り付けられている。ブーム１２１には、水平バー１２２が可動部１３２を介して回転可能に取り付けられている。
また、水平バー１２２には、表示部１１１ａが可動部１３３を介して回転可能に取り付けられ、表示部１１１ｂが可動部１３４を介して回転可能に取り付けられている。
なお、図３に示したコミュニケーションロボット１の外形は一例であり、これに限らない。 Further, the communication robot 1 includes a boom 121. The boom 121 is movably attached to the housing 120 via the movable portion 131. A horizontal bar 122 is rotatably attached to the boom 121 via a movable portion 132.
Further, the display unit 111a is rotatably attached to the horizontal bar 122 via the movable portion 133, and the display unit 111b is rotatably attached via the movable portion 134.
The outer shape of the communication robot 1 shown in FIG. 3 is an example, and is not limited to this.

＜第１データベースが格納するデータ＞
次に、第１データベースが格納するデータ例を説明する。
言語モデルデータベースは、言語モデルを格納する。言語モデルは、任意の文字列について、それが日本語文等である確率を付与する確率モデルである。また、言語モデルは、例えば、Ｎグラムモデル、隠れマルコフモデル、最大エントロピーモデル等のいずれかである。 <Data stored in the first database>
Next, an example of data stored in the first database will be described.
The language model database stores the language model. The language model is a probability model that gives the probability that an arbitrary character string is a Japanese sentence or the like. The language model is, for example, an N-gram model, a hidden Markov model, a maximum entropy model, or the like.

音響モデルデータベースは、音源モデルを格納する。音源モデルは、収音された音響信号を音源同定するために用いるモデルである。 The acoustic model database stores the sound source model. The sound source model is a model used for identifying the sound source of the collected acoustic signal.

音響特徴量とは、収音された音響信号を高速フーリエ変換（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）を行って周波数領域の信号に変換した後、算出された特徴量である。音響特徴量は、例えば、静的メル尺度対数スペクトル（ＭＳＬＳ：Ｍｅｌ－ＳｃａｌｅＬｏｇＳｐｅｃｔｒｕｍ）、デルタＭＳＬＳ及び１個のデルタパワーを、所定時間（例えば、１０ｍｓ）毎に算出される。なお、ＭＳＬＳは、音響認識の特徴量としてスペクトル特徴量を用い、ＭＦＣＣ（メル周波数ケプストラム係数；ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）を逆離散コサイン変換することによって得られる。 The acoustic feature amount is a feature amount calculated after the pickled acoustic signal is converted into a signal in the frequency domain by performing a fast Fourier transform (Fast Fourier Transform). The acoustic feature amount is calculated, for example, a static Mel-Scale Log Spectram (MSLS), a delta MSLS, and one delta power at predetermined time intervals (for example, 10 ms). The MSLS is obtained by using a spectral feature as a feature for speech recognition and performing an inverse discrete cosine transform on the MFCC (Mel Frequency Cepstrum Deficient).

対話コーパスデータベースは、対話コーパスを格納する。対話コーパスとは、コミュニケーションロボット１と利用者とが、対話を行う際に使用するコーパスであり、例えば対話内容に応じたシナリオである。 The dialogue corpus database stores the dialogue corpus. The dialogue corpus is a corpus used when the communication robot 1 and the user have a dialogue, and is, for example, a scenario according to the content of the dialogue.

比較用画像データベースは、例えばパターンマッチングの際に用いられる画像を格納する。パターンマッチングの際に用いられる画像は、例えば、利用者の画像、利用者の家族の画像、利用者のペットの画像、利用者の友人や知り合いの画像等を含む。 The comparison image database stores, for example, images used for pattern matching. The image used for pattern matching includes, for example, an image of a user, an image of a user's family, an image of a user's pet, an image of a user's friend or acquaintance, and the like.

画像特徴量は、例えば人物や物の画像から、周知の画像処理によって抽出された特徴量である。
なお、上述した例は一例であり、第１データベース１０７は他のデータを格納していてもよい。 The image feature amount is, for example, a feature amount extracted from an image of a person or an object by a well-known image process.
The above-mentioned example is an example, and the first database 107 may store other data.

＜認知、学習、社会的能力の流れ＞
次に、本実施形態の社会的能力生成装置１００が行う認知と学習の流れについて説明する。図４は、本実施形態の社会的能力生成装置１００が行う認知と学習と社会的能力の流れを示す図である。 <Flow of cognition, learning, and social ability>
Next, the flow of cognition and learning performed by the social ability generation device 100 of the present embodiment will be described. FIG. 4 is a diagram showing the flow of cognition, learning, and social ability performed by the social ability generation device 100 of the present embodiment.

認識結果２０１は、認知部１０５によって認識された結果の一例である。認識結果２０１は、例えば対人関係、対人相互関係等である。 The recognition result 201 is an example of the result recognized by the cognitive unit 105. The recognition result 201 is, for example, interpersonal relationships, interpersonal relationships, and the like.

マルチモーダル学習、理解２１１は、学習部１０８によって行われる学習内容例である。学習方法２１２は、機械学習等である。また、学習対象２１３は、社会構成要素、社会模範、心理学、人文学等である。 The multimodal learning and understanding 211 are examples of learning contents performed by the learning unit 108. The learning method 212 is machine learning or the like. The learning target 213 is a social component, a social model, psychology, humanities, and the like.

社会的能力２２１は、社会技能であり、例えば共感、個性化、適応性、情緒的アホーダンス等である。 Social ability 221 is a social skill, such as empathy, individualization, adaptability, emotional ahodance, and the like.

＜認識するデータ＞
次に、認知部１０５が認識するデータ例を説明する。
図５は、本実施形態に係る認知部１０５が認識するデータ例を示す図である。本実施形態では、図５のように個人データ３０１と、対人関係データ３５１を認識する。 <Data to be recognized>
Next, an example of data recognized by the cognitive unit 105 will be described.
FIG. 5 is a diagram showing an example of data recognized by the cognitive unit 105 according to the present embodiment. In this embodiment, personal data 301 and interpersonal relationship data 351 are recognized as shown in FIG.

個人データは、１人の中でおきる行動であり、撮影部１０２と収音部１０３によって取得されたデータと、取得されたデータに対して音声認識処理、画像認識処理等を行ったデータである。個人データは、例えば、音声データ、音声処理された結果である意味データ、声の大きさ、声の抑揚、発話された単語、表情データ、ジェスチャーデータ、頭部姿勢データ、顔向きデータ、視線データ、共起表現データ、生理的情報（体温、心拍数、脈拍数等）等である。なお、どのようなデータを用いるかは、例えばコミュニケーションロボット１の設計者が選択してもよい。この場合、例えば、実際の２人のコミュニケーションまたはデモンストレーションに対して、コミュニケーションロボット１の設計者が、コミュニケーションにおいて個人データのうち重要な特徴を設定するようにしてもよい。また、認知部１０５は、取得された発話と画像それぞれから抽出された情報に基づいて、個人データとして、利用者の感情を認知する。この場合、認知部１０５は、例えば声の大きさや抑揚、発話継続時間、表情等に基づいて認知する。そして本実施形態のコミュニケーションロボット１は、利用者の感情を良い感情を維持する、利用者との関係を良い関係を維持するように働きかけるように制御する。 Personal data is an action that occurs in one person, and is data acquired by the photographing unit 102 and the sound collecting unit 103, and data obtained by performing voice recognition processing, image recognition processing, or the like on the acquired data. .. Personal data includes, for example, voice data, semantic data that is the result of voice processing, voice volume, voice intonation, spoken words, facial expression data, gesture data, head posture data, face orientation data, and line-of-sight data. , Co-occurrence expression data, physiological information (body temperature, heart rate, pulse rate, etc.). The designer of the communication robot 1 may select, for example, what kind of data is used. In this case, for example, the designer of the communication robot 1 may set important features of personal data in communication for actual communication or demonstration of two people. Further, the cognitive unit 105 recognizes the user's emotion as personal data based on the information extracted from each of the acquired utterance and the image. In this case, the cognitive unit 105 recognizes based on, for example, the volume and intonation of the voice, the duration of utterance, the facial expression, and the like. Then, the communication robot 1 of the present embodiment controls the user's emotions to maintain good emotions and the relationship with the user to work to maintain a good relationship.

ここで、利用者の社会的背景（バックグラウンド）の認知方法例を説明する。
認知部１０５は、取得した発話と画像と第１データベース１０７が格納するデータとに基づいて、利用者の国籍、出身地等を推定する。認知部１０５は、取得した発話と画像と第１データベース１０７が格納するデータとに基づいて、利用者の起床時間、外出時間、帰宅時間、就寝時間等の生活スケジュールを抽出する。認知部１０５は、取得した発話と画像と生活スケジュールと第１データベース１０７が格納するデータとに基づいて、利用者の性別、年齢、職業、趣味、経歴、嗜好、家族構成、信仰している宗教、コミュニケーションロボット１に対する愛着度等を推定する。なお、社会的背景は変化する場合もあるため、コミュニケーションロボット１は、会話と画像と第１データベース１０７が格納するデータとに基づいて、利用者の社会的背景に関する情報を更新していく。なお、感情的な共有を可能とするために、社会的背景やコミュニケーションロボット１に対する愛着度は、年齢や性別や経歴等の入力可能なレベルに限らず、例えば、時間帯に応じた感情の起伏や話題に対する声の大きさや抑揚等に基づいて認知する。このように、認知部１０５は、利用者が自信で気づいていないことについても、日々の会話と会話時の表情等に基づいて学習していく。 Here, an example of how to recognize the social background of the user will be described.
The cognitive unit 105 estimates the nationality, hometown, etc. of the user based on the acquired utterance, the image, and the data stored in the first database 107. The cognitive unit 105 extracts a life schedule such as a user's wake-up time, outing time, return time, and bedtime based on the acquired utterance and image and the data stored in the first database 107. The cognitive unit 105 is based on the acquired utterances, images, life schedule, and data stored in the first database 107, and is based on the user's gender, age, occupation, hobbies, career, taste, family structure, and religion. , Estimate the degree of attachment to the communication robot 1. Since the social background may change, the communication robot 1 updates the information on the social background of the user based on the conversation, the image, and the data stored in the first database 107. In order to enable emotional sharing, the social background and the degree of attachment to the communication robot 1 are not limited to the level at which input such as age, gender, and career can be input, and for example, emotional ups and downs according to the time of day. Recognize based on the loudness and intonation of the voice for the topic. In this way, the cognitive unit 105 learns about things that the user is confident and unaware of, based on daily conversation and facial expressions during conversation.

対人関係データは、利用者と他の人との関係に関するデータである。このように対人関係データを用いることで、社会的なデータを用いることができる。対人関係のデータは、例えば、人と人との距離、対話している人同士の視線が交わっているか否か、声の抑揚、声の大きさ等である。人と人との距離は後述するように、対人関係によって異なる。例えば夫婦や友達であれば対人関係がＬ１であり、ビジネスマン同士の対人関係はＬ１よりも大きいＬ２である。 Interpersonal relationship data is data related to the relationship between a user and another person. By using interpersonal relationship data in this way, social data can be used. The interpersonal relationship data is, for example, the distance between people, whether or not the lines of sight of the people who are interacting with each other intersect, the intonation of the voice, the loudness of the voice, and the like. The distance between people depends on the interpersonal relationship, as will be described later. For example, in the case of a couple or a friend, the interpersonal relationship is L1, and the interpersonal relationship between businessmen is L2, which is larger than L1.

なお、例えば、実際の２人のコミュニケーションまたはデモンストレーションに対して、コミュニケーションロボット１の設計者が、コミュニケーションにおいて対人データのうち重要な特徴を設定するようにしてもよい。なお、このような個人データ、対人関係データ、利用者の社会的背景に関する情報は、第１データベース１０７または記憶部１０６に格納する。 For example, the designer of the communication robot 1 may set important features of the interpersonal data in the communication for the actual communication or demonstration of the two people. Such personal data, interpersonal relationship data, and information on the social background of the user are stored in the first database 107 or the storage unit 106.

また、認知部１０５は、利用者が複数人の場合、例えば利用者とその家族の場合、利用者毎に個人データを収集して学習し、人毎に社会的背景を推定する。なお、このような社会的背景は、例えばネットワークと受信部１０１を介して取得してもよく、その場合、利用者が例えばスマートフォン等で自分の社会的背景を入力または項目を選択するようにしてもよい。 Further, the cognitive unit 105 collects and learns personal data for each user when there are a plurality of users, for example, the user and his / her family, and estimates the social background for each person. It should be noted that such a social background may be acquired, for example, via a network and a receiving unit 101, in which case the user inputs his / her social background or selects an item using, for example, a smartphone. May be good.

ここで、対人関係データの認知方法例を説明する。
認知部１０５は、取得した発話と画像と第１データベース１０７が格納するデータとに基づいて、コミュニケーションが行われている人と人との距離（間隔）を推定する。認知部１０５は、取得した発話と画像と第１データベース１０７が格納するデータとに基づいて、コミュニケーションが行われている人の視線が交わっているか否かを検出する。認知部１０５は、取得した発話と第１データベース１０７が格納するデータとに基づいて、発話内容、声の大きさ、声の抑揚、受信した電子メール、送信した電子メール、送受信した電子メールの送受信先の相手に基づいて、友人関係、仕事仲間、親戚親子関係を推定する。 Here, an example of a method of recognizing interpersonal relationship data will be described.
The cognitive unit 105 estimates the distance (interval) between people with whom communication is being performed, based on the acquired utterances, images, and data stored in the first database 107. The cognitive unit 105 detects whether or not the lines of sight of the person communicating with each other intersect based on the acquired utterance, the image, and the data stored in the first database 107. The cognitive unit 105 sends and receives the utterance content, the volume of the voice, the inflection of the voice, the received e-mail, the transmitted e-mail, and the transmitted / received e-mail based on the acquired utterance and the data stored in the first database 107. Estimate friendships, business associates, and relatives based on the previous person.

なお、認知部１０５は、使用される初期状態において、第１データベース１０７が記憶するいくつかの社会的背景や個人データの初期値の組み合わせの中から、例えばランダムに１つを選択して、コミュニケーションを開始するようにしてもよい。そして、認知部１０５は、ランダムに選択した組み合わせによって生成された行動によって、利用者とのコミュニケーションが継続しにくい場合、別の組み合わせを選択しなおすようにしてもよい。 In the initial state in which the cognitive unit 105 is used, for example, one is randomly selected from a combination of several social backgrounds and initial values of personal data stored in the first database 107 to communicate. May be started. Then, the cognitive unit 105 may reselect another combination when it is difficult to continue communication with the user due to the action generated by the randomly selected combination.

＜学習手順＞
本実施形態では、認知部１０５によって認識された個人データ３０１と対人関係データ３５１と、第２データベース１０９が格納するデータを用いて、学習部１０８が学習を行う。 <Learning procedure>
In the present embodiment, the learning unit 108 learns using the personal data 301 recognized by the cognitive unit 105, the interpersonal relationship data 351 and the data stored in the second database 109.

ここで、社会的構成と社会規範について説明する。人々が社会的な相互作用に参加する空間において、例えば人と人とのキャリによって、対人関係が異なる。例えば、人との間隔が０～５０ｃｍの関係は親密（Ｉｎｔｉｍａｔｅ）な関係であり、人との間隔が５０～１ｍの関係は個人的（Ｐｅｒｓｏｎａｌ）な関係である。人との間隔が１～４ｍの関係は社会的（Ｓｏｃｉａｌ）な関係であり、人との間隔が４ｍの以上の関係は公的（Ｐｕｂｌｉｃ）な関係である。このような社会規範は、学習時に、仕草や発話が社会規範に合致しているか否かを報酬（暗示的な報酬）として用いられる。 Here, the social structure and social norms will be explained. In a space where people participate in social interactions, interpersonal relationships differ, for example, depending on the carry between people. For example, a relationship with a person having a distance of 0 to 50 cm is an intimate relationship, and a relationship with a person having a distance of 50 to 1 m is a personal relationship. A relationship with a distance of 1 to 4 m from a person is a social relationship, and a relationship with a distance of 4 m or more from a person is a public relationship. At the time of learning, such social norms are used as rewards (suggestive rewards) as to whether or not gestures and utterances conform to social norms.

また、対人関係は、学習時に報酬の特徴量の設定によって、利用される環境や利用者に応じたものに設定するようにしてもよい。具体的には、ロボットが苦手な人には、あまり話しかけないようなルールとし、ロボットが好きな人には積極的に話しかけるルールに設定するなど、複数の親密度の設定を設けるようにしてもよい。そして、実環境において、利用者の発話と画像を処理した結果に基づいて、利用者が、どのタイプであるかを認知部１０５が認知して、学習部１０８がルールを選択するようにしてもよい。 Further, the interpersonal relationship may be set according to the environment to be used and the user by setting the feature amount of the reward at the time of learning. Specifically, even if you set multiple intimacy settings, such as setting a rule that does not talk much to people who are not good at robots and a rule that actively talks to people who like robots. good. Then, in the actual environment, the cognitive unit 105 recognizes which type the user is based on the result of processing the user's utterance and the image, and the learning unit 108 selects the rule. good.

また、人間のトレーナーは、コミュニケーションロボット１の行動を評価し、自分が知っている社会構成や規範に応じた報酬（暗示的な報酬）を提供するようにしてもよい。 Further, the human trainer may evaluate the behavior of the communication robot 1 and provide a reward (implicit reward) according to the social structure and norms that he / she knows.

＜第２データベースが格納するデータ＞
次に、第２データベースが格納するデータ例を説明する。
社会構成要素は、例えば、年齢、性別、職業、複数の人の間の関係（親子、夫婦、恋人、友達、知り合い、仕事仲間、ご近所の人、先生と生徒等）である。 <Data stored in the second database>
Next, an example of data stored in the second database will be described.
Social components are, for example, age, gender, occupation, relationships between multiple people (parents and children, couples, lovers, friends, acquaintances, colleagues, neighbors, teachers and students, etc.).

社会規範は、個人、複数の人の間のルールやマナーであり、年齢、性別、職業、複数の人の間の関係に応じた発話、仕草等が関連づけられている。 Social norms are rules and manners between individuals and multiple people, and are associated with utterances, gestures, etc. according to age, gender, occupation, and relationships between multiple people.

心理学に関するデータは、例えば、これまでの実験や検証で得られている知見（例えば母親と幼児との愛着関係、エディプスコンプレックス等のコンプレックス、条件反射、フェティシズム等）のデータである。 The data related to psychology are, for example, data of findings obtained by experiments and verifications so far (for example, attachment relationship between mother and infant, complex such as Oedipus complex, conditioned reflex, fetishism, etc.).

人文学に関するデータは、例えば宗教的なルール、慣習、国民性、地域性、国や地域における特徴的な行為や行動や発話等のデータである。例えば、日本人の場合は、同意の際に、言葉で言わずに頷くことで同意を表す等のデータである。また、人文学に関するデータは、例えば、国や地域によって、何を重要視し、何を優先するか等のデータである。 Data on the humanities are, for example, data on religious rules, customs, national character, regional character, characteristic acts, behaviors, and utterances in a country or region. For example, in the case of Japanese people, when consenting, the data is such that consent is expressed by nodding without saying it in words. In addition, the data on the humanities is, for example, data on what is emphasized and what is prioritized depending on the country or region.

図６は、比較例における生データを用いて深層強化学習を行うシステム例を示す図である。
比較例では、撮影された画像９０１と収音された音響信号９０１それぞれの生データ９０２を学習に用いる場合は、深層強化学習９０３を行う必要がある。この比較例のシステムは、実現が困難である。理由は、深層強化学習のための教示データを充分に集める必要があるが、集めるのが困難である。集めるのが困難な理由は、生データの中に必要な特徴が現れる回数が限られるためである。 FIG. 6 is a diagram showing an example of a system for performing deep reinforcement learning using raw data in a comparative example.
In the comparative example, when the raw data 902 of each of the captured image 901 and the picked-up acoustic signal 901 is used for learning, it is necessary to perform deep reinforcement learning 903. The system of this comparative example is difficult to realize. The reason is that it is necessary to collect sufficient teaching data for deep reinforcement learning, but it is difficult to collect. The reason it is difficult to collect is that the number of times the required features appear in the raw data is limited.

このため、本実施形態では、生データ（音声信号、画像）を学習に直接用いず、生データから特徴量を検出し、その特徴量を学習に用いることで深層強化学習では無く強化学習ですむ。 Therefore, in this embodiment, the raw data (voice signal, image) is not directly used for learning, but the feature amount is detected from the raw data and the feature amount is used for learning, so that reinforcement learning is required instead of deep reinforcement learning. ..

図７は、本実施形態に係る動作生成部１１０が用いるエージェント作成方法例を示す図である。
符号３００が示す領域は、入力からエージェントを作成、出力（エージェント）までの流れを示す図である。
撮影部１０２が撮影した画像と収音部１０３が収音した情報３１０は、人（利用者、利用者の関係者、他人）に関する情報と、人の周りの環境情報である。撮影部１０２と収音部１０３によって取得された生データ３０２は、認知部１０５に入力される。 FIG. 7 is a diagram showing an example of an agent creation method used by the motion generation unit 110 according to the present embodiment.
The area indicated by reference numeral 300 is a diagram showing a flow from input to agent creation and output (agent).
The image taken by the photographing unit 102 and the information 310 collected by the sound collecting unit 103 are information about a person (user, a person related to the user, another person) and environmental information around the person. The raw data 302 acquired by the photographing unit 102 and the sound collecting unit 103 is input to the recognition unit 105.

認知部１０５は、入力された生データ３０２から複数の情報（声の大きさ、声の抑揚、発話内容、発話された単語、利用者の視線、利用者の頭部姿勢、利用者の顔向き、利用者の生態情報、人と人との距離、人と人との視線が交わっているか否か、等）を抽出、認識する。認知部１０５は、抽出、認識された複数の情報を利用して、例えばニューラルネットワークを用いてマルチモーダル理解を行う。
認知部１０５は、例えば音声信号および画像の少なくとも１つに基づいて、個人を識別し、識別した個人に識別情報（ＩＤ）を付与する。認知部１０５は、音声信号および画像の少なくとも１つに基づいて、識別した人ごとの動作を認知する。認知部１０５は、例えば画像に対して周知の画像処理と追跡処理を行って、識別した人の視線を認識する。認知部１０５は、例えば音声信号に対して音声認識処理（音源同定、音源定位、音源分離、発話区間検出、雑音抑圧等）を行って音声を認識する。認知部１０５は、例えば画像に対して周知の画像処理を行って、識別した人の頭部姿勢を認識する。認知部１０５は、例えば撮影された画像に２人が撮影されている場合、発話内容、撮影された画像における２人の間隔等に基づいて、対人関係を認知する。認知部１０５は、例えば撮影された画像と収音された音声信号それぞれを処理した結果に応じて、コミュニケーションロボット１と利用者との社会的な距離を認知する（推定する）。 The cognitive unit 105 has a plurality of information (voice volume, voice intonation, utterance content, spoken word, user's line of sight, user's head posture, user's face orientation) from the input raw data 302. , User's ecological information, distance between people, whether or not the line of sight of people intersects, etc.) is extracted and recognized. The cognitive unit 105 uses a plurality of extracted and recognized information to perform multimodal understanding using, for example, a neural network.
The cognitive unit 105 identifies an individual based on, for example, at least one of an audio signal and an image, and assigns identification information (ID) to the identified individual. The cognitive unit 105 recognizes the identified human action based on at least one of the audio signal and the image. The cognitive unit 105 performs, for example, well-known image processing and tracking processing on an image to recognize the line of sight of the identified person. For example, the recognition unit 105 performs voice recognition processing (sound source identification, sound source localization, sound source separation, utterance section detection, noise suppression, etc.) on the voice signal to recognize the voice. The cognitive unit 105 performs, for example, well-known image processing on an image to recognize the head posture of the identified person. For example, when two people are photographed in a photographed image, the cognitive unit 105 recognizes an interpersonal relationship based on the utterance content, the distance between the two persons in the photographed image, and the like. The cognitive unit 105 recognizes (estimates) the social distance between the communication robot 1 and the user according to the result of processing each of the captured image and the picked-up audio signal, for example.

学習部１０８は、深層学習では無く、強化学習３０４を行う。強化学習では、最も関連性の高い特徴（社会構成や社会規範を含む）を選択するように学習を行う。この場合は、マルチモーダル理解で用いた複数の情報を特徴として入力に用いる。学習部１０８の入力は、例えば、生データそのものか、名前ＩＤ（識別情報）、顔の影響、認識したジェスチャー、音声からのキーワード等である。学習部１０８の出力は、コミュニケーションロボットの行動である。出力される行動は、目的に応じて定義したいものであればよく、例えば、音声応答、ロボットのルーチン、ロボットが回転するための向きの角度などである。なお、マルツモーダル理解において、検出にニューラルネットワーク等を用いてもよい。この場合は、身体の異なるモダリティを用いて、人間の活動を検出しますようにしてもよい。また、どの特徴を用いるかは、例えばコミュニケーションロボット１の設計者が、予め選択するようにしてもよい。さらに、本実施形態では、学習時に、暗示的な報酬と明示的な報酬を用いることで、社会的な模範や社会構成概念を取り込むことができる。強化学習した結果が出力であり、エージェント３０５である。このように、本実施形態では、動作生成部１１０が用いるエージェントを作成する。 The learning unit 108 performs reinforcement learning 304 instead of deep learning. Reinforcement learning involves learning to select the most relevant features (including social composition and social norms). In this case, a plurality of information used in multimodal understanding is used as a feature for input. The input of the learning unit 108 is, for example, the raw data itself, the name ID (identification information), the influence of the face, the recognized gesture, the keyword from the voice, and the like. The output of the learning unit 108 is the behavior of the communication robot. The output action may be any action that is desired to be defined according to the purpose, such as a voice response, a robot routine, an angle of orientation for the robot to rotate, and the like. In addition, in understanding Marutsu modal, a neural network or the like may be used for detection. In this case, different modality of the body may be used to detect human activity. Further, for example, the designer of the communication robot 1 may select in advance which feature to use. Further, in the present embodiment, by using implicit rewards and explicit rewards at the time of learning, it is possible to incorporate social models and social constructs. The result of reinforcement learning is the output, which is Agent 305. As described above, in the present embodiment, the agent used by the motion generation unit 110 is created.

符号３５０が示す領域は、報酬の使用方法を示す図である。
暗黙的の報酬３６２は、暗黙的反応を学習するために使われる。この場合、生データ３０２には利用者の反応が含まれ、この生データ３０２を上述したマルチモーダル理解３０３する。学習部１０８は、暗黙的の報酬３６２と第２データベース１０９が格納する社会模範等を用いて、暗黙的反応システム３７２を生成する。なお、暗黙の報酬は、強化学習によって得られたものでもよく、人間が与えてもよい。また、暗黙的反応システムは、学習によって獲得されるモデルであってもよい。 The area indicated by reference numeral 350 is a diagram showing how to use the reward.
The implicit reward 362 is used to learn the implicit reaction. In this case, the raw data 302 includes the user's reaction, and the raw data 302 is used for the above-mentioned multimodal understanding 303. The learning unit 108 generates the implicit reaction system 372 by using the implicit reward 362 and the social model stored in the second database 109. The implicit reward may be obtained by reinforcement learning or may be given by a human. The implicit reaction system may also be a model acquired by learning.

明示的反応の学習には、例えば人間のトレーナーが、コミュニケーションロボット１の行動を評価し、自分の知っている社会構成や社会規範に応じた報酬３６１を与える。なお、エージェントは、入力に対して、報酬が最大となる行動を採用する。これにより、エージェントは、ユーザーに対して肯定的な感情を最大化させるような振る舞い（発話、仕草）を採用する。 For learning the explicit reaction, for example, a human trainer evaluates the behavior of the communication robot 1 and gives a reward 361 according to the social composition and social norms that he / she knows. The agent adopts the action that maximizes the reward for the input. As a result, the agent adopts behavior (utterance, gesture) that maximizes positive emotions toward the user.

学習部１０８は、この明示的の報酬３６１を用いて、明示的反応システム３７１を生成する。なお、明示的反応システムは、学習によって獲得されるモデルであってもよい。なお、明示的な報酬は、利用者が、コミュニケーションロボット１の行動を評価して与えるようにしてもよく、利用者の発話や行動（仕草、表情等）に基づいて、コミュニケーションロボット１が、例えば利用者が望んでいた行動を取れたか否か等に基づいて報酬を推定するようにしてもよい。
学習部１０８は、動作時、これらの学習モデルを用いてエージェント３０５を出力する。 The learning unit 108 uses this explicit reward 361 to generate an explicit reaction system 371. The explicit reaction system may be a model acquired by learning. The explicit reward may be given by the user by evaluating the behavior of the communication robot 1, and the communication robot 1 may, for example, give an explicit reward based on the user's utterances and behaviors (gestures, facial expressions, etc.). The reward may be estimated based on whether or not the user has taken the desired action.
The learning unit 108 outputs the agent 305 using these learning models during operation.

なお、本実施形態では、利用者の反応である明示的な報酬を、暗示的な報酬より優先する。この理由は、利用者の反応の方が、コミュニケーションにおいては信頼度が高いためである。 In this embodiment, the explicit reward, which is the reaction of the user, is prioritized over the implicit reward. The reason for this is that the user's reaction is more reliable in communication.

＜処理手順例＞
次に、処理手順例を説明する。図８は、本実施形態に係る社会的能力生成処理の手順例を示すフローチャートである。 <Processing procedure example>
Next, an example of the processing procedure will be described. FIG. 8 is a flowchart showing a procedure example of the social capacity generation process according to the present embodiment.

（ステップＳ１１）認識部１０５は、撮影部１０２が撮影した画像と、収音部１０３が収音した音響信号を取得する。 (Step S11) The recognition unit 105 acquires an image captured by the photographing unit 102 and an acoustic signal collected by the sound collecting unit 103.

（ステップＳ１２）認識部１０５は、音響信号から音声に関する特徴情報を認識または検出あるいは抽出し、画像から人に関する特徴情報を認識または検出あるいは抽出する。なお、音声に関する特徴情報は、音声信号、声の大きさの情報、声の抑揚の情報、発話の意味のうち少なくとも１つである。人に関する特徴情報は、人の表情情報、人が行ったジェスチャー情報、人の頭部姿勢情報、人の顔向き情報、人の視線情報のうち少なくとも１つである。 (Step S12) The recognition unit 105 recognizes, detects, or extracts characteristic information related to voice from an acoustic signal, and recognizes, detects, or extracts characteristic information related to a person from an image. The characteristic information related to voice is at least one of a voice signal, voice loudness information, voice intonation information, and the meaning of utterance. The characteristic information about a person is at least one of a person's facial expression information, a person's gesture information, a person's head posture information, a person's face orientation information, and a person's line-of-sight information.

（ステップＳ１３）認知部１０５は、取得した情報と、第１データベース１０７に格納されているデータに基づいて、コミュニケーションロボット１と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知する。 (Step S13) The cognitive unit 105 recognizes the action that occurs between the communication robot 1 and a person, or the action that occurs between a plurality of people, based on the acquired information and the data stored in the first database 107. ..

（ステップＳ１４）学習部１０８は、認知部１０５が出力する認知結果と、第２データベース１０９に格納されているデータを用いて、人間の感情的な相互作用を学習する。 (Step S14) The learning unit 108 learns human emotional interactions using the cognitive results output by the cognitive unit 105 and the data stored in the second database 109.

（ステップＳ１５）動作生成部１１０は、学習された結果と、取得された情報とに基づいて、利用者に対する行動（発話、仕草、画像）を生成する。 (Step S15) The motion generation unit 110 generates actions (utterances, gestures, images) for the user based on the learned result and the acquired information.

＜コミュニケーションロボット１と人とのコミュニケーション＞
次に、コミュニケーションロボット１と人とのコミュニケーション例を説明する。
コミュニケーションロボット１の発話のタイミングは、初期値として、例えば帰宅時、起床時等に設定しておく。そして、コミュニケーションを繰り返していくことで、コミュニケーションロボット１が発話を開始するタイミングを学習していくようにしてもよい。
あるいは、コミュニケーションロボット１は、利用者の発話や行動に応じて、発話を開始するようにしてもよい。この場合、コミュニケーションロボット１は、スマートスピーカー等で採用されているコマンドに応じて会話を開始するのではなく、例えば、利用者がコミュニケーションロボット１に話しかけていることや利用者の表情や顔向き（顔がコミュニケーションロボット１を見ている等）や部屋の明かりが点灯した等を認知して会話を開始する。また、コミュニケーションロボット１は、例えば、利用者の発話内容、利用者の表情等を認知して会話を終了する。そして、コミュニケーションロボット１は、利用者の社会的背景（含む年齢、性別）、およびコミュニケーションロボット１に対する愛着等に応じて、発話の開始タイミングと終了タイミングを切り替えるようにしてもよい。 <Communication between Communication Robot 1 and people>
Next, an example of communication between the communication robot 1 and a person will be described.
The timing of the utterance of the communication robot 1 is set as an initial value, for example, when returning home, when waking up, or the like. Then, by repeating the communication, the timing at which the communication robot 1 starts the utterance may be learned.
Alternatively, the communication robot 1 may start the utterance according to the utterance or action of the user. In this case, the communication robot 1 does not start the conversation in response to the command adopted by the smart speaker or the like, but for example, the user is talking to the communication robot 1, the user's facial expression or face (face). The conversation is started by recognizing that the face is looking at the communication robot 1) or the light in the room is lit. Further, the communication robot 1 recognizes, for example, the utterance content of the user, the facial expression of the user, and the like, and ends the conversation. Then, the communication robot 1 may switch between the start timing and the end timing of the utterance according to the social background (including age and gender) of the user, the attachment to the communication robot 1, and the like.

コミュニケーションロボット１と利用者のコミュニケーション例を、図９～図１２を用いて説明する。
図９は、本実施形態に係るコミュニケーションロボット１と利用者のコミュニケーション例を示す図である。図９の例は、利用者の帰宅時に、コミュニケーションロボット１が話しかけ、利用者との会話が行われている様子を示している。コミュニケーションの際、コミュニケーションロボット１は、人との関係を良い状態を維持するように行動（発話、仕草、画像提示）する。なお、コミュニケーションロボット１は、駆動部１１０３とアクチュエータ１１３によってブーム水平バー等を駆動してジェスチャーや仕草を制御する。 An example of communication between the communication robot 1 and the user will be described with reference to FIGS. 9 to 12.
FIG. 9 is a diagram showing an example of communication between the communication robot 1 and the user according to the present embodiment. The example of FIG. 9 shows a state in which the communication robot 1 talks to the user and has a conversation with the user when the user returns home. During communication, the communication robot 1 acts (speech, gesture, image presentation) so as to maintain a good relationship with a person. The communication robot 1 drives a boom horizontal bar or the like by a drive unit 1103 and an actuator 113 to control gestures and gestures.

図１０と図１１は、本実施形態に係るコミュニケーションロボット１と利用者のコミュニケーション時に表示部に表示される画像例を示す図である。
図１０の例は、表示部１１１ａと１１１ｂに愛着を表すハートマークを表示し、表示部１１１ｃに笑っている口の画像に相当する画像を表示した例である。
図１１の例は、予定日の場所をイメージするイラストと、当日の天気予報を提示し、スピーカー１１２によって天気予報を発話して読み上げている例である。なお、図１１の例では、場所をイメージするイラストを提示する例を示したが、コミュニケーションロボット１は、予定の場所の画像（写真または動画）を受信部１０１を介して取得し、取得した画像を提示するようにしてもよい。
このように、表示部１１１には、人に対して肯定的な感情を最大化させる振る舞いをさせることで人との関係を良い状態を維持する画像を表示させる。これにより、本実施形態によれば、人に対して肯定的な感情を最大化させる振る舞いをさせることができ、人との関係を良い状態を維持することができる。 10 and 11 are diagrams showing an example of an image displayed on the display unit during communication between the communication robot 1 and the user according to the present embodiment.
The example of FIG. 10 is an example in which a heart mark indicating attachment is displayed on the display units 111a and 111b, and an image corresponding to an image of a laughing mouth is displayed on the display unit 111c.
The example of FIG. 11 is an example in which an illustration imagining the place of the scheduled date and the weather forecast of the day are presented, and the weather forecast is spoken and read aloud by the speaker 112. In the example of FIG. 11, an example of presenting an illustration imagining a place is shown, but the communication robot 1 acquires an image (photograph or moving image) of the planned place via the receiving unit 101, and the acquired image. May be presented.
In this way, the display unit 111 causes the person to behave in a manner that maximizes positive emotions, thereby displaying an image that maintains a good relationship with the person. Thereby, according to the present embodiment, it is possible to make the person behave in a way that maximizes positive emotions, and it is possible to maintain a good relationship with the person.

図１２は、本実施形態に係る利用者の友達とのコミュニケーション時のコミュニケーションロボット１によるコミュニケーション例を示す図である。図１２の例では、コミュニケーションロボット１は、利用者との対話によって、利用者の友達にメッセージを送信する。利用者の友達が所持する端末２００は、このメッセージを受信して表示部に表示させる（ｇ３０１）。そして、端末２００は、利用者の友達の操作結果に応じて、利用者への返信ｇ３０２をコミュニケーションロボット１へ送信する。コミュニケーションロボット１は、端末２００から受信した情報に基づいて、利用者の友達からの返信を動作（発話、仕草、画像提示）で提示する。 FIG. 12 is a diagram showing an example of communication by the communication robot 1 at the time of communication with a friend of a user according to the present embodiment. In the example of FIG. 12, the communication robot 1 sends a message to a friend of the user by a dialogue with the user. The terminal 200 owned by the user's friend receives this message and displays it on the display unit (g301). Then, the terminal 200 transmits the reply g302 to the user to the communication robot 1 according to the operation result of the user's friend. The communication robot 1 presents a reply from a user's friend in an action (utterance, gesture, image presentation) based on the information received from the terminal 200.

上述した例では、コミュニケーションロボット１は、音声と動作（仕草）と画像を用いて利用者とのコミュニケーションを行う例を説明したが、これに限らない。利用者とのコミュニケーションを行うために、コミュニケーションロボット１が用いる出力手段は、２つ以上用いることが好ましく、音声と動作（仕草）と画像のうち２つ以上であればよい。または、出力手段は、例えばテキストと動作、テキストと音声等であってもよい。また、コミュニケーションロボット１に対して、利用者に飽きさせないため、出力手段は複数であることが好ましい。 In the above-mentioned example, the communication robot 1 has described an example of communicating with a user by using voice, motion (gesture), and an image, but the present invention is not limited to this. In order to communicate with the user, it is preferable to use two or more output means used by the communication robot 1, and it is sufficient that two or more of voice, motion (gesture), and image are used. Alternatively, the output means may be, for example, text and action, text and voice, or the like. Further, it is preferable that the communication robot 1 has a plurality of output means so as not to make the user bored.

また、コミュニケーションロボット１への利用者からの入力は、上述した音声と画像に限らない。利用者の行動を取得できればよく、他の情報も取得するようにしてもよい。他の情報とは、例えば、利用者がコミュニケーションロボット１に触れた、叩いた等の接触情報等である。 Further, the input from the user to the communication robot 1 is not limited to the above-mentioned voice and image. It suffices if the user's behavior can be acquired, and other information may also be acquired. The other information is, for example, contact information such as a user touching or hitting the communication robot 1.

以上のように本実施形態では、コミュニケーションロボット１と人の間に生じる働きかけ、もしくは複数人の間に生じる働きかけをコミュニケーションロボット１が認知するようにした。そして、本実施形態では、認知した内容から人間の感情的な相互作用を機械学習と心理学、社会的慣習、人文科学などから学習し、学習した内容からロボットの社会的能力を生成するようにした。また、本実施形態では、学習において、暗示的な報酬に加え得て明示的な報酬を用いるようにした。 As described above, in the present embodiment, the communication robot 1 recognizes the action that occurs between the communication robot 1 and a person, or the action that occurs between a plurality of people. Then, in this embodiment, the emotional interaction of human beings is learned from the perceived contents from machine learning and psychology, social customs, humanities, etc., and the social ability of the robot is generated from the learned contents. did. Further, in the present embodiment, in addition to the implicit reward, the explicit reward is used in the learning.

これにより本実施形態によれば、人との感情的な相互作用に基づいたロボットの社会的スキルの生成を行うことが出来る。本実施形態によれば、人々との関係を育む家のソーシャルロボット、エージェントを提供することができる。本実施形態によれば、機械と人間の共感的なコミュニケーションと相互作用を生むことができる。本実施形態によれば、ペットフレンドのような「マシン」フレンドの概念またはロボットの友達を提供することができる。本実施形態によれば、社会的にインテリジェントで、社会シナリオをナビゲートできるマシンを提供することができる。これにより、本実施形態によれば、ロボットと人との間に感情的な繋がりを形成することができる。
また、本実施形態によれば、収音された音響信号、撮影された画像それぞれから特徴を抽出して、抽出した特徴を用いて強化学習させるようにしたので、生データを用いた深層機械学習のように多くの教示データを用いずに学習を行わせることができる。 Thereby, according to the present embodiment, it is possible to generate social skills of the robot based on emotional interaction with humans. According to this embodiment, it is possible to provide a social robot and an agent of a house that fosters relationships with people. According to this embodiment, it is possible to generate sympathetic communication and interaction between machines and humans. According to this embodiment, it is possible to provide a concept of a "machine" friend such as a pet friend or a robot friend. According to this embodiment, it is possible to provide a machine that is socially intelligent and can navigate social scenarios. Thereby, according to the present embodiment, it is possible to form an emotional connection between the robot and the human.
Further, according to the present embodiment, features are extracted from each of the picked-up acoustic signal and the captured image, and reinforcement learning is performed using the extracted features. Therefore, deep machine learning using raw data is performed. It is possible to perform learning without using a lot of teaching data as in.

＜変形例＞
なお、実施形態では、コミュニケーションを行う装置の例としてコミュニケーションロボット１を説明するが、本実施形態は、他の装置、例えば車載のナビゲーション装置、スマートフォン、タブレット端末等にも適用可能である。例えばスマートフォンに適用する場合は、スマートフォンの表示部上に、例えば図３のようなコミュニケーションロボット１の静止画を表示させ、音声によるコミュニケーションを主とするようにしてもよい。または、スマートフォンの表示部上に、コミュニケーションロボット１の仕草をアニメーションで表示させるようにしてもよい。 <Modification example>
In the embodiment, the communication robot 1 will be described as an example of a device for communicating, but the present embodiment can be applied to other devices such as an in-vehicle navigation device, a smartphone, a tablet terminal, and the like. For example, when applied to a smartphone, a still image of the communication robot 1 as shown in FIG. 3, for example, may be displayed on the display unit of the smartphone so that communication by voice is mainly performed. Alternatively, the gesture of the communication robot 1 may be displayed by animation on the display unit of the smartphone.

図１３は、本実施形態のコミュニケーションロボットを車両内のカーナビケーションシステム３００に適用した例を示す図である。なお、カーナビケーションシステム３００は、スマートフォン、タブレット端末等であってもよい。
カーナビケーションシステム３００は、表示部にコミュニケーションロボットの画像を表示させる。この場合、コミュニケーションロボットは、エージェントとして動作する。そして、エージェントは、カーナビケーションシステム３００が備える撮影部、収音部、表示部、スピーカー等を用いて、コミュニケーションロボット１の機能（除く駆動部、アクチュエータ等）を実現する。 FIG. 13 is a diagram showing an example in which the communication robot of the present embodiment is applied to the car navigation system 300 in the vehicle. The car navigation system 300 may be a smartphone, a tablet terminal, or the like.
The car navigation system 300 causes the display unit to display an image of the communication robot. In this case, the communication robot operates as an agent. Then, the agent realizes the functions of the communication robot 1 (excluding the drive unit, the actuator, etc.) by using the photographing unit, the sound collecting unit, the display unit, the speaker, and the like provided in the car navigation system 300.

カーナビケーションシステム３００に適用する場合は、表示部に表示されるコミュニケーションロボットは静止画であってもアニメーションであってもよい。この場合、エージェントは、少なくとも音声による対話を応答として行う。この場合であっても、コミュニケーションの際、エージェントは、人との関係を良い状態を維持するように行動（発話、仕草、画像提示）する。 When applied to the car navigation system 300, the communication robot displayed on the display unit may be a still image or an animation. In this case, the agent responds by at least a voice dialogue. Even in this case, during communication, the agent acts (speech, gesture, image presentation) to maintain a good relationship with the person.

図１４は、本実施形態に係るカーナビゲーションに適用した場合に家庭内の各種装置との接続例を示す図である。なお、図１４においても、カーナビケーションシステム３００は、スマートフォン、タブレット端末等であってもよい。なお、カーナビケーションシステム３００は通信部（受信部と送信部）を備え、自宅の各装置はネットワークを介して接続されているとする。カーナビゲーションシステム３００に適用されているエージェントは、利用者とのコミュニケーションに応じて、例えば駐車場のシャッターの開閉４０１、炊飯器の動作開始指示４０２、エアーコンの動作開始指示や温度等の設定指示４０３、部屋等の電灯の点灯開始指示４０４、および自動芝刈り機の動作開始指示４０５等を行う。なお、エージェントは、単に動作指示を行うのではなく、利用者との発話に応じて、何時に帰宅予定であるか、利用者の好みの温度設定、利用者の好みの部屋の明るさ設定をコミュニケーション伊予って学習し、これらを学習した結果にも基づいて、帰宅時にこれらの作業が終了しているように、それぞれ最適なタイミングや設定指示を行うようにしてもよい。 FIG. 14 is a diagram showing an example of connection with various devices in the home when applied to the car navigation system according to the present embodiment. In addition, also in FIG. 14, the car navigation system 300 may be a smartphone, a tablet terminal, or the like. It is assumed that the car navigation system 300 includes a communication unit (reception unit and transmission unit), and each device at home is connected via a network. The agent applied to the car navigation system 300 responds to communication with the user, for example, opening / closing the shutter of the parking lot 401, the operation start instruction 402 of the rice cooker, the operation start instruction of the air conditioner, and the setting instruction 403 such as temperature. , The lighting start instruction 404 of the electric lamp of the room or the like, the operation start instruction 405 of the automatic lawn mower, and the like are given. In addition, the agent does not simply give an operation instruction, but sets the user's favorite temperature setting and the user's favorite room brightness setting according to the utterance with the user. Communication You may learn in advance, and based on the results of learning these, you may give the optimum timing and setting instructions so that these tasks are completed when you return home.

なお、本発明における社会的能力生成装置１００の機能の全てまたは一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより社会的能力生成装置１００が行う全ての処理または一部の処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 A program for realizing all or part of the functions of the social ability generation device 100 in the present invention is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into the computer system. , All or part of the processing performed by the social capacity generation device 100 may be performed by executing. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. Further, the "computer system" shall also include a WWW system provided with a homepage providing environment (or display environment). Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, and a storage device such as a hard disk built in a computer system. Furthermore, a "computer-readable recording medium" is a volatile memory (RAM) inside a computer system that serves as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, it shall include those that hold the program for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the program may be transmitted from a computer system in which this program is stored in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the "transmission medium" for transmitting a program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the above program may be for realizing a part of the above-mentioned functions. Further, a so-called difference file (difference program) may be used, which can realize the above-mentioned function in combination with a program already recorded in the computer system.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形および置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１…コミュニケーションロボット、１０１…受信部、１０２…撮影部、１０３…収音部、１０４…センサ、１００…社会的能力生成装置、１０６…記憶部、１０７…第１データベース、１０９…第２データベース、１１１…表示部、１１２…スピーカー、１１３…アクチュエータ、１１４…送信部、１０５…認知部、１０８…学習部、１１０…動作生成部、１１０１…画像生成部、１１０２…音声生成部、１１０３…駆動部、１１０４…送信情報生成部 1 ... Communication robot, 101 ... Receiver unit, 102 ... Imaging unit, 103 ... Sound pickup unit, 104 ... Sensor, 100 ... Social ability generator, 106 ... Storage unit, 107 ... First database, 109 ... Second database, 111 ... Display unit, 112 ... Speaker, 113 ... Actuator, 114 ... Transmission unit, 105 ... Cognitive unit, 108 ... Learning unit, 110 ... Motion generation unit, 1101 ... Image generation unit, 1102 ... Sound generation unit, 1103 ... Drive unit 1,104 ... Transmission information generation unit

Claims

Acquires human information about a person, extracts characteristic information about a person from the acquired person information, recognizes the action that occurs between the communication device that communicates and the person, and recognizes the action that occurs between people. Cognitive means and
A learning means for learning the emotional interaction of a person by multimodal using the extracted characteristic information about the person, and
An action generation means that generates an action based on the learned emotional interaction information of the person,
A social capacity generator equipped with.

The learning means learns using an implicit reward and an explicit reward.
The implicit reward is a reward learned by multimodal using the characteristic information about the person.
The explicit reward is a reward based on the result of evaluating the behavior of the communication device generated by the motion generating means with respect to the person.
The social ability generation device according to claim 1.

A sound collecting part that collects acoustic signals and
Equipped with a shooting unit that shoots images including users,
The cognitive means performs voice recognition processing on the picked-up acoustic signal to extract characteristic information related to voice, and performs image processing on the captured image to perform characteristic information on human behavior included in the image. Extracted,
The characteristic information regarding the person includes the characteristic information regarding the voice and the characteristic information regarding the human behavior.
The characteristic information related to voice is at least one of a voice signal, voice loudness information, voice intonation information, and the meaning of utterance.
The characteristic information related to human behavior is at least one of a person's facial expression information, a person's gesture information, a person's head posture information, a person's face orientation information, a person's line of sight information, and a distance between people. One,
The social ability generation device according to claim 1 or 2.

The learning means is
Learn using social norms, social components, psychological and humanistic findings,
The social ability generation device according to any one of claims 1 to 3.

The cognitive means acquires human information about a person, extracts characteristic information about the person from the acquired human information, recognizes the action that occurs between the communication device that communicates and the person, and occurs between the person. Recognize the work,
The learning means multimodally learns the emotional interaction of a person using the extracted characteristic information about the person.
The motion generation means generates an action based on the learned emotional interaction information of the person.
How to generate social ability.

Acquires human information about a person, extracts characteristic information about a person from the acquired person information, recognizes the action that occurs between the communication device that communicates and the person, and recognizes the action that occurs between people. Cognitive means and
A learning means for learning the emotional interaction of a person by multimodal using the extracted characteristic information about the person, and
An action generation means that generates an action based on the learned emotional interaction information of the person,
Communication robot equipped with.

Equipped with a display
The motion generating means generates an image that maintains a good relationship with a person by causing the person to behave in a way that maximizes positive emotions, and displays the generated image on the display unit.
The communication robot according to claim 6.