JP2021117371A

JP2021117371A - Information processor, information processing method and information processing program

Info

Publication number: JP2021117371A
Application number: JP2020011190A
Authority: JP
Inventors: 真里斎藤; Mari Saito
Original assignee: Sony Group Corp
Current assignee: Sony Group Corp
Priority date: 2020-01-27
Filing date: 2020-01-27
Publication date: 2021-08-10
Also published as: WO2021153101A1

Abstract

To achieve natural interaction following utterance intent of a speaker.SOLUTION: An information processor comprises: a state estimation part for estimating a state of emotion understanding for understanding emotion based on speaker's utterance; and a response generation part for generating output information based on a result of estimation performed by the state estimation part.SELECTED DRAWING: Figure 10

Description

本開示は、情報処理装置、情報処理方法および情報処理プログラムに関する。 The present disclosure relates to information processing devices, information processing methods and information processing programs.

近年、音声の認識精度の向上により、ユーザ（話者）の発話を理解し、話者と対話を行うシステムが普及してきている。例えば、話者の発話の理解度を示すために、入力された発話をテキスト化して表示するシステムが一般化してきている。このシステムは、例えば、スマートスピーカ等のスピーカ型やＰｅｐｐｅｒ（登録商標）等の人型の対話エージェントとして実現されている。 In recent years, due to the improvement of voice recognition accuracy, a system that understands a user's (speaker's) utterance and interacts with the speaker has become widespread. For example, in order to show the degree of understanding of a speaker's utterance, a system in which the input utterance is converted into text and displayed has become common. This system is realized, for example, as a speaker-type dialogue agent such as a smart speaker or a human-type dialogue agent such as Pepper (registered trademark).

特開２０１８−４０８９７号公報JP-A-2018-40897

しかしながら、発話が複雑な場合、テキストが長々と表示されてしまう場合もあり、話者の発話を理解していることが伝わりにくかった。また、表示デバイスに話者の発話の認識結果をそのまま表示することは不自然でもあり、発話を理解しているかどうか話者を不安にさせてしまう可能性も生じ得る。 However, when the utterance is complicated, the text may be displayed for a long time, and it is difficult to convey that the speaker understands the utterance. In addition, it is unnatural to display the recognition result of the speaker's utterance as it is on the display device, and there is a possibility that the speaker may be anxious about whether or not he / she understands the utterance.

また、命令や依頼等の目的的な発話ではなく、日常会話等の非目的的な発話を傾聴するようなユースケースでは、発話を理解しているかどうか分からないと、話者が発話を十分に楽しむことができない可能性も生じ得る。 Also, in use cases where you listen to unpurposed utterances such as daily conversations rather than purposeful utterances such as commands and requests, if you do not know whether you understand the utterances, the speaker will fully speak. It may not be possible to enjoy it.

このように、従来技術に係る対話エージェントにおいては、話者の発話の意図に沿った自然な対話を実現することが困難であった。 As described above, it has been difficult for the dialogue agent according to the prior art to realize a natural dialogue in line with the intention of the speaker's utterance.

そこで、本開示では、話者の発話の意図に沿った自然な対話を実現することが可能な、新規かつ改良された情報処理装置、情報処理方法及び情報処理プログラムを提案する。 Therefore, the present disclosure proposes a new and improved information processing device, information processing method, and information processing program capable of realizing a natural dialogue in line with the intention of the speaker's utterance.

本開示によれば、話者の発話に基づく感情を理解する感情理解の状態を推定する状態推定部と、前記状態推定部による推定結果に基づいた出力情報を生成する応答生成部とを備える、情報処理装置が提供される。 According to the present disclosure, the present disclosure includes a state estimation unit that estimates the state of emotion understanding that understands emotions based on the speaker's utterance, and a response generation unit that generates output information based on the estimation result by the state estimation unit. An information processing device is provided.

実施形態に係る情報処理システムの構成例を示す図である。It is a figure which shows the structural example of the information processing system which concerns on embodiment. 実施形態に係る情報処理システムの機能の概要を示す図である。It is a figure which shows the outline of the function of the information processing system which concerns on embodiment. 実施形態に係る情報処理システムの機能の概要の一例を示す図である。It is a figure which shows an example of the outline of the function of the information processing system which concerns on embodiment. 実施形態に係る情報処理システムの機能の概要の一例を示す図である。It is a figure which shows an example of the outline of the function of the information processing system which concerns on embodiment. 実施形態に係る情報処理システムの機能の概要の一例を示す図である。It is a figure which shows an example of the outline of the function of the information processing system which concerns on embodiment. 実施形態に係る情報処理システムの機能の概要の一例を示す図である。It is a figure which shows an example of the outline of the function of the information processing system which concerns on embodiment. 実施形態に係る情報処理システムの機能の概要の一例を示す図である。It is a figure which shows an example of the outline of the function of the information processing system which concerns on embodiment. 実施形態に係る情報処理システムの機能の概要の一例を示す図である。It is a figure which shows an example of the outline of the function of the information processing system which concerns on embodiment. 実施形態に係る情報処理システムの機能の概要の一例を示す図である。It is a figure which shows an example of the outline of the function of the information processing system which concerns on embodiment. 実施形態に係る情報処理システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the information processing system which concerns on embodiment. 実施形態に係る話者情報記憶部の一例を示す図である。It is a figure which shows an example of the speaker information storage part which concerns on embodiment. 実施形態に係る感情語情報記憶部の一例を示す図である。It is a figure which shows an example of the emotion word information storage part which concerns on embodiment. 実施形態に係る情報処理装置における処理の流れを示すフローチャートである。It is a flowchart which shows the flow of processing in the information processing apparatus which concerns on embodiment. 実施形態に係る情報処理装置における処理の流れを示すフローチャートである。It is a flowchart which shows the flow of processing in the information processing apparatus which concerns on embodiment. 実施形態に係る情報処理装置における処理の流れを示すフローチャートである。It is a flowchart which shows the flow of processing in the information processing apparatus which concerns on embodiment. 実施形態に係る情報処理装置における処理の流れを示すフローチャートである。It is a flowchart which shows the flow of processing in the information processing apparatus which concerns on embodiment. 情報処理装置の機能を実現するコンピュータの一例を示すハードウェア構成図である。It is a hardware block diagram which shows an example of the computer which realizes the function of an information processing apparatus.

以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Preferred embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. In the present specification and the drawings, components having substantially the same functional configuration are designated by the same reference numerals, so that duplicate description will be omitted.

なお、説明は以下の順序で行うものとする。
１．本開示の一実施形態
１．１．概要
１．２．情報処理システムの構成
２．情報処理システムの機能
２．１．機能の概要
２．２．種々のユースケース例
２．３．機能構成例
２．４．情報処理システムの処理
２．５．処理のバリエーション
３．応用例
３．１．視聴覚障害者
３．２．高齢者
４．ハードウェア構成例
５．まとめ The explanations will be given in the following order.
1. 1. Embodiment 1.1 of the present disclosure. Overview 1.2. Information processing system configuration 2. Information processing system functions 2.1. Outline of function 2.2. Examples of various use cases 2.3. Functional configuration example 2.4. Information processing system processing 2.5. Variations of processing 3. Application example 3.1. Audiovisually impaired 3.2. Elderly people 4. Hardware configuration example 5. summary

＜＜１．本開示の一実施形態＞＞
＜１．１．概要＞
近年、音声の認識精度の向上により、話者の発話を理解し、話者と対話を行うシステムが普及してきている。例えば、話者の発話の理解度を示すために、入力された発話をテキスト化して表示するシステムが一般化してきている。このシステムは、例えば、スマートスピーカ等のスピーカ型やＰｅｐｐｅｒ（登録商標）等の人型の対話エージェントとして実現されている。 << 1. Embodiment of the present disclosure >>
<1.1. Overview>
In recent years, due to the improvement of voice recognition accuracy, a system that understands a speaker's utterance and interacts with the speaker has become widespread. For example, in order to show the degree of understanding of a speaker's utterance, a system in which the input utterance is converted into text and displayed has become common. This system is realized, for example, as a speaker-type dialogue agent such as a smart speaker or a human-type dialogue agent such as Pepper (registered trademark).

話者の発話において、例えば、発話内容とは関係のない繋ぎ言葉であるフィラーや、頷きや相槌等を行うことができれば、対話エージェントが発話を理解していると話者に感じさせることができ得る。そこで、話者の発話において、フィラーや頷きや相槌等を行う対話エージェントに関する技術が進められている。 In the speaker's utterance, for example, if a filler, which is a connecting word that has nothing to do with the utterance content, or a nod or an aizuchi can be performed, the speaker can be made to feel that the dialogue agent understands the utterance. obtain. Therefore, technology related to dialogue agents that perform fillers, nods, and aizuchi in the speaker's utterances is being advanced.

上述の対話エージェントの技術に関連し、例えば、特許文献１には、話者からの発話を待つべきとも、発話を実行すべきとも推定できなかった場合に、対話エージェントの動作を制御する技術が開示されている。 In relation to the above-mentioned technique of the dialogue agent, for example, Patent Document 1 includes a technique for controlling the operation of the dialogue agent when it cannot be estimated that the utterance should be waited for or the utterance should be executed. It is disclosed.

しかしながら、上述の対話エージェントの技術では、話者の発話の意図と関係なく、対話エージェントの対話に関する動作を制御するため、例えば、対話エージェントの動作が話者の発話の邪魔となる可能性も生じ得る。 However, in the above-mentioned dialogue agent technology, since the dialogue agent's behavior related to the dialogue is controlled regardless of the intention of the speaker's utterance, for example, the dialogue agent's behavior may interfere with the speaker's utterance. obtain.

本開示の一実施形態では、上記の点に着目して発想されたものであり、話者の発話の意図に沿った適切な応答を行うよう制御することが可能な技術を提案する。以下、本実施形態について順次詳細に説明する。以下、対話エージェントの一例として、端末装置２０を用いて説明する。 One embodiment of the present disclosure is conceived by paying attention to the above points, and proposes a technique capable of controlling to perform an appropriate response in accordance with the intention of the speaker's utterance. Hereinafter, the present embodiment will be described in detail in order. Hereinafter, an example of the dialogue agent will be described using the terminal device 20.

＜１．２．情報処理システムの構成＞
まず、実施形態に係る情報処理システム１の構成について説明する。図１は、情報処理システム１の構成例を示す図である。図１に示したように、情報処理システム１は、情報処理装置１０及び端末装置２０を備える。情報処理装置１０には、多様な装置が接続され得る。例えば、情報処理装置１０には、端末装置２０が接続され、各装置間で情報の連携が行われる。情報処理装置１０には、端末装置２０が無線で接続される。例えば、情報処理装置１０は、端末装置２０とＢｌｕｅｔｏｏｔｈ（登録商標）を用いた近距離無線通信を行う。なお、情報処理装置１０には、情報処理装置１０及び端末装置２０が、有線と無線とを問わず、Ｉ２Ｃ（Inter-Integrated Circuit）やＳＰＩ（Serial Peripheral Interface）などの各種インタフェースや、ＬＡＮ（Local Area Network）やＷＡＮ（Wide Area Network）やインターネットや移動体通信網などの各種ネットワークを介して接続されてもよい。 <1.2. Information processing system configuration>
First, the configuration of the information processing system 1 according to the embodiment will be described. FIG. 1 is a diagram showing a configuration example of the information processing system 1. As shown in FIG. 1, the information processing system 1 includes an information processing device 10 and a terminal device 20. Various devices can be connected to the information processing device 10. For example, a terminal device 20 is connected to the information processing device 10, and information is linked between the devices. The terminal device 20 is wirelessly connected to the information processing device 10. For example, the information processing device 10 performs short-range wireless communication using the terminal device 20 and Bluetooth (registered trademark). In the information processing device 10, the information processing device 10 and the terminal device 20 include various interfaces such as I2C (Inter-Integrated Circuit) and SPI (Serial Peripheral Interface), and LAN (Local) regardless of whether they are wired or wireless. It may be connected via various networks such as Area Network), WAN (Wide Area Network), the Internet, and mobile communication networks.

（１）情報処理装置１０
情報処理装置１０は、話者の発話（音声）の発話データに応じて、例えば、端末装置２０を制御する情報処理装置である。具体的には、情報処理装置１０は、まず、話者の発話に基づく感情を理解する感情理解の状態を推定し、推定結果に基づいた出力情報を生成する。そして、情報処理装置１０は、生成された出力情報を、例えば、端末装置２０に送信することにより、端末装置２０を制御する。 (1) Information processing device 10
The information processing device 10 is an information processing device that controls, for example, the terminal device 20 according to the utterance data of the speaker's utterance (voice). Specifically, the information processing device 10 first estimates the state of emotional understanding that understands emotions based on the speaker's utterance, and generates output information based on the estimation result. Then, the information processing device 10 controls the terminal device 20 by transmitting the generated output information to, for example, the terminal device 20.

また、情報処理装置１０は、情報処理システム１の動作全般を制御する機能も有する。例えば、情報処理装置１０は、各装置間で連携される情報に基づき、情報処理システム１の動作全般を制御する。具体的には、情報処理装置１０は、端末装置２０から受信する情報に基づき、端末装置２０を制御する。 The information processing device 10 also has a function of controlling the overall operation of the information processing system 1. For example, the information processing device 10 controls the overall operation of the information processing system 1 based on the information linked between the devices. Specifically, the information processing device 10 controls the terminal device 20 based on the information received from the terminal device 20.

情報処理装置１０は、ＰＣ（Ｐｅｒｓｏｎａｌｃｏｍｐｕｔｅｒ）、ＷＳ（Ｗｏｒｋｓｔａｔｉｏｎ）等により実現される。なお、情報処理装置１０は、ＰＣ、ＷＳ等に限定されない。例えば、情報処理装置１０は、情報処理装置１０としての機能をアプリケーションとして実装したＰＣ、ＷＳ等の情報処理装置であってもよい。 The information processing device 10 is realized by a PC (Personal computer), a WS (Workstation), or the like. The information processing device 10 is not limited to a PC, a WS, or the like. For example, the information processing device 10 may be an information processing device such as a PC or WS that implements the function of the information processing device 10 as an application.

（２）端末装置２０
端末装置２０は、制御対象となる情報処理装置である。 (2) Terminal device 20
The terminal device 20 is an information processing device to be controlled.

端末装置２０は、発話データを取得する。そして、端末装置２０は、取得した発話データを情報処理装置１０へ送信する。 The terminal device 20 acquires utterance data. Then, the terminal device 20 transmits the acquired utterance data to the information processing device 10.

端末装置２０は、どのような装置として実現されてもよい。例えば、端末装置２０は、スピーカ型の装置として実現されてもよいし、人型の装置として実現されてもよい。端末装置２０は、例えば、対話エージェントの視覚情報を提示する提示装置として実現されてもよい。 The terminal device 20 may be realized as any device. For example, the terminal device 20 may be realized as a speaker type device or a human type device. The terminal device 20 may be realized as, for example, a presenting device that presents visual information of a dialogue agent.

＜＜２．情報処理システムの機能＞＞
以上、情報処理システム１の構成について説明した。続いて、情報処理システム１の機能について説明する。 << 2. Information processing system functions >>
The configuration of the information processing system 1 has been described above. Subsequently, the function of the information processing system 1 will be described.

＜２．１．機能の概要＞
実施形態に係る情報処理システム１は、話者の発話に対する傾聴反応である応答の生成を３つの状態（ステート）の遷移で行う。具体的には、情報処理システム１は、話者の発話を認識する発話認識の状態の推定と、話者の発話に基づく感情理解の状態の推定と、話者の発話に含まれる依頼に関する情報である依頼関連情報に基づく処理を実行するための実行準備の状態の推定とを遷移することで、応答の生成を行う。発話認識の状態の推定に基づく応答は、例えば、話者の発話を受信したことを話者に伝えるための応答である。また、感情理解の状態の推定に基づく応答は、例えば、共感していることを話者に伝えるための応答である。また、実行準備の状態の推定に基づく応答は、例えば、話者の発話に含まれる依頼関連情報に基づく処理を実行するための応答である。情報処理システム１は、この３つの状態を遷移することで、状態に応じた応答の生成を行うことができる。 <2.1. Function overview>
The information processing system 1 according to the embodiment generates a response, which is a listening reaction to a speaker's utterance, by transitioning three states. Specifically, the information processing system 1 estimates the state of utterance recognition that recognizes the speaker's utterance, estimates the state of emotional understanding based on the speaker's utterance, and information on the request included in the speaker's utterance. The response is generated by transitioning from the estimation of the execution preparation state for executing the process based on the request-related information. The response based on the estimation of the state of utterance recognition is, for example, a response for telling the speaker that the utterance of the speaker has been received. Further, the response based on the estimation of the state of emotional understanding is, for example, a response for telling the speaker that they are sympathetic. Further, the response based on the estimation of the execution preparation state is, for example, a response for executing a process based on the request-related information included in the utterance of the speaker. The information processing system 1 can generate a response according to the state by transitioning between these three states.

図２は、情報処理システム１の機能の概要を示す図である。情報処理システム１は、まず、話者Ｕ１２の発話を認識する（Ｓ１１）。情報処理システム１は、話者Ｕ１２の発話を認識すると、発話認識の状態を推定する。次いで、情報処理システム１は、話者Ｕ１２の発話から感情を示す感情語を認識する（Ｓ１２）。情報処理システム１は、感情語を認識すると、感情理解の状態を推定する。そして、情報処理システム１は、感情語を復唱する処理を実行する（Ｓ１３）。情報処理システム１は、更なる話者Ｕ１２の発話を認識する場合、発話認識の状態を推定する。Ｓ１２の処理において、情報処理システム１は、話者Ｕ１２の発話から依頼関連情報を認識する（Ｓ１４）。情報処理システム１は、依頼関連情報を認識すると、実行準備の状態を推定する。そして、情報処理システム１は、依頼関連情報に基づく処理を実行する（Ｓ１５）。Ｓ１５の処理において、情報処理システム１は、依頼関連情報に基づく処理を実行しない場合には、発話認識の状態を推定する（Ｓ１６）。 FIG. 2 is a diagram showing an outline of the functions of the information processing system 1. The information processing system 1 first recognizes the utterance of the speaker U12 (S11). When the information processing system 1 recognizes the utterance of the speaker U12, the information processing system 1 estimates the utterance recognition state. Next, the information processing system 1 recognizes an emotional word indicating an emotion from the utterance of the speaker U12 (S12). When the information processing system 1 recognizes an emotional word, it estimates the state of emotional understanding. Then, the information processing system 1 executes a process of reciting an emotional word (S13). When the information processing system 1 recognizes another speaker U12's utterance, the information processing system 1 estimates the state of utterance recognition. In the process of S12, the information processing system 1 recognizes the request-related information from the utterance of the speaker U12 (S14). When the information processing system 1 recognizes the request-related information, it estimates the state of preparation for execution. Then, the information processing system 1 executes a process based on the request-related information (S15). In the process of S15, the information processing system 1 estimates the state of utterance recognition when the process based on the request-related information is not executed (S16).

ここで、Ｓ１４と同様の処理を、感情理解の状態を推定した後に行う場合を説明する。情報処理システム１は、話者Ｕ１２の発話から依頼関連情報を認識する（Ｓ１７）。情報処理システム１は、依頼関連情報を認識すると、実行準備の状態を推定する。そして、情報処理システム１は、依頼関連情報に基づく処理を実行する（Ｓ１５）。Ｓ１５の処理において、情報処理システム１は、依頼関連情報に基づく処理を実行しない場合には、感情理解の状態を推定する（Ｓ１８）。 Here, a case where the same processing as in S14 is performed after estimating the state of emotional understanding will be described. The information processing system 1 recognizes the request-related information from the utterance of the speaker U12 (S17). When the information processing system 1 recognizes the request-related information, it estimates the state of preparation for execution. Then, the information processing system 1 executes a process based on the request-related information (S15). In the process of S15, the information processing system 1 estimates the state of emotion understanding when the process based on the request-related information is not executed (S18).

このように、情報処理システム１は、対話エージェントの相槌等の応答に段階を設けることで、「聞いている（声が届いている）」、「感情を理解している」、及び、「依頼を実行する」の状態を異なる処理を用いて伝えることができる。これにより、情報処理システム１は、対話エージェントが話者の発話の内容の推移を理解しながら聞いていることを伝えることができるため、話者は安心して発話をすることができる。 In this way, the information processing system 1 "listens (voices)", "understands emotions", and "requests" by setting a stage in the response such as the dialogue agent's aizuchi. The state of "execute" can be communicated using different processes. As a result, the information processing system 1 can convey that the dialogue agent is listening while understanding the transition of the contents of the speaker's utterance, so that the speaker can speak with peace of mind.

（発話認識の状態を推定する場合）
図３は、発話認識の状態を推定する場合のＵＩ（ＵｓｅｒＩｎｔｅｒｆａｃｅ）の概要を示す図である。端末装置２０は、まず、話者Ｕ１２の発話ＴＫ１１を検出する。情報処理システム１は、発話ＴＫ１１の終端ＳＫ１１を検出すると、「うん」等の相槌を行うように端末装置２０を制御する（Ｓ２１）。端末装置２０は、発話ＴＫ１１に対する相槌である応答ＲＫ１１を出力する。次いで、端末装置２０は、話者Ｕ１２の発話ＴＫ１２を検出する。情報処理システム１は、話者Ｕ１２が発話ＴＫ１２の発話中、発話ＴＫ１２の終端ＳＫ１２が検出されるまで、首を縦に振る等の頷きを行うように端末装置２０を制御する（Ｓ２２）。すなわち、情報処理システム１は、話者Ｕ１２が発話ＴＫ１２の発話中、相槌を行わないように端末装置２０を制御する。情報処理システム１は、発話ＴＫ１２の終端ＳＫ１２を検出すると、相槌を行うように端末装置２０を制御する。端末装置２０は、発話ＴＫ１２に対する相槌である応答ＲＫ１２を出力する。次いで、端末装置２０は、話者Ｕ１２の発話ＴＫ１３を検出する。情報処理システム１は、話者Ｕ１２が発話ＴＫ１３の発話中、発話ＴＫ１３の終端ＳＫ１３が検出されるまで、頷きを行うように端末装置２０を制御する（Ｓ２３）。情報処理システム１は、発話ＴＫ１３の終端ＳＫ１３を検出すると、相槌を行うように端末装置２０を制御する。端末装置２０は、発話ＴＫ１３に対する相槌である応答ＲＫ１３を出力する。これにより、情報処理システム１は、話者の発話を阻害しないタイミングで相槌を行うことができるため、話者の発話が届いていることを話者に適切に伝えることができる。 (When estimating the state of utterance recognition)
FIG. 3 is a diagram showing an outline of a UI (User Interface) when estimating the state of utterance recognition. The terminal device 20 first detects the utterance TK11 of the speaker U12. When the information processing system 1 detects the terminal SK11 of the utterance TK11, the information processing system 1 controls the terminal device 20 so as to perform an aizuchi such as "Yeah" (S21). The terminal device 20 outputs the response RK11, which is an aizuchi to the utterance TK11. Next, the terminal device 20 detects the utterance TK12 of the speaker U12. The information processing system 1 controls the terminal device 20 so that the speaker U12 makes a nod such as shaking his head vertically until the terminal SK12 of the utterance TK12 is detected during the utterance of the utterance TK12 (S22). That is, the information processing system 1 controls the terminal device 20 so that the speaker U12 does not give an aizuchi during the utterance of the utterance TK12. When the information processing system 1 detects the terminal SK12 of the utterance TK12, the information processing system 1 controls the terminal device 20 so as to perform an aizuchi. The terminal device 20 outputs a response RK12, which is an aizuchi to the utterance TK12. Next, the terminal device 20 detects the utterance TK13 of the speaker U12. The information processing system 1 controls the terminal device 20 to nod while the speaker U12 is speaking the utterance TK13 until the terminal SK13 of the utterance TK13 is detected (S23). When the information processing system 1 detects the terminal SK13 of the utterance TK13, the information processing system 1 controls the terminal device 20 so as to perform an aizuchi. The terminal device 20 outputs the response RK13, which is an aizuchi to the utterance TK13. As a result, the information processing system 1 can perform an aizuchi at a timing that does not interfere with the speaker's utterance, so that it is possible to appropriately inform the speaker that the speaker's utterance has arrived.

（感情理解の状態を推定する場合）
図４では、感情理解の状態を推定する場合のＵＩの概要を示す図である。以下、図３と同様の記載は、説明を適宜省略する。端末装置２０は、話者Ｕ１２の発話ＴＫ２３を検出する。情報処理システム１は、話者Ｕ１２が発話ＴＫ２３の発話中、発話ＴＫ２３の終端ＳＫ２３が検出されるまで、頷きを行うように端末装置２０を制御する。また、情報処理システム１は、発話ＴＫ２３から感情語ＫＧ１１を検出する（Ｓ３３）。具体的には、情報処理システム１は、発話ＴＫ２３に対して言語解析処理を行う。そして、情報処理システム１は、発話ＴＫ２３に含まれる言語情報と、感情語として予め定められた言語情報とを比較することにより、感情語ＫＧ１１を検出する。例えば、情報処理システム１は、感情語情報を記憶した記憶部にアクセスすることにより、感情語ＫＧ１１を検出する。情報処理システム１は、感情語ＫＧ１１を検出すると、感情語ＫＧ１１と、発話ＴＫ２３に含まれる言語情報のうち感情語ＫＧ１１に近い文脈の言語情報とを用いて、感情語ＫＧ１１が示す感情を適切な表現で復唱するように端末装置２０を制御する。具体的には、情報処理システム１は、感情語ＫＧ１１である「困っちゃった」と、近接する言語情報である「長くて」とに基づいて、感情語ＫＧ１１が示す感情である「困る」を適切な表現で復唱する。端末装置２０は、発話ＴＫ２３の復唱である応答ＲＫ２３を出力する。このように、情報処理システム１は、感情語ＫＧ１１に近接する前後の文脈の言語情報を復唱することができる。これにより、情報処理システム１は、話者の感情を理解し共感していることを話者に適切に伝えることができるため、話者は安心して発話を行うことができる。 (When estimating the state of emotional understanding)
FIG. 4 is a diagram showing an outline of the UI when estimating the state of emotional understanding. Hereinafter, the same description as in FIG. 3 will be omitted as appropriate. The terminal device 20 detects the utterance TK23 of the speaker U12. The information processing system 1 controls the terminal device 20 so that the speaker U12 nods during the utterance of the utterance TK23 until the terminal SK23 of the utterance TK23 is detected. Further, the information processing system 1 detects the emotional word KG11 from the utterance TK23 (S33). Specifically, the information processing system 1 performs language analysis processing on the utterance TK23. Then, the information processing system 1 detects the emotional word KG11 by comparing the linguistic information included in the utterance TK23 with the linguistic information predetermined as the emotional word. For example, the information processing system 1 detects the emotional word KG11 by accessing the storage unit that stores the emotional word information. When the information processing system 1 detects the emotional word KG11, the information processing system 1 uses the emotional word KG11 and the linguistic information in the context close to the emotional word KG11 among the linguistic information contained in the utterance TK23 to appropriately express the emotion indicated by the emotional word KG11. The terminal device 20 is controlled so as to repeat the expression. Specifically, the information processing system 1 determines the emotion "trouble" indicated by the emotion word KG11 based on the emotion word KG11 "trouble" and the adjacent linguistic information "long". Repeat with appropriate expressions. The terminal device 20 outputs the response RK23, which is a repeat of the utterance TK23. In this way, the information processing system 1 can repeat the linguistic information in the context before and after the emotional word KG11. As a result, the information processing system 1 can appropriately convey to the speaker that he / she understands and sympathizes with the speaker's emotions, so that the speaker can speak with peace of mind.

図５では、話者Ｕ１２が図３と異なる発話を行う場合を例に挙げて、感情理解の状態を推定する場合のＵＩの概要を説明する。以下、図２乃至４と同様の記載は、説明を適宜省略する。情報処理システム１は、発話ＴＫ３３から感情語ＫＧ２１を検出する（Ｓ４３）。情報処理システム１は、感情語ＫＧ２１を検出すると、感情語ＫＧ２１の同義語（類義語）として予め定められた言語情報を用いて、感情語ＫＧ２１が示す感情を適切な表現で復唱するように端末装置２０を制御する。具体的には、情報処理システム１は、感情語ＫＧ２１である「最悪」の同義語として予め定められた言語情報である「悲しい」を用いて、感情語ＫＧ２１が示す感情である「最悪」を適切な表現で復唱する。このように、情報処理システム１は、感情語ＫＧ２１の同義語として予め定められた言語情報を復唱するための共感発話を生成する。端末装置２０は、発話ＴＫ３３の復唱である応答ＲＫ３３を出力する。他の例として、情報処理システム１は、感情語ＫＧ２１である「最悪」の同義語として予め定められた言語情報である「ひどい」と、発話ＴＫ３３に含まれる言語情報のうち感情語ＫＧ２１に近い文脈の言語情報である「会ったんだって」とを用いて、「会ったんですね、それはひどいですね」を出力する。なお、情報処理システム１は、登録された感情語を用いて応答を出力するのみではなく、例えば、センサを用いて話者の感情を推定することにより、推定された感情に対応する感情語を用いて応答を出力してもよい。また、情報処理システム１は、例えば、他の話者との会話に含まれる発話に基づいて応答を学習してもよい。また、情報処理システム１は、例えば、他の話者との会話を検出する度に学習及び記憶された応答を随時更新することにより、更新された最新の応答を出力してもよい。 In FIG. 5, the outline of the UI in the case of estimating the state of emotional understanding will be described by taking as an example the case where the speaker U12 makes an utterance different from that in FIG. Hereinafter, the same description as in FIGS. 2 to 4 will be omitted as appropriate. The information processing system 1 detects the emotional word KG21 from the utterance TK33 (S43). When the information processing system 1 detects the emotional word KG21, the terminal device uses linguistic information predetermined as a synonym (synonym) of the emotional word KG21 so as to repeat the emotion indicated by the emotional word KG21 in an appropriate expression. 20 is controlled. Specifically, the information processing system 1 uses "sad", which is linguistic information predetermined as a synonym for "worst", which is the emotion word KG21, to perform "worst", which is the emotion indicated by the emotion word KG21. Repeat with appropriate expressions. In this way, the information processing system 1 generates an empathic utterance for reciting predetermined linguistic information as a synonym for the emotional word KG21. The terminal device 20 outputs the response RK33, which is a repeat of the utterance TK33. As another example, the information processing system 1 has "terrible", which is linguistic information predetermined as a synonym for "worst", which is the emotional word KG21, and is close to the emotional word KG21 among the linguistic information contained in the utterance TK33. Using the linguistic information of the context, "I met you", "I met you, that's terrible" is output. The information processing system 1 not only outputs a response using the registered emotional words, but also estimates the emotions of the speaker by using, for example, a sensor to generate emotional words corresponding to the estimated emotions. You may use it to output the response. Further, the information processing system 1 may learn the response based on the utterance included in the conversation with another speaker, for example. Further, the information processing system 1 may output the latest updated response by updating the learned and stored response at any time each time it detects a conversation with another speaker, for example.

（実行準備の状態を推定する場合）
図６では、実行準備の状態を推定する場合のＵＩの概要を示す図である。以下、図２乃至５と同様の記載は、説明を適宜省略する。端末装置２０は、話者Ｕ１２の発話ＴＫ４３を検出する。情報処理システム１は、話者Ｕ１２が発話ＴＫ４３の発話中、発話ＴＫ４３の終端ＳＫ４３が検出されるまで、頷きを行うように端末装置２０を制御する。また、情報処理システム１は、発話ＴＫ４３から依頼関連情報ＩＧ１１を検出する（Ｓ５３）。情報処理システム１は、依頼関連情報ＩＧ１１を検出すると、「了解」等の依頼を認識した旨の応答ＲＫ４３を出力する。そして、情報処理システム１は、依頼関連情報ＩＧ１１が示す依頼の内容を復唱するように端末装置２０を制御する。端末装置２０は、発話ＴＫ４３の復唱である応答ＲＫ４４を出力する。そして、情報処理システム１は、依頼関連情報ＩＧ１１が示す依頼に関する情報に基づく処理を実行する（Ｓ５４）。 (When estimating the state of preparation for execution)
FIG. 6 is a diagram showing an outline of the UI when estimating the state of preparation for execution. Hereinafter, the same description as in FIGS. 2 to 5 will be omitted as appropriate. The terminal device 20 detects the utterance TK43 of the speaker U12. The information processing system 1 controls the terminal device 20 so that the speaker U12 nods during the utterance of the utterance TK43 until the terminal SK43 of the utterance TK43 is detected. Further, the information processing system 1 detects the request-related information IG11 from the utterance TK43 (S53). When the information processing system 1 detects the request-related information IG11, it outputs a response RK43 to the effect that the request such as "OK" is recognized. Then, the information processing system 1 controls the terminal device 20 so as to repeat the content of the request indicated by the request-related information IG11. The terminal device 20 outputs the response RK44, which is a repeat of the utterance TK43. Then, the information processing system 1 executes a process based on the information regarding the request indicated by the request-related information IG11 (S54).

また、Ｓ５３において、情報処理システム１は、依頼関連情報ＩＧ１１が示す依頼に関する情報が、処理を実行するために十分であるか否かを判定する。情報処理システム１は、依頼関連情報ＩＧ１１が示す依頼に関する情報が、処理を実行するために十分でない場合、所定の基準よりも認識可能でない表現で相槌を行うように端末装置２０を制御する。これにより、情報処理システム１は、例えば、低音量で相槌を行うように端末装置２０を制御することで、話者に発話の続きを促すことができる。また、依頼関連情報ＩＧ１１が示す依頼に関する情報が、処理を実行するために十分でない場合、端末装置２０による発話の重複が生じ得る。情報処理システム１は、話者に発話の続きを促すことができるため、端末装置２０による発話の重複が生じ得る問題等を解消し得る。なお、情報処理システム１は、話者の発話の続きを検出できない場合には、話者に発話が十分でない旨出力する。また、情報処理システム１は、言いよどみ（不完全）な文章の言語情報を用いることにより、話者に発話の続きを促す。これにより、情報処理システム１は、処理を実行するために必要な不足の情報を話者に発話するように促す場合より、自然な発話を促すことができる。一方、情報処理システム１は、依頼関連情報ＩＧ１１が示す依頼に関する情報が、処理を実行するために十分な場合、依頼を認識した旨の応答ＲＫ４３を出力する。ここで、情報処理システム１は、発話ＴＫ４３が、依頼関連情報ＩＧ１１が示す依頼を発話するための対話の文末である場合には、所定の基準と同等の認識可能な表現で応答ＲＫ４３を出力する。これにより、情報処理システム１は、例えば、所定の基準と同等の音量で、応答ＲＫ４３を出力することができる。 Further, in S53, the information processing system 1 determines whether or not the information regarding the request indicated by the request-related information IG11 is sufficient to execute the process. When the information about the request indicated by the request-related information IG11 is not sufficient to execute the process, the information processing system 1 controls the terminal device 20 so as to perform an aizuchi with an expression that is less recognizable than a predetermined standard. As a result, the information processing system 1 can prompt the speaker to continue the utterance by controlling the terminal device 20 so as to perform the aizuchi at a low volume, for example. Further, if the information regarding the request indicated by the request-related information IG11 is not sufficient to execute the process, duplication of utterances by the terminal device 20 may occur. Since the information processing system 1 can prompt the speaker to continue the utterance, the problem that the utterance may be duplicated by the terminal device 20 can be solved. If the information processing system 1 cannot detect the continuation of the speaker's utterance, the information processing system 1 outputs to the speaker that the utterance is not sufficient. In addition, the information processing system 1 prompts the speaker to continue the utterance by using the linguistic information of the stagnant (incomplete) sentence. As a result, the information processing system 1 can promote natural utterance rather than urging the speaker to speak insufficient information necessary for executing the process. On the other hand, when the information related to the request indicated by the request-related information IG11 is sufficient to execute the process, the information processing system 1 outputs a response RK43 indicating that the request has been recognized. Here, when the utterance TK43 is the end of the dialogue for uttering the request indicated by the request-related information IG11, the information processing system 1 outputs the response RK43 in a recognizable expression equivalent to a predetermined reference. .. As a result, the information processing system 1 can output the response RK43 at a volume equivalent to, for example, a predetermined reference.

＜２．２．種々のユースケース例＞
以上、本開示の実施形態に係る機能の概要について説明した。続いて、本開示の実施形態に係る情報処理システム１のユースケース例を説明する。 <2.2. Various use case examples>
The outline of the function according to the embodiment of the present disclosure has been described above. Subsequently, an example of a use case of the information processing system 1 according to the embodiment of the present disclosure will be described.

（介護施設の場合１）
図７では、話者Ｕ１２が介護施設で発話を行う場合を例に挙げて、情報処理システム１の機能の概要を説明する。以下、図２乃至６と同様の記載は、説明を適宜省略する。端末装置２０は、話者Ｕ１２の発話ＴＫ５１から感情語ＫＧ３１を検出する（Ｓ６２）。端末装置２０は、感情語ＫＧ３１が示す感情である「楽しみ」を適切な表現で復唱した応答ＲＫ５２を出力する。具体的には、情報処理システム１は、感情語ＫＧ３１である「楽しみ」と、近接する言語情報である「すごく」とに基づいて、応答ＲＫ５２を出力する。 (For long-term care facilities 1)
In FIG. 7, an outline of the function of the information processing system 1 will be described by taking as an example a case where the speaker U12 speaks at a nursing care facility. Hereinafter, the same description as in FIGS. 2 to 6 will be omitted as appropriate. The terminal device 20 detects the emotional word KG31 from the utterance TK51 of the speaker U12 (S62). The terminal device 20 outputs a response RK52 in which the emotion "enjoyment" indicated by the emotional word KG31 is repeated in an appropriate expression. Specifically, the information processing system 1 outputs the response RK52 based on the emotional word KG31 "fun" and the adjacent linguistic information "very".

（介護施設の場合２）
図８では、話者Ｕ１２が図７とは異なる発話を行う場合を例に挙げて、情報処理システム１の機能の概要を説明する。以下、図２乃至７と同様の記載は、説明を適宜省略する。端末装置２０は、話者のＵ１２の発話ＴＫ６３から依頼関連情報ＩＧ２１を検出する（Ｓ７３）。Ｓ７３において、情報処理システム１は、依頼関連情報ＩＧ２１が示す依頼に関する情報が、処理を実行するために十分でないと判定する。情報処理システム１は、発話が十分でない旨の応答ＲＫ６３を出力する。端末装置２０は、話者Ｕ１２の発話ＴＫ６４を検出する。情報処理システム１は、話者Ｕ１２の発話ＴＫ６４が、依頼関連情報ＩＧ２１が示す依頼に関する情報に基づく処理を実行するために十分な情報を含むと判定する（Ｓ７４）。端末装置２０は、発話ＴＫ６４の復唱である応答ＲＫ６４を出力する。情報処理システム１は、話者Ｕ１２の発話ＴＫ６５に応じて、応答ＲＫ６５の出力と共に、依頼関連情報ＩＧ２１が示す依頼に関する情報を提示するように端末装置２０を制御する。その後、情報処理システム１は、話者Ｕ１２の発話ＴＫ６７から感情語ＫＧ４１を検出する（Ｓ７７）。端末装置２０は、感情語ＫＧ４１が示す感情である「おいしそう」を適切な表現で復唱した応答ＲＫ６７を出力する。具体的には、情報処理システム１は、感情語ＫＧ４１である「おいしそうね」に基づいて、応答ＲＫ６７を出力する。 (In the case of a long-term care facility 2)
In FIG. 8, an outline of the function of the information processing system 1 will be described by taking as an example a case where the speaker U12 makes an utterance different from that in FIG. 7. Hereinafter, the same description as in FIGS. 2 to 7 will be omitted as appropriate. The terminal device 20 detects the request-related information IG21 from the utterance TK63 of the speaker U12 (S73). In S73, the information processing system 1 determines that the information regarding the request indicated by the request-related information IG21 is not sufficient to execute the process. The information processing system 1 outputs a response RK63 indicating that the utterance is not sufficient. The terminal device 20 detects the utterance TK64 of the speaker U12. The information processing system 1 determines that the utterance TK64 of the speaker U12 contains sufficient information to execute the process based on the information regarding the request indicated by the request-related information IG21 (S74). The terminal device 20 outputs the response RK64, which is a repeat of the utterance TK64. The information processing system 1 controls the terminal device 20 so as to present the information regarding the request indicated by the request-related information IG21 together with the output of the response RK65 in response to the utterance TK65 of the speaker U12. After that, the information processing system 1 detects the emotional word KG41 from the utterance TK67 of the speaker U12 (S77). The terminal device 20 outputs a response RK67 in which the emotion “delicious” indicated by the emotion word KG41 is repeated in an appropriate expression. Specifically, the information processing system 1 outputs the response RK67 based on the emotional word KG41, "it looks delicious".

（単身赴任の場合）
図９では、話者Ｕ１２が単身赴任中に発話を行う場合を例に挙げて、情報処理システム１の機能の概要を説明する。以下、図２乃至８と同様の記載は、説明を適宜省略する。端末装置２０は、話者Ｕ１２の発話ＴＫ７１から感情語ＫＧ５１を検出する（Ｓ８１）。端末装置２０は、感情語ＫＧ５１が示す感情である「忙しい」を適切な表現で復唱した応答ＲＫ７１を出力する。具体的には、情報処理システム１は、感情語ＫＧ５１である「忙しくてね」と、近接する言語情報である「仕事が」とに基づいて、応答ＲＫ７１を出力する。 (In the case of a single assignment)
In FIG. 9, the outline of the function of the information processing system 1 will be described by taking as an example the case where the speaker U12 speaks while he / she is assigned to work alone. Hereinafter, the same description as in FIGS. 2 to 8 will be omitted as appropriate. The terminal device 20 detects the emotional word KG51 from the utterance TK71 of the speaker U12 (S81). The terminal device 20 outputs a response RK71 in which the emotion “busy” indicated by the emotion word KG51 is repeated in an appropriate expression. Specifically, the information processing system 1 outputs the response RK71 based on the emotional word KG51 "busy" and the adjacent linguistic information "work".

＜２．３．機能構成例＞
図１０は、実施形態１に係る情報処理システム１の機能構成例を示すブロック図である。 <2.3. Function configuration example>
FIG. 10 is a block diagram showing a functional configuration example of the information processing system 1 according to the first embodiment.

（１）情報処理装置１０
図１０に示したように、情報処理装置１０は、通信部１００、制御部１１０、及び記憶部１２０を備える。なお、情報処理装置１０は、少なくとも制御部１１０を有する。 (1) Information processing device 10
As shown in FIG. 10, the information processing device 10 includes a communication unit 100, a control unit 110, and a storage unit 120. The information processing device 10 has at least a control unit 110.

（１−１）通信部１００
通信部１００は、外部装置と通信を行う機能を有する。例えば、通信部１００は、外部装置との通信において、外部装置から受信する情報を制御部１１０へ出力する。具体的には、通信部１００は、端末装置２０から受信する発話データを制御部１１０へ出力する。 (1-1) Communication unit 100
The communication unit 100 has a function of communicating with an external device. For example, the communication unit 100 outputs information received from the external device to the control unit 110 in communication with the external device. Specifically, the communication unit 100 outputs the utterance data received from the terminal device 20 to the control unit 110.

通信部１００は、外部装置との通信において、制御部１１０から入力される情報を外部装置へ送信する。具体的には、通信部１００は、制御部１１０から入力される発話データの取得に関する情報を端末装置２０へ送信する。 The communication unit 100 transmits the information input from the control unit 110 to the external device in communication with the external device. Specifically, the communication unit 100 transmits information regarding acquisition of utterance data input from the control unit 110 to the terminal device 20.

（１−２）制御部１１０
制御部１１０は、情報処理装置１０の動作を制御する機能を有する。例えば、制御部１１０は、発話データの終端を検出する。また、制御部１１０は、検出された終端に関する情報に基づいて、端末装置２０の動作を制御する処理を行う。 (1-2) Control unit 110
The control unit 110 has a function of controlling the operation of the information processing device 10. For example, the control unit 110 detects the end of the utterance data. Further, the control unit 110 performs a process of controlling the operation of the terminal device 20 based on the information regarding the detected termination.

上述の機能を実現するために、制御部１１０は、図１０に示すように、話者識別部１１１、発話検出部１１２、発話認識部１１３、状態推定部１１４、意味解析部１１５、依頼処理部１１６、応答生成部１１７、発話実行部１１８、動作提示部１１９を有する。 In order to realize the above-mentioned functions, as shown in FIG. 10, the control unit 110 includes a speaker identification unit 111, an utterance detection unit 112, an utterance recognition unit 113, a state estimation unit 114, a semantic analysis unit 115, and a request processing unit. It has 116, a response generation unit 117, an utterance execution unit 118, and an action presentation unit 119.

・話者識別部１１１
話者識別部１１１は、話者の識別処理を行う機能を有する。例えば、話者識別部１１１は、記憶部１２０（例えば、話者情報記憶部１２１）にアクセスして話者情報を用いた識別処理を行う。具体的には、話者識別部１１１は、通信部２００を介して、撮像部２１２から送信された撮像情報と、記憶部１２０に記憶された話者情報とを比較して、話者を識別する。 -Speaker identification unit 111
The speaker identification unit 111 has a function of performing speaker identification processing. For example, the speaker identification unit 111 accesses the storage unit 120 (for example, the speaker information storage unit 121) and performs identification processing using the speaker information. Specifically, the speaker identification unit 111 identifies the speaker by comparing the image pickup information transmitted from the image pickup unit 212 via the communication unit 200 with the speaker information stored in the storage unit 120. do.

・発話検出部１１２
発話検出部１１２は、話者の発話の検出処理を行う機能を有する。例えば、発話検出部１１２は、通信部２００を介して、発話取得部２１１から送信された発話データに対する検出処理を行う。また、発話検出部１１２は、特定の話者の発話を検出する。例えば、発話検出部１１２は、通信部２００を介して、撮像部２１２から送信された撮像情報に基づいて、特定の話者の発話を検出する。 -Utterance detection unit 112
The utterance detection unit 112 has a function of detecting the utterance of the speaker. For example, the utterance detection unit 112 performs detection processing on the utterance data transmitted from the utterance acquisition unit 211 via the communication unit 200. In addition, the utterance detection unit 112 detects the utterance of a specific speaker. For example, the utterance detection unit 112 detects the utterance of a specific speaker based on the image pickup information transmitted from the image pickup unit 212 via the communication unit 200.

・発話認識部１１３
発話認識部１１３は、話者の発話の認識処理を行う機能を有する。例えば、発話認識部１１３は、通信部２００を介して、発話取得部２１１から送信された発話データに対する認識処理を行う。具体的には、発話認識部１１３は、発話データを、言語情報に変換する。 -Utterance recognition unit 113
The utterance recognition unit 113 has a function of recognizing the utterance of the speaker. For example, the utterance recognition unit 113 performs recognition processing on the utterance data transmitted from the utterance acquisition unit 211 via the communication unit 200. Specifically, the utterance recognition unit 113 converts the utterance data into linguistic information.

また、発話認識部１１３は、発話データの終端を検出する処理を行う機能を有する。例えば、発話認識部１１３は、発話取得部２１１から送信された発話データの終端を検出する処理を行う。具体的には、発話認識部１１３は、言語情報の終端を検出する。 Further, the utterance recognition unit 113 has a function of performing a process of detecting the end of the utterance data. For example, the utterance recognition unit 113 performs a process of detecting the end of the utterance data transmitted from the utterance acquisition unit 211. Specifically, the utterance recognition unit 113 detects the end of the language information.

・状態推定部１１４
状態推定部１１４は、話者の発話に基づく状態を推定する処理を行う機能を有する。例えば、状態推定部１１４は、通信部２００を介して、発話取得部２１１から送信された発話データに対する推定処理を行う。具体的には、状態推定部１１４は、話者の発話に感情語が含まれる場合、感情理解の状態を推定する。状態推定部１１４は、記憶部１２０（例えば、感情語情報記憶部１２２）にアクセスして言語情報を用いた推定処理を行う。具体的には、状態推定部１１４は、発話データに含まれる言語情報と、記憶部１２０に記憶された感情語とを比較して、感情理解の状態を推定する。 -State estimation unit 114
The state estimation unit 114 has a function of performing a process of estimating a state based on the utterance of the speaker. For example, the state estimation unit 114 performs estimation processing on the utterance data transmitted from the utterance acquisition unit 211 via the communication unit 200. Specifically, the state estimation unit 114 estimates the state of emotional understanding when the speaker's utterance includes emotional words. The state estimation unit 114 accesses the storage unit 120 (for example, the emotional word information storage unit 122) and performs estimation processing using linguistic information. Specifically, the state estimation unit 114 estimates the state of emotional understanding by comparing the linguistic information included in the utterance data with the emotional words stored in the storage unit 120.

また、状態推定部１１４は、話者の発話に含まれる言語情報のうち、感情を示す感情語に応じた感情理解の状態を推定する。また、状態推定部１１４は、話者の発話に含まれる言語情報のうち、感情を示す感情語以外の言語情報であって、話者の感情を表現する言語情報に応じた感情理解の状態を推定する。 In addition, the state estimation unit 114 estimates the state of emotional understanding according to the emotional word indicating the emotion among the linguistic information included in the utterance of the speaker. Further, the state estimation unit 114 determines the state of emotional understanding according to the linguistic information that expresses the speaker's emotions, which is linguistic information other than the emotional words that indicate emotions among the linguistic information included in the speaker's utterance. presume.

また、状態推定部１１４は、話者の発話に依頼関連情報が含まれる場合、実行準備の状態を推定する。また、状態推定部１１４は、話者の発話に感情語及び依頼関連情報が含まれない場合、発話認識の状態を推定する。 Further, the state estimation unit 114 estimates the state of preparation for execution when the utterance of the speaker includes request-related information. Further, the state estimation unit 114 estimates the state of utterance recognition when the speaker's utterance does not include emotional words and request-related information.

・意味解析部１１５
意味解析部１１５は、話者の発話に含まれる言語情報から話者の発話の意図を解析する処理を行う機能を有する。具体的には、意味解析部１１５は、話者の発話の言語情報を、名詞や動詞や修飾語等のカテゴリに分類することにより、話者の発話の意図を解析する。・ Semantic analysis unit 115
The semantic analysis unit 115 has a function of analyzing the intention of the speaker's utterance from the linguistic information included in the speaker's utterance. Specifically, the semantic analysis unit 115 analyzes the intention of the speaker's utterance by classifying the linguistic information of the speaker's utterance into categories such as nouns, verbs, and modifiers.

・依頼処理部１１６
依頼処理部１１６は、話者の発話に含まれる依頼関連情報に基づく処理を実行するための処理を行う機能を有する。例えば、依頼処理部１１６は、依頼関連情報に基づく処理を実行するための制御情報を生成する。・ Request processing unit 116
The request processing unit 116 has a function of performing processing for executing processing based on the request-related information included in the utterance of the speaker. For example, the request processing unit 116 generates control information for executing processing based on the request-related information.

・応答生成部１１７
応答生成部１１７は、話者に提示する応答を生成する処理を行う機能を有する。例えば、応答生成部１１７は、話者に提示する応答である頷きや相槌等を行うための制御情報を生成する。応答生成部１１７は、例えば、大中小等の段階的な動作の頷きを行うための制御情報を予め定めることにより、話者の発話に基づく状態に応じた大きさでの動作の頷きを行うための制御情報を生成する。他の例として、応答生成部１１７は、頷きの動作の大きさを決定するためのパラメータを予め定めることにより、パラメータの値に基づいて、話者の発話に基づく状態に応じた大きさでの動作の頷きを行うための制御情報を生成する。また、応答生成部１１７は、例えば、音量や語調等が異なる相槌を行うための制御情報を予め定めることにより、話者の発話に基づく状態に応じた音量や語調等での相槌を行うための制御情報を生成する。他の例として、応答生成部１１７は、相槌の音量や語調等を決定するためのパラメータを予め定めることにより、パラメータの値に基づいて、話者の発話に基づく状態に応じた音量や語調等での相槌を行うための制御情報を生成する。応答生成部１１７は、話者に応じた基準と比較して相対的な出力を行うための制御情報を生成する。 -Response generator 117
The response generation unit 117 has a function of performing a process of generating a response to be presented to the speaker. For example, the response generation unit 117 generates control information for performing a nod, an aizuchi, or the like, which is a response to be presented to the speaker. The response generation unit 117 is for performing the nodding of the operation in a size according to the state based on the utterance of the speaker by, for example, predetermining the control information for performing the nodding of the stepwise operation such as large, medium and small. Generate control information for. As another example, the response generation unit 117 determines a parameter for determining the magnitude of the nodding motion in advance, so that the size of the response generation unit 117 corresponds to the state based on the speaker's utterance based on the value of the parameter. Generates control information for nodding the operation. Further, the response generation unit 117 is for performing intonation at a volume, tone, etc. according to the state based on the speaker's utterance, for example, by predetermining control information for performing intonation with different volume, tone, etc. Generate control information. As another example, the response generation unit 117 defines parameters for determining the volume, tone, etc. of the intonation in advance, and based on the values of the parameters, the volume, tone, etc. according to the state based on the speaker's utterance. Generates control information for intonation in. The response generation unit 117 generates control information for performing relative output with respect to a reference according to the speaker.

応答生成部１１７は、話者の発話以外の周囲音が定常の周囲音の状態であるか否かを判定し、話者の発話以外の周囲音が定常の周囲音の状態である場合には、例えば、定常的な音量や語調等で相槌を行うための制御情報を生成する。また、応答生成部１１７は、話者の発話以外の周囲音が、定常の周囲音の状態と比較して大きい又は小さい場合には、例えば、相対的に同等の音量や語調等で相槌を行うための制御情報を生成する。この場合、応答生成部１１７は、相槌の音量や語調等に応じた大きさでの動作の頷きを行うための制御情報を生成する。 The response generation unit 117 determines whether or not the ambient sound other than the speaker's utterance is in the steady ambient sound state, and when the ambient sound other than the speaker's utterance is in the steady ambient sound state, the response generation unit 117 determines. For example, it generates control information for performing intonation at a steady volume and tone. Further, when the ambient sound other than the speaker's utterance is louder or smaller than the steady ambient sound state, the response generation unit 117 performs intonation at, for example, relatively the same volume and tone. Generate control information for. In this case, the response generation unit 117 generates control information for nodding the operation with a size corresponding to the volume of the intonation, the tone of the tone, and the like.

応答生成部１１７は、大きい動作の頷きを行うための制御情報を生成する場合には、頷きの動作の大きさに応じた音量や語調等の相槌を行うための制御情報を生成する。これにより、応答生成部１１７は、端末装置２０に対して制御する動作である頷きと相槌との大きさを同期させることができる。例えば、応答生成部１１７は、大きい動作の頷きを行うように端末装置２０を制御する場合には、相槌の音量が増すように端末装置２０を制御する。他の例として、応答生成部１１７は、大きい動作の頷きを行うように端末装置２０を制御する場合には、相槌の頻度が増す又は相槌の間（タイミング）が短くなるように端末装置２０を制御する。 When the response generation unit 117 generates control information for performing a nod of a large motion, the response generation unit 117 generates control information for performing an aizuchi such as a volume and a tone according to the magnitude of the nod motion. As a result, the response generation unit 117 can synchronize the magnitudes of the nod and the aizuchi, which are the operations controlled for the terminal device 20. For example, when the response generation unit 117 controls the terminal device 20 so as to perform a nod of a large operation, the response generation unit 117 controls the terminal device 20 so that the volume of the aizuchi is increased. As another example, when the response generation unit 117 controls the terminal device 20 so as to perform a large operation nod, the terminal device 20 is set so that the frequency of the aizuchi increases or the interval (timing) of the aizuchi becomes short. Control.

応答生成部１１７は、話者の発話に話者が定常的に用いる感情語が含まれる場合には、定常的な応答を行うための制御情報を生成する。また、応答生成部１１７は、話者の発話に話者が定常的に用いない（使用頻度の低い）又は初出の感情語が含まれる場合には、非定常的な応答を行うための制御情報を生成する。例えば、応答生成部１１７は、非定常的な応答として、話者の発話を聞き返す、身を乗り出す動作を行う、不審な表情を出す動作を行う、又は、復唱の際の語尾を上げる発話を行う等の応答を行うための制御情報を生成する。 The response generation unit 117 generates control information for performing a steady response when the speaker's utterance includes an emotional word that the speaker regularly uses. In addition, the response generation unit 117 provides control information for performing a non-stationary response when the speaker's utterance does not constantly use (infrequently used) or contains a first-appearing emotional word. To generate. For example, the response generation unit 117, as a non-stationary response, listens back to the speaker's utterance, leans forward, makes a suspicious facial expression, or makes a utterance that raises the ending when reciting. Generate control information for making a response such as.

応答生成部１１７は、話者の発話に含まれる言語情報を用いて応答を生成する。例えば、応答生成部１１７は、意味解析部１１５により解析された言語情報を用いて応答を生成する。 The response generation unit 117 generates a response using the linguistic information included in the speaker's utterance. For example, the response generation unit 117 generates a response using the linguistic information analyzed by the semantic analysis unit 115.

また、応答生成部１１７は、話者の発話に含まれる言語情報のうち、感情を示す感情語を復唱するための共感発話を生成する。また、応答生成部１１７は、話者の発話に含まれる言語情報のうち、感情を示す感情語以外の言語情報であって、話者の感情を表現する言語情報を復唱するための共感発話を生成する。 In addition, the response generation unit 117 generates an empathic utterance for reciting an emotional word indicating an emotion among the linguistic information included in the speaker's utterance. Further, the response generation unit 117 recites sympathetic utterances for reciting linguistic information other than emotional words indicating emotions among the linguistic information included in the speaker's utterances and expressing the speaker's emotions. Generate.

・発話実行部１１８
発話実行部１１８は、話者に対する端末装置２０の発話を実行するための制御情報を提示する処理を行う機能を有する。例えば、発話実行部１１８は、通信部１００を介して、話者に対する端末装置２０の発話を実行するための制御情報を端末装置２０へ提示する。・ Utterance execution unit 118
The utterance execution unit 118 has a function of presenting control information for executing the utterance of the terminal device 20 to the speaker. For example, the utterance execution unit 118 presents control information for executing the utterance of the terminal device 20 to the speaker to the terminal device 20 via the communication unit 100.

・動作提示部１１９
動作提示部１１９は、話者に対する端末装置２０の動作を制御するための制御情報を提示する処理を行う機能を有する。例えば、動作提示部１１９は、通信部１００を介して、話者に対する端末装置２０の動作を制御するための制御情報を端末装置２０へ提示する。 -Motion presentation unit 119
The motion presentation unit 119 has a function of presenting control information for controlling the motion of the terminal device 20 to the speaker. For example, the motion presentation unit 119 presents control information for controlling the operation of the terminal device 20 to the speaker to the terminal device 20 via the communication unit 100.

（１−３）記憶部１２０
記憶部１２０は、例えば、ＲＡＭ、フラッシュメモリ等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。記憶部１２０は、情報処理装置１０における処理に関するデータを記憶する機能を有する。図１０に示すように、記憶部１２０は、話者情報記憶部１２１と、感情語情報記憶部１２２とを有する。 (1-3) Storage unit 120
The storage unit 120 is realized by, for example, a semiconductor memory element such as a RAM or a flash memory, or a storage device such as a hard disk or an optical disk. The storage unit 120 has a function of storing data related to processing in the information processing device 10. As shown in FIG. 10, the storage unit 120 includes a speaker information storage unit 121 and an emotional word information storage unit 122.

図１１は、話者情報記憶部１２１の一例を示す。図１１に示す話者情報記憶部１２１は、話者情報を記憶する。図１１に示すように、話者情報記憶部１２１は、「話者ＩＤ」、「話者情報」といった項目を有してもよい。 FIG. 11 shows an example of the speaker information storage unit 121. The speaker information storage unit 121 shown in FIG. 11 stores speaker information. As shown in FIG. 11, the speaker information storage unit 121 may have items such as "speaker ID" and "speaker information".

「話者ＩＤ」は、話者を識別するための識別情報を示す。「話者情報」は、話者情報を示す。図１１に示す例では、「話者情報」に「話者情報＃１」や「話者情報＃２」といった概念的な情報が格納される例を示したが、実際には、話者の撮像情報等が格納される。 The "speaker ID" indicates identification information for identifying the speaker. "Speaker information" indicates speaker information. In the example shown in FIG. 11, conceptual information such as "speaker information # 1" and "speaker information # 2" is stored in "speaker information", but in reality, the speaker information Imaging information and the like are stored.

図１２は、感情語情報記憶部１２２の一例を示す。図１２に示す感情語情報記憶部１２２は、感情語に関する情報を記憶する。図１２に示すように、感情語情報記憶部１２２は、「感情語情報ＩＤ」、「感情語」、「同義語」、「一般共起語」、「話者共起語」といった項目を有してもよい。 FIG. 12 shows an example of the emotion word information storage unit 122. The emotion word information storage unit 122 shown in FIG. 12 stores information related to emotion words. As shown in FIG. 12, the emotional word information storage unit 122 has items such as "emotional word information ID", "emotional word", "synonymous word", "general co-occurrence word", and "speaker co-occurrence word". You may.

「感情語情報ＩＤ」は、感情語情報を識別するための識別情報を示す。「感情語」は、感情語を示す。「同義語」は、感情語の同義語を示す。「一般共起語」は、感情語を共起するための共起語のうち、一般的に用いられる共起語を示す。「話者共起語」は、感情語を共起するための共起語のうち、話者固有の共起語を示す。 The "emotional word information ID" indicates identification information for identifying the emotion word information. "Emotional word" indicates an emotional word. "Synonyms" indicate synonyms for emotional words. "General co-occurrence word" indicates a commonly used co-occurrence word among co-occurrence words for co-occurring emotional words. "Speaker co-occurrence word" indicates a speaker-specific co-occurrence word among co-occurrence words for co-occurring emotional words.

ここで、実施形態に係る感情語について説明する。実施形態に係る感情語は、一般的な感情語として話者に共通して定められた感情語でなくても、話者固有の特定の表現に対して頻出する言語情報であってもよい。例えば、感情語情報記憶部１２２は、話者固有の特定の表現に対して頻出する言語情報を感情語として記憶してもよい。この場合、情報処理システム１は、感情語を復唱するのではなくて、特定の表現に共起する言語情報を感情語として提示する。例えば、情報処理システム１は、「忙しい」、「死にそう」、「やってられない」等の特定の表現が検出された場合であって、この特定表現に対して頻出する言語情報が「大変」の場合には、「大変」を感情語として提示する。 Here, the emotional words according to the embodiment will be described. The emotional word according to the embodiment may not be an emotional word commonly defined by the speaker as a general emotional word, but may be linguistic information that frequently appears for a specific expression peculiar to the speaker. For example, the emotional word information storage unit 122 may store linguistic information that frequently appears for a specific expression peculiar to the speaker as an emotional word. In this case, the information processing system 1 does not repeat the emotional word, but presents linguistic information co-occurring with a specific expression as the emotional word. For example, in the information processing system 1, when a specific expression such as "busy", "let's die", or "cannot be done" is detected, the linguistic information that frequently appears for this specific expression is "very difficult". In the case of "", "difficult" is presented as an emotional word.

（２）端末装置２０
図１０に示したように、端末装置２０は、通信部２００、制御部２１０、及び提示部２２０を有する。 (2) Terminal device 20
As shown in FIG. 10, the terminal device 20 has a communication unit 200, a control unit 210, and a presentation unit 220.

（２−１）通信部２００
通信部２００は、外部装置と通信を行う機能を有する。例えば、通信部２００は、外部装置との通信において、外部装置から受信する情報を制御部２１０へ出力する。具体的に、通信部２００は、情報処理装置１０から受信する発話データの取得に関する情報を制御部２１０へ出力する。また、通信部２００は、情報処理装置１０から受信する制御情報を制御部２１０へ出力する。 (2-1) Communication unit 200
The communication unit 200 has a function of communicating with an external device. For example, the communication unit 200 outputs information received from the external device to the control unit 210 in communication with the external device. Specifically, the communication unit 200 outputs information regarding acquisition of utterance data received from the information processing device 10 to the control unit 210. Further, the communication unit 200 outputs the control information received from the information processing device 10 to the control unit 210.

また、通信部２００は、情報処理装置１０から受信する制御情報を提示部２２０へ出力する。 Further, the communication unit 200 outputs the control information received from the information processing device 10 to the presentation unit 220.

また、通信部２００は、外部装置との通信において、制御部２１０から入力される情報を外部装置へ送信する。具体的に、通信部２００は、制御部２１０から入力される発話データを情報処理装置１０へ送信する。 Further, the communication unit 200 transmits the information input from the control unit 210 to the external device in communication with the external device. Specifically, the communication unit 200 transmits the utterance data input from the control unit 210 to the information processing device 10.

（２−２）制御部２１０
制御部２１０は、端末装置２０の動作全般を制御する機能を有する。例えば、制御部２１０は、発話取得部２１１による発話データの取得処理を制御する。また、制御部２１０は、発話取得部２１１により取得された発話データを、通信部２００が情報処理装置１０へ送信する処理を制御する。 (2-2) Control unit 210
The control unit 210 has a function of controlling the overall operation of the terminal device 20. For example, the control unit 210 controls the utterance data acquisition process by the utterance acquisition unit 211. Further, the control unit 210 controls a process in which the communication unit 200 transmits the utterance data acquired by the utterance acquisition unit 211 to the information processing device 10.

・発話取得部２１１
発話取得部２１１は、話者の発話データを取得する機能を有する。例えば、発話取得部２１１は、端末装置２０に備えられた発話（音声）検出器を用いて発話データを取得する。・ Utterance acquisition unit 211
The utterance acquisition unit 211 has a function of acquiring the utterance data of the speaker. For example, the utterance acquisition unit 211 acquires utterance data using the utterance (voice) detector provided in the terminal device 20.

・撮像部２１２
撮像部２１２は、話者を撮像する機能を有する。・ Imaging unit 212
The image pickup unit 212 has a function of capturing a speaker.

・動作制御部２１３
動作制御部２１３は、端末装置２０の動作を制御する機能を有する。例えば、動作制御部２１３は、取得した制御情報に応じて、端末装置２０の動作を制御する。 -Motion control unit 213
The operation control unit 213 has a function of controlling the operation of the terminal device 20. For example, the operation control unit 213 controls the operation of the terminal device 20 according to the acquired control information.

（２−３）提示部２２０
提示部２２０は、提示全般を制御する機能を有する。提示部２２０は、図１０に示すように、音声提示部２２１及び動作提示部２２２を有する。 (2-3) Presentation unit 220
The presentation unit 220 has a function of controlling the overall presentation. As shown in FIG. 10, the presentation unit 220 includes a voice presentation unit 221 and an motion presentation unit 222.

・音声提示部２２１
音声提示部２２１は、端末装置２０の音声を提示する処理を行う機能を有する。例えば、音声提示部２２１は、通信部２００を介して、発話実行部１１８から受信した制御情報に基づいて、音声を提示する。・ Voice presentation unit 221
The voice presentation unit 221 has a function of performing a process of presenting the voice of the terminal device 20. For example, the voice presenting unit 221 presents voice based on the control information received from the utterance execution unit 118 via the communication unit 200.

・動作提示部２２２
動作提示部２２２は、端末装置２０の動作を提示する処理を行う機能を有する。例えば、動作提示部２２２は、通信部２００を介して、動作提示部１１９から受信した制御情報に基づいて、動作を提示する。 -Motion presentation unit 222
The motion presentation unit 222 has a function of performing a process of presenting the motion of the terminal device 20. For example, the motion presentation unit 222 presents the motion based on the control information received from the motion presentation unit 119 via the communication unit 200.

＜２．４．情報処理システムの処理＞
以上、実施形態に係る情報処理システム１の機能について説明した。続いて、情報処理システム１の処理について説明する。 <2.4. Information processing system processing>
The function of the information processing system 1 according to the embodiment has been described above. Subsequently, the processing of the information processing system 1 will be described.

（１）情報処理装置１０における状態推定に関する処理
図１３は、実施形態に係る情報処理装置１０における状態推定に関する処理の流れを示すフローチャートである。まず、情報処理装置１０は、発話データに基づいて、話者の発話を検知する（Ｓ１０１）。例えば、情報処理装置１０は、特定の話者の発話を検知する。また、情報処理装置１０は、話者の発話を認識する（Ｓ１０２）。例えば、情報処理装置１０は、話者の発話の終端を検出する。次いで、情報処理装置１０は、感情語を含むか否かを判定する。そして、情報処理装置１０は、話者の発話に感情語が含まれる場合（Ｓ１０４；ＹＥＳ）、感情理解の状態を推定する（Ｓ１０６）。また、情報処理装置１０は、話者の発話に感情語が含まれない場合（Ｓ１０４；ＮＯ）、依頼関連情報を含むか否かを判定する（Ｓ１０８）。そして、情報処理装置１０は、話者の発話に依頼関連情報が含まれる場合（Ｓ１０８；ＹＥＳ）、実行準備の状態を推定する（Ｓ１１０）。また、情報処理装置１０は、話者の発話に依頼関連情報が含まれない場合（Ｓ１０８；ＮＯ）、発話認識の状態を推定する（Ｓ１１２）。 (1) Process related to state estimation in the information processing device 10 FIG. 13 is a flowchart showing a flow of processing related to state estimation in the information processing device 10 according to the embodiment. First, the information processing device 10 detects the utterance of the speaker based on the utterance data (S101). For example, the information processing device 10 detects the utterance of a specific speaker. Further, the information processing device 10 recognizes the utterance of the speaker (S102). For example, the information processing device 10 detects the end of a speaker's utterance. Next, the information processing device 10 determines whether or not the emotional word is included. Then, when the speaker's utterance includes an emotional word (S104; YES), the information processing device 10 estimates the state of emotional understanding (S106). Further, when the utterance of the speaker does not include an emotional word (S104; NO), the information processing device 10 determines whether or not the request-related information is included (S108). Then, when the request-related information is included in the utterance of the speaker (S108; YES), the information processing device 10 estimates the state of preparation for execution (S110). Further, the information processing device 10 estimates the state of utterance recognition (S112) when the request-related information is not included in the utterance of the speaker (S108; NO).

（２）発話認識の状態を推定した場合の処理
図１４は、実施形態に係る情報処理装置１０における発話認識の状態を推定した場合の処理の流れを示すフローチャートである。まず、情報処理装置１０は、発話の終端であるか否かを判定する（Ｓ２００）。そして、情報処理装置１０は、発話の終端である場合（Ｓ２００；ＹＥＳ）、復唱やフィラーで相槌をするように端末装置２０を制御する（Ｓ２０２）。また、情報処理装置１０は、発話の終端でない場合（Ｓ２００；ＮＯ）、発話の間であるか否かを判定する（Ｓ２０４）。そして、情報処理装置１０は、発話の間である場合（Ｓ２０４；ＹＥＳ）、小さい音量で相槌をするように端末装置２０を制御する（Ｓ２０６）。また、情報処理装置１０は、発話の間でない場合（Ｓ２０４；ＮＯ）、小さい動作で頷きをするように端末装置２０を制御する（Ｓ２０８）。 (2) Processing when the state of utterance recognition is estimated FIG. 14 is a flowchart showing a flow of processing when the state of utterance recognition is estimated in the information processing apparatus 10 according to the embodiment. First, the information processing device 10 determines whether or not it is the end of the utterance (S200). Then, when the information processing device 10 is the end of the utterance (S200; YES), the information processing device 10 controls the terminal device 20 so as to repeat or give an aizuchi with a filler (S202). Further, when the information processing device 10 is not the end of the utterance (S200; NO), the information processing device 10 determines whether or not it is between the utterances (S204). Then, the information processing device 10 controls the terminal device 20 so as to give an aizuchi at a low volume when it is during an utterance (S204; YES) (S206). Further, the information processing device 10 controls the terminal device 20 so as to nod with a small operation when it is not during the utterance (S204; NO) (S208).

（３）感情理解の状態を推定した場合の処理
図１５は、実施形態に係る情報処理装置１０における感情理解の状態を推定した場合の処理の流れを示すフローチャートである。まず、情報処理装置１０は、発話の終端であるか否かを判定する（Ｓ３００）。そして、情報処理装置１０は、発話の終端である場合（Ｓ３００；ＹＥＳ）、感情語を復唱するように端末装置２０を制御する（Ｓ３０２）。また、情報処理装置１０は、発話の終端でない場合（Ｓ３００；ＮＯ）、発話の間であるか否かを判定する（Ｓ３０４）。そして、情報処理装置１０は、発話の間である場合（Ｓ３０４；ＹＥＳ）、大きい音量で相槌をするように端末装置２０を制御する（Ｓ３０６）。また、情報処理装置１０は、発話の間でない場合（Ｓ３０４；ＮＯ）、大きい動作で頷きをするように端末装置２０を制御する（Ｓ３０８）。情報処理装置１０は、感情理解の状態を推定した場合には、図１４に示す発話認識の状態を推定した場合よりも、話者にとって認識可能な制御情報を生成する。 (3) Processing when the state of emotion understanding is estimated FIG. 15 is a flowchart showing a flow of processing when the state of emotion understanding in the information processing apparatus 10 according to the embodiment is estimated. First, the information processing device 10 determines whether or not it is the end of the utterance (S300). Then, when the information processing device 10 is the end of the utterance (S300; YES), the information processing device 10 controls the terminal device 20 so as to repeat the emotional word (S302). Further, when the information processing device 10 is not the end of the utterance (S300; NO), the information processing device 10 determines whether or not it is between the utterances (S304). Then, the information processing device 10 controls the terminal device 20 so as to give an aizuchi at a loud volume when the speech is in progress (S304; YES) (S306). Further, the information processing device 10 controls the terminal device 20 so as to nod with a large operation when it is not during the utterance (S304; NO) (S308). When the state of emotion understanding is estimated, the information processing device 10 generates control information recognizable to the speaker as compared with the case of estimating the state of utterance recognition shown in FIG.

（４）実行準備の状態を推定した場合の処理
図１６は、実施形態に係る情報処理装置１０における実行準備の状態を推定した場合の処理の流れを示すフローチャートである。まず、情報処理装置１０は、実行に十分な発話を取得したか否かを判定する（Ｓ４００）。そして、情報処理装置１０は、実行に十分な発話を取得したと判定した場合（Ｓ４００；ＹＥＳ）、依頼に関する情報に基づく処理を実行するように端末装置２０を制御する（Ｓ４０２）。また、情報処理装置１０は、実行に十分な発話を取得していないと判定した場合（Ｓ４００；ＮＯ）、実行をキャンセルする旨の発話であるキャンセル発話を取得したか否かを判定する（Ｓ４０４）。そして、情報処理装置１０は、キャンセル発話を取得したと判定した場合（Ｓ４０４；ＹＥＳ）、情報処理を終了する。また、情報処理装置１０は、キャンセル発話を取得していないと判定した場合（Ｓ４０４；ＮＯ）、更なる依頼に関する情報を発話するように促す発話である促し発話を行うように端末装置２０を制御する（Ｓ４０６）。そして、Ｓ４００の処理に戻る。 (4) Processing when the State of Preparation for Execution is Estimated FIG. 16 is a flowchart showing a flow of processing when the state of preparation for execution is estimated in the information processing apparatus 10 according to the embodiment. First, the information processing device 10 determines whether or not an utterance sufficient for execution has been acquired (S400). Then, when it is determined that the information processing device 10 has acquired sufficient utterances for execution (S400; YES), the information processing device 10 controls the terminal device 20 to execute the process based on the information regarding the request (S402). Further, when the information processing apparatus 10 determines that the utterance sufficient for execution has not been acquired (S400; NO), the information processing apparatus 10 determines whether or not the canceled utterance, which is an utterance to cancel the execution, has been acquired (S404). ). Then, when the information processing apparatus 10 determines that the canceled utterance has been acquired (S404; YES), the information processing device 10 ends the information processing. Further, when it is determined that the canceled utterance has not been acquired (S404; NO), the information processing device 10 controls the terminal device 20 so as to urge the utterance to utter information regarding the further request. (S406). Then, the process returns to the process of S400.

＜２．５．処理のバリエーション＞
以上、本開示の実施形態について説明した。続いて、本開示の実施形態の処理のバリエーションを説明する。なお、以下に説明する処理のバリエーションは、単独で本開示の実施形態に適用されてもよいし、組み合わせで本開示の実施形態に適用されてもよい。また、処理のバリエーションは、本開示の実施形態で説明した構成に代えて適用されてもよいし、本開示の実施形態で説明した構成に対して追加的に適用されてもよい。 <2.5. Variations of processing ＞
The embodiments of the present disclosure have been described above. Subsequently, a variation of the processing of the embodiment of the present disclosure will be described. The variations of the processing described below may be applied alone to the embodiments of the present disclosure, or may be applied in combination to the embodiments of the present disclosure. Further, the variation of the processing may be applied in place of the configuration described in the embodiment of the present disclosure, or may be additionally applied to the configuration described in the embodiment of the present disclosure.

（１）表現
上記実施形態では、応答生成部１１７が、頷きの大きさ及び相槌の音量や語調等が異なる応答を行うための制御情報を生成する場合を示したが、この例に限られない。応答生成部１１７は、表情の強弱やアニメーション表現の大きさが異なる応答を行うための制御情報を生成してもよい。例えば、応答生成部１１７は、顔の表情、動物等の尻尾や耳の動き、着ている衣服やアクセサリが異なる応答を行うための制御情報を生成してもよい。このように、応答生成部１１７は、映像上の表現に関する制御情報を生成してもよい。 (1) Expression In the above embodiment, the response generation unit 117 has shown a case where the response generation unit 117 generates control information for performing a response in which the size of the nod, the volume of the intonation, the tone of the tone, etc. are different, but the present invention is not limited to this example. .. The response generation unit 117 may generate control information for performing a response in which the strength of the facial expression and the size of the animation expression are different. For example, the response generation unit 117 may generate control information for facial expressions, movements of tails and ears of animals, and clothing and accessories worn to make different responses. In this way, the response generation unit 117 may generate control information regarding the representation on the video.

また、応答生成部１１７は、端末装置２０が示すキャラクタに応じて、頷きや相槌等の仕方が異なる応答を行うための制御情報を生成してもよい。例えば、応答生成部１１７は、端末装置２０が示すキャラクタがビジネスライクなキャラクタである場合には、強弱差が小さい応答を行うための制御情報を生成してもよい。そして、応答生成部１１７は、「はい」や「そうですか」等の丁寧語を用いた相槌を行うための制御情報を生成してもよい。他の例として、応答生成部１１７は、端末装置２０が示すキャラクタがカジュアルなキャラクタである場合には、強弱差が大きい応答を行うための制御情報を生成してもよい。そして、応答生成部１１７は、「うん」や「なるほど」や「へえ」等の日常語を用いた相槌を行うための制御情報を生成してもよい。 In addition, the response generation unit 117 may generate control information for making a response in a different manner such as nodding or aizuchi, depending on the character indicated by the terminal device 20. For example, when the character indicated by the terminal device 20 is a business-like character, the response generation unit 117 may generate control information for performing a response with a small difference in strength. Then, the response generation unit 117 may generate control information for performing an aizuchi using polite words such as "yes" and "is that so?". As another example, when the character indicated by the terminal device 20 is a casual character, the response generation unit 117 may generate control information for performing a response having a large difference in strength. Then, the response generation unit 117 may generate control information for performing an aizuchi using everyday words such as "Yeah", "I see", and "Hee".

（２）個人化
・間を個人に合わせる
応答生成部１１７は、話者に応じて間が異なる応答を行うための制御情報を生成してもよい。例えば、応答生成部１１７は、発話データや撮像情報等を用いて話者を識別して、話者ごとの話速や間等を記憶することにより、話者の発話の間を推定してもよい。そして、応答生成部１１７は、相槌等の応答が重ならなかった対話を教師データとして学習してもよい。これにより、応答生成部１１７は、応答の重複を回避するように適応することができる。また、応答生成部１１７は、間が不確定の場合には、例えば、低音量の相槌や、小さい動作の頷きを行うための制御情報を生成してもよい。これにより、情報処理システム１は、話者の発話を阻害することなく応答の提示を行うことができる。 (2) Personalization-Adjusting the interval to the individual The response generation unit 117 may generate control information for making a response with a different interval depending on the speaker. For example, the response generation unit 117 may estimate the interval between utterances of a speaker by identifying the speaker using utterance data, imaging information, etc., and storing the speech speed, interval, etc. of each speaker. good. Then, the response generation unit 117 may learn the dialogue in which the responses such as the aizuchi do not overlap as the teacher data. As a result, the response generation unit 117 can be adapted to avoid duplication of responses. Further, when the interval is uncertain, the response generation unit 117 may generate control information for performing, for example, a low-volume aizuchi or a nod of a small operation. As a result, the information processing system 1 can present the response without disturbing the speaker's utterance.

・相槌のパターンや感情理解の復唱を個人化する
応答生成部１１７は、相槌の長短や言語情報のバリエーションを変化させることにより、発話が継続する確率の高い相槌のパターンを話者ごとに学習してもよい。また、応答生成部１１７は、相槌後の話者の発話量が増えた場合の相槌の使用頻度が高くなるように学習してもよい。 -The response generation unit 117, which personalizes the pattern of the aizuchi and the repetition of emotional understanding, learns the pattern of the aizuchi, which has a high probability of continuing utterance, for each speaker by changing the length of the aizuchi and the variation of the linguistic information. You may. In addition, the response generation unit 117 may learn so that the frequency of use of the aizuchi increases when the amount of utterances of the speaker after the aizuchi increases.

・状態の遷移を個人化する
状態推定部１１４は、感情語を多く用いる話者の場合には、発話認識の状態から感情理解の状態への遷移の頻度を低くして推定してもよい。これにより、情報処理装置１０は、復唱が多くならないように端末装置２０の制御を行うことができる。また、応答生成部１１７は、感情語を多く用いる話者の場合には、感情理解を示すバリエーションが異なる応答を行うための制御情報を生成してもよい。例えば、応答生成部１１７は、感情語情報記憶部１２２等にアクセスして、同義語等を用いた処理を行ってもよい。 -The state estimation unit 114 that personalizes the transition of the state may estimate the transition from the state of utterance recognition to the state of emotion understanding at a low frequency in the case of a speaker who uses a lot of emotional words. As a result, the information processing device 10 can control the terminal device 20 so that the number of repetitions does not increase. Further, in the case of a speaker who uses a lot of emotional words, the response generation unit 117 may generate control information for performing a response having a different variation indicating emotional understanding. For example, the response generation unit 117 may access the emotional word information storage unit 122 or the like and perform processing using synonyms or the like.

応答生成部１１７は、話者が日常的に忙しい話者の場合には、実行準備の状態において、聞き返しの復唱をせずに処理を行うための制御情報を生成してもよい。これにより、情報処理装置１０は、話者が依頼に関する発話をすると直ぐに実行するように端末装置２０の制御を行うことができる。 When the speaker is a busy speaker on a daily basis, the response generation unit 117 may generate control information for processing without reciting the listener in the state of preparation for execution. As a result, the information processing device 10 can control the terminal device 20 so that the information processing device 10 executes the information processing device 10 as soon as the speaker makes an utterance regarding the request.

・感情理解の状態の推定の制限
状態推定部１１４は、話者の感情が定常（ニュートラル）の状態と判定した場合には、話者の発話に感情語を含む場合であっても、感情理解の状態を推定しなくてもよい。例えば、状態推定部１１４は、撮像情報に基づく話者の表情の認識処理結果に基づいて、話者の感情が定常の状態と判定した場合には、感情理解の状態を推定しなくてもよい。他の例として、状態推定部１１４は、話者の発話の抑揚や周辺言語等による発話認識の処理結果に基づいて、話者の感情が定常の状態と判定した場合には、感情理解の状態を推定しなくてもよい。また、状態推定部１１４は、発話に対する言語処理結果に基づいて、発話に含まれる感情語が、話者の感情による言語情報ではなく、他者の感情や他者の文章から引用された言語情報である場合には、感情理解の状態を推定しなくてもよい。 -Restriction on Estimating the State of Emotion Understanding When the state estimation unit 114 determines that the speaker's emotion is in a neutral state, the state understanding unit 114 understands the emotion even if the speaker's utterance includes emotional words. It is not necessary to estimate the state of. For example, the state estimation unit 114 does not have to estimate the state of emotion understanding when the speaker's emotion is determined to be a steady state based on the result of the recognition processing of the speaker's facial expression based on the imaging information. .. As another example, when the state estimation unit 114 determines that the speaker's emotion is in a steady state based on the processing result of the speaker's utterance intonation and the utterance recognition by the peripheral language, the state of emotion understanding. Does not have to be estimated. Further, in the state estimation unit 114, based on the linguistic processing result for the utterance, the emotional words included in the utterance are not the linguistic information based on the speaker's emotions, but the linguistic information quoted from the emotions of others or the sentences of others. If this is the case, it is not necessary to estimate the state of emotional understanding.

＜＜３．応用例＞＞
以上、本開示の実施形態について説明した。続いて、本開示の実施形態に係る情報処理システム１の応用例を説明する。 << 3. Application example >>
The embodiments of the present disclosure have been described above. Subsequently, an application example of the information processing system 1 according to the embodiment of the present disclosure will be described.

＜３．１．視聴覚障害＞
上記実施形態は、視聴覚障害者等の医療分野においても応用し得る。話者が視覚障害者である場合には、頷き等の視覚による応答を適切に把握することができないものと考えられる。このため、情報処理システム１は、話者が視覚障害者の場合には、頷きではなく、相槌を用いて応答を行ってもよい。この場合、応答生成部１１７は、頷きを用いて応答するタイミングに、頷きではなく、相槌を用いて応答を行うための制御情報を生成してもよい。一方、話者が聴覚障害者である場合には、相槌等の聴覚による応答を適切に把握することができないものと考えられる。このため、情報処理システム１は、話者が聴覚障害者の場合には、相槌ではなく、頷きを用いて応答を行ってもよい。この場合、応答生成部１１７は、相槌を用いて応答するタイミングに、相槌ではなく、頷きを用いて応答を行うための制御情報を生成してもよい。 <3.1. Audiovisual impairment>
The above embodiment can also be applied in the medical field such as a visually impaired person. If the speaker is visually impaired, it is considered that the visual response such as nodding cannot be properly grasped. Therefore, when the speaker is a visually impaired person, the information processing system 1 may respond by using an aizuchi instead of a nod. In this case, the response generation unit 117 may generate control information for responding by using an aizuchi instead of a nod at the timing of responding by using a nod. On the other hand, when the speaker is a hearing-impaired person, it is considered that the auditory response such as an aizuchi cannot be properly grasped. Therefore, when the speaker is a hearing-impaired person, the information processing system 1 may respond by using a nod instead of an aizuchi. In this case, the response generation unit 117 may generate control information for responding by using a nod instead of the aizuchi at the timing of responding by using the aizuchi.

＜３．２．高齢者＞
上記実施形態は、高齢者等の介護分野においても応用し得る。話者が高齢者である場合には、情報処理システム１は、頷きや相槌等の応答の動作のテンポを遅くしてもよい。また、情報処理システム１は、終端検出の間の時間等に関する検出の閾値を大きくしてもよい。これにより、情報処理システム１は、話者の発話と端末装置２０による発話とのタイミングが重複しないように制御することができる。また、情報処理システム１は、端末装置２０が示す表情の変化を大きくしてもよい。また、情報処理システム１は、周囲音が定常であっても、聴力が低下している高齢者の場合には、発話音量等の応答の変化を大きくしてもよい。これにより、情報処理システム１は、端末装置２０が話者以外の他者（例えば、話者の家族）とも対話する場合であっても、他者と対話する場合と比較して、端末装置２０が行う応答を相対的に変化させることにより、話者に適した応答を行うことができる。 <3.2. Elderly>
The above embodiment can also be applied to the field of long-term care for the elderly and the like. When the speaker is an elderly person, the information processing system 1 may slow down the tempo of the response operation such as nodding or aizuchi. Further, the information processing system 1 may increase the detection threshold value regarding the time between terminal detections and the like. As a result, the information processing system 1 can control the timing of the utterance of the speaker and the utterance of the terminal device 20 so as not to overlap. Further, the information processing system 1 may increase the change in facial expression indicated by the terminal device 20. Further, the information processing system 1 may make a large change in the response such as the utterance volume in the case of an elderly person whose hearing is deteriorated even if the ambient sound is steady. As a result, in the information processing system 1, even when the terminal device 20 interacts with another person other than the speaker (for example, the speaker's family), the terminal device 20 is compared with the case where the terminal device 20 interacts with the other person. By changing the response made by the speaker relatively, it is possible to make a response suitable for the speaker.

＜＜４．ハードウェア構成例＞＞
最後に、図１７を参照しながら、実施形態に係る情報処理装置のハードウェア構成例について説明する。図１７は、実施形態に係る情報処理装置のハードウェア構成例を示すブロック図である。なお、図１７に示す情報処理装置９００は、例えば、図１０に示した情報処理装置１０及び端末装置２０を実現し得る。実施形態に係る情報処理装置１０及び端末装置２０による情報処理は、ソフトウェアと、以下に説明するハードウェアとの協働により実現される。 << 4. Hardware configuration example >>
Finally, a hardware configuration example of the information processing apparatus according to the embodiment will be described with reference to FIG. FIG. 17 is a block diagram showing a hardware configuration example of the information processing device according to the embodiment. The information processing device 900 shown in FIG. 17 can realize, for example, the information processing device 10 and the terminal device 20 shown in FIG. The information processing by the information processing device 10 and the terminal device 20 according to the embodiment is realized by the cooperation between the software and the hardware described below.

図１７に示すように、情報処理装置９００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）９０１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）９０２、及びＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９０３を備える。また、情報処理装置９００は、ホストバス９０４ａ、ブリッジ９０４、外部バス９０４ｂ、インタフェース９０５、入力装置９０６、出力装置９０７、ストレージ装置９０８、ドライブ９０９、接続ポート９１０、及び通信装置９１１を備える。なお、ここで示すハードウェア構成は一例であり、構成要素の一部が省略されてもよい。また、ハードウェア構成は、ここで示される構成要素以外の構成要素をさらに含んでもよい。 As shown in FIG. 17, the information processing apparatus 900 includes a CPU (Central Processing Unit) 901, a ROM (Read Only Memory) 902, and a RAM (Random Access Memory) 903. The information processing device 900 includes a host bus 904a, a bridge 904, an external bus 904b, an interface 905, an input device 906, an output device 907, a storage device 908, a drive 909, a connection port 910, and a communication device 911. The hardware configuration shown here is an example, and some of the components may be omitted. Further, the hardware configuration may further include components other than the components shown here.

ＣＰＵ９０１は、例えば、演算処理装置又は制御装置として機能し、ＲＯＭ９０２、ＲＡＭ９０３、又はストレージ装置９０８に記録された各種プログラムに基づいて各構成要素の動作全般又はその一部を制御する。ＲＯＭ９０２は、ＣＰＵ９０１に読み込まれるプログラムや演算に用いるデータ等を格納する手段である。ＲＡＭ９０３には、例えば、ＣＰＵ９０１に読み込まれるプログラムや、そのプログラムを実行する際に適宜変化する各種パラメータ等が一時的又は永続的に格納される。これらはＣＰＵバスなどから構成されるホストバス９０４ａにより相互に接続されている。ＣＰＵ９０１、ＲＯＭ９０２およびＲＡＭ９０３は、例えば、ソフトウェアとの協働により、図１０を参照して説明した制御部１１０及び制御部２１０の機能を実現し得る。 The CPU 901 functions as, for example, an arithmetic processing device or a control device, and controls all or a part of the operation of each component based on various programs recorded in the ROM 902, the RAM 903, or the storage device 908. The ROM 902 is a means for storing a program read into the CPU 901, data used for calculation, and the like. In the RAM 903, for example, a program read into the CPU 901, various parameters that change as appropriate when the program is executed, and the like are temporarily or permanently stored. These are connected to each other by a host bus 904a composed of a CPU bus or the like. The CPU 901, ROM 902, and RAM 903 can realize the functions of the control unit 110 and the control unit 210 described with reference to FIG. 10, for example, in collaboration with software.

ＣＰＵ９０１、ＲＯＭ９０２、及びＲＡＭ９０３は、例えば、高速なデータ伝送が可能なホストバス９０４ａを介して相互に接続される。一方、ホストバス９０４ａは、例えば、ブリッジ９０４を介して比較的データ伝送速度が低速な外部バス９０４ｂに接続される。また、外部バス９０４ｂは、インタフェース９０５を介して種々の構成要素と接続される。 The CPU 901, ROM 902, and RAM 903 are connected to each other via, for example, a host bus 904a capable of high-speed data transmission. On the other hand, the host bus 904a is connected to the external bus 904b, which has a relatively low data transmission speed, via, for example, the bridge 904. Further, the external bus 904b is connected to various components via the interface 905.

入力装置９０６は、例えば、マウス、キーボード、タッチパネル、ボタン、マイクロフォン、スイッチ及びレバー等、話者によって情報が入力される装置によって実現される。また、入力装置９０６は、例えば、赤外線やその他の電波を利用したリモートコントロール装置であってもよいし、情報処理装置９００の操作に対応した携帯電話やＰＤＡ等の外部接続機器であってもよい。さらに、入力装置９０６は、例えば、上記の入力手段を用いて話者により入力された情報に基づいて入力信号を生成し、ＣＰＵ９０１に出力する入力制御回路などを含んでいてもよい。情報処理装置９００の話者は、この入力装置９０６を操作することにより、情報処理装置９００に対して各種のデータを入力したり処理動作を指示したりすることができる。 The input device 906 is realized by a device such as a mouse, a keyboard, a touch panel, a button, a microphone, a switch, and a lever, in which information is input by a speaker. Further, the input device 906 may be, for example, a remote control device using infrared rays or other radio waves, or an externally connected device such as a mobile phone or a PDA that supports the operation of the information processing device 900. .. Further, the input device 906 may include, for example, an input control circuit that generates an input signal based on the information input by the speaker using the above input means and outputs the input signal to the CPU 901. By operating the input device 906, the speaker of the information processing device 900 can input various data to the information processing device 900 and instruct the processing operation.

他にも、入力装置９０６は、話者に関する情報を検知する装置により形成され得る。例えば、入力装置９０６は、画像センサ（例えば、カメラ）、深度センサ（例えば、ステレオカメラ）、加速度センサ、ジャイロセンサ、地磁気センサ、光センサ、音センサ、測距センサ（例えば、ＴｏＦ（ＴｉｍｅｏｆＦｌｉｇｈｔ）センサ）、力センサ等の各種のセンサを含み得る。また、入力装置９０６は、情報処理装置９００の姿勢、移動速度等、情報処理装置９００自身の状態に関する情報や、情報処理装置９００の周辺の明るさや騒音等、情報処理装置９００の周辺環境に関する情報を取得してもよい。また、入力装置９０６は、ＧＮＳＳ（ＧｌｏｂａｌＮａｖｉｇａｔｉｏｎＳａｔｅｌｌｉｔｅＳｙｓｔｅｍ）衛星からのＧＮＳＳ信号（例えば、ＧＰＳ（ＧｌｏｂａｌＰｏｓｉｔｉｏｎｉｎｇＳｙｓｔｅｍ）衛星からのＧＰＳ信号）を受信して装置の緯度、経度及び高度を含む位置情報を測定するＧＮＳＳモジュールを含んでもよい。また、位置情報に関しては、入力装置９０６は、Ｗｉ−Ｆｉ（登録商標）、携帯電話・ＰＨＳ・スマートフォン等との送受信、または近距離通信等により位置を検知するものであってもよい。入力装置９０６は、例えば、図１０を参照して説明した発話取得部２１１の機能を実現し得る。 Alternatively, the input device 906 may be formed by a device that detects information about the speaker. For example, the input device 906 includes an image sensor (for example, a camera), a depth sensor (for example, a stereo camera), an acceleration sensor, a gyro sensor, a geomagnetic sensor, an optical sensor, a sound sensor, and a distance measuring sensor (for example, ToF (Time of Flight)). ) Sensors), may include various sensors such as force sensors. Further, the input device 906 includes information on the state of the information processing device 900 itself such as the posture and moving speed of the information processing device 900, and information on the surrounding environment of the information processing device 900 such as brightness and noise around the information processing device 900. May be obtained. Further, the input device 906 receives a GNSS signal (for example, a GPS signal from a GPS (Global Positioning System) satellite) from a GNSS (Global Navigation Satellite System) satellite and receives position information including the latitude, longitude and altitude of the device. It may include a GPS module to measure. Further, regarding the position information, the input device 906 may detect the position by transmission / reception with Wi-Fi (registered trademark), a mobile phone / PHS / smartphone, or short-range communication. The input device 906 can realize, for example, the function of the utterance acquisition unit 211 described with reference to FIG.

出力装置９０７は、取得した情報を話者に対して視覚的又は聴覚的に通知することが可能な装置で形成される。このような装置として、ＣＲＴディスプレイ装置、液晶ディスプレイ装置、プラズマディスプレイ装置、ＥＬディスプレイ装置、レーザープロジェクタ、ＬＥＤプロジェクタ及びランプ等の表示装置や、スピーカ及びヘッドホン等の音声出力装置や、プリンタ装置等がある。出力装置９０７は、例えば、情報処理装置９００が行った各種処理により得られた結果を出力する。具体的には、表示装置は、情報処理装置９００が行った各種処理により得られた結果を、テキスト、イメージ、表、グラフ等、様々な形式で視覚的に表示する。他方、音声出力装置は、再生された音声データや音響データ等からなるオーディオ信号をアナログ信号に変換して聴覚的に出力する。出力装置９０７は、例えば、図１０を参照して説明した提示部２２０の機能を実現し得る。 The output device 907 is formed by a device capable of visually or audibly notifying the speaker of the acquired information. Such devices include display devices such as CRT display devices, liquid crystal display devices, plasma display devices, EL display devices, laser projectors, LED projectors and lamps, audio output devices such as speakers and headphones, and printer devices. .. The output device 907 outputs, for example, the results obtained by various processes performed by the information processing device 900. Specifically, the display device visually displays the results obtained by various processes performed by the information processing device 900 in various formats such as texts, images, tables, and graphs. On the other hand, the audio output device converts an audio signal composed of reproduced audio data, acoustic data, etc. into an analog signal and outputs it audibly. The output device 907 can realize, for example, the function of the presentation unit 220 described with reference to FIG.

ストレージ装置９０８は、情報処理装置９００の記憶部の一例として形成されたデータ格納用の装置である。ストレージ装置９０８は、例えば、ＨＤＤ等の磁気記憶部デバイス、半導体記憶デバイス、光記憶デバイス又は光磁気記憶デバイス等により実現される。ストレージ装置９０８は、記憶媒体、記憶媒体にデータを記録する記録装置、記憶媒体からデータを読み出す読出し装置および記憶媒体に記録されたデータを削除する削除装置などを含んでもよい。このストレージ装置９０８は、ＣＰＵ９０１が実行するプログラムや各種データ及び外部から取得した各種のデータ等を格納する。ストレージ装置９０８は、例えば、図１０を参照して説明した記憶部１２０の機能を実現し得る。 The storage device 908 is a data storage device formed as an example of the storage unit of the information processing device 900. The storage device 908 is realized by, for example, a magnetic storage device such as an HDD, a semiconductor storage device, an optical storage device, a magneto-optical storage device, or the like. The storage device 908 may include a storage medium, a recording device that records data on the storage medium, a reading device that reads data from the storage medium, a deleting device that deletes the data recorded on the storage medium, and the like. The storage device 908 stores programs executed by the CPU 901, various data, various data acquired from the outside, and the like. The storage device 908 can realize, for example, the function of the storage unit 120 described with reference to FIG.

ドライブ９０９は、記憶媒体用リーダライタであり、情報処理装置９００に内蔵、あるいは外付けされる。ドライブ９０９は、装着されている磁気ディスク、光ディスク、光磁気ディスク、または半導体メモリ等のリムーバブル記憶媒体に記録されている情報を読み出して、ＲＡＭ９０３に出力する。また、ドライブ９０９は、リムーバブル記憶媒体に情報を書き込むこともできる。 The drive 909 is a reader / writer for a storage medium, and is built in or externally attached to the information processing device 900. The drive 909 reads information recorded on a removable storage medium such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, and outputs the information to the RAM 903. The drive 909 can also write information to the removable storage medium.

接続ポート９１０は、例えば、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）ポート、ＩＥＥＥ１３９４ポート、ＳＣＳＩ（ＳｍａｌｌＣｏｍｐｕｔｅｒＳｙｓｔｅｍＩｎｔｅｒｆａｃｅ）、ＲＳ−２３２Ｃポート、又は光オーディオ端子等のような外部接続機器を接続するためのポートである。 The connection port 910 is a port for connecting an external connection device such as a USB (Universal Serial Bus) port, an IEEE1394 port, a SCSI (Small Computer System Interface), an RS-232C port, or an optical audio terminal. ..

通信装置９１１は、例えば、ネットワーク９２０に接続するための通信デバイス等で形成された通信インタフェースである。通信装置９１１は、例えば、有線若しくは無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＬＴＥ（ＬｏｎｇＴｅｒｍＥｖｏｌｕｔｉｏｎ）、Ｂｌｕｅｔｏｏｔｈ（登録商標）又はＷＵＳＢ（ＷｉｒｅｌｅｓｓＵＳＢ）用の通信カード等である。また、通信装置９１１は、光通信用のルータ、ＡＤＳＬ（ＡｓｙｍｍｅｔｒｉｃＤｉｇｉｔａｌＳｕｂｓｃｒｉｂｅｒＬｉｎｅ）用のルータ又は各種通信用のモデム等であってもよい。この通信装置９１１は、例えば、インターネットや他の通信機器との間で、例えばＴＣＰ／ＩＰ等の所定のプロトコルに則して信号等を送受信することができる。通信装置９１１は、例えば、図１０を参照して説明した通信部１００及び通信部２００の機能を実現し得る。 The communication device 911 is, for example, a communication interface formed by a communication device or the like for connecting to the network 920. The communication device 911 is, for example, a communication card for a wired or wireless LAN (Local Area Network), LTE (Long Term Evolution), Bluetooth (registered trademark), WUSB (Wireless USB), or the like. Further, the communication device 911 may be a router for optical communication, a router for ADSL (Asymmetric Digital Subscriber Line), a modem for various communications, or the like. The communication device 911 can transmit and receive signals and the like to and from the Internet and other communication devices in accordance with a predetermined protocol such as TCP / IP. The communication device 911 can realize, for example, the functions of the communication unit 100 and the communication unit 200 described with reference to FIG.

なお、ネットワーク９２０は、ネットワーク９２０に接続されている装置から送信される情報の有線、または無線の伝送路である。例えば、ネットワーク９２０は、インターネット、電話回線網、衛星通信網などの公衆回線網や、Ｅｔｈｅｒｎｅｔ（登録商標）を含む各種のＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）などを含んでもよい。また、ネットワーク９２０は、ＩＰ−ＶＰＮ（ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ−ＶｉｒｔｕａｌＰｒｉｖａｔｅＮｅｔｗｏｒｋ）などの専用回線網を含んでもよい。 The network 920 is a wired or wireless transmission path for information transmitted from a device connected to the network 920. For example, the network 920 may include a public line network such as the Internet, a telephone line network, and a satellite communication network, and various LANs (Local Area Network) including Ethernet (registered trademark), WAN (Wide Area Network), and the like. Further, the network 920 may include a dedicated line network such as IP-VPN (Internet Protocol-Virtual Private Network).

以上、実施形態に係る情報処理装置９００の機能を実現可能なハードウェア構成の一例を示した。上記の各構成要素は、汎用的な部材を用いて実現されていてもよいし、各構成要素の機能に特化したハードウェアにより実現されていてもよい。従って、実施形態を実施する時々の技術レベルに応じて、適宜、利用するハードウェア構成を変更することが可能である。 The above is an example of a hardware configuration capable of realizing the functions of the information processing apparatus 900 according to the embodiment. Each of the above components may be realized by using a general-purpose member, or may be realized by hardware specialized for the function of each component. Therefore, it is possible to appropriately change the hardware configuration to be used according to the technical level at each time when the embodiment is implemented.

＜＜５．まとめ＞＞
以上説明したように、実施形態に係る情報処理装置１０は、話者の発話に基づく感情理解の状態の推定結果に基づいた出力情報を生成する処理を行う。これにより、情報処理装置１０は、話者の発話に基づく感情理解の状態の推定結果に基づいて、端末装置２０の動作を制御することができる。 << 5. Summary >>
As described above, the information processing device 10 according to the embodiment performs a process of generating output information based on the estimation result of the state of emotional understanding based on the utterance of the speaker. As a result, the information processing device 10 can control the operation of the terminal device 20 based on the estimation result of the state of emotional understanding based on the utterance of the speaker.

よって、話者の発話の意図に沿った自然な対話を実現することが可能な、新規かつ改良された情報処理装置、情報処理方法及び情報処理プログラムを提供することが可能である。 Therefore, it is possible to provide a new and improved information processing apparatus, information processing method, and information processing program capable of realizing a natural dialogue in line with the intention of the speaker's utterance.

以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本
開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。 Although the preferred embodiments of the present disclosure have been described in detail with reference to the accompanying drawings, the technical scope of the present disclosure is not limited to such examples. It is clear that a person having ordinary knowledge in the technical field of the present disclosure can come up with various modifications or modifications within the scope of the technical ideas described in the claims. Of course, it is understood that the above also belongs to the technical scope of the present disclosure.

例えば、本明細書において説明した各装置は、単独の装置として実現されてもよく、一部または全部が別々の装置として実現されても良い。例えば、図１０に示した情報処理装置１０及び端末装置２０は、それぞれ単独の装置として実現されてもよい。また、例えば、情報処理装置１０及び端末装置２０とネットワーク等で接続されたサーバ装置として実現されてもよい。また、情報処理装置１０が有する制御部１１０の機能をネットワーク等で接続されたサーバ装置が有する構成であってもよい。 For example, each device described herein may be realized as a single device, or part or all of it may be realized as a separate device. For example, the information processing device 10 and the terminal device 20 shown in FIG. 10 may be realized as independent devices. Further, for example, it may be realized as a server device connected to the information processing device 10 and the terminal device 20 via a network or the like. Further, the server device connected by a network or the like may have the function of the control unit 110 of the information processing device 10.

また、本明細書において説明した各装置による一連の処理は、ソフトウェア、ハードウェア、及びソフトウェアとハードウェアとの組合せのいずれを用いて実現されてもよい。ソフトウェアを構成するプログラムは、例えば、各装置の内部又は外部に設けられる記録媒体（非一時的な媒体：ｎｏｎ−ｔｒａｎｓｉｔｏｒｙｍｅｄｉａ）に予め格納される。そして、各プログラムは、例えば、コンピュータによる実行時にＲＡＭに読み込まれ、ＣＰＵなどのプロセッサにより実行される。 In addition, the series of processes by each device described in the present specification may be realized by using software, hardware, or a combination of software and hardware. The programs constituting the software are stored in advance in, for example, a recording medium (non-temporary medium: non-transitory media) provided inside or outside each device. Then, each program is read into RAM at the time of execution by a computer and executed by a processor such as a CPU.

また、本明細書においてフローチャートを用いて説明した処理は、必ずしも図示された順序で実行されなくてもよい。いくつかの処理ステップは、並列的に実行されてもよい。また、追加的な処理ステップが採用されてもよく、一部の処理ステップが省略されてもよい。 Further, the processes described with reference to the flowchart in the present specification do not necessarily have to be executed in the order shown in the drawings. Some processing steps may be performed in parallel. Further, additional processing steps may be adopted, and some processing steps may be omitted.

また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 In addition, the effects described herein are merely explanatory or exemplary and are not limited. That is, the techniques according to the present disclosure may exhibit other effects apparent to those skilled in the art from the description herein, in addition to or in place of the above effects.

なお、以下のような構成も本開示の技術的範囲に属する。
（１）
話者の発話に基づく感情を理解する感情理解の状態を推定する状態推定部と、
前記状態推定部による推定結果に基づいた出力情報を生成する応答生成部と、
を備える、情報処理装置。
（２）
前記状態推定部は、
前記感情理解を含む複数の状態を推定する、
前記（１）に記載の情報処理装置。
（３）
前記状態推定部は、
前記複数の状態として、前記感情理解、前記話者の発話を認識する発話認識、及び、当該話者の発話に含まれる依頼に関する情報に基づく処理を実行するための準備である実行準備処理のうち少なくともいずれか一つの状態を推定する、
前記（２）に記載の情報処理装置。
（４）
前記状態推定部は、
前記話者の発話に含まれる言語情報のうち感情を示す感情語に応じた前記感情理解の状態を推定する、
前記（１）〜（３）のいずれか一項に記載の情報処理装置。
（５）
前記状態推定部は、
前記話者の発話に含まれる言語情報のうち感情を示す感情語以外の言語情報であって、当該話者の感情を表現する言語情報に応じた前記感情理解の状態を推定する、
前記（１）〜（４）のいずれか一項に記載の情報処理装置。
（６）
前記応答生成部は、
前記話者の発話の終端に関する情報に基づいて、当該話者の発話を認識する発話認識に基づいた前記出力情報を生成する、
前記（１）〜（５）のいずれか一項に記載の情報処理装置。
（７）
前記応答生成部は、
前記話者の発話の終端に関する情報に基づいて、前記感情理解に基づいた前記出力情報を生成する、
前記（１）〜（６）のいずれか一項に記載の情報処理装置。
（８）
前記応答生成部は、
前記感情理解に基づいた前記出力情報として、前記話者の発話に含まれる言語情報のうち感情を示す感情語を復唱するための共感発話を生成する、
前記（７）に記載の情報処理装置。
（９）
前記応答生成部は、
前記感情語に対応する同義語として予め定められた言語情報を復唱するための共感発話を生成する、
前記（８）に記載の情報処理装置。
（１０）
前記応答生成部は、
前記感情理解に基づいた前記出力情報として、前記話者の発話に含まれる言語情報のうち感情を示す感情語以外の言語情報であって、当該話者の感情を表現する言語情報を復唱するための共感発話を生成する、
前記（７）〜（９）のいずれか一項に記載の情報処理装置。
（１１）
前記応答生成部は、
前記話者の発話に含まれる依頼に関する情報が所定の条件を満たす場合、当該依頼に関する情報に基づいた前記出力情報を生成する、
前記（１）〜（１０）のいずれか一項に記載の情報処理装置。
（１２）
前記応答生成部は、
前記話者の発話に含まれる依頼に関する情報が所定の条件を満たさない場合、当該話者に対して当該依頼に関する情報を発話するよう促すための前記出力情報を生成する、
前記（１）〜（１１）のいずれか一項に記載の情報処理装置。
（１３）
前記応答生成部は、
前記出力情報として、音声情報、又は、動作情報を生成する、
請求項１に記載の情報処理装置。
前記（１）〜（１２）のいずれか一項に記載の情報処理装置。
（１４）
前記応答生成部は、
前記出力情報として、映像上の表現に関する前記動作情報を生成する、
前記（１３）に記載の情報処理装置。
（１５）
前記応答生成部は、
前記感情理解に基づいた前記出力情報として、前記話者の発話を認識する発話認識に基づいた前記出力情報よりも、当該話者にとって認識可能な前記音声情報、又は、前記動作情報を生成する、
前記（１３）又は（１４）に記載の情報処理装置。
（１６）
前記応答生成部は、
前記話者に応じた基準と比較して相対的な前記出力情報を生成する、
前記（１３）〜（１５）のいずれか一項に記載の情報処理装置。
（１７）
前記応答生成部は、
前記出力情報として、前記話者の周囲の環境に応じた音量での前記音声情報を生成する、
前記（１６）に記載の情報処理装置。
（１８）
コンピュータが、
話者の発話に基づく感情を理解する感情理解の状態を推定し、
推定された推定結果に基づいた出力情報を生成する、
情報処理方法。
（１９）
話者の発話に基づく感情を理解する感情理解の状態を推定する状態推定手順と、
推定された推定結果に基づいた出力情報を生成する応答生成手順と、
をコンピュータに実行させることを特徴とする情報処理プログラム。 The following configurations also belong to the technical scope of the present disclosure.
(1)
A state estimation unit that estimates the state of emotional understanding that understands emotions based on the speaker's utterances,
A response generation unit that generates output information based on the estimation result by the state estimation unit, and a response generation unit.
Information processing device.
(2)
The state estimation unit
Estimate a plurality of states including the emotional understanding,
The information processing device according to (1) above.
(3)
The state estimation unit
Of the execution preparatory processes, which are preparations for executing the emotion understanding, the utterance recognition for recognizing the speaker's utterance, and the process based on the information regarding the request included in the speaker's utterance as the plurality of states. Estimate at least one of the states,
The information processing device according to (2) above.
(4)
The state estimation unit
Estimate the state of emotional understanding according to emotional words indicating emotions in the linguistic information included in the speaker's utterance.
The information processing device according to any one of (1) to (3) above.
(5)
The state estimation unit
Among the linguistic information included in the utterance of the speaker, the linguistic information other than the emotional word indicating the emotion, and the state of the emotional understanding according to the linguistic information expressing the emotion of the speaker is estimated.
The information processing device according to any one of (1) to (4) above.
(6)
The response generator
Based on the information about the end of the speaker's utterance, the output information based on the utterance recognition that recognizes the speaker's utterance is generated.
The information processing device according to any one of (1) to (5) above.
(7)
The response generator
Generates the output information based on the emotional understanding based on the information about the end of the speaker's utterance.
The information processing device according to any one of (1) to (6) above.
(8)
The response generator
As the output information based on the emotional understanding, an empathic utterance for reciting an emotional word indicating an emotion among the linguistic information included in the speaker's utterance is generated.
The information processing device according to (7) above.
(9)
The response generator
Generates empathic utterances for reciting predetermined linguistic information as synonyms corresponding to the emotional words.
The information processing device according to (8) above.
(10)
The response generator
As the output information based on the emotional understanding, the linguistic information other than the emotional words indicating the emotions among the linguistic information included in the utterance of the speaker is used to repeat the linguistic information expressing the emotions of the speaker. Generate sympathetic utterances,
The information processing device according to any one of (7) to (9) above.
(11)
The response generator
When the information about the request included in the utterance of the speaker satisfies a predetermined condition, the output information based on the information about the request is generated.
The information processing device according to any one of (1) to (10) above.
(12)
The response generator
When the information regarding the request included in the utterance of the speaker does not satisfy a predetermined condition, the output information for urging the speaker to speak the information regarding the request is generated.
The information processing device according to any one of (1) to (11) above.
(13)
The response generator
As the output information, voice information or operation information is generated.
The information processing device according to claim 1.
The information processing device according to any one of (1) to (12) above.
(14)
The response generator
As the output information, the operation information regarding the expression on the video is generated.
The information processing device according to (13) above.
(15)
The response generator
As the output information based on the emotional understanding, the voice information or the operation information that can be recognized by the speaker is generated rather than the output information based on the utterance recognition that recognizes the utterance of the speaker.
The information processing device according to (13) or (14).
(16)
The response generator
Generate the output information relative to the speaker-dependent criteria.
The information processing device according to any one of (13) to (15).
(17)
The response generator
As the output information, the voice information at a volume corresponding to the environment around the speaker is generated.
The information processing device according to (16) above.
(18)
The computer
Understanding emotions based on the speaker's utterance Estimate the state of emotional understanding and
Generate output information based on the estimated estimation results,
Information processing method.
(19)
Understanding emotions based on the speaker's utterance State estimation procedure for estimating the state of emotional understanding, and
A response generation procedure that generates output information based on the estimated estimation results, and
An information processing program characterized by having a computer execute.

１情報処理システム
１０情報処理装置
２０端末装置
１００通信部
１１０制御部
１１１話者識別部
１１２発話検出部
１１３発話認識部
１１４状態推定部
１１５意味解析部
１１６依頼処理部
１１７応答生成部
１１８発話実行部
１１９動作提示部
１２０記憶部
２００通信部
２１０制御部
２１１発話取得部
２１２撮像部
２１３動作制御部
２２０提示部
２２１音声提示部
２２２動作提示部 1 Information processing system 10 Information processing device 20 Terminal device 100 Communication unit 110 Control unit 111 Speaker identification unit 112 Speech detection unit 113 Speech recognition unit 114 State estimation unit 115 Semantic analysis unit 116 Request processing unit 117 Response generation unit 118 Speech execution unit 119 Motion presentation unit 120 Storage unit 200 Communication unit 210 Control unit 211 Speech acquisition unit 212 Imaging unit 213 Motion control unit 220 Presentation unit 221 Voice presentation unit 222 Motion presentation unit

Claims

A state estimation unit that estimates the state of emotional understanding that understands emotions based on the speaker's utterances,
A response generation unit that generates output information based on the estimation result by the state estimation unit, and a response generation unit.
Information processing device.

The state estimation unit
Estimate a plurality of states including the emotional understanding,
The information processing device according to claim 1.

The state estimation unit
Of the execution preparatory processes, which are preparations for executing the emotion understanding, the utterance recognition for recognizing the speaker's utterance, and the process based on the information regarding the request included in the speaker's utterance as the plurality of states. Estimate at least one of the states,
The information processing device according to claim 2.

The state estimation unit
Estimate the state of emotional understanding according to emotional words indicating emotions in the linguistic information included in the speaker's utterance.
The information processing device according to claim 1.

The state estimation unit
Among the linguistic information included in the utterance of the speaker, the linguistic information other than the emotional word indicating the emotion, and the state of the emotional understanding according to the linguistic information expressing the emotion of the speaker is estimated.
The information processing device according to claim 1.

The response generator
Based on the information about the end of the speaker's utterance, the output information based on the utterance recognition that recognizes the speaker's utterance is generated.
The information processing device according to claim 1.

The response generator
Generates the output information based on the emotional understanding based on the information about the end of the speaker's utterance.
The information processing device according to claim 1.

The response generator
As the output information based on the emotional understanding, an empathic utterance for reciting an emotional word indicating an emotion among the linguistic information included in the speaker's utterance is generated.
The information processing device according to claim 7.

The response generator
Generates empathic utterances for reciting predetermined linguistic information as synonyms corresponding to the emotional words.
The information processing device according to claim 8.

The response generator
As the output information based on the emotional understanding, the linguistic information other than the emotional words indicating the emotions among the linguistic information included in the utterance of the speaker is used to repeat the linguistic information expressing the emotions of the speaker. Generate sympathetic utterances,
The information processing device according to claim 7.

The response generator
When the information about the request included in the utterance of the speaker satisfies a predetermined condition, the output information based on the information about the request is generated.
The information processing device according to claim 1.

The response generator
When the information regarding the request included in the utterance of the speaker does not satisfy a predetermined condition, the output information for urging the speaker to speak the information regarding the request is generated.
The information processing device according to claim 1.

The response generator
As the output information, voice information or operation information is generated.
The information processing device according to claim 1.

The response generator
As the output information, the operation information regarding the expression on the video is generated.
The information processing device according to claim 13.

The response generator
As the output information based on the emotional understanding, the voice information or the operation information that can be recognized by the speaker is generated rather than the output information based on the utterance recognition that recognizes the utterance of the speaker.
The information processing device according to claim 13.

The response generator
Generate the output information relative to the speaker-dependent criteria.
The information processing device according to claim 13.

The response generator
As the output information, the voice information at a volume corresponding to the environment around the speaker is generated.
The information processing device according to claim 16.

The computer
Understanding emotions based on the speaker's utterance Estimate the state of emotional understanding and
Generate output information based on the estimated estimation results,
Information processing method.

Understanding emotions based on the speaker's utterance State estimation procedure for estimating the state of emotional understanding, and
A response generation procedure that generates output information based on the estimated estimation results, and
An information processing program characterized by having a computer execute.