JP2021086354A

JP2021086354A - Information processing system, information processing method, and program

Info

Publication number: JP2021086354A
Application number: JP2019214178A
Authority: JP
Inventors: 尚史福江; Naofumi Fukue; 啓介小西; Keisuke Konishi
Original assignee: TIS Inc
Current assignee: TIS Inc
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2021-06-03
Anticipated expiration: 2039-11-27
Also published as: JP7123028B2

Abstract

To cause a speaker to actively output sounds to a user so that a conference or the like smoothly progresses.SOLUTION: An information processing system includes an acquisition unit configured to acquire a face image showing a user's face from an image acquisition device connected to a speaker device having a speaker, an emotion analysis unit configured to analyze an emotion of the user based on the face image, a line-of-sight determination unit configured to determine whether the user is looking at the speaker device based on the face image, a time calculation unit configured to calculate the duration time of the user watching the speaker device based on a determination result by the line-of-sight determination unit, a response content identification unit configured to identify a predetermined response content based on the user's emotion and the duration time, and a transmission unit configured to transmit sound information indicating the predetermined response content to the speaker device in order to cause the speaker device to output sounds in accordance with the predetermined response content identified by the response content identification unit.SELECTED DRAWING: Figure 2

Description

本発明は、情報処理システム、情報処理方法、及びプログラムに関する。 The present invention relates to information processing systems, information processing methods, and programs.

画像センサで検知された画像情報に基づいてスピーカに音声出力させるか否かを決定する決定装置が開示されている（特許文献１）。 A determination device for determining whether or not to output audio to a speaker based on image information detected by an image sensor is disclosed (Patent Document 1).

特開２０１９−３５８９７号公報Japanese Unexamined Patent Publication No. 2019-35897

特許文献１には、ユーザが居住する住宅内に設けられたスピーカにおける音声出力のタイミングを画像情報に基づいて決定する決定装置が開示されている。また、特許文献１に記載の決定装置は、音声情報が途切れたタイミングにおいてスピーカに音声出力させる。特許文献１の決定装置によれば住宅内の状況に応じてスピーカに音声出力させることができる。しかしながら、特許文献１に記載の決定装置では、スピーカからユーザに対して、ユーザの感情を考慮して能動的に音声出力させることができないため、スピーカとユーザとの会話を円滑に行うには不十分であるという問題があった。 Patent Document 1 discloses a determination device that determines the timing of audio output in a speaker provided in a house in which a user resides based on image information. Further, the determination device described in Patent Document 1 causes the speaker to output voice at the timing when the voice information is interrupted. According to the determination device of Patent Document 1, the speaker can output audio according to the situation in the house. However, in the determination device described in Patent Document 1, since it is not possible for the speaker to actively output voice from the speaker in consideration of the user's emotions, it is not possible to smoothly perform a conversation between the speaker and the user. There was the problem that it was enough.

本発明の目的は、上記のような問題に鑑みてなされたものであり、ユーザの感情を考慮して、スピーカから能動的に音声出力するシステムを提供することにある。 An object of the present invention has been made in view of the above problems, and an object of the present invention is to provide a system for actively outputting voice from a speaker in consideration of user's emotions.

本発明の一態様に係る情報処理システムは、スピーカを有するスピーカ装置に接続される画像取得装置から、ユーザの顔を示す顔画像を取得する取得部と、前記顔画像に基づいて、前記ユーザの感情を分析する感情分析部と、前記顔画像に基づいて、前記ユーザが前記スピーカ装置を見ているか否かを判定する視線判定部と、前記視線判定部における判定結果に基づいて、前記ユーザが前記スピーカ装置を見ている継続時間を算出する時間算出部と、前記ユーザの感情と、前記継続時間と、に基づいて、所定の応答内容を特定する応答内容特定部と、前記応答内容特定部で特定された前記所定の応答内容に沿って、前記スピーカ装置に音声出力させるべく、前記所定の応答内容を示す音声情報を、前記スピーカ装置に送信する送信部と、を備える。 The information processing system according to one aspect of the present invention includes an acquisition unit that acquires a face image showing a user's face from an image acquisition device connected to a speaker device having a speaker, and the user's face image based on the face image. Based on the emotion analysis unit that analyzes emotions, the line-of-sight determination unit that determines whether or not the user is looking at the speaker device based on the face image, and the determination result in the line-of-sight determination unit, the user A time calculation unit that calculates the duration of viewing the speaker device, a response content specifying unit that specifies a predetermined response content based on the user's emotion and the duration, and the response content specifying unit. The speaker device is provided with a transmission unit that transmits voice information indicating the predetermined response content to the speaker device in order to output the sound to the speaker device in accordance with the predetermined response content specified in the above.

本発明の一態様に係る情報処置方法は、コンピュータが、スピーカを有するスピーカ装置に接続される画像取得装置から、ユーザの顔を示す顔画像を取得することと、前記顔画像に基づいて、前記ユーザの感情を分析することと、前記顔画像に基づいて、前記ユーザが前記スピーカ装置を見ているか否かを判定することと、判定結果に基づいて、前記ユーザが前記スピーカ装置を見ている継続時間を算出することと、前記ユーザの感情と、前記継続時間と、に基づいて、所定の応答内容を特定することと、特定された前記所定の応答内容に沿って、前記スピーカ装置に音声出力させるべく、前記所定の応答内容を示す音声情報を、前記スピーカ装置に送信することと、を実行する。 In the information treatment method according to one aspect of the present invention, a computer acquires a face image showing a user's face from an image acquisition device connected to a speaker device having a speaker, and the computer obtains a face image showing the user's face based on the face image. The user is looking at the speaker device based on the analysis of the user's emotion, determining whether or not the user is looking at the speaker device based on the face image, and based on the determination result. Specifying a predetermined response content based on the calculation of the duration, the emotion of the user, and the duration, and the voice to the speaker device according to the specified predetermined response content. In order to output, the voice information indicating the predetermined response content is transmitted to the speaker device, and so on.

本発明の一態様に係るプログラムは、コンピュータに、スピーカを有するスピーカ装置に接続される画像取得装置から、ユーザの顔を示す顔画像を取得させることと、前記顔画像に基づいて、前記ユーザの感情を分析させることと、前記顔画像に基づいて、前記ユーザが前記スピーカ装置を見ているか否かを判定させることと、判定結果に基づいて、前記ユーザが前記スピーカ装置を見ている継続時間を算出させることと、前記ユーザの感情と、前記継続時間と、に基づいて、所定の応答内容を特定させることと、特定された前記所定の応答内容に沿って、前記スピーカ装置に音声出力させるべく、前記所定の応答内容を示す音声情報を、前記スピーカ装置に送信させることと、を実行させる。 A program according to one aspect of the present invention causes a computer to acquire a face image showing a user's face from an image acquisition device connected to a speaker device having a speaker, and based on the face image, the user's To analyze emotions, to determine whether or not the user is viewing the speaker device based on the face image, and to determine the duration of the user viewing the speaker device based on the determination result. Is calculated, the predetermined response content is specified based on the user's emotion and the duration, and the speaker device is made to output audio according to the specified predetermined response content. Therefore, the speaker device is made to transmit the voice information indicating the predetermined response content, and is executed.

本発明によれば、ユーザの感情および視線に基づいて、能動的にユーザに対して音声出力することで、ユーザの発言を促すことができる。 According to the present invention, it is possible to encourage the user to speak by actively outputting voice to the user based on the user's emotion and line of sight.

音声通知システムの構成の一例を示す図である。It is a figure which shows an example of the structure of a voice notification system. 音声通知システムにおける処理の概要を示す図である。It is a figure which shows the outline of the processing in a voice notification system. 応答サーバ装置の機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of the response server apparatus. 応答情報テーブルの一例を示す図である。It is a figure which shows an example of the response information table. スピーカ装置の機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of a speaker apparatus. 応答サーバ装置の処理の一例を示すフロー図である。It is a flow chart which shows an example of the processing of a response server apparatus. コンピュータのハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware configuration of a computer. スピーカ装置の他の実施形態における機能構成の一例を示す図である。It is a figure which shows an example of the functional structure in another embodiment of a speaker apparatus. スピーカ応答情報テーブルの一例を示す図である。It is a figure which shows an example of a speaker response information table.

以下に、本発明の一実施形態における音声通知システム１について、図面を参照して詳細に説明する。ただし、以下に説明する実施形態は、あくまでも例示であり、以下に明示しない種々の変形や技術の適用を排除する意図はない。すなわち、本発明は、その趣旨を逸脱しない範囲で種々変形し、又は各実施例を組み合わせる等して実施することができる。また、以下の図面の記載において、同一または類似の部分には同一または類似の符号を付して表している。
＝＝構成＝＝ Hereinafter, the voice notification system 1 according to the embodiment of the present invention will be described in detail with reference to the drawings. However, the embodiments described below are merely examples, and there is no intention of excluding the application of various modifications and techniques not specified below. That is, the present invention can be implemented in various ways without departing from the spirit of the present invention, or by combining each embodiment. Further, in the description of the following drawings, the same or similar parts are represented by the same or similar reference numerals.
== Composition ==

図１は、音声通知システム１の構成の一例を示す図である。図１に示すように、音声通知システム１は、例えば応答サーバ装置１０およびスピーカ装置２０を含む。なお、応答サーバ装置１０とスピーカ装置２０の機能を一のシステムで実現してもよい。また、応答サーバ装置１０とスピーカ装置２０のそれぞれの機能を、他の複数の装置で実現してもよい。以下、音声通知システム１の各構成要素について説明する。 FIG. 1 is a diagram showing an example of the configuration of the voice notification system 1. As shown in FIG. 1, the voice notification system 1 includes, for example, a response server device 10 and a speaker device 20. The functions of the response server device 10 and the speaker device 20 may be realized by one system. Further, the functions of the response server device 10 and the speaker device 20 may be realized by a plurality of other devices. Hereinafter, each component of the voice notification system 1 will be described.

応答サーバ装置１０は、スピーカ装置２０からユーザの顔を示す顔画像を取得し、該顔画像に基づきスピーカ装置２０に所定の応答情報を送信することで、スピーカ装置２０からユーザに対して能動的に発話させる装置である。応答サーバ装置１０は、例えばサーバコンピュータなどの情報処理装置で構成され、ネットワーク３００を介してスピーカ装置２０と接続される。応答サーバ装置１０とスピーカ装置２０との間の各種データの送受信については後述する。 The response server device 10 acquires a face image showing the user's face from the speaker device 20, and transmits predetermined response information to the speaker device 20 based on the face image, so that the speaker device 20 actively responds to the user. It is a device that makes the speaker speak. The response server device 10 is composed of an information processing device such as a server computer, and is connected to the speaker device 20 via a network 300. The transmission and reception of various data between the response server device 10 and the speaker device 20 will be described later.

スピーカ装置２０は、顔画像を取得する画像取得装置２１を有し、該画像取得装置２１で取得された顔画像を、応答サーバ装置１０に送信する装置である。そして、スピーカ装置２０は、応答サーバ装置１０から取得した応答情報に基づいて発話する。スピーカ装置２０は所謂スマートスピーカである。 The speaker device 20 has an image acquisition device 21 that acquires a face image, and is a device that transmits the face image acquired by the image acquisition device 21 to the response server device 10. Then, the speaker device 20 speaks based on the response information acquired from the response server device 10. The speaker device 20 is a so-called smart speaker.

ここで、以下説明の理解を助けるために、スピーカ装置２０のハードウェア構成の一例について説明する。スピーカ装置２０は、例えば、音声を検出して電気信号に変換するマイクロフォン（不図示）、応答サーバ装置１０から取得する音声情報を音声出力するスピーカ（不図示）、外部の装置と通信するための通信モジュール（不図示）、視覚的にスピーカ装置２０のステータスや表情を模したアイコンを表示する表示部２２、各種操作指示を行うための操作ボタン（不図示）、各構成要素を制御する制御部（不図示）を備える。スピーカ装置２０については、様々の種類のものが存在し、例えば、複数のマイクロフォンおよびスピーカを有するものや、上面の外周部に等間隔にマイクロフォンを配設したものや、側面の外周部に等間隔にスピーカを配設したものなどが存在し、その仕様が限定されるものではない。
＝＝音声通知システム１の概要＝＝ Here, in order to help understanding the following description, an example of the hardware configuration of the speaker device 20 will be described. The speaker device 20 is for communicating with, for example, a microphone (not shown) that detects voice and converts it into an electric signal, a speaker that outputs voice information acquired from the response server device 10 (not shown), and an external device. A communication module (not shown), a display unit 22 that visually displays an icon that imitates the status and expression of the speaker device 20, operation buttons (not shown) for giving various operation instructions, and a control unit that controls each component. (Not shown). There are various types of speaker devices 20, for example, those having a plurality of microphones and speakers, those having microphones arranged at equal intervals on the outer peripheral portion of the upper surface, and those having equal intervals on the outer peripheral portion of the side surface. There are some that have speakers arranged in the room, and the specifications are not limited.
== Overview of voice notification system 1 ==

図２は、音声通知システム１における処理の概要を示す図である。図２を参照して、音声通知システム１の動作の概要を説明する。 FIG. 2 is a diagram showing an outline of processing in the voice notification system 1. An outline of the operation of the voice notification system 1 will be described with reference to FIG.

まず、Ｓ１において、スピーカ装置２０の画像取得装置２１は、ユーザの顔画像を取得する。 First, in S1, the image acquisition device 21 of the speaker device 20 acquires the user's face image.

次に、Ｓ２において、スピーカ装置２０は、取得した顔画像を応答サーバ装置１０に送信する。そして、応答サーバ装置１０において、顔画像に基づいてユーザの表情および視線が分析される。応答サーバ装置１０は、分析したユーザの表情および視線に基づいて、ユーザの感情およびユーザがスピーカ装置２０を見ている継続時間を特定する。 Next, in S2, the speaker device 20 transmits the acquired face image to the response server device 10. Then, the response server device 10 analyzes the user's facial expression and line of sight based on the face image. The response server device 10 identifies the user's emotions and the duration during which the user is viewing the speaker device 20 based on the analyzed facial expression and line of sight of the user.

次に、Ｓ３において、応答サーバ装置１０は、特定された感情や継続時間に基づき応答情報を特定し、該応答情報をスピーカ装置２０に送信する。これにより、スピーカ装置２０は、応答情報に基づいて、ユーザに対して能動的に適切な発話が可能となる。 Next, in S3, the response server device 10 identifies the response information based on the specified emotion and duration, and transmits the response information to the speaker device 20. As a result, the speaker device 20 can actively speak appropriately to the user based on the response information.

次に、Ｓ４において、スピーカ装置２０はユーザに対して応答情報に応じた音声を発信する。
＝＝音声通知システム１の構成＝＝ Next, in S4, the speaker device 20 transmits a voice corresponding to the response information to the user.
== Configuration of voice notification system 1 ==

以下、応答サーバ装置１０およびスピーカ装置２０が備える機能について、それぞれ説明する。
＜＜応答サーバ装置１０＞＞ Hereinafter, the functions provided by the response server device 10 and the speaker device 20 will be described.
<< Response server device 10 >>

図３を参照して、応答サーバ装置１０の機能構成について説明する。図３は、応答サーバ装置１０の機能構成の一例を示す図である。図３に示すとおり、応答サーバ装置１０は、記憶部１１、取得部１２、感情分析部１３、視線判定部１４、時間算出部１５、タイミング特定部１６、応答内容特定部１７、表情アイコン特定部１８、送信部１９の機能を有する。 The functional configuration of the response server device 10 will be described with reference to FIG. FIG. 3 is a diagram showing an example of the functional configuration of the response server device 10. As shown in FIG. 3, the response server device 10 includes a storage unit 11, an acquisition unit 12, an emotion analysis unit 13, a line-of-sight determination unit 14, a time calculation unit 15, a timing identification unit 16, a response content identification unit 17, and a facial expression icon identification unit. 18. It has the function of the transmission unit 19.

記憶部１１は、例えば応答情報テーブル１１ａを有する。 The storage unit 11 has, for example, a response information table 11a.

応答情報テーブル１１ａは、例えばユーザに対する応答内容を示す応答情報を格納したテーブルである。図４に示すように、応答情報テーブル１１ａのデータ構造は、例えば応答内容ＩＤなどの適宜な項目を主キーとして、感情、継続時間、タイミング、ユーザ属性、応答内容、および表情アイコンなどのデータから成るレコードの集合体である。ここで、感情とは、例えば顔画像に基づいて推測されるユーザの感情である。継続時間とは、例えばユーザがスピーカ装置２０を継続的に見ている時間である。タイミングとは、例えばスピーカ装置２０が発話するタイミングである。ユーザ属性とは、例えばユーザの性別や職位などである。応答内容とは、例えばスピーカ装置２０から音声出力される内容である。表情アイコンとは、例えばスピーカ装置２０の表示部に表示される人の表情を模したアイコンである。応答情報テーブル１１ａの内容は、例えば応答サーバ装置１０の管理者により適宜更新される。なお、応答情報テーブル１１ａは一例を示すものであり、その内容が限定されるものではない。 The response information table 11a is, for example, a table that stores response information indicating the content of the response to the user. As shown in FIG. 4, the data structure of the response information table 11a is based on data such as emotions, duration, timing, user attributes, response contents, and facial expression icons, using an appropriate item such as a response content ID as a primary key. It is a collection of records consisting of. Here, the emotion is a user's emotion estimated based on, for example, a face image. The duration is, for example, the time during which the user is continuously watching the speaker device 20. The timing is, for example, the timing at which the speaker device 20 speaks. The user attribute is, for example, the gender or job title of the user. The response content is, for example, the content output from the speaker device 20 as voice. The facial expression icon is, for example, an icon that imitates a person's facial expression displayed on the display unit of the speaker device 20. The contents of the response information table 11a are appropriately updated by, for example, the administrator of the response server device 10. The response information table 11a shows an example, and the content thereof is not limited.

また、記憶部１１は、例えば、後述する感情分析部１３や時間算出部１５における学習済みモデルを格納してもよい。さらに、記憶部１１には、学習済みモデルを生成するための学習データを格納していてもよい。学習データとは、例えば感情分析部１３における感情分析で用いられる顔画像などと、これに対応付けられ、教師データとなるユーザの感情や視線などである。これらについては後述する。 Further, the storage unit 11 may store, for example, the trained models in the emotion analysis unit 13 and the time calculation unit 15, which will be described later. Further, the storage unit 11 may store training data for generating a trained model. The learning data is, for example, a face image used in the emotion analysis in the emotion analysis unit 13, and a user's emotion and line of sight that are associated with the image and serve as teacher data. These will be described later.

取得部１２は、スピーカ装置２０から送信された顔画像を取得する。 The acquisition unit 12 acquires the face image transmitted from the speaker device 20.

感情分析部１３は、顔画像における特徴量を学習済みモデルに入力してユーザの感情を分析し、出力する。具体的には、感情分析部１３は、例えば、目の領域、口の領域、鼻の領域、または頬の領域などの注目領域を抽出し、該注目領域から特徴点を抽出する。感情分析部１３は、抽出した特徴点間の距離から特徴量を特定する。そして、該特徴量を学習済みモデルに入力することで、ユーザの感情を出力する。ここで、該学習済みモデルは、例えば、畳み込みニューラルネットワークであり、注目領域の特徴量と、該特徴量に対応する感情（教師データ）との組を学習データとして学習されたものである。これにより、感情分析部１３は、顔画像から、ユーザの目、口、鼻、頬などの顔の要素の変形に応じて生じる顔の筋肉の収縮で現れる表情を分析し、該表情が表す感情を特定できる。 The emotion analysis unit 13 inputs the feature amount in the face image into the trained model, analyzes the user's emotion, and outputs it. Specifically, the sentiment analysis unit 13 extracts a region of interest such as an eye region, a mouth region, a nose region, or a cheek region, and extracts feature points from the attention region. The sentiment analysis unit 13 specifies the feature amount from the distance between the extracted feature points. Then, by inputting the feature amount into the trained model, the user's emotion is output. Here, the trained model is, for example, a convolutional neural network, and is trained using a set of a feature amount of a region of interest and an emotion (teacher data) corresponding to the feature amount as training data. As a result, the sentiment analysis unit 13 analyzes the facial expression that appears due to the contraction of the facial muscles that occurs in response to the deformation of facial elements such as the user's eyes, mouth, nose, and cheeks from the facial image, and the emotion expressed by the facial expression. Can be identified.

なお、感情とは、例えば、所謂２７種類の基本的な感情であってもよく、また該基本的な感情を組み合わせた感情であってもよい。以下では、説明の便宜のため、一例として感情分析部１３で特定される感情を「楽しい」「怒り」「悲哀」に限定して説明する。 The emotion may be, for example, a so-called 27 types of basic emotions, or an emotion in which the basic emotions are combined. In the following, for convenience of explanation, the emotions specified by the sentiment analysis unit 13 will be limited to “fun”, “anger”, and “sorrow” as an example.

視線判定部１４は、顔画像からユーザの目に関する特徴量を学習済みモデルに入力してユーザの視線を分析し、ユーザがスピーカ装置２０を見ているか否かを判定する。具体的には、視線判定部１４は、例えば、顔画像のうち少なくとも片目を含む部分的画像を抽出し、該部分的画像が示す特徴量を抽出する。視線判定部１４は、該特徴量を学習済みモデルに入力し、ユーザがスピーカ装置２０に見ているか否かを判定する。ここで、該学習済みモデルは、例えば、畳み込みニューラルネットワークであり、該部分的画像が示す特徴量と、該特徴量に対応する視線（教師データ）との組を学習データとして学習されたものである。 The line-of-sight determination unit 14 inputs a feature amount related to the user's eyes from the face image into the trained model, analyzes the user's line of sight, and determines whether or not the user is looking at the speaker device 20. Specifically, the line-of-sight determination unit 14 extracts, for example, a partial image including at least one eye of the face image, and extracts the feature amount indicated by the partial image. The line-of-sight determination unit 14 inputs the feature amount to the trained model and determines whether or not the user is viewing the speaker device 20. Here, the trained model is, for example, a convolutional neural network, and is trained using a set of a feature amount shown by the partial image and a line of sight (teacher data) corresponding to the feature amount as training data. is there.

なお、上述した、感情分析部１３および視線判定部１４おける、特徴量を抽出する技術や、それを用いた学習済みモデルは、あくまで例示であって限定されるものではなく、これらに代えて周知の他の技術を利用することを妨げない。 It should be noted that the above-mentioned techniques for extracting features in the sentiment analysis unit 13 and the line-of-sight determination unit 14 and the trained model using the same are merely examples and are not limited to them, and are well known in place of them. It does not prevent you from using other technologies.

時間算出部１５は、視線判定部１４においてユーザがスピーカ装置２０を見ていると判定されている間の継続時間を算出する。具体的には、時間算出部１５は、視線判定部１４から所定の時間間隔で判定結果を取得することで、ユーザがスピーカ装置２０を見ていると判定された時点から、ユーザがスピーカ装置２０を見ていない判定された時点までの時間を、継続時間として算出する。 The time calculation unit 15 calculates the duration while the line-of-sight determination unit 14 determines that the user is looking at the speaker device 20. Specifically, the time calculation unit 15 acquires the determination result from the line-of-sight determination unit 14 at predetermined time intervals, and from the time when it is determined that the user is looking at the speaker device 20, the user uses the speaker device 20. The time until the time when it is determined not to see is calculated as the duration.

タイミング特定部１６は、例えば、感情分析部１３で分析された感情と、時間算出部１５で算出された継続時間と、の少なくともいずれかに基づいて、所定の応答内容をスピーカ装置２０に音声出力させるタイミングを特定する。これにより、ユーザの感情に適切なタイミングでスピーカ装置２０に音声出力させることができる。 The timing specifying unit 16 outputs a predetermined response content to the speaker device 20 by voice based on at least one of, for example, the emotion analyzed by the emotion analysis unit 13 and the duration calculated by the time calculation unit 15. Specify the timing to make it. As a result, the speaker device 20 can output voice at an appropriate timing according to the user's emotions.

また、タイミング特定部１６は、感情分析部１３において分析されるユーザの第１感情に対する第１タイミングを特定し、第１感情とは異なり、感情分析部１３において分析される第２感情に対する第２タイミングを、第１タイミングよりも長くなるように特定する。具体的には、図４に示すように、例えば、継続時間が「一瞬」において、第２感情が「楽しい」の場合には「直後」の第１タイミングを特定し、第２感情が「悲哀」の場合には、「楽しい」の第１タイミング（直後）よりも長い、「３秒前後」の第２タイミングを特定する。また、第２感情が「怒り」の場合には、「楽しい」の第１タイミング（直後）よりも長い、「６秒前後」の第２タイミングを特定する。このように、「楽しい」などのプラスの感情をユーザが抱いている場合は、スピーカ装置２０はできるだけ早く応答内容を音声出力させ、「悲哀」や「怒り」など、マイナス感情が強くなるにしたがって、スピーカ装置２０は応答内容の音声出力を遅らせる。これにより、ユーザの感情に応じた適切なタイミングでスピーカ装置２０に音声出力させることができるため、ユーザとスピーカ装置２０との間のコミュニケーションを促進できる。 Further, the timing specifying unit 16 specifies the first timing for the first emotion of the user analyzed by the emotion analysis unit 13, and unlike the first emotion, the second emotion for the second emotion analyzed by the emotion analysis unit 13. The timing is specified to be longer than the first timing. Specifically, as shown in FIG. 4, for example, when the duration is "momentary" and the second emotion is "fun", the first timing of "immediately after" is specified, and the second emotion is "sorrow". In the case of "", the second timing of "around 3 seconds", which is longer than the first timing (immediately after) of "fun", is specified. When the second emotion is "anger", the second timing of "around 6 seconds", which is longer than the first timing (immediately after) of "fun", is specified. In this way, when the user has positive emotions such as "fun", the speaker device 20 outputs the response content by voice as soon as possible, and as the negative emotions such as "sorrow" and "anger" become stronger. , The speaker device 20 delays the audio output of the response content. As a result, the speaker device 20 can output voice at an appropriate timing according to the user's emotions, so that communication between the user and the speaker device 20 can be promoted.

なお、タイミング特定部１６は、予めタイミングを定めた応答情報テーブル１１ａを参照して、感情分析部１３で分析された感情と時間算出部１５で算出された継続時間とに対応付けられる、第１タイミングおよび第２タイミングを特定してもよい。 The timing specifying unit 16 refers to the response information table 11a whose timing has been determined in advance, and is associated with the emotion analyzed by the emotion analysis unit 13 and the duration calculated by the time calculation unit 15. The timing and the second timing may be specified.

応答内容特定部１７は、応答情報テーブル１１ａを参照して、少なくとも感情および継続時間に基づいて、スピーカ装置２０に音声出力させる応答内容を特定する。これにより、ユーザの感情に適切な内容をスピーカ装置２０に音声出力させることができるため、ユーザとスピーカ装置２０との間のコミュニケーションを促進できる。また、会議やパーティなど、複数の人が集まるような状況において、その状況に応じた的確な応答内容を音声出力できる、会議やパーティなどを円滑に進行させることができる。 The response content specifying unit 17 refers to the response information table 11a and specifies the response content to be output to the speaker device 20 by voice, at least based on emotions and duration. As a result, the speaker device 20 can output voice content appropriate for the user's emotions, so that communication between the user and the speaker device 20 can be promoted. Further, in a situation where a plurality of people gather, such as a meeting or a party, it is possible to output an accurate response content according to the situation by voice, and it is possible to smoothly proceed with the meeting or the party.

表情アイコン特定部１８は、応答情報テーブル１１ａを参照して、応答内容に対応する表情アイコンを特定する。表情アイコン特定部１８で特定された表情アイコンを表示部２２に表示させることにより、スピーカ装置２０を見ているユーザの心に寄り添う雰囲気を演出できるため、ユーザとスピーカ装置２０とのコミュニケーションを促進できる。 The facial expression icon specifying unit 18 identifies the facial expression icon corresponding to the response content with reference to the response information table 11a. By displaying the facial expression icon specified by the facial expression icon specifying unit 18 on the display unit 22, it is possible to create an atmosphere that is close to the heart of the user who is looking at the speaker device 20, so that communication between the user and the speaker device 20 can be promoted. ..

なお、上述した、タイミング特定部１６、応答内容特定部１７、および表情アイコン特定部１８の機能を、周知の学習済みモデルを用いて実現することを妨げない。すなわち、この場合、各機能部は、応答情報テーブル１１ａを参照する必要がない。 It should be noted that the functions of the timing specifying unit 16, the response content specifying unit 17, and the facial expression icon specifying unit 18 described above are not prevented from being realized by using a well-known learned model. That is, in this case, each functional unit does not need to refer to the response information table 11a.

送信部１９は、タイミング、応答内容、および表情アイコンに関する情報を含む応答情報をスピーカ装置２０に送信する。
＜＜スピーカ装置２０＞＞ The transmission unit 19 transmits the response information including the information regarding the timing, the response content, and the facial expression icon to the speaker device 20.
<< Speaker device 20 >>

次に、スピーカ装置２０の機能構成について説明する。スピーカ装置２０は、送受信部２０ａおよび表示制御部２０ｂの機能を有する。 Next, the functional configuration of the speaker device 20 will be described. The speaker device 20 has the functions of the transmission / reception unit 20a and the display control unit 20b.

送受信部２０ａは、スピーカ装置２０におけるデータの送受信を制御する。例えば、送受信部２０ａは、画像取得装置２１から取得した顔画像を応答サーバ装置１０などの外部装置に送信する。また、送受信部２０ａは、応答サーバ装置１０などの外部装置からの応答情報を受信する。 The transmission / reception unit 20a controls the transmission / reception of data in the speaker device 20. For example, the transmission / reception unit 20a transmits the face image acquired from the image acquisition device 21 to an external device such as the response server device 10. Further, the transmission / reception unit 20a receives the response information from an external device such as the response server device 10.

表示制御部２０ｂは、スピーカ装置２０が備える、またはスピーカ装置２０に接続される表示部２２（図１参照）の表示を制御する。表示制御部２０ｂは、応答情報に含まれる表情アイコンを表示部２２に表示させる。
＝＝＝処理手順＝＝＝ The display control unit 20b controls the display of the display unit 22 (see FIG. 1) included in the speaker device 20 or connected to the speaker device 20. The display control unit 20b causes the display unit 22 to display the facial expression icon included in the response information.
=== Processing procedure ===

図６は、応答サーバ装置１０の処理の一例を示すフロー図である。図６を参照して、応答サーバ装置１０により実行される処理の一例を説明する。 FIG. 6 is a flow chart showing an example of processing of the response server device 10. An example of the processing executed by the response server device 10 will be described with reference to FIG.

まず、Ｓ１００において、応答サーバ装置１０はスピーカ装置２０（または画像取得装置２１）から顔画像を取得する。次に、Ｓ１０１において、応答サーバ装置１０は顔画像を学習済みモデルに入力してユーザの感情を特定する。次に、Ｓ１０２において、応答サーバ装置１０は顔画像のうち例えばユーザの目を含む部分的画像を学習済みモデルに入力してユーザがスピーカ装置２０を見ているか否か判定し、見ていると判定された場合に、その時点からユーザが見ていないと判定される時点までの継続時間を算出する。次に、Ｓ１０３において、応答サーバ装置１０は、感情および継続時間に基づいてスピーカ装置２０から音声出力するタイミングを特定する。次に、Ｓ１０４において、応答サーバ装置１０は、少なくとも感情および継続時間に基づいて応答内容を特定する。次に、Ｓ１０５において、応答サーバ装置１０は、応答情報をスピーカ装置２０に送信する。これにより、スピーカ装置２０は、適切なタイミングで的確な応答内容をユーザに対して音声出力できる。
＝＝音声通知システム１のハードウェア構成＝＝ First, in S100, the response server device 10 acquires a face image from the speaker device 20 (or the image acquisition device 21). Next, in S101, the response server device 10 inputs a face image into the trained model to identify the user's emotions. Next, in S102, the response server device 10 inputs a partial image including, for example, the user's eyes, among the facial images into the trained model, determines whether or not the user is looking at the speaker device 20, and is looking at it. When it is determined, the duration from that point to the point when it is determined that the user is not watching is calculated. Next, in S103, the response server device 10 specifies the timing of voice output from the speaker device 20 based on the emotion and the duration. Next, in S104, the response server device 10 identifies the response content based on at least emotions and duration. Next, in S105, the response server device 10 transmits the response information to the speaker device 20. As a result, the speaker device 20 can output an accurate response content to the user by voice at an appropriate timing.
== Hardware configuration of voice notification system 1 ==

図７を参照して、応答サーバ装置１０およびスピーカ装置２０をコンピュータ１００により実現する場合のハードウェア構成の一例を説明する。なお、それぞれの装置の機能は、複数台の装置に分けて実現することもできる。また、スピーカ装置２０における一部のハードウェア構成については上述したとおりである。応答サーバ装置１０およびスピーカ装置２０が有する機能は、プロセッサ１０１が、記憶装置１０３に記憶されたコンピュータプログラムを読み込み、実行し、応答サーバ装置１０およびスピーカ装置２０の各構成を制御すること等により実現される。 An example of the hardware configuration when the response server device 10 and the speaker device 20 are realized by the computer 100 will be described with reference to FIG. 7. The function of each device can be realized by dividing it into a plurality of devices. Further, some hardware configurations in the speaker device 20 are as described above. The functions of the response server device 10 and the speaker device 20 are realized by the processor 101 reading and executing the computer program stored in the storage device 103 and controlling each configuration of the response server device 10 and the speaker device 20. Will be done.

図７は、コンピュータのハードウェア構成の一例を示す図である。図７に示すように、コンピュータ１００は、プロセッサ１０１と、メモリ１０２と、記憶装置１０３と、入力Ｉ／Ｆ部１０４と、データＩ／Ｆ部１０５と、通信Ｉ／Ｆ部１０６、および表示装置１０７を含む。 FIG. 7 is a diagram showing an example of the hardware configuration of the computer. As shown in FIG. 7, the computer 100 includes a processor 101, a memory 102, a storage device 103, an input I / F unit 104, a data I / F unit 105, a communication I / F unit 106, and a display device. Includes 107.

プロセッサ１０１は、メモリ１０２に記憶されているプログラムを実行することによりコンピュータ１００における各種の処理を制御する制御部である。 The processor 101 is a control unit that controls various processes in the computer 100 by executing a program stored in the memory 102.

メモリ１０２は、例えばＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の記憶媒体である。メモリ１０２は、プロセッサ１０１によって実行されるプログラムのプログラムコードや、プログラムの実行時に必要となるデータを一時的に記憶する。 The memory 102 is, for example, a storage medium such as a RAM (Random Access Memory). The memory 102 temporarily stores the program code of the program executed by the processor 101 and the data required when the program is executed.

記憶装置１０３は、例えばハードディスクドライブ（ＨＤＤ）やフラッシュメモリ等の不揮発性の記憶媒体である。記憶装置１０３は、オペレーティングシステムや、上記各構成を実現するための各種プログラムを記憶する。 The storage device 103 is a non-volatile storage medium such as a hard disk drive (HDD) or a flash memory. The storage device 103 stores an operating system and various programs for realizing each of the above configurations.

入力Ｉ／Ｆ部１０４は、ユーザからの入力を受け付けるためのデバイスである。入力Ｉ／Ｆ部１０４の具体例としては、キーボードやマウス、タッチパネル、各種センサ、ウェアラブル・デバイス等が挙げられる。入力Ｉ／Ｆ部１０４は、例えばＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）等のインタフェースを介してコンピュータ１００に接続されても良い。 The input I / F unit 104 is a device for receiving input from the user. Specific examples of the input I / F unit 104 include a keyboard, a mouse, a touch panel, various sensors, a wearable device, and the like. The input I / F unit 104 may be connected to the computer 100 via an interface such as USB (Universal Serial Bus).

データＩ／Ｆ部１０５は、コンピュータ１００の外部からデータを入力するためのデバイスである。データＩ／Ｆ部１０５の具体例としては、各種記憶媒体に記憶されているデータを読み取るためのドライブ装置等がある。データＩ／Ｆ部１０５は、コンピュータ１００の外部に設けられることも考えられる。その場合、データＩ／Ｆ部１０５は、例えばＵＳＢ等のインタフェースを介してコンピュータ１００へと接続される。 The data I / F unit 105 is a device for inputting data from the outside of the computer 100. Specific examples of the data I / F unit 105 include a drive device for reading data stored in various storage media. It is also conceivable that the data I / F unit 105 is provided outside the computer 100. In that case, the data I / F unit 105 is connected to the computer 100 via an interface such as USB.

通信Ｉ／Ｆ部１０６は、コンピュータ１００の外部の装置と有線又は無線により、インターネットＮを介したデータ通信を行うためのデバイスである。通信Ｉ／Ｆ部１０６は、コンピュータ１００の外部に設けられることも考えられる。その場合、通信Ｉ／Ｆ部１０６は、例えばＵＳＢ等のインタフェースを介してコンピュータ１００に接続される。 The communication I / F unit 106 is a device for performing data communication via the Internet N by wire or wirelessly with an external device of the computer 100. It is also conceivable that the communication I / F unit 106 is provided outside the computer 100. In that case, the communication I / F unit 106 is connected to the computer 100 via an interface such as USB.

表示装置１０７（表示部２２）は、各種情報を表示するためのデバイスである。表示装置１０７の具体例としては、例えば液晶ディスプレイや有機ＥＬ（Ｅｌｅｃｔｒｏ−Ｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイ、ウェアラブル・デバイスのディスプレイなどである。表示装置１０７は、コンピュータ１００の外部に設けられても良い。その場合、表示装置１０７は、例えばディスプレイケーブル等を介してコンピュータ１００に接続される。また、入力Ｉ／Ｆ部１０４としてタッチパネルが採用される場合には、表示装置１０７は、入力Ｉ／Ｆ部１０４と一体化して構成することが可能である。
＝＝＝他の実施形態＝＝＝
＜＜応答サーバ装置２１０（不図示）＞＞ The display device 107 (display unit 22) is a device for displaying various types of information. Specific examples of the display device 107 include a liquid crystal display, an organic EL (Electro-Lumisensence) display, and a display of a wearable device. The display device 107 may be provided outside the computer 100. In that case, the display device 107 is connected to the computer 100 via, for example, a display cable or the like. Further, when the touch panel is adopted as the input I / F unit 104, the display device 107 can be integrally configured with the input I / F unit 104.
=== Other embodiments ===
<< Response server device 210 (not shown) >>

上記において、応答サーバ装置１０は、応答情報を特定するために、感情分析部１３において感情を分析するよう説明したが、これに代えて、感情分析部２１３（不図示）においてユーザの表情を分析してもよい。この場合、例えば「笑っている」という表情、「怒っている」という表情、または「悲しんでいる」という表情などを分析して、これらの表情と継続時間とに基づいて、応答内容を特定してもよい。すなわち、応答サーバ装置２１０（不図示）においては、感情を表すユーザの挙動をユーザの感情に対応付けて、応答情報を特定してもよい。
＜＜スピーカ装置２２０＞＞ In the above, the response server device 10 has been described so that the emotion analysis unit 13 analyzes the emotion in order to specify the response information, but instead, the emotion analysis unit 213 (not shown) analyzes the user's facial expression. You may. In this case, for example, the facial expressions of "laughing", "angry", or "sad" are analyzed, and the response content is specified based on these facial expressions and the duration. You may. That is, in the response server device 210 (not shown), the response information may be specified by associating the behavior of the user expressing emotions with the emotions of the user.
<< Speaker device 220 >>

上記において、スピーカ装置２０は、ユーザの顔画像を応答サーバ装置１０に送信する。その後、応答サーバ装置１０から応答情報を受信することでユーザに対して音声出力する。すなわち、スピーカ装置２０おいて、顔画像を分析処理することなく、ユーザに対して音声出力するように説明したが、これに限定されない。 In the above, the speaker device 20 transmits the user's face image to the response server device 10. After that, the response information is received from the response server device 10 to output voice to the user. That is, the speaker device 20 has been described so as to output voice to the user without analyzing and processing the face image, but the present invention is not limited to this.

例えば、スピーカ装置２２０は、ユーザの顔画像からユーザの感情を特定することで、ユーザに対して応答してもよい。図８は、他の実施形態におけるスピーカ装置２２０の構成を示す図である。図８に示すように、スピーカ装置２２０は、例えば、記憶部２２１、取得部２２２、スピーカ感情分析部２２３、音声分析部２２４、タイミング特定部２２５、スピーカ応答内容特定部２２６、表示制御部２２７、出力部２２８を含む。なお、以下においては説明の便宜上、スピーカ装置２０と同様のものはその説明を省略し、異なるものを中心に説明する。 For example, the speaker device 220 may respond to the user by identifying the user's emotion from the user's face image. FIG. 8 is a diagram showing the configuration of the speaker device 220 in another embodiment. As shown in FIG. 8, the speaker device 220 includes, for example, a storage unit 221, an acquisition unit 222, a speaker emotion analysis unit 223, a voice analysis unit 224, a timing identification unit 225, a speaker response content identification unit 226, and a display control unit 227. The output unit 228 is included. In the following, for convenience of explanation, the same ones as the speaker device 20 will be omitted, and different ones will be mainly described.

記憶部２２１には、例えばスピーカ応答情報テーブル２２１ａを有する。 The storage unit 221 has, for example, a speaker response information table 221a.

スピーカ応答情報テーブル２２１ａは、例えばユーザに対する応答内容を格納したテーブルである。図９は、スピーカ応答情報テーブル２２１ａを示す図である。図９に示すように、スピーカ応答情報テーブル２２１ａのデータ構造は、例えば応答内容ＩＤなどの適宜な項目を主キーとして、感情、声の大きさ、タイミング、応答内容、および表情アイコンなどのデータから成るレコードの集合体である。ここで、感情、タイミング、応答内容および表情アイコンについては、上述した応答情報テーブル１１ａと同様であるため説明を省略する。声の大きさとは、スピーカ装置２２０のマイクロフォンで取得されるユーザの声に関する情報から特定されるユーザの声の大きさである。スピーカ装置２２０は、スピーカ応答情報テーブル２２１ａを参照することで、ユーザに対して相槌など即座に発話させることができるため、ユーザとのコミュニケーションを促進できる。なお、声の大きさに代えて、または、声の大きさと共に「継続時間」を格納することを妨げない。スピーカ応答情報テーブル２２１ａは一例を示すものであり、その内容が特に限定されるものではない。 The speaker response information table 221a is, for example, a table in which response contents to the user are stored. FIG. 9 is a diagram showing a speaker response information table 221a. As shown in FIG. 9, the data structure of the speaker response information table 221a is based on data such as emotions, voice volume, timing, response contents, and facial expression icons, using an appropriate item such as a response content ID as a primary key. It is a collection of records consisting of. Here, the emotion, timing, response content, and facial expression icon are the same as those in the response information table 11a described above, and thus description thereof will be omitted. The loudness of the voice is the loudness of the user's voice specified from the information about the user's voice acquired by the microphone of the speaker device 220. By referring to the speaker response information table 221a, the speaker device 220 can promptly make the user speak an aizuchi or the like, so that communication with the user can be promoted. It should be noted that it does not prevent the "duration" from being stored in place of the loudness of the voice or together with the loudness of the voice. The speaker response information table 221a shows an example, and the content thereof is not particularly limited.

また、記憶部２２１は、例えば後述するスピーカ感情分析部２２３における学習済みモデルを格納してもよい。さらに、記憶部２２１には、学習済みモデルを生成するための学習データを格納していてもよい。 Further, the storage unit 221 may store, for example, the trained model in the speaker emotion analysis unit 223, which will be described later. Further, the storage unit 221 may store the training data for generating the trained model.

スピーカ感情分析部２２３は、応答サーバ装置１０の感情分析部１３と同様であるためその説明を省略する。 Since the speaker emotion analysis unit 223 is the same as the emotion analysis unit 13 of the response server device 10, the description thereof will be omitted.

音声分析部２２４は、スピーカ装置２２０のマイクロフォンで取得されるユーザの音声を示す音声情報に基づいて、ユーザの音声の大きさを分析する。音声分析部２２４は、例えば、所定の閾値以上の大きさ（声量）を示す音声情報については「大」と評価し、所定の閾値未満の大きさを示す音声情報については「小」と評価する。これにより、例えばユーザの怒りの度合いなどに応じて、後述するように、スピーカ装置２２０から音声出力する適切なタイミングを特定できるため、ユーザとのコミュニケーションを円滑に実行できる。 The voice analysis unit 224 analyzes the loudness of the user's voice based on the voice information indicating the user's voice acquired by the microphone of the speaker device 220. For example, the voice analysis unit 224 evaluates the voice information showing a loudness (voice volume) equal to or more than a predetermined threshold value as "large", and the voice information showing a loudness less than a predetermined threshold value is evaluated as "small". .. As a result, as will be described later, it is possible to specify an appropriate timing for outputting voice from the speaker device 220 according to, for example, the degree of anger of the user, so that communication with the user can be smoothly executed.

なお、音声分析部２２４は、声の大きさ（声量）によって「大」「小」を分析することに限定されず、例えば、声に含まれる単語を分析して感情の「大」「小」を区別してもよい。具体的には、例えば、音声情報をテキストに変換し、該テキストから形態素を抽出した後、所定の文章に所定の感情用語が含まれている場合、該感情用語に基づいて感情ベクトルを計算する。そして、感情ベクトルが示す代表的な感情強さ「大」「小」などを特定する。なお、声に含まれる単語を分析して感情の「大」「小」を特定する技術は、周知の技術を利用することが可能である。 The voice analysis unit 224 is not limited to analyzing "large" and "small" according to the loudness (voice volume) of the voice. For example, the voice analysis unit 224 analyzes words contained in the voice and "large" and "small" emotions. May be distinguished. Specifically, for example, after converting voice information into text and extracting morphemes from the text, if a predetermined sentence contains a predetermined emotion term, an emotion vector is calculated based on the emotion term. .. Then, the typical emotional strengths "large" and "small" indicated by the emotion vector are specified. It is possible to use a well-known technique for identifying the "large" and "small" emotions by analyzing the words contained in the voice.

タミング特定部は、スピーカ感情分析部２２３で分析された感情と、音声分析部２２４で分析された音声情報と、の少なくともいずれかに基づいて、所定の応答内容をスピーカ装置２２０に音声出力させるタイミングを特定する。具体的には、図９に示すように、スピーカ感情分析部２２３において分析されるユーザに感情に対するタイミングを特定する。例えば、感情が「楽しい」の場合は「直後」のタイミングを特定し、感情が「悲哀」の場合は「直後」のタイミングを特定する。さらに、感情が「怒り」の場合において、ユーザの音声が「小」の場合は「直後」のタイミングを特定し、ユーザの音声が「大」の場合は「待ち」（応答しない）のタイミングを特定する。このように、応答サーバ装置１０において特定されるタイミングよりもシンプルなタイミングを特定することで、応答サーバ装置１０よりも性能の低いハードウェア資源でユーザに対する１次応答を実現できるため、ユーザとスピーカ装置２２０とのコミュニケーションを促進できる。 The timing identification unit causes the speaker device 220 to output a predetermined response content by voice based on at least one of the emotion analyzed by the speaker emotion analysis unit 223 and the voice information analyzed by the voice analysis unit 224. To identify. Specifically, as shown in FIG. 9, the timing for emotions is specified for the user analyzed by the speaker emotion analysis unit 223. For example, when the emotion is "fun", the timing of "immediately after" is specified, and when the emotion is "sorrow", the timing of "immediately after" is specified. Furthermore, when the emotion is "anger", the timing of "immediately after" is specified when the user's voice is "small", and the timing of "waiting" (not responding) when the user's voice is "loud". Identify. In this way, by specifying a timing simpler than the timing specified in the response server device 10, it is possible to realize a primary response to the user with hardware resources having lower performance than the response server device 10, so that the user and the speaker can be realized. Communication with the device 220 can be promoted.

ここで、図９に示す「待ち」とは、例えば応答サーバ装置２１０から応答情報が送信されるのを待つ処理を意味する。例えば、ユーザに強い「怒り」の感情があることが判定された場合において、スピーカ装置２２０が即座に相槌などの返答をすると、ユーザはマイナス方向の感情を持つと推認されるため、待機する処理である。 Here, the “wait” shown in FIG. 9 means, for example, a process of waiting for response information to be transmitted from the response server device 210. For example, when it is determined that the user has a strong feeling of "anger", if the speaker device 220 immediately responds with an aizuchi or the like, it is presumed that the user has a feeling in the negative direction, so that the process of waiting is performed. Is.

なお、タイミング特定部２２５は、予めタイミングを定めたスピーカ応答情報テーブル２２１ａを参照して、スピーカ感情分析部２２３で分析された感情と音声分析部２２４で分析された音声情報とに対応付けられるタイミングを特定してもよい。 The timing specifying unit 225 refers to the speaker response information table 221a whose timing has been determined in advance, and refers to the timing associated with the emotion analyzed by the speaker sentiment analysis unit 223 and the voice information analyzed by the voice analysis unit 224. May be specified.

スピーカ応答内容特定部２２６は、スピーカ応答情報テーブル２２１ａを参照して、少なくとも感情および音声情報に基づいて、スピーカ装置２２０に音声出力させる応答内容を特定する。スピーカ応答内容特定部２２６においては、応答サーバ装置２１０よりも簡易な応答内容である「相槌」のような応答内容を特定する。これにより、応答サーバ装置２１０よりも性能の低いハードウェア資源でユーザに対する１次応答を実現できる。 The speaker response content specifying unit 226 refers to the speaker response information table 221a and specifies the response content to be output by the speaker device 220 based on at least emotion and voice information. The speaker response content specifying unit 226 specifies a response content such as "aizuchi", which is a simpler response content than the response server device 210. As a result, a primary response to the user can be realized with hardware resources having lower performance than the response server device 210.

なお、スピーカ応答内容特定部２２６は、予め応答内容を定めたスピーカ応答情報テーブル２２１ａを参照して、スピーカ感情分析部２２３で分析された感情、音声分析部２２４で分析された音声情報、およびタイミング特定部２２５で特定されたタイミングに対応付けられる応答内容を特定してもよい。 The speaker response content specifying unit 226 refers to the speaker response information table 221a in which the response content is determined in advance, and refers to the emotions analyzed by the speaker sentiment analysis unit 223, the voice information analyzed by the voice analysis unit 224, and the timing. The response content associated with the timing specified by the specific unit 225 may be specified.

このように、音声通知システム２００は、まずスピーカ装置２２０においてユーザに対して即時に相槌などの回答を発し、その後、ユーザの感情に応じて応答サーバ装置１０を介して適切なメッセージを出力するよう処理できるため、ユーザに対して能動的に適切なコミュニケーションを図れる。 In this way, the voice notification system 200 first immediately issues a response such as an aizuchi to the user in the speaker device 220, and then outputs an appropriate message via the response server device 10 according to the user's emotions. Since it can be processed, it is possible to actively communicate appropriately with the user.

出力部２２８は、スピーカ応答内容特定部２２６で特定された応答内容に沿ってスピーカに音声出力させるべく、応答内容を示す音声信号をスピーカに出力する。 The output unit 228 outputs a voice signal indicating the response content to the speaker in order to cause the speaker to output voice according to the response content specified by the speaker response content specifying unit 226.

さらに、スピーカ装置２２０は、位置情報生成部（不図示）を備えていてもよい。位置情報生成部は、例えば、複数のユーザそれぞれの位置を示す位置情報を生成する。位置情報生成部は、例えば、顔画像に基づいてユーザの存在する位置を特定してもよく、またはユーザが所持する携帯端末装置と無線接続されるビーコンから該携帯端末装置の位置情報を取得してユーザの存在する位置を特定してもよい。ユーザの存在する位置の特定方法は限定されない。 Further, the speaker device 220 may include a position information generation unit (not shown). The position information generation unit generates, for example, position information indicating the positions of each of a plurality of users. The position information generation unit may specify the position where the user exists based on, for example, a face image, or acquires the position information of the mobile terminal device from a beacon wirelessly connected to the mobile terminal device possessed by the user. You may specify the position where the user exists. The method of identifying the position where the user exists is not limited.

この場合において、スピーカ装置２２０の表情アイコン表示部は、例えば、位置情報生成部で生成された位置情報に基づいて、複数のユーザの存在する位置と対向する、表示装置の所定の領域に表情アイコンを表示してもよい。これにより、複数のユーザに対して同時に、スピーカ装置２２０がユーザの心に寄り添う雰囲気を演出できるため、複数のユーザとスピーカ装置２２０とのコミュニケーションを促進できる。なお、表示装置は、スピーカ装置２２０の円周部における所定の領域を３６０度にわたって設けられることが望ましい。また、画像取得装置２１は、スピーカ装置２２０に３６０度を撮影可能なカメラとして設けられていることが望ましい。これにより、ユーザの存在する位置に対応するように表情アイコンを確実に表示できる。
＝＝＝まとめ＝＝＝ In this case, the facial expression icon display unit of the speaker device 220 has, for example, a facial expression icon in a predetermined area of the display device facing the position where a plurality of users exist, based on the position information generated by the position information generation unit. May be displayed. As a result, the speaker device 220 can produce an atmosphere that is close to the user's mind at the same time for a plurality of users, so that communication between the plurality of users and the speaker device 220 can be promoted. It is desirable that the display device is provided with a predetermined region on the circumference of the speaker device 220 over 360 degrees. Further, it is desirable that the image acquisition device 21 is provided in the speaker device 220 as a camera capable of photographing 360 degrees. As a result, the facial expression icon can be reliably displayed so as to correspond to the position where the user exists.
=== Summary ===

本実施形態に係る音声通知システム１は、スピーカを有するスピーカ装置２０に接続される画像取得装置２１から、ユーザの顔を示す顔画像を取得する取得部１２と、顔画像に基づいて、ユーザの感情を分析する感情分析部１３と、顔画像に基づいて、ユーザがスピーカ装置２０を見ているか否かを判定する視線判定部１４と、視線判定部１４における判定結果に基づいて、ユーザがスピーカ装置２０を見ている継続時間を算出する時間算出部１５と、ユーザの感情と、継続時間と、に基づいて、所定の応答内容を特定する応答内容特定部１７と、応答内容特定部１７で特定された前記所定の応答内容に沿って、スピーカ装置２０に音声出力させるべく、所定の応答内容を示す音声情報を、スピーカ装置２０に送信する送信部１９と、を備える。本実施形態によれば、ユーザの感情および視線に基づいて、能動的にユーザに対して音声出力することで、ユーザの発言を促すことができる。 The voice notification system 1 according to the present embodiment is based on the acquisition unit 12 that acquires a face image showing the user's face from the image acquisition device 21 connected to the speaker device 20 having a speaker, and the user's face image. Based on the sentiment analysis unit 13 that analyzes emotions, the line-of-sight determination unit 14 that determines whether or not the user is looking at the speaker device 20 based on the face image, and the determination result in the line-of-sight determination unit 14, the user uses the speaker. The time calculation unit 15 that calculates the duration of viewing the device 20, the response content specifying unit 17 that specifies a predetermined response content based on the user's emotion and the duration, and the response content specifying unit 17 The speaker device 20 includes a transmission unit 19 that transmits voice information indicating the predetermined response content to the speaker device 20 so that the speaker device 20 can output the sound according to the specified predetermined response content. According to the present embodiment, it is possible to encourage the user to speak by actively outputting voice to the user based on the user's emotion and line of sight.

本実施形態に係る音声通知システム１は、感情分析部１３で分析されたユーザの感情に基づいて、所定の応答内容をスピーカ装置２０に音声出力させるタイミングを特定するタイミング特定部１６をさらに備え、送信部１９は、タイミング特定部１６で特定されたタイミングでスピーカ装置２０に音声出力させるべく、音声情報を前記スピーカ装置２０に送信する。本実施形態によれば、ユーザの感情に応じた適切なタイミングでスピーカ装置２０に音声出力させることができるため、ユーザとスピーカ装置２０との間のコミュニケーションを促進できる。 The voice notification system 1 according to the present embodiment further includes a timing specifying unit 16 that specifies a timing for causing the speaker device 20 to output a predetermined response content by voice based on the user's emotions analyzed by the sentiment analysis unit 13. The transmission unit 19 transmits voice information to the speaker device 20 so that the speaker device 20 can output voice at the timing specified by the timing specifying unit 16. According to the present embodiment, since the speaker device 20 can output voice at an appropriate timing according to the emotion of the user, communication between the user and the speaker device 20 can be promoted.

本実施形態に係る音声通知システム１のタイミング特定部１６は、感情分析部１３において分析されるユーザの第１感情に対する第１タイミングを特定し、第１感情とは異なり、感情分析部１３において分析される第２感情に対する第２タイミングを、第１タイミングよりも長くなるように特定する。本実施形態によれば、ユーザの感情に応じた、より適切なタイミングでスピーカ装置２０に音声出力させることができるため、ユーザとスピーカ装置２０との間のコミュニケーションを促進できる。 The timing specifying unit 16 of the voice notification system 1 according to the present embodiment specifies the first timing for the user's first emotion analyzed by the emotion analysis unit 13, and is analyzed by the emotion analysis unit 13 unlike the first emotion. The second timing for the second emotion to be performed is specified to be longer than the first timing. According to the present embodiment, since the speaker device 20 can output voice at a more appropriate timing according to the emotion of the user, communication between the user and the speaker device 20 can be promoted.

本実施形態に係る音声通知システム１は、表示部を有するスピーカ装置２０において表示部２２に表情を模した表情アイコン（アイコン）を表示させるべく、応答情報テーブル１１ａを参照して、応答内容に対応する表情アイコン（アイコン）を特定する表情アイコン特定部１８をさらに備え、送信部１９は、表情アイコン特定部１８で特定された表情アイコン（アイコン）を、スピーカ装置２０に送信する。本実施形態によれば、スピーカ装置２０を見ているユーザの心に寄り添う雰囲気を演出できるため、ユーザとスピーカ装置２０とのコミュニケーションを促進できる。 The voice notification system 1 according to the present embodiment corresponds to the response content by referring to the response information table 11a in order to display the facial expression icon (icon) imitating the facial expression on the display unit 22 in the speaker device 20 having the display unit. A facial expression icon specifying unit 18 for specifying the facial expression icon (icon) to be used is further provided, and the transmitting unit 19 transmits the facial expression icon (icon) specified by the facial expression icon specifying unit 18 to the speaker device 20. According to the present embodiment, it is possible to create an atmosphere that is close to the heart of the user who is looking at the speaker device 20, so that communication between the user and the speaker device 20 can be promoted.

本実施形態に係る音声通知システム１は、スピーカ装置２２０をさらに備え、スピーカ装置２２０は、顔画像に基づいて、ユーザの感情を分析するスピーカ感情分析部２２３を有し、ユーザ感情分析部で分析されたユーザの感情に基づいて、所定の応答内容を特定するスピーカ応答内容特定部２２６と、スピーカ応答内容特定部２２６で特定された所定の応答内容に沿って、スピーカに音声出力させるべく、所定の応答内容を示す音声信号を、スピーカに出力する出力部２２８と、を備える。本実施形態によれば、応答サーバ装置２１０とスピーカ装置２２０とを含む一のシステムとして構成することで、システム設計を効率的に実施できる。 The voice notification system 1 according to the present embodiment further includes a speaker device 220, and the speaker device 220 has a speaker sentiment analysis unit 223 that analyzes the user's emotions based on a face image, and is analyzed by the user sentiment analysis unit. Based on the sentiment of the user, the speaker response content specifying unit 226 that specifies a predetermined response content and the speaker respond content specifying unit 226 are specified to output audio to the speaker. It is provided with an output unit 228 that outputs an audio signal indicating the response content of the above to the speaker. According to this embodiment, the system design can be efficiently carried out by configuring the system as one system including the response server device 210 and the speaker device 220.

本実施形態に係る音声通知システム１のスピーカ装置２２０は、円筒形状を呈し、円筒形状の上端部における側周縁部に、感情分析部１３またはスピーカ感情分析部２２３の少なくともいずれかで分析される、表情アイコンを表示する表示部２２を有する。本実施形態によれば、ユーザの存在する位置に対応するように表情アイコンを確実に表示できるため、より確実に、ユーザとスピーカ装置２２０とのコミュニケーションを促進できる。 The speaker device 220 of the voice notification system 1 according to the present embodiment has a cylindrical shape, and is analyzed by at least one of the emotion analysis unit 13 and the speaker emotion analysis unit 223 on the side peripheral portion at the upper end portion of the cylindrical shape. It has a display unit 22 for displaying a facial expression icon. According to the present embodiment, the facial expression icon can be reliably displayed so as to correspond to the position where the user exists, so that the communication between the user and the speaker device 220 can be promoted more reliably.

なお、述した実施の形態は本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。本発明はその趣旨を逸脱することなく変更、改良され得るとともに、本発明にはその等価物も含まれる。 It should be noted that the above-described embodiment is for facilitating the understanding of the present invention, and is not for limiting and interpreting the present invention. The present invention can be modified and improved without departing from the spirit thereof, and the present invention also includes an equivalent thereof.

１，２００…音声通知システム、１０…応答サーバ装置、１２…取得部、１３…感情分析部、１４…視線判定部、１５…時間算出部、１６…タイミング特定部、１７…応答内容特定部、１８…表情アイコン特定部、１９…送信部、２０，２２０…スピーカ装置、２１…画像取得装置、２２３…スピーカ感情分析部、２２６…スピーカ応答内容特定部、２２８…出力部 1,200 ... Voice notification system, 10 ... Response server device, 12 ... Acquisition unit, 13 ... Sentiment analysis unit, 14 ... Line-of-sight determination unit, 15 ... Time calculation unit, 16 ... Timing identification unit, 17 ... Response content identification unit, 18 ... Facial expression icon identification unit, 19 ... Transmission unit, 20, 220 ... Speaker device, 21 ... Image acquisition device, 223 ... Speaker sentiment analysis unit, 226 ... Speaker response content identification unit, 228 ... Output unit

Claims

An acquisition unit that acquires a face image showing a user's face from an image acquisition device connected to a speaker device having a speaker.
An emotion analysis unit that analyzes the user's emotions based on the face image,
A line-of-sight determination unit that determines whether or not the user is looking at the speaker device based on the face image.
A time calculation unit that calculates the duration of the user watching the speaker device based on the determination result in the line-of-sight determination unit.
A response content specifying unit that specifies a predetermined response content based on the user's emotion and the duration.
A transmission unit that transmits voice information indicating the predetermined response content to the speaker device in order to cause the speaker device to output voice in accordance with the predetermined response content specified by the response content specifying unit.
Information processing system equipped with.

A timing specifying unit for specifying a timing for outputting the predetermined response content to the speaker device by voice based on the emotion of the user analyzed by the emotion analysis unit is further provided.
The information processing system according to claim 1, wherein the transmission unit transmits the voice information to the speaker device so that the speaker device can output voice at the timing specified by the timing specifying unit.

The timing specifying unit specifies the first timing for the first emotion of the user analyzed by the sentiment analysis unit, and unlike the first emotion, the second sentiment for the second emotion analyzed by the sentiment analysis unit. The information processing system according to claim 2, wherein the timing is specified to be longer than the first timing.

In the speaker device having a display unit, in order to display an icon imitating a human facial expression on the display unit, a facial expression icon specifying unit that specifies the icon corresponding to the response content is further provided with reference to predetermined information. Prepare,
The information processing system according to any one of claims 1 to 3, wherein the transmitting unit transmits the icon specified by the facial expression icon specifying unit to the speaker device.

The acquisition unit acquires the face image including the position information of each of the plurality of users, and obtains the face image.
The transmitting unit causes each of the plurality of users to display the icon specified by the facial expression icon specifying unit in a predetermined area of the display unit corresponding to the position information of each of the plurality of users. The information processing system according to claim 4, which is transmitted to the speaker device.

Further equipped with the speaker device
The speaker device is
It has a speaker sentiment analysis unit that analyzes the user's emotions based on the face image.
A speaker response content specifying unit that specifies a predetermined response content based on the user's emotion analyzed by the speaker emotion analysis unit, and a speaker response content specifying unit.
An output unit that outputs a voice signal indicating the predetermined response content to the speaker in order to output the voice to the speaker in accordance with the predetermined response content specified by the speaker response content specifying unit.
The information processing system according to any one of claims 1 to 5.

The speaker device has a cylindrical shape, and displays an icon imitating a human facial expression analyzed by at least one of the sentiment analysis unit and the speaker sentiment analysis unit on a side peripheral portion at an upper end portion of the cylindrical shape. The information processing system according to claim 6, which has a display unit.

The computer
Acquiring a face image showing a user's face from an image acquisition device connected to a speaker device having a speaker,
Analyzing the user's emotions based on the face image,
Determining whether or not the user is looking at the speaker device based on the face image.
Based on the determination result, the duration of viewing the speaker device by the user is calculated.
Identifying a predetermined response content based on the user's emotion and the duration.
To transmit the voice information indicating the predetermined response content to the speaker device in order to output the voice to the speaker device according to the specified predetermined response content.
Information processing method to execute.

On the computer
Acquiring a face image showing a user's face from an image acquisition device connected to a speaker device having a speaker,
To analyze the user's emotions based on the face image,
To determine whether or not the user is looking at the speaker device based on the face image.
To calculate the duration of viewing the speaker device by the user based on the determination result.
To specify a predetermined response content based on the user's emotion and the duration.
In order to have the speaker device output voice according to the specified predetermined response content, the speaker device is to transmit voice information indicating the predetermined response content.
A program that executes.