JP6787957B2

JP6787957B2 - Utterance control device, utterance control method, and utterance control program

Info

Publication number: JP6787957B2
Application number: JP2018159400A
Authority: JP
Inventors: 孝太坪内; 山本　学; 学山本; 中村　浩樹; 浩樹中村; 太士岩▲瀬▼張
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2018-08-28
Filing date: 2018-08-28
Publication date: 2020-11-18
Anticipated expiration: 2038-03-20
Also published as: JP2019164321A

Description

本発明は、発話制御装置、発話制御方法、および発話制御プログラムに関する。 The present invention relates to an utterance control device, an utterance control method, and an utterance control program.

従来、機器からの発話を制御する発話制御装置が知られている。例えば、特許文献１には、機器の稼働状況を示す稼働ログを分析して、機器を使用するユーザの繁忙度を判定し、判定した繁忙度に応じて、機器からの発話タイミングを制御する発話制御装置が開示されている。 Conventionally, an utterance control device for controlling an utterance from an apparatus is known. For example, in Patent Document 1, an operation log showing an operation status of a device is analyzed to determine the busyness of a user who uses the device, and an utterance that controls the utterance timing from the device according to the determined busyness. The control device is disclosed.

特開２０１７−１５１７１８号公報JP-A-2017-151718

しかしながら、上記特許文献１に記載の発話制御装置は、機器の稼働状況から予め設定された条件で発話タイミングを制御するものであり、機器のユーザによっては適切な発話タイミングとならないおそれがあり、更なる改善の余地があった。 However, the utterance control device described in Patent Document 1 controls the utterance timing under preset conditions based on the operating status of the device, and may not be an appropriate utterance timing depending on the user of the device. There was room for improvement.

本願は、上記に鑑みてなされたものであって、発話タイミングをより適切に決定することができる発話制御装置、発話制御方法、および発話制御プログラムを提供することを目的とする。 The present application has been made in view of the above, and an object of the present application is to provide an utterance control device, an utterance control method, and an utterance control program capable of more appropriately determining the utterance timing.

本願に係る発話制御装置は、ユーザに関するコンテキスト情報を取得するコンテキスト取得部と、前記コンテキスト取得部によって取得された前記コンテキスト情報に基づいて、音声出力器からの発話に対する過去のユーザの反応を考慮した前記発話のタイミングを決定するタイミング決定部とを備える。 The utterance control device according to the present application considers the reaction of the past user to the utterance from the voice output device based on the context acquisition unit that acquires the context information about the user and the context information acquired by the context acquisition unit. It is provided with a timing determination unit that determines the timing of the utterance.

実施形態の一態様によれば、発話タイミングをより適切に決定することができる発話制御装置、発話制御方法、および発話制御プログラムを提供することができる。 According to one aspect of the embodiment, it is possible to provide an utterance control device, an utterance control method, and an utterance control program capable of more appropriately determining the utterance timing.

図１は、実施形態に係る情報処理システムの構成例を示す図である。FIG. 1 is a diagram showing a configuration example of an information processing system according to an embodiment. 図２は、実施形態に係る発話制御処理の説明図である。FIG. 2 is an explanatory diagram of the utterance control process according to the embodiment. 図３は、実施形態に係るスマートスピーカの構成例を示す図である。FIG. 3 is a diagram showing a configuration example of the smart speaker according to the embodiment. 図４は、実施形態に係る発話テーブルの一例を示す図である。FIG. 4 is a diagram showing an example of an utterance table according to the embodiment. 図５は、実施形態に係る情報提供装置の構成例を示す図である。FIG. 5 is a diagram showing a configuration example of the information providing device according to the embodiment. 図６は、実施形態に係る発話テーブル記憶部に記憶される発話テーブルの一例を示す図である。FIG. 6 is a diagram showing an example of an utterance table stored in the utterance table storage unit according to the embodiment. 図７は、実施形態に係るコンテンツ記憶部に記憶されるコンテンツテーブルの一例を示す図である。FIG. 7 is a diagram showing an example of a content table stored in the content storage unit according to the embodiment. 図８は、実施形態に係る音声広告記憶部に記憶される音声広告テーブルの一例を示す図である。FIG. 8 is a diagram showing an example of a voice advertisement table stored in the voice advertisement storage unit according to the embodiment. 図９は、実施形態に係るユーザ情報記憶部に記憶されるユーザ情報テーブルの一例を示す図である。FIG. 9 is a diagram showing an example of a user information table stored in the user information storage unit according to the embodiment. 図１０は、実施形態に係る情報処理システムによる発話制御処理の一例を示すフローチャート（その１）である。FIG. 10 is a flowchart (No. 1) showing an example of the utterance control process by the information processing system according to the embodiment. 図１１は、実施形態に係る情報処理システムによる発話制御処理の一例を示すフローチャート（その２）である。FIG. 11 is a flowchart (No. 2) showing an example of the utterance control process by the information processing system according to the embodiment. 図１２は、実施形態に係る情報処理システムによる出力制御処理の一例を示すフローチャートである。FIG. 12 is a flowchart showing an example of output control processing by the information processing system according to the embodiment. 図１３は、実施形態に係る情報処理システムによる音声情報効果判定処理の一例を示すフローチャートである。FIG. 13 is a flowchart showing an example of voice information effect determination processing by the information processing system according to the embodiment. 図１４は、プログラムを実行するコンピュータのハードウェア構成の一例を示す図である。FIG. 14 is a diagram showing an example of the hardware configuration of the computer that executes the program.

以下に、本願に係る発話制御装置、発話制御方法、および発話制御プログラムを実施するための形態（以下、「実施形態」と呼ぶ）について図面を参照しつつ詳細に説明する。なお、この実施形態により本願に係る発話制御装置、発話制御方法、および発話制御プログラムが限定されるものではない。 Hereinafter, the utterance control device, the utterance control method, and the embodiment for implementing the utterance control program (hereinafter referred to as “the embodiment”) according to the present application will be described in detail with reference to the drawings. It should be noted that this embodiment does not limit the utterance control device, the utterance control method, and the utterance control program according to the present application.

〔１．情報提供システム〕
図１は、実施形態に係る情報処理システムの構成例を示す図である。図１に示すように、実施形態に係る情報処理システム１００は、スマートスピーカ１と、情報提供装置２と、端末装置３と、複数の機器４_１〜４_ｎ（ｎは２以上の整数）と、複数のセンサ装置５_１〜５_ｍ（ｍは２以上の整数）とを備える。以下、機器４_１〜４_ｎの各々を区別せずに示す場合、機器４と記載し、センサ装置５_１〜５_ｍの各々を区別せずに示す場合、センサ装置５と記載する。 [1. Information provision system]
FIG. 1 is a diagram showing a configuration example of an information processing system according to an embodiment. As shown in FIG. 1, the information processing system 100 according to the embodiment includes a smart speaker 1, an information providing device 2, a terminal device 3, and a plurality of devices 4 ₁ to 4 _n (n is an integer of 2 or more). , A plurality of sensor devices 5 _{1 to} 5 _m (m is an integer of 2 or more). Hereinafter, when each of the devices 4 ₁ to 4 _n is shown without distinction, it is described as the device 4, and when each of the sensor devices 5 _{1 to} 5 _m is shown without distinction, it is described as the sensor device 5.

スマートスピーカ１、情報提供装置２、端末装置３、機器４、およびセンサ装置５は、ネットワーク６を介して無線または有線で互いに通信可能に接続される。ネットワーク６は、例えば、ＬＡＮ（Local Area Network）や、インターネットなどのＷＡＮ（Wide Area Network）であり、１以上のネットワークで構成される。 The smart speaker 1, the information providing device 2, the terminal device 3, the device 4, and the sensor device 5 are connected to each other so as to be able to communicate with each other wirelessly or by wire via the network 6. The network 6 is, for example, a LAN (Local Area Network) or a WAN (Wide Area Network) such as the Internet, and is composed of one or more networks.

スマートスピーカ１、端末装置３、機器４、およびセンサ装置５は、ユーザＵ_１の周囲の領域ＡＲ_１に配置されている。領域ＡＲ_１は、例えばユーザＵ_１の部屋または家である。また、図示していないがユーザＵ_２〜Ｕ_ｋ（ｋは２以上の整数）の領域ＡＲ_２〜ＡＲ_ｋの各々にも、領域ＡＲ_１と同様に、スマートスピーカ１、端末装置３、機器４、およびセンサ装置５が配置されている。以下、ユーザＵ_１〜Ｕ_ｋの各々を区別せずに示す場合、ユーザＵと記載する。 Smart speaker 1, the terminal device 3, device 4, and the sensor device 5 is arranged in the area AR ₁ surrounding the user _{U 1.} Region AR ₁ is, for example, the room or home of user U ₁ . Furthermore, also each of the regions _AR 2 to Ar _k of the illustrated non Although the user _U 2 _~U k (k is an integer of 2 or more), as in the area AR _1, the smart speaker 1, the terminal device 3, device 4 , And the sensor device 5 is arranged. Hereinafter, when each of users U _{1 to} U _k is shown without distinction, it is described as user U.

スマートスピーカ１は、対話型の音声操作に対応するＡＩ（人工知能：Artificial Intelligence）アシスタント機能を利用可能なスピーカであり、ユーザＵは、スマートスピーカ１と対話することで様々な情報を取得することができる。例えば、スマートスピーカ１は、ユーザＵからの指示を示す入力情報を情報提供装置２へ送信し、入力情報応じた情報提供装置２からネットワーク６を介して提供されるコンテンツ（例えば、音楽、ニュース、交通情報、天候などの各種情報）を取得し、内蔵している音声出力器から取得したコンテンツを出力することができる。 The smart speaker 1 is a speaker that can use the AI (Artificial Intelligence) assistant function corresponding to the interactive voice operation, and the user U acquires various information by interacting with the smart speaker 1. Can be done. For example, the smart speaker 1 transmits input information indicating an instruction from the user U to the information providing device 2, and the content (for example, music, news, etc.) provided from the information providing device 2 according to the input information via the network 6. It is possible to acquire various information such as traffic information and weather, and output the content acquired from the built-in audio output device.

また、スマートスピーカ１は、ユーザＵからの指示に従って機器４を制御することができる。例えば、機器４が照明機器である場合、スマートスピーカ１は、ユーザＵからの指示に従って照明機器である機器４のオンとオフを制御することができる。 Further, the smart speaker 1 can control the device 4 according to an instruction from the user U. For example, when the device 4 is a lighting device, the smart speaker 1 can control the on / off of the device 4 which is a lighting device according to an instruction from the user U.

情報提供装置２は、スマートスピーカ１から出力される情報に基づいて、ユーザＵからの指示に応じたコンテンツをスマートスピーカ１へネットワーク６を介して提供することができる。例えば、情報提供装置２は、スマートスピーカ１から送信されるユーザＵの発話情報に基づき、ユーザＵの指示を判定し、判定した指示に応じたコンテンツをスマートスピーカ１へ提供することができる。 The information providing device 2 can provide the content according to the instruction from the user U to the smart speaker 1 via the network 6 based on the information output from the smart speaker 1. For example, the information providing device 2 can determine the instruction of the user U based on the utterance information of the user U transmitted from the smart speaker 1, and provide the content according to the determined instruction to the smart speaker 1.

端末装置３は、例えば、スマートフォン、デスクトップ型ＰＣ（Personal Computer）、ノート型ＰＣ、タブレット型端末、携帯電話機、ＰＤＡ（Personal Digital Assistant）等により実現される。かかる端末装置３は、例えば、情報通知アプリケーションを含む複数のアプリケーションを有しており、情報提供装置２から通知されるコンテンツを取得して表示したりする。端末装置３は、端末装置３の動作状態を示す情報やユーザＵによる端末装置３の操作履歴を示す情報をスマートスピーカ１や情報提供装置２へ通知することができる。 The terminal device 3 is realized by, for example, a smartphone, a desktop PC (Personal Computer), a notebook PC, a tablet terminal, a mobile phone, a PDA (Personal Digital Assistant), or the like. Such a terminal device 3 has, for example, a plurality of applications including an information notification application, and acquires and displays the content notified from the information providing device 2. The terminal device 3 can notify the smart speaker 1 and the information providing device 2 of information indicating the operating state of the terminal device 3 and information indicating the operation history of the terminal device 3 by the user U.

機器４は、ユーザＵの周囲に存在する機器である。例えば、機器４には、冷蔵庫、照明機器、洗濯機、エアーコンディショナー、テレビジョン受像機、食器洗浄機、食器乾燥機、電磁調理器、電子レンジといった種々の機器が含まれる。機器４は、機器４の動作状態を示す情報やユーザＵによる機器４の操作履歴を示す情報をスマートスピーカ１や情報提供装置２へ通知することができる。 The device 4 is a device existing around the user U. For example, the device 4 includes various devices such as a refrigerator, a lighting device, a washing machine, an air conditioner, a television receiver, a dishwasher, a dish dryer, an electromagnetic cooker, and a microwave oven. The device 4 can notify the smart speaker 1 and the information providing device 2 of information indicating the operating state of the device 4 and information indicating the operation history of the device 4 by the user U.

センサ装置５は、ユーザＵの周囲に存在するセンサである。センサ装置５は、例えば、温度センサ、湿度センサ、照度センサ、気圧センサ、ドアの開閉を検出する開閉センサといったセンサを含む。また、センサ装置５は、ユーザＵを撮像する撮像部を含む。センサ装置５は、温度の計測値、湿度の計測値、照度の計測値、気圧の計測値、ドアの開閉情報、撮像画像の情報といったセンサ情報をスマートスピーカ１や情報提供装置２へ通知することができる。なお、センサ装置５は、スマートスピーカ１や機器４に内蔵されていてもよい。 The sensor device 5 is a sensor existing around the user U. The sensor device 5 includes, for example, sensors such as a temperature sensor, a humidity sensor, an illuminance sensor, a pressure sensor, and an open / close sensor that detects opening / closing of a door. Further, the sensor device 5 includes an imaging unit that images the user U. The sensor device 5 notifies the smart speaker 1 and the information providing device 2 of sensor information such as temperature measurement value, humidity measurement value, illuminance measurement value, atmospheric pressure measurement value, door opening / closing information, and captured image information. Can be done. The sensor device 5 may be built in the smart speaker 1 or the device 4.

実施形態に係る情報処理システム１００は、スマートスピーカ１の音声出力器からの自発的な発話を制御する発話制御処理を実行することができる。以下、発話制御処理について説明する。図２は、実施形態に係る発話制御処理の説明図であり、図２に示す例では、発話制御装置の一例であるスマートスピーカ１によって発話制御処理が実行される。 The information processing system 100 according to the embodiment can execute an utterance control process for controlling spontaneous utterances from the voice output device of the smart speaker 1. The utterance control process will be described below. FIG. 2 is an explanatory diagram of the utterance control process according to the embodiment. In the example shown in FIG. 2, the utterance control process is executed by the smart speaker 1 which is an example of the utterance control device.

スマートスピーカ１は、ユーザＵに関するコンテキスト情報を取得するコンテキスト取得処理を実行する（ステップＳ１）。ユーザＵに関するコンテキスト情報は、ユーザＵに関するコンテキストを示す情報である。例えば、ユーザＵに関するコンテキスト（以下、単にコンテキストと記載する場合がある）は、ユーザＵに関する状況である。例えば、コンテキストには、ユーザＵの状況、ユーザＵによる機器（例えば、端末装置３や機器４）の使用状況、およびユーザＵの周囲の状況といったユーザＵに関する状況が含まれる。 The smart speaker 1 executes a context acquisition process for acquiring context information regarding the user U (step S1). The context information regarding the user U is information indicating the context regarding the user U. For example, a context relating to user U (hereinafter, may be simply referred to as a context) is a situation relating to user U. For example, the context includes a situation relating to the user U, such as the situation of the user U, the usage of the device (eg, terminal device 3 or device 4) by the user U, and the situation around the user U.

ユーザＵの状況には、例えば、ユーザＵの会話状態、ユーザＵの発話内容、ユーザＵの運動状態、ユーザＵの現在位置、ユーザＵの属性状態、およびユーザＵの感情状態などが含まれる。ユーザＵによる機器の使用状況には、例えば、ユーザＵによる機器の操作履歴および使用時間などが含まれる。 The situation of the user U includes, for example, the conversation state of the user U, the utterance content of the user U, the motion state of the user U, the current position of the user U, the attribute state of the user U, the emotional state of the user U, and the like. The usage status of the device by the user U includes, for example, the operation history and usage time of the device by the user U.

ユーザＵの周囲の状況には、例えば、ユーザＵの周囲の他人の存在や他人の状態、ユーザＵが置かれた物理環境、およびユーザＵが置かれた社会環境などが含まれる。ユーザＵが置かれた物理環境には、ユーザＵの周囲の明るさ、気温、湿度、および天候などが含まれる。ユーザＵが置かれた社会環境には、ユーザＵの周囲の交通機関の運行状態、ユーザＵの周囲のイベントの開催状態、曜日（例えば、休日と平日の区別を含む）などが含まれる。 The situation around the user U includes, for example, the existence and state of another person around the user U, the physical environment in which the user U is placed, the social environment in which the user U is placed, and the like. The physical environment in which the user U is placed includes the brightness, temperature, humidity, weather, and the like around the user U. The social environment in which the user U is placed includes the operating state of transportation around the user U, the holding state of events around the user U, the day of the week (including the distinction between holidays and weekdays, for example), and the like.

スマートスピーカ１は、例えば、端末装置３や機器４から、ユーザＵによる端末装置３や機器４の使用状況を示す情報をコンテキスト情報として取得することができる。また、スマートスピーカ１は、ユーザＵの周囲に配置されたセンサ装置５からユーザＵが置かれた物理環境を示す情報をコンテキスト情報として取得することができる。また、スマートスピーカ１は、ユーザＵの発話を不図示の音声入力器から取得し、取得したユーザＵの発話からユーザＵの会話状態を示す情報を得ることができる。また、スマートスピーカ１は、ユーザＵの発話以外の音を不図示の音声入力器から取得し、取得したユーザＵの発話以外の音から周囲の音の状態を示す情報を得ることができる。 For example, the smart speaker 1 can acquire information indicating the usage status of the terminal device 3 and the device 4 by the user U as context information from the terminal device 3 and the device 4. Further, the smart speaker 1 can acquire information indicating the physical environment in which the user U is placed as context information from the sensor device 5 arranged around the user U. Further, the smart speaker 1 can acquire the utterance of the user U from a voice input device (not shown), and can obtain information indicating the conversation state of the user U from the acquired utterance of the user U. Further, the smart speaker 1 can acquire sounds other than the user U's utterance from a voice input device (not shown), and can obtain information indicating the state of the surrounding sound from the acquired sounds other than the user U's utterance.

また、スマートスピーカ１は、情報提供装置２からネットワーク６を介してユーザＵが置かれた社会環境を示す情報（例えば、交通機関の運行状態）やユーザＵが置かれた物理環境（例えば、天候）を示す情報などを取得することができる。また、端末装置３や機器４が情報提供装置２へユーザＵによる端末装置３や機器４の使用状況を示す情報を送信する場合、スマートスピーカ１は、情報提供装置２からネットワーク６を介してユーザＵによる端末装置３や機器４の使用状況を示す情報を取得することもできる。 Further, the smart speaker 1 is provided with information indicating the social environment in which the user U is placed from the information providing device 2 via the network 6 (for example, the operating state of transportation) and the physical environment in which the user U is placed (for example, the weather). ) Can be obtained. Further, when the terminal device 3 or the device 4 transmits information indicating the usage status of the terminal device 3 or the device 4 by the user U to the information providing device 2, the smart speaker 1 is the user from the information providing device 2 via the network 6. It is also possible to acquire information indicating the usage status of the terminal device 3 and the device 4 by U.

スマートスピーカ１は、上述したコンテキスト取得処理を繰り返し実行する。スマートスピーカ１は、繰り返し実行されるコンテキスト取得処理によって取得されたコンテキスト情報に基づいて、音声出力器１１からの発話に対する過去のユーザＵの反応を考慮した発話タイミングを決定するタイミング決定処理を実行する（ステップＳ２）。発話タイミングは、ユーザＵの指示によることなくスマートスピーカ１から自発的に発話するタイミングである。 The smart speaker 1 repeatedly executes the above-mentioned context acquisition process. The smart speaker 1 executes a timing determination process for determining the utterance timing in consideration of the past reaction of the user U to the utterance from the voice output device 11 based on the context information acquired by the context acquisition process that is repeatedly executed. (Step S2). The utterance timing is the timing at which the smart speaker 1 spontaneously utters without being instructed by the user U.

スマートスピーカ１は、音声出力器１１からの過去の発話に対するユーザＵの反応を示す反応情報とコンテキスト情報に基づいて、現在のコンテキストが発話に適しているコンテキストであるか否かを判定することができる。 The smart speaker 1 can determine whether or not the current context is a context suitable for the utterance based on the reaction information and the context information indicating the reaction of the user U to the past utterance from the voice output device 11. it can.

例えば、スマートスピーカ１は、反応情報を教師データとしコンテキスト情報を特徴量とした機械学習によって生成されたモデル（以下、タイミング判定モデルと記載する場合がある）を有している。スマートスピーカ１は、コンテキスト取得処理で取得されたコンテキスト情報をタイミング判定モデルに入力することで、現在のコンテキストが発話に適しているコンテキストであるか否かを判定することができる。 For example, the smart speaker 1 has a model (hereinafter, may be referred to as a timing determination model) generated by machine learning using reaction information as teacher data and context information as a feature amount. The smart speaker 1 can determine whether or not the current context is a context suitable for utterance by inputting the context information acquired in the context acquisition process into the timing determination model.

スマートスピーカ１は、ユーザＵに関する現在のコンテキストが発話に適しているコンテキストである場合、発話のタイミングになったと判定し、音声出力器１１から自発的な発話を出力する出力制御処理を行う（ステップＳ３）。これにより、音声出力器１１から自発的な発話が出力される（ステップＳ４）。 When the current context regarding the user U is a context suitable for utterance, the smart speaker 1 determines that it is time to utter, and performs an output control process for outputting a spontaneous utterance from the voice output device 11 (step). S3). As a result, the voice output device 11 outputs a spontaneous utterance (step S4).

このように、スマートスピーカ１は、音声出力器１１からの発話に対する過去のユーザＵの反応を考慮してスマートスピーカ１から自発的に発話するタイミングを決定することから、予め設定された条件で発話タイミングを制御する場合に比べ、より適切なタイミングでスマートスピーカ１からの発話を行うことができる。 In this way, since the smart speaker 1 determines the timing of spontaneously speaking from the smart speaker 1 in consideration of the past reaction of the user U to the speech from the voice output device 11, the smart speaker 1 speaks under preset conditions. The utterance from the smart speaker 1 can be performed at a more appropriate timing as compared with the case of controlling the timing.

なお、スマートスピーカ１は、タイミング決定処理において、音声出力器１１から出力することができるコンテンツ毎に発話のタイミングを決定することができる。これにより、スマートスピーカ１は、コンテンツ毎に適したタイミングで発話を行うことができる。 In the timing determination process, the smart speaker 1 can determine the timing of utterance for each content that can be output from the voice output device 11. As a result, the smart speaker 1 can speak at a timing suitable for each content.

〔２．スマートスピーカ１の構成〕
次に、実施形態に係るスマートスピーカ１の構成について具体的に説明する。図３は、実施形態に係るスマートスピーカ１の構成例を示す図である。図３に示すように、スマートスピーカ１は、通信部１０と、音声出力器１１と、音声入力器１２と、撮像部１３と、記憶部１４と、制御部１５とを備える。 [2. Configuration of smart speaker 1]
Next, the configuration of the smart speaker 1 according to the embodiment will be specifically described. FIG. 3 is a diagram showing a configuration example of the smart speaker 1 according to the embodiment. As shown in FIG. 3, the smart speaker 1 includes a communication unit 10, a voice output device 11, a voice input device 12, an imaging unit 13, a storage unit 14, and a control unit 15.

通信部１０は、ネットワーク６を介して情報提供装置２、端末装置３、機器４、およびセンサ装置５などの装置と通信可能な通信インターフェイスである。制御部１５は通信部１０を介して情報提供装置２、端末装置３、機器４、およびセンサ装置５と情報の送受信を行うことができる。なお、スマートスピーカ１は、通信部１０以外の通信部によって端末装置３、機器４、およびセンサ装置５と通信する構成であってもよい。 The communication unit 10 is a communication interface capable of communicating with devices such as an information providing device 2, a terminal device 3, a device 4, and a sensor device 5 via a network 6. The control unit 15 can send and receive information to and from the information providing device 2, the terminal device 3, the device 4, and the sensor device 5 via the communication unit 10. The smart speaker 1 may be configured to communicate with the terminal device 3, the device 4, and the sensor device 5 by a communication unit other than the communication unit 10.

例えば、ネットワーク６がＬＡＮとＷＡＮで構成され、且つ通信部１０がＬＡＮに接続される場合、制御部１５は、ＬＡＮを介して端末装置３、機器４、およびセンサ装置５と情報の送受信を行い、ＬＡＮおよびＷＡＮを介して情報提供装置２と情報の送受信を行う。 For example, when the network 6 is composed of a LAN and a WAN and the communication unit 10 is connected to the LAN, the control unit 15 transmits and receives information to and from the terminal device 3, the device 4, and the sensor device 5 via the LAN. , Sends and receives information to and from the information providing device 2 via LAN and WAN.

音声出力器１１は、制御部１５から出力される電気信号に応じた振動を行うことで、電気信号に応じた音波をスマートスピーカ１の外部へ出力する。音声出力器１１は、例えば、振動板と、電気信号に応じて振動板を振動させる駆動機構とを備える。なお、図３に示す例では、一つの音声出力器１１のみ図示しているが、スマートスピーカ１には、音声出力器１１が複数設けられてもよい。 The voice output device 11 vibrates according to the electric signal output from the control unit 15 to output a sound wave corresponding to the electric signal to the outside of the smart speaker 1. The audio output device 11 includes, for example, a diaphragm and a drive mechanism that vibrates the diaphragm in response to an electric signal. In the example shown in FIG. 3, only one audio output device 11 is shown, but the smart speaker 1 may be provided with a plurality of audio output devices 11.

音声入力器１２は、マイクロフォンであり、外部から入力される音波を電気信号に変換し、変換した電気信号を制御部１５へ出力する。なお、図３に示す例では、一つの音声入力器１２のみ図示しているが、スマートスピーカ１には、音声入力器１２が複数設けられてもよい。 The voice input device 12 is a microphone, converts sound waves input from the outside into an electric signal, and outputs the converted electric signal to the control unit 15. In the example shown in FIG. 3, only one voice input device 12 is shown, but the smart speaker 1 may be provided with a plurality of voice input devices 12.

撮像部１３は、例えば、ＣＯＭＳ（Complementary Metal Oxide Semiconductor）イメージセンサを有しており、スマートスピーカ１の周囲を撮像する。撮像部１３は、撮像結果である撮像情報を制御部１５へ出力する。撮像情報には、スマートスピーカ１の周囲の撮像画像の情報が含まれる。 The imaging unit 13 has, for example, a COMS (Complementary Metal Oxide Semiconductor) image sensor, and images the surroundings of the smart speaker 1. The imaging unit 13 outputs the imaging information, which is the imaging result, to the control unit 15. The captured information includes information on a captured image around the smart speaker 1.

記憶部１４は、スマートスピーカ１へのユーザＵの操作履歴２０、および自発的な発話を行うための発話テーブル２１を記憶する。 The storage unit 14 stores the operation history 20 of the user U to the smart speaker 1 and the utterance table 21 for spontaneously speaking.

操作履歴２０は、スマートスピーカ１への操作内容と操作時刻とがユーザＵの操作毎に関連付けられた情報である。操作内容は、例えば、各種のコンテンツ（例えば、スケジュール、メール、ニュース、音楽、交通情報など）の音声出力器１１からの出力を開始する音声操作、各種のコンテンツの音声出力器１１からの出力を停止する音声操作などの種々の操作が含まれる。 The operation history 20 is information in which the operation content and the operation time for the smart speaker 1 are associated with each operation of the user U. The operation contents include, for example, a voice operation that starts output of various contents (for example, schedule, mail, news, music, traffic information, etc.) from the voice output device 11, and output of various contents from the voice output device 11. Various operations such as voice operation to stop are included.

図４は、実施形態に係る発話テーブル２１の一例を示す図である。図４に示す発話テーブル２１は、「コンテンツＩＤ」と、「発話内容」と、「モデル」とが互いに関連付けられた情報を含む。「コンテンツＩＤ」は、コンテンツに固有の識別情報である。 FIG. 4 is a diagram showing an example of the utterance table 21 according to the embodiment. The utterance table 21 shown in FIG. 4 includes information in which the “content ID”, the “utterance content”, and the “model” are associated with each other. The "content ID" is identification information unique to the content.

「発話内容」は、例えば、コンテンツの利用の可否の問い合わせを行うための発話情報である。例えば、「発話内容」は、コンテンツがニュースである場合、「今日のニュースはいかがですか？」、「ニュースが“Ｘ”件あります」などである。また、「発話内容」は、コンテンツがメールである場合、「メールが“Ｘ”通届いています」などであり、コンテンツがスケジュールである場合、「本日は、“Ｘ”件の予定があります」などである。なお、“Ｘ”の情報は、情報提供装置２から制御部１５が取得して発話内容に追加することができる。 The "utterance content" is, for example, utterance information for making an inquiry as to whether or not the content can be used. For example, when the content is news, the "utterance content" is "How about today's news?", "There are" X "news", and so on. In addition, the "utterance content" is, for example, "the mail has been sent by" X "" when the content is an email, and when the content is a schedule, "there are plans for" X "today." And so on. The information of "X" can be acquired by the control unit 15 from the information providing device 2 and added to the utterance content.

また、「発話内容」は、コンテンツそのものであってもよい。この場合、発話テーブル２１には発話内容は設定されず、制御部１５がコンテンツＩＤに基づいて情報提供装置２からコンテンツを取得する。 Further, the "utterance content" may be the content itself. In this case, the utterance content is not set in the utterance table 21, and the control unit 15 acquires the content from the information providing device 2 based on the content ID.

「モデル」は、発話に対する過去のユーザＵの反応と発話時の過去のユーザＵに関するコンテキスト情報とに基づいて生成されるタイミング判定モデルであり、コンテンツ毎に異なる。例えば、タイミング判定モデルは、発話に対する過去のユーザＵの反応を目的変数とし、ユーザＵに関するコンテキストを説明変数とする回帰モデルである。目的変数は、教師データとも呼ばれ、説明変数は、素性または特徴量とも呼ばれる。タイミング判定モデルにおいて説明変数の数は多いほど精度が良くなるが、説明変数の数は一つであってもよい。 The "model" is a timing determination model generated based on the reaction of the past user U to the utterance and the context information about the past user U at the time of the utterance, and is different for each content. For example, the timing determination model is a regression model in which the past reaction of the user U to the utterance is used as the objective variable and the context related to the user U is used as the explanatory variable. The objective variable is also called teacher data, and the explanatory variable is also called feature or feature quantity. In the timing determination model, the larger the number of explanatory variables, the better the accuracy, but the number of explanatory variables may be one.

図３に示す制御部１５は、入力処理部３１と、情報出力部３２と、情報取得部３３と、出力処理部３４と、コンテキスト取得部３５と、タイミング決定部３６とを備える。入力処理部３１は、音声入力器１２から出力される電気信号からユーザＵの音声を認識する。 The control unit 15 shown in FIG. 3 includes an input processing unit 31, an information output unit 32, an information acquisition unit 33, an output processing unit 34, a context acquisition unit 35, and a timing determination unit 36. The input processing unit 31 recognizes the voice of the user U from the electric signal output from the voice input device 12.

また、入力処理部３１は、撮像部１３から出力される撮像情報からユーザＵのジェスチャーを判定する。なお、入力処理部３１は、撮像部を含むセンサ装置５から撮像情報を取得し、取得した撮像情報に基づいて、ユーザＵのジェスチャーを判定することができる。 Further, the input processing unit 31 determines the gesture of the user U from the imaging information output from the imaging unit 13. The input processing unit 31 can acquire imaging information from the sensor device 5 including the imaging unit, and can determine the gesture of the user U based on the acquired imaging information.

また、入力処理部３１は、撮像部１３から出力される撮像情報からユーザＵの口の動きを検出し、検出した口の動きからユーザＵの口パク（無音発声）の内容を判定することができる。すなわち、入力処理部３１は、撮像部１３から出力される撮像情報から読唇を行うことができる。入力処理部３１は、読唇の結果をユーザＵの発話情報とする。なお、入力処理部３１は、ユーザＵの音声が識別できる場合、読唇を行わない。 Further, the input processing unit 31 can detect the movement of the mouth of the user U from the imaging information output from the imaging unit 13, and determine the content of the lip-synching (silent vocalization) of the user U from the detected movement of the mouth. it can. That is, the input processing unit 31 can read the lips from the image pickup information output from the image pickup unit 13. The input processing unit 31 uses the result of lip reading as the utterance information of the user U. If the voice of the user U can be identified, the input processing unit 31 does not read the lip.

情報出力部３２は、ユーザＵがウェークアップワードを発話したと入力処理部３１によって認識された場合、ウェークアップワードに続くユーザＵの発話情報を情報提供装置２へ出力する。発話情報は、ユーザＵの音声情報そのものであっても、テキスト情報であってもよい。また、情報出力部３２は、入力処理部３１によって判定されたジェスチャーの情報であるジェスチャー情報を情報提供装置２へ出力する。なお、ウェークアップワードは、スマートスピーカ１に予め設定されたワードであるが、情報出力部３２は、ウェークアップワードの発話にかかわらずユーザＵの発話情報およびジェスチャー情報を情報提供装置２へ出力することもできる。 When the input processing unit 31 recognizes that the user U has spoken the wake-up word, the information output unit 32 outputs the utterance information of the user U following the wake-up word to the information providing device 2. The utterance information may be the voice information of the user U itself or text information. Further, the information output unit 32 outputs the gesture information, which is the gesture information determined by the input processing unit 31, to the information providing device 2. The wake-up word is a word preset in the smart speaker 1, but the information output unit 32 may output the utterance information and the gesture information of the user U to the information providing device 2 regardless of the utterance of the wake-up word. it can.

また、情報出力部３２は、例えば、情報提供装置２からコンテキスト情報を取得しない場合などにおいて、コンテキスト取得部３５で取得したコンテキスト情報を継続して繰り返し情報提供装置２へ出力することもできる。 Further, the information output unit 32 can continuously and repeatedly output the context information acquired by the context acquisition unit 35 to the information providing device 2 when, for example, the context information is not acquired from the information providing device 2.

また、情報出力部３２は、音声出力器１１からの音声広告の出力が開始されたときから予め設定された期間（以下、広告出力期間と記載する場合がある）において、音声入力器１２へ入力される音を含む音情報および撮像部１３から出力される撮像情報を含む撮像情報を情報提供装置２へ出力する。なお、予め設定された期間は、例えば、音声出力器１１からの音声広告の出力が開始されたときから開始し、音声広告の出力が停止または終了されたときに終了する期間、または、音声出力器１１からの音声広告の出力が開始されたときから開始し、音声広告の出力が停止または終了されてから一定期間後に終了する期間である。なお、情報出力部３２は、広告出力期間において、端末装置３の音声入力器またはセンサ装置５へ入力される音を含む音情報を端末装置３またはセンサ装置５から取得して情報提供装置２へ出力することもできる。 Further, the information output unit 32 inputs to the voice input device 12 during a preset period (hereinafter, may be referred to as an advertisement output period) from the time when the sound advertisement output from the voice output device 11 is started. The sound information including the sound to be produced and the imaging information including the imaging information output from the imaging unit 13 are output to the information providing device 2. The preset period is, for example, a period that starts when the output of the voice advertisement from the voice output device 11 is started and ends when the output of the voice advertisement is stopped or ended, or a voice output. It is a period that starts when the output of the voice advertisement from the device 11 is started and ends after a certain period of time after the output of the voice advertisement is stopped or ended. The information output unit 32 acquires sound information including the sound input to the voice input device or the sensor device 5 of the terminal device 3 from the terminal device 3 or the sensor device 5 during the advertisement output period, and sends the information providing device 2 to the information providing device 2. It can also be output.

情報取得部３３は、情報提供装置２からコンテンツ毎のタイミング判定モデルを含むモデル情報を取得し、取得したタイミング判定モデル情報を発話テーブル２１に設定することができる。また、情報取得部３３は、情報提供装置２からコンテンツを取得する。出力処理部３４は、情報取得部３３によって取得されたコンテンツを電気信号へ変換して音声出力器１１へ出力する。これにより、スマートスピーカ１からコンテンツが音として出力される。 The information acquisition unit 33 can acquire model information including the timing determination model for each content from the information providing device 2, and set the acquired timing determination model information in the utterance table 21. In addition, the information acquisition unit 33 acquires the content from the information providing device 2. The output processing unit 34 converts the content acquired by the information acquisition unit 33 into an electric signal and outputs the content to the audio output device 11. As a result, the content is output as sound from the smart speaker 1.

なお、出力処理部３４は、コンテンツを情報提供装置２から文字情報として取得した場合、文字情報を音声合成処理によって音声信号（電気信号）へ変換して音声出力器１１へ出力する。また、出力処理部３４は、コンテンツを情報提供装置２から音声情報として取得した場合、音声情報をデジタルアナログ変換によって音声信号（電気信号）へ変換して音声出力器１１へ出力する。 When the content is acquired as character information from the information providing device 2, the output processing unit 34 converts the character information into a voice signal (electric signal) by voice synthesis processing and outputs the content to the voice output device 11. Further, when the content is acquired as audio information from the information providing device 2, the output processing unit 34 converts the audio information into an audio signal (electric signal) by digital-to-analog conversion and outputs the content to the audio output device 11.

コンテキスト取得部３５は、ユーザＵに関するコンテキスト情報を取得するコンテンツ取得処理を実行する。コンテキスト取得部３５は、端末装置３、機器４、およびセンサ装置５から直接または情報提供装置２を介してコンテキスト情報を取得することができる。 The context acquisition unit 35 executes a content acquisition process for acquiring context information regarding the user U. The context acquisition unit 35 can acquire context information directly from the terminal device 3, the device 4, and the sensor device 5 or via the information providing device 2.

具体的には、コンテキスト取得部３５は、ユーザＵの周囲の状況を示す周囲情報をコンテキスト情報の少なくとも一部として取得することができる。例えば、コンテキスト取得部３５は、ユーザＵの周囲に存在する１以上のセンサ装置５から出力されるセンサ情報から周囲情報を取得することができる。 Specifically, the context acquisition unit 35 can acquire surrounding information indicating the surrounding situation of the user U as at least a part of the context information. For example, the context acquisition unit 35 can acquire surrounding information from sensor information output from one or more sensor devices 5 existing around the user U.

センサ情報は、例えば、ユーザＵの周囲の明るさを示す照度情報、ユーザＵの周囲の気温を示す気温情報、およびユーザＵの周囲の湿度を示す湿度情報の少なくとも一つが含まれている。コンテキスト取得部３５は、照度情報、気温情報、および湿度情報を周囲情報として取得することができる。 The sensor information includes, for example, at least one of illuminance information indicating the brightness around the user U, temperature information indicating the air temperature around the user U, and humidity information indicating the humidity around the user U. The context acquisition unit 35 can acquire illuminance information, temperature information, and humidity information as ambient information.

また、コンテキスト取得部３５は、ユーザＵの周囲に存在する１以上の機器の状態を示す機器情報から周囲情報を取得することができる。ここで、１以上の機器とは、スマートスピーカ１、端末装置３、および機器４のうち１以上の機器である。機器情報は、例えば、機器のオン／オフといった機器の稼動状態を示す情報や、動作状態を示す情報である。 Further, the context acquisition unit 35 can acquire surrounding information from device information indicating the state of one or more devices existing around the user U. Here, the one or more devices are one or more devices among the smart speaker 1, the terminal device 3, and the device 4. The device information is, for example, information indicating the operating state of the device such as turning on / off the device, or information indicating the operating state.

例えば、端末装置３の場合、動作状態を示す情報には、端末装置３で表示中のアプリケーションの種別や表示中のコンテンツの内容などが含まれる。また、機器がエアコンである場合、動作状態を示す情報には、エアコンの設定風量や設定温度などの情報が含まれる。 For example, in the case of the terminal device 3, the information indicating the operating state includes the type of the application displayed on the terminal device 3 and the content of the displayed content. When the device is an air conditioner, the information indicating the operating state includes information such as the set air volume and the set temperature of the air conditioner.

また、コンテキスト取得部３５は、ユーザＵの周囲に存在する１以上の機器への操作履歴を示す操作履歴情報から周囲情報を取得することができる。１以上の機器とは、スマートスピーカ１、端末装置３、および機器４のうち１以上の機器である。操作履歴情報には、例えば、機器への操作内容と操作時刻とが関連付けられた情報がユーザＵの操作毎に含まれる。コンテキスト取得部３５は、スマートスピーカ１の操作履歴情報を記憶部１４から取得することができる。 Further, the context acquisition unit 35 can acquire surrounding information from the operation history information indicating the operation history of one or more devices existing around the user U. The one or more devices are one or more devices among the smart speaker 1, the terminal device 3, and the device 4. The operation history information includes, for example, information associated with the operation content of the device and the operation time for each operation of the user U. The context acquisition unit 35 can acquire the operation history information of the smart speaker 1 from the storage unit 14.

また、コンテキスト取得部３５は、例えば、ユーザＵの撮像情報を示す撮像情報を撮像部１３、端末装置３、機器４、またはセンサ装置５から取得することができる。コンテキスト取得部３５は、取得した撮像情報からユーザＵの状況を示す情報といったコンテキスト情報を取得することができる。 Further, the context acquisition unit 35 can acquire, for example, imaging information indicating the imaging information of the user U from the imaging unit 13, the terminal device 3, the device 4, or the sensor device 5. The context acquisition unit 35 can acquire context information such as information indicating the status of the user U from the acquired imaging information.

また、コンテキスト取得部３５は、例えば、音声入力器１２へ入力される音を含む音情報から、ユーザＵの会話の状態、ユーザＵの発話状態、ユーザＵの周囲の音（機器４の音を含む）などのコンテキスト情報を取得することができる。 Further, the context acquisition unit 35 obtains, for example, the conversation state of the user U, the utterance state of the user U, and the sound around the user U (the sound of the device 4) from the sound information including the sound input to the voice input device 12. Context information such as) can be acquired.

タイミング決定部３６は、コンテキスト取得部３５によって取得されたコンテキスト情報に基づいて、音声出力器１１からの発話に対する過去のユーザＵの反応を考慮した発話のタイミングである発話タイミングを決定する。 The timing determination unit 36 determines the utterance timing, which is the timing of the utterance in consideration of the past reaction of the user U to the utterance from the voice output device 11, based on the context information acquired by the context acquisition unit 35.

例えば、タイミング決定部３６は、記憶部１４に記憶された発話テーブルに含まれるコンテンツ毎のタイミング判定モデルにコンテキスト取得部３５で取得されたコンテキスト情報を入力情報として入力してモデルを用いた演算を行う。タイミング決定部３６は、タイミング判定モデルの演算結果であるスコアが予め設定された閾値であるか否かを判定する。 For example, the timing determination unit 36 inputs the context information acquired by the context acquisition unit 35 as input information into the timing determination model for each content included in the utterance table stored in the storage unit 14, and performs an operation using the model. Do. The timing determination unit 36 determines whether or not the score, which is the calculation result of the timing determination model, is a preset threshold value.

タイミング決定部３６は、タイミング判定モデルの演算結果であるスコアが閾値以上であると判定した場合、発話テーブル２１において、スコアが閾値以上であるタイミング判定モデルに関連付けられた発話内容を出力するタイミングになったと判定する。また、タイミング決定部３６は、スコアが閾値以上であるタイミング判定モデルが同時に２以上ある場合、スコアが閾値以上であるタイミング判定モデルのうち最も高いスコアのタイミング判定モデルに関連付けられた発話内容を出力するタイミングになったと判定する。 When the timing determination unit 36 determines that the score, which is the calculation result of the timing determination model, is equal to or greater than the threshold value, the timing determination unit 36 outputs the utterance content associated with the timing determination model whose score is equal to or greater than the threshold value in the utterance table 21. Judge that it has become. Further, when there are two or more timing determination models having a score equal to or higher than the threshold value, the timing determination unit 36 outputs the utterance content associated with the timing determination model having the highest score among the timing determination models having the score equal to or higher than the threshold value. Judge that it is time to do.

なお、タイミング決定部３６は、タイミング判定モデルの演算結果であるスコアが閾値以上であると判定した場合でも、発話内容を出力するタイミングになったとは判定しないことができる。例えば、タイミング決定部３６は、スコアが閾値以上であるタイミング判定モデルに関連付けられた発話内容を前回出力してから予め設定した期間（以下、出力禁止期間と記載する）を経過していない場合、発話タイミングになったとは判定しないことができる。 Even if the timing determination unit 36 determines that the score, which is the calculation result of the timing determination model, is equal to or greater than the threshold value, it cannot determine that it is time to output the utterance content. For example, when the timing determination unit 36 does not elapse a preset period (hereinafter referred to as an output prohibition period) since the last output of the utterance content associated with the timing determination model whose score is equal to or higher than the threshold value. It cannot be determined that the utterance timing has come.

また、タイミング決定部３６は、スコアが閾値以上であるタイミング判定モデルに関連付けられた発話内容を現時刻から予め設定された期間（以下、設定期間と記載する）前までの間に予め設定された回数（以下、出力上限回数と記載する）を超えた場合、発話タイミングになったとは判定しないことができる。 Further, the timing determination unit 36 presets the utterance content associated with the timing determination model whose score is equal to or higher than the threshold value from the current time to a period before a preset period (hereinafter, referred to as a set period). When the number of times (hereinafter referred to as the output upper limit number of times) is exceeded, it can not be determined that the utterance timing has come.

また、タイミング決定部３６は、タイミング判定モデルの演算結果であるスコアが閾値以上であると判定した場合でも、ユーザＵがスマートスピーカ１を操作中の場合や音声出力器１１からコンテンツや発話が出力中であれば、発話タイミングになったとは判定しない。この場合、タイミング決定部３６は、ユーザＵによるスマートスピーカ１の操作が終了した時点で、継続してスコアが閾値以上であるタイミング判定モデルがあれば、ユーザＵによるスマートスピーカ１の操作が終了してから一定期間後に、発話内容を出力するタイミングになったと判定することができる。 Further, even when the timing determination unit 36 determines that the score, which is the calculation result of the timing determination model, is equal to or higher than the threshold value, the content or utterance is output from the user U while the smart speaker 1 is being operated or from the voice output device 11. If it is medium, it is not determined that the speech timing has come. In this case, the timing determination unit 36 ends the operation of the smart speaker 1 by the user U if there is a timing determination model in which the score is continuously equal to or higher than the threshold value when the operation of the smart speaker 1 by the user U is completed. After a certain period of time, it can be determined that it is time to output the utterance content.

また、タイミング決定部３６は、ユーザＵによる音声操作に基づいて、上述した出力禁止期間、および出力上限回数をコンテンツ毎に発話テーブル２１に設定することができる。なお、コンテンツ毎に設定可能な情報は、出力禁止期間、および出力上限回数に限定されない。また、出力禁止期間、および出力上限回数といった情報はユーザＵの設定によらず予め発話テーブル２１に設定されていてもよい。 Further, the timing determination unit 36 can set the above-mentioned output prohibition period and output upper limit number of times in the utterance table 21 for each content based on the voice operation by the user U. The information that can be set for each content is not limited to the output prohibition period and the maximum number of output times. Further, information such as the output prohibition period and the maximum number of times of output may be set in advance in the utterance table 21 regardless of the setting of the user U.

このように、タイミング決定部３６は、コンテキスト情報に基づいて、発話タイミングと、かかる発話タイミングで出力すると判定した発話内容（以下、出力対象発話内容と記載する場合がある）とを決定することができる。タイミング決定部３６は、発話タイミングと出力対象発話内容とを音声出力器１１からの発話に対する過去のユーザＵの反応を考慮して、発話タイミングと出力対象発話内容とを決定することから、発話タイミングをより適切に決定することができる。 In this way, the timing determination unit 36 can determine the utterance timing and the utterance content determined to be output at the utterance timing (hereinafter, may be described as the output target utterance content) based on the context information. it can. The timing determination unit 36 determines the utterance timing and the output target utterance content in consideration of the past reaction of the user U to the utterance from the voice output device 11, and thus determines the utterance timing and the output target utterance content. Can be determined more appropriately.

例えば、ユーザＵが暗い場所に位置し、ユーザＵの周囲に収集車（例えば、ゴミ収集車）がいる状況で何度発話しても、発話に対するユーザＵの反応がないとする。この場合、タイミング判定モデルは、ユーザＵが暗い場所に位置し、かつ、ユーザＵの周囲に収集車がいることを示す場合に出力するスコアが閾値よりも小さくなるように生成される。そのため、タイミング決定部３６は、ユーザＵが暗い場所に位置し、かつ、ユーザＵの周囲に収集車がいることをコンテキスト情報が示す場合、発話タイミングでないと判定する。 For example, suppose that the user U is located in a dark place and there is a collection vehicle (for example, a garbage truck) around the user U, and no matter how many times the user U speaks, there is no reaction of the user U to the utterance. In this case, the timing determination model is generated so that the score output when the user U is located in a dark place and indicates that there is a collection vehicle around the user U is smaller than the threshold value. Therefore, when the context information indicates that the user U is located in a dark place and there is a collecting vehicle around the user U, the timing determination unit 36 determines that the utterance timing is not reached.

また、食器洗浄機と電子レンジとが共に使用されている状態では、発話に対するユーザＵの反応がないとする。この場合、タイミング判定モデルは、食器洗浄機と電子レンジとが共に使用されている状態である場合に出力するスコアが閾値よりも小さくなるように生成される。そのため、タイミング決定部３６は、食器洗浄機と電子レンジとが共に使用されていることをコンテキスト情報が示す場合、発話タイミングでないと判定する。 Further, it is assumed that the user U does not respond to the utterance when both the dishwasher and the microwave oven are used. In this case, the timing determination model is generated so that the score output when both the dishwasher and the microwave oven are used is smaller than the threshold value. Therefore, when the context information indicates that the dishwasher and the microwave oven are used together, the timing determination unit 36 determines that it is not the utterance timing.

また、ユーザＵが端末装置３を操作中（例えば、端末装置３でウェブページを閲覧中、または端末装置３で音楽を再生中）である場合に、発話に対するユーザＵの反応がないとする。この場合、タイミング判定モデルは、ユーザＵが端末装置３を操作中である場合に出力するスコアが閾値よりも小さくなるように生成される。そのため、タイミング決定部３６は、ユーザＵが端末装置３を操作中であることをコンテキスト情報が示す場合、発話タイミングでないと判定する。 Further, when the user U is operating the terminal device 3 (for example, the terminal device 3 is browsing a web page or the terminal device 3 is playing music), it is assumed that the user U does not respond to the utterance. In this case, the timing determination model is generated so that the score output when the user U is operating the terminal device 3 is smaller than the threshold value. Therefore, when the context information indicates that the user U is operating the terminal device 3, the timing determination unit 36 determines that it is not the utterance timing.

また、例えば、発話が開始された後において、ユーザＵの会話が続く場合やユーザＵが「やめて」と発話した場合を不正解データとして、且つ発話時のコンテキスト情報を特徴量としてタイミング判定モデルが生成される。この場合、タイミング決定部３６は、ユーザＵがユーザＵの会話を続けるようなコンテキストやユーザＵが「やめて」と発話するようなコンテキストでは、発話タイミングでないと判定することができる。 Further, for example, when the conversation of the user U continues after the utterance is started or when the user U utters "stop", the timing determination model uses the context information at the time of the utterance as the feature amount as incorrect answer data. Will be generated. In this case, the timing determination unit 36 can determine that it is not the utterance timing in the context in which the user U continues the conversation of the user U or in the context in which the user U speaks "stop".

このように、タイミング決定部３６は、現在のユーザＵに関するコンテキストが発話に適したコンテキストである場合に、発話タイミングであると決定することができる。また、タイミング判定モデルはコンテンツ毎に生成されているため、コンテンツ毎の適切な発話タイミングが決定される。例えば、朝の時間帯であれば、交通機関の運行状態に関するコンテンツやニュースのコンテンツにユーザＵが反応することが多い。そのため、交通機関の運行状態に関するコンテンツやニュースのコンテンツには、朝の時間帯が発話タイミングになりやすいタイミング判定モデルが生成される。 In this way, the timing determination unit 36 can determine the utterance timing when the current context regarding the user U is a context suitable for utterance. Further, since the timing determination model is generated for each content, an appropriate utterance timing for each content is determined. For example, in the morning hours, the user U often reacts to content related to the operating status of transportation or news content. Therefore, a timing determination model is generated in which the morning time zone is likely to be the utterance timing for the content related to the operation state of the transportation system and the content of the news.

出力処理部３４は、タイミング決定部３６によって決定された発話タイミングで、タイミング決定部３６によって決定された出力対象発話内容を音声出力器１１から出力する。例えば、発話テーブル２１が図４に示す状態で、出力対象発話内容が「発話内容ＸＡ」である場合、出力処理部３４は、発話内容ＸＡに基づく電信信号を音声出力器１１へ出力することで、発話内容ＸＡが音声出力器１１から音声で出力される。 The output processing unit 34 outputs the output target utterance content determined by the timing determination unit 36 from the voice output device 11 at the utterance timing determined by the timing determination unit 36. For example, when the utterance table 21 is shown in FIG. 4 and the utterance content to be output is "utterance content XA", the output processing unit 34 outputs a telegraph signal based on the utterance content XA to the voice output device 11. , The utterance content XA is output by voice from the voice output device 11.

〔３．情報提供装置２の構成〕
次に、実施形態に係る情報提供装置２の構成について具体的に説明する。図５は、実施形態に係る情報提供装置２の構成例を示す図である。図５に示すように、情報提供装置２は、通信部４１と、記憶部４２と、制御部４３とを備える。 [3. Configuration of information providing device 2]
Next, the configuration of the information providing device 2 according to the embodiment will be specifically described. FIG. 5 is a diagram showing a configuration example of the information providing device 2 according to the embodiment. As shown in FIG. 5, the information providing device 2 includes a communication unit 41, a storage unit 42, and a control unit 43.

通信部４１は、ネットワーク６を介してスマートスピーカ１、端末装置３、機器４、およびセンサ装置５などの装置と通信可能な通信インターフェイスである。制御部４３は通信部４１を介して情報提供装置２、スマートスピーカ１、端末装置３、機器４、およびセンサ装置５と情報の送受信を行うことができる。 The communication unit 41 is a communication interface capable of communicating with devices such as the smart speaker 1, the terminal device 3, the device 4, and the sensor device 5 via the network 6. The control unit 43 can send and receive information to and from the information providing device 2, the smart speaker 1, the terminal device 3, the device 4, and the sensor device 5 via the communication unit 41.

記憶部４２は、発話テーブル記憶部５１と、コンテンツ記憶部５２と、音声広告記憶部５３と、ユーザ情報記憶部５４と、コンテキスト記憶部５５と、出力態様判定情報記憶部５６とを有する。 The storage unit 42 includes an utterance table storage unit 51, a content storage unit 52, a voice advertisement storage unit 53, a user information storage unit 54, a context storage unit 55, and an output mode determination information storage unit 56.

発話テーブル記憶部５１は、スマートスピーカ１毎の発話テーブル２１の情報を記憶する。図６は、実施形態に係る発話テーブル記憶部５１に記憶される発話テーブルの一例を示す図である。図６に示す発話テーブル７１は、「コンテンツＩＤ」と、「発話内容」と、「モデル」とが互いに関連付けられた情報を「機器ＩＤ」毎に含む。 The utterance table storage unit 51 stores the information of the utterance table 21 for each smart speaker 1. FIG. 6 is a diagram showing an example of an utterance table stored in the utterance table storage unit 51 according to the embodiment. The utterance table 71 shown in FIG. 6 includes information in which the “content ID”, the “utterance content”, and the “model” are associated with each other for each “device ID”.

発話テーブル７１における「コンテンツＩＤ」、「発話内容」、および「モデル」は、発話テーブル２１における「コンテンツＩＤ」、「発話内容」、および「モデル」と同様の情報である。「機器ＩＤ」は、スマートスピーカ１毎に固有の識別情報である。 The "content ID", "utterance content", and "model" in the utterance table 71 are the same information as the "content ID", "utterance content", and "model" in the utterance table 21. The "device ID" is identification information unique to each smart speaker 1.

図５に示すコンテンツ記憶部５２は、スマートスピーカ１へ提供する各種のコンテンツを記憶する。図７は、実施形態に係るコンテンツ記憶部５２に記憶されるコンテンツテーブルの一例を示す図である。図７に示すコンテンツテーブル７２は、「コンテンツＩＤ」と、「コンテンツ」とが互いに関連付けられた情報である。 The content storage unit 52 shown in FIG. 5 stores various contents to be provided to the smart speaker 1. FIG. 7 is a diagram showing an example of a content table stored in the content storage unit 52 according to the embodiment. The content table 72 shown in FIG. 7 is information in which the “content ID” and the “content” are associated with each other.

「コンテンツ」には、聴覚的出力用コンテンツと、視覚的出力用コンテンツとが含まれる。聴覚的出力用コンテンツは、音声で出力されるコンテンツであり、視覚的出力用コンテンツは文字、画像などといった音声以外の態様で出力されるコンテンツである。 The "content" includes a content for auditory output and a content for visual output. The auditory output content is content that is output by voice, and the visual output content is content that is output in a mode other than voice, such as characters and images.

図５に示す音声広告記憶部５３は、音声広告の情報などを記憶する。図８は、実施形態に係る音声広告記憶部５３に記憶される音声広告テーブルの一例を示す図である。図８に示す音声広告テーブル７３は、「広告ＩＤ」と、「音声広告」と、「出力回数」と、「受容回数」と、「受容率」とが互いに関連付けられた情報である。「広告ＩＤ」は、音声広告毎に固有の識別情報である。 The voice advertisement storage unit 53 shown in FIG. 5 stores voice advertisement information and the like. FIG. 8 is a diagram showing an example of a voice advertisement table stored in the voice advertisement storage unit 53 according to the embodiment. The voice advertisement table 73 shown in FIG. 8 is information in which the “advertisement ID”, the “voice advertisement”, the “output count”, the “acceptance count”, and the “acceptance rate” are associated with each other. The "advertisement ID" is identification information unique to each voice advertisement.

「音声広告」は、音声広告のコンテンツであり、例えば、スマートスピーカ１の音声出力器１１または端末装置３の音声出力器から出力される。なお、音声広告テーブル７３の「音声広告」は、音声広告のコンテンツそのものであるが、音声広告のコンテンツの格納場所を示す情報であってもよい。 The “voice advertisement” is the content of the voice advertisement, and is output from, for example, the voice output device 11 of the smart speaker 1 or the voice output device of the terminal device 3. The "voice advertisement" in the voice advertisement table 73 is the content of the voice advertisement itself, but may be information indicating a storage location of the content of the voice advertisement.

「出力回数」は、音声広告がユーザＵに提供された回数を示す情報であり、例えば、音声広告がスマートスピーカ１や端末装置３へ出力される度に制御部４３によってインクリメントされる。「受容回数」は、音声広告がユーザＵに受容された回数であり、例えば、音声広告がユーザＵに受容される度に制御部４３によってインクリメントされる。「受容率」は、出力回数に対する受容回数の割合であり、例えば、制御部４３によって演算される。 The “number of outputs” is information indicating the number of times the voice advertisement is provided to the user U, and is incremented by the control unit 43 each time the voice advertisement is output to the smart speaker 1 or the terminal device 3, for example. The “number of acceptances” is the number of times the voice advertisement is received by the user U, and is incremented by the control unit 43 each time the voice advertisement is received by the user U, for example. The "acceptance rate" is the ratio of the number of acceptances to the number of outputs, and is calculated by, for example, the control unit 43.

例えば、図８に示す音声広告テーブル７３において、広告ＩＤ「Ａ１０１」の音声広告は、出力回数が２９８１７回で、受容回数が８２７回で、受容率が０．０２７８であることを示している。また、広告ＩＤ「Ａ１０２」の音声広告は、出力回数が８３７２回で、受容回数が３５２回で、受容率が０．０４２０であることを示している。 For example, in the voice advertisement table 73 shown in FIG. 8, the voice advertisement of the advertisement ID “A101” shows that the number of outputs is 29817, the number of acceptances is 827, and the acceptance rate is 0.0278. Further, the voice advertisement of the advertisement ID "A102" shows that the number of outputs is 8372, the number of acceptances is 352, and the acceptance rate is 0.0420.

図５に示すユーザ情報記憶部５４は、ユーザＵの情報を記憶する。図９は、実施形態に係るユーザ情報記憶部５４に記憶されるユーザ情報テーブルの一例を示す図である。図９に示すユーザ情報テーブル７４は、「ユーザＩＤ」と、「ユーザ属性」と、「機器ＩＤ」と、「機器アドレス」とが互いに関連付けられた情報である。 The user information storage unit 54 shown in FIG. 5 stores the information of the user U. FIG. 9 is a diagram showing an example of a user information table stored in the user information storage unit 54 according to the embodiment. The user information table 74 shown in FIG. 9 is information in which a "user ID", a "user attribute", a "device ID", and a "device address" are associated with each other.

「ユーザＩＤ」は、ユーザＵ毎に固有の識別情報である。「ユーザ属性」は、ユーザＵの属性を示す情報である。ユーザＵの属性は、例えば、性別、および年齢の他、住所、職業などのデモグラフィック属性であるが、ユーザＵの嗜好などを示すサイコグラフィック属性を含んでもよい。「機器ＩＤ」は、ユーザＵが所有するスマートスピーカ１に固有の識別情報、およびユーザＵが所有する端末装置３に固有の識別情報を含む。「機器アドレス」は、ユーザＵが所有するスマートスピーカ１または端末装置３のネットワーク６上のアドレスである。 The "user ID" is identification information unique to each user U. The "user attribute" is information indicating the attribute of the user U. The attributes of the user U are, for example, demographic attributes such as address and occupation in addition to gender and age, but may include psychographic attributes indicating the preferences of the user U and the like. The "device ID" includes identification information unique to the smart speaker 1 owned by the user U and identification information unique to the terminal device 3 owned by the user U. The “device address” is an address on the network 6 of the smart speaker 1 or the terminal device 3 owned by the user U.

出力態様判定情報記憶部５６は、スマートスピーカ１からコンテンツの出力要求があった場合に、出力要求の対象となるコンテンツの出力態様を決定するための出力態様判定情報を含む。出力態様には、上述したように、コンテンツの出力種別、およびコンテンツの出力先の少なくとも一つが含まれる。 The output mode determination information storage unit 56 includes output mode determination information for determining the output mode of the content subject to the output request when the smart speaker 1 requests the output of the content. As described above, the output mode includes at least one of the content output type and the content output destination.

出力態様判定情報は、例えば、ユーザＵに関するコンテキストと各出力態様との関係を規定する情報であり、モデルまたはテーブルを含む。出力態様判定情報に含まれるテーブルは、ユーザＵに関するコンテキストと各出力態様との関係を規定するテーブルである。また、出力態様判定情報に含まれるモデルは、学習部６４による学習によって生成されるモデルである。 The output mode determination information is, for example, information that defines the relationship between the context regarding the user U and each output mode, and includes a model or a table. The table included in the output mode determination information is a table that defines the relationship between the context regarding the user U and each output mode. Further, the model included in the output mode determination information is a model generated by learning by the learning unit 64.

図５に示す制御部４３は、情報取得部６１と、情報出力部６２と、コンテキスト取得部６３と、学習部６４と、出力態様決定部６５と、検出部６６と、判定部６７と、広告効果更新部６８とを備える。 The control unit 43 shown in FIG. 5 includes an information acquisition unit 61, an information output unit 62, a context acquisition unit 63, a learning unit 64, an output mode determination unit 65, a detection unit 66, a determination unit 67, and an advertisement. It is provided with an effect updating unit 68.

情報取得部６１は、スマートスピーカ１から送信される情報を取得する。例えば、情報取得部６１は、スマートスピーカ１からユーザＵの指示を示す入力情報（例えば、発話情報、ジェスチャーによる操作内容を示す情報）を取得することができる。また、情報取得部６１は、例えば、ユーザＵの撮像画像を示す撮像情報をスマートスピーカ１、端末装置３、機器４、またはセンサ装置５から取得することができる。 The information acquisition unit 61 acquires the information transmitted from the smart speaker 1. For example, the information acquisition unit 61 can acquire input information (for example, utterance information, information indicating the operation content by gesture) indicating the instruction of the user U from the smart speaker 1. Further, the information acquisition unit 61 can acquire, for example, imaging information indicating the captured image of the user U from the smart speaker 1, the terminal device 3, the device 4, or the sensor device 5.

また、情報取得部６１は、ユーザＵの周囲に存在する１以上の機器（例えば、スマートスピーカ１、端末装置３、機器４など）への操作履歴を示す操作履歴情報をスマートスピーカ１、端末装置３、または機器４から取得することができる。 Further, the information acquisition unit 61 provides operation history information indicating the operation history to one or more devices (for example, smart speaker 1, terminal device 3, device 4, etc.) existing around the user U to the smart speaker 1, terminal device. It can be obtained from 3 or the device 4.

情報出力部６２は、出力態様決定部６５によって決定される出力態様に基づいて、ユーザＵの入力情報に応じたコンテンツ（聴覚的出力用コンテンツ）をコンテンツ記憶部５２から取得する。例えば、情報出力部６２は、出力態様決定部６５によって決定される出力種別が聴覚的出力である場合、ユーザＵの入力情報に応じたコンテンツであって音声のコンテンツをコンテンツ記憶部５２から取得する。 The information output unit 62 acquires the content (auditory output content) corresponding to the input information of the user U from the content storage unit 52 based on the output mode determined by the output mode determination unit 65. For example, when the output type determined by the output mode determination unit 65 is auditory output, the information output unit 62 acquires audio content from the content storage unit 52, which is content corresponding to the input information of the user U. ..

また、情報出力部６２は、出力態様決定部６５によって決定される出力種別が視覚的出力である場合、ユーザＵの入力情報に応じたコンテンツであって文字または画像のコンテンツ（視覚的出力用コンテンツ）をコンテンツ記憶部５２のコンテンツテーブル７２から取得する。また、情報出力部６２は、出力態様決定部６５によって決定される出力種別が聴覚的出力および視覚的出力である場合、ユーザＵの入力情報に応じたコンテンツであって音声および文字（または画像）を含むコンテンツをコンテンツ記憶部５２のコンテンツテーブル７２から取得する。 Further, when the output type determined by the output mode determination unit 65 is visual output, the information output unit 62 is content corresponding to the input information of the user U and is character or image content (visual output content). ) Is acquired from the content table 72 of the content storage unit 52. Further, when the output types determined by the output mode determination unit 65 are the auditory output and the visual output, the information output unit 62 is the content corresponding to the input information of the user U, and is voice and characters (or images). The content including the above is acquired from the content table 72 of the content storage unit 52.

情報出力部６２は、出力態様決定部６５によって決定される出力態様に基づいて、記憶部４２から取得したコンテンツをスマートスピーカ１および端末装置３の少なくとも一つに出力する。情報出力部６２は、出力態様決定部６５によって決定される出力先の機器アドレスを記憶部４２のユーザ情報テーブル７４から取得し、取得した機器アドレス宛にコンテンツを送信する。 The information output unit 62 outputs the content acquired from the storage unit 42 to at least one of the smart speaker 1 and the terminal device 3 based on the output mode determined by the output mode determination unit 65. The information output unit 62 acquires the device address of the output destination determined by the output mode determination unit 65 from the user information table 74 of the storage unit 42, and transmits the content to the acquired device address.

例えば、情報出力部６２は、出力態様決定部６５によって決定される出力先がスマートスピーカ１である場合、記憶部４２から取得したコンテンツをスマートスピーカ１の機器アドレス宛へ送信する。また、情報出力部６２は、出力態様決定部６５によって決定される出力先が端末装置３である場合、記憶部４２から取得したコンテンツを端末装置３の機器アドレス宛へ送信する。 For example, when the output destination determined by the output mode determination unit 65 is the smart speaker 1, the information output unit 62 transmits the content acquired from the storage unit 42 to the device address of the smart speaker 1. Further, when the output destination determined by the output mode determination unit 65 is the terminal device 3, the information output unit 62 transmits the content acquired from the storage unit 42 to the device address of the terminal device 3.

コンテキスト取得部６３は、ユーザＵに関するコンテキスト情報を取得するコンテンツ取得処理を実行する。コンテキスト取得部６３は、情報取得部６１で取得した情報からコンテキスト情報を取得することができる。コンテキスト取得部６３によって取得されるコンテキスト情報は、コンテキスト取得部３５によって取得されるコンテキスト情報と同じであるが、コンテキスト取得部３５によって取得されるコンテキスト情報と一部または全部が異なる情報であってもよい。 The context acquisition unit 63 executes a content acquisition process for acquiring context information regarding the user U. The context acquisition unit 63 can acquire context information from the information acquired by the information acquisition unit 61. The context information acquired by the context acquisition unit 63 is the same as the context information acquired by the context acquisition unit 35, but even if the information is partially or completely different from the context information acquired by the context acquisition unit 35. Good.

なお、コンテキスト取得部６３が取得するコンテキスト情報には、ユーザＵの指示の種別を示す入力種別情報が含まれる。入力種別情報は、例えば、ユーザＵの指示が音声、ジェスチャー、および口の動きのいずれであるかを示す情報である。なお、コンテキスト取得部６３は、スマートスピーカ１、端末装置３、機器４、またはセンサ装置５からユーザＵの撮像画像が情報提供装置２へ送信される場合、ユーザＵの撮像画像からユーザＵのジェスチャーや口の動きを判定することで、入力種別情報を取得することもできる。 The context information acquired by the context acquisition unit 63 includes input type information indicating the type of instruction of the user U. The input type information is, for example, information indicating whether the instruction of the user U is a voice, a gesture, or a movement of the mouth. When the captured image of the user U is transmitted from the smart speaker 1, the terminal device 3, the device 4, or the sensor device 5 to the information providing device 2, the context acquisition unit 63 makes a gesture of the user U from the captured image of the user U. It is also possible to acquire input type information by determining the movement of the or mouth.

学習部６４は、コンテキスト取得部６３によって取得されるユーザＵ毎のコンテキスト情報に基づいて、発話テーブル７１のタイミング判定モデルをユーザＵ毎且つコンテンツ毎に生成することができる。かかるタイミング判定モデルは、上述したように、発話に対する過去のユーザＵの反応と発話時の過去のユーザＵに関するコンテキスト情報とに基づいて生成されるモデルである。 The learning unit 64 can generate a timing determination model for the utterance table 71 for each user U and each content based on the context information for each user U acquired by the context acquisition unit 63. As described above, such a timing determination model is a model generated based on the reaction of the past user U to the utterance and the context information regarding the past user U at the time of the utterance.

学習部６４は、スマートスピーカ１からの自発的な発話を開始してから予め設定された期間においてコンテキスト情報に含まれるユーザＵの動作またはユーザＵの発話がスマートスピーカ１からの自発的な発話に対して肯定的な反応であるか否かを教師データとする。 In the learning unit 64, the operation of the user U or the utterance of the user U included in the context information becomes a spontaneous utterance from the smart speaker 1 within a preset period after the spontaneous utterance from the smart speaker 1 is started. Whether or not the reaction is positive is used as the teacher data.

例えば、学習部６４は、スマートスピーカ１の自発的発話に対するユーザＵの発話が肯定的である場合、スマートスピーカ１の自発的発話に対して肯定的な反応であると判定する。例えば、学習部６４は、「今日のニュースはいかがですか？」などの自発的発話に対して、ユーザＵの発話が例えば「よろしく」、「うん」などである場合、肯定的な反応であると判定することができる。 For example, when the user U's utterance is positive with respect to the spontaneous utterance of the smart speaker 1, the learning unit 64 determines that the reaction is positive with respect to the spontaneous utterance of the smart speaker 1. For example, the learning unit 64 responds positively to a spontaneous utterance such as "How about today's news?" When the utterance of the user U is, for example, "Thank you" or "Yeah". Can be determined.

また、学習部６４は、スマートスピーカ１の自発的発話に対するユーザＵの発話がない場合、またはスマートスピーカ１の自発的発話に対するユーザＵの発話が否定的である場合、スマートスピーカ１の自発的発話に対して肯定的な反応ではないと判定する。例えば、学習部６４は、「今日のニュースはいかがですか？」などの自発的発話に対して、ユーザＵの発話がない場合、またはユーザＵの発話が例えば「いらない」、「やめて」などである場合、肯定的な反応ではないと判定することができる。 Further, when the user U does not speak to the spontaneous utterance of the smart speaker 1 or the user U's utterance to the spontaneous utterance of the smart speaker 1 is negative, the learning unit 64 causes the smart speaker 1 to speak spontaneously. It is judged that the reaction is not positive. For example, the learning unit 64 responds to a spontaneous utterance such as "How about today's news?" When there is no utterance of the user U, or when the utterance of the user U is, for example, "don't need" or "stop". In some cases, it can be determined that the reaction is not positive.

なお、自発的発話に対するユーザＵの反応が肯定的であるか否かは、上述した例に限定されない。例えば、学習部６４は、ユーザＵが頷いた場合に、自発的発話に対して肯定的な反応であると判定することができる。また、学習部６４は、ユーザＵがスマートスピーカ１から遠ざかった場合に、自発的発話に対して肯定的な反応ではないと判定することができる。 Whether or not User U's reaction to spontaneous utterance is positive is not limited to the above-mentioned example. For example, the learning unit 64 can determine that the user U has a positive reaction to the spontaneous utterance when the user U nods. Further, the learning unit 64 can determine that the user U does not have a positive reaction to the spontaneous utterance when the user U moves away from the smart speaker 1.

学習部６４は、上述のように自発的発話に対して肯定的な反応であるか否かを教師データとし、自発的発話を開始してから予め設定された期間においてコンテキスト情報に含まれる１以上の情報を特徴量として機械学習を行ってタイミング判定モデルを生成および更新する。 As described above, the learning unit 64 uses the teacher data as to whether or not it is a positive reaction to the spontaneous utterance, and one or more included in the context information in a preset period after the start of the spontaneous utterance. Machine learning is performed using the information in the above as a feature quantity to generate and update a timing judgment model.

なお、タイミング判定モデルは、上述した例に限定されるものではなく、例えば、ＳＶＭ（Support Vector Machine）やその他の機械学習法を用いて生成されるモデルであってもよい。また、タイミング判定モデルの生成は、深層学習（ディープラーニング）の技術を用いて行われてもよい。例えば、タイミング判定モデルの生成は、ＤＮＮ（Deep Neural Network）やＲＮＮ（Recurrent Neural Network）やＣＮＮ（Convolutional Neural Network）等の種々のディープラーニングの技術を適宜用いて行われてもよい。 The timing determination model is not limited to the above-mentioned example, and may be, for example, a model generated by using SVM (Support Vector Machine) or another machine learning method. Further, the timing determination model may be generated by using a technique of deep learning. For example, the timing determination model may be generated by appropriately using various deep learning techniques such as DNN (Deep Neural Network), RNN (Recurrent Neural Network), and CNN (Convolutional Neural Network).

学習部６４は、生成したタイミング判定モデルを発話テーブル７１に設定する。また、学習部６４は、発話テーブル７１に設定されたタイミング判定モデルを、新たに取得される発話に対するユーザＵの反応とコンテキスト情報とに基づいてタイミング判定モデルを更新することができる。例えば、学習部６４は、情報提供装置２の処理負荷が少ない時間帯（例えば、深夜）などにタイミング判定モデルを更新することができる。 The learning unit 64 sets the generated timing determination model in the utterance table 71. Further, the learning unit 64 can update the timing determination model set in the utterance table 71 based on the reaction of the user U to the newly acquired utterance and the context information. For example, the learning unit 64 can update the timing determination model in a time zone (for example, midnight) when the processing load of the information providing device 2 is small.

また、学習部６４は、複数のユーザＵに共通のタイミング判定モデル（以下、共通判定モデルと記載する場合がある）をコンテンツ毎または特定のコンテンツについて生成することができる。この場合、学習部６４は、発話に対する過去の複数のユーザＵの反応と発話時の過去の複数のユーザＵに関するコンテキスト情報とに基づいて、共通判定モデルを生成することができる。 Further, the learning unit 64 can generate a timing determination model common to a plurality of users U (hereinafter, may be referred to as a common determination model) for each content or for a specific content. In this case, the learning unit 64 can generate a common determination model based on the reactions of the plurality of past users U to the utterance and the context information regarding the past plurality of users U at the time of the utterance.

また、学習部６４は、共通判定モデルをコンテンツ毎に生成した後、かかる共通判定モデルをベースにして新たに取得される発話に対する各ユーザＵの反応とコンテキスト情報とに基づいて、各ユーザＵに固有のタイミング判定モデルを生成することもできる。 Further, the learning unit 64 generates a common judgment model for each content, and then gives each user U a reaction and context information to a newly acquired utterance based on the common judgment model. It is also possible to generate a unique timing determination model.

また、学習部６４は、複数のコンテンツに共通かつ複数のユーザＵに共通のタイミング判定モデルを生成することもできる。この場合、学習部６４は、任意の発話に対する過去の複数のユーザＵの反応と任意の発話時の過去の複数のユーザＵに関するコンテキスト情報とに基づいて、複数のユーザＵに共通のタイミング判定モデルを生成することができる。 Further, the learning unit 64 can also generate a timing determination model that is common to a plurality of contents and common to a plurality of users U. In this case, the learning unit 64 has a timing determination model common to the plurality of users U based on the reactions of the past plurality of users U to an arbitrary utterance and the context information regarding the past plurality of users U at the time of an arbitrary utterance. Can be generated.

また、学習部６４は、コンテンツに対する過去のユーザＵの反応と過去のユーザＵに関するコンテキスト情報とに基づいて出力態様毎に出力態様判定モデルを生成することができる。例えば、ユーザＵの反応が否定的な反応であるか否かまたは肯定的な反応であるか否かを教師データとし、ユーザＵの反応時のコンテキスト情報を特徴量として機械学習を行うことができる。 Further, the learning unit 64 can generate an output mode determination model for each output mode based on the reaction of the past user U to the content and the context information regarding the past user U. For example, machine learning can be performed using the teacher data as to whether or not the reaction of the user U is a negative reaction or a positive reaction, and the context information at the time of the reaction of the user U as a feature amount. ..

否定的な反応は、例えば、スマートスピーカ１からコンテンツが音声として出力された場合におけるユーザＵの否定的な発話（例えば、「いらない」や「やめて」など）である。また、否定的な反応は、例えば、端末装置３からコンテンツが文字または画像として出力された場合におけるユーザＵの端末装置３に対する非操作である。 The negative reaction is, for example, a negative utterance of the user U (for example, "don't need" or "stop") when the content is output as voice from the smart speaker 1. Further, the negative reaction is, for example, a non-operation of the user U with respect to the terminal device 3 when the content is output as a character or an image from the terminal device 3.

また、肯定的な反応は、例えば、スマートスピーカ１からコンテンツが音声として出力された場合におけるユーザＵの否定的な発話がない状態である。肯定的な反応は、例えば、端末装置３からコンテンツが文字または画像として出力された場合における端末装置３に対する操作である。 Further, the positive reaction is, for example, a state in which there is no negative utterance of the user U when the content is output as voice from the smart speaker 1. The positive reaction is, for example, an operation on the terminal device 3 when the content is output as a character or an image from the terminal device 3.

出力態様決定部６５は、コンテンツの出力要求があった場合、コンテキスト取得部６３によって取得されたコンテキスト情報に基づいて、ユーザＵに提供されるコンテンツの出力態様を決定する。出力態様には出力種別および出力先が含まれるが、出力態様決定部６５は、出力種別および出力先の一方のみを決定することもできる。出力態様決定部６５による出力態様の決定は、出力態様判定情報記憶部５６に記憶された上述の出力態様判定情報を用いて行われる。 When there is a content output request, the output mode determination unit 65 determines the output mode of the content provided to the user U based on the context information acquired by the context acquisition unit 63. The output mode includes an output type and an output destination, but the output mode determining unit 65 can also determine only one of the output type and the output destination. The output mode determination unit 65 determines the output mode using the above-mentioned output mode determination information stored in the output mode determination information storage unit 56.

出力態様決定部６５は、コンテキスト情報に含まれるユーザＵの状況を示す状況情報に基づいて、出力態様を決定することができる。例えば、出力態様決定部６５は、ユーザＵが移動中である場合、出力種別を聴覚的出力とし且つ出力先をスマートスピーカ１とする態様を、コンテンツの出力態様として決定することができる。これにより、ユーザＵは移動しながら端末装置３の画面を見ることなくコンテンツを把握することができる。 The output mode determination unit 65 can determine the output mode based on the status information indicating the status of the user U included in the context information. For example, when the user U is moving, the output mode determination unit 65 can determine a mode in which the output type is auditory output and the output destination is the smart speaker 1 as the content output mode. As a result, the user U can grasp the content without looking at the screen of the terminal device 3 while moving.

また、出力態様決定部６５は、スマートスピーカ１および端末装置３のうちユーザＵの現在位置に近い機器を出力先とすることができ、これにより、ユーザＵによるコンテンツの把握を容易にすることができる。 Further, the output mode determination unit 65 can set the device of the smart speaker 1 and the terminal device 3 that is close to the current position of the user U as the output destination, which makes it easy for the user U to grasp the content. it can.

また、出力態様決定部６５は、ユーザＵが会話中である場合、出力種別を視覚的出力とし且つ出力先を端末装置３とする態様を、コンテンツの出力態様として決定することができる。これにより、ユーザＵが会話を中断することなくコンテンツを把握することができる。 Further, when the user U is in a conversation, the output mode determining unit 65 can determine a mode in which the output type is visual output and the output destination is the terminal device 3 as the content output mode. As a result, the user U can grasp the content without interrupting the conversation.

また、出力態様決定部６５は、コンテキスト情報に含まれるユーザＵの周囲の状況を示す周囲情報に基づいて、出力態様を決定することができる。例えば、出力態様決定部６５は、ユーザＵの周囲に他人が存在する場合に、出力種別を視覚的出力とし且つ出力先を端末装置３とする態様を、コンテンツの出力態様として決定することができる。これにより、例えば、コンテンツがユーザＵのスケジュールやユーザＵへのメールである場合に、スマートスピーカ１からコンテンツで音声出力されないため、ユーザＵのスケジュールやメールを他人に知られることを防止することができる。 Further, the output mode determining unit 65 can determine the output mode based on the surrounding information indicating the surrounding situation of the user U included in the context information. For example, the output mode determination unit 65 can determine a mode in which the output type is visual output and the output destination is the terminal device 3 as the content output mode when there is another person around the user U. .. As a result, for example, when the content is the schedule of the user U or the mail to the user U, the smart speaker 1 does not output the voice as the content, so that it is possible to prevent the schedule or mail of the user U from being known to others. it can.

出力態様決定部６５は、ユーザＵの周囲に存在する他人が寝ている場合に、出力種別を視覚的出力とし且つ出力先を端末装置３とする態様を、コンテンツの出力態様として決定することができる。これにより、例えば、ユーザＵの周囲で寝ている他人をコンテンツの出力によって起こしてしまうといった事態を回避することができる。 The output mode determining unit 65 may determine as a content output mode a mode in which the output type is visual output and the output destination is the terminal device 3 when another person existing around the user U is sleeping. it can. As a result, for example, it is possible to avoid a situation in which another person sleeping around the user U is awakened by the output of the content.

また、出力態様決定部６５は、ユーザＵの周囲の音が大きい場合に、出力種別を視覚的出力とし且つ出力先を端末装置３とする態様を、コンテンツの出力態様として決定することができる。これにより、ユーザＵの周囲の騒音や機器４の発する音でコンテンツが把握できなくなるといった事態を回避することができる。 Further, the output mode determination unit 65 can determine a mode in which the output type is visual output and the output destination is the terminal device 3 as the content output mode when the sound around the user U is loud. As a result, it is possible to avoid a situation in which the content cannot be grasped due to the noise around the user U or the sound emitted by the device 4.

また、出力態様決定部６５は、ユーザＵの周囲に他人が存在しない場合や、ユーザＵの周囲の音が小さい場合、出力種別を聴覚的出力とし且つ出力先をスマートスピーカ１とする態様を、コンテンツの出力態様として決定することができる。これにより、ユーザＵは端末装置３の画面を見ることなく、コンテンツを把握することができる。 Further, the output mode determining unit 65 sets the output type to auditory output and the output destination to the smart speaker 1 when there is no other person around the user U or when the sound around the user U is small. It can be determined as the output mode of the content. As a result, the user U can grasp the content without looking at the screen of the terminal device 3.

また、出力態様決定部６５は、スマートスピーカ１への入力種別がジェスチャーまたは口の動きである場合、出力種別を視覚的出力とし且つ出力先を端末装置３とする態様を、コンテンツの出力態様として決定することができる。また、出力態様決定部６５は、スマートスピーカ１への発話（有音発話）である場合、出力種別を聴覚的出力とし且つ出力先をスマートスピーカ１とする態様を、コンテンツの出力態様として決定することができる。 Further, when the input type to the smart speaker 1 is gesture or mouth movement, the output mode determining unit 65 sets the output type to visual output and the output destination to the terminal device 3, as the content output mode. Can be decided. Further, in the case of utterance to the smart speaker 1 (sound utterance), the output mode determining unit 65 determines as the content output mode the mode in which the output type is auditory output and the output destination is the smart speaker 1. be able to.

これにより、ユーザＵは、所望の出力態様に応じた入力種別でスマートスピーカ１へ入力することができ、ユーザＵは、スマートスピーカ１または端末装置３から所望の出力態様でコンテンツを確認することができる。出力態様決定部６５が出力態様判定モデルに基づいて入力種別に応じた出力態様を決定する場合、出力態様判定モデルは、例えば、入力種別を特徴量としての重みを大きくしたり、特徴量を入力種別のみとしたりすることで、出力態様決定部６５は、入力種別に応じた出力態様を決定することができる。なお、出力態様決定部６５は、入力種別と出力態様とが入力種別毎に対応付けられた出力態様判定テーブルに基づいて、入力種別に応じた出力態様を決定することもできる。 As a result, the user U can input to the smart speaker 1 with an input type according to the desired output mode, and the user U can confirm the content from the smart speaker 1 or the terminal device 3 in the desired output mode. it can. When the output mode determination unit 65 determines the output mode according to the input type based on the output mode determination model, the output mode determination model may, for example, increase the weight with the input type as the feature amount or input the feature amount. The output mode determination unit 65 can determine the output mode according to the input type by setting only the type. The output mode determination unit 65 can also determine the output mode according to the input type based on the output mode determination table in which the input type and the output mode are associated with each input type.

また、出力態様決定部６５は、出力態様判定情報として出力態様毎の出力態様判定モデルを含む場合、出力態様毎の出力態様判定モデルにコンテキスト情報を入力する。出力態様決定部６５は、出力態様毎の出力態様判定モデルの出力に基づいて、コンテンツの出力態様を決定する。 Further, when the output mode determination unit 65 includes the output mode determination model for each output mode as the output mode determination information, the output mode determination unit 65 inputs the context information to the output mode determination model for each output mode. The output mode determination unit 65 determines the output mode of the content based on the output of the output mode determination model for each output mode.

例えば、ユーザＵの反応が否定的な反応であるか否かを教師データとして出力態様判定モデルが生成される場合、出力態様決定部６５は、出力するスコアが最も低い出力態様判定モデルに対応する出力態様を、コンテンツの出力態様として決定することができる。また、ユーザＵの反応が肯定的な反応であるか否かを教師データとして出力態様判定モデルが生成される場合、出力態様決定部６５は、出力するスコアが最も高い出力態様判定モデルに対応する出力態様を、コンテンツの出力態様として決定することができる。 For example, when the output mode determination model is generated using whether or not the reaction of the user U is a negative reaction as teacher data, the output mode determination unit 65 corresponds to the output mode determination model having the lowest output score. The output mode can be determined as the output mode of the content. Further, when the output mode determination model is generated using whether or not the reaction of the user U is a positive reaction as teacher data, the output mode determination unit 65 corresponds to the output mode determination model having the highest output score. The output mode can be determined as the output mode of the content.

このように、出力態様決定部６５は、ユーザＵの状況やユーザＵの周囲の状況に応じてコンテンツの出力態様を決定することができるため、ユーザＵへのコンテンツの提供を適切に行うことができる。 In this way, the output mode determination unit 65 can determine the output mode of the content according to the situation of the user U and the situation around the user U, so that the content can be appropriately provided to the user U. it can.

検出部６６は、スマートスピーカ１の音声出力器１１または端末装置３の音声出力器（図示せず）から音声広告が出力された場合のユーザＵの振る舞いを検出する。検出部６６は、情報取得部６１によって取得される撮像情報を画像解析することで、音声広告が出力された場合のユーザＵの振る舞いを検出することができる。 The detection unit 66 detects the behavior of the user U when a voice advertisement is output from the voice output device 11 of the smart speaker 1 or the voice output device (not shown) of the terminal device 3. The detection unit 66 can detect the behavior of the user U when the voice advertisement is output by performing image analysis of the image pickup information acquired by the information acquisition unit 61.

例えば、検出部６６は、情報取得部６１によって取得される撮像情報に基づいて、ユーザＵの目線の動き、ユーザＵの頭部の動き、ユーザＵの口の動き、ユーザＵの手の動き、およびユーザＵの足の動きのうち少なくとも一つの身体的振る舞いをユーザＵの振る舞いとして検出することができる。 For example, the detection unit 66 moves the line of sight of the user U, the movement of the head of the user U, the movement of the mouth of the user U, the movement of the hand of the user U, based on the imaging information acquired by the information acquisition unit 61. And at least one physical behavior of the user U's foot movement can be detected as the user U's behavior.

また、検出部６６は、情報取得部６１によって取得される撮像情報に基づいて、ユーザＵが行っている作業の状態を検出することができる。例えば、検出部６６は、ユーザＵが食器洗い、ミシンでの縫製、および料理といった作業を中断したか否かを検出することができる。 Further, the detection unit 66 can detect the state of the work performed by the user U based on the imaging information acquired by the information acquisition unit 61. For example, the detection unit 66 can detect whether or not the user U has interrupted operations such as washing dishes, sewing with a sewing machine, and cooking.

また、検出部６６は、情報取得部６１によって取得される音情報を音響解析することで、音声広告が出力された場合のユーザＵの振る舞いを検出することができる。例えば、検出部６６は、音情報に基づいて、ユーザＵの会話における振る舞い、ユーザＵによるスマートスピーカ１への発話による問いかけ、ユーザＵが行っていた作業における振る舞い、およびユーザＵの機器４への振る舞いなどを検出することができる。 Further, the detection unit 66 can detect the behavior of the user U when the voice advertisement is output by acoustically analyzing the sound information acquired by the information acquisition unit 61. For example, the detection unit 66 asks the user U's behavior in conversation, the user U's utterance to the smart speaker 1, the behavior in the work performed by the user U, and the user U's device 4 based on the sound information. Behavior etc. can be detected.

具体的には、検出部６６は、音情報に基づいて、会話中のユーザＵが発話を止める、および会話中のユーザＵが発話の音量を下げるといった振る舞いを検出することができる。また、検出部６６は、ユーザＵが情報を検索するための発話、およびユーザＵが情報を確認するための発話といった振る舞いを検出することができる。 Specifically, the detection unit 66 can detect behaviors such as the user U in conversation stopping the utterance and the user U in the conversation lowering the volume of the utterance based on the sound information. In addition, the detection unit 66 can detect behaviors such as an utterance for the user U to search for information and an utterance for the user U to confirm the information.

また、検出部６６は、音情報に基づいて、食器洗いや料理といった作業をユーザＵが中断したか否かを検出することができる。例えば、検出部６６は、水道の蛇口から水が吐出する音が消えた場合や食器を洗う音が消えた場合、食器洗いを中断したと判定することができる。 Further, the detection unit 66 can detect whether or not the user U has interrupted the work such as washing dishes and cooking based on the sound information. For example, the detection unit 66 can determine that the dishwashing is interrupted when the sound of water discharged from the faucet disappears or the sound of washing the dishes disappears.

また、ユーザＵが機器４をオフすることで機器４から出力される音である機器音が停止するため、検出部６６は、音情報に基づいて、ユーザＵが機器４をオフする振る舞いを行ったことを検出することができる。検出部６６が音情報に基づいてオフを検出する機器４は、例えば、電子レンジ、洗濯機、食器洗浄機、ミシン、テレビジョン受像機、ラジオ受信器などの比較的大きな音を発する機器である。 Further, when the user U turns off the device 4, the device sound, which is the sound output from the device 4, is stopped. Therefore, the detection unit 66 behaves as the user U turns off the device 4 based on the sound information. It can be detected. The device 4 in which the detection unit 66 detects off based on sound information is, for example, a device that emits a relatively loud sound, such as a microwave oven, a washing machine, a dishwasher, a sewing machine, a television receiver, and a radio receiver. ..

また、検出部６６は、情報取得部６１によって取得された操作履歴情報に基づいて、スマートスピーカ１、端末装置３、または機器４へのユーザＵの振る舞いを検出することができる。例えば、検出部６６は、操作履歴情報に基づいて、ユーザＵが機器をオフしたりオンしたりする振る舞いおよびユーザＵが情報を検索する振る舞いなどを検出することができる。 Further, the detection unit 66 can detect the behavior of the user U to the smart speaker 1, the terminal device 3, or the device 4 based on the operation history information acquired by the information acquisition unit 61. For example, the detection unit 66 can detect the behavior of the user U turning the device off and on, the behavior of the user U searching for information, and the like based on the operation history information.

判定部６７は、検出部６６によって検出されたユーザＵの振る舞いに基づいて音声広告がユーザＵに受容されたか否かを判定する。例えば、判定部６７は、検出部６６によって検出されたユーザＵの身体的な振る舞いが特定の振る舞いである場合に、音声広告がユーザＵに受容されたと判定する。 The determination unit 67 determines whether or not the voice advertisement is accepted by the user U based on the behavior of the user U detected by the detection unit 66. For example, the determination unit 67 determines that the voice advertisement has been accepted by the user U when the physical behavior of the user U detected by the detection unit 66 is a specific behavior.

特定の振る舞いは、例えば、広告出力期間においてユーザＵが一定時間以上視線をスマートスピーカ１に向ける、音声広告の出力開始時に移動中のユーザＵが広告出力期間において一定時間以上移動を停止する、および広告出力期間においてユーザＵが頷くなどといった振る舞いである。また、特定の振る舞いは、音声広告の出力開始前に継続的に手が動いていたユーザＵが広告出力期間において一定時間以上手を止める、およびユーザＵが特定のジェスチャーをしたなどといった振る舞いである。 Specific behaviors include, for example, the user U directs the line of sight to the smart speaker 1 for a certain period of time or longer during the advertisement output period, the moving user U stops moving for a certain period of time or longer during the advertisement output period at the start of the output of the voice advertisement, and the like. This is a behavior such as the user U nodding during the advertisement output period. Further, the specific behavior is such that the user U who has been continuously moving before the start of the output of the voice advertisement stops his / her hand for a certain period of time or more during the advertisement output period, and the user U makes a specific gesture. ..

また、判定部６７は、検出部６６が音情報に基づいて検出したユーザＵの振る舞いが特定の振る舞いである場合に、音声広告がユーザＵに受容されたと判定する。特定の振る舞いは、例えば、会話中のユーザＵが発話を止める、会話中のユーザＵが発話の音量を下げる、ユーザＵが音声広告に関する発話をする、ユーザＵが作業を中断する、およびユーザＵが機器をオフするなどといった振る舞いである。なお、ユーザＵによる音声広告に関する発話は、例えば、「それで？」、「続きは？」などである。 Further, the determination unit 67 determines that the voice advertisement has been accepted by the user U when the behavior of the user U detected by the detection unit 66 based on the sound information is a specific behavior. Specific behaviors include, for example, user U in conversation stops speaking, user U in conversation reduces the volume of utterance, user U speaks about a voice advertisement, user U interrupts work, and user U. Is the behavior such as turning off the device. The utterances related to the voice advertisement by the user U are, for example, "So?" And "Continued?".

また、判定部６７は、検出部６６が操作履歴情報に基づいて検出したユーザＵの振る舞いが特定の振る舞いである場合に、音声広告がユーザＵに受容されたと判定する。特定の振る舞いは、例えば、ユーザＵが音声広告の広告対象の商品またはサービスに関する検索をする、およびユーザＵが機器をオフするなどといった振る舞いである。 Further, the determination unit 67 determines that the voice advertisement has been accepted by the user U when the behavior of the user U detected by the detection unit 66 based on the operation history information is a specific behavior. The specific behavior is, for example, a behavior in which the user U searches for a product or service to be advertised in a voice advertisement, and the user U turns off the device.

判定部６７は、ユーザＵの振る舞いが否定的な振る舞いである場合に、音声広告がユーザＵに受容されていないと判定することができる。否定的な振る舞いは、例えば、ユーザＵが否定的な発話をする、およびユーザＵが否定的な身体的振る舞いをするなどといった振る舞いである。 The determination unit 67 can determine that the voice advertisement is not accepted by the user U when the behavior of the user U is a negative behavior. Negative behavior is, for example, behavior such that the user U makes a negative utterance and the user U makes a negative physical behavior.

例えば、判定部６７は、ユーザＵが「やめて」、「聞きたくない」、および「嫌い」といった否定的な発話をした場合に、ユーザＵが否定的な振る舞いをしたと判定することができる。また、例えば、判定部６７は、ユーザＵが耳を手で塞いだ場合に、ユーザＵが否定的な振る舞いをしたと判定することができる。 For example, the determination unit 67 can determine that the user U has behaved negatively when the user U makes negative utterances such as "stop", "do not want to hear", and "dislike". Further, for example, the determination unit 67 can determine that the user U has behaved negatively when the user U closes his / her ear with his / her hand.

なお、判定部６７は、ユーザＵの振る舞いが特定の振る舞いでないと判定した場合、音声広告がユーザＵに受容されていないと判定することもできる。これにより、判定部６７はユーザＵが否定的な振る舞いをしたか否かを判定しなくてもよく、処理負荷が軽減される。 If the determination unit 67 determines that the behavior of the user U is not a specific behavior, it can also determine that the voice advertisement is not accepted by the user U. As a result, the determination unit 67 does not have to determine whether or not the user U has behaved negatively, and the processing load is reduced.

また、判定部６７は、ユーザＵの振る舞いが特定の振る舞いでも否定的な振る舞いでもないと判定した場合、ユーザＵによる音声広告の非受容度が低いと判定し、ユーザＵの振る舞いが否定的な振る舞いである場合、ユーザＵによる音声広告の非受容度が高いと判定することもできる。 Further, when the determination unit 67 determines that the behavior of the user U is neither a specific behavior nor a negative behavior, the determination unit 67 determines that the non-acceptance of the voice advertisement by the user U is low, and the behavior of the user U is negative. In the case of behavior, it can be determined that the non-acceptance of the voice advertisement by the user U is high.

また、判定部６７は、ユーザＵの振る舞いが特定の振る舞いである場合に、音声広告がユーザＵに受容されたと判定する処理を行わないこともできる。例えば、判定部６７は、ユーザＵの振る舞いが否定的な振る舞いである場合に、音声広告がユーザＵに受容されていないと判定し、ユーザＵの振る舞いが否定的な振る舞いではない場合に、音声広告がユーザＵに受容されていると判定することができる。 Further, the determination unit 67 may not perform the process of determining that the voice advertisement has been accepted by the user U when the behavior of the user U is a specific behavior. For example, the determination unit 67 determines that the voice advertisement is not accepted by the user U when the behavior of the user U is a negative behavior, and when the behavior of the user U is not a negative behavior, the voice It can be determined that the advertisement is accepted by the user U.

上述した特定の振る舞いおよび否定的な振る舞いは、ユーザＵの属性に応じて設定される。ユーザＵの属性は、例えば、性別、年齢、住所、および職業の少なくとも一つを含む。例えば、判定部６７は、ユーザＵが子供である場合、ユーザＵが飛び跳ねる、およびユーザＵが踊り出すといった振る舞いをした場合、音声広告がユーザＵに受容されたと判定する。 The specific behaviors and negative behaviors described above are set according to the attributes of the user U. User U's attributes include, for example, at least one of gender, age, address, and occupation. For example, the determination unit 67 determines that the voice advertisement has been accepted by the user U when the user U is a child, the user U jumps, and the user U begins to dance.

なお、判定部６７は、ユーザＵに受容されたと判定する基準とする特定の振る舞いを音声広告の時間的長さや種類に応じて変更することもできる。また、判定部６７は、一つの音声広告を出力している期間（例えば、３０秒）における所定期間（例えば、５秒）毎に、音声広告がユーザＵに受容されているか否かを判定することもできる。 The determination unit 67 can also change a specific behavior as a criterion for determining acceptance by the user U according to the time length and type of the voice advertisement. Further, the determination unit 67 determines whether or not the voice advertisement is accepted by the user U every predetermined period (for example, 5 seconds) in the period (for example, 30 seconds) during which one voice advertisement is output. You can also do it.

広告効果更新部６８は、音声広告がユーザＵに受容されたと判定すると、音声広告テーブル７３において音声広告の受容回数を更新する。これにより、テキスト広告またはバナー広告のクリックに相当する広告効果を音声広告に対して得ることができる。そして、広告効果更新部６８は、音声広告の出力回数に対する音声広告の受容回数の割合である受容率を演算し、演算した受容率を音声広告効果として音声広告テーブル７３に設定することができる。これにより、音声広告においてＣＴＲに相当する広告効果指標を得ることができる。 When the advertisement effect updating unit 68 determines that the voice advertisement has been accepted by the user U, the advertisement effect updating unit 68 updates the number of times the voice advertisement is received in the voice advertisement table 73. As a result, it is possible to obtain an advertising effect equivalent to a click of a text advertisement or a banner advertisement for a voice advertisement. Then, the advertisement effect updating unit 68 can calculate the acceptance rate, which is the ratio of the number of times the voice advertisement is received to the number of times the voice advertisement is output, and set the calculated acceptance rate in the voice advertisement table 73 as the voice advertisement effect. As a result, it is possible to obtain an advertising effectiveness index corresponding to CTR in voice advertising.

また、広告効果更新部６８は、音声広告のコンバージョン回数を外部装置から取得することができる。音声広告のコンバージョン回数とは、商品やサービスの購入、サンプルの申し込み、およびパンフレットの申し込みといった音声広告の目的を達成した回数である。広告効果更新部６８は、音声広告の受容回数に対する音声広告のコンバージョン回数の割合であるコンバージョン率を演算し、演算したコンバージョン率を音声広告効果とすることができる。これにより、音声広告においてＣＶＲ（Conversion Rate）に相当する広告効果指標を得ることができる。 In addition, the advertisement effect updating unit 68 can acquire the number of conversions of the voice advertisement from the external device. The number of conversions of an audio advertisement is the number of times that the purpose of an audio advertisement such as purchase of goods or services, application for samples, and application for pamphlets is achieved. The advertisement effect updating unit 68 can calculate the conversion rate, which is the ratio of the number of conversions of the voice advertisement to the number of times the voice advertisement is received, and can use the calculated conversion rate as the voice advertisement effect. As a result, it is possible to obtain an advertising effectiveness index corresponding to CVR (Conversion Rate) in voice advertising.

また、判定部６７によって非受容度が判定された場合、広告効果更新部６８は、音声広告の出力回数に対する音声広告の非受容度毎の非受容回数の割合を演算することができる。この場合、広告効果更新部６８は、例えば、非受容度が２の非受容回数を音声広告の受容回数から減算し、減算結果を音声広告の受容回数とすることもできる。 When the non-acceptance degree is determined by the determination unit 67, the advertisement effect updating unit 68 can calculate the ratio of the non-acceptance number for each non-acceptance degree of the voice advertisement to the output number of voice advertisements. In this case, the advertisement effect updating unit 68 may, for example, subtract the number of non-acceptances with a non-acceptance degree of 2 from the number of acceptances of the voice advertisement, and use the subtraction result as the number of acceptances of the voice advertisement.

〔４．情報処理システム１００の処理フロー〕
次に、実施形態に係る情報処理システム１００による発話制御処理の手順について説明する。図１０および図１１は、実施形態に係る情報処理システム１００による発話制御処理の一例を示すフローチャートである。 [4. Processing flow of information processing system 100]
Next, the procedure of the utterance control process by the information processing system 100 according to the embodiment will be described. 10 and 11 are flowcharts showing an example of the utterance control process by the information processing system 100 according to the embodiment.

まず、スマートスピーカ１の発話制御処理について説明する。図１０に示すように、スマートスピーカ１の制御部１５は、コンテンツ出力処理中か否かを判定する（ステップＳ１０）。制御部１５は、ステップＳ１０の処理において、例えば、制御部１５がユーザＵからコンテンツの出力要求を受け付けてから出力要求に対応するコンテンツの出力が完了するまでの間をコンテンツ出力処理中として扱う。 First, the utterance control process of the smart speaker 1 will be described. As shown in FIG. 10, the control unit 15 of the smart speaker 1 determines whether or not the content output processing is in progress (step S10). In the process of step S10, the control unit 15 treats, for example, the period from when the control unit 15 receives the content output request from the user U until the output of the content corresponding to the output request is completed as the content output process.

制御部１５は、コンテンツ出力中ではないと判定した場合（ステップＳ１０：Ｎｏ）、発話処理中か否かを判定する（ステップＳ１１）。制御部１５は、ステップＳ１１の処理において、例えば、音声入力器１２からの発話の出力を開始してから発話に対するユーザＵの要求を受け可能な期間が終了するまでの期間を発話処理中として扱う。 When the control unit 15 determines that the content is not being output (step S10: No), the control unit 15 determines whether or not the utterance process is in progress (step S11). In the process of step S11, the control unit 15 treats, for example, the period from the start of the output of the utterance from the voice input device 12 to the end of the period during which the user U's request for the utterance can be received as the utterance process. ..

制御部１５は、発話処理中ではないと判定した場合（ステップＳ１１：Ｎｏ）、ユーザＵに関するコンテキスト情報を取得し（ステップＳ１２）、取得したコンテキスト情報を発話テーブル２１に含まれる各タイミング判定モデルに入力する（ステップＳ１３）。 When the control unit 15 determines that the utterance process is not in progress (step S11: No), the control unit 15 acquires the context information regarding the user U (step S12), and applies the acquired context information to each timing determination model included in the utterance table 21. Input (step S13).

つづいて、制御部１５は、予め設定された閾値以上のスコアを出力するタイミング判定モデルがあるか否かを判定する（ステップＳ１４）。制御部１５は、閾値以上のスコアを出力するタイミング判定モデルがあると判定した場合（ステップＳ１４：Ｙｅｓ）、閾値以上のスコアを出力するタイミング判定モデルが複数であるか否かを判定する（ステップＳ１５）。 Subsequently, the control unit 15 determines whether or not there is a timing determination model that outputs a score equal to or greater than a preset threshold value (step S14). When the control unit 15 determines that there is a timing determination model that outputs a score equal to or higher than the threshold value (step S14: Yes), the control unit 15 determines whether or not there are a plurality of timing determination models that output a score equal to or higher than the threshold value (step). S15).

制御部１５は、タイミング判定モデルが複数であると判定した場合（ステップＳ１５：Ｙｅｓ）、最もスコアが高いタイミング判定モデルを選択する（ステップＳ１６）。また、制御部１５は、タイミング判定モデルが複数ではないと判定した場合（ステップＳ１５：Ｎｏ）、閾値以上のスコアを出力するタイミング判定モデルを選択する（ステップＳ１７）。制御部１５は、選択したタイミング判定モデルに関連付けられた発話内容を発話テーブル２１から取得し、取得した発話内容を音声入力器１２から出力する（ステップＳ１８）。 When the control unit 15 determines that there are a plurality of timing determination models (step S15: Yes), the control unit 15 selects the timing determination model having the highest score (step S16). Further, when the control unit 15 determines that the number of timing determination models is not plural (step S15: No), the control unit 15 selects a timing determination model that outputs a score equal to or greater than the threshold value (step S17). The control unit 15 acquires the utterance content associated with the selected timing determination model from the utterance table 21, and outputs the acquired utterance content from the voice input device 12 (step S18).

制御部１５は、ステップＳ１８の処理が終了した場合、コンテンツ出力中であると判定した場合（ステップＳ１０：Ｙｅｓ）、発話処理中であると判定した場合（ステップＳ１１：Ｙｅｓ）、または閾値上のスコアを出力するタイミング判定モデルがないと判定した場合（ステップＳ１４：Ｎｏ）、図１０に示す処理を終了する。 When the process of step S18 is completed, the control unit 15 determines that the content is being output (step S10: Yes), the control unit 15 determines that the utterance process is in progress (step S11: Yes), or is on the threshold value. When it is determined that there is no timing determination model for outputting the score (step S14: No), the process shown in FIG. 10 is terminated.

次に、情報提供装置２の処理を説明する。図１１に示すように、情報提供装置２の制御部４３は、スマートスピーカ１が自発的に発話をする時のユーザＵの反応を示す反応情報を取得する（ステップＳ２０）。また、制御部４３は、スマートスピーカ１が自発的に発話をする時のユーザＵに関するコンテキスト情報を取得する（ステップＳ２１）。制御部４３は、ステップＳ２０で取得した反応情報とステップＳ２１で取得したコンテキスト情報を記憶部４２に記憶する（ステップＳ２２）。 Next, the processing of the information providing device 2 will be described. As shown in FIG. 11, the control unit 43 of the information providing device 2 acquires reaction information indicating the reaction of the user U when the smart speaker 1 spontaneously speaks (step S20). Further, the control unit 43 acquires context information regarding the user U when the smart speaker 1 spontaneously speaks (step S21). The control unit 43 stores the reaction information acquired in step S20 and the context information acquired in step S21 in the storage unit 42 (step S22).

つづいて、制御部４３は、記憶部４２に記憶した反応情報およびコンテキスト情報に基づいて、コンテンツ毎のタイミング判定モデルを生成または更新し（ステップＳ２３）、図１１に示す処理を終了する。例えば、制御部４３は、発話テーブル７１にタイミング判定モデルが生成されていないコンテンツのタイミング判定モデルを生成した場合、発話テーブル７１にタイミング判定モデルを追加する。また、制御部４３は、発話テーブル７１のタイミング判定モデルを更新した場合、更新したタイミング判定モデルを発話テーブル７１に上書きする。 Subsequently, the control unit 43 generates or updates the timing determination model for each content based on the reaction information and the context information stored in the storage unit 42 (step S23), and ends the process shown in FIG. For example, when the control unit 43 generates the timing determination model of the content for which the timing determination model is not generated in the utterance table 71, the control unit 43 adds the timing determination model to the utterance table 71. Further, when the timing determination model of the utterance table 71 is updated, the control unit 43 overwrites the updated timing determination model on the utterance table 71.

次に、情報処理システム１００による出力制御処理について説明する。図１２は、実施形態に係る情報処理システム１００による出力制御処理の一例を示すフローチャートである。 Next, the output control process by the information processing system 100 will be described. FIG. 12 is a flowchart showing an example of output control processing by the information processing system 100 according to the embodiment.

図１２に示すように、制御部４３は、スマートスピーカ１からコンテンツの出力要求があるか否かを判定する（ステップＳ３０）。制御部４３は、コンテンツの出力要求があると判定した場合（ステップＳ３０：Ｙｅｓ）、出力要求を行ったスマートスピーカ１のユーザＵに関するコンテキスト情報を取得する（ステップＳ３１）。 As shown in FIG. 12, the control unit 43 determines whether or not there is a content output request from the smart speaker 1 (step S30). When the control unit 43 determines that there is a content output request (step S30: Yes), the control unit 43 acquires context information regarding the user U of the smart speaker 1 that has made the output request (step S31).

そして、制御部４３は、取得したコンテキスト情報に基づいて、出力要求されたコンテンツの出力態様を決定する（ステップＳ３２）。制御部４３は、ステップＳ３２で決定した出力態様でコンテンツを出力する（ステップＳ３３）。制御部４３は、ステップＳ３３の処理が終了した場合、または出力要求がないと判定した場合（ステップＳ３０：Ｎｏ）、図１２に示す処理を終了する。 Then, the control unit 43 determines the output mode of the content requested to be output based on the acquired context information (step S32). The control unit 43 outputs the content in the output mode determined in step S32 (step S33). When the process of step S33 is completed or when it is determined that there is no output request (step S30: No), the control unit 43 ends the process shown in FIG.

次に、情報処理システム１００による音声情報効果判定処理について説明する。図１３は、実施形態に係る情報処理システム１００による音声情報効果判定処理の一例を示すフローチャートである。 Next, the voice information effect determination process by the information processing system 100 will be described. FIG. 13 is a flowchart showing an example of the voice information effect determination process by the information processing system 100 according to the embodiment.

図１３に示すように、情報提供装置２の制御部４３は、音声広告の出力タイミングであるか否かを判定する（ステップＳ４０）。制御部４３は、音声広告の出力タイミングであると判定した場合（ステップＳ４０：Ｙｅｓ）、音声広告をスマートスピーカ１へ出力する（ステップＳ４１）。 As shown in FIG. 13, the control unit 43 of the information providing device 2 determines whether or not it is the output timing of the voice advertisement (step S40). When the control unit 43 determines that it is the output timing of the voice advertisement (step S40: Yes), the control unit 43 outputs the voice advertisement to the smart speaker 1 (step S41).

つづいて、制御部４３は、音声広告テーブル７３において、ステップＳ４１で出力した音声広告の出力回数をインクリメントする（ステップＳ４２）。また、制御部４３は、音声出力期間における撮像情報、音情報および機器操作情報をスマートスピーカ１、端末装置３、機器４、およびセンサ装置５の少なくとも一つから取得する（ステップＳ４３）。 Subsequently, the control unit 43 increments the number of times the voice advertisement output in step S41 is output in the voice advertisement table 73 (step S42). Further, the control unit 43 acquires the imaging information, the sound information, and the device operation information during the audio output period from at least one of the smart speaker 1, the terminal device 3, the device 4, and the sensor device 5 (step S43).

制御部４３は、ステップＳ４３で取得した情報に基づいて、ユーザＵの振る舞いを検出し（ステップＳ４４）、検出した振る舞いが特定の振る舞いであるか否かを判定する（ステップＳ４５）。制御部４３は、ステップＳ４４で検出した振る舞いが特定の振る舞いであると判定した場合（ステップＳ４５：Ｙｅｓ）、音声広告がユーザＵに受容されたと判定し（ステップＳ４６）、音声広告テーブル７３において、ステップＳ４１で出力した音声広告の受容回数をインクリメントする（ステップＳ４７）。 The control unit 43 detects the behavior of the user U based on the information acquired in step S43 (step S44), and determines whether or not the detected behavior is a specific behavior (step S45). When the control unit 43 determines that the behavior detected in step S44 is a specific behavior (step S45: Yes), the control unit 43 determines that the voice advertisement has been accepted by the user U (step S46), and in the voice advertisement table 73, The number of times the voice advertisement received in step S41 is received is incremented (step S47).

一方、制御部４３は、ステップＳ４４で検出した振る舞いが特定の振る舞いではないと判定した場合（ステップＳ４５：Ｎｏ）、音声広告がユーザＵに受容されていないと判定する（ステップＳ４８）。制御部４３は、ステップＳ４７またはステップＳ４８の処理が終了した場合、または、音声広告の出力タイミングではないと判定した場合（ステップＳ４０：Ｎｏ）、図１３に示す処理を終了する。 On the other hand, when the control unit 43 determines that the behavior detected in step S44 is not a specific behavior (step S45: No), the control unit 43 determines that the voice advertisement is not accepted by the user U (step S48). When the process of step S47 or step S48 is completed, or when it is determined that it is not the output timing of the voice advertisement (step S40: No), the control unit 43 ends the process shown in FIG.

〔５．変形例〕
上述した例では、情報提供装置２において、タイミング判定モデルを生成する学習部６４が設けられるが、スマートスピーカ１に学習部６４が設けられてもよい。また、上述した例では、スマートスピーカ１において、コンテキスト取得部３５およびタイミング決定部３６が設けられるが、情報提供装置２において、コンテキスト取得部３５およびタイミング決定部３６が設けられてもよい。 [5. Modification example]
In the above example, the information providing device 2 is provided with the learning unit 64 for generating the timing determination model, but the smart speaker 1 may be provided with the learning unit 64. Further, in the above-described example, the smart speaker 1 is provided with the context acquisition unit 35 and the timing determination unit 36, but the information providing device 2 may be provided with the context acquisition unit 35 and the timing determination unit 36.

また、上述した例では、タイミング判定モデルを用いて発話タイミングを決定したが、過去の発話に対するユーザＵの反応を考慮して発話タイミングを決定することができればよく、上述した例に限定されない。例えば、スマートスピーカ１は、タイミング判定モデルに代えて、過去の発話に対する反応情報とコンテキスト情報とに基づいて生成される判定条件情報によって、現在のコンテキストが発話に適しているコンテキストであるか否かを判定することもできる。なお、上記判定条件情報には、発話タイミングであると判定するコンテキストの条件が含まれる。 Further, in the above-mentioned example, the utterance timing is determined by using the timing determination model, but the utterance timing is not limited to the above-mentioned example as long as the utterance timing can be determined in consideration of the reaction of the user U to the past utterance. For example, in the smart speaker 1, instead of the timing determination model, whether or not the current context is a context suitable for the utterance is based on the determination condition information generated based on the reaction information and the context information for the past utterance. Can also be determined. The determination condition information includes the condition of the context for determining the utterance timing.

また、上述した例では、ユーザＵのジェスチャーや口の動きをスマートスピーカ１で検出したが、ユーザＵのジェスチャーや口の動きを情報提供装置２で検出する構成であってもよい。 Further, in the above-described example, the gesture of the user U and the movement of the mouth are detected by the smart speaker 1, but the gesture of the user U and the movement of the mouth may be detected by the information providing device 2.

また、上述した例では、スマートスピーカ１から出力される音声情報の一例として音声広告を挙げて説明したが、スマートスピーカ１から出力される音声情報は、音声広告に限定されない。例えば、情報提供装置２の制御部４３は、「今日は晴れですね」、「今日は仕事お休みですね」といったプッシュ発話型の音声情報などをスマートスピーカ１から出力した場合のユーザＵの振る舞いに基づいて、音声情報がユーザＵに受容されたか否かを判定することができる。そして、情報提供装置の制御部４３は、音声情報の出力回数に対する音声情報の受容回数の割合を演算し、演算した割合を音声情報効果とすることができる。これにより、音声情報においてＣＴＲに相当する効果指標を得ることができる。 Further, in the above-described example, the voice advertisement has been described as an example of the voice information output from the smart speaker 1, but the voice information output from the smart speaker 1 is not limited to the voice advertisement. For example, the control unit 43 of the information providing device 2 outputs the push-speech type voice information such as "Today is sunny" and "Today is a holiday" from the smart speaker 1, and the behavior of the user U. Based on the above, it can be determined whether or not the voice information has been accepted by the user U. Then, the control unit 43 of the information providing device can calculate the ratio of the number of times the voice information is received to the number of times the voice information is output, and use the calculated ratio as the voice information effect. As a result, an effect index corresponding to CTR can be obtained in voice information.

なお、効果の判定対象となる音声情報は、情報提供装置２からスマートスピーカ１へ出力する音声情報に限定されず、スマートスピーカ１に記憶された音声情報であってもよい。 The voice information to be determined for the effect is not limited to the voice information output from the information providing device 2 to the smart speaker 1, and may be the voice information stored in the smart speaker 1.

また、上述した例では、情報処理システム１００は、ユーザＵの肯定的な反応でない場合であっても、ユーザＵが受容したと判定する場合があるが、ユーザＵが肯定的な反応である場合のみ、ユーザＵが受容したと判定することもできる。すなわち、情報処理システム１００は、音声情報効果判定処理において、発話制御処理および出力制御処理におけるユーザＵの肯定的な反応があったと判定されるユーザＵの状態を、ユーザＵの受容し、それ以外をユーザの非受容とすることができる。また、情報処理システム１００は、発話制御処理および出力制御処理において、音声情報効果判定処理におけるユーザＵが受容したと判定されるユーザＵの状態を、ユーザＵの肯定的な反応とし、それ以外をユーザの肯定的でない反応とすることができる。 Further, in the above-described example, the information processing system 100 may determine that the user U has accepted even if the reaction is not positive for the user U, but the user U has a positive reaction. Only, it can be determined that the user U has accepted. That is, the information processing system 100 accepts the state of the user U, which is determined to have had a positive reaction of the user U in the utterance control process and the output control process in the voice information effect determination process, and other than that. Can be non-acceptable to the user. Further, in the information processing system 100, in the utterance control process and the output control process, the state of the user U determined to be accepted by the user U in the voice information effect determination process is regarded as a positive reaction of the user U, and the other states are set as a positive reaction. It can be a non-positive reaction of the user.

〔６．プログラム〕
上述した実施形態におけるスマートスピーカ１および情報提供装置２の各々は、例えば図１４に示すような構成のコンピュータ２００がプログラムを実行することによって実現される。図１４は、プログラムを実行するコンピュータのハードウェア構成の一例を示す図である。コンピュータ２００は、ＣＰＵ（Central Processing Unit）２０１、ＲＡＭ（Random Access Memory）２０２、ＲＯＭ（Read Only Memory）２０３、ＨＤＤ（Hard Disk Drive）２０４、通信インターフェイス（Ｉ／Ｆ）２０５、入出力インターフェイス（Ｉ／Ｆ）２０６、およびメディアインターフェイス（Ｉ／Ｆ）２０７を備える。 [6. program〕
Each of the smart speaker 1 and the information providing device 2 in the above-described embodiment is realized by executing a program by a computer 200 having a configuration as shown in FIG. 14, for example. FIG. 14 is a diagram showing an example of the hardware configuration of the computer that executes the program. The computer 200 includes a CPU (Central Processing Unit) 201, a RAM (Random Access Memory) 202, a ROM (Read Only Memory) 203, an HDD (Hard Disk Drive) 204, a communication interface (I / F) 205, and an input / output interface (I). It includes a / F) 206 and a media interface (I / F) 207.

ＣＰＵ２０１は、ＲＯＭ２０３またはＨＤＤ２０４に格納されたプログラムに基づいて動作し、各部の制御を行う。ＲＯＭ２０３は、コンピュータ２００の起動時にＣＰＵ２０１によって実行されるブートプログラムや、コンピュータ２００のハードウェアに依存するプログラム等を格納する。ＨＤＤ２０４は、ＣＰＵ２０１によって実行されるプログラムによって使用されるデータ等を格納する。通信インターフェイス２０５は、ネットワーク６を介して他の機器からデータを受信してＣＰＵ２０１へ送り、ＣＰＵ２０１が生成したデータを、ネットワーク６を介して他の機器へ送信する。 The CPU 201 operates based on the program stored in the ROM 203 or the HDD 204, and controls each part. The ROM 203 stores a boot program executed by the CPU 201 when the computer 200 is started, a program that depends on the hardware of the computer 200, and the like. HDD 204 stores data and the like used by a program executed by CPU 201. The communication interface 205 receives data from another device via the network 6 and sends it to the CPU 201, and transmits the data generated by the CPU 201 to the other device via the network 6.

ＣＰＵ２０１は、入出力インターフェイス２０６を介して、ディスプレイやプリンタ等の出力装置、および、キーボードやマウス等の入力装置を制御する。ＣＰＵ２０１は、入出力インターフェイス２０６を介して、入力装置からデータを取得する。また、ＣＰＵ２０１は、生成したデータを、入出力インターフェイス２０６を介して出力装置へ出力する。 The CPU 201 controls an output device such as a display or a printer, and an input device such as a keyboard or a mouse via the input / output interface 206. The CPU 201 acquires data from the input device via the input / output interface 206. Further, the CPU 201 outputs the generated data to the output device via the input / output interface 206.

メディアインターフェイス２０７は、記録媒体２０８に格納されたプログラムまたはデータを読み取り、ＲＡＭ２０２を介してＣＰＵ２０１に提供する。ＣＰＵ２０１は、当該プログラムを、メディアインターフェイス２０７を介して記録媒体２０８からＲＡＭ２０２上にロードし、ロードしたプログラムを実行する。記録媒体２０８は、例えばＤＶＤ（Digital Versatile Disc）、ＰＤ（Phase change rewritable Disk）等の光学記録媒体、ＭＯ（Magneto-Optical disk）等の光磁気記録媒体、テープ媒体、磁気記録媒体、または半導体メモリ等である。 The media interface 207 reads the program or data stored in the recording medium 208 and provides it to the CPU 201 via the RAM 202. The CPU 201 loads the program from the recording medium 208 onto the RAM 202 via the media interface 207, and executes the loaded program. The recording medium 208 is, for example, an optical recording medium such as a DVD (Digital Versatile Disc) or PD (Phase change rewritable Disk), a magneto-optical recording medium such as MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory. And so on.

コンピュータ２００が上述した実施形態に係るスマートスピーカ１として機能する場合、コンピュータ２００のＣＰＵ２０１は、ＲＡＭ２０２上にロードされたプログラムを実行することにより、図３に示す入力処理部３１、情報出力部３２、情報取得部３３、出力処理部３４、コンテキスト取得部３５、およびタイミング決定部３６の各機能を実現する。また、例えば、ＨＤＤ２０４は、図３に示す記憶部１４と同様の情報を記憶する。 When the computer 200 functions as the smart speaker 1 according to the above-described embodiment, the CPU 201 of the computer 200 executes the program loaded on the RAM 202 to execute the input processing unit 31 and the information output unit 32 shown in FIG. Each function of the information acquisition unit 33, the output processing unit 34, the context acquisition unit 35, and the timing determination unit 36 is realized. Further, for example, the HDD 204 stores the same information as the storage unit 14 shown in FIG.

また、コンピュータ２００が上述した実施形態に係る情報提供装置２として機能する場合、コンピュータ２００のＣＰＵ２０１は、ＲＡＭ２０２上にロードされたプログラムを実行することにより、図５に示す情報取得部６１、情報出力部６２、コンテキスト取得部６３、学習部６４、出力態様決定部６５、検出部６６、判定部６７、および広告効果更新部６８の各機能を実現する。また、例えば、ＨＤＤ２０４は、図５に示す記憶部４２と同様の情報を記憶する。 Further, when the computer 200 functions as the information providing device 2 according to the above-described embodiment, the CPU 201 of the computer 200 executes the program loaded on the RAM 202 to output the information acquisition unit 61 shown in FIG. Each function of the unit 62, the context acquisition unit 63, the learning unit 64, the output mode determination unit 65, the detection unit 66, the determination unit 67, and the advertisement effect updating unit 68 is realized. Further, for example, the HDD 204 stores the same information as the storage unit 42 shown in FIG.

コンピュータ２００のＣＰＵ２０１は、プログラムを、記録媒体２０８から読み取って実行するが、他の例として、他の装置から、ネットワーク６を介してこれらのプログラムを取得してもよい。 The CPU 201 of the computer 200 reads and executes the programs from the recording medium 208, but as another example, these programs may be acquired from another device via the network 6.

〔７．効果〕
上述したように、実施形態に係る情報処理システム１００（発話制御装置の一例）は、ユーザＵに関するコンテキスト情報を取得するコンテキスト取得部３５と、コンテキスト取得部３５によって取得されたコンテキスト情報に基づいて、音声出力器１１からの発話に対する過去のユーザＵの反応を考慮した発話のタイミングを決定するタイミング決定部３６とを備える。これにより、予め設定された条件で発話タイミングを制御する場合に比べ、発話タイミングをより適切に決定することができる。 [7. effect〕
As described above, the information processing system 100 (an example of the utterance control device) according to the embodiment is based on the context acquisition unit 35 that acquires the context information regarding the user U and the context information acquired by the context acquisition unit 35. The timing determination unit 36 for determining the timing of the utterance in consideration of the past reaction of the user U to the utterance from the voice output device 11 is provided. As a result, the utterance timing can be determined more appropriately than in the case where the utterance timing is controlled under preset conditions.

また、タイミング決定部３６は、コンテキスト取得部３５によって取得されたコンテキスト情報に基づいて、音声出力器１１からのコンテンツ毎の発話に対する過去のユーザＵの反応を考慮したコンテンツ毎の発話のタイミングを決定する。これにより、コンテンツに応じた適切な発話タイミングを決定することができる。 Further, the timing determination unit 36 determines the timing of the utterance for each content in consideration of the past reaction of the user U to the utterance for each content from the voice output device 11 based on the context information acquired by the context acquisition unit 35. To do. As a result, it is possible to determine an appropriate utterance timing according to the content.

また、タイミング決定部３６は、発話に対する過去のユーザＵの反応と発話時の過去のユーザＵに関するコンテキスト情報とに基づいて生成されるタイミング判定モデルにコンテキスト取得部３５で取得されたコンテキスト情報を入力して、発話のタイミングを決定する。これにより、ユーザＵの過去の反応に基づいて発話タイミングを容易に決定することができる。 Further, the timing determination unit 36 inputs the context information acquired by the context acquisition unit 35 into the timing determination model generated based on the reaction of the past user U to the utterance and the context information regarding the past user U at the time of the utterance. Then, the timing of the utterance is decided. As a result, the utterance timing can be easily determined based on the past reaction of the user U.

また、コンテキスト取得部３５は、ユーザＵの周囲の状況を示す周囲情報をコンテキスト情報の少なくとも一部として取得する。これにより、例えば、ユーザＵの反応に影響を与える周囲の状況から、コンテンツに応じたより適切な発話タイミングを決定することができる。 Further, the context acquisition unit 35 acquires surrounding information indicating the surrounding situation of the user U as at least a part of the context information. Thereby, for example, it is possible to determine a more appropriate utterance timing according to the content from the surrounding situation that affects the reaction of the user U.

また、コンテキスト取得部３５は、ユーザＵの周囲に存在する１以上のセンサ装置５から出力されるセンサ情報から周囲情報を取得する。これにより、例えば、ユーザＵの周囲の明るさ、温度、湿度などのコンテキストを用いることができ、コンテンツに応じたより適切な発話タイミングを決定することができる。 Further, the context acquisition unit 35 acquires the surrounding information from the sensor information output from one or more sensor devices 5 existing around the user U. Thereby, for example, contexts such as brightness, temperature, and humidity around the user U can be used, and a more appropriate utterance timing can be determined according to the content.

また、コンテキスト取得部３５は、ユーザＵの周囲に存在する１以上の機器（例えば、スマートスピーカ１、端末装置３、または機器４）の状態を示す機器情報から周囲情報を取得する。これにより、ユーザＵによる機器の操作状態などのコンテキストを用いることができ、コンテンツに応じたより適切な発話タイミングを決定することができる。 Further, the context acquisition unit 35 acquires surrounding information from device information indicating the state of one or more devices (for example, smart speaker 1, terminal device 3, or device 4) existing around the user U. As a result, it is possible to use a context such as the operation state of the device by the user U, and it is possible to determine a more appropriate utterance timing according to the content.

また、コンテキスト取得部３５は、ユーザＵの周囲に存在する１以上の機器（例えば、スマートスピーカ１、端末装置３、または機器４）への操作履歴を示す操作履歴情報から周囲情報を取得する。これにより、ユーザＵによる機器の操作ログなどのコンテキストを用いることができ、コンテンツに応じたより適切な発話タイミングを決定することができる。 Further, the context acquisition unit 35 acquires the surrounding information from the operation history information indicating the operation history of one or more devices (for example, the smart speaker 1, the terminal device 3, or the device 4) existing around the user U. As a result, it is possible to use a context such as a device operation log by the user U, and it is possible to determine a more appropriate utterance timing according to the content.

また、タイミング判定モデルは、複数のユーザＵに共通に生成されるモデルである。これにより、ユーザＵ毎にタイミング判定モデルを生成する場合に比べ、タイミング判定モデルを生成するための処理負荷を軽減することができる。 Further, the timing determination model is a model commonly generated by a plurality of users U. As a result, the processing load for generating the timing determination model can be reduced as compared with the case where the timing determination model is generated for each user U.

また、タイミング判定モデルは、少なくとも一部がユーザＵに固有のモデルである。これにより、ユーザＵの特性に沿ったより適切な発話タイミングを決定することができる。 Further, the timing determination model is a model unique to the user U at least in part. As a result, it is possible to determine a more appropriate utterance timing according to the characteristics of the user U.

また、情報処理システム１００は、音声出力器１１からの発話に対する過去のユーザＵの反応とコンテキスト情報とに基づいて、タイミング判定モデルを更新する学習部６４を備える。これにより、ユーザＵやユーザＵの周囲に特性変化があった場合であっても、より適切な発話タイミングを決定することができる。 Further, the information processing system 100 includes a learning unit 64 that updates the timing determination model based on the past reaction of the user U to the utterance from the voice output device 11 and the context information. As a result, even when there is a characteristic change in the user U or around the user U, a more appropriate utterance timing can be determined.

また、タイミング判定モデルに用いられるユーザＵの反応は、音声出力器１１からの発話に対するユーザの発話が肯定的か否かを含む。これにより、より適切な発話タイミングを決定することができる。 Further, the reaction of the user U used in the timing determination model includes whether or not the user's utterance to the utterance from the voice output device 11 is positive. This makes it possible to determine a more appropriate utterance timing.

また、タイミング判定モデルに用いられるユーザＵの反応は、ユーザＵが会話中である場合、音声出力器１１から発話の出力が開始された後のユーザＵの会話の状態で判定される。例えば、ユーザＵの反応は、ユーザＵが会話を中断する、ユーザＵが会話の話題を変える、などである。 Further, the reaction of the user U used in the timing determination model is determined in the state of the conversation of the user U after the output of the utterance is started from the voice output device 11 when the user U is in a conversation. For example, the reaction of the user U is that the user U interrupts the conversation, the user U changes the topic of the conversation, and so on.

また、音声出力器１１からの発話は、コンテンツの出力の可否に関する発話である。これにより、仮に発話タイミングが適切でないタイミングになった場合であっても、コンテンツを自発的に出力する場合に比べ、ユーザＵに与える発話に対するわずらわしさを抑えることができる。 Further, the utterance from the voice output device 11 is an utterance relating to whether or not the content can be output. As a result, even if the utterance timing is not appropriate, it is possible to reduce the annoyance of the utterance given to the user U as compared with the case where the content is spontaneously output.

〔８．その他〕
また、上記実施形態及び変形例において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。 [8. Others]
Further, among the processes described in the above-described embodiments and modifications, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed. It is also possible to automatically perform all or part of the processed processing by a known method.

この他、上記文書中や図面中で示した処理手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。例えば、各図に示した各種情報は、図示した情報に限られない。 In addition, the processing procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified. For example, the various information shown in each figure is not limited to the illustrated information.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 Further, each component of each of the illustrated devices is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. It can be integrated and configured.

例えば、情報処理システム１００は、入力処理部３１、コンテキスト取得部３５およびタイミング決定部３６の少なくとも一つをスマートスピーカ１とは異なる情報提供装置２または別の装置に設ける構成であってもよい。また、スマートスピーカ１は、上述した情報提供装置２の処理の一部または全部を行うことができる構成であってもよい。なお、情報提供装置２は、複数のサーバコンピュータで実現してもよく、また、機能によっては外部のプラットフォーム等をＡＰＩ（Application Programming Interface）やネットワークコンピューティングなどで呼び出して実現するなど、構成は柔軟に変更できる。 For example, the information processing system 100 may have a configuration in which at least one of an input processing unit 31, a context acquisition unit 35, and a timing determination unit 36 is provided in an information providing device 2 different from the smart speaker 1 or another device. Further, the smart speaker 1 may have a configuration capable of performing a part or all of the processing of the information providing device 2 described above. The information providing device 2 may be realized by a plurality of server computers, and depending on the function, the information providing device 2 may be realized by calling an external platform or the like by API (Application Programming Interface) or network computing, and the configuration is flexible. Can be changed to.

また、上述してきた実施形態及び変形例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。また、上述してきた「部（section、module、unit）」は、「手段」や「回路」などに読み替えることができる。例えば、入力処理部３１は、入力処理手段や入力処理回路に読み替えることができる。 In addition, the above-described embodiments and modifications can be appropriately combined as long as the processing contents do not contradict each other. Further, the above-mentioned "section, module, unit" can be read as "means" or "circuit". For example, the input processing unit 31 can be read as an input processing means or an input processing circuit.

１スマートスピーカ
２情報提供装置
３端末装置
４，４_１〜４_ｎ機器
５，５_１〜５_ｍセンサ装置
６ネットワーク
１０，４１通信部
１１音声出力器
１２音声入力器
１３撮像部
１４，４２記憶部
１５，４３制御部
２０操作履歴
２１発話テーブル
３１入力処理部
３２，６２情報出力部
３３，６１情報取得部
３４出力処理部
３５，６３コンテキスト取得部
３６タイミング決定部
５１発話テーブル記憶部
５２コンテンツ記憶部
５３音声広告記憶部
５４ユーザ情報記憶部
５５コンテキスト記憶部
５６出力態様判定情報記憶部
６４学習部
６５出力態様決定部
６６検出部
６７判定部
６８広告効果更新部
７１発話テーブル
７２コンテンツテーブル
７３音声広告テーブル
７４ユーザ情報テーブル
１００情報処理システム 1 Smart speaker 2 Information providing device 3 Terminal device 4, 4 ₁ to 4 _n device 5, 5 _{1 to} 5 _m Sensor device 6 Network 10, 41 Communication unit 11 Audio output device 12 Audio input device 13 Imaging unit 14, 42 Storage unit 15,43 Control unit 20 Operation history 21 Speech table 31 Input processing unit 32,62 Information output unit 33,61 Information acquisition unit 34 Output processing unit 35,63 Context acquisition unit 36 Timing determination unit 51 Speech table storage unit 52 Content storage unit 53 Voice advertisement storage unit 54 User information storage unit 55 Context storage unit 56 Output mode judgment information storage unit 64 Learning unit 65 Output mode determination unit 66 Detection unit 67 Judgment unit 68 Advertising effect update unit 71 Speech table 72 Content table 73 Voice advertisement table 74 User information table 100 Information processing system

Claims

A model that uses the teacher data as to whether or not the past user's reaction to the utterance from the voice output device is positive, is generated by machine learning using the past context information about the user, and is used to determine the timing of the utterance. A memory unit that memorizes
And context acquisition unit for acquiring context information for the user,
An utterance control device including a timing determination unit that inputs the context information acquired by the context acquisition unit into the model stored in the storage unit and determines the timing of the utterance.

The storage unit
The model is memorized for each content,
The timing determination unit
The utterance control device according to claim 1, wherein the context information acquired by the context acquisition unit is input to the model for each content to determine the timing of utterance for each content.

The context acquisition unit
The utterance control device according to claim 1 or 2 , wherein the surrounding information indicating the surrounding situation of the user is acquired as at least a part of the context information.

The context acquisition unit
The utterance control device according to claim 3 , wherein the surrounding information is acquired from sensor information output from one or more sensor devices existing around the user.

The context acquisition unit
The utterance control device according to claim 3 or 4 , wherein the surrounding information is acquired from device information indicating the state of one or more devices existing around the user.

The context acquisition unit
The utterance control device according to any one of claims 3 to 5 , wherein the surrounding information is acquired from the operation history information indicating the operation history of one or more devices existing around the user.

The model is
The utterance control device according to any one of claims 1 to 6 , wherein the model is generated in common by a plurality of users.

The model is
The utterance control device according to any one of claims 1 to 6 , wherein at least a part of the model is unique to the user.

The invention according to any one of claims 1 to 8 , further comprising a learning unit that updates the model based on the past user's reaction to the utterance from the voice output device and the context information. Speech control device.

The utterance control device according to any one of claims 1 to 8 , wherein the reaction of the user includes whether or not the utterance of the user is positive with respect to the utterance from the voice output device.

Any of claims 1 to 8 , wherein the reaction of the user is determined in the state of the conversation after the output of the utterance is started from the voice output device when the user is in a conversation. The speech control device described in one.

The utterance control device according to any one of claims 1 to 11 , wherein the utterance from the voice output device is an utterance relating to whether or not the content can be output.

The utterance control device according to any one of claims 1 to 12 , wherein the voice output device is a voice output device included in a smart speaker.

It is a speech control method executed by a computer.
The context acquisition process to acquire context information about the user,
Whether or not the user's past reaction to the utterance from the voice output device is positive is used as the teacher data, and is generated by machine learning using the past context information about the user and used to determine the timing of the utterance. An utterance control method comprising: a timing determination step of inputting the context information acquired by the context acquisition step into the model of a storage unit for storing the model to determine the timing of the utterance.

The context acquisition procedure to acquire context information about the user, and
Whether or not the user's past reaction to the utterance from the voice output device is positive is used as the teacher data, and is generated by machine learning using the past context information about the user and used to determine the timing of the utterance. A timing determination procedure for inputting the context information acquired by the context acquisition procedure into the model of the storage unit for storing the model to determine the timing of the utterance, and
An utterance control program characterized by having a computer execute.