JP2018120203A

JP2018120203A - Information processing method and program

Info

Publication number: JP2018120203A
Application number: JP2017145707A
Authority: JP
Inventors: 由理西川; Yuri Nishikawa; 山上　勝義; Katsuyoshi Yamagami; 勝義山上
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2016-11-02
Filing date: 2017-07-27
Publication date: 2018-08-02

Abstract

PROBLEM TO BE SOLVED: To provide an information processing method and a program to reduce the response time of a plurality of dialogs.SOLUTION: The information processing method, outputs a first phrase information generated from a first voice of a user to a server via a network, acquires a first response information indicating a first response message for the first voice from the server, makes a speaker output the first response message, acquires a second response information from the server, the second response information contains a conditional branch instruction for branching to a plurality of instructions according to the phrase information concerning a selection candidate for a response to the first response message, acquires a second voice containing a response of the user to the first response message after the output of the first response message, and make at least one of speakers and apparatus execute predetermined operation according to the instruction of a conditional branch instruction determined by referencing the phrase information to a second phrase information generated from the second voice.SELECTED DRAWING: Figure 13A

Description

本開示は、情報処理方法及びプログラムに関する。 The present disclosure relates to an information processing method and a program.

例えば、音声認識技術の一例が特許文献１に開示されている。特許文献１の装置及び方法は、音声コマンドを用いて消費者電化製品に関連する装置を音声制御する。この装置及び方法は、サーバを利用してユーザの命令を認識する際、命令の認識から動作の実行までにかかる時間を短縮するために、ユーザがよく発話する音声認識命令とそれに対応する制御命令情報とをローカルの「保存部」に記憶させる。 For example, Patent Document 1 discloses an example of a voice recognition technique. The device and method of Patent Document 1 perform voice control of a device related to consumer electronics using a voice command. This apparatus and method uses a voice recognition command frequently spoken by a user and a control command corresponding thereto in order to reduce the time taken from the command recognition to the execution of an operation when a user command is recognized using a server. Information is stored in a local “save unit”.

特開２０１４−７１４５７号公報JP 2014-71457 A

特許文献１の装置及び方法に代表される音声対話エージェントでは、一問一答のように、一往復の対話のみが想定されている。 In the voice dialogue agent represented by the apparatus and method of Patent Document 1, only one round-trip dialogue is assumed as one question one answer.

本開示は、複数回の対話の応答時間を低減する情報処理方法及びプログラムを提供する。 The present disclosure provides an information processing method and program for reducing response time of a plurality of dialogues.

本開示の一態様に係る情報処理方法は、ユーザとの対話を通じて少なくとも１つの機器を制御するプロセッサによって実行される情報処理方法であって、マイクロホンから入力された前記ユーザの第１音声を示す第１音声情報を取得し、前記第１音声情報から生成された第１フレーズ情報を、ネットワークを介してサーバに出力し、前記第１フレーズ情報に応じた第１応答情報を、前記ネットワークを介して前記サーバから取得し、前記第１応答情報は、前記第１音声に対する第１応答メッセージを示し、前記第１応答情報に基づいて、スピーカに前記第１応答メッセージを出力させ、前記サーバ上で前記第１応答情報に関連づけられている第２応答情報を、前記ネットワークを介して前記サーバから取得し、前記第２応答情報は、１以上のフレーズ情報に応じて異なる複数の命令に分岐する条件分岐命令を含み、前記１以上のフレーズ情報のそれぞれは、前記第１応答メッセージに対する前記ユーザの返答の選択候補に関するものであり、前記第１応答メッセージが出力された後に、前記マイクロホンから入力された前記ユーザの第２音声を示す第２音声情報を取得し、前記第２音声は前記第１応答メッセージに対する前記ユーザの返答を含み、前記条件分岐命令のうち、前記１以上のフレーズ情報と前記第２音声情報から生成された第２フレーズ情報とを照合することによって決定された命令に応じて、前記スピーカ及び前記少なくとも１つの機器の少なくとも１つに所定の動作を実行させ、前記第２音声情報を取得してから前記所定の動作を実行させるまでの間、前記第２音声情報及び前記第２フレーズ情報のいずれも前記サーバに出力されない。 An information processing method according to an aspect of the present disclosure is an information processing method executed by a processor that controls at least one device through a dialog with a user, and includes a first voice indicating the first voice of the user input from a microphone. 1 voice information is acquired, the 1st phrase information generated from the 1st voice information is outputted to a server via a network, and the 1st response information according to the 1st phrase information is sent via the network Obtained from the server, the first response information indicates a first response message for the first voice, and based on the first response information, causes the speaker to output the first response message, and on the server Second response information associated with the first response information is acquired from the server via the network, and the second response information is equal to or greater than one. Including a conditional branch instruction that branches into a plurality of different instructions according to the lasing information, wherein each of the one or more pieces of phrase information relates to a selection candidate of the user's response to the first response message, and the first response After the message is output, second voice information indicating the second voice of the user input from the microphone is acquired, and the second voice includes a response of the user to the first response message, and the conditional branch Of the commands, at least one of the speaker and the at least one device according to a command determined by collating the one or more phrase information and the second phrase information generated from the second audio information. During the period from when the second sound information is acquired until the predetermined operation is performed. Both of information and the second phrase information is not output to the server.

本開示の一態様に係るプログラムは、上記の情報処理方法を前記プロセッサに実行させる。 A program according to an aspect of the present disclosure causes the processor to execute the information processing method.

本開示の一態様に係る対話処理方法は、サーバ上の第２プロセッサによって実行される情報処理方法であって、前記第２プロセッサは、ユーザとの対話を通じて少なくとも１つの機器を制御する第１プロセッサとネットワークを介して通信可能であり、マイクロホンから入力された前記ユーザの第１音声に関する第１フレーズ情報を、前記第１プロセッサから前記ネットワークを介して取得し、前記第１フレーズ情報は、前記第１音声に対応する文字列と前記文字列の意味情報との少なくとも一方を示し、前記第１フレーズ情報に応じた第１応答情報を、前記ネットワークを介して前記第１プロセッサに出力し、前記第１応答情報は、スピーカから出力される、前記第１音声に対する第１応答メッセージを示し、前記サーバ上で前記第１応答情報に関連づけられている第２応答情報を、前記ネットワークを介して前記第１プロセッサに出力し、前記第２応答情報は、１以上のフレーズ情報に応じて異なる複数の命令に分岐する条件分岐命令を含み、前記１以上のフレーズ情報のそれぞれは、前記第１応答メッセージに対する前記ユーザの返答の選択候補に関するものである。 An interaction processing method according to an aspect of the present disclosure is an information processing method executed by a second processor on a server, and the second processor controls at least one device through interaction with a user. The first phrase information about the first voice of the user input from the microphone is acquired from the first processor via the network, and the first phrase information is At least one of a character string corresponding to one voice and semantic information of the character string, and outputting first response information corresponding to the first phrase information to the first processor via the network; 1 response information shows the 1st response message with respect to the said 1st audio | voice output from a speaker, The said 1st response on the said server Conditional branch instruction that outputs second response information associated with the information to the first processor via the network, and the second response information branches to a plurality of different instructions according to one or more phrase information Each of the one or more pieces of phrase information relates to a selection candidate of the user's response to the first response message.

本開示の一態様に係るプログラムは、上記の情報処理方法を前記第２プロセッサに実行させる。 A program according to an aspect of the present disclosure causes the second processor to execute the information processing method.

本開示の情報処理方法及びプログラムによれば、複数回の対話の応答時間の低減が可能になる。 According to the information processing method and program of the present disclosure, it is possible to reduce the response time of a plurality of dialogues.

図１Ａは、実施の形態に係る対話処理装置を備える音声対話エージェントシステムが配置される環境の一例を示す図であり、音声対話エージェントシステムを備える情報管理システムが提供するサービスの全体像を示す図である。FIG. 1A is a diagram illustrating an example of an environment in which a voice dialogue agent system including a dialogue processing apparatus according to an embodiment is arranged, and a diagram illustrating an overview of services provided by an information management system including a voice dialogue agent system It is. 図１Ｂは、図１Ａのデータセンタ運営会社が、機器メーカに該当する例を示す図である。FIG. 1B is a diagram illustrating an example in which the data center operating company in FIG. 1A corresponds to a device manufacturer. 図１Ｃは、図１Ａのデータセンタ運営会社が、機器メーカ及び管理会社の両者又はいずれか一方に該当する例を示す図である。FIG. 1C is a diagram illustrating an example in which the data center operating company in FIG. 1A corresponds to either or both of an equipment manufacturer and a management company. 図２は、実施の形態に係る音声対話エージェントシステムの構成を示す概略図である。FIG. 2 is a schematic diagram showing the configuration of the voice interaction agent system according to the embodiment. 図３は、実施の形態に係る音声入出力装置のハードウェア構成の一例を示す図である。FIG. 3 is a diagram illustrating an example of a hardware configuration of the voice input / output device according to the embodiment. 図４は、実施の形態に係る機器のハードウェア構成の一例を示す図である。FIG. 4 is a diagram illustrating an example of a hardware configuration of the device according to the embodiment. 図５は、実施の形態に係るローカルサーバのハードウェア構成の一例を示す図である。FIG. 5 is a diagram illustrating an example of a hardware configuration of the local server according to the embodiment. 図６は、実施の形態に係るクラウドサーバのハードウェア構成の一例を示す図である。FIG. 6 is a diagram illustrating an example of a hardware configuration of the cloud server according to the embodiment. 図７は、実施の形態に係る音声入出力装置のシステム構成の一例を示す図である。FIG. 7 is a diagram illustrating an example of a system configuration of the voice input / output device according to the embodiment. 図８は、実施の形態に係る機器のシステム構成の一例を示す図である。FIG. 8 is a diagram illustrating an example of a system configuration of a device according to the embodiment. 図９は、実施の形態に係るローカルサーバのシステム構成の一例を示す図である。FIG. 9 is a diagram illustrating an example of a system configuration of the local server according to the embodiment. 図１０は、実施の形態に係るクラウドサーバのシステム構成の一例を示す図である。FIG. 10 is a diagram illustrating an example of a system configuration of the cloud server according to the embodiment. 図１１は、実施の形態に係るローカル辞書ＤＢの具体例である。FIG. 11 is a specific example of the local dictionary DB according to the embodiment. 図１２Ａは、実施の形態に係る対話ルールＤＢの具体例である。FIG. 12A is a specific example of the dialogue rule DB according to the embodiment. 図１２Ｂは、実施の形態に係るオフロード命令生成ＤＢの具体例である。FIG. 12B is a specific example of the offload instruction generation DB according to the embodiment. 図１３Ａは、実施の形態に係る音声対話エージェントシステムの応答時間を短縮する通信処理のシーケンス図である。FIG. 13A is a sequence diagram of communication processing for shortening the response time of the voice interaction agent system according to the embodiment. 図１３Ｂは、実施の形態に係る音声対話エージェントシステムの応答時間を短縮する通信処理のシーケンス図である。FIG. 13B is a sequence diagram of communication processing for shortening the response time of the voice interaction agent system according to the embodiment. 図１４は、実施の形態に係る音声対話エージェントシステムが適用可能である、サービスの類型１（自社データセンタ型クラウドサービス）における情報管理システムが提供する、サービスの全体像を示す図である。FIG. 14 is a diagram illustrating an overall image of a service provided by an information management system in service type 1 (in-house data center type cloud service) to which the voice interaction agent system according to the embodiment is applicable. 図１５は、実施の形態に係る音声対話エージェントシステムが適用可能である、サービスの類型２（ＩａａＳ利用型クラウドサービス）における情報管理システムが提供する、サービスの全体像を示す図である。FIG. 15 is a diagram illustrating an overall image of a service provided by an information management system in service type 2 (cloud service using IaaS) to which the voice interaction agent system according to the embodiment is applicable. 図１６は、実施の形態に係る音声対話エージェントシステムが適用可能である、サービスの類型３（ＰａａＳ利用型クラウドサービス）における情報管理システムが提供する、サービスの全体像を示す図である。FIG. 16 is a diagram showing an overall image of a service provided by an information management system in service type 3 (PaaS use type cloud service) to which the voice interaction agent system according to the embodiment is applicable. 図１７は、実施の形態に係る音声対話エージェントシステムが適用可能である、サービスの類型４（ＳａａＳ利用型クラウドサービス）における情報管理システムが提供する、サービスの全体像を示す図である。FIG. 17 is a diagram illustrating an overall image of a service provided by an information management system in service type 4 (SaaS-based cloud service) to which the voice interaction agent system according to the embodiment is applicable.

［本開示の技術の基礎となった知見］
本発明者らは、特許文献１に開示されるような従来技術において、以下の問題が生じることを見出した。上記特許文献１の方法及び装置は、一問一答の対話のみが想定されている。このような方法及び装置は、一問一答を超える対話に関して、応答時間が長くなる、又は、応答できない場合がある。本発明者らは、複数回の対話に関して、ローカル側及びクラウド側の装置の応答時間を低減する技術を検討した。本発明者らは、複数回の対話で機器制御を行うクラウド型音声対話エージェントにおいて、簡単な認識命令とこれに対応する制御命令とをローカル側にオフロードする、つまり負担が軽減するように与えることで、音声対話エージェントの応答時間を短縮することを見出した。そこで、本発明者らは、以下の改善策を検討した。 [Knowledge that became the basis of the technology of this disclosure]
The present inventors have found that the following problems occur in the prior art as disclosed in Patent Document 1. The method and apparatus of the above-mentioned Patent Document 1 is supposed to be a one-on-one dialogue only. Such methods and devices may increase response time or fail to respond to interactions that exceed one question. The present inventors have studied a technique for reducing the response time of the local side and cloud side devices for a plurality of dialogues. The present inventors give a simple recognition command and a corresponding control command to the local side in a cloud-type voice interaction agent that performs device control by a plurality of conversations, that is, the load is reduced. It has been found that the response time of the voice interaction agent can be shortened. Therefore, the present inventors examined the following improvement measures.

本開示の一態様に係る第１の情報処理方法は、ユーザとの対話を通じて少なくとも１つの機器を制御するプロセッサによって実行される情報処理方法であって、マイクロホンから入力された前記ユーザの第１音声を示す第１音声情報を取得し、前記第１音声情報から生成された第１フレーズ情報（第１の音声認識結果）を、ネットワークを介してサーバに出力し、前記第１フレーズ情報に応じた第１応答情報を、前記ネットワークを介して前記サーバから取得し、前記第１応答情報は、前記第１音声に対する第１応答メッセージを示し、前記第１応答情報に基づいて、スピーカに前記第１応答メッセージを出力させ、前記サーバ上で前記第１応答情報に関連づけられている第２応答情報を、前記ネットワークを介して前記サーバから取得し、前記第２応答情報は、１以上のフレーズ情報に応じて異なる複数の命令に分岐する条件分岐命令（処理情報）を含み、前記１以上のフレーズ情報のそれぞれは、前記第１応答メッセージに対する前記ユーザの返答の選択候補に関するものであり、前記第１応答メッセージが出力された後に、前記マイクロホンから入力された前記ユーザの第２音声を示す第２音声情報を取得し、前記第２音声は前記第１応答メッセージに対する前記ユーザの返答を含み、前記条件分岐命令のうち、前記１以上のフレーズ情報（推定返答情報）と前記第２音声情報から生成された第２フレーズ情報（第２の音声認識結果）とを照合することによって決定された命令に応じて、前記スピーカ及び前記少なくとも１つの機器の少なくとも１つに所定の動作を実行させ、前記第２音声情報を取得してから前記所定の動作を実行させるまでの間、前記第２音声情報及び前記第２フレーズ情報のいずれも前記サーバに出力されない。 A first information processing method according to an aspect of the present disclosure is an information processing method executed by a processor that controls at least one device through interaction with a user, and the first voice of the user input from a microphone Is obtained, and the first phrase information (first speech recognition result) generated from the first voice information is output to a server via the network, and the first phrase information is determined according to the first phrase information. First response information is obtained from the server via the network, and the first response information indicates a first response message for the first voice, and the first response information is sent to a speaker based on the first response information. Outputting a response message, obtaining second response information associated with the first response information on the server from the server via the network; The second response information includes a conditional branch instruction (processing information) that branches into a plurality of different instructions according to one or more phrase information, and each of the one or more phrase information includes the user for the first response message. The second voice information indicating the second voice of the user inputted from the microphone is acquired after the first response message is output, and the second voice is the second voice information. A second response information (second speech recognition result) generated from the one or more pieces of phrase information (estimated response information) and the second speech information in the conditional branch instruction ) And causing at least one of the speaker and the at least one device to perform a predetermined operation in accordance with an instruction determined by checking During the period from the acquisition of the second audio information to thereby perform a predetermined operation, both of the second audio information and the second phrase information is not output to the server.

上記態様によれば、ユーザの第１音声の取得後、第１音声に応じた第１応答情報と、第１応答情報の第１応答メッセージに対するユーザの返答に応じた第２応答情報とが、サーバにおいて決定される。そして、サーバから、第１応答情報及び第２応答情報が得られる。このため、第１応答情報に基づき、ユーザの第１音声に応じた第１応答メッセージの出力が可能である。また、第２応答情報に基づき、第１応答メッセージに対するユーザの返答に応じて、スピーカによるメッセージの出力及び機器の制御が可能である。よって、ユーザとの複数回の対話が可能となる。さらに、第２応答情報が条件分岐命令を含むため、第１応答メッセージに対するユーザの異なる複数の返答に応じた対応が可能になる。よって、第１応答メッセージに対するユーザの返答を取得した後、返答に応じた応答情報を生成するためのサーバとの通信が不要である。従って、複数回の対話における応答時間の短縮が可能になる。 According to the above aspect, after the acquisition of the first voice of the user, the first response information according to the first voice and the second response information according to the user's response to the first response message of the first response information are: Determined at the server. Then, the first response information and the second response information are obtained from the server. For this reason, the 1st response message according to a user's 1st voice can be outputted based on the 1st response information. Further, based on the second response information, it is possible to output a message and control the device by a speaker in accordance with a user response to the first response message. Therefore, multiple dialogues with the user are possible. Furthermore, since the second response information includes the conditional branch instruction, it is possible to respond to the first response message according to a plurality of different responses from the user. Therefore, after acquiring the user's response to the first response message, communication with the server for generating response information corresponding to the response is unnecessary. Accordingly, it is possible to shorten the response time in a plurality of dialogues.

また、本開示の一態様に係る第２の情報処理方法は、サーバ上の第２プロセッサ（対話処理装置）によって実行される情報処理方法であって、前記第２プロセッサは、ユーザとの対話を通じて少なくとも１つの機器を制御する第１プロセッサとネットワークを介して通信可能であり、マイクロホンから入力された前記ユーザの第１音声に関する第１フレーズ情報（第１の音声認識結果）を、前記第１プロセッサから前記ネットワークを介して取得し、前記第１フレーズ情報は、前記第１音声に対応する文字列と前記文字列の意味情報との少なくとも一方を示し、前記第１フレーズ情報に応じた第１応答情報を、前記ネットワークを介して前記第１プロセッサに出力し、前記第１応答情報は、スピーカから出力される、前記第１音声に対する第１応答メッセージを示し、前記サーバ上で前記第１応答情報に関連づけられている第２応答情報を、前記ネットワークを介して前記第１プロセッサに出力し、前記第２応答情報は、１以上のフレーズ情報（推定返答情報）に応じて異なる複数の命令に分岐する条件分岐命令（処理情報）を含み、前記１以上のフレーズ情報のそれぞれは、前記第１応答メッセージに対する前記ユーザの返答の選択候補に関するものである。 In addition, a second information processing method according to an aspect of the present disclosure is an information processing method executed by a second processor (interaction processing device) on a server, and the second processor passes through a dialog with a user. The first processor is capable of communicating with a first processor that controls at least one device via a network, and first phrase information (first speech recognition result) regarding the first voice of the user input from a microphone is used as the first processor. And the first phrase information indicates at least one of a character string corresponding to the first voice and semantic information of the character string, and a first response corresponding to the first phrase information Information is output to the first processor via the network, and the first response information is output from a speaker in response to a first sound corresponding to the first sound. A response message is displayed, and second response information associated with the first response information on the server is output to the first processor via the network, and the second response information includes one or more pieces of phrase information A conditional branch instruction (processing information) that branches into a plurality of different instructions according to (estimated response information), and each of the one or more phrase information relates to a selection candidate of the user's reply to the first response message It is.

上記態様によれば、サーバにおいて、ユーザの第１音声に関する第１フレーズ情報の取得後、第１音声に応じた第１応答情報と、第１応答情報の第１応答メッセージに対するユーザの返答に対応した第２応答情報とが、決定される。そして、第１応答情報及び第２応答情報が、サーバから第１プロセッサに出力される。第１プロセッサは、第１応答情報に基づき、ユーザの第１音声に応じた第１応答メッセージを出力することができる。また、第１プロセッサは、第２応答情報に基づき、第１応答メッセージに対するユーザの返答に応じて、スピーカによるメッセージの出力及び機器の制御を行うことができる。よって、ユーザとの複数回の対話が可能となる。さらに、第２応答情報が条件分岐命令を含むため、第１応答メッセージに対するユーザの異なる複数の返答に応じて、第１プロセッサが対応することができる。よって、第１応答メッセージに対するユーザの返答を取得した後、返答に応じた応答情報を生成するための第１プロセッサとサーバとの通信が不要である。従って、複数回の対話における応答時間の短縮が可能になる。 According to the above aspect, the server responds to the first response information corresponding to the first voice and the user's response to the first response message of the first response information after obtaining the first phrase information about the first voice of the user. The second response information is determined. Then, the first response information and the second response information are output from the server to the first processor. The first processor can output a first response message corresponding to the first voice of the user based on the first response information. In addition, the first processor can perform message output and speaker control by a speaker in accordance with a user response to the first response message based on the second response information. Therefore, multiple dialogues with the user are possible. Furthermore, since the second response information includes a conditional branch instruction, the first processor can respond to a plurality of different responses from the user to the first response message. Therefore, after the user's response to the first response message is acquired, communication between the first processor and the server for generating response information corresponding to the response is unnecessary. Accordingly, it is possible to shorten the response time in a plurality of dialogues.

例えば、本開示の一態様に係る第１の情報処理方法において、前記第１応答情報と前記第２応答情報とは、同時に前記サーバから取得されてもよい。 For example, in the first information processing method according to an aspect of the present disclosure, the first response information and the second response information may be simultaneously acquired from the server.

上記態様によれば、サーバとプロセッサとの通信頻度が低減し、ユーザの音声に対する応答時間の短縮が可能になる。具体的には、第１応答情報が示す第１応答メッセージに対するユーザの返答の確認前に、ユーザの返答に対する第２応答情報が、サーバから得られる。このため、サーバがユーザの返答を確認するための通信が不要である。さらに、第１応答メッセージに対する返答をユーザから受け取っているが、第２応答情報を未だ取得しておらず、次の動作を行えないという処理の遅延を防ぐことができる。 According to the above aspect, the communication frequency between the server and the processor is reduced, and the response time to the user's voice can be shortened. Specifically, before confirming the user's response to the first response message indicated by the first response information, the second response information for the user's response is obtained from the server. This eliminates the need for communication for the server to confirm the user's response. Furthermore, although a response to the first response message has been received from the user, the second response information has not yet been acquired, and processing delays such that the next operation cannot be performed can be prevented.

例えば、本開示の一態様に係る第１の情報処理方法において、前記第１フレーズ情報が前記サーバに出力された後、前記第２応答情報は、前記第１フレーズ情報及び前記第１応答情報の少なくとも一方に応じて、前記サーバ上に格納された複数の第２応答情報の中から選択されてもよい。 For example, in the first information processing method according to an aspect of the present disclosure, after the first phrase information is output to the server, the second response information includes the first phrase information and the first response information. According to at least one, it may be selected from a plurality of second response information stored on the server.

例えば、本開示の一態様に係る第１の情報処理方法は、さらに、前記所定の動作を実行させた後、前記所定の動作が実行されたことを示す実行通知を前記サーバに出力してもよい。 For example, in the first information processing method according to one aspect of the present disclosure, after the predetermined operation is executed, an execution notification indicating that the predetermined operation is executed may be output to the server. Good.

上記態様によれば、サーバは、実行通知を取得することによって、ユーザとの対話を通じた機器の制御に関連する動作を終了することができる。よって、サーバの不要な待機及び動作の低減が可能になる。 According to the above aspect, the server can end the operation related to the control of the device through the dialogue with the user by acquiring the execution notification. Therefore, unnecessary standby and operation of the server can be reduced.

例えば、本開示の一態様に係る第１の情報処理方法において、前記複数の命令の少なくとも１つは、前記第２音声に対する第２応答メッセージを前記スピーカから出力させる命令を含んでもよい。 For example, in the first information processing method according to an aspect of the present disclosure, at least one of the plurality of commands may include a command to output a second response message for the second sound from the speaker.

上記態様によれば、第２応答情報を取得することによって、第１応答メッセージに対するユーザの返答を含む第２音声の内容に応じて、音声を用いた対応をすることができる。 According to the above aspect, by acquiring the second response information, it is possible to take a response using voice according to the content of the second voice including the user's response to the first response message.

例えば、本開示の一態様に係る第１の情報処理方法において、前記複数の命令の少なくとも１つは、前記少なくとも１つの機器へ制御コマンドを送信させる命令を含んでもよい。 For example, in the first information processing method according to an aspect of the present disclosure, at least one of the plurality of instructions may include an instruction that causes a control command to be transmitted to the at least one device.

上記態様によれば、第２応答情報を取得することによって、第１応答メッセージに対するユーザの返答を含む第２音声の内容に応じて、機器を制御することができる。 According to the said aspect, by acquiring 2nd response information, an apparatus can be controlled according to the content of the 2nd audio | voice containing the user's reply with respect to a 1st response message.

例えば、本開示の一態様に係る第１の情報処理方法において、前記ネットワークはインターネットであり、前記情報処理方法は、前記少なくとも１つの機器と前記インターネットを介さずに通信可能なローカルサーバ上で実行されてもよい。 For example, in the first information processing method according to an aspect of the present disclosure, the network is the Internet, and the information processing method is executed on a local server that can communicate with the at least one device without using the Internet. May be.

上記態様によれば、ローカルサーバと少なくとも１つの機器との通信による、応答時間の増加が抑えられる。 According to the above aspect, an increase in response time due to communication between the local server and at least one device can be suppressed.

また、本開示の一態様に係る第１のプログラムは、上記の第１の情報処理方法を前記プロセッサに実行させる。 A first program according to an aspect of the present disclosure causes the processor to execute the first information processing method.

例えば、本開示の一態様に係る第２の情報処理方法において、前記第１プロセッサは、前記条件分岐命令のうち、前記１以上のフレーズ情報と前記第２音声情報から生成された第２フレーズ情報とを照合することによって決定された命令に応じて、前記スピーカ及び前記少なくとも１つの機器の少なくとも１つに所定の動作を実行させてもよい。 For example, in the second information processing method according to an aspect of the present disclosure, the first processor includes second phrase information generated from the one or more phrase information and the second audio information in the conditional branch instruction. And at least one of the speaker and the at least one device may perform a predetermined operation in accordance with an instruction determined by comparing

上記態様によれば、第１プロセッサは、条件分岐命令のうちから、ユーザの第２音声の内容に適した命令を選択することができる。これにより、ユーザの第２音声の内容と、上記命令に基づく第１プロセッサの動作との不適合に起因する、応答時間の増加が抑えられる。 According to the above aspect, the first processor can select an instruction suitable for the content of the user's second voice from the conditional branch instructions. As a result, an increase in response time due to incompatibility between the content of the user's second voice and the operation of the first processor based on the command is suppressed.

例えば、本開示の一態様に係る第２の情報処理方法において、前記第２プロセッサは、前記第１プロセッサが前記ユーザの返答を取得してから、前記所定の動作を実行させるまでの間、前記第２音声情報及び前記第２フレーズ情報を前記第１プロセッサに出力しないとしてもよい。 For example, in the second information processing method according to an aspect of the present disclosure, the second processor is configured to perform the predetermined operation after the first processor acquires the user's response. The second audio information and the second phrase information may not be output to the first processor.

例えば、本開示の一態様に係る第２の情報処理方法は、さらに、前記所定の動作が実行されたことを示す実行通知を前記第１プロセッサから取得してもよい。 For example, the second information processing method according to an aspect of the present disclosure may further acquire an execution notification indicating that the predetermined operation has been executed from the first processor.

上記態様によれば、サーバの第２プロセッサは、実行通知を取得することによって、ユーザとの対話を通じた機器の制御に関連する動作を終了することができる。よって、第２プロセッサの不要な待機及び動作の低減が可能になる。 According to the above aspect, the second processor of the server can end the operation related to the control of the device through the dialogue with the user by acquiring the execution notification. Therefore, unnecessary standby and operation of the second processor can be reduced.

例えば、本開示の一態様に係る第２の情報処理方法において、前記第１応答情報と前記第２応答情報とは、同時に前記第１プロセッサに出力されてもよい。 For example, in the second information processing method according to an aspect of the present disclosure, the first response information and the second response information may be simultaneously output to the first processor.

上記態様によれば、第２プロセッサと第１プロセッサとの通信頻度が低減し、ユーザの音声に対する応答時間の短縮が可能になる。具体的には、第１応答情報が示す第１応答メッセージに対するユーザの返答の確認前に、ユーザの返答に対する第２応答情報が、第１プロセッサに出力される。このため、第２プロセッサは、上記ユーザの返答を取得する通信をする必要がない。さらに、第１応答メッセージに対してユーザが返答しているが、第１プロセッサが第２応答情報を未だ取得してないため、次の動作を行えないという処理の遅延を防ぐことができる。 According to the above aspect, the communication frequency between the second processor and the first processor is reduced, and the response time to the user's voice can be shortened. Specifically, the second response information for the user response is output to the first processor before confirming the user response to the first response message indicated by the first response information. For this reason, the second processor does not need to perform communication for obtaining the user's response. Furthermore, although the user responds to the first response message, since the first processor has not yet acquired the second response information, it is possible to prevent a delay in processing that the next operation cannot be performed.

例えば、本開示の一態様に係る第２の情報処理方法において、前記複数の命令の少なくとも１つは、前記第２音声に対する第２応答メッセージを前記スピーカから出力させる命令を含んでもよい。 For example, in the second information processing method according to an aspect of the present disclosure, at least one of the plurality of instructions may include an instruction for outputting a second response message for the second sound from the speaker.

上記態様によれば、第１プロセッサは、第２応答情報を取得することによって、第１応答メッセージに対するユーザの返答を含む第２音声の内容に応じて、音声を用いた対応をすることができる。 According to the above aspect, the first processor can take a response using the voice according to the content of the second voice including the user's response to the first response message by acquiring the second response information. .

例えば、本開示の一態様に係る第２の情報処理方法において、前記複数の命令の少なくとも１つは、前記少なくとも１つの機器へ制御コマンドを出力させる命令を含んでもよい。 For example, in the second information processing method according to an aspect of the present disclosure, at least one of the plurality of instructions may include an instruction that causes the at least one device to output a control command.

上記態様によれば、第１プロセッサは、第２応答情報を取得することによって、第１応答メッセージに対するユーザの返答を含む第２音声の内容に応じて、機器を制御することができる。 According to the said aspect, the 1st processor can control an apparatus according to the content of the 2nd audio | voice containing the user's reply with respect to a 1st response message by acquiring 2nd response information.

例えば、本開示の一態様に係る第２の情報処理方法において、前記サーバ上には、複数の意味情報と、前記複数の意味情報にそれぞれ関連づけられている複数の第１応答情報と、前記複数の第１応答情報にそれぞれ関連づけられている複数の第２応答情報とを含むデータベースが格納されており、前記第１フレーズ情報を取得した後、前記データベースを参照して、前記複数の意味情報の中から、前記第１音声に対応する意味情報を特定し、前記複数の第１応答情報の中から、特定された前記意味情報に関連づけられている前記第１応答情報を特定し、前記複数の第２応答情報の中から、特定された前記第１応答情報に関連づけられている前記第２応答情報を特定してもよい。 For example, in the second information processing method according to one aspect of the present disclosure, on the server, a plurality of semantic information, a plurality of first response information respectively associated with the plurality of semantic information, and the plurality of the plurality of semantic information A database including a plurality of second response information respectively associated with the first response information, and after obtaining the first phrase information, referring to the database, Identifying the semantic information corresponding to the first voice, identifying the first response information associated with the identified semantic information from the plurality of first response information, You may identify the 2nd response information linked | related with the identified said 1st response information from 2nd response information.

上記態様によれば、サーバ側のみにおいて、第１フレーズ情報から第１応答情報及び第２応答情報の生成処理が可能である。サーバは、処理能力が高く、データ容量も大きい。上記生成処理をサーバが行うことによって、処理時間の短縮が可能になる。 According to the said aspect, the production | generation process of 1st response information and 2nd response information from 1st phrase information is possible only in the server side. The server has a high processing capacity and a large data capacity. When the server performs the generation process, the processing time can be shortened.

また、本開示の一形態に係る第２のプログラムは、上記の第２の情報処理方法を前記第２プロセッサに実行させる。 Further, a second program according to an aspect of the present disclosure causes the second processor to execute the second information processing method.

なお、これらの包括的又は具体的な態様は、システム、方法、集積回路、コンピュータプログラム又はコンピュータ読み取り可能なＣＤ−ＲＯＭなどの記録媒体で実現されてもよく、システム、方法、集積回路、コンピュータプログラム及び記録媒体の任意な組み合わせで実現されてもよい。 These comprehensive or specific modes may be realized by a system, a method, an integrated circuit, a computer program, or a recording medium such as a computer-readable CD-ROM. The system, method, integrated circuit, computer program Also, any combination of recording media may be realized.

以下、実施の形態について、図面を参照しながら具体的に説明する。なお、以下で説明する実施の形態は、いずれも本開示の技術の一具体例を示すものである。以下の実施の形態で示される数値、形状、構成要素、ステップ、ステップの順序などは、一例であり、本開示を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。また全ての実施の形態において、各々の内容を組み合わせることも出来る。 Hereinafter, embodiments will be specifically described with reference to the drawings. Note that each of the embodiments described below shows a specific example of the technology of the present disclosure. Numerical values, shapes, components, steps, order of steps, and the like shown in the following embodiments are merely examples, and are not intended to limit the present disclosure. In addition, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims indicating the highest concept are described as optional constituent elements. In all the embodiments, the contents can be combined.

［実施の形態］
［１．提供するサービスの全体像］
まず、図１Ａ〜図１Ｃを参照して、実施の形態に係る対話処理装置を備える音声対話エージェントシステム１が配置される情報管理システムが提供する、サービスの全体像について説明する。図１Ａは、実施の形態に係る対話処理装置を備える音声対話エージェントシステム１が配置される環境の一例を示す図であり、音声対話エージェントシステムを備える情報管理システムが提供するサービスの全体像を示す図である。図１Ｂは、図１Ａのデータセンタ運営会社が、機器メーカに該当する例を示す図である。図１Ｃは、図１Ａのデータセンタ運営会社が、機器メーカ及び管理会社の両者又はいずれか一方に該当する例を示す図である。なお、対話処理装置は、後述するホームゲートウェイ（ローカルサーバとも呼ぶ）１０２であってもよく、クラウドサーバ１１１であってもよく、ホームゲートウェイ１０２及びクラウドサーバ１１１を含むものでもよい。 [Embodiment]
[1. Overview of services provided]
First, with reference to FIG. 1A to FIG. 1C, an overview of services provided by an information management system in which a voice interaction agent system 1 including an interaction processing apparatus according to an embodiment is provided will be described. FIG. 1A is a diagram illustrating an example of an environment in which a voice dialogue agent system 1 including a dialogue processing apparatus according to an embodiment is arranged, and illustrates an overall image of a service provided by an information management system including the voice dialogue agent system FIG. FIG. 1B is a diagram illustrating an example in which the data center operating company in FIG. 1A corresponds to a device manufacturer. FIG. 1C is a diagram illustrating an example in which the data center operating company in FIG. 1A corresponds to either or both of an equipment manufacturer and a management company. The interactive processing device may be a home gateway (also referred to as a local server) 102 described later, may be a cloud server 111, or may include the home gateway 102 and the cloud server 111.

図１Ａに示されるように、情報管理システム４０００は、グループ４１００、データセンタ運営会社４１１０及びサービスプロバイダ４１２０を備える。グループ４１００は、例えば企業、団体又は家庭等であり、その規模を問わない。グループ４１００は、第１の機器１０１ａ及び第２の機器１０１ｂを含む複数の機器１０１、並びにホームゲートウェイ１０２を備える。複数の機器１０１は、例えば家電機器である。複数の機器１０１は、例えば、スマートフォン、パーソナルコンピュータ（ＰＣ）又はテレビ等のインターネットなどの通信網と接続可能な機器を含んでもよく、例えば、照明、洗濯機又は冷蔵庫等のそれ自身ではインターネットなどの通信網と接続不可能な機器を含んでもよい。複数の機器１０１は、それ自身ではインターネット等の通信網と接続不可能であっても、ホームゲートウェイ１０２を介してインターネット等の通信網と接続可能となる機器を含んでもよい。また、ユーザ５１００は、グループ４１００内の複数の機器１０１を使用する。 As shown in FIG. 1A, the information management system 4000 includes a group 4100, a data center operating company 4110, and a service provider 4120. The group 4100 is, for example, a company, an organization, a home, or the like, and may be of any size. The group 4100 includes a plurality of devices 101 including a first device 101a and a second device 101b, and a home gateway 102. The plurality of devices 101 are home appliances, for example. The plurality of devices 101 may include devices that can be connected to a communication network such as the Internet such as a smartphone, a personal computer (PC), or a television, for example, such as a lighting, a washing machine, a refrigerator, etc. A device that cannot be connected to the communication network may be included. The plurality of devices 101 may include devices that can be connected to a communication network such as the Internet via the home gateway 102 even if they cannot be connected to a communication network such as the Internet. The user 5100 uses a plurality of devices 101 in the group 4100.

データセンタ運営会社４１１０は、クラウドサーバ１１１を備える。クラウドサーバ１１１は、インターネットなどの通信網を介して様々な装置と連携する仮想化サーバである。クラウドサーバ１１１は、主に通常のデータベース管理ツール等で扱うことが困難な巨大なデータ（ビッグデータ）等を管理する。データセンタ運営会社４１１０は、データの管理、クラウドサーバ１１１の管理、及びそれらを行うデータセンタの運営等を行っている。データセンタ運営会社４１１０が行っている役務の詳細については後述する。以降では、通信網として、インターネットが用いられるものとして説明するが、通信網は、インターネットに限定されない。 The data center operating company 4110 includes a cloud server 111. The cloud server 111 is a virtualization server that cooperates with various devices via a communication network such as the Internet. The cloud server 111 mainly manages huge data (big data) that is difficult to handle with a normal database management tool or the like. The data center operating company 4110 performs management of data, management of the cloud server 111, operation of the data center that performs them, and the like. Details of services performed by the data center operating company 4110 will be described later. In the following description, it is assumed that the Internet is used as the communication network, but the communication network is not limited to the Internet.

ここで、データセンタ運営会社４１１０は、データの管理又はクラウドサーバ１１１の管理のみを行っている会社に限らない。例えば、図１Ｂに示すように、複数の機器１０１のうちの一つの機器を開発又は製造している機器メーカが、データの管理又はクラウドサーバ１１１の管理等を行っている場合は、機器メーカがデータセンタ運営会社４１１０に該当する。また、データセンタ運営会社４１１０は一つの会社に限らない。例えば、図１Ｃに示すように、機器メーカ及び管理会社が共同又は分担してデータの管理又はクラウドサーバ１１１の管理を行っている場合は、両者又はいずれか一方がデータセンタ運営会社４１１０に該当する。 Here, the data center operating company 4110 is not limited to a company that only manages data or manages the cloud server 111. For example, as shown in FIG. 1B, when a device manufacturer that develops or manufactures one of a plurality of devices 101 performs data management or cloud server 111 management, the device manufacturer This corresponds to the data center operating company 4110. Further, the data center operating company 4110 is not limited to one company. For example, as shown in FIG. 1C, when the device manufacturer and the management company jointly or share the management of the data or the cloud server 111, both or one of them corresponds to the data center operating company 4110. .

サービスプロバイダ４１２０は、サーバ１２１を備える。ここで言うサーバ１２１とは、その規模は問わず、例えば、個人用ＰＣ内のメモリ等も含む。また、サービスプロバイダ４１２０がサーバ１２１を備えていない場合もある。 The service provider 4120 includes a server 121. The server 121 here is not limited in scale, and includes, for example, a memory in a personal PC. In addition, the service provider 4120 may not include the server 121.

なお、上記の情報管理システム４０００において、ホームゲートウェイ１０２は必須ではない。例えば、クラウドサーバ１１１が全てのデータ管理を行っている場合等は、ホームゲートウェイ１０２は不要となる。また、家庭内の全ての機器１０１がインターネットに接続されている場合のように、それ自身ではインターネットと接続不可能な機器が存在しない場合もある。 In the information management system 4000 described above, the home gateway 102 is not essential. For example, when the cloud server 111 manages all data, the home gateway 102 becomes unnecessary. Further, there may be a case where there is no device that cannot be connected to the Internet by itself, as in the case where all the devices 101 in the home are connected to the Internet.

次に、情報管理システム４０００における情報の流れを説明する。まず、グループ４１００の第１の機器１０１ａ又は第２の機器１０１ｂは、各々のログ情報をデータセンタ運営会社４１１０のクラウドサーバ１１１にそれぞれ送信する。クラウドサーバ１１１は、第１の機器１０１ａ又は第２の機器１０１ｂのログ情報を集積する（図１Ａの矢印１３１）。ここで、ログ情報とは、複数の機器１０１の例えば運転状況及び動作日時等を示す情報である。例えば、ログ情報は、テレビの視聴履歴、レコーダの録画予約情報、洗濯機の運転日時、洗濯物の量、冷蔵庫の開閉日時、及び冷蔵庫の開閉回数などを含み得るが、これらの情報に限らず、種々の機器１０１から取得が可能な種々の情報を含んでもよい。なお、ログ情報は、インターネットを介して複数の機器１０１自体から直接クラウドサーバ１１１に提供されてもよい。また、ログ情報は、複数の機器１０１から一旦ホームゲートウェイ１０２に集積され、ホームゲートウェイ１０２からクラウドサーバ１１１に提供されてもよい。 Next, the information flow in the information management system 4000 will be described. First, the first device 101a or the second device 101b of the group 4100 transmits each log information to the cloud server 111 of the data center operating company 4110, respectively. The cloud server 111 accumulates log information of the first device 101a or the second device 101b (arrow 131 in FIG. 1A). Here, the log information is information indicating, for example, driving conditions and operation dates / times of the plurality of devices 101. For example, the log information may include TV viewing history, recorder recording reservation information, washing machine operation date / time, amount of laundry, refrigerator opening / closing date / time, refrigerator opening / closing frequency, etc., but is not limited thereto. Various information that can be acquired from various devices 101 may be included. Note that the log information may be provided directly to the cloud server 111 from the plurality of devices 101 themselves via the Internet. The log information may be temporarily accumulated in the home gateway 102 from a plurality of devices 101 and provided to the cloud server 111 from the home gateway 102.

次に、データセンタ運営会社４１１０のクラウドサーバ１１１は、集積したログ情報を一定の単位でサービスプロバイダ４１２０に提供する。ここで、一定の単位とは、データセンタ運営会社４１１０が集積した情報を整理してサービスプロバイダ４１２０に提供することの出来る単位でもよく、サービスプロバイダ４１２０が要求する単位でもよい。また、ログ情報は、一定の単位で提供されるとしているが、一定の単位で提供されなくてもよく、状況に応じて提供される情報量が変化してもよい。ログ情報は、必要に応じてサービスプロバイダ４１２０が保有するサーバ１２１に保存される（図１Ａの矢印１３２）。 Next, the cloud server 111 of the data center operating company 4110 provides the collected log information to the service provider 4120 in a certain unit. Here, the fixed unit may be a unit that can organize and provide the information collected by the data center operating company 4110 to the service provider 4120, or may be a unit that the service provider 4120 requests. Further, the log information is provided in a fixed unit. However, the log information may not be provided in a fixed unit, and the amount of information provided may vary depending on the situation. The log information is stored in the server 121 held by the service provider 4120 as necessary (arrow 132 in FIG. 1A).

そして、サービスプロバイダ４１２０は、ログ情報を、ユーザに提供するサービスに適合する情報に整理し、ユーザに提供する。情報が提供されるユーザは、複数の機器１０１を使用するユーザ５１００でもよく、外のユーザ５２００でもよい。ユーザ５１００，５２００への情報提供方法としては、例えば、サービスプロバイダ４１２０から直接ユーザ５１００，５２００へ情報が提供されてもよい（図１Ａの矢印１３３，１３４）。また、ユーザ５１００への情報提供方法としては、例えば、データセンタ運営会社４１１０のクラウドサーバ１１１を再度経由して、ユーザ５１００に情報が提供される方法でもよい（図１Ａの矢印１３５，１３６）。また、データセンタ運営会社４１１０のクラウドサーバ１１１は、ログ情報を、ユーザに提供するサービスに適合する情報に整理し、サービスプロバイダ４１２０に提供してもよい。なお、ユーザ５１００は、ユーザ５２００と異なっていても同一であってもよい。 Then, the service provider 4120 organizes the log information into information suitable for the service provided to the user and provides it to the user. The user to whom information is provided may be a user 5100 using a plurality of devices 101 or an outside user 5200. As a method for providing information to the users 5100 and 5200, for example, information may be provided directly from the service provider 4120 to the users 5100 and 5200 (arrows 133 and 134 in FIG. 1A). As a method for providing information to the user 5100, for example, a method in which information is provided to the user 5100 through the cloud server 111 of the data center operating company 4110 again (arrows 135 and 136 in FIG. 1A) may be used. Further, the cloud server 111 of the data center operating company 4110 may organize the log information into information suitable for the service provided to the user and provide it to the service provider 4120. Note that the user 5100 may be the same as or different from the user 5200.

［２−１．実施の形態に係る音声対話エージェントシステムの構成］
以下、実施の形態に係る音声対話エージェントシステム１の構成を説明する。音声対話エージェントシステム１は、ユーザの音声指示に対して、応答時間を短縮するシステムである。本実施の形態では、音声対話エージェントシステム１は、複数回の対話で機器制御を行うクラウド型音声対話エージェントシステムであり、簡単な認識命令と対応する制御命令とをローカル側にオフロードすることで、その応答時間を短縮する。 [2-1. Configuration of Spoken Dialogue Agent System According to Embodiment]
Hereinafter, the configuration of the voice interaction agent system 1 according to the embodiment will be described. The voice interaction agent system 1 is a system that shortens a response time in response to a user's voice instruction. In this embodiment, the voice interaction agent system 1 is a cloud-type voice interaction agent system that performs device control in a plurality of conversations, and by offloading a simple recognition instruction and a corresponding control instruction to the local side. , Reduce its response time.

まず、音声対話エージェントシステム１の構成に関して、音声対話エージェントシステムの構成、音声入出力装置のハードウェア構成、機器のハードウェア構成、ローカルサーバのハードウェア構成、クラウドサーバのハードウェア構成、音声入出力装置の機能ブロック、機器の機能ブロック、ローカルサーバの機能ブロック、及びクラウドサーバの機能ブロックを順次説明する。さらに、音声対話エージェントシステム１の構成に関して、ローカル辞書ＤＢの具体例、対話ルールＤＢの具体例、オフロード命令生成ＤＢの具体例も説明する。そして、音声対話エージェントシステム１の動作に関して、音声対話エージェントシステム１による応答時間を短縮する処理のシーケンスを説明する。 First, regarding the configuration of the voice interaction agent system 1, the configuration of the voice interaction agent system, the hardware configuration of the voice input / output device, the hardware configuration of the device, the hardware configuration of the local server, the hardware configuration of the cloud server, and the voice input / output The functional block of the device, the functional block of the device, the functional block of the local server, and the functional block of the cloud server will be described in order. Further, regarding the configuration of the voice interaction agent system 1, a specific example of the local dictionary DB, a specific example of the interaction rule DB, and a specific example of the offload instruction generation DB will be described. Then, regarding the operation of the voice interaction agent system 1, a processing sequence for shortening the response time by the voice interaction agent system 1 will be described.

図２を参照して、実施の形態に係る対話処理装置を含む音声対話エージェントシステム１の構成を説明する。図２は、実施の形態に係る音声対話エージェントシステム１の構成を示す概略図である。音声対話エージェントシステム１は、音声入出力装置２４０と、複数の機器１０１と、ローカルサーバ１０２と、情報通信ネットワーク２２０と、クラウドサーバ１１１とを含む。ローカルサーバ１０２は、ホームゲートウェイの一例である。情報通信ネットワーク２２０は、例えば、インターネットであり、通信網の一例である。本実施の形態では、複数の機器１０１は、テレビ２４３、エアコン２４４及び冷蔵庫２４５で構成される。また、複数の機器１０１を構成する機器は、テレビ２４３、エアコン２４４及び冷蔵庫２４５に限定されるものでなく、任意の機器でよい。音声入出力装置２４０、複数の機器１０１及びローカルサーバ１０２は、グループ４１００に配置される。ここで、ローカルサーバ１０２が、対話処理装置を構成してもよく、クラウドサーバ１１１が、対話処理装置を構成してもよく、ローカルサーバ１０２及びクラウドサーバ１１１が共に、対話処理装置を構成してもよい。 With reference to FIG. 2, the structure of the spoken dialogue agent system 1 including the dialogue processing apparatus according to the embodiment will be described. FIG. 2 is a schematic diagram showing a configuration of the voice interaction agent system 1 according to the embodiment. The voice interaction agent system 1 includes a voice input / output device 240, a plurality of devices 101, a local server 102, an information communication network 220, and a cloud server 111. The local server 102 is an example of a home gateway. The information communication network 220 is, for example, the Internet, and is an example of a communication network. In the present embodiment, the plurality of devices 101 includes a television 243, an air conditioner 244, and a refrigerator 245. In addition, the devices constituting the plurality of devices 101 are not limited to the television 243, the air conditioner 244, and the refrigerator 245, and may be arbitrary devices. The voice input / output device 240, the plurality of devices 101, and the local server 102 are arranged in a group 4100. Here, the local server 102 may constitute a dialogue processing device, the cloud server 111 may constitute a dialogue processing device, and both the local server 102 and the cloud server 111 constitute a dialogue processing device. Also good.

図２に示す例では、人間であるユーザ５１００が、音声対話エージェントシステム１が配置されるグループ４１００内に存在する。また、ユーザ５１００が、音声対話エージェントシステム１に対する話者であるとする。 In the example shown in FIG. 2, a user 5100 who is a human exists in a group 4100 in which the voice interaction agent system 1 is arranged. Further, it is assumed that the user 5100 is a speaker for the voice interaction agent system 1.

音声入出力装置２４０は、グループ４１００内の音声を取得する集音部の一例であり、グループ４１００内に音声を出力する音声出力部の一例でもある。音声入出力装置２４０は、マイクロホンを介して音声を取得してもよく、スピーカを介して音声を出力してもよい。マイクロホン及びスピーカは、音声入出力装置２４０に備えられてもよく、音声入出力装置２４０を搭載する装置に備えられてもよく、音声入出力装置２４０及び上記装置と別個の装置に備えられてもよい。グループ４１００は、音声入出力装置２４０が音声によりユーザに情報提供可能な空間である。音声入出力装置２４０は、グループ４１００内のユーザ５１００の音声を認識し、認識した音声入力によるユーザ５１００の指示に応じて、音声入出力装置２４０より音声情報を提示し、且つ機器１０１を制御する。より具体的には、音声入出力装置２４０は、音声入力によるユーザ５１００の指示に従いコンテンツを表示したり、ユーザ５１００の質問に回答したり、機器１０１を制御したりする。 The voice input / output device 240 is an example of a sound collection unit that acquires voice in the group 4100 and is also an example of a voice output unit that outputs voice in the group 4100. The voice input / output device 240 may acquire voice via a microphone and may output voice via a speaker. The microphone and the speaker may be provided in the voice input / output device 240, may be provided in a device on which the voice input / output device 240 is mounted, or may be provided in a device separate from the voice input / output device 240 and the above device. Good. The group 4100 is a space in which the voice input / output device 240 can provide information to the user by voice. The voice input / output device 240 recognizes the voice of the user 5100 in the group 4100, presents voice information from the voice input / output device 240 and controls the device 101 in accordance with an instruction of the user 5100 by the recognized voice input. . More specifically, the voice input / output device 240 displays content according to an instruction of the user 5100 by voice input, answers a question of the user 5100, or controls the device 101.

また、ここでは、音声入出力装置２４０、複数の機器１０１及びローカルサーバ１０２の間の接続には、有線又は無線による接続を用いることができる。無線による接続には、様々な無線通信が適用可能である。例えば、Ｗｉ−Ｆｉ（登録商標）（ＷｉｒｅｌｅｓｓＦｉｄｅｌｉｔｙ）などの無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）が適用されてもよく、Ｂｌｕｅｔｏｏｔｈ（登録商標）、ＺｉｇＢｅｅ（登録商標）等の近距離無線通信が適用されてもよい。 In addition, here, a wired or wireless connection can be used for the connection between the voice input / output device 240, the plurality of devices 101, and the local server 102. Various wireless communications can be applied to the wireless connection. For example, a wireless LAN (Local Area Network) such as Wi-Fi (registered trademark) may be applied, and short-range wireless communication such as Bluetooth (registered trademark) or ZigBee (registered trademark) is applied. Also good.

また、音声入出力装置２４０、機器１０１及びローカルサーバ１０２のうち少なくとも一部が一体化されていてもよい。例えば、音声入出力装置２４０に、ローカルサーバ１０２の機能が組み込まれ、音声入出力装置２４０が、自身でクラウドサーバ１１１と通信するローカル端末として機能してもよい。又は、音声入出力装置２４０が、複数の機器１０１のそれぞれ、若しくは、複数の機器１０１のうちの１つに組み込まれてもよい。後者の場合、音声入出力装置２４０が組み込まれた機器１０１が、他の機器１０１を制御してもよい。又は、音声入出力装置２４０の機能とローカルサーバ１０２の機能とのうち少なくともローカルサーバ１０２の機能が、複数の機器１０１のそれぞれ、若しくは、複数の機器１０１のうちの１つに組み込まれてもよい。前者の場合、各機器１０１が、自身でクラウドサーバ１１１と通信するローカル端末として機能してもよく、後者の場合、ローカルサーバ１０２の機能が組み込まれたローカル端末である１つの機器１０１を介して、他の機器１０１がクラウドサーバ１１１と通信してもよい。 In addition, at least some of the voice input / output device 240, the device 101, and the local server 102 may be integrated. For example, the function of the local server 102 may be incorporated in the voice input / output device 240, and the voice input / output device 240 may function as a local terminal that communicates with the cloud server 111 itself. Alternatively, the voice input / output device 240 may be incorporated in each of the plurality of devices 101 or one of the plurality of devices 101. In the latter case, the device 101 in which the voice input / output device 240 is incorporated may control another device 101. Alternatively, at least the function of the local server 102 among the function of the voice input / output device 240 and the function of the local server 102 may be incorporated into each of the plurality of devices 101 or one of the plurality of devices 101. . In the former case, each device 101 may function as a local terminal that communicates with the cloud server 111 by itself. In the latter case, each device 101 passes through one device 101 that is a local terminal in which the function of the local server 102 is incorporated. Other devices 101 may communicate with the cloud server 111.

さらに、音声入出力装置２４０、機器１０１、ローカルサーバ１０２及びクラウドサーバ１１１について、ハードウェア構成の観点から説明する。図３は、実施の形態に係る音声入出力装置２４０のハードウェア構成の一例を示す。図３に示すように、音声入出力装置２４０は、処理回路３００、集音回路３０１、音声出力回路３０２及び通信回路３０３を有している。処理回路３００、集音回路３０１、音声出力回路３０２及び通信回路３０３は、バス３３０で相互に接続されており、互いの間でデータ及び命令の授受を行うことが可能である。ここで、クラウドサーバ１１１は、サーバの一例である。 Further, the voice input / output device 240, the device 101, the local server 102, and the cloud server 111 will be described from the viewpoint of hardware configuration. FIG. 3 shows an example of a hardware configuration of the voice input / output device 240 according to the embodiment. As shown in FIG. 3, the audio input / output device 240 includes a processing circuit 300, a sound collection circuit 301, an audio output circuit 302, and a communication circuit 303. The processing circuit 300, the sound collection circuit 301, the audio output circuit 302, and the communication circuit 303 are connected to each other via a bus 330, and can exchange data and commands with each other. Here, the cloud server 111 is an example of a server.

処理回路３００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３１０と、機器ＩＤ３４１及びコンピュータプログラム３４２を格納したメモリ３２０との組み合わせによって実現され得る。ＣＰＵ３１０は、音声入出力装置２４０の動作を制御するが、ローカルサーバ１０２を介して接続される各機器１０１の動作も制御してもよい。この場合、処理回路３００は、各機器１０１の制御命令を、ローカルサーバ１０２を介して送信するが、各機器１０１に直接送信してもよい。ＣＰＵ３１０は、メモリ３２０に展開されたコンピュータプログラム３４２に記述された命令群を実行する。これにより、ＣＰＵ３１０は種々の機能を実現することができる。コンピュータプログラム３４２には、後述する音声入出力装置２４０の動作を実現するための命令群が記述されている。上述のコンピュータプログラム３４２は、製品としての音声入出力装置２４０のメモリ３２０に予め格納されていてもよい。又は、コンピュータプログラム３４２は、ＣＤ−ＲＯＭ等の記録媒体に記録されて製品として市場に流通され、若しくは、インターネット等の電気通信回線を通じて伝送され、記録媒体又は電気通信回線を通じて取得されたコンピュータプログラム３４２がメモリ３２０に格納されてもよい。 The processing circuit 300 can be realized by a combination of a CPU (Central Processing Unit) 310 and a memory 320 that stores a device ID 341 and a computer program 342. The CPU 310 controls the operation of the voice input / output device 240, but may also control the operation of each device 101 connected via the local server 102. In this case, the processing circuit 300 transmits a control command for each device 101 via the local server 102, but may directly transmit the control command to each device 101. CPU 310 executes an instruction group described in computer program 342 expanded in memory 320. Thus, the CPU 310 can realize various functions. The computer program 342 describes a group of instructions for realizing the operation of the voice input / output device 240 described later. The computer program 342 described above may be stored in advance in the memory 320 of the voice input / output device 240 as a product. Alternatively, the computer program 342 is recorded on a recording medium such as a CD-ROM and distributed as a product to the market, or transmitted through an electric communication line such as the Internet, and acquired through the recording medium or the electric communication line. May be stored in the memory 320.

或いは、処理回路３００は、以下に説明する動作を実現するように構成された専用のハードウェアによって実現されていてもよい。なお、機器ＩＤ３４１は、機器１０１に一意に付与された識別子である。機器ＩＤ３４１は、機器１０１のメーカによって独自に付与されてもよいし、或いは、原則としてネットワーク上で一意に割り当てられる物理アドレス（いわゆるＭＡＣ（ＭｅｄｉａＡｃｃｅｓｓＣｏｎｔｒｏｌ）アドレス）であってもよい。 Alternatively, the processing circuit 300 may be realized by dedicated hardware configured to realize the operation described below. The device ID 341 is an identifier uniquely assigned to the device 101. The device ID 341 may be uniquely assigned by the manufacturer of the device 101, or may be a physical address (so-called MAC (Media Access Control) address) uniquely assigned on the network in principle.

なお、図３では、コンピュータプログラム３４２が格納されているメモリ３２０に機器ＩＤ３４１が格納されているとした。しかしながらこれは、処理回路３００の構成の一例である。例えば、コンピュータプログラム３４２がＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）又はＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）に格納され、機器ＩＤ３４１がフラッシュメモリに格納されてもよい。 In FIG. 3, it is assumed that the device ID 341 is stored in the memory 320 in which the computer program 342 is stored. However, this is an example of the configuration of the processing circuit 300. For example, the computer program 342 may be stored in a RAM (Random Access Memory) or ROM (Read Only Memory), and the device ID 341 may be stored in a flash memory.

集音回路３０１は、ユーザの音声を収集してアナログ音声信号を生成し、そのアナログ音声信号をデジタルデータに変換してバス３３０に送信する。 The sound collection circuit 301 collects the user's voice to generate an analog voice signal, converts the analog voice signal into digital data, and transmits the digital data to the bus 330.

音声出力回路３０２は、バス３３０を通じて受信したデジタルデータをアナログ音声信号に変換し、そのアナログ音声信号を出力する。 The audio output circuit 302 converts the digital data received through the bus 330 into an analog audio signal and outputs the analog audio signal.

通信回路３０３は、有線通信又は無線通信を介して、他の機器（例えばローカルサーバ１０２）と通信を行う回路である。限定されるものではないが、本実施の形態では、通信回路３０３は、ネットワークを介して他の機器と通信を行い、例えばイーサネット（登録商標）規格に準拠したネットワーク等の有線ＬＡＮを介して通信を行う。通信回路３０３は、処理回路３００によって生成されたログ情報及びＩＤ情報をローカルサーバ１０２に送信する。また、通信回路３０３は、ローカルサーバ１０２より受信した信号を、バス３３０を通じて処理回路３００に送信する。 The communication circuit 303 is a circuit that communicates with another device (for example, the local server 102) via wired communication or wireless communication. Although not limited thereto, in the present embodiment, the communication circuit 303 communicates with other devices via a network, for example, via a wired LAN such as a network compliant with the Ethernet (registered trademark) standard. I do. The communication circuit 303 transmits log information and ID information generated by the processing circuit 300 to the local server 102. In addition, the communication circuit 303 transmits a signal received from the local server 102 to the processing circuit 300 through the bus 330.

音声入出力装置２４０は、図示される構成要素以外にも、音声入出力装置２４０に要求される機能を実現するための他の構成要素も含み得る。 The voice input / output device 240 may include other components for realizing the functions required for the voice input / output device 240 in addition to the illustrated components.

図４は、実施形態に係る機器１０１のハードウェア構成の一例を示す。図２に示されるテレビ２４３、エアコン２４４及び冷蔵庫２４５は、機器１０１の一例である。図４に示すように、機器１０１は、入出力回路４１０と、通信回路４５０と、処理回路４７０とを有している。入出力回路４１０、通信回路４５０及び処理回路４７０は、バス４６０で相互に接続されており、互いの間でデータ及び命令の授受を行うことが可能である。 FIG. 4 shows an example of a hardware configuration of the device 101 according to the embodiment. A television 243, an air conditioner 244, and a refrigerator 245 shown in FIG. 2 are examples of the device 101. As illustrated in FIG. 4, the device 101 includes an input / output circuit 410, a communication circuit 450, and a processing circuit 470. The input / output circuit 410, the communication circuit 450, and the processing circuit 470 are connected to each other through a bus 460, and can exchange data and commands with each other.

処理回路４７０は、ＣＰＵ４３０と、機器ＩＤ４４１及びコンピュータプログラム４４２を格納したメモリ４４０との組み合わせによって実現され得る。ＣＰＵ４３０は、機器１０１の動作を制御する。ＣＰＵ４３０は、メモリ４４０に展開されたコンピュータプログラム４４２に記述された命令群を実行し、種々の機能を実現することができる。コンピュータプログラム４４２には、機器１０１の動作を実現するための命令群が記述されている。上述のコンピュータプログラム４４２は、製品としての機器１０１のメモリ４４０に予め格納されていてもよい。又は、コンピュータプログラム４４２は、ＣＤ−ＲＯＭ等の記録媒体に記録されて製品として市場に流通され、若しくは、インターネット等の電気通信回線を通じて伝送され、記録媒体又は電気通信回線を通じて取得されたコンピュータプログラム４４２がメモリ４４０に格納されてもよい。 The processing circuit 470 can be realized by a combination of the CPU 430 and the memory 440 storing the device ID 441 and the computer program 442. The CPU 430 controls the operation of the device 101. The CPU 430 can execute a group of instructions described in the computer program 442 expanded in the memory 440 to realize various functions. The computer program 442 describes a group of instructions for realizing the operation of the device 101. The computer program 442 described above may be stored in advance in the memory 440 of the device 101 as a product. Alternatively, the computer program 442 is recorded on a recording medium such as a CD-ROM and distributed as a product to the market, or transmitted through an electric communication line such as the Internet, and acquired through the recording medium or the electric communication line. May be stored in the memory 440.

或いは、処理回路４７０は、以下に説明する動作を実現するように構成された専用のハードウェアによって実現されていてもよい。なお、機器ＩＤ４４１は、機器１０１に一意に付与された識別子である。機器ＩＤ４４１は、機器１０１のメーカによって独自に付与されてもよいし、或いは、原則としてネットワーク上で一意に割り当てられる物理アドレス（いわゆるＭＡＣアドレス）であってもよい。 Alternatively, the processing circuit 470 may be realized by dedicated hardware configured to realize the operation described below. The device ID 441 is an identifier uniquely assigned to the device 101. The device ID 441 may be uniquely assigned by the manufacturer of the device 101, or may be a physical address (so-called MAC address) uniquely assigned on the network in principle.

なお、図４では、コンピュータプログラム４４２が格納されているメモリ４４０に機器ＩＤ４４１が格納されているとした。しかしながらこれは、処理回路４７０の構成の一例である。例えば、コンピュータプログラム４４２がＲＡＭ又はＲＯＭに格納され、機器ＩＤ４４１がフラッシュメモリに格納されてもよい。 In FIG. 4, it is assumed that the device ID 441 is stored in the memory 440 in which the computer program 442 is stored. However, this is an example of the configuration of the processing circuit 470. For example, the computer program 442 may be stored in RAM or ROM, and the device ID 441 may be stored in flash memory.

入出力回路４１０は、処理回路４７０が処理した結果を出力する。また、入出力回路４１０は、入力されたアナログ信号をデジタルデータに変換してバス３３０に送信する。 The input / output circuit 410 outputs the result processed by the processing circuit 470. The input / output circuit 410 converts the input analog signal into digital data and transmits the digital data to the bus 330.

通信回路４５０は、有線通信又は無線通信を介して、他の装置（例えばローカルサーバ１０２）と通信を行う回路である。限定されるものではないが、本実施の形態では、通信回路４５０は、ネットワークを介して他の装置と通信を行い、例えばイーサネット（登録商標）規格に準拠したネットワーク等の有線ＬＡＮを介して通信を行う。通信回路４５０は、処理回路４７０によって生成されたログ情報及びＩＤ情報をローカルサーバ１０２に送信する。また、通信回路４５０は、ローカルサーバ１０２より受信した信号を、バス４６０を通じて処理回路４７０に送信する。 The communication circuit 450 is a circuit that communicates with another device (for example, the local server 102) via wired communication or wireless communication. Although not limited thereto, in the present embodiment, the communication circuit 450 communicates with other devices via a network, for example, via a wired LAN such as a network compliant with the Ethernet (registered trademark) standard. I do. The communication circuit 450 transmits the log information and ID information generated by the processing circuit 470 to the local server 102. Further, the communication circuit 450 transmits the signal received from the local server 102 to the processing circuit 470 through the bus 460.

機器１０１は、図示される構成要素以外にも、機器１０１に要求される機能を実現するための他の構成要素も含み得る。 The device 101 may include other components for realizing the functions required for the device 101 in addition to the illustrated components.

図５は、ローカルサーバ１０２のハードウェア構成の一例を示す。ローカルサーバ１０２は、音声入出力装置２４０、機器１０１及び情報通信ネットワーク２２０の間のゲートウェイを構成する。図５に示されるように、ローカルサーバ１０２は、第一通信回路５５１と、第二通信回路５５２と、処理回路５７０と、音響モデルＤＢ（データベース；ＤａｔａＢａｓｅ）５８０と、言語モデルＤＢ５８１と、音声素片ＤＢ５８２と、韻律制御ＤＢ５８３と、ローカル辞書ＤＢ５８４とを、構成要素として備えている。これらの構成要素は、バス５６０で相互に接続されており、互いの間でデータ及び命令の授受を行うことが可能である。 FIG. 5 shows an example of the hardware configuration of the local server 102. The local server 102 constitutes a gateway between the voice input / output device 240, the device 101, and the information communication network 220. As shown in FIG. 5, the local server 102 includes a first communication circuit 551, a second communication circuit 552, a processing circuit 570, an acoustic model DB (database) 580, a language model DB 581, and a voice. A segment DB 582, a prosody control DB 583, and a local dictionary DB 584 are provided as components. These components are connected to each other by a bus 560, and can exchange data and commands with each other.

処理回路５７０は、音響モデルＤＢ５８０、言語モデルＤＢ５８１、音声素片ＤＢ５８２、韻律制御ＤＢ５８３及びローカル辞書ＤＢ５８４に接続されており、これらのＤＢに格納された管理情報の取得及び編集を行うことができる。なお、本実施形態では、音響モデルＤＢ５８０、言語モデルＤＢ５８１、音声素片ＤＢ５８２、韻律制御ＤＢ５８３及びローカル辞書ＤＢ５８４は、ローカルサーバ１０２の内部の構成要素であるが、ローカルサーバ１０２の外部に設けられていてもよい。その場合には、各ＤＢ及びローカルサーバ１０２の構成要素の間の接続手段には、バス５６０に加えて、インターネット回線、有線又は無線ＬＡＮ等の通信回線が含まれ得る。 The processing circuit 570 is connected to the acoustic model DB 580, the language model DB 581, the speech segment DB 582, the prosody control DB 583, and the local dictionary DB 584, and can acquire and edit management information stored in these DBs. In this embodiment, the acoustic model DB 580, the language model DB 581, the speech segment DB 582, the prosody control DB 583, and the local dictionary DB 584 are internal components of the local server 102, but are provided outside the local server 102. May be. In that case, in addition to the bus 560, the connection means between each DB and the components of the local server 102 may include a communication line such as an Internet line, a wired line, or a wireless LAN.

第一通信回路５５１は、有線通信又は無線通信を介して、他の装置（例えば音声入出力装置２４０及び機器１０１）と通信を行う回路である。限定されるものではないが、本実施の形態では、第一通信回路５５１は、ネットワークを介して他の装置と通信を行い、例えばイーサネット（登録商標）規格に準拠したネットワーク等の有線ＬＡＮを介して通信を行う。第一通信回路５５１は、処理回路５７０によって生成されたログ情報及びＩＤ情報を音声入出力装置２４０及び機器１０１に送信する。また、第一通信回路５５１は、音声入出力装置２４０及び機器１０１より受信した信号を、バス５６０を通じて処理回路５７０に送信する。 The first communication circuit 551 is a circuit that communicates with other devices (for example, the voice input / output device 240 and the device 101) via wired communication or wireless communication. Although not limited, in the present embodiment, the first communication circuit 551 communicates with other devices via a network and, for example, via a wired LAN such as a network compliant with the Ethernet (registered trademark) standard. To communicate. The first communication circuit 551 transmits the log information and ID information generated by the processing circuit 570 to the voice input / output device 240 and the device 101. The first communication circuit 551 transmits the signal received from the voice input / output device 240 and the device 101 to the processing circuit 570 through the bus 560.

第二通信回路５５２は、有線通信又は無線通信を介して、クラウドサーバ１１１と通信を行う回路である。第二通信回路５５２は、有線通信又は無線通信を介して、通信網に接続し、さらに、通信網を介してクラウドサーバ１１１と通信する。本実施の形態では、通信網は、情報通信ネットワーク２２０である。第二通信回路５５２は、例えばイーサネット（登録商標）規格に準拠したネットワーク等の有線ＬＡＮを介して通信を行う。第二通信回路５５２は、クラウドサーバ１１１との間で、種々の情報を送受信する。 The second communication circuit 552 is a circuit that communicates with the cloud server 111 via wired communication or wireless communication. The second communication circuit 552 is connected to a communication network via wired communication or wireless communication, and further communicates with the cloud server 111 via the communication network. In the present embodiment, the communication network is the information communication network 220. The second communication circuit 552 performs communication via a wired LAN such as a network conforming to the Ethernet (registered trademark) standard, for example. The second communication circuit 552 transmits / receives various information to / from the cloud server 111.

処理回路５７０は、ＣＰＵ５３０と、一意に識別可能なゲートウェイＩＤ（以下、ＧＷ−ＩＤとも呼ぶ）５４１及びコンピュータプログラム５４２を格納したメモリ５４０との組み合わせによって実現され得る。ＣＰＵ５３０は、ローカルサーバ１０２の動作を制御するが、音声入出力装置２４０及び機器１０１の動作も制御してもよい。ゲートウェイＩＤ５４１は、ローカルサーバ１０２に一意に付与された識別子である。ゲートウェイＩＤ５４１は、ローカルサーバ１０２のメーカによって独自に付与されてもよいし、或いは、原則としてネットワーク上で一意に割り当てられる物理アドレス（いわゆるＭＡＣアドレス）であってもよい。ＣＰＵ５３０は、メモリ５４０に展開されたコンピュータプログラム５４２に記述された命令群を実行し、種々の機能を実現することができる。コンピュータプログラム５４２には、ローカルサーバ１０２の動作を実現するための命令群が記述されている。上述のコンピュータプログラム５４２は、製品としてのローカルサーバ１０２のメモリ５４０に予め格納されていてもよい。又は、コンピュータプログラム５４２は、ＣＤ−ＲＯＭ等の記録媒体に記録されて製品として市場に流通され、若しくは、インターネット等の電気通信回線を通じて伝送され、記録媒体又は電気通信回線を通じて取得されたコンピュータプログラム５４２がメモリ５４０に格納されてもよい。ここで、処理回路５７０又はＣＰＵ５３０は、プロセッサ又は第１プロセッサの一例である。 The processing circuit 570 can be realized by a combination of the CPU 530 and a memory 540 storing a uniquely identifiable gateway ID (hereinafter also referred to as GW-ID) 541 and a computer program 542. The CPU 530 controls the operation of the local server 102, but may also control the operations of the voice input / output device 240 and the device 101. The gateway ID 541 is an identifier uniquely assigned to the local server 102. The gateway ID 541 may be uniquely assigned by the manufacturer of the local server 102, or may be a physical address (so-called MAC address) uniquely assigned on the network in principle. The CPU 530 can execute a group of instructions described in the computer program 542 expanded in the memory 540 to realize various functions. The computer program 542 describes a group of instructions for realizing the operation of the local server 102. The computer program 542 described above may be stored in advance in the memory 540 of the local server 102 as a product. Alternatively, the computer program 542 is recorded on a recording medium such as a CD-ROM and distributed as a product to the market, or transmitted through an electric communication line such as the Internet, and acquired through the recording medium or the electric communication line. May be stored in the memory 540. Here, the processing circuit 570 or the CPU 530 is an example of a processor or a first processor.

或いは、処理回路５７０は、以下に説明する動作を実現するように構成された専用のハードウェアによって実現されていてもよい。ローカルサーバ１０２は、図示される構成要素以外にも、ローカルサーバ１０２に要求される機能を実現するための他の構成要素も含み得る。 Alternatively, the processing circuit 570 may be realized by dedicated hardware configured to realize the operation described below. In addition to the illustrated components, the local server 102 may include other components for realizing the functions required for the local server 102.

なお、図５では、コンピュータプログラム５４２が格納されているメモリ５４０にゲートウェイＩＤ５４１が格納されているとした。しかしながらこれは、処理回路５７０の構成の一例である。例えば、コンピュータプログラム５４２がＲＡＭ又はＲＯＭに格納され、ゲートウェイＩＤ５４１がフラッシュメモリに格納されてもよい。 In FIG. 5, it is assumed that the gateway ID 541 is stored in the memory 540 in which the computer program 542 is stored. However, this is an example of the configuration of the processing circuit 570. For example, the computer program 542 may be stored in RAM or ROM, and the gateway ID 541 may be stored in flash memory.

音響モデルＤＢ５８０は、音声の波形などの周波数パターン及び音声に対応する文字列等を含む種々の音響モデルを登録している。言語モデルＤＢ５８１は、単語とその並び方等を含む種々の言語モデルを登録している。音声素片ＤＢ５８２は、音素等を単位とし且つ音声の特徴を表現した種々の音声素片を登録している。韻律制御ＤＢ５８３は、文字列の韻律を制御するための種々の情報を登録している。ローカル辞書ＤＢ５８４は、種々の文字列と、文字列それぞれに対応する意味タグとを対応付けて登録している。文字列は、単語、文節などのフレーズ等で構成される。意味タグは、文字列の意味を表す論理表現を指す。例えば、文字列の意味が類似する複数の文字列には、同一の意味タグが共通して設定される。例えば、意味タグは、タスク対象の名称、タスク対象へのタスク内容等を、キーワードとして示す。例えば、図１１を参照すると、文字列と、文字列に対応する意味タグとの組み合わせの例が示されている。ここで、意味タグは、意味情報の一例である。 The acoustic model DB 580 registers various acoustic models including frequency patterns such as speech waveforms and character strings corresponding to speech. The language model DB 581 registers various language models including words and how to arrange them. The speech unit DB 582 registers various speech units that express phonetic features in units of phonemes and the like. The prosody control DB 583 registers various information for controlling the prosody of the character string. The local dictionary DB 584 registers various character strings and meaning tags corresponding to the character strings in association with each other. The character string is composed of phrases such as words and phrases. A semantic tag refers to a logical expression representing the meaning of a character string. For example, the same meaning tag is commonly set to a plurality of character strings having similar character string meanings. For example, the semantic tag indicates the name of the task target, the task content for the task target, and the like as keywords. For example, referring to FIG. 11, an example of a combination of a character string and a semantic tag corresponding to the character string is shown. Here, the semantic tag is an example of semantic information.

図６は、クラウドサーバ１１１のハードウェア構成の一例を示す。図６に示されるように、クラウドサーバ１１１は、通信回路６５０と、処理回路６７０と、対話ルールＤＢ６９１と、オフロード命令生成ＤＢ６９２とを、構成要素として備えている。これらの構成要素は、バス６８０で相互に接続されており、互いの間でデータ及び命令を授受することが可能である。 FIG. 6 shows an exemplary hardware configuration of the cloud server 111. As illustrated in FIG. 6, the cloud server 111 includes a communication circuit 650, a processing circuit 670, an interaction rule DB 691, and an offload instruction generation DB 692 as components. These components are connected to each other by a bus 680 and can exchange data and commands with each other.

処理回路６７０は、ＣＰＵ６７１と、プログラム６７３を格納したメモリ６７２とを有している。ＣＰＵ６７１は、クラウドサーバ１１１の動作を制御する。ＣＰＵ６７１は、メモリ６７２に展開されたコンピュータプログラム６７３に記述された命令群を実行する。これにより、ＣＰＵ６７１は種々の機能を実現することができる。コンピュータプログラム６７３には、クラウドサーバ１１１が後述する動作を実現するための命令群が記述されている。上述のコンピュータプログラム６７３は、ＣＤ−ＲＯＭ等の記録媒体に記録されて製品として市場に流通され、又は、インターネット等の電気通信回線を通じて伝送され得る。図６に示すハードウェアを備えた装置（例えばＰＣ）は、当該コンピュータプログラム６７３を読み込むことにより、本実施形態によるクラウドサーバ１１１として機能し得る。ここで、処理回路６７０又はＣＰＵ６７１は、第２プロセッサの一例である。 The processing circuit 670 includes a CPU 671 and a memory 672 that stores a program 673. The CPU 671 controls the operation of the cloud server 111. The CPU 671 executes an instruction group described in the computer program 673 expanded in the memory 672. Thus, the CPU 671 can realize various functions. The computer program 673 describes a group of instructions for the cloud server 111 to realize an operation described later. The computer program 673 described above can be recorded on a recording medium such as a CD-ROM and distributed as a product to the market, or can be transmitted through an electric communication line such as the Internet. A device (for example, a PC) having hardware shown in FIG. 6 can function as the cloud server 111 according to the present embodiment by reading the computer program 673. Here, the processing circuit 670 or the CPU 671 is an example of a second processor.

処理回路６７０は、対話ルールＤＢ６９１と、オフロード命令生成ＤＢ６９２とに接続されており、これらＤＢに格納された管理情報の取得及び編集を行うことができる。なお、本実施形態では、対話ルールＤＢ６９１及びオフロード命令生成ＤＢ６９２は、クラウドサーバ１１１の内部の構成要素であるが、クラウドサーバ１１１の外部に設けられていてもよい。その場合には、各ＤＢ及びクラウドサーバ１１１の構成要素の間の接続手段には、バス６８０に加えて、インターネット回線、有線又は無線ＬＡＮ等の通信回線が含まれ得る。詳細は後述するが、対話ルールＤＢ６９１は、種々の意味タグと、各意味タグに関する条件とを対応付けて登録している。オフロード命令生成ＤＢ６９２は、音声対話エージェントシステム１の応答メッセージと、応答メッセージに対応するローカルサーバ１０２に対する命令データ（本実施の形態では、オフロード命令データ）とを対応付けて登録している。オフロード命令データは、ユーザと音声対話エージェントシステム１との間の複数通りにわたる対話に対応した命令データであり、ローカルサーバ１０２の負荷を軽減するように構成された命令データである。オフロード命令データは、ユーザと音声対話エージェントシステム１との間の複数回の対話にも対応し得る命令データでもある。ここで、対話ルールＤＢ６９１及びオフロード命令生成ＤＢ６９２は、データベースの一例である。 The processing circuit 670 is connected to the dialogue rule DB 691 and the offload instruction generation DB 692 and can acquire and edit management information stored in these DBs. In the present embodiment, the interaction rule DB 691 and the offload instruction generation DB 692 are components inside the cloud server 111, but may be provided outside the cloud server 111. In that case, in addition to the bus 680, the connection means between each DB and the components of the cloud server 111 may include a communication line such as an Internet line, a wired line, or a wireless LAN. Although details will be described later, the dialogue rule DB 691 registers various semantic tags and conditions related to the respective semantic tags in association with each other. The offload command generation DB 692 registers the response message of the voice interaction agent system 1 and the command data for the local server 102 corresponding to the response message (in this embodiment, offload command data) in association with each other. The offload command data is command data corresponding to a plurality of types of dialogue between the user and the voice dialogue agent system 1, and is command data configured to reduce the load on the local server 102. The offload command data is also command data that can correspond to a plurality of dialogues between the user and the voice dialogue agent system 1. Here, the dialogue rule DB 691 and the offload instruction generation DB 692 are examples of databases.

通信回路６５０は、有線通信又は無線通信を介して、他の装置（例えばローカルサーバ１０２）と通信を行う回路である。通信回路６５０は、有線通信又は無線通信を介して、通信網に接続し、さらに、通信網を介して他の装置（例えば、ローカルサーバ１０２）と通信する。本実施の形態では、通信網は、情報通信ネットワーク２２０である。通信回路６５０は、例えばイーサネット（登録商標）規格に準拠したネットワーク等の有線ＬＡＮを介して通信を行う。 The communication circuit 650 is a circuit that communicates with another device (for example, the local server 102) via wired communication or wireless communication. The communication circuit 650 is connected to a communication network via wired communication or wireless communication, and further communicates with another device (for example, the local server 102) via the communication network. In the present embodiment, the communication network is the information communication network 220. The communication circuit 650 performs communication via a wired LAN such as a network conforming to the Ethernet (registered trademark) standard, for example.

次いで、音声入出力装置２４０、機器１０１、ローカルサーバ１０２及びクラウドサーバ１１１について、システム構成の観点から説明する。図７は、音声入出力装置２４０のシステム構成の一例を示すブロック図である。図７に示されるように、音声入出力装置２４０は、集音部７００と、音声検出部７１０と、音声区間切り出し部７２０と、通信部７３０と、音声出力部７４０とを備える。 Next, the voice input / output device 240, the device 101, the local server 102, and the cloud server 111 will be described from the viewpoint of the system configuration. FIG. 7 is a block diagram illustrating an example of a system configuration of the voice input / output device 240. As illustrated in FIG. 7, the voice input / output device 240 includes a sound collection unit 700, a voice detection unit 710, a voice segment cutout unit 720, a communication unit 730, and a voice output unit 740.

集音部７００は、図３の集音回路３０１に対応する。集音部７００は、ユーザの音声を収集してアナログ音声信号を生成し、生成したアナログ音声信号をデジタルデータに変換し、変換したデジタルデータから音声信号を生成する。 The sound collection unit 700 corresponds to the sound collection circuit 301 in FIG. The sound collection unit 700 collects a user's voice to generate an analog voice signal, converts the generated analog voice signal into digital data, and generates a voice signal from the converted digital data.

音声検出部７１０及び音声区間切り出し部７２０は、図３の処理回路３００により実現される。コンピュータプログラム３４２を実行したＣＰＵ３１０は、ある時点では、例えば音声検出部７１０として機能し、異なる他の一時点では音声区間切り出し部７２０として機能する。なお、これら２つの構成要素のうち、少なくとも１つが、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）などの専用の処理を行うハードウェアによって実現されてもよい。 The voice detection unit 710 and the voice segment cutout unit 720 are realized by the processing circuit 300 in FIG. The CPU 310 that has executed the computer program 342 functions as, for example, a voice detection unit 710 at a certain point in time, and functions as a voice segment cutout unit 720 at another different point in time. Note that at least one of these two components may be realized by hardware that performs dedicated processing such as a DSP (Digital Signal Processor).

音声検出部７１０は、音声を検出したかどうかを判定する。例えば、検出した音声のレベルが所定値以下の場合には、音声検出部７１０は音声を検出していないと判断する。音声区間切り出し部７２０は、取得した音声信号の中から音声が存在する区間を検出する。例えば、当該区間は、時間区間である。 The sound detection unit 710 determines whether sound is detected. For example, when the detected sound level is equal to or lower than a predetermined value, the sound detection unit 710 determines that no sound is detected. The voice segment cutout unit 720 detects a segment in which voice is present from the acquired voice signal. For example, the section is a time section.

通信部７３０は、図３の通信回路３０３に対応する。通信部７３０は、ネットワーク等の有線通信又は無線通信を介して、音声入出力装置２４０の他の装置（例えばローカルサーバ１０２）と通信を行う。通信部７３０は、例えばイーサネット（登録商標）規格に準拠したネットワーク等の有線ＬＡＮを介して通信を行う。通信部７３０は、音声区間切り出し部７２０が検出した音声区間の音声信号を、他の装置に送信する。また、通信部７３０は、他の装置から受信した音声信号を音声出力部７４０に受け渡す。 The communication unit 730 corresponds to the communication circuit 303 in FIG. The communication unit 730 communicates with another device (for example, the local server 102) of the voice input / output device 240 via wired communication such as a network or wireless communication. The communication unit 730 performs communication via a wired LAN such as a network conforming to the Ethernet (registered trademark) standard, for example. The communication unit 730 transmits the audio signal of the audio segment detected by the audio segment cutout unit 720 to another device. In addition, the communication unit 730 passes an audio signal received from another device to the audio output unit 740.

音声出力部７４０は、図３の音声出力回路３０２に対応する。音声出力部７４０は、通信部７３０が受信した音声信号をアナログ音声信号に変換し、そのアナログ音声信号を出力する。 The audio output unit 740 corresponds to the audio output circuit 302 of FIG. The audio output unit 740 converts the audio signal received by the communication unit 730 into an analog audio signal and outputs the analog audio signal.

図８は、機器１０１のシステム構成の一例を示すブロック図である。図８に示されるように、機器１０１は、通信部８００と、機器制御部８１０とを備える。 FIG. 8 is a block diagram illustrating an example of a system configuration of the device 101. As illustrated in FIG. 8, the device 101 includes a communication unit 800 and a device control unit 810.

通信部８００は、図４の通信回路４５０に対応する。通信部８００は、ネットワーク等の有線通信又は無線通信を介して、機器１０１の他の装置（例えばローカルサーバ１０２）と通信を行う。通信部８００は、例えばイーサネット（登録商標）規格に準拠したネットワーク等の有線ＬＡＮを介して通信を行う。 The communication unit 800 corresponds to the communication circuit 450 in FIG. The communication unit 800 communicates with other devices (for example, the local server 102) of the device 101 through wired communication such as a network or wireless communication. The communication unit 800 performs communication via a wired LAN such as a network conforming to the Ethernet (registered trademark) standard, for example.

機器制御部８１０は、図４の入出力回路４１０及び処理回路４７０に対応する。機器制御部８１０は、通信部８００が受信した制御データを読み込み、機器１０１の動作を制御する。また、機器制御部８１０は、機器１０１の動作の制御上での処理結果の出力を制御する。例えば、機器制御部８１０は、通信部８００が受信した制御データの処理回路４７０による読み込み及び処理、入出力回路４１０の入出力制御等を実施する。 The device control unit 810 corresponds to the input / output circuit 410 and the processing circuit 470 of FIG. The device control unit 810 reads the control data received by the communication unit 800 and controls the operation of the device 101. In addition, the device control unit 810 controls output of a processing result in controlling the operation of the device 101. For example, the device control unit 810 performs reading and processing of the control data received by the communication unit 800 by the processing circuit 470, input / output control of the input / output circuit 410, and the like.

図９は、ローカルサーバ１０２のシステム構成の一例を示すブロック図である。図９に示されるように、ローカルサーバ１０２は、通信部９００と、受信データ解析部９１０と、音声認識部９２０と、ローカル辞書照合部９３０と、制御部９４０と、音声合成部９５０と、送信データ生成部９６０と、コマンド記録部９７０とを備える。 FIG. 9 is a block diagram illustrating an example of a system configuration of the local server 102. As shown in FIG. 9, the local server 102 includes a communication unit 900, a received data analysis unit 910, a speech recognition unit 920, a local dictionary collation unit 930, a control unit 940, a speech synthesis unit 950, and a transmission. A data generation unit 960 and a command recording unit 970 are provided.

通信部９００は、図５の第一通信回路５５１及び第二通信回路５５２に対応する。通信部９００は、ネットワーク等の有線通信又は無線通信を介して、ローカルサーバ１０２の他の装置（例えば音声入出力装置２４０及び機器１０１）と通信を行う。通信部９００はまた、有線通信又は無線通信を介して、情報通信ネットワーク２２０等の通信網に接続し、さらに、通信網を介してクラウドサーバ１１１とも通信する。通信部９００は、例えばイーサネット（登録商標）規格に準拠したネットワーク等の有線ＬＡＮを介して通信を行う。通信部９００は、他の装置及びクラウドサーバ１１１等から受信したデータを受信データ解析部９１０に受け渡す。また、通信部９００は、送信データ生成部９６０が生成したデータを、他の装置及びクラウドサーバ１１１等に送信する。 The communication unit 900 corresponds to the first communication circuit 551 and the second communication circuit 552 in FIG. The communication unit 900 communicates with other devices (for example, the voice input / output device 240 and the device 101) via the wired communication such as a network or wireless communication. The communication unit 900 is also connected to a communication network such as the information communication network 220 via wired communication or wireless communication, and further communicates with the cloud server 111 via the communication network. The communication unit 900 performs communication via a wired LAN such as a network conforming to the Ethernet (registered trademark) standard, for example. The communication unit 900 passes data received from another device, the cloud server 111, and the like to the received data analysis unit 910. In addition, the communication unit 900 transmits the data generated by the transmission data generation unit 960 to other devices, the cloud server 111, and the like.

受信データ解析部９１０は、図５の処理回路５７０に対応する。受信データ解析部９１０は、通信部９００が受信したデータの種別を解析する。また、受信データ解析部９１０は、受信したデータの種別を解析した結果、ローカルサーバ１０２内部にて更なる処理を行うか、それとも他の装置にデータを送信すべきかを判断する。前者の場合、受信データ解析部９１０は、受信したデータを音声認識部９２０等に受け渡す。後者の場合、受信データ解析部９１０は、次に送信すべき装置と、当該装置に送信すべきデータとの組み合わせを決定する。 The reception data analysis unit 910 corresponds to the processing circuit 570 in FIG. The reception data analysis unit 910 analyzes the type of data received by the communication unit 900. Also, the received data analysis unit 910 determines whether further processing is to be performed within the local server 102 or data should be transmitted to another device as a result of analyzing the type of received data. In the former case, the reception data analysis unit 910 delivers the received data to the voice recognition unit 920 or the like. In the latter case, the reception data analysis unit 910 determines a combination of a device to be transmitted next and data to be transmitted to the device.

音声認識部９２０は、図５の処理回路５７０と、音響モデルＤＢ５８０と、言語モデルＤＢ５８１とにより実現される。音声認識部９２０は、音声信号から、文字列データに変換する。具体的には、音声認識部９２０は、予め登録された音響モデルの情報を音響モデルＤＢ５８０より取得し、音響モデルと音声データの周波数特性とから、音声データを音素データに変換する。さらに、音声認識部９２０は、予め登録された言語モデルの情報を言語モデルＤＢ５８１より取得し、言語モデルと音素データの並び方とから、音素データを特定の文字列データに変換する。音声認識部９２０は、変換した文字列データをローカル辞書照合部９３０に引き渡す。 The speech recognition unit 920 is realized by the processing circuit 570, the acoustic model DB 580, and the language model DB 581 shown in FIG. The voice recognition unit 920 converts the voice signal into character string data. Specifically, the speech recognition unit 920 acquires information on the acoustic model registered in advance from the acoustic model DB 580, and converts the speech data into phoneme data from the acoustic model and the frequency characteristics of the speech data. Furthermore, the speech recognition unit 920 acquires information on a language model registered in advance from the language model DB 581 and converts the phoneme data into specific character string data based on how the language model and the phoneme data are arranged. The voice recognition unit 920 delivers the converted character string data to the local dictionary collation unit 930.

ローカル辞書照合部９３０は、図５の処理回路５７０と、ローカル辞書ＤＢ５８４とにより実現される。ローカル辞書照合部９３０は、文字列データから、意味タグに変換する。意味タグとは、具体的には、制御対象となる機器及びタスク内容等を指すキーワードである。ローカル辞書照合部９３０は、受信した文字列データと、ローカル辞書ＤＢ５８４とを照合することで、当該文字列データと一致した意味タグを抽出する。なお、ローカル辞書ＤＢ５８４には、単語等の文字列と、文字列に対応する意味タグとが、対応付けられて収納されている。受信した文字列に一致する文字列を、ローカル辞書ＤＢ５８４内で探索することによって、受信した文字列と一致する、つまり適合する意味タグが抽出される。 The local dictionary collation unit 930 is realized by the processing circuit 570 in FIG. 5 and the local dictionary DB 584. The local dictionary collation unit 930 converts character string data into a semantic tag. Specifically, the semantic tag is a keyword indicating a device to be controlled, task contents, and the like. The local dictionary collating unit 930 collates the received character string data with the local dictionary DB 584 to extract a semantic tag that matches the character string data. The local dictionary DB 584 stores character strings such as words and meaning tags corresponding to the character strings in association with each other. By searching the local dictionary DB 584 for a character string that matches the received character string, a semantic tag that matches, that is, matches, the received character string is extracted.

制御部９４０は、図５の処理回路５７０に対応する。制御部９４０は、コマンド記録部９７０に記録されたオフロード命令データの有無の判定処理、及び、コマンド記録部９７０に記録されたオフロード命令データと意味タグとの照合処理を行う。 The control unit 940 corresponds to the processing circuit 570 in FIG. The control unit 940 performs a process for determining whether or not there is offload instruction data recorded in the command recording unit 970 and a process for comparing the offload instruction data recorded in the command recording unit 970 with the semantic tag.

音声合成部９５０は、図５の処理回路５７０と、音声素片ＤＢ５８２と、韻律制御ＤＢ５８３とにより実現される。音声合成部９５０は、文字列データから、音声信号に変換する。具体的には、音声合成部９５０は、予め登録された音声素片モデル及び韻律制御モデルの情報をそれぞれ、音声素片ＤＢ５８２及び韻律制御ＤＢ５８３より取得し、音声素片モデル、韻律制御モデル及び文字列データから、文字列データを特定の音声信号に変換する。 The speech synthesis unit 950 is realized by the processing circuit 570, the speech segment DB 582, and the prosody control DB 583 in FIG. The voice synthesizer 950 converts character string data into a voice signal. Specifically, the speech synthesizer 950 acquires information on the speech unit model and the prosody control model registered in advance from the speech unit DB 582 and the prosody control DB 583, respectively, and the speech unit model, the prosody control model, and the character From the column data, the character string data is converted into a specific audio signal.

送信データ生成部９６０は、図５の処理回路５７０に対応する。送信データ生成部９６０は、受信データ解析部９１０が決定した、次に送信すべき装置及び当該装置に送信すべきデータの組み合わせから、送信データを生成する。 The transmission data generation unit 960 corresponds to the processing circuit 570 in FIG. The transmission data generation unit 960 generates transmission data from the combination of the device to be transmitted next and the data to be transmitted to the device determined by the reception data analysis unit 910.

コマンド記録部９７０は、図５のメモリ５４０に対応する。コマンド記録部９７０は、クラウドサーバ１１１がローカルサーバ１０２に送信したオフロード命令データを記録する。 The command recording unit 970 corresponds to the memory 540 in FIG. The command recording unit 970 records offload command data transmitted from the cloud server 111 to the local server 102.

図１０は、クラウドサーバ１１１のシステム構成の一例を示すブロック図である。図１０に示されるように、クラウドサーバ１１１は、通信部１０００と、対話ルール照合部１０２０と、応答生成部１０３０とを備える。 FIG. 10 is a block diagram illustrating an example of a system configuration of the cloud server 111. As illustrated in FIG. 10, the cloud server 111 includes a communication unit 1000, a dialogue rule matching unit 1020, and a response generation unit 1030.

通信部１０００は、図６の通信回路６５０に対応する。通信部１０００は、ネットワーク等の有線通信又は無線通信を介して、情報通信ネットワーク２２０等の通信網に接続し、さらに、通信網を介して、他の装置（例えばローカルサーバ１０２）と通信を行う。通信部１０００は、例えばイーサネット（登録商標）規格に準拠したネットワーク等の有線ＬＡＮを介して通信を行う。 The communication unit 1000 corresponds to the communication circuit 650 in FIG. The communication unit 1000 is connected to a communication network such as the information communication network 220 via wired communication or wireless communication such as a network, and further communicates with another device (for example, the local server 102) via the communication network. . The communication unit 1000 performs communication via a wired LAN such as a network conforming to the Ethernet (registered trademark) standard, for example.

対話ルール照合部１０２０は、図６の処理回路６７０と、対話ルールＤＢ６９１とにより実現される。対話ルール照合部１０２０は、意味タグとその条件とを照合し、オフロード命令生成ＤＢ参照用の処理ＩＤに変換する。 The dialogue rule matching unit 1020 is realized by the processing circuit 670 of FIG. 6 and the dialogue rule DB 691. The dialogue rule collation unit 1020 collates the semantic tag and its condition, and converts it into a processing ID for referring to the offload instruction generation DB.

応答生成部１０３０は、図６の処理回路６７０と、オフロード命令生成ＤＢ６９２とにより実現される。応答生成部１０３０は、オフロード命令生成ＤＢ６９２を照合し、制御対象となる機器１０１を制御する制御信号を生成する。さらに、応答生成部１０３０は、照合結果に基づき、ユーザ５１００に提供すべきテキスト情報の文字列データを生成する。 The response generation unit 1030 is realized by the processing circuit 670 of FIG. 6 and the offload instruction generation DB 692. The response generation unit 1030 collates the offload instruction generation DB 692 and generates a control signal for controlling the device 101 to be controlled. Furthermore, the response generation unit 1030 generates character string data of text information to be provided to the user 5100 based on the collation result.

図１１は、ローカル辞書ＤＢ５８４の登録内容の具体例を示す図である。ローカル辞書ＤＢ５８４には、単語等の文字列と意味タグの情報とが、互いに関連付けられて保持されている。意味タグとは、ある文字列の意味を表す論理表現を指す。また、意味タグは、変数として値を保持することができる。図１１のテーブルに示されるように、例えば、ローカル辞書ＤＢ５８４は、文字列「予約」と、意味タグ＜ｒｅｓｅｒｖｅ＞とを対応付けて保持している。また、例えば、ローカル辞書ＤＢ５８４は、文字列「音量２０」と、式を表わす意味タグ「［＜ｖｏｌｕｍｅ＞］＝２０」とを対応付けて保持している。＜ｖｏｌｕｍｅ＞は、文字列「音量」の意味を表す意味タグだが、鍵括弧「［］」で囲われた意味タグは、値を保持できる変数としての役割を備えることができる。 FIG. 11 is a diagram showing a specific example of registered contents in the local dictionary DB 584. As shown in FIG. In the local dictionary DB 584, character strings such as words and semantic tag information are stored in association with each other. A semantic tag refers to a logical expression representing the meaning of a certain character string. The semantic tag can hold a value as a variable. As shown in the table of FIG. 11, for example, the local dictionary DB 584 holds a character string “reserved” and a semantic tag <reserve> in association with each other. Further, for example, the local dictionary DB 584 holds the character string “volume 20” and the semantic tag “[<volume>] = 20” representing an expression in association with each other. <Volume> is a semantic tag representing the meaning of the character string “volume”, but the semantic tag enclosed in square brackets “[]” can serve as a variable that can hold a value.

図１２Ａ及び図１２Ｂをそれぞれ参照すると、対話ルールＤＢ６９１及びオフロード命令生成ＤＢ６９２の登録内容の具体例が示されている。例えば、図１２Ａは、対話ルールＤＢ６９１の具体例を示す。対話ルールＤＢ６９１には、意味タグと、意味タグ値の条件と、処理ＩＤの情報とが、互いに関連付けられて保持されている。例えば、文字列「番組予約」に対応する意味タグ＜ｒｅｓｅｒｖｅ＞＜ｐｒｏｇｒａｍ＞のように、文字列が複数の単語を含む場合、このような文字列に対応する意味タグは、複数の意味タグの組み合わせである場合がある。意味タグ値の条件とは、意味タグに設定される条件であり、意味タグに代入された値に関する条件である。処理ＩＤとは、対話ルールＤＢ６９１の各項目に一意に割り当てられた番号である。 Referring to FIGS. 12A and 12B respectively, specific examples of registration contents of the dialogue rule DB 691 and the offload instruction generation DB 692 are shown. For example, FIG. 12A shows a specific example of the dialogue rule DB 691. In the dialogue rule DB 691, semantic tags, semantic tag value conditions, and processing ID information are stored in association with each other. For example, when a character string includes a plurality of words, such as a meaning tag <reserve> <program> corresponding to a character string “program reservation”, the meaning tag corresponding to such a character string includes a plurality of meaning tags. May be a combination. The condition of the semantic tag value is a condition set in the semantic tag, and is a condition regarding the value assigned to the semantic tag. The process ID is a number uniquely assigned to each item in the dialogue rule DB 691.

図１２Ｂは、オフロード命令生成ＤＢ６９２の具体例を示す。オフロード命令生成ＤＢ６９２には、処理ＩＤと、システム応答メッセージと、オフロード命令データとが、互いに関連付けられて保持されている。システム応答メッセージは、ユーザと音声対話エージェントシステム１との対話において、ユーザに発する応答メッセージである。オフロード命令データは、システム応答メッセージに対するユーザの返答に対する処理の命令データであり、ユーザと音声対話エージェントシステム１との複数通りにわたる対話に対応した命令データである。１つのオフロード命令データは、ユーザの複数通りの返答に対応するように構成されている。このようなオフロード命令データは、クラウドサーバ１１１とローカルサーバ１０２との間の通信を低減し、ローカルサーバ１０２の負荷を軽減する。さらに、本例では、オフロード命令データは、簡単な認識命令とこれに対応する制御命令とを含むため、オフロード命令データに基づく処理が簡易である。ここで、システム応答メッセージは、第１応答メッセージの一例であり、オフロード命令データは、条件分岐命令の一例である。ユーザと音声対話エージェントシステム１との複数通りにわたる対話は、第１応答メッセージに対するユーザの返答の選択候補の一例である。 FIG. 12B shows a specific example of the offload instruction generation DB 692. In the offload instruction generation DB 692, a process ID, a system response message, and offload instruction data are stored in association with each other. The system response message is a response message issued to the user in the dialog between the user and the voice interaction agent system 1. The offload command data is command data for processing in response to a user response to the system response message, and is command data corresponding to a plurality of dialogues between the user and the voice interaction agent system 1. One offload command data is configured to correspond to a plurality of user responses. Such offload command data reduces communication between the cloud server 111 and the local server 102 and reduces the load on the local server 102. Furthermore, in this example, since the offload instruction data includes a simple recognition instruction and a control instruction corresponding to the simple recognition instruction, processing based on the offload instruction data is simple. Here, the system response message is an example of a first response message, and the offload instruction data is an example of a conditional branch instruction. A plurality of dialogues between the user and the voice interaction agent system 1 are an example of selection candidates for user responses to the first response message.

処理ＩＤは、オフロード命令生成ＤＢ６９２の各項目に一意に割り当てられた番号である。オフロード命令生成ＤＢ６９２の処理ＩＤは、対話ルールＤＢ６９１の処理ＩＤと対応している。処理ＩＤが同一の場合、オフロード命令生成ＤＢ６９２のシステム応答メッセージ及びオフロード命令データは、対話ルールＤＢ６９１の意味タグ及び意味タグ値の条件に対応する。例えば、オフロード命令生成ＤＢ６９２のシステム応答メッセージは、対話ルールＤＢ６９１の意味タグに対する応答メッセージである。また、オフロード命令生成ＤＢ６９２のオフロード命令データの内容は、対話ルールＤＢ６９１の意味タグ及び意味タグ値の条件に関連する。 The process ID is a number uniquely assigned to each item in the offload instruction generation DB 692. The process ID of the offload instruction generation DB 692 corresponds to the process ID of the dialogue rule DB 691. When the process IDs are the same, the system response message and offload command data in the offload command generation DB 692 correspond to the semantic tag and semantic tag value conditions of the dialogue rule DB 691. For example, the system response message of the offload instruction generation DB 692 is a response message to the semantic tag of the dialogue rule DB 691. Further, the content of the offload instruction data in the offload instruction generation DB 692 is related to the meaning tag and the meaning tag value condition of the dialogue rule DB 691.

［２−２．実施の形態に係る音声対話エージェントシステムの動作］
次いで、音声対話エージェントシステム１の動作に関して、音声対話エージェントシステム１の応答時間を短縮する処理の流れを説明する。図１３Ａ及び図１３Ｂは、音声対話エージェントシステム１の応答時間を短縮する処理の一連のシーケンスを示す。このシーケンスは、ユーザ５１００が音声により音声入出力装置２４０に何らかの指示を開始したときに開始される。 [2-2. Operation of Spoken Dialogue Agent System According to Embodiment]
Next, regarding the operation of the voice interaction agent system 1, the flow of processing for shortening the response time of the voice interaction agent system 1 will be described. FIG. 13A and FIG. 13B show a series of processes for shortening the response time of the voice interaction agent system 1. This sequence is started when the user 5100 starts some instruction to the voice input / output device 240 by voice.

図１３Ａに示すように、ユーザ５１００が音声入出力装置２４０に、マイクロホンなどから音声により指示を入力すると、ステップＳ１５０１において、音声入出力装置２４０はユーザ５１００の音声データを取得する。音声入出力装置２４０の通信回路３０３は、取得した音声データをローカルサーバ１０２に送信する。ローカルサーバ１０２は当該音声データを受信する。なお、上記指示は、ユーザ５１００と音声対話エージェントシステム１とが行う複数回にわたる一連の対話において、最初に発せられる音声であるとして、以降の説明を行う。よって、図１３Ａは、音声が初めて発せられた場合の音声対話エージェントシステム１のシーケンスを示す。ここで、上記ユーザの音声は、第１音声の一例であり、上記音声データは、第１音声情報の一例である。 As shown in FIG. 13A, when the user 5100 inputs an instruction to the voice input / output device 240 by voice from a microphone or the like, the voice input / output device 240 acquires the voice data of the user 5100 in step S1501. The communication circuit 303 of the voice input / output device 240 transmits the acquired voice data to the local server 102. The local server 102 receives the audio data. The following description will be given on the assumption that the above instruction is the voice that is initially issued in a series of conversations performed by the user 5100 and the voice conversation agent system 1 over a plurality of times. Therefore, FIG. 13A shows a sequence of the voice interactive agent system 1 when a voice is first uttered. Here, the user's voice is an example of the first voice, and the voice data is an example of the first voice information.

次いで、ステップＳ１５０２において、ローカルサーバ１０２は、音声入出力装置２４０から音声データを受信し、音声データの音声認識処理を行う。音声認識処理とは、ローカルサーバ１０２が有する音声認識部９２０によってユーザの音声を認識する処理である。具体的には、ローカルサーバ１０２は、音響モデルＤＢ５８０及び言語モデルＤＢ５８１に登録された音響モデル及び言語モデルの情報を保持している。ユーザ５１００が音声入出力装置２４０に音声を入力すると、ローカルサーバ１０２のＣＰＵ５３０は、ユーザ５１００の音声から周波数特性を抽出し、音響モデルＤＢ５８０に保持されている音響モデルから、抽出した周波数特性に対応する音素データを抽出する。次に、ＣＰＵ５３０は、抽出した音素データの並び方が、言語モデルＤＢ５８１に保持されている言語モデルのどの文字列データに最も近いかを照合することにより、音素データを特定の文字列データに変換する。この結果、音声データが文字列データに変換される。 Next, in step S1502, the local server 102 receives voice data from the voice input / output device 240, and performs voice recognition processing on the voice data. The voice recognition process is a process of recognizing the user's voice by the voice recognition unit 920 included in the local server 102. Specifically, the local server 102 holds information on acoustic models and language models registered in the acoustic model DB 580 and the language model DB 581. When the user 5100 inputs voice to the voice input / output device 240, the CPU 530 of the local server 102 extracts the frequency characteristics from the voice of the user 5100, and corresponds to the extracted frequency characteristics from the acoustic model held in the acoustic model DB 580. Phoneme data to be extracted. Next, CPU 530 converts phoneme data into specific character string data by collating which character string data of the language model stored in language model DB 581 is the closest to the arrangement of the extracted phoneme data. . As a result, the voice data is converted into character string data.

次いで、ステップＳ１５０３において、ローカルサーバ１０２は、文字列データのローカル辞書照合処理を行う。ローカル辞書照合処理とは、ローカルサーバ１０２が有するローカル辞書照合部９３０によって、文字列データを意味タグに変換する処理である。具体的には、ローカルサーバ１０２は、ローカル辞書ＤＢ５８４に登録された辞書の情報を保持している。ローカルサーバ１０２のＣＰＵ５３０は、ステップＳ１５０２において変換された文字列データとローカル辞書ＤＢ５８４とを照合し、当該文字列データと一致する意味タグ、つまり当該文字列データに対応する意味タグを出力する。ここで、出力される意味タグは、第１フレーズ情報の一例である。 Next, in step S1503, the local server 102 performs local dictionary collation processing of character string data. The local dictionary collation process is a process in which character string data is converted into a semantic tag by the local dictionary collation unit 930 included in the local server 102. Specifically, the local server 102 holds dictionary information registered in the local dictionary DB 584. The CPU 530 of the local server 102 collates the character string data converted in step S1502 with the local dictionary DB 584, and outputs a semantic tag that matches the character string data, that is, a semantic tag corresponding to the character string data. Here, the output semantic tag is an example of first phrase information.

さらに、ステップＳ１５０４において、ローカルサーバ１０２は、コマンド記録部９７０に、クラウドサーバ１１１から受信したオフロード命令データが記録されているか否かを判定する。記録されていない場合（ステップＳ１５０４でＮｏ）、ローカルサーバ１０２は、意味タグと自身のゲートウェイＩＤとを組み合わせてクラウドサーバ１１１に送信し、ステップＳ１５０５に進む。なお、上述したステップＳ１５０１〜Ｓ１５０３までの処理は、ユーザ５１００と音声対話エージェントシステム１との間の複数回にわたる一連の対話において、ユーザ５１００が最初に発する音声に対する処理であるため、オフロード命令データはまだ生成されておらず、コマンド記録部９７０に記録されていない。このため、以降の処理には、ユーザ５１００へシステム応答メッセージを出力する前に行われるステップＳ１５０１〜Ｓ１５０８の一連の処理が、引き続き選択される。 Further, in step S1504, the local server 102 determines whether or not the offload command data received from the cloud server 111 is recorded in the command recording unit 970. If it is not recorded (No in step S1504), the local server 102 combines the semantic tag and its gateway ID and transmits it to the cloud server 111, and proceeds to step S1505. Note that the processing from step S1501 to S1503 described above is processing for the voice that the user 5100 first utters in a series of conversations between the user 5100 and the voice interaction agent system 1, and therefore, offload command data. Has not yet been generated and is not recorded in the command recording unit 970. For this reason, a series of processes in steps S1501 to S1508 performed before outputting the system response message to the user 5100 are continuously selected for the subsequent processes.

しかしながら、ユーザ５１００と音声対話エージェントシステム１との間の一連の対話の途中では、オフロード命令データが生成され、コマンド記録部９７０に記録されている場合がある。このように、オフロード命令データが記録されている場合（ステップＳ１５０４でＹｅｓ）、図１３Ｂに示すような、以降の処理には、ユーザ５１００へシステム応答メッセージを出力した後に行われるステップＳ１５０９〜１５１４の一連の処理が選択される。この場合、ローカルサーバ１０２の処理は、ステップＳ１５０４からステップＳ１５１３に進むことになる。 However, in the middle of a series of conversations between the user 5100 and the voice interaction agent system 1, offload instruction data may be generated and recorded in the command recording unit 970. As described above, when the offload instruction data is recorded (Yes in Step S1504), the subsequent processing as shown in FIG. 13B is performed after outputting the system response message to the user 5100, Steps S1509 to 1514. A series of processes is selected. In this case, the processing of the local server 102 proceeds from step S1504 to step S1513.

ステップＳ１５０４に続くステップＳ１５０５において、クラウドサーバ１１１は、受信した意味タグの対話ルール照合処理を行う。対話ルール照合処理とは、クラウドサーバ１１１が有する対話ルール照合部１０２０によって、意味タグを処理ＩＤに変換する処理である。具体的には、クラウドサーバ１１１は、対話ルールＤＢ６９１に保持された対話ルールの情報を保持している。対話ルールの情報は、図１２Ａに示されるように、意味タグ、意味タグ値の条件及び処理ＩＤを含む。クラウドサーバ１１１のＣＰＵ６７１は、ステップＳ１５０３において変換された意味タグと対話ルールＤＢ６９１とを照合し、対話ルールＤＢ６９１に保持されており且つ当該意味タグに対応する意味タグ値の条件を確認し、当該意味タグをこれに対応する適切な処理ＩＤに変換する。例えば、意味タグが、図１２Ａに示される＜ｒｅｓｅｒｖｅ＞＜ｐｒｏｇｒａｍ＞である場合、この意味タグは処理ＩＤ「Ｓ００１」に変換される。 In step S <b> 1505 following step S <b> 1504, the cloud server 111 performs a dialogue rule matching process for the received semantic tag. The dialogue rule matching process is a process of converting the semantic tag into a process ID by the dialogue rule matching unit 1020 of the cloud server 111. Specifically, the cloud server 111 holds information on the dialogue rules held in the dialogue rule DB 691. As shown in FIG. 12A, the dialogue rule information includes a semantic tag, a semantic tag value condition, and a process ID. The CPU 671 of the cloud server 111 collates the meaning tag converted in step S1503 with the dialogue rule DB 691, confirms the condition of the meaning tag value held in the dialogue rule DB 691 and corresponding to the meaning tag, and the meaning The tag is converted into an appropriate process ID corresponding to the tag. For example, when the semantic tag is <reserve> <program> shown in FIG. 12A, this semantic tag is converted into the process ID “S001”.

次いで、ステップＳ１５０６において、クラウドサーバ１１１は、処理ＩＤからのオフロード命令生成処理を行う。オフロード命令生成処理とは、クラウドサーバ１１１が有する応答生成部１０３０によって、システム応答メッセージとオフロード命令データとを生成する処理である。具体的には、クラウドサーバ１１１は、オフロード命令生成ＤＢ６９２に保持されたシステム応答メッセージ及びオフロード命令データの情報を保持している。クラウドサーバ１１１のＣＰＵ６７１は、ステップＳ１５０５において変換された処理ＩＤとオフロード命令生成ＤＢ６９２とを照合し、当該処理ＩＤに対応するシステム応答メッセージ及びオフロード命令データを出力する。例えば、図１２Ｂの例において、処理ＩＤが「Ｓ００１」である場合、システム応答メッセージ「予約していいですか？」と、オフロード命令データ「ｉｆ（＜ｙｅｓ＞）ｔｈｅｎｃｍｄ＝０ｘ１００１０ｆ０ａａｎｄｒｅｐｌｙ “それでは予約します” ｅｌｓｅｒｅｐｌｙ “予約を中止します”」とが出力される。クラウドサーバ１１１は、生成したシステム応答メッセージ及びオフロード命令データとゲートウェイＩＤとを組み合わせてローカルサーバ１０２に送信する。 Next, in step S1506, the cloud server 111 performs an offload instruction generation process from the process ID. The offload command generation process is a process of generating a system response message and offload command data by the response generation unit 1030 of the cloud server 111. Specifically, the cloud server 111 holds information on the system response message and offload command data held in the offload command generation DB 692. The CPU 671 of the cloud server 111 collates the process ID converted in step S1505 with the offload instruction generation DB 692, and outputs a system response message and offload instruction data corresponding to the process ID. For example, in the example of FIG. 12B, when the process ID is “S001”, the system response message “Can I make a reservation?” And offload instruction data “if (<yes>) then cmd = 0x10010f0a and reply“ Then, the reservation is made. “Else reply“ Reservation is canceled ”” is output. The cloud server 111 transmits the generated system response message, offload command data, and gateway ID to the local server 102 in combination.

ステップＳ１５０７において、ローカルサーバ１０２は、クラウドサーバ１１１から受信したシステム応答メッセージ、オフロード命令データ及びゲートウェイＩＤを対応付けて、コマンド記録部９７０に記録する。具体的は、ローカルサーバ１０２のＣＰＵ５３０は、コマンド記録部９７０に対応するメモリ５４０に、システム応答メッセージ、オフロード命令データ及びゲートウェイＩＤを対応付けて記憶させる。ここで、システム応答メッセージ及びゲートウェイＩＤは、第１応答情報の一例であり、オフロード命令データ及びゲートウェイＩＤは、第２応答情報の一例である。 In step S1507, the local server 102 associates the system response message received from the cloud server 111, the offload command data, and the gateway ID, and records them in the command recording unit 970. Specifically, the CPU 530 of the local server 102 stores the system response message, the offload command data, and the gateway ID in the memory 540 corresponding to the command recording unit 970 in association with each other. Here, the system response message and the gateway ID are examples of first response information, and the offload command data and the gateway ID are examples of second response information.

次のステップＳ１５０８において、ローカルサーバ１０２は、音声合成処理を行う。音声合成処理とは、ローカルサーバ１０２が有する音声合成部９５０が、システム応答メッセージを音声データに変換する処理である。具体的には、ローカルサーバ１０２は、音声素片ＤＢ５８２に登録された音声素片の情報と、韻律制御ＤＢに登録された韻律情報とを保持している。ローカルサーバ１０２のＣＰＵ５３０は、音声素片ＤＢ５８２に登録された音声素片の情報と、韻律制御ＤＢに登録された韻律情報とを読み込み、システム応答メッセージの文字列データから特定の音声データに変換する。そして、ローカルサーバ１０２は、ステップＳ１５０８にて変換した音声データを、音声入出力装置２４０に送信する。さらに、音声データを受信した音声入出力装置２４０は、システム応答メッセージを音声として、スピーカなどからユーザ５１００へ出力する。 In the next step S1508, the local server 102 performs speech synthesis processing. The voice synthesis process is a process in which the voice synthesis unit 950 included in the local server 102 converts the system response message into voice data. Specifically, the local server 102 holds information on speech units registered in the speech unit DB 582 and prosody information registered in the prosody control DB. The CPU 530 of the local server 102 reads the speech unit information registered in the speech unit DB 582 and the prosody information registered in the prosody control DB, and converts the character string data of the system response message into specific speech data. . Then, the local server 102 transmits the audio data converted in step S1508 to the audio input / output device 240. Furthermore, the voice input / output device 240 that has received the voice data outputs a system response message as voice to the user 5100 from a speaker or the like.

図１３Ｂを参照すると、その後、ユーザ５１００が、音声入出力装置２４０に音声を入力すると、ステップＳ１５０９において、音声入出力装置２４０はユーザ５１００の音声データを取得する。音声入出力装置２４０は、取得した音声データをローカルサーバ１０２に送信し、ローカルサーバ１０２は当該音声データを受信する。音声データには、システム応答メッセージに対する適切な回答が含まれている場合も、含まれていない場合もある。これに対する判断は、後の処理で行われる。以降では、ユーザ５１００へシステム応答メッセージを出力した後に行われるステップＳ１５０９〜１５１４の一連の処理が行われる。なお、音声入出力装置２４０が、音声の入力を、例えば所定の時間以上にわたって検出しない場合、ステップＳ１５０１〜Ｓ１５０８の処理を含む一連の処理を終了してもよい。ここで、上記ユーザの音声は、第２音声の一例であり、上記音声データは、第２音声情報の一例である。 Referring to FIG. 13B, after that, when the user 5100 inputs voice to the voice input / output device 240, the voice input / output device 240 acquires voice data of the user 5100 in step S <b> 1509. The voice input / output device 240 transmits the acquired voice data to the local server 102, and the local server 102 receives the voice data. The voice data may or may not contain an appropriate answer to the system response message. This determination is made in later processing. Thereafter, a series of processing in steps S1509 to 1514 performed after outputting the system response message to the user 5100 is performed. If the voice input / output device 240 does not detect voice input for a predetermined time or more, for example, a series of processes including the processes of steps S1501 to S1508 may be terminated. Here, the user's voice is an example of the second voice, and the voice data is an example of the second voice information.

次いで、ステップＳ１５１０において、ローカルサーバ１０２は、音声入出力装置２４０から音声データを受信し、さらに、ステップＳ１５０２の処理と同様に、当該音声データを文字列データに変換する音声認識処理を行う。さらに、ステップＳ１５１１において、ローカルサーバ１０２は、ステップＳ１５０３の処理と同様に、変換した文字列データを意味タグに変換するローカル辞書照合処理を行う。 Next, in step S1510, the local server 102 receives voice data from the voice input / output device 240, and further performs voice recognition processing for converting the voice data into character string data in the same manner as the processing in step S1502. Further, in step S1511, the local server 102 performs a local dictionary collation process for converting the converted character string data into a semantic tag, as in the process of step S1503.

次のステップＳ１５１２において、ローカルサーバ１０２は、コマンド記録部９７０に、クラウドサーバ１１１から受信したオフロード命令データが記録されているか否かを判定する。記録されている場合（ステップＳ１５１２でＹｅｓ）、ローカルサーバ１０２は、ステップＳ１５１３の処理に進む。記録されているオフロード命令データは、ステップＳ１５０７において記録されたデータである。本例では、ステップＳ１５０７において、オフロード命令データがコマンド記録部９７０に記録され、ステップＳ１５０８において、システム応答メッセージが音声として出力されているため、ステップＳ１５１３の処理が行われる。 In the next step S1512, the local server 102 determines whether or not the offload command data received from the cloud server 111 is recorded in the command recording unit 970. If it is recorded (Yes in step S1512), the local server 102 proceeds to the process of step S1513. The recorded offload instruction data is the data recorded in step S1507. In this example, since the offload instruction data is recorded in the command recording unit 970 in step S1507, and the system response message is output as voice in step S1508, the process in step S1513 is performed.

しかしながら、オフロード命令データが記録されていない場合（ステップＳ１５１２でＮｏ）、以降の処理には、例えば、ユーザ５１００へシステム応答メッセージを出力する前のステップＳ１５０１〜Ｓ１５０８の一連の処理が、選択されることになる。 However, when the offload command data is not recorded (No in step S1512), for example, a series of processes in steps S1501 to S1508 before outputting a system response message to the user 5100 is selected for the subsequent processes. Will be.

次のステップＳ１５１３において、ローカルサーバ１０２は、ステップＳ１５１１において変換された意味タグと、コマンド記録部９７０に記録されたオフロード命令データに含まれる条件とがマッチするか否かを判定する。マッチする場合（ステップＳ１５１３でＹｅｓ）、ローカルサーバ１０２は、オフロード命令データに記述された制御コマンドのうちの当該条件に適合時の制御コマンドを機器１０１に送信し、オフロード命令データに記述されたメッセージのうちの当該条件に適合時のメッセージの文字例データを出力する。さらに、ローカルサーバ１０２は、制御コマンドの実行通知とゲートウェイＩＤとを組み合わせて、クラウドサーバ１１１に送信する。受信したクラウドサーバ１１１は、ユーザ５１００と音声対話エージェントシステム１との間の複数回の対話を伴う機器１０１の制御の完了を認める。そして、ローカルサーバ１０２は、ステップＳ１５１４の音声合成処理に進む。ここで、上記意味タグは、第２フレーズ情報の一例である。 In the next step S1513, the local server 102 determines whether or not the meaning tag converted in step S1511 matches the condition included in the offload instruction data recorded in the command recording unit 970. If a match is found (Yes in step S1513), the local server 102 transmits a control command conforming to the condition among the control commands described in the offload instruction data to the device 101, and is described in the offload instruction data. The character example data of the message when conforming to the relevant condition is output. Further, the local server 102 combines the execution notification of the control command and the gateway ID, and transmits them to the cloud server 111. The received cloud server 111 recognizes the completion of the control of the device 101 involving a plurality of dialogues between the user 5100 and the voice dialogue agent system 1. Then, the local server 102 proceeds to the speech synthesis process in step S1514. Here, the meaning tag is an example of second phrase information.

一方、マッチしない場合（ステップＳ１５１３でＮｏ）、ローカルサーバ１０２は、オフロード命令データに記述されたメッセージのうちの当該条件に不適合時のメッセージの文字列データを出力する。さらに、ローカルサーバ１０２は、オフロード命令データに記述された制御コマンドのうちの当該条件に不適合時の制御コマンドを出力するが、図１２Ｂに示されるような本例では、制御コマンドを出力しない。なお、オフロード命令データには、当該条件への適合時と異なる不適合時の制御コマンドが記述され、この制御コマンドをローカルサーバ１０２が出力してもよい。そして、ローカルサーバ１０２は、制御コマンドの不実行通知とゲートウェイＩＤとを組み合わせて、クラウドサーバ１１１に送信する。さらに、ローカルサーバ１０２は、ステップＳ１５１４の音声合成処理に進む。 On the other hand, if there is no match (No in step S1513), the local server 102 outputs the character string data of the message when the condition is not met among the messages described in the offload command data. Further, the local server 102 outputs a control command at the time of nonconformity to the condition among the control commands described in the offload instruction data, but does not output the control command in this example as shown in FIG. 12B. In the offload instruction data, a control command at the time of non-conformity different from that at the time of conforming to the condition may be described, and the local server 102 may output this control command. Then, the local server 102 combines the non-execution notification of the control command and the gateway ID, and transmits them to the cloud server 111. Further, the local server 102 proceeds to the speech synthesis process in step S1514.

例えば、コマンド記録部９７０に、オフロード命令データとして、図１２Ｂに示される「ｉｆ（＜ｙｅｓ＞）ｔｈｅｎｃｍｄ＝０ｘ１００１０ｆ０ａａｎｄｒｅｐｌｙ “それでは予約します” ｅｌｓｅｒｅｐｌｙ “予約を中止します”」が記録されているとする。さらに、ステップＳ１５１３より前のステップＳ１５０８において、システム応答メッセージ「予約していいですか？」が音声入出力装置２４０から出力されているものとする。ユーザ５１００が、「はい」、「いいえ」又はその他のフレーズを発話した場合、ステップＳ１５１０の音声認識処理にて、ローカルサーバ１０２は、音声データを文字列「はい」、「いいえ」又は「フレーズ内容」に変換する。その後、ステップＳ１５１１のローカル辞書照合処理にて、ローカルサーバ１０２は、文字列「はい」、「いいえ」又は「フレーズ内容」を、意味タグ＜ｙｅｓ＞、＜ｎｏ＞、又は＜フレーズ内容のキーワード＞に変換する。そして、ステップＳ１５１２にて、ローカルサーバ１０２は、コマンド記録部９７０にオフロード命令データが記録されていると判定する。 For example, “if (<yes>) then cmd = 0x10010f0a and reply“ I will reserve ”else reply“ Cancel reservation ”” shown in FIG. 12B is recorded in the command recording unit 970. Suppose that Further, in step S1508 prior to step S1513, a system response message “Can I make a reservation?” Is output from the voice input / output device 240. When the user 5100 utters “Yes”, “No”, or other phrases, the local server 102 converts the voice data into the character strings “Yes”, “No”, or “phrase contents” in the voice recognition process in step S1510. To "". Thereafter, in the local dictionary collation process in step S1511, the local server 102 converts the character string “Yes”, “No”, or “phrase content” into the meaning tag <yes>, <no>, or <phrase content keyword>. Convert to In step S 1512, the local server 102 determines that offload command data is recorded in the command recording unit 970.

ステップＳ１５１３にて、ローカルサーバ１０２は、意味タグ＜ｙｅｓ＞、＜ｎｏ＞、又は＜フレーズ内容のキーワード＞とオフロード命令データの条件とがマッチするか否かを判定する。この例において、オフロード命令データの条件は「ｉｆ（＜ｙｅｓ＞）」であるため、意味タグが＜ｙｅｓ＞である場合、意味タグとオフロード命令データの条件とがマッチする。意味タグが、＜ｎｏ＞又は＜フレーズ内容のキーワード＞、つまり＜ｙｅｓ＞以外である場合、意味タグとオフロード命令データの条件とがマッチしない。マッチする場合、ローカルサーバ１０２は、制御コマンド「ｃｍｄ＝０ｘ１００１０ｆ０ａ」を機器１０１に送信し、「それでは予約します」というメッセージの文字列データを出力する。マッチしない場合、ローカルサーバ１０２は、「予約を中止します」というメッセージの文字列データのみを出力する。このように、オフロード命令データは、条件分岐命令の一例であり、意味タグ＜ｙｅｓ＞及び＜ｎｏ＞からなる１つ以上のフレーズ情報に応じて分岐する異なる複数の命令を含む。また、「それでは予約します」というメッセージ及び「予約を中止します」というメッセージは、第２応答メッセージの一例である。 In step S1513, the local server 102 determines whether or not the semantic tag <yes>, <no>, or <phrase content keyword> matches the condition of the offload instruction data. In this example, the condition of the offload instruction data is “if (<yes>)”. Therefore, when the semantic tag is <yes>, the semantic tag matches the condition of the offload instruction data. When the semantic tag is other than <no> or <phrase content keyword>, that is, <yes>, the semantic tag does not match the condition of the offload instruction data. In the case of a match, the local server 102 transmits a control command “cmd = 0x10010f0a” to the device 101, and outputs character string data of a message “I will make a reservation”. If there is no match, the local server 102 outputs only the character string data of the message “Cancel reservation”. As described above, the offload instruction data is an example of a conditional branch instruction and includes a plurality of different instructions that branch according to one or more pieces of phrase information including semantic tags <yes> and <no>. Further, the message “I will make a reservation” and the message “Cancel reservation” are examples of the second response message.

ステップＳ１５１４において、ローカルサーバ１０２は、音声合成処理を行う。この音声合成処理では、ローカルサーバ１０２が、ステップＳ１５１３の処理に関連して出力したメッセージを、ステップＳ１５０８の処理と同様に、音声データに変換する。例えば、上述で例示したメッセージの文字列データ「それでは予約します」又は「予約を中止します」が、音声データに変換される。ローカルサーバ１０２は、ステップＳ１５１４にて変換した音声データを、音声入出力装置２４０に送信する。さらに、音声データを受信した音声入出力装置２４０は、メッセージを音声として、スピーカなどからユーザ５１００へ出力する。 In step S1514, the local server 102 performs speech synthesis processing. In this voice synthesis process, the local server 102 converts the message output in relation to the process in step S1513 into voice data, as in the process in step S1508. For example, the character string data “Now make a reservation” or “Cancel reservation” of the message exemplified above is converted into voice data. The local server 102 transmits the audio data converted in step S1514 to the audio input / output device 240. Furthermore, the voice input / output device 240 that has received the voice data outputs the message as voice to the user 5100 from a speaker or the like.

上述したように、ステップＳ１５０１〜Ｓ１５１４の処理において、ユーザ５１００と音声対話エージェントシステム１との間の複数回の対話に関する処理は、ローカルサーバ１０２によって行われ、オフロード命令データの生成は、クラウドサーバ１１１によって行われる。このため、ローカルサーバ１０２とクラウドサーバ１１１との通信は、オフロード命令データの生成時に行われ、ユーザ５１００と音声対話エージェントシステム１とが対話する度に行われる必要がない。このため、ローカルサーバ１０２とクラウドサーバ１１１との間の通信回数が低く抑えられ、通信時間が短くなる。このような実施の形態に係る音声対話エージェントシステム１は、簡単な認識命令とこれに対応する制御命令とをローカル側にオフロードする、つまり負担が軽減するように与えることで、音声対話エージェントシステム１の応答時間を短縮する。また、容量が大きいデータベースを備えることができるクラウドサーバ１１１が、オフロード命令データを生成するため、ローカルサーバ１０２の負荷が低く抑えられる。 As described above, in the processing of steps S1501 to S1514, the processing related to a plurality of dialogues between the user 5100 and the voice dialogue agent system 1 is performed by the local server 102, and the offload command data is generated by the cloud server. 111. For this reason, the communication between the local server 102 and the cloud server 111 is performed when offload command data is generated, and does not need to be performed every time the user 5100 and the voice interaction agent system 1 interact. For this reason, the frequency | count of communication between the local server 102 and the cloud server 111 is restrained low, and communication time becomes short. The voice interaction agent system 1 according to such an embodiment provides a voice interaction agent system by offloading a simple recognition command and a corresponding control command to the local side, that is, so as to reduce the burden. 1 response time is shortened. In addition, since the cloud server 111 that can include a large-capacity database generates offload instruction data, the load on the local server 102 can be kept low.

また、音声対話エージェントシステム１の動作の上記説明は、ステップＳ１５０１において、ユーザ５１００の最初の音声が発せられるケースに関する説明であったが、ステップＳ１５０１における音声が、対話のどの時点か特定されていなくてもよい。 Further, the above description of the operation of the voice interaction agent system 1 has been described with respect to the case where the first sound of the user 5100 is emitted in step S1501, but the time of the conversation in step S1501 is not specified. May be.

また、実施の形態では、１つの条件と当該条件に対応する命令とを含むオフロード命令データが例示されていたが、これに限定されない。オフロード命令データは、２つ以上の条件と、各条件に対応する命令とを含んでもよい。これにより、ユーザ５１００の多様な回答に対応した機器１０１の制御が可能である。 In the embodiment, the offload instruction data including one condition and an instruction corresponding to the condition has been exemplified, but the present invention is not limited to this. The offload instruction data may include two or more conditions and an instruction corresponding to each condition. Thereby, it is possible to control the device 101 corresponding to various answers of the user 5100.

また、実施の形態では、ユーザ５１００と音声対話エージェントシステム１との間の対話に関する処理に対して、１つのオフロード命令データが用いられていたが、２つ以上のオフロード命令データが用いられてもよい。例えば、第１のオフロード命令データの条件が満たされない場合、第２のオフロード命令データが用いられるように処理が行われてもよい。 Further, in the embodiment, one offload command data is used for the process related to the dialogue between the user 5100 and the voice dialogue agent system 1, but two or more offload command data are used. May be. For example, when the condition of the first offload instruction data is not satisfied, the process may be performed so that the second offload instruction data is used.

また、実施の形態に係る音声対話エージェントシステム１では、ローカルサーバ１０２は、ステップＳ１５０４において、意味タグをクラウドサーバ１１１に送信していたが、これに限定されない。ローカルサーバ１０２は、意味タグの代わりに又は意味タグに加えて、文字列データを送信してもよい。このような文字列データは、第１フレーズ情報の一例である。この場合、クラウドサーバ１１１が、ローカル辞書ＤＢ５８４のような辞書ＤＢを備えていれば、クラウドサーバ１１１が、受信した文字列データを意味タグに変換してもよい。クラウドサーバ１１１は、ローカルサーバ１０２よりも容量が大きい辞書ＤＢを有することができるため、多種多様な文字列データから意味タグへの変換が可能である。 In the spoken dialogue agent system 1 according to the embodiment, the local server 102 transmits the semantic tag to the cloud server 111 in step S1504, but the present invention is not limited to this. The local server 102 may transmit the character string data instead of or in addition to the semantic tag. Such character string data is an example of first phrase information. In this case, if the cloud server 111 includes a dictionary DB such as the local dictionary DB 584, the cloud server 111 may convert the received character string data into a semantic tag. Since the cloud server 111 can have a dictionary DB having a larger capacity than the local server 102, it is possible to convert a wide variety of character string data into semantic tags.

また、実施の形態に係る音声対話エージェントシステム１では、クラウドサーバ１１１は、システム応答メッセージ及びオフロード命令データを一緒に、ローカルサーバ１０２に送信していたが、時間差を伴って別々に送信してもよい。例えば、クラウドサーバ１１１は、システム応答メッセージの送信の後にオフロード命令データを送信してもよく、オフロード命令データの送信タイミングは、ステップＳ１５０６でのシステム応答メッセージの送信後からステップＳ１５１２の処理を行う前まで期間のうちのいつでもよい。例えば、クラウドサーバ１１１は、上記期間を推定する、又は、予め推定された上記期間を記憶し、推定した期間内にオフロード命令データを送信してもよい。なお、オフロード命令データの送信は、音声入出力装置２４０がシステム応答メッセージを音声出力する前であることが、望ましく、このようなタイミングが推定され、推定されたタイミングに基づき、オフロード命令データが送信されてもよい。 In the voice interaction agent system 1 according to the embodiment, the cloud server 111 transmits the system response message and the offload command data together to the local server 102, but transmits them separately with a time difference. Also good. For example, the cloud server 111 may transmit the offload command data after transmitting the system response message, and the transmission timing of the offload command data is determined by performing the process of step S1512 after the transmission of the system response message in step S1506. Any time during the period before doing. For example, the cloud server 111 may estimate the period or store the period estimated in advance, and transmit offload command data within the estimated period. The transmission of the offload command data is preferably performed before the voice input / output device 240 outputs the system response message by voice. Such timing is estimated, and the offload command data is based on the estimated timing. May be sent.

［３．効果等］
以上で説明したように、本開示の実施の形態に係る対話処理装置の一態様であるクラウドサーバ１１１は、音声認識結果を取得する取得部としての通信部１０００と、対話処理部としての対話ルール照合部１０２０及び応答生成部１０３０とを備える。対話ルール照合部１０２０及び応答生成部１０３０は、通信部１０００により取得される音声認識結果に基づいて応答情報を決定し、応答情報に応じた処理情報を決定する。なお、処理情報は、応答情報に対して推定されるユーザの返答についての推定返答情報と、推定返答情報に応じた動作制御情報とを含む。通信部１０００は、決定される応答情報及び処理情報を送信する。例えば、応答情報は、システム応答メッセージであってもよい。例えば、推定返答情報は、オフロード命令データに含まれる条件に関する情報であってよく、動作制御情報は、オフロード命令データに含まれる、条件に応じた制御コマンド及びメッセージを含んでもよい。 [3. Effect]
As described above, the cloud server 111 which is an aspect of the dialog processing device according to the embodiment of the present disclosure includes the communication unit 1000 as an acquisition unit that acquires a speech recognition result and the dialog rule as a dialog processing unit. A collation unit 1020 and a response generation unit 1030 are provided. The dialogue rule matching unit 1020 and the response generation unit 1030 determine response information based on the speech recognition result acquired by the communication unit 1000, and determine processing information corresponding to the response information. Note that the processing information includes estimated response information regarding a user's response estimated with respect to the response information, and operation control information corresponding to the estimated response information. The communication unit 1000 transmits the determined response information and processing information. For example, the response information may be a system response message. For example, the estimated response information may be information related to a condition included in the offload command data, and the operation control information may include a control command and a message corresponding to the condition included in the offload command data.

上述の構成において、クラウドサーバ１１１は、音声認識結果に基づく応答情報と、応答情報に応じた処理情報とを決定し、送信する。そして、クラウドサーバ１１１から応答情報及び処理情報を受信するローカルサーバ１０２等の装置は、応答情報に基づき、ユーザに対して、ユーザの音声の認識結果に対する応答メッセージを出力することができ、処理情報の推定返答情報及び動作制御情報に基づき、応答メッセージに対するユーザの返答に対応した動作制御を行うことができる。推定返答情報は、応答情報に基づき推定された情報であり、動作制御情報は、推定返答情報に対応して設定された情報である。このような処理情報は、ユーザから返答を受け取った後に返答に応じて生成される情報ではなく、予め設定された情報である。これにより、当該装置は、ユーザから返答を受け取ったとき、返答に対応する情報を取得するために、クラウドサーバ１１１と通信する必要がない。よって、ユーザと複数回の対話を行う際、当該装置とクラウドサーバ１１１との通信回数が低減し、当該装置及びクラウドサーバ１１１の応答時間の短縮が可能になる。 In the above configuration, the cloud server 111 determines and transmits response information based on the voice recognition result and processing information corresponding to the response information. Then, a device such as the local server 102 that receives the response information and the processing information from the cloud server 111 can output a response message for the user's voice recognition result to the user based on the response information. Based on the estimated response information and the operation control information, the operation control corresponding to the user's response to the response message can be performed. The estimated response information is information estimated based on the response information, and the operation control information is information set corresponding to the estimated response information. Such processing information is not information generated in response to a response after receiving a response from the user, but information set in advance. Thus, when the device receives a response from the user, the device does not need to communicate with the cloud server 111 in order to acquire information corresponding to the response. Therefore, when a plurality of dialogues with the user are performed, the number of communication between the device and the cloud server 111 is reduced, and the response time of the device and the cloud server 111 can be shortened.

実施の形態に係る対話処理装置の一態様のクラウドサーバ１１１において、通信部１０００は、動作制御情報を用いて行われた動作制御の結果を示す動作制御結果情報を受信する。例えば、動作制御結果情報は、制御コマンドの実行通知であってもよい。上述の構成において、クラウドサーバ１１１は、動作制御結果情報を受信することによって、ユーザとローカルサーバ１０２等の装置との間の対話処理に関連する動作を終了することができる。よって、クラウドサーバ１１１の不要な待機及び動作の低減が可能になる。 In cloud server 111 of one aspect of the dialog processing device according to the embodiment, communication unit 1000 receives operation control result information indicating a result of operation control performed using the operation control information. For example, the operation control result information may be a control command execution notification. In the above-described configuration, the cloud server 111 can end the operation related to the interactive process between the user and the device such as the local server 102 by receiving the operation control result information. Therefore, unnecessary standby and operation of the cloud server 111 can be reduced.

実施の形態に係る対話処理装置の一態様のクラウドサーバ１１１において、通信部１０００は、応答情報及び処理情報を共に送信する。上述の構成において、クラウドサーバ１１１の通信頻度が低減し、クラウドサーバ１１１及びローカルサーバ１０２等の装置の応答時間の短縮が可能になる。具体的には、クラウドサーバ１１１は、応答情報に対するユーザの返答を確認せずに処理情報を送信するため、ユーザの返答を確認するための通信が不要である。さらに、ローカルサーバ１０２等の装置が、応答情報に対する返答をユーザから受け取っているが、処理情報を未だ受信しておらず、次の動作を行えないという処理の遅延を防ぐことができる。 In the cloud server 111 of one aspect of the interactive processing device according to the embodiment, the communication unit 1000 transmits both response information and processing information. In the above configuration, the communication frequency of the cloud server 111 is reduced, and the response time of devices such as the cloud server 111 and the local server 102 can be shortened. Specifically, since the cloud server 111 transmits the processing information without confirming the user's response to the response information, communication for confirming the user's response is unnecessary. Furthermore, although a device such as the local server 102 has received a response to the response information from the user, the processing information has not yet been received and processing delays such that the next operation cannot be performed can be prevented.

実施の形態に係る対話処理装置の一態様のクラウドサーバ１１１において、音声認識結果が、通信部１０００を介して取得される。上述の構成において、クラウドサーバ１１１は、離れた位置にあるローカルサーバ１０２等の装置から音声認識結果を取得する。よって、クラウドサーバ１１１は、音声の取得場所の環境の影響を受けずに構成されることができる。 In the cloud server 111 of one aspect of the dialog processing device according to the embodiment, a speech recognition result is acquired via the communication unit 1000. In the configuration described above, the cloud server 111 acquires a speech recognition result from a device such as the local server 102 at a remote location. Therefore, the cloud server 111 can be configured without being affected by the environment of the voice acquisition location.

実施の形態に係る対話処理装置の別の一態様のローカルサーバ１０２は、音声認識結果を取得する取得部としての音声認識部９２０と、通信部９００と、動作制御部としての制御部９４０とを備える。通信部９００は、音声認識部９２０により取得される第１の音声認識結果に基づいた応答情報と、応答情報に応じた処理情報とを受信する。なお、処理情報は、応答情報から推定されるユーザの返答についての推定返答情報と、推定返答情報に応じた動作制御情報とを含む。制御部９４０は、音声認識部９２０により取得される第２の音声認識結果と、推定返答情報との照合結果に応じて、動作制御情報に基づく動作を行う。 The local server 102 of another aspect of the interactive processing device according to the embodiment includes a speech recognition unit 920 as an acquisition unit that acquires a speech recognition result, a communication unit 900, and a control unit 940 as an operation control unit. Prepare. The communication unit 900 receives response information based on the first voice recognition result acquired by the voice recognition unit 920 and processing information corresponding to the response information. The processing information includes estimated response information about the user's response estimated from the response information and operation control information corresponding to the estimated response information. The control unit 940 performs an operation based on the operation control information according to the collation result between the second speech recognition result acquired by the speech recognition unit 920 and the estimated response information.

上述の構成において、ローカルサーバ１０２は、ユーザの音声の第１の音声認識結果に基づく応答情報に基づき、ユーザに対して、第１の音声認識結果に対する応答メッセージを出力することができ、処理情報の推定返答情報及び動作制御情報に基づき、応答メッセージに対するユーザの返答に対応した動作制御を行うことができる。さらに、ローカルサーバ１０２は、ユーザの音声の第２の音声認識結果に対応した動作制御情報に基づく動作を行うことができる。処理情報は、ユーザから返答を受け取った後に返答に応じて生成される情報でなく、予め設定された情報である。これにより、ローカルサーバ１０２は、第２の音声認識結果を取得したとき、これに対応するための情報を取得するために、通信する必要がない。よって、ユーザと複数回の対話を行う際、ローカルサーバ１０２の応答時間の短縮が可能になる。 In the above-described configuration, the local server 102 can output a response message to the first voice recognition result to the user based on the response information based on the first voice recognition result of the user's voice. Based on the estimated response information and the operation control information, the operation control corresponding to the user's response to the response message can be performed. Furthermore, the local server 102 can perform an operation based on the operation control information corresponding to the second voice recognition result of the user's voice. The processing information is information that is set in advance, not information that is generated in response to the response after receiving the response from the user. Thus, when the local server 102 acquires the second speech recognition result, it is not necessary to communicate in order to acquire information for corresponding to the second speech recognition result. Therefore, the response time of the local server 102 can be shortened when the user interacts a plurality of times.

実施の形態に係る対話処理装置の別の一態様のローカルサーバ１０２において、通信部９００は、応答情報及び処理情報を共に受信する。上述の構成において、ローカルサーバ１０２の通信頻度が低減し、ローカルサーバ１０２の応答時間の短縮が可能になる。具体的には、ローカルサーバ１０２は、応答情報に対するユーザの返答を取得する前に処理情報を受信するため、ユーザの返答に対応して通信する必要がない。さらに、ローカルサーバ１０２が、応答情報に対する返答をユーザから受け取っているが、処理情報を未だ受信しておらず、次の動作を行えないという処理の遅延を防ぐことができる。 In local server 102 of another aspect of the interactive processing device according to the embodiment, communication unit 900 receives both response information and processing information. In the above configuration, the communication frequency of the local server 102 is reduced, and the response time of the local server 102 can be shortened. Specifically, since the local server 102 receives the processing information before obtaining the user's response to the response information, there is no need to communicate in response to the user's response. Furthermore, although the local server 102 has received a response to the response information from the user, the processing information has not yet been received, and processing delays such that the next operation cannot be performed can be prevented.

実施の形態に係る対話処理装置の別の一態様のローカルサーバ１０２において、通信部９００は、第１の音声認識結果を送信する。上述の構成において、第１の音声認識結果に基づいた処理は、クラウドサーバ１１１等のローカルサーバ１０２とは別の装置によって行われる。これにより、ローカルサーバ１０２における処理の負荷が低減し、処理速度の向上が可能になる。 In local server 102 of another aspect of the interactive processing device according to the embodiment, communication unit 900 transmits a first speech recognition result. In the configuration described above, the processing based on the first speech recognition result is performed by an apparatus different from the local server 102 such as the cloud server 111. Thereby, the processing load on the local server 102 is reduced, and the processing speed can be improved.

実施の形態に係る対話処理装置の別の一態様のローカルサーバ１０２において、第２の音声認識結果は、通信部９００により受信された応答情報に基づく応答が再生された後の音声認識結果である。上述の構成において、第２の音声認識結果は、応答情報に基づく応答メッセージに対するユーザの返答を含み得る。よって、ローカルサーバ１０２は、ユーザの返答に適切に対応した処理を可能にする。 In local server 102 of another aspect of the dialog processing device according to the exemplary embodiment, the second speech recognition result is a speech recognition result after a response based on the response information received by communication unit 900 is reproduced. . In the above configuration, the second speech recognition result may include a user response to the response message based on the response information. Therefore, the local server 102 enables processing appropriately corresponding to the user's response.

実施の形態に係るクラウドサーバ１１１及びローカルサーバ１０２において、処理情報は、推定返答情報及び動作制御情報の複数の組を含む。上述の構成において、ユーザの様々な返答に対応した制御が可能になる。 In the cloud server 111 and the local server 102 according to the embodiment, the processing information includes a plurality of sets of estimated response information and operation control information. In the above configuration, control corresponding to various responses of the user is possible.

実施の形態に係るクラウドサーバ１１１及びローカルサーバ１０２において、動作制御情報は、ユーザの返答に対する第２の応答情報と、制御対象への動作指示を含む動作指示情報とを含む。例えば、ユーザの返答は、メッセージを含み、推定返答情報は、メッセージの内容に対応したメッセージ情報を含み、動作制御情報は、メッセージ情報に応じた第２の応答情報及び動作指示情報を含んでよい。上述の構成において、動作制御情報に基づく動作では、ユーザへ返答に対する応答を提示して確認を受けつつ、制御対象を制御することが可能になる。例えば、メッセージ情報は、ユーザのメッセージの意味タグに対応し且つオフロード命令データ内の条件に含まれる意味タグであってもよい。メッセージ情報に応じた第２の応答情報は、オフロード命令データ内の条件に応じたメッセージであってよく、メッセージ情報に応じた動作指示情報は、オフロード命令データ内の条件に応じた制御コマンドであってもよい。 In the cloud server 111 and the local server 102 according to the embodiment, the operation control information includes second response information with respect to a user response and operation instruction information including an operation instruction to the control target. For example, the user response may include a message, the estimated response information may include message information corresponding to the content of the message, and the operation control information may include second response information and operation instruction information corresponding to the message information. . In the above-described configuration, in the operation based on the operation control information, it is possible to control the control target while presenting a response to the user and receiving confirmation. For example, the message information may be a semantic tag corresponding to the semantic tag of the user's message and included in the condition in the offload command data. The second response information according to the message information may be a message according to the condition in the offload instruction data, and the operation instruction information according to the message information is a control command according to the condition in the offload instruction data. It may be.

また、実施の形態の一態様に係るクラウドサーバ１１１の対話処理方法は、音声認識結果を取得し、取得される音声認識結果に基づいて応答情報を決定し、応答情報に応じた処理情報を決定し、決定される応答情報及び処理情報を送信し、処理情報は、応答情報に対して推定されるユーザの返答についての推定返答情報と、推定返答情報に応じた動作制御情報とを含む。 In addition, the conversation processing method of the cloud server 111 according to an aspect of the embodiment acquires a speech recognition result, determines response information based on the acquired speech recognition result, and determines processing information according to the response information. Then, the determined response information and processing information are transmitted, and the processing information includes estimated response information regarding a user's response estimated with respect to the response information, and operation control information corresponding to the estimated response information.

また、実施の形態の別の一態様に係るローカルサーバ１０２の対話処理方法は、第１の音声認識結果を取得して送信し、送信される第１の音声認識結果に基づいた応答情報と、応答情報に応じた処理情報とを受信する。ここで、処理情報が、応答情報から推定されるユーザの返答についての推定返答情報と、推定返答情報に応じた動作制御情報とを含む。さらに、対話処理方法は、第２の音声認識結果を取得し、取得される第２の音声認識結果と推定返答情報との照合結果に応じて、動作制御情報に基づく動作を行う。 Further, the dialogue processing method of the local server 102 according to another aspect of the embodiment acquires and transmits the first speech recognition result, and response information based on the transmitted first speech recognition result; Processing information corresponding to the response information is received. Here, the processing information includes estimated response information about the user's response estimated from the response information, and operation control information corresponding to the estimated response information. Further, the dialogue processing method obtains the second voice recognition result, and performs an operation based on the action control information according to the collation result between the obtained second voice recognition result and the estimated response information.

上述の対話処理方法によれば、実施の形態に係る対話処理装置による効果と同様の効果が得られる。なお、上記方法は、ＭＰＵ（ＭｉｃｒｏＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＣＰＵ、プロセッサ、ＬＳＩ（ＬａｒｇｅＳｃａｌｅＩｎｔｅｇｒａｔｉｏｎ：大規模集積回路）などの回路、ＩＣカード（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔＣａｒｄ）又は単体のモジュール等によって、実現されてもよい。 According to the above-described dialogue processing method, the same effect as that obtained by the dialogue processing apparatus according to the embodiment can be obtained. Note that the above method may be realized by an MPU (Micro Processing Unit), a CPU, a processor, a circuit such as an LSI (Large Scale Integration), an IC card (Integrated Circuit Card), or a single module. Good.

また、実施の形態での処理は、ソフトウェアプログラム又はソフトウェアプログラムからなるデジタル信号によって実現されてもよい。例えば、実施の形態での処理は、次のようなプログラム又はプログラムからなるデジタル信号によって、実現される。 The processing in the embodiment may be realized by a software program or a digital signal made up of a software program. For example, the processing in the embodiment is realized by the following program or a digital signal including the program.

実施の形態の一態様に係るクラウドサーバ１１１の処理を実現するプログラムは、コンピュータに以下の機能を実行させるプログラムであって、音声認識結果を取得し、取得される音声認識結果に基づいて応答情報を決定し、応答情報に応じた処理情報を決定し、決定される応答情報及び処理情報を送信することを実行させる。なお、処理情報は、応答情報に対して推定されるユーザの返答についての推定返答情報と、推定返答情報に応じた動作制御情報とを含む。 The program that realizes the processing of the cloud server 111 according to an aspect of the embodiment is a program that causes a computer to execute the following functions, acquires a speech recognition result, and response information based on the acquired speech recognition result And processing information corresponding to the response information is determined, and transmission of the determined response information and processing information is executed. Note that the processing information includes estimated response information regarding a user's response estimated with respect to the response information, and operation control information corresponding to the estimated response information.

実施の形態の別の一態様に係るローカルサーバ１０２の処理を実現するプログラムは、コンピュータに以下の機能を実行させるプログラムであって、音声認識結果を取得し、取得される第１の音声認識結果に基づいた応答情報と、応答情報に応じた処理情報とを受信することを実行させる。ここで、処理情報は、応答情報から推定されるユーザの返答についての推定返答情報と、推定返答情報に応じた動作制御情報とを含む。プログラムはさらに、取得される第２の音声認識結果と推定返答情報との照合結果に応じて、動作制御情報に基づく制御を行うことをコンピュータに実行させる。 A program that realizes processing of the local server 102 according to another aspect of the embodiment is a program that causes a computer to execute the following functions, obtains a speech recognition result, and obtains the first speech recognition result And receiving the response information based on the response information and the processing information corresponding to the response information. Here, the processing information includes estimated response information about the user's response estimated from the response information, and operation control information corresponding to the estimated response information. The program further causes the computer to execute control based on the operation control information in accordance with the collation result between the acquired second speech recognition result and the estimated response information.

［その他］
以上、本出願において開示する技術の例示として、実施の形態に係る対話処理装置等について説明したが、本開示は、実施の形態に限定されるものではない。本開示における技術は、適宜、変更、置き換え、付加、省略などを行った実施の形態の変形例又は他の実施の形態にも適用可能である。また、実施の形態で説明する各構成要素を組み合わせて、新たな実施の形態又は変形例とすることも可能である。 [Others]
As described above, the dialogue processing apparatus according to the embodiment has been described as an example of the technology disclosed in the present application, but the present disclosure is not limited to the embodiment. The technology in the present disclosure can also be applied to a modified example of the embodiment in which modifications, replacements, additions, omissions, and the like are appropriately performed or other embodiments. In addition, it is possible to combine the constituent elements described in the embodiment to form a new embodiment or modification.

上述したように、本開示の包括的又は具体的な態様は、システム、方法、集積回路、コンピュータプログラム又はコンピュータ読み取り可能なＣＤ−ＲＯＭなどの記録媒体で実現されてもよい。また、本開示の包括的又は具体的な態様は、システム、方法、集積回路、コンピュータプログラム及び記録媒体の任意な組み合わせで実現されてもよい。 As described above, the comprehensive or specific aspect of the present disclosure may be realized by a recording medium such as a system, a method, an integrated circuit, a computer program, or a computer-readable CD-ROM. The comprehensive or specific aspect of the present disclosure may be realized by any combination of a system, a method, an integrated circuit, a computer program, and a recording medium.

例えば、上記実施の形態に係る対話処理装置に含まれる各処理部は典型的には集積回路であるＬＳＩとして実現される。これらは個別に１チップ化されてもよいし、一部又は全てを含むように１チップ化されてもよい。 For example, each processing unit included in the dialogue processing apparatus according to the above embodiment is typically realized as an LSI that is an integrated circuit. These may be individually made into one chip, or may be made into one chip so as to include a part or all of them.

また、集積回路化はＬＳＩに限るものではなく、専用回路又は汎用プロセッサで実現してもよい。ＬＳＩ製造後にプログラムすることが可能なＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、又はＬＳＩ内部の回路セルの接続や設定を再構成可能なリコンフィギュラブル・プロセッサを利用してもよい。 Further, the circuit integration is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. An FPGA (Field Programmable Gate Array) that can be programmed after manufacturing the LSI or a reconfigurable processor that can reconfigure the connection and setting of circuit cells inside the LSI may be used.

なお、上記実施の形態において、各構成要素は、専用のハードウェアで構成されるか、各構成要素に適したソフトウェアプログラムを実行することによって実現されてもよい。各構成要素は、ＣＰＵ又はプロセッサなどのプログラム実行部が、ハードディスク又は半導体メモリなどの記録媒体に記録されたソフトウェアプログラムを読み出して実行することによって実現されてもよい。 In the above embodiment, each component may be configured by dedicated hardware or may be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory.

さらに、本開示の技術はプログラムであってもよいし、プログラムが記録された非一時的なコンピュータ読み取り可能な記録媒体であってもよい。また、上記プログラムは、インターネット等の伝送媒体を介して流通させることができるのは言うまでもない。 Furthermore, the technology of the present disclosure may be a program or a non-transitory computer-readable recording medium on which the program is recorded. Needless to say, the program can be distributed via a transmission medium such as the Internet.

また、上記で用いた序数、数量等の数字は、全て本開示の技術を具体的に説明するために例示するものであり、本開示は例示された数字に制限されない。また、構成要素間の接続関係は、本開示の技術を具体的に説明するために例示するものであり、本開示の機能を実現する接続関係はこれに限定されない。 Further, the numbers such as the ordinal numbers and the quantities used in the above are examples for specifically explaining the technology of the present disclosure, and the present disclosure is not limited to the illustrated numbers. In addition, the connection relationship between the constituent elements is exemplified for specifically explaining the technology of the present disclosure, and the connection relationship for realizing the functions of the present disclosure is not limited thereto.

また、ブロック図における機能ブロックの分割は一例であり、複数の機能ブロックを一つの機能ブロックとして実現したり、一つの機能ブロックを複数に分割したり、一部の機能を他の機能ブロックに移してもよい。また、類似する機能を有する複数の機能ブロックの機能を単一のハードウェア又はソフトウェアが並列又は時分割に処理してもよい。 In addition, division of functional blocks in the block diagram is an example, and a plurality of functional blocks can be realized as one functional block, a single functional block can be divided into a plurality of functions, or some functions can be transferred to other functional blocks. May be. In addition, functions of a plurality of functional blocks having similar functions may be processed in parallel or time-division by a single hardware or software.

以上、一つの態様に係る対話処理装置等について、実施の形態に基づいて説明したが、本開示は、実施の形態に限定されるものではない。本開示の趣旨を逸脱しない限り、当業者が思いつく各種変形を実施の形態に施したものや、異なる実施の形態における構成要素を組み合わせて構築される形態も、一つの態様の範囲内に含まれてもよい。 As mentioned above, although the interactive processing apparatus etc. which concern on one aspect were demonstrated based on embodiment, this indication is not limited to embodiment. Unless it deviates from the gist of the present disclosure, forms in which various modifications conceived by those skilled in the art have been made in the embodiments and forms constructed by combining components in different embodiments are also included in the scope of one aspect. May be.

なお、本開示は、音声対話エージェントシステムとユーザとの対話に関するものであれば適用可能である。例えば、音声対話エージェントシステムを用いてユーザが家電機器等を動作させる場合に有効である。例えば、ユーザが音声操作に対応した電子レンジ又はオーブンを動作させる場合に、「温めて」という指示をした場合を想定する。このとき、音声対話エージェントシステムはユーザに対して「何分温めますか？」又は「何度に温めますか？」などと具体的な指示を聞き返すことが可能である。これに対して返答可能なユーザ（聞き返したことに対してエージェントシステムが指示を受け付けるユーザ）は当初に「温めて」と指示したユーザのみである。 It should be noted that the present disclosure is applicable as long as it relates to a dialogue between a voice interaction agent system and a user. For example, it is effective when a user operates home appliances or the like using a voice interaction agent system. For example, it is assumed that the user gives an instruction “warm up” when operating a microwave oven or oven corresponding to voice operation. At this time, the voice interaction agent system can ask the user to return a specific instruction such as “How many minutes do you want to warm up?” Or “How many times do you want to warm up?”. In contrast to this, only users who initially instructed to “warm up” are users who can respond (users who receive instructions from the agent system in response to hearing back).

これ以外にも、ユーザの抽象的な指示に対して、音声対話エージェントシステムが具体的な内容を聞き返す動作に本開示は適用可能となる。また、音声対話エージェントシステムがユーザに対して聞き返す内容は、動作実行の確認などであってもよい。 In addition to this, the present disclosure can be applied to an operation in which the voice interaction agent system listens back to specific contents in response to an abstract instruction from the user. The content that the voice interaction agent system asks the user to return may be confirmation of operation execution.

なお、上記態様において、ユーザからの音声の入力は、システム又は各家電機器が備えるマイクロホンによって行われるとしてもよい。また、音声対話エージェントシステムからユーザに対する聞き返しは、システム又は各家電機器が備えるスピーカなどからユーザに対して伝えるとしてもよい。本開示において、「所定の動作」は、例えば、スピーカを介してユーザに音声を出力する動作であってもよい。すなわち、本開示において、制御対象となる「機器」は、音声入出力装置（例えばスピーカ）であってもよい。本開示において、「プロセッサ」、「マイクロホン」、及び／又は「スピーカ」は、例えば、制御対象となる「機器」に内蔵されていてもよい。本開示において、「フレーズ情報」は、文字列又はその意味を示す情報である。上記態様における「文字列データ」及び「意味タグ」は、フレーズ情報の一例である。 Note that in the above aspect, voice input from the user may be performed by a microphone provided in the system or each home appliance. Further, the user's feedback from the voice interaction agent system may be transmitted to the user from a speaker or the like included in the system or each home appliance. In the present disclosure, the “predetermined operation” may be an operation of outputting sound to the user via a speaker, for example. That is, in the present disclosure, the “device” to be controlled may be a voice input / output device (for example, a speaker). In the present disclosure, the “processor”, “microphone”, and / or “speaker” may be incorporated in a “device” to be controlled, for example. In the present disclosure, “phrase information” is information indicating a character string or its meaning. “Character string data” and “meaning tag” in the above embodiment are examples of phrase information.

なお、上記態様において説明された技術は、例えば、以下のクラウドサービスの類型において実現されうる。しかし、上記態様において説明された技術が実現されるクラウドサービスの類型はこれらに限られるものでない。 In addition, the technique demonstrated in the said aspect can be implement | achieved in the following types of cloud services, for example. However, the types of cloud services in which the technology described in the above aspect is realized are not limited to these.

以下、サービスの類型１（自社データセンタ型クラウドサービス）を利用した情報管理システムが提供するサービスの全体像、サービスの類型２（ＩａａＳ利用型クラウドサービス）を利用した情報管理システムが提供するサービスの全体像、サービスの類型３（ＰａａＳ利用型クラウドサービス）を利用した情報管理システムが提供するサービスの全体像、サービスの類型４（ＳａａＳ利用型クラウドサービス）を利用した情報管理システムが提供するサービスの全体像について順次説明する。 Hereinafter, an overview of services provided by an information management system using service type 1 (in-house data center type cloud service), and services provided by an information management system using service type 2 (cloud service using IaaS) Overview of services provided by an information management system using a service type 3 (PaaS-based cloud service), and information services provided by an information management system using a service type 4 (SaaS-based cloud service) The overall image will be described sequentially.

［サービスの類型１：自社データセンタ型クラウドサービス］
図１４は、実施の形態に係る音声対話エージェントシステムが適用可能である、サービスの類型１（自社データセンタ型クラウドサービス）における情報管理システムが提供する、サービスの全体像を示す図である。図１４に示すように、本類型では、サービスプロバイダ４１２０がグループ４１００から情報を取得し、ユーザに対してサービスを提供する。本類型では、サービスプロバイダ４１２０が、データセンタ運営会社の機能を有している。すなわち、サービスプロバイダ４１２０が、ビッグデータを管理するクラウドサーバ１１１を保有している。したがって、データセンタ運営会社は存在しない。 [Service type 1: In-house data center type cloud service]
FIG. 14 is a diagram illustrating an overall image of a service provided by an information management system in service type 1 (in-house data center type cloud service) to which the voice interaction agent system according to the embodiment is applicable. As shown in FIG. 14, in this type, the service provider 4120 acquires information from the group 4100 and provides a service to the user. In this type, the service provider 4120 has a function of a data center operating company. That is, the service provider 4120 has a cloud server 111 that manages big data. Therefore, there is no data center operating company.

本類型では、サービスプロバイダ４１２０は、データセンタ（クラウドサーバ）４２０３を運営及び管理している。また、サービスプロバイダ４１２０は、オペレーティングシステム（ＯＳ）４２０２及びアプリケーション４２０１を管理する。サービスプロバイダ４１２０は、サービスプロバイダ４１２０が管理するＯＳ４２０２及びアプリケーション４２０１を用いてサービスを提供する（矢印２０４）。 In this type, the service provider 4120 operates and manages a data center (cloud server) 4203. The service provider 4120 manages an operating system (OS) 4202 and an application 4201. The service provider 4120 provides a service using the OS 4202 and the application 4201 managed by the service provider 4120 (arrow 204).

［サービスの類型２：ＩａａＳ利用型クラウドサービス］
図１５は、実施の形態に係る音声対話エージェントシステムが適用可能である、サービスの類型２（ＩａａＳ利用型クラウドサービス）における情報管理システムが提供する、サービスの全体像を示す図である。ここで、ＩａａＳとは、インフラストラクチャー・アズ・ア・サービスの略であり、コンピュータシステムを構築及び稼動させるための基盤そのものを、インターネット経由のサービスとして提供するクラウドサービス提供モデルである。 [Service type 2: Cloud service using IaaS]
FIG. 15 is a diagram illustrating an overall image of a service provided by an information management system in service type 2 (cloud service using IaaS) to which the voice interaction agent system according to the embodiment is applicable. Here, IaaS is an abbreviation for infrastructure as a service, and is a cloud service provision model that provides a base for constructing and operating a computer system as a service via the Internet.

図１５に示すように、本類型では、データセンタ運営会社４１１０が、データセンタ（クラウドサーバ）４２０３を運営及び管理している。また、サービスプロバイダ４１２０は、ＯＳ４２０２及びアプリケーション４２０１を管理する。サービスプロバイダ４１２０は、サービスプロバイダ４１２０が管理するＯＳ４２０２及びアプリケーション４２０１を用いてサービスを提供する（矢印２０４）。 As shown in FIG. 15, in this type, a data center operating company 4110 operates and manages a data center (cloud server) 4203. The service provider 4120 manages the OS 4202 and the application 4201. The service provider 4120 provides a service using the OS 4202 and the application 4201 managed by the service provider 4120 (arrow 204).

［サービスの類型３：ＰａａＳ利用型クラウドサービス］
図１６は、実施の形態に係る音声対話エージェントシステムが適用可能である、サービスの類型３（ＰａａＳ利用型クラウドサービス）における情報管理システムが提供する、サービスの全体像を示す図である。ここで、ＰａａＳとは、プラットフォーム・アズ・ア・サービスの略であり、ソフトウェアを構築及び稼動させるための土台となるプラットフォームを、インターネット経由のサービスとして提供するクラウドサービス提供モデルである。 [Service type 3: Cloud service using PaaS]
FIG. 16 is a diagram showing an overall image of a service provided by an information management system in service type 3 (PaaS use type cloud service) to which the voice interaction agent system according to the embodiment is applicable. Here, PaaS is an abbreviation for Platform as a Service, and is a cloud service provision model that provides a platform serving as a foundation for constructing and operating software as a service via the Internet.

図１６に示すように、本類型では、データセンタ運営会社４１１０は、ＯＳ４２０２を管理し、データセンタ（クラウドサーバ）４２０３を運営及び管理している。また、サービスプロバイダ４１２０は、アプリケーション４２０１を管理する。サービスプロバイダ４１２０は、データセンタ運営会社４１１０が管理するＯＳ４２０２及びサービスプロバイダ４１２０が管理するアプリケーション４２０１を用いてサービスを提供する（矢印２０４）。 As shown in FIG. 16, in this type, the data center operating company 4110 manages the OS 4202 and operates and manages the data center (cloud server) 4203. The service provider 4120 manages the application 4201. The service provider 4120 provides a service using the OS 4202 managed by the data center operating company 4110 and the application 4201 managed by the service provider 4120 (arrow 204).

［サービスの類型４：ＳａａＳ利用型クラウドサービス］
図１７は、実施の形態に係る音声対話エージェントシステムが適用可能である、サービスの類型４（ＳａａＳ利用型クラウドサービス）における情報管理システムが提供する、サービスの全体像を示す図である。ここで、ＳａａＳとは、ソフトウェア・アズ・ア・サービスの略である。ＳａａＳ利用型クラウドサービスは、例えば、データセンタ（クラウドサーバ）を保有しているプラットフォーム提供者が提供するアプリケーションを、データセンタ（クラウドサーバ）を保有していない会社又は個人などの利用者がインターネットなどのネットワーク経由で使用できる機能を有するクラウドサービス提供モデルである。 [Service type 4: Cloud service using SaaS]
FIG. 17 is a diagram illustrating an overall image of a service provided by an information management system in service type 4 (SaaS-based cloud service) to which the voice interaction agent system according to the embodiment is applicable. Here, SaaS is an abbreviation for software as a service. The SaaS-based cloud service is, for example, an application provided by a platform provider who owns a data center (cloud server), or a user such as a company or individual who does not have a data center (cloud server) on the Internet. This is a cloud service provision model that has functions that can be used via other networks.

図１７に示すように、本類型では、データセンタ運営会社４１１０は、アプリケーション４２０１を管理し、ＯＳ４２０２を管理し、データセンタ（クラウドサーバ）４２０３を運営及び管理している。また、サービスプロバイダ４１２０は、データセンタ運営会社４１１０が管理するＯＳ４２０２及びアプリケーション４２０１を用いてサービスを提供する（矢印２０４）。 As shown in FIG. 17, in this type, the data center operating company 4110 manages the application 4201, manages the OS 4202, and operates and manages the data center (cloud server) 4203. The service provider 4120 provides a service using the OS 4202 and the application 4201 managed by the data center operating company 4110 (arrow 204).

以上、いずれのクラウドサービスの類型においても、サービスプロバイダ４１２０がサービスを提供する。また、例えば、サービスプロバイダ又はデータセンタ運営会社は、ＯＳ、アプリケーション又はビッグデータのデータベース等を自ら開発してもよいし、また、第三者に外注させてもよい。 As described above, in any cloud service type, the service provider 4120 provides a service. In addition, for example, the service provider or the data center operating company may develop an OS, an application, or a big data database, or may be outsourced to a third party.

本開示の技術は、音声対話エージェントに適用できる。 The technology of the present disclosure can be applied to a voice interaction agent.

１０１，１０１ａ，１０１ｂ機器
１０２ローカルサーバ
１１１クラウドサーバ
２４０音声入出力装置
３００音声入出力装置の処理回路
３０１音声入出力装置の集音回路
３０２音声入出力装置の音声出力回路
３０３音声入出力装置の通信回路
３１０音声入出力装置のＣＰＵ
３２０音声入出力装置のメモリ
３３０音声入出力装置のバス
３４１音声入出力装置の機器ＩＤ
３４２音声入出力装置のプログラム
４１０機器の入出力回路
４３０機器のＣＰＵ
４４０機器のメモリ
４４１機器の機器ＩＤ
４４２機器のプログラム
４５０機器の通信回路
４６０機器のバス
４７０機器の処理回路
５３０ローカルサーバのＣＰＵ
５４０ローカルサーバのメモリ
５４１ローカルサーバのゲートウェイＩＤ
５４２ローカルサーバのプログラム
５５１ローカルサーバの第一通信回路
５５２ローカルサーバの第二通信回路
５６０ローカルサーバのバス
５７０ローカルサーバの処理回路
５８０ローカルサーバの音響モデルＤＢ
５８１ローカルサーバの言語モデルＤＢ
５８２ローカルサーバの音声素片ＤＢ
５８３ローカルサーバの韻律制御ＤＢ
５８４ローカルサーバのローカル辞書ＤＢ
６５０クラウドサーバの通信回路
６７０クラウドサーバの処理回路
６７１クラウドサーバのＣＰＵ
６７２クラウドサーバのメモリ
６８０クラウドサーバのバス
６９１クラウドサーバの対話ルールＤＢ
６９２クラウドサーバのオフロード命令生成ＤＢ
７００音声入出力装置の集音部
７１０音声入出力装置の音声検出部
７２０音声入出力装置の音声区間切り出し部
７３０音声入出力装置の通信部
７４０音声入出力装置の音声出力部
８００機器の通信部
８１０機器の機器制御部
９００ローカルサーバの通信部
９１０ローカルサーバの受信データ解析部
９２０ローカルサーバの音声認識部
９３０ローカルサーバのローカル辞書照合部
９４０ローカルサーバの制御部
９５０ローカルサーバの音声合成部
９６０ローカルサーバの送信データ生成部
９７０ローカルサーバのコマンド記録部
１０００クラウドサーバの通信部
１０２０クラウドサーバの対話ルール照合部
１０３０クラウドサーバの応答生成部 101, 101a, 101b equipment 102 local server 111 cloud server 240 voice input / output device 300 processing circuit of voice input / output device 301 sound collecting circuit of voice input / output device 302 voice output circuit of voice input / output device 303 communication of voice input / output device Circuit 310 CPU of voice input / output device
320 Memory of voice input / output device 330 Bus of voice input / output device 341 Device ID of voice input / output device
342 Voice input / output device program 410 Device input / output circuit 430 Device CPU
440 Device memory 441 Device ID of device
442 Device program 450 Device communication circuit 460 Device bus 470 Device processing circuit 530 Local server CPU
540 Local server memory 541 Local server gateway ID
542 Local server program 551 Local server first communication circuit 552 Local server second communication circuit 560 Local server bus 570 Local server processing circuit 580 Local server acoustic model DB
581 Local server language model DB
582 Local server speech segment DB
583 Prosody control DB of local server
584 Local dictionary DB of local server
650 Cloud server communication circuit 670 Cloud server processing circuit 671 Cloud server CPU
672 Cloud server memory 680 Cloud server bus 691 Cloud server conversation rule DB
692 Cloud Server Offload Instruction Generation DB
700 Sound Collection Unit of Audio Input / Output Device 710 Audio Detection Unit of Audio Input / Output Device 720 Audio Section Extraction Unit of Audio Input / Output Device 730 Audio Communication Unit of Audio Input / Output Device 740 Audio Output Unit of Audio Input / Output Device 800 Device Communication Unit 810 Device control unit of device 900 Local server communication unit 910 Local server received data analysis unit 920 Local server speech recognition unit 930 Local server local dictionary collation unit 940 Local server control unit 950 Local server speech synthesis unit 960 Local Server transmission data generation unit 970 Local server command recording unit 1000 Cloud server communication unit 1020 Cloud server interaction rule verification unit 1030 Cloud server response generation unit

Claims

An information processing method executed by a processor that controls at least one device through dialogue with a user,
Obtaining first voice information indicating the first voice of the user input from a microphone;
Outputting the first phrase information generated from the first voice information to a server via a network;
First response information corresponding to the first phrase information is acquired from the server via the network, and the first response information indicates a first response message for the first voice,
Based on the first response information, the speaker outputs the first response message,
Second response information associated with the first response information on the server is acquired from the server via the network, and the second response information includes a plurality of instructions that differ according to one or more pieces of phrase information Each of the one or more pieces of phrase information relates to a selection candidate of the user's response to the first response message,
After the first response message is output, second voice information indicating the second voice of the user input from the microphone is acquired, and the second voice includes the user's response to the first response message. ,
Of the conditional branch instructions, in response to an instruction determined by collating the one or more phrase information and the second phrase information generated from the second audio information, the speaker and the at least one device Causing at least one to perform a predetermined action,
Neither the second audio information nor the second phrase information is output to the server until the second operation is performed after the second audio information is acquired.
Information processing method.

The first response information and the second response information are acquired from the server at the same time.
The information processing method according to claim 1.

After the first phrase information is output to the server, the second response information is a plurality of second information stored on the server according to at least one of the first phrase information and the first response information. Selected from the response information,
The information processing method according to claim 1 or 2.

Furthermore, after executing the predetermined operation, an execution notification indicating that the predetermined operation has been executed is output to the server.
The information processing method according to any one of claims 1 to 3.

At least one of the plurality of instructions includes an instruction to output a second response message for the second sound from the speaker.
The information processing method according to any one of claims 1 to 4.

At least one of the plurality of instructions includes an instruction that causes a control command to be transmitted to the at least one device.
The information processing method according to any one of claims 1 to 5.

The network is the Internet;
The information processing method is executed on a local server capable of communicating with the at least one device without going through the Internet.
The information processing method according to any one of claims 1 to 6.

The program which makes the said processor perform the information processing method as described in any one of Claim 1 to 7.

An information processing method executed by a second processor on a server, wherein the second processor is capable of communicating via a network with a first processor that controls at least one device through interaction with a user.
First phrase information related to the first voice of the user input from a microphone is acquired from the first processor via the network, and the first phrase information includes a character string and the character corresponding to the first voice. Indicates at least one of the column semantic information,
First response information corresponding to the first phrase information is output to the first processor via the network, and the first response information is a first response message for the first sound output from a speaker. Show
Second response information associated with the first response information on the server is output to the first processor via the network, and the second response information is different depending on one or more pieces of phrase information. Each of the one or more pieces of phrase information relates to a selection candidate of the user's response to the first response message.
Information processing method.

The first processor, in the conditional branch instruction, according to an instruction determined by collating the one or more phrase information and second phrase information generated from second audio information, the speaker and the Causing at least one of the at least one device to perform a predetermined operation;
The information processing method according to claim 9.

The second processor outputs the second voice information and the second phrase information to the first processor from when the first processor acquires the user's response until the predetermined operation is executed. do not do,
The information processing method according to claim 10.

Further, an execution notification indicating that the predetermined operation has been executed is acquired from the first processor.
The information processing method according to claim 10 or 11.

The first response information and the second response information are simultaneously output to the first processor.
The information processing method according to any one of claims 9 to 12.

At least one of the plurality of instructions includes an instruction to output a second response message to the second sound from the speaker.
The information processing method according to any one of claims 9 to 13.

At least one of the plurality of instructions includes an instruction that causes the at least one device to output a control command.
The information processing method according to any one of claims 9 to 14.

On the server, a plurality of semantic information, a plurality of first response information respectively associated with the plurality of semantic information, and a plurality of second response information respectively associated with the plurality of first response information And a database containing
After obtaining the first phrase information, referring to the database,
Identifying semantic information corresponding to the first voice from the plurality of semantic information;
Identifying the first response information associated with the identified semantic information from the plurality of first response information;
Identifying the second response information associated with the identified first response information from the plurality of second response information;
The information processing method according to any one of claims 9 to 15.

A program causing the second processor to execute the information processing method according to any one of claims 9 to 16.