JP2021173880A

JP2021173880A - Information processing unit, program and information processing method

Info

Publication number: JP2021173880A
Application number: JP2020078049A
Authority: JP
Inventors: 啓介小西; Keisuke Konishi; 尚史福江; Naofumi Fukue
Original assignee: TIS Inc
Current assignee: TIS Inc
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2021-11-01

Abstract

To respond to user's voice even when communication with a device which recognizes voice is disabled.SOLUTION: In an interaction system 1, an interaction device 100 is an information processing device which connects with a voice recognition system recognizing voice through a network, and comprises: a voice acquisition part which acquires user's voice; a determination part which determines whether communication with the voice recognition system is available; a transmission part which transmits voice data on the acquired voice to the voice recognition system if the communication with the voice recognition system is available; a reception part which receives first recognition information representing a recognition result of the voice data from the voice recognition system; a voice recognition part which recognizes the acquired voice and generates second recognition information representing a recognition result if the communication with the voice recognition system is not available; a response generation part which generates first response information for responding to voice based upon the first recognition information or second recognition information; and an output part which outputs a response to the voice based upon the first response information.SELECTED DRAWING: Figure 3

Description

本発明は、情報処理装置、プログラム、および情報処理方法に関する。 The present invention relates to an information processing device, a program, and an information processing method.

従来、ユーザの音声を取得し、取得した音声に応答して様々な操作をする装置、いわゆるスマートスピーカーの技術が知られている。 Conventionally, there is known a technique of a so-called smart speaker, which is a device that acquires a user's voice and performs various operations in response to the acquired voice.

下記特許文献１に開示されているスマートスピーカーでは、ユーザの音声を示す音声情報を入力して、ネットワークを介して接続される音声出力装置にこの音声情報を送信する。音声出力装置は受信した音声情報に基づいて音声を認識し、認識結果に基づいてユーザの音声に対して発話するための発話データを生成する。音声出力装置がこの発話データをスマートスピーカーに送信して、スマートスピーカーは発話データに基づいて音声を出力する。 In the smart speaker disclosed in Patent Document 1 below, voice information indicating a user's voice is input, and this voice information is transmitted to a voice output device connected via a network. The voice output device recognizes the voice based on the received voice information, and generates utterance data for speaking to the user's voice based on the recognition result. The voice output device transmits this utterance data to the smart speaker, and the smart speaker outputs voice based on the utterance data.

特開２０２０−２１０４０号公報Japanese Unexamined Patent Publication No. 2020-21040

特許文献１のスマートスピーカーでは、音声出力装置の通信が不可能となった場合、ユーザの音声に応答できなくなるという問題がある。 The smart speaker of Patent Document 1 has a problem that it cannot respond to the user's voice when the communication of the voice output device becomes impossible.

そこで、本発明は、音声を認識する装置との通信が不可能となった場合でもユーザの音声に応答することができる情報処理装置、プログラム、および情報処理方法を提供することを目的とする。 Therefore, an object of the present invention is to provide an information processing device, a program, and an information processing method capable of responding to a user's voice even when communication with a device that recognizes voice becomes impossible.

本発明の一態様に係る情報処理装置は、音声を認識する音声認識システムとネットワークを介して接続する情報処理装置であって、ユーザの音声を取得する音声取得部と、音声認識システムとの通信が可能か否か判定する判定部と、音声認識システムとの通信が可能な場合、取得された音声の音声データを音声認識システムに送信する送信部と、音声認識システムから、音声データの認識結果を示す第１認識情報を受信する受信部と、音声認識システムとの通信が不可能な場合、取得された音声を認識し、認識結果を示す第２認識情報を生成する音声認識部と、第１認識情報または第２認識情報に基づき、音声に対して応答するための第１応答情報を生成する応答生成部と、第１応答情報に基づき、音声に対する応答を出力する出力部と、を備える。 The information processing device according to one aspect of the present invention is an information processing device that connects to a voice recognition system that recognizes voice via a network, and is a communication between a voice acquisition unit that acquires a user's voice and the voice recognition system. When communication with the voice recognition system is possible, the voice data recognition result from the transmission unit that transmits the voice data of the acquired voice to the voice recognition system and the voice recognition system. When communication between the receiving unit that receives the first recognition information indicating the above and the voice recognition system is impossible, the voice recognition unit that recognizes the acquired voice and generates the second recognition information indicating the recognition result, and the first It includes a response generation unit that generates first response information for responding to voice based on one recognition information or second recognition information, and an output unit that outputs a response to voice based on the first response information. ..

本発明の一態様に係るプログラムは、音声を認識する音声認識システムとネットワークを介して接続する情報処理装置に、ユーザの音声を取得する音声取得機能と、音声認識システムとの通信が可能か否か判定する判定機能と、音声認識システムとの通信が可能な場合、取得された音声の音声データを音声認識システムに送信する送信機能と、音声認識システムから、音声データの認識結果を示す第１認識情報を受信する受信機能と、音声認識システムとの通信が不可能な場合、取得された音声を認識し、認識結果を示す第２認識情報を生成する音声認識機能と、第１認識情報または第２認識情報に基づき、音声に対して応答するための第１応答情報を生成する応答生成機能と、第１応答情報に基づき、音声に対する応答を出力する出力機能と、を実現させる。 The program according to one aspect of the present invention has a voice acquisition function for acquiring a user's voice and communication with the voice recognition system in an information processing device connected to a voice recognition system for recognizing voice via a network. The first, which indicates the recognition result of the voice data from the voice recognition system, and the transmission function of transmitting the voice data of the acquired voice to the voice recognition system when communication with the voice recognition system is possible. When communication with the voice recognition system is not possible with the reception function that receives the recognition information, the voice recognition function that recognizes the acquired voice and generates the second recognition information indicating the recognition result, and the first recognition information or A response generation function that generates a first response information for responding to a voice based on the second recognition information, and an output function that outputs a response to the voice based on the first response information are realized.

本発明の一態様に係る情報処理方法は、音声を認識する音声認識システムとネットワークを介して接続する情報処理装置が、ユーザの音声を取得し、音声認識システムとの通信が可能か否か判定し、音声認識システムとの通信が可能な場合、取得された音声の音声データを音声認識システムに送信し、音声認識システムから、音声データの認識結果を示す第１認識情報を受信し、音声認識システムとの通信が不可能な場合、取得された音声を認識し、認識結果を示す第２認識情報を生成し、第１認識情報または第２認識情報に基づき、音声に対して応答するための第１応答情報を生成し、第１応答情報に基づき、音声に対する応答を出力する。 In the information processing method according to one aspect of the present invention, the information processing device connected to the voice recognition system that recognizes voice via a network acquires the user's voice and determines whether or not communication with the voice recognition system is possible. Then, when communication with the voice recognition system is possible, the voice data of the acquired voice is transmitted to the voice recognition system, and the first recognition information indicating the recognition result of the voice data is received from the voice recognition system to perform voice recognition. When communication with the system is impossible, it recognizes the acquired voice, generates the second recognition information indicating the recognition result, and responds to the voice based on the first recognition information or the second recognition information. The first response information is generated, and the response to the voice is output based on the first response information.

上記の態様によれば、情報処理装置と音声認識システムとの通信が不可能な場合でも、情報処理装置内の音声認識部によりユーザの音声を認識することができる。このため情報処理装置は、音声認識システムとの通信が不可能な場合でもユーザの音声に応答することができる。 According to the above aspect, even when communication between the information processing device and the voice recognition system is impossible, the voice recognition unit in the information processing device can recognize the user's voice. Therefore, the information processing device can respond to the user's voice even when communication with the voice recognition system is impossible.

本発明によれば、音声を認識する装置との通信が不可能となった場合でもユーザの音声に応答することができる情報処理装置、プログラム、および情報処理方法を提供することができる。 According to the present invention, it is possible to provide an information processing device, a program, and an information processing method capable of responding to a user's voice even when communication with a device that recognizes voice becomes impossible.

第１実施形態に係る対話システムのシステム構成例を説明するための図である。It is a figure for demonstrating the system configuration example of the dialogue system which concerns on 1st Embodiment. 第１実施形態に係る対話システムの概要を説明するための図である。It is a figure for demonstrating the outline of the dialogue system which concerns on 1st Embodiment. 第１実施形態に係る対話システムの概要を説明するための図である。It is a figure for demonstrating the outline of the dialogue system which concerns on 1st Embodiment. 第１実施形態に係る対話装置の機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of the dialogue apparatus which concerns on 1st Embodiment. 第１実施形態に係るローカル指示・リモート指示とその実行・キューイングとの関係の一例を説明する図である。It is a figure explaining an example of the relationship between a local instruction / remote instruction and its execution / queuing which concerns on 1st Embodiment. 第１実施形態に係るサーバ装置の機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of the server apparatus which concerns on 1st Embodiment. 第１実施形態に係る対話装置の動作例を示す図である。It is a figure which shows the operation example of the dialogue apparatus which concerns on 1st Embodiment. 第１実施形態に係る対話装置およびサーバ装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware composition of the dialogue apparatus and the server apparatus which concerns on 1st Embodiment. 第２実施形態に係る対話システムの概要を説明するための図である。It is a figure for demonstrating the outline of the dialogue system which concerns on 2nd Embodiment. 第２実施形態に係る対話システムの概要を説明するための図である。It is a figure for demonstrating the outline of the dialogue system which concerns on 2nd Embodiment. 第２実施形態に係る対話装置の動作例を示す図である。It is a figure which shows the operation example of the dialogue apparatus which concerns on 2nd Embodiment.

添付図面を参照して、本発明の好適な実施形態について説明する。なお、各図において、同一の符号を付したものは、同一または同様の構成を有する。 Preferred embodiments of the present invention will be described with reference to the accompanying drawings. In each figure, those having the same reference numerals have the same or similar configurations.

本実施形態において、「部」や「手段」、「装置」、「システム」とは、単に物理的手段を意味するものではなく、その「部」や「手段」、「装置」、「システム」が有する機能をソフトウェアによって実現する場合も含む。また、１つの「部」や「手段」、「装置」、「システム」が有する機能が２つ以上の物理的手段や装置により実現されても、２つ以上の「部」や「手段」、「装置」、「システム」の機能が１つの物理的手段や装置により実現されてもよい。 In the present embodiment, the "part", "means", "device", and "system" do not simply mean physical means, but the "part", "means", "device", and "system". Including the case where the function of is realized by software. Further, even if the functions of one "part", "means", "device", or "system" are realized by two or more physical means or devices, two or more "parts" or "means", The functions of "device" and "system" may be realized by one physical means or device.

［第１実施形態］
本発明の第１実施形態（以下、「本実施形態」という）について説明する。本実施形態では、本実施形態に係る対話システム１が（１）ユーザと対話する、（２）ユーザの音声を議事録に記録する、（３）ユーザの音声指示により家電などの装置の動作を制御する、例を用いて説明するが、これに限る趣旨ではない。 [First Embodiment]
The first embodiment of the present invention (hereinafter, referred to as “the present embodiment”) will be described. In the present embodiment, the dialogue system 1 according to the present embodiment (1) interacts with the user, (2) records the user's voice in the minutes, and (3) operates a device such as a home appliance according to the user's voice instruction. It will be explained using an example of controlling, but the purpose is not limited to this.

＜１．システム構成＞
図１を参照して、対話システム１のシステム構成例を説明する。対話システム１は、ユーザの音声に応じて動作するシステムである。対話システム１は、上記（１）〜（３）の機能をユーザに提供する。 <1. System configuration>
A system configuration example of the dialogue system 1 will be described with reference to FIG. The dialogue system 1 is a system that operates in response to a user's voice. The dialogue system 1 provides the functions (1) to (3) above to the user.

図１に示すように、対話システム１は、対話装置１００と、サーバ装置２００と、を含む。また対話システム１は、第１ネットワークＮ１を介して音声認識システム３００と接続さている。また対話システム１の対話装置１００は、第２ネットワークＮ２を介してローカル装置４００ａと接続されている。また対話システム１は、第１ネットワークＮ１を介してリモート装置４００ｂと接続されている。 As shown in FIG. 1, the dialogue system 1 includes a dialogue device 100 and a server device 200. Further, the dialogue system 1 is connected to the voice recognition system 300 via the first network N1. Further, the dialogue device 100 of the dialogue system 1 is connected to the local device 400a via the second network N2. Further, the dialogue system 1 is connected to the remote device 400b via the first network N1.

第１ネットワークＮ１は、広域通信網のネットワークであり、例えば、インターネット、移動体通信網、電話回線などを含む。また、第１ネットワークＮ１は、例えば、３Ｇ（第３世代移動通信システム）回線、４Ｇ（第４世代移動通信システム）回線、５Ｇ（第５世代移動通信システム）回線、またはＬＴＥ（登録商標）（ＬｏｎｇＴｅｒｍＥｖｏｌｕｔｉｏｎ）回線などを用いた無線通信方式を用いてもよい。 The first network N1 is a network of a wide area communication network, and includes, for example, the Internet, a mobile communication network, a telephone line, and the like. The first network N1 is, for example, a 3G (3rd generation mobile communication system) line, a 4G (4th generation mobile communication system) line, a 5G (5th generation mobile communication system) line, or LTE (registered trademark) (registered trademark). A wireless communication method using a Long Term Evolution) line or the like may be used.

第２ネットワークＮ２は、所定の施設や室内に対して独自に構築された通信網であり、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）である。言い換えれば、対話装置１００とローカル装置４００ａとは、同一のＬＡＮ内に設置されている。第２ネットワークＮ２は、有線および／または無線により、対話装置１００とローカル装置４００ａとが互いに通信できるものであれば、任意の通信方式を用いることができる。また第２ネットワークＮ２は、複数の通信方式を用いるものであってもよい。第２ネットワークＮ２は、例えば、Ｗｉ−Ｆｉ（登録商標）規格に準拠した無線ＬＡＮを含み、ルータが中継することで、これらの相互接続を実現させてもよい。 The second network N2 is a communication network independently constructed for a predetermined facility or room, and is a LAN (Local Area Network). In other words, the dialogue device 100 and the local device 400a are installed in the same LAN. The second network N2 can use any communication method as long as the dialogue device 100 and the local device 400a can communicate with each other by wire and / or wirelessly. Further, the second network N2 may use a plurality of communication methods. The second network N2 may include, for example, a wireless LAN conforming to the Wi-Fi (registered trademark) standard, and these interconnections may be realized by relaying by a router.

第２ネットワークＮ２は、例えば、ローカル装置４００ａと直接接続するためのネットワークであってもよい。第２ネットワークＮ２は、例えば、Ｂｌｕｅｔｏｏｔｈ（登録商標）や赤外線通信等の１０ｍ程度の近距離無線通信を実現するネットワークを含んでもよい。 The second network N2 may be, for example, a network for directly connecting to the local device 400a. The second network N2 may include, for example, a network that realizes short-range wireless communication of about 10 m such as Bluetooth (registered trademark) and infrared communication.

対話装置１００は、サーバ装置２００や音声認識システム３００、または装置４００との通信が可能な情報処理装置である。対話装置１００は、ユーザの音声を取得して、取得した音声に対話などで応答する、いわゆるスマートスピーカーである。対話装置１００は、例えば、汎用のタブレット端末やスマートフォンなどであってもよい。対話装置１００は、例えば、汎用のタブレット端末に専用のプログラムをインストールし、このプログラムを実行させることにより、タブレット端末を対話装置１００として使用してもよい。 The dialogue device 100 is an information processing device capable of communicating with the server device 200, the voice recognition system 300, or the device 400. The dialogue device 100 is a so-called smart speaker that acquires a user's voice and responds to the acquired voice by dialogue or the like. The dialogue device 100 may be, for example, a general-purpose tablet terminal or a smartphone. The dialogue device 100 may use the tablet terminal as the dialogue device 100 by, for example, installing a dedicated program on a general-purpose tablet terminal and executing this program.

サーバ装置２００は、対話装置１００との通信や議事録の管理が可能な情報処理装置である。サーバ装置２００は、所定のプログラムを実行することにより、対話装置１００と連携して、ユーザの音声に対する応答や議事録の新規登録、変更並びに削除（以下、これらの処理をまとめて「更新」ともいう）し、またはこれらの履歴を管理するサーバ機能を実現する。 The server device 200 is an information processing device capable of communicating with the dialogue device 100 and managing minutes. By executing a predetermined program, the server device 200 cooperates with the dialogue device 100 to newly register, change, and delete the response to the user's voice and the minutes (hereinafter, these processes are collectively referred to as "update"). Or, realize a server function that manages these histories.

音声認識システム３００は、対話装置１００やサーバ装置２００と通信の通信が可能なシステムである。音声認識システム３００は、対話装置１００またはサーバ装置２００から受信したユーザの音声を示す音声データ（以下、単に「音声データ」ともいう）に基づいてユーザの音声を認識する。 The voice recognition system 300 is a system capable of communicating with the dialogue device 100 and the server device 200. The voice recognition system 300 recognizes the user's voice based on the voice data (hereinafter, also simply referred to as “voice data”) indicating the user's voice received from the dialogue device 100 or the server device 200.

ローカル装置４００ａおよびリモート装置４００ｂは、ユーザの音声指示に応じて対話装置１００により動作を制御される装置である。ローカル装置４００ａは、対話装置１００と同一のＬＡＮ（第２ネットワークＮ２）内の装置である。リモート装置４００ｂは、対話装置が接続するＬＡＮ（第２ネットワークＮ２）外の装置である。ローカル装置４００ａとリモート装置４００ｂとは、特に区別の必要がない場合に総称して「装置４００」という。 The local device 400a and the remote device 400b are devices whose operations are controlled by the dialogue device 100 in response to a user's voice instruction. The local device 400a is a device in the same LAN (second network N2) as the dialogue device 100. The remote device 400b is a device outside the LAN (second network N2) to which the dialogue device is connected. The local device 400a and the remote device 400b are collectively referred to as "device 400" when there is no particular need to distinguish them.

＜２．システム概要＞
図２〜３を参照して、対話システム１の概要を、（Ａ）対話装置１００がオンラインのとき、（Ｂ）対話装置１００がオンラインからオフラインに切り替わったとき、という二つの場面に分けて説明する。 <2. System overview>
With reference to FIGS. 2 and 3, the outline of the dialogue system 1 will be described by dividing it into two situations: (A) when the dialogue device 100 is online, and (B) when the dialogue device 100 is switched from online to offline. do.

図２を参照して、まず上記（Ａ）の場面について説明する。対話装置１００は、オンラインの状態であり、サーバ装置２００および音声認識システム３００との通信が可能な状態である。 First, the scene (A) will be described with reference to FIG. The dialogue device 100 is in an online state, and is in a state in which communication with the server device 200 and the voice recognition system 300 is possible.

（１）図２に示すように、対話装置１００の音声取得部１２０は、ユーザの音声「議事録を開始」を取得する。（２）対話装置１００の音声取得部１２０が取得した音声の音声データを音声認識システム３００に送信するため、判定部１１１は、音声認識システム３００との通信が可能か否か判定する。 (1) As shown in FIG. 2, the voice acquisition unit 120 of the dialogue device 100 acquires the user's voice "start minutes". (2) Since the voice data of the voice acquired by the voice acquisition unit 120 of the dialogue device 100 is transmitted to the voice recognition system 300, the determination unit 111 determines whether or not communication with the voice recognition system 300 is possible.

判定部１１１の判定により音声認識システム３００との通信が可能な場合、対話装置１００の通信部１３０は音声データを音声認識システム３００に送信する。（４）対話装置１００の通信部１３０は、音声認識システム３００から第１認識情報を受信する。 When communication with the voice recognition system 300 is possible by the determination of the determination unit 111, the communication unit 130 of the dialogue device 100 transmits the voice data to the voice recognition system 300. (4) The communication unit 130 of the dialogue device 100 receives the first recognition information from the voice recognition system 300.

「第１認識情報」とは、音声認識システム３００による音声データの認識結果を示す情報である。第１認識情報は、例えば、音声の内容（「議事録開始」）をテキストで表したものでもよい。なお第１認識情報と第２認識情報とは、いずれもユーザの音声を認識した結果を示す情報であるため、特に区別の必要がなければ以下総称して「認識情報」という。 The "first recognition information" is information indicating the recognition result of the voice data by the voice recognition system 300. The first recognition information may be, for example, a text representation of the content of the voice (“start of minutes”). Since the first recognition information and the second recognition information are both information indicating the result of recognizing the user's voice, they are collectively referred to as "recognition information" unless there is a particular need to distinguish them.

（５）通信部１３０が受信した第１認識情報をサーバ装置２００に送信するため、対話装置１００の判定部１１１は、サーバ装置２００との通信が可能か判定する。 (5) Since the first recognition information received by the communication unit 130 is transmitted to the server device 200, the determination unit 111 of the dialogue device 100 determines whether communication with the server device 200 is possible.

（６）判定部１１１の判定によりサーバ装置２００との通信が可能な場合、対話装置１００の通信部１３０は第１認識情報をサーバ装置２００に送信する。（７）サーバ装置２００の通信部２３０は、対話装置１００から第１認識情報を受信する。 (6) When communication with the server device 200 is possible by the determination of the determination unit 111, the communication unit 130 of the dialogue device 100 transmits the first recognition information to the server device 200. (7) The communication unit 230 of the server device 200 receives the first recognition information from the dialogue device 100.

（８）サーバ装置２００の応答生成部２１３は、通信部２３０が受信した第１認識情報に基づき第２応答情報を生成する。ここで「第２応答情報」とは、サーバ装置２００が生成する、ユーザの音声に対して応答するための情報である。第２応答情報は、例えば、対話装置１００から出力する音声の内容「議事録を開始します」をテキストで表したものでもよく、またこの内容を出力するための音声のデータであってもよい。なお第１応答情報と第２応答情報とは、いずれもユーザの音声に対する応答の内容を示す情報であるため、特に区別の必要がなければ以下総称して「応答情報」という。 (8) The response generation unit 213 of the server device 200 generates the second response information based on the first recognition information received by the communication unit 230. Here, the "second response information" is information generated by the server device 200 for responding to the user's voice. The second response information may be, for example, a text representation of the content of the voice output from the dialogue device 100, "starting the minutes", or voice data for outputting this content. .. Since the first response information and the second response information are both information indicating the content of the response to the user's voice, they are collectively referred to as "response information" unless there is a particular need to distinguish them.

（９）サーバ装置２００の通信部２３０は、第２応答情報を対話装置１００に送信する。（１０）対話装置１００の通信部１３０は、サーバ装置２００から第２応答情報を受信する。 (9) The communication unit 230 of the server device 200 transmits the second response information to the dialogue device 100. (10) The communication unit 130 of the dialogue device 100 receives the second response information from the server device 200.

（１１）対話装置１００の出力部１４０は、通信部１３０が受信した第２応答情報に基づき、音声に対する応答を出力する。出力部１４０は、具体的には、「議事録を開始します」とする音声を出力する。 (11) The output unit 140 of the dialogue device 100 outputs a response to the voice based on the second response information received by the communication unit 130. Specifically, the output unit 140 outputs a voice saying "Start the minutes".

（１２）サーバ装置２００の装置制御部２１５は、第２応答情報に基づいて、自装置の動作を制御する。装置制御部２１５は、第２応答情報に基づいて、第１認識情報に示された音声の内容を議事録として記録を開始する。装置制御部２１５は、この音声の内容を議事録のフォーマットに合わせるよう調整して記憶部２５０に記憶する。 (12) The device control unit 215 of the server device 200 controls the operation of the own device based on the second response information. The device control unit 215 starts recording the contents of the voice indicated in the first recognition information as minutes based on the second response information. The device control unit 215 adjusts the content of this voice to match the format of the minutes and stores it in the storage unit 250.

つぎに図３を参照して、上記（Ｂ）の場面について説明する。 Next, the scene of (B) above will be described with reference to FIG.

（１）図３に示すように、対話装置１００の音声取得部１２０は、ユーザの音声「議事録を開始」を取得する。（２）対話装置１００の音声取得部１２０が取得した音声の音声データを音声認識システム３００に送信するため、判定部１１１は、音声認識システム３００との通信が可能か否か判定する。 (1) As shown in FIG. 3, the voice acquisition unit 120 of the dialogue device 100 acquires the user's voice "start minutes". (2) Since the voice data of the voice acquired by the voice acquisition unit 120 of the dialogue device 100 is transmitted to the voice recognition system 300, the determination unit 111 determines whether or not communication with the voice recognition system 300 is possible.

（３）判定部１１１の判定により音声認識システム３００との通信が不可能な場合、対話装置１００の音声認識部１１２は、取得された音声を認識し、第２認識情報を生成する。またこの際音声認識部１１２は、通信部１３０と音声認識システム３００との間の音声データ送信の際のセッションに関する第１セッション情報を参照して、送信途中の音声データを引き継いでもよい。第１セッション情報の詳細は後述する。 (3) When communication with the voice recognition system 300 is impossible due to the determination of the determination unit 111, the voice recognition unit 112 of the dialogue device 100 recognizes the acquired voice and generates the second recognition information. Further, at this time, the voice recognition unit 112 may take over the voice data in the middle of transmission by referring to the first session information regarding the session at the time of voice data transmission between the communication unit 130 and the voice recognition system 300. Details of the first session information will be described later.

「第２認識情報」とは、対話装置１００の音声認識部１１２による音声データの認識結果を示す情報である。第２認識情報は、例えば、音声の内容（「議事録開始」）をテキストで表してもよい。 The "second recognition information" is information indicating the recognition result of the voice data by the voice recognition unit 112 of the dialogue device 100. The second recognition information may represent, for example, the content of the voice (“start of minutes”) in text.

（４）対話装置１００の応答生成部１１３は、第２認識情報に基づいて第１応答情報を生成する。ここで「第１応答情報」とは、応答生成部１１３によるユーザの音声に対して応答するための情報である。第１応答情報は、例えば、対話装置１００から出力する音声の内容「議事録を開始します」をテキストで表してもよく、またこの内容を出力するための音声データのファイルであってもよい。またこの際応答生成部１１３は、通信部１３０とサーバ装置２００との間の認識情報送信の際のセッションに関する第２セッション情報を参照して、送信途中の認識情報を引き継いでもよい。第２セッション情報の詳細は後述する。 (4) The response generation unit 113 of the dialogue device 100 generates the first response information based on the second recognition information. Here, the "first response information" is information for responding to the user's voice by the response generation unit 113. The first response information may, for example, represent the content of the voice output from the dialogue device 100 "start the minutes" in text, or may be a file of voice data for outputting this content. .. Further, at this time, the response generation unit 113 may take over the recognition information in the middle of transmission by referring to the second session information regarding the session at the time of transmitting the recognition information between the communication unit 130 and the server device 200. Details of the second session information will be described later.

（５）対話装置１００の出力部１４０は、第１応答情報に基づいて、音声に対する応答を出力する。出力部１４０は、具体的には、「議事録を開始します」とする音声を出力する。 (5) The output unit 140 of the dialogue device 100 outputs a response to the voice based on the first response information. Specifically, the output unit 140 outputs a voice saying "Start the minutes".

（６）対話装置１００の識別部１１４は、第２認識情報に基づき、ユーザの音声がローカル指示であることを識別する。対話装置１００の装置制御部１１６は、自装置に対するローカル指示の場合、第１応答情報に基づいて、自装置の動作を制御する。装置制御部１１６は、第１応答情報に基づいて第２認識情報の音声の内容を議事録として記録を開始する。装置制御部１１６は、第２認識情報を所定の議事録フォーマットに合わせるよう調整して議事録として記憶部１５０に記憶する。 (6) The identification unit 114 of the dialogue device 100 identifies that the user's voice is a local instruction based on the second recognition information. In the case of a local instruction to the own device, the device control unit 116 of the dialogue device 100 controls the operation of the own device based on the first response information. The device control unit 116 starts recording the contents of the voice of the second recognition information as minutes based on the first response information. The device control unit 116 adjusts the second recognition information to match the predetermined minutes format and stores it in the storage unit 150 as the minutes.

「ローカル指示」とは、ユーザの音声のうち対話装置１００が接続する所定のネットワーク内の装置に対する指示をいう。ここで「所定のネットワーク」とは、例えば、第２ネットワークＮ２である。またこの装置には、ローカル４００ａと、自装置（対話装置１００）とが含まれる。 The “local instruction” refers to an instruction to a device in a predetermined network to which the dialogue device 100 is connected in the user's voice. Here, the "predetermined network" is, for example, the second network N2. Further, this device includes a local 400a and a own device (dialogue device 100).

上記記憶された議事録は、対話装置１００とサーバ装置２００との通信が可能になった場合に、サーバ装置２００の記憶部２５０に通信が不能となるまで記憶されていた議事録に加えるために送信してもよい。 When communication between the dialogue device 100 and the server device 200 becomes possible, the stored minutes are added to the minutes stored in the storage unit 250 of the server device 200 until communication becomes impossible. You may send it.

上記構成によれば、対話装置１００は、音声認識システムとの通信が不可能な場合でも、対話装置１００内の音声認識部１１２によりユーザの音声を認識することができる。上記構成によれば、対話装置１００は、さらにサーバ装置２００との通信が不可能な場合でも自装置における音声認識の結果に基づいてユーザの音声に応答することができる。このため上記構成によれば、ユーザは、オフライン環境でも通常どおり対話装置１００を利用することができる。 According to the above configuration, the dialogue device 100 can recognize the user's voice by the voice recognition unit 112 in the dialogue device 100 even when communication with the voice recognition system is impossible. According to the above configuration, the dialogue device 100 can respond to the user's voice based on the result of voice recognition in the own device even when communication with the server device 200 is impossible. Therefore, according to the above configuration, the user can use the dialogue device 100 as usual even in an offline environment.

＜３．機能構成＞
図４を参照して、本実施形態に係る対話装置１００の機能構成を説明する。図４に示すように、対話装置１００は、通信部１３０と、制御部１１０と、音声取得部１２０と、出力部１４０と、記憶部１５０と、を備える。 <3. Functional configuration>
The functional configuration of the dialogue device 100 according to the present embodiment will be described with reference to FIG. As shown in FIG. 4, the dialogue device 100 includes a communication unit 130, a control unit 110, a voice acquisition unit 120, an output unit 140, and a storage unit 150.

制御部１１０は、判定部１１１と、音声認識部１１２と、応答生成部１１３と、識別部１１４と、を備える。また制御部１１０は、例えば、特定部１１５または装置制御部１１６を備えてもよい。 The control unit 110 includes a determination unit 111, a voice recognition unit 112, a response generation unit 113, and an identification unit 114. Further, the control unit 110 may include, for example, a specific unit 115 or a device control unit 116.

判定部１１１は、音声認識システム３００との通信が可能か否か判定する。判定部１１１は、例えば、サイクリックまたはイベントドリブンでネットワーク接続の状態（オフラインまたはオンライン）を監視して、監視の結果に基づいて音声認識システム３００との通信が可能か否か判定してもよい。 The determination unit 111 determines whether or not communication with the voice recognition system 300 is possible. The determination unit 111 may, for example, cyclically or event-drivenly monitor the network connection status (offline or online) and determine whether or not communication with the voice recognition system 300 is possible based on the monitoring result. ..

判定部１１１は、例えば、通信部１３０によって第１ネットワークＮ１を介してサーバ装置２００に通信接続要求を送信し、この通信接続要求の応答を受信した場合に通信が可能と判定してもよい。他方判定部１１１は、例えば、通信部１３０によって第１ネットワークＮ１を介してサーバ装置２００に通信接続要求を送信し、一定の時間、この通信接続要求の応答を受信しなかった場合に通信が不可能と判定してもよい。 For example, the determination unit 111 may transmit a communication connection request to the server device 200 via the first network N1 by the communication unit 130, and may determine that communication is possible when the response of the communication connection request is received. On the other hand, the determination unit 111 transmits a communication connection request to the server device 200 via the first network N1 by the communication unit 130, for example, and communication is not performed when the response of the communication connection request is not received for a certain period of time. It may be determined that it is possible.

判定部１１１は、例えば、第１ネットワークＮ１の中継・転送機器（不図示）に、サーバ装置２００の第１ネットワークＮ１への接続状況を問合せて、この問合せに対する応答によって通信が可能か不可能か判定してもよい。 For example, the determination unit 111 inquires the relay / transfer device (not shown) of the first network N1 about the connection status of the server device 200 to the first network N1, and whether communication is possible or impossible by the response to this inquiry. You may judge.

判定部１１１の上記判定の態様は、音声認識システム３００との通信だけではなく、サーバ装置２００、またはローカル装置４００ａやリモート装置４００ｂとの通信に対する判定においても適用できる。 The above-mentioned determination mode of the determination unit 111 can be applied not only to the communication with the voice recognition system 300 but also to the determination for the communication with the server device 200, the local device 400a, or the remote device 400b.

判定部１１１は、例えば、サーバ装置２００との通信が可能か否か判定してもよい The determination unit 111 may determine, for example, whether or not communication with the server device 200 is possible.

判定部１１１は、例えば、ユーザの音声がリモート指示の場合、リモート装置４００ｂとの通信が可能か判定してもよい。判定部１１１は、例えば、ユーザの音声がローカル指示の場合、ローカル装置４００ａとの通信が可能か判定してもよい。 For example, when the user's voice is a remote instruction, the determination unit 111 may determine whether communication with the remote device 400b is possible. For example, when the user's voice is a local instruction, the determination unit 111 may determine whether communication with the local device 400a is possible.

音声認識部１１２は、音声取得部１２０により取得されたユーザの音声を認識する。音声認識部１１２は、この認識の結果を示す第２認識情報を生成する。音声認識部１１２は、例えば、音声取得部１２０が取得した音声データを音声認識技術によりテキスト情報に変換する。この変換したテキスト情報が、第２認識情報に相当する。 The voice recognition unit 112 recognizes the user's voice acquired by the voice acquisition unit 120. The voice recognition unit 112 generates second recognition information indicating the result of this recognition. The voice recognition unit 112 converts, for example, the voice data acquired by the voice acquisition unit 120 into text information by voice recognition technology. This converted text information corresponds to the second recognition information.

音声認識部１１２は、送信部１３１が音声認識システム３００に音声データを送信している途中で音声認識システム３００との通信が不可能になった場合、第１セッション情報を参照して、未送信の音声データの音声に基づいて第２認識情報を生成してもよい。 When the voice recognition unit 112 becomes unable to communicate with the voice recognition system 300 while the transmission unit 131 is transmitting voice data to the voice recognition system 300, the voice recognition unit 112 refers to the first session information and does not transmit the voice data. The second recognition information may be generated based on the voice of the voice data of.

「第１セッション情報」とは、送信部１３１が音声認識システム３００に音声データを送信する際に確立したセッションに関する情報である。第１セッション情報は、例えば、送信していた音声データの各パケットや各セグメントがどこまで送信完了したかを示してもよい。第１セッション情報は、例えば、音声データの全セグメントのうち最後に送信完了したセグメントのＴＣＰヘッダのシーケンス番号やＡＣＫ番号、または最初の未送信セグメントのＴＣＰヘッダのシーケンス番号などを示してもよい。 The "first session information" is information about a session established when the transmission unit 131 transmits voice data to the voice recognition system 300. The first session information may indicate, for example, how far each packet or each segment of the transmitted voice data has been transmitted. The first session information may indicate, for example, the sequence number or ACK number of the TCP header of the last segment of all the segments of the voice data that has been transmitted, or the sequence number of the TCP header of the first untransmitted segment.

上記構成によれば、音声認識部１１２は、音声認識システム３００に送信途中の音声データを引き継いで、第２認識情報を生成することができる。このため上記構成によれば、音声認識部１１２は、音声認識システム３００との通信が遮断されても、円滑に精度よくユーザの音声を認識することができる。 According to the above configuration, the voice recognition unit 112 can take over the voice data being transmitted to the voice recognition system 300 and generate the second recognition information. Therefore, according to the above configuration, the voice recognition unit 112 can smoothly and accurately recognize the user's voice even if the communication with the voice recognition system 300 is interrupted.

応答生成部１１３は、第１認識情報または第２認識情報に基づき、第１応答情報を生成する。応答生成部１１３は、例えば、自然言語処理を用いて認識情報を解析してもよい。応答生成部１１３は、この解析により、ユーザの音声に対する応答の内容を特定する。応答生成部１１３は、具体的には、図２〜３に示すように、ユーザの音声の内容「議事録を開始」を形態素解析して「議事録」および「開始」という単語を抽出する。次いで応答生成部１１３は、抽出したこれらの単語を検索キーとして、辞書情報を検索して該当する応答の内容を特定する。この応答の内容とは、（ア）議事録として記憶部１５０または記憶部２５０への認識情報の記録を開始する処理を実行、（イ）ユーザに「議事録を開始します」とする音声を出力する処理を実行、である。 The response generation unit 113 generates the first response information based on the first recognition information or the second recognition information. The response generation unit 113 may analyze the recognition information by using, for example, natural language processing. The response generation unit 113 identifies the content of the response to the user's voice by this analysis. Specifically, as shown in FIGS. 2 to 3, the response generation unit 113 morphologically analyzes the content of the user's voice “start minutes” and extracts the words “minutes” and “start”. Next, the response generation unit 113 searches the dictionary information using these extracted words as a search key and identifies the content of the corresponding response. The contents of this response are (a) execution of processing to start recording recognition information in the storage unit 150 or storage unit 250 as minutes, and (b) a voice saying "start minutes" to the user. Execute the output process.

「辞書情報」とは、単語または複数の単語の組み合わせと、応答の内容とローカル指示かリモート指示かを示すフラグとを関連付ける情報である。辞書情報は、例えば、「議事録」および「開始」とする単語の組み合わせと、上記（ア）および（イ）の処理の実行とする応答の内容と、リモート指示を示すフラグと、を関連付ける。なおこのフラグにおいて、リモート指示についてローカルで代替可能か否かでさらに分けて設けてもよい。すなわち、フラグ情報は、「ローカル指示」、「リモート指示（ローカル代替可）」、「リモート指示（ローカル代替不可）」とする３種類のフラグ（例えば、「１」〜「３」）のいずれかを示してもよい。 "Dictionary information" is information that associates a word or a combination of a plurality of words with the content of a response and a flag indicating whether it is a local instruction or a remote instruction. The dictionary information associates, for example, a combination of words "minutes" and "start" with the content of the response to be executed in the processes (a) and (b) above, and a flag indicating a remote instruction. It should be noted that this flag may be further divided according to whether or not the remote instruction can be replaced locally. That is, the flag information is one of three types of flags (for example, "1" to "3") of "local instruction", "remote instruction (local substitution possible)", and "remote instruction (local substitution not possible)". May be indicated.

応答生成部１１３は、送信部１３１がリモート指示の送信を取り止めた場合、リモート指示の音声に対する応答としてリモート指示を取り止めた旨の第１応答情報を生成してもよい。 When the transmission unit 131 cancels the transmission of the remote instruction, the response generation unit 113 may generate the first response information indicating that the remote instruction is canceled as a response to the voice of the remote instruction.

上記構成によれば、応答生成部１１３は、リモート装置４００ｂへのリモート指示を取り止めたことをユーザに応答することができる。このため上記構成によれば、応答生成部１１３は、リモート指示を取り止めたことをユーザに把握させることができる。 According to the above configuration, the response generation unit 113 can respond to the user that the remote instruction to the remote device 400b has been canceled. Therefore, according to the above configuration, the response generation unit 113 can make the user know that the remote instruction has been canceled.

応答生成部１１３は、送信部１３１がサーバ装置２００に第１認識情報または第２認識情報を送信している途中でサーバ装置２００との通信が不可能になった場合、第２セッション情報を参照して、未送信の第１認識情報または第２認識情報に基づいて第１応答情報を生成してもよい。 The response generation unit 113 refers to the second session information when communication with the server device 200 becomes impossible while the transmission unit 131 is transmitting the first recognition information or the second recognition information to the server device 200. Then, the first response information may be generated based on the untransmitted first recognition information or the second recognition information.

「第２セッション情報」とは、送信部１３１がサーバ装置２００に第１認識情報または第２認識情報を送信する際に確立したセッションに関する情報である。第２セッション情報は、例えば、送信していた認識情報の各パケットや各セグメントがどこまで送信完了したかを示してもよい。第２セッション情報は、例えば、認識情報の全セグメントのうち最後に送信完了したセグメントのＴＣＰヘッダのシーケンス番号やＡＣＫ番号、または最初の未送信セグメントのＴＣＰヘッダのシーケンス番号を示してもよい。 The "second session information" is information about a session established when the transmission unit 131 transmits the first recognition information or the second recognition information to the server device 200. The second session information may indicate, for example, how far each packet or each segment of the transmitted recognition information has been transmitted. The second session information may indicate, for example, the sequence number or ACK number of the TCP header of the last segment of the recognition information that has been transmitted, or the sequence number of the TCP header of the first untransmitted segment.

上記構成によれば、応答生成部１１３は、サーバ装置２００に送信途中の認識情報を引き継いで、第１応答情報を生成することができる。このため上記構成によれば、応答生成部１１３は、サーバ装置２００との通信が遮断されても、円滑に精度よくユーザの音声に対して応答することができる。 According to the above configuration, the response generation unit 113 can generate the first response information by taking over the recognition information during transmission to the server device 200. Therefore, according to the above configuration, the response generation unit 113 can respond smoothly and accurately to the user's voice even if the communication with the server device 200 is interrupted.

識別部１１４は、第１認識情報または第２認識情報に基づき、ユーザの音声のうち、情報処理装置が接続する所定のネットワーク外のリモート装置４００ｂに対するリモート指示を識別する。 Based on the first recognition information or the second recognition information, the identification unit 114 identifies a remote instruction to the remote device 400b outside the predetermined network to which the information processing device is connected in the user's voice.

識別部１１４は、第１認識情報または第２認識情報に基づき、ユーザの音声のうち、情報処理装置が接続する所定のネットワーク内のローカル装置４００ａまたは自装置に対するローカル指示を識別する。 Based on the first recognition information or the second recognition information, the identification unit 114 identifies the local device 400a in the predetermined network to which the information processing device is connected or the local instruction to the own device among the user's voices.

識別部１１４は、例えば、応答生成部１１３と同様に自然言語処理を用いて解析を行って単語を抽出してもよい。識別部１１４は、抽出した単語に基づき、応答生成部１１３と同様に辞書情報の検索・特定によりリモート指示かローカル指示かを識別してもよい。 The identification unit 114 may extract words by performing analysis using natural language processing in the same manner as the response generation unit 113, for example. Based on the extracted word, the identification unit 114 may identify whether it is a remote instruction or a local instruction by searching / specifying dictionary information in the same manner as the response generation unit 113.

特定部１１５は、リモート指示における指示の実行タイミングを特定する。特定部１１５は、例えば、リモート指示に含まれる時刻または実行までの期間を表す情報に基づきリモート指示の実行タイミングを特定する。この「時刻または実行までの期間を表す情報」とは、例えば、「朝７時」や「１８：００」または「５分後」などを示す情報である。 The identification unit 115 specifies the execution timing of the instruction in the remote instruction. The identification unit 115 specifies the execution timing of the remote instruction based on, for example, the time included in the remote instruction or the information representing the period until execution. The "information representing the time or the period until execution" is, for example, information indicating "7:00 am", "18:00", "5 minutes later", or the like.

装置制御部１１６は、第１応答情報に基づいて、自装置の動作を制御する。装置制御部１１６は、第１応答情報に基づいて、ユーザの音声指示に対する応答が議事録の開始の場合、第２認識情報の音声の内容を議事録として記憶部１５０に記録する。 The device control unit 116 controls the operation of the own device based on the first response information. Based on the first response information, when the response to the user's voice instruction is the start of the minutes, the device control unit 116 records the contents of the voice of the second recognition information in the storage unit 150 as the minutes.

通信部１３０は、ネットワークＮを介して、サーバ装置２００、音声認識システム３００、装置４００などとの間で各種情報・データを送受信する。通信部１３０は、送信部１３１と、受信部１３２と、を備える。 The communication unit 130 transmits and receives various information and data to and from the server device 200, the voice recognition system 300, the device 400, and the like via the network N. The communication unit 130 includes a transmission unit 131 and a reception unit 132.

送信部１３１は、音声認識システム３００との通信が可能な場合、音声取得部１２０により取得された音声の音声データを音声認識システム３００に送信する。 When communication with the voice recognition system 300 is possible, the transmission unit 131 transmits the voice data of the voice acquired by the voice acquisition unit 120 to the voice recognition system 300.

送信部１３１は、例えば、サーバ装置２００との通信が可能な場合、第１認識情報または第２認識情報をサーバ装置２００に送信してもよい。 The transmission unit 131 may transmit the first recognition information or the second recognition information to the server device 200, for example, when communication with the server device 200 is possible.

送信部１３１は、例えば、判定部１１１の判定によりリモート装置４００ｂとの通信が可能な場合、リモート指示をリモート装置４００ｂに送信してもよい。他方送信部１３１は、例えば、リモート装置４００ｂとの通信が不可能な場合、リモート指示をキューイングし、その後リモート装置４００ｂとの通信が可能となった際にキューイングされたリモート指示を読み出してリモート装置４００ｂに送信してもよい。 The transmission unit 131 may transmit a remote instruction to the remote device 400b, for example, when communication with the remote device 400b is possible by the determination of the determination unit 111. On the other hand, the transmission unit 131 queues the remote instruction when communication with the remote device 400b is impossible, and then reads the queued remote instruction when communication with the remote device 400b becomes possible. It may be transmitted to the remote device 400b.

上記構成によれば、送信部１３１は、一時的にリモート装置４００ｂとの通信が不可能な場合でもその後通信が可能となった際にリモート指示をリモート装置４００ｂに送信することができる。 According to the above configuration, even if communication with the remote device 400b is temporarily impossible, the transmission unit 131 can transmit a remote instruction to the remote device 400b when communication becomes possible thereafter.

送信部１３１は、例えば、判定部１１１の判定によりローカル装置４００ａとの通信が可能な場合、ローカル指示をローカル装置４００ａに送信してもよい。他方送信部１３１は、例えば、ローカル装置４００ａとの通信が不可能な場合、ローカル指示をキューイングし、その後ローカル装置４００ａとの通信が可能となった際にキューイングされたローカル指示を読み出してローカル装置４００ａに送信してもよい。 The transmission unit 131 may transmit a local instruction to the local device 400a, for example, when communication with the local device 400a is possible by the determination of the determination unit 111. On the other hand, the transmission unit 131 queues the local instruction when communication with the local device 400a is impossible, and then reads the queued local instruction when communication with the local device 400a becomes possible. It may be transmitted to the local device 400a.

送信部１３１は、例えば、特定部１１５により特定されたリモート指示の実行タイミングにおいてリモート装置４００ｂとの通信が不可能な場合、リモート装置４００ｂへのリモート指示の送信を取り止めてもよい。送信部１３１は、例えば、実行タイミングが「（音声の取得時点から）５分後」でかつこの取得時点から５分を超えてリモート装置４００ｂとの通信が不可能な場合、リモート装置４００ｂへのリモート指示の送信を取り止める。 For example, when the transmission unit 131 cannot communicate with the remote device 400b at the execution timing of the remote instruction specified by the specific unit 115, the transmission unit 131 may cancel the transmission of the remote instruction to the remote device 400b. For example, when the execution timing is "5 minutes after (from the time of voice acquisition)" and communication with the remote device 400b is impossible for more than 5 minutes from the time of acquisition, the transmission unit 131 sends to the remote device 400b. Stop sending remote instructions.

上記構成によれば、送信部１３１は、実行の時期を逸したリモート指示をリモート装置４００ｂに送信しないことができる。このため上記構成によれば、送信部１３１は、リモート装置４００ｂへの余計・冗長な指示の送信を抑止することができる。 According to the above configuration, the transmission unit 131 can prevent the remote device 400b from transmitting the remote instruction that is out of time for execution. Therefore, according to the above configuration, the transmission unit 131 can suppress the transmission of unnecessary and redundant instructions to the remote device 400b.

送信部１３１は、例えば、バッチ処理時間帯を記憶する記憶部を参照して、リモート指示の実行タイミングが特定の日時および即時ではない場合、その後リモート装置との通信が可能となった際にバッチ処理時間帯にキューイングされたリモート指示を読み出してリモート装置に一括または順次送信してもよい。この「バッチ処理時間帯」とは、相対的に負荷の高い通信処理や緊急度の低い通信処理を行うための時間帯である。バッチ処理時間帯は、例えば、オンラインリアルタイム処理が少ない夜間や休日などの時間帯が設定されてもよい。またここでいう「記憶部」は、自装置の記憶部１５０であってもよいし、他の装置の記憶部であってもよい。 For example, the transmission unit 131 refers to a storage unit that stores a batch processing time zone, and when the execution timing of the remote instruction is not a specific date and time and is not immediate, the batch is performed when communication with the remote device becomes possible thereafter. The remote instructions queued during the processing time may be read and collectively or sequentially transmitted to the remote device. This "batch processing time zone" is a time zone for performing communication processing with a relatively high load and communication processing with a low degree of urgency. As the batch processing time zone, for example, a time zone such as nighttime or holiday when online real-time processing is small may be set. Further, the "storage unit" referred to here may be a storage unit 150 of the own device or a storage unit of another device.

上記構成によれば、送信部１３１は、特定の日時および即時ではないリモート指示に関してはバッチ処理時間帯にリモート装置４００ｂに送信することができる。このため上記構成によれば、送信部１３１は、通信が復旧した際にキューイングされたリモート指示を全量送信することなく優先度（重要度・緊急度）の高い一部の指示に限定して送信し、その後のバッチ処理時間帯に残りを回すことができる。 According to the above configuration, the transmission unit 131 can transmit to the remote device 400b at a specific date and time and a remote instruction that is not immediate during the batch processing time zone. Therefore, according to the above configuration, the transmission unit 131 does not transmit the entire amount of the queued remote instructions when the communication is restored, and limits the instructions to some of the instructions having high priority (importance / urgency). It can be sent and the rest can be passed during subsequent batch processing times.

送信部１３１は、例えば、バッチ処理時間帯を記憶する記憶部を参照して、所定期間蓄積された音声データを、その後リモート装置との通信が可能となった際にバッチ処理時間帯にサーバ装置２００に送信してもよい。 The transmission unit 131 refers to, for example, a storage unit that stores the batch processing time zone, and when the voice data accumulated for a predetermined period can be communicated with the remote device thereafter, the server device in the batch processing time zone. It may be transmitted to 200.

上記構成によれば、送信部１３１は、例えばオフラインで大量の議事録音声を取得したものを、サーバ２００において一括で文字起こしする場合に、他の優先する処理やユーザに対する応答に影響を与えずに処理を行うことができる。 According to the above configuration, the transmission unit 131 does not affect other priority processing or the response to the user when, for example, a large amount of minutes voice is acquired offline and transcribed in a batch on the server 200. Can be processed.

ここで図５を参照して、自装置に対するローカル指示・リモート装置４００ｂに対するリモート指示と、その実行・キューイングとの関係の一例を説明する。図５に示すように、対話装置１００は、自装置に対するローカル指示の実行タイミングが即時の場合、ローカル指示をキューイングすることなく即時実行してもよい。また対話装置１００は、ローカル指示の実行タイミングが特定の日時に指定または何ら指定がない場合、ローカル指示をキューイングして、指定された日時または順次キューイングから読み出して実行してもよい。 Here, with reference to FIG. 5, an example of the relationship between the local instruction to the own device / remote instruction to the remote device 400b and its execution / queuing will be described. As shown in FIG. 5, when the execution timing of the local instruction to the own device is immediate, the dialogue device 100 may immediately execute the local instruction without queuing. Further, when the execution timing of the local instruction is specified at a specific date and time or is not specified at all, the dialogue device 100 may queue the local instruction and read it from the specified date and time or sequential queuing and execute it.

対話装置１００は、リモート指示の実行タイミングが即時の場合、リモート指示をキューイングすることなく即時リモート装置４００ｂに送信してもよい。また対話装置１００は、リモート指示の実行タイミングが特定の日時に指定または何ら指定がない場合、リモート指示をキューイングして、指定された日時または順次キューイングから読み出してリモート装置４００ｂに送信してもよい。 When the execution timing of the remote instruction is immediate, the dialogue device 100 may transmit the remote instruction to the immediate remote device 400b without queuing. Further, when the execution timing of the remote instruction is specified at a specific date and time or no specification is made, the dialogue device 100 queues the remote instruction, reads it from the specified date and time or sequentially queuing, and transmits it to the remote device 400b. May be good.

図４に戻って説明を続ける。受信部１３２は、音声認識システム３００から、送信部１３１が送信した音声データの認識結果を示す第１認識情報を受信する。 The explanation will be continued by returning to FIG. The receiving unit 132 receives the first recognition information indicating the recognition result of the voice data transmitted by the transmitting unit 131 from the voice recognition system 300.

受信部１３２は、例えば、サーバ装置２００から、第１認識情報または第２認識情報に基づき生成された第２応答情報を受信してもよい。 The receiving unit 132 may receive, for example, the second response information generated based on the first recognition information or the second recognition information from the server device 200.

音声取得部１２０は、ユーザの音声を取得する。 The voice acquisition unit 120 acquires the user's voice.

出力部１４０は、第１応答情報または第２応答情報に基づき、音声に対する応答を出力する。出力部１４０の出力態様は、どのような態様でもよい。出力部１４０の出力態様は、例えば、音声出力、画面出力、ファイル出力またはメッセージ出力などが考えられる。 The output unit 140 outputs a response to the voice based on the first response information or the second response information. The output mode of the output unit 140 may be any mode. The output mode of the output unit 140 may be, for example, voice output, screen output, file output, message output, or the like.

上記構成によれば、出力部１４０は、サーバ装置２００や音声認識システム３００との通信が不可能な場合でも、ユーザの音声に応答することができる。このため上記構成によれば、ユーザは、オフライン環境でも通常どおり対話装置１００を利用することができる。また上記構成によれば、出力部１４０は、自装置で生成した第１応答情報だけではなく、サーバ装置２００が生成した第２応答情報を利用することもできる。 According to the above configuration, the output unit 140 can respond to the user's voice even when communication with the server device 200 or the voice recognition system 300 is impossible. Therefore, according to the above configuration, the user can use the dialogue device 100 as usual even in an offline environment. Further, according to the above configuration, the output unit 140 can use not only the first response information generated by the own device but also the second response information generated by the server device 200.

記憶部１５０は、音声データ、第１認識情報、第２認識情報、第１応答情報、第２応答情報、第１セッション情報、第２セッション情報、記録された議事録を示す議事録情報または設定情報などを記憶する。ここで「設定情報」とは、対話装置１００が動作するために設定されているパラメータを示す情報である。設定情報は、バッチ処理時間帯を含んでもよい。 The storage unit 150 has voice data, first recognition information, second recognition information, first response information, second response information, first session information, second session information, minutes information or setting indicating recorded minutes. Memorize information etc. Here, the "setting information" is information indicating parameters set for operating the dialogue device 100. The setting information may include a batch processing time zone.

記憶部１５０は、データベースマネジメントシステム（ＤＢＭＳ）を利用して上記の情報を記憶してもよいし、ファイルシステムを利用して上記の情報を記憶してもよい。ＤＢＭＳを利用する場合は、上記の情報ごとにテーブルを設けて、テーブル間を関連付けてこれらの情報を管理してもよい。 The storage unit 150 may use a database management system (DBMS) to store the above information, or may use a file system to store the above information. When using the DBMS, a table may be provided for each of the above information, and the tables may be associated with each other to manage the information.

図６を参照して、本実施形態に係るサーバ装置２００の機能構成を説明する。図６に示すように、サーバ装置２００は、制御部２１０と、通信部２３０と、記憶部２５０と、を備える。通信部２３０と記憶部２５０の機能は、対話装置１００の通信部１３０と記憶部１５０と同様のため説明を割愛する。 The functional configuration of the server device 200 according to the present embodiment will be described with reference to FIG. As shown in FIG. 6, the server device 200 includes a control unit 210, a communication unit 230, and a storage unit 250. Since the functions of the communication unit 230 and the storage unit 250 are the same as those of the communication unit 130 and the storage unit 150 of the dialogue device 100, the description thereof will be omitted.

制御部２１０は、判定部２１１と、音声認識部２１２と、応答生成部２１３と、を備える。また制御部２１０は、例えば、装置制御部２１５を備えてもよい。各機能部の機能は、対話装置１００の判定部１１１と、音声認識部１１２と、応答生成部１１３と、装置制御部１１６と同様のため説明を割愛する。 The control unit 210 includes a determination unit 211, a voice recognition unit 212, and a response generation unit 213. Further, the control unit 210 may include, for example, a device control unit 215. Since the functions of the functional units are the same as those of the determination unit 111 of the dialogue device 100, the voice recognition unit 112, the response generation unit 113, and the device control unit 116, the description thereof will be omitted.

＜４．動作例＞
図７を参照して、対話装置１００の動作例を説明する。なお、以下に示す図７の動作例の処理の順番は一例であって、適宜、変更されてもよい。 <4. Operation example>
An operation example of the dialogue device 100 will be described with reference to FIG. 7. The order of processing of the operation example of FIG. 7 shown below is an example, and may be changed as appropriate.

図７に示すように、対話装置１００の音声取得部１２０は、ユーザの音声を取得する（Ｓ１０）。次いで判定部１１１は、音声認識システム３００との通信が可能か否か判定する（Ｓ１１）。 As shown in FIG. 7, the voice acquisition unit 120 of the dialogue device 100 acquires the user's voice (S10). Next, the determination unit 111 determines whether or not communication with the voice recognition system 300 is possible (S11).

判定部１１１の判定により音声認識システム３００との通信が可能な場合（Ｓ１２のＹｅｓ）、音声取得部１２０により取得された音声の音声データを音声認識システムに送信する（Ｓ１３）。音声認識システム３００から、この音声データの認識結果を示す第１認識情報を受信する（Ｓ１４）。 When communication with the voice recognition system 300 is possible by the determination of the determination unit 111 (Yes in S12), the voice data of the voice acquired by the voice acquisition unit 120 is transmitted to the voice recognition system (S13). The first recognition information indicating the recognition result of the voice data is received from the voice recognition system 300 (S14).

判定部１１１の判定により音声認識システム３００との通信が不可能な場合（Ｓ１２のＮｏ）、音声認識部１１２は、音声取得部１２０により取得された音声を認識し、この認識結果を示す第２認識情報を生成する（Ｓ１５）。 When communication with the voice recognition system 300 is impossible due to the determination of the determination unit 111 (No in S12), the voice recognition unit 112 recognizes the voice acquired by the voice acquisition unit 120 and shows the recognition result. Generate recognition information (S15).

応答生成部１１３は、第１認識情報または第２認識情報に基づき、第１応答情報を生成する（Ｓ１６）。出力部１４０は、第１応答情報に基づき、ユーザの音声に対する応答を出力する（Ｓ１７）。 The response generation unit 113 generates the first response information based on the first recognition information or the second recognition information (S16). The output unit 140 outputs a response to the user's voice based on the first response information (S17).

＜５．ハードウェア構成＞
図８を参照して、上述してきた対話装置１００およびサーバ装置２００をコンピュータ８００により実現する場合のハードウェア構成の一例を説明する。なお、それぞれの装置の機能は、複数台の装置に分けて実現することもできる。 <5. Hardware configuration>
With reference to FIG. 8, an example of the hardware configuration in the case where the above-described dialogue device 100 and server device 200 are realized by the computer 800 will be described. The function of each device can be realized by dividing it into a plurality of devices.

図８に示すように、コンピュータ８００は、プロセッサ８０１と、メモリ８０３と、記憶装置８０５と、入力Ｉ／Ｆ部８０７と、データＩ／Ｆ部８０９と、通信Ｉ／Ｆ部８１１、表示装置８１３、音声入力装置８１７および音声出力装置８１９を含む。 As shown in FIG. 8, the computer 800 includes a processor 801 and a memory 803, a storage device 805, an input I / F unit 807, a data I / F unit 809, a communication I / F unit 811 and a display device 813. , Includes audio input device 817 and audio output device 819.

プロセッサ８０１は、メモリ８０３に記憶されているプログラムを実行することによりコンピュータ８００における様々な処理を制御する。例えば、対話装置１００の制御部１１０やサーバ装置２００の制御部２１０が備える各機能部などは、メモリ８０３に一時記憶されたプログラムをプロセッサ８０１が実行することにより実現可能である。 The processor 801 controls various processes in the computer 800 by executing a program stored in the memory 803. For example, each functional unit included in the control unit 110 of the dialogue device 100 and the control unit 210 of the server device 200 can be realized by the processor 801 executing a program temporarily stored in the memory 803.

メモリ８０３は、例えばＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の記憶媒体である。メモリ８０３は、プロセッサ８０１によって実行されるプログラムのプログラムコードや、プログラムの実行時に必要となるデータを一時的に記憶する。 The memory 803 is, for example, a storage medium such as a RAM (Random Access Memory). The memory 803 temporarily stores the program code of the program executed by the processor 801 and the data required when the program is executed.

記憶装置８０５は、例えばハードディスクドライブ（ＨＤＤ）やフラッシュメモリ等の不揮発性の記憶媒体である。記憶装置８０５は、オペレーティングシステムや、上記各構成を実現するための各種プログラムを記憶する。この他、記憶装置８０５は、音声データ、第１認識情報、第２認識情報、第１応答情報、第２応答情報、第１セッション情報、第２セッション情報、議事録情報または設定情報などを登録するテーブルと、このテーブルを管理するＤＢを記憶することも可能である。このようなプログラムやデータは、必要に応じてメモリ８０３にロードされることにより、プロセッサ８０１から参照される。 The storage device 805 is a non-volatile storage medium such as a hard disk drive (HDD) or a flash memory. The storage device 805 stores an operating system and various programs for realizing each of the above configurations. In addition, the storage device 805 registers voice data, first recognition information, second recognition information, first response information, second response information, first session information, second session information, minutes information, setting information, and the like. It is also possible to store the table to be used and the DB that manages this table. Such programs and data are referred to by the processor 801 by being loaded into the memory 803 as needed.

入力Ｉ／Ｆ部８０７は、ユーザからの入力を受け付けるためのデバイスである。入力Ｉ／Ｆ部８０７の具体例としては、キーボードやマウス、タッチパネル、各種センサ、ウェアラブル・デバイス等が挙げられる。入力Ｉ／Ｆ部８０７は、例えばＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）等のインタフェースを介してコンピュータ８００に接続されても良い。 The input I / F unit 807 is a device for receiving input from the user. Specific examples of the input I / F unit 807 include a keyboard, a mouse, a touch panel, various sensors, a wearable device, and the like. The input I / F unit 807 may be connected to the computer 800 via an interface such as USB (Universal Serial Bus).

データＩ／Ｆ部８０９は、コンピュータ８００の外部からデータを入力するためのデバイスである。データＩ／Ｆ部８０９の具体例としては、各種記憶媒体に記憶されているデータを読み取るためのドライブ装置等がある。データＩ／Ｆ部８０９は、コンピュータ８００の外部に設けられることも考えられる。その場合、データＩ／Ｆ部８０９は、例えばＵＳＢ等のインタフェースを介してコンピュータ８００へと接続される。 The data I / F unit 809 is a device for inputting data from the outside of the computer 800. Specific examples of the data I / F unit 809 include a drive device for reading data stored in various storage media. It is also conceivable that the data I / F unit 809 is provided outside the computer 800. In that case, the data I / F unit 809 is connected to the computer 800 via an interface such as USB.

通信Ｉ／Ｆ部８１１は、コンピュータ８００の外部の装置と有線または無線により、インターネットＮを介したデータ通信を行うためのデバイスである。通信Ｉ／Ｆ部８１１は、コンピュータ８００の外部に設けられることも考えられる。その場合、通信Ｉ／Ｆ部８１１は、例えばＵＳＢ等のインタフェースを介してコンピュータ８００に接続される。 The communication I / F unit 811 is a device for performing data communication via the Internet N by wire or wirelessly with an external device of the computer 800. It is also conceivable that the communication I / F unit 811 is provided outside the computer 800. In that case, the communication I / F unit 811 is connected to the computer 800 via an interface such as USB.

表示装置８１３は、各種情報を表示するためのデバイスである。表示装置８１３の具体例としては、例えば液晶ディスプレイや有機ＥＬ（Ｅｌｅｃｔｒｏ−Ｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイ、ウェアラブル・デバイスのディスプレイ等が挙げられる。表示装置８１３は、コンピュータ８００の外部に設けられても良い。その場合、表示装置８１３は、例えばディスプレイケーブル等を介してコンピュータ８００に接続される。また、入力Ｉ／Ｆ部８０７としてタッチパネルが採用される場合には、表示装置８１３は、入力Ｉ／Ｆ部８０７と一体化して構成することが可能である。 The display device 813 is a device for displaying various information. Specific examples of the display device 813 include a liquid crystal display, an organic EL (Electro-Luminescence) display, a display of a wearable device, and the like. The display device 813 may be provided outside the computer 800. In that case, the display device 813 is connected to the computer 800 via, for example, a display cable or the like. Further, when the touch panel is adopted as the input I / F unit 807, the display device 813 can be integrally configured with the input I / F unit 807.

音声入力装置８１７は、マイクなどの音声を検出するための入力装置である。音声入力装置は、例えば、音声信号を含めたアナログ振動信号を取得するマイクロフォン、アナログ振動信号を増幅するアンプ、アナログ振動信号をデジタル信号に変換するＡ／Ｄ変換部などを備える。音声入力装置８１７は、例えば、ユーザが発する音声を検出する。 The voice input device 817 is an input device for detecting voice such as a microphone. The voice input device includes, for example, a microphone that acquires an analog vibration signal including a voice signal, an amplifier that amplifies the analog vibration signal, and an A / D conversion unit that converts the analog vibration signal into a digital signal. The voice input device 817 detects, for example, the voice emitted by the user.

音声出力装置８１９は、音声を出力するための出力装置であり、例えば、スピーカなどである。また音声出力装置８１９は、ヘッドフォンまたはイヤフォンに音を出力するための装置であってもよい。 The audio output device 819 is an output device for outputting audio, and is, for example, a speaker or the like. Further, the audio output device 819 may be a device for outputting sound to headphones or earphones.

［第２実施形態］
次に、本発明の第２実施形態（以下、「本実施形態」という）について説明する。本実施形態では、本実施形態に係る対話システム１ａが（４）ユーザと遠隔にいる相手（以下、「会話相手」という）との間の音声による会話を実現する例を用いて説明するが、これに限る趣旨ではない。以下、第１実施形態と異なる点を中心に説明する。 [Second Embodiment]
Next, a second embodiment of the present invention (hereinafter, referred to as "the present embodiment") will be described. In the present embodiment, the dialogue system 1a according to the present embodiment will be described with reference to (4) an example in which a voice conversation between a user and a remote partner (hereinafter referred to as “conversation partner”) is realized. It is not limited to this. Hereinafter, the points different from those of the first embodiment will be mainly described.

＜１．システム構成および概要＞
本実施形態に係る対話装置１００ａは、第１ネットワークＮ１を介して相手装置と接続されている。 <1. System configuration and overview>
The dialogue device 100a according to the present embodiment is connected to the partner device via the first network N1.

図９〜１０を参照して、対話システム１ａの概要を説明する。 The outline of the dialogue system 1a will be described with reference to FIGS. 9 to 10.

図９〜１０に示すように、先ず第１ネットワークＮ１において相手装置と通信可能（以下、単に「オンライン」ともいう）となっている場合、対話装置１００ａは相手装置に音声データをリアルタイムで送信する。すなわちユーザと会話相手とは、対話装置１００ａと相手装置を介してリアルタイムに通話することが可能である。このとき対話装置１００ａは、ユーザの発話データを記録しない。ここで「発話データ」とは、ユーザの音声の音声データから無音区間の少なくとも一部を除いたデータである。 As shown in FIGS. 9 to 10, when the first network N1 is capable of communicating with the other device (hereinafter, also simply referred to as “online”), the dialogue device 100a transmits voice data to the other device in real time. .. That is, the user and the conversation partner can talk in real time via the dialogue device 100a and the other party device. At this time, the dialogue device 100a does not record the user's utterance data. Here, the "utterance data" is data obtained by removing at least a part of the silent section from the voice data of the user's voice.

次に第１ネットワークＮ１においてオンラインから相手装置と通信不可能（以下、単に「オフライン」ともいう）となった場合、対話装置１００は、ユーザの発話データを記録する。このとき相手装置では当然にユーザの音声は出力されない。 Next, when it becomes impossible to communicate with the other device from online in the first network N1 (hereinafter, also simply referred to as “offline”), the dialogue device 100 records the user's utterance data. At this time, the other device naturally does not output the user's voice.

次に第１ネットワークＮ１においてオフラインから再びオンラインとなった場合、対話装置１００ａは、発話データを相手装置に送信する。またこのとき対話装置１００ａは、ユーザの発話データを引き続き記録する。対話装置１００ａはユーザの発話のタイミングと相手装置の発話データの再生タイミングのずれが解消されると発話データの記録を停止する。対話装置１００ａは発話データの送信が完了した後に相手装置への音声データの送信を再開する。すなわちユーザと会話相手とは、対話装置１００ａと相手装置を介してリアルタイムに通話することが再び可能となる。 Next, when the first network N1 goes from offline to online again, the dialogue device 100a transmits the utterance data to the other device. At this time, the dialogue device 100a continues to record the user's utterance data. The dialogue device 100a stops recording the utterance data when the difference between the utterance timing of the user and the reproduction timing of the utterance data of the other device is resolved. The dialogue device 100a resumes the transmission of voice data to the other device after the transmission of the utterance data is completed. That is, the user and the conversation partner can talk again in real time via the dialogue device 100a and the other party device.

上記構成によれば、対話システム１ａは、オンラインからオフラインとなった場合に、録音した音声データのうち無音区間の全部または一部を除いたデータを相手装置に送信することができる。このため上記構成によれば、音声データから無音区間を除いて相手装置に送信することにより、オフラインになったことによるユーザの発話タイミングと相手装置の再生タイミングのずれを解消することができる。したがって上記構成によれば、ユーザと会話相手との遠隔会話の途中で、相手装置との通信が不可能となった場合でも、対話システム１ａは、通信が可能となった後に会話相手にユーザが発話した情報をスムーズに伝えることができる。 According to the above configuration, the dialogue system 1a can transmit the recorded voice data excluding all or a part of the silent section to the other device when going from online to offline. Therefore, according to the above configuration, by transmitting the voice data to the other party device by removing the silent section, it is possible to eliminate the difference between the user's utterance timing and the reproduction timing of the other party device due to being offline. Therefore, according to the above configuration, even if communication with the other party device becomes impossible during a remote conversation between the user and the conversation partner, the dialogue system 1a allows the user to talk to the conversation partner after the communication becomes possible. You can smoothly convey the spoken information.

＜２．機能構成＞
対話装置１００ａの機能構成の一例について説明する。対話装置１００ａは、第１実施形態に係る対話装置１００の音声取得部１２０、通信部１３０、出力部１４０および記憶部１５０を共通して備え、制御部１１０においては、判定部１１１を共通して備え、これらの機能部に加えて検出部と、を備える。 <2. Functional configuration>
An example of the functional configuration of the dialogue device 100a will be described. The dialogue device 100a commonly includes a voice acquisition unit 120, a communication unit 130, an output unit 140, and a storage unit 150 of the dialogue device 100 according to the first embodiment, and the control unit 110 shares a determination unit 111. In addition to these functional units, a detection unit is provided.

判定部１１１は、相手装置との通信が可能か否か判定する。 The determination unit 111 determines whether or not communication with the other device is possible.

検出部は、音声取得部１２０により取得された音声データから発話区間と無音区間とを検出する。検出部は、例えば、この音声データを一度録音し、録音された音声データから発話区間と無音区間とを検出してもよい。この「無音区間」とは、音声データにおいて音声レベルがゼロとなる区間である。また検出部は、検出の前処理として、音声データに対してイコライザー処理やタイムアライメント処理などの各種音響処理を行ってもよい。 The detection unit detects the utterance section and the silent section from the voice data acquired by the voice acquisition unit 120. For example, the detection unit may record this voice data once and detect the utterance section and the silence section from the recorded voice data. This "silence section" is a section in which the voice level becomes zero in the voice data. Further, the detection unit may perform various acoustic processes such as an equalizer process and a time alignment process on the voice data as a pre-process for detection.

記録部は、相手装置との通信が不可能な場合、発話データを記録する。記録部は、例えば、録音された音声データから発話区間のデータを抽出して発話データを生成してもよい。記録部は、この生成した発話データを記憶部１５０に記録する。 The recording unit records the utterance data when communication with the other device is impossible. For example, the recording unit may extract the data of the utterance section from the recorded voice data to generate the utterance data. The recording unit records the generated utterance data in the storage unit 150.

記録部は、例えば、ユーザの発話タイミングと相手装置のユーザの発話の再生タイミングのずれの少なくとも一部が解消されるまで発話データを記録してもよい。 The recording unit may record the utterance data until, for example, at least a part of the difference between the utterance timing of the user and the reproduction timing of the user's utterance of the other device is eliminated.

記録部は、例えば、相手装置との通信が不可能となった期間（以下、「オフライン期間」という）の開始から合計した無音区間の長さが、オフライン期間の長さを超えるまで（例えば、図９の時点Ｐまで）発話データを記録してもよい。また記録部は、他の例として、オフライン期間の開始から合計した無音区間の長さが、オフライン期間における合計した発話区間の長さを超えるまで（例えば、図１０の時点Ｐ’まで）発話データを記録してもよい。 In the recording unit, for example, from the start of the period during which communication with the other device becomes impossible (hereinafter referred to as "offline period") until the total length of the silent section exceeds the length of the offline period (for example,). The utterance data may be recorded (up to the time point P in FIG. 9). Further, as another example, the recording unit uses the utterance data until the total length of the silent section from the start of the offline period exceeds the total length of the utterance section in the offline period (for example, up to the time point P'in FIG. 10). May be recorded.

送信部１３１は、音声取得部１２０により取得された音声データを相手装置に送信する。送信部１３１は、相手装置との通信が可能となった場合相手装置に発話データを送信し、発話データの送信が完了した後に音声データの送信を再開する。 The transmission unit 131 transmits the voice data acquired by the voice acquisition unit 120 to the other device. The transmission unit 131 transmits the utterance data to the other device when communication with the other device becomes possible, and resumes the transmission of the voice data after the transmission of the utterance data is completed.

上記構成によれば、対話システム１ａは、オンラインからオフラインとなった場合に、録音した音声データのうち無音区間の全部または一部を除いたデータを相手装置に送信することができる。このため上記構成によれば、音声データから無音区間を除いて相手装置に送信することにより、オフラインになったことによるユーザの発話タイミングと相手装置の再生タイミングのずれを解消することができる。 According to the above configuration, the dialogue system 1a can transmit the recorded voice data excluding all or a part of the silent section to the other device when going from online to offline. Therefore, according to the above configuration, by transmitting the voice data to the other device by removing the silent section, it is possible to eliminate the difference between the user's utterance timing and the reproduction timing of the other device due to being offline.

＜３．動作例＞
図１１を参照して、対話装置１００ａの動作例を説明する。なお、以下に示す図１１の動作例の処理の順番は一例であって、適宜、変更されてもよい。 <3. Operation example>
An operation example of the dialogue device 100a will be described with reference to FIG. The order of processing of the operation example of FIG. 11 shown below is an example, and may be changed as appropriate.

図１１に示すように、対話装置１００の音声取得部１２０は、ユーザの音声を取得する（Ｓ２０）。次いで判定部１１１は、相手装置との通信が可能か否か判定する（Ｓ２１）。 As shown in FIG. 11, the voice acquisition unit 120 of the dialogue device 100 acquires the user's voice (S20). Next, the determination unit 111 determines whether or not communication with the other device is possible (S21).

判定部１１１の判定により相手装置との通信が不可能な場合（Ｓ２２のＮｏ）、記録部は発話データを記録する（Ｓ２３）。 When communication with the other device is impossible due to the determination of the determination unit 111 (No in S22), the recording unit records the utterance data (S23).

判定部１１１の判定により相手装置との通信が可能な場合（Ｓ２２のＹｅｓ）、かつ相手装置への発話データの送信が完了していない場合（Ｓ２４のＮｏ）、送信部１３１は発話データを送信する（Ｓ２５）。またユーザの発話タイミングと相手装置のユーザの発話の再生タイミングのずれの少なくとも一部が解消されていない場合（Ｓ２６のＮｏ）、記録部は発話データを記録する（Ｓ２７）。 When communication with the other device is possible by the determination of the determination unit 111 (Yes in S22) and the transmission of the utterance data to the other device is not completed (No in S24), the transmission unit 131 transmits the utterance data. (S25). Further, when at least a part of the difference between the utterance timing of the user and the reproduction timing of the utterance of the user of the other device is not resolved (No in S26), the recording unit records the utterance data (S27).

判定部１１１の判定により相手装置との通信が可能な場合（Ｓ２２のＹｅｓ）、かつ相手装置への発話データの送信が完了した場合（Ｓ２４のＹｅｓ）、送信部１３１は音声データを送信する（Ｓ２８）。 When communication with the other device is possible by the determination of the determination unit 111 (Yes in S22) and when the transmission of the utterance data to the other device is completed (Yes in S24), the transmission unit 131 transmits the voice data (Yes). S28).

ユーザと会話相手が会話を続ける場合（Ｓ２９のＹｅｓ）、ステップＳ２０の前に戻る。 When the user and the conversation partner continue the conversation (Yes in S29), the process returns to the step S20.

なお、本実施形態は、本発明を説明するための例示であり、本発明をその実施の形態のみに限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、さまざまな変形が可能である。さらに、当業者であれば、以下に述べる各要素を均等なものに置換した実施の形態を採用することが可能であり、かかる実施の形態も本発明の範囲に含まれる。 It should be noted that the present embodiment is an example for explaining the present invention, and the present invention is not intended to be limited only to the embodiment. Further, the present invention can be modified in various ways as long as it does not deviate from the gist thereof. Further, those skilled in the art can adopt an embodiment in which each element described below is replaced with an equal one, and such an embodiment is also included in the scope of the present invention.

１、１ａ…対話システム、１００、１００ａ…対話装置、１１０…制御部、１１１…判定部、１１２…音声認識部、１１３…応答生成部、１１４…識別部、１１５…特定部、１１６…装置制御部、１２０…音声取得部、１３０…通信部、１３１…送信部、１３２…受信部、１４０…出力部、１５０…記憶部、２００…サーバ装置、２１０…制御部、２１１…判定部、２１２…音声認識部、２１３…応答生成部、２１４…識別部、２１５…装置制御部、２３０…通信部、２５０…記憶部、３００…音声認識システム、４００…装置、４００ａ…ローカル装置、４００ｂ…リモート装置、８００…コンピュータ、８０１…プロセッサ、８０３…メモリ、８０５…記憶装置、８０７…入力Ｉ／Ｆ部、８０９…データＩ／Ｆ部、８１１…通信Ｉ／Ｆ部、８１３…表示装置、８１７…音声入力装置、８１９…音声出力装置。 1, 1a ... Dialogue system, 100, 100a ... Dialogue device, 110 ... Control unit, 111 ... Judgment unit, 112 ... Voice recognition unit, 113 ... Response generation unit, 114 ... Identification unit, 115 ... Specific unit, 116 ... Device control Unit, 120 ... Voice acquisition unit, 130 ... Communication unit, 131 ... Transmission unit, 132 ... Reception unit, 140 ... Output unit, 150 ... Storage unit, 200 ... Server device, 210 ... Control unit, 211 ... Judgment unit, 212 ... Voice recognition unit, 213 ... Response generation unit, 214 ... Identification unit, 215 ... Device control unit, 230 ... Communication unit, 250 ... Storage unit, 300 ... Voice recognition system, 400 ... Device, 400a ... Local device, 400b ... Remote device , 800 ... Computer, 801 ... Processor, 803 ... Memory, 805 ... Storage device, 807 ... Input I / F section, 809 ... Data I / F section, 811 ... Communication I / F section, 813 ... Display device, 817 ... Voice Input device, 819 ... Audio output device.

Claims

An information processing device that connects to a voice recognition system that recognizes voice via a network.
A voice acquisition unit that acquires the user's voice,
A determination unit that determines whether communication with the voice recognition system is possible, and
When communication with the voice recognition system is possible, a transmission unit that transmits the acquired voice data of the voice to the voice recognition system, and a transmission unit.
A receiving unit that receives the first recognition information indicating the recognition result of the voice data from the voice recognition system, and
When communication with the voice recognition system is impossible, a voice recognition unit that recognizes the acquired voice and generates second recognition information indicating the recognition result, and a voice recognition unit.
A response generation unit that generates first response information for responding to the voice based on the first recognition information or the second recognition information.
An output unit that outputs a response to the voice based on the first response information is provided.
Information processing device.

The information processing device is connected to the server device that generates the second response information for responding to the voice based on the first recognition information or the second recognition information via the network.
The determination unit determines whether or not communication with the server device is possible, and determines whether or not communication with the server device is possible.
When communication with the server device is possible, the transmission unit transmits the first recognition information or the second recognition information to the server device.
The receiving unit receives the first recognition information or the second response information generated based on the second recognition information from the server device.
The output unit outputs a response to the voice based on the received second response information.
The information processing device according to claim 1.

Based on the first recognition information or the second recognition information, the user's voice is further provided with an identification unit that identifies a remote instruction to a remote device outside a predetermined network to which the information processing device is connected.
When the voice is the remote instruction, the determination unit determines whether communication with the remote device is possible.
The transmitter
When communication with the remote device is possible, the remote instruction is transmitted to the remote device.
When communication with the remote device is not possible, the remote instruction is queued, and then when communication with the remote device becomes possible, the queued remote instruction is read and transmitted to the remote device.
The information processing device according to claim 1 or 2.

Further provided with a specific unit for specifying the execution timing of the remote instruction,
When communication with the remote device is not possible at the specified execution timing, the transmitter cancels transmission of the remote instruction to the remote device.
The response generation unit generates the first response information indicating that the remote instruction has been canceled as a response to the voice of the remote instruction.
The information processing device according to claim 3.

The transmitting unit refers to a storage unit that stores the batch processing time zone, and when the execution timing of the instruction is not a specific date and time and immediately, and then communication with the remote device becomes possible, the batch processing is performed. Read the remote instruction queued in the time zone and send it to the remote device.
The information processing device according to claim 4.

When the transmission unit becomes unable to communicate with the voice recognition system while the transmission unit is transmitting the voice data to the voice recognition system, the voice recognition unit causes the transmission unit to send the voice recognition system to the voice recognition system. The second recognition information is generated based on the voice of the voice data that has not been transmitted by referring to the first session information regarding the session established when the voice data is transmitted.
The information processing device according to any one of claims 1 to 5.

When the transmission unit becomes unable to communicate with the server device while the transmission unit is transmitting the first recognition information or the second recognition information to the server device, the response generation unit causes the transmission unit to perform. Based on the untransmitted first recognition information or the second recognition information with reference to the second session information regarding the session established when the first recognition information or the second recognition information is transmitted to the server device. Generate the first response information,
The information processing device according to claim 2.

The information processing device is connected to the other device of the other party by voice conversation with the user via the network.
The determination unit determines whether or not communication with the other device is possible, and determines whether or not communication with the other device is possible.
The transmission unit transmits the acquired voice data to the other device, and then transmits the acquired voice data to the other device.
Information processing equipment
A detection unit that detects an utterance section and a silent section from the voice data,
When communication with the other device is impossible, a recording unit that records utterance data excluding at least a part of the silent section from the voice data, and a recording unit.
The transmission unit transmits the utterance data to the other device when communication with the other device becomes possible, and resumes the transmission of the voice data after the transmission of the utterance data is completed.
The information processing device according to any one of claims 1 to 7.

For information processing devices that connect to a voice recognition system that recognizes voice via a network
With the voice acquisition function to acquire the user's voice,
A determination function for determining whether or not communication with the voice recognition system is possible, and
When communication with the voice recognition system is possible, a transmission function for transmitting the voice data of the acquired voice to the voice recognition system, and
A receiving function for receiving the first recognition information indicating the recognition result of the voice data from the voice recognition system, and
When communication with the voice recognition system is impossible, a voice recognition function that recognizes the acquired voice and generates a second recognition information indicating the recognition result, and a voice recognition function.
A response generation function that generates first response information for responding to the voice based on the first recognition information or the second recognition information.
An output function that outputs a response to the voice based on the first response information is realized.
program.

An information processing device that connects to a voice recognition system that recognizes voice via a network
Get the user's voice and
It is determined whether communication with the voice recognition system is possible, and
When communication with the voice recognition system is possible, the voice data of the acquired voice is transmitted to the voice recognition system.
The first recognition information indicating the recognition result of the voice data is received from the voice recognition system, and the first recognition information is received.
When communication with the voice recognition system is impossible, the acquired voice is recognized and a second recognition information indicating the recognition result is generated.
Based on the first recognition information or the second recognition information, the first response information for responding to the voice is generated.
Outputs a response to the voice based on the first response information.
Information processing method.