JP2020021040A

JP2020021040A - Information processing unit, sound output method, and sound output program

Info

Publication number: JP2020021040A
Application number: JP2018147243A
Authority: JP
Inventors: 達郎五十嵐; Tatsuro Igarashi; 大樹坂内; Daiki Sakauchi
Original assignee: SoftBank Corp
Current assignee: SoftBank Corp
Priority date: 2018-08-03
Filing date: 2018-08-03
Publication date: 2020-02-06
Anticipated expiration: 2038-08-03
Also published as: JP6761007B2

Abstract

To provide a sound output device that executes control in response to a sound instruction made by a user.SOLUTION: The sound output device includes an acceptance unit that accepts input of sound information indicating speech sound made by a user, a generation unit that generates robot speech data corresponding to a response to the speech sound on the basis of the sound information, and an output unit that outputs robot speech data. In a case where second sound information is accepted within a prescribed time after first sound information is accepted, and when a second word in the same category as a first word included in the first sound information is included in the second sound information, the generation unit generates the robot speech data at least on the basis of the second sound information.SELECTED DRAWING: Figure 1

Description

本発明は、ユーザからの音声指示に基づいて制御を行う情報処理装置、音声出力方法、及び、音声出力プログラムに関する。 The present invention relates to an information processing device that performs control based on a voice instruction from a user, a voice output method, and a voice output program.

従来、人工知能を利用した機器の開発が目覚ましい。その中には、ユーザからの音声による指示に従って、指示された内容を実行する機器がある。例えば、特許文献１には、ユーザからの音声による指示に従って、音楽を再生したり、アラームを実行したり、計算をしたり、他の機器（例えば、照明装置）の制御を行ったりする情報処理装置（スマートスピーカー）が開示されている。 Conventionally, the development of devices utilizing artificial intelligence has been remarkable. Among them, there are devices that execute the instructed content in accordance with a voice instruction from a user. For example, Patent Literature 1 discloses information processing for playing music, executing an alarm, calculating, and controlling another device (for example, a lighting device) in accordance with a voice instruction from a user. A device (smart speaker) is disclosed.

特開２０１７−０３２８９５号公報JP 2017-032895 A

ところで、ユーザは指示をする際に言い間違いをしたり、考えを改めたりして、言い直しをすることがある。しかしながら、従来のスマートスピーカーの場合、最初にユーザが発話した内容に対する応答をするので、言い直しをした場合に対する応答をしないという問題がある。また、このような場合に、ユーザには、再度指示をし直すという煩雑さや、最初に指示した内容に対する応答の発話の終了を待たなければ次の指示ができないという問題もあった。 By the way, when giving an instruction, the user sometimes makes a mistake by saying a wrong word or changing his / her thought. However, in the case of the conventional smart speaker, there is a problem that since the user responds to the content spoken first by the user, it does not respond to restatement. Further, in such a case, there is a problem that the user is troublesome to give an instruction again, and cannot give a next instruction without waiting for the end of the utterance of a response to the first instruction.

そこで、本発明は上記問題に鑑みて成されたものであり、ユーザが言い直しをした場合であっても、適切に応答をすることができる音声出力装置、音声出力方法及び音声出力プログラムを提供することを目的とする。 Therefore, the present invention has been made in view of the above problems, and provides an audio output device, an audio output method, and an audio output program that can appropriately respond even when a user rephrases. The purpose is to do.

上記課題を解決するために、本発明の一態様に係る情報処理装置は、ユーザによる発話音声を示す音声情報の入力を受け付ける受付部と、音声情報に基づいて、発話音声に対する返事に相当するロボット発話データを生成する生成部と、ロボット発話データを出力する出力部と、を備え、生成部は、第１の音声情報を受け付けてから、所定時間内に第２の音声情報を受け付けた場合であって、第２の音声情報に、第１の音声情報に含まれる第１の単語と同じカテゴリの第２の単語が含まれているときに、少なくとも、第２の音声情報に基づいて、ロボット発話データを生成する。 In order to solve the above-described problem, an information processing apparatus according to one aspect of the present invention includes a receiving unit that receives input of voice information indicating a user's uttered voice, and a robot corresponding to a reply to the uttered voice based on the voice information. A generation unit that generates utterance data; and an output unit that outputs robot utterance data, wherein the generation unit receives the first voice information and then receives the second voice information within a predetermined time. When the second voice information includes a second word of the same category as the first word included in the first voice information, at least the robot based on the second voice information Generate utterance data.

上記課題を解決するために、本発明の一態様に係る音声出力方法は、ユーザによる発話音声を示す音声情報の入力を受け付ける受付ステップと、音声情報に基づいて、発話音声に対する返事に相当するロボット発話データを生成する生成ステップと、ロボット発話データを出力する出力ステップと、を含み、生成ステップは、第１の音声情報を受け付けてから、所定時間内に第２の音声情報を受け付けた場合であって、第２の音声情報に、第１の音声情報に含まれる第１の単語と同じカテゴリの第２の単語が含まれているときに、少なくとも、第２の音声情報に基づいて、ロボット発話データを生成する。 In order to solve the above-mentioned problem, a voice output method according to one aspect of the present invention includes a receiving step of receiving an input of voice information indicating a speech voice by a user, and a robot corresponding to a reply to the speech voice based on the voice information. A generating step of generating utterance data; and an output step of outputting robot utterance data, wherein the generating step includes a step of receiving the first voice information and then receiving the second voice information within a predetermined time. When the second voice information includes a second word of the same category as the first word included in the first voice information, at least the robot based on the second voice information Generate utterance data.

上記課題を解決するために、本発明の一態様に係る音声出力プログラムは、コンピュータに、ユーザによる発話音声を示す音声情報の入力を受け付ける受付機能と、音声情報に基づいて、発話音声に対する返事に相当するロボット発話データを生成する生成機能と、ロボット発話データを出力する出力機能と、を実現させ、生成機能は、第１の音声情報を受け付けてから、所定時間内に第２の音声情報を受け付けた場合であって、第２の音声情報に、第１の音声情報に含まれる第１の単語と同じカテゴリの第２の単語が含まれているときに、少なくとも、第２の音声情報に基づいて、ロボット発話データを生成する。 In order to solve the above problem, an audio output program according to one embodiment of the present invention provides a computer with a reception function for receiving input of audio information indicating an uttered voice by a user, and a reply to the uttered voice based on the audio information. A generation function of generating the corresponding robot utterance data and an output function of outputting the robot utterance data are realized, and the generation function converts the second voice information within a predetermined time after receiving the first voice information. In the case where the second voice information is received and the second voice information includes the second word in the same category as the first word included in the first voice information, at least the second voice information Based on this, robot utterance data is generated.

上記情報処理装置において、生成部は、第２の音声情報に第１の音声情報を否定する単語が含まれる場合に、第２の音声情報にのみ基づくロボット発話データを生成することとしてもよい。 In the information processing device, when the second voice information includes a word that denies the first voice information, the generation unit may generate the robot utterance data based only on the second voice information.

上記情報処理装置において、生成部は、第２の音声情報に第１の音声情報と接続する単語が含まれる場合に、第１の音声情報と第２の音声情報との双方に対するロボット発話データを生成することとしてもよい。 In the information processing device, when the second voice information includes a word connected to the first voice information, the generation unit may generate the robot utterance data for both the first voice information and the second voice information. It may be generated.

上記情報処理装置において、生成部は、第２の音声情報に、第１の音声情報に含まれる第１の単語と同種類の第２の単語が含まれている場合に、いずれが正しいのかを問い合わせるロボット発話データを生成することとしてもよい。 In the information processing apparatus, when the second audio information includes a second word of the same type as the first word included in the first audio information, the generation unit determines which one is correct. The robot utterance data to be queried may be generated.

上記情報処理装置において、ロボット発話データに基づく音声を出力する音声出力部と、ユーザの発話音声を集音する音声収集部とを更に備え、出力部は、音声出力部にロボット発話データを出力し、受付部は、音声収集部が収集した発話音声を音声情報として入力され、出力部は、第１の音声情報に基づくロボット発話データを出力しているときに、所定時間内に第２の音声情報を受け付けると、第２の音声情報に、第１の音声情報に含まれる第１の単語と同種類の第２の単語が含まれているときには第１の音声情報に対して生成されたロボット発話データの音声出力部への出力を中止することとしてもよい。 The information processing apparatus further includes a voice output unit that outputs a voice based on the robot utterance data, and a voice collection unit that collects a user's utterance voice, wherein the output unit outputs the robot utterance data to the voice output unit. The receiving unit receives the uttered voice collected by the voice collecting unit as voice information, and the output unit outputs the second voice within a predetermined time when outputting the robot utterance data based on the first voice information. When the information is received, the robot generated for the first voice information when the second voice information includes a second word of the same type as the first word included in the first voice information The output of the utterance data to the audio output unit may be stopped.

上記情報処理装置において、出力部は、音声出力部へのロボット発話データの出力を中止した後に、第２の音声情報に基づく新たなロボット発話データを出力することとしてもよい。 In the information processing device, the output unit may output new robot utterance data based on the second audio information after stopping outputting the robot utterance data to the audio output unit.

上記情報処理装置において、出力部は、外部のスピーカーにロボット発話データを出力し、受付部は、外部のマイクが収集した発話音声を音声情報として入力を受け付けるものであり、出力部は、外部のスピーカーが第１の音声情報に基づくロボット発話データに基づく音声を音声出力しているときにユーザから受け付けた発話音声に基づく第２の音声情報を受け付け、当該第２の音声情報に、第１の音声情報に含まれる第１の単語と同種類の第２の単語が含まれているときには第１の音声情報に対して生成されたロボット発話データに基づく音声の出力を中止する中止指示を出力することとしてもよい。 In the information processing device, the output unit outputs the robot utterance data to an external speaker, the receiving unit receives an input of the uttered voice collected by the external microphone as voice information, and the output unit outputs the external voice. When the speaker is outputting a voice based on the robot utterance data based on the first voice information, the speaker receives second voice information based on the uttered voice received from the user, and the first voice information includes the first voice information. When the second word of the same type as the first word included in the voice information is included, a stop instruction to stop outputting the voice based on the robot utterance data generated for the first voice information is output. It may be that.

上記情報処理装置において、出力部は、中止指示を出力した後に、第２の音声情報に基づく新たなロボット発話データを出力することとしてもよい。 In the information processing device, the output unit may output new robot utterance data based on the second voice information after outputting the stop instruction.

上記情報処理装置において、生成部は、ユーザに対して問い合わせをするためのロボット発話データを生成するとともに、当該問い合わせに対する回答として望まれる言葉のカテゴリを決定し、生成部は、問い合わせに対する回答としての、第１の音声情報と第２の音声情報とを受け付けた場合に、第１の音声情報に決定したカテゴリに属する単語が第１の単語として含まれており、第２の音声情報に決定したカテゴリに属する単語が第２の単語として含まれていたときに、少なくとも第２の音声情報に基づいて、ロボット発話データを生成することとしてもよい。 In the information processing device, the generation unit generates the robot utterance data for making an inquiry to the user, determines a category of a word desired as a response to the inquiry, and the generation unit determines When the first audio information and the second audio information are received, the words belonging to the category determined as the first audio information are included as the first words, and are determined as the second audio information. When a word belonging to the category is included as the second word, the robot utterance data may be generated based on at least the second voice information.

本発明の一態様に係る情報処理装置は、ユーザが言い直しをした場合であっても、適切に応答をすることができる。 The information processing device according to one embodiment of the present invention can appropriately respond even when the user makes a restatement.

通信システムの構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of a communication system. 音声サーバの構成例を示すブロック図である。FIG. 3 is a block diagram illustrating a configuration example of a voice server. スマートスピーカーの構成例を示すブロック図である。It is a block diagram which shows the example of a structure of a smart speaker. 制御モデルの構成例を示すデータ概念図である。It is a data conceptual diagram which shows the example of a structure of a control model. ユーザによる言い直しがない場合の音声サーバとスマートスピーカーとの間のやり取りの例を示すシーケンス図である。FIG. 9 is a sequence diagram illustrating an example of an exchange between a voice server and a smart speaker when there is no restatement by a user. ユーザによる言い直しが発生した場合の音声サーバとスマートスピーカーとの間のやり取りの例を示すシーケンス図である。FIG. 11 is a sequence diagram illustrating an example of an exchange between a voice server and a smart speaker when a restatement by a user occurs. 音声サーバの動作例を示すフローチャートである。5 is a flowchart illustrating an operation example of a voice server.

＜実施形態＞
本発明の一実施形態について、図面を参照しながら説明する。 <Embodiment>
An embodiment of the present invention will be described with reference to the drawings.

本発明に係る情報処理装置たりえる音声出力装置１００は、ユーザ１０からの音声による指示入力に基づいて、指示入力の内容に対応する回答となる発話音声を出力する装置である。音声出力装置１００は、ユーザからの音声による指示（問い合わせ）があった場合に、その指示に対応する回答となるロボット発話データを生成して出力する。このとき、音声出力装置１００は、ユーザが言い直しをしたかどうかを判定し、適宜、適切と推定される回答を示すロボット発話データを生成する。音声出力装置１００は、どのような態様で実現されてもよく、図１に示すようなサーバ装置やコンピュータシステムとして実現されてもよいし、スマートスピーカーのようなスピーカー、ロボットなどに内包される態様で実現されてもよい。音声出力装置１００は、スマートスピーカー、ロボットまたはＡＩアシスタントを制御するための制御装置であってもよい。 The voice output device 100, which can be an information processing device according to the present invention, is a device that outputs an uttered voice as an answer corresponding to the content of the instruction input, based on the instruction input by voice from the user 10. When a voice instruction (inquiry) is received from the user, the voice output device 100 generates and outputs robot utterance data as an answer corresponding to the voice instruction. At this time, the voice output device 100 determines whether the user has rephrased, and appropriately generates robot utterance data indicating an answer that is estimated to be appropriate. The audio output device 100 may be implemented in any manner, may be implemented as a server device or a computer system as shown in FIG. 1, or may be implemented in a speaker such as a smart speaker, a robot, or the like. It may be realized by. The audio output device 100 may be a control device for controlling a smart speaker, a robot, or an AI assistant.

以下、このような音声出力装置１００について説明する。 Hereinafter, such an audio output device 100 will be described.

（システム構成）
図１に示すように、通信システム１は、ユーザ１０からの音声による指示（問い合わせ）を受け付ける機器としてスマートスピーカー２００と、ユーザ１０からの音声による指示に対する応答を示すロボット発話データを生成する音声出力装置１００と、を含む。 (System configuration)
As shown in FIG. 1, the communication system 1 includes a smart speaker 200 as a device that receives a voice instruction (inquiry) from the user 10 and a voice output that generates robot utterance data indicating a response to the voice command from the user 10. Device 100.

スマートスピーカー２００は、マイクを内蔵しており、ユーザの発話音声を含む周囲の音声を逐次集音し、集音して得られる音声データを音声出力装置１００に送信する。また、スマートスピーカー２００は、音声出力装置１００から送信されたロボット発話データに基づく音声を出力する。 The smart speaker 200 has a built-in microphone, and sequentially collects surrounding sounds including a user's uttered voice, and transmits voice data obtained by collecting the voice to the voice output device 100. In addition, the smart speaker 200 outputs a sound based on the robot utterance data transmitted from the sound output device 100.

音声出力装置１００は、音声データを受信し、受信した音声データからユーザ１０の指示を抽出し、ユーザ１０の指示に応じた回答を示すロボット発話データを生成する。そして、生成したロボット発話データをスマートスピーカー２００に送信する。 The audio output device 100 receives the audio data, extracts an instruction of the user 10 from the received audio data, and generates robot utterance data indicating an answer according to the instruction of the user 10. Then, the generated robot utterance data is transmitted to the smart speaker 200.

図１の例では、ユーザ１０が、「東京の天気を教えて？」と問い合わせをしたあとで、「あ、やっぱり、品川の天気を教えて？」と言い直しをしたことに対して、スマートスピーカー２００が、「品川の天気ですね？今日の品川の天気は…」と回答をしている例を示している。なお、「品川」および「新橋」は日本の地名である。このように、本実施の形態に係るスマートスピーカー２００は、音声出力装置１００からの指示の下、ユーザ１０の言い直しに対応して、言い直された方の指示に従った応答をすることができる。 In the example of FIG. 1, after the user 10 makes an inquiry “Tell me the weather in Tokyo?”, And then rephrases “Oh, after all, tell me the weather in Shinagawa?” An example is shown in which the speaker 200 answers "Shinagawa's weather? Today's weather in Shinagawa ...". “Shinagawa” and “Shimbashi” are Japanese place names. As described above, the smart speaker 200 according to the present embodiment can respond in accordance with the restatement of the user 10 in response to the restatement of the user 10 under the instruction from the audio output device 100. it can.

図１に示すように、音声出力装置１００は、ネットワーク３００を介して、スマートスピーカー２００と通信可能に接続されている。また、図示はしていないが、ネットワーク３００には、音声出力装置１００が情報を収集する情報処理装置が通信可能に接続されていてよい。 As shown in FIG. 1, the audio output device 100 is communicably connected to a smart speaker 200 via a network 300. Although not shown, an information processing device from which the audio output device 100 collects information may be communicably connected to the network 300.

ネットワーク３００は、音声出力装置１００と各種の機器との間を相互に接続させるためのネットワークであり、例えば、無線ネットワークや有線ネットワークである。具体的には、ネットワーク３００は、ワイヤレスＬＡＮ（ｗｉｒｅｌｅｓｓＬＡＮ：ＷＬＡＮ）や広域ネットワーク（ｗｉｄｅａｒｅａｎｅｔｗｏｒｋ：ＷＡＮ）、ＩＳＤＮｓ（ｉｎｔｅｇｒａｔｅｄｓｅｒｖｉｃｅｄｉｇｉｔａｌｎｅｔｗｏｒｋｓ）、無線ＬＡＮｓ、ＬＴＥ（ｌｏｎｇｔｅｒｍｅｖｏｌｕｔｉｏｎ）、ＬＴＥ−Ａｄｖａｎｃｅｄ、第４世代（４Ｇ）、第５世代（５Ｇ）、ＣＤＭＡ（ｃｏｄｅｄｉｖｉｓｉｏｎｍｕｌｔｉｐｌｅａｃｃｅｓｓ）、ＷＣＤＭＡ（登録商標）、イーサネット（登録商標）などである。 The network 300 is a network for mutually connecting the audio output device 100 and various devices, and is, for example, a wireless network or a wired network. Specifically, the network 300 includes a wireless LAN (wireless LAN: WLAN), a wide area network (WAN), ISDNs (integrated service digital networks), wireless LANs, LTE (long term evolution, and volume evolution). Fourth generation (4G), fifth generation (5G), CDMA (code division multiple access), WCDMA (registered trademark), Ethernet (registered trademark), and the like.

また、ネットワーク３００は、これらの例に限られず、例えば、公衆交換電話網（ＰｕｂｌｉｃＳｗｉｔｃｈｅｄＴｅｌｅｐｈｏｎｅＮｅｔｗｏｒｋ：ＰＳＴＮ）やブルートゥース（Ｂｌｕｅｔｏｏｔｈ（登録商標））、ブルートゥースローエナジー（ＢｌｕｅｔｏｏｔｈＬｏｗＥｎｅｒｇｙ）、光回線、ＡＤＳＬ（ＡｓｙｍｍｅｔｒｉｃＤｉｇｉｔａｌＳｕｂｓｃｒｉｂｅｒＬｉｎｅ）回線、衛星通信網などであってもよく、どのようなネットワークであってもよい。ネットワーク３００は、ユーザ１０の住居に備えられる場合には、ホームネットワークと呼称されることもある。 The network 300 is not limited to these examples. For example, the public switched telephone network (PSTN), Bluetooth (Bluetooth (registered trademark)), Bluetooth Low Energy, optical line AD, optical line (Asymmetric Digital Subscriber Line) A line, a satellite communication network or the like may be used, and any network may be used. The network 300 may be called a home network when provided in the residence of the user 10.

また、ネットワーク３００、例えば、ＮＢ−ＩｏＴ（ＮａｒｒｏｗＢａｎｄＩｏＴ）や、ｅＭＴＣ（ｅｎｈａｎｃｅｄＭａｃｈｉｎｅＴｙｐｅＣｏｍｍｕｎｉｃａｔｉｏｎ）であってもよい。なお、ＮＢ−ＩｏＴやｅＭＴＣは、ＩｏＴ向けの無線通信方式であり、低コスト、低消費電力で長距離通信が可能なネットワークである。 In addition, the network 300 may be, for example, NB-IoT (Narrow Band IoT) or eMTC (enhanced Machine Type Communication). Note that NB-IoT and eMTC are wireless communication systems for IoT, and are networks capable of long-distance communication with low cost and low power consumption.

また、ネットワーク３００は、これらの組み合わせであってもよい。また、ネットワーク３００は、これらの例を組み合わせた複数の異なるネットワークを含むものであってもよい。例えば、ネットワーク３００は、ＬＴＥによる無線ネットワークと、閉域網であるイントラネットなどの有線ネットワークとを含むものであってもよい。 Further, the network 300 may be a combination of these. The network 300 may include a plurality of different networks obtained by combining these examples. For example, the network 300 may include a wireless network based on LTE and a wired network such as an intranet that is a closed network.

（音声出力装置の構成例）
図２は、音声出力装置１００の構成例を示すブロック図である。図２に示すように、音声出力装置１００は、例えば、受信部１１０と、記憶部１２０と、制御部１３０と、送信部１４０と、を備える。音声出力装置１００は、ユーザが発話した内容について、言い直しかどうかを認識し、その認識に基づいてユーザが求めている回答を特定して、その回答内容を示すロボット発話データを生成、出力するものである。即ち、音声出力装置１００は、ユーザの発話に基づく第１音声情報、その後の発話に基づく第２音声情報とを解析し、第２音声情報に、第１音声情報に含まれる文言と同一カテゴリとなる文言がある場合に言い直しであると認識する。ここで同一カテゴリとは、少なくとも第１音声情報と第２音声情報とに、ユーザが知りたい情報の種別のことをいい、例えば、ユーザがある場所の天気を知りたい場合に、その場所を示す情報がカテゴリとなり、例えば、ユーザが知りたい情報が店舗に係るものであって、店の種類（例えば、紳士服、小物、家具など）を音声により指示する場合に、その店の種類がカテゴリとなるが、カテゴリはこれらの例に限定されるものではない。 (Configuration example of audio output device)
FIG. 2 is a block diagram illustrating a configuration example of the audio output device 100. As shown in FIG. 2, the audio output device 100 includes, for example, a receiving unit 110, a storage unit 120, a control unit 130, and a transmitting unit 140. The voice output device 100 recognizes only the restatement of the content uttered by the user, specifies an answer requested by the user based on the recognition, and generates and outputs robot utterance data indicating the content of the answer. Things. That is, the voice output device 100 analyzes the first voice information based on the utterance of the user and the second voice information based on the subsequent utterance, and sets the second voice information to the same category as the word included in the first voice information. If there is a certain wording, it is recognized as restatement. Here, the same category refers to a type of information that the user wants to know at least in the first audio information and the second audio information. For example, when the user wants to know the weather at a certain place, the same category indicates the place. The information is a category. For example, when the information that the user wants to know is related to the store, and the type of the store (for example, men's clothing, accessories, furniture, etc.) is indicated by voice, the type of the store is the category. However, the category is not limited to these examples.

受信部１１０は、ネットワーク３００を介して、スマートスピーカー２００から音声データを受信する通信インターフェースである。受信部１１０は、ユーザからの音声による指示入力を示す音声データを受信する。受信部１１０は、音声データを受信すると、制御部１３０に伝達する。また、受信部１１０は、図示しないネットワーク３００に接続されている他の情報処理装置から送信された情報を受信する。 The receiving unit 110 is a communication interface that receives audio data from the smart speaker 200 via the network 300. The receiving unit 110 receives voice data indicating an instruction input by voice from a user. When receiving the audio data, the receiving unit 110 transmits the audio data to the control unit 130. Further, the receiving unit 110 receives information transmitted from another information processing device connected to the network 300 (not shown).

記憶部１２０は、音声出力装置１００が動作するうえで必要とする各種プログラムや各種データを記憶する機能を有する。記憶部１２０は、例えば、ＨＤＤ、ＳＳＤ、フラッシュメモリなど各種の記憶媒体により実現される。なお、音声出力装置１００は、プログラムを記憶部１２０に記憶し、当該プログラムを実行して、制御部１３０が、制御部１３０に含まれる各機能部としての処理を実行してもよい。このプログラムは、音声出力装置１００に、制御部１３０が実行する各機能を実現させる。 The storage unit 120 has a function of storing various programs and various data required for the operation of the audio output device 100. The storage unit 120 is realized by various storage media such as an HDD, an SSD, and a flash memory. Note that the sound output device 100 may store the program in the storage unit 120, execute the program, and cause the control unit 130 to execute processing as each functional unit included in the control unit 130. This program causes the audio output device 100 to realize the functions executed by the control unit 130.

記憶部１２０は、受信した受信した音声データに基づいてユーザからの指示内容を推定するための音声解析を行う音声解析プログラムや、解析結果に基づいて、ロボット音声を示すロボット発話データを生成するための音声データ生成プログラムを記憶している。記憶部１２０は、入力された音声について、状況に応じて生成するロボット発話データを生成するための回答モデル情報１２１を記憶している。回答モデル情報１２１の詳細については、後述する。 The storage unit 120 is used to generate a voice analysis program for performing voice analysis for estimating the instruction content from the user based on the received voice data received, and to generate robot utterance data indicating a robot voice based on the analysis result. Is stored. The storage unit 120 stores answer model information 121 for generating robot utterance data to be generated according to the situation with respect to the input voice. The details of the answer model information 121 will be described later.

制御部１３０は、音声出力装置１００の各部を制御するものであり、例えば、中央処理装置（ＣＰＵ）やマイクロプロセッサ、ＡＳＩＣ、ＦＰＧＡなどであってもよい。なお、制御部１３０は、これらの例に限られず、どのようなものであってもよい。 The control unit 130 controls each unit of the audio output device 100, and may be, for example, a central processing unit (CPU), a microprocessor, an ASIC, an FPGA, or the like. The control unit 130 is not limited to these examples, and may be any type.

制御部１３０は、音声解析部１３１と、生成部１３２と、を含む。 The control unit 130 includes a voice analysis unit 131 and a generation unit 132.

音声解析部１３１は、受け付けた音声情報に基づく音声を解析する機能を有し、解析結果を生成部１３２に伝達する。音声情報の解析は、従来の音声認識技術を用いてよく、入力された音声をテキストデータに変換し、文脈を解析する。文脈の解析には、例えば、従来の形態素解析を利用することができる。音声解析部１３１は、解析結果を、生成部１３２に伝達する。 The voice analysis unit 131 has a function of analyzing voice based on the received voice information, and transmits the analysis result to the generation unit 132. Analysis of the voice information may use a conventional voice recognition technology, and converts the input voice into text data and analyzes the context. For the analysis of the context, for example, a conventional morphological analysis can be used. The voice analysis unit 131 transmits the analysis result to the generation unit 132.

生成部１３２は、伝達された音声結果に基づいて、ロボット発話データを生成する機能を有する。生成部１３２は、従来と同様に受け付けている音声情報の解析結果に応じた回答を示すロボット発話データを生成する。また、従来の機能に加えて、生成部１３２は、音声解析部１３１から、解析結果を受け付けた場合に、その所定時間前に別の解析結果を受け付けているかに応じて、ロボット発話データを生成する。即ち、第１の音声情報を解析した第１の解析結果を受け付けてから、所定時間内に、第２の音声情報を解析した解析結果を受け付けた場合に、それまでの発話の流れや、第１の音声情報と第２の音声情報とのうち、いずれを優先するのか、あるいは、いずれに対しても返答するのかなどを判断し、そのうえで、状況に応じた回答となるロボット発話データを生成する。また、生成部１３２は、第１の音声情報から所定時間後に第２の音声情報を受け付けた場合に、第２の音声情報に対応したロボット発話データを作成することとしてよい（しなくともよい）。生成部１３２は、状況に応じてどのような回答をするかについては、記憶部１２０に記憶されている回答モデル情報１２１を参照して決定し、ロボット発話データを生成する。生成部１３２は、生成したロボット発話データを送信部１４０に伝達し、スマートスピーカー２００に送信するように指示する。 The generation unit 132 has a function of generating robot utterance data based on the transmitted voice result. The generation unit 132 generates robot utterance data indicating an answer according to the analysis result of the received voice information as in the related art. In addition, in addition to the conventional function, when the analysis result is received from the voice analysis unit 131, the generation unit 132 generates the robot utterance data according to whether another analysis result has been received a predetermined time before that. I do. That is, when the analysis result of analyzing the second voice information is received within a predetermined time after receiving the first analysis result obtained by analyzing the first voice information, the flow of the utterance up to that time, A determination is made as to which of the first voice information and the second voice information is to be given priority or which is to be responded to, and then, robot utterance data which is an answer according to the situation is generated. . In addition, when the second voice information is received a predetermined time after the first voice information, the generation unit 132 may (or may not) generate the robot utterance data corresponding to the second voice information. . The generating unit 132 determines what kind of answer is given according to the situation with reference to the answer model information 121 stored in the storage unit 120, and generates robot utterance data. The generating unit 132 transmits the generated robot utterance data to the transmitting unit 140, and instructs the transmitting unit 140 to transmit the data to the smart speaker 200.

送信部１４０は、制御部１３０（生成部１３２）からの指示に従って、スマートスピーカー２００に、スマートスピーカー２００に発声させるためのロボット発話データを送信する機能を有する通信インターフェースである。 The transmission unit 140 is a communication interface having a function of transmitting robot utterance data for causing the smart speaker 200 to utter the voice to the smart speaker 200 according to an instruction from the control unit 130 (the generation unit 132).

以上が、音声出力装置１００の構成例である。 The above is the configuration example of the audio output device 100.

（スマートスピーカーの構成例）
図３は、スマートスピーカー２００の構成例を示すブロック図である。図３に示すように、スマートスピーカー２００は、受信部２１０と、記憶部２２０と、スピーカー２３０と、マイク２４０と、送信部２５０と、を備える。 (Example of smart speaker configuration)
FIG. 3 is a block diagram illustrating a configuration example of the smart speaker 200. As shown in FIG. 3, the smart speaker 200 includes a receiving unit 210, a storage unit 220, a speaker 230, a microphone 240, and a transmitting unit 250.

受信部２１０は、音声出力装置１００から制御信号（音声データ）を受信する通信インターフェースである。受信部２１０は、受信した制御信号（音声データ）をスピーカー２３０に伝達する。 The receiving unit 210 is a communication interface that receives a control signal (audio data) from the audio output device 100. Receiving section 210 transmits the received control signal (sound data) to speaker 230.

記憶部２２０は、スマートスピーカー２００が動作する上で必要とする各種のプログラムやデータを記憶する機能を有する。記憶部２２０は、例えば、ＨＤＤ、ＳＳＤ、フラッシュメモリなど各種の記憶媒体により実現される。なお、スマートスピーカー２００は、プログラムを記憶部２２０に記憶し、当該プログラムを実行して、図示しない制御部が、スマートスピーカー２００として実現すべき機能を実現することとしてよい。記憶部２２０は、例えば、マイク２４０が集音した音声データを記憶する。 The storage unit 220 has a function of storing various programs and data required for the operation of the smart speaker 200. The storage unit 220 is realized by various storage media such as an HDD, an SSD, and a flash memory. Note that the smart speaker 200 may store a program in the storage unit 220 and execute the program, so that a control unit (not shown) realizes a function to be realized as the smart speaker 200. The storage unit 220 stores, for example, audio data collected by the microphone 240.

スピーカー２３０は、音声出力装置１００から送信されて受信した制御信号（音声データ）を再生する機能を有する。 The speaker 230 has a function of reproducing a control signal (audio data) transmitted from the audio output device 100 and received.

マイク２４０は、スマートスピーカー２００の周囲の音声を集音する機能を有する。マイク２４０は、１つのマイクロフォンで構成されてもよいし、複数のマイクロフォンで構成されていてもよい。また、マイクロフォンは、集音の方向が限定された指向性のものであってもよい。マイク２４０は、集音した音声を示す音声データを、記憶部２２０に記憶する。 The microphone 240 has a function of collecting sound around the smart speaker 200. Microphone 240 may be configured with one microphone, or may be configured with a plurality of microphones. Further, the microphone may be a directional microphone having a limited sound collection direction. The microphone 240 stores audio data indicating the collected audio in the storage unit 220.

送信部２５０は、記憶部２２０に記憶されている音声データを、音声出力装置１００に送信する機能を有する通信インターフェースである。送信部２５０は、記憶部２２０に記憶されている音声データを逐次、音声出力装置１００に送信することとしてもよいし、ユーザからの音声による指示入力があったと検出できた場合に、その前後の所定長分の音声データを送信することとしてもよい。 The transmission unit 250 is a communication interface having a function of transmitting the audio data stored in the storage unit 220 to the audio output device 100. The transmission unit 250 may sequentially transmit the audio data stored in the storage unit 220 to the audio output device 100, or, when it is detected that an instruction input by a user has been made, Audio data for a predetermined length may be transmitted.

以上が、スマートスピーカー２００の構成例である。 The above is the configuration example of the smart speaker 200.

（回答モデル情報１２１の構成例）
次に、回答モデル情報１２１の一例を、図４を用いて説明する。図４は、回答モデル情報１２１のデータ構成例を示すデータ概念図である。 (Example of configuration of answer model information 121)
Next, an example of the answer model information 121 will be described with reference to FIG. FIG. 4 is a data conceptual diagram showing a data configuration example of the answer model information 121.

図４に示すように、回答モデル情報１２１は、状況情報４１０と、対応情報４２０とが対応付けられた情報である。 As shown in FIG. 4, the answer model information 121 is information in which status information 410 and correspondence information 420 are associated with each other.

状況情報４１０は、音声情報の入力を受け付けている状況、受け付けた音声情報の解析結果から示される状況を示す情報である。 The status information 410 is information indicating a status in which input of voice information is being received and a status indicated from an analysis result of the received voice information.

対応情報４２０は、対応する状況情報４１０に応じて、音声出力装置１００が、どのような基準でロボット音声発話データを生成する（あるいは、生成しない）かを規定する情報である。 The correspondence information 420 is information that defines the basis on which the voice output device 100 generates (or does not generate) the robot voice utterance data according to the corresponding status information 410.

例えば、状況情報４１０として、「・第１音声情報受信後、所定時間後に第２音声情報を取得」していること、「第２音声情報に質問に該当する単語なし」である場合に、音声出力装置１００は、対応する対応情報４２０に示すように、「第２音声情報に対応するロボット発話データを生成しない」か、投げかけられた音声がどういう音声であったかを「問い合わせをするロボット発話データを生成する」という対応をする。 For example, when the status information 410 indicates that “the second voice information is acquired after a predetermined time after receiving the first voice information”, and that “there is no word corresponding to the question in the second voice information”, As shown in the corresponding correspondence information 420, the output device 100 either “does not generate robot utterance data corresponding to the second voice information” or determines what kind of voice the thrown voice was. Generate ".

また、例えば、状況情報４１０として、「・第１音声情報受信後、所定時間内に第２音声情報を取得」し、「・第２音声情報に第１音声情報に含まれる同じカテゴリの単語があり」、「第２音声情報に第１音声情報を否定する単語が含まれる」場合に、音声出力装置１００は、「第２音声情報に対応するロボット発話データのみを生成する」という対応をする。 Further, for example, as the status information 410, “acquire the second audio information within a predetermined time after receiving the first audio information”, and “the second audio information includes words of the same category included in the first audio information” "Yes" and "the second voice information includes a word that denies the first voice information", the voice output device 100 responds to "only generate robot utterance data corresponding to the second voice information". .

このように、音声出力装置１００は、状況情報４１０において設定されている状態になったときに、対応する対応情報４２０で示される対応をして、ロボット発話データを生成する（生成しないこともある）。 As described above, when the state set in the status information 410 is reached, the voice output device 100 generates the robot utterance data by performing the correspondence indicated by the corresponding correspondence information 420 (it may not be generated). ).

（通信システム１のやり取りの例）
図５は、通信システム１において、ユーザによる言い直しが発生しなかった場合の、スマートスピーカー２００と、音声出力装置１００との間のやり取りを示すシーケンス図である。 (Example of exchange of communication system 1)
FIG. 5 is a sequence diagram showing an exchange between the smart speaker 200 and the audio output device 100 in the case where no restatement by the user has occurred in the communication system 1.

図５に示すように、スマートスピーカー２００は、ユーザからの音声（以下、第１音声）の入力を受け付ける（ステップＳ５０１）。スマートスピーカー２００は、受け付けた音声をデジタルデータに変換した第１音声情報を、音声出力装置１００に、送信する（ステップＳ５０２）。 As shown in FIG. 5, the smart speaker 200 receives an input of a voice (hereinafter, a first voice) from a user (step S501). The smart speaker 200 transmits the first audio information obtained by converting the received audio to digital data, to the audio output device 100 (Step S502).

音声出力装置１００は、第１音声情報を受信すると、その内容を解析する（ステップＳ５０３）。そして、音声出力装置１００は、解析結果、即ち、第１音声情報に基づく、応答を示すロボット発話データを生成する（ステップＳ５０４）。音声出力装置１００は、生成したロボット発話データをスマートスピーカー２００に送信する（ステップＳ５０５）。 Upon receiving the first audio information, the audio output device 100 analyzes the content (step S503). Then, the voice output device 100 generates robot utterance data indicating a response based on the analysis result, that is, the first voice information (step S504). The voice output device 100 transmits the generated robot utterance data to the smart speaker 200 (Step S505).

ロボット発話データを受信したスマートスピーカー２００は、そのロボット発話データに基づく音声を出力し（ステップＳ５０６）、ステップＳ５０１で受け付けたユーザからの指示（問い合わせ）に対する応答をする。 The smart speaker 200 that has received the robot utterance data outputs a voice based on the robot utterance data (step S506), and responds to the instruction (inquiry) from the user received in step S501.

図５に示す処理は、従来のスマートスピーカーにおいても実現できている動作になる。 The process shown in FIG. 5 is an operation that can be realized even in a conventional smart speaker.

一方、図６は、通信システム１において、ユーザによる言い直しが発生した場合のスマートスピーカー２００と、音声出力装置１００との間のやり取りを示すシーケンス図である。図６に示すシーケンス図において、ステップＳ５０１〜Ｓ５０３に係る処理は、図５に示す処理におけるステップＳ５０１〜Ｓ５０３の処理と同様であるので、説明を省略する。 On the other hand, FIG. 6 is a sequence diagram showing an exchange between the smart speaker 200 and the audio output device 100 in a case where rephrasing by the user has occurred in the communication system 1. In the sequence diagram shown in FIG. 6, the processes related to steps S501 to S503 are the same as the processes of steps S501 to S503 in the process shown in FIG.

第１音声情報を送信後、スマートスピーカー２００は、更にユーザから次の音声（以下、第２音声）の入力を受け付ける（ステップＳ６０１）。すると、スマートスピーカー２００は、第２音声をデジタルデータに変換した第２音声情報を音声出力装置１００に送信する（ステップＳ６０２）。 After transmitting the first voice information, the smart speaker 200 further receives the input of the next voice (hereinafter, the second voice) from the user (step S601). Then, the smart speaker 200 transmits the second audio information obtained by converting the second audio to digital data to the audio output device 100 (step S602).

すると、音声出力装置１００は、第２音声情報を解析する（ステップＳ６０３）。この解析の結果、第１音声情報を受信してから、所定時間（言い直しを受け付ける時間として適切な時間であって、例えば、５秒）内に、第２音声情報を受信していること、第２音声情報の中に、第１音声情報の指示に含まれる単語と同じカテゴリの単語が含まれているとする。 Then, the audio output device 100 analyzes the second audio information (Step S603). As a result of this analysis, after receiving the first voice information, receiving the second voice information within a predetermined time (a time appropriate for receiving the rephrase, for example, 5 seconds); It is assumed that words in the same category as the words included in the instruction of the first sound information are included in the second sound information.

このような場合に、音声出力装置１００は、少なくとも第２音声情報に基づいて、ロボット発話データを生成する（ステップＳ６０４）。ここで、少なくとも第２音声情報に基づいて生成するとは、第２音声情報のみに基づいて生成することと、第１音声情報と第２音声情報との双方に基づいて音声データを生成することとの両方の場合を含む。 In such a case, the voice output device 100 generates robot utterance data based on at least the second voice information (step S604). Here, to generate based on at least the second voice information means to generate based on only the second voice information, and to generate voice data based on both the first voice information and the second voice information. Including both cases.

音声出力装置１００は、生成したロボット発話データを、スマートスピーカー２００に送信する（ステップＳ６０５）。 The voice output device 100 transmits the generated robot utterance data to the smart speaker 200 (Step S605).

スマートスピーカー２００は、ロボット発話データを受信すると、そのロボット発話データに基づく音声を出力する（ステップＳ６０６）。 Upon receiving the robot utterance data, the smart speaker 200 outputs a voice based on the robot utterance data (step S606).

このように、音声出力装置１００は、ユーザが続けざまに発話を行った場合に、それが言い直しかどうかを、その前の発話から所定時間内であるか否か、そして、発話内容に共通するカテゴリの単語が含まれるかによって判定して、適切に応答を行うことができる。 As described above, when the user speaks one after another, the voice output device 100 determines whether or not it is a restatement within a predetermined time from the previous utterance, and is common to the utterance content. A determination can be made based on whether a word in the category is included, and a response can be appropriately made.

（音声出力装置１００の動作例）
図７は、音声出力装置１００の動作であって、機器の制御を行う際の動作を示すフローチャートである。 (Operation Example of Audio Output Device 100)
FIG. 7 is a flowchart illustrating the operation of the audio output device 100 when the device is controlled.

図７に示すように、音声出力装置１００の受信部１１０は、スマートスピーカー２００から、ユーザが発話した音声を示す第１音声情報を受信する（ステップＳ７０１）。受信部１１０は、受信した第１音声情報を、制御部１３０に伝達する。 As illustrated in FIG. 7, the receiving unit 110 of the audio output device 100 receives, from the smart speaker 200, first audio information indicating audio uttered by the user (step S701). The receiving unit 110 transmits the received first audio information to the control unit 130.

制御部１３０の音声解析部１３１は、伝達された第１音声情報を解析し（ステップＳ７０２）、どのような指示内容（問い合わせ内容）であるかを特定する。当該指示の特定については、予め、記憶部１２０に問い合わせ内容となり得る単語のリストを保持しておくことにより特定することができる。音声解析部１３１は、解析結果を生成部１３２に伝達する。 The voice analysis unit 131 of the control unit 130 analyzes the transmitted first voice information (step S702), and specifies what instruction content (content of inquiry). The specification of the instruction can be performed by storing in advance a list of words that can be an inquiry content in the storage unit 120. The voice analysis unit 131 transmits the analysis result to the generation unit 132.

そして、生成部１３２は、解析結果、即ち、第１音声情報を解析した結果に基づいて、その第１音声情報で示される指示内容（問い合わせ内容）に対する応答となるロボット発話音声データの生成を開始する（ステップＳ７０３）。 Then, based on the analysis result, that is, the result of analyzing the first voice information, the generation unit 132 starts generating the robot utterance voice data that is a response to the instruction content (inquiry content) indicated by the first voice information. (Step S703).

ロボット発話音声データの生成開始した後に、制御部１３０は、受信部１１０から新たな音声情報である第２音声情報を受け付けたか否かを判定する（ステップＳ７０４）。 After starting the generation of the robot utterance voice data, the control unit 130 determines whether the second voice information, which is new voice information, has been received from the receiving unit 110 (step S704).

第２音声情報を受け付けていない場合には（ステップＳ７０４のＮＯ）、生成部１３２は、そのまま第１音声情報に対する応答であるロボット発話データを生成し、送信部１４０を介して、スマートスピーカー２００に送信させて（ステップＳ７０５）、処理を終了する。 If the second voice information has not been received (NO in step S704), the generation unit 132 generates the robot utterance data as a response to the first voice information as it is, and transmits the robot utterance data to the smart speaker 200 via the transmission unit 140. This is transmitted (step S705), and the process ends.

一方、第２音声情報を受け付けていた場合（ステップＳ７０４のＹＥＳ）、音声解析部１３１は、第１音声情報の受け付けから所定時間内であるか否かを判定する（ステップＳ７０６）。当該判定は、第１音声情報の受信時間と、第２音声情報の受信時間との差分をとり、所定時間となる閾値と比較することにより判定することができる。なお、第２音声情報を受け付けたタイミングにおいて、第１音声情報に対する応答であるロボット発話データの生成、スマートスピーカー２００への送信まで完了していてもよいし、完了していなくてもよい。 On the other hand, when the second voice information has been received (YES in step S704), the voice analysis unit 131 determines whether or not it is within a predetermined time after receiving the first voice information (step S706). The determination can be made by taking a difference between the reception time of the first audio information and the reception time of the second audio information, and comparing the difference with a threshold value that is a predetermined time. At the timing when the second voice information is received, generation of the robot utterance data as a response to the first voice information and transmission to the smart speaker 200 may or may not be completed.

第２音声情報の受付が、第１音声情報の受付から、所定時間内であると判定した場合に（ステップＳ７０６のＹＥＳ）、音声解析部１３１は、第２音声情報に、第１音声情報に含まれる単語と同一カテゴリの単語があるか否かを判定する（ステップＳ７０７）。ここで、同一カテゴリの単語があるか否かは、例えば、第１音声情報に含まれる単語であって、問い合わせの目的格となる単語についての属性と、第２音声情報に含まれる単語であって、同一の問い合わせの目的格となる単語についての属性とで一致するものがあるか否かによって判定することができる。一具体例を挙げれば、第１音声情報として、「東京の天気を教えて」という問い合わせがある場合に、「天気」が問い合わせの内容となり、その問い合わせの目的格は「東京」となる。このとき、「東京」には、地名、都市名、場所といった属性を持ち得る。そして、第２音声情報として、「品川の天気を教えて」との問い合わせがある場合に、同様に、「天気」が問い合わせの内容となり、その問い合わせの目的格は「品川」となる。このとき、「品川」には、地名、都市名、場所となった属性を持ち得るので、第２音声情報には、第１音声情報に含まれる単語と同一のカテゴリの単語があると判定することができる。 When it is determined that the reception of the second audio information is within a predetermined time from the reception of the first audio information (YES in step S706), the audio analysis unit 131 converts the second audio information into the first audio information. It is determined whether or not there is a word in the same category as the included word (step S707). Here, whether or not there is a word in the same category is, for example, a word included in the first voice information, which is an attribute of a word serving as a target case of the inquiry and a word included in the second voice information. Thus, the determination can be made based on whether or not there is a match with the attribute of the word serving as the purpose of the same inquiry. As a specific example, when there is an inquiry “Tell me the weather in Tokyo” as the first audio information, “weather” is the content of the inquiry, and the purpose of the inquiry is “Tokyo”. At this time, “Tokyo” may have attributes such as a place name, a city name, and a place. Then, when there is an inquiry "Tell me the weather of Shinagawa" as the second audio information, "weather" is the content of the inquiry, and the purpose of the inquiry is "Shinagawa". At this time, since “Shinagawa” can have attributes of a place name, a city name, and a place, it is determined that the second voice information includes words in the same category as the words included in the first voice information. be able to.

第２音声情報に、第１音声情報に含まれる単語と同一カテゴリの単語があると判定できた場合に（ステップＳ７０７のＹＥＳ）、音声解析部１３１による第１音声情報と第２音声情報との間の文脈の解析と併せた、第２音声情報の解析結果を生成部１３２に伝達する。そして、生成部１３２は、伝達された解析結果に基づいて、少なくとも、第２音声情報に基づくロボット発話データ、即ち、第２音声情報に対する応答となるロボット発話データを生成する。そして、生成部１３２は、送信部１４０を介して、生成したロボット発話データを、スマートスピーカー２００に送信して（ステップＳ７０９）、処理を終了する。ここで、少なくとも第２音声情報に対する応答となるロボット発話データとは、第２音声情報に含まれる問い合わせに対する応答を含み、場合によっては、第１音声情報に含まれる問い合わせに対する応答を含むことがある。また、このとき生成部１３２は、まだ第１音声情報に基づくロボット発話音声データの生成、送信を完了していない場合には、その生成、送信を中止したうえで、少なくとも第２音声情報に基づくロボット発話データの生成、出力を行う。これは、スマートスピーカー２００が、第１音声情報に基づくロボット発話データを音声として出力している最中に、ユーザ１０が発話を行って第２音声情報が得られた場合であって、第２音声情報と第１音声情報とが同一カテゴリの単語であると判定されたときに、音声出力装置１００は、スマートスピーカー２００に第１音声情報に基づくロボット発話データによる音声出力の中止を指示するものであってもよい。そして、この中止の指示の後に、生成部１３２は、第２音声情報に基づくロボット発話データを生成し、音声衆力装置は、第２音声情報に基づくロボット発話データをスマートスピーカー２００に送信することとしてもよい。 When it is determined that the second voice information includes a word in the same category as the word included in the first voice information (YES in step S707), the voice analysis unit 131 compares the first voice information with the second voice information. The result of the analysis of the second audio information, together with the analysis of the context between them, is transmitted to the generation unit 132. Then, the generation unit 132 generates, based on the transmitted analysis result, at least robot utterance data based on the second voice information, that is, robot utterance data serving as a response to the second voice information. Then, the generation unit 132 transmits the generated robot utterance data to the smart speaker 200 via the transmission unit 140 (step S709), and ends the processing. Here, at least the robot utterance data serving as a response to the second voice information includes a response to an inquiry included in the second voice information, and in some cases, includes a response to an inquiry included in the first voice information. . Further, at this time, when the generation and transmission of the robot utterance voice data based on the first voice information has not been completed, the generation unit 132 stops the generation and transmission, and then based on at least the second voice information. Generate and output robot utterance data. This is the case where the user 10 utters while the smart speaker 200 is outputting the robot utterance data based on the first audio information as audio, and the second audio information is obtained. When it is determined that the voice information and the first voice information are words in the same category, the voice output device 100 instructs the smart speaker 200 to stop the voice output based on the robot utterance data based on the first voice information. It may be. After the stop instruction, the generation unit 132 generates the robot utterance data based on the second voice information, and the voice assist device transmits the robot utterance data based on the second voice information to the smart speaker 200. Is also good.

一方、ステップＳ７０６において、音声解析部１３１が第２音声情報を、第１音声情報を受け付けてから所定時間内に受け付けていないと判断した場合（ステップＳ７０６のＮＯ）や、ステップＳ７０７において、音声解析部１３１が第２音声情報に第１音声情報に含まれる単語と同一カテゴリの単語がないと判定した場合（ステップＳ７０７のＮＯ）には、音声解析部１３１は、第２音声情報に質問に該当する単語があるか否かを判定する（ステップＳ７０８）。ここでの、所定時間とは、例えば、スマートスピーカー２００が第１音声情報に基づくロボット発話データを、音声として、出力している間の時間のことであってよい。 On the other hand, when the voice analysis unit 131 determines that the second voice information has not been received within a predetermined time after receiving the first voice information in step S706 (NO in step S706), or in step S707, the voice analysis If the unit 131 determines that the second voice information does not include a word in the same category as the word included in the first voice information (NO in step S707), the voice analysis unit 131 corresponds to the question in the second voice information. It is determined whether there is a word to be executed (step S708). Here, the predetermined time may be, for example, a time during which the smart speaker 200 outputs the robot utterance data based on the first voice information as voice.

第２音声情報に、質問に該当する単語が含まれている場合（ステップＳ７０８のＹＥＳ）、生成部１３２は、その質問内容に対する回答となるロボット発話データを生成し、送信部１４０を介して、スマートスピーカー２００に送信し（ステップＳ７１０）、処理を終了する。 When the word corresponding to the question is included in the second voice information (YES in step S708), the generation unit 132 generates robot utterance data as an answer to the content of the question, and transmits the robot utterance data via the transmission unit 140. The data is transmitted to the smart speaker 200 (step S710), and the process ends.

また、第２音声情報に質問に該当する単語がない場合には（ステップＳ７０８のＮＯ）、ステップＳ７０５の処理に移行する。なお、このとき、ステップＳ７０４の処理に移行するのではなく、ユーザに対して、もう一度問い合わせを言い直してもらうためのリクエストをするロボット発話データを生成して、スマートスピーカー２００に送信するように構成されてもよい。 If there is no word corresponding to the question in the second voice information (NO in step S708), the process proceeds to step S705. In this case, instead of moving to the process of step S704, the system is configured to generate robot utterance data for requesting the user to re-inquire again and transmit the generated data to the smart speaker 200. May be done.

なお、スマートスピーカー２００の動作は、ユーザからの音声をマイク２４０で受け付けて、その音声情報を、送信部２５０から音声出力装置１００に送信し、その音声出力装置１００から出力されたロボット発話データを受信部２１０で受信して、スピーカー２３０から出力（報知）するだけであるので、詳細な説明については省略する。 The operation of the smart speaker 200 is such that the voice from the user is received by the microphone 240, the voice information is transmitted from the transmission unit 250 to the voice output device 100, and the robot utterance data output from the voice output device 100 is transmitted. Since the signal is only received by the receiving unit 210 and output (notified) from the speaker 230, detailed description is omitted.

（応答具体例）
以下には、スマートスピーカー２００が集音した音声データに基づいて、音声出力装置１００が実行する処理について具体的に説明する。以下の具体例では、ユーザが天気を問い合わせる例を用いて説明する。 (Example of response)
Hereinafter, a process executed by the audio output device 100 based on audio data collected by the smart speaker 200 will be specifically described. The following specific example will be described using an example in which the user inquires about the weather.

（例１）ユーザが、「東京の天気を教えて？…あ、やっぱり、品川の天気を教えて？」と発言した場合 (Example 1) When the user says, "Tell me the weather in Tokyo? Ah, after all, tell me the weather in Shinagawa?"

例１は、図４に示す回答モデル情報１２１において、状況情報４１０として、欄４１２の状況を満たす場合に該当する。この場合、対応情報４２０としては、欄４２２に示される対応をすることになる。具体的には、この場合、まず、音声出力装置１００には、第１音声情報として、「東京の天気を教えて？」という情報が伝達され、その解析を行うことになる。そして、「天気」という文言から、天気予報サーバにアクセスし、「東京」の天気情報を取得する。そして、その天気情報に基づくロボット発話データを生成する。ここで、音声出力装置１００には、第２音声情報として、「あ、やっぱり、品川の天気を教えて？」という情報が伝達される。 Example 1 corresponds to a case where the status in the column 412 is satisfied as the status information 410 in the answer model information 121 shown in FIG. In this case, the correspondence information 420 corresponds to the correspondence shown in the column 422. Specifically, in this case, first, the information “Tell me the weather in Tokyo?” Is transmitted to the audio output device 100 as the first audio information, and the analysis is performed. Then, the weather forecast server is accessed from the word “weather” to obtain the weather information of “Tokyo”. Then, robot utterance data based on the weather information is generated. Here, to the audio output device 100, information of "Oh, after all, tell me the weather in Shinagawa?" Is transmitted as the second audio information.

すると、音声出力装置１００は、第２音声情報を、第１音声情報を受信してから所定時間内（例えば、５秒）に受信しているか判断し、所定時間内に受信したと判断したものとする。すると、生成部１３２は、第１音声情報に含まれる「天気」と、第２音声情報に含まれる「天気」というアプリケーションを指定する文言があること、そして、「東京」と「品川」という「地名」という同じカテゴリの単語が双方に含まれていることを検出する。 Then, the sound output device 100 determines whether the second sound information is received within a predetermined time (for example, 5 seconds) after receiving the first sound information, and determines that the second sound information is received within the predetermined time. And Then, the generation unit 132 determines that there is a word specifying the application “weather” included in the first voice information and “weather” included in the second voice information, and “Tokyo” and “Shinagawa”. It detects that both words include the same category of word "place name".

この場合、生成部１３２は、第２音声情報が言い直しであると判断し、第１音声情報に対するロボット発話データの生成を中止し、第２音声情報に対する回答を示すロボット発話データを生成する。例えば、生成部１３２は、天気予報サーバにアクセスし、品川の天気情報を取得し、その内容に応じたロボット発話データ（例えば、品川の天気は晴れ。気温は○○度。）を生成する。 In this case, the generation unit 132 determines that the second voice information is a rephrase, stops generating the robot voice data for the first voice information, and generates the robot voice data indicating the answer to the second voice information. For example, the generation unit 132 accesses the weather forecast server, acquires the weather information of Shinagawa, and generates robot utterance data (for example, the weather in Shinagawa is fine and the temperature is XX degrees) according to the content.

すると、この場合、スマートスピーカー２００は、品川の天気についてのみのアナウンスを行うことになるので、ユーザは、指示のし直しをすることなく、真に知りたい情報を得ることができる。 Then, in this case, the smart speaker 200 makes an announcement only about the weather in Shinagawa, so that the user can obtain information that he / she really wants to know without re-instruction.

なお、このとき、生成部１３２は、更に、ユーザの発言の文脈を解析して、後者が言い直しであるとの判定の確度を向上させてもよい。上記の例で言えば、「やっぱり」という前者を否定する文脈が有ることから、後者が言い直しであるとの判定の確度を向上させることができる。 At this time, the generation unit 132 may further analyze the context of the user's utterance to improve the accuracy of the determination that the latter is a restatement. In the above example, since there is a context that denies the former “after all”, the accuracy of the determination that the latter is a restatement can be improved.

（例２）ユーザが、「東京の天気を教えて？…あ、それから、大阪も」と発言した場合 (Example 2) When the user says "Tell me the weather in Tokyo? Oh, and also Osaka"

例２は、図４に示す回答モデル情報１２１において、状況情報４１０として、欄４１３の状況を満たす場合に該当する。この場合、対応情報４２０としては、欄４２３に示される対応をすることになる。具体的には、この場合、まず、音声出力装置１００には、第１音声情報として、「東京の天気を教えて？」という情報が伝達され、その解析を行うことになる。そして、「天気」という文言から、天気予報サーバにアクセスし、「東京」の天気情報を取得する。そして、その天気情報に基づくロボット発話データを生成する。ここで、音声出力装置１００には、第２音声情報として、「あ、それから、大阪も」という情報が伝達される。 Example 2 corresponds to a case where the status of the column 413 is satisfied as the status information 410 in the answer model information 121 shown in FIG. In this case, the correspondence information 420 corresponds to the correspondence shown in the column 423. Specifically, in this case, first, the information “Tell me the weather in Tokyo?” Is transmitted to the audio output device 100 as the first audio information, and the analysis is performed. Then, the weather forecast server is accessed from the word “weather” to obtain the weather information of “Tokyo”. Then, robot utterance data based on the weather information is generated. Here, the information “Oh, then Osaka” is transmitted to the audio output device 100 as the second audio information.

すると、音声出力装置１００は、第２音声情報を、第１音声情報を受信してから所定時間内（例えば、５秒）に受信しているか判断し、所定時間内に受信したと判断したものとする。すると、生成部１３２は、第１音声情報に含まれる「東京」と、第２音声情報に含まれる「大阪」という「地名」という同じカテゴリの単語が双方に含まれていることを検出する。 Then, the sound output device 100 determines whether the second sound information is received within a predetermined time (for example, 5 seconds) after receiving the first sound information, and determines that the second sound information is received within the predetermined time. And Then, the generating unit 132 detects that both words “Tokyo” included in the first voice information and “Osaka” included in the second voice information in the same category of “place name” are included.

一方で、この場合、音声解析部１３１は、第２音声情報の文脈を解析し、「それから」や「も」という追加の意味合いを有する語が含まれていると解釈する。このような場合に、生成部１３２は、第２音声情報が言い直しではなく、追加の情報の要求であると判断し、第１音声情報に対するロボット発話データの生成、送信を行うとともに、第２音声情報に対する回答を示すロボット発話データを生成して、送信する。したがって、スマートスピーカー２００には、東京と大阪の双方の天気の情報が伝えられ、スマートスピーカー２００は、双方の天気の情報を報知する。 On the other hand, in this case, the voice analysis unit 131 analyzes the context of the second voice information, and interprets that a word having an additional meaning such as “then” or “also” is included. In such a case, the generation unit 132 determines that the second voice information is not a rephrase, but a request for additional information, generates and transmits robot utterance data for the first voice information, and Robot utterance data indicating an answer to the voice information is generated and transmitted. Therefore, the information on the weather in both Tokyo and Osaka is transmitted to the smart speaker 200, and the smart speaker 200 notifies the information on the weather in both.

（例３）ユーザが、「東京の天気を教えて？」と発言し、所定時間経過後に、「あと、大阪も」と発言した場合 (Example 3) When the user says "Tell me the weather in Tokyo?", And after a predetermined time elapses, he says "And also Osaka"

例３は、図４に示す回答モデル情報１２１において、状況情報４１０として、欄４１１の状況を満たす場合に該当する。この場合、対応情報４２０としては、欄４２１に示される対応をすることになる。具体的には、この場合、まず、音声出力装置１００には、第１音声情報として、「東京の天気を教えて？」という情報が伝達され、その解析を行うことになる。そして、音声出力装置１００は、その内容から東京の天気に関する情報を含むロボット発話データを生成する。その所定時間経過後に、音声出力装置１００には、第２音声情報として、「あと、大阪も」という情報が伝達された場合、最初の質問から所定時間経過しているため、音声出力装置１００は、双方の関連性がないと判定する。 Example 3 corresponds to a case where the status in the column 411 is satisfied as the status information 410 in the answer model information 121 shown in FIG. In this case, the correspondence information 420 corresponds to the correspondence shown in the column 421. Specifically, in this case, first, the information “Tell me the weather in Tokyo?” Is transmitted to the audio output device 100 as the first audio information, and the analysis is performed. Then, the audio output device 100 generates robot utterance data including information on the weather in Tokyo from the contents. After the lapse of the predetermined time, if the information "Oh, Osaka" is transmitted as the second voice information to the voice output device 100, the voice output device 100 is Is determined to be unrelated.

このような場合には、音声出力装置１００は、第１音声情報に対する応答のロボット発話データのみを生成するか、第１音声情報に対する応答のロボット発話データを生成し、送信しつつ、「もう一度、質問をお願いします」というリクエストをするロボット発話データを生成して、送信する構成にしてよい。このとき、第１音声情報に対する応答のロボット発話データを生成せずに、「もう一度、質問をお願いします」というリクエストをするロボット発話データのみを生成して、送信するように構成してもよい。 In such a case, the voice output device 100 generates only the robot utterance data in response to the first voice information, or generates and transmits the robot utterance data in response to the first voice information. Please ask a question. "The robot utterance data for requesting may be generated and transmitted. At this time, it may be configured to generate and transmit only the robot utterance data requesting "Please ask a question again" without generating the robot utterance data in response to the first voice information. .

（例４）スマートスピーカー２００から「どこの天気が知りたいですか？」と問い合わせをし、ユーザが、「新橋…、品川」と発言した場合 (Example 4) When the smart speaker 200 inquires "Where do you want to know the weather?" And the user says "Shimbashi ..., Shinagawa"

この場合、音声出力装置１００は、第１音声情報として、「新橋」という音声を受け付ける。そして、第２音声情報として、「品川」という音声を受け付ける。ユーザの発言は、スマートスピーカー２００がした質問に対する回答になり、「新橋」も「品川」も共に「地名」という同一カテゴリの単語であると音声解析部１３１は、解析することができる。このような場合に、音声出力装置１００は、上記（例１）に示したように、言い直しであると判断して、後者の「品川」がユーザが知りたい天気の場所であると認定して、「品川」の天気を示すロボット発話データを生成して、送信することとしてもよいが、双方の地名間の距離が所定距離内であれば、その双方の地名が含まれる地域の天気を取得して、その情報を報知するようにしてもよい。この例の場合であれば、音声出力装置１００は、「新橋」及び「品川」が含まれる「東京」という地域の天気の情報を示すロボット発話データを生成して、スマートスピーカー２００に送信するようにしてもよい。この例４の場合についての回答モデル情報１２１は、図４には示していないが、状況情報４１０としては、例えば、
「・第１音声情報受信後、所定時間内に第２音声情報を取得。
・第２音声情報に質問に該当する単語あり」となり、
対応情報４２０としては、
「・第２音声情報に対応するロボット発話データのみを生成する」ということになる。 In this case, the sound output device 100 receives the sound "Shimbashi" as the first sound information. Then, the voice "Shinagawa" is received as the second voice information. The user's remark is an answer to the question asked by the smart speaker 200, and the voice analysis unit 131 can analyze that both "Shinbashi" and "Shinagawa" are words of the same category of "place name". In such a case, the audio output device 100 determines that this is a paraphrase as described in (Example 1) above, and determines that the latter “Shinagawa” is a weather location that the user wants to know. Then, robot utterance data indicating the weather of "Shinagawa" may be generated and transmitted, but if the distance between the two place names is within a predetermined distance, the weather of the area including both of the place names is calculated. You may make it acquire and notify the information. In the case of this example, the audio output device 100 generates robot utterance data indicating weather information of an area of “Tokyo” including “Shimbashi” and “Shinagawa”, and transmits the generated data to the smart speaker 200. It may be. The answer model information 121 for the case of this example 4 is not shown in FIG. 4, but as the status information 410, for example,
"-Acquire the second audio information within a predetermined time after receiving the first audio information.
・ There is a word corresponding to the question in the second voice information "
As the correspondence information 420,
This means that “only the robot utterance data corresponding to the second voice information is generated”.

以上のように、音声出力装置１００は、ユーザからの様々な態様の問い合わせについて、自然な応答をするためのロボット発話データを生成することができる。また、この例４によれば、音声出力装置１００は、ユーザの音声中に、第１音声情報を否定する文言がなくとも、即ち、特定語彙を有する文言がなくとも、ユーザによる言い直しを認識して、自然な応答をするためのロボット発話データを生成することができる。 As described above, the voice output device 100 can generate robot utterance data for making a natural response to various inquiries from the user. Further, according to Example 4, the voice output device 100 recognizes the restatement by the user even if there is no word denying the first voice information in the user's voice, that is, even if there is no wording having the specific vocabulary. Then, the robot utterance data for making a natural response can be generated.

（例５）スマートスピーカー２００から「どこの天気が知りたいですか？」と問い合わせをし、ユーザが、「新橋、…、品川！品川！品川！」と発言した場合 (Example 5) When the smart speaker 200 inquires, "Where do you want to know the weather?" And the user says "Shimbashi, ..., Shinagawa! Shinagawa! Shinagawa!"

この場合、音声出力装置１００は、第１音声情報として、「新橋」という音声を受け付ける。そして、第２音声情報として、「品川」という音声を受け付ける。ユーザの発言は、スマートスピーカー２００がした質問に対する回答になり、「新橋」も「品川」も共に「地名」という同一カテゴリの単語であると音声解析部１３１は、解析することができる。一方で、例５の場合、例４とは異なり、ユーザが複数回同じ文言を発声していることを音声出力装置１００は、解析により認識することができる。このような場合に、音声出力装置１００は、質問に対する回答として、所定時間内に、同一カテゴリの文言が複数含まれるような場合には、その複数の文言のうち、ユーザが最も多く発言した文言を、質問に対する回答として特定して、ロボット発話データを生成するようにしてもよい。また、特に、ユーザが連続して、同じ文言を繰り返していることが解析できた場合に、その繰り返しの文言を、質問に対する回答として特定して、ロボット発話データを生成するようにしてもよい。 In this case, the sound output device 100 receives the sound "Shimbashi" as the first sound information. Then, the voice "Shinagawa" is received as the second voice information. The user's remark is an answer to the question asked by the smart speaker 200, and the voice analysis unit 131 can analyze that both "Shinbashi" and "Shinagawa" are words of the same category of "place name". On the other hand, in the case of Example 5, unlike the example 4, the voice output device 100 can recognize by analysis that the user has uttered the same phrase a plurality of times. In such a case, when a plurality of texts of the same category are included within a predetermined time as an answer to the question, the voice output device 100 may use the text that the user has said most among the plurality of texts. May be specified as an answer to the question, and the robot utterance data may be generated. Further, in particular, when it can be analyzed that the user repeatedly repeats the same text, the repeated text may be specified as an answer to the question, and the robot utterance data may be generated.

この例５の場合についての回答モデル情報１２１は、図４には示していないが、状況情報４１０としては、例えば、
「・第１音声情報受信後、所定時間内に第２音声情報を取得。
・第１音声情報と第２音声情報に複数の回答となる文言あり」、となり、
対応情報４２０としては、
「・第１音声情報と第２音声情報とで、回答となる文言のうち最も登場回数の多い文言に対応するロボット発話データのみを生成する」ということとしてよい。
また、あるいは、状況情報４１０としては、例えば、
「・第１音声情報受信後、所定時間内に第２音声情報を取得。
・第２音声情報に複数回繰り返されている文言あり」となり、
対応情報４２０としては、
「・第２音声情報で複数回繰り返されている文言に基づくロボット発話データを生成する」ということとしてよい。 Although the answer model information 121 for the case of this example 5 is not shown in FIG. 4, as the status information 410, for example,
"-Acquire the second audio information within a predetermined time after receiving the first audio information.
・ The first voice information and the second voice information have a plurality of answers, "
As the correspondence information 420,
It is also possible to say that “only the robot voice data corresponding to the word having the highest number of appearances among the words to be answered is generated by the first voice information and the second voice information”.
Alternatively, as the status information 410, for example,
"-Acquire the second audio information within a predetermined time after receiving the first audio information.
・ There is a wording that is repeated multiple times in the second audio information "
As the correspondence information 420,
"・ Generate robot utterance data based on words repeated a plurality of times in the second voice information".

このように複数回強調されたような回答こそ、ユーザが知りたい回答であると推測することができるので、音声出力装置１００は、そのような文言をユーザの知りたい情報の回答であると特定して、自然な応答をするためのロボット発話データを生成することができる。なお、ユーザが強調している回答に基いてロボット発話データを生成する場合に、回答数ではなく、ユーザの声の音量に基づくものであってもよい。即ち、ユーザの声の音量が高い方の回答がユーザが知りたい事項に対する回答であると特定するようにしてもよい。 Since the answer emphasized a plurality of times in this way can be presumed to be the answer that the user wants to know, the audio output device 100 specifies such a word as the answer of the information that the user wants to know. Then, the robot utterance data for making a natural response can be generated. When the robot utterance data is generated based on the answer emphasized by the user, it may be based on the volume of the user's voice instead of the number of answers. That is, the answer with the higher volume of the voice of the user may be specified as the answer to the item that the user wants to know.

（まとめ）
このように、音声出力装置１００は、ユーザが言い直しをした場合に、その言い直しが、言い直しをする前の言葉を発したタイミングから所定時間以内に行われていること、言い直しの中の単語に、言い直しをする前の言葉の中の単語と同じカテゴリを含むか否かによって、第２音声情報が言い直しかどうかを判定することができる。したがって、音声出力装置１００は、第１音声情報と、第２音声情報とについて、その双方についてのロボット発話データを作成する必要があるのか、それとも、第２音声情報に対する回答のみを示すロボット発話データを生成すればいいのかの判断をすることができる。そして、言い直しであると判断できた場合に、音声出力装置１００は、第２音声情報の方が、ユーザが実際に問い合わせたい内容であると判断して応答を行うので、より、自然な会話の応答をすることができる音声出力装置１００を提供することができる。 (Summary)
As described above, when the user makes a restatement, the voice output device 100 determines that the restatement is performed within a predetermined time from the timing at which the word before the restatement was issued. It can be determined whether or not the second voice information is only a restatement, based on whether or not the word includes the same category as the word in the word before the restatement. Therefore, the voice output device 100 needs to generate the robot voice data for both the first voice information and the second voice information, or the robot voice data indicating only the answer to the second voice information. Can be determined. Then, when it can be determined that it is a rephrase, the voice output device 100 responds by determining that the second voice information is the content that the user actually wants to inquire, so that a more natural conversation is performed. Can be provided.

（補足）
上記実施形態に係る装置は、上記実施形態に限定されるものではなく、他の手法により実現されてもよいことは言うまでもない。以下、各種変形例について説明する。 (Supplement)
It goes without saying that the device according to the above embodiment is not limited to the above embodiment, and may be realized by another method. Hereinafter, various modifications will be described.

（１）上記実施形態においては、天気に関する問い合わせをする例を示しているが、音声出力装置１００は、天気以外の事例に対しても対応できるのは言うまでもない。ユーザからの天気の問い合わせに限らず、例えば、家電操作、音楽再生、買い物等のリスト管理などにおいても活用できる。家電操作の場合であれば、一例として、「冷房を２４度、…２６度でつけて」という指示を受け付けたとする。このとき、音声出力装置１００は、２４度と発話してから２６度と発話するまでの間の時間が所定時間以内である場合に、「２４度」と「２６度」とが同じ温度というカテゴリに属することから「２６度」の方を、ユーザが指定した情報であると解釈して、音声出力装置１００は、家電を操作する情報処理装置として、冷房を２６度の設定でオンする制御を行うことができる。また、音楽再生の例であれば、ユーザが「Ａをかけて。…Ｂの方がいいかな」という発話をしたとする。この場合、音声出力装置１００は、「Ａをかけて」との発言から「Ｂの方がいいかな」という発言までの間の時間が所定時間以内であれば、「Ａ」と「Ｂ」がどちらも音楽（曲）というカテゴリに属することから、スマートスピーカーに対して、「Ｂ」の曲の再生を指示する。また、買い物等のリスト管理を行うのであれば、「ニンジン、ナス、ジャガイモ、…じゃなくてサツマイモ」という発言をユーザがしたときに、音声出力装置１００は、買い物リストを管理する情報処理装置として、「ジャガイモ」という発言から、「サツマイモ」という発言までの時間が所定時間以内であれば、サツマイモのみを買い物リストに追加する。このように、音声出力装置１００は、天気の問い合わせ以外にも様々な場面におけるユーザによる言い直しに対応して、ユーザにとって望ましいと推定される処理を行うことができる。また、その他の例としては、経路案内における地名などについても、同様のことが言える。 (1) In the above embodiment, an example is shown in which an inquiry about the weather is made. However, it goes without saying that the audio output device 100 can deal with cases other than the weather. The present invention can be used not only for inquiries about the weather from the user but also, for example, for home appliance operation, music playback, list management of shopping, and the like. In the case of home appliance operation, as an example, it is assumed that an instruction to “turn on cooling at 24 degrees,..., 26 degrees” is received. At this time, if the time between the utterance of 24 degrees and the utterance of 26 degrees is within a predetermined time, the audio output device 100 sets the category of “24 degrees” and “26 degrees” as the same temperature. Interprets that "26 degrees" is the information specified by the user, so that the audio output device 100 controls the air conditioner to be turned on at a setting of 26 degrees as an information processing device for operating home appliances. It can be carried out. Further, in the case of music reproduction, it is assumed that the user has uttered "playing A. ... is B better?" In this case, the audio output device 100 determines that “A” and “B” are equal to each other if the time between the statement “apply A” and the statement “B is better” is within a predetermined time. Since both belong to the category of music (music), the smart speaker is instructed to reproduce the music of "B". If the user manages a shopping list, the voice output device 100 is used as an information processing device that manages a shopping list when the user makes a statement such as "sweet potato instead of carrot, eggplant, potato, ...". If the time from the statement "potato" to the statement "sweet potato" is within a predetermined time, only the sweet potato is added to the shopping list. As described above, the audio output device 100 can perform a process estimated to be desirable for the user in response to the user's restatement in various situations other than the inquiry about the weather. As another example, the same can be said for a place name in route guidance.

（２）上記実施の形態において、音声出力装置１００が保持する機能の一部を別の装置が保持し、その別の装置に音声出力装置１００が実行する処理の一部を負担させてもよい。例えば、音声解析機能を有する他の情報処理装置が、まず、スマートスピーカー２００が受け付けたユーザの音声を解析し、その解析結果を音声出力装置１００に転送する。そして、音声出力装置１００は、転送された解析結果に基づくロボット発話データを生成するように構成されていてもよい。 (2) In the above embodiment, a part of the functions held by the audio output device 100 may be held by another device, and the other device may be caused to bear a part of the processing executed by the audio output device 100. . For example, another information processing device having a voice analysis function first analyzes the user's voice received by the smart speaker 200, and transfers the analysis result to the voice output device 100. Then, the sound output device 100 may be configured to generate robot utterance data based on the transferred analysis result.

（３）上記実施の形態において、音声出力装置１００は、実行する処理に応じて複数存在してよい。例えば、天気に関する情報を通知する装置、料理に関する情報を通知する装置、家電を操作する装置など、様々な装置が考えられる。このとき、通信システム１は、更に、スマートスピーカー２００が受け付けた音声を先に解析し、どの装置にその音声に基づく問い合わせを行うのかを決定する情報処理装置が含まれてもよい。そのような構成において、その情報処理装置において、上記実施の形態に示す言い直しの判定と、音声を伝達する装置の指定に役立てることとしてもよい。例えば、ユーザが、「天気…じゃなくて、電車の発車時間を教えて」というような問い合わせをしたい種別についての言い直しをした場合に、回答をする装置を、天気の情報を管理する装置とするか電車の時間を管理する装置とするかについて、同時に実行できないことから、そのうちの一方であって、言い直しであると判定された場合に、後者の方に対応する装置に、音声による問いかけを転送するというように構成されてもよい。 (3) In the above embodiment, a plurality of audio output devices 100 may exist depending on the processing to be executed. For example, various devices such as a device for notifying information about weather, a device for notifying information about cooking, and a device for operating home appliances can be considered. At this time, the communication system 1 may further include an information processing device that first analyzes the voice received by the smart speaker 200 and determines which device should make an inquiry based on the voice. In such a configuration, the information processing apparatus may be useful for determining the restatement described in the above embodiment and for specifying a device that transmits sound. For example, when the user rephrases the type of inquiry that he / she wants to make, such as "tell me the departure time of the train, not the weather ...", the device that responds is replaced by a device that manages weather information. Can not be executed at the same time as to whether the device should be used to manage the time of the train, so if one of them is determined to be a restatement, the device corresponding to the latter will be asked by voice. May be configured to be transmitted.

（４）上記実施の形態においては、スマートスピーカー２００と、音声出力装置１００を別の装置として、説明したが、スマートスピーカー２００と、音声出力装置１００とは、一体に形成されてもよい。即ち、スマートスピーカー２００は、音声出力装置１００が有する機能の一部又は全部を備えることとしてもよい。 (4) In the above embodiment, the smart speaker 200 and the audio output device 100 are described as separate devices, but the smart speaker 200 and the audio output device 100 may be formed integrally. That is, the smart speaker 200 may include some or all of the functions of the audio output device 100.

（５）上記実施の形態においては、ユーザとの対話における文脈、質問に対する回答として望ましい文言として、同じカテゴリの単語が、第１の音声情報と第２の音声情報とに含まれるか否かを判定していた。しかしながら、音声出力装置１００側（スマートスピーカー２００側）から、ユーザに対して、何らかの問いかけをする場合には、生成部１３２は、その問いかけためのロボット発話データを生成する際に、その問いかけに対する回答としてふさわしいと想定されるカテゴリをも決定し、その決定したカテゴリを記憶部１２０に記憶する。そして、音声出力装置１００の制御部１３０は、問いかけに対するユーザからの回答として、第１の音声情報と第１の音声情報から所定時間内に第２の音声情報とが得られたときに、第１の音声情報に、決定した（記憶した）カテゴリに属する単語が含まれるか否かを判定する。また、同様に、第２の音声情報に、決定した（記憶した）カテゴリに属する単語が含まれるか否かを判定する。そして、両音声情報に、決定したカテゴリに属する単語が含まれていた場合に、生成部１３２は、少なくとも、第２の音声情報に基づくロボット発話データを生成することとしてもよい。こうすることで、第１の音声情報及び第２の音声情報に同じカテゴリの単語が含まれるか否かを検証する際の絞り込みが容易になり、処理時間を短縮することができる。 (5) In the above-described embodiment, it is determined whether or not words in the same category are included in the first audio information and the second audio information as words that are desirable as an answer to the context and the question in the dialogue with the user. Had been determined. However, when asking the user any question from the audio output device 100 side (smart speaker 200 side), when generating the robot utterance data for the question, the generation unit 132 responds to the question. Also, a category assumed to be appropriate is determined, and the determined category is stored in the storage unit 120. Then, the control unit 130 of the audio output device 100, when the second audio information is obtained within a predetermined time from the first audio information and the first audio information as a response from the user to the inquiry, It is determined whether or not one piece of audio information includes a word belonging to the determined (stored) category. Similarly, it is determined whether or not the second voice information includes a word belonging to the determined (stored) category. Then, when both voice information include a word belonging to the determined category, the generation unit 132 may generate at least robot utterance data based on the second voice information. By doing so, it becomes easy to narrow down when verifying whether or not words of the same category are included in the first audio information and the second audio information, and the processing time can be reduced.

（６）本開示の各実施形態のプログラムは、コンピュータに読み取り可能な記憶媒体に記憶された状態で提供されてもよい。記憶媒体は、「一時的でない有形の媒体」に、プログラムを記憶可能である。記憶媒体は、ＨＤＤやＳＤＤなどの任意の適切な記憶媒体、またはこれらの２つ以上の適切な組合せを含むことができる。記憶媒体は、揮発性、不揮発性、または揮発性と不揮発性の組合せでよい。なお、記憶媒体はこれらの例に限られず、プログラムを記憶可能であれば、どのようなデバイスまたは媒体であってもよい。 (6) The program according to each embodiment of the present disclosure may be provided in a state stored in a computer-readable storage medium. The storage medium is capable of storing the program on a “temporary tangible medium”. The storage medium may include any suitable storage medium such as an HDD or an SDD, or a suitable combination of two or more thereof. The storage medium may be volatile, non-volatile, or a combination of volatile and non-volatile. The storage medium is not limited to these examples, and may be any device or medium as long as it can store a program.

なお、音声出力装置１００は、例えば、記憶媒体に記憶されたプログラムを読み出し、読み出したプログラムを実行することによって、各実施形態に示す複数の機能部の機能を実現することができる。また、当該プログラムは、任意の伝送媒体（通信ネットワークや放送波等）を介して、音声出力装置１００に提供されてもよい。音声出力装置１００は、例えば、インターネット等を介してダウンロードしたプログラムを実行することにより、各実施形態に示す複数の機能部の機能を実現する。 Note that the audio output device 100 can realize the functions of the plurality of functional units described in each embodiment, for example, by reading a program stored in a storage medium and executing the read program. Further, the program may be provided to the audio output device 100 via an arbitrary transmission medium (such as a communication network or a broadcast wave). The sound output device 100 realizes the functions of the plurality of functional units described in each embodiment by executing a program downloaded via the Internet or the like, for example.

なお、当該プログラムは、例えば、ＡｃｔｉｏｎＳｃｒｉｐｔ、ＪａｖａＳｃｒｉｐｔ(登録商標)などのスクリプト言語、Ｏｂｊｅｃｔｉｖｅ―Ｃ、Ｊａｖａ(登録商標)などのオブジェクト指向プログラミング言語、ＨＴＭＬ５などのマークアップ言語などを用いて実装できる。 The program can be implemented using, for example, a script language such as ActionScript or JavaScript (registered trademark), an object-oriented programming language such as Objective-C or Java (registered trademark), or a markup language such as HTML5.

音声出力装置１００における処理の少なくとも一部は、１以上のコンピュータにより構成されるクラウドコンピューティングにより実現されていてもよい。また、音声出力装置１００の各機能部は、上記実施形態に示した機能を実現する１または複数の回路によって実現されてもよく、１の回路により複数の機能部の機能が実現されることとしてもよい。 At least part of the processing in the audio output device 100 may be realized by cloud computing including one or more computers. Further, each functional unit of the audio output device 100 may be realized by one or a plurality of circuits that realize the functions described in the above-described embodiments, and the function of the plurality of functional units may be realized by one circuit. Is also good.

（７）本開示の実施形態を諸図面や実施例に基づき説明してきたが、当業者であれば本開示に基づき種々の変形や修正を行うことが容易であることに注意されたい。従って、これらの変形や修正は本開示の範囲に含まれることに留意されたい。例えば、各手段、各ステップ等に含まれる機能等は論理的に矛盾しないように再配置可能であり、複数の手段やステップ等を１つに組み合わせたり、或いは分割したりすることが可能である。また、各実施形態に示す構成を適宜組み合わせることとしてもよい。 (7) Although the embodiments of the present disclosure have been described based on the drawings and examples, it should be noted that those skilled in the art can easily make various changes and modifications based on the present disclosure. Therefore, it should be noted that these variations and modifications are included in the scope of the present disclosure. For example, the functions and the like included in each means, each step, and the like can be rearranged so as not to be logically inconsistent, and a plurality of means, steps, and the like can be combined into one or divided. . Further, the configurations shown in the embodiments may be appropriately combined.

１００音声出力装置
１１０受信部
１２０記憶部
１３０制御部
１３１音声解析部
１３２生成部
１４０送信部 Reference Signs List 100 audio output device 110 reception unit 120 storage unit 130 control unit 131 audio analysis unit 132 generation unit 140 transmission unit

Claims

A receiving unit that receives input of voice information indicating an uttered voice by the user;
A generation unit that generates robot utterance data corresponding to a reply to the utterance voice based on the voice information;
An output unit that outputs the robot utterance data,
The generation unit may include a case where the second audio information is received within a predetermined time after receiving the first audio information, wherein the second audio information includes a second audio information included in the first audio information. An information processing device that generates robot utterance data based on at least the second voice information when a second word of the same category as one word is included.

The said generation part, when the word which denies the said 1st audio | voice information is contained in the said 2nd audio | voice information, produces | generates the robot utterance data based only on the said 2nd audio | voice information. 2. The information processing device according to 1.

The generation unit generates robot utterance data for both the first voice information and the second voice information when the second voice information includes a word connected to the first voice information. The information processing apparatus according to claim 1 or 2, wherein

When the second voice information includes a second word of the same type as the first word included in the first voice information, a robot utterance inquiring which is correct The information processing apparatus according to claim 1, wherein the information processing apparatus generates data.

An audio output unit that outputs an audio based on the robot utterance data,
A voice collection unit that collects the voice of the user,
The output unit outputs the robot utterance data to the voice output unit,
The receiving unit receives the uttered voice collected by the voice collecting unit as the voice information,
The output unit outputs the first voice information to the second voice information when receiving the second voice information within a predetermined time while outputting the robot utterance data based on the first voice information. When the second word of the same type as the first word included in the first voice information is included, the output of the robot utterance data generated for the first voice information to the voice output unit is stopped. The information processing apparatus according to claim 1.

The information processing according to claim 5, wherein the output unit outputs new robot utterance data based on the second voice information after stopping outputting the robot utterance data to the voice output unit. apparatus.

The output unit outputs the robot utterance data to an external speaker,
The receiving unit is configured to receive an uttered voice collected by an external microphone as the voice information,
The output unit receives the second voice information based on the uttered voice received from the user when the external speaker is outputting voice based on the robot utterance data based on the first voice information, and When the second voice information includes a second word of the same type as the first word included in the first voice information, the second voice information is based on the robot utterance data generated for the first voice information. The information processing apparatus according to claim 1, wherein the information processing apparatus outputs a stop instruction to stop outputting the voice.

The information processing device according to claim 7, wherein the output unit outputs new robot utterance data based on the second voice information after outputting the stop instruction.

The generation unit, while generating robot utterance data for making an inquiry to the user, determines a category of words desired as a response to the inquiry,
The generating unit, when receiving the first voice information and the second voice information as an answer to the inquiry, when the word belonging to the determined category in the first voice information is the first voice information. When the second voice information includes a word belonging to the determined category as the second word, the robot utterance data is based on at least the second voice information. The information processing apparatus according to any one of claims 1 to 8, wherein the information processing apparatus generates:

A receiving step of receiving an input of voice information indicating an uttered voice by the user;
A generation step of generating robot utterance data corresponding to a reply to the utterance voice based on the voice information;
An output step of outputting the robot utterance data,
The generation step is a case where the second audio information is received within a predetermined time after the first audio information is received, and the second audio information includes a second audio information included in the first audio information. An audio output method executed by a computer for generating robot utterance data based on at least the second audio information when a second word of the same category as one word is included.

On the computer,
A reception function for receiving an input of voice information indicating an uttered voice by the user;
A generation function of generating robot utterance data corresponding to a reply to the utterance voice based on the voice information;
Output function of outputting the robot utterance data,
The generation function is a case where the second audio information is received within a predetermined time after the first audio information is received, and the second audio information includes a second audio information included in the first audio information. A voice output program for generating robot utterance data based on at least the second voice information when a second word of the same category as the first word is included.