JP6649200B2

JP6649200B2 - Utterance generation device, method, and program

Info

Publication number: JP6649200B2
Application number: JP2016153957A
Authority: JP
Inventors: 東中　竜一郎; 竜一郎東中; 松尾　義博; 義博松尾; 牧野　俊朗; 俊朗牧野; 隆朗福冨
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-08-04
Filing date: 2016-08-04
Publication date: 2020-02-19
Anticipated expiration: 2036-08-04
Also published as: JP2018022075A

Description

本発明は、発話生成装置、方法、及びプログラムに係り、特に、ユーザと対話するための発話生成装置、方法、及びプログラムに関する。 The present invention relates to an utterance generation device, a method, and a program, and more particularly, to an utterance generation device, a method, and a program for interacting with a user.

音声対話システムは、ユーザ発話を音声認識し、その認識結果を処理することで理解を行い、その理解結果に基づいて、何を話すかを決定し、決定した内容を音声合成することでユーザに応答を行う。対話システムの基本的な構成は非特許文献１に記載されている。 Spoken dialogue systems recognize a user's utterance by speech recognition, process the recognition result to understand it, determine what to speak based on the understanding result, and synthesize the determined content to the user by speech synthesis. Make a response. The basic configuration of the dialogue system is described in Non-Patent Document 1.

従来の音声対話システムで問題になることの一つとして、ユーザが発話を行ったあと、システムが発話を行うまでに間が空いてしまうという問題がある。そこで、すぐに応答するための工夫として、音声認識器とその他のモジュールを密に結合し，逐次的に理解をしたり、ユーザが自身の発話をすぐに訂正したりすることができる枠組みが提案されている。 One of the problems with the conventional speech dialogue system is that after the user speaks, there is a gap between when the system speaks. Therefore, as a device for responding immediately, a framework has been proposed in which the speech recognizer and other modules are tightly coupled, so that the user can immediately understand his own speech and correct his or her own utterance immediately. Have been.

中野幹生, 駒谷和範, 船越孝太郎, 中野有紀子, 奥村学(監修) . 対話システム. コロナ社, 2015.Mikio Nakano, Kazunori Komagaya, Kotaro Funakoshi, Yukiko Nakano, Manabu Okumura (supervision). Dialogue system. Coronasha, 2015. Schlangen, David, and Gabriel Skantze. "A general,abstract model of incremental dialogue processing." Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009.Schlangen, David, and Gabriel Skantze. "A general, abstract model of incremental dialogue processing." Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009. Skantze, Gabriel, and Anna Hjalmarsson. "Towards incremental speech generation in dialogue systems." Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, 2010.Skantze, Gabriel, and Anna Hjalmarsson. "Towards incremental speech generation in dialogue systems." Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, 2010.

しかし、従来の手法は、ユーザ発話にすぐに反応できるような枠組みを提案しているが、すぐに反応するだけでは、ユーザ発話をシステムが理解したかを伝えるには不十分であるという課題があった。 However, although the conventional method proposes a framework that can immediately respond to the user's utterance, there is a problem that just responding immediately is not enough to tell whether the system understands the user's utterance. there were.

本発明は、上記問題点を解決するために成されたものであり、システムが理解したことをユーザに伝達し、スムーズな対話を可能とすることができる発話生成装置、方法、及びプログラムを提供することを目的とする。 SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and provides an utterance generation device, a method, and a program that can convey to a user what the system has understood and enable a smooth dialogue. The purpose is to do.

上記目的を達成するために、第１の発明に係る発話生成装置は、ユーザ発話を表す音のストリームを入力とし、音声の開始と音声の終了を検出し、検出された音声の開始と音声の終了とで規定される区間の音声を逐次出力し、かつ、音声の開始を検出したとき、あるいは音声の終了を検出したときに、検出結果を出力する音声区間検出部と、前記音声区間検出部によって検出された音声の開始に対応する区間の音声について音声認識を行い、前記区間の途中の認識結果を含む、前記区間の終了までの認識結果を逐次出力する音声認識部と、前記音声区間検出部による検出結果、及び前記音声認識部による認識結果に基づいて、前記音声認識が行われていることを示す発話を生成する相槌生成部、前記音声認識部による認識結果に基づいて、認識された文字列を示す発話を生成する焦点抽出部、及び前記音声認識部による認識結果に基づいて、システムが理解した内容を示す発話を生成する復唱生成部の少なくとも一つを含む発話生成部と、発話出力部と、前記音声認識部及び前記音声区間検出部の両方又は何れかからの出力を、前記発話生成部へ出力し、前記発話生成部からの出力を前記発話出力部へ出力する通信部と、を含んで構成されている。 In order to achieve the above object, an utterance generation device according to a first aspect of the present invention receives a sound stream representing a user utterance as input, detects the start and end of the sound, and detects the start of the detected sound and the sound A voice section detection unit that sequentially outputs voices in a section defined by the end, and outputs a detection result when the start of voice is detected or when the end of voice is detected, and the voice section detection unit A voice recognition unit that performs voice recognition on a voice in a section corresponding to the start of the voice detected by the voice recognition unit and sequentially outputs a recognition result up to the end of the section, including a recognition result in the middle of the section; A companion generation unit that generates an utterance indicating that the voice recognition is being performed based on the detection result by the voice recognition unit and the recognition result by the voice recognition unit; A focus extraction unit that generates an utterance indicating the character string obtained, and an utterance generation unit that includes at least one of a repeat generation unit that generates an utterance indicating a content understood by the system based on a recognition result by the voice recognition unit; A communication that outputs an output from an utterance output unit, and / or any of the voice recognition unit and the voice section detection unit to the utterance generation unit, and outputs an output from the utterance generation unit to the utterance output unit. And a unit.

また、第２の発明に係る発話生成装置は、ユーザ発話を表す音のストリームを入力とし、音声の開始と音声の終了を検出し、検出された音声の開始と音声の終了とで規定される区間の音声を逐次出力し、かつ、音声の開始を検出したとき、あるいは音声の終了を検出したときに、検出結果を出力する音声区間検出部と、前記音声区間検出部によって検出された音声の開始に対応する区間の音声について音声認識を行い、前記区間の途中の認識結果を含む、前記区間の終了までの認識結果を逐次出力する音声認識部と、前記音声区間検出部による検出結果、及び前記音声認識部による認識結果に基づいて、前記音声認識が行われていることを示す発話を生成する相槌生成部、前記音声認識部による認識結果に基づいて、認識された文字列を示す発話を生成する焦点抽出部、及び前記音声認識部による認識結果に基づいて、システムが理解した内容を示す発話を生成する復唱生成部の少なくとも一つを含む発話生成部と、前記音声認識部による認識結果に基づいて、前記ユーザ発話に対応する応答発話を生成する応答部と、発話出力部と、前記音声認識部及び前記音声区間検出部の両方又は何れかからの出力を、前記発話生成部及び前記応答部へ出力し、前記発話生成部及び前記応答部の両方又は何れかからの出力を前記発話出力部へ出力する通信部と、を含んで構成されている。 The utterance generation device according to the second invention receives a sound stream representing a user utterance as input, detects the start and end of the sound, and is defined by the start of the detected sound and the end of the sound. A voice section detection unit that sequentially outputs voices of a section, and detects a start of voice or detects an end of voice, and outputs a detection result, and a voice section detected by the voice section detection unit. A voice recognition unit that performs voice recognition on the voice of the section corresponding to the start, including a recognition result in the middle of the section, sequentially outputs recognition results until the end of the section, and a detection result by the voice section detection unit, Based on the recognition result by the voice recognition unit, a hammer generation unit that generates an utterance indicating that the voice recognition is being performed, and indicates a recognized character string based on the recognition result by the voice recognition unit. An utterance generation unit including at least one of a focus extraction unit that generates a speech, and a repetition generation unit that generates an utterance indicating a content understood by the system based on a recognition result of the voice recognition unit; A response unit that generates a response utterance corresponding to the user utterance based on the recognition result; an utterance output unit; and an output from both or any of the voice recognition unit and the voice section detection unit, the utterance generation unit And a communication unit that outputs to the response unit, and outputs from the utterance generation unit and / or the response unit to the utterance output unit.

また、第１の発明に係る発話生成装置において、前記発話出力部は、前記発話生成部によって生成された前記発話を出力しているときに、前記発話生成部によって生成された新たな前記発話が入力されると、前記新たな前記発話の出力は行わないようにしてもよい。 Further, in the utterance generation device according to the first invention, when the utterance output unit is outputting the utterance generated by the utterance generation unit, the utterance generation unit generates a new utterance generated by the utterance generation unit. When input, the new utterance may not be output.

また、第２の発明に係る発話生成装置において、前記発話出力部は、前記発話生成部によって生成された前記発話を出力しているときに、前記発話生成部によって生成された新たな前記発話が入力されると、前記新たな前記発話の出力は行わず、前記発話生成部によって生成された前記発話を出力しているときに、前記応答部によって生成された前記応答発話が入力されると、前記発話生成部によって生成された前記発話を出力した後に、前記応答発話の出力を行うようにしてもよい。 Further, in the utterance generation device according to the second invention, the utterance output unit outputs the utterance generated by the utterance generation unit and outputs the new utterance generated by the utterance generation unit. When input, the output of the new utterance is not performed, and when outputting the utterance generated by the utterance generation unit, when the response utterance generated by the response unit is input, After outputting the utterance generated by the utterance generation unit, the response utterance may be output.

また、第１及び第２の発明に係る発話生成装置において、前記復唱生成部は、前記音声認識部による認識結果を表す文字列から、述語項構造を抽出し、前記抽出された述語項構造に基づいて、前記システムが理解した内容を示す発話を生成するようにしてもよい。 In the utterance generation devices according to the first and second inventions, the repetition generation unit extracts a predicate term structure from a character string representing a recognition result by the speech recognition unit, and converts the predicate term structure into the extracted predicate term structure. Based on this, an utterance indicating the content understood by the system may be generated.

また、第１及び第２の発明に係る発話生成装置において、前記相槌生成部は、前記音声区間検出部によって前記音声の開始が検出されたときに、前記音声認識が行われていることを示す発話を生成するようにしてもよい。 In the utterance generation devices according to the first and second inventions, the companion generation unit indicates that the voice recognition is being performed when the start of the voice is detected by the voice section detection unit. The utterance may be generated.

また、第１及び第２の発明に係る発話生成装置において、前記音声認識部は、ショートポーズを検知したとき、又は前記区間の終了までの期間における一定時間おきに、認識結果を出力し、前記相槌生成部は、前記音声認識部による、ショートポーズを検知したときの認識結果の出力、又は前記音声認識部による一定時間おきの認識結果の出力があったときに、前記音声認識が行われていることを示す発話を生成するようにしてもよい。 In the utterance generation devices according to the first and second inventions, the voice recognition unit outputs a recognition result when a short pause is detected or at regular intervals in a period until the end of the section. The companion generator is configured to output the recognition result when the short pause is detected by the voice recognition unit, or to output the recognition result at regular intervals by the voice recognition unit. Alternatively, an utterance indicating the presence may be generated.

また、第１及び第２の発明に係る発話生成装置において、前記相槌生成部は、前記音声認識部によって前記区間が終了したときの認識結果が出力されたときに、前記音声認識が行われていることを示す発話を生成するようにしてもよい。 Further, in the utterance generation devices according to the first and second inventions, the companion generation unit performs the voice recognition when the voice recognition unit outputs a recognition result when the section ends. Alternatively, an utterance indicating the presence may be generated.

また、第１及び第２の発明に係る発話生成装置において、前記発話出力部は、音声により出力するようにしてもよい。 Further, in the utterance generation devices according to the first and second inventions, the utterance output unit may output by voice.

また、第１及び第２の発明に係る発話生成装置において、前記発話出力部は、ディスプレイ表示により出力するようにしてもよい。 In the utterance generation devices according to the first and second inventions, the utterance output unit may output the utterance by a display.

第３の発明に係る発話生成方法は、音声区間検出部と、音声認識部、相槌生成部、及び焦点抽出部の少なくとも一つを含む発話生成部と、復唱生成部と、発話出力部と、通信部とを含む発話生成装置における発話生成方法であって、前記音声区間検出部が、ユーザ発話を表す音のストリームを入力とし、音声の開始と音声の終了を検出し、検出された音声の開始と音声の終了とで規定される区間の音声を逐次出力し、かつ、音声の開始を検出したとき、あるいは音声の終了を検出したときに、検出結果を出力するステップと、前記音声認識部が、前記音声区間検出部によって検出された音声の開始に対応する区間の音声について音声認識を行い、前記区間の途中の認識結果を含む、前記区間の終了までの認識結果を逐次出力するステップと、前記通信部が、前記音声認識部及び前記音声区間検出部の両方又は何れかからの出力を、前記発話生成部へ出力するステップと、前記発話生成部において、前記相槌生成部が、前記音声区間検出部による検出結果、及び前記音声認識部による認識結果に基づいて、前記音声認識が行われていることを示す発話を生成するステップ、前記焦点抽出部が、前記音声認識部による認識結果に基づいて、認識された文字列を示す発話を生成するステップ、及び前記復唱生成部が、前記音声認識部による認識結果に基づいて、システムが理解した内容を示す発話を生成するステップの少なくとも一つを実行するステップを含み、前記通信部が、前記発話生成部からの出力を前記発話出力部へ出力するステップと、を含んで実行することを特徴とする。 An utterance generation method according to a third invention is a utterance generation unit including at least one of a speech section detection unit, a speech recognition unit, a companion generation unit, and a focus extraction unit, a repetition generation unit, an utterance output unit, An utterance generation method in an utterance generation device including a communication unit, wherein the voice section detection unit receives a stream of sound representing a user utterance as input, detects start and end of the voice, and outputs the detected voice. Outputting a voice in a section defined by a start and an end of the voice sequentially, and outputting a detection result when the start of the voice is detected or when the end of the voice is detected; and Performing speech recognition on speech in a section corresponding to the start of speech detected by the speech section detection unit, including a recognition result in the middle of the section, and sequentially outputting recognition results until the end of the section. , The communication unit outputs the output from both or any of the voice recognition unit and the voice section detection unit to the utterance generation unit; and in the utterance generation unit, the companion generation unit generates the voice section. Generating an utterance indicating that the voice recognition is being performed based on the detection result by the detection unit and the recognition result by the voice recognition unit, wherein the focus extraction unit is based on the recognition result by the voice recognition unit Generating at least one of the steps of generating an utterance indicating a recognized character string, and the step of generating the utterance indicating the content understood by the system based on the recognition result by the voice recognition unit. Executing the output from the utterance generation unit to the utterance output unit.

第４の発明に係る発話生成方法は、音声区間検出部と、音声認識部、相槌生成部、及び焦点抽出部の少なくとも一つを含む発話生成部と、復唱生成部と、発話出力部と、応答部と、通信部とを含む発話生成装置における発話生成方法であって、前記音声区間検出部が、ユーザ発話を表す音のストリームを入力とし、音声の開始と音声の終了を検出し、検出された音声の開始と音声の終了とで規定される区間の音声を逐次出力し、かつ、音声の開始を検出したとき、あるいは音声の終了を検出したときに、検出結果を出力するステップと、前記音声認識部が、前記音声区間検出部によって検出された音声の開始に対応する区間の音声について音声認識を行い、前記区間の途中の認識結果を含む、前記区間の終了までの認識結果を逐次出力するステップと、前記通信部が、前記音声認識部及び前記音声区間検出部の両方又は何れかからの出力を、前記発話生成部及び前記応答部へ出力するステップと、前記発話生成部において、前記相槌生成部が、前記音声区間検出部による検出結果、及び前記音声認識部による認識結果に基づいて、前記音声認識が行われていることを示す発話を生成するステップ、前記焦点抽出部が、前記音声認識部による認識結果に基づいて、認識された文字列を示す発話を生成するステップ、及び前記復唱生成部が、前記音声認識部による認識結果に基づいて、システムが理解した内容を示す発話を生成するステップの少なくとも一つを実行するステップを含み、前記応答部が、前記音声認識部による認識結果に基づいて、前記ユーザ発話に対応する応答発話を生成するステップと、前記通信部が、前記発話生成部及び前記応答部の両方又は何れかからの出力を前記発話出力部へ出力するステップと、を含んで実行することを特徴とする。 An utterance generation method according to a fourth invention is a utterance generation unit that includes at least one of a voice section detection unit, a voice recognition unit, a hammer generation unit, and a focus extraction unit, a repetition generation unit, and an utterance output unit. An utterance generation method in an utterance generation device including a response unit and a communication unit, wherein the voice section detection unit receives a stream of sound representing a user utterance, detects start of sound and end of sound, and detects Outputting the voice of the section defined by the start of the voice and the end of the voice sequentially, and, when detecting the start of the voice, or when detecting the end of the voice, outputting a detection result, The voice recognition unit performs voice recognition on a voice in a section corresponding to the start of the voice detected by the voice section detection unit, and sequentially includes a recognition result up to the end of the section, including a recognition result in the middle of the section. Output The communication unit outputs the output from the voice recognition unit and / or the voice section detection unit to the utterance generation unit and the response unit. A companion generation unit, based on the detection result by the voice section detection unit, and a recognition result by the voice recognition unit, generating an utterance indicating that the voice recognition is being performed, the focus extraction unit, Generating an utterance indicating a recognized character string based on the recognition result by the voice recognition unit, and the repetition generation unit generating an utterance indicating the content understood by the system based on the recognition result by the voice recognition unit. Performing at least one of the generating steps, wherein the response unit generates a response utterance corresponding to the user utterance based on a recognition result by the voice recognition unit. A step of forming, the communication unit, and executes contain, and outputting to the utterance output unit outputs from both or either the utterance generation unit and the response unit.

第５の発明に係るプログラムは、コンピュータを、上記第１及び第２の発明に係る発話生成装置の各部として機能させるためのプログラムである。 A program according to a fifth invention is a program for causing a computer to function as each unit of the utterance generation devices according to the first and second inventions.

本発明の発話生成装置、方法、及びプログラムによれば、ユーザ発話を表す音のストリームを入力とし、音声の開始と音声の終了を検出し、検出された音声の開始と音声の終了とで規定される区間の音声を逐次出力し、かつ、音声の開始を検出したとき、あるいは音声の終了を検出したときに、検出結果を出力し、検出された音声の開始に対応する区間の音声について音声認識を行い、区間の途中の認識結果を含む、区間の終了までの認識結果を逐次出力し、音声区間検出部による検出結果、及び音声認識部による認識結果に基づいて、音声認識が行われていることを示す発話を生成する相槌生成部、音声認識部による認識結果に基づいて、認識された文字列を示す発話を生成する焦点抽出部、及び音声認識部による認識結果に基づいて、システムが理解した内容を示す発話を生成する復唱生成部の少なくとも一つを含む発話生成部を持ち、音声認識部及び音声区間検出部の両方又は何れかからの出力を、発話生成部へ出力し、発話生成部からの出力を発話出力部へ出力することにより、システムが理解したことをユーザに伝達し、スムーズな対話を可能とすることができる、という効果が得られる。 According to the utterance generation device, method and program of the present invention, a sound stream representing a user's utterance is input, the start and end of the sound are detected, and the start and end of the detected sound are defined. The sound of the section corresponding to the start of the detected sound is output when the sound of the section corresponding to the start of the detected sound is output when the start of the sound or the end of the sound is detected. Performs recognition, sequentially outputs recognition results up to the end of the section, including recognition results in the middle of the section, and performs voice recognition based on the detection results by the voice section detection unit and the recognition results by the voice recognition unit. Based on the result of recognition by the speech recognition unit, a focus extraction unit that generates an utterance indicating the recognized character string, and a system based on the recognition result by the speech recognition unit. The utterance generation unit includes at least one of a repetition generation unit that generates an utterance indicating the content understood by the system, and outputs an output from both or any of the voice recognition unit and the voice section detection unit to the utterance generation unit. By outputting the output from the utterance generation unit to the utterance output unit, it is possible to convey to the user that the system has understood, and to achieve an effect of enabling a smooth dialogue.

本発明の実施の形態に係る発話生成装置の構成を示すブロック図である。1 is a block diagram illustrating a configuration of an utterance generation device according to an embodiment of the present invention. 音声区間検出部と音声認識部がpublishするチャネルとそのタイミングを示す図である。FIG. 4 is a diagram illustrating channels to be published by a voice section detection unit and a voice recognition unit and their timings. ＣＲＦの学習データに一例を示す図である。It is a figure showing an example to learning data of CRF. 本発明の実施の形態に係る発話生成装置における発話生成処理ルーチンを示すフローチャートである。6 is a flowchart illustrating an utterance generation processing routine in the utterance generation device according to the embodiment of the present invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る発話生成装置の構成＞ <Configuration of Utterance Generation Device According to Embodiment of the Present Invention>

まず、本発明の実施の形態に係る発話生成装置の構成について説明する。図１に示すように、本発明の実施の形態に係る発話生成装置１００は、ＣＰＵと、ＲＡＭと、後述する発話生成処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この発話生成装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、発話出力部５０とを備えている。 First, the configuration of the utterance generation device according to the embodiment of the present invention will be described. As shown in FIG. 1, an utterance generation device 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing an utterance generation processing routine described later and various data. It can be configured with a computer. The utterance generation device 100 functionally includes an input unit 10, a calculation unit 20, and an utterance output unit 50, as shown in FIG.

入力部１０は、ユーザ発話を表す音のストリームを入力として受け付ける。 The input unit 10 receives a sound stream representing a user utterance as an input.

演算部２０は、通信部２８と、音声区間検出部３０と、音声認識部３２と、発話生成部４０と、応答部４８とを含んで構成されている。 The calculation unit 20 includes a communication unit 28, a voice section detection unit 30, a voice recognition unit 32, an utterance generation unit 40, and a response unit 48.

まず、はじめに通信部２８について説明する。なぜなら、この部はすべての部の橋渡しとなる部だからである。 First, the communication unit 28 will be described. This is because it is the bridge between all the departments.

通信部２８は、後述する音声認識部３２及び音声区間検出部３０の両方又は何れかからの出力を、発話生成部４０及び応答部４８へ出力する。具体的には、音声区間検出部３０の区間の検出結果及び音声認識部３２の音声の認識結果の出力を、発話生成部４０に出力し、音声認識部３２の音声の認識結果の出力を応答部４８へ出力する。また、発話生成部４０及び応答部４８の両方又は何れかからの出力を発話出力部５０へ出力する。この場合、発話生成部４０及び応答部４８では順次発話及び応答発話が生成されるため、受け付けたものを順次出力すればよい。 The communication unit 28 outputs the output from both or any one of the later-described voice recognition unit 32 and voice section detection unit 30 to the speech generation unit 40 and the response unit 48. Specifically, the output of the section detection result of the voice section detection section 30 and the output of the voice recognition result of the voice recognition section 32 are output to the utterance generation section 40, and the output of the voice recognition result of the voice recognition section 32 is returned as a response. Output to the unit 48. The output from the utterance generation unit 40 and / or the response unit 48 is output to the utterance output unit 50. In this case, since the utterance generation unit 40 and the response unit 48 sequentially generate the utterance and the response utterance, the received utterances may be sequentially output.

通信部２８で用いられる音声対話システムのモジュール群はPublisher-Subscriberモデルに基づいて通信を行う。このモデルでは、publisherとなったモジュールは特定のチャネルに対し情報を流す（publishするという）。subscriberとなったモジュールが、チャネルを事前にsubscribe（購読）しておくと、そのチャネルに流れてきた情報を受け取ることができる。あるモジュールは、publisherかつsubscriberになってもよい。また、一つのモジュールが複数のチャネルに情報を流してもよいし、複数のチャネルから情報を受け取ってもよい。通信部２８は、publisherとsubscriberの管理を行い、チャネルの制御を行う。このような制御を行うソフトウェアとして、ActiveMQというフリーソフトウェアがあり、ここではこの実装を通信部として用いる。以降に述べるすべてのモジュールはpublisherまたはsubscriberであり、特定のチャネルに情報を流すことで連携する仕組みとなっている。もちろん、モジュールの連携ではpeer to peerモデルなどがあり、他のものを用いてもよい。 The modules of the voice interaction system used in the communication unit 28 perform communication based on the Publisher-Subscriber model. In this model, the publisher module sends (publishes) information to a specific channel. If the module that has become a subscriber subscribes to a channel in advance, it can receive information that has flowed through that channel. Certain modules may be publishers and subscribers. Further, one module may transmit information to a plurality of channels, or may receive information from a plurality of channels. The communication unit 28 manages publishers and subscribers, and controls channels. As software for performing such control, there is free software called ActiveMQ, and here, this implementation is used as a communication unit. All modules described below are publishers or subscribers, and work together by sending information to specific channels. Of course, there is a peer-to-peer model in module cooperation, and other types may be used.

音声区間検出部３０は、入力部１０で受け付けたユーザ発話を表す音のストリームを入力とし、音声の開始と音声の終了を検出し、検出された音声の開始と音声の終了とで規定される区間の音声を逐次出力し、かつ、音声の開始を検出したとき、あるいは音声の終了を検出したときに、検出結果を出力（publish）する。 The voice section detection unit 30 receives a stream of a sound representing a user utterance received by the input unit 10 as input, detects the start and end of the voice, and is defined by the start of the detected voice and the end of the voice. The sound of the section is sequentially output, and the detection result is output (published) when the start of the sound is detected or when the end of the sound is detected.

音声区間検出部３０においては、検出した区間の音声を、音声認識部３２に逐次送る。音声区間の検出には、音声のパワーや、ゼロクロスといった一般な音声の特徴を用いればよい。音声の開始時と音声の終了時には、通信部２８を介し、VAD STARTとVAD ENDというチャネルにメッセージをpublishする。ここでは、開始と終了が他のモジュールに伝わればよいため、メッセージの中身は空でよい。 The voice section detection unit 30 sequentially transmits the voice of the detected section to the voice recognition unit 32. For detection of the voice section, general voice characteristics such as voice power and zero cross may be used. At the start of the voice and at the end of the voice, the message is published to channels VAD START and VAD END via the communication unit 28. Here, since the start and end need only be transmitted to other modules, the content of the message may be empty.

音声認識部３２は、以下に説明するように、音声区間検出部３０によって検出された音声の開始に対応する区間の音声について音声認識を行い、当該区間の途中の認識結果を含む、当該区間の終了までの認識結果を逐次出力（publish）する。また、音声認識部３２は、ショートポーズを検知したとき、又は区間の終了までの期間における一定時間おきに、認識結果を出力（publish）する。 As described below, the voice recognition unit 32 performs voice recognition on the voice of the section corresponding to the start of the voice detected by the voice section detection unit 30, and includes a recognition result in the middle of the section including the recognition result in the middle of the section. The recognition results up to the end are sequentially output (publish). In addition, the voice recognition unit 32 outputs (publishes) a recognition result when a short pause is detected or at regular intervals in the period up to the end of the section.

音声認識部３２の音声認識では、音声認識器にはNTT(R)が開発したものを用いる。これは、深層学習と大規模な重み付き有限状態オートマトンを用いた高精度な音声認識器である。この詳細は、以下の非特許文献に記載されている。 In the speech recognition of the speech recognition unit 32, a speech recognizer developed by NTT (R) is used. This is a highly accurate speech recognizer using deep learning and a large-scale weighted finite state automaton. The details are described in the following non-patent document.

非特許文献４：久保陽太郎, 小川厚徳, 堀貴明, 中村篤. 音声と言語の一体型学習に基づく音声認識技術.NTT技術ジャーナル, Vol.25 No.9 pages 22-25, 2013. Non-Patent Document 4: Yotaro Kubo, Atsunori Ogawa, Takaaki Hori, Atsushi Nakamura. Speech recognition technology based on integrated learning of speech and language. NTT Technical Journal, Vol.25 No.9 pages 22-25, 2013.

音声認識というのは、内部的には、ネットワークの探索処理である。複数の単語列の仮説を展開しながら、音響的、言語的に尤度が高いものを残しながら探索を行う。音声が入力される度に探索は進む。音声認識器の内部状態を参照しながら、音声認識部３２は、音声認識開始時、ショートポーズ検知時（100ms〜200ms 程度のポーズを検知した時）、ロングポーズ検知時（200ms以上の長いポーズを検知した時とする。この時点で、音声認識は終了する）、及び、一定時間おき（たとえば、100msおき）に認識結果を出力する。それぞれのタイミングにおいて、通信部２８を介し、RECG START、RECG SP、RECG LP、及びRECG NPというチャネルに認識結果をpublishする。SPはショートポーズ、LPはロングポーズ、NPはポーズではない（not a pause）のことをそれぞれ表している。 Speech recognition is internally a network search process. A search is performed while developing hypotheses of a plurality of word strings while leaving those with high acoustic and linguistic likelihood. The search proceeds each time a voice is input. While referring to the internal state of the speech recognizer, the speech recognition unit 32 starts speech recognition, detects a short pause (when a pause of about 100 ms to 200 ms is detected), and detects a long pause (a long pause of 200 ms or more). At this point, the speech recognition is terminated), and a recognition result is output at regular intervals (for example, at intervals of 100 ms). At each timing, the recognition result is published to the channels RECG START, RECG SP, RECG LP, and RECG NP via the communication unit 28. SP indicates short pause, LP indicates long pause, and NP indicates non-pause.

図２に音声区間検出部３０と音声認識部３２がpublishするチャネルとそのタイミングを示す。 FIG. 2 shows channels to be published by the voice section detection unit 30 and the voice recognition unit 32 and their timings.

発話生成部４０は、相槌生成部４２と、焦点抽出部４４と、復唱生成部４６とを含んで構成されている。発話生成部４０の各構成部は、それぞれ非同期に処理が行われ、各構成部によって生成した発話を順次、通信部２８を介してpublishする。なお、発話生成部４０は、相槌生成部４２と、焦点抽出部４４と、復唱生成部４６との各部をいずれか少なくとも一つを含んでいればよい。 The utterance generation unit 40 includes a companion generation unit 42, a focus extraction unit 44, and a repetition generation unit 46. The components of the utterance generation unit 40 are asynchronously processed, and publish the utterances generated by the components sequentially via the communication unit 28. Note that the utterance generation unit 40 may include at least one of the companion generation unit 42, the focus extraction unit 44, and the repetition generation unit 46.

相槌生成部４２は、以下に説明するように、音声区間検出部３０による検出結果、及び音声認識部３２による認識結果に基づいて、音声認識が行われていることを示す発話を生成する。また、音声区間検出部３０によって音声の開始が検出されたときに、音声認識が行われていることを示す発話を生成する。また、音声認識部３２による、ショートポーズを検知したときの認識結果の出力、又は音声認識部３２による一定時間おきの認識結果の出力があったときに、音声認識が行われていることを示す発話を生成する。また、音声認識部３２によって区間が終了したときの認識結果が出力されたときに、音声認識が行われていることを示す発話を生成する。 The companion generating unit 42 generates an utterance indicating that voice recognition is being performed based on the detection result by the voice section detecting unit 30 and the recognition result by the voice recognizing unit 32, as described below. Further, when the start of the voice is detected by the voice section detection unit 30, an utterance indicating that the voice recognition is being performed is generated. Also, when the speech recognition unit 32 outputs a recognition result when a short pause is detected, or when the speech recognition unit 32 outputs a recognition result at regular intervals, it indicates that speech recognition is being performed. Generate utterances. In addition, when the speech recognition unit 32 outputs a recognition result at the end of the section, an utterance indicating that speech recognition is being performed is generated.

相槌生成部４２では、相槌生成部はVAD START、RECG SP、RECG LP、RECG NPのいずれか少なくとも一つをsubscribeして、これらのチャネルのメッセージが届いていたら、正しく音声区間が検出された、もしくは、正しく認識が行われている旨を表すための相槌発話を生成する。具体的には、「はい」や「ええ」を含む数発話の相槌のリストからランダムに一つを選択して相槌発話とする。一つを選択したら、通信部２８を介し、UTT BCというチャネルに、この発話内容をpublishする。UTTはutterance（発話）のことであり、BCとはback-channel（相槌）を表す。なお、確率的に相槌発話を生成してもよい。たとえば、ランダムに、二回に一回程度相槌を生成してもよい。VAD STARTで相槌を打つと、音声が始まったことを理解したということをユーザに伝えることができる。また、RECG SP、RECG NPで相槌を打つことは、発話の途中でもシステムがユーザの発話を聞いているということを伝えることができる。また、RECG LPに相槌を打つことで発話が終わったことがシステムに理解されたことを伝えることができる。 In the companion generation unit 42, the companion generation unit subscribes to at least one of VAD START, RECG SP, RECG LP, and RECG NP, and if a message of these channels has arrived, a voice section has been correctly detected. Alternatively, a companion utterance is generated to indicate that recognition has been correctly performed. Specifically, one is randomly selected from a list of several utterances including "yes" and "yes" as the utterances. When one is selected, the utterance content is published to a channel called UTT BC via the communication unit 28. UTT stands for utterance, and BC stands for back-channel. Note that a hammer utterance may be generated stochastically. For example, the hitting may be randomly generated about once every two times. Hitting the VAD START will tell the user that they understand that the sound has started. Also, hitting with RECG SP and RECG NP can convey that the system is listening to the user's utterance even during the utterance. Also, by hitting RECG LP, the system can be informed that the system has understood that the utterance has ended.

焦点抽出部４４は、以下に説明するように、音声認識部３２による認識結果に基づいて、認識された文字列を示す発話を生成する。 The focus extraction unit 44 generates an utterance indicating the recognized character string based on the recognition result by the voice recognition unit 32, as described below.

焦点抽出部４４では、RECG SP、RECG LP、RECG NPのいずれか少なくとも一つをsubscribeして、これらのチャネルからメッセージが届いたら、そのメッセージの内容である部分的、もしくは、最終的な音声認識結果に対して、焦点抽出を行う。焦点抽出とは、発話文字列に含まれる対話の話題として相応しい単語列やフレーズのことである。 The focus extraction unit 44 subscribes to at least one of RECG SP, RECG LP, and RECG NP, and when a message arrives from these channels, a partial or final speech recognition that is the content of the message. Focus extraction is performed on the result. Focus extraction is a word string or phrase suitable as a topic of a dialog included in a speech character string.

焦点を抽出する問題は、発話文字列中の部分文字列と捉えることができるため、系列ラベリングの手法によって焦点を抽出する。具体的には、多くの発話文を事前に収集し、それらの発話文中の焦点として相応しい単語やフレーズをラベル付けする。例えば、以下は焦点がタグ付けされたデータの例である。 Since the problem of extracting the focus can be regarded as a partial character string in the uttered character string, the focus is extracted by a sequence labeling technique. More specifically, many utterance sentences are collected in advance, and a word or phrase appropriate as a focus in those utterance sentences is labeled. For example, the following is an example of focus-tagged data.

<cand>キャンプ</cand>みたいですね。
今週末とかはうってつけですね。
<cand>バラエティー</cand>もよく見るんですよ。 It looks like <cand> camp </ cand>.
This weekend is perfect.
I often see <cand> variety </ cand>.

ここで、<cand>と</cand>に囲まれている箇所が、焦点として相応しいとされた単語やフレーズである。このようなデータを大量に作成し、未知の発話文についても、これらの箇所を同定できるようなモデルを系列ラベリングの手法によって学習する。具体的には、条件付き確率場（conditional random field,CRF）と呼ばれる手法を用いた。この手法は文書について、それに含まれる特定の系列をラベル付けする一般的な方法である。 Here, the portion surrounded by <cand> and </ cand> is a word or phrase that is determined to be appropriate as a focus. Such a large amount of data is created, and a model capable of identifying these parts even for an unknown utterance sentence is learned by a sequence labeling method. Specifically, a technique called a conditional random field (CRF) was used. This method is a general method for labeling a document with a specific sequence included in the document.

学習にあたっては、上記のようなタグ付きデータについて、形態素解析し、それぞれの形態素について、焦点として相応しい単語やフレーズであるかのラベルを付与することで、図４のようなCRFの学習データを作成する。 In the learning process, the morphological analysis is performed on the tagged data as described above, and a label is added to each morpheme to determine whether the word or phrase is appropriate as a focus, thereby creating learning data of the CRF as shown in FIG. I do.

ここで、各カラムはそれぞれ、単語表記、品詞、NTT(R)の日本語語彙大系における一般名詞の意味属性、日本語語彙大系における固有名詞の意味属性、焦点（焦点の開始であるか（B-cand)、焦点の中間であるか(I-cand)、焦点ではないか(O)）を表す。空行は発話文の区切りを表す。意味属性は概念を表す番号である。該当する意味属性が当該単語に付与されていない場合には、Oが付与される。 Here, each column is the word notation, part of speech, the semantic attribute of general nouns in the Japanese vocabulary of NTT (R), the semantic attribute of proper nouns in the Japanese vocabulary, and the focus ( (B-cand), middle of focus (I-cand), not focus (O)). A blank line indicates a break in an utterance sentence. The semantic attribute is a number representing a concept. If the corresponding semantic attribute is not assigned to the word, O is assigned.

学習には、既存のツールであるCRF++などを用いればよい。また、CRFSuiteといった他のフリーソフトを用いてもよい。 You can use existing tools such as CRF ++ for learning. Further, other free software such as CRFSuite may be used.

このデータからCRFのモデルが学習でき、未知の文について、このモデルを適用することで、焦点を得ることができる。具体的には、未知の文について、形態素解析を行い、上記学習データと同様のデータを作成し（この場合正解が分からないため、焦点に関わる情報はすべてOとする）、CRFのモデルに、最も尤度が高くなるように、形態素毎に、焦点の開始であるか、中間であるか、焦点ではないかというラベルを推定させる。 A CRF model can be learned from this data, and the focus can be obtained by applying this model to unknown sentences. Specifically, a morphological analysis is performed on an unknown sentence, and data similar to the above learning data is created (in this case, since the correct answer is not known, all information related to the focus is set to O). In order to maximize the likelihood, a label is determined for each morpheme, indicating whether the focus is at the start, at the middle or at the focus.

「ラーメンが食べたい」について、ラーメンの単語に、B-candが付与されたのであれば、「ラーメン」が焦点として抽出される。また、名詞句の場合にはB-cand及びI-candを連結したものを焦点として抽出する。 As for “I want to eat ramen”, if B-cand is added to the word of ramen, “ramen” is extracted as the focus. Also, in the case of a noun phrase, a concatenation of B-cand and I-cand is extracted with focus.

このように、焦点抽出部４４は、入力された音声の認識結果について焦点抽出を行い、その結果、焦点が得られたら通信部２８を介し、UTT BCのチャネルに、焦点の文字列をpublishする。 As described above, the focus extraction unit 44 performs focus extraction on the recognition result of the input speech, and as a result, when the focus is obtained, publishes the focus character string to the UTT BC channel via the communication unit 28. .

復唱生成部４６は、以下に説明するように、音声認識部３２による認識結果に基づいて、システムが理解した内容を示す発話を生成する。ここでは、音声認識部３２による認識結果を表す文字列から、述語項構造を抽出し、抽出された述語項構造に基づいて、システムが理解した内容を示す発話を生成する。 The repetition generation unit 46 generates an utterance indicating the contents understood by the system based on the recognition result by the voice recognition unit 32, as described below. Here, a predicate term structure is extracted from the character string representing the recognition result by the speech recognition unit 32, and an utterance indicating the contents understood by the system is generated based on the extracted predicate term structure.

復唱生成部４６では、RECG SP、RECG LP、RECG NPのいずれか少なくとも一つから得られる音声認識結果について、述語項構造解析を行い、得られた構造から、相手の発話内容を繰り返す（復唱する）文を生成する。 The repetition generation unit 46 performs a predicate term structure analysis on a speech recognition result obtained from at least one of RECG SP, RECG LP, and RECG NP, and repeats the utterance content of the other party from the obtained structure (repetition is performed). ) Generate a sentence.

述語項構造解析とは、「何がどうした」を抽出する処理で、文章から述語およびその項となる名詞句を抽出する処理である。下記の非特許文献５には、その処理の詳細な記述がある。 The predicate term structure analysis is a process of extracting "what happened", and is a process of extracting a predicate and a noun phrase which is the term from a sentence. Non-Patent Document 5 below has a detailed description of the processing.

非特許文献５：今村賢治,東中竜一郎,泉朋子. 対話解析のためのゼロ代名詞照応解析付き述語項構造解析. 自然言語処理, 2015, 22.1: 3-26. Non-patent Document 5: Kenji Imamura, Ryuichiro Higashinaka, Tomoko Izumi. Predicate term structure analysis with zero pronoun anaphora analysis for dialog analysis. Natural language processing, 2015, 22.1: 3-26.

「太郎が動物園に行くんです」という文であれば、「行く」が述語、「太郎」が主語（ガ格の項）、「動物園」が（間接）目的語（ニ格の項）と解析される。 In the sentence "Taro goes to the zoo", "go" is a predicate, "Taro" is the subject (ga-case), and "zoo" is the (indirect) object (Ni-case). Is done.

入力文の述語項構造が分かれば、そこから文を生成することができる。例えば、ガ格、二格（それ以外の格）、述語の順番で並べることで、「太郎が動物園に行く」という文が生成できる。この文について、手作業による文末調整のルールを適用することで、「太郎が動物園に行くんだね」「太郎が動物園に行くんですね」といった、復唱に相応しい、共感調の文を作成することができる。これはたとえば、「んだね」や「んですね」を付与するというルールで実現できる。これ以外にも、より複雑なルールを用いて、任意の文末表現に変換してもよい。たとえば、質問調にしてみたり、方言や特殊な語尾を付与することにより、キャラクタ付けを行ってもよい。 Once the predicate term structure of the input sentence is known, a sentence can be generated from it. For example, a sentence "Taro goes to the zoo" can be generated by arranging in the order of ga case, ni case (other case), and predicate. By applying the rule of manual sentence end adjustment to this sentence, create a sympathetic sentence suitable for repetition, such as "Taro goes to the zoo" or "Taro goes to the zoo" Can be. This can be realized, for example, by a rule of giving “done” or “nde”. In addition, the expression may be converted to an arbitrary sentence end expression using a more complicated rule. For example, the character may be assigned by giving a question tone or adding a dialect or a special ending.

述語項構造からの発話文生成手法については、以下の非特許文献６で説明されている。 A technique for generating an utterance sentence from a predicate term structure is described in Non-Patent Document 6 below.

非特許文献６：HIGASHINAKA, Ryuichiro, et al. Towards an open-domain conversational system fully based on natural language processing. In: COLING. 2014. pp.928-939. Non-Patent Document 6: HIGASHINAKA, Ryuichiro, et al. Towards an open-domain conversational system fully based on natural language processing.In: COLING. 2014. pp.928-939.

なお、述語項構造に一人称が入っていれば二人称に変換する。また、二人称が入っていれば一人称に変換する処理を行う。そうすることで、「私は元気です」とユーザが言った場合、「あなたは元気なんですね」といった適切な復唱を行うことができる。 If the first-person is included in the predicate term structure, it is converted to the second-person. Further, if the second person is included, a process of converting the first person into the first person is performed. That way, if the user says, "I'm fine," you can make an appropriate repetition, such as "I'm fine."

復唱生成部４６は、上記ようにして、復唱文を生成してから、通信部２８を介し、UTT BCのチャネルに復唱文をpublishする。 The repetition generation unit 46 generates the repetition sentence as described above, and then publishes the repetition sentence to the UTT BC channel via the communication unit 28.

応答部４８は、以下に説明するように、音声認識部３２による認識結果に基づいて、ユーザ発話に対応する応答発話を生成する。 The response unit 48 generates a response utterance corresponding to the user utterance based on the recognition result by the voice recognition unit 32, as described below.

応答部４８は、RECG LP（すなわち、音声区間が終わった後の音声認識結果であり、基本的に最も信頼できる認識結果）をsubscribeして受け取ると、その発話内容をもとに、次のシステム発話を生成する。応答部４８は任意の対話システムでよい。たとえば、応答ルールに基づく手法（非特許文献６）や、大規模なテキストデータに基づく発話生成の手法（前述の非特許文献５）によって応答を行うものである。 When the responding unit 48 subscribes and receives the RECG LP (that is, the speech recognition result after the end of the speech section, which is basically the most reliable recognition result), the responding unit 48 performs the following system based on the utterance content. Generate utterances. Responder 48 may be any interactive system. For example, a response is made by a method based on response rules (Non-Patent Document 6) or an utterance generation method based on large-scale text data (Non-Patent Document 5 described above).

非特許文献７：R. S. Wallace, The Anatomy of A.L.I.C.E. A.L.I.C.E. Artificial Intelligence Foundation, Inc., 2004. Non-Patent Document 7: R. S. Wallace, The Anatomy of A.L.I.C.E.A.L.I.C.E.Artificial Intelligence Foundation, Inc., 2004.

応答部４８が生成する応答発話は相手に理解が進んでいることを伝えるためのフィードバックを目的とした発話ではなく、システムが対話を進める上で熟考した発話であることが望ましい。生成した発話文は、通信部２８を介し、UTT GENのチャネルにpublishされる。GENとは、システムの次発話として、生成（generate）された発話文という意味である。 It is preferable that the response utterance generated by the response unit 48 is not an utterance intended for feedback for notifying the other party that understanding is progressing, but an utterance that the system has considered in proceeding with the dialogue. The generated utterance sentence is published to the UTT GEN channel via the communication unit 28. GEN means an utterance sentence generated as the next utterance of the system.

発話出力部５０は、発話生成部４０によって生成された発話又は応答部４８によって生成された応答発話を、音声合成して音声として出力するか、又はディスプレイ表示により出力する。本実施の形態では音声合成して音声として出力する。また、発話生成部４０によって生成された発話を出力しているときに、発話生成部４０によって生成された新たな発話が入力されると、新たな発話の出力は行わない。これは、相槌生成部４２、焦点抽出部４４、復唱生成部４６のいずれかの発話が行われている場合には、他の発話を行わないということである。また、発話生成部４０によって生成された発話を出力しているときに、応答部４８によって生成された応答発話が入力されると、発話生成部４０によって生成された発話を出力した後に、応答発話の出力を行う。 The utterance output unit 50 synthesizes the utterance generated by the utterance generation unit 40 or the response utterance generated by the response unit 48 and outputs the synthesized utterance, or outputs the utterance as a display. In the present embodiment, speech is synthesized and output as speech. When a new utterance generated by the utterance generation unit 40 is input while the utterance generated by the utterance generation unit 40 is being output, the output of the new utterance is not performed. This means that if any one of the companion generation unit 42, the focus extraction unit 44, and the repetition generation unit 46 is uttered, no other utterance is performed. When the utterance generated by the utterance generation unit 40 is output and the response utterance generated by the response unit 48 is input, the utterance generated by the utterance generation unit 40 is output, and then the response utterance is output. Output.

本実施の形態では、発話出力部５０は、音声合成部としての機能をもつ。発話出力部５０は、UTT BC（発話生成部４０が生成した発話）とUTT GEN（応答部４８が生成した応答発話）のチャネルをsubscribeしている。これらから通信部２８を介してメッセージを受け取ると、UTT BCであれば、その発話内容を音声合成する。すでにUTT BCの内容を音声合成中や発話中であれば、新たなUTT BCの音声合成は取りやめる。UTT GENのチャネルでメッセージを受け取ると、それはシステムが熟考した応答であるから、必ず音声合成を行い発話する。もしUTT BCの処理中であれば、その処理を待って、UTT GENの内容を発話する。なお、音声合成には、市販の音声合成エンジンを用いればよい。 In the present embodiment, the utterance output unit 50 has a function as a speech synthesis unit. The utterance output unit 50 subscribes to channels of UTT BC (utterance generated by the utterance generation unit 40) and UTT GEN (response utterance generated by the response unit 48). When a message is received from these via the communication unit 28, if it is UTT BC, the speech content of the utterance is synthesized. If the content of the UTT BC is already being synthesized or spoken, the speech synthesis of the new UTT BC is canceled. When a message is received on the UTT GEN channel, it is a response that the system has considered, so it always performs speech synthesis and speaks. If the UTT BC is being processed, wait for the processing and speak the contents of the UTT GEN. Note that a commercially available speech synthesis engine may be used for speech synthesis.

なお、ディスプレイ表示により出力する場合には、表示の仕方としては、例えば、システムの発話をディスプレイに表示するとき、発話を表示した後、発話を音声出力した場合にかかる時間分たった後に、最新のメッセージを出力（発話出力部５０でメッセージの出力中に到着したメッセージはスキップされる動き相当）することができる。また、文字を人が読み上げる標準的な速さで（例えば４文字／秒）１文字づつ表示してもよい。ディスプレイに発話を表示すると、それまでのシステムの発話をユーザが確認できるというメリットがある。また、音声出力とあわせてディスプレイに発話を表示すると、ユーザの聞き間違いを防ぐことができ好適である。 In the case of outputting by the display, as a display method, for example, when displaying the utterance of the system on the display, after displaying the utterance, after the time required to output the utterance by voice, the latest A message can be output (a message arriving during output of a message by the utterance output unit 50 corresponds to a skipped movement). Further, characters may be displayed one by one at a standard speed at which a person reads out the characters (for example, 4 characters / second). Displaying the utterance on the display has an advantage that the user can confirm the utterance of the system up to that time. In addition, it is preferable to display the utterance on the display together with the audio output, because it is possible to prevent the user from mistakenly listening.

＜本発明の実施の形態に係る発話生成装置の作用＞ <Operation of Utterance Generation Apparatus According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る発話生成装置１００の作用について説明する。入力部１０においてユーザ発話を表す音のストリームを入力として受け付けると、発話生成装置１００は、図５に示す発話生成処理ルーチンを実行する。 Next, the operation of the utterance generation device 100 according to the embodiment of the present invention will be described. When the input unit 10 receives a sound stream representing a user's utterance as an input, the utterance generation device 100 executes an utterance generation processing routine shown in FIG.

まず、ステップＳ１００では、音声区間検出部３０が、入力部１０で受け付けたユーザ発話を表す音のストリームを入力とし、音声の開始と音声の終了を検出し、検出された音声の開始と音声の終了とで規定される区間の音声を逐次出力し、かつ、音声の開始を検出したとき、あるいは音声の終了を検出したときに、検出結果を出力（publish）する。 First, in step S100, the voice section detection unit 30 receives a stream of a sound representing a user's utterance received by the input unit 10 as input, detects the start and end of the voice, and detects the start of the detected voice and the detected voice. The voice of the section defined by the end is sequentially output, and the detection result is output (published) when the start of the voice is detected or when the end of the voice is detected.

次に、ステップＳ１０２では、音声認識部３２が、ステップＳ１００によって検出された音声の開始に対応する区間の音声について音声認識を行い、当該区間の途中の認識結果を含む、当該区間の終了までの認識結果を逐次出力（publish）する。また、ショートポーズを検知したとき、又は区間の終了までの期間における一定時間おきに、認識結果を出力（publish）する。 Next, in step S102, the voice recognition unit 32 performs voice recognition on the voice in the section corresponding to the start of the voice detected in step S100, and includes a recognition result in the middle of the section until the end of the section. Output (publish) the recognition results sequentially. Further, when a short pause is detected, or at regular intervals in the period up to the end of the section, the recognition result is output (published).

ステップＳ１０４では、通信部２８が、ステップＳ１００の区間の検出結果及びＳ１０２の音声の認識結果の出力を、発話生成部４０に出力し、ステップＳ１０２の音声の認識結果の出力を応答部４８へ出力する。 In step S104, the communication unit 28 outputs the detection result of the section in step S100 and the output of the speech recognition result in S102 to the utterance generation unit 40, and outputs the output of the speech recognition result in step S102 to the response unit 48. I do.

ステップＳ１０６では、発話生成部４０において、相槌生成部４２が、音声区間検出部３０による検出結果、及び音声認識部３２による認識結果に基づいて、音声認識が行われていることを示す発話を生成し、焦点抽出部４４が、音声認識部３２による認識結果に基づいて、認識された文字列を示す発話を生成し、復唱生成部４６が、音声認識部３２による認識結果に基づいて、システムが理解した内容を示す発話を生成し、それぞれで順次生成された発話を生成順にpublishする。 In step S <b> 106, in the utterance generation unit 40, the companion generation unit 42 generates an utterance indicating that the voice recognition is being performed based on the detection result by the voice section detection unit 30 and the recognition result by the voice recognition unit 32. Then, the focus extraction unit 44 generates an utterance indicating the recognized character string based on the recognition result by the voice recognition unit 32, and the repetition generation unit 46 generates a speech based on the recognition result by the voice recognition unit 32. Utterances showing the understood contents are generated, and the utterances sequentially generated for each are published in the generation order.

ステップＳ１０８では、応答部４８が、音声認識部３２による認識結果に基づいて、ユーザ発話に対応する応答発話を生成しpublishする。 In step S108, the response unit 48 generates and publishes a response utterance corresponding to the user utterance based on the recognition result by the voice recognition unit 32.

ステップＳ１１０では、通信部２８が、発話生成部４０及び応答部４８の両方又は何れかからの出力を順次、発話出力部５０へ出力する。 In step S110, the communication unit 28 sequentially outputs the output from the utterance generation unit 40 and / or the response unit 48 to the utterance output unit 50.

ステップＳ１１２では、ステップＳ１０６で発話生成部４０によって生成された発話又はステップＳ１０８で応答部４８によって生成された応答発話を、音声合成して音声として出力する。ここでは、発話生成部４０によって生成された新たな発話が入力されると、新たな発話の出力は行わない。また、発話生成部４０によって生成された発話を出力しているときに、応答部４８によって生成された応答発話が入力されると、発話生成部４０によって生成された発話を出力した後に、応答発話の出力を行う。 In step S112, the utterance generated by the utterance generation unit 40 in step S106 or the response utterance generated by the response unit 48 in step S108 is voice-synthesized and output as voice. Here, when the new utterance generated by the utterance generation unit 40 is input, the output of the new utterance is not performed. When the utterance generated by the utterance generation unit 40 is output and the response utterance generated by the response unit 48 is input, the utterance generated by the utterance generation unit 40 is output, and then the response utterance is output. Output.

以上説明したように、本発明の実施の形態に係る発話生成装置によれば、ユーザ発話を表す音のストリームを入力とし、音声の開始と音声の終了を検出し、検出された音声の開始と音声の終了とで規定される区間の音声を逐次出力し、かつ、音声の開始を検出したとき、あるいは音声の終了を検出したときに、検出結果を出力し、検出された音声の開始に対応する区間の音声について音声認識を行い、区間の途中の認識結果を含む、区間の終了までの認識結果を逐次出力し、音声区間検出部３０による検出結果、及び音声認識部３２による認識結果に基づいて、音声認識が行われていることを示す発話を生成する相槌生成部４２、音声認識部３２による認識結果に基づいて、認識された文字列を示す発話を生成する焦点抽出部４４、及び音声認識部３２による認識結果に基づいて、システムが理解した内容を示す発話を生成する復唱生成部の少なくとも一つを含む発話生成部４０を持ち、応答部４８によりユーザ発話に対応する応答発話を生成し、音声認識部３２及び音声区間検出部３０の両方又は何れかからの出力を、発話生成部４０又は応答部４８へ出力し、発話生成部４０又は応答部４８からの出力を発話出力部５０へ出力することにより、システムが理解したことをユーザに伝達し、スムーズな対話を可能とすることができる。 As described above, according to the utterance generation device according to the embodiment of the present invention, a sound stream representing a user utterance is input, the start and end of the sound are detected, and the start of the detected sound is Outputs the voice of the section specified by the end of the voice sequentially, and outputs the detection result when the start of the voice is detected or when the end of the voice is detected, and corresponds to the start of the detected voice. Speech recognition is performed for the voice of the section to be performed, and the recognition results up to the end of the section including the recognition result in the middle of the section are sequentially output, and based on the detection result by the voice section detection unit 30 and the recognition result by the voice recognition unit 32. And a focus extraction unit 44 that generates an utterance indicating a recognized character string based on the recognition result of the voice recognition unit 32, and a speech extraction unit 42 that generates an utterance indicating that the voice recognition is being performed. The system has an utterance generation unit 40 including at least one of a repetition generation unit that generates an utterance indicating the content understood by the system based on the recognition result by the recognition unit 32, and generates a response utterance corresponding to the user utterance by the response unit 48. Then, the output from the voice recognition unit 32 and / or the voice section detection unit 30 is output to the utterance generation unit 40 or the response unit 48, and the output from the utterance generation unit 40 or the response unit 48 is output to the utterance output unit 50. By outputting to the user, what the system has understood can be communicated to the user, and a smooth conversation can be made possible.

実際に本発明の実施の形態の技術を実装した対話システムと会話してみたところ、音声区間が適切に検出されている、もしくは、音声認識が進んでいることがユーザに伝わる音声対話システムが実現できた。また、音声認識の途中であっても、焦点抽出を行う事ができるため、ユーザが話した対話の話題として相応しい単語が、正しく認識されているかどうかがユーザに伝わることが確認できた。 When we talked with a dialogue system that actually implemented the technology of the embodiment of the present invention, a voiced dialogue system was realized in which the voice section was detected properly or the user was informed that speech recognition was progressing. did it. In addition, since focus extraction can be performed even during speech recognition, it has been confirmed that the user can be informed whether or not a word appropriate as a topic of a conversation spoken by the user is correctly recognized.

さらに、音声認識の途中であっても、復唱文生成によって、ユーザが話した命題（何がどうした）という内容が、正しく認識されているかどうかがユーザに伝わることも確認できた。これらの仕組みによって、ユーザはシステムが正しく認識、理解しているかどうかを確認しながら対話を行うことができ、ユーザとスムーズな音声コミュニケーションが取れるシステムが実現できた。 Furthermore, even during the speech recognition, it was confirmed that the generation of the repetition sentence informs the user whether or not the content of the proposition (what has happened) spoken by the user is correctly recognized. With these mechanisms, the user can perform a dialogue while confirming whether the system correctly recognizes and understands the system, and a system that enables smooth voice communication with the user has been realized.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

１０入力部
２０演算部
２８通信部
３０音声区間検出部
３２音声認識部
４０発話生成部
４２相槌生成部
４４焦点抽出部
４６復唱生成部
４８応答部
５０発話出力部
１００発話生成装置 Reference Signs List 10 input unit 20 arithmetic unit 28 communication unit 30 voice section detection unit 32 voice recognition unit 40 utterance generation unit 42 hammer generation unit 44 focus extraction unit 46 repetition generation unit 48 response unit 50 utterance output unit 100 utterance generation device

Claims

A stream of sound representing a user utterance is input, the start of sound and the end of sound are detected, and sound in a section defined by the detected start and end of sound is sequentially output, and the sound is started. When detecting the sound, or when the end of the voice is detected, a voice section detection unit that outputs a detection result,
A speech recognition unit that performs speech recognition on speech in a section from the start to the end of the speech detected by the speech section detection unit, and sequentially outputs a recognition result up to the end of the section, including a recognition result in the middle of the section. When,
Based on a detection result by the voice section detection unit and a recognition result sequentially output from the voice recognition unit, a hammer generation unit that generates an utterance indicating that the voice recognition is being performed, sequentially from the voice recognition unit based on the outputted recognition result is sequentially outputted from the focus extraction section for generating a speech containing the substring of focus of the user utterance to be extracted from the character string representing the recognition result, and the voice recognition unit Based on the recognition result, a predicate term structure is extracted from the character string representing the recognition result, and a repetition generation unit that generates an utterance that is a repetition of the user utterance indicating the content understood by the system based on the predicate term structure is provided. seen including, said back-channel feedback generating section, the focus extraction section, and the in each repetition generator, the speech generator for sequentially generating speech indicating that understanding of the user's utterance is progressing,
An utterance output unit,
A communication unit that outputs an output from both or any of the voice recognition unit and the voice section detection unit to the utterance generation unit and sequentially outputs the utterance to the utterance output unit in the order of utterances sequentially generated from the utterance generation unit. When,
An utterance generation device including:

A stream of sounds representing a user's utterance is input, the start and end of the sound are detected, and the sounds in the section defined by the detected start and end of the sound are sequentially output, and the sound is started. When detecting the sound, or when the end of the voice is detected, a voice section detection unit that outputs a detection result,
A speech recognition unit that performs speech recognition on speech in a section from the start to end of the speech detected by the speech section detection unit, and sequentially outputs recognition results up to the end of the section, including recognition results in the middle of the section; When,
Based on a detection result by the voice section detection unit, and a recognition result sequentially output from the voice recognition unit, a hammer generation unit that generates an utterance indicating that the voice recognition is being performed, sequentially from the voice recognition unit based on the outputted recognition result, the focus extraction section for generating a speech indicating a string that contains the substring of focus of the user utterance to be extracted from the character string representing the recognition result, and from the voice recognition unit A predicate term structure is extracted from a character string representing the recognition result based on the sequentially output recognition results, and an utterance that is a repetition of the user utterance indicating the contents understood by the system based on the predicate term structure is generated. look containing a repetition generator, the back-channel feedback generator, the focus extraction section, and in each of the repetition generator, calling for sequentially generating speech indicating that advances understanding of the user's utterance A generation unit,
A response unit that generates a response utterance for advancing a dialogue corresponding to the user utterance, based on only the final recognition result when the section of the recognition results by the voice recognition unit has ended ,
An utterance output unit,
An output from both or any of the voice recognition unit and the voice section detection unit is output to the utterance generation unit and the response unit, and the utterance generation unit sequentially outputs the utterances to the utterance output unit. A communication unit that sequentially outputs, and outputs the response utterance generated after the end of the section in the response unit to the utterance output unit;
An utterance generation device including:

When outputting the speech produced by the pre-Symbol utterance generation unit, when the response utterance generated by the response unit is input, after outputting the speech generated by the speech generator, 3. The utterance generation device according to claim 2, wherein the response utterance is output.

The back-channel feedback generating section, when the detected start of the speech by the speech section detection unit, any one of claims 1 to 3 for generating a speech indicating that the voice recognition is being performed An utterance generation device as described.

The voice recognition unit outputs a recognition result when a short pause is detected or at regular intervals in a period until the end of the section,
The companion generator is configured to perform the voice recognition when the voice recognition unit outputs a recognition result when a short pause is detected, or when the voice recognition unit outputs a recognition result at regular intervals. generating a speech indicating that have claim 1 utterance generation apparatus according to any one of claims 4.

The utterance generation device according to any one of claims 1 to 5 , wherein the utterance output unit outputs by voice.

The utterance generation device according to any one of claims 1 to 5 , wherein the utterance output unit outputs the utterance by a display.

An utterance generation method in an utterance generation device including a voice section detection unit, a speech recognition unit, a companion generation unit, and a focus extraction unit , a repetition generation unit, a speech output unit, and a communication unit. ,
The voice section detection unit receives a stream of sounds representing a user's utterance as input, detects the start and end of the voice, and sequentially outputs voice in a section defined by the detected start and end of the voice. And, when detecting the start of the voice, or when detecting the end of the voice, outputting a detection result,
The voice recognition unit performs voice recognition on the voice of the section from the start to the end of the voice detected by the voice section detection unit, and includes a recognition result in the middle of the section, the recognition result until the end of the section. Outputting sequentially;
The communication unit, the output from both or any of the voice recognition unit and the voice section detection unit, the step of outputting to the utterance generation unit,
In each of the companion generation unit of the utterance generation unit, the focus extraction unit, and the repetition generation unit, the step of sequentially generating an utterance indicating that the understanding of the user utterance is progressing, the companion generation unit, the detection result by the speech section detecting unit, and on the basis of the recognition result is sequentially output from the voice recognition unit, said generating a speech indicating that the voice recognition is being performed, the focus extraction section, the recognition result generating an utterance containing a partial character strings extracted from the character string is the focus of the user's utterance representing the said repetition generator is based on the recognition result is sequentially output from the voice recognition unit, the recognition result Extracting a predicate term structure from the character string to represent, and generating an utterance to be a repetition of the user utterance indicating the content understood by the system based on the predicate term structure ,
The communication unit sequentially outputs to the utterance output unit in the order of utterances sequentially generated from the utterance generation unit,
An utterance generation method including:

Utterance generation in an utterance generation device including a voice section detection unit, a speech recognition unit, a companion generation unit, and a focus extraction unit , a repetition generation unit, a speech output unit, a response unit, and a communication unit. The method
The voice section detection unit receives a stream of sounds representing a user's utterance as input, detects the start and end of the voice, and sequentially outputs voice in a section defined by the detected start and end of the voice. And, when detecting the start of the voice, or when detecting the end of the voice, outputting a detection result,
The voice recognition unit performs voice recognition on the voice of the section from the start to the end of the voice detected by the voice section detection unit, and includes a recognition result in the middle of the section, the recognition result until the end of the section. Outputting sequentially;
The communication unit, an output from both or any of the voice recognition unit and the voice section detection unit, the step of outputting to the utterance generation unit and the response unit,
In each of the companion generation unit of the utterance generation unit, the focus extraction unit, and the repetition generation unit, the step of sequentially generating an utterance indicating that the understanding of the user utterance is progressing, the companion generation unit, the detection result by the speech section detecting unit, and on the basis of the recognition result is sequentially output from the voice recognition unit, said generating a speech indicating that the voice recognition is being performed, the focus extraction section, the recognition result generating an utterance containing a partial character strings extracted from the character string is the focus of the user's utterance representing the said repetition generator is based on the recognition result is sequentially output from the voice recognition unit, the recognition result Extracting a predicate term structure from the character string to represent, and generating an utterance to be a repetition of the user utterance indicating the content understood by the system based on the predicate term structure ,
The response unit generating a response utterance for advancing a dialogue corresponding to the user utterance based only on a final recognition result when the section of the recognition results by the voice recognition unit ends. ,
A step in which the communication unit sequentially outputs to the utterance output unit in the order of the sequentially generated utterances from the utterance generation unit, and outputs the response utterance generated after the end of the section in the response unit to the utterance output unit; When,
An utterance generation method including:

A program for causing a computer to function as each unit of the utterance generation device according to any one of claims 1 to 7 .