JP7361988B2

JP7361988B2 - Voice dialogue system, voice dialogue method, and voice dialogue management device

Info

Publication number: JP7361988B2
Application number: JP2023508340A
Authority: JP
Inventors: 啓吾川島
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2023-10-16
Anticipated expiration: 2041-03-25
Also published as: JPWO2022201458A1; WO2022201458A1

Description

本開示は、音声対話システム、音声対話方法及び音声対話管理装置に関する。
The present disclosure relates to a voice dialogue system, a voice dialogue method, and a voice dialogue management device.

音声認識機能が搭載されているカーナビゲーションシステム、スマートスピーカ、電話自動応答システムなどに代表される音声対話システムにおいて、音声対話システムの利用者であるユーザが、音声対話システムの応答音声出力中でも割り込んで音声入力可能とするためのバージイン機能（以降、バージイン）が開発されている。一方、このバージインをユーザに許可することで、対話型の処理においては副作用が出る場合もある。例えば、音声対話システムがうまく音声認識が出来ず、ユーザにもう一度発話の入力を求める際に、前の発話の続きを誤認識したり、また、ユーザが音声対話システムの応答音声を途中までしか聞かず、質問内容を勘違いしたまま発話してしまうこともあり、これら音声認識開始タイミングのずれ、言い換えれば、音声認識のバージインの受付判定精度が低いことが、音声対話システムの可用性（ユーザビリティ）を低下させていた。 In voice dialogue systems such as car navigation systems, smart speakers, automatic telephone answering systems, etc. that are equipped with voice recognition functions, users of the voice dialogue system may interrupt the voice dialogue system even when it is outputting a response voice. A barge-in function (hereinafter referred to as barge-in) has been developed to enable voice input. On the other hand, allowing the user to barge in may have side effects in interactive processing. For example, a voice dialogue system may not be able to properly recognize speech, and when asking the user to input another utterance, it may misrecognize the continuation of the previous utterance, or the user may only hear part of the voice response voice of the voice dialogue system. In other words, the difference in the start timing of speech recognition, or in other words, the low accuracy of barge-in acceptance judgment of speech recognition, reduces the availability (usability) of the voice dialogue system. I was letting it happen.

これらの課題に対して、従来の音声対話システムでは、生成した応答音声の信号を入力として、応答音声の発話時間の長さを信号データファイル容量から算出し、算出された応答音声の発話時間の長さに基づいて、音声認識開始のタイミングを応答音声出力完了前に制御するように動作させている（例えば、特許文献１参照）。
To address these issues, conventional voice dialogue systems use the generated response voice signal as input, calculate the length of the response voice's utterance time from the signal data file capacity, and then Based on the length, the timing of starting speech recognition is controlled before the output of response speech is completed (for example, see Patent Document 1).

特開２００７－１５５９８６号公報Japanese Patent Application Publication No. 2007-155986

しかしながら、上記した従来の音声対話システムを、音声対話管理部と音声入出力部とが別の独立した構成のシステムに適用する際、音声対話管理部と音声入出力部とは、音声対話管理部が出力する応答音声の出力完了タイミング（出力完了時刻）に呼応して動作することとなるが、当該システムは非同期の通信ネットワークにより相互接続される場合が多い。このような場合、通信ネットワークの伝送遅延は時々刻々と変動することから、音声対話管理部が生成した応答音声と音声入出力部とでの応答音声の出力完了タイミングが異なる。そのため、ユーザに出力した応答音声の出力完了時刻を正確に検出することは困難である。 However, when applying the above-mentioned conventional voice dialogue system to a system in which the voice dialogue management section and the voice input/output section are separate and independent, the voice dialogue management section and the voice input/output section are The system operates in response to the output completion timing (output completion time) of the response voice output by the system, but the systems are often interconnected by an asynchronous communication network. In such a case, since the transmission delay of the communication network varies from moment to moment, the response voice generated by the voice dialogue management section and the output completion timing of the response voice generated by the voice input/output section differ. Therefore, it is difficult to accurately detect the output completion time of the response voice output to the user.

更に、音声対話管理部と音声入出力部との音声データを取り扱う上での相違、例えば、音声データのサンプリング周波数の相違により、信号データファイル容量から応答音声の出力完了時刻を正確に検出することは困難であり、また、応答音声の出力信号に出力データファイルサイズ等の出力設定情報を付与することも困難である。 Furthermore, due to differences in the handling of voice data between the voice dialogue management unit and the voice input/output unit, for example, differences in the sampling frequency of voice data, it is difficult to accurately detect the output completion time of the response voice from the signal data file capacity. It is also difficult to add output setting information such as the output data file size to the output signal of the response voice.

つまり、出力タイミングが異なる応答音声データから、応答音声の出力完了時刻を算出できないため、音声対話管理部では、音声対話システムがユーザに出力した応答音声の出力完了時刻を正確に検出することができず、その結果、音声認識のバージインの受付判定精度が劣化して、音声対話システムのユーザビリティが低下する問題があった。 In other words, since the output completion time of the response voice cannot be calculated from response voice data with different output timings, the voice dialogue management unit cannot accurately detect the output completion time of the response voice output by the voice dialogue system to the user. As a result, there is a problem in that the accuracy of barge-in acceptance determination by voice recognition deteriorates, and the usability of the voice dialogue system deteriorates.

本開示は、上述の課題を解決するためになされたものであり、音声対話管理部と音声入出力部が独立した構成となる音声対話システムにおいても、音声対話管理部が、ユーザに対して音声入出力部が出力した応答音声の出力完了時刻を受信することで、音声対話管理部がユーザに出力した応答音声の出力完了時刻を正確に検出することができる。これにより、音声認識のバージインの受付判定精度を改善し、音声対話システムのユーザビリティを向上することを目的とする。
The present disclosure has been made in order to solve the above-mentioned problems, and even in a voice dialogue system in which the voice dialogue management unit and the voice input/output unit are configured independently, the voice dialogue management unit can provide voice information to the user. By receiving the output completion time of the response voice output by the input/output unit, it is possible to accurately detect the output completion time of the response voice output by the voice interaction management unit to the user. The purpose of this is to improve the accuracy of barge-in acceptance determination using voice recognition and to improve the usability of voice dialogue systems.

本開示に係る音声対話システムは、
音声入出力部と、音声対話管理部とを有し、
前記音声対話管理部により生成される応答音声が、ユーザに対して遅延して出力される音声対話システムであって、
前記音声入出力部は、
前記ユーザの発話音声を取得する音声入力部と、
前記応答音声を前記ユーザへ出力すると共に、前記応答音声の音声出力状況を前記音声対話管理部へ出力する音声出力部とを備え、
前記音声対話管理部は、
前記ユーザの発話音声を音声認識し、音声認識結果を出力する音声認識部と、
前記音声認識結果から前記ユーザの発話意図を推定して意図理解結果を出力する意図理解部と、
前記意図理解結果より、前記ユーザへの応答内容情報を出力する対話管理部と、
前記応答内容情報に基づいて、前記応答音声の音声信号を生成して前記音声入出力部へ出力する音声生成部と、
前記音声出力状況から、前記応答音声を音声出力中か否かを示す情報である音声出力情報を生成する音声出力情報生成部と、
前記音声出力情報を用いて、前記意図理解部への入力受付可否を判定する入力受付判定部とを備えるものである。The voice dialogue system according to the present disclosure includes:
It has a voice input/output section and a voice dialogue management section,
A voice dialogue system in which a response voice generated by the voice conversation management unit is output to a user with a delay,
The audio input/output section is
a voice input unit that acquires the user's uttered voice;
an audio output unit that outputs the response voice to the user and outputs a voice output status of the response voice to the voice dialogue management unit;
The voice dialogue management unit includes:
a voice recognition unit that performs voice recognition on the user's uttered voice and outputs a voice recognition result;
an intention understanding unit that estimates the user's utterance intention from the voice recognition result and outputs an intention understanding result;
a dialogue management unit that outputs response content information to the user based on the intention understanding result;
a voice generation unit that generates an audio signal of the response voice based on the response content information and outputs it to the audio input/output unit;
a voice output information generation unit that generates voice output information that is information indicating whether or not the response voice is being outputted from the voice output status;
The apparatus further includes an input acceptance determination section that uses the voice output information to determine whether or not input to the intention understanding section can be accepted.

また、本開示に係る音声対話方法は、音声入出力装置と、応答音声を生成する音声対話管理装置とを含む音声対話システムで実行される。前記音声入出力装置が、ユーザの発話音声を取得し、前記応答音声を前記ユーザへ出力すると共に、前記応答音声の音声出力状況を前記音声対話管理装置へ出力する。前記音声対話管理装置が、前記ユーザの発話音声を音声認識し、前記音声認識の結果から前記ユーザの発話意図を推定し、前記推定の結果である意図理解結果に基づき、前記ユーザへの応答内容を決定し、前記応答内容に基づく応答内容情報に基づいて、前記応答音声の音声信号を生成して前記音声入出力装置へ出力し、前記音声出力状況が入力された場合、前記音声出力状況から、前記応答音声を音声出力中か否かを示す情報である音声出力情報を生成し、前記音声出力情報を用いて、前記推定を実行するか否かを判定する。 Further, the voice dialogue method according to the present disclosure is executed by a voice dialogue system including a voice input/output device and a voice dialogue management device that generates response voices. The voice input/output device acquires the user's uttered voice, outputs the response voice to the user, and outputs the voice output status of the response voice to the voice dialogue management device. The voice dialogue management device performs voice recognition on the user's uttered voice, estimates the user's utterance intention from the result of the voice recognition, and responds to the user based on the intention understanding result that is the result of the estimation. is determined, and based on response content information based on the response content, an audio signal of the response voice is generated and output to the audio input/output device, and when the audio output status is input, the audio signal is determined from the audio output status. , generates audio output information that is information indicating whether or not the response voice is being outputted, and uses the audio output information to determine whether or not to perform the estimation.

また、本開示に係る音声対話管理装置は、応答音声を生成する装置であって、
ユーザの発話音声を音声認識し、音声認識結果を出力する音声認識部と、
前記音声認識結果から前記ユーザの発話意図を推定して意図理解結果を出力する意図理解部と、
前記意図理解結果より、前記ユーザへの応答内容情報を出力する対話管理部と、
前記応答内容情報に基づいて、前記応答音声の音声信号を生成して出力する音声生成部と、
前記応答音声の音声信号を前記ユーザに出力している状況である音声出力状況を入力し、前記応答音声を音声出力中か否かを示す情報である音声出力情報を生成する音声出力情報生成部と、
前記音声出力情報を用いて、前記意図理解部への入力受付可否を判定する入力受付判定部とを備えるものである。 Further, the voice dialogue management device according to the present disclosure is a device that generates a response voice,
a voice recognition unit that performs voice recognition on the user's uttered voice and outputs a voice recognition result;
an intention understanding unit that estimates the user's utterance intention from the voice recognition result and outputs an intention understanding result;
a dialogue management unit that outputs response content information to the user based on the intention understanding result;
a voice generation unit that generates and outputs an audio signal of the response voice based on the response content information;
an audio output information generation unit that receives an audio output status that is a status in which an audio signal of the response voice is output to the user, and generates audio output information that is information indicating whether or not the response voice is being output as audio; and,
The apparatus further includes an input acceptance determination section that uses the voice output information to determine whether or not input to the intention understanding section can be accepted.

本開示によれば、音声対話管理部と音声入出力部が別の独立した構成となる音声対話システムにおいても、音声対話システムの応答音声の出力完了時刻を正確に検出することができる。その結果、音声認識のバージインの受付判定精度を改善することが可能となり、音声対話システム及び音声対話方法のユーザビリティが向上する効果を有する。
According to the present disclosure, even in a voice dialogue system in which the voice dialogue management unit and the voice input/output unit are configured separately and independently, it is possible to accurately detect the output completion time of the response voice of the voice dialogue system. As a result, it becomes possible to improve the accuracy of barge-in acceptance determination using voice recognition, and this has the effect of improving the usability of the voice dialogue system and the voice dialogue method.

実施の形態１における音声対話システムのブロック構成図である。1 is a block configuration diagram of a voice dialogue system in Embodiment 1. FIG. 実施の形態１における音声対話システムのハードウェア構成図である。1 is a hardware configuration diagram of a voice dialogue system in Embodiment 1. FIG. 実施の形態１における音声対話システムの動作を示すフローチャートである。3 is a flowchart showing the operation of the voice dialogue system in Embodiment 1. FIG. 実施の形態１における入力受付判定部の動作の一例である。4 is an example of the operation of the input acceptance determination unit in the first embodiment. 実施の形態２における音声対話システムのブロック構成図である。FIG. 2 is a block diagram of a voice dialogue system according to a second embodiment. 実施の形態２における音声対話システムのハードウェア構成図である。FIG. 2 is a hardware configuration diagram of a voice dialogue system according to a second embodiment. 実施の形態３における音声対話システムのブロック構成図である。FIG. 3 is a block configuration diagram of a voice dialogue system in Embodiment 3. FIG. 実施の形態３における音声対話システムの動作を示すフローチャートである。7 is a flowchart showing the operation of the voice dialogue system in Embodiment 3. 実施の形態４における音声対話システムのブロック構成図である。FIG. 3 is a block configuration diagram of a voice dialogue system in Embodiment 4. FIG. 実施の形態４における音声対話システムの動作を示すフローチャートである。12 is a flowchart showing the operation of the voice dialogue system in Embodiment 4.

実施の形態１．
《１－１》構成
実施の形態１における音声対話システムについて図１～図４を用いて説明する。図１は本実施の形態１を示す音声対話システムのブロック構成図である。Embodiment 1.
<<1-1>> Configuration The voice dialogue system in Embodiment 1 will be explained using FIGS. 1 to 4. FIG. 1 is a block diagram of a voice dialogue system showing the first embodiment.

図１において、音声対話システム１０００は、音声入出力部２００と、音声対話管理部３００と、ネットワークＮＷとから構成される。 In FIG. 1, a voice dialogue system 1000 includes a voice input/output section 200, a voice conversation management section 300, and a network NW.

音声入出力部２００は、ユーザＵに対面しており、音声対話システム１０００への音声入力と、音声対話システム１０００からの応答音声をユーザＵへ提示する処理を行う。また、音声入出力部２００は、例えば、スマートスピーカの音声入出力装置に内蔵されている。 The voice input/output unit 200 faces the user U, and performs a process of inputting voice to the voice dialogue system 1000 and presenting a response voice from the voice dialogue system 1000 to the user U. Further, the audio input/output unit 200 is built in, for example, an audio input/output device of a smart speaker.

音声対話管理部３００は、ユーザＵが発話した音声信号を、後述するネットワークＮＷを通じて得ると共に、ユーザＵの発話した音声の音声認識と意図理解を行い、ユーザＵの意図に対応した応答音声を生成する処理を行う。生成された応答音声はネットワークＮＷへ出力される。また、音声対話管理部３００は、例えば、ユーザＵと離れた位置にあるデータセンタのサーバ装置に内蔵されている。 The voice dialogue management unit 300 obtains a voice signal uttered by the user U through the network NW described later, performs voice recognition and intention understanding of the voice uttered by the user U, and generates a response voice corresponding to the user U's intention. Perform the processing to do. The generated response voice is output to the network NW. Further, the voice dialogue management unit 300 is built in, for example, a server device in a data center located away from the user U.

ネットワークＮＷは、音声入出力部２００と音声対話管理部３００とのデータ送受を行う通信機器であり、例えば、インターネットあるいはＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）など、有線または無線によるデジタル通信機器である。なお、ネットワークＮＷは、電話回線とモデムにより音声をアナログ伝送する通信機器であってもよい。 The network NW is a communication device that transmits and receives data between the voice input/output unit 200 and the voice dialogue management unit 300, and is, for example, a wired or wireless digital communication device such as the Internet or a LAN (Local Area Network). Note that the network NW may be a communication device that transmits audio in analog form using a telephone line and a modem.

音声入出力部２００は、音声入力部１と、音声出力部７とから構成される。また、音声対話管理部３００は、音声認識部２と、入力受付判定部３と、意図理解部４と、対話管理部５と、音声生成部６と、音声出力情報生成部８とから構成される。 The audio input/output unit 200 includes an audio input unit 1 and an audio output unit 7. Furthermore, the voice dialogue management section 300 includes a voice recognition section 2, an input acceptance determination section 3, an intention understanding section 4, a dialogue management section 5, a voice generation section 6, and a voice output information generation section 8. Ru.

音声入力部１は、マイクロフォン（図示せず）を用いて、音声対話システム１０００の利用者であるユーザＵが発話した音声を取得する。取得したアナログ音声波形は、アナログ／デジタル変換器を用いて、例えば１６ｋＨｚのサンプリング周波数でサンプリングされ、デジタル音声データ列に変換される。続いて、変換されたデジタル音声データ列の音響分析が行われて、例えば、音声認識で使用される特徴量パラメータである２０次のＭＦＣＣ（ＭｅｌＦｒｅｑｕｎｅｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ；メル周波数ケプストラム係数）に変換される。得られた特徴量パラメータＭＦＣＣを入力音声情報Ｄ１としてネットワークＮＷへ出力する。 The voice input unit 1 uses a microphone (not shown) to acquire the voice uttered by the user U, who is the user of the voice dialogue system 1000. The obtained analog audio waveform is sampled at a sampling frequency of, for example, 16 kHz using an analog/digital converter and converted into a digital audio data string. Next, acoustic analysis of the converted digital audio data string is performed, and the data is converted into, for example, 20th-order MFCC (Mel Frequency Cepstrum Coefficients), which are feature parameters used in speech recognition. . The obtained feature parameter MFCC is output to the network NW as input audio information D1.

なお、入力音声情報Ｄ１は特徴量パラメータＭＦＣＣに限られることは無い。入力音声情報Ｄ１は、後述する音声認識部２において音声認識処理が可能な情報であれば良く、例えば、音声波形を表すデジタル音声データ列、あるいはアナログ音声信号のままでも良い。この場合、音声入力部１中の音響分析を省略することができ、音響分析のための処理量を削減できる。 Note that the input audio information D1 is not limited to the feature parameter MFCC. The input voice information D1 may be any information that can be subjected to voice recognition processing in the voice recognition unit 2, which will be described later, and may be, for example, a digital voice data string representing a voice waveform or an analog voice signal. In this case, acoustic analysis in the audio input section 1 can be omitted, and the amount of processing for acoustic analysis can be reduced.

音声認識部２は、ネットワークＮＷを通じて得られた入力音声情報Ｄ１を入力し、音声区間検出処理により、ユーザＵの発話開始タイミングと発話完了タイミングとを検出し、ユーザＵの発話区間のみを切り出す。切り出された発話音声に対して音声認識処理を行うことでユーザＵの発話内容を音声認識し、発話内容を表すテキストデータと発話開始タイミングおよび発話完了タイミングとを音声認識結果Ｄ２として出力する。 The speech recognition unit 2 inputs the input speech information D1 obtained through the network NW, detects the speech start timing and the speech completion timing of the user U through speech section detection processing, and cuts out only the speech section of the user U. The utterance content of the user U is voice recognized by performing voice recognition processing on the extracted utterance voice, and text data representing the utterance content, utterance start timing, and utterance completion timing are output as a voice recognition result D2.

音声認識結果Ｄ２の発話内容は、ユーザＵの発話中に含まれていた特定のキーワードを表すテキストデータだけでも良い。また、予め決められたキーワードを示すＩＤなどを表す数値データであっても良い。 The utterance content of the speech recognition result D2 may be only text data representing a specific keyword included in the user U's utterance. Alternatively, it may be numerical data representing an ID or the like indicating a predetermined keyword.

入力受付判定部３は、音声認識結果Ｄ２及び、後述する音声出力情報Ｄ８を入力として、ユーザＵが発話した音声の入力を受け付けるかを判定し、入力を受け付ける場合に受理した音声認識結果Ｄ３を出力する。 The input acceptance determination unit 3 receives the voice recognition result D2 and the voice output information D8 (described later) as input, and determines whether to accept the input of the voice uttered by the user U. If the input is accepted, the input acceptance determination unit 3 receives the voice recognition result D3 that has been accepted. Output.

意図理解部４は、受理した音声認識結果Ｄ３を入力とし、入力内容の意図を推定し意図理解結果Ｄ４として出力する。ここで、意図理解結果Ｄ４は、ユーザＵの発話意図・操作内容を表す情報であれば良く、テキストデータ、テキストの内容を示すＩＤといった数値データであれば良い。 The intention understanding unit 4 inputs the received speech recognition result D3, estimates the intention of the input content, and outputs it as an intention understanding result D4. Here, the intention understanding result D4 may be any information that represents the utterance intention/operation content of the user U, and may be numerical data such as text data or an ID indicating the content of the text.

対話管理部５は、意図理解結果Ｄ４を入力とし、ユーザＵへの応答が必要な場合に応答内容情報Ｄ５を出力する。 The dialogue management unit 5 receives the intention understanding result D4 as input, and outputs response content information D5 when a response to the user U is required.

なお、応答内容情報Ｄ５は、応答の種類・内容等の応答文を生成するために必要な情報であれば良く、テキストデータ、あるいは数値データ等、任意の形式をとることができる。 Note that the response content information D5 may be any information necessary to generate a response sentence, such as the type and content of the response, and can take any format such as text data or numerical data.

音声生成部６は、応答内容情報Ｄ５を入力とし、応答音声を生成し出力音声Ｄ６としてネットワークＮＷへ出力する。ここで、出力音声Ｄ６は、音声波形を表すデータ列である。 The voice generation unit 6 receives the response content information D5, generates a response voice, and outputs it to the network NW as an output voice D6. Here, the output audio D6 is a data string representing an audio waveform.

音声出力部７は、ネットワークＮＷを通じて得られた出力音声Ｄ６を入力し、出力音声Ｄ６をデジタル／アナログ変換器によりアナログ音声信号へ変換する。アナログ音声信号へ変換された出力音声Ｄ６は、スピーカ（図示せず）等の音声報知装置を用いて、音声対話システム１０００からの応答音声としてユーザＵへ出力される。 The audio output unit 7 inputs the output audio D6 obtained through the network NW, and converts the output audio D6 into an analog audio signal using a digital/analog converter. The output voice D6 converted into an analog voice signal is output to the user U as a response voice from the voice dialogue system 1000 using a voice notification device such as a speaker (not shown).

また、音声出力部７は、出力音声Ｄ６の音声出力開始時刻、あるいは音声出力完了時刻を示す情報である音声出力状況Ｄ７をネットワークＮＷへ出力する。なお、音声出力状況Ｄ７は、出力音声Ｄ６の音声出力開始時刻と音声出力開始時からの経過時間であっても良い。 Furthermore, the audio output unit 7 outputs the audio output status D7, which is information indicating the audio output start time or audio output completion time of the output audio D6, to the network NW. Note that the audio output status D7 may be the audio output start time of the output audio D6 and the elapsed time from the audio output start time.

音声出力情報生成部８は、ネットワークＮＷを通じて得られた音声出力状況Ｄ７を入力とし、音声出力部７が音声出力中か否かを示す情報である、音声出力情報Ｄ８を生成し出力する。ここで、音声出力情報Ｄ８は少なくとも音声出力中か否かを表現可能な情報であれば良く、時間そのものに限ることは無い。例えば、音声出力情報Ｄ８は、音声出力が完了するタイミングを示す、所定の周期（例えば、０．２５ｍｓｅｃ）毎で出力するフラグ情報（例えば、音声出力中は１、音声停止中は０）であればよい。あるいは、音声出力開始時から出力完了するまでの相対時間の数値情報、時間を表すテキスト情報、あるいは、システム起動時からの音声データフレームのカウント数など、音声出力が完了するタイミングが判別可能な信号であれば良い。 The audio output information generation unit 8 receives the audio output status D7 obtained through the network NW as input, and generates and outputs audio output information D8, which is information indicating whether or not the audio output unit 7 is outputting audio. Here, the audio output information D8 may at least be information that can express whether or not audio is being output, and is not limited to the time itself. For example, the audio output information D8 may be flag information (for example, 1 during audio output, 0 when audio is stopped) that is output at predetermined intervals (for example, 0.25 msec) indicating the timing at which audio output is completed. Bye. Alternatively, a signal that can determine the timing at which audio output is completed, such as numerical information of the relative time from the start of audio output to the completion of output, text information representing time, or the number of audio data frames counted since system startup That's fine.

《１－２》ハードウェア構成
図１に示される音声対話システム１０００の各構成は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）内蔵の情報処理装置であるコンピュータで実現可能である。ＣＰＵ内蔵のコンピュータは、例えば、パーソナルコンピュータ、サーバ型コンピュータなどの据え置き型コンピュータ、スマートフォン、タブレット型コンピュータなどの可搬型コンピュータ、あるいは、カーナビゲーションシステムなどの車載情報システムの機器組み込み用途のマイクロコンピュータ、及びＳｏＣ（ＳｙｓｔｅｍｏｎＣｈｉｐ）などである。<1-2> Hardware Configuration Each configuration of the voice dialogue system 1000 shown in FIG. 1 can be realized by a computer that is an information processing device with a built-in CPU (Central Processing Unit). Computers with a built-in CPU include, for example, stationary computers such as personal computers and server-type computers, portable computers such as smartphones and tablet computers, or microcomputers that are incorporated into in-vehicle information systems such as car navigation systems. Such as SoC (System on Chip).

また、図１に示される音声対話システム１０００の各構成は、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）、又はＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）などの電気回路であるＬＳＩ（ＬａｒｇｅＳｃａｌｅＩｎｔｅｇｒａｔｅｄｃｉｒｃｕｉｔ）により実現されてもよい。また、図１に示される音声対話システム１０００の各構成は、コンピュータとＬＳＩの組み合わせであってもよい。 Furthermore, each configuration of the voice dialogue system 1000 shown in FIG. LSI (Large Scale It may also be realized by an integrated circuit. Moreover, each configuration of the voice dialogue system 1000 shown in FIG. 1 may be a combination of a computer and an LSI.

図２は、コンピュータ等の情報処理装置を用いて構成される音声対話システム１０００のハードウェア構成の例を示すブロック図である。 FIG. 2 is a block diagram showing an example of a hardware configuration of a voice dialogue system 1000 configured using an information processing device such as a computer.

図２の例では、音声対話システム１０００の音声入出力部２００は、メモリ１０１Ａ、ＣＰＵ１１０Ａを内蔵するプロセッサ１０２Ａ、記録媒体１０３Ａ、音響インタフェース１０４（図２中では音響Ｉ／Ｆと記載）、及びバスなどの信号路１０８Ａを備えている。 In the example of FIG. 2, the audio input/output unit 200 of the audio dialogue system 1000 includes a memory 101A, a processor 102A including a CPU 110A, a recording medium 103A, an audio interface 104 (described as audio I/F in FIG. 2), and a bus. It is provided with a signal path 108A such as.

また、図２の例では、音声対話システム１０００の音声対話管理部３００は、メモリ１０１Ｂ、ＣＰＵ１１０Ｂを内蔵するプロセッサ１０２Ｂ、記録媒体１０３Ｂ、ネットワークインタフェース１０５Ｂ（図２中ではネットワークＩ／Ｆと記載）、テキストインタフェース１０６（図２中ではテキストＩ／Ｆと記載）、表示インタフェース１０７（図２中では表示Ｉ／Ｆと記載）、及びバスなどの信号路１０８Ｂを備えている。 In the example of FIG. 2, the voice dialogue management unit 300 of the voice dialogue system 1000 includes a memory 101B, a processor 102B including a CPU 110B, a recording medium 103B, a network interface 105B (described as network I/F in FIG. 2), It includes a text interface 106 (described as text I/F in FIG. 2), a display interface 107 (described as display I/F in FIG. 2), and a signal path 108B such as a bus.

メモリ１０１Ａ、及びメモリ１０１Ｂは、実施の形態１の音声対話処理を実現するための各種プログラムを記憶するプログラムメモリ、プロセッサがデータ処理を行う際に使用するワークメモリ、及び信号データを展開するメモリ等として使用するＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）及びＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の記憶装置である。 The memory 101A and the memory 101B include a program memory that stores various programs for realizing the voice interaction processing of the first embodiment, a work memory that is used when the processor performs data processing, a memory that expands signal data, etc. These are storage devices such as ROM (Read Only Memory) and RAM (Random Access Memory) used as a storage device.

メモリ１０１Ａには、より具体的に言えば、音声入力部１、音声出力部７の各プログラムを記憶することができる。また、メモリ１０１Ａには、入力音声情報Ｄ１、出力音声Ｄ６、音声出力状況Ｄ７などの中間データを記憶することができる。 More specifically, each program for the audio input section 1 and the audio output section 7 can be stored in the memory 101A. Further, the memory 101A can store intermediate data such as input audio information D1, output audio D6, and audio output status D7.

メモリ１０１Ｂには、より具体的に言えば、音声認識部２、入力受付判定部３、意図理解部４、対話管理部５、音声生成部６、音声出力情報生成部８の各プログラムを記憶することができる。また、メモリ１０１Ｂには、入力音声情報Ｄ１、音声認識結果Ｄ２、受理した音声認識結果Ｄ３、意図理解結果Ｄ４、応答内容情報Ｄ５、出力音声Ｄ６、音声出力状況Ｄ７、音声出力情報Ｄ８などの中間データを記憶することができる。 More specifically, the memory 101B stores programs for the speech recognition section 2, input acceptance determination section 3, intention understanding section 4, dialogue management section 5, speech generation section 6, and speech output information generation section 8. be able to. The memory 101B also stores input voice information D1, voice recognition results D2, received voice recognition results D3, intention understanding results D4, response content information D5, output voice D6, voice output status D7, voice output information D8, etc. Data can be stored.

プロセッサ１０２Ａは、ＣＰＵ１１０Ａと、作業用メモリとしてメモリ１０１Ａ中のＲＡＭを使用し、メモリ１０１Ａ中のＲＯＭから読み出されたコンピュータ・プログラム（すなわち、音声対話プログラム）に従って動作する。 Processor 102A uses CPU 110A and RAM in memory 101A as a working memory, and operates according to a computer program (ie, a voice interaction program) read from ROM in memory 101A.

プロセッサ１０２Ａは、より具体的に言えば、音声入力部１、音声出力部７の各処理に対応するプログラムをメモリ１０１Ａから読み出し、ＣＰＵ１１０Ａで処理を行うことで、本実施の形態１に示す音声対話処理に係る音声入出力処理を実行することができる。 More specifically, the processor 102A reads programs corresponding to each process of the voice input section 1 and the voice output section 7 from the memory 101A, and processes them with the CPU 110A, thereby creating the voice dialogue shown in the first embodiment. Audio input/output processing related to processing can be executed.

プロセッサ１０２Ｂは、ＣＰＵ１１０Ｂと、作業用メモリとしてメモリ１０１Ｂ中のＲＡＭを使用し、メモリ１０１Ｂ中のＲＯＭから読み出されたコンピュータ・プログラム（すなわち、音声対話プログラム）に従って動作する。 Processor 102B uses CPU 110B and RAM in memory 101B as working memory, and operates according to a computer program (ie, a voice interaction program) read from ROM in memory 101B.

プロセッサ１０２Ｂは、より具体的に言えば、音声認識部２、入力受付判定部３、意図理解部４、対話管理部５、音声生成部６、音声出力情報生成部８の各処理に対応するプログラムをメモリ１０１Ｂから読み出し、ＣＰＵ１１０Ｂで処理を行うことで、本実施の形態１に示す音声対話処理に係る音声対話管理処理を実行することができる。 More specifically, the processor 102B includes programs corresponding to the processes of the speech recognition section 2, input acceptance determination section 3, intention understanding section 4, dialogue management section 5, speech generation section 6, and speech output information generation section 8. is read from the memory 101B and processed by the CPU 110B, thereby making it possible to execute the voice dialogue management process related to the voice dialogue process shown in the first embodiment.

記録媒体１０３Ａは、プロセッサ１０２Ａの各種設定データ及び信号データなどの各種データを蓄積するために使用される。記録媒体１０３Ａとしては、例えば、ＳＤＲＡＭ（ＳｙｎｃｈｒｏｎｏｕｓＤＲＡＭ）などの揮発性メモリ、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）又はＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の不揮発性メモリを使用することが可能である。記録媒体１０３Ａには、例えば、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）を含む起動プログラム及び、音声対話システムのプログラム、初期状態及び各種設定データ、制御用の定数データ、音響信号データ、エラー情報のログ等の各種データを蓄積することができる。なお、この記録媒体１０３Ａに、メモリ１０１Ａ内の各種データを蓄積しておくこともできる。 The recording medium 103A is used to store various data such as various setting data and signal data of the processor 102A. As the recording medium 103A, it is possible to use, for example, a volatile memory such as SDRAM (Synchronous DRAM), or a nonvolatile memory such as HDD (Hard Disk Drive) or SSD (Solid State Drive). The recording medium 103A includes, for example, a startup program including an OS (Operating System), a voice dialogue system program, initial state and various setting data, constant data for control, acoustic signal data, and various data such as error information logs. can be accumulated. Note that various data in the memory 101A can also be stored in this recording medium 103A.

記録媒体１０３Ｂは、プロセッサ１０２Ｂの各種設定データ及び信号データなどの各種データを蓄積するために使用される。記録媒体１０３Ｂとしては、例えば、ＳＤＲＡＭなどの揮発性メモリ、ＨＤＤ又はＳＳＤ等の不揮発性メモリを使用することが可能である。記録媒体１０３Ｂには、例えば、ＯＳを含む起動プログラム及び、音声対話システムのプログラム、初期状態及び各種設定データ、制御用の定数データ、音響信号データ、エラー情報のログ等の各種データを蓄積することができる。なお、この記録媒体１０３Ｂに、メモリ１０１Ｂ内の各種データを蓄積しておくこともできる。 The recording medium 103B is used to store various data such as various setting data and signal data of the processor 102B. As the recording medium 103B, it is possible to use, for example, a volatile memory such as SDRAM, or a nonvolatile memory such as HDD or SSD. The recording medium 103B stores various data such as a startup program including the OS, a voice dialogue system program, initial state and various setting data, control constant data, acoustic signal data, and error information logs. I can do it. Note that various data in the memory 101B can also be stored in this recording medium 103B.

音響インタフェース１０４は、ユーザＵの発話した音声信号を取得するマイクロフォンと、出力音声Ｄ６をユーザＵに報知するためのスピーカとで構成される。 The acoustic interface 104 includes a microphone that acquires an audio signal uttered by the user U, and a speaker that notifies the user U of the output audio D6.

ユーザＵが発話した音声をマイクロフォンで取得する代わりに、後述するネットワークインタフェース１０５Ａを用い、他の装置から取得したストリームデータを入力するようにしても良い。また、ネットワークインタフェース１０５Ａを通じて外部装置に記憶されている録音済みの音声データを選択し、読み込むようにしても良い。また、出力音声Ｄ６をスピーカによりユーザＵに報知する代わりに、ネットワークインタフェース１０５Ａを用い、他の装置へデータとして送出しても構わない。なお、マイクロフォン及びスピーカを用いる代わりに、有線あるいは無線等の通信を介して音声を入出力するシステムであれば、音響インタフェース１０４は省略することが可能である。 Instead of acquiring the voice uttered by the user U using a microphone, stream data acquired from another device may be input using a network interface 105A, which will be described later. Alternatively, recorded audio data stored in an external device may be selected and read through the network interface 105A. Furthermore, instead of notifying the user U of the output audio D6 through a speaker, the output audio D6 may be sent as data to another device using the network interface 105A. Note that the acoustic interface 104 can be omitted if the system inputs and outputs audio via wired or wireless communication instead of using a microphone and a speaker.

ネットワークインタフェース１０５Ａ、及びネットワークインタフェース１０５Ｂは、入力音声情報Ｄ１、出力音声Ｄ６、及び音声出力状況Ｄ７をネットワーク上のデータから参照する場合、ストリームデータとして入出力する場合など、外部データの送受信を有線又は無線通信にて行う通信インタフェースである。 The network interface 105A and the network interface 105B transmit and receive external data using wired or This is a communication interface that performs wireless communication.

テキストインタフェース１０６は、応答音声内容等を人の手によって文字入力するための入力機器であり、キーボード、タッチパネル、マウスなどの入力装置で構成される。なお、人による入力を必要としないシステムであれば、テキストインタフェース１０６は省略することが可能である。 The text interface 106 is an input device for manually inputting text such as response voice content, and is composed of input devices such as a keyboard, a touch panel, and a mouse. Note that if the system does not require human input, the text interface 106 can be omitted.

表示インタフェース１０７は、入力音声の音声認識結果、応答音声の出力内容等の表示機器であり、ディスプレイ等の表示装置で構成される。なお、表示装置での表示を必要としないシステムであれば、表示インタフェース１０７は省略することが可能である。 The display interface 107 is a display device for displaying the voice recognition results of the input voice, the output contents of the response voice, etc., and is composed of a display device such as a display. Note that if the system does not require display on a display device, the display interface 107 can be omitted.

以上のように、図２に示される、音声入力部１、音声認識部２、入力受付判定部３、意図理解部４、対話管理部５、音声生成部６、音声出力部７、音声出力情報生成部８の各機能は、メモリ１０１Ａ、メモリ１０１Ｂ、プロセッサ１０２Ａ、プロセッサ１０２Ｂ、記録媒体１０３Ａ、及び記録媒体１０３Ｂで実現することができる。 As described above, the voice input unit 1, voice recognition unit 2, input acceptance determination unit 3, intention understanding unit 4, dialogue management unit 5, voice generation unit 6, voice output unit 7, voice output information shown in FIG. Each function of the generation unit 8 can be realized by a memory 101A, a memory 101B, a processor 102A, a processor 102B, a recording medium 103A, and a recording medium 103B.

なお、音声対話システム１０００を実行するプログラムは、ソフトウエアプログラムを実行するコンピュータ内部の記憶装置に記憶していてもよいし、ＣＤ－ＲＯＭあるいはフラッシュメモリ等のコンピュータで読み取り可能な外部記憶媒体にて配布される形式で保持され、コンピュータ起動時に読み込んで動作させてもよい。また、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等の無線または有線ネットワークを通じて他のコンピュータからプログラムを取得することも可能である。 Note that the program for executing the voice dialogue system 1000 may be stored in a storage device inside the computer that executes the software program, or may be stored in a computer-readable external storage medium such as a CD-ROM or flash memory. It may be maintained in a distributed format and loaded and activated when the computer starts. It is also possible to obtain programs from other computers via a wireless or wired network such as a LAN (Local Area Network).

また、音声対話システム１０００を実行するプログラムは、外部で実行されるプログラム、例えば、カーナビゲーションシステム、自動電話応答システムを実行するプログラムとソフトウェア上で結合し、同一のコンピュータで動作させることも可能であるし、又は、複数のコンピュータ上で分散処理することも可能である。
Furthermore, the program that executes the voice dialogue system 1000 can be combined with a program that is executed externally, for example, a program that executes a car navigation system or an automatic telephone answering system, and run on the same computer. Alternatively, it is possible to perform distributed processing on multiple computers.

《１－３》処理動作
続いて、実施の形態１の音声対話システムの処理動作について図３を用いて説明する。図３は、本実施の形態１を示す音声対話システム１０００の処理の流れを示すフローチャートである。なお、以下の各ステップにおける「部」を「工程」と読み替えてもよい。<<1-3>> Processing Operation Next, the processing operation of the voice dialogue system of Embodiment 1 will be described using FIG. 3. FIG. 3 is a flowchart showing the processing flow of the voice dialogue system 1000 according to the first embodiment. Note that "part" in each step below may be read as "process".

ステップＳＴ１で、音声入力部１は、ユーザＵが発話した入力音声を取得して音響分析が行われ、得られた特徴量パラメータＭＦＣＣを入力音声情報Ｄ１として音声認識部２へ出力する（ステップＳＴ１）。 In step ST1, the voice input unit 1 acquires the input voice uttered by the user U, performs acoustic analysis, and outputs the obtained feature parameter MFCC as input voice information D1 to the voice recognition unit 2 (step ST1 ).

ステップＳＴ２で、音声認識部２は、まず、入力音声の音声区間検出により、入力音声の発話開始タイミングならびに発話完了タイミングを検出し、入力音声の特徴量パラメータからユーザＵの発話音声のみを切り出す。続いて、切り出された発話音声に対して音声認識処理が行われることで、入力音声情報Ｄ１からユーザＵの発話内容を認識し、発話内容の認識結果と発話開始タイミングならびに発話完了タイミングとを音声認識結果Ｄ２として入力受付判定部３へ出力する（ステップＳＴ２）。ここで、音声認識は公知の音声認識技術を用いればよく、例えば、非特許文献１に記載されているように、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ；隠れマルコフモデル）法に基づく音声認識方法により、単語単位、あるいは文単位の音声認識を行えばよい。また、入力音声の音声区間検出方法として、音声の短時間パワーと所定の閾値との比較、あるいは、入力音声のケプストラム分析などの公知の手法を用いることができる。 In step ST2, the speech recognition unit 2 first detects the speech start timing and speech completion timing of the input speech by detecting the speech section of the input speech, and extracts only the speech speech of the user U from the feature parameters of the input speech. Next, voice recognition processing is performed on the extracted speech to recognize the speech content of the user U from the input speech information D1, and the speech content recognition result, speech start timing, and speech completion timing are recorded in the speech. The recognition result D2 is output to the input acceptance determination unit 3 (step ST2). Here, the speech recognition may be performed using a known speech recognition technology. For example, as described in Non-Patent Document 1, a speech recognition method based on the HMM (Hidden Markov Model) method is used to perform word-by-word recognition. , or perform sentence-by-sentence speech recognition. Further, as a method for detecting a speech section of input speech, a known method such as a comparison of the short-term power of the speech with a predetermined threshold, or cepstral analysis of the input speech can be used.

古井貞熙著、「音声情報処理」、第１版、森北出版株式会社、１９９８年６月３０日発行、ｐ．９６－１０５Sadahiro Furui, "Speech Information Processing", 1st edition, Morikita Publishing Co., Ltd., published June 30, 1998, p. 96-105

ステップＳＴ３で、入力受付判定部３は、音声認識結果Ｄ２及び音声出力情報Ｄ８を入力し、ユーザＵの発話音声の入力を受け付けるか否かを判定する（ステップＳＴ３）。ここで、音声出力情報Ｄ８は、応答音声を出力中か否かを表す情報であり、例えば、応答音声を出力中か否かであることを示すフラグであり、例えば、フラグの値が１の場合、応答音声出力中とし、フラグの値が０であれば応答音声が出力されていない状態である。あるいは、応答音声出力開始時刻からの出力経過時間であってもよく、経過時間が０でなければ、応答音声出力中であると判断することができる。なお、応答音声が出力完了した場合、出力結果時間は０にリセットされる。 In step ST3, the input acceptance determination unit 3 inputs the voice recognition result D2 and the voice output information D8, and determines whether to accept the input of the voice uttered by the user U (step ST3). Here, the audio output information D8 is information indicating whether or not a response voice is being output, and is, for example, a flag indicating whether or not a response voice is being output. In this case, the response voice is being outputted, and if the value of the flag is 0, the response voice is not being output. Alternatively, it may be the output elapsed time from the response voice output start time, and if the elapsed time is not 0, it can be determined that the response voice is being output. Note that when the output of the response voice is completed, the output result time is reset to 0.

図４に、ステップＳＴ３の入力受付判定部３における具体的な動作の一例を示す。以下、音声対話システム１０００がユーザＵへ出力する応答音声を“システム発話”と略し、ユーザＵが音声対話システム１０００へ入力する発話音声を“ユーザ発話”と略する。この一例では、システム発話の開始及び完了のタイミングを音声出力情報Ｄ８として入力される。また、この一例では、システム発話開始から発話完了までの区間におけるユーザ発話の入力を受け付けないように動作する。 FIG. 4 shows an example of a specific operation in the input reception determining section 3 in step ST3. Hereinafter, the response voice that the voice dialogue system 1000 outputs to the user U will be abbreviated as "system utterance", and the voice that the user U inputs to the voice dialogue system 1000 will be abbreviated as "user utterance". In this example, the timing of the start and completion of system utterances is input as audio output information D8. Furthermore, in this example, the system operates so as not to accept user utterance input in the period from the start of the system utterance to the completion of the utterance.

本発明の実施の形態１の効果を具体的に比較可能とするため、（ａ）に音声生成部６が出力するシステム発話に基づく動作の一例、（ｂ）に本発明の実施の形態１による動作の一例をそれぞれ示す。なお、音声生成部６が出力するシステム発話の音声を、上段（Ａ）の音声内容として図示し、音声出力部７がユーザＵへ出力するシステム発話の音声を、下段（Ｂ）の音声内容として図示する。また、”ユーザ発話”はユーザＵが発話した音声内容、”発話状況”はシステム発話の出力状況、”受理結果”は入力受付判定部３での入力音声の受け付け結果をそれぞれ表す。横軸は音声対話管理部３００における時間である。 In order to specifically compare the effects of the first embodiment of the present invention, (a) shows an example of the operation based on the system utterance output by the voice generation unit 6, and (b) shows an example of the operation based on the system utterance output by the voice generation unit 6. An example of each operation is shown below. Note that the audio of the system utterance outputted by the audio generation unit 6 is shown as the audio content in the upper row (A), and the audio of the system utterance outputted by the audio output unit 7 to the user U is shown as the audio content in the lower row (B). Illustrated. Further, "user utterance" represents the content of the audio uttered by the user U, "utterance status" represents the output status of the system utterance, and "acceptance result" represents the reception result of the input audio at the input acceptance determination unit 3, respectively. The horizontal axis is time in the voice dialogue management section 300.

また、図４に示す動作の一例では、音声出力部７がユーザＵへ出力するシステム発話（（Ｂ）の音声内容）の発話開始時刻と発話完了時刻は、音声生成部６が出力する応答音声のデータがネットワークＮＷの伝送遅延等の影響を受けるため、音声生成部６が出力するシステム発話（（Ａ）の音声内容）と異なるタイミングとなる。具体的には、時間軸上に示す”ＳＴ（Ａ）”が、音声生成部６の音声データから得られるシステム発話の開始時刻、同じく”ＥＮ（Ａ）”が音声生成部６の音声データから得られるシステム発話の完了時刻である。また、時間軸上に示す”ＳＴ（Ｂ）”は、音声出力部７がユーザＵに出力する応答音声であるシステム発話の開始時刻、すなわち、本発明の実施の形態１における発話開始時刻、同じく”ＥＮ（Ｂ）”は、音声出力部７がユーザＵに出力する応答音声であるシステム発話の完了時刻、すなわち、本発明の実施の形態１における発話完了時刻である。 In addition, in the example of the operation shown in FIG. 4, the utterance start time and utterance completion time of the system utterance (audio content in (B)) outputted by the voice output unit 7 to the user U are the response voice outputted by the voice generation unit 6. Since the data is affected by the transmission delay of the network NW, etc., the timing is different from the system utterance (the audio content of (A)) outputted by the audio generation unit 6. Specifically, "ST(A)" shown on the time axis is the start time of system utterance obtained from the audio data of the audio generator 6, and "EN(A)" is the start time of the system utterance obtained from the audio data of the audio generator 6. This is the completion time of the resulting system utterance. Further, "ST(B)" shown on the time axis is the start time of the system utterance which is the response voice outputted by the voice output unit 7 to the user U, that is, the utterance start time in the first embodiment of the present invention, “EN(B)” is the completion time of the system utterance, which is the response voice output by the voice output unit 7 to the user U, that is, the utterance completion time in the first embodiment of the present invention.

なお、ユーザＵが発話開始するタイミングは、音声出力部７が出力する応答音声の出力完了後、すなわち、ユーザＵに対し報知されたシステム発話（すなわち、（Ｂ）の音声内容）をユーザＵが聴取した後であるため、（ａ）の音声生成部６が出力するシステム発話に基づく動作の一例と（ｂ）の本発明の実施の形態１による動作の一例とは同じになる。 Note that the timing at which the user U starts speaking is after the output of the response voice output by the voice output unit 7 is completed, that is, when the user U hears the system utterance (i.e., the voice content in (B)) notified to the user U. Since this is after listening, the example of the operation based on the system utterance output by the voice generation unit 6 in (a) is the same as the example of the operation according to the first embodiment of the present invention in (b).

図４において、まず、音声対話システム１０００は、ユーザＵに対して音声入力を促すシステム発話である「ご用件をお話しください。」を出力する（［１］発話開始）。システム発話完了後（［１］発話完了）、ユーザＵが「宅配を、えーと、お願いします」と発話する。 In FIG. 4, first, the voice dialogue system 1000 outputs "Please tell me your business", which is a system utterance that prompts the user U to input voice ([1] Start of speech). After the system utterance is completed ([1] utterance completed), the user U utters, "I would like a home delivery, please."

音声入力部１がユーザ発話を取得後、音声認識部２において、ユーザ発話が「宅配を、」と「えーと、」と「お願いします。」とに発話区間が分割されて入力された場合、音声認識部２はまず「宅配を、」という入力を受け付け、音声対話システム１０００はユーザＵの発話途中であるがユーザの発話意図を理解し、「住所をお話しください。」とシステム発話を開始する（［２］発話開始）。 After the voice input unit 1 acquires the user's utterance, when the voice recognition unit 2 inputs the user's utterance with the utterance section divided into "delivery," "um," and "please." The speech recognition unit 2 first receives the input "Home delivery," and the voice dialogue system 1000 understands the user's utterance intention even though the user U is in the middle of speaking, and starts the system utterance, "Please tell me your address." ([2] Start of speech).

「住所をお話しください。」のシステム発話中に、「えーと、」「お願いします。」というユーザ発話が入力された場合、（ａ）に示す動作の一例では、「えーと」のユーザ発話はシステム発話中（［２］発話開始の”ＳＴ（Ａ）”から［２］発話完了の”ＥＮ（Ａ）”の間）であると判断できるので、ユーザ発話「えーと、」の入力受付は棄却される。しかし、ユーザ発話「お願いします。」の語尾部分に関しては、システム発話完了時刻（”ＥＮ（Ａ）”印）よりも後に発話したものと見做される。このユーザ発話の語尾部分は、システム発話完了後のユーザ発話「東京都・・・」と共に誤って受け付けられてしまい、その結果、誤認識となってしまう。 In the example of the operation shown in (a), if the user utterances ``Um,'' and ``Please.'' are input while the system is uttering ``Please tell me your address.'' Since it can be determined that the user is speaking (between [2] "ST(A)" at the start of the speech and "EN(A)" at the end of the speech), the input acceptance of the user's utterance "Um," is rejected. Ru. However, the final part of the user's utterance "Please." is considered to have been uttered after the system utterance completion time (marked "EN(A)"). The final part of the user's utterance is mistakenly accepted together with the user's utterance "Tokyo..." after the system utterance is completed, resulting in erroneous recognition.

一方、（ｂ）に示す本発明の動作の一例では、システム発話「住所をお話しください。」の開始及び完了のタイミングを含む音声出力情報Ｄ８の入力を受けることで、ユーザ発話「えーと、」「お願いします。」は、システム発話開始時刻（［２］発話開始の”ＳＴ（Ｂ）”）から発話完了時刻（［２］発話完了の”ＥＮ（Ｂ）”）までの区間の入力であることが分かるので、前のシステム発話「ご用件をお話ください。」に対する入力であると音声対話システム１０００は判断し、ユーザ発話「えーと、」「お願いします。」の入力受付を棄却する。そして、システム発話完了後に入力された「東京都・・・」というユーザ発話に対し、システム発話「住所をお話しください。」の入力を正しく受け付けることができ、その結果、正しく認識することができる。 On the other hand, in the example of the operation of the present invention shown in (b), by receiving the input of audio output information D8 including the start and completion timing of the system utterance "Please tell me your address," the user utterance "Um," " "Please." is an input for the interval from the system utterance start time ([2] utterance start "ST(B)") to the utterance completion time ([2] utterance completion "EN(B)") Since this is known, the voice dialogue system 1000 determines that the input is for the previous system utterance ``Please tell me your business.'' and rejects the input acceptance of the user utterances ``Um,'' and ``Please.'' Then, in response to the user utterance "Tokyo..." inputted after the system utterance is completed, the input of the system utterance "Please tell me your address" can be correctly accepted, and as a result, it can be correctly recognized.

つまり、本発明の実施の形態１に示すように、音声出力情報Ｄ８を用いることで、音声生成部６が生成したシステム発話の出力完了時刻と、音声出力部７がユーザＵに出力したシステム発話の出力完了時刻との時間差を吸収あるいは補正できるので、音声対話システム１０００は、ユーザＵに出力したシステム発話完了時刻（すなわち、音声入出力部７でのシステム発話出力が完了するタイミング）が正確に分かる。よって、システム発話完了直後にユーザが発話したとしても、そのユーザ発話を受け付けすることが可能である。この動作により、音声対話システム１０００がユーザＵの発話途中に意図を理解し、次の対話に進んでしまった場合にも、前の質問に対するユーザＵの発話による誤認識を精度良く防止する効果がある。 That is, as shown in Embodiment 1 of the present invention, by using the audio output information D8, the output completion time of the system utterance generated by the audio generation unit 6 and the system utterance outputted to the user U by the audio output unit 7 can be determined. Since it is possible to absorb or correct the time difference between the output completion time and the output completion time of I understand. Therefore, even if the user speaks immediately after the system utterance is completed, the user's utterance can be accepted. With this operation, even if the voice dialogue system 1000 understands the intention of the user U during the utterance and proceeds to the next dialogue, it is possible to accurately prevent erroneous recognition due to the utterance of the user U regarding the previous question. be.

なお、上記したステップＳＴ３の動作の一例では、システム発話の開始時刻から完了時刻までの区間のユーザ発話を受け付けないように動作しているが、これに限られるものではない。例えば、システム発話完了後から所定の時間内はユーザ発話を受け付けないようにしても良く、システム発話開始時刻とシステム発話完了時刻から発話時間長を算出し、発話時間長のうち所定の割合時間が経過するまで、ユーザ発話を受け付けないようにしても良い。 Note that in the example of the operation in step ST3 described above, the operation is performed so as not to accept user utterances in the section from the start time to the completion time of the system utterance, but the system is not limited to this. For example, user utterances may not be accepted within a predetermined time after the system utterance is completed, and the utterance time length is calculated from the system utterance start time and the system utterance completion time, and a predetermined percentage of the utterance length is User utterances may not be accepted until the time period has elapsed.

また、図４において、入力受付判定時にシステム発話開始を利用する動作の一例を示したが、ネットワークＮＷの伝送遅延、音声認識の処理遅延が少なく、音声認識が完了した時点がシステム発話開始時刻と見なせる場合には、音声出力状況Ｄ７及び音声出力情報Ｄ８にシステム発話開始時刻に関する情報が無くても良い、すなわち、応答音声出力開始時刻に関する情報が含まれなくても良い。 In addition, although FIG. 4 shows an example of an operation that uses the system utterance start when determining input acceptance, the transmission delay of the network NW and the processing delay of voice recognition are small, and the system utterance start time is the time when voice recognition is completed. In this case, the audio output status D7 and the audio output information D8 may not include information regarding the system utterance start time, that is, the information regarding the response audio output start time may not be included.

ステップＳＴ４で、意図理解部４は、音声認識結果Ｄ２を入力とし、音声対話システム１０００に対するユーザＵの発話意図・操作内容を推定し、意図理解結果Ｄ４を出力する（ステップＳＴ４）。なお、意図理解部４における意図理解処理は公知の意図理解方法を用いれば良く、例えば、複数の意図のそれぞれを示す複数の意図情報毎に、ユーザ発話に基づいて入力された音声信号の意図情報に対する適合度を示すスコアを算出し、算出されたスコアに基づいて、複数の意図情報の中から、ユーザ発話の意図を示す意図情報を選択する意図理解方法を用いることができる。 In step ST4, the intention understanding unit 4 inputs the speech recognition result D2, estimates the user U's utterance intention/operation content with respect to the voice dialogue system 1000, and outputs the intention understanding result D4 (step ST4). Note that the intention understanding processing in the intention understanding unit 4 may be performed using a known intention understanding method. For example, for each of a plurality of intention information indicating each of a plurality of intentions, intention information of an audio signal input based on a user's utterance is An intention understanding method can be used in which a score indicating the degree of suitability for a user's utterance is calculated, and intention information indicating the intention of the user's utterance is selected from a plurality of pieces of intention information based on the calculated score.

ステップＳＴ５で、対話管理部５は、ユーザ発話の意図理解結果に基づき応答内容を決定し、応答内容情報Ｄ５として出力する（ステップＳＴ５）。ここで、対話管理部５における対話管理処理は公知の対話管理方法を用いれば良く、例えば、予め定められた対話状態に対応する応答テンプレートの中から、ユーザとの対話状態に対応する応答テンプレートを選択し、選択した応答テンプレートに含まれる用語シンボルを出力する対話管理方法を用いることができる。 In step ST5, the dialogue management unit 5 determines the response content based on the result of understanding the intention of the user's utterance, and outputs it as response content information D5 (step ST5). Here, the dialogue management process in the dialogue management section 5 may use a known dialogue management method. For example, a response template corresponding to the dialogue state with the user is selected from response templates corresponding to predetermined dialogue states. A dialog management method may be used that selects and outputs terminology symbols included in the selected response template.

ステップＳＴ６で、音声生成部６は、応答内容情報Ｄ５に応じてユーザＵに提示する応答音声を生成し、出力音声Ｄ６として出力する（ステップＳＴ６）。応答内容情報Ｄ５が、発話内容を示すテキストである場合、音声生成部６は公知の音声合成方法を用いれば良く、例えば、ＰＳＯＬＡ（ＰｉｔｃｈＳｙｎｃｈｒｏｎｏｕｓＯｖｅｒｌａｐａｎｄＡｄｄ；ピッチ同期重畳加算）方式に基づくテキスト音声合成方法、あるいは、非特許文献２に記載されているような、波形編集型テキスト音声合成方法を用いれば良い。また、応答内容情報Ｄ５が予め用意された音声データに紐づくＩＤであった場合、音声生成部６が内蔵する記憶装置（図示せず）から、ＩＤに対応する音声データを読み込んで出力音声Ｄ６として出力することもできる。 In step ST6, the voice generation unit 6 generates a response voice to be presented to the user U according to the response content information D5, and outputs it as an output voice D6 (step ST6). When the response content information D5 is text indicating the utterance content, the speech generation unit 6 may use a known speech synthesis method, for example, a text speech based on the PSOLA (Pitch Synchronous Overlap and Add) method. A synthesis method or a waveform editing text-to-speech synthesis method as described in Non-Patent Document 2 may be used. Further, when the response content information D5 is an ID linked to voice data prepared in advance, the voice generation unit 6 reads the voice data corresponding to the ID from a built-in storage device (not shown) and outputs the voice D6. It can also be output as

古井貞熙著、「音声情報処理」、第１版、森北出版株式会社、１９９８年６月３０日、ｐ．７３－７８Sadahiro Furui, "Speech Information Processing", 1st edition, Morikita Publishing Co., Ltd., June 30, 1998, p. 73-78

ステップＳＴ７で、音声出力部７は、生成した出力音声Ｄ６をシステム発話としてユーザＵへ報知する（ステップＳＴ７）。また、音声出力部７は応答音声の音声データの送出が完了した時点で、システム発話である応答音声の音声出力完了時刻を示す情報である音声出力状況Ｄ７を、ネットワークＮＷを通じて音声出力情報生成部８に出力する（ステップＳＴ７）。 In step ST7, the audio output unit 7 notifies the user U of the generated output audio D6 as a system utterance (step ST7). Furthermore, when the audio output unit 7 completes sending the audio data of the response voice, the audio output status D7, which is information indicating the audio output completion time of the response voice that is the system utterance, is sent to the audio output information generation unit through the network NW. 8 (step ST7).

ここで、音声出力状況Ｄ７として音声出力完了時刻を示す情報を送出するタイミングは、例えば、スピーカ出力時の音声出力用バッファ、あるいはネットワークＮＷへのデータ送信時の音声送信用バッファにすべての音声データを書き込み終わった時点であれば良い。また、音声出力状況Ｄ７として音声出力開始時刻を示す情報を送出するタイミングは、スピーカ出力時の音声出力用バッファ、あるいはネットワークＮＷへのデータ送信時の音声送信用バッファに音声データを書き込み始めた時点であれば良い。 Here, the timing at which the information indicating the audio output completion time is sent as the audio output status D7 is determined, for example, when all audio data is stored in the audio output buffer when outputting from a speaker, or when transmitting data to the network NW. It is fine as long as you have finished writing. Furthermore, the timing at which the information indicating the audio output start time is sent as the audio output status D7 is the time when audio data starts to be written to the audio output buffer when outputting from the speaker or to the audio transmission buffer when transmitting data to the network NW. That's fine.

ステップＳＴ８で、音声出力情報生成部８は、入力された音声出力状況Ｄ７から音声出力情報Ｄ８を生成し、入力受付判定部３へ出力する（ステップＳＴ８）。 In step ST8, the audio output information generation unit 8 generates audio output information D8 from the input audio output situation D7, and outputs it to the input acceptance determination unit 3 (step ST8).

ここで、ステップＳＴ８での動作の一例として、音声出力開始時刻を示す信号、あるいは、音声出力完了時刻を示す信号を音声出力状況Ｄ７として受け取り次第、音声出力情報Ｄ８としてそのまま出力すればよく、音声出力部７が出力する音声出力状況Ｄ７を音声出力情報Ｄ８としても良い。また、音声出力部７が複数存在するようにシステムが構成されている場合には、音声出力部７のそれぞれの音声出力状況が区別できるようにすれば良く、例えば、音声出力部７のＩＤ等を付与した音声出力情報Ｄ８を生成するようにすればよい。 Here, as an example of the operation in step ST8, as soon as the signal indicating the audio output start time or the signal indicating the audio output completion time is received as the audio output status D7, it is sufficient to output it as is as the audio output information D8. The audio output status D7 output by the output unit 7 may be used as the audio output information D8. Furthermore, if the system is configured such that there are a plurality of audio output units 7, the audio output status of each audio output unit 7 may be distinguished, for example, the ID of the audio output unit 7, etc. What is necessary is to generate the audio output information D8 to which .

この実施の形態１では、ステップＳＴ２の音声認識部２での処理後に、ステップＳＴ３の入力受付判定部３での処理を行うように構成したが、ステップＳＴ４の意図理解部４での処理の後に、ステップＳＴ３の入力受付判定部３での処理を実行するように構成しても良い。この場合には、すべての音声認識結果Ｄ２に対して意図理解部４における意図理解処理を実行するが、入力受付判定部３では、意図理解内容を踏まえた上で入力受付判定処理を実行することができるので、入力受付判定処理の精度を高めることが可能となる。 In the first embodiment, after the processing in the speech recognition section 2 in step ST2, the processing in the input acceptance determination section 3 in step ST3 is performed, but after the processing in the intention understanding section 4 in step ST4, , it may be configured to execute the process in the input reception determining unit 3 in step ST3. In this case, the intention understanding unit 4 executes the intention understanding process for all the speech recognition results D2, but the input acceptance judgment unit 3 executes the input acceptance judgment process based on the intention understanding contents. Therefore, it is possible to improve the accuracy of the input acceptance determination process.

また、ステップＳＴ４の意図理解部４で得られた意図理解結果Ｄ４が、音声対話システム１０００との対話内容に応じた内容であれば、音声出力情報Ｄ８に応じた入力受付判定を行い、対話内容とは関係のない意図理解結果Ｄ４であれば、音声出力情報Ｄ８に影響されず常時入力を受け付けるように動作させても良い。 Further, if the intention understanding result D4 obtained by the intention understanding unit 4 in step ST4 is content that corresponds to the content of the dialogue with the voice dialogue system 1000, an input acceptance determination is performed according to the audio output information D8, and the content of the dialogue is determined based on the audio output information D8. If the intention understanding result D4 is unrelated to the above, it may be operated so as to always accept input without being influenced by the audio output information D8.

以上のように、この実施の形態１では、音声出力情報生成部が、システム発話を出力中か否かを示す情報である音声出力情報を生成し、入力受付判定部は、受け取った音声出力情報に基づいてシステム発話の出力完了時刻を補正し、ユーザ発話を受け付けるか否かを判定するように構成したので、ユーザＵが最後まで発話内容を聞く必要がある、システム発話に対する音声入力について、入力受付判定部がシステム発話完了のタイミングを正確に把握することが可能となる。 As described above, in the first embodiment, the audio output information generation section generates audio output information that is information indicating whether or not system utterances are being output, and the input acceptance determination section Since the system is configured to correct the output completion time of the system utterance based on the utterance and determine whether or not to accept the user utterance, it is possible to correct the output completion time of the system utterance based on the input It becomes possible for the reception determining unit to accurately grasp the timing of completion of system utterance.

すなわち、この実施の形態１の構成を為すことにより、入力受付判定部は、ユーザが実際に聞いた応答音声と、音声生成部が生成した応答音声との時間差がある場合であってもその影響を吸収し、システム発話完了のタイミングを正確に把握することが可能となる。言い換えれば、音声対話管理部と音声入出力部が別の独立した構成で、応答音声の伝送遅延がある音声対話システムにおいても、音声対話システムは応答音声の出力完了時刻を正確に検出することができる。その結果、音声認識のバージインの受付判定精度を改善することが可能となり、音声対話システムのユーザビリティが向上する効果を有する。 In other words, by having the configuration of the first embodiment, the input acceptance determination section can eliminate the influence of the time difference between the response voice actually heard by the user and the response voice generated by the voice generation section. This makes it possible to accurately grasp the timing of system utterance completion. In other words, even in a voice dialogue system in which the voice dialogue management section and the voice input/output section are separate and independent configurations, and there is a transmission delay of the response voice, the voice dialogue system cannot accurately detect the output completion time of the response voice. can. As a result, it becomes possible to improve the accuracy of barge-in acceptance determination using voice recognition, which has the effect of improving the usability of the voice dialogue system.

また、音声出力情報生成部が、システム発話完了のタイミングの情報を音声出力情報として出力するように構成したので、入力受付判定部で応答音声を受信する必要は無くなり、入力受付判定部にて改めて応答音声を分析して発話時間を算出する場合と比べ、応答音声データ分析のための処理量が削減できるという効果がある。 In addition, since the voice output information generation unit is configured to output information on the timing of system utterance completion as voice output information, there is no need for the input reception determination unit to receive the response voice, and the input reception determination unit Compared to calculating the speaking time by analyzing the response voice, this method has the effect of reducing the amount of processing required to analyze the response voice data.

更に、ネットワークＮＷの通信において伝送遅延が生じ、入力受付判定部で応答音声の受信に遅延が生じた場合、改めて応答音声を分析する場合と比べ、正確なシステム発話完了のタイミングが得られるために入力受付の判定精度が維持できる効果がある。 Furthermore, if there is a transmission delay in communication on the network NW and there is a delay in receiving the response voice at the input acceptance determination unit, it is possible to obtain a more accurate system utterance completion timing than when the response voice is analyzed again. This has the effect of maintaining the judgment accuracy of input reception.

また、入力受付判定部が応答音声の音声データを受信する必要が無いので、音声出力部における応答音声の音声データ送信も不要であり、そのための処理コスト及び装置コストを削減可能であるという効果がある上、応答音声の音声データの送受信が不要なことから、音声入出力部が出力する音声データと、音声対話管理部が受信する音声データとのサンプリング周波数が異なるなど、音声入出力設定に差異があっても影響されず、音声対話システムの設計自由度が増す効果も奏する。 In addition, since the input acceptance determination section does not need to receive the voice data of the response voice, there is no need for the voice output section to transmit the voice data of the response voice, which has the effect of reducing the processing cost and device cost. Moreover, since it is not necessary to send and receive audio data for response voices, there may be differences in the audio input/output settings, such as different sampling frequencies between the audio data output by the audio input/output unit and the audio data received by the audio dialogue management unit. This has the effect of increasing the degree of freedom in designing the voice dialogue system.

実施の形態２．
《２－１》構成
上記した実施の形態１では、音声入出力部２００と音声対話管理部３００との音声データ送受をネットワークＮＷを介して行っていたが、これに限ることは無い。例えば、音声入出力部２００と音声対話管理部３００は同一の装置内に配置されているが、音声入出力部２００と音声対話管理部３００とが独立した構成の場合、音声入出力部が出力する音声データと、音声対話管理部が受信する音声データの規格（例えば、サンプリング周波数）が異なることが多い。このような場合でも、音声入出力部２００と音声対話管理部３００とを直接接続することも可能である。これを実施の形態２として説明する。Embodiment 2.
<<2-1>> Configuration In the first embodiment described above, voice data is transmitted and received between the voice input/output section 200 and the voice dialogue management section 300 via the network NW, but the present invention is not limited to this. For example, the voice input/output unit 200 and the voice dialogue management unit 300 are arranged in the same device, but if the voice input/output unit 200 and the voice dialogue management unit 300 are configured independently, the voice input/output unit outputs The standards (for example, sampling frequency) of the audio data received by the audio dialogue management unit are often different from that of the audio data received by the audio dialogue management unit. Even in such a case, it is also possible to directly connect the voice input/output section 200 and the voice dialogue management section 300. This will be described as a second embodiment.

実施の形態２における音声対話システムについて図５を用いて説明する。図５は実施の形態２を示す音声対話システムのブロック構成図である。図５中、図１と同一符号を付したものは同一または相当部分を示す。またそれらの構成は実施の形態１で示したのと同等であるので説明を省略する。 The voice dialogue system in Embodiment 2 will be explained using FIG. 5. FIG. 5 is a block diagram of a voice dialogue system according to a second embodiment. In FIG. 5, the same reference numerals as in FIG. 1 indicate the same or corresponding parts. Furthermore, since their configurations are the same as those shown in Embodiment 1, their explanations will be omitted.

音声入力部１は、マイクロフォン（図示せず）を用いて、音声対話システム１０００の利用者であるユーザＵが発話した音声を取得する。取得したアナログ音声波形は、例えば１６ｋＨｚのサンプリング周波数でサンプリングされ、デジタル音声データ列に変換される。続いて、変換されたデジタル音声データ列の音響分析が行われて、例えば、音声認識で使用される特徴量パラメータである２０次のＭＦＣＣに変換される。得られた特徴量パラメータＭＦＣＣを入力音声情報Ｄ１として音声対話管理部３００内の音声認識部２へ出力する。 The voice input unit 1 uses a microphone (not shown) to acquire the voice uttered by the user U, who is the user of the voice dialogue system 1000. The obtained analog audio waveform is sampled at a sampling frequency of 16 kHz, for example, and converted into a digital audio data string. Subsequently, the converted digital audio data string is subjected to acoustic analysis and converted into, for example, 20th-order MFCC, which is a feature parameter used in speech recognition. The obtained feature parameter MFCC is output to the speech recognition section 2 in the speech dialogue management section 300 as input speech information D1.

音声認識部２は、入力音声情報Ｄ１を入力し、例えば、ユーザＵの発話区間の切り出しと、切り出された発話音声の発話内容を音声認識し、発話内容を表すテキストデータと発話開始タイミングおよび発話完了タイミングとを音声認識結果Ｄ２として出力する。 The speech recognition unit 2 inputs the input speech information D1, performs speech recognition on, for example, cutting out the speech section of the user U and the content of the cut out speech, and extracts text data representing the content of the speech, the speech start timing, and the speech. The completion timing is output as the speech recognition result D2.

入力受付判定部３は、音声認識結果Ｄ２、及び音声出力情報Ｄ８を入力として、ユーザＵが発話した音声の入力を受け付けるかを判定し、入力を受け付ける場合に受理した音声認識結果Ｄ３を出力する。 The input acceptance determination unit 3 receives the voice recognition result D2 and the voice output information D8 as input, determines whether to accept the input of the voice uttered by the user U, and outputs the accepted voice recognition result D3 if the input is accepted. .

意図理解部４は、受理した音声認識結果Ｄ３を入力とし、入力内容の意図を推定し意図理解結果Ｄ４として出力する。 The intention understanding unit 4 inputs the received speech recognition result D3, estimates the intention of the input content, and outputs it as an intention understanding result D4.

音声生成部６は、応答内容情報Ｄ５を入力とし、応答音声を生成し出力音声Ｄ６として音声入出力部２００内の音声出力部７へ出力する。 The voice generation section 6 inputs the response content information D5, generates a response voice, and outputs it to the voice output section 7 in the voice input/output section 200 as an output voice D6.

音声出力部７は、音声生成部６から得られた出力音声Ｄ６を入力し、スピーカ（図示せず）等の音声報知装置により音声対話システム１０００からの応答音声をユーザＵへ出力すると共に、音声出力状況Ｄ７を音声出力情報生成部８へ出力する。 The audio output unit 7 inputs the output audio D6 obtained from the audio generator 6, and outputs the response audio from the audio dialogue system 1000 to the user U through an audio notification device such as a speaker (not shown). The output status D7 is output to the audio output information generation section 8.

音声出力情報生成部８は、音声出力部７から得られた音声出力状況Ｄ７を入力とし、音声出力部７が音声出力中か否かを示す情報である、音声出力情報Ｄ８を生成し出力する。 The audio output information generation unit 8 receives the audio output status D7 obtained from the audio output unit 7 as input, and generates and outputs audio output information D8, which is information indicating whether or not the audio output unit 7 is outputting audio. .

《２－２》ハードウェア構成
図５に示される音声対話システム１０００の各構成は、実施の形態１で示したのと同様に、ＣＰＵ内蔵の情報処理装置であるコンピュータで実現可能である。ＣＰＵ内蔵のコンピュータは、例えば、パーソナルコンピュータ、サーバ型コンピュータなどの据え置き型コンピュータ、スマートフォン、タブレット型コンピュータなどの可搬型コンピュータ、あるいは、カーナビゲーションシステムなどの車載情報システムの機器組み込み用途のマイクロコンピュータ、及びＳｏＣなどである。<<2-2>> Hardware Configuration Each configuration of the voice dialogue system 1000 shown in FIG. 5 can be realized by a computer, which is an information processing device with a built-in CPU, in the same way as shown in the first embodiment. Computers with a built-in CPU include, for example, stationary computers such as personal computers and server-type computers, portable computers such as smartphones and tablet computers, or microcomputers that are incorporated into in-vehicle information systems such as car navigation systems. For example, SoC.

また、図５に示される音声対話システム１０００の各構成は、ＤＳＰ、ＡＳＩＣ、又はＦＰＧＡなどの電気回路であるＬＳＩにより実現されてもよい。また、図５に示される音声対話システム１０００の各構成は、コンピュータとＬＳＩの組み合わせであってもよい。 Further, each configuration of the voice dialogue system 1000 shown in FIG. 5 may be realized by an LSI that is an electric circuit such as a DSP, an ASIC, or an FPGA. Moreover, each configuration of the voice dialogue system 1000 shown in FIG. 5 may be a combination of a computer and an LSI.

図６は、コンピュータ等の情報処理装置を用いて構成される音声対話システム１０００のハードウェア構成の例を示すブロック図である。図６中、図２と同一符号を付したものは同一または相当部分を示すものとし、またそれらの構成は実施の形態１で示したのと同等であるので説明を省略する。 FIG. 6 is a block diagram showing an example of a hardware configuration of a voice dialogue system 1000 configured using an information processing device such as a computer. In FIG. 6, the same reference numerals as those in FIG. 2 indicate the same or corresponding parts, and their configurations are the same as those shown in Embodiment 1, so the explanation will be omitted.

図６の例では、音声対話システム１０００は、メモリ１０１、ＣＰＵ１１０を内蔵するプロセッサ１０２、記録媒体１０３、音響インタフェース１０４（図６中では音響Ｉ／Ｆと記載）、ネットワークインタフェース１０５（図６中ではネットワークＩ／Ｆと記載）、テキストインタフェース１０６（図６中では表示Ｉ／Ｆと記載）、表示インタフェース１０７（図６中ではテキストＩ／Ｆと記載）、及びバスなどの信号路１０８を備えている。 In the example of FIG. 6, the voice dialogue system 1000 includes a memory 101, a processor 102 including a CPU 110, a recording medium 103, an acoustic interface 104 (indicated as acoustic I/F in FIG. 6), and a network interface 105 (indicated as acoustic I/F in FIG. 6). It includes a network I/F), a text interface 106 (described as display I/F in FIG. 6), a display interface 107 (described as text I/F in FIG. 6), and a signal path 108 such as a bus. There is.

メモリ１０１は、実施の形態２の音声対話処理を実現するための各種プログラムを記憶するプログラムメモリ、プロセッサがデータ処理を行う際に使用するワークメモリ、及び信号データを展開するメモリ等として使用するＲＯＭ及びＲＡＭ等の記憶装置である。 The memory 101 is a program memory that stores various programs for realizing the voice interaction processing of the second embodiment, a work memory that is used when the processor performs data processing, and a ROM that is used as a memory that expands signal data. and a storage device such as RAM.

メモリ１０１には、より具体的に言えば、音声入力部１、音声認識部２、入力受付判定部３、意図理解部４、対話管理部５、音声生成部６、音声出力部７、音声出力情報生成部８の各プログラムを記憶することができる。また、メモリ１０１には、入力音声情報Ｄ１、音声認識結果Ｄ２、受理した音声認識結果Ｄ３、意図理解結果Ｄ４、応答内容情報Ｄ５、出力音声Ｄ６、音声出力状況Ｄ７、音声出力情報Ｄ８などの中間データを記憶することができる。 More specifically, the memory 101 includes a voice input section 1, a voice recognition section 2, an input acceptance determination section 3, an intention understanding section 4, a dialogue management section 5, a voice generation section 6, a voice output section 7, and a voice output section. Each program of the information generation section 8 can be stored. The memory 101 also stores intermediate information such as input speech information D1, speech recognition results D2, received speech recognition results D3, intention understanding results D4, response content information D5, output speech D6, speech output status D7, and speech output information D8. Data can be stored.

プロセッサ１０２は、ＣＰＵ１１０と、作業用メモリとしてメモリ１０１中のＲＡＭを使用し、メモリ１０１中のＲＯＭから読み出されたコンピュータ・プログラム（すなわち、音声対話プログラム）に従って動作する。 Processor 102 uses CPU 110 and RAM in memory 101 as a working memory, and operates according to a computer program (ie, a voice interaction program) read from ROM in memory 101.

プロセッサ１０２は、より具体的に言えば、音声入力部１、音声認識部２、入力受付判定部３、意図理解部４、対話管理部５、音声生成部６、音声出力部７、音声出力情報生成部８の各処理に対応するプログラムをメモリ１０１から読み出し、ＣＰＵ１１０で処理を行うことで、本実施の形態２に示す音声対話処理を実行することができる。 More specifically, the processor 102 includes a voice input unit 1, a voice recognition unit 2, an input acceptance determination unit 3, an intention understanding unit 4, a dialogue management unit 5, a voice generation unit 6, a voice output unit 7, and voice output information. By reading the programs corresponding to each process of the generation unit 8 from the memory 101 and performing the processes with the CPU 110, the voice interaction process shown in the second embodiment can be executed.

記録媒体１０３は、プロセッサ１０２の各種設定データ及び信号データなどの各種データを蓄積するために使用される。記録媒体１０３としては、例えば、ＳＤＲＡＭなどの揮発性メモリ、ＨＤＤ又はＳＳＤ等の不揮発性メモリを使用することが可能である。記録媒体１０３には、例えば、ＯＳを含む起動プログラム及び、音声対話システムのプログラム、初期状態及び各種設定データ、制御用の定数データ、音響信号データ、エラー情報のログ等の各種データを蓄積することができる。なお、この記録媒体１０３に、メモリ１０１内の各種データを蓄積しておくこともできる。 The recording medium 103 is used to store various data such as various setting data and signal data of the processor 102. As the recording medium 103, it is possible to use, for example, a volatile memory such as SDRAM, or a nonvolatile memory such as HDD or SSD. The recording medium 103 stores various data such as a startup program including the OS, a voice dialogue system program, initial state and various setting data, control constant data, acoustic signal data, and error information logs. I can do it. Note that various data in the memory 101 can also be stored in this recording medium 103.

ユーザＵが発話した音声をマイクロフォンで取得する代わりに、後述するネットワークインタフェース１０５を用い、他の装置から取得したストリームデータを入力するようにしても良い。また、ネットワークインタフェース１０５を通じて外部装置に記憶されている録音済みの音声データを選択し、読み込むようにしても良い。また、出力音声Ｄ６をスピーカによりユーザＵに報知する代わりに、ネットワークインタフェース１０５を用い、他の装置へデータとして送出しても構わない。なお、マイクロフォン及びスピーカを用いる代わりに、有線あるいは無線等の通信を介して音声を入出力するシステムであれば、音響インタフェース１０４は省略することが可能である。 Instead of acquiring the voice spoken by the user U using a microphone, stream data acquired from another device may be input using the network interface 105, which will be described later. Alternatively, recorded audio data stored in an external device may be selected and read through the network interface 105. Furthermore, instead of notifying the user U of the output audio D6 through a speaker, the output audio D6 may be sent as data to another device using the network interface 105. Note that the acoustic interface 104 can be omitted if the system inputs and outputs audio via wired or wireless communication instead of using a microphone and a speaker.

ネットワークインタフェース１０５は、入力音声情報Ｄ１、出力音声Ｄ６、及び音声出力状況Ｄ７をネットワーク上のデータから参照する場合、ストリームデータとして入出力する場合など、外部データの送受信を有線又は無線通信にて行う通信インタフェースである。なお、外部データの送受信を行わない場合、ネットワークインタフェース１０５は省略することが可能である。 The network interface 105 transmits and receives external data using wired or wireless communication, such as when referring to input audio information D1, output audio D6, and audio output status D7 from data on a network, or when inputting and outputting as stream data. It is a communication interface. Note that if external data is not transmitted or received, the network interface 105 can be omitted.

以上のように、図５に示される、音声入力部１、音声認識部２、入力受付判定部３、意図理解部４、対話管理部５、音声生成部６、音声出力部７、音声出力情報生成部８の各機能は、メモリ１０１、プロセッサ１０２、及び記録媒体１０３で実現することができる。 As described above, the voice input unit 1, voice recognition unit 2, input acceptance determination unit 3, intention understanding unit 4, dialogue management unit 5, voice generation unit 6, voice output unit 7, voice output information shown in FIG. Each function of the generation unit 8 can be realized by a memory 101, a processor 102, and a recording medium 103.

なお、音声対話システム１０００を実行するプログラムは、ソフトウエアプログラムを実行するコンピュータ内部の記憶装置に記憶していてもよいし、ＣＤ－ＲＯＭあるいはフラッシュメモリ等のコンピュータで読み取り可能な外部記憶媒体にて配布される形式で保持され、コンピュータ起動時に読み込んで動作させてもよい。また、ＬＡＮ等の無線または有線ネットワークを通じて他のコンピュータからプログラムを取得することも可能である。 Note that the program for executing the voice dialogue system 1000 may be stored in a storage device inside the computer that executes the software program, or may be stored in a computer-readable external storage medium such as a CD-ROM or flash memory. It may be maintained in a distributed format and loaded and activated when the computer starts. It is also possible to obtain programs from other computers via a wireless or wired network such as a LAN.

上記したように、音声入出力部２００と音声対話管理部３００とが独立した構成の場合、音声入出力部が出力する音声データと、音声対話管理部が受信する音声データの規格、例えば、サンプリング周波数が異なることが多い。音声入出力部と音声対話管理部とを相互接続するためには、両者が送受信する音声データのサンプリング周波数を同一にする必要があり、サンプリング周波数変換に伴う音声データの時間遅延が生じるが、この実施の形態２の構成を為すことで、システム発話の時間遅延が生じても、入力受付判定部３は音声出力情報Ｄ８を用いることで、システム発話完了時刻（システム発話の出力完了タイミング）を正確に検出することが可能となる。 As described above, when the audio input/output unit 200 and the audio dialogue management unit 300 are configured independently, the standards of the audio data output by the audio input/output unit and the audio data received by the audio dialogue management unit, such as sampling Frequencies are often different. In order to interconnect the audio input/output section and the audio dialogue management section, it is necessary to make the sampling frequency of the audio data sent and received by both the same, which causes a time delay in the audio data due to sampling frequency conversion. With the configuration of the second embodiment, even if there is a time delay in system utterances, the input acceptance determination unit 3 can accurately determine the system utterance completion time (output completion timing of system utterances) by using the audio output information D8. It becomes possible to detect

以上のように、この実施の形態２では、音声出力情報生成部が、システム発話を出力中か否かを示す情報である音声出力情報を生成し、入力受付判定部は、受け取った音声出力情報に基づいて、ユーザ発話を受け付けるか否かを判定するように構成したので、ユーザＵが最後まで発話内容を聞く必要がある、システム発話に対する音声入力について、入力受付判定部がシステム発話完了のタイミングを正確に把握することが可能となる。 As described above, in this second embodiment, the voice output information generation unit generates voice output information that is information indicating whether or not system utterances are being output, and the input acceptance determination unit Since the configuration is configured to determine whether or not to accept user utterances based on It becomes possible to understand accurately.

すなわち、この実施の形態２の構成を為すことにより、入力受付判定部は、ユーザが実際に聞いた応答音声と、音声生成部が生成した応答音声との時間差がある場合であってもその影響を吸収し、システム発話完了のタイミングを正確に把握することが可能となる。言い換えれば、音声対話管理部と音声入出力部が別の独立した構成で、応答音声の伝送遅延がある音声対話システムにおいても、音声対話システムの応答音声の出力完了時刻（システム発話の出力完了タイミング）を正確に検出することができる。その結果、音声認識のバージインの受付判定精度を改善することが可能となり、音声対話システムのユーザビリティが向上する効果を有する。 In other words, by having the configuration of this second embodiment, the input reception determining section can eliminate the influence of the time difference between the response voice actually heard by the user and the response voice generated by the voice generation section. This makes it possible to accurately grasp the timing of system utterance completion. In other words, even in a voice dialogue system in which the voice dialogue management section and the voice input/output section are separate and independent configurations, and there is a delay in transmission of the response voice, the time at which the voice dialogue system completes outputting the response voice (the timing at which the output of system utterances completes) ) can be detected accurately. As a result, it becomes possible to improve the accuracy of barge-in acceptance determination using voice recognition, which has the effect of improving the usability of the voice dialogue system.

また、音声出力情報生成部が、システム発話完了のタイミングの情報を音声出力情報として出力するように構成したので、入力受付判定部で応答音声を受信する必要は無くなり、入力受付判定部にて改めて応答音声を分析して発話時間を算出する場合と比べ、応答音声データ分析のための処理量が削減できるという効果も有する。 In addition, since the voice output information generation unit is configured to output information on the timing of system utterance completion as voice output information, there is no need for the input reception determination unit to receive the response voice, and the input reception determination unit This method also has the effect that the amount of processing for analyzing response voice data can be reduced compared to the case where the speaking time is calculated by analyzing the response voice.

なお、この実施の形態２では、音声入出力部２００と音声対話管理部３００とが独立した構成について説明したが、これに限ることは無く、音声入出力部２００と音声対話管理部３００とを同じシステム内で動作させることも可能であり、独立した構成の場合と同様の効果を奏する。 In the second embodiment, the voice input/output unit 200 and the voice dialogue management unit 300 are configured to be independent, but the configuration is not limited to this, and the voice input/output unit 200 and the voice dialogue management unit 300 are It is also possible to operate within the same system, and the same effects as in the case of independent configurations are achieved.

実施の形態３．
《３－１》構成
上記した実施の形態１では、音声出力部７が生成する音声出力状況Ｄ７のみから応答音声の出力開始時刻、あるいは出力完了時刻を検出していたが、これに限ることはなく、出力音声Ｄ６を併せて分析して、応答音声の出力開始時刻あるいは出力完了時刻を検出することも可能であり、これを実施の形態３として説明する。Embodiment 3.
<<3-1>> Configuration In the first embodiment described above, the output start time or output completion time of the response voice is detected only from the voice output situation D7 generated by the voice output unit 7, but the present invention is not limited to this. Instead, it is also possible to analyze the output voice D6 and detect the output start time or output completion time of the response voice, and this will be described as a third embodiment.

実施の形態３における音声対話システムについて図７を用いて説明する。図７は実施の形態３を示す音声対話システムのブロック構成図である。図７中、図１と同一符号を付したものは同一または相当部分を示す。またそれらの構成は実施の形態１で示したのと同等であるので説明を省略する。 The voice dialogue system in Embodiment 3 will be explained using FIG. 7. FIG. 7 is a block diagram of a voice dialogue system according to a third embodiment. In FIG. 7, the same reference numerals as those in FIG. 1 indicate the same or corresponding parts. Furthermore, since their configurations are the same as those shown in Embodiment 1, their explanations will be omitted.

音声生成部６は、応答内容情報Ｄ５を入力とし、応答音声を生成し出力音声Ｄ６としてネットワークＮＷへ出力する。また、出力音声Ｄ６の時間長を、例えば、音声データのサイズから算出し、得られた時間長を音声長情報Ｄ９として出力する。 The voice generation unit 6 receives the response content information D5, generates a response voice, and outputs it to the network NW as an output voice D6. Further, the time length of the output audio D6 is calculated, for example, from the size of the audio data, and the obtained time length is output as audio length information D9.

音声出力部７は、音声生成部６からネットワークＮＷを通じて得られた出力音声Ｄ６を入力し、スピーカ（図示せず）等の音声報知装置により音声対話システム１０００からの応答音声をユーザＵへ出力すると共に、音声出力状況Ｄ７を音声出力情報生成部８へ出力する。 The audio output unit 7 inputs the output audio D6 obtained from the audio generator 6 through the network NW, and outputs the response audio from the audio dialogue system 1000 to the user U using an audio notification device such as a speaker (not shown). At the same time, the audio output status D7 is output to the audio output information generation section 8.

また、音声出力部７は、音声生成部６からネットワークＮＷを通じて得られた音声長情報Ｄ９を入力とし、ディスプレイ（図示せず）等の情報提示装置を用いて、出力音声Ｄ６の時間長に関する情報、例えば、応答音声出力完了までの残り時間をテキスト表示することで、ユーザＵへ提示することも可能である。ユーザＵへ出力音声Ｄ６の時間長に関する情報をユーザＵに提示することで、ユーザＵは自身の発話タイミングを図ることが可能となり、音声対話システムのユーザビリティが向上する。 The audio output unit 7 also inputs the audio length information D9 obtained from the audio generator 6 through the network NW, and uses an information presentation device such as a display (not shown) to provide information regarding the time length of the output audio D6. For example, it is also possible to present the remaining time until the output of the response voice is completed to the user U by displaying it in text. By presenting information regarding the time length of the output voice D6 to the user U, the user U can plan the timing of his or her own utterance, and the usability of the voice dialogue system is improved.

あるいは、ランプ等の発光装置を用いて、ランプの点滅周期の速度によってユーザＵへ発話タイミングを提示してもよい。例えば、応答音声出力開始時はランプを全点灯し、応答音声出力完了までの残り時間が少なくなるにしたがって点滅周期を早くし、ランプが消灯した時点で応答音声出力完了とすることで、ユーザＵへ発話タイミングを提示しても良い。ユーザＵへ出力音声Ｄ６の時間長に関する情報をユーザＵに提示することで、ユーザＵは自身の発話タイミングを図ることが可能となり、音声対話システムのユーザビリティが向上する上、ディスプレイよりも簡易な情報提示装置でユーザＵに発話タイミングを通知することができるので、装置コストを削減することができる。 Alternatively, using a light emitting device such as a lamp, the utterance timing may be presented to the user U based on the flashing cycle speed of the lamp. For example, when the response voice output starts, all the lamps are turned on, and as the time remaining until the response voice output is completed, the blinking cycle becomes faster, and when the lamp goes out, the response voice output is completed, so that the user U You may also present the timing of the utterance to. By presenting information regarding the time length of the output audio D6 to the user U, the user U can plan the timing of his or her own utterances, which not only improves the usability of the voice dialogue system but also provides information that is simpler than on a display. Since the presentation device can notify the user U of the utterance timing, device costs can be reduced.

音声出力情報生成部８は、ネットワークＮＷを通じて得られた音声出力状況Ｄ７から応答音声の音声出力開始時刻を取得する。取得した応答音声の音声出力開始時刻に、音声長情報Ｄ９の時間長を加算した時間を応答音声の音声出力完了時刻とし、音声出力開始時刻及び音声出力完了時刻を音声出力情報Ｄ８として出力する。 The voice output information generation unit 8 acquires the voice output start time of the response voice from the voice output situation D7 obtained through the network NW. The time obtained by adding the time length of the voice length information D9 to the acquired voice output start time of the response voice is set as the voice output completion time of the response voice, and the voice output start time and the voice output completion time are output as voice output information D8.

また、音声出力情報生成部８では、音声出力状況Ｄ７の応答音声の音声出力完了時刻と音声長情報Ｄ９により音声出力状況Ｄ７の補正を行うことも可能である。 Furthermore, the audio output information generation unit 8 can also correct the audio output situation D7 using the audio output completion time of the response voice of the audio output situation D7 and the audio length information D9.

ここで、音声長情報Ｄ９による音声出力状況Ｄ７の補正とは、例えば、音声出力状況Ｄ７に記録されている応答音声の出力完了時刻と、音声長情報Ｄ９に記録されている音声長（すなわち、出力信号の出力完了時刻）との時間のずれを所定の時間毎に測定し、測定された時間のずれに基づいてリアルタイムに補正することである。このように、音声長情報Ｄ９の出力完了時刻の情報に基づいて、音声出力状況Ｄ７の出力完了時刻を所定時間毎にリアルタイムに補正することで、ネットワークＮＷの輻輳あるいは再送によって生じる送出した応答音声のデータ長変動、すなわち伝送の“ゆらぎ”の影響を抑制することができ、音声対話システムの応答音声の出力完了時刻を正確に検出することができる。 Here, the correction of the voice output status D7 using the voice length information D9 means, for example, the output completion time of the response voice recorded in the voice output status D7 and the voice length recorded in the voice length information D9 (i.e., The method is to measure the time difference with respect to the output completion time of the output signal at predetermined intervals, and to correct it in real time based on the measured time difference. In this way, by correcting the output completion time of the audio output status D7 in real time based on the output completion time information of the audio length information D9, the transmitted response audio caused by congestion of the network NW or retransmission can be adjusted in real time. It is possible to suppress the influence of data length fluctuations, that is, transmission "fluctuations", and it is possible to accurately detect the output completion time of the response voice of the voice dialogue system.

また、音声出力状況Ｄ７がネットワークＮＷの影響で受信が不可能である場合、あるいは、データ伝送誤りにより応答音声の出力完了時刻データが壊れるなどした場合には、音声長情報Ｄ９から得られる音声出力完了時刻を、音声出力状況Ｄ７の音声出力完了時刻に置き換える補正も可能であり、音声出力状況Ｄ７が得られない場合でも音声対話システムの応答音声の出力完了時刻を正確に検出することができる。 In addition, if the audio output status D7 cannot be received due to the influence of the network NW, or if the response audio output completion time data is corrupted due to a data transmission error, the audio output obtained from the audio length information D9 is It is also possible to correct the completion time by replacing the completion time with the audio output completion time of the audio output situation D7, and even if the audio output situation D7 is not obtained, the output completion time of the response voice of the audio dialogue system can be accurately detected.

《３－２》処理動作
続いて、実施の形態３の音声対話システムの処理動作について図８を用いて説明する。図８は、本実施の形態３を示す音声対話システム１０００の処理の流れを示すフローチャートである。なお、以下の各ステップにおける「部」を「工程」と読み替えてもよい。ステップＳＴ１からステップＳＴ６までの動作は、実施の形態１と同様であるので説明を省略する。<<3-2>> Processing Operation Next, the processing operation of the voice dialogue system according to the third embodiment will be explained using FIG. 8. FIG. 8 is a flowchart showing the processing flow of the voice dialogue system 1000 according to the third embodiment. Note that "part" in each step below may be read as "process". The operations from step ST1 to step ST6 are the same as in Embodiment 1, so the explanation will be omitted.

ステップＳＴ９で、音声生成部６は、出力音声Ｄ６の音声データの時間長を算出し、音声長情報Ｄ９として音声出力情報生成部８へ出力する（ステップＳＴ９）。この時、音声データの時間長は生成された音声データのサイズとサンプリング周波数等の音声フォーマット、ファイル形式から算出することが可能である。また、音声合成方法により出力音声Ｄ６の音声データを生成する場合、音声合成方法が指定する合成音声継続時間長を音声長情報Ｄ９とすれば良い。 In step ST9, the audio generation section 6 calculates the time length of the audio data of the output audio D6, and outputs it to the audio output information generation section 8 as audio length information D9 (step ST9). At this time, the time length of the audio data can be calculated from the size of the generated audio data, the audio format such as the sampling frequency, and the file format. Furthermore, when the audio data of the output audio D6 is generated by the audio synthesis method, the synthesized audio duration specified by the audio synthesis method may be used as the audio length information D9.

また、音声合成方法が、出力音声Ｄ６の音声データ末尾の無音区間（無音時間長）を取得可能な場合、音声データ末尾の無音時間長を削除した時間長を音声長情報Ｄ９としても良い。また、音声データ末尾において、例えば、所定の閾値以下の振幅値となった場合に無音区間と見なし、無音区間を削除した時間長を音声長情報Ｄ９としても良い。なお、無音区間を判定する方法は、所定の閾値以下の振幅値により判断する方法の他、公知の無音区間判定方法を用いることができる。 Furthermore, if the voice synthesis method is capable of acquiring a silent section (silent time length) at the end of the audio data of the output audio D6, the time length obtained by removing the silent time length at the end of the audio data may be used as the audio length information D9. Further, at the end of the audio data, for example, if the amplitude value is less than or equal to a predetermined threshold value, it may be regarded as a silent section, and the time length obtained by removing the silent section may be set as the audio length information D9. In addition to the method of determining a silent section using an amplitude value that is less than or equal to a predetermined threshold, a known method for determining a silent section can be used.

また、音声合成方法が、予め用意された音声データを２つ以上連結して出力する場合には、連結する音声データの時間長を合算した値を音声長情報Ｄ９とすれば良い。更に、音声長情報Ｄ９は、音声生成が完了する前に算出できる場合には、その時点で出力するようにしても良い。その場合には、音声生成と音声出力を並列に処理するような構成において、遅延なく音声長情報Ｄ９を音声出力情報生成部８へ出力することが可能である。 Furthermore, when the voice synthesis method concatenates and outputs two or more pieces of voice data prepared in advance, the voice length information D9 may be the sum of the time lengths of the concatenated voice data. Furthermore, if the voice length information D9 can be calculated before voice generation is completed, it may be output at that point. In that case, in a configuration in which audio generation and audio output are processed in parallel, it is possible to output the audio length information D9 to the audio output information generation section 8 without delay.

ステップＳＴ１０で、音声出力部７は、生成した出力音声Ｄ６をシステム発話としてユーザＵへ報知する（ステップＳＴ１０）。また、音声出力部７は応答音声の音声データの送出が完了した時点で、システム発話である応答音声の音声出力完了時刻あるいは音声出力完了時刻を示す情報である音声出力状況Ｄ７を、ネットワークＮＷを通じて音声出力情報生成部８に出力する（ステップＳＴ１０）。 In step ST10, the audio output unit 7 notifies the user U of the generated output audio D6 as a system utterance (step ST10). In addition, when the audio output unit 7 completes sending the audio data of the response voice, the audio output status D7, which is information indicating the audio output completion time or audio output completion time of the response voice, which is the system utterance, is transmitted through the network NW. It is output to the audio output information generation section 8 (step ST10).

ステップＳＴ１１で、音声出力情報生成部８は、ネットワークＮＷを通じて得られた音声出力状況Ｄ７から応答音声の音声出力開始時刻を取得する。取得した応答音声の音声出力開始時刻に、音声長情報Ｄ９の時間長を加算した時間を応答音声の音声出力完了時刻とし、音声出力開始時刻及び音声出力完了時刻を含むタイミングを音声出力情報Ｄ８として出力する（ステップＳＴ１１）。 In step ST11, the audio output information generation unit 8 acquires the audio output start time of the response voice from the audio output status D7 obtained through the network NW. The time obtained by adding the time length of the voice length information D9 to the voice output start time of the acquired response voice is the voice output completion time of the response voice, and the timing including the voice output start time and the voice output completion time is the voice output information D8. Output (step ST11).

この実施の形態３では、音声生成部６が音声長情報Ｄ９を生成するように構成したが、対話管理部５が所望の音声長情報Ｄ９を生成し、音声生成部６は、生成された音声長情報Ｄ９と同一の音声長となるように出力音声Ｄ６を生成するようにしても良い。この場合、音声生成部６は話速やポーズ長を増減させることで音声長を調整すれば良い。その他、公知の波形変換方法により音声長を調整しても良い。 In this third embodiment, the voice generation section 6 is configured to generate the voice length information D9, but the dialogue management section 5 generates the desired voice length information D9, and the voice generation section 6 generates the voice length information D9. The output audio D6 may be generated to have the same audio length as the length information D9. In this case, the voice generation unit 6 may adjust the voice length by increasing or decreasing the speaking speed or pause length. Alternatively, the audio length may be adjusted using a known waveform conversion method.

また、音声出力情報生成部８は、音声長情報Ｄ９を対話管理部５から直接入力するようにしても良い。 Further, the voice output information generation section 8 may directly input the voice length information D9 from the dialogue management section 5.

以上のように、この実施の形態３では、音声出力情報生成部が、ネットワークＮＷを通じて得られた音声出力状況と、音声生成部が算出した音声長情報とを入力とし、音声出力状況の情報を音声長情報により補正を行うことで、ネットワークＮＷあるいはデータ伝送誤りの影響があっても、音声対話システムの応答音声の出力完了時刻（システム発話の主力完了タイミング）を正確に検出することができる。その結果、音声認識のバージインの受付判定精度を改善することが可能となり、音声対話システムのユーザビリティが向上する効果を有する。 As described above, in this third embodiment, the audio output information generation section receives the audio output status obtained through the network NW and the audio length information calculated by the audio generation unit, and generates information on the audio output status. By performing the correction using the voice length information, it is possible to accurately detect the output completion time of the response voice of the voice dialogue system (the main system utterance completion timing) even if there is an influence of the network NW or data transmission error. As a result, it becomes possible to improve the accuracy of barge-in acceptance determination using voice recognition, which has the effect of improving the usability of the voice dialogue system.

また、この実施の形態３では、応答音声出力完了後に音声出力情報を生成する実施の形態１の構成と比べて、実際に応答音声出力が完了してからの遅延が発生することを抑制できるので、音声対話システムの応答音声の出力完了時刻を更に正確に検出することができる顕著な効果を有する。 Furthermore, in the third embodiment, compared to the configuration of the first embodiment in which the audio output information is generated after the output of the response voice is completed, it is possible to suppress the occurrence of a delay after the output of the response voice is actually completed. , it has a remarkable effect that the output completion time of the response voice of the voice dialogue system can be detected more accurately.

また、この実施の形態３では、音声生成部において、末尾の無音時間長を削除した時間長を音声長情報とするように構成したので、音声データ列は存在するがユーザＵには聴こえない末尾の時間はシステム発話が出力完了済みと見なすことができる。したがって、ユーザＵの聴感に近い音声出力情報に従って入力受付判定を行うことが可能となる。よって、音声認識のバージインの受付判定精度を改善することが可能となり、音声対話システムのユーザビリティが更に向上する効果を奏する。 In addition, in the third embodiment, the sound generation unit is configured to use the time length obtained by removing the silence time length at the end as the sound length information. It can be considered that the system utterance has been outputted for a time of . Therefore, it becomes possible to perform input acceptance determination according to audio output information that is close to the user's U's auditory sense. Therefore, it is possible to improve the accuracy of barge-in acceptance determination by voice recognition, and the usability of the voice dialogue system is further improved.

また、この実施の形態３では、音声出力部が、音声生成部から音声長情報を入力とし、ディスプレイ等により出力音声の時間長に関する情報をユーザＵへ提示するように構成したので、ユーザＵは自身の発話タイミングを図ることができ、入力受付判定部は、ユーザＵがシステム発話の音声出力の残り時間を把握していることを前提とした入力受付判定を行うことが可能となる。よって、音声認識のバージインの受付判定精度を改善することが可能となり、音声対話システムのユーザビリティが更に向上する効果を奏する。 Furthermore, in the third embodiment, the audio output section is configured to receive audio length information from the audio generation section and present information regarding the duration of the output audio to the user U on a display or the like. It is possible to time the user's own utterance, and the input acceptance determination unit can make an input acceptance determination on the premise that the user U knows the remaining time for audio output of the system utterance. Therefore, it is possible to improve the accuracy of barge-in acceptance determination by voice recognition, and the usability of the voice dialogue system is further improved.

また、この実施の形態３では、音声生成部が、対話管理部において設定した音声長情報に従って出力音声を生成するように構成したので、システム発話の音声長を考慮した入力受付判定を行うことが可能となる。よって、音声認識のバージインの受付判定精度を改善することが可能となり、音声対話システムのユーザビリティが更に向上する効果を奏する。 Furthermore, in this third embodiment, the voice generation section is configured to generate output voice according to the voice length information set in the dialogue management section, so that it is possible to make an input acceptance determination that takes into account the voice length of system utterances. It becomes possible. Therefore, it is possible to improve the accuracy of barge-in acceptance determination by voice recognition, and the usability of the voice dialogue system is further improved.

実施の形態４．
《４－１》構成
上記した実施の形態１の別の構成例として、入力受付判定部３は、音声出力部７に対して応答音声の出力状況を確認するための信号を出力し、任意のタイミングで応答音声の出力状況を確認できるように構成することも可能であり、これを実施の形態４として説明する。Embodiment 4.
<<4-1>> Configuration As another configuration example of the first embodiment described above, the input acceptance determination unit 3 outputs a signal for checking the output status of the response voice to the voice output unit 7, and outputs an arbitrary signal to the voice output unit 7. It is also possible to configure so that the output status of the response voice can be checked at the timing, and this will be described as a fourth embodiment.

実施の形態４における音声対話システムについて図９を用いて説明する。図９は実施の形態４を示す音声対話システムのブロック構成図である。図９中、図１と同一符号を付したものは同一または相当部分を示す。またそれらの構成は実施の形態１で示したのと同等であるので説明を省略する。 The voice dialogue system in Embodiment 4 will be explained using FIG. 9. FIG. 9 is a block diagram of a voice dialogue system according to a fourth embodiment. In FIG. 9, the same reference numerals as in FIG. 1 indicate the same or corresponding parts. Furthermore, since their configurations are the same as those shown in Embodiment 1, their explanations will be omitted.

入力受付判定部３は、音声認識結果Ｄ２、及び音声出力情報Ｄ８を入力として、ユーザＵが発話した音声の入力を受け付けるかを判定し、入力を受け付ける場合に受理した音声認識結果Ｄ３を出力する。また、音声出力部７に対し、応答音声の出力状況を問い合わせるための信号である、出力状況確認命令Ｄ１０を出力する。 The input acceptance determination unit 3 receives the voice recognition result D2 and the voice output information D8 as input, determines whether to accept the input of the voice uttered by the user U, and outputs the accepted voice recognition result D3 if the input is accepted. . It also outputs an output status confirmation command D10, which is a signal for inquiring the output status of the response voice, to the audio output unit 7.

音声出力部７は、出力音声Ｄ６を入力とし、ユーザＵに対し応答音声出力を行うとともに、入力受付判定部３からの出力状況確認命令Ｄ１０に応じて音声出力状況Ｄ７を出力する。 The audio output unit 7 inputs the output audio D6, outputs a response audio to the user U, and outputs an audio output status D7 in response to the output status confirmation command D10 from the input acceptance determination unit 3.

《４－２》処理動作
続いて、実施の形態４の音声対話システムの処理動作について図１０を用いて説明する。図１０は、本実施の形態４を示す音声対話システム１０００の処理の流れを示すフローチャートである。なお、以下の各ステップにおける「部」を「工程」と読み替えてもよい。ステップＳＴ１からステップＳＴ２までの動作は、実施の形態１と同様であるので説明を省略する。<<4-2>> Processing Operation Next, the processing operation of the voice dialogue system of Embodiment 4 will be described using FIG. 10. FIG. 10 is a flowchart showing the processing flow of the voice dialogue system 1000 according to the fourth embodiment. Note that "part" in each step below may be read as "process". The operations from step ST1 to step ST2 are the same as in Embodiment 1, so the explanation will be omitted.

ステップＳＴ１２で、入力受付判定部３は、ユーザＵの発話開始を判断し、音声出力部７に対して出力状況確認命令Ｄ１０を出力する（ステップＳＴ１２）。 In step ST12, the input reception determining unit 3 determines whether the user U has started speaking, and outputs an output status confirmation command D10 to the audio output unit 7 (step ST12).

ステップＳＴ１３で、音声出力部７は、出力状況確認命令Ｄ１０を受信し、現在音声出力中であるか、音声出力完了済みかの情報を音声出力状況Ｄ７としてネットワークＮＷを通じて音声出力情報生成部８へ出力する（ステップＳＴ１３）。 In step ST13, the audio output unit 7 receives the output status confirmation command D10, and sends information indicating whether audio is currently being output or has been completed to the audio output information generation unit 8 via the network NW as an audio output status D7. Output (step ST13).

なお、音声出力部７が、出力状況確認命令Ｄ１０に対し音声出力中か否かを示す音声出力状況Ｄ７を出力するようにしたが、出力状況確認命令Ｄ１０を受信した時点以降の、初めて応答音声出力が完了状態になっている時点で、音声出力が完了した旨を示す音声出力状況Ｄ７を生成するようにしても良く、情報伝送のための処理量を更に削減可能である。 Note that although the audio output unit 7 is configured to output the audio output status D7 indicating whether or not audio is being output in response to the output status confirmation command D10, the response voice is not output for the first time after receiving the output status confirmation command D10. When the output is in the completed state, the audio output status D7 indicating that the audio output is completed may be generated, and the amount of processing for information transmission can be further reduced.

ステップＳＴ１４で、音声出力情報生成部８は、入力された音声出力状況Ｄ７から音声出力情報Ｄ８を生成し、入力受付判定部３へ出力する（ステップＳＴ１４）。 In step ST14, the audio output information generation unit 8 generates audio output information D8 from the input audio output situation D7, and outputs it to the input acceptance determination unit 3 (step ST14).

続くステップＳＴ３からステップＳＴ６の処理は、実施の形態１と同様であるので説明を省略する。 The subsequent processes from step ST3 to step ST6 are the same as in the first embodiment, and therefore the description thereof will be omitted.

ステップＳＴ１５で、音声出力部７は、生成した出力音声Ｄ６をシステム発話としてユーザＵへ報知する（ステップＳＴ１５）。 In step ST15, the audio output unit 7 notifies the user U of the generated output audio D6 as a system utterance (step ST15).

以上のように、本実施の形態４では、入力受付判定部は、音声出力部に対し出力状況確認命令を出力し、任意のタイミングで応答音声の出力状況を確認できるように構成したので、入力受付判定部は、ユーザ発話の受付判定処理が必要な時点で、応答音声出力状況に関する情報を即座に入手をすることが可能となるので、音声認識のバージインの受付判定精度を改善することが可能となり、音声対話システムのユーザビリティが更に向上する効果を奏する。 As described above, in the fourth embodiment, the input acceptance determination unit is configured to output an output status confirmation command to the audio output unit and check the output status of the response voice at any timing. Since the reception determination unit can immediately obtain information regarding the response voice output status at the time when it is necessary to process the reception determination of user utterances, it is possible to improve the reception determination accuracy of barge-in voice recognition. This has the effect of further improving the usability of the voice dialogue system.

また、この実施の形態４では、音声出力部が、応答音声出力完了時刻を送出する必要が無くなるので、情報伝送等の処理量を削減できる更なる副次効果も奏する。 Furthermore, in the fourth embodiment, since the voice output section does not need to transmit the response voice output completion time, an additional side effect of reducing the amount of processing such as information transmission is achieved.

上記した実施の形態のそれぞれにおいて、入力音声のサンプリング周波数を１６ｋＨｚとして用いたが、これに限ることは無く、例えば、サンプリング周波数２２ｋＨｚなどの異なるサンプリング周波数の音声信号を用いてもよく、上述した各実施の形態のそれぞれにおいて同様の効果を奏する。 In each of the embodiments described above, the sampling frequency of the input audio is 16 kHz, but the invention is not limited to this. For example, an audio signal with a different sampling frequency, such as a sampling frequency of 22 kHz, may be used, and each of the above-mentioned Similar effects are achieved in each of the embodiments.

上記した実施の形態のそれぞれにおいて、ユーザ発話及びシステム発話の言語に日本語を用いて動作を例示したが、本開示に係る音声対話システムは日本語に限らず適用可能であり、その場合は適用する言語に対応した音声認識方法、意図理解方法、及び対話処理方法を用いればよい。 In each of the above-described embodiments, the operations were illustrated using Japanese as the language of user utterances and system utterances, but the voice dialogue system according to the present disclosure is applicable not only to Japanese, and in that case, it is applicable. What is necessary is to use a speech recognition method, an intention understanding method, and a dialogue processing method that correspond to the language used.

上記以外にも、本開示はその開示の範囲内において、実施の形態の任意の構成要素の変形、もしくは実施の形態の任意の構成要素の省略が可能である。 In addition to the above, any component of the embodiments of the present disclosure may be modified or any component of the embodiments may be omitted within the scope of the disclosure.

本開示に係る音声対話システムは、例えば、商品配送を受け付けるコールセンタの自動音声応答システムに用いられるのに適している。例えば、実施の形態１に係る音声対話システム１０００において、音声入出力部２００が、ユーザＵに対面して設置されているスマートスピーカの音声入出力装置に内蔵され、また、音声対話管理部３００が、ユーザＵと離れた位置にあるデータセンタのサーバ装置に内蔵されているとする。 The voice dialogue system according to the present disclosure is suitable for use in, for example, an automatic voice response system of a call center that accepts product deliveries. For example, in the voice dialogue system 1000 according to the first embodiment, the voice input/output unit 200 is built in the voice input/output device of a smart speaker installed facing the user U, and the voice dialogue management unit 300 is built in the voice input/output device of a smart speaker installed facing the user U. , is built in a server device in a data center located away from user U.

ユーザＵが、例えば、購入した商品の配送手配をスマートスピーカに対して発話（ユーザ発話）すると、音声対話管理部３００は、ユーザ発話の音声認識と意図理解を行い、ユーザＵの意図に対応した応答音声（システム発話）を生成する処理を行い、生成されたシステム発話はネットワークＮＷへ出力される。 For example, when user U utters (user utterance) about arranging the delivery of a purchased product to the smart speaker, the voice dialogue management unit 300 performs voice recognition and intention understanding of the user utterance, and responds to user U's intention. A process of generating a response voice (system utterance) is performed, and the generated system utterance is output to the network NW.

システム発話中にユーザ発話が入力される場合、システム発話開始から発話完了までの区間にユーザが発話していることからその入力を棄却する。そして、システム発話完了後に入力されたユーザ発話の入力を受け付けるように動作する。この動作により、音声対話システムがユーザＵの発話途中に意図を理解し、次の対話に進んでしまった場合にも、前の質問に対するユーザＵの発話による誤認識を防止することができるので、ユーザＵに対して適切な応答音声出力とユーザ発話受付ができるので、更に機能が向上した自動音声応答システムとして利用することができる。 If a user utterance is input during system utterance, the input is rejected because the user is speaking during the period from the start of system utterance to the completion of utterance. Then, it operates to accept input of user utterances input after the system utterance is completed. With this operation, even if the voice dialogue system understands user U's intention mid-speech and proceeds to the next dialogue, it is possible to prevent erroneous recognition due to user U's utterance in response to the previous question. Since it is possible to output an appropriate response voice to the user U and accept the user's utterance, it can be used as an automatic voice response system with further improved functionality.

１音声入力部、２音声認識部、３入力受付判定部、４意図理解部、５対話管理部、６音声生成部、７音声出力部、８音声出力情報生成部、
１０１、１０１Ａ、１０１Ｂメモリ、
１０２、１０２Ａ、１０２Ｂプロセッサ、
１０３、１０３Ａ、１０３Ｂ記録媒体、
１０４音響インタフェース、
１０５、１０５Ａ、１０５Ｂネットワークインタフェース、
１０６テキストインタフェース、
１０７表示インタフェース、
１０８、１０８Ａ、１０８Ｂ信号路、
１１０、１１０Ａ、１１０ＢＣＰＵ、
２００音声入出力部、３００音声対話管理部、１０００音声対話システム1 speech input section, 2 speech recognition section, 3 input acceptance judgment section, 4 intention understanding section, 5 dialogue management section, 6 speech generation section, 7 speech output section, 8 speech output information generation section,
101, 101A, 101B memory,
102, 102A, 102B processor,
103, 103A, 103B recording medium,
104 acoustic interface,
105, 105A, 105B network interface,
106 text interface,
107 display interface,
108, 108A, 108B signal path,
110, 110A, 110B CPU,
200 voice input/output unit, 300 voice dialogue management unit, 1000 voice dialogue system

Claims

It has a voice input/output section and a voice dialogue management section,
A voice dialogue system in which a response voice generated by the voice conversation management unit is output to a user with a delay,
The audio input/output section is
a voice input unit that acquires the user's uttered voice;
an audio output unit that outputs the response voice to the user and outputs a voice output status of the response voice to the voice dialogue management unit;
The voice dialogue management unit includes:
a voice recognition unit that performs voice recognition on the user's uttered voice and outputs a voice recognition result;
an intention understanding unit that estimates the user's utterance intention from the voice recognition result and outputs an intention understanding result;
a dialogue management unit that outputs response content information to the user based on the intention understanding result;
a voice generation unit that generates an audio signal of the response voice based on the response content information and outputs it to the audio input/output unit;
a voice output information generation unit that generates voice output information that is information indicating whether or not the response voice is being outputted from the voice output status;
A voice dialogue system comprising: an input acceptance determination unit that uses the voice output information to determine whether input to the intention understanding unit can be accepted.

2. The voice dialogue system according to claim 1, wherein the voice output information includes at least output start timing and output completion timing of the response voice.

3. The voice dialogue system according to claim 2, wherein information on the output completion timing of the response voice is corrected based on voice length information of the voice signal generated by the voice generation unit.

The input acceptance determination unit outputs a signal for inquiring the output status of the response voice to the voice output unit, so that the output status of the response voice can be confirmed. 3. The voice dialogue system according to any one of 3.

4. The voice dialogue system according to claim 3, wherein the voice output unit visually presents the voice utterance timing to the user.

A voice dialogue management device that generates a response voice,
a voice recognition unit that performs voice recognition on the user's uttered voice and outputs a voice recognition result;
an intention understanding unit that estimates the user's utterance intention from the voice recognition result and outputs an intention understanding result;
a dialogue management unit that outputs response content information to the user based on the intention understanding result;
a voice generation unit that generates and outputs an audio signal of the response voice based on the response content information;
an audio output information generation unit that receives an audio output status that is a status in which an audio signal of the response voice is output to the user, and generates audio output information that is information indicating whether or not the response voice is being output as audio; and,
A voice dialogue management device comprising: an input acceptance determination unit that uses the voice output information to determine whether input to the intention understanding unit can be accepted.

A voice dialogue method executed in a voice dialogue system including a voice input/output device and a voice dialogue management device that generates a response voice, the method comprising:
The audio input/output device is
Obtain the user's utterance audio,
outputting the response voice to the user, and outputting the voice output status of the response voice to the voice dialogue management device ;
The voice dialogue management device includes:
voice recognition of the user 's uttered voice ,
Estimating the user's utterance intention from the voice recognition result ,
determining the content of the response to the user based on the intention understanding result that is the result of the estimation ;
generating an audio signal of the response voice based on response content information based on the response content and outputting it to the audio input/output device ;
When the audio output status is input, audio output information, which is information indicating whether or not the response voice is being outputted, is generated from the audio output status, and the estimation is performed using the audio output information. Determine whether to execute or not .
Voice dialogue method.

8. The voice interaction method according to claim 7, wherein the voice output information includes at least output start timing and output completion timing of the response voice.

The voice dialogue management device includes:
generating audio length information of the audio signal;
9. The voice interaction method according to claim 8 , wherein information on the output completion timing of the response voice is corrected based on voice length information of the voice signal.

The voice dialogue management device includes:
The voice interaction method according to any one of claims 7 to 9, characterized in that a signal for inquiring the output status of the response voice is output to the voice input/ output device .

10. The voice interaction method according to claim 9, wherein the voice input/output device visually presents the voice utterance timing to the user.

A voice dialogue management device that generates response voices is
Recognizes the user's spoken voice ,
Estimating the user's utterance intention from the voice recognition result ,
determining the content of the response to the user based on the intention understanding result that is the result of the estimation ;
generating an audio signal of the response voice based on response content information based on the response content ;
When an audio output status is input, which is a status in which an audio signal of the response voice is being output to the user, audio output is information indicating whether or not the response voice is being output as audio, based on the audio output status. generating information and using the audio output information to determine whether to perform the estimation ;
Voice dialogue method .