JP2014119716A

JP2014119716A - Interaction control method and computer program for interaction control

Info

Publication number: JP2014119716A
Application number: JP2012277171A
Authority: JP
Inventors: Yuichiro Noguchi; 祐一郎野口; Jun Takahashi; 潤高橋; Kentaro Murase; 健太郎村瀬; Toshiyuki Fukuoka; 俊之福岡
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2012-12-19
Filing date: 2012-12-19
Publication date: 2014-06-30
Anticipated expiration: 2032-12-19
Also published as: JP6028556B2

Abstract

PROBLEM TO BE SOLVED: To provide an interaction control method capable of improving sensory quality of a user in response time after a user performs voice input by properly setting length of message voice to be reproduced in the response time in a voice interaction device.SOLUTION: An interaction control method includes steps of: extracting a predetermined keyword from a voice signal representing voice of a user collected by a voice input part; retrieving a content according to the keyword; estimating a response time until the content is presented to the user after input of the voice of the user is terminated; setting a length of message voice by referring to a table representing correspondence relation among a plurality of combinations of sample values of a silent period in which the voice is not output and sample values of the length of the message voice occupying the response time and sensory quality of the user, generating message voice having the length, and reproducing the message voice in the response time.

Description

本発明は、例えば、ユーザが発した音声を認識し、その認識結果に応じた処理を実行する音声対話装置における対話制御方法及び対話制御用コンピュータプログラムに関する。 The present invention relates to, for example, a dialog control method and a computer program for dialog control in a voice dialog device that recognizes a voice uttered by a user and executes processing according to the recognition result.

近年、端末を介して入力された音声信号を通信ネットワークを介してサーバへ送信し、サーバがその音声信号に応じて所望のコンテンツを選択し、そのコンテンツを端末へ返信する音声対話システムが開発されている。このような音声対話システムでは、サーバの負荷が一定の水準を超えたとき、または、通信ネットワークに輻輳が生じている場合など、サーバから端末へ情報が返信される際の遅延が大きくなることがあった。遅延が大きくなると、ユーザによる音声入力から端末がその音声に対応するコンテンツをユーザに提示するまでの、端末が応答しない応答時間が長くなる。そのため、ユーザが不安を感じたり、ユーザの不快感が増し、その結果として音声対話システムに対するユーザの利便性が低下するおそれがあった。そこで、認識辞書のサイズ情報に基づいて認識辞書の読み出しに必要な時間を予測し、その予測された時間に応じた長さの応答音声をスピーカに出力させる音声対話装置が提案されている（例えば、特許文献１を参照）。この音声対話装置は、そのような応答音声の出力により、認識辞書を読み出す際に長く無音時間が続くことを防止する。 In recent years, a voice interaction system has been developed in which a voice signal input via a terminal is transmitted to a server via a communication network, the server selects desired content according to the voice signal, and the content is returned to the terminal. ing. In such a spoken dialogue system, when the load on the server exceeds a certain level, or when the communication network is congested, the delay in returning information from the server to the terminal may increase. there were. When the delay increases, the response time during which the terminal does not respond increases from the voice input by the user until the terminal presents the content corresponding to the voice to the user. For this reason, the user feels anxiety or the user's discomfort increases, and as a result, the user's convenience for the voice interaction system may be reduced. In view of this, there has been proposed a voice interaction device that predicts the time required for reading the recognition dictionary based on the size information of the recognition dictionary and outputs a response voice having a length corresponding to the predicted time to a speaker (for example, , See Patent Document 1). This spoken dialogue apparatus prevents such a long silence period from being read when the recognition dictionary is read out by outputting such a response voice.

特開２００１−２２３８４号公報JP 2001-22384 A

しかしながら、ユーザが応答を待っている間に再生されるメッセージ音声が長過ぎても、ユーザは不快に感じる。そのため、無音時間が長いからといって、単純にメッセージ音声を長くすると、必ずしもユーザの不安感及び不快感を低減できないおそれがあった。 However, the user feels uncomfortable if the message voice played while the user is waiting for a response is too long. For this reason, there is a possibility that the user's anxiety and discomfort may not necessarily be reduced if the message voice is simply lengthened because the silent period is long.

そこで本明細書は、音声対話装置において、ユーザが音声入力を行ってからの応答時間中に再生するメッセージ音声の長さを適切に設定して応答時間中のユーザの体感品質を向上できる対話制御方法を提供することを目的とする。 Therefore, the present specification describes a dialog control that can improve a user's quality of experience during a response time by appropriately setting a length of a message voice to be reproduced during the response time after the user performs voice input in the voice dialog device. It aims to provide a method.

一つの実施形態によれば、対話制御方法が提供される。この対話制御方法は、音声入力部により集音されたユーザの音声を表す音声信号から所定のキーワードを抽出し、そのキーワードに応じたコンテンツを検索し、ユーザの音声の入力が終了してからユーザにコンテンツを提示するまでの応答時間を推定し、その応答時間に占める、音声出力されない無音期間の長さのサンプル値とメッセージ音声の長さのサンプル値との複数の組み合わせとユーザの体感品質との対応関係を表すテーブルを参照することにより、メッセージ音声の長さを設定し、その長さを持つメッセージ音声を生成し、応答時間中にメッセージ音声を再生することを含む。 According to one embodiment, a dialog control method is provided. In this interactive control method, a predetermined keyword is extracted from a voice signal representing the user's voice collected by the voice input unit, a content corresponding to the keyword is searched, and after the user's voice input is completed, the user The response time until the content is presented to the user is estimated, and a combination of a sample value of the length of the silent period during which no voice is output and a sample value of the length of the message voice, and the user's quality of experience, The length of the message voice is set by referring to the table representing the correspondence relationship, and the message voice having the length is generated, and the message voice is reproduced during the response time.

本発明の目的及び利点は、請求項において特に指摘されたエレメント及び組み合わせにより実現され、かつ達成される。
上記の一般的な記述及び下記の詳細な記述の何れも、例示的かつ説明的なものであり、請求項のように、本発明を限定するものではないことを理解されたい。 The objects and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
It should be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention as claimed.

本明細書に開示された対話制御方法は、応答時間中に再生するメッセージ音声の長さを適切に設定して応答時間中のユーザの体感品質を向上できる。 The dialogue control method disclosed in the present specification can appropriately set the length of the message voice to be reproduced during the response time, and can improve the quality of experience of the user during the response time.

第１の実施形態による音声対話システムの概略構成図である。1 is a schematic configuration diagram of a voice interaction system according to a first embodiment. 第１の実施形態によるサーバの制御部の機能ブロック図である。It is a functional block diagram of the control part of the server by 1st Embodiment. 情報源テーブルの一例を示す図である。It is a figure which shows an example of an information source table. 応答時間テーブルの一例を示す図である。It is a figure which shows an example of a response time table. 応答時間に占めるメッセージ音声と無音期間の関係の一例を示す図である。It is a figure which shows an example of the relationship between the message voice which occupies for response time, and a silence period. （ａ）は、メッセージ音声の時間長及び無音期間とユーザの体感品質との関係を表す模式図である。（ｂ）は、メッセージ音声の長さがLの場合の、応答時間とユーザの体感品質との関係を表す模式図である。（ｃ）は、メッセージ音声の長さとユーザの体感品質との関係を表す模式図である。(A) is a schematic diagram showing the relationship between the time length and silence period of a message sound, and a user's experience quality. (B) is a schematic diagram showing the relationship between response time and user experience quality when the length of the message voice is L. FIG. (C) is a schematic diagram showing the relationship between the length of the message voice and the user's quality of experience. （ａ）は、線形補間によるMOS値算出の一例を示す概念図である。（ｂ）は、応答時間の推定値に含まれるメッセージ音声の長さの候補が複数存在する場合における、メッセージ音声の長さの決定及びMOS値算出の一例を示す概念図である。(A) is a conceptual diagram which shows an example of MOS value calculation by linear interpolation. (B) is a conceptual diagram showing an example of determination of message voice length and MOS value calculation when there are a plurality of message voice length candidates included in the estimated response time. 品質評価テーブルの一例を示す図である。It is a figure which shows an example of a quality evaluation table. メッセージ音声生成処理の動作フローチャートである。It is an operation | movement flowchart of a message audio | voice production | generation process. 第１の実施形態による対話制御処理の動作シーケンス図である。It is an operation | movement sequence diagram of the dialogue control process by 1st Embodiment. 第２の実施形態によるサーバの制御部の機能ブロック図である。It is a functional block diagram of the control part of the server by 2nd Embodiment. 嗜好情報テーブルの一例を示す図である。It is a figure which shows an example of a preference information table. 第２の実施形態による対話制御処理の動作シーケンス図である。It is an operation | movement sequence diagram of the dialogue control process by 2nd Embodiment.

以下、図を参照しつつ、様々な実施形態による音声対話装置及びその音声対話装置の対話制御方法について説明する。
任意の応答時間に対して、その応答時間中に再生されるメッセージ音声の長さと、応答時間中でメッセージ音声が出力されない無音期間の長さの組み合わせは複数存在する。そしてそれぞれの組み合わせごとに、ユーザの体感品質は異なる。しかし、発明者は、メッセージ音声の長さが一定であれば、無音期間の長さ2秒以下の場合、ユーザの体感品質はほとんど変化せず、無音期間の長さが2秒以上4秒以下では、無音期間の長さに対して線形にユーザの体感品質が低下すると言う知見を得た。そこで、発明者は、メッセージ音声の複数の時間長のそれぞれごとに、無音期間2秒及び4秒でのユーザの体感品質を予め実験により求め、メッセージ音声の時間長及び無音期間とユーザの体感品質との関係を表す品質評価テーブルを作成した。そして音声対話装置は、その品質評価テーブルを参照することにより、任意の応答時間に対してユーザの体感品質が最も高くなるメッセージ音声の時間長を求め、その時間長を持つメッセージを応答時間内に再生する。 Hereinafter, a voice interaction device and a dialogue control method of the voice interaction device according to various embodiments will be described with reference to the drawings.
For an arbitrary response time, there are a plurality of combinations of the length of the message voice reproduced during the response time and the length of the silent period during which no message voice is output during the response time. And the user's experience quality differs for each combination. However, if the length of the message voice is constant, the inventor has almost no change in the user's experience quality when the duration of the silence period is 2 seconds or less, and the length of the silence period is 2 seconds or more and 4 seconds or less. Then, the knowledge that a user's bodily sensation quality falls linearly with respect to the length of a silence period was acquired. Therefore, the inventor determined the user's quality of experience in the silent period of 2 seconds and 4 seconds for each of the plurality of time lengths of the message voice in advance by experiment, and the time length and silent period of the message voice and the user's quality of experience. A quality evaluation table showing the relationship between Then, by referring to the quality evaluation table, the voice interaction apparatus obtains the message voice time length that gives the highest user experience quality for any response time, and puts the message having the time length within the response time. Reproduce.

図１は、音声対話装置の第１の実施形態による音声対話システムの概略構成図である。本実施形態では、音声対話システム１は、ユーザが音声を入力し、入力した音声に応じたコンテンツの提供を受けるための端末２と、端末２を介して入力された音声に応じたコンテンツを選択し、選択したコンテンツを端末２へ返信するサーバ３とを有する。 FIG. 1 is a schematic configuration diagram of a voice dialogue system according to a first embodiment of a voice dialogue apparatus. In this embodiment, the voice interaction system 1 selects a terminal 2 for receiving content provided according to the input voice by the user, and a content corresponding to the voice input via the terminal 2. And a server 3 for returning the selected content to the terminal 2.

端末２とサーバ３とは、例えば、公衆通信回線といった通信ネットワーク４を介して互いに通信可能となっている。さらに、サーバ３は、通信ネットワーク４を介して、Webサーバまたはftpサーバなどの外部情報源５と通信可能となっていてもよい。なお、端末２とサーバ３とは、通信ネットワーク４とは別個の専用線などの通信回線によって接続されていてもよい。 The terminal 2 and the server 3 can communicate with each other via a communication network 4 such as a public communication line. Further, the server 3 may be able to communicate with an external information source 5 such as a Web server or an ftp server via the communication network 4. Note that the terminal 2 and the server 3 may be connected to a communication line such as a dedicated line separate from the communication network 4.

端末２は、例えば、携帯電話機、携帯情報端末、あるいは固定端末であり、音声入力部２１と、音声出力部２２と、通信部２３と、記憶部２４と、処理部２５とを有する。さらに端末２は、液晶ディスプレイといった表示部（図示せず）をさらに有してもよい。 The terminal 2 is, for example, a mobile phone, a portable information terminal, or a fixed terminal, and includes a voice input unit 21, a voice output unit 22, a communication unit 23, a storage unit 24, and a processing unit 25. Further, the terminal 2 may further include a display unit (not shown) such as a liquid crystal display.

音声入力部２１は、例えば、マイクロホンと、アナログ／デジタル変換器とを有する。そしてユーザが発した音声はマイクロホンによりアナログの電気信号である音声信号に変換される。そのアナログの音声信号は、アナログ／デジタル変換器により所定のサンプリング周波数でサンプリングされてデジタルの音声信号に変換された後、処理部２５へ送られる。 The voice input unit 21 includes, for example, a microphone and an analog / digital converter. The voice uttered by the user is converted into an audio signal which is an analog electric signal by the microphone. The analog audio signal is sampled at a predetermined sampling frequency by an analog / digital converter, converted into a digital audio signal, and then sent to the processing unit 25.

音声出力部２２は、ユーザにコンテンツを提示する出力部の一例であり、例えば、デジタル／アナログ変換器とスピーカとを有する。そして音声出力部２２は、デジタル／アナログ変換器により、処理部２５から受け取った音声信号をアナログ化し、そのアナログ化された音声信号がスピーカにより音声に変換され、ユーザへ向けて出力される。 The audio output unit 22 is an example of an output unit that presents content to the user, and includes, for example, a digital / analog converter and a speaker. Then, the audio output unit 22 converts the audio signal received from the processing unit 25 into an analog signal by a digital / analog converter, and the analog audio signal is converted into a sound by a speaker and output to a user.

通信部２３は、端末２を通信ネットワーク４に接続するためのインターフェース回路を有する。そして通信部２３は、通信ネットワーク４が準拠する通信方式に従って、処理部２５から受け取った音声信号を通信ネットワーク４を介してサーバ３へ送信する。一方、通信部２３は、通信ネットワーク４を介して、サーバ３から、コンテンツと、応答時間中に再生されるメッセージ音声及び再生遅延期間をサーバ３から受信して処理部２５へ渡す。 The communication unit 23 includes an interface circuit for connecting the terminal 2 to the communication network 4. And the communication part 23 transmits the audio | voice signal received from the process part 25 to the server 3 via the communication network 4 according to the communication system with which the communication network 4 is based. On the other hand, the communication unit 23 receives the content, the message voice reproduced during the response time, and the reproduction delay period from the server 3 via the communication network 4 and passes them to the processing unit 25.

記憶部２４は、例えば、半導体メモリを有する。そして記憶部２４は、処理部２５が端末２を制御するための各種プログラム、端末２上で動作するアプリケーションプログラム及びそれらプログラムの実行に必要な各種のデータを記憶する。 The storage unit 24 includes, for example, a semiconductor memory. And the memory | storage part 24 memorize | stores the various programs for the process part 25 to control the terminal 2, the application program which operate | moves on the terminal 2, and various data required for execution of these programs.

処理部２５は、一つまたは複数のプロセッサと、周辺回路とを有する。そして処理部２５は、音声入力部２１を介して入力された音声信号を通信部２３へ出力する。また処理部２５は、再生時間制御部の一例であり、ユーザによる音声入力が終了した時点から計時を開始し、その終了時点から、サーバ３から通知された再生遅延期間が経過するとメッセージ音声を音声出力部２２を介して再生する。なお、処理部２５は、例えば、音声入力が終了したか否かを判定するために、入力された音声信号を所定のフレーム単位で時間周波数変換することにより、フレームごとのパワースペクトルを求める。そして処理部２５は、周波数帯域ごとのパワーの平均値が所定の閾値以上となったフレームの後で、そのパワーの平均値以下となるフレームが所定数以上連続すると、音声入力が終了したと判断する。 The processing unit 25 includes one or more processors and peripheral circuits. Then, the processing unit 25 outputs the audio signal input via the audio input unit 21 to the communication unit 23. The processing unit 25 is an example of a playback time control unit, and starts counting from the time when the voice input by the user is finished. When the playback delay period notified from the server 3 elapses, the message voice is voiced. Playback is performed via the output unit 22. Note that the processing unit 25 obtains a power spectrum for each frame by, for example, performing time-frequency conversion on the input audio signal in units of predetermined frames in order to determine whether or not the audio input has been completed. Then, the processing unit 25 determines that the voice input is completed when a predetermined number or more of frames having the average power value or less continue after the frame in which the average power value for each frequency band is equal to or higher than the predetermined threshold value. To do.

さらに、処理部２５は、メッセージ音声の再生が終了した後、サーバ３からコンテンツを受信すると、コンテンツに含まれる音声信号を音声出力部２２を介して再生する。 Further, when the processing unit 25 receives the content from the server 3 after the reproduction of the message sound is completed, the processing unit 25 reproduces the sound signal included in the content via the sound output unit 22.

なお、コンテンツが画像またはビデオストリームを含む場合には、処理部２５は、その画像またはビデオストリームを図示しない表示部に表示させてもよい。また処理部２５は、コンテンツにテキスト情報が含まれる場合も、そのテキスト情報を表示部に表示させてもよい。
さらに、コンテンツにゲームなどのアプリケーションプログラムが含まれる場合、処理部２５は、アプリケーションプログラムを実行し、その実行結果に応じた画像を表示部に表示させ、またはその実行結果に応じた音声信号を音声出力部２２に再生させてもよい。 If the content includes an image or video stream, the processing unit 25 may display the image or video stream on a display unit (not shown). Further, the processing unit 25 may cause the display unit to display the text information even when the content includes text information.
Furthermore, when the application program such as a game is included in the content, the processing unit 25 executes the application program, causes the display unit to display an image corresponding to the execution result, or outputs an audio signal corresponding to the execution result. The output unit 22 may reproduce the data.

再度図１を参照すると、サーバ３は、通信部３１と、記憶部３２と、制御部３３とを有する。 Referring to FIG. 1 again, the server 3 includes a communication unit 31, a storage unit 32, and a control unit 33.

通信部３１は、サーバ３を通信ネットワーク４に接続するためのインターフェース回路を有する。そして通信部３１は、通信ネットワーク４が準拠する通信方式に従って、端末２から通信ネットワーク４を介して受け取った音声信号を制御部３３へ渡す。一方、通信部３１は、制御部３３から受け取ったコンテンツ及びメッセージ音声等を通信ネットワーク４を介して端末２へ送信する。さらに通信部３１は、通信ネットワーク４を介して外部情報源５へ情報取得要求信号を送信し、その外部情報源５から所定のコンテンツを含むデータを受信してもよい。
なお、以下では、説明の便宜上、端末２からサーバが受信した、ユーザの音声信号を、入力音声信号と呼ぶ。 The communication unit 31 has an interface circuit for connecting the server 3 to the communication network 4. Then, the communication unit 31 passes the audio signal received from the terminal 2 via the communication network 4 to the control unit 33 in accordance with the communication method that the communication network 4 complies with. On the other hand, the communication unit 31 transmits the content, message voice, and the like received from the control unit 33 to the terminal 2 via the communication network 4. Further, the communication unit 31 may transmit an information acquisition request signal to the external information source 5 via the communication network 4 and receive data including predetermined content from the external information source 5.
Hereinafter, for convenience of explanation, the user's voice signal received by the server from the terminal 2 is referred to as an input voice signal.

記憶部３２は、例えば、半導体メモリ、磁気記録装置または光記録装置の少なくとも一つを有する。そして記憶部３２は、制御部３３がサーバ３を制御するための各種プログラム、サーバ３上で動作するアプリケーションプログラム及びそれらプログラムの実行に必要な各種のデータを記憶する。例えば、記憶部３２は、音声認識または音声合成に利用される各種の情報、ユーザが求める各種のコンテンツまたは外部情報源のアドレスを記憶する。さらに記憶部３２は、応答時間中に再生されるメッセージ音声の長さ及びメッセージ音声の内容を決定するために利用される各種の情報を記憶する。 The storage unit 32 includes, for example, at least one of a semiconductor memory, a magnetic recording device, and an optical recording device. And the memory | storage part 32 memorize | stores the various programs for the control part 33 to control the server 3, the application program which operate | moves on the server 3, and various data required for execution of these programs. For example, the storage unit 32 stores various information used for speech recognition or speech synthesis, various contents requested by the user, or addresses of external information sources. Further, the storage unit 32 stores various types of information used for determining the length of the message voice reproduced during the response time and the content of the message voice.

制御部３３は、一つまたは複数のプロセッサと、読み書き可能な半導体メモリと、周辺回路とを有する。そして制御部３３は、入力音声信号からキーワードを抽出し、そのキーワードに基づいて、ユーザが所望するコンテンツを検索する。さらに制御部３３は、キーワードに基づいて応答時間を推定し、推定した応答時間に応じて、その応答時間内に再生されるメッセージ音声の時間長を設定し、その時間長に応じたメッセージ音声を作成する。そして制御部３３は、検索したコンテンツ及びメッセージ音声を端末２へ返信する。 The control unit 33 includes one or more processors, a readable / writable semiconductor memory, and a peripheral circuit. And the control part 33 extracts a keyword from an input audio | voice signal, and searches the content which a user desires based on the keyword. Further, the control unit 33 estimates the response time based on the keyword, sets the time length of the message voice reproduced within the response time according to the estimated response time, and sets the message voice corresponding to the time length. create. Then, the control unit 33 returns the searched content and message voice to the terminal 2.

図２は、音声対話システムが有するサーバ３の制御部３３の機能ブロック図である。制御部３３は、音声認識部３４と、検索部３５と、遅延推定部３６と、メッセージ生成部３７と、音声合成部３８とを有する。
制御部３３が有するこれらの各部は、例えば、制御部３３が有するプロセッサ上で動作するコンピュータプログラムにより実現される機能モジュールである。なお、制御部３３が有するこれらの各部は、その各部の機能を実現する一つの集積回路としてサーバ３に実装されてもよい。 FIG. 2 is a functional block diagram of the control unit 33 of the server 3 included in the voice interaction system. The control unit 33 includes a voice recognition unit 34, a search unit 35, a delay estimation unit 36, a message generation unit 37, and a voice synthesis unit 38.
Each of these units included in the control unit 33 is, for example, a functional module realized by a computer program that operates on a processor included in the control unit 33. Note that these units included in the control unit 33 may be mounted on the server 3 as one integrated circuit that realizes the functions of the units.

音声認識部３４は、入力音声信号から、予め登録されている、コンテンツ検索用のキーワードを抽出する。キーワードを抽出するために、音声認識部３４は、音響モデルを用いる方法または動的時間伸縮法を用いる方法など、様々な音声認識方法の何れを利用してもよい。本実施形態では、音声認識部３４は、認識対象となるキーワードを格納した単語辞書と、予め作成された音響モデルを用いてキーワードを抽出する。 The voice recognition unit 34 extracts a keyword for content search registered in advance from the input voice signal. In order to extract a keyword, the speech recognition unit 34 may use any of various speech recognition methods such as a method using an acoustic model or a method using a dynamic time stretching method. In the present embodiment, the speech recognition unit 34 extracts a keyword using a word dictionary storing a keyword to be recognized and an acoustic model created in advance.

音響モデルは、例えば、キーワードの発音を音素または音節といった単位音ごとに区分し、その単位音の順序に従って、その単位音に対応する単位音響モデルを連結することにより生成され、予め、記憶部３２に記憶される。この単位音響モデル及び音響モデルは、例えば、それぞれ、隠れマルコフモデル（Hidden Markov Model, HMM）により表される。音声認識部３４は、単位音響モデルを表すHMMを用いて、入力音声信号の所定の区間から抽出される１以上の特徴量に基づいて、特定の単位音に対するその所定の区間が推定音である確率または尤度を算出する。 The acoustic model is generated, for example, by dividing the pronunciation of a keyword into unit sounds such as phonemes or syllables, and connecting unit acoustic models corresponding to the unit sounds in accordance with the order of the unit sounds. Is remembered. The unit acoustic model and the acoustic model are each represented by, for example, a hidden Markov model (HMM). The speech recognition unit 34 uses an HMM representing a unit acoustic model, and based on one or more feature amounts extracted from a predetermined section of the input speech signal, the predetermined section for a specific unit sound is an estimated sound. Probability or likelihood is calculated.

具体的に、音声認識部３４は、入力音声信号から、音声認識に用いられる特徴量を抽出する。そのために、音声認識部３４は、例えば、入力音声信号を所定のフレーム長を持つフレームごとに高速フーリエ変換といった周波数変換を行ってフレームごとのスペクトルを求める。なお、フレーム長は、例えば、10ミリ秒〜100ミリ秒程度に設定される。そして音声認識部３４は、そのスペクトルに基づいて、特徴量として、例えば、フレームごとに、メル周波数ケプストラム係数（Mel Frequency Cepstral Coefficient、MFCC）またはフレーム間のパワーの差分値を求める。音声認識部３４は、特徴量としてMFCCを算出する場合、例えば、各フレームのスペクトルをメル尺度のパワー値に変換した後、そのパワー値の対数に対して再度離散コサイン変換などの周波数変換を行うことによりMFCCを算出する。また音声認識部３４は、特徴量としてフレーム間のパワーの差分値を求める場合、例えば、各フレームの周波数帯域ごとのスペクトルの２乗の和をパワーとして求め、連続する二つのフレーム間でパワーの差を求めることによりその差分値を求める。 Specifically, the voice recognition unit 34 extracts feature amounts used for voice recognition from the input voice signal. For this purpose, the speech recognition unit 34 obtains a spectrum for each frame by performing frequency conversion such as fast Fourier transform for each frame having a predetermined frame length on the input speech signal. The frame length is set to, for example, about 10 milliseconds to 100 milliseconds. Based on the spectrum, the voice recognition unit 34 obtains, for example, a Mel Frequency Cepstral Coefficient (MFCC) or a power difference value between frames as a feature amount for each frame. For example, when the MFCC is calculated as the feature amount, the speech recognition unit 34 converts the spectrum of each frame into a mel scale power value, and then performs frequency conversion such as discrete cosine conversion again on the logarithm of the power value. To calculate the MFCC. Further, when obtaining the power difference value between the frames as the feature amount, the voice recognition unit 34 obtains, for example, the sum of the squares of the spectrum for each frequency band of each frame as the power, and the power between the two consecutive frames. The difference value is obtained by obtaining the difference.

なお、音声認識部３４は、特徴量として、音響モデルを用いた音声認識で用いられる他の様々な特徴量（例えば、基本周波数）の何れかを抽出してもよい。また音声認識部３４は、入力音声信号から、フレームごとに複数の種類の特徴量を抽出してもよい。 Note that the speech recognition unit 34 may extract any of various other feature amounts (for example, fundamental frequency) used in speech recognition using an acoustic model as the feature amount. The speech recognition unit 34 may extract a plurality of types of feature amounts for each frame from the input speech signal.

音声認識部３４は、音響モデルと、１以上のフレームから得られた特徴量の組とを照合することによって、単語辞書に含まれるキーワードごとに、単位音を連結して生成したキーワードごとの尤度を求める。そして音声認識部３４は、尤度が高い方から順に所定数のキーワードを抽出する。所定数は、例えば、１〜５程度に設定される。そして音声認識部３４は、キーワードが検出される度に、そのキーワードを検索部３５、遅延推定部３６及びメッセージ生成部３７へ通知する。 The speech recognition unit 34 collates the acoustic model with a set of feature values obtained from one or more frames, thereby for each keyword included in the word dictionary, the likelihood for each keyword generated by concatenating unit sounds. Find the degree. Then, the voice recognition unit 34 extracts a predetermined number of keywords in descending order of likelihood. The predetermined number is set to about 1 to 5, for example. Then, each time a keyword is detected, the voice recognition unit 34 notifies the search unit 35, the delay estimation unit 36, and the message generation unit 37 of the keyword.

検索部３５は、キーワードと情報源との対応関係を表す情報源テーブルを参照することにより、検出されたキーワードに対応する情報源を特定する。そして検索部３５は、その情報源からキーワードに対応するコンテンツを取得する。なお、コンテンツは、例えば、テキスト情報、画像データ及びアプリケーションプログラムのうちの少なくとも一つを含む。 The search unit 35 specifies an information source corresponding to the detected keyword by referring to an information source table representing the correspondence relationship between the keyword and the information source. And the search part 35 acquires the content corresponding to a keyword from the information source. The content includes, for example, at least one of text information, image data, and an application program.

図３は、情報源テーブルの一例を示す図である。情報源テーブル３００の左側の列の各欄には、少なくとも一つのキーワードが格納されている。一方、情報源テーブル３００の右側の列の各欄には、同じ行に示されたキーワードに対応する情報源を示すアドレス情報が格納されている。このアドレス情報は、例えば、外部情報源５を特定するためのユニフォームリソースロケータ（Uniform Resource Locator, URL）、または記憶部３２に記憶されている、特定のコンテンツを含むファイル名である。なお、一つのアドレス情報が、複数のキーワードと関連付けられていてもよく、また、一つのキーワードが、複数のアドレス情報と関連付けられていてもよい。 FIG. 3 is a diagram illustrating an example of the information source table. At least one keyword is stored in each column in the left column of the information source table 300. On the other hand, each column in the right column of the information source table 300 stores address information indicating information sources corresponding to the keywords shown in the same row. This address information is, for example, a uniform resource locator (URL) for specifying the external information source 5 or a file name including specific content stored in the storage unit 32. One address information may be associated with a plurality of keywords, and one keyword may be associated with a plurality of address information.

検索部３５は、情報源テーブルを参照して、入力されたキーワードに対応するアドレス情報を特定する。そして検索部３５は、アドレス情報に示された情報源にアクセスして、キーワードに対応するコンテンツを取得する。例えば、検索部３５は、アドレス情報に示されたファイル名を持つファイルを記憶部３２から読み込む。あるいは、検索部３５は、アドレス情報に示されたURLで特定される外部情報源５から、通信ネットワーク４を介して、そのURLで特定されるウェブページを受信する。そして検索部３５は、受信したウェブページのソースを解析して、ウェブページ内で画面に表示されるテキスト情報及び画像情報をコンテンツとして抽出する。
検索部３５は、コンテンツを通信部３１へ出力する。また、コンテンツにテキスト情報が含まれている場合には、検索部３５は、そのテキスト情報を音声合成部３８へ渡す。 The search unit 35 refers to the information source table and identifies address information corresponding to the input keyword. And the search part 35 accesses the information source shown by address information, and acquires the content corresponding to a keyword. For example, the search unit 35 reads a file having the file name indicated in the address information from the storage unit 32. Alternatively, the search unit 35 receives the web page specified by the URL from the external information source 5 specified by the URL indicated in the address information via the communication network 4. And the search part 35 analyzes the source | sauce of the received web page, and extracts the text information and image information which are displayed on a screen within a web page as a content.
The search unit 35 outputs the content to the communication unit 31. If the text information is included in the content, the search unit 35 passes the text information to the speech synthesis unit 38.

遅延推定部３６は、抽出されたキーワードに対応する応答時間の推定値を求める。本実施形態では、遅延推定部３６は、記憶部３２に記憶されている、キーワードと応答時間の推定時間の対応関係を表す応答時間テーブルを参照することにより、抽出されたキーワードに対応する応答時間を推定する。なお、キーワードごとの応答時間の推定値は、例えば、過去にそのキーワードを含む入力音声信号をサーバ３が受信した時の応答時間の統計的代表値（例えば、応答時間の平均値、中央値または最大値）に設定される。 The delay estimation unit 36 obtains an estimated value of response time corresponding to the extracted keyword. In the present embodiment, the delay estimation unit 36 refers to a response time table that is stored in the storage unit 32 and represents a correspondence relationship between a keyword and an estimated response time, so that the response time corresponding to the extracted keyword is obtained. Is estimated. In addition, the estimated value of the response time for each keyword is, for example, a statistical representative value of response time when the server 3 has received an input voice signal including the keyword in the past (for example, an average value, a median value of response times, or Maximum value).

図４は、応答時間テーブルの一例を示す図である。応答時間テーブル４００の左側の列の各欄には、少なくとも一つのキーワードが格納されている。一方、応答時間テーブル４００の右側の列の各欄には、同じ行に示されたキーワードに対応する応答時間が格納されている。例えば、キーワードが「ニュース」である場合、応答時間は5秒である。 FIG. 4 is a diagram illustrating an example of a response time table. At least one keyword is stored in each column in the left column of the response time table 400. On the other hand, each column in the right column of the response time table 400 stores response times corresponding to the keywords shown in the same row. For example, when the keyword is “news”, the response time is 5 seconds.

遅延推定部３６は、応答時間の推定値をメッセージ生成部３７へ通知する。 The delay estimation unit 36 notifies the message generation unit 37 of the estimated response time.

メッセージ生成部３７は、応答時間中に再生されるメッセージ音声の時間長を、ユーザの体感品質が最良となるように決定し、その長さに応じたメッセージ音声の元となるメッセージのテキスト情報を生成する。 The message generation unit 37 determines the time length of the message sound reproduced during the response time so that the user's quality of experience is the best, and the text information of the message that is the source of the message sound according to the length is obtained. Generate.

図５は、応答時間とメッセージ音声の関係の一例を示す図である。本実施形態では、応答時間５００内で、無音期間５０１と無音期間５０２の間にメッセージ音声５０３が再生される。また本実施形態では、無音期間５０１と無音期間５０２は、同じ時間長に設定される。これにより、一方の無音期間が長くなってユーザの体感品質が低下することが抑制されるためである。
なお、以下の説明では、無音期間の時間長は、無音期間５０１と無音期間５０２の合計の時間長である。 FIG. 5 is a diagram illustrating an example of the relationship between response time and message voice. In the present embodiment, the message voice 503 is reproduced between the silence period 501 and the silence period 502 within the response time 500. In the present embodiment, the silent period 501 and the silent period 502 are set to the same time length. This is because one silent period is lengthened and the user's quality of experience is suppressed from being lowered.
In the following description, the time length of the silent period is the total time length of the silent period 501 and the silent period 502.

図６（ａ）は、メッセージ音声の時間長及び無音期間とユーザの体感品質との関係を表す模式図である。図６（ａ）において、縦軸は応答時間を表し、横軸はメッセージ音声の時間長を表す。なお応答時間からメッセージ音声の時間長を減じた時間は当該応答時間に含まれる無音期間である。また表記はされていないが、紙面に垂直な鉛直方向はユーザの体感品質値を表す。この体感品質値は、mean opinion score（MOS）値で表される。MOS値は、1〜5の値をとり、MOS値が大きいほど、ユーザの体感品質が高いことを表す。線６０１は、応答時間がメッセージ音声の時間長に等しく無音期間が０の場合を表す。さらに線６０２及び線６０３は、応答時間がメッセージ音声の時間長に加えて２秒の無音期間及び４秒の無音期間を含む場合を表す。そして線６０２上の数値は、メッセージ音声の時間長のそれぞれにおける、無音期間の長さが2秒のときのMOS値であり、線６０３上の数値は、メッセージ音声の時間長のそれぞれにおける、無音期間の長さが4秒のときのMOS値である。ただしMOS値は応答時間に対し体感品質が最大となるメッセージ音声の再生開始位置、すなわち前後の無音時間が均等となる再生開始位置での体感品質である（線６０２はメッセージ音声の前後の無音時間が１秒、線６０３は２秒のときの体感品質）。さらに、図６（ｂ）においてグラフ６１０は、メッセージ音声の時間長がLの場合の、応答時間とユーザの体感品質との関係を表す。図６（ｂ）において横軸は応答時間を表し、縦軸はMOS値を表す。なお図６（ｂ）においてＱ１は応答時間に２秒の無音期間を含む場合の体感品質を表し、またＱ２は応答時間に４秒の無音時間を含む場合の体感品質を表す。すなわち図６（ａ）における線６０２、線６０３上のMOS値は、各々のLにおける図６（ｂ）のＱ１、Ｑ２に対応する。 FIG. 6A is a schematic diagram showing the relationship between the time length and silence period of the message voice and the user's quality of experience. In FIG. 6A, the vertical axis represents the response time, and the horizontal axis represents the time length of the message voice. The time obtained by subtracting the time length of the message voice from the response time is a silent period included in the response time. Although not shown, the vertical direction perpendicular to the paper surface represents the quality value of the user's experience. This experience quality value is represented by a mean opinion score (MOS) value. The MOS value takes a value of 1 to 5, and the higher the MOS value, the higher the quality of experience of the user. A line 601 represents a case where the response time is equal to the time length of the message voice and the silence period is zero. Further, lines 602 and 603 represent cases in which the response time includes a silence period of 2 seconds and a silence period of 4 seconds in addition to the time length of the message voice. The numerical value on the line 602 is the MOS value when the length of the silent period is 2 seconds in each time length of the message voice, and the numerical value on the line 603 is the silent value in each time length of the message voice. This is the MOS value when the length of the period is 4 seconds. However, the MOS value is the sensation quality at the playback start position of the message voice where the sensation quality is maximum with respect to the response time, that is, at the playback start position where the silent time before and after is equal (line 602 is the silence time before and after the message voice). Is 1 second and the line 603 is 2 seconds. Further, in FIG. 6B, a graph 610 represents the relationship between the response time and the user's quality of experience when the time length of the message voice is L. In FIG. 6B, the horizontal axis represents the response time, and the vertical axis represents the MOS value. In FIG. 6B, Q1 represents the quality of experience when the response time includes a silent period of 2 seconds, and Q2 represents the quality of experience when the response time includes 4 seconds of silence. That is, the MOS values on the lines 602 and 603 in FIG. 6A correspond to Q1 and Q2 in FIG.

図６（ｂ）に示されるように、メッセージ音声の長さが一定であれば、無音期間の長さが2秒以下の場合、ユーザの体感品質は2秒のときの体感品質Ｑ１でほぼ一定である。一方、無音期間の長さが2秒以上4秒以下では、2秒のときの体感品質Ｑ１から4秒のときの体感品質Ｑ２まで、無音期間の長さに対して線形に体感品質が低下する。なお図示されてはいないが、無音期間の長さが4秒よりも長くなると、ユーザの体感品質は急激に低下するため、応答時間に対し無音期間が4秒以下となるメッセージ音声を選択する必要がある。以上のように、メッセージ音声の長さごとに無音期間の長さが2秒のときと4秒のときのユーザの体感品質Ｑ１とＱ２とが予め測定されていれば、メッセージ生成部３７は、任意の応答時間における、メッセージ音声の任意の長さについてのユーザの体感品質を推定できる。 As shown in FIG. 6 (b), if the length of the message voice is constant, if the length of the silent period is 2 seconds or less, the user's quality of experience is almost constant at the quality of experience Q1 at 2 seconds. It is. On the other hand, when the length of the silence period is 2 seconds or more and 4 seconds or less, the quality of experience decreases linearly with respect to the length of the silence period from the quality of experience Q1 at 2 seconds to the quality of experience Q2 at 4 seconds. . Although not shown in the figure, if the length of the silence period is longer than 4 seconds, the user's bodily sensation quality deteriorates sharply, so it is necessary to select a message voice whose silence period is 4 seconds or less with respect to the response time There is. As described above, if the user experience quality Q1 and Q2 when the length of the silent period is 2 seconds and 4 seconds for each message voice length is measured in advance, the message generator 37 It is possible to estimate a user's quality of experience for an arbitrary length of message voice at an arbitrary response time.

例えば、推定された応答時間がDである場合、メッセージ生成部３７は、線６０４に沿って、MOS値が最大となるメッセージ音声の長さを特定すればよい。図６（ｃ）においてグラフ６２０は、線６０４に沿った、メッセージ音声の長さとユーザの体感品質との関係を表す。図６（ｃ）において、横軸はメッセージ音声の長さを表し、縦軸はMOS値を表す。なお、この例では、MOS値が最大となるのは、メッセージ音声の長さが期間６２１に含まれる場合、または期間６２２に含まれる場合であるが、このようにMOS値が最大となるメッセージ音声の長さが複数存在するときは、メッセージ生成部３７は最も短いメッセージ音声を選択する。 For example, when the estimated response time is D, the message generation unit 37 may identify the length of the message voice that maximizes the MOS value along the line 604. In FIG. 6C, a graph 620 represents the relationship between the length of the message voice and the user's quality of experience along the line 604. In FIG. 6C, the horizontal axis represents the length of the message voice, and the vertical axis represents the MOS value. In this example, the MOS value is maximized when the length of the message sound is included in the period 621 or when it is included in the period 622, but the message sound having the maximum MOS value in this way. When there are a plurality of lengths, the message generator 37 selects the shortest message voice.

上記のようにメッセージ音声の長さを設定するために、メッセージ生成部３７は、メッセージ音声の長さ及び無音期間とユーザの体感品質との関係を表す品質評価テーブルを使用する。 In order to set the length of the message voice as described above, the message generator 37 uses a quality evaluation table that represents the relationship between the length of the message voice and the silence period and the quality of the user's experience.

図８は、メッセージ音声の長さ及び無音期間とユーザの体感品質との関係を表す品質評価テーブルの一例を示す図である。品質評価テーブル８００において、左端の列の各欄“メッセージ長”には、メッセージ音声の長さのサンプル値が格納されている。一方、品質評価テーブル８００の中央の列の各欄“Ｑ１”には、同じ列のメッセージ音声の長さのサンプル値に、ユーザの体感品質が略一定な無音期間の最大長（第１の時間の無音期間）、すなわち2秒の無音期間が追加されたときのMOS値（第１の品質値）が格納されている。同様に、品質評価テーブル７００の右端の列の各欄“Ｑ２”には、同じ列のメッセージ音声の長さのサンプル値に、ユーザの体感品質が一定割合で低下する無音期間の最大長（第２の時間の無音期間）、すなわち4秒の無音期間が追加されたときのMOS値（第２の品質値）が格納されている。 FIG. 8 is a diagram illustrating an example of a quality evaluation table that represents the relationship between the length and silence period of a message voice and the user's quality of experience. In the quality evaluation table 800, each column “message length” in the leftmost column stores a sample value of the length of the message voice. On the other hand, in each column “Q1” in the center column of the quality evaluation table 800, the maximum value of the silent period in which the user's sensation quality is substantially constant (first time) is added to the sample value of the message voice length in the same column. (Silence period), that is, the MOS value (first quality value) when a 2-second silence period is added is stored. Similarly, in each column “Q2” in the rightmost column of the quality evaluation table 700, the maximum length of the silent period in which the user's bodily sensation quality decreases at a certain rate is added to the sample value of the message voice length in the same column (the first column). 2), that is, the MOS value (second quality value) when a 4-second silence period is added.

メッセージ音声の長さが決まると、メッセージ生成部３７は、その長さに相当するメッセージのテキスト情報を作成する。本実施形態では、メッセージ生成部３７は、予め登録された複数の定型メッセージと入力されたキーワードの組み合わせにより、メッセージのテキスト情報を作成する。定型メッセージは、例えば、「です」、「をお伝えいたします」といった、キーワードに後続するメッセージであり、記憶部３２に予め記憶される。また記憶部３２には、定型メッセージと、その定型メッセージの時間長との対応関係を表す定型メッセージテーブルが記憶される。そしてメッセージ生成部３７は、メッセージ音声の長さから、入力されたキーワードの長さを減じた残りを定型メッセージの時間長に設定する。そして生成部３７は、定型メッセージテーブルを参照することにより、設定された時間長に最も近い時間長を持つ定型メッセージを選択する。メッセージ生成部３７は、入力されたキーワードと定型メッセージを組み合わせること、例えば、定型メッセージ中の指定された位置にキーワードを挿入することで、メッセージのテキスト情報を作成する。例えば、入力されたキーワードが「ニュース」であり、選択された定型メッセージが「をお伝えいたします」であれば、メッセージ生成部３７は、「ニュースをお伝えいたします」をメッセージのテキスト情報とする。
メッセージ生成部３７は、メッセージのテキスト情報を音声合成部３８へ通知する。 When the length of the message voice is determined, the message generator 37 creates text information of the message corresponding to the length. In the present embodiment, the message generation unit 37 creates message text information by combining a plurality of pre-registered standard messages and input keywords. The standard message is a message that follows the keyword such as “is” or “I will tell you” and is stored in the storage unit 32 in advance. In addition, the storage unit 32 stores a fixed message table representing the correspondence between a fixed message and the time length of the fixed message. Then, the message generator 37 sets the remaining length obtained by subtracting the length of the input keyword from the length of the message voice as the time length of the standard message. Then, the generation unit 37 selects a fixed message having a time length closest to the set time length by referring to the fixed message table. The message generation unit 37 creates text information of the message by combining the input keyword and the standard message, for example, by inserting the keyword at a designated position in the standard message. For example, if the input keyword is “news” and the selected standard message is “I will tell you”, the message generator 37 will use “I will tell you the news” as the text information of the message.
The message generator 37 notifies the text synthesizer 38 of the message text information.

図９は、メッセージ生成部３７により実行される、メッセージ生成処理の動作フローチャートである。
メッセージ生成部３７は、応答時間の推定値Dから無音期間の最大許容時間である4秒を減じた時間長である、メッセージ音声の最短候補長Lmin（=D-4）を算出する。またメッセージ生成部３７は、応答時間の推定値Dから体感品質が略一定となる無音期間の最大値2秒を減じた時間長である、メッセージ音声の基準長Lref（=D-2）を算出する（ステップＳ１０１）。 FIG. 9 is an operation flowchart of message generation processing executed by the message generation unit 37.
The message generation unit 37 calculates the shortest candidate length Lmin (= D-4) of the message voice, which is a time length obtained by subtracting 4 seconds, which is the maximum allowable time of the silent period, from the estimated response time D. Further, the message generator 37 calculates a reference length Lref (= D−2) of the message voice, which is a time length obtained by subtracting the maximum value of 2 seconds of the silent period during which the quality of experience is substantially constant from the estimated value D of the response time. (Step S101).

次に、メッセージ生成部３７は、品質評価テーブル（例えば、図８）に格納されたメッセージ音声の長さのうち、最短候補長L_minから基準長L_refの間に含まれる、メッセージ音声の長さそれぞれについてのMOS値のうちの最大値Qaを算出する（ステップＳ１０２）。例えば品質評価テーブルで、最短候補長L_minから基準長L_refの間に、L₂、L₃、L₄が含まれていれば、メッセージ生成部３７は、L_min、L₂、L₃、L₄、L_refのそれぞれについてMOS値を算出する。その際、メッセージ生成部３７は、メッセージの長さLmについて、無音期間が2秒のときのMOS値Q₁と、無音期間が4秒のときのMOS値Q₂とに基づいて、次式に従って線形補間することにより、メッセージの長さLmに対応するMOS値Q_mを算出する。

また、最短候補長L_minが品質評価テーブルに格納されたメッセージ音声の長さL_iとL_i+1の間に含まれる場合、メッセージ生成部３７は、（１）式に従ってL_iに対するMOS値Q_iとL_i+1に対するMOS値Q_i+1を算出する。そしてメッセージ生成部３７は、（１）式と同様に、L_i+1とL_iとの差に対する、L_minとL_iの距離の比に応じて、MOS値Q_iとMOS値Q_i+1とを線形補間することにより、最短候補長L_minに対するMOS値Q_minを算出する（図７（ａ））。同様に、メッセージ生成部３７は、基準長L_refについてのMOS値Q_refを算出すればよい。 Next, the message generator 37 includes the length of the message voice included between the shortest candidate length L _min and the reference length L _ref out of the message voice lengths stored in the quality evaluation table (for example, FIG. 8). The maximum value Qa of the MOS values for each is calculated (step S102). For example, in the quality evaluation table, if L ₂ , L ₃ , and L ₄ are included between the shortest candidate length L _min and the reference length L _ref , the message generation unit 37 performs L _min , L ₂ , L ₃ , The MOS value is calculated for each of L ₄ and L _ref . At that time, the message generation unit 37 follows the following expression for the message length Lm based on the MOS value Q ₁ when the silence period is 2 seconds and the MOS value Q ₂ when the silence period is 4 seconds. by linear interpolation, and calculates the MOS value Q _m corresponding to the length Lm of the message.

Further, if the shortest candidate length L _min is comprised between the length L _i and L _{i + 1} of the voice message stored in the quality evaluation table, the message generator 37, (1) MOS value for L _i according formula The MOS value Q _{i + 1} for Q _i and L _i _{+ 1} is calculated. Then, similarly to the equation (1), the message generator 37 determines the MOS value Q _i and the MOS value Q _{i +} according to the ratio of the distance between L _min and L _i with respect to the difference between L _{i + 1} and L _i. by _one and the linear interpolation, and calculates the MOS value Q _min to the shortest candidate length L _min (FIG. 7 (a)). Similarly, the message generator 37 may calculate the MOS value Q _ref for the reference length L _ref .

次に、メッセージ生成部３７は、品質評価テーブルに格納されたメッセージ音声の長さのうち、基準長L_refから、応答時間の推定値Dに含まれる、メッセージ音声の長さL_n、…、L_n+kついてのMOS値のうちの最大値Q_bを算出する（ステップＳ１０３）。この場合、無音期間は2秒未満のため、メッセージ生成部３７は、長さL_jに（n≦j≦n+k）ついての無音期間2秒のMOS値Q_j1を、長さL_jについてのMOS値とすればよい。またメッセージ音声が短いほど体感品質は高いため、メッセージ音声の長さL_n、…、L_n+kのうち最も短いL_nのMOS値Q_nが求める最大値Qbとなる（図７（ｂ））。 Next, the message generation unit 37 determines the message voice length L _n included in the response time estimation value D from the reference length L _ref out of the message voice lengths stored in the quality evaluation table. L _{n + k} with by calculating the maximum value Q _b of the MOS value (step S103). In this case, since the silence period is less than 2 seconds, the message generation unit 37 sets the MOS value Q _j1 of the silence period 2 seconds for the length L _j (n ≦ j ≦ n + k) to the length L _j . The MOS value may be used. Also, the shorter the message voice, the higher the quality of experience, so the MOS value Q _n of the shortest L _n among the message voice lengths L _n ,..., L _{n + k} is the maximum value Qb to be obtained (FIG. 7B). ).

メッセージ生成部３７は、MOS値QaとQbのうち、大きい方のMOS値に相当するメッセージ音声の長さを、作成するメッセージ音声の長さLとし、かつ、応答時間の推定値Dからその長さLを減じた時間長を無音期間とする（ステップＳ１０４）。さらに、メッセージ生成部３７は、無音期間の1/2を、端末２へのユーザの音声入力が終了してからメッセージ音声が再生されるまでの再生遅延期間とする。
なお、MOS値が最大となるメッセージ音声の長さが複数ある場合には、メッセージ生成部３７は、最も短いものを選択することが好ましい。あるいは、メッセージ生成部３７は、それら複数のメッセージ音声の長さのうちで、定型メッセージとキーワードの組み合わせの長さが一致するように、メッセージ音声の長さを設定してもよい。 The message generation unit 37 sets the length of the message voice corresponding to the larger MOS value of the MOS values Qa and Qb as the length L of the message voice to be created, and calculates the length from the estimated response time D. The time length obtained by subtracting the length L is set as a silent period (step S104). Further, the message generation unit 37 sets 1/2 of the silent period as a reproduction delay period from when the user's voice input to the terminal 2 is finished until the message voice is reproduced.
When there are a plurality of message voices having the maximum MOS value, it is preferable that the message generator 37 selects the shortest message voice. Or the message production | generation part 37 may set the length of a message voice so that the length of the combination of a fixed message and a keyword may match among the lengths of these several message voices.

メッセージ生成部３７は、メッセージ音声の長さLから、キーワードの時間長Wを減じることにより、定型メッセージの時間長を算出する（ステップＳ１０５）。なお、メッセージ生成部３７は、例えば、キーワードとその時間長との対応関係を表すキーワードテーブルを参照することにより、キーワードの時間長Wを決定する。あるいは、メッセージ生成部３７は、キーワードに含まれる音節の数に一音節当たりの単位時間を乗じることによってそのキーワードの時間長Wを決定してもよい。 The message generator 37 calculates the time length of the standard message by subtracting the keyword time length W from the length L of the message voice (step S105). Note that the message generation unit 37 determines the keyword time length W by referring to, for example, a keyword table representing a correspondence relationship between the keyword and its time length. Alternatively, the message generator 37 may determine the time length W of the keyword by multiplying the number of syllables included in the keyword by the unit time per syllable.

メッセージ生成部３７は、定型メッセージテーブルを参照することにより、設定された時間長に最も近い時間長を持つ定型メッセージを選択する（ステップＳ１０６）。そしてメッセージ生成部３７は、キーワードと選択された定型メッセージを組み合わせることにより、メッセージのテキスト情報を生成する（ステップＳ１０７）。
メッセージ生成部３７は、メッセージのテキスト情報を音声合成部３８へ出力するとともに、再生遅延期間を通信部３１へ出力して、メッセージ生成処理を終了する。 The message generator 37 selects a fixed message having a time length closest to the set time length by referring to the fixed message table (step S106). Then, the message generator 37 generates message text information by combining the keyword and the selected standard message (step S107).
The message generator 37 outputs the text information of the message to the speech synthesizer 38 and also outputs the reproduction delay period to the communication unit 31 and ends the message generation process.

なお、変形例によれば、メッセージ生成部３７は、メッセージ音声の長さLを持つ定型メッセージそのものを、メッセージのテキスト情報としてもよい。この場合には、定型メッセージは、例えば、「少々お待ち下さい」、あるいは「ただいま検索中です」といった、定型メッセージ単独で意味を持つメッセージとなる。またこの場合、ステップＳ１０５及びＳ１０７の処理は省略される。 Note that according to the modification, the message generation unit 37 may use the fixed message itself having the message voice length L as the text information of the message. In this case, the fixed message is a message having meaning only by the fixed message, for example, “Please wait for a while” or “Now searching”. In this case, the processes in steps S105 and S107 are omitted.

また、他の変形例によれば、メッセージ生成部３７は、品質評価テーブルに記録された複数のメッセージ音声長のうち、応答時間の推定値Dよりも短い全てのメッセージ音声長について、品質値を算出してもよい。この場合には、メッセージ生成部３７は、最短候補長L_minを算出しなくてよい。その代わりに、メッセージ生成部３７は、無音期間の長さが4秒より長くなるメッセージ音声長についての品質値を、非常に低い値、例えば、1.0に設定することが好ましい。 Further, according to another modification, the message generation unit 37 sets quality values for all message voice lengths shorter than the estimated response time D among a plurality of message voice lengths recorded in the quality evaluation table. It may be calculated. In this case, the message generator 37 does not have to calculate the shortest candidate length L _min . Instead, the message generation unit 37 preferably sets the quality value for the message voice length in which the length of the silence period is longer than 4 seconds to a very low value, for example, 1.0.

音声合成部３８は、コンテンツに含まれるテキストの合成音声信号を作成する。また音声合成部３８は、応答時間中に再生されるメッセージのテキスト情報に基づいて、メッセージ音声を合成する。
音声合成部３８は、先ず、音声の合成対象となるテキストを表音情報に変換する。表音情報は、テキストに含まれる原文の読みなどを表す情報であり、例えば、原文の読みをカタカナ文字で表し、さらにアクセントの位置及び区切りの位置を追加した情報である。 The voice synthesizer 38 creates a synthesized voice signal of text included in the content. The voice synthesizer 38 synthesizes the message voice based on the text information of the message reproduced during the response time.
First, the speech synthesizer 38 converts the text to be synthesized into speech into phonetic information. The phonetic information is information representing the reading of the original text included in the text. For example, the phonetic information is information in which the reading of the original text is expressed by katakana characters, and further the position of the accent and the position of the break are added.

音声合成部３８は、テキストを表音情報に変換するために、記憶部３２に記憶されている言語辞書を読み込む。言語辞書には、例えば、テキスト情報中に出現すると想定される様々な単語、その単語の読み、品詞及び活用形などが登録されている。そして音声合成部３８は、例えば、その言語辞書を用いて、テキストに含まれる原文に対して形態素解析を行って、原文中のテキストの読み、アクセントの位置及び区切りの位置を決定する。その際、音声合成部３８は、例えば、原文中で句読点が設定された位置を区切りの位置とする。 The speech synthesizer 38 reads a language dictionary stored in the storage unit 32 in order to convert text into phonetic information. In the language dictionary, for example, various words assumed to appear in the text information, readings of the words, parts of speech, and utilization forms are registered. Then, for example, using the language dictionary, the speech synthesizer 38 performs morphological analysis on the original sentence included in the text, and determines the text reading, accent position, and break position in the original sentence. At that time, the speech synthesizer 38 sets, for example, a position where a punctuation mark is set in the original text as a break position.

音声合成部３８は、形態素解析として、例えば、動的計画法を用いる方法を利用できる。そして音声合成部３８は、各単語の読み、アクセントの位置及び区切りの位置に応じて表音情報を作成する。 The speech synthesis unit 38 can use, for example, a method using dynamic programming as the morphological analysis. Then, the speech synthesizer 38 creates phonetic information according to the reading of each word, the position of the accent, and the position of the break.

次に、音声合成部３８は、表音情報に基づいて、合成音声を生成する際の目標韻律を生成する。そのために、音声合成部３８は、記憶部３２から複数の韻律辞書を読み込む。この韻律辞書には、時間経過に応じた声の高さと音素長の変化を表す韻律モデルが格納されている。そして音声合成部３８は、韻律辞書の中から、文中の位置または表音情報に示されたアクセントの位置などに最も一致する韻律モデルを適用する。そして音声合成部３８は、適用される韻律モデル及び予め設定された合成パラメータに従って、表音情報に対応した目標韻律を作成する。なお、合成パラメータは、例えば、話速を表すパラメータと声の高さを表すパラメータとを含む。さらに、合成パラメータは、抑揚、音量などを表すパラメータを含んでいてもよい。また目標韻律は、音声波形を決定する単位となる音素ごとに、音素の長さ及びピッチ周波数を含む。さらに、目標韻律は個々の音素の波形の振幅情報を含んでいてもよい。なお、音素は、例えば、一つの母音あるいは一つの子音とすることができる。 Next, the speech synthesizer 38 generates a target prosody for generating synthesized speech based on the phonetic information. For this purpose, the speech synthesis unit 38 reads a plurality of prosodic dictionaries from the storage unit 32. This prosodic dictionary stores prosodic models representing changes in voice pitch and phoneme length over time. Then, the speech synthesizer 38 applies the prosodic model that most closely matches the position in the sentence or the accent position indicated in the phonetic information from the prosodic dictionary. Then, the speech synthesizer 38 creates a target prosody corresponding to the phonetic information in accordance with the applied prosodic model and preset synthesis parameters. Note that the synthesis parameters include, for example, a parameter representing speech speed and a parameter representing voice pitch. Furthermore, the synthesis parameter may include a parameter representing inflection, volume, and the like. The target prosody includes the phoneme length and the pitch frequency for each phoneme that is a unit for determining a speech waveform. Further, the target prosody may include amplitude information of individual phoneme waveforms. Note that the phoneme can be, for example, one vowel or one consonant.

目標韻律が決定されると、音声合成部３８は、例えば、vocoder方式または波形編集方式によって合成音声信号を作成する。
音声合成部３８は、音素ごとに、目標韻律の音素長及びピッチ周波数に最も近い音声波形を、例えばパターンマッチングにより音声波形辞書に登録されている複数の音声波形の中から選択する。そのために、音声合成部３８は、記憶部３２から音声波形辞書を読み込む。音声波形辞書は、複数の音声波形及び各音声波形の識別番号を記録する。また音声波形は、例えば、一人以上のナレータが様々なテキストを読み上げた様々な音声を録音した音声信号から、音素単位で取り出された波形信号である。
さらに、音声合成部３８は、音素ごとに選択された音声波形を目標韻律に沿って接続できるようにするため、それら選択された音声波形と目標韻律に示された対応する音素の波形パターンとのずれ量を、波形変換情報として算出してもよい。
音声合成部３８は、音素ごとに選択された音声波形の識別番号を含む波形生成情報を作成する。波形生成情報は、波形変換情報をさらに含んでもよい。 When the target prosody is determined, the speech synthesizer 38 creates a synthesized speech signal by, for example, a vocoder method or a waveform editing method.
For each phoneme, the speech synthesizer 38 selects a speech waveform closest to the phoneme length and pitch frequency of the target prosody from, for example, a plurality of speech waveforms registered in the speech waveform dictionary by pattern matching. For this purpose, the speech synthesis unit 38 reads a speech waveform dictionary from the storage unit 32. The speech waveform dictionary records a plurality of speech waveforms and an identification number of each speech waveform. The voice waveform is, for example, a waveform signal extracted in units of phonemes from voice signals obtained by recording various voices in which one or more narrators read various texts.
Furthermore, in order to enable the speech synthesizer 38 to connect the speech waveforms selected for each phoneme along the target prosody, the selected speech waveform and the waveform pattern of the corresponding phoneme indicated in the target prosody The deviation amount may be calculated as waveform conversion information.
The speech synthesizer 38 creates waveform generation information including the identification number of the speech waveform selected for each phoneme. The waveform generation information may further include waveform conversion information.

音声合成部３８は、波形生成情報に基づいて合成音声信号を作成する。そのために、音声合成部３８は、波形生成情報に含まれる各音素の音声波形の識別番号に対応する音声波形信号を記憶部３２に保存されている音声波形辞書から読み込む。そして音声合成部３８は、各音声波形信号を連続的に接続することにより、合成音声信号を作成する。なお、波形生成情報に波形変換情報が含まれている場合、音声合成部３８は、各音声波形信号を、対応する音素について求められた波形変換情報に従って補正して音声波形信号を連続的に接続することにより、合成音声信号を作成する。
音声合成部３８は、コンテンツに含まれるテキストの合成音声信号及びメッセージ音声を通信部３１へ出力する。そして通信部３１は、それらの合成音声信号とともに、再生遅延期間を、通信ネットワーク４を介して端末２へ送信する。 The voice synthesizer 38 creates a synthesized voice signal based on the waveform generation information. For this purpose, the speech synthesizer 38 reads the speech waveform signal corresponding to the speech waveform identification number of each phoneme included in the waveform generation information from the speech waveform dictionary stored in the storage unit 32. Then, the speech synthesizer 38 creates a synthesized speech signal by connecting each speech waveform signal continuously. When the waveform conversion information is included in the waveform generation information, the speech synthesizer 38 corrects each speech waveform signal according to the waveform conversion information obtained for the corresponding phoneme and continuously connects the speech waveform signals. By doing so, a synthesized speech signal is created.
The voice synthesizer 38 outputs a synthesized voice signal and message voice of text included in the content to the communication unit 31. And the communication part 31 transmits a reproduction | regeneration delay period to the terminal 2 via the communication network 4 with those synthetic | combination audio | voice signals.

図１０は、音声対話システム１における対話制御処理の動作シーケンス図である。
先ず、端末２の処理部２５は、音声入力部２１を介してユーザが発した音声に対応する入力音声信号を取得する（ステップＳ２０１）。そして処理部２５は、入力音声信号を、通信部２３及び通信ネットワーク４を介してサーバ３へ送信する。また処理部２５は、入力音声信号が終了した時点からの経過時間の計時を開始する。 FIG. 10 is an operation sequence diagram of dialogue control processing in the voice dialogue system 1.
First, the processing unit 25 of the terminal 2 acquires an input voice signal corresponding to the voice uttered by the user via the voice input unit 21 (step S201). Then, the processing unit 25 transmits the input audio signal to the server 3 via the communication unit 23 and the communication network 4. In addition, the processing unit 25 starts measuring the elapsed time from the time when the input audio signal ends.

サーバ３は、入力音声信号を受信すると、制御部３３の音声認識部３４により、入力音声信号に含まれるキーワードを抽出する（ステップＳ２０２）。 When the server 3 receives the input voice signal, the voice recognition unit 34 of the control unit 33 extracts keywords included in the input voice signal (step S202).

そして制御部３３の遅延推定部３６は、キーワードに応じた応答時間の推定値Dを求める（ステップＳ２０３）。そして制御部３３のメッセージ生成部３７は、メッセージ生成処理を実行して、応答時間中に再生されるメッセージ及び再生遅延期間を求める（ステップＳ２０４）。 Then, the delay estimation unit 36 of the control unit 33 obtains an estimated value D of the response time corresponding to the keyword (Step S203). Then, the message generation unit 37 of the control unit 33 executes a message generation process to obtain a message reproduced during the response time and a reproduction delay period (step S204).

制御部３３の音声合成部３８は、メッセージの音声を合成する（ステップＳ２０５）。そしてサーバ３は、合成されたメッセージ音声及び再生遅延期間を端末２へ通知する。 The voice synthesizer 38 of the control unit 33 synthesizes the voice of the message (step S205). Then, the server 3 notifies the terminal 2 of the synthesized message voice and the reproduction delay period.

端末２の処理部２５は、ユーザによる音声入力が終了してから再生遅延期間が経過すると、音声出力部２２にメッセージ音声を再生させる（ステップＳ２０６）。 The processing unit 25 of the terminal 2 causes the audio output unit 22 to reproduce the message audio when the reproduction delay period elapses after the voice input by the user is completed (step S206).

一方、制御部３３の検索部３５は、キーワードに対応する情報源のアドレス情報を特定する（ステップＳ２０７）。そして検索部３５は、そのアドレス情報で示された情報源からコンテンツを取得する（ステップＳ２０８）。 On the other hand, the search unit 35 of the control unit 33 specifies the address information of the information source corresponding to the keyword (step S207). Then, the search unit 35 acquires content from the information source indicated by the address information (step S208).

制御部３３の音声合成部３８は、コンテンツに含まれるテキストに対応する音声信号を合成する（ステップＳ２０９）。そしてサーバ３は、コンテンツ及びそのコンテンツに含まれるテキストの合成音声信号を端末２へ通知する。
端末２の処理部２５は、コンテンツ及びそのコンテンツに含まれるテキストの合成音声信号を受信すると、その合成音声信号を音声出力部２２に再生させる（ステップＳ２１０）。
そして音声対話システム１は、対話制御処理を終了する。 The voice synthesis unit 38 of the control unit 33 synthesizes a voice signal corresponding to the text included in the content (Step S209). Then, the server 3 notifies the terminal 2 of the synthesized speech signal of the content and the text included in the content.
When the processing unit 25 of the terminal 2 receives the synthesized voice signal of the content and the text included in the content, the processing unit 25 causes the voice output unit 22 to reproduce the synthesized voice signal (step S210).
Then, the voice dialogue system 1 ends the dialogue control process.

以上に説明してきたように、この音声対話システム及び音声対話システムにおける対話制御方法は、応答時間中に再生されるメッセージ音声の長さを、その長さ及び無音期間とユーザの体感品質との関係を表す品質評価テーブルを参照して決定する。そのため、この音声対話システム及び対話制御方法は、メッセージ音声の長さを、ユーザの体感品質が最良となるように設定できる。さらに、この音声対話システム及び対話制御方法は、無音期間が2秒未満ではユーザの体感品質が略一定となり、一方、無音期間が2秒以上4秒以下では無音期間の長さに応じて線形にユーザの体感品質が低下するという知見に基づいてユーザの体感品質を決定する。そのため、この音声対話システム及び対話制御方法は、比較的少ないサンプル数で、任意の応答時間に対して適切なメッセージ音声の長さを決定できる。 As described above, in this voice dialogue system and the dialogue control method in the voice dialogue system, the length of the message voice reproduced during the response time is related to the length and silence period and the user's quality of experience. It is determined with reference to the quality evaluation table representing. For this reason, the voice dialogue system and the dialogue control method can set the length of the message voice so that the quality of experience of the user is the best. Furthermore, in this voice interactive system and the interactive control method, the user experience quality is substantially constant when the silent period is less than 2 seconds, while linearly according to the length of the silent period when the silent period is 2 seconds or longer and 4 seconds or shorter. The user's experience quality is determined based on the knowledge that the user's experience quality is lowered. Therefore, this voice dialogue system and dialogue control method can determine an appropriate message voice length for an arbitrary response time with a relatively small number of samples.

次に、第２の実施形態による音声対話システム及びその音声対話システムにおける対話制御方法について説明する。第２の実施形態による音声対話システム及び対話制御方法は、利用者の嗜好に応じて、メッセージ音声の長さを調節する。 Next, a voice dialogue system according to the second embodiment and a dialogue control method in the voice dialogue system will be described. The voice dialogue system and the dialogue control method according to the second embodiment adjust the length of the message voice according to the user's preference.

なお、第２の実施形態による音声対話システムは、第１の実施形態による音声対話システムと比較して、サーバの制御部により実現される機能の一部が異なる。そこで以下では、サーバの制御部により実現される機能のうち、第１の実施形態と異なる点について説明する。音声対話システムの他の構成要素については、第１の実施形態についての対応する構成要素の説明及び図を参照されたい。 Note that the voice interaction system according to the second embodiment differs from the voice interaction system according to the first embodiment in some of the functions realized by the control unit of the server. Therefore, in the following description, differences from the first embodiment among the functions realized by the control unit of the server will be described. For other components of the spoken dialogue system, refer to the description of the corresponding components and the diagram for the first embodiment.

図１１は、第２の実施形態によるサーバ３の制御部４０の機能ブロック図である。制御部４０は、音声認識部３４と、検索部３５と、遅延推定部３６と、メッセージ生成部３７と、音声合成部３８と、嗜好判定部３９とを有する。
制御部４０が有するこれらの各部は、例えば、制御部４０が有するプロセッサ上で動作するコンピュータプログラムにより実現される機能モジュールである。なお、制御部４０が有するこれらの各部は、その各部の機能を実現する一つの集積回路としてサーバ３に実装されてもよい。 FIG. 11 is a functional block diagram of the control unit 40 of the server 3 according to the second embodiment. The control unit 40 includes a voice recognition unit 34, a search unit 35, a delay estimation unit 36, a message generation unit 37, a voice synthesis unit 38, and a preference determination unit 39.
Each of these units included in the control unit 40 is, for example, a functional module realized by a computer program that operates on a processor included in the control unit 40. Note that these units included in the control unit 40 may be mounted on the server 3 as one integrated circuit that realizes the functions of the units.

第２の実施形態による制御部４０は、第１の実施形態による制御部３３と比較して、嗜好判定部３９を有する点で異なる。そこで以下では、嗜好判定部３９及びその関連部分について説明する。 The control unit 40 according to the second embodiment is different from the control unit 33 according to the first embodiment in that it includes a preference determination unit 39. Therefore, hereinafter, the preference determination unit 39 and its related parts will be described.

第２の実施形態では、音声入力に先立ち、端末２からユーザの識別番号がサーバ３へ送信される。ユーザの識別番号は、例えば、通信ネットワーク４が電話回線である場合、端末２の電話番号とすることができる。そしてこの場合には、端末２とサーバ３間の呼接続時に端末２からサーバ３へ電話番号が送信されることによって、サーバ３はユーザ識別番号を取得できる。また、ユーザの識別番号は、端末２とは別個に割り当てられる番号であってもよい。この場合には、端末２とサーバ３間の接続が確立された後に、端末２は、例えば、端末２が有する操作部（図示せず）を介して入力されたユーザの識別番号をサーバ３へ送信してもよい。 In the second embodiment, the user identification number is transmitted from the terminal 2 to the server 3 prior to voice input. For example, when the communication network 4 is a telephone line, the user identification number can be the telephone number of the terminal 2. In this case, the server 3 can acquire the user identification number by transmitting the telephone number from the terminal 2 to the server 3 at the time of call connection between the terminal 2 and the server 3. Further, the user identification number may be a number assigned separately from the terminal 2. In this case, after the connection between the terminal 2 and the server 3 is established, the terminal 2 transmits, for example, the user identification number input via the operation unit (not shown) of the terminal 2 to the server 3. You may send it.

嗜好判定部３９は、ユーザの識別番号とメッセージの長さの嗜好との対応関係を表す嗜好情報テーブルを参照することにより、端末２から受信したユーザの識別番号に対応するユーザの嗜好を特定する。 The preference determination unit 39 specifies a user preference corresponding to the user identification number received from the terminal 2 by referring to a preference information table that represents a correspondence relationship between the user identification number and the message length preference. .

図１２は、嗜好情報テーブルの一例である。嗜好情報テーブル１２００の左側の列の各欄には、ユーザの識別番号が格納されている。一方、嗜好情報テーブル１２００の右側の列の各欄には、その欄の左側に隣接する欄に格納されたユーザの識別番号に対応する嗜好のタイプが格納されている。本実施形態では、嗜好のタイプとして、「簡潔型」、「均等型」、「冗長型」の３タイプが設定されている。なお、各ユーザは、例えば、音声対話システム１への登録時に、「簡潔型」、「均等型」、「冗長型」の何れに該当するかを選択する。そして、その選択結果に基づいて、嗜好情報テーブルが作成または更新される。 FIG. 12 is an example of a preference information table. In each column of the left column of the preference information table 1200, user identification numbers are stored. On the other hand, each column in the right column of the preference information table 1200 stores a preference type corresponding to the user identification number stored in the column adjacent to the left side of the column. In the present embodiment, three types of “succinct type”, “uniform type”, and “redundant type” are set as preference types. For example, each user selects one of “concise type”, “equal type”, and “redundant type” when registering in the voice interaction system 1. Then, the preference information table is created or updated based on the selection result.

「簡潔型」は、例えば、「ニュースです」といった、応答期間に占めるメッセージ音声の長さが相対的に短く、その分だけ無音期間が長くてもよいタイプである。また、「冗長型」は、例えば、「ただいま、ニュースを検索中です。少々お待ち下さい」といった、応答期間に占めるメッセージ音声の長さが相対的に長く、無音期間が短いことを好むタイプである。そして「均等型」は、例えば、「ニュースをお伝えします」といった、応答期間に占めるメッセージ音声の長さが、「簡潔型」より長く、「冗長型」よりも短いことを好むタイプである。 The “concise type” is a type in which, for example, “it is news”, the length of the message voice in the response period may be relatively short, and the silent period may be longer correspondingly. In addition, the “redundant type” is a type that prefers that the length of the message voice in the response period is relatively long and the silence period is short, for example, “We are currently searching for news. Please wait for a while”. . The “uniform type” is a type that prefers that the length of the message voice in the response period is longer than the “concise type” and shorter than the “redundant type”, for example, “I will tell you the news”.

第２の実施形態では、複数のユーザを上記の３タイプの何れかに分類し、各タイプごとに、複数のメッセージ音声長のそれぞれについて無音期間2秒に設定したときと4秒に設定したときのMOS値を求めることで、品質評価テーブルを予め作成する。そしてそれぞれの品質評価テーブルが、嗜好のタイプと関連付けて記憶部３２に記憶される。
一般的に、メッセージ音声の時間長が同一であれば、「冗長型」タイプでは、他のタイプと比較して、無音期間が短いほどMOS値が高くなり、逆に、無音期間が長いほどMOS値が低くなる。また、「簡潔型」タイプでは、他のタイプと比較して、無音期間が短いほどMOS値が低くなり、逆に、無音期間が長いほどMOS値が高くなる。 In the second embodiment, when a plurality of users are classified into any of the above three types, and for each type, when the silence period is set to 2 seconds and set to 4 seconds for each of the plurality of message voice lengths A quality evaluation table is created in advance by obtaining the MOS value of. Each quality evaluation table is stored in the storage unit 32 in association with the preference type.
In general, if the time length of the message voice is the same, the “redundant” type has a higher MOS value as the silence period is shorter than the other types, and conversely, the longer the silence period, the higher the MOS value. The value becomes lower. In addition, in the “concise type” type, the MOS value is lower as the silence period is shorter than the other types, and conversely, the MOS value is higher as the silence period is longer.

図１３は、第２の実施形態による対話制御処理の動作シーケンス図である。なお、図１３に示した動作シーケンスは、図１０に示された第１の実施形態による対話制御処理の動作シーケンスと比較して、ステップＳ３０１の処理が異なるので、以下では、ステップＳ３０１について説明する。 FIG. 13 is an operation sequence diagram of the dialog control process according to the second embodiment. The operation sequence shown in FIG. 13 differs from the operation sequence of the dialog control process according to the first embodiment shown in FIG. 10 in the process of step S301. Therefore, step S301 will be described below. .

ステップＳ２０３の後、制御部４０の嗜好判定部３９は、嗜好情報テーブルを参照してユーザの識別番号に対応する嗜好のタイプを特定し、その特定したタイプに対応する品質評価テーブルを記憶部３２から読み込む（ステップＳ３０１）。そして嗜好判定部３９は、読み込んだ品質評価テーブルをメッセージ生成部３７へ渡す。
ステップＳ２０４にて、メッセージ生成部３７は、嗜好判定部３９から受け取った品質評価テーブルを利用して、メッセージ音声の長さ及び再生遅延期間を決定する。そしてメッセージ生成部３７は、その長さに応じたメッセージのテキストを作成する。なお、この実施形態でも、再生遅延期間は、応答時間に含まれる無音期間の1/2に設定されることが好ましい。 After step S203, the preference determination unit 39 of the control unit 40 refers to the preference information table, identifies the preference type corresponding to the user identification number, and stores the quality evaluation table corresponding to the identified type in the storage unit 32. (Step S301). Then, the preference determination unit 39 passes the read quality evaluation table to the message generation unit 37.
In step S204, the message generation unit 37 determines the length of the message voice and the reproduction delay period using the quality evaluation table received from the preference determination unit 39. Then, the message generator 37 creates a message text corresponding to the length. In this embodiment as well, the reproduction delay period is preferably set to ½ of the silence period included in the response time.

この第２の実施形態によれば、音声対話システム及び対話制御方法は、ユーザの嗜好に応じて応答時間中に挿入されるメッセージ音声の長さを調節できるので、音声対話システムについてのユーザの体感品質をより向上できる。 According to the second embodiment, since the voice dialogue system and the dialogue control method can adjust the length of the message voice inserted during the response time according to the user's preference, the user experience about the voice dialogue system can be adjusted. The quality can be further improved.

上記の各実施形態に対する変形例によれば、遅延推定部３６は、キーワード以外の情報に基づいて遅延時間を推定してもよい。例えば、遅延推定部３６は、通信ネットワーク４を管理する管理装置（図示せず）から、通信部３１を介して、入力音声を受け取った時点の通信ネットワーク４のトラフィック状況を表す情報を取得してもよい。そして遅延推定部３６は、トラフィック状況と遅延時間の推定値との対応関係を表すテーブルを参照することで、遅延時間を推定してもよい。あるいは、遅延推定部３６は、サーバ３との接続が確立されている端末２の数に応じて、遅延時間を推定してもよい。 According to the modification to each of the above embodiments, the delay estimation unit 36 may estimate the delay time based on information other than the keyword. For example, the delay estimation unit 36 acquires information representing the traffic status of the communication network 4 at the time of receiving the input voice from the management device (not shown) that manages the communication network 4 via the communication unit 31. Also good. Then, the delay estimation unit 36 may estimate the delay time by referring to a table representing a correspondence relationship between the traffic situation and the estimated value of the delay time. Alternatively, the delay estimation unit 36 may estimate the delay time according to the number of terminals 2 that are connected to the server 3.

また他の変形例によれば、サーバ３の制御部は、応答時間テーブルを更新してもよい。この場合、例えば、端末２は、ユーザの音声入力が行われる度に、その音声入力からコンテンツ再生までの応答時間を測定し、その測定値を通信部２３を介してサーバ３へ送信する。サーバ３の制御部は、キーワードごとに、応答時間の測定値を受け取る度に、そのキーワードに関する過去の応答時間の測定値と最新の応答時間の測定値との平均値を求め、その平均値で応答時間テーブルのそのキーワードの応答時間の値を更新する。 According to another modification, the control unit of the server 3 may update the response time table. In this case, for example, every time a user's voice input is performed, the terminal 2 measures a response time from the voice input to content reproduction, and transmits the measured value to the server 3 via the communication unit 23. Each time the control unit of the server 3 receives a response time measurement value for each keyword, the control unit obtains an average value of the past response time measurement value and the latest response time measurement value for the keyword. Update the response time value for that keyword in the response time table.

さらに他の変形例によれば、端末２の処理部２５が、音声認識部及び音声合成部の機能を有していてもよい。これにより、サーバ３の処理負荷が軽減される。 According to still another modification, the processing unit 25 of the terminal 2 may have functions of a speech recognition unit and a speech synthesis unit. Thereby, the processing load of the server 3 is reduced.

さらに他の変形例によれば、上記の実施形態またはその変形例における、端末が有する各部と、サーバが有する各部は、一つの装置に搭載されていてもよい。そしてその装置が有する一つまたは複数のプロセッサが、端末の処理部の各機能と、サーバの制御部の各機能を実行してもよい。 According to still another modification, each unit included in the terminal and each unit included in the server in the above-described embodiment or the modification thereof may be mounted on one device. One or a plurality of processors included in the apparatus may execute each function of the processing unit of the terminal and each function of the control unit of the server.

さらに、上記の各実施形態またはその変形例によるサーバの制御部が有する各機能をコンピュータに実現させるコンピュータプログラムは、コンピュータによって読み取り可能な記録媒体に記録された形で提供されてもよい。そのコンピュータ読取可能な記録媒体は、例えば、磁気記録媒体、光記録媒体または半導体メモリとすることができる。ただし、その記録媒体には、搬送波は含まれない。 Furthermore, a computer program that causes a computer to realize each function of the control unit of the server according to each of the above embodiments or modifications thereof may be provided in a form recorded on a computer-readable recording medium. The computer-readable recording medium can be, for example, a magnetic recording medium, an optical recording medium, or a semiconductor memory. However, the recording medium does not include a carrier wave.

ここに挙げられた全ての例及び特定の用語は、読者が、本発明及び当該技術の促進に対する本発明者により寄与された概念を理解することを助ける、教示的な目的において意図されたものであり、本発明の優位性及び劣等性を示すことに関する、本明細書の如何なる例の構成、そのような特定の挙げられた例及び条件に限定しないように解釈されるべきものである。本発明の実施形態は詳細に説明されているが、本発明の精神及び範囲から外れることなく、様々な変更、置換及び修正をこれに加えることが可能であることを理解されたい。 All examples and specific terms listed herein are intended for instructional purposes to help the reader understand the concepts contributed by the inventor to the present invention and the promotion of the technology. It should be construed that it is not limited to the construction of any example herein, such specific examples and conditions, with respect to showing the superiority and inferiority of the present invention. Although embodiments of the present invention have been described in detail, it should be understood that various changes, substitutions and modifications can be made thereto without departing from the spirit and scope of the present invention.

以上説明した実施形態及びその変形例に関し、更に以下の付記を開示する。
（付記１）
音声入力部により集音されたユーザの音声を表す音声信号から所定のキーワードを抽出し、
前記キーワードに応じたコンテンツを検索し、
前記ユーザの音声の入力が終了してからユーザに前記コンテンツを提示するまでの応答時間を推定し、
前記応答時間に占める、音声出力されない無音期間の長さのサンプル値とメッセージ音声の長さのサンプル値との複数の組み合わせとユーザの体感品質との対応関係を表すテーブルを参照することにより、前記メッセージ音声の長さを設定し、
前記長さを持つ前記メッセージ音声を生成し、
前記応答時間中に前記メッセージ音声を再生する、
ことを含む対話制御方法。
（付記２）
前記テーブルは、前記メッセージ音声の長さのサンプル値のそれぞれについて、前記体感品質が一定となる前記無音期間の最大長である第１の時間の無音期間を追加したときの前記ユーザの体感品質を表す第１の品質値と、前記第１の時間よりも長く、一定割合で前記体感品質が低下する前記無音期間の最大長である第２の時間の無音期間を追加したときの前記ユーザの体感品質を表す第２の品質値とを格納し、
前記メッセージ音声の長さを設定することは、前記応答時間以下の前記メッセージ音声の長さのサンプル値のそれぞれについて、前記第１の品質値と前記第２の品質値から当該長さに対応するユーザの体感品質を表す品質値を算出し、当該品質値が最大となる長さを前記メッセージ音声の長さとする、付記１に記載の対話制御方法。
（付記３）
前記メッセージ音声の長さを設定することは、前記応答時間から前記メッセージ音声の長さのサンプル値を減じた差が前記第１の時間よりも短い場合には、当該長さのサンプル値についての前記第１の品質値を当該長さのサンプル値についての前記品質値とし、一方、前記差が前記第１の時間以上、かつ、前記第２の時間以下である場合には、前記第１の品質値と前記第２の品質値とを線形補間することで前記品質値を算出する、付記２に記載の対話制御方法。
（付記４）
前記メッセージ音声の長さについての嗜好に応じた複数の前記テーブルのなかから、前記ユーザの嗜好に対応するテーブルを選択することをさらに含み、
前記メッセージ音声の長さを設定することは、前記選択されたテーブルを参照して前記メッセージ音声の長さを設定する、付記１〜３の何れか一項に記載の対話制御方法。
（付記５）
前記メッセージ音声の長さを設定することは、前記メッセージ音声を再生するタイミングを、前記応答時間と前記メッセージ音声の長さの差の半分だけ前記ユーザの音声が終了してから経過した時点に設定する、付記１〜４の何れか一項に記載の対話制御方法。
（付記６）
前記メッセージ音声を生成することは、予め記憶部に記憶された複数の定型メッセージのなかから、前記メッセージ音声の長さから前記キーワードの長さを減じた時間長を持つ定型メッセージを選択し、該選択された定型メッセージと前記キーワードとを組み合わせることで前記メッセージ音声を生成する、付記１〜４の何れか一つに記載の対話制御方法。
（付記７）
音声入力部により集音されたユーザの音声を表す音声信号から所定のキーワードを抽出し、
前記キーワードに応じたコンテンツを検索し、
前記ユーザの音声の入力が終了してからユーザに前記コンテンツを提示するまでの応答時間を推定し、
前記応答時間に占める、音声出力されない無音期間の長さのサンプル値とメッセージ音声の長さのサンプル値との複数の組み合わせとユーザの体感品質との対応関係を表すテーブルを参照することにより、前記メッセージ音声の長さを設定し、
前記長さを持つ前記メッセージ音声を生成し、
前記応答時間中に前記メッセージ音声を再生する、
ことをコンピュータに実行させるための対話制御用コンピュータプログラム。
（付記８）
音声入力部により集音されたユーザの音声を表す音声信号から所定のキーワードを抽出する音声認識部と、
前記キーワードに応じたコンテンツを検索する検索部と、
前記ユーザの音声の入力が終了してからユーザに前記コンテンツを提示するまでの応答時間を推定する遅延推定部と、
前記応答時間に占める、音声出力されない無音期間の長さのサンプル値とメッセージ音声の長さのサンプル値との複数の組み合わせとユーザの体感品質との対応関係を表すテーブルを参照することにより、前記メッセージ音声の長さを設定し、当該長さを持つ前記メッセージ音声を生成するメッセージ生成部と、
前記応答時間中に音声出力部に前記メッセージ音声を再生させる再生時間制御部と、
を有する音声対話装置。 The following supplementary notes are further disclosed regarding the embodiment described above and its modifications.
(Appendix 1)
A predetermined keyword is extracted from the voice signal representing the voice of the user collected by the voice input unit;
Search for content according to the keyword,
Estimating the response time from the end of the user's voice input to the presentation of the content to the user,
By referring to a table representing a correspondence relationship between a plurality of combinations of a sample value of a length of a silent period in which no voice is output and a sample value of a length of a message voice in the response time and a user's bodily sensation quality, Set the length of the message voice,
Generating the message voice with the length;
Playing the message audio during the response time;
Dialog control method including that.
(Appendix 2)
The table shows, for each sample value of the message voice length, the user's experience quality when adding a silence period of a first time that is the maximum length of the silence period in which the experience quality is constant. The user's sensation when adding a first quality value to be expressed and a second period of silence that is longer than the first time and is the maximum length of the silence period in which the sensation quality decreases at a constant rate Storing a second quality value representing quality,
Setting the length of the message voice corresponds to the length from the first quality value and the second quality value for each sample value of the message voice length below the response time. The dialogue control method according to supplementary note 1, wherein a quality value representing a user's bodily sensation quality is calculated, and a length at which the quality value is maximized is defined as a length of the message voice.
(Appendix 3)
Setting the length of the message voice means that when the difference obtained by subtracting the sample value of the message voice length from the response time is shorter than the first time, the sample time of the length is set. If the first quality value is the quality value for the sample value of the length, while the difference is not less than the first time and not more than the second time, the first quality value The dialogue control method according to appendix 2, wherein the quality value is calculated by linearly interpolating a quality value and the second quality value.
(Appendix 4)
Further including selecting a table corresponding to the user's preference from among the plurality of tables corresponding to the preference for the length of the message voice;
The dialogue control method according to any one of appendices 1 to 3, wherein setting the length of the message voice sets the length of the message voice with reference to the selected table.
(Appendix 5)
Setting the length of the message voice sets the timing to play the message voice at the time when the user's voice has ended by half the difference between the response time and the length of the message voice. The dialog control method according to any one of appendices 1 to 4.
(Appendix 6)
The message voice is generated by selecting a fixed message having a time length obtained by subtracting the length of the keyword from the length of the message voice from among a plurality of fixed messages stored in the storage unit in advance. The dialogue control method according to any one of appendices 1 to 4, wherein the message voice is generated by combining the selected fixed message and the keyword.
(Appendix 7)
A predetermined keyword is extracted from the voice signal representing the voice of the user collected by the voice input unit;
Search for content according to the keyword,
Estimating the response time from the end of the user's voice input to the presentation of the content to the user,
By referring to a table representing a correspondence relationship between a plurality of combinations of a sample value of a length of a silent period in which no voice is output and a sample value of a length of a message voice in the response time and a user's bodily sensation quality, Set the length of the message voice,
Generating the message voice with the length;
Playing the message audio during the response time;
A computer program for interactive control for causing a computer to execute the above.
(Appendix 8)
A voice recognition unit that extracts a predetermined keyword from a voice signal representing the user's voice collected by the voice input unit;
A search unit for searching for content according to the keyword;
A delay estimation unit that estimates a response time from the end of input of the user's voice to presentation of the content to the user;
By referring to a table representing a correspondence relationship between a plurality of combinations of a sample value of a length of a silent period in which no voice is output and a sample value of a length of a message voice in the response time and a user's bodily sensation quality, A message generator for setting the length of the message voice and generating the message voice having the length;
A playback time control unit for causing the voice output unit to play the message voice during the response time;
Spoken dialogue apparatus having

１音声対話システム
２端末
３サーバ
４通信ネットワーク
５外部情報源
２１音声入力部
２２音声出力部
２３通信部
２４記憶部
２５処理部
３１通信部
３２記憶部
３３、４０制御部
３４音声認識部
３５検索部
３６遅延推定部
３７メッセージ生成部
３８音声合成部
３９嗜好判定部 DESCRIPTION OF SYMBOLS 1 Voice dialogue system 2 Terminal 3 Server 4 Communication network 5 External information source 21 Voice input part 22 Voice output part 23 Communication part 24 Storage part 25 Processing part 31 Communication part 32 Storage part 33, 40 Control part 34 Voice recognition part 35 Search part 36 delay estimation unit 37 message generation unit 38 speech synthesis unit 39 preference determination unit

Claims

A predetermined keyword is extracted from the voice signal representing the voice of the user collected by the voice input unit;
Search for content according to the keyword,
Estimating the response time from the end of the user's voice input to the presentation of the content to the user,
By referring to a table representing a correspondence relationship between a plurality of combinations of a sample value of a length of a silent period in which no voice is output and a sample value of a length of a message voice in the response time and a user's bodily sensation quality, Set the length of the message voice,
Generating the message voice with the length;
Playing the message audio during the response time;
Dialog control method including that.

The table shows, for each sample value of the message voice length, the user's experience quality when adding a silence period of a first time that is the maximum length of the silence period in which the experience quality is constant. The user's sensation when adding a first quality value to be expressed and a second period of silence that is longer than the first time and is the maximum length of the silence period in which the sensation quality decreases at a constant rate Storing a second quality value representing quality,
Setting the length of the message voice corresponds to the length from the first quality value and the second quality value for each sample value of the message voice length below the response time. The dialogue control method according to claim 1, wherein a quality value representing a user's bodily sensation quality is calculated, and a length at which the quality value is maximized is set as a length of the message voice.

Setting the length of the message voice means that when the difference obtained by subtracting the sample value of the message voice length from the response time is shorter than the first time, the sample time of the length is set. If the first quality value is the quality value for the sample value of the length, while the difference is not less than the first time and not more than the second time, the first quality value The dialogue control method according to claim 2, wherein the quality value is calculated by linearly interpolating a quality value and the second quality value.

Further including selecting a table corresponding to the user's preference from among the plurality of tables corresponding to the preference for the length of the message voice;
The dialogue control method according to any one of claims 1 to 3, wherein the setting of the length of the message voice sets the length of the message voice with reference to the selected table.

A predetermined keyword is extracted from the voice signal representing the voice of the user collected by the voice input unit;
Search for content according to the keyword,
Estimating the response time from the end of the user's voice input to the presentation of the content to the user,
By referring to a table representing a correspondence relationship between a plurality of combinations of a sample value of a length of a silent period in which no voice is output and a sample value of a length of a message voice in the response time and a user's bodily sensation quality, Set the length of the message voice,
Generating the message voice with the length;
Playing the message audio during the response time;
A computer program for interactive control for causing a computer to execute the above.