JP2001022384A

JP2001022384A - Voice interactive device

Info

Publication number: JP2001022384A
Application number: JP11196387A
Authority: JP
Inventors: Takeshi Ono; 健大野; Masayuki Takada; 雅行高田
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 1999-07-09
Filing date: 1999-07-09
Publication date: 2001-01-26

Abstract

PROBLEM TO BE SOLVED: To reduce an unpleasant feeling and an uneasy feeling making a user feel caused by continuing a long silence time by performing a voice response of a suitable length even when it takes a long time for reading a new recognition dictionary, and it makes the user wait. SOLUTION: Respective dictionary size information are stored beforehand in a recognition dictionary storage 4 together with plural kinds of recognition dictionaries, and when a signal processor 3 reads out newly the recognition dictionary from the storage 4, the device reads out the dictionary size information of the relevant recognition dictionary, and predicts/operates the time required for reading out the recognition dictionary based on the read out dictionary size information. Then, a response voice of an optimum length is constructed based on the predictive time required for reading out the dictionary to be outputted from a speaker 9. Thus, the situation that the long silence time continues when the signal processor 3 reads out newly the recognition dictionary is avoided, and during the reading out time, the response voice is outputted to reduce the unpleasant feeling and uneasy feeling of the user.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声対話装置に関
する。[0001] The present invention relates to a voice interaction device.

【０００２】[0002]

【従来の技術】従来、音声入力に対して認識辞書を参照
して音声認識を行い、その認識した音声入力に対応して
さらに別の認識辞書を展開すると共に、ユーザに対して
適切な応答音声を出力して次の音声入力を促す手順を繰
り返すことにより、ユーザが音声入力だけでその必要と
する情報を検索できるようにした音声対話装置が知られ
ている。2. Description of the Related Art Conventionally, speech recognition is performed on a speech input by referring to a recognition dictionary, and another recognition dictionary is developed in response to the recognized speech input. There is known a spoken dialogue apparatus in which a user can search for necessary information only by voice input by repeating a procedure of prompting the next voice input.

【０００３】このような音声対話装置では一般に、入力
音声に対する音声認識処理において、認識対象の語数が
多くなるほど認識率が低下する。そしてその欠点を補う
ために、音声辞書を図４に示すように階層化し、音声入
力を複数回繰り返して階層を進めていき、認識対象語数
を絞り込み認識率を上げるようにしている。例えば、横
浜「そごう」デパートの所在地を確認する場合、次の手
順をとる。[0003] In such a speech dialogue apparatus, generally, in a speech recognition process for input speech, the recognition rate decreases as the number of words to be recognized increases. In order to compensate for the drawback, the speech dictionary is hierarchized as shown in FIG. 4, and the speech input is repeated a plurality of times to advance the hierarchy, thereby narrowing down the number of words to be recognized and increasing the recognition rate. For example, to confirm the location of the Yokohama Sogo department store, take the following procedure.

【０００４】第一階層ａの認識辞書には「住所」と「施
設」だけが登録されている。そこで、ユーザが「しせ
つ」と発話すると、この音声入力に対して、装置側では
これを認識辞書中の単語「住所」、「施設」それぞれと
一致度を調べ、一致度が一番高い単語を認識語に決定
し、「施設ですね」といった応答音声を出力し、「施
設」に対する第二階層の認識辞書ｅに移行し、さらに
「施設名、例えば、駅、デパート、ホテルなどを発話し
てください」と応答音声を出力する。[0004] Only the "address" and "facility" are registered in the recognition dictionary of the first level a. Then, when the user utters “shisetsu”, the apparatus checks the voice input for the matching degree with each of the words “address” and “facility” in the recognition dictionary, and determines the word having the highest matching degree. Is determined as a recognition word, a response voice such as "facility is output" is output, the processing is shifted to the second-level recognition dictionary e for "facility", and "facility name, e.g. Please answer ".

【０００５】これに対して、ユーザが「デパート」と発
話すれば、上記と同様に、第二階層の認識辞書ｅ中に登
録されている各単語と入力音声との一致度を演算し、一
致度の一番高い単語を認識語に決定し、「デパートです
ね」といった応答音声を出力し、「デパート」に対する
第３階層の認識辞書ｉに移行し、さらに「都道府県名を
発話して下さい」と応答音声を出力する。[0005] On the other hand, if the user speaks "department store", the degree of coincidence between each word registered in the recognition dictionary e of the second hierarchy and the input voice is calculated, and The word with the highest degree is determined as a recognized word, a response voice such as "department store" is output, and the process moves to the third-level recognition dictionary i for "department store", and then "speak the prefecture name." And a response voice is output.

【０００６】以下、同様にして認識辞書の階層を進めて
いき、最終的に第五階層ｋの認識辞書まで進むと、「デ
パート名を発話して下さい」と応答音声を出力し、これ
に対してユーザが「そごう」と発話し、これに対して認
識辞書中の単語「そごう」との一致度が一番高ければ、
この「そごう」デパートが選択されたものと判断する。In the same manner, the hierarchy of the recognition dictionary is advanced in the same manner, and when the process finally proceeds to the recognition dictionary of the fifth hierarchy k, a response voice saying "Please speak the department store name" is output. If the user utters “Sogo” and the match with the word “Sogo” in the recognition dictionary is the highest,
It is determined that this “Sogo” department store has been selected.

【０００７】こうして、最終的に「横浜市内のそごうデ
パート」が検索対象として音声入力されたものと決定
し、これに対してナビゲーション装置側の地図情報から
該当する施設の所在地を割り出し、またこれを目的地と
したルート探索を実行し、ディスプレイに結果を表示す
ることになる。[0007] In this way, it is finally determined that "SOGO Department Store in Yokohama" has been inputted as a search target by voice, and the location of the corresponding facility is determined from the map information on the navigation device side. A route search with the destination as is performed, and the result is displayed on the display.

【０００８】[0008]

【発明が解決しようとする課題】ところが、このような
従来の音声対話装置では、ある入力音声に対して音声認
識し、それに対してさらに深い階層の認識辞書を読み込
み、次の音声入力に対する音声認識に供しようとする場
合、その認識辞書の読み込み中、応答音声を出力しない
か、応答音声を出力するとしても一定の、しかも短時間
のものであり、新しい認識辞書の読み込み処理中、無音
時間が長く続き、ユーザにとって自分の発した言葉が正
しく認識されず、装置が停止しているのか、それとも認
識されたが辞書を新しく読み込んでいるのか判断しかね
ることが多く、不快感、不安感を抱かせてしまうことが
ある問題点があった。However, in such a conventional speech dialogue apparatus, speech recognition is performed for a certain input speech, a recognition dictionary of a deeper hierarchy is read in response thereto, and speech recognition for the next speech input is performed. During the reading of the recognition dictionary, no response voice is output, or even if the response voice is output, the response time is constant and short-lived. For a long time, the user does not recognize his / her words correctly and often cannot determine whether the device is stopped or whether the device has been recognized but the dictionary has been newly read, which causes discomfort and anxiety. There was a problem that could be saved.

【０００９】本発明はこのような従来の問題点に鑑みて
なされたもので、新しい認識辞書の読み込みに時間がか
かり、ユーザを待たせてしまうような場合でも、適切な
長さの音声応答を行うことによってユーザに長い無音時
間が続くことで抱かせてしまっていた不快感や不安感を
低減することができる音声対話装置を提供することを目
的とする。The present invention has been made in view of such a conventional problem, and it takes a long time to read a new recognition dictionary, so that a voice response of an appropriate length can be provided even when the user is kept waiting. It is an object of the present invention to provide a voice interactive device that can reduce discomfort and anxiety caused by a user by holding a long silent period.

【００１０】[0010]

【課題を解決するための手段】請求項１の発明の音声対
話装置は、音声入力手段と、複数の認識辞書を記憶する
認識辞書記憶手段と、応答音声を記憶する応答音声記憶
手段と、音声出力手段と、制御手段とを備え、前記制御
手段が、前記音声入力手段からの入力音声と前記認識辞
書内の認識対象語との一致度を演算し、その演算結果に
基づいて次の認識辞書を読み出し、かつ適切な応答音声
を選択して前記音声出力手段に出力するよう指示するも
のにおいて、前記認識辞書記憶手段が、前記複数の認識
辞書ごとにその辞書サイズ情報を記憶し、前記制御手段
が、前記認識辞書の読み出しに際して当該認識辞書の前
記辞書サイズ情報を読み出し、読み出した辞書サイズ情
報に基づいて前記認識辞書の読み出しに必要な時間を予
測演算し、その読み出しに必要な予測時間をもとに最適
な長さの応答音声を構築して前記音声出力手段に出力す
るように指示するものである。According to a first aspect of the present invention, there is provided a speech dialogue apparatus, comprising: a voice input unit; a recognition dictionary storage unit for storing a plurality of recognition dictionaries; a response voice storage unit for storing a response voice; Output means, and control means, wherein the control means calculates the degree of coincidence between the input speech from the voice input means and the recognition target word in the recognition dictionary, and based on the calculation result, the next recognition dictionary And instructing to select an appropriate response voice and output it to the voice output means, wherein the recognition dictionary storage means stores dictionary size information for each of the plurality of recognition dictionaries, Reads the dictionary size information of the recognition dictionary at the time of reading the recognition dictionary, predicts the time required to read the recognition dictionary based on the read dictionary size information, and calculates the read time. Is intended to instruct the building a response voice optimum length based on output to the audio output means estimated time required to put out.

【００１１】請求項２の発明の音声対話装置は、請求項
１において、前記制御手段が、前記応答音声の読み上げ
速度を調節して前記最適な長さの応答音声を構築して出
力するよう前記音声出力手段に指示するものである。According to a second aspect of the present invention, in the first aspect, the control means adjusts a reading speed of the response voice to construct and output the response voice having the optimum length. It instructs the audio output means.

【００１２】請求項３の発明の音声対話装置は、請求項
１において、前記制御手段が、前記応答音声のあらかじ
め設定されている所定の位置に無音部分を挿入すること
によって前記最適な長さの応答音声を構築して出力する
よう前記音声出力手段に指示するものである。According to a third aspect of the present invention, in the first aspect of the present invention, the control means inserts a silent portion into a predetermined position of the response voice so that the optimum length of the response voice is obtained. The voice output means is instructed to construct and output a response voice.

【００１３】請求項４の発明の音声対話装置は、請求項
１において、前記制御手段が、長さの異なる複数の前記
応答音声の中から最適な長さの応答音声を選択して出力
するよう前記音声出力手段に指示するものである。According to a fourth aspect of the present invention, in the first aspect, the control means selects and outputs a response voice having an optimum length from the plurality of response voices having different lengths. It instructs the audio output means.

【００１４】請求項５の発明の音声対話装置は、請求項
１〜４において、前記認識辞書記憶手段が、他の装置か
ら要求されるデータを保有するデータ記憶手段と兼用で
あり、前記制御手段が、前記データ記憶手段から前記認
識辞書の読み出しに必要な時間を予測演算するものであ
る。According to a fifth aspect of the present invention, in the speech dialogue apparatus according to the first to fourth aspects, the recognition dictionary storage means is also used as a data storage means for holding data requested from another apparatus, and the control means Predicts and calculates the time required to read the recognition dictionary from the data storage means.

【００１５】請求項６の発明の音声対話装置は、請求項
１〜４において、前記認識辞書記憶手段が、前記制御手
段にネットワークによって接続され、前記制御手段が、
前記認識辞書の読み出しに必要な時間を前記ネットワー
クの負荷をも考慮して予測演算するものである。According to a sixth aspect of the present invention, in the first aspect, the recognition dictionary storage unit is connected to the control unit via a network, and the control unit is
The time required for reading the recognition dictionary is predicted and calculated in consideration of the load on the network.

【００１６】[0016]

【発明の効果】請求項１の発明の音声対話装置では、認
識辞書記憶手段に複数種の認識辞書それぞれと共にそれ
ぞれの辞書サイズ情報を記憶させておき、制御手段が新
たに認識辞書を読み出す際には、当該認識辞書の辞書サ
イズ情報を読み出し、読み出した辞書サイズ情報に基づ
いて当該認識辞書の読み出しに必要な時間を予測演算
し、その読み出しに必要な予測時間をもとに最適な長さ
の応答音声を構築して音声出力手段に出力するように指
示する。According to the first aspect of the present invention, a plurality of types of recognition dictionaries are stored together with the respective dictionary size information in the recognition dictionary storage means. Reads the dictionary size information of the recognition dictionary, calculates the time required for reading the recognition dictionary based on the read dictionary size information, and calculates the optimal length based on the predicted time required for the reading. It instructs a response voice to be constructed and output to the voice output means.

【００１７】これにより、制御手段が新たに認識辞書を
読み出す際に、従来のように長らく無音時間が続く事態
を避け、その読み出し時間中、応答音声を出力させるこ
とができ、ユーザに不快感や不安感を抱かせないように
することができる。Thus, when the control unit reads a new recognition dictionary, it is possible to avoid a situation in which silence continues for a long time as in the prior art, and to output a response voice during the reading time. You can avoid anxiety.

【００１８】請求項２の発明の音声対話装置では、制御
手段が新たに認識辞書を読み出す際には、応答音声の読
み上げ速度を調節して最適な長さの応答音声を構築して
出力するよう音声出力手段に指示することにより、請求
項１の発明の効果に加えて、連続性の高い自然な応答音
声を出力させることができる。According to the second aspect of the present invention, when the control unit reads a new recognition dictionary, the control unit adjusts the reading speed of the response voice to construct and output a response voice having an optimum length. By instructing the voice output means, in addition to the effect of the first aspect of the present invention, a highly continuous natural response voice can be output.

【００１９】請求項３の発明の音声対話装置では、制御
手段が新たに認識辞書を読み出す際には、応答音声のあ
らかじめ設定されている所定の位置に無音部分を挿入す
ることによって最適な長さの応答音声を構築して出力す
るよう音声出力手段に指示することにより、請求項１の
発明の効果に加えて、合成音声の音質を変化させること
なく自然な応答音声を出力することができる。According to the third aspect of the present invention, when the control unit reads a new recognition dictionary, the control unit inserts a silent portion at a predetermined position of the response voice to thereby obtain an optimum length. By instructing the voice output means to construct and output the response voice of (1), it is possible to output a natural response voice without changing the sound quality of the synthesized voice in addition to the effect of the first aspect of the present invention.

【００２０】請求項４の発明の音声対話装置では、制御
手段が長さの異なる複数の応答音声の中から最適な長さ
の応答音声を選択して出力するよう音声出力手段に指示
することにより、請求項１の発明の効果に加えて、音質
を変化させることなく、また連続性の高い自然な応答音
声を出力させることができる。In the voice dialogue apparatus according to the fourth aspect of the invention, the control means instructs the voice output means to select and output a response voice having an optimum length from a plurality of response voices having different lengths. In addition to the effects of the first aspect of the present invention, it is possible to output a natural sound with high continuity without changing the sound quality.

【００２１】請求項５の発明の音声対話装置では、認識
辞書記憶手段が他の装置から要求されるデータを保有す
るデータ記憶手段、例えば、ナビゲーション装置におけ
る地図情報の記憶手段などと兼用であり、制御手段がそ
のようなデータ記憶手段から認識辞書の読み出しに必要
な時間を予測演算するようにしたことにより、請求項１
〜４の発明それぞれの効果に加えて、他システムと連携
した装置にあっても、新たに認識辞書を読み出すのに必
要な時間を正確に予測し、その予測時間に応じて最適な
長さの応答音声を出力することができる。In the speech dialogue apparatus according to the fifth aspect of the present invention, the recognition dictionary storage means is also used as a data storage means for holding data requested from another apparatus, for example, a map information storage means in a navigation apparatus. 2. The control device according to claim 1, wherein the control means predicts and calculates the time required for reading the recognition dictionary from the data storage means.
In addition to the effects of the inventions of the fourth to fourth aspects, even in a device linked to another system, the time required to read a new recognition dictionary is accurately predicted, and the optimum length of the readout is determined according to the predicted time. A response voice can be output.

【００２２】請求項６の発明の音声対話装置では、制御
手段がネットワークによって接続されている認識辞書記
憶手段から新たに認識辞書を読み出すのに必要な時間を
そのネットワークの負荷をも考慮して予測演算すること
により、請求項１〜４の発明それぞれの効果に加えて、
新たに認識辞書を読み出すのに必要な時間をそのネット
ワークの負荷をも考慮して正確に予測し、その予測時間
に応じて最適な長さの応答音声を出力することができ
る。According to the speech dialogue apparatus of the present invention, the time required for the control means to newly read the recognition dictionary from the recognition dictionary storage means connected by the network is predicted in consideration of the load on the network. By calculating, in addition to the respective effects of the inventions of claims 1 to 4,
The time required to read a new recognition dictionary can be accurately predicted in consideration of the load on the network, and a response voice having an optimum length can be output according to the predicted time.

【００２３】[0023]

【発明の実施の形態】以下、本発明の実施の形態を図に
基づいて詳説する。図１は本発明の第１の実施の形態の
音声対話装置の機能構成を示している。図１に示す音声
対話装置において、マイク１は音声入力のためのもので
あり、Ａ／Ｄコンバータ２はマイク１からの音声入力を
Ａ／Ｄ変換し、デジタル信号にして信号処理装置３に入
力する。信号処理装置３は本装置を全体的に制御し、ま
た必要な演算処理を実行するＣＰＵ３Ａと必要な容量の
内部メモリ３Ｂとを備えている。外部記憶装置４は図４
に示したような階層的な構造の音声認識辞書、応答音声
データを含み、その他必要な諸情報を記憶する。スイッ
チ６は音声入力機能の開始／停止スイッチである。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 shows a functional configuration of the voice interaction apparatus according to the first embodiment of the present invention. In the voice interaction apparatus shown in FIG. 1, the microphone 1 is for voice input, and the A / D converter 2 converts the voice input from the microphone 1 from analog to digital, converts it into a digital signal, and inputs the digital signal to the signal processing apparatus 3. I do. The signal processing device 3 controls the entire device and includes a CPU 3A for executing necessary arithmetic processing and an internal memory 3B having a required capacity. The external storage device 4 is shown in FIG.
And a voice recognition dictionary having a hierarchical structure as shown in FIG. 3 and response voice data, and other necessary information are stored. The switch 6 is a start / stop switch for the voice input function.

【００２４】Ｄ／Ａコンバータ７は信号処理装置３が後
述する処理により選択、編集した応答音声データをＤ／
Ａ変換する。このＤ／Ａコンバータ７の音声信号出力は
増幅器８によって増幅し、スピーカ９から可聴音にして
出力する。The D / A converter 7 converts the response voice data selected and edited by the signal processing
A conversion is performed. The audio signal output of the D / A converter 7 is amplified by the amplifier 8 and output from the speaker 9 as an audible sound.

【００２５】次に、上記構成の音声対話装置を用いた音
声応答動作を説明する。図２は本実施の形態の音声対話
装置による音声応答処理を示すフローチャートである。
外部記憶装置４には、従来例と同様、図４に示した階層
構造の認識辞書を格納している。そして、各階層の認識
辞書に対してそれぞれの辞書サイズデータＳ（byte）も
記憶させてある。Next, a description will be given of a voice response operation using the voice dialogue apparatus having the above configuration. FIG. 2 is a flowchart showing a voice response process by the voice interaction device of the present embodiment.
The external storage device 4 stores a recognition dictionary having a hierarchical structure shown in FIG. The dictionary size data S (byte) is also stored for the recognition dictionary of each hierarchy.

【００２６】音声入力の初期状態では、信号処理装置３
は階層化された音声辞書の最上位の階層の部分音声辞書
ａを最初の認識対象とする（ステップＳ１０）。そして
スイッチ６が操作された場合、音声認識処理を開始し
（ステップＳ１１）、ステップＳ１２〜Ｓ２０の処理を
辞書階層が進むごとに繰り返す。In the initial state of voice input, the signal processing device 3
Sets the partial speech dictionary a of the highest hierarchy of the hierarchized speech dictionary as the first recognition target (step S10). When the switch 6 is operated, the voice recognition processing is started (step S11), and the processing of steps S12 to S20 is repeated every time the dictionary hierarchy advances.

【００２７】ステップＳ１２では、ＣＰＵ３Ａは部分音
声辞書を外部記憶装置４から内部メモリ３Ｂに読み込み
を開始する。この辞書読み込みの最初には、辞書サイズ
データＸを読み込み、メモリ４に保持する。なお、辞書
の読み込み自体は並列処理によって処理するため、この
ステップＳ１２で辞書読み込みの終了を待つ必要はな
く、辞書サイズを読み込んだ時点で次のステップＳ１３
に移行する。In step S12, the CPU 3A starts reading the partial speech dictionary from the external storage device 4 into the internal memory 3B. At the beginning of the dictionary reading, the dictionary size data X is read and stored in the memory 4. Since the reading of the dictionary itself is processed by parallel processing, there is no need to wait for the completion of the reading of the dictionary in step S12.
Move to

【００２８】ステップＳ１３では、ＣＰＵ３Ａは応答音
声の読み上げ速度を調節して出力する。応答音声のメッ
セージ内容は辞書への入力を促すものであり、例えば、
辞書の階層に対応して、「目的とされるのは住所です
か、施設ですか」、「所在地の都道府県名を発話してく
ださい」、「施設の種類を発話してください」、「デパ
ートの名称を発話してください」といったものである。In step S13, the CPU 3A adjusts the reading speed of the response voice and outputs it. The message content of the response voice prompts entry into the dictionary, for example,
According to the hierarchy of the dictionary, "Is the intended address or facility?", "Speak the name of the state or province where you are located", "Speak the type of facility", "Department store" Please say the name of "."

【００２９】このステップＳ１３ではまず、メモリ３Ｂ
に記憶されている辞書サイズＳから該当辞書の読み込み
に必要な時間Ｔを予測演算する。辞書サイズをＳ（byt
e）とすると、読み込みに必要な時間Ｔ（ｍｓｅｃ）
は、外部記憶装置４、ＣＰＵ３Ａ、内部メモリ３Ｂのア
クセス速度などによって決まる係数ｋを用いて、Ｔ＝ｋ×Ｓ（１）によって算出する。そしてこの係数ｋは、装置の仕様ご
とにあらかじめメモリ３Ｂに格納しておくか、セットア
ップ時に登録するものとする。In this step S13, first, the memory 3B
Of the dictionary required for reading the dictionary from the dictionary size S stored in. Set the dictionary size to S (byt
e), the time required for reading T (msec)
Is calculated using the coefficient k determined by the access speed of the external storage device 4, the CPU 3A, the internal memory 3B, and the like, and T = k × S (1). The coefficient k is stored in advance in the memory 3B for each device specification or registered at the time of setup.

【００３０】次にＣＰＵ３Ａは読み込み必要時間Ｔに応
じて、読み上げ速度を調節して最適な時間長さの応答音
声を構築して出力する。そのためにまず、ＣＰＵ３Ａは
このステップＳ１３において出力する応答音声のメッセ
ージ内容をメモリ３Ｂから選択する。このメッセージの
長さがＴｍ（ｍｓｅｃ）、出力するＤ／Ａ変換の通常の
周波数がＦｍ（ｋＨｚ）である場合、実際に出力する周
波数Ｆ（ｋＨｚ）は、Ｆ＝Ｆｍ×（Ｔｍ／Ｔ）（２）となる。ＣＰＵ３Ａは周波数ＦをＤ／Ａコンバータ７に
設定し、応答音声をＤ／Ａコンバータ７、増幅器８、ス
ピーカ９を経て出力する。これにより、応答音声が終了
した時点で認識辞書の読み込みがほぼ完了していること
になる（ステップＳ１４）。Next, the CPU 3A adjusts the reading speed according to the required reading time T to construct and output a response voice having an optimum time length. For this purpose, first, the CPU 3A selects, from the memory 3B, the message content of the response voice output in step S13. If the length of this message is Tm (msec) and the normal frequency of the output D / A conversion is Fm (kHz), the frequency F (kHz) actually output is: F = Fm × (Tm / T) (2) The CPU 3A sets the frequency F to the D / A converter 7, and outputs a response voice via the D / A converter 7, the amplifier 8, and the speaker 9. As a result, the reading of the recognition dictionary is almost completed when the response voice ends (step S14).

【００３１】なお、応答音声の長さを認識辞書の読み込
みに必要な時間Ｔに応じて調節する別の方法は、応答音
声ごとにあらかじめ設定されている所定の位置に無音部
分を挿入するものである。そのために、応答音声のメッ
セージ内容には、無音部分を挿入することが可能な箇所
に特殊な表記を含めておき、例えば、「駅の△名称を△
発話してください」というように、無音部分を挿入可能
な箇所に△の表記を挿入しておくのである。そして、Ｃ
ＰＵ３Ａは、Ｔｍ＜Ｔが成立する場合には無音部分を挿
入するが、無音部分の挿入可能箇所数をＮ、無音部分１
箇所当たりの長さδＴとすると、 δＴ＝（Ｔ−Ｔｍ）／Ｎ（３）の式で求められるδＴの長さの無音部分を無音部分挿入
可能箇所△それぞれに挿入する。こうして無音部分を挿
入した応答音声はＤ／Ａコンバータ７、増幅器８、スピ
ーカ９を経て出力する。Another method of adjusting the length of the response voice according to the time T required for reading the recognition dictionary is to insert a silent portion at a predetermined position set in advance for each response voice. is there. For this purpose, the message content of the response voice includes a special notation at a place where a silent part can be inserted, and, for example, “Enter the name of the station.
"Please speak," and insert a notation in the place where silence can be inserted. And C
The PU 3A inserts a silent part when Tm <T is satisfied, but sets the number of insertable parts of the silent part to N and the silent part 1
Assuming that the length per part is δT, a silent part having a length of δT obtained by the equation of δT = (T−Tm) / N (3) is inserted into each of the silent part insertable points △. The response voice into which the silent portion is inserted is output through the D / A converter 7, the amplifier 8, and the speaker 9.

【００３２】ステップＳ１３，Ｓ１４で応答音声を構築
して出力し、所定の認識辞書を外部記憶装置４から内部
メモリ３Ｂに読み込んだ後、ステップＳ１５においてユ
ーザの音声入力の取り込み処理を開始する。以下、説明
を簡明にするため、最初に出力された応答音声は、「目
的地の設定方法を発話してください」というものであっ
たとする。In steps S13 and S14, a response voice is constructed and output, and a predetermined recognition dictionary is read from the external storage device 4 into the internal memory 3B. Then, in step S15, a process of capturing a user's voice input is started. Hereinafter, for the sake of simplicity, it is assumed that the response voice output first is "Please say how to set the destination".

【００３３】ここでＣＰＵ３Ａは、対話開始スイッチ６
が押されるまではＡ／Ｄコンバータ２を経て入力されて
くるデジタル信号の平均パワーを演算している。そして
対話開始スイッチ６が押された後、その平均パワーに比
べてデジタル信号の瞬間パワーが所定値以上大きくなっ
たときに、ユーザが発話したと判断し、音声の取り込み
処理を開始するのである。ここで、ユーザは応答音声に
対して、「しせつ」と発話したものとする。Here, the CPU 3A operates the dialogue start switch 6
Until is pressed, the average power of the digital signal input through the A / D converter 2 is calculated. After the dialogue start switch 6 is pressed, when the instantaneous power of the digital signal becomes larger than the average power by a predetermined value or more, it is determined that the user has spoken, and the voice capturing process is started. Here, it is assumed that the user has spoken “shisetsu” in response to the response voice.

【００３４】続いてステップＳ１６では、ＣＰＵ３Ａは
メモリ３Ｂに読み込んだ部分音声辞書と取り込んだ音声
入力の音声区間部分との一致度を演算する。ここでは、
入力された「しせつ」という音声区間部分と認識辞書の
「住所」、「施設」それぞれとの一致度を演算する。な
お、このステップの処理を行う間も、並列処理によって
音声取り込みは継続されている。Subsequently, in step S16, the CPU 3A calculates the degree of coincidence between the partial speech dictionary read into the memory 3B and the voice section of the fetched voice input. here,
The degree of coincidence between the input speech section “shisetsu” and each of “address” and “facility” in the recognition dictionary is calculated. Note that even during the processing of this step, the voice capturing is continued by the parallel processing.

【００３５】続くステップＳ１７では、入力されるデジ
タル信号の瞬間パワーが所定時間以上継続して所定値以
下になったとき、ユーザの発話が終了したと判断し、入
力の受付を終了する。In the following step S17, when the instantaneous power of the input digital signal has become equal to or less than a predetermined value for a predetermined time or more, it is determined that the utterance of the user has ended, and the reception of the input is ended.

【００３６】ステップＳ１８では、ＣＰＵ３Ａは一致度
の演算終了を待ち、終了後に最も一致度の高い単語を選
択する。ここでは、「施設」の方が一致度が高くなって
いるはずである。そこで、その単語「施設」を音声認識
の結果としてモニタ５に表示する（ステップＳ１９）。In step S18, the CPU 3A waits for the completion of the calculation of the matching degree, and selects the word having the highest matching degree after the completion. Here, "facility" should have a higher degree of coincidence. Therefore, the word "facility" is displayed on the monitor 5 as a result of the voice recognition (step S19).

【００３７】またＣＰＵ３Ａは、認識した単語が認識辞
書の下の階層の部分音声辞書を示しているかどうかを判
断し、下層の部分音声辞書を示している場合には、新た
に部分音声辞書を設定してステップＳ１２の処理に戻
る。以下、選択された単語がより下層の部分音声辞書を
示さなくなるまでステップＳ１２〜Ｓ２０の処理を繰り
返す（ステップＳ２０）。The CPU 3A determines whether or not the recognized word indicates a partial speech dictionary below the recognition dictionary. If the word indicates a partial speech dictionary below the recognition dictionary, a new partial speech dictionary is set. Then, the process returns to step S12. Thereafter, the processing of steps S12 to S20 is repeated until the selected word does not indicate a lower partial speech dictionary (step S20).

【００３８】こうして、例えば、「施設」−「駅」−
「神奈川県」−「ＪＲ」−「桜木町」と音声認識が進む
と、最終的に「桜木町」の単語を出力することになる
（ステップＳ２１）。Thus, for example, "facility"-"station"-
When the voice recognition proceeds as "Kanagawa"-"JR"-"Sakuragicho", the word "Sakuragicho" is finally output (step S21).

【００３９】こうして、音声対話により１つの単語が決
定されると、この単語を検索キーにして、例えば、ナビ
ゲーション装置における目的地の設定、周辺地理の検索
などに利用することができることになる。Thus, when one word is determined by voice dialogue, this word can be used as a search key, for example, for setting a destination in a navigation device, searching for a geographical area, and the like.

【００４０】なお、認識辞書の読み込みに必要な予測時
間Ｔに応じて応答音声の長さを最適なものにする方法と
して、次の方法を採用することもできる。すなわち、応
答音声は、読み込もうとする部分音声辞書の辞書サイズ
によって複数種のものが用意され、読み込もうとする部
分音声辞書の読み込みに先立って外部記憶装置４から最
初に読み込んでメモリ３Ｂに格納し、あるいはセットア
ップ時にあらかじめメモリ３Ｂに格納しておく。The following method can be adopted as a method for optimizing the length of the response voice according to the estimated time T required for reading the recognition dictionary. That is, a plurality of types of response voices are prepared depending on the dictionary size of the partial voice dictionary to be read. Prior to reading the partial voice dictionary to be read, the response voice is first read from the external storage device 4 and stored in the memory 3B, Alternatively, it is stored in the memory 3B in advance during setup.

【００４１】例えば、図４に示した認識辞書において、
駅名を音声入力させる場合、都道府県によってＪＲの駅
数は大きく異なり、したがってＪＲの駅名辞書のサイズ
も都道府県別に大きく異なってくる。そこで、応答音声
として、Ａ：「駅の名称を発話してください」Ｂ：「駅の名称をはっきりと発話してください」Ｃ：「ＪＲの駅の名称をはっきりと発話してください」Ｄ：「神奈川県のＪＲの駅の名称をはっきりと発話して
ください」のように複数種の長さの異なるものを用意しておき、読
み込む部分辞書のサイズ、したがって読み込みの必要な
予測時間に応じて最適な長さの内容のものを選択して出
力するのである。For example, in the recognition dictionary shown in FIG.
When a station name is input by voice, the number of JR stations greatly differs depending on the prefecture, and therefore, the size of the JR station name dictionary also greatly differs depending on the prefecture. Therefore, as a response voice, A: "Please utter the name of the station" B: "Please utter the name of the station clearly" C: "Please utter the name of the station of JR clearly" D: Prepare different types of different lengths, such as "Please clearly state the name of the JR station in Kanagawa Prefecture", according to the size of the partial dictionary to be read, and therefore the estimated time required for reading The content with the optimal length is selected and output.

【００４２】つまり、ＣＰＵ３Ａは、外部記憶装置４か
らの認識辞書の読み込み予測時間Ｔを演算し、その時間
Ｔが０＜Ｔ≦ＴmaならばＡメッセージを、Ｔma＜Ｔ≦Ｔ
mbならばＢメッセージを、Ｔmb＜Ｔ≦ＴmcならばＣメッ
セージを、そしてＴmc＜ＴならばＤメッセージを選択し
て出力するように処理する。あるいは、上記にあって、
Ｔmc＜Ｔ≦ＴmdならばＤメッセージを選択し、さらに、
Ｔmd＜Ｔならば、応答音声の出力速度を第１の実施の形
態のようにして調節するか、あるいは無音部分を挿入し
て出力するように処理するのである。ただし、Ｔma，Ｔ
mb，Ｔmc，Ｔmdは装置性能に応じてあらかじめ設定した
値である。これによっても、応答音声の出力が終了した
時点で必要な部分認識辞書の読み込みがほぼ完了してい
るようにすることができる。That is, the CPU 3A calculates the predicted reading time T of the recognition dictionary from the external storage device 4, and if the time T is 0 <T ≦ Tma, the CPU 3A sends the A message and Tma <T ≦ Tma.
If mb, the B message is selected, if Tmb <T ≦ Tmc, the C message is selected, and if Tmc <T, the D message is selected and output. Or in the above,
If Tmc <T ≦ Tmd, select the D message.
If Tmd <T, the output speed of the response voice is adjusted as in the first embodiment, or processing is performed so that a silent portion is inserted and output. Where Tma, T
mb, Tmc, and Tmd are values set in advance according to the device performance. In this way, it is possible to make the reading of the necessary partial recognition dictionary almost complete at the time when the output of the response voice ends.

【００４３】次に、本発明の第２の実施の形態の音声対
話装置について図３に基づいて説明する。第２の実施の
形態の音声対話装置は、ナビゲーション装置１０とネッ
トワーク１１によって接続され、外部記憶装置４を共用
している点に特徴がある。Next, a voice interactive device according to a second embodiment of the present invention will be described with reference to FIG. The voice interactive device according to the second embodiment is characterized in that it is connected to the navigation device 10 by a network 11 and shares the external storage device 4.

【００４４】ナビゲーション装置１０はＧＰＳセンサや
車両の加速度センサからの情報をもとにして自車の現在
位置の絶対位置を検出し、外部記憶装置４に格納されて
いる地図情報から現在位置に対応するその周辺領域の地
図情報をネットワーク１１を通じて読み込み、モニタ５
に表示させる。The navigation device 10 detects the absolute position of the current position of the vehicle based on information from the GPS sensor or the acceleration sensor of the vehicle, and uses the map information stored in the external storage device 4 to correspond to the current position. The map information of the surrounding area is read through the network 11 and the monitor 5
To be displayed.

【００４５】そして音声対話装置側は、ＣＰＵ３Ａがネ
ットワーク１１を通じて外部記憶装置４にアクセスして
音声認識辞書を読み出して内部メモリ３Ｂに格納し、音
声認識に利用する。On the voice interactive device side, the CPU 3A accesses the external storage device 4 through the network 11, reads out the voice recognition dictionary, stores it in the internal memory 3B, and uses it for voice recognition.

【００４６】次に、上記の第２の実施の形態による音声
認識処理を説明する。第２の実施の形態の音声対話装置
における音声認識処理のフローチャートは、第１の実施
の形態と同様に図２に示したものである。ただし、ステ
ップＳ１２及びステップＳ１３におけるＣＰＵ３Ａの処
理内容は、次のようになる。Next, a speech recognition process according to the second embodiment will be described. The flowchart of the voice recognition process in the voice dialogue device of the second embodiment is the same as that of the first embodiment shown in FIG. However, the processing contents of the CPU 3A in step S12 and step S13 are as follows.

【００４７】ステップＳ１２において、ＣＰＵ３Ａは外
部記憶装置４からネットワーク１１を介して必要な認識
辞書の読み込みを開始する。この読み込みは、事前情報
の取得と、認識辞書本体の読み込みとの２段階に分けて
行う。In step S12, the CPU 3A starts reading a necessary recognition dictionary from the external storage device 4 via the network 11. This reading is performed in two stages, that is, the acquisition of prior information and the reading of the recognition dictionary body.

【００４８】まず事前情報の取得には、外部記憶装置４
のアクセスが許可されるまでの時間Ｔｎの取得と、認識
辞書サイズ情報の取得が含まれる。ＣＰＵ３Ａは、図示
していないネットワークインタフェースを介してナビゲ
ーション装置１０に外部記憶装置４のアクセスの許可を
求める。ナビゲーション装置１０はその要求を受けて、
許可するまでの待ち時間Ｔｎを応答する。ナビゲーショ
ン装置１０が外部記憶装置４から地図情報をアクセス中
である場合には、このＴｎはそのアクセスが終了するま
での時間となる。First, in order to obtain advance information, the external storage device 4
Acquisition of the time Tn until the access is permitted, and acquisition of the recognition dictionary size information are included. The CPU 3A requests the navigation device 10 to permit access to the external storage device 4 via a network interface (not shown). The navigation device 10 receives the request,
The waiting time Tn until permission is returned. When the navigation device 10 is accessing the map information from the external storage device 4, this Tn is the time until the access ends.

【００４９】ＣＰＵ３Ａはこの待ち時間Ｔｎを内部メモ
リ３Ｂに保持する。また読み込もうとしている認識辞書
の辞書サイズＳ（byte）の情報のみを外部記憶装置４か
ら読み込み、メモリ３Ｂに保持する。The CPU 3A holds the waiting time Tn in the internal memory 3B. Only the information of the dictionary size S (byte) of the recognition dictionary to be read is read from the external storage device 4 and stored in the memory 3B.

【００５０】認識辞書の読み込み自体は並列処理によっ
て処理されるため、このステップＳ１２で読み込み終了
を待つ必要はなく、事前情報を取得した時点で次のステ
ップＳ１３に進む。Since the reading of the recognition dictionary itself is processed by parallel processing, there is no need to wait for the completion of the reading in step S12, and the process proceeds to the next step S13 when the advance information is obtained.

【００５１】ステップＳ１３においては、ＣＰＵ３Ａは
応答音声の読み上げ速度を調節して出力する。応答音声
のメッセージ内容は辞書への入力を促すものであり、第
１の実施の形態と同様に、「駅の名称を発話してくださ
い」といったものである。In step S13, the CPU 3A adjusts the reading speed of the response voice and outputs it. The message content of the response voice prompts entry into the dictionary, such as "speak the station name" as in the first embodiment.

【００５２】これに対して、メモリ３Ｂに記憶されてい
る待ち時間Ｔｎと辞書サイズＳから読み込みに必要な時
間Ｔ（ｍｓｅｃ）を予測する。On the other hand, the time T (msec) required for reading is predicted from the waiting time Tn and the dictionary size S stored in the memory 3B.

【００５３】Ｔ＝Ｔｎ＋ｋ×Ｓ（１′）ここで、ｋは第１の実施の形態と同様の定数であるが、
外部記憶装置４のネットワーク１１を介してのアクセス
速度などによって装置ごとに異なる値をとることにな
る。そして、この予測時間Ｔを用いて、第１の実施の形
態の場合と同様に、読み上げ速度を調節して最適な時間
長さの応答音声を構築して出力する。T = Tn + k × S (1 ′) where k is a constant similar to that of the first embodiment,
A different value is taken for each device depending on the access speed of the external storage device 4 via the network 11 and the like. Then, using the predicted time T, the reading speed is adjusted to construct and output a response voice having an optimal time length, as in the case of the first embodiment.

【００５４】ステップＳ１５以降の処理は、第１の実施
の形態と同様である。これにより、第２の実施の形態の
音声対話装置では、外部記憶装置４を他の装置と共用し
ており、他の装置のアクセス状態に応じて認識辞書の読
み込みの必要な時間が左右されるような環境でも、その
読み込みに必要な時間に応じて応答音声の出力時間を調
節して出力することができ、さらには、ネットワーク１
１により外部記憶装置４に接続されているような環境で
も、そのネットワーク１１の特性を考慮に入れて辞書の
読み込みに必要な時間を計算し、応答音声の出力時間を
調節して出力することができ、ユーザに辞書読み込み時
に長い無音時間を与えてしまう問題点を解消することが
できる。The processing after step S15 is the same as in the first embodiment. As a result, in the voice interactive device according to the second embodiment, the external storage device 4 is shared with another device, and the time required to read the recognition dictionary depends on the access state of the other device. In such an environment, the output time of the response voice can be adjusted and output according to the time required for the reading, and further, the network 1
1, it is possible to calculate the time required to read the dictionary taking into account the characteristics of the network 11 and adjust the output time of the response voice, and output the result, taking into account the characteristics of the network 11. Thus, it is possible to solve the problem that the user is given a long silent time when reading the dictionary.

【００５５】なお、第２の実施の形態においても、応答
音声の長さを認識辞書の読み込みに必要な時間Ｔに応じ
て調節する別の方法として、応答音声ごとにあらかじめ
設定されている所定の位置に無音部分を挿入する方法を
採用することができる。In the second embodiment, as another method of adjusting the length of the response voice according to the time T required to read the recognition dictionary, a predetermined method preset for each response voice is used. A method of inserting a silent part at a position can be adopted.

【００５６】さらには、応答音声は、読み込もうとする
部分音声辞書の辞書サイズによって複数種のものを用意
し、読み込もうとする部分音声辞書の読み込みに先立っ
て外部記憶装置４から最初に読み込んでメモリ３Ｂに格
納し、あるいはセットアップ時にあらかじめメモリ３Ｂ
に格納しておき、読み込む部分辞書のサイズ、したがっ
て読み込みに必要な予測時間に応じて最適な長さの内容
のものを選択して出力する方法を採用することもでき
る。Further, a plurality of types of response voices are prepared according to the dictionary size of the partial voice dictionary to be read, and the response voice is first read from the external storage device 4 before reading the partial voice dictionary to be read, and is read from the memory 3B. In the memory or in the memory 3B in advance during setup.
And a method of selecting and outputting a content having an optimal length according to the size of the partial dictionary to be read, that is, the estimated time required for reading, can also be adopted.

【００５７】このようにして、本発明の音声対話装置に
よれば、新たに認識辞書を読み出す際には、当該認識辞
書の辞書サイズ情報を読み出し、読み出した辞書サイズ
情報に基づいて認識辞書の読み出しに必要な時間を予測
演算し、その読み出しに必要な予測時間をもとに最適な
長さの応答音声を構築して出力させるので、ＣＰＵ３Ａ
が新たに認識辞書を外部記憶装置４から読み出す際に、
従来のように長らく無音時間が続く事態を避け、その読
み出し時間中、応答音声を出力させることができ、ユー
ザに不快感や不安感を抱かせないようにすることができ
る。As described above, according to the speech dialogue apparatus of the present invention, when a new recognition dictionary is read, the dictionary size information of the recognition dictionary is read, and the recognition dictionary is read based on the read dictionary size information. The CPU 3A constructs and outputs a response voice having an optimal length based on the prediction time required for the readout, and based on the prediction time required for the readout.
When newly reading the recognition dictionary from the external storage device 4,
It is possible to avoid a situation in which silence continues for a long time as in the related art, and to output a response voice during the readout time, thereby preventing the user from feeling uncomfortable or uneasy.

【００５８】そして、ＣＰＵ３Ａは、外部記憶装置４か
ら認識辞書を読み出す際に、応答音声の読み上げ速度を
調節して最適な長さの応答音声にして出力させることに
より、連続性の高い自然な応答音声を出力させることが
できる。When reading the recognition dictionary from the external storage device 4, the CPU 3A adjusts the reading speed of the response voice to output a response voice having an optimum length, thereby providing a natural response with high continuity. Sound can be output.

【００５９】またＣＰＵ３Ａは、外部記憶装置４から認
識辞書を読み出す際に、応答音声のあらかじめ設定され
ている所定の位置に無音部分を必要なだけ挿入すること
によって最適な長さの応答音声を構築して出力させるこ
とにより、合成音声の音質を変化させることなく自然な
応答音声を出力することができる。When reading the recognition dictionary from the external storage device 4, the CPU 3A inserts as many silent portions as necessary into the predetermined position of the response voice, thereby constructing a response voice having an optimum length. Thus, a natural response voice can be output without changing the sound quality of the synthesized voice.

【００６０】またＣＰＵ３Ａは、外部記憶装置４から認
識辞書を読み出す際に、その認識辞書のサイズに対応し
た長さの応答音声を選択して出力させることにより、音
質を変化させることなく、また連続性の高い自然な応答
音声を出力させることができる。Further, when reading the recognition dictionary from the external storage device 4, the CPU 3A selects and outputs a response voice having a length corresponding to the size of the recognition dictionary, thereby changing the sound quality without changing the sound quality. It is possible to output a natural response voice with high performance.

【００６１】さらに、本発明の音声対話装置では、外部
記憶装置４が他の装置１０から要求されるデータを保有
するデータ記憶手段としても兼用される場合、ＣＰＵ３
Ａがそのような外部記憶装置４から認識辞書を読み出す
のに必要な時間を予測演算することにより、他装置と連
携した装置にあっても、新たに認識辞書を読み出すのに
必要な時間を正確に予測し、その予測時間に応じて最適
な長さの応答音声を出力することができる。Further, in the voice dialogue apparatus of the present invention, when the external storage device 4 is also used as data storage means for holding data requested from another device 10, the CPU 3
A predicts and calculates the time required for reading the recognition dictionary from such an external storage device 4 so that the time required for newly reading the recognition dictionary can be accurately determined even in a device linked with another device. And a response voice having an optimal length can be output according to the prediction time.

【００６２】加えて、本発明の音声対話装置では、ネッ
トワーク１１によって接続されている外部記憶装置４か
ら新たに認識辞書を読み出すのに必要な時間をそのネッ
トワーク１１の負荷をも考慮して予測演算することによ
り、新たに認識辞書を読み出すのに必要な時間を正確に
予測し、その予測時間に応じて最適な長さの応答音声を
出力することができる。In addition, in the spoken dialogue apparatus of the present invention, the time required for newly reading out the recognition dictionary from the external storage device 4 connected to the network 11 is calculated by taking the load of the network 11 into consideration. By doing so, it is possible to accurately predict the time required to read a new recognition dictionary, and to output a response voice having an optimum length according to the predicted time.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態の構成を示すブロッ
ク図。FIG. 1 is a block diagram showing a configuration of a first embodiment of the present invention.

【図２】上記の実施の形態による音声認識処理のフロー
チャート。FIG. 2 is a flowchart of a speech recognition process according to the embodiment.

【図３】本発明の第２の実施の形態の構成を示すブロッ
ク図。FIG. 3 is a block diagram showing a configuration according to a second embodiment of the present invention.

【図４】一般的な階層構造の音声認識辞書の構造図。FIG. 4 is a structural diagram of a speech recognition dictionary having a general hierarchical structure.

[Explanation of symbols]

１マイク２Ａ／Ｄコンバータ３信号処理装置３ＡＣＰＵ３Ｂメモリ４外部記憶装置５モニタ６スイッチ７Ｄ／Ａコンバータ８増幅器９スピーカ１０ナビゲーション装置１１ネットワーク Reference Signs List 1 microphone 2 A / D converter 3 signal processing device 3A CPU 3B memory 4 external storage device 5 monitor 6 switch 7 D / A converter 8 amplifier 9 speaker 10 navigation device 11 network

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 3/00 ５７１Ｕ ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G10L 3/00 571U

Claims

[Claims]

1. A speech input means, a recognition dictionary storage means for storing a plurality of recognition dictionaries, a response speech storage means for storing a response speech, a speech output means, and a control means, wherein the control means comprises: Calculating the degree of coincidence between the input voice from the voice input means and the recognition target word in the recognition dictionary, reading the next recognition dictionary based on the calculation result, and selecting an appropriate response voice to output the voice In the voice interactive device for instructing the recognition dictionary to output, the recognition dictionary storage unit stores dictionary size information for each of the plurality of recognition dictionaries, and the control unit reads the recognition dictionary when reading the recognition dictionary. The dictionary size information is read, the time required for reading the recognition dictionary is calculated based on the read dictionary size information, and the optimum length is calculated based on the estimated time required for reading. Speech dialogue system to build a response voice, characterized in that an instruction to output to the sound output unit.

2. The apparatus according to claim 1, wherein the control unit instructs the voice output unit to adjust a reading speed of the response voice to construct and output the response voice having the optimum length. The speech dialogue device according to 1.

3. The sound output means to construct and output a response sound of the optimal length by inserting a silent part at a predetermined position of the response sound, the control means being configured to output the response sound. The voice interaction device according to claim 1, wherein the instruction is given.

4. The apparatus according to claim 1, wherein the control unit instructs the voice output unit to select and output a response voice having an optimum length from among the plurality of response voices having different lengths. 2. The voice interaction device according to 1.

5. The recognition dictionary storage means is also used as a data storage means for holding data required from another device, and the control means is configured to read a time required for reading the recognition dictionary from the data storage means. The speech dialogue device according to any one of claims 1 to 4, wherein the speech dialogue device performs a prediction operation.

6. The recognition dictionary storage means is connected to the control means by a network, and the control means predicts and calculates a time required for reading the recognition dictionary in consideration of a load on the network. The voice interaction device according to claim 1, wherein