JP2003058188A

JP2003058188A - Voice interaction system

Info

Publication number: JP2003058188A
Application number: JP2001245707A
Authority: JP
Inventors: Hideki Nakamura; 英樹中村
Original assignee: Denso Ten Ltd
Current assignee: Denso Ten Ltd
Priority date: 2001-08-13
Filing date: 2001-08-13
Publication date: 2003-02-28

Abstract

PROBLEM TO BE SOLVED: To provide a voice interaction system with which a recognition rate in voice recognition is improved. SOLUTION: This voice interaction system is provided with a voice recognizing engine for recognizing the voice of a user, an interaction processing engine for preparing a speech to the user corresponding the recognized result, a voice synthesizing engine for synthesizing the prepared speech to the voice, a voice recognition dictionary for storing dictionary data containing voice patterns for voice recognition and a voice input/output means. Further, this voice recognizing engine is configured to search the last user voice recognized result in the voice recognition dictionary to extract dictionary data related to the searched recognized result and to set the data into its own storage area. Therefore, when recognizing the voice of the user the next time, the dictionary data related to the last recognized result are used so that the recognition rate can be improved.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、各種の機器例えば
車両に搭載されたカーナビゲーションシステム、オーデ
ィオ機器などの操作を、ユーザとの対話に基づいて自動
的に遂行する音声対話システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice dialogue system for automatically operating various equipment such as a car navigation system mounted on a vehicle and an audio equipment based on a dialogue with a user.

【０００２】[0002]

【従来の技術】カーナビゲーションシステム、オーディ
オ機器など操作が複雑な機器では、機器を操作するため
に音声対話システムが用いられる。このシステムでは、
予め決められたステップに基づいて作成される質問をシ
ステムが発話し、この発話に対するユーザの回答を音声
認識し、認識結果に基づいて新たな発話を作成すると言
う手順によって、機器の操作に必要な情報をユーザより
得ている。2. Description of the Related Art In a complicated operation such as a car navigation system and an audio device, a voice dialogue system is used to operate the device. In this system,
The system utters a question created based on a predetermined step, recognizes the user's answer to this utterance by voice, and creates a new utterance based on the recognition result. Information is obtained from the user.

【０００３】[0003]

【発明が解決しようとする課題】ところがユーザの発声
は、発声の速度、音質、イントネーションなどが種々万
別であり、従って音声パターンの比較照合による音声認
識では、往々にして認識エラーが発生する。このような
エラーが発生すると、ユーザはそのエラーを訂正するた
めの発声を行なった後、同じ発声を繰り返す必要があ
り、対話の進行速度が大幅に低下する。従って、認識エ
ラーが度々発生すると、機器の操作性は大幅に低下す
る。However, the user's utterance has various utterance speeds, sound qualities, intonations, and the like. Therefore, in voice recognition by comparing and collating voice patterns, a recognition error often occurs. When such an error occurs, the user has to utter to correct the error and then repeat the same utterance, which significantly reduces the progress speed of the dialogue. Therefore, if a recognition error occurs frequently, the operability of the device is significantly reduced.

【０００４】本発明は、従来の音声対話システムの上記
の点に関してなされたもので、認識率が高く、従って操
作性の良い音声対話システムを提供することを目的とす
る。The present invention has been made in view of the above points of the conventional voice dialogue system, and an object of the present invention is to provide a voice dialogue system having a high recognition rate and therefore good operability.

【０００５】[0005]

【課題を解決するための手段】上記課題を解決するため
に、本発明の第１の態様では、ユーザの発声を認識する
ための音声認識エンジンと、その認識結果に応じてユー
ザへの発話を作成する対話処理エンジンと、作成された
発話を音声に合成するための音声合成エンジンと、前記
音声認識のための音声パターンを含む辞書データを格納
する音声認識辞書と、音声の入出力手段とを備え、前記
音声認識エンジンは前回のユーザ発声の認識結果を前記
音声認識辞書において探索し、探索された認識結果に関
連する辞書データを抽出して該音声認識エンジンに設定
し次のユーザ発声の音声認識を行なう、音声対話システ
ムが提供される。In order to solve the above problems, according to a first aspect of the present invention, a voice recognition engine for recognizing a user's utterance and a utterance to the user according to the recognition result. A dialogue processing engine for creating, a voice synthesizing engine for synthesizing the created utterance into voice, a voice recognition dictionary for storing dictionary data including a voice pattern for the voice recognition, and a voice input / output unit. The voice recognition engine searches the recognition result of the previous user's utterance in the voice recognition dictionary, extracts dictionary data related to the searched recognition result, and sets it in the voice recognition engine to set the voice of the next user's utterance. A spoken dialogue system is provided that provides recognition.

【０００６】この音声対話システムでは、音声認識エン
ジンは、前回の認識結果を音声認識辞書内で探索し、探
索された認識結果に関連する辞書データのみを音声認識
辞書から抽出し、これを自身の記憶領域内に辞書として
設定する。次回の音声認識は、音声認識エンジンにおい
てこの様にして設定された辞書を用いて実施される。そ
のため、前回の認識結果に関連しない認識結果を得ると
言う認識エラーの発生が防止される。また、認識のため
の辞書データが大幅に絞り込まれるので、認識速度が向
上し、しいては対話の応答速度が向上する。In this speech dialogue system, the speech recognition engine searches the speech recognition dictionary for the previous recognition result, extracts only the dictionary data related to the searched recognition result from the speech recognition dictionary, and extracts it from the speech recognition dictionary. Set as a dictionary in the storage area. The next voice recognition is performed using the dictionary thus set in the voice recognition engine. Therefore, it is possible to prevent a recognition error that a recognition result not related to the previous recognition result is obtained. In addition, since the dictionary data for recognition is significantly narrowed down, the recognition speed is improved, which in turn improves the response speed of the dialogue.

【０００７】上記態様の音声システムにおいて、前回の
ユーザ発声の認識結果が複数ある場合、音声認識エンジ
ンはそのいずれをも前記音声認識辞書において探索し、
いずれの認識結果にも関連する辞書データを前記音声認
識エンジンに設定する。In the voice system of the above aspect, when there are a plurality of recognition results of the previous user's utterance, the voice recognition engine searches the voice recognition dictionary for any of them.
Dictionary data related to any recognition result is set in the voice recognition engine.

【０００８】これによって、音声認識エンジンに辞書と
して組み込まれるデータがさらに絞り込まれ、認識エラ
ーの発生率が低下すると共に、対話の応答速度が向上す
る。As a result, the data incorporated in the voice recognition engine as a dictionary is further narrowed down, the occurrence rate of recognition errors is reduced, and the response speed of the dialogue is improved.

【０００９】上記態様の音声システムにおいて、音声認
識エンジンは認識結果の探索対象とする単語属性のリス
トを有している。従って前回の音声認識に当たってこの
リストに無い単語が認識された場合は、その単語を除外
してリストにある単語のみを辞書中で探索する。これに
よって、探索対象とする必要がない単語がユーザ発声中
に含まれていた場合でも、対処することができる。In the speech system of the above aspect, the speech recognition engine has a list of word attributes to be searched for recognition results. Therefore, when a word not in this list is recognized in the previous voice recognition, the word is excluded and only the word in the list is searched in the dictionary. As a result, even if a word that does not need to be a search target is included in the user's utterance, it can be dealt with.

【００１０】上記態様の音声システムにおいて、前記音
声認識エンジンは前回のユーザ発声の認識結果中に前記
リストに対応する単語属性がある場合、それを保存して
おく。これによって、次回のユーザ発声の認識にエラー
が生じた場合、保存してある単語属性を利用して辞書の
探索、関連する辞書データの抽出を行なうことができ
る。この結果、認識エラーに対して速やかに対処可能と
なる。In the voice system according to the above aspect, the voice recognition engine stores the word attribute corresponding to the list in the recognition result of the previous user's utterance, if the word attribute exists. Thus, when an error occurs in the recognition of the next user's utterance, the stored word attributes can be used to search the dictionary and extract the related dictionary data. As a result, the recognition error can be dealt with promptly.

【００１１】本発明の第２の態様では、ユーザの発声を
認識するための音声認識エンジンと、その認識結果に応
じてユーザへの発話を作成する対話処理エンジンと、作
成された発話を音声に合成するための音声合成エンジン
と、前記音声認識のための音声パターンを含む辞書デー
タを格納する音声認識辞書と、音声の入出力手段とを備
え、前記音声認識辞書は複数の種類の辞書を有し、前記
音声認識エンジンは対話状態と該対話状態で使用する種
類の辞書の対応テーブルを有し、各対話状態のユーザ発
声の認識に当たって前記対応テーブルに従って対応する
種類の辞書を前記音声認識辞書から抽出して使用する、
音声対話システムが提供される。According to the second aspect of the present invention, a voice recognition engine for recognizing a user's utterance, a dialogue processing engine for creating a utterance to the user according to the recognition result, and the created utterance as a voice. A voice synthesis engine for synthesizing, a voice recognition dictionary for storing dictionary data including voice patterns for voice recognition, and a voice input / output unit are provided, and the voice recognition dictionary has a plurality of types of dictionaries. However, the speech recognition engine has a correspondence table of a dialogue state and a dictionary of a type used in the dialogue state, and in recognizing a user's utterance in each dialogue state, a dictionary of a corresponding type according to the correspondence table is selected from the speech recognition dictionary. Extract and use,
A spoken dialogue system is provided.

【００１２】また、この音声対話システムにおいて、前
記音声認識エンジンは前記対応テーブルに従って複数の
種類の辞書を前記音声認識辞書より抽出し使用する。Further, in this voice dialogue system, the voice recognition engine extracts a plurality of types of dictionaries from the voice recognition dictionary according to the correspondence table and uses them.

【００１３】この音声対話システムによれば、音声認識
エンジンは対話状態に応じて予め設定されている対応テ
ーブルに従って、必要とする種類の辞書のみを音声認識
辞書から抽出し、これを自身の記憶領域に設定する。従
って、音声認識に当たって、必要とする辞書以外は認識
作業の対象とならないので、音声認識エンジンにかかる
負荷が低減され、その結果音声認識の処理時間が短縮さ
れ、しいては音声対話のレスポンスが改善される。According to this speech dialogue system, the speech recognition engine extracts only the required type of dictionary from the speech recognition dictionary according to the correspondence table preset according to the dialogue state, and extracts it from its own storage area. Set to. Therefore, in speech recognition, since only the required dictionary is subject to recognition work, the load on the speech recognition engine is reduced, and as a result, the processing time of speech recognition is shortened and the response of the speech dialogue is improved. To be done.

【００１４】[0014]

【発明の実施の形態】以下に、図面を参照して本発明の
実施形態を説明する。図１は、本発明の１実施形態にか
かる音声対話システム１０の基本構成を示すブロック図
である。このシステムは、カーナビゲーションシステム
に音声による案内機能を提供するためのシステムであっ
て、映像出力装置としてのディスプレイ１２、音声出力
装置としてのスピーカ１４、音声入力装置としてのマイ
ク１６、音声対話開始スイッチ１８及び処理装置２０を
備えている。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing the basic configuration of a voice dialogue system 10 according to an embodiment of the present invention. This system is a system for providing a voice guidance function to a car navigation system, and includes a display 12 as a video output device, a speaker 14 as a voice output device, a microphone 16 as a voice input device, and a voice dialogue start switch. 18 and a processing device 20.

【００１５】処理装置２０は、ハードウェアとしては、
周知のように、中央処理装置（ＣＰＵ）、主記憶装置等
からなるものである。また、処理装置２０は、主記憶装
置上で走行せしめられるソフトウェアとして、図１に示
すように、ディスプレイ１２に画像を表示する処理を行
う画像表示エンジン２０２、マイク１６から入力される
音声のパターンを認識するための処理を行う音声認識エ
ンジン２０４、およびスピーカ１４から出力されるべき
音声を電子的に合成する処理を行う音声合成エンジン２
０６を備えている。The processing device 20 has, as hardware,
As is well known, it comprises a central processing unit (CPU), a main storage device, and the like. In addition, the processing device 20, as software that can be run on the main storage device, as shown in FIG. 1, an image display engine 202 that performs a process of displaying an image on the display 12, and a pattern of a sound input from the microphone 16 are generated. A voice recognition engine 204 that performs a process for recognizing, and a voice synthesis engine 2 that performs a process for electronically synthesizing a voice to be output from the speaker 14.
It is equipped with 06.

【００１６】さらに、処理装置２０は、ソフトウェアと
して、図１に示すように、アプリケーションプログラム
２０８の指示を受けて上述の画像表示エンジン、音声認
識エンジン及び音声合成エンジンを制御する対話処理専
用エンジン２１０を備えている。また、処理装置２０
は、上述の音声認識エンジン２０４、音声合成エンジン
２０６及び対話処理専用エンジン２１０によって参照さ
れる対話用データベース２１２と主に音声認識エンジン
２０４によって参照される音声認識辞書２１４を備え
る。Further, as shown in FIG. 1, the processor 20 includes, as software, an interactive processing engine 210 for controlling the above-described image display engine, voice recognition engine and voice synthesis engine in response to an instruction from the application program 208. I have it. Further, the processing device 20
Includes a dialogue database 212 referred to by the speech recognition engine 204, the speech synthesis engine 206, and the dialogue processing dedicated engine 210, and a speech recognition dictionary 214 mainly referred to by the speech recognition engine 204.

【００１７】なお、当然のことながら、ハードウェア及
びソフトウェアを総合的に管理及び制御するオペレーテ
ィングシステム（ＯＳ）２１６も処理装置２０において
走行する。また、上述の対話用データベース２１２およ
び音声認識辞書２１４は、処理装置２０内に組み込むこ
となく、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、メモリカード
等の外部メモリ上に構成されてもよい事は勿論である。Of course, an operating system (OS) 216 that comprehensively manages and controls hardware and software also runs in the processing device 20. In addition, the interactive database 212 and the voice recognition dictionary 214 described above may of course be configured on an external memory such as a CD-ROM, a DVD-ROM, or a memory card without being incorporated in the processing device 20. .

【００１８】音声認識エンジン２０４は、図２に示す様
に、音声認識のための処理プログラム（図示せず）と、
カーナビゲーションシステムの操作に用いられる基本用
語のための辞書２０４ａ、さらに音声認識辞書２１４か
ら所定の部分を抽出して設定するＲＡＭ領域２０４ｂを
有している。基本用語辞書２０４ａには、対話に必要な
基本的な用語、例えば「はい」、「いいえ」、「戻
る」、「行く」などの言葉と、都道府県名、および頻繁
に使用される大都市名、デパート、レストランなどの施
設名などが、その音声パターンと共に記憶されている。
尚、基本用語辞書２０４ａは常時ＲＡＭ領域２０４ｂに
設定され、その語は音声認識可能となっている。また都
道府県名及び大都市名は基本用語辞書２０４ａから外
し、地名に関する辞書の最上位の階層として扱うことも
可能である。As shown in FIG. 2, the voice recognition engine 204 includes a processing program (not shown) for voice recognition,
It has a dictionary 204a for basic terms used for operating the car navigation system, and a RAM area 204b for extracting and setting a predetermined portion from the voice recognition dictionary 214. The basic term dictionary 204a includes basic terms necessary for dialogue, such as “yes”, “no”, “return”, “go”, prefecture names, and frequently used big city names. Facility names such as department stores, restaurants, etc. are stored together with their voice patterns.
The basic term dictionary 204a is always set in the RAM area 204b, and the word can be recognized by voice. It is also possible to remove the prefecture name and the metropolitan city name from the basic term dictionary 204a and treat them as the highest hierarchy of the dictionary concerning place names.

【００１９】また音声認識エンジンは、探索対象単語属
性リスト、対話状態とその対話に使用する辞書の種類な
どを記載したテーブルが保存される領域２０４ｃ、探索
対象単語属性リスト中にある単語属性が認識された場合
それを保存するための保存領域２０４ｄなどが設けられ
ているが、これらについては、各実施形態の説明の項で
詳細に述べる。The voice recognition engine recognizes a search target word attribute list, a region 204c in which a table in which a dialog state and a dictionary type used for the dialog are stored, and word attributes in the search target word attribute list. A storage area 204d or the like for storing the storage area when the storage area is stored is described in detail in the section of description of each embodiment.

【００２０】音声認識辞書２１４には、図３に示す様
に、住所の辞書、施設名称の辞書などが記憶されてい
る。住所の辞書は、本発明では、図４に示す様に、概念
的に上下関係を有する属性の単語を階層状に配置した階
層構造を有すると共に、上位の階層の各単語をノードと
して、このノードから関連する下位の階層の単語を分岐
させた構造、即ちツリー構造を有している。As shown in FIG. 3, the voice recognition dictionary 214 stores an address dictionary, a facility name dictionary, and the like. In the present invention, as shown in FIG. 4, the address dictionary has a hierarchical structure in which words of attributes having conceptually upper and lower relations are arranged in a hierarchical manner, and each word in the upper hierarchy is used as a node. Has a structure in which words related to the lower hierarchy are branched, that is, a tree structure.

【００２１】図示の例では、最上位の階層を「府県名」
を属性に持つ単語で構成し、２段目の階層を「市名」を
属性に持つ単語で構成し、３段目の階層を「区名」を属
性に持つ単語で構成している。さらに、最上位の階層の
府県名をそれぞれノードとし、各府県中の都市名を分岐
させている。同様に、都市名の階層に属する各単語をノ
ードとし、区名を分岐させている。これにより図４に示
すような、ツリー状の３階層文法のイメージが構成され
る。In the illustrated example, the highest hierarchy is "prefecture name".
Is used as an attribute, the second level is made up of words having "city name" as an attribute, and the third level is made up of words having "ward name" as an attribute. Furthermore, the prefecture names in the highest hierarchy are used as nodes, and the city names in each prefecture are branched. Similarly, each word belonging to the hierarchy of the city name is used as a node, and the ward name is branched. As a result, an image of a tree-shaped three-layer grammar as shown in FIG. 4 is constructed.

【００２２】この様な文法構造を有する住所の認識辞書
を記憶する音声認識辞書２１４に対して、音声認識エン
ジン２０４は、前回のユーザ発声による音声認識の結果
を参照して、特定のノード以下の文法を辞書２１４内か
ら抽出し、これを自身のＲＡＭ領域２０４ｂに設定す
る。音声認識エンジン２０４はこの様にして設定された
辞書を用いて次のユーザ発声を認識する。従って、この
音声認識に当たっては、前回の認識結果によって規定さ
れたノード以降の単語属性のデータ以外は音声認識エン
ジン２０４に設定されず、例えば兵庫県内の市名を発声
したのにそれを大阪府内の市名と誤認識するような事態
の発生は無い。従って、音声認識にあたって、認識率が
向上する。With respect to the voice recognition dictionary 214 which stores the address recognition dictionary having such a grammatical structure, the voice recognition engine 204 refers to the result of the voice recognition by the previous user's utterance, and refers to The grammar is extracted from the dictionary 214 and set in its own RAM area 204b. The voice recognition engine 204 recognizes the next user utterance using the dictionary set in this way. Therefore, in this voice recognition, data other than the word attribute data after the node defined by the previous recognition result is not set in the voice recognition engine 204. For example, even if a city name in Hyogo prefecture is uttered, it is set in Osaka prefecture. There is no situation in which it is mistakenly recognized as the city name. Therefore, in speech recognition, the recognition rate is improved.

【００２３】以下に、上記基本構造を有する音声対話シ
ステムの各種の実施形態の動作を、フローチャートを示
して詳細に説明する。The operation of various embodiments of the voice dialogue system having the above basic structure will be described below in detail with reference to flowcharts.

【００２４】図５は、本発明の第１の実施形態にかかる
音声対話システムの動作説明のためのフローチャートで
ある。また、図６はこのフローチャートの説明のため
の、音声対話システムとユーザとの対話例（対話例１）
を示す。FIG. 5 is a flow chart for explaining the operation of the voice dialogue system according to the first embodiment of the present invention. Further, FIG. 6 is an example of a dialogue between the voice dialogue system and the user for explaining the flowchart (dialogue example 1).
Indicates.

【００２５】システムが駆動設定されると、対話処理用
エンジン２１０が作動して目的地設定のための最初の質
問「どこへ行きたいですか」が作成され、この質問が音
声合成エンジン２０６によって音声に変換され、スピー
カ１４から出力される。この質問に対してユーザが「兵
庫県」と答えると、音声認識エンジン２０４が自身の持
っている基本辞書２０４ａを参照してユーザ発声を認識
する（図５のステップＳ１）。When the system is driven and set, the dialogue engine 210 operates to create the first question "Where do you want to go?" For setting the destination, and this question is voiced by the voice synthesis engine 206. And is output from the speaker 14. When the user answers "Hyogo prefecture" to this question, the voice recognition engine 204 recognizes the user's utterance by referring to the basic dictionary 204a of its own (step S1 in FIG. 5).

【００２６】次に、音声認識エンジン２０４は、最初の
認識結果である単語属性「兵庫県」を、音声認識辞書２
０４の住所辞書中で探索し（ステップＳ２）、探索され
た単語属性以降の辞書データを抽出して音声認識エンジ
ン２０４のＲＡＭ領域２０４ｂに設定する（ステップＳ
３）。具体的には、図４のノード１０以下の辞書データ
が領域２０４ｂに設定される。Next, the voice recognition engine 204 converts the word attribute "Hyogo", which is the first recognition result, into the voice recognition dictionary 2.
04 in the address dictionary (step S2), and the dictionary data after the searched word attribute is extracted and set in the RAM area 204b of the voice recognition engine 204 (step S).
3). Specifically, the dictionary data below the node 10 in FIG. 4 is set in the area 204b.

【００２７】システムは次に「兵庫県のどこですか」を
質問する。これに対してユーザが「神戸市」と答える
と、音声認識エンジン２０４はＲＡＭ領域２０４ｂに設
定された辞書データを用いてこれを認識する（ステップ
Ｓ４）。次のステップＳ５において対話が終了していな
いことを確認すると、音声認識エンジン２０４は再びス
テップＳ２に戻って、認識結果の単語属性である「神戸
市」を図４の辞書中で探索し、ステップＳ３においてノ
ード２０以下の辞書データをＲＡＭ領域２０４ｂに設定
する。The system then asks "where in Hyogo Prefecture?" In response to this, when the user answers "Kobe City", the voice recognition engine 204 recognizes this using the dictionary data set in the RAM area 204b (step S4). When it is confirmed in the next step S5 that the dialogue has not ended, the voice recognition engine 204 returns to step S2 again, and searches the dictionary of FIG. In S3, the dictionary data of the node 20 and below is set in the RAM area 204b.

【００２８】システムは次に「神戸市のどこですか」を
質問し、ユーザはこれに対して「灘区」と答える。この
ユーザ発声は、ＲＡＭ領域２０４ｂに設定されたノード
２０以下の辞書データを用いて認識される。これらの操
作を対話終了、即ち目的地の発声が終了するまで繰り返
すことにより、目的地の設定が完了する。The system then asks "Where is Kobe?" And the user replies "Nada Ward." This user utterance is recognized using the dictionary data of the node 20 and below set in the RAM area 204b. The setting of the destination is completed by repeating these operations until the dialogue is finished, that is, the utterance of the destination is finished.

【００２９】上述の実施形態では、「神戸市」の音声認
識に当たって、音声認識エンジン２０４のＲＡＭ領域２
０４ｂには、兵庫県以下の各都市名の辞書しか設定され
ないので、「神戸市」を間違って他府県の都市名と誤認
識する事は無い。同様に、「灘区」の認識に当たって
も、ＲＡＭ領域２０４ｂには、神戸市内の区名しか設定
されないので、これを他市の区名と誤認識する可能性が
排除される。In the above embodiment, the RAM area 2 of the voice recognition engine 204 is used for voice recognition of "Kobe City".
Since only the dictionary of each city name under Hyogo prefecture is set in 04b, there is no mistakenly recognizing "Kobe city" as the city name of other prefectures. Similarly, even when recognizing "Nada Ward", only the ward name in Kobe city is set in the RAM area 204b, so that there is no possibility of erroneously recognizing this as a ward name in another city.

【００３０】そのため、システムとユーザ間で、例え
ば、システム「どこに行きますか」、ユーザ「大阪
府」、システム「大阪府のどこですか」、ユーザ「堺
市」、システム「明石市のどこですか」（ユーザ発声
「堺市」を「明石市」と誤認識）のような不自然な会話
の発生が防止される。また、各認識に当たって、認識対
象となる単語が絞り込まれるので、認識の速度が大幅に
向上する。Therefore, between the system and the user, for example, the system "where are you going?", The user "Osaka Prefecture", the system "Where is Osaka Prefecture", the user "Sakai City", the system "Where is Akashi City?" The occurrence of an unnatural conversation such as (a user utterance "Sakai city" is mistakenly recognized as "Akashi city") is prevented. Further, since the words to be recognized are narrowed down in each recognition, the recognition speed is significantly improved.

【００３１】図７は、本発明の第２の実施形態にかかる
音声対話システムの動作説明のためのフローチャートで
あり、図８はこのフローチャートの説明に使用する対話
例（対話例２）を示す。FIG. 7 is a flow chart for explaining the operation of the voice dialogue system according to the second embodiment of the present invention, and FIG. 8 shows a dialogue example (dialogue example 2) used for the explanation of this flow chart.

【００３２】この実施形態では、システム側の質問「ど
こへ行きたいですか」に対して、ユーザが「兵庫県神戸
市」と２個の単語で答えた場合に対処し得る構成を有し
ている。In this embodiment, the system has a configuration capable of responding to the question "Where do you want to go?" On the system side when the user answers with two words, "Kobe City, Hyogo Prefecture". There is.

【００３３】まず、ユーザの最初の発声「兵庫県神戸
市」を、例えば基本辞書２０４ａを用いて認識する（ス
テップＳ１０）。次に音声認識エンジン２０４は、認識
結果の単語属性「兵庫県」について、これを認識辞書２
１４中で探索し（ステップＳ１１）、探索された単語属
性以下の辞書、即ち図４のノード１０以下の辞書を抽出
する（ステップＳ１２）。次に、未探索の認識結果が存
在するか否か判断し（ステップＳ１３）、存在する場
合、ステップＳ１１に戻って未探索の認識結果に対して
辞書中で探索を行なう。図８の場合は、未探索の認識結
果「神戸市」が存在するのでこれをノード１０以下の辞
書中から探索し、ノード２０以下の辞書を抽出する。First, the user's first utterance "Kobe, Hyogo" is recognized using, for example, the basic dictionary 204a (step S10). Next, the voice recognition engine 204 recognizes the word attribute “Hyogo” of the recognition result as the recognition dictionary 2
14 is searched (step S11), and a dictionary below the searched word attribute, that is, a dictionary below the node 10 in FIG. 4 is extracted (step S12). Next, it is determined whether or not there is an unsearched recognition result (step S13). If there is an unsearched recognition result, the process returns to step S11 and a search is performed in the dictionary for the unsearched recognition result. In the case of FIG. 8, since an unsearched recognition result “Kobe City” exists, this is searched from the dictionary of nodes 10 and below, and the dictionary of nodes 20 and below is extracted.

【００３４】この操作を、未探索の認識結果が無くなる
まで繰り返し、無くなると、ステップＳ１４において最
終的に抽出された辞書のデータを音声認識エンジン２０
４のＲＡＭ領域２０４ｂに設定する。この後、次のユー
ザ発声「灘区」を、設定された辞書データを用いて認識
する（ステップＳ１５）。This operation is repeated until there is no unsearched recognition result, and when there is no recognition result, the dictionary data finally extracted in step S14 is used as the voice recognition engine 20.
4 RAM area 204b. After that, the next user utterance "Nada Ward" is recognized using the set dictionary data (step S15).

【００３５】以上の操作を対話終了まで繰り返す（ステ
ップＳ１６）事により、システムへの目的地の設定が完
了する。By repeating the above operation until the end of the dialogue (step S16), the setting of the destination in the system is completed.

【００３６】この実施形態では、ユーザによる複数の単
語属性の発声に対して、音声認識エンジンが単語属性毎
に繰り返して設定する辞書を絞り込むので、第１の実施
形態の場合と同様、目的地の音声入力に当たって誤認識
の発生が防止され、かつ高い応答速度を得ることができ
る。In this embodiment, since the voice recognition engine narrows down the dictionary to be repeatedly set for each word attribute with respect to the utterance of a plurality of word attributes by the user, as in the case of the first embodiment, the destination Occurrence of erroneous recognition upon voice input can be prevented, and a high response speed can be obtained.

【００３７】図９は、本発明の第３の実施形態を説明す
るための図であって、具体的には音声認識エンジン２０
６内に予め設定される探索対象単語属性リストを示す。
この実施形態は、例えば、「大阪府に行く」と言うよう
なユーザの発声に対して、最初の音声認識で「大阪」、
「に」、「行く」と言う３つの単語属性からなる認識結
果を得ている場合に対処するものである。FIG. 9 is a diagram for explaining the third embodiment of the present invention, and specifically, the voice recognition engine 20.
6 shows a search target word attribute list set in advance in FIG.
In this embodiment, for example, in response to a user's utterance such as “go to Osaka prefecture”, “Osaka” is detected by the first voice recognition,
This is to deal with the case where a recognition result consisting of three word attributes "ni" and "go" is obtained.

【００３８】本実施形態では、認識された各単語属性に
ついて、それぞれが探索対象単語属性リスト中に存在す
るか否かをまず探索する。図示の例では、「大阪府」の
みがリスト中に存在するため、「大阪府」と言う単語属
性に対して図５のフローチャートに示す処理を行なう。
これによって、次のユーザ発声で大阪府の市区町村以外
の単語属性を持つ結果が認識されることを防止すること
が出来る。In this embodiment, each recognized word attribute is first searched for in the search target word attribute list. In the illustrated example, since only "Osaka Prefecture" exists in the list, the process shown in the flowchart of FIG. 5 is performed for the word attribute "Osaka Prefecture".
As a result, it is possible to prevent the result having the word attribute other than the municipality of Osaka from being recognized by the next user's utterance.

【００３９】図１０は、本発明にかかる第４の実施形態
の構成を説明するための対話例（対話例３）を示す図で
ある。この実施形態では、過去に認識された認識結果に
ついて、図９に示すような探索対象単語属性リストに属
する単語属性があった場合、これを認識エンジン２０４
の保存領域２０４ｄに保存しておき、その後、ユーザ発
声の認識のための辞書データを作成するに当たって、前
回の認識結果に加えて保存された結果を使用することを
特徴とする。FIG. 10 is a diagram showing an interaction example (interaction example 3) for explaining the configuration of the fourth embodiment according to the present invention. In this embodiment, when there is a word attribute belonging to the search target word attribute list as shown in FIG. 9 in the recognition result recognized in the past, this is recognized by the recognition engine 204.
It is characterized in that the saved result is used in addition to the previous recognition result when the dictionary data for recognizing the user's utterance is created.

【００４０】即ち、図１０に示す様に、システム側の質
問「どこに行きますか」に対するユーザの回答「兵庫
県」が認識されると、図５または図７のフローチャート
に示す様に、音声認識エンジン２０４中のＲＡＭ領域２
０６ｂには「兵庫県」をノードとするそれ以降の辞書デ
ータが設定される。認識された単語属性「兵庫県」は、
図９に示す探索対象単語属性リスト中に含まれているの
で、この単語属性「兵庫県」は、音声認識エンジン２０
４の保存領域２０４ｄ中に保存される。That is, as shown in FIG. 10, when the user's answer "Hyogo prefecture" to the question "Where are you going?" On the system side is recognized, as shown in the flowchart of FIG. 5 or 7, voice recognition is performed. RAM area 2 in engine 204
In 06b, the subsequent dictionary data having “Hyogo Prefecture” as a node is set. The recognized word attribute "Hyogo" is
This word attribute “Hyogo” is included in the search target word attribute list shown in FIG.
4 storage areas 204d.

【００４１】システムは、次に「兵庫県のどこですか」
を質問する。ユーザはこれに対して「ラーメン屋」と答
える。「ラーメン屋」と言う単語属性は、音声認識エン
ジン２０４中の基本辞書における施設の種類の辞書中に
含まれているので、その波形データを使用して「ラーメ
ン屋」が認識される。The system then asks, "Where is Hyogo Prefecture?"
Ask. The user answers "ramen shop" to this. Since the word attribute "ramen shop" is included in the dictionary of the facility type in the basic dictionary in the voice recognition engine 204, "ramen shop" is recognized using the waveform data.

【００４２】「ラーメン屋」が認識されると、音声認識
エンジンは、音声認識辞書２１４中の施設名の辞書に含
まれている、ラーメン屋の辞書を抽出し、さらに保存領
域２０４ｄ中に保存された単語属性「兵庫県」を用いて
ラーメン屋の辞書をさらに絞り込む。これによって、兵
庫県内のラーメン屋の辞書が抽出され、これが音声認識
エンジン２０４のＲＡＭ領域２０４ｂに設定される。次
のユーザ発声「○○ラーメン」は、この様にして設定さ
れた辞書データを用いて認識される。When the "ramen shop" is recognized, the voice recognition engine extracts the ramen shop dictionary included in the dictionary of the facility names in the voice recognition dictionary 214, and further stores it in the storage area 204d. Use the word attribute "Hyogo" to further narrow down the ramen shop dictionary. As a result, a dictionary of ramen restaurants in Hyogo prefecture is extracted and set in the RAM area 204b of the voice recognition engine 204. The next user utterance "OO ramen" is recognized using the dictionary data set in this way.

【００４３】なお、施設名の単語属性には、それぞれの
住所がデータとして添付されているので、このデータを
探索することにより、特定の地域のラーメン屋を特定す
ることができる。Since each address is attached as data to the word attribute of the facility name, a ramen shop in a specific area can be specified by searching this data.

【００４４】図１１は、本発明にかかる第５の実施形態
の構成を説明するための対話例（対話例４）を示す図で
ある。この実施形態では、過去に認識された認識結果に
ついて、図９に示すような探索対象単語属性リストに属
する単語属性があった場合、これを認識エンジン２０４
の保存領域２０４ｄに保存しておき、その後、システム
とユーザの対話の過程で認識間違いがあった場合、保存
領域２０４ｄに保存されている単語属性を用いて辞書の
絞り込みを行なうことを特徴としている。FIG. 11 is a diagram showing an interaction example (interaction example 4) for explaining the configuration of the fifth embodiment according to the present invention. In this embodiment, when there is a word attribute belonging to the search target word attribute list as shown in FIG. 9 in the recognition result recognized in the past, this is recognized by the recognition engine 204.
It is characterized in that it is stored in the storage area 204d of the dictionary and then, if a recognition error occurs in the process of the dialogue between the system and the user, the dictionary is narrowed down using the word attribute stored in the storage area 204d. .

【００４５】即ち、図１１に示す様に、システム側の質
問「どこに行きますか」に対するユーザの回答「兵庫
県」が認識されると、図５または図７のフローチャート
に示す様に、音声認識エンジン２０４中のＲＡＭ領域２
０６ｂには「兵庫県」をノードとするそれ以降の辞書デ
ータが設定される。認識された単語属性「兵庫県」は、
図９に示す探索対象単語属性リスト中に含まれているの
で、音声認識エンジン２０４の保存領域２０４ｄ中に保
存される。That is, as shown in FIG. 11, when the user's answer “Hyogo prefecture” to the question “Where are you going?” On the system side is recognized, as shown in the flowchart of FIG. 5 or 7, voice recognition is performed. RAM area 2 in engine 204
In 06b, the subsequent dictionary data having “Hyogo Prefecture” as a node is set. The recognized word attribute "Hyogo" is
Since it is included in the search target word attribute list shown in FIG. 9, it is stored in the storage area 204 d of the voice recognition engine 204.

【００４６】システムは、次に「兵庫県のどこですか」
を質問する。ユーザはこれに対して「神戸市」と答える
が、システムがこれを「明石市」と誤認識し、次の質問
「明石市のどこですか」を発話する。ユーザはシステム
のこの質問によって前回の自身の回答が誤認識されたこ
とを知って、「ちがう」と回答する。システムが基本辞
書を用いてこの発声を認識すると、保存領域２０４ｄを
探索して保存されている単語属性「兵庫県」を見出す。
次に、この単語属性を用いて辞書の絞り込みを行なうと
共に、質問「兵庫県のどこですか」を再度作成する。The system then asks, "Where is Hyogo Prefecture?"
Ask. The user answers "Kobe city" to this, but the system misrecognizes this as "Akashi city" and utters the next question "Where is Akashi city?". The user knows that this question in the system misrecognized his previous answer and answers "yes". When the system recognizes this utterance using the basic dictionary, it searches the storage area 204d to find the stored word attribute "Hyogo prefecture".
Next, while narrowing down the dictionary using this word attribute, the question "Where is Hyogo prefecture?" Is created again.

【００４７】この操作は、図１１において二重線矢印で
示すように、認識を誤った場所まで対話が戻ることを示
している。システムの再度の質問以降は、図の二重線矢
印で示す様に、上記各実施形態の処理が実行される。This operation indicates that the dialogue returns to the location where the recognition is incorrect, as indicated by the double-lined arrow in FIG. After the re-inquiry of the system, the processing of each of the above-described embodiments is executed as indicated by the double-lined arrow in the figure.

【００４８】図１２〜１４は、本発明の第６の実施形態
を説明するための図であって、図１２はこの実施形態に
適用される対話例（対話例５）を示し、図１３、１４は
本実施形態における音声認識エンジン２０４と音声認識
辞書２１４の構造を概念的に示す図である。12 to 14 are views for explaining the sixth embodiment of the present invention. FIG. 12 shows a dialogue example (dialog example 5) applied to this embodiment, and FIG. 14 is a diagram conceptually showing the structures of the voice recognition engine 204 and the voice recognition dictionary 214 in this embodiment.

【００４９】本実施形態では、ユーザとシステムの対話
状態（図１２参照）に応じて使用する辞書の種類を予め
設定しておき、各対話状態に達したとき、予め設定され
た種類の辞書のデータを音声認識辞書から抜き出して、
音声認識エンジン２０４のＲＡＭ領域２０４ｂに設定す
るようにしたことを特徴とする。In the present embodiment, the type of dictionary to be used is set in advance according to the dialogue state between the user and the system (see FIG. 12), and when each dialogue state is reached, the dictionary of the preset type is set. Extract the data from the voice recognition dictionary,
It is characterized in that it is set in the RAM area 204b of the voice recognition engine 204.

【００５０】例えば、対話状態１〜４に対して図１３に
示すような対応テーブル２０４ｅが予め形成されている
と、対話状態４では、該対応テーブルに従って施設名称
の辞書のデータを音声認識辞書２１４から取り出して来
てＲＡＭ領域２０４ｂに設定し、この辞書を用いて次の
ユーザ発声の音声認識を実行する。For example, if the correspondence table 204e as shown in FIG. 13 is previously formed for the dialogue states 1 to 4, in the dialogue state 4, the data of the facility name dictionary is converted into the voice recognition dictionary 214 according to the correspondence table. It is taken out from and set in the RAM area 204b, and the voice recognition of the next user's utterance is executed using this dictionary.

【００５１】あるいは、対話状態２の場合は、図１４に
示す様に、対応テーブル２０４ｅに従って住所の辞書と
施設の種類の辞書を音声認識辞書２１４から取り出し、
それらを共にＲＡＭ領域２０４ｂに設定する。これによ
り、ユーザの次の発声が「ラーメン屋」であっても容易
に音声認識される。Alternatively, in the case of the interactive state 2, as shown in FIG. 14, an address dictionary and a facility type dictionary are retrieved from the voice recognition dictionary 214 according to the correspondence table 204e,
Both of them are set in the RAM area 204b. As a result, even if the user's next utterance is a “ramen shop”, the voice is easily recognized.

【００５２】本実施形態によれば、対話の各状態におい
て、ユーザが発生すると思われる発声パターンを有する
辞書のみを音声認識辞書から取り出し、音声認識エンジ
ンに設定することで、該エンジンにかかる負荷（処理
量）を低減することができる。その結果、音声認識の応
答時間が短縮でき、音声対話のレスポンスが改善され
る。According to the present embodiment, in each state of the dialogue, by extracting only the dictionary having the utterance pattern which the user is likely to generate from the voice recognition dictionary and setting it in the voice recognition engine, the load on the engine ( The processing amount) can be reduced. As a result, the response time of voice recognition can be shortened, and the response of voice dialogue is improved.

【００５３】本発明のさらに他の実施形態として、ある
状態で、既に得られている単語属性以外の、対話タスク
を達成させるに必要な単語属性を含む音声認識用辞書を
自動的に選択し、音声認識エンジンに設定する事も可能
である。即ち、目的地設定の対話で、既に「県名」の属
性を持つ認識結果が得られている場合、タスク達成のた
めに他に必要な属性は、「市名」、「町名」、「番地」
などであり、これらの単語属性を含む辞書を、複数の辞
書から探索し、適合する辞書を組み合わせて音声認識エ
ンジンに設定する。As still another embodiment of the present invention, in a certain state, a speech recognition dictionary including word attributes necessary for accomplishing an interactive task other than the already obtained word attributes is automatically selected, It can also be set as a voice recognition engine. That is, when the recognition result having the attribute of "prefecture name" has already been obtained in the dialogue for setting the destination, the other attributes necessary for accomplishing the task are "city name", "town name", and "street address". "
For example, a dictionary including these word attributes is searched from a plurality of dictionaries, and matching dictionaries are combined and set in the voice recognition engine.

【００５４】これによって、第６の実施形態のように対
応テーブルを持つことなく、この場合の実施形態と同じ
効果を得ることができ、処理量（データ量）の削減が期
待できる。As a result, the same effect as the embodiment in this case can be obtained without having the correspondence table as in the sixth embodiment, and the reduction of the processing amount (data amount) can be expected.

【００５５】[0055]

【発明の効果】以上、各実施形態を挙げて説明したよう
に、本発明の音声対話システムによれば、既に得られて
いる音声認識の結果に関連する辞書データを音声認識辞
書から取り出して音声認識エンジンに設定し、次のユー
ザ発声の音声認識を行なうので、音声認識の誤認率が低
下する。同時に音声認識エンジンに設定されるデータ量
が減少するので認識エンジンの負荷が減少し、認識速度
が速くなる。その結果、対話の応答速度が向上する。As described above with reference to each embodiment, according to the voice dialogue system of the present invention, the dictionary data relating to the already obtained result of the voice recognition is extracted from the voice recognition dictionary and the Since the recognition engine is set and the voice recognition of the next user's utterance is performed, the false recognition rate of the voice recognition is lowered. At the same time, the amount of data set in the voice recognition engine decreases, so the load on the recognition engine decreases and the recognition speed increases. As a result, the response speed of the dialogue is improved.

[Brief description of drawings]

【図１】本発明の１実施形態にかかる音声対話システム
の構成を示すブロック図。FIG. 1 is a block diagram showing the configuration of a voice dialogue system according to an embodiment of the present invention.

【図２】図１に示す音声認識エンジンの構成を示す図。FIG. 2 is a diagram showing a configuration of a voice recognition engine shown in FIG.

【図３】図１に示す音声認識辞書の構成を示す図。FIG. 3 is a diagram showing a configuration of a voice recognition dictionary shown in FIG.

【図４】ツリー状階層構造を有する住所辞書の構成を示
す図。FIG. 4 is a diagram showing a configuration of an address dictionary having a tree-like hierarchical structure.

【図５】本発明の第１の実施形態にかかる音声対話シス
テムの動作説明のためのフローチャート。FIG. 5 is a flowchart for explaining the operation of the voice dialogue system according to the first embodiment of the present invention.

【図６】図５のフローチャートの説明に提供する第１の
対話例。FIG. 6 is a first example interaction provided in the description of the flowchart of FIG.

【図７】本発明の第２の実施形態にかかる音声対話シス
テムの動作説明のためのフローチャート。FIG. 7 is a flowchart for explaining the operation of the voice dialogue system according to the second embodiment of the present invention.

【図８】図７のフローチャートの説明に提供する第２の
対話例。FIG. 8 is a second example interaction provided in the description of the flowchart of FIG.

【図９】探索対象単語属性リストの一例を示す図。FIG. 9 is a diagram showing an example of a search target word attribute list.

【図１０】本発明の第３の実施形態の説明に供する第３
の対話例。FIG. 10 is a third diagram for explaining a third embodiment of the present invention.
Dialogue example.

【図１１】本発明の第４の実施形態の説明に供する第４
の対話例。FIG. 11 is a fourth diagram for explaining the fourth embodiment of the present invention.
Dialogue example.

【図１２】本発明の第５の実施形態の説明に供する第５
の対話例。FIG. 12 is a fifth diagram for explaining the fifth embodiment of the present invention.
Dialogue example.

【図１３】本発明の第６の実施形態の説明に供する図。FIG. 13 is a diagram for explaining the sixth embodiment of the present invention.

【図１４】本発明の第６の実施形態の説明に供する図。FIG. 14 is a diagram which is used for describing a sixth embodiment of the present invention.

[Explanation of symbols]

１０…音声対話システム１２…ディスプレイ１４…スピーカ１６…マイク１８…スイッチ２０…処理装置２０２…画像表示エンジン２０４…音声認識エンジン２０６…音声合成エンジン２０８…アプリケーションプログラム２１０…対話処理用エンジン２１２…対話用データベース２１４…音声認識辞書 10 ... Spoken dialogue system 12 ... Display 14 ... speaker 16 ... Mike 18 ... switch 20 ... Processing device 202 ... Image display engine 204 ... Voice recognition engine 206 ... Speech synthesis engine 208 ... Application program 210 ... Engine for dialogue processing 212 ... Interactive database 214 ... Voice recognition dictionary

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/18 Ｇ１０Ｌ 3/00 ５３７Ｆ 15/28 ５２１Ｗ ─────────────────────────────────────────────────── ─── Continued Front Page (51) Int.Cl. ⁷ Identification Code FI Theme Coat (Reference) G10L 15/18 G10L 3/00 537F 15/28 521W

Claims

[Claims]

1. A voice recognition engine for recognizing a user's utterance, a dialogue processing engine for creating a utterance to the user according to the recognition result, and a voice synthesizing engine for synthesizing the created utterance into a voice. And a voice recognition dictionary that stores dictionary data including a voice pattern for the voice recognition, and a voice input / output unit, and the voice recognition engine searches the voice recognition dictionary for a recognition result of a previous user utterance. Then, the dictionary data related to the searched recognition result is extracted and set in the voice recognition engine to perform voice recognition of the next user's utterance.

2. The voice interaction system according to claim 1, wherein when the voice recognition engine has a plurality of recognition results of the previous user's utterance, the voice recognition engine searches the voice recognition dictionary for any of the recognition results, A voice dialogue system for setting the relevant dictionary data in the voice recognition engine.

3. The voice dialog system according to claim 1, wherein the voice recognition engine has a list of word attributes to be searched for a recognition result.

4. The voice dialogue system according to claim 3, wherein the voice recognition engine stores a word attribute corresponding to the list in the recognition result of the previous user's utterance, if the word attribute exists. , Voice dialogue system.

5. A voice recognition engine for recognizing a user's utterance, a dialogue processing engine for creating a utterance to the user according to the recognition result, and a voice synthesizing engine for synthesizing the created utterance into a voice. And a voice recognition dictionary that stores dictionary data including a voice pattern for the voice recognition, and a voice input / output unit, the voice recognition dictionary has a plurality of types of dictionaries, and the voice recognition engine is A table having a correspondence table of a dialogue state and a dictionary of a type used in the dialogue state, and in recognizing a user utterance in each dialogue state, a dictionary of a corresponding type is extracted from the voice recognition dictionary according to the correspondence table and used. There is a voice dialogue system.

6. The voice dialogue system according to claim 5, wherein the voice recognition engine extracts and uses a plurality of types of dictionaries from the voice recognition dictionary according to the correspondence table.

7. A voice recognition engine for recognizing a user's utterance, a dialogue processing engine for creating a utterance to the user according to the recognition result, and a voice synthesizing engine for synthesizing the created utterance into a voice. And a voice recognition dictionary that stores dictionary data including a voice pattern for the voice recognition, and a voice input / output unit, and the voice recognition engine has a plurality of word attributes that achieve a predetermined interactive task. The voice interaction system, wherein dictionary data including word attributes other than the word attributes used when recognizing the user's utterance is extracted from the voice recognition dictionary and used for the next recognition of the user utterance.