JP2009025538A

JP2009025538A - Voice interactive device

Info

Publication number: JP2009025538A
Application number: JP2007188284A
Authority: JP
Inventors: Daisuke Saito; 大介斎藤; Minoru Togashi; 実冨樫; Takeshi Ono; 健大野; Keiko Katsuragawa; 景子桂川; Eiji Tonozuka; 英治外塚
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2007-07-19
Filing date: 2007-07-19
Publication date: 2009-02-05

Abstract

PROBLEM TO BE SOLVED: To provide a voice interactive device, capable of reducing the probability of repeated occurrence of false recognition even if words having a similar phoneme string sequence are generated. SOLUTION: The voice interactive device includes a voice recognition part 102 which compares a speech by a user with vocabulary of a recognition dictionary 103 to acquire at least one combination of recognition candidates as a recognition result; an understanding part 104 which generates an understanding state of a system based on the recognition result and determines a control task intended by the user from the understanding state; and an answer generation part 106 which returns an answer to the user based on the control task. The device further includes a recognition feature extraction part 109 which monitors a series of conversations leading to control task attainment to extract a combination related to the recognition result as a recognition pattern, and extracts a control task related to the recognition pattern; and a dictionary control part 11 which prioritizes the control task based on the recognition pattern and the control task for the recognition pattern. COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声自動応答装置などに使用される音声対話装置に関する。 The present invention relates to a voice interaction device used for an automatic voice response device or the like.

従来、利用者の発話を音声認識技術によって認識し、該認識結果に応じて機器操作を行うような音声インタフェースが知られている。例えば、自動車のナビゲーションシステムや電話応答システムであるＩＶＲ（ＩｎｔｅｒａｃｔｉｖｅＶｏｉｃｅＲｅｓｐｏｎｓｅ：音声自動応答装置）等に適用されている。こうした音声インタフェースでは、主に利用者とシステムの間で音声対話を行う。すなわち、利用者は決定・制御したい操作タスク（以下、タスクとする。）について、システムからの応答に応える形で音声を入力する。システムは該音声を認識する。音声認識結果から利用者のタスクが一意に決定できる場合は、該操作の実行に移行する。一方、一意に決まらない場合は、利用者に対して、言い直し、あるいは追加情報の発話を促す応答を返す。 2. Description of the Related Art Conventionally, a voice interface is known in which a user's utterance is recognized by a voice recognition technique and device operation is performed according to the recognition result. For example, it is applied to an IVR (Interactive Voice Response) which is a navigation system of a car or a telephone response system. In such a voice interface, a voice dialogue is mainly performed between the user and the system. That is, the user inputs voice in response to a response from the system for an operation task (hereinafter referred to as a task) to be determined and controlled. The system recognizes the voice. If the user's task can be uniquely determined from the speech recognition result, the process proceeds to execution of the operation. On the other hand, if it is not uniquely determined, the user is rephrased or a response that prompts the user to speak additional information is returned.

ところで、音声認識は、認識精度に限界があるため、誤認識は避けられないという問題がある。音声認識の性能は、利用者の発生の傾向や利用環境にも大きく左右される。特に、自動車においては、元来雑音が多い環境であるため、利用環境や利用者によっては頻繁に誤認識が発生しうる。また、近年の車載ＩＴ機器の高度化に伴い、音声による制御タスクが益々増加傾向にあり、認識対象が広がってきていることも、誤認識の発生を誘発する原因となりうる。すなわち、認識辞書に登録された認識語彙が音素列の並びであることを考えると、認識語彙の追加に伴い、音素列の並びが類似する認識語彙が生じ易くなるため、ある認識語彙を別の特定の認識語彙に誤り易いという傾向が生じる。更に、ここに利用者の話者性（発話の傾向・癖）が加わると、特定の話者については、度々同様の誤認識が繰り返し発生すると言う問題が生じる。 By the way, since voice recognition has a limit in recognition accuracy, there is a problem that erroneous recognition cannot be avoided. The performance of speech recognition greatly depends on the tendency of users to occur and the usage environment. In particular, in an automobile, since it is an environment with a lot of noise from the beginning, erroneous recognition may frequently occur depending on the use environment and the user. In addition, with the advancement of in-vehicle IT devices in recent years, the number of voice control tasks is increasing, and the recognition targets are expanding. This can also cause the occurrence of misrecognition. In other words, considering that the recognition vocabulary registered in the recognition dictionary is a sequence of phoneme strings, a recognition vocabulary with a similar phoneme string sequence is likely to occur with the addition of the recognition vocabulary. There is a tendency for certain recognition vocabulary to be prone to errors. Furthermore, if the user's speaker characteristics (speech tendency / 癖) are added here, there arises a problem that the same erroneous recognition frequently occurs repeatedly for a specific speaker.

上記の問題に対処する方法として、予め認識語彙全体に対する認識誤り傾向を分析し、認識語彙に優先順位をつける方法が提案されている（特許文献１参照）。本方法では、認識辞書中の音素について、予め誤認識傾向を保持し、利用者からの修正操作が検出された際に、認識誤り傾向に基づき、修正候補を提示する構成となっている。なお、認識誤り傾向は、ある正解パターンにマッチしやすい誤認識パターンの対を記録したものであり、コンフュージョンマトリクス（混同行列）等と呼ばれる。予め認識誤りを把握することで、認識辞書に含まれる認識語彙が含む誤り易さに基づく認識結果の重み付けや優先順位付けといった認識結果の補正が可能となり、認識性能の向上が期待できる。 As a method for coping with the above problem, a method has been proposed in which a recognition error tendency with respect to the entire recognition vocabulary is analyzed in advance and a priority is given to the recognition vocabulary (see Patent Document 1). This method is configured to hold a misrecognition tendency for phonemes in the recognition dictionary in advance and present correction candidates based on the recognition error tendency when a correction operation from a user is detected. The recognition error tendency is a record of a pair of erroneous recognition patterns that easily match a certain correct pattern, and is called a confusion matrix (confusion matrix) or the like. By recognizing recognition errors in advance, it is possible to correct recognition results such as weighting and prioritization of recognition results based on the ease of error included in the recognition vocabulary included in the recognition dictionary, and an improvement in recognition performance can be expected.

また、音素列の並びが類似する認識語彙が生じる原因として、音声による制御タスクの増加による認識語彙の増加のみならず、多様な言い回しに対応するための認識語彙の増加も考慮する必要がある。ある利用者のタスク達成のための言い回しは一通りとは限らないからである。例えば、自動車用の音声インタフェースにおけるタスク「エアコン起動」を考えた場合、「エアコンをつけたい」と思った利用者の意図を反映する言い回しには、「エアコンをつけて」「エアコンをつけてください」「エアコンをつけたい」といったように、語尾の言い回しだけでも複数考えられる。語尾以外を考えると「エアコンＯＮ」「エアコン起動」「エアコンスタート」「冷房つけて」「暑い」といった言い回しが考えられる。この問題に対処する方法としては、多数の利用者の音声（一般にコーパスと呼ばれる。）を分析し、頻度の高い言い回しを認識語彙として選定する方法が考えられる。これにより、大多数の利用者の言い回しに対処する事が可能になる。
特開２００４―２２７１５６号公報 In addition, it is necessary to consider not only an increase in the recognition vocabulary due to an increase in the control task by speech but also an increase in the recognition vocabulary to deal with various phrases as a cause of the occurrence of a recognition vocabulary with a similar phoneme sequence. This is because there is not always a single phrase for accomplishing a certain user's task. For example, when considering the task “starting an air conditioner” in a voice interface for a car, “turn on the air conditioner” and “turn on the air conditioner” to reflect the intention of the user who wanted to turn on the air conditioner. "I want to turn on the air conditioner," and so on. Other than the endings, phrases such as “air conditioner ON”, “air conditioner activation”, “air conditioner start”, “cool on” and “hot” can be considered. As a method of dealing with this problem, a method of analyzing voices of a large number of users (generally called corpus) and selecting a high-frequency phrase as a recognition vocabulary can be considered. This makes it possible to deal with the majority of users' phrases.
JP 2004-227156 A

しかしながら、特許文献１に示した認識誤り傾向を反映した認識方法では、認識語彙数が無数に登録されるような大規模な認識辞書において、予め、全ての認識誤り傾向を把握するのは困難という問題があった。 However, in the recognition method reflecting the recognition error tendency shown in Patent Document 1, it is difficult to grasp all the recognition error trends in advance in a large-scale recognition dictionary in which an infinite number of recognition vocabularies are registered. There was a problem.

本発明は、こうした問題に鑑みてなされたものであり、音素列の並びが類似する語彙が生じても、誤認識が繰り返し発生する可能性を低減できる音声対話装置を提供することを目的とする。 The present invention has been made in view of these problems, and an object of the present invention is to provide a voice interactive apparatus that can reduce the possibility of repeated recognition errors even when vocabularies with similar phoneme string sequences occur. .

上記目的達成のため、本発明に係る音声対話装置では、利用者の発話と認識辞書の語彙とを比較して取得された認識候補の組合せから生成された理解状態から、利用者の意図するタスクを決定する理解手段と、当該タスクに基づき、利用者に対して応答を返す応答手段と、一連のタスク達成に至る対話を監視し、認識結果に関する組合せを認識パターンとして抽出するとともに、当該認識パターンに関するタスクを抽出する認識特性抽出手段と、上記認識パターンと上記認識パターンに関する上記タスクに基づいて、上記タスクを優先させる辞書制御手段とを備えることを特徴としている。 In order to achieve the above object, in the spoken dialogue apparatus according to the present invention, the task intended by the user from the understanding state generated from the combination of recognition candidates acquired by comparing the user's utterance and the vocabulary of the recognition dictionary. An understanding means for determining a response, a response means for returning a response to the user based on the task, a dialog for monitoring a series of task achievements, and extracting a combination of recognition results as a recognition pattern. And a dictionary control unit that prioritizes the task based on the recognition pattern and the task related to the recognition pattern.

本発明により、音素列の並びが類似する語彙が生じても、誤認識が繰り返し発生する可能性を低減できる。 According to the present invention, even if a vocabulary having a similar phoneme sequence is generated, the possibility of repeated erroneous recognition can be reduced.

以下に、本発明の第１乃至第４の実施形態に係る音声対話装置について、図１乃至図２６を参照して説明する。第１乃至第４の実施形態に係る音声対話装置は、主に自動車内の各種機器（ナビゲーション装置、オーディオ装置、エアコン等）を音声で操作することを目的としている。但し、本発明の実施範囲は、これに留まるものではなく、音声により各種機器を操作するインタフェース全般に適用することが可能である。 Hereinafter, the voice interactive apparatus according to the first to fourth embodiments of the present invention will be described with reference to FIGS. 1 to 26. The voice interaction device according to the first to fourth embodiments is mainly intended to operate various devices (navigation device, audio device, air conditioner, etc.) in an automobile with voice. However, the scope of implementation of the present invention is not limited to this, and can be applied to all interfaces for operating various devices by voice.

（第１の実施形態）
第１の実施形態では、初期発話の認識結果に基づく応答に対して否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合の結果に着目する。上記の場合、認識特性抽出部１０９（図１参照）は、認識結果に関する組合せを認識パターンとして抽出するとともに、当該認識パターンに関する制御タスクである最終的に達成した制御タスクを抽出する。更に、認識特性抽出部１０９は、上記認識パターンと上記制御タスクを認識パターンテーブル１１０（図１参照）に記憶する。その後、利用者の新規発話に対する新規認識候補群が認識パターンと順不同で同じ場合に、辞書制御部１１１（図１参照）は最終的に達成した制御タスクを優先させて実行させる。具体的には、辞書制御部１１１に基づいて、理解部１０４（図１参照）は、認識パターンの最大の認識スコアを持つ認識候補（後述する正解語彙）と同じ新規認識候補の認識スコアに補正値を加算する。理解部１０４は、更に、加算後の新規認識結果の認識スコアのうち、最大の認識スコアが閾値を上回るか判定する。最大の認識スコアが閾値を上回ると判定した場合、理解部１０４は、最大の認識スコアを持つ新規認識候補からシステムの理解状態である理解結果を生成する。理解部１０４は、当該理解結果から制御タスクを決定する。これから、誤認識が繰り返し発生する可能性を低減するものである。なお、第１の実施形態における認識パターンは、利用者の初期発話に対する認識結果、すなわち、認識候補群であるＮ−ｂｅｓｔのうち、認識特性抽出部１０９により抽出され、記憶された認識候補群を指す。また、認識候補には各々認識スコアが付与されるものとする。認識スコアについては、後述するように尤度や信頼度の尺度を利用することができる。なお、第１の実施形態における認識パターンに関する制御タスクは、最終的に達成した制御タスクである。 (First embodiment)
In the first embodiment, attention is paid to the result when a series of control tasks are finally achieved without denial and correction of the response based on the recognition result of the initial utterance. In the above case, the recognition characteristic extraction unit 109 (see FIG. 1) extracts a combination related to the recognition result as a recognition pattern, and extracts a finally achieved control task that is a control task related to the recognition pattern. Further, the recognition characteristic extraction unit 109 stores the recognition pattern and the control task in the recognition pattern table 110 (see FIG. 1). Thereafter, when the new recognition candidate group for the user's new utterance is the same as the recognition pattern in no particular order, the dictionary control unit 111 (see FIG. 1) gives priority to the finally achieved control task to be executed. Specifically, based on the dictionary control unit 111, the understanding unit 104 (see FIG. 1) corrects the recognition score of the same new recognition candidate as the recognition candidate (correct vocabulary to be described later) having the maximum recognition score of the recognition pattern. Add the values. The understanding unit 104 further determines whether or not the maximum recognition score among the recognition scores of the new recognition result after addition exceeds a threshold value. When it is determined that the maximum recognition score exceeds the threshold value, the understanding unit 104 generates an understanding result that is an understanding state of the system from a new recognition candidate having the maximum recognition score. The understanding unit 104 determines a control task from the understanding result. As a result, the possibility of repeated erroneous recognition is reduced. In addition, the recognition pattern in 1st Embodiment is the recognition result group extracted by the recognition characteristic extraction part 109 among the recognition results with respect to a user's initial speech, ie, N-best which is a recognition candidate group, and memorize | stored. Point to. In addition, each recognition candidate is given a recognition score. As for the recognition score, a scale of likelihood and reliability can be used as will be described later. Note that the control task related to the recognition pattern in the first embodiment is a control task finally achieved.

第１の実施形態の基本的な構成を図１に示す。図１は、本発明の第１の実施形態に係る音声対話装置の基本構成を示したブロック図である。ここで、図１の矢印（ａ）は利用者の発話音声を示し、矢印（ｂ）はシステムからの出力音声あるいは認識結果に基づくタスクの実行を示している。図１に示すように、第１の実施形態の音声対話装置は、音声入力部１０１、音声認識手段である音声認識部１０２、認識辞書１０３、理解手段である理解部１０４、機能テーブル１０５および応答手段である応答生成部１０６を備えている。更に、応答テーブル１０７、出力部１０８、認識特性抽出手段である認識特性抽出部１０９、認識パターンテーブル１１０および辞書制御手段である辞書制御部１１１を備えている。
＜基本機能と実現手段＞
ここで、図１に示した各部の基本的な機能および具体的な実現手段について、図２を用いて説明する。図２は、図１に示す音声対話装置の実現手段を示したブロック図である。音声入力部１０１は、利用者の決定・制御したい操作タスク（以下、タスクとする。）について入力される利用者の初期発話音声（図１（ａ）参照）を取得する。例えば、マイクロフォン２０１とＡＤ変換部２０２を組合せることで実現できる。音声認識部１０２は、音声入力部１０１を介して取得した初期発話音声の一部あるいは全てについて、特徴抽出処理を行う。音声認識部１０２は、初期発話音声の特徴量と、後述の認識辞書１０３に登録された各々の語彙、すなわち、認識対象とする語彙（以下、認識語彙とする。）の特徴量との比較を行う。音声認識部１０２は、更に、一般的な音声認識処理を行う。具体的には、特徴量の類似度である尤度が高い順に複数個の認識語彙を認識候補の組合せ、すなわち、認識結果として取得する。音声認識部１０２は、演算装置２０３および記憶装置２０４を組合せることで実現できる。なお、上述の複数個の認識語彙、すなわち、認識候補の組合せはＮ−ｂｅｓｔと呼ばれる。 A basic configuration of the first embodiment is shown in FIG. FIG. 1 is a block diagram showing a basic configuration of a voice interactive apparatus according to the first embodiment of the present invention. Here, the arrow (a) in FIG. 1 indicates the speech voice of the user, and the arrow (b) indicates the execution of the task based on the output voice from the system or the recognition result. As shown in FIG. 1, the voice interaction apparatus according to the first embodiment includes a voice input unit 101, a voice recognition unit 102 that is a voice recognition unit, a recognition dictionary 103, an understanding unit 104 that is an understanding unit, a function table 105, and a response. A response generation unit 106 as means is provided. Furthermore, a response table 107, an output unit 108, a recognition characteristic extraction unit 109 as a recognition characteristic extraction unit, a recognition pattern table 110, and a dictionary control unit 111 as a dictionary control unit are provided.
<Basic functions and implementation methods>
Here, basic functions and specific implementation means of each unit shown in FIG. 1 will be described with reference to FIG. FIG. 2 is a block diagram showing means for realizing the voice interactive apparatus shown in FIG. The voice input unit 101 acquires a user's initial utterance voice (see FIG. 1A) input for an operation task (hereinafter referred to as a task) that the user wants to determine and control. For example, it can be realized by combining the microphone 201 and the AD conversion unit 202. The voice recognition unit 102 performs feature extraction processing on a part or all of the initial utterance voice acquired through the voice input unit 101. The speech recognition unit 102 compares the feature amount of the initial utterance speech with the feature amount of each vocabulary registered in the recognition dictionary 103 described later, that is, a vocabulary to be recognized (hereinafter referred to as a recognition vocabulary). Do. The voice recognition unit 102 further performs general voice recognition processing. Specifically, a plurality of recognition vocabularies are acquired as a combination of recognition candidates, that is, as a recognition result, in descending order of likelihood that is the similarity between feature quantities. The voice recognition unit 102 can be realized by combining the arithmetic device 203 and the storage device 204. The plurality of recognition vocabularies described above, that is, a combination of recognition candidates is referred to as N-best.

認識辞書１０３は、音声認識部１０２における音声認識処理に用いる認識語彙を登録したものであり、記憶装置２０４によって実現できる。第１の実施形態に係る認識辞書１０３の一例を図３に示す。図３は、図１に示す認識辞書１０３の一例を示した図である。また、図３（ａ）は「初期辞書」、図３（ｂ）は「番号選択辞書」、図３（ｃ）は「施設種別選択辞書」である。図３に示すように、認識辞書１０３は、図３（ａ）〜（ｃ）の３通りの辞書を含む構成とされている。これは、理解結果に基づいて、複数の辞書を切り替えながら利用する場合の辞書の構成の一例である。例えば、利用者が初期発話を行う場合、図３（ａ）の「初期辞書」が有効になっている。利用者の初期発話の理解結果が「登録地へ行く」であれば、目的地設定という一連の制御タスク（図４参照）を達成させるため、次の制御タスク「登録地の番号１番から５番の選択」に移行する。従って、次の制御タスクの移行により、図３（ａ）から図３（ｂ）の辞書に切替えて、利用者の発話音声を待ち受ける。一方、初期発話の理解結果が「周辺施設検索」であれば、次の制御タスク「周辺施設の種別の選択」に移行する。従って、図３（ａ）から図３（ｃ）の辞書に切り替えて、利用者の発話音声を待ち受ける。 The recognition dictionary 103 registers recognition vocabulary used for voice recognition processing in the voice recognition unit 102, and can be realized by the storage device 204. An example of the recognition dictionary 103 according to the first embodiment is shown in FIG. FIG. 3 is a diagram showing an example of the recognition dictionary 103 shown in FIG. 3A is an “initial dictionary”, FIG. 3B is a “number selection dictionary”, and FIG. 3C is a “facility type selection dictionary”. As shown in FIG. 3, the recognition dictionary 103 is configured to include the three types of dictionaries shown in FIGS. This is an example of the configuration of a dictionary when a plurality of dictionaries are used while being switched based on the understanding result. For example, when the user makes an initial utterance, the “initial dictionary” in FIG. 3A is valid. If the understanding result of the user's initial utterance is “go to registration location”, the next control task “registration location numbers 1 to 5 is used to achieve a series of control tasks (see FIG. 4) called destination setting. Move to “Select number”. Therefore, when the next control task shifts, the dictionary is switched from FIG. 3A to FIG. 3B and the user's speech is awaited. On the other hand, if the understanding result of the initial utterance is “surrounding facility search”, the process proceeds to the next control task “selecting the type of surrounding facility”. Therefore, the dictionary is switched from the dictionary shown in FIG. 3A to the dictionary shown in FIG.

なお、第１の実施形態に係る認識辞書１０３では、認識語彙を一語毎に登録する形をとっているが、当該認識語彙を単語毎に接続する形で登録することも可能である。一般的には、ネットワーク文法と呼ばれる。例えば、図３中の認識語彙「自宅へ帰る」であれば、「自宅／家」−「ガベージ」−「帰る／戻る」といった形の単語のつながりとして登録する。ここでいうガベージとは、接続詞や間投詞「が、の、えー」等を吸収する語彙として定義したものである。上記のように登録した場合、「自宅へ帰る」、「自宅までかえる」、「家に帰る」、「家に戻る」といった語彙の認識が可能になる。 In the recognition dictionary 103 according to the first embodiment, the recognition vocabulary is registered for each word. However, the recognition vocabulary can be registered for each word. Generally called network grammar. For example, in the case of the recognition vocabulary “return to home” in FIG. Garbage here is defined as a vocabulary that absorbs conjunctions and interjections such as “ga, no, eh”. When registered as described above, vocabulary such as “return home”, “return home”, “return home”, “return home” can be recognized.

理解部１０４は、認識辞書１０３による音声認識処理にて取得したＮ−ｂｅｓｔに基づき、システムの理解状態である理解結果を生成するものであり、演算装置２０３および記憶装置２０４を組合せることで実現できる。なお、理解部１０４における理解結果の生成方法としては、認識スコアを利用するのが一般的である。例えば、音響・言語的に特徴量がどれだけ認識辞書１０３のパターンに類似するかを示す尺度である「尤度」を認識スコアとして用い、該スコアが最大となる認識候補を理解結果と決定する方法が知られている。また、「信頼度」を認識スコアとして、該認識スコアが最大となる認識候補を理解結果と決定する方法等が知られている。ここで、「信頼度」とは、認識語彙に類似・競合する語彙の多さを反映した尺度、すなわち、認識候補をどれだけ信頼してよいかという尺度である。信頼度の算出方法としては、例えば、特開平１１−８５１８８号公報で開示された方法がある。当該方法では、対象とする認識辞書と当該認識辞書に競合する競合辞書とを用い、音声認識に用いたモデルと競合モデルとの２種類のモデルを使用する。各々のモデルから得られた尤度から尤度比を算出し、認識候補の信頼度として付与する。 The understanding unit 104 generates an understanding result that is an understanding state of the system based on the N-best acquired by the speech recognition processing by the recognition dictionary 103, and is realized by combining the arithmetic device 203 and the storage device 204. it can. Note that a recognition score is generally used as a method of generating an understanding result in the understanding unit 104. For example, “likelihood” which is a scale indicating how much the feature quantity is acoustically and linguistically similar to the pattern of the recognition dictionary 103 is used as a recognition score, and a recognition candidate having the maximum score is determined as an understanding result. The method is known. Further, a method is known in which “reliability” is used as a recognition score, and a recognition candidate having the maximum recognition score is determined as an understanding result. Here, the “reliability” is a measure that reflects the number of vocabularies that are similar to or competing with the recognized vocabulary, that is, a measure of how much the recognition candidate can be trusted. As a reliability calculation method, for example, there is a method disclosed in Japanese Patent Laid-Open No. 11-85188. In this method, a target recognition dictionary and a competitive dictionary that competes with the recognition dictionary are used, and two types of models, that is, a model used for speech recognition and a competitive model are used. A likelihood ratio is calculated from the likelihood obtained from each model, and given as the reliability of the recognition candidate.

また、信頼度の算出方法としては、例えば、Frank
Wessel,Ralf Schluter,Klaus Macherey,Hermam Ney:“Confidence Measure for Large Vocabulary Continuous Speech Recognition”,IEEE Transactions Speech and Audio Process Vol.9 No.3
pp.288-298,2001.で開示された方法がある。当該方法では、Ｎ−ｂｅｓｔを用いて信頼度の計算が行われる。すなわち、音響尤度、言語尤度などを用いて認識候補をＮ位まで作成し、作成された認識候補を用いて信頼度を算出する。当該方法では、複数の認識候補に多く出現している単語は信頼度が高いとしている。また、信頼度の算出方法として、Thomas
Kemp,Thomas Schaaf:“Estimating
confidence using word lattices”,Proc.
5th Eurospeech,pp.827-830,1997.で開示された方法もある。当該方法では、単語事後確率を用いて信頼度を計算している。すなわち、１単語の音響尤度、単語の言語尤度、ｆｏｒｗａｒｄ確率、ｂａｃｋｗａｒｄ確率を用いて、文中の単語に対する信頼度を算出している。また、信頼度の決定方法としては、例えば、宇津呂武仁,西崎博光,小玉康広,中川聖一:「複数の大語彙連続音声認識モデルの出力の共通部分を用いた高信頼度部分の推定」,電子情報通信学会論文誌,D-II Vol.J86-D-II No.7 pp.974-987,2003.で開示された方法がある。当該方法では、複数の音声認識モデルを用いて信頼度の決定を行っている。すなわち、音声認識モデルを２つ以上用いて音声認識を行い、全ての音声認識モデルで信頼できると判断された共通部分が信頼できると判断する。 In addition, as a calculation method of reliability, for example, Frank
Wessel, Ralf Schluter, Klaus Macherey, Hermam Ney: “Confidence Measure for Large Vocabulary Continuous Speech Recognition”, IEEE Transactions Speech and Audio Process Vol.9 No.3
There is a method disclosed in pp.288-298, 2001. In this method, the reliability is calculated using N-best. That is, recognition candidates are created up to the Nth position using acoustic likelihood, language likelihood, and the like, and reliability is calculated using the created recognition candidates. In this method, it is assumed that words that frequently appear in a plurality of recognition candidates have high reliability. As a calculation method of reliability, Thomas
Kemp, Thomas Schaaf: “Estimating
confidence using word lattices ”, Proc.
There is also a method disclosed in 5th Eurospeech, pp. 827-830, 1997. In this method, the reliability is calculated using the word posterior probability. That is, the reliability of a word in a sentence is calculated using the acoustic likelihood of one word, the language likelihood of the word, the forward probability, and the backward probability. As methods for determining reliability, for example, Takehito Utsuro, Hiromitsu Nishizaki, Yasuhiro Kodama, and Seiichi Nakagawa: `` Estimation of high-reliability parts using common parts of outputs of multiple large vocabulary continuous speech recognition models '', There is a method disclosed in IEICE Transactions, D-II Vol.J86-D-II No.7 pp.974-987, 2003. In this method, the reliability is determined using a plurality of speech recognition models. That is, speech recognition is performed using two or more speech recognition models, and it is determined that the common part determined to be reliable in all speech recognition models is reliable.

機能テーブル１０５は、理解部１０４から発行される辞書切替、応答出力、あるいは、車載機器を制御するための制御タスクコマンドを含む機能と理解結果との対応を記憶するものであり、記憶装置２０４により実現できる。ここで、機能テーブル１０５の一例を図４に示す。図４は、図１に示す機能テーブル１０５の一例を示した図である。図４に示す機能テーブル１０５には、図３における初期辞書（図３（ａ）参照）のいずれかの語彙が理解結果として決定された場合に対応する制御タスクと、当該制御タスクが決定された場合に実際に実行される制御内容とを登録する。例えば、理解部１０４が理解結果「登録地へ行く」を生成した場合、理解部１０４は機能テーブル１０５を参照し、制御タスクを「目的地設定(登録地)」と決定する。なお、第１の実施形態に係る機能テーブル１０５では、制御タスクを「機能名（詳細な条件）」の形式で登録している。 The function table 105 stores correspondence between functions including dictionary switching, response output, or control task commands for controlling in-vehicle devices issued from the understanding unit 104 and understanding results. realizable. An example of the function table 105 is shown in FIG. FIG. 4 is a diagram showing an example of the function table 105 shown in FIG. In the function table 105 shown in FIG. 4, a control task corresponding to a case where any vocabulary in the initial dictionary in FIG. 3 (see FIG. 3A) is determined as an understanding result, and the control task is determined. In this case, the control contents that are actually executed are registered. For example, when the understanding unit 104 generates an understanding result “go to registered location”, the understanding unit 104 refers to the function table 105 and determines the control task as “destination setting (registered location)”. In the function table 105 according to the first embodiment, the control task is registered in the format of “function name (detailed condition)”.

当該制御タスクが決定された結果、理解部１０４は、制御内容として、「登録地番号選択応答」および「辞書切替」という機能を発行する。具体的には、「登録地番号選択応答」の機能として、登録地番号の選択を利用者に促す応答を応答生成部１０６に音声出力させる。また、「辞書切替」の機能として、初期辞書（図３（ａ）参照）から番号選択辞書（図３（ｂ）参照）に認識辞書１０３を切替えさせる。なお、認識辞書１０３の切替えは、音声認識部１０２が行う。同様にして、理解部１０４が理解結果「自宅へ帰る」を生成した場合、理解部１０４は制御タスクを「目的地設定（自宅）」と決定する。その後、ナビゲーション装置に対して、現在地から自宅へのルートを検索するコマンドが発行される（図４における機能「現在地〜目的地ルート検索コマンド発行」参照）。この時、制御タスク完了の応答として、「自宅へ帰るルートを探索します」の応答（図５参照）を出力している。 As a result of the determination of the control task, the understanding unit 104 issues functions of “registered land number selection response” and “dictionary switching” as control contents. Specifically, as a function of “Registered Land Number Selection Response”, a response that prompts the user to select a registered land number is output to the response generation unit 106 by voice. Further, as a function of “dictionary switching”, the recognition dictionary 103 is switched from the initial dictionary (see FIG. 3A) to the number selection dictionary (see FIG. 3B). Note that the speech recognition unit 102 switches the recognition dictionary 103. Similarly, when the understanding unit 104 generates the understanding result “return to home”, the understanding unit 104 determines the control task as “destination setting (home)”. Thereafter, a command for searching for a route from the current location to the home is issued to the navigation device (refer to the function “Issuance of current location to destination route search command” in FIG. 4). At this time, as a response to the completion of the control task, a response of “searching for a route to return home” (see FIG. 5) is output.

応答生成部１０６は、理解部１０４が発行した機能に基づき、後述の応答テーブル１０７を参照し、利用者に返す応答を生成し、当該応答を確定するものであり、演算装置２０３および記憶装置２０４を組合せることで実現できる。応答テーブル１０７は、応答生成部１０６が応答を生成する際に参照するものであり、記憶装置２０４により実現できる。ここで、応答テーブル１０７の一例を図５に示す。図５は、図１に示す応答テーブル１０７の一例を示した図である。図５に示すように、応答テーブル１０７には、理解部１０４が発行した機能と、当該機能に対応する応答内容とを登録している。例えば、理解部１０４が発行した機能が「登録地番号選択応答」であれば、応答内容として、「登録地の番号１から５番を選択してください」という応答が出力される。なお、上記の例の場合、続けて利用者の発話を促しているため、音声による出力が適切と考えられる。一方、「ＣＤ曲順送りコマンド発行」という機能が発行された場合には、機能として完了しており、利用者の追加の発話を必要としない。従って、「次の曲にします」という応答を音声出力せず、画面出力のみとする方法をとっても良い。出力部１０８は、応答生成部１０６の生成した応答を利用者に出力するものであり、ＤＡ変換部２０５およびスピーカ／表示装置２０６を組み合わせることで実現できる。 The response generation unit 106 refers to a response table 107 described later based on the function issued by the understanding unit 104, generates a response to be returned to the user, and determines the response. The arithmetic device 203 and the storage device 204 It can be realized by combining. The response table 107 is referred to when the response generation unit 106 generates a response, and can be realized by the storage device 204. An example of the response table 107 is shown in FIG. FIG. 5 is a diagram showing an example of the response table 107 shown in FIG. As shown in FIG. 5, in the response table 107, functions issued by the understanding unit 104 and response contents corresponding to the functions are registered. For example, if the function issued by the understanding unit 104 is “registration location number selection response”, a response “Please select registration location numbers 1 to 5” is output as the response content. In the case of the above example, since the user's speech is continuously urged, it is considered that the output by voice is appropriate. On the other hand, when the function “issue CD song forward command issuance” is issued, the function is completed and no additional user utterance is required. Therefore, it is possible to adopt a method in which the response “I will make the next song” is not output by voice but only the screen output. The output unit 108 outputs the response generated by the response generation unit 106 to the user, and can be realized by combining the DA conversion unit 205 and the speaker / display device 206.

認識特性抽出部１０９は、一連の制御タスク達成に至る対話、すなわち、認識候補の組合せ（認識候補群）と理解結果を監視し、初期発話の認識候補の組合せを認識パターンとして抽出するとともに、最終的に達成した制御タスクを抽出するものである。演算装置２０３および記憶装置２０４を組合せることで実現できる。第１の実施形態では、初期発話の認識結果から理解結果が一意に決まらなかったために出力された確認応答に対して否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合に抽出している。抽出後、認識特性抽出部１０９は、抽出した上記制御タスクと、上記認識パターンと、上記制御タスク毎の、新規認識候補群が上記認識パターンと順不同で同じになる頻度である出現頻度とを認識パターンテーブル１１０に記憶する。更に、認識特性抽出部１０９は、当該認識パターンのうち最大の認識スコアを持つ認識候補に対応させて、上記認識パターンの最大の認識スコアと所定の閾値との差から算出された補正値も認識パターンテーブル１１０に記憶する。ここで、認識パターンテーブル１１０は、記憶装置２０４により実現できる。 The recognition characteristic extraction unit 109 monitors a dialogue that achieves a series of control tasks, that is, a combination of recognition candidates (a recognition candidate group) and an understanding result, extracts a combination of recognition candidates of an initial utterance as a recognition pattern, and finally The control task that has been achieved is extracted. This can be realized by combining the arithmetic device 203 and the storage device 204. In the first embodiment, when the understanding result is not uniquely determined from the recognition result of the initial utterance, there is no negation or correction for the output confirmation response, and finally a series of control tasks are achieved. Extracting. After the extraction, the recognition characteristic extraction unit 109 recognizes the extracted control task, the recognition pattern, and the appearance frequency, which is the frequency at which the new recognition candidate group for each control task is the same in the order of the recognition pattern. Store in the pattern table 110. Further, the recognition characteristic extraction unit 109 recognizes a correction value calculated from the difference between the maximum recognition score of the recognition pattern and a predetermined threshold value, corresponding to the recognition candidate having the maximum recognition score among the recognition patterns. Store in the pattern table 110. Here, the recognition pattern table 110 can be realized by the storage device 204.

辞書制御部１１１は、利用者の新規発話に対する新規認識候補群を音声認識部１０２が取得した場合に認識パターンテーブル１１０を参照する。上記新規認識候補群と順不同で同じ認識パターンが存在した場合に、当該認識パターンと対応する制御タスクを優先させるものである。具体的には、辞書制御部１１１は、当該認識パターンのうち最大の認識スコアを持つ認識候補と同じ新規認識候補の認識スコアに、当該認識パターンと対応する上記補正値を加算させることで、当該認識パターンに関する制御タスクを優先させる。辞書制御部１１１は、演算装置２０３および記憶装置２０４を組合せることで実現できる。なお、辞書制御部１１１による上記優先処理は、利用者の新規発話に対する新規認識候補の組合せが認識パターンテーブル１１０に存在する認識パターンと順不同で同じ場合で、かつ、新規認識候補群の各認識スコアに基づいて、理解結果が一意に決まらない場合に適用することが好ましい。ここで、理解結果が一意に決まらない場合とは、すなわち、十分な認識スコアを獲得した認識候補が取得されない場合である。例えば、認識候補群は取得したものの、発話音声に十分に類似するものは存在しない場合（尤度が所定の閾値を上回らない場合）が相当する。また、認識候補群の複数が類似しており、何れであるか確定できない場合（信頼度が所定の閾値を上回らない場合）等が相当する。 The dictionary control unit 111 refers to the recognition pattern table 110 when the speech recognition unit 102 acquires a new recognition candidate group for a new utterance of the user. When the same recognition pattern exists in any order with the new recognition candidate group, the control task corresponding to the recognition pattern is prioritized. Specifically, the dictionary control unit 111 adds the correction value corresponding to the recognition pattern to the recognition score of the same new recognition candidate as the recognition candidate having the largest recognition score among the recognition patterns. Prioritize control tasks related to recognition patterns. The dictionary control unit 111 can be realized by combining the arithmetic device 203 and the storage device 204. Note that the above priority processing by the dictionary control unit 111 is performed when the combination of new recognition candidates for a new utterance of the user is the same as the recognition pattern existing in the recognition pattern table 110 in the same order, and each recognition score of the new recognition candidate group. It is preferable to apply when the understanding result is not uniquely determined based on the above. Here, the case where the understanding result is not uniquely determined is a case where a recognition candidate having acquired a sufficient recognition score is not acquired. For example, it corresponds to the case where the recognition candidate group has been acquired but there is no sufficiently similar speech speech (the likelihood does not exceed a predetermined threshold). In addition, this corresponds to a case where a plurality of recognition candidate groups are similar and cannot be determined (the reliability does not exceed a predetermined threshold).

以下に、具体的な対話例を用いて、認識特性抽出部１０９および辞書制御部１１１の動きを説明する。なお、本対話例では、Ｎ−ｂｅｓｔ中の上位３認識候補について注目するものとする。上位何位までの認識候補を認識傾向の分析に用いるかによって、適用の精度と適用される頻度が異なるため、システムの認識傾向を鑑みて適切な数を選択するのが好ましい。より下位の認識候補まで利用すれば、下位の認識候補まで一致した時のみ上記優先処理が実行されるため、適用の精度は向上する可能性がある。ただし、下位の認識候補まで一致する認識パターンは上位の認識候補のみが一致する場合より少ないと考えられるため、適用される場面が限定される。 Hereinafter, movements of the recognition characteristic extraction unit 109 and the dictionary control unit 111 will be described using a specific example of dialogue. In this dialogue example, attention is paid to the top three recognition candidates in N-best. Depending on how many of the top recognition candidates are used for recognition tendency analysis, the accuracy of application and the frequency of application differ, so it is preferable to select an appropriate number in view of the recognition tendency of the system. If the lower recognition candidates are used, the priority processing is executed only when the lower recognition candidates match, so that the application accuracy may be improved. However, since it is considered that there are fewer recognition patterns that match up to lower recognition candidates than in the case where only upper recognition candidates match, the scenes to be applied are limited.

図６は、第１の実施形態の対話例における記憶条件と記憶対象データを示した図である。図６では、利用者が「自宅に帰る」という発話を行ない、自宅へのルート探索が実行されるまでの対話例を示している。図６に示すように、システムが「ご用件をどうぞ」の発話を行うと（ステップＳ１１）、利用者は、「自宅に帰る」の初期発話を行っている（ステップＵ１１）。なお、ステップＳ１１の応答は、例えば、利用者が不図示の音声操作スイッチ（ＰＴＴスイッチ、ＰＴＡスイッチ等とも呼ばれる。）の押下に伴い出力される。システムは、利用者の初期発話音声を認識し、当該初期発話音声の認識候補群、自宅へ帰る（０．５５）、近くのコンビニ（０．４０）、テレビＯＮ（０．０１）を取得している（ステップＳ１２）。括弧は認識スコアである。理解部１０４は、閾値と認識スコアを比較し、閾値を上回る認識候補があった場合に、当該認識候補から理解結果を生成する。閾値は、システムの認識制度が最大となるように予めコーパスデータの認識結果等から決定される。閾値を上回る認識候補が無い場合は、十分に信用できる語彙が無いと理解し、確認、質問応答が行われる。本対話例では、閾値を０．７０とした。結果、理解部１０４は、閾値を上回る認識候補が見つからないと判定する。なお、当該閾値を調整することにより、システムの応答傾向を変えることが可能である。すなわち、閾値を低く設定すれば、誤認識を覚悟して積極的に一意に決める対話となり、閾値を高く設定すれば、慎重を期して、十分信頼できる時のみ一意に決めるといった対話になる。こうした対話傾向を、例えば、利用者の属性に応じて値を可変にすることも可能である。 FIG. 6 is a diagram illustrating storage conditions and storage target data in the interactive example of the first embodiment. FIG. 6 shows an example of a dialogue until the user utters “going home” and a route search to the home is executed. As shown in FIG. 6, when the system utters “Please give me a request” (step S11), the user makes an initial utterance “go home” (step U11). The response in step S11 is output, for example, when the user presses a voice operation switch (not shown) (also referred to as a PTT switch or a PTA switch). The system recognizes the user's initial utterance voice, acquires the initial utterance voice recognition candidate group, returns home (0.55), nearby convenience store (0.40), TV ON (0.01). (Step S12). Parentheses are recognition scores. The understanding unit 104 compares the threshold value with the recognition score, and when there is a recognition candidate that exceeds the threshold value, generates an understanding result from the recognition candidate. The threshold value is determined in advance from the recognition result of the corpus data so that the system recognition system is maximized. If there is no recognition candidate exceeding the threshold, it is understood that there is no sufficiently reliable vocabulary, and confirmation and question answering are performed. In this interactive example, the threshold is set to 0.70. As a result, the understanding unit 104 determines that a recognition candidate exceeding the threshold is not found. The response tendency of the system can be changed by adjusting the threshold value. That is, if the threshold value is set low, the dialogue is positively determined uniquely by preparing for misrecognition, and if the threshold value is set high, the dialogue is determined only carefully when it is sufficiently reliable. It is also possible to change the value of such a conversation tendency according to, for example, the attribute of the user.

閾値を上回る認識候補が見つからなかったため、システムは最大の認識スコアを持つ認識候補を用いて、確認応答の出力を行う。図６では、確認（ルート検索、目的地＝自宅）と示されている。この結果、応答生成部１０６は、応答「自宅へ帰るルートを検索しますか？」を音声出力する。システムの上記確認応答に対して、利用者が「はい」と発話している（ステップＵ１２）。システムは当該発話を認識し、その結果、理解部１０４は閾値を上回る認識候補「はい」が見つかったと判定する（ステップＳ１３）。更に、理解結果として、自宅へのルートを検索するコマンド（図６では、実行（ルート検索、目的地＝自宅））を発行する。この結果、応答生成部１０６は、応答「自宅へ帰るルートを検索します」を音声出力する。ここで、上述した理解結果である確認（ルート検索、目的地＝自宅）や実行（ルート検索、目的地＝自宅）は、その後の対話の移行を示している。すなわち、確認（○○）であれば、括弧内の制御タスクコマンドを実行してよいか一旦確認を行うフェーズに移行することを示し、実行（○○）であれば、括弧内の制御タスクコマンドの実行に移行する。 Since no recognition candidate exceeding the threshold was found, the system outputs a confirmation response using the recognition candidate having the maximum recognition score. In FIG. 6, confirmation (route search, destination = home) is shown. As a result, the response generation unit 106 outputs a response “Do you want to search for a route to return home?” As a voice. In response to the confirmation response from the system, the user speaks “Yes” (step U12). The system recognizes the utterance, and as a result, the understanding unit 104 determines that a recognition candidate “yes” exceeding the threshold is found (step S13). Further, as an understanding result, a command for searching for a route to the home (execution (route search, destination = home) in FIG. 6) is issued. As a result, the response generation unit 106 outputs a response “searches for a route to return home” by voice. Here, the confirmation (route search, destination = home) and execution (route search, destination = home), which are the understanding results described above, indicate the subsequent transition of dialogue. That is, if it is confirmation (XX), it indicates that the control task command in parentheses can be executed once, and if it is execution (XX), it indicates a control task command in parentheses. Transition to execution.

認識特性抽出部１０９は、上記一連の制御タスク達成に至る対話を監視し、図６の３列目に示すような記憶対象データを抽出する。すなわち、本対話例においては、
・初期発話の理解結果が一意に決まらなかった（図６の（ａ）の部分から確定）
・確認応答の結果、否定および訂正が検出されなかった（図６の（ｂ）の部分から確定）
・最終的に制御タスクが決定された（図６の（ｃ）の部分から確定）
という記憶条件を満たすか否か判定する。認識特性抽出部１０９は、上記条件を満たすと判定し、この場合の認識パターン、すなわち、３つの認識候補および最終的に達成した制御タスク（ルート検索、目的地＝自宅）を抽出する。更に、認識特性抽出部１０９は、抽出した制御タスクおよび認識パターンを認識パターンテーブル１１０に記憶する。また、認識特性抽出部１０９は、後述するボーナス値と出現頻度も認識パターンテーブル１１０に記憶する。 The recognition characteristic extraction unit 109 monitors a dialog that reaches the series of control tasks, and extracts storage target data as shown in the third column of FIG. That is, in this dialogue example,
・ Understanding of initial utterance was not uniquely determined (determined from part (a) of FIG. 6)
-Negation and correction were not detected as a result of the confirmation response (confirmed from the part (b) in FIG. 6)
-Finally, the control task was determined (determined from the part (c) in FIG. 6)
It is determined whether or not the storage condition is satisfied. The recognition characteristic extraction unit 109 determines that the above condition is satisfied, and extracts a recognition pattern in this case, that is, three recognition candidates and a finally achieved control task (route search, destination = home). Further, the recognition characteristic extraction unit 109 stores the extracted control task and recognition pattern in the recognition pattern table 110. The recognition characteristic extraction unit 109 also stores a bonus value and an appearance frequency, which will be described later, in the recognition pattern table 110.

ここで、認識パターンテーブル１１０の記憶例を図７に示す。図７は、図１に示す認識パターンテーブル１１０の一例を示した図である。図６に示した対話例から抽出された制御タスクおよび認識パターンは、Ｎｏ．１の行に記憶されている。上記認識パターンに含まれた３認識候補のうち、最上位の認識候補（図７における「自宅へ帰る」）が正解語彙である。ここで、正解語彙とは、上記認識パターンのうち最大の認識スコアを持つ認識候補から理解結果を生成し、上記制御タスクを達成した場合における当該認識候補である。図７に示すように、認識パターンテーブル１１０は、上記制御タスク、上記認識パターン、上記補正値であるボーナス値および上記出現頻度を記憶している。ここで、ボーナス値とは、利用者の新規発話に対する新規認識候補群が上記認識パターンと順不同で同じ場合に、上記正解語彙と同じ新規認識候補の認識スコアに加算される認識スコアの加算値である。当該ボーナス値は、上記認識パターンの最大の認識スコア、すなわち、正解語彙の認識スコアと所定の閾値との差（正解語彙が閾値を上回るための不足認識スコア）から算出されている。 Here, a storage example of the recognition pattern table 110 is shown in FIG. FIG. 7 is a diagram showing an example of the recognition pattern table 110 shown in FIG. The control task and recognition pattern extracted from the dialogue example shown in FIG. It is stored in one row. Of the three recognition candidates included in the recognition pattern, the highest recognition candidate (“return to home” in FIG. 7) is the correct vocabulary. Here, the correct vocabulary is a recognition candidate when an understanding result is generated from a recognition candidate having the largest recognition score among the recognition patterns and the control task is achieved. As shown in FIG. 7, the recognition pattern table 110 stores the control task, the recognition pattern, the bonus value as the correction value, and the appearance frequency. Here, the bonus value is an added value of the recognition score added to the recognition score of the same new recognition candidate as the correct vocabulary when the new recognition candidate group for the new utterance of the user is the same in the same order as the recognition pattern. is there. The bonus value is calculated from the maximum recognition score of the recognition pattern, that is, the difference between the recognition score of the correct vocabulary word and a predetermined threshold value (deficiency recognition score for the correct vocabulary value exceeding the threshold value).

具体的に説明すると、図６に示した対話例では、「自宅へ帰る」のスコアが０．５５であり、閾値０．７０との差は０．１５である。よって、ボーナス値として０．１５以上の値を加えることで、次回以降の利用者の新規発話に対する新規認識候補群が上記認識パターンと順不同で同じ場合に、上記新規認識候補群から理解結果を一意に決定できる可能性が高くなる。ただし、上記新規認識候補群が上記認識パターンと順不同で同じ場合でも、「自宅へ帰る」の認識スコアが０．５５より小さい場合も考えられる。この場合、上記新規認識候補群の上記新規認識候補「自宅へ帰る」の認識スコアに０．１５のボーナス値を加算しても閾値０．７０を上回らないため、やはり理解結果が一意に決まらない。こうした状況を鑑み、ボーナス値に若干のマージンを与えるようにしている。すなわち、ボーナス値＝閾値−認識スコア＋αといった計算によりボーナス値を決定する。ここで、αがマージンである。例えば、α＝０．１０とすれば、
ボーナス値＝０．７０−０．５５＋０．１０＝０．２５
となる。なお、継続使用に伴い、上記制御タスクにおける上記認識パターンと順不同で同じ新規認識候補群が発生することが考えられる。上記認識パターンと順不同で同じ新規認識候補群を取得した場合は、取得の都度、最新の認識スコアに基づき、ボーナス値を更新することが望ましい。ボーナス値の更新方法については、単純に最新の認識スコアのみを見て決定する方法、過去のボーナス値と最新のボーナス値の平均を取る方法または過去のボーナス値と最新のボーナス値とを比較し、両値の最大値を取る方法等が考えられる。 More specifically, in the dialogue example shown in FIG. 6, the score of “return to home” is 0.55, and the difference from the threshold value 0.70 is 0.15. Therefore, by adding a value of 0.15 or more as a bonus value, when the new recognition candidate group for a new utterance of the user after the next time is the same as the recognition pattern in no particular order, the understanding result is uniquely identified from the new recognition candidate group. The possibility of being able to decide is increased. However, even when the new recognition candidate group is in the same order as the recognition pattern, the recognition score of “return to home” may be smaller than 0.55. In this case, even if a bonus value of 0.15 is added to the recognition score of the new recognition candidate “go home” in the new recognition candidate group, the threshold value 0.70 is not exceeded, so the understanding result is not uniquely determined. . In view of this situation, a slight margin is given to the bonus value. That is, the bonus value is determined by calculation such as bonus value = threshold value−recognition score + α. Here, α is a margin. For example, if α = 0.10,
Bonus value = 0.70-0.55 + 0.10 = 0.25
It becomes. In addition, it is possible that the same new recognition candidate group will generate | occur | produce out of order with the said recognition pattern in the said control task with continuous use. When the same new recognition candidate group is acquired out of order with the recognition pattern, it is desirable to update the bonus value based on the latest recognition score each time the acquisition is performed. Regarding how to update the bonus value, simply determine by looking only at the latest recognition score, average the past bonus value and the latest bonus value, or compare the past bonus value with the latest bonus value. A method of taking the maximum value of both values can be considered.

また、図７に示すように、認識パターンテーブル１１０は、上記制御タスク毎の、新規認識候補群が上記認識パターンと順不同で同じになる頻度である出現頻度も記憶している。例えば、図７に示したように、Ｎｏ．１の行の認識パターン、すなわち、図６に示した対話例にて抽出された認識パターンと順不同で同じ新規認識結果が過去に５回発生している。また、Ｎｏ．２の行の認識パターンと順不同で同じ新規認識結果が過去に１度発生している。当該出現頻度を利用して、例えば、当該出現頻度が所定値を上回った場合のみ、ボーナス値を加算しても良い。また、当該出現頻度の多い認識パターンほど、ボーナス値のマージン（上記式のα）を大きくする等の制御を実行しても良い。これにより、利用者の継続使用に伴い、利用者の意図するタスクが正確に達成できる可能性が高くなる。 As shown in FIG. 7, the recognition pattern table 110 also stores the appearance frequency, which is the frequency at which the new recognition candidate group is the same in the order as the recognition pattern for each control task. For example, as shown in FIG. The same new recognition result has occurred five times in the past in the same order as the recognition pattern of one line, that is, the recognition pattern extracted in the dialogue example shown in FIG. No. The same new recognition result occurs once in the past in the same order as the recognition pattern of the second row. Using the appearance frequency, for example, the bonus value may be added only when the appearance frequency exceeds a predetermined value. Further, the control such as increasing the margin of the bonus value (α in the above formula) may be executed for the recognition pattern having a higher appearance frequency. This increases the possibility that the task intended by the user can be accurately achieved with the continuous use of the user.

認識特性抽出部１０９が上記認識パターン、上記制御タスク、ボーナス値および出現頻度を認識パターンテーブル１１０に記憶した後に、上記認識パターンと順不同で同じ新規認識候補群が取得された新規発話があった場合の対話例を図８に示す。図８は、図７に示すボーナス値を反映した場合の対話例を示した図である。図８に示すように、利用者は「自宅に帰る」の新規発話を行う（ステップＵ１４）。システムは、利用者の新規発話音声を認識し、当該新規発話音声に対する新規認識候補群、近くのコンビニ（０．５０）、自宅へ帰る（０．４８）、テレビＯＮ（０．０２）を取得している（ステップＳ１４）。理解部１０４は、閾値０．７０と認識スコアを比較するものの、閾値０．７０を上回る認識スコアを持つ認識候補が見つからないと判定する。そこで、辞書制御部１１１は、認識パターンテーブル１１０を参照し、新規認識候補群と順不同で同じ認識パターンの有無を調べる（ステップＳ１４’）。当該認識パターンが存在するため、辞書制御部１１１は、認識パターンテーブル１１０のボーナス値を参照する。 After the recognition characteristic extraction unit 109 stores the recognition pattern, the control task, the bonus value, and the appearance frequency in the recognition pattern table 110, there is a new utterance in which the same new recognition candidate group is acquired in the same order as the recognition pattern An example of this interaction is shown in FIG. FIG. 8 is a diagram showing an example of dialogue when the bonus value shown in FIG. 7 is reflected. As shown in FIG. 8, the user makes a new utterance “go home” (step U14). The system recognizes the user's new speech and acquires a new recognition candidate group for the new speech, a nearby convenience store (0.50), return home (0.48), and TV ON (0.02). (Step S14). The understanding unit 104 compares the recognition score with the threshold value 0.70, but determines that no recognition candidate having a recognition score exceeding the threshold value 0.70 is found. Therefore, the dictionary control unit 111 refers to the recognition pattern table 110 and checks whether or not there is the same recognition pattern out of order with the new recognition candidate group (step S 14 ′). Since the recognition pattern exists, the dictionary control unit 111 refers to the bonus value in the recognition pattern table 110.

理解部１０４は、辞書制御部１１１に基づき、正解語彙と同じ新規認識候補「自宅へ帰る」の認識スコアに対して、ボーナス値０．２５を加算する。これにより、当該新規認識候補「自宅へ帰る」が閾値０．７０を上回るため、理解結果として、自宅へのルートを探索するコマンド（図８における実行（ルート検索、目的地＝自宅））を発行する。この結果、応答生成部１０６は、応答「自宅へ帰るルートを検索します」を出力する。これにより、図６に示した、図８と同様の対話例では、初期発話の認識候補群に対して確認応答（図６のステップＳ１２）が必要であったが、本発明により、当該確認応答を省略している。すなわち、タスク達成時間を大幅に短縮できる。また、新規認識候補群を見るに、僅かの認識スコア差で、新規認識候補「近くのコンビニ」の認識スコアが最大となっている。このままでは、理解部１０４は最大の認識スコアを持つ新規認識候補「近くのコンビニ」を理解結果として決定するので、誤認識となる。しかし、本発明により、新規認識候補群と順不同で同じ認識パターンがある場合、正解語彙と同じ新規認識候補「自宅へ帰る」の認識スコアにボーナス値を加算するので、誤認識する可能性を低減できる。よって、誤認識が繰り返し発生する可能性を低減することができる。 Based on the dictionary control unit 111, the understanding unit 104 adds a bonus value of 0.25 to the recognition score of the same new recognition candidate “return to home” as the correct vocabulary. As a result, since the new recognition candidate “return to home” exceeds the threshold value 0.70, a command to search for a route to the home (execution (route search, destination = home) in FIG. 8) is issued as an understanding result. To do. As a result, the response generation unit 106 outputs a response “searches for a route to return home”. Thus, in the same dialogue example as shown in FIG. 8 shown in FIG. 6, the confirmation response (step S12 in FIG. 6) is required for the recognition candidate group of the initial utterance. Is omitted. That is, the task achievement time can be greatly shortened. In addition, when looking at the new recognition candidate group, the recognition score of the new recognition candidate “near convenience store” is maximized with a slight difference in recognition score. In this state, the understanding unit 104 determines a new recognition candidate “near convenience store” having the maximum recognition score as an understanding result, which results in erroneous recognition. However, according to the present invention, when there is the same recognition pattern in the same order as the new recognition candidate group, a bonus value is added to the recognition score of the same new recognition candidate “return to home” as the correct vocabulary, thus reducing the possibility of erroneous recognition it can. Therefore, it is possible to reduce the possibility of erroneous recognition being repeatedly generated.

一方、新規認識候補群と順不同で同じ認識パターンが認識パターンテーブル１１０にある場合でも、対応する制御タスクが異なる可能性も考慮する必要がある。すなわち、図８に示した対話例において、ステップＳ１４’の処理の後、利用者による否定または訂正が存在し、かつ、認識パターンの３認識候補のうち、認識候補「自宅へ帰る」以外の認識候補、例えば、認識候補「近くのコンビニ」に対応する制御タスクが決定されたような場合である。上記の場合、認識パターンテーブル１１０に記憶した認識パターンに対応する制御タスク、ボーナス値および出現頻度に矛盾が生じることとなる。そこで、矛盾が生じた場合には、一旦当該データをリセットするか、最新の認識パターン、対応する制御タスク、ボーナス値および出現頻度に更新することが望ましい。
＜上記構成を用いた具体的な制御処理の流れ＞
ここで、上記構成を用いた第１の実施形態に係る音声対話装置の具体的な制御処理の流れを、図９を用いて説明する。図９は、第１の実施形態に係る音声対話装置の制御処理の流れを示したフローチャートである。まず、音声入力部１０１は、利用者の発話音声を取得する。音声認識部１０２は、音声入力部１０１を介して取得した発話音声について音声認識処理を行い、Ｎ−ｂｅｓｔを取得する（ステップＳ１０１）。次に、理解部１０４は、音声認識部１０２の認識結果であるＮ−ｂｅｓｔに基づいて、各認識候補について認識スコアを算出する。更に、理解部１０４は、算出した認識スコアのうち、閾値を上回る認識スコアが存在するか否かを判定する（ステップＳ１０２）。閾値を上回る認識スコアがあると理解部１０４が判定した場合（ステップＳ１０２：Ｙｅｓ）、理解結果を一意に決定できるため、ステップＳ１１３へ移行する。一方、閾値を上回る認識スコアが無いと理解部１０４が判定した場合（ステップＳ１０２：Ｎｏ）、理解結果を一意に決定できないため、辞書制御部１１１は、認識パターンテーブル１１０を参照する（ステップＳ１０３）。辞書制御部１１１は、Ｎ−ｂｅｓｔと順不同で同じ認識パターンが、認識パターンテーブル１１０に存在するか否かを判定する（ステップＳ１０４）。 On the other hand, even when the same recognition pattern is in the recognition pattern table 110 out of order with the new recognition candidate group, it is necessary to consider the possibility that the corresponding control task is different. That is, in the dialogue example shown in FIG. 8, after the process of step S14 ′, there is a denial or correction by the user, and among the three recognition candidates of the recognition pattern, recognition other than the recognition candidate “return to home” This is a case where a control task corresponding to a candidate, for example, a recognition candidate “near convenience store” is determined. In the above case, a contradiction arises in the control task corresponding to the recognition pattern stored in the recognition pattern table 110, the bonus value, and the appearance frequency. Therefore, when a contradiction occurs, it is desirable to reset the data once or update it to the latest recognition pattern, corresponding control task, bonus value, and appearance frequency.
<Specific control processing flow using the above configuration>
Here, a specific control processing flow of the voice interactive apparatus according to the first embodiment using the above-described configuration will be described with reference to FIG. FIG. 9 is a flowchart showing a flow of control processing of the voice interaction apparatus according to the first embodiment. First, the voice input unit 101 acquires a user's uttered voice. The voice recognition unit 102 performs voice recognition processing on the uttered voice acquired via the voice input unit 101, and acquires N-best (step S101). Next, the understanding unit 104 calculates a recognition score for each recognition candidate based on N-best which is a recognition result of the speech recognition unit 102. Furthermore, the understanding unit 104 determines whether there is a recognition score that exceeds the threshold among the calculated recognition scores (step S102). When the understanding unit 104 determines that there is a recognition score exceeding the threshold (step S102: Yes), the understanding result can be uniquely determined, and the process proceeds to step S113. On the other hand, when the understanding unit 104 determines that there is no recognition score that exceeds the threshold (step S102: No), the dictionary control unit 111 refers to the recognition pattern table 110 because the understanding result cannot be uniquely determined (step S103). . The dictionary control unit 111 determines whether or not the same recognition pattern is present in the recognition pattern table 110 out of order with N-best (step S104).

Ｎ−ｂｅｓｔと順不同で同じ認識パターンが認識パターンテーブル１１０に存在しないと辞書制御部１１１が判定した場合（ステップＳ１０４：Ｎｏ）、ステップＳ１０８へ移行する。一方、Ｎ−ｂｅｓｔと順不同で同じ認識パターンが認識パターンテーブル１１０に存在すると辞書制御部１１１が判定した場合（ステップＳ１０４：Ｙｅｓ）、認識特性抽出部１０９は、Ｎ−ｂｅｓｔと順不同で同じ認識パターンの出現頻度に１加算し、認識パターンテーブル１１０に記憶する（ステップＳ１０５）。次に、辞書制御部１１１は、Ｎ−ｂｅｓｔと順不同で同じ認識パターンの正解語彙に対応するボーナス値を認識パターンテーブル１１０から取得する（ステップＳ１０６）。次に、辞書制御部１１１は、上記正解語彙と同じ新規認識候補の認識スコアに、辞書制御部１１１が取得したボーナス値を加算させる。すなわち、理解部１０４は、上記新規認識候補の認識スコアに上記ボーナス値を加算する。更に、理解部１０４は、再度算出した認識スコアのうち、閾値を上回る認識スコアが存在するか否かを判定する（ステップＳ１０７）。 When the dictionary control unit 111 determines that the same recognition pattern does not exist in the recognition pattern table 110 in the order out of order with N-best (step S104: No), the process proceeds to step S108. On the other hand, when the dictionary control unit 111 determines that the same recognition pattern exists in the recognition pattern table 110 in the order out of order with the N-best (step S104: Yes), the recognition characteristic extraction unit 109 has the same recognition pattern in the order out of the N-best. 1 is added to the appearance frequency and stored in the recognition pattern table 110 (step S105). Next, the dictionary control unit 111 acquires a bonus value corresponding to the correct vocabulary of the same recognition pattern out of order with N-best from the recognition pattern table 110 (step S106). Next, the dictionary control unit 111 adds the bonus value acquired by the dictionary control unit 111 to the recognition score of the same new recognition candidate as the correct vocabulary. That is, the understanding unit 104 adds the bonus value to the recognition score of the new recognition candidate. Further, the understanding unit 104 determines whether there is a recognition score that exceeds the threshold among the recalculated recognition scores (step S107).

閾値を上回る認識スコアがあると理解部１０４が判定した場合（ステップＳ１０７：Ｙｅｓ）、ステップＳ１１３へ移行する。一方、閾値を上回る認識スコアが無いと理解部１０４が判定した場合（ステップＳ１０７：Ｎｏ）、認識特性抽出部１０９は、認識特性抽出フラグをＯＮにする（ステップＳ１０８）。これにより、後述するが、最終的に制御タスクが達成した後、認識特性抽出部１０９は、Ｎ−ｂｅｓｔを認識パターンとして抽出する。次に、理解部１０４は、理解結果を一意に決められないため、利用者に対する確認応答を応答生成部１０６に音声出力させる（ステップＳ１０９）。第１の実施形態では、まず、最大の認識スコアを持つ認識候補を用いて、当該認識候補で合っているか否かを質問する。例えば、最大の認識スコアを持つ認識候補が「自宅へ帰る」であれば、「自宅へ帰るルートの探索で宜しいでしょうか？」といった応答を出力し、利用者からの反応を待つ。次に、認識特性抽出部１０９は、上記の確認応答に対する利用者の発話音声の認識結果から、利用者の否定が存在するか否かを判定する（ステップＳ１１０）。なお、否定以外に、直接訂正後の発話を認識するようにしても良い。この場合は、後述するステップＳ１１２の修正再発話要求応答を省略できる。 When the understanding unit 104 determines that there is a recognition score exceeding the threshold (step S107: Yes), the process proceeds to step S113. On the other hand, when the understanding unit 104 determines that there is no recognition score exceeding the threshold (step S107: No), the recognition characteristic extraction unit 109 turns the recognition characteristic extraction flag on (step S108). Thereby, as will be described later, after the control task is finally achieved, the recognition characteristic extraction unit 109 extracts N-best as a recognition pattern. Next, since the understanding unit 104 cannot uniquely determine the understanding result, the understanding unit 104 causes the response generation unit 106 to output a confirmation response to the user (step S109). In the first embodiment, first, a recognition candidate having the maximum recognition score is used to ask whether or not the recognition candidate matches. For example, if the recognition candidate having the maximum recognition score is “going home”, a response “Is it OK to search for a route to go home?” Is output, and a response from the user is waited for. Next, the recognition characteristic extraction unit 109 determines whether or not there is a denial of the user from the recognition result of the user's uttered voice in response to the confirmation response (step S110). In addition to negative, the utterance after correction may be recognized directly. In this case, the modified re-utterance request response in step S112 described later can be omitted.

認識特性抽出部１０９が上記確認応答に対して、否定が存在しないと判定した場合（ステップＳ１１０：Ｎｏ）、ステップＳ１１３へ移行する。一方、認識特性抽出部１０９が上記確認応答に対して、否定が存在すると判定した場合（ステップＳ１１０：Ｙｅｓ）、認識特性抽出部１０９は否定フラグをＯＮにする（ステップＳ１１１）。なお、認識特性抽出部１０９は、否定フラグをＯＮにする制御処理と同時に、否定された認識候補、すなわち、現時点で最大の認識スコアを持つ認識候補に対して、取り消しフラグを付与することが望ましい。これにより、否定後の訂正発話の認識結果に、当該取り消しフラグを付与した認識候補が含まれた場合で、かつ、当該認識候補の認識スコアが高い場合でも、当該認識候補の認識スコアから所定値を減算することができる。よって、当該取り消しフラグを付与した認識候補が、認識結果の上位に含まれないように調整できる。あるいは、当該取り消しフラグを付与した認識候補と同じ語彙を認識辞書１０３から除外しても良い。これにより、当該取り消しフラグを付与した認識候補を、認識結果に含まれないように調整することが可能である。 When the recognition characteristic extraction unit 109 determines that there is no denial with respect to the confirmation response (step S110: No), the process proceeds to step S113. On the other hand, when the recognition characteristic extraction unit 109 determines that there is a negative response to the confirmation response (step S110: Yes), the recognition characteristic extraction unit 109 turns the negative flag on (step S111). In addition, it is desirable that the recognition characteristic extraction unit 109 assigns a cancellation flag to a negative recognition candidate, that is, a recognition candidate having the maximum recognition score at the present time, simultaneously with the control process for turning on the negative flag. . As a result, even when the recognition candidate with the cancellation flag is included in the recognition result of the corrected utterance after denial and the recognition score of the recognition candidate is high, a predetermined value is obtained from the recognition score of the recognition candidate. Can be subtracted. Therefore, it is possible to adjust so that the recognition candidate to which the cancellation flag is assigned is not included in the upper rank of the recognition result. Alternatively, the same vocabulary as the recognition candidate assigned with the cancellation flag may be excluded from the recognition dictionary 103. Thereby, it is possible to adjust the recognition candidate to which the cancellation flag is assigned so that it is not included in the recognition result.

次に、理解部１０４は、利用者の否定に対応して、修正再発話を促す応答の機能を応答生成部１０６に発行する（ステップＳ１１２）。応答生成部１０６は、理解部１０４が発行した機能に基づき、修正再発話を促す応答を生成し、音声出力する。例えば、「失礼しました、もう一度コマンドをお聞かせください。」といった発話を行う。その後、利用者の修正再発話があるまで待機する。なお、次に認識スコアが大きい認識候補を理解結果と仮定して、再度ステップＳ１０９の確認応答へと移行するようにしても良い。この場合は、例えば、１位および２位の認識スコアを持つ認識候補の双方が否定された場合に、ステップＳ１１２へ移行すれば良い。 Next, in response to the denial of the user, the understanding unit 104 issues a response function that prompts the corrective recurrence to the response generation unit 106 (step S112). Based on the function issued by the understanding unit 104, the response generation unit 106 generates a response that prompts the corrective recurrence and outputs the response. For example, say "I'm sorry, please tell me the command again". Then, it waits until there is a user's revised relapse story. Note that the recognition candidate with the next largest recognition score may be assumed to be the understanding result, and the process may proceed to the confirmation response in step S109 again. In this case, for example, when both of the recognition candidates having the first and second recognition scores are denied, the process may proceed to step S112.

ステップＳ１０２またはＳ１０７の制御処理において、閾値を上回る認識スコアがあると理解部１０４が判定した場合、理解部１０４は理解結果を決定する（ステップＳ１１３）。または、ステップＳ１１０の制御処理において、上記確認応答に対して否定が存在しないと認識特性抽出部１０９が判定した場合、理解部１０４は理解結果を決定する（ステップＳ１１３）。更に、理解部１０４は、機能テーブル１０５を参照して、上記決定した理解結果と対応する制御タスクを決定し、当該制御タスクの機能を発行する（ステップＳ１１３）。例えば、図４に示したように、理解結果が「自宅へ帰る」であれば、制御タスクは「目的地設定（自宅）」となり、当該制御タスクの機能は制御タスクコマンド「現在地〜目的地ルート検索コマンド発行」となる。当該制御タスクコマンドの発行により、ナビゲーション装置は現在地から自宅へのルート探索処理をスタートさせる。次に、応答生成部１０６は、当該制御タスクの機能に基づいて応答を生成し、出力する（ステップＳ１１４）。制御タスクの機能と応答内容の対応は、図５に示した応答テーブル１０７を参照する。例えば、制御タスクの機能が制御タスクコマンド「現在地〜目的地ルート検索コマンド発行」であれば、応答内容は「自宅へ帰るルートを検索します」となる。 In the control process of step S102 or S107, when the understanding unit 104 determines that there is a recognition score that exceeds the threshold, the understanding unit 104 determines an understanding result (step S113). Alternatively, in the control process of step S110, when the recognition characteristic extraction unit 109 determines that there is no negation for the confirmation response, the understanding unit 104 determines an understanding result (step S113). Further, the understanding unit 104 refers to the function table 105, determines a control task corresponding to the determined understanding result, and issues a function of the control task (step S113). For example, as shown in FIG. 4, if the understanding result is “go home”, the control task is “set destination (home)”, and the function of the control task is the control task command “current location to destination route”. Search command issuance ". When the control task command is issued, the navigation device starts a route search process from the current location to the home. Next, the response generation unit 106 generates and outputs a response based on the function of the control task (step S114). For the correspondence between the function of the control task and the response content, refer to the response table 107 shown in FIG. For example, if the function of the control task is a control task command “issue current location to destination route search command”, the response content is “search for route to return home”.

次に、認識特性抽出部１０９は、認識特性抽出フラグがＯＮであるか否かを判定する（ステップＳ１１５）。認識特性抽出フラグがＯＮでないと認識特性抽出部１０９が判定した場合（ステップＳ１１５：Ｎｏ）、認識特性抽出部１０９は、最終的に達成した制御タスクおよびＮ−ｂｅｓｔを抽出する必要がない。そこで、認識特性抽出部１０９は、全フラグ（認識特性抽出フラグ、否定フラグ、あれば、取り消しフラグも含む。）をＯＦＦにし、本制御処理を終了する。一方、認識特性抽出フラグがＯＮであると認識特性抽出部１０９が判定した場合（ステップＳ１１５：Ｙｅｓ）、認識特性抽出部１０９は、否定フラグがＯＦＦであるか否か、すなわち、対話中に利用者による否定が存在したか否かを判定する（ステップＳ１１６）。否定フラグがＯＮである、すなわち、否定が存在すると認識特性抽出部１０９が判定した場合（ステップＳ１１６：Ｎｏ）、認識特性抽出部１０９は、上記制御タスクおよびＮ−ｂｅｓｔを抽出する必要がない。そこで、認識特性抽出部１０９は、全フラグ（認識特性抽出フラグ、否定フラグ、あれば、取り消しフラグも含む。）をＯＦＦにし、本制御処理を終了する。 Next, the recognition characteristic extraction unit 109 determines whether or not the recognition characteristic extraction flag is ON (step S115). When the recognition characteristic extraction unit 109 determines that the recognition characteristic extraction flag is not ON (step S115: No), the recognition characteristic extraction unit 109 does not need to extract the finally achieved control task and N-best. Therefore, the recognition characteristic extraction unit 109 turns off all the flags (including a recognition characteristic extraction flag, a negative flag, and a cancellation flag if any), and ends the present control process. On the other hand, when the recognition characteristic extraction unit 109 determines that the recognition characteristic extraction flag is ON (step S115: Yes), the recognition characteristic extraction unit 109 determines whether the negative flag is OFF, that is, used during the conversation. It is determined whether or not there is a negation by the person (step S116). When the negative flag is ON, that is, when the recognition characteristic extraction unit 109 determines that there is a negative (step S116: No), the recognition characteristic extraction unit 109 does not need to extract the control task and N-best. Therefore, the recognition characteristic extraction unit 109 turns off all the flags (including a recognition characteristic extraction flag, a negative flag, and a cancellation flag if any), and ends the present control process.

一方、否定フラグがＯＦＦである、すなわち、否定が存在しないと認識特性抽出部１０９が判定した場合（ステップＳ１１６：Ｙｅｓ）、認識特性抽出部１０９は、Ｎ−ｂｅｓｔを認識パターンとして抽出する（ステップＳ１１７）。更に、認識特性抽出部１０９は、上記制御タスクを抽出する。認識特性抽出部１０９は、抽出された上記認識パターンの最大の認識スコアと閾値との差からボーナス値を算出する（ステップＳ１１８）。なお、認識パターンテーブル１１０が、既に、Ｎ−ｂｅｓｔと順不同で同じ認識パターン、最終的に達成した上記制御タスクおよびボーナス値を記憶している場合、新しいボーナス値との最大値を取るなどとする。詳細は、上述の通りである。次に、抽出した上記認識パターン、上記制御タスクおよび算出した上記ボーナス値を認識パターンテーブル１１０に記憶する（ステップＳ１１９）。認識パターンテーブル１１０が、上記制御タスクにおける上記認識パターンを記憶していない場合、出現頻度を１として、新規に記憶する。以後、認識特性抽出部１０９は、全フラグ（認識特性抽出フラグ、否定フラグ、あれば、取り消しフラグも含む。）をＯＦＦにし、本制御処理を終了する。 On the other hand, when the negative flag is OFF, that is, when the recognition characteristic extraction unit 109 determines that there is no negative (step S116: Yes), the recognition characteristic extraction unit 109 extracts N-best as a recognition pattern (step S116). S117). Further, the recognition characteristic extraction unit 109 extracts the control task. The recognition characteristic extraction unit 109 calculates a bonus value from the difference between the maximum recognition score of the extracted recognition pattern and the threshold value (step S118). If the recognition pattern table 110 has already stored the same recognition pattern in the same order as N-best, the control task and the bonus value finally achieved, the maximum value of the new bonus value is taken. . Details are as described above. Next, the extracted recognition pattern, the control task, and the calculated bonus value are stored in the recognition pattern table 110 (step S119). When the recognition pattern table 110 does not store the recognition pattern in the control task, the recognition pattern table 110 newly stores the appearance frequency as 1. Thereafter, the recognition characteristic extraction unit 109 turns off all the flags (including a recognition characteristic extraction flag, a negative flag, and a cancellation flag if any), and ends this control process.

なお、上述の認識パターンの発生傾向には、利用者や発話環境の要因が大きく影響すると考えられる。すなわち、ある利用者ａの発話Ａは認識パターンαが出やすい、あるいはある雑音環境ｂにおける発話Ｂは認識パターンβが出やすいといった傾向である。特に自動車では、利用者が運転者を中心として極めて限定的であること、また車両の走行環境によって雑音環境がある程度特定できることから、第１の実施形態に係る音声対話装置を自動車に用いる場合に特に効果が大きいと考えられる。従って、利用者の音声の特徴量、カメラその他個人認証デバイスによる利用者識別情報に基づいて、発話者の特定が可能な場合は、上記認識パターンおよび最終的に達成した上記制御タスクに対応させて、上記の利用者識別情報を記憶することが望ましい。この場合、辞書制御部１１１は、新規発話に基づく利用者識別情報と、記憶された利用者識別情報とが同じ場合に、上記正解語彙と同じ新規発話の新規認識候補の認識スコアに、更に、利用者に応じて設定した所定値を加算させることができる。これから、利用者の発話の特性に応じた音声認識処理が可能となり、誤認識する可能性をより低減できる。よって、認識性能の向上が期待できる。 In addition, it is thought that the factors of the user and the utterance environment greatly influence the above-mentioned recognition pattern occurrence tendency. That is, there is a tendency that the utterance A of a certain user a is likely to generate the recognition pattern α, or the utterance B in a certain noise environment b is likely to generate the recognition pattern β. Especially in automobiles, the user is extremely limited, especially the driver, and the noise environment can be specified to some extent depending on the driving environment of the vehicle. Therefore, especially when the voice interactive apparatus according to the first embodiment is used in an automobile. The effect is considered large. Therefore, if it is possible to identify the speaker based on the user's voice feature, user identification information from the camera or other personal authentication device, it is necessary to correspond to the recognition pattern and the control task finally achieved. It is desirable to store the above user identification information. In this case, when the user identification information based on the new utterance and the stored user identification information are the same, the dictionary control unit 111 further adds the recognition score of the new recognition candidate of the same new utterance as the correct vocabulary to the recognition score. A predetermined value set according to the user can be added. Thus, voice recognition processing according to the user's utterance characteristics is possible, and the possibility of erroneous recognition can be further reduced. Therefore, improvement in recognition performance can be expected.

同様に、走行状態等に基づく雑音環境を複数のセグメントに分類し、現在の雑音環境がどのセグメントに属するかを判定する機構を新たに設けても良い。この場合には、現在の雑音環境がどのセグメントに属するかを示す雑音環境情報を、上記認識パターンおよび最終的に達成した上記制御タスクに対応させて記憶することが望ましい。これにより、辞書制御部１１１は、新規発話に基づく雑音環境情報と、記憶された雑音環境情報とが同じ場合に、上記正解語彙と同じ新規発話の新規認識候補の認識スコアに、更に、雑音環境に応じて設定した所定値を加算させることができる。これから、雑音環境の特性に応じた音声認識処理が可能となり、誤認識する可能性をより低減できる。よって、認識性能の向上が期待できる。 Similarly, a noise environment based on a running state or the like may be classified into a plurality of segments, and a new mechanism for determining which segment the current noise environment belongs to may be provided. In this case, it is desirable to store noise environment information indicating to which segment the current noise environment belongs in association with the recognition pattern and the control task finally achieved. Thereby, the dictionary control unit 111 further adds the noise environment information to the recognition score of the new recognition candidate of the same new utterance as the correct vocabulary when the noise environment information based on the new utterance and the stored noise environment information are the same. A predetermined value set in accordance with can be added. Thus, voice recognition processing according to the characteristics of the noise environment is possible, and the possibility of erroneous recognition can be further reduced. Therefore, improvement in recognition performance can be expected.

以上より、第１の実施形態に係る音声対話装置では、一連の制御タスク達成に至る対話を監視し、初期発話の認識結果を認識パターンとして抽出するとともに、最終的に達成した制御タスクを抽出する認識特性抽出部１０９を備える。また、当該認識パターンと当該制御タスクに基づいて、当該制御タスクを優先させる辞書制御部１１１とを備える。更に、辞書制御部１１１は、新規発話に対する新規認識候補群が当該認識パターンと順不同で同じ場合に、当該制御タスクを優先させる。これから、音素列の並びが類似する語彙が生じても、利用者の過去の発話における認識結果と順不同で同じ認識結果が得られた場合、過去の発話で達成した制御タスクを優先させるので、誤認識が繰り返し発生する可能性を低減できる。 As described above, in the spoken dialogue apparatus according to the first embodiment, the dialogue that achieves a series of control tasks is monitored, the recognition result of the initial utterance is extracted as a recognition pattern, and the finally achieved control task is extracted. A recognition characteristic extraction unit 109 is provided. Further, a dictionary control unit 111 that prioritizes the control task based on the recognition pattern and the control task is provided. Further, the dictionary control unit 111 gives priority to the control task when the new recognition candidate group for the new utterance is the same as the recognition pattern in no particular order. From this, even if vocabulary with similar phoneme sequence occurs, if the same recognition result is obtained in the same order as the recognition result in the user's past utterance, priority is given to the control task achieved in the past utterance. It is possible to reduce the possibility of repeated recognition.

また、第１の実施形態では、認識特性抽出部１０９は、初期発話の認識結果に基づく応答生成部１０６による応答に対して否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合に、上記認識パターンと上記制御タスクを対応させて、認識パターンテーブル１１０に記憶する。更に、認識特性抽出部１０９は、上記認識パターンの最大の認識スコアと閾値との差から算出されたボーナス値を、最大の認識スコアを持つ認識候補、すなわち、正解語彙に対応させて、認識パターンテーブル１１０に記憶する。これから、辞書制御部１１１は、新規認識候補群が上記認識パターンと順不同で同じ場合に、正解語彙と同じ新規認識候補の認識スコアにボーナス値を加算させることができ、上記制御タスクを優先させることができる。 In the first embodiment, the recognition characteristic extraction unit 109 has no negation or correction for the response by the response generation unit 106 based on the recognition result of the initial utterance, and finally, a series of control tasks are achieved. In this case, the recognition pattern is associated with the control task and stored in the recognition pattern table 110. Furthermore, the recognition characteristic extraction unit 109 associates the bonus value calculated from the difference between the maximum recognition score of the recognition pattern and the threshold value with the recognition candidate having the maximum recognition score, that is, the correct vocabulary, and recognizes the recognition pattern. Store in table 110. From this, the dictionary control unit 111 can add a bonus value to the recognition score of the same new recognition candidate as the correct vocabulary when the new recognition candidate group is the same in the order of the recognition pattern, and give priority to the control task. Can do.

（第２の実施形態）
第２の実施形態では、ｎ回目の発話の認識結果に基づく応答に対して否定または訂正が存在し、（ｎ＋１）回目以降の発話の認識結果に基づく応答に対して、否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合の結果に着目する。上記の場合、認識特性抽出部２０９（図１０参照）は、認識結果に関する組合せを認識パターンとして抽出するとともに、当該認識パターンに関する制御タスクを抽出する。更に、認識特性抽出部２０９は、上記認識パターンと上記制御タスクを対応させて、認識パターンテーブル１１０（図１０参照）に記憶する。なお、第２の実施形態における認識パターンは、利用者のｎ回目の発話の認識結果、すなわち、認識候補群であるＮ−ｂｅｓｔのうち、認識特性抽出部２０９により抽出され、記憶された認識候補群を指す。また、第１の実施形態と同様に、認識候補には各々認識スコアが付与されるものとする。認識スコアについては、第１の実施形態と同様に、尤度や信頼度の尺度を利用することができる。なお、第２の実施形態における認識パターンに関する制御タスクは、否定および訂正が存在しなかったタスクのうち、該認識パターンに含まれる認識候補に関する制御タスクである。 (Second Embodiment)
In the second embodiment, there is negation or correction for a response based on the recognition result of the nth utterance, and there is no negation and correction for a response based on the recognition result of the (n + 1) th utterance. First, focus on the results when a series of control tasks are finally achieved. In the above case, the recognition characteristic extraction unit 209 (see FIG. 10) extracts a combination related to the recognition result as a recognition pattern and extracts a control task related to the recognition pattern. Further, the recognition characteristic extraction unit 209 stores the recognition pattern and the control task in association with each other in the recognition pattern table 110 (see FIG. 10). In addition, the recognition pattern in 2nd Embodiment is the recognition candidate extracted by the recognition characteristic extraction part 209 among the recognition results of a user's nth utterance, ie, N-best which is a recognition candidate group, and memorize | stored. Refers to a group. In addition, as in the first embodiment, each recognition candidate is given a recognition score. As for the recognition score, as in the first embodiment, a measure of likelihood or reliability can be used. Note that the control task related to the recognition pattern in the second embodiment is a control task related to the recognition candidates included in the recognition pattern among the tasks for which there is no negation or correction.

その後、利用者の新規発話に対する新規認識候補群が認識パターンと順不同で同じ場合に、辞書制御部２１１（図１０参照）は、上記認識パターンに関する制御タスクを優先させて実行させる。具体的には、辞書制御部２１１に基づいて、理解部１０４（図１０参照）は、上記認識パターンに関する制御タスクに関する認識候補と同じ新規認識候補の認識スコアに補正値を加算する。理解部１０４は、更に、加算後の認識結果の認識スコアのうち、最大の認識スコアが閾値を上回るか判定する。最大の認識スコアが閾値を上回ると判定した場合、理解部１０４は、最大の認識スコアを持つ新規認識候補からシステムの理解状態である理解結果を生成する。理解部１０４は、当該理解結果から制御タスクを決定する。これから、誤認識が繰り返し発生する可能性を低減するものである。 Thereafter, when the new recognition candidate group for the user's new utterance is the same as the recognition pattern in no particular order, the dictionary control unit 211 (see FIG. 10) prioritizes and executes the control task related to the recognition pattern. Specifically, based on the dictionary control unit 211, the understanding unit 104 (see FIG. 10) adds a correction value to the recognition score of the same new recognition candidate as the recognition candidate related to the control task related to the recognition pattern. The understanding unit 104 further determines whether the maximum recognition score among the recognition scores of the recognition results after the addition exceeds a threshold value. If it is determined that the maximum recognition score exceeds the threshold value, the understanding unit 104 generates an understanding result that is an understanding state of the system from a new recognition candidate having the maximum recognition score. The understanding unit 104 determines a control task from the understanding result. As a result, the possibility of repeated erroneous recognition is reduced.

以下、第２の実施形態に係る音声対話装置について、第１の実施形態に係る音声対話装置と異なる点を中心に説明する。また、第２の実施形態に係る音声対話装置について、第１の実施形態に係る音声対話装置と同様の構造には同じ番号を付し、説明を省略する。図１０は、本発明の第２の実施形態に係る音声対話装置の基本構成を示したブロック図である。図１０に示すように、第２の実施形態に係る音声対話装置の構成は、基本的には、第１の実施形態に係る音声対話装置の構成と同じである。第１の実施形態と異なるのは、認識特性抽出手段である認識特性抽出部２０９および辞書制御手段である辞書制御部２１１だけである。よって、認識特性抽出部２０９および辞書制御部２１１のみ説明する。 Hereinafter, the voice interactive apparatus according to the second embodiment will be described focusing on differences from the voice interactive apparatus according to the first embodiment. Moreover, the same number is attached | subjected to the structure similar to the voice interactive apparatus which concerns on 1st Embodiment about the voice interactive apparatus which concerns on 2nd Embodiment, and description is abbreviate | omitted. FIG. 10 is a block diagram showing a basic configuration of a voice interactive apparatus according to the second embodiment of the present invention. As shown in FIG. 10, the configuration of the voice interaction apparatus according to the second embodiment is basically the same as the configuration of the voice interaction apparatus according to the first embodiment. The only difference from the first embodiment is a recognition characteristic extraction unit 209 that is a recognition characteristic extraction unit and a dictionary control unit 211 that is a dictionary control unit. Therefore, only the recognition characteristic extraction unit 209 and the dictionary control unit 211 will be described.

第２の実施形態の認識特性抽出部２０９は、一連の制御タスク達成に至る対話、すなわち、認識候補群と理解結果を監視する。そして、ｎ回目の発話の認識候補の組合せ１００１（図１１参照）を認識パターン１００２（図１１参照）として抽出するとともに、上記認識パターン１００２に関する制御タスク１００６（図１１参照）を抽出する。第２の実施形態では、対話中に否定または訂正が存在した場合に、否定または訂正直前の認識結果１００１に誤認識が含まれていると仮定する。すなわち、認識結果１００１中の最大の認識スコアを持つ認識候補「近くのコンビニ」（図１１参照）が誤認識であると仮定する。その後、（ｎ＋１）回目以降の発話の認識結果に基づく応答に対して、否定および訂正が存在せず、最終的に一連の制御タスクが達成し、上記制御タスク１００６の機能が発行された場合、認識特性抽出部２０９は、一連の制御タスクのうち、認識パターン１００２に関する制御タスクに関する認識候補が認識パターン１００２に含まれているか否かを判定する。すなわち、認識特性抽出部２０９は、否定および訂正操作されなかった制御タスク１００４（図１１参照）に関する認識候補「目的地設定」（図１１参照）が認識パターン１００２に含まれているか否か判定する。更に、認識特性抽出部２０９は、否定および訂正操作されなかった制御タスク１００６に関する認識候補が認識パターン１００２に含まれているか否か判定する。認識パターン１００２に認識語彙が含まれていると認識特性抽出部２０９が判定した場合、認識特性抽出部２０９は、認識パターン１００２を抽出するとともに、認識パターン１００２に関する制御タスク１００６を抽出する。 The recognition characteristic extraction unit 209 of the second embodiment monitors a dialogue that reaches a series of control tasks, that is, a recognition candidate group and an understanding result. Then, a combination 1001 (see FIG. 11) of recognition candidates for the nth utterance is extracted as a recognition pattern 1002 (see FIG. 11), and a control task 1006 (see FIG. 11) related to the recognition pattern 1002 is extracted. In the second embodiment, when there is a negation or correction during the dialogue, it is assumed that a recognition result 1001 immediately before the denial or correction includes a misrecognition. That is, it is assumed that the recognition candidate “near convenience store” (see FIG. 11) having the maximum recognition score in the recognition result 1001 is erroneous recognition. Thereafter, there is no negation and correction for the response based on the recognition result of the (n + 1) th and subsequent utterances, and when a series of control tasks are finally achieved and the function of the control task 1006 is issued, The recognition characteristic extraction unit 209 determines whether the recognition pattern 1002 includes a recognition candidate related to the control task related to the recognition pattern 1002 among the series of control tasks. That is, the recognition characteristic extracting unit 209 determines whether or not the recognition pattern 1002 includes a recognition candidate “destination setting” (see FIG. 11) regarding the control task 1004 (see FIG. 11) that has not been negated and corrected. . Furthermore, the recognition characteristic extraction unit 209 determines whether or not the recognition pattern 1002 includes a recognition candidate related to the control task 1006 that has not been negated and corrected. When the recognition characteristic extraction unit 209 determines that the recognition vocabulary is included in the recognition pattern 1002, the recognition characteristic extraction unit 209 extracts the recognition pattern 1002 and the control task 1006 related to the recognition pattern 1002.

そして、認識パターン１００２と制御タスク１００６を対応させて、認識パターンテーブル１１０に記憶する。更に、認識特性抽出部２０９は、認識パターンに関する制御タスク毎の、新規認識候補群が認識パターンと順不同で同じになる頻度である出現頻度を認識パターンテーブル１１０に記憶する。また、認識特性抽出部２０９は、認識パターン１００２の認識候補「自宅に帰る」（図１１参照）に対する補正値であるボーナス値を算出し、認識パターンテーブル１１０に記憶する。ここで、ボーナス値は、認識パターン１００２に関する制御タスク１００６に関する認識候補「自宅へ帰る」の認識スコア０．４０（図１１参照）と認識パターン１００２の最大の認識スコア０．４６（図１１参照）との差に基づいて算出される。よって、第２の実施形態の認識パターンテーブル１１０は、第１の実施形態と同様の書式となる。 Then, the recognition pattern 1002 and the control task 1006 are associated with each other and stored in the recognition pattern table 110. Furthermore, the recognition characteristic extraction unit 209 stores, in the recognition pattern table 110, the appearance frequency, which is the frequency at which the new recognition candidate group becomes the same in the same order as the recognition pattern, for each control task related to the recognition pattern. In addition, the recognition characteristic extraction unit 209 calculates a bonus value that is a correction value for the recognition candidate “return to home” (see FIG. 11) of the recognition pattern 1002 and stores it in the recognition pattern table 110. Here, the bonus value includes a recognition score of 0.40 (see FIG. 11) for the recognition candidate “go home” regarding the control task 1006 regarding the recognition pattern 1002 and a maximum recognition score of 0.46 (see FIG. 11) of the recognition pattern 1002. It is calculated based on the difference between Therefore, the recognition pattern table 110 of the second embodiment has the same format as that of the first embodiment.

なお、認識パターンテーブル１１０は、第１の実施形態と同様に、利用者識別情報を、上記認識パターンに関する上記制御タスクに対応させて、記憶しても良い。このようにすれば、利用者の発話の特性に応じた音声認識処理が可能となり、誤認識する可能性をより低減できる。また、同様に、認識パターンテーブル１１０は、雑音環境情報を、上記認識パターンに関する上記制御タスクに対応させて、記憶しても良い。このようにすれば、雑音環境の特性に応じた音声認識処理が可能となり、誤認識する可能性をより低減できる。 Note that the recognition pattern table 110 may store user identification information in association with the control task related to the recognition pattern, as in the first embodiment. In this way, speech recognition processing according to the user's utterance characteristics can be performed, and the possibility of erroneous recognition can be further reduced. Similarly, the recognition pattern table 110 may store noise environment information in association with the control task related to the recognition pattern. In this way, speech recognition processing according to the characteristics of the noise environment can be performed, and the possibility of erroneous recognition can be further reduced.

第２の実施形態の辞書制御部２１１は、第１の実施形態と同様に、利用者の新規発話に対する新規認識候補群を音声認識部１０２が取得した場合に認識パターンテーブル１１０を参照する。上記新規認識候補群と順不同で同じ認識パターンが存在した場合に、当該認識パターンに関する制御タスクを優先させるものである。具体的には、辞書制御部１１１は、当該認識パターン１００２に関する制御タスク１００６に関する認識候補「自宅へ帰る」と同じ新規認識候補の認識スコアに、上記ボーナス値を加算させることで、当該認識パターンに関する制御タスクを優先させる。なお、認識特性抽出部２０９は、第１の実施形態と同様に、認識パターンテーブル１１０に出現頻度を記憶するので、辞書制御部２１１は、当該出現頻度を利用して、例えば、当該出現頻度が所定値を上回った場合のみ、ボーナス値を加算しても良い。また、当該出現頻度の多い認識パターンほど、ボーナス値のマージンを大きくする等の制御を実行しても良い。これにより、利用者の継続使用に伴い、利用者の意図するタスクが正確に達成できる可能性が高くなる。また、辞書制御部２１１は、理解部１０４が（ｎ＋１）回目以降の発話の認識結果について認識スコアを算出する際、ｎ回目の発話の認識結果１００１のうち最大の認識スコアを持つ認識候補「近くのコンビニ」と同じ認識候補「近くのコンビニ」について、該認識候補の認識スコアから所定値を減算させる。これにより、（ｎ＋１）回目以降の発話の認識結果に基づく応答が、ｎ回目の発話の認識結果に基づく応答と同じになる可能性を低減することができる。 Similar to the first embodiment, the dictionary control unit 211 of the second embodiment refers to the recognition pattern table 110 when the speech recognition unit 102 acquires a new recognition candidate group for a new utterance of the user. When the same recognition pattern exists in any order with the new recognition candidate group, the control task related to the recognition pattern is prioritized. Specifically, the dictionary control unit 111 adds the bonus value to the recognition score of the same new recognition candidate as the recognition candidate “return to home” related to the control task 1006 related to the recognition pattern 1002, thereby related to the recognition pattern. Give priority to control tasks. Since the recognition characteristic extraction unit 209 stores the appearance frequency in the recognition pattern table 110 as in the first embodiment, the dictionary control unit 211 uses the appearance frequency, for example, to determine the appearance frequency. The bonus value may be added only when the predetermined value is exceeded. Further, a control such as increasing the margin of the bonus value may be executed for a recognition pattern having a higher appearance frequency. This increases the possibility that the task intended by the user can be accurately achieved with the continuous use of the user. Further, when the understanding unit 104 calculates the recognition score for the recognition result of the (n + 1) th and subsequent utterances, the dictionary control unit 211 has a recognition candidate “nearest having the maximum recognition score among the recognition results 1001 of the nth utterance. For a recognition candidate “near convenience store” that is the same as “a convenience store”, a predetermined value is subtracted from the recognition score of the recognition candidate. This can reduce the possibility that the response based on the recognition result of the (n + 1) th and subsequent utterances is the same as the response based on the recognition result of the nth utterance.

なお、上述のシステムと利用者の否定または訂正後の対話は、音声に限定されるものでない。すなわち、音声により上手く認識できず、タッチパネルの操作等に移行し、タスクを設定した場合も記憶対象となる。また、記憶する制御タスクは、最終的に達成した制御タスクとは限らない。例えば、階層構造を辿るような対話を想定した場合、対話の中間地点（途中階層）においても、制御タスクが発生する。従って、第２の実施形態では、否定または訂正が存在した場合の再発話、すなわち、（ｎ＋１）回目以後の発話において、否定および訂正が存在しない場合、もしくは、肯定が存在した場合に、当該発話の認識結果に正解が含まれると仮定する。当該認識結果から生成した制御タスクも記憶対象とする。 The dialogue after the above-mentioned system and user denial or correction is not limited to voice. That is, even if the task cannot be recognized by voice and the operation is shifted to a touch panel operation or the like, and a task is set, it becomes a storage target. Further, the control task to be stored is not necessarily the control task finally achieved. For example, when a dialogue that follows a hierarchical structure is assumed, a control task occurs even at an intermediate point (intermediate layer) of the dialogue. Therefore, in the second embodiment, when there is a negative or correction, the recurrent utterance, that is, in the utterance after the (n + 1) th time, when there is no negation and correction, or when there is an affirmation, the utterance Suppose that the recognition result contains a correct answer. A control task generated from the recognition result is also stored.

以下に、具体的な対話例を用いて、認識特性抽出部２０９および辞書制御部２１１の動きを説明する。図１１は、第２の実施形態の対話例における記憶条件と記憶対象データを示した図である。図１１では、利用者が「自宅に帰る」という初期発話を行ない、自宅へのルート探索が実行されるまでの対話例を示している。なお、本対話例では、初期発話の認識結果に基づく応答に対して、否定または訂正が存在し、次回以降の発話の認識結果に基づく応答に対して、否定および訂正が存在せず、最終的に一連の制御タスクが達成した例を示しているが、これに限定されない。一連の制御タスク達成に至る対話の途中に否定または訂正が存在し、次回以降の発話の認識結果に基づく応答に対して否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合にも適用できる。この場合、否定または訂正操作された直前の応答に関する認識結果を認識パターンとして抽出すれば良い。 Hereinafter, movements of the recognition characteristic extraction unit 209 and the dictionary control unit 211 will be described using a specific example of dialogue. FIG. 11 is a diagram showing storage conditions and storage target data in the interactive example of the second embodiment. FIG. 11 shows an example of the dialogue until the user performs the initial utterance “going home” and the route search to the home is executed. In this dialogue example, there is a denial or correction for the response based on the recognition result of the initial utterance, and there is no denial or correction for the response based on the recognition result of the subsequent utterance. Shows an example where a series of control tasks have been achieved, but the present invention is not limited to this. When there is a negation or correction in the middle of a dialogue that reaches a series of control tasks, and there is no denial or correction for a response based on the recognition result of the next and subsequent utterances. It can also be applied to. In this case, the recognition result regarding the response immediately before the negative or correction operation may be extracted as a recognition pattern.

図１１に示すように、システムが「ご用件をどうぞ」の発話を行うと（ステップＳ２１）、利用者は、「自宅に帰る」の初期発話を行っている（ステップＵ２１）。システムは、利用者の初期発話音声を認識し、当該初期発話音声の認識候補群１００１、近くのコンビニ（０．４６）、自宅へ帰る（０．４０）、テレビＯＮ（０．１０）を取得している（ステップＳ２２）。括弧は認識スコアである。なお、本対話例では、第１の実施形態と同様に、Ｎ−ｂｅｓｔ中の上位３認識候補について注目している。しかし、記憶対象とする認識候補数は、音声対話装置の認識性能を加味して決定するのが望ましい。あるいは、認識スコアが所定の閾値を上回る認識候補のみを記憶対象にしても良い。 As shown in FIG. 11, when the system utters “Please give me business” (step S21), the user is making an initial utterance of “going home” (step U21). The system recognizes the user's initial speech and recognizes the initial speech recognition candidate group 1001, a nearby convenience store (0.46), return home (0.40), and TV ON (0.10). (Step S22). Parentheses are recognition scores. In this example of dialogue, attention is paid to the top three recognition candidates in the N-best as in the first embodiment. However, it is desirable to determine the number of recognition candidates to be stored in consideration of the recognition performance of the voice interactive apparatus. Alternatively, only recognition candidates whose recognition score exceeds a predetermined threshold may be stored.

理解部１０４は、当該初期発話音声の認識結果のうち最大の認識スコアを持つ認識候補「近くのコンビニ」に基づき、理解結果を決定する。図１１に示した対話例では、理解結果として、周辺施設探索を実行するコマンド（図１１では、実行（施設検索、目的地＝周辺施設（コンビニ）））を発行する。同時に、応答生成部１０６は、応答「近くのコンビニエンスストアを探索します」を出力する。なお、第１の実施形態では、所定の閾値０．７０を設け、該閾値を上回る認識スコアを持つ認識候補が見つからない限り、確認応答を出力する対話戦略の例を示した。第２の実施形態では、少なくとも初期発話に対する認識結果について、最大の認識スコアを持つ認識候補を用いて、理解結果を積極的に決定する対話戦略の例を示す。ただし、第２の実施形態を、第１の実施形態と同様の対話戦略に適用することも可能である。 The understanding unit 104 determines the understanding result based on the recognition candidate “near convenience store” having the largest recognition score among the recognition results of the initial utterance speech. In the example of dialogue shown in FIG. 11, a command for executing a peripheral facility search (execution (facility search, destination = neighboring facility (convenience store)) in FIG. 11) is issued as an understanding result. At the same time, the response generation unit 106 outputs a response “search for a nearby convenience store”. In the first embodiment, an example of an interactive strategy in which a predetermined threshold value 0.70 is provided and a confirmation response is output unless a recognition candidate having a recognition score exceeding the threshold value is found. In the second embodiment, an example of a dialogue strategy in which an understanding result is positively determined using a recognition candidate having a maximum recognition score for a recognition result for at least an initial utterance will be described. However, it is also possible to apply the second embodiment to the same dialogue strategy as the first embodiment.

システムの上記応答に対して、利用者は、否定または訂正操作を行っている（ステップＵ２２）。当該否定または訂正操作は、音声による「違う」または「戻れ」等の発話でも良いし、訂正スイッチの押下でも良い。また、明示的な否定または訂正操作でなく、単純なリセット処理や、制御タスクが達成した直後の取り消し操作等も否定または訂正操作と捉えることが可能である。利用者の上記否定または訂正操作を受けて、システムは、理解結果として「訂正の検出」を取得し、応答生成部１０６は、当該理解結果に基づいて、応答「失礼しました、再度発話してください」を出力する（ステップＳ２３）。その後、改めて利用者は発話を行う（ステップＵ２３）。本対話例では、利用者は、初期発話と異なる内容の次回発話「目的地設定」を行うことで、タスク達成を試みている。なお、初期発話と同様の内容の発話によって、タスク達成を試みることも可能である。 In response to the response from the system, the user performs a negative or correction operation (step U22). The negation or correction operation may be a speech such as “different” or “return” by voice, or may be a press of a correction switch. In addition, a simple reset process or a cancel operation immediately after a control task is achieved can be regarded as a negate or correct operation instead of an explicit negate or correct operation. Upon receiving the above denial or correction operation of the user, the system acquires “correction detection” as an understanding result, and the response generation unit 106 responds based on the understanding result, “sorry, speak again. Please output "(step S23). Thereafter, the user speaks again (step U23). In this dialogue example, the user attempts to accomplish the task by performing the next utterance “Destination setting” with a content different from the initial utterance. It is also possible to try to accomplish the task by uttering the same content as the initial utterance.

システムは、利用者の次回発話音声を認識し、当該発話音声の認識候補群、目的地設定（０．７０）、駅で検索（０．１０）、登録地（０．１０）を取得している（ステップＳ２４）。理解部１０４は、当該認識候補群に基づいて、理解結果１００３として、目的地を設定するコマンド（図１１では、実行（目的地設定方法選択））を発行する。同時に、応答生成部１０６は、応答「目的地を設定します。自宅、施設の名前、施設住所、施設の電話番号、履歴、登録地から設定できます」を出力する。その後、目的地の設定方法の選択を促された利用者は、「自宅へ帰る」という次々回発話を行う（ステップＵ２４）。なお、この時、画面にて選択肢を提示し、画面操作にて選択できるようにしても良い。利用者の上記発話音声を認識し、当該発話音声の認識候補群、
自宅へ帰る（０．５０）
近くのコンビニ（０．４０−ペナルティ（０．７０）＝０．０）
登録地（０．０５）
を取得している（ステップＳ２５）。 The system recognizes the next utterance voice of the user, obtains the recognition candidate group of the utterance voice, destination setting (0.70), search at the station (0.10), and registration location (0.10). (Step S24). Based on the recognition candidate group, the understanding unit 104 issues a command for setting a destination (execution (destination setting method selection) in FIG. 11) as the understanding result 1003. At the same time, the response generation unit 106 outputs a response “Set the destination. It can be set from the home, the name of the facility, the facility address, the telephone number of the facility, the history, and the registered location”. After that, the user who is prompted to select the destination setting method utters “return to home” one after another (step U24). At this time, the options may be presented on the screen so that they can be selected by screen operation. Recognizing the above utterance voice of the user, a recognition candidate group of the utterance voice,
Return home (0.50)
Nearby convenience store (0.40-Penalty (0.70) = 0.0)
Registration place (0.05)
Is acquired (step S25).

ここで、本対話例では、利用者の次々回発話の認識結果に基づく応答が、初期発話の認識結果に基づく応答と同じになる可能性を低減している。すなわち、初期発話の認識結果のうち最大の認識スコアを持つ認識候補「近くのコンビニ」と同じ認識候補「近くのコンビニ」の認識スコア０．４０から所定値であるペナルティ（本対話例では、０．７０）を減算する。これから、次々回発話の認識結果のうち最大の認識スコアを持つ認識候補は「自宅へ帰る」となる。理解部１０４は、理解結果１００５として、自宅へのルートを検索するコマンド（図１１では、実行（ルート検索、目的地＝自宅））を発行する。同時に、応答生成部１０６は、応答「自宅へ帰るルートを探索します」を出力する。更に、否定または訂正操作後の認識結果について、第１の実施形態と同様に、閾値（例えば、０．７０）を上回る認識候補が見つからなかった場合に確認応答を行う対話戦略としても良い。この場合、次々回発話の認識結果に、閾値（例えば、０．７０）を上回る認識候補が見つからないので、理解部１０４は、次々回発話の認識結果のうち最大の認識スコアを持つ認識候補「自宅へ帰る」を用いて、確認応答の出力を行う。この結果、応答生成部１０６は、応答「自宅へ帰るルートの検索でよろしいですか？」を出力する。その後、システムの上記確認応答に対して、利用者からの肯定の応答を取得した場合に、理解部１０４は、理解結果１００５を決定する。 Here, in this dialogue example, the possibility that the response based on the recognition result of the user's next utterance is the same as the response based on the recognition result of the initial utterance is reduced. That is, a penalty that is a predetermined value from the recognition score 0.40 of the same recognition candidate “near convenience store” as the recognition candidate “near convenience store” having the maximum recognition score among the recognition results of the initial utterance (in this dialogue example, 0) .70) is subtracted. From this, the recognition candidate having the maximum recognition score among the recognition results of the utterances one after another becomes “return to home”. The understanding unit 104 issues a command for searching for a route to the home (execution (route search, destination = home) in FIG. 11) as the understanding result 1005. At the same time, the response generation unit 106 outputs a response “searches for a route to return home”. Further, the recognition result after the negative or correction operation may be an interactive strategy for performing a confirmation response when a recognition candidate exceeding a threshold value (for example, 0.70) is not found, as in the first embodiment. In this case, since the recognition candidate exceeding the threshold (for example, 0.70) is not found in the recognition result of the second utterance, the understanding unit 104 recognizes the recognition candidate “home” from the recognition result of the second utterance. Use “Return” to output the confirmation response. As a result, the response generation unit 106 outputs a response “Are you sure you want to search for the route to go home?”. Thereafter, the understanding unit 104 determines an understanding result 1005 when a positive response from the user is acquired in response to the confirmation response of the system.

上述の対話例を監視した場合、認識特性抽出部２０９は、
・初期発話の認識結果に基づく応答（ステップＳ２２）の結果、否定または訂正（ステップＵ２２）が検出された（図１１の（ａ）にて確定）
・否定（ステップＵ２２）に伴う次回以降の発話（ステップＵ２３、Ｕ２４）の認識結果に基づく応答（ステップＳ２４、Ｓ２５）に対して、否定および訂正が検出されなかった（図１１の（ｂ）にて確定）
・最終的に制御タスクが決定された（最終的な制御タスクの決定に対して否定が検出されなかった）（図１１の（ｃ）にて確定）
という記憶条件を満たすか否か判定する。認識特性抽出部２０９は、上記記憶条件を満たすと判定し、初期発話の認識結果１００１を認識パターン１００２として抽出する。更に、認識特性抽出部２０９は、中間対話にて決定した理解結果１００３に対応する制御タスク１００４（目的地設定方法選択）と、最終的に決定した理解結果１００５に対応する制御タスク１００６（ルート検索、目的地＝自宅）を記憶対象とする。ただし、本対話例では、認識パターン１００２に関する制御タスクが２候補得られている。そこで、認識特性抽出部２０９は、上記制御タスクの２候補のうちいずれかが、認識パターン１００２に関する制御タスクか判定する。その後、判定された認識パターン１００２に関する制御タスクを、認識特性抽出部２０９は、認識パターンテーブル１１０に記憶する。 When the above dialogue example is monitored, the recognition characteristic extraction unit 209
As a result of the response based on the recognition result of the initial utterance (step S22), negation or correction (step U22) is detected (confirmed in (a) of FIG. 11)
-Negation and correction were not detected for the response (steps S24, S25) based on the recognition result of the utterances (steps U23, U24) from the next time onward (step U22) (see (b) of FIG. 11). Confirmed)
Finally, the control task was determined (no negative was detected for the final control task determination) (confirmed in (c) of FIG. 11)
It is determined whether or not the storage condition is satisfied. The recognition characteristic extraction unit 209 determines that the storage condition is satisfied, and extracts the recognition result 1001 of the initial utterance as a recognition pattern 1002. Further, the recognition characteristic extraction unit 209 controls the control task 1004 (selection of destination setting method) corresponding to the understanding result 1003 determined in the intermediate dialogue and the control task 1006 (route search) corresponding to the finally determined understanding result 1005. , Destination = home). However, in this interactive example, two control tasks related to the recognition pattern 1002 are obtained. Therefore, the recognition characteristic extraction unit 209 determines whether one of the two candidates for the control task is a control task related to the recognition pattern 1002. Thereafter, the recognition characteristic extraction unit 209 stores the control task related to the recognized recognition pattern 1002 in the recognition pattern table 110.

ここで、認識パターン１００２に関する制御タスクを判定する方法について説明する。具体的には、利用者により否定または訂正操作された直前の応答に関する発話（初期発話）が、図１１に示したように、「自宅へ帰る」であれば、認識パターン１００２に関する制御タスクは、制御タスク１００６であると判定する。仮に、利用者により否定または訂正操作された直前の応答に関する発話が、「目的地設定」であったなら、認識パターン１００２に関する制御タスクは、制御タスク１００４であると判定する。図１１に示した対話例の場合、初期発話の認識結果１００１に、制御タスク１００４（目的地設定方法選択）に関する認識候補が存在しない。一方、制御タスク１００６（ルート検索、目的地＝自宅）に関する認識候補「自宅へ帰る」のみが存在する。これから、認識パターン１００２に関する制御タスクは、制御タスク１００６であると判定できる。すなわち、制御タスクに関する認識候補が認識パターン１００２に含まれているか否か判定することで、認識特性抽出部２０９は、認識パターン１００２に関する制御タスクがいずれの制御タスクか自動的に判定できる。なお、更に進んで、認識パターン１００２に、制御タスク１００４に関する認識候補（例えば、「目的地設定」等）と制御タスク１００６に関する認識候補（例えば、「自宅へ帰る」等）が含まれている場合を考える。この場合、認識特性抽出部２０９は、上記２候補のうち、認識パターン１００２における認識スコアの高い認識候補に関する制御タスクを、認識パターン１００２に関する制御タスクと判定すれば良い。
＜上記構成を用いた具体的な制御処理の流れ＞
ここで、上記構成を用いた第２の実施形態に係る音声対話装置の具体的な制御処理の流れを説明する。第２の実施形態に係る音声対話装置の具体的な制御処理は、第１の実施形態に係る制御処理と同様である。具体的には、図９に示したフローチャートのステップＳ１０１乃至Ｓ１１５の制御処理は、全く同じである。次に、認識特性抽出部２０９は、第１の実施形態と同様に、否定フラグがＯＦＦであるか否か、すなわち、対話中に利用者による否定が存在したか否かを判定する（ステップＳ１１６）。否定フラグがＯＦＦである、すなわち、否定が存在しないと認識特性抽出部２０９が判定した場合（ステップＳ１１６：Ｙｅｓ）、第１の実施形態と異なり、認識特性抽出部２０９は、全フラグをＯＦＦにし、本制御処理を終了する。一方、否定フラグがＯＮである、すなわち、否定が存在すると認識特性抽出部２０９が判定した場合（ステップＳ１１６：Ｎｏ）、認識特性抽出部２０９は、第１の実施形態と異なり、Ｎ−ｂｅｓｔを認識パターンとして抽出する（ステップＳ１１７）。 Here, a method for determining a control task related to the recognition pattern 1002 will be described. Specifically, if the utterance (initial utterance) regarding the response immediately before being denied or corrected by the user is “return to home” as shown in FIG. 11, the control task related to the recognition pattern 1002 is: The control task 1006 is determined. If the utterance related to the response immediately before being denied or corrected by the user is “set destination”, it is determined that the control task related to the recognition pattern 1002 is the control task 1004. In the case of the dialogue example shown in FIG. 11, there is no recognition candidate related to the control task 1004 (select destination setting method) in the recognition result 1001 of the initial utterance. On the other hand, only the recognition candidate “return to home” relating to the control task 1006 (route search, destination = home) exists. From this, it can be determined that the control task related to the recognition pattern 1002 is the control task 1006. That is, by determining whether or not the recognition candidate related to the control task is included in the recognition pattern 1002, the recognition characteristic extraction unit 209 can automatically determine which control task is related to the recognition pattern 1002. Further, when the recognition pattern 1002 includes a recognition candidate (for example, “Destination setting”) and a recognition candidate for the control task 1006 (for example, “return to home”) in the recognition pattern 1002. think of. In this case, the recognition characteristic extraction unit 209 may determine a control task related to a recognition candidate having a high recognition score in the recognition pattern 1002 among the two candidates as a control task related to the recognition pattern 1002.
<Specific control processing flow using the above configuration>
Here, the flow of a specific control process of the voice interactive apparatus according to the second embodiment using the above configuration will be described. The specific control process of the voice interactive apparatus according to the second embodiment is the same as the control process according to the first embodiment. Specifically, the control processing in steps S101 to S115 in the flowchart shown in FIG. 9 is exactly the same. Next, as in the first embodiment, the recognition characteristic extraction unit 209 determines whether or not a negative flag is OFF, that is, whether or not there is a negative by the user during the conversation (step S116). ). When the negative flag is OFF, that is, when the recognition characteristic extraction unit 209 determines that there is no negative (step S116: Yes), unlike the first embodiment, the recognition characteristic extraction unit 209 sets all the flags to OFF. This control process is terminated. On the other hand, when the negative flag is ON, that is, when the recognition characteristic extraction unit 209 determines that there is a negative (step S116: No), the recognition characteristic extraction unit 209 determines N-best unlike the first embodiment. Extracted as a recognition pattern (step S117).

更に、認識特性抽出部２０９は、上述のように、認識パターンに関する制御タスクを判定し、当該制御タスクを抽出する。抽出された当該制御タスクに関する認識候補の認識スコアと認識パターンの最大の認識スコアとの差に基づいて、ボーナス値を算出する（ステップＳ１１８）。なお、認識パターンテーブル１１０が、既にボーナス値を記憶している場合、第１の実施形態と同様に、新しいボーナス値との最大値を取るなどとする。次に、第１の実施形態と同様に、抽出した上記認識パターン、上記制御タスクおよび算出した上記ボーナス値を認識パターンテーブル１１０に記憶する（ステップＳ１１９）。認識パターンテーブル１１０が、上記制御タスクにおける上記認識パターンを記憶していない場合、第１の実施形態と同様に、出現頻度を１として、新規に記憶する。以後、認識特性抽出部２０９は、第１の実施形態と同様に、全フラグをＯＦＦにし、本制御処理を終了する。このようにして、一連の制御タスク達成に至る対話中に否定または訂正が存在する過去の対話結果から、誤認識が発生する際の認識パターンを抽出する。利用者の新規発話に対する新規認識候補群が上記認識パターンと順不同で同じ場合に、上記認識パターンに関する制御タスクに関する認識候補と同じ新規認識候補の認識スコアにボーナス値を加算する。すなわち、認識パターンテーブル１１０に記憶されたボーナス値に対応する認識候補と同じ新規認識候補の認識スコアにボーナス値を加算する。これから、上記認識パターンに関する制御タスクを優先させている。 Further, as described above, the recognition characteristic extraction unit 209 determines a control task related to the recognition pattern, and extracts the control task. A bonus value is calculated based on the difference between the extracted recognition score of the recognition candidate related to the extracted control task and the maximum recognition score of the recognition pattern (step S118). If the recognition pattern table 110 has already stored bonus values, the maximum value with the new bonus value is taken as in the first embodiment. Next, as in the first embodiment, the extracted recognition pattern, the control task, and the calculated bonus value are stored in the recognition pattern table 110 (step S119). When the recognition pattern table 110 does not store the recognition pattern in the control task, it is newly stored with an appearance frequency of 1 as in the first embodiment. Thereafter, the recognition characteristic extraction unit 209 turns off all the flags as in the first embodiment, and ends this control process. In this way, the recognition pattern at the time of erroneous recognition is extracted from the past dialogue result in which negation or correction exists in the dialogue leading to the series of control task achievement. When the new recognition candidate group for the user's new utterance is the same as the recognition pattern in no particular order, a bonus value is added to the recognition score of the same new recognition candidate as the recognition candidate related to the control task related to the recognition pattern. That is, the bonus value is added to the recognition score of the same new recognition candidate as the recognition candidate corresponding to the bonus value stored in the recognition pattern table 110. From now on, priority is given to the control task regarding the said recognition pattern.

こうした誤認識パターンの理解に伴う認識スコア是正措置は、音響的に近い語彙が辞書に存在してしまうことに起因している。この点に着目し、そもそも認識辞書に登録する語彙から音響的に近いものを排除するように語彙を選定してしまう方法も考えられる。しかしながら、実際にそうした語彙の設定を行うとユーザビリティの低下が懸念される。つまり、システム側の都合で語彙を選択してしまうと、利用者にとって不自然な語彙を受け付けることになりかねない。よって、認識辞書に登録する語彙としては、利用者があるタスクに対して想起しやすい語彙を網羅すべきであり、そのために生じる音響的な類似性に伴う誤認識パターンを、第２の実施形態に示した方法にて是正することが望ましいと考える。 The recognition score correction measure accompanying the understanding of such a misrecognition pattern is attributed to the presence of an acoustically close vocabulary in the dictionary. Focusing on this point, a method of selecting a vocabulary so as to exclude words that are acoustically close from the vocabulary registered in the recognition dictionary can be considered. However, if such a vocabulary is actually set, there is a concern that the usability will be reduced. In other words, if a vocabulary is selected for the convenience of the system, it may be possible to accept a vocabulary that is unnatural for the user. Therefore, the vocabulary to be registered in the recognition dictionary should cover vocabularies that can be easily recalled for a certain task, and the erroneous recognition pattern that accompanies the acoustic similarity caused by the vocabulary is the second embodiment. I think that it is desirable to correct with the method shown in.

以上より、第２の実施形態に係る音声対話装置では、一連の制御タスク達成に至る対話を監視し、否定または訂正操作された認識候補が含まれるｎ回目の発話の認識結果を、認識パターンとして抽出する認識特性抽出部２０９を備える。更に、認識特性抽出部２０９は、否定および訂正が存在しなかった制御タスクのうち、上記認識パターンに含まれる認識候補に関する制御タスクを抽出する。また、当該認識パターンと当該制御タスクに基づいて、当該制御タスクを優先させる辞書制御部２１１とを備える。更に、辞書制御部２１１は、新規発話に対する新規認識候補群が当該認識パターンと順不同で同じ場合に、当該制御タスクを優先させる。これから、音素列の並びが類似する語彙が生じても、利用者の過去の発話における認識結果と順不同で同じ認識結果が得られた場合、上記認識パターンに関する制御タスクを優先させるので、誤認識が繰り返し発生する可能性を低減できる。 As described above, in the speech dialogue apparatus according to the second embodiment, the dialogue leading to the achievement of a series of control tasks is monitored, and the recognition result of the nth utterance including the recognition candidate that has been negated or corrected is used as a recognition pattern. A recognition characteristic extraction unit 209 for extraction is provided. Further, the recognition characteristic extraction unit 209 extracts a control task related to a recognition candidate included in the recognition pattern from among control tasks for which there is no negation or correction. Further, a dictionary control unit 211 that prioritizes the control task based on the recognition pattern and the control task is provided. Further, the dictionary control unit 211 gives priority to the control task when the new recognition candidate group for the new utterance is the same as the recognition pattern in no particular order. From this, even if a vocabulary with similar phoneme sequence occurs, if the same recognition result is obtained in the same order as the recognition result of the user's past utterance, priority is given to the control task related to the recognition pattern, so erroneous recognition is avoided. The possibility of repeated occurrence can be reduced.

また、第２の実施形態では、認識特性抽出部２０９は、ｎ回目の発話の認識結果に基づく応答生成部１０６による応答に対して否定または訂正が存在し、（ｎ＋１）回目以降の発話の認識結果に基づく応答に対して、否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合に、上記認識パターンと上記制御タスクを対応させて、認識パターンテーブル１１０に記憶する。更に、認識特性抽出部２０９は、上記認識パターンに関する上記制御タスクに関する認識候補の認識スコアと上記認識パターンの最大の前記認識スコアとの差に基づいて算出されたボーナス値を、上記制御タスクに関する認識候補に対応させて、認識パターンテーブル１１０に記憶する。これから、辞書制御部２１１は、新規認識候補群が上記認識パターンと順不同で同じ場合に、上記制御タスクに関する認識候補と同じ新規認識候補の認識スコアにボーナス値を加算させることができ、上記制御タスクを優先させることができる。 In the second embodiment, the recognition characteristic extraction unit 209 recognizes a negative or correct response to the response by the response generation unit 106 based on the recognition result of the nth utterance, and recognizes the (n + 1) th and subsequent utterances. When there is no negation or correction for the response based on the result and a series of control tasks are finally achieved, the recognition pattern and the control task are associated with each other and stored in the recognition pattern table 110. Further, the recognition characteristic extraction unit 209 recognizes the bonus value calculated based on the difference between the recognition score of the recognition candidate related to the control task related to the recognition pattern and the maximum recognition score of the recognition pattern related to the control task. The recognition pattern table 110 is stored in association with the candidates. From this, when the new recognition candidate group is in the same order as the recognition pattern, the dictionary control unit 211 can add a bonus value to the recognition score of the same new recognition candidate as the recognition candidate related to the control task. Can be prioritized.

また、第２の実施形態では、辞書制御部２１１は、理解部１０４が（ｎ＋１）回目以降の発話の認識結果について認識スコアを算出する際、ｎ回目の発話の認識結果のうち最大の認識スコアを持つ認識候補と同じ認識候補について、該認識候補の認識スコアから所定値を減算させる。これにより、（ｎ＋１）回目以降の発話の認識結果に基づく応答が、ｎ回目の発話の認識結果に基づく応答と同じになる可能性を低減することができる。 In the second embodiment, when the understanding unit 104 calculates a recognition score for the recognition result of the (n + 1) th and subsequent utterances, the dictionary control unit 211 determines the maximum recognition score among the recognition results of the nth utterance. For a recognition candidate that is the same as the recognition candidate having, a predetermined value is subtracted from the recognition score of the recognition candidate. This can reduce the possibility that the response based on the recognition result of the (n + 1) th and subsequent utterances is the same as the response based on the recognition result of the nth utterance.

（第３の実施形態）
第２の実施形態でも述べたとおり、音声対話装置のユーザビリティを向上するための方策として、受理可能な言い回しに多様性を持たせることが考えられる。すなわち、一つのタスクを実行するコマンドの表現を複数持たせることで、利用者の多様な表現を受理するものである。しかしながら、複数の表現を登録することは、すなわち語彙の増加であるため、認識速度の低下、認識精度の低下が免れない。そこで、この相反性を適切にバランスさせる手法が必要である。そこで、利用者が対話装置を継続的に使用するにつれ、表現が収束していく特徴に着目する。 (Third embodiment)
As described in the second embodiment, as a measure for improving the usability of the voice interactive apparatus, it is conceivable to give diversity to acceptable phrases. In other words, by providing multiple expressions of commands for executing one task, various expressions of users are accepted. However, since registering a plurality of expressions is an increase in vocabulary, a reduction in recognition speed and a reduction in recognition accuracy are inevitable. Therefore, a technique for appropriately balancing this reciprocity is necessary. Therefore, attention is paid to the feature that the expression converges as the user continuously uses the interactive device.

具体的には、第３の実施形態では、第１の実施形態と同様に、初期発話の認識結果に基づく応答生成部１０６による応答に対して否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合の結果に着目する。上記の場合、認識辞書３０３（図１２参照）は、一の制御タスクに対して、認識対象とする語彙（以下、認識語彙とする。）を複数登録する。また、認識特性抽出部３０９（図１２参照）は、最終的に達成した制御タスク毎の、利用者の新規発話に対する新規認識結果のうち最大の認識スコアを持つ認識候補が認識パターンのうち最大の認識スコアを持つ認識候補と同じになる頻度に基づいて、制御タスクにおける語彙毎の認識採用頻度を分析する。更に、辞書制御部３１１（図１２参照）は、上記認識採用頻度が閾値を下回る語彙について、音声認識部３０２（図１２参照）に当該語彙を認識対象から除外させる。これから、辞書制御部３１１は、認識パターンに関する制御タスクを優先させる。このようにして、認識速度の低下および認識精度の低下を防止し、誤認識が繰り返し発生する可能性を低減するものである。 Specifically, in the third embodiment, as in the first embodiment, there is no negation or correction for the response by the response generation unit 106 based on the recognition result of the initial utterance, and finally a series of Focus on the results when the control task is achieved. In the above case, the recognition dictionary 303 (see FIG. 12) registers a plurality of words to be recognized (hereinafter referred to as recognition words) for one control task. In addition, the recognition characteristic extraction unit 309 (see FIG. 12) determines that the recognition candidate having the largest recognition score among the new recognition results for the user's new utterance for each control task finally achieved is the largest among the recognition patterns. Based on the frequency that becomes the same as the recognition candidate having the recognition score, the recognition adoption frequency for each vocabulary in the control task is analyzed. Further, the dictionary control unit 311 (see FIG. 12) causes the speech recognition unit 302 (see FIG. 12) to exclude the vocabulary from the recognition target for the vocabulary whose recognition adoption frequency is below the threshold. Accordingly, the dictionary control unit 311 gives priority to the control task related to the recognition pattern. In this way, a decrease in recognition speed and a decrease in recognition accuracy are prevented, and the possibility of repeated erroneous recognition is reduced.

なお、第３の実施形態における認識パターンは、第１の実施形態と同様に、利用者の初期発話の認識結果、すなわち、認識候補群であるＮ−ｂｅｓｔのうち、認識特性抽出部３０９により抽出され、記憶された認識候補群を指す。また、第１の実施形態と同様に、認識候補には各々認識スコアが付与されるものとする。認識スコアについては、第１の実施形態と同様に、尤度や信頼度の尺度を利用することができる。なお、第３の実施形態における認識パターンに関する制御タスクは、最終的に達成した制御タスクである。 In addition, the recognition pattern in 3rd Embodiment is extracted by the recognition characteristic extraction part 309 among the recognition results of a user's initial utterance, ie, N-best which is a recognition candidate group, similarly to 1st Embodiment. And the stored recognition candidate group. In addition, as in the first embodiment, each recognition candidate is given a recognition score. As for the recognition score, as in the first embodiment, a measure of likelihood or reliability can be used. Note that the control task related to the recognition pattern in the third embodiment is a control task finally achieved.

以下、第３の実施形態に係る音声対話装置について、第１の実施形態に係る音声対話装置と異なる点を中心に説明する。また、第３の実施形態に係る音声対話装置について、第１の実施形態に係る音声対話装置と同様の構造には同じ番号を付し、説明を省略する。図１２は、本発明の第３の実施形態に係る音声対話装置の基本構成を示したブロック図である。図１２に示すように、第３の実施形態に係る音声対話装置の構成は、基本的には、第１の実施形態に係る音声対話装置の構成と同じである。第１の実施形態と異なるのは、音声認識手段である音声認識部３０２、認識辞書３０３、認識パターンテーブル３１０、認識特性抽出手段である認識特性抽出部３０９および辞書制御手段である辞書制御部３１１だけである。よって、音声認識部３０２、認識辞書３０３、認識パターンテーブル３１０、認識特性抽出部３０９および辞書制御部３１１のみ説明する。 Hereinafter, the voice interactive apparatus according to the third embodiment will be described focusing on differences from the voice interactive apparatus according to the first embodiment. Moreover, the same number is attached | subjected to the structure similar to the voice interactive apparatus which concerns on 1st Embodiment about the voice interactive apparatus which concerns on 3rd Embodiment, and description is abbreviate | omitted. FIG. 12 is a block diagram showing a basic configuration of a voice interactive apparatus according to the third embodiment of the present invention. As shown in FIG. 12, the configuration of the voice interaction apparatus according to the third embodiment is basically the same as the configuration of the voice interaction apparatus according to the first embodiment. The difference from the first embodiment is a voice recognition unit 302 as a voice recognition unit, a recognition dictionary 303, a recognition pattern table 310, a recognition characteristic extraction unit 309 as a recognition characteristic extraction unit, and a dictionary control unit 311 as a dictionary control unit. Only. Therefore, only the speech recognition unit 302, the recognition dictionary 303, the recognition pattern table 310, the recognition characteristic extraction unit 309, and the dictionary control unit 311 will be described.

第３の実施形態の音声認識部３０２は、第１の実施形態と同様に、一般的な音声認識処理を行う。なお、音声認識部３０２は、製品出荷時には認識辞書３０３の全ての語彙を認識対象とするが、辞書制御部３１１に基づき、語彙を認識対象から除外することができる。また、認識辞書３０３は、一の制御タスクに対して、認識語彙を複数登録する。当該語彙は、利用者の言い回しの多様さに対応するために決定される。具体的には、
Ａ．制御タスクに関する語彙に対して、同様の意味や機能を連想する語彙を網羅する
Ｂ．制御タスクを言語化した場合の文体のバリエーションを網羅する
のように、辞書作成者側が予め複数の語彙を考える。更に、
Ｃ．多数の被験者からある制御タスクを実行したい場合の自発的な発話を取得し（コーパスと呼ばれる）、出現頻度等に基づき語彙を選定する
というプロセスを入れることが望ましい。実際の利用実態を把握し、これを反映する形で語彙を選定することで、初期設定として用いる辞書のサイズをある程度絞り込むことが可能である。 The voice recognition unit 302 according to the third embodiment performs general voice recognition processing as in the first embodiment. The voice recognition unit 302 recognizes all vocabularies in the recognition dictionary 303 at the time of product shipment. However, the vocabulary can be excluded from the recognition targets based on the dictionary control unit 311. The recognition dictionary 303 registers a plurality of recognition vocabularies for one control task. The vocabulary is determined to accommodate the variety of user phrases. In particular,
A. Covers vocabulary associated with similar meanings and functions for vocabulary related to control tasks. The dictionary creator considers a plurality of vocabularies in advance so as to cover variations in style when the control task is verbalized. Furthermore,
C. It is desirable to include a process of acquiring spontaneous utterances (called a corpus) when performing a control task from a large number of subjects and selecting a vocabulary based on the appearance frequency. It is possible to narrow down the size of the dictionary used as the initial setting to some extent by grasping the actual use situation and selecting the vocabulary in a way that reflects this.

例えば、「ナビゲーション装置における目的地設定」という制御タスクについて、方法Ａに基づき、バリエーションを増やす場合、まず、当該制御タスクの機能の代表的なコマンドを「目的地設定」のように決定し、これと同等の機能を連想する語を選定する。「目的地」という語彙に対して、「行き先」、「行く」等が、「設定」という語彙に対しては、「探す」、「検索」、「探索」等のコマンドが生成できる。これらを組み合わせて最終的に、「目的地を探す」、「目的地探索」、「行き先設定」、「行き先を探す」のような語彙が生成できる。同様に、「エアコンのスイッチを入れる」という制御タスクであれば、代表コマンドとして「エアコンオン」等と決定し、語彙「エアコン」に対して、「冷房（暖房）」、「クーラー（ヒーター）」、「空調」を、「オン」に対して「つける」、「入れる」等を選定し、最終的に、「エアコンを入れる」、「エアコンをつける」、「冷房オン」等のコマンドが生成できる。 For example, when the variation of the control task “Destination setting in the navigation device” is increased based on the method A, first, a representative command of the function of the control task is determined as “Destination setting”. Select a word associated with a function equivalent to. Commands such as “Destination”, “Go”, etc. can be generated for the vocabulary “Destination”, and “Search”, “Search”, “Search”, etc. can be generated for the vocabulary “Setting”. By combining these, vocabularies such as “search for destination”, “search for destination”, “set destination”, and “search for destination” can be generated. Similarly, if the control task is to “switch on the air conditioner”, the representative command is determined to be “air conditioner on” and the like, and the vocabulary “air conditioner” is “cooling (heating)”, “cooler (heater)”. Select “ON”, “ON”, etc. for “ON” for “ON”, and finally generate commands such as “ON AC”, “ON AC”, “ON” .

次に、方法Ｂの文体によるバリエーションを考える。機器操作における表現で現れる文体の代表的なものに、「体言止め」、「命令形」、「希望」、「丁寧調」といったものがある。上述の「エアコンオン」というコマンドであれば、
体言止め：エアコンオン、エアコンをつける
命令形：エアコンをオンにしろ、エアコンをつけろ
希望：エアコンをオンにしたい、エアコンをつけたい
といった形に変形される。なお、丁寧調については、上記それぞれの語尾を、「オンにしてください」、「つけてください」、「つけたいです」のように変形される。 Next, consider variations of the style of Method B. Typical examples of styles that appear in expressions in device operations include “stopping words”, “command type”, “hope”, and “polite”. If the above-mentioned command “air conditioner ON”,
Suppression: Air conditioner on, command form to turn on the air conditioner: Turn on the air conditioner, turn on the air conditioner. Hope: Turn on the air conditioner, turn on the air conditioner. For polite tone, the above endings are transformed to “Please turn on”, “Please add”, “I want to add”.

図１３に、上記の手順で選定した語彙を登録した認識辞書３０３の例を示す。図１３は、図１２に示す認識辞書３０３の一例を示した図である。ここで、図１３（ａ）は、各制御タスク名と当該制御タスク名に対応する複数の語彙とが登録されている。これから、ある制御タスクに関する語彙のうち、いずれかが理解結果として決定されれば、該制御タスクが実行される。更に、図１３（ｂ）には、単語ネットワークの形式で登録した例を示している。単語ネットワークの形式で登録した方法では、複数の言い回しを含む語彙を、単語、間投詞および接続詞等に分割し、単語、間投詞および接続詞等の接続関係をネットワークとして登録している。全ての接続の組み合わせが、認識語彙となる。なお、図１３（ａ）、（ｂ）は実質同じ語彙を認識することが可能である。 FIG. 13 shows an example of the recognition dictionary 303 in which the vocabulary selected in the above procedure is registered. FIG. 13 is a diagram showing an example of the recognition dictionary 303 shown in FIG. Here, in FIG. 13A, each control task name and a plurality of vocabularies corresponding to the control task name are registered. From now on, if any vocabulary related to a certain control task is determined as an understanding result, the control task is executed. Furthermore, FIG. 13B shows an example of registration in the form of a word network. In the method registered in the form of a word network, a vocabulary including a plurality of phrases is divided into words, interjections, conjunctions, and the like, and connection relationships such as words, interjections, and conjunctions are registered as a network. All connection combinations are recognized vocabularies. In FIGS. 13A and 13B, substantially the same vocabulary can be recognized.

認識特性抽出部３０９は、一連の制御タスク達成に至る対話、すなわち、認識候補群と理解結果を監視する。そして、初期発話の認識結果に基づく応答生成部１０６による応答に対して否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合に、初期発話の認識候補群である認識パターンを抽出する。更に、最終的に達成した制御タスクを抽出する。当該認識パターンの認識候補には、第１の実施形態と同様に、各々認識スコアが付与されており、当該認識パターンは認識スコアの大きい順に整列されている。そして、認識特性抽出部３０９は、第１の実施形態と同様に、認識スコア順に整列した認識パターンと上記制御タスクを対応させて、認識パターンテーブル３１０に記憶する。更に、認識特性抽出部３０９は、出現頻度を認識パターンテーブル３１０に記憶する。よって、認識パターンテーブル３１０は、図７に示した第１の実施形態の認識パターンテーブル１１０と同様の書式となっている。認識パターンテーブル３１０が認識パターンテーブル１１０と異なる点は、ボーナス値の欄が無いことだけである。 The recognition characteristic extraction unit 309 monitors a dialogue that achieves a series of control tasks, that is, a recognition candidate group and an understanding result. Then, when there is no negation or correction for the response by the response generation unit 106 based on the recognition result of the initial utterance, and finally a series of control tasks are achieved, the recognition pattern that is the recognition candidate group of the initial utterance is Extract. Furthermore, the control task finally achieved is extracted. Similar to the first embodiment, recognition scores are assigned to recognition candidates of the recognition patterns, and the recognition patterns are arranged in descending order of recognition scores. Then, as in the first embodiment, the recognition characteristic extraction unit 309 associates the recognition patterns arranged in the order of recognition scores with the control task and stores them in the recognition pattern table 310. Further, the recognition characteristic extraction unit 309 stores the appearance frequency in the recognition pattern table 310. Therefore, the recognition pattern table 310 has the same format as the recognition pattern table 110 of the first embodiment shown in FIG. The recognition pattern table 310 is different from the recognition pattern table 110 only in that there is no bonus value column.

ここで、出現頻度とは、上記制御タスク毎の、利用者の新規発話に対する新規認識結果のうち最大の認識スコアを持つ認識候補が上記認識パターンのうち最大の認識スコアを持つ認識候補と同じになる頻度である。上記出現頻度を、上記制御タスクが達成する度に記憶していくと、音声認識処理において、上記制御タスク毎に、どの語彙が多く使用されているかを示す認識採用頻度を把握することができる。図１４は、図１２に示す認識特性抽出部３０９における語彙毎の認識採用頻度を示した図である。ここで、図１４（ａ）、（ｂ）、（ｃ）は、ある制御タスクＡ、Ｂ、Ｃに関する語彙毎の認識採用頻度である。図１４（ａ）では、全体的に認識採用頻度が少なく、図１４（ｂ）および（ｃ）では、認識採用頻度が多い一部の語彙に偏りが生じている。認識特性抽出部３０９が分析した認識採用頻度に基づき、後述の辞書制御部３１１は、認識語彙の絞込みを行う。なお、認識採用頻度に利用する数値は、出現頻度そのものを用いても良い。また、最大の出現頻度を持つ語彙の数値を基準値１とし、他の語彙との出現倍率の差として示しても良い。すなわち、最大の出現頻度を持つ語彙の半分しか使用されていない語彙の出現倍率は０．５となる。 Here, the appearance frequency is the same as the recognition candidate having the largest recognition score among the recognition patterns having the largest recognition score among the new recognition results for the user's new utterance for each control task. Is the frequency. If the appearance frequency is stored each time the control task is achieved, the recognition adoption frequency indicating which vocabulary is frequently used for each control task can be grasped in the speech recognition process. FIG. 14 is a diagram showing the recognition adoption frequency for each vocabulary in the recognition characteristic extraction unit 309 shown in FIG. Here, FIGS. 14A, 14 B, and 14 C are recognition adoption frequencies for each vocabulary related to certain control tasks A, B, and C. FIG. In FIG. 14 (a), the recognition adoption frequency is low overall, and in FIGS. 14 (b) and (c), some vocabularies with high recognition adoption frequency are biased. Based on the recognition adoption frequency analyzed by the recognition characteristic extraction unit 309, the dictionary control unit 311 described later narrows down the recognition vocabulary. Note that the appearance frequency itself may be used as the numerical value used for the recognition adoption frequency. Further, the numerical value of the vocabulary having the maximum appearance frequency may be set as the reference value 1 and may be indicated as a difference in appearance magnification from other vocabularies. That is, the appearance magnification of a vocabulary in which only half of the vocabulary having the maximum appearance frequency is used is 0.5.

辞書制御部３１１は、認識特性抽出部３０９が分析した認識採用頻度に基づき、語彙毎の認識採用頻度に偏りが生じているか否かを判定する。例えば、図１５に示す方法で判定する。図１５は、図１２に示す辞書制御部３１１における認識採用頻度と閾値の比較例を示した図である。ここで、図１５は、図１４に示した制御タスクＡ、Ｂ、Ｃについて比較した例を示している。図１５に示すように、辞書制御部３１１は、２つの閾値Ｔｈ_１およびＴｈ_２を備える。そして、辞書制御部３１１は、ある語彙の認識採用頻度が閾値Ｔｈ_１を上回った際、認識採用頻度が閾値Ｔｈ_２を下回る語彙があるか否か判定する。認識採用頻度が閾値Ｔｈ_２を下回る語彙があると判定した場合、辞書制御部３１１は、当該語彙を認識対象から除外するよう、音声認識部３０２を制御する。例えば、図１５（ａ）に示した制御タスクＡの判定結果では、閾値Ｔｈ_１を上回る語彙が無いため、辞書制御部３１１は、認識採用頻度が閾値Ｔｈ_２を下回る語彙があるか否か判定しない。一方、図１５（ｂ）に示した制御タスクＢの判定結果では、語彙１３０１の認識採用頻度が閾値Ｔｈ_１を上回っている。この時、他の語彙の認識採用頻度は閾値Ｔｈ_２を下回っている。従って、辞書制御部３１１は、語彙１３０１以外の語彙を認識対象から除外させる。また、図１５（ｃ）に示した制御タスクＣの判定結果では、語彙１３０２および１３０４の認識採用頻度が閾値Ｔｈ_１を上回っている。また、この時、閾値Ｔｈ_２を上回る語彙１３０３が存在している。従って、辞書制御部３１１は、語彙１３０２、１３０３、１３０４を除いた語彙を認識対象から除外させる。 The dictionary control unit 311 determines whether or not the recognition adoption frequency for each vocabulary is biased based on the recognition adoption frequency analyzed by the recognition characteristic extraction unit 309. For example, the determination is made by the method shown in FIG. FIG. 15 is a diagram showing a comparative example of the recognition adoption frequency and the threshold value in the dictionary control unit 311 shown in FIG. Here, FIG. 15 shows an example in which the control tasks A, B, and C shown in FIG. 14 are compared. As illustrated in FIG. 15, the dictionary control unit 311 includes two threshold values Th ₁ and Th ₂ . The dictionary control unit 311, when recognizing utilization frequency of a vocabulary exceeds the threshold value Th _1, determines whether there is a vocabulary recognition utilization frequency is below the threshold Th _2. If recognition employing frequency is determined that there is a vocabulary below the threshold Th _2, the dictionary control unit 311, so excludes the vocabulary from the recognition target, it controls the voice recognition unit 302. For example, the determination result of the control task A shown in FIG. 15 (a), the there is no vocabulary above a threshold Th _1, the dictionary control unit 311 determines whether there is a vocabulary recognition utilization frequency is below the threshold value Th ₂ do not do. On the other hand, the determination result of the control task B shown in FIG. 15 (b), recognizing utilization frequency vocabulary 1301 is larger than the threshold value Th _1. In this case, recognition utilization frequency of the other vocabulary is below the threshold Th _2. Accordingly, the dictionary control unit 311 excludes vocabularies other than the vocabulary 1301 from recognition targets. Further, the determination result of the control task C shown in FIG. 15 (c), the recognition utilization frequency vocabulary 1302 and 1304 is above the threshold value Th _1. At this time, vocabulary 1303 above the threshold Th ₂ is present. Therefore, the dictionary control unit 311 excludes vocabularies excluding the vocabularies 1302, 1303, and 1304 from recognition targets.

なお、上述した出現倍率の差を用いる場合、辞書制御部３１１は、ある出現回数を上回る語彙の出現倍率を１．０とする。そして、上記語彙以外の他の語彙の出現倍率を計算する。更に、辞書制御部３１１は、ある語彙の出現倍率が所定の閾値、例えば、０．２を下回るか否か判定する。出現倍率が０．２を下回る語彙があると判定した場合、辞書制御部３１１は、当該語彙を認識対象から除外させるようにすればよい。製品出荷時の出現倍率は、全ての語彙が１．０に設定されている。図１６に出現倍率で判定した例を示す。図１６は、図１２に示す辞書制御部３１１における認識採用頻度と閾値の他の比較例を示した図である。図１６では、所定の閾値をＴｈ_３としている。図１６（ａ）に示した制御タスクＤの判定結果では、認識パターンテーブル３１０が十分記憶されていないため、出現倍率が閾値Ｔｈ_３を下回る語彙が存在しない。よって、制御タスクＤに関する語彙の全てが認識対象となる。図１６（ｂ）に示した制御タスクＥの判定結果では、語彙１４０１（出現倍率１．０）以外の全ての語彙の出現倍率が、閾値Ｔｈ_３を下回っている。よって、語彙１４０１以外の全ての語彙を認識対象から除外させる。一方、図１６（ｃ）に示した制御タスクＦの判定結果では、語彙１４０２の出現倍率を基準として、語彙１４０３および１４０４の出現倍率が閾値Ｔｈ_３を上回っている。従って、語彙１４０２、１４０３および１４０４以外の語彙を認識対象から除外させる。 When using the above-described difference in appearance magnification, the dictionary control unit 311 sets the appearance magnification of vocabulary exceeding a certain number of appearances to 1.0. And the appearance magnification of vocabulary other than the said vocabulary is calculated. Furthermore, the dictionary control unit 311 determines whether the appearance magnification of a certain vocabulary is below a predetermined threshold, for example, 0.2. When it is determined that there is a vocabulary whose appearance magnification is less than 0.2, the dictionary control unit 311 may exclude the vocabulary from the recognition target. The appearance magnification at the time of product shipment is set to 1.0 for all vocabularies. FIG. 16 shows an example determined by the appearance magnification. FIG. 16 is a diagram showing another comparative example of the recognition adoption frequency and the threshold in the dictionary control unit 311 shown in FIG. In FIG. 16, the predetermined threshold is Th ₃ . The determination result of the control task D shown in FIG. 16 (a), the order recognition pattern table 310 is not sufficiently stored, there is no vocabulary appearing magnification is below the threshold Th _3. Therefore, all of the vocabulary related to the control task D is a recognition target. The determination result of the control task E shown in FIG. 16 (b), the occurrence ratio of all the words except vocabulary 1401 (appearance ratio 1.0) is below the threshold Th _3. Therefore, all vocabularies other than the vocabulary 1401 are excluded from recognition targets. On the other hand, in the determination result of the control task F shown in FIG. 16C, the appearance magnifications of the vocabularies 1403 and 1404 exceed the threshold Th ₃ with the appearance magnification of the vocabulary 1402 as a reference. Therefore, vocabularies other than the vocabularies 1402, 1403, and 1404 are excluded from recognition targets.

上記の方法により、認識語彙を制御した後の認識辞書３０３の例を図１７に示す。図１７は、図１２に示す辞書制御部３１１における認識語彙を制御した後の認識辞書３０３の一例を示した図である。図１３と同様に、図１７（ａ）は各制御タスク名と当該制御タスク名に対応する語彙を登録した例であり、図１７（ｂ）は単語ネットワークの形式で登録した例である。図１７（ａ）では、認識辞書３０３から除外された認識語彙が斜体の文字で示されている。同様に、図１７（ｂ）では、除外された単語が点線で示され、当該単語の前後の接続関係が消去されている。なお、辞書制御部３１１における認識語彙の認識対象除外処理は、利用者とシステムとの対話に否定や訂正が無い場合に用いられる。対話中に否定または訂正が発生した場合は、認識対象から除外した語彙を利用者が発話している可能性がある。そこで、第３の実施形態では、利用者の新規発話に対する新規認識結果に基づく応答に対して否定または訂正が存在した場合に、辞書制御部３１１は、認識対象から除外した語彙を認識対象に戻すよう、音声認識部３０２を制御する。更に、否定または訂正が存在した応答の直前の新規発話ついて、再度音声認識するよう、音声認識部３０２を制御する。これにより、認識対象から除外されていた語彙によって、制御タスクが達成した場合、認識パターンテーブル３１０の出現頻度にも修正を加える。 FIG. 17 shows an example of the recognition dictionary 303 after the recognition vocabulary is controlled by the above method. FIG. 17 is a diagram showing an example of the recognition dictionary 303 after the recognition vocabulary is controlled by the dictionary control unit 311 shown in FIG. Similarly to FIG. 13, FIG. 17A is an example in which each control task name and a vocabulary corresponding to the control task name are registered, and FIG. 17B is an example in which a word network is registered. In FIG. 17A, the recognized vocabulary excluded from the recognition dictionary 303 is indicated by italicized characters. Similarly, in FIG. 17B, the excluded word is indicated by a dotted line, and the connection relationship before and after the word is deleted. The recognition vocabulary recognition target exclusion processing in the dictionary control unit 311 is used when there is no denial or correction in the dialogue between the user and the system. If negation or correction occurs during the dialogue, the user may be speaking a vocabulary excluded from the recognition target. Therefore, in the third embodiment, when there is a negative or correction for a response based on a new recognition result for a user's new utterance, the dictionary control unit 311 returns the vocabulary excluded from the recognition target to the recognition target. The voice recognition unit 302 is controlled as described above. Further, the voice recognition unit 302 is controlled so that voice recognition is performed again for a new utterance immediately before a response for which there is a negative or correction. Thereby, when the control task is achieved by the vocabulary excluded from the recognition target, the appearance frequency of the recognition pattern table 310 is also corrected.

上記の修正方法に基づく認識採用頻度の経時変化の例について、図１８を用いて説明する。図１８は、図１６に示す認識採用頻度の経時変化を示した図である。図１８では、認識採用頻度の値を出現倍率として計算する例を示す。ここで、図１８（ａ）は、図１６（ｂ）に示した制御タスクＥに関する語彙毎の認識採用頻度と同様である。すなわち、図１８（ａ）に示す語彙毎の認識採用頻度は、上記の修正方法によって修正されていない。図１８（ａ）では、語彙１４０５の出現倍率が閾値Ｔｈ_３を下回っている。これから、語彙１４０５は認識対象から除外されている。この状況で、利用者とシステムの対話において否定が発生した場合を考える。すると、辞書制御部３１１は、認識対象から除外していた語彙を一旦全て認識対象に戻させ、否定が存在した応答の直前の新規発話を再度音声認識させる。この結果に対しても否定が発生した場合には、利用者に再発話を要求し、新たな発話音声を取得する。結果として、制御タスクが達成した場合、
辞書制御部３１１は、認識対象から除外されていた語彙によって、制御タスクが達成したか否か判定する。認識対象から除外されていた語彙により、制御タスクが達成したと判定した場合、認識パターンテーブル３１０の出現頻度に修正を加える。 An example of the change over time in the recognition adoption frequency based on the above correction method will be described with reference to FIG. FIG. 18 is a diagram showing the change over time in the recognition adoption frequency shown in FIG. FIG. 18 shows an example in which the value of the recognition adoption frequency is calculated as the appearance magnification. Here, FIG. 18A is the same as the recognition adoption frequency for each vocabulary related to the control task E shown in FIG. That is, the recognition adoption frequency for each vocabulary shown in FIG. 18A is not corrected by the above correction method. In FIG. 18 (a), the appearance ratio of the vocabulary 1405 is below the threshold Th _3. Thus, the vocabulary 1405 is excluded from the recognition target. In this situation, let us consider a case where denial occurs in the dialogue between the user and the system. Then, the dictionary control unit 311 once returns all vocabularies that have been excluded from the recognition target to the recognition target, and again recognizes a new utterance immediately before the response in which a negative is present. If a negative result also occurs for this result, the user is requested to re-utter and a new utterance is acquired. As a result, if the control task is achieved,
The dictionary control unit 311 determines whether the control task has been achieved based on the vocabulary excluded from the recognition target. When it is determined that the control task has been achieved based on the vocabulary excluded from the recognition target, the appearance frequency of the recognition pattern table 310 is corrected.

当該出現頻度への修正は、例えば、図１８（ｂ）のように、初期発話では認識対象から除外されていた語彙１４０５により、制御タスクが達成したと判定した場合、語彙１４０５の出現倍率が閾値Ｔｈ_３を上回るように、ボーナスを与えることで実現できる。結果として、語彙１４０１および１４０５が認識対象となり、語彙１４０１および１４０５以外の他の語彙が認識対象から除外される。その後、利用者が語彙１４０５を継続して発話した結果、図１８（ｃ）のような認識採用頻度（出現倍率）になったとする。すると、今度は、語彙１４０１の出現倍率が閾値Ｔｈ_３を下回ったため、認識対象から除外される。これから、語彙１４０５のみが認識語彙として残ることとなる。 For example, as shown in FIG. 18B, the correction to the appearance frequency is performed when the vocabulary 1405 that has been excluded from the recognition target in the initial utterance determines that the control task has been achieved, and the appearance magnification of the vocabulary 1405 is a threshold value. as more than Th _3, it can be achieved by giving a bonus. As a result, the vocabularies 1401 and 1405 become recognition targets, and vocabularies other than the vocabularies 1401 and 1405 are excluded from the recognition targets. Thereafter, it is assumed that as a result of the user continuously speaking the vocabulary 1405, the recognition adoption frequency (appearance magnification) as shown in FIG. Then, since the appearance magnification of the vocabulary 1401 has fallen below the threshold Th ₃ this time, it is excluded from the recognition target. Thus, only the vocabulary 1405 remains as a recognized vocabulary.

こうした一連の処理の結果、利用者の対話に基づき、言い回しの定着を検出し、適切に認識対象とする語彙を絞り込むことができる。認識語彙が削減されることにより、利用者の発話をより精度よく認識することが可能になると共に、認識対象が減るため、認識速度も向上することができる。なお、こうした語彙の偏りも、第１および第２の実施形態と同様、利用者の個人性に起因する部分が大きいと考えられる。すなわち、ある利用者ａはタスクＡに対し言い回しαを好んで使い、利用者ｂはタスクＡに対し言い回しβを好んで使うという傾向である。自動車における音声対話装置では、利用者が極めて限定されやすい環境にあると考えられる。従って、第３の実施形態に係る音声対話装置を車両用に用いる場合に、特に効果的に働くことが期待できる。また、音声の特徴量やカメラ、その他個人認証デバイスを用いて利用者を判別する機構を設けることが可能であれば、上述の語彙毎の認識採用頻度を、利用者の識別情報と共に管理することが望ましい。 As a result of such a series of processes, it is possible to detect wording based on the user's dialogue and appropriately narrow down the vocabulary to be recognized. By reducing the recognition vocabulary, it becomes possible to recognize the user's utterance with higher accuracy and to reduce the number of objects to be recognized, thereby improving the recognition speed. In addition, it is considered that such a vocabulary bias is largely due to the individuality of the user, as in the first and second embodiments. That is, there is a tendency that a certain user “a” likes to use α for the task A, and a user “b” likes to use the statement β for the task A. It is considered that a voice dialogue apparatus in an automobile is in an environment where users are extremely limited. Therefore, it can be expected to work particularly effectively when the voice interactive apparatus according to the third embodiment is used for a vehicle. In addition, if it is possible to provide a mechanism for discriminating users using voice features, cameras, and other personal authentication devices, the frequency of recognition adoption for each vocabulary described above should be managed together with the user identification information. Is desirable.

以上より、第３の実施形態に係る音声対話装置では、一の制御タスクに対して、認識対象とする語彙を複数登録した認識辞書３０３と、認識特性抽出部３０９と、辞書制御部３１１とを備える。更に、認識特性抽出部３０９は、一連の制御タスク達成に至る対話を監視し、初期発話の認識結果に基づく応答に対して否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合に、初期発話の認識結果を認識パターンとして抽出する。同時に、最終的に達成した制御タスクを抽出する。その後、認識特性抽出部３０９は、抽出した認識パターンと上記制御タスクを認識パターンテーブル３１０に記憶する。更に、上記制御タスク毎の、利用者の新規発話に対する新規認識結果のうち最大の認識スコアを持つ認識候補が認識パターンのうち最大の認識スコアを持つ認識候補と同じになる頻度である出現頻度を認識パターンテーブル３１０に記憶する。認識特性抽出部３０９は、当該出現頻度に基づいて、上記制御タスクにおける語彙毎の認識採用頻度を分析する。辞書制御部３１１は、上記認識採用頻度が閾値を下回る語彙について、音声認識部３０２に当該語彙を認識対象から除外させる。これにより、認識語彙が削減されることにより、利用者の発話をより精度よく認識することが可能になると共に、認識対象が減るため、認識速度も向上することができる。よって、誤認識した場合、出現頻度が更新されず、認識採用頻度が低いままとなり、利用者の継続使用により、制御タスクにおける誤認識した語彙は認識対象から除外されるので、誤認識が繰り返し発生する可能性を低減できる。 As described above, in the spoken dialogue apparatus according to the third embodiment, the recognition dictionary 303, the recognition characteristic extraction unit 309, and the dictionary control unit 311 in which a plurality of vocabularies to be recognized are registered for one control task. Prepare. In addition, the recognition characteristic extraction unit 309 monitors the dialogue to achieve a series of control tasks, and there is no negation or correction for the response based on the recognition result of the initial utterance, and the series of control tasks is finally achieved. In this case, the recognition result of the initial utterance is extracted as a recognition pattern. At the same time, the control task finally achieved is extracted. Thereafter, the recognition characteristic extraction unit 309 stores the extracted recognition pattern and the control task in the recognition pattern table 310. Furthermore, for each control task, an appearance frequency that is the frequency at which the recognition candidate having the largest recognition score among the new recognition results for the user's new utterance becomes the same as the recognition candidate having the largest recognition score among the recognition patterns is set. Store in the recognition pattern table 310. The recognition characteristic extraction unit 309 analyzes the recognition adoption frequency for each vocabulary in the control task based on the appearance frequency. The dictionary control unit 311 causes the speech recognition unit 302 to exclude the vocabulary from the recognition target for the vocabulary whose recognition adoption frequency is below the threshold. Thereby, by reducing the recognition vocabulary, it becomes possible to recognize the user's utterance with higher accuracy and to reduce the number of objects to be recognized, thereby improving the recognition speed. Therefore, in the case of misrecognition, the appearance frequency is not updated, the recognition adoption frequency remains low, and the misrecognized vocabulary in the control task is excluded from the recognition target due to continued use by the user, so misrecognition occurs repeatedly The possibility of doing so can be reduced.

また、辞書制御部３１１は、利用者の新規発話に対する新規認識結果に基づく応答に対して否定または訂正が存在した場合に、認識対象から除外した語彙を認識対象に戻させて、否定または訂正が存在した応答の直前の新規発話ついて、再度音声認識させる。これから、認識対象の絞込みにより、誤認識が発生した場合でも、否定または訂正操作の後、正常に認識することができる。また、利用者の継続使用により、利用者の発話傾向が経時変化しても、当該経時変化に自動的に追従し、誤認識が繰り返し発生する可能性を低減できる。 Further, the dictionary control unit 311 returns the vocabulary excluded from the recognition target to the recognition target when the response based on the new recognition result for the user's new utterance exists, and the negative or correction is performed. A new utterance just before the existing response is recognized again. From this, even if erroneous recognition occurs due to narrowing down of recognition targets, it can be normally recognized after a negative or correction operation. Moreover, even if the user's utterance tendency changes with time due to continuous use by the user, it is possible to automatically follow the change with time and reduce the possibility of repeated recognition errors.

（第４の実施形態）
第３実施形態では、利用者の継続使用に伴う言い回しの定着、すなわち、制御タスクが達成した場合の利用者の発話に含まれた語彙の偏りを、制御タスク毎に判定している。従って、利用者が良く使用するタスクについては、言い回しの定着が検出された場合に、該タスクに対してほとんど発話されない語彙を認識対象から除外することができる。しかし、使用頻度が少ないタスクについては、該言い回しの定着が判定できないため、当該タスクに関する全ての語彙を認識対象とする必要がある。 (Fourth embodiment)
In the third embodiment, the wording included in the user's utterance when the control task is achieved is determined for each control task. Therefore, for a task that is frequently used by a user, when a fixed wording is detected, a vocabulary that is hardly spoken to the task can be excluded from recognition targets. However, for a task that is used infrequently, since it is not possible to determine whether the wording is fixed, it is necessary to recognize all vocabulary related to the task.

ここで、利用者の言い回しの定着について、更に着目すると、利用者が好んで使う言い回しは、特定のタスクのみならず、全体のタスクについて共通したものになる可能性が高いことが考えられる。例えば、エアコンの起動を行う命令について、「エアコンオン」という言い回しを多用する利用者では、ＣＤやラジオを起動する命令についても、「ＣＤオン」、「ラジオオン」のように発話する可能性が「ＣＤを入れる」、「ラジオをつける」のように発話する可能性より高いと予想される。更には、「オン」という語彙を多用する利用者は、当該語彙の対義語として、「オフ」という語彙を使用する可能性が「消す」、「切る」を使用する可能性より高いことが予想される。これは、利用者の操作の継続に伴う成功経験、すなわち、思い通りに操作が完了した際の経験が、「こう言えば正しく動くだろう」というシステムに対する観念（メンタルモデル）の醸成に働くためと考えられる。実際に我々の実験でも、ある利用者が特定のタスクにおいて使用した言い回しを、別タスクにも使用する可能性が高いことがわかっている。こうした利用者側の特性を利用することで、十分に使用されていないタスクに関しても語彙の絞込みを達成することが可能となる。 Here, paying more attention to the fixing of the user's wording, it is possible that the wording that the user preferably uses is likely to be common not only for a specific task but also for the entire task. For example, a user who frequently uses the phrase “air conditioner on” for an instruction to start an air conditioner may utter a command such as “CD on” or “radio on” for an instruction to start a CD or radio. It is expected to be higher than the possibility of speaking like “insert CD” or “turn on radio”. Furthermore, it is expected that users who frequently use the vocabulary “on” are more likely to use the vocabulary “off” as a synonym of the vocabulary than “erase” or “cut”. The This is because the success experience associated with the continuation of the user's operation, that is, the experience when the operation is completed as expected, works to cultivate an idea (mental model) for the system that "it will move correctly in this way". Conceivable. In fact, even in our experiments, we have found that the phrase used by a user for a specific task is likely to be used for another task. By utilizing such user characteristics, it is possible to achieve narrowing down of vocabulary even for tasks that are not fully used.

第４の実施形態に係る音声対話装置では、こうした考えに基づく辞書の構成方法および辞書制御方法について説明する。すなわち、辞書を言い回しの共通性に基づき分類して複数備え、複数の辞書を同時並行的に認識に用いる方法である。更に、第４の実施形態に係る音声対話装置では、第１および第２の実施形態に示した、音響的な特性から生じる認識スコアの偏りや誤認識の発生を是正する方法についても、分割した辞書に拡張して適用する。すなわち、第１および第２の実施形態では単語単位の誤認識パターンに着目して是正する処理を行っていた。しかし、第４の実施形態では、辞書単位の誤選択パターン（誤った辞書の語彙が認識結果として取得されてしまう事象）の抽出に用いる。誤選択パターンに基づき、辞書の優先順位を決定する。 In the spoken dialogue apparatus according to the fourth embodiment, a dictionary construction method and a dictionary control method based on these ideas will be described. In other words, this is a method in which a plurality of dictionaries are classified and provided based on commonality of phrases and a plurality of dictionaries are used for recognition simultaneously. Furthermore, in the voice interactive apparatus according to the fourth embodiment, the method for correcting the bias of recognition scores and the occurrence of misrecognition caused by the acoustic characteristics shown in the first and second embodiments is also divided. Apply to dictionary. That is, in the first and second embodiments, correction processing is performed by paying attention to the erroneous recognition pattern in units of words. However, in the fourth embodiment, it is used for extraction of an erroneous selection pattern (an event in which an erroneous dictionary vocabulary is acquired as a recognition result) in units of dictionary. The priority order of the dictionary is determined based on the erroneous selection pattern.

以下、第４の実施形態に係る音声対話装置について、第１の実施形態に係る音声対話装置と異なる点を中心に説明する。また、第４の実施形態に係る音声対話装置について、第１の実施形態に係る音声対話装置と同様の構造には同じ番号を付し、説明を省略する。図１９は、本発明の第４の実施形態に係る音声対話装置の基本構成を示したブロック図である。図１９に示すように、第４の実施形態に係る音声対話装置の構成は、基本的には、第１の実施形態に係る音声対話装置の構成と同じである。第１の実施形態と異なるのは、音声認識手段である音声認識部４０２、認識辞書４０３、理解手段である理解部４０４、認識パターンテーブル４１０、認識特性抽出手段である認識特性抽出部４０９および辞書制御手段である辞書制御部４１１だけである。よって、音声認識部４０２、認識辞書４０３、理解部４０４、認識パターンテーブル４１０、認識特性抽出部４０９および辞書制御部４１１のみ説明する。 Hereinafter, the voice interactive apparatus according to the fourth embodiment will be described focusing on differences from the voice interactive apparatus according to the first embodiment. Moreover, the same number is attached | subjected to the structure similar to the voice interactive apparatus which concerns on 1st Embodiment about the voice interactive apparatus which concerns on 4th Embodiment, and description is abbreviate | omitted. FIG. 19 is a block diagram showing a basic configuration of a voice interactive apparatus according to the fourth embodiment of the present invention. As shown in FIG. 19, the configuration of the voice interactive apparatus according to the fourth embodiment is basically the same as the configuration of the voice interactive apparatus according to the first embodiment. The difference from the first embodiment is a speech recognition unit 402 as a speech recognition unit, a recognition dictionary 403, an understanding unit 404 as an understanding unit, a recognition pattern table 410, a recognition characteristic extraction unit 409 as a recognition characteristic extraction unit, and a dictionary. Only the dictionary control unit 411 is a control means. Therefore, only the speech recognition unit 402, the recognition dictionary 403, the understanding unit 404, the recognition pattern table 410, the recognition characteristic extraction unit 409, and the dictionary control unit 411 will be described.

音声認識部４０２は、後述する認識辞書４０３が備える複数の辞書に登録された語彙と利用者の発話音声とを同時並列に比較し、複数の辞書から認識結果を取得する。上記認識結果は各辞書から複数の認識候補、すなわち、Ｎ−ｂｅｓｔを取得することが望ましい。複数辞書の並列認識方法については、田熊、岩野、古井、“並列処理型計算機を用いた音声対話システムの検討”、人口知能学会研究会資料、SIG-SLUD-A201-04,pp.21-26,2002.が詳しい。当該文献では、複数の対話ドメイン（達成させる対話内容）毎に辞書を持つように構成される。当該辞書を並列に認識させることで、利用者が自由に対話のドメインを決定することができる。すなわち、対話ドメインの順序を意識することなく発話することができる。更に、タスクの切り替えも任意のタイミングで行うことができる。また、当該並列認識方法については、特開２００４-２５８２８９号公報も詳しい。当該文献では、複数の雑音環境に適応させた辞書を複数併せ持ち、当該辞書を並列に認識させることで、多様な雑音の状況下においても、最適な辞書の認識結果が抽出され、認識精度を向上させることができる。 The voice recognition unit 402 compares vocabulary registered in a plurality of dictionaries provided in a recognition dictionary 403, which will be described later, and the user's uttered voice simultaneously in parallel, and acquires a recognition result from the plurality of dictionaries. As the recognition result, it is desirable to obtain a plurality of recognition candidates, that is, N-best, from each dictionary. As for the parallel recognition method of multiple dictionaries, Takuma, Iwano, Furui, “Examination of a spoken dialogue system using parallel processing type computer”, Japan Society for Population Intelligence, SIG-SLUD-A201-04, pp.21-26 , 2002. This document is configured to have a dictionary for each of a plurality of interaction domains (conversation contents to be achieved). By allowing the dictionaries to be recognized in parallel, the user can freely determine the domain of the dialogue. That is, it is possible to speak without being aware of the order of the dialogue domains. Furthermore, task switching can be performed at an arbitrary timing. Further, regarding the parallel recognition method, Japanese Patent Laid-Open No. 2004-258289 is also detailed. In this document, having multiple dictionaries adapted to multiple noise environments, and recognizing the dictionaries in parallel, the optimal dictionary recognition results are extracted even under various noise conditions, improving recognition accuracy Can be made.

認識辞書４０３は、第３実施形態と同様に、利用者の多様な言い回しを受理できるように、一の制御タスクに対して、認識対象とする語彙を複数選定する。更に、利用者の多様な言い回しの共通性に基づき、選定された語彙を分類して分割した辞書に登録する。認識辞書４０３の構成例を図２０および図２１に示す。図２０は図１９に示す認識辞書４０３の一例を示した図、図２１は図１９に示す認識辞書４０３の他の一例を示した図である。図２０は、機器操作における表現で現れる文体に基づき認識辞書４０３を分類した例を示している。第４の実施形態に述べる文体としては、第３の実施形態で述べたような、「体言止め」、「命令形」、「希望」といった、主に動詞の活用形の違いに基づくものが考えられる。また、「普通調（〜して）」、「丁寧調（〜してください）」といった利用者の発話対象に抱く上下関係に依存するものが考えられる。上記の違いに基づき、認識辞書４０３は、辞書Ａ、辞書Ｂ・・・と分割されている。一方、図２１は、図２０の文体に加え、主に動詞あるいは動名詞の共通点に基づき細かく分割した例を示している。例えば、辞書Ａ（体言止め）に対し、共通部分「行く」、「探す」、「聴く」、「オン/オフ」毎に分類され、各々辞書Ａ−１、辞書Ａ−２、・・・と分割されている。この場合、制御タスクによっては、全ての辞書に語彙が含まれない場合がある。図２１では、語彙が含まれない部分を「ｎｕｌｌ」と表記している。 Similar to the third embodiment, the recognition dictionary 403 selects a plurality of vocabularies to be recognized for one control task so as to accept various expressions of the user. Furthermore, based on the commonality of various phrases of users, the selected vocabulary is classified and registered in a divided dictionary. An example of the configuration of the recognition dictionary 403 is shown in FIGS. 20 is a diagram showing an example of the recognition dictionary 403 shown in FIG. 19, and FIG. 21 is a diagram showing another example of the recognition dictionary 403 shown in FIG. FIG. 20 shows an example in which the recognition dictionary 403 is classified based on the style that appears in the expression in device operation. The style described in the fourth embodiment may be based on the difference in verb usage, such as “declaration”, “command”, and “hope” as described in the third embodiment. It is done. Moreover, the thing depending on the user's utterance target, such as “normal tone (to do)” and “careful tone (to do)”, can be considered. Based on the above differences, the recognition dictionary 403 is divided into a dictionary A, a dictionary B,. On the other hand, FIG. 21 shows an example in which, in addition to the style of FIG. 20, it is divided finely based mainly on common points of verbs or verbal nouns. For example, for the dictionary A (word stop), the common parts “go”, “search”, “listen”, “on / off” are classified into the dictionary A-1, the dictionary A-2,. It is divided. In this case, depending on the control task, the vocabulary may not be included in all dictionaries. In FIG. 21, a portion that does not include a vocabulary is represented as “null”.

理解部４０４は、音声認識部４０２が各辞書から取得したＮ−ｂｅｓｔに基づき、理解結果を生成する。具体的には、Ｎ−ｂｅｓｔに基づいて、最も信頼できる認識候補を理解結果として決定する。更に、理解部４０４は、第１の実施形態と同様に、機能テーブル１０５を参照して、当該理解結果に対する制御タスクを決定する。理解結果の決定方法としては、一般的に、各々の辞書から取得したＮ−ｂｅｓｔに含まれる各認識候補について、認識スコアを算出し、最大の認識スコアを持つ認識候補を理解結果として決定する方法が用いられる。 The understanding unit 404 generates an understanding result based on the N-best acquired from each dictionary by the speech recognition unit 402. Specifically, the most reliable recognition candidate is determined as an understanding result based on N-best. Further, the understanding unit 404 refers to the function table 105 as in the first embodiment, and determines a control task for the understanding result. As an understanding result determination method, generally, a recognition score is calculated for each recognition candidate included in an N-best acquired from each dictionary, and a recognition candidate having the maximum recognition score is determined as an understanding result. Is used.

認識特性抽出部４０９では、
Ａ．対話履歴と認識パターンの分析に基づく辞書誤選択（誤った辞書の語彙が選択されることによる誤認識の発生）の分析処理
Ｂ．対話履歴と認識パターンの分析に基づく辞書間の認識採用頻度の分析処理
を行う。 In the recognition characteristic extraction unit 409,
A. Analysis processing of dictionary misselection based on analysis of dialogue history and recognition pattern (occurrence of misrecognition due to selection of wrong dictionary vocabulary) Analyzes the frequency of recognition adoption between dictionaries based on dialogue history and recognition pattern analysis.

辞書制御部４１１では、上記の各処理の結果に基づき、
Ａ．辞書間の認識特性（辞書誤選択の発生しやすさ）に基づく辞書の優先順位付け処理
Ｂ．辞書間の認識採用頻度に基づく辞書の優先順位付けおよび辞書除外処理
を行う。以下に、処理Ａ、Ｂを詳しく説明する。 In the dictionary control unit 411, based on the results of the above processes,
A. D. Prioritization processing of dictionaries based on recognition characteristics between dictionaries (probability of occurrence of erroneous dictionary selection) Perform dictionary prioritization and dictionary exclusion processing based on recognition adoption frequency between dictionaries. Hereinafter, the processes A and B will be described in detail.

＜処理Ａ＞
処理Ａは、一連の制御タスク達成に至る対話を監視し、誤認識が発生した際、認識パターンおよび認識パターンに関する制御タスクを抽出するものであり、第２の実施形態に類似する処理である。すなわち、ｎ回目の発話の認識結果に基づく応答に対して、否定または訂正が存在し、（ｎ＋１）回目以降の発話の認識結果に基づく応答に対して、否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合に、抽出する。なお、第４の実施形態の処理Ａでは、認識パターンは、ｎ回目の発話の認識候補群が登録された辞書と上記辞書毎の最大の認識スコアとの組合せを指す。一方、認識パターンに関する制御タスクである正解制御タスク（図２４参照）は、否定および訂正が存在しなかった制御タスクのうち、ｎ回目の発話の認識候補群に含まれる認識候補に関する制御タスクである。認識特性抽出部４０９は、上記認識パターンと上記正解制御タスクとを対応させて、認識パターンテーブル４１０に記憶する。また、認識特性抽出部４０９は、ｎ回目の発話の認識候補群に基づいて決定された制御タスクである誤認識制御タスク（図２４参照）も認識パターンテーブル４１０に記憶する。更に、認識特性抽出部４０９は、上記正解制御タスクに関する上記認識候補を登録した辞書である優先辞書（図２４参照）も認識パターンテーブル４１０に記憶する。 <Process A>
The process A is a process similar to that of the second embodiment, which monitors a dialog that reaches a series of control tasks and extracts a recognition pattern and a control task related to the recognition pattern when a misrecognition occurs. That is, there is negation or correction for the response based on the recognition result of the nth utterance, and there is no negation or correction for the response based on the recognition result of the (n + 1) th utterance and the final result. When a series of control tasks are achieved, they are extracted. In the process A of the fourth embodiment, the recognition pattern indicates a combination of a dictionary in which a recognition candidate group for the n-th utterance is registered and the maximum recognition score for each dictionary. On the other hand, the correct answer control task (see FIG. 24), which is a control task related to the recognition pattern, is a control task related to the recognition candidates included in the recognition candidate group of the nth utterance among the control tasks for which there is no negation or correction. . The recognition characteristic extraction unit 409 stores the recognition pattern and the correct control task in the recognition pattern table 410 in association with each other. The recognition characteristic extraction unit 409 also stores in the recognition pattern table 410 an erroneous recognition control task (see FIG. 24) that is a control task determined based on the recognition candidate group of the nth utterance. Further, the recognition characteristic extraction unit 409 also stores in the recognition pattern table 410 a priority dictionary (see FIG. 24) that is a dictionary in which the recognition candidates related to the correct answer control task are registered.

以下に、具体的な対話例を用いて、処理Ａにおける認識特性抽出部４０９の動きを説明する。図２２は第４の実施形態の対話例における記憶条件と記憶対象データを示した図、図２３は第４の実施形態の他の対話例における記憶条件と記憶対象データを示した図である。なお、第２の実施形態と同様に、図２２および図２３に示す対話例では、初期発話の認識結果に基づく応答に対して、否定または訂正が存在し、次回以降の発話の認識結果に基づく応答に対して、否定および訂正が存在せず、最終的に一連の制御タスクが達成した例を示しているが、これに限定されない。一連の制御タスク達成に至る対話の途中に否定または訂正が存在し、次回以降の発話の認識結果に基づく応答に対して否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合にも適用できる。この場合、否定または訂正操作された直前の応答に関する認識結果が登録された辞書と当該辞書毎の最大の認識スコアとの組合せを認識パターンとして抽出すれば良い。 Hereinafter, the movement of the recognition characteristic extraction unit 409 in process A will be described using a specific dialogue example. FIG. 22 is a diagram showing storage conditions and storage target data in a dialog example of the fourth embodiment, and FIG. 23 is a diagram showing storage conditions and storage target data in another dialog example of the fourth embodiment. Similar to the second embodiment, in the dialogue example shown in FIG. 22 and FIG. 23, there is a denial or correction for the response based on the recognition result of the initial utterance, and it is based on the recognition result of the next and subsequent utterances. Although there is no negation and correction for the response, an example in which a series of control tasks are finally achieved is shown, but the present invention is not limited to this. When there is a negation or correction in the middle of a dialogue that reaches a series of control tasks, and there is no denial or correction for a response based on the recognition result of the next and subsequent utterances. It can also be applied to. In this case, a combination of a dictionary in which a recognition result related to a response immediately before a negative or correction operation is registered and a maximum recognition score for each dictionary may be extracted as a recognition pattern.

図２２に示す対話例では、利用者がコンビニエンスストアのアイコンをナビゲーション画面に表示させるために行った対話例における記憶条件と記憶対象データを示している。図２２に示すように、システムが「ご用件をどうぞ」の発話を行うと（ステップＳ４１１）、利用者は「コンビニを表示」の初期発話を行っている（ステップＵ４１１）。システムは、利用者の初期発話音声を認識し、当該初期発話音声の認識結果１９０１を取得している（ステップＳ４１２）。図２２における括弧は認識スコアである。理解部４０４は、閾値と認識スコアを比較し、閾値を上回る認識候補があった場合に、当該認識結果から理解結果を生成する。本対話例では、閾値は０．７０である。結果、理解部４０４は、閾値を上回る認識候補が見つからないと判定し、認識結果１９０１のうち、最大の認識スコアを持つ認識候補を用いて、確認応答の出力を行う。図２２では、確認（オーディオ切替、オーディオ種別＝テレビ）と示されている。この結果、応答生成部１０６は、応答「テレビに切り替えますか？」を音声出力する。 The dialogue example shown in FIG. 22 shows the storage conditions and the storage target data in the dialogue example performed for the user to display the convenience store icon on the navigation screen. As shown in FIG. 22, when the system utters “Please give me a request” (step S411), the user makes an initial utterance of “display convenience store” (step U411). The system recognizes the initial utterance voice of the user and obtains the recognition result 1901 of the initial utterance voice (step S412). The parentheses in FIG. 22 are recognition scores. The understanding unit 404 compares the threshold value with the recognition score, and when there is a recognition candidate that exceeds the threshold value, generates an understanding result from the recognition result. In this interactive example, the threshold is 0.70. As a result, the understanding unit 404 determines that no recognition candidate exceeding the threshold is found, and outputs a confirmation response using the recognition candidate having the maximum recognition score among the recognition results 1901. In FIG. 22, confirmation (audio switching, audio type = television) is shown. As a result, the response generation unit 106 outputs a response “Do you want to switch to television?”

システムの上記応答に対して、利用者は、訂正スイッチを押下している（ステップＵ４１２）。なお、「違う」、「いいえ」等の否定の発話でも良い。利用者の上記訂正スイッチの押下を受けて、システムは、理解結果として「訂正」を取得し、応答生成部１０６は、当該理解結果に基づいて、応答「失礼しました、再度発話してください」を出力する（ステップＳ４１３）。その後、改めて利用者は「コンビニエンスストア表示」と発話している（ステップＵ４１３）。システムは、利用者の次回発話音声を認識し、当該発話音声の認識結果を取得している（ステップＳ４１４）。理解部４０４は、閾値と認識スコアを比較し、閾値を上回る認識候補が見つからないと判定し、上記認識候補群のうち、最大の認識スコアを持つ認識候補を用いて、確認応答の出力を行う。図２２では、確認（施設表示、種別＝コンビニエンスストア）と示されている。この結果、応答生成部１０６は、応答「コンビニエンスストアを表示しますか？」を音声出力する。 In response to the response from the system, the user has pressed the correction switch (step U412). Note that negative utterances such as “No” and “No” may be used. When the user presses the correction switch, the system acquires “correction” as an understanding result, and the response generation unit 106 responds based on the understanding result, “sorry, please speak again”. Is output (step S413). Thereafter, the user speaks again “Convenience store display” (step U413). The system recognizes the user's next speech and acquires the recognition result of the speech (step S414). The understanding unit 404 compares the threshold with the recognition score, determines that no recognition candidate exceeding the threshold is found, and outputs a confirmation response using the recognition candidate having the largest recognition score in the recognition candidate group. . In FIG. 22, confirmation (facility display, type = convenience store) is shown. As a result, the response generation unit 106 outputs the response “Do you want to display a convenience store?”

システムの上記確認応答に対して、利用者が「はい」と発話している（ステップＵ４１４）。システムは当該発話を認識し、その結果、理解部４０４は閾値を上回る認識候補「はい」が見つかったと判定する（ステップＳ４１５）。更に、理解結果として、コンビニエンスストアを表示するコマンド（図２２では、実行（施設表示、種別＝コンビニエンスストア））を発行する。この結果、応答生成部１０６は、応答「コンビニエンスストアを表示します」を音声出力する。これより、最終的な制御タスク、すなわち、コンビニエンスストアの表示を達成している。なお、第２の実施形態と同様に、ステップＵ４１３における利用者の次回発話音声の認識では、直前に訂正スイッチを押下された制御タスク（オーディオ切替、オーディオ種別＝テレビ）に関する認識候補（例えば、「テレビＯＮ」、「テレビを点ける」、「テレビを見る」）について、上位の認識候補として取得しないよう、認識スコアに補正を加えることが望ましい。具体的には、辞書制御部４１１は、理解部４０４が（ｎ＋１）回目以降の発話（次回発話）の認識結果について認識スコアを算出する際、否定または訂正直前の認識結果、すなわち、ｎ回目の発話（初期発話）の認識結果のうち最大の認識スコアを持つ認識候補「テレビＯＮ」と同じ認識候補の認識スコアから所定値を減算させることで実現できる。 In response to the confirmation response from the system, the user speaks “Yes” (step U414). The system recognizes the utterance, and as a result, the understanding unit 404 determines that a recognition candidate “Yes” exceeding the threshold is found (step S415). Further, as an understanding result, a command for displaying a convenience store (in FIG. 22, execution (facility display, type = convenience store)) is issued. As a result, the response generation unit 106 outputs the response “I will display a convenience store” as a voice. Thus, the final control task, that is, display of the convenience store is achieved. As in the second embodiment, in recognition of the user's next utterance voice in step U413, recognition candidates (for example, “audio switching, audio type = television) for which the correction switch has been pressed immediately before are pressed. It is desirable to correct the recognition score so that it is not acquired as a higher recognition candidate for “TV ON”, “Turn on TV”, “Watch TV”). Specifically, when the understanding unit 404 calculates the recognition score for the recognition result of the (n + 1) -th and subsequent utterances (next utterance), the dictionary control unit 411 recognizes the recognition result immediately before negative or correction, that is, the n-th time. This can be realized by subtracting a predetermined value from the recognition score of the same recognition candidate as the recognition candidate “TV ON” having the largest recognition score among the recognition results of the utterance (initial utterance).

本対話例において、
・初期発話の認識結果１９０１から生成された理解結果１９０２に対して訂正が検出された（図２２の（ａ）にて確定）
・訂正直後の次回発話（ステップＵ４１３）の認識結果（ステップＳ４１４）に対して肯定が検出された（あるいは否定が検出されなかった）（図２２の（ｂ）にて確定）
・最終的に制御タスクが決定された（図２２の（ｃ）にて確定）
という記憶条件を満たすか否か判定する。認識特性抽出部４０９は、上記記憶条件を満たすと判定し、初期発話認識候補群１９０３から認識パターンを抽出する。更に、認識特性抽出部４０９は、誤認識となった理解結果１９０２に対応する誤認識制御タスク１９０４と、最終的に決定した理解結果１９０５に対応する正解制御タスク１９０６を抽出する。更に、認識特性抽出部４０９は、正解制御タスク１９０６に関する認識候補「コンビニを表示」を登録した辞書Ｃを優先辞書として抽出する。なお、認識パターンとして、初期発話認識候補群１９０３から、認識候補の認識スコアと当該認識候補が登録された辞書名を対応させて抽出している点が、第１および第２の実施形態と異なる。その後、抽出した誤認識制御タスク１９０４、正解制御タスク１９０６および優先辞書を、認識パターンと対応させて、認識パターンテーブル４１０に記憶する。 In this dialogue example,
Correction was detected for the understanding result 1902 generated from the recognition result 1901 of the initial utterance (confirmed in (a) of FIG. 22)
Affirmation was detected for the recognition result (step S414) of the next utterance (step U413) immediately after correction (or no negative was detected) (determined in (b) of FIG. 22)
-Finally, the control task is determined (confirmed in (c) of FIG. 22)
It is determined whether or not the storage condition is satisfied. The recognition characteristic extraction unit 409 determines that the storage condition is satisfied, and extracts a recognition pattern from the initial utterance recognition candidate group 1903. Further, the recognition characteristic extraction unit 409 extracts a misrecognition control task 1904 corresponding to the misunderstanding recognition result 1902 and a correct answer control task 1906 corresponding to the finally determined understanding result 1905. Further, the recognition characteristic extraction unit 409 extracts the dictionary C in which the recognition candidate “display convenience store” related to the correct answer control task 1906 is registered as a priority dictionary. Note that the recognition pattern is extracted from the initial utterance recognition candidate group 1903 in association with the recognition score of the recognition candidate and the dictionary name in which the recognition candidate is registered, as compared with the first and second embodiments. . Thereafter, the extracted erroneous recognition control task 1904, correct answer control task 1906, and priority dictionary are stored in the recognition pattern table 410 in association with the recognition pattern.

一方、図２３に示す他の対話例では、利用者が住所入力により目的地を設定させるために行った対話例における記憶条件と記憶対象データを示している。図２３に示した対話例では、図２２に示した対話例と異なり、正解制御タスクが２候補得られている。利用者がメニュー階層に従って、システムと複数回の対話を継続するような場合がこれに相当する。図２３に示すように、システムが「ご用件をどうぞ」の発話を行うと（ステップＳ４２１）、利用者は「目的地を探す」の初期発話を行っている（ステップＵ４２１）。システムは、利用者の初期発話音声を認識し、当該初期発話音声の認識結果２００１を取得している（ステップＳ４２２）。図２３における括弧は認識スコアである。理解部４０４は、当該初期発話音声の認識結果のうち、最大の認識スコアを持つ認識候補「自宅へ帰る」に基づき、理解結果２００２を決定する。図２３に示した対話例では、理解結果として、自宅へのルートを検索するコマンド（図２３では、実行（ルート検索、目的地＝自宅）を発行する。同時に、応答生成部１０６は、応答「自宅へ帰るルートを検索します」を出力する。なお、図２３の対話例では、図２２に示した対話例と異なり、第２の実施形態と同様に、閾値を設けず、少なくとも初期発話に対する認識結果について、最大の認識スコアを持つ認識候補を用いて、理解結果を積極的に決定する対話戦略を用いた。 On the other hand, the other dialogue example shown in FIG. 23 shows the storage condition and the storage target data in the dialogue example performed for the user to set the destination by inputting the address. In the dialogue example shown in FIG. 23, unlike the dialogue example shown in FIG. 22, two correct control task candidates are obtained. This is the case when the user continues to interact with the system multiple times according to the menu hierarchy. As shown in FIG. 23, when the system utters “Please give me a request” (step S421), the user makes an initial utterance of “Find a destination” (step U421). The system recognizes the initial utterance voice of the user and obtains the recognition result 2001 of the initial utterance voice (step S422). The parentheses in FIG. 23 are recognition scores. The understanding unit 404 determines the understanding result 2002 based on the recognition candidate “return to home” having the maximum recognition score among the recognition results of the initial utterance speech. In the dialogue example shown in FIG. 23, as a result of understanding, a command (in FIG. 23, execution (route search, destination = home)) is issued as a result of understanding. 23. Unlike the dialogue example shown in FIG. 22, the dialogue example in FIG. 23 does not set a threshold and at least the initial utterance is output. For the recognition result, we used a dialogue strategy that positively determines the understanding result using the recognition candidate with the largest recognition score.

システムの上記応答に対して、利用者は、訂正スイッチを押下している（ステップＵ４２２）。なお、「違う」、「いいえ」等の否定の発話でも良い。利用者の上記訂正スイッチの押下を受けて、システムは、理解結果として「訂正」を取得し、応答生成部１０６は、当該理解結果に基づいて、応答「失礼しました、再度発話してください」を出力する（ステップＳ４２３）。その後、改めて利用者は「目的地設定」と発話している（ステップＵ４２３）。図２３の対話例では、利用者は、初期発話と異なる内容の次回発話「目的地設定」を行うことで、制御タスク達成を試みている。システムは、利用者の次回発話音声を認識し、当該発話音声の認識結果を取得している（ステップＳ４２４）。理解部４０４は、当該認識結果に基づいて、理解結果２００５として、目的地を設定するコマンド（図２３では、実行（目的地設定方法選択））を発行する。同時に、応答生成部１０６は、応答「目的地を設定します。自宅、施設の名前、施設住所、施設の電話番号、履歴、登録地から設定できます」２００６を出力する。その後、目的地の設定方法の選択を促された利用者は、「住所で探す」という次々回発話を行う（ステップＵ４２４）。 In response to the response from the system, the user has pressed the correction switch (step U422). Note that negative utterances such as “No” and “No” may be used. When the user presses the correction switch, the system acquires “correction” as an understanding result, and the response generation unit 106 responds based on the understanding result, “sorry, please speak again”. Is output (step S423). After that, the user speaks again “Destination setting” (step U423). In the dialogue example of FIG. 23, the user attempts to achieve the control task by performing the next utterance “Destination setting” with a content different from the initial utterance. The system recognizes the user's next speech and acquires the recognition result of the speech (step S424). Based on the recognition result, the understanding unit 404 issues a command for setting a destination (execution (destination setting method selection) in FIG. 23) as the understanding result 2005. At the same time, the response generation unit 106 outputs a response “Set destination. Can be set from home, facility name, facility address, facility phone number, history, registered location” 2006. Thereafter, the user who is prompted to select a destination setting method utters “Find by address” one after another (step U424).

システムは、利用者の次々回発話音声を認識し、当該発話音声の認識結果を取得している（ステップＳ４２５）。理解部４０４は、当該認識結果に基づいて、理解結果２００７として、目的地を住所で設定するコマンド（図２３では、実行（目的地設定方法選択、方法＝住所））を発行する。同時に、応答生成部１０６は、応答「住所を都道府県名からお話ください」を出力する。その後、利用者は、「神奈川県・・」という次々々回発話を行う（ステップＵ４２５）。以降、システムと利用者の間で住所入力に関する対話が継続され、理解部４０４は、理解結果２００９として、神奈川県○○へのルート検索を実行するコマンド（図２３では、実行（ルート検索、目的地＝神奈川県○○））を発行する。同時に、応答生成部１０６は、応答「神奈川県○○を目的地に設定します」を出力する。上記の対話より、最終的に住所による目的地設定が完了している。 The system recognizes the utterance voice of the user one after another and acquires the recognition result of the utterance voice (step S425). Based on the recognition result, the understanding unit 404 issues, as the understanding result 2007, a command (in FIG. 23, execution (destination setting method selection, method = address)) for setting a destination with an address. At the same time, the response generation unit 106 outputs a response “Please tell me your address from the prefecture name”. Thereafter, the user utters “Kanagawa Prefecture” one after another (step U425). Thereafter, the dialogue regarding the address input between the system and the user is continued, and the understanding unit 404 uses the command (execution (route search, purpose in FIG. 23) to execute the route search to Kanagawa XX as the understanding result 2009. Territory = Kanagawa Prefecture ○○)). At the same time, the response generation unit 106 outputs a response “Set Kanagawa XX as the destination”. From the above dialogue, the destination setting by address is finally completed.

図２３に示した対話例において、
・初期発話の認識結果２００１から生成された理解結果２００２に対して訂正が検出された（図２３の（ａ）にて確定）
・訂正直後の次回発話（ステップＵ４２３）の認識結果（ステップＳ４２４）に対して肯定が検出された（あるいは否定が検出されなかった）（図２３の（ｂ）にて確定）
・最終的に制御タスクが決定された（図２３の（ｃ）にて確定）
という記憶条件を満たすか否か判定する。認識特性抽出部４０９は、上記記憶条件を満たすと判定し、初期発話認識候補群２００３から認識パターンを抽出する。更に、誤認識となった理解結果２００２に対応する誤認識制御タスク２００４を抽出する。また、正解制御タスクとして、中間理解結果２００５に対応する制御タスク２００８と、最終的に決定した理解結果２００９に対応する制御タスク２０１０を記憶対象とする。 In the dialogue example shown in FIG.
Correction was detected for the understanding result 2002 generated from the recognition result 2001 of the initial utterance (confirmed in (a) of FIG. 23)
Affirmation was detected for the recognition result (step S424) of the next utterance (step U423) immediately after correction (or no negation was not detected) (confirmed in (b) of FIG. 23)
Finally, the control task is determined (confirmed in (c) of FIG. 23)
It is determined whether or not the storage condition is satisfied. The recognition characteristic extraction unit 409 determines that the storage condition is satisfied, and extracts a recognition pattern from the initial utterance recognition candidate group 2003. Further, the misrecognition control task 2004 corresponding to the understanding result 2002 that has been misrecognized is extracted. As correct answer control tasks, a control task 2008 corresponding to the intermediate understanding result 2005 and a control task 2010 corresponding to the finally determined understanding result 2009 are stored.

ここで、上記の通り、正解制御タスクの候補として、制御タスク２００８および２０１０がある。そこで、認識特性抽出部４０９は、上記制御タスクの２候補２００８および２０１０のうちいずれかが、正解制御タスクか判定する。正解制御タスクの判定方法は、第２の実施形態と同様に、初期発話認識候補群２００３の内容から判定する。すなわち、利用者により否定および訂正操作されなかった制御タスクに関する認識候補が、誤認識制御タスクに関する初期発話認識候補群２００３に含まれているか否か判定する。当該認識候補が含まれていた制御タスクを正解制御タスクと判定する。なお、双方の制御タスク２００８および２０１０に関する認識候補が、初期発話認識候補群２００３に含まれていた場合、認識スコアの高いほうを正解制御タスクと判定する。また、どちらも存在しない場合は、正解制御タスクは存在しないので、認識パターン、誤認識制御タスク、正解制御タスク、後述する優先辞書を抽出しない。 Here, as described above, there are control tasks 2008 and 2010 as candidates for the correct control task. Therefore, the recognition characteristic extraction unit 409 determines whether one of the two control task candidates 2008 and 2010 is a correct control task. The determination method of the correct answer control task is determined from the contents of the initial utterance recognition candidate group 2003, as in the second embodiment. That is, it is determined whether or not a recognition candidate related to a control task that has not been negated or corrected by the user is included in the initial utterance recognition candidate group 2003 related to an erroneous recognition control task. The control task including the recognition candidate is determined as a correct control task. When the recognition candidates related to both control tasks 2008 and 2010 are included in the initial utterance recognition candidate group 2003, the one with the higher recognition score is determined as the correct control task. If neither exists, the correct answer control task does not exist, so that the recognition pattern, the misrecognition control task, the correct answer control task, and a priority dictionary described later are not extracted.

図２３の対話例では、制御タスク２０１０に関する認識候補（例えば、「神奈川県○○市」等）が、誤認識制御タスクに関する認識候補群である初期発話認識候補群２００３に存在しない。一方、制御タスク２００８に関する認識候補「目的地を探す」が、初期発話認識候補群２００３に存在する。これから、正解制御タスクは、制御タスク２００８と判定される。よって、制御タスク２００８も抽出される。また、認識特性抽出部４０９は、正解制御タスク２００８に関する認識候補「目的地を探す」を登録した辞書Ｆを優先辞書として抽出する。その後、抽出した誤認識制御タスク２００４、正解制御タスク２００８および優先辞書を、認識パターンと対応させて、認識パターンテーブル４１０に記憶する。 In the dialogue example of FIG. 23, the recognition candidate related to the control task 2010 (for example, “Kanagawa XX city”) does not exist in the initial utterance recognition candidate group 2003 that is the recognition candidate group related to the erroneous recognition control task. On the other hand, a recognition candidate “Find Destination” regarding the control task 2008 exists in the initial utterance recognition candidate group 2003. From this, the correct answer control task is determined as the control task 2008. Therefore, the control task 2008 is also extracted. In addition, the recognition characteristic extraction unit 409 extracts, as a priority dictionary, the dictionary F in which the recognition candidate “Find Destination” related to the correct answer control task 2008 is registered. Thereafter, the extracted erroneous recognition control task 2004, correct answer control task 2008, and priority dictionary are stored in the recognition pattern table 410 in association with the recognition pattern.

図２２および図２３の対話例から記憶された認識パターンテーブル４１０の例を示す。図２４は、図１９に示す認識パターンテーブル４１０の一例を示した図である。図２２に示した対話例から抽出された認識パターン、誤認識制御タスク１９０４、正解制御タスク１９０６および優先辞書は、Ｎｏ．１の行に記憶されている。図２３に示した対話例から抽出された認識パターン、誤認識制御タスク２００４、正解制御タスク２００８および優先辞書は、Ｎｏ．２の行に記憶されている。 The example of the recognition pattern table 410 memorize | stored from the dialogue example of FIG.22 and FIG.23 is shown. FIG. 24 shows an example of the recognition pattern table 410 shown in FIG. The recognition pattern, the misrecognition control task 1904, the correct answer control task 1906, and the priority dictionary extracted from the dialogue example shown in FIG. It is stored in one row. The recognition pattern, the misrecognition control task 2004, the correct answer control task 2008 and the priority dictionary extracted from the dialogue example shown in FIG. Is stored in the second row.

また、第４の実施形態の認識パターンは、上述したように、否定または訂正操作された直前の応答に関する認識結果が登録された辞書と上記辞書毎の最大の認識スコアとの組合せである。優先辞書名は、利用者の新規発話に対する新規認識結果を取得した場合に、優先すべき辞書名である。例えば、図２２に示した対話例では、辞書Ｃに登録された語彙（認識候補）から理解結果を生成すべきところ、辞書Ｂに登録された語彙（認識候補）から理解結果を生成した結果、誤認識となった。これを是正すべく、該優先辞書名に辞書Ｃが記憶されている。同様にして、図２３に示した対話例では、辞書Ｆに登録された語彙（認識候補）から理解結果を生成すべきところ、辞書Ａに登録された語彙（認識候補）から理解結果を生成した結果、誤認識となった。これを是正すべく、該優先辞書名に辞書Ｆが記憶されている。 In addition, as described above, the recognition pattern of the fourth embodiment is a combination of a dictionary in which a recognition result related to a response immediately before a negative or correction operation is registered and the maximum recognition score for each dictionary. The priority dictionary name is a dictionary name that should be prioritized when a new recognition result for a user's new utterance is acquired. For example, in the dialogue example shown in FIG. 22, an understanding result should be generated from a vocabulary (recognition candidate) registered in the dictionary C. However, as a result of generating an understanding result from a vocabulary (recognition candidate) registered in the dictionary B, It became misrecognition. In order to correct this, the dictionary C is stored in the priority dictionary name. Similarly, in the dialogue example shown in FIG. 23, the understanding result should be generated from the vocabulary (recognition candidate) registered in the dictionary F, but the understanding result was generated from the vocabulary (recognition candidate) registered in the dictionary A. As a result, it was misrecognized. In order to correct this, the dictionary F is stored in the priority dictionary name.

第４の実施形態の辞書制御部４１１は、第１の実施形態と同様に、利用者の新規発話に対する新規認識候補群を音声認識部４０２が取得した場合に認識パターンテーブル４１０を参照する。第４の実施形態の辞書制御部４１１は、上記新規認識候補群が登録された辞書群が上記認識パターンの辞書群と順不同で同じで、上記新規認識候補群に基づいて決定された制御タスクが上記誤認識制御タスクと同じで、かつ、上記新規認識候補群の認識スコアと上記認識パターンの認識スコアとの差が所定内の場合に、上記正解制御タスクを優先させるものである。具体的には、上記の場合に、辞書制御部４１１は、認識パターンテーブル４１０に記憶した優先辞書に登録された語彙（認識候補）が優先して認識されるよう、当該語彙（認識候補）の認識スコアに所定値を加算する。または、優先辞書に登録された語彙以外の語彙（認識候補）の認識スコアから所定値を減算しても良い。あるいは、認識パターンテーブル４１０に記憶した正解制御タスクを参照し、制御タスクを直接書き換えても良い。このようにして、誤認識が繰り返し発生する可能性を低減している。 Similar to the first embodiment, the dictionary control unit 411 of the fourth embodiment refers to the recognition pattern table 410 when the speech recognition unit 402 acquires a new recognition candidate group for a new utterance of the user. In the dictionary control unit 411 of the fourth embodiment, the dictionary group in which the new recognition candidate group is registered is the same as the dictionary group of the recognition pattern in no particular order, and the control task determined based on the new recognition candidate group is When the difference between the recognition score of the new recognition candidate group and the recognition score of the recognition pattern is within a predetermined range, the correct control task is prioritized. Specifically, in the above case, the dictionary control unit 411 determines the vocabulary (recognition candidate) of the vocabulary (recognition candidate) so that the vocabulary (recognition candidate) registered in the priority dictionary stored in the recognition pattern table 410 is recognized with priority. A predetermined value is added to the recognition score. Alternatively, a predetermined value may be subtracted from the recognition score of a vocabulary (recognition candidate) other than the vocabulary registered in the priority dictionary. Alternatively, the control task may be directly rewritten with reference to the correct control task stored in the recognition pattern table 410. In this way, the possibility that erroneous recognition will occur repeatedly is reduced.

ここで、上記新規認識候補群の認識スコアが、上記認識パターンの認識スコアと完全に一致することは少ないと思われる。そこで、第４の実施形態では、上記新規認識候補群の認識スコアと上記認識パターンの認識スコアとの差が所定、例えば、±α内の場合に上記正解制御タスクを優先させている。しかし、所定値αはなくても良い。辞書制御部４１１により、過去の誤認識時の辞書出現パターンと同様のパターンが検出された際に、適切に辞書の優先順位を決定できるため、利用者の訂正操作に基づく認識性能の向上が期待できる。 Here, it is unlikely that the recognition score of the new recognition candidate group completely matches the recognition score of the recognition pattern. Therefore, in the fourth embodiment, when the difference between the recognition score of the new recognition candidate group and the recognition score of the recognition pattern is within a predetermined range, for example, ± α, the correct answer control task is prioritized. However, the predetermined value α may not be present. When the dictionary control unit 411 detects the same pattern as the dictionary appearance pattern at the time of past erroneous recognition, it is possible to appropriately determine the priority order of the dictionary, so that the recognition performance based on the correction operation of the user is expected to be improved. it can.

なお、第４の実施形態では、制御タスク毎に認識パターンを抽出し、認識パターン、制御タスクおよび優先辞書を認識パターンテーブル４１０に記憶している。一方、辞書に登録する語彙の構成によっては、制御タスクとは無関係に、特定の辞書の認識スコアが高め/低めに出やすいといった、「スコアの偏り」が生じる場合がある。特に後述する、言い回しの共通性に伴う辞書の分類を考える場合、語彙の長さや音響的特徴の観点からも共通した語彙が集合する可能性が高い。この場合、特定の辞書ばかりが高スコアで認識されてしまい、誤認識を頻発する不具合が生じることが考えられる。そこで、制御タスクには着目せず、単純に、誤認識した際の認識候補（語彙）が登録された辞書と、正解語彙（図２３に示した対話例では、「目的地設定」）が登録された辞書の対を記録・蓄積する。これを定期的に分析することで、誤認識の発生が所定値を上回る辞書が検出された場合に、当該辞書にペナルティを与える方法を用いることも可能である。 In the fourth embodiment, a recognition pattern is extracted for each control task, and the recognition pattern, control task, and priority dictionary are stored in the recognition pattern table 410. On the other hand, depending on the configuration of the vocabulary registered in the dictionary, there may be a “score bias” in which the recognition score of a specific dictionary tends to be raised / lower regardless of the control task. In particular, when considering dictionary classifications with common wording, which will be described later, there is a high possibility that common vocabularies are gathered from the viewpoint of vocabulary length and acoustic characteristics. In this case, it is conceivable that only a specific dictionary is recognized with a high score, resulting in a problem of frequent misrecognition. Therefore, without paying attention to the control task, a dictionary in which recognition candidates (vocabulary) for misrecognition are registered and a correct vocabulary (in the dialogue example shown in FIG. 23, “destination setting”) are registered. Record and store the dictionary pairs. By analyzing this periodically, it is also possible to use a method of penalizing the dictionary when a dictionary in which the occurrence of erroneous recognition exceeds a predetermined value is detected.

＜処理Ｂ＞
処理Ｂは、一連の制御タスク達成に至る対話を監視し、認識時に使用された辞書の情報を蓄積することで、高頻度で使用される辞書および低頻度で使用される辞書を検出する。これを反映して認識辞書の優先順位を決定する、もしくは認識辞書の除外処理を行うものである。具体的には、第４の実施形態の処理Ｂでは、第１の実施形態と同様に、初期発話の認識結果に基づく応答生成部１０６による応答に対して否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合に、認識特性抽出部４０９は、認識パターンと認識パターンに関する制御タスクとを抽出する。なお、第４の実施形態の処理Ｂでは、認識パターンは、初期発話の認識結果のうち最大の認識スコアを持つ認識候補が登録された辞書である。一方、認識パターンに関する制御タスクは、最終的に達成した制御タスクである。認識特性抽出部４０９は、上記認識パターンと最終的に達成した上記制御タスクとを対応させて、認識パターンテーブル（不図示）に記憶する。また、認識特性抽出部４０９は、最終的に達成した制御タスク毎の、利用者の新規発話に対する新規認識結果のうち最大の認識スコアを持つ認識候補が登録された辞書が上記認識パターンと同じになる頻度に基づいて、該辞書毎の認識採用頻度を分析する。 <Process B>
The process B detects a dictionary used at a high frequency and a dictionary used at a low frequency by monitoring a dialog to reach a series of control tasks and accumulating information on the dictionary used at the time of recognition. Reflecting this, the priority order of recognition dictionaries is determined, or recognition dictionary exclusion processing is performed. Specifically, in the process B of the fourth embodiment, as in the first embodiment, there is no negation or correction for the response by the response generation unit 106 based on the recognition result of the initial utterance, and the final When a series of control tasks are achieved, the recognition characteristic extraction unit 409 extracts a recognition pattern and a control task related to the recognition pattern. In the process B of the fourth embodiment, the recognition pattern is a dictionary in which a recognition candidate having the maximum recognition score among the recognition results of the initial utterance is registered. On the other hand, the control task related to the recognition pattern is the control task finally achieved. The recognition characteristic extraction unit 409 associates the recognition pattern with the control task finally achieved and stores it in a recognition pattern table (not shown). Also, the recognition characteristic extraction unit 409 uses the same dictionary as the recognition pattern in which the recognition candidate having the maximum recognition score among the new recognition results for the user's new utterance for each control task finally achieved is registered. The recognition adoption frequency for each dictionary is analyzed based on the following frequency.

例えば、図２０に示した認識辞書４０３において、辞書Ａに登録された語彙（認識候補）「目的地自宅」から理解結果が生成された場合を考える。当該語彙（認識候補）に関する制御タスク（ルート探索、目的地＝自宅）が否定されず、最終的に上記制御タスクが達成した場合に、認識候補「目的地自宅」を登録した辞書Ａを、認識パターンとして認識パターンテーブル（不図示）に記憶する。また、認識特性抽出部４０９は、認識パターン（辞書Ａ）に対応させて、上記制御タスクを認識パターンテーブル（不図示）に記憶する。また、認識特性抽出部４０９は、上記制御タスク毎の、利用者の新規発話に対する新規認識結果のうち最大の認識スコアを持つ認識候補が登録された辞書が上記認識パターン（辞書Ａ）と同じになる頻度である出現頻度を、認識パターンテーブル（不図示）に記憶する。当該出現頻度に基づいて、分割された辞書毎の認識採用頻度を算出する。図２５に、辞書毎の認識採用頻度の分析例を示す。図２５は、図１９に示す認識特性抽出部４０９における辞書毎の認識採用頻度を示した図である。図２５に示すように、最も使用される辞書の認識採用頻度を１．０とし、他の辞書の認識採用頻度を算出している。図２５では、利用者が辞書ＣおよびＥに登録された語彙（言い回し）を多用していることがわかる。一方、辞書Ａ、ＢおよびＤに登録された語彙（言い回し）をほとんど使用していないことがわかる。 For example, let us consider a case where an understanding result is generated from the vocabulary (recognition candidate) “destination home” registered in the dictionary A in the recognition dictionary 403 shown in FIG. When the control task (route search, destination = home) related to the vocabulary (recognition candidate) is not denied, and finally the control task is achieved, the dictionary A in which the recognition candidate “destination home” is registered is recognized. The pattern is stored in a recognition pattern table (not shown). The recognition characteristic extraction unit 409 stores the control task in a recognition pattern table (not shown) in association with the recognition pattern (dictionary A). In addition, the recognition characteristic extraction unit 409 uses the same dictionary as the recognition pattern (dictionary A) in which the recognition candidate having the largest recognition score among the new recognition results for the user's new utterance is registered for each control task. Is stored in a recognition pattern table (not shown). Based on the appearance frequency, the recognition adoption frequency for each divided dictionary is calculated. FIG. 25 shows an analysis example of the recognition adoption frequency for each dictionary. FIG. 25 is a diagram showing the recognition adoption frequency for each dictionary in the recognition characteristic extraction unit 409 shown in FIG. As shown in FIG. 25, the recognition adoption frequency of the most used dictionary is set to 1.0, and the recognition adoption frequency of other dictionaries is calculated. In FIG. 25, it can be seen that the user frequently uses the vocabulary (phrase) registered in the dictionaries C and E. On the other hand, it can be seen that the vocabulary (phrase) registered in the dictionaries A, B, and D are hardly used.

辞書制御部４１１は、該認識採用頻度が多い辞書に登録された認識候補を、上記認識採用頻度が少ない辞書に登録された認識候補よりも優先させるものである。具体的には、辞書制御部４１１は、該認識採用頻度が閾値を下回る上記辞書について、音声認識部４０２に上記辞書を認識対象から除外させている。これから、認識速度の低下および認識精度の低下を防止する。例えば、図２５に示した辞書毎の認識採用頻度の分析例では、閾値としてＴｈ_４を設定している。辞書制御部４１１は、各辞書について、上記認識採用頻度が閾値Ｔｈ_４を下回るか否か判定する。上記認識採用頻度が閾値Ｔｈ_４を下回ると判定した場合、辞書制御部４１１は、認識採用頻度が閾値Ｔｈ_４を下回る辞書を認識対象から除外するよう、音声認識部４０２を制御する。なお、利用者の新規発話に対する新規認識候補群が登録された辞書群に、閾値Ｔｈ_４を下回る辞書が含まれる場合に、閾値Ｔｈ_４を下回る辞書に登録された認識候補の認識スコアからペナルティを減算しても良い。 The dictionary control unit 411 gives priority to recognition candidates registered in a dictionary with a high recognition adoption frequency over recognition candidates registered in a dictionary with a low recognition adoption frequency. Specifically, the dictionary control unit 411 causes the speech recognition unit 402 to exclude the dictionary from recognition targets for the dictionary whose recognition adoption frequency is lower than the threshold. This prevents a decrease in recognition speed and a decrease in recognition accuracy. For example, in the analysis example of the recognition adoption frequency for each dictionary shown in FIG. 25, Th ₄ is set as the threshold value. The dictionary control unit 411 determines whether or not the recognition adoption frequency is lower than the threshold Th ₄ for each dictionary. If it is determined that the recognition utilization frequency is below the threshold Th _4, the dictionary control module 411, so exclude dictionary recognition utilization frequency is below the threshold Th ₄ from the recognition target, controls the voice recognition unit 402. Note that dictionaries new recognition candidates registered for the new utterance of the user, if it contains a dictionary below the threshold Th _4, the penalty from the recognition scores of recognition candidates in the dictionary below the threshold Th ₄ You may subtract.

ここで、認識採用頻度が閾値Ｔｈ_４を下回る辞書を認識対象から除外した認識辞書４０３の例を図２６に示す。図２６は、図１９に示す辞書制御部４１１における認識辞書４０３の制御例を示した図である。ここで、図２６（ａ）は、初期状態（製品出荷時）の認識辞書４０３の状態であり、全ての辞書が認識対象としてメモリに展開されている。一方、図２６（ｂ）は、図２５に示した認識採用頻度の分析例に基づき、認識採用頻度の少ない辞書を認識対象から除外した認識辞書４０３である。図２６（ｂ）の場合、辞書Ａ、Ｂ、Ｄは、認識対象としてメモリに展開されず、認識対象とならない。これから、利用者の言い回しの定着に伴い、辞書の優先順位が決定されるため、ほとんど発話される可能性のない言い回しの辞書が認識候補として取得され、誤認識となる可能性を低減できる。更に、処理Ｂを用いた場合、認識対象とする語彙の数が削減されるため、音声認識処理に必要なリソース（メモリに展開する容量や計算時間等）を大幅に低減することが可能である。 Here, FIG. 26 shows an example of the recognition dictionary 403 in which a dictionary whose recognition adoption frequency is lower than the threshold Th ₄ is excluded from recognition targets. FIG. 26 is a diagram showing a control example of the recognition dictionary 403 in the dictionary control unit 411 shown in FIG. FIG. 26A shows the state of the recognition dictionary 403 in the initial state (at the time of product shipment), and all the dictionaries are expanded in the memory as recognition targets. On the other hand, FIG. 26B shows a recognition dictionary 403 in which a dictionary with a low recognition adoption frequency is excluded from recognition targets based on the analysis example of the recognition adoption frequency shown in FIG. In the case of FIG. 26B, the dictionaries A, B, and D are not expanded in the memory as recognition targets and are not recognized. Accordingly, the priority order of the dictionary is determined as the wording of the user is fixed, so that the wording dictionary that is hardly uttered is acquired as a recognition candidate, and the possibility of erroneous recognition can be reduced. Further, when the process B is used, the number of vocabulary to be recognized is reduced, so that resources (capacity expanded in memory, calculation time, etc.) necessary for the speech recognition process can be greatly reduced. .

なお、処理Ｂは、利用者による否定または訂正が発生した際には、一時的に中断するのが望ましい。すなわち、利用者の新規発話に対する新規認識結果に基づく応答に対して否定または訂正が存在した場合、辞書制御部４１１は、認識対象から除外していた全辞書を戻すように、音声認識部４０２を制御する。更に、辞書制御部４１１は、否定または訂正が存在した応答の直前の新規発話について、再度音声認識するよう、音声認識部４０２を制御する。更に、その後の対話にて、認識対象から除外されていた辞書に関する語彙（認識候補）に基づいて、最終的な制御タスクが決定した場合には、この情報に基づき、当該辞書の認識採用頻度に修正を施す。また、上述した認識スコアからペナルティを減算する方法でも、利用者の新規発話に対する新規認識結果に基づく応答に対して否定または訂正が存在した場合、同様に、ペナルティの減算を中止する。更に、同様に、当該辞書の認識採用頻度に修正を施す。 Note that it is desirable to temporarily interrupt the process B when a negative or correction by the user occurs. That is, when there is a negative or correction for the response based on the new recognition result for the user's new utterance, the dictionary control unit 411 sets the voice recognition unit 402 to return all the dictionaries excluded from the recognition target. Control. Furthermore, the dictionary control unit 411 controls the speech recognition unit 402 so that speech recognition is again performed for a new utterance immediately before a response for which there is a negative or correction. Furthermore, when the final control task is determined based on the vocabulary (recognition candidates) related to the dictionary that has been excluded from the recognition target in the subsequent dialogue, the recognition adoption frequency of the dictionary is determined based on this information. Make corrections. Also, in the method of subtracting a penalty from the above-described recognition score, if there is a negative or correction for a response based on a new recognition result for a user's new utterance, the penalty subtraction is similarly stopped. Similarly, the recognition adoption frequency of the dictionary is corrected.

なお、上述した処理ＡおよびＢは、同時に利用することが可能である。双方を利用することで、音響的な類似に伴う誤認識を是正すると同時に、言い回しの定着に伴う辞書の適正な有効化無効化を行うことができる。また、第４の実施形態の並列認識方式は、便宜上認識辞書４０３自体を分割して説明しているが、物理的に分割する方法のほか、辞書自体は一つとし、各語彙に対して、言い回しの共通性に基づく部ループ名を付与して識別することでも同等の機能を実現することが可能である。 Note that the processes A and B described above can be used simultaneously. By using both, it is possible to correct misrecognition associated with acoustic similarity, and at the same time, to appropriately validate and invalidate the dictionary accompanying the establishment of the wording. In the parallel recognition method of the fourth embodiment, the recognition dictionary 403 itself is divided for convenience. However, in addition to the method of physically dividing, the dictionary itself is one, and for each vocabulary, An equivalent function can be realized by assigning and identifying a partial loop name based on commonality of phrases.

以上より、第４の実施形態に係る音声対話装置では、一の制御タスクに対して、認識対象とする語彙の分類毎に分割し、かつ、上記語彙を複数登録した複数の辞書を備える認識辞書４０３と、上記辞書を並列に認識する音声認識部４０２とを備える。更に、ｎ回目の発話の認識結果に基づく応答生成部１０６による応答に対して否定または訂正が存在し、（ｎ＋１）回目以降の発話の認識結果に基づく応答に対して、否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合に、ｎ回目の発話の認識候補群が登録された辞書と該辞書毎の最大の認識スコアとの組合せを、認識パターンとして抽出する認識特性抽出部４０９を備える。また、認識特性抽出部４０９は、否定および訂正が存在しなかった制御タスクのうち、ｎ回目の発話の認識候補群に含まれる認識候補に関する制御タスクである正解制御タスクを抽出する。また、ｎ回目の発話の認識候補群に基づいて決定された制御タスクである誤認識制御タスクも抽出する。認識特性抽出部４０９は、正解制御タスクおよび誤認識制御タスクを上記認識パターンに対応させて、認識パターンテーブル４１０に記憶する。更に、認識特性抽出部４０９は、上記正解制御タスクに関する上記認識候補を登録した辞書である優先辞書を、上記認識パターンに対応させて、認識パターンテーブル４１０に記憶する。また、上記認識パターンと正解制御タスクに基づいて、正解制御タスクを優先させる辞書制御部４１１とを備える。具体的には、辞書制御部４１１は、利用者の新規発話に対する新規認識候補群が登録された辞書群が上記認識パターンの辞書群と順不同で同じで、上記新規認識候補群に基づいて決定された制御タスクが上記誤認識制御タスクと同じで、かつ、上記新規認識候補群の認識スコアと上記認識パターンの認識スコアとの差が所定内の場合に、上記正解制御タスクを優先させる。これから、誤認識が繰り返し発生する可能性を低減できる。 As described above, in the spoken dialogue apparatus according to the fourth embodiment, the recognition dictionary includes a plurality of dictionaries that are divided for each vocabulary classification to be recognized and that have a plurality of the vocabulary registered for one control task. 403 and a speech recognition unit 402 that recognizes the dictionary in parallel. Further, there is negation or correction for the response by the response generation unit 106 based on the recognition result of the nth utterance, and there is no negation or correction for the response based on the recognition result of the (n + 1) th utterance. First, when a series of control tasks is finally achieved, a recognition characteristic extraction that extracts a combination of a dictionary in which a recognition candidate group of the nth utterance is registered and the maximum recognition score for each dictionary as a recognition pattern. Part 409. In addition, the recognition characteristic extraction unit 409 extracts a correct control task that is a control task related to a recognition candidate included in the recognition candidate group of the n-th utterance among control tasks for which no negation and correction existed. Further, a misrecognition control task that is a control task determined based on the recognition candidate group of the nth utterance is also extracted. The recognition characteristic extraction unit 409 stores the correct answer control task and the incorrect recognition control task in the recognition pattern table 410 in association with the recognition pattern. Further, the recognition characteristic extraction unit 409 stores a priority dictionary, which is a dictionary in which the recognition candidates related to the correct answer control task are registered, in the recognition pattern table 410 in association with the recognition pattern. Further, a dictionary control unit 411 that prioritizes the correct answer control task based on the recognition pattern and the correct answer control task is provided. Specifically, the dictionary control unit 411 determines the dictionary group in which the new recognition candidate group for the new utterance of the user is registered in the same order as the dictionary group of the recognition pattern, and is determined based on the new recognition candidate group. If the control task is the same as the erroneous recognition control task and the difference between the recognition score of the new recognition candidate group and the recognition score of the recognition pattern is within a predetermined range, the correct control task is prioritized. From this, the possibility that erroneous recognition repeatedly occurs can be reduced.

また、第４の実施形態では、辞書制御部４１１は、理解部４０４が（ｎ＋１）回目以降の発話の認識結果について認識スコアを算出する際、否定または訂正直前の認識結果、すなわち、ｎ回目の発話の認識結果のうち最大の認識スコアを持つ認識候補と同じ認識候補の認識スコアから所定値を減算させる。これにより、（ｎ＋１）回目以降の発話の認識結果に基づく応答が、ｎ回目の発話の認識結果に基づく応答と同じになる可能性を低減することができる。 In the fourth embodiment, the dictionary control unit 411, when the understanding unit 404 calculates the recognition score for the recognition result of the (n + 1) th and subsequent utterances, the recognition result immediately before negative or correction, that is, the nth time. A predetermined value is subtracted from the recognition score of the same recognition candidate as the recognition candidate having the maximum recognition score among the utterance recognition results. This can reduce the possibility that the response based on the recognition result of the (n + 1) th and subsequent utterances is the same as the response based on the recognition result of the nth utterance.

また、第４の実施形態では、認識特性抽出部４０９は、初期発話の認識結果に基づく応答生成部１０６による応答に対して否定および訂正が存在せず、最終的に一連の制御タスクが達成した場合に、最終的に達成した制御タスクを抽出する。また、初期発話の認識結果のうち最大の認識スコアを持つ認識候補が登録された辞書である認識パターンを抽出する。認識特性抽出部４０９は、最終的に達成した上記制御タスクと上記認識パターンとを対応させて、認識パターンテーブル（不図示）に記憶する。更に、認識特性抽出部４０９は、最終的に達成した上記制御タスク毎の、利用者の新規発話に対する新規認識結果のうち最大の認識スコアを持つ認識候補が登録された辞書が上記認識パターンと同じになる頻度である出現頻度を、上記認識パターンテーブルに記憶する。認識特性抽出部４０９は、上記出現頻度に基づいて、上記辞書毎の認識採用頻度を分析する。辞書制御部４１１は、上記認識採用頻度が多い辞書に登録された認識候補を、上記認識採用頻度が少ない辞書に登録された認識候補よりも優先させる。これにより、利用者の言い回しの定着に伴い、認識性能を向上させることができる。具体的には、上記識採用頻度が閾値Ｔｈ_４を下回る辞書について、音声認識部４０２に当該辞書を認識対象から除外させる。これにより、利用者の言い回しの定着に伴い、認識対象とする語彙が絞られるため、認識性能および認識速度を向上することができる。 Further, in the fourth embodiment, the recognition characteristic extraction unit 409 has no negation or correction with respect to the response by the response generation unit 106 based on the recognition result of the initial utterance, and finally a series of control tasks are achieved. In the case, the control task finally achieved is extracted. In addition, a recognition pattern that is a dictionary in which a recognition candidate having the maximum recognition score is registered among the recognition results of the initial utterance is extracted. The recognition characteristic extraction unit 409 stores the control task finally achieved in association with the recognition pattern in a recognition pattern table (not shown). Furthermore, the recognition characteristic extraction unit 409 uses the same dictionary as the recognition pattern in which the recognition candidate having the largest recognition score among the new recognition results for the user's new utterance for each control task finally achieved is registered. The appearance frequency, which is a frequency of becoming, is stored in the recognition pattern table. The recognition characteristic extraction unit 409 analyzes the recognition adoption frequency for each dictionary based on the appearance frequency. The dictionary control unit 411 gives priority to recognition candidates registered in the dictionary having a high recognition adoption frequency over recognition candidates registered in the dictionary having a low recognition adoption frequency. As a result, the recognition performance can be improved as the user's wording is fixed. Specifically, for the dictionary whose knowledge adoption frequency is lower than the threshold Th ₄ , the speech recognition unit 402 excludes the dictionary from the recognition target. As a result, the vocabulary to be recognized is narrowed as the user's wording is fixed, so that the recognition performance and the recognition speed can be improved.

また、第４の実施形態では、辞書制御部４１１は、利用者の新規発話に対する新規認識結果に基づく応答に対して否定または訂正が存在した場合に、認識対象から除外した辞書を認識対象に戻させて、否定または訂正が存在した応答の直前の新規発話ついて、再度音声認識させる。これから、認識対象とする語彙の絞込みにより、認識ができなかった場合であっても、否定または訂正操作の後、正常に認識することができる。 In the fourth embodiment, the dictionary control unit 411 returns the dictionary excluded from the recognition target to the recognition target when there is a negative or correction for the response based on the new recognition result for the user's new utterance. Then, a new utterance immediately before the response in which there is a negation or correction is recognized again. From this, even if the recognition is not possible due to narrowing down the vocabulary to be recognized, it can be recognized normally after a negative or correction operation.

なお、以上に述べた実施形態は、本発明の実施の一例であり、本発明の範囲はこれらに限定されるものでなく、特許請求の範囲に記載した範囲内で、他の様々な実施形態に適用可能である。例えば、第１および第２の実施形態に係る音声対話装置では、認識パターンテーブル１１０に出現頻度を記憶しているが、特にこれに限定されるものでなく、出現頻度を記憶しなくても良い。しかし、当該出現頻度を利用して、例えば、当該出現頻度が所定値を上回った場合のみ、ボーナス値を加算する制御が可能となる。また、当該出現頻度の多い認識パターンほど、ボーナス値のマージンを大きくする等の制御も可能となる。同様に、認識パターンテーブル１１０にボーナス値を記憶しているが、特にこれに限定されるものでなく、ボーナス値を記憶しなくても良い。この場合、例えば、利用者の新規発話に対する新規認識候補群が認識パターンと順不同で同じ場合、認識パターンに関する制御タスクに直接書き換えるようにすれば良い。 The embodiment described above is an example of the implementation of the present invention, and the scope of the present invention is not limited thereto, and other various embodiments are within the scope described in the claims. It is applicable to. For example, in the voice interactive apparatus according to the first and second embodiments, the appearance frequency is stored in the recognition pattern table 110, but the present invention is not particularly limited to this, and the appearance frequency may not be stored. . However, using the appearance frequency, for example, it is possible to control to add the bonus value only when the appearance frequency exceeds a predetermined value. In addition, the recognition pattern having a higher appearance frequency can be controlled to increase the bonus value margin. Similarly, although the bonus value is stored in the recognition pattern table 110, the present invention is not particularly limited to this, and the bonus value may not be stored. In this case, for example, when the new recognition candidate group for the user's new utterance is the same as the recognition pattern in any order, it may be directly rewritten to the control task related to the recognition pattern.

また、第１および第２の実施形態に係る音声対話装置では、音声認識部１０２がＮ−ｂｅｓｔを取得した後、理解部１０４は、Ｎ−ｂｅｓｔに基づいて、各認識候補について認識スコアを算出し、算出した認識スコアのうち、閾値を上回る認識スコアが存在するか否かを判定しているが（図９のステップＳ１０２）、特にこれに限定されるものでなく、ステップＳ１０２の制御処理はなくても良い。しかし、この場合、必ず、認識パターンテーブル１１０を参照し、新規認識候補群が認識パターンと順不同で同じ場合に、ボーナス値を加算するので、計算時間の増大が懸念される。 Moreover, in the voice interactive apparatus according to the first and second embodiments, after the voice recognition unit 102 acquires N-best, the understanding unit 104 calculates a recognition score for each recognition candidate based on N-best. However, it is determined whether or not there is a recognition score that exceeds the threshold among the calculated recognition scores (step S102 in FIG. 9). However, the present invention is not particularly limited to this, and the control process in step S102 is performed. It is not necessary. However, in this case, since the bonus value is always added when the recognition pattern table 110 is referred to and the new recognition candidate group is the same as the recognition pattern in no particular order, there is a concern about an increase in calculation time.

また、第２および第４の実施形態では、理解部１０４、４０４が（ｎ＋１）回目以降の発話の認識結果について認識スコアを算出する際、ｎ回目の発話の認識結果のうち最大の認識スコアを持つ認識候補と同じ認識候補の認識スコアから所定値、すなわち、ペナルティを減算しているが、特にこれに限定されるものでなく、ペナルティは無くても良い。しかし、上記認識候補の認識スコアから所定値を減算した方が、（ｎ＋１）回目以降の発話の認識結果に基づく応答が、ｎ回目の発話の認識結果に基づく応答と同じになる可能性を低減することができる。 In the second and fourth embodiments, when the understanding units 104 and 404 calculate the recognition score for the recognition result of the (n + 1) th and subsequent utterances, the maximum recognition score among the recognition results of the nth utterance is calculated. A predetermined value, that is, a penalty, is subtracted from the recognition score of the same recognition candidate as the recognition candidate possessed, but the present invention is not limited to this, and there may be no penalty. However, the possibility that the response based on the recognition result of the (n + 1) th and subsequent utterances is the same as the response based on the recognition result of the nth utterance is reduced by subtracting the predetermined value from the recognition score of the recognition candidate. can do.

また、第２の実施形態では、認識パターンテーブル１１０に、認識パターンに関する制御タスクに関する認識候補の認識スコアと認識パターンの最大の認識スコアとの差に基づいて算出された補正値であるボーナス値を記憶しているが、特にこれに限定されるものでなく、認識パターンの最大の認識スコアと所定の閾値との差から算出された補正値であるペナルティでも良い。この場合、辞書制御部２１１は、認識パターンの最大の認識スコアを持つ認識候補と同じ新規認識候補からペナルティを減算させれば良い。 In the second embodiment, a bonus value that is a correction value calculated based on the difference between the recognition score of the recognition candidate related to the control task related to the recognition pattern and the maximum recognition score of the recognition pattern is stored in the recognition pattern table 110. Although it is stored, it is not particularly limited to this, and a penalty that is a correction value calculated from the difference between the maximum recognition score of the recognition pattern and a predetermined threshold value may be used. In this case, the dictionary control unit 211 may subtract the penalty from the same new recognition candidate as the recognition candidate having the maximum recognition score of the recognition pattern.

本発明の第１の実施形態に係る音声対話装置の基本構成を示したブロック図1 is a block diagram showing a basic configuration of a voice interaction apparatus according to a first embodiment of the present invention. 図１に示す音声対話装置の実現手段を示したブロック図The block diagram which showed the implementation means of the voice interactive apparatus shown in FIG. 図１に示す認識辞書の一例を示した図The figure which showed an example of the recognition dictionary shown in FIG. 図１に示す機能テーブルの一例を示した図The figure which showed an example of the function table shown in FIG. 図１に示す応答テーブルの一例を示した図The figure which showed an example of the response table shown in FIG. 第１の実施形態の対話例における記憶条件と記憶対象データを示した図The figure which showed the memory conditions and memory | storage object data in the example of a dialog of 1st Embodiment 図１に示す認識パターンテーブルの一例を示した図The figure which showed an example of the recognition pattern table shown in FIG. 図７に示すボーナス値を反映した場合の対話例を示した図The figure which showed the example of a dialogue at the time of reflecting the bonus value shown in FIG. 第１の実施形態に係る音声対話装置の制御処理の流れを示したフローチャートThe flowchart which showed the flow of the control processing of the voice interactive apparatus concerning 1st Embodiment 本発明の第２の実施形態に係る音声対話装置の基本構成を示したブロック図The block diagram which showed the basic composition of the voice interactive apparatus concerning the 2nd Embodiment of this invention 第２の実施形態の対話例における記憶条件と記憶対象データを示した図The figure which showed the memory | storage conditions and memory | storage object data in the example of a dialog of 2nd Embodiment 本発明の第３の実施形態に係る音声対話装置の基本構成を示したブロック図The block diagram which showed the basic composition of the voice interactive apparatus concerning the 3rd Embodiment of this invention 図１２に示す認識辞書の一例を示した図The figure which showed an example of the recognition dictionary shown in FIG. 図１２に示す認識特性抽出部における語彙毎の認識採用頻度を示した図The figure which showed the recognition adoption frequency for every vocabulary in the recognition characteristic extraction part shown in FIG. 図１２に示す辞書制御部における認識採用頻度と閾値の比較例を示した図The figure which showed the comparative example of the recognition adoption frequency and threshold value in the dictionary control part shown in FIG. 図１２に示す辞書制御部における認識採用頻度と閾値の他の比較例を示した図The figure which showed the other comparative example of the recognition adoption frequency and threshold value in the dictionary control part shown in FIG. 図１２に示す辞書制御部における認識語彙を制御した後の認識辞書の一例を示した図The figure which showed an example of the recognition dictionary after controlling the recognition vocabulary in the dictionary control part shown in FIG. 図１６に示す認識採用頻度の経時変化を示した図The figure which showed the time-dependent change of recognition adoption frequency shown in FIG. 本発明の第４の実施形態に係る音声対話装置の基本構成を示したブロック図The block diagram which showed the basic composition of the voice interactive apparatus concerning the 4th Embodiment of this invention 図１９に示す認識辞書の一例を示した図The figure which showed an example of the recognition dictionary shown in FIG. 図１９に示す認識辞書の他の一例を示した図The figure which showed another example of the recognition dictionary shown in FIG. 第４の実施形態の対話例における記憶条件と記憶対象データを示した図The figure which showed the memory conditions and memory | storage object data in the example of a dialog of 4th Embodiment 第４の実施形態の他の対話例における記憶条件と記憶対象データを示した図The figure which showed the memory | storage conditions and memory | storage object data in the other example of a dialog of 4th Embodiment 図１９に示す認識パターンテーブルの一例を示した図The figure which showed an example of the recognition pattern table shown in FIG. 図１９に示す認識特性抽出部における辞書毎の認識採用頻度を示した図The figure which showed the recognition adoption frequency for every dictionary in the recognition characteristic extraction part shown in FIG. 図１９に示す辞書制御部における認識辞書の制御例を示した図The figure which showed the example of control of the recognition dictionary in the dictionary control part shown in FIG.

Explanation of symbols

１０１音声入力部、１０２音声認識部、
１０３認識辞書、１０４理解部、１０５機能テーブル、
１０６応答生成部、１０７応答テーブル、１０８出力部、
１０９認識特性抽出部、１１０認識パターンテーブル、
１１１辞書制御部、
２０１マイクロフォン、２０２ＡＤ変換器、２０３演算装置、
２０４記憶装置、２０５ＤＡ変換器、２０６スピーカ／表示装置、
２０９認識特性抽出部、２１１辞書制御部、
３０２音声認識部、３０３認識辞書、３０９認識特性抽出部、
３１０認識パターンテーブル、３１１辞書制御部、
４０２音声認識部、４０３認識辞書、４０４理解部、
４０９認識特性抽出部、４１０認識パターンテーブル、
４１１辞書制御部、
１００１、１９０１、２００１認識結果、
１００２認識パターン、１００３、１００５理解結果、
１００４、１００６制御タスク、
１３０１、１３０２、１３０３、１３０４語彙、
１４０１、１４０２、１４０３、１４０４、１４０５語彙、
１９０２、１９０５、２００２、２００５、２００７、２００９理解結果、
１９０３、２００３初期発話認識候補群、
１９０４、１９０６、２００４、２００８、２０１０制御タスク、
２００６応答 101 voice input unit, 102 voice recognition unit,
103 recognition dictionary, 104 understanding unit, 105 function table,
106 response generation unit, 107 response table, 108 output unit,
109 recognition characteristic extraction unit, 110 recognition pattern table,
111 dictionary controller,
201 microphone, 202 AD converter, 203 arithmetic unit,
204 storage device, 205 DA converter, 206 speaker / display device,
209 recognition characteristic extraction unit, 211 dictionary control unit,
302 voice recognition unit, 303 recognition dictionary, 309 recognition characteristic extraction unit,
310 recognition pattern table, 311 dictionary control unit,
402 voice recognition unit, 403 recognition dictionary, 404 understanding unit,
409 recognition characteristic extraction unit, 410 recognition pattern table,
411 dictionary control unit,
1001, 1901, 2001 Recognition result,
1002 Recognition pattern, 1003, 1005 Understanding result,
1004, 1006 control task,
1301, 1302, 1303, 1304 vocabulary,
1401, 1402, 1403, 1404, 1405 vocabulary,
1902, 1905, 2002, 2005, 2007, 2009 Understanding results,
1903, 2003 Initial utterance recognition candidate group,
1904, 1906, 2004, 2008, 2010 Control tasks,
2006 Response

Claims

A speech recognition means for comparing a user's utterance with a vocabulary of a recognition dictionary and acquiring a combination of at least one recognition candidate as a recognition result;
An understanding means for generating an understanding state of a system based on the recognition result and determining a task intended by the user from the understanding state;
A response means for returning a response to the user based on the task;
Recognizing characteristic extraction means for monitoring a dialogue to achieve a series of tasks, extracting a combination related to the recognition result as a recognition pattern, and extracting a task related to the recognition pattern;
A spoken dialogue apparatus comprising: the recognition pattern; and a dictionary control unit that prioritizes the task based on the task related to the recognition pattern.

The understanding means calculates a recognition score for each recognition candidate, generates the understanding state from a recognition candidate having the largest recognition score among the recognition scores,
The spoken dialogue apparatus according to claim 1, wherein the recognition characteristic extraction unit stores the extracted recognition pattern and the task related to the recognition pattern in a recognition pattern table.

The recognition characteristic extracting means stores when there is no negation or correction for a response by the response means based on a recognition result of an initial utterance, and finally a series of the tasks are achieved. The voice interactive apparatus according to claim 2.

The recognition pattern is a combination of the recognition candidates of the initial utterance,
The task related to the recognition pattern is the task finally achieved;
The dictionary control means gives priority to the same new recognition candidate as the recognition candidate having the largest recognition score among the recognition patterns when the new recognition candidate group for the new utterance of the user is the same as the recognition pattern. The voice interactive apparatus according to claim 3.

The recognition characteristic extraction unit stores, in the recognition pattern table, the frequency at which the new recognition candidate group is the same as the recognition pattern for each task finally achieved,
5. The spoken dialogue apparatus according to claim 4, wherein the dictionary control means gives priority only when the frequency exceeds a predetermined value.

The recognition characteristic extracting means associates the correction value calculated from the difference between the maximum recognition score of the recognition pattern and a predetermined threshold value with the recognition candidate having the maximum recognition score, and thereby recognizes the recognition pattern. Remember to the table,
The spoken dialogue apparatus according to claim 4 or 5, wherein the dictionary control unit adds the correction value to the recognition score of the new recognition candidate that is the same as the recognition candidate having the maximum recognition score.

The recognition characteristic extraction means stores user identification information in the recognition pattern table in correspondence with the recognition pattern and the task finally achieved,
When the user identification information based on the new utterance and the stored user identification information are the same, the dictionary control unit sets the recognition score of the new recognition candidate that is the same as the recognition candidate having the maximum recognition score. Furthermore, the predetermined value set according to the said user is added, The voice interactive apparatus of Claim 6 characterized by the above-mentioned.

The recognition characteristic extraction unit stores noise environment information in the recognition pattern table in correspondence with the recognition pattern and the task finally achieved,
The dictionary control means, when the noise environment information based on the new utterance and the stored noise environment information are the same, to the recognition score of the new recognition candidate that is the same as the recognition candidate having the maximum recognition score, The spoken dialogue apparatus according to claim 6 or 7, wherein a predetermined value set according to a noise environment is added.

The recognition dictionary registers a plurality of vocabulary to be recognized for one task,
The recognition pattern is a combination of the recognition candidates of the initial utterance,
The task related to the recognition pattern is the task finally achieved;
The recognition characteristic extracting means recognizes the recognition candidate having the largest recognition score among the new recognition results for the user's new utterance for each task finally achieved having the largest recognition score among the recognition patterns. Based on the frequency that becomes the same as the candidate, analyze the recognition adoption frequency for each vocabulary in the task,
4. The spoken dialogue apparatus according to claim 3, wherein the dictionary control unit causes the voice recognition unit to exclude the vocabulary from recognition targets for a vocabulary whose recognition adoption frequency is lower than a threshold value.

The dictionary control means causes the vocabulary excluded from the recognition target to be returned to the recognition target when the response based on the new recognition result for the user's new utterance has been negated or corrected, and denied or corrected. The voice interactive apparatus according to claim 9, wherein voice recognition is performed again for the new utterance immediately before the response in which a correction exists.

The recognition dictionary includes a plurality of dictionaries divided for each vocabulary classification to be recognized,
The dictionary registers a plurality of the vocabularies for one task,
The speech recognition means recognizes the dictionary in parallel,
The recognition pattern is a dictionary in which a recognition candidate having the maximum recognition score among the recognition results of the initial utterance is registered,
The task related to the recognition pattern is the task finally achieved;
The recognition characteristic extraction unit is configured such that a dictionary in which a recognition candidate having the maximum recognition score among new recognition results for a new utterance of the user for each task finally achieved is registered is the same as the recognition pattern. Based on the frequency of the analysis, the recognition adoption frequency for each dictionary is analyzed,
4. The voice according to claim 3, wherein the dictionary control unit gives priority to recognition candidates registered in a dictionary having a high recognition adoption frequency over recognition candidates registered in a dictionary having a low recognition adoption frequency. Interactive device.

12. The voice interaction apparatus according to claim 11, wherein the dictionary control unit causes the voice recognition unit to exclude the dictionary from a recognition target for the dictionary whose recognition adoption frequency is lower than a threshold value.

The dictionary control means causes the dictionary excluded from the recognition target to be returned to the recognition target when there is a negative or correction for the response based on the new recognition result for the new utterance of the user. The voice interactive apparatus according to claim 12, wherein the voice utterance is recognized again for the new utterance immediately before the response in which correction exists.

The recognition characteristic extracting means has a negation or correction for a response based on the recognition result of the nth utterance, and a negation and correction for a response based on the recognition result of the (n + 1) th or later utterance. The voice interactive apparatus according to claim 2, wherein when a series of tasks is finally achieved, the voice dialogue apparatus is stored.

The recognition pattern is the recognition result of the nth utterance,
The task related to the recognition pattern is a task related to a recognition candidate included in the recognition pattern among tasks where negation and correction did not exist,
15. The voice dialogue according to claim 14, wherein the dictionary control means prioritizes a new recognition candidate that is the same as the recognition candidate when a new recognition candidate group for the user's new utterance is the same as the recognition pattern. apparatus.

The recognition characteristic extraction unit stores, in the recognition pattern table, the frequency at which the new recognition candidate group is the same as the recognition pattern for each task related to the recognition pattern,
16. The spoken dialogue apparatus according to claim 15, wherein the dictionary control unit gives priority only when the frequency exceeds a predetermined value.

The recognition characteristic extraction unit associates a correction value calculated based on a difference between the recognition score of the recognition candidate related to the task related to the recognition pattern and the maximum recognition score of the recognition pattern with the recognition candidate. , Store in the recognition pattern table,
The spoken dialogue apparatus according to claim 15 or 16, wherein the dictionary control unit adds the correction value to a recognition score of the new recognition candidate that is the same as the recognition candidate related to the task related to the recognition pattern.

The recognition characteristic extracting means associates the correction value calculated from the difference between the maximum recognition score of the recognition pattern and a predetermined threshold value with the recognition candidate having the maximum recognition score, and thereby recognizes the recognition pattern. Remember to the table,
The spoken dialogue apparatus according to claim 15 or 16, wherein the dictionary control means subtracts the correction value from the same new recognition candidate as the recognition candidate having the maximum recognition score.

The recognition dictionary includes a plurality of dictionaries divided for each vocabulary classification to be recognized,
The dictionary registers a plurality of the vocabularies for one task,
The speech recognition means recognizes the dictionary in parallel,
The recognition pattern is a combination of a dictionary in which the recognition candidate group of the nth utterance is registered and a maximum recognition score for each dictionary,
The task related to the recognition pattern is a task related to a recognition candidate included in the recognition candidate group of the utterance of the nth time among tasks in which negation and correction did not exist,
The recognition characteristic extracting means includes a task determined based on the recognition candidate group of the n-th utterance and a priority dictionary that is a dictionary in which the recognition candidates related to the task related to the recognition pattern are registered. Remember to the table,
The dictionary control means is configured such that a dictionary in which a new recognition candidate group for a new utterance of the user is registered is the same as the dictionary of the recognition pattern, and a task determined based on the new recognition candidate group is the nth time. When the task is determined based on the recognition candidate group of the utterance and the difference between the recognition score of the new recognition candidate group and the recognition score of the recognition pattern is within a predetermined range,
15. The spoken dialogue apparatus according to claim 14, wherein priority is given to the recognition candidates registered in the priority dictionary.

When the understanding means calculates the recognition score for the recognition result of the utterance after the (n + 1) th time, the dictionary control means uses the maximum recognition score among the recognition results of the nth utterance. The spoken dialogue apparatus according to any one of claims 14 to 17, wherein a predetermined value is subtracted from a recognition score of the recognition candidate for the same recognition candidate as the recognition candidate possessed.

The recognition characteristic extraction unit stores user identification information in the recognition pattern table in association with the recognition pattern and the task related to the recognition pattern,
When the user identification information based on the new utterance and the stored user identification information are the same, the dictionary control means sets the recognition score of the new recognition candidate to be the same as the recognition candidate related to the task related to the recognition pattern. The voice interactive apparatus according to claim 17, further comprising a predetermined value set in accordance with the user.

The recognition characteristic extraction unit stores noise environment information in the recognition pattern table in association with the task related to the recognition pattern and the recognition pattern,
When the noise environment information based on the new utterance and the stored noise environment information are the same, the dictionary control means further adds the recognition score of the new recognition candidate that is the same as the recognition candidate related to the task related to the recognition pattern to the recognition score. The spoken dialogue apparatus according to claim 17 or 21, wherein a predetermined value set according to a noise environment is added.