JP2008286930A

JP2008286930A - Voice interactive device

Info

Publication number: JP2008286930A
Application number: JP2007130585A
Authority: JP
Inventors: Shinichi Satomi; 真一里見; Noriyoshi Matsuo; 典義松尾; Masashi Nishida; 昌史西田; Yasuo Horiuchi; 靖雄堀内; Akira Ichikawa; 熹市川
Original assignee: Chiba University NUC; Fuji Heavy Industries Ltd
Current assignee: Subaru Corp; Chiba University NUC
Priority date: 2007-05-16
Filing date: 2007-05-16
Publication date: 2008-11-27

Abstract

<P>PROBLEM TO BE SOLVED: To smoothly progress interaction with a user by appropriately setting system utterance. <P>SOLUTION: Prediction recognition processing for recognizing user's utterance is performed based on a prediction sentence (Step 2), and large vocabulary recognition processing for recognizing user's utterance is performed based on a large vocabulary dictionary (Step 3). A prediction likelihood which is calculated in the prediction recognition processing is compared with a large vocabulary likelihood which is calculated in the large vocabulary recognition processing (Step 4), after the prediction likelihood is determined to be more than the large vocabulary likelihood, a partial word recognition processing for recognizing the user's utterance is performed (Step 6). Then, a prediction recognition result of a sentence unit, which is calculated in the prediction recognition processing, is compared with a partial word recognition result of a word unit, which is calculated in the partial word recognition result, and a matching degree of the recognition result (complete matching, partial matching and complete mismatching) is determined (Step 7). According to the determined matching degree, a content of the system utterance is separately set (Step 8 to 10). <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、ユーザ発話を認識してこれに応答する音声対話装置に関する。 The present invention relates to a voice interaction apparatus that recognizes and responds to a user utterance.

近年、カーナビゲーションシステムやコールセンター等の分野において、利用者から発せられた音声(以下、ユーザ発話という)を自動的に認識して各種処理を実行するようにした音声認識装置が開発されている。このような音声認識装置にあっては、音声分析によって入力音声の特徴量を抽出した後に、この入力音声の特徴量に合致する音響モデルの列を探索することにより、ユーザ発話の内容を認識するようにしている。また、ユーザ発話の認識精度を向上させるため、言語統計や文法等に基づいて単語の生起確率を定めた言語モデルが設定されており、この言語モデルの拘束下で入力音声に合致する音響モデルの列を探索するようにしている。さらに、単にユーザ発話を認識して認識結果を出力するだけでなく、ユーザ発話の内容に応じて装置側から発せられる音声(以下、システム発話という)を設定することにより、利用者と装置との間で対話を進行させるようにした音声対話装置も開発されている(たとえば、特許文献１参照)。
特開２００１−１７５２７９号公報 In recent years, in a field such as a car navigation system and a call center, a voice recognition device has been developed that automatically recognizes a voice uttered by a user (hereinafter referred to as a user utterance) and executes various processes. In such a speech recognition apparatus, after extracting a feature amount of the input speech by speech analysis, the contents of the user utterance are recognized by searching for a sequence of acoustic models matching the feature amount of the input speech. I am doing so. In addition, in order to improve the recognition accuracy of user utterances, a language model that sets word occurrence probabilities based on language statistics, grammar, etc. is set, and an acoustic model that matches the input speech under the constraints of this language model Search for a column. In addition to simply recognizing the user utterance and outputting the recognition result, by setting the voice uttered from the device side according to the content of the user utterance (hereinafter referred to as system utterance), the user and the device A voice dialogue apparatus has also been developed that allows dialogue to proceed between the two (for example, see Patent Document 1).
JP 2001-175279 A

しかしながら、特許文献１に記載される音声対話装置にあっては、ユーザ発話の認識が成功した場合と、ユーザ発話の認識が失敗した場合とに分けて、ユーザ発話に応答するシステム発話の内容を設定するようにしている。このように、単にユーザ発話の認識が成功したか否かによってシステム発話の内容を設定すると、システム発話の内容を適切に設定することが困難となるため、利用者と装置との対話を円滑に進行させることが困難となり、利用者に対して煩わしさを感じさせてしまうおそれがある。つまり、ユーザ発話の認識に失敗した場合について考えると、ユーザ発話の一部を認識できずに失敗した場合や、ユーザ発話の全てを認識できずに失敗した場合が考えられるため、これらの状況において同一内容のシステム発話を設定することは、利用者との円滑な対話を阻害してしまう要因となっていた。 However, in the speech dialogue apparatus described in Patent Document 1, the content of the system utterance responding to the user utterance is divided into a case where the recognition of the user utterance is successful and a case where the recognition of the user utterance fails. I am trying to set it. As described above, if the content of the system utterance is simply set based on whether or not the user utterance has been successfully recognized, it becomes difficult to appropriately set the content of the system utterance. It may be difficult to proceed and the user may feel annoyed. In other words, considering the case where the user utterance recognition fails, there are cases where the user utterance fails without being able to recognize a part of the user utterance or when the user utterance fails without being able to recognize all of the user utterances. Setting system utterances with the same content has been a factor that hinders smooth dialogue with users.

本発明の目的は、システム発話を適切に設定することにより、利用者との対話を円滑に進行させることにある。 An object of the present invention is to smoothly advance a dialog with a user by appropriately setting a system utterance.

本発明の音声対話装置は、予測した発話内容に基づきユーザ発話を認識して予測認識結果を出力する予測認識手段と、所定の単語辞書に基づきユーザ発話を認識して単語認識結果を出力する単語認識手段と、前記予測認識結果と前記単語認識結果とを比較して認識結果の一致度を判定する一致度判定手段と、前記認識結果の一致度に基づいてシステム発話の内容を設定するシステム発話設定手段とを有することを特徴とする。 The speech dialogue apparatus of the present invention includes a prediction recognition unit that recognizes a user utterance based on the predicted utterance content and outputs a prediction recognition result, and a word that recognizes the user utterance based on a predetermined word dictionary and outputs a word recognition result Recognizing means, coincidence degree determining means for comparing the prediction recognition result with the word recognition result to determine the coincidence degree of the recognition result, and system utterance for setting the content of the system utterance based on the coincidence degree of the recognition result And setting means.

本発明の音声対話装置は、前記一致度判定手段は、完全一致、部分一致または完全不一致のいずれかに前記一致度を判定することを特徴とする。 The spoken dialogue apparatus according to the present invention is characterized in that the coincidence degree determining means determines the coincidence degree as one of complete coincidence, partial coincidence or complete disagreement.

本発明の音声対話装置は、所定の大語彙辞書に基づいてユーザ発話を認識して大語彙認識結果の尤度を出力する大語彙認識手段と、前記大語彙認識結果の尤度と前記予測認識手段から出力される前記予測認識結果の尤度とを比較する尤度判定手段とを有し、前記予測認識結果の尤度が前記大語彙認識結果の尤度を上回ると判定されてから、前記一致度判定手段は前記一致度の判定を開始することを特徴とする。 The spoken dialogue apparatus of the present invention includes a large vocabulary recognition unit that recognizes a user utterance based on a predetermined large vocabulary dictionary and outputs a likelihood of a large vocabulary recognition result, and the likelihood of the large vocabulary recognition result and the prediction recognition. A likelihood determination means for comparing the likelihood of the prediction recognition result output from the means, and after determining that the likelihood of the prediction recognition result exceeds the likelihood of the large vocabulary recognition result, The degree of coincidence determination means starts the determination of the degree of coincidence.

本発明によれば、予測認識結果と単語認識結果との一致度を判定し、この判定された一致度に基づいてシステム発話の内容を設定するようにしたので、ユーザ発話の内容を適切に設定することができ、利用者との対話を円滑に進行させることが可能となる。 According to the present invention, the degree of coincidence between the prediction recognition result and the word recognition result is determined, and the content of the system utterance is set based on the determined degree of coincidence. It is possible to facilitate the dialogue with the user.

図１は本発明の一実施の形態である音声対話装置１０の構成を示すブロック図である。図１に示すように、音声対話装置１０は、利用者から発せられる音声つまりユーザ発話を電気信号に変換するマイク１１と、この電気信号から音声特徴量を抽出する音響分析部１２とを備えている。また、音声対話装置１０には、ユーザ発話を認識する予測認識部(予測認識手段)１３、大語彙認識部(大語彙認識手段)１４、部分単語認識部(単語認識手段)１５が設けられ、これらの各認識部１３〜１５には音響分析部１２を介して音声特徴量が入力されている。 FIG. 1 is a block diagram showing a configuration of a voice interaction apparatus 10 according to an embodiment of the present invention. As shown in FIG. 1, the voice interaction apparatus 10 includes a microphone 11 that converts a voice uttered by a user, that is, a user utterance, into an electric signal, and an acoustic analysis unit 12 that extracts a voice feature amount from the electric signal. Yes. In addition, the speech dialogue apparatus 10 is provided with a prediction recognition unit (prediction recognition unit) 13 that recognizes a user utterance, a large vocabulary recognition unit (large vocabulary recognition unit) 14, and a partial word recognition unit (word recognition unit) 15. A voice feature amount is input to each of the recognition units 13 to 15 via the acoustic analysis unit 12.

ここで、図２は予測認識部１３、大語彙認識部１４、部分単語認識部１５の各構成を示すブロック図である。図２に示すように、各認識部１３〜１５は、音素や音節を単位とした音声特徴量パターンの統計的モデル(隠れマルコフモデル等)である音響モデル１６ａ〜１６ｃと、単語の接続確率や出現確率の統計的モデル(Ｎグラムモデル等)である言語モデル１７ａ〜１７ｃとをそれぞれに備えている。また、認識対象となる語彙およびその発音を規定するため、各認識部１３〜１５には大量の単語を収録した辞書１８ａ〜１８ｃが設けられている。なお、辞書１８ａ〜１８ｃに登録される単語は、新聞記事、学会講演、Ｗｅｂページ等によって構成された大量のテキストデータベースから収集されたものである。 Here, FIG. 2 is a block diagram showing the configuration of the prediction recognition unit 13, large vocabulary recognition unit 14, and partial word recognition unit 15. As shown in FIG. 2, each of the recognition units 13 to 15 includes acoustic models 16a to 16c that are statistical models (such as hidden Markov models) of speech feature patterns in units of phonemes and syllables, word connection probabilities, Language models 17a to 17c, which are statistical models of appearance probability (N-gram model or the like), are provided respectively. In addition, in order to define the vocabulary to be recognized and the pronunciation thereof, each of the recognition units 13 to 15 is provided with a dictionary 18a to 18c in which a large number of words are recorded. The words registered in the dictionaries 18a to 18c are collected from a large amount of text databases composed of newspaper articles, conference lectures, web pages, and the like.

また、各認識部１３〜１５には音響モデル１６ａ〜１６ｃや言語モデル１７ａ〜１７ｃを用いて音声認識処理(人工知能学会誌Ｖｏｌ.２０・Ｎｏ.１「連続音声認識ソフトウェアＪｕｌｉｕｓ」河原達也・李晃伸著４１〜４９頁参照)を実行するデコーダ１９ａ〜１９ｃが設けられており、このデコーダ１９ａ〜１９ｃによって入力音声Ｘが最適な単語列Ｗに変換されている。この音声認識処理は、入力音声Ｘに対する事後確率ｐ(Ｗ｜Ｘ)が最大となる単語列Ｗを探索する処理であり、ベイズの定理による以下の式(１)を用いて、様々な単語列Ｗについての事後確率ｐ(Ｗ｜Ｘ)を計算し、最も高い事後確率ｐ(Ｗ｜Ｘ)が得られた単語列Ｗを認識結果として採用する処理である(以下の式(２)参照)。この事後確率ｐ(Ｗ｜Ｘ)つまり尤度は、入力音声の認識結果がどれくらい妥当であるかを示す指標となっている。なお、式(１)の分母に示されるｐ(Ｘ)は、単語列Ｗの決定に影響しないことから無視することが可能である。また、式(２)の確率ｐ(Ｗ)は単語列Ｗの生起確率を表現する言語モデル１７ａ〜１７ｃを意味しており、式(２)の確率ｐ(Ｘ｜Ｗ)は単語列Ｗから入力音声Ｘが得られる確率を表現する音響モデル１６ａ〜１６ｃを意味している。
ｐ(Ｗ｜Ｘ)＝ｐ(Ｗ)＊ｐ(Ｘ｜Ｗ)／ｐ(Ｘ) ・・・・・(１)
Ｗ＝ａｒｇｍａｘｐ(Ｗ)＊ｐ(Ｘ｜Ｗ) ・・・・・(２) In addition, each of the recognition units 13 to 15 uses the acoustic models 16a to 16c and the language models 17a to 17c to perform speech recognition processing (Journal of the Japanese Society for Artificial Intelligence Vol. Decoders 19a to 19c are provided for executing the work (see pages 41 to 49), and the input speech X is converted into an optimal word string W by the decoders 19a to 19c. This voice recognition process is a process for searching for a word string W that maximizes the posterior probability p (W | X) for the input voice X. Various word strings are obtained using the following equation (1) based on Bayes' theorem. This is a process of calculating the posterior probability p (W | X) for W and adopting the word string W with the highest posterior probability p (W | X) as the recognition result (see the following equation (2)). . This posterior probability p (W | X), that is, the likelihood is an index indicating how valid the recognition result of the input speech is. Note that p (X) shown in the denominator of equation (1) does not affect the determination of the word string W and can be ignored. Further, the probability p (W) in the expression (2) means language models 17a to 17c expressing the occurrence probability of the word string W, and the probability p (X | W) in the expression (2) is calculated from the word string W. The acoustic models 16a to 16c expressing the probability that the input speech X is obtained are meant.
p (W | X) = p (W) * p (X | W) / p (X) (1)
W = argmax p (W) * p (X | W) (2)

続いて、予測認識部１３によって実行される予測認識処理、大語彙認識部１４によって実行される大語彙認識処理、部分単語認識部１５によって実行される部分単語認識処理について説明する。まず、予測認識部１３は、対話場面に基づいて利用者から発せられるユーザ発話を予測し、この予測したユーザ発話を文単位で辞書１８ａに登録する。次いで、予測認識部１３は、辞書登録したユーザ発話を単語または文節単位に区切って認識対象を文単位とした言語モデル１７ａを作成した後に、この言語モデル１７ａの拘束下で前述した音声認識処理を実行し、ユーザ発話に対する文単位の認識結果Ｗ１(以下、予測認識結果という)と、この予測認識結果Ｗ１の尤度Ｒ１(以下、予測尤度という)とを出力する。なお、辞書登録したユーザ発話を単語または文節単位に区切って文法を作成し、この文法を用いて音声認識処理を実行するようにしても良い。この予測認識処理にあっては、認識対象を絞り込んだ上で音声認識を実行するため、認識範囲が狭くなるものの認識精度を向上させることが可能である。 Next, the prediction recognition process executed by the prediction recognition unit 13, the large vocabulary recognition process executed by the large vocabulary recognition unit 14, and the partial word recognition process executed by the partial word recognition unit 15 will be described. First, the prediction recognizing unit 13 predicts a user utterance uttered from the user based on the conversation scene, and registers the predicted user utterance in the dictionary 18a in sentence units. Next, the prediction recognition unit 13 divides the user utterances registered in the dictionary into words or phrase units, creates a language model 17a with the recognition target as a sentence unit, and then performs the speech recognition process described above under the restriction of the language model 17a. This is executed, and a sentence unit recognition result W1 (hereinafter referred to as a prediction recognition result) for the user utterance and a likelihood R1 (hereinafter referred to as a prediction likelihood) of the prediction recognition result W1 are output. Note that it is also possible to create a grammar by dividing user utterances registered in the dictionary into units of words or phrases, and execute speech recognition processing using this grammar. In this prediction recognition process, since recognition is performed after narrowing down the recognition target, it is possible to improve the recognition accuracy even though the recognition range is narrowed.

また、大語彙認識部１４は、前述した各種テキストデータから収集された辞書(大語彙辞書)１８ｂ内の単語を用いて認識対象を文単位とした言語モデル１７ｂを作成する。そして、大語彙認識部１４は、作成した言語モデル１７ｂの拘束下で前述した音声認識処理を実行し、ユーザ発話に対する文単位の認識結果Ｗ２(以下、大語彙認識結果という)と、この大語彙認識結果Ｗ２の尤度Ｒ２(以下、大語彙尤度という)とを出力する。この大語彙認識処理にあっては、認識対象を過度に絞り込むことなく音声認識を実行するため、認識精度が低下するものの認識範囲を広げることが可能である。 Further, the large vocabulary recognition unit 14 creates a language model 17b in which a recognition target is a sentence unit using words in the dictionary (large vocabulary dictionary) 18b collected from the various text data described above. Then, the large vocabulary recognition unit 14 executes the speech recognition process described above under the constraint of the created language model 17b, and recognizes the sentence-by-sentence recognition result W2 (hereinafter referred to as the large vocabulary recognition result) for the user utterance, and the large vocabulary. The likelihood R2 of the recognition result W2 (hereinafter referred to as a large vocabulary likelihood) is output. In this large vocabulary recognition process, since speech recognition is performed without excessively narrowing down the recognition target, it is possible to widen the recognition range although the recognition accuracy is reduced.

また、部分単語認識部１５は、前述した各種テキストデータから収集された辞書(単語辞書)１８ｃ内の単語を用いて認識対象を単語単位とした言語モデル１７ｃを作成する。そして、部分単語認識部１５は、作成した言語モデル１７ｃの拘束下で前述した音声認識処理を実行し、ユーザ発話について単語単位の認識結果Ｗ３(以下、部分単語認識結果という)を出力する。つまり、前述した予測認識処理や大語彙認識処理はユーザ発話を文単位で認識する処理であるが、この部分単語認識処理はユーザ発話を単語単位で認識する処理となっている。 In addition, the partial word recognition unit 15 creates a language model 17c with the recognition target as a word unit using words in the dictionary (word dictionary) 18c collected from the various text data described above. Then, the partial word recognition unit 15 executes the speech recognition process described above under the restriction of the created language model 17c, and outputs a word-unit recognition result W3 (hereinafter referred to as a partial word recognition result) for the user utterance. That is, the above-described prediction recognition process and large vocabulary recognition process are processes for recognizing user utterances in sentence units, but this partial word recognition process is a process for recognizing user utterances in word units.

さらに、一致度判定手段として機能する部分単語認識部１５には、予測認識部１３から予測認識結果Ｗ１が入力されており、部分単語認識部１５は文単位で認識された予測認識結果Ｗ１と単語単位で認識された部分単語認識結果(単語認識結果)Ｗ３との一致度を判定してこれを出力する。つまり、予測認識結果Ｗ１が「Ａ，Ｂ，Ｃ」であり部分単語認識結果Ｗ３が「Ａ」，「Ｂ」，「Ｃ」である場合には部分単語認識部１５によって完全一致と判定され、予測認識結果Ｗ１が「Ａ，Ｂ，Ｃ」であり部分単語認識結果Ｗ３が「Ａ」，「Ｂ」，「Ｄ」である場合には部分単語認識部１５によって部分一致と判定され、予測認識結果Ｗ１が「Ａ，Ｂ，Ｃ」であり部分単語認識結果Ｗ３が「Ｄ」，「Ｅ」，「Ｆ」である場合には部分単語認識部１５によって完全不一致と判定されることになる。なお、前述したＡ〜Ｆはそれぞれに異なる単語を意味している。 Furthermore, the prediction recognition result W1 is input from the prediction recognition unit 13 to the partial word recognition unit 15 functioning as a degree of coincidence determination unit, and the partial word recognition unit 15 recognizes the prediction recognition result W1 and the word recognized in sentence units. The degree of coincidence with the partial word recognition result (word recognition result) W3 recognized in units is determined and output. That is, when the prediction recognition result W1 is “A, B, C” and the partial word recognition results W3 are “A”, “B”, “C”, the partial word recognition unit 15 determines that they are completely matched, When the prediction recognition result W1 is “A, B, C” and the partial word recognition results W3 are “A”, “B”, “D”, the partial word recognition unit 15 determines partial match, When the result W1 is “A, B, C” and the partial word recognition result W3 is “D”, “E”, “F”, the partial word recognition unit 15 determines that they are completely inconsistent. Note that A to F described above mean different words.

また、音声対話装置１０には尤度判定手段としての予測内外判定部２０が設けられており、この予測内外判定部２０は、予測尤度Ｒ１と大語彙尤度Ｒ２とを比較して、ユーザ発話の内容が予測認識部１３の予測範囲内であるか否かを判定する。予測尤度Ｒ１が大語彙尤度Ｒ２よりも大きい場合には、予測内外判定部２０から後述するシステム発話選択部２１に対してユーザ発話が予測内であるとの判定結果が出力される。一方、予測尤度Ｒ１が大語彙尤度Ｒ２よりも小さい場合には、予測内外判定部２０からシステム発話選択部２１に対してユーザ発話が予測外であるとの判定結果が出力される。 Further, the speech dialogue apparatus 10 is provided with a prediction inside / outside determination unit 20 as a likelihood determination unit, and the prediction inside / outside determination unit 20 compares the prediction likelihood R1 with the large vocabulary likelihood R2 to determine whether or not It is determined whether or not the content of the utterance is within the prediction range of the prediction recognition unit 13. When the prediction likelihood R1 is larger than the large vocabulary likelihood R2, the determination result that the user utterance is within the prediction is output from the prediction inside / outside determination unit 20 to the system utterance selection unit 21 described later. On the other hand, when the prediction likelihood R1 is smaller than the large vocabulary likelihood R2, the determination result that the user utterance is not predicted is output from the prediction inside / outside determination unit 20 to the system utterance selection unit 21.

さらに、音声対話装置１０にはシステム発話設定手段としてのシステム発話選択部２１が設けられており、このシステム発話選択部２１には、予測認識結果Ｗ１、大語彙認識結果Ｗ２、部分単語認識結果Ｗ３、部分単語認識部１５からの一致度、予測内外判定部２０からの判定結果が入力されている。このシステム発話選択部２１には、装置側から発せられる音声つまりシステム発話に関する複数の内容や文法が登録されており、システム発話選択部２１は入力される各種情報に基づいてシステム発話の内容を選択する。そして、システム発話選択部２１は選択したシステム発話のテキストデータを音声合成部２２に出力し、音声合成部２２はテキストデータを解析して音声波形の電気信号を生成する。次いで、音声合成部２２からスピーカ２３に対して電気信号が入力され、スピーカ２３から利用者に対してシステム発話が発せられることになる。 Further, the speech dialogue apparatus 10 is provided with a system utterance selection unit 21 as a system utterance setting unit. The system utterance selection unit 21 includes a prediction recognition result W1, a large vocabulary recognition result W2, and a partial word recognition result W3. The degree of coincidence from the partial word recognition unit 15 and the determination result from the prediction inside / outside determination unit 20 are input. In this system utterance selection unit 21, a plurality of contents and grammars related to speech uttered from the apparatus side, that is, system utterances are registered, and the system utterance selection unit 21 selects the contents of the system utterance based on various kinds of input information. To do. Then, the system utterance selection unit 21 outputs the text data of the selected system utterance to the speech synthesizer 22, and the speech synthesizer 22 analyzes the text data and generates an electrical signal having a speech waveform. Next, an electrical signal is input from the speech synthesizer 22 to the speaker 23, and a system utterance is issued from the speaker 23 to the user.

続いて、音声対話装置１０による音声対話処理の実行手順をフローチャートに沿って具体的に説明する。ここで、図３は音声対話処理の実行手順を示すフローチャートである。また、図４(Ａ)は予測外と判定される予測認識結果Ｗ１および大語彙認識結果Ｗ２の一例を示す説明図であり、図４(Ｂ)は予測内と判定される予測認識結果Ｗ１および大語彙認識結果Ｗ２の一例を示す説明図である。さらに、図５(Ａ)は完全一致と判定される予測認識結果Ｗ１および部分単語認識結果Ｗ３の一例を示す説明図であり、図５(Ｂ)は部分一致と判定される予測認識結果Ｗ１および部分単語認識結果Ｗ３の一例を示す説明図であり、図５(Ｃ)は完全不一致と判定される予測認識結果Ｗ１および部分単語認識結果Ｗ３の一例を示す説明図である。 Subsequently, the execution procedure of the voice dialogue process by the voice dialogue apparatus 10 will be specifically described with reference to a flowchart. Here, FIG. 3 is a flowchart showing an execution procedure of the voice dialogue processing. FIG. 4A is an explanatory diagram showing an example of the prediction recognition result W1 and the large vocabulary recognition result W2 determined to be out of prediction, and FIG. 4B shows the prediction recognition result W1 determined to be within prediction and It is explanatory drawing which shows an example of the large vocabulary recognition result W2. Further, FIG. 5A is an explanatory diagram showing an example of the prediction recognition result W1 and the partial word recognition result W3 determined to be complete match, and FIG. 5B shows the prediction recognition result W1 determined to be partial match and FIG. 5C is an explanatory diagram illustrating an example of the partial word recognition result W3, and FIG. 5C is an explanatory diagram illustrating an example of the prediction recognition result W1 and the partial word recognition result W3 that are determined to be completely inconsistent.

図３に示すように、ステップＳ１ではユーザ発話(例えば「ＡＢ大学」)が取り込まれ、ステップＳ２では入力音声に対する予測認識処理が実行され、続くステップＳ３では入力音声に対する大語彙認識処理が実行される。次いで、ステップＳ４に進み、予測尤度Ｒ１が大語彙尤度Ｒ２を上回るか否かが判定される。ステップＳ４において、予測尤度Ｒ１が大語彙尤度Ｒ２を下回ると判定された場合には(図４(Ａ)参照)、認識されたユーザ発話の内容が音声認識装置の予測外であると判定されるため、ステップＳ５に進み、ユーザ発話の分野を広い範囲から特定するためのシステム発話(例えば「ジャンルを言って下さい」)が発せられ、再びステップＳ１においてユーザ発話が取り込まれる。 As shown in FIG. 3, in step S1, a user utterance (for example, “AB University”) is captured, in step S2, a prediction recognition process for the input voice is executed, and in a subsequent step S3, a large vocabulary recognition process for the input voice is executed. The Subsequently, it progresses to step S4 and it is determined whether the prediction likelihood R1 exceeds the large vocabulary likelihood R2. If it is determined in step S4 that the prediction likelihood R1 is less than the large vocabulary likelihood R2 (see FIG. 4A), it is determined that the content of the recognized user utterance is outside the prediction of the speech recognition apparatus. Therefore, the process proceeds to step S5, where a system utterance (for example, “Please say a genre”) for specifying the field of user utterance from a wide range is uttered, and the user utterance is captured again in step S1.

一方、ステップＳ４において、予測尤度Ｒ１が大語彙尤度Ｒ２を上回ると判定された場合には(図４(Ｂ)参照)、認識されたユーザ発話の内容が音声認識装置の予測内であると判定されるため、そのままステップＳ６に進み、入力音声に対する部分単語認識処理が実行される。次いで、ステップＳ７に進み、予測認識結果Ｗ１と部分単語認識結果Ｗ３とが比較され、認識結果Ｗ１，Ｗ３の一致度(完全一致，部分一致，完全不一致)が判定される。 On the other hand, when it is determined in step S4 that the prediction likelihood R1 exceeds the large vocabulary likelihood R2 (see FIG. 4B), the content of the recognized user utterance is within the prediction of the speech recognition apparatus. Therefore, the process proceeds to step S6 as it is, and the partial word recognition process for the input voice is executed. Next, the process proceeds to step S7, where the predicted recognition result W1 and the partial word recognition result W3 are compared, and the degree of coincidence (complete match, partial match, complete mismatch) of the recognition results W1, W3 is determined.

このステップＳ７において、認識結果Ｗ１，Ｗ３が完全一致であると判定された場合には(図５(Ａ)参照)、認識結果の確認応答を省略して次の段階のシステム発話(例えば「学部はどこですか？」)が発せられる。このように、予測尤度Ｒ１と大語彙尤度Ｒ２とに基づいてユーザ発話が予測内と判定され、しかも予測認識結果Ｗ１と部分単語認識結果Ｗ３とが完全一致であると判定された場合には、ユーザ発話に対する予測認識結果Ｗ１の精度が極めて高いと判断できるため、直ちに対話内容を次の段階に移行させることにより、確認応答等による煩わしさを利用者に与えることなく対話を進行させることが可能となる。 In this step S7, when it is determined that the recognition results W1 and W3 are completely coincident (see FIG. 5A), the confirmation response of the recognition result is omitted and the system utterance of the next stage (for example, “Faculty” Where are you? ”). As described above, when the user utterance is determined to be within the prediction based on the prediction likelihood R1 and the large vocabulary likelihood R2, and it is determined that the prediction recognition result W1 and the partial word recognition result W3 are completely identical. Since it can be determined that the accuracy of the prediction recognition result W1 for the user utterance is extremely high, the conversation can be immediately advanced to the next stage so that the conversation can proceed without giving the user troublesome confirmation response or the like. Is possible.

一方、ステップＳ７において、認識結果Ｗ１，Ｗ３が部分一致であると判定される場合とは(図５(Ｂ)参照)、予測尤度Ｒ１と大語彙尤度Ｒ２とに基づいてユーザ発話が予測内と判定されているが、予測認識結果Ｗ１と部分単語認識結果Ｗ３とが部分的に相違している場合である。つまり、ユーザ発話に対する予測認識結果Ｗ１の精度が若干低い状態であるため、ステップＳ９に進み、予測認識結果Ｗ１と部分単語認識結果Ｗ３との相違点(未確定部分)を確定させるための部分一致処理が実行される。 On the other hand, when it is determined in step S7 that the recognition results W1 and W3 are partially matched (see FIG. 5B), the user utterance is predicted based on the prediction likelihood R1 and the large vocabulary likelihood R2. This is a case where the prediction recognition result W1 and the partial word recognition result W3 are partially different. That is, since the accuracy of the prediction recognition result W1 with respect to the user utterance is in a slightly low state, the process proceeds to step S9 and partial matching is performed to determine the difference (undefined part) between the prediction recognition result W1 and the partial word recognition result W3. Processing is executed.

ここで、図６は部分一致処理の実行手順を示すフローチャートである。図６に示すように、まずステップＳ２１では未確定部分に関するユーザ発話を促すためのシステム発話(例えば「ＡＢ何々ですか？」)が発せられる。そして、ステップＳ２２ではユーザ発話(例えば「大学です」)が取り込まれ、ステップＳ２３では未確定部分を辞書に追加して入力音声に対する予測認識処理が実行される。次いで、ステップＳ２４では予測識結果Ｗ１を利用者に確認するためのシステム発話(例えば「ＡＢ大学ですね」)が発せられ、続くステップＳ２５ではシステム発話に対する回答のユーザ発話(例えば「はい」)が取り込まれる。 Here, FIG. 6 is a flowchart showing an execution procedure of the partial matching process. As shown in FIG. 6, first, in step S21, a system utterance (for example, “What is AB?”) For prompting a user utterance regarding an uncertain part is uttered. In step S22, a user utterance (for example, “is a university”) is captured, and in step S23, an uncertain part is added to the dictionary, and a prediction recognition process for the input speech is executed. Next, in step S24, a system utterance (for example, “AB University”) for confirming the predicted knowledge result W1 to the user is uttered, and in a succeeding step S25, a user utterance (for example, “yes”) for an answer to the system utterance is issued. It is captured.

続いて、ステップＳ２６ではユーザ発話に対する予測認識処理が実行され、続くステップＳ２７ではユーザ発話の認識結果が肯定(例えば「はい」)であるか否かが判定される。ステップＳ２７において認識結果が肯定であると判定された場合には、予測認識結果Ｗ１に対する部分単語認識結果Ｗ３の相違点が解消されたと判断され、ステップＳ２８に進み、次の段階のシステム発話(例えば「学部はどこですか？」)が発せられる。一方、ステップＳ２７において認識結果が否定であると判定された場合には、予測認識結果Ｗ１と部分単語認識結果Ｗ３との相違点が解消されていないと判断され、ステップＳ２９に進み、再度のユーザ発話を促すシステム発話(例えば「もう一度言って下さい」)が発せられる。 Subsequently, in step S26, a prediction recognition process for the user utterance is executed, and in subsequent step S27, it is determined whether or not the recognition result of the user utterance is affirmative (eg, “yes”). If it is determined in step S27 that the recognition result is affirmative, it is determined that the difference of the partial word recognition result W3 with respect to the predicted recognition result W1 has been resolved, and the process proceeds to step S28, where the system utterance of the next stage (for example, "Where is your faculty?") On the other hand, if it is determined in step S27 that the recognition result is negative, it is determined that the difference between the predicted recognition result W1 and the partial word recognition result W3 has not been resolved, and the process proceeds to step S29, where the user again A system utterance prompting utterance (eg, “Please say again”) is uttered.

このように、ユーザ発話が音声対話装置１０の予測内であると判定された上で、認識結果Ｗ１，Ｗ３が部分一致であると判定された場合には、実際のユーザ発話と予測認識結果Ｗ１とが若干相違している状態であるため、この相違点を確定させるためのシステム発話が設定されるようになっている。これにより、最低限の対話によって相違点を解消することができるため、利用者に対して不快感を与えずに対話を進行させることが可能となる。 As described above, when it is determined that the user utterance is within the prediction of the voice interaction apparatus 10 and the recognition results W1 and W3 are partially matched, the actual user utterance and the predicted recognition result W1. Are slightly different from each other, and therefore, a system utterance for determining this difference is set. Thereby, since a difference can be eliminated by a minimum dialogue, it is possible to make the dialogue proceed without causing discomfort to the user.

また、図３に示すように、ステップＳ７において、認識結果Ｗ１，Ｗ３が完全不一致であると判定された場合には(図５(Ｃ)参照)、予測尤度Ｒ１と大語彙尤度Ｒ２とに基づいてユーザ発話が予測内と判定されているが、予測認識結果Ｗ１と部分単語認識結果Ｗ３とが完全に相違している場合である。つまり、ユーザ発話に対して予測認識結果Ｗ１が外れている状態であるため、ステップＳ１０に進み、予測認識結果Ｗ１と部分単語認識結果Ｗ３との相違点を確定させるための完全不一致処理が実行される。 As shown in FIG. 3, when it is determined in step S7 that the recognition results W1 and W3 are completely inconsistent (see FIG. 5C), the prediction likelihood R1 and the large vocabulary likelihood R2 This is a case where the user utterance is determined to be within the prediction based on the above, but the prediction recognition result W1 and the partial word recognition result W3 are completely different. That is, since the prediction recognition result W1 is out of the user's utterance, the process proceeds to step S10, and complete mismatch processing is performed to determine the difference between the prediction recognition result W1 and the partial word recognition result W3. The

ここで、図７は完全不一致処理の実行手順を示すフローチャートである。図７に示すように、まずステップＳ３１では未確定部分に関するユーザ発話を促すためのシステム発話(例えば「大学ですか？銀行ですか？」，「ＡＢですか？ＸＹですか？」)が発せられる。そして、ステップＳ３２ではユーザ発話(例えば「ＡＢ大学です」，「大学です」，「ＡＢです」)が取り込まれ、ステップＳ３３では入力音声に対する予測認識処理が実行される。次いで、ステップＳ３４では予測認識処理によって認識された単語を利用者に確認するためのシステム発話(例えば「ＡＢ大学ですね」，「大学ですね」，「ＡＢですね」)が発せられ、続くステップＳ３５ではユーザ発話(例えば「はい」)が取り込まれる。 Here, FIG. 7 is a flowchart showing an execution procedure of complete mismatch processing. As shown in FIG. 7, first, in step S31, a system utterance (for example, “Are you a university? A bank?”, “AB? XY?”) For prompting a user utterance regarding an uncertain part is issued. . In step S32, user utterances (eg, “AB university”, “university”, “AB”) are captured, and in step S33, predictive recognition processing for the input speech is executed. Next, in step S34, a system utterance (for example, “AB university”, “university”, “AB” ”) for confirming the word recognized by the prediction recognition process to the user is uttered. In S35, a user utterance (for example, “Yes”) is captured.

続いて、ステップＳ３６ではユーザ発話に対する予測認識処理が実行され、続くステップＳ３７ではユーザ発話の認識結果が肯定(例えば「はい」)であるか否かが判定される。ステップＳ３７において認識結果が肯定であると判定された場合には、予測認識結果Ｗ１に対する部分単語認識結果Ｗ３の相違点が解消された状態であるため、ステップＳ３８に進み、次の段階のシステム発話(例えば「学部はどこですか？」)が発せられる。一方、ステップＳ３７において認識結果が否定であると判定された場合には、予測認識結果Ｗ１と部分単語認識結果Ｗ３との相違点が解消されていない状態であるため、ステップＳ３９に進み、利用者に対して再度のユーザ発話を促すシステム発話(例えば「もう一度言って下さい」)が発せられる。 Subsequently, in step S36, a prediction recognition process for the user utterance is executed, and in subsequent step S37, it is determined whether or not the recognition result of the user utterance is affirmative (eg, “yes”). If it is determined in step S37 that the recognition result is affirmative, the difference between the partial word recognition result W3 and the predicted recognition result W1 has been eliminated, and the process proceeds to step S38, where the system utterance in the next stage is performed. (For example, “Where is your faculty?”) On the other hand, if it is determined in step S37 that the recognition result is negative, the difference between the predicted recognition result W1 and the partial word recognition result W3 has not been resolved. A system utterance prompting the user to speak again (for example, “Please say again”) is uttered.

このように、ユーザ発話が音声対話装置１０の予測内であると判定された上で、認識結果Ｗ１，Ｗ３が完全不一致であると判定された場合には、実際のユーザ発話と予測認識結果Ｗ１とが部分一致の場合よりも大きく相違している状態であるため、この相違点を確定させるためのシステム発話が設定されるようになっている。これにより、最低限の対話によって相違点を解消することができるため、利用者に対して不快感を与えずに対話を進行させることが可能となる。 As described above, when it is determined that the user utterance is within the prediction of the voice interaction apparatus 10 and the recognition results W1 and W3 are completely inconsistent, the actual user utterance and the predicted recognition result W1. Are different from each other in the case of partial matching, and therefore, a system utterance is set to determine this difference. Thereby, since a difference can be eliminated by a minimum dialogue, it is possible to make the dialogue proceed without causing discomfort to the user.

これまで説明したように、予測尤度Ｒ１と大語彙尤度Ｒ２とを比較することにより、ユーザ発話の内容が音声対話装置１０の予測内であるか否かを判定し、この判定結果に応じてシステム発話の内容を変更するようにしたので、ユーザ発話が絞られていない段階であっても、ユーザ発話に対する認識処理を適切に実行することが可能となる。 As described so far, by comparing the prediction likelihood R1 and the large vocabulary likelihood R2, it is determined whether or not the content of the user utterance is within the prediction of the voice interaction apparatus 10, and according to this determination result Thus, since the contents of the system utterance are changed, the recognition process for the user utterance can be appropriately executed even when the user utterance is not narrowed down.

また、予測認識処理によって得られる文単位の予測認識結果Ｗ１と、部分単語認識によって得られる単語単位の部分単語認識結果Ｗ３との一致度に応じて、その後のシステム発話の内容を変更するようにしたので、ユーザ発話の認識度に応じて適切なシステム発話を設定することができ、利用者との対話を円滑に進行させることが可能となる。すなわち、認識結果Ｗ１，Ｗ３が相違する場合であっても、認識結果Ｗ１，Ｗ３が部分的に相違している状態であるか、認識結果Ｗ１，Ｗ３が完全に相違している状態であるかに応じて、認識結果を正解に導くために発せられるシステム発話の内容を変化させることができるため、最低限の対話によって認識結果Ｗ１，Ｗ３の相違点を解消することが可能となる。これにより、利用者に対して不快感を与えることなく円滑に対話を進行させることが可能となる。 Further, the content of the subsequent system utterance is changed in accordance with the degree of coincidence between the prediction recognition result W1 in sentence units obtained by the prediction recognition process and the partial word recognition result W3 in word units obtained by partial word recognition. Therefore, an appropriate system utterance can be set according to the degree of recognition of the user utterance, and the conversation with the user can proceed smoothly. That is, even if the recognition results W1 and W3 are different, whether the recognition results W1 and W3 are partially different or whether the recognition results W1 and W3 are completely different. Accordingly, it is possible to change the contents of the system utterance that is uttered in order to lead the recognition result to the correct answer, so that the difference between the recognition results W1 and W3 can be eliminated by a minimum dialogue. Thereby, it becomes possible to advance a conversation smoothly, without giving a user discomfort.

本発明は前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。また、図示する音声対話装置１０にあっては、カーナビゲーションシステムやコールセンター等の様々な対話分野において有効に適用することが可能である。 It goes without saying that the present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the scope of the invention. The illustrated voice interaction device 10 can be effectively applied in various interactive fields such as a car navigation system and a call center.

本発明の一実施の形態である音声対話装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice interactive apparatus which is one embodiment of this invention. 予測認識部、大語彙認識部、部分単語認識部の各構成を示すブロック図である。It is a block diagram which shows each structure of a prediction recognition part, a large vocabulary recognition part, and a partial word recognition part. 音声対話処理の実行手順を示すフローチャートである。It is a flowchart which shows the execution procedure of a voice interaction process. (Ａ)は予測外と判定される予測認識結果および大語彙認識結果の一例を示す説明図であり、(Ｂ)は予測内と判定される予測認識結果および大語彙認識結果の一例を示す説明図である。(A) is explanatory drawing which shows an example of the prediction recognition result and large vocabulary recognition result determined to be outside prediction, (B) is an explanation which shows an example of the prediction recognition result and large vocabulary recognition result determined to be in prediction FIG. (Ａ)は完全一致と判定される予測認識結果および部分単語認識結果の一例を示す説明図であり、(Ｂ)は部分一致と判定される予測認識結果および部分単語認識結果の一例を示す説明図であり、(Ｃ)は完全不一致と判定される予測認識結果および部分単語認識結果の一例を示す説明図である。(A) is explanatory drawing which shows an example of the prediction recognition result and partial word recognition result which are determined to be exact match, (B) is an explanation which shows an example of the prediction recognition result and partial word recognition result which are determined to be partial match (C) is an explanatory view showing an example of a prediction recognition result and a partial word recognition result determined to be a complete mismatch. 部分一致処理の実行手順を示すフローチャートである。It is a flowchart which shows the execution procedure of a partial matching process. 完全不一致処理の実行手順を示すフローチャートである。It is a flowchart which shows the execution procedure of complete mismatch processing.

Explanation of symbols

１０音声対話装置
１３予測認識部(予測認識手段)
１４大語彙認識部(大語彙認識手段)
１５部分単語認識部(単語認識手段，一致度判定手段)
１８ａ辞書(大語彙辞書)
１８ｂ辞書(単語辞書)
２０予測内外判定部(尤度判定手段)
２１システム発話選択部(システム発話設定手段) 10 Spoken Dialogue Device 13 Prediction Recognition Unit (Prediction Recognition Means)
14 Large vocabulary recognition section (large vocabulary recognition means)
15 Partial word recognition unit (word recognition means, coincidence determination means)
18a Dictionary (Large Vocabulary Dictionary)
18b Dictionary (word dictionary)
20 Predicted inside / outside determination unit (likelihood determination means)
21 System utterance selection part (system utterance setting means)

Claims

A prediction recognition means for recognizing a user utterance based on the predicted utterance content and outputting a prediction recognition result;
Word recognition means for recognizing a user utterance based on a predetermined word dictionary and outputting a word recognition result;
A degree of coincidence determination means for comparing the prediction recognition result and the word recognition result to determine the degree of coincidence of the recognition result;
And a system utterance setting unit configured to set the contents of the system utterance based on the degree of coincidence of the recognition results.

The voice interaction apparatus according to claim 1, wherein
The coincidence degree determining means determines the degree of coincidence according to any of perfect coincidence, partial coincidence, and complete disagreement.

The voice interaction apparatus according to claim 1 or 2,
Large vocabulary recognition means for recognizing a user utterance based on a predetermined large vocabulary dictionary and outputting the likelihood of a large vocabulary recognition result;
Likelihood determining means for comparing the likelihood of the large vocabulary recognition result with the likelihood of the prediction recognition result output from the prediction recognition means;
The speech dialogue apparatus according to claim 1, wherein after the likelihood of the prediction recognition result is determined to be greater than the likelihood of the large vocabulary recognition result, the coincidence degree determination unit starts the determination of the coincidence degree.