JPH08234789A

JPH08234789A - Integrated recognition interactive device

Info

Publication number: JPH08234789A
Application number: JP7038581A
Authority: JP
Inventors: Natsuki Yuasa; 夏樹湯浅
Original assignee: GIJUTSU KENKYU KUMIAI SHINJOHO SHIYORI KAIHATSU KIKO; Sharp Corp
Current assignee: GIJUTSU KENKYU KUMIAI SHINJOHO SHIYORI KAIHATSU KIKO; Sharp Corp
Priority date: 1995-02-27
Filing date: 1995-02-27
Publication date: 1996-09-13
Anticipated expiration: 2018-02-10
Also published as: JP3375449B2

Abstract

PURPOSE: To provide an integrated recognition interactive device capable of performing a more natural interaction by handling inputs from plural input channels integrally based on information of a multi-modal interactive data base. CONSTITUTION: This device is provided with plural channels of recognizing means 105 to 108 recognizing input data 101 to 104 including time information of a voice signal, actions of a face, the line of sight and a body, etc., a time obtaining means 109 outputting time information, an integrated processing means 110 recognizing the intention of a user by integrally processing recognition results outputted in parallel from respective recognizing means 105 to 108, a context information aquiring means 111 outputting context information, an interaction managing means 112 going ahead with an interaction based on the intention of the user recognized by the integrated processing means and an output means 113 outputting output data.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、統合認識対話装置に関
し、特に人間の音声・動作等の多チャネルの情報を統合
して認識を行ない、ユーザとの自然な対話を可能にする
統合認識対話装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an integrated recognition dialog device, and more particularly to integrated recognition dialog for integrating and recognizing multi-channel information such as human voices and motions and enabling a natural dialog with a user. It relates to the device.

【０００２】[0002]

【従来の技術】従来、計算機との対話を、人間との対話
のように自然に行なうためには、音声、顔の動き、身振
り、視線などの複数のチャネルを用いて、これらの認識
結果を統合することが必要となる。特開平５−３０７４
３２号公報に開示されている時刻タグ付加による多チャ
ネル間同期統合装置では、複数のチャネルの認識結果そ
れぞれについて入力データの時刻情報（時刻タグ）を併
せて出力させることによって認識結果の統合を行なって
いる。2. Description of the Related Art Conventionally, in order to carry out a dialogue with a computer as naturally as a dialogue with a human being, a plurality of channels such as voice, facial movement, gesturing, and line of sight are used to obtain the recognition result of these. It will need to be integrated. JP-A-5-3074
In the multi-channel synchronization integration device with time tag addition disclosed in Japanese Patent No. 32, the recognition results are integrated by outputting the time information (time tag) of the input data for each recognition result of a plurality of channels. ing.

【０００３】[0003]

【発明が解決しようとする課題】しかし、特開平５−３
０７４３２号公報に開示されている時刻タグ付加による
多チャネル間同期統合装置においては、時刻情報をどの
ように使って各チャネルの認識結果を統合するのかにつ
いては明らかにされていない。However, Japanese Unexamined Patent Publication No. 5-3.
In the multi-channel synchronization integration device with time tag addition disclosed in Japanese Patent Publication No. 07432, it is not clarified how the time information is used to integrate the recognition result of each channel.

【０００４】本発明は以上の事情を考慮してなされたも
ので、マルチモーダル対話データベースの情報を元にし
て、複数の入力チャネルからの入力を統合して扱うこと
で、より自然な対話を行なうことができる統合認識対話
装置を提供することを目的とする。The present invention has been made in consideration of the above circumstances. Based on the information in the multi-modal dialogue database, the inputs from a plurality of input channels are integrated and handled, thereby providing a more natural dialogue. It is an object of the present invention to provide an integrated recognition dialogue device capable of performing.

【０００５】[0005]

【課題を解決するための手段】請求項１に記載の統合認
識対話装置は、時刻情報を出力する時刻取得手段と、ユ
ーザの音声信号、顔の動き、視線、体の動作等を含む入
力データをそれぞれ認識する複数の認識手段と、音声信
号から単語を識別するための文脈情報を出力する文脈情
報取得手段と、時刻情報、文脈情報及び複数の認識手段
より並列に出力される認識結果を統合処理してユーザの
意図の認識を行なう統合処理手段と、統合処理手段によ
って認識されたユーザの意図に基づいて対話を進める対
話管理手段と、対話管理手段から渡された出力データを
ユーザに出力する出力手段とを具備することを特徴とす
る。An integrated recognition dialogue apparatus according to claim 1 is a time acquisition means for outputting time information, and input data including a user's voice signal, face movement, line of sight, body movement and the like. A plurality of recognizing means for recognizing each of them, a context information acquiring means for outputting context information for identifying a word from a voice signal, a time information, a context information, and a recognition result outputted in parallel by the plurality of recognizing means. Integrated processing means for processing and recognizing the user's intention, dialogue management means for advancing dialogue based on the user's intention recognized by the integrated processing means, and output data passed from the dialogue management means to the user. And output means.

【０００６】請求項２に記載の統合認識対話装置は、統
合処理手段が、対話管理手段からの情報と複数の認識手
段からの情報に基づいて、入力データからユーザの意図
の認識を行う期間であるレスポンスウィンドウを設定す
ることを特徴とする。According to another aspect of the integrated recognition dialogue device of the present invention, the integrated processing means recognizes the user's intention from the input data based on the information from the dialogue management means and the information from the plurality of recognition means. It is characterized by setting a certain response window.

【０００７】請求項３に記載の統合認識装置は、統合処
理手段が、対話管理手段からの情報と文脈情報取得手段
からの情報に基づいて、音声信号を認識するための所定
の個数のキーワード群を設定することを特徴とする。In the integrated recognition device according to the third aspect, the integrated processing means recognizes a voice signal based on the information from the dialogue management means and the information from the context information acquisition means. It is characterized by setting.

【０００８】請求項４に記載の統合認識装置は、キーワ
ード群がユーザが肯定を意図する「肯定キーワード群」
とユーザが否定を意図する「否定キーワード群」とを含
むことを特徴とする。In the integrated recognition apparatus according to the fourth aspect, the keyword group is the "affirmative keyword group" which the user intends to affirm.
And a “negative keyword group” that the user intends to negate.

【０００９】請求項５に記載の統合認識装置は、文脈情
報取得手段が、所定の文書データベース中の単語間の共
起関係をもとにして作成した特徴べクトル間の類似度を
使用することを特徴とする。In the integrated recognition apparatus according to a fifth aspect, the context information acquisition means uses the similarity between the feature vectors created based on the co-occurrence relation between words in a predetermined document database. Is characterized by.

【００１０】請求項６に記載の統合認識装置は、統合処
理識手段が、ユーザの意図の認識に所定の対話データベ
ースのデータを学習データとして使用することを特徴と
する。The integrated recognition device according to claim 6 is characterized in that the integrated processing recognition means uses data of a predetermined dialogue database as learning data for recognition of a user's intention.

【００１１】請求項７に記載の統合認識装置は、統合処
理手段が、ユーザの意図を認識した後、その認識結果を
前記対話データベースに学習データとして追加すること
を特徴とする。In the integrated recognition device according to the seventh aspect, the integrated processing means adds the recognition result as learning data to the dialogue database after recognizing the intention of the user.

【００１２】請求項８に記載の統合認識装置は、統合処
理手段が、ユーザの発話の後半で出現したユーザの顔の
縦振り動作を無視することを特徴とする。In the integrated recognition device according to the present invention, the integrated processing means ignores the vertical swing motion of the user's face that appears in the latter half of the user's utterance.

【００１３】[0013]

【作用】請求項１に記載の統合認識対話装置において
は、各認識手段において認識された信号の開始時刻と終
了時刻とが時刻取得手段から取得され、認識結果とその
開始時刻、終了時刻が統合処理手段に渡される。統合処
理手段においてユーザの発話意図の識別が行われ、その
識別結果が対話管理手段に渡される。統合処理手段から
渡されるこの識別結果によって対話管理手段により新た
な状態に遷移され、出力手段によってつぎに発話される
内容が決定される。このように構成されているので、請
求項１に記載の統合認識対話装置によれば、ユーザはあ
たかも人間と対話をするかのような感覚で自然な対話を
行うことができる。In the integrated recognition dialogue apparatus according to claim 1, the start time and end time of the signal recognized by each recognition means are acquired from the time acquisition means, and the recognition result and the start time and end time thereof are integrated. Passed to processing means. The integrated processing means identifies the utterance intention of the user, and the identification result is passed to the dialogue management means. Based on this identification result passed from the integrated processing means, the dialogue management means makes a transition to a new state, and the output means determines the content to be uttered next. With this configuration, the integrated recognition dialogue apparatus according to the first aspect enables the user to have a natural dialogue as if he / she were talking to a human.

【００１４】請求項２に記載の統合認識対話装置におい
ては、各認識手段から渡された認識結果と対話管理手段
から渡された時刻の情報とに基づいてレスポンスウィン
ドウが設定される。レスポンスウィンドウは、人間同士
の対話における自然な間に合わせて設定されるので、ユ
ーザは気持ちの良い対話を行うことができる。In the integrated recognition dialogue device according to the second aspect, the response window is set based on the recognition result passed from each recognition means and the time information passed from the dialogue management means. Since the response window is set in accordance with the natural time in human-to-human dialogue, the user can have a pleasant dialogue.

【００１５】請求項３に記載の統合認識対話装置におい
ては、統合処理手段において選択肢がユーザに示される
場合には、各選択肢に対応したキーワード群が設定され
る。このキーワード群によりユーザの意図の判断が確実
に行われる。In the integrated recognition dialogue apparatus according to the third aspect, when the integrated processing means shows the options to the user, a keyword group corresponding to each option is set. This keyword group ensures the determination of the user's intention.

【００１６】請求項４に記載の統合認識対話装置におい
ては、統合処理手段によりユーザの肯定／否定の意図の
判定が行なわれる場合には、対話管理手段から渡された
キーワードと文脈情報取得手段とが用いられて「肯定キ
ーワード」と「否定キーワード」が設定される。これら
のキーワードによりユーザの肯定及び否定の意図の判断
が確実に行われる。In the integrated recognition dialogue device according to the fourth aspect, when the integrated processing means judges the user's affirmative / negative intention, the keyword passed from the dialogue management means and the context information acquisition means are used. Is used to set the “affirmative keyword” and the “negative keyword”. These keywords ensure the determination of the user's affirmative and negative intentions.

【００１７】請求項５に記載の統合認識対話装置におい
ては、文脈情報取得手段が、文書データベース中の単語
間の共起関係をもとにして作成した特徴ベクトルを使用
するので、ユーザやシステムの使用状況にあった文書デ
ータベースを用意しておくことで、特定の状況で使用さ
れるシステムやユーザの発話の癖に対応することができ
る。In the integrated recognition dialogue apparatus according to the fifth aspect, the context information acquisition means uses the feature vector created based on the co-occurrence relation between words in the document database, so that the user or the system By preparing a document database suitable for the usage situation, it is possible to deal with the habit of utterance of the system or user used in a specific situation.

【００１８】請求項６に記載の統合認識対話装置におい
ては、統合処理手段がユーザの意図の認識に学習データ
を使用するので、対話データベースの中にあるような対
話であれば、どのような対話に対しても対応できる。In the integrated recognition dialogue apparatus according to the sixth aspect, since the integrated processing means uses the learning data to recognize the intention of the user, what kind of dialogue is present in the dialogue database? You can also deal with.

【００１９】請求項７に記載の統合認識対話装置におい
ては、統合処理手段がユーザの意図を認識した後、その
認識結果を学習データに追加するので、統合認識をユー
ザに対応させていくことができる。In the integrated recognition dialogue device according to the seventh aspect, since the integrated processing means recognizes the intention of the user and then adds the recognition result to the learning data, it is possible to make the integrated recognition correspond to the user. it can.

【００２０】請求項８に記載の統合認識装置において
は、ユーザの発話の後半で出現したユーザの顔の縦振り
動作を無視するので、ユーザ自身の発話にたいしてのう
なづきであることが多い顔の縦振り動作に起因する誤識
別を防ぐことができる。In the integrated recognition apparatus according to the eighth aspect, since the vertical swing motion of the user's face that appears in the latter half of the user's utterance is ignored, the vertical face of the face which is often a nod to the user's own utterance is ignored. It is possible to prevent erroneous identification due to the swing motion.

【００２１】[0021]

【実施例】以下、本発明の統合認識対話装置の第１の実
施例の構成を図１を参照しながら説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The configuration of the first embodiment of the integrated recognition dialogue system of the present invention will be described below with reference to FIG.

【００２２】本実施例の統合認識対話装置は、音声信
号、顔の動き、視線、体の動作等の時刻情報を含む入力
データ１０１〜１０４を認識する複数チャネルの認識手
段１０５〜１０８を具備しており、認識手段１０５〜１
０８には、時刻情報を出力する時刻取得手段１０９と、
各認識手段より並列に出力される認識結果を統合処理し
てユーザの意図の認識を行なう統合処理手段１１０とが
接続されている。統合処理手段１１０には、文脈情報を
出力する文脈情報取得手段１１１と、統合処理手段によ
り認識されたユーザの意図に基づいて対話を進める対話
管理手段１１２とが接続されており、対話管理手段１１
２には、出力データを出力する出力手段１１３が接続さ
れている。The integrated recognition dialogue apparatus of this embodiment comprises a plurality of channels of recognition means 105 to 108 for recognizing input data 101 to 104 including time information such as voice signals, facial movements, line of sight, body movements and the like. And recognizing means 105-1
At 08, time acquisition means 109 for outputting time information,
It is connected to an integrated processing unit 110 that performs integrated processing of the recognition results output in parallel from each recognition unit to recognize the user's intention. The integrated processing means 110 is connected to the context information acquisition means 111 for outputting context information and the dialogue management means 112 for advancing the dialogue based on the user's intention recognized by the integrated processing means.
An output unit 113 for outputting output data is connected to 2.

【００２３】なお、各認識手段１０５〜１０８は、その
認識データに応じた認識アルゴリズムを持ち、さらに認
識結果の開始時刻と終了時刻を時刻取得手段１０９から
得るように構成されている。文脈情報取得手段１１１に
は、音声認識手段から得られる「単語」に対応する文脈
情報が格納されている。文脈情報は、同じような状況、
場面、文脈で用いられる「単語」は類似した値を持つよ
うに構成される。Each of the recognizing means 105 to 108 has a recognizing algorithm corresponding to the recognizing data, and is configured to obtain the start time and end time of the recognizing result from the time acquiring means 109. The context information acquisition unit 111 stores context information corresponding to the “word” obtained from the voice recognition unit. Contextual information is a similar situation,
"Words" used in scenes and contexts are configured to have similar values.

【００２４】文脈情報の構成方法を単語の特徴べクトル
の例で説明する。A method of constructing the context information will be described by using an example of a word characteristic vector.

【００２５】まず、文書データベースと単語の辞書が用
意される。特徴べクトルの次元数が適当な数に定めら
れ、その個数の単語が選出される。単語選出は、通常は
データベース中の出現頻度の多い順に選出されれば良
い。選出された単語は特徴べクトルの各要素に対応する
ことになる。単語の特徴べクトルは、文書データベース
中の一塊すなわち文，段落，記事等の文書中に含まれて
いる単語の出現頻度分布に、その単語のその一塊の文書
データ中での出現頻度を掛けたものが加算されていくこ
とによって得られる。First, a document database and a word dictionary are prepared. The dimension number of the feature vector is set to an appropriate number, and that number of words is selected. The words are usually selected in descending order of appearance frequency in the database. The selected word corresponds to each element of the feature vector. The characteristic vector of a word is obtained by multiplying the appearance frequency distribution of a word contained in a document such as a sentence, paragraph, or article in a document database by the frequency of occurrence of that word in the document data. It is obtained by adding things.

【００２６】これをより具体的な例で説明する。This will be described with a more specific example.

【００２７】例文Ａ「アメリカ政府が先進主要国にココ
ム規制の抜本的な見直しを提案してきた。」例文Ｂ「規制対象国が兵器の製造につながる工業製品の
輸出を規制することを条件に、ココムの規制品目を大幅
に削滅する意向のようだ。」という文書データからどの
ように単語の特徴べクトルを作成するかを説明する。こ
こでは、文書データは「一文」という単位で読み込まれ
ることとするが、これは一段落、一記事など、他の単位
でも構わない。Example sentence A "The US government has proposed to the developed countries a radical review of cocom regulations." Example sentence B "On the condition that regulated countries regulate the export of industrial products that lead to the manufacture of weapons. It seems that the intention is to substantially reduce the items regulated by Cocom. ”It explains how to create a word characteristic vector from document data. Here, the document data is read in a unit of “one sentence”, but this may be another unit such as a paragraph or an article.

【００２８】また、この例では特徴べクトルの次元数は
２１次元すなわち特徴べクトルを生成するための単語数
が２１個で各要素が「アメリカ、政府、先進、主要、
国、ココム、規制、抜本的、見直し、提案、対象、兵
器、製造、工業、製品、輸出、条件、品目、大幅、削
減、意向」という単語に対応しているとする。In this example, the dimension number of the feature vector is 21. That is, the number of words for generating the feature vector is 21 and each element is "American, Government, Advanced, Major,
Country, cocom, regulation, drastic, review, proposal, target, weapons, manufacturing, industry, product, export, condition, item, significant, reduction, intention ".

【００２９】このような条件のもとで、例文Ａが読み込
まれ、形態素解析が行なわれると「アメリカ、政府、先
進、主要、国、ココム、規制、抜本的、見直し、提案」
が抽出される。これから得られる単語出現頻度分布は
（１，１，１，１，１，１，１，１，１，１，０，０，
０，０，０，０，０，０，０，０）である。従って、図
２に示すように、「アメリカ」「政府」等、例文Ａに出
現する単語の特徴べクトルには（１，１，１，１，１，
１，１，１，１，１，０，０，０，０，０，０，０，
０，０，０）を加算することになる。Under such conditions, when the example sentence A is read and morphological analysis is performed, "America, government, advanced, major, country, cocom, regulation, drastic, review, proposal"
Is extracted. The word appearance frequency distribution obtained from this is (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
0,0,0,0,0,0,0,0). Therefore, as shown in FIG. 2, the characteristic vectors of the words appearing in the example sentence A such as “America” and “Government” are (1, 1, 1, 1, 1,
1,1,1,1,1,0,0,0,0,0,0,0,
0,0,0) will be added.

【００３０】次に例文Ｂが読み込まれ、形態素解析が行
なわれると「規制、対象、国、兵器、製造、工業、製
品、輸出、規制、条件、ココム、規制、品目、大幅、削
減、意向」が抽出される。これから得られる単語出現頻
度分布は（０，０，０，０，１，１，３，０，０，０，
１，１，１，１，１，１，１，１，１，１，１）であ
る。「規制」は３回出現しているので、この単語出現頻
度分布を３倍したべクトルである（０，０，０，０，
３，３，９，０，０，０，３，３，３，３，３，３，
３，３，３，３，３）が「規制」の特徴べクトルに加算
され、「対象」「国」等、図３に示すように、例文Ｂに
１回しか出現していない単語の特徴べクトルには（０，
０，０，０，１，１，３，０，０，０，１，１，１，
１，１，１，１，１，１，１，１）が加算される。Next, when the example sentence B is read and morphological analysis is performed, "regulation, target, country, weapon, manufacturing, industry, product, export, regulation, condition, cocom, regulation, item, significant, reduction, intention" Is extracted. The word appearance frequency distribution obtained from this is (0,0,0,0,1,1,3,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1). Since "regulation" appears three times, it is a vector that triples this word appearance frequency distribution (0,0,0,0,
3,3,9,0,0,0,3,3,3,3,3,3,3
3, 3, 3, 3, 3) is added to the characteristic vector of "regulation", and the characteristic of a word such as "target" and "country" that appears only once in example sentence B, as shown in FIG. The vector has (0,
0,0,0,1,1,3,0,0,0,1,1,1,
1, 1, 1, 1, 1, 1, 1, 1, 1) are added.

【００３１】このような処理が行われながら多くの文章
が読み込まれて最終的に得られた特徴べクトルは、絶対
値が１に正規化されて、文脈情報取得手段１１１に格納
される。The characteristic vector finally obtained by reading many sentences while performing such processing is normalized in absolute value to 1 and stored in the context information acquisition means 111.

【００３２】つぎに、一実施例として、音声認識と顔の
振りの認識と視線の向きの認識を統合することで、シス
テムが発話する質問文に対するユーザの反応が「肯定」
なのか「否定」なのかが判定されるシステムの場合によ
り、対話管理手段１１２や統合処理手段１１０を説明す
る。なお、この例では音声認識は予め定められたキーワ
ード内でワードスポッティングによる認識が行なわれ、
顔の振りの認識は「縦振り」「横振り」「かしげ」が認
識され、視線の向きの認識は、「正面（視線一致）」
「正面以外（泳ぎ）」が認識される。Next, as an example, by integrating voice recognition, face swing recognition, and gaze direction recognition, the user's reaction to the question sentence uttered by the system is "affirmative".
The dialogue management means 112 and the integrated processing means 110 will be described depending on the case of a system in which it is determined whether it is “negative”. In this example, voice recognition is performed by word spotting within a predetermined keyword,
Face swing is recognized as "vertical swing", "horizontal swing", and "squeak", and the gaze direction is recognized as "front (line of sight)".
"Other than front (swimming)" is recognized.

【００３３】対話管理手段１１２により、図８に示すよ
うに、現在の状態すなわちシステムとユーザとの対話が
どの程度進んでいるか否かが把握され、次の発話内容が
決定され、その発話内容のキーワードや、キーワード発
話時刻や、発話終了時刻が統合処理手段１１０に渡され
てから、発話内容が出力手段１１３から出力される。As shown in FIG. 8, the dialogue management means 112 grasps the current state, that is, how much the dialogue between the system and the user is advanced, determines the next utterance content, and determines the utterance content. After the keyword, the keyword utterance time, and the utterance end time are passed to the integrated processing means 110, the utterance content is output from the output means 113.

【００３４】統合処理手段１１０により、対話管理手段
１１２から渡されたキーワードＫＷをもとにして、「肯
定キーワード」と「否定キーワード」が作成される。
「肯定キーワード」とはキーワードＫＷと、「はい」
「うん」「そうです」等の肯定に用いる常套句である。
「否定キーワード」とはキーワードＫＷと意味的に反対
あるいは類似したキーワードと、「いいえ」等の否定に
用いる常套句である。「キーワードＫＷと意味的に反対
あるいは類似したキーワード」とは、音声認識できるす
べてのキーワードの中でキーワードＫＷとの類似度〔こ
れは文脈情報取得手段１１１より特徴べクトルが取得さ
れ、これを用いて、各キーワードの特徴べクトル間の類
似度（べクトルの値を絶対値を１に正規化してから内積
をとったもの）を計算すれば良い〕がある閾値以上のキ
ーワードのことである。The integrated processing means 110 creates "affirmative keyword" and "negative keyword" based on the keyword KW passed from the dialogue management means 112.
"Affirmative keyword" means keyword KW and "Yes"
It is a common phrase used to affirm "yes""yes".
The “negative keyword” is a keyword that is semantically opposite to or similar to the keyword KW, and a common phrase used for denying such as “no”. The “keyword that is semantically opposite or similar to the keyword KW” means the similarity to the keyword KW among all the speech-recognizable keywords [this is the characteristic vector acquired by the context information acquisition means 111, and is used. Then, the degree of similarity between the characteristic vectors of each keyword (the value of the vector is normalized to an absolute value of 1 and then the inner product is taken) is calculated].

【００３５】電子化された類義語辞典や反意語辞典が利
用できる場合はこれらを用いてキーワードＫＷと意味的
に反対あるいは類似したキーワードを検索することもで
きる。また、統合処理手段１１０により、対話管理手段
１１２からキーワードＫＷの発話開始時刻Ｔ１や発話自
体の終了時刻Ｔ２も受けとられる。Ｔ１としてはキーワ
ードＫＷの発話開始時刻ではなく、キーワードＫＷがキ
ーワードＫＷであると識別できるところまで発話された
時点の時刻をとっても良い。しかし、キーワードＫＷが
キーワードＫＷであると識別できるところというのは文
脈によっても変化するので、簡単にはキーワードＫＷの
発話終了時刻―ＭＴ（例えば０．５秒）と、キーワード
ＫＷの発話開始時刻とで時刻の遅い方をＴ１とする手法
をとることもできる。When a computerized synonym dictionary or antonym dictionary is available, it is also possible to search for keywords that are semantically opposite or similar to the keyword KW. Further, the integrated processing means 110 also receives the utterance start time T1 of the keyword KW and the end time T2 of the utterance itself from the dialogue management means 112. T1 may be the time at which the keyword KW is uttered to the point where the keyword KW can be identified as the keyword KW, instead of the utterance start time of the keyword KW. However, the fact that the keyword KW can be identified as the keyword KW changes depending on the context, so it is easy to say that the utterance end time of the keyword KW-MT (for example, 0.5 seconds) and the utterance start time of the keyword KW. It is also possible to adopt a method in which the later time is T1.

【００３６】時刻Ｔ１からＴ２＋ＷＴ（例えば０．５
秒）の間に認識されたユーザの発話や顔の振り等をもと
にして、ユーザの意図が認識される。図４Ａに示すよう
に、ここでは時刻Ｔ１からＴ２＋ＷＴの間のことを「レ
スポンスウィンドウ」と呼ぶことにする。レスポンスウ
ィンドウはユーザの発話や動作によって短縮されたり伸
長されたりする。短縮されるのは、図４Ｂに示すよう
に、ユーザが意味のある発話や行動を行なってからＷＴ
たっても次の発話や動作が見られなかった場合である。
伸長されるのは、レスポンスウィンドウの右端の時点で
ユーザから意味のある発話や動作（「かしげ」や「目の
泳ぎ」や「不要語（「えーと」等）の発話」を含む）が
見られた場合であり、この場合はそれらの動作が終了し
てからＷＴの時間がたつまで待ち、その時点でユーザの
発話や動作が見られなければ、図４Ｃに示すように、そ
の時点までをレスポンスウィンドウとする。From time T1 to T2 + WT (for example, 0.5
The user's intention is recognized based on the user's utterance, face swing, and the like recognized during the second). As shown in FIG. 4A, the period from time T1 to T2 + WT will be referred to as a “response window” here. The response window may be shortened or expanded depending on the user's utterance or action. As shown in FIG. 4B, the time is shortened after the user performs a meaningful utterance or action and then WT.
This is the case when the next utterance or action is not seen even if it happens.
The extension is that the user can see meaningful utterances and actions (including "squashing", "swimming of the eyes", and "unnecessary words (" Eto "etc.)") at the right edge of the response window. In this case, after waiting for the time of WT after the end of those operations, if there is no user's utterance or operation at that time, as shown in FIG. 4C, the response up to that time is made. Make it a window.

【００３７】ＷＴの例として「０．５秒」を示している
のは、マルチモーダル対話データベースの解析結果に基
づく。マルチモーダル対話データベースから人間同士の
対話が解析されると、０．５秒以上何の反応もないとい
う状況はなく、例えば質問が発話されてから０．５秒以
内に、発話／目の泳ぎ／顔のかしげ等なんらかの反応が
起こることがわかっている。これが人間同士の対話にお
ける自然な間（ま）を構成しているものと考えられる。
したがって、レスポンスウィンドウは０．５秒以上の何
の反応も生じなかった場合に閉じるのが自然な対話を促
すのではないかと考えられるため、ＷＴの例として
「０．５秒」を示した。このＷＴの値は、個人差があ
り、またシステムの発話速度とも関係するので、必要に
応じて変更できるようにしておく。"0.5 seconds" is shown as an example of WT based on the analysis result of the multimodal dialogue database. When human-to-human dialogue is analyzed from the multi-modal dialogue database, there is no situation in which there is no reaction for 0.5 seconds or longer. For example, within 0.5 seconds after the question is uttered, utterance / swimming / eyes / It is known that some reaction such as a shadow on the face occurs. This is considered to constitute a natural interval in human-to-human dialogue.
Therefore, it is considered that closing the response window when there is no reaction for 0.5 seconds or more may promote a natural dialogue, and thus "0.5 seconds" is shown as an example of WT. The value of this WT has individual differences and is also related to the speech rate of the system, so that it can be changed as necessary.

【００３８】各認識手段１０５〜１０８から渡される一
つ一つの情報は、「開始時刻（ｓｔ），終Ｔ時刻（ｅ
ｔ），モード（ｍｄ），認識結果（ｒｓ），尤度（ｓ
ｃ）の五つ組で表される。開始時刻や終了時刻は時刻取
得手段１０９から渡される値であり、その認識結果を得
た入力データの開始時刻と終了時刻とを表す。モードと
は「音声」「顔の振り」「顔の向き」「視線の向き」
「表情」「ジェスチャー」等、同時に発生可能なユーザ
からの複数の出力の種類を指す。認識結果はモードに応
じて、「音声」なら「認識単語」、「顔の振り」なら
「縦振り」「構振り」「かしげ」等、「顔の向き」なら
「正面」「右」「左」「上」「下」「右上」等、「視線
の向き」なら「正面（視線一致）」「正面以外（泳
ぎ）」「右」「左」「上」「下」「右上」等、「表情」
なら「笑い」「怒り」「悲しみ」等である。尤度は、そ
の認識結果の確からしさを示す数値であり、例えば認識
用のテンプレートと実際に認識されるものとの間の距離
から求められる。Each piece of information passed from each recognizing means 105 to 108 is "start time (st), end T time (e
t), mode (md), recognition result (rs), likelihood (s)
It is represented by the quintet of c). The start time and the end time are values passed from the time acquisition unit 109, and represent the start time and the end time of the input data that has obtained the recognition result. Modes are "voice", "face swing", "face direction", and "gaze direction"
This refers to multiple types of output from the user that can occur simultaneously, such as "facial expression" and "gesture". Depending on the mode, the recognition result depends on the mode: "recognition word" for "voice", "vertical swing", "composition", "skage" for "face swing", and "front""right""left" for "face orientation". "Upper""Lower""Upperright" etc. If "line of sight" is "Front (matching line of sight)""Other than front (swim)""Right""Left""Up""Down""Upperright" etc. Facial expression "
If so, "laughter,""anger,""sorrow," etc. The likelihood is a numerical value indicating the certainty of the recognition result, and is obtained from, for example, the distance between the recognition template and what is actually recognized.

【００３９】統合処理手段１１０により、各認識手段１
０５〜１０８から渡される情報の中の開始時刻と終了時
刻がまず注目され、この二つともが「レスポンスウィン
ドウ」に入っているものだけが統合認識に用いられる。By the integrated processing means 110, each recognition means 1
The start time and the end time in the information passed from 05 to 108 are first noticed, and only those both of which are in the “response window” are used for the integrated recognition.

【００４０】この実施例では、音声認識手段から得られ
るキーワードは「肯定キーワード」か「否定キーワー
ド」か「その他のキーワード（肯定キーワードでも否定
キーワードでもないキーワード）」かの３種類に限定す
ることができる。用途によっては「その他のキーワー
ド」は使わない方が良い場合もあり、この場合はその他
のキーワードが認識されたら、他の認識結果に基づいて
「肯定／否定」の判断が行なわれ、その判断結果がユー
ザに正しいかどうか質問され、それが正しければその判
断結果に基づいてそのキーワードが「肯定キーワード」
か「否定キーワード」のどちらかに入れられ、以後は同
じ質間文が使われる場合にはそのキーワードは「肯定キ
ーワード」か「否定キーワード」に入れられるという使
い方ができる。ただし、システムを使用する人が違った
り、同じ人でも時間がたつと同じキーワードが「肯定キ
ーワード」になったり「否定キーワード」になったりす
る可能性があるので、ユーザが認識されて区別された
り、それまでの判断結果からべイズ識別等がされたりす
ると良い。In this embodiment, the keywords obtained from the voice recognition means may be limited to three kinds of "affirmative keyword", "negative keyword", and "other keywords (keywords that are neither positive keyword nor negative keyword)". it can. Depending on the application, it may be better not to use "other keywords". In this case, if other keywords are recognized, a "affirmative / negative" judgment is made based on the other recognition results, and the judgment result Is asked by the user if it is correct, and if it is correct, the keyword is "affirmative keyword" based on the judgment result.
Or "negative keyword", and when the same interstitial sentence is used thereafter, the keyword can be used as "affirmative keyword" or "negative keyword". However, because different people use the system, or the same person may become the "affirmative keyword" or "negative keyword" over time, the user may be recognized and distinguished. , It is recommended that Bayes be identified based on the judgment results up to that point.

【００４１】ユーザの意図が「肯定」なのか「否定」な
のかが判断されるのはレスポンスウィンドウ内での「肯
定キーワードの発話」「否定キーワードの発話」「その
他キーワードの発話」「顔の縦振り」「顔の横振り」の
５つについてのべイズ識別による。システムにより発話
される質問文には肯定／否定の対象となるキーワードが
存在するので、そのキーワード発話時点（Ｔ１）から、
発話自体の終了時刻（Ｔ２）＋ＷＴまでの間がレスポン
スウィンドウに設定され、そのレスポンスウィンドウ内
で上述の５つ（「肯定キーワードの発話」「否定キーワ
ードの発話」「その他キーワードの発話」「顔の縦振
り」「顔の横振り」）で判定される。ただし、べイズ識
別に用いるのは上述の５つであるが、「かしげ」や「目
の泳ぎ」や「不要語（「えーと」等）の発話」等が認識
されると、レスポンスウィンドウは時間的に後ろに伸長
される。また、「肯定キーワード」や「否定キーワー
ド」が発話されたり、「縦振り」や「横振り」が発生さ
れてからＷＴの時間がたっても次の発話や顔の動きが使
出されなかった場合には、レスポンスウィンドウはそこ
で打ち切られる。Whether the user's intention is "affirmative" or "negative" is determined by "utterance of positive keyword", "utterance of negative keyword", "utterance of other keyword", "vertical face" in the response window. It is based on the Bayes identification for the five swings, "sideways swing". Since there is a keyword to be affirmative / negative in the question sentence uttered by the system, from the keyword utterance point (T1),
The response window is set up to the end time (T2) of the utterance itself (T2) + WT, and within the response window, the above-mentioned five items (“affirmative keyword utterance”, “negative keyword utterance”, “other keyword utterance”, “face” Vertical swing ”or“ horizontal swing ”). However, although the above five are used for Bayes identification, if "Kashige", "Swimming of the eyes", "Urgent words (" Eto "etc.) utterances" are recognized, the response window Is stretched backwards. Also, if the next utterance or facial movement is not used even after the WT time has elapsed since the "affirmative keyword" or "negative keyword" was uttered, or "vertical swing" or "horizontal swing" was generated. The response window will be terminated there.

【００４２】以上のことをより詳しく説明する。The above will be described in more detail.

【００４３】システムにより質問文が発話し始められた
ら、Ｔ１からＴ２＋ＷＴまでの時間に発生したユーザの
「肯定キーワードの発話」「否定キーワードの発話」
「その他キーワードの発話」「顔の縦振り」「顔の横振
り」が調べられる。Ｔ２の時刻までの間にこれらの反応
が見られず、ユーザの「かしげ」や「目の泳ぎ」あるい
は「えーと」などの不要語の発話が認識された場合は、
レスポンスウィンドウが伸長される。「かしげ」の場合
は頭がまっすぐになるか「縦振り」「横振り」が生じる
まで待機され、「目の泳ぎ」の場合は正面を見るように
なるまで待機され、不要語の場合は不要語の発話終了後
ＷＴだけ待機され、その時点でユーザの発話や顔の振り
等の動作が発生していなければ、それまでに発生したも
のがべイズ識別に用いられる。ユーザの発話や顔の振り
等の動作が発生していれば、レスポンスウィンドウの伸
長がし続けられる。ただし、ユーザ発話の後半に複数回
の「顔の縦振り」が存在する場合は、その「顔の縦振
り」はユーザ自分自身に対する縦振りなので、無視さ
れ、べイズ識別時には使用されない。When the system starts to utter a question, the user's "affirmative keyword utterance" and "negative keyword utterance" that occur during the period from T1 to T2 + WT.
"Utterances of other keywords", "vertical swing of face" and "horizontal swing of face" can be checked. If these reactions are not seen by the time of T2, and the user's utterance of an unnecessary word such as “shikage”, “swim of eyes”, or “um” is recognized,
The response window is expanded. In the case of "Castle", it waits until the head is straightened or "vertical swing" and "lateral swing" occur, in the case of "eye swimming" it waits until you look at the front, unnecessary for unnecessary words After uttering a word, only WT is waited, and if there is no user's utterance or facial movement at that time, the one that has occurred up to that point is used for the Bayes identification. If an action such as a user's utterance or a face swing occurs, the response window is continuously expanded. However, if there are a plurality of “vertical face swings” in the latter half of the user's utterance, the “vertical face swings” are vertical swings for the user himself and are therefore ignored and are not used during Bayes identification.

【００４４】Ｔ２＋ＷＴの時刻までの間にこれらの反応
が見られず、ユーザの「かしげ」や「目の泳ぎ」あるい
は「えーと」などの不要語の発話（レスポンスウィンド
ウ伸長動作）も認識されなかった場合や、これらのレス
ポンスウィンドウ伸長動作が認識されて待機された後に
ＷＴの時間がたってもユーザの発話や顔の振り等の動作
が発生されなかった場合は、統合処理手段１１０により
「ユーザが何の反応もしない」という旨が対話管理手段
１１２へ伝えられる。すると、対話管理手段１１２によ
り現在の状況に応じて「もしもし」、「何か答えてくだ
さい」等の発話が出力手段１１３を通じて行なわれる。
なお、べイズ識別ではマルチモーダル対話データベース
の情報が用いられる。By the time of T2 + WT, these reactions were not seen, and the user's utterance of unnecessary words such as "squeak", "eye swimming", or "er" (response window expansion operation) was not recognized. In this case, or when the response window expansion operation is recognized and waited for, and no operation such as the user's utterance or face waving occurs even after the WT time elapses, the integrated processing unit 110 displays “what the user does Is also not reacted ”is transmitted to the dialogue management means 112. Then, the dialogue management means 112 causes the output means 113 to make utterances such as "Hello" and "Please answer" according to the current situation.
In addition, the information of the multi-modal dialogue database is used for the Bayes identification.

【００４５】次に、他の実施例として、音声認識と顔の
振りの認識と顔や視線の向きの認識が統合されること
で、システムにより発話される質問文に対するユーザの
反応が「肯定」なのか「否定」なのかが判定される以外
に、右／左等の向きが認識されるシステムの場合によ
り、対話管理手段１１２及び統合処理手段１１０を説明
する。Next, as another embodiment, voice recognition, face swing recognition, and face and line-of-sight orientation recognition are integrated, so that the user's reaction to the question uttered by the system is "affirmative". The dialog management means 112 and the integrated processing means 110 will be described depending on the case of a system in which the orientation such as right / left is recognized in addition to the determination as to whether it is “negative”.

【００４６】この場合は対話管理手段１１２により統合
処理手段１１０から「肯定／否定」を答として受けとり
たいのか、「右／左等の向き」を答として受けとりたい
のかが、キーワードＫＷや発話時刻が統合処理手段１１
０に送られる時に一緒に送られる必要がある。対話管理
手段１１２によるそれ以外の点では、上述実施例と同様
である。また、出力手段１１３も上述実施例と同様であ
る。In this case, whether the keyword management unit 112 wants to receive "affirmative / negative" as the answer or "right / left or the like" as the answer from the integrated processing means 110 depends on the keyword KW and the utterance time. Integrated processing means 11
Must be sent together when sent to 0. The other points of the dialogue management means 112 are the same as those of the above-described embodiment. The output means 113 is also the same as in the above embodiment.

【００４７】統合処理手段１１０については、「肯定／
否定」を答として受けとりたい場合の処理は前述のシス
テムと同様にすれば良い。Regarding the integrated processing means 110, "affirmation /
If it is desired to receive "negative" as the answer, the processing may be performed in the same manner as the above-mentioned system.

【００４８】「右／左等の向き」を答として受けとりた
い場合の統合処理手段１１０の処理は、ユーザからのデ
ータとして、例えば音声としては「それ」等の指示語や
「各方向に特有のキーワード発話」（「右」「左」等）
や「画面に表示されている物の名前とそれに類似した単
語」等が認識され、他のモードとしては「顔の向き」、
「視線の向き」、「手を伸ばした方向」等が認識され、
やはりマルチモーダル対話データベースの情報が用いら
れてレスポンスウィンドウ内でのべイズ識別が行なわれ
る。レスポンスウィンドウの設定方法は上述実施例と同
様である。The processing of the integrated processing means 110 when it is desired to receive "right / left or the like" as an answer is carried out as data from the user, for example, as a voice, an instruction word such as "that" or "unique to each direction". Keyword utterance "(" right "," left ", etc.)
And "the name of the object displayed on the screen and words similar to it" are recognized, and the other modes are "face direction",
"Direction of line of sight", "direction reaching out", etc. are recognized,
Information from the multimodal dialogue database is also used to identify the Bayes in the response window. The setting method of the response window is the same as that of the above-mentioned embodiment.

【００４９】「画面に表示されている物の名称に類似し
た単語」は、対話管理手段１１２から「両面に表示され
ている物の名称」を受け取り、これをＤＷ１，ＤＷ
２，．．．ＤＷｎとすると、音声認識できるすべてのキ
ーワードの中でＤＷｉとの類似度（これは文脈情報取得
手段の情報から得られる。例えば、類似度を求めたい単
語の特徴べクトルとＤＷｉの特徴べクトルとの内積を取
れば良い）がある閾値以上のキーワードのことである。
このＤＷｉとの類似度がある閾値以上になるキーワード
群が「キーワード群ｉ」となる。The "word similar to the name of the object displayed on the screen" receives the "name of the object displayed on both sides" from the dialogue management means 112, and receives it as DW1, DW.
2 ,. ．． If DWn, among all the keywords that can be recognized by voice, the similarity with DWi (this can be obtained from the information of the context information acquisition means. For example, the feature vector of the word for which the similarity is desired and the feature vector of DWi are obtained. Is the inner product of)) is a keyword above a certain threshold.
A keyword group in which the degree of similarity with this DWi is greater than or equal to a certain threshold is the “keyword group i”.

【００５０】ここで、べイズ識別の方法について説明す
る。Here, a method for identifying the Bayes will be described.

【００５１】マルチモーダル対話データベースには、図
７に示すように、人間同士の対話（各人の役割がシステ
ムとユーザとにそれぞれ対応しているものもある）や、
システムとユーザとの対話の様子を様々なモードでとら
えたものが記録されている。肯定／否定を識別するため
のべイズ識別を行なうには、マルチモーダル対話データ
ベース中から、ユーザが肯定／否定で答える対話のもの
だけが抜き出され、その対話データのレスポンスウィン
ドウ内での「肯定キーワード」「否定キーワード」「そ
の他のキーワード」「顔の縦振り」「顔の横振り」の存
在の有無が調査され、その調査結果が一つの学習データ
とされる。なお、対話データの中に「かしげ」や「目の
泳ぎ」がある場合はそれらがなくなるまでレスポンスウ
ィンドウが拡張されて調査される。In the multi-modal dialogue database, as shown in FIG. 7, there are dialogues between humans (the roles of each person correspond to the system and the user, respectively),
It records what the system interacts with the user in various modes. In order to perform the Bayes identification for identifying affirmative / negative, only those of the dialogues answered by the user with affirmative / negative are extracted from the multi-modal dialogue database, and the “affirmative” in the response window of the dialogue data is extracted. The presence / absence of a keyword, “negative keyword,” “other keyword,” “vertical face swing,” “face horizontal swing” is investigated, and the survey result is used as one learning data. Note that if the dialogue data includes “squid” or “eye swim”, the response window is expanded and investigated until they disappear.

【００５２】例えば、「今日は暑いですね」というシス
テムからの問いかけに対するユーザの応答データがある
とする。この場合、キーワードＫＷは「暑い」であり、
肯定キーワードとしては「はい」「うん」「そうです」
「暑い」等が考えられ、否定キーワードとしては「いい
え」「暑くない」「涼しい」等が考えられる。肯定の答
のデータ例として、レスポンスウィンドウ内で「はい」
という発話があり、「顔の縦振り」が見られたという場
合は、Ｙ１００１０という学習データが得られる。先頭のＹは肯定の答えを
意味し、次の１と０は、それぞれ「肯定キーワード」
「否定キーワード」「その他のキーワード」「顔の縦振
り」「顔の横振り」が存在するなら１、存在しないなら
０である。For example, it is assumed that there is user response data to a question from the system "It's hot today". In this case, the keyword KW is "hot",
The affirmative keywords are "Yes", "Yes" and "Yes."
"Hot" and the like are considered, and negative keywords include "no", "not hot", and "cool". "Yes" in the response window as an example of positive answer data
When "vertical swing of face" is seen, learning data Y10010 is obtained. The leading Y means an affirmative answer, and the following 1 and 0 are "affirmative keywords", respectively.
It is 1 if "negative keyword", "other keywords", "vertical swing of face", and "horizontal swing of face" are present, and 0 if they are not present.

【００５３】また、否定の答のデータ例として、レスポ
ンスウィンドウ内で「いいえ暑くないです」という発話
があり、顔の動きは特に見られなかった場合は、Ｎ０１０００という学習データが得られる。先頭のＮは否定の答えを
意味する。このような学習データをたくさん用意してお
き、認識データとして例えば「１００１０」（「肯定キ
ーワード」の発話と「顔の縦振り」が見られた）が与え
られたら学習データの中の「Ｙ１００１０」と「Ｎ１０
０１０」の個数が比べられ、「Ｙ１００１０」の方が多
ければ、その時のユーザの意図は「肯定」であるとみな
され、「Ｎ１００１０」の方が多ければ、その時のユー
ザの意図は「否定」であるとみなされる。もしも同数
（両方とも０だった場合を含む）だった場合は「不明」
なので、その旨が対話管理手段１１２に返信され、対話
管理手段１１２により、その場合はもう一度質問がし直
される。また、個数の差が小さい場合もユーザに意図の
識別が正しかったかが確認されるようにすると良い場合
がある。この「１００１０」のような識別結果の先頭に
認識データ（ＹかＮ）を加えたものを学習データに加え
ることで、ユーザが使用すればするほど学習データが増
えて認識率が高まる。Further, as an example of the negative answer data, there is an utterance "No heat" in the response window, and if no movement of the face is observed, learning data N01000 is obtained. The N at the beginning means a negative answer. A large amount of such learning data is prepared, and if "10010" (the utterance of "affirmative keyword" and "longitudinal movement of face" is seen) is given as the recognition data, "Y10010" in the learning data is given. And "N10
If the number of “010” is compared and there are more “Y10010”, the user's intention at that time is considered to be “affirmative”, and if there are more “N10010”, the user's intention at that time is “negative”. Is considered to be. If the numbers are the same (including the case where both are 0), "unknown"
Therefore, the fact is sent back to the dialogue management means 112, and in this case, the question is asked again. In addition, even when the difference in the number is small, it may be preferable to confirm with the user whether or not the intention is correctly identified. By adding the recognition data (Y or N) added to the beginning of the identification result such as "10010" to the learning data, the more the user uses the learning data, the more the recognition rate increases.

【００５４】次に「右／左等の向き」を答えとして受け
とりたい場合のべイズ識別の例を説明する。なお、説明
の都合上「右」と「左」と「上」の３つを識別する場合
について説明するが、方向が増えたりしても考え方は同
じである。この場合はマルチモーダル対話データベース
の中から、システムにより方向をたずねている対話のも
のだけが抜き出され、その対話データのレスポンスウィ
ンドウ内での「『右』や右に表示されている物の名称、
及び右に表示されている物の名称に類似した単語の発
話」「『左』や左に表示されている物の名称、及び左に
表示されている物の名称に類似した単語の発話」
「『上』や上に表示されている物の名称、及び上に表示
されている物の名称に類似した単語の発話」「指示語発
話と同時に顔の向きが右」「指示語発話と同時に顔の向
きが左」「指示語発話と同時に顔の向きが上」「指示語
発話と同時に視線の向きが右」「指示語発話と同時に視
線の向きが左」「指示語発話と同時に視線の向きが上」
「指示語発話と同時に手を伸ばした方向が右」「指示語
発話と同時に手を伸ばした方向が左」「指示語発話と同
時に手を伸ばした方向が上」「顔の向きが右」「顔の向
きが左」「顔の向きが上」「視線の向きが右」「視線の
向きが左」「視線の向きが上」「手を伸ばした方向が
右」「手を伸ばした方向が左」「手を伸ばした方向が
上」等の存在の有無を調査（これらの中の一部だけしか
使わないようにしても良い）し、その調査結果を一つの
学習データとする。なお、対話データの中に「かしげ」
や「目の泳ぎ」がある場合はそれらがなくなるまでレス
ポンスウィンドウが伸長されて調査される。Next, an example of the Bayes identification in the case where it is desired to receive "right / left direction" as an answer will be described. Note that, for convenience of description, a case will be described in which three “right”, “left”, and “top” are identified, but the idea is the same even if the direction is increased. In this case, from the multi-modal dialogue database, only those dialogues that the system is asking for direction are extracted, and ““ Right ”or the name of the object displayed on the right in the response window of the dialogue data is extracted. ,
And utterance of words similar to the name of the object displayed on the right """Utterance of words similar to the name of the object displayed on the left or the left, and the object displayed on the left"
"The utterance of" up "or the name of the object displayed above, and a word similar to the name of the object displayed above""The direction of the face is right at the same time as the utterance of the demonstrative word""Face direction is left""Face direction is upward at the same time as the utterance is uttered""Gaze direction is right at the same time as utterance is uttered""Gaze direction is left at the same time as utterance is uttered" Facing up "
"The right direction is when you reach the utterance at the same time as the utterance of the demonstrative word.""The left direction is when you reach at the same time as the utterance of the demonstrative word.""Face direction is left""Face direction is up""Gaze direction is right""Gaze direction is left""Gaze direction is up""Reaching direction is right""Reaching direction is" The existence of "left" and "the direction in which the hand is extended is up" etc. is investigated (only a part of these may be used), and the result of the investigation is used as one learning data. In the dialogue data, "Kashige"
If there are or "eye swims", the response window is extended and investigated until they disappear.

【００５５】例えば、システムの出力画面の右側に手帳
が、左に鉛筆が、上に消しゴムが表示されている場合
に、「どれが一番欲しいですか」というシステムからの
問いかけに対するユーザの応答データがあるとする。こ
の場合、キーワードＫＷは「欲しい」であり、キーワー
ドＤＷ１は「手帳」であり、キーワードＤＷ２は「鉛
筆」であり、キーワードＤＷ３は「消しゴム」である。
すると、キーワード群１としては「手帳」「ノート」な
どが入ることが考えられ、キーワード群２としては「鉛
筆」「ペン」などが入ることが考えられ、キーワード群
３としては「消しゴム」「イレーサ」などが入ることが
考えられる。なお、どのような単語が入るかは文脈情報
取得手段１１１からの情報に左右される。すると、
「『右』や右に表示されている物の名称、及び右に表示
されている物の名称に類似した単語の発話」としては
「右」「手帳」「ノート」等が考えられ、「『左』や左
に表示されている物の名称、及び左に表示されている物
の名称に類似した単語の発話」としては「左」「鉛筆」
「ペン」等が考えられ、「『上』や上に表示されている
物の名称、及び上に表示されている物の名称に類似した
単語の発話」としては「上」「消しゴム」「イレーサ」
等が考えられる。For example, in the case where the notebook is displayed on the right side of the output screen of the system, the pencil is displayed on the left side, and the eraser is displayed on the top, the response data of the user to the question "which one do you want most" from the system. There is. In this case, the keyword KW is "want", the keyword DW1 is "notebook", the keyword DW2 is "pencil", and the keyword DW3 is "eraser".
Then, it is considered that "notebook", "notebook", etc. can be entered as the keyword group 1, "pencils", "pen", etc. can be entered as the keyword group 2, and "erase" and "eraser" can be set as the keyword group 3. It is possible that ", etc. It should be noted that what kind of word is included depends on the information from the context information acquisition unit 111. Then
"Right", the name of an object displayed on the right, and the utterance of a word similar to the name of the object displayed on the right "may be" right "," notebook "," notebook ", etc. "Left" and the name of the object displayed on the left, and the "utterance of words similar to the name of the object displayed on the left" are "left" and "pencil".
"Pen" and the like are conceivable, and "upper", "eraser", and "eraser" can be used as "utterances of" upper "and the names of objects displayed above, and words similar to the names of objects displayed above.""
Etc. are possible.

【００５６】「右」が答であるデータ例として、レスポ
ンスウィンドウ内で「指示語発話と同時に顔の向きが
右」「顔の向きが右」「顔の向きが左」「顔の向きが
上」が見られた場合は、Ｒ０００１００００００００１１１００００００という学習データが得られる。先頭のＲは「右」が答で
あることを意味し、次の１と０は、それぞれ上記の状態
が存在するなら１、存在しないなら０である。As an example of data in which “right” is the answer, in the response window, “face direction is right at the same time as the utterance of the directive word”, “face direction is right”, “face direction is left”, and “face direction is up”. Is found, the learning data of R00010000001111000000 is obtained. The leading R means that "right" is the answer, and the next 1 and 0 are 1 if the above states exist, and 0 otherwise.

【００５７】また、「左」が答であるデータ例として、
レスポンスウィンドウ内で「『左』や左に表示されてい
る物の名称の発話、及び左に表示されている物の名称に
類似した単語の発話」「顔の向きが右」「顔の向きが
左」「顔の向きが上」が見られた場合は、Ｌ０１００００００００００１１１００００００という学習データが得られる。先頭のＬは「左」が答で
あることを意味する。As an example of data whose answer is "left",
In the response window, "Utterance of the name of the object displayed on the left or left, and a word similar to the name of the object displayed on the left""Face orientation is right""Face orientation is" When "left" and "face facing up" is seen, learning data of L0100000001111000000 is obtained. The L at the beginning means that "left" is the answer.

【００５８】また、「上」が答であるデータ例として、
レスポンスウィンドウ内で「指示語発話と同時に視線の
向きが上」「指示語発話と同時に手を伸ばした方向が
上」「視線の向きが右」「視線の向きが左」「視線の向
きが上」「手を伸ばした方向が上」が見られた場合は、Ｕ００００００００１００１０００１１１００１という学習データが得られる。先頭のＵは「上」が答え
であることを意味する。As an example of data in which "upper" is the answer,
In the response window, "The direction of the line of sight is at the same time as the utterance of the demonstrative word is up""The direction of the hand is up at the same time as the utterance of the mnemonic word is up""The direction of the line of sight is right""The direction of the line of sight is left" When "the direction in which the hand is extended is up", the learning data U000000001001000111001 is obtained. The U at the beginning means that "upper" is the answer.

【００５９】このような学習データをたくさん用意して
おき、認識データとして例えば「１００１００００００
００１１１００００００」（「『右』や右に表示されて
いる物の名称、及び右に表示されている物の名称に類似
した単語の発話」と「指示語発話と同時に顔の向きが
右」と「顔の向きが右」と「顔の向きが左」と「顔の向
きが上」が見られた）が与えられたら、学習データの中
の「Ｒ１００１００００００００１１１００００００」
と「Ｌ１００１００００００００１１１００００００」
と「Ｕ１００１００００００００１１１００００００」
の個数が比べられ、最も多いデータの先頭の文字によっ
て、「Ｒ」なら「右」、「Ｌ」なら「左」、「Ｕ」なら
「上」であるとみなされる。もしも、同数（三つとも０
だった場合を含む）だった場合は「不明」なので、その
旨が対話管理手段１１２に返信され、対話管理手段１１
２により、その場合はもう一度質問をし直されたりす
る。また、個数の差が小さい場合もユーザに方向の識別
が正しかったかどうかが確認されるようにすると良い場
合がある。これらの識別結果の先頭に認識データ（Ｒか
ＬかＵ）を加えたものを学習データに加えることで、ユ
ーザが使用すればするほど学習データが増えるようにす
ることができる。A large amount of such learning data is prepared and, for example, "1001000000" is used as recognition data.
00111000000 ”(“ utterance of a word similar to the name of the object displayed on the right or the right and the name of the object displayed on the right ”and“ the direction of the face is right at the same time as the utterance of the instruction word ”and“ "Face orientation is right", "Face orientation is left" and "Face orientation is up" are given), "R100100000110011000000" in the learning data is given.
And "L10000000011100000000"
And "U1000000001111000000"
Are compared, and depending on the leading character of the largest amount of data, "R" is regarded as "right", "L" as "left", and "U" as "upper". If the same number (all three are 0
If it is (including the case), it is “unknown”, so that is returned to the dialogue management means 112, and the dialogue management means 11
Depending on 2, in that case, the question may be asked again. In addition, even if the difference in the number is small, it may be preferable to make the user confirm whether or not the identification of the direction is correct. By adding the recognition data (R, L, or U) to the beginning of these identification results, the learning data can be increased as the user uses it.

【００６０】図５は、本発明を「商品紹介システム」に
応用した実施例である。この場合は、レスポンスウィン
ドウの伸長のために「顔の振りのかしげ」や「視線の泳
ぎ（視線が正面を向いていない）」を用い、肯定／否定
のべイズ識別において「肯定キーワードの発話」「否定
キーワードの発話」「その他のキーワードの発話」「顔
の縦振り」「顔の横振り」を用い、方向の判定に「指示
語の発話」「方向キーワードの発話」「商品名、及び商
品名に類似した単語の発話」「顔の向き」が用いられて
いる。FIG. 5 shows an embodiment in which the present invention is applied to a "product introduction system". In this case, "swing of the face" or "swim of the line of sight (the line of sight is not facing the front)" is used to extend the response window, and the "utterance of the affirmative keyword" is used in the positive / negative Bayes discrimination. "Utterance of negative keyword""Utterance of other keyword""Vertical swing of face""Horizontal swing of face" is used to determine the direction of "Directive utterance""Directive keyword utterance""Product name and product""Utterance of word similar to name" and "face direction" are used.

【００６１】この「商品紹介システム」の対話例を図６
Ａから図６Ｄを用いて詳細に説明する。なお、ここでは
システムの発話を『Ｓ「発話内容」』、ユーザの発話や
動作を『Ｕ「発話や動作内容」』で表している。FIG. 6 shows an example of the dialog of this "product introduction system".
This will be described in detail using A to FIG. 6D. Here, the system utterance is represented by “S“ Utterance content ””, and the user's utterance or action is represented by “U“ Utterance or action content ”.

【００６２】Ｓ「こんにちは」Ｓ「商品の紹介を致します。よろしいでしょうか？」Ｕ「『はい』と発話してうなずく」（図６Ａ参照）Ｓ「どちらの商品に興味がありますか？」Ｕ「右を見てうなずきながら『それです』と発話する」
（図６Ｂでは顔が右を向いているかどうかが一定の間
隔で出力されている）Ｓ「右ですね？」Ｕ「『はい』と発話してうなずく」（図６Ｃ参照）Ｓ「これは、ＦＡＸ機能を内蔵した電子手帳ザウルスで
す。」Ｓ「こちらのビューカムには興味はございますか？」Ｕ「顔を横振り」（図６Ｄ参照）Ｓ「ありがとうざいました。」なお、本実施例ではべイズ識別に用いるデータは「肯定
キーワードの発話」「否定キーワードの発話」「その他
キーワードの発話」「顔の縦振り」「顔の構振り」
「『右』や右に表示されている物の名称、及び右に表示
されている物の名称に類似した単語の発話」「『左』や
左に表示されている物の名称、及び左に表示されている
物の名称に類似した単語の発話」「『上』や上に表示さ
れている物の名称、及び上に表示されている物の名称に
類似した単語の発話」「指示語発話と同時に顔の向きが
右」「指示語発話と同時に顔の向きが左」「指示語発話
と同時に顔の向きが上」「指示語発話と同時に視線の向
きが右」「指示語発話と同時に視線の向きが左」「指示
語発話と同時に視線の向きが上」「指示語発話と同時に
手を伸ばした方向が右」「指示語発話と同時に手を伸ば
した方向が左」「指示語発話と同時に手を伸ばした方向
が上」「顔の向きが右」「顔の向きが左」「顔の向きが
上」「視線の向きが右」「視線の向きが左」「視線の向
きが上」「手を伸ばした方向が右」「手を伸ばした方向
が左」「手を伸ばした方向が上」等が使われているが、
これはマルチモーダル対話データベースから得られる情
報なら何を使っても良く、より一般化して書けば、モードＭ₁，モードＭ₂，．．．，モードＭ_n で、それぞれのモードの有無が調査されれば良い。例え
ば「肯定キーワードの発話」「否定キーワードの発話」
「その他キーワードの発話」「顔の縦振り」「顔の横振
り」が使われた場合というのはｎ＝５で、モードＭ₁ ＝
「肯定キーワードの発話」、モードＭ₂＝「否定キーワ
ードの発話」、モードＭ₃＝「その他キーワードの発
話」、モードＭ₄＝「顔の縦振り」、モードＭ₅＝「顔の
構振り」とした場合ということになる。[0062] "We will introduce the product. Are you sure you want?" S "Hello" S U "uttered" yes "nods to", "Are you interested in which of the goods?" (FIG. 6A see) S U "Look at the right and nod and say"that's it. "
(In FIG. 6B, whether or not the face is facing to the right is output at regular intervals.) S "Is it right?" U "Nods when uttering" Yes "(see Fig. 6C) S" This is It is an electronic notebook Zaurus with a built-in fax function. ”S“ Are you interested in this view cam? ”U“ Surface to side ”(See Fig. 6D) S“ Thank you. ” The data used to identify the noise is "utterance of positive keyword,""utterance of negative keyword,""utterance of other keywords,""vertical face swing,""facemovement."
"" Right "or the name of the object displayed on the right, and the utterance of a word similar to the name of the object displayed on the right""The name of the object displayed on the" left "or the left, and the left "Utterance of words similar to the name of the displayed object""Utteranceof" upper "or the name of the object displayed above and the word similar to the name of the object displayed above""Utterance of the demonstrative word" At the same time, the direction of the face is right "" The direction of the face is left at the same time as the demonstrative utterance "" The direction of the face is upward at the same time as the utterance of the demonstrative utterance ""The direction of the gaze is left""The direction of the gaze is upward when the utterance is uttered""The direction in which the hand is extended at the same time as the utterance is right" At the same time, the direction in which the hand is extended is "up", "face is right", "face is left", "face is up", and "gaze direction". "Right""The direction of the line of sight is left""The direction of the line of sight is up""The direction in which the hand is extended is right""The direction in which the hand is extended is left""The direction in which the hand is extended is" ,
This can use any information obtained from the multimodal dialogue database, and if it is written in a more generalized manner, the mode M ₁ , the mode M ₂ ,. ．． , The mode M _n may be checked for the presence or absence of each mode. For example, "utterance of positive keyword""utterance of negative keyword"
When "utterance of other keywords", "vertical face swing", or "horizontal face swing" is used, n = 5, and mode M ₁ =
"Utterance of the positive keyword", mode M ₂ = "utterance of the negative keyword", mode M ₃ = "utterance of the other keyword", mode M ₄ = "vertical appearance of a face", mode M ₅ = "構振Ri of the face." If that is the case.

【００６３】また、各モードの有無についても、単なる
０，１を用いる以外に、尤度として０〜１の実数値をと
らせることもできる。この場合のべイズの識別は、学習
データから線形補間を行なえば良い。例えばレスポンス
ウィンドウ内で０．８，０，０，０．７，０という認
識データが得られた場合には、学習データ中の「Ｙ１０
０１０」の個数をｙ₁₁、「Ｎ１００１０」の個数を
ｎ₁₁、「Ｙ０００１０」の個数をｙ₀₁、「Ｎ０００１
０」の個数をｎ₀₁、「Ｙ１００００」の個数をｙ₁₀、
「Ｎ１００００」の個数をｎ₁₀、「Ｙ０００００」の個
数をｙ₀₀、「Ｎ０００００」の個数をｎ₀₀、とすると、ｎ_Y＝０.８×０．７×ｙ₁₁＋（１−０．８）×０．７×
ｙ₀₁＋０．８×（１−０．７）ｙ₁₀＋（１−０．８）×
（１−０．７）ｙ₀₀ ｎ_N＝０.８×０．７×ｎ₁₁＋（１−０．８）×０．７×
ｎ₀₁＋０．８×（１−０．７）ｎ₁₀＋（１−０．８）×
（１−０．７）ｎ₀₀ が計算され、ｎ_Yとｎ_Nの大小が比較され、ｎ_Yの方が大
きければ、その時のユーザの意図は「肯定」であるとみ
なされ、ｎ_Nの方が大きければ、その時のユーザの意図
は「否定」であるとみなされる。As for the presence / absence of each mode, a real value of 0 to 1 can be taken as the likelihood in addition to the simple use of 0 and 1. In this case, the Bayes may be identified by performing linear interpolation from the learning data. For example, when the recognition data of 0.8,0,0,0.7,0 is obtained in the response window, "Y10 in the learning data is acquired.
The number of "010" is y ₁₁ , the number of "N10010" is n ₁₁ , the number of "Y00010" is y ₀₁ , "N0001".
The number of “0” is n ₀₁ , the number of “Y10000” is y ₁₀ ,
Assuming that the number of “N10000” is n ₁₀ , the number of “Y0000” is y ₀₀ , and the number of “N00000” is n ₀₀ , then n _Y = 0.8 × 0.7 × y ₁₁ + (1-0.8 ) × 0.7 ×
y ₀₁ +0.8 x (1-0.7) y ₁₀ + (1-0.8) x
(1-0.7) y ₀₀ n _N = 0.8 × 0.7 × n ₁₁ + (1-0.8) × 0.7 ×
n ₀₁ +0.8 x (1-0.7) n ₁₀ + (1-0.8) x
(1-0.7) n ₀₀ is calculated, the magnitudes of n _Y and n _N are compared, and if n _Y is larger, the user's intention at that time is regarded as “affirmative”, and n _N If it is larger, the user's intention at that time is considered to be “denial”.

【００６４】また、「キーワード群」は、対話管理手段
１１２から与えられた各キーワードＤＷｉをもとに、あ
らかじめ求めておく場合で説明したが、単語Ｗを音声認
識後に、各キーワードＤＷｉとの類似度から、単語Ｗが
どのキーワード群に入っているかを求めることもでき
る。これは単語Ｗの特徴べクトルとＤＷｉの特徴べクト
ルとの内積が最大となるｉをｍとすると、単語Ｗはキー
ワード群ｍに属することにすれば良い。または、所定閾
値を定めておき、類似度がこの閾値以上になるキーワー
ド群に属する（複数のキーワード群に属する場合もあ
る）とみなしてもよい。Further, the "keyword group" has been explained in the case of being obtained in advance based on each keyword DWi given from the dialogue management means 112, but after the word W is recognized by voice, it is similar to each keyword DWi. It is also possible to find out which keyword group the word W belongs to from the degree. This means that the word W belongs to the keyword group m, where i is the maximum inner product of the characteristic vector of the word W and the characteristic vector of DWi. Alternatively, a predetermined threshold may be set, and it may be considered that the keyword belongs to a keyword group whose degree of similarity is equal to or higher than this threshold (may belong to a plurality of keyword groups).

【００６５】肯定キーワード／否定キーワードの場合も
同様で、対話管理手段１１２から与えられたキーワード
ＫＷに対し、あらかじめ「肯定キーワード」「否定キー
ワード」を求めておかなくても、単語Ｗを音声認識後
に、単語ＷとキーワードＫＷが同じであれば、単語Ｗは
「肯定キーワード」とみなせるし、単語Ｗとキーワード
ＫＷとの類似度が所定閾値以上であれば単語Ｗは「否定
キーワード」とみなせる。The same applies to the case of the affirmative keyword / negative keyword. Even if it is not necessary to previously obtain the “affirmative keyword” and the “negative keyword” for the keyword KW given from the dialogue management means 112, the word W is recognized after the voice recognition. If the word W and the keyword KW are the same, the word W can be regarded as a “positive keyword”, and if the similarity between the word W and the keyword KW is a predetermined threshold value or more, the word W can be regarded as a “negative keyword”.

【００６６】また、「レスポンスウィンドウ」は、一対
の対話に対して一つだけ存在する場合について説明した
が、システム側の発話にキーワードＫＷが複数ある場合
などは複数のレスポンスウィンドウを設定することもで
きる。この場合、レスポンスウィンドウｉは、キーワー
ドＫＷｉの発話開始時刻からＫＷ（ｉ＋１）の発話開始
時刻＋ＷＴまでの時間となる。ただし、最後のレスポン
スウィンドウの終了時刻は、システムの発話自体の終了
時刻＋ＷＴとなる。これを、キーワードＫＷが、ＫＷ
１，ＫＷ２，ＫＷ３の三つある場合で説明すると、図９
に示すように、ＫＷ１の発話開始時刻からＫＷ２の発話
開始時刻＋ＷＴまでの時間を「レスポンスウィンドウ
１」、ＫＷ２の発話開始時刻からＫＷ３の発話開始時刻
＋ＷＴまでの時間を「レスポンスウィンドウ２」、ＫＷ
３の発話開始時刻からシステムの発話自体の終了時刻＋
ＷＴまでの時間を「レスポンスウィンドウ３」に設定す
れば良い。この場合、各レスポンスウィンドウ間で重複
している時間が生じるが、これによるあいまい性は、シ
ステムからもう一度聞き直す等をして確認を取るように
すれば良い。Further, although the case where only one "response window" exists for a pair of dialogues has been described, a plurality of response windows may be set when there are a plurality of keywords KW in the utterance on the system side. it can. In this case, the response window i is the time from the utterance start time of the keyword KWi to the utterance start time of KW (i + 1) + WT. However, the end time of the last response window is the end time of the system utterance itself + WT. The keyword KW is KW
When there are three cases of 1, KW2, KW3, and FIG.
As shown in, the time from the utterance start time of KW1 to the utterance start time of KW2 + WT is “response window 1”, and the time from the utterance start time of KW2 to the utterance start time of KW3 + WT is “response window 2”, KW
End time of system utterance from utterance start time of 3 +
The time until WT may be set in "Response window 3". In this case, time overlaps between the response windows, but the ambiguity due to this may be confirmed by re-listening from the system.

【００６７】[0067]

【発明の効果】請求項１に記載の統合認識装置によれ
ば、統合処理手段においてユーザの発話意図の識別が行
われ、その識別結果が対話管理手段に渡され、対話管理
手段により新たな状態に遷移され、出力手段によってつ
ぎに発話される内容が決定されるので、ユーザはあたか
も人間と対話をするかのような感覚で自然な対話を行う
ことができる。According to the integrated recognition device of the first aspect, the integrated processing means identifies the utterance intention of the user, passes the identification result to the dialogue management means, and the dialogue management means creates a new state. Then, the content to be spoken next is determined by the output means, so that the user can have a natural dialogue as if he / she were talking to a human.

【００６８】請求項２に記載の統合認識対話装置によれ
ば、統合処理手段において人間同士の対話における自然
な間に合わせてレスポンスウインドウが設定されるの
で、ユーザは気持ちの良い対話を行うことができる。According to the integrated recognition dialogue apparatus of the second aspect, since the response window is set in the integrated processing means in a natural time in the dialogue between humans, the user can have a pleasant dialogue. .

【００６９】請求項３に記載の統合認識対話装置によれ
ば、統合処理手段において選択肢がユーザに示される場
合には、各選択肢に対応したキーワード群が設定される
ので、このキーワード群によりユーザの意図の判断が確
実に行われる。According to the integrated recognition dialogue apparatus of the third aspect, when the integrated processing means shows the options to the user, a keyword group corresponding to each option is set. The judgment of intention is surely made.

【００７０】請求項４に記載の統合認識対話装置によれ
ば、統合処理手段においてユーザの肯定／否定の意図の
判定が行なわれる場合には、「肯定キーワード」と「否
定キーワード」が設定されるので、これらのキーワード
によりユーザの肯定及び否定の意図の判断が確実に行わ
れる。According to the integrated recognition dialogue device of the fourth aspect, when the integrated processing means determines the positive / negative intention of the user, the "affirmative keyword" and the "negative keyword" are set. Therefore, these keywords ensure the determination of the user's affirmative and negative intentions.

【００７１】請求項５に記載の統合認識対話装置によれ
ば、文脈情報取得手段が、文書データベース中の単語間
の共起関係をもとにして作成した特徴ベクトルを使用す
るので、ユーザやシステムの使用状況にあった文書デー
タベースを用意しておくことで、特定の状況で使用され
るシステムやユーザの発話の癖に対応することができ
る。According to the integrated recognition dialogue apparatus of the fifth aspect, the context information acquisition means uses the feature vector created based on the co-occurrence relation between words in the document database. It is possible to deal with the habit of the system or user's utterances used in a specific situation by preparing a document database suitable for the situation of use.

【００７２】請求項６に記載の統合認識対話装置によれ
ば、統合処理手段がユーザの意図の認識に学習データを
使用するので、データベースの中にあるような対話であ
れば、どのような対話に対しても対応できる。According to the integrated recognition dialogue apparatus of the sixth aspect, since the integrated processing means uses the learning data to recognize the intention of the user, what kind of dialogue is present in the database? You can also deal with.

【００７３】請求項７に記載の統合認識対話装置によれ
ば、統合処理手段がユーザの意図を認識した後、その認
識結果を学習データに追加するので、統合認識をユーザ
に対応させていくことができる。According to the integrated recognition dialogue device of claim 7, after the integrated processing means recognizes the intention of the user, the recognition result is added to the learning data, so that the integrated recognition is made to correspond to the user. You can

【００７４】請求項８に記載の統合認識装置によれば、
ユーザの発話の後半で出現したユーザの顔の縦振り動作
を無視するので、ユーザ自身の発話に対してのうなづき
であることが多い顔の縦振り動作に起因する誤識別を防
ぐことができる。According to the integrated recognition device of claim 8,
Since the vertical swing motion of the user's face that appears in the latter half of the user's utterance is ignored, it is possible to prevent erroneous identification due to the vertical swing motion of the face, which is often a nod to the user's own utterance.

[Brief description of drawings]

【図１】本発明の統合認識対話装置の基本構成を示すブ
ロック図である。FIG. 1 is a block diagram showing a basic configuration of an integrated recognition dialogue device of the present invention.

【図２】本発明の単語の特徴べクトルの生成を説明する
図である。FIG. 2 is a diagram for explaining generation of a word characteristic vector of the present invention.

【図３】本発明の単語の特徴べクトルの生成を説明する
図である。FIG. 3 is a diagram for explaining generation of a word characteristic vector of the present invention.

【図４Ａ】本発明のレスポンスウィンドウを説明する図
である。FIG. 4A is a diagram illustrating a response window of the present invention.

【図４Ｂ】本発明のレスポンスウィンドウの短縮を説明
する図である。FIG. 4B is a diagram illustrating shortening of the response window according to the present invention.

【図４Ｃ】本発明のレスポンスウィンドウの伸長を説明
する図である。FIG. 4C is a diagram illustrating expansion of the response window of the present invention.

【図５】本発明の統合認識対話装置を「商品紹介システ
ム」に応用した場合の構成を示すブロック図である。FIG. 5 is a block diagram showing a configuration when the integrated recognition dialogue device of the present invention is applied to a “product introduction system”.

【図６Ａ】商品紹介システムでの対話例を示す図であ
る。FIG. 6A is a diagram showing an example of dialogue in the product introduction system.

【図６Ｂ】商品紹介システムでの対話例を示す図であ
る。FIG. 6B is a diagram showing an example of a dialogue in the product introduction system.

【図６Ｃ】商品紹介システムでの対話例を示す図であ
る。FIG. 6C is a diagram showing an example of a dialogue in the product introduction system.

【図６Ｄ】商品紹介システムでの対話例を示す図であ
る。FIG. 6D is a diagram showing an example of a dialogue in the product introduction system.

【図７】マルチモーダル対話データベースを示す図であ
る。FIG. 7 is a diagram showing a multimodal dialogue database.

【図８】対話管理手段による状態の遷移の様子を示す図
である。FIG. 8 is a diagram showing how state is changed by the dialogue management means.

【図９】複数のレスポンスウィンドウを示す図である。FIG. 9 is a diagram showing a plurality of response windows.

[Explanation of symbols]

105 認識手段 106 認織手段 107 認識手段 108 認識手段 109 時刻取得手段 110 統合処理手段 111 文脈情報取得手段 112 対話管理手段 113 出力手段 105 recognition means 106 recognition means 107 recognition means 108 recognition means 109 time acquisition means 110 integrated processing means 111 context information acquisition means 112 dialogue management means 113 output means

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号庁内整理番号ＦＩ技術表示箇所Ｇ０６Ｔ 1/00 Ｇ０６Ｆ 15/62 ３８０ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁶ Identification code Internal reference number FI Technical display location G06T 1/00 G06F 15/62 380

Claims

[Claims]

1. A time acquisition means for outputting time information, a plurality of recognition means for respectively recognizing input data including a user's voice signal, face movement, line of sight, body movement, etc., and a word is identified from the voice signal. Context information acquisition means for outputting context information for performing, and integrated processing means for recognizing the user's intention by integrally processing time information, context information, and recognition results output in parallel by the plurality of recognition means. An integrated recognition dialogue apparatus comprising: a dialogue management means for advancing a dialogue based on the user's intention recognized by the integration processing means; and an output means for outputting the output data passed from the dialogue management means to the user.

2. The integrated processing means sets a response window, which is a period in which the user's intention is recognized from input data, based on information from the dialogue management means and information from the plurality of recognition means. Item 1. The integrated recognition dialogue device according to Item 1.

3. The integrated processing means sets a predetermined number of keyword groups for recognizing a voice signal based on the information from the dialogue management means and the information from the context information acquisition means. Integrated recognition dialogue device described in.

4. The integrated recognition dialogue apparatus according to claim 3, wherein the keyword group includes a “affirmative keyword group” that the user intends to affirm and a “negative keyword group” that the user intends to negate.

5. The integrated recognition dialogue according to claim 1, wherein the context information acquisition means uses the similarity between the feature vectors created based on the co-occurrence relation between words in a predetermined document database. apparatus.

6. The integrated recognition dialogue device according to claim 1, wherein said integrated processing means uses data of a predetermined dialogue database as learning data for recognition of a user's intention.

7. The integrated recognition dialogue apparatus according to claim 6, wherein the integrated processing means, after recognizing the intention of the user, adds the recognition result as learning data to the dialogue database.

8. The integrated recognition dialogue apparatus according to claim 6, wherein the integrated processing means ignores a vertical motion of the user's face that appears in the latter half of the user's utterance.