JPH09114634A

JPH09114634A - Multi-modal information integrated analysis device

Info

Publication number: JPH09114634A
Application number: JP7267000A
Authority: JP
Inventors: Takeshi Mizunashi; 豪水梨; Rooken Kimu Kiyunho; キュンホ・ローケン・キム; Mutsuko Tomokiyo; 睦子友清; Takuma Morimoto; 逞森元
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1995-10-16
Filing date: 1995-10-16
Publication date: 1997-05-02
Anticipated expiration: 2015-10-16
Also published as: JP2993872B2

Abstract

PROBLEM TO BE SOLVED: To integrally analyze voice sounded by a human being and the gesture of the human being and to output an analyzed result. SOLUTION: A voice recognition part 11 voice-recognizes sounded voice and outputs a voice recognized result and the time corresponding to it and a language analysis part 12 performs language analysis by using knowledge relating to languages and outputs the semantic structure of the voice recognized result and the corresponding time. A GUI control part 13 outputs the position on a screen of the track of an inputted gesture and the corresponding time and a gesture analysis part 14 performs analysis by using the knowledge relating to a diagram provided with plural instruction object candidates for information from the GUI control part 13 and outputs the kind of the gesture, the time corresponding to it and an instruction object which is the instruction object candidate instructed by the gesture. An integrated analysis part 15 detects the timewise relation of a retrieved word or phrase corresponding to the gesture and the instruction object and generates the semantic structure for which the semantic structure of the voice recognized result and the semantic structure of the kind of the gesture are integrated.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、入力される人間の
発話する音声と、入力される人間のジェスチャとを統合
的に解析を行って解析結果を出力するマルチモーダル情
報統合解析装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multi-modal information integrated analysis apparatus for performing integrated analysis of an input human speech and an input human gesture and outputting an analysis result.

【０００２】[0002]

【従来の技術】図１６に、従来技術の文献「新田恒雄ほ
か，“自由発話音声入力と直指（直接指示）を利用した
マルチモーダル対話システムの検討”，社団法人電子情
報通信学会技術報告，ＳＰ９２−１２０，１９９３年１
月」において開示された従来例のマルチモーダル対話シ
ステムを示す。この文献では、入力手段（自由発話音声
＋直指（タッチ））と出力手段（応答文音声合成＋グラ
フィックス）の双方を複数チャンネル化するとともに、
ユーザの状況を複数のセンサでモニタしながら、「見込
み」に対する「プラン」に沿って指示・案内・画面展開
を行う実時間対話システムを試作し、情報案内システム
への応用を検討した結果について述べられている。2. Description of the Related Art FIG. 16 shows a prior art document "Tsunio Nitta et al.," A study of a multi-modal dialogue system using free-speech speech input and direct finger (direct instruction) ", Institute of Electronics, Information and Communication Engineers technical report. , SP92-120, 1993 1
2 shows a conventional multi-modal dialogue system disclosed in "Moon". In this document, both the input means (free speech voice + direct finger (touch)) and the output means (response sentence voice synthesis + graphics) are made into multiple channels, and
While monitoring the user's situation with multiple sensors, we made a prototype of a real-time dialogue system that gives instructions, guidance, and screen development according to the "plan" for "prospect", and described the results of studying its application to the information guidance system. Has been.

【０００３】この従来例のマルチモーダル対話システム
は、図１６に示すように、ユーザ状況検知部１０１と、
単語スポッタ１０２と、入力管理部１０３と、ユーザの
行為に対する「見込み」作成部１０４と、プラン作成部
１０５と、応答戦略生成部１０６と、文−音声変換部１
０７と、メモリに格納されたタスクモデル１１０とを備
え、ユーザ状況検知部１０１には３つのセンサＳ１，Ｓ
２，Ｓ３が接続され、入力管理部１０３にはタッチパネ
ルＴ１が接続され、単語スポッタ１０２に接続される。
また、応答戦略生成部１０６には、ＣＲＴディスプレイ
Ｃ１が接続され、文−音声変換部１０７には外部スピー
カＨ１及びハンドセットＨ２が接続される。As shown in FIG. 16, this conventional multi-modal dialogue system includes a user situation detecting section 101,
The word spotter 102, the input management unit 103, the “probability” creation unit 104 for the user's action, the plan creation unit 105, the response strategy creation unit 106, and the sentence-speech conversion unit 1
07 and a task model 110 stored in the memory, and the user situation detection unit 101 has three sensors S1 and S1.
2, S3 are connected, the touch panel T1 is connected to the input management unit 103, and the word spotter 102 is connected.
A CRT display C1 is connected to the response strategy generation unit 106, and an external speaker H1 and a handset H2 are connected to the sentence-speech conversion unit 107.

【０００４】次いで、以上のように構成された従来例の
マルチモーダル対話システムの動作について以下に説明
する。（１）ユーザがシステムの前に来ると、センサＳ１がそ
れを検知する。次にシステムは、ＣＲＴディスプレイＣ
１に「受話器を耳に当てて下さい。」というメッセージ
と、ハンドセットＨ２述べられて持ち方を示す画面を表
示する。同時に、「受話器を耳に当てて下さい。」とい
う指示を合成音で外部スピーカＨ１から出力する。（２）次いで、ハンドセットＨ２を手に取ると、センサ
Ｓ２がこれを検知して、「発声案内画面」を表示する。（３）続いて、ハンドセットＨ２を耳に当てると、これ
をセンサＳ３が検知して「希望の場所を発声して下さ
い。」という指示を表示する。同時に、音声出力を外部
スピーカＨ１からハンドセットＨ２内蔵のスピーカに切
り換え、同一の内容を音声でガイドする。（４）案内対象の単語、例えば「デパートへ行きたいの
ですが？」と発声すると、東京駅周辺のデパートが複
数、地図上に表示される。（５）個々のデパート名の表示を指でタッチすると、そ
のデパートの情報（「本日は定休日です」など）が合成
音で得られる。Next, the operation of the conventional multimodal dialogue system configured as described above will be described below. (1) When the user comes in front of the system, the sensor S1 detects it. Next, the system is the CRT display C
The message “Please put the handset on your ear.” And the screen of the handset H2 that describes how to hold the handset are displayed in 1. At the same time, the instruction "Please put the handset on your ear." Is output from the external speaker H1 as a synthetic sound. (2) Next, when the handset H2 is picked up, the sensor S2 detects it and displays the "voice guidance screen". (3) Then, when the handset H2 is applied to the ear, the sensor S3 detects it and displays the instruction "Speak your desired location." At the same time, the voice output is switched from the external speaker H1 to the speaker built in the handset H2, and the same content is guided by voice. (4) When you say a word to be guided, for example, "I want to go to a department store?", Multiple department stores around Tokyo Station are displayed on the map. (5) If you touch the display of each department store name with your finger, the information of that department store (“Today is a regular holiday” etc.) can be obtained by synthesized sound.

【０００５】[0005]

【発明が解決しようとする課題】上述の従来例のマルチ
モーダル対話システムにおいては、音声入力部において
合成音のみを認識するワードスポッテイング音声認識装
置が用いられ、言語解析装置も用いられていないので、
ユーザの発話の全体の意味の詳細な解釈ができない。一
方、ジェスチャに関しては、画面上のグラフィックスに
触れて選択する動作しか受け付けない。すなわち、項目
情報を選択的に認識することができるが、例えば、「丸
で囲む」というような人間が行う複雑なジェスチャの意
味については解釈することができない。さらに、従来例
では、特に、現実の街中で使用されているような多様な
発話と複雑なジェスチャの組み合わせの統合的な意味に
ついて解析することはできないという問題点があった。In the above-described conventional multimodal dialogue system, the word spotting voice recognition device for recognizing only the synthesized voice is used in the voice input unit, and the language analysis device is not used. ,
A detailed interpretation of the overall meaning of the user's utterance is not possible. On the other hand, regarding gestures, only the operation of touching and selecting graphics on the screen is accepted. That is, although the item information can be selectively recognized, it is not possible to interpret the meaning of a complicated gesture performed by a human such as "encircling". Furthermore, in the conventional example, there is a problem in that it is not possible to analyze the integrated meaning of a combination of various utterances and complicated gestures, which is used in a real city.

【０００６】本発明の目的は以上の問題点を解決し、人
間の発話する音声と人間のジェスチャとを統合的に解析
を行って解析結果を出力することができるマルチモーダ
ル情報統合解析装置を提供することにある。An object of the present invention is to solve the above problems and provide a multi-modal information integrated analysis device capable of comprehensively analyzing a human uttered voice and a human gesture and outputting an analysis result. To do.

【０００７】[0007]

【課題を解決するための手段】本発明に係る請求項１記
載のマルチモーダル情報統合解析装置は、所定の基準時
刻からの経過した時刻情報を出力する計時手段と、上記
計時手段から出力される時刻情報に基づいて、入力され
た発話音声を音声認識して、音声認識結果を、上記音声
認識結果に対応する時刻情報とともに出力する音声認識
手段と、上記音声認識手段から出力される音声認識結果
とそれに対応する時刻情報とに基づいて、所定の言語に
関する知識を用いて言語解析して、上記音声認識結果の
意味構造を、それに対応する時刻情報とともに出力する
言語解析手段と、複数の指示物候補を含む図を画面上に
表示し、上記表示した画面上で人間のジェスチャを入力
するための入力手段と、上記計時手段から出力される時
刻情報に基づいて、上記入力手段を介して入力されたジ
ェスチャの軌跡の画面上の位置と、それに対応する時刻
情報とともに出力するインターフェース制御手段と、上
記インターフェース制御手段から出力されるジェスチャ
の軌跡の画面上の位置に対して、上記複数の指示物候補
を含む図に関する知識を用いて解析することにより、上
記ジェスチャの種類と、それに対応する時刻情報と、上
記複数の指示物候補のうち上記ジェスチャによって指示
される指示物候補である指示物の情報とを出力するジェ
スチャ解析手段と、上記言語解析手段から出力される上
記音声認識結果の意味構造とそれに対応する時刻情報
と、上記ジェスチャ解析手段から出力される上記ジェス
チャの種類とそれに対応する時刻情報と上記指示物の情
報とに基づいて、上記音声認識結果の意味構造から上記
ジェスチャに対応する語又は句を検索し、検索された上
記ジェスチャに対応する語又は句と、上記指示物の情報
との時間的関係を検出し、検出された時間的関係に基づ
いて、上記音声認識結果の意味構造と上記ジェスチャの
種類の意味構造とが統合された意味構造を生成して出力
する統合解析手段とを備えたことを特徴とする。According to a first aspect of the present invention, there is provided a multimodal information integrated analysis apparatus which outputs time information from a predetermined reference time and time information output from the time measurement means. A voice recognition means for voice-recognizing the input speech voice based on the time information and outputting the voice recognition result together with the time information corresponding to the voice recognition result, and a voice recognition result output from the voice recognition means. Language analysis means for performing language analysis using knowledge about a predetermined language based on the time information corresponding to the time information, and outputting the semantic structure of the speech recognition result together with the time information corresponding thereto, and a plurality of pointing objects. Based on the time information output from the input means for displaying a figure including candidates on the screen and inputting a human gesture on the displayed screen and the time measuring means. With respect to the position on the screen of the trajectory of the gesture input through the input unit, the interface control unit that outputs together with the corresponding time information, and the position on the screen of the trajectory of the gesture output from the interface control unit. Then, by analyzing using the knowledge about the diagram including the plurality of pointer candidates, the type of the gesture, time information corresponding to it, and the pointer designated by the gesture among the plurality of pointer candidates. Gesture analysis means for outputting the information of the candidate referent, the semantic structure of the speech recognition result output from the language analysis means and time information corresponding thereto, and the gesture output from the gesture analysis means. Based on the type, time information corresponding to it, and the information on the pointing object, the voice recognition result The word or phrase corresponding to the gesture is searched from the taste structure, the word or phrase corresponding to the searched gesture is detected, and the temporal relationship between the information of the pointing object is detected, and based on the detected temporal relationship. In addition, an integrated analysis unit that generates and outputs a semantic structure in which the semantic structure of the voice recognition result and the semantic structure of the gesture type are integrated is provided.

【０００８】また、請求項２記載のマルチモーダル情報
統合解析装置は、請求項１記載のマルチモーダル情報統
合解析装置において、上記ジェスチャに対応する語は指
示詞であることを特徴とする。さらに、請求項３記載の
マルチモーダル情報統合解析装置は、請求項１又は２記
載のマルチモーダル情報統合解析装置において、上記ジ
ェスチャ解析手段によって解析される上記ジェスチャの
種類は、「丸で囲む」ジェスチャと、「線を引く」ジェ
スチャと、「点を打つ」ジェスチャと、「マーキング」
ジェスチャと、ランダムな動きがある描写である「スク
ランブリング」ジェスチャとを含むことを特徴とする。A multimodal information integrated analysis apparatus according to a second aspect is the multimodal information integrated analysis apparatus according to the first aspect, characterized in that the word corresponding to the gesture is a demonstrative. Furthermore, the multimodal information integrated analysis device according to claim 3 is the multimodal information integrated analysis device according to claim 1 or 2, wherein the type of the gesture analyzed by the gesture analysis means is a "circle" gesture. , "Draw a line" gesture, "Dot" gesture, "Marking"
It is characterized by including a gesture and a "scrambling" gesture, which is a depiction with random movement.

【０００９】またさらに、請求項４記載のマルチモーダ
ル情報統合解析装置は、請求項１、２又は３記載のマル
チモーダル情報統合解析装置において、上記ジェスチャ
解析手段は、上記ジェスチャの軌跡を囲む長方形の中心
を通過する複数の線によって上記長方形を複数の領域に
分割し、分割された領域と上記ジェスチャの軌跡との関
係に基づいて、上記ジェスチャの種類を判断することを
特徴とする。Furthermore, the multimodal information integrated analysis device according to claim 4 is the multimodal information integrated analysis device according to claim 1, 2 or 3, wherein the gesture analysis means is a rectangle surrounding the trajectory of the gesture. It is characterized in that the rectangle is divided into a plurality of regions by a plurality of lines passing through the center, and the type of the gesture is determined based on the relationship between the divided regions and the trajectory of the gesture.

【００１０】[0010]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１は、本発明に係る一
実施形態であるマルチモーダル情報統合解析装置のブロ
ック図である。この実施形態のマルチモーダル情報統合
解析装置は、例えば、地図を用いた道案内システムであ
って、ＣＲＴディスプレイ３３上に京都駅付近の地図が
表示され、ユーザがマイクロホン３１を通じて「京都駅
はここですか」としゃべると同時に、ユーザの指を使っ
て京都駅を例えば丸で囲むポインティング・ジェスチャ
を行う場面を仮定して説明する。ここで、当該マルチモ
ーダル情報統合解析装置は、ＣＲＴディスプレイ３３の
画面上にある図や絵の上に線などを引くことによってな
されるジェスチャの情報と、それと同時になされる発話
音声の情報を入力として受信し、地図データベース２４
に予め記憶されたＣＲＴディスプレイ３３の画面上の表
示物に関する知識データと、それぞれ各メモリに予め記
憶された隠れマルコフ網（以下、ＨＭ網という。）２
１、文脈自由文法２２、及び単語辞書２３などの言語に
関する知識を用いて入力された発話音声の意味構造とジ
ェスチャの意味構造をそれぞれ解析した後、発話音声の
意味構造とジェスチャの意味構造とを時間経過に従って
統合的に解析して解析結果を出力するものである。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a multi-modal information integrated analysis device according to an embodiment of the present invention. The multimodal information integrated analysis device of this embodiment is, for example, a route guidance system using a map, in which a map near Kyoto station is displayed on the CRT display 33, and the user says “Kyoto station is here through the microphone 31. At the same time as saying "", the explanation will be made assuming that the user's finger is used to perform a pointing gesture in which Kyoto Station is circled, for example. Here, the multi-modal information integrated analysis device uses as input the information of the gesture made by drawing a line or the like on the figure or the picture on the screen of the CRT display 33 and the information of the speech voice made at the same time. Receive and map database 24
The knowledge data about the display object on the screen of the CRT display 33 stored in advance and the hidden Markov network (hereinafter, referred to as HM network) 2 stored in each memory 2 respectively.
1, the context-free grammar 22, and the semantic structure of the uttered voice and the semantic structure of the gesture input using the knowledge of the language such as the word dictionary 23, respectively, and then, the semantic structure of the uttered voice and the semantic structure of the gesture are analyzed. The integrated analysis is performed as time passes and the analysis result is output.

【００１１】この実施形態のマルチモーダル情報統合解
析装置は、図１に示すように、各種制御処理実行する処
理部として、監視制御部１０と音声認識部１１と言語解
析部１２とグラフィックユーザインターフェース制御部
（以下、ＧＵＩ制御部という。）１３とジェスチャ解析
部１４と統合解析部１５（以下、総称して、処理部１０
−１５という。）とを備える。ここで、各処理部１１−
１５を総合的に監視制御する監視制御部１０に、音声認
識部１１と言語解析部１２とＧＵＩ制御部１３とジェス
チャ解析部１４と統合解析部１５が接続される。また、
各処理部１１−１４には、監視制御部１０によってセッ
ト・リセットが制御されるクロック信号発生器３０によ
って発生されるクロック信号が入力され、各処理部１１
−１４は、当該クロック信号に基づいて、スタートボタ
ン３２ａがオンされたときの時刻、すなわち、オンセッ
ト時刻からのクロック時刻（本実施形態では、ミリ秒の
単位で表わす。）が計算されて、当該装置のマルチモー
ダル情報の基準時刻となる。As shown in FIG. 1, the multimodal information integrated analysis apparatus of this embodiment has a supervisory control unit 10, a voice recognition unit 11, a language analysis unit 12, and a graphic user interface control as processing units for executing various control processes. Unit (hereinafter, referred to as GUI control unit) 13, gesture analysis unit 14, and integrated analysis unit 15 (hereinafter collectively referred to as processing unit 10).
It is called -15. ). Here, each processing unit 11-
The voice control unit 10, the language analysis unit 12, the GUI control unit 13, the gesture analysis unit 14, and the integrated analysis unit 15 are connected to the monitoring control unit 10 that comprehensively monitors and controls 15. Also,
A clock signal generated by a clock signal generator 30 whose set / reset is controlled by the supervisory control unit 10 is input to each processing unit 11-14.
In -14, the time when the start button 32a is turned on, that is, the clock time from the onset time (in the present embodiment, expressed in units of milliseconds) is calculated based on the clock signal. It becomes the reference time of the multimodal information of the device.

【００１２】ユーザからの入力情報機器として、マイク
ロホン３１と、スタートボタン３２ａとストップボタン
３２ｂとクウイットボタン３２ｃとを有するキーボード
３２と、画面上がタッチパネルとなっているＣＲＴディ
スプレイ３３と、マウス３４とを備え、ここで、マイク
ロホン３１は音声認識部１１に接続され、キーボード３
２は監視制御部１０に接続され、ＣＲＴディスプレイ３
３及びマウス３４がＧＵＩ制御部１３に接続される。一
方、出力情報機器として、ＣＲＴディスプレイ３５が設
けられ統合解析部１５に接続される。As input information devices from the user, a microphone 31, a keyboard 32 having a start button 32a, a stop button 32b, and a quit button 32c, a CRT display 33 having a touch panel on the screen, and a mouse 34 are provided. , Where the microphone 31 is connected to the voice recognition unit 11 and the keyboard 3
2 is connected to the monitoring control unit 10 and a CRT display 3
3 and the mouse 34 are connected to the GUI control unit 13. On the other hand, a CRT display 35 is provided as an output information device and is connected to the integrated analysis unit 15.

【００１３】音声認識部１１には、ＨＭ網２１と文脈自
由文法２２とが接続され、言語解析部１２には文脈自由
文法２２と単語辞書とが接続される。また、ジェスチャ
解析部１４には、地図データベース２４とジェスチャ辞
書２５とが接続され、統合解析部１５にはジェスチャ辞
書２５が接続される。An HM network 21 and a context-free grammar 22 are connected to the voice recognition unit 11, and a context-free grammar 22 and a word dictionary are connected to the language analysis unit 12. The gesture analysis unit 14 is connected to the map database 24 and the gesture dictionary 25, and the integrated analysis unit 15 is connected to the gesture dictionary 25.

【００１４】ここで、監視制御部１０と音声認識部１１
と言語解析部１２とＧＵＩ制御部１３とジェスチャ解析
部１４と統合解析部１５はそれぞれ、例えばデジタル電
子計算機で構成され、各処理部１０−１５はそれぞれ、
ＣＰＵと、動作プログラムとそれを実行するためのデー
タを記憶するＲＯＭと、ワーキングメモリとして用いら
れるＲＡＭとを備える。なお、６個の処理部１０−１５
を１つのデジタル電子計算機で構成してもよい。さら
に、ＨＭ網２１、文脈自由文法２２、単語辞書２３、地
図データベース２４、及びジェスチャ辞書２５は、例え
ばハードディスクメモリなどのメモリに記憶される。Here, the monitor control unit 10 and the voice recognition unit 11
The language analysis unit 12, the GUI control unit 13, the gesture analysis unit 14, and the integrated analysis unit 15 are each configured by, for example, a digital electronic computer, and each processing unit 10-15 is
It has a CPU, a ROM that stores an operation program and data for executing the operation program, and a RAM that is used as a working memory. In addition, the six processing units 10-15
May be composed of one digital computer. Furthermore, the HM network 21, the context-free grammar 22, the word dictionary 23, the map database 24, and the gesture dictionary 25 are stored in a memory such as a hard disk memory.

【００１５】まず、各処理部１１−１５に接続される各
データベースについて以下に説明する。ＨＭ網２１と文
脈自由文法２２と単語辞書２３とは、音声認識及び言語
解析のための言語に関する知識のデータベースであり、
地図データベース２４はＣＲＴディスプレイ３３の画面
上に表示される表示物又は指し示す指示物に関する知識
のデータベースであり、ジェスチャ辞書２５はユーザが
ＣＲＴディスプレイ３３の画面（当該画面はいわゆるタ
ッチパネルとして動作する。）上で行うジェスチャの種
類を識別するための知識のデータベースである。First, each database connected to each processing unit 11-15 will be described below. The HM network 21, the context-free grammar 22 and the word dictionary 23 are databases of language knowledge for speech recognition and language analysis,
The map database 24 is a database of knowledge about display objects or pointing objects displayed on the screen of the CRT display 33, and the gesture dictionary 25 is on the screen of the CRT display 33 (the screen operates as a so-called touch panel) by the user. It is a database of knowledge for identifying the type of gesture performed in.

【００１６】単語辞書２３においては、単語辞書２３に
おける道案内タスクに関係する４３個の単語が存在し、
複数の単語とその属性は素性構造で表され、音響情報を
捕捉するための時間の情報とジェスチャの空間的な情報
とを伴って増大される。単語辞書２３の一例を表１に示
す。In the word dictionary 23, there are 43 words related to the route guidance task in the word dictionary 23,
A plurality of words and their attributes are represented by feature structures and are augmented with time information for capturing acoustic information and spatial information of gestures. Table 1 shows an example of the word dictionary 23.

【００１７】[0017]

【表１】単語辞書２３における直示表現の素性構造 ─────────────────────────────────── (deflex-named このあたり-1 このあたり n-deictic !(lex-phon-orth "konoatari" " このあたり") (<!m sem> == [[RELN DEITIC-PLACE] [AGEN *SPEAKER*] [RECP *HEARER*] [OBJE [[RELN このあたり]]]]) (<!m time-stamp> == [[SPEECH [[tS ?X1] [tE ?X2]]]]) (<!m gesture> == [[RELN CIRCLING-3] [LOCATION [[lS [[X ?X][Y ?Y]]] [lE [[X ?X][Y ?Y]]]]] [TIME-STAMP [[mouse [[tS ?X1] [tE ?X2]]]]]]) (<!m prag> == [[iterr agen]])) ───────────────────────────────────[Table 1] Feature structure of direct expression in word dictionary 23 ─────────────────────────────────── ( deflex-named around this-1 around this n-deictic! (lex-phon-orth "konoatari" "this around") (<! m sem> == [[RELN DEITIC-PLACE] [AGEN * SPEAKER *] [RECP * HEARER *] [OBJE [[RELN this area]]]]) (<! M time-stamp> == [[SPEECH [[tS? X1] [tE? X2]]]])) (<! M gesture> == [[RELN CIRCLING-3] [LOCATION [[lS [[X? X] [Y? Y]]] [lE [[X? X] [Y? Y]]]]] [TIME-STAMP [[ mouse [[tS? X1] [tE? X2]]]]])) (<! m prag> == [[iterr agen]])) ──────────────── ────────────────────

【００１８】ここで、表１の内容について説明すると、
第１行目は、言語解析用の辞書中の、指示語「このあた
り」の定義であり、第１行目の「(deflex-named この
あたり-1 このあたり n-deictic」は、「このあたり」
というインデックスを持ち、n-deicticという品詞であ
る、「このあたり-1」という語を定義する。第２行目の
「!(lex-phon-orth "konoatari" "このあたり")」は、
音と表記はそれぞれ、"konoatari"、"このあたり"であ
るということを意味する。第３乃至第６行目の「The contents of Table 1 will be described below.
The first line is the definition of the directive "this area" in the dictionary for language analysis, and the first line "(deflex-named this area-1 this area n-deictic" is "this area". "
It has an index of and has a part of speech called n-deictic, which defines the word "this area -1". "! (Lex-phon-orth" konoatari "" this area ")" on the second line
The sound and the notation mean "konoatari" and "this area", respectively. 3rd to 6th line "

【数１】」は、意味(sem)の属性としては、DEITIC-PLACEという
関係名、*SPEAKER*という主体、*HEARER*という受容
体、「このあたり」という関係をもつ対象を定義する。
第７行目の「(<!m time-stamp> == [[SPEECH [[tS ?X1]
[tE ?X2]]]])」は、時間情報(time-stamp)という属性
としては、発話の開始時刻、終了時刻を定義する。第８
行目から第１２行目までの「(Equation 1) Defines the relationship name of DEITIC-PLACE, the subject of * SPEAKER *, the receptor of * HEARER *, and the object of "this area" as the attribute of meaning (sem).
The 7th line "(<! M time-stamp> == [[SPEECH [[tS? X1]
[tE? X2]]]]) ”defines the start time and end time of the utterance as attributes of time information (time-stamp). 8th
From line 12 to line 12

【数２】 (<!m gesture> == [[RELN CIRCLING-3] [LOCATION [[lS [[X ?X][Y ?Y]]] [lE [[X ?X][Y ?Y]]]]] [TIME-STAMP [[mouse [[tS ?X1] [tE ?X2]]]]]]) 」は、ジェスチャ情報(gesture)という属性としては、C
IRCLING-3という関係名、ジェスチャの行なわれたディ
スプレイ上の場所、ジェスチャが行なわれた手段とその
開始・終了時刻を定義する。第１３行目の「(<!m prag>
== [[iterr agen]]))」は、語用論的(prag)な属性とし
て、情報の保持者を定義する。[Equation 2] (<! M gesture> == [[RELN CIRCLING-3] [LOCATION [[lS [[X? X] [Y? Y]]] [lE [[X? X] [Y? Y] ]]]] [TIME-STAMP [[mouse [[tS? X1] [tE? X2]]]]]]]] is C as an attribute called gesture information (gesture).
It defines the relationship name IRCLING-3, the location on the display where the gesture was made, the means by which the gesture was made, and its start and end times. Line 13 "(<! M prag>
== [[iterr agen]])) ”defines the holder of the information as a pragmatic attribute.

【００１９】本実施形態のジェスチャ辞書２５において
は、ただ８個のエントリー（見出し語）のみが存在して
いる。当該ジェスチャの素性構造における素性はジェス
チャの時間的及び空間的情報を捕捉するように構成され
ている。ジェスチャ辞書２５の一例を表２に示す。In the gesture dictionary 25 of this embodiment, there are only eight entries (headwords). The feature in the feature structure of the gesture is configured to capture the temporal and spatial information of the gesture. Table 2 shows an example of the gesture dictionary 25.

【００２０】[0020]

【表２】ジェスチャ辞書２５におけるジェスチャの素性構造 ─────────────────────────────────── (deflex-named CIRCLING-3 CIRCLING gesture (<!m sem> == [[RELN CIRCLING-3] [LOCATION [[lS [[X ?X][Y ?Y]]] [lE [[X ?X][Y ?Y]]]]] [TIME-STAMP [[mouse [[tS ?X1] [tE ?X2]]]]]]) ───────────────────────────────────[Table 2] Gesture feature structure in gesture dictionary 25 ─────────────────────────────────── (deflex- named CIRCLING-3 CIRCLING gesture (<! m sem> == [[RELN CIRCLING-3] [LOCATION [[lS [[X? X] [Y? Y]]] [lE [[X? X] [Y? Y]]]]] [TIME-STAMP [[mouse [[tS? X1] [tE? X2]]]]]]) ───────────────────── ───────────────

【００２１】表２の説明をすると、第１行目の「(defle
x-named CIRCLING-3 CIRCLING gesture」は、CIRCLIN
Gというインデックスを持ち、gestureという品詞であ
る、CIRCLING-3（「丸で囲む」−３）というジェスチャ
を定義する。第２行目から第６行目までの「Explaining Table 2, "(defle
x-named CIRCLING-3 CIRCLING gesture "is CIRCLIN
It defines a gesture called CIRCLING-3 (“enclose circle” -3), which has an index of G and a part of speech of gesture. From the 2nd line to the 6th line

【数３】」は、表１と同様であり、ジェスチャ情報(gesture)と
いう属性としては、CIRCLING-3という関係名、ジェスチ
ャの行なわれたディスプレイ上の場所、ジェスチャが行
なわれた手段とその開始・終了時刻を定義する。(Equation 3) Is the same as that in Table 1. As attributes of gesture information (gesture), the relationship name of CIRCLING-3, the place where the gesture is performed on the display, the means by which the gesture is performed, and the start / end time thereof are given. Define.

【００２２】次いで、地図データベース２４は、地図上
の物体又は指示物候補が属性のリストを有して表されて
いる。地図データベース２４の一例を表３に示す。Next, in the map database 24, the object or pointer candidate on the map is represented with a list of attributes. Table 3 shows an example of the map database 24.

【００２３】[0023]

【表３】地図データベース２４における地図の表現 ─────────────────────────────────── [Object number][min X][min Y][max X][max Y][kind of object] [name of object] example: [1][56][145][70][178][hotel][kyoto-hotel] ───────────────────────────────────[Table 3] Map representation in the map database 24 ─────────────────────────────────── [Object number] [min X] [min Y] [max X] [max Y] [kind of object] [name of object] example: [1] [56] [145] [70] [178] [hotel] [kyoto-hotel ] ───────────────────────────────────

【００２４】表３の説明を行うと、指示物候補番号は１
であって、その指示物候補のＣＲＴディスプレイ３３の
画面上のｘ座標値の最小値（ｍｉｎｘ）は５６であ
り、その指示物候補のｙ座標値の最小値（ｍｉｎｙ）
は１４５であり、その指示物候補のｘ座標値の最大値
（ｍａｘｘ）は７０であり、その指示物候補のｙ座標
値の最大値（ｍａｘｙ）は１７８である。また、指示
物候補の種類は「ホテル」であって、指示物候補の名前
は「京都ホテル」である。The explanation of Table 3 shows that the candidate number for the indicator is 1.
And the minimum x coordinate value (min x) on the screen of the CRT display 33 of the pointer candidate is 56, and the minimum y coordinate value (min y) of the pointer candidate.
Is 145, the maximum x coordinate value (max x) of the pointer candidate is 70, and the maximum y coordinate value (max y) of the pointer candidate is 178. In addition, the type of the indicator candidate is “hotel”, and the name of the indicator candidate is “Kyoto Hotel”.

【００２５】音声認識部１１に接続されるＨＭ網２１
は、音素環境依存型の効率のよい隠れマルコフモデルの
表現形式を用いて、各状態をノードとする複数のネット
ワークとして表され、各状態はそれぞれ以下の情報を有
する。（ａ）状態番号（ｂ）受理可能なコンテキストクラス（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率また、文脈自由文法２２は、音声認識と言語解析の両方
のために用いられる１１４個の文法規則を含む。用語数
の大きさは４３単語であり、１．７４の音素パープレキ
シティを有する。HM network 21 connected to the voice recognition unit 11
Is represented as a plurality of networks with each state as a node, using the efficient representation format of the phoneme environment-dependent hidden Markov model, and each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding states and succeeding states (d) Parameter of output probability density distribution (e) Probability of self-transition and transition to subsequent state The grammar 22 contains 114 grammar rules used for both speech recognition and linguistic analysis. The number of terms is 43 words and has a phoneme perplexity of 1.74.

【００２６】監視制御部１０は、すべての処理部１１−
１５を制御し、データフローを統制する複数機能のモジ
ュールである。図２は、監視制御部１０によって実行さ
れる監視制御処理であり、これについて以下図２を参照
して説明する。The monitoring control unit 10 includes all the processing units 11-
It is a multi-function module that controls 15 and controls the data flow. FIG. 2 is a monitoring control process executed by the monitoring control unit 10, which will be described below with reference to FIG.

【００２７】まず、ステップＳ１でスタートボタン３２
ａがオンされたか否かが判断され、オンされたときステ
ップＳ２に進み、オンされていないときはステップＳ１
の処理を繰り返す。ステップＳ２では、すべての処理部
１１−１５を初期化する。この時点において、音声認識
部１１はマイクロホン３１からの音声入力の検出を開始
し、音声認識処理を実行し、音声認識結果を監視制御部
１０に出力する。一方、ＧＵＩ制御部１３は、画面がタ
ッチパネルであるＣＲＴディスプレイ３３の画面でユー
ザによって入力されるジェスチャに関するデータ、及び
ユーザがマウス３４を操作することによって入力される
ジェスチャに関するデータの検出を開始し、検出したジ
ェスチャに関するデータ（具体的には、画面上のジェス
チャの軌跡の各点の座標値）を監視制御部１０に出力す
る。同時に、ステップＳ３では、クロック信号発生器３
０をリセットし、クロック信号発生器３０が発生するシ
ステムクロック時刻（以下、クロック時刻という。）を
０にリセットした後、その計時を開始させる。システム
クロック時刻はクロック信号発生器３０から各処理部１
１−１４に出力される。これにより、１つの処理期間で
あるターン（turn）が開始される。First, in step S1, the start button 32
It is determined whether or not a has been turned on. When it is turned on, the process proceeds to step S2, and when it is not turned on, step S1 is performed.
Is repeated. In step S2, all processing units 11-15 are initialized. At this point, the voice recognition unit 11 starts detection of voice input from the microphone 31, executes voice recognition processing, and outputs the voice recognition result to the monitoring control unit 10. On the other hand, the GUI control unit 13 starts detecting data regarding a gesture input by the user on the screen of the CRT display 33 whose screen is a touch panel and data regarding a gesture input by the user operating the mouse 34, The data regarding the detected gesture (specifically, the coordinate value of each point of the trajectory of the gesture on the screen) is output to the monitoring control unit 10. At the same time, in step S3, the clock signal generator 3
After resetting 0 and resetting the system clock time (hereinafter referred to as clock time) generated by the clock signal generator 30 to 0, the clocking is started. The system clock time is from the clock signal generator 30 to each processing unit 1.
It is output to 1-14. As a result, a turn, which is one processing period, is started.

【００２８】ステップＳ４では、データ転送処理が実行
され、監視制御部１０は、音声認識部１１から出力され
るクロック時刻情報付きの音声認識結果のデータを言語
解析部１２に転送するとともに、ＧＵＩ制御部１３から
出力されるクロック時刻情報付きのジェスチャに関する
データをジェスチャ解析部１４に転送する。このとき、
言語解析部１２はクロック時刻情報付きの音声認識結果
のデータに基づいて後述するように言語解析処理を実行
して、処理結果であるクロック時刻情報付きの音声の意
味構造を含むデータを監視制御部１０に出力する。一
方、ジェスチャ解析部１４はクロック時刻情報付きのジ
ェスチャに関するデータに基づいて後述するようにジェ
スチャの種類の解析を行って、処理結果であるクロック
時刻情報付きのジェスチャの意味構造を含むデータを監
視制御部１０に出力する。In step S4, the data transfer process is executed, and the monitor control unit 10 transfers the data of the voice recognition result with the clock time information output from the voice recognition unit 11 to the language analysis unit 12 and the GUI control. The data regarding the gesture with the clock time information output from the unit 13 is transferred to the gesture analysis unit 14. At this time,
The language analysis unit 12 executes a language analysis process as described later based on the data of the voice recognition result with the clock time information, and monitors the data including the semantic structure of the voice with the clock time information, which is the processing result. Output to 10. On the other hand, the gesture analysis unit 14 analyzes the type of gesture as described later based on the data about the gesture with the clock time information, and monitors and controls the data including the semantic structure of the gesture with the clock time information, which is the processing result. Output to the unit 10.

【００２９】次いで、ステップＳ５ではストップボタン
３２ｂがオンされたか否かが判断され、オンされていな
いときはいまだ当該ターンの期間であるので、ステップ
Ｓ５の処理を繰り返し、オンされたときは、ステップＳ
６に進み、各処理部１１−１４に対してターンの終了を
通知する。そして、ステップＳ７では、クロック信号発
生器３０の計時を停止させる。ステップＳ８では、デー
タ転送処理が実行され、監視制御部１０は、言語解析部
１２からのクロック時刻情報付きの音声の意味構造を含
むデータと、ジェスチャ解析部１４からのクロック時刻
情報付きのジェスチャの意味構造を含むデータとを統合
解析部１５に出力する。そして、ステップＳ９において
統合解析部１５に対して後述の統合解析処理を実行させ
て、その解析結果をＣＲＴディスプレイ３５に出力させ
て表示させる。次いで、ステップＳ１０では、クウイッ
トボタン３２ｃがオンされたか否かが判断され、オンさ
れたときは、当該監視制御処理を終了し、オンされてい
ないときはステップＳ１に戻って上述の処理を繰り返
す。Next, in step S5, it is determined whether or not the stop button 32b is turned on. If it is not turned on, it means that the turn is still in progress. Therefore, the process of step S5 is repeated. S
In step 6, the processing units 11-14 are notified of the end of the turn. Then, in step S7, the clocking of the clock signal generator 30 is stopped. In step S8, the data transfer process is executed, and the monitor control unit 10 detects the data including the semantic structure of the voice with the clock time information from the language analysis unit 12 and the gesture with the clock time information from the gesture analysis unit 14. The data including the semantic structure is output to the integrated analysis unit 15. Then, in step S9, the integrated analysis unit 15 is caused to execute an integrated analysis process, which will be described later, and the analysis result is output and displayed on the CRT display 35. Next, in step S10, it is determined whether or not the quiz button 32c is turned on. If it is turned on, the monitoring control process is ended, and if it is not turned on, the process returns to step S1 to repeat the above process. .

【００３０】監視制御部１０の最も重要な処理の１つ
は、ステップＳ８での“イベント収集”であり、すなわ
ち、図４に示すように、１つのターン（“オンセット時
刻”と“オフセット時刻”との間）において生じるすべ
ての周辺のイベント（音声、ジェスチャなど）を収集
し、それらを統合解析部１５に受け渡す。音声の意味構
造のデータについては、単語毎に開始時刻と終了時刻が
付与され、ジェスチャの意味構造のデータについては、
１つのジェスチャ毎に開始時刻と終了時刻が付与され
る。ここで、オンセット時刻は、１つのターンの開始時
刻でありユーザがスタートボタン３２ａをオンしたとき
の時刻である。オフセット時刻は、当該ターンの終了時
刻でありユーザがストップボタン３２ｂをオンしたとき
の時刻である。One of the most important processes of the monitor control unit 10 is "event collection" in step S8, that is, one turn ("onset time" and "offset time" as shown in FIG. All of the peripheral events (voice, gesture, etc.) that occur in (between “and”) are collected and passed to the integrated analysis unit 15. For the data of the semantic structure of the voice, the start time and the end time are given for each word, and for the data of the semantic structure of the gesture,
A start time and an end time are given to each gesture. Here, the onset time is the start time of one turn and is the time when the user turns on the start button 32a. The offset time is the end time of the turn and is the time when the user turns on the stop button 32b.

【００３１】音声認識部１１は、エイ・ティ・アール音
声翻訳通信研究所で研究発展されてきた音素同期型ＳＳ
Ｓ−ＬＲ技術（例えば、従来技術の文献「Ｈａｒａｌｄ
Ｓｉｎｇｅｒｅｔａｌ．，“ＡＭｏｄｕｌａｒ
ＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎＳｙｓｔｅ
ｍＡｒｃｈｉｔｅｃｔｕｒｅ”，Ｐｒｏｃｅｅｄｉｎ
ｇｓｏｆＡｃｏｕｓｔｉｃＳｏｃｉｅｔｙ，Ｊａ
ｐａｎ，Ｆａｌｌ，１９９４年」参照。）に基づく従来
技術の連続音声認識装置を採用しており、新しいモジュ
ールが容易に加えられるようにモジュール方式に重きを
おいて発展されたものである。The speech recognition unit 11 is a phoneme synchronization type SS which has been researched and developed at the ATR Speech Translation and Communication Research Institute.
S-LR technology (see, for example, the prior art document Harald
Singer et al. , "A Modular
Speech Recognition System
m Architecture ”, Proceedin
gs of Acoustic Society, Ja
pan, Fall, 1994 ”. ) Based on the prior art continuous speech recognizer, which was developed with emphasis on the modular system so that new modules can be easily added.

【００３２】ユーザの発話音声はマイクロホン３１に入
力されて音声信号に変換された後、特徴抽出処理が実行
される。この処理では、音声信号をＡ／Ｄ変換した後、
例えばＬＰＣ分析を実行し、対数パワー、１６次ケプス
トラム係数、Δ対数パワー及び１６次Δケプストラム係
数を含む３４次元の特徴パラメータを抽出する。次い
で、抽出された特徴パラメータについて音素照合処理及
びＬＲパーザ処理が実行される。ここでは、不特定話者
モデルであるＨＭ網２１を用いて音素照合区間内のデー
タに対する尤度が計算され、この尤度の値が音素照合ス
コアとしてＬＲパーザに返され、ＬＲパーザは文脈自由
文法２２を参照して入力された音素予測データについて
左から右方向に、後戻りなしに処理する。構文的にあい
まいさがある場合は、スタックを分割してすべての候補
の解析が平行して処理される。ＬＲパーザは文脈自由文
法２２に基づいて次にくる音素予測して音素予測データ
を出力し、音素照合処理では、その音素に対応するＨＭ
網２１の情報を参照して照合し、その尤度を音声認識ス
コアとしてＬＲパーザに戻し、順次音素を連接していく
ことにより、連続音声認識を行い、その音声認識結果の
データを、クロック時刻情報付きで監視制御部１０に出
力する。上記連続音声認識の処理において、複数の音素
が予測された場合には、これらすべての存在をチェック
し、ビームサーチの方法により、部分的な音声認識の尤
度の高い部分木を残すという枝刈りを行って高速処理を
実現する。The voice uttered by the user is input to the microphone 31 and converted into a voice signal, and then feature extraction processing is executed. In this process, after A / D converting the audio signal,
For example, LPC analysis is performed to extract 34-dimensional characteristic parameters including logarithmic power, 16th-order cepstrum coefficient, Δ logarithmic power, and 16th-order Δcepstrum coefficient. Next, the phoneme matching process and the LR parser process are executed on the extracted feature parameter. Here, the likelihood for the data in the phoneme matching section is calculated using the HM network 21 which is an unspecified speaker model, and the value of this likelihood is returned to the LR parser as a phoneme matching score, and the LR parser is context-free. The phoneme prediction data input with reference to the grammar 22 is processed from left to right without any backtracking. If there is syntactic ambiguity, the stack is split and the analysis of all candidates is processed in parallel. The LR parser predicts the next phoneme based on the context-free grammar 22 and outputs phoneme prediction data. In the phoneme matching process, the HM corresponding to the phoneme is calculated.
The information of the network 21 is referred to, the likelihood is returned to the LR parser as a speech recognition score, and the phonemes are sequentially connected to perform continuous speech recognition. Information is output to the monitoring control unit 10. When a plurality of phonemes are predicted in the above continuous speech recognition processing, the existence of all of them is checked, and the pruning that leaves a partial tree with a high likelihood of partial speech recognition by the beam search method. To achieve high-speed processing.

【００３３】本実施形態の音声認識部１１において認識
された複数の文章はほとんど短くてかつ簡単なものであ
り、それらは、例えば、「京都ホテルはこのあたりで
す」という直示的な表現の例を含んでいる。複数の文章
は連続モード又は連結されたモードのいずれかで発話さ
れることができる。ユーザは１呼吸で１つの文章を自由
に発話することができ、もしくは、２つの文節のフレー
ズの間で１つのポーズを自由におくことができる。音声
認識部１１からの出力は、各単語に対して、認識された
単語、開始時刻及び終了時刻の３つの要素である。音声
認識部１１から出力される音声認識結果の一例を表４に
示す。The plurality of sentences recognized in the speech recognition unit 11 of this embodiment are those most short and simple, they are, for example, straight-called "Kyoto hotel to <br/> around this" shows Includes examples of typical expressions. Multiple sentences can be spoken in either continuous mode or concatenated mode. The user can speak one sentence freely in one breath, or one pose can be freely placed between phrases of two clauses. The output from the voice recognition unit 11 is, for each word, three elements of the recognized word, start time, and end time. Table 4 shows an example of the voice recognition result output from the voice recognition unit 11.

【００３４】[0034]

【表４】音声認識結果 ─────────────────────────────────── sentence: 京都ホテルはこのあたりですか recognition results: 1135 : time elapsed since the turn "onset time" 京都ホテル 0 830 : speech onset & offset time は 830 920 3842 : time elapsed since the turn "onset time" このあたり 0 780 : speech onset time reset due to the pause で 780 860 す 860 1050 か 1050 1200 京都ホテルはこのあたりですか -32.115994 ───────────────────────────────────[Table 4] Speech recognition results ─────────────────────────────────── sentence: Kyoto Hotel is around here Or recognition results: 1135: time elapsed since the turn "onset time" Kyoto Hotel 0 830: speech onset & offset time is 830 920 3842: time elapsed since the turn "onset time" 0 780: speech onset time reset due to 780 860 or 860 1050 or 1050 1200 at the pause Kyoto Hotel is around here -32.115994 ─────────────────────────────── ─────

【００３５】表４について説明すると、表４の内容の意
味は以下の通りである。音声認識結果の文は、「京都ホ
テルはこのあたりですか」であり、この文のうちの「京
都ホテル」の開始時刻（以下、第１の開始時刻とい
う。）は、オンセット時刻から１１３５ミリ秒の経過時
間が経過したときである。ここで、音声認識結果の文の
うちの「京都ホテル」は第１の開始時刻を基準として０
ミリ秒から８３０ミリ秒までの時間にあり、「は」は８
３０ミリ秒から９２０ミリ秒までの時間にある。上記文
のうちの「このあたり」の開始時刻（以下、第２の開始
時刻という。）は、オンセット時刻から３８４２ミリ秒
の経過時間が経過したときである。ここで、「このあた
り」は第２の開始時刻を基準として０ミリ秒から７８０
ミリ秒までの時間にあり、「で」は７８０ミリ秒から８
６０ミリ秒までの時間にあり、「す」は８６０ミリ秒か
ら１０５０ミリ秒までの時間にあり、「か」は１０５０
ミリ秒から１２００ミリ秒までの時間にある。そして、
音声認識結果の文「京都ホテルはこのあたりですか」の
スコアは「−３２．１１５９９４」である。Explaining Table 4, the meanings of the contents of Table 4 are as follows. The sentence of the voice recognition result is "Is this the Kyoto Hotel?", And the start time of the "Kyoto Hotel" in this sentence (hereinafter referred to as the first start time) is 1135 mm from the onset time. This is when the elapsed time of seconds has elapsed. Here, “Kyoto Hotel” in the sentence of the voice recognition result is 0 based on the first start time.
It is in the time from millisecond to 830 milliseconds, and "ha" is 8
It is in the time from 30 ms to 920 ms. The start time of "about this" in the above sentence (hereinafter referred to as the second start time) is when the elapsed time of 3842 milliseconds has elapsed from the onset time. Here, “this area” is 0 to 780 seconds based on the second start time.
It is in the time to millisecond, and "de" is from 780 millisecond to 8
The time is up to 60 ms, the "su" is from 860 ms to 1050 ms, and the "ka" is 1050
It is in the time from millisecond to 1200 milliseconds. And
The score of the sentence "Is this the Kyoto Hotel?" In the voice recognition result is "-32.115994".

【００３６】言語解析部１２は、パージング（文解析）
ツールキット（従来技術の文献「Ｔｏｓｈｉｈｉｓａ
Ｔａｓｈｉｒｏｅｔａｌ．，“ＡＰａｒｓｉｎｇ
ＴｏｏｌｋｉｔｆｏｒＳｐｏｋｅｎＬａｎｇｕ
ａｇｅＰｒｏｃｅｓｓｉｎｇ”，ＷＧＮＬＭｅｅｔ
ｉｎｇｏｆＩＰＳＪ，９５−ＮＬＰ−１０６，１
９９５年」参照。）を用いて発展されたものであり、こ
のパージングツールキットは、瞬時の音声における多く
の言葉の音素を取り扱うために、効率的なユニフィケー
ションやモジュール方式に重きをおいて発展されてい
る。この言語解析部１２への入力データは、音声認識の
結果である。音声認識の結果を受信したときに、言語解
析部１２はまず、文脈自由文法２２内の文法規則を用い
て解析木を発生し、次いで、当該木を依存構造に変換
し、最後に、発話の意味上の素性構造（表５）、すなわ
ち音声の意味構造とクロック時刻情報とを含むデータを
生成する。次いで、当該データは、監視制御部１０を介
して統合解析部１５に手渡される。ここで、言語解析部
１２の出力結果の一例を表５に示す。The language analysis unit 12 performs parsing (sentence analysis).
Toolkit (see prior art document "Toshihisa"
Tashiro et al. , "A Parsing
Toolkit for Spoken Langu
age Processing ”, WGNL Meet
ing of IPSJ, 95-NLP-106, 1
995 ". ), This parsing toolkit is developed with emphasis on efficient unification and modularity to handle the phonemes of many words in instant speech. The input data to the language analysis unit 12 is the result of voice recognition. When the speech recognition result is received, the language analysis unit 12 first generates a parse tree using the grammar rules in the context-free grammar 22, then transforms the tree into a dependency structure, and finally, the utterance of the utterance. Data including a semantic feature structure (Table 5), that is, a semantic structure of voice and clock time information is generated. Then, the data is handed to the integrated analysis unit 15 via the monitoring control unit 10. Here, an example of the output result of the language analysis unit 12 is shown in Table 5.

【００３７】[0037]

【表５】言語解析部１２の出力結果 ─────────────────────────────────── sentence: 京都ホテルはこのあたりですか [SEM [[RELN *YN-QUESTION*] [AGEN *SPEAKER*] [RECP*HEARER*] [OBJE [[RELN *BE-LOCATED*] [IDEN [[RELN *京都ホテル*]]] [PLACE [[RELN *DEICTIC-PLACE*] [AGEN *SPEAKER*] [RECP *SPEAKER*] [OBJE [[RELN *このあたり*] [PRAG [[ITERR *SPEAKER*]]] [TIME-STAMP [[SPEECH [[tS ?X1] [tE ?X2]]]]] [GESTURE [[RELN CIRCLING-3] [LOCATION [[lS [[X ?X][Y ?Y]]] [lE [[X ?X][Y ?Y]]]]] [TIME-STAMP [[mouse [[tS ?X1] [tE ?X2]]]]] ]]]]]]]]]] ───────────────────────────────────[Table 5] Output results of the language analysis unit 12 ─────────────────────────────────── sentence: Kyoto Hotel Is this around [SEM [[RELN * YN-QUESTION *] [AGEN * SPEAKER *] [RECP * HEARER *] [OBJE [[RELN * BE-LOCATED *] [IDEN [[RELN * Kyoto Hotel *]] ] [PLACE [[RELN * DEICTIC-PLACE *] [AGEN * SPEAKER *] [RECP * SPEAKER *] [OBJE [[RELN * this area *] [PRAG [[ITERR * SPEAKER *]]] [TIME-STAMP [ [SPEECH [[tS? X1] [tE? X2]]]]] [GESTURE [[RELN CIRCLING-3] [LOCATION [[lS [[X? X] [Y? Y]]] [lE [[X? X] [Y? Y]]]]]] [TIME-STAMP [[mouse [[tS? X1] [tE? X2]]]]]]]]]]]]]]] ──────── ────────────────────────────

【００３８】表５について説明すると、第１行目は、入
力された文字列の文は、「京都ホテルはこのあたりです
か」であり、第２行目の「[SEM [[RELN *YN-QUESTION
*]」は、この文の発話意図は*YN-QUESTION*（はい、い
いえで答える質問）であることを意味する。第３行目及
び第４行目の「Explaining Table 5, in the first line, the sentence of the input character string is "Is this the Kyoto Hotel?", And in the second line, "[SEM [[RELN * YN- QUESTION
*] ”Means that the utterance intent of this sentence is * YN-QUESTION * (question answered with yes or no). 3rd and 4th line "

【数４】 [AGEN *SPEAKER*] [RECP *HEARER*] 」は、表１と同様に、*SPEAKER*という主体、*HEARER*
という受容体を定義する。そして、第５行目の「[OBJE
[[RELN *BE-LOCATED*]」は、質問内容が、「もの(IDEN)
が場所(PLACE)にある」という内容であることを意味す
る。第６行目の「[IDEN [[RELN *京都ホテル*]]]」は、
「ものは「京都ホテル」である。」ということを意味す
る。さらに、第７行目から第１７行目までの「[Equation 4] [AGEN * SPEAKER *] [RECP * HEARER *] ”is the same as in Table 1, the main body of * SPEAKER *, * HEARER *
Defines the receptor. And, in the 5th line, "[OBJE
[[RELN * BE-LOCATED *]], the question content is "thing (IDEN)
Is in a place (PLACE) ”. "[IDEN [[RELN * Kyoto Hotel *]]]" on the 6th line is
"The thing is" Kyoto Hotel ". It means that. Furthermore, from the 7th line to the 17th line,

【数５】 [PLACE [[RELN *DEICTIC-PLACE*] [AGEN *SPEAKER*] [RECP *SPEAKER*] [OBJE [[RELN *このあたり*] [PRAG [[ITERR *SPEAKER*]]] [TIME-STAMP [[SPEECH [[tS ?X1] [tE ?X2]]]]] [GESTURE [[RELN CIRCLING-3] [LOCATION [[lS [[X ?X][Y ?Y]]] [lE [[X ?X][Y ?Y]]]]] [TIME-STAMP [[mouse [[tS ?X1] [tE ?X2]]]]]]]]]]]]]]] 」は、場所が、「このあたり」で示されるものであるこ
とを示す。[Equation 5] [PLACE [[RELN * DEICTIC-PLACE *] [AGEN * SPEAKER *] [RECP * SPEAKER *] [OBJE [[RELN * this area *] [PRAG [[ITERR * SPEAKER *]]] [TIME -STAMP [[SPEECH [[tS? X1] [tE? X2]]]]] [GESTURE [[RELN CIRCLING-3] [LOCATION [[lS [[X? X] [Y? Y]]] [lE [ [X? X] [Y? Y]]]]] [TIME-STAMP [[mouse [[tS? X1] [tE? X2]]]]]]]]]]]]]]]] Indicates that this is what is indicated by "this area".

【００３９】ＧＵＩ制御部１３は、例えば図６に示すグ
ラフィックス画面を表示することによってユーザインタ
ーフェースを管理し、ＣＲＴディスプレイ３３の画面上
の複数のスクリーンイベント（例えば、タッチパネル上
の複数のジェスチャ）をモニタする。特に、ＧＵＩ制御
部１３は、以下の処理を実行する。（ａ）図６の下部に示すように、複数の指示物候補（建
物、駅など）を含む地図及び他のグラフィックスを表示
する。（ｂ）ユーザによる地図上のジェスチャの軌跡に対応す
る座標値を読み出す。（ｃ）キーボード３２のプッシュボタン３２ａ，３２
ｂ，３２ｃの複数のイベントを検出する。（ｄ）図６の上部に示すように、統合解析部１５によっ
て実行された統合解析結果（これは、統合解析部１５か
ら監視制御部１０介してＧＵＩ制御部１３に入力され
る。）音声認識結果とジェスチャ解析結果との間の時間
的なマッチングの結果を表示する。また、発話とジェス
チャの単一化された意味上の表現を表示する。The GUI control unit 13 manages the user interface by displaying, for example, the graphics screen shown in FIG. 6, and detects a plurality of screen events (for example, a plurality of gestures on the touch panel) on the screen of the CRT display 33. To monitor. In particular, the GUI control unit 13 executes the following processing. (A) As shown in the lower part of FIG. 6, a map including a plurality of indicator candidates (buildings, stations, etc.) and other graphics are displayed. (B) Read out the coordinate values corresponding to the trajectory of the gesture on the map by the user. (C) Push buttons 32a, 32 of the keyboard 32
A plurality of events b and 32c are detected. (D) As shown in the upper part of FIG. 6, the integrated analysis result executed by the integrated analysis unit 15 (this is input from the integrated analysis unit 15 to the GUI control unit 13 via the monitoring control unit 10). The result of temporal matching between the result and the gesture analysis result is displayed. It also displays a unified semantic representation of utterances and gestures.

【００４０】ジェスチャ解析部１４の主たる処理は、次
の通りである。１）直示的なジェスチャの種類（「丸で囲む」、「線を
引く」など）を認識すること、２）指示物候補（目的物）を選択すること、並びに、３）ジェスチャの時間的及び空間的情報（例えば、表６
に示す。）を生成すること。当該ジェスチャ解析部１４によって実行されるジェスチ
ャ解析処理は図３に示すように、ステップＳ１１のジェ
スチャ認識処理と、ステップＳ１２の指示物の選択処理
とかなる。The main processing of the gesture analysis unit 14 is as follows. 1) Recognizing the types of indirect gestures (“circle”, “draw line”, etc.), 2) selecting a candidate for an indicator (object), and 3) the time of the gesture. And spatial information (eg Table 6
Shown in ) Is generated. As shown in FIG. 3, the gesture analysis process executed by the gesture analysis unit 14 includes a gesture recognition process in step S11 and a pointing object selection process in step S12.

【００４１】ステップＳ１１のジェスチャ認識処理にお
いては、以下の処理が実行される。まず、ＧＵＩ制御部
１３から監視制御部１０介して入力される、１つのジェ
スチャの全体の軌跡の点（ＣＲＴディスプレイ３３の画
面上のｘ，ｙ座標値）をメモリ内に記憶する。次いで、
図５に示すように、上記記憶された軌跡の点のｘ，ｙ座
標値の最小値（ｍｉｎｘ）及び（ｍｉｎｙ）と最大
値（ｍａｘｘ）及び（ｍａｘｙ）を計算し、その中
心Ｏの点を見つける。そして、図５に示すように例えば
「丸で囲む」のジェスチャ６００の場合、上記軌跡の点
のｘ，ｙ座標値の最小値（ｍｉｎｘ）及び（ｍｉｎ
ｙ）と最大値（ｍａｘｘ）及び（ｍａｘｙ）の長方
形内に位置する「丸で囲む」のジェスチャ６００の領域
を８個の領域Ａ１乃至Ａ８に分割し、各領域Ａ１乃至Ａ
８に属する座標値を計算する。In the gesture recognition process of step S11, the following process is executed. First, the points of the entire locus of one gesture (x, y coordinate values on the screen of the CRT display 33) input from the GUI control unit 13 via the monitoring control unit 10 are stored in the memory. Then
As shown in FIG. 5, the minimum values (min x) and (min y) and the maximum values (max x) and (max y) of the x and y coordinate values of the stored locus points are calculated and the center thereof is calculated. Find the O point. Then, as shown in FIG. 5, for example, in the case of the “circle” gesture 600, the minimum x and y coordinate values (min x) and (min
y) and the region of the “circle” gesture 600 located within the rectangle of maximum values (max x) and (max y) is divided into eight regions A1 to A8, and each region A1 to A8
The coordinate value belonging to 8 is calculated.

【００４２】そして、もし、当該ジェスチャ６００の軌
跡の点の座標値がすべての領域Ａ１乃至Ａ８において存
在し、当該ジェスチャ６００の開始点６０１と終了点６
０２との間のユークリッド距離が５０（現在の割り当て
設計値）よりも小さいならば、そのジェスチャ６００は
「丸で囲む」であると判断する。また、もしただ１つの
領域において複数の軌跡の点が存在しているならば、そ
のときそのジェスチャは「ポインティング（さし示す、
もしくは点を打つ）」であると判断する。もし領域Ａ６
及びＡ７において軌跡の点が存在しておらず、当該ジェ
スチャ６００の開始点６０１と終了点６０２の間のユー
クリッド距離が３（現在の割り当て設計値）よりも小さ
いときは、そのジェスチャは、「マーキング」であると
判断する。残りの条件のときは、「線を引く」ジェスチ
ャであると判断する。Then, if the coordinate values of the points of the trajectory of the gesture 600 exist in all areas A1 to A8, the start point 601 and the end point 6 of the gesture 600 are displayed.
If the Euclidean distance to 02 is smaller than 50 (current allocation design value), it is determined that the gesture 600 is “encircled”. Also, if there are multiple trajectory points in only one area, then the gesture is "pointing (pointing, pointing,
Or hit a dot) ”. If area A6
, And when there is no locus point at A7 and the Euclidean distance between the start point 601 and the end point 602 of the gesture 600 is smaller than 3 (current assigned design value), the gesture is “marking It is determined to be. When the remaining conditions are satisfied, it is determined that the gesture is a “draw line” gesture.

【００４３】次いで、ステップＳ１２の指示物の選択処
理においては、以下の通り処理が実行される。ここで、
指示物候補とは、ＣＲＴディスプレイ３３の地図上での
建物や駅のことであり、例えば、京都ホテル、京都駅な
どである。（ａ）「丸で囲む」ジェスチャと判断されたときは、丸
の周囲内又は周囲上のいずれかにあるすべての指示物候
補の中で、中心に近接する指示物が選択される。（ｂ）「ポインティング」ジェスチャと判断されたとき
は、指示対象に位置する指示物が選択される。（ｃ）「線を引く」ジェスチャと判断されたときは、軌
跡上に位置する指示物が選択される。（ｄ）「マーキング」ジェスチャと判断されたときは、
中心に最も近接する指示物が選択される。表６に、ジェスチャ解析部１４から出力される解析結果
であるジェスチャの時間的及び空間的情報（もしくは、
ジェスチャの意味構造という。）の一例を示す。Next, in the pointing object selection processing in step S12, the processing is executed as follows. here,
The pointing candidate is a building or station on the map of the CRT display 33, such as Kyoto Hotel or Kyoto Station. (A) When it is determined that the gesture is a “circle encircle” gesture, the indicator close to the center is selected from all the indicator candidates in or around the circle. (B) When it is determined that the gesture is the "pointing" gesture, the pointing object positioned at the pointing target is selected. (C) When it is determined that the gesture is “draw a line”, the pointing object located on the locus is selected. (D) When it is judged as a "marking" gesture,
The indicator closest to the center is selected. Table 6 shows the temporal and spatial information of the gesture, which is the analysis result output from the gesture analysis unit 14 (or,
It is called the semantic structure of gestures. ).

【００４４】[0044]

【表６】ジェスチャの時間的及び空間的情報 ─────────────────────────────────── 3 : turn I.D circle : gesture analysis result 3119 : gesture onset time 4864 : gesture offset time (897,921) (128,164) : object coordinates ───────────────────────────────────[Table 6] Temporal and spatial information of gestures ─────────────────────────────────── 3: turn ID circle: gesture analysis result 3119: gesture onset time 4864: gesture offset time (897,921) (128,164): object coordinates ───────────────────────── ──────────

【００４５】表６について説明すると、第１行目はター
ンのＩＤ番号であり、第２行目は「丸で囲む」というジ
ェスチャが判断されたことを示す。第３行目は、当該ジ
ェスチャの開始点の時刻は、オンセット時刻から計時し
て３１１９ミリ秒であることを意味し、第４行目は、当
該ジェスチャの終了点の時刻は、オンセット時刻から計
時して４８６４ミリ秒であることを意味する。そして、
第５行目は、当該ジェスチャによって指示された指示物
の座標値、具体的には、（ｍａｘｘ，ｍａｘｙ）と
（ｍｉｎｘ，ｍｉｎｙ）の組を意味する。Referring to Table 6, the first line shows the turn ID number, and the second line shows that the gesture of "encircling" is determined. The third line means that the time of the start point of the gesture is 3119 milliseconds counted from the onset time, and the fourth line is that the time of the end point of the gesture is the onset time. Means that it is 4864 milliseconds. And
The fifth line means the coordinate value of the pointing object designated by the gesture, specifically, a set of (max x, maxy) and (min x, min y).

【００４６】すなわち、ジェスチャ解析部１４は、座標
情報からジェスチャの種類を認識し、その結果とＣＲＴ
ディスプレイ３３の画面上の図や絵に関する知識からそ
のジェスチャが指示している指示物を推定して判断す
る。図５及び図６のジェスチャの場合は、ジェスチャの
「丸」と画面上の地図との位置関係から、「京都ホテ
ル」が指示されているというこを判断することができ
る。最終的に、ジェスチャの種類、時刻、指示物に関す
る情報を有する意味構造を生成する。That is, the gesture analysis unit 14 recognizes the type of gesture from the coordinate information, and the result and the CRT.
From the knowledge of the drawings and pictures on the screen of the display 33, the pointing object indicated by the gesture is estimated and judged. In the case of the gestures of FIGS. 5 and 6, it is possible to determine that “Kyoto Hotel” is instructed from the positional relationship between the “maru” of the gesture and the map on the screen. Finally, a semantic structure having information about the type of gesture, the time, and the referent is generated.

【００４７】統合解析部１５は、以下の処理を実行す
る。（ａ）言語解析部１２から監視制御部１０を介して入力
される、発話音声の意味上の素性構造と、ジェスチャ解
析部１４から監視制御部１０を介して入力される、ジェ
スチャの時間的及び空間的情報を受信する。（ｂ）上記発話音声の意味上の素性構造における直示的
な素性（例えば、指示詞「ここ」）を検索する。（ｃ）上記直示的な素性とジェスチャとの間の時間的な
配置関係をチェックする。例えば、図６に示すように、
「ここ」という発話と、「丸で囲む」ジェスチャとの時
間的な配置関係はどうか、具体的には、例えば、「こ
こ」という発話時間内に「丸で囲む」ジェスチャの時間
が含まれているか？、含まれているならば、直接的な指
示関係があると判断される。また、「まるで囲む」ジェ
スチャの直後に、「ここ」という発話がなされても、直
接的な指示関係があると判断される。（ｄ）統合解析結果である、複数のジェスチャの時間的
及び空間的な値を有する直示的な素性構造をＣＲＴディ
スプレイ３５や３３に表示する。その一例を表７に示
す。なお、直示とジェスチャの配置調整が発話とジェス
チャの始まりからなされ、１つのジェスチャが１つの直
示に割り当てられ、残りのジェスチャは無視される。The integrated analysis unit 15 executes the following processing. (A) Semantic feature structure of uttered voice input from the language analysis unit 12 via the monitoring control unit 10, and the temporal and temporal structure of the gesture input from the gesture analysis unit 14 via the monitoring control unit 10. Receives spatial information. (B) A direct feature (for example, the demonstrative "here") in the semantic feature structure of the uttered voice is searched. (C) Check the temporal arrangement relationship between the explicit feature and the gesture. For example, as shown in FIG.
What is the temporal arrangement relationship between the “here” utterance and the “circle” gesture? Specifically, for example, the time of the “circle” gesture is included in the “here” utterance time. There? , If included, it is determined that there is a direct instructional relationship. Further, even if the utterance "here" is made immediately after the "enclose" gesture, it is determined that there is a direct instructional relationship. (D) The direct analysis feature structure having the temporal and spatial values of a plurality of gestures, which is the integrated analysis result, is displayed on the CRT display 35 or 33. Table 7 shows an example thereof. It should be noted that the arrangement of the direct gesture and the gesture is adjusted from the beginning of the utterance and the gesture, one gesture is assigned to one direct gesture, and the remaining gestures are ignored.

【００４８】[0048]

【表７】統合解析部１５によって生成された発話の意味表現 ─────────────────────────────────── [SEM [[RELN *YN-QUESTION*] [AGEN *SPEAKER*] [RECP*HEARER*] [OBJE [[RELN *BE-LOCATED*] [IDEN [[RELN *京都ホテル*]]] [PLACE [[RELN *DEICTIC-PLACE*] [AGEN *SPEAKER*] [RECP *SPEAKER*] [OBJE [[RELN *このあたり*] [PRAG [[ITERR *SPEAKER*]]] [TIME-STAMP [[SPEECH [[tS 3842] [tE 4622]]]]] [GESTURE [[RELN CIRCLING-3] [LOCATION [[lS [[X 897][Y 921]]] [lE [[X 128][Y 164]]]]] [TIME-STAMP [[mouse [[tS 3119] [tE 4864]]]] ]]]]]]]]]]] ───────────────────────────────────[Table 7] Semantic Expression of Utterance Generated by Integrated Analysis Unit ─────────────────────────────────── ─ [SEM [[RELN * YN-QUESTION *] [AGEN * SPEAKER *] [RECP * HEARER *] [OBJE [[RELN * BE-LOCATED *] [IDEN [[RELN * Kyoto Hotel *]]] [PLACE [ [RELN * DEICTIC-PLACE *] [AGEN * SPEAKER *] [RECP * SPEAKER *] [OBJE [[RELN * this *] [PRAG [[ITERR * SPEAKER *]]] [TIME-STAMP [[SPEECH [[ tS 3842] [tE 4622]]]]] [GESTURE [[RELN CIRCLING-3] [LOCATION [[lS [[X 897] [Y 921]]] [lE [[X 128] [Y 164]]]] ] [TIME-STAMP [[mouse [[tS 3119] [tE 4864]]]]]]]]]]]]]]] ──────────────────── ────────────────

【００４９】表７を説明すると、この内容は、表５の？
の部分に具体的な数値が入ったものである。すなわち、
統合解析部１５は、音声の意味構造とジェスチャの意味
構造を受信し、音声とジェスチャの時間情報と、ジェス
チャが指示するものとに基づいて、音声の意味構造の中
からジェスチャに対応する部分（指示詞「ここ」）を探
し、そこにジェスチャの意味構造を付加し、最終的に音
声とジェスチャの意味が統合された意味構造を生成して
出力する。要約すれば、統合解析部１５は、上記音声認
識結果の意味構造から指示詞を検索し、検索された指示
詞と、上記指示物の情報との時間的関係を検出し、検出
された時間的関係に基づいて、上記音声認識結果の意味
構造と上記ジェスチャの種類の意味構造とが統合された
意味構造を生成して出力する。Explaining Table 7, the contents are as shown in Table 5.
The specific value is entered in the part. That is,
The integrated analysis unit 15 receives the semantic structure of the voice and the semantic structure of the gesture, and based on the time information of the voice and the gesture and what the gesture indicates, the portion corresponding to the gesture from the semantic structure of the voice ( The verb "here") is searched for, the semantic structure of the gesture is added thereto, and finally the semantic structure in which the meanings of the voice and the gesture are integrated is generated and output. In summary, the integrated analysis unit 15 searches for a descriptive word from the semantic structure of the speech recognition result, detects the temporal relationship between the searched denotative word and the information of the pointing object, and detects the detected temporal Based on the relationship, a semantic structure in which the semantic structure of the voice recognition result and the semantic structure of the gesture type are integrated is generated and output.

【００５０】さらに、本発明に係る実施形態の変形例に
ついて以下に説明する。Further, a modified example of the embodiment according to the present invention will be described below.

【００５１】＜第１の変形例＞図７は、図１のジェスチ
ャ解析部１４によって実行される変形例のジェスチャ解
析処理を示すフローチャートである。この変形例のジェ
スチャ解析処理は、大きく分けて、ステップＳ２１から
ステップＳ２７までのジェスチャを認識するための処理
と、ステップＳ２８の指示物の選択処理とに分けられ
る。ここで、ジェスチャの種類を判断するステップＳ２
２−Ｓ２３及びＳ２６−Ｓ２８において最初にジェスチ
ャの種類を判断できたときは、図７において図示してい
ないが、制御フローは判断した時点でステップＳ２９に
進む。<First Modification> FIG. 7 is a flowchart showing a gesture analysis process of a modification executed by the gesture analysis unit 14 of FIG. The gesture analysis process of this modification is roughly divided into a process for recognizing the gesture in steps S21 to S27 and a pointing object selection process in step S28. Here, step S2 of determining the type of gesture
When the gesture type can be first determined in 2-S23 and S26-S28, although not shown in FIG. 7, the control flow advances to step S29 at the time of determination.

【００５２】図７において、ステップＳ２１で、まず、
前置処理が実行される。ここでは、ＧＵＩ制御部１３か
ら監視制御部１０を介して入力された１つのジェスチャ
のすべての軌跡のｘ，ｙ座標値（以下、ジェスチャ点と
いう。）をジェスチャ解析部１４のメモリ内に記憶す
る。もし、同一のｘ，ｙ座標値に複数のジェスチャ点が
ある場合は、１つのジェスチャ点のみを記憶し、残りを
廃棄する。In FIG. 7, in step S21, first,
Prefix processing is executed. Here, the x, y coordinate values (hereinafter, referred to as gesture points) of all the loci of one gesture input from the GUI control unit 13 via the monitoring control unit 10 are stored in the memory of the gesture analysis unit 14. . If there are multiple gesture points at the same x, y coordinate value, only one gesture point is stored and the rest are discarded.

【００５３】次いで、ステップＳ２２において、ポイン
ティングの判断処理が実行される。すなわち、この処理
では、まず、図８に示すように、上記メモリ内に記憶し
た軌跡のｘ，ｙ座標値に基づいて、ｘ，ｙ座標値の各最
小値（ｍｉｎｘ，ｍｉｎｙ）と各最大値（ｍａｘ
ｘ，ｍａｘｙ）を計算し、すべてのジェスチャ点を囲
む１つの長方形（以下、最小の長方形という。）５００
を仮想的に描く。次いで、次式で定義される密度率ＤＲ
を計算する。Next, in step S22, a pointing determination process is executed. That is, in this process, first, as shown in FIG. 8, each minimum value (min x, miny) and each maximum value of the x and y coordinate values are calculated based on the x and y coordinate values of the locus stored in the memory. Value (max
x, max y) is calculated, and one rectangle (hereinafter referred to as the minimum rectangle) 500 that surrounds all the gesture points is calculated.
Is drawn virtually. Then, the density ratio DR defined by the following equation
Is calculated.

【００５４】[0054]

【数６】ＤＲ＝｛（ジェスチャ点の数）／（最小の長方
形５００の面積）｝×１００DR = {(number of gesture points) / (area of the smallest rectangle 500)} × 100

【００５５】ここで、面積は、予め決められたｘ，ｙ座
標の値を単位として計算される。このとき、もし、密度
率ＤＲが９０％以上であるときは、入力されたジェスチ
ャは、指示物候補を指し示すジェスチャ、すなわち、
「ポインティング・ジェスチャ」であると判断する。ま
た、もし、ジェスチャ点の数が５未満であって、密度率
が１０％以上であるときは、入力されたジェスチャは
「ポインティング・ジェスチャ」と判断する。Here, the area is calculated in units of predetermined values of x and y coordinates. At this time, if the density ratio DR is 90% or more, the input gesture is a gesture indicating a pointer candidate, that is,
It is determined to be a "pointing gesture". If the number of gesture points is less than 5 and the density ratio is 10% or more, the input gesture is determined to be a “pointing gesture”.

【００５６】次いで、ステップＳ２３において、マーキ
ングの判断処理が実行される。ここでは、図９に示すよ
うに、すべてのジェスチャ点を連結し、ジェスチャ点を
連結した隣接する２つの連結線間の角度θの余弦値ｃｏ
ｓθを計算する。ここで、図９に示すように、各２つの
連結線間の角度θは始点から順番に昇順で番号付けされ
る。そして、以下の４つの条件（第１乃至第４の条件）
がすべて成立するときは、入力されたジェスチャは「マ
ーキング」と判断する。Next, in step S23, marking judgment processing is executed. Here, as shown in FIG. 9, all the gesture points are connected, and the cosine value co of the angle θ between two adjacent connecting lines connecting the gesture points co
Calculate sθ. Here, as shown in FIG. 9, the angle θ between each two connecting lines is numbered in ascending order from the starting point. Then, the following four conditions (first to fourth conditions)
When all of the above are satisfied, the input gesture is determined as “marking”.

【００５７】（ａ）図１０（ａ）及び（ｂ）に示すよう
に、角度θ＜９０°又は角度θ＞２７０°であるとき、
ピークと定義し、余弦値ｃｏｓθが０を超えるときの角
度θ（図１０（ａ）及び（ｂ）において、θｐとして示
す。）が存在するとき、すなわち、ピークが存在するこ
とを第１の条件とする。（ｂ）図１１に示すように、ｘ軸方向の最大値（ｍａｘ
ｘ）の点３０１から連結されて最小の長方形５００の
ｘ軸方向の辺に交わる点３０２のｘ座標値からｘ座標の
最小値（ｍｉｎｘ）を引いた長さをｘ１と定義し、ｘ
座標の最大値（ｍａｘｘ）からｘ座標の最小値（ｍｉ
ｎｘ）を引いた長さをｘ２と定義する。このとき、次
式で定義された長さの比の値ＬＲを計算する。(A) As shown in FIGS. 10A and 10B, when the angle θ <90 ° or the angle θ> 270 °,
It is defined as a peak, and when the angle θ (shown as θp in FIGS. 10A and 10B) when the cosine value cos θ exceeds 0 exists, that is, the peak exists. And (B) As shown in FIG. 11, the maximum value (max
x) The length obtained by subtracting the minimum value (min x) of the x coordinate from the x coordinate value of the point 302 that is connected from the point 301 and intersects the side of the smallest rectangle 500 in the x axis direction is defined as x1.
From the maximum value of the coordinates (max x) to the minimum value of the x coordinates (mi
The length obtained by subtracting n x) is defined as x2. At this time, the value LR of the length ratio defined by the following equation is calculated.

【００５８】[0058]

【数７】ＬＲ＝（ｘ１／ｘ２）×１００［％］(7) LR = (x1 / x2) × 100 [%]

【００５９】そして、計算された長さの比の値ＬＲが７
０％以上であることを、第２の条件とする。（ｃ）図１２（ａ）又は（ｂ）に示すように、ピークの
位置が、最小の長方形５００の最下部（底部）又は最上
部に対応すること（これを、２０１，２０２で示す。）
を、第３の条件とする。（ｄ）図１３に示すように、ジェスチャの始点及び終点
が、ピークの反対側の最上部２５％の面積の領域に位置
していることを、第４の条件とする。Then, the calculated length ratio value LR is 7
The second condition is 0% or more. (C) As shown in FIG. 12A or 12B, the position of the peak corresponds to the bottom (bottom) or top of the smallest rectangle 500 (shown by 201 and 202).
Is the third condition. (D) As shown in FIG. 13, the fourth condition is that the start point and the end point of the gesture are located in the area of the uppermost 25% area on the opposite side of the peak.

【００６０】次いで、ステップＳ２４では、「線を引
く」の判断処理が実行される。ここでは、もし、「ポイ
ンティング」ジェスチャでも、「マーキング」ジェスチ
ャでもない１つのジェスチャが存在し、ジェスチャ点の
数が３未満であるとき、入力されたジェスチャは、「線
を引く」ジェスチャであると判断する。Next, in step S24, a judgment process of "drawing a line" is executed. Here, if there is one gesture that is neither a "pointing" gesture nor a "marking" gesture, and the number of gesture points is less than 3, then the input gesture is a "draw line" gesture. to decide.

【００６１】次いで、ステップＳ２５では、中間処理が
実行される。ここで、すべてのジェスチャ点（図１４
（ａ））が図１４（ｂ）に示すように連結され、図１４
（ｃ）に示すように、各２つのジェスチャ点の間を複数
の点で補間する。ここで、補間された点の数は、各２つ
のジェスチャ点の間の距離に依存している。Next, in step S25, an intermediate process is executed. Here, all gesture points (see FIG.
(A)) is connected as shown in FIG.
As shown in (c), a plurality of points are interpolated between each two gesture points. Here, the number of interpolated points depends on the distance between each two gesture points.

【００６２】次いで、ステップＳ２６では、「丸で囲
む」、「線を引く」、「スクランブリング」の判断処理
が実行される。ここで、「スクランブリング」とは、丸
や線のように所定の形状でなくランダムな動きがある描
写入力をいう。このステップＳ２６の処理においては、
まず、図１４（ｄ）に示すように、最小の長方形５００
の中心Ｏと各ジェスチャ点とを線で連結し、それらの連
結線を最小の長方形５００の各辺にあたるまで延在させ
る。ここで、延在された線を以下、延在線という。次い
で、延在線が複数のジェスチャ線と交差する回数を計数
し、その交差を以下、ジェスチャ線交差という。Next, in step S26, a judgment process of "encircling", "drawing a line", and "scrambling" is executed. Here, "scrambling" refers to depiction input having random movements, such as circles and lines, rather than a predetermined shape. In the process of step S26,
First, as shown in FIG. 14D, the smallest rectangle 500
A line connects the center O of each of the gesture points to each gesture point, and the connecting lines extend until they reach each side of the smallest rectangle 500. Here, the extended line is hereinafter referred to as an extended line. Next, the number of times the extended line intersects a plurality of gesture lines is counted, and the intersection is hereinafter referred to as a gesture line intersection.

【００６３】そして、もし各ジェスチャ点に対するジェ
スチャ線交差の数が３以上であるときは、入力されたジ
ェスチャは「スクランブリング」であると判断する。ま
た、８５．５％以上の延在線が２つのジェスチャ線交差
を有するときは、「丸で囲む」ジェスチャであると判断
する。もし１５％未満の延在線が１個又はそれ以下のジ
ェスチャ線交差を有し、かつ７５％以上の延在線が２個
のジェスチャ線交差を有するときは、入力されたジェス
チャは「丸で囲む」と判断する。さらに、もしジェスチ
ャ線交差を有しない延在線が存在せず、かつ４０％以上
の延在線が２個のジェスチャ線交差を有するならば、入
力されたジェスチャは「丸で囲む」ジェスチャである。
またさらに、もし７０％以上の延在線がジェスチャ線交
差を有しないならば、入力されたジェスチャは「線を引
く」ジェスチャである判断する。If the number of gesture line crossings for each gesture point is 3 or more, it is determined that the input gesture is "scrambling". Also, when 85.5% or more of the extended lines have two gesture line intersections, it is determined that the gesture is a “circle” gesture. If less than 15% of the extended lines have one or less gesture line intersections and more than 75% of the extended lines have two gesture line intersections, the entered gesture is "circled". To judge. Further, if there are no extended lines that have no gesture line intersections and more than 40% of the extended lines have two gesture line intersections, then the input gesture is a "circle" gesture.
Still further, if 70% or more of the extended lines do not have a gesture line intersection, the input gesture is determined to be a "draw line" gesture.

【００６４】次いで、ステップＳ２７では、「線を引
く」、「丸で囲む」の判断処理が実行される。この処理
では、図１５に示すように、各ジェスチャ点で交差する
ように水平方向の線（ｘ軸方向に平行な線）と垂直方向
の線（ｙ軸方向に平行な線）（以下、平行線という。）
とを仮想的に描く。もし１ケ所のみで交差する延在線が
７０％以上ならば、入力されたジェスチャは「線を引
く」ジェスチャであると判断される。もし３ケ所以上で
交差する延在線がない場合、あるいは１ケ所のみで交差
する延在線が３０％未満である場合、入力されたジェス
チャは「丸で囲む」ジェスチャであると判断される。Next, in step S27, the judgment process of "drawing a line" and "circling" is executed. In this process, as shown in FIG. 15, a horizontal line (a line parallel to the x-axis direction) and a vertical line (a line parallel to the y-axis direction) (hereinafter referred to as parallel lines) are intersected at each gesture point. Called a line.)
And are drawn virtually. If 70% or more of the extended lines intersect at only one place, the input gesture is determined to be a "draw line" gesture. If there are no extended lines that intersect at three or more places, or if the number of extended lines that intersect at only one place is less than 30%, the input gesture is determined to be a “circle” gesture.

【００６５】さらに、ステップＳ２８では、「線を引
く」の判断処理が実行される。この処理では、入力され
たジェスチャが上記の条件に合致しないときは、「線を
引く」ジェスチャである判断される。Further, in step S28, a judgment process of "drawing a line" is executed. In this process, when the input gesture does not meet the above conditions, it is determined to be a "draw line" gesture.

【００６６】次いで、ステップＳ２９において、指示物
の選択処理が実行される。上述のジェスチャの種類の判
断処理で以下のように判断されたときに、その種類に応
じて指示物の選択処理が以下のように異なる。そして、
指示物が選択されたときに、ジェスチャ解析部１４は、
ジェスチャの意味構造を示すデータを監視制御部１０を
介して統合解析部１５に出力する。Next, in step S29, a pointing object selection process is executed. When the gesture type determination processing described above makes the following determination, the pointing object selection processing differs as follows depending on the type. And
When the pointing object is selected, the gesture analysis unit 14
Data indicating the semantic structure of the gesture is output to the integrated analysis unit 15 via the monitoring control unit 10.

【００６７】（ａ）「丸で囲む」ジェスチャと判断され
たとき、当該丸の周囲内又は周囲上のいずれかにあるす
べての指示物候補の中で、最小の長方形５００の中心Ｏ
に最も近接する１つの指示物候補が、ユーザによって指
示された指示物として選択される。（ｂ）「ポインティング」ジェスチャと判断されたと
き、ポインティングの指示物候補に位置する指示物候補
が、ユーザによって指示された指示物として選択され
る。（ｃ）「線を引く」ジェスチャと判断されたとき、軌跡
上に位置する指示物候補が、ユーザによって指示された
指示物として選択される。（ｄ）「マーキング」ジェスチャとして判断されたと
き、最小の長方形５００の中心Ｏに最も近接する指示物
候補が、ユーザによって指示された指示物として選択さ
れる。(A) When it is judged that the gesture is a "circle encircling" gesture, the center O of the smallest rectangle 500 is selected among all the indicator candidates that are either inside or on the periphery of the circle.
One of the indicator candidates closest to is selected as the indicator designated by the user. (B) When it is determined that the gesture is the “pointing” gesture, the indicator candidate located in the pointing indicator candidate is selected as the pointer designated by the user. (C) When it is determined that the gesture is a "draw line" gesture, the indicator candidate located on the trajectory is selected as the indicator designated by the user. (D) When it is determined to be the “marking” gesture, the pointer candidate closest to the center O of the smallest rectangle 500 is selected as the pointer designated by the user.

【００６８】＜第２の変形例＞例えば、申し込み書など
のフォーム（書式）を埋める場合を想定する。ユーザが
「名前はここに書くんですか」と発声すると同時に、指
でフォーム中の１つの欄に丸印をつける場合を考える。
マルチモーダル情報統合解析装置は、上記実施形態と同
様に、ジェスチャの種類を解析して「丸で囲む」ジェス
チャであると判断するが、その後は、指示されるものを
解析する際に、この場合は画面上には地図ではなく、フ
ォームがあることを考え合わせて、最終的にそのジェス
チャはフォーム中の特定の欄を指示していると判断す
る。すなわち、図１の地図データベース２４は、フォー
ムの様式を含むフォームのデータベースにとって代わる
が、その他の構成は図１と同様である。そして、統合解
析部１５は、上記実施形態と同様に、「ここ」に対応し
て指示されるものはフォームの中の１つの欄であると解
析して同定し、その解析結果をＣＲＴディスプレイ３５
の画面上に表示する。<Second Modification> For example, assume that a form such as an application form is filled. Consider a case where the user says "Do you write your name here?" And at the same time marks a column in the form with your finger.
The multimodal information integrated analysis device analyzes the gesture type and determines that the gesture is a “circle” gesture, as in the above embodiment. Considering that there is a form instead of a map on the screen, finally determines that the gesture points to a specific field in the form. That is, the map database 24 of FIG. 1 replaces the form database including the form style, but the other configurations are the same as those of FIG. Then, as in the above-described embodiment, the integrated analysis unit 15 analyzes and identifies that what is instructed corresponding to “here” is one column in the form, and the analysis result is displayed on the CRT display 35.
Displayed on the screen of.

【００６９】＜第３の変形例＞ＣＲＴディスプレイ３３
の画面上には、３次元の物体Ａが表示されているものと
する。そのとき、ユーザが「これをこっちに回転させて
下さい。」と発声するとともに、回転させたい方向（例
えば、右回り）に指で画面をなぞる（例えば、右方向に
円弧を描く。）とする。このとき、ジェスチャ解析部１
４は、まず、そのジェスチャが「線を引く」ジェスチャ
である判断する。次いで、地図データベース２４に代わ
る物体の形状と位置に関するデータベースに基づいて、
ＣＲＴディスプレイ３３の画面上の物体Ａとジェスチャ
との位置関係から、ジェスチャに指示されるものは、物
体Ａであると判断し、最終的には、そのジェスチャの種
類、時刻、指示するもののデータを統合解析部１５に手
渡す。統合解析部１５は、言語解析部１２から監視制御
部１０を介して入力される音声の意味構造から、「もの
を回転させる」という発話がなされているということが
わかるので、その「線を引く」ジェスチャは「回転させ
る」方法を示す「こっち」に対応しており、「右方向
に」という回転の方向を意味するということが判断され
て同定される。そして、最終的に、「物体Ａを右方向に
回転させて下さい。」という統合的な意味を解析して同
定し、その解析結果をＣＲＴディスプレイ３５の画面上
に表示する。<Third Modification> CRT display 33
It is assumed that the three-dimensional object A is displayed on the screen. At that time, the user utters "Please rotate this.", And traces the screen with the finger in the direction (for example, clockwise) to rotate (for example, draw an arc in the right direction). . At this time, the gesture analysis unit 1
4 first determines that the gesture is a "draw line" gesture. Then, based on the database relating to the shape and position of the object instead of the map database 24,
Based on the positional relationship between the object A and the gesture on the screen of the CRT display 33, it is determined that the gesture-instructed object is the object A, and finally, the type of the gesture, the time, and the data of the instructed object. Hand it to the integrated analysis unit 15. From the semantic structure of the voice input from the language analysis unit 12 via the monitoring control unit 10, the integrated analysis unit 15 knows that the utterance "rotate something" is made, so the "draw a line" The "gesture" corresponds to "here" indicating the "rotate" method, and is identified and determined to mean the direction of rotation "to the right". Finally, the integrated meaning of "rotate the object A to the right." Is analyzed and identified, and the analysis result is displayed on the screen of the CRT display 35.

【００７０】＜第４の変形例＞ＣＲＴディスプレイ３３
の画面上には、地図が表示されており、ユーザは「この
ように行くんですね」という発話と同時に、上記地図上
の道路に沿って線を引くジェスチャをしたとする。この
場合は、「線を引く」ジェスチャが指示するものとし
て、線の始点と、終点、及び通過点付近の座標値を抽出
する。そして、統合解析部１５は、「このように」とい
う語と、上記抽出された線の始点と、終点、及び通過点
付近の座標値とを対応付けし、地図上の「始点から通過
点を通って終点まで行く」という統合的な意味を解析し
て同定し、その解析結果をＣＲＴディスプレイ３５の画
面上に表示する。<Fourth Modification> CRT display 33
It is assumed that a map is displayed on the screen and the user makes a gesture to draw a line along the road on the map at the same time as the utterance "I'm going like this". In this case, the coordinate values near the start point, end point, and passing point of the line are extracted, as indicated by the "draw line" gesture. Then, the integrated analysis unit 15 associates the word “in this way” with the starting point of the extracted line, the ending point, and the coordinate values in the vicinity of the passing point, and maps “the starting point to the passing point” on the map. The integrated meaning of "go through to the end point" is analyzed and identified, and the analysis result is displayed on the screen of the CRT display 35.

【００７１】以上説明したように、本実施形態のマルチ
モーダル情報統合解析装置によれば、人間の発話する音
声と人間のジェスチャとを統合的に解析を行って解析結
果を出力することができるマルチモーダル情報統合解析
装置を提供することができる。これにより、人間の発話
する音声と人間のジェスチャとに基づいて人間のより複
雑で具体的な入力情報を解析し判断することが可能とな
る。また、本実施形態のマルチモーダル情報統合解析装
置を、例えば音声対話システムに適用することによっ
て、音声と同時にポインティング・ジェスチャを入力と
して解析することができるので、人間とシステムとの間
のより柔軟な対話を実現することができる。さらに、本
実施形態のマルチモーダル情報統合解析装置を、例えば
マルチモーダル翻訳対話システムにおける入力解析に適
用することによって、発話情報とジェスチャ情報が有機
的に統合された意味構造に基づいて翻訳することがで
き、他の言語へのより正確な翻訳を行うことができる。As described above, according to the multimodal information integrated analysis apparatus of the present embodiment, it is possible to perform an integrated analysis of a voice uttered by a human and a gesture of the human and output the analysis result. A modal information integrated analysis device can be provided. This makes it possible to analyze and judge more complicated and concrete input information of the human based on the voice spoken by the human and the gesture of the human. Further, by applying the multimodal information integrated analysis device of the present embodiment to, for example, a voice dialogue system, it is possible to analyze the pointing gesture together with the voice as an input, and thus more flexible between the human and the system. Dialogue can be realized. Furthermore, by applying the multimodal information integrated analysis device of the present embodiment to input analysis in a multimodal translation dialogue system, for example, it is possible to translate based on a semantic structure in which utterance information and gesture information are organically integrated. Yes, and more accurate translations into other languages.

【００７２】以上の実施形態及び変形例において、解析
結果である統合的な意味構造を出力する出力機器とし
て、ＣＲＴディスプレイ３５を用いているが、本発明は
これに限らず、他の画像表示装置、プリンタなどの他の
情報出力装置を設けてもよい。In the above-described embodiments and modified examples, the CRT display 35 is used as the output device for outputting the integrated semantic structure as the analysis result, but the present invention is not limited to this, and other image display devices are used. Other information output devices such as a printer may be provided.

【００７３】以上の実施形態及び変形例において、ジェ
スチャの種類の判断において、種々のパラメータの値を
用いているが、これらの値は設計値であり必要に応じて
変更してもよい。In the above-described embodiments and modified examples, the values of various parameters are used in the determination of the type of gesture, but these values are design values and may be changed as necessary.

【００７４】以上の実施形態及び変形例において、統合
解析部１５は、上記音声認識結果の意味構造から上記ジ
ェスチャに対応する指示詞を検索し、検索された上記ジ
ェスチャに対応する指示詞と、上記指示物の情報との時
間的関係を検出している。本発明はこれに限らず、上記
指示詞に代えて、上記音声認識結果の意味構造から上記
ジェスチャに対応する語又は句を検索し、検索された上
記ジェスチャに対応する語又は句と、上記指示物の情報
との時間的関係を検出してもよい。ここで、上記ジェス
チャに対応する語又は句とは、例えば、以下のものであ
る。例えば、「京都ホテルには空き部屋があります。」
を発話しながら、地図上の「京都ホテル」をマーキング
する場合においては、「京都ホテル」が上記ジェスチャ
に対応する語である。In the above-described embodiment and modification, the integrated analysis unit 15 searches the semantic structure of the voice recognition result for the indicator corresponding to the gesture, and the indicator corresponding to the searched gesture, and The temporal relationship with the information of the pointing object is detected. The present invention is not limited to this, the word or phrase corresponding to the gesture is searched from the semantic structure of the voice recognition result instead of the directive, and the word or phrase corresponding to the searched gesture and the instruction You may detect the temporal relationship with the information of a thing. Here, the word or phrase corresponding to the gesture is, for example, the following. For example, “Kyoto Hotel has vacant rooms.”
When marking "Kyoto Hotel" on the map while uttering, "Kyoto Hotel" is a word corresponding to the above gesture.

【００７５】以上の実施形態において、キーボード３２
は監視制御部１０に接続されているが、本発明はこれに
限らず、キーボード３２をＧＵＩ制御部１３に接続し
て、キーボード３２を用いて入力される入力情報をＧＵ
Ｉ制御部１３を介して監視制御部１０に転送してもよ
い。また、キーボード３２は、ＣＲＴディスプレイ３３
上のタッチパネルのキーボードであってもよい。In the above embodiment, the keyboard 32 is used.
Is connected to the monitor control unit 10, but the present invention is not limited to this, and the keyboard 32 is connected to the GUI control unit 13 so that the input information input using the keyboard 32 is GU.
It may be transferred to the monitoring control unit 10 via the I control unit 13. The keyboard 32 is a CRT display 33.
It may be the keyboard of the upper touch panel.

【００７６】[0076]

【発明の効果】以上詳述したように本発明によれば、言
語解析手段から出力される上記音声認識結果の意味構造
とそれに対応する時刻情報と、上記ジェスチャ解析手段
から出力される上記ジェスチャの種類とそれに対応する
時刻情報と上記指示物の情報とに基づいて、上記音声認
識結果の意味構造から上記ジェスチャに対応する語又は
句を検索し、検索された上記ジェスチャに対応する語又
は句と、上記指示物の情報との時間的関係を検出し、検
出された時間的関係に基づいて、上記音声認識結果の意
味構造と上記ジェスチャの種類の意味構造とが統合され
た意味構造を生成して出力する。従って、人間の発話す
る音声と人間のジェスチャとを統合的に解析を行って解
析結果を出力することができるマルチモーダル情報統合
解析装置を提供することができる。これにより、人間の
発話する音声と人間のジェスチャとに基づいて人間のよ
り複雑で具体的な入力情報を解析し判断することが可能
となる。As described in detail above, according to the present invention, the semantic structure of the speech recognition result output from the language analyzing means and the time information corresponding thereto, and the gesture output from the gesture analyzing means. Based on the type and time information corresponding to it and the information of the pointing object, search the word or phrase corresponding to the gesture from the semantic structure of the voice recognition result, and the word or phrase corresponding to the retrieved gesture Detecting a temporal relationship with the information of the pointing object and generating a semantic structure in which the semantic structure of the voice recognition result and the semantic structure of the gesture type are integrated based on the detected temporal relationship. Output. Therefore, it is possible to provide a multi-modal information integrated analysis device capable of comprehensively analyzing a voice uttered by a human and a gesture of the human and outputting an analysis result. This makes it possible to analyze and judge more complicated and concrete input information of the human based on the voice spoken by the human and the gesture of the human.

[Brief description of the drawings]

【図１】本発明に係る一実施形態であるマルチモーダ
ル情報統合解析装置のブロック図である。FIG. 1 is a block diagram of a multi-modal information integrated analysis device according to an embodiment of the present invention.

【図２】図１の監視制御部によって実行される監視制
御処理を示すフローチャートである。FIG. 2 is a flowchart showing a monitoring control process executed by a monitoring controller of FIG.

【図３】図１のジェスチャ解析部によって実行される
ジェスチャ解析処理を示すフローチャートである。FIG. 3 is a flowchart showing a gesture analysis process executed by the gesture analysis unit of FIG.

【図４】図１のマルチモーダル情報統合解析装置にお
いて実行される音声とジェスチャの情報の収集を示すタ
イミングチャートである。FIG. 4 is a timing chart showing collection of voice and gesture information executed in the multimodal information integrated analysis device of FIG. 1.

【図５】図１のマルチモーダル情報統合解析装置にお
いて実行されるジェスチャの認識と指示物の選択の処理
を示す正面図である。5 is a front view showing a process of gesture recognition and selection of a pointing object, which is executed in the multimodal information integrated analysis device of FIG. 1. FIG.

【図６】図１のＣＲＴディスプレイの画面の一例を示
す正面図である。FIG. 6 is a front view showing an example of a screen of the CRT display of FIG.

【図７】図１のジェスチャ解析部によって実行される
変形例のジェスチャ解析処理を示すフローチャートであ
る。FIG. 7 is a flowchart showing a gesture analysis process of a modified example executed by the gesture analysis unit of FIG.

【図８】図７のジェスチャ解析処理の１つの処理にお
ける最小の長方形を示す図である。FIG. 8 is a diagram showing a minimum rectangle in one process of the gesture analysis process of FIG.

【図９】図７のジェスチャ解析処理の１つの処理にお
ける２本のジェスチャライン間の複数の角度を示す図で
ある。9 is a diagram showing a plurality of angles between two gesture lines in one process of the gesture analysis process of FIG. 7. FIG.

【図１０】図７のジェスチャ解析処理のジェスチャを
マーキングする処理を示す図である。FIG. 10 is a diagram showing a process of marking a gesture in the gesture analysis process of FIG.

【図１１】図７のジェスチャ解析処理の１つの処理に
おける長さの比を示す図である。FIG. 11 is a diagram showing a length ratio in one process of the gesture analysis process of FIG. 7.

【図１２】図７のジェスチャ解析処理の１つの処理に
おけるジェスチャのピーク点の位置を示す図である。FIG. 12 is a diagram showing the position of a peak point of a gesture in one process of the gesture analysis process of FIG. 7.

【図１３】図７のジェスチャ解析処理の１つの処理に
おける上部２５％の領域を示す図である。FIG. 13 is a diagram showing an upper 25% region in one process of the gesture analysis process of FIG. 7.

【図１４】（ａ），（ｂ），（ｃ）及び（ｄ）は図７
のジェスチャ解析処理の「丸で囲む」、「線を引く」及
び「スクランブリング」を決定する処理のプロセスを示
す図である。14 (a), (b), (c) and (d) are shown in FIG.
FIG. 6 is a diagram showing a process of determining “encircling”, “drawing a line”, and “scrambling” in the gesture analysis process of FIG.

【図１５】図７のジェスチャ解析処理の「線を引く」
及び「丸で囲む」を決定する処理のプロセスを示す図で
ある。FIG. 15 “Draw a line” of the gesture analysis process of FIG.
FIG. 6 is a diagram showing a process of a process of determining “circle” and “encircle”.

【図１６】従来例のマルチモーダル対話地理案内シス
テムのブロック図である。FIG. 16 is a block diagram of a conventional multi-modal interactive geographic guidance system.

[Explanation of symbols]

１０…監視制御部、１１…音声認識部、１２…言語解析部、１３…グラフィックユーザインターフェース制御部（Ｇ
ＵＩ制御部）、１４…ジェスチャ解析部、１５…統合解析部、２１…ＨＭ網、２２…文脈自由文法、２３…単語辞書、２４…地図データベース、２５…ジェスチャ辞書、３０…クロック信号発生器、３１…マイクロホン、３２…キーボード、３２ａ…スタートボタン、３２ｂ…ストップボタン、３２ｃ…クウイットボタン、３３，３５…ＣＲＴディスプレイ、３４…マウス。10 ... Monitoring control unit, 11 ... Voice recognition unit, 12 ... Language analysis unit, 13 ... Graphic user interface control unit (G
UI control unit), 14 ... Gesture analysis unit, 15 ... Integrated analysis unit, 21 ... HM network, 22 ... Context-free grammar, 23 ... Word dictionary, 24 ... Map database, 25 ... Gesture dictionary, 30 ... Clock signal generator, 31 ... Microphone, 32 ... Keyboard, 32a ... Start button, 32b ... Stop button, 32c ... Quit button, 33, 35 ... CRT display, 34 ... Mouse.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号庁内整理番号ＦＩ技術表示箇所Ｇ１０Ｌ 3/00 ５７１Ｇ０６Ｆ 15/38 Ｚ 15/62 ３２２Ｍ (72)発明者キュンホ・ローケン・キム京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (72)発明者友清睦子京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (72)発明者森元逞京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁶ Identification number Internal reference number FI Technical display location G10L 3/00 571 G06F 15/38 Z 15/62 322M (72) Inventor Kyunho Loken Kim Kyoto Shiraka-gun, Seika-cho, Ina-kan, Osamu, 5 Sanhira-tani Co., Ltd. ATR Co., Ltd. Voice Translation Research Laboratory (72) Inventor, Mutsuko Tomoyo Kyoto, Soraku-gun, Seika-cho, Inaiya, 5 Sanpei, A (72) Inventor, Takuma Morimoto, Soka Town, Kyoto Prefecture, Osamu Osamu, Osamu Osamu, No. 5, Mihiraya, Tokyo, Inc.

Claims

[Claims]

1. A time measuring means for outputting time information elapsed from a predetermined reference time, and voice recognition of an inputted uttered voice based on the time information outputted from the time measuring means, and a voice recognition result is obtained. , A speech recognition unit that outputs together with time information corresponding to the speech recognition result, and a language analysis using knowledge about a predetermined language based on the speech recognition result output from the speech recognition unit and the corresponding time information. Then, the semantic structure of the voice recognition result is displayed on the screen with a language analysis means for outputting together with the corresponding time information and a plurality of pointing object candidates, and a human gesture is displayed on the displayed screen. Based on the input means for inputting and the time information output from the timing means, the position on the screen of the trajectory of the gesture input via the input means, and By analyzing the interface control means that outputs together with the corresponding time information and the position on the screen of the trajectory of the gesture output from the interface control means by using the knowledge about the diagram including the plurality of pointer candidates, ,
Type of the above gesture and time information corresponding to it,
Gesture analysis means for outputting information of a pointing object which is a pointing object candidate pointed by the gesture among the plurality of pointing object candidates, and a semantic structure of the speech recognition result output from the language analyzing means and its correspondence Based on the time information, the type of the gesture output from the gesture analysis means, the time information corresponding to it, and the information of the pointing object, the word or phrase corresponding to the gesture from the semantic structure of the voice recognition result. And detecting the temporal relationship between the searched word or phrase corresponding to the gesture and the information of the pointing object, and based on the detected temporal relationship, the semantic structure of the voice recognition result and the A multi-modal information integrated solution characterized by having an integrated analysis means for generating and outputting a semantic structure in which the semantic structures of the types of gestures are integrated Analyzer.

2. The multimodal information integrated analysis device according to claim 1, wherein the word corresponding to the gesture is a demonstrative word.

3. The gesture types analyzed by the gesture analyzing means include a “circle” gesture, a “draw line” gesture, a “dotting” gesture, a “marking” gesture, and a random gesture. 3. The multimodal information integration analysis device according to claim 1, further comprising a "scrambling" gesture which is a depiction with various movements.

4. The gesture analysis means divides the rectangle into a plurality of areas by a plurality of lines passing through a center of a rectangle surrounding the trajectory of the gesture, and establishes a relationship between the divided areas and the trajectory of the gesture. The type of the gesture is determined based on the gesture.
Alternatively, the multimodal information integrated analysis device described in 3.