JPH1124813A

JPH1124813A - Multi-modal input integration system

Info

Publication number: JPH1124813A
Application number: JP17799997A
Authority: JP
Inventors: Keiju Okabayashi; 桂樹岡林
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1997-07-03
Filing date: 1997-07-03
Publication date: 1999-01-29

Abstract

PROBLEM TO BE SOLVED: To carry out the desired processing in an input mode that can eliminate the user burden to an information device in a multi-modal input integration system which contains a user interface to deal with the information device. SOLUTION: An input part 1 recognizes the contents which are inputted via an input means having plural modes and outputs the result of recognition. A planner part 2 receives the words of recognition result from the part 1 and converts the inputted words into a generalized command script. Then the command script sent from the part 2 is interpreted and executed at an agent part 4. Thus, the inputs of various types are integrated into a significant meaning and commanded to an information device.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は情報機器を扱うため
のユーザインタフェースを備えるマルチモーダル入力統
合システムに関する。[0001] 1. Field of the Invention [0002] The present invention relates to a multimodal input integration system having a user interface for handling information equipment.

【０００２】近年，コンピュータ，電子手帳，ＡＴＭ端
末等の各種端末やロボットのような各種の情報機器が利
用されているが，それらの入力装置としてはキーボード
やマウス等が使用されている。そして，マルチメディア
の発展と共に情報が多様化し，それにともなって，タッ
チパネル，タブレット等新しい入力装置も用いられるよ
うになってきた。更に，それらとは，様式（タイプ）が
異なった，音声入力や，ジェスチャー入力等のノンバー
バルな入力方法も研究開発されている。In recent years, various terminals such as computers, electronic organizers, ATM terminals and the like, and various information devices such as robots have been used. As their input devices, keyboards and mice have been used. With the development of multimedia, information has been diversified, and accordingly, new input devices such as touch panels and tablets have been used. Further, non-verbal input methods, such as voice input and gesture input, which have different styles (types), have been researched and developed.

【０００３】このような，様々な異なったタイプの入力
を使用状況や好みからユーザが選択し，または冗長に使
うことができるインタフェースは，マルチモーダル（mo
dal:タイプ, 様式の) インタフェースと呼ばれている。
本発明は，このマルチモーダルインタフェースの実現の
ための入力統合方式や，システムアーキテクチャに関す
る。[0003] Such an interface that allows the user to select various different types of input from usage conditions and preferences or to use it redundantly is a multimodal (mo
dal: type, style) interface.
The present invention relates to an input integration method for realizing the multimodal interface and a system architecture.

【０００４】[0004]

【従来の技術】パーソナルコンピュータ等ではアプリケ
ーションを起動するため，アプリケーション名をキーボ
ードで打ったり，アプリケーションを示すアイコンをマ
ウスでクリックする方法が一般に用いられている。2. Description of the Related Art In a personal computer or the like, in order to start an application, a method of hitting an application name with a keyboard or clicking an icon indicating the application with a mouse is generally used.

【０００５】図１６は従来のアプリケーション起動方法
を示し，この例ではパーソナルコンピュータ等の情報機
器に対し，ユーザが「メールを読みたい」という目的が
ある場合の例である。この場合，ユーザは，メールを扱
うアプリケーションが何であるかを判断して決定する。
次に決定したアプリケーションをキーボードまたはマウ
スを操作してディスプレイに表示された中から探し出
し，そのアプリケーションを起動するコマンドを所定の
場所に打ち込むか，デスクトップでそのアプリケーショ
ンに相当するアイコンを探してクリックすることによっ
てアプリケーションを起動する。FIG. 16 shows a conventional application starting method. In this example, a user has a purpose of "reading a mail" for an information device such as a personal computer. In this case, the user determines and determines what the application handles mail.
Next, use the keyboard or mouse to search for the determined application from the display, and type a command to start the application into a predetermined location, or search for the icon corresponding to the application on the desktop and click it. Start the application by

【０００６】また別の例として，インターネットの例を
挙げると，「インターネット（www)にアクセスしたい」
という目的に対して，従来はブラウザと呼ばれるアプリ
ケーションを使用することをユーザが判断して，ユーザ
が自らキーボードを打ったり，マウスをクリックする等
のアクションを起こしてブラウザを起動しなければなら
ない。Another example is the Internet, "I want to access the Internet (www)."
For this purpose, the user must determine that an application conventionally called a browser is to be used, and the user must start the browser by taking an action such as hitting a keyboard or clicking a mouse.

【０００７】[0007]

【発明が解決しようとする課題】上記の従来の方法で
は，ユーザが起動したいアプリケーションの正確なスペ
ルや，アイコンの形を覚えておかなければならず，ユー
ザの負担が大きいという問題があった。The above-mentioned conventional method has a problem that the user has to remember the correct spelling of the application to be started and the shape of the icon, which imposes a heavy burden on the user.

【０００８】本発明は情報機器に対しユーザの負担をか
けない入力により目的とする処理を実行させることがで
きるマルチモーダル入力統合システムを提供することを
目的とする。また，情報機器としてパーソナルコンピュ
ータのような情報処理装置だけでなく，各種のロボット
についてもユーザの負担をかけないで操作できるマルチ
モーダル入力統合システムを提供することも目的とす
る。SUMMARY OF THE INVENTION It is an object of the present invention to provide a multi-modal input integration system that can execute a target process on an information device by an input that does not burden a user. It is another object of the present invention to provide a multimodal input integration system that can operate not only information processing devices such as personal computers as information devices but also various robots without burdening a user.

【０００９】[0009]

【課題を解決するための手段】本発明は，ユーザが目的
とするアプリケーションを起動するためにユーザが動作
を行うことにより入力するだけで，システムがその目的
を達成できるアプリケーションを自動的に選択して自動
的に起動するものである。SUMMARY OF THE INVENTION According to the present invention, a user simply inputs an operation by activating a target application, and the system automatically selects an application capable of achieving the target. It starts automatically.

【００１０】図１は本発明の原理構成図である。図中，
１はユーザの音声，ジェスチャー（手の動き等）等の各
種タイプ（様式）による入力に対応した複数の入力手段
１ａ〜ｃを備えたマルチモーダルの入力部，２は様々な
様式（タイプ）の入力の情報を統合するために入力内容
を識別して，その上位概念の単語（または上位単語）に
翻訳して実行の対象や目的を表す指令スクリプト（文字
列）を発生するプランナー部，３は入力を上位概念の単
語に変換（翻訳）するための，複数の属性を持つオブジ
ェクト指向の概念に基づく単語辞書，４はプランナー部
２で生成されたスクリプトを実行して，対応するアプリ
ケーションを起動するエージェント部，５は多数のアプ
リケーションで構成するアプリケーション群である。FIG. 1 is a diagram showing the principle of the present invention. In the figure,
1 is a multi-modal input unit having a plurality of input means 1a to 1c corresponding to inputs of various types (styles) such as a user's voice and gestures (hand movements, etc.). A planner unit 3 that identifies input contents in order to integrate the input information, translates the input contents into words (or high-order words) of the superordinate concept, and generates a command script (character string) representing the target and purpose of execution, 3 A word dictionary for converting (translating) input into words of a higher concept based on an object-oriented concept having a plurality of attributes. 4 executes a script generated by the planner unit 2 and starts a corresponding application. The agent unit 5 is an application group composed of a large number of applications.

【００１１】なお，入力部１を構成する入力手段１ａ〜
ｃは，各様式（タイプ）に対応した入力（音声や，ゼス
チャー等）を識別するための個別の手段を備えるか，一
つの入力部１により複数の様式の入力を識別するよう構
成することができる。The input means 1a to 1c constituting the input unit 1
c may be provided with individual means for identifying inputs (speech, gestures, etc.) corresponding to each style (type), or may be configured to identify inputs in a plurality of styles with one input unit 1. it can.

【００１２】ユーザが目的とする処理を実行させるため
に，目的に対応する表現を様々な様式（タイプ）の１つ
で入力すると，対応する入力手段１ａ〜１ｃはその表現
内容を認識する。この表現は入力様式により異なるが，
認識した結果はプランナー部２へ供給される。単語辞書
３には予め様々な入力様式に応じて識別された概念を統
合するため標準化された上位概念単語が定義され，いく
つかの属性を持つオブジェクト指向の概念に基づいて作
成されている。プランナー部２は入力部１からの認識結
果について判別し，単語辞書３を使って，エージェント
部４が実行できる指令スクリプト（文字列）を生成す
る。When a user inputs an expression corresponding to a purpose in one of various forms (types) in order to execute a desired process, the corresponding input means 1a to 1c recognizes the contents of the expression. This expression depends on the input format,
The recognized result is supplied to the planner unit 2. In the word dictionary 3, standardized high-level concept words for integrating concepts identified according to various input formats in advance are defined, and are created based on an object-oriented concept having several attributes. The planner unit 2 determines the recognition result from the input unit 1 and uses the word dictionary 3 to generate a command script (character string) that can be executed by the agent unit 4.

【００１３】エージェント部４はこの指令スクリプトを
受け取ると，そのスクリプトを実行することにより，当
該「メール」を「読む」ため，アプリケーション群５の
中の対応する一つを起動する。ユーザとシステムの間の
会話には複数のレベルがあり，システムへの指令に関す
るトップ（システム）レベル，エージェント部との間の
会話に関するエージェントレベル，アプリケーションと
の間の会話に関するアプリケーションレベルがある。When the agent section 4 receives the command script, it executes the script to activate the corresponding one of the application groups 5 to "read" the "mail". There are a plurality of levels of conversation between the user and the system. There are a top (system) level relating to commands to the system, an agent level relating to conversation with the agent unit, and an application level relating to conversation with the application.

【００１４】[0014]

【発明の実施の形態】図２は実施例の構成図を示し，図
中，１０は実入力デバイスを表し，１０−１はユーザの
音声入力を認識して認識結果の単語を出力する音声入力
デバイス，１０−２はユーザの手や表情の動きを画像入
力として認識して認識結果の単語を出力する動き入力デ
バイス，１０−３はその他（マウスまたはキーボード等
を含む）の入力デバイス，１１は入力モデルであり，各
実入力デバイス１０─１〜１０─３に対応して１１─１
〜１１─３の入力モデルが設けられ，それぞれ対応する
入力デバイスのドライバとして機能し，各入力モデルの
中に対応する入力デバイスが扱う単語の辞書（またはテ
ンプレート）を備える。１２はユーザから各種の実入力
デバイスを介して入力された単語を単語辞書１２０を用
いて上位概念の単語に置き変え，エージェント部が実行
できるスクリプトを生成するプランナー部，１３はプラ
ンナー部が生成したスクリプトを実行するエージェント
部である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 2 is a block diagram of an embodiment. In the drawing, reference numeral 10 denotes a real input device, and 10-1 denotes a speech input for recognizing a user's speech input and outputting a word as a recognition result. A device 10-2, a motion input device for recognizing the movement of the user's hand or facial expression as an image input and outputting a word as a recognition result; 10-3, another input device (including a mouse or a keyboard); This is an input model, and is 11─1 corresponding to each of the real input devices 10 # 1 to 10 # 3.
-11 モデル 3 input models are provided, each of which functions as a driver for a corresponding input device, and includes a dictionary (or template) of words handled by the corresponding input device in each input model. 12 is a planner section that replaces words input by the user through various real input devices with words of a higher concept using the word dictionary 120 and generates a script that can be executed by the agent section, and 13 is a planner section generated by the planner section. An agent unit that executes a script.

【００１５】１４はアプリケーションモデルであり，起
動するアプリケーション名やオプション，アプリケーシ
ョン起動時にエージェントが出力するメッセージ，終了
時にエージェントが出力するメッセージを保存する。１
５はアプリケーションモデル１４により目的とする処理
を実行するアプリケーション群である。１６は実計算機
システム（ハードウェアやＯＳなど）を表し，１７は計
算機の内部情報や外部センサの状態を監視するシステム
モデルであり，エージェント部１３がスクリプトを実行
する際に，そのようなシステム情報が必要になればこの
オブジェクトに問いかけることで，その情報を引き出す
ことができる。具体的には日付，時間，ユーザ着席の有
無などが得られる。An application model 14 stores an application name and options to be started, a message output by the agent when the application is started, and a message output by the agent when the application ends. 1
Reference numeral 5 denotes an application group that executes a target process according to the application model 14. Reference numeral 16 denotes a real computer system (hardware, OS, etc.), and reference numeral 17 denotes a system model for monitoring the internal information of the computer and the status of external sensors. When the agent unit 13 executes a script, such system information is displayed. If you need, you can retrieve that information by asking this object. Specifically, the date, time, presence / absence of the user sitting, and the like are obtained.

【００１６】１８はユーザモデルであり，ユーザが出力
情報を受け取る時の様式（画面表示，音声出力，印字出
力，電子メール等）の好み，またはくせなどの個人情報
を蓄積する。ユーザモデル１８にユーザ固有のくせや好
みを蓄積することで，ユーザに応じた様式（タイプ）で
応答を返すことができる。従って，エージェント部１３
が応答する場合には，出力デバイスを一切気にする必要
がないため，エージェントの処理が簡略化される。ま
た，エージェント部１３と出力デバイス（後述する）の
関係を切り離すことにより，出力デバイスの変更が容易
に行える。なぜなら，出力デバイスの変更のためにエー
ジェント部１３の記述を変更する必要がないからであ
る。このような構成により出力デバイスのアレンジが自
由に行え，種々のシステムへの適用が可能となる。Reference numeral 18 denotes a user model, which stores personal information such as a preference of a format (screen display, voice output, printout, electronic mail, etc.) when the user receives the output information, or a habit. By storing user-specific habits and preferences in the user model 18, it is possible to return a response in a style (type) according to the user. Therefore, the agent unit 13
If a response is made, there is no need to worry about the output device at all, thus simplifying the processing of the agent. Further, the output device can be easily changed by disconnecting the relationship between the agent unit 13 and the output device (described later). This is because it is not necessary to change the description of the agent unit 13 to change the output device. With such a configuration, output devices can be freely arranged, and application to various systems is possible.

【００１７】１９は出力モデルであり，１９−１，１９
−２の複数のモデルからなり，各出力モデルは実出力デ
バイス２０を構成する各実出力デバイス２０−１，２０
−２に対応する各ドライバの機能を備え，実出力デバイ
ス２０への指令はこのオブジェクトへの指令に全て置き
換えられる。２０は実出力デバイスであり，ユーザから
の指示に対する実行結果または拒否等の応答を画像，印
字，音声，電子メール等の異なる形式で出力するための
表示装置，音声出力装置等で構成される。Reference numeral 19 denotes an output model.
-2, and each output model is composed of the actual output devices 20-1 and 20-2 constituting the actual output device 20.
-2, and all commands for the actual output device 20 are replaced with commands for this object. Reference numeral 20 denotes an actual output device, which includes a display device, an audio output device, and the like for outputting an execution result or a response such as rejection to an instruction from a user in a different format such as an image, a print, a voice, and an e-mail.

【００１８】図２に示す構成の各部の内容についてシス
テムの動作の順に説明する。音声入力デバイス１０─１
は，ユーザが発声する音声を認識する入力デバイスであ
り，従来公知の技術を用いたパターンマッチング等の音
声認識技術を用い，音声入力デバイス内に備えられた音
声辞書を用いて認識した単語を出力する。動き入力デバ
イス１０─２はユーザが手の動きまたは，顔の表情報の
画像を入力として，その動きを追跡し，入力モデル１１
─２に予め設定されたテンプレート（図示省略）と照合
して各動きを認識して，認識結果を出力するものであ
る。この技術は例えば，本発明と同じ出願人により「相
関追跡システム」（特願平７−３４２３２０号）として
先に提案されており，その技術を使用することにより実
施することができる。その技術内容を図３，図４を用い
て説明する。The contents of each part of the configuration shown in FIG. 2 will be described in the order of operation of the system. Voice input device 10─1
Is an input device for recognizing a voice uttered by a user, and outputs a word recognized using a voice dictionary provided in the voice input device using a voice recognition technology such as pattern matching using a conventionally known technology. I do. The motion input device 10 # 2 is used by the user to input a hand movement or an image of face information as an input, and to track the movement.
This recognizes each motion by comparing it with a template (not shown) preset in # 2 and outputs a recognition result. For example, this technique has been previously proposed as a "correlation tracking system" (Japanese Patent Application No. 7-342320) by the same applicant as the present invention, and can be implemented by using the technique. The technical contents will be described with reference to FIGS.

【００１９】図３は提案された相関追跡のフローチャー
トであり，最初にビデオカメラで対象物（手や顔）を撮
影した画像を取り込み（図３のＳ１），次に探索する範
囲を全画面として指定する（同Ｓ２）。一方，予め，対
象物（手の形や顔の表情）の変化する異なる複数の形状
を表す複数のテンプレート（後述する図４に例を示す）
を用意しておき，その中から１つのテンプレートを選択
し（図３のＳ３），画面全体についてテンプレートを順
次ずらしながらパターンマッチングを行い，一致度を算
出して（同Ｓ４），最も一致する一致度を記憶する（同
Ｓ５）。１つのテンプレートについて終了すると，次の
テンプレートについて同様の処理を行って，全てのテン
プレートについて処理を終了すると，一致度が最も高い
テンプレートとその位置を検出し，その位置から追跡を
開始する（同Ｓ７）。FIG. 3 is a flowchart of the proposed correlation tracking. First, an image of an object (hand or face) taken by a video camera is taken in (S1 in FIG. 3), and the range to be searched next is taken as a full screen. Specify (S2). On the other hand, a plurality of templates (an example is shown in FIG. 4 to be described later) representing a plurality of different shapes in which an object (a hand shape or a facial expression) changes in advance.
Is prepared, one template is selected from among them (S3 in FIG. 3), pattern matching is performed while sequentially shifting the template over the entire screen, the degree of coincidence is calculated (the same S4), and the best matching The degree is stored (S5). When one template is completed, the same processing is performed for the next template. When the processing is completed for all templates, the template with the highest matching degree and its position are detected, and tracking is started from that position (S7). ).

【００２０】すなわち，画像を取り込んで（同Ｓ８），
所定の大きさの探索ブロックを設定し（Ｓ９），テンプ
レートを１つ選択し（Ｓ１０），探索ブロック内で一致
度が最も大きくなる位置を探し（Ｓ１１），探索される
と一致度と位置を記憶する（同Ｓ１２），各テンプレー
トについて処理を行って終了すると，一致度が最も大き
いテンプレートが示す位置へ参照ブロックを移動し，こ
の時の移動ベクトルを求める（同Ｓ１４）。このように
して，対象物を追跡すると共に，対象物の形状が変化し
た時に何れかのテンプレートとの一致を検出する。That is, an image is fetched (S8),
A search block of a predetermined size is set (S9), one template is selected (S10), and a position where the degree of coincidence is maximized in the search block is searched (S11). When the process is completed for each template, the reference block is moved to the position indicated by the template having the highest degree of coincidence, and the movement vector at this time is obtained (S14). In this way, the object is tracked, and when the shape of the object changes, a match with any of the templates is detected.

【００２１】図４は上記の提案された原理による手形状
の認識例を示し，(a) は直前のフレーム,(b)は現在のフ
レーム，(c) はテンプレートの例である。直前のフレー
ムでは参照ブロック内の対象物が手を開いたテンプレー
トとの一致度が最も大きいが, (b) に示す現在の参照
ブロック内の対象物が２本の指を出したテンプレート
との一致度が最も大きいとして算出された例である。こ
の場合，移動ベクトルは図４中の矢印を付した線で表
し，認識結果として数字“２”が出力される。これらの
手の指の形を予めＯＫ（またはＧＯＯＤ），ＮＯＧＯＯ
Ｄ，ＯＰＥＮ，ＣＬＯＳＥ等を表現するように定義し
て，それぞれの指の形を認識して，対応する認識結果を
出力することができる。FIGS. 4A and 4B show examples of hand shape recognition based on the above proposed principle, wherein FIG. 4A shows an example of the immediately preceding frame, FIG. 4B shows an example of the current frame, and FIG. In the previous frame, the object in the reference block has the highest degree of matching with the open hand template, but the object in the current reference block shown in (b) matches the template with two fingers out. This is an example where the degree is calculated as being the largest. In this case, the movement vector is represented by a line with an arrow in FIG. 4, and a numeral "2" is output as a recognition result. OK (or GOOD), NOGOO
D, OPEN, CLOSE, etc. can be defined so as to recognize the shape of each finger and output a corresponding recognition result.

【００２２】図２のプランナー部１２がスクリプトを生
成するために使用する単語辞書１２０の構成を説明す
る。図５は単語辞書のデータ構造であり，図６は同じ上
位概念を持つ単語（同義語）の構成，図７は上位概念単
語の属性の例を示す。The configuration of the word dictionary 120 used by the planner unit 12 of FIG. 2 to generate a script will be described. FIG. 5 shows the data structure of the word dictionary. FIG. 6 shows the configuration of words (synonyms) having the same superordinate concept, and FIG. 7 shows an example of the attributes of the superordinate concept word.

【００２３】単語辞書は図５のＡ．に示すようなデータ
構造を備え，各単語に対応して，その品詞，上位概念単
語，及び発行相手リスト（動作の対象のリスト）とから
成り，図５のＢ．に例として示すように，「企画書」と
いう単語は，名詞であり，文書を意味し，発行相手は
“文書エージェント”（文書を取り扱うエージェントを
意味する）であることを表す。「読む」という単語は，
動詞，上位概念単語は「ｒｅａｄ」であり，発行相手と
しては，文書エージェント，アシスタントエージェント
がある。「アシスタント」は，固有名詞で，上位概念単
語はアシスタントエージェントであり，発行相手先は無
いことを表す。The word dictionary is shown in FIG. 5 has a data structure as shown in FIG. 5 and includes, for each word, its part of speech, broader concept word, and issuer list (list of operation targets). As shown as an example, the word "plan book" is a noun and means a document, and the issue destination is a "document agent" (meaning an agent handling a document). The word "read"
The verb and the broader concept word are "read", and the issuer includes a document agent and an assistant agent. “Assistant” is a proper noun, and the broader concept word is an assistant agent, indicating that there is no issue destination.

【００２４】図６は同じ上位概念を持つ単語の例であ
り，この例ではユーザの音声入力であるａ「はい」，ｂ
「イエス」と，ユーザの手の指形による入力であるｃ
「ｇｏｏｄ」，「ｏｋ」について同意の単語（同義語）
として定義され，それぞれ品詞が動詞，上位概念単語が
ｙｅｓ，発行相手先が文書エージェントまたはアシスタ
ントエージェントがあることを示している。FIG. 6 is an example of a word having the same general concept. In this example, a user's voice input is a “Yes”, b
"Yes" and input by the finger of the user c
Words synonymous with "good" and "ok" (synonyms)
, Which indicates that the part of speech is a verb, the broader concept word is yes, and the destination is a document agent or an assistant agent.

【００２５】図７は上位概念単語の属性の例を示し，図
７のＡ．に示すよう各上位概念単語に対し，意味を表
す文字列，対にできる上位概念単語（組み合わせが可
能な上位概念単語）のリスト，起動するアプリケーシ
ョンのパス，起動するアプリケーション名という各属
性により構成される。図７のＢ．に上位単語「ｍａｉ
ｌ」について，〜の各項目のデータの具体例を示
す。FIG. 7 shows an example of the attribute of the broader concept word. As shown in the table, each attribute is composed of a character string representing the meaning, a list of paired upper concept words (combinable upper concept words), the path of the application to be started, and the attributes of the application to be started. You. B. of FIG. The top word "mai
For "l", a specific example of the data of each of the following items is shown.

【００２６】図８，図９はプランナー部の処理フロー
（その１），（その２）である。最初に，入力ポート
（図２の入力モデル１１の出力が発生するポート）を調
べ（図８のＳ１），単語が入力されたか判別する（同Ｓ
２）。入力された場合，入力された単語をプランナー部
１２内の図示省略されたスタックに格納し（図８のＳ
３），スタックの単語が２つになったか判別する（同Ｓ
４）。これは，名詞（処理の対象）と動詞（メソッド，
すなわち処理内容）の２つが入力されることが必要であ
るためである。そのため，２つの単語が入力された場
合，単語の品詞をチェックして（図８のＳ５），名詞と
動詞の組み合わせであるか判別し（同Ｓ６），組み合わ
せが不可能な場合はスタックをクリアする（同Ｓ７）。
この場合はその旨をユーザに通知する。組み合わせが可
能な場合には入力された単語の属性から上位概念単語を
取り出し（同Ｓ８），組み合わせのチェックを行う（同
Ｓ９）。FIGS. 8 and 9 show the processing flow (part 1) and (part 2) of the planner section. First, the input port (the port where the output of the input model 11 in FIG. 2 is generated) is checked (S1 in FIG. 8), and it is determined whether a word has been input (S1 in FIG. 8).
2). When input, the input word is stored in a stack (not shown) in the planner unit 12 (S in FIG. 8).
3) It is determined whether or not the number of words on the stack is two (S
4). This is a combination of noun (target) and verb (method,
That is, it is necessary to input two of the processing contents. Therefore, when two words are input, the part of speech of the word is checked (S5 in FIG. 8), and it is determined whether the combination is a combination of a noun and a verb (S6). If the combination is not possible, the stack is cleared. (S7).
In this case, the fact is notified to the user. If the combination is possible, the superordinate concept word is extracted from the attribute of the input word (S8), and the combination is checked (S9).

【００２７】このチェックにより，上記図７のを参照
して当該名詞と動詞の組み合わせが可能であるかを判定
し（図９のＳ１０），可能な場合，名詞の発行相手先属
性から発行相手先（実行を行うエージェント部）を決定
し（同Ｓ１１），動詞の上位概念単語属性からメソッド
（処理内容）を決定する（同Ｓ１２）。次に名詞の上位
概念単語属性より引数（処理の対象）を決定し（同Ｓ１
３），相手先，メソッド，引数を組み合わせてスクリプ
ト（文字列）を作成して（同Ｓ１４），スタックをクリ
アし（同Ｓ１５），終了する。With this check, it is determined whether or not the combination of the noun and the verb is possible with reference to FIG. 7 (S10 in FIG. 9). (Execution agent section) is determined (S11), and a method (processing content) is determined from the upper concept word attribute of the verb (S12). Next, an argument (target of processing) is determined from the superordinate concept word attribute of the noun (S1).
3), a script (character string) is created by combining the destination, the method, and the argument (S14), the stack is cleared (S15), and the process ends.

【００２８】このプランナー部１２における処理による
スクリプト生成の例を図１０に示す。この例は，図１０
のＢ．に示すように入力された単語が「メール」と「読
む」であり，「メール」は単語辞書により名詞で，「読
む」は動詞であり，組み合わせが可能であり，それぞれ
の上位概念単語属性を用いて，図１０のＡ．に示すルー
ルに基づいてスクリプトを作成する。このルールによ
り，図１０のＣ．のスクリプトフォーマット例に示すよ
うに，発行相手先が「assist Agent」（メールの属性の
発行相手先を表す固有名詞），メソッド（動詞）がｒｅ
ａｄ，メソッドの引数（ｒｅａｄの対象）がｍａｉｌと
なるスクリプトが生成される。FIG. 10 shows an example of script generation by the processing in the planner unit 12. This example is shown in FIG.
B. The words entered are “mail” and “read” as shown in, and “mail” is a noun according to the word dictionary, “read” is a verb, and the combination is possible. Using FIG. Create a script based on the rules shown in. According to this rule, C.I. As shown in the example script format, the destination is "assist Agent" (proper noun indicating the destination of the mail attribute), and the method (verb) is re.
A script is generated in which the ad (method of read) is mail.

【００２９】プランナー部１２で生成されたスクリプト
がエージェント部１３に供給されると，エージェント部
１３でスクリプトをチェックし，必要な情報が不足して
いる場合には，ユーザモデル１８に対しその旨を通知す
るメッセージを出力する。ユーザモデル１８は予め蓄積
されたユーザの好み（くせ）に応じた出力形式に対応し
た出力モデル１９に対しメッセージを応答し，その出力
モデル１９は対応する実出力デバイス２０を駆動する。
実出力デバイス２０に出力された応答内容をユーザが認
識することにより，ユーザは必要な情報または指示を実
入力デバイス１０に入力する。必要な情報が入力される
ことにより，エージェント部１３は対応するアプリケー
ションモデル１４に対し指示を発生し，アプリケーショ
ンモデル１４は指示されたアプリケーション（アプリケ
ーション群１５内）を駆動する命令を予め登録されたメ
ソッドから発生し，そのアプリケーションは受け取った
命令を実行する。When the script generated by the planner unit 12 is supplied to the agent unit 13, the script is checked by the agent unit 13, and if necessary information is insufficient, the user model 18 is informed. Output a message to notify. The user model 18 responds to the output model 19 corresponding to the output format according to the user's preference (habit) stored in advance, and the output model 19 drives the corresponding real output device 20.
The user inputs necessary information or instructions to the actual input device 10 by recognizing the contents of the response output to the actual output device 20. When the necessary information is input, the agent unit 13 issues an instruction to the corresponding application model 14, and the application model 14 sends an instruction to drive the designated application (in the application group 15) to a method registered in advance. And the application executes the received instruction.

【００３０】図１１はユーザへの出力提示の処理フロー
である。ユーザモデル１８（図２の１８）でエージェン
ト部（図２の１３）の出力指示の中から出力内容を取り
出し（図１１のＳ１），ユーザモデルの蓄積情報（好
み，くせ等）から出力先を決定する（同Ｓ２）。次に出
力は画面表示か判別して該当すると画面に出力する（図
１１のＳ３，Ｓ４）。この場合，画面表示を行う実出力
デバイスに対応する出力モデル（図２の１９）に表示さ
せたいメッセージを出力する。次に音声出力か判別して
該当すると対応する出力モデルに音声出力をさせ（図１
１のＳ５，Ｓ６），続いて電子メールの出力か判別して
該当すると，電子メールで出力し（同Ｓ７，Ｓ８），更
にＦＡＸ出力であるか判別して該当するとＦＡＸに出力
する（同Ｓ９，Ｓ１０）。最後にファイル出力であるか
判別し該当するとファイルに出力する（図１１のＳ１
１，Ｓ１２）。FIG. 11 is a processing flow for presenting output to the user. In the user model 18 (18 in FIG. 2), the output contents are extracted from the output instruction of the agent section (13 in FIG. 2) (S1 in FIG. 11), and the output destination is determined from the accumulated information (like, habit, etc.) of the user model. It is determined (S2). Next, it is determined whether or not the output is a screen display. In this case, a message to be displayed on the output model (19 in FIG. 2) corresponding to the actual output device for screen display is output. Next, it is determined whether the output is a voice output, and if the voice output is applicable, the corresponding output model outputs a voice (FIG. 1).
1, S5 and S6), and then determines whether the output is an e-mail. If the output is applicable, the e-mail is output (S7 and S8). , S10). Finally, it is determined whether the file is a file output, and if the file is output, the file is output (S1 in FIG. 11).
1, S12).

【００３１】本発明ではユーザとマルチモーダル入力統
合システムの間の会話を通じて，必要なアプリケーショ
ンの起動を指示して実行させ，システムからの出力をユ
ーザに提示する。そのために，ユーザとシステムの間の
会話は３つのレベルに分類する。According to the present invention, a necessary application is started and instructed through a conversation between the user and the multi-modal input integration system to execute the application, and the output from the system is presented to the user. To that end, the conversation between the user and the system is divided into three levels.

【００３２】図１２はユーザとシステム間の会話のレベ
ルを示し，３０はトップレベル，31エージェントレベ
ル，３２はアプリケーションレベルを表す。トップレベ
ル３０は，ユーザとプランナーとの会話を表し，「メー
ル」，「読む」等の目的の指示がこれに相当する。次に
エージェントレベル３１は，ユーザとエージェント部と
の会話を表し，エージェント部がアプリケーションを起
動するに当たって不足している情報がある場合に，ユー
ザとの間で行われる会話である。FIG. 12 shows the level of conversation between the user and the system, where 30 indicates the top level, 31 indicates the agent level, and 32 indicates the application level. The top level 30 represents a conversation between the user and the planner, and corresponds to a purpose instruction such as “mail” or “read”. Next, the agent level 31 represents a conversation between the user and the agent unit, and is a conversation between the user and the user when there is insufficient information when the agent unit starts the application.

【００３３】例えば，ユーザが，「レポート」，「変更
する」というトップレベルの会話による指示を入力する
と，エージェントレベルにおいて，対象とするファイル
が不明であるため，「これらのうち，どのファイルです
か」というメッセージ出力がユーザに対して提示され
る。これに対しユーザが，例えば，「３番のファイル」
という出力を発生する。また，アプリケーションレベル
は，アプリケーションとの会話であり，アプリケーショ
ンの操作に関し，例えば，「終了」，「開く」，「検
索」等のアプリケーションへのユーザからの指示や，ア
プリケーションからユーザへの出力がある。For example, when a user inputs an instruction in a top-level conversation such as “report” and “change”, the target file is unknown at the agent level. Is output to the user. On the other hand, the user, for example, "File 3"
Output. The application level is a conversation with the application, and the operation of the application includes, for example, an instruction from the user to the application such as “end”, “open”, and “search”, and an output from the application to the user. .

【００３４】上記したようにユーザとシステムの間の会
話は３つのレベルに分類され，各レベルで使用する単語
はそれぞれ異なる。また，音声入力等のパターンマッチ
ングを行う入力デバイスでは，登録する単語が多くなれ
ばなるほど，テンプレートの探索範囲は広くなり，認識
の認識率が劣化する。そこで，図１３に示すようにそれ
ぞれのレベルに応じた音声テンプレート群を用意し，そ
のレベルが遷移する毎に切り替えて探索範囲を少なくす
ることによって認識率を向上できる。As described above, the conversation between the user and the system is classified into three levels, and each level uses a different word. In an input device that performs pattern matching such as voice input, as the number of words to be registered increases, the search range of the template increases, and the recognition rate of recognition deteriorates. Therefore, as shown in FIG. 13, a voice template group corresponding to each level is prepared, and each time the level changes, the voice template is switched to reduce the search range, thereby improving the recognition rate.

【００３５】図１３は会話のレベルによる入力動作の説
明図である。この例は実入力デバイスとして音声入力デ
バイスを用いた例であり，この例では上記図１２に示す
トップレベル，エージェントレベル，アプリケーション
レベルをレベル１，レベル２，レベル３として呼ぶ。図
１３において，４０は音声入力デバイス，４１は音声辞
書（テンプレート）であって，４２〜４４はレベル１〜
３に対応する音声辞書，４５はプランナー部である。FIG. 13 is an explanatory diagram of the input operation depending on the conversation level. In this example, a voice input device is used as the actual input device. In this example, the top level, the agent level, and the application level shown in FIG. 13, reference numeral 40 denotes a voice input device; 41, a voice dictionary (template);
An audio dictionary corresponding to 3 and 45 is a planner section.

【００３６】図１３のＡ．は音声入力デバイスとプラン
ナー部の関係を示している。音声入力デバイス内は，３
つの会話レベルに対応した音声辞書を持つ。プランナー
部は入力デバイスからのレベル情報の問い合わせを受け
付け，現在のレベルを返す。音声入力デバイスがユーザ
の音声入力を認識する際，まず最初にプランナー部に対
し現在の会話レベルを問い合わせ，プランナー部から入
手したレベル情報によって音声辞書の探索範囲を限定す
る。例えば，図１３のＢ．に示すようにプランナー部か
ら入手したレベル情報がレベル３であれば，レベル３に
対応する音声辞書４４だけを用いて探索を行い音の認識
動作を行う。FIG. Indicates the relationship between the voice input device and the planner unit. 3 in the voice input device
It has a voice dictionary corresponding to two conversation levels. The planner receives an inquiry about level information from the input device and returns the current level. When the voice input device recognizes the user's voice input, first, the current conversation level is inquired to the planner unit, and the search range of the voice dictionary is limited by the level information obtained from the planner unit. For example, in FIG. If the level information obtained from the planner unit is level 3 as shown in (1), a search is performed using only the voice dictionary 44 corresponding to level 3 to perform a sound recognition operation.

【００３７】このようにして，音声認識の探索範囲を小
さくして認識の速度を向上すると共に認識ミスを防ぎ，
認識率を向上することが可能となる。図１４は遠隔地の
エージェントに指令するマルチモーダル入力統合システ
ムの構成例を示す。In this way, the search range of speech recognition is reduced to improve the speed of recognition and to prevent recognition mistakes.
It is possible to improve the recognition rate. FIG. 14 shows an example of the configuration of a multimodal input integration system for instructing a remote agent.

【００３８】図中，１０〜２０は上記図２に示す実施例
の同じ符号で表す各部と同じであり説明を省略する。２
１は通信ネットワークであり，インターネットを含む。
図１４の構成では上記図２の構成とは，プランナー部１
２，ユーザモデル１８とエージェント部１３との間が通
信ネットワーク２１により接続されている点が相違す
る。このマルチモーダル入力統合システムにより，ユー
ザが遠隔地にあるエージェント部１３（アプリケーショ
ンモデル１４，アプリケーション群１５を含む）に対し
て各種の処理をマルチモーダル入力により会話を行うこ
とにより，種々の処理を実行して結果を得ることが可能
となる。In the figure, reference numerals 10 to 20 denote the same parts as those of the embodiment shown in FIG. 2
Reference numeral 1 denotes a communication network including the Internet.
In the configuration of FIG. 14, the configuration of FIG.
2. The difference is that the user model 18 and the agent unit 13 are connected by the communication network 21. With this multi-modal input integration system, the user executes various processes by performing multi-modal input conversations with the agent unit 13 (including the application model 14 and the application group 15) at a remote location. And obtain the result.

【００３９】次に図１５は遠隔地のロボット群に指令す
るマルチモーダル入力統合システムの構成例である。図
１５中，１０〜１２，１８〜２０は上記図２に示す実施
例の同じ符号で表す各部と同じであり説明を省略する。
２１はインターネットを含む通信ネットワーク，２２，
２３はエージェント部（図２の１３）の機能を備えた機
構を内蔵し，各種の作業を実行するロボットＡ，Ｂであ
り，エージェント部が実質的に２つ備えられている。Next, FIG. 15 shows an example of the configuration of a multi-modal input integration system for instructing a group of remote robots. In FIG. 15, reference numerals 10 to 12, and 18 to 20 are the same as those of the embodiment shown in FIG.
21 is a communication network including the Internet, 22,
Reference numerals 23 denote robots A and B which have a built-in mechanism having the function of an agent unit (13 in FIG. 2) and execute various operations, and are substantially provided with two agent units.

【００４０】このシステムでは実ユーザ（ロボットオペ
レータ）は，音声や手振りなど好みの入力形式を用い
て，遠隔地のロボット２２，２３に対して指令を与え
る。この指令は実入力デバイス１０，入力モデル１１，
プランナー部１２，インターネット等の通信ネットワー
ク２１を介してあて先のロボット２２または２３へ送ら
れる。なお，実出力デバイス２０によりロボット２２ま
たは２３の状態を表示することができる。具体的には，
「ロボットＡ」，「移動」など直観的な言葉で指示で
き，各ロボットにはプランナー部１２が生成したスクリ
プトが送られる。このようにマルチモーダル入力により
オペレータの負担を軽減することができる。In this system, a real user (robot operator) gives a command to the remote robots 22 and 23 using a desired input format such as voice or hand gesture. This command is executed by the actual input device 10, the input model 11,
It is sent to the destination robot 22 or 23 via the planner unit 12 and a communication network 21 such as the Internet. The state of the robot 22 or 23 can be displayed by the actual output device 20. In particular,
Instructions can be given by intuitive words such as "robot A" and "movement", and a script generated by the planner unit 12 is sent to each robot. Thus, the burden on the operator can be reduced by the multi-modal input.

【００４１】[0041]

【発明の効果】本発明によればユーザにとって計算の利
用目的に対応する音声や，ジェスチャーにより通常の人
間同士の会話で用いている馴染みのある音声または動作
で入力することができるので，ユーザがコマンドやアプ
リケーション名を覚える負担を大幅に軽減することがで
きる。According to the present invention, it is possible for the user to input a voice corresponding to the purpose of calculation, or a familiar voice or action used in ordinary conversation between humans by gesture. The burden of remembering commands and application names can be greatly reduced.

[Brief description of the drawings]

【図１】本発明の原理構成図である。FIG. 1 is a principle configuration diagram of the present invention.

【図２】実施例の構成図である。FIG. 2 is a configuration diagram of an embodiment.

【図３】提案された相関追跡のフローチャートである。FIG. 3 is a flowchart of a proposed correlation tracking.

【図４】提案された原理による手形状の認識例を示す図
である。FIG. 4 is a diagram showing an example of recognizing a hand shape according to the proposed principle.

【図５】単語辞書のデータ構造を示す図である。FIG. 5 is a diagram showing a data structure of a word dictionary.

【図６】同じ上位概念を持つ単語の構成を示す図であ
る。FIG. 6 is a diagram showing a configuration of words having the same superordinate concept.

【図７】上位概念単語の属性の例を示す図である。FIG. 7 is a diagram illustrating an example of attributes of a broader concept word.

【図８】プランナー部の処理フローを示す図（その１）
である。FIG. 8 is a view showing a processing flow of a planner unit (part 1);
It is.

【図９】プランナー部の処理フローを示す図（その２）
である。FIG. 9 is a view showing a processing flow of a planner unit (part 2);
It is.

【図１０】スクリプト生成の例を示す図である。FIG. 10 is a diagram illustrating an example of script generation.

【図１１】ユーザへの出力提示の処理フローを示す図で
ある。FIG. 11 is a diagram showing a processing flow of output presentation to a user.

【図１２】ユーザとシステム間の会話のレベルを示す図
である。FIG. 12 illustrates the level of conversation between the user and the system.

【図１３】会話のレベルによる入力動作の説明図であ
る。FIG. 13 is an explanatory diagram of an input operation based on a conversation level.

【図１４】遠隔地のエージェントに指令するマルチモー
ダル入力統合システムの構成例を示す図である。FIG. 14 is a diagram illustrating a configuration example of a multimodal input integration system that issues a command to a remote agent.

【図１５】遠隔地のロボット群に指令するマルチモーダ
ル入力統合システムの構成例を示す図である。FIG. 15 is a diagram illustrating a configuration example of a multimodal input integration system that commands a robot group in a remote place.

【図１６】従来のアプリケーション起動方法を示す図で
ある。FIG. 16 is a diagram showing a conventional application starting method.

[Explanation of symbols]

１マルチモーダルの入力部２プランナー部３単語辞書４エージェント部５アプリケーション群 DESCRIPTION OF SYMBOLS 1 Multimodal input part 2 Planner part 3 Word dictionary 4 Agent part 5 Application group

Claims

[Claims]

An input unit for recognizing any one of a plurality of styles of input by voice, gesture, etc. and outputting a recognition result, determining the recognition result output from each of the input units, and executing a generalized command A planner part to convert to script,
An agent unit for interpreting and executing the command script from the planner unit, and integrates inputs of different formats into higher-level meanings and issues a command to the information device.

2. The multimodal input integration system according to claim 1, wherein said input unit recognizes an input by voice or gesture expressing a purpose of use of the information device, generates a corresponding word, and outputs the word to said planner unit. .

3. The planner section according to claim 2, further comprising: a word dictionary in which words representing superordinate concepts are set and attributes associated with the words representing the superordinate concepts are set. The multi-modal input integration system according to claim 1, wherein the planner unit uses the word dictionary to generate a script representing a command to be executed by the agent unit, using the word dictionary.

4. The planner unit according to claim 3, wherein a list of words that can be combined with the word is set as an attribute associated with a word representing a superordinate concept stored in the word dictionary of the planner unit. A multi-modal input integration system, wherein it is determined whether a combination of words is possible at the time of generation of a word.

5. The agent unit according to claim 1, wherein the agent unit generates output information including a necessary input instruction and a processing state in executing the application specified by the input script. A user model for receiving the output, the user model holding information indicating the user's preference for the output format in advance;
A multimodal input integration system, wherein output information from the agent unit is output in a format corresponding to the held information.

6. The system according to claim 1, wherein a user's input from the input unit and a dialogue with an information device are divided into a plurality of levels, and the input unit recognizes an input such as a voice corresponding to each level. A multi-modal input integration system, comprising: a dictionary for identifying a current level of an input; and performing recognition using a corresponding dictionary.

7. The multimodal input according to claim 1, wherein the planner unit and the agent unit are connected by a network, and a script generated by the planner unit is provided to an agent unit provided at a remote place. Integrated system.