JP2007033478A

JP2007033478A - Multi-modal dialog system and multi-modal application generation wizard

Info

Publication number: JP2007033478A
Application number: JP2005212055A
Authority: JP
Inventors: Toshihiro Kujirai; 俊宏鯨井
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2005-07-22
Filing date: 2005-07-22
Publication date: 2007-02-08

Abstract

<P>PROBLEM TO BE SOLVED: To easily generate a prototype of basic multi-modal application and to improve development efficiency even for an application development engineer who is not familiar with a mounting method of voice dialog application and the multi-modal application. <P>SOLUTION: To automatically generate source codes of graphic user interface (GUI) application, a dialog scenario, voice recognition syntax for composing the multi-modal application by selecting the number of question items, a name regarding each question item, guidance information, an answer example and a correction method. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声操作と画面上の操作の両者を用いて操作可能なマルチモーダルアプリケーションプログラムを自動的に生成する技術に関するものである。
The present invention relates to a technique for automatically generating a multimodal application program that can be operated using both voice operation and operation on a screen.

画面上でのアプリケーション操作が可能なＧＵＩ(Graphical User Interface)は、パーソナルコンピュータや、カーナビゲーションシステム、携帯電話などで広く用いられている。このようなアプリケーションプログラムを開発するためのツールや参考書は多く出回っており、開発者の数も非常に多い。 A GUI (Graphical User Interface) capable of operating an application on a screen is widely used in personal computers, car navigation systems, mobile phones, and the like. There are many tools and reference books for developing such application programs, and the number of developers is very large.

一方で、音声操作と画面上の操作の両者を用いて操作可能なマルチモーダルアプリケーションプログラムは、まだ普及しているとは言えず、その開発ツールもまだ普及していない。 On the other hand, multimodal application programs that can be operated using both voice operations and on-screen operations are not yet widespread, and their development tools are not yet widespread.

さらに、ＧＵＩアプリケーションでは、ボタンなどのＧＵＩコンポーネントを画面に配置して、それらのコンポーネントがユーザによって操作されたときのシステムの動作を記述するというイベントドリブン型のプログラム形式が用いられるが、音声対話の場合、対話フローに基づく逐次的な処理が必要となり、ＧＵＩアプリケーションの開発者にとって敷居の高いものになっている。 In addition, GUI applications use an event-driven program format in which GUI components such as buttons are arranged on the screen and the operation of the system is described when these components are operated by the user. In this case, sequential processing based on the dialogue flow is required, which is a threshold for GUI application developers.

プログラミングの出来ないユーザでも、マルチモーダルアプリケーションを作成できるツールについては、特許文献１などが公開されている。 For a tool that allows a user who cannot perform programming to create a multi-modal application, Patent Literature 1 is disclosed.

特開平９−２５１３６８号公報JP-A-9-251368

しかし、特許文献１に公開されている方式についても、複数の入出力モダリティのタイミングについて、ユーザが直接タイミングチャートを作成したり、タイミングチャートの学習のための操作ログを作成したりなど、比較的工数がかかっていた。 However, with respect to the method disclosed in Patent Document 1, the user directly creates a timing chart or creates an operation log for learning the timing chart for the timings of a plurality of input / output modalities. It took man-hours.

本発明は、上記の問題を鑑みてなされたものであり、本発明では、GUIアプリケーションと、音声対話シナリオが連携してマルチモーダルアプリケーションを実現する仕組みを提供するとともに、質問項目数、質問項目名、質問項目ごとのガイダンス、質問項目ごとの回答例、誤認識時の訂正方式という情報を与えるだけで、マルチモーダルアプリケーションを実現するGUIアプリケーションのソースコードと、GUIアプリケーションと連動する対話シナリオと、対話シナリオで利用する音声認識文法の雛形を生成する。 The present invention has been made in view of the above problems, and the present invention provides a mechanism for realizing a multi-modal application in cooperation with a GUI application and a voice interaction scenario, as well as the number of question items and the names of question items. Guidance for each question item, answer example for each question item, GUI application source code that realizes multimodal application just by giving information such as correction method at the time of misrecognition, dialogue scenario linked with GUI application, dialogue Generate a speech recognition grammar template to be used in the scenario.

上記の方式によって、音声対話アプリケーションやマルチモーダルアプリケーションの実装方法に詳しくないアプリケーション開発者でも、基本的なマルチモーダルアプリケーションの雛形を簡単に生成することができ、開発効率が向上する。 With the above method, even an application developer who is not familiar with the implementation method of a voice interaction application or a multimodal application can easily generate a basic model of a multimodal application, thereby improving the development efficiency.

図１は、本発明によるマルチモーダルアプリケーションの構成の一実施例を表した図である。マルチモーダルアプリケーションは、ＯＳ１００上で動作する音声認識エンジン１０１と、音声合成エンジン１０２と、１つもしくは複数の対話シナリオ１０５と、１つもしくは複数の音声認識文法１０６と、対話シナリオ１０５に記述された内容に従って、音声認識エンジン１０１と音声合成エンジン１０２を利用してユーザとの対話を制御する対話制御部１０３と、ＧＵＩアプリケーション１０４から構成される。 FIG. 1 is a diagram showing an embodiment of a configuration of a multimodal application according to the present invention. The multimodal application is described in the speech recognition engine 101 operating on the OS 100, the speech synthesis engine 102, one or more interaction scenarios 105, one or more speech recognition grammars 106, and the interaction scenario 105. In accordance with the contents, the dialogue control unit 103 that controls dialogue with the user by using the speech recognition engine 101 and the speech synthesis engine 102, and a GUI application 104 are configured.

ＯＳ１００には、ＧＵＩや音声入出力をサポートする任意の既存のＯＳを利用することができる。音声認識エンジン１０１や音声合成エンジン１０２には、任意の既存のエンジンを利用することができる。対話制御部１０３には、ＷｏｒｌｄＷｉｄｅＷｅｂＣｏｎｓｏｒｔｈｉｕｍ（以後Ｗ３Ｃと略す）が勧告するＶｏｉｃｅＸＭＬ２．０の仕様に基づいて実装されたＶｏｉｃｅＸＭＬインタプリタを用いることができるが、他の形式の対話シナリオに基づいて対話制御を行うプログラムを利用することもできる。対話シナリオ１０５は、ＶｏｉｃｅＸＭＬ２．０の仕様に基づいて記述することができるが、他の形式に基づいて記述してもよい。音声認識文法１０６は、Ｗ３Ｃ勧告であるＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎＧｒａｍｍａｒＳｐｅｃｉｆｉｃａｔｉｏｎ（以後ＳＲＧＳと略す）の仕様に基づいて記述することができるが、他の形式に基づいて記述してもよい。
ＧＵＩアプリケーション１０４は、ボタンやドロップダウンリスト、ラジオボックスなどのＧＵＩコンポーネントをポインティングデバイスなどで選択し、操作を行うことで、ユーザがアプリケーションの動作を制御することが可能な形式のアプリケーションであり、ＣやＣ＋＋，Ｊａｖａ（登録商標）などの言語で記述することができる。
別々のスレッドもしくはプロセスで動作するＧＵＩアプリケーション１０４と対話制御部１０３が連携して、ユーザがＧＵＩでも音声対話でもアプリケーションを制御できるようにするためには、ＧＵＩアプリケーション１０４から対話制御部１０３の動作を制御する仕組みを設けるか、逆に対話制御部１０３からＧＵＩアプリケーション１０４の動作を制御する仕組みを設ける必要がある。一般的に1回のＧＵＩ操作でユーザがシステムに対して行える指示よりも、１回の発話でユーザがシステムに対して行える指示の方が、複雑度が高い。すなわち、ＧＵＩ操作であれば、「ファイル」→「印刷」→「プリンタ選択」→「ＯＫ」のように４ステップの操作が必要な場合においても、音声発話であれば「ＸＸＸのプリンタに印刷」のように１発話で指示が行える。これは、音声対話の方がより複雑な制御を必要とすることを意味しているため、対話制御部１０３が、制御の単純なＧＵＩアプリケーション１０４の動作を制御する、後者の仕組みが適切である。 As the OS 100, any existing OS that supports GUI and voice input / output can be used. Any existing engine can be used as the speech recognition engine 101 or the speech synthesis engine 102. The dialog control unit 103 can use a VoiceXML interpreter implemented based on the specification of VoiceXML 2.0 recommended by World Wide Web Consortium (hereinafter abbreviated as W3C). It is also possible to use a program that performs control. The interaction scenario 105 can be described based on the specification of VoiceXML 2.0, but may be described based on other formats. The speech recognition grammar 106 can be described based on the specification of Speech Recognition Grammar Specification (hereinafter abbreviated as SRGS), which is a W3C recommendation, but may be described based on other formats.
The GUI application 104 is an application in a format that allows the user to control the operation of the application by selecting and operating GUI components such as buttons, drop-down lists, and radio boxes with a pointing device. , C ++, Java (registered trademark), and other languages.
In order for the GUI application 104 and the dialog control unit 103 that operate in separate threads or processes to cooperate with each other and to allow the user to control the application in both GUI and voice dialog, the operation of the dialog control unit 103 is performed from the GUI application 104. It is necessary to provide a mechanism for controlling, or conversely, a mechanism for controlling the operation of the GUI application 104 from the dialogue control unit 103. In general, an instruction that the user can give to the system with a single utterance is higher in complexity than an instruction that the user can give to the system with a single GUI operation. That is, in the case of a GUI operation, even if a four-step operation such as “file” → “print” → “printer selection” → “OK” is necessary, “print to XXX printer” is used for voice utterance. You can give instructions with a single utterance. This means that the voice dialog requires more complicated control, so the latter mechanism in which the dialog control unit 103 controls the operation of the GUI application 104 with simple control is appropriate. .

そこで一実施例として、対話制御部１０３がＧＵＩアプリケーション１０４の動作を制御する仕組みについて説明する。
対話制御部１０３は、与えられた対話シナリオ１０５の記述内容に基づいて、対話の制御を行う。図２は、ＶｏｉｃｅＸＭＬ形式で記述された対話シナリオ１０５の例である。この対話シナリオに基づき、対話制御部１０３は、音声合成エンジン１０２を用いて、ユーザに音声で「新幹線予約システムへようこそ」「出発する駅名を発声して下さい」というガイダンスを出力する。次に対話制御部１０３は、「departure.abnf」というファイル名で指定された音声認識文法１０６の記述内容に基づき、音声認識エンジン１０１を用いてユーザの発声を音声認識する。音声認識結果１０７が音声認識エンジン１０１から通知されると、ユーザが適切な駅名の発声をした場合は、目的地を質問する次の対話に進み、ユーザが発声をしなかった場合<noinput>は、「東京駅のように発声して下さい」というガイダンスを出力し、再び音声認識を行う。このように、対話制御部１０３は、ＧＵＩアプリケーション１０４とは独立して対話の制御を行うことができる。 Therefore, as an embodiment, a mechanism in which the dialogue control unit 103 controls the operation of the GUI application 104 will be described.
The dialogue control unit 103 controls the dialogue based on the description content of the given dialogue scenario 105. FIG. 2 is an example of the dialogue scenario 105 described in the VoiceXML format. Based on this dialogue scenario, the dialogue control unit 103 uses the speech synthesis engine 102 to output guidance to the user such as “Welcome to the Shinkansen reservation system” and “Please say the name of the station to depart”. Next, the dialog control unit 103 recognizes the user's utterance using the speech recognition engine 101 based on the description content of the speech recognition grammar 106 specified by the file name “departure.abnf”. When the speech recognition result 107 is notified from the speech recognition engine 101, if the user utters an appropriate station name, the process proceeds to the next dialog for asking the destination, and if the user does not utter <noinput> , “Speak like Tokyo Station” is output and voice recognition is performed again. As described above, the dialog control unit 103 can control the dialog independently of the GUI application 104.

一般的なＧＵＩアプリケーションでは、画面上に配置されたメニューやボタンなどのＧＵＩコンポーネントをポインティングデバイスで選択操作すると、そのＧＵＩコンポーネントに対応付けられた処理が実行される仕組みになっている。しかし、本発明におけるＧＵＩアプリケーション１０４では、ＧＵＩコンポーネントが選択操作された場合に、対話制御部１０３に対して、そのＧＵＩ操作と同等の意味を持つ音声コマンドをユーザが発声した場合の音声認識結果と同等の擬似音声認識結果情報１０８を送信する。これによって、音声コマンドを発した場合でも、ＧＵＩ操作を行った場合でも、対話制御部１０３は同じように音声対話を制御することができる。 In a general GUI application, when a GUI component such as a menu or a button arranged on the screen is selected and operated by a pointing device, a process associated with the GUI component is executed. However, in the GUI application 104 of the present invention, when a GUI component is selected and operated, the voice recognition result when the user utters a voice command having the same meaning as the GUI operation to the dialogue control unit 103. Equivalent pseudo speech recognition result information 108 is transmitted. Thus, regardless of whether a voice command is issued or a GUI operation is performed, the dialogue control unit 103 can control the voice dialogue in the same manner.

一方、ＧＵＩ操作を行った場合や、音声コマンドを発声した場合に、ＧＵＩアプリケーション１０４にもその操作の結果が反映される必要がある。このために、対話制御部１０３は、対話の進行状況に関する情報をＧＵＩアプリケーション１０４に対して提供する機能を持つ必要がある。この情報提供の仕組みとして、イベントシステムを用いることができる。対話制御部１０３にイベントリスナー登録を行うＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍＩｎｔｅｒｆａｃｅ）を用意し、対話制御部１０３の動作中に発生する、対話の進行状況に関するイベント１０９をＧＵＩアプリケーション１０４に通知すればよい。 On the other hand, when a GUI operation is performed or when a voice command is uttered, the result of the operation needs to be reflected in the GUI application 104. For this purpose, the dialogue control unit 103 needs to have a function of providing the GUI application 104 with information regarding the progress of the dialogue. An event system can be used as a mechanism for providing this information. An API (Application Program Interface) for registering an event listener may be prepared in the dialog control unit 103, and an event 109 related to the progress of the dialog generated during the operation of the dialog control unit 103 may be notified to the GUI application 104.

対話の進行状況に関するイベントとしては、対話シナリオ１０５をＶｏｉｃｅＸＭＬで記述した場合、対話の開始を表すイベント、フォームの実行開始を表すイベント、フォーム内の特定の項目の実行開始を表すイベント、フォーム内で定義されているフィールドにユーザ入力により値が代入されたことを表すイベント、対話シナリオ内で定義されている変数の値が変更されたことを表すイベント、対話の終了を表すイベントなどが考えられる。これらのイベントをＧＵＩアプリケーションが受け取り、その内容に基づいて画面表示を変更するなどの処理を行えばよい。例えば図３は、図２の対話シナリオと連携してマルチモーダルアプリケーションを実現するＧＵＩアプリケーション１０４の画面の例であり、「出発駅」の選択に対応したリストボックス３０１の値を、フィールドdepartureにユーザ入力により値が代入されたことを表すイベントが通知されたときに、変更すればよい。ユーザが1回の発声で複数の項目に対して回答するような発声を行った場合は、フォーム内で定義されているフィールドにユーザ入力により値が代入されたことを表すイベントを複数回連続して発行することで、ＧＵＩアプリケーションを複数回操作した場合と同等の効果を得ることができる。 As an event related to the progress of the dialog, when the dialog scenario 105 is described in VoiceXML, an event indicating the start of the dialog, an event indicating the start of executing the form, an event indicating the start of executing a specific item in the form, An event indicating that a value is assigned to a defined field by a user input, an event indicating that a value of a variable defined in an interaction scenario is changed, an event indicating the end of an interaction, or the like can be considered. The GUI application may receive these events and perform processing such as changing the screen display based on the contents. For example, FIG. 3 is an example of a screen of the GUI application 104 that realizes a multimodal application in cooperation with the dialogue scenario of FIG. 2, and the value of the list box 301 corresponding to the selection of “departure station” is displayed in the field department. What is necessary is just to change when the event showing that the value was substituted by the input was notified. When a user utters a response to multiple items in a single utterance, an event indicating that a value is assigned to the field defined in the form by user input is repeated multiple times. By issuing the command, it is possible to obtain the same effect as when the GUI application is operated a plurality of times.

図２で例示した対話シナリオでは、質問項目を順次ユーザに尋ねる方式の対話を行っている。図３のＧＵＩアプリケーション１０４では、３つの入力項目がすべて表示されているが、対話の制御を対話制御部１０３で行うことを考えると、対話制御部１０３が受け付けることのできない入力は、ＧＵＩアプリケーション１０４においても受け付けるべきではないので、出発駅の入力を要求している状況では、図３のようにリストボックス３０２と３０３、及びボタン３０４と３０５は入力不可状態にしておくことが望ましい。同様に、音声対話シナリオ１０５及び、音声認識文法１０６が複数の入力項目に関する入力を同時に受け付けられるように記述されている場合は、ＧＵＩアプリケーション１０４においても、同時受付している入力項目のＧＵＩコンポーネントだけを入力可能状態にし、他のＧＵＩコンポーネントは入力不可状態にしておくことが望ましい。 In the dialogue scenario illustrated in FIG. 2, a dialogue in which question items are sequentially asked to the user is performed. In the GUI application 104 of FIG. 3, all three input items are displayed. However, considering that the dialog control unit 103 controls the dialog, the input that the dialog control unit 103 cannot accept is the GUI application 104. Therefore, in the situation where the input of the departure station is requested, it is desirable that the list boxes 302 and 303 and the buttons 304 and 305 be in the input disabled state as shown in FIG. Similarly, when the voice conversation scenario 105 and the speech recognition grammar 106 are described so that inputs related to a plurality of input items can be received simultaneously, only the GUI component of the input items received simultaneously in the GUI application 104. It is desirable that the input is enabled and other GUI components are disabled.

次に、本発明によるマルチモーダルアプリケーション生成ウィザードの一実施例について図を参照して説明する。
図１１は、本発明によるマルチモーダルアプリケーション生成ウィザードのフローの一実施例を示したものである。本フローに従い、具体的に生成すべきソースコード、対話シナリオ、音声認識文法の形式について説明を行う。 Next, an embodiment of a multimodal application generation wizard according to the present invention will be described with reference to the drawings.
FIG. 11 shows an embodiment of the flow of the multimodal application generation wizard according to the present invention. According to this flow, the source code, dialogue scenario, and speech recognition grammar format to be generated will be explained.

マルチモーダルアプリケーション生成ウィザードが実行を開始する（Ｓ０１）と、アプリケーションを生成するのに必要な情報が記載されたアプリ定義ファイルを読み込む（Ｓ０２）。アプリ定義ファイルには、少なくとも(1)質問項目の数、(2)各質問項目の名称、 (3)各質問項目に対する典型的な回答例の３種類の情報が含まれている必要がある。 When the multimodal application generation wizard starts executing (S01), an application definition file in which information necessary for generating an application is written is read (S02). The application definition file must include at least three types of information: (1) the number of question items, (2) the name of each question item, and (3) typical answer examples for each question item.

アプリ定義ファイルには、オプションとして(4)直前の入力の訂正を許可するかどうかの情報及び、(5)各質問項目に関するガイダンス情報を含めても良い。 The application definition file may optionally include (4) information on whether or not to allow correction of the previous input, and (5) guidance information on each question item.

図４(a)(b)は、本発明によるマルチモーダルアプリケーション生成ウィザードが生成するマルチモーダルアプリケーションの対話シーケンスの例を表した図である。この例では、それぞれ(1)出発駅名(departure)、(2)到着駅名(destination)、(3)座席のクラス(class)をユーザが入力すべき項目としている。また、両者とも各入力項目を順番に入力することをユーザに求める形式になっている。(a)と(b)で異なるのは、音声認識エンジン１０１が認識結果を誤ったり、ユーザが入力を誤ったときに、その場で直前の入力を訂正することができるかどうかであり、アプリ定義ファイルに記載された(4)の情報によって、どちらの形式の対話を生成するかを決める。オプションが与えられなかった場合は、いずれかの形式の対話をデフォルトで生成すればよい。 4 (a) and 4 (b) are diagrams showing an example of a dialogue sequence of a multimodal application generated by the multimodal application generation wizard according to the present invention. In this example, (1) departure station name (departure), (2) arrival station name (destination), and (3) seat class (class) are items to be input by the user. Both are in a format that requires the user to input each input item in turn. The difference between (a) and (b) is whether or not the speech recognition engine 101 can correct the previous input on the spot when the recognition result is incorrect or the user makes an input error. Which type of dialog is generated is determined by the information in (4) described in the definition file. If no option is given, one of the forms of interaction can be created by default.

次に最初の質問項目に対する音声認識文法１０６を生成する（Ｓ０３）。最初の質問項目では、訂正発話の可能性がないため、図４の(a)(b)の対話形式ともに同一形式の音声認識文法を生成すればよい。 Next, the speech recognition grammar 106 for the first question item is generated (S03). Since there is no possibility of correct utterance in the first question item, it is only necessary to generate a speech recognition grammar having the same format for both of the dialog formats shown in FIGS.

音声認識文法１０６の生成には、(3)の各質問項目に対する典型的な回答例の情報を利用することができる。発音と表記の組を複数、回答例として与えた場合、図６(a)のような音声認識文法１０６を生成することができる。この音声認識文法の例は、Ｗ３Ｃの勧告するＳＲＧＳで定義されたＡＢＮＦ形式に基づいた表記方法を用いている。「発音 { 質問項目の名称 = "表記"}」という形式で、開発者から与えられた情報を用いている。発音の表記の組をただ１つ与えた場合は、図６(b)のような音声認識文法１０６を生成することができる。このような音声認識文法１０６を生成することで、発話例を後から増やす場合に、どのような記述を行えば良いのか、開発者に知らせることができる。 For generation of the speech recognition grammar 106, information on typical answer examples for each question item in (3) can be used. When a plurality of combinations of pronunciation and notation are given as answer examples, a speech recognition grammar 106 as shown in FIG. 6A can be generated. This example of speech recognition grammar uses a notation method based on the ABNF format defined by SRGS recommended by W3C. The information provided by the developer is used in the form of “pronunciation {name of question item =“ notation ”}”. When only one pronunciation notation set is given, a speech recognition grammar 106 as shown in FIG. 6B can be generated. By generating such a speech recognition grammar 106, it is possible to inform the developer what kind of description should be made when the number of utterance examples is increased later.

次に、直前の入力の訂正が可能な対話を生成するか否かで、処理を振り分ける（Ｓ０４）。まず、直前の入力の訂正ができない図４の(a)の対話形式のアプリケーションの生成フローについて説明する。 Next, the process is distributed depending on whether or not a dialog that can correct the previous input is generated (S04). First, an interactive application generation flow shown in FIG. 4A in which the previous input cannot be corrected will be described.

図４の(a)の対話形式の場合、訂正発話を受理する必要がないため、生成する音声認識文法１０６は、全ての質問項目に対して図６の形式のものを生成すればよい。さらに、最終確認用の音声認識文法１０６として、「はい」「いいえ」を認識語彙の音声認識文法を出力する（Ｓ０５）。 In the case of the interactive format shown in FIG. 4A, it is not necessary to accept a corrected utterance, and the generated speech recognition grammar 106 may be generated in the format shown in FIG. 6 for all question items. Furthermore, the speech recognition grammar of the recognition vocabulary “Yes” and “No” is output as the speech recognition grammar 106 for final confirmation (S05).

次に、対話シナリオ１０５を生成する（Ｓ０６）。図５は、本発明によるマルチモーダルアプリケーション生成ウィザードが生成した図４の(a)の対話形式のマルチモーダルアプリケーションを構成する対話シナリオ１０５の一例であり、ＶｏｉｃｅＸＭＬ形式で記述されている。この記述方式においては、(1)の質問項目の数は、生成するＶｏｉｃｅＸＭＬドキュメントに含まれる<field>要素の数として扱うことができる。また(2)の各質問項目の名称は、各<field>要素のname属性の値及び参照する音声認識文法１０６の名称として扱うことができる。(5)の各質問項目での音声ガイダンスは、各<field>要素内の<prompt>要素の内容として扱うことができるが、指定がなかった場合は、図５のように「質問項目名称の発声を促すガイダンス」という文字列を出力しておき、後で開発者が簡単に修正できるようにしておけばよい。さらに発声がなかった場合のガイダンス音声に関する記述例や、全ての質問項目が入力された後での確認ガイダンスの例や、確認が肯定的だった場合と、否定的だった場合のそれぞれの処理の例を自動生成することで、開発者の工数を削減することができる。確認ガイダンスの例としては、入力された各質問項目の内容をユーザに聞かせることが有効なので、<value>要素を用いて各入力内容を参照したガイダンス文章の例を自動生成することが望ましいが、開発者は自分の判断によってガイダンスを修正することも可能である。 Next, a dialogue scenario 105 is generated (S06). FIG. 5 shows an example of an interaction scenario 105 constituting the interactive multimodal application shown in FIG. 4A generated by the multimodal application generation wizard according to the present invention, and is described in the VoiceXML format. In this description method, the number of question items in (1) can be handled as the number of <field> elements included in the generated VoiceXML document. The name of each question item in (2) can be treated as the value of the name attribute of each <field> element and the name of the speech recognition grammar 106 to be referred to. The voice guidance for each question item in (5) can be handled as the contents of the <prompt> element in each <field> element. It is sufficient to output the character string “guidance for prompting utterance” so that the developer can easily modify it later. In addition, a description example of guidance voice when there is no utterance, an example of confirmation guidance after all question items have been entered, and each processing when the confirmation is affirmative and negative By automatically generating examples, developers' man-hours can be reduced. As an example of confirmation guidance, it is effective to let the user hear the contents of each question item entered, so it is desirable to automatically generate an example of guidance text that refers to each input content using the <value> element. Developers can also modify the guidance at their own discretion.

次に、ＧＵＩアプリケーション１０４のソースコードを生成する（Ｓ０７）。図７は、本発明によるマルチモーダルアプリケーション生成ウィザードが生成したＧＵＩアプリケーション１０４のソースコードの一例であり、Ｊａｖａで記述されている。 Next, the source code of the GUI application 104 is generated (S07). FIG. 7 shows an example of the source code of the GUI application 104 generated by the multimodal application generation wizard according to the present invention, which is written in Java.

図７(a)は、質問項目に対応するＧＵＩコンポーネントを生成するコードである。現在の音声認識システムの性能では、ＧＵＩでのテキストボックスのような自由入力をサポートすることは困難であるとともに、例え音声認識が正確にできたとしても、その内容を理解して対話を進めることも困難である。従って、各質問項目に対応したＧＵＩコンポーネントとしては、あらかじめ決まった複数の選択肢から選択を行うタイプのもの、すなわちドロップダウンリストや、ラジオボックスなどが適している。どのＧＵＩコンポーネントを利用するかは、開発者に問い合わせてもよいし、選択肢の個数を開発者に問い合わせて、数が多い場合はドロップダウンリストを用い、少ない場合はラジオボックスを用いるなどしてもよい。図７(a)の例では、ドロップダウンリストを生成している。ドロップダウンリストを説明するラベルコンポーネントのタイトルには、(2)の質問項目の名称が利用されている。また、(3)の各質問項目に対する典型的な回答例は読みと表記の情報が利用されている。音声認識文法では、図６(a)のように１つの選択肢に対して複数の読みを用意することがあるが、ＧＵＩコンポーネントでは冗長となるので、表記が同じ回答例については、１つだけ選択して用いることが望ましい。ＧＵＩコンポーネントの生成コードは、(1)の入力項目数だけ生成する。 FIG. 7A shows a code for generating a GUI component corresponding to the question item. With the performance of current speech recognition systems, it is difficult to support free input such as a text box in the GUI, and even if speech recognition can be done correctly, understand the content and proceed with the dialogue. It is also difficult. Therefore, as the GUI component corresponding to each question item, a type that selects from a plurality of predetermined options, that is, a drop-down list, a radio box, or the like is suitable. You can ask the developer which GUI component to use, or ask the developer for the number of options, use the drop-down list if there are many, use the radio box if there are few, etc. Good. In the example of FIG. 7A, a drop-down list is generated. The name of the question item in (2) is used for the title of the label component explaining the drop-down list. Also, typical answer examples for each question item in (3) use reading and notation information. In the speech recognition grammar, multiple readings may be prepared for one option as shown in FIG. 6 (a). However, since the GUI component is redundant, only one answer is selected for the same answer example. It is desirable to use it. As many GUI component generation codes as (1) are generated.

図７(b)は、これらのＧＵＩコンポーネントの登録と、このＧＵＩコンポーネントが操作されたときの動作を表すソースコードの一例である。一般的には、ＧＵＩコンポーネントが操作された場合の処理は各アプリケーションによって異なるため、自動生成は困難であるが、ここでは対話制御部１０３に対して、擬似音声認識結果を通知するコードのみを生成している。これによってＧＵＩ操作を行った場合も、音声で操作した場合と同様に対話を進行させることができる。対話制御部１０３は、本当の音声認識結果が得られた場合でも、擬似音声認識結果が与えられた場合でも、音声認識結果によって、ＶｏｉｃｅＸＭＬの仕様に従って、適切なフィールドに値を代入すればよい。音声認識結果はＷ３Ｃが標準化を進めているＳＩＳＲ（ＳｅｍａｎｔｉｃｓＩｎｔｅｒｐｒｅｔａｔｉｏｎｆｏｒＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ）に基づいた記述方式で表現できる。例えば名称が「departure」である質問項目に関して、「東京」という入力が、ＧＵＩもしくは音声で行われた場合は、入力された内容が属するカテゴリーとその値のペアという形式で、「{ departure : "東京"}」と表現することができる。図７(b)のソースコードでも、ＧＵＩで選択された項目に対して、上記の音声認識結果を表現する文字列を生成している。カテゴリー情報とその値のペアという形式を用いることによって、どの質問項目に対する入力であるかが、入力方法を問わず明確になるとともに、音声入力によって複数の質問項目に対して入力が行われた場合でも対応できるという利点がある。 FIG. 7B is an example of source code representing registration of these GUI components and operations when the GUI components are operated. In general, the processing when the GUI component is operated differs depending on each application, so automatic generation is difficult, but here, only the code for notifying the dialogue control unit 103 of the pseudo speech recognition result is generated. is doing. As a result, even when a GUI operation is performed, it is possible to make the conversation proceed as in the case of operating with a voice. Whether the real speech recognition result is obtained or the pseudo speech recognition result is given, the dialogue control unit 103 may substitute a value into an appropriate field according to the VoiceXML specification based on the speech recognition result. The speech recognition result can be expressed by a description method based on SISR (Semantics Information for Speech Recognition) which W3C is standardizing. For example, for a question item whose name is “departure”, if the input “Tokyo” is made by GUI or voice, it will be in the form of a category / value pair to which the input content belongs and “{departure:“ Tokyo "}". Also in the source code of FIG. 7B, a character string expressing the above speech recognition result is generated for the item selected by the GUI. By using the format of category information and its value pairs, it becomes clear which question item is input regardless of the input method, and when multiple question items are input by voice input However, there is an advantage that it can respond.

対話制御部１０３の処理において、フィールドに値が代入されると、対応するイベントがＧＵＩアプリケーション１０４のイベントハンドラに通知される。図７(c)は生成するイベントハンドラの一例である。値が代入されたフィールド名と、代入された値が対話制御部１０３から通知される。フィールド名には開発者が与えた質問項目名称を用いているので、ここでも質問項目名称を用いて、どの質問項目に対する入力があったのかを判断して、処理を振り分けるコードを生成すればよい。各質問項目への入力に対する処理は、通知された入力内容に応じてＧＵＩコンポーネントの選択項目を変更する処理（choice1.select(argValue)）と、対話の進行状況に応じてＧＵＩコンポーネントの入力可能/不可能状態を変更するコードを呼び出すコード(setDialogStatus(1))を生成すればよい。 When a value is assigned to a field in the processing of the dialogue control unit 103, a corresponding event is notified to the event handler of the GUI application 104. FIG. 7C shows an example of an event handler to be generated. The dialogue control unit 103 notifies the field name to which the value is assigned and the assigned value. Since the question item name given by the developer is used as the field name, the question item name can be used here to determine which question item has been input and generate a code for distributing the process. . The process for input to each question item is a process of changing the selection item of the GUI component according to the notified input content (choice1.select (argValue)), and the GUI component can be input according to the progress of the dialogue / A code (setDialogStatus (1)) that calls the code that changes the impossible state may be generated.

図７(d)は、対話の進行状況に応じてＧＵＩコンポーネントの入力可能/不可能状態を変更するコードの実体である。図４の(a)の形式の対話では、現在質問している項目以外は、入力不可能な状態であることが望ましいため、全てのＧＵＩコンポーネントを入力不可能状態にしてから、対話の進行状況に応じて現在の質問項目に対応したＧＵＩコンポーネントを入力可能状態にするコードを生成すればよい。 FIG. 7D shows the substance of the code that changes the input enabled / disabled state of the GUI component in accordance with the progress of the dialogue. In the dialog shown in FIG. 4 (a), it is desirable that all items other than the currently questioned item are in an input-impossible state. In response to this, a code for enabling input of the GUI component corresponding to the current question item may be generated.

次に、直前の入力内容を訂正する発話や、ＧＵＩ操作を許容するように開発者がウィザードに対して指示した場合のウィザードのフローについて説明する。 Next, the flow of the wizard when the developer instructs the wizard to allow the utterance to correct the input content immediately before and the GUI operation will be described.

図４の(b)の対話形式において、出発駅が入力された後、到着駅の入力が要求されている段階で、ユーザに許容されている操作が次の４つになるようにマルチモーダルアプリケーションを生成することが望ましい。すなわち(A)音声操作で到着駅（現在の質問項目）を選択、(B)音声操作で出発駅（直前の入力項目）を訂正、(C)ＧＵＩ操作で到着駅（現在の質問項目）を選択、(D)ＧＵＩ操作で出発駅（直前の入力項目）を訂正である。以下の対話シーケンスにおいても同様である。 In the interactive form of FIG. 4 (b), after the departure station is input, the multimodal application is set so that the following four operations are allowed for the user when the input of the arrival station is requested. It is desirable to generate That is, (A) Select the arrival station (current question item) by voice operation, (B) Correct the departure station (immediate entry item) by voice operation, (C) Specify the arrival station (current question item) by GUI operation Select (D) Departure station (immediate input item) by GUI operation. The same applies to the following dialogue sequence.

まず(A)(B)を実現する音声対話シナリオ１０５を生成する（Ｓ０８）。図８は、本発明による対話ウィザードが自動生成する対話シナリオ１０５の一例である。<field>要素の数や、各要素name属性の名称、<prompt>の内容などを開発者が入力した情報から生成していることは図５の対話シナリオ１０５の例と同様である。図５と異なるのは、直前の入力内容をユーザに提示し、現在の質問項目に加えて訂正発話も受理する音声認識文法を指定し、現在の質問項目への入力か、訂正発話かで、それに応じた処理を行う部分である。具体的には、<prompt>の内容として開発者が与えた質問項目に加え、直前の入力内容を確認する文章（<value expr=”直前の質問項目名称”/>ですね？>）を自動生成することが望ましい。また参照する音声認識文法１０６の名称も「現在の質問項目名称_correct.abnf」と設定することで、開発者が生成された対話シナリオを見たときに、訂正が許容されることが分りやすくなる。ユーザの発声が、現在の質問項目に対する入力であるか、直前の入力内容に対する訂正であるかを判断するためには、すでに説明した音声認識結果の表現形式である「{ destination : "大阪"}では、情報が足りない。そこで、訂正発話を表現として「{ destination : { correct : "true" departure : "名古屋"}}」のような形式を用いる。この例では、destinationを入力するステップにおいて、訂正発話が行われ、直前のdepartureの内容が名古屋に訂正されたという意味である。一方、現在の質問項目に対する入力を表す表現形式は、「{ destination : { correct : "false" destination : "東京"}}」のようになる。訂正発話を許容する各<field>要素内で、<if>要素を用いてcorrectの内容を判断し、訂正発声と判断された場合は、直前の入力内容を<asign>要素を用いて、「現在の質問項目名称.直前の質問項目名称」という変数で書き換え、まだ入力が完了していない現在の質問項目のフィールド変数をクリアする記述を自動生成し、訂正発話ではないと判断された場合は、現在の質問項目のフィールド変数の値を「現在の質問項目名称.現在の質問項目名称」という変数の値で書き換える記述を自動生成することが望ましい。
図８の例では、最終確認のステップでは、直前の「class」が訂正できないようになっているが、開発者が最終確認のステップでの訂正発話を許容するかどうか選択できることが望ましい。最終確認のステップでの直前の入力内容に対する訂正発話を許容する場合は、参照する音声認識文法１０６の名称を「yesno_correct.abnf」など分りやすい名称にし、認識結果が訂正発話であるかどうかを判断する<if>要素も自動生成すればよい。 First, a voice conversation scenario 105 that realizes (A) and (B) is generated (S08). FIG. 8 is an example of a dialogue scenario 105 automatically generated by the dialogue wizard according to the present invention. The number of <field> elements, the name of each element name attribute, the content of <prompt>, and the like are generated from information input by the developer, as in the example of the interactive scenario 105 in FIG. The difference from FIG. 5 is that the previous input content is presented to the user, the speech recognition grammar that accepts the corrected utterance in addition to the current question item is specified, and whether the input to the current question item or the corrected utterance is This is the part that performs processing accordingly. Specifically, in addition to the question item provided by the developer as the content of <prompt>, a sentence that confirms the previous input content (<value expr = ”name of the immediately previous question item” />?>) Automatically It is desirable to generate. Also, by setting the name of the speech recognition grammar 106 to be referred to as “current question item name_correct.abnf”, it is easy to understand that the correction is allowed when the developer sees the generated dialogue scenario. Become. In order to determine whether the user's utterance is an input to the current question item or a correction to the previous input, “{destination:“ Osaka ”}, which is the expression format of the speech recognition result already described, is used. So, there is not enough information, so we use a form like "{destination: {correct:" true "departure:" Nagoya "}}" as a representation of the correct utterance. In this example, in the step of inputting the destination, the correction utterance is performed, and the content of the previous department is corrected to Nagoya. On the other hand, the expression format representing the input to the current question item is “{destination: {correct:“ false ”destination:“ Tokyo ”}}”. Within each <field> element that allows correct utterance, the correct content is determined using the <if> element, and if it is determined to be corrected utterance, the previous input content is determined using the <asign> element. If it is determined that it is not a corrected utterance, it is automatically rewritten with a variable called `` Current Question Item Name.Previous Question Item Name '' and clears the field variable of the current question item that has not yet been entered. It is desirable to automatically generate a description for rewriting the value of the field variable of the current question item with the value of the variable “current question item name.current question item name”.
In the example of FIG. 8, the last “class” cannot be corrected in the final confirmation step, but it is desirable that the developer can select whether or not to allow correction utterance in the final confirmation step. When correction utterances for the input content immediately before in the final confirmation step are allowed, the name of the speech recognition grammar 106 to be referred to is set to an easily understood name such as “yesno_correct.abnf”, and it is determined whether or not the recognition result is a corrected utterance. <If> elements to be generated can be automatically generated.

次に直前の入力内容に対する訂正発話を許容する音声認識文法１０６ついて説明する。直前の入力内容に対する訂正発話としては「大阪ではなくて名古屋」のような言い回しが考えられる。このとき「ｘｘｘではなくてｙｙｙ」のような言い回しを全てのパターンについて受理可能な音声認識文法１０６を自動生成することが考えられる。しかしこの場合、直前の入力内容が「大阪」だった場合でも「東京ではなくて名古屋」のような発声を受理してしまうことと、全てのパターンを受理するために音声認識文法１０６のサイズが大きくなってしまうことから、最適な方法ではない。別の方法として、直前の入力内容から参照する音声認識文法１０６の名称を動的に決定する方法が考えられる。例えばＷ３ＣのＶｏｉｃｅＸＭＬ２．１ＷｏｒｋｉｎｇＤｒａｆｔに記載の音声認識文法指定方法であれば、「<grammar srcexpr="departure + '_correct.abnf'"/>」のように記述することで、「大阪」という直前の入力に対して、「大阪_correct.abnf」という音声認識文法１０６を次の質問項目のステップで参照することができる。この方法により、適切でない言い回しを排除できるため有利である。また別の方法としては、参照する音声認識文法１０６の名称は「現在の質問項目名称_correct.abnf」に固定してＧＵＩアプリケーション１０４がこのファイル名を持つ音声認識文法１０６を自動生成する方法が考えられる。ここでは、最後の自動生成を行う方式に基づいて説明するため、マルチモーダル生成アプリケーションの次のフローは音声認識文法を自動生成するソースコードの生成となる（Ｓ０９）。
図９は、ＧＵＩアプリケーション１０４が自動生成する、直前の入力内容を考慮した音声認識文法１０６の一例である。すでに説明した訂正発話かどうかが判断可能な形式の認識結果が得られるようにＳＲＧＳ形式で記述されている。直前の入力内容にかかわらず文法の構造は同一であり、直前の入力内容によって変化する部分は、$correctの先頭の単語が直前の入力内容であることと、$correctの末尾にある訂正候補のリストから直前の入力内容が除外されていることである。本発明によるウィザードでは、直前の入力内容の読みと表記を与えることで、上記のような音声認識文法１０６を生成するコードをＧＵＩアプリケーション１０４の中に自動生成できる（Ｓ０９）。 Next, the speech recognition grammar 106 that allows a corrected utterance to the input content immediately before will be described. As a correction utterance for the input content immediately before, a phrase such as “Nagoya, not Osaka” can be considered. At this time, it is conceivable to automatically generate a speech recognition grammar 106 that can accept a phrase such as “yyyy instead of xxx” for all patterns. However, in this case, even if the input content immediately before is “Osaka”, the utterance such as “Nagoya instead of Tokyo” is accepted, and the size of the speech recognition grammar 106 is enough to accept all patterns. It's not optimal because it grows. As another method, a method of dynamically determining the name of the speech recognition grammar 106 to be referred to from the immediately preceding input content can be considered. For example, in the speech recognition grammar specification method described in W3C's VoiceXML 2.1 Working Draft, by writing “<grammar srcexpr =" departure + '_correct.abnf'"/>" For the input, the speech recognition grammar 106 “Osaka_correct.abnf” can be referred to in the next question item step. This method is advantageous because it can eliminate inappropriate language. As another method, the name of the speech recognition grammar 106 to be referred to is fixed to “current question item name_correct.abnf”, and the GUI application 104 automatically generates the speech recognition grammar 106 having this file name. Conceivable. Here, in order to explain based on the last automatic generation method, the next flow of the multimodal generation application is generation of source code for automatically generating a speech recognition grammar (S09).
FIG. 9 is an example of a speech recognition grammar 106 that is automatically generated by the GUI application 104 and that takes into account the previous input content. It is described in the SRGS format so as to obtain a recognition result in a format in which it can be determined whether or not the correction utterance is already explained. The structure of the grammar is the same regardless of the previous input content, and the part that changes depending on the previous input content is that the first word of $ correct is the previous input content and the correction candidate at the end of $ correction The previous entry is excluded from the list. In the wizard according to the present invention, the code for generating the speech recognition grammar 106 as described above can be automatically generated in the GUI application 104 by giving the reading and notation of the immediately preceding input content (S09).

次に、直前の入力内容を訂正可能で、対話シナリオ１０５と連携可能なＧＵＩアプリケーション１０４のソースコードの自動生成（Ｓ１０）について説明する。
図１０は、本発明によるウィザードが自動生成するＧＵＩアプリケーション１０４のソースコードの一例である。図１０(a)では、図７(a)の例と同様に、ドロップダウンリストを生成している。ドロップダウンリストを説明するラベルコンポーネントのタイトルには、(2)の質問項目の名称が利用されている。また、(3)の各質問項目に対する典型的な回答例は読みと表記の情報が利用されている。図１０(b)は最初の質問項目、(d)はそれ以降の質問項目に対するＧＵＩコンポーネントが操作された場合のイベントハンドラである。変数stateは対話の進行状況を表す変数であり、同じＧＵＩコンポーネントが操作された場合でも、現在の質問項目に対しての入力なのか、次のステップに移った後での訂正入力なのかを区別することができる。図１０(b)は、最初の質問項目に対するイベントハンドラであるため、現在最初の質問項目のステップにある場合(state=0)は、訂正する内容がないため、「{ departure : "大阪"}」のような認識結果表現を生成するためのコードを自動生成し、次のステップに移っている場合(state=1)は、「{ destination : { correct : "true" departure : "名古屋"}}のような、次のステップで直前の入力内容に対する訂正発話を行った場合と同様の認識結果表現を生成するためのコードを自動生成すればよい。図１０(d)では、現在の質問項目に対する入力を表す認識結果表現も、「{ destination : { correct : "false" destination : "東京"}}のような訂正入力かどうかの判断ができる記述形式で生成するコードを自動生成すればよい。図１０(c)の<field>要素に値が代入された場合のイベントハンドラコードは、図７(c)のコードとほぼ同一のものを自動生成すればよいが、異なるのは入力内容によって、次のステップでの音声認識文法１０６を自動生成するコードを呼び出すコード（makeDic(argValue)）が追加されていることである。図１０(e)は、上で述べたユーザ操作に関する制限の(c)(d)を実現するためのコードであり、ウィザードは、現在の質問項目と、直前の質問項目以外のＧＵＩコンポーネントを操作できないようにするコードを自動生成すればよい。そのためには、最初の質問項目の場合(step=0)は、最初の質問項目に対するＧＵＩコンポーネントのみを有効にするコードを自動生成し、それ以降の状態では、現在の質問項目と直前の質問項目だけを有効にするコードを自動生成すればよい。 Next, automatic source code generation (S10) of the GUI application 104 that can correct the input content immediately before and can cooperate with the dialogue scenario 105 will be described.
FIG. 10 shows an example of source code of the GUI application 104 automatically generated by the wizard according to the present invention. In FIG. 10 (a), a drop-down list is generated as in the example of FIG. 7 (a). The name of the question item in (2) is used for the title of the label component explaining the drop-down list. Also, typical answer examples for each question item in (3) use reading and notation information. FIG. 10B shows the first question item, and FIG. 10D shows the event handler when the GUI component for the subsequent question items is operated. The variable “state” is a variable that indicates the progress of the dialog, and even if the same GUI component is operated, it is distinguished whether the input is for the current question item or a correction input after moving to the next step. can do. Since FIG. 10B is an event handler for the first question item, there is no content to be corrected when it is currently in the step of the first question item (state = 0), so “{departure:“ Osaka ”} If the code for generating the recognition result expression like "" is automatically generated and the process moves to the next step (state = 1), "{destination: {correct:" true "departure:" Nagoya "}} As shown in Fig. 10 (d), a code for generating a recognition result expression similar to that in the case where the corrected utterance for the immediately preceding input content is performed in the next step may be automatically generated. The recognition result expression representing the input may be automatically generated in a description format that can determine whether the input is a correct input such as “{destination: {correct:“ false ”destination:“ Tokyo ”}}. Event handler code when a value is assigned to the <field> element of 10 (c) It is only necessary to automatically generate almost the same code as that shown in FIG. 7C, but the difference is that the code (makeDic () that calls the code for automatically generating the speech recognition grammar 106 in the next step depends on the input contents. 10 (e) is a code for realizing the restrictions (c) and (d) related to the user operation described above, and the wizard displays the current question item. In order to do so, the code that prevents the GUI component other than the previous question item from being operated can be automatically generated.For this, in the case of the first question item (step = 0), only the GUI component for the first question item is required. The code for enabling is automatically generated, and in the subsequent states, the code for enabling only the current question item and the immediately preceding question item may be automatically generated.

本発明は、パーソナルコンピュータや、ＰＤＡ、カーナビゲーションシステムなど、音声対話機能とＧＵＩ操作機能を搭載する情報機器向けのプログラム開発に利用することが可能である。 The present invention can be used for developing a program for an information device equipped with a voice interaction function and a GUI operation function, such as a personal computer, a PDA, and a car navigation system.

マルチモーダルアプリケーションの構成の一実施例を表す図。The figure showing one Example of a structure of a multimodal application. 対話シナリオの記述例の一実施例を表す図。The figure showing one Example of the example of a description of a dialogue scenario. ＧＵＩアプリケーションの画面の位置実施例を表す図。The figure showing the example of the position of the screen of a GUI application. 自動生成されるマルチモーダルアプリケーションの対話シーケンスの一実施例を表す図。The figure showing one Example of the interaction sequence of the multimodal application produced | generated automatically. 自動生成されるマルチモーダルアプリケーションの対話シナリオの一実施例を表す図。The figure showing one Example of the interactive scenario of the multimodal application produced | generated automatically. 自動生成されるマルチモーダルアプリケーションの音声認識文法の一実施例を表す図。The figure showing one Example of the speech recognition grammar of the multimodal application produced | generated automatically. 自動生成されるマルチモーダルアプリケーションのＧＵＩアプリケーションソースコードの一実施例を表す図。The figure showing one Example of the GUI application source code of the multimodal application produced | generated automatically. 自動生成されるマルチモーダルアプリケーションの対話シナリオの一実施例を表す図。The figure showing one Example of the interactive scenario of the multimodal application produced | generated automatically. 自動生成されるマルチモーダルアプリケーションのＧＵＩアプリケーションが自動生成する音声認識文法の一実施例を表す図。The figure showing one Example of the speech recognition grammar automatically produced | generated by the GUI application of the multimodal application produced | generated automatically. 自動生成されるマルチモーダルアプリケーションのＧＵＩアプリケーションソースコードの一実施例を表す図。The figure showing one Example of the GUI application source code of the multimodal application produced | generated automatically. マルチモーダルアプリケーションの自動生成フローの一実施例。An example of the automatic generation flow of a multimodal application.

Explanation of symbols

１００・・・ＯＳ
１０１・・・音声認識エンジン
１０２・・・音声合成エンジン
１０３・・・対話制御部
１０４・・・ＧＵＩアプリケーション
１０５・・・対話シナリオ
１０６・・・音声認識文法
１０７・・・音声認識結果
１０８・・・擬似音声認識結果
１０９・・・対話の進行状況を表すイベント
３０１・・・現在入力対象となっているドロップダウンリスト
３０２・・・現在入力対象外のドロップダウンリスト
３０３・・・現在入力対象外のドロップダウンリスト
３０４・・・現在入力対象外のボタン
３０５・・・現在入力対象外のボタン。
100 ... OS
DESCRIPTION OF SYMBOLS 101 ... Speech recognition engine 102 ... Speech synthesis engine 103 ... Dialog control part 104 ... GUI application 105 ... Dialog scenario 106 ... Speech recognition grammar 107 ... Speech recognition result 108 ... -Pseudo-voice recognition result 109 ... Event 301 indicating the progress of dialogue ... Drop-down list 302 that is currently input target ... Drop-down list 303 that is not currently input target ... Not currently input target Drop-down list 304 ... Button that is not currently input 305 ... Button that is not currently input.

Claims

At least a GUI application, a voice dialogue control unit that controls voice dialogue based on a voice dialogue scenario, and a voice recognition unit are configured. The GUI application transmits a user operation content to the voice dialogue control unit. Notification is made as pseudo speech recognition result information in the same format as that outputted as a speech recognition result, and the speech dialogue control unit receives pseudo speech recognition result information from the GUI application when the speech recognition result is obtained from the speech recognition unit. If obtained, the multi-modal dialogue system is characterized by notifying the GUI application of information regarding the progress of the dialogue.

2. The multimodal dialog system according to claim 1, wherein the format shared by the speech recognition result and the pseudo speech recognition result information includes at least information on category name / value pairs.

The information on the progress status of the dialog notified to the GUI application when the voice recognition result is obtained from the voice recognition unit and the pseudo voice recognition result information is obtained from the GUI application according to claim 1. A multimodal dialog system characterized by adopting a GUI application operation sequence format.

Receive at least information about the number of question items, each question item name, and a list of typical answers to each question item as input,
A source code of a program of a GUI application including a process of transmitting the pseudo voice recognition information to the voice dialogue control unit when an answer to the plurality of question items is input through the GUI interface;
A voice interaction scenario for allowing a user to answer the plurality of question items;
Generating a speech recognition grammar including a typical answer example for each of the plurality of question items,
Multimodal application generation wizard program.

5. The multi-modal application generation wizard program according to claim 4, wherein when a corrected utterance of the previous speech recognition result and an utterance for the current input item are received, a speech recognition grammar that can distinguish the corrected utterance from the input utterance is dynamically generated. A multi-modal application generation wizard program, wherein the source code is generated by including the code to be generated.

5. The multi-modal application generation wizard program according to claim 4, wherein the source code includes a code for prohibiting operation of a GUI component corresponding to an input item that cannot be input by a user according to progress of the voice dialogue scenario. A multi-modal application generation wizard program characterized by generating