JP2004246865A

JP2004246865A - Audio response web system and its input/output control method

Info

Publication number: JP2004246865A
Application number: JP2003365945A
Authority: JP
Inventors: Kenji Sugie; 健司杉江; Makoto Kakizaki; 誠柿崎
Original assignee: OMEGA SYSTEM DESIGN KK
Current assignee: OMEGA SYSTEM DESIGN KK
Priority date: 2002-10-25
Filing date: 2003-10-27
Publication date: 2004-09-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide an audio response web system and an input/output controlling method for it capable of improving a usage feeling in a terminal device using a multimodal technology. <P>SOLUTION: This audio response web system accessed from a terminal device 2 via a network 3 is provided with a web server 4 executing web application and an audio server executing audio application. The web server 4 is provided with a control part 12 notifying a condition of a program working on the web server 4 to a program working on the audio server 5. The audio server 5 is provided with a control part 15 notifying a condition of a program working on the audio server 5 to the web server 4. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

本発明は、ウェブサーバと音声サーバとの間で同期をとることによって、端末装置からの音声によるアクセスとウェブアクセスとの間のシームレスな連携を可能にする音声応答ウェブシステムと、そのような音声応答ウェブシステムにおける入出力制御方法とに関する。 The present invention relates to a voice response web system that synchronizes between a web server and a voice server, thereby enabling seamless cooperation between voice access from a terminal device and web access. And an input / output control method in a response web system.

携帯電話機あるいはＰＤＡ（携帯情報端末；personal digital assistants）などの開発が進むにつれて、音声伝達機能のほかにインターネット(Internet)上のウェブサイトにアクセスする機能や画面を表示する機能を有する携帯電話機（いわゆるブラウザホン）や、文字入力機能や画面表示機能のほかに音声入出力機能・音声伝達機能を有するＰＤＡなどが実用化されてきた。これらの端末装置では、キーパッドあるいはキーボード、スタイラスペン、音声コマンドなどの複数の方法で文字やコマンドを入力することができる。このように、ある装置において複数の入出力方式を可能にする技術をマルチモーダル技術と呼ぶ。 With the development of mobile phones and personal digital assistants (PDAs), mobile phones having a function of accessing websites on the Internet and a function of displaying a screen in addition to a voice transmission function (so-called so-called personal digital assistants). Browser phones) and PDAs having a voice input / output function and a voice transmission function in addition to a character input function and a screen display function have been put to practical use. In these terminal devices, characters and commands can be input by a plurality of methods such as a keypad or a keyboard, a stylus pen, and voice commands. Such a technique that enables a plurality of input / output methods in a certain device is called a multimodal technique.

従来、携帯電話機からウェブアクセスを行う場合、一般には、アクセスに必要な文字を入力するためにキーパッドを使用していたが、マルチモーダル技術を適用することにより、音声認識を用いて文字入力を行うことができるし、ウェブ出力としても音声合成を利用することができるようになる。そこで、携帯電話機などからのウェブアクセスに対し、文字入力手段として音声認識を使用し、文章出力手段として音声合成を用いた音声応答ウェブシステムが提案されている。音声応答ウェブシステムは、ウェブアプリケーションを実行することによって通常のウェブアクセスに伴うデータ入出力（ウェブデータ入出力）を処理するとともに、音声認識及び音声出力に関する処理を実行する。 Conventionally, when performing web access from a mobile phone, a keypad is generally used to enter characters required for access.However, by applying multimodal technology, character input can be performed using speech recognition. And voice synthesis can be used as web output. Therefore, a voice response web system using voice recognition as a character input unit and using voice synthesis as a text output unit has been proposed for web access from a mobile phone or the like. The voice response web system processes data input / output (web data input / output) associated with normal web access by executing a web application, and also executes processes related to voice recognition and voice output.

このような音声応答ウェブシステムとして、音声認識及び音声合成を行う音声サーバと、ウェブサーバとを組み合わせたものがある。このシステムでは、ウェブサーバ上で動作するプログラムがウェブアクセスを処理し、音声サーバ上で動作するプログラムが音声認識及び音声合成を処理する。ウェブアクセスに関する処理の開始は、端末装置からウェブサーバへ通知され、音声認識及び音声合成に関する処理の開始は端末装置からウェブサーバを経由して音声サーバへ通知されていた。 As such a voice response web system, there is a system in which a voice server for performing voice recognition and voice synthesis and a web server are combined. In this system, a program running on a web server processes web access, and a program running on a voice server processes voice recognition and voice synthesis. The start of the process related to web access is notified from the terminal device to the web server, and the start of the process related to voice recognition and speech synthesis is notified from the terminal device to the voice server via the web server.

しかしながらこのような音声応答ウェブシステムでは、ウェブサーバ上で動作するプログラムが音声サーバ上で動作するプログラムの状態を入力とすることができないため、端末装置は、音声認識の開始と音声認識結果の表示を、それぞれにウェブサーバに通知しなければならなかった。このため、端末装置を操作するユーザは、音声認識の開始と結果の表示のための２回の操作を行わなければならなかった。いいかえれば、ユーザは、一旦、音声サーバに対して入力音声の認識を行わせるために例えばキー（あるいはボタン）操作を行い、音声認識結果が音声サーバから返されてきたらそれに基づいてウェブアクセスを行うために、さらにキー（あるいはボタン）操作を行う必要がある。また、音声合成の結果である音声が端末装置から出力された後に音声認識を行う場合において、音声サーバ上で動作するプログラムがウェブサーバ上で動作するプログラムの状態を入力とすることができないため、端末装置は音声認識の開始を、音声合成の結果である音声が端末装置から出力された後に通知しなければならなかった。このため、端末装置を操作するユーザは、音声の出力が終わるのを待ってから、次の文字入力となる音声認識の開始のためのキー（あるいはボタン）操作を行わなければならなかった。 However, in such a voice response web system, since the program running on the web server cannot input the state of the program running on the voice server, the terminal device starts the voice recognition and displays the voice recognition result. Had to be notified to the web server for each. For this reason, the user who operates the terminal device has to perform two operations for starting speech recognition and displaying the result. In other words, the user once performs, for example, a key (or button) operation to cause the voice server to recognize the input voice, and performs a web access based on the voice recognition result when the voice recognition result is returned from the voice server. Therefore, it is necessary to perform further key (or button) operation. Further, in the case of performing voice recognition after the voice as a result of voice synthesis is output from the terminal device, since the program running on the voice server cannot input the state of the program running on the web server, The terminal device has to notify the start of the speech recognition after the speech as a result of the speech synthesis is output from the terminal device. For this reason, the user who operates the terminal device has to wait for the output of the voice to end, and then perform the key (or button) operation for starting the voice recognition as the next character input.

音声応答ウェブシステムを構築する方法としては、端末装置自体に音声認識機能及び音声合成機能を持たせ、端末装置での音声認識結果に応じてウェブサーバにアクセスし、ウェブサーバからの文字情報を端末装置で音声合成する方法もあるが、この方法では、ハードウェア面でもソフトウェア面でも端末装置での処理量が大きくなりすぎ、例えば、端末装置が携帯電話機である場合には、小規模の語彙数でしか音声認識ができなかったり、連続使用時間が著しく短くなったり、端末装置の小型軽量化が阻害される、といった問題点を生じる。 As a method of constructing a voice response web system, the terminal device itself has a voice recognition function and a voice synthesis function, accesses the web server according to the voice recognition result in the terminal device, and transmits the character information from the web server to the terminal. There is also a method of synthesizing speech with a device, but this method requires too much processing at the terminal device in terms of both hardware and software. For example, when the terminal device is a mobile phone, a small number of vocabularies However, there is a problem that the voice recognition can be performed only by the above, the continuous use time becomes extremely short, and the reduction in size and weight of the terminal device is hindered.

また、携帯電話機やＰＤＡを含む端末装置を用いるウェブアプリケーションにおいて、ウェブデータの入出力に加えて音声認識及び音声出力に関する処理を備えるアプリケーションは、マルチモーダル・アプリケーションと呼ばれる。マルチモーダル・アプリケーションの実現方法には、マルチモーダル・アプリケーション専用の言語と専用のウェブブラウザもしくは既存のウェブブラウザに付加機能を追加することよって実現される方法と、上述したように専用の言語と専用のウェブブラウザを必要とせず音声認識及び音声出力の処理を行う一般的な音声サーバと一般的なウェブサーバとを連動させることにより実現させる方法の２つがある。 Further, in a web application using a terminal device including a mobile phone or a PDA, an application having a process related to voice recognition and voice output in addition to input / output of web data is called a multimodal application. There are two ways to implement multi-modal applications: a language that is dedicated to the multi-modal application and a dedicated web browser or an additional function added to an existing web browser. There are two methods that are realized by linking a general voice server and a general web server that perform voice recognition and voice output processing without requiring a web browser.

マルチモーダル・アプリケーションの作成において、専用の言語を用いる方法では、１つのウェブページつまりＸＭＬ(extended markup language)ベースのテキストファイルに画面データ用のアプリケーション記述と音声処理用のアプリケーション記述を混在して記述することは可能であるが、音声サーバとウェブサーバとを連動させる方法では、両者のアプリケーションを混在して記述することはできない。ここで、画面データ用のアプリケーション及び音声処理用のアプリケーションとは、いわゆるコンテンツ（情報の内容）に相当するものである。音声サーバとウェブサーバとを連動させる方法では、両者のアプリケーションを合わせたものがマルチモーダル・アプリケーションに相当する。すなわち、マルチモーダル・アプリケーションは、文字及び画像のみからなる通常のウェブページ（あるいはウェブコンテンツ）と音声応答の処理を行うウェブページ（あるいはウェブコンテンツ）が対応付けされたアプリケーションのことを指している。（マルチモーダル・アプリケーションは、音声応答ウェブシステムによって起動制御される。この起動制御を行うプログラムが、ここで示す音声応答ウェブシステムを実現するシステム・アプリケーションであるが、以下の説明では、マルチモーダル・アプリケーションと区別するために、アプリケーションとは呼ばずモジュールと呼ぶことにする。）
音声サーバとウェブサーバとを連動させたシステムにおいて実行される従来のマルチモーダル・アプリケーションは、音声入力に対して画面出力を行うというのが典型的な機能であり、表示されるウェブページが変化していくのに応じて音声アプリケーションを動的に実行したり、音声と画面データを同時に出力したり、ウェブデータ入力に対して音声認識と音声出力を制御したり、さらには文脈に依存した処理をしたりするような、複雑なアプリケーションの構築は困難であった。 In a method of using a special language in creating a multi-modal application, an application description for screen data and an application description for voice processing are described in a single web page, that is, an XML (extended markup language) -based text file. Although it is possible to do so, the method of linking the voice server and the web server cannot describe both applications in a mixed manner. Here, the application for screen data and the application for audio processing correspond to so-called contents (contents of information). In a method of linking a voice server and a web server, a combination of both applications corresponds to a multimodal application. That is, the multi-modal application refers to an application in which a normal web page (or web content) including only characters and images is associated with a web page (or web content) that performs a voice response process. (The multimodal application is controlled to be activated by the voice response web system. A program that performs the activation control is a system application that implements the voice response web system described here. (In order to distinguish it from an application, it will be called a module instead of an application.)
In a conventional multimodal application executed in a system in which a voice server and a web server are linked, a typical function is to output a screen in response to a voice input, and a displayed web page changes. Dynamic execution of voice applications, simultaneous output of voice and screen data, control of speech recognition and output for web data input, and context-sensitive processing. It was difficult to build a complicated application.

また、起動される画面データ処理用のアプリケーション及び音声処理用のアプリケーションを指定する方法は、両方のサーバを連動させることを目的とする装置のプログラム内で決められているのが一般的であった。そのためアプリケーション作成者は、その作成するプログラムやファイルにおいて、起動される画面データ処理用のアプリケーション及び音声処理用のアプリケーションを指定するための記述をすることができず、起動する画面データ処理用のアプリケーション及び音声処理用のアプリケーションをユーザの操作に応じて自由に指定できるようなアプリケーションを作成することは困難であった。また、両方のサーバを連動させるための装置と、画面データ処理用及び音声処理用の各アプリケーションは、同一のローカルホスト上あるいは同一のローカルネットワーク内に置く必要があって、これらを完全に分離して実行することができない。このため、アプリケーション作成者は、ウェブサーバ及び音声サーバを連動させる装置をインターネットを介して利用することができなかった。 In addition, a method of designating an application for screen data processing and an application for audio processing to be started is generally determined in a program of an apparatus that aims to link both servers. . Therefore, the application creator cannot describe in the program or file to be created a specification for specifying the screen data processing application and the audio processing application to be started, and the screen data processing application to be started. In addition, it has been difficult to create an application that can freely specify an application for voice processing according to a user operation. Also, the device for linking both servers and the application for screen data processing and audio processing must be placed on the same local host or in the same local network, and they must be completely separated. Can not run. For this reason, the application creator could not use a device for linking the web server and the voice server via the Internet.

上述したように、マルチモーダル技術による従来の音声応答ウェブシステムでは、その音声応答ウェブシステムに対してアクセスする端末装置での操作性が悪く、音声認識の開始や音声認識結果の表示などのイベントのたびに端末装置においてキーやボタン操作を行う必要がある、という問題点がある。また、ウェブサーバと音声サーバとを連携させた音声応答ウェブシステムでは、音声入力に対して画面出力を行うというのが典型的な機能であり、音声入力とウェブデータ入力のいずれかあるいは両方に対して音声出力と画面での出力の両方を行うことができなかった。 As described above, in the conventional voice response web system based on the multimodal technology, the operability of a terminal device accessing the voice response web system is poor, and events such as start of voice recognition and display of a voice recognition result are not performed. There is a problem that it is necessary to operate keys and buttons on the terminal device every time. In a voice response web system in which a web server and a voice server are linked, a typical function is to output a screen in response to a voice input, and to perform a voice input and / or a web data input. It was not possible to perform both audio output and screen output.

そこで本発明の目的は、端末装置における使用感を向上させることができる音声応答ウェブシステムとその入出力制御方法を提供することにある。 Therefore, an object of the present invention is to provide a voice response web system capable of improving the usability in a terminal device and an input / output control method thereof.

本発明の別の目的は、音声入力およびウェブデータ入力に対して音声と画面の同時出力や音声認識と音声出力の制御を行い、そのアプリケーション作成を容易に行うことができる音声応答ウェブシステムとの入出力制御方法を提供することにある。 Another object of the present invention is to provide a voice response web system that can simultaneously output voice and screen, control voice recognition and voice output for voice input and web data input, and easily perform application creation. An object of the present invention is to provide an input / output control method.

本発明の第１の音声応答ウェブシステムは、端末装置よりアクセスされる音声応答ウェブシステムであって、ウェブサーバ及び音声サーバを備え、ウェブサーバ上で動作するプログラムの状態を音声サーバ上で動作するプログラムに通知する第１の制御手段と、音声サーバ上で動作するプログラムの状態をウェブサーバに通知する第２の制御手段と、を有する。 A first voice response web system according to the present invention is a voice response web system accessed by a terminal device, comprising a web server and a voice server, and operating a program running on the web server on a voice server. It has first control means for notifying the program and second control means for notifying the web server of the state of the program running on the voice server.

本発明の第２の音声応答ウェブシステムは、端末装置よりアクセスされる音声応答ウェブシステムであって、ウェブアプリケーションの実行を管理するウェブサーバと、音声アプリケーションの実行を管理する音声サーバとを備え、ウェブサーバ上で動作するプログラムの状態を音声サーバに通知する第１の制御手段と、音声サーバ上で動作するプログラムの状態をウェブサーバに通知する第２の制御手段と、を有する。 A second voice response web system of the present invention is a voice response web system accessed from a terminal device, comprising: a web server that manages execution of a web application; and a voice server that manages execution of a voice application. It has first control means for notifying the state of the program running on the web server to the voice server, and second control means for notifying the state of the program running on the voice server to the web server.

本発明では、端末装置において音声認識による入力の要求あるいは音声出力要求の操作が行われたときに、ウェブデータ入力により取得される識別情報に基づいて、起動手段が音声アプリケーションを起動するようにすることができる。識別情報としては、例えば、音声認識による入力処理のために起動する音声アプリケーションを指定することを示す属性を設定することができる。そのように識別情報を設定することにより、音声アプリケーションの起動処理を行う起動手段は、ウェブデータ入力により取得される値にその属性を見つけると指定されている音声アプリケーションを実行する。アプリケーションの作成者は、音声アプリケーションの識別子にその属性を付加することで、音声認識による入力の要求が行われた時に実行する音声アプリケーションを指定することができる。 According to the present invention, when an input request or a voice output request operation by voice recognition is performed in the terminal device, the activation unit activates the voice application based on the identification information obtained by the web data input. be able to. As the identification information, for example, an attribute indicating that a voice application to be started for input processing by voice recognition can be set. By setting the identification information as described above, the activation unit that performs the activation process of the audio application executes the audio application specified to find the attribute in the value obtained by inputting the web data. By adding the attribute to the identifier of the voice application, the creator of the application can specify the voice application to be executed when an input request by voice recognition is made.

また本発明では、第１の制御手段は、インターネットなどのネットワークを介して、ウェブサーバとは異なる第１のアプリケーションサーバに保持されたウェブアプリケーションを起動し、起動手段は、音声サーバとは異なる第２のアプリケーションサーバに保持された音声アプリケーションを起動するようにしてもよい。このように構成すれば、起動すべきアプリケーションやファイルをアプリケーション作成者が任意に指定できるようになる。例えば、アプリケーションを起動するための識別情報（例えば識別子）として絶対アドレス指定のＵＲＬ(Uniform Resource Locater)を記述し、音声応答ウェブシステムのプログラムを絶対アドレス指定で呼び出すことで、起動する音声アプリケーションおよびウェブアプリケーションは、音声応答ウェブシステムと同じホストに置く必要がなくなり、任意のホストに置くことが可能である。また、それらアプリケーション呼び出しを指定する初期ページもＨＴＭＬ(hypertext markup language)、ＸＭＬ、ＸＨＴＭＬ(extended HTML)のようなウェブアクセス可能な言語で記述すれば、任意のホストに置くことが可能となり、インターネットを介して分散して実行することが可能になる。 Further, in the present invention, the first control means activates a web application held in a first application server different from the web server via a network such as the Internet, and the activation means executes the second application different from the voice server. The voice application held in the second application server may be started. With this configuration, the application creator can arbitrarily specify the application or file to be started. For example, a URL (Uniform Resource Locater) for specifying an absolute address is described as identification information (for example, an identifier) for starting an application, and a program of a voice response web system is called by specifying an absolute address, thereby starting a voice application and a web application. The application does not have to be on the same host as the voice response web system, but can be on any host. Also, if the initial page designating these application calls is described in a web-accessible language such as HTML (hypertext markup language), XML, or XHTML (extended HTML), it can be placed on any host and the Internet can be placed. And can be executed in a distributed manner.

本発明の入出力制御方法では、ウェブページに記述する識別情報として、例えば、そのウェブページから起動されるウェブアプリケーション（音声応答ウェブシステムの起動制御モジュールを介して起動される既存のウェブアプリケーション）を音声対応させることを示す属性と音声対応のための情報（音声操作用タグ）を書き込んだファイルを指定する識別子とを設定することができる。そのウェブページから起動されるウェブアプリケーションが実行時に出力するウェブデータに対して、識別子によって指定された音声対応のためのファイルから読み込んだデータを動的に埋め込み、その結合された結果を最終的なウェブデータ出力とすることにより、既存のアプリケーションに対して修正を行うことなく音声入出力機能を付加することができる。また、既に音声対応されたウェブアプリケーションに対しては、音声対応のためファイルを指定しないことで、音声対応済みおよび未対応の双方のウェブアプリケーションに対応することができる。 In the input / output control method of the present invention, for example, a web application started from the web page (an existing web application started via a start control module of the voice response web system) is used as the identification information described in the web page. It is possible to set an attribute indicating that audio is to be supported and an identifier for specifying a file in which information (voice operation tag) for voice is written. Dynamically embed data read from the file for voice support specified by the identifier into the web data output by the web application launched from the web page at the time of execution, and put the combined result in the final By using web data output, a voice input / output function can be added without modifying an existing application. In addition, for a web application that has already been voice-enabled, it is possible to support both web applications that have already been voice-supported and those that have not.

なお、本発明は、端末装置がウェブサーバと音声サーバに同時にアクセスが可能な場合と、端末装置がある一時点ではウェブサーバと音声サーバのいずれか一方のみにアクセス可能な場合とのいずれに対しても適用することが可能である。 Note that the present invention is applicable to both the case where the terminal device can access the web server and the voice server at the same time and the case where the terminal device can access only one of the web server and the voice server at a certain point in time. It is also possible to apply.

本発明によれば、ウェブサーバと音声サーバとの連動した動作が可能となり、音声応答ウェブシステムにおける端末装置の使用感が向上する。 ADVANTAGE OF THE INVENTION According to this invention, the cooperation operation | movement of a web server and a voice server is attained, and the usability of the terminal device in a voice response web system improves.

さらに本発明によれば、ウェブアプリケーションと音声アプリケーションを連動させる音声応答ウェブシステムが提供される。この音声応答ウェブシステムによれば、アプリケーション作成者は、アプリケーション起動を制御するための識別情報（例えば、識別子と属性）を利用することで、新たに制御のためのプログラム開発を行うことなく、音声と画面の同時出力、および、音声認識と音声出力の制御を行うマルチモーダル・アプリケーションを開発することができ、より自由度の高いマルチモーダル・アプリケーション開発が可能となる。アプリケーション作成者がアプリケーションを起動するための識別情報（例えば、識別子としてのファイル名）を絶対アドレス指定のＵＲＬで記述し、かつ音声応答ウェブシステムのプログラムを絶対アドレス指定で呼び出すことで、起動すべき音声アプリケーションおよびウェブアプリケーションを任意のホストに置くことが可能であり、これらのアプリケーションを音声応答ウェブシステムの本体ホスト（本体サーバ）から分散して実行できるようにすることが可能である。以下に説明する例では、ＨＴＭＬデータを利用した場合を示したが、アプリケーションや識別子などを記述する言語として、他のスクリプト言語やマークアップ言語を使用することが可能であり、例えば、ＸＭＬデータ、ＸＨＴＭＬデータの場合のいずれにも適応することが可能である。さらには、現在、ＸＭＬの音声データへの拡張として策定が進んでいるＶｏｉｃｅＸＭＬデータを使用することも可能である。 Further, according to the present invention, there is provided a voice response web system for linking a web application and a voice application. According to the voice response web system, the application creator can use the identification information (for example, the identifier and the attribute) for controlling the activation of the application, without developing a new control program. A multi-modal application can be developed for simultaneous output of images and screens, and voice recognition and control of voice output, and a multi-modal application with a higher degree of freedom can be developed. The application creator describes the identification information (for example, a file name as an identifier) for starting the application by using the URL of the absolute address specification, and calls the program of the voice response web system by specifying the absolute address to start the application. The voice application and the web application can be placed on any host, and these applications can be distributed and executed from the main host (main server) of the voice response web system. In the example described below, a case in which HTML data is used is shown. However, as a language for describing an application, an identifier, or the like, another script language or a markup language can be used. For example, XML data, It is possible to adapt to any of the XHTML data cases. Further, VoiceXML data, which is currently being developed as an extension to XML voice data, can be used.

次に、本発明の好ましい実施の形態について、図面を参照して説明する。 Next, a preferred embodiment of the present invention will be described with reference to the drawings.

図１は、本発明の実施の一形態の音声応答ウェブシステムの構成を示すブロック図である。音声応答ウェブシステム１は、端末装置２からインターネットなどのネットワーク３を介してアクセスされるものであって、ウェブサーバ４と音声サーバ５とを備えている。ウェブサーバ４及び音声サーバ５は相互に連携して動作する。ウェブサーバ４と音声サーバ５とは、ハードウェア的には一体のものであってもよいし、ネットワークで相互に密接に結合した別体のものであってもよい。端末装置２は、本発明においては、音声伝達機能及び画面表示機能の少なくとも一方を備えていればよいが、この実施の形態では、音声伝達機能と画面表示機能の両方を備えているものとする。そのような端末装置２としては、キーボードあるいはキーパッドと、表示画面と、音声入力用のマイクロホンと、音声出力用のスピーカーとを備えた、例えば、ブラウザホンあるいはスマートホンと呼ばれ、インターネット接続と音声通話を同時に可能な携帯電話機が好ましく使用できる。 FIG. 1 is a block diagram showing a configuration of a voice response web system according to an embodiment of the present invention. The voice response web system 1 is accessed from a terminal device 2 via a network 3 such as the Internet, and includes a web server 4 and a voice server 5. The web server 4 and the voice server 5 operate in cooperation with each other. The web server 4 and the voice server 5 may be integral with each other in terms of hardware, or may be separate components that are closely connected to each other via a network. In the present invention, the terminal device 2 only needs to have at least one of a voice transmission function and a screen display function, but in this embodiment, it is assumed that the terminal device 2 has both a voice transmission function and a screen display function. . Such a terminal device 2 includes a keyboard or a keypad, a display screen, a microphone for voice input, and a speaker for voice output. A mobile phone capable of simultaneously making voice calls can be preferably used.

ウェブサーバ４は、ウェブアプリケーションを実行するアプリケーション実行部１１と、属性判定を行うとともにウェブアプリケーションの起動を制御する制御部１２と、を備えている。音声サーバ５は、音声アプリケーションを実行するアプリケーション実行部１３と、音声アプリケーションを起動する起動部１４と、音声認識結果の送信や音声出力の制御を行う制御部１５と、を備えている。制御部１２、１５及び起動部１４は、実際にはウェブサーバ４あるいは音声サーバ５上で実行されるソフトウェアによって実現されるものであるので、以下の説明では、ウェブサーバ４内の制御部１２のことは属性判定・ウェブアプリケーション起動制御モジュールと呼び、音声サーバ内の起動部１４及び制御部１５のことをそれぞれ音声アプリケーション起動モジュール及び認識結果送信・音声出力制御モジュールと呼ぶことにする。 The web server 4 includes an application execution unit 11 that executes a web application, and a control unit 12 that performs attribute determination and controls activation of the web application. The voice server 5 includes an application execution unit 13 for executing a voice application, an activation unit 14 for activating the voice application, and a control unit 15 for transmitting a voice recognition result and controlling voice output. Since the control units 12 and 15 and the activation unit 14 are actually realized by software executed on the web server 4 or the voice server 5, in the following description, the control unit 12 of the web server 4 This is referred to as an attribute determination / web application activation control module, and the activation unit 14 and the control unit 15 in the audio server are referred to as an audio application activation module and a recognition result transmission / audio output control module, respectively.

まず、この音声応答ウェブシステムにおける動作について、ウェブサーバ４と音声サーバ５の間で状態通知の処理とそのタイミングを中心にして説明する。図２は、音声出力が終了してから音声認識の開始の操作を行う場合の処理を示しており、図３は、音声出力中に音声認識の開始の操作があった場合の処理を示している。なお、これらの図は、水平方向に左から右に向かって時間が経過するものとして、各サーバ間でのデータ等の流れを時系列に示している。 First, the operation of the voice response web system will be described focusing on the status notification processing between the web server 4 and the voice server 5 and the timing thereof. FIG. 2 shows a process in a case where an operation of starting speech recognition is performed after the end of the speech output, and FIG. 3 shows a process in a case where an operation of starting speech recognition is performed during the speech output. I have. Note that, in these figures, the flow of data and the like between the servers is shown in chronological order as time passes from left to right in the horizontal direction.

図２に示すように、端末装置２から音声応答ウェブシステムに対してアクセスするための最初の処理において、端末装置２のユーザの操作により音声応答要求アクセスが発行されると、端末装置２から音声サーバ５に音声応答要求アクセス１０４が送られ、この音声応答要求アクセス１０４は、音声サーバ上で動作するプログラム１０３に通知される。これを受けて、音声サーバ上で動作するプログラム１０３は、ウェブサーバ４上で動作するプログラム１０２に、音声合成状態通知１０６を送り、端末装置２には音声出力１０５を送る。音声出力１０５の終了後、音声サーバ上で動作するプログラム１０３は、音声認識開始待ち状態通知１０７をウェブサーバ４上で動作するプログラム１０２に送る。音声合成状態通知１０６は、音声サーバ５が音声合成処理を行う状態になっていることを示すものであり、音声認識開始待ち状態通知１０７は、音声サーバ５が音声認識の開始を待機する状態になっていることを示すものである。 As shown in FIG. 2, in the first processing for accessing the voice response web system from the terminal device 2, when a voice response request access is issued by a user operation of the terminal device 2, the voice is transmitted from the terminal device 2. The voice response request access 104 is sent to the server 5, and the voice response request access 104 is notified to the program 103 operating on the voice server. In response to this, the program 103 running on the voice server sends a voice synthesis status notification 106 to the program 102 running on the web server 4 and sends a voice output 105 to the terminal device 2. After the voice output 105 is completed, the program 103 running on the voice server sends a voice recognition start waiting state notification 107 to the program 102 running on the web server 4. The voice synthesis state notification 106 indicates that the voice server 5 is in a state of performing voice synthesis processing, and the voice recognition start waiting state notification 107 indicates that the voice server 5 is in a state of waiting for the start of voice recognition. It shows that it is.

端末装置２でのユーザの操作により、端末装置２から音声認識要求ウェブアクセス１０８がウェブサーバ４に送られると、この音声認識要求ウェブアクセス１０８はウェブサーバ上で動作するプログラム１０２に通知され、ウェブサーバ上で動作するプログラム１０２は、音声サーバ上で動作するプログラム１０３に対し、音声認識開始通知１０９及び音声認識結果待ち状態通知１１０を送る。音声認識開始通知１０９は、音声認識を開始すべきことを通知するものであり、音声認識結果待ち状態通知１１０は、ウェブサーバ４が音声認識結果を待っている状態にあることを通知するものである。 When a speech recognition request web access 108 is sent from the terminal device 2 to the web server 4 by a user operation on the terminal device 2, the speech recognition request web access 108 is notified to the program 102 running on the web server, The program 102 running on the server sends a voice recognition start notification 109 and a voice recognition result waiting state notification 110 to the program 103 running on the voice server. The voice recognition start notification 109 notifies that the voice recognition should be started, and the voice recognition result waiting state notification 110 notifies that the web server 4 is waiting for the voice recognition result. is there.

ここでは、音声出力が終了したのちに音声認識要求ウェブアクセスが行われたとしているので、音声サーバ上で動作するプログラム１０３は、その音声認識要求ウェブアクセスが行われる前に、ウェブサーバ上で動作するプログラム１０２に対して音声認識開始待ち状態通知１０７を送っており、この時点で音声サーバ上で動作するプログラム１０３は待ち状態にあるため、直ちに音声認識処理を開始する。 Here, since it is assumed that the voice recognition request web access is performed after the voice output is completed, the program 103 operating on the voice server operates on the web server before the voice recognition request web access is performed. A voice recognition start wait state notification 107 is sent to the program 102 to be executed. At this point, the program 103 running on the voice server is in a wait state, and thus the voice recognition processing is immediately started.

その後、音声サーバ上で動作するプログラム１０３によって音声認識処理が実行され、音声認識処理が完了すると、音声サーバ上で動作するプログラム１０３は、ウェブサーバ上で動作するプログラム１０２に対し、音声認識処理が完了した状態にあることを示す音声認識完了状態通知１１１を送る。この音声認識完了状態通知１１１を受けて、ウェブサーバ上で動作するプログラム１０２は、端末装置２に対し、音声認識ウェブ表示１１２を送る。音声認識ウェブ表示１１２は、音声認識結果に対応した表示となる。 Thereafter, the voice recognition processing is executed by the program 103 running on the voice server, and when the voice recognition processing is completed, the program 103 running on the voice server performs the voice recognition processing on the program 102 running on the web server. A voice recognition completion status notification 111 indicating that the voice recognition is completed is sent. Upon receiving the voice recognition completion status notification 111, the program 102 running on the web server sends a voice recognition web display 112 to the terminal device 2. The speech recognition web display 112 is a display corresponding to the speech recognition result.

一方、音声出力が終了していない時点で音声認識要求ウェブアクセスが行われた場合は、図３に示すように、音声認識要求ウェブアクセス１０８が発行された時点では、音声サーバ上で動作するプログラム１０３は、音声認識開始待ち状態通知１０７をまだ出していない。そして音声サーバ上で動作するプログラム１０３は、ウェブサーバ上で動作するプログラム１０２から音声認識開始通知１０９と音声認識結果待ち状態通知１１０とを通知されるものの、音声合成処理中であるので音声認識処理を開始しない。音声合成処理が終了した時点で音声認識開始通知１０９を見つけると、音声サーバ上で動作するプログラム１０３は、ウェブサーバ上で動作するプログラム１０２の状態が音声認識結果待ち状態であるので、音声認識処理を開始する。その後は、図２に示した場合と同様に、音声認識完了状態１１１が、ウェブサーバ上で動作するアプリケーションに送られる。ただし、音声サーバ５として、音声出力中の処理を取り消し可能な音声合成システムを使用している場合には、音声サーバ上で動作するプログラム１０３は、その音声出力中に音声認識開始通知１０９を受けた場合、その時点で音声出力を取り消し、直ちに音声認識処理を開始してもよい。 On the other hand, if the voice recognition request web access is performed at the time when the voice output is not completed, as shown in FIG. 3, at the time when the voice recognition request web access 108 is issued, the program running on the voice server is executed. 103 has not issued the voice recognition start waiting state notification 107 yet. Then, the program 103 operating on the voice server is notified of the voice recognition start notification 109 and the voice recognition result waiting state notification 110 from the program 102 operating on the web server. Do not start. When the speech recognition start notification 109 is found at the end of the speech synthesis processing, the program 103 operating on the speech server executes the speech recognition processing because the state of the program 102 operating on the web server is in a speech recognition result waiting state. To start. Thereafter, the speech recognition completion state 111 is sent to the application running on the web server, as in the case shown in FIG. However, when a voice synthesis system capable of canceling processing during voice output is used as the voice server 5, the program 103 running on the voice server receives the voice recognition start notification 109 during the voice output. In this case, the voice output may be canceled at that time, and the voice recognition processing may be started immediately.

図２に示す例、図３に示す例のいずれにおいても、ウェブサーバ上で動作するプログラム１０２は、音声認識ウェブアクセス８が通知されて以降は、音声サーバ上で動作するプログラム１０３の状態を参照している。そして、音声認識完了状態になった時、ウェブサーバ上で動作するプログラム１０２は、端末装置２に対し、音声認識ウェブ表示１１２を送る。その結果、端末装置２の画面には、音声認識ウェブ表示１１２に基づき、音声認識結果に対応した表示となる。 In each of the example shown in FIG. 2 and the example shown in FIG. 3, the program 102 running on the web server refers to the state of the program 103 running on the voice server after the notification of the voice recognition web access 8 is notified. are doing. Then, when the voice recognition is completed, the program 102 operating on the web server sends a voice recognition web display 112 to the terminal device 2. As a result, the screen corresponding to the voice recognition result is displayed on the screen of the terminal device 2 based on the voice recognition web display 112.

ウェブサーバ上で動作するプログラム１０２は、端末装置１０２の画面上への音声認識ウェブ表示と同時に音声出力を行うアプリケーションを実行する場合がある。この場合は、ウェブサーバ上で動作するプログラム１０２は、端末装置２に対して音声認識ウェブ表示１１２を送ると同時に、音声サーバ上で動作するプログラム１０３に対し、音声合成の開始を通知する音声合成開始通知１１３を送る。音声合成開始通知１１３を受け取ると音声サーバ上で動作するプログラム１０３は、音声合成処理を開始して、端末装置２に対して音声出力１１４を実行する。ここでの音声出力１１４は、先に述べた音声出力１０５と同様のものであり、この音声合成処理の開始に伴って、上述の音声出力１０５以降の処理が同様に繰り返される。 The program 102 running on the web server may execute an application that outputs a voice simultaneously with the voice recognition web display on the screen of the terminal device 102. In this case, the program 102 running on the web server sends the speech recognition web display 112 to the terminal device 2 and simultaneously notifies the program 103 running on the voice server of the start of speech synthesis. A start notification 113 is sent. Upon receiving the voice synthesis start notification 113, the program 103 operating on the voice server starts voice synthesis processing and executes voice output 114 to the terminal device 2. The audio output 114 here is the same as the audio output 105 described above, and the processing after the above-described audio output 105 is similarly repeated with the start of the audio synthesis processing.

次に、図１に示した音声応答ウェブシステムにおける音声アプリケーションおよびウェブアプリケーションの起動制御について説明する。本実施形態では、識別情報として識別子・属性を用いてアプリケーションの起動制御を行っている。図４は、図２に示した処理の流れにおける音声認識要求ウェブアクセスから音声認識ウェブ表示までの部分をさらに詳しく示したものであり、基本処理を行う場合の処理の流れを示している。また、図５は、アプリケーションの起動を制御するために利用される属性と識別子の記述例を示している。図５では、ＨＴＭＬにおけるタグとして、属性と識別子とが記述されている。図６は、この音声応答ウェブシステムにおける処理を示すフローチャートである。この音声応答ウェブシステムでは、以下の説明から明らかなように、音声入力に基づいて音声及び画面の同時出力を行ったり、ウェブデータ入力から音声出力のみを行ったり、ウェブデータ入力から音声及び画面の同時出力を行ったりするが、図６では、これらの全ての処理に対応できるものとして、フローチャートが描かれている。これらのいずれの場合であるかの判断は、ステップＳ６、Ｓ１２、Ｓ１４あたりで行われている。当然のことながら、どの場合に該当するかを判断するステップＳ６、Ｓ１２、Ｓ１４をどのような順番で行うかは、適宜に変更しうるものである。 Next, activation control of the voice application and the web application in the voice response web system shown in FIG. 1 will be described. In the present embodiment, activation control of an application is performed using an identifier / attribute as identification information. FIG. 4 shows in more detail the portion from the speech recognition request web access to the speech recognition web display in the process flow shown in FIG. 2, and shows the process flow when the basic process is performed. FIG. 5 shows a description example of an attribute and an identifier used for controlling the activation of an application. In FIG. 5, attributes and identifiers are described as tags in HTML. FIG. 6 is a flowchart showing processing in the voice response web system. In this voice response web system, as is apparent from the following description, simultaneous output of voice and screen is performed based on voice input, only voice output is performed from web data input, and voice and screen are output from web data input. Although simultaneous output is performed, a flowchart is illustrated in FIG. 6 as being capable of coping with all these processes. The determination of any of these cases is made around steps S6, S12, and S14. It goes without saying that the order in which the steps S6, S12, and S14 for determining which case are performed can be changed as appropriate.

以下の説明においてモジュールとは、ここで示す音声応答ウェブシステムを実現するシステム開発者が作成したプログラムのことであって、音声アプリケーションやウェブアプリケーションの起動制御などの機能によって、モジュール化されているもののことである。通常、モジュールは、アプリケーション開発者がそれを変更することはできず、実行することのみが許されている。それに対し音声アプリケーションとウェブアプリケーションは、主としてアプリケーション開発者によって作成されるものである。音声出力用データも、同様に、主としてアプリケーション開発者によって作成される。 In the following description, a module is a program created by a system developer that implements the voice response web system shown here, and is modularized by a function such as activation control of a voice application or a web application. That is. Usually, a module cannot be modified by an application developer, but is only allowed to execute. On the other hand, the voice application and the web application are mainly created by an application developer. Similarly, the audio output data is also created mainly by the application developer.

ここで、音声アプリケーションとは、音声処理用のブラウザで実行可能な言語で記述されたＸＭＬベースのテキストファイル、またはＸＭＬベースのテキストファイルを出力するプログラムのことである。ウェブアプリケーションとは、例えばウェブブラウザで表示することができるデータが記述されるＨＴＭＬテキストファイルやＸＭＬベースのテキストファイル、またはＨＴＭＬテキストやＸＭＬベースのテキストを出力するプログラムのことである。音声アプリケーションおよびウェブアプリケーションは、音声応答ウェブシステムと同じローカルホスト上にあってもよいし、別のホスト上にあっても構わない。なお、ここでの説明では、音声アプリケーション自体が音声認識、音声合成を行うようにみえるが、実際には音声アプリケーションが制御モジュールによって起動された後、音声認識エンジンや音声合成エンジンによって音声認識、音声合成が行われる。 Here, the voice application is an XML-based text file described in a language executable by a browser for voice processing, or a program that outputs an XML-based text file. The web application is, for example, an HTML text file or an XML-based text file in which data that can be displayed by a web browser is described, or a program that outputs an HTML text or an XML-based text. The voice application and the web application may be on the same local host as the voice response web system or on a different host. In this description, the voice application itself seems to perform voice recognition and voice synthesis. However, after the voice application is activated by the control module, the voice recognition and voice synthesis engines perform voice recognition and voice synthesis. Synthesis is performed.

《基本処理》
基本処理の一例を説明する。図４に示すように、端末装置２において、その端末装置２の画面上に表示されている音声対応されたウェブページ上でのボタンやリンクのクリックなど、ユーザの操作により音声認識要求ウェブアクセス２０８が発生すると、ウェブアプリケーション２０５は、この音声認識要求ウェブアクセス２０８を受け付け、識別子・属性通知２０９を属性判定・ウェブアプリケーション起動制御モジュール２０４（以下、単に「起動制御モジュール２０４」という）に送信する。図１との関係で説明すれば、ウェブアプリケーション２０５はアプリケーション実行部１１で実行され、起動制御モジュール２０４の機能は制御部１２によって実現される。起動制御モジュール２０４は、音声サーバ５が音声認識開始待ち状態にあるとして、音声アプリケーション起動モジュール２０１（以下、単に「起動モジュール２０１」という）に対して、音声認識開始通知、識別子・属性通知１０を送り、音声アプリケーション２０２の起動の指示を行う。起動モジュール２０１の機能は音声サーバ５の起動部１４によって実現され、音声アプリケーション２０２は音声サーバ５のアプリケーション実行部１３上において実行される。《Basic processing》
An example of the basic processing will be described. As shown in FIG. 4, in the terminal device 2, a voice recognition request web access 208 is performed by a user operation such as clicking a button or link on a voice-enabled web page displayed on the screen of the terminal device 2. Occurs, the web application 205 receives the voice recognition request web access 208, and transmits an identifier / attribute notification 209 to the attribute determination / web application activation control module 204 (hereinafter, simply referred to as “activation control module 204”). 1, the web application 205 is executed by the application execution unit 11, and the function of the activation control module 204 is realized by the control unit 12. The activation control module 204 determines that the speech server 5 is in a state of waiting for speech recognition, and sends a speech recognition start notification and an identifier / attribute notification 10 to the speech application activation module 201 (hereinafter, simply referred to as “activation module 201”). Then, the voice application 202 is started. The function of the activation module 201 is realized by the activation unit 14 of the audio server 5, and the audio application 202 is executed on the application execution unit 13 of the audio server 5.

起動モジュール２０１は、起動制御モジュール２０４からの指示と識別子・属性通知２１０の内容にしたがって、音声認識を行う音声アプリケーション２０２を起動する。この時、識別子・属性通知１０において特段の指定がない場合は、デフォルトの、言い換えれば予め定められている音声アプリケーションが起動される。 The activation module 201 activates the voice application 202 that performs voice recognition in accordance with the instruction from the activation control module 204 and the contents of the identifier / attribute notification 210. At this time, if there is no special designation in the identifier / attribute notification 10, a default, in other words, a predetermined voice application is started.

音声アプリケーション２０２の実行後、音声認識結果が発生した場合は、その認識結果の値が音声アプリケーション２０３から認識結果送信・音声出力制御モジュール２０３（以下、単に「制御モジュール２０３」という）に送信される。図２において示したように、起動制御モジュール２０４は制御モジュール２０３の状態を参照しており、制御モジュール２０３が音声認識完了状態通知１１を起動制御モジュール２０４に送ると、起動制御モジュール２０４が音声認識の結果を受け取り、その内容に応じて直ちにウェブアプリケーション起動２１２を行い、ウェブアプリケーション２０６が起動される。この結果、ウェブアプリケーション２０６から音声認識ウェブ表示２１３が出力され、端末装置２の画面上に、音声認識の結果が、音声認識ウェブ表示として表示される。このとき、音声認識結果の値には、ローカルホスト上のファイル名、またはインターネット上のウェブアプリケーションを指定するＵＲＬ、またはローカルホスト上のファイル名、またはウェブアプリケーションへのパラメータ（クェリデータ）のような、ウェブアプリケーションを起動するために必要な情報を決定することができる識別子が含まれている。すなわち、識別子はウェブアプリケーションのファイル名そのものであってもよいし、あるいは、識別子をもとにデータベース、テキストファイルを含むデータ保持装置からウェブアプリケーションの起動に必要なＵＲＬやファイル名を参照するようにしてもよい。また、識別子は、起動するウェブアプリケーションを特定できる限り、複数のキーワードあるいはパラメータをもとに構成されていてもよい。 When a voice recognition result is generated after the execution of the voice application 202, the value of the recognition result is transmitted from the voice application 203 to the recognition result transmission / voice output control module 203 (hereinafter, simply referred to as “control module 203”). . As shown in FIG. 2, the activation control module 204 refers to the state of the control module 203. When the control module 203 sends the speech recognition completion state notification 11 to the activation control module 204, the activation control module 204 Is received, the web application 206 is immediately started according to the content, and the web application 206 is started. As a result, the speech recognition web display 213 is output from the web application 206, and the result of the speech recognition is displayed on the screen of the terminal device 2 as the speech recognition web display. At this time, the value of the speech recognition result includes a file name on the local host, a URL designating a web application on the Internet, a file name on the local host, or a parameter (query data) to the web application. It contains an identifier that can determine the information needed to launch the web application. That is, the identifier may be the file name of the web application itself, or a URL or a file name necessary for starting the web application may be referred from a data holding device including a database and a text file based on the identifier. You may. The identifier may be configured based on a plurality of keywords or parameters as long as the web application to be started can be specified.

起動制御モジュール２０４は、これらの識別子によって特定されるウェブアプリケーション２０６を起動し、ウェブサーバ４の制御はウェブアプリケーション２０６に移る。一般にはウェブアプリケーション２０５とウェブアプリケーション２０６とは異なる場合が多いが、同じアプリケーションである場合もある。 The activation control module 204 activates the web application 206 specified by these identifiers, and transfers control of the web server 4 to the web application 206. Generally, the web application 205 and the web application 206 are often different, but may be the same application.

音声サーバ５の制御権は、制御モジュール２０３がウェブサーバ４の起動制御モジュール２０４との通信を終えると、起動モジュール２０１に移る。そして、起動モジュール２０１は、図２の音声認識開始待ち状態通知１０７を起動制御モジュール２０４に送り、次の音声処理要求に備えて起動制御モジュール２０４からの通知を待ち受ける状態になる。 When the control module 203 finishes communication with the activation control module 204 of the web server 4, the control right of the voice server 5 is transferred to the activation module 201. Then, the activation module 201 sends the voice recognition start waiting state notification 107 of FIG. 2 to the activation control module 204, and enters a state of waiting for a notification from the activation control module 204 in preparation for the next voice processing request.

以上の処理の結果、ウェブアプリケーション２０６が起動され、画面出力が行われると、ユーザは、端末装置２の画面に表示されたウェブページに対して操作を行う。ここで音声処理要求の操作が行われると、今度は、ウェブアプリケーション２０６が上述したウェブアプリケーション２０５となって、図４に示した基本処理の動作が繰り返される。 As a result of the above processing, when the web application 206 is activated and screen output is performed, the user operates the web page displayed on the screen of the terminal device 2. When the operation of the voice processing request is performed, the web application 206 becomes the web application 205 described above, and the operation of the basic process illustrated in FIG. 4 is repeated.

ウェブアプリケーション２０５が起動制御モジュール２０４へ送信する識別子・属性通知２０９には、音声アプリケーションやウェブアプリケーションを起動するための識別子や、識別子に付加する値からなるパラメータが含まれていてもよい。このパラメータの利用の仕方には、後述するようにいくつかの形態がある。その利用形態を区別するために、本実施形態においては、識別子に属性を持たせてもよい。属性情報を持った識別子または識別子に付加された値は、ウェブサーバ４の起動制御モジュール２０４や音声サーバ５の起動モジュール２０１において属性に応じた処理を行うために利用される。属性情報は、ウェブページに直接記述される値であってもよいし、ファイルやデータベースのようなデータ保持装置から値を参照するための識別子であってもよい。 The identifier / attribute notification 209 transmitted by the web application 205 to the activation control module 204 may include an identifier for activating the voice application or the web application, and a parameter including a value added to the identifier. There are several ways to use this parameter, as described below. In this embodiment, the identifier may have an attribute in order to distinguish the usage form. The identifier having the attribute information or the value added to the identifier is used in the activation control module 204 of the web server 4 or the activation module 201 of the voice server 5 to perform a process according to the attribute. The attribute information may be a value described directly on a web page, or may be an identifier for referring to the value from a data holding device such as a file or a database.

《属性の利用》
次に、属性の利用の仕方について説明する。《Usage of attributes》
Next, how to use attributes will be described.

音声認識結果の値が例えばＵＲＬのような文字数が多い値であると、上述の《基本処理》で示したように音声認識結果の１つ１つにその文字列すべてを記述するのは、アプリケーション作成者にとって煩雑である。そこで、アプリケーション作成者が音声認識結果の文字列を簡潔に記述するための手段を説明する。 If the value of the speech recognition result is a value with a large number of characters such as a URL, for example, as described in the above <Basic Processing>, it is necessary for the application to describe all the character strings in each of the speech recognition results. It is cumbersome for the creator. Therefore, means for the application creator to briefly describe the character string of the speech recognition result will be described.

上述した属性の１つとして、取得された値に対して「音声認識結果の値の前に付加する値」であることを示す属性を設定する。図６に示したフローチャートを用いて、このように属性が設定された場合の処理を説明する。 As one of the attributes described above, an attribute indicating that the acquired value is “a value to be added before the value of the speech recognition result” is set. With reference to the flowchart shown in FIG. 6, a process when the attribute is set in this way will be described.

まず、ユーザが音声認識要求操作を行うと、ウェブページからパラメータが取得され、そのパラメータが起動制御モジュール２０４へ送信される（ステップＳ１）。そして、ステップＳ２において、音声認識結果に付加する値があるかどうかを判断し、そのような値があるときは、ステップＳ３において、認識結果付加用データとして保持する。ステップＳ４〜Ｓ９までの処理については、ここでは関係ないから後述するとして、その後、音声入力が行われると、ステップＳ１０において、音声認識結果が取得されて、音声認識結果が起動制御モジュール２０４へ送られる。そして、ステップＳ１１において、音声認識結果と認識結果付加用データの両者の値が結合されて、ステップＳ１６において、ウェブアプリケーション２０５を起動させるために利用される。 First, when the user performs a voice recognition request operation, parameters are acquired from a web page, and the parameters are transmitted to the activation control module 204 (step S1). Then, in step S2, it is determined whether or not there is a value to be added to the speech recognition result. When such a value is present, in step S3, it is held as recognition result addition data. Since the processes in steps S4 to S9 are not relevant here and will be described later, when a voice input is performed thereafter, in step S10, a voice recognition result is obtained and the voice recognition result is transmitted to the activation control module 204. Can be Then, in step S11, the values of both the speech recognition result and the data for adding the recognition result are combined, and used in step S16 to activate the web application 205.

図５の(1)は、属性と値の記述の一例を示している。ここで「page1.html」は、音声認識アプリケーション２０２から送られてくる音声認識結果である。ＩＮＰＵＴタグ（「<INPUT」で始まるタグ）は、ウェブアプリケーション２０５において記述されるものであり、ＩＮＰＵＴタグのＮａｍｅの属性値がＶａｌｕｅの属性値で指定される値の属性を示すものとして利用されている。ＩＮＰＵＴタグの代わりにＨＴＴＰ(hypertext transfer protocol)リクエストでクエリ文字列パラメータとして渡してもよい。この例では、ＩＮＰＵＴタグ中の「add」は、音声認識結果の前に「value=」以下に記述される値を付加するということを示す属性である。「http://aaaaa.co.jp/app1/」が実際に「page1.html」に付加される値である。起動制御モジュール２０４は、これらの値を結合してウェブアプリケーションを起動する。 FIG. 5A shows an example of a description of an attribute and a value. Here, “page1.html” is a speech recognition result sent from the speech recognition application 202. The INPUT tag (a tag starting with “<INPUT”) is described in the web application 205, and is used as an attribute of the Name of the INPUT tag indicating an attribute of a value specified by an attribute value of Value. I have. Instead of the INPUT tag, it may be passed as a query string parameter in an HTTP (hypertext transfer protocol) request. In this example, “add” in the INPUT tag is an attribute indicating that a value described below “value =” is added before the speech recognition result. “Http://aaaaa.co.jp/app1/” is a value actually added to “page1.html”. The launch control module 204 combines these values to launch the web application.

次に、属性の利用の別の例について説明する。音声認識結果に記述する内容を、起動させるウェブアプリケーションを特定する識別子として利用するのではなく、起動させたいウェブアプリケーションへのパラメータとして利用する手順について説明する。 Next, another example of attribute use will be described. A procedure will be described in which the content described in the speech recognition result is used not as an identifier for specifying the web application to be activated but as a parameter to the web application to be activated.

上述の《基本処理》で示した属性の１つとして、『「パラメータを要求するウェブアプリケーションを指定する識別子」であることを示す属性』を設定する。図５の(2)は、認識結果をパラメータとして扱う場合の記述の一例である。この場合の属性は、音声認識結果の前に値を付加すればよいため、上述した場合と同様に、ＩＮＰＵＴタグ中の「add」を利用することができる。アプリケーション開発者は、パラメータを要求するウェブアプリケーションを「http://aaaaa.co.jp/app2/servlet/stock?」のように記述し、音声認識結果として記述する値には、「code=9999」のように、上記のウェブアプリケーションに渡すパラメータを記述しておく。 As one of the attributes described in the above <Basic Processing>, "attribute indicating" identifier designating web application requesting parameter "" is set. FIG. 5B shows an example of a description when the recognition result is handled as a parameter. In this case, since the attribute may be added with a value before the speech recognition result, "add" in the INPUT tag can be used as in the case described above. The application developer describes the web application requesting the parameter as “http://aaaaa.co.jp/app2/servlet/stock?”, And the value described as the speech recognition result includes “code = 9999 ", The parameters to be passed to the web application are described.

図６に示したフローチャートにおいて、起動制御モジュール２０４は、ステップＳ２において「add」の属性を見つけると、ステップＳ３において「http://cywiz.co.jp/app2/servlet/stock?」を認識結果付加用データとして保持し、ステップＳ１１において、この値と音声認識結果の値として送られるパラメータ「code=9999」とを結合して、アプリケーションを起動する。 In the flowchart shown in FIG. 6, when the activation control module 204 finds the attribute of “add” in step S2, it recognizes “http://cywiz.co.jp/app2/servlet/stock?” In step S3. It is stored as additional data, and in step S11, this value is combined with the parameter "code = 9999" sent as the value of the speech recognition result, and the application is started.

《既存アプリケーションへの音声対応》
本発明の特徴の一つとして、既存の一般的なアプリケーションをそのまま音声対応のものとすることができることが挙げられる。ここで、既存の一般的なアプリケーションについて、その修正を行わなくとも音声対応させる方法について説明する。ただし、既存のアプリケーションを呼び出すための最初のページだけは、新規に作成されるページである必要がある。このページには、上述の《属性の利用》で説明したようにＵＲＬを指定しておき、既存のアプリケーションを呼び出すようにしておく。加えて、音声対応させるために既存アプリケーションに追加する必要のあるデータを用意しておき、そのデータ（が記述されているファイル）の識別子を、「音声対応させるために出力データに付加させるデータの識別子」であることを示す属性とともに、記述しておく。《Audio support for existing applications》
One of the features of the present invention is that an existing general application can be directly used for audio. Here, a method of making an existing general application compatible with voice without modifying it will be described. However, only the first page for calling an existing application needs to be a newly created page. In this page, a URL is specified as described in the above << Use of Attributes >>, and an existing application is called. In addition, data that needs to be added to the existing application in order to support voice is prepared, and an identifier of the data (a file in which the data is described) is set to “data to be added to output data in order to support voice. It is described together with an attribute indicating that it is an "identifier".

図６のフローチャートにおいて、起動制御モジュール２０４は、ステップＳ２またはＳ３の実行後、ステップＳ４において、パラメータの値を受け取った時に、取り込みファイル、すなわち既存アプリケーションに追加する必要のあるデータが記述されているファイルがあるかどうかを判定し、そのようなファイルを見つけると、得られた値を「付加させるデータの識別子」として扱う。そして、ステップＳ５において、その識別子を基にファイルを読み込み、既存のアプリケーションの出力に付加するデータとして保持しておく。 In the flowchart of FIG. 6, after the execution of step S2 or S3, the startup control module 204 describes, in step S4, a capture file, that is, data that needs to be added to an existing application when a parameter value is received. It is determined whether or not a file exists, and if such a file is found, the obtained value is treated as an “identifier of data to be added”. Then, in step S5, a file is read based on the identifier and stored as data to be added to the output of the existing application.

付加させるデータの識別子の記述例を図５の(3)に示す。この例では、ＩＮＰＵＴタグ中の「include」は、「value=」以下に記述される値「voice.html」をファイル取り込みのため識別子として扱い、音声対応させることを示す属性である。「voice.html」の内容には、起動成語モジュール２０４を呼び出して音声入力のトリガーとなるボタンやリンクが記述され、さらに、《属性の利用》で示した値や後述する音声出力、音声・画面出力のための値が記述できる。また、音声対応されていないアプリケーションを呼び続ける限り、「include」の指定は必要である。起動制御モジュール２０４は、《属性の利用》で示したように、既存のアプリケーションを呼び出すと、その出力データ（通常はＨＴＭＬデータ）を、ＵＲＬによって接続されたストリームから取り込み、このデータに「voice.html」のデータを結合して出力する（ステップＳ１６）。結合の仕方は、例えば画面データの最後に音声操作用ボタンを表示するのであれば、ＨＴＭＬテキストにおける「</BODY>」の前に埋め込む形になる。また、既存のアプリケーションを予め修正しておくのであれば、「voice.html」の内容に相当するデータを書き込んでおき、ｉｎｃｌｕｄｅの属性は指定しなければ良い。 A description example of the identifier of the data to be added is shown in (3) of FIG. In this example, “include” in the INPUT tag is an attribute indicating that the value “voice.html” described below “value =” is handled as an identifier for capturing a file, and is associated with voice. In the content of “voice.html”, a button or a link that triggers the voice input by calling the activation language module 204 is described. Further, the value indicated in “Use of attributes”, voice output described later, voice / screen, and the like are described. A value for output can be described. In addition, as long as an application that does not support audio is called, the specification of “include” is necessary. As described in << Usage of Attributes >>, the startup control module 204 retrieves the output data (usually HTML data) from the stream connected by the URL when calling the existing application, and adds "voice. "html" is combined and output (step S16). If the voice operation button is displayed at the end of the screen data, for example, the combination is embedded before "</ BODY>" in the HTML text. If an existing application is to be modified in advance, data corresponding to the content of "voice.html" must be written, and the attribute of "include" need not be specified.

《音声アプリケーションの指定》
次に、音声アプリケーションの指定について説明する。《Specification of voice application》
Next, the designation of the voice application will be described.

アプリケーションの作成者は、ウェブアプリケーション２０５においてユーザから音声認識による入力の要求が行われた時に、デフォルトの音声アプリケーションではなく、アプリケーションの作成者によって指定された音声アプリケーション、例えば表示されているウェブページに対応する音声認識用のグラマーを含む音声アプリケーションを起動したい場合、ウェブアプリケーション２０５にその起動させたい音声アプリケーションの識別子を記述しておく。この識別子は、「音声認識による入力処理のために起動する音声アプリケーションを指定する識別子」であることを示す属性とともに記述される。その一例を図５の(3)に示す。ＩＮＰＵＴタグ中の「vxml」は、「value=」以下に記述される値を次の音声認識による入力処理時に実行する音声アプリケーションの識別子として扱うということを示す属性であり、「query.vxml」が起動させたい音声アプリケーションの識別子である。 When the user requests input by voice recognition in the web application 205, the application creator does not use the default voice application but the voice application specified by the application creator, for example, the displayed web page. When it is desired to activate a corresponding voice application including a grammar for voice recognition, the identifier of the voice application to be activated is described in the web application 205. This identifier is described together with an attribute indicating that it is an "identifier for specifying a voice application to be activated for input processing by voice recognition". One example is shown in FIG. “Vxml” in the INPUT tag is an attribute indicating that a value described below “value =” is to be treated as an identifier of a voice application executed at the time of input processing by the next voice recognition, and “query.vxml” is This is the identifier of the voice application to be activated.

図６に示すフローチャートにおいて、ユーザの操作によって音声認識による入力要求が行われた時（ステップＳ６）、起動制御モジュール２０４は、次に実行させる音声アプリケーションの識別子（「query.vxml」）をウェブアプリケーション２０５からパラメータとして受け取ると、次に実行する音声アプリケーションとして指定するという属性（「vxml」）をその受け取った識別子に付加して、その識別子を、起動制御モジュール２０４からのデータの待ち受け状態にある起動モジュール２０１に送る。ステップＳ７において、起動モジュール２０１は、「vxml」の属性があるかどうかを判定し、そのような属性を見つけた場合には、ステップＳ８において、受け取った識別子に基づいて、対応する音声アプリケーション２０２（「query.vxml」）を起動する。もしステップＳ７において「vxml」の属性が見つけられなければ、起動モジュール２０１は、ステップＳ９において、デフォルトの音声アプリケーションを起動する。ステップＳ８、Ｓ９の実行後の処理は、ステップＳ１０以降の処理として先に説明した《基本処理》の場合と同様である。なお、ステップＳ６において音声入力要求かどうかの判定を行った結果、音声入力要求でない場合には、後述するステップＳ１２に移行する。 In the flowchart shown in FIG. 6, when an input request by voice recognition is made by a user operation (step S6), the activation control module 204 sets the identifier of the voice application to be executed next (“query.vxml”) to the web application. When received as a parameter from 205, an attribute (“vxml”) of designating as a voice application to be executed next is added to the received identifier, and the identifier is set in the standby state for waiting for data from the activation control module 204. Send to module 201. In step S7, the activation module 201 determines whether there is an attribute of “vxml”. If such an attribute is found, in step S8, based on the received identifier, the corresponding voice application 202 ( "Query.vxml"). If the attribute of “vxml” is not found in step S7, the activation module 201 activates the default voice application in step S9. The processing after the execution of steps S8 and S9 is the same as the case of << basic processing >> described above as the processing after step S10. If it is determined in step S6 whether the request is a voice input request, and if the request is not a voice input request, the process proceeds to step S12 described below.

《音声入力から音声と画面の同時出力》
次に、音声入力に基づいて、音声と画面との同時出力を行う場合の処理について、図７を用いて説明する。《Simultaneous output of audio and screen from audio input》
Next, a process for simultaneously outputting a voice and a screen based on a voice input will be described with reference to FIG.

音声と画面の同時出力を行う方法は２つある。１つは、出力したい音声データの識別子をウェブアプリケーション２０５に記述するという、明示的な指定方法である。ウェブアプリケーション２０５から起動制御モジュール２０４に送られる識別子・属性通知２０９には、音声出力用データ２１４の識別子が含まれている。起動制御モジュール２０４は、音声認識完了状態通知２１１を受けた時に、制御モジュール２０３に対し、音声合成開始通知とともに音声出力用識別子通知２１６を送る。制御モジュール２０３は、音声出力用識別子に基づいて音声出力用データ２１４を決定し、音声出力２１７を実行する。ここで、音声出力用データ２１４は、ファイルアクセスまたはデータベースアクセスまたはウェブアクセスによって読み込み可能なデータであり、音声応答ウェブシステムと同じローカルホスト上にあってもよいし、ネットワークによってアクセス可能な別のホスト上にあっても構わない。 There are two methods for simultaneous output of sound and screen. One is an explicit designation method in which an identifier of audio data to be output is described in the web application 205. The identifier / attribute notification 209 sent from the web application 205 to the activation control module 204 includes the identifier of the audio output data 214. Upon receiving the voice recognition completion status notification 211, the activation control module 204 sends a voice output identifier notification 216 to the control module 203 together with a voice synthesis start notification. The control module 203 determines the audio output data 214 based on the audio output identifier, and executes the audio output 217. Here, the voice output data 214 is data that can be read by file access, database access, or web access, and may be on the same local host as the voice response web system or another host accessible by the network. It can be on top.

音声出力用データ２１４は、例えば、音声サーバ５の音声合成エンジンが実行するためのメッセージファイルや音声出力用のオーディオファイルであり、その場合、制御モジュール２０３は、このデータを取り込んで音声アプリケーションとして実行すればよい。あるいは音声出力用データ２１４は、初めから実行可能な音声出力アプリケーションの形式であってもよい。さらには、音声出力用データ２１４と同様のデータをウェブアプリケーション２０６が動的に生成するようにしてもよい。あるいは、制御モジュール２０３か起動制御モジュール２０４が、音声認識ウェブ表示２１３のデータから動的にデータを抽出して音声出力用データ２１４を作成してもよい。また、音声出力用識別子と音声認識結果の値と結合して音声出力用データ２１４を決定してもよい。音声認識完了状態通知・音声認識結果通知２１１と音声合成開始通知・音声出力用識別子通知２１６は、メッセージ交換の形でほぼ同時に行われるため、図７では音声出力２１７が遅れて実行されるように見えるが、実際には音声認識ウェブ表示２１３と音声出力２１７は同時に実行される。 The audio output data 214 is, for example, a message file to be executed by the audio synthesis engine of the audio server 5 or an audio file for audio output. In this case, the control module 203 takes in the data and executes it as an audio application. do it. Alternatively, the audio output data 214 may be in the form of an audio output application that can be executed from the beginning. Further, the web application 206 may dynamically generate data similar to the audio output data 214. Alternatively, the control module 203 or the activation control module 204 may dynamically extract data from the data of the speech recognition web display 213 to create the speech output data 214. Alternatively, the voice output data 214 may be determined by combining the voice output identifier and the value of the voice recognition result. Since the voice recognition completion status notification / voice recognition result notification 211 and the voice synthesis start notification / voice output identifier notification 216 are performed almost simultaneously in the form of message exchange, the voice output 217 is executed with a delay in FIG. Although visible, in practice the speech recognition web display 213 and speech output 217 are performed simultaneously.

音声と画面の同時出力を行うもう１つの方法は、音声出力用データ２１４と音声認識結果から起動されるウェブアプリケーション２０５とを対応させる共通の識別子を利用する方法である。アプリケーションの作成者は、音声出力用データ２１４を、音声認識結果から起動されるウェブアプリケーション２０５が持つ識別子と共通の識別子を持つように作成する。上述の《基本処理》と異なる点は、図６のフローチャートに示すように、制御モジュール２０３が起動制御モジュール２０４に音声認識結果を送信する際に、ステップＳ１７において、ウェブアプリケーション２０５と共通の識別子を持つ音声出力用データ２１４あるいは音声出力指定が存在するかどうかを調べ、これらが存在する場合には、ステップＳ１８においてそのデータを読み込み、音声出力を実行することである。これにより、起動制御モジュール２０４によるウェブアプリケーション起動による画面出力（ステップＳ１９）と制御モジュール２０３による音声出力とが同時に行われることになる。なお、図６に示したフローチャートでは、ステップＳ１７へはステップＳ１６の実行後に移行する。また、ステップＳ１７においてそのような音声出力用データ２１４が存在しない場合には、そのまま、ステップＳ１９に移行する。 Another method for simultaneously outputting the voice and the screen is to use a common identifier that associates the voice output data 214 with the web application 205 started from the voice recognition result. The creator of the application creates the voice output data 214 so as to have the same identifier as the identifier of the web application 205 started from the voice recognition result. The difference from the above-described “basic processing” is that, as shown in the flowchart of FIG. 6, when the control module 203 transmits the speech recognition result to the activation control module 204, in step S17, the common identifier with the web application 205 is set. It is checked whether or not the audio output data 214 or the audio output designation is present, and if these are present, the data is read in step S18 to execute the audio output. As a result, the screen output by the activation of the web application by the activation control module 204 (step S19) and the audio output by the control module 203 are performed simultaneously. In the flowchart shown in FIG. 6, the process proceeds to step S17 after the execution of step S16. If there is no such audio output data 214 in step S17, the process directly proceeds to step S19.

《音声とウェブデータの同時入力から音声および画面の出力》
次に、音声及びウェブデータが同時に入力され、それに基づいて音声及び画面の出力を行う場合の処理を説明する。《Sound and screen output from simultaneous input of sound and web data》
Next, a description will be given of a process in a case where voice and web data are input simultaneously and a voice and a screen are output based on the voice and web data.

図７に示した音声入力からの音声と画面の同時出力の処理において、ユーザが端末装置２の画面に表示されたウェブページに対して操作した際に、ウェブアプリケーション２０５から起動制御モジュール２０４に送られる識別子・属性通知２０９に、ユーザが入力したウェブデータ、例えばユーザが指し示した地図の位置データやユーザが選択したボタンやクリックポイント、ユーザが入力したテキストデータなどのデータを含めることにより、音声とウェブデータの同時入力から音声と画面の同時出力が可能となる。上述の《基本処理》で示したように音声出力の処理をしなければ、音声とウェブデータの同時入力から画面の出力が可能となる。ウェブデータと音声認識結果の値の両方を利用する処理には、《属性の利用》で示した識別子の値の結合が利用できる。 In the process of simultaneous output of the voice and the screen from the voice input illustrated in FIG. 7, when the user operates the web page displayed on the screen of the terminal device 2, the web application 205 sends the screen to the activation control module 204. The identifier / attribute notification 209 includes web data input by the user, for example, data such as position data of a map pointed to by the user, buttons and click points selected by the user, and text data input by the user. Simultaneous input of web data enables simultaneous output of sound and screen. If the audio output processing is not performed as described in the above <Basic processing>, the screen can be output from simultaneous input of audio and web data. In the process of using both the web data and the value of the speech recognition result, the combination of the values of the identifiers indicated by << use of attributes >> can be used.

《ウェブデータ入力から音声のみ出力》
次に、ウェブデータ入力から音声のみを出力する場合の処理を説明する。上述した各例は、音声認識によって音声と画面の同時出力を行うものであったが、音声入力を行わずに、表示されたウェブ画面上の操作によって音声出力や画面出力を行いたい場合もある。図８は、ウェブデータ入力から音声出力のみの制御を行う場合の処理を示している。ここでは、音声認識要求ウェブアクセスの代わりに、音声認識を行わない音声応答要求ウェブアクセス２１８の操作が行われる。ウェブアプリケーション２０５には、上述した《音声アプリケーションの指定》の場合と同様に、実行させたい音声アプリケーション２０７の識別子が記述されている。図５の(4)は、ユーザがリンクをクリックすることによって音声出力を実行させるための識別子の記述の一例を示している。《Only audio from web data input》
Next, a description will be given of a case where only sound is output from the web data input. In each of the above-described examples, the simultaneous output of the voice and the screen is performed by the voice recognition. However, there is a case where the voice output or the screen output is performed by the operation on the displayed web screen without performing the voice input. . FIG. 8 shows a process in the case of controlling only the audio output from the web data input. Here, the operation of the voice response request web access 218 that does not perform voice recognition is performed instead of the voice recognition request web access. In the web application 205, the identifier of the voice application 207 to be executed is described as in the case of the above-mentioned "designation of voice application". FIG. 5 (4) shows an example of a description of an identifier for executing a voice output by a user clicking a link.

ウェブアプリケーション２０５において音声出力要求の操作があると、起動制御モジュール２０４は、音声アプリケーション２０７の識別子（この例では「explain.vxml」）をウェブアプリケーション２０５からパラメータとして受け取る。このとき、パラメータには「音声出力用のアプリケーションを指定する識別子」であることを示す属性（この例では「vout」）が付加されている。起動制御モジュール２０４は、図６に示すフローチャートに示すように、ステップＳ１４において「vout」の属性があるかどうかを判断し、この属性を見つけた場合には、《音声アプリケーションの指定》の例で示した「vxml」の属性を付加して、音声アプリケーション７の識別子「explain.vxml」を起動モジュール２０１に送る。起動モジュール２０１は、《音声アプリケーションの指定》の場合と同様に、ステップＳ１５において、音声アプリケーション２０７を起動する。この場合、音声アプリケーション２０７は、《基本処理》で述べた音声入力を要求する音声アプリケーション２０２とは異なり、音声出力のみを行う音声アプリケーションであり、実行後に起動モジュール２０１に制御を渡すように記述されている。また、起動制御モジュール２０４は、「vout」の属性を見つけると音声認識結果の取得は行わず、現在の画面データの更新を行わないか、あるいはウェブ画面上の操作のために更新が必要な端末の場合には、その操作が行われたアプリケーションと同じウェブアプリケーション２０５を表示する。 When a voice output request operation is performed in the web application 205, the activation control module 204 receives the identifier of the voice application 207 (“explain.vxml” in this example) from the web application 205 as a parameter. At this time, an attribute (“vout” in this example) indicating that the parameter is an “identifier for specifying an application for voice output” is added to the parameter. The start control module 204 determines whether or not there is an attribute of “vout” in step S14 as shown in the flowchart shown in FIG. The identifier “explain.vxml” of the voice application 7 is sent to the activation module 201 with the attribute “vxml” shown. The activation module 201 activates the audio application 207 in step S15, as in the case of << designation of audio application >>. In this case, the voice application 207 is different from the voice application 202 which requests a voice input described in << Basic Processing >> and is a voice application which performs only a voice output, and is described so as to pass control to the activation module 201 after execution. ing. When the activation control module 204 finds the attribute of “vout”, the activation control module 204 does not acquire the voice recognition result and does not update the current screen data or a terminal that needs to be updated for operation on the web screen. In the case of, the same web application 205 as the application on which the operation was performed is displayed.

《ウェブデータ入力から音声と画面の同時出力》
次に、ウェブデータ入力から音声と画面との同時出力を行う場合の処理について、図９を用いて説明する。《Simultaneous output of voice and screen from web data input》
Next, a process in the case of performing simultaneous output of sound and screen from web data input will be described with reference to FIG.

ウェブデータ入力により音声出力と同時に画面出力を行うためには、次のような処理を実行する。ウェブアプリケーション２０５においてユーザ操作があると、ウェブアプリケーション２０５は、音声アプリケーション２０２から音声認識の結果として送られるデータと同じものを起動制御モジュール２０４に対してパラメータとして送る。図５の(5)は、ユーザがリンクをクリックすることによって音声と画面の同時出力を実行させるための、ウェブページ上の識別子の記述の一例を示している。図の「kinou.html」は、音声認識の結果に相当する値である。「vdout」は取得された値が「音声認識の結果に相当する値」であることを示す属性である。 In order to output the screen simultaneously with the audio output by inputting the web data, the following processing is executed. When a user operation is performed in the web application 205, the web application 205 sends the same data as the result of the voice recognition from the voice application 202 to the activation control module 204 as a parameter. FIG. 5 (5) shows an example of a description of an identifier on a web page for allowing a user to simultaneously output a voice and a screen by clicking a link. “Kinou.html” in the figure is a value corresponding to the result of speech recognition. “Vdout” is an attribute indicating that the acquired value is “a value corresponding to the result of speech recognition”.

図６のフローチャートに示すように、起動制御モジュール２０４は、例えば「kinou.html」の値を取得した時に、ステップＳ１２において、「vdout」の属性があるかどうかを判断する。そして「vdout」の属性を見つけると、その値を、起動制御モジュール２０４からのデータの待ち受け状態にある起動モジュール２０１へと送る。起動モジュール２０１は、「vdout」の属性が見つかった場合には、ステップＳ１３において、音声アプリケーション２０２の起動は行わずに、制御モジュール２０３に対し、直接、認識結果の値として「kinou.html」を送る（図９の矢印Ｂ）。その値を受け取った制御モジュール２０３は、上述の《音声入力から音声と画面の同時出力》の動作と同様の処理を行い、すなわちステップＳ１１に移行し、その結果、音声出力と画面出力が行われる。よって、音声認識完了状態通知・音声認識結果通知２１１においても、実際には音声認識の処理は行わなくても音声認識が行われた場合と同様のデータが送られる。なお、ステップＳ１２において「vdout」の属性が見つからなかった場合には、前述したステップＳ１４に移行し、その場合、ステップＳ１４において「vout」の属性が見つからなかった場合には、今まで述べた「音声入力に基づいて音声及び画面の同時出力」や「ウェブデータ入力から音声のみ出力」、「ウェブデータ入力から音声及び画面の同時出力」のいずれにも該当しない場合であるから、図６に示すように、ステップＳ２０において、エラー処理またはその他の処理を実行するようにする。 As shown in the flowchart of FIG. 6, when acquiring the value of “kinou.html”, for example, in step S12, the activation control module 204 determines whether there is an attribute of “vdout”. When the attribute of “vdout” is found, the value is sent to the activation module 201 which is in a state of waiting for data from the activation control module 204. If the attribute of “vdout” is found, the activation module 201 directly transmits “kinou.html” as the value of the recognition result to the control module 203 without activating the voice application 202 in step S13. Send (arrow B in FIG. 9). The control module 203 that has received the value performs the same processing as the above-described operation of “Simultaneous output of voice and screen from voice input”, that is, shifts to step S11, and as a result, voice output and screen output are performed. . Therefore, even in the speech recognition completion state notification / speech recognition result notification 211, the same data as when speech recognition is performed is sent without actually performing the speech recognition processing. If the attribute “vdout” is not found in step S12, the process proceeds to step S14 described above. In this case, if the attribute “vout” is not found in step S14, the above-described “ FIG. 6 shows a case in which none of the above-mentioned "simultaneous output of voice and screen based on voice input", "only voice output from web data input", and "simultaneous output of voice and screen from web data input" apply. As described above, in step S20, error processing or other processing is executed.

以上の構成により、アプリケーションの作成者は、ウェブ画面に対するユーザの操作から画面と音声の同時出力を実現するアプリケーションを作成する場合、音声アプリケーション２０２から音声認識の結果として送られる値をウェブアプリケーション２０５に記述しておくか、またはウェブアプリケーション２０５からパラメータとして送信するようにプログラムしておけばよい。 With the above configuration, when creating an application that realizes simultaneous output of a screen and voice from a user operation on the web screen, the creator of the application sends the value sent as a result of voice recognition from the voice application 202 to the web application 205. It may be described or programmed so as to be transmitted as a parameter from the web application 205.

《文脈処理》
本発明によれば、文脈やユーザとの対話履歴を考慮した処理も行うことができる。このような処理を文脈処理と呼ぶことにする。以下、文脈処理について説明する。《Context processing》
According to the present invention, it is possible to perform processing in consideration of the context and the history of conversation with the user. Such processing will be referred to as context processing. Hereinafter, the context processing will be described.

音声アプリケーション２０２の実行結果から得られる情報は、「あるトピックを決定する情報」、または「あるトピックに属する情報」、または「対話を制御する命令」に分けられる。音声認識結果がこれらのどれに分けられるかは、音声入力結果に含まれる識別子によって判断され、その結果、「あるトピックを決定する情報」は、起動制御モジュール２０４において保持される。ここでトピックとは、例えば、ある商品や個人の名称、個体の名称などである。トピックについての情報とは、それらの個体に関する属性情報であり、例えば、性質、性能、価格、特徴などである。 Information obtained from the execution result of the voice application 202 is classified into "information for determining a certain topic", "information belonging to a certain topic", or "command for controlling a dialogue". Which one of the speech recognition results is divided is determined by the identifier included in the speech input result. As a result, “information for determining a certain topic” is held in the activation control module 204. Here, the topic is, for example, the name of a certain product or individual, or the name of an individual. The information on topics is attribute information on those individuals, for example, properties, performance, price, characteristics, and the like.

上記の構成によれば、起動制御モジュール２０４において、「あるトピックを決定する情報」が保持されており、トピックに属する情報が入力されつづける限り、トピックとして保持される情報は変化しない。起動制御モジュール２０４にトピックに属する情報が送られてきた場合、その情報と現在のトピックとを結合することで、ウェブアプリケーションを起動するための識別子として扱う。この識別子の利用の仕方は、上述の《基本処理》で述べたものと同様である。例えば、現在保持されているトピックが「車」であって、ユーザがそのトピックについての対話している際に、個体の名称を省略して「特徴は？」というような発話が行われると、起動制御モジュール２０４は、「車」と「特徴」というキーワードをウェブアプリケーション起動の識別子として利用し、ウェブアプリケーション２０５を起動する。また、ウェブアプリケーション起動だけでなく、《音声入力から音声と画面の同時出力》で示したように、ウェブアプリケーション起動とともに音声の同時出力を行うこともできる。また、文脈に依存した対話を行うためには、対応するグラマーも切り替えていく必要があるが、《音声アプリケーションの指定》で示した手法により実現可能である。 According to the above configuration, “information for determining a certain topic” is held in the activation control module 204, and the information held as a topic does not change as long as information belonging to the topic is continuously input. When information that belongs to a topic is sent to the activation control module 204, the information is combined with the current topic and handled as an identifier for activating a web application. How to use this identifier is the same as that described in the above <Basic processing>. For example, if the topic currently held is "car" and the user is talking about the topic, and an utterance such as "What is the characteristic?" The activation control module 204 activates the web application 205 using the keywords “car” and “feature” as identifiers for activating the web application. In addition to the start of the web application, the simultaneous output of the sound and the start of the web application can be performed as shown in << Simultaneous output of sound and screen from sound input >>. Further, in order to perform a context-dependent dialogue, it is necessary to switch the corresponding grammar, but this can be realized by the method described in << Specification of Voice Application >>.

また、上記の構成によれば、あるトピックに関する対話ではなくその対話の制御を行うために、起動制御モジュール２０４において、現在のトピックだけでなく入力のあったトピックを履歴として保持しておくことができる。そのようにすると、ユーザが対話を制御する命令を発話したときに、例えば「前のトピックに戻る」、「今対話していた話題に戻る」という音声認識結果が送られてきたときに、履歴に基づいて、対応するトピック情報を検索しウェブアプリケーションを起動することができる。また、複数のトピックが保持されている履歴情報の値と新しく入力された音声認識結果とを組み合わせることにより、より複雑な文脈の処理も実現可能である。 Further, according to the above-described configuration, in order to control not the conversation about a certain topic but the conversation, the startup control module 204 can hold not only the current topic but also the input topic as a history. it can. By doing so, when the user utters a command to control the dialogue, for example, when a speech recognition result such as "return to the previous topic" or "return to the topic with which the dialogue is currently transmitted" is sent, , The corresponding topic information can be searched and a web application can be started. Further, by combining the value of the history information holding a plurality of topics with the newly input speech recognition result, it is possible to realize more complicated context processing.

《分散実行》
以上の説明から明らかなように、本実施形態の音声応答ウェブシステムでは、アプリケーション作成者が作成するウェブアプリケーションおよび音声アプリケーションは、音声応答ウェブシステム開発者が作成した起動制御モジュールなどの各モジュールと分離することが可能である。さらに、これらのアプリケーションと各モジュール間の参照を絶対アドレス指定のＵＲＬで行うことにより、ウェブアプリケーションや音声アプリケーションは、インターネット上の任意のホストに置かれていても、実行が可能である。図１０は、ウェブアプリケーションや音声アプリケーションを音声応答ウェブシステムの本体から独立したサーバに格納した構成を示している。ここでは、ウェブアプリケーションを保持して実行するアプリケーションサーバ６と、音声アプリケーションを保持するアプリケーションサーバ７とが、それぞれ、ネットワーク３に接続している。これらのアプリケーションサーバ６，７には、それぞれ、ウェブアプリケーション及び音声アプリケーションを実行するためのアプリケーション実行部１６，１７が設けられている。ネットワーク３がインターネットであるとすると、各サーバ４〜７上のリソースやＵＲＬによって特定することができるから、ウェブサーバ４や音声サーバ５から、各アプリケーションサーバ６、７に保持されているアプリケーションを起動し実行させることが可能となる。《Distributed execution》
As is apparent from the above description, in the voice response web system according to the present embodiment, the web application and the voice application created by the application creator are separated from each module such as the start control module created by the voice response web system developer. It is possible to do. Furthermore, by referring to these applications and each module using a URL that specifies an absolute address, a web application or a voice application can be executed even if the application is placed on any host on the Internet. FIG. 10 shows a configuration in which the web application and the voice application are stored in a server independent of the main body of the voice response web system. Here, an application server 6 that holds and executes a web application and an application server 7 that holds a voice application are respectively connected to the network 3. These application servers 6 and 7 are provided with application execution units 16 and 17 for executing a web application and a voice application, respectively. If the network 3 is the Internet, it can be specified by the resources and URLs on each of the servers 4 to 7, so that the web server 4 or the voice server 5 starts the application held in each of the application servers 6 and 7. Can be executed.

以上、本発明の好ましい実施形態について説明したが、本発明は上述したものに限られるものではない。端末装置２について言えば、上述の説明では音声入出力機能とウェブ表示機能とを備えたものとしているが、本発明の適用にあたっては、必ずしもこれらの機能を１つの端末装置が備えている必要はない。本発明では、サーバ間の連携により音声アプリケーションおよびウェブアプリケーションの起動制御を実現しており、端末装置に特別なプログラムやプラグインを実装する必要はない。そのため、任意のウェブ表示機能を備えた端末と任意の音声入力機能を備えた端末とを組み合わせることも可能である。例えば、音声入力機能を備えた端末として、固定電話、または携帯電話（ウェブ表示機能は必要としない）、またはＩＰ(Internet protocol)電話、またはソフトフォンを用い、ウェブ表示機能を備えた端末として、パーソナルコンピュータ、またはＰＤＡ、またはウェブ表示機能を備えた携帯電話、またはウェブ表示機能を備えた固定電話、またはウェブ表示機能を備えたＰＤＰ（プラズマディスプレイパネル）やＬＣＤ（液晶表示装置）などの表示装置、またはウェブ表示機能を備えたキオスク端末を用いることができる。また、ウェブサーバにアクセス可能で、かつ音声入出力データまたは音声の特徴抽出データをＩＰで通信可能な携帯電話においても適用することが可能である。 Although the preferred embodiment of the present invention has been described above, the present invention is not limited to the above-described embodiment. In the above description, the terminal device 2 has the voice input / output function and the web display function. However, in applying the present invention, it is not always necessary that one terminal device has these functions. Absent. In the present invention, activation control of a voice application and a web application is realized by cooperation between servers, and it is not necessary to mount a special program or plug-in on a terminal device. Therefore, it is also possible to combine a terminal having an arbitrary web display function with a terminal having an arbitrary voice input function. For example, as a terminal having a voice input function, a fixed telephone, or a mobile phone (a web display function is not required), or an IP (Internet protocol) telephone, or a softphone, and as a terminal having a web display function, Personal computer, or PDA, or mobile phone with web display function, fixed phone with web display function, or display device such as PDP (plasma display panel) or LCD (liquid crystal display device) with web display function Alternatively, a kiosk terminal having a web display function can be used. Further, the present invention can be applied to a mobile phone that can access a web server and can communicate voice input / output data or voice feature extraction data by IP.

また、本発明においては、端末装置からの音声認識の要求は、必ずしも、ウェブアクセスを介して音声応答ウェブシステムに伝達される必要はない。 Also, in the present invention, the request for voice recognition from the terminal device does not necessarily need to be transmitted to the voice response web system via web access.

なお、以上説明した本発明の音声応答ウェブシステムは、典型的には、サーバ用のコンピュータ上で実行されるソフトウェアプログラムによって実現することができる。そのようなプログラムは、一般には、磁気テープやＣＤ−ＲＯＭなどの記録媒体によって、あるいは、インターネットなどのネットワークを介して、コンピュータに読み込まれる。そのコンピュータは、読み込んだプログラムを実行することによって、上述した音声応答ウェブシステムとして機能することになる。 The above-described voice response web system of the present invention can be typically realized by a software program executed on a server computer. Such a program is generally read into a computer by a recording medium such as a magnetic tape or a CD-ROM, or via a network such as the Internet. The computer functions as the voice response web system described above by executing the read program.

本発明の実施の一形態の音声応答ウェブシステムの構成を示すブロック図である。It is a block diagram showing composition of a voice response web system of one embodiment of the present invention. 音声出力が終了してから音声認識の開始の操作があった場合における、音声サーバとウェブサーバとの間で行われる状態通知の処理とそのタイミングを示す図である。FIG. 14 is a diagram illustrating a state notification process performed between the voice server and the web server and a timing thereof when a voice recognition start operation is performed after the voice output ends. 音声出力中に音声認識の開始の操作があった場合における、音声サーバとウェブサーバとの間で行われる状態通知の処理とそのタイミングを示す図である。FIG. 14 is a diagram illustrating a process and a timing of a status notification performed between the voice server and the web server when a voice recognition start operation is performed during voice output. 識別子・属性を用いて音声アプリケーションおよびウェブアプリケーションの起動制御を行う処理とそのタイミングを示す図であるFIG. 9 is a diagram illustrating a process for controlling activation of a voice application and a web application using an identifier / attribute and timings thereof. アプリケーションの起動を制御するために利用される属性と識別子の記述例を示す図である。FIG. 5 is a diagram illustrating a description example of an attribute and an identifier used for controlling activation of an application. 音声応答ウェブシステムでの処理の流れの詳細を示すフローチャートである。It is a flowchart which shows the detail of the flow of a process in a voice response web system. 音声入力から音声と画面の同時出力を行う場合の処理とそのタイミングを示す図である。It is a figure which shows the process at the time of performing simultaneous output of a sound and a screen from a voice input, and its timing. ウェブデータ入力から音声出力のみを行う場合の処理とそのタイミングを示す図である。It is a figure which shows the process at the time of performing only audio | voice output from web data input, and its timing. ウェブデータ入力から音声と画面の同時出力を行う場合の処理とそのタイミングを示す図である。It is a figure which shows the process at the time of performing simultaneous output of a sound and a screen from web data input, and its timing. 音声アプリケーション、ウェブアプリケーションを別のサーバ上に配置した場合の音声応答ウェブシステムの構成を示すブロック図である。FIG. 3 is a block diagram illustrating a configuration of a voice response web system when a voice application and a web application are arranged on another server.

Explanation of reference numerals

１音声応答ウェブシステム
２端末装置
３ネットワーク
４ウェブサーバ
５音声サーバ
６、７アプリケーションサーバ
１１、１３、１６、１７アプリケーション実行部
１２、１５制御部
１４起動部

DESCRIPTION OF SYMBOLS 1 Voice response web system 2 Terminal device 3 Network 4 Web server 5 Voice server 6, 7 Application server 11, 13, 16, 17 Application execution part 12, 15 Control part 14 Activation part

Claims

A voice response web system accessed from a terminal device, comprising: a web server and a voice server, a first control unit for notifying a program running on the voice server of a state of a program running on the web server; And a second control unit for notifying the web server of the status of the program running on the voice server.

A voice response web system accessed from a terminal device,
A web server that manages execution of the web application, and a voice server that manages execution of the voice application,
A first control unit for notifying the state of the program running on the web server to the voice server; and a second control unit for notifying the state of the program running on the voice server to the web server. Voice response web system.

The content of processing in the voice server is determined based on a state notified from the first control means, and the content of processing in the web server is determined based on a state notified from the second control means. 3. The voice response web system according to 1 or 2.

The voice response web system according to claim 2, further comprising an activation unit that activates the voice application based on a notification from the first control unit.

The voice response according to any one of claims 2 to 4, wherein association between the voice application and the web application is performed based on a value obtained as a result of voice recognition and a value obtained through web data input. Web system.

The method according to any one of claims 2 to 4, wherein the first control unit activates a corresponding web application by combining a value obtained from a speech recognition result and a value obtained from web data input. Voice response web system.

5. The method according to claim 4, wherein, when an input request or a voice output request operation by voice recognition is performed in the terminal device, the activation unit activates a voice application based on identification information obtained by web data input. A voice response web system as described.

In the terminal device, when an operation of a request for simultaneous output of voice and screen is performed or when voice input is performed, screen data output from the web application is performed based on identification information obtained by web data input. The voice response web system according to any one of claims 2 to 4, wherein voice data output from the voice application and the voice application are performed simultaneously.

The first control unit activates a web application held by a first application server different from the web server via a network, and the activation unit executes a second application server different from the voice server 5. The voice response web system according to claim 4, wherein the voice response web system is activated.

An input / output control method in a voice response web system that is accessed from a terminal device and has a web server that manages execution of a web application and a voice server that manages execution of a voice application,
For the value obtained from the web data input, set identification information indicating that voice correspondence is to be performed and specify a file for voice operation,
Dynamically refers to the screen output data output by calling the web application, based on the identification information, embedded data read from the file,
An input / output control method for executing screen output data in which the data is embedded as final screen output.

An input / output control method in a voice response web system that is accessed from a terminal device and has a web server that manages execution of a web application and a voice server that manages execution of a voice application,
Context processing is performed taking into account topics and dialog history during dialogue with the user,
An input / output control method for activating the web application and outputting a sound in accordance with a processing result of the context.