JP3725566B2

JP3725566B2 - Speech recognition interface

Info

Publication number: JP3725566B2
Application number: JP35314293A
Authority: JP
Inventors: 秀樹橋本; 仁史永田; 重宣瀬戸; 洋一竹林; 浩司山口; 秀昭新地
Original assignee: Toshiba Corp; Toshiba Digital Media Engineering Corp
Current assignee: Toshiba Corp; Toshiba Development and Engineering Corp
Priority date: 1992-12-28
Filing date: 1993-12-28
Publication date: 2005-12-14
Anticipated expiration: 2020-12-14
Also published as: JPH07140998A

Abstract

PURPOSE:To provide a voice recognition interface excellent in use convenience, capable of simultaneously handing plural application programs from a voice recognition system. CONSTITUTION:A voice recognition system 1 is connected to plural application programs 2. The system 1 controls the information relative to the programs 2 by an application program control table 13. Based on the information of the table 13, a message processing section 11 makes decisions on the recognition object vocabularies corresponding to the voice input, sending addresses of the recognition results at a voice identification section 12 which identifies the voice and the voice focus which specifies the object of the voice input to the plural application programs.

Description

【０００１】
【産業上の利用分野】
本発明は、パーソナルコンピュータおよびワークステーションなどに用いられる音声認識インターフェースに関するものである。
【０００２】
【従来の技術】
近年、計算機はキーボード、マウス、音声、画像など複数の入力手段を装備し、様々な指示やデータ入力を可能にしたものが考えられている。
【０００３】
このうちで、音声入力は、人間にとって自然であり有力な入力手段といえるが、音声処理にかかる計算量や認識率などの点で問題があり、入力手段として広く利用されるに至らなかった。
【０００４】
しかして、従来、音声認識インターフェースにおける、応用プログラムと音声認識システムの構成として、次のようなものが考えられている。
【０００５】
図１２２は、応用プログラムＡＰに音声認識システムＳＲＳが組み込まれているものである。このようにしたものは、音声認識機能が応用プログラムＡＰから分離できないため、他の応用プログラムから音声認識機能を利用するのが困難であった。
【０００６】
また、図１２３は、一つの音声認識システムＳＲＳと一つの応用プログラムＡＰからなり、両者が接続される構成のものである。このようにしたものは、音声認識システムＳＲＳは、接続された応用プログラムＡＰに占有されるため、同じ音声認識システムＳＲＳを別の応用プログラムから利用するには、別の応用プログラムに接続を変更する必要があり、接続し直すための手間がかかる。また、音声認識システムＳＲＳと応用プログラムＡＰの間でやり取りするデータは、音声認識システムＳＲＳから応用プログラムＡＰへ送られる認識結果のみであるため、音声認識システムＳＲＳは応用プログラムＡＰの内部状態を知ることができない。このため、応用プログラムＡＰの内部状態に応じた認識対象語彙の変更などが自動的に行えず、利用者が語彙の変更を行う必要があるため、使い勝手の悪いシステムになっていた。
【０００７】
また、図１２４は、一つの音声認識システムＳＲＳと一つの応用プログラムＡＰからなり、それらが相互に接続されて、認識語彙や認識結果などの情報を送り合う構成のものである。このようにしたものは、音声認識システムＳＲＳは応用プログラムＡＰの内部状態や認識語彙などを知ることができるため、認識語彙の変更を自動的に行うことができるが、音声認識システムＳＲＳは応用プログラムＡＰに占有されるため、同時に他の応用プログラムが音声認識システムＳＲＳを利用することができない。
【０００８】
また、図１２５は、文献［Ｓｃｈｍａｎｄｔｅｔａｌ，“Ａｕｇｍｅｎｔｉｎｇａｗｉｎｄｏｗｓｙｓｔｅｍｗｉｔｈｓｐｅｅｃｈｉｎｐｕｔ”，ＣＯＭＰＵＴＥＲ，Ｖｏｌ．２３，ｐｐ．５０−５８，１９９０］のシステムの構成であり、一つの音声認識システムＳＲＳから複数の応用プログラムＡＰに音声認識結果を一方的に送るものである。このシステムでは、ウインドウシステムを利用し、音声認識結果をマウスやキーボードによる入力に翻訳することによって音声を入力している。この構成のシステムでは、複数の応用プログラムＡＰ音声認識機能を同時に利用できるが、音声認識システムＳＲＳが応用プログラムＡＰの内部状態を知ることができないため、応用プログラムＡＰの内部状態に応じた認識処理を行うことができない。
【０００９】
また、図１２６は、文献［Ｒｕｄｎｉｃｋｙ他、ｓｐｏｋｅｎｌａｎｇｕａｇｅｒｅｃｏｇｎｉｔｉｏｎｉｎａｎｏｆｆｉｃｅｍａｎａｇｅｍｅｎｔｄｏｍａｉｎ，Ｐｒｏｃ．ＩＣＡＳＳＰ´９１，Ｓ１２．１２，ｐｐ．８２９−８３２，１９９１］のシステムの構成であり、一つの音声認識システムＳＲＳと複数の応用プログラムＡＰからなり、音声認識システムＳＲＳと応用プログラムＡＰが相互に情報を送り合って音声認識を行う。このシステムには、複数の応用プログラムが連続音声認識を共用できるという特徴があり、高価な音声認識装置の利用に関して有用な方法を取っているといえるが、リアルタイム処理やワークステーション上での利用形態についての検討は十分ではない。この構成では、複数のプログラムが音声認識機能を利用可能であり、また、応用プログラムＡＰの内部状態に応じた認識システムＳＲＳ側の処理も可能であるが、同時に一つの応用プログラムＡＰとしか接続できないため、複数の応用プログラムＡＰを同時に扱えるという音声の特徴を生かした処理を行うことができなかった。また、どの応用プログラムＡＰに音声認識結果を送るかの決定は、音声認識システムＳＲＳから行われていたため、例えば応用プログラムＡＰ側で認識結果を必要としている場面でも、認識結果を得られない場合があった。
【００１０】
【発明が解決しようとする課題】
このように従来の音声認識インターフェースによると、応用プログラムＡＰが音声認識対象を管理できないため、応用プログラムＡＰ主導の音声入力制御ができず、利用者に音声認識を促したい状態でも、音声認識システムＳＲＳからの音声入力許可命令を受けとるまで待たなければならなかった。また、１つの音声で複数の応用プログラムＡＰを同時に制御することができないため、例えば「終了」という１つの音声入力で、複数の応用プログラムＡＰを終了させることができなかった。また、認識結果にしたがって、音声入力を複数の応用プログラムＡＰに振り分けることができないため、音声の入力に先立って入力対象を特定することが必要とされていた。また、１つの音声入力に対して１つの音声認識システムしか動作しないため、例えば孤立単語認識と連続音声認識のように異なる種類の認識方式を共存させ、同時に利用するようなことができなかった。
【００１１】
本発明は、上記事情に鑑みてなされたもので、音声認識システムより複数の応用プログラムを同時に取扱うことが可能で、使い勝手に優れた音声認識インターフェースを提供することを目的とする。
【００１２】
【課題を解決するための手段】
本発明は、音声認識システムに複数の応用プログラムを接続した音声認識インターフェースにおいて、前記音声認識システムは、音声を認識する音声認識手段と、前記複数の応用プログラムのそれぞれに対応して、該応用プログラムが音声入力の対象となっているか否かを示す第１の情報、及び該応用プログラムのために認識対象とすべき１又は複数の認識対象語彙を示す第２の情報を少なくとも管理する応用プログラム管理手段と、この応用プログラム管理手段により管理されている前記第１の情報が音声入力の対象となっていることを示している１又は複数の前記応用プログラムに対応してそれぞれ管理されている前記第２の情報に基づいて音声入力に対する認識対象語彙を特定し、該特定された認識対象語彙のいずれかが前記音声認識手段により認識された場合に、前記第１の情報が音声入力の対象となっていることを示しており且つ前記第２の情報認識が当該認識された語彙を認識対象語彙とすることを示している１又は複数の前記応用プログラムを、当該認識された語彙の送信先として特定するメッセージ処理手段とを具備し、いずれの前記応用プログラムが音声入力の対象となっているかにかかわらず常に認識対象とすべき、個々の前記応用プログラムに一意に対応する語彙を示す第３の情報をも管理し、前記第３の情報に含まれる語彙のいずれかが前記音声認識手段により認識された場合には、当該認識された語彙に一意に対応する前記応用プログラムに対応する前記第１の情報を、当該応用プログラムが音声入力の対象となっていることを示す状態にすることを特徴とする。
また、音声認識システムに複数の応用プログラムを接続した音声認識インターフェースにおいて、前記音声認識システムは、音声を認識する音声認識手段と、前記複数の応用プログラムのそれぞれに対応して、該応用プログラムが音声入力の対象となっているか否かを示す第１の情報、及び該応用プログラムのために認識対象とすべき１又は複数の認識対象語彙を示す第２の情報を少なくとも管理する応用プログラム管理手段と、この応用プログラム管理手段により管理されている前記第１の情報が音声入力の対象となっていることを示している１又は複数の前記応用プログラムに対応してそれぞれ管理されている前記第２の情報に基づいて音声入力に対する認識対象語彙を特定し、該特定された認識対象語彙のいずれかが前記音声認識手段により認識された場合に、前記第１の情報が音声入力の対象となっていることを示しており且つ前記第２の情報認識が当該認識された語彙を認識対象語彙とすることを示している１又は複数の前記応用プログラムを、当該認識された語彙の送信先として特定するメッセージ処理手段とを具備し、前記応用プログラムは、それがキーボード入力の対象となった場合に、前記音声認識システムに対して、自信を音声入力の対象とすべきことを要求するものであり、前記音声認識システムは、前記応用プログラムから前記要求を受けた場合に、当該応用プログラムに対応する前記第１の情報を、当該応用プログラムが音声入力の対象となっていることを示す状態にすることを特徴とする。
好ましくは、前記音声認識システムは、予め定められた所定のイベントが発生した場合に、該発生したイベントの内容及び予め定められた規則に従って、所定の前記応用プログラムに対応する前記第１の情報を、当該応用プログラムが音声入力の対象となっていることを示す状態に変更するとともに、他の所定の前記応用プログラムに対応する前記第１の情報を、当該応用プログラムが音声入力の対象となっていないことを示す状態に変更するようにしてもよい。
好ましくは、前記音声認識システムは、前記応用プログラムのうちで通知要求を受けているものに対して、当該応用プログラム自身が現在音声入力の対象となっているか否かを少なくとも判断可能とする情報を通知するようにしてもよい。好ましくは、前記音声認識システムは、前記第１の情報が音声入力の対象となっていることを示している前記応用プログラムのウィンドウを、前記第１の情報が音声入力の対象となっていないことを示している他の前記応用プログラムのウィンドウの表示形態とは異なる表示形態で、表示画面に表示するようにしてもよい。
好ましくは、前記音声認識システムは、前記第１の情報が音声入力の対象となっていることを示している前記応用プログラムについて、該応用プログラムに対応する前記第２の情報が示す該応用プログラムのために認識対象とすべき１又は複数の認識対象語彙を、表示画面に表示するするようにしてもよい。
好ましくは、前記音声認識システムは、前記送信先として特定された前記応用プログラムに対して送信された前記認識された語彙を、表示画面に表示するようにしてもよい。
好ましくは、前記第２の情報は、各々の応用プログラムから前記音声認識システムへ与えられるものであるようにしてもよい。
好ましくは、前記音声認識システムは、前記第２の情報を、対応する前記応用プログラムのウィンドウを複数に分割した各分割領域のそれぞれに対応して管理し、前記応用プログラムに対応する前記第２の情報としては、該応用プログラムのウィンドウにおける各分割領域のうち、マウスポインタが現在位置している分割領域に対応して管理されている前記第２の情報を用いるようにしてもよい。
好ましくは、前記音声認識システムは、前記複数の応用プログラムの少なくとも一部について、前記第１の情報及び前記第２の情報を、個々の前記応用プログラムに対応する１又は複数のウィンドウのそれぞれに対応して管理し、前記第１の情報及び前記第２の情報が前記ウィンドウのそれぞれに対応して管理されている前記応用プログラムについては、前記第１の情報が音声入力の対象となっていることを示している１又は複数の前記ウィンドウに対応してそれぞれ管理されている前記第２の情報に基づいて音声入力に対する認識対象語彙を特定し、該特定された認識対象語彙のいずれかが前記音声認識手段により認識された場合に、前記第１の情報が音声入力の対象となっていることを示しており且つ前記第２の情報認識が当該認識された語彙を認識対象語彙とすることを示している１又は複数の前記ウィンドウを、当該認識された語彙の送信先として特定するようにしてもよい。
好ましくは、前記音声認識システムは、前記第１の情報及び前記第２の情報が前記ウィンドウのそれぞれに対応して管理されている前記応用プログラムについて、該応用プログラムのウィンドウのうち前記第１の情報が音声入力の対象となっていることを示しているものにおいては、該ウィンドウに対応して管理されている前記第２の情報に加えて、該ウィンドウをもつ該応用プログラムの他のウィンドウに対応して管理されている前記第２の情報に含まれる、当該応用プログラムの他のウィンドウについても用いるものとして指定されている語彙をも用いるようにしてもよい。
【００１３】
【作用】
この結果、本発明によれば各応用プログラムにより音声認識システムに対する音声認識結果の受信の可否を決定できるので、応用プログラムが自分や他の応用プログラムの音声入力に関する制御を自由に行うことができ、柔軟で使いやすい音声認識インターフェースが構築できる。
【００１４】
また、音声認識システムがその音声認識結果を同時に複数の応用プログラムに送信できるので、一つの音声入力による操作を同時に複数の応用プログラムに対して行うこともでき、音声入力による計算機の操作性も向上する。
【００１５】
さらに、音声認識システムが複数の応用プログラムに対する音声認識を行えるので、音声入力対象の明示的な指定をせずに音声認識結果に基づき音声入力を各応用プログラムに振り分けることができ、利用者の負担を軽減できる。
【００１６】
【実施例】
以下、本発明の実施例を図面に従い説明する。
【００１７】
（第１実施例）
図１は同実施例の概略構成を示している。図において、１は音声認識システムで、この音声認識システム１は、メッセージ処理部１１、音声認識部１２、応用プログラム管理テーブル１３から構成され、メッセージ処理部１１に複数の応用プログラム２を接続している。
【００１８】
この場合、音声認識システム１は、応用プログラム２からのメッセージに含まれる指示に従って音声認識を行い、認識結果をメッセージとして応用プログラム２に送る。応用プログラム２は、その音声認識結果を利用してその応用に依存した固有の処理を行う。また、音声認識システム１は、同時に複数の応用プログラム２とメッセージを交換し音声認識結果を送信できるようにしている。
【００１９】
音声認識システム１を構成するメッセージ処理部１１は、応用プログラム２と音声認識部１２のメッセージを交換し、音声認識システム１の全体制御を行う。また、音声認識部１２は、メッセージ処理部１１とメッセージを交換し合うことでメッセージ処理部１１から送られてくる情報に従って入力音声に対して音声認識を行い、その結果情報をメッセージ処理部１１に通知する。
【００２０】
応用プログラム管理テーブル１３は、音声認識システム１と通信を行う全ての応用プログラム２に関する情報を収納するテーブルである。このテーブルは、音声が入力された際の認識対象語彙の決定や、認識結果の送信先の決定に利用され、これにより音声認識システム１は同時に複数の応用プログラム２とのメッセージ交換を行うことができる。また、応用プログラム管理テーブル１３は、プログラムＩＤ、入力マスク、認識対象語彙リスト、音声入力フラグを持っている。プログラムＩＤは、音声認識システム１により応用プログラム２に対して一意に付けられる識別のための番号である。入力マスクは、音声認識システム１から応用プログラム２に送信するメッセージの種類を限定するものである。認識語彙リストは、応用プログラム２が音声認識システム１に対して要求した認識語彙が記述されるテーブルである。音声入力時の認識対象語彙の決定に利用される。音声入力フラグは、当該応用プログラム２に音声フォーカスが当たっているか否かを表している。なお、応用プログラム２に音声フォーカスが当たるという言葉は、応用プログラム２が音声入力対象となるということを意味するものとする。すなわち、音声フォーカスは、認識結果の送信対象を特定するものである。
【００２１】
図２は音声認識部１２の概略構成を示している。
【００２２】
この場合、音声認識部１２は、音声検出部１２１、音声分析部１２２、認識辞書照合部１２３および音声認識辞書１２４からなっている。
【００２３】
音声検出部１２１は、例えば一定時間間隔ごとの入力音声のパワーをもとにして検出を行う方法（永田、他“ワークステーションにおける音声認識機能の開発”，電子情報通信学会技術報告、ＨＣ９１１９，ｐｐ．６３−７０，（１９９１））が知られている。音声分析部１２２は、音声検出部１２１で検出される音声区間に対して、例えばＦＦＴやバンドパスフィルタなどを用いて周波数分析を行い、単語音声の特徴パラメータを抽出する。認識辞書照合部１２３は、音声分析部１２２からの出力パラメータを用いて、例えば複合類似度法（上記の研究資料）やＨＭＭ、ＤＰマッチングなどの手法により認識辞書１２４との照合を行い、スコアの最も高い語彙を認識結果として出力する。
【００２４】
そして、認識辞書照合部１２３では、音声特徴パラメータと認識辞書１２４と照合を行う際、照合前に無駄な処理を行わないため、その時点で認識辞書１２４のどの語彙と照合を行うべきかをメッセージ処理部１１に問い合わせ、その照会情報にしたがって認識辞書１２４との照合処理を行う。そして、認識の成功失敗にかかわらずその認識結果はメッセージ処理部１１に送られ、応用プログラム管理テーブル１３の内容にしたがって応用プログラム２に認識結果を送るようになる。
【００２５】
ここで、図２では、認識部の要素が全て一体となっており、１つのプロセスとして動作可能であるが、図３に示すように音声検出部１２１を分離した構成も可能である。音声検出部１２１と後続の音声分析部１２２、認識辞書照合部１２３を、例えばべつプロセスとして両者の間のデータのやり取りをプロセス間通信により行えば、音声検出部１２１を独立した形で扱うことができ、例えば、図４に示すように複数の音声検出部１２１からの出力を共通の音声分析部１２２、認識辞書照合部１２３で扱うことができる。また、図５に示すように音声検出部１２１と音声分析部１２２を一体にして、認識辞書照合部１２３と認識辞書１２４を分離した構成も可能である。
【００２６】
図６は応用プログラム２の概略構成を示している。
【００２７】
この場合、応用プログラム２は、メッセージ入出力部２１およびプログラム本体２２からなっている。メッセージ入出力部２１は、音声認識システム１とメッセージ交換を一括して行うもので、音声入力の標準の手段を応用プログラム２の作成者に提供するものである。また、複雑なメッセージ送受信規約を応用プログラム作成者から隠蔽し、全ての応用プログラム作成者に統一的に通信手続を提供するためでもある。プログラム本体２２は、応用プログラムに依存した処理の手続を行うプログラムであり、応用プログラム固有の内部状態に従った音声認識システム１に対する命令や、音声認識システム１から音声認識結果を受け取った際の手続などを含んでいる。
【００２８】
次に、このように構成した実施例の動作を説明する。
【００２９】
この場合、音声認識システム１と応用プログラム２との間の情報のやり取りは、メッセージ交換によって行う。ここで、メッセージとは、ある構成要素からほかの構成要素に渡されるコマンドやそのコマンドの実行結果、音声認識結果などのデータを総称していう。
【００３０】
メッセージによる通信は、例えば音声認識システム１をサーバ化し、また応用プログラム２を音声認識システムのクライアントとし、その間をＴＰＣ、ＤＥＣｎｅｔ、Ｓｔｒｅａｍなどのバイトストリーム型のプロトコルを利用して実装する。音声認識インターフェースの各構成要素間で交換されるメッセージを、次に説明する図７に示している。これらのメッセージの処理は、すべて音声認識システムのメッセージ処理部１１が担当する。なお、上述の実施例では、図１の音声認識システム、全体が１つのプロセスとして実行するものとして説明したが、音声認識システムの構成要素である音声認識部、メッセージ処理部、応用プログラム管理テーブル、それぞれを別個のプログラムとして実行することも可能である。
【００３１】
［音声認識システム１と応用プログラム２の間のメッセージ］
応用プログラム２から音声認識システム１へのメッセージは、図７（ａ）に示すような種類がある。これらは、基本的に、応用プログラム２から音声認識システム１への命令を意味している。
【００３２】
ここで、通信路接続／切断要求は、応用プログラム２が音声認識システム１とメッセージを交換するにあたって、その通信路を接続／解放する要求である。音声認識辞書のロード／解放要求は、応用プログラム２が利用したい語彙を含む音声認識辞書を音声認識システム１にロード／解放する要求である。認識語彙設定要求は、応用プログラム２が、どの認識辞書のどの語彙を使って認識を行うかを音声認識システム１に要求するものである。入力マスク設定要求は、応用プログラム２が、音声認識システム１から受け取りたいメッセージの種類を設定する要求である。入力タスク設定要求は、音声フォーカスを、指定した応用プログラム２に変更する要求である。認識開始／終了要求は、音声認識システム１に対する音声認識開始／終了の要求である。
【００３３】
一方、音声認識システム１から応用プログラム２へのメッセージは、図７（ｂ）に示すような種類があり、２つに分類できる。１つは、応用プログラム２からの命令やデータの問い合わせなどの要求に対する応答で、これは上記の要求メッセージに対応する。もう１つのメッセージは、音声認識結果の情報や、音声認識システムの内部状態の変化にともない、音声認識システムによって生成されるメッセージである。
【００３４】
ここで、音声認識結果は、音声認識システム１が、応用プログラム２の設定要求をした認識語彙を用いて認識した結果を通知するメッセージである。認識が成功した場合には、すくなくとも１つの認識語彙を含み、その語彙が何であるか、その語彙の持つ辞書はどれか、認識処理結果としての得点などの情報を含む。失敗した場合には（音声レベルが高すぎたとか低すぎたなど）、失敗した原因に関する情報を持っている。入力タスク変更通知は、入力タスク設定要求などで音声フォーカスが実際に変更された際に応用プログラム２に対して送信されるメッセージで、変更される前のタスクＩＤおよび変更後のタスクＩＤを含んでいる。認識辞書ロード／解放通知は、認識辞書ロード／解放要求などで認識辞書が新たにロードされたり解放された場合に送信されるメッセージである。通信路接続／切断の通知は、応用プログラム２が音声認識システム１に対して通信路接続／切断要求を発行した際に発生するメッセージである。応用プログラム２が要求せずに一方的に通信路を切断した場合にも発生する。認識語彙変更通知は、認識語彙設定要求により各応用プログラムの認識語彙が変更された場合に発生するメッセージである。
【００３５】
これらは、音声入力を受け付け音声認識を行ったときや、音声フォーカスが変更されたときや、応用プログラム２が音声認識システム１と接続したときや、認識語彙が変更されたときなど、音声認識システム１から全ての応用プログラム２に送信できるものであるが、全てのメッセージを応用プログラム２が常時受け取る必要はない。応用プログラム２が、どのメッセージを受け取るかの設定は、各メッセージに対応する入力マスクを音声認識システム１に通知する事で行う（入力マスク設定要求）。これによって応用プログラム２は、自分の必要とするメッセージのみを音声認識システム１に通知してもらうことができる。
【００３６】
図８は、入力マスクの種類を示している。これらは応用プログラム２が受け取りたいメッセージの種類に対応しており、同時に複数のマスクが設定できるものとする。
【００３７】
この設定を音声認識システム１に通知することで、入力マスクに対応するメッセージが音声認識システム１の内部で生成される度にそれを受け取ることができる。例えば、音声認識結果マスクを設定すれば、音声入力がなされる度に音声認識結果を得ることができるようになるし、入力タスク変更マスクを設定すれば、音声フォーカスが変更される度に、それが応用プログラムに通知されるようになる。
【００３８】
音声認識システム１と応用プログラム２の間のメッセージとして、上記の２種類のメッセージ（要求メッセージと応答メッセージ）以外に、エラーメッセージが考えられる。エラーメッセージは、成功時には応答を要しない応用プログラム２からの片道メッセージの失敗や、認識システムにクリティカルな状態が発生したときにそれを通知するメッセージである。また、上述したメッセージ以外にも、音声認識システム１の内部情報にアクセスするためのメッセージや、音声入力レベルを変更するなど、音声認識システム１や音声入出力の設定をするメッセージなどさまざまなメッセージが考えられる。
【００３９】
このように、応用プログラム２は、音声認識システム１の内部状態の変化をメッセージの形で通知させることができるため、それに基づいて音声認識システム１を制御し、さらには他の応用プログラム２が制御できるようになるため、自由度の高い、柔軟なインターフェースを音声によって制御することができる。
【００４０】
さて、音声認識システム１は、メッセージ処理部１１と音声認識部１２を有しているが、これらの間もメッセージによって情報交換がなされる。なお、音声認識システム１における応用プログラム２とのメッセージは、その全てをメッセージ処理部１１が取扱っている。
【００４１】
［音声認識部１２とメッセージ処理部１１の間のメッセージ］
音声認識部１２からメッセージ処理部１１へのメッセージは、図７（ｃ）に示す種類がある。ここで、認識語彙照会要求は、音声認識システムに音声が入力された時点で、入力音声とどの認識語彙との照合を行うべきかを決定するために発行される要求である。音声認識結果は、入力音声とその時点で認識すべき認識語彙との照合結果をメッセージ処理部１１に通知するものである。
【００４２】
一方、メッセージ処理部１１から音声認識部１２へのメッセージは、図７（ｄ）に示すような種類がある。ここで、認識辞書ロード／解放要求は、応用プログラム２が音声認識システム１に対して発行した認識辞書ロード／解放要求がそのまま音声認識部１２へ引き渡された所のメッセージである。認識語彙情報は、音声認識部１２からメッセージ処理部１１への認識語彙照会要求に対する応答である。
【００４３】
このようにして、音声認識システムを構成する各部において、メッセージをやりとりすることで、処理が進められるが、次に、音声認識インターフェースとして、処理がどのように進められていくかを図９に従い説明する。同図では、応用プログラム２が起動されてから、最初に音声認識結果を受け取るまでのタイムチャートを示している。
【００４４】
この場合、応用プログラム２は、まず音声認識システム１との接続要求（ａ）を送る。接続が達成されたならば、音声認識語彙を含む認識辞書ロード要求（ｂ）と、ロードした辞書中で音声入力に使いたい語彙を認識語彙とする設定要求（ｃ）を発行する。メッセージ処理部１１では、（ａ）に対しては応用プログラム２との通信路接続処理を行って、その結果を応用プログラム２に返す。（ｂ）に対しては、メッセージをそのまま音声認識部１２に送って辞書がロードされるのを待ち、辞書のロードの結果を応用プログラム２に返す。（ｃ）に対しては指定された認識語彙を応用プログラム管理テーブル１３に書込み、その処理結果を返す。認識対象語彙が無事に設定されたならば、応用プログラム２は、入力マスク設定要求（ｄ）と入力タスク設定要求（ｅ）を送る。メッセージ処理部１１では、（ｄ）と（ｅ）を受けて、それぞれ応用プログラム管理テーブル１３に書込む。
【００４５】
以上が、音声認識システム１に対する応用プログラム２からの初期設定要求となる。初期設定が終わったならば、音声認識システム１からのメッセージ待ちに入る。メッセージを待ちながら、応用プログラム２固有のタスクに依存した処理などを行う。処理に伴う内部状態の遷移などに従い、認識語彙を変更する要求や入力タスクを自分自身や他の応用プログラム２に変更する要求など、任意の要求を自分の処理に応じて音声認識システム１に送り、音声認識システム１を応用プログラム２側から制御できるようにしている。
【００４６】
ここで、音声入力が応用プログラム２に対して行われたとする。すると入力音声は、まず音声認識部１２において、音声区間の検出と分析が行なわれる。音声認識部１２は、音声分析を済ませたならば、その時点で認識対象となっている語彙を知るために、メッセージ処理部１１に対して認識語彙照会要求（ｆ）を送る。メッセージ処理部１１では、これを受信すると、応用プログラム管理テーブル１３を参照してこの場面で音声認識処理を行うべき語彙を調べ、その結果である認識語彙情報を音声認識部１２に返す。音声認識部１２では、（ｇ）により指定された認識対象語彙に対応する認識辞書データと分析済みの分析データを照合し、その結果をメッセージ処理部１１に送る。メッセージ処理部１１では、（ｇ）のうちの１位の尤度をもつ語彙を応用プログラム管理テーブル１３の認識対象語彙中で探し、それを持つ応用プログラム２の音声入力フラグが１であり、かつ入力マスクとして認識結果通知マスクが設定されていたならば、その応用プログラムに対して認識結果を送信する。
【００４７】
図９で説明した処理を、さらに具体例を用いて説明する。
【００４８】
音声認識システム１と接続している応用プログラム２がシェルツールとテキストエディタの２つであった場合の応用プログラム管理テーブル１３は、図１０（ａ）に示すようになる。
【００４９】
ここで新しくメールツールを起動する際の処理を説明する。起動されたメールツールが、まず通信路接続要求（ａ）を送信すると、応用プログラム管理テーブル１３にメールツール用の領域が取られ、メールツールのプログラムＩＤが付けられる。プログラムＩＤは、例えば応用プログラム２の起動順に０から付けられるとする。次に認識辞書ロード要求（ｂ）を送る。ここでは認識辞書はすでにロードされており、音声認識システム１は、そのことを応用プログラム２に知らせる。次に、認識語彙設定要求（ｃ）で認識語彙として「先頭」「最後」「前」「次」「送信」「終了」を送り、入力マスクとして認識結果通知マスクを送る（ｄ）。入力タスク設定要求（ｅ）として、現在当たっている全ての音声フォーカスを無効にし、音声フォーカスをメールツールに当てる要求をする。
【００５０】
なお、本実施例では、１つの認識辞書をすべての応用プログラム２で共通に使うこととし、従って、図１０においては、複数の辞書を利用する場合に必要となる各語彙がどの辞書に含まれるかを示す情報を省略する。
【００５１】
以上の処理により、応用プログラム管理テーブル１３は、図１０（ｂ）のようになり、シェルツールに当たっていた音声フォーカスは、新たに起動されたメールツールに変更され、メールツールは音声入力が可能な状態になる。
【００５２】
ここで、例えば「次」という音声が入力されたとする。入力された音声は、音声認識部１２において音声区間検出と分析処理を施され、音声特徴パラメータが求められる。音声認識部１２は、この音声特徴パラメータと照合する辞書データを知るべく、メッセージ処理部１１に対して認識語彙照合要求（ｆ）を送る。この要求を受けたメッセージ処理部１１は、応用プログラム管理テーブル１３を参照してその時点での認識対象語彙を知る。ここでは、音声入力フラグが１であり、かつ入力マスクに認識結果通知マスクが設定されているメールツールの認識対象語彙リスト中の全ての語彙「先頭」「最後」「前」「次」「送信」「終了」がその時点で入力可能な語彙となる。これら６つの語彙が音声認識部１２に通知され、音声認識部１２は、これら語彙に関する辞書データと分析された特徴パラメータに対して照合処理を行い、その結果をメッセージ処理部１１に送る（ｇ）。
メッセージ処理部１１は、認識結果を受けとると、応用プログラム２中の音声入力フラグが１であり、かつ入力マスクに認識結果通知マスクが設定されている応用プログラム２の認識対象語彙リスト中に認識結果の語彙を探し、発見したならばその認識結果をその語彙リストを持つ応用プログラム２に対して送信する。
【００５３】
先の音声入力の認識結果が「次」であった場合には、メールツールに送信されることになる。「次」という認識結果をメッセージ入出力部２１を介して受けとった応用プログラム２は、例えば現在表示している受信メールの次のメールを表示するといった処理を行う。
【００５４】
図１０（ａ）（ｂ）では、シェルツールの入力マスクとして、認識結果通知マスクが設定されている。このマスクにより音声フォーカスの変更が発生する度にそれが通知されるようになる。
【００５５】
上述の例では、メールツールからの入力タスク設定要求（ｅ）を音声認識システム１が受信し、メッセージ処理部１１が音声フォーカスの変更を行ったときに入力タスク変更通知のメッセージがシェルツールに送られる。認識結果通知マスク以外の入力マスクは音声入力フラグの値に依存していないため、入力タスク変更マスクが設定されていれば、音声入力フラグの値に関係なく、音声フォーカスの変更メッセージが、それが起きる度に応用プログラム２に通知される。応用プログラム２は、このような音声認識システム１の内部状態の変化をメッセージを介して知ることで、様々な柔軟な処理をすることができる。例えば、シェルツールは、音声フォーカスを失ったことを利用者に画面表示や合成音声またビープ音などを通じて知らせることができる。
【００５６】
このようにして、応用プログラム２は、メッセージを通じて音声認識システム１を自由に制御できるようになり、応用プログラム主導の柔軟な音声認識インターフェースが得られることになる。
【００５７】
従って、第１実施例によれば、複数の応用プログラム２が同時に平行して動作するマルチタスク環境において、各応用プログラム２が音声認識システム１と通信により直接メッセージ交換を行い、認識語彙や認識結果などのデータを直接相互に交換できるため、全ての応用プログラム２にキーボードやマウスなどの様に、音声入力を標準的な入力手段として装備することができるため、ワークステーションなどのマルチタスク環境における音声入力の本格的な利用が可能となり、音声を含めたマンマシンインタフェースの使い勝手の向上が期待できることになる。
【００５８】
なお、本実施例は、孤立単語認識を応用した音声認識インターフェースの実施例であったが、連続単語音声認識や連続音声認識を応用することも可能である。
（第２実施例）
同第２実施例では、マルチタスクの計算機環境において、ウィンドウシステムを同時に利用することで、ユーザの利用環境の向上を可能にしている。
【００５９】
ウィンドウシステムを同時に利用する場合の構成を図１１に示している。この場合、音声入力を扱う音声認識システム３と、キーボード入力およびマウス入力を扱うウィンドウシステム４と、これら音声認識システム３およびウィンドウシステム４と相互にメッセージを通信する１つ以上の応用プログラム５からなっている。つまり、同実施例では、上述の第１実施例にウィンドウシステムを追加し、応用プログラムにウィンドウシステムとの間の通信手段を持たせるようにしている。
【００６０】
ウィンドウシステム４と音声認識システムと３は、相互に独立している。また、ウィンドウシステム４と応用プログラム５との間のメッセージは、マルチウィンドウ環境におけるウィンドウの生成や、キーボード入力やマウス入力などの処理に関するものである。
【００６１】
本実施例を説明する前に、マルチウィンドウを実現するウィンドウシステムについて簡単に説明する。ワークステーションなどのマルチタスクの計算機環境でマルチウィンドウを実現するウィンドウシステムは、その環境下で動作する複数の応用プログラムと通信し、各々の応用プログラムをビットマップディスプレイと呼ばれる表示画面に抽象化して表示される。そこでは、応用プログラムごとに基本的に１つのウィンドウが割り当てられる。
【００６２】
図１２は、一般的なウィンドウシステムの画面表示例である。この例では、Ａ，Ｂ，Ｃの３つの応用プログラムが平行して動作している。ウィンドウシステムは、キーボードやマウスなどの入力装置を管理し、複数の応用プログラムに入力装置を共有させる。マウス画面中では、矢印型のマウスポインタとして抽象化されており、ウィンドウの操作や入力対象の指定などに使われる。
【００６３】
なお、本願の実施例では専ら、ポインティングデバイスとしてマウスを利用した説明をしているが、ペンやタッチパネルなど他のポインティングデバイスを用いることも可能であり、全ての実施例における記述はこれら他のポインティングデバイスについても全く同様に適用できる。
【００６４】
キーボード入力を行う対象は、キーボードフォーカスによる。キーボードフォーカスは一般的にマウスポインタによって指定される。キーボードフォーカスの当たっている応用プログラムは、ウィンドウ枠がそれ以外のウィンドウより太くしたり、ウィンドウ上部のタイトルバーの色を変えることで表現される。図１２では、応用プログラムＢにキーボードフォーカスが当たっている様子を示す。キーボードフォーカスは一般に常に１つのウィンドウにだけ当てられる。
【００６５】
ここで、第１実施例で述べた３つのプログラム、つまりシェルツール、テキストディタ、メールツールを再び利用して説明する。この場合、各プログラムは、ウィンドウシステムによって各々１つのウィンドウとして抽象化され表現される。また、音声認識システムとも通信を行い、起動時に音声認識システムに対して第１実施例で示した手順をもって認識語彙を設定する。各応用プログラムの認識語彙は同じく図１３に示す通りである。
【００６６】
一般に既存のウィンドウシステムにおいて、応用プログラムは、キーボードフォーカスの変更の通知が受け取られる。キーボードの入力対象と音声の入力対象を同じ応用プログラムにするために、応用プログラムはキーボードフォーカスが当たったならば、音声認識システムに対し、自身に音声フォーカスを当てる要求をし、外れたならば音声フォーカスをはずす要求をする。これは、第１実施例で述べた、入力タスク変更要求を送信することで可能となる。以下ではキーボードフォーカスと音声フォーカスを一致したものとして扱い、それを入力フォーカスと呼ぶ。入力フォーカスはマウスによって操作する。
【００６７】
入力フォーカスの移動に伴う音声認識語彙の変化を図１４に示している。この場合、図１４（ａ）は状態１、図１４（ｂ）は状態２を示すもので、入力フォーカス（それと同時に音声フォーカスも）がテキストエディタに当たっている。従って、この状態で認識可能な語彙は、テキストエディタの認識語彙である、「カット」「コピー」「ペースト」「解消」「終了」の５つである。ここではユーザはこの５つの語彙を発声すると、音声認識結果がテキストエディタに送られるということである。マウスポインタによりシェルツールを指定すると、入力フォーカスはシェルツールに移動し（それと同時に音声フォーカスもシェルツールに移動し）認識可能な語彙は、シェルツールの認識語彙である「ヒストリ」「リスト」「ホーム」「プロセス」「終了」の５つに変化する。
【００６８】
音声認識語彙として何を使うかは自由であり、応用プログラム毎の認識語彙をユーザが記憶、判断することは、ユーザへの大きな負担となる。しかし、個々の応用プログラムに認識語彙の表示を可能にする手段をもたせるのは逆に応用プログラムの作成者にとって負担となる。また、音声入力は、キーボードなどの入力手段と違って暖味性があるため、入力音声が正しく認識されたかをユーザが確認できることが重要となってくる。
【００６９】
この問題を解決する手段として、音声認識インタフェースに標準的な応用プログラムとして、図１５に示すような、認識語彙を表示するプログラム（語彙表示プログラム）を作成することが考えられる。このプログラムは、全ての応用プログラムが、新規の応用プログラムが通信路を接続／切断したり、語彙の変更を要求したり、音声フォーカスの変更をする度に、自身に、それらの要求により発生するメッセージを送信するように要求する（すなわちそれを受け取る入力マスクを設定する）。語彙表示プログラムは、常に、その時点で認識可能な語彙をすべて表示できる。また、音声が認識される度に、それを知り、応用プログラムに送信された認識結果を、例えば図１５のように色を変えて表示することで、音声認識システムが受け付けた音声入力を確認できる。認識語彙表示プログラムにより、応用プログラムの使用者と作成者の双方の負担を軽減し、より使いやすい音声入力環境をユーザに提供できる。
【００７０】
また、語彙表示プログラムのリスト中の色を変える以外に、認識結果は別の方法によっても、使用者に通知できる。
【００７１】
例えば、ディスプレイのスクリーン、あるいはアプリケーションのウインドウの特定の位置に認識結果を表示させる方法もある。この表示部分は、アプリケーションごとに持つことも音声認識システム自身が所有していてもよい。ウインドウシステム環境下では、認識結果表示用のウインドウを作成しておき、アプリケーションのウインドウの中央部や上下左右などの周囲の部分あるいはマウスなどのポインタ、キーボード入力のカーソルなどの付近など、特定の位置に表示させるように位置調節をすればよい。
【００７２】
また認識結果は、次の認識結果が得られるまで表示し続けてもよいし、認識結果が得られた直後だけ表示し、ある時間が経過した後は、次の認識結果が得られるまで表示させないようにしておいてもよい。特にマウスなどのポインタやキーボード入力のカーソルの付近は、視線の移動がわずかで済む利点がある反面、作業している領域の近くで常時表示させると作業の妨げになる場合もあるので、認識結果が得られた直後だけ表示するのは有効である。これとスクリーンやアプリケーションの特定位置に認識結果を常時表示させる方法と併用してもよい。
【００７３】
応用プログラム間だけではなく、１つの応用プログラム内でも、そのマウスの位置によって音声認識語彙を変更することで、必要以上の認識処理を減らし、音声入力をより確実なものにできる。例えば、図１６（ａ）および同図（ｂ）に示すように、メールツールをリスト表示部とテキスト表示部の２つに分割し、そのどちらにマウスポインタがあるかによって認識語彙（ここでは認識語彙は８つである）を変える。こうすることで、必要以上の無駄な認識処理をおさえるとともに、入力音声の認識誤りを起こりにくくすることができるという効果がある。
【００７４】
また、第１実施例では、新たなアプリケーションが起動されると、そのアプリケーションに音声フォーカスが移ることを説明した。同様にアプリケーションの起動、終了時あるいは、マウス、ペンなどのポインティングデバイス、キーボードなどの入力操作や音声認識の結果を受けて実行される処理の結果として、アプリケーションのウインドウ状態変化がある（ウインドウが生成破壊、ジオメトリ変更された）場合には、音声フォーカスの移動を行う規則を作ることにより使い勝手を向上させることができる。
【００７５】
例えば、「ウインドウの破壊、アイコン化、ウインドウが他のウインドウに隠れる、等の場合音声フォーカスを失い、ウインドウの生成、非表示状態から表示状態への変更、ウインドウが他のウインドウの上に表示される、ウインドウの大きさを大きくする、等の場合音声フォーカスを獲得する。」、というような規則に従い、各アプリケーション内部でウインドウ状態変化に応じて、フォーカスの獲得・消去を行う。勿論、このような、ウインドウ状態変化は個々のアプリケーションが個別に管理しなくても、音声フォーカスの管理を行うプログラムにより一括管理してもよい。この場合、この管理プログラムは、ウインドウシステムを管理するプログラム（例えば、システムのウインドウサーバ）に管理したいアプリケーションのウインドウの状態変化を知らせてもらい、その通知を受けた時に上述のような規則を適用して、音声フォーカスを変更すればよい。
【００７６】
また、音声フォーカス管理プログラムがあれば、音声フォーカスを獲得していたアプリケーションがアプリケーションの終了、ウインドウの破壊などにともない音声フォーカスを失った場合も、どのアプリケーションに音声フォーカスを移すかに関して、同様に規則を作り、使い勝手を向上させることができる。
【００７７】
例えば、「音声フォーカスの履歴を音声フォーカス管理プログラムが保持しておき、音声フォーカスを獲得していたアプリケーションが音声フォーカスを消失した場合、その消失原因が他のアプリケーションのフォーカス獲得要求によるものでないならば、それ以前に音声フォーカスを獲得していたアプリケーションにフォーカスを戻す。」、という規則を作り、音声フォーカス管理プログラムがこれに従って、音声フォーカスを変更させれば、音声フォーカスを獲得しているアプリケーションが１つもない状態、すなわち音声認識ステムの出力をどのアプリケーションも受け取らない状態を回避することができる。
【００７８】
なお、本実施例においては、音声認識システムとウインドウシステムを独立した構成としたが、両システムを統合した形態の音声認識インターフェースの実現も可能である。
【００７９】
（第３実施例）
第２実施例では、音声認識システムとウィンドウシステムを組み合わせ、音声フォーカスとキーボードフォーカスを一致させて、１つの入力フォーカスとし、入力フォーカスをマウスポインタで指定することで、音声認識対象語彙を変更した。しかし、これでは入力フォーカスを変更する度にキーボードから手を離さなければならない。入力フォーカスの変更を音声で可能にすることでユーザはキーボードから手を離さずに入力タスクを変更し、ユーザのマルチウインドウ環境における使い勝手の向上を期待できる。
【００８０】
入力フォーカスを音声入力で変更可能にするため、実施例１を拡張して各認識語彙に対してローカルとグローバルの２つの値を設定できるようにする。ローカルな認識語彙とは、それによる認識設定をした応用プログラムに音声フォーカスが当たっている際に認識するような語彙であり、グローバルな認識語彙とは、音声フォーカスがどの応用プログラムに当たっているかにかかわらず、認識対象となるような語彙である。
【００８１】
ここで、再び３つの応用プログラム（シェルツール、テキストエディタ、メールツール）を使って説明する。
【００８２】
各応用プログラムの認識語彙は、図１７に示す通りである。ローカル／グローバル設定にともない応用プログラム管理テーブル中の認識対象語彙リスト中の語彙のそれぞれにローカル／グローバルを示すフラグを設ける。応用プログラム管理テーブルは、図１８に示すようになる。音声入力が与えられたときに、メッセージ処理部は、この応用プログラム管理テーブルを使って認識語彙を次のように求める。まず、応用管理テーブルを参照して音声フォーカスの当たっている応用プログラムのローカル認識語彙を拾い出す。次いで全ての応用プログラムのグローバル認識語彙を拾い集める。これらがその時点での認識システムが認識可能とする語彙である。例えば、テキストエディタに音声フォーカスが当たっているとすると、その時点での認識語彙は「カット」「コピー」「ペースト」「取消し」「終了」「シェルツール」「メールツール」「テキストエディタ」の８つである。ここで、「カット」「コピー」「ペースト」「取消し」「終了」「テキストエディタ」の発声に対する認識結果は、テキストエディタに送付され、「メールツール」「シェルツール」は、それぞれメールツール、シェルツールに送信される。例えば、この状態でメールツールを発声した時に、メールツールの中で入力フォーカス（音声フォーカスとキーボードフォーカス）を自分自身に変更すれば、音声入力とキー入力の対象をキーボードから手を離すことなく変更できる。
【００８３】
これは、換言すれば、ウインドウに名前を付けると言うことである。このウンドウ名はウインドウの上部のタイトル表示部に表示すれば、ユーザは、それによってウインドウを何と呼べばよいかわかる。
【００８４】
以上のように、本実施例では、認識語彙にローカル／グローバルの属性を与えることで、ウィンドウに名前を付け、その名前を発声することで手を使わずにフォーカスの変更が可能となり、応用プログラムを切り替えることが可能となる。
（第４実施例）
第２、３実施例では、音声フォーカスとキーボードフォーカスを一致させ、同時に１つのウインドウだけが両者の入力を排他的に受け付けるようにした。
【００８５】
この２つの入力フォーカスを一致させることで１つの応用プログラムが両方の入力から一手に引き受けることができた反面、２つの入力手段がありながらそれぞれ別々の応用プログラムに対する入力をできなかった。本実施例では、この２つのフォーカスを分離するために音声フォーカスをマウスポインタによって直接操作しないようにする（キーボードフォーカスは、マウスポインタを使う。）。
マウスポインタがウインドウに入り、それが応用プログラムに通知されても、応用プログラムは音声フォーカスを移動させない。この場合は、音声フォーカスは、第３実施例で述べたようにウインドウに名前を付け、それぞれグローバル認識語彙とし、その名前で発声することで変更できる。
【００８６】
入力フォーカスを分離した際、その２つのフォーカスをユーザに分かりやすく呈示しなければ、ユーザが入力する際に混乱してしまう。本実施例では、キーボードフォーカスをウインドウ枠を太くすることで表示し、音声フォーカスをウィンドウタイトルの色を変化させることで示すこととする。
【００８７】
図１９は、入力フォーカスを２つに分離し、それぞれを別々に移動させた場合の例である。同図（ａ）では、両フォーカスは、どちらもテキストエディタに当たっている。メールツールをマウスポインタで指定すると、キーボードフォーカスはメールツールに移動するが、音声フォーカスは、テキストエディタに当たったままである同図（ｂ）。同図（ａ）の状態から、「メールツール」音声入力を行うと、音声フォーカスがメールツールに移動するが、キーボードフォーカスはそのままである。同図（ｂ）および（ｃ）において、キーボードフォーカスと音声フォーカスは、それぞれ個別の応用プログラムに当たっているため、まったく同時に別々の入力チャンネルを通じて２つの応用プログラムを操作できる。例えば、同図（ｃ）の状態にすることで、テキストエディタに対してキーボードで文章を打ち込みながらメールツールを音声で操作し、受信した電子メールを読むことができる。
【００８８】
また、音声フォーカスをコントロールする応用プログラム、音声フォーカスマネージャを作成し、これにより音声フォーカスを音声以外の手段で移動できるようにもしている。図１９の右側が音声フォーカスマネージャを示してており、この音声フォーカスマネージャは、同時に動作している応用プログラムの状態を音声認識システムと通信することで知り、リストなどの形で表示する。
【００８９】
音声フォーカスは、例えば応用プログラム名を反転表示することで表現し、これらリスト上をマウスポインタで指定することで音声フォーカスが変更できるようになる。また、応用プログラムに入力可能な手段は、キーボードや音声以外にもペンなども考えられる。応用プログラムに入力可能な手段および何が入力できるかを表示すればユーザの使い勝手が向上できる。例えば、入力可能性を手段別にアイコン化することで表示する。
【００９０】
このように、音声入力対象と音声以外の手段による入力対象を別々に分離することで、複数の入力手段を複数の応用プログラムに割り当て、人間が自然の形の作業を平行して行うことができるようになる。
【００９１】
（第５実施例）
図２０は、同実施例の概略構成を示している。この場合、音声認識システム６に対して複数の応用プログラム７を接続している。そして、これら応用プログラム７には、それぞれメッセージ入出力部７１を有している。
【００９２】
しかして、音声認識システム６は、音声入力があるごとに、その音声に対して認識処理を行い、その認識結果を応用プログラム７に送信する。応用プログラム７は、音声認識システム６に対して認識対象語彙を通知し、音声認識システム６は、それを用いて認識処理した結果を応用プログラム７に送信する。
【００９３】
応用プログラム７は、メッセージ入出力部７１を有していて、このメッセージ入出力部７１は、応用プログラム７が認識結果を受け取るか否かを決定し、その要求を音声認識システム６に対して行う。メッセージ入出力部７１は、応用プログラム７の指示によって音声認識システム６に対して応用プログラム７のための音声認識を行う要求をしたり、音声認識システム６から送信された認識結果を受けて応用プログラム７に渡したり、ブロックして渡さなかったりする。また、認識対象語彙を変更できる。
【００９４】
応用プログラム７がメッセージ入出力部７１を持つことで、応用プログラム７は、外部からの働き掛けによらず、自分の状態にしたがって音声入力（認識結果）を受けとったり、受けとらなかったりできる。
【００９５】
例えば、音声による制御が可能な電子メールシステム（音声メールと称する。）の例を挙げると、音声の誤認識による誤動作を防止するため、音声入力が不能な状態で音声メールを起動、動作させておく。音声メールがメールをうけとると、例えば「新しいメールを受信しました。いますぐお読みになりますか。」と合成音声を出力して知らせ、「はい」「いいえ」などの確認を取るための認識対象語彙と、それにより音声認識することを音声認識システム６に通知する。ユーザが「はい」といったならば、新しく受信したメールを表示したり、合成音声によりメールを読み上げたりする。「いいえ」といったならば、音声メールは音声認識システム６に対して音声認識結果を受けとらないように要求し、元の状態に戻る。
【００９６】
「新しくメッセージを…」のメッセージは、合成音声でなく、図２１のように表示してもよい。同図での「はい」「いいえ」は、マウスなどでも操作を可能とするためである。
【００９７】
また、図２０において、１つの応用プログラム７のメッセージ入出力部７１に、他の応用プログラム７の音声入力を可能にしたり、ブロックしたりする機能を与えれば、電子メールの例で言えば、確認のための音声入力を待つ間、電子メールは、他の音声によって制御が可能な応用プログラム７の音声入力を一時的にブロックし、確認が終わった時に戻すような操作が可能になる。
【００９８】
応用プログラム７による、こうした他応用プログラム７の音声入力をブロックする操作が競合した場合には、時間的に後にブロックモードになった応用プログラム７は、先にブロックモードになった応用プログラム７のブロック解除を待つ操作ができる。
【００９９】
このように音声認識システム６でなく、応用プログラム７にタスクの管理を可能にする手段を持たせることにより、応用プログラム７が音声認識システム６の指示に従うだけでなく、応用プログラム７独自の内容状態に従って音声入力を利用できる。
【０１００】
また、ある特定の応用プログラム７に他の全ての応用プログラム７のタスクの管理（音声認識結果を送るか否か、どの認識対象語彙により音声認識を行うか否かなどの処理）を行わせることもできる。
【０１０１】
図２２は、ワークステーションなどのマルチウィンドウ環境で、音声により操作できるメールツール、シェルツール、テキストエディタおよびタスク管理プログラムを示している。ここでは、どれか１つの応用プログラム７が音声入力を可能としている。この場合、テキストウディタが音声入力対象となっている（タイトルの色の変更によりそれが表示されている。）。そして、音声入力対象となっていることは、タスク管理プログラムでも同様に表示することができる。この例では、音声入力対象の変更は、タスク管理プログラムの表示の上をマウスなどのポインティングデバイスを利用して指定することができる。
【０１０２】
（第６実施例）
実施例５では、１つの応用プログラム７だけを音声入力対象としたが、複数の応用プログラム７を同時に認識対象することも可能である。
【０１０３】
図２０の音声認識システム６に、例えば図２３に示すような応用プログラム管理テーブルを持たせる。この応用プログラム管理テーブルは、音声認識システム６に接続している全ての応用プログラム７に関して認識の可否および認識対象語彙に関する情報を持つ。
【０１０４】
このテーブルの情報の変更は、各応用プログラム７のメッセージ入出力部７１からの要求によって行う。図２３では、メールツールとシェルツールが音声入力可能になっている。図２３の状態は、例えば図２４に示すように表現できる。
【０１０５】
ここで音声認識システム６は、「プロセス」「ホーム」といった音声入力は、シェルツールに送り、「先頭」「次」といった音声入力は、メールツールに送るといったように、認識した結果を自動的に振り分けることができる。また、「終了」は、メールツールとシェルツールに同時に送ることができるため、各応用プログラム７はそれを受け取って応用プログラム７自身を終了させることができる。
【０１０６】
さらに、複数の応用プログラム７を音声入力対象とすることを前提とすれば、次のような操作が可能になる。図２５は、タスク管理プログラムの機能を拡張した例である。「排他制御」は従来あるように音声入力対象の応用プログラム７を常に一つにする機能である。「全部」は、音声認識システム６に接続されている全ての応用プログラム７を音声入力対象とする機能である。「反転」は音声入力対象を逆転させる機能であり、メールツールとシェルツールが音声入力対象となった状態で「反転」することで音声入力対象がエキストエディタとなる。もう一度「反転」すれば元に戻る。これらの操作は、マウスのようなポインティングデバイスだけでなく、音声やキーなどの入力装置により可能である。例えば、何等かのキーボタンやキーを押しながら音声入力する。
【０１０７】
「全部」ボタンを押しながら発声すると、全部の応用プログラム７が音声入力対象となり、「反転」ボタンを押しながら発声すると、音声入力対象が反転し、ボタンを離すとそれらの状態は元に戻る。
【０１０８】
特定の１つの対象を指定しないまま入力し、その入力が適切に処理されることが本実施例では可能になる。ワークステーションなどのマルチウインドウ環境を考えると、その上で例え音声による操作が可能な応用プログラム７が複数動作していたとしても、対計算機ということを考えれば、人間の相手は１つであり、計算機も相手の発声をタスク切換えなどの特別な操作をすることなく、自動的に適切に処理されることを人間が期待するのは自然のことであり、音声メディアの特性を活かすことになるといえる。
【０１０９】
（第７実施例）
上述の第６実施例において、各応用プログラム７の認識対象語彙が何であるかは分からない。そのため、タスク管理プログラム（あるいは別の応用プログラム７にしてもよい）に各応用プログラム７の認識対象語彙を表示させる。応用プログラム７は音声認識システム６に対して、音声認識システム６の持つ応用プログラム管理テーブル（図２３）の情報を要求することで、その表示が可能である（図２６）。
【０１１０】
このように音声入力対象となった応用プログラム７の認識対象語彙を自動的に表示することで、ユーザが各応用プログラム７ごとに入力に使用する認識対象語彙を記憶する必要がなくなり、ユーザの負担が少なくなる。また、応用プログラム７の作成者の側にも認識対象語彙を表示させる手段を用意する必要がなくなる分、負担解消が計れる。これはまた、例えば、入力対象の応用プログラム７の表示と一緒に表示できる（図２７）。図２７では、メールツールとシェルツールの色の変化を持って、それらが入力対象となっていることが表示されている。
【０１１１】
（第８実施例）
複数の応用プログラム７の制御は、画面の表示やマウスなどのポインティングデバイスを必ずしも必要としていない。例えば、音声によるビデオ予約が可能なＶＴＲ制御プログラムを電話でコントロールしている際に、第５実施例で述べた音声メールプログラムがＶＴＲ制御プログラムの処理に一時的に割り込み、「緊急のメール受信しました。ないようを確認しますか」と合成音声を出力して知らせることができる。この確認を受けた利用者は、受信したメールの内容を合成音声により知ることができる。
【０１１２】
メールによる作業が終わると、ビデオ予約の作業が再開される。ＶＴＲ制御プログラムは、作業中断に備えて「予約内容確認」などの語彙とともに、中断前までに行われた予約内容を確認できれば、より使いやすいインターフェースとなる。電話の場合、音声だけでなく、電話のプッシュボタンなどの入力装置が利用できる。音声入力の自然な性質を活かしながら、例えば環境の雑音が一時的に増大し、音声による入力が疎外されるような場合には、適宜プッシュボタンなどを利用して入力を確実にすることができる。
【０１１３】
（第９実施例）
次に、本発明による音声認識プログラムによる認識語彙の学習に関する実施例について説明する。
【０１１４】
従来、認識語彙の学習の際は、学習語彙の一覧表の中から利用者が学習させたい語彙を選択するが、語彙が多い場合、選択したい語彙を探すのに手間がかかり、使い勝手を悪化させていた。例えばワークステーション用に発売されている音声認識装置における学習プログラムでは、様々な応用プログラムで使用する認識語彙がすべて表示されるため学習させたい語彙を数百の単語リストから選ばなければならなかった。
【０１１５】
本実施例では、応用プログラムからの認識語彙情報を利用することにより、利用者に提示する単語一覧の語彙数を少なくして容易に目的の語彙を選択するようにでき、また、応用プログラム使用中であってもその場で学習を行うことができるようにしている。
【０１１６】
同実施例は、図２８に示すように図１で述べた音声認識システム１と応用プログラム２に学習データ収集部８と辞書作成部９を加えた構成からなっている。
【０１１７】
ここで、学習データ収集部８は、音声認識システム１とメッセージ交換を行って応用プログラム２に関する語彙情報を受け取り、利用者への語彙表示を行って認識語彙を選択させる。また、学習に必要な設定、例えば、学習データの出力を行うように音声認識システム１へ要求し、受け取ったデータをファイルに保存する。辞書作成部９は上記ファイルを入力として認識辞書の作成を行う。
【０１１８】
以上の動作を行うために、学習データ収集部８は、図２９に示すように単語音声特徴データ保存部８１、学習語彙表示選択部８２、学習データ収集制御部８３、学習語彙ガイド表示部８４から構成している。
【０１１９】
ここで、学習語彙表示選択部８２は、語彙を利用者に表示して学習語彙を選択させるもので、内部に有する学習語彙テーブル８２１に音声認識システム１から送られて来る応用プログラム２の認識語彙を記憶するようにしている。学習語彙テーブル８２１は、例えば文書編集に使うコマンド群が認識対象になっている場合は、
音声認識対象語彙：取り消し、カット、コピー、ペースト、フォントのようになっていて、この内容が、例えば図３３のように表示され、利用者が応用プログラムを使用しているその場で目的の語彙を選択することができる。表示される語彙は応用プログラムの内部状態に応じて必要とされる認識対象の語彙のみであるため、全部をまとめて表示するよりも非常に少なくすることができ、容易に目的の語彙を選択可能である。単語音声特徴データ保存部８１は、メッセージ処理部を介して音声認識システム１から送られてくる単語音声特徴データを、例えば磁気ディスクなどに保存する。学習データ収集制御部８３は、データ収集の全体制御を行い、データ収集の開始／終了を示すためのデータ収集指示フラグを持つ。音声認識システム１との間のメッセージ交換は、図３０に示すメッセージを用いて行うことができる。
【０１２０】
学習データ収集のため、音声認識システム１では、音声認識を行って認識結果を応用プログラム２へ送る通常の認識動作の他、音声分析の結果得られる単語音声特徴データをデータ収集部８へ返すデータ収集動作の２つの動作モードを行うことが可能であり、以下では各々の動作を認識モード、学習モードと呼ぶことにする。
【０１２１】
次に、図３１、図３２を参照しながらデータ収集の手順について説明する。
【０１２２】
図３１は、音声認識システム１のデータ収集時のフローチャートである。
【０１２３】
この場合、学習を行う前に音声認識システムでは、応用プログラムとの通信により、すでに認識語彙が設定されているものとする（ステップ３１０１）。そして、データ収集部８からの学習モード設定要求メッセージを受信すると（ステップ３１０２）、学習に必要な動作を行う（ステップ３１０３）。
【０１２４】
学習に必要な動作は、例えば設定されている語彙のセットをデータ収集中保持するために音声フォーカスを移らないようにしたり、収集中に認識結果を応用プログラムへ送って認識結果によって応用プログラム２の状態が変化して設定語彙が変化しないようにデータ収集中、認識結果を応用プログラム２へ送らないようにすることなどがある。
【０１２５】
次に、音声認識システム１は、データ収集部８へ認識対象語彙のリストを送信した後（ステップ３１０４）、データ収集部８からのメッセージを受信し（ステップ３１０５）、それが音声特徴データ送信要求であれば、音声入力がおこなわれる度に特徴データをデータ収集部８へ送信し（ステップ３１０７）、学習モード解除要求であれば、学習モードを解除を行い通常の認識モードに戻る（ステップ３１０８）。
【０１２６】
図３２は、学習データ収集部のフローチャートである。
【０１２７】
まず、初期状態としてデータ収集の実行を指示するフラグにＯＦＦが設定されている（ステップ３２００）。ユーザによりデータ収集がＯＮにセットされると音声認識システム１へ学習モード設定要求のメッセージを送る（ステップ３２０１）。次に音声認識システム１にその時の認識対象語彙を要求し、語彙を学習語彙表示選択部８２の学習語彙テーブル８２１に記憶させる。
【０１２８】
学習語彙ガイド表示部８４は、例えば図３３のように表示し（ステップ３２０２）、学習語彙をマウスなどを用いて選択させる（ステップ３２０３）。選択語彙は複数でもよく、例えば選択した語彙の背景色が白から緑に変化して見易くすることができる。図３３は、文書編集メニューの語彙の中から「コピー」と「ペースト」を学習語彙として選択した場合を図示している。
【０１２９】
次に、単語音声特徴データ送信要求を音声認識システム１に出した後（ステップ３２０４）、学習語彙の発声を促すための発声すべき語彙の表示が学習ガイド表示部８４によって図３４のようになされる（ステップ３２０５）。この場合、ガイドはなくすことも可能である。また、補助情報として発声回数などを表示したり、発声すべき語彙を合成音声によって聞かせることもできる。こうすることで、ガイドを画面に表示するだけに比べて見誤りなどによる間違った発声を少なくすることができる。
【０１３０】
ユーザが発声した後、音声認識システム１から送られてくる単語音声特徴データをファイルへ出力し、学習データ収集制御部８３により設定されているデータ収集指示フラグによりデータ収集の送信／終了を判断する（ステップ３２０７）。フラグがＯＮなら単語音声特徴データ送信要求から、ステップ３２０９を介して上記データ収集・ファイル出力までを繰り返し、ＯＦＦなら音声認識システム１に学習設定解除の要求を出す（ステップ３２０８）。
【０１３１】
次に、データ収集時の音声認識インターフェース全体の処理の流れを図３５を用いて説明する。
【０１３２】
まず、初期設定では、ユーザからデータ収集の指示が出されると（ａ）、データ収集部８より音声認識システム１に対して学習モード設定要求が出される（ｂ）。これを受けて音声認識システム１が現在認識に用いている認識対象語彙をデータ収集部８に送る（ｃ）。
【０１３３】
データ収集部８では、認識対象語彙をユーザに表示して学習を行う語彙の選択を促す。学習の語彙が選択されると（ｄ）、データ収集部８は、音声認識システム１に対して単語音声特徴データの送信を要求し（ｆ）、選択された語彙を発声のガイドとして表示し（ｅ）、ユーザに発声を促す。
【０１３４】
音声認識システム１では、発声されたユーザの音声を処理した後、データ収集部８に単語音声特徴データを送信し（ｇ）、データ収集部８は、そのデータをファイルに出力する。
【０１３５】
学習終了時には、まず、ユーザがデータ収集終了の指示を入力し（ｈ）、データ収集部８は、学習モードの解除を音声認識システム１に要求する（ｉ）。音声認識システム１では、それを受けて学習モードを解除する。
【０１３６】
データ収集終了後は利用者が必要に応じて認識辞書の作成を行うことができる。辞書作成部９は，単語音声特徴データ保存部８１からのデータを用いて辞書作成を行い辞書をファイル出力する。
【０１３７】
従って、このようにすれば目的の語彙を簡単に選択でき、応用プログラム使用中においても認識語彙の学習を簡単に行うことができるようになる。
【０１３８】
（第１０実施例）
次に、時間のかかる辞書作成をバックグランドで行い、データ収集中や他の応用プログラム実行中に辞書を作成することで辞書作成終了を待たずに使い勝手の良い音声認識インターフェースを実現する実施例について説明する。
【０１３９】
ところで、従来、音声認識のパターンマッチング法としては、ＤＰ法やＨＭＭ、複合類似度法などが知られており、いずれも標準となる認識辞書を用いてパターンマッチングを行うが、例えば高精度な認識を行うため固有値展開などを必要とする複合類似度法（永田、他“ワークステーションにおける音声認識機能の開発”電子情報通信学会技術報告、ＨＣ９１１９、ｐｐ．６３−７０、（１９９１））では、辞書作成のための計算量が多く、現在高速であるとされるワークステーション、例えば処理能力２０ＭＩＰＳの計算機を用いた場合でもかなりの時間、例えば一単語当り数秒から数十秒を要するため待ち時間による学習インターフェースの使い勝手の悪化が無視できない。そこで、学習データの収集中に辞書作成をバックグラウンドで計算することにより、待ち時間を減らしてインターフェースの使い勝手を向上させるようにしている。
【０１４０】
そこで、同実施例では、辞書作成をバックグラウンドで行うことでインターフェースを良くする音声認識システムについて説明する。
【０１４１】
この場合、図２８で述べた辞書作成部９を、図３６に示すように辞書作成管理部９１、辞書作成制御部９２、データ入力部９３、辞書作成部本体９４、ファイル出力部９５から構成している。
【０１４２】
ここで、辞書作成管理部９１は、データ収集部８からのメッセージを受け、要求された語彙の単語認識辞書の作成を辞書作成制御部９２へ指示し作成終了をメッセージでデータ収集部８に通知する。
【０１４３】
複数の辞書作成要求があった場合に順序よく実行するため例えば図３７のような辞書作成管理テーブルの要求日時の順番に従って作成を行う。図３７は例として文書編集用のコマンドである「コピー」「ペースト」「カット」という単語について、この順序で辞書作成を要求されたときの管理テーブルの内容である。語彙などの条件は要求のあった日付、時刻とともに管理テーブルに登録され、辞書作成がこの順で行われ、作成の終了した要求は管理表から削除される。
【０１４４】
辞書作成要求は上記のように語彙を指定するだけでなく、単語音声特徴データの属性としてデータ自身に登録されている他の情報、例えば図３８のように発声者の名前を指定してその人の特定話者用の辞書を作ったり、図３９のように日付を指定して新しいデータのみによって辞書を作ることもできる。
【０１４５】
そして、辞書作成管理部９１と辞書作成制御部９２の間はメッセージ交換でやりとりを行う。
【０１４６】
次に、図４０、図４１を用いて辞書作成の流れについて説明する。
【０１４７】
まず、図４０は辞書作成管理テーブルへの登録の手順である。この場合、辞書作成要求のメッセージがあったかどうかを判断し（ステップ４００１）、なければ要求を待ち、あれば語彙やユーザ名などの条件を辞書作成管理テーブルに登録する（ステップ４００２）。
【０１４８】
一方、図４１は辞書作成の手順である。この場合、辞書作成管理テーブル上に登録されている辞書作成要求を検索し、要求がなければ登録を待ち、あれば最も古い日時の要求を選ぶ（ステップ４１０１）。次に単語音声特徴データを入力し（ステップ４１０２）、上記要求の条件に適合するデータを選択する（ステップ４１０３）。選択したデータのみを用いて辞書を作成しファイル出力する（ステップ４１０４、４１０５）。上記要求を管理テーブルから削除し、管理テーブルの検索（ステップ４１０１）へ戻る。以上を繰り返す。また、すべての辞書作成要求が削除された時点で、辞書作成が終了したことを学習データ収集部に通知しても良い。
【０１４９】
認識辞書の作成は、データ収集時にバックグラウンドで行うため、辞書作成の進行状況は利用者にとって分かりにくい。そこで、辞書作成の進行状況を例えば図４２（ａ）（ｂ）に示すように全処理量に対する終了した処理量の割合を表示することによって利用者に分かりやすいインターフェースを提供できるようにしている。この場合、辞書作成の開始や終了の際には、ビープ音などにより通知することも可能である。また、辞書作成処理の速度を表示することも可能で、例えば図４３に示すように速度を４段階に分けたり、図４４（ｂ）に示す色分けを用いて同図（ａ）のように色で処理速度を表示したりでき、計算機の負荷が大きくて辞書作成の処理が進まない場合には、処理が停滞していることを表示することにより、利用者に計算機の負荷の分散を促すようにもできる。
【０１５０】
以上のように、時間がかかる音声データの収集中にバックグラウンドで辞書作成を行うことにより、待ち時間を少なくして使い勝っての良いインターフェースを実現することができる。
【０１５１】
また、以上述べた辞書作成は、独立したプロセスとして動作することが可能で、データ収集部８からの要求だけでなく、音声認識システムやその他の応用プログラムからも辞書作成要求を受け付けることが可能であり、学習データ収集処理時のみに限らず、いつ辞書作成を行ってもよい。
【０１５２】
（第１１実施例）
認識対象を単語または文節などとする音声認識においては、従来より入力音声のパワーの変化、音声ピッチの変化、あるいは零交差回数などの特徴パラメータを用いて単語境界を検出し、この音声特徴ベクトルと認識語彙セットについての認識辞書とを照合することにより行われていた。しかし、実際の作業環境では、背景雑音やユーザの不用意な発話（他のユーザとの会話や独り言など）の影響により誤った単語境界が検出されることが少なくない。このため、音声認識システムのユーザは現在何が認識対象になっているかを常に意識し、それ以外の言葉を発声しないようにする必要がある。
【０１５３】
一方、音声を計算機への入力手段の一つとして他の入力手段（例えばキーボードやマウス）と合わせて作業を行う場合、ユーザは、入力内容や作業の状況に応じてそれぞれの入力手段を使い分けることが考えられる。
【０１５４】
そこで、本実施例では、図４５に示すように図１で述べた音声認識システム１と応用プログラム２に音声認識自動停止部１０を加えた構成とし、認識処理に、通常の認識処理（現在の認識対象となっている全ての語彙に対する認識処理）をおこなうモードと、特定のキーワードについてのみ認識処理を行うモードの２つのモードを設け、認識処理を開始して暫くは通常の認識処理を行い、予め定めておいた時間内に音声入力が行われなかった場合には、それまでの認識語彙セットを保存し、特定のキーワード（例えば「認識開始」など）のみを認識語彙セットとするモードに切り替わるようにする。その後、このキーワードが入力されれば、保存していた認識語彙セットを新たに設定し、通常の認識処理モードに移行する。この認識処理モードの切り替えは、例えば音声フォーカスの変更や音声以外の入力手段による指示によっても行われ、認識モードの移行は、メッセージまたはアイコンによる表示やピープ音などを用いてユーザに伝えられる。これにより、ユーザが音声を暫く使わない状態になると、自動的に音声認識のモードが切り替わり、特定のキーワード以外の音声を無視することで検出誤りによる予期しないタスクの切り替えや誤動作を回避することができる。
【０１５５】
また、ユーザはキーワードを発声するか、音声以外の入力手段により音声認識処理モードの切り替えを意識的に行うことができる。上記の処理は、例えばインターバル・タイマ機構を用いることにより実現できる。これは、現在時刻から時間切れになる時間を秒数で指定するもので、時間切れになると、その旨を通知するシグナルが渡される。このシグナルを受信した時点で音声認識のモードの切り替えを行う。
【０１５６】
以下、図４６に示すフローチャートに従って説明する。
【０１５７】
まず、最初にタイマが時間切れになるまでの秒数を設定し（ステップ４６０１）、時間切れか否かを示すフラグを０にする。このフラグは、時間切れになった旨を通知するシグナルを受信した際に呼び出されるシグナルハンドラ内で１がセットされるようにしておき、認識処理の最初にその値が調べられる。なお、タイマの機能は、計算機に通常内蔵されている時計の機能により容易に実現可能である。また、シグナルハンドラは、音声認識自動停止部１０の中にプログラムとして書くことができる。
【０１５８】
次に、認識対象とする語彙セットを設定した後（ステップ４６０２）、時間切れか否かを調べて（ステップ４６０３）、時間切れでなければ、その語彙セットに対する認識処理を行う。
【０１５９】
認識処理は、まず入力音声のパワー変化や音声ピッチの変化、あるいは零交差回数などの特徴パラメータを用いて音声区間の始端と終端を検出し（ステップ４６０４）、終端が検出されれば、その始端と終端で定まる音声区間から音声特徴ベクトルを抽出し、現在の認識語彙セットの認識辞書と照合を行い、各確認語彙のの類似度を求め、そのうち類似度最大で、かつその値が予め定めておいたしきい値以上のものを認識結果として出力し、認識処理を終了する。（ステップ４６０５〜４６０９）
なお、図４６では、音声特徴ベクトルの抽出から、認識辞書との照合およびしきい値による判定までを認識処理としている。終端が検出されない場合や、認識結果が得られない場合は（ステップ４６０５、４６０７）、語彙セットの設定に戻り、必要に応じて（例えばクライアントから音声フォーカスの変更や認識語彙の変更要求があった場合）認識語彙セットの変更を行い、時間切れか否かを調べて、時間切れでなければ再び現在の認識語彙セットに対する認識処理を行う。時間切れになった場合は、それまでの認識語彙セットを保存し、特定のキーワードを認識語彙とするモードに移行する。そのキーワードが検出されるか、クライアントから認識処理モードの切り替え指示があれば、保存していた認識語彙セットを復元し、タイマを再設定して通常の認識処理に復帰する（ステップ４６１０〜４６１７）。
【０１６０】
以上述べた認識機能の自動停止機能により背景雑音やユーザの不用意な発話による誤動作を防ぎ、使い勝手のよい音声認識インターフェースを実現することができる。
【０１６１】
また、背景雑音やユーザの発話による誤動作をユーザが意識的に避ける方法として、従来からマウスやキーを押し下げている間だけ、音声入力を行う方法が使われているが、音声入力ごとに毎回マウスを操作するのは煩わしいという問題がある。そこで、常時音声入力中として、マウスを押し下げている間だけ音声入力を受け付けないことにすれば、発声ごとにマウスを操作しなければならないといった煩わしさを軽減できる。
【０１６２】
（第１２実施例）
ところで、音声メールツールは、音声入力可能な電子メールシステムであり、音声を使って受信したメールのリストを移動して内容を確認したり、そのメールに対する返事を送信することができる。
【０１６３】
この場合、ツールは、リスト表示部、受信メール表示部、送信メール編集部からなり、リスト中の反転表示されたメールが受信メール表示部に表示される。そして、例えば、音声を使って以下のような操作ができる。ここでは、上司からの緊急のメールに対して返事を出すまでを示している。
【０１６４】
「メールツール」（音声メールツールを全てウインドウの前に出す。）
「先頭」（受付けリストの先頭にリストポインタを移動する。）
「次」（リストポインタを次のメールに移動する。）
「最後」（受信リストの最後にリストポインタを移動する。）
「前」（リストポインタの前のメールに移動する。）
「上司」（上司からのメールだけをリストアップする。）
「緊急」（そのうち緊急のメールだけをリストアップする。）
「返事」（緊急のメールに対して返事を出す。送信メール表示部に“Ｔｏ：上司名”と“Ｓｕｂｊｅｃｔ：Ｒｅ：上司からのメールのＳｕｂｊｅｃｔ”が入る。）
メールシステムの初期状態を図４７に示す。メールリストの表示部には、全てのメールリストを一度に表示できないため、所望のメールを探すのにマウスを使う場合には、表示部の右側にあるスライド用のバーを使う必要がある。特に大量のメールが来た時などは、メール探しに多くの労力を必要とし、操作性は十分であるといえない。しかし、ここで音声を用いることにより、直接所望のメールを検索でき、作業の大幅な効率化が図れる。
【０１６５】
ここで、例えば上司からの緊急のメールを選択する場合、「上司」「緊急」と発声するだけで、選択することができる。図４８に上司からの緊急のメールの検索結果を示す。この例では２通のメールがきているものとすると、次のようになる。
【０１６６】
「コピー」（メッセージをコピーする。）
「ペースト」（コピーしたメッセージを受信メールにペーストする。）
「引用」（そのメッセージに引用符を付ける。）
ここで、そのメッセージに対する返事を書き、
「サイン」（必要があれば自分のシグネチャをメールの最後に付ける。）
「送信」（返信メールを送信する。）
ここで使われている「上司」や「緊急」は、音声マクロコマンドとして実装されており、メールのヘッダや内容を用いて照合した結果を用いてリストを限定するものである。すなわち、電子メールの発信者の名前、所属、標題、差出日、本文の内容は、テキスト（文字データ）で書かれており、その内容を理解し、キーワードや内容の照合を行うことにより、音声での効率的な電子メールの取り出しが可能になる。これはフルテキストサーチなどの情報検索技術や文脈解析技術を用いて、ＷＳ上で実現でき、音声入力インターフェースの利用により音声メールの使い勝手が大幅に向上する。また、テキストの一部を音声合成で読み上げたり、強調したり、スピードを変化させることも可能である。また、図４７に示すように認識語彙の表示や現在音声フォーカスが当たっているクライアントの表示、認識が動作中であるか否かの表示などを行い、ユーザにシステムの状態をできるだけ伝えるように考慮し、作業の効率化を可能にしている。
【０１６７】
（第１３実施例）
音声認識サーバを使って、既存のアプリケーションを音声で制御することができる。これは、既存のアプリケーションのキーボード入力を音声によって代行するクライアントを作成すれば可能である。ここでは、既存のアプリケーションに対する音声制御を可能とする音声マクロプログラムを使って、既存のＤＴＰ（ＤｅｓｋＴｏｐＰｕｂｌｉｓｈｉｈｇ）システムを音声コントロールする例を示す。
【０１６８】
音声マクロプログラムは、既存アプリケーションの認識語彙に関する知識をメニュー形式で持ち、そのメニュー階層を利用して認識語彙を限定する。ここで、
“図形”メニュー
“取り消し”
“グループ化”
“グループ解除”
“フロント”
“バック”
“上／下（うえした）反転”
“右／左（みぎひだり）反転”
“回転”
“トップレベル”メニュー
“文書”
“編集”
“図形”
メニュー階層のルートを「トップレベル」と呼び、トップレベルから単語を発生し、メニュー階層をたどることでコマンドを実行していく。メニューの階層を移動するごとにウインドウにメニューの各項目とメニュー階層における現在位置をパスの形で表現しユーザに呈示する。
【０１６９】
そして、以下のように操作される。ここでは、文書ウインドウに存在する複数個の図形を取り扱う例を示している（図４９参照）。
【０１７０】
図形を扱うためにトップレベルから図面メニューを開く。
「図形」（メニューの項目が音声コマンダにリストアップされる。）
ここで、文書ウインドウ上の複数の図形をマウスで選択する。
「グループ化」（複数の図形を１つの図形として取り扱うべく纏める。）
「上下反転」（グループ化した図形の上下を反転させる。）
「回転」（図形を回転させる。）
「グロープ解除」（グルーブ化を解除する。）
次に、先にグループ化された図形のうちの１つをマウスで選択する。
「バック」（選択した図形を全ての図形の後ろに送る。）
「取消し」（「バック」により行われた操作を取り消す。）
「フロント」（一番前に送る。）
これをマウスを使って操作する場合には、
・メニューバーをクリックしてメニューを表示する。
【０１７１】
・メニューをプルダウンし、実行したいコマンドの項目を選択する。
【０１７２】
・マウスボタンから手を離してコマンドを実行。
の少なくとも３アクション必要であり、マウスポインタの移動の手間を考えると、それ以上のアクションを行っていると考えられる。
【０１７３】
ところが、音声を使用すると、
・操作を行う単語を発生する。
の１アクションで済むため、音声の有用性が分かる。マウスを使ってメニューを選択することで操作する場合には、例え予め何を操作したいかをユーザが分かっていても、上記の操作は必ず実行しなければならない。音声は、他の入力手段と組み合わせることで、より効果的なインタフェースとなる。
【０１７４】
ここで、キーボードマクロを使えば、音声と同様に１回だけの操作で済むが、キーボードマクロは基本的に一つの文字で表現するため、キーボードマクロが多ければ多いほど対応付けのしにくい文字とコマンドの組み合わせを記憶することが要求され、ユーザの負担になる。
【０１７５】
そこで、コマンドを、ただ１つの文字でなく、そのコマンドの意味をも自然に表現し得る音声と結び付けることで、アプリケーションは、ユーザに対して、より自然なインターフェースが提供できる。
【０１７６】
また、単語認識の際に上述した図形メニューのなかで、例えば「グループ化」と「グループ解除」のように前半部分が同じカテゴリに存在する場合には、部分抽象化により単語の後半部分のパターンを用いて認識を行うことにより、認識精度の向上を図ることができる。また、「上下反転」「左右反転」のように後半部分が同じ場合には、単独の前半部分のパターンを用いて認識を行うことも可能である。要するに、パターンの違いがより明確になるように様々な視点から認識のための単語パターンを取り出し、認識を行うことにより認識性能の向上が可能になる。
【０１７７】
（第１４実施例）
以上、述べてきた音声認識インターフェースは、音声の入力にのみ注目してきたが、音声の出力機能をインターフェース内に取り入れ、テキストからの音声合成や音声データの再生を行なうようにすれば、音声の入出力を統合して行なうことができるため、複数の応用プログラムへの音声入力とそれらからの音によるメッセージの出力を簡単に行なうことができ、ユーザにとって取扱い易いインターフェースを実現することができる。
【０１７８】
以下に、音声合成機能を備えた音声認識インターフェースである音声入出力インターフェースの構成について説明する。
【０１７９】
図５０は音声合成部を備えた音声入出力システムの概略構成を示しており、図１で述べた音声認識システム１に音声合成部１４を付加した構成になっている。この場合、音声合成部１４はメッセージ処理部１１からの指示に従ってテキスト情報から合成音声生成を行い、音声出力を行なうようになっている。また、応用プログラム管理テーブル１３は、複数の応用プログラム２からの音声出力を制御するため、図５５に示すように応用プログラム２の音声出力に関する情報を収納するフィールドを持っている。これにより、複数の応用プログラム２からの音声出力に対する制御を行なうことができる。ここでの音声出力に関する情報としては、特定の音声出力に対して音声出力を優先的に行なうことを指示するための音声出力優先度などがある。
【０１８０】
図５１は、音声合成部１４の概略構成を示しており、全体制御部５６１、波形重畳部５６２、音声出力管理テーブル５６３、波形合成部５６４からなっている。
【０１８１】
全体制御部５６１はメッセージ処理部１１から合成音声の出力要求とともに文字列を受けとり波形合成部５６４に送って音声合成を行ない音声出力する。この場合、音声合成部１４によって出力する音響信号は合成音のみでなく、録音された音声や音声以外であってもよく、その場合は音声の合成を必要としない。このときは波形合成は行なわずにメッセージ処理部から受けとった波形データをそのまま音声出力するようにしている。
【０１８２】
また、波形合成部５６４は全体制御部５６１から文字列データを受けとって音声合成を行なう。音声合成の方式としてはさまざまな方法が知られており、例えば文献（D.Klatt: "Review of text-to-speech conversion for English ", J,Acoust.Soc.Am.,82,3，pp.737-793 (Sept.1987)) の方法を用いることが可能である。
【０１８３】
音声出力管理テーブル５６３はメッセージ処理部１１からの音声出力の要求を登録するテーブルであり、このテーブルに登録された順番に従って音声出力を行なうことにより、複数の音声出力要求に対して時間的な整合性を保ちながら音声出力を行なうことができる。
【０１８４】
音声合成部１４は独立したプロセスとして動作させることが可能で、メッセージ処理部１１とは、音声認識システム１と応用プログラム２の間のメッセージで述べたように、プロセス通信によるメッセージ交換によりデータのやりとりを行なう。ここでのメッセージとしては図５３に示すようなものがある。
【０１８５】
同図（ａ）の応用プログラム２からメッセージ処理部１１へのメッセージは応用プログラム２からの命令を意味している。ここでの音声合成要求は、応用プログラムがテキスト内容を合成音声に変換させる要求で、合成するテキストデータと共に要求を出し、その結果合成音声データが通知される。波形再生要求は応用プログラムが録音等により既に波形の形で音声データを持っている際、それをそのまま再生するための要求で、再生データと共に送信する。音声合成・再生要求は、音声の合成とその再生をまとめて行なう要求であり、合成音声データは通知されない。
【０１８６】
優先度設定要求は、特定の応用プログラムからの出力音を優先させるための要求であり、例えば出力音のレベルと音声合成処理の優先度、中断出力の有無、などに関して、設定できるようになっている。
【０１８７】
音声出力要求の優先度は、例えば緊急を要する場合に、高い値に設定することにより、直ちにユーザの注意を向けることができるため効果的である。
【０１８８】
先に述べたように、音声出力管理テーブル５６３はメッセージ処理部１１からの音声出力要求を登録するテーブルであり、このテーブルに登録された順番に従って音声出力を行なうことにより、複数の音声出力要求に対して時間的な整合性を保ちながら、音声出力を行なうことができる。
【０１８９】
音声出力管理テーブル５６３の例を図５２（ａ）（ｂ）に示している。テーブルに記録するデータはデータＩＤ、波形かテキストかを表す入力データの種類、出力要求のテーブルへの登録時刻、テキストデータの内容、音声出力の際の音量などがある。図の例では、データＩＤ＃１、＃２、＃３がテキストデータであり、＃０〜２のデータに対しては処理が終了しているが、＃３のデータは現在処理中、＃４のデータはまだ処理が行なわれていないことを示している。
【０１９０】
一方、メッセージ処理部１１から応用プログラム２へのメッセージは図５３の（ｂ）に示すような種類がある。音声出力状況通知は、要求された音声出力が終了したことを通知し、優先度設定通知は、優先度設定要求に従って音声出力の優先度が設定されたことを通知する。いずれも要求に対する確認のメッセージである。
【０１９１】
応用プログラム２がどのメッセージを受け取るかの設定は、先の音声認識システム１と応用プログラム２の間のメッセージに関する説明で既に述べた通りで、入力マスクによって設定することができる。この場合、音声合成部１４が加わったことにより、図５４に示すような種類からなっている。
【０１９２】
また、上述したようなメッセージ以外にも、エラーメッセージや音声出力レベルの設定メッセージ、音声合成部１４の内部情報にアクセスするメッセージなどさまざまなメッセージが設定可能である。
【０１９３】
音声合成部１４とメッセージ処理部１１との間もメッセージによって情報交換が行なわれる。この場合のメッセージは図５３の（ｃ）（ｄ）に示す種類がある。このうちの（ｄ）のメッセージ処理部１１から音声合成部１４へのメッセージは、（ａ）の応用プログラム２からメッセージ処理部１１への要求メッセージとほぼ同じであり、（ｃ）の音声合成部１４からメッセージ処理部１１へのメッセージは、（ｂ）のメッセージ処理部１１から応用プログラム２への通知メッセージとほぼ同じ種類のものを使うようにしている。
【０１９４】
以上、述べたように音声合成部１４を有する音声認識システム１の各部においてメッセージをやりとりすることによって、複数の応用プログラム２からの要求による音声出力処理が進められるが、次に、音声認識インターフェース全体としての処理の流れを図５６、５７に従って説明する。
【０１９５】
図５６では、既に第１実施例で述べた手続に従って応用プログラム２と音声認識システム１との接続処理と音声認識に関する初期設定をステップ６１０１で既に完了しているものとする。そして、ステップ６１０１の終了後、応用プログラム２は音声出力処理に関する初期設定を後述の図５７の（ａ）に従って行なう（ステップ６１０２）。初期設定としては、音声合成部１４における音声出力管理テーブル５６３の初期化、応用プログラム管理テーブル１３の音声出力優先度情報の初期化などがある。そして、音声入力および音声出力の処理を実行する（ステップ６１０３）。
【０１９６】
次に、応用プログラム２からの音声出力に関する要求ごとの音声出力処理について説明する。
【０１９７】
まず、図５７の（ｂ−１）の音声合成要求が応用プログラム２から出された場合、メッセージ処理部１１は要求をそのまま音声合成部１４へ音声合成要求として送る。そして、音声合成部１４は音声出力管理テーブル５６３へのメッセージの登録を行なう。音声合成要求は波形の再生処理を含まないため、例えば図５２の出力管理テーブルメッセージＩＤ＃１のように、出力ありなしの項は出力なし（＝０）となる。この場合、音声出力優先度情報は使われない。合成処理が終了後は、音声合成部１４は終了したことを音声出力状況通知によってメッセージ処理部１１へ通知し、メッセージ処理部１１はそれを応用プログラム２へ通知する。応用プログラム２はこの通知の後音声波形データ要求を出し、合成音声ごとに受け取る。
【０１９８】
次に、図５７の（ｂ−２）の波形再生要求があった場合、メッセージ処理部は図５５に示す応用プログラム管理テーブルに登録してある優先度情報を検索し、要求を行なった応用プログラムに関する情報を付加して音声合成部１４へ波形再生要求を行なう。
【０１９９】
音声合成部１４では、音声出力管理テーブルにメッセージの登録を行なうが、この場合は、例えば図５２のメッセージＩＤ＃０または＃４のような内容が登録される。波形再生終了後に音声合成部１４は、音声出力状況通知により、再生が終了したことをメッセージ処理部１１に送り、メッセージ処理部１１はそれを応用プログラム２へ送る。
【０２００】
次に、図５７の（ｂ−３）の音声合成再生要求があった場合は、波形再生の場合と同様な処理で音声の合成および再生の処理を行なう。
【０２０１】
また、図５７の（ｂ−４）の優先度設定要求によって音声出力優先度を変更することができる。音声出力優先度は先に述べたように、音声出力のレベル、音声合成処理の優先度、中断処理の有無等がある。出力音声のレベルを高くすれば、その出力メッセージに対する注意を引きつけるのに役立ち、音声合成処理の優先度を高くすれば、その音声データが音声合成後出力されるまでの時間遅れを小さくできる。又、中断処理は、特定の音声出力データ以外の音声出力を一時中断し、そのデータのみを出力する処理であり、これらを組み合わせて使用することにより、重要なメッセージを優先的に出力するなどの処理が可能である。
【０２０２】
例えば図５２では、メッセージＩＤ＃０の波形再生要求に対しては、出力レベル＝３、中断出力なし、合成処理優先度−（値なし）が設定されている。この場合、優先度の値は０〜１０の範囲で設定するようになっており、出力レベル３は、比較的小さい値である。又、中断出力なしのため、この波形データは他の音と重なって聞こえて来る。これに対し、＃２の音声合成・再生要求に対しては、出力レベルは最大の１０であり、かつ音声合成処理の優先度も最大であるため、合成音データが直ちに出力される。又、中断出力ありのため、この間に他の音は出力中断状態にある。この合成音を出力中は、他の音に邪魔されずに音を聞くことができる。
【０２０３】
次に、以上述べたような音声出力要求を順次処理する方法について説明する。
【０２０４】
複数の音声出力要求は音声合成部１４の音声出力管理テーブル５６３に従って処理を行なう。音声出力管理テーブル５６３には要求のあった順番に要求のＩＤ、入力データの種類（波形／テキスト）、要求受付時刻、データ内容、処理状態、音量、出力中断処理のあり／なし、音声合成処理の優先度、排他処理の係数、等が登録される。
【０２０５】
図５８に示すように、まず、全体制御部５６１は、音声出力管理テーブル５６３の処理状態の項を参照し（ステップ６３０１）、「未処理」となっているデータを探し、あれば処理状態を「処理中」に更新し（ステップ６３０２）、データの種類を参照する（ステップ６３０３）。そして、データがテキストであればテキストデータを波形合成部５６４へ送って音声合成を行い（ステップ６３０４）、合成音データを波形重畳部５６２へ渡し、波形データであればそのまま波形データを波形重畳部へ渡す（ステップ６３０５）。そして処理状態を「終了」に更新して（ステップ６３０６）、次の未処理データの処理を行なう。
【０２０６】
波形合成部５６４では、処理を行なっているデータに関する合成処理優先度情報をもとに、合成演算を行なう処理の他の処理に対する優先度を設定して演算を行なう。優先度の設定は、例えばワークステーションのオペレーティングシステムとして一般的であるＵＮＩＸのシステムコールを用い、合成プロセスに対する演算装置の割り当て時間を変更させたり、処理量の異なる複数の音声合成器を用意して優先度に応じて使う合成器を変えたりすることにより行なえる。
【０２０７】
波形重畳部５６２では、波形データと共に音量、出力中断処理のあり／なし、排他処理の係数などの情報に基づいて複数の波形を重畳する。重畳の際には、時刻と波形データのサンプルの対応を常に監視し、複数の音声出力要求の間の時間とそれらの要求に対応する複数の波形データの出力される間隔が、なるべく等しくなるようにしている。また、重畳の処理は単位時間、例えば１０ｍｓｅｃごとのブロック処理によって行なうことが可能である。
【０２０８】
次に、図５９により、中断処理のある音声データを重畳する際の例を説明する。この場合、データは図５２の音声出力管理テーブル５６３にあるデータＩＤ＃１〜３であり、簡単のため、登録から波形重畳までは時間遅れがないものとしたが、実際には用いる計算機の処理能力に応じて、音声合成やデータの移動による時間遅れがある。音声出力管理テーブル５６３に記録された時刻どおりで、かつ出力中断処理を行なわずに音声データを出力する場合には、図５９（ａ）のように、データどうしが時間的に重なっているため、緊急なメッセージであるデータ＃２の音声は、先頭部がデータ＃１の最後と、後半部がデータ＃３の前半部と重なって出力されることになる。これに対し、出力中断処理を行なう場合の（ｂ）では、データ＃２の「緊急です」が始まる時点でデータ＃１の重畳を中断し、＃２の処理終了後、＃１の中断された時点から残りを重畳することになる。又、データ＃３は、＃２が終了後に重畳される。データ＃１のように、中断処理によって時間的に分割されるデータは、上述のように分割したまま出力しても良いが、中断処理後にもう一度最初から出力し直したり、又、分割された後半部は出力しない、あるいは徐々に音量を下げて重畳するなど様々な処理が考えられる。
【０２０９】
（第１５実施例）
第１４実施例に記述したように、音声認識システムは、音声合成部１４を組み入れ、マルチタスク環境において、複数のタスクから音声認識および合成機能の利用を可能にすることで、ユーザが応用プログラム２を使用する際の使い勝手が向上する。本実施例においては、第１４実施例をふまえ、具体的なシステムの応用例として、音声メールツールについて音声合成機能を追加した際の効果を中心に述べる。
【０２１０】
図６０は、第１５実施例の概略構成を示しており、音声入出力システム６５１、ウィンドウシステム６５２、音声メールツール６５３から構成している。また、音声メールツール６５３は、電子メール処理部６５３１とメッセージ入出力部６５３２からなっている。
【０２１１】
この場合、音声入出力システム６５１は、第１４実施例に述べた、音声合成機能を持つシステムである。ウィンドウシステム６５２は、応用プログラムに関する情報をＧＵＩ（Graphical User Interface）を通じてユーザに提供する。そして、これら音声入出力システム６５１及びウィンドウシステム６５２を利用することで、音声メールツール６５３で、音声入力をマウスやキーボードと同様に扱え、音声合成をも統一的に扱えるようにしている。
【０２１２】
通常、音声メールシステムで送受信されるデータはテキストデータであるが、テキストデータだけではなく、音声データや画像データ等をメールの中に混在させることができる。音声データを含むメールを送受信するために、メールツールは生の音声データを録音・再生する機能が必要となる。
【０２１３】
応用プログラム２が生の音声データを扱えるようにするために、応用プログラム２と音声入出力システム６５１間で交わされるメッセージとして、図６１に示すものを追加する。これらのメッセージを利用して、メールツールが音声データを録音する手順を図６２の（ａ）に、再生する手順を図６２の（ｂ）に示している。また、今述べた音声の録音・再生機能を持つ音声メールツールの画面表示例を図６３に示す。この表示例は、上述した第１２実施例の図４８とほぼ同じ表示画面を持つ。ここでは、ツールのリスト表示部の行の先頭に＊印の付いたものがあるが、これは音声データを含むメール文書を識別する印である。受信メール表示部に、音声データ付メール文書の表示例を示す。メール文書中の音声データは、例えばボタン様の形式でユーザに提示する。
【０２１４】
図６３においては、緊急とラベル付けられたボタンが、音声データである。音声データをマウス等で指定し、マウスやキーや音声入力を使って再生する。音声データ付のボタンは、メールのテキスト中の任意の位置に任意の個数作成し、配置できる。
【０２１５】
メール中の音声データの録音・再生・編集は、図６４のような、音声データ編集用のサブウィンドウを用いて行なう。図の上部の２つのスライダーはそれぞれ音声データの入力・出力時のボリュームを設定するものである。その下のボタンは、それぞれ音声データの録音、再生、録音／再生の停止、音声データの編集、メールへの音声データの追加を行なうボタンである。編集ボタンには、カット、コピー、ペーストなどを行なう編集用のサブメニューが存在する。ボタン列の右端の「緊急」は、ユーザが任意に入れることのできる文字で、音声データ作成時に、ボタンのラベルとして表示される。図６４の下部が音声波形データを編集する所である。データをマウスを用いて選択し、音声入力を用いてカット、コピー、ペーストを行ったり、エコーをかけたり、ピッチを変化させたりなどの効果を音声データに加えることが可能である。また、音声データの編集やデータに対する効果の付加は、メールツールでなく、専用の音声データ編集ツールで行ってもよい。それを用いて音声を編集する際に、メールツールとの間で音声データの受け渡しを行う必要があるが、その受け渡しを音声入力を使ったカット＆ペーストによって行えば音声データに対する編集操作が簡単に行えるようになる。
【０２１６】
音声入力を使ったカット＆ペーストは、音声データに対してだけでなく、テキストやグラフィックなど様々な形態のデータに対して適用し、応用プログラム向けのデータの受け渡しに用いることができる。
【０２１７】
以上述べた機能を用いてメールの返事を出す場合には「返事」と発することにより読んだメールの全て、あるいは文面の一部などを自動的にコピーし、引用の印をそれぞれ付加し、さらに自分のサインと録音メッセージを自動的に付加して送信してしまうことにより、ほとんどのキーボードに手を触れずにメールの返事を出すことができる。その際、録音メッセージは、前もって録音してあるものを用いてもよいが、自動的に録音モードに入って「送信」が発声されたならば、自動的にその録音データを付加してメール送信を行ったりできる。例えば図６５は、送別会のお知らせに対する返信の文面の例である。この例では、８行目まで、送られてきたお知らせのメールのコピーに引用マーク（》）を付け、９〜１１行目に自分のサインと録音メッセージの印を付加するようにしている。
【０２１８】
また、図６４で示す音声データの録再・編集機能の一部または全部を図６６のように、受信メール表示部や、送信メール編集部に並べて配置することで、メール中の音声データに対する操作性が向上するとも考えられる。
【０２１９】
録音データはそのまま全部をメール用のデータとして用いてもよいが、データ中には言い澱みなどにより不要な無音部があり、必要以上にデータ量が多くなってしまうことがある。
【０２２０】
そのような場合、無音部を自動的に検出して一定の長さ、例えば１秒以上の無音部をカットすることも可能である。
【０２２１】
また、録音の際の利用者の動きにより、口とマイクロホン間の距離が変化して録音レベルが一定でなくなり、聞きずらいデータになってしまうことがある。
【０２２２】
そのような場合、録音データのパワーを調べて全体に亘ってレベルを均一にし、聞きやすくすることができる。レベルの均一化の処理は、ある単位ごと、例えば単語、文ごとのレベルを求め、最大のレベルを持つものに他を合わせるようにするなどにより実現可能である。
【０２２３】
また、データ全体あるいは上述の最大レベルが小さすぎたり、大きすぎたりした場合には、データ全体のレベルをそれに応じて変えることにより、聞き苦しくないようにできる。
【０２２４】
さて、本実施例のメールツールを使うことで、テキストと音声の混在したメール文書を読み上げることができる。
【０２２５】
図６３の受信メール部のメールを読み上げることとすると、
「田村殿」（音声合成）
「先週の出張報告書を至急提出のこと」（〃）
（緊急ボタンの音声データを再生）
「沢田」（音声合成）
と、このように、データの出現順に、データの種類に応じた処理（テキストデータは音声合成し、音声データはそのまま再生する）を行なうことで、テキスト以外のデータをも読み上げることができる。また、テキストデータだけの読み上げや、音声データだけの読み上げを可能にすることもユーザにとって有用になる。テキスト以外のデータ形式としては、音声以外のものでも、そのデータ形式に従った処理を行なえば良い（動画なら動画の再生を行なう）。
【０２２６】
メールの読み上げは、本文だけではなく、題や発信者や送受信の時間を示すメールのヘッダに関しても行なって良い。
【０２２７】
ここで、全てのメール文書に対して、同一の読み上げ方をする必要はない。例えば、メールアドレスと、合成音声の属性を図６７に示すようにデータベース化することによって、発信者毎にメール文書読み上げの際の音声の特徴を変化させることができる。図６７の設定では、Ｔａｍｕｒａ氏からのメールは、低くゆっくりと話す男性の声で、Ｎａｋａｙａｍａ氏からのメールは、高く早口の女性の声で、それ以外のメールは、標準的な声の高さを持つ男性の声で、標準的スピードにより読み上げられる。
【０２２８】
さらに、発信者情報だけではなく、１つの文書内の情報を使って合成部を変化させることが考えられる。例えば、引用符に囲まれた部分のみに関して、男女の性別を入れ替えるとか、声の高さや読み上げの速度を変化させることが可能である。
【０２２９】
また、メールの受信者が、合成音声によるメールの読み上げを行なうことを想定し、メール本文中のテキストに、音声合成用の制御コードを付加して、メールの読み上げ方を指定することが考えられる。制御コード交じりのメールの例を図７６に示す。
【０２３０】
この場合、＠＜…＞で囲まれた部分が、制御コードおよびその指定で読み上げられる部分である。ｍａｌｅ、５、５、９は、特に性別（男性）、声の高さ、速度、声の大きさを示し、ここでは、「絶対に遅れないように」の部分だけが、その他の部分よりも大きな声で読まれる。このように、メール本文中の部分に対し、音声合成の細かな設定を可能にすることで、メール中の重要な所を強調したり、文章の抑揚を変えたり、引用した言葉を本人に近い特徴の合成音声で読ませて変化をつけるといった事が可能となる。
【０２３１】
以上に述べたメールツールはマルチタスクの環境下で音声によって制御を行うことができるため、キーボードやマウスなどによって文書の作成やプログラム編集などを行いながら音声によってメールによって読むことができ便利である。
【０２３２】
なお、メールツールだけでなく、情報検索のためのツール、例えば英和、和英などの電子辞書や対訳辞書、類似表現、言い換えなどを引くための類似語辞書などのデータベースを本発明によるインターフェースにより音声で操作すれば、文書やメール作成中に調べたい単語などを音声による操作で引くことができるため、文書作成の中断を少なくできて便利である。
【０２３３】
メールの内容の確認を、表示によらず、音声読み上げを使って行なう際に、１つのメール全体を読み上げの対象とすることは、特に、大量のメールの中から所望のメール文書を検索する場合などには、効率が悪くなると考えられる。そこで、メールの読み上げの最中にメールツールに対するコマンドを発行可能にする。特に、そのコマンドは、音声入力によって行なえれば都合が良い。
【０２３４】
まず、読み上げモードを設け、メールを読み上げる際の単位を設定可能にしておく。読み上げモードには、全文、段落、文の３つのモードがある。図６３の右上の「読上」ボタンのとなりの「全文」の表示が読み上げモードを示す。「読上」ボタンにより、モードに従った音声合成を行なう。メール読み上げ時に使用する音声コマンドを、図６８に示す。
【０２３５】
ユーザは、モードを設定し、「読上」ボタンあるいは「読み上げ」と発声することにより、メールの読み上げを開始する。音声コマンド「ストップ」、「続行」により、読み上げの一時停止と再開を行なえる。「もう一度」は最後に読み上げた単位をもう一度読み上げる。「前の〜」および「次の〜」の「〜」は読み上げの単位であり、メールツールはコマンドに従ってモードを自動的に変更する。例えばモードが「全文」の時に「次の文」と入力すれば、モードは自動的に「文」に変わる。「次」および「前」は、「次の〜」および「前の〜」の省略表現であり、それらのコマンドで扱われる単位はモードとして現在設定されている単位である。「速く」「ゆっくり」は読み上げ速度の設定、「高く」「低く」は読み上げ合成音の声の高さ設定、「男性」「女性」は合成音声の性別の設定を行なう音声コマンドである。
【０２３６】
このように、メールの内容の音声による読み上げを可能にし、読み上げの制御を音声を使って行なうことで、マウスおよびキーボードのみを使って制御する時よりも、使い勝手が向上すると考えられる。特に、マルチウィンドウ環境において、聴覚と音声入力を音声メールツールの制御に使い、視覚とキー入力を別のタスク（例えばテキストエディタ）に使うことで、１人のユーザによる複数のタスクの同時制御が可能となる。
【０２３７】
音声合成機能は、メール文書の読み上げだけではなく、メールツールからユーザに対して提供されるメッセージにも利用可能である。例えば、マルチウィンドウ環境において、動作するメールツールがメッセージの出力に合成音声を利用する場合を考えてみる。まず、メールツールをその起動時にアイコン化しておく。メールツールが新規メールを受信すると、「××さんから新しいメールが届きました。未読分は全部で５通あります」といったメッセージを合成音声を使ってユーザに提供する。もちろんこのメッセージは、録音された音声データでも良いが、メッセージ文の変更し易さや、任意の数値データの読み上げを考えると、合成音声の方がメールツール等の応用プログラムの作成者にとっては都合が良い。新規メール受信通知のメッセージをいつも同じ様に出力するのではなく、例えば、メールに重要度を設定し、その重要度に従って音声メッセージを出力しなかったり、「××さんから緊急のメールが届きました」と、メッセージ文を変えたり、音声合成のパラメータを変更して声のトーンを変えることができる。メッセージとして、「サブジェクトは、会議通知です」と、メールの題についての情報を提供してもよい。このように、合成音声をメールツールのメッセージ出力に利用することで、ユーザは、メールツールを直接見ることなく、受信メールを読むか否かの決定ができる。
【０２３８】
新規メール受信のメッセージは、ユーザが計算機上で行なっている作業に割り込むメッセージであり、ユーザの作業に割り込んで欲しいか否かは、作業内容によりけりである。例えば、何らかのプログラムのデモンストレーション中には、メールに割り込んで欲しくないであろう。そこで作業の重要度を設定し、作業の重要度とメールの重要度を比較して、メールの重要度が作業の重要度以上なら音声メッセージを出力し、それに満たない場合は出力しない、といった事を行なう。作業の重要度は、作業環境全体に設定したり、個々のプログラムに設定したり、プログラム内のサブタスク毎に設定する事が考えられる。
【０２３９】
作業の重要度とメールの重要度を比較し、メールの受信の通知方法を決定するために、音声メールシステムを図６９に示す構成とする。メールシステム６９１は、メッセージ入出力部６９１１の介在によって、音声入出力システム６９２やウィンドウシステム６９３と接続されている。音声入出力システム６９２やウィンドウシステム６９３からのメッセージは、メッセージの内容に従い、メッセージ入出力部６９１１によってふりわけられ、そのメッセージを処理すべき所において処理が行なわれる。
【０２４０】
電子メール処理部６９１２は、外部の公衆回線やＬＡＮを通じ、電子メール文書の送受信や、受信したメールに対する処理を行なう。タスク重要度管理テーブル６９１３は、音声入出力システムに接続したすべての応用プログラムの作業の重要度を音声入出力システムから受け取り、管理する。このタスクの重要度と、受信したメールの重要度から、受信したメールをユーザに対してどのように知らせるかの役割も、電子メール処理部６９１２が担う。
【０２４１】
この機能を実現するために第１４実施例で述べた音声入出力システムの持つ応用プログラム管理テーブルを拡張し、項目として、タスク優先度を新たに設定する。図７０に拡張した応用プログラム管理テーブルを示す。ここでは、シェルツールのタスク優先度が「２」、ＤＴＰシステムのが「５」に設定されている。
【０２４２】
さらに、この応用プログラム管理テーブルに値を設定したり、値を読み取るためのメッセージとして、図７１に示すメッセージを新たに設ける。また、タスク優先度変更のたびにその通知をメールシステムが受け取れるようにするために、入力マスクとして、タスク優先度変更マスクを新たに設ける。
【０２４３】
メールシステムは、入力マスクとして、タスク優先度変更マスクと、入力タスク変更マスクを設定することにより、音声入出力システムに接続されているすべての応用プログラムのタスク優先度と、音声フォーカスの有無を得、図７２に示すようにその情報をタスク重要度管理テーブルに動的に反映することが可能である。電子メールの優先度は、例えば、”Ｐｒｅｆｅｒｅｎｃｅ：３”のようなヘッダ情報をメール文書に付加し、メール自体に重要度を設定することも可能であるし、発行者毎にメールの優先度を設定しても良い。メールシステムの電子メール処理部は、電子メールを受信するたびに図７３に示す処理を行なう。
【０２４４】
この場合、音声フォーカスが１つのタスクに当たっているか調べ（ステップ７８０１）、ＹＥＳならば音声フォーカスのあるタスクの優先度を選択し、ＮＯならば音声フォーカスの当たっているすべてのタスクの優先度の平均を選択する。例えば、その中で一番高い優先度を選択しても良い。そして、これらがメールの優先度より低いか調べ（ステップ７８０４）、ＹＥＳならば音声を使って通知し（ステップ７８０５）、ＮＯならば何も通知しない（ステップ７８０６）。この場合、アイコンの表示を変化させたり、動画像を用いたりといった様々な方法をユーザへのメールの受信通知に用いることができる。
【０２４５】
応用プログラムとして、メールシステム以外に、シェルツールとＤＴＰシステムが、音声入出力システムに接続している時の画面の表示例を図７４に示す。図７４（ａ）は、タスク重要度管理テーブルが図７２の状態の時の画面表示例である。ここで、重要度３を持つメールを受信したとすると、図７３に示した処理によればここで音声フォーカスの当たっているシェルツールの重要度が、メールの重要度より高い（小さい値ほど重要度が高いと見做す）ため、メールシステムはメールの受信をユーザに通知しない。これに対して、タスク重要度管理テーブルが図７５の状態にある（対応する画面表示例は図７４（ｂ））時に、先ほどと同様に、重要度３のメールを受信した際には、メールシステムは「新しいメールを受信しました」という音声出力を行ない、メールの受信をユーザに通知する。また、通知と同時にメールシステムは、自身に対して音声フォーカスを設定することによってユーザの作業に割り込み、ユーザにメールシステムを使わせることが可能である。
【０２４６】
このように、新規受信の通知などに関するメッセージを、メールの重要度や作業の重要度に従って変化させることで、ユーザの作業を疎外しない柔軟なインターフェースをユーザに提供できることになる。
【０２４７】
（第１６実施例）
第１５実施例におけるメール文書の読み上げ機能は、受信したメールの一部あるいは全部をその文面に対して何の変更も加えず、合成音声を使ってそのまま読み上げるものであった。この方法は、メール文書が少なく、おしなべて小さい場合には問題は少ないが、メールが多く、大きくなるに従い、その機能だけでは不十分である。
【０２４８】
図７７は、音声メールシステムの概略構成を示すもので、音声入出力システム８２１に接続される音声メールシステム８２２を電子メール処理部８２２１、文書要約部８２２２、メッセージ入出力部８２２３より構成している。この場合、図７８に示すように文書要約部８２２２を音声メールシステム８２２の外に設けるようにしてもよい。
【０２４９】
ここで、メールシステム８２２は、音声入出力システム８２１と接続してその音声入出力機能を用いる。電子メール処理部８２２１は、外部の公衆回線やＬＡＮを通じ、電子メール文書の送受信や、受信したメールに対する処理を行なう。文書要約部８２２２は、電子メールなどの文書を要約するシステムである。テキスト文を要約する技術としては、「石橋ほか、英文要約システム「ＤＩＥＴ」、情報処理学会第４８回全国大会、６Ｄ−９（１９８９）」や、「喜多、説明文を要約するシステム、情報処理学会自然言語処理研究会、６３−３（１９８７）」などが知られており、この技術を応用して、文書要約部を構成できる。
【０２５０】
文書要約部８２２２は、電子メール処理部８２２１から要約前のメール文書を受け取り、要約して返す。電子メール処理部８２２１は、受信したメールの重要度や、文書の長さや文書の内容などに従って、そのメール文書を要約するか否か、また、どのような要約を行なうかを決定し、要約方法の情報とともにメールを文書要約部にひきわたす。電子メール処理部８２２１は、メールを受信するたびに、例えば図７９に示すような処理を行ない、受信メールに対する要約方法を決定する。
【０２５１】
この場合、メールの重要度が「３」以上か判断し（ステップ８４０１）、「３」以上であれば要約しない（ステップ８４０２）。「３」以上出なければ、メール本分中に「至急」を含むか調べ（ステップ８４０３）、「至急」を含めば、さらに文書が長いか調べ（ステップ８４０４）、文書が長くなければ要約せず（ステップ８４０２）、文書が長ければ要約する（ステップ８４０５）。また、本分中に「至急」を含まなければ、先頭行だけを要約する（ステップ８４０６）。そして、メールに従った要約処理を行う（ステップ８４０７）。
【０２５２】
メールのような文書の場合、その内容が完結していなかったり、短すぎたりして、要約に適さないこともあると考えられるが、その場合には、短いメールに対しては、要約を行なわない（必要がない）こともできるし、完結していなくて要約に失敗したメール文書に対しては、例えば、最初や最後の数行を取り出して読み上げるようにすれば、すべてのメールに対して何らかの要約処理をほどこすことができるといえる。要約は、例えば、音声による「要約」コマンドの形でユーザが指示することによってもできるし、あるいは、メールシステムが、受信メールの全てを（あるいは長いものだけを）自動的に要約しておくことによってもできる。
【０２５３】
このように、音声メールツールがメール文書の要約機能を具えることで、メール文書処理の効率化をはかることができ、時に多忙なユーザや、大量のメールを処理する必要のあるユーザにとっては、便利である。
【０２５４】
（第１７実施例）
第１５および第１６実施例においては、音声入出力システムの提供する音声認識および合成機能の利用に関して、音声メールツールを使って述べた。
【０２５５】
これらは、ＧＵＩおよび音声出力を使ってユーザに情報を提供していたが、電話インターフェースなどの、ＧＵＩを利用できない環境において第１５および１６実施例で述べた機能はより有用である。本実施例では、ＧＵＩを利用しない電話を介した音声入出力インターフェースについて、音声メールシステムの例を使って述べる。
【０２５６】
図８０は第１７実施例の概略構成を示している。この場合、音声認識システム８５１に接続される音声メーメシステム８５２にメールアドレステーブル８５３を接続している。
【０２５７】
この場合、音声入出力システム８５１は電話回線に接続されるが、この電話回線との接続は、既存技術を使えば可能であり、ここでは述べない。電話からの音声メールシステム８５２への入力は、音声およびプッシュボタンにより行なえるとする。
【０２５８】
メールは個人情報であるため、電話でメールの内容を確認する前にあらかじめ個人情報の認証手続が必要である。これは電話のプッシュボタン等で行なうかパスワードの音声認識、あるいは話者照合技術により行なう。
【０２５９】
認証手続において利用者を確認した後、音声認識を利用して、メールへのアクセスを対話的に進める。ここで述べる音声メールシステム８５２は、第１５、１６実施例で述べた音声認識と音声合成の機能が全て利用できる。即ち、音声入力によって、音声メールの全部のあるいは部分のあるいは要約された内容の確認を行なうことができる。音声メールシステム８５２の操作は基本的に全て音声を使って行なう。従って、メール送信も音声により行なう。電話インターフェースでは、プッシュボタンを使ってメールの内容を入力するのは現実的ではないため、メールの内容自体も音声となる。音声によるメール文書の作成は、音声認識と音声の録音を同時に行なうことで可能となる。図８０の構成において、認識と録音を同時に行なうことは疎外されない。図８１に、音声を使ったメール文書作成の例を示す。場面設定は、受信したメールの内容を音声（合声／肉声）により確認した後、そのメールに対して返事を出す所である。
【０２６０】
まず、（１）のユーザの「録音開始」という音声を認識し、メールシステムは続くユーザの音声（２）の「○○です〜お伝え下さい」をメール文書として録音する。（２）の最後の「ストップ、ストップ」は、録音を停止する命令である。「ストップ」が繰り返されているのは、メール本文中の「ストップ」と、命令としての「ストップ」を区別するためである。「ストップ、ストップ」全体を認識対象語彙としても良い。メールシステムは「ストップ、ストップ」の区間を録音されたデータからカットする。ユーザは（３）の「内容確認」によりメール文書の内容（４）を確認し、（５）の「送信」によりメールを送信する。最後に（６）のメッセージにより、メールの送信を認識する。
【０２６１】
ここで、（２）でユーザがデータを録音する際に、音声認識システムの音声認識部の中の音声検出部に音声データの先頭を検出させれば、「録音開始」から本文の入力までの間に間があいても、その無音区間を録音せずに済む。
【０２６２】
また、録音終了を指定するための「ストップ、ストップ」などの単語の代わりに「送信」と発声し、「送信」を認識したならば、録音内容をメールデータとして自動的に送信してしまうこともできる。こうすれば、録音の停止を指定する「ストップ」の発声が不要になり、簡単にメールを送信できる。この時、送信したメールの内容は、「内容確認」などの発声により確認しなくとも、自動的に録音内容を再生することによって確認できる。
【０２６３】
また、「録音開始」後、１つの音声区間を録音するようにすると、「ストップ、ストップ」のような録音停止命令は不要となる。音声区間の終端は、例えば「３秒間無音であれば音声データの入力終了とみなす」のように、余裕をもたせた設定にすれば、ユーザが一息でメッセージを入力しなければならないというような制約が緩和される。
【０２６４】
このように、データとしての音声区間を検出するために、応用プログラムと音声認識システムとの間のメッセージとして、図８２のメッセージを追加する。この音声区間検出メッセージは往復メッセージであり、図８３に示すような手順でもって、音声区間のデータを入力音声から切り出すことができる。音声区間検出メッセージでは、パラメータとして、音声の終端を検出するための時間（例えば、３秒間無音区間が続いたら、その無音区間の前を音声区間とみなす）や、入力音声がない場合のタイムアウト指定（要求を発信してから３０秒たったら、音声区間は検出されなかったとみなす）ができる。
【０２６５】
また、メール文書の題は、ここで述べたように、受信メールに対して返事を出す際には、ＵＮＩＸメールでの表現でいえば、受信したメールの“ Subject : hello “に対しては“ Subject : re: hello“のように、返事としての題を入れられるといえるが、電話口で新規にメールを作成する場合には、そのメールに題を付けられない。それを可能にするために、音声認識を組み合わせる。例を図８４に示す。
【０２６６】
この場合、ユーザの（１）「サブジェクト」という音声をメールシステムが認識すると、メールシステムは、サブジェクト入力モードになる。このモードでは、予め決められたサブジェクト（題）用の単語が認識対象語彙となる。例えば、「こんにちは」「お知らせ」「至急連絡下さい」「ごくろうさま」「会議通知」などが考えられる。図８４の例では、（２）「会議通知」を入力する。メールシステムは「会議通知」を認識すると、メール文書にテキスト“ Subject :会議通知“を挿入し（３）、（４）のような確認のメッセージを合成音声により行なう。
【０２６７】
サブジェクト入力モード時の認識結果をうけて行なうのは、メールの題の挿入だけではなく、例えば、定型的なメール文書の入力が可能である。図８５は、「ごくろうさま」という入力に対して、メールの本文として挿入される定型メールの例である。文書中の｛receiver｝と｛sender｝は、受信者、発信者の代入される変数を表している。この変数により、誰にでも同一の文面のメールを音声だけで送信できる。定型メールをデータベース化し、そのデータを音声で呼び出すことが可能であれば、便利であると考えられる。
【０２６８】
また、第１５実施例においては、メール文書中の任意の場所に音声データを追加・挿入可能としたが、サブジェクト入力モードにおいて、サブジェクト自体に音声データを付けることを可能とし、例えば、メールの受信と同時に音声サブジェクトを出力すれば、メールの発信者やメールの内容が受信者に伝わり易いと考えられる。もちろん、音声認識によるサブジェクトの挿入と音声サブジェクトの録音を同時に行なっても良い。
【０２６９】
受信メールに対する返事を送信するのではなく、電話口から送信先を指定するのには、音声認識を利用する。そのために、あらかじめ、学習機能を応用して単語登録を行ない、認識対語語彙とメールアドレスを結びつけておく。例えば図８６に示すような外観を持つアドレスブックをメールシステムに持たせ、図８７に示すメールアドレスの登録機能によって、メールアドレスと音声を結びつける。
この時の登録の手順は、
遙メールアドレスブック（図８６）を開く
遙登録用ウィンドウ（図８７）を開きメールアドレスの新規登録を開始する。
遙名前とアドレスをキーボードで入力する。
遙学習に必要な回数（数回〜数十回）、新しい単語（この例では鈴木）
を発声する。
遙ＯＫボタンを押し、登録を完了する。
【０２７０】
このようにして、認識対語語彙（鈴木）と、メールアドレス（Ｓｕｚｕｋｉ＠ａａａ，ｂｂｂ，ｃｃｃ，ｃｏ．ｊｐ）とを結びつけておき、電話口で利用する。例えば、図８８の手順で行なう。まず（１）でユーザが「送信先」と発声し、認識されると、メールシステムは、（２）のメッセージを音声出力し、ユーザに確認をとる。（３）では図Ａ、Ｂなどによって登録された語彙が認識対象となっており、この例では、「鈴木」が認識されると、メール文書中にｔｏ：Ｓｕｚｕｋｉ＠ａａａ，ｂｂｂ，ｃｃｃ，ｃｏ．ｊｐが挿入される。
【０２７１】
（４）（５）はメールアドレスの認識の様子を示している。（４）の「鈴木」の音声のように、例えば図８７における登録の際に利用した音声の１つを自動的に録音しておき、認識の確認に使うことができる。
【０２７２】
（４）の「Ｓｕｚｕｋｉ＠…」は、合成音声によるアルファベット読み下しを使って確認を行なう例である。
【０２７３】
この方法では、音声によるメールアドレスの指定は、予め登録したものにしか適用できないが、次に述べるように、予め登録しているメールアドレスを音声を使って指定することができる。そのためにまず、ユーザが過去に受け取ったメールから、自動的にメールアドレスのデータベースを作成する機能を付ける。メールアドレスは、ＵＮＩＸメールにおいては、メールのヘッダとして入っており、そこからデータベースを作成するのは困難ではない。メールアドレスの構成は、例えば、
ユーザ名＠部課名、組織名、組織区分、国の区分
のような構成になっており、メールアドレスの逆順（国→ユーザ名）にツリー状の階層構造を持つデータベースが作成できる。
【０２７４】
メールシステムは、国の区分から順に、図８９のように合成音声による読み上げを用いて、メールアドレスを順にたどっていく。図８９の例において、あやまったノード（メールアドレスを順にたどった際の節）を選択した際は「取り消し」などの語彙をもって、１つ前の（上位の）ノードに戻ったり、「取り止め」などの語彙をもって、アドレスの入力を取り止めたりできる。また、任意のノードに認識対語語彙を予め結びつけておき、例えば、会社名を発声することで、その会社のメールアドレスノードまで、一時に移動することもできる。
【０２７５】
このような方法をとれば、過去にメールをくれた人に対してならば、音声をつかってそのメールアドレスを指定することが可能となる。
【０２７６】
また、単語単位の認識辞書が不要な音韻認識をベースとした音声認識システムが広く研究されているが、これを用いることにより過去に届いたメール中に該当するアドレスがない場合でも、音声によってアドレスを入力し、メールを転送することが可能である。
【０２７７】
（第１８実施例）
本発明の第１実施例や第１４実施例で述べた音声認識インターフェースでは、音声認識システムあるいは音声入出力システム専用に開発した応用プログラムを対象として、音声認識や音声合成のサービスを提供するものであった。本実施例では、上記のような専用プログラムに対する音声による制御に加えて、前記音声認識システムあるいは音声入出力システムと直接メッセージをやり取りすることのできないような任意の応用プログラムに対する音声による制御を可能とする拡張を前記音声認識インターフェースに施すものである。これによって、音声認識の応用分野とユーザの拡大を図ることができる。本実施例では、第１４実施例に上記拡張を施した例を説明するが、同様の拡張を第１実施例に施すことが可能であることは明らかである。
【０２７８】
以下、本実施例について説明する。
図９０は、本実施例の音声入出力インターフェースの全体構成であり、第１４実施例で述べたものと同一の音声入出力システム１と、そのメッセージ処理部１１（図示せず）に応用プログラムとして接続された音声インターフェース管理システム（以下、ＳＩＭと呼ぶ）１０４からなる。
【０２７９】
汎用応用プログラム（以下、ＧＡＰと呼ぶ）１０３は、音声入出力システム１と直接接続されていない応用プログラムであり、音声入出力システム１とは全く独立して動作可能なプログラムである。これに対して、専用応用プログラム（以下、ＳＡＰと呼ぶ）１０２は、音声入出力システム１と直接接続して動作するものである。
【０２８０】
ＳＩＭ１０４は、ＳＡＰの一つであり、音声入出力システム１とＧＡＰ１０３との仲立ちをして、ＧＡＰ１０３に対する音声による操作を可能にする応用プログラムである。音声フォーカスの表示も、ＳＩＭ１０４が行なう。なお、ＳＡＰ１０２は、図５０の応用プログラム２に対応するものである。ＳＡＰおよびＧＡＰは、１つの音声入出力システムに対してそれぞれ複数個存在することが可能である。
【０２８１】
次に、ＳＩＭ１０４による、ＧＡＰ１０３に対する操作について説明する。ＧＡＰ１０３は、ＳＡＰ１０２と異なり音声入出力システムと直接接続されてはおらず、ＧＡＰ１０３が受け付けられる入力は、音声以外のキーボードやマウスといった入力装置からのものである。従って、ＳＩＭ１０４は、音声によるＧＡＰ１０３の操作を実現するために、音声入力をＧＡＰ１０３の受理できる形の入力、例えばキーボード入力やマウス入力等に変換する。
本実施例では、ＳＩＭ１０４は、図９０に示すように、音声インターフェース管理部１４１、プログラム操作登録部１４２、メッセージ変換部１４３から構成される。音声インターフェース管理部１４１内には、応用プログラムごとの音声認識結果と操作との対応表が設けられており、この対応表（以下、音声インターフェース管理テーブルと言う）の情報は、プログラム操作登録部１４２によって登録される。前記メッセージ処理部１１と直接接続されるメッセージ変換部１４３は、音声入出力システム１とのメッセージのやり取りを行なう機能、つまり図６のメッセージ入出力部２１の機能を包含するものであり、認識結果を受信した際に、音声インターフェース管理テーブルを参照して、該認識結果をＧＡＰ１０３に対する操作コマンドに変換し、ＧＡＰ１０３に送信する。
【０２８２】
ＳＩＭ１０４からＧＡＰ１０３に操作コマンドを送るには、ＧＡＰ１０３自身が他のアプリケーションからの操作の手段を提供していなければならない。
【０２８３】
ウインドウシステムを利用したアプリケーションであれば、ＳＩＭ１０４はウインドウシステムを介して、そのＧＡＰ１０３へキーやマウスなどの入力デバイスによる操作コマンドの入力時に発生するのと同じメッセージをＧＡＰ１０３に送る。このようなメッセージ送信の方法はＸウインドウシステムなどの各ウインドウシステムの提供するライブラリにある機能で容易に実装できる。実際、ウイドウシステムでは、メッセージの送付先がＧＡＰ１０３そのものではなく、ＧＡＰ１０３の中で生成したウインドウなどのオブジェクトの場合もある。メッセージ送信時に、そのオブジェクトの識別子である必要があるケースもあるが、後述するプログラム操作登録の内容や、ウインドウシステムに問合せて識別子の情報から、送り先のオブジェクトの識別子を決定することは容易である。
【０２８４】
次に、具体例をあげて説明する。図９１に示すように、１つの音声入出力システム１に対して、音声インターフェース管理システム１０４とメールツール１２０が直接接続して動作し、また音声入出力システム１と直接接続できないＧＡＰであるシェルツール１３０とエディタ１３１が並行して動作しているとする。このときの画面表示は、例えば図９２のように行なえる。
【０２８５】
この場合のＳＩＭ１０４の音声インターフェース管理テーブルの一例を図９３に示す。この表における“プログラム名”は、認識対象語彙であり、ユーザがプログラム名を発声することで応用プログラムに対する疑似音声フォーカスを切り換えることができる。“応用プログラム”は、応用プログラム自体の識別子であると共に、コマンドの送信対象を表す。
【０２８６】
上記の疑似音声フォーカスは、応用プログラムに対して疑似的に設けた音声フォーカスである。ＧＡＰは音声入出力システム１と直接接続しておらず、従って、音声入出力システム１はＧＡＰの存在を関知しないため、ＧＡＰに対して本当の音声フォーカスは設定されない。ＳＩＭ１０４は、「シェルツール」や「エディタ」等、ＧＡＰの名前を認識結果として受け取ると、そのプログラムについて定義されているコマンド名を認識対象語彙とする設定要求を、音声入出力システムに対して行なう（例えば、「シェルツール」の場合、「エルエス」や「プロセス」）。そして、図１２や図１９等で示したような音声フォーカスの表示をそのプログラムに対して行なう。
【０２８７】
図９４に示すように、ＧＡＰ１０３に関係する真の音声フォーカスはＳＩＭ１０４に設定され、実際に画面に表示されるのは疑似音声フォーカスである。ＳＩＭ１０４が、プログラム名の認識をきっかけにして、認識のコンテキストを切り換えるのである。なお、メールツールにみるように、ＳＡＰの疑似音声フォーカスと真の音声フォーカスは合致する。
【０２８８】
ＳＩＭおよびＧＡＰのコマンド名の属性は、ＳＩＭに対してローカルである。すなわち、ＳＩＭに音声フォーカスが設定されているときに認識対象となる。ＳＡＰにコマンドを送信する際、ＳＩＭ１０４に音声フォーカスが設定されない状態であるため、ＳＡＰ１０２に関するコマンド名は、グローバル属性を持つ。例えば、図９３のメールツールのコマンド名「終了」の属性がグローバルである。なお、図９３で、ローカル，グローバルといった認識対象語彙の属性は、プログラム名および認識対象語彙の欄の括弧内に示されている。属性値は、“０”がローカル、“１”がグローバルである。
【０２８９】
このようなメッセージ変換部１４３の処理手順の一例を図９５に示す。すなわち、音声入出力システム１のメッセージ処理部１１から受信した認識結果がプログラム名である場合、直前の疑似フォーカスに関するコマンド名を認識対象からはずし（ステップ９００３）、認識したプログラム名を持つ応用プログラムに疑似フォーカスを設定し（ステップ９００４）、その応用プログラムのコマンド名を認識対象として設定（追加）する（ステップ９００５）。
【０２９０】
一方、受信した認識結果がプログラム名でない場合（ステップ９００２）、コマンド名に対応するコマンドを、疑似フォーカスの設定されている応用プログラムに送信する（ステップ９００６）。
【０２９１】
以上述べたように、本実施例のような構成をとることにより、既に存在する音声入力（認識）を用いない応用プログラム（ＧＡＰ）に対しても、音声認識の利用が可能となり、ユーザの拡大と使い勝手の向上が実現できる。
【０２９２】
（第１９実施例）
ウィンドウベースのＧＵＩ（グラフィカル・ユーザ・インターフェース）を持つシステム下では、１つのプログラムを複数のウィンドウを使って構成することができる。本実施例では、上記第１８実施例をもとに、複数のウィンドウを持つ応用プログラムの個々のウィンドウに対する音声入力を可能にするべく、システムを拡張した例を説明する。これにより、よりきめ細かい音声認識の利用が可能となり、操作性が向上する。
【０２９３】
これまで説明してきた実施例においては、音声入出力システム１によって音声フォーカスが設定可能な単位は、“応用プログラム”であったが、本実施例では、その単位を“音声ウィンドウ”とする。音声ウィンドウは、応用プログラム中に複数個作成可能であり、個々の音声ウィンドウは、音声ウィンドウ名、入力マスク、および認識対象語彙セットを持つ。
【０２９４】
図９６が、実施例１４（図５０参照）で説明した音声入出力システム１を拡張して、音声ウィンドウを扱えるようにしたものである。ここで、図９６の応用プログラム管理テーブル１３は、後述するように拡張する。また、応用プログラム２に音声ウィンドウ２３が加わっているが、音声ウィンドウ２３の実体は、音声入出力システム１の応用プログラム管理テーブル１３中に存在する。
【０２９５】
以下、具体例をあげて説明する。第１８実施例と同様に、応用プログラムとして、ＳＩＭ（１０４）、シェルツール、エディタ、およびメールツールの４つが動作しているとする。このうち、ＳＩＭとメールツールはＳＡＰであり、シェルツールとエディタはＧＡＰである。図９７のように、シェルツールとエディタをそれぞれ２つのウィンドウから構成し、それ以外を１つのウィンドウから構成したとする。この場合の音声入出力インターフェース全体の構成を図９８に示す。専用プログラム（ＳＡＰ）であるメールツール１２０は、自分用の音声ウィンドウ２２３を持ち、ＳＩＭ１０４は、自分用の音声ウィンドウ０（１４４₀）に加えて、汎用プログラム用の音声ウィンドウ１〜４（１４４₁〜１４４₄）を持つ。この音声ウィンドウは、図９７に示すようないわゆるウィンドウシステム（図示せず）やＯＳ（図示せず）におけるウィンドウとは異なり、ビジュアルな属性を持たないものである。ウィンドウシステムのウィンドウは、通常、ツリー構造を持ち、その構造やウィンドウシステムの内部状態の変化を応用プログラム内部から知ることができる。ＳＩＭ１０４は、そのようなウィンドウシステムの情報と、音声入出力システム１の情報にアクセスし、ウィンドウと音声ウィンドウとを結びつけて協調的に動作させ、統一的なユーザインターフェースを提供する。ウィンドウと音声ウィンドウとの結び付けは、両者にウィンドウ名などの一意かつ同一の属性を付与したり、プログラム操作登録部１４２で対話的に行なうことで可能である。
【０２９６】
音声ウィンドウはその属性として、ウィンドウ名、認識対象語彙、入力マスク等を持ち、音声入出力システム１はこの音声ウィンドウ単位で音声フォーカスの設定を行なう。ウィンドウ名やコマンド名などの認識対象語彙の属性として、ローカル、グローバルに加え、ウィンドウを設ける。ローカル属性を持つ語彙は、それが属する音声ウィンドウに音声フォーカスが設定されている時に認識対象となる。グローバル属性を持つ語彙は、音声フォーカスがどこに設定されていようと常に認識対象となる。ウィンドウ属性を持つ語彙は、それが属する音声ウィンドウに音声フォーカスが設定されていなくとも、その音声ウィンドウと同じ応用プログラムに属する音声ウィンドウに音声フォーカスが設定されている時に認識対象となる。
【０２９７】
また、複数の音声ウィンドウをグループ化して認識語彙を混合し、認識結果に応じて自動的にその認識語彙の属する音声ウィンドウへ結果を送信することもできる。例えば、応用プログラム管理テーブルが図１０２の状態の場合に、シェルツールとエディタをグループ化してエルエス、プロセス、カット、コピー、ペーストを１度に認識し、エルエスまたはプロセスが認識された場合はシェルツールへ認識結果を送り、カット、コピー、またはペーストが認識された場合にはエディタへ認識結果を送るようにする。
【０２９８】
これにより、シェルツールとエディタの間の音声フォーカスの移動を省略して効率的に両者の作業を行うことができる。複数の音声ウィンドウの語彙の中にも同じものがある場合には、それを語彙として持つ複数の音声ウィンドウへ同時に認識結果を送信しても良いし、音声フォーカスの当たっている音声ウィンドウを優先させることにしても良い。なお、グループ化は、図１０２の応用プログラム管理テーブルのグループ化ＩＤの属性により、行うかどうかを決めることができる。
【０２９９】
また、音声ウィンドウのグループ化の一方法として、音声ウィンドウに親子関係を導入し、親ウィンドウと子ウィンドウをグループ化して両者の語彙を同時に認識することもできる。例えば、応用プログラム管理テーブルが図１０２の状態の場合に、シェルツールの設定ウィンドウに関して、その親のシェルツールウィンドウと設定ウィンドウをグループ化する。そして、設定ウィンドウに音声フォーカスが当たったときに両者の混合した語彙によって認識を行う。
【０３００】
これにより、子音声ウィンドウに音声フォーカスが当たっている場合に、音声フォーカスの移動を省略してその親ウィンドウへの音声入力を行うことができ、作業が効率化できる。なお、親ウィンドウと子ウィンドウで同じ語彙を持つ場合には、音声フォーカスの当たっている子ウィンドウに優先して認識結果を送るようにできる。
【０３０１】
図９８の状態の時、ＳＩＭ１０４の音声インターフェース管理部１４１内の音声インターフェース管理テーブルは、図９９のようになる。図９３のテーブルにウィンドウＩＤを加え、プログラム名の替りにウィンドウ名を追加した形である。ウィンドウＩＤとは、ウィンドウシステムにおけるウィンドウの識別子である（図９７参照）。図９９に示すように、ウィンドウＩＤと音声ウィンドウＩＤとは一対一に対応しており、この表を用いてＳＩＭ１０４はウィンドウと音声ウィンドウとを連動させる。例えば、この例でいうと「シェルツール」を認識したならば、ＳＩＭ１０４はＩＤ＝１の音声ウィンドウに音声フォーカスを設定し、ＩＤ＝１０１のウィンドウの表示を図１９に示したように音声フォーカスの設定された状態にする。
【０３０２】
ウィンドウシステムやＯＳによっては、他の応用プログラムウィンドウの表示を変更できない場合があるが、そのときには図１００の斜線部ｗ１で示すような形で独立した別のウィンドウを他の応用プログラムのウィンドウに貼り付け、音声フォーカスの所在を示す。この外付けウィンドウの表示の例を図１０１に示す。図のように、応用プログラムの上部に音声フォーカスを示す表示（ウィンドウ）が示される。なお、このウィンドウの位置は、音声フォーカスが明示できればどこでも良く、また数もいくつでも良い。また、静止画だけでなく、動画像を使うことで、音声フォーカスの位置がより分かり易くなる。
【０３０３】
ここで、図１８で示した音声入出力システム１の応用プログラム管理テーブル１３は、図１０２に示すように拡張される。新たな欄として音声ウィンドウＩＤおよびウィンドウ名が付加されている。音声ウィンドウＩＤは、音声フォーカスの設定されている音声ウィンドウの識別子であり、ウィンドウ名はその名前である。ローカル，グローバルといった認識対象語彙の属性は、ウィンドウ名および認識対象語彙の欄の括弧内に示されている。属性値は、“０”がローカル、“２”がグローバル、“１”がウィンドウである。音声入出力インターフェース１の構成が図９８である場合の音声入出力システム１の応用プログラム管理テーブル１３は図１０２に示す状態にあり、音声インターフェース管理システム１０４の音声インターフェース管理テーブルが図９９に示す状態にある。この時、疑似音声フォーカスによって、ユーザには、音声フォーカスが“シェルツール”（ウィンドウＩＤ＝１０１）に設定されているように見えている。一方、真の音声フォーカスは、ウィンドウ（ＩＤ＝１０１）と対応付けられた所の音声ウィンドウ（ＩＤ＝１）に設定されており、その音声ウィンドウは、ＳＩＭ１０４に属している。例えば、この状態で認識可能な語彙は、「エルエス」、「プロセス」、「シェルツール」、「エディタ」、「メールツール」、「システム」、および「設定」である。
【０３０４】
上記構成において、音声入出力システム１が認識処理を行い、その認識結果が、それぞれの語彙が設定されている音声ウィンドウに送られる。図１０３に、この認識処理の手順の一例を示す。
【０３０５】
まず、ウィンドウ（０）について、音声フォーカスが設定されている場合、当該ウィンドウ（０）に設定されている語彙を認識語彙リストに追加する（ステップ９１０３）。一方、音声フォーカスが設定されていない場合、当該ウィンドウ（０）が音声フォーカスの設定されている音声ウィンドウと同じ応用プログラムに属すときは、当該ウィンドウ（０）の語彙のうち属性値が“１”であるものを認識語彙リストに追加し（ステップ９１０５）、属しないときは、当該ウィンドウ（０）の語彙のうち属性値が“２”であるものを認識語彙リストに追加する（ステップ９１０６）。
【０３０６】
以上の処理を、ウィンドウ（１）をはじめとする他の全ウィンドウについて行う。
【０３０７】
そして、認識処理を行い（ステップ９１０８）、第１位の認識結果がウィンドウ名である場合、第１位の語彙が設定されていたウィンドウに音声フォーカスを設定し（ステップ９１１０）、ウィンドウ名でない場合、第１位の語彙が設定されていたウィンドウに上記認識結果を送信する（ステップ９１１１）。
【０３０８】
例えば、図１０２において、認識可能な語彙の１つである「設定」の設定されている音声ウィンドウは２つ（ＩＤ＝２とＩＤ＝４）あるが、それぞれの語彙の属性が“１”（＝ウィンドウ）であることから、ここで認識した結果「設定」は、音声ウィンドウＩＤ＝２に送られる。これに対して、音声フォーカスが音声ウィンドウＩＤ＝３に設定されている場合に認識された「設定」は、音声ウィンドウＩＤ＝４に送られる。ウィンドウ名を認識した際に音声入出力システム１の動作としては、単に認識結果をウィンドウ名の属する音声ウィンドウに送ることもできるし、送らずに音声フォーカスをその音声ウィンドウに設定することもできる。
【０３０９】
このように、認識対象語彙にウィンドウ属性を持たせることで、複数の応用プログラムのウィンドウに同一の名前を付け、操作することが可能となる。本実施例により音声認識インターフェースとしての使い勝手が大幅に向上する。
【０３１０】
（第２０実施例）
第１８実施例および第１９実施例で述べたように、音声認識システムからの音声メッセージを音声インターフェース管理システム１０４によって音声メッセージを変換して送信することにより、音声入出力インターフェースと直接通信する手段を持たない既存の応用プログラムに対しても、音声入力を行なうことが可能になった。
【０３１１】
既存の応用プログラムに本発明の音声入出力インターフェースを適用する場合には、既存のプログラムの操作と、それを行うための語彙との対応を、音声入出力インターフェース専用の応用プログラムとは別個にとる必要がある。この実施例では、“語彙”と“プログラムの操作”との対応をとるためのプログラム操作の登録について説明する。
【０３１２】
プログラム操作の登録では、音声フォーカスを目的の応用プログラムに移動させるのに用いるプログラム名またはウィンドウ名の登録と、既存の応用プログラムの操作を行なうためのキー入力またはマウス入力イベントの系列と語彙との対応づけを行なう。例えば、シェルツールのウィンドウを２つ使う場合には、ウィンドウ名として「シェル１」、「シェル２」と付け、シェルツールの中で行なう操作、例えば画面上の文字を全部消去するクリア（clear ）コマンドを行なうためのキー入力系列に対し「クリア」という単語を割り当て、登録する。
【０３１３】
通常、一般の応用プログラムは、そのプログラムが表示しているウィンドウのウィンドウ名を持っていないため、名前でウィンドウを指定するためには、ウィンドウに名前を付け、音声インターフェース管理テーブルからウィンドウ名で対象ウィンドウを識別できるようにすることが必要である。このため、第１９実施例の図９９に示すように、音声インターフェース管理テーブルに、ウィンドウシステムにおけるウィンドウ識別子であるウィンドウＩＤとウィンドウ名とを格納するフィールドを持たせている。このテーブルにより、音声インターフェース管理部１４１は、例えば「エデイタ」が認識結果として送られて来た場合には、ウィンドウＩＤ１０３を持つウィンドウに対して疑似音声フォーカスを設定する。上述のウィンドウＩＤは、ウィンドウシステム（図示せず）の持つ情報にアクセスすることにより得られる。例えば、ウィンドウシステムのサーバー（図示せず）に、ウィンドウ構造に関する情報を問い合わせることにより得ることができるが、ウィンドウ名も同時に得られるとは限らない。ウィンドウＩＤとウィンドウ名を同時に得るにはウィンドウ名を指定してプログラムを起動する方法があるが、既に動作中のプログラムが別のウィンドウを新たに生成するポップアップウィンドウのような場合には、起動前から名前を付けることは困難である。そのような場合には、マウスでウィンドウをクリックすることによって、そのウィンドウのウィンドウＩＤを獲得し、そのウィンドウＩＤにウィンドウ名を対応させるという方法でウィンドウ名を付けることができる。マウスがクリックされたウィンドウのＩＤは、ウィンドウシステムのサーバーに問い合わせることで容易に得られる。
【０３１４】
次に、ウィンドウへの名前付けとプログラム操作の登録方法について以下に説明する。
図１０４は、前記プログラム操作登録部１４２の構成である。このプログラム操作登録部１４２は、登録内容の画面への表示とユーザからの入力を行なうプログラム操作表示編集部１５１と、登録内容をファイル２００に保存する登録内容保存部１５２と、ウィンドウシステムからウィンドウＩＤを取得するウィンドウＩＤ取得部１５３からなる。
【０３１５】
プログラム操作表示編集部１５１は、例えば図１０５のような登録画面を表示してウィンドウ名やプログラム操作、単語名などの入力を行ない、前記音声インターフェース管理部１４１内の音声インターフェース管理テーブルに登録内容を書き込む。登録内容保存部１５２は、プロクラム操作の登録内容をファイル２００に保存する。ウィンドウＩＤの取得は、ウィンドウシステムのサーバーに問い合わせることにより容易に行なえる。
【０３１６】
図１０５の登録画面は、プログラム操作登録内容を音声インターフェース管理テーブルに書き込む「登録」ボタン、入力内容を取り消して入力前の状態に戻すための「取り消し」ボタン、登録を終了するための「終了」ボタン、対象とする一般応用プログラムのウィンドウＩＤを取得するための「ウィンドウＩＤ取得」ボタン、応用プログラムの種類を入力する「応用プログラムクラス」（ＡＰクラス）ウィンドウ、ウィンドウ名を入力する「ウィンドウ名」ウィンドウ、および語彙とそれに対応したプログラム操作を表すキー入力系列またはマウス入力系列を入力するプログラム操作入力ウィンドウからなる。
【０３１７】
図１０５では、応用プログラムクラスとして「シェル」、シェルのウィンドウ名として「シェル１」が選択され、背景色が反転しており、シェル１に対する操作として単語「エルエス」と「クリア」に相当するキー入力操作と、それらの語彙のスコープとしてローカル（０）が、編集用ウィンドウに入力された状態を示している。
【０３１８】
次に、プログラム操作の登録手順について図１０６を用いて説明する。プログラム操作登録部１４２は、メッセージ変換部１４３から起動され、まず、プログラム操作登録内容を保存した登録内容ファイル２００から登録内容を読み出し（ステップ９２０１）、画面表示を行ないユーザの入力待ちの状態（ステップ９２０２）になる。
【０３１９】
ここで、ユーザが、ＡＰクラス、ウィンドウ名、語彙、プログラム操作などの入力、あるいは、登録ボタン、取り消しボタン、終了ボタン、ウィンドウＩＤ取得ボタン等の入力を行なう。
【０３２０】
入力が登録ボタンであった場合には（ステップ９２０３）、画面に表示されている編集結果を保存ファイル２００へ保存し、更に音声インターフェース管理テーブル１４１へ書き込んで登録内容を音声入出力インターフェースの動作に反映させる（ステップ９２０４）。
【０３２１】
入力が取り消しボタンであった場合には（ステップ９２０５）、再度、保存ファイル２００から登録内容を読み込んで表示し、入力待ちの状態に戻る（ステップ９２０２）。
【０３２２】
入力が既に登録済みの応用プログラムクラス（ＡＰクラス）であった場合（ステップ９２０６）選択されたＡＰクラスのウィンドウ名の一覧と語彙、プログラム操作を画面表示し（ステップ９２０７）、入力待ちの状態に戻る（ステップ９２０２）。
【０３２３】
入力がウィンドウＩＤ取得ボタンであった場合（ステップ９２０８）、まず、ウィンドウ名が選択されているか判別し（ステップ９２０９）、選択されていない場合には入力待ちに戻り（ステップ９２０２）、選択されている場合にはマウスでウィンドウがクリックされるのを待ち、クリックされたウィンドウのＩＤを取得して、図９９に示すような音声インターフェース管理テーブルに選択されているウィンドウ名とウィンドウＩＤを書き込む（ステップ９２１０）。
【０３２４】
入力が終了ボタンである場合（ステップ９２１１）には、画面表示内容の内容の音声インターフェース管理テーブルへの書き込みとファイル２００への保存を行なって（ステップ９２１２）、登録を終了する。
【０３２５】
以上述べたように、プログラム操作登録の際、応用プログラムの種類を指定することにより、同一のプログラム操作を入力せずに、自動的に指定することが可能になり、登録が効率的に行なえるようになる。
【０３２６】
また、名前を指定して起動することが困難な応用プログラムのウィンドウに対しても、マウスのクリックされたウィンドウのＩＤを取得してウィンドウ名と結び付けるようにすることにより、容易にウィンドウ名を付けて音声入力を行なえるようになる。
【０３２７】
上述の登録の例では、すでに生成されているウインドウのＩＤを利用して、操作コマンドと認識結果の対応をとっていたが、一般にウインドウ等のオブジェクトＩＤは生成時に決定され、同じ種類のアプリケーションであっても異なるＩＤが付与される。したがって、登録時にウインドウ階層やウインドウ名など、同じ種類のアプリケーションで共通のウインドウ属性値をウインドウシステムに問い合わせて登録内容に付加しておけば、これらの属性値を照合することによって同種のアプリケーションで共通に登録内容を反映させることができる。
【０３２８】
さらに、この登録時に、登録対象のアプリケーションに複数のウインドウ名を登録しておくことによって、同じ種類のアプリケーションが起動された時に（既に使われいる音声ウインドウ名を音声認識システムに問合わせて）使われていないウインドウ名を起動されたアプリケーションの音声ウインドウ名として利用すれば音声ウインドウ名の衝突を避けることができる。
【０３２９】
（第２１実施例）
次に、音声入出力インターフェースにおいて音声の認識を行なうための認識辞書の編集機能に関する実施例について説明する。
【０３３０】
図１０７は、辞書編集部１４４を持つ音声インターフェース管理システム１０４の構成である。辞書編集部１４４は、メッセージ変換部１４３から起動され、編集を終了すると終了メッセージをメッセージ変換部１４３へ返す。この終了メッセージを受けて音声インターフェース管理部１４１は、音声入出力システム１へ、編集した後の新しい辞書のロード命令を出すことができる。
【０３３１】
ここで、図１０８は、認識辞書の構成の例である。認識辞書には単語ごとに、パターンマッチング用のテンプレートの他、単語名や単語ＩＤ、あるいは認識パラメータ等のデータがヘッダに格納されている。これらのデータの内容を表示し、編集する機能を備えることにより、使わない単語の辞書を削除して辞書に要する実行時のメモリ量を減らしたり、単語名やＩＤを付け替えたりすることが容易に行なえるようになる。
【０３３２】
次に、辞書編集部１４４の構成について説明する。辞書編集部１４４は、図１０９に示すように、辞書内容を表示してユーザが編集を行なえるようにする辞書内容表示編集部４４１と、辞書内容のチェックや検索を行なう辞書内容検索部４４２からなる。
【０３３３】
辞書内容は、例えば図１１０のような画面に表示される。画面中には、辞書名を表示する辞書名ウィンドウ、語彙番号、単語ＩＤ、単語、パラメータ、辞書番号を表示する辞書内容ウィンドウ、辞書の削除を行なう「削除」ボタン、パラメータの検索を行なう「検索」ボタン、内容の全表示を行なう「全表示」ボタン、辞書編集を終了する「終了」ボタン、辞書内容チェック結果を表示するステータスウィンドウ、検索の際の値を入力する検索値ウィンドウなどがある。辞書内容ウィンドウのパラメータの項目はメニューになっており、マウスでクリックすると図に示すようなパラメータ内容が表示されて表示する内容を選択するようにできる。
【０３３４】
辞書内容のチェックは、辞書名を選択したときに自動的に動作するようにでき、例えば、同じＩＤの単語がないか、あるいは同じ単語名の辞書がないか等のチェックや、認識パラメータのくい違いがないか等のチェックが行なわれ、結果がステータスウィンドウに表示される。
【０３３５】
図１１０の項目では、辞書として、“common”および“usr.１”というファイル名の辞書が選択され、辞書内容としてその２つの内容がマージして表示される。例えば、語彙Ｎｏ．“１”はＩＤ＝１のオープンで辞書作成に使ったデータ数が１００であることを示している。また、語彙Ｎｏ．“２”はＩＤ＝２のクリアでこの単語が選択されて背景色が暗く変わっていることを示している。
【０３３６】
次に、辞書編集の処理の手順を、図１１１を用いて説明する。辞書編集部が起動されるとまず、辞書ファイルから辞書内容を読み出し（ステップ９３０１）、画面に内容を表示して入力待ちする状態になる（ステップ９３０２）。
【０３３７】
入力が削除ボタンであった場合には（ステップ９３０３）、ユーザが指定した辞書Ｎｏの辞書をファイルから削除し（ステップ９３０４）、入力待ちに戻る（ステップ９３０２）。
【０３３８】
入力が全表示ボタンであった場合には（ステップ９３０５）、辞書内容を再度読み出して（ステップ９３０１）、入力待ちに戻る（ステップ９３０２）。
【０３３９】
入力が検索ボタンであった場合には、パラメータメニューからのパラメータの指定を待ち（ステップ９３０７）、指定されたパラメータと検索値ウィンドウに入力された値に合致する辞書のみ辞書内容として表示して（ステップ９３０８）、入力待ちに戻る（ステップ９３０２）。
【０３４０】
入力が終了ボタンであった場合には、画面に入力した内容から辞書ファイルを更新し（ステップ９３１０）終了したことをメッセージ変換部へ知らせて（ステップ９３１１）終了する。
【０３４１】
以上に述べた辞書編集部により、不要な単語辞書の削除や内容の確認、単語名の変更などの編集が容易に行なえ、また同じＩＤや単語の２重使用や認識パラメータの不統一のチェック等が容易に行なえる。
【０３４２】
（第２２実施例）
本発明の第１８，１９実施例で述べた音声入出力インタフェースでは、ユーザの発声の認識結果の確認および認識結果により引きおこされる応用プログラムの動作の確認は、応用プログラムの提示する画面情報を通じて行っている。例えば、認識結果（および認識失敗）を文字情報としてユーザに提示する。「シェルツール」などプログラム名を呼んだ時にシェルツールの表示を第１９実施例の図１００，１０１のように変更する。「アイコン化」の発声に対して、音声フォーカスの当たったウィンドウをアイコン化する等、音声による応用プログラムへの働きかけは、応用プログラムの行う画面表示の変化としてユーザへフィードバックされる。しかし、応用プログラムによっては、操作によりその表示が殆んどあるいは全く変化しない事も考えられる。また、キーボードフォーカスと音声フォーカスを分離できるという本発明の特長を生かして音声フォーカスを当てた応用プログラムを表示しない状態で使用することも考えられる。このような場合には、認識結果やそれによる操作の確認を画面出力ではなく、第１４実施例で述べた、音声合成機能を利用した音声出力によって行うことで、ユーザの応用プログラム操作上の利便性が向上する。
【０３４３】
動作確認を音声出力によって行うために、第１９実施例の音声インタフェースマネージャ（図９８）を図１１２のように拡張する。すなわち、音声インタフェース管理システム（ＳＩＭ）に応答音声管理部４０１と応答音声登録部４０３を追加する。
【０３４４】
ユーザの行った発声に対してどのような応答音声を返すかを定義するのが、応答音声管理部４０１であり、その登録を行うのが応答音声登録部４０３である。そして、動作（すなわちメッセージ）が発生した際に応答音声管理部４０１を参照して音声応答を出力するのが、メッセージ変換部１４３である。
【０３４５】
応答音声管理部４０１の例を図１１３に示す。応答音声管理部４０１は、音声応答を出力するきっかけとなる動作と、動作時に行う応答コマンドおよび、その設定を実際に適用するか否かを決定するフラグから成る。動作は、音声によらないものでもよい。応答には、コマンドが記述される。ｓｙｎｔｈ（）は、その引数をテキストとして合成音声を出力するコマンド、ｐｌａｙ（）は、引数を波形データと見做し、出力するコマンドである。
【０３４６】
メッセージ変換部１４３は応答音声管理部４０１のデータを参照し、図１１４に示す流れにより処理を行う。先ず、音声入出力シスムテから受信したメッセージが認識結果か否かを判定し（ステップ１０００１）、認識処理が成功したか否かを判定する（ステップ１０００２）。ついでその成功・失敗に応じて、音声応答コマンドを実行する（ステップ１０００３，ステップ１０００４）。ステップ１０００５は、認識処理の成功・失敗以外の応答音声を出力する段階であり、図１１３の３行目以下の設定にあたる。この流れに従えば、認識はできたが類似度が低い、あるいは音声入力レベルが大き（小さ）すぎるなどの理由によって、認識失敗した際には、「えっ？」などという音声データが出力され、応用プログラム名、例えば「メール」が認識された時には、合成音声により「はい、メールです」などと出力される。ここで、図１１３中の＄＜ｃａｔ＞は、認識結果の語彙名が置換される）。
【０３４７】
応答音声管理部４０１のコマンドを登録するのが、図１１５に示す応答音声登録部４０３である。各動作に対してコマンドを記述し、また適用するか否かのチェックボックスをチェックし、ＯＫボタンを押すことで登録を確認する。
【０３４８】
応答音声管理部４０３の応答コマンドは、メッセージ変換部１４３が処理するものであり、第１９実施例の図９９に示した音声インタフェース管理テーブルのコマンドとして記述できる。ここにｐｌａｙ（）およびｓｙｎｔｈ（）コマンドを記述することで、音声入出力システム１と直接情報を交換できないＧＡＰの動作に対して、その応用プログラムに即した応答音声出力を定義できる。
【０３４９】
このように、音声入力によって行われる（あるいは行われない）動作に対して、動作毎に意味のある音声応答を返す機構をＳＩＭに設け、音声入力に対しては音声で応答するという自然な方法で、ユーザが画面の表示の変化を注視しなくとも（あるいは全くみなくとも）応用プログラムの実行した動作を確認できるため、音声入出力インタフェースの操作性が向上する。
【０３５０】
（第２３実施例）
本発明の第９実施例では、認識辞書作成のためのデータ収集について説明したが、収集データの中には、間違った語彙の発声や音声区間の検出誤りなどにより、誤りデータが含まれることがある。例えば「ひらく」という単語は「く」の音が小さく発声されることがあり、「く」が抜けて「ひら」のみ音声区間として検出されることがある。このような誤ったデータによる認識辞書の学習は認識精度を大きく低下させるため、データの確認を行って誤りデータを取り除くことが必要である。そこで本実施例では、データの確認を容易に且つ確実に行なえるように、音を再生して聞くことによりデータ確認するようにしている。
【０３５１】
従来、収集した音声データを再生して確認する方法では、検出された音声区間のみを再生する場合が多いが、語彙によっては、音声の始終端が誤って検出されている場合でもユーザがそれを聞きもらしてしまうという問題があった。例えば上に述べた「ひらく」の語尾の「く」が抜けて「ひら」だけになってしまった場合でも、「ひら」の再生音が「ひらく」と聞こえてしまうことがある。本実施例では、このような始終端の確認のミスを少なくするため、音声の始終端位置を音により分り易く提示するようにしている。これにより、音声データの確認が音により容易に且つ確実に行なえるようになるため、学習データの収集が簡単でミスなく行なえ、音声入出力インタフェースの使い勝手の向上と認識精度の向上が実現できる。
【０３５２】
始終端位置を分り易くする方法としては、
（方法１）検出された音声区間の前後に白色雑音や正弦波など既知の音を付加して再生する方法、
（方法２）始終端位置にクリック音を乗せて再生する方法、
（方法３）始端よりも一定時間前から終端よりも一定時間後までの発声全体を再生した後、音声区間のみを再生する方法、
などが考えられる。
【０３５３】
上記方法１によれば、先程述べた「ひらく」の例では、「ひら」の後にすぐ別の音が続くため、「く」が抜けていることを容易に聞き取ることができる。上記方法２によれば、「ひら」の後に続いて、クリック音が来るため「く」が抜けていることが分る。また、上記方法３によれば、発声全体と音声区間とを比較して聞くことができるため、「く」の有無を容易に識別することができる。
【０３５４】
ここで、本実施例による拡張したデータ収集部８の構成を図１１６に示す。
【０３５５】
データ収集部８は、図１１６に示すように、第９実施例の図２９のデータ収集部８に、音声データ確認部４１１、データ使用可否入力部４１３を加え、学習データ収集制御部８３を介して音声特徴データを音声特徴データ保存部に送るような構成になっている。すなわち、音声データ確認部４１１で提示された再生音を聞いて、ユーザがその音声データを辞書作成に使うか否かをデータ使用可否入力部４１３から指定できるような構成になっている。
【０３５６】
このデータ収集部８の処理の流れを図１１７に従って説明する。
【０３５７】
まず、初期設定では、ユーザからのデータ収集の指示により、データ収集部８から音声認識システム１に対して学習モード設定要求が出され（ステップ１１００１）、これを受けて音声認識システムは認識対象語彙をデータ収集部８に送る。データ収集部８では認識対象語彙がユーザに表示される（ステップ１１００２）。
【０３５８】
ユーザにより学習語彙が選択されると（ステップ１１００３）、データ収集部８は音声認識システム１に単語音声特徴データと単語音声波形データの送信を要求し（ステップ１１００４）、選択された語彙を発声のガイドとして発声ガイド表示部４１５に表示し（ステップ１１００５）、ユーザに発声を促す。音声認識システム１では発声されたユーザの音声を処理した後、データ収集部８に単語特徴データと波形データを送信する。そして、データ収集部８はそのデータを受信し、内部メモリに一時格納する（ステップ１１００６）。
【０３５９】
音声波形データは音声データ確認部４１１に送られ、ユーザがそのデータを確認し、辞書作成に使うか否かを、データ使用可否入力部４１３により入力する（ステップ１１００７）。データを使用するとした場合には単語音声特徴データが磁気ディスク上などにファイル出力され（ステップ１１００８でＹＥＳの場合およびステップ１１００９）、使用しないとした場合にはファイル出力しない（ステップ１１００８でＮＯの場合）。
【０３６０】
学習終了時にはユーザがデータ収集終了の指示を入力し、データ収集指示フラグがＯＦＦならば（ステップ１１０１０でＹｅｓの場合）、データ収集部８は学習モードの解除を音声認識システム１に要求する（ステップ１１０１２）。音声認識システム１では、それを受けて学習モードを解除する。一方、学習を終了しないときは、データ収集指示フラグを検査し（ステップ１１０１１）、上記ステップ１１００４以下の処理を繰り返す。データ収集指示フラグは、学習データ収集制御部の中に設定されており、図に示すようなデータ収集ボタンにより、ユーザが入力可能とすることができる。
【０３６１】
次に、本実施例の音声データ確認部４１１の構成を図１１８に示す。
【０３６２】
音声データ確認部４１１は、音声データを格納する音声データメモリ４２１、音声データを加工する音声データ加工部４２２、加工に用いる付加音を生成する付加音生成部４２４、加工後の音声データを再生して音にする再生部４２３から成り、学習データ収集部制御８３から音声データと始終端位置に関する情報を受け取って加工後、音として出力する。加工後の音を音声入出力システムに送って音データを再生することにすれば、再西部４２３はなくても良い。
【０３６３】
次に、図１１９に従って処理の流れについて説明する。
【０３６４】
まず、学習データ収集制御部８３から音声データと始終端情報を受け取り、音声データメモリ４２１に格納する（ステップ１２００１，ステップ１２１０１，ステップ１２２０１）。この音声データは、音声区間の前後に一定時間、例えば２４０ｍｓｅｃの余裕を付けた波形データであり、例えば図１２０に示すようなものである。図のデータは「ひらく」の「ひら」が音声区間として検出されたため、「く」の音は終端の余裕の中に入っている。
【０３６５】
次に、音声区間の前後に付加音をつける上記方法１の場合では、付加音を付加音生成部４２４で作り（ステップ１２００２）、音声データ加工部４２２で始終位置の前と終端位置の後にこの付加音を付加する（ステップ１２００３，ステップ１２００４）。この結果、音声データ図１２１の（ａ）に示すようなものになる。
【０３６６】
付加音データは白色ノイズでも良いし、正弦波でも良く、これらは乱数発生ルーチンや三角関数のルーチンを使って容易に作成できる。又、録音データを単に読み出すだけでも良い。
【０３６７】
始終端位置にクリック音を付加する上記方法２の場合では、クリック音を付加音生成部４２４で作り（ステップ１２１０２）、始終端位置に付加する（ステップ１２１０３，ステップ１２１０４）。この結果、音声データは図１２１の（ｂ）に示すようなものになる。ここでクリック音は短時間、例えば数１０ｍｓｅｃ幅のパルスや三角波等で良い。
【０３６８】
発声の全体と音声区間の両方を再生する上記方法３の場合では、まず、音声区間外の平均パワーを計算し（ステップ１２２０２）、この値が、しきい値、例えば雑音レベル＋２ｄＢよりも大きければ（ステップ１２２０３でＹＥＳの場合）、音声区間の前後についた余裕と音声区間とを合わせた音声全体を再生する（ステップ１２２０４）。一方、計算した平均パワーがしきい値よりも小さければ（ステップ１２２０３でＮＯの場合）、音声区間のみ再生する（ステップ１２２０５）。雑音レベルは音声認識システム１で音声検出のために常時測定しているため（永田、他“ワークステーションにおける音声認識機能の開発”，電子情報通信学会技術報告、ＨＣ９１１９，ｐｐ．６３−７０，（１９９１）、参照）それを用いれば良い。発声全体の再生と音声区間の再生の２回の再生を、発声の毎に行なうのは煩しいため、上述のように音声区間の外の音声パワーが大きいときに、始終端位置を誤った可能性が大きいと見なして、そのときのみ２回の再生を行なうようにすれば、煩しさを軽減できる。
【０３６９】
この場合、図１２１の（ｃ）に示すように、発声全体の再生音は「ひらく」の全発声が再生されるが、音声区間のみの再音声は「ひら」だけしか再生されないため、続けてこの２つの再生音を聞いて比較することによって「く」が抜けていることを容易に識別できる。
【０３７０】
以上に述べたように、音声データが正しいか否かをユーザが再生音により容易に判断することができ、データを辞書作成に使用するか否かをデータ収集部で直ちに入力することができるため、音声データ収集を簡単に、且つ確実に行なうことができる。
【０３７１】
これにより、誤ったデータを除いて認識辞書を作成することができる。
【０３７２】
【発明の効果】
本発明によれば、各応用プログラムにより音声認識システムに対する音声認識結果の受信の可否を決定できるので、応用プログラムが自分や他の応用プログラムの音声入力に関する制御を自由に行うことができ、柔軟で使いやすい音声認識インターフェースが構築できる。また、音声認識システムがその音声認識結果を同時に複数の応用プログラムに送信できるので、一つの音声入力による操作を同時に複数の応用プログラムに対して行うこともでき、音声入力による計算機の操作性も向上する。さらに音声認識システムが複数の応用プログラムに対する音声認識を行えるので、音声入力対象の明示的な指定をせずに音声認識結果に基づき音声入力を各応用プログラムに振り分けることができ、利用者の負担を軽減できる。
【図面の簡単な説明】
【図１】本発明の一実施例の概略構成を示す図。
【図２】音声認識部の概略構成を示す図。
【図３】音声認識部の他例の概略構成を示す図。
【図４】音声認識部の他例の概略構成を示す図。
【図５】音声認識部の他例の概略構成を示す図。
【図６】応用プログラムの概略構成を示す図。
【図７】構成要素間で伝送されるメッセージを説明する図。
【図８】入力マスクの種類を示す図。
【図９】音声認識インターフェース各部の処理のタイムチャートを示す図。
【図１０】応用プログラム管理テーブルを説明する図。
【図１１】本発明の第２実施例の概略構成を示す図。
【図１２】一般的なウィンドウシステムの画面表示例を示す図。
【図１３】応用プログラムの認識語彙を説明する図。
【図１４】入力フォーカスの移動に伴う音声認識語彙の変化を説明する図。
【図１５】認識語彙の表示例を説明する図。
【図１６】マウスの位置により認識語彙を変更する状態を説明する図。
【図１７】本発明の第３実施例での応用プログラムの認識語彙を説明する図。
【図１８】応用プログラム管理テーブルを説明する図。
【図１９】本発明の第４実施例を説明する図。
【図２０】本発明の第５実施例の概略構成を示す図。
【図２１】メッセージ表示例を示す図。
【図２２】ワークステーションなどのマルチウィンドウ環境を示す図。
【図２３】本発明の第６実施例での応用プログラム管理テーブルを示す図。
【図２４】図２３の応用プログラム管理テーブルに基づく表現を説明する図。
【図２５】タスク管理プログラム機能の拡張例を示す図。
【図２６】本発明の第７実施例での表示例を説明する図。
【図２７】同第７実施例での表示例を説明する図。
【図２８】本発明の第９実施例の概略構成を示す図。
【図２９】学習データ収集部の概略構成を示す図。
【図３０】音声認識システムとのメッセージ交換を説明する図。
【図３１】音声認識システムのデータ収集時のフローチャートを示す図。
【図３２】学習データ収集部のフローチャートを示す図。
【図３３】学習語彙ガイド表示部での表示例を示す図。
【図３４】学習語彙ガイド表示部での表示例を示す図。
【図３５】データ収集時の音声認識インターフェースの処理の流れを示す図。
【図３６】本発明の第１０実施例の概略構成を示す図。
【図３７】辞書作成管理テーブルを示す図。
【図３８】辞書作成管理テーブルを示す図。
【図３９】辞書作成管理テーブルを示す図。
【図４０】辞書作成管理テーブルへの登録手順を説明する図。
【図４１】辞書作成の手順を説明する図。
【図４２】辞書作成の進行状況の表示例を示す図。
【図４３】辞書作成処理の速度表示の例を示す図。
【図４４】辞書作成処理の速度表示の例を示す図。
【図４５】本発明の第１１実施例の概略構成を示す図。
【図４６】音声認識自動停止処理を説明する図。
【図４７】本発明の第１２実施例を説明する図。
【図４８】同第１２実施例を説明する図。
【図４９】本発明の第１３実施例を説明する図。
【図５０】本発明の第１４実施例の概略構成を示す図。
【図５１】音声合成部の概略構成を示す図。
【図５２】音声出力管理テーブルを説明する図。
【図５３】音声入力に対するメッセージを説明する図。
【図５４】音声出力に対する入力マスクを説明する図。
【図５５】応用プログラム管理テーブルを説明する図。
【図５６】音声出力処理のフローチャートを示す図。
【図５７】音声出力処理のタイムチャートを示す図。
【図５８】音声出力要求処理のフローチャートを示す図。
【図５９】中断処理のある音声データを重畳する際の一例を説明する図。
【図６０】本発明の第１５実施例の概略構成を示す図。
【図６１】応用プログラムと音声入出力システム間で交わされるメッセージを説明する図。
【図６２】音声メールツールが音声データを録音する処理のタイムチャートを示す図。
【図６３】音声メールツールの画面表示例を示す図。
【図６４】音声データ編集用のサブウィンドウを示す図。
【図６５】メール送信による返信の文面例を示す図。
【図６６】音声データ編集用のサブウィンドウを示す図。
【図６７】合成音声の属性のデータベースの一例を示す図。
【図６８】メール読み上げ時に使用する音声コマンドの例を示す図。
【図６９】音声メールシステムの概略構成を示す図。
【図７０】応用プログラム管理テーブルを説明する図。
【図７１】メールシステムと音声入出力システム間のメッセージを説明する図。
【図７２】タスク重要度管理テーブルを説明する図。
【図７３】音声メールシステムの電子メール処理のフローチャートを示す図。
【図７４】受信メールの通知例を示す図。
【図７５】タスク重要度管理テーブルを説明する図。
【図７６】制御コード交じりのメール例を示す図。
【図７７】本発明の第１６実施例の概略構成を示す図。
【図７８】本発明の第１６実施例の概略構成を示す図。
【図７９】要約設定処理のフローチャートを示す図。
【図８０】本発明の第１７実施例の概略構成を示す図。
【図８１】音声を使ったメール文書作成例を示す図。
【図８２】応用プログラムと音声認識システムの間のメッセージ例を示す図。
【図８３】音声区間データを入力音声から切り出す処理のタイムチャートを示す図。
【図８４】音声によるメール題の入力を説明する図。
【図８５】定型的なメール文書の入力を説明する図。
【図８６】メールアドレスブックの画面表示例を示す図。
【図８７】音声入力可能なメールアドレスの登録例を示す図。
【図８８】音声によるメール送付先指定の手順を説明する図。
【図８９】メールアドレスのデータベースを用いたメール送付先指定を説明する図。
【図９０】本発明の第１８実施例の概略構成を示す図。
【図９１】同第１８実施例におけるシステム構成を示す図。
【図９２】同第１８実施例での画面表示例を示す図。
【図９３】音声インターフェース管理テーブルの一例を示す図。
【図９４】疑似音声フォーカスと音声フォーカスとの対応関係を示す図。
【図９５】メッセージ変換部のフローチャートを示す図。
【図９６】本発明の第１９実施例の概略構成を示す図。
【図９７】同第１９実施例での画面表示例を示す図。
【図９８】同第１９実施例のより詳細な構成を示す図。
【図９９】音声インターフェース管理テーブルの一例を示す図。
【図１００】音声フォーカスの表示方法を説明するための図。
【図１０１】外付けウィンドウの表示例を示す図。
【図１０２】応用プログラム管理テーブルの一例を示す図。
【図１０３】音声入出力システムの認識処理のフローチャートを示す図。
【図１０４】本発明の第２０実施例の概略構成を示す図。
【図１０５】プログラム操作の登録画面の一例を示す図。
【図１０６】プログラム操作登録の処理手順を示す図・
【図１０７】本発明の第２０実施例の概略構成を示す図。
【図１０８】認識辞書の構成の一例を示す図。
【図１０９】辞書編集部の概略構成を示す図。
【図１１０】辞書編集画面の一例を示す図。
【図１１１】辞書編集部の処理のフローチャートを示す図。
【図１１２】本発明の第２２実施例の概略構成を示す図。
【図１１３】応答音声管理部の概略構成を示す図。
【図１１４】メッセージ変換部の処理のフローチャートを示す図。
【図１１５】応答音声登録部の概略構成を示す図。
【図１１６】拡張したデータ収集部の概略構成を示す図。
【図１１７】図１１６のデータ収集部の処理のフローチャートを示す図。
【図１１８】音声データ確認部の概略構成を示す図。
【図１１９】音声データ確認部の処理のフローチャートを示す図。
【図１２０】音声データの一例を示す図。
【図１２１】加工後の音声データの様子を示す図。
【図１２２】従来の音声認識インターフェースを示す図。
【図１２３】従来の音声認識インターフェースを示す図。
【図１２４】従来の音声認識インターフェースを示す図。
【図１２５】従来の音声認識インターフェースを示す図。
【図１２６】従来の音声認識インターフェースを示す図。
【符号の説明】
１、３、６…音声認識システム、１１…メッセージ処理部、１２…音声認識部、１２１…音声検出部、１２２…音声分析部、１２３…認識辞書照合部、１２４…音声認識辞書、１３…応用プログラム管理テーブル、２、５、７…応用プロクラム、２１、７１…メッセージ入出力部、２２…プログラム本体、４…ウインドウシステム、８…データ収集部、８１…単語音声特徴データ保持部、８２…学習語彙表示選択部、８３…学習データ収集制御部、８４…学習語彙ガイド表示部、９…辞書作成部、９１…辞書作成管理部、９２…辞書作成制御部、９３…データ入力部、９４…辞書作成部本体、９５…ファイル出力部、１０…音声認識自動停止部、１４…音声合成部、５６１…全体制御部、５６２…波形重畳部、５６３…音声出力管理テーブル、５６４…波形合成部、６５１…音声入出力システム、６５２…ウィンドウシステム、６５３…音声メールツール、６５３１…電子メール処理部、６５３２…メッセージ入出力部、８２１…音声入出力システム、８２２…音声メールシステム、８２２１…電子メール処理部、８２２２…文書要約部、８２２３…メッセージ入出力部、８５１…音声認識システム、８５２…音声メーメシステム８５２、８５３…メールアドレステーブル、１０３…汎用応用プログラム（ＧＡＰ）、１０２…専用応用プログラム（ＳＡＰ）、１０４…音声インターフェース管理システム（ＳＩＭ）、１４１…音声インターフェース管理部、１４２…プログラム操作登録部、１４３…メッセージ変換部、２３…音声ウィンドウ、１４４０₀〜１４４０₄…音声ウィンドウ、１５１……プログラム操作表示編集部、１５２…登録内容保存部、１５３…ウィンドウＩＤ取得部、１４４…辞書編集部、４４１…辞書内容表示編集部、４４２…辞書内容検索部、４０１…応答音声管理部４０１、４０３…応答音声登録部、４１１…音声データ確認部、４１３…データ使用可否入力部、４１５…発生ガイド表示部、４２１…音声データメモリ、４２２…音声データ加工部、４２３…再生部、４２４…付加音データ保存部。[0001]
[Industrial application fields]
The present invention relates to a voice recognition interface used in personal computers and workstations.
[0002]
[Prior art]
In recent years, it has been considered that a computer is equipped with a plurality of input means such as a keyboard, a mouse, a voice, and an image and can input various instructions and data.
[0003]
Of these, voice input is a natural and powerful input means for humans, but has problems in terms of calculation amount and recognition rate for voice processing, and has not been widely used as an input means.
[0004]
Conventionally, the following configuration is considered as a configuration of the application program and the speech recognition system in the speech recognition interface.
[0005]
In FIG. 122, the speech recognition system SRS is incorporated in the application program AP. In this case, since the voice recognition function cannot be separated from the application program AP, it is difficult to use the voice recognition function from other application programs.
[0006]
FIG. 123 shows a configuration in which one speech recognition system SRS and one application program AP are connected to each other. In this way, since the speech recognition system SRS is occupied by the connected application program AP, in order to use the same speech recognition system SRS from another application program, the connection is changed to another application program. It is necessary and it takes time to reconnect. Further, since the data exchanged between the speech recognition system SRS and the application program AP is only the recognition result sent from the speech recognition system SRS to the application program AP, the speech recognition system SRS knows the internal state of the application program AP. I can't. For this reason, the recognition target vocabulary according to the internal state of the application program AP cannot be automatically changed, and the user needs to change the vocabulary, resulting in a system that is not easy to use.
[0007]
FIG. 124 is composed of one speech recognition system SRS and one application program AP, which are connected to each other and send information such as a recognition vocabulary and a recognition result. In this way, since the speech recognition system SRS can know the internal state of the application program AP and the recognition vocabulary, the recognition vocabulary can be automatically changed. Since it is occupied by the AP, other application programs cannot use the speech recognition system SRS at the same time.
[0008]
FIG. 125 shows a document [Schmandtal, “Augmenting a window system with speech input”, COMPUTER, Vol. 23, pp. 50-58, 1990], which unilaterally sends a speech recognition result from one speech recognition system SRS to a plurality of application programs AP. This system uses a window system to input speech by translating the speech recognition result into input using a mouse or keyboard. In the system having this configuration, a plurality of application program AP speech recognition functions can be used simultaneously, but since the speech recognition system SRS cannot know the internal state of the application program AP, recognition processing according to the internal state of the application program AP is performed. I can't do it.
[0009]
FIG. 126 shows a document [Rudnicky et al., Spoken language recognition in an office management domain, Proc. ICASSP '91, S12.12, pp. 829-832, 1991], which is composed of one speech recognition system SRS and a plurality of application programs AP, and the speech recognition system SRS and the application program AP send information to each other to perform speech recognition. This system is characterized by the fact that multiple application programs can share continuous speech recognition, and it can be said that it is a useful method for the use of expensive speech recognition devices. Consideration is not enough. In this configuration, a plurality of programs can use the voice recognition function, and the processing on the recognition system SRS side according to the internal state of the application program AP is also possible, but only one application program AP can be connected at a time. For this reason, it has not been possible to perform processing that makes use of the feature of speech that multiple application programs AP can be handled simultaneously. In addition, since the determination of which application program AP the speech recognition result is sent to has been made by the speech recognition system SRS, the recognition result may not be obtained even when the application program AP requires the recognition result, for example. there were.
[0010]
[Problems to be solved by the invention]
As described above, according to the conventional speech recognition interface, the application program AP cannot manage the speech recognition target. Therefore, the speech input system cannot be controlled by the application program AP, and the speech recognition system SRS can be used even when the user wants to promote speech recognition. I had to wait until I received a voice input permission command. In addition, since a plurality of application programs AP cannot be controlled simultaneously with one voice, for example, a plurality of application programs AP cannot be ended with one voice input of “end”. Further, since voice input cannot be distributed to a plurality of application programs AP according to the recognition result, it is necessary to specify an input target prior to voice input. In addition, since only one voice recognition system operates for one voice input, different types of recognition methods such as isolated word recognition and continuous voice recognition cannot coexist and be used simultaneously.
[0011]
The present invention has been made in view of the above circumstances, and an object thereof is to provide a speech recognition interface that can handle a plurality of application programs simultaneously from a speech recognition system and is excellent in usability.
[0012]
[Means for Solving the Problems]
The present invention provides a speech recognition interface in which a plurality of application programs are connected to a speech recognition system, wherein the speech recognition system corresponds to each of the speech recognition means for recognizing speech and the plurality of application programs. Application program management for managing at least first information indicating whether or not a speech input target and second information indicating one or more recognition target words to be recognized for the application program And the first information managed by the application program management means are respectively managed in correspondence with one or a plurality of the application programs indicating that the first information is a target of voice input. 2. A recognition target vocabulary for speech input is identified based on the information of 2, and any of the identified recognition target vocabulary is the speech recognition The first information indicates that the first information is a target of speech input, and the second information recognition indicates that the recognized vocabulary is the recognition target vocabulary. Message processing means for identifying one or a plurality of the application programs as transmission destinations of the recognized vocabulary.And managing third information indicating a vocabulary uniquely corresponding to each application program, which should always be a recognition target regardless of which application program is the target of voice input, When the vocabulary included in the information is recognized by the voice recognition means, the application program uses the first information corresponding to the application program uniquely corresponding to the recognized vocabulary as the voice. Set the status to indicate that it is subject to inputIt is characterized by.
Also,In a speech recognition interface in which a plurality of application programs are connected to a speech recognition system, the speech recognition system corresponds to each of the speech recognition means for recognizing speech and the plurality of application programs. Application program management means for managing at least first information indicating whether or not the target information and second information indicating one or more recognition target words to be recognized for the application program, The first information managed by the application program management means includes the second information managed corresponding to one or a plurality of the application programs indicating that the first information is a target of voice input. Based on The recognition target vocabulary for speech input is specified, and when any of the specified recognition target vocabulary is recognized by the speech recognition means, the first information is indicated as a speech input target. Message processing means for specifying one or a plurality of the application programs as the transmission destination of the recognized vocabulary, wherein the second information recognition indicates that the recognized vocabulary is the recognition target vocabulary. And the application program requests the speech recognition system to be confident about speech input when it is a keyboard input target, and the speech recognition system. When the request is received from the application program, the application program is subject to voice input for the first information corresponding to the application program. Characterized by the state shown and.
Preferably,When a predetermined event occurs in advance, the speech recognition system converts the first information corresponding to the predetermined application program according to the content of the event and a predetermined rule in accordance with the application information. The program is changed to a state indicating that the program is a target of voice input, and the first information corresponding to the other predetermined application program is changed to that the application program is not a target of voice input. You may make it change into the state shown.
Preferably,The voice recognition system notifies the application program that has received a notification request of information that can at least determine whether or not the application program itself is currently the target of voice input. It may be.Preferably,The voice recognition system displays a window of the application program indicating that the first information is a target of voice input, and indicates that the first information is not a target of voice input. You may make it display on a display screen with the display form different from the display form of the window of the other said application program.
Preferably,The voice recognition system recognizes the application program indicating that the first information is a target of voice input for the application program indicated by the second information corresponding to the application program. You may make it display the 1 or several recognition object vocabulary which should be made into object on a display screen.
Preferably,The speech recognition system is configured to use the application identified as the transmission destination. The recognized vocabulary transmitted to the program may be displayed on a display screen.
Preferably,The second information may be given to the voice recognition system from each application program.
Preferably,The speech recognition system manages the second information corresponding to each of the divided areas obtained by dividing the window of the corresponding application program into a plurality, and the second information corresponding to the application program is as follows. The second information managed corresponding to the divided area where the mouse pointer is currently located among the divided areas in the window of the application program may be used.
Preferably,The speech recognition system manages the first information and the second information for at least a part of the plurality of application programs corresponding to each of one or a plurality of windows corresponding to the individual application programs. For the application program in which the first information and the second information are managed corresponding to each of the windows, it indicates that the first information is a target of voice input. A recognition target vocabulary for speech input is specified based on the second information managed corresponding to each of the one or more windows, and any one of the specified recognition target vocabularies is determined by the speech recognition means. If recognized, it indicates that the first information is a target of speech input, and the second information recognition recognizes the recognized vocabulary. One or more of the windows indicates that the elephant vocabulary may be specified as the destination of the recognition vocabulary.
Preferably,In the speech recognition system, for the application program in which the first information and the second information are managed corresponding to each of the windows, the first information in the application program window is input by speech. In addition to the second information managed in correspondence with the window, management is performed in correspondence with other windows of the application program having the window. The vocabulary designated to be used for other windows of the application program included in the second information may also be used.
[0013]
[Action]
As a result, according to the present invention, each application program can determine whether or not the speech recognition result can be received by the speech recognition system, so that the application program can freely control the voice input of itself or other application programs, A flexible and easy-to-use speech recognition interface can be constructed.
[0014]
In addition, since the voice recognition system can send the voice recognition results to multiple application programs at the same time, one voice input operation can be performed on multiple application programs at the same time. To do.
[0015]
Furthermore, since the voice recognition system can perform voice recognition for multiple application programs, it is possible to distribute voice input to each application program based on the voice recognition result without explicitly specifying the voice input target. Can be reduced.
[0016]
【Example】
Embodiments of the present invention will be described below with reference to the drawings.
[0017]
(First embodiment)
FIG. 1 shows a schematic configuration of the embodiment. In the figure, reference numeral 1 denotes a speech recognition system. This speech recognition system 1 includes a message processing unit 11, a speech recognition unit 12, and an application program management table 13. A plurality of application programs 2 are connected to the message processing unit 11. Yes.
[0018]
In this case, the voice recognition system 1 performs voice recognition according to an instruction included in the message from the application program 2 and sends the recognition result to the application program 2 as a message. The application program 2 performs a specific process depending on the application using the voice recognition result. In addition, the voice recognition system 1 can simultaneously exchange messages with a plurality of application programs 2 and transmit a voice recognition result.
[0019]
The message processing unit 11 constituting the speech recognition system 1 exchanges messages between the application program 2 and the speech recognition unit 12 and performs overall control of the speech recognition system 1. Further, the voice recognition unit 12 exchanges messages with the message processing unit 11 to perform voice recognition on the input voice according to the information sent from the message processing unit 11, and sends the result information to the message processing unit 11. Notice.
[0020]
The application program management table 13 is a table that stores information on all application programs 2 that communicate with the speech recognition system 1. This table is used to determine the recognition target vocabulary when speech is input and to determine the transmission destination of recognition results, so that the speech recognition system 1 can simultaneously exchange messages with a plurality of application programs 2. it can. The application program management table 13 has a program ID, an input mask, a recognition target vocabulary list, and a voice input flag. The program ID is an identification number uniquely assigned to the application program 2 by the voice recognition system 1. The input mask limits the types of messages transmitted from the speech recognition system 1 to the application program 2. The recognition vocabulary list is a table in which the recognition vocabulary requested by the application program 2 to the speech recognition system 1 is described. It is used to determine the recognition target vocabulary when inputting voice. The voice input flag indicates whether the application program 2 is focused on voice. It should be noted that the phrase that the voice focus is applied to the application program 2 means that the application program 2 is a voice input target. That is, the audio focus is for specifying the transmission target of the recognition result.
[0021]
FIG. 2 shows a schematic configuration of the voice recognition unit 12.
[0022]
In this case, the speech recognition unit 12 includes a speech detection unit 121, a speech analysis unit 122, a recognition dictionary collation unit 123, and a speech recognition dictionary 124.
[0023]
For example, the voice detection unit 121 performs detection based on the input voice power at regular time intervals (Nagata, et al. “Development of voice recognition function at a workstation”, IEICE Technical Report, HC9119, pp 63-70, (1991)). The voice analysis unit 122 performs frequency analysis on the voice section detected by the voice detection unit 121 using, for example, an FFT or a bandpass filter, and extracts feature parameters of the word voice. The recognition dictionary collation unit 123 performs collation with the recognition dictionary 124 using a method such as a composite similarity method (the above research material), HMM, DP matching, or the like, using the output parameters from the voice analysis unit 122, and scores are obtained. The highest vocabulary is output as the recognition result.
[0024]
Then, the recognition dictionary collation unit 123 does not perform useless processing before collation when collating the speech feature parameter with the recognition dictionary 124. The processing unit 11 is inquired and collation processing with the recognition dictionary 124 is performed according to the inquiry information. Regardless of the success or failure of recognition, the recognition result is sent to the message processing unit 11, and the recognition result is sent to the application program 2 according to the contents of the application program management table 13.
[0025]
Here, in FIG. 2, all the elements of the recognition unit are integrated and can operate as one process, but a configuration in which the voice detection unit 121 is separated as shown in FIG. 3 is also possible. If the voice detection unit 121, the subsequent voice analysis unit 122, and the recognition dictionary collation unit 123 perform, for example, data exchange between the two as inter-process communication, the voice detection unit 121 can be handled independently. For example, as shown in FIG. 4, outputs from a plurality of voice detection units 121 can be handled by a common voice analysis unit 122 and a recognition dictionary collation unit 123. Further, as shown in FIG. 5, a configuration in which the speech detection unit 121 and the speech analysis unit 122 are integrated and the recognition dictionary collation unit 123 and the recognition dictionary 124 are separated is possible.
[0026]
FIG. 6 shows a schematic configuration of the application program 2.
[0027]
In this case, the application program 2 includes a message input / output unit 21 and a program body 22. The message input / output unit 21 performs message exchange with the voice recognition system 1 in a lump, and provides a standard means for voice input to the creator of the application program 2. Another reason is to conceal a complicated message transmission / reception protocol from the application program creator and provide a unified communication procedure to all application program creators. The program body 22 is a program that performs a procedure of processing depending on the application program, and a procedure for receiving a command for the speech recognition system 1 according to an internal state unique to the application program and a speech recognition result from the speech recognition system 1. Etc.
[0028]
Next, the operation of the embodiment configured as described above will be described.
[0029]
In this case, information exchange between the speech recognition system 1 and the application program 2 is performed by message exchange. Here, the message is a collective term for data such as a command passed from one component to another component, the execution result of the command, and a voice recognition result.
[0030]
Communication by message is implemented, for example, by using the voice recognition system 1 as a server and the application program 2 as a client of the voice recognition system, and using a byte stream protocol such as TPC, DECnet, and Stream. The messages exchanged between the components of the voice recognition interface are shown in FIG. These messages are all processed by the message processing unit 11 of the voice recognition system. In the above-described embodiment, the speech recognition system of FIG. 1 has been described as being executed as a single process, but a speech recognition unit, a message processing unit, an application program management table, which are components of the speech recognition system, It is also possible to execute each as a separate program.
[0031]
[Message between voice recognition system 1 and application program 2]
The message from the application program 2 to the voice recognition system 1 has a kind as shown in FIG. These basically mean commands from the application program 2 to the speech recognition system 1.
[0032]
Here, the communication channel connection / disconnection request is a request to connect / release the communication channel when the application program 2 exchanges a message with the voice recognition system 1. The speech recognition dictionary load / release request is a request to load / release the speech recognition dictionary including the vocabulary that the application program 2 wants to use in the speech recognition system 1. The recognition vocabulary setting request requests the speech recognition system 1 which application program 2 uses which vocabulary in which recognition dictionary is used for recognition. The input mask setting request is a request for the application program 2 to set the type of message that the application program 2 wants to receive from the speech recognition system 1. The input task setting request is a request for changing the voice focus to the specified application program 2. The recognition start / end request is a request for voice recognition start / end to the voice recognition system 1.
[0033]
On the other hand, the messages from the speech recognition system 1 to the application program 2 have the types shown in FIG. 7B and can be classified into two. One is a response to a request from the application program 2 such as a command or data inquiry, which corresponds to the above request message. The other message is a message generated by the speech recognition system in accordance with information on the speech recognition result or a change in the internal state of the speech recognition system.
[0034]
Here, the voice recognition result is a message for notifying the result recognized by the voice recognition system 1 using the recognition vocabulary requested to set the application program 2. If the recognition is successful, it includes at least one recognition vocabulary, what the vocabulary is, what dictionary the vocabulary has, and information such as the score as a recognition processing result. If it fails (for example, the audio level is too high or too low), you have information about the cause of the failure. The input task change notification is a message transmitted to the application program 2 when the voice focus is actually changed by an input task setting request or the like, and includes a task ID before the change and a task ID after the change. Yes. The recognition dictionary load / release notification is a message transmitted when a recognition dictionary is newly loaded or released by a recognition dictionary load / release request or the like. The communication path connection / disconnection notification is a message generated when the application program 2 issues a communication path connection / disconnection request to the speech recognition system 1. This also occurs when the application program 2 unilaterally disconnects the communication path without requesting it. The recognition vocabulary change notification is a message generated when the recognition vocabulary of each application program is changed by a recognition vocabulary setting request.
[0035]
These are voice recognition systems such as when voice input is received and voice recognition is performed, when the voice focus is changed, when the application program 2 is connected to the voice recognition system 1, or when the recognition vocabulary is changed. Although it can be transmitted from 1 to all application programs 2, it is not necessary for the application program 2 to always receive all messages. The application program 2 sets which message is received by notifying the voice recognition system 1 of an input mask corresponding to each message (input mask setting request). As a result, the application program 2 can notify the voice recognition system 1 of only the message that it needs.
[0036]
FIG. 8 shows the types of input masks. These correspond to the types of messages that the application program 2 wants to receive, and a plurality of masks can be set simultaneously.
[0037]
By notifying the voice recognition system 1 of this setting, a message corresponding to the input mask can be received every time the message is generated inside the voice recognition system 1. For example, if a voice recognition result mask is set, a voice recognition result can be obtained each time a voice is input, and if an input task change mask is set, each time a voice focus is changed, Is notified to the application program.
[0038]
As a message between the speech recognition system 1 and the application program 2, an error message can be considered in addition to the above two types of messages (request message and response message). The error message is a message for notifying a failure of a one-way message from the application program 2 that does not require a response upon success, or when a critical state occurs in the recognition system. In addition to the messages described above, there are various messages such as a message for accessing the internal information of the voice recognition system 1 and a message for setting the voice recognition system 1 and voice input / output such as changing the voice input level. Conceivable.
[0039]
Thus, since the application program 2 can notify the change of the internal state of the voice recognition system 1 in the form of a message, the voice recognition system 1 is controlled based on the change, and further the other application program 2 controls. Since it becomes possible, a flexible interface with a high degree of freedom can be controlled by voice.
[0040]
Now, the voice recognition system 1 has the message processing unit 11 and the voice recognition unit 12, and information is exchanged between them by messages. Note that the message processing unit 11 handles all messages with the application program 2 in the voice recognition system 1.
[0041]
[Message between voice recognition unit 12 and message processing unit 11]
The message from the voice recognition unit 12 to the message processing unit 11 has the type shown in FIG. Here, the recognition vocabulary query request is a request issued in order to determine which recognition vocabulary should be compared with the input speech when speech is input to the speech recognition system. The speech recognition result notifies the message processing unit 11 of the collation result between the input speech and the recognized vocabulary to be recognized at that time.
[0042]
On the other hand, the message from the message processing unit 11 to the voice recognition unit 12 has a type as shown in FIG. Here, the recognition dictionary load / release request is a message where the recognition dictionary load / release request issued by the application program 2 to the speech recognition system 1 is directly delivered to the speech recognition unit 12. The recognized vocabulary information is a response to the recognized vocabulary query request from the voice recognition unit 12 to the message processing unit 11.
[0043]
In this way, the processing is advanced by exchanging messages in each part of the speech recognition system. Next, how the processing proceeds as the speech recognition interface will be described with reference to FIG. To do. In the figure, a time chart from when the application program 2 is activated to when the first speech recognition result is received is shown.
[0044]
In this case, the application program 2 first sends a connection request (a) with the speech recognition system 1. If the connection is achieved, a recognition dictionary load request (b) including the speech recognition vocabulary and a setting request (c) for setting the vocabulary to be used for speech input in the loaded dictionary as the recognition vocabulary are issued. The message processing unit 11 performs communication path connection processing with the application program 2 for (a), and returns the result to the application program 2. For (b), the message is sent to the speech recognition unit 12 as it is to wait for the dictionary to be loaded, and the result of loading the dictionary is returned to the application program 2. For (c), the specified recognition vocabulary is written into the application program management table 13 and the processing result is returned. If the recognition target vocabulary is set safely, the application program 2 sends an input mask setting request (d) and an input task setting request (e). The message processing unit 11 receives (d) and (e) and writes them in the application program management table 13 respectively.
[0045]
The above is the initial setting request from the application program 2 for the speech recognition system 1. When the initial setting is completed, the process waits for a message from the voice recognition system 1. While waiting for the message, processing depending on the task specific to the application program 2 is performed. Sends any request to the speech recognition system 1 according to its own processing, such as a request to change the recognition vocabulary or a request to change the input task to itself or another application program 2 according to the internal state transition accompanying the processing. The voice recognition system 1 can be controlled from the application program 2 side.
[0046]
Here, it is assumed that voice input is performed on the application program 2. Then, the speech recognition unit 12 first detects and analyzes the speech section of the input speech. After completing the speech analysis, the speech recognition unit 12 sends a recognition vocabulary query request (f) to the message processing unit 11 in order to know the vocabulary to be recognized at that time. Upon receiving this, the message processing unit 11 refers to the application program management table 13 to check the vocabulary to be subjected to speech recognition processing in this scene, and returns the recognized vocabulary information as a result to the speech recognition unit 12. The speech recognition unit 12 collates the recognition dictionary data corresponding to the recognition target vocabulary specified in (g) with the analyzed data that has been analyzed, and sends the result to the message processing unit 11. The message processing unit 11 searches the recognition target vocabulary of the application program management table 13 for the vocabulary having the first likelihood in (g), and the speech input flag of the application program 2 having the vocabulary is 1, and If the recognition result notification mask is set as the input mask, the recognition result is transmitted to the application program.
[0047]
The process described with reference to FIG. 9 will be further described using a specific example.
[0048]
The application program management table 13 when the application programs 2 connected to the speech recognition system 1 are the shell tool and the text editor is as shown in FIG.
[0049]
Here, processing when a new mail tool is started will be described. When the activated mail tool first transmits a communication path connection request (a), an area for the mail tool is taken in the application program management table 13, and the program ID of the mail tool is attached. For example, it is assumed that the program ID is assigned from 0 in the starting order of the application program 2. Next, a recognition dictionary load request (b) is sent. Here, the recognition dictionary is already loaded, and the speech recognition system 1 informs the application program 2 of this fact. Next, in the recognition vocabulary setting request (c), “first”, “last”, “previous”, “next”, “send” and “end” are sent as recognition vocabulary, and a recognition result notification mask is sent as an input mask (d). As an input task setting request (e), a request is made to invalidate all currently focused voice focus and focus the voice focus on the mail tool.
[0050]
In this embodiment, one recognition dictionary is used in common for all application programs 2. Therefore, in FIG. 10, which dictionary includes each vocabulary required when using a plurality of dictionaries. This information is omitted.
[0051]
With the above processing, the application program management table 13 becomes as shown in FIG. 10B, and the voice focus hitting the shell tool is changed to the newly activated mail tool, and the mail tool is ready for voice input. become.
[0052]
Here, for example, it is assumed that the voice “next” is input. The input speech is subjected to speech segment detection and analysis processing in the speech recognition unit 12 to obtain speech feature parameters. The voice recognition unit 12 sends a recognition vocabulary collation request (f) to the message processing unit 11 in order to know dictionary data to be collated with the voice feature parameters. Upon receiving this request, the message processing unit 11 refers to the application program management table 13 to know the recognition target vocabulary at that time. Here, all vocabularies “first”, “last”, “previous”, “next”, “sent” in the recognition target vocabulary list of the mail tool in which the voice input flag is 1 and the recognition result notification mask is set in the input mask "End" is the vocabulary that can be entered at that time. These six vocabularies are notified to the speech recognition unit 12, and the speech recognition unit 12 collates the dictionary data related to these vocabularies and the analyzed feature parameters, and sends the result to the message processing unit 11 (g). .
When the message processing unit 11 receives the recognition result, the speech input flag in the application program 2 is 1, and the recognition result is included in the recognition target vocabulary list of the application program 2 in which the recognition result notification mask is set in the input mask. If the vocabulary is searched for and found, the recognition result is transmitted to the application program 2 having the vocabulary list.
[0053]
When the recognition result of the previous voice input is “next”, it is transmitted to the mail tool. The application program 2 that has received the recognition result “next” via the message input / output unit 21 performs processing such as displaying the next mail of the currently displayed received mail.
[0054]
10A and 10B, a recognition result notification mask is set as an input mask for the shell tool. With this mask, every time a voice focus change occurs, it is notified.
[0055]
In the above example, when the voice recognition system 1 receives the input task setting request (e) from the mail tool and the message processing unit 11 changes the voice focus, an input task change notification message is sent to the shell tool. It is done. Since input masks other than the recognition result notification mask do not depend on the value of the voice input flag, if the input task change mask is set, the voice focus change message will be displayed regardless of the value of the voice input flag. The application program 2 is notified whenever it wakes up. The application program 2 can perform various flexible processes by knowing such a change in the internal state of the voice recognition system 1 via a message. For example, the shell tool can notify the user that the voice focus has been lost through a screen display, a synthesized voice, a beep sound, or the like.
[0056]
In this way, the application program 2 can freely control the speech recognition system 1 through the message, and a flexible speech recognition interface led by the application program can be obtained.
[0057]
Therefore, according to the first embodiment, in a multitasking environment in which a plurality of application programs 2 operate simultaneously in parallel, each application program 2 exchanges messages directly with the speech recognition system 1 to recognize words and recognition results. Since data such as can be directly exchanged with each other, all application programs 2 can be equipped with voice input as a standard input means, such as a keyboard and mouse, so that voice in a multitasking environment such as a workstation Full-fledged use of input is possible, and usability of man-machine interface including voice can be expected to improve.
[0058]
In addition, although a present Example was an Example of the speech recognition interface which applied isolated word recognition, it is also possible to apply continuous word speech recognition and continuous speech recognition.
(Second embodiment)
In the second embodiment, a user environment can be improved by simultaneously using a window system in a multitasking computer environment.
[0059]
FIG. 11 shows a configuration when the window system is used simultaneously. In this case, the voice recognition system 3 that handles voice input, the window system 4 that handles keyboard input and mouse input, and one or more application programs 5 that communicate messages with the voice recognition system 3 and the window system 4 mutually. ing. That is, in this embodiment, a window system is added to the first embodiment described above, and the application program is provided with a communication means with the window system.
[0060]
The window system 4 and the voice recognition system 3 are independent of each other. Messages between the window system 4 and the application program 5 relate to processing such as window generation, keyboard input, and mouse input in a multi-window environment.
[0061]
Before describing this embodiment, a window system for realizing a multi-window will be briefly described. A window system that realizes multi-windows in a multi-tasking computer environment such as a workstation communicates with multiple application programs that operate in that environment, and each application program is abstracted and displayed on a display screen called a bitmap display. Is done. There, one window is basically assigned to each application program.
[0062]
FIG. 12 is a screen display example of a general window system. In this example, three application programs A, B, and C are operating in parallel. The window system manages input devices such as a keyboard and a mouse, and allows a plurality of application programs to share the input devices. In the mouse screen, it is abstracted as an arrow-type mouse pointer, and is used for window operations and designation of input targets.
[0063]
In the embodiments of the present application, description is made exclusively using a mouse as a pointing device, but other pointing devices such as a pen and a touch panel can also be used, and descriptions in all the embodiments are other pointing devices. The same applies to devices.
[0064]
The target for keyboard input is keyboard focus. Keyboard focus is generally specified by a mouse pointer. An application program with keyboard focus is expressed by making the window frame thicker than other windows or changing the color of the title bar at the top of the window. FIG. 12 shows a state in which the application program B is focused on the keyboard. Keyboard focus is generally always on only one window.
[0065]
Here, the three programs described in the first embodiment, that is, the shell tool, the text editor, and the mail tool will be described again. In this case, each program is abstracted and represented as one window by the window system. Further, it communicates with the voice recognition system, and at the time of activation, the recognition vocabulary is set for the voice recognition system according to the procedure shown in the first embodiment. The recognition vocabulary of each application program is as shown in FIG.
[0066]
Generally, in an existing window system, an application program receives a notification of a change in keyboard focus. In order to make the keyboard input target and the voice input target the same application program, the application program requests the speech recognition system to put the voice focus on itself when the keyboard focus is hit, and the voice when it is off. Request to remove focus. This can be done by transmitting the input task change request described in the first embodiment. Hereinafter, the keyboard focus and the audio focus are treated as being matched, and this is referred to as input focus. The input focus is operated with the mouse.
[0067]
FIG. 14 shows changes in the speech recognition vocabulary accompanying the movement of the input focus. In this case, FIG. 14A shows the state 1 and FIG. 14B shows the state 2, and the input focus (and the audio focus at the same time) is hitting the text editor. Therefore, there are five vocabularies that can be recognized in this state: “cut”, “copy”, “paste”, “resolve”, and “end”, which are recognized vocabularies of the text editor. Here, when the user utters these five vocabularies, the speech recognition result is sent to the text editor. When a shell tool is specified with the mouse pointer, the input focus moves to the shell tool (and simultaneously the voice focus moves to the shell tool), and the recognizable vocabulary is the recognition vocabulary of the shell tool, “history” “list” “home” “Process” and “End”.
[0068]
What is used as the speech recognition vocabulary is free, and it is a heavy burden on the user to memorize and judge the recognition vocabulary for each application program. However, it is a burden for the creator of the application program to provide each application program with a means for displaying the recognized vocabulary. Also, since voice input has a warm taste unlike input means such as a keyboard, it is important that the user can confirm whether the input voice has been correctly recognized.
[0069]
As means for solving this problem, it is conceivable to create a program (vocabulary display program) for displaying the recognized vocabulary as shown in FIG. 15 as a standard application program for the speech recognition interface. This program is generated by each application program whenever a new application program connects / disconnects a communication path, requests a vocabulary change, or changes the voice focus. Request to send a message (ie, set an input mask to receive it). The vocabulary display program can always display all vocabularies that can be recognized at that time. Moreover, every time a voice is recognized, the voice input received by the voice recognition system can be confirmed by knowing it and displaying the recognition result transmitted to the application program in a different color as shown in FIG. 15, for example. . With the recognized vocabulary display program, it is possible to reduce the burden on both the application program user and the creator, and to provide a user with a voice input environment that is easier to use.
[0070]
In addition to changing the color in the list of vocabulary display programs, the recognition result can be notified to the user by another method.
[0071]
For example, there is a method of displaying the recognition result at a specific position on a display screen or an application window. This display part may be provided for each application or owned by the speech recognition system itself. Under the window system environment, a recognition result display window is created, and a specific position such as the center of the application window, surrounding parts such as up / down / left / right, pointers such as mouse, cursors for keyboard input, etc. The position may be adjusted so as to be displayed.
[0072]
The recognition result may continue to be displayed until the next recognition result is obtained, or is displayed only immediately after the recognition result is obtained. After a certain period of time, the recognition result is not displayed until the next recognition result is obtained. You may keep it. In particular, the mouse and other pointers and keyboard input cursors have the advantage of requiring little movement of the line of sight, but if they are always displayed near the work area, they may interfere with the work. It is effective to display only immediately after. You may use together with this and the method of always displaying a recognition result in the specific position of a screen or an application.
[0073]
By changing the speech recognition vocabulary depending on the position of the mouse, not only between application programs, but also within one application program, it is possible to reduce recognition processing more than necessary and make speech input more reliable. For example, as shown in FIGS. 16A and 16B, the mail tool is divided into two parts, a list display part and a text display part, and the recognition vocabulary (recognized here) depends on which one has the mouse pointer. Vocabulary is 8). By doing so, it is possible to suppress unnecessary recognition processing more than necessary and to make it difficult to cause recognition errors of input speech.
[0074]
Further, in the first embodiment, it has been described that when a new application is activated, the audio focus is shifted to the application. Similarly, there is a change in the window state of the application as a result of processing that is executed at the start or end of the application, or in response to an input operation or voice recognition result of a pointing device such as a mouse or pen or a keyboard. In the case of destruction or geometry change), the usability can be improved by creating a rule for moving the audio focus.
[0075]
For example, if the window is destroyed, iconified, the window is hidden by another window, etc., the voice focus is lost, the window is created, the display state is changed from the non-display state, and the window is displayed on the other window. In this case, the focus is acquired / erased in accordance with the window state change in each application in accordance with a rule such as “If the window size is increased, or the window size is increased, etc.”. Of course, such window state changes may not be managed individually by individual applications, but may be collectively managed by a program for managing audio focus. In this case, this management program informs the window system program (for example, the window server of the system) of the change in the state of the window of the application to be managed, and applies the rules as described above when the notification is received. Then, the audio focus can be changed.
[0076]
Also, if there is an audio focus management program, the same rule applies to which application the audio focus is transferred to even if the application that acquired the audio focus loses the audio focus due to application termination or window destruction. Can improve usability.
[0077]
For example, “If the voice focus management program keeps the voice focus history and the application that has acquired the voice focus loses the voice focus, the cause of the loss is not due to the focus acquisition request of another application. If the voice focus management program changes the voice focus according to this rule, the application that has acquired the voice focus is 1 It is possible to avoid a situation where there is no connection, that is, no application receives the output of the speech recognition system.
[0078]
In the present embodiment, the voice recognition system and the window system are configured independently, but it is also possible to realize a voice recognition interface in a form in which both systems are integrated.
[0079]
(Third embodiment)
In the second embodiment, the speech recognition target vocabulary is changed by combining the speech recognition system and the window system, matching the speech focus and the keyboard focus to one input focus, and specifying the input focus with the mouse pointer. However, you must release your keyboard each time you change the input focus. By enabling the input focus to be changed by voice, the user can change the input task without taking his hand off the keyboard, and the user can expect to improve usability in a multi-window environment.
[0080]
In order to change the input focus by voice input, the first embodiment is extended so that two values, local and global, can be set for each recognized vocabulary. A local recognition vocabulary is a vocabulary that is recognized when the voice program is focused on an application program that has been set for recognition. A global recognition vocabulary is a word that is recognized regardless of which application program the voice focus is on. Vocabulary to be recognized.
[0081]
Here, description will be made again using three application programs (shell tool, text editor, and mail tool).
[0082]
The recognition vocabulary of each application program is as shown in FIG. According to the local / global setting, a flag indicating local / global is provided for each vocabulary in the recognition target vocabulary list in the application program management table. The application program management table is as shown in FIG. When a voice input is given, the message processing unit uses the application program management table to obtain the recognized vocabulary as follows. First, referring to the application management table, the local recognition vocabulary of the application program that is focused on is picked up. The global recognition vocabulary of all application programs is then collected. These are the vocabularies that can be recognized by the current recognition system. For example, if the voice focus is on the text editor, the recognition vocabulary at that time is 8 of “cut”, “copy”, “paste”, “cancel”, “end”, “shell tool”, “mail tool”, “text editor”. One. Here, the recognition results for the utterances of “cut”, “copy”, “paste”, “cancel”, “end” and “text editor” are sent to the text editor, and “mail tool” and “shell tool” are the mail tool and shell respectively. Sent to the tool. For example, when you utter the mail tool in this state, if you change the input focus (voice focus and keyboard focus) to yourself in the mail tool, you can change the target of voice input and key input without releasing your hands from the keyboard. it can.
[0083]
In other words, name the window. If this window name is displayed in the title display at the top of the window, the user can know what to call the window.
[0084]
As described above, in this embodiment, by giving local / global attributes to the recognition vocabulary, it is possible to change the focus without using a hand by naming the window and speaking the name. Can be switched.
(Fourth embodiment)
In the second and third embodiments, the voice focus and the keyboard focus are made to coincide with each other, and only one window accepts both inputs exclusively.
[0085]
By matching the two input focuses, one application program could take over from both inputs at once. However, although there were two input means, it was not possible to input to different application programs. In this embodiment, in order to separate the two focus points, the voice focus is not directly operated by the mouse pointer (the keyboard focus uses the mouse pointer).
Even if the mouse pointer enters the window and is notified to the application program, the application program does not move the audio focus. In this case, the voice focus can be changed by assigning a name to the window as described in the third embodiment, making each a global recognition vocabulary, and speaking with that name.
[0086]
When the input focus is separated, unless the two focus are presented to the user in an easy-to-understand manner, the user will be confused when inputting. In this embodiment, the keyboard focus is displayed by making the window frame thick, and the audio focus is shown by changing the color of the window title.
[0087]
FIG. 19 shows an example in which the input focus is divided into two and each is moved separately. In FIG. 5A, both the focus points to the text editor. When the mail tool is designated with the mouse pointer, the keyboard focus moves to the mail tool, but the voice focus remains on the text editor. If voice input of “Mail Tool” is performed from the state of FIG. 11A, the voice focus moves to the mail tool, but the keyboard focus remains unchanged. In FIGS. 5B and 5C, since the keyboard focus and the voice focus are respectively applied to individual application programs, two application programs can be operated through completely different input channels at the same time. For example, by setting the state shown in FIG. 5C, the user can read the received e-mail by operating the mail tool by voice while typing a sentence with a keyboard into a text editor.
[0088]
In addition, an application program for controlling audio focus, an audio focus manager, is created so that the audio focus can be moved by means other than audio. The right side of FIG. 19 shows the voice focus manager. This voice focus manager knows the state of the application program that is operating simultaneously by communicating with the voice recognition system, and displays it in the form of a list or the like.
[0089]
The voice focus is expressed, for example, by highlighting the application program name, and the voice focus can be changed by designating the list with a mouse pointer. In addition to the keyboard and voice, a pen or the like can be considered as a means that can be input to the application program. Displaying the means that can be input to the application program and what can be input can improve user convenience. For example, the input possibility is displayed as an icon for each means.
[0090]
In this way, by separating the voice input target and the input target by means other than voice separately, a plurality of input means can be assigned to a plurality of application programs, and a human can perform a natural work in parallel. It becomes like this.
[0091]
(5th Example)
FIG. 20 shows a schematic configuration of the embodiment. In this case, a plurality of application programs 7 are connected to the voice recognition system 6. Each of these application programs 7 has a message input / output unit 71.
[0092]
Accordingly, every time there is a voice input, the voice recognition system 6 performs a recognition process on the voice and transmits the recognition result to the application program 7. The application program 7 notifies the speech recognition system 6 of the recognition target vocabulary, and the speech recognition system 6 transmits the recognition processing result using the recognition target vocabulary to the application program 7.
[0093]
The application program 7 has a message input / output unit 71. The message input / output unit 71 determines whether or not the application program 7 receives a recognition result, and makes a request to the voice recognition system 6. . The message input / output unit 71 requests the voice recognition system 6 to perform voice recognition for the application program 7 in response to an instruction from the application program 7 or receives the recognition result transmitted from the voice recognition system 6 to receive the application program. Pass to 7 or block and not pass. In addition, the recognition target vocabulary can be changed.
[0094]
Since the application program 7 has the message input / output unit 71, the application program 7 can receive or not receive voice input (recognition result) according to its own state, regardless of external action.
[0095]
For example, taking an example of an electronic mail system that can be controlled by voice (referred to as voice mail), in order to prevent malfunction due to voice misrecognition, voice mail is activated and operated without voice input. deep. When voice mail is received, for example, "New mail has been received. Do you want to read it now?" Synthetic voice is output and notified, and "Yes" and "No" are recognized for confirmation. The speech recognition system 6 is notified of the target vocabulary and the speech recognition. If the user says “Yes”, the newly received mail is displayed or the mail is read out by synthesized speech. If “No”, the voice mail requests the voice recognition system 6 not to receive the voice recognition result and returns to the original state.
[0096]
The message “New message ...” may be displayed as shown in FIG. 21 instead of the synthesized speech. “Yes” and “No” in the figure are for enabling operation with a mouse or the like.
[0097]
Also, in FIG. 20, if the message input / output unit 71 of one application program 7 is provided with a function that enables or blocks the voice input of another application program 7, it is confirmed in the example of e-mail. While waiting for the voice input for the e-mail, the electronic mail can be temporarily blocked from the voice input of the application program 7 that can be controlled by other voices, and returned when the confirmation is completed.
[0098]
In the case where the operation for blocking the voice input of the other application program 7 by the application program 7 competes, the application program 7 that has been in the block mode later in time is the block of the application program 7 that has previously entered the block mode. You can wait for the release.
[0099]
In this way, not only the voice recognition system 6 but also the application program 7 has a means for managing tasks, so that the application program 7 not only follows the instructions of the voice recognition system 6 but also the contents state unique to the application program 7. Voice input is available according to
[0100]
Also, let a specific application program 7 manage the tasks of all other application programs 7 (processing such as whether to send a speech recognition result, which vocabulary to recognize and which to recognize). You can also.
[0101]
FIG. 22 shows a mail tool, a shell tool, a text editor, and a task management program that can be operated by voice in a multi-window environment such as a workstation. Here, any one application program 7 is capable of voice input. In this case, the text editor is a voice input target (it is displayed by changing the title color). The fact that it is a voice input target can be displayed in the same manner in the task management program. In this example, the change of the voice input target can be designated using a pointing device such as a mouse on the display of the task management program.
[0102]
(Sixth embodiment)
In the fifth embodiment, only one application program 7 is a voice input target, but a plurality of application programs 7 can be simultaneously recognized.
[0103]
The voice recognition system 6 in FIG. 20 has an application program management table as shown in FIG. 23, for example. This application program management table has information on whether or not each application program 7 connected to the speech recognition system 6 can be recognized and the recognition target vocabulary.
[0104]
The information in this table is changed in response to a request from the message input / output unit 71 of each application program 7. In FIG. 23, the mail tool and the shell tool can input voice. The state of FIG. 23 can be expressed as shown in FIG. 24, for example.
[0105]
Here, the speech recognition system 6 automatically sends the recognized result such as “process” and “home” to the shell tool, and “start” and “next” to the mail tool. Can be sorted. Further, since “end” can be sent simultaneously to the mail tool and the shell tool, each application program 7 can receive it and terminate the application program 7 itself.
[0106]
Furthermore, if it is assumed that a plurality of application programs 7 are intended for voice input, the following operation is possible. FIG. 25 shows an example in which the function of the task management program is expanded. The “exclusive control” is a function that always keeps one application program 7 as a voice input target as in the prior art. “All” is a function for setting all application programs 7 connected to the speech recognition system 6 as speech input targets. “Reverse” is a function for reversing the voice input target, and the voice input target becomes the export editor by “inverting” the mail tool and the shell tool being the voice input target. If you “reverse” again, it will return. These operations can be performed not only by a pointing device such as a mouse but also by an input device such as a voice or a key. For example, voice input is performed while pressing any key button or key.
[0107]
If the user speaks while pressing the “all” button, all the application programs 7 become voice input targets. If the voice is spoken while pressing the “reverse” button, the voice input target is reversed. When the button is released, the state returns to the original state.
[0108]
In this embodiment, it is possible to input a specific one target without specifying it and to appropriately process the input. Considering a multi-window environment such as a workstation, even if multiple application programs 7 that can be operated by voice are running on it, there is only one human partner when considering a counter computer, It is natural for a computer to expect that the other person's utterance is automatically processed appropriately without performing special operations such as task switching, and it can be said that it will take advantage of the characteristics of audio media. .
[0109]
(Seventh embodiment)
In the above-described sixth embodiment, it is not known what the recognition target vocabulary of each application program 7 is. Therefore, the recognition target vocabulary of each application program 7 is displayed on the task management program (or another application program 7). The application program 7 can display the information by requesting the information of the application program management table (FIG. 23) of the speech recognition system 6 from the speech recognition system 6 (FIG. 26).
[0110]
By automatically displaying the recognition target vocabulary of the application program 7 that is the target of speech input in this way, it is not necessary for the user to memorize the recognition target vocabulary used for input for each application program 7, and the user's burden Less. In addition, the burden on the creator of the application program 7 can be eliminated because it is not necessary to prepare means for displaying the recognition target vocabulary. This can also be displayed together with the display of the application program 7 to be input, for example (FIG. 27). In FIG. 27, it is displayed that the mail tool and the shell tool have a change in color and are input targets.
[0111]
(Eighth embodiment)
The control of the plurality of application programs 7 does not necessarily require a screen display or a pointing device such as a mouse. For example, when a VTR control program capable of making a video reservation by voice is controlled by telephone, the voice mail program described in the fifth embodiment temporarily interrupts the processing of the VTR control program, and “Emergency mail is received. Can you confirm that there is no such thing? " The user who has received this confirmation can know the content of the received mail from the synthesized voice.
[0112]
When the work by email is finished, the work of video reservation is resumed. If the VTR control program can confirm the reservation contents made before the interruption together with a vocabulary such as “reservation content confirmation” in preparation for the interruption of the work, the VTR control program becomes an easy-to-use interface. In the case of a telephone, not only voice but also an input device such as a telephone push button can be used. While making use of the natural nature of voice input, for example, when environmental noise temporarily increases and voice input is marginalized, input can be ensured by using push buttons as appropriate. .
[0113]
(Ninth embodiment)
Next, an embodiment relating to learning of recognized vocabulary by the speech recognition program according to the present invention will be described.
[0114]
Conventionally, when learning recognized vocabulary, the user selects the vocabulary that the user wants to learn from the list of learning vocabularies. However, if there are many vocabularies, it takes time and effort to find the vocabulary that the user wants to select. It was. For example, in a learning program in a speech recognition device released for a workstation, all the recognition vocabulary used in various application programs are displayed, so the vocabulary to be learned has to be selected from a list of hundreds of words.
[0115]
In this embodiment, by using the recognition vocabulary information from the application program, the number of vocabularies in the word list presented to the user can be reduced and the target vocabulary can be easily selected, and the application program is being used. Even so, we can learn on the spot.
[0116]
As shown in FIG. 28, this embodiment has a configuration in which a learning data collection unit 8 and a dictionary creation unit 9 are added to the speech recognition system 1 and application program 2 described in FIG.
[0117]
Here, the learning data collection unit 8 exchanges messages with the speech recognition system 1 to receive vocabulary information related to the application program 2, and displays a vocabulary for the user to select a recognized vocabulary. In addition, the voice recognition system 1 is requested to perform settings necessary for learning, for example, output of learning data, and the received data is stored in a file. The dictionary creation unit 9 creates a recognition dictionary using the file as an input.
[0118]
In order to perform the above operation, the learning data collection unit 8 includes a word voice feature data storage unit 81, a learning vocabulary display selection unit 82, a learning data collection control unit 83, and a learning vocabulary guide display unit 84 as shown in FIG. It is composed.
[0119]
Here, the learning vocabulary display selection unit 82 displays the vocabulary to the user and selects the learning vocabulary, and the recognition vocabulary of the application program 2 sent from the speech recognition system 1 to the learning vocabulary table 821 included therein. Is remembered. In the learning vocabulary table 821, for example, when a command group used for document editing is a recognition target,
Speech recognition target vocabulary: such as cancel, cut, copy, paste, and font. The contents are displayed as shown in FIG. 33, for example, and the target vocabulary is used when the user is using the application program. Can be selected. The displayed vocabulary is only the vocabulary to be recognized depending on the internal state of the application program, so it can be much less than displaying all together, and the target vocabulary can be easily selected It is. The word speech feature data storage unit 81 stores the word speech feature data sent from the speech recognition system 1 via the message processing unit, for example, on a magnetic disk. The learning data collection control unit 83 performs overall control of data collection and has a data collection instruction flag for indicating start / end of data collection. Message exchange with the speech recognition system 1 can be performed using the message shown in FIG.
[0120]
In order to collect learning data, the speech recognition system 1 performs speech recognition and sends recognition results to the application program 2 as well as data that returns word speech feature data obtained as a result of speech analysis to the data collection unit 8. Two operation modes of the collection operation can be performed. Hereinafter, each operation will be referred to as a recognition mode and a learning mode.
[0121]
Next, a data collection procedure will be described with reference to FIGS.
[0122]
FIG. 31 is a flowchart when data is collected by the speech recognition system 1.
[0123]
In this case, it is assumed that the recognition vocabulary has already been set by communication with the application program in the speech recognition system before learning (step 3101). When a learning mode setting request message is received from the data collection unit 8 (step 3102), an operation necessary for learning is performed (step 3103).
[0124]
The operations necessary for learning include, for example, keeping the set of vocabulary set during data collection so as not to move the voice focus, or sending the recognition result to the application program during collection and using the recognition result of the application program 2 In some cases, the recognition result is not sent to the application program 2 during data collection so that the state changes and the set vocabulary does not change.
[0125]
Next, the speech recognition system 1 transmits a list of recognition target words to the data collection unit 8 (step 3104), and then receives a message from the data collection unit 8 (step 3105), which is a speech feature data transmission request. If so, the feature data is transmitted to the data collection unit 8 every time voice input is performed (step 3107). If the learning mode is requested to cancel, the learning mode is canceled and the normal recognition mode is returned (step 3108). .
[0126]
FIG. 32 is a flowchart of the learning data collection unit.
[0127]
First, as an initial state, a flag for instructing execution of data collection is set to OFF (step 3200). When data collection is set to ON by the user, a learning mode setting request message is sent to the speech recognition system 1 (step 3201). Next, the speech recognition system 1 requests the recognition target vocabulary at that time, and the vocabulary is stored in the learning vocabulary table 821 of the learning vocabulary display selection unit 82.
[0128]
The learning vocabulary guide display unit 84 displays, for example, as shown in FIG. 33 (step 3202), and selects a learning vocabulary using a mouse or the like (step 3203). There may be a plurality of selected vocabularies. For example, the background color of the selected vocabulary changes from white to green so that it can be easily seen. FIG. 33 illustrates a case where “copy” and “paste” are selected as learning vocabulary from the vocabulary of the document editing menu.
[0129]
Next, after issuing a word speech feature data transmission request to the speech recognition system 1 (step 3204), the learning guide display unit 84 displays the vocabulary to be uttered for prompting the utterance of the learning vocabulary as shown in FIG. (Step 3205). In this case, the guide can be eliminated. In addition, the number of utterances and the like can be displayed as auxiliary information, and the vocabulary to be uttered can be heard by synthesized speech. By doing this, it is possible to reduce false utterances due to misunderstandings, etc., compared to just displaying the guide on the screen.
[0130]
After the user speaks, the word speech feature data sent from the speech recognition system 1 is output to a file, and transmission / end of data collection is determined by the data collection instruction flag set by the learning data collection control unit 83. (Step 3207). If the flag is ON, the processing from the word voice feature data transmission request to the data collection / file output through step 3209 is repeated, and if it is OFF, a request for canceling the learning setting is issued to the voice recognition system 1 (step 3208).
[0131]
Next, a processing flow of the entire voice recognition interface at the time of data collection will be described with reference to FIG.
[0132]
First, in the initial setting, when a data collection instruction is issued from the user (a), a learning mode setting request is issued from the data collection unit 8 to the speech recognition system 1 (b). In response, the recognition target vocabulary currently used for recognition by the speech recognition system 1 is sent to the data collection unit 8 (c).
[0133]
The data collection unit 8 prompts the user to select a vocabulary for learning by displaying the recognition target vocabulary to the user. When the learning vocabulary is selected (d), the data collection unit 8 requests the speech recognition system 1 to transmit the word speech feature data (f), and displays the selected vocabulary as an utterance guide ( e) Encourage the user to speak.
[0134]
In the voice recognition system 1, after processing the voice of the uttered user, the word voice feature data is transmitted to the data collection unit 8 (g), and the data collection unit 8 outputs the data to a file.
[0135]
At the end of learning, first, the user inputs an instruction to end data collection (h), and the data collection unit 8 requests the speech recognition system 1 to cancel the learning mode (i). In response to this, the speech recognition system 1 cancels the learning mode.
[0136]
After data collection is completed, the user can create a recognition dictionary as needed. The dictionary creation unit 9 creates a dictionary using data from the word speech feature data storage unit 81 and outputs the dictionary as a file.
[0137]
Accordingly, the target vocabulary can be easily selected in this way, and the recognized vocabulary can be easily learned even while the application program is being used.
[0138]
(Tenth embodiment)
Next, an embodiment for realizing a user-friendly speech recognition interface without waiting for completion of dictionary creation by creating a dictionary that takes time in the background and creating a dictionary while collecting data or executing other application programs. explain.
[0139]
Conventionally, as a pattern matching method for speech recognition, a DP method, an HMM, a composite similarity method, and the like are known, and all perform pattern matching using a standard recognition dictionary. In the composite similarity method (Nagata, et al. “Development of speech recognition function at a workstation”, IEICE Technical Report, HC9119, pp. 63-70, (1991)) that requires eigenvalue expansion to perform Even when using a workstation that has a large amount of calculation for creation and is currently considered to be high speed, for example, a computer with a processing capacity of 20 MIPS, it takes a considerable amount of time, for example, several seconds to several tens of seconds per word, so learning by waiting time The deterioration in usability of the interface cannot be ignored. Therefore, by creating a dictionary in the background while collecting learning data, the waiting time is reduced and the usability of the interface is improved.
[0140]
Therefore, in this embodiment, a voice recognition system that improves the interface by creating a dictionary in the background will be described.
[0141]
In this case, as shown in FIG. 36, the dictionary creation unit 9 described in FIG. 28 includes a dictionary creation management unit 91, a dictionary creation control unit 92, a data input unit 93, a dictionary creation unit main body 94, and a file output unit 95. ing.
[0142]
Here, the dictionary creation management unit 91 receives a message from the data collection unit 8, instructs the dictionary creation control unit 92 to create a word recognition dictionary for the requested vocabulary, and notifies the data collection unit 8 of the completion of the creation with a message. To do.
[0143]
In order to execute in order when there are a plurality of dictionary creation requests, creation is performed according to the order of the request date and time of the dictionary creation management table as shown in FIG. FIG. 37 shows, as an example, the contents of the management table when dictionary creation is requested in this order for the words “copy”, “paste”, and “cut”, which are commands for editing a document. Conditions such as vocabulary are registered in the management table together with the date and time when the request is made, dictionary creation is performed in this order, and requests for which creation has been completed are deleted from the management table.
[0144]
The dictionary creation request not only specifies the vocabulary as described above, but also specifies other information registered in the data itself as the attribute of the word voice feature data, for example, the name of the speaker as shown in FIG. It is also possible to create a dictionary for a specific speaker, or specify a date as shown in FIG. 39 to create a dictionary only with new data.
[0145]
The dictionary creation management unit 91 and the dictionary creation control unit 92 exchange messages by exchanging messages.
[0146]
Next, the flow of creating a dictionary will be described with reference to FIGS.
[0147]
First, FIG. 40 shows a procedure for registration in the dictionary creation management table. In this case, it is determined whether or not a dictionary creation request message has been received (step 4001). If not, the request is awaited, and if there is a condition such as a vocabulary or a user name, it is registered in the dictionary creation management table (step 4002).
[0148]
On the other hand, FIG. 41 shows a procedure for creating a dictionary. In this case, a dictionary creation request registered on the dictionary creation management table is searched. If there is no request, registration is waited for, and if there is no request, the request with the oldest date and time is selected (step 4101). Next, word voice feature data is input (step 4102), and data that meets the requirements of the request is selected (step 4103). A dictionary is created using only the selected data and output as a file (steps 4104 and 4105). The request is deleted from the management table, and the process returns to the management table search (step 4101). Repeat above. Further, when all the dictionary creation requests are deleted, the learning data collection unit may be notified that dictionary creation has been completed.
[0149]
Since the creation of the recognition dictionary is performed in the background at the time of data collection, the progress of dictionary creation is difficult for the user to understand. Therefore, the progress of dictionary creation is displayed, for example, as shown in FIGS. 42 (a) and 42 (b), by displaying the ratio of the completed processing amount to the total processing amount, an interface easy to understand for the user can be provided. In this case, it is also possible to notify the user with a beep sound at the start or end of dictionary creation. It is also possible to display the speed of the dictionary creation process. For example, the speed is divided into four stages as shown in FIG. 43, or the color as shown in FIG. To display the processing speed, and if the load on the computer is large and the dictionary creation process does not proceed, the fact that the process is stagnant is displayed to encourage the user to distribute the load on the computer. You can also.
[0150]
As described above, by creating a dictionary in the background while collecting time-consuming audio data, it is possible to realize a user-friendly interface with less waiting time.
[0151]
The dictionary creation described above can operate as an independent process, and can accept not only a request from the data collection unit 8 but also a dictionary creation request from a speech recognition system and other application programs. Yes, not only at the time of learning data collection processing, but also when creating a dictionary.
[0152]
(Eleventh embodiment)
In speech recognition in which a recognition target is a word or a phrase, a word boundary is detected using a feature parameter such as a change in input speech power, a change in speech pitch, or the number of zero crossings, and the speech feature vector and This was done by matching the recognition dictionary with the recognition vocabulary set. However, in an actual working environment, an erroneous word boundary is often detected due to the influence of background noise and user's careless utterance (conversation with other users, monologue, etc.). For this reason, it is necessary for the user of the speech recognition system to always be aware of what is currently recognized and not to speak other words.
[0153]
On the other hand, when working with other input means (for example, keyboard or mouse) as one of the input means to the computer, the user should use each input means properly according to the input contents and work status. Can be considered.
[0154]
Therefore, in this embodiment, as shown in FIG. 45, the speech recognition automatic stop unit 10 is added to the speech recognition system 1 and the application program 2 described in FIG. (Recognition processing for all vocabulary that is the recognition target) and a mode for performing recognition processing only for a specific keyword are provided, and normal recognition processing is performed for a while after starting the recognition processing. If voice input is not performed within a predetermined time, the recognition vocabulary set up to that point is saved, and the mode is switched to a mode in which only a specific keyword (for example, “recognition start”) is set as a recognition vocabulary set. Like that. After that, if this keyword is input, the saved recognition vocabulary set is newly set, and the mode shifts to the normal recognition processing mode. The switching of the recognition processing mode is also performed by, for example, changing the voice focus or instructing by an input means other than voice, and the transition of the recognition mode is transmitted to the user using a message or icon display or a peep sound. As a result, when the user has not used the voice for a while, the voice recognition mode is automatically switched, and by ignoring the voice other than the specific keyword, unexpected task switching or malfunction due to a detection error can be avoided. it can.
[0155]
In addition, the user can utter a keyword or consciously switch the speech recognition processing mode using an input means other than speech. The above processing can be realized by using, for example, an interval timer mechanism. This designates the time when the current time expires in seconds, and when the time expires, a signal indicating that is passed. When this signal is received, the voice recognition mode is switched.
[0156]
Hereinafter, description will be given with reference to the flowchart shown in FIG.
[0157]
First, the number of seconds until the timer expires is first set (step 4601), and a flag indicating whether the timer has expired is set to zero. This flag is set to 1 in a signal handler that is called when a signal notifying that time has expired is received, and its value is checked at the beginning of recognition processing. Note that the timer function can be easily realized by the clock function normally built in the computer. The signal handler can be written as a program in the automatic speech recognition stop unit 10.
[0158]
Next, after setting a vocabulary set to be recognized (step 4602), it is checked whether or not the time has expired (step 4603). If the time has not expired, recognition processing is performed for the vocabulary set.
[0159]
In the recognition process, first, the start and end of a speech section are detected using a characteristic parameter such as a change in power of input speech, a change in speech pitch, or the number of zero crossings (step 4604). The speech feature vector is extracted from the speech section determined at the end and collated with the recognition dictionary of the current recognition vocabulary set, the similarity of each confirmation vocabulary is obtained, of which the similarity is maximum and the value is predetermined. A result exceeding the set threshold is output as a recognition result, and the recognition process is terminated. (Steps 4605-4609)
In FIG. 46, the process from the extraction of the voice feature vector to the collation with the recognition dictionary and the determination by the threshold value is the recognition process. When the end is not detected or when the recognition result is not obtained (steps 4605 and 4607), the setting is returned to the vocabulary set, and when necessary (for example, the client requests to change the voice focus or change the recognized vocabulary). Case) The recognition vocabulary set is changed, and it is checked whether or not the time has expired. If the time has not expired, the recognition processing for the current recognition vocabulary set is performed again. When the time expires, the recognition vocabulary set up to that point is saved, and the mode shifts to a mode in which a specific keyword is a recognition vocabulary. If the keyword is detected or the recognition processing mode is switched from the client, the saved recognition vocabulary set is restored, the timer is reset, and the normal recognition processing is resumed (steps 4610-4617). .
[0160]
The automatic stop function of the recognition function described above can prevent malfunction due to background noise or user's careless utterance, and can realize a user-friendly speech recognition interface.
[0161]
In addition, as a method for the user to consciously avoid malfunctions caused by background noise or user utterances, a method of inputting voice only while the mouse or key is held down is conventionally used. There is a problem that it is troublesome to operate. Therefore, if the voice input is not accepted only while the mouse is being pressed while the voice is being constantly input, the troublesomeness of having to operate the mouse for each utterance can be reduced.
[0162]
(Twelfth embodiment)
By the way, the voice mail tool is an electronic mail system that can input voice, and can move the list of received mail using voice to check the contents and send a reply to the mail.
[0163]
In this case, the tool includes a list display section, a received mail display section, and a transmitted mail editing section, and the highlighted mail in the list is displayed on the received mail display section. For example, the following operations can be performed using voice. Here, it shows to respond to the urgent mail from the boss.
[0164]
"Mail Tool" (Put all voice mail tools in front of the window.)
"Top" (The list pointer is moved to the top of the acceptance list.)
“Next” (Move the list pointer to the next mail.)
“Last” (Move the list pointer to the end of the receiving list.)
"Previous" (Go to the mail before the list pointer.)
“Boss” (List only emails from your boss.)
“Emergency” (List only urgent emails.)
“Reply” (Respond to an urgent mail. “To: supervisor name” and “Subject: Re: subject of the email from the supervisor” are entered in the sent mail display section.)
The initial state of the mail system is shown in FIG. Since not all mail lists can be displayed at once in the mail list display section, when using the mouse to search for a desired mail, it is necessary to use the slide bar on the right side of the display section. In particular, when a large amount of e-mails arrives, it takes a lot of labor to search for e-mails, and the operability is not sufficient. However, by using the voice here, it is possible to directly search for a desired mail, and the work efficiency can be greatly improved.
[0165]
Here, for example, when selecting an urgent mail from the boss, it can be selected simply by saying “boss” or “emergency”. FIG. 48 shows an emergency mail search result from the boss. In this example, assuming that two emails are received, the following occurs.
[0166]
"Copy" (Copy the message.)
"Paste" (The copied message is pasted into the received mail.)
"Quote" (quotes the message)
Now write a reply to the message,
"Sign" (add your signature to the end of the email if necessary)
"Send" (Send reply email)
“Supervisor” and “emergency” used here are implemented as voice macro commands, and the list is limited using the result of collation using the header and contents of the mail. In other words, the name, affiliation, title, date of sending, and body content of the sender of the e-mail are written in text (character data). By understanding the content and collating keywords and content, voice E-mail can be retrieved efficiently at This can be realized on WS using information search technology such as full text search and context analysis technology, and the use of voice input interface greatly improves the usability of voice mail. It is also possible to read a part of text by speech synthesis, emphasize it, or change the speed. In addition, as shown in FIG. 47, the recognition vocabulary is displayed, the client that is currently focused on the voice, and whether or not the recognition is in operation, and so on, so as to convey the system status to the user as much as possible. This makes it possible to improve work efficiency.
[0167]
(Thirteenth embodiment)
The voice recognition server can be used to control existing applications with voice. This can be done by creating a client that substitutes voice for keyboard input of an existing application. Here, an example is shown in which a voice macro program that enables voice control for an existing application is used for voice control of an existing DTP (Desk Top Publishing) system.
[0168]
The voice macro program has knowledge about the recognition vocabulary of the existing application in a menu format, and uses the menu hierarchy to limit the recognition vocabulary. here,
“Figure” menu
"cancel"
“Grouping”
“Ungroup”
"front"
"back"
“Up / Down inversion”
“Right / Left (Migihidari) inversion”
"rotation"
“Top Level” menu
"documents"
“Edit”
“Figure”
The root of the menu hierarchy is called “top level”, a word is generated from the top level, and the command is executed by following the menu hierarchy. Each time the menu hierarchy is moved, each item of the menu and the current position in the menu hierarchy are represented in the form of a path and presented to the user.
[0169]
And it operates as follows. Here, an example of handling a plurality of figures existing in the document window is shown (see FIG. 49).
[0170]
Open the drawing menu from the top level to work with shapes.
"Figure" (menu items are listed in the voice commander)
Here, a plurality of figures on the document window are selected with the mouse.
"Grouping" (combining multiple figures as one figure)
"Vertical flip" (upside down the grouped figure)
"Rotate" (rotates the figure)
"Grop release" (cancels the groove)
Next, one of the previously grouped figures is selected with the mouse.
"Back" (Send selected figure behind all figures)
"Cancel" (cancel the operation performed by "Back")
"Front" (Send to the front)
If you want to operate this with a mouse,
-Click the menu bar to display the menu.
[0171]
-Pull down the menu and select the command item you want to execute.
[0172]
-Release the mouse button to execute the command.
At least three actions are necessary. Considering the time and effort of moving the mouse pointer, it is considered that more actions are being performed.
[0173]
However, when using audio,
・ Generate words to operate.
Since one action is sufficient, the usefulness of voice can be understood. When an operation is performed by selecting a menu using a mouse, the above operation must be executed even if the user knows in advance what he wants to operate. Audio becomes a more effective interface when combined with other input means.
[0174]
Here, if you use a keyboard macro, you only need to perform a single operation, as with voice. However, since a keyboard macro is basically represented by a single character, the more keyboard macros, the more difficult It is required to memorize a combination of commands, which is a burden on the user.
[0175]
Therefore, the application can provide a more natural interface to the user by combining the command with a voice that can naturally express the meaning of the command, not just a single character.
[0176]
In addition, in the graphic menu described above at the time of word recognition, when the first half part exists in the same category, for example, “grouping” and “ungrouping”, the pattern of the second half part of the word is obtained by partial abstraction. The recognition accuracy can be improved by performing recognition using. Further, when the latter half is the same as in “upside down” and “left and right inversion”, it is also possible to perform recognition using a single first half pattern. In short, it is possible to improve recognition performance by extracting word patterns for recognition from various viewpoints so that the difference in patterns becomes clearer and performing recognition.
[0177]
(14th embodiment)
The speech recognition interface described above has focused only on speech input. However, if the speech output function is incorporated into the interface to perform speech synthesis from text and playback of speech data, speech input is possible. Since output can be integrated, voice input to a plurality of application programs and output of messages by sounds from them can be easily performed, and an interface that is easy for the user to handle can be realized.
[0178]
The configuration of a voice input / output interface, which is a voice recognition interface having a voice synthesis function, will be described below.
[0179]
FIG. 50 shows a schematic configuration of a voice input / output system provided with a voice synthesizer. The voice synthesizer 14 is added to the voice recognition system 1 described in FIG. In this case, the voice synthesizer 14 generates a synthesized voice from the text information in accordance with an instruction from the message processor 11 and outputs a voice. In addition, the application program management table 13 has a field for storing information related to the audio output of the application program 2 as shown in FIG. 55 in order to control the audio output from the plurality of application programs 2. Thereby, control with respect to the audio | voice output from the some application program 2 can be performed. The information related to the audio output includes an audio output priority for instructing to perform audio output preferentially for a specific audio output.
[0180]
FIG. 51 shows a schematic configuration of the voice synthesizer 14, which includes an overall controller 561, a waveform superimposing unit 562, a voice output management table 563, and a waveform synthesizer 564.
[0181]
The general control unit 561 receives a character string from the message processing unit 11 together with an output request for synthesized speech, sends it to the waveform synthesis unit 564, performs speech synthesis, and outputs the speech. In this case, the sound signal output by the speech synthesizer 14 is not limited to the synthesized sound but may be other than the recorded voice or voice, and in that case, synthesis of the voice is not required. At this time, the waveform data received from the message processing unit is output as it is without performing waveform synthesis.
[0182]
In addition, the waveform synthesis unit 564 receives character string data from the overall control unit 561 and performs speech synthesis. Various methods are known as speech synthesis methods. For example, literature (D. Klatt: "Review of text-to-speech conversion for English", J, Acoust. Soc. Am., 82, 3, pp. 737-793 (Sept. 1987)) can be used.
[0183]
The voice output management table 563 is a table for registering voice output requests from the message processing unit 11. By performing voice output according to the order registered in the table, the voice output management table 563 is temporally matched to a plurality of voice output requests. Audio output can be performed while maintaining the characteristics.
[0184]
The speech synthesizer 14 can be operated as an independent process, and the message processor 11 exchanges data by exchanging messages by process communication as described in the message between the speech recognition system 1 and the application program 2. To do. The message here is as shown in FIG.
[0185]
The message from the application program 2 to the message processing unit 11 in FIG. The speech synthesis request here is a request for the application program to convert the text content into synthesized speech, and the request is issued together with the text data to be synthesized. As a result, the synthesized speech data is notified. The waveform reproduction request is a request for reproducing the waveform data as it is when the application program already has audio data in the form of a waveform by recording or the like, and is transmitted together with the reproduction data. The voice synthesis / playback request is a request for performing voice synthesis and playback together, and no synthesized voice data is notified.
[0186]
The priority setting request is a request for giving priority to the output sound from a specific application program. For example, the output sound level, the priority of the speech synthesis process, the presence / absence of interruption output, and the like can be set. Yes.
[0187]
The priority of the voice output request is effective because the user's attention can be immediately given by setting the priority to a high value when, for example, an emergency is required.
[0188]
As described above, the voice output management table 563 is a table for registering voice output requests from the message processing unit 11. By performing voice output in the order registered in this table, a plurality of voice output requests can be obtained. On the other hand, audio output can be performed while maintaining temporal consistency.
[0189]
An example of the audio output management table 563 is shown in FIGS. The data to be recorded in the table includes a data ID, the type of input data indicating whether it is a waveform or text, the registration time of the output request in the table, the contents of the text data, the volume at the time of voice output, and the like. In the example of the figure, the data IDs # 1, # 2, and # 3 are text data, and the processing for the data of # 0 to 2 is finished, but the data of # 3 is currently being processed, # 4 This data indicates that processing has not yet been performed.
[0190]
On the other hand, the message from the message processing unit 11 to the application program 2 has a type as shown in FIG. The audio output status notification notifies that the requested audio output has ended, and the priority setting notification notifies that the audio output priority has been set according to the priority setting request. Both are confirmation messages for the request.
[0191]
The setting of which message is received by the application program 2 can be set by the input mask as already described in the explanation regarding the message between the speech recognition system 1 and the application program 2 above. In this case, since the speech synthesizer 14 is added, the type is as shown in FIG.
[0192]
In addition to the messages described above, various messages such as an error message, a voice output level setting message, and a message for accessing internal information of the voice synthesizer 14 can be set.
[0193]
Information exchange is also performed between the voice synthesis unit 14 and the message processing unit 11 by messages. The messages in this case have the types shown in FIGS. 53 (c) and 53 (d). Of these, the message from the message processing unit 11 of (d) to the speech synthesis unit 14 is substantially the same as the request message from the application program 2 to the message processing unit 11 of (a), and the speech synthesis unit of (c). The message from 14 to the message processing unit 11 is almost the same type as the notification message from the message processing unit 11 to the application program 2 in (b).
[0194]
As described above, by exchanging messages in each part of the speech recognition system 1 having the speech synthesizer 14, the speech output processing according to requests from a plurality of application programs 2 is advanced. The processing flow will be described with reference to FIGS.
[0195]
In FIG. 56, it is assumed that the initial setting regarding the connection processing and speech recognition between the application program 2 and the speech recognition system 1 has already been completed in step 6101 according to the procedure described in the first embodiment. After step 6101 is completed, the application program 2 performs initial settings relating to the audio output processing according to FIG. 57A described later (step 6102). Initial settings include initialization of the voice output management table 563 in the voice synthesizer 14 and initialization of voice output priority information in the application program management table 13. Then, voice input and voice output processing is executed (step 6103).
[0196]
Next, an audio output process for each request related to audio output from the application program 2 will be described.
[0197]
First, when the speech synthesis request of (b-1) in FIG. 57 is issued from the application program 2, the message processing unit 11 sends the request as it is to the speech synthesis unit 14 as a speech synthesis request. Then, the voice synthesizer 14 registers a message in the voice output management table 563. Since the speech synthesis request does not include the waveform reproduction process, for example, the term with or without output becomes no output (= 0) as in the output management table message ID # 1 of FIG. In this case, the audio output priority information is not used. After the synthesizing process is completed, the voice synthesizing unit 14 notifies the message processing unit 11 of the completion by the voice output status notification, and the message processing unit 11 notifies the application program 2 of it. The application program 2 issues a voice waveform data request after this notification and receives it for each synthesized voice.
[0198]
Next, when there is a waveform reproduction request of (b-2) in FIG. 57, the message processing unit searches the priority information registered in the application program management table shown in FIG. 55 and makes the requested application program. Is added to the speech synthesizer 14 and a waveform reproduction request is made.
[0199]
The voice synthesizer 14 registers a message in the voice output management table. In this case, for example, contents such as message ID # 0 or # 4 in FIG. 52 are registered. After the waveform reproduction is completed, the speech synthesizer 14 sends a message indicating that the reproduction is completed to the message processing unit 11 by the notification of the voice output status, and the message processing unit 11 sends it to the application program 2.
[0200]
Next, when there is a voice synthesis / playback request in (b-3) of FIG. 57, voice synthesis and playback are performed in the same process as in the waveform playback.
[0201]
Also, the audio output priority can be changed by the priority setting request in (b-4) of FIG. As described above, the voice output priority includes the level of voice output, the priority of voice synthesis processing, the presence or absence of interruption processing, and the like. Increasing the level of output speech helps to draw attention to the output message, and increasing the priority of speech synthesis processing can reduce the time delay until the speech data is output after speech synthesis. In addition, the interruption process is a process for temporarily interrupting voice output other than specific voice output data and outputting only the data. By using these in combination, an important message is preferentially outputted. Processing is possible.
[0202]
For example, in FIG. 52, output level = 3, no interruption output, and synthesis processing priority- (no value) are set for the waveform reproduction request of message ID # 0. In this case, the priority value is set in the range of 0 to 10, and the output level 3 is a relatively small value. Moreover, since there is no interruption output, this waveform data is heard overlapping with other sounds. On the other hand, for the voice synthesis / playback request of # 2, since the output level is 10 at the maximum and the priority of the voice synthesis process is also the maximum, the synthesized sound data is output immediately. Since there is an interruption output, other sounds are in an output interruption state during this period. While outputting this synthesized sound, it is possible to hear the sound without being disturbed by other sounds.
[0203]
Next, a method for sequentially processing the voice output requests as described above will be described.
[0204]
Multiple voice output requests are processed according to the voice output management table 563 of the voice synthesizer 14. In the voice output management table 563, the request ID, input data type (waveform / text), request reception time, data contents, processing status, volume, presence / absence of output interruption processing, voice synthesis processing in the order requested. Priority, exclusive processing coefficient, etc. are registered.
[0205]
As shown in FIG. 58, first, the overall control unit 561 refers to the processing status section of the audio output management table 563 (step 6301), searches for data that is “unprocessed”, and if there is a processing status, Update to “processing” (step 6302) and refer to the type of data (step 6303). If the data is text, the text data is sent to the waveform synthesizer 564 to perform speech synthesis (step 6304), and the synthesized sound data is passed to the waveform superimposing unit 562. (Step 6305). Then, the processing state is updated to “end” (step 6306), and the next unprocessed data is processed.
[0206]
In the waveform synthesis unit 564, based on the synthesis processing priority information relating to the data being processed, the calculation is performed by setting the priority of other processing for the synthesis operation. The priority can be set by using, for example, a UNIX system call, which is a common operating system for workstations, by changing the allocation time of the arithmetic unit for the synthesis process, or by preparing multiple speech synthesizers with different processing amounts. This can be done by changing the synthesizer used according to the priority.
[0207]
The waveform superimposing unit 562 superimposes a plurality of waveforms based on information such as volume, presence / absence of output interruption processing, coefficient of exclusive processing, and the like together with waveform data. When superimposing, the correspondence between the time and the sample of the waveform data is always monitored so that the time between a plurality of audio output requests and the output intervals of the plurality of waveform data corresponding to those requests are made as equal as possible. I have to. The superimposing process can be performed by a block process every unit time, for example, 10 msec.
[0208]
Next, an example of superimposing audio data with interruption processing will be described with reference to FIG. In this case, the data is data IDs # 1 to # 3 in the audio output management table 563 in FIG. 52. For simplicity, it is assumed that there is no time delay from registration to waveform superimposition. Depending on the ability, there is a time delay due to speech synthesis and data movement. When outputting audio data according to the time recorded in the audio output management table 563 and without performing the output interruption process, the data overlap in time as shown in FIG. The voice of data # 2, which is an urgent message, is output with the beginning overlapping the end of data # 1 and the latter half overlapping with the first half of data # 3. On the other hand, in (b) when the output interruption process is performed, the superimposition of the data # 1 is interrupted when the “emergency” of the data # 2 starts, and after the process of # 2, the # 1 is interrupted. The rest will be superimposed from the moment. Data # 3 is superimposed after # 2 ends. Data that is temporally divided by the interruption process, such as data # 1, may be output while being divided as described above. However, after the interruption process, the data may be output again from the beginning or the second half of the division. Various processing can be considered, such as no output from the unit or superimposition with the volume gradually lowered.
[0209]
(15th embodiment)
As described in the fourteenth embodiment, the speech recognition system incorporates the speech synthesizer 14 to enable use of speech recognition and synthesis functions from a plurality of tasks in a multitask environment, so that the user can apply the application program 2. Usability when using is improved. In this embodiment, based on the 14th embodiment, as a specific application example of the system, the effect when the voice synthesis function is added to the voice mail tool will be mainly described.
[0210]
FIG. 60 shows a schematic configuration of the fifteenth embodiment, which includes a voice input / output system 651, a window system 652, and a voice mail tool 653. The voice mail tool 653 includes an email processing unit 6531 and a message input / output unit 6532.
[0211]
In this case, the voice input / output system 651 is a system having a voice synthesis function described in the fourteenth embodiment. The window system 652 provides information related to the application program to the user through a GUI (Graphical User Interface). By using the voice input / output system 651 and the window system 652, the voice mail tool 653 can handle voice input in the same manner as a mouse and keyboard and can handle voice synthesis in a unified manner.
[0212]
Normally, data transmitted and received by the voice mail system is text data, but not only text data but also voice data, image data, and the like can be mixed in the mail. In order to send and receive e-mail containing audio data, the e-mail tool needs a function for recording and reproducing raw audio data.
[0213]
61 is added as a message exchanged between the application program 2 and the voice input / output system 651 so that the application program 2 can handle raw audio data. FIG. 62 (a) shows a procedure for recording voice data by the mail tool using these messages, and FIG. 62 (b) shows a playback procedure. FIG. 63 shows a screen display example of the voice mail tool having the voice recording / playback function just described. This display example has substantially the same display screen as FIG. 48 of the twelfth embodiment described above. In this case, there is a mark with * at the beginning of the line of the list display section of the tool, which is a mark for identifying a mail document including voice data. A display example of the mail document with voice data is shown in the received mail display section. The voice data in the mail document is presented to the user in a button-like format, for example.
[0214]
In FIG. 63, the button labeled emergency is voice data. Audio data is specified with a mouse, etc., and played using a mouse, keys or voice input. Any number of buttons with audio data can be created and placed at any position in the mail text.
[0215]
Recording / playback / editing of voice data in the mail is performed using a sub-window for voice data editing as shown in FIG. The two sliders at the top of the figure set the volume for audio data input / output, respectively. The buttons below the buttons are buttons for recording, playing back, stopping recording / playback, editing the voice data, and adding the voice data to the mail. The edit button has an edit submenu for cutting, copying, and pasting. “Emergency” at the right end of the button row is a character that can be arbitrarily entered by the user, and is displayed as a button label when the voice data is created. The lower part of FIG. 64 is a place where voice waveform data is edited. It is possible to select data using a mouse and add effects such as cutting, copying and pasting using an audio input, applying an echo, and changing a pitch to audio data. Also, editing of audio data and adding effects to the data may be performed with a dedicated audio data editing tool instead of the mail tool. When editing audio using it, it is necessary to transfer audio data to and from the mail tool, but if the transfer is performed by cut and paste using audio input, editing operation on audio data is easy You can do it.
[0216]
Cut and paste using voice input can be applied not only to voice data, but also to various forms of data such as text and graphics, and can be used to pass data for application programs.
[0217]
When replying to an email using the functions described above, you can automatically copy all of the email you read or a part of the text by saying “reply”, add a quotation mark, By automatically adding your signature and recorded message and sending it, you can respond to emails without touching most keyboards. At that time, the recorded message may be recorded in advance, but if the recording mode is automatically entered and "Send" is uttered, the recorded data is automatically added and sent by e-mail. Can be done. For example, FIG. 65 is an example of the text of a reply to the farewell party notice. In this example, up to the 8th line, a quote mark (>>) is added to a copy of the notification mail sent, and the signature and the recorded message mark are added to the 9th to 11th lines.
[0218]
Also, some or all of the audio data recording / playback / editing functions shown in FIG. 64 are arranged side by side in the received mail display section and the transmitted mail editing section as shown in FIG. It is thought that the property improves.
[0219]
All of the recorded data may be used as mail data as it is, but there is an unnecessary silent part in the data due to the wording and the like, and the amount of data may increase more than necessary.
[0220]
In such a case, it is also possible to automatically detect a silence part and cut a silence part having a certain length, for example, 1 second or more.
[0221]
In addition, due to the movement of the user during recording, the distance between the mouth and the microphone may change, the recording level may not be constant, and the data may be difficult to hear.
[0222]
In such a case, the power of the recording data can be checked to make the level uniform throughout and make it easier to hear. The level equalization process can be realized by obtaining a level for each unit, for example, a word or a sentence, and matching the others with the one having the maximum level.
[0223]
Further, when the entire data or the above-mentioned maximum level is too small or too large, the level of the entire data is changed accordingly, so that it is not difficult to hear.
[0224]
By using the mail tool of this embodiment, it is possible to read out a mail document in which text and voice are mixed.
[0225]
If we read the mail in the received mail part of FIG.
"Tamura-dono" (speech synthesis)
“Must submit last week's business trip report” (〃)
(Play emergency button audio data)
"Sawada" (speech synthesis)
As described above, by performing processing corresponding to the type of data (text data is synthesized by speech and the speech data is reproduced as it is) in the order of appearance of data, data other than text can be read out. It is also useful for the user to be able to read out only text data or read out only audio data. As a data format other than text, processing other than audio may be performed in accordance with the data format (if a moving image is reproduced, a moving image is reproduced).
[0226]
The reading out of the mail may be performed not only on the text but also on the header of the mail indicating the title, sender, and transmission / reception time.
[0227]
Here, it is not necessary to read out all the mail documents in the same way. For example, by creating a database of e-mail addresses and synthesized voice attributes as shown in FIG. 67, it is possible to change the voice characteristics when reading a mail document for each sender. In the setting of FIG. 67, the email from Mr. Tamura is a male voice that speaks slowly and slowly, the email from Mr. Nakayama is a high-pitched female voice, and the other emails are standard voice pitches. It is a voice of a man with a voice and is read out at a standard speed.
[0228]
Furthermore, it is conceivable to change the composition unit using not only the sender information but also information in one document. For example, it is possible to change the sexes of men and women, or to change the voice pitch and the speed of reading only for the part surrounded by quotation marks.
[0229]
In addition, assuming that the recipient of the mail reads out the mail using synthesized speech, it is possible to specify how to read the mail by adding a control code for speech synthesis to the text in the body of the mail. . FIG. 76 shows an example of control code mixed mail.
[0230]
In this case, a portion surrounded by @ <...> Is a portion read out by the control code and its designation. male, 5, 5, and 9 indicate gender (male), voice pitch, speed, and loudness, where only the part "Do not delay" is more than the other parts. Read loudly. In this way, by enabling fine settings of speech synthesis for the part in the body of the email, emphasizing important places in the email, changing the inflection of the text, and quoting the quoted word closer to the person It is possible to make changes by reading the synthesized speech of features.
[0231]
Since the mail tool described above can be controlled by voice in a multitasking environment, it can be conveniently read by voice while creating a document or editing a program with a keyboard or mouse.
[0232]
It should be noted that not only mail tools but also information retrieval tools such as electronic dictionaries such as English-Japanese, Japanese-English, bilingual dictionaries, similar word dictionaries for drawing similar expressions, paraphrasing, etc. are spoken by the interface according to the present invention. By operating, it is possible to draw a document or a word to be examined while creating a mail by a voice operation, which is convenient because interruption of document creation can be reduced.
[0233]
When checking the contents of an e-mail message by using voice-to-speech regardless of the display, it is especially important to search for a desired e-mail document from a large number of e-mails. For example, the efficiency may be deteriorated. Therefore, it is possible to issue a command for the mail tool while reading the mail. In particular, it is convenient if the command can be executed by voice input.
[0234]
First, a reading mode is provided so that a unit for reading mail can be set. There are three reading modes: full sentence, paragraph, and sentence. The display of “full text” next to the “read” button in the upper right of FIG. 63 indicates the reading mode. The "Reading" button is used to synthesize speech according to the mode. FIG. 68 shows voice commands used when reading a mail.
[0235]
The user sets a mode and starts reading a mail by saying “Read” button or “Read”. The voice commands “stop” and “continue” can be used to pause and resume reading. “One more time” reads the last unit read out again. “Previous” and “Next” are “to”, and the mail tool automatically changes the mode according to the command. For example, if “next sentence” is entered when the mode is “full sentence”, the mode automatically changes to “sentence”. “Next” and “Previous” are abbreviations of “Next” and “Previous”, and the units handled by these commands are units currently set as modes. “Fast” and “Slow” are voice commands for setting the reading speed, “High” and “Low” are for setting the voice height of the synthesized voice, and “Male” and “Female” are the voice commands for setting the gender of the synthesized voice.
[0236]
Thus, it is considered that the usability can be improved by enabling reading of the contents of the mail by voice and controlling the reading by using the voice, as compared with the case of using only the mouse and the keyboard. In particular, in a multi-window environment, auditory and voice input are used to control a voice mail tool, and visual and key inputs are used for different tasks (for example, a text editor), so that one user can simultaneously control multiple tasks. It becomes possible.
[0237]
The speech synthesis function can be used not only for reading a mail document but also for a message provided to a user from a mail tool. For example, consider a case where an operating mail tool uses synthesized speech for message output in a multi-window environment. First, make Mail Tool an icon when it starts up. When the mail tool receives a new mail, it provides the user with a message such as “New mail has arrived from Mr. XX. There are 5 unread messages”. Of course, this message can be recorded voice data, but considering the ease of changing the message text and reading out any numerical data, synthesized voice is more convenient for creators of application programs such as mail tools. good. Instead of always outputting the new mail reception notification message in the same way, for example, setting the importance level for the mail and not outputting a voice message according to the importance, or `` An urgent mail arrived from Mr. XX You can change the tone of the voice by changing the message text or changing the parameters for speech synthesis. As the message, information about the subject of the mail may be provided as “subject is a meeting notification”. In this way, by using the synthesized speech for the message output of the mail tool, the user can determine whether or not to read the received mail without directly looking at the mail tool.
[0238]
The new mail reception message is a message that interrupts the work performed by the user on the computer, and whether or not the user wants to interrupt the user's work depends on the work content. For example, during a program demonstration, you may not want to interrupt your email. Therefore, the task importance level is set, the task importance level is compared with the email importance level, and if the email importance level is higher than the task importance level, a voice message is output, and if not, it is not output. To do. The importance of work can be set for the entire work environment, for individual programs, or for each subtask in the program.
[0239]
The voice mail system is configured as shown in FIG. 69 in order to compare the importance of work with the importance of mail and determine the mail reception notification method. The mail system 691 is connected to the voice input / output system 692 and the window system 693 through the message input / output unit 6911. Messages from the voice input / output system 692 and the window system 693 are distributed by the message input / output unit 6911 in accordance with the content of the message, and processing is performed at a place where the message is to be processed.
[0240]
The e-mail processing unit 6912 performs transmission / reception of e-mail documents and processing for received mails through an external public line or LAN. The task importance level management table 6913 receives and manages the importance levels of all application programs connected to the voice input / output system from the voice input / output system. The e-mail processing unit 6912 also plays a role of notifying the user of the received mail based on the importance of the task and the importance of the received mail.
[0241]
In order to realize this function, the application program management table of the voice input / output system described in the fourteenth embodiment is expanded, and a task priority is newly set as an item. FIG. 70 shows an extended application program management table. Here, the task priority of the shell tool is set to “2”, and the DTP system is set to “5”.
[0242]
Furthermore, a message shown in FIG. 71 is newly provided as a message for setting a value in this application program management table and reading the value. In addition, a task priority change mask is newly provided as an input mask so that the mail system can receive the notification every time the task priority is changed.
[0243]
The mail system sets the task priority change mask and the input task change mask as input masks to obtain the task priority of all application programs connected to the voice input / output system and the presence or absence of voice focus. As shown in FIG. 72, the information can be dynamically reflected in the task importance management table. The priority of the e-mail can be set, for example, by adding header information such as “Preference: 3” to the mail document, and setting the importance on the mail itself. May be set. The e-mail processing unit of the mail system performs the process shown in FIG. 73 every time an e-mail is received.
[0244]
In this case, it is checked whether or not the voice focus is on one task (step 7801). If YES, the priority of the task having the voice focus is selected. If NO, the average of the priorities of all the tasks having the voice focus is calculated. select. For example, the highest priority may be selected. Whether these are lower than the priority of the mail is checked (step 7804). If YES, the voice is used for notification (step 7805), and if NO, nothing is notified (step 7806). In this case, various methods such as changing the icon display or using a moving image can be used for notifying the user of mail reception.
[0245]
As an application program, a display example of a screen when the shell tool and the DTP system are connected to the voice input / output system in addition to the mail system is shown in FIG. FIG. 74A shows a screen display example when the task importance management table is in the state shown in FIG. Here, assuming that a mail having importance 3 is received, according to the process shown in FIG. 73, the importance of the shell tool focused on here is higher than the importance of the mail (the smaller the value, the more important the tool is). Therefore, the mail system does not notify the user that the mail has been received. On the other hand, when the task importance management table is in the state of FIG. 75 (corresponding screen display example is FIG. 74 (b)), when a mail with importance 3 is received, The system outputs a voice message “New mail received” and notifies the user of the mail reception. At the same time as the notification, the mail system can interrupt the user's work by setting the voice focus for itself, and can cause the user to use the mail system.
[0246]
As described above, by changing the message related to the notification of new reception in accordance with the importance of mail and the importance of work, it is possible to provide the user with a flexible interface that does not exclude the user's work.
[0247]
(Sixteenth embodiment)
In the fifteenth embodiment, the read-out function of the mail document reads out a part or all of the received mail as it is using the synthesized speech without any change to the text. This method is less problematic when the number of mail documents is small and generally small, but as the number of mails becomes larger and larger, its function alone is insufficient.
[0248]
FIG. 77 shows a schematic configuration of a voice mail system. A voice mail system 822 connected to the voice input / output system 821 is composed of an email processing unit 8221, a document summarizing unit 8222, and a message input / output unit 8223. . In this case, a document summarizing unit 8222 may be provided outside the voice mail system 822 as shown in FIG.
[0249]
Here, the mail system 822 is connected to the voice input / output system 821 and uses its voice input / output function. The e-mail processing unit 8221 performs transmission / reception of e-mail documents and processing for received mail via an external public line or LAN. The document summarizing unit 8222 is a system that summarizes documents such as electronic mail. Techniques for summarizing texts include “Ishibashi et al., English Summarization System“ DIET ”, Information Processing Society of Japan 48th National Convention, 6D-9 (1989)”, “Kita, explanation summarization system, information processing The Society for Natural Language Processing, 63-3 (1987) "is known, and a document summarizing section can be constructed by applying this technique.
[0250]
The document summarizing unit 8222 receives the mail document before the summarization from the electronic mail processing unit 8221, summarizes it, and returns it. The e-mail processing unit 8221 determines whether or not to summarize the mail document according to the importance of the received mail, the length of the document, the contents of the document, and the like, and the summarization method. E-mail is sent to the document summary section along with the information. Each time the e-mail processing unit 8221 receives a mail, the electronic mail processing unit 8221 performs a process as shown in FIG. 79, for example, and determines a summarization method for the received mail.
[0251]
In this case, it is determined whether the importance of the mail is “3” or more (step 8401), and if it is “3” or more, no summarization is performed (step 8402). If “3” or more does not appear, check whether “urgent” is included in the mail (step 8403). If “urgent” is included, check whether the document is longer (step 8404). If the document is not long, summarize it. (Step 8402), if the document is long, it is summarized (step 8405). If “urgent” is not included in the main portion, only the first line is summarized (step 8406). Then, summary processing according to the mail is performed (step 8407).
[0252]
In the case of a document such as e-mail, the content may not be complete or too short, so it may not be suitable for summarization. For mail documents that could not be (and need not be) completed and failed to summarize, for example, if you take the first few lines and read them aloud, It can be said that some summarization processing can be performed. Summarization can be done by the user in the form of, for example, a voice “summary” command, or the mail system can automatically summarize all incoming mail (or only long ones) automatically. You can also.
[0253]
In this way, the voice mail tool has a mail document summarization function, which can improve the efficiency of mail document processing. For users who are sometimes busy or need to process a large amount of mail, Convenient.
[0254]
(Seventeenth embodiment)
In the fifteenth and sixteenth embodiments, the use of voice recognition and synthesis functions provided by the voice input / output system has been described using a voice mail tool.
[0255]
Although these provided information to the user using GUI and audio output, the functions described in the fifteenth and sixteenth embodiments are more useful in an environment where the GUI is not available, such as a telephone interface. In this embodiment, a voice input / output interface through a telephone that does not use a GUI will be described using an example of a voice mail system.
[0256]
FIG. 80 shows a schematic configuration of the seventeenth embodiment. In this case, the mail address table 853 is connected to the voice mail system 852 connected to the voice recognition system 851.
[0257]
In this case, the voice input / output system 851 is connected to a telephone line, but the connection with the telephone line is possible using existing technology, and is not described here. It is assumed that input to the voice mail system 852 from the telephone can be performed by voice and a push button.
[0258]
Since e-mail is personal information, an authentication procedure for personal information is required before confirming the contents of the e-mail by telephone. This can be done with a push button on the phone, password voice recognition, or speaker verification technology.
[0259]
After confirming the user in the authentication procedure, voice access is used interactively by using voice recognition. The voice mail system 852 described here can use all the functions of voice recognition and voice synthesis described in the fifteenth and sixteenth embodiments. That is, it is possible to confirm all or part of voice mail or summarized contents by voice input. The operation of the voice mail system 852 is basically performed using voice. Therefore, the mail transmission is also performed by voice. In the telephone interface, it is not realistic to input the content of an email using a push button, so the content of the email itself is also voice. A voice mail document can be created by simultaneously performing voice recognition and voice recording. In the configuration of FIG. 80, simultaneous recognition and recording is not excluded. FIG. 81 shows an example of creating a mail document using voice. The scene setting is a place where a reply is made to the mail after confirming the content of the received mail by voice (voice / voice).
[0260]
First, the user's voice of “recording start” is recognized in (1), and the mail system records the following voice of the user (2), “Please tell me!” As a mail document. The last “stop, stop” in (2) is an instruction to stop recording. The reason why “stop” is repeated is to distinguish “stop” in the mail text from “stop” as a command. The entire “stop, stop” may be the recognition target vocabulary. The mail system cuts the “stop, stop” section from the recorded data. The user confirms the content (4) of the mail document by “content confirmation” in (3), and transmits the mail by “transmission” in (5). Finally, the transmission of mail is recognized by the message (6).
[0261]
Here, when the user records data in (2), if the head of the voice data is detected by the voice detection unit in the voice recognition unit of the voice recognition system, the process from “recording start” to the input of the text. Even if there is an interval, it is not necessary to record the silent section.
[0262]
Also, if you say “send” instead of words like “stop, stop” to specify the end of recording, and if “send” is recognized, the recorded content will be automatically sent as mail data. You can also. This eliminates the need to say “stop” to stop recording, and can easily send an email. At this time, the content of the transmitted mail can be confirmed by automatically reproducing the recorded content without confirming it by uttering “content confirmation” or the like.
[0263]
Further, if one voice section is recorded after “recording start”, a recording stop command such as “stop, stop” becomes unnecessary. The end of the voice section is a restriction that the user has to input a message at a breath if the setting is made with a margin, for example, “if the voice data is silent for 3 seconds, it is regarded as the end of the voice data input”. Is alleviated.
[0264]
As described above, in order to detect a voice section as data, the message shown in FIG. 82 is added as a message between the application program and the voice recognition system. This voice section detection message is a round-trip message, and the voice section data can be cut out from the input voice by the procedure shown in FIG. In the voice section detection message, as parameters, the time for detecting the end of the voice (for example, if a silent section continues for 3 seconds, the section before the silent section is regarded as the voice section), and a timeout specification when there is no input voice (It is considered that no voice section has been detected after 30 seconds from the transmission of the request).
[0265]
Also, as described here, the subject of the mail document is that when responding to a received mail, in terms of UNIX mail, “Subject: hello“ Subject: re: hello “, you can say the answer as a reply, but if you create a new email at the phone, you can not give the subject. Combine speech recognition to make it possible. An example is shown in FIG.
[0266]
In this case, when the mail system recognizes the user's (1) “subject” voice, the mail system enters the subject input mode. In this mode, a predetermined subject word is the recognition target vocabulary. For example, such as "Hello", "News", "Please urgently contact", "cheers for hard work", "meeting notification" it can be considered. In the example of FIG. 84, (2) “Conference notification” is input. When the mail system recognizes the “meeting notice”, the text “Subject: meeting notice” is inserted into the mail document (3), and a confirmation message such as (4) is made by synthetic voice.
[0267]
The recognition result in the subject input mode is not only the insertion of the mail title but also the input of a standard mail document, for example. FIG. 85 is an example of a standard e-mail inserted as the body of the e-mail in response to the input “Gokuro-sama”. {Receiver} and {sender} in the document represent variables assigned to the receiver and the sender. This variable allows anyone to send an email with the same text only by voice. If it is possible to create a database of standard e-mails and call the data by voice, it will be convenient.
[0268]
In the fifteenth embodiment, voice data can be added / inserted at an arbitrary place in a mail document. However, in the subject input mode, voice data can be attached to the subject itself, for example, receiving a mail. At the same time, if the voice subject is output, it is considered that the sender of the mail and the contents of the mail are easily transmitted to the receiver. Of course, subject insertion by voice recognition and voice subject recording may be performed simultaneously.
[0269]
Rather than sending a reply to the received mail, voice recognition is used to specify the destination from the telephone door. For this purpose, word registration is performed in advance by applying a learning function, and the recognized word vocabulary and the mail address are linked. For example, an address book having an appearance as shown in FIG. 86 is provided in the mail system, and the mail address and voice are linked by the mail address registration function shown in FIG.
The registration procedure at this time is as follows:
Open the e-mail address book (Fig. 86)
Open the registration window (FIG. 87) and start new registration of the mail address.
遙 Enter the name and address using the keyboard.
回数 Number of times required for learning (several to tens of times), new words (Suzuki in this example)
Say.
遙 Press the OK button to complete registration.
[0270]
In this way, the recognition word vocabulary (Suzuki) and the e-mail address (Suzuki @ aaa, bbb, ccc, co.jp) are linked and used at the telephone door. For example, the procedure shown in FIG. 88 is performed. First, in (1), when the user utters “destination” and is recognized, the mail system outputs the message (2) by voice and confirms with the user. In (3), the vocabulary registered in FIG. A, B, etc. is to be recognized. In this example, when “Suzuki” is recognized, to: Suzuki @ aaa, bbb, ccc, co . jp is inserted.
[0271]
(4) and (5) show how the mail address is recognized. Like the “Suzuki” voice in (4), for example, one of the voices used for registration in FIG. 87 can be automatically recorded and used for confirmation of recognition.
[0272]
“Suzuki @...” In (4) is an example in which confirmation is performed using alphabet reading by synthesized speech.
[0273]
In this method, the designation of a mail address by voice can be applied only to those registered in advance. However, as described below, a mail address registered in advance can be designated using voice. For this purpose, first, a function for automatically creating a database of e-mail addresses from e-mails received by the user in the past is added. In UNIX mail, the mail address is included as a mail header, and it is not difficult to create a database from the mail address. The configuration of the email address is, for example,
User name @ department name, organization name, organization classification, country classification
It is possible to create a database having a tree-like hierarchical structure in the reverse order of e-mail addresses (country → user name).
[0274]
The mail system traces the mail addresses in order from the country division, using synthetic speech reading as shown in FIG. In the example of FIG. 89, when a wrong node (a section when mail addresses are traced in order) is selected, it returns to the previous (higher) node with a vocabulary such as “cancel”, “cancel”, etc. With this vocabulary, you can cancel address entry. In addition, it is possible to move to a mail address node of a company at once by associating a recognition word vocabulary with an arbitrary node in advance and uttering a company name, for example.
[0275]
If such a method is used, it becomes possible to designate the mail address using voice if it is for a person who has sent mail in the past.
[0276]
In addition, speech recognition systems based on phonological recognition that do not require word-based recognition dictionaries have been widely researched. By using this, even if there is no corresponding address in emails that have arrived in the past, addresses can be addressed by speech. It is possible to enter and forward the mail.
[0277]
(Eighteenth embodiment)
The speech recognition interface described in the first embodiment and the fourteenth embodiment of the present invention provides speech recognition and speech synthesis services for application programs developed exclusively for speech recognition systems or speech input / output systems. there were. In this embodiment, in addition to the voice control for the dedicated program as described above, it is possible to control the voice for any application program that cannot directly exchange messages with the voice recognition system or the voice input / output system. This extension is applied to the voice recognition interface. As a result, the application field of voice recognition and users can be expanded. In the present embodiment, an example in which the above-described extension is applied to the fourteenth embodiment will be described, but it is obvious that the same extension can be applied to the first embodiment.
[0278]
Hereinafter, this embodiment will be described.
FIG. 90 shows the overall configuration of the voice input / output interface of the present embodiment. The same voice input / output system 1 as described in the fourteenth embodiment and its message processing unit 11 (not shown) are used as application programs. It comprises a connected voice interface management system (hereinafter referred to as SIM) 104.
[0279]
A general-purpose application program (hereinafter referred to as GAP) 103 is an application program that is not directly connected to the voice input / output system 1 and is a program that can operate completely independent of the voice input / output system 1. On the other hand, a dedicated application program (hereinafter referred to as SAP) 102 operates by directly connecting to the voice input / output system 1.
[0280]
The SIM 104 is one of SAPs, and is an application program that enables voice operations on the GAP 103 by intermediating the voice input / output system 1 and the GAP 103. The voice focus is also displayed by the SIM 104. The SAP 102 corresponds to the application program 2 in FIG. A plurality of SAPs and GAPs can exist for each voice input / output system.
[0281]
Next, an operation on the GAP 103 by the SIM 104 will be described. Unlike the SAP 102, the GAP 103 is not directly connected to the voice input / output system, and the input accepted by the GAP 103 is from an input device such as a keyboard or mouse other than voice. Therefore, the SIM 104 converts the voice input into an input that can be received by the GAP 103, such as a keyboard input or a mouse input, in order to realize the operation of the GAP 103 by voice.
In this embodiment, the SIM 104 includes a voice interface management unit 141, a program operation registration unit 142, and a message conversion unit 143, as shown in FIG. The voice interface management unit 141 is provided with a correspondence table between voice recognition results and operations for each application program. Information in the correspondence table (hereinafter referred to as a voice interface management table) is stored in the program operation registration unit 142. Registered by The message conversion unit 143 directly connected to the message processing unit 11 includes the function of exchanging messages with the voice input / output system 1, that is, the function of the message input / output unit 21 of FIG. Is received, the voice interface management table is referenced, and the recognition result is converted into an operation command for the GAP 103 and transmitted to the GAP 103.
[0282]
In order to send an operation command from the SIM 104 to the GAP 103, the GAP 103 itself must provide means for operation from another application.
[0283]
If the application uses a window system, the SIM 104 sends the same message to the GAP 103 that is generated when an operation command is input to the GAP 103 by using an input device such as a key or a mouse. Such a message transmission method can be easily implemented by a function provided in a library provided by each window system such as the X window system. In fact, in the window system, the message destination is not the GAP 103 itself but may be an object such as a window generated in the GAP 103. In some cases, it is necessary to use the identifier of the object when sending a message, but it is easy to determine the identifier of the destination object from the contents of the program operation registration, which will be described later, and from the identifier information by querying the window system. .
[0284]
Next, a specific example will be described. As shown in FIG. 91, the voice interface management system 104 and the mail tool 120 are directly connected to one voice input / output system 1 to operate, and the shell tool which is a GAP that cannot be directly connected to the voice input / output system 1 Assume that 130 and the editor 131 are operating in parallel. The screen display at this time can be performed as shown in FIG. 92, for example.
[0285]
An example of the voice interface management table of the SIM 104 in this case is shown in FIG. “Program name” in this table is a vocabulary to be recognized, and the pseudo voice focus for the application program can be switched by the user uttering the program name. The “application program” is an identifier of the application program itself and represents a command transmission target.
[0286]
The above pseudo audio focus is an audio focus provided in a pseudo manner for the application program. The GAP is not directly connected to the voice input / output system 1, and therefore the voice input / output system 1 is not aware of the existence of the GAP, so that no real voice focus is set for the GAP. When the SIM 104 receives the name of the GAP such as “shell tool” or “editor” as a recognition result, the SIM 104 makes a setting request to the speech input / output system with the command name defined for the program as a recognition target vocabulary. (For example, in the case of “shell tool”, “LS” or “process”). Then, the voice focus display as shown in FIG. 12, FIG. 19, etc. is performed on the program.
[0287]
As shown in FIG. 94, the true audio focus related to the GAP 103 is set in the SIM 104, and what is actually displayed on the screen is the pseudo audio focus. The SIM 104 uses the recognition of the program name as a trigger to switch the recognition context. As seen in the mail tool, the SAP pseudo audio focus and the true audio focus match.
[0288]
The SIM and GAP command name attributes are local to the SIM. That is, it becomes a recognition target when the voice focus is set in the SIM. Since the voice focus is not set in the SIM 104 when the command is transmitted to the SAP, the command name related to the SAP 102 has a global attribute. For example, the attribute of the command name “end” of the mail tool in FIG. 93 is global. In FIG. 93, the attributes of recognition target vocabulary such as local and global are shown in parentheses in the column of program name and recognition target vocabulary. As attribute values, “0” is local and “1” is global.
[0289]
An example of the processing procedure of such a message conversion unit 143 is shown in FIG. That is, when the recognition result received from the message processing unit 11 of the voice input / output system 1 is a program name, the command name related to the immediately preceding pseudo focus is removed from the recognition target (step 9003), and the application program having the recognized program name is obtained. A pseudo focus is set (step 9004), and the command name of the application program is set (added) as a recognition target (step 9005).
[0290]
On the other hand, if the received recognition result is not the program name (step 9002), the command corresponding to the command name is transmitted to the application program in which the pseudo focus is set (step 9006).
[0291]
As described above, by adopting the configuration as in the present embodiment, it is possible to use speech recognition even for an existing application program (GAP) that does not use speech input (recognition), thereby expanding the number of users. And improved usability.
[0292]
(Nineteenth embodiment)
Under a system having a window-based GUI (graphical user interface), one program can be configured using a plurality of windows. In the present embodiment, an example will be described in which the system is extended to enable voice input to individual windows of an application program having a plurality of windows based on the eighteenth embodiment. This makes it possible to use finer speech recognition and improve operability.
[0293]
In the embodiments described so far, the unit in which the voice focus can be set by the voice input / output system 1 is “application program”, but in this embodiment, the unit is “voice window”. A plurality of voice windows can be created in an application program, and each voice window has a voice window name, an input mask, and a recognition target vocabulary set.
[0294]
FIG. 96 is an extension of the voice input / output system 1 described in the fourteenth embodiment (see FIG. 50) so that a voice window can be handled. Here, the application program management table 13 of FIG. 96 is expanded as will be described later. Further, an audio window 23 is added to the application program 2, and the entity of the audio window 23 exists in the application program management table 13 of the audio input / output system 1.
[0295]
Hereinafter, a specific example will be described. As in the eighteenth embodiment, it is assumed that four application programs are operating: SIM (104), shell tool, editor, and mail tool. Of these, the SIM and mail tool are SAP, and the shell tool and editor are GAP. As shown in FIG. 97, assume that the shell tool and the editor are each composed of two windows, and the others are composed of one window. FIG. 98 shows the configuration of the entire voice input / output interface in this case. The mail tool 120, which is a dedicated program (SAP), has its own voice window 223, and the SIM 104 has its own voice window 0 (144).₀) And audio windows 1 to 4 (144 for general-purpose programs)₁~ 144_Four)have. Unlike the windows in a so-called window system (not shown) and OS (not shown) as shown in FIG. 97, this audio window does not have visual attributes. A window in a window system usually has a tree structure, and changes in the structure and the internal state of the window system can be known from within the application program. The SIM 104 accesses the information of the window system and the information of the voice input / output system 1, links the window and the voice window, operates in a coordinated manner, and provides a unified user interface. The window and the voice window can be linked to each other by assigning a unique and identical attribute such as a window name or interactively using the program operation registration unit 142.
[0296]
The voice window has a window name, a recognition target vocabulary, an input mask, and the like as its attributes. In addition to local and global attributes, windows are provided as attributes of recognition target vocabulary such as window names and command names. A vocabulary having a local attribute becomes a recognition target when the voice focus is set to the voice window to which the vocabulary belongs. Vocabulary with global attributes will always be recognized no matter where the voice focus is set. A vocabulary having a window attribute becomes a recognition target when the voice focus is set to a voice window belonging to the same application program as the voice window even if the voice window to which the vocabulary belongs is not set.
[0297]
It is also possible to group a plurality of voice windows, mix recognition vocabulary, and automatically send the result to the voice window to which the recognition vocabulary belongs according to the recognition result. For example, when the application program management table is in the state shown in FIG. 102, the shell tool and the editor are grouped to recognize LS, process, cut, copy, and paste at a time. The recognition result is sent to the editor, and when cut, copy, or paste is recognized, the recognition result is sent to the editor.
[0298]
Thereby, the movement of the voice focus between the shell tool and the editor can be omitted, and both operations can be performed efficiently. If there is the same vocabulary in multiple audio windows, the recognition result may be sent to multiple audio windows that have the same as the vocabulary, or the audio window with the audio focus is given priority. Anyway. Whether or not grouping is performed can be determined by the attribute of the grouping ID in the application program management table of FIG.
[0299]
As a method of grouping voice windows, a parent-child relationship can be introduced into the voice window, and the parent window and the child window can be grouped to recognize both vocabularies at the same time. For example, when the application program management table is in the state shown in FIG. 102, the parent shell tool window and the setting window are grouped with respect to the setting window of the shell tool. Then, when the voice focus is applied to the setting window, recognition is performed using a vocabulary in which both are mixed.
[0300]
As a result, when the voice focus is applied to the child voice window, the voice focus can be omitted and the voice input to the parent window can be performed, and the work efficiency can be improved. If the parent window and the child window have the same vocabulary, the recognition result can be sent with priority over the child window with the voice focus.
[0301]
In the state of FIG. 98, the voice interface management table in the voice interface management unit 141 of the SIM 104 is as shown in FIG. A window ID is added to the table of FIG. 93, and a window name is added instead of a program name. The window ID is a window identifier in the window system (see FIG. 97). As shown in FIG. 99, there is a one-to-one correspondence between window IDs and voice window IDs, and the SIM 104 links the windows and voice windows using this table. For example, in this example, when the “shell tool” is recognized, the SIM 104 sets the voice focus to the voice window with ID = 1, and the display of the window with ID = 101 is set to the voice focus as shown in FIG. Set to the set state.
[0302]
Depending on the window system and OS, the display of another application program window may not be changed. In this case, another independent window is pasted to the window of another application program as shown by the hatched portion w1 in FIG. And indicate the location of the audio focus. An example of the display of this external window is shown in FIG. As shown in the figure, a display (window) showing the voice focus is shown at the top of the application program. The position of this window may be anywhere as long as the voice focus can be clearly indicated, and any number may be used. In addition to using still images, using moving images makes it easier to understand the position of audio focus.
[0303]
Here, the application program management table 13 of the voice input / output system 1 shown in FIG. 18 is expanded as shown in FIG. A voice window ID and a window name are added as new fields. The audio window ID is an identifier of an audio window for which audio focus is set, and the window name is the name. The attributes of the recognition target vocabulary such as local and global are shown in parentheses in the columns of the window name and the recognition target vocabulary. As attribute values, “0” is local, “2” is global, and “1” is window. When the configuration of the voice input / output interface 1 is FIG. 98, the application program management table 13 of the voice input / output system 1 is in the state shown in FIG. 102, and the voice interface management table of the voice interface management system 104 is in the state shown in FIG. It is in. At this time, due to the pseudo audio focus, it seems to the user that the audio focus is set to “shell tool” (window ID = 101). On the other hand, the true audio focus is set to the audio window (ID = 1) associated with the window (ID = 101), and the audio window belongs to the SIM 104. For example, the vocabulary that can be recognized in this state is “LS”, “process”, “shell tool”, “editor”, “mail tool”, “system”, and “setting”.
[0304]
In the above configuration, the speech input / output system 1 performs recognition processing, and the recognition result is sent to the speech window in which each vocabulary is set. FIG. 103 shows an example of the procedure of this recognition process.
[0305]
First, when the audio focus is set for the window (0), the vocabulary set for the window (0) is added to the recognized vocabulary list (step 9103). On the other hand, when the voice focus is not set and the window (0) belongs to the same application program as the voice window for which the voice focus is set, the attribute value of the vocabulary of the window (0) is “1”. Are added to the recognized vocabulary list (step 9105), and if they do not belong, the vocabulary with the attribute value “2” is added to the recognized vocabulary list (step 9106).
[0306]
The above processing is performed for all other windows including the window (1).
[0307]
Then, recognition processing is performed (step 9108). If the first recognition result is a window name, the voice focus is set to the window where the first vocabulary has been set (step 9110). The recognition result is transmitted to the window in which the first vocabulary has been set (step 9111).
[0308]
For example, in FIG. 102, there are two voice windows (ID = 2 and ID = 4) in which “setting”, which is one of the recognizable vocabularies, is set, but the attribute of each vocabulary is “1” ( = Window), the recognized “setting” is sent to the voice window ID = 2. On the other hand, “setting” recognized when the voice focus is set to the voice window ID = 3 is sent to the voice window ID = 4. As an operation of the voice input / output system 1 when a window name is recognized, the recognition result can be simply sent to the voice window to which the window name belongs, or the voice focus can be set to the voice window without sending it.
[0309]
As described above, by giving the recognition target vocabulary a window attribute, it is possible to give the same name to the windows of a plurality of application programs and operate them. This embodiment greatly improves the usability as a voice recognition interface.
[0310]
(20th embodiment)
As described in the eighteenth and nineteenth embodiments, means for directly communicating with the voice input / output interface by converting the voice message from the voice recognition system by the voice interface management system 104 and transmitting the voice message. It is now possible to perform voice input even for existing application programs that you do not have.
[0311]
When the voice input / output interface of the present invention is applied to an existing application program, the correspondence between the operation of the existing program and the vocabulary for doing so is taken separately from the application program dedicated to the voice input / output interface. There is a need. In this embodiment, registration of program operation for taking correspondence between “vocabulary” and “program operation” will be described.
[0312]
In the program operation registration, the program name or window name used to move the voice focus to the target application program, and the key input or mouse input event sequence and vocabulary used to operate the existing application program are registered. Perform matching. For example, when two shell tool windows are used, “shell 1” and “shell 2” are added as window names, and operations performed in the shell tool, for example, clearing all characters on the screen (clear) The word “clear” is assigned and registered to the key input sequence for performing the command.
[0313]
Normally, a general application program does not have the window name of the window displayed by the program. Therefore, to specify a window by name, name the window and select the window name from the voice interface management table. It is necessary to be able to identify the window. For this reason, as shown in FIG. 99 of the nineteenth embodiment, the audio interface management table has fields for storing a window ID and a window name which are window identifiers in the window system. With this table, for example, when “editor” is sent as a recognition result, the audio interface management unit 141 sets the pseudo audio focus for the window having the window ID 103. The window ID described above is obtained by accessing information held by a window system (not shown). For example, a window system server (not shown) can be obtained by inquiring information about the window structure, but the window name is not always obtained at the same time. To obtain the window ID and window name at the same time, there is a method to start the program by specifying the window name. However, if the already running program is a pop-up window that newly generates another window, before starting It is difficult to name from. In such a case, the window name can be assigned by clicking the window with the mouse to obtain the window ID of the window and associating the window name with the window ID. The ID of the window in which the mouse is clicked can be easily obtained by inquiring the window system server.
[0314]
Next, a method for registering names and program operations for windows will be described below.
FIG. 104 shows the configuration of the program operation registration unit 142. The program operation registration unit 142 includes a program operation display editing unit 151 that displays registration contents on the screen and inputs from the user, a registration contents storage unit 152 that stores the registration contents in the file 200, and a window ID from the window system. The window ID acquisition unit 153 for acquiring.
[0315]
For example, the program operation display editing unit 151 displays a registration screen as shown in FIG. 105 and inputs a window name, a program operation, a word name, and the like, and registers the registered contents in the voice interface management table in the voice interface management unit 141. Write. The registered content storage unit 152 stores the registered content of the program operation in the file 200. The window ID can be easily obtained by making an inquiry to the window system server.
[0316]
The registration screen of FIG. 105 has a “Register” button for writing program operation registration contents into the voice interface management table, a “Cancel” button for canceling the input contents and returning to the state before the input, and “End” for ending registration. Button, “window ID acquisition” button for acquiring the window ID of the target general application program, “application program class” (AP class) window for inputting the type of application program, “window name” for inputting the window name A window and a program operation input window for inputting a key input sequence or a mouse input sequence representing a vocabulary and a corresponding program operation.
[0317]
In FIG. 105, “Shell” is selected as the application program class, “Shell 1” is selected as the shell window name, the background color is reversed, and the keys corresponding to the words “LS” and “Clear” are operated as operations for the shell 1. The input operation and local (0) as the scope of those vocabularies are shown in a state where they are input to the editing window.
[0318]
Next, a program operation registration procedure will be described with reference to FIG. The program operation registration unit 142 is activated by the message conversion unit 143. First, the program operation registration unit 142 reads the registration content from the registration content file 200 storing the program operation registration content (step 9201), displays the screen, and waits for user input (step 9202).
[0319]
Here, the user inputs an AP class, window name, vocabulary, program operation, etc., or inputs a registration button, a cancel button, an end button, a window ID acquisition button, and the like.
[0320]
If the input is a registration button (step 9203), the editing result displayed on the screen is saved in the save file 200, and further written into the voice interface management table 141, so that the registered contents can be used as an operation of the voice input / output interface. Reflect (step 9204).
[0321]
If the input is a cancel button (step 9205), the registered contents are again read from the saved file 200 and displayed, and the process waits for input (step 9202).
[0322]
If the input is an already registered application program class (AP class) (step 9206), a list of window names of the selected AP class, vocabulary, and program operation are displayed on the screen (step 9207), and input waiting is entered. Return (step 9202).
[0323]
If the input is a window ID acquisition button (step 9208), it is first determined whether a window name is selected (step 9209). If not selected, the process waits for input (step 9202) and is selected. When the window is clicked with the mouse, the ID of the clicked window is obtained, and the selected window name and window ID are written in the voice interface management table as shown in FIG. 9210).
[0324]
If the input is an end button (step 9211), the contents of the screen display contents are written into the voice interface management table and stored in the file 200 (step 9212), and the registration is completed.
[0325]
As described above, by specifying the type of application program at the time of program operation registration, it is possible to specify automatically without inputting the same program operation, and registration can be performed efficiently. It becomes like this.
[0326]
Also, even for application program windows that are difficult to start by specifying a name, it is easy to assign a window name by acquiring the ID of the window clicked by the mouse and connecting it to the window name. Voice input.
[0327]
In the registration example described above, the ID of the window that has already been generated is used to associate the operation command with the recognition result. In general, the object ID of a window or the like is determined at the time of generation, and the same type of application is used. Different IDs are assigned even if they exist. Therefore, if a window attribute value common to applications of the same type, such as the window hierarchy and window name, is inquired to the window system at the time of registration and added to the registered content, it is common to the same type of application by comparing these attribute values. The registered contents can be reflected in.
[0328]
In addition, by registering multiple window names in the application to be registered at the time of registration, it is used when the same type of application is started (inquires the voice recognition system for the name of an already used voice window). If an unoccupied window name is used as the voice window name of the activated application, the collision of the voice window names can be avoided.
[0329]
(21st Example)
Next, an embodiment relating to a recognition dictionary editing function for recognizing speech in a speech input / output interface will be described.
[0330]
FIG. 107 shows the configuration of the voice interface management system 104 having the dictionary editing unit 144. The dictionary editing unit 144 is activated by the message conversion unit 143 and returns an end message to the message conversion unit 143 when editing is completed. Upon receiving this end message, the voice interface management unit 141 can issue a new dictionary load command after editing to the voice input / output system 1.
[0331]
Here, FIG. 108 is an example of the configuration of a recognition dictionary. In the recognition dictionary, for each word, in addition to a pattern matching template, data such as a word name, a word ID, or a recognition parameter is stored in the header. Equipped with a function to display and edit the contents of these data, it is easy to delete unused word dictionaries to reduce the amount of memory required for dictionaries and to change word names and IDs. You can do it.
[0332]
Next, the configuration of the dictionary editing unit 144 will be described. As shown in FIG. 109, the dictionary editing unit 144 includes a dictionary content display editing unit 441 that displays dictionary contents and allows the user to edit, and a dictionary content search unit 442 that checks and searches dictionary contents. Become.
[0333]
The dictionary contents are displayed on a screen as shown in FIG. 110, for example. In the screen, a dictionary name window for displaying a dictionary name, a vocabulary number, a word ID, a word, a parameter, a dictionary content window for displaying a dictionary number, a “delete” button for deleting a dictionary, and a “search for searching for a parameter” ”Button, an“ all display ”button for displaying all contents, an“ end ”button for ending dictionary editing, a status window for displaying a dictionary content check result, a search value window for inputting a search value, and the like. The parameter items in the dictionary content window are menus, and when the mouse is clicked, the parameter content as shown in the figure is displayed and the content to be displayed can be selected.
[0334]
The dictionary contents can be automatically checked when a dictionary name is selected. For example, it is possible to check whether there is a word with the same ID or a dictionary with the same word name. Checks are made for differences and the results are displayed in the status window.
[0335]
In the item of FIG. 110, dictionaries having file names “common” and “usr.1” are selected as dictionaries, and the two contents are merged and displayed as the dictionary contents. For example, the vocabulary No. “1” indicates that the number of data used for creating the dictionary with ID = 1 is 100. The vocabulary No. “2” indicates that ID = 2 is clear and this word is selected and the background color is changed to dark.
[0336]
Next, the procedure of dictionary editing processing will be described with reference to FIG. When the dictionary editing unit is activated, first, the contents of the dictionary are read from the dictionary file (step 9301), the contents are displayed on the screen, and input is waited for (step 9302).
[0337]
If the input is a delete button (step 9303), the dictionary No. specified by the user is deleted from the file (step 9304), and the process waits for input (step 9302).
[0338]
If the input is a full display button (step 9305), the contents of the dictionary are read again (step 9301) and the process waits for input (step 9302).
[0339]
If the input is a search button, the system waits for parameter specification from the parameter menu (step 9307), and displays only the dictionary that matches the specified parameter and the value input in the search value window as the dictionary contents ( Step 9308), the process returns to waiting for input (Step 9302).
[0340]
If the input is an end button, the dictionary file is updated from the contents input on the screen (step 9310), and the message conversion unit is notified of the end (step 9311), and the process ends.
[0341]
The dictionary editing section described above makes it easy to delete unnecessary word dictionaries, check their contents, change word names, etc., check for duplicate use of the same ID and word, and inconsistent recognition parameters, etc. Can be done easily.
[0342]
(Twenty-second embodiment)
In the voice input / output interfaces described in the eighteenth and nineteenth embodiments of the present invention, confirmation of the recognition result of the user's utterance and confirmation of the operation of the application program caused by the recognition result are performed through screen information presented by the application program. ing. For example, the recognition result (and recognition failure) is presented to the user as character information. When a program name such as “shell tool” is called, the display of the shell tool is changed as shown in FIGS. 100 and 101 of the nineteenth embodiment. In response to the utterance of “iconization”, an action to the application program by voice, such as making a window with a voice focus into an icon, is fed back to the user as a change in screen display performed by the application program. However, depending on the application program, the display may be changed little or not by the operation. Further, taking advantage of the feature of the present invention that the keyboard focus and the voice focus can be separated, it is also conceivable to use the application program with the voice focus not displayed. In such a case, it is convenient for the user to operate the application program by performing the recognition result and the confirmation of the operation by the voice output using the voice synthesis function described in the fourteenth embodiment instead of the screen output. Improves.
[0343]
The voice interface manager (FIG. 98) of the nineteenth embodiment is expanded as shown in FIG. 112 in order to check the operation by voice output. That is, a response voice management unit 401 and a response voice registration unit 403 are added to the voice interface management system (SIM).
[0344]
The response voice management unit 401 defines what type of response voice is returned in response to the utterance made by the user, and the response voice registration unit 403 performs registration thereof. The message conversion unit 143 outputs a voice response with reference to the response voice management unit 401 when an operation (that is, a message) occurs.
[0345]
An example of the response voice management unit 401 is shown in FIG. The response voice management unit 401 includes an operation that triggers the output of a voice response, a response command that is performed during the operation, and a flag that determines whether or not the setting is actually applied. The operation may not be based on voice. A command is described in the response. synth () is a command for outputting synthesized speech using the argument as text, and play () is a command for outputting the argument considering the argument as waveform data.
[0346]
The message conversion unit 143 refers to the data of the response voice management unit 401 and performs processing according to the flow shown in FIG. First, it is determined whether or not the message received from the voice input / output system is a recognition result (step 10001), and it is determined whether or not the recognition process is successful (step 10002). Next, a voice response command is executed according to the success / failure (step 10003, step 10004). Step 10005 is a stage for outputting a response voice other than the success / failure of the recognition process, and corresponds to the setting of the third and lower lines in FIG. According to this flow, when the recognition fails due to reasons such as being able to be recognized but having a low similarity or the voice input level being too large (small), voice data such as “Huh?” Is output. When an application program name, for example, “mail” is recognized, “Yes, it is mail” is output by synthesized speech. Here, $ <cat> in FIG. 113 is replaced with the vocabulary name of the recognition result).
[0347]
The response voice registration unit 403 shown in FIG. 115 registers the command of the response voice management unit 401. Describe the command for each operation, check the check box for whether or not to apply, and confirm the registration by pressing the OK button.
[0348]
The response command of the response voice management unit 403 is processed by the message conversion unit 143 and can be described as a command in the voice interface management table shown in FIG. 99 of the nineteenth embodiment. By describing the play () and synth () commands here, it is possible to define a response voice output corresponding to the application program for the operation of the GAP that cannot exchange information directly with the voice input / output system 1.
[0349]
As described above, a SIM is provided with a mechanism for returning a meaningful voice response for each operation performed (or not performed) by voice input, and a natural method of responding to voice input by voice. Therefore, since the user can confirm the operation executed by the application program without paying attention to the change in the display on the screen (or not at all), the operability of the voice input / output interface is improved.
[0350]
(23rd embodiment)
In the ninth embodiment of the present invention, data collection for creating a recognition dictionary has been described. However, the collected data may contain error data due to wrong vocabulary utterances or detection errors in speech sections. is there. For example, the word “open” may be uttered with a small “ku” sound, and “ku” may be missed and only “open” may be detected as a speech segment. Since learning of the recognition dictionary using such erroneous data greatly reduces the recognition accuracy, it is necessary to confirm the data and remove the erroneous data. Therefore, in this embodiment, the data is confirmed by reproducing and listening to the sound so that the data can be confirmed easily and reliably.
[0351]
Conventionally, in the method of playing back and confirming the collected voice data, only the detected voice section is often played back, but depending on the vocabulary, even if the start / end of the voice is detected incorrectly, the user can There was a problem of being heard. For example, even when “ku” at the end of “open” described above is missing and only “open” is displayed, the playback sound of “open” may be heard as “open”. In this embodiment, in order to reduce such mistakes in the confirmation of the start / end, the start / end positions of the voice are presented in an easy-to-understand manner by sound. As a result, since the voice data can be easily and reliably confirmed by sound, the learning data can be collected easily and without mistakes, and the usability of the voice input / output interface and the recognition accuracy can be improved.
[0352]
To make it easier to understand the start and end positions,
(Method 1) A method in which a known sound such as white noise or a sine wave is added before and after the detected speech section and reproduced.
(Method 2) A method of playing with a click sound at the start and end positions,
(Method 3) A method of playing back only the voice section after playing back the entire utterance from a certain time before the start to a time after the end,
Etc. are considered.
[0353]
According to the above method 1, in the example of “open” described above, since another sound immediately follows “open”, it can be easily heard that the “open” is missing. According to the above method 2, it is understood that “ku” is missing because a click sound comes after “hira”. Further, according to the above method 3, since the entire utterance and the voice section can be compared and heard, the presence or absence of “ku” can be easily identified.
[0354]
FIG. 116 shows the configuration of the expanded data collection unit 8 according to this embodiment.
[0355]
As shown in FIG. 116, the data collection unit 8 adds a voice data confirmation unit 411 and a data availability input unit 413 to the data collection unit 8 of FIG. 29 of the ninth embodiment, and via the learning data collection control unit 83. The voice feature data is sent to the voice feature data storage unit. That is, it is configured such that the user can specify from the data availability input unit 413 whether or not to use the audio data for dictionary creation by listening to the reproduced sound presented by the audio data confirmation unit 411.
[0356]
The processing flow of the data collection unit 8 will be described with reference to FIG.
[0357]
First, in the initial setting, the data collection unit 8 issues a learning mode setting request to the speech recognition system 1 in response to a data collection instruction from the user (step 11001). Is sent to the data collection unit 8. The data collection unit 8 displays the recognition target vocabulary to the user (step 11002).
[0358]
When the learning vocabulary is selected by the user (step 11003), the data collection unit 8 requests the speech recognition system 1 to transmit the word speech feature data and the word speech waveform data (step 11004), and utters the selected vocabulary. A guide is displayed on the utterance guide display unit 415 (step 11005), and the user is prompted to utter. The voice recognition system 1 processes the spoken user's voice, and then transmits word feature data and waveform data to the data collection unit 8. Then, the data collection unit 8 receives the data and temporarily stores it in the internal memory (step 11006).
[0359]
The speech waveform data is sent to the speech data confirmation unit 411, and the user confirms the data, and inputs whether or not to use it for dictionary creation by the data use availability input unit 413 (step 11007). If the data is used, the word voice feature data is output to a file on the magnetic disk or the like (YES in step 11008 and step 11009), and if not used, the file is not output (NO in step 11008). ).
[0360]
At the end of learning, when the user inputs an instruction to end data collection and the data collection instruction flag is OFF (Yes in step 11010), the data collection unit 8 requests the speech recognition system 1 to cancel the learning mode (step 11012). In response to this, the speech recognition system 1 cancels the learning mode. On the other hand, when the learning is not finished, the data collection instruction flag is inspected (step 11011), and the processing after step 11004 is repeated. The data collection instruction flag is set in the learning data collection control unit, and can be input by the user with a data collection button as shown in the figure.
[0361]
Next, FIG. 118 shows the configuration of the audio data confirmation unit 411 of this embodiment.
[0362]
The sound data confirmation unit 411 reproduces the sound data memory 421 that stores sound data, the sound data processing unit 422 that processes sound data, the additional sound generation unit 424 that generates additional sound used for processing, and the processed sound data A sound reproduction unit 423 that receives audio data and information on the start / end positions from the learning data collection unit control 83, processes the information, and outputs it as sound. If the processed sound is sent to the voice input / output system to reproduce the sound data, the re-west portion 423 is not necessary.
[0363]
Next, the flow of processing will be described with reference to FIG.
[0364]
First, voice data and start / end information are received from the learning data collection controller 83 and stored in the voice data memory 421 (steps 12001, 12101, 12201). This voice data is waveform data with a certain time, for example, 240 msec before and after the voice section, and is as shown in FIG. 120, for example. In the data shown in the figure, since “open” of “open” is detected as a voice section, the “open” sound is within the margin of the end.
[0365]
Next, in the case of the above method 1 in which an additional sound is added before and after the voice section, the additional sound is generated by the additional sound generation unit 424 (step 12002), and this is performed by the voice data processing unit 422 before the start / end position and after the end position. An additional sound is added (step 12003, step 12004). As a result, the audio data becomes as shown in FIG.
[0366]
The additional sound data may be white noise or a sine wave, and these can be easily created using a random number generation routine or a trigonometric function routine. Alternatively, the recorded data may be simply read out.
[0367]
In the case of the above method 2 in which the click sound is added to the start / end position, the click sound is generated by the additional sound generation unit 424 (step 12102) and added to the start / end position (steps 12103, 12104). As a result, the audio data is as shown in FIG. Here, the click sound may be a short time, for example, a pulse having a width of several tens of msec or a triangular wave.
[0368]
In the case of the above method 3 for reproducing both the entire utterance and the speech section, first, the average power outside the speech section is calculated (step 12202), and if this value is larger than a threshold, for example, noise level +2 dB. (In the case of YES at step 12203), the entire voice including the margin before and after the voice section and the voice section is reproduced (step 12204). On the other hand, if the calculated average power is smaller than the threshold value (NO in step 12203), only the voice section is reproduced (step 12205). Since the noise level is constantly measured for voice detection by the voice recognition system 1 (Nagata, et al. “Development of voice recognition function at a workstation”, IEICE Technical Report, HC9119, pp. 63-70, ( (1991), and it may be used. Since it is cumbersome to perform the playback of the entire utterance and the playback of the voice section for each utterance, the start / end position may be wrong when the voice power outside the voice section is large as described above. If it is assumed that the reproduction is large and only the reproduction is performed twice at that time, the troublesomeness can be reduced.
[0369]
In this case, as shown in (c) of FIG. 121, the entire utterance is reproduced as “Hiraku”, but the re-speech of only the voice section is reproduced only as “Hira”. By hearing and comparing these two reproduced sounds, it can be easily identified that “ku” is missing.
[0370]
As described above, the user can easily determine whether or not the sound data is correct based on the reproduced sound, and whether or not the data is used for creating a dictionary can be immediately input by the data collection unit. The voice data can be collected easily and reliably.
[0371]
Thereby, a recognition dictionary can be created excluding erroneous data.
[0372]
【The invention's effect】
According to the present invention, each application program can determine whether or not a speech recognition result can be received by the speech recognition system, so that the application program can freely control the voice input of itself and other application programs, and is flexible. An easy-to-use speech recognition interface can be constructed. In addition, since the voice recognition system can send the voice recognition results to multiple application programs at the same time, one voice input operation can be performed on multiple application programs at the same time. To do. In addition, since the voice recognition system can perform voice recognition for multiple application programs, it is possible to distribute voice input to each application program based on the voice recognition results without explicitly specifying the voice input target, which burdens the user. Can be reduced.
[Brief description of the drawings]
FIG. 1 is a diagram showing a schematic configuration of an embodiment of the present invention.
FIG. 2 is a diagram showing a schematic configuration of a voice recognition unit.
FIG. 3 is a diagram showing a schematic configuration of another example of a speech recognition unit.
FIG. 4 is a diagram showing a schematic configuration of another example of a voice recognition unit.
FIG. 5 is a diagram showing a schematic configuration of another example of a voice recognition unit.
FIG. 6 is a diagram showing a schematic configuration of an application program.
FIG. 7 is a diagram illustrating a message transmitted between components.
FIG. 8 is a diagram showing types of input masks.
FIG. 9 is a diagram showing a time chart of processing of each part of the voice recognition interface.
FIG. 10 is a diagram illustrating an application program management table.
FIG. 11 is a diagram showing a schematic configuration of a second embodiment of the present invention.
FIG. 12 is a diagram showing a screen display example of a general window system.
FIG. 13 is a diagram for explaining a recognition vocabulary of an application program.
FIG. 14 is a diagram for explaining a change in a speech recognition vocabulary associated with movement of an input focus.
FIG. 15 is a diagram for explaining a display example of a recognized vocabulary.
FIG. 16 is a diagram for explaining a state in which the recognized vocabulary is changed depending on the position of the mouse.
FIG. 17 is a diagram for explaining a recognition vocabulary of an application program in the third embodiment of the present invention.
FIG. 18 is a diagram illustrating an application program management table.
FIG. 19 is a diagram for explaining a fourth embodiment of the present invention.
FIG. 20 is a diagram showing a schematic configuration of a fifth embodiment of the present invention.
FIG. 21 is a diagram showing a message display example.
FIG. 22 is a diagram showing a multi-window environment such as a workstation.
FIG. 23 is a diagram showing an application program management table in the sixth embodiment of the present invention.
24 is a diagram for explaining an expression based on the application program management table of FIG.
FIG. 25 is a diagram showing an example of expansion of the task management program function.
FIG. 26 is a view for explaining a display example in the seventh embodiment of the present invention.
FIG. 27 is a view for explaining a display example in the seventh embodiment;
FIG. 28 is a diagram showing a schematic configuration of a ninth embodiment of the present invention.
FIG. 29 is a diagram showing a schematic configuration of a learning data collection unit.
FIG. 30 is a diagram for explaining message exchange with the voice recognition system;
FIG. 31 is a diagram showing a flowchart at the time of data collection in the voice recognition system.
FIG. 32 is a diagram showing a flowchart of a learning data collection unit.
FIG. 33 is a diagram showing a display example on a learning vocabulary guide display unit.
FIG. 34 is a diagram showing a display example on the learning vocabulary guide display unit.
FIG. 35 is a diagram showing a flow of processing of a voice recognition interface at the time of data collection.
FIG. 36 is a diagram showing a schematic configuration of a tenth embodiment of the present invention.
FIG. 37 is a diagram showing a dictionary creation management table.
FIG. 38 shows a dictionary creation management table.
FIG. 39 is a diagram showing a dictionary creation management table;
FIG. 40 is a diagram for explaining a registration procedure in a dictionary creation management table.
FIG. 41 is a diagram for explaining a procedure for creating a dictionary;
FIG. 42 is a diagram showing a display example of the progress of dictionary creation.
FIG. 43 is a diagram showing an example of speed display for dictionary creation processing;
FIG. 44 is a diagram showing an example of speed display for dictionary creation processing;
FIG. 45 is a diagram showing a schematic configuration of an eleventh embodiment of the present invention.
FIG. 46 is a diagram for explaining voice recognition automatic stop processing;
FIG. 47 is a diagram for explaining a twelfth embodiment of the present invention.
FIG. 48 is a view for explaining the twelfth embodiment;
FIG. 49 is a diagram for explaining a thirteenth embodiment of the present invention.
FIG. 50 is a diagram showing a schematic configuration of a fourteenth embodiment of the present invention.
FIG. 51 is a diagram showing a schematic configuration of a speech synthesis unit.
FIG. 52 is a diagram illustrating an audio output management table.
FIG. 53 is a diagram illustrating a message for voice input.
FIG. 54 is a diagram illustrating an input mask for audio output.
FIG. 55 is a diagram illustrating an application program management table.
FIG. 56 is a view showing a flowchart of an audio output process.
FIG. 57 shows a time chart of audio output processing.
FIG. 58 is a flowchart showing audio output request processing.
FIG. 59 is a diagram for explaining an example when superimposing audio data with interruption processing;
FIG. 60 is a diagram showing a schematic configuration of a fifteenth embodiment of the present invention.
FIG. 61 is a diagram for explaining messages exchanged between the application program and the voice input / output system.
FIG. 62 is a diagram showing a time chart of a process in which the voice mail tool records voice data.
FIG. 63 is a diagram showing a screen display example of the voice mail tool.
FIG. 64 is a diagram showing a sub-window for editing audio data.
FIG. 65 is a diagram showing a text example of a reply by mail transmission.
FIG. 66 is a diagram showing a sub-window for editing audio data.
FIG. 67 is a diagram showing an example of a synthetic speech attribute database;
FIG. 68 is a diagram showing an example of a voice command used when reading a mail.
FIG. 69 is a diagram showing a schematic configuration of a voice mail system.
FIG. 70 is a diagram illustrating an application program management table.
FIG. 71 is a diagram for explaining a message between the mail system and the voice input / output system.
FIG. 72 is a diagram for explaining a task importance level management table;
FIG. 73 is a view showing a flowchart of electronic mail processing in the voice mail system.
FIG. 74 is a diagram showing a notification example of received mail.
FIG. 75 is a diagram for explaining a task importance level management table;
FIG. 76 is a view showing an example of a control code mixed mail.
FIG. 77 is a diagram showing a schematic configuration of a sixteenth embodiment of the present invention.
FIG. 78 is a diagram showing a schematic configuration of a sixteenth embodiment of the present invention.
FIG. 79 is a flowchart showing summary setting processing;
FIG. 80 is a diagram showing a schematic configuration of a seventeenth embodiment of the present invention.
FIG. 81 is a diagram showing an example of creating a mail document using voice.
FIG. 82 is a diagram showing an example of a message between the application program and the voice recognition system.
FIG. 83 is a diagram showing a time chart of a process of cutting out voice segment data from input voice.
FIG. 84 is a diagram for explaining input of a mail title by voice.
FIG. 85 is a view for explaining typical mail document input.
FIG. 86 is a diagram showing a screen display example of a mail address book.
87 is a diagram showing an example of registration of a mail address that can be input by voice;
FIG. 88 is a diagram for explaining a procedure for designating a mail destination by voice.
FIG. 89 is a diagram for explaining mail destination specification using a mail address database;
FIG. 90 is a diagram showing a schematic configuration of an eighteenth embodiment of the present invention.
FIG. 91 is a view showing a system configuration in the eighteenth embodiment;
FIG. 92 is a view showing a screen display example in the eighteenth embodiment.
FIG. 93 is a diagram showing an example of an audio interface management table.
FIG. 94 is a diagram showing a correspondence relationship between pseudo audio focus and audio focus.
FIG. 95 is a view showing a flowchart of a message conversion unit.
FIG. 96 is a diagram showing a schematic configuration of a nineteenth embodiment of the present invention.
FIG. 97 is a view showing a screen display example in the nineteenth embodiment;
FIG. 98 is a view showing a more detailed configuration of the nineteenth embodiment;
FIG. 99 is a diagram showing an example of a voice interface management table.
FIG. 100 is a diagram for explaining a voice focus display method;
FIG. 101 is a diagram showing a display example of an external window.
FIG. 102 is a diagram showing an example of an application program management table.
FIG. 103 is a diagram showing a flowchart of recognition processing of the voice input / output system.
FIG. 104 is a diagram showing a schematic configuration of a twentieth embodiment of the present invention.
FIG. 105 is a diagram showing an example of a registration screen for program operation.
FIG. 106 is a diagram showing a processing procedure for program operation registration;
FIG. 107 is a diagram showing a schematic configuration of a twentieth embodiment of the present invention.
FIG. 108 is a diagram showing an example of the configuration of a recognition dictionary.
FIG. 109 is a diagram showing a schematic configuration of a dictionary editing unit.
FIG. 110 is a diagram showing an example of a dictionary editing screen.
FIG. 111 is a diagram showing a flowchart of processing of a dictionary editing unit.
FIG. 112 is a diagram showing a schematic configuration of a twenty-second embodiment of the present invention.
FIG. 113 is a diagram showing a schematic configuration of a response voice management unit.
FIG. 114 is a diagram showing a flowchart of processing of a message conversion unit.
FIG. 115 is a diagram showing a schematic configuration of a response voice registration unit.
FIG. 116 is a diagram showing a schematic configuration of an extended data collection unit.
117 is a view showing a flowchart of processing of a data collection unit in FIG. 116;
FIG. 118 is a diagram showing a schematic configuration of an audio data confirmation unit.
119 is a diagram showing a flowchart of processing of an audio data confirmation unit. FIG.
FIG. 120 is a diagram showing an example of audio data.
FIG. 121 is a view showing a state of audio data after processing.
FIG. 122 is a diagram showing a conventional voice recognition interface.
FIG. 123 is a diagram showing a conventional voice recognition interface.
FIG. 124 is a diagram showing a conventional voice recognition interface.
FIG. 125 is a diagram showing a conventional voice recognition interface.
FIG. 126 is a diagram showing a conventional voice recognition interface.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1, 3, 6 ... Voice recognition system, 11 ... Message processing part, 12 ... Voice recognition part, 121 ... Voice detection part, 122 ... Voice analysis part, 123 ... Recognition dictionary collation part, 124 ... Voice recognition dictionary, 13 ... Application Program management table 2, 5, 7 ... Application program, 21, 71 ... Message input / output unit, 22 ... Program body, 4 ... Window system, 8 ... Data collection unit, 81 ... Word voice feature data holding unit, 82 ... Learning Vocabulary display selection unit, 83 ... Learning data collection control unit, 84 ... Learning vocabulary guide display unit, 9 ... Dictionary creation unit, 91 ... Dictionary creation management unit, 92 ... Dictionary creation control unit, 93 ... Data input unit, 94 ... Dictionary Creation unit main body, 95 ... file output unit, 10 ... voice recognition automatic stop unit, 14 ... voice synthesis unit, 561 ... overall control unit, 562 ... waveform superposition unit, 563 ... voice output management table 564 ... Waveform synthesis unit, 651 ... Voice input / output system, 652 ... Window system, 653 ... Voice mail tool, 6531 ... Email processing unit, 6532 ... Message input / output unit, 821 ... Voice input / output system, 822 ... Voice mail system 8221: E-mail processing unit, 8222 ... Document summary unit, 8223 ... Message input / output unit, 851 ... Voice recognition system, 852 ... Voice message system 852, 853 ... Mail address table, 103 ... General application program (GAP), DESCRIPTION OF SYMBOLS 102 ... Dedicated application program (SAP), 104 ... Voice interface management system (SIM), 141 ... Voice interface management part, 142 ... Program operation registration part, 143 ... Message conversion part, 23 ... Voice window, 1440₀~ 1440_Four... Voice window 151... Program operation display editing unit 152. Registered content storage unit 153. Window ID acquisition unit 144. Dictionary editing unit 441. Dictionary content display editing unit 442. Response voice management section 401, 403 ... Response voice registration section, 411 ... Voice data confirmation section, 413 ... Data use availability input section, 415 ... Generation guide display section, 421 ... Voice data memory, 422 ... Voice data processing section, 423 ... Playback unit, 424... Additional sound data storage unit.

Claims

In a speech recognition interface that connects multiple application programs to a speech recognition system,
The voice recognition system includes:
Speech recognition means for recognizing speech;
Corresponding to each of the plurality of application programs, first information indicating whether or not the application program is a target of voice input, and one or more recognitions to be recognized for the application program Application program management means for managing at least the second information indicating the target vocabulary;
The second information managed in correspondence with one or a plurality of the application programs indicating that the first information managed by the application program management means is a target of voice input. The recognition target vocabulary for speech input is identified based on the speech recognition, and when any of the identified recognition vocabulary is recognized by the speech recognition means, the first information is a speech input target. And the second information recognition indicates that the recognized vocabulary is the recognition target vocabulary as one or more application programs that are identified as transmission destinations of the recognized vocabulary Means ,
Managing third information indicating a vocabulary uniquely corresponding to each application program, which should always be a recognition target regardless of which application program is a target of voice input,
If any of the vocabulary included in the third information is recognized by the speech recognition means, the first information corresponding to the application program that uniquely corresponds to the recognized vocabulary is used as the application information. A voice recognition interface, wherein the program is set in a state indicating that it is a target of voice input .

In a speech recognition interface that connects multiple application programs to a speech recognition system,
The voice recognition system includes:
Speech recognition means for recognizing speech;
Corresponding to each of the plurality of application programs, first information indicating whether or not the application program is a target of voice input, and one or more recognitions to be recognized for the application program Application program management means for managing at least the second information indicating the target vocabulary;
The second information managed in correspondence with one or a plurality of the application programs indicating that the first information managed by the application program management means is a target of voice input. The recognition target vocabulary for speech input is specified based on the above, and when any of the specified recognition vocabulary is recognized by the speech recognition means, the first information is a target for speech input. And the second information recognition indicates that the recognized vocabulary is the recognition target vocabulary as one or a plurality of application programs that are identified as transmission destinations of the recognized vocabulary Means,
The application program requests the speech recognition system that confidence should be the target of voice input when it is the target of keyboard input;
When the voice recognition system receives the request from the application program, the voice recognition system sets the first information corresponding to the application program to a state indicating that the application program is a target of voice input. Voice recognition interface characterized by

When a predetermined event occurs in advance, the speech recognition system converts the first information corresponding to the predetermined application program according to the content of the event and a predetermined rule in accordance with the application information. The program is changed to a state indicating that the program is a target of voice input, and the first information corresponding to the other predetermined application program is changed to that the application program is not a target of voice input. The voice recognition interface according to claim 1, wherein the voice recognition interface is changed to a state shown.

The voice recognition system notifies the application program that has received a notification request, of information that can at least determine whether or not the application program itself is currently subject to voice input. The voice recognition interface according to claim 1, wherein

The voice recognition system displays a window of the application program indicating that the first information is a target of voice input, and indicates that the first information is not a target of voice input. The voice recognition interface according to claim 1, wherein the voice recognition interface is displayed on a display screen in a display form different from a display form of a window of the other application program.

The voice recognition system recognizes the application program indicating that the first information is a target of voice input for the application program indicated by the second information corresponding to the application program. The speech recognition interface according to claim 1, wherein one or a plurality of recognition target words to be targeted are displayed on a display screen.

The voice recognition interface according to claim 6, wherein the voice recognition system displays the recognized vocabulary transmitted to the application program specified as the transmission destination on a display screen.

3. The voice recognition interface according to claim 1, wherein the second information is given to the voice recognition system from each application program.

The voice recognition system includes:
Managing the second information corresponding to each of the divided areas obtained by dividing the window of the corresponding application program into a plurality of parts,
The second information corresponding to the application program includes the application program. 3. The voice according to claim 1, wherein the second information managed corresponding to the divided area where the mouse pointer is currently located among the divided areas in the window is used. Recognition interface.

The voice recognition system includes:
Managing at least a part of the plurality of application programs, the first information and the second information corresponding to each of one or more windows corresponding to each of the application programs;
For the application program in which the first information and the second information are managed corresponding to each of the windows, it indicates that the first information is a target of voice input 1 Alternatively, a recognition target vocabulary for speech input is identified based on the second information managed corresponding to each of the plurality of windows, and any of the identified recognition vocabulary is recognized by the speech recognition means. One or more indicating that the first information is a target of voice input and the second information recognition indicates that the recognized vocabulary is the recognition target vocabulary. The voice recognition interface according to claim 1, wherein the window is specified as a transmission destination of the recognized vocabulary.

In the speech recognition system, for the application program in which the first information and the second information are managed corresponding to each of the windows, the first information in the application program window is input by speech. In addition to the second information managed in correspondence with the window, management is performed in correspondence with other windows of the application program having the window. The speech recognition according to claim 1 or 2, wherein a vocabulary specified to be used for another window of the application program included in the second information is also used. interface.