JP3927800B2

JP3927800B2 - Voice recognition apparatus and method, program, and storage medium

Info

Publication number: JP3927800B2
Application number: JP2001370354A
Authority: JP
Inventors: 寛樹山本
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2001-12-04
Filing date: 2001-12-04
Publication date: 2007-06-13
Anticipated expiration: 2021-12-04
Also published as: JP2003167600A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識の結果を、グラフィックユーザインタフェース（ＧＵＩ）を用いて表示する分野に関する。
【０００２】
【従来の技術】
従来より、入力された音声を、複数の語が記憶されている認識辞書を利用して認識する音声認識技術が提案されており、このような音声認識技術を利用して、情報処理装置等に対して、ユーザ所望のコマンドを入力する技術も提案されている。
【０００３】
これらの技術に基づく音声認識装置や音声認識システムにおいて、入力された音声の認識処理が終了するのに応じて、その認識結果に基づいた他の処理が行われる構成の場合には、その認識処理において誤認識された結果が含まれると、係る他の処理の結果にも影響を与えることになる。
【０００４】
このため、音声認識処理の後で行われる処理に重要なコマンドや、誤認識が頻繁に起ることが想定される場合には、音声認識処理が終了した時点で、その認識結果が正しいか否かを確認する処理が必要になる。
【０００５】
このような認識結果の確認処理の一例としては、ある認識結果に対して、例えば、「 ○○でよろしいですか？」というユーザに確認を促すためのメッセージをディスプレイに表示すると共に、「はい」及び「いいえ」のソフトウエア・ボタン（以下、単に「ボタン」と称する）を表示することにより、ユーザに確認のためのボタン操作を促す方法が一般的である。或いは、合成音で同様のメッセージをユーザに対して通知し、比較的認識精度の良いことが一般に知られている「はい」及び「いいえ」の２種類の単語の音声認識を利用して、認識結果のユーザによる確認を行なう方法等がある。
【０００６】
このような手順で音声認識の結果確認を行なう方法によれば、誤認識による誤った処理の実行を防ぐことは可能である。しかしながら、音声認識処理におけるスコア１位の認識結果のみを使用することに起因する低い認識精度の影響によって誤認識が繰り返される場合には、音声認識処理の後で行われる処理においてユーザ所望のコマンドの実行に至るまでに、
（１）システムからの認識結果の確認のためユーザへの通知、
（２）ユーザが認識結果に満足できない場合に行なわなければならない「いいえ」ボタンの選択操作、並びに再認識のための同一音声を再入力、
等の手順の繰り返し作業をユーザに対して強いることになる。
【０００７】
また、従来の音声認識処理においては、スコア１位の結果がユーザにとって誤認識である場合であっても、２位以下の認識結果に正解（ユーザにとって正しい認識結果）が含まれる場合もあるため、この場合、複数の認識結果を、選択候補としてユーザに対して同時に提示すれば、ユーザ所望の何れかの候補を選択可能であるため、上述した煩わしい手順を減らすことができる。
【０００８】
更に、これらの認識結果候補をボタンとして表示し、ボタンが押された際に対応するコマンドを実行するようにすれば、システムからの確認の通知やユーザの「はい」「いいえ」等の応答の手順を省くことができ、所望のコマンドの実行に至るまでの音声入力を含むユーザの操作回数を減らすことができる。
【０００９】
上述した如く音声認識の結果確認を、グラフィックユーザインタフェース（ＧＵＩ）としてディスプレイ上のボタンで表示する方法は、例えば、特開平１０-２１２５４号公報で提案されている。同公報には、音声認識機能を有する情報検索装置が提案されており、その発明の詳細な説明によれば、検索するキーワードを音声で入力し、認識されたキーワードによる検索結果を表示するに際して、音声認識の結果得られる１位の認識結果に対する検索結果が表示されると共に、その１位以下の複数の認識結果候補が候補順に選択ボタンで選択可能に表示される。そして、ユーザは、１位の認識候補が誤りの場合にはマウスを用いて正しい認識結果のボタンを選択する操作を行なうことにより、情報検索用のキーワードとして、正しい音声認識結果を選択することができる。
【００１０】
【発明が解決しようとする課題】
しかしながら、上述した複数の認識結果候補が表示される従来の技術においては、グラフィックユーザインタフェースにおいてユーザがボタンやスイッチを操作する際に、表示画面上に多くのボタンが並んでいたり、ボタンそのものが小さい場合には操作を誤る可能性が高く、特に重要な処理を行なうべく音声認識結果を選択するためのボタンを操作する場合等は注意が必要である。
【００１１】
これに対して、物理的に実在する一般の機器（例えば家電製品等の操作パネルや生産設備の制御パネル等）に付属するボタンやスイッチ等は、例えば重要な処理を行なうためのボタンは大きくしたり、使用頻度の高いボタンを押しやすい位置に配置する等、使用頻度や機能等のボタンの属性によって配置や大きさ、形状、色等を工夫することで、誤操作を防ぐと同時に操作性の向上を図っている。従って、グラフィックユーザインタフェースを用いて音声認識結果を表示する構成の場合にも、同様の工夫を取り入れて、例えば、ある特定のコマンドの認識結果に関しては、表示方法を他の認識結果と変えたり、認識結果のスコアに応じて、表示するＧＵＩのサイズを変えたりすることで誤操作を防いだり、ユーザの操作性を向上することができると考えられるが、従来は、そのような提案はなされていない。
【００１２】
本発明は、上述した課題に鑑みてなされたものであって、音声認識の結果候補をＧＵＩを用いてユーザに提示する際の表示態様を改善することにより、良好な操作性を実現する音声認識装置及び方法、ページ記述言語表示装置及びその制御方法、並びにコンピュータ・プログラムの提供を目的とする。
【００１３】
【課題を解決するための手段】
上記の目的を達成するため、本発明に係る音声認識装置は、以下の構成を特徴とする。
【００１４】
例えば本発明の一側面に係る音声認識装置は、入力された音声を認識することにより認識結果候補及び各認識結果候補の認識スコアを取得する取得手段と、前記取得手段で取得した認識結果候補毎に、前記認識スコアに基づいて、表示手段に選択可能に表示する際の、選択できる領域の大きさを決定する決定手段と、前記決定手段で決定された大きさで、前記認識結果候補を表示するよう制御する表示制御手段とを備えることを特徴とする。
【００２４】
【発明の実施の形態】
以下、本発明に係る音声認識装置の実施形態を、図面を参照して詳細に説明する。
【００２５】
［第１の実施形態］
図１は、第１の実施形態に係る音声認識装置の概略構成を示すブロック図である。
【００２６】
同図に示す音声認識装置において、１００は、マイクロフォン等の音声を入力する音声入力装置である。２００は、本装置を動作させるプログラムおよび本装置の動作に必要なデータや動作の過程で生成されるデータを一時的に格納するROM、RAM、ハードディスク等の記憶装置である。
【００２７】
また、３００は、主に認識結果候補等を表示するために用いるディスプレイ等の表示装置である。４００は、ユーザが操作を入力する際に用いるマウス、キーボード等の操作入力装置である。
【００２８】
５０１は、入力された音声を認識する音声認識部である。５０２は、認識結果の表示態様（表示書式）の決定および表示を制御する表示制御部である。５０３は、ユーザの操作に応じて認識結果を選択する認識結果選択部である。
【００２９】
５０４は、認識結果選択部５０３にて選択された認識結果に基づいて、その認識結果に対応する処理を実行あるいは処理を実行するように他のプログラムを制御する処理制御部である。５０５は、表示態様を設定する表示態様設定部である。
【００３０】
記憶装置２００には、音声認識を行なう際に参照するＨＭＭ等の音響モデル２０１、認識対象となる語の発音情報等を記述した認識辞書２０２、表示制御部５０２で表示態様を決定する方法を記述した表示ルール２０３、並びに認識結果に対応する処理方法を記述した処理ルール２０４が記憶されている。
【００３１】
ここで、本実施形態に係る音声認識装置のハードウエアには、音声入力可能なパーソナル・コンピュータ、携帯情報端末（ＰＤＡ）等の情報処理装置を採用することができる。
【００３２】
次に、上述した構成を備える音声認識装置の動作について、図２を参照して説明する。
【００３３】
図２は、第１の実施形態における音声認識装置の制御処理を示すフローチャートであり、当該音声認識装置の不図示のＣＰＵが行なうところの、図１に示す各処理部に対応するソフトウエア・プログラムに記述された処理手順を示す。
【００３４】
同図において、ステップＳ１０１においてユーザがマイクロフォン等の音声入力装置１００を用いて入力した音声は、記憶装置２００内に記憶されている音響モデル２０１及び認識辞書２０２を用いて音声認識部５０１の機能によって認識されることにより、単数または複数の認識結果が得られる（ステップＳ１０２）。
【００３５】
ステップＳ１０３では、ステップＳ１０２において取得した認識結果を、表示制御部５０２の機能により、表示装置３００への表示態様を決定し、その表示形式に従って、認識結果の選択候補（認識結果候補）として、表示装置３００上に表示する。このとき、表示態様を決定する方法は、表示制御部５０２の機能を記述したプログラム中に記述しても良いし、例えば図５に示すような表示ルール２０３を、記憶装置２００に予め記憶しておいても良い。
【００３６】
図５は、表示ルール２０３の一例を説明する図であり、このルールには、音声認識のスコアを基準に、表示するＧＵＩの種類および表示サイズがルール１として規定され、表示に際しての配置がルール２として規定されている。このような表示ルール２０３の設定は、表示設定部５０５の機能によってユーザが設定することも可能である。
【００３７】
例えば、ステップＳ１０２において、図３に例示するような認識語彙と発音が記述された認識辞書２０２を用いて音声認識処理が行なわれ、スコアの大きい方から上位４つの候補を取得した結果が図４に例示する如くであったとする。
【００３８】
上記の場合、ステップＳ１０３では、図５に例示した表示ルール２０３が参照されることにより、図６に例示するＧＵＩの如く複数の認識結果選択用の候補を例示する図が表示される。即ち、図６の例では、ルール１及び２に従って、スコアが最も大きい「印刷」が大きいサイズのボタンとして表示され、以下３つの候補（認識結果候補）が、順次スコアの値に応じた表示態様のボタンとして表示されている。
【００３９】
ステップＳ１０４では、図６に例示する如く表示された複数のボタンの中から、マウス等の操作入力装置４００を用いて、ユーザによって何れか所望のボタンが選択され、選択されたボタンに対応する語彙（認識結果候補）が、正しい認識結果として設定される。
【００４０】
そしてステップＳ１０５では、選択操作に応じて正しい認識結果として設定語彙に従って、記憶装置２００に記憶されている処理ルール２０４が参照されることにより、該当する処理が実行される。処理ルール２０４は、例えば、図７に例示する如く、設定された認識結果が「印刷」であれば印刷処理が行われる等、認識辞書２０２に記述されている語毎に規定される。
【００４１】
このような本実施形態によれば、音声認識の結果候補をＧＵＩを用いてユーザに提示する際の表示態様が改善されるので、良好な操作性を実現することができる。
【００４２】
尚、上述した本実施形態では、図４に示す認識結果の属性として、個々の語彙の認識結果のスコアを利用したが、これに限られるものではなく、個々の語の重要度、品詞の種類、所定の処理を指示するためのコマンドか否か、情報検索用のキーワードか否か、入力された音声にて使用される頻度、英語・日本語等の言語の種類等の各種の属性を採用することができ、それら属性のうち少なくとも何れか１種類が各認識結果に共通に採用されれば良い（以下の各実施形態においても同様である）。
【００４３】
［第２の実施形態］
次に、上述した第１の実施形態に係る音声認識装置を基本とする第２の実施形態を説明する。以下の説明においては、第１の実施形態と同様な構成については重複する説明を省略し、本実施形態における特徴的な部分を中心に説明する。
【００４４】
第１の実施形態では、主に認識スコアによって表示態様を変更する場合について説明したが、本実施形態では、例えば図８に示す如く、「東京」、「大阪」等の地名は、小さいサイズのノーマルフォントを利用してテキストとして表示し、「終了」、「印刷」等の処理の選択するためのコマンドに対応する語は、ボタンとして表示すると共に、表示に際してのフォントやボタンの大きさもコマンドの重要さの度合いに応じて適宜設定される等のように、各語毎に表示態様が設定されるような表示ルールを用いて表示態様を制御する。
【００４５】
このような本実施形態によっても、音声認識の結果候補をＧＵＩを用いてユーザに提示する際の表示態様が改善されるので、良好な操作性を実現することができる。
【００４６】
［第３の実施形態］
次に、上述した第１の実施形態に係る音声認識装置を基本とする第３の実施形態を説明する。以下の説明においては、第１の実施形態と同様な構成については重複する説明を省略し、本実施形態における特徴的な部分を中心に説明する。
【００４７】
一般に、例えばテキストを編集するアプリケーション等では、音声認識をテキストの入力やアプリケーションを操作するためのコマンドの入力等のように、異なる目的で使用する場合がある。このような場合には、認識結果候補の表示方法として、コマンドはボタンで表示し、テキストはテキストとして表示する方がユーザにとって操作し易い。
【００４８】
また、メニューの表示等のように、ユーザによって選択されたあるコマンドが誤って実行されても操作上特に問題ない場合がある一方で、アプリケーションの終了等にように、ユーザの本来の希望とは異なるコマンドが実行されると復帰するのが困難なコマンドもある。
【００４９】
そこで本実施形態では、実行されるコマンドによって、表示態様を変更することにより、操作性の向上を図る。例えば、アプリケーションを終了するコマンドは、表示する際に他の認識結果候補よりもサイズを大きくする等して視認性を良くすることで、誤操作を避けることができる。
【００５０】
このような表示態様を実現すべく、本実施形態では、図１０に例示するような表示ルールに基づく制御制御を行なう。
【００５１】
図１０に例示した表示ルールにおいて、ルール１では、認識結果候補に対して対応する処理がある場合はボタンで表示し、それ以外の場合はテキストで表示するよう記述されている。また、ルール２では、重要度という認識結果候補の属性を用いて、表示する際のサイズやフォントが規定されている。そして、ルール３では、複数の認識結果候補を表示する際の配置が規定されている。
【００５２】
ここで、重要度は、例えば、図１１のように各語毎に事前に付与しておき、表示ルールに含めても良いし、認識辞書自体に重要度を記述して記憶装置２００に予め記憶しておいても良い。
【００５３】
即ち、図１１の例では、「終了」や「印刷」、「削除」といった誤操作を避けたいコマンドに対応する語には重要度が大きく設定され、図１０に例示した表示ルールで重要度の大きい語についてはサイズを大きく、且つフォントを太くして視認性を良くするよう記述されている。
【００５４】
図１２は、第３の実施形態において、図１０に示す表示ルールに基づいて、図９に示すスコアに基づく認識結果候補が得られた場合に表示される認識結果選択用のＧＵＩを例示する図であり、対応する処理のある認識結果候補についてはボタンが表示され、それ以外の認識結果候補はテキストとして表示されると共に、「終了」や「削除」といった重要度の大きい語の表示サイズが大きくされている。
【００５５】
このような本実施形態によっても、音声認識の結果候補をＧＵＩを用いてユーザに提示する際の表示態様が改善されるので、良好な操作性を実現することができる。
【００５６】
［第４の実施形態］
次に、上述した第１乃至第３の実施形態に係る音声認識装置を基本とする第４の実施形態を説明する。以下の説明においては、上述した各実施形態と同様な構成については重複する説明を省略し、本実施形態における特徴的な部分を中心に説明する。
【００５７】
本実施形態では、上述した第１乃至第３の実施形態で説明した音声認識装置を、ＷＷＷ（ World Wide Web ）ブラウザ等のページ記述言語を表示するページ記述言語表示装置において実現した例について、図１３及び図１４を参照して説明する。
【００５８】
図１３は、第４の実施形態に係るページ記述言語表示装置の概略構成を示すブロック図であり、基本的な装置構成は、図１に示す音声認識装置と同様である。本実施形態においても、ハードウエアには、音声入力可能なパーソナル・コンピュータ、携帯情報端末（ＰＤＡ）等の情報処理装置を採用することができる。
【００５９】
即ち、図１３において、１００は、マイクロフォン等の音声を入力する音声入力装置である。２００は、本装置を動作するプログラムおよび本装置の動作に必要なデータや動作の過程で生成されるデータを一時的に格納するROM、RAM、ハードディスク等の記憶装置である。
【００６０】
３００は、主に認識結果候補等を表示するために用いるディスプレイ等の表示装置である。４００は、ユーザが操作を入力する際に用いるマウス、キーボード等の操作入力装置である。
【００６１】
５０７は、所定のページ記述言語形式のデータ入力を制御し、入力されたページ記述言語を解析し、その解析結果に基づいて、表示装置３００上にページを表示するページ記述制御部である。５０１は、入力された音声を認識する音声認識部である。５０６は、音声認識部５０１による音声認識結果を表示するためのページ記述言語のデータを生成するページ記述データ生成部である。
【００６２】
５０３は、表示装置３００上に表示された認識結果候補の中からユーザの操作に応じて、所望の何れかを選択する認識結果選択部である。５０４は、認識結果選択部５０３において選択された認識結果候補に基づいて、その認識結果候補に対応する処理を実行あるいは処理を実行するように他のプログラムを制御する処理制御部である。
【００６３】
記憶装置２００には、音声認識を行なう際に参照するＨＭＭ等の音響モデル２０１、認識対象となる語の発音情報等を記述した認識辞書２０２、ページ記述データ生成部５０６で表示態様を決定するための方法を記述した表示ルール２０３、並びに認識結果に対応する処理方法を記述した処理ルール２０４が記憶されている。ここで、記憶装置２００に記憶されている上記２０１乃至２０４の音響モデルをはじめとする各種データは、本装置とは構成を別にするＷＷＷサーバ等の記憶装置から、インターネット等の通信ネットワークを介して読み出し可能な構成としても良い。
【００６４】
次に、上述した構成を備えるページ記述言語表示装置の動作について、第１の実施形態と同じ表示ルール、処理ルール、並びに認識結果の例を用いて、図１４を参照して説明する。
【００６５】
図１４は、第４の実施形態におけるページ記述言語表示装置の制御処理を示すフローチャートであり、当該ページ記述言語表示装置の不図示のＣＰＵが行なうところの、図１３に示す各処理部に対応するソフトウエア・プログラムに記述された処理手順を示す。
【００６６】
同図において、ステップＳ２０１では、本装置にインターネットのサーバ上あるいは本装置内の記憶装置２００上に記憶されたページ記述言語のデータが入力される。入力されたページ記述言語のデータは、ページ記述言語制御部５０６の機能によって解析され（ステップＳ２０２）、ステップＳ２０３では、その解析結果に基づいて、当該ページ記述言語のデータの記述内容に応じたページが表示装置３００上に表示される。
【００６７】
次に、ステップＳ２０２におけるページ記述言語の解析の過程で、そのページ記述言語のデータの中に、音声認識を行なうタグが記述されていたかを判断し（ステップＳ２０４）、記述されていなかった場合には処理を終了し、記述されていた場合には、ステップＳ２０５において、ユーザがマイクロフォン等の音声入力装置１００を用いて入力した音声を受け付ける。
【００６８】
尚、本実施形態において、ページ記述言語のデータに含まれる音声認識を行なうタグ以外のタグについては、ＷＷＷブラウザ等の一般的なページ記述言語の表示装置と同様な機能を有するものとする。
【００６９】
ステップＳ２０６において、ステップＳ２０５において入力された音声は、記憶装置２００内に記憶されている音響モデル２０１及び認識辞書２０２を用いて、音声認識部５０１の機能によって認識されることにより、単数または複数の認識結果候補が得られる。
【００７０】
ステップＳ２０７では、ページ記述データ生成部５０６の機能により、音声認識部５０１にて取得した認識結果候補（図４）に基づいて、第１の実施形態と同様に表示装置３００への表示態様が決定されると共に、更に、決定された表示態様の内容に対応するページ記述言語のデータが生成される。ページ記述データ生成部５０６の機能によって生成されたデータは、ページ記述制御部５０７に設定される。
【００７１】
ステップＳ２０８では、ページ記述制御部５０７の機能により、ステップＳ２０７にて生成されたページ記述言語のデータが解析され、ステップＳ２０９では、その解析結果に基づいて、表示装置３００上にページが表示される。このとき、表示装置３００上に表示するページの表示態様を決定する方法は、第１の実施形態と同様にプログラム中に記述しても良いし、図５に示したような表示ルール２０３を記憶装置２００に記憶しておき、その表示ルールを参照しても良い。表示用ルール２０３の設定は、表示設定部５０５の機能を利用してユーザが設定する構成としても良い。
【００７２】
図１５は、第４の実施形態において音声認識を実行するためのタグを例示する図である。
【００７３】
同図において、斜体で示した部分が本実施形態に係る音声認識タグの一例であり、「＜SpeechRcog ..... ＞」が音声認識による入力を実行するための記述であり、本実施形態において、「＜SpeechRcog ..... ＞」は、「音声認識して、認識した結果を表示する」と解釈するものとする。
【００７４】
また、本実施形態に係るページ記述言語表示装置では、音声認識で使用する認識辞書２０１及び音響モデル２０２を、「 grammar 」、「 acousticmodel 」なる記述によって指定することが可能である。更に、第１の実施形態で述べた表示ルール２０３及び処理ルール２０４を、「 resulttemplate 」、「 actiontable」なる記述によって指定できるものとする。
【００７５】
即ち、図１５に示す例では、音声認識部５０１の機能により、「＜SpeechRcog ..... ＞」というタグに従って、認識辞書「 command.gra 」及び音響モデル「 phone.mdl 」を用いて音声認識を行なうと共に、ページ記述データ生成部５０６の機能により、表示ルール「 type１.dat 」及び処理ルール「 command.tbl」を参照して、認識結果候補を表示するためのページ記述言語のデータを生成することが表わされている。
【００７６】
図１６は、第４の実施形態において、図５に示した表示ルール２０３に基づいて、ページ記述データ生成部５０６の機能によって生成されたページ記述言語データを例示する図である。
【００７７】
同図において、斜体で示した部分は、一般のページ記述言語の仕様を拡張した部分であり、本実施形態におけるページ記述言語表示装置では、「 input type = mybotton 」なる記述と、「 size 」なる記述とによって、ボタンの表示と、表示する際のボタンのサイズを指定することが可能である。また、「＜p＞……＜/p＞」で囲まれた範囲が、一行のボタンとして表示されるように解釈される。本実施形態では、係る拡張仕様が解釈されることにより、図１６に従って表示装置３００に認識結果候補が表示されると、第１の実施形態の場合と同様に、図６に示す表示例が表示される。
【００７８】
ここで、再び図１４のフローチャートの説明に戻る。ステップＳ２１０では、表示装置３００に表示された認識結果候補に対して、認識結果選択部５０３の機能により、マウス等の操作入力装置４００を利用して、ユーザが所望の認識結果候補を選択する。
【００７９】
ステップＳ２１１では、処理制御部５０４の機能により、選択された認識結果候補に対応する処理が実行される。ここで、認識結果候補と処理との対応関係は、プログラム中に記述しても良いし、図７に例示するような対応関係が記述された処理ルール２０４を、記憶装置２００上に予め記憶しておき、処理制御部５０４によって処理が実行される際に参照するようにしても良い。
【００８０】
図１６に示したページ記述言語のデータの例では、表示装置３００に表示されたボタンがユーザによって押下されたときに、「 name 」で指定された環境変数「 com 」に、「 value 」で指定された値（同図では検索、印刷設定、編集に相当）が代入され、「 ExecuteCommand 」というプログラムの実行が開始される。この「 ExecuteCommand 」なる記述は、処理制御部５０４に該当し、係る「 ExecuteCommand 」及びプログラムは、「 name 」で指定された環境変数から「 com」に代入された値を取り出し、取り出した値に該当する処理を、処理ルール２０４を参照することによって特定した上で実行する。
【００８１】
このような本実施形態によっても、音声認識の結果候補をＧＵＩを用いてユーザに提示する際の表示態様が改善されるので、良好な操作性を実現することができる。
【００８２】
以上説明したように、上述した各実施形態によれば、単数または複数の音声認識結果の候補を、表示装置３００にソフトウエアボタンやテキストを含むＧＵＩを用いて表示する場合に、個々の音声認識結果候補の認識スコアや重要度等の属性に基づいて、認識結果候補毎に表示する書式が決定されるので、音声認識を用いたユーザインタフェースやコマンド入力の操作性を向上することができる。
【００８３】
尚、上述した各実施形態において、図６及び図１２に例示したＧＵＩでは、そのＧＵＩに表示された複数の候補の中からユーザ所望のものを、大きさやフォントが異なるソフトウエアボタンを用いて選択可能に構成したが、この装置構成に限られるものではなく、例えば、大きさやフォントが異なる各選択候補の表示エリア内または近傍に設けた所謂ラジオボタンやチェックボックス等によって選択可能に構成しても良い。
【００８４】
また、上述した各実施形態において、図６及び図１２に例示したＧＵＩでは、そのＧＵＩに表示された複数の候補の中からユーザ所望のものを選択するに際して、大きさやフォントが異なるソフトウエアボタンを用いて選択可能に構成することによってユーザに対する操作性を向上したが、この装置構成に限られるものではなく、例えば、ボタンの表示色、表示するボタンの形状等を適宜変更することによっても、操作性を向上することができる。
【００８５】
【他の実施形態】
尚、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体（または記録媒体）を、上述した音声認識装置として動作するパーソナル・コンピュータや携帯情報端末等の情報処理装置に供給し、それらシステムあるいは装置のコンピュータ（またはCPUやMPU）が記憶媒体に格納されたプログラムコードを読み出し実行することによっても達成される。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体、並びに電気通信回線等を介してコンピュータ・プログラム製品として取得した当該プログラムコードは、本発明を構成することになる。
【００８６】
また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているオペレーティングシステム(OS)等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれる。
【００８７】
更に、記憶媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張カードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張カードや機能拡張ユニットに備わるCPU等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれる。
【００８８】
【発明の効果】
以上説明した本発明によれば、音声認識の結果候補をＧＵＩを用いてユーザに提示する際の表示態様を改善することにより、良好な操作性を実現する音声認識装置及び方法、ページ記述言語表示装置及びその制御方法、並びにコンピュータ・プログラムの提供が実現する。
【図面の簡単な説明】
【図１】第１の実施形態に係る音声認識装置の概略構成を示すブロック図である。
【図２】第１の実施形態における音声認識装置の制御処理を示すフローチャートである。
【図３】第１の実施形態における認識辞書２０２の構成例を説明する図である。
【図４】第１の実施形態におけるスコアに基づく認識結果候補の一例を説明する図である。
【図５】第１の実施形態における表示ルール２０３の一例を説明する図である。
【図６】第１の実施形態において表示装置３００に表示される認識結果選択用のＧＵＩの表示態様を例示する図である。
【図７】第１の実施形態における処理ルール２０４の構成例を説明する図である。
【図８】第２の実施形態における表示ルール２０３の一例を説明する図である。
【図９】第２の実施形態におけるスコアに基づく認識結果候補の一例を説明する図である。
【図１０】第３の実施形態における表示ルール２０３の一例を説明する図である。
【図１１】第３の実施形態における語毎の重要度の設定例を説明する図である。
【図１２】第３の実施形態において表示装置３００に表示される認識結果選択用のＧＵＩの表示態様を例示する図である。
【図１３】第４の実施形態に係るページ記述言語表示装置の概略構成を示すブロック図である。
【図１４】第４の実施形態におけるページ記述言語表示装置の制御処理を示すフローチャートである。
【図１５】第４の実施形態において音声認識を実行するためのタグを例示する図である。
【図１６】第４の実施形態において、図５に示した表示ルール２０３に基づいて、ページ記述データ生成部５０６の機能によって生成されたページ記述言語データを例示する図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to the field of displaying speech recognition results using a graphic user interface (GUI).
[0002]
[Prior art]
Conventionally, a speech recognition technology for recognizing input speech by using a recognition dictionary in which a plurality of words is stored has been proposed. By using such speech recognition technology, an information processing apparatus or the like can be used. On the other hand, a technique for inputting a command desired by a user has also been proposed.
[0003]
In a speech recognition device or speech recognition system based on these technologies, in the case of a configuration in which other processing based on the recognition result is performed when the input speech recognition processing ends, the recognition processing If the result of misrecognition is included, the result of the other processing is also affected.
[0004]
For this reason, if it is assumed that commands that are important after the speech recognition processing or misrecognitions frequently occur, whether or not the recognition result is correct when the speech recognition processing ends. It is necessary to check whether or not
[0005]
As an example of such a recognition result confirmation process, for a certain recognition result, for example, a message prompting the user to confirm “Yes?” Is displayed on the display, and “Yes” is displayed. In addition, a method of prompting the user to operate a button by displaying a software button of “No” (hereinafter simply referred to as “button”) is generally used. Alternatively, a similar message is notified to the user with a synthesized sound, and recognition is performed using voice recognition of two types of words “Yes” and “No”, which are generally known to have relatively high recognition accuracy. There is a method of confirming the result by the user.
[0006]
According to the method for confirming the result of speech recognition in such a procedure, it is possible to prevent erroneous processing from being performed due to erroneous recognition. However, when erroneous recognition is repeated due to the influence of low recognition accuracy resulting from the use of only the recognition result of the first score in the speech recognition processing, the user-desired command in the processing performed after the speech recognition processing To get it done
(1) Notification to the user for confirmation of the recognition result from the system,
(2) “No” button selection operation that must be performed when the user is not satisfied with the recognition result, and re-input the same voice for re-recognition,
The user is forced to repeat the above procedure.
[0007]
Further, in the conventional speech recognition processing, even if the result of the first place score is a misrecognition for the user, the recognition result of the second place or lower may include a correct answer (a correct recognition result for the user). In this case, if a plurality of recognition results are simultaneously presented to the user as selection candidates, any candidate desired by the user can be selected, so that the troublesome procedure described above can be reduced.
[0008]
Furthermore, if these recognition result candidates are displayed as buttons and the corresponding command is executed when the button is pressed, a confirmation notification from the system or a response such as “Yes” or “No” from the user will be sent. The procedure can be omitted, and the number of user operations including voice input up to execution of a desired command can be reduced.
[0009]
As described above, a method for displaying a result of speech recognition as a graphic user interface (GUI) with a button on a display is proposed in, for example, Japanese Patent Laid-Open No. 10-21254. In the same publication, an information search device having a voice recognition function has been proposed, and according to the detailed description of the invention, when a keyword to be searched is input by voice and a search result by the recognized keyword is displayed, A search result for the first recognition result obtained as a result of the speech recognition is displayed, and a plurality of recognition result candidates of the first and lower are displayed so as to be selectable by a selection button in the order of candidates. When the first recognition candidate is incorrect, the user can select a correct speech recognition result as a keyword for information search by performing an operation of selecting a correct recognition result button using a mouse. it can.
[0010]
[Problems to be solved by the invention]
However, in the conventional technique in which a plurality of recognition result candidates described above are displayed, when the user operates buttons and switches in the graphic user interface, many buttons are arranged on the display screen, or the buttons themselves are small. In such a case, there is a high possibility that the operation will be mistaken, and care should be taken when operating a button for selecting a speech recognition result to perform particularly important processing.
[0011]
On the other hand, for buttons and switches attached to general physical devices (such as operation panels for home appliances and control panels for production facilities), for example, the buttons for performing important processes are enlarged. In addition, by arranging the size, shape, color, etc. according to the attributes of the buttons such as the frequency of use and functions, such as placing frequently used buttons at easy-to-press positions, it improves operability while preventing incorrect operations. I am trying. Therefore, even in the case of a configuration for displaying a voice recognition result using a graphic user interface, the same method is adopted, for example, with respect to a recognition result of a specific command, a display method is changed from another recognition result, It is thought that by changing the size of the GUI to be displayed according to the score of the recognition result, it is possible to prevent erroneous operations and improve user operability, but no such proposal has been made so far. .
[0012]
The present invention has been made in view of the above-described problems, and is a speech recognition that realizes good operability by improving a display mode when presenting a speech recognition result candidate to a user using a GUI. An object of the present invention is to provide an apparatus and method, a page description language display apparatus and control method thereof, and a computer program.
[0013]
[Means for Solving the Problems]
In order to achieve the above object, a speech recognition apparatus according to the present invention is characterized by the following configuration.
[0014]
For example, a speech recognition apparatus according to one aspect of the present invention is , Recognition result candidates by recognizing input speech And recognition score of each recognition result candidate For each acquisition result candidate acquired by the acquisition means and the acquisition means, Based on the recognition score, Display means Selectable When displaying Selectable area size Determining means, and determined by the determining means In size And display control means for controlling to display the recognition result candidates.
[0024]
DETAILED DESCRIPTION OF THE INVENTION
DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments of a speech recognition apparatus according to the present invention will be described in detail with reference to the drawings.
[0025]
[First Embodiment]
FIG. 1 is a block diagram illustrating a schematic configuration of the speech recognition apparatus according to the first embodiment.
[0026]
In the voice recognition apparatus shown in FIG. 1, reference numeral 100 denotes a voice input apparatus that inputs voice such as a microphone. Reference numeral 200 denotes a storage device such as a ROM, a RAM, and a hard disk that temporarily stores a program for operating the apparatus, data necessary for the operation of the apparatus, and data generated in the course of the operation.
[0027]
Reference numeral 300 denotes a display device such as a display used mainly for displaying recognition result candidates and the like. Reference numeral 400 denotes an operation input device such as a mouse or a keyboard used when a user inputs an operation.
[0028]
Reference numeral 501 denotes a voice recognition unit that recognizes input voice. A display control unit 502 controls the determination and display of the display mode (display format) of the recognition result. A recognition result selection unit 503 selects a recognition result in accordance with a user operation.
[0029]
Reference numeral 504 denotes a process control unit that executes a process corresponding to the recognition result based on the recognition result selected by the recognition result selection unit 503 or controls another program so as to execute the process. Reference numeral 505 denotes a display mode setting unit that sets a display mode.
[0030]
The storage device 200 describes an acoustic model 201 such as an HMM to be referred to when performing speech recognition, a recognition dictionary 202 describing pronunciation information of words to be recognized, and a method for determining a display mode by the display control unit 502. The display rule 203 and the processing rule 204 describing the processing method corresponding to the recognition result are stored.
[0031]
Here, an information processing apparatus such as a personal computer capable of inputting voice or a personal digital assistant (PDA) can be employed as the hardware of the voice recognition apparatus according to the present embodiment.
[0032]
Next, the operation of the speech recognition apparatus having the above-described configuration will be described with reference to FIG.
[0033]
FIG. 2 is a flowchart showing a control process of the speech recognition apparatus in the first embodiment, and a software program corresponding to each processing unit shown in FIG. 1 performed by a CPU (not shown) of the speech recognition apparatus. The processing procedure described in is shown below.
[0034]
In the figure, the voice input by the user using the voice input device 100 such as a microphone in step S101 is obtained by the function of the voice recognition unit 501 using the acoustic model 201 and the recognition dictionary 202 stored in the storage device 200. By being recognized, one or a plurality of recognition results are obtained (step S102).
[0035]
In step S103, the recognition result acquired in step S102 is determined as a recognition result selection candidate (recognition result candidate) according to the display format by determining the display mode on the display device 300 by the function of the display control unit 502. Display on the device 300. At this time, the method of determining the display mode may be described in a program in which the function of the display control unit 502 is described. For example, a display rule 203 as shown in FIG. You can leave it.
[0036]
FIG. 5 is a diagram for explaining an example of the display rule 203. In this rule, the type of GUI to be displayed and the display size are defined as rule 1 on the basis of the voice recognition score, and the arrangement at the time of display is a rule. 2 is specified. Such setting of the display rule 203 can be set by the user by the function of the display setting unit 505.
[0037]
For example, in step S102, the speech recognition process is performed using the recognition dictionary 202 in which the recognition vocabulary and pronunciation as illustrated in FIG. 3 are described, and the result of obtaining the top four candidates from the higher score is shown in FIG. It is assumed that
[0038]
In the above case, in step S103, the display rule 203 illustrated in FIG. 5 is referred to, and a diagram illustrating a plurality of recognition result selection candidates such as the GUI illustrated in FIG. 6 is displayed. That is, in the example of FIG. 6, according to the rules 1 and 2, “print” having the largest score is displayed as a button having a large size, and the following three candidates (recognition result candidates) are sequentially displayed according to the score values. It is displayed as a button.
[0039]
In step S104, a desired button is selected by the user from the plurality of buttons displayed as illustrated in FIG. 6 using the operation input device 400 such as a mouse, and the vocabulary corresponding to the selected button. (Recognition result candidate) is set as a correct recognition result.
[0040]
In step S105, the corresponding process is executed by referring to the processing rule 204 stored in the storage device 200 according to the set vocabulary as a correct recognition result according to the selection operation. For example, as illustrated in FIG. 7, the processing rule 204 is defined for each word described in the recognition dictionary 202, for example, printing processing is performed if the set recognition result is “print”.
[0041]
According to the present embodiment as described above, since the display mode when the result of speech recognition is presented to the user using the GUI is improved, good operability can be realized.
[0042]
In the present embodiment described above, the recognition result score of each vocabulary is used as the recognition result attribute shown in FIG. 4. However, the present invention is not limited to this, but the importance of each word and the type of part of speech. , Adopts various attributes such as whether it is a command for instructing a predetermined process, whether it is a keyword for information retrieval, frequency used in input voice, language type such as English / Japanese, etc. It is only necessary that at least one of these attributes is commonly used for each recognition result (the same applies to the following embodiments).
[0043]
[Second Embodiment]
Next, a second embodiment based on the speech recognition apparatus according to the first embodiment described above will be described. In the following description, the description similar to that of the first embodiment will be omitted, and the description will focus on the characteristic part of the present embodiment.
[0044]
In the first embodiment, the case where the display mode is mainly changed according to the recognition score has been described. However, in this embodiment, for example, as shown in FIG. 8, place names such as “Tokyo” and “Osaka” have a small size. Words that are displayed as text using normal fonts and that correspond to commands for selecting processing such as "Finish" and "Print" are displayed as buttons, and the font and button size at the time of display are The display mode is controlled using a display rule in which a display mode is set for each word, such as appropriately set according to the degree of importance.
[0045]
Also according to this embodiment, since the display mode when presenting the speech recognition result candidates to the user using the GUI is improved, good operability can be realized.
[0046]
[Third Embodiment]
Next, a third embodiment based on the speech recognition apparatus according to the first embodiment described above will be described. In the following description, the description similar to that of the first embodiment will be omitted, and the description will focus on the characteristic part of the present embodiment.
[0047]
In general, for example, in an application that edits text, voice recognition may be used for different purposes, such as inputting text or inputting a command for operating the application. In such a case, as a method for displaying recognition result candidates, it is easier for the user to display commands as buttons and text as text.
[0048]
Also, there may be no problem in operation even if a certain command selected by the user is mistakenly executed, such as displaying a menu. On the other hand, what is the user's original hope, such as closing an application? Some commands are difficult to return when different commands are executed.
[0049]
Therefore, in this embodiment, operability is improved by changing the display mode according to the executed command. For example, a command for ending an application can be prevented from being erroneously operated by improving the visibility by, for example, making the size larger than other recognition result candidates when displayed.
[0050]
In order to realize such a display mode, in the present embodiment, control control based on a display rule as illustrated in FIG. 10 is performed.
[0051]
In the display rule illustrated in FIG. 10, rule 1 describes that when there is a process corresponding to a recognition result candidate, it is displayed with a button, and in other cases, it is displayed as text. In rule 2, the size and font for display are defined using the recognition result candidate attribute of importance. And in rule 3, arrangement | positioning at the time of displaying a some recognition result candidate is prescribed | regulated.
[0052]
Here, the importance may be given in advance for each word as shown in FIG. 11 and may be included in the display rule, or the importance is described in the recognition dictionary itself and stored in the storage device 200 in advance. You can keep it.
[0053]
That is, in the example of FIG. 11, importance is set to a word corresponding to a command to avoid an erroneous operation such as “end”, “print”, and “delete”, and the importance is high in the display rule illustrated in FIG. 10. The word is described to be large in size and thickened to improve visibility.
[0054]
12 is a diagram illustrating a recognition result selection GUI displayed when a recognition result candidate based on the score shown in FIG. 9 is obtained based on the display rule shown in FIG. 10 in the third embodiment. A button is displayed for recognition result candidates with corresponding processing, other recognition result candidates are displayed as text, and the display size of words with high importance such as “End” and “Delete” is large. Has been.
[0055]
Also according to this embodiment, since the display mode when presenting the speech recognition result candidates to the user using the GUI is improved, good operability can be realized.
[0056]
[Fourth Embodiment]
Next, a fourth embodiment based on the speech recognition apparatus according to the first to third embodiments described above will be described. In the following description, the description similar to that of each of the above-described embodiments will be omitted, and description will be made focusing on characteristic portions in the present embodiment.
[0057]
In this embodiment, an example in which the speech recognition apparatus described in the first to third embodiments described above is realized in a page description language display apparatus that displays a page description language such as a WWW (World Wide Web) browser will be described. This will be described with reference to FIG. 13 and FIG.
[0058]
FIG. 13 is a block diagram showing a schematic configuration of a page description language display device according to the fourth embodiment, and the basic device configuration is the same as that of the speech recognition apparatus shown in FIG. Also in the present embodiment, an information processing apparatus such as a personal computer and a personal digital assistant (PDA) capable of inputting voice can be employed as hardware.
[0059]
That is, in FIG. 13, reference numeral 100 denotes a voice input device for inputting voice such as a microphone. Reference numeral 200 denotes a storage device such as a ROM, a RAM, and a hard disk that temporarily stores a program for operating the apparatus, data necessary for the operation of the apparatus, and data generated in the course of the operation.
[0060]
Reference numeral 300 denotes a display device such as a display mainly used for displaying recognition result candidates and the like. Reference numeral 400 denotes an operation input device such as a mouse or a keyboard used when a user inputs an operation.
[0061]
A page description control unit 507 controls data input in a predetermined page description language format, analyzes the input page description language, and displays a page on the display device 300 based on the analysis result. Reference numeral 501 denotes a voice recognition unit that recognizes input voice. Reference numeral 506 denotes a page description data generation unit that generates page description language data for displaying a voice recognition result by the voice recognition unit 501.
[0062]
Reference numeral 503 denotes a recognition result selection unit that selects any desired one of the recognition result candidates displayed on the display device 300 in accordance with a user operation. Based on the recognition result candidate selected by the recognition result selection unit 503, a processing control unit 504 executes a process corresponding to the recognition result candidate or controls another program so as to execute the process.
[0063]
The storage device 200 includes an acoustic model 201 such as an HMM to be referred to when performing speech recognition, a recognition dictionary 202 describing pronunciation information of words to be recognized, Page description data generation unit 506 Determine the display mode with for A display rule 203 describing the method and a processing rule 204 describing the processing method corresponding to the recognition result are stored. Here, various data including the acoustic models 201 to 204 stored in the storage device 200 are transmitted from a storage device such as a WWW server, which has a different configuration from the present device, via a communication network such as the Internet. It is good also as a structure which can be read.
[0064]
Next, the operation of the page description language display device having the above-described configuration will be described with reference to FIG. 14 using the same display rules, processing rules, and recognition results as those in the first embodiment.
[0065]
FIG. 14 is a flowchart showing a control process of the page description language display device in the fourth embodiment, and corresponds to each processing unit shown in FIG. 13 performed by a CPU (not shown) of the page description language display device. The processing procedure described in the software program is shown below.
[0066]
In the figure, in step S201, page description language data stored on the Internet server or the storage device 200 in the apparatus is input to the apparatus. The input page description language data is analyzed by the function of the page description language control unit 506 (step S202). In step S203, a page corresponding to the description content of the page description language data is based on the analysis result. Is displayed on the display device 300.
[0067]
Next, in the process of analyzing the page description language in step S202, it is determined whether a tag for performing speech recognition is described in the data of the page description language (step S204). Ends the process, and if it is described, in step S205, the voice input by the user using the voice input device 100 such as a microphone is accepted.
[0068]
In the present embodiment, tags other than the tag for performing speech recognition included in the page description language data have the same functions as those of a general page description language display device such as a WWW browser.
[0069]
In step S206, the voice input in step S205 is recognized by the function of the voice recognition unit 501 using the acoustic model 201 and the recognition dictionary 202 stored in the storage device 200, so that one or a plurality of voices are recognized. A recognition result candidate is obtained.
[0070]
In step S207, the function of the page description data generation unit 506 determines the display mode on the display device 300 as in the first embodiment, based on the recognition result candidate (FIG. 4) acquired by the voice recognition unit 501. In addition, page description language data corresponding to the contents of the determined display mode is further generated. Data generated by the function of the page description data generation unit 506 is set in the page description control unit 507.
[0071]
In step S208, the page description language data generated in step S207 is analyzed by the function of the page description control unit 507. In step S209, a page is displayed on the display device 300 based on the analysis result. . At this time, the method for determining the display mode of the page to be displayed on the display device 300 may be described in the program as in the first embodiment, or the display rule 203 as shown in FIG. 5 is stored. You may memorize | store in the apparatus 200 and refer to the display rule. The display rule 203 may be set by the user using the function of the display setting unit 505.
[0072]
FIG. 15 is a diagram illustrating a tag for executing speech recognition in the fourth embodiment.
[0073]
In the figure, the part shown in italics is an example of a speech recognition tag according to the present embodiment, and “<SpeechRcog .....>” is a description for executing input by speech recognition. In the above, “<SpeechRcog .....>” is interpreted as “recognize speech and display the recognized result”.
[0074]
In the page description language display device according to the present embodiment, the recognition dictionary 201 and the acoustic model 202 used for speech recognition can be specified by descriptions “grammar” and “acousticmodel”. Furthermore, it is assumed that the display rule 203 and the processing rule 204 described in the first embodiment can be specified by descriptions “resulttemplate” and “actiontable”.
[0075]
That is, in the example shown in FIG. 15, the function of the voice recognition unit 501 uses the recognition dictionary “command.gra” and the acoustic model “phone.mdl” according to the tag “<SpeechRcog .....>”. In addition to performing recognition, the function of the page description data generation unit 506 generates page description language data for displaying recognition result candidates with reference to the display rule “type1.dat” and the processing rule “command.tbl”. It is shown to do.
[0076]
FIG. 16 is a diagram illustrating page description language data generated by the function of the page description data generation unit 506 based on the display rule 203 shown in FIG. 5 in the fourth embodiment.
[0077]
In the figure, the italicized part is an extended part of the specification of a general page description language. In the page description language display device according to this embodiment, the description “input type = mybotton” and “size” Depending on the description, it is possible to specify the display of the button and the size of the button at the time of display. In addition, the range surrounded by “<p>... </ P>” is interpreted as being displayed as a single line button. In this embodiment, when the recognition result candidate is displayed on the display device 300 according to FIG. 16 by interpreting the extended specification, the display example shown in FIG. 6 is displayed as in the case of the first embodiment. Is done.
[0078]
Here, the description returns to the flowchart of FIG. In step S210, for the recognition result candidate displayed on the display device 300, the user selects a desired recognition result candidate using the operation input device 400 such as a mouse by the function of the recognition result selection unit 503.
[0079]
In step S211, processing corresponding to the selected recognition result candidate is executed by the function of the processing control unit 504. Here, the correspondence relationship between the recognition result candidate and the process may be described in the program, or the processing rule 204 in which the correspondence relationship illustrated in FIG. 7 is described is stored in the storage device 200 in advance. It may be referred to when the process is executed by the process control unit 504.
[0080]
In the example of the page description language data shown in FIG. 16, when the user presses the button displayed on the display device 300, the environment variable “com” specified by “name” is specified by “value”. The entered value (corresponding to search, print setting, and editing in the figure) is substituted, and execution of the program “ExecuteCommand” is started. The description “ExecuteCommand” corresponds to the processing control unit 504, and the “ExecuteCommand” and the program extract the value assigned to “com” from the environment variable specified by “name” and correspond to the extracted value. The process to be executed is specified by referring to the process rule 204 and executed.
[0081]
Also according to this embodiment, since the display mode when presenting the speech recognition result candidates to the user using the GUI is improved, good operability can be realized.
[0082]
As described above, according to each of the above-described embodiments, when one or a plurality of speech recognition result candidates are displayed on the display device 300 using a GUI including software buttons and text, individual speech recognition is performed. Since the format to be displayed for each recognition result candidate is determined based on attributes such as the recognition score and importance of the result candidate, the user interface using voice recognition and the operability of command input can be improved.
[0083]
In each of the above-described embodiments, the GUI illustrated in FIGS. 6 and 12 selects a user's desired one from a plurality of candidates displayed on the GUI using software buttons having different sizes and fonts. However, the present invention is not limited to this device configuration. For example, it may be configured to be selectable by a so-called radio button or check box provided in or near the display area of each selection candidate having a different size or font. good.
[0084]
Further, in each of the above-described embodiments, the GUI illustrated in FIG. 6 and FIG. 12 includes software buttons having different sizes and fonts when selecting a user-desired one from a plurality of candidates displayed on the GUI. The operability for the user has been improved by using the configuration to be selectable, but is not limited to this device configuration. For example, the operation can be performed by appropriately changing the display color of the button, the shape of the displayed button, Can be improved.
[0085]
[Other Embodiments]
An object of the present invention is to use a storage medium (or recording medium) that records a program code of software that realizes the functions of the above-described embodiments as a personal computer or a portable information terminal that operates as the above-described voice recognition device. This is also achieved by supplying the information processing apparatus and reading and executing the program code stored in the storage medium by the computer (or CPU or MPU) of the system or apparatus. In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiment, and is acquired as a computer program product via the storage medium storing the program code and the telecommunication line. The program code thus configured constitutes the present invention.
[0086]
Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an operating system (OS) or the like running on the computer based on the instruction of the program code. A case where part or all of the actual processing is performed and the functions of the above-described embodiments are realized by the processing is also included.
[0087]
Further, after the program code read from the storage medium is written in a memory provided in a function expansion card inserted into the computer or a function expansion unit connected to the computer, the function is determined based on the instruction of the program code. This includes a case where the CPU or the like provided in the expansion card or the function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.
[0088]
【The invention's effect】
According to the present invention described above, a speech recognition apparatus and method for realizing good operability by improving a display mode when a speech recognition result candidate is presented to a user using a GUI, and a page description language display Provision of an apparatus, a control method thereof, and a computer program is realized.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a speech recognition apparatus according to a first embodiment.
FIG. 2 is a flowchart showing a control process of the speech recognition apparatus in the first embodiment.
FIG. 3 is a diagram illustrating a configuration example of a recognition dictionary 202 in the first embodiment.
FIG. 4 is a diagram illustrating an example of a recognition result candidate based on a score according to the first embodiment.
FIG. 5 is a diagram for explaining an example of a display rule 203 in the first embodiment.
6 is a diagram exemplifying a display mode of a recognition result selection GUI displayed on the display device 300 in the first embodiment. FIG.
7 is a diagram illustrating a configuration example of a processing rule 204 in the first embodiment. FIG.
FIG. 8 is a diagram for explaining an example of a display rule 203 in the second embodiment.
FIG. 9 is a diagram illustrating an example of a recognition result candidate based on a score according to the second embodiment.
FIG. 10 is a diagram for explaining an example of a display rule 203 in the third embodiment.
FIG. 11 is a diagram illustrating an example of setting importance for each word in the third embodiment.
FIG. 12 is a diagram illustrating a display mode of a recognition result selection GUI displayed on the display device in the third embodiment.
FIG. 13 is a block diagram showing a schematic configuration of a page description language display device according to a fourth embodiment.
FIG. 14 is a flowchart illustrating a control process of the page description language display device according to the fourth embodiment.
FIG. 15 is a diagram illustrating a tag for executing speech recognition in the fourth embodiment.
FIG. 16 is a diagram illustrating page description language data generated by the function of the page description data generation unit 506 based on the display rule 203 shown in FIG. 5 in the fourth embodiment.

Claims

An acquisition means for acquiring a recognition result candidate and a recognition score of each recognition result candidate by recognizing the input voice;
For each recognition result candidate acquired by the acquisition unit, a determination unit that determines the size of a selectable region when displaying selectably on a display unit based on the recognition score;
And a display control unit configured to control the display of the recognition result candidate with a size determined by the determination unit.

An acquisition means for acquiring a recognition result candidate by recognizing the input voice;
For each recognition result candidate acquired by the acquisition means, a determination means for determining the size of a selectable area when displaying selectably on the display means based on the importance of the recognition result candidate;
And a display control unit configured to control the display of the recognition result candidate with a size determined by the determination unit.

The apparatus further comprises control means for controlling the process corresponding to the recognition result candidate to be executed in the own apparatus or an external apparatus according to the recognition result candidate selected from the recognition result candidates displayed by the display control means. The speech recognition apparatus according to claim 1 or 2 .

In accordance with a recognition result candidate selected from the recognition result candidates displayed by the display control means, the apparatus further comprises control means for controlling the processing corresponding to the recognition result candidate to be executed in the own device or an external device,
The speech recognition apparatus according to claim 2 , wherein the importance is set to be large for a recognition result candidate corresponding to a process for which an erroneous operation is desired to be avoided.

A pointing device, speech recognition apparatus according to any one of claims 1 to 4, further comprising a selection means for selecting the display means on the displayed recognition result candidates.

An acquisition step of acquiring a recognition result candidate and a recognition score of each recognition result candidate by recognizing the input voice;
For each recognition result candidate acquired in the acquisition step, a determination step for determining the size of a selectable region when displaying selectably on a display unit based on the recognition score;
And a display control step of controlling to display the recognition result candidates with the size determined in the determination step.

An acquisition step of acquiring recognition result candidates by recognizing input speech;
For each recognition result candidate acquired in the acquisition step, a determination step for determining the size of a selectable region when displaying selectably on a display unit based on the importance of the recognition result candidate;
And a display control step of controlling to display the recognition result candidates with the size determined in the determination step.

According to the recognition result candidate selected from the recognition result candidates displayed by the display control step, the method further comprises a control step of controlling the processing corresponding to the recognition result candidate to be executed in the own device or an external device. The speech recognition method according to claim 6 or 7 .

In accordance with a recognition result candidate selected from the recognition result candidates displayed by the display control step, further comprising a control step of controlling the processing corresponding to the recognition result candidate to be executed in the own device or an external device,
The speech recognition method according to claim 7 , wherein the importance is set to be large for a recognition result candidate corresponding to a process for which an erroneous operation is desired to be avoided.

The speech recognition method according to claim 6 , further comprising a selection step of selecting a recognition result candidate displayed on the display unit by a pointing device.

A program for causing a computer to execute the speech recognition method according to any one of claims 6 to 10.

A computer-readable storage medium storing the program according to claim 11 .