JP3888584B2

JP3888584B2 - Speech recognition apparatus, speech recognition method, and speech recognition program

Info

Publication number: JP3888584B2
Application number: JP2003095772A
Authority: JP
Inventors: 伸昌荒木; 一郎森; 康弘小池; 潔山端; 亮輔磯谷; 健花沢; 誠也長田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2003-03-31
Filing date: 2003-03-31
Publication date: 2007-03-07
Anticipated expiration: 2023-03-31
Also published as: JP2004302196A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置及びそれに用いる音声認識方法並びに音声認識プログラムに関し、特に、入力される音声を認識する音声認識装置における認識対象となる音声の特定方法に関する。
【０００２】
【従来の技術】
入力された音声を音声認識する音声認識装置が知られている。このような音声認識装置では、入力音声のパワー値等を用いてユーザが発声した音声を検出し、音声認識を行うが、ユーザが実際に発声した音声以外にも、周囲で生じたノイズ等が検出され音声認識されてしまい、誤入力されてしまうことがある。そのため、音声認識装置では、認識対象となる音声を特定するために、ユーザが発声の直前にボタンを押す等、ユーザがボタン等で入力操作を行う方式がよく用いられる。その代表的な方法としては、その決定の仕方として、ＰＴＴ（Ｐｕｓｈ−Ｔｏ−Ｔａｌｋ）方式と、ＰＴＡ（Ｐｕｓｈ−Ｔｏ−Ａｃｔｉｖａｔｅ）方式とが知られている。
【０００３】
ＰＴＴ方式では、ボタンが押された時刻からボタンが放された時刻までに入力された音声が音声認識の対象になる。ＰＴＡ方式では、ボタンが押された時刻から所定の時間以上に音声が途切れるまでに入力された音声が音声認識の対象になる。しかしながら、このような音声認識では、ユーザが音声の発声に後れてボタンを押したときに、音声の先頭部分が欠落されて音声認識される。そのため、ユーザが音声の発声に後れてボタンを押したときにも、先頭部分を欠落しないで音声認識する音声認識装置が望まれている。
【０００４】
特開平２−１８４９１５号公報には、認識処理動作の開始コマンドが与えられる直前までの入力音声データを順次リングバッファに格納しておき、このリングバッファに格納された入力音声データを適宜用いて音声区間検出と、その音声データの抽出処理を行う音声認識装置が開示されている。このような音声認識装置は、非常に簡易にして効果的に先頭部分の欠落のない音声データによる音声認識処理を実行することができる。
【０００５】
特開平４−２４６９４号公報には、発声された音声を電気信号に変換するマイクロフォンと、電気信号に変換された音声信号から音声が入力されていることを検出する音声検出部と、音声信号を所定時間遅延させて出力する遅延回路とを設けるように構成され、音声検出部の検出結果を基準にして、遅延回路で遅延された音声信号の認識を開始する音声入力回路が開示されている。このような音声入力回路は、音声の先頭部分を欠落することなく、操作者が自由なタイミングで音声を発声すれば音声認識ができる。
【０００６】
特開平８−１８５１９６号公報には、音声信号が入力され、検出範囲を指定するスイッチ操作に応じて音声区間を抽出して出力する音声区間検出装置において、上記入力音声信号を一定時間記憶する記憶手段と、上記スイッチ操作で指定される検出範囲よりも広い範囲で上記記憶手段に記憶されている上記入力音声信号から一つだけ音声区間を抽出し、出力する制御手段とを有する音声区間検出装置が開示されている。
【０００７】
このような音声区間検出装置は、有声音と無声音の判別による音声区間判定を常に行いながら、その判定結果と一定時間までの入力信号を記憶しておくことで、スイッチが押されるより早いタイミングで発声が行われた場合でも、語頭を欠くことなく音声区間を検出することが可能となる。又、一つのスイッチ指定区間に対して一つの音声区間だけを検出し送信するようにしたことで、話者が発声の少し前や、少し後に関係のない言葉を発声したとしても、これを音声区間として検出することが無くなるため、例えば音声認識装置に適用した場合等に誤動作を起こし難くなる。
【０００８】
特開２００１−６７０９１号公報には、音声認識を行う発話区間の特定方法としてキーワード方式、トリガ方式、ＰＴＴ方式のうち少なくとも２方式を兼用して利便性を向上する音声認識装置が開示されている。このような音声認識装置は、音声認識動作が開始されると、キーワードが発話されたか、トークボタンを操作されたかを判定する。ここで、キーワードが発話された場合には、キーワード方式の音声認識処理に分岐し、音声区間検出部による発話区間の開始点の推定と、発話区間の終了点の推定を行う。また、トークボタンが操作された場合には、その時点で発話区間の開始点を特定する。そしてその操作が瞬間的なものか否かを判断し、瞬間的なものであれば、トリガ方式の音声認識処理に分岐し、音声検出部による発話区間の終了点の推定を行う。またトークボタンの操作が継続的なものであれば、ＰＴＴ方式の音声認識に分岐し、トークボタンの操作解除時点で発話区間の終了点を特定する。
【０００９】
【特許文献１】
特開平２−１８４９１５号公報
【特許文献２】
特開平４−２４６９４号公報
【特許文献３】
特開平８−１８５１９６号公報
【特許文献４】
特開２００１−６７０９１号公報
【００１０】
【発明が解決しようとする課題】
ところで、上記の音声認識装置では、以下のような問題点がある。
第一に、ユーザが発声した音声の開始位置や終了位置が正しく判定されない場合があるということである。その理由は、マイクボタンを押すのが発声よりも遅れた場合、リングバッファや遅延回路等を用いて発声の先頭の欠落を防止する方法では、そのリングバッファや遅延回路の処理する長さ以上に遅れた場合、先頭の欠落を完全に防止することはできないためである。
第二に、発声に間が空いたために装置により発声が途中で区切られても、認識結果を見るまでユーザにそれが伝わらないということである。その理由は、現在入力中の音声に対して音声認識処理が行われているかどうかをユーザに伝える手段がなく、ユーザは認識結果が表示されてからその文字列を見て判断することしかできないためである。
第三に、発声に間が空き音声認識が途中で途切れたときに、そのまま発声しても認識されず、再度発声のし直しが必要となることである。その理由は、装置により発声が途中で区切られても、その時点ではユーザには伝わらず、また、それを訂正する手段もないため、ユーザは認識結果が表示されてから途中で途切れたことを知り、その後で再度発声する必要があるためである。
第四に、ノイズ等余計な音声が認識されたり、音声認識が途中で途切れて認識されない部分があったりしたときに、容易に修正することができないということである。その理由は、認識結果の文字列を修正する場合、余計な音声が認識された場合は対応する文字列を選択して削除するという一連の操作を行う必要があり、また、音声認識が途中で途切れて認識されなかった部分については、再度発声して入力する必要があるためである。
本発明の目的は、認識対象の開始位置および終了位置をより適切に決定する音声認識装置を提供することである。
本発明の他の目的は、ユーザが現在発声している音声が認識されているかどうかを、ユーザに分かりやすく伝える音声認識装置を提供することである。
本発明の他の目的は、発声に間が空いたために音声認識が途中で途切れた場合でも、簡単な操作でそのまま発声を続けることができる音声認識装置を提供することである。
本発明の他の目的は、ノイズ等余計な音声が認識されたり、音声認識が途中で途切れて認識されない部分があったりしたときでも、ユーザが容易に修正することができる音声認識装置を提供することである。
本発明の他の目的は、自動翻訳装置に好適である音声認識装置を提供することである。
【００１１】
【課題を解決するための手段】
以下に、［発明の実施の形態］で使用される番号・符号を括弧付きで用いて、課題を解決するための手段を説明する。これらの番号・符号は、［特許請求の範囲］の記載と［発明の実施の形態］の記載との対応を明らかにするために付加されたものであり、［特許請求の範囲］に記載されている発明の技術的範囲の解釈に用いてはならない。
【００１２】
本発明による音声認識装置（１）は、入力された音声を認識する音声認識装置であって、入力音声から利用者が発声した音声部分を検出して、入力音声を音声部分と非音声部分に分割する音声検出部（１１）と、音声部分に対して認識処理を行う音声認識部（１５）と、入力されている音声に対して音声認識部（１５）により認識処理が行われているかどうかの状態を随時表示して利用者に伝える認識状態表示部（１４）とを備えている。
【００１３】
このとき、ユーザは、発声した音声を音声認識装置（１）が音声認識しているかどうかを確認することができる。たとえば、認識状態表示部（１４）は、オン時刻の後で音声部分が入力される時刻に、すなわち、認識処理の開始時刻（ｔ１、ｔ８、ｔ１２）にランプを点灯し、終了時刻にランプを消灯する。また、認識状態表示部（１４）は、認識処理の開始時刻（ｔ１、ｔ８、ｔ１２）から終了時刻までレベルメータ（２２）に第１表示形態で音声の音量を表示し、終了時刻より後にレベルメータ（２２）に第１表示形態と異なる第２表示形態で音量を表示する。
【００１４】
レベルメータ（２２）は、音量が音声認識部（１５）により音声認識するために必要である必要音量より大きいか小さいかを識別可能に表示することが好ましい。このとき、ユーザは、発音すべき音声の音量がよくわかり、音量が適正な音声を音声認識装置（１）に入力することができる。この結果、音声認識装置（１）は、音量が適正な音声が入力されて、より正確に音声認識することができる。
【００１５】
本発明による音声認識装置（１）は、利用者が発声を行う際に操作するマイクボタン（４）のオン操作またはオン操作とオフ操作の両方を検出するマイクボタン検出部（１３）を更に備えている。音声認識部（１２）は、マイクボタン検出部（１３）により検出されたオン操作またはオン操作とオフ操作の両方の時刻に基づき、音声検出部（１１）により検出された音声部分から認識対象とする音声部分を選択して音声認識を行い、また、音声部分の認識処理が終了し、認識状態表示部が認識処理の終了を表示した後、所定時間内に再度マイクボタン検出部（１３）によりオン操作が検出された場合に、音声認識部（１５）は、再度検出されたオン操作またはオン操作とオフ操作の両方の時刻に基づき認識対象とする音声部分を更に選択し、先のオン操作時の認識結果と後のオン操作の認識結果との両方を表示する。
【００１６】
すなわち、ユーザはマイクボタン（４）の操作を２通りの方法で行うことができる。なお、第１の操作方法と第２の操作方法を区別する手段としては、押下時間と設定時間との比較だけでなく、例えば以下のような手段でもよい。
・マイクボタンを２つ用意し、一方のマイクボタンの押下を第１の操作方法、他方のマイクボタンの押下を第２の操作方法とする。
・マイクボタンの押下が２段階の深さで行え、深く押下する操作を第１の操作方法、浅く押下する方法を第２の操作方法とする。
・マイクボタンは押し込む操作と回転させて上下に倒す操作を行うことができ、押し込む操作を第１の操作方法、回転させて上下に倒す操作を第２の操作方法とする。
・マイクボタンは、回転させて上下に倒す、もしくは上下にスライドさせる操作を行うことができ、上に倒すもしくはスライドさせる操作を第１の操作方法、下に倒すもしくはスライドさせる操作を第２の操作方法とする。
【００１７】
本発明による音声認識装置（１）は、非音声部分（５１〜５５、９１〜９４）を記録しないで音声部分（４１〜４５、８１〜８４）を記録する音声記録部（１２）を更に備えていることが好ましい。このとき、認識対象部分（６１〜６４、１０１〜１０４、１１１〜１１４）は、音声部分（４１〜４５、８１〜８４）のうちオン時刻（ｔ１、ｔ８、ｔ１２）より前に入力され音声記録部（１２）により記録される部分を含んでいる。
【００１８】
音声認識部（１５）は、認識状態表示部（１４）が認識処理が行われていることを表示している際に入力された音声部分の認識結果を第１の表示形態で表示し、また、認識状態表示部（１４）が終了を伝えた後の所定時間内に再度音声部分が検出された場合に、所定時間内に検出された音声部分に対しても音声認識を行い、認識結果を第１の表示形態とは異なる第２の表示形態で表示することが好ましい。
【００１９】
すなわち、認識対象部分（６１〜６４、１０１〜１０４、１１１〜１１４）は、終了時刻から追加可能時間だけ経過するより前にマイクボタン（４）が押されるときに、音声部分（４１〜４５、８１〜８４）のうちの最初の終了時刻までの音声部分に加えて、後にマイクボタン（４）が押されたオン時刻（ｔ１２）により認識対象として選択される音声部分を含んでいることが好ましい。このとき、ユーザは、発声している途中で音声認識装置（１）が音声認識を終了したときに、再度マイクボタン（４）を押すことにより、音声を追加して入力することができる。
【００２０】
本発明による音声認識装置（１）は、利用者の操作により指示されたときに、第１の表示形態の認識結果を第２の表示形態に変換し、また、第２の表示形態の認識結果を第１の表示形態に変換する文節選択部（１６）と、第１の表示形態にて表示されている認識結果に対して処理を行う処理部（１７）とを備えている。この処理部（１７）としては、認識結果を翻訳する翻訳部が例示される。
【００２１】
すなわち、文節選択部（１６）は、表示装置（５）に認識文字列を表示し、認識文字列のうちの操作により選択される選択文字列が表示される表示形式を変換することにより、対象文に対応する文字列を対象表示形式により表示し、対象外文に対応する文字列を対象表示形式と異なる対象外表示形式により表示する。音声認識装置（１）は、不必要である音声も音声認識したり、逆に必要な音声を認識しなかったりすることがある。このような音声認識装置（１）は、認識された文字列のうち、ユーザが必要な発声部分を選択して使用することができる。
【００２２】
マイクボタン検出部（１３）は、マイクボタン（４）のオン操作とオフ操作を検出し、さらに、これらの操作が第１の操作と第２の操作のどちらで行われたかを検出し、音声認識部（１２）は、第１の操作が行われた場合には、オフ操作のが行われた時刻を基準に決定した時刻までに検出された音声部分に対して認識処理を行い、第２の操作が行われた場合には、オン時刻より後に音声検出部（１１）により所定以上の非音声部分が検出されるまでに検出された音声部分に対して認識処理を行うことが好ましい。
【００２３】
例えば、認識対象部分（６１〜６５、７１〜７５、１０１〜１０３、１１１〜１１４）は、マイクボタン（４）を操作される状態により決定する終了時刻より前に入力された音声部分（４１〜４５、８１〜８４）である。その終了時刻は、マイクボタン（４）が押されるオン時刻（ｔ１、ｔ５、ｔ８）からマイクボタン（４）が放されるオフ時刻（ｔ２、ｔ６、ｔ９）までの押下時間（Δｔ１、Δｔ３、Δｔ５）が設定時間より長いときに、オフ時刻から予め設定された認識延長時間だけ経過した時刻（ｔ７）であり、押下時間（Δｔ１、Δｔ３、Δｔ５）が設定時間より短いときに、非音声部分（５１〜５５、９１〜９４）のうちの予め設定された空白判別時間より長い部分（５５、９３、９４）がオフ時刻より後に入力される時刻（ｔ４、ｔ１１、ｔ１５）である。
【００２４】
マイクボタン（４）が押されるオン時刻（ｔ１、ｔ８、ｔ１２）から放されるオフ時刻（ｔ２、ｔ９、ｔ１３）までの押下時間（Δｔ１、Δｔ５、Δｔ７）が設定時間より長い操作を第１の操作方法とし、押下時間（Δｔ１、Δｔ５、Δｔ７）が設定時間より短い操作を第２の操作方法とする。そして、第１の操作方法が行われたときは、マイクボタン（４）のオフ時刻から認識延長時間だけ経過した時刻、もしくは、オフ時刻より後に空白判別時間より長い非音声部分が入力される時刻を音声認識の終了時刻とし、第２の操作方法が行われたときは、マイクボタン（４）のオン時刻より後に空白判別時間より長い非音声部分が入力される時刻を音声認識の終了時刻とする。音声認識部（１５）は、終了時刻までに入力された音声に対して認識処理を行う。
【００２５】
このとき、ユーザは、マイクボタン（４）を適正に操作することにより音声認識する認識対象の終了位置をより適切に決定することができる。すなわち、ユーザは、マイクボタン（４）を押し続けることにより、押している間に発声された音声をより確実に音声認識させ、マイクボタン（４）を押し続けた後に放すことにより、認識対象の終了位置をより適切に決定させることができる。ユーザは、音声認識装置（１）の取り扱い方法をよく理解していないときに、マイクボタン（４）を瞬間的に押したり、音声の発声が終了する前にマイクボタン（４）を放したりすることが多い。音声認識装置（１）は、さらに、このような操作がされたときに、ユーザが発声した音声をより確実に音声認識することができる。
【００２６】
本発明による音声認識方法は、入力された音声を認識する音声認識方法であって、入力音声から利用者が発声した音声部分を検出して、入力音声を音声部分と非音声部分に分割するステップ（Ｓ１）と、音声部分に対して認識処理を行うステップ（Ｓ１３、１５、１６）と、入力されている音声に対して認識処理が行われているかどうかの状態を随時表示して利用者に伝えるステップ（Ｓ１１、１７）とを備えている。
【００２７】
このとき、ユーザは、発声した音声を音声認識装置（１）が音声認識しているかどうかを確認することができる。たとえば、認識状態表示部（１４）は、オン時刻の後で音声部分が入力される時刻に、すなわち、認識処理の開始時刻（ｔ１、ｔ８、ｔ１２）にランプを点灯し、終了時刻にランプを消灯する。また、認識状態表示部（１４）は、認識処理の開始時刻（ｔ１、ｔ８、ｔ１２）から終了時刻までレベルメータ（２２）に第１表示形態で音声の音量を表示し、終了時刻より後にレベルメータ（２２）に第１表示形態と異なる第２表示形態で音量を表示する。
【００２８】
レベルメータ（２２）は、音量が音声認識部（１５）により音声認識するために必要である必要音量より大きいか小さいかを識別可能に表示することが好ましい。このとき、ユーザは、発音すべき音声の音量がよくわかり、音量が適正な音声を音声認識装置（１）に入力することができる。この結果、音声認識装置（１）は、音量が適正な音声が入力されて、より正確に音声認識することができる。
【００２９】
本発明による音声認識方法は、利用者が発声を行う際に操作するマイクボタンのオン操作またはオン操作とオフ操作の両方を検出するステップ（Ｓ９、１４、１９）と、検出されたオン操作またはオン操作とオフ操作の両方の時刻に基づき、音声検出部により検出された音声部分から認識対象とする音声部分を選択して音声認識を行うステップ（Ｓ１３、１５、１６）と、また、音声部分の認識処理が終了し、認識処理の終了が表示された後、所定時間内に再度オン操作が検出された場合に、再度検出されたオン操作またはオン操作とオフ操作の両方の時刻に基づき認識対象とする音声部分を更に選択し、先のオン操作時の認識結果と後のオン操作の認識結果との両方を表示するステップとを更に備えていることが好ましい。
【００３０】
本発明による音声認識方法は、認識処理が行われていることを表示している際に入力された音声部分の認識結果を第１の表示形態で表示するステップと、また、認識状態表示部が終了を伝えた後の所定時間内に再度音声部分が検出された場合に、所定時間内に検出された音声部分に対しても音声認識を行い、認識結果を第１の表示形態とは異なる第２の表示形態で表示するステップとを更に備えている。
【００３１】
すなわち、認識対象部分（６１〜６４、１０１〜１０４、１１１〜１１４）は、終了時刻から追加可能時間だけ経過するより前にマイクボタン（４）が押されるときに、音声部分（４１〜４５、８１〜８４）のうちの最初の終了時刻までの音声部分に加えて、後にマイクボタン（４）が押されたオン時刻（ｔ１２）により認識対象として選択される音声部分を含んでいることが好ましい。このとき、ユーザは、発声している途中で音声認識装置（１）が音声認識を終了したときに、再度マイクボタン（４）を押すことにより、音声を追加して入力することができる。
【００３２】
本発明による音声認識方法は、利用者の操作により指示されたときに、第１の表示形態の認識結果を第２の表示形態に変換し、また、第２の表示形態の認識結果を第１の表示形態に変換するステップ（Ｓ３２）と、第１の表示形態にて表示されている認識結果に対して処理を行うステップ（Ｓ３４）とを備えている。この処理としては、認識結果の翻訳が例示される。
【００３３】
すなわち、文節選択部（１６）は、表示装置（５）に認識文字列を表示し、認識文字列のうちの操作により選択される選択文字列が表示される表示形式を変換することにより、対象文に対応する文字列を対象表示形式により表示し、対象外文に対応する文字列を対象表示形式と異なる対象外表示形式により表示する。音声認識装置（１）は、不必要である音声も音声認識したり、逆に必要な音声を認識しなかったりすることがある。このような音声認識装置（１）は、認識された文字列のうち、ユーザが必要な発声部分を選択して使用することができる。
【００３４】
本発明による音声認識方法は、マイクボタンのオン操作とオフ操作を検出するステップ（Ｓ９）と、さらに、これらの操作が第１の操作と第２の操作のどちらで行われたかを検出するステップ（Ｓ１４）と、第１の操作が行われた場合には、オフ操作が行われた時刻を基準に決定した時刻までに検出された音声部分に対して認識処理を行い、第２の操作が行われた場合には、オン時刻より後に音声検出部により所定以上の非音声部分が検出されるまでに検出された音声部分に対して認識処理を行うステップ（Ｓ１５、１６）とを更に備えていることが好ましい。
【００３５】
例えば、認識対象部分（６１〜６５、７１〜７５、１０１〜１０３、１１１〜１１４）は、マイクボタン（４）を操作される状態により決定する終了時刻より前に入力された音声部分（４１〜４５、８１〜８４）である。その終了時刻は、マイクボタン（４）が押されるオン時刻（ｔ１、ｔ５、ｔ８）からマイクボタン（４）が放されるオフ時刻（ｔ２、ｔ６、ｔ９）までの押下時間（Δｔ１、Δｔ３、Δｔ５）が設定時間より長いときに、オフ時刻から予め設定された認識延長時間だけ経過した時刻（ｔ７）であり、押下時間（Δｔ１、Δｔ３、Δｔ５）が設定時間より短いときに、非音声部分（５１〜５５、９１〜９４）のうちの予め設定された空白判別時間より長い部分（５５、９３、９４）がオフ時刻より後に入力される時刻（ｔ４、ｔ１１、ｔ１５）である。
【００３６】
マイクボタン（４）が押されるオン時刻（ｔ１、ｔ８、ｔ１２）から放されるオフ時刻（ｔ２、ｔ９、ｔ１３）までの押下時間（Δｔ１、Δｔ５、Δｔ７）が設定時間より長い操作を第１の操作方法とし、押下時間（Δｔ１、Δｔ５、Δｔ７）が設定時間より短い操作を第２の操作方法とする。そして、第１の操作方法が行われたときは、マイクボタン（４）のオフ時刻から認識延長時間だけ経過した時刻、もしくは、オフ時刻より後に空白判別時間より長い非音声部分が入力される時刻を音声認識の終了時刻とし、第２の操作方法が行われたときは、マイクボタン（４）のオン時刻より後に空白判別時間より長い非音声部分が入力される時刻を音声認識の終了時刻とする。音声認識部（１５）は、終了時刻までに入力された音声に対して認識処理を行う。
【００３７】
このとき、ユーザは、マイクボタン（４）を適正に操作することにより音声認識する認識対象の終了位置をより適切に決定することができる。すなわち、ユーザは、マイクボタン（４）を押し続けることにより、押している間に発声された音声をより確実に音声認識させ、マイクボタン（４）を押し続けた後に放すことにより、認識対象の終了位置をより適切に決定させることができる。ユーザは、音声認識装置（１）の取り扱い方法をよく理解していないときに、マイクボタン（４）を瞬間的に押したり、音声の発声が終了する前にマイクボタン（４）を放したりすることが多い。音声認識装置（１）は、さらに、このような操作がされたときに、ユーザが発声した音声をより確実に音声認識することができる。
【００３８】
本発明による音声認識プログラムは、入力された音声を認識する音声認識方法をコンピュータに実行させるコンピュータプログラムであって、入力音声から利用者が発声した音声部分を検出して、入力音声を音声部分と非音声部分に分割するステップ（Ｓ１）と、音声部分に対して認識処理を行うステップ（Ｓ１３、１５、１６）と、入力されている音声に対して認識処理が行われているかどうかの状態を随時表示して利用者に伝えるステップ（Ｓ１１、１７）とを備えている。
【００３９】
このとき、ユーザは、発声した音声を音声認識装置（１）が音声認識しているかどうかを確認することができる。たとえば、認識状態表示部（１４）は、オン時刻の後で音声部分が入力される時刻に、すなわち、認識処理の開始時刻（ｔ１、ｔ８、ｔ１２）にランプを点灯し、終了時刻にランプを消灯する。また、認識状態表示部（１４）は、認識処理の開始時刻（ｔ１、ｔ８、ｔ１２）から終了時刻までレベルメータ（２２）に第１表示形態で音声の音量を表示し、終了時刻より後にレベルメータ（２２）に第１表示形態と異なる第２表示形態で音量を表示する。
【００４０】
レベルメータ（２２）は、音量が音声認識部（１５）により音声認識するために必要である必要音量より大きいか小さいかを識別可能に表示することが好ましい。このとき、ユーザは、発音すべき音声の音量がよくわかり、音量が適正な音声を音声認識装置（１）に入力することができる。この結果、音声認識装置（１）は、音量が適正な音声が入力されて、より正確に音声認識することができる。
【００４１】
本発明による音声認識プログラムは、利用者が発声を行う際に操作するマイクボタンのオン操作またはオン操作とオフ操作の両方を検出するステップ（Ｓ９、１４、１９）と、検出されたオン操作またはオン操作とオフ操作の両方の時刻に基づき、音声検出部により検出された音声部分から認識対象とする音声部分を選択して音声認識を行うステップ（Ｓ１３、１５、１６）と、また、音声部分の認識処理が終了し、認識処理の終了が表示された後、所定時間内に再度オン操作が検出された場合に、再度検出されたオン操作またはオン操作とオフ操作の両方の時刻に基づき認識対象とする音声部分を更に選択し、先のオン操作時の認識結果と後のオン操作の認識結果との両方を表示するステップとを更に備えていることが好ましい。
【００４２】
本発明による音声認識プログラムは、認識処理が行われていることを表示している際に入力された音声部分の認識結果を第１の表示形態で表示するステップと、また、認識状態表示部が終了を伝えた後の所定時間内に再度音声部分が検出された場合に、所定時間内に検出された音声部分に対しても音声認識を行い、認識結果を第１の表示形態とは異なる第２の表示形態で表示するステップとを更に備えている。
【００４３】
すなわち、認識対象部分（６１〜６４、１０１〜１０４、１１１〜１１４）は、終了時刻から追加可能時間だけ経過するより前にマイクボタン（４）が押されるときに、音声部分（４１〜４５、８１〜８４）のうちの最初の終了時刻までの音声部分に加えて、後にマイクボタン（４）が押されたオン時刻（ｔ１２）により認識対象として選択される音声部分を含んでいることが好ましい。このとき、ユーザは、発声している途中で音声認識装置（１）が音声認識を終了したときに、再度マイクボタン（４）を押すことにより、音声を追加して入力することができる。
【００４４】
本発明による音声認識プログラムは、利用者の操作により指示されたときに、第１の表示形態の認識結果を第２の表示形態に変換し、また、第２の表示形態の認識結果を第１の表示形態に変換するステップ（Ｓ３２）と、第１の表示形態にて表示されている認識結果に対して処理を行うステップ（Ｓ３４）とを備えている。この処理としては、認識結果の翻訳が例示される。
【００４５】
すなわち、文節選択部（１６）は、表示装置（５）に認識文字列を表示し、認識文字列のうちの操作により選択される選択文字列が表示される表示形式を変換することにより、対象文に対応する文字列を対象表示形式により表示し、対象外文に対応する文字列を対象表示形式と異なる対象外表示形式により表示する。音声認識装置（１）は、不必要である音声も音声認識したり、逆に必要な音声を認識しなかったりすることがある。このような音声認識装置（１）は、認識された文字列のうち、ユーザが必要な発声部分を選択して使用することができる。
【００４６】
本発明による音声認識プログラムは、マイクボタンのオン操作とオフ操作を検出するステップ（Ｓ９）と、さらに、これらの操作が第１の操作と第２の操作のどちらで行われたかを検出するステップ（Ｓ１４）と、第１の操作が行われた場合には、オフ操作が行われた時刻を基準に決定した時刻までに検出された音声部分に対して認識処理を行い、第２の操作が行われた場合には、オン時刻より後に音声検出部により所定以上の非音声部分が検出されるまでに検出された音声部分に対して認識処理を行うステップ（Ｓ１５、１６）とを更に備えていることが好ましい。
【００４７】
例えば、認識対象部分（６１〜６５、７１〜７５、１０１〜１０３、１１１〜１１４）は、マイクボタン（４）を操作される状態により決定する終了時刻より前に入力された音声部分（４１〜４５、８１〜８４）である。その終了時刻は、マイクボタン（４）が押されるオン時刻（ｔ１、ｔ５、ｔ８）からマイクボタン（４）が放されるオフ時刻（ｔ２、ｔ６、ｔ９）までの押下時間（Δｔ１、Δｔ３、Δｔ５）が設定時間より長いときに、オフ時刻から予め設定された認識延長時間だけ経過した時刻（ｔ７）であり、押下時間（Δｔ１、Δｔ３、Δｔ５）が設定時間より短いときに、非音声部分（５１〜５５、９１〜９４）のうちの予め設定された空白判別時間より長い部分（５５、９３、９４）がオフ時刻より後に入力される時刻（ｔ４、ｔ１１、ｔ１５）である。
【００４８】
マイクボタン（４）が押されるオン時刻（ｔ１、ｔ８、ｔ１２）から放されるオフ時刻（ｔ２、ｔ９、ｔ１３）までの押下時間（Δｔ１、Δｔ５、Δｔ７）が設定時間より長い操作を第１の操作方法とし、押下時間（Δｔ１、Δｔ５、Δｔ７）が設定時間より短い操作を第２の操作方法とする。そして、第１の操作方法が行われたときは、マイクボタン（４）のオフ時刻から認識延長時間だけ経過した時刻、もしくは、オフ時刻より後に空白判別時間より長い非音声部分が入力される時刻を音声認識の終了時刻とし、第２の操作方法が行われたときは、マイクボタン（４）のオン時刻より後に空白判別時間より長い非音声部分が入力される時刻を音声認識の終了時刻とする。音声認識部（１５）は、終了時刻までに入力された音声に対して認識処理を行う。
【００４９】
このとき、ユーザは、マイクボタン（４）を適正に操作することにより音声認識する認識対象の終了位置をより適切に決定することができる。すなわち、ユーザは、マイクボタン（４）を押し続けることにより、押している間に発声された音声をより確実に音声認識させ、マイクボタン（４）を押し続けた後に放すことにより、認識対象の終了位置をより適切に決定させることができる。ユーザは、音声認識装置（１）の取り扱い方法をよく理解していないときに、マイクボタン（４）を瞬間的に押したり、音声の発声が終了する前にマイクボタン（４）を放したりすることが多い。音声認識装置（１）は、さらに、このような操作がされたときに、ユーザが発声した音声をより確実に音声認識することができる。
【００５０】
【発明の実施の形態】
図面を参照して、本発明による音声認識装置の実施の形態を説明する。ここでは例として、本発明による音声認識装置を、入力された音声を音声認識して別の言語に翻訳して読み上げる自動翻訳装置に適用して説明する。その音声認識装置が適用される自動翻訳装置１は、図１に示されているように、自動翻訳装置本体２とマイク３とマイクボタン４とタッチパネルディスプレイ５とＬＥＤ６とスピーカ７とを備えている。自動翻訳装置本体２は、情報処理装置（コンピュータ）であり、記録されるコンピュータプログラムを実行して、マイク３とマイクボタン４とタッチパネルディスプレイ５とＬＥＤ６とスピーカ７とを制御する。
【００５１】
マイク３は、入力される音声を電気信号に変換して自動翻訳装置本体２に入力する。マイクボタン４は、ユーザにより押され、または、放されて、その状態を示す電位信号を自動翻訳装置本体２に入力する。タッチパネルディスプレイ５は、自動翻訳装置本体２から出力される電気信号に示される画面を表示する。タッチパネルディスプレイ５は、さらに、ユーザによりペンまたは指でタップされた（触れられた）位置を示す電気信号を自動翻訳装置本体２に入力する。ＬＥＤ６は、自動翻訳装置本体２から出力される電気信号に応答して点灯し、または、消灯する。スピーカ７は、自動翻訳装置本体２から出力される電気信号に応答して音声を出力する。
【００５２】
図２は、自動翻訳装置本体２を詳細に示している。自動翻訳装置本体２は、コンピュータプログラムである音声検出部１１と音声記録部１２とマイクボタン検出部１３と認識状態表示部１４と音声認識部１５と文節選択部１６と翻訳部１７とお手本部１８とガイド表示部１９とを備えている。
【００５３】
音声検出部１１は、マイク３から入力される音声を音声部分と非音声部分とに分離する。その音声部分は、人が発声した音声を示している。その非音声部分は、音声がない状態、または、人が発声した音声以外の音声、すなわち、ノイズを示している。音声検出部１１は、さらに、特定のユーザの声紋を記録し、マイク３から入力される音声を、そのユーザが発声した音声である音声部分と、それ以外の非音声部分とに分離することもできる。
【００５４】
音声記録部１２は、図示されていないバッファに音声検出部１１により分離された音声部分を常時に記録する。音声記録部１２は、さらに、そのバッファに新規に記録する領域が不足したとき、バッファのサイズを拡張して音声を記録する。また、過去に記録された音声のうち、利用されずに残っている充分古い音声があった場合、その音声を削除してバッファの空き領域を増やす。マイクボタン検出部１３は、マイクボタン４がユーザにより押されているか放されているかを判別する。音声認識部１５は、マイクボタン４の状態に基づいてマイク３から入力され音声記録部１２に記録された音声から認識対象となる部分を判別し、その認識対象を音声認識して文字言語を生成する。
【００５５】
認識状態表示部１４は、マイクボタン４の状態に基づいてそのときに入力されている音声が音声認識する認識対象になっているかどうかをタッチパネルディスプレイ５またはＬＥＤ６を用いてユーザに通知する。すなわち、認識状態表示部１４は、初期的に、入力されている音声の音量をタッチパネルディスプレイ５に常時に表示し、ＬＥＤ６を消灯している。認識状態表示部１４は、マイクボタン４が押され、その音声に対して音声認識部１５により音声認識が行われていると、入力されている音声の音量をタッチパネルディスプレイ５に色を変えて表示し、ＬＥＤ６を点灯する。認識状態表示部１４は、音声認識部１５により入力されている音声が認識対象ではないと判別されたときに、入力されている音声の音量をタッチパネルディスプレイ５に初期の色に変えて表示し、ＬＥＤ６を消灯する。
【００５６】
文節選択部１６は、音声認識部１５により生成された文字言語をタッチパネルディスプレイ５に表示する。その文字言語は、黒字で表示される黒文字列と濃い灰色で表示される濃文字列と淡い灰色で表示される淡文字列とから形成されている。すなわち、文字列は、その音声がユーザが入力したかったものであるかどうかの推定により色分けされて表示される。黒文字列はユーザの入力したかった発声である可能性が非常に高いと思われる音声部分の文字列、濃文字列はユーザの入力したかった発声である可能性が高いと思われる音声部分の文字列、淡文字列は、音声認識処理を行ったがユーザの入力したかった発声である可能性が低いと思われる音声部分の文字列である。これらの推定は、例えば次のように行う。マイクボタン４を押している間あるいは押した直後の音声部分はユーザの入力したかった発声である可能性が非常に高いと推定して黒文字列とする。それらの黒文字列とした音声部分に短い時間隔で連続している音声部分は、ユーザが黒文字列の部分に連続して入力した発声と推定して濃文字列とする。一方、それらの黒文字列・濃文字列とした音声部分から間隔を置いて入力された音声部分は、周囲のノイズや、装置への入力を意図していないユーザの発声の可能性が高いと推定して淡文字列とする。
【００５７】
文節選択部１６は、さらに、文字列の元となる音声部分の単位で、ユーザが文字列の一部をタップすることにより、その文字列の色を切り替えて表示する。すなわち、文節選択部１６は、黒文字列がタップされると、その黒文字列を淡い灰色で表示する。文節選択部１６は、濃文字列がタップされると、その濃文字列を淡い灰色で表示する。文節選択部１６は、淡文字列がタップされると、その淡文字列を濃い灰色で表示する。
【００５８】
翻訳部１７は、タッチパネルディスプレイ５に黒字で表示され、または、濃い灰色で表示されている文字列を他の言語に翻訳して翻訳文を生成する。翻訳部１７は、さらに、その翻訳文をタッチパネルディスプレイ５に表示し、その翻訳文をスピーカ７から出力する。
【００５９】
お手本部１８は、タッチパネルディスプレイ５にお手本ボタンを表示して、そのお手本ボタンがタップされたときに、スピーカ７から所定の音量の音声を出力して、ユーザにマイク３から入力する音声の音量のお手本を示す。
【００６０】
ガイド表示部１９は、ユーザに対して発声の間を空けずに入力してもらうよう促す等の注意点を表示する。
【００６１】
図３は、タッチパネルディスプレイ５に表示される画面を示している。その画面２１は、レベルメータ２２と認識結果表示欄２３と翻訳結果表示欄２４と翻訳ボタン２５とお手本ボタン２６とを備えている。レベルメータ２２は、指示棒２７と目盛り２８とから形成されている。
【００６２】
指示棒２７は、左端が固定されて、入力される音声の音量が大きいときに、左右に伸びて表示され、入力される音声の音量が小さいときに、左右に縮んで表示される。目盛り２８は、音声認識部１５が十分に認識することができる音声の音量を示している。すなわち、マイク３に入力される音声は、指示棒２７の右端が目盛り２８より右に表示されるときに、音声認識部１５により十分に認識され、指示棒２７の右端が目盛り２８より左に表示されるときに、音声認識部１５により十分に認識されないことがある。
【００６３】
認識結果表示欄２３には、音声認識部１５により生成された認識結果が表示される。その認識結果は、文字列３１〜３４から形成されている。文字列３１〜３４は、黒字、濃い灰色または淡い灰色により表示されている。文字列３１〜３４のうちの黒字の文字列は、タップされると、淡い灰色で表示される。文字列３１〜３４のうちの濃い灰色の文字列は、タップされると、淡い灰色で表示される。文字列３１〜３４のうちの淡い灰色の文字列は、タップされると、濃い灰色で表示される。
【００６４】
翻訳ボタン２５がタップされると、認識結果表示欄２３に表示される認識結果のうちの黒字または濃い灰色で表示されている文字列が翻訳部１７により翻訳されて、その翻訳文が翻訳結果表示欄２４に表示され、その翻訳文がスピーカ７から発声される。お手本ボタン２６がタップされると、スピーカ７から所定の音量の音声を出力して、自動翻訳装置１は、ユーザにマイク３から入力する音声の音量のお手本を示す。
【００６５】
自動翻訳装置１の動作の実施の形態は、入力される音声を監視する動作と、マイクボタン４が押されたときの動作と、自動翻訳する動作と、ユーザにお手本を示す動作とを備えている。
【００６６】
図４は、入力される音声を監視する動作を示している。その入力される音声を監視する動作は、自動翻訳装置１により常時に実行され、入力される音声を監視する動作、マイクボタン４が押されたときの動作、自動翻訳する動作またはユーザにお手本を示す動作と並行して実行されることができる。自動翻訳装置１は、まず、マイク３から入力される音声が人により発声された音声かどうかを判別する（ステップＳ１）。自動翻訳装置１は、人により発声された音声が入力されたときに（ステップＳ１；ＹＥＳ）、その音声を認識対象としてバッファに記録する（ステップＳ２、Ｓ３）。このとき、自動翻訳装置１は、その音声が開始されてから中断されるまでのひとかたまりとして記録する。その際に、バッファに新規に記録する領域が不足したとき、バッファのサイズを拡張して領域を増やし、音声を記録する。また、過去にバッファに記録され、利用されずにバッファに残っている音声のうち、充分古い音声があった場合、その音声を削除してバッファの空き領域を増やす（ステップＳ４、Ｓ５）。
【００６７】
自動翻訳装置１は、人により発声された音声が入力されないときに（ステップＳ１；ＮＯ）、その音声が入力されることを待機する。自動翻訳装置１は、さらに、入力された音声の音量を灰色でレベルメータ２２に表示する。
【００６８】
図５は、マイクボタン４が押されたときの動作を示している。自動翻訳装置１は、マイクボタン４が押されたときに（ステップＳ９；ＹＥＳ）、まず、バッファに音声が記録されているかを調べ（ステップＳ１０）、記録されていなければ、その入力される音声を監視する動作により、バッファに音声が記録されるまで待機する。自動翻訳装置１は、音声が記録されている場合、ＬＥＤ６を点灯し、入力された音声の音量を黄色でレベルメータ２２に表示する（ステップＳ１１）。自動翻訳装置１は、次いで、マイクボタン４が押される前に入力されバッファに記録されている音声も含めて認識対象となる音声を選択する（ステップＳ１２）。自動翻訳装置１は、バッファより認識対象となる音声を、これから入力されてバッファに記録される音声も含めて順次読み込み、文字言語に音声認識し、その認識結果をタッチパネルディスプレイ５に表示する（ステップＳ１３）。
【００６９】
自動翻訳装置１において、マイクボタン４は、１秒以上押す第１の操作方法と、１秒以下だけ押す第２の操作方法を行うことができる。
自動翻訳装置１は、マイクボタン４が１秒以上押されたかどうかを判別する（ステップＳ１４）。自動翻訳装置１は、マイクボタン４が１秒以上たかどうかを判別する（ステップＳ１４）。自動翻訳装置１は、マイクボタン４が１秒以上押される第１の操作方法が行われたときに（ステップＳ１４；ＹＥＳ）、マイクボタン４が放された時刻を元に終了時刻を決めて音声認識を終了させる。例えば、自動翻訳装置１は、マイクボタン４が放されてから１秒後までに入力された音声までを音声認識する（ステップＳ１５）。
【００７０】
なお、ステップＳ１５では、自動翻訳装置１は、マイクボタンが放されてから２秒以上続く非音声部分より前に入力された音声までを音声認識してもよい。
【００７１】
自動翻訳装置１は、マイクボタン４が１秒以下だけ押されていたときに（ステップＳ１４；ＮＯ）、マイクボタン４が放された後に３秒以上続く非音声部分より前に入力された音声を音声認識する（ステップＳ１６）。自動翻訳装置１は、音声認識を終了すると、ＬＥＤ６を消灯し、入力された音声の音量を灰色でレベルメータ２２に表示する（ステップＳ１７）。
【００７２】
なお、第１の操作方法と第２の操作方法を区別する手段としては、マイクボタン４を押す時間による区別だけでなく、例えば以下のような手段でもよい。
・マイクボタンを２つ用意し、一方のマイクボタンの押下を第１の操作方法、他方のマイクボタンの押下を第２の操作方法とする。
・マイクボタンの押下が２段階の深さで行え、深く押下する操作を第１の操作方法、浅く押下する方法を第２の操作方法とする。
・マイクボタンは押し込む操作と回転させて上下に倒す操作を行うことができ、押し込む操作を第１の操作方法、回転させて上下に倒す操作を第２の操作方法とする。
・マイクボタンは、回転させて上下に倒す、もしくは上下にスライドさせる操作を行うことができ、上に倒すもしくはスライドさせる操作を第１の操作方法、下に倒すもしくはスライドさせる操作を第２の操作方法とする。
【００７３】
自動翻訳装置１は、次いで、ステップＳ１３、ステップＳ１５またはステップＳ１６で音声認識された認識結果をタッチパネルディスプレイ５に表示する（ステップＳ１８）。自動翻訳装置１は、音声認識が終了してＬＥＤ６が消灯してから３秒後までに、再度マイクボタン４が押されるかどうかを判別する（ステップＳ１９）。
【００７４】
自動翻訳装置１は、再度マイクボタン４が押されたときに（ステップＳ１９；ＹＥＳ）、ステップＳ１０〜ステップＳ１７を実行してさらに入力される音声を音声認識して、その認識結果を先に音声認識された認識結果とともにタッチパネルディスプレイ５に表示する（ステップＳ１８）。
【００７５】
あるいは、自動翻訳装置１は、ＬＥＤ６をステップＳ１１、Ｓ１７に示される形態で点灯させる動作に限らないで、他の形態で点灯してもよい。ＬＥＤ６は、発光する色を変えて点灯してもよく、たとえば、ステップＳ１１で緑色に点灯し、ステップＳ１７で黄色に点灯してもよい。ＬＥＤ６は、発光する明るさを変えて点灯してもよく、たとえば、ステップＳ１１で明るく点灯し、ステップＳ１７で暗く点灯してもよい。ＬＥＤ６は、ステップＳ１１で点滅し、ステップＳ１７で点灯してもよい。
【００７６】
このとき、ユーザは、マイクボタン４を適正に操作することにより音声認識する認識対象の終了位置をより適切に決定することができる。
【００７７】
図６は、自動翻訳する動作を示している。自動翻訳する動作は、タッチパネルディスプレイ５に認識結果が表示されているときに、自動翻訳装置１により実行される。自動翻訳装置１は、淡い灰色の文字列がタップされると（ステップＳ３１；ＹＥＳ）、その文字列の元となる音声部分を単位としてその淡い灰色の文字列を濃い灰色に変えて表示する（ステップＳ３２）。自動翻訳装置１は、黒色または濃い灰色の文字列がタップされると（ステップＳ３３；ＹＥＳ）、その黒色または濃い灰色の文字列を淡い灰色に変えて表示する（ステップＳ３４）。ステップＳ３１〜ステップＳ３４の動作は、翻訳ボタン２５がタップされるまで繰り返される。
【００７８】
自動翻訳装置１は、翻訳ボタン２５がタップされると（ステップＳ３５；ＹＥＳ）、まず、最初に淡い灰色であった文字列をタップして濃い灰色に変えたかどうかを調べ（ステップＳ３６）、変えていた場合は、ガイドメッセージを表示する（ステップＳ３７）。このガイドメッセージは、ユーザに対して、その後の入力で正しく入力するように注意点を示すもので、例えば、発声中に間を空けすぎないように促す内容などである。自動翻訳装置１はその後、タッチパネルディスプレイ５に黒色または濃い灰色で表示されている文字列を他の言語たとえば、英語の翻訳文に翻訳する（ステップＳ３８）。自動翻訳装置１は、その翻訳文をタッチパネルディスプレイ５に表示して、その翻訳文をスピーカ７から発声する（ステップＳ３９）。
【００７９】
ユーザにお手本を示す動作は、お手本ボタン２６がタップされたときに、実行される。すなわち、自動翻訳装置１は、お手本ボタン２６がタップされたときに、スピーカ７から所定の音量の音声を出力して、ユーザにマイク３から入力する音声の音量のお手本を示す。このような動作によれば、ユーザは、発音すべき音声の音量がよくわかり、音量が適正な音声を自動翻訳装置１に入力することができる。この結果、自動翻訳装置１は、音量が適正な音声が入力されて、より正確に音声認識することができる。
【００８０】
次に、図面を参照して、自動翻訳装置１にて実際に音声が入力されるときの動作について説明する。
図７は、ある場面でのマイクボタン４が押される状態と認識対象との関係を示している。その場面では、ユーザにより「ええと」、「いちばん近い」、「駅は」、「どこですか？」と区切りながらノイズがある環境で発音されている。さらに、自動翻訳装置１は、「ええと」を音声部分４１として検出し、「いちばん近い」を音声部分４２として検出し、「駅は」を音声部分４３として検出し、「どこですか？」を音声部分４４として検出し、ノイズを音声部分４５として検出している。さらに、自動翻訳装置１は、音声部分４１の後に非音声部分５１を検出し、音声部分４２の後に非音声部分５２を検出し、音声部分４３の後に非音声部分５３を検出し、音声部分４４の後に非音声部分５４を検出し、音声部分４５の後に非音声部分５５を検出する。ここで、非音声部分５１〜５５は、それぞれ１秒以内だけ続き、非音声部分５４、５５は、１秒以上続いているとする。
【００８１】
自動翻訳装置１は、その場面で、ユーザが「いちばん近い」と発声している途中の時刻ｔ１から時刻ｔ２までマイクボタン４を１秒以内の時間Δｔ１だけ押したときに、ＬＥＤ６を時刻ｔ１から音声部分４４が終了した時刻ｔ３から所定時間Δｔ２だけ経過した時刻ｔ４まで点灯する。このとき、自動翻訳装置１は、音声部分４１〜４４をそれぞれ認識対象６１〜６４として音声認識し、音声部分４５に対しては音声認識を行わない。また、仮に非音声部分５４が１秒以内であったため、ノイズである音声部分４５が認識対象となってしまった場合でも、翻訳を行う前に、ユーザが音声部分４５の認識結果の文字列をタップして淡文字列に変更することで、ノイズが認識された文字列部分を容易に除外することができる。
【００８２】
次に、自動翻訳装置１に音声が入力される他の場合の動作ついて説明する。
図８は、ある場面でのマイクボタン４が押される状態と認識対象との関係を示している。その場面では、ユーザにより「ええと」、「いちばん近い」、「駅は」、「どこですか？」と区切りながら発音され、さらに「駅は」と「どこですか？」との間に間をおいて発音されている。さらに、自動翻訳装置１が、「ええと」を音声部分８１として検出し、「いちばん近い」を音声部分８２として検出し、「駅は」を音声部分８３として検出し、「どこですか？」を音声部分８４として検出している。このとき、自動翻訳装置１は、さらに、音声部分８１の後に非音声部分９１を検出し、音声部分８２の後に非音声部分９２を検出し、音声部分８３の後に非音声部分９３を検出し、音声部分８４の後に非音声部分９４を検出する。非音声部分９１、９２は、それぞれ１秒以内だけ続き、非音声部分９３、９４は、１秒以上続いている。
【００８３】
自動翻訳装置１は、その場面で、ユーザが「いちばん近い」と発声している途中の時刻ｔ８から時刻ｔ９までマイクボタン４を１秒以内の時間Δｔ５だけ押したときに、ＬＥＤ６を時刻ｔ８から時刻ｔ１１まで点灯する。時刻ｔ１１は、音声部分８３が終了した時刻ｔ１０から所定時間Δｔ６（１秒）だけ経過した時刻である。このとき、自動翻訳装置１は、音声部分８１〜８３をそれぞれ認識対象１０１〜１０３として音声認識する。
【００８４】
その後、ユーザがＬＥＤ６が消灯していることで、音声認識が終了していることに気付き、「どこですか？」と発声している途中の時刻ｔ１２から時刻ｔ１３までマイクボタン４を１秒以内の時間Δｔ７だけ押したとする。このとき、ＬＥＤ６は時刻ｔ１２から時刻１５まで再度点灯する。時刻ｔ１５は、音声部分８４が終了した時刻ｔ１４から所定時間Δｔ８（１秒）だけ経過した時刻である。このとき、自動翻訳装置１は、音声部分８１〜８３に加えて音声部分８４もそれぞれ認識対象１１１〜１１４として音声認識する。その結果、音声部分８１〜８４までが音声認識される。
【００８５】
すなわち、ユーザは、発声の途中でＬＥＤ６が消灯したときに、再度マイクボタン４を押すことにより、自動翻訳装置１により１つの文として音声認識される。ユーザは、発声している途中で音声認識が終了したときに、再度マイクボタン４を押すことにより、音声を追加して入力することができ好ましい。
【００８６】
次に、本発明による自動翻訳装置の他の実施の形態について説明する。図９に、マイクボタン４が押されたときの動作を示す。マイクボタン４が押されたとき、まず、図５に示した動作に基づき、音声認識が行われ、結果が表示される（ステップＳ２１）。その後、３秒以内に音声検出により音声部分が検出された場合、その音声部分に対して認識処理を行い、結果の文字列を淡文字列として、それまでの文字列の後に追加して表示する。このとき、後から追加された文字列がノイズ等のユーザが意図しないものであれば、そのまま翻訳ボタン２５をタップすることで、後から追加された文字列は除外して翻訳を行わせることができ、また、後から追加された文字列がユーザの意図した発声であれば、その文字列をタップして濃文字列としてから翻訳させることで、後から発声した音声部分を含めて翻訳を行わせることができる。
【００８７】
図１０は、マイクボタン４が押される状態と認識対象との関係を示している。マイクボタンが押され、音声部分８１〜８３が音声認識されるのは図８の場合と同様である。「ええと」「いちばん近い」「駅は」の発声が音声認識され、表示される。このとき、認識結果として「ええと」「いちばん近い」「駅は」までが一旦表示されているが、この後、ユーザが「どこですか」と続けて発声すると、音声部分８４が入力される。このとき、自動翻訳装置１は、音声部分８４を認識対象１０４として音声認識する。音声認識された結果「どこですか」は、淡い灰色で画面に追加する。
【００８８】
このときユーザは、「どこですか」の文字列をタップして濃文字列にすることで、その部分も含めて翻訳させることができる。また、仮に、後から入力された部分が、ノイズ等のユーザが意図しない音声であっても、淡文字列の状態のままで翻訳ボタン２５をタップすれば、後から入力された部分を除外して翻訳させることができる。
【００８９】
【発明の効果】
以上説明したように、本発明においては、次のような効果を奏する。
第一の効果は、ユーザがマイクボタンを押すタイミングが適切でない場合でも、音声認識の開始位置と終了位置をより適切に決定することができることである。第二の効果は、ユーザが現在発声している音声が認識されているかどうかや、発声に間が空いたときに音声認識が終了したかどうかが、ユーザに伝わるということである。
第三の効果は、仮に発声に間が空き音声認識が終了した場合でも、ユーザがそれを知り、続けてマイクボタンを押すことで、そのまま音声認識を続けることができるということである。
第四の効果は、仮にノイズ等の余計な音声が認識されたり、前の発声から間を空けて発声したりした場合でも、認識結果の採用・不採用を切り替えて、意図して発声した部分のみを容易に選択することができるということである。
本発明の音声認識装置では、音声認識する認識対象をより適切に決定することができる。
また、現在発声中の内容が認識されているかどうかをユーザが容易に知ることができ、発声の途中で間が空き認識処理が途切れてしまった場合でも、ボタンを押すことで認識を継続させることができる。
本発明の音声認識装置は、さらに、自動翻訳装置に適用されることが好適である。
【図面の簡単な説明】
【図１】図１は、本発明による自動翻訳装置の実施の形態を示す斜視図である。
【図２】図２は、自動翻訳装置本体の実施の形態を示すブロック図である。
【図３】図３は、タッチパネルディスプレイに表示される画面の実施の形態を示す図である。
【図４】図４は、音声を監視する動作の実施の形態を示すフローチャートである。
【図５】図５は、マイクボタンが押されたときの動作の実施の形態を示すフローチャートである。
【図６】図６は、自動翻訳する動作の実施の形態を示すフローチャートである。
【図７】図７は、認識対象の位置を示すタイムチャートである。
【図８】図８は、認識対象の位置を示すタイムチャートである。
【図９】図９は、マイクボタンが押されたときの動作の実施の他の形態を示すフローチャートである。
【図１０】図１０は、認識対象の位置を示すタイムチャートである。
【符号の説明】
１：自動翻訳装置
２：自動翻訳装置本体
３：マイク
４：マイクボタン
５：タッチパネルディスプレイ
６：ＬＥＤ
７：スピーカ
１１：音声検出部
１２：音声記録部
１３：マイクボタン検出部
１４：認識状態表示部
１５：音声認識部
１６：文節選択部
１７：翻訳部
１８：お手本部
１９：ガイド表示部
２１：画面
２２：レベルメータ
２３：認識結果表示欄
２４：翻訳結果表示欄
２５：翻訳ボタン
２６：お手本ボタン
２７：指示棒
２８：目盛り
３１〜３４：文字列
４１〜４５：音声部分
５１〜５５：非音声部分
６１〜６５：認識対象
７１〜６４：認識対象
８１〜８４：音声部分
９１〜９４：非音声部分
１０１〜１０３：認識対象
１１１〜１１４：認識対象[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition device, a speech recognition method and a speech recognition program used therefor, and more particularly to a method for identifying speech to be recognized in a speech recognition device that recognizes input speech.
[0002]
[Prior art]
A voice recognition device that recognizes an input voice is known. In such a voice recognition device, the voice uttered by the user is detected using the power value of the input voice and the like, and voice recognition is performed. However, in addition to the voice actually uttered by the user, noise generated in the surroundings is detected. It may be detected and recognized, and may be input incorrectly. For this reason, in a speech recognition apparatus, a method in which a user performs an input operation with a button or the like, such as a user pressing a button immediately before speaking, is often used in order to specify a speech to be recognized. As typical methods, a PTT (Push-To-Talk) method and a PTA (Push-To-Activate) method are known as methods of determination.
[0003]
In the PTT method, voice input from the time when the button is pressed to the time when the button is released becomes the target of voice recognition. In the PTA method, a voice that is input from when the button is pressed until the voice is interrupted for a predetermined time or longer is subject to voice recognition. However, in such voice recognition, when the user presses the button after voice utterance, the head part of the voice is lost and voice recognition is performed. For this reason, there is a demand for a speech recognition device that recognizes speech without losing the head portion even when the user presses a button after the speech is uttered.
[0004]
In Japanese Patent Laid-Open No. 2-184915, input voice data until immediately before a recognition processing operation start command is given is sequentially stored in a ring buffer, and the input voice data stored in the ring buffer is used as appropriate. A voice recognition device that performs section detection and voice data extraction processing is disclosed. Such a speech recognition apparatus can execute speech recognition processing using speech data that is very simple and effective and has no missing leading portion.
[0005]
Japanese Laid-Open Patent Publication No. 4-24694 discloses a microphone that converts a voice that has been uttered into an electric signal, a voice detector that detects that voice is being input from the voice signal that has been converted into an electric signal, and a voice signal. There is disclosed a voice input circuit that is configured to include a delay circuit that outputs after a predetermined time delay and starts recognition of a voice signal delayed by the delay circuit based on a detection result of a voice detection unit. Such a voice input circuit can recognize a voice if the operator utters the voice at any timing without missing the head part of the voice.
[0006]
Japanese Patent Application Laid-Open No. 8-185196 discloses a voice section detection apparatus that receives a voice signal and extracts and outputs a voice section in response to a switch operation for designating a detection range, and stores the input voice signal for a certain period of time. And a control means for extracting and outputting only one speech section from the input speech signal stored in the storage means in a range wider than the detection range specified by the switch operation. Is disclosed.
[0007]
Such a voice section detection device always performs voice section judgment by discrimination between voiced sound and unvoiced sound, and memorizes the judgment result and an input signal up to a certain time, so that the switch is pressed at an earlier timing. Even when the utterance is made, it is possible to detect the speech section without missing the beginning of the word. In addition, by detecting and transmitting only one voice section for one switch designated section, even if the speaker utters a word that is not related to the voice a little before or after the utterance, this is voiced. Since it is not detected as a section, for example, when applied to a voice recognition device, it is difficult to cause a malfunction.
[0008]
Japanese Laid-Open Patent Publication No. 2001-67091 discloses a speech recognition device that improves convenience by combining at least two of a keyword method, a trigger method, and a PTT method as a method for specifying an utterance section for performing speech recognition. . When a voice recognition operation is started, such a voice recognition device determines whether a keyword has been uttered or a talk button has been operated. Here, when a keyword is uttered, the process branches to a keyword-based speech recognition process, and the speech segment detection unit estimates the start point of the speech segment and estimates the end point of the speech segment. In addition, when the talk button is operated, start Identify points. Then, it is determined whether or not the operation is instantaneous. If it is instantaneous, the process branches to a trigger-type speech recognition process, and the end point of the speech section is estimated by the speech detection unit. If the operation of the talk button is continuous, the process branches to PTT speech recognition, and the end point of the utterance section is specified when the operation of the talk button is released.
[0009]
[Patent Document 1]
JP-A-2-184915
[Patent Document 2]
JP-A-4-24694
[Patent Document 3]
JP-A-8-185196
[Patent Document 4]
JP 2001-67091 A
[0010]
[Problems to be solved by the invention]
By the way, the above speech recognition apparatus has the following problems.
First, the start position and end position of the voice uttered by the user may not be correctly determined. The reason for this is that when the microphone button is pressed later than the utterance, the method of preventing the beginning of the utterance using a ring buffer or delay circuit, etc., is longer than the processing length of the ring buffer or delay circuit. This is because if it is delayed, it is not possible to completely prevent the beginning from being lost.
Second, even if the utterance is divided by the device due to a gap in the utterance, it is not transmitted to the user until the recognition result is seen. The reason is that there is no means to tell the user whether or not voice recognition processing is being performed on the currently input voice, and the user can only judge by looking at the character string after the recognition result is displayed. It is.
Third, when idle voice recognition is interrupted in the middle of utterance, it is not recognized even if uttered as it is, and it is necessary to utter again. The reason is that even if the utterance is divided by the device, it is not transmitted to the user at that time, and there is no means to correct it. This is because it is necessary to know and then speak again.
Fourthly, it cannot be easily corrected when extraneous speech such as noise is recognized, or when speech recognition is interrupted and there are parts that are not recognized. The reason is that when correcting the character string of the recognition result, it is necessary to perform a series of operations such as selecting and deleting the corresponding character string when extraneous speech is recognized. This is because it is necessary to utter again and input a portion that is not recognized due to interruption.
An object of the present invention is to provide a speech recognition apparatus that more appropriately determines a start position and an end position of a recognition target.
Another object of the present invention is to provide a voice recognition device that tells a user whether or not the voice currently uttered by the user is recognized.
Another object of the present invention is to provide a speech recognition apparatus that can continue speech with a simple operation even when speech recognition is interrupted midway due to a gap in speech.
Another object of the present invention is to provide a voice recognition device that can be easily corrected by a user even when extraneous voice such as noise is recognized or there is a part where voice recognition is interrupted and is not recognized. That is.
Another object of the present invention is to provide a speech recognition apparatus suitable for an automatic translation apparatus.
[0011]
[Means for Solving the Problems]
Hereinafter, means for solving the problems will be described using the numbers and symbols used in [Embodiments of the Invention] in parentheses. These numbers and symbols are added to clarify the correspondence between the description of [Claims] and the description of [Mode for carrying out the invention], and are described in [Claims]. It should not be used to interpret the technical scope of the invention.
[0012]
A speech recognition device (1) according to the present invention is a speech recognition device that recognizes input speech, detects a speech portion uttered by a user from input speech, and converts the input speech into a speech portion and a non-speech portion. Whether the speech detection unit (11) to be divided, the speech recognition unit (15) that performs recognition processing on the speech part, and whether the speech recognition unit (15) performs recognition processing on the input speech And a recognition state display unit (14) for displaying the state at any time and transmitting it to the user.
[0013]
At this time, the user can confirm whether or not the voice recognition device (1) recognizes the voice uttered. For example, the recognition state display unit (14) lights the lamp at the time when the voice part is input after the ON time, that is, at the start time (t1, t8, t12) of the recognition process, and the lamp at the end time. Turns off. The recognition status display unit (14) displays the sound volume in the first display form on the level meter (22) from the start time (t1, t8, t12) to the end time of the recognition process, and the level after the end time. The volume is displayed on the meter (22) in a second display form different from the first display form.
[0014]
It is preferable that the level meter (22) displays in a distinguishable manner whether the volume is larger or smaller than the necessary volume necessary for voice recognition by the voice recognition unit (15). At this time, the user can understand the sound volume of the sound to be pronounced and can input the sound with the appropriate sound volume to the sound recognition device (1). As a result, the voice recognition device (1) can recognize voice more accurately when a voice having an appropriate volume is input.
[0015]
The voice recognition device (1) according to the present invention further includes a microphone button detection unit (13) for detecting an on operation of the microphone button (4) operated when the user speaks or both an on operation and an off operation. ing. The voice recognition unit (12) recognizes the recognition target from the voice part detected by the voice detection unit (11) based on the time of the ON operation or both the ON operation and the OFF operation detected by the microphone button detection unit (13). Voice recognition is performed by selecting a voice part to be performed, and after the voice part recognition process is completed and the recognition state display unit displays the completion of the recognition process, the microphone button detection unit (13) again within a predetermined time. When the on operation is detected, the voice recognition unit (15) further selects a voice part to be recognized based on the detected on operation or the time of both the on operation and the off operation, and the previous on operation. Both the recognition result at the time and the recognition result of the subsequent ON operation are displayed.
[0016]
That is, the user can operate the microphone button (4) by two methods. The means for distinguishing between the first operation method and the second operation method is not limited to the comparison between the pressing time and the set time, but may be the following means, for example.
Two microphone buttons are prepared, and pressing one microphone button is a first operation method and pressing the other microphone button is a second operation method.
The microphone button can be pressed at two levels of depth, and the first operation method is the operation of pressing the microphone button deeply, and the second operation method is the method of pressing the microphone button lightly.
The microphone button can be pushed in and rotated and tilted up and down. The pushing operation is the first operation method, and the rotation and tilting up and down is the second operation method.
-The microphone button can be rotated and tilted up or down, or can be slid up and down. The first operation method can be tilted up or down, and the second operation can be tilted down or slid. The method.
[0017]
The voice recognition device (1) according to the present invention further includes a voice recording unit (12) that records a voice part (41-45, 81-84) without recording a non-voice part (51-55, 91-94). It is preferable. At this time, the recognition target portions (61 to 64, 101 to 104, 111 to 114) are input before the on time (t1, t8, t12) of the sound portions (41 to 45, 81 to 84) and recorded as a sound. The part recorded by the part (12) is included.
[0018]
The voice recognition unit (15) displays the recognition result of the voice part input in the first display form when the recognition state display unit (14) displays that the recognition process is being performed. When the voice part is detected again within a predetermined time after the recognition state display unit (14) notifies the end, the voice recognition is also performed on the voice part detected within the predetermined time, and the recognition result is displayed. It is preferable to display in a second display form different from the first display form.
[0019]
That is, the recognition target portions (61 to 64, 101 to 104, 111 to 114) are displayed when the microphone button (4) is pressed before the additional time has elapsed from the end time. 81 to 84), in addition to the voice part up to the first end time, it is preferable to include a voice part that is selected as a recognition target by the on-time (t12) when the microphone button (4) is pressed later. . At this time, when the voice recognition device (1) finishes the voice recognition in the middle of speaking, the user can additionally input the voice by pressing the microphone button (4) again.
[0020]
The voice recognition device (1) according to the present invention converts the recognition result of the first display form into the second display form when instructed by the operation of the user, and the recognition result of the second display form. A phrase selection unit (16) for converting the information into the first display form, and a processing unit (17) for processing the recognition result displayed in the first display form. An example of the processing unit (17) is a translation unit that translates the recognition result.
[0021]
That is, the phrase selection unit (16) displays the recognized character string on the display device (5), and converts the display format in which the selected character string selected by the operation of the recognized character string is displayed. A character string corresponding to a sentence is displayed in a target display format, and a character string corresponding to a non-target sentence is displayed in a non-target display format different from the target display format. The voice recognition device (1) may recognize voices that are unnecessary, or may not recognize necessary voices. Such a speech recognition apparatus (1) can select and use the utterance part which a user needs among the recognized character strings.
[0022]
The microphone button detection unit (13) detects the on / off operation of the microphone button (4), and further detects whether these operations are performed by the first operation or the second operation, When the first operation is performed, the recognizing unit (12) performs a recognition process on the voice part detected up to the time determined based on the time when the off operation is performed, and the second operation is performed. When the above operation is performed, it is preferable to perform a recognition process on a voice part detected until a predetermined or more non-voice part is detected by the voice detection unit (11) after the on time.
[0023]
For example, the recognition target parts (61-65, 71-75, 101-103, 111-114) are voice parts (41-41) input before the end time determined by the state in which the microphone button (4) is operated. 45, 81-84). The end time is the pressing time (Δt1, Δt3, from the on time (t1, t5, t8) when the microphone button (4) is pressed to the off time (t2, t6, t9) when the microphone button (4) is released. When Δt5) is longer than the set time, it is a time (t7) when a preset recognition extension time has elapsed from the off time, and when the pressing time (Δt1, Δt3, Δt5) is shorter than the set time, the non-speech part Of (51-55, 91-94), the part (55, 93, 94) longer than the preset blank discrimination time is the time (t4, t11, t15) input after the off time.
[0024]
The first operation is that the pressing time (Δt1, Δt5, Δt7) from the on time (t1, t8, t12) when the microphone button (4) is pressed to the off time (t2, t9, t13) released is longer than the set time. An operation method in which the pressing times (Δt1, Δt5, Δt7) are shorter than the set time is defined as a second operation method. When the first operation method is performed, the time when the recognition extension time has elapsed from the off time of the microphone button (4), or the time when a non-speech part longer than the blank determination time is input after the off time Is the voice recognition end time, and when the second operation method is performed, the time when the non-voice part longer than the blank determination time is input after the on time of the microphone button (4) is set as the voice recognition end time. To do. The voice recognition unit (15) performs a recognition process on the voice input by the end time.
[0025]
At this time, the user can more appropriately determine the end position of the recognition target for voice recognition by appropriately operating the microphone button (4). That is, the user presses and holds the microphone button (4) to recognize the voice uttered while pressing the microphone button (4), and releases the microphone after pressing the microphone button (4). The position can be determined more appropriately. When the user does not understand the handling method of the voice recognition device (1) well, the user presses the microphone button (4) instantaneously or releases the microphone button (4) before the voice is finished. There are many cases. Furthermore, the voice recognition device (1) can more reliably recognize the voice uttered by the user when such an operation is performed.
[0026]
The speech recognition method according to the present invention is a speech recognition method for recognizing an input speech, the step of detecting a speech portion uttered by a user from the input speech and dividing the input speech into a speech portion and a non-speech portion. (S1), a step (S13, 15, 16) for performing a recognition process on the voice part, and a state whether or not the recognition process is being performed on the input voice is displayed at any time to the user. Step (S11, 17) of transmitting.
[0027]
At this time, the user can confirm whether or not the voice recognition device (1) recognizes the voice uttered. For example, the recognition state display unit (14) lights the lamp at the time when the voice part is input after the ON time, that is, at the start time (t1, t8, t12) of the recognition process, and the lamp at the end time. Turns off. The recognition status display unit (14) displays the sound volume in the first display form on the level meter (22) from the start time (t1, t8, t12) to the end time of the recognition process, and the level after the end time. The volume is displayed on the meter (22) in a second display form different from the first display form.
[0028]
It is preferable that the level meter (22) displays in a distinguishable manner whether the volume is larger or smaller than the necessary volume necessary for voice recognition by the voice recognition unit (15). At this time, the user can understand the sound volume of the sound to be pronounced and can input the sound with the appropriate sound volume to the sound recognition device (1). As a result, the voice recognition device (1) can recognize voice more accurately when a voice having an appropriate volume is input.
[0029]
The speech recognition method according to the present invention includes a step (S9, 14, 19) of detecting an on operation or both an on operation and an off operation of a microphone button operated when a user speaks, and the detected on operation or A step (S13, 15, 16) of performing speech recognition by selecting a speech portion to be recognized from speech portions detected by the speech detection unit based on the time of both the on operation and the off operation, and the speech portion After the recognition process is completed and the end of the recognition process is displayed, if an on-operation is detected again within a predetermined time, recognition is performed based on the time of both the on-operation or the on-operation and the off-operation detected again. It is preferable that the method further includes a step of further selecting a target voice part and displaying both a recognition result of the previous on operation and a recognition result of the subsequent on operation.
[0030]
In the speech recognition method according to the present invention, the step of displaying the recognition result of the speech portion input when displaying that the recognition process is being performed is displayed in the first display form, and the recognition status display unit When a voice part is detected again within a predetermined time after the end is notified, voice recognition is also performed for the voice part detected within the predetermined time, and the recognition result is different from the first display form. And a step of displaying in the two display forms.
[0031]
That is, the recognition target portions (61 to 64, 101 to 104, 111 to 114) are displayed when the microphone button (4) is pressed before the additional time has elapsed from the end time. 81 to 84), in addition to the voice part up to the first end time, it is preferable to include a voice part that is selected as a recognition target by the on-time (t12) when the microphone button (4) is pressed later. . At this time, when the voice recognition device (1) finishes the voice recognition in the middle of speaking, the user can additionally input the voice by pressing the microphone button (4) again.
[0032]
The voice recognition method according to the present invention converts the recognition result of the first display form into the second display form when instructed by the user's operation, and converts the recognition result of the second display form into the first display form. The step (S32) for converting to the display form and the step (S34) for processing the recognition result displayed in the first display form are provided. An example of this process is translation of recognition results.
[0033]
That is, the phrase selection unit (16) displays the recognized character string on the display device (5), and converts the display format in which the selected character string selected by the operation of the recognized character string is displayed. A character string corresponding to a sentence is displayed in a target display format, and a character string corresponding to a non-target sentence is displayed in a non-target display format different from the target display format. The voice recognition device (1) may recognize voices that are unnecessary, or may not recognize necessary voices. Such a speech recognition apparatus (1) can select and use the utterance part which a user needs among the recognized character strings.
[0034]
In the speech recognition method according to the present invention, the step of detecting the on / off operation of the microphone button (S9) and the step of detecting whether these operations are performed by the first operation or the second operation. (S14) When the first operation is performed, recognition processing is performed on the voice part detected up to the time determined based on the time when the off operation was performed, and the second operation is performed. If it is performed, the method further includes a step (S15, 16) of performing a recognition process on the detected voice part after the ON time until the non-speech part of a predetermined level or more is detected by the voice detection unit. Preferably it is.
[0035]
For example, the recognition target parts (61-65, 71-75, 101-103, 111-114) are voice parts (41-41) input before the end time determined by the state in which the microphone button (4) is operated. 45, 81-84). The end time is the pressing time (Δt1, Δt3, from the on time (t1, t5, t8) when the microphone button (4) is pressed to the off time (t2, t6, t9) when the microphone button (4) is released. When Δt5) is longer than the set time, it is a time (t7) when a preset recognition extension time has elapsed from the off time, and when the pressing time (Δt1, Δt3, Δt5) is shorter than the set time, the non-speech part Of (51-55, 91-94), the part (55, 93, 94) longer than the preset blank discrimination time is the time (t4, t11, t15) input after the off time.
[0036]
The first operation is that the pressing time (Δt1, Δt5, Δt7) from the on time (t1, t8, t12) when the microphone button (4) is pressed to the off time (t2, t9, t13) released is longer than the set time. An operation method in which the pressing times (Δt1, Δt5, Δt7) are shorter than the set time is defined as a second operation method. When the first operation method is performed, the time when the recognition extension time has elapsed from the off time of the microphone button (4), or the time when a non-speech part longer than the blank determination time is input after the off time Is the voice recognition end time, and when the second operation method is performed, the time when the non-voice part longer than the blank determination time is input after the on time of the microphone button (4) is set as the voice recognition end time. To do. The voice recognition unit (15) performs a recognition process on the voice input by the end time.
[0037]
At this time, the user can more appropriately determine the end position of the recognition target for voice recognition by appropriately operating the microphone button (4). That is, the user presses and holds the microphone button (4) to recognize the voice uttered while pressing the microphone button (4), and releases the microphone after pressing the microphone button (4). The position can be determined more appropriately. When the user does not understand the handling method of the voice recognition device (1) well, the user presses the microphone button (4) instantaneously or releases the microphone button (4) before the voice is finished. There are many cases. Furthermore, the voice recognition device (1) can more reliably recognize the voice uttered by the user when such an operation is performed.
[0038]
A speech recognition program according to the present invention is a computer program that causes a computer to execute a speech recognition method for recognizing input speech, detects a speech portion uttered by a user from input speech, and converts the input speech as a speech portion. The step (S1) of dividing into non-speech parts, the step (S13, 15, 16) of performing recognition processing on the speech parts, and the state of whether or not the recognition processing is performed on the input speech. Steps (S11, 17) that are displayed at any time and transmitted to the user.
[0039]
At this time, the user can confirm whether or not the voice recognition device (1) recognizes the voice uttered. For example, the recognition state display unit (14) lights the lamp at the time when the voice part is input after the ON time, that is, at the start time (t1, t8, t12) of the recognition process, and the lamp at the end time. Turns off. The recognition status display unit (14) displays the sound volume in the first display form on the level meter (22) from the start time (t1, t8, t12) to the end time of the recognition process, and the level after the end time. The volume is displayed on the meter (22) in a second display form different from the first display form.
[0040]
It is preferable that the level meter (22) displays in a distinguishable manner whether the volume is larger or smaller than the necessary volume necessary for voice recognition by the voice recognition unit (15). At this time, the user knows the volume of the sound to be pronounced well and can input the sound with the appropriate volume to the voice recognition device (1). As a result, the voice recognition device (1) can recognize voice more accurately when a voice with an appropriate volume is input.
[0041]
The speech recognition program according to the present invention includes a step (S9, 14, 19) of detecting an on operation or both an on operation and an off operation of a microphone button operated when a user speaks, and the detected on operation or A step (S13, 15, 16) of performing speech recognition by selecting a speech portion to be recognized from speech portions detected by the speech detection unit based on both on-operation time and off-operation time; After the recognition process is completed and the end of the recognition process is displayed, if an on operation is detected again within a predetermined time, the recognition is performed based on the time of the on operation detected again or both the on operation and the off operation. It is preferable that the method further includes a step of further selecting a target voice part and displaying both a recognition result of the previous on operation and a recognition result of the subsequent on operation.
[0042]
The speech recognition program according to the present invention includes a step of displaying the recognition result of the speech portion input when displaying that the recognition process is being performed in the first display form, and a recognition state display unit When a voice part is detected again within a predetermined time after the end is notified, voice recognition is also performed for the voice part detected within the predetermined time, and the recognition result is different from the first display form. And a step of displaying in the two display forms.
[0043]
That is, the recognition target portions (61 to 64, 101 to 104, 111 to 114) are displayed when the microphone button (4) is pressed before the additional time has elapsed from the end time. 81 to 84), in addition to the voice part up to the first end time, it is preferable to include a voice part that is selected as a recognition target by the on-time (t12) when the microphone button (4) is pressed later. . At this time, when the voice recognition device (1) finishes the voice recognition in the middle of speaking, the user can additionally input the voice by pressing the microphone button (4) again.
[0044]
The speech recognition program according to the present invention converts the recognition result of the first display form into the second display form when instructed by the user's operation, and also converts the recognition result of the second display form into the first display form. The step (S32) for converting to the display form and the step (S34) for processing the recognition result displayed in the first display form are provided. An example of this process is translation of recognition results.
[0045]
That is, the phrase selection unit (16) displays the recognized character string on the display device (5), and converts the display format in which the selected character string selected by the operation of the recognized character string is displayed. A character string corresponding to a sentence is displayed in a target display format, and a character string corresponding to a non-target sentence is displayed in a non-target display format different from the target display format. The voice recognition device (1) may recognize voices that are unnecessary, or may not recognize necessary voices. Such a speech recognition apparatus (1) can select and use the utterance part which a user needs among the recognized character strings.
[0046]
The speech recognition program according to the present invention detects a microphone button on / off operation (S9), and further detects whether these operations are performed by a first operation or a second operation. (S14) When the first operation is performed, recognition processing is performed on the voice part detected up to the time determined based on the time when the off operation was performed, and the second operation is performed. If it is performed, the method further includes a step (S15, 16) of performing a recognition process on the detected voice part after the ON time until the non-speech part of a predetermined level or more is detected by the voice detection unit. Preferably it is.
[0047]
For example, the recognition target parts (61-65, 71-75, 101-103, 111-114) are voice parts (41-41) input before the end time determined by the state in which the microphone button (4) is operated. 45, 81-84). The end time is the pressing time (Δt1, Δt3, from the on time (t1, t5, t8) when the microphone button (4) is pressed to the off time (t2, t6, t9) when the microphone button (4) is released. When Δt5) is longer than the set time, it is a time (t7) when a preset recognition extension time has elapsed from the off time, and when the pressing time (Δt1, Δt3, Δt5) is shorter than the set time, the non-speech part Of (51-55, 91-94), the part (55, 93, 94) longer than the preset blank discrimination time is the time (t4, t11, t15) input after the off time.
[0048]
The first operation is that the pressing time (Δt1, Δt5, Δt7) from the on time (t1, t8, t12) when the microphone button (4) is pressed to the off time (t2, t9, t13) released is longer than the set time. An operation method in which the pressing times (Δt1, Δt5, Δt7) are shorter than the set time is defined as a second operation method. Then, when the first operation method is performed, the time when the recognition extension time has elapsed from the off time of the microphone button (4), or the time when the non-speech part longer than the blank determination time is input after the off time Is the voice recognition end time, and when the second operation method is performed, the time when the non-voice part longer than the blank determination time is input after the on time of the microphone button (4) is set as the voice recognition end time. To do. The voice recognition unit (15) performs a recognition process on the voice input by the end time.
[0049]
At this time, the user can more appropriately determine the end position of the recognition target for voice recognition by appropriately operating the microphone button (4). That is, the user presses and holds the microphone button (4) to recognize the voice uttered while pressing the microphone button (4), and releases the microphone after pressing the microphone button (4). The position can be determined more appropriately. When the user does not understand the handling method of the voice recognition device (1) well, the user presses the microphone button (4) instantaneously or releases the microphone button (4) before the voice is finished. There are many cases. Furthermore, the voice recognition device (1) can more reliably recognize the voice uttered by the user when such an operation is performed.
[0050]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of a speech recognition apparatus according to the present invention will be described with reference to the drawings. Here, as an example, the speech recognition apparatus according to the present invention is applied to an automatic translation apparatus that recognizes input speech, translates it into another language, and reads it out. The automatic translation apparatus 1 to which the speech recognition apparatus is applied includes an automatic translation apparatus main body 2, a microphone 3, a microphone button 4, a touch panel display 5, an LED 6, and a speaker 7, as shown in FIG. . The automatic translation apparatus main body 2 is an information processing apparatus (computer), and executes a recorded computer program to control the microphone 3, the microphone button 4, the touch panel display 5, the LED 6, and the speaker 7.
[0051]
The microphone 3 converts the input voice into an electrical signal and inputs it to the automatic translation apparatus body 2. The microphone button 4 is pressed or released by the user and inputs a potential signal indicating the state to the automatic translation apparatus main body 2. The touch panel display 5 displays a screen indicated by an electric signal output from the automatic translation apparatus main body 2. The touch panel display 5 further inputs an electric signal indicating a position tapped (touched) by a user with a pen or a finger to the automatic translation apparatus main body 2. The LED 6 is turned on or off in response to an electrical signal output from the automatic translation apparatus main body 2. The speaker 7 outputs a sound in response to the electric signal output from the automatic translation apparatus main body 2.
[0052]
FIG. 2 shows the automatic translation apparatus main body 2 in detail. The automatic translation apparatus main body 2 includes a voice detection unit 11, a voice recording unit 12, a microphone button detection unit 13, a recognition state display unit 14, a voice recognition unit 15, a phrase selection unit 16, a translation unit 17, and a model unit 18 which are computer programs. And a guide display unit 19.
[0053]
The voice detection unit 11 separates the voice input from the microphone 3 into a voice part and a non-voice part. The voice portion indicates a voice uttered by a person. The non-speech portion indicates a state where there is no sound or a sound other than a sound uttered by a person, that is, noise. The voice detection unit 11 further records a voice print of a specific user, and separates the voice input from the microphone 3 into a voice part that is a voice uttered by the user and other non-speech parts. it can.
[0054]
The voice recording unit 12 always records the voice part separated by the voice detection unit 11 in a buffer (not shown). Furthermore, the sound recording unit 12 expands the size of the buffer and records sound when a new recording area is insufficient in the buffer. Also, when there is a sufficiently old voice that has not been used among voices recorded in the past, the voice is deleted to increase the buffer free space. The microphone button detection unit 13 determines whether the microphone button 4 is pressed or released by the user. Based on the state of the microphone button 4, the voice recognition unit 15 determines a part to be recognized from the voice input from the microphone 3 and recorded in the voice recording unit 12, and recognizes the recognition target by voice to generate a character language. To do.
[0055]
Based on the state of the microphone button 4, the recognition state display unit 14 notifies the user using the touch panel display 5 or the LED 6 whether or not the voice input at that time is a recognition target for voice recognition. That is, the recognition state display unit 14 initially displays the volume of the input voice on the touch panel display 5 at all times, and turns off the LED 6. When the microphone button 4 is pressed and voice recognition is performed on the voice by the voice recognition unit 15, the recognition state display unit 14 displays the volume of the input voice on the touch panel display 5 in a different color. Then, the LED 6 is turned on. When it is determined that the voice input by the voice recognition unit 15 is not a recognition target, the recognition state display unit 14 displays the volume of the input voice on the touch panel display 5 by changing the initial color, The LED 6 is turned off.
[0056]
The phrase selection unit 16 displays the character language generated by the voice recognition unit 15 on the touch panel display 5. The character language is formed of a black character string displayed in black, a dark character string displayed in dark gray, and a light character string displayed in light gray. That is, the character string is displayed in a color-coded manner by estimating whether or not the voice is desired to be input by the user. The black character string is a voice string that is very likely to be the utterance that the user wanted to input, and the dark character string is the voice part that is likely to be the utterance that the user wanted to input. The character string and the light character string are character strings of a voice part that is considered to be unlikely to be an utterance that the user wanted to input after performing the voice recognition process. These estimations are performed as follows, for example. It is assumed that the voice portion while the microphone button 4 is being pressed or immediately after being pressed is very likely to be the utterance that the user wanted to input, and is assumed to be a black character string. The speech portion that is continuous with the speech portion that is the black character string at a short time interval is assumed to be a utterance continuously input by the user to the portion of the black character string, and is made a dark character string. On the other hand, it is presumed that the voice part that is input at intervals from the voice part that is the black character string / dense character string is likely to be ambient noise or the voice of the user who does not intend to input to the device. Into a light string.
[0057]
The phrase selection unit 16 further switches and displays the color of the character string when the user taps a part of the character string in units of the voice part that is the source of the character string. That is, when the black character string is tapped, the phrase selection unit 16 displays the black character string in light gray. When the dark character string is tapped, the phrase selection unit 16 displays the dark character string in light gray. When the light character string is tapped, the phrase selection unit 16 displays the light character string in dark gray.
[0058]
The translation unit 17 translates a character string displayed in black on the touch panel display 5 or displayed in dark gray into another language to generate a translated sentence. The translation unit 17 further displays the translated sentence on the touch panel display 5 and outputs the translated sentence from the speaker 7.
[0059]
The model unit 18 displays a model button on the touch panel display 5, and when the model button is tapped, outputs a sound of a predetermined volume from the speaker 7, and the volume of the sound input to the user from the microphone 3. Demonstrate an example.
[0060]
The guide display unit 19 displays precautions such as prompting the user to input without leaving a utterance.
[0061]
FIG. 3 shows a screen displayed on the touch panel display 5. The screen 21 includes a level meter 22, a recognition result display field 23, a translation result display field 24, a translation button 25, and a model button 26. The level meter 22 is formed of an indicator rod 27 and a scale 28.
[0062]
When the left end is fixed and the volume of the input sound is high, the instruction bar 27 is displayed to extend left and right, and when the volume of the input sound is low, the instruction bar 27 is displayed to be contracted left and right. The scale 28 indicates the volume of voice that can be sufficiently recognized by the voice recognition unit 15. That is, the voice input to the microphone 3 is sufficiently recognized by the voice recognition unit 15 when the right end of the instruction bar 27 is displayed to the right of the scale 28, and the right end of the instruction bar 27 is displayed to the left of the scale 28. May not be fully recognized by the voice recognition unit 15.
[0063]
In the recognition result display field 23, the recognition result generated by the voice recognition unit 15 is displayed. The recognition result is formed from character strings 31-34. The character strings 31 to 34 are displayed in black, dark gray, or light gray. A black character string among the character strings 31 to 34 is displayed in light gray when tapped. A dark gray character string in the character strings 31 to 34 is displayed in a light gray color when tapped. A light gray character string in the character strings 31 to 34 is displayed in a dark gray color when tapped.
[0064]
When the translation button 25 is tapped, the character string displayed in black or dark gray among the recognition results displayed in the recognition result display field 23 is translated by the translation unit 17, and the translation is displayed as a translation result. Displayed in the column 24, the translated sentence is uttered from the speaker 7. When the model button 26 is tapped, a sound having a predetermined volume is output from the speaker 7, and the automatic translation apparatus 1 indicates a model of the sound volume input from the microphone 3 to the user.
[0065]
The embodiment of the operation of the automatic translation apparatus 1 includes an operation for monitoring the input voice, an operation when the microphone button 4 is pressed, an operation for automatic translation, and an operation for showing a model to the user. Yes.
[0066]
FIG. 4 shows an operation for monitoring the input voice. The operation for monitoring the input speech is always executed by the automatic translation apparatus 1, and the operation for monitoring the input speech, the operation when the microphone button 4 is pressed, the operation for automatic translation, or a model for the user. It can be performed in parallel with the operations shown. The automatic translation apparatus 1 first determines whether or not the voice input from the microphone 3 is a voice uttered by a person (step S1). When a voice uttered by a person is input (step S1; YES), the automatic translation apparatus 1 records the voice as a recognition target in a buffer (steps S2 and S3). At this time, the automatic translation apparatus 1 records as a lump from when the voice is started until it is interrupted. At that time, when a new recording area is insufficient in the buffer, the size of the buffer is increased to increase the area, and audio is recorded. In addition, when there is a sufficiently old sound that has been recorded in the buffer in the past and remains in the buffer without being used, the sound is deleted and the buffer free space is increased (steps S4 and S5).
[0067]
When the voice uttered by a person is not input (step S1; NO), the automatic translation apparatus 1 waits for the input of the voice. The automatic translation apparatus 1 further displays the volume of the input voice on the level meter 22 in gray.
[0068]
FIG. 5 shows an operation when the microphone button 4 is pressed. When the microphone button 4 is pressed (step S9; YES), the automatic translation apparatus 1 first checks whether or not voice is recorded in the buffer (step S10). By the operation of monitoring the above, it waits until sound is recorded in the buffer. When the voice is recorded, the automatic translation apparatus 1 turns on the LED 6 and displays the volume of the inputted voice on the level meter 22 in yellow (step S11). Next, the automatic translation apparatus 1 selects the speech to be recognized including the speech input before the microphone button 4 is pressed and recorded in the buffer (step S12). The automatic translation apparatus 1 sequentially reads the speech to be recognized from the buffer, including the speech that will be input and recorded in the buffer, recognizes the speech as a character language, and displays the recognition result on the touch panel display 5 (Step S1). S13).
[0069]
In the automatic translation apparatus 1, the microphone button 4 can perform a first operation method of pressing for 1 second or more and a second operation method of pressing for 1 second or less.
The automatic translation apparatus 1 determines whether or not the microphone button 4 has been pressed for 1 second or longer (step S14). The automatic translation apparatus 1 determines whether or not the microphone button 4 is 1 second or longer (step S14). When the first operation method in which the microphone button 4 is pressed for 1 second or longer is performed (step S14; YES), the automatic translation apparatus 1 determines the end time based on the time when the microphone button 4 is released, and performs speech. End recognition. For example, the automatic translation apparatus 1 recognizes up to a voice input 1 second after the microphone button 4 is released (step S15).
[0070]
Note that in step S15, the automatic translation apparatus 1 may recognize speech up to speech input before the non-speech portion that continues for 2 seconds or more after the microphone button is released.
[0071]
When the microphone button 4 is pressed for 1 second or less (step S14; NO), the automatic translation apparatus 1 receives the speech input before the non-speech part that continues for 3 seconds or more after the microphone button 4 is released. Speech recognition is performed (step S16). When the automatic translation apparatus 1 completes the speech recognition, the LED 6 is turned off, and the volume of the input speech is displayed in gray on the level meter 22 (step S17).
[0072]
The means for distinguishing the first operation method from the second operation method is not limited to the time when the microphone button 4 is pressed. For example, the following means may be used.
Two microphone buttons are prepared, and pressing one microphone button is a first operation method and pressing the other microphone button is a second operation method.
The microphone button can be pressed at two levels of depth, and the first operation method is the operation of pressing the microphone button deeply, and the second operation method is the method of pressing the microphone button lightly.
The microphone button can be pushed in and rotated and tilted up and down. The pushing operation is the first operation method, and the rotation and tilting up and down is the second operation method.
-The microphone button can be rotated and tilted up or down, or can be slid up and down. The first operation method can be tilted up or down, and the second operation can be tilted down or slid. The method.
[0073]
Next, the automatic translation apparatus 1 displays the recognition result recognized in step S13, step S15 or step S16 on the touch panel display 5 (step S18). The automatic translation apparatus 1 determines whether or not the microphone button 4 is pressed again by 3 seconds after the voice recognition is completed and the LED 6 is turned off (step S19).
[0074]
When the microphone button 4 is pressed again (step S19; YES), the automatic translation apparatus 1 executes step S10 to step S17 and recognizes the voice that is further input, and the recognition result is voiced first. It displays on the touch panel display 5 with the recognized recognition result (step S18).
[0075]
Alternatively, the automatic translation apparatus 1 is not limited to the operation of lighting the LED 6 in the form shown in steps S11 and S17, and may be lit in another form. The LED 6 may be turned on by changing the color of light emission. For example, the LED 6 may be turned on in step S11 and turned on in yellow in step S17. The LED 6 may be lit by changing the brightness of light emission. For example, the LED 6 may be lit brightly in step S11 and lit darkly in step S17. The LED 6 may blink at step S11 and turn on at step S17.
[0076]
At this time, the user can more appropriately determine the end position of the recognition target for voice recognition by appropriately operating the microphone button 4.
[0077]
FIG. 6 shows an automatic translation operation. The automatic translation operation is executed by the automatic translation apparatus 1 when the recognition result is displayed on the touch panel display 5. When the light gray character string is tapped (step S31; YES), the automatic translation apparatus 1 changes the light gray character string to dark gray and displays it in units of the original voice part of the character string (step S31; YES). Step S32). When the black or dark gray character string is tapped (step S33; YES), the automatic translation apparatus 1 changes the black or dark gray character string to light gray and displays it (step S34). The operations from step S31 to step S34 are repeated until the translation button 25 is tapped.
[0078]
When the translation button 25 is tapped (step S35; YES), the automatic translation apparatus 1 first checks whether or not the character string that was initially light gray is tapped to change to dark gray (step S36). If yes, a guide message is displayed (step S37). This guide message indicates a point of caution so that the user can input correctly in the subsequent input. For example, the guide message prompts the user not to leave too much space during utterance. The automatic translation apparatus 1 then translates the character string displayed in black or dark gray on the touch panel display 5 into another language, for example, an English translation (step S38). The automatic translation apparatus 1 displays the translated sentence on the touch panel display 5 and utters the translated sentence from the speaker 7 (step S39).
[0079]
The operation of showing the model to the user is executed when the model button 26 is tapped. That is, the automatic translation apparatus 1 outputs a sound of a predetermined volume from the speaker 7 when the model button 26 is tapped, and shows a model of the sound volume input from the microphone 3 to the user. According to such an operation, the user can understand the sound volume of the sound to be pronounced and can input the sound with the appropriate sound volume to the automatic translation apparatus 1. As a result, the automatic translation apparatus 1 can recognize the voice more accurately when the voice with the appropriate volume is input.
[0080]
Next, the operation when speech is actually input in the automatic translation apparatus 1 will be described with reference to the drawings.
FIG. 7 shows the relationship between the state where the microphone button 4 is pressed and the recognition target in a certain scene. In that scene, the sound is pronounced in a noisy environment with the user separating it as “um”, “closest”, “where is the station”, “where?”. Furthermore, the automatic translation apparatus 1 detects “um” as the voice part 41, detects “closest” as the voice part 42, detects “station is” as the voice part 43, and “where is?”. The sound part 44 is detected and noise is detected as the sound part 45. Furthermore, the automatic translation apparatus 1 detects the non-speech part 51 after the speech part 41, detects the non-speech part 52 after the speech part 42, detects the non-speech part 53 after the speech part 43, and detects the speech part 44. The non-speech part 54 is detected after, and the non-speech part 55 is detected after the speech part 45. Here, it is assumed that the non-speech parts 51 to 55 each last within one second, and the non-speech parts 54 and 55 last for one second or more.
[0081]
In the scene, the automatic translation apparatus 1 pushes the LED 6 from the time t1 when the user presses the microphone button 4 for a time Δt1 within one second from the time t1 to the time t2 while the user is speaking “closest”. Lights up until time t4 when a predetermined time Δt2 has elapsed from time t3 when the audio portion 44 ends. At this time, the automatic translation apparatus 1 recognizes the speech portions 41 to 44 as recognition targets 61 to 64, and does not perform speech recognition on the speech portion 45. In addition, since the non-speech part 54 is within one second, even if the speech part 45 that is noise becomes a recognition target, the user can input the character string of the recognition result of the speech part 45 before translation. By tapping and changing to a light character string, it is possible to easily exclude a character string portion in which noise is recognized.
[0082]
Next, the operation in other cases where speech is input to the automatic translation apparatus 1 will be described.
FIG. 8 shows the relationship between the state where the microphone button 4 is pressed and the recognition target in a certain scene. In that scene, the user pronounces the words “um”, “closest”, “where is the station”, “where is?”, And there is a space between “where is the station” and “where is?” And pronounced. Further, the automatic translation apparatus 1 detects “um” as the voice part 81, detects “closest” as the voice part 82, detects “station” as the voice part 83, and “where?” It is detected as an audio part 84. At this time, the automatic translation apparatus 1 further detects the non-speech part 91 after the speech part 81, detects the non-speech part 92 after the speech part 82, detects the non-speech part 93 after the speech part 83, A non-voice portion 94 is detected after the voice portion 84. The non-voice portions 91 and 92 each last for only one second or less, and the non-voice portions 93 and 94 last for one second or more.
[0083]
The automatic translation apparatus 1 switches the LED 6 from the time t8 when the microphone button 4 is pressed for a time Δt5 within one second from the time t8 to the time t9 while the user utters “closest” in the scene. Lights up until time t11. The time t11 is a time when a predetermined time Δt6 (1 second) has elapsed since the time t10 when the audio part 83 ended. At this time, the automatic translation apparatus 1 recognizes the speech portions 81 to 83 as speech recognition targets 101 to 103, respectively.
[0084]
Thereafter, the user notices that the voice recognition has been completed because the LED 6 is turned off, and the microphone button 4 is pressed within 1 second from time t12 to time t13 while saying "Where?" It is assumed that the button is pressed for the time Δt7. At this time, the LED 6 is turned on again from time t12 to time 15. The time t15 is a time when a predetermined time Δt8 (1 second) has elapsed from the time t14 when the audio part 84 ends. At this time, the automatic translation apparatus 1 recognizes the speech portion 84 as the recognition targets 111 to 114 in addition to the speech portions 81 to 83. As a result, the speech portions 81 to 84 are recognized as speech.
[0085]
That is, the user recognizes the speech as one sentence by the automatic translation apparatus 1 by pressing the microphone button 4 again when the LED 6 is turned off during the utterance. It is preferable that the user can additionally input a voice by pressing the microphone button 4 again when the voice recognition is finished in the middle of speaking.
[0086]
Next, another embodiment of the automatic translation apparatus according to the present invention will be described. FIG. 9 shows an operation when the microphone button 4 is pressed. When the microphone button 4 is pressed, first, voice recognition is performed based on the operation shown in FIG. 5, and the result is displayed (step S21). After that, if a voice part is detected by voice detection within 3 seconds, the voice part is recognized and the resulting character string is displayed as a light character string after the previous character string. . At this time, if the character string added later is not intended by the user, such as noise, the translation button 25 can be tapped to exclude the character string added later and perform translation. If the character string added later is the utterance intended by the user, it can be translated by tapping the character string and translating it after making it a dark character string. Can be made.
[0087]
FIG. 10 shows the relationship between the state in which the microphone button 4 is pressed and the recognition target. The microphone button is pressed and the voice portions 81 to 83 are recognized as in the case of FIG. The utterances of “um”, “closest”, and “station” are recognized and displayed. At this time, “um”, “closest”, and “station” are displayed as recognition results. After that, when the user continues to say “Where”, the voice part 84 is input. At this time, the automatic translation apparatus 1 recognizes the speech portion 84 as a recognition target 104 by speech recognition. The voice recognition result “Where” is added to the screen in a light gray color.
[0088]
At this time, the user can make a translation including that portion by tapping the character string “where is” to make it a dark character string. Also, even if the part that is input later is a voice that is not intended by the user, such as noise, if the translation button 25 is tapped in the state of a light character string, the part that is input later is excluded. Can be translated.
[0089]
【The invention's effect】
As described above, the present invention has the following effects.
The first effect is that the voice recognition start position and end position can be determined more appropriately even when the timing at which the user presses the microphone button is not appropriate. The second effect is that the user is notified whether or not the voice currently being uttered by the user is recognized, and whether or not the voice recognition is finished when there is a gap in the utterance.
The third effect is that, even if the voice recognition is completed after the speech is released, the user can know the fact and then press the microphone button to continue the voice recognition as it is.
The fourth effect is a part that is intentionally uttered by switching between adoption and non-recruitment of recognition results even if extraneous speech such as noise is recognized or uttered with a gap from the previous utterance Only that can be selected easily.
In the speech recognition apparatus of the present invention, the recognition target for speech recognition can be determined more appropriately.
In addition, the user can easily know whether or not the content currently being spoken is recognized, and the recognition can be continued by pressing the button even if the space recognition process is interrupted in the middle of the speech. Can do.
The speech recognition device of the present invention is preferably applied to an automatic translation device.
[Brief description of the drawings]
FIG. 1 is a perspective view showing an embodiment of an automatic translation apparatus according to the present invention.
FIG. 2 is a block diagram showing an embodiment of an automatic translation apparatus main body.
FIG. 3 is a diagram showing an embodiment of a screen displayed on a touch panel display.
FIG. 4 is a flowchart illustrating an embodiment of an operation for monitoring audio.
FIG. 5 is a flowchart showing an embodiment of an operation when a microphone button is pressed.
FIG. 6 is a flowchart showing an embodiment of an automatic translation operation.
FIG. 7 is a time chart showing a position of a recognition target.
FIG. 8 is a time chart showing a position of a recognition target.
FIG. 9 is a flowchart showing another embodiment of the operation when the microphone button is pressed.
FIG. 10 is a time chart showing a position of a recognition target.
[Explanation of symbols]
1: Automatic translation device
2: Automatic translation device
3: Microphone
4: Microphone button
5: Touch panel display
6: LED
7: Speaker
11: Voice detector
12: Audio recording unit
13: Microphone button detector
14: Recognition state display part
15: Voice recognition unit
16: Phrase selector
17: Translation Department
18: Model department
19: Guide display section
21: Screen
22: Level meter
23: Recognition result display field
24: Translation result display field
25: Translation button
26: Model button
27: Indicator stick
28: Scale
31-34: Character string
41-45: Audio part
51-55: Non-speech part
61-65: Recognition target
71-64: Recognition target
81-84: Audio part
91-94: Non-voice part
101-103: Recognition target
111-114: Recognition target

Claims

A speech recognition device that recognizes input speech,
A voice detection unit that detects a voice part uttered by a user from the input voice and divides the input voice into a voice part and a non-voice part;
A voice recognition unit that performs a recognition process on the voice part;
A recognition status display unit that displays the status of whether or not recognition processing is being performed by the voice recognition unit for the input voice and tells the user at any time,
A microphone button detection unit that detects an on operation or both an on operation and an off operation of a microphone button that is operated when a user speaks,
The voice recognition unit selects a voice part to be recognized from the voice parts detected by the voice detection unit based on the time of the on operation or both the on operation and the off operation detected by the microphone button detection unit. Voice recognition
In addition, when the on-operation is detected again by the microphone button detection unit within a predetermined time after the recognition process of the voice part is finished and the recognition state display unit displays the end of the recognition process, the voice recognition unit Further selects a speech part to be recognized based on the time of both the on operation or the on operation and the off operation detected again, and the recognition result of the previous on operation and the recognition result of the subsequent on operation are A voice recognition device that displays both.

The microphone button detection unit detects an on operation and an off operation of the microphone button, and further detects whether these operations are performed by a first operation or a second operation. When the first operation is performed, recognition processing is performed on a voice part detected up to the time determined based on the time when the off operation is performed, and the second operation is performed. 2. The voice recognition according to claim 1, wherein a recognition process is performed on a voice part detected until a non-voice part greater than or equal to a predetermined value is detected by the voice detection unit after the ON time. apparatus.

A speech recognition device that recognizes input speech,
A voice detection unit that detects a voice part uttered by a user from the input voice and divides the input voice into a voice part and a non-voice part;
A voice recognition unit that performs a recognition process on the voice part;
A recognition status display unit that displays the status of whether or not recognition processing is being performed by the voice recognition unit for the input voice and tells the user at any time,
The voice recognition unit displays a recognition result of the voice part input when the recognition state display unit displays that the recognition process is performed in a first display form, and When a voice part is detected again within a predetermined time after the recognition state display unit notifies the end, voice recognition is also performed for the voice part detected within the predetermined time, and the recognition result is obtained as the first result. Display in a second display form different from the display form of
A phrase for converting the recognition result of the first display form into the second display form and converting the recognition result of the second display form into the first display form when instructed by a user operation. A selection section;
A speech recognition apparatus, further comprising: a processing unit that performs processing on the recognition result displayed in the first display form.

A speech recognition method for recognizing input speech,
Detecting a voice portion uttered by a user from the input voice and dividing the input voice into a voice portion and a non-voice portion;
Performing recognition processing on the audio portion;
A step of displaying the status of whether or not the recognition process is being performed on the input voice at any time and telling the user;
Detecting an on-operation or both on-off and off-operation of a microphone button operated when a user speaks;
Selecting a speech portion to be recognized from the detected speech portions based on the detected on-operation or both on-operation and off-operation times, and performing speech recognition;
Further, after the recognition process of the voice part is completed and the completion of the recognition process is displayed, when an on operation is detected again within a predetermined time, the detected on operation or the on operation and the off operation are detected again. Further comprising: selecting a speech part to be recognized based on both times, and displaying both a recognition result of the previous ON operation and a recognition result of the subsequent ON operation. Method.

Detecting the on / off operation of the microphone button;
And detecting whether these operations are performed in the first operation or the second operation;
When the first operation is performed, recognition processing is performed on a voice part detected up to the time determined based on the time when the off operation is performed, and the second operation is performed. 5. The method of claim 4, further comprising a step of performing a recognition process on a detected voice part until a predetermined or more non-speech part is detected after the on-time. Speech recognition method.

A speech recognition method for recognizing input speech,
Detecting a voice portion uttered by a user from the input voice and dividing the input voice into a voice portion and a non-voice portion;
Performing recognition processing on the audio portion;
A step of displaying the status of whether or not the recognition process is being performed on the input voice at any time and telling the user;
Displaying the recognition result of the voice portion input when displaying that the recognition processing is being performed in a first display form;
Further, when a voice part is detected again within a predetermined time after informing the end, voice recognition is also performed on the voice part detected within the predetermined time, and the recognition result is displayed in the first display form. Displaying in a second display form different from
A step of converting the recognition result of the first display form into the second display form and converting the recognition result of the second display form into the first display form when instructed by a user operation; When,
And a step of processing the recognition result displayed in the first display form.

A computer program for causing a computer to execute a speech recognition method for recognizing input speech,
Detecting a voice portion uttered by a user from the input voice and dividing the input voice into a voice portion and a non-voice portion;
Performing recognition processing on the audio portion;
A step of displaying the status of whether or not the recognition process is being performed on the input voice at any time and telling the user;
Detecting an on-operation or both on-off and off-operation of a microphone button operated when a user speaks;
Selecting a speech portion to be recognized from the detected speech portions based on the detected on-operation or both on-operation and off-operation times, and performing speech recognition;
Further, after the recognition process of the voice part is completed and the completion of the recognition process is displayed, when an on operation is detected again within a predetermined time, the detected on operation or the on operation and the off operation are detected again. Further comprising the step of further selecting a voice part to be recognized based on both times and displaying both the recognition result of the previous ON operation and the recognition result of the subsequent ON operation. Recognition program.

Detecting the on / off operation of the microphone button;
And detecting whether these operations are performed in the first operation or the second operation;
When the first operation is performed, recognition processing is performed on a voice part detected up to the time determined based on the time when the off operation is performed, and the second operation is performed. 8. The method according to claim 7, further comprising a step of performing a recognition process on a detected voice part until a predetermined or more non-speech part is detected after the ON time. Speech recognition program.

A computer program for causing a computer to execute a speech recognition method for recognizing input speech,
Detecting a voice portion uttered by a user from the input voice and dividing the input voice into a voice portion and a non-voice portion;
Performing recognition processing on the audio portion;
A step of displaying the status of whether or not the recognition process is being performed on the input voice at any time and telling the user;
Displaying the recognition result of the voice portion input when displaying that the recognition processing is being performed in a first display form;
Further, when a voice part is detected again within a predetermined time after informing the end, voice recognition is also performed on the voice part detected within the predetermined time, and the recognition result is displayed in the first display form. Displaying in a second display form different from
A step of converting the recognition result of the first display form into the second display form and converting the recognition result of the second display form into the first display form when instructed by a user operation; When,
And a step of processing the recognition result displayed in the first display form.