JP4475380B2

JP4475380B2 - Speech recognition apparatus and speech recognition program

Info

Publication number: JP4475380B2
Application number: JP2003132640A
Authority: JP
Inventors: 聡一外山; 光弥駒村; 孝一長岐; 佳洋川添; 載小林; 育雄藤田
Original assignee: Pioneer Corp
Current assignee: Pioneer Corp
Priority date: 2002-05-15
Filing date: 2003-05-12
Publication date: 2010-06-09
Anticipated expiration: 2023-05-12
Also published as: JP2004046106A

Description

【０００１】
【発明の属する技術分野】
本発明は、マイクなどを介して入力された人間の音声を認識する音声認識手法に関する。
【０００２】
【従来の技術】
一般的に、音声認識装置は、ユーザの発話に基づいて生成された音声入力信号を音響的に分析し、予め用意された複数の候補の単語モデルと比較してそれぞれの音響的尤度（類似度）を算出し、音響的尤度の最も高い候補（「１位候補」と呼ぶ。）を認識結果として決定する。また、１位候補の認識信頼度が十分に高くない場合は、音声認識装置は、正しい認識結果が無いと判定し、「もう一度発話して下さい」などとトークバックしてユーザの再発話を促し、再度認識処理を行う。
【０００３】
従来の音声認識装置は、認識結果の信頼度が低く、ユーザに再発話を要求する場合でも、前回と同じ候補を用いて再度認識処理を行っていた。よって、ユーザが前回と同じ言い方で発話を繰り返せば、認識結果は前回と同じとなってしまうため、結局、再発話に対する認識率はあまり改善されない。
【０００４】
【発明が解決しようとする課題】
このような点を改善した音声認識手法の１つが特許第３１１２０３７号公報に記載されている。この音声認識手法は、ユーザの１回目の発話に対する認識処理で十分に信頼度の高い認識結果が得られない場合には、信頼度の高い幾つかの候補に絞り込みを行いユーザに再発話を促す。さらに、１回目の発話に対する認識処理において得られた信頼度が上位である候補について、それらの同意味語を候補に加えてユーザに再発話を促し、再度の認識処理を行う。
【０００５】
しかし、この方法では１回目の認識結果により絞り込まれた上位候補中に正解が含まれていない場合には認識を行うことができなくなってしまう。また、上位候補の同意味語を候補に加えたとしても、ユーザが再発話において前回と同じ単語を使用した場合は、同意味語を候補に加えた意味がなくなってしまう。
【０００６】
また、別の音声認識手法が特開平１１−１１９７９２号公報に記載されている。この公報に記載の方法では、音響的に相互に類似したコマンド（「類似タイプコマンド」と呼んでいる。）のセットと、それに対応する言い換えコマンドのセットを予め定義して記憶しておく。例えば類似タイプコマンドとして「窓を上げる」と「窓を下げる」がある場合、これに対する言い換えコマンドとして「窓を開ける」と「窓を閉める」を用意しておく。そして、ユーザが類似タイプコマンドを発話したときには、それに対する言い換えコマンドを使用して再度発話するように要求する。
【０００７】
しかし、この方法では、類似タイプコマンドとそれに対応する言い換えコマンドとの対応関係を予め規定し、メモリなどに記憶しておく必要がある。よって、システムで使用するコマンドが多数になると、そのために必要とされる記憶容量が増大し、コストの上昇などを招く。
【０００８】
本発明は、以上の点に鑑みてなされたものであり、ユーザに対する再発話の要求をなるべく少なくし、効率的かつ正確な認識を可能とする音声認識装置及びプログラムを提供することを課題とする。
【０００９】
【課題を解決するための手段】
本発明の１つの観点によれば、音声認識装置は、ユーザの音声入力を受け取る音声入力手段と、予め設定された待ち受け単語グループ中の各待ち受け単語とのマッチング処理により、前記音声入力に対応する複数の単語候補を決定する認識処理を行う認識処理手段と、前記複数の単語候補中に正解が含まれるか否かを判定する判定手段と、前記複数の単語候補及びそれらの同意単語候補の各々について、各単語候補を構成する音素を分析する手段を具備する設定手段と、を備え、前記設定手段は、前記複数の単語候補中に正解が含まれないと前記判定手段が判定した場合に、前記複数の単語候補及びそれらの同意単語候補から、少なくとも前記単語候補に対応する同意単語候補を一つのグループとした場合の前記単語候補毎に設定される各グループ間において、（１）音素の重複が少ない組み合わせを抽出し、（２）前記抽出された組み合わせのうち、相対的に総音素数が多い組み合わせとなるように前記各グループで１つの候補を決定し、次回の認識処理において使用される前記待ち受け単語グループに設定することを特徴とする。なお前記設定手段は、前記複数の単語候補中に正解が含まれないと前記判定手段が判定した場合に、前記複数の単語候補及びそれらの同意単語候補から、前記単語候補およびその同意単語候補を一つのグループとして、前記各グループ間でそれぞれ候補を決定することが好ましい。
【００１０】
上記の音声認識装置は、コマンドなどのユーザによる音声入力を受け取り、予め設定された待ち受け単語とのマッチング処理により、ユーザによる音声入力に対応する単語候補を決定する。そして、その単語候補中に正解が含まれるか否かを判定する。正解が含まれると判定手段が判定した場合、その単語候補が認識結果として出力される。一方、正解が含まれないと判定手段が判定した場合、それら単語候補と、各単語候補と意味が同一である同意単語候補とのうちから、少なくとも単語候補に対応する同意単語候補を一つのグループとした場合の前記単語候補毎に設定される各グループ間において、（１）音素の重複が少ない組み合わせを抽出し、（２）前記抽出された組み合わせのうち、相対的に総音素数が多い組み合わせとなるように各グループ間で候補を決定し、これらが次回の認識処理において使用される。よって、同意単語を含む単語候補中から、識別しやすい単語候補を利用して次回の認識処理が行われるので、ユーザによる再発話の認識率を向上させることができる。
【００１１】
上記の音声認識装置の一態様では、前記設定手段は、前記複数の単語候補及びそれらの同意単語候補の各々について、各単語候補を構成する音素を分析する手段と、音素の重複が最も少ない単語候補の組み合わせを前記待ち受け単語として設定する手段と、を有することができる。
【００１２】
この態様によれば、同意単語候補を含む単語候補を、それらの構成要素である音素の面から分析し、音素の重複が最も少ない単語候補の組み合わせを待ち受け単語として使用する。よって、音声認識処理上、相互に区別しやすい状態で認識処理を行うことができる。
【００１３】
上記の音声認識装置の他の一態様では、前記設定手段は、前記複数の単語候補及びそれらの同意単語候補の各々について、各単語候補を構成する音素を分析する手段と、音素の重複が最も少なく、かつ、総音素数が最も多い単語候補の組み合わせを前記待ち受け単語として設定する手段と、を有することができる。
【００１４】
この態様によれば、同意単語候補を含む単語候補を、それらの構成要素である音素の面から分析し、音素の重複が最も少なく、かつ、総音素数が最も多い単語候補の組み合わせを待ち受け単語として使用する。よって、音声認識処理上、さらに区別しやすい状態で認識処理を行うことができる。
【００１５】
上記の音声認識装置の他の一態様では、前記設定手段は、前記ユーザの音声入力が、前記待ち受け単語に含まれる単語候補以外の単語候補に対応することを意味する待ち受け誤り単語を前記待ち受け単語グループに含めることができる。よって、現在の待ち受け単語が正解を含んでいない場合にはユーザはその待ち受け誤り単語を発話することになるので、現在の待ち受け単語が正解を含んでいるか否かを判定することができる。
【００１６】
上記の音声認識装置のさらに他の一態様では、前記設定手段は、過去に使用した待ち受け単語グループを記憶する記憶手段を有し、前記判定手段が前記待ち受け誤り単語を正解と判定した場合、前記設定手段は、前記記憶手段に記憶されている１回前の待ち受け単語グループを、次回の認識処理において使用する待ち受け単語グループに設定することができる。これにより、現在の待ち受け単語グループに正解が含まれていない場合には、待ち受け単語の範囲を広げて、正解を探すことができる。
【００１７】
上記の音声認識装置のさらに他の一態様では、前記待ち受け誤り単語は、「その他」、またはその同義語とすることができる。
【００１８】
上記の音声認識装置のさらに他の一態様では、前記ユーザの音声入力が前記待ち受け誤り単語であった場合には、その時点における前記待ち受け単語グループ中の単語候補のうち、前記待ち受け誤り単語に対応する単語候補以外の単語候補を、次回の待ち受け単語グループに含める単語候補から除外することができる。待ち受け誤り単語は、現在の待ち受け単語グループ中の単語候補には正解が無いことを示すので、それらを次回の待ち受け単語グループに含める意味はない。よって、それら正解でないことがわかっている単語候補を次回の単語候補から除外することにより単語候補を絞り込み、効率的に正解を得ることができるようになる。
【００１９】
上記の音声認識装置のさらに他の一態様は、前記複数の単語候補中に正解が含まれないと前記判定手段が判定した場合に、前記設定手段が設定した待ち受け単語グループに属する待ち受け単語を、合成音声出力又は文字表示の少なくとも一方により前記ユーザに通知する通知手段を備えることができる。これにより、待ち受け単語が合成音声によりユーザに知らされるので、ユーザは再発話すべき単語を容易に知ることができる。
【００２０】
上記の音声認識装置のさらに他の一態様では、前記判定手段は、前記認識処理が繰り返されるたびに、前記単語候補を正解と判定する基準を緩和していくことができる。これにより、認識処理を繰り返すにつれて正解を得やすくし、認識処理の効率を上げることができる。なお、１つの好適な例では、前記判定手段は、前記単語候補の信頼度が所定のしきい値以上である場合に当該単語候補を正解であると判定し、前記認識処理が繰り返されるたびに、前記しきい値を低下させていくことができる。
【００２１】
上記の音声認識装置のさらに他の一態様では、前記設定手段は、前記複数の単語候補、それらの同意単語候補及び前記待ち受け誤り単語から、最も識別しやすい候補の組み合わせを決定し、次回の認識処理において使用される前記待ち受け単語グループに設定することができる。よって、音声認識処理上、さらに区別しやすい状態で認識処理を行うことができる。
【００２２】
本発明の他の観点では、コンピュータにより実行される音声認識プログラムは、前記コンピュータを、ユーザの音声入力を受け取る音声入力手段、予め設定された待ち受け単語グループ中の各待ち受け単語とのマッチング処理により、前記音声入力に対応する複数の単語候補を決定する認識処理を行う認識処理手段、前記複数の単語候補中に正解が含まれるか否かを判定する手段、前記複数の単語候補及びそれらの同意単語候補の各々について、各単語候補を構成する音素を分析する手段を具備する設定手段、として機能させ、かつ、前記設定手段をして、前記複数の単語候補中に正解が含まれないと前記判定手段が判定した場合に、前記複数の単語候補及びそれらの同意単語候補から、少なくとも前記単語候補に対応する同意単語候補を一つのグループとした場合の各グループ間において（１）音素の重複が少ない組み合わせを抽出し、（２）前記抽出された組み合わせのうち、相対的に総音素数が多い組み合わせとなるように前記各グループで１つの候補を決定し、次回の認識処理において使用される前記待ち受け単語グループに設定するよう機能させる。
【００２３】
【発明の実施の形態】
以下、図面を参照して本発明の好適な実施の形態について説明する。
【００２４】
［音声認識装置の構成］
図１に、本発明の実施形態にかかる音声認識装置の機能的構成を示す。図１において、音声認識装置１０は、サブワード音響モデル記憶部１と、辞書２と、単語モデル生成部３と、音響分析部４と、認識処理部５と、付加情報収集部６と、認識信頼度演算部７と、再発話制御部８と、合成音声生成部９と、スピーカ１１と、マイク１２と、スイッチＳＷ１とを備える。
【００２５】
サブワード音響モデル記憶部１は、予め学習された音素などの、サブワード単位の音響モデルを記憶している。ここで、「音素」とは、ある一つの言語で用いる音を弁別機能の見地から分析・規定した最小単位であり、子音、母音などに分類される。また、「サブワード」とは、個々の単語を構成する単位であり、サブワードの集合により１つの単語が構成される。サブワード音響モデル記憶部１には、母音、子音などの各音素に対応するサブワード音響モデルが記憶されている。例えば「あか（aka）」という単語の場合、これを構成するサブワードは"a"、"k"、"a"である。
【００２６】
辞書２には、音声認識の対象となる単語に関する単語情報が記憶されている。具体的には、複数の単語に対して、その単語を構成するサブワードの組み合わせが記憶されている。例えば「あか」という単語については、それを構成するサブワードが"a"、"k"、"a"であることが情報として記憶されている。
【００２７】
単語モデル生成部３は、各単語の音響モデルである単語モデルを生成する。具体的には、ある単語について、辞書２に記憶されている単語情報と、サブワード音響モデル記憶部１に記憶されているサブワード音響モデルとを利用して、その単語の単語モデルを生成する。例えば「あか」という単語の場合、辞書２には「あか」という単語がサブワード"a"、"k"、"a"により構成されることが単語情報として記憶されている。また、サブワード"a"、"k"、"a"に対応するサブワード音響モデルがサブワード音響モデル記憶部１に記憶されている。よって、単語モデル生成部３は、辞書２を参照して単語「あか」を構成するサブワードを調べ、それらに対応するサブワード音響モデルをサブワード音響モデル記憶部１から取得して組み合わせることにより単語「あか（aka）」の単語モデルを生成する。
【００２８】
音響分析部４は、マイク１２を介して音声認識装置１０に入力された発話音声信号を音響的に分析して特徴ベクトル系列に変換する。認識処理部５は、音響分析部４により得られた発話音声の特徴ベクトルと、単語モデル生成部３が生成した複数の単語モデルとを照合し（マッチング処理）、ユーザの発話音声に対する各単語モデルの音響的尤度を計算する。この際に照合される単語モデルを「単語候補」とも呼ぶ。認識処理部５は、予め決められた複数の単語候補をユーザの発話音声に対応する特徴ベクトル系列とマッチングし、各単語候補についての音響的尤度をそれぞれ算出する。
【００２９】
実際には、ユーザがある単語を発話する際には、その状況においてユーザが発話するであろうと予測される幾つかの単語が単語候補として決定される（これを、「待ち受け単語」とも呼ぶ）。そして、ユーザの発話に対応する特徴ベクトル系列が得られると、それを予め決定された単語候補（待ち受け単語）とマッチングし、各単語候補に対する音響的尤度を個別に算出する。
【００３０】
付加情報収集部６は、ユーザの過去の発話履歴などの付加情報を収集する。本発明の音声入力装置がカーナビゲーション装置のコマンド入力部に使用される場合には、付加情報にはカーナビゲーション装置を搭載した車両の位置情報などが含まれる。認識信頼度演算部７は、認識処理部５により算出されたユーザ発話に対する各単語候補の音響的尤度を元にして、各単語候補の認識信頼度を計算する。認識信頼度は、その単語候補が、ユーザが実際に発話した単語とどの程度の確からしさで一致しているかを示す指標である。認識信頼度が高いほど、その単語候補がユーザの発話した単語と一致している確率、即ち正解である確率が高い。認識信頼度が低いほど、その単語候補が正解である確率が低い。
【００３１】
具体的には、認識信頼度演算部７は、認識処理部５により算出された各単語候補の音響的尤度に対して、付加情報収集部６で得られた付加情報を用いて重み付けを行い、ユーザの発話音声に対する各単語候補の認識信頼度を算出する。例えば、付加情報収集部６が収集した付加情報に、そのユーザが過去にある単語を頻繁に発話しているという履歴がある場合には、それと同一の単語候補の認識信頼度は高く設定される。また、車両の現在位置に関連する単語が発話された場合には、その単語の信頼度を高く設定することができる。なお、上記の認識信頼度の算出方法は一例であり、本発明では、他の各種の認識信頼度の算出方法を使用することができる。
【００３２】
再発話制御部８は本発明の中心的な役割を果たす要素であり、再発話における単語候補の制御などを行う。図２に再発話制御部８の内部構成を示す。図２に示すように、再発話制御部８は、信頼度分析部８１と、候補選択部８２と、待ち受け単語選択部８３と、１位候補情報抽出部８４と、合成音声情報生成部８５と、スイッチＳＷ２とを備える。
【００３３】
再発話制御部８には、認識信頼度演算部７から信頼度情報２０が入力される。この信頼度情報２０は、ユーザの発話に対する複数の単語候補を示す単語候補情報と、それら各単語候補について認識信頼度演算部７が算出した認識信頼度情報とを含む。即ち、信頼度情報２０は、どの単語候補がどの程度の信頼度を有するかを示している。
【００３４】
信頼度分析部８１は、信頼度情報２０に含まれる複数の単語候補のうち、最も高い信頼度を有する単語候補（以下、「１位単語候補」と呼ぶ。）を認識結果と決定してよいか否か、即ち、１位単語候補を正解として良いか否かを判定する。この判定は、例えば１位単語候補の信頼度と２位単語候補の信頼度とを利用して行うことができる。即ち、１位単語候補の信頼度が十分高く、ある所定のしきい値α以上であること（条件１）、及び、１位単語候補の信頼度と２位単語候補の信頼度との差が十分に大きく、ある所定のしきい値β以上であること（条件２）の２つの条件を具備する場合には、その１位単語候補を正解と判定する。一方、条件１及び２のいずれか一方でも具備されない場合は、その１位単語候補を正解とはしない。なお、１位単語候補を正解と判定する方法は、上記以外の方法を採用することもできる。例えば、上位の所定数ｎ個の単語候補の信頼度を使用して、１位単語候補が正解であるか否かを判定することもできる。
【００３５】
１位単語候補が正解であると判定した場合、認識度分析部８１は、図１に示すスイッチＳＷ１及び図２に示すスイッチＳＷ２をいずれも端子Ｔ１に接続する制御信号をスイッチＳＷ１及びＳＷ２に供給する。一方、１位単語候補が正解でないと判定した場合、認識度分析部８１はスイッチＳＷ１及びＳＷ２をともに端子Ｔ２に接続する制御信号をスイッチＳＷ１及びＳＷ２に供給する。
【００３６】
１位候補情報抽出部８４は、信頼度分析部８１が１位単語候補を正解であると判定した場合に、認識信頼度演算部７からスイッチＳＷ２を介して信頼度情報２０を受け取る。そして、１位単語候補が正解であることを示す情報、正解と判定された１位候補単語が何であるかを示す情報、及び、１位単語候補に対応する発音情報などを合成音声情報生成部８５へ供給する。また、１位候補情報抽出部８４は、１位候補単語が何であるかを示す情報を、認識結果として外部へ出力する。
【００３７】
合成音声情報生成部８５は、１位単語候補が正解である場合には、１位候補情報抽出部８４からの情報に基づいて、認識結果をユーザへ通知するための合成音声情報を生成して合成音声生成部９へ出力する。
【００３８】
図１に示す合成音声生成部９は、合成音声情報生成部８５から入力された合成音声情報に基づいて、正解と判定された単語を含む合成音声を生成し、スピーカ１１から出力することにより認識結果をユーザへ通知する。認識結果をユーザへ通知するとは、例えば正解と判定された単語候補が「あか」である場合、「あかですね？」などの合成音声を出力することを意味する。これによりユーザは認識結果の確認を行う。なお、本実施形態では、スピーカ１１からの音声出力によりユーザに認識結果を通知する方法を採用しているが、そのかわりに又はそれに加えて、ディスプレイなどにより視覚的にユーザに認識結果を通知するように構成することもできる。
【００３９】
一方、１位候補が正解でないと認識度分析部８１が判定した場合には、音声認識装置１０はユーザに再発話を要求することになる。その場合、スイッチＳＷ２は端子Ｔ２に接続され、信頼度情報２０が候補選択部８２に供給される。また、スイッチＳＷ１も端子Ｔ２に接続され、待ち受け単語選択部８３が単語モデル生成部３に接続される。候補選択部８２は、信頼度が算出された全ての単語候補のうちから、信頼度の高い幾つかの単語候補（以下、「正解単語候補」と呼ぶ。）に絞り込みを行う。例えば、１位単語候補との信頼度の差が所定のしきい値γ以下である単語候補を正解単語候補に設定する。そして、決定された正解単語候補の識別情報を待ち受け単語選択部８３へ供給する。
【００４０】
待ち受け単語選択部８３は、ユーザの再発話に対する待ち受け単語グループ（即ち、ユーザの再発話に対する認識処理において単語候補として使用する単語の組み合わせ）を決定する。この最も典型的な方法は、候補選択部８２が選択した正解単語候補を待ち受け単語に設定する方法である。これにより、前回の発話の認識処理において、認識信頼度が高かった候補が待ち受け単語に設定される。しかし、これでは、ユーザの前回の発話と再発話とがまったく同一であった場合（例えば、繰り返し「あか」と発話した場合）には前回の発話と同様に認識結果を正解と判定できなくなる可能性がある。そこで、本発明では、再発話において待ち受け単語として使用される単語を、正解単語候補の同義語などであって認識処理により識別しやすい別な単語とすることにより、再発話における認識率を高めるようにしている。即ち、待ち受け単語選択部８３は、候補選択部８２から供給される正解単語候補に基づいて、それらの同義語などであって識別しやすい単語の組み合わせを再発話用の待ち受け単語として設定する。ここで、「識別しやすい単語の組み合わせ」の１つの好適な例は、正解単語候補の同義語であって、音素の重複が少なく（条件Ａ）、かつ、総音素数が多い（条件Ｂ）単語の組み合わせである。これは、音声認識の観点から単語同士を音響的に比較した場合、一般的に音素の重複が少なく、かつ、単語の音素数が多いほど、単語の識別が容易になるからである。
【００４１】
これについて、より詳しく説明する。辞書２には、１つの単語について意味が同じで発音の異なる同義語（同意単語）を用意しておく。いま、候補選択部８２が選択した正解候補単語が「あか（aka）」と「あお（ao）」の２つであったと仮定する。また、「あか」の同意単語として「れっど（reqdo）」が辞書２内に記憶されており、「あお」の同意単語として「ぶるー（buruu）」が辞書２に記憶されていると仮定する。この場合、「あか」と「あお」とでは音素"a"が重複し、「れっど」と「あお」では音素"o"が重複するので、条件Ａによれば、識別しやすい単語の組み合わせは「あか」と「ぶるー」、又は、「れっど」と「ぶるー」となる。そして、さらに条件Ｂを考慮すると、これらの組み合わせのうちでは、「れっど」と「ぶるー」の組み合わせの方が総音素数が多いので、最終的に「れっど」と「ぶるー」の組み合わせが待ち受け単語に設定される。また、別の例として、例えば「あお」の同意単語としてさらに「みずいろ（mizuiro）」が辞書２に記憶されている場合には、同一音素の数が最も少ない組み合わせのなかで総音素数がもっとも多い「あか」と「みずいろ」の組み合わせが待ち受け単語に設定される。このように、本発明では、正解単語候補及びそれらの同意単語のうち、最も識別しやすい組み合わせを次回の再発話に対する待ち受け単語に設定する。これにより、再発話の認識処理における認識精度を改善することができる。
【００４２】
また、本発明では、再発話の際に、再発話を促すトークバックに含まれる単語が正解の単語以外であることを示す「その他」、「それ以外」、「ちがう」などの単語を、再発話を促すトークバックに含めることを特徴とする。これにより、再発話を促すトークバックでユーザに尋ねた単語中に正解が含まれていない場合には、音声認識装置１０はそれを知ることができる。例えば、初回発話の結果、正解候補単語が「あか」と「あお」に絞られ、さらに上述のように最終的に「あか」と「みずいろ」が待ち受け単語に決定されたとする。その場合、再発話を促すトークバックでは、音声認識装置１０はユーザに例えば「「あか」ですか、「みずいろ」ですか、「その他」ですか？」と尋ねる。これに対してユーザが「その他」と再発話したとすれば、ユーザの発話した単語は「あか」でも「みずいろ」でもないことがわかる。そして、音声認識装置１０は前回の絞り込みが誤りだったことを認識し、「あか」、「みずいろ」以外の単語候補を探すことが可能となる。
【００４３】
こうして、待ち受け単語選択部８３は、再発話用の待ち受け単語の数、発音、意味（元単語の読み）などを含む情報を待ち受け単語情報８３ａとしてスイッチＳＷ１を介して単語モデル生成部３に供給するとともに、合成音声情報生成部８５に供給する。その場合、単語モデル生成部３は、待ち受け単語情報８３ａに含まれる待ち受け単語の単語モデルを生成し、再発話の認識処理において認識処理部５によるマッチング処理に使用させる。即ち、前述の例では、再発話された単語の認識処理において「あか」、「みずいろ」、「その他」の単語モデルがマッチング処理の対象となる。また、合成音声情報生成部８５は、待ち受け単語情報８３ａに基づき、再発話を促すトークバックとして「「あか」ですか、「みずいろ」ですか、「その他」ですか？」という合成音声情報を生成する。この合成音声情報は、合成音声生成部９によりスピーカ１１から合成音声として出力される。
【００４４】
こうして、音声認識装置１０は、再発話の際にはトークバック中に、正解候補単語のうち識別しやすい組み合わせの単語を含め、さらに、それら以外の単語を示す「その他」などの単語を含めてユーザに再発話を促す。これにより、再発話時の認識精度を上げることができる。
【００４５】
なお、再発話後の認識処理においても依然として１位単語候補を正解と判定できない場合は、さらに同様の再発話処理を繰り返すことができる。また、再発話処理においては、信頼度分析部８１が１位単語候補を正解と判定する際に使用するしきい値を徐々に緩和して、正解との判定がされやすくなるようにすることもできる。
【００４６】
また、再発話（複数回の場合を含む）において単語「その他」に対応する単語候補が正解であると判定された場合、即ち、ユーザがトークバック中で指定された現在の待ち受け候補単語中に正解が無いと判断した場合には、待ち受け単語選択部８３は、待ち受け単語を１回前の発話時の状態に戻す。この理由は以下の通りである。例えばｍ回目の発話に対する認識処理において１位単語候補が正解でないと判定された場合、（ｍ＋１）回目の発話に対する待ち受け単語は、候補選択部８２により上位候補のみに絞り込みがなされている。しかし、（ｍ＋１）回目の発話でユーザが「その他」と発話したということは、その際に設定されている待ち受け単語中には正解の単語が無いということであり、絞り込みが誤りであった（待ち受け誤り）ということを意味している。よって、待ち受け単語を絞り込み以前の状態（ｍ回目の発話時の状態）に戻して単語候補の範囲を広げ、必要であれば再発話を促すのである。
【００４７】
この場合、信頼度分析部８１はスイッチＳＷ１及びＳＷ２を端子Ｔ２に接続する。待ち受け単語選択部８３は、次回の発話用の待ち受け単語グループを決定する際に、前回の待ち受け単語グループを記憶する。即ち、待ち受け単語選択部８３は、過去の待ち受け単語グループを全て記憶しておき、待ち受け誤りの場合には、１回前の待ち受け単語グループを次回の再発話及び認識処理において使用する。
【００４８】
こうして、必要に応じて再発話を繰り返し、最終的に信頼度分析部８１が、ある１位単語候補を正解であると判定すると、その１位単語候補は認識結果として音声認識装置１０から外部装置へと送られる。外部装置とは、音声認識装置１０による認識結果をコマンドなどとして使用する装置である。例えば、前述のようにカーナビゲーション装置の入力部に音声認識装置１０を使用する場合、認識結果はカーナビゲーション装置のコントローラなどに供給され、その内容（コマンド）に対応する処理が実行される。
【００４９】
［音声認識処理］
次に、図３を参照して、上記の音声認識装置１０により実行される音声認識処理について説明する。図３は音声認識処理のフローチャートである。
【００５０】
まず、ステップＳ１において、ユーザによる初回発話を認識するための初期設定を行う。具体的には、再発話制御部８は、スイッチＳＷ１を端子Ｔ１側に接続し、認識を行う単語候補情報が格納された辞書２内の全単語を、初回発話に対する待ち受け単語として設定する。そして、発話カウンタｃを１に設定する。なお、発話カウンタは、認識を行う発話に対する待ち受け単語グループを示す。つまり、発話カウンタｃ＝１は初回発話に対する待ち受け単語グループ（上記の例では、辞書２に格納された全単語）に対応し、発話カウンタｃ＝２は初回発話後に１回絞り込みを行った後の待ち受け単語グループに対応する。
【００５１】
次に、ステップＳ２において、待ち受け単語グループに基づいて、単語モデル生成部３がサブワード音響モデル記憶部１内に記憶されているサブワード音響モデルを使用して、単語モデルを生成する。これにより、初回発話に対する待ち受け単語グループに対応する単語モデルが全て用意されたことになる。
【００５２】
次に、ステップＳ３において音声認識処理が行われる。即ち、ユーザによる発話が行われ、対応する発話音声信号がマイク１２を介して音響分析部４に入力される。音響分析部４は発話音声信号の音響分析を行い、その特徴ベクトル系列を得る。そして、認識処理部５は、発話音声信号の特徴ベクトルと、ステップＳ２において用意された各単語モデルとのマッチング処理を行い、両者間の音響的尤度を単語モデル毎に算出する。
【００５３】
次に、ステップＳ４において、認識信頼度演算部７は、ステップＳ３で算出された各単語候補についての音響的尤度を、付加情報収集部６が収集した付加情報を用いて重み付けすることにより、各単語候補の認識信頼度を算出する。なお、付加情報は、例えばユーザの過去の発話履歴やナビゲーション装置を搭載した車両の位置情報などである。
【００５４】
次に、ステップＳ５において、信頼度分析部８１は、各単語候補の認識信頼度に基づいて、最も認識信頼度が高い１位単語候補が正解であるか否かを分析する。この分析は、前述のように、例えば１位単語候補の信頼度と２位単語候補の信頼度を利用して行うことができる。
【００５５】
次に、ステップＳ６において、信頼度分析部８１は、ステップＳ５における分析の結果に基づいて１位単語候補が正解であるか否かを判定する。１位単語候補が正解であると判定された場合、処理はステップＳ７に進む。一方、１位単語候補が正解ではないと判定された場合、処理はステップＳ１４へ進む。
【００５６】
ステップＳ６において１位単語候補が正解であると判定された場合、ステップＳ７において、信頼度分析部８１は、その１位単語候補が「その他」に対応する単語であるか否かを判別する。前述のように「その他」に対応する単語候補は、待ち受け単語の絞り込みを行った結果、正解の単語が待ち受け単語に含まれなくなった場合に、待ち受け単語グループを修正するために使用される。１位単語候補が「その他」に対応している場合、処理はステップＳ１０へ進む。１位単語候補が「その他」に対応していない場合、処理はステップＳ８へ進む。
【００５７】
ステップＳ８に処理が進んだ場合、それは１位単語候補が正解であり、かつ、「その他」の単語候補ではないことを意味する。即ち、その１位単語候補を認識結果として良いことになる。よって、１位候補情報抽出部８４は信頼度情報２０から１位単語候補を抽出し、１位単語候補が正解であることを示す情報、正解と判定された１位候補単語が何であるかを示す情報、及び、１位単語候補に対応する発音情報などを合成音声情報生成部８５へ供給するとともに、正解と判定された１位候補単語が何であるかを示す情報を外部へ認識結果として出力する。
【００５８】
ステップＳ９では、１位単語候補に対応する発音情報に基づいて、合成音声情報生成部８５が合成音声情報を生成して合成音声生成部９へ供給し、合成音声生成部９がスピーカ１１から１位単語候補の読みを合成音声として出力する。例えば、１位単語候補が「あか」である場合、スピーカからは「あかですね？」というように、ユーザに対して認識結果の通知がなされる。
【００５９】
一方、ステップＳ６において、１位単語候補が正解でないと判定された場合、ステップＳ１４において候補選択部８２が正解単語候補を選択する。具体的には、候補選択部８２は、１位単語候補の認識信頼度などを利用して正解単語候補を選択する。この処理により、次回発話時の認識処理において使用される単語候補の絞り込みがなされる。
【００６０】
次に、ステップＳ１５において、候補選択部８２により選択された正解単語候補に基づいて、待ち受け単語選択部８３は、識別しやすく発音の異なる単語の組み合わせを生成する。具体的には、待ち受け単語選択部８３は、正解候補単語に対応する同意単語の組み合わせのうち、同一音素の数が最も少なく、総音素数の多い単語候補を待ち受け単語として決定する。そして、それらの待ち受け単語を含む待ち受け単語グループを設定する。なお、この待ち受け単語グループ中には、上記の単語の他に、「その他」に対応する単語が含められる。そして、待ち受け単語選択部８３は、それらの待ち受け単語に対応する単語情報を辞書２から取得して単語モデル生成部３へ送り、対応する単語モデルを生成させる。こうして、待ち受け単語グループが更新される。
【００６１】
また、待ち受け単語選択部８３は、更新前の待ち受け単語グループを記憶する。これは、次回の発話においてユーザが「その他」と発話した場合には、１回前の待ち受け単語グループを再度使用する必要が生じるからである。また、待ち受け単語選択部８３は、選択した待ち受け単語グループを合成音声情報生成部８５へも供給する。
【００６２】
ステップＳ１６においては、合成音声情報生成部８５及び合成音声生成部９が、再発話を促すトークバックとして、ステップＳ１５で決定された待ち受け単語の合成音声を出力する。例えば、ステップＳ１５において待ち受け単語が「あか」、「あお」、「その他」に決定されたとすると、再発話を促すトークバックとして「あかですか、あおですか、その他ですか？」などの合成音声が出力される。
【００６３】
次に、ステップＳ１７において、発話カウンタｃを１だけ増加する。その結果、増加後の発話カウンタｃは、待ち受け単語グループが、前回の待ち受け単語グループに対して１回の更新後の状態に移行したことを示すようになる。そして、処理はステップＳ２へ戻り、ステップＳ１５で決定された待ち受け単語グループに含まれる単語の単語モデルが生成され、再発話に対する認識処理が行われる。
【００６４】
また、ステップＳ７において、１位単語候補が「その他」に対応すると判定された場合、それは、その際の待ち受け単語グループ中に正解の単語が含まれていない、即ち待ち受け誤りであることを示している。よって、処理はステップＳ１０へ進み、発話カウンタｃの値が１であるか否かを判定する。発話カウンタｃ＝１である場合、現在の認識処理は初回発話に対して行われたものであり、その際の待ち受け単語の組み合わせは、辞書に含まれる全ての単語候補に設定されている。よって、ユーザが発話した単語がもともと辞書２に含まれていないことになる。この場合は、候補なしとして認識処理を終了する。
【００６５】
一方、発話カウンタｃが１でない場合、処理はステップＳ１１へ進む。ステップＳ１１では、待ち受け単語選択部８３は、発話カウンタｃの値から１を減算し、先に記憶しておいた前回の待ち受け単語グループを設定する。ユーザが「その他」と発話したということは、正解の単語が現在の待ち受け単語グループ中に含まれていないわけであるから、１回前の認識処理において使用した待ち受け単語グループに戻して再度認識処理を行うのである。なお、待ち受け単語選択部８３はステップＳ１４において、待ち受け単語の更新を行った際に、更新前の状態の待ち受け単語グループを記憶しているので、これを読み出して設定すればよい。その際、待ち受け単語選択部８３は、「その他」に対応する単語（「待ち受け誤り単語」とも呼ぶ。）が待ち受け単語グループに含まれるようにする。
【００６６】
次に、ステップＳ１２において、待ち受け単語選択部８３は、そのようにして決定された待ち受け単語グループを単語モデル生成部３及び合成音声情報生成部８５へ供給する。単語モデル生成部３は、それらの待ち受け単語に対応する単語モデルを生成し、次回の認識処理において使用できるようにする。また、合成音声情報生成部８５及び合成音声生成部９は、供給された待ち受け単語の情報を使用して、対応する単語の合成音声出力を行う。
【００６７】
以上のようにして、１位単語候補が正解と判定され、認識結果として出力される（ステップＳ９）か、又は、候補なしとして認識処理が終了される（ステップＳ１０：Yes）まで、ユーザの発話内容に応じて待ち受け単語グループが更新されつつ認識処理が行われる。１位単語候補の信頼度が正解と判定できる程度まで高くない場合には、信頼度などに基づいて待ち受け単語の絞り込みが行われ、さらに、絞り込まれた単語の同意単語であって音響的に識別しやすい組み合わせの単語を次回の発話時の待ち受け単語として設定することにより待ち受け単語グループが更新される。従って、再発話による認識率を改善することができ、ユーザの発話音声を迅速かつ効率的に認識することが可能となる。
【００６８】
［変形例］
図２に示す再発話制御部８においては、信頼度分析部８１は、１位単語候補及び２位単語候補を用いて１位単語候補が正解であるか否かを決定していた。その代わりに、信頼度分析部８１は、認識信頼度が上位ｎ個の単語候補を使用して１位単語候補が正解であるか否かを決定するように構成することができる。その場合には、１位単語候補が正解であるか否かを決定する処理中に、認識信頼度が上位であるｎ個の単語候補が決定される。よって、それら認識信頼度が上位であるｎ個の単語候補が決定された時点で、それらを絞り込み後の正解単語候補とすることができる。こうすれば、候補選択部８２の処理を信頼度分析部８１が行うことができ、候補選択部８２を省略することができる。その場合、正解単語候補の情報は信頼度分析部８１から待ち受け単語選択部８３へ入力されることになる。
【００６９】
図３に示す音声認識処理においては、ステップＳ７で１位単語候補が「その他」に対応すると判定され、かつ、ステップＳ８で発話カウンタｃが１でないと判定された場合、発話カウンタを１だけ減算して、前回の待ち受け単語グループを次回発話で使用するように設定している。しかし、ステップＳ７の判定がＹｅｓであるということは、前回の待ち受け単語グループ中に正解の単語が無かったことを示しているのであるから、次回の待ち受け単語グループ中にそれらの単語を含めることに意味はない。例えば、「あか」、「あお」、「その他」という待ち受け単語グループを使用した発話において、ユーザが「その他」と発話したということは、ユーザの発話した単語は「あか」、「あお」のいずれでもない。よって、待ち受け単語選択部８３は、ステップＳ１１で取得した１回前の待ち受け単語グループ中から、「あか」、「あお」及びそれらの同義語を除いて待ち受け単語グループを設定することができる。これにより、正解ではないことが明白である単語を待ち受け単語グループから除くことにより、認識処理のさらなる効率化が可能となる。
【００７０】
なお、上記の音声認識装置１０を構成する各要素をコンピュータプログラムとして構成し、コンピュータを備える機器において実行することにより、上記の音声認識装置を実現することが可能である。例えば、コンピュータを備えるカーナビゲーション装置やＡＶ機器などにおいて、上記のコンピュータプログラムを使用することにより、音声入力機能を実現することが可能となる。
【００７１】
なお、上記実施形態においては、正解単語候補及びそれらの同意単語のうち、最も識別しやすい組み合わせを次回の再発話に対する待ち受け単語に設定したが、正解単語候補の同意単語のみから最も識別しやすい組み合わせを決定してもよい。
【００７２】
また、正解単語候補及びそれらの同意単語に、再発話を促すトークバックに含まれる単語が正解の単語以外であることを示す待ち受け誤り単語も加えて、最も識別しやすい組み合わせを決定してもよい。
【００７３】
【発明の効果】
以上説明したように、本発明によれば、認識結果が誤りである可能性が高い場合は、ユーザに再発話を促すことにより、誤認識の可能性を減少させることができる。また、ある発話に対する認識結果が正解であると判定できない場合、その際に使用した待ち受け単語と同意単語であって、音響的に識別が容易な単語を次回の発話の際の待ち受け単語に設定することにより、同じ認識結果が繰り返されることがなくなり、次回の発話による認識率が改善される。また、再発話を促す確認のトークバック中に、現在の待ち受け単語以外の単語を示す「その他」などの単語を含めることにより、正解ではない単語を除去していくことができ、効率的かつ迅速に正解に至ることが可能となる。
【図面の簡単な説明】
【図１】本発明の実施形態にかかる音声認識装置の概略構成を示すブロック図である。
【図２】図１に示す再発話制御部の内部構成を示すブロック図である。
【図３】図１に示す音声認識装置による音声認識処理を示すフローチャートである。
【符号の説明】
１サブワード音響モデル記憶部
２辞書
３単語モデル生成部
４音響分析部
５認識処理部
６付加情報収集部
７認識信頼度演算部
８再発話制御部
９合成音声生成部
１０音声認識装置
１１スピーカ
１２マイク
８１信頼度分析部
８２候補選択部
８３待ち受け単語選択部
８４１位候補情報抽出部
８５合成音声情報生成部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition method for recognizing human speech input via a microphone or the like.
[0002]
[Prior art]
In general, the speech recognition apparatus acoustically analyzes a speech input signal generated based on a user's utterance, and compares each acoustic likelihood (similarity) with a plurality of candidate word models prepared in advance. Degree) and the candidate having the highest acoustic likelihood (referred to as “first candidate”) is determined as the recognition result. If the recognition reliability of the first candidate is not sufficiently high, the speech recognition device determines that there is no correct recognition result, and talks back, such as “Please speak again” to prompt the user to speak again. The recognition process is performed again.
[0003]
The conventional speech recognition apparatus has low reliability of the recognition result, and performs the recognition process again using the same candidate as the previous time even when the user is requested to re-utter. Therefore, if the user repeats the utterance in the same way as the previous time, the recognition result will be the same as the previous time, so that the recognition rate for the recurrent utterance is not improved so much.
[0004]
[Problems to be solved by the invention]
One of speech recognition methods that improve such a point is described in Japanese Patent No. 3112037. In this speech recognition method, when a sufficiently reliable recognition result is not obtained in the recognition process for the first utterance of the user, narrowing down to a few candidates with high reliability and prompting the user to speak again . Further, with respect to a candidate having a higher degree of reliability obtained in the recognition process for the first utterance, those synonyms are added to the candidate to prompt the user to speak again, and the recognition process is performed again.
[0005]
However, in this method, if the correct answer is not included in the top candidates narrowed down by the first recognition result, the recognition cannot be performed. Further, even if a higher-ranked synonym is added to a candidate, if the user uses the same word as the previous one in a recurrent utterance, the meaning of adding the synonym to the candidate is lost.
[0006]
Another speech recognition method is described in Japanese Patent Application Laid-Open No. 11-119792. In the method described in this publication, a set of acoustically similar commands (referred to as “similar type commands”) and a corresponding paraphrase command set are defined and stored in advance. For example, when there are “raise window” and “lower window” as similar type commands, “open window” and “close window” are prepared as paraphrase commands. When the user utters a similar type command, the user requests to speak again using a paraphrase command for the utterance.
[0007]
However, in this method, it is necessary to preliminarily define the correspondence between the similar type command and the paraphrase command corresponding thereto and store it in a memory or the like. Therefore, when a large number of commands are used in the system, the storage capacity required for that purpose increases, leading to an increase in cost.
[0008]
The present invention has been made in view of the above points, and it is an object of the present invention to provide a speech recognition apparatus and a program that enable efficient and accurate recognition while minimizing the number of requests for recurrent speech to the user. .
[0009]
[Means for Solving the Problems]
According to one aspect of the present invention, the speech recognition apparatus responds to the speech input by a matching process between speech input means for receiving the speech input of the user and each standby word in a preset standby word group. Recognition processing means for performing recognition processing for determining a plurality of word candidates; determination means for determining whether or not a correct answer is included in the plurality of word candidates; and each of the plurality of word candidates and their consent word candidates And setting means comprising means for analyzing the phonemes constituting each word candidate, and when the determination means determines that the correct answer is not included in the plurality of word candidates, Each group set for each word candidate in the case where at least the consent word candidates corresponding to the word candidates are grouped from the plurality of word candidates and the consent word candidates. In between the-loop, (1) phoneme of overlapping small (2) Among the extracted combinations, Relative total sound Elementary One candidate is determined for each group so as to have a large number of combinations, and is set to the standby word group used in the next recognition process. When the determination unit determines that a correct answer is not included in the plurality of word candidates, the setting unit determines the word candidate and the consent word candidate from the plurality of word candidates and their consent word candidates. It is preferable to determine candidates between the groups as one group.
[0010]
The voice recognition apparatus receives voice input by the user such as a command, and determines word candidates corresponding to the voice input by the user by matching processing with a preset standby word. Then, it is determined whether or not a correct answer is included in the word candidate. When the determination unit determines that the correct answer is included, the word candidate is output as a recognition result. On the other hand, if the determination unit determines that the correct answer is not included, at least one of the word candidates and the consent word candidate having the same meaning as each word candidate, the consent word candidate corresponding to the word candidate is grouped into one group. (1) There is little overlap of phonemes between groups set for each word candidate (2) Among the extracted combinations, Relative total sound Elementary Candidates are determined between the groups so that there are many combinations, and these are used in the next recognition process. Therefore, since the next recognition process is performed using easy-to-identify word candidates from among the word candidates including the consent word, it is possible to improve the recognition rate of recurrent speech by the user.
[0011]
In one aspect of the speech recognition apparatus, the setting unit includes a unit that analyzes a phoneme constituting each word candidate for each of the plurality of word candidates and their consent word candidates, and a word having the least phoneme overlap. Means for setting a combination of candidates as the standby word.
[0012]
According to this aspect, word candidates including consent word candidates are analyzed from the aspect of phonemes that are their constituent elements, and a combination of word candidates having the smallest phoneme overlap is used as a standby word. Therefore, the recognition processing can be performed in a state where it is easy to distinguish from each other in the speech recognition processing.
[0013]
In another aspect of the speech recognition apparatus described above, the setting unit includes: a unit that analyzes a phoneme constituting each word candidate for each of the plurality of word candidates and their consent word candidates; And a means for setting a combination of word candidates having a small total number of phonemes as the standby word.
[0014]
According to this aspect, word candidates including consent word candidates are analyzed from the aspect of phonemes that are their constituent elements, and a combination of word candidates with the smallest phoneme overlap and the largest total phoneme number is set as a standby word. Use as Therefore, the recognition process can be performed in a state in which the voice recognition process is easier to distinguish.
[0015]
In another aspect of the above speech recognition apparatus, the setting means determines a standby error word that means that the user's voice input corresponds to a word candidate other than the word candidate included in the standby word as the standby word. Can be included in a group. Therefore, when the current standby word does not include the correct answer, the user utters the standby error word, so it can be determined whether or not the current standby word includes the correct answer.
[0016]
In still another aspect of the speech recognition apparatus, the setting unit includes a storage unit that stores a standby word group used in the past, and when the determination unit determines that the standby error word is correct, The setting means can set the previous standby word group stored in the storage means to the standby word group used in the next recognition process. As a result, when the correct answer is not included in the current standby word group, the range of the standby words can be expanded to search for the correct answer.
[0017]
In still another aspect of the speech recognition apparatus, the standby error word may be “other” or a synonym thereof.
[0018]
In still another aspect of the speech recognition apparatus, when the user's speech input is the standby error word, the standby error word corresponds to the word candidate in the standby word group at that time. It is possible to exclude word candidates other than the word candidates to be included from the word candidates to be included in the next standby word group. Since the standby error word indicates that there is no correct answer in the word candidates in the current standby word group, it does not make sense to include them in the next standby word group. Therefore, by excluding those word candidates that are known not to be correct from the next word candidates, it is possible to narrow down the word candidates and obtain a correct answer efficiently.
[0019]
Still another aspect of the speech recognition apparatus described above is, when the determination unit determines that a correct answer is not included in the plurality of word candidates, standby words belonging to the standby word group set by the setting unit, Notification means for notifying the user by at least one of synthesized speech output or character display can be provided. Thereby, since the standby word is notified to the user by the synthesized speech, the user can easily know the word to be re-spoken.
[0020]
In still another aspect of the speech recognition apparatus, the determination unit can relax a criterion for determining the word candidate as correct each time the recognition process is repeated. As a result, the correct answer can be easily obtained as the recognition process is repeated, and the efficiency of the recognition process can be increased. In one preferable example, the determination unit determines that the word candidate is correct when the reliability of the word candidate is equal to or higher than a predetermined threshold, and each time the recognition process is repeated. The threshold value can be lowered.
[0021]
In still another aspect of the above speech recognition apparatus, the setting unit determines a combination of candidates that are most easily identified from the plurality of word candidates, their consent word candidates, and the standby error word, and performs next recognition. The standby word group used in the processing can be set. Therefore, the recognition process can be performed in a state in which the voice recognition process is easier to distinguish.
[0022]
In another aspect of the present invention, a speech recognition program executed by a computer uses the computer to perform a matching process with each standby word in a preset standby word group by using a voice input unit that receives a user's voice input. Recognition processing means for performing recognition processing for determining a plurality of word candidates corresponding to the voice input, means for determining whether or not a correct answer is included in the plurality of word candidates, the plurality of word candidates and their agreed words For each of the candidates, the determination means is configured to function as a setting unit including a unit that analyzes a phoneme constituting each word candidate, and the determination unit determines that a correct answer is not included in the plurality of word candidates. When the means determines, at least one consent word candidate corresponding to the word candidate is selected from the plurality of word candidates and their consent word candidates. Among the groups in the case of the groups (1) phonemes overlap small (2) Among the extracted combinations, Relative total sound Elementary One candidate is determined in each group so as to have a large number of combinations, and functions are set to the standby word group used in the next recognition process.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
Preferred embodiments of the present invention will be described below with reference to the drawings.
[0024]
[Configuration of voice recognition device]
FIG. 1 shows a functional configuration of a speech recognition apparatus according to an embodiment of the present invention. In FIG. 1, a speech recognition device 10 includes a subword acoustic model storage unit 1, a dictionary 2, a word model generation unit 3, an acoustic analysis unit 4, a recognition processing unit 5, an additional information collection unit 6, and a recognition trust. A degree calculation unit 7, a recurrent speech control unit 8, a synthesized speech generation unit 9, a speaker 11, a microphone 12, and a switch SW1.
[0025]
The subword acoustic model storage unit 1 stores an acoustic model in units of subwords such as previously learned phonemes. Here, the “phoneme” is a minimum unit in which sound used in a certain language is analyzed and defined from the viewpoint of the discrimination function, and is classified into consonants and vowels. A “subword” is a unit constituting each word, and one word is constituted by a set of subwords. The subword acoustic model storage unit 1 stores subword acoustic models corresponding to phonemes such as vowels and consonants. For example, in the case of the word “aka”, the subwords constituting this are “a”, “k”, and “a”.
[0026]
The dictionary 2 stores word information related to words that are subject to speech recognition. Specifically, for a plurality of words, combinations of subwords constituting the word are stored. For example, for the word “red”, it is stored as information that the subwords constituting it are “a”, “k”, and “a”.
[0027]
The word model generation unit 3 generates a word model that is an acoustic model of each word. Specifically, for a certain word, a word model of the word is generated using word information stored in the dictionary 2 and a subword acoustic model stored in the subword acoustic model storage unit 1. For example, in the case of the word “Aka”, the dictionary 2 stores that the word “Aka” is composed of subwords “a”, “k”, and “a” as word information. Also, subword acoustic models corresponding to the subwords “a”, “k”, and “a” are stored in the subword acoustic model storage unit 1. Therefore, the word model generation unit 3 refers to the dictionary 2 to check the subwords constituting the word “red”, obtains the subword acoustic model corresponding to them from the subword acoustic model storage unit 1 and combines them to obtain the word “red” (Aka) "is generated.
[0028]
The acoustic analysis unit 4 acoustically analyzes the speech signal input to the speech recognition device 10 via the microphone 12 and converts it into a feature vector series. The recognition processing unit 5 collates the feature vector of the utterance voice obtained by the acoustic analysis unit 4 with a plurality of word models generated by the word model generation unit 3 (matching process), and each word model for the user's utterance voice. Calculate the acoustic likelihood of. The word model collated at this time is also called a “word candidate”. The recognition processing unit 5 matches a plurality of predetermined word candidates with a feature vector sequence corresponding to the user's uttered speech, and calculates an acoustic likelihood for each word candidate.
[0029]
In practice, when a user utters a word, several words that the user is expected to utter in that situation are determined as word candidates (this is also referred to as a “standby word”). . When a feature vector sequence corresponding to the user's utterance is obtained, it is matched with a predetermined word candidate (standby word), and the acoustic likelihood for each word candidate is calculated individually.
[0030]
The additional information collection unit 6 collects additional information such as the user's past utterance history. When the voice input device of the present invention is used for a command input unit of a car navigation device, the additional information includes position information of a vehicle on which the car navigation device is mounted. The recognition reliability calculation unit 7 calculates the recognition reliability of each word candidate based on the acoustic likelihood of each word candidate for the user utterance calculated by the recognition processing unit 5. The recognition reliability is an index indicating to what degree the probability that the word candidate matches the word actually spoken by the user. The higher the recognition reliability, the higher the probability that the word candidate matches the word spoken by the user, that is, the probability of being correct. The lower the recognition reliability, the lower the probability that the word candidate is correct.
[0031]
Specifically, the recognition reliability calculation unit 7 weights the acoustic likelihood of each word candidate calculated by the recognition processing unit 5 using the additional information obtained by the additional information collection unit 6. The recognition reliability of each word candidate with respect to the user's speech is calculated. For example, when the additional information collected by the additional information collecting unit 6 has a history that the user frequently utters a word in the past, the recognition reliability of the same word candidate is set high. . In addition, when a word related to the current position of the vehicle is uttered, the reliability of the word can be set high. Note that the above-described method of calculating the recognition reliability is an example, and various other methods of calculating the recognition reliability can be used in the present invention.
[0032]
The recurrent speech control unit 8 is an element that plays a central role in the present invention, and controls word candidates in the recurrent speech. FIG. 2 shows the internal configuration of the recurrent speech control unit 8. As shown in FIG. 2, the recurrent speech control unit 8 includes a reliability analysis unit 81, a candidate selection unit 82, a standby word selection unit 83, a first candidate information extraction unit 84, and a synthesized speech information generation unit 85. And a switch SW2.
[0033]
The recurrence control unit 8 receives the reliability information 20 from the recognition reliability calculation unit 7. The reliability information 20 includes word candidate information indicating a plurality of word candidates for the user's utterance, and recognition reliability information calculated by the recognition reliability calculation unit 7 for each word candidate. That is, the reliability information 20 indicates which word candidate has what degree of reliability.
[0034]
The reliability analysis unit 81 may determine the word candidate having the highest reliability (hereinafter referred to as “first word candidate”) among the plurality of word candidates included in the reliability information 20 as the recognition result. Or not, that is, whether or not the first word candidate is correct. This determination can be made using, for example, the reliability of the first word candidate and the reliability of the second word candidate. That is, the reliability of the first word candidate is sufficiently high and is greater than or equal to a predetermined threshold α (condition 1), and the difference between the reliability of the first word candidate and the reliability of the second word candidate is When the two conditions of sufficiently large and above a predetermined threshold β (condition 2) are satisfied, the first word candidate is determined to be correct. On the other hand, if neither of the conditions 1 and 2 is satisfied, the first word candidate is not regarded as a correct answer. A method other than the above can also be adopted as a method of determining the first word candidate as the correct answer. For example, it is possible to determine whether or not the first word candidate is correct by using the reliability of a predetermined number n of word candidates.
[0035]
When it is determined that the first word candidate is correct, the recognition level analysis unit 81 supplies the switches SW1 and SW2 with control signals that connect the switch SW1 shown in FIG. 1 and the switch SW2 shown in FIG. 2 to the terminal T1. To do. On the other hand, when it is determined that the first word candidate is not correct, the recognition degree analysis unit 81 supplies a control signal for connecting the switches SW1 and SW2 to the terminal T2 to the switches SW1 and SW2.
[0036]
The first candidate information extraction unit 84 receives the reliability information 20 from the recognition reliability calculation unit 7 via the switch SW2 when the reliability analysis unit 81 determines that the first word candidate is correct. Then, information indicating that the first word candidate is correct, information indicating what is the first candidate word determined to be correct, pronunciation information corresponding to the first word candidate, and the like, a synthesized speech information generation unit 85. The first candidate information extraction unit 84 outputs information indicating what is the first candidate word to the outside as a recognition result.
[0037]
When the first word candidate is correct, the synthesized voice information generation unit 85 generates synthesized voice information for notifying the user of the recognition result based on the information from the first candidate information extraction unit 84. Output to the synthesized speech generation unit 9.
[0038]
The synthesized speech generation unit 9 shown in FIG. 1 generates a synthesized speech including a word determined to be correct based on the synthesized speech information input from the synthesized speech information generation unit 85, and recognizes it by outputting from the speaker 11. Notify the user of the result. Notifying the user of the recognition result means that, for example, when the word candidate determined to be correct is “red”, a synthesized speech such as “red” is output. Thereby, the user confirms the recognition result. In this embodiment, a method of notifying the user of the recognition result by audio output from the speaker 11 is employed, but instead of or in addition to this, the user is visually notified of the recognition result by a display or the like. It can also be configured as follows.
[0039]
On the other hand, when the recognition level analysis unit 81 determines that the first candidate is not correct, the speech recognition apparatus 10 requests the user to re-utter. In that case, the switch SW2 is connected to the terminal T2, and the reliability information 20 is supplied to the candidate selection unit 82. The switch SW1 is also connected to the terminal T2, and the standby word selection unit 83 is connected to the word model generation unit 3. The candidate selection unit 82 narrows down some word candidates with high reliability (hereinafter referred to as “correct word candidates”) from among all word candidates for which the reliability is calculated. For example, a word candidate whose reliability difference from the first word candidate is a predetermined threshold value γ or less is set as a correct word candidate. Then, the identification information of the determined correct word candidate is supplied to the standby word selection unit 83.
[0040]
The standby word selection unit 83 determines a standby word group for the user's recurrent utterance (that is, a combination of words used as word candidates in the recognition process for the user's recurrent utterance). This most typical method is a method of setting the correct word candidate selected by the candidate selection unit 82 as a standby word. As a result, in the previous speech recognition process, candidates with high recognition reliability are set as standby words. However, in this case, when the user's previous utterance and the recurrent utterance are exactly the same (for example, when the user repeatedly utters “red”), the recognition result may not be determined to be correct as in the previous utterance. There is sex. Therefore, in the present invention, the word used as the standby word in the recurrent utterance is a different word that is a synonym of the correct word candidate and can be easily identified by the recognition process, thereby increasing the recognition rate in the recurrent utterance. I have to. That is, the standby word selection unit 83 sets a combination of words that are synonyms and are easily identified based on the correct word candidates supplied from the candidate selection unit 82 as standby words for recurrence. Here, one suitable example of “a combination of easy-to-identify words” is a synonym of a correct word candidate, and there is little overlap of phonemes (condition A) and a large number of total phonemes (condition B). It is a combination of words. This is because, when words are compared acoustically from the viewpoint of speech recognition, generally, the fewer phonemes overlap and the greater the number of phonemes in a word, the easier it is to identify the words.
[0041]
This will be described in more detail. In the dictionary 2, synonyms (consent words) having the same meaning and different pronunciation are prepared for one word. Now, it is assumed that there are two correct candidate words selected by the candidate selection unit 82: “aka” and “ao”. Also, “redo (reqdo)” is stored in the dictionary 2 as a consensus word for “red”, and “buruu” is stored in the dictionary 2 as a consensus word for “red”. Assume. In this case, the phoneme “a” is duplicated between “red” and “blue”, and the phoneme “o” is duplicated between “red” and “blue”. The combination is “red” and “blue”, or “red” and “blue”. Further, considering condition B, among these combinations, the combination of “Rado” and “Buru” has a larger total number of phonemes. "Is set as a standby word. As another example, for example, when “mizuiro” is further stored in the dictionary 2 as a consensus word for “Ao”, the total number of phonemes is the highest among the combinations having the smallest number of identical phonemes. Many combinations of “Aka” and “Mizuiro” are set as standby words. Thus, in the present invention, among the correct word candidates and their consent words, the most easily identifiable combination is set as a standby word for the next recurrent utterance. Thereby, the recognition accuracy in the recognizing process for recurrent speech can be improved.
[0042]
Further, in the present invention, in the case of recurrent utterance, words such as “other”, “other”, and “different” indicating that the word included in the talkback prompting recurrent utterance is other than the correct word are recurred. It is characterized by including it in the talkback that encourages the talk. Thereby, when the correct answer is not included in the word asked to the user by the talkback prompting the recurrent speech, the speech recognition apparatus 10 can know it. For example, as a result of the first utterance, the correct answer candidate words are narrowed down to “red” and “blue”, and “red” and “Mizuiro” are finally determined as standby words as described above. In that case, in the talkback prompting the recurrence, is the voice recognition device 10 for example “Aka”, “Mizuiro” or “Other”? " On the other hand, if the user re-speaks “other”, it is understood that the word spoken by the user is neither “red” nor “Mizuiro”. Then, the voice recognition device 10 recognizes that the previous narrowing-down was an error, and can search for word candidates other than “Aka” and “Mizuiro”.
[0043]
In this way, the standby word selection unit 83 supplies information including the number, pronunciation, meaning (reading of the original word), and the like of standby words for recurrent speech to the word model generation unit 3 via the switch SW1 as standby word information 83a. At the same time, it is supplied to the synthesized speech information generation unit 85. In this case, the word model generation unit 3 generates a word model of the standby word included in the standby word information 83a and uses it for the matching process by the recognition processing unit 5 in the recognizing speech recognition process. In other words, in the above-described example, the word models “Aka”, “Mizuiro”, and “Other” are subjected to the matching process in the recognition process of the re-spoken word. In addition, the synthesized voice information generation unit 85 is “Aka”, “Mizuiro”, or “Other” as a talkback for prompting a recurrence based on the standby word information 83a? ”Is generated. The synthesized voice information is output as synthesized voice from the speaker 11 by the synthesized voice generation unit 9.
[0044]
In this way, the speech recognition apparatus 10 includes words of combinations that are easy to identify among correct answer candidate words during talkback, and further includes words such as “others” indicating other words. Encourage users to talk again. Thereby, the recognition accuracy at the time of recurrent speech can be improved.
[0045]
If the first word candidate cannot be determined to be correct in the recognition process after re-utterance, the same re-utterance process can be repeated. In the re-utterance process, the threshold value used when the reliability analysis unit 81 determines the first word candidate as the correct answer may be gradually relaxed so that the correct answer is easily determined. it can.
[0046]
Further, when it is determined that the word candidate corresponding to the word “others” is correct in the recurrent utterance (including the case of multiple times), that is, in the current standby candidate word designated by the user during the talkback If it is determined that there is no correct answer, the standby word selection unit 83 returns the standby word to the state at the time of the previous utterance. The reason is as follows. For example, when it is determined in the recognition process for the m-th utterance that the first word candidate is not correct, the candidate selection unit 82 narrows down the standby words for the (m + 1) -th utterance to only the top candidates. However, the fact that the user uttered “other” in the (m + 1) th utterance means that there is no correct word among the standby words set at that time, and narrowing down was incorrect ( It means "waiting error". Therefore, the standby words are narrowed down and returned to the previous state (the state at the m-th utterance) to widen the range of word candidates, and if necessary, recurrent speech is urged.
[0047]
In this case, the reliability analysis unit 81 connects the switches SW1 and SW2 to the terminal T2. The standby word selection unit 83 stores the previous standby word group when determining a standby word group for the next utterance. That is, the standby word selection unit 83 stores all the past standby word groups, and in the case of a standby error, the standby word group of the previous time is used in the next recurrence and recognition process.
[0048]
Thus, if necessary, the recurrent utterance is repeated, and when the reliability analysis unit 81 finally determines that a certain first word candidate is correct, the first word candidate is recognized as a recognition result from the speech recognition device 10 to the external device. Sent to. The external device is a device that uses a recognition result by the voice recognition device 10 as a command or the like. For example, when the speech recognition device 10 is used as the input unit of the car navigation device as described above, the recognition result is supplied to the controller of the car navigation device, and processing corresponding to the content (command) is executed.
[0049]
[Voice recognition processing]
Next, the speech recognition process executed by the speech recognition apparatus 10 will be described with reference to FIG. FIG. 3 is a flowchart of the voice recognition process.
[0050]
First, in step S1, initial setting for recognizing the first utterance by the user is performed. Specifically, the recurrent utterance control unit 8 connects the switch SW1 to the terminal T1 side, and sets all words in the dictionary 2 storing word candidate information to be recognized as standby words for the first utterance. Then, the utterance counter c is set to 1. The utterance counter indicates a standby word group for an utterance to be recognized. That is, the utterance counter c = 1 corresponds to the standby word group for the first utterance (in the above example, all words stored in the dictionary 2), and the utterance counter c = 2 is narrowed down once after the first utterance. Corresponds to the standby word group.
[0051]
Next, in step S <b> 2, based on the standby word group, the word model generation unit 3 uses the subword acoustic model stored in the subword acoustic model storage unit 1 to generate a word model. Thereby, all the word models corresponding to the standby word group for the first utterance are prepared.
[0052]
Next, voice recognition processing is performed in step S3. That is, the user speaks, and the corresponding speech signal is input to the acoustic analysis unit 4 via the microphone 12. The acoustic analysis unit 4 performs acoustic analysis of the speech signal and obtains a feature vector series. And the recognition process part 5 performs the matching process with the feature vector of an utterance audio | voice signal, and each word model prepared in step S2, and calculates the acoustic likelihood between both for every word model.
[0053]
Next, in step S4, the recognition reliability calculation unit 7 weights the acoustic likelihood for each word candidate calculated in step S3 by using the additional information collected by the additional information collection unit 6, The recognition reliability of each word candidate is calculated. The additional information is, for example, the user's past utterance history or the position information of the vehicle on which the navigation device is mounted.
[0054]
Next, in step S5, the reliability analysis unit 81 analyzes whether the first word candidate with the highest recognition reliability is a correct answer based on the recognition reliability of each word candidate. As described above, this analysis can be performed using the reliability of the first word candidate and the reliability of the second word candidate, for example.
[0055]
Next, in step S6, the reliability analysis unit 81 determines whether or not the first word candidate is correct based on the result of the analysis in step S5. If it is determined that the first word candidate is correct, the process proceeds to step S7. On the other hand, if it is determined that the first word candidate is not correct, the process proceeds to step S14.
[0056]
When it is determined in step S6 that the first word candidate is correct, in step S7, the reliability analysis unit 81 determines whether or not the first word candidate is a word corresponding to “other”. As described above, the word candidates corresponding to “others” are used to correct the standby word group when the correct words are not included in the standby words as a result of narrowing down the standby words. If the first word candidate corresponds to “others”, the process proceeds to step S10. If the first word candidate does not correspond to “others”, the process proceeds to step S8.
[0057]
When the process proceeds to step S8, it means that the first word candidate is correct and is not an “other” word candidate. That is, the first word candidate is good as a recognition result. Therefore, the first candidate information extraction unit 84 extracts the first word candidate from the reliability information 20, information indicating that the first word candidate is correct, and what is the first candidate word determined to be correct. The information indicating the first candidate word and the pronunciation information corresponding to the first word candidate are supplied to the synthesized speech information generation unit 85, and the information indicating the first candidate word determined to be correct is output as a recognition result to the outside. To do.
[0058]
In step S9, based on the pronunciation information corresponding to the first word candidate, the synthesized speech information generating unit 85 generates synthesized speech information and supplies it to the synthesized speech generating unit 9. The synthesized speech generating unit 9 The reading of the candidate word is output as synthesized speech. For example, when the first word candidate is “red”, the speaker notifies the user of the recognition result as “red?”.
[0059]
On the other hand, if it is determined in step S6 that the first word candidate is not correct, the candidate selection unit 82 selects the correct word candidate in step S14. Specifically, the candidate selection unit 82 selects a correct word candidate using the recognition reliability of the first word candidate. By this processing, word candidates used in the recognition processing at the next utterance are narrowed down.
[0060]
Next, in step S15, based on the correct word candidate selected by the candidate selection unit 82, the standby word selection unit 83 generates a combination of words that are easy to identify and have different pronunciations. Specifically, the standby word selection unit 83 determines a word candidate having the smallest number of identical phonemes and a large number of total phonemes as a standby word among combinations of consent words corresponding to correct answer candidate words. Then, a standby word group including those standby words is set. In this standby word group, in addition to the above words, words corresponding to “others” are included. And the standby word selection part 83 acquires the word information corresponding to those standby words from the dictionary 2, sends it to the word model production | generation part 3, and produces a corresponding word model. Thus, the standby word group is updated.
[0061]
The standby word selection unit 83 stores the standby word group before update. This is because when the user speaks “other” in the next utterance, it is necessary to use the standby word group from the previous one again. The standby word selection unit 83 also supplies the selected standby word group to the synthesized speech information generation unit 85.
[0062]
In step S16, the synthesized speech information generation unit 85 and the synthesized speech generation unit 9 output the synthesized speech of the standby word determined in step S15 as a talkback that encourages re-speech. For example, if the standby word is determined to be “Aka”, “Ao”, or “Other” in step S15, a synthesized voice such as “Aka, Ao? Is output.
[0063]
Next, in step S17, the utterance counter c is incremented by one. As a result, the increased utterance counter c indicates that the standby word group has shifted to the state after one update with respect to the previous standby word group. Then, the process returns to step S2, a word model of words included in the standby word group determined in step S15 is generated, and a recognition process for the recurrent utterance is performed.
[0064]
If it is determined in step S7 that the first word candidate corresponds to “others”, it indicates that the word in the standby word group at that time does not include the correct word, that is, a standby error. Yes. Therefore, the process proceeds to step S10, and it is determined whether or not the value of the utterance counter c is 1. When the utterance counter c = 1, the current recognition process is performed for the first utterance, and the combinations of standby words at that time are set for all word candidates included in the dictionary. Therefore, the word spoken by the user is not originally included in the dictionary 2. In this case, the recognition process is terminated with no candidate.
[0065]
On the other hand, if the utterance counter c is not 1, the process proceeds to step S11. In step S11, the standby word selection unit 83 subtracts 1 from the value of the utterance counter c, and sets the previous standby word group stored in advance. When the user utters “others”, the correct word is not included in the current standby word group, so the recognition process is performed again after returning to the standby word group used in the previous recognition process. Is done. Note that when the standby word is updated in step S14, the standby word selection unit 83 stores the standby word group in the state before the update, so that it may be read and set. At that time, the standby word selection unit 83 causes the word corresponding to “others” (also referred to as “standby error word”) to be included in the standby word group.
[0066]
Next, in step S <b> 12, the standby word selection unit 83 supplies the standby word group thus determined to the word model generation unit 3 and the synthesized speech information generation unit 85. The word model generation unit 3 generates a word model corresponding to these standby words so that it can be used in the next recognition process. In addition, the synthesized speech information generation unit 85 and the synthesized speech generation unit 9 use the supplied standby word information to output a synthesized speech of the corresponding word.
[0067]
As described above, the user's utterance is determined until the first word candidate is determined to be the correct answer and is output as a recognition result (step S9) or the recognition process is terminated as no candidate (step S10: Yes). Recognition processing is performed while the standby word group is updated according to the content. If the reliability of the first word candidate is not high enough to determine that it is correct, the standby words are narrowed down based on the reliability, etc., and further, the consensus word of the narrowed word is acoustically identified The standby word group is updated by setting the easy-to-use combination words as standby words for the next utterance. Therefore, the recognition rate due to recurrent speech can be improved, and the user's speech can be recognized quickly and efficiently.
[0068]
[Modification]
In the recurrent speech control unit 8 illustrated in FIG. 2, the reliability analysis unit 81 determines whether or not the first word candidate is correct using the first word candidate and the second word candidate. Instead, the reliability analysis unit 81 can be configured to determine whether or not the first word candidate is correct using the top n candidate words with the highest recognition reliability. In that case, n word candidates having higher recognition reliability are determined during the process of determining whether or not the first word candidate is correct. Therefore, when n word candidates having higher recognition reliability are determined, they can be used as correct word candidates after narrowing down. In this way, the reliability analysis unit 81 can perform the process of the candidate selection unit 82, and the candidate selection unit 82 can be omitted. In this case, information on correct word candidates is input from the reliability analysis unit 81 to the standby word selection unit 83.
[0069]
In the speech recognition processing shown in FIG. 3, when it is determined in step S7 that the first word candidate corresponds to “others” and it is determined in step S8 that the utterance counter c is not 1, the utterance counter is decremented by 1. The previous standby word group is set to be used in the next utterance. However, if the determination in step S7 is “Yes”, it means that there is no correct word in the previous standby word group, so that these words are included in the next standby word group. There is no meaning. For example, in the utterance using the standby word group “Aka”, “Ao”, “Other”, the user uttered “Other” means that the user uttered the word “Aka” or “Ao”. not. Therefore, the standby word selection unit 83 can set a standby word group by excluding “red”, “blue”, and their synonyms from the previous standby word group acquired in step S11. Thereby, it is possible to further improve the efficiency of the recognition process by removing words that are clearly not correct from the standby word group.
[0070]
In addition, it is possible to implement | achieve said speech recognition apparatus by comprising each element which comprises said speech recognition apparatus 10 as a computer program, and performing in the apparatus provided with a computer. For example, in a car navigation device or an AV device equipped with a computer, the voice input function can be realized by using the above computer program.
[0071]
In the above embodiment, among the correct word candidates and their consent words, the most easily identified combination is set as the standby word for the next recurrent utterance. May be determined.
[0072]
In addition to the correct word candidates and their agreed words, a standby error word indicating that the word included in the talkback for prompting recurrent speech is other than the correct word may be added to determine the most easily identifiable combination. .
[0073]
【The invention's effect】
As described above, according to the present invention, when there is a high possibility that the recognition result is erroneous, the possibility of misrecognition can be reduced by prompting the user to speak again. Also, if the recognition result for a certain utterance cannot be determined to be correct, the standby word used at that time and the consent word, which are acoustically easy to identify, are set as the standby word for the next utterance. Thus, the same recognition result is not repeated, and the recognition rate by the next utterance is improved. In addition, by including words such as “others” indicating words other than the current standby word in the confirmation talkback prompting recurrent speech, it is possible to remove words that are not correct, thereby efficiently and quickly. It is possible to reach a correct answer.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a speech recognition apparatus according to an embodiment of the present invention.
2 is a block diagram showing an internal configuration of a recurrent speech control unit shown in FIG. 1. FIG.
FIG. 3 is a flowchart showing voice recognition processing by the voice recognition apparatus shown in FIG. 1;
[Explanation of symbols]
1 Subword acoustic model storage
2 Dictionary
3 Word model generator
4 Acoustic analysis section
5 recognition processing section
6 Additional information collection department
7 Recognition reliability calculator
8 Recurrence control unit
9 Synthetic speech generator
10 Voice recognition device
11 Speaker
12 Microphone
81 Reliability Analysis Department
82 Candidate selection part
83 Standby word selector
84 1st candidate information extraction part
85 Synthetic voice information generator

Claims

Voice input means for receiving the user's voice input;
Recognition processing means for performing recognition processing for determining a plurality of word candidates corresponding to the voice input by matching processing with each standby word in a preset standby word group;
Determining means for determining whether or not a correct answer is included in the plurality of word candidates;
Setting means comprising means for analyzing the phonemes constituting each word candidate for each of the plurality of word candidates and their consent word candidates;
The setting means, when the determination means determines that a correct answer is not included in the plurality of word candidates, the consent word candidate corresponding to at least the word candidate from the plurality of word candidates and their consent word candidates the between each group the set for each word candidate in the case of the one group, (1) extracts the combination phonemes overlap has little, (2) among combinations the extracted relatively total said to be a combination sound prime number is often determines one candidate in each group, the speech recognition apparatus characterized by setting the standby word group is used in the next recognition processing.

When the determination unit determines that a correct answer is not included in the plurality of word candidates, the setting unit determines the word candidate and the consent word candidate from the plurality of word candidates and the consent word candidates. The speech recognition apparatus according to claim 1, wherein candidates are determined between the groups as one group.

2. The standby word group includes a standby error word meaning that the voice input of the user corresponds to a word candidate other than the word candidate included in the standby word, or the standby word group. The speech recognition apparatus according to 2.

The setting means includes storage means for storing standby word groups used in the past,
When the determination unit determines that the standby error word is correct, the setting unit sets the previous standby word group stored in the storage unit as a standby word group to be used in the next recognition process. The speech recognition apparatus according to claim 3.

The speech recognition apparatus according to claim 3, wherein the standby error word is “other” or a synonym thereof.

If the user's voice input is the standby error word, the word candidates other than the word candidate corresponding to the standby error word among the word candidates in the standby word group at that time are determined as the next standby word. The speech recognition apparatus according to claim 3, wherein the speech recognition apparatus is excluded from word candidates to be included in a group.

When the determination unit determines that a correct answer is not included in the plurality of word candidates, the standby word belonging to the standby word group set by the setting unit is sent to the user by at least one of synthesized speech output or character display. The voice recognition apparatus according to claim 1, further comprising a notification unit that notifies the user.

The speech recognition apparatus according to claim 1, wherein the determination unit relaxes a criterion for determining the word candidate as a correct answer every time the recognition process is repeated.

The determination unit determines that the word candidate is correct when the reliability of the word candidate is equal to or higher than a predetermined threshold, and decreases the threshold every time the recognition process is repeated. The speech recognition apparatus according to claim 8, wherein:

A speech recognition program executed by a computer comprising:
Voice input means for receiving the voice input of the user,
A recognition processing means for performing a recognition process for determining a plurality of word candidates corresponding to the voice input by a matching process with each standby word in a preset standby word group;
Means for determining whether or not a correct answer is included in the plurality of word candidates;
For each of the plurality of word candidates and their consent word candidates, function as setting means comprising means for analyzing the phonemes constituting each word candidate,
And when the determination means determines that the correct answer is not included in the plurality of word candidates, the setting means corresponds to at least the word candidates from the plurality of word candidates and their consent word candidates agree word candidates among the groups in the case of a single group (1) extracts the combination overlap has little of phonemes, (2) among combinations the extracted, often relatively Sooto prime number A speech recognition program that functions to determine one candidate in each group so as to be combined and to set the candidate word group to be used in the next recognition process.