JP4282354B2

JP4282354B2 - Voice recognition device

Info

Publication number: JP4282354B2
Application number: JP2003087565A
Authority: JP
Inventors: 伸洋田崎; 武志橋本; 正樹芦澤
Original assignee: Clarion Co Ltd
Current assignee: Faurecia Clarion Electronics Co Ltd
Priority date: 2003-03-27
Filing date: 2003-03-27
Publication date: 2009-06-17
Anticipated expiration: 2023-03-27
Also published as: JP2004294803A

Description

【０００１】
【発明の属する技術分野】
本発明は、ユーザからの入力音声に対して、辞書を用いて単語認識を行う音声認識装置に関し、特に、ユーザからの入力音声に対して単語認識および音素認識を行い、これらの認識結果に基づいた音声認識を行うことで認識率を向上した音声認識装置に関する。
【０００２】
【従来の技術】
人間の話した音声を言葉として認識する音声認識装置が各種方面で実用化されている。この音声認識装置は、例えば、工場における各種装置に対応する指示をはなれた場所から音声で指示する入力装置として実用化されており、また、自動車のナビゲーション装置において、目的地や指示情報等を音声入力する場合の音声入力装置としても実用化されている。このような音声認識装置では、一般に入力された音声を特定するために、予め認識対象となる音声の周波数分布を分析することで、例えば、スペクトルや基本周波数の時系列情報等を特徴として抽出し、そのパターンを各単語に対応させて格納する音声認識用単語辞書を備えている。
【０００３】
認識するべき音声が入力されると、入力された音声の周波数パターンと辞書に格納された各単語のパターンをパターンマッチングにより比較照合し、各単語に対する類似度を算出する。つぎに算出された類似度が最も高い単語（パターンが最も近い単語）を、入力された音声であると認識し、その単語を出力するようにしている。つまり、入力された単語の周波数分布のパターンがどの単語パターンに最もよく似ているかを調べることによって、入力音声を判定している。
【０００４】
このような音声認識において、さらに、出力された認識結果に対する話者からの応答に基づいて一致率の履歴を更新し、より一層認識率を高めた音声認識装置が提案されている（特許文献１参照）。
【０００５】
【特許文献１】
特開平８−１６０９８６号公報
【０００６】
【発明が解決しようとする課題】
このような単語認識に基づく音声認識は、特に、カーナビゲーション装置等において音声に基づいたコマンド入力時に利用されている。このような音声認識においては、特定の単語が認識されにくい状況や、誤認識されやすい状況等が生じるが、これらの状況は、類似した単語が辞書に登録されている場合に特に生じやすい。従って、このような状況は、辞書に登録する単語が類似しないように選定することによりある程度回避することができるが、認識結果は話者により異なることから、多くの話者についてテストを行い、単語の登録と削除を繰り返す等、時間をかけて辞書の最適化を行う必要があり、実用的な使用に適した、高認識率を有する音声認識装置が望まれている。
【０００７】
そこで、本発明の目的は、単語認識用辞書の最適化を行うことにより、単語認識の性能を向上させ、音声認識の認識率を高めた音声認識装置を提供することにある。
【０００８】
【課題を解決するための手段】
以上の目的を達成するために、請求項１記載の発明は、ユーザからの入力音声に対して、辞書を用いて単語認識を行う音声認識装置であって、入力音声に対して単語認識と音素認識とを行い、これにより得られた単語認識結果と音素認識結果とが不一致であり、かつ、当該単語認識結果が正解である場合であって、かつ、当該音素認識結果の、当該正解となる単語認識結果に対応した単語に対する類似度と、他の単語に対する類似度との関係が所定の条件を満たす場合に、当該音素認識結果を、当該正解となる単語認識結果に対応する単語に対する同義語として辞書へ登録することを特徴とする。
【０００９】
また、請求項２記載の発明は、前記所定の条件とは、前記音素認識結果の、当該正解となる単語認識結果に対応した単語に対する類似度と、前記他の単語に対する類似度との差が所定値以上であることを特徴とする。
【００１０】
また、請求項３記載の発明は、前記所定の条件とは、前記音素認識結果の、当該正解となる単語認識結果に対応した単語に対する類似度と、前記他の単語に対する類似度との比が所定値以上であることを特徴とする。
【００１１】
また、請求項４記載の発明は、前記同義語と認識された前記音素認識結果が、同じ単語認識結果の同義語として所定回数または所定確率で認識された場合に、当該同義語を前記辞書に登録することを特徴とする。
【００１２】
また、請求項５記載の発明は、当該辞書に複数個の同義語が存在する場合に、単語認識時における、前記辞書に登録された同義語毎の検索回数と正解回数とを計数し、当該同義語が正解となる確率が所定値を下回ったときに、当該同義語を前記辞書より削除することを特徴とする。
【００１３】
また、請求項６記載の発明は、前記音声入力に基づいた単語認識の後に、前記ユーザによる操作が、あらかじめ定められた正解後の操作の候補と一致した場合に、当該単語認識結果を正解と判定することを特徴とする。
【００１４】
また、請求項７記載の発明は、ユーザからの入力音声を入力する音声入力部と、前記入力音声に対して単語認識を行う単語認識部と、前記入力音声に対して音素認識とを行う音素認識部と、前記単語認識部により得られた単語認識結果と、前記音素認識部により得られた音素認識結果との不一致であり、かつ、当該単語認識結果が正解である場合に、当該音素認識結果を、当該正解となる単語認識結果に対応する単語に対する同義語として辞書へ登録するかどうかを判定する辞書登録部とを備え、前記辞書登録部は、前記音素認識結果の、当該正解となる単語認識結果に対応した単語に対する類似度と、前記他の単語に対する類似度との差が所定値以上または、前記音素認識結果の、当該正解となる単語認識結果に対応した単語に対する類似度と、前記他の単語に対する類似度との比が所定値以上である場合に、当該音素認識結果を同義語として認識して仮登録し、さらに、同じ単語認識結果に対して前記同義語と認識された前記音素認識結果に対する仮登録回数を計数し、当該仮登録回数が所定値以上であった場合に、当該音素認識結果を辞書に登録し、次回の音声認識処理に利用することを特徴とする。
【００１５】
【発明の実施の形態】
以下、本発明の実施態様による音声認識装置について説明する。
【００１６】
図１は本実施態様の音声認識装置１の構成を示す機能ブロック図である。
【００１７】
音声認識装置１は、制御部２，音声入力部３，単語認識部４、音素認識部５及び辞書管理部６から構成されている。制御部２は、例えばナビゲーション装置等の外部装置と接続されて、外部装置からの音声認識コマンド情報等を入力し、さらに、音声認識装置１における最終的な音声認識結果を外部装置に送信する。また、制御部２は音声認識装置全体の制御をも行っている。音声入力部３は、例えばマイク等から構成されており、制御部２による制御に基づいてユーザの音声を入力する。単語認識部４は、ユーザからの入力音声を単語を基本単位として認識処理し、入力音声に対する最適な単語を選択するものである。具体的には、辞書管理部６に備えられ、あらかじめ単語（単語モデル）が登録されている単語辞書を用いて、入力音声と単語辞書における候補（単語）との類似度を算出し、最も類似度の高い候補を選択することにより、入力音声を候補中の単語として認識する。さらに、音素認識部５は、ユーザからの入力音声を音素に分け、最も近い音素を選択することにより、入力音声を任意の文字列からなる単語として認識するものである。辞書管理部は、単語認識用の単語辞書を管理し、候補となる単語の登録、削除、統計等を行うものである。なお、上述した構成要素に加えて、音声認識結果等をユーザに表示する表示部をさらに設けていても良い。
【００１８】
次に、上述した本実施態様の音声認識装置の動作について図２を参照して説明する。
【００１９】
図２は本実施態様の音声認識装置の音声認識動作を示すフローチャートである。ここでは、例としてカーナビゲーション装置における音声認識装置について説明する。すなわち、図１における外部装置としてカーナビゲーション装置が用いられるが、本発明の音声認識装置はカーナビゲーション装置に限定されるものではなく、音声認識の必要なあらゆる装置に適応可能であることは言うまでもない。
【００２０】
まず、制御部２はカーナビゲーション装置からの指示に従って、音声入力部３へ音声入力の指示を行う。制御部２からの指示に基づいて、音声入力部３はユーザからの入力音声を取得し、単語認識部４及び音素認識部５の各々へ入力音声を出力する（Ｓ１）。
【００２１】
音声認識は、単語認識部４と音素認識部５での認識結果による総合的な判断に基づいて行われる。すなわち、入力音声に基づいて単語認識部４での単語認識結果と、音素認識部５での音素認識結果とが一致したかどうかが判断され、一致した場合は、これをユーザに表示して次の音声認識を行うが、一致していなければ本実施態様の同義語登録処理に移る。従って、本実施態様の同義語登録（辞書登録）処理は、単語認識結果と音素認識結果とが不一致であり、かつ、単語認識結果が正解の場合に行われる。
【００２２】
さて、音声入力部３により入力音声が取得されると（Ｓ１）、単語認識部４および音素認識部５では、各々、音声入力部３からの入力音声に対して単語認識処理および音素認識処理を行う（Ｓ２）。具体的には、単語認識部４では、辞書管理部６に備えられている単語辞書を用いて、入力音声と単語辞書内の単語（単語モデル）とを比較し、これにより最も高い類似度を有する単語を単語認識結果として辞書管理部６へ出力する。なお、認識処理としては、入力音声に対する特徴抽出処理により得られた特徴データと、あらかじめ単語辞書に登録された単語の特徴データとの照合により入力音声の単語認識（照合）が行われている。また、音素認識部５では、入力音声を各音素に分けて、各音素毎に音素認識を行い、得られる単語を音素認識結果として辞書管理部６へ出力する。これらの単語認識処理と音素認識処理とは同時に並行して行われている。
【００２３】
次に、辞書管理部６では、単語認識部４の単語認識結果が正解か否かの判定が行われる（Ｓ３）。以下に単語認識結果の正解判定について説明する。
【００２４】
通常、カーナビゲーションの音声認識において、コマンドは階層化されており、走査には幾つかのステップが必要となる。従って、単語認識結果をユーザに通知した後、ユーザが続けて次の階層のコマンドの発話を行うか、あるいは、ユーザが次のステップの操作を行う等、その後の操作があらかじめ定められた正解後の操作の候補と一致した場合は、単語認識結果を正解と判定する（Ｓ３：Ｙ）。一方、単語認識結果をユーザに通知した後、ユーザからもう一度同じ単語認識を行うか、あるいは、キャンセルの操作を行う等、その後の操作があらかじめ定められた正解後の操作の候補と一致しなかった場合には（Ｓ３：Ｎ）、単語認識結果を不正解と判定する。
【００２５】
ここで、単語認識の判定の結果、単語認識結果が不正解の場合（Ｓ３：Ｎ）は、辞書管理部６は通常の辞書管理処理（Ｓ８）へ移行する。この場合、ユーザに音声認識処理の失敗を報知するエラーメッセージを出力するか、あるいは、再度の音声入力を催促する等して、単語認識処理が不正解である旨伝えてもよい。一方、単語認識結果が正解の場合（Ｓ３：Ｙ）、すなわち、単語を基本単位とする音声認識が正解であった場合は、辞書管理部６は単語認識結果と音素認識結果による単語とが一致するか否かの判定を行う（Ｓ４）。
【００２６】
判定の結果、単語認識結果と音素認識結果とが一致した場合（Ｓ４：Ｙ）、辞書管理部６は辞書管理処理（Ｓ８）へ移行する。一方、単語認識結果と音素認識結果とが一致しなかった場合（Ｓ４：Ｎ）、辞書管理部６は音素結果が単語認識結果の同義語として適当であるかどうかの判定を行う（Ｓ５）。この判定処理は、具体的には、音素認識結果を単語として登録することによる他の単語への影響を調べることで行われる。単語認識の過程において得られた類似度において、音素認識結果に基づいて得られた単語の、ステップＳ３で正解と判定された単語に対する類似度と、他の単語に対する類似度との差もしくは比が所定の値を超えておれば、同義語として登録することによる他の単語への影響が小さいので、同義語として認識する（Ｓ５：Ｙ）。
【００２７】
判定の結果、同義語と認識されなかった場合（Ｓ５：Ｎ）、辞書管理部６は辞書管理処理（Ｓ８）へ移行する。一方、同義語と判定された場合（Ｓ５：Ｙ）、同義語の辞書への登録の判定を行う（Ｓ６）。この登録の判定処理は以下のようにして行われる。
【００２８】
ある同義語について、同じ単語認識結果（単語）の同義語と認識された回数をカウントしておく。回数が所定の値を超え、且つ、選択される確立が所定の値を超えたときに、単語認識結果の同義語として辞書へ登録する（Ｓ７）。
【００２９】
次に、辞書管理部６は辞書管理処理（Ｓ８）を行い、この処理を終了する。辞書管理（Ｓ８）は、単語認識の結果から単語毎の統計情報を算出し、不要な単語の削除等を行う処理である。単語認識において検索された単語は検索回数をカウントし、正解として選択された単語は正解回数をカウントされる。ここで、複数の同義語があり、そのうち正解として選択される確立が所定の値を下回った単語は辞書から削除される。
【００３０】
以上のような動作で単語認識を繰り返すことにより、単語認識用の単語辞書が最適化されていくことになる。なお、ステップＳ３とステップＳ４とは前後を入れ替えてそれらの処理を行っても同様な結果が得られる。
【００３１】
ここで、上述した音声認識装置を、テレビ受像機能、ラジオ受信機能並びに電話機能等が備えられているカーナビ装置に接続して利用する場合を例として、上述の同義語辞書登録処理をより詳細に説明する。
【００３２】
コマンドの階層化の例として、例えば、第一の発話の階層に「ラジオ」、「デンワ」、「テレビ」が登録されている場合を考える。この時、「テレビ」に対する第二の発話の階層の辞書には、チャンネルが登録されており、「ラジオ」に対する第二の発話の階層の辞書には、放送局名が登録されており、「デンワ」に対する第二の発話の階層の辞書には、電話番号が登録されているものとする。ここで、第一の発話の後にユーザーが行う操作（第二の発話）と、音声入力の第一の発話による認識結果に基づいて推測される、第二の発話の階層の辞書に登録されている内容とが一致した場合、第一の発話による認識結果を正解と判定する。例えば、ユーザの第一の発話に基づいた認識結果が「テレビ」であった場合、ユーザの第二の発話がチャンネルを示すものであったときは、この場合の認識結果「テレビ」は正解と判定される。ここで、第一の発話による認識結果（単語）の、正解の単語に対する類似度と他の単語に対する類似度との関係が所定の関係（例えば、類似度の差が所定値以上（例えば、０．５以上）か、あるいは類似度の比が所定値以上（例えば２倍以上）であれば、この第一の発話による認識結果（単語）を正解の単語に対する同義語として適当と判定し、仮登録する。これは、この第一の発話による認識結果（単語）を同義語として判定することによって、他の単語の誤認識を招くようでは困るので、類似度が所定条件を満たす単語のみを適当と判断するからである。
【００３３】
さらに、この様に同義語として判定（仮登録）された単語に対して、同義語として判定された回数が所定回数（例えば、３回）以上であり、かつ、同義語のうちその単語が選択された確率が所定値（例えば、５０％）以上であれば、辞書に登録と判定される。これは、高い確率で選択される同義語は辞書で利用できるからである。
【００３４】
一方、同義語の何れかが正解として選択された回数が所定数（例えば、１０回）以上であり、かつ、同義語のうちその単語が正解として選択された確率が所定値（例えば、３０％）未満であれば、その単語を辞書より削除する。
【００３５】
以上のような音声認識装置の設定の基で、いま、ユーザーがカーナビ装置に備えられたテレビの６チャンネルを見ようとした場合、ユーザの第一の発話は「テレビ」であり、この第一の発話に基づいた音素認識結果が「テレイ」、単語認識結果が「テレビ」であったとする。また、この時、単語認識の過程で得られる類似度が、「テレビ」が０．８、「デンワ」が０．３および「ラジオ」が０．２であったとする。さらに、ユーザの第二の発話が「ロクチャンネル」であったとき、単語認識結果が「ロクチャンネル」であったとする。
【００３６】
これらの状況下での同義語登録処理を図２を参照にして説明する。
【００３７】
まず、ユーザーが第一の音声認識結果に対してキャンセル等を行わず、続けて正常に第二の音声認識が行われたことから、辞書登録部６は第一の発話に対する単語認識結果「テレビ」を正解と判定する（Ｓ３：Ｙ）。次に、辞書登録部６は音素認識結果「テレイ」と単語認識結果「テレビ」とが一致するかどうかを判定する（Ｓ４）。この場合、不一致であるので（Ｓ４：Ｎ）、辞書登録部６は単語認識の過程における類似度を比較する。すなわち、音素認識結果「テレイ」が、正解の単語「テレビ」の同義語として登録可能かどうかを調べる。ここでは、正解の単語「テレビ」に対する類似度が０．８であるのに対して、他の単語「デンワ」、「ラジオ」に対する類似度が正解の単語の類似度の５０％以下（０．３／０．８、０．２／０．８）であることから、誤認識の影響は小さく、「テレイ」を「テレビ」の同義語として適当であると判定し（Ｓ５：Ｙ）、同義語「テレイ」を記録する（仮登録）。
【００３８】
本音声認識装置の使用により以上のような動作（ステップＳ３，Ｓ４，Ｓ５）が繰り返されて、「テレイ」に対する仮登録の回数が計数されていくことになる。仮登録された同義語およびそれらの仮登録回数は、辞書管理部６内に設けられた（あるいは、別途備えられた）所定のメモリー領域に一時的に記録されることになる。この繰り返し動作の結果、音素認識結果が「テレビ」であり、かつ、単語認識結果が「テレビ」となった回数が２回であり、音素認識結果が「テレイ」であり、かつ、単語認識結果が「テレビ」となった回数が３回となったとする。この場合、「テレイ」が同義語と判定された回数が３回（以上）となり、かつ、同じ単語の同義語（「テレビ」、「テレイ」）のうち「テレイ」が選択される確率が５０％以上（３／５＝６０％）であることから、「テレイ」を「テレビ」の同義語として辞書に登録すると判定し（Ｓ６）、辞書に登録する（Ｓ７）。
【００３９】
その結果、辞書には、「ラジオ」、「デンワ」、「テレビ」、「テレイ」が登録されることになる。ただし、「テレビ」と「テレイ」とは同じコマンドをあらわしている。
【００４０】
さらに、以上のような動作（ステップＳ３乃至ステップＳ７）が繰り返された結果、単語認識過程において「テレビ」と「テレイ」の何れかが正解として選択された回数が１０回で、そのうち「テレビ」が正解となった回数が２回、「テレイ」が正解となった回数が８回であっとする。この場合、同義語の何れか（「テレビ」または「テレイ」）が正解として選択された回数が（２＋８＝）１０回で、かつ、同義語のうち「テレビ」が選択された回数が３０％未満（２／１０＝２０％）であるので、同義語「テレビ」は辞書から削除されることになる。
【００４１】
従って、辞書には、「ラジオ」、「デンワ」および「テレイ」が登録されることになる。ただし、「テレイ」は「テレビ」を意味するコマンドを表している。
【００４２】
上述したように、本実施態様の音声認識装置の利用を繰り返すことで、必要性の高い同義語は辞書へ登録し、必要性が低い同義語は辞書から削除されるという動作が行われていくことになり、結果として辞書が最適化され、認識率を向上することになる。従って、本実施態様では、もともと辞書になく意味を持たないような単語（例えば「テレイ」）でも認識率を向上するものであれば、その単語に意味を与えて辞書に登録することが可能となり、ユーザ固有の発話に対しても正確な音声認識を行うことが可能となる。
【００４３】
【発明の効果】
本発明によれば、使用しながら単語認識用辞書を最適化することができ、設計コストが低減できる。また、話者によって、認識されにくい単語や誤認識されやすい単語が現れる場合でも、単語の認識の性能を向上させることができる。
【図面の簡単な説明】
【図１】本実施態様による音声認識装置の構成を示した機能ブロック図である。
【図２】図１で示した音声認識装置による音声認識動作を示したフローチャートである。
【符号の説明】
１音声認識装置
２制御部
３音声入力部
４単語認識部
５音素認識部
６辞書管理部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition apparatus that performs word recognition on input speech from a user using a dictionary, and in particular, performs word recognition and phoneme recognition on input speech from a user, and based on these recognition results. The present invention relates to a speech recognition apparatus that improves the recognition rate by performing speech recognition.
[0002]
[Prior art]
Speech recognition devices that recognize speech spoken by humans as words have been put into practical use in various fields. This voice recognition device has been put to practical use as an input device that gives voice instructions from places where instructions corresponding to various devices in a factory are separated, and in a car navigation device, destinations, instruction information, etc. are voiced. It has also been put to practical use as a voice input device for input. In such a speech recognition device, in order to identify the input speech in general, the frequency distribution of the speech to be recognized is analyzed in advance to extract, for example, the time series information of the spectrum and the fundamental frequency as features. And a speech recognition word dictionary for storing the pattern corresponding to each word.
[0003]
When the speech to be recognized is input, the frequency pattern of the input speech is compared with the pattern of each word stored in the dictionary by pattern matching, and the similarity to each word is calculated. Next, the word having the highest degree of similarity calculated (the word having the closest pattern) is recognized as the input voice, and the word is output. That is, the input speech is determined by examining which word pattern is most similar to the frequency distribution pattern of the input word.
[0004]
In such speech recognition, a speech recognition apparatus has been proposed in which the matching rate history is further updated based on the response from the speaker to the output recognition result to further increase the recognition rate (Patent Document 1). reference).
[0005]
[Patent Document 1]
JP-A-8-160986 gazette
[Problems to be solved by the invention]
Such speech recognition based on word recognition is used particularly when a command based on speech is input in a car navigation device or the like. In such speech recognition, there are situations in which a specific word is difficult to recognize, and situations in which it is easy to be erroneously recognized. These situations are particularly likely to occur when similar words are registered in the dictionary. Therefore, this situation can be avoided to some extent by selecting the words to be registered in the dictionary so that they are not similar. However, since the recognition results differ depending on the speakers, many speakers are tested and the words It is necessary to optimize the dictionary over time, such as repeating registration and deletion, and a speech recognition device having a high recognition rate suitable for practical use is desired.
[0007]
Accordingly, an object of the present invention is to provide a speech recognition device that improves the performance of word recognition and increases the recognition rate of speech recognition by optimizing a dictionary for word recognition.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, an invention according to claim 1 is a speech recognition apparatus that performs word recognition on a speech input from a user by using a dictionary, the word recognition and phoneme being performed on the input speech. The word recognition result and the phoneme recognition result thus obtained are inconsistent, and the word recognition result is correct , and the phoneme recognition result is the correct answer. When the relationship between the similarity to a word corresponding to the word recognition result and the similarity to another word satisfies a predetermined condition, the phoneme recognition result is used as a synonym for the word corresponding to the correct word recognition result. It is characterized by registering as a dictionary .
[0009]
In the invention according to claim 2, the predetermined condition is that the difference between the similarity of the phoneme recognition result to the word corresponding to the correct word recognition result and the similarity to the other word is It is more than a predetermined value.
[0010]
According to a third aspect of the present invention, the predetermined condition is that the ratio of the similarity of the phoneme recognition result to the word corresponding to the correct word recognition result and the similarity to the other word is It is more than a predetermined value.
[0011]
According to a fourth aspect of the present invention, when the phoneme recognition result recognized as the synonym is recognized as a synonym of the same word recognition result a predetermined number of times or with a predetermined probability, the synonym is stored in the dictionary. It is characterized by registering.
[0012]
The invention according to claim 5 counts the number of searches and the number of correct answers for each synonym registered in the dictionary at the time of word recognition when there are a plurality of synonyms in the dictionary. When the probability that the synonym is correct falls below a predetermined value, the synonym is deleted from the dictionary.
[0013]
Further, in the invention described in claim 6, after the word recognition based on the voice input, when the operation by the user coincides with a predetermined operation candidate after the correct answer, the word recognition result is set as a correct answer. It is characterized by determining.
[0014]
According to a seventh aspect of the present invention, there is provided a voice input unit that inputs an input voice from a user, a word recognition unit that performs word recognition on the input voice, and a phoneme that performs phoneme recognition on the input voice. A recognition unit, a word recognition result obtained by the word recognition unit, and a phoneme recognition result obtained by the phoneme recognition unit are inconsistent and the word recognition result is correct, the phoneme recognition A dictionary registration unit that determines whether to register a result in the dictionary as a synonym for a word corresponding to the word recognition result that is the correct answer, and the dictionary registration unit is the correct answer of the phoneme recognition result The difference between the similarity to the word corresponding to the word recognition result and the similarity to the other word is a predetermined value or more, or the similarity to the word corresponding to the correct word recognition result of the phoneme recognition result When the ratio of the similarity to the other word is equal to or greater than a predetermined value, the phoneme recognition result is recognized as a synonym and temporarily registered, and further, the same word recognition result is recognized as the synonym. The number of temporary registrations for the phoneme recognition result is counted, and when the number of temporary registrations is equal to or greater than a predetermined value, the phoneme recognition result is registered in a dictionary and used for the next speech recognition process. To do.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a speech recognition apparatus according to an embodiment of the present invention will be described.
[0016]
FIG. 1 is a functional block diagram showing the configuration of the speech recognition apparatus 1 of this embodiment.
[0017]
The speech recognition apparatus 1 includes a control unit 2, a speech input unit 3, a word recognition unit 4, a phoneme recognition unit 5, and a dictionary management unit 6. The control unit 2 is connected to an external device such as a navigation device, for example, inputs voice recognition command information from the external device, and transmits the final voice recognition result in the voice recognition device 1 to the external device. The control unit 2 also controls the entire voice recognition device. The voice input unit 3 includes, for example, a microphone, and inputs a user's voice based on control by the control unit 2. The word recognizing unit 4 recognizes the input voice from the user as a basic unit and selects an optimum word for the input voice. Specifically, the similarity between the input speech and the candidate (word) in the word dictionary is calculated by using a word dictionary in which words (word models) are registered in advance, provided in the dictionary management unit 6, and the most similar By selecting a candidate with a high degree, the input speech is recognized as a word in the candidate. Furthermore, the phoneme recognition unit 5 recognizes the input speech as a word composed of an arbitrary character string by dividing the input speech from the user into phonemes and selecting the closest phoneme. The dictionary management unit manages a word dictionary for word recognition and performs registration, deletion, statistics, and the like of candidate words. In addition to the components described above, a display unit that displays a voice recognition result or the like to the user may be further provided.
[0018]
Next, the operation of the speech recognition apparatus of this embodiment described above will be described with reference to FIG.
[0019]
FIG. 2 is a flowchart showing the speech recognition operation of the speech recognition apparatus according to this embodiment. Here, a voice recognition device in a car navigation device will be described as an example. That is, although the car navigation device is used as the external device in FIG. 1, the voice recognition device of the present invention is not limited to the car navigation device, and it goes without saying that it can be applied to any device that requires voice recognition. .
[0020]
First, the control unit 2 instructs the voice input unit 3 to input a voice in accordance with an instruction from the car navigation device. Based on the instruction from the control unit 2, the voice input unit 3 acquires the input voice from the user, and outputs the input voice to each of the word recognition unit 4 and the phoneme recognition unit 5 (S1).
[0021]
Speech recognition is performed based on comprehensive judgment based on recognition results in the word recognition unit 4 and the phoneme recognition unit 5. That is, it is determined whether or not the word recognition result in the word recognition unit 4 and the phoneme recognition result in the phoneme recognition unit 5 match based on the input speech. However, if they do not match, the process moves to the synonym registration process of this embodiment. Therefore, the synonym registration (dictionary registration) processing of the present embodiment is performed when the word recognition result and the phoneme recognition result do not match and the word recognition result is correct.
[0022]
When the input speech is acquired by the speech input unit 3 (S1), the word recognition unit 4 and the phoneme recognition unit 5 perform word recognition processing and phoneme recognition processing on the input speech from the speech input unit 3, respectively. Perform (S2). Specifically, the word recognizing unit 4 compares the input speech with a word (word model) in the word dictionary using the word dictionary provided in the dictionary managing unit 6, thereby obtaining the highest similarity. The stored word is output to the dictionary management unit 6 as a word recognition result. As recognition processing, word recognition (collation) of input speech is performed by collating feature data obtained by feature extraction processing on input speech with feature data of words registered in the word dictionary in advance. The phoneme recognition unit 5 divides the input speech into phonemes, performs phoneme recognition for each phoneme, and outputs the obtained words to the dictionary management unit 6 as a phoneme recognition result. These word recognition processing and phoneme recognition processing are simultaneously performed in parallel.
[0023]
Next, the dictionary management unit 6 determines whether or not the word recognition result of the word recognition unit 4 is correct (S3). The correct answer determination of the word recognition result will be described below.
[0024]
Normally, in car navigation voice recognition, commands are hierarchized, and scanning requires several steps. Therefore, after notifying the user of the word recognition result, the user continues to utter the next level command, or the user performs the next step, etc. If the result matches the operation candidate, the word recognition result is determined to be correct (S3: Y). On the other hand, after notifying the user of the word recognition result, the user did the same word recognition again, or performed a cancel operation, and the subsequent operations did not match the predetermined correct operation candidates. In this case (S3: N), the word recognition result is determined to be incorrect.
[0025]
Here, as a result of the word recognition determination, if the word recognition result is incorrect (S3: N), the dictionary management unit 6 proceeds to normal dictionary management processing (S8). In this case, the user may be notified that the word recognition process is incorrect by outputting an error message notifying the user of the failure of the voice recognition process, or by prompting for another voice input. On the other hand, when the word recognition result is correct (S3: Y), that is, when the speech recognition using the word as a basic unit is correct, the dictionary management unit 6 matches the word recognition result with the word based on the phoneme recognition result. It is determined whether or not to perform (S4).
[0026]
As a result of the determination, if the word recognition result and the phoneme recognition result match (S4: Y), the dictionary management unit 6 proceeds to dictionary management processing (S8). On the other hand, when the word recognition result and the phoneme recognition result do not match (S4: N), the dictionary management unit 6 determines whether the phoneme result is appropriate as a synonym of the word recognition result (S5). Specifically, this determination process is performed by examining the influence on other words by registering the phoneme recognition result as a word. In the similarity obtained in the word recognition process, the difference or ratio between the similarity of the word obtained based on the phoneme recognition result with respect to the word determined to be correct in step S3 and the similarity with other words is If it exceeds the predetermined value, the influence on other words by registering as a synonym is small, so that it is recognized as a synonym (S5: Y).
[0027]
As a result of the determination, if it is not recognized as a synonym (S5: N), the dictionary management unit 6 proceeds to dictionary management processing (S8). On the other hand, if it is determined as a synonym (S5: Y), it is determined whether or not the synonym is registered in the dictionary (S6). This registration determination process is performed as follows.
[0028]
The number of times that a synonym is recognized as a synonym of the same word recognition result (word) is counted. When the number of times exceeds a predetermined value and the selected probability exceeds a predetermined value, it is registered in the dictionary as a synonym for the word recognition result (S7).
[0029]
Next, the dictionary management unit 6 performs dictionary management processing (S8), and ends this processing. Dictionary management (S8) is a process of calculating statistical information for each word from the result of word recognition and deleting unnecessary words. A word searched in word recognition counts the number of searches, and a word selected as a correct answer counts the number of correct answers. Here, there are a plurality of synonyms, and among them, a word whose probability of being selected as a correct answer falls below a predetermined value is deleted from the dictionary.
[0030]
By repeating the word recognition by the operation as described above, the word recognition word dictionary is optimized. Note that the same result can be obtained even if step S3 and step S4 are interchanged before and after the processing.
[0031]
Here, the synonym dictionary registration process described above will be described in more detail, taking as an example the case where the voice recognition apparatus described above is used by being connected to a car navigation apparatus equipped with a television reception function, a radio reception function, a telephone function, and the like. explain.
[0032]
As an example of command hierarchization, consider a case where “radio”, “denwa”, and “television” are registered in the first utterance layer, for example. At this time, the channel is registered in the second utterance hierarchy dictionary for “TV”, and the broadcasting station name is registered in the second utterance hierarchy dictionary for “radio”. It is assumed that a telephone number is registered in the dictionary of the second utterance hierarchy for “Denwa”. Here, it is registered in the dictionary of the second utterance hierarchy, which is inferred based on the operation (second utterance) performed by the user after the first utterance and the recognition result by the first utterance of the voice input. If the content matches, the recognition result of the first utterance is determined to be correct. For example, if the recognition result based on the first utterance of the user is “TV” and the second utterance of the user indicates a channel, the recognition result “TV” in this case is correct. Determined. Here, in the recognition result (word) of the first utterance, the relationship between the similarity to the correct word and the similarity to other words is a predetermined relationship (for example, the difference in similarity is greater than or equal to a predetermined value (for example, 0 .5 or higher) or the similarity ratio is equal to or higher than a predetermined value (for example, twice or higher), it is determined that the recognition result (word) by the first utterance is appropriate as a synonym for the correct word, and This is because it is not necessary to misrecognize other words by determining the recognition result (word) from the first utterance as a synonym, so only words that satisfy the predetermined condition are appropriate. It is because it judges.
[0033]
Furthermore, the number of times that a synonym is determined (provisionally registered) as a synonym in this way is equal to or more than a predetermined number (for example, three times), and the word is selected from the synonyms. If the probability is greater than or equal to a predetermined value (for example, 50%), it is determined to be registered in the dictionary. This is because synonyms selected with high probability can be used in the dictionary.
[0034]
On the other hand, the number of times any one of the synonyms is selected as a correct answer is equal to or more than a predetermined number (for example, 10 times), and the probability that the word is selected as a correct word among the synonyms is a predetermined value (for example, 30%). ), The word is deleted from the dictionary.
[0035]
Based on the settings of the voice recognition device as described above, when the user tries to watch 6 channels of the TV provided in the car navigation device, the user's first utterance is “TV”. Assume that the phoneme recognition result based on the utterance is “Tele” and the word recognition result is “TV”. At this time, it is assumed that the similarity obtained in the word recognition process is 0.8 for “TV”, 0.3 for “Denwa”, and 0.2 for “Radio”. Furthermore, when the user's second utterance is “Roku Channel”, the word recognition result is “Roku Channel”.
[0036]
The synonym registration process under these circumstances will be described with reference to FIG.
[0037]
First, since the user did not cancel the first speech recognition result and the second speech recognition was normally performed continuously, the dictionary registration unit 6 performed the word recognition result “TV” for the first utterance. "Is determined to be correct (S3: Y). Next, the dictionary registration unit 6 determines whether the phoneme recognition result “Tele” matches the word recognition result “TV” (S4). In this case, since they do not match (S4: N), the dictionary registration unit 6 compares the similarities in the word recognition process. That is, it is checked whether or not the phoneme recognition result “Tele” can be registered as a synonym for the correct word “TV”. Here, the similarity to the correct word “TV” is 0.8, whereas the similarity to other words “Denwa” and “Radio” is 50% or less (0. 3 / 0.8, 0.2 / 0.8), the influence of misrecognition is small, and it is determined that “terre” is suitable as a synonym for “television” (S5: Y). Record the word “terei” (temporary registration).
[0038]
By using this voice recognition apparatus, the above-described operations (steps S3, S4, S5) are repeated, and the number of temporary registrations for “Tele” is counted. The temporarily registered synonyms and the number of temporary registrations thereof are temporarily recorded in a predetermined memory area provided in the dictionary management unit 6 (or provided separately). As a result of this repeated operation, the phoneme recognition result is “TV”, the word recognition result is “TV” twice, the phoneme recognition result is “Tele”, and the word recognition result Suppose that the number of times “TV” became three. In this case, the number of times “terei” is determined to be a synonym is three (or more), and the probability that “terei” is selected from synonyms (“TV” and “terei”) of the same word is 50. % Or more (3/5 = 60%), it is determined that “Tele” is registered in the dictionary as a synonym for “TV” (S6), and is registered in the dictionary (S7).
[0039]
As a result, “radio”, “denwa”, “television”, and “tele” are registered in the dictionary. However, “TV” and “Tele” represent the same command.
[0040]
Furthermore, as a result of repeating the above operations (steps S3 to S7), the number of times “TV” or “Tele” is selected as the correct answer in the word recognition process is 10 times, of which “TV” 2 is the correct number of times, and 8 is the number of times that “Telei” is the correct answer. In this case, the number of times any one of the synonyms (“TV” or “Tele”) is selected as the correct answer is (2 + 8 =) 10 times, and the number of times that “TV” is selected among the synonyms is 30%. Is less than (2/10 = 20%), the synonym “TV” is deleted from the dictionary.
[0041]
Therefore, “radio”, “denwa”, and “tele” are registered in the dictionary. However, “Tele” represents a command meaning “TV”.
[0042]
As described above, by repeating the use of the speech recognition apparatus of this embodiment, an operation is performed in which synonyms with high necessity are registered in the dictionary and synonyms with low necessity are deleted from the dictionary. As a result, the dictionary is optimized and the recognition rate is improved. Therefore, in this embodiment, even a word that does not have a meaning originally in the dictionary (for example, “terre”) can be registered in the dictionary by giving meaning to the word as long as it improves the recognition rate. In addition, accurate speech recognition can be performed for user-specific utterances.
[0043]
【The invention's effect】
According to the present invention, the word recognition dictionary can be optimized while being used, and the design cost can be reduced. Moreover, even when a word that is difficult to recognize or easily misrecognized by a speaker appears, the performance of word recognition can be improved.
[Brief description of the drawings]
FIG. 1 is a functional block diagram showing a configuration of a speech recognition apparatus according to an embodiment.
FIG. 2 is a flowchart showing a speech recognition operation by the speech recognition apparatus shown in FIG.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Voice recognition apparatus 2 Control part 3 Voice input part 4 Word recognition part 5 Phoneme recognition part 6 Dictionary management part

Claims

A speech recognition device that performs word recognition on input speech from a user using a dictionary, performs word recognition and phoneme recognition on the input speech, and obtains a word recognition result and a phoneme recognition result obtained thereby And the word recognition result is correct , and the phoneme recognition result is similar to the word corresponding to the correct word recognition result and the similarity to other words A speech recognition apparatus that registers a phoneme recognition result as a synonym for a word corresponding to a correct word recognition result in a dictionary when a relationship with a degree satisfies a predetermined condition.

The predetermined condition is characterized in that the difference between the similarity of the phoneme recognition result to the word corresponding to the correct word recognition result and the similarity to the other word is a predetermined value or more. The speech recognition apparatus according to claim 1.

The predetermined condition is characterized in that a ratio of the similarity between the phoneme recognition result and the word corresponding to the correct word recognition result and the similarity to the other word is a predetermined value or more. The speech recognition apparatus according to claim 1.

The synonym is registered in the dictionary when the phoneme recognition result recognized as the synonym is recognized as a synonym of the same word recognition result a predetermined number of times or with a predetermined probability. The speech recognition apparatus of Claims 3 thru | or 3.

When there are a plurality of synonyms in the dictionary, the number of searches and the number of correct answers for each synonym registered in the dictionary at the time of word recognition is counted, and the probability that the synonym is correct is a predetermined value. The speech recognition apparatus according to claim 1, wherein the synonym is deleted from the dictionary when the number is less than.

The word recognition result is determined to be a correct answer when an operation by the user matches a predetermined operation candidate after a correct answer after the word recognition based on the voice input. The speech recognition apparatus of Claims 5-5.

Obtained by a speech input unit that inputs input speech from a user, a word recognition unit that performs word recognition on the input speech, a phoneme recognition unit that performs phoneme recognition on the input speech, and the word recognition unit If the word recognition result obtained is inconsistent with the phoneme recognition result obtained by the phoneme recognition unit, and the word recognition result is correct, the phoneme recognition result is used as the word recognition result that is the correct answer. A dictionary registration unit that determines whether or not to register in the dictionary as a synonym for the word corresponding to the word, the dictionary registration unit, the similarity of the phoneme recognition result to the word corresponding to the correct word recognition result And the difference between the similarity to the other word is a predetermined value or more, or the similarity of the phoneme recognition result to the word corresponding to the correct word recognition result and the class to the other word When the ratio to the degree is equal to or greater than a predetermined value, the phoneme recognition result is recognized as a synonym and provisionally registered, and further, the temporary recognition for the phoneme recognition result recognized as the synonym for the same word recognition result is performed. A speech recognition apparatus characterized in that the number of registrations is counted, and when the number of temporary registrations is equal to or greater than a predetermined value, the phoneme recognition result is registered in a dictionary and used for the next speech recognition process.