JP4212947B2

JP4212947B2 - Speech recognition system and speech recognition correction / learning method

Info

Publication number: JP4212947B2
Application number: JP2003127376A
Authority: JP
Inventors: 光章渡邉; 望齊藤
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2003-05-02
Filing date: 2003-05-02
Publication date: 2009-01-21
Anticipated expiration: 2023-05-02
Also published as: JP2004333703A

Abstract

<P>PROBLEM TO BE SOLVED: To surely and simply correct a recognition result when erroneous recognition of voice is arisen. <P>SOLUTION: When erroneous recognition of uttered voice is detected by a voice recognition engine 3, an interactive processing section 4 is constituted to read a word for which a user previously made correction for erroneous recognition, from a recognition word link DB7, the word is presented as a correct candidate and the erroneously recognized word and the correct word that is corrected by the user are made correspondent to each other and they are newly registered. Thus, when erroneous recognition is arisen, only a word which has high probability of being a correct one is presented to the user as an appropriate corrected candidate and recognition result is surely corrected by conducting only a simple operation in which a selection is made from correct candidates being presented. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は音声認識システムおよび音声認識の訂正・学習方法に関し、特に、認識対象の文字列とその音声パターンとの対応を音声辞書として登録しておき、入力音声との類似度が高い音声パターンを有する文字列を入力音声の文字列であると認識するように成された音声認識システムに用いて好適なものである。
【０００２】
【従来の技術】
最近の車両の殆どには、オーディオ装置、エアーコンディショナ、ナビゲーション装置など各種の電子機器が搭載されている。また、最近では、これらの電子機器を操作する際の片手運転等を回避するために、電子機器の操作を音声認識により行えるようにしたシステムも提供されている。この音声認識技術を用いれば、運転者は、ハンドルから手を離すことなく（リモートコントローラや操作パネル等の操作部を手動で操作せずに）各種電子機器の操作を行うことができる。
【０００３】
音声認識システムは通常、ユーザが発声した特定の単語や熟語、簡単な命令文など（本明細書ではこれらを単に「単語」と表現する）を発話コマンドとして認識し、認識単語を音声合成してトークバックする。ユーザは、トークバックされた認識単語の確認を行い、正しければその旨の入力を行う。これに応じてシステムは、認識単語に応じた制御を行う。一方、システムよりトークバックされた認識単語がユーザ発声の単語と異なる場合には、ユーザは再度音声入力を行う。
【０００４】
かかる音声認識システムでは、認識対象単語の文字列とその音声パターンとを対応付けた音響モデルを音声辞書データベースにあらかじめ登録しておく。そして、ユーザの入力音声から算出した特徴量と音響モデルの特徴量とを比較して類似度が最も高い音声パターンを検索し、その音声パターンを有する文字列を入力音声の文字列であると認識する。
【０００５】
このような音声認識システムにおいて、発話音声の誤認識は避けられない。ユーザの発声する音声によっては、誤認識が連続して発生する場合もある。この場合の対策として、類似度が最高位の１単語だけでなく、類似度が上位の複数単語をユーザに提示し、この中から何れかを選択してもらうようにした機能を有するものも提供されている。また、誤認識とされた最高位の認識結果以降の認識結果を順次最高位に導出することにより、見かけ上の認識性能を向上させるようにした技術も提案されている（例えば、特許文献１参照）。
【０００６】
【特許文献１】
特開平１０−６３２９５号公報
【０００７】
また、認識率そのものを上げるための技術として、個々の話者に対応して音響モデルをチューニングし、誤認識の発生を低減するようにした「話者適応化」という手法も種々検討されている（例えば、特許文献２参照）。話者適応化の代表的なものとして、「エンロール」と呼ばれる手法がある。エンロールは、システムの使い始めの段階で、システムからの指示に従ってあらかじめ用意した単語をユーザに読み上げてもらい、その指示単語の音声パターンと話者入力に係る音声パターンとを用いて学習を行うものである（例えば、特許文献３，４参照）。
【０００８】
【特許文献２】
特開平７−２３０２９５号公報
【特許文献３】
特開２００２−１３２２８８号公報
【特許文献４】
特開２０００−１４８１９８号公報
【０００９】
【発明が解決しようとする課題】
しかしながら、類似度が上位の複数単語を提示する機能を有していても、システムが連続して誤認識するような場合は、認識エンジンの音響モデルとユーザの音声パターンとが大きくかけ離れていることが多く、類似度により提示した正解候補の全てが誤認識であることが多い。そのため、ユーザは複数の正解候補が提示されてもそれを利用することができず、音声入力を何度もやり直すか、音声入力を諦めてリモコン等によりコマンド入力せざるを得ないという問題があった。
【００１０】
また、特許文献１のように見かけ上の認識性能を向上させたとしても、実際の認識率が向上する訳ではない。実際に認識率を上げるためには、話者適応化の処理を行う必要がある。ところが、誤認識が発生することのある通常の使用状態では、常に「システム側の認識単語＝ユーザが入力したい正解単語」であるとは限らない。よって、音声入力の結果のみを頼りにして話者適応化を行っても、うまく認識率を上げることができないという問題があった。
【００１１】
また、ユーザがリモコン等を操作して入力したコマンドを話者適応化の正解値として利用することも考えられる。しかし、システム側では、リモコン入力されたコマンドが、音声の誤認識が連続した結果リモコン操作に切り替えて訂正入力されたものなのか、音声認識とは関係なくユーザの任意操作により入力されたものなのかを把握できない。そのため、リモコン等による入力コマンドを話者適応化の正解値としては利用することができなかった。
【００１２】
このような実情から、車載用の電子機器では、話者適応化の手法として、正解の単語があらかじめ分かっているエンロールが一般的に用いられてきた。ところが、エンロールを用いて音声の認識率を上げる場合には、システムの使い始めの段階で、システム側であらかじめ用意されたいくつかの単語をユーザがわざわざ読み上げなければならない。そのため、ユーザが電子機器に対して実際に行いたい操作とは直接関係のないことで、ユーザに余計な負担が生じてしまうという問題があった。
【００１３】
本発明は、このような問題を解決するために成されたものであり、音声の誤認識が発生した場合に、確実かつ簡単に認識結果を訂正できるようにすることを目的とする。
また、本発明は、音声認識の結果を確実かつ簡単に訂正することができ、しかも、時間と労力がかかるエンロールを行うことなく音声認識性能を実際に向上できるようにすることも目的としている。
【００１４】
【課題を解決するための手段】
上記した課題を解決するために、本発明では、誤認識された単語とユーザにより訂正された正解単語とを対にしてデータベースに実績として登録しておき、次に同じ発話音声に対して誤認識が発生したときは、その実績に基づいて、ユーザが以前に訂正した正解を今回の正解候補として提示するようにしている。このように構成した本発明によれば、誤認識が発生した場合に、過去の訂正実績からして正解の確率が高いものだけをユーザに提示することが可能となる。
【００１５】
本発明の他の態様では、上述のようにして正解候補を提示した後にユーザが選択した候補を、本来認識すべき認識結果として話者適応化手段に提供するようにしている。このように構成した本発明によれば、通常の使用状態においても「ユーザが入力したい正解」をシステム側で正確に把握することが可能となり、その正解と発話音声とを用いて話者適応化を適切に行うことが可能となる。
【００１６】
【発明の実施の形態】
以下、本発明の一実施形態を図面に基づいて説明する。図１は、本実施形態による音声認識システムの構成例を示すブロック図である。
【００１７】
図１に示すように、本実施形態の音声認識システムは、リモコン１などの操作部と、マイク２と、音声認識エンジン３と、対話処理部４と、音声合成エンジン５と、スピーカ６と、認識単語リンクＤＢ（データベース）７と、画面表示制御部８と、ディスプレイ９と、話者適応化モジュール１０とを備えて構成されている。上記リモコン１は、発話ボタン１ａ、訂正ボタン１ｂ、誤認識ボタン１ｃ、ジョイスティック１ｄおよびＯＫボタン１ｅを備えている。
【００１８】
リモコン１は、本実施形態の音声認識システムを利用する電子機器（オーディオ装置やナビゲーション装置など）に対してユーザが各種の操作を行うための操作子であり、音声認識を行う際の操作もこのリモコン１によって行う。発話ボタン１ａは、発話による音声認識処理の開始を指示するためのボタンである。すなわち、この発話ボタン１ａを操作したタイミングに合わせて、発話による音声入力受付状態となる。ジョイスティック１ｄは、音声の誤認識が発生した場合に、その誤認識単語を正しい単語に訂正する際に使う操作子である。
【００１９】
訂正ボタン１ｂは、本来言いたかったものとは違う単語を間違って発声してしまったようなときなどに、音声入力のやり直しを指示するためのボタンである。誤認識ボタン１ｃは、誤認識が発生した場合、すなわち、システムよりトークバックされた認識単語がユーザ発声の単語と異なる場合に、ジョイスティック１ｄを使って誤認識単語の訂正を行うことを指示するためのボタンである。
【００２０】
本実施形態ではこのように、音声入力のやり直しや誤認識単語の訂正を指示するために従来は１つのボタンでしかなかった「戻りボタン」を、訂正ボタン１ｂと誤認識ボタン１ｃとの２つに分けている。これにより、音声の誤認識が発生した結果としてその認識単語の訂正が指示されたということを、システム側で明確に判別できるようにしている。
【００２１】
ＯＫボタン１ｅは、ジョイスティック１ｄを使って選択した所望のメニュー項目の決定を指示したり、音声認識処理を利用して入力した情報の最終的な内容が正しい場合にその入力情報（例えば目的地など）をシステムに設定することを指示したりするためのボタンである。このＯＫボタン１ｅは、図１のようにこれ単独で専用のボタンとして設けても良いし、発話ボタン１ａあるいはジョイスティック１ｄと兼用するように構成しても良い。
【００２２】
音声認識エンジン３は、マイク２より入力された発話音声とあらかじめ用意されている音声辞書とを比較して、当該発話音声に係る単語を認識する。そして、その発話音声に対応するコマンドを、対話処理部４を通じて図示しないオーディオ装置やナビゲーション装置に対して実行する。
【００２３】
音声合成エンジン５は、音声認識エンジン３により認識された単語を音声合成してスピーカ６からトークバックする。これに応じてユーザは、トークバックされた認識単語を聞いて、誤認識が発生したかどうかを確認する。誤認識がなければ、ユーザは次の処理の音声入力を行う。一方、誤認識があった場合は、ユーザは誤認識ボタン１ｃを押して認識単語の訂正を行う。画面表示制御部８は、認識単語の訂正を行う際のリモコン操作画面をディスプレイ９に表示する制御を行う。
【００２４】
対話処理部４は、音声認識を行う際におけるユーザとの一連の対話処理を実行する。すなわち、ユーザによる発話ボタン１ａの操作に応じて音声認識処理の開始を音声認識エンジン３に対して指示する処理、音声認識エンジン３より認識された単語を音声合成エンジン５に供給してユーザにトークバックする処理、トークバックの結果としてユーザにより誤認識ボタン１ｃが押された場合に画面表示制御部８を制御してリモコン操作画面をユーザに提供する処理などを実行する。
【００２５】
また、対話処理部４は、音声認識エンジン３による発話音声の誤認識を検知した場合、すなわち、ユーザにより誤認識ボタン１ｃが押された場合に、当該誤認識された単語（音声認識エンジン３による認識結果）と、誤認識ボタン１ｃの操作後にジョイスティック１ｄを用いてユーザにより訂正された正解単語とを対応付けて認識単語リンクＤＢ７に登録する処理も行う。このように対話処理部４は、本発明の正解単語登録手段を構成する。
【００２６】
対話処理部４が認識単語の訂正時に画面表示制御部８を制御してディスプレイ９に上述のリモコン操作画面を提示する際には、そのとき誤認識した単語に対してユーザが以前に訂正したことのある単語を認識単語リンクＤＢ７から読み出し、これを正解候補のリストとしてユーザに提示する。このように、対話処理部４および画面表示制御部８は、本発明の正解候補提示手段を構成する。
【００２７】
さらに、対話処理部４は、マイク２より入力された発話音声とそれに対応する正解単語（誤認識がない場合の認識結果、もしくは誤認識があった場合の訂正結果）とを話者適応化モジュール１０に提供する処理も行う。
【００２８】
例えば、音声認識エンジン３による発話音声の誤認識を検知しなかった場合、すなわち、誤認識ボタン１ｃが押されずに発話ボタン１ａが押された場合、対話処理部４は、そのときの発話音声と音声認識エンジン３による認識結果とを話者適応化モジュール１０に提供する。また、誤認識ボタン１ｃが押されて認識単語の訂正が行われた場合には、そのときの発話音声とその訂正結果（正解候補からの選択結果）とを話者適応化モジュール１０に提供する。このように、対話処理部４は、本発明の情報提供手段も構成する。
【００２９】
話者適応化モジュール１０は、対話処理部４より提供されるマイク２からの発話音声のパターンと正解音声のパターンとを用いて話者適応化処理を行う。正解音声のパターンは、話者適応化モジュール１０が音響モデルとしてあらかじめ備えており、対話処理部４より通知される正解単語に基づき該当する音声パターンを利用して話者適応化を行う。なお、この話者適応化処理の内容については種々の手法を適用することができるが、何れも公知の手法を適用できるので、ここではその詳細な説明を割愛する。
【００３０】
図２は、認識単語リンクＤＢ７のデータ構造を示す概念図である。図２において、「リンク単語」は、認識結果に対してユーザがリモコン１を用いて以前に訂正を行ったことのある単語である。すなわち、例えば音声認識エンジン３によって「福島県」と誤認識された結果に対して、ユーザが以前にリモコン１を用いて「佐賀県」あるいは「千葉県」と訂正したことのある実績がこの認識単語リンクＤＢ７に登録されている。
【００３１】
次に、上記のように構成した本実施形態による音声認識システムの動作を説明する。なお、音声認識システムの動作を説明する前に、その前提となる発話コマンドの状態遷移について説明しておく。通常、システムに用意されている複数の発話コマンドは、当該システムに対する操作内容に応じて複数の階層に分けて管理されている。例えば、ナビゲーション装置において住所で目的地を設定する場合は、図３に示すように、住所を３階層に分けて入力し、最後にＯＫボタン１ｅを押すことによって、入力された住所を目的地として設定する。
【００３２】
すなわち、図３の例において、初期状態の階層では「住所」「電話番号」・・・などの単語を管理している。この階層で例えば「住所」と発話すると、１つ下の階層１に進む。この階層１では都道府県名を管理しており、「福島県」「佐賀県」「千葉県」・・・などの単語を発話コマンドとして入力することが可能である。この階層１で所望の都道府県名を発話すると、更に１つ下の階層２に進む。この階層２では市区町村名を発話コマンドとして入力することが可能である。
【００３３】
同様に、階層２で所望の市区町村名を発話すると、更に１つ下の階層３に進む。この階層３では住所の残り部分を発話コマンドとして入力することが可能である。住所の残り部分を発話すると、最終の階層４へと進む。この階層４ではＯＫボタン１ｅを押すことによって、発話によって入力された住所を目的地に設定する。以上のような各階層１〜４において、訂正ボタン１ｂや誤認識ボタン１ｃを押すと戻り処理が行われ、１つ上の階層に戻る。
【００３４】
図４および図５は、本実施形態による音声認識処理の動作例を示すフローチャートである。このうち図４は、図３に示した各階層の中で行われる階層処理の動作を示すフローチャート、図５は、図４中に含まれる誤認識訂正処理の動作を示すフローチャートである。
【００３５】
図４において、対話処理部４は、発話ボタン１ａが押されたかどうかを判断する（ステップＳ１）。発話ボタン１ａが押されたと判断した場合、対話処理部４は音声認識エンジン３をアクティブにして音声入力受付モードに設定し、図３の初期状態にあるかどうかを更に判断する（ステップＳ２）。
【００３６】
初期状態でなければ、対話処理部４は前階層での音声認識により正解が得られたものと判断して、以下の情報を学習データとして保持し（ステップＳ３）、話者適応化モジュール１０に送信する（ステップＳ４）。
ｉ）発話音声の波形データ（例：「滋賀県」と発声した際のユーザの音声波形）ii）認識結果（例：「滋賀県」）
iii）「認識結果＝正解」という情報
【００３７】
その後ユーザは、所望の単語を発声してマイク２から入力する（ステップＳ５）。これを受けて音声認識エンジン３は、音声入力受付モードを一旦抜けて、上記入力された単語の認識処理を行う。そして、その認識結果を音声合成エンジン５が音声合成してスピーカ６からトークバックする（ステップＳ６）。トークバックの後は、対話処理部４は次階層に遷移する処理を実行する（ステップＳ７）。
【００３８】
なお、話者適応化モジュール１０は、上記ステップＳ４で対話処理部４より提供されたｉ）〜iii）の情報に基づいて、例えば、パラメータ更新に基づく話者適応化アルゴリズムにより話者適応化処理を実行する。
【００３９】
上記ステップＳ１で発話ボタン１ａが押されていないと判断した場合、対話処理部４は、訂正ボタン１ｂが押されたかどうかを判断する（ステップＳ８）。訂正ボタン１ｂが押された場合は、対話処理部４は前階層に遷移する戻り処理を実行する（ステップＳ９）。
【００４０】
一方、訂正ボタン１ｂも押されていないと判断した場合、対話処理部４は、誤認識ボタン１ｃが押されたかどうかを更に判断する（ステップＳ１０）。誤認識ボタン１ｃが押された場合は、対話処理部４は、前階層での音声認識により得られた結果は誤りであると判断して、以下の情報を学習データとして保持する（ステップＳ１１）。
Ｉ）発話音声の波形データ（例：「滋賀県」と発声した際のユーザの音声波形）II）認識結果（例：「福島県」）
III）「認識結果＝誤り」という情報
そして、対話処理部４は前階層に遷移する戻り処理を実行した後（ステップＳ１２）、図５に示す誤認識訂正処理を実行する（ステップＳ１３）。
【００４１】
図５において、対話処理部４は音声認識エンジン３からの誤認識単語（上述の例では「福島県」）をキーとして認識単語リンクＤＢ７の検索を行う（ステップＳ２１）。この検索の結果、当該誤認識単語に対して以前にユーザが訂正を行ったことのあるリンク単語が認識単語リンクＤＢ７に登録されているかどうかを判断する（ステップＳ２２）。
【００４２】
そして、そのようなリンク単語が１つ以上見つかった場合は、そのリンク単語を正解候補として含み、更に「その他」の単語を含んだ図６（ａ）のようなリモコン操作画面をディスプレイ９上に提示する（ステップＳ２３）。この正解候補の中に実際の正解があれば、ユーザはジョイスティック１ｄを操作してそれを選択する。この場合、対話処理部４は、図６（ａ）に示すリモコン操作画面中から何らかの単語が選択されたことを確認して（ステップＳ２４）、選択された単語が「その他」か否かを判断し（ステップＳ２５）、「その他」以外の正解候補中から何れかのリンク単語が選択されていれば、ステップＳ２９にジャンプする。
【００４３】
一方、図６（ａ）の画面に示される正解候補中に実際の正解がない場合（ユーザがジョイスティック１ｄを操作して「その他」を選択した場合）、もしくは、ステップＳ２２で認識単語リンクＤＢ７にリンク単語が１つも登録されていないと判断した場合には、その場面で選択可能な単語を全て取り出して図６（ｂ）のようにリスト表示する（ステップＳ２６）。ユーザは、このリストの中から正解の単語をジョイスティック１ｄの操作により選択する（ステップＳ２７，Ｓ２８）。
【００４４】
なお、その場面で選択可能な単語とは、該当する階層の単語を言う。図６（ｂ）の例は、「福島県」「佐賀県」「千葉県」などの都道府県名を管理している図３の階層１の単語を全てリストとして表示している。
【００４５】
上記図６（ａ）もしくは（ｂ）のリモコン操作画面で何れかの単語が選択されると、対話処理部４は、その選択された単語を認識単語リンクＤＢ７に登録する（ステップＳ２９）。
【００４６】
図７は、認識単語リンクＤＢ７に対する選択単語の登録例を示す図である。例えば、図６（ｂ）のリモコン操作画面から「滋賀県」が正解単語として選択された場合、その選択単語をリンク単語の最上位（リンク単語１）に登録する。リンク単語１に新たな単語である「滋賀県」が登録された場合、それまで登録されていた「佐賀県」「千葉県」の単語は、リンク単語２以降に移動する。
【００４７】
このようなリンク単語の更新処理後に対話処理部４は、以下の情報を学習データとして保持し（ステップＳ３０）、Ｉ）〜Ｖ）の情報が揃った段階でこれらを話者適応化モジュール１０に送信する（ステップＳ３１）。
IV）選択単語（例：ジョイスティック１ｄで選択した「滋賀県」）
Ｖ）「選択結果＝正解」という情報
そして、対話処理部４は次階層に遷移する処理を実行し（ステップＳ３２）、誤認識訂正処理を終了する。なお、話者適応化モジュール１０は、対話処理部４から受け取ったＩ）〜Ｖ）の情報に基づいて話者適応化処理を実行する。
【００４８】
以上詳しく説明したように、本実施形態によれば、誤認識が発生した場合に、ユーザが過去にリモコン１を使って行った訂正結果を正解候補として提示するようにしたので、正解の確率が高い適切な訂正候補をユーザに提示することができる。これによりユーザは、件数の絞られた少ない正解候補の中から何れかを選択するという簡単な操作のみで、音声認識エンジン３の認識結果を確実に訂正することができるようになる。
【００４９】
また、本実施形態によれば、音声認識エンジン３による認識で正解が得られた単語および誤認識ボタン１ｃの操作後にリモコン操作画面で選択した単語を話者適応化モジュール１０に提供するようにしたので、これらの単語をユーザが本来入力したかった正解単語として用いることが可能となる。これにより、システムの通常の使用状態で話者適応化の学習を行うことができ、時間と労力が取られるエンロールをユーザがわざわざ行わなくても済む。しかも、音声認識処理のバックグラウンドで個々のユーザに適するように音響モデルをチューニングすることが可能となるので、ただの「不特定話者用音声認識」を用いた場合に比べて音声認識性能も良くなる。
【００５０】
なお、上記実施形態では操作部としてリモコン１を用いているが、タッチパネルであっても良い。
また、上記実施形態では、図６（ａ）の画面で「その他」を選択した場合に該当する階層の単語をリスト表示する例について説明したが、５０音を個別に入力するためのソフトウェアキーボードを表示するようにしても良い。
【００５１】
その他、上記各実施形態は、何れも本発明を実施するにあたっての具体化の一例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその精神、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。
【００５２】
【発明の効果】
本発明は上述したように、誤認識が発生した場合に、ユーザが過去に訂正していた結果を正解候補として提示するようにしたので、正解の確率が高い単語だけを適切な訂正候補としてユーザに提示することができる。これによりユーザは、音声の誤認識が発生した場合に、提示された正解候補の中から何れかを選択するという簡単な操作のみで認識結果を確実に訂正することができる。
【００５３】
また、本発明の他の特徴によれば、音声認識で正解が得られた単語および誤認識発生後に正解候補の中から選択された単語を、本来認識すべき認識結果として話者適応化手段に提供するようにしたので、通常の使用状態においても正解の単語を話者適応化手段で正確に把握することができ、話者適応化処理を適切に行うことが可能となる。これにより、時間と労力が取られるエンロールをユーザがわざわざ行わなくても、音声認識性能を確実に向上させることができる。
【図面の簡単な説明】
【図１】本実施形態による音声認識システムの構成例を示すブロック図である。
【図２】本実施形態による認識単語リンクＤＢの構造を示す概念図である。
【図３】本実施形態の音声認識システムに用意されている複数の発話コマンドに関する階層遷移状態を示す図である。
【図４】本実施形態による音声認識処理のうち階層処理の動作を示すフローチャートである。
【図５】本実施形態による音声認識処理のうち誤認識訂正処理の動作を示すフローチャートである。
【図６】本実施形態の誤認識ボタンの操作時に提示されるリモコン操作画面を示す図である。
【図７】本実施形態の認識単語リンクＤＢに対する選択単語の登録動作例を示す図である。
【符号の説明】
１リモコン
１ａ発話ボタン
１ｂ訂正ボタン
１ｃ誤認識ボタン
１ｄジョイスティック
１ｅＯＫボタン
２マイク
３音声認識エンジン
４対話処理部
５音声合成エンジン
６スピーカ
７認識単語リンクＤＢ
８画面表示制御部
９ディスプレイ
１０話者適応化モジュール[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition system and a speech recognition correction / learning method, and in particular, a correspondence between a character string to be recognized and its speech pattern is registered as a speech dictionary, and a speech pattern having a high similarity to an input speech is recorded. This is suitable for use in a speech recognition system configured to recognize a character string having a character string of an input speech.
[0002]
[Prior art]
Most recent vehicles are equipped with various electronic devices such as an audio device, an air conditioner, and a navigation device. Recently, in order to avoid one-handed operation or the like when operating these electronic devices, a system is also provided that allows the electronic devices to be operated by voice recognition. By using this voice recognition technology, the driver can operate various electronic devices without removing his / her hands from the steering wheel (without manually operating an operation unit such as a remote controller or an operation panel).
[0003]
A speech recognition system usually recognizes a specific word, idiom or simple command sentence (which is simply expressed as “word” in this specification) spoken by a user as an utterance command, and synthesizes the recognized word by speech synthesis. Talk back. The user confirms the recognized word that has been talked back, and if it is correct, inputs that fact. In response to this, the system performs control according to the recognized word. On the other hand, when the recognition word talked back from the system is different from the word spoken by the user, the user performs voice input again.
[0004]
In such a speech recognition system, an acoustic model in which a character string of a recognition target word and its speech pattern are associated is registered in advance in the speech dictionary database. Then, the feature amount calculated from the user's input speech is compared with the feature amount of the acoustic model to search for the speech pattern with the highest similarity, and the character string having the speech pattern is recognized as the character string of the input speech To do.
[0005]
In such a speech recognition system, misrecognition of spoken speech is inevitable. Depending on the voice uttered by the user, misrecognition may occur continuously. As a countermeasure in this case, not only one word with the highest degree of similarity but also a function that presents a plurality of words with a higher degree of similarity to the user and selects one of them is also provided Has been. In addition, a technique has been proposed in which apparent recognition performance is improved by sequentially deriving recognition results after the highest recognition result that has been erroneously recognized as the highest recognition result (see, for example, Patent Document 1). ).
[0006]
[Patent Document 1]
Japanese Patent Application Laid-Open No. 10-63295
In addition, as a technique for increasing the recognition rate itself, various methods called “speaker adaptation” have been studied, in which the acoustic model is tuned for each speaker to reduce the occurrence of misrecognition. (For example, refer to Patent Document 2). As a typical speaker adaptation, there is a method called “enroll”. Enrollment is a process in which the user reads out words prepared in advance according to instructions from the system at the beginning of use of the system, and learns using the voice pattern of the specified word and the voice pattern related to speaker input. (For example, refer to Patent Documents 3 and 4).
[0008]
[Patent Document 2]
Japanese Patent Laid-Open No. 7-230295 [Patent Document 3]
JP 2002-132288 A [Patent Document 4]
JP 2000-148198 A [0009]
[Problems to be solved by the invention]
However, even if the system has the function of presenting multiple words with high similarity, the acoustic model of the recognition engine and the user's voice pattern are far apart if the system continuously misrecognizes In many cases, all the correct candidates presented according to the degree of similarity are misrecognitions. For this reason, even if a plurality of correct answer candidates are presented, the user cannot use them, and there is a problem that he / she has to repeat voice input many times or give up voice input and input a command using a remote control or the like. It was.
[0010]
Moreover, even if the apparent recognition performance is improved as in Patent Document 1, the actual recognition rate is not improved. In order to actually increase the recognition rate, it is necessary to perform speaker adaptation processing. However, in a normal use state where misrecognition may occur, it is not always “recognized word on the system side = correct word that the user wants to input”. Therefore, there is a problem that the recognition rate cannot be improved well even if speaker adaptation is performed only by relying on the result of speech input.
[0011]
It is also conceivable to use a command input by the user by operating a remote controller or the like as a correct value for speaker adaptation. However, on the system side, whether the command entered by the remote control is a command that was corrected and input after switching to remote control operation as a result of repeated voice recognition errors, or entered by a user's arbitrary operation regardless of voice recognition. I ca n’t figure out. Therefore, an input command from a remote controller or the like cannot be used as a correct value for speaker adaptation.
[0012]
From such a situation, in-vehicle electronic devices have generally used enrollers in which correct words are known in advance as speaker adaptation techniques. However, in order to increase the speech recognition rate using enrollment, the user must bother to read out some words prepared in advance on the system side at the beginning of using the system. For this reason, there is a problem that an extra burden is generated on the user because the operation is not directly related to the operation that the user actually wants to perform on the electronic device.
[0013]
The present invention has been made to solve such a problem, and it is an object of the present invention to make it possible to reliably and easily correct a recognition result when erroneous recognition of speech occurs.
Another object of the present invention is to make it possible to reliably and easily correct the result of speech recognition, and to actually improve speech recognition performance without performing enrollment that takes time and effort.
[0014]
[Means for Solving the Problems]
In order to solve the above-described problems, in the present invention, a misrecognized word and a correct word corrected by a user are registered as a record in the database, and then erroneously recognized for the same utterance Is generated, the correct answer corrected by the user previously is presented as the correct answer candidate for this time. According to the present invention configured as described above, when erroneous recognition occurs, it is possible to present to the user only those having a high probability of correct answers from past correction results.
[0015]
In another aspect of the present invention, the candidate selected by the user after presenting the correct answer candidate as described above is provided to the speaker adaptation means as a recognition result to be originally recognized. According to the present invention configured as described above, it becomes possible to accurately grasp the “correct answer that the user wants to input” on the system side even in a normal use state, and speaker adaptation is performed using the correct answer and the uttered voice. Can be performed appropriately.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration example of the speech recognition system according to the present embodiment.
[0017]
As shown in FIG. 1, the speech recognition system of the present embodiment includes an operation unit such as a remote controller 1, a microphone 2, a speech recognition engine 3, a dialogue processing unit 4, a speech synthesis engine 5, a speaker 6, A recognition word link DB (database) 7, a screen display control unit 8, a display 9, and a speaker adaptation module 10 are provided. The remote controller 1 includes an utterance button 1a, a correction button 1b, an erroneous recognition button 1c, a joystick 1d, and an OK button 1e.
[0018]
The remote controller 1 is an operator for a user to perform various operations on an electronic device (such as an audio device or a navigation device) that uses the voice recognition system of the present embodiment. Performed by the remote control 1. The utterance button 1a is a button for instructing the start of speech recognition processing by utterance. That is, the voice input acceptance state by the utterance is set in accordance with the operation timing of the utterance button 1a. The joystick 1d is an operator that is used to correct a misrecognized word to a correct word when a voice misrecognition occurs.
[0019]
The correction button 1b is a button for instructing to redo voice input when, for example, a word different from what the user originally wanted to say is accidentally uttered. The misrecognition button 1c is used to instruct to correct a misrecognized word using the joystick 1d when misrecognition occurs, that is, when the recognized word talked back from the system is different from the word spoken by the user. It is a button.
[0020]
In this embodiment, in this way, two “return buttons”, which are conventionally only one button for instructing re-input of voice input or correction of a misrecognized word, are a correction button 1b and a misrecognition button 1c. It is divided into. As a result, the system side can clearly determine that correction of the recognized word has been instructed as a result of erroneous speech recognition.
[0021]
The OK button 1e instructs the determination of a desired menu item selected using the joystick 1d, or the input information (for example, destination) when the final content of the information input using the voice recognition process is correct. ) Is a button for instructing the system to set. The OK button 1e may be provided as a dedicated button alone as shown in FIG. 1, or may be configured to be used also as the speech button 1a or the joystick 1d.
[0022]
The voice recognition engine 3 compares the uttered voice input from the microphone 2 with a voice dictionary prepared in advance, and recognizes a word related to the uttered voice. Then, a command corresponding to the uttered voice is executed to an audio device or a navigation device (not shown) through the dialogue processing unit 4.
[0023]
The speech synthesis engine 5 synthesizes the words recognized by the speech recognition engine 3 and talks back from the speaker 6. In response to this, the user listens to the recognition word that has been talked back to confirm whether or not a misrecognition has occurred. If there is no erroneous recognition, the user performs voice input for the next process. On the other hand, if there is a misrecognition, the user presses the misrecognition button 1c to correct the recognized word. The screen display control unit 8 controls the display 9 to display a remote control operation screen when correcting the recognized word.
[0024]
The dialogue processing unit 4 executes a series of dialogue processing with the user when performing voice recognition. That is, processing for instructing the speech recognition engine 3 to start speech recognition processing in response to the user's operation of the utterance button 1a, and supplying the words recognized by the speech recognition engine 3 to the speech synthesis engine 5 to talk to the user For example, when the user presses the erroneous recognition button 1c as a result of the talkback, the screen display control unit 8 is controlled to provide a remote control operation screen to the user.
[0025]
The dialogue processing unit 4 detects the erroneously recognized word (by the voice recognition engine 3 when the erroneous recognition button 1c is pressed by the user when the erroneous recognition of the uttered voice by the voice recognition engine 3 is detected. (Recognition result) and the correct word corrected by the user using the joystick 1d after the operation of the erroneous recognition button 1c are associated with each other and registered in the recognized word link DB7. Thus, the dialogue processing unit 4 constitutes the correct word registration means of the present invention.
[0026]
When the dialog processing unit 4 controls the screen display control unit 8 at the time of correcting the recognized word and presents the above-described remote control operation screen on the display 9, the user previously corrected the word that was erroneously recognized at that time Is read from the recognized word link DB 7 and presented to the user as a list of correct answer candidates. Thus, the dialogue processing unit 4 and the screen display control unit 8 constitute the correct candidate presenting means of the present invention.
[0027]
Further, the dialogue processing unit 4 displays the speech voice input from the microphone 2 and the correct word corresponding thereto (recognition result when there is no misrecognition or correction result when there is misrecognition) as a speaker adaptation module. 10 is also performed.
[0028]
For example, when the misrecognition of the utterance voice by the voice recognition engine 3 is not detected, that is, when the utterance button 1a is pushed without the misrecognition button 1c being pushed, the dialogue processing unit 4 The result of recognition by the speech recognition engine 3 is provided to the speaker adaptation module 10. When the recognition word is corrected by pressing the misrecognition button 1c, the utterance voice at that time and the correction result (selection result from the correct answer candidate) are provided to the speaker adaptation module 10. . Thus, the dialogue processing unit 4 also constitutes information providing means of the present invention.
[0029]
The speaker adaptation module 10 performs speaker adaptation processing using the speech pattern from the microphone 2 and the correct speech pattern provided by the dialogue processing unit 4. The correct speech pattern is prepared in advance by the speaker adaptation module 10 as an acoustic model, and the speaker adaptation is performed using the corresponding speech pattern based on the correct word notified from the dialogue processing unit 4. Note that various methods can be applied to the contents of the speaker adaptation processing, but any known method can be applied to them, and a detailed description thereof will be omitted here.
[0030]
FIG. 2 is a conceptual diagram showing the data structure of the recognized word link DB 7. In FIG. 2, “link word” is a word that the user has previously corrected for the recognition result using the remote controller 1. That is, for example, a result that the user has previously corrected to “Saga Prefecture” or “Chiba Prefecture” using the remote controller 1 for the result of erroneous recognition of “Fukushima Prefecture” by the speech recognition engine 3 is recognized. It is registered in the word link DB7.
[0031]
Next, the operation of the speech recognition system according to the present embodiment configured as described above will be described. Before explaining the operation of the voice recognition system, the state transition of the utterance command which is the premise thereof will be explained. Usually, a plurality of utterance commands prepared in the system are managed in a plurality of layers according to the operation contents for the system. For example, when a destination is set by an address in the navigation device, as shown in FIG. 3, the address is divided into three layers and input, and finally the OK button 1e is pressed to set the input address as the destination. Set.
[0032]
That is, in the example of FIG. 3, words such as “address”, “telephone number”,... Are managed in the hierarchy in the initial state. If, for example, “address” is spoken in this hierarchy, the process proceeds to the next lower hierarchy 1. In this level 1, the prefecture name is managed, and words such as “Fukushima Prefecture”, “Saga Prefecture”, “Chiba Prefecture”, etc. can be input as utterance commands. When a desired prefecture name is uttered at level 1, the process proceeds to level 2, which is one level lower. In level 2, it is possible to input a city name as an utterance command.
[0033]
Similarly, when a desired city name is uttered at level 2, the system proceeds to level 3 which is one level lower. In this hierarchy 3, it is possible to input the remaining part of the address as an utterance command. When the remaining part of the address is spoken, the process proceeds to the final level 4. In level 4, by pressing the OK button 1e, the address input by the utterance is set as the destination. When the correction button 1b or the misrecognition button 1c is pressed in each of the levels 1 to 4 as described above, a return process is performed and the level returns to the level one above.
[0034]
4 and 5 are flowcharts showing an operation example of the speech recognition processing according to the present embodiment. 4 is a flowchart showing the operation of the hierarchical processing performed in each hierarchy shown in FIG. 3, and FIG. 5 is a flowchart showing the operation of the erroneous recognition correction processing included in FIG.
[0035]
In FIG. 4, the dialogue processing unit 4 determines whether or not the utterance button 1a has been pressed (step S1). If it is determined that the utterance button 1a has been pressed, the dialogue processing unit 4 activates the voice recognition engine 3 to set the voice input acceptance mode, and further determines whether or not the voice input engine 1 is in the initial state of FIG. 3 (step S2).
[0036]
If it is not the initial state, the dialogue processing unit 4 determines that the correct answer has been obtained by the speech recognition in the previous hierarchy, holds the following information as learning data (step S3), and stores it in the speaker adaptation module 10 Transmit (step S4).
i) Waveform data of speech (eg, user's voice waveform when uttering “Shiga Prefecture”) ii) Recognition result (eg, “Shiga Prefecture”)
iii) Information “recognition result = correct answer”
Thereafter, the user utters a desired word and inputs it from the microphone 2 (step S5). In response to this, the speech recognition engine 3 once exits the speech input acceptance mode and performs the recognition processing of the input word. Then, the speech synthesis engine 5 synthesizes the recognition result and talks back from the speaker 6 (step S6). After the talkback, the dialogue processing unit 4 executes a process for transitioning to the next layer (step S7).
[0038]
Note that the speaker adaptation module 10 uses the speaker adaptation algorithm based on parameter update, for example, based on the information i) to iii) provided from the dialogue processing unit 4 in step S4. Execute.
[0039]
If it is determined in step S1 that the speech button 1a has not been pressed, the dialogue processing unit 4 determines whether or not the correction button 1b has been pressed (step S8). When the correction button 1b is pressed, the dialogue processing unit 4 executes a return process for transitioning to the previous hierarchy (step S9).
[0040]
On the other hand, when determining that the correction button 1b is not pressed, the dialogue processing unit 4 further determines whether or not the erroneous recognition button 1c is pressed (step S10). When the misrecognition button 1c is pressed, the dialogue processing unit 4 determines that the result obtained by the speech recognition in the previous hierarchy is an error, and holds the following information as learning data (step S11). .
I) Waveform data of speech (eg, user's voice waveform when uttering “Shiga”) II) Recognition result (eg, “Fukushima”)
III) Information “Recognition Result = Error” Then, the dialogue processing unit 4 executes the return process for transitioning to the previous hierarchy (step S12), and then executes the erroneous recognition correction process shown in FIG. 5 (step S13).
[0041]
In FIG. 5, the dialogue processing unit 4 searches the recognized word link DB 7 using the erroneously recognized word from the speech recognition engine 3 (“Fukushima Prefecture” in the above example) as a key (step S21). As a result of this search, it is determined whether or not a link word that has been corrected by the user before the erroneously recognized word is registered in the recognized word link DB 7 (step S22).
[0042]
When one or more such link words are found, a remote control operation screen as shown in FIG. 6A including the link word as a correct candidate and further including the word “other” is displayed on the display 9. Present (step S23). If there is an actual correct answer among the correct answer candidates, the user operates the joystick 1d to select it. In this case, the dialogue processing unit 4 confirms that any word has been selected from the remote control operation screen shown in FIG. 6A (step S24), and determines whether or not the selected word is “other”. If any link word is selected from among the correct answer candidates other than “others”, the process jumps to step S29.
[0043]
On the other hand, when there is no actual correct answer in the correct answer candidates shown in the screen of FIG. 6A (when the user selects “other” by operating the joystick 1d), or in the recognized word link DB 7 in step S22. If it is determined that no link word is registered, all the selectable words in the scene are taken out and displayed as a list as shown in FIG. 6B (step S26). The user selects a correct word from the list by operating the joystick 1d (steps S27 and S28).
[0044]
In addition, the word which can be selected in the scene means the word of the applicable hierarchy. In the example of FIG. 6B, all the words in the hierarchy 1 in FIG. 3 managing the prefecture names such as “Fukushima Prefecture”, “Saga Prefecture”, and “Chiba Prefecture” are displayed as a list.
[0045]
When any word is selected on the remote control operation screen shown in FIG. 6A or 6B, the dialogue processing unit 4 registers the selected word in the recognized word link DB 7 (step S29).
[0046]
FIG. 7 is a diagram illustrating a registration example of the selected word with respect to the recognized word link DB 7. For example, when “Shiga Prefecture” is selected as the correct word from the remote control operation screen of FIG. 6B, the selected word is registered at the top of the link word (link word 1). When the new word “Shiga Prefecture” is registered in the link word 1, the words “Saga Prefecture” and “Chiba Prefecture” that have been registered so far move to the link word 2 and later.
[0047]
After the link word update processing, the dialogue processing unit 4 holds the following information as learning data (step S30), and when the information of I) to V) is prepared, the dialogue processing unit 4 stores them in the speaker adaptation module 10. Transmit (step S31).
IV) Selected words (eg “Shiga” selected with joystick 1d)
V) Information “selection result = correct answer” Then, the dialogue processing unit 4 executes a process of transitioning to the next layer (step S32), and ends the erroneous recognition correction process. The speaker adaptation module 10 executes speaker adaptation processing based on the information I) to V) received from the dialogue processing unit 4.
[0048]
As described above in detail, according to the present embodiment, when a misrecognition occurs, the correction result that the user has performed using the remote controller 1 in the past is presented as a correct answer candidate. A high appropriate correction candidate can be presented to the user. As a result, the user can surely correct the recognition result of the speech recognition engine 3 only by a simple operation of selecting any one of the correct answer candidates with a small number of cases.
[0049]
In addition, according to the present embodiment, the word that has been correctly obtained by recognition by the speech recognition engine 3 and the word that is selected on the remote control operation screen after the operation of the erroneous recognition button 1c are provided to the speaker adaptation module 10. Therefore, these words can be used as correct words that the user originally wanted to input. As a result, speaker adaptation learning can be performed in a normal use state of the system, and the user does not have to perform enrollment that takes time and effort. In addition, the acoustic model can be tuned to suit individual users in the background of the speech recognition process, so the speech recognition performance is also better than when using only “non-specific speaker speech recognition”. Get better.
[0050]
In the above embodiment, the remote controller 1 is used as the operation unit, but a touch panel may be used.
In the above-described embodiment, the example of displaying a list of words of the corresponding hierarchy when “others” is selected on the screen of FIG. 6A has been described. However, a software keyboard for individually inputting 50 sounds is provided. It may be displayed.
[0051]
In addition, each of the above-described embodiments is merely an example of the embodiment for carrying out the present invention, and the technical scope of the present invention should not be construed in a limited manner. In other words, the present invention can be implemented in various forms without departing from the spirit or main features thereof.
[0052]
【The invention's effect】
As described above, in the present invention, when erroneous recognition occurs, the results corrected by the user in the past are presented as correct answer candidates, so that only words with a high probability of correct answers are displayed as appropriate correction candidates. Can be presented. As a result, the user can reliably correct the recognition result only by a simple operation of selecting any one of the presented correct answer candidates when an erroneous voice recognition occurs.
[0053]
According to another feature of the present invention, a word that has been correctly obtained by speech recognition and a word that has been selected from correct candidates after the occurrence of misrecognition are provided to the speaker adaptation means as a recognition result that should be recognized. Since it is provided, the correct word can be accurately grasped by the speaker adapting means even in the normal use state, and the speaker adapting process can be appropriately performed. Thereby, even if the user does not bother to perform enrollment that takes time and effort, the speech recognition performance can be improved with certainty.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration example of a speech recognition system according to an embodiment.
FIG. 2 is a conceptual diagram showing a structure of a recognized word link DB according to the present embodiment.
FIG. 3 is a diagram showing a hierarchy transition state related to a plurality of utterance commands prepared in the voice recognition system of the present embodiment.
FIG. 4 is a flowchart showing the operation of hierarchical processing in the speech recognition processing according to the present embodiment.
FIG. 5 is a flowchart showing an operation of an erroneous recognition correction process in the voice recognition process according to the present embodiment.
FIG. 6 is a diagram showing a remote control operation screen presented when operating a misrecognition button according to the present embodiment.
FIG. 7 is a diagram illustrating an operation example of registering a selected word with respect to a recognized word link DB according to the present embodiment.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Remote control 1a Speech button 1b Correction button 1c Error recognition button 1d Joystick 1e OK button 2 Microphone 3 Speech recognition engine 4 Dialogue processing part 5 Speech synthesis engine 6 Speaker 7 Recognition word link DB
8 Screen display controller 9 Display 10 Speaker adaptation module

Claims

A speech recognition means for recognizing a word related to the uttered speech by comparing the input uttered speech with a speech dictionary prepared in advance;
When misrecognition of the uttered speech by the speech recognition means is detected, a word previously corrected by the user for the misrecognized word is read from the recognized word link database and presented as a correct candidate, and a correct word is selected. Correct answer candidate presentation means for prompting the user,
Correct word registration means for associating a word erroneously recognized by the speech recognition means with a correct word selected by the user through processing by the correct answer candidate presenting means and registering it in the recognized word link database;
Speech synthesis means for speech-synthesizing the words recognized by the speech recognition means;
A misrecognition button for operation when the user who has input the uttered speech confirms the recognized speech to be talked back by the speech synthesizer and determines that it is misrecognized;
A correction button to be operated when the user who has input the utterance voice instructs to input the utterance voice again,
According to the presence or absence of the operation of the misrecognition button, to detect the presence or absence of misrecognition of the speech voice by the speech recognition means,
A speech recognition system, wherein the user makes a transition to a state in which re-input processing of the uttered speech by the user is possible in response to an operation of the correction button .

Speaker adaptation means for performing speaker adaptation processing using the input speech pattern and correct speech pattern;
When misrecognition of the uttered voice is not detected, when the input uttered voice and the recognition result by the voice recognition means are provided to the speaker adaptation means, and the erroneous recognition of the uttered voice is detected 2. The information providing means for providing the speaker adaptation means with the input speech voice and the correction result by the user made through the processing of the correct answer candidate presentation means. The speech recognition system described in 1.

A first step of recognizing a word related to the uttered voice by comparing the input uttered voice with a prepared voice dictionary;
A second step in which the words recognized in the first step are speech synthesized and talkbacked;
A third step of detecting the presence / absence of an operation of a correction button to be operated when the user who has input the utterance voice instructs to re-input the utterance voice;
A fourth step of transitioning to a state in which the user can re-input the spoken voice when the operation of the correction button is detected in the third step;
The user who has input the uttered voice confirms the recognized voice to be talked back in the second step and determines that it is misrecognized. A fifth step of detecting the presence or absence of misrecognition;
When misrecognition of the uttered speech is detected in the fifth step, a word corrected by the user before the misrecognized word is read from the recognized word link database and presented as a correct candidate, and the correct candidate A sixth step for prompting the user to select a correct word from
A seventh step of registering the word detected as erroneous recognition in the fifth step and the correct word selected by the user through the process of the sixth step in association with each other in the recognized word link database;
An eighth step of providing the speaker adaptation unit with the uttered voice input in the first step and a correction result by the user made through the process of the sixth step;
The speaker adaptation unit performs speaker adaptation processing using the speech pattern input in the first step and the correct speech pattern based on the correction result provided in the eighth step. 9. A speech recognition correction / learning method comprising: 9 steps.

If no erroneous recognition of the utterance voice is detected, the processing of the sixth step and the seventh step is not performed, and the utterance voice input in the first step in the eighth step. The speech recognition correction / learning method according to claim 3 , further comprising: providing the speaker adaptation unit with the recognition result obtained in the first step.