JP4209247B2

JP4209247B2 - Speech recognition apparatus and method

Info

Publication number: JP4209247B2
Application number: JP2003127378A
Authority: JP
Inventors: 修一松本; 徹丸本
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2003-05-02
Filing date: 2003-05-02
Publication date: 2009-01-14
Anticipated expiration: 2023-05-02
Also published as: DE602004014675D1; US20040260549A1; EP1475781A3; EP1475781B1; CN1258753C; EP1475781A2; US7552050B2; JP2004333704A; CN1542734A

Description

【０００１】
【発明の属する技術分野】
本発明は、ユーザが発声した音声コマンドを認識して機器の制御を行うための音声認識装置および方法に関し、特に、認識した発話音声をユーザにフィードバックするトークバック機能を有する音声認識装置に用いて好適なものである。
【０００２】
【従来の技術】
従来、例えば車両に搭載されるナビゲーション装置やハンズフリー装置、あるいはパーソナルコンピュータ（パソコン）等の分野において、リモコンやタッチパネルあるいはキーボードやマウスに加え、音声認識装置を用いることにより、ユーザの音声入力により機器の操作をすることが可能である。
【０００３】
この種の音声認識装置では、備え付けの発話ボタンを押すことで音声認識モードとなり、ユーザの発話音声を認識してコマンドを実行する。発話方法には主に２つの方法がある。第１の方法は、ユーザが発話ボタンを一度押すと音声認識モードになり、必要に応じて機器からユーザに音声入力を促すことで、ユーザと機器とが対話的にやり取りを行う方法である。第２の方法は、ユーザが発話ボタンを押すたびに所定時間だけ音声入力が可能となる方法である。
【０００４】
音声認識装置の殆どは、認識した発話音声をスピーカ等からユーザにフィードバックするトークバック機能を有している。ユーザは、トークバックされた音声を聞いて正しいかどうか確認し、間違っていれば音声入力をやり直し、正しければその旨を音声認識装置に指示する。音声認識装置はこの指示を受けることによって各種制御を実行するようになっている。
【０００５】
通常、音声認識装置に用意されている複数の音声コマンドは、制御対象の機器に対する操作内容に応じて複数の階層に分けて管理されている。例えば、ナビゲーション装置において住所で目的地を設定する場合は、「都道府県→市区町村→住所の残り」のように、住所を複数階層に分けて音声入力する。
【０００６】
この場合、各階層で音声入力をするたびにトークバックが行われるため、一連の音声入力が完了するまでには長い時間がかかることが多い。これに対して、音声の認識時間を短縮する試みが成されている。その一例として、トークバックの演算量を削減して認識時間の短縮を図った装置が提案されている（例えば、特許文献１参照）。
【０００７】
【特許文献１】
特開平６−１４９２８７号公報
【０００８】
【発明が解決しようとする課題】
しかしながら、従来の音声認識装置では、トークバックの最中は次の音声入力を受け付けない状態となる。トークバックの音声が発話音声に混ざると、発話音声の誤認識が発生しやすくなるからである。図４（ａ）は、従来の音声認識装置に関する音声入力受付状態の変化の様子を示したタイミングチャートである。なお、この図４（ａ）は、上述した第１の発話方法に関する音声入力受付状態の変化を示している。
【０００９】
図４（ａ）に示すように、第１の発話方法では、ユーザが最初に発話ボタンを押すと音声認識モードになり、所定時間だけ音声入力受付状態となる。ユーザは、音声入力受付状態となっている間に所望の音声コマンドを発声する。発声が行われると、音声認識装置はその入力音声の認識処理およびトークバック処理を行うが、この間は音声入力を受け付けない状態となる。トークバックが終わると、再び音声入力受付状態となり、次の音声入力が可能となる。
【００１０】
このように、第１の発話方法では、トークバックが終了するまでは次の音声入力を受け付けないため、ユーザの好きなタイミングで発声することができない。すなわち、トークバックを聞き終わるまでは待っていなければならないので、一連の音声入力をするのに長い時間がかかってしまうという問題があった。
【００１１】
一方、第２の発話方法によれば、発話ボタンを押すことによってトークバックを中断し、次の音声入力を行うことも可能である。しかしながら、この場合は、複数階層に亘って音声入力を行う際に、各階層で音声入力を行う都度、発話ボタンを押さなければならず、操作が非常に煩雑になるという問題があった。
【００１２】
本発明は、このような問題を解決するために成されたものであり、発話ボタンを何回も押すなどの煩雑な操作を行うことなく、音声認識の操作時間を短縮できるようにすることを目的とする。
【００１３】
【課題を解決するための手段】
上記した課題を解決するために、本発明の音声認識装置では、スピーカから出力されてマイクに入力されるトークバック音声を適応フィルタ手段により模擬して、そのトークバック模擬音声をマイク入力音声から減算することにより、発話音声とトークバック音声とが混在しているマイク入力音声から発話音声だけを抽出するようにしている。
【００１４】
上記のように構成した本発明によれば、トークバックが行われている最中に音声入力をしても、トークバック音声は除去され、発話音声だけが抽出されて音声認識エンジンに供給されることとなる。これにより、トークバック中に音声入力をしても発話音声の誤認識が生じる不都合を抑制でき、トークバック時においても音声入力を随時受け付けることが可能となる。
【００１５】
【発明の実施の形態】
（第１の実施形態）
以下、本発明の第１の実施形態を図面に基づいて説明する。図１は、第１の実施形態による音声認識装置の要部構成を示すブロック図である。
【００１６】
図１に示すように、本実施形態の音声認識装置１００は、ボリューム又はイコライザ（以下、単にボリューム等と記す）１、ゲイン制御部２、出力アンプ３、適応フィルタ（ＡＤＦ）４、減算器５、音声出力部５１、スピーカ５２、マイク５３および音声認識エンジン５４を備えて構成されている。
【００１７】
音声出力部５１は、トークバック音声を生成して出力する処理を行う。スピーカ５２は、ボリューム等１でゲイン制御され更に出力アンプ３で増幅されたトークバック音声を出力する。マイク５３は、発話音声入力用のものであるが、実際には、発声された音声コマンドだけでなく、スピーカ５２から出力されるトークバック音声、走行ノイズなどの周辺ノイズも全て同じマイク５３に入力される。音声認識エンジン５４は、マイク入力された発話音声を認識して、その発話音声に対応するコマンドを図示しない制御対象の機器（例えば、ナビゲーション装置）に対して実行する。
【００１８】
適応フィルタ４は、図２に示すように、係数同定部２１および音声補正フィルタ２２を含んで構成されている。係数同定部２１は、スピーカ５２からマイク５３の間における音響系の伝達関数（音声補正フィルタ２２のフィルタ係数）を同定するためのフィルタであり、ＬＭＳ（Least Mean Square ）アルゴリズムやＮ−ＬＭＳ（Normalized-LMS）アルゴリズムによる適応フィルタが用いられている。この係数同定部２１は、減算器５から出力される誤差ｅ（ｎ）のパワーが最小となるように動作して音響系のインパルス応答を同定する。
【００１９】
音声補正フィルタ２２は、係数同定部２１により決定されたフィルタ係数ｗ（ｎ）と、制御対象となるトークバック音声ｘ（ｎ）とを用いて畳み込み演算することにより、トークバック音声ｘ（ｎ）に対して上述の音響系と同一の伝達特性を与える。これにより、マイク５３の位置におけるトークバック音声を模擬したトークバック模擬音声ｙ（ｎ）を生成する。このように適応フィルタ４は、本発明の適応フィルタ手段を構成する。
【００２０】
減算器５は、マイク５３より入力された音声（音声コマンドとトークバックと周辺ノイズとが混在した音声）から、適応フィルタ４により生成されたトークバック模擬音声ｙ（ｎ）を減算することにより、音声コマンド（発話音声）と周辺ノイズ（例えば走行ノイズ）とを抽出する。このように減算器５は、本発明の発話音声抽出手段を構成する。
【００２１】
この減算器５により抽出された発話音声と周辺ノイズとの混在音声は、音声認識エンジン５４に供給される。音声認識エンジン５４は、雑音処理を行った後、音声コマンドの認識処理を行う。この際の雑音処理とは、フィルタによる処理やスペクトラムサブストラクションなど、従来の代表される処理である。なお、減算器５により抽出された発話音声および周辺ノイズの混在音声は、誤差ｅ（ｎ）として適応フィルタ４の係数同定部２１およびゲイン制御部２にもフィードバックされる。
【００２２】
ゲイン制御部２は、適応フィルタ４から出力されるトークバック模擬音声ｙ（ｎ）と、減算器５から出力される発話音声および周辺ノイズの混在音声ｅ（ｎ）とに基づいて、音声出力部５１から出力される制御対象のトークバック音声に対して加える最適のゲインを算出し、この算出されたゲイン値をボリューム等１に出力する。ここでは、発話音声および周辺ノイズの混在音声ｅ（ｎ）をトークバック音声に対するノイズとみなして、スピーカ５２から出力されるトークバック音声がユーザに明瞭に聞こえるように、当該トークバック音声のゲイン調整を行う。
【００２３】
ボリューム等１は、音声出力部５１より出力されたトークバック音声に対してゲイン補正を行う。すなわち、音声出力部５１から入力されるトークバック音声に対して、ゲイン制御部２により算出されたゲインを与えることにより、当該トークバック音声を補正する。この補正は、例えば、複数に分割された周波数帯域のそれぞれ毎に行う。
【００２４】
次に、上記のように構成した音声認識装置１００の動作を簡単に説明する。音声出力部５１より出力されたトークバック音声は、ボリューム等１およびゲイン制御部２によってゲイン調整が行われ、当該トークバック音声の明瞭度が改善される。ボリューム等１から出力されたトークバック音声は、出力アンプ３において所定の倍率で増幅された後、スピーカ５２から出力される。
【００２５】
スピーカ５２から出力されたトークバック音声は、マイク５３より入力される。このときユーザが音声コマンドを発声していると、その発話音声もマイク５３より入力される。また、走行中であれば、エンジン音やロードノイズなどの周辺ノイズもマイク５３より入力される。したがって、マイク５３には、トークバック音声と発話音声と周辺ノイズとが混在した状態で入力される。この混在音声は、減算器５のプラス端に入力される。一方、減算器５のマイナス端には、適応フィルタ４により生成されたトークバック模擬音声（トークバック音声の推定値）が入力される。
【００２６】
減算器５は、マイク５３より入力されたトークバック音声と発話音声と周辺ノイズとの混在音声から、適応フィルタ４より入力されたトークバック模擬音声を引くことによって誤差を演算し、発話音声と周辺ノイズとを抽出する。抽出された発話音声と周辺ノイズは、音声認識エンジン５４に供給される。これにより、周辺ノイズの低減処理および音声コマンドに対応した処理が実行される。また、上記抽出された発話音声と周辺ノイズは、ゲイン制御部２および適応フィルタ４にもフィードバックされ、トークバック音声の明瞭度改善処理およびトークバック音声の推定演算処理に利用される。
【００２７】
図３は、第１の実施形態による音声認識処理の動作を示すフローチャートである。なお、図１には図示していないが、音声認識装置１００は音声認識に関する全体の制御を行うコントローラを備えており、図３に示すフローチャートはこのコントローラの制御に従って実行される。
【００２８】
図３において、コントローラが音声認識開始のトリガ（例えば、発話ボタンの押下、所定キーワードの音声入力等）を検知すると（ステップＳ１）、音声認識エンジン５４をアクティブにして、音声入力受付状態とする（ステップＳ２）。この状態でユーザは、複数階層に分けて管理されている音声コマンドの最上層に当たる第１コマンドを発声する（ステップＳ３）。
【００２９】
ここで発声された音声コマンドはマイク５３から入力され、減算器５を介して音声認識エンジン５４に供給される。これを受けて音声認識エンジン５４は、音声認識処理（ノイズ低減処理を含む）を実行する（ステップＳ４）。このとき、コントローラは音声認識エンジン５４を非アクティブに戻して音声入力受付状態を解除する。次に、ボリューム等１およびゲイン制御部２は、トークバック音声の明瞭度改善処理を開始する（ステップＳ５）。この状態で音声出力部５１は、音声認識エンジン５４による認識結果および案内文の音声トークバックを開始する（ステップＳ６）。
【００３０】
このトークバックが行われている間、コントローラは、引き続き音声操作が必要か否かを判定する（ステップＳ７）。ここでは、更に下の階層に遷移して音声コマンドの入力を続ける必要があるか否かを判定する。引き続き音声操作が必要な場合は、音声認識エンジン５４を再びアクティブにして音声入力受付状態とする（ステップＳ８）。その後、減算器５は、上記ステップＳ６で出力されたトークバック音声の推定値を適応フィルタ４から得て、これをマイク５３の入力音声から減算することによって、マイク入力音声からトークバック音声を除去する（ステップＳ９）。
【００３１】
そして、コントローラは、音声コマンドの発声があったか否かを判定する（ステップＳ１０）。発声がない場合はステップＳ９に戻り、発声があるまでこのループ処理を繰り返す。なお、一定時間内に何の発声も行われない場合は、タイムアウト処理が行われる。一方、音声コマンドの発声が行われると、その時点でトークバックを中断し（ステップＳ１１）、ステップＳ４の処理に戻る。なお、ここでは発話が行われたときにトークバックを中断しているが、トークバックがあってもその音声は除去されて発話音声だけが抽出されるので、必ずしもトークバックを中断する必要はない。
【００３２】
図４は、本実施形態による音声入力受付状態の変化の様子を従来技術と比較して示すタイミングチャートであり、（ａ）は従来技術、（ｂ）は本実施形態を示している。なお、図４（ａ）の動作については既に説明した。
【００３３】
図４（ｂ）に示すように、本実施形態では、ユーザが最初に発話ボタンを押すと音声認識モードになり、所定時間だけ音声入力受付状態となる。ユーザは、音声入力受付状態となっている間に所望の音声コマンドを発声する。音声コマンドが入力されると、その入力音声の認識処理およびトークバック処理が行われる。ここまでの動作は、図４（ａ）に示す従来技術と同じである。
【００３４】
図４（ａ）に示す従来技術では、トークバックが行われている間は音声入力を受け付けない状態とされていた。これに対して、図４（ｂ）に示す本実施形態では、認識処理が終わった段階で自動的に音声入力受付状態となり、トークバックが終了するまで待たずに、好きなタイミングで次の音声入力をすることが可能となる。これにより待ち時間を少なくすることができる。
【００３５】
以上詳しく説明したように、本実施形態によれば、トークバック時においても音声入力を随時受け付け、トークバックが終了するのを待たずに好きなタイミングで音声入力をすることができるようになる。しかも、発話をするたびに発話ボタンを押す必要もない。これにより、煩雑なボタン操作を行うことなく、一連の音声認識にかかる操作時間を短縮することができる。
【００３６】
また、本実施形態では、トークバック音声の明瞭度改善のために設けられている適応フィルタ４で推定した模擬音声を利用して、マイク入力音声からトークバック音声を除去している。そのため、トークバック音声除去のために、専用の適応フィルタを別に導入する必要がない。これにより、コストアップを招くことなく、トークバック音声の明瞭度を改善すると同時に、音声認識操作時間の短縮を図ることができる。
【００３７】
（第２の実施形態）
次に、本発明の第２の実施形態について説明する。図５は、第２の実施形態による音声認識装置の要部構成を示すブロック図である。なお、この図５において、図１に示した符号と同一の符号を付したものは同一の機能を有するものであるので、ここでは重複する説明を省略する。
【００３８】
図５に示すように、本実施形態の音声認識装置２００は、図１に示した構成に加え、出力アンプ６−１，６−２、第２の適応フィルタ７−１，７−２、加算器８、減算器９、オーディオ再生部６１および複数チャンネル（右チャンネル、左チャンネル）のスピーカ６２−１，６２−２を備えて構成されている。
【００３９】
オーディオ再生部６１は、ＣＤ（Compact Disc）、ＭＤ（Mini Disc）、ＤＶＤ（Digital Versatile Disk）、ラジオ放送等の各種オーディオソースを再生するものである。出力アンプ６−１，６−２は、オーディオ再生部６１により再生された左右チャンネルのオーディオ音を所定の倍率で増幅し、各チャンネルのスピーカ６２−１，６２−２から出力する。スピーカ６２−１，６２−２から出力されたオーディオ音は、発話音声およびスピーカ５２からのトークバック音声と共にマイク５３に入力される。
【００４０】
第２の適応フィルタ７−１，７−２も図２のように構成されている。一方の適応フィルタ７−１は、右チャンネルのスピーカ６２−１からマイク５３までの伝達系を模擬したフィルタ係数を同定し、右チャンネルのオーディオ音をフィルタ処理することによって右チャンネルのオーディオ模擬音を生成する。
【００４１】
また、他方の適応フィルタ７−２は、左チャンネルのスピーカ６２−２からマイク５３までの伝達系を模擬したフィルタ係数を同定し、左チャンネルのオーディオ音をフィルタ処理することによって左チャンネルのオーディオ模擬音を生成する。
【００４２】
このように、第２の実施形態では、適応フィルタ４が本発明による第１の適応フィルタ手段を構成し、第２の適応フィルタ７−１，７−２が本発明による第２の適応フィルタ手段を構成する。加算器８は、第２の適応フィルタ７−１，７−２から出力される左右チャンネルのオーディオ模擬音を加算して減算器９に出力する。
【００４３】
本実施形態において減算器５は、マイク５３より入力された音声（音声コマンドとトークバックとオーディオ音と周辺ノイズとが混在した音声）から、適応フィルタ４により生成されたトークバック模擬音声を減算することによって、音声コマンドとオーディオ音と周辺ノイズとを抽出する。さらに、減算器９は、減算器５より出力された音声から、適応フィルタ７−１，７−２および加算器８により生成されたオーディオ模擬音を減算することによって音声コマンド（発話音声）と周辺ノイズとを抽出する。このように減算器５，９は、本発明の発話音声抽出手段を構成する。
【００４４】
減算器５により抽出された音声コマンドとオーディオ音と周辺ノイズとの混在音声のうち、周辺ノイズは音声認識エンジン５４によって低減され、音声コマンドのみが認識処理される。また、この減算器５により抽出された発話音声とオーディオ音と周辺ノイズとの混在音声は、ゲイン制御部２および適応フィルタ４にフィードバックされる。また、減算器９により抽出された発話音声と周辺ノイズとの混在音声は、音声認識エンジン５４に供給されるとともに、第２の適応フィルタ７−１，７−２にフィードバックされる。
【００４５】
次に、上記のように構成した第２の実施形態による音声認識装置２００の動作を簡単に説明する。音声出力部５１より出力されたトークバック音声は、ボリューム等１およびゲイン制御部２によってゲイン調整が行われ、当該トークバック音声の明瞭度が改善される。ボリューム等１から出力されたトークバック音声は、出力アンプ３において所定の倍率で増幅された後、スピーカ５２から出力される。
【００４６】
また、オーディオ再生部６１より出力されたオーディオ音は、出力アンプ６−１，６−２において所定の倍率で増幅された後、スピーカ６２−１，６２−２から出力される。
【００４７】
スピーカ５２から出力されたトークバック音声と、スピーカ６２−１，６２−２から出力されたオーディオ音は、マイク５３より入力される。このときユーザが音声コマンドを発声していると、その発話音声もマイク５３より入力される。また、走行中であれば、エンジン音やロードノイズなどの周辺ノイズもマイク５３より入力される。したがって、マイク５３には、トークバック音声とオーディオ音と発話音声と周辺ノイズとが混在した状態で入力される。
【００４８】
この混在音声は、減算器５のプラス端に入力される。一方、減算器５のマイナス端には、適応フィルタ４により生成されたトークバック模擬音声が入力される。減算器５は、マイク５３より入力された混在音声から、適応フィルタ４より出力されるトークバック模擬音声を引くことによって誤差を演算し、オーディオ音と発話音声と周辺ノイズとを抽出する。
【００４９】
抽出されたオーディオ音と発話音声と周辺ノイズとの混在音声は、減算器９のプラス端に入力される。一方、減算器９のマイナス端には、適応フィルタ７−１，７−２および加算器８により生成されたオーディオ模擬音が入力される。減算器９は、減算器５より入力された混在音声から、加算器８より入力されたオーディオ模擬音を引くことによって誤差を演算し、発話音声と周辺ノイズとを抽出する。
【００５０】
抽出された発話音声および周辺ノイズは、音声認識エンジン５４に供給される。これにより、周辺ノイズの低減処理および音声コマンドに対応した処理が実行される。減算器５で抽出されたオーディオ音と発話音声と周辺ノイズは、ゲイン制御部２および適応フィルタ４にもフィードバックされ、トークバック音声の明瞭度改善処理およびトークバック音声の推定演算処理に利用される。また、減算器９で抽出された発話音声と周辺ノイズは、適応フィルタ７−１，７−２にもフィードバックされ、オーディオ音の推定演算処理に利用される。
【００５１】
図６は、第２の実施形態による音声認識処理の動作を示すフローチャートである。なお、この図６において、図３に示したステップ番号と同一の番号を付した処理は同一の処理内容を示すものであるので、ここでは重複する説明を省略する。図６において図３と異なるのは、ステップＳ２とステップＳ３との間、ステップＳ９とステップＳ１０との間にそれぞれオーディオ音の除去処理（ステップＳ２１，Ｓ２２）が入っていることのみである。
【００５２】
ステップＳ２１，Ｓ２２におけるオーディオ音の除去処理では、減算器９によって、加算器８から入力されるオーディオ音の推定値を減算器５の出力音声から減算することにより、オーディオ音と発話音声と周辺ノイズとの混在音声からオーディオ音を除去し、発話音声と周辺ノイズとを抽出する。
【００５３】
以上詳しく説明したように、第２の実施形態によれば、トークバックが行われていて、かつ、オーディオ再生が行われているときに音声入力をしても、マイク入力音声からトークバック音声とオーディオ音とを除去し、発話音声と周辺ノイズとを抽出して音声認識エンジン５４に供給することができる。よって、トークバックとオーディオ再生とが行われている最中においても音声入力を随時受け付け、好きなタイミングで音声入力をすることができるようになり、音声認識の操作時間を短縮することができる。
【００５４】
（第３の実施形態）
次に、本発明の第３の実施形態について説明する。図７は、第３の実施形態による音声認識装置の要部構成を示すブロック図である。なお、この図７において、図５に示した符号と同一の符号を付したものは同一の機能を有するものであるので、ここでは重複する説明を省略する。
【００５５】
上記図５に示す第２の実施形態は、トークバック音声の出力先とオーディオ音の出力先とが異なる場合について説明した。これに対して図７に示す第３の実施形態は、トークバック音声の出力先とオーディオ音の出力先とが同じの場合を示している。
【００５６】
すなわち、図７に示す第３の実施形態による音声認識装置３００では、図５に示した出力アンプ３がなく、２つの出力アンプ６−１，６−２のみを備えている。また、本実施形態の音声認識装置３００は、図５に示した適応フィルタ４の代わりに可変フィルタ１０を備え、加算器１１を更に備えて構成されている。その他の構成は図５と同様である。
【００５７】
図７において、加算器１１は、ボリューム等１より出力されたトークバック音声と、オーディオ再生部６１により再生された右チャンネルのオーディオ音とを加算して出力アンプ６−１および適応フィルタ７−１に出力する。出力アンプ６−１は、加算器１１より出力された音声を所定の倍率で増幅し、右チャンネルのスピーカ６２−１から出力する。
【００５８】
また、適応フィルタ７−１は、右チャンネルのスピーカ６２−１からマイク５３までの伝達系を模擬したフィルタ係数を同定する。そして、この同定したフィルタ係数を用いて、加算器１１から出力されるトークバック音声と右チャンネルのオーディオ音との混在音声をフィルタ処理することによって、当該混在音声を模擬した音声を生成する。
【００５９】
可変フィルタ１０は、フィルタ係数が可変に構成された音声補正フィルタであり、右チャンネルの適応フィルタ７−１により同定されたフィルタ係数をコピーして設定する。そして、ボリューム等１より出力されたトークバック音声をフィルタ処理することにより、マイク５３の位置でのトークバック模擬音声を生成する。この可変フィルタ１０は、本発明の可変フィルタ手段を構成する。
【００６０】
ここで、可変フィルタ１０に対するフィルタ係数のコピー元である右チャンネルの適応フィルタ７−１は、トークバック音声が出力される右チャンネルのスピーカ６２−１からマイク５３までの伝達系を模擬する適応フィルタである。例えば、本実施形態の音声認識装置３００をナビゲーション装置に適用する場合、運転席の近くに設置された右チャンネルのスピーカ６２−１からトークバック音声が出力され、それが入力されるマイク５３も運転席の近くに設置される。したがって、この場合は右チャンネルの適応フィルタ７−１のフィルタ係数を可変フィルタ１０にコピーするのが好ましい。なお、運転席が左側にある場合は、左チャンネルの適応フィルタ７−２のフィルタ係数を可変フィルタ１０にコピーするのが好ましい。
【００６１】
次に、上記のように構成した第３の実施形態による音声認識装置３００の動作を簡単に説明する。音声出力部５１より出力されたトークバック音声は、ボリューム等１およびゲイン制御部２によってゲイン調整が行われ、当該トークバック音声の明瞭度が改善される。
【００６２】
ボリューム等１から出力されたトークバック音声は、オーディオ再生部６１により再生された右チャンネルのオーディオ音と加算器１１で加算され、出力アンプ６−１において所定の倍率で増幅された後、スピーカ６２−１から出力される。また、オーディオ再生部６１により再生された左チャンネルのオーディオ音は、出力アンプ６−２において所定の倍率で増幅された後、スピーカ６２−２から出力される。
【００６３】
スピーカ６２−１から出力された音声（トークバック音声と右チャンネルオーディオ音との混在音声）と、スピーカ６２−２から出力された左チャンネルオーディオ音とは、マイク５３より入力される。このときユーザが音声コマンドを発声していると、その発話音声もマイク５３より入力される。また、走行中であれば、エンジン音やロードノイズなどの周辺ノイズもマイク５３より入力される。したがって、マイク５３からは、トークバック音声と左右チャンネルのオーディオ音と発話音声と周辺ノイズとが混在した音声が入力される。
【００６４】
この混在音声は、減算器５，９のプラス端に入力される。減算器５のマイナス端には、可変フィルタ１０により生成されたトークバック模擬音声が入力される。減算器５は、マイク５３より入力された混在音声から、可変フィルタ１０より出力されるトークバック模擬音声を引くことによって誤差を演算し、オーディオ音と発話音声と周辺ノイズとを抽出する。抽出されたオーディオ音と発話音声と周辺ノイズとの混在音声は、ゲイン制御部２にフィードバックされ、トークバック音声の明瞭度改善処理に利用される。
【００６５】
上記加算器１１より出力されたトークバック音声と右チャンネルオーディオ音との混在音声は、適応フィルタ７−１にも入力される。そして、この適応フィルタ７−１によって、トークバック音声と右チャンネルオーディオ音との混在模擬音声が生成される。一方、適応フィルタ７−２において、左チャンネルのオーディオ模擬音が生成される。
【００６６】
そして、これらの適応フィルタ７−１，７−２により生成された模擬音声が加算器８で加算され、その結果が減算器９のマイナス端に入力される。減算器９は、減算器５より入力された混在音声から、加算器８より入力されたトークバック音声とオーディオ音との混在模擬音声を引くことによって誤差を演算し、発話音声と周辺ノイズとを抽出する。
【００６７】
減算器９で抽出された発話音声および周辺ノイズは、音声認識エンジン５４に供給される。これにより、周辺ノイズの低減処理および音声コマンドに対応した処理が実行される。また、当該減算器９で抽出された発話音声と周辺ノイズは、適応フィルタ７−１，７−２にもフィードバックされ、オーディオ音の推定演算処理に利用される。
【００６８】
なお、第３の実施形態による音声認識処理の動作は、図６に示したフローチャートと同様であるので、ここでは説明を省略する。
【００６９】
以上詳しく説明したように、第３の実施形態においても第２の実施形態と同様に、トークバックとオーディオ再生とが行われているときでも音声入力を随時受け付け、好きなタイミングで音声入力をすることができる。また、第３の実施形態によれば、フィルタ係数の同定を行うためのアルゴリズムを含んだ高度な適応フィルタをトークバック音声推定用に用意する必要がなく、その分コストを削減することができる。さらに、可変フィルタ１０にはフィルタ係数をコピーするだけで良く、フィルタ係数を同定するための演算処理を行う必要がないので、処理負荷を軽減することができるというメリットも有する。
【００７０】
なお、以上第１〜第３の実施形態について説明したが、これらは本発明を実施するにあたっての具体化の一例を示したものに過ぎず、これによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその精神、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。
【００７１】
【発明の効果】
本発明は上述したように、スピーカから出力されマイクに入力されるトークバック音声を適応フィルタにより推定して、その推定値をマイク入力音声から減算することにより、発話音声とその他の音声とが混在しているマイク入力音声から発話音声だけを抽出するようにしたので、発話のたびに発話ボタンを押してトークバックを中断させるといった面倒な操作をしなくても、トークバック中の任意のタイミングで音声入力を随時行うことができるようになる。これにより、煩雑な操作を行うことなく、音声認識の操作時間を短縮することができる。
【図面の簡単な説明】
【図１】第１の実施形態による音声認識装置の要部構成を示すブロック図である。
【図２】適応フィルタの構成を示す図である。
【図３】第１の実施形態による音声認識処理の動作を示すフローチャートである。
【図４】本実施形態による音声入力受付状態の変化の様子を従来技術と比較して示すタイミングチャートである。
【図５】第２の実施形態による音声認識装置の要部構成を示すブロック図である。
【図６】第２の実施形態による音声認識処理の動作を示すフローチャートである。
【図７】第３の実施形態による音声認識装置の要部構成を示すブロック図である。
【符号の説明】
１ボリューム又はイコライザ
２ゲイン制御部
３出力アンプ
４適応フィルタ
５減算器
６−１，６−２出力アンプ
７−１，７−２適応フィルタ
８加算器
９減算器
１０可変フィルタ
１１加算器[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice recognition apparatus and method for controlling a device by recognizing a voice command uttered by a user, and more particularly, to a voice recognition apparatus having a talkback function for feeding back a recognized utterance voice to a user. Is preferred.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, in the fields of navigation devices, hands-free devices, personal computers (personal computers), etc. mounted on vehicles, devices using voice recognition devices in addition to remote controls, touch panels, keyboards, mice, and the like can be used. It is possible to operate.
[0003]
In this type of speech recognition apparatus, a speech recognition mode is entered by pressing a provided speech button, and a command is executed by recognizing the user's speech. There are two main ways of speaking. The first method is a method in which when the user presses the utterance button once, the voice recognition mode is set, and the user and the device interact with each other by prompting the user to input a voice as necessary. The second method is a method that enables voice input for a predetermined time each time the user presses the utterance button.
[0004]
Most speech recognition apparatuses have a talkback function that feeds back the recognized speech sound to a user from a speaker or the like. The user listens to the talk-backed voice to check whether it is correct. If the voice is incorrect, the user inputs the voice again, and if correct, instructs the voice recognition device to that effect. The voice recognition apparatus executes various controls upon receiving this instruction.
[0005]
Usually, a plurality of voice commands prepared in the voice recognition apparatus are managed in a plurality of layers according to the operation contents for the device to be controlled. For example, when a destination is set by an address in the navigation device, the address is divided into a plurality of layers and input by voice, such as “prefecture → city / town → remaining address”.
[0006]
In this case, talkback is performed every time voice is input in each layer, and thus it often takes a long time to complete a series of voice inputs. In contrast, attempts have been made to shorten the speech recognition time. As an example, an apparatus has been proposed in which the amount of talkback computation is reduced to shorten the recognition time (see, for example, Patent Document 1).
[0007]
[Patent Document 1]
JP-A-6-149287
[0008]
[Problems to be solved by the invention]
However, in the conventional speech recognition apparatus, the next speech input is not accepted during talkback. This is because when the talkback sound is mixed with the utterance voice, erroneous recognition of the utterance voice is likely to occur. FIG. 4A is a timing chart showing a state of change in the voice input acceptance state related to the conventional voice recognition apparatus. FIG. 4A shows the change in the voice input acceptance state related to the first utterance method described above.
[0009]
As shown in FIG. 4A, in the first utterance method, when the user first presses the utterance button, the voice recognition mode is set, and the voice input acceptance state is set for a predetermined time. The user utters a desired voice command while in the voice input acceptance state. When speech is performed, the speech recognition apparatus performs recognition processing and talkback processing of the input speech, but during this time, speech input is not accepted. When the talkback ends, the voice input acceptance state is entered again, and the next voice input becomes possible.
[0010]
As described above, in the first utterance method, the next voice input is not accepted until the talkback is completed, and therefore the voice cannot be uttered at the user's favorite timing. That is, there is a problem that it takes a long time to input a series of voices because it is necessary to wait until the talkback is finished.
[0011]
On the other hand, according to the second speech method, the talkback can be interrupted by pressing the speech button, and the next voice input can be performed. However, in this case, when performing voice input over a plurality of hierarchies, there is a problem that the speech button must be pressed every time voice input is performed in each hierarchy, and the operation becomes very complicated.
[0012]
The present invention has been made to solve such a problem, and it is possible to shorten the operation time of voice recognition without performing a complicated operation such as pressing a speech button many times. Objective.
[0013]
[Means for Solving the Problems]
In order to solve the above-described problems, in the speech recognition apparatus of the present invention, the talkback speech output from the speaker and input to the microphone is simulated by the adaptive filter means, and the talkback simulated speech is subtracted from the microphone input speech. By doing so, only the utterance voice is extracted from the microphone input voice in which the utterance voice and the talkback voice are mixed.
[0014]
According to the present invention configured as described above, even if speech input is performed during talkback, the talkback speech is removed, and only the speech speech is extracted and supplied to the speech recognition engine. It will be. As a result, it is possible to suppress the inconvenience of erroneous recognition of the uttered voice even if voice input is performed during talkback, and it is possible to accept voice input at any time even during talkback.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
(First embodiment)
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, a first embodiment of the invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a main configuration of the speech recognition apparatus according to the first embodiment.
[0016]
As shown in FIG. 1, the speech recognition apparatus 100 of the present embodiment includes a volume or equalizer (hereinafter simply referred to as a volume or the like) 1, a gain control unit 2, an output amplifier 3, an adaptive filter (ADF) 4, and a subtractor 5. , A voice output unit 51, a speaker 52, a microphone 53, and a voice recognition engine 54.
[0017]
The audio output unit 51 performs processing for generating and outputting a talkback audio. The speaker 52 outputs a talkback sound that is gain-controlled by the volume 1 or the like and further amplified by the output amplifier 3. The microphone 53 is used to input a speech voice. Actually, however, not only the spoken voice command but also the peripheral noise such as talkback voice and running noise output from the speaker 52 are all input to the same microphone 53. Is done. The voice recognition engine 54 recognizes the uttered voice input by the microphone and executes a command corresponding to the uttered voice with respect to a control target device (for example, a navigation device) (not shown).
[0018]
As shown in FIG. 2, the adaptive filter 4 includes a coefficient identification unit 21 and a sound correction filter 22. The coefficient identifying unit 21 is a filter for identifying the transfer function of the acoustic system (the filter coefficient of the sound correction filter 22) between the speaker 52 and the microphone 53, and is an LMS (Least Mean Square) algorithm or N-LMS (Normalized). -LMS) adaptive filter is used. The coefficient identifying unit 21 operates so as to minimize the power of the error e (n) output from the subtracter 5 and identifies the impulse response of the acoustic system.
[0019]
The voice correction filter 22 performs a convolution operation using the filter coefficient w (n) determined by the coefficient identification unit 21 and the talkback voice x (n) to be controlled, thereby providing the talkback voice x (n). Gives the same transfer characteristics as the above-mentioned acoustic system. As a result, a talkback simulation sound y (n) that simulates the talkback sound at the position of the microphone 53 is generated. Thus, the adaptive filter 4 constitutes the adaptive filter means of the present invention.
[0020]
The subtracter 5 subtracts the talkback simulated voice y (n) generated by the adaptive filter 4 from the voice input from the microphone 53 (voice mixed with voice command, talkback, and ambient noise). A voice command (uttered voice) and ambient noise (for example, running noise) are extracted. In this way, the subtracter 5 constitutes the speech voice extraction means of the present invention.
[0021]
The mixed voice of the uttered voice and the ambient noise extracted by the subtracter 5 is supplied to the voice recognition engine 54. The voice recognition engine 54 performs voice command recognition processing after performing noise processing. The noise processing at this time is a conventional representative processing such as processing by a filter or spectrum subtraction. Note that the speech sound and the mixed sound of ambient noise extracted by the subtracter 5 are fed back to the coefficient identification unit 21 and the gain control unit 2 of the adaptive filter 4 as an error e (n).
[0022]
Based on the talkback simulated speech y (n) output from the adaptive filter 4 and the mixed speech e (n) of the speech and surrounding noise output from the subtractor 5, the gain control unit 2 An optimum gain to be added to the talkback sound to be controlled output from 51 is calculated, and the calculated gain value is output to the volume 1 or the like. Here, the mixed sound e (n) including the speech sound and the ambient noise is regarded as noise for the talkback sound, and the gain adjustment of the talkback sound is made so that the talkback sound output from the speaker 52 can be clearly heard by the user. I do.
[0023]
The volume or the like 1 performs gain correction on the talkback sound output from the sound output unit 51. That is, the talkback sound input from the sound output unit 51 is corrected by giving the gain calculated by the gain control unit 2 to the talkback sound. This correction is performed, for example, for each of the frequency bands divided into a plurality.
[0024]
Next, the operation of the speech recognition apparatus 100 configured as described above will be briefly described. The talkback sound output from the sound output unit 51 is gain-adjusted by the volume 1 and the gain control unit 2 to improve the clarity of the talkback sound. The talkback sound output from the volume 1 or the like is amplified by the output amplifier 3 at a predetermined magnification and then output from the speaker 52.
[0025]
The talkback sound output from the speaker 52 is input from the microphone 53. At this time, if the user utters a voice command, the uttered voice is also input from the microphone 53. If the vehicle is running, ambient noise such as engine sound and road noise is also input from the microphone 53. Therefore, the microphone 53 is input in a state where talkback speech, speech speech, and ambient noise are mixed. This mixed sound is input to the plus end of the subtracter 5. On the other hand, the talkback simulated speech (estimated value of talkback speech) generated by the adaptive filter 4 is input to the minus end of the subtracter 5.
[0026]
The subtractor 5 calculates an error by subtracting the talkback simulated voice input from the adaptive filter 4 from the mixed voice of the talkback voice, the utterance voice, and the ambient noise input from the microphone 53, and the utterance voice and the peripheral voice are calculated. Extract noise. The extracted speech voice and ambient noise are supplied to the voice recognition engine 54. As a result, processing for reducing ambient noise and processing corresponding to the voice command are executed. The extracted speech voice and ambient noise are also fed back to the gain control unit 2 and the adaptive filter 4 and used for talkback voice clarity improvement processing and talkback voice estimation calculation processing.
[0027]
FIG. 3 is a flowchart showing the operation of the speech recognition process according to the first embodiment. Although not shown in FIG. 1, the speech recognition apparatus 100 includes a controller that performs overall control related to speech recognition, and the flowchart shown in FIG. 3 is executed according to the control of this controller.
[0028]
In FIG. 3, when the controller detects a voice recognition start trigger (for example, pressing of a speech button, voice input of a predetermined keyword, etc.) (step S1), the voice recognition engine 54 is activated to enter a voice input acceptance state (step S1). Step S2). In this state, the user utters the first command corresponding to the uppermost layer of the voice commands managed in a plurality of layers (step S3).
[0029]
The voice command uttered here is input from the microphone 53 and supplied to the voice recognition engine 54 via the subtracter 5. In response to this, the speech recognition engine 54 executes speech recognition processing (including noise reduction processing) (step S4). At this time, the controller returns the voice recognition engine 54 to inactive and cancels the voice input acceptance state. Next, the volume 1 and the gain control unit 2 start a process of improving the clarity of the talkback sound (Step S5). In this state, the voice output unit 51 starts voice talkback of the recognition result and the guidance sentence by the voice recognition engine 54 (step S6).
[0030]
While this talkback is being performed, the controller determines whether or not a voice operation is still necessary (step S7). Here, it is determined whether or not it is necessary to move to a lower hierarchy and continue to input voice commands. If the voice operation is still necessary, the voice recognition engine 54 is activated again to enter the voice input acceptance state (step S8). Thereafter, the subtracter 5 obtains the estimated value of the talkback sound output in step S6 from the adaptive filter 4 and subtracts it from the input sound of the microphone 53, thereby removing the talkback sound from the microphone input sound. (Step S9).
[0031]
Then, the controller determines whether or not a voice command has been uttered (step S10). If there is no utterance, the process returns to step S9, and this loop processing is repeated until there is utterance. If no utterance is made within a certain time, a timeout process is performed. On the other hand, when a voice command is issued, the talkback is interrupted at that time (step S11), and the process returns to step S4. Note that the talkback is interrupted when an utterance is made here, but even if there is a talkback, the voice is removed and only the utterance voice is extracted, so it is not always necessary to interrupt the talkback. .
[0032]
FIG. 4 is a timing chart showing the state of change in the voice input acceptance state according to the present embodiment in comparison with the prior art. FIG. 4A shows the prior art, and FIG. 4B shows the present embodiment. Note that the operation of FIG. 4A has already been described.
[0033]
As shown in FIG. 4B, in this embodiment, when the user first presses the speech button, the voice recognition mode is set, and the voice input reception state is set for a predetermined time. The user utters a desired voice command while in the voice input acceptance state. When a voice command is input, recognition processing and talkback processing for the input voice are performed. The operation so far is the same as that of the prior art shown in FIG.
[0034]
In the prior art shown in FIG. 4A, a voice input is not accepted while talkback is being performed. On the other hand, in the present embodiment shown in FIG. 4B, the voice input is automatically accepted when the recognition process is completed, and the next voice can be played at a desired timing without waiting until the talkback is finished. It becomes possible to input. This can reduce the waiting time.
[0035]
As described above in detail, according to the present embodiment, voice input can be received at any time even during talkback, and voice input can be performed at any timing without waiting for the talkback to end. Moreover, it is not necessary to press the utterance button every time an utterance is made. Thereby, the operation time required for a series of voice recognition can be shortened without performing complicated button operations.
[0036]
In this embodiment, the talkback sound is removed from the microphone input sound by using the simulated sound estimated by the adaptive filter 4 provided for improving the clarity of the talkback sound. Therefore, it is not necessary to separately introduce a dedicated adaptive filter for removing talkback sound. As a result, the intelligibility of the talkback voice can be improved and the voice recognition operation time can be shortened without increasing the cost.
[0037]
(Second Embodiment)
Next, a second embodiment of the present invention will be described. FIG. 5 is a block diagram showing a main configuration of the speech recognition apparatus according to the second embodiment. In FIG. 5, those given the same reference numerals as those shown in FIG. 1 have the same functions, and therefore redundant description is omitted here.
[0038]
As shown in FIG. 5, the speech recognition apparatus 200 of the present embodiment has output amplifiers 6-1 and 6-2, second adaptive filters 7-1 and 7-2, addition in addition to the configuration shown in FIG. 1. 8, a subtracter 9, an audio playback unit 61, and a plurality of channels (right channel, left channel) speakers 62-1 and 62-2.
[0039]
The audio playback unit 61 plays back various audio sources such as a CD (Compact Disc), an MD (Mini Disc), a DVD (Digital Versatile Disk), and a radio broadcast. The output amplifiers 6-1 and 6-2 amplify the audio sound of the left and right channels reproduced by the audio reproducing unit 61 at a predetermined magnification, and output the amplified sound from the speakers 62-1 and 62-2 of each channel. The audio sound output from the speakers 62-1 and 62-2 is input to the microphone 53 together with the speech sound and the talkback sound from the speaker 52.
[0040]
The second adaptive filters 7-1 and 7-2 are also configured as shown in FIG. One adaptive filter 7-1 identifies a filter coefficient that simulates the transmission system from the right channel speaker 62-1 to the microphone 53, and filters the right channel audio sound to produce the right channel audio simulated sound. Generate.
[0041]
The other adaptive filter 7-2 identifies a filter coefficient simulating the transmission system from the left channel speaker 62-2 to the microphone 53, and filters the left channel audio sound to simulate the left channel audio. Generate sound.
[0042]
Thus, in the second embodiment, the adaptive filter 4 constitutes the first adaptive filter means according to the present invention, and the second adaptive filters 7-1 and 7-2 are the second adaptive filter means according to the present invention. Configure. The adder 8 adds the left and right channel audio simulated sounds output from the second adaptive filters 7-1 and 7-2 and outputs the result to the subtracter 9.
[0043]
In the present embodiment, the subtracter 5 subtracts the talkback simulated voice generated by the adaptive filter 4 from the voice input from the microphone 53 (speech in which voice command, talkback, audio sound, and ambient noise are mixed). Thus, the voice command, the audio sound, and the ambient noise are extracted. Further, the subtracter 9 subtracts the audio simulated sound generated by the adaptive filters 7-1 and 7-2 and the adder 8 from the voice output from the subtracter 5, thereby generating a voice command (speech voice) and the surroundings. Extract noise. As described above, the subtracters 5 and 9 constitute the speech voice extraction means of the present invention.
[0044]
Of the mixed voice of the voice command, the audio sound, and the ambient noise extracted by the subtracter 5, the ambient noise is reduced by the voice recognition engine 54, and only the voice command is recognized. Further, the mixed speech of the speech speech, the audio sound, and the ambient noise extracted by the subtracter 5 is fed back to the gain control unit 2 and the adaptive filter 4. Further, the mixed speech of the speech speech and the ambient noise extracted by the subtracter 9 is supplied to the speech recognition engine 54 and fed back to the second adaptive filters 7-1 and 7-2.
[0045]
Next, the operation of the speech recognition apparatus 200 according to the second embodiment configured as described above will be briefly described. The talkback sound output from the sound output unit 51 is gain-adjusted by the volume 1 and the gain control unit 2 to improve the clarity of the talkback sound. The talkback sound output from the volume 1 or the like is amplified by the output amplifier 3 at a predetermined magnification and then output from the speaker 52.
[0046]
The audio sound output from the audio playback unit 61 is amplified by the output amplifiers 6-1 and 6-2 at a predetermined magnification, and then output from the speakers 62-1 and 62-2.
[0047]
The talkback sound output from the speaker 52 and the audio sound output from the speakers 62-1 and 62-2 are input from the microphone 53. At this time, if the user utters a voice command, the uttered voice is also input from the microphone 53. If the vehicle is running, ambient noise such as engine sound and road noise is also input from the microphone 53. Therefore, the microphone 53 is input with the talkback sound, the audio sound, the speech sound, and the ambient noise mixed.
[0048]
This mixed sound is input to the plus end of the subtracter 5. On the other hand, the talkback simulated voice generated by the adaptive filter 4 is input to the minus end of the subtracter 5. The subtractor 5 calculates an error by subtracting the talkback simulated sound output from the adaptive filter 4 from the mixed sound input from the microphone 53, and extracts the audio sound, the speech sound, and the ambient noise.
[0049]
The extracted mixed sound of the audio sound, the speech sound and the ambient noise is input to the plus end of the subtractor 9. On the other hand, the simulated audio generated by the adaptive filters 7-1 and 7-2 and the adder 8 is input to the minus end of the subtracter 9. The subtractor 9 calculates an error by subtracting the simulated audio sound input from the adder 8 from the mixed sound input from the subtracter 5, and extracts the speech sound and ambient noise.
[0050]
The extracted speech voice and ambient noise are supplied to the voice recognition engine 54. As a result, processing for reducing ambient noise and processing corresponding to the voice command are executed. The audio sound, the speech sound, and the ambient noise extracted by the subtracter 5 are fed back to the gain control unit 2 and the adaptive filter 4 and are used for the talkback sound intelligibility improvement process and the talkback sound estimation calculation process. . The speech and ambient noise extracted by the subtracter 9 are also fed back to the adaptive filters 7-1 and 7-2 and used for audio sound estimation calculation processing.
[0051]
FIG. 6 is a flowchart showing the operation of the speech recognition process according to the second embodiment. In FIG. 6, the processing given the same number as the step number shown in FIG. 3 indicates the same processing content, and therefore, duplicate description is omitted here. 6 is different from FIG. 3 only in that an audio sound removal process (steps S21 and S22) is included between steps S2 and S3 and between steps S9 and S10.
[0052]
In the audio sound removal processing in steps S21 and S22, the subtracter 9 subtracts the estimated value of the audio sound input from the adder 8 from the output sound of the subtractor 5, thereby making the audio sound, the utterance sound, and the ambient noise. The audio sound is removed from the mixed voice and the speech voice and the ambient noise are extracted.
[0053]
As described above in detail, according to the second embodiment, even if voice input is performed while talkback is being performed and audio playback is being performed, the talkback voice is converted from the microphone input voice. The audio sound can be removed, and the uttered voice and the ambient noise can be extracted and supplied to the voice recognition engine 54. Therefore, even during talkback and audio playback, voice input can be received at any time and voice input can be performed at any timing, and the operation time for voice recognition can be shortened.
[0054]
(Third embodiment)
Next, a third embodiment of the present invention will be described. FIG. 7 is a block diagram showing a main configuration of the speech recognition apparatus according to the third embodiment. In FIG. 7, components having the same reference numerals as those shown in FIG. 5 have the same functions, and thus redundant description is omitted here.
[0055]
In the second embodiment shown in FIG. 5, the case where the output destination of the talkback sound is different from the output destination of the audio sound has been described. On the other hand, the third embodiment shown in FIG. 7 shows a case where the output destination of the talkback sound and the output destination of the audio sound are the same.
[0056]
That is, in the speech recognition apparatus 300 according to the third embodiment shown in FIG. 7, the output amplifier 3 shown in FIG. 5 is not provided and only two output amplifiers 6-1 and 6-2 are provided. In addition, the speech recognition apparatus 300 according to the present embodiment includes the variable filter 10 instead of the adaptive filter 4 illustrated in FIG. 5, and further includes an adder 11. Other configurations are the same as those in FIG.
[0057]
In FIG. 7, an adder 11 adds the talkback sound output from the volume 1 and the right channel audio sound reproduced by the audio reproduction unit 61, and adds an output amplifier 6-1 and an adaptive filter 7-1. Output to. The output amplifier 6-1 amplifies the sound output from the adder 11 at a predetermined magnification and outputs it from the right channel speaker 62-1.
[0058]
The adaptive filter 7-1 identifies filter coefficients that simulate the transmission system from the right channel speaker 62-1 to the microphone 53. Then, using the identified filter coefficient, the mixed sound of the talkback sound output from the adder 11 and the audio sound of the right channel is filtered to generate a sound simulating the mixed sound.
[0059]
The variable filter 10 is an audio correction filter having a variable filter coefficient, and copies and sets the filter coefficient identified by the right channel adaptive filter 7-1. Then, the talkback sound output from the volume 1 or the like is filtered to generate a talkback simulated sound at the position of the microphone 53. The variable filter 10 constitutes variable filter means of the present invention.
[0060]
Here, the right-channel adaptive filter 7-1 that is the copy source of the filter coefficient for the variable filter 10 is an adaptive filter that simulates a transmission system from the right-channel speaker 62-1 to which the talkback sound is output to the microphone 53. It is. For example, when the speech recognition apparatus 300 according to the present embodiment is applied to a navigation apparatus, a talkback voice is output from the right channel speaker 62-1 installed near the driver's seat, and the microphone 53 to which the talkback voice is input is also driven. It is installed near the seat. Therefore, in this case, it is preferable to copy the filter coefficient of the adaptive filter 7-1 for the right channel to the variable filter 10. When the driver's seat is on the left side, it is preferable to copy the filter coefficient of the adaptive filter 7-2 for the left channel to the variable filter 10.
[0061]
Next, the operation of the speech recognition apparatus 300 according to the third embodiment configured as described above will be briefly described. The talkback sound output from the sound output unit 51 is gain-adjusted by the volume 1 and the gain control unit 2 to improve the clarity of the talkback sound.
[0062]
The talkback sound output from the volume 1 or the like is added by the adder 11 with the audio sound of the right channel reproduced by the audio reproducing unit 61, amplified by the output amplifier 6-1 at a predetermined magnification, and then the speaker 62. -1 is output. Further, the audio sound of the left channel reproduced by the audio reproducing unit 61 is amplified at a predetermined magnification by the output amplifier 6-2 and then output from the speaker 62-2.
[0063]
The sound output from the speaker 62-1 (mixed sound of talkback sound and right channel audio sound) and the left channel audio sound output from the speaker 62-2 are input from the microphone 53. At this time, if the user utters a voice command, the uttered voice is also input from the microphone 53. If the vehicle is running, ambient noise such as engine sound and road noise is also input from the microphone 53. Therefore, from the microphone 53, talkback sound, left and right channel audio sound, speech sound, and surrounding noise are mixed.
[0064]
This mixed sound is input to the positive ends of the subtracters 5 and 9. The talkback simulated voice generated by the variable filter 10 is input to the minus end of the subtracter 5. The subtractor 5 calculates an error by subtracting the talkback simulated sound output from the variable filter 10 from the mixed sound input from the microphone 53, and extracts the audio sound, the speech sound, and the ambient noise. The extracted mixed sound of the audio sound, the uttered sound and the ambient noise is fed back to the gain control unit 2 and used for the process of improving the clarity of the talkback sound.
[0065]
The mixed sound of the talkback sound and the right channel audio sound output from the adder 11 is also input to the adaptive filter 7-1. The adaptive filter 7-1 generates a mixed simulated sound of talkback sound and right channel audio sound. On the other hand, in the adaptive filter 7-2, an audio simulation sound of the left channel is generated.
[0066]
The simulated voices generated by the adaptive filters 7-1 and 7-2 are added by the adder 8 and the result is input to the minus end of the subtractor 9. The subtractor 9 calculates an error by subtracting the mixed simulated voice of the talkback voice and the audio sound input from the adder 8 from the mixed voice input from the subtractor 5, and calculates the speech voice and the ambient noise. Extract.
[0067]
The speech voice and ambient noise extracted by the subtracter 9 are supplied to the voice recognition engine 54. As a result, processing for reducing ambient noise and processing corresponding to the voice command are executed. The speech and ambient noise extracted by the subtracter 9 are also fed back to the adaptive filters 7-1 and 7-2 and used for audio sound estimation calculation processing.
[0068]
Note that the operation of the speech recognition process according to the third embodiment is the same as that in the flowchart shown in FIG.
[0069]
As described above in detail, in the third embodiment, as in the second embodiment, even when talkback and audio playback are being performed, voice input is accepted at any time, and voice input is performed at a desired timing. be able to. Further, according to the third embodiment, it is not necessary to prepare an advanced adaptive filter including an algorithm for identifying filter coefficients for talkback speech estimation, and the cost can be reduced correspondingly. Furthermore, it is only necessary to copy the filter coefficient to the variable filter 10, and it is not necessary to perform an arithmetic process for identifying the filter coefficient. Therefore, there is an advantage that the processing load can be reduced.
[0070]
Although the first to third embodiments have been described above, these are merely examples of implementation in carrying out the present invention, and thus the technical scope of the present invention is limitedly interpreted. It must not be done. In other words, the present invention can be implemented in various forms without departing from the spirit or main features thereof.
[0071]
【The invention's effect】
In the present invention, as described above, the talkback sound output from the speaker and input to the microphone is estimated by the adaptive filter, and the estimated value is subtracted from the microphone input sound, so that the speech sound and other sounds are mixed. Since only the utterance voice is extracted from the microphone input voice, the utterance is interrupted by pressing the utterance button for each utterance, and the voice can be heard at any timing during the talkback. Input can be performed at any time. Thereby, the operation time for voice recognition can be shortened without performing a complicated operation.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a main configuration of a speech recognition apparatus according to a first embodiment.
FIG. 2 is a diagram illustrating a configuration of an adaptive filter.
FIG. 3 is a flowchart showing an operation of speech recognition processing according to the first embodiment.
FIG. 4 is a timing chart showing the state of change in the voice input acceptance state according to the present embodiment in comparison with the prior art.
FIG. 5 is a block diagram showing a main configuration of a speech recognition apparatus according to a second embodiment.
FIG. 6 is a flowchart showing an operation of speech recognition processing according to the second embodiment.
FIG. 7 is a block diagram showing a main configuration of a speech recognition apparatus according to a third embodiment.
[Explanation of symbols]
1 Volume or equalizer
2 Gain controller
3 Output amplifier
4 Adaptive filter
5 Subtractor
6-1, 6-2 Output amplifier
7-1, 7-2 Adaptive filter
8 Adder
9 Subtractor
10 Variable filter
11 Adder

Claims

A speech recognition device having a function of recognizing a speech voice input from a microphone and talking back from a speaker,
A filter coefficient simulating a transmission system in which the talkback sound output from the speaker is input to the microphone is set, and the talkback sound before being output from the speaker is filtered to filter at the position of the microphone. Adaptive filter means for generating talkback simulated speech;
Utterance voice extraction means for extracting the utterance voice by subtracting the talkback simulated voice from the voice input from the microphone;
The error sound is output from the speaker by using the talkback simulated sound generated by the adaptive filter means and the error sound obtained by subtracting the talkback simulated sound from the sound input from the microphone. A gain value to be added to the talkback sound before being output from the speaker is calculated as noise with respect to the previous talkback sound, and the talkback sound before being output from the speaker is calculated based on the calculated gain value. And a clarity improvement processing means for performing a clarity improvement processing of the talkback sound before being output from the speaker by performing gain correction on the speaker,
The adaptive filter means operates so as to minimize the power of the error sound, and sets the filter coefficient.

A speech recognition device having a function of recognizing a speech voice input from a microphone and talking back from a speaker,
A first filter coefficient simulating a transmission system in which the talkback sound output from the first speaker is input to the microphone is set, and the talkback sound before being output from the first speaker is filtered. First adaptive filter means for generating talkback simulated speech at the position of the microphone,
By setting a second filter coefficient simulating a transmission system in which the audio sound output from the second speaker is input to the microphone, and filtering the audio sound before being output from the second speaker Second adaptive filter means for generating simulated audio at the microphone location;
Utterance voice extraction means for extracting utterance voice by subtracting the talkback simulation voice and the audio simulation sound from the voice input from the microphone;
Using the talkback simulated sound generated by the first adaptive filter means and the error sound obtained by subtracting the talkback simulated sound from the sound input from the microphone, the error sound is converted into the first error sound. A gain value to be added to the talkback sound before being output from the first speaker is calculated as noise for the talkback sound before being output from the first speaker, and the first gain is calculated based on the calculated gain value. Clarity improvement processing means for performing intelligibility improvement processing of the talkback sound before being output from the first speaker by performing gain correction on the talkback sound before being output from the first speaker. ,
The first adaptive filter means operates so as to minimize the power of the error sound and sets the first filter coefficient.
The second adaptive filter means operates so that the power of the second error sound obtained by subtracting the talkback simulated sound and the audio simulated sound from the sound input from the microphone is minimized. A speech recognition apparatus, wherein the second filter coefficient is set.

3. The voice recognition apparatus according to claim 2, wherein a plurality of the second adaptive filter means are provided corresponding to a plurality of channels of audio sounds output from the plurality of second speakers.

A speech recognition device having a function of recognizing a speech voice input from a microphone and talking back from a speaker,
By setting a filter coefficient that simulates a transmission system in which a mixed sound of talkback sound and audio sound output from the speaker is input to the microphone, and filtering the mixed sound before being output from the speaker Adaptive filter means for generating mixed simulated speech at the microphone location;
Utterance voice extraction means for extracting the utterance voice by subtracting the mixed simulated voice from the voice input from the microphone;
Variable filter means for generating a talkback simulated sound at the microphone position by copying and setting the filter coefficient set by the adaptive filter means and filtering the talkback sound before being output from the speaker When,
Using the talkback simulated sound and the error sound obtained by subtracting the talkback simulated sound from the sound input from the microphone, the error sound is a noise with respect to the talkback sound before being output from the speaker, and As a result, a gain value to be added to the talkback sound before being output from the speaker is calculated, and gain correction is performed on the talkback sound before being output from the speaker by the calculated gain value. And a clarity improving processing means for performing a clarity improving process of the talkback sound before being output from the speaker,
The adaptive filter means operates so as to minimize the power of the third error sound obtained by subtracting the mixed simulated sound from the sound input from the microphone, and sets the filter coefficient. Voice recognition device.

A plurality of channels of audio sound is output from a plurality of speakers, and the talkback sound is also output from at least one of the speakers.
5. The sound according to claim 4, wherein the adaptive filter means is provided corresponding to a mixed sound of an audio sound of a certain channel output from the at least one speaker and the talkback sound. Recognition device.

Audio before being output from the other speaker by setting a second filter coefficient simulating a transmission system in which the audio sound of the other channel output from the other speaker among the plurality of speakers is input to the microphone Further comprising other adaptive filter means for generating simulated audio at the microphone location by filtering the sound,
The other adaptive filter means operates so as to minimize the power of the fourth error sound obtained by subtracting the mixed simulated sound and the audio simulated sound from the sound input from the microphone. Set the filter coefficient of
6. The speech recognition apparatus according to claim 5, wherein the utterance speech extraction unit extracts the utterance speech by subtracting the mixed simulated speech and the audio simulated sound from speech input from the microphone.

When the recognition processing unit recognizes the uttered voice input from the microphone, a process of setting the inactive state in which the voice input is not accepted and talking back the uttered voice recognized by the recognition processing unit from the speaker The voice recognition device according to claim 1, further comprising a controller that sets the active state to accept the voice input when starting the operation.

A step of setting an inactive state in which voice input is not accepted when the recognition processing unit performs recognition processing on speech sound input from a microphone;
A step of setting an active state for receiving voice input when starting a process of talking back from the speaker the uttered voice recognized by the recognition processing unit;
A filter coefficient that simulates a transmission system in which the talkback sound output from the speaker is input to the microphone is set in an adaptive filter, and the talkback sound before being output from the speaker is filtered to filter the microphone. Generating talkback simulated audio at a location;
Extracting the utterance voice by subtracting the talkback simulated voice from the voice inputted from the microphone when the active state is set, and supplying the utterance voice to the recognition processing unit;
Using the talkback simulated sound and the error sound obtained by subtracting the talkback simulated sound from the sound input from the microphone, the error sound is a noise with respect to the talkback sound before being output from the speaker, and As a result, a gain value to be added to the talkback sound before being output from the speaker is calculated, and gain correction is performed on the talkback sound before being output from the speaker by the calculated gain value. And a process for improving the clarity of the talkback sound before being output from the speaker,
In the step of generating the talkback simulated voice , the adaptive filter operates so as to minimize the power of the error sound and sets the filter coefficient.

A step of setting an inactive state in which voice input is not accepted when the recognition processing unit performs recognition processing on speech sound input from a microphone;
A step of setting an active state for receiving a voice input when starting a process of talking back the uttered voice recognized by the recognition processing unit from the first speaker;
A first filter coefficient simulating a transmission system in which talkback sound output from the first speaker is input to the microphone is set in the first adaptive filter, and before the signal is output from the first speaker. Generating talkback simulated speech at the microphone location by filtering the talkback speech;
A second filter coefficient simulating a transmission system in which the audio sound output from the second speaker is input to the microphone is set in the second adaptive filter, and the audio sound before being output from the second speaker Generating an audio simulation sound at the microphone position by filtering
Extracting the utterance voice by subtracting the talkback simulation voice and the audio simulation sound from the voice input from the microphone at the time of setting the active state, and supplying the utterance voice to the recognition processing unit;
The talkback sound before the error sound is output from the first speaker, using the talkback simulated sound and the error sound obtained by subtracting the talkback simulated sound from the sound input from the microphone. The gain value added to the talkback sound before being output from the first speaker is calculated, and the talkback sound before being output from the first speaker is calculated based on the calculated gain value. Performing a clarity improvement process for talkback audio before being output from the first speaker by performing gain correction on
In the step of generating the talkback simulated voice, the first adaptive filter operates to minimize the power of the error sound and sets the first filter coefficient.
In the step of generating the audio simulation sound , the second adaptive filter has a second error sound power obtained by subtracting the talkback simulation sound and the audio simulation sound from the sound input from the microphone. The speech recognition method is characterized in that the second filter coefficient is set by operating so as to be minimized.

A step of setting an inactive state in which voice input is not accepted when the recognition processing unit performs recognition processing on speech sound input from a microphone;
A step of setting an active state for receiving voice input when starting a process of talking back from the speaker the uttered voice recognized by the recognition processing unit;
A filter coefficient simulating a transmission system in which mixed speech of talkback sound and audio sound output from the speaker is input to the microphone is set as an adaptive filter, and the mixed sound before being output from the speaker is filtered. Generating a mixed simulated voice at the microphone position by:
Extracting the uttered voice by subtracting the mixed simulated voice from the voice input from the microphone when the active state is set, and supplying the extracted voice to the recognition processing unit;
The filter coefficient set in the adaptive filter is copied and set in a variable filter, and the talkback sound before being output from the speaker is filtered by the variable filter, so that the talkback simulated sound at the microphone position is obtained. Generating
Using the talkback simulated sound and the error sound obtained by subtracting the talkback simulated sound from the sound input from the microphone, the error sound is a noise with respect to the talkback sound before being output from the speaker, and As a result, a gain value to be added to the talkback sound before being output from the speaker is calculated, and gain correction is performed on the talkback sound before being output from the speaker by the calculated gain value. And a process for improving the clarity of the talkback sound before being output from the speaker,
In the step of generating the mixed simulated voice , the adaptive filter operates so that the power of the third error sound obtained by subtracting the mixed simulated voice from the voice input from the microphone is minimized. A speech recognition method characterized by setting a filter coefficient.

A plurality of channels of audio sound is output from a plurality of speakers, and the talkback sound is also output from at least one of the speakers.
The mixed simulated sound at the microphone position is generated by performing the filtering process on the mixed sound of the audio sound of a certain channel and the talkback sound before being output from the at least one speaker. The speech recognition method according to claim 10, wherein: