JP3916861B2

JP3916861B2 - Voice recognition device

Info

Publication number: JP3916861B2
Application number: JP2000278399A
Authority: JP
Inventors: 真吾木内; 孝一中田
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2000-09-13
Filing date: 2000-09-13
Publication date: 2007-05-23
Anticipated expiration: 2020-09-13
Also published as: JP2002091489A

Description

【０００１】
【発明の属する技術分野】
本発明は、入力される音声に対応する文字列を特定し、その内容に応じた応答を返す音声認識装置に関する。
【０００２】
【従来の技術】
従来から、音声によって各種の操作指示等の入力を行うための音声認識装置が実用化されており、各種の装置やシステムに採用されている。例えば、音声認識装置を搭載した車載用のナビゲーション装置では、経路探索における目的地の設定等の操作指示を音声により入力できるようになっている。また、パーソナルコンピュータ（以下、「パソコン」と称する。）において所定のプログラムを実行することにより、パソコン上で音声認識装置を実現し、マイクロホンによって集音された音声に対応して文章の入力等の操作を行っているものもある。
【０００３】
ところで、一般に音声認識技術は、単語音声認識技術と連続語音声認識技術とに分類することができる。前者の単語音声認識技術は、単語毎に区切って発声された音声を認識し、対応する単語の文字列を特定する技術である。また、後者の連続語音声認識技術は、複数の単語等が連続して発声された音声を認識し、対応する複数の単語の文字列を特定する技術である。
【０００４】
従来は、比較的に処理が容易な単語音声認識技術を採用した音声認識装置が主流であったが、利用者の立場から考えると、複数の単語を連続して入力することができるほうが操作性がよく好ましいことから、近年では、連続語音声認識技術を採用した音声認識装置が普及しつつある。このような連続語音声認識技術を採用した音声認識装置をナビゲーション装置に搭載した場合には、例えば、経路探索の目的地設定等において、「○○県××市△△・・・」というように都道府県名、市町村名、地名等の単語を連続して入力して認識させることができるので、これら都道府県名等の単語を１つずつ入力する場合に比べて入力操作を快適に行うことができる。
【０００５】
【発明が解決しようとする課題】
ところで、上述した連続語音声認識技術を採用した音声認識装置では、利用者が発声した音声を取り込む際に、ほぼ無音と見なせる状態（以後、この無音状態を「ブランク」と呼ぶ。）が予め設定した一定時間を超えた場合に、その時点を区切りとしてそれまでに入力された音声に対して音声認識処理を行い、認識結果を利用者に対して応答している。
【０００６】
例えば、入力された音声において時間Ｔ１以上のブランクが含まれていることを検出した時点で、この音声に対応した音声合成処理を行って所定の応答を出力する場合を考えると、有効な音声（ブランクを除いた音声）の入力が終わってから対応する応答が出力されるまでの間に必要な時間は、ブランクに対応する時間Ｔ１と音声合成処理に必要な時間を合計した所定時間Ｔとなる。したがって、利用者の立場からすれば、この所定時間Ｔが音声入力時に許容される見かけ上のブランクであり、この所定時間Ｔよりも短いブランクしか含まずに音声入力を行った場合には、連続語として音声認識処理が行われるものと考えるのが普通である。
【０００７】
ところが、従来の音声認識装置では、所定時間Ｔよりも短い時間Ｔ１のブランクを検出した時点で音声認識処理を終了して応答処理を開始していたため、この時間Ｔ１の経過後に音声が入力されても認識されない、いわゆる「取りこぼし」が生じるという問題があった。一般に、普段言い慣れていない単語は、流暢に発声することはできず、単語間にブランクが含まれる場合が多いため、上述した取りこぼしが生じやすい。
【０００８】
例えば、音声認識装置を搭載したナビゲーション装置に対して、普段言い慣れていない住所等を入力する場合を考えると、利用者自身は、「○○県××市・・・」というように住所を連続して入力しているつもりであるにも関わらず、実際には、「○○県」と「××市」の間など各単語の間にブランクを挿入してしまい、このため、例えば「○○県」までで認識処理が中断されて対応する応答が行われ、それ以降に発声された「××市・・・」の一部が取りこぼしとなってしまうことがある。また、上述したような取りこぼしが生じた場合には、例えば、「○○県。市町村名をどうぞ。」といった応答が行われることとなるので、利用者の立場から考えると、一度入力したはずである市町村名以降の音声が無視され、再度入力を要求されるので、このような応答に対して利用者は、違和感を感じることが多い。
【０００９】
本発明は、このような点に鑑みて創作されたものであり、その目的は、応答を返すまでに入力された音声に対して取りこぼしをなくすことができる音声認識装置を提供することにある。また、本発明の他の目的は、違和感のない応答を返すことができる音声認識装置を提供することにある。
【００１０】
【課題を解決するための手段】
上述した課題を解決するために、本発明の音声認識装置では、マイクロホンにより音声を集音し、集音された音声に対して音声認識処理手段によって音声認識処理を行い、認識された内容に基づいて応答手段により応答音声を生成し、出力する場合に、中断決定手段は、マイクロホンによって集音される音声に含まれる無音状態を検出し、この無音状態が時間ｔ１以上継続したときに、音声認識処理の中断を決定する。そして、音圧レベル検出手段は、マイクロホンによって集音される音声の音圧レベルを検出しており、上述した無音状態が時間ｔ１を経過した後の時間ｔ２の間に、音圧レベル検出手段によって検出された音圧レベルが所定値を超えたときに、再開決定手段は、音声認識処理の再開を決定する。
【００１１】
音声に含まれる無音状態が時間ｔ１を経過して所定の応答処理が開始された後にも、所定の時間ｔ２が経過するまでの間に所定の音圧レベルを超える音声が入力された場合には音声認識処理手段による処理が再開されるので、応答を返すまでに入力された音声に対して取りこぼしをなくすことができる。
【００１２】
また、上述した再開決定手段は、入力音声に含まれる時間ｔ１以上の最初の無音状態に対応して音声認識処理手段に対して１回だけ処理の再開を決定することが望ましい。一般に、最初の無音状態が検出されて応答が返された場合に、利用者がこの応答と並行して音声入力を行い続けるということはあまりないので、最初の無音状態に対応して１回だけ音声認識処理手段の処理を再開するだけでも、応答を返すまでに入力された音声の取りこぼしをほとんどなくすことができる。
【００１３】
また、上述した再開決定手段は、音声認識処理手段に対して処理の再開を指示する動作とともに、応答手段に対して応答音声の出力を中止する指示を送ることが望ましい。音声認識処理手段の処理が再開された場合に応答音声に出力を中止することにより、利用者自身が発声した音声と応答音声とが重なることを防ぐことができる。特に、応答音声を返すことなく音声認識処理が再開されるため、利用者によって発声される音声に時間（ｔ１＋ｔ２）のブランクが含まれるまで連続語に対する音声認識処理を継続することができ、効率よい音声入力を行うことができる。
【００１４】
また、上述した時間ｔ２は、無音状態の継続時間が時間ｔ１となって、中断決定手段によって音声認識処理の中断が決定されてから、応答手段によって応答音声を出力するまでの時間にほぼ等しい値に設定することが望ましい。無音状態の継続時間が時間ｔ１となってから、応答手段による応答音声が出力されるまでの時間と上述した時間ｔ２をほぼ等しい値とすることにより、応答音声が出力される以前に音声入力が行われた場合に、この音声入力に確実に対応して音声認識処理を継続させることができる。したがって、利用者自身は連続して音声を入力しているつもりであるにも関わらず、入力途中の音声に対応して音声認識処理が開始されて応答音声が出力されてしまうことがなく、利用者が違和感を感じることを防ぐことができる。
【００１５】
また、応答音声やその他の音源から出力される音声を出力するスピーカと、マイクロホンによって集音される音声に含まれる音声認識対象外の成分を除去する除去手段とをさらに備えておいて、除去手段から出力される音声認識対象の音声を音声認識処理手段に入力することが望ましい。音声認識対象外の成分を除去することにより、音声認識処理の精度を向上させることができるので、車載用のナビゲーション装置等に本発明の音声認識装置を搭載する場合など、音声認識対象外の音声がマイクロホンによって集音される音声に含まれやすい環境において特に有効である。
【００１６】
また、上述した応答手段は、時間ｔ１以上の無音状態が検出された後に再開された音声認識処理手段による音声認識処理の成否に応じて異なる内容の応答音声を生成することが望ましい。具体的には、例えば、再開後の音声認識処理が成功した場合には認識結果に基づいた応答音声を出力し、音声認識処理が失敗した場合には「利用者による入力音声の存在は認識しているが音声認識処理には失敗した」という内容を含む応答音声を出力するというように、音声認識処理の成否に応じて応答音声の内容を異ならせることにより、自分の行った音声入力が無視され、あるいは途中で遮られているといった悪い印象を利用者に対して与えることがなく、利用者が応答音声に対して感じる違和感をなくすことができる。
【００１７】
【発明の実施の形態】
以下、本発明を適用した一実施形態の音声認識装置について、図面を参照しながら説明する。
図１は、本実施形態の音声認識装置の構成を示す図である。同図に示す音声認識装置１００は、車載用のナビゲーション装置３００に対して音声により操作指示を与えるために用いられるものであり、トークスイッチ１０、マイクロホン１２、制御部１４、遅延素子１６、適応フィルタ（ＡＤＦ）１７、演算部１８、音声認識処理部２０、レベルメータ３０、音声合成処理部３２、合成部３４、スピーカ３６を含んで構成されている。なお、本実施形態の音声認識装置は、連続語音声認識技術を採用しているものとする。
【００１８】
トークスイッチ１０は、利用者が音声入力を行う前に操作されるものであり、操作状況が制御部１４に出力される。マイクロホン１２は、利用者が発声した音声を集音し、これを電気信号（音声信号）に変換して出力する。
制御部１４は、音声認識装置１００の全体動作を制御するものであり、音声認識処理を行った結果得られた文字列等の情報をナビゲーション装置３００に出力する。制御部１４の動作の詳細については後述する。
【００１９】
遅延素子１６は、マイクロホン１２から出力される音声信号を所定時間だけ遅延した信号を出力する。この遅延素子１６は、例えば、伝達特性Ｚ^-mを有するＦＩＲ（Finite Impulse Response ）型のデジタルフィルタを用いて、遅延時間ｔに対応するフィルタ係数を１、それ以外のフィルタ係数を０に設定することにより実現される。
【００２０】
適応フィルタ１７は、車室内の音響空間の伝達特性、具体的には、スピーカ３６から放射される音がマイクロホン１２に到達するまでの間の伝達特性を模擬するためのものであり、フィルタ係数Ｗを有するＦＩＲ型のデジタルフィルタと、このデジタルフィルタのフィルタ係数を設定するフィルタ係数設定部とを含んで構成されている。例えば、ＬＭＳ（Least Mean Square ）アルゴリズムを用いて、スピーカ３６に入力される音声信号（後述する）を参照信号として適応等化処理を行うことによりフィルタ係数Ｗが決定され、マイクロホン１２の出力信号に含まれるスピーカ３６の出力音成分を除去する処理が演算部１８によって行われる。
【００２１】
このようにして、本実施形態では、スピーカ３６の出力音成分をマイクロホン１２から出力される音声信号から除去しているので、音声認識処理時における応答音声やオーディオ装置２００から出力されるオーディオ音などが利用者の入力した音声と重なった場合にも、利用者の音声のみを確実に抽出することでき、音声認識処理の認識率を向上させることができる。
【００２２】
音声認識処理部２０は、入力される音声に対応して文字列を特定する所定の音声認識処理を行うものであり、２つのリングバッファ２２、２４、特徴量抽出部２６、照合処理部２８を含んで構成されている。
リングバッファ２２は、演算部１８から出力される雑音成分（オーディオ音や応答音声等）除去後の音声信号を入力順に取り込んで格納する。この格納された音声信号は、格納順に読み出されて、特徴量抽出部２６に入力される。
【００２３】
特徴量抽出部２６は、音声認識処理を行うために必要な各種の音声特徴量を抽出する。特徴量抽出部２６によって抽出された音声特徴量は、制御部１４からの指示に応じて、照合処理部２８に向けて直接出力されるか、またはリングバッファ２４に格納される。
【００２４】
リングバッファ２４は、特徴量抽出部２６から出力される音声特徴量をその入力順に格納しており、照合処理部２８から読み出し要求が与えられると、この格納された音声特徴量が格納順に読み出される。
照合処理部２８は、予め音素や単語などを単位とする標準パターンを用意しており、特徴量抽出部２６によって抽出された音声特徴量とこの標準パターンとを照合することにより、入力音声に対応する文字列を特定して制御部１４に出力する。
【００２５】
レベルメータ３０は、特徴量抽出部２６から出力される音声特徴量に基づいて音声の音圧レベルを計測し、計測結果を制御部１４に出力する。
音声合成処理部３２は、制御部１４からの指示に従い、照合処理部２８から出力された認識結果に対応した応答音声を出力するための音声信号を生成し、出力する。
【００２６】
合成部３４は、音声合成処理部３２から出力される音声信号と、オーディオ装置２００から出力されるオーディオ音信号とを合成してスピーカ３６に出力する。スピーカ３６は、合成部３４からの出力信号に対応して、応答音声やオーディオ音を出力する。
【００２７】
上述した音声認識処理部２０が音声認識処理手段に、音声合成処理部３２、スピーカ３６が応答手段に、照合処理部２８が中断決定手段に、制御部１４が再開決定手段に、レベルメータ３０が音圧レベル検出手段にそれぞれ対応している。また、遅延素子１６、適応フィルタ１７、演算部１８が除去手段に対応している。
【００２８】
本実施形態の音声認識装置はこのような構成を有しており、次にその動作を説明する。
〔第１の動作手順〕
図２は、音声認識装置１００における第１の動作手順を示す流れ図である。なお、以下の説明では、ナビゲーション装置３００において目的地などを設定する場合を想定し、操作指示として「○○県××市△△……」という音声、すなわち、“都道府県名”と“市町村名”、“地名”、……と続く複数の単語で構成される連続語音声に対して音声認識処理を行うものとして説明を行う。
【００２９】
制御部１４は、トークスイッチ１０が押下されたか否かを判定しており（ステップ１００）、トークスイッチ１０が押下されると、音声認識処理部２０に対して起動指示を出力する。
音声認識処理部２０が起動した後に、マイクロホン１２に対して利用者により音声入力が行われると（ステップ１０１）、この音声入力に対応して、音声認識処理部２０により所定の音声認識処理が行われる（ステップ１０２）。具体的には、リングバッファ２２に格納される音声信号に基づいて、特徴量抽出部２６により音声特徴量が抽出され、照合処理部２８により音声特徴量と標準パターンとの照合処理が行われることにより、入力された音声に対応する文字列（単語）が順次、特定される。
【００３０】
次に、音声認識処理部２０内の照合処理部２８は、入力された音声に時間ｔ１以上のブランク（無音状態）が含まれているか否かを判定する（ステップ１０３）。時間ｔ１以上のブランクが含まれていない場合には、ステップ１０３で否定判断がなされ、ステップ１０２に戻り、所定の音声認識処理が継続される。
【００３１】
また、入力された音声に時間ｔ１以上のブランクが含まれている場合には、ステップ１０３で肯定判断がなされ、照合処理部２８は、音声認識処理部２０による音声認識処理の中断を決定するとともに、ブランク検出時点までの音声に対する認識結果を制御部１４に出力する。
【００３２】
制御部１４は、照合処理部２８から受け取った認識結果を音声合成処理部３２に出力することにより、ブランク検出時点までの音声に対応する応答音声を出力する（ステップ１０４）。
また、ステップ１０４に示した処理と並行して、制御部１４は、レベルメータ３０からの出力信号が所定値を超えたか否かを調べることにより、時間ｔ１以上のブランク検出時から時間ｔ２以内に音声入力が行われたか否かを判定する（ステップ１０５）。なお、以後の説明では、時間ｔ１以上のブランク検出時から時間ｔ２以内に行われる音声入力を「追加の音声入力」と称することとする。
【００３３】
追加の音声入力が行われた場合には、ステップ１０５で肯定判断がなされ、制御部１４は、音声認識処理の再開を決定し、音声認識処理部２０に対して再度、起動指示を出力する。この起動指示に従って、音声認識処理部２０による所定の音声認識処理が再開され（ステップ１０６）、入力音声に時間ｔ１以上の２度目のブランクが含まれるまで（ステップ１０７）、ステップ１０６に示した音声認識処理が継続される。
【００３４】
入力された音声に時間ｔ１以上の２度目のブランクが含まれる場合には、ステップ１０７で肯定判断がなされ、照合処理部２８は、ブランク検出時点までの音声に対する認識結果を制御部１４に出力する。制御部１４は、照合処理部２８から出力される認識結果に基づいて、追加の音声入力を正常に認識することができたか否かを判定する（ステップ１０８）。具体的には、追加の音声入力に対応して何らかの文字列（単語）を特定することができた場合にはその文字列、追加の音声入力に対応する文字列を特定することができなかった場合にはその旨、すなわち、認識を正常に行えなかった旨のエラー通知がそれぞれ照合処理部２８から出力されるので、制御部１４は、照合処理部２８からのエラー通知の有無に基づいて、追加の音声入力を正常に認識することができたか否かを判定する。
【００３５】
追加の音声入力を認識することができた場合には、ステップ１０８で肯定判断がなされ、制御部１４は、照合処理部２８から受け取った認識結果の文字列を音声合成処理部３２に出力することにより、追加の音声入力に対応する応答音声を出力する（ステップ１０９）。
【００３６】
また、追加の音声入力を認識できなかった場合には、ステップ１０８で否定判断がなされ、制御部１４は、音声合成処理部３２に指示を送り、追加の音声入力が存在することは認識している旨を含む応答音声を出力する（ステップ１１０）。
【００３７】
具体的には、例えば、上述したように、利用者が「○○県××市……」と入力しようとしたが、「○○県」に対応した応答音声がステップ１０４に示した処理によって出力されてしまったために、「××市……」の入力を途中でやめてしまった場合などで、追加の音声入力を正常に認識することができなかった場合には、「○○県まで認識できました。もう一度、○○県以降をお願いします」といった内容の応答音声が出力される。このように、追加の音声入力の存在を認識している旨を含む応答音声を出力することにより、再度音声入力を促す場合であっても利用者の不快感や違和感を軽減することができる。
【００３８】
また、追加の音声入力が行われなかった場合には、上述したステップ１０５で否定判断がなされ、制御部１４は、必要に応じて追加の音声入力を促す応答を出力する（ステップ１１１）。具体的には、例えば、利用者により「○○県」だけが入力された場合であれば、「○○県。市町村名以降をどうぞ」といった応答音声が出力される。
【００３９】
〔第２の動作手順〕
ところで、上述した図２に示した第１の動作手順では、入力された音声に所定の時間ｔ１以上のブランクが含まれる場合にこれを検出し、その後の時間ｔ２以内に再び音声入力が行われた場合に１回だけ音声認識処理を再開するようにしていたが、時間ｔ１以上のブランクを検出した後の時間ｔ２以内に再び音声入力が行われた場合に、その都度音声認識処理が再開されるようにしてもよい。
【００４０】
図３は、音声認識装置１００における第２の動作手順を示す流れ図であり、所定の時間ｔ２以内に再び音声入力が行われた場合に、その都度音声認識処理を再開する場合の動作手順が示されている。なお、以下の説明においても、ナビゲーション装置３００において目的地などを設定する場合を想定し、操作指示として「○○県××市△△……」という音声、すなわち、“都道府県名”と“市町村名”、“地名”、……と続く複数の単語で構成される連続語音声に対して音声認識処理を行うものとして説明を行う。また、図３に示す第２の動作手順では、上述した図２に示した第１の動作手順における動作と重複している部分が多いので、重複部分に関しては適宜、簡略化して説明を行う。
【００４１】
制御部１４は、トークスイッチ１０が押下されたか否かを判定しており（ステップ２００）、トークスイッチ１０が押下されると、音声認識処理部２０に対して起動指示を出力する。
音声認識処理部２０が起動した後に、マイクロホン１２に対して利用者により音声入力が行われると（ステップ２０１）、この音声入力に対応して、音声認識処理部２０により所定の音声認識処理が行われる（ステップ２０２）。
【００４２】
次に、音声認識処理部２０内の照合処理部２８は、入力された音声に時間ｔ１以上のブランクが含まれているか否かを判定する（ステップ２０３）。時間ｔ１以上のブランクが含まれていない場合には、ステップ１０３で否定判断がなされ、ステップ１０２に戻り、所定の音声認識処理が継続される。
【００４３】
また、入力された音声に時間ｔ１以上のブランクが含まれている場合には、ステップ２０３で肯定判断がなされ、照合処理部２８は、音声認識処理部２０による音声認識処理の中断を決定するとともに、ブランク検出時点までの音声に対する認識結果を制御部１４に出力する。
【００４４】
次に、制御部１４は、音声認識処理部２０内の特徴量抽出部２６に対して指示を送ることにより、音声特徴量をリングバッファ２４に格納し（ステップ２０４）、これと並行して、照合処理部２８から取得した認識結果を音声合成処理部３２に出力することにより、ブランク検出時点までの音声に対応する応答音声の出力処理を開始するよう指示する（ステップ２０５）。
【００４５】
次に、制御部１４は、レベルメータ３０の出力信号に基づいて、時間ｔ１以上のブランク検出時点から所定の時間ｔ２以内に音声入力が行われたか否かを判定する（ステップ２０６）。
ここで、第２の動作手順における時間ｔ１およびｔ２について説明する。図４は、第２の動作手順における時間ｔ１およびｔ２について説明する図である。同図（Ａ）に示すように、最初に入力された音声において時間ｔ１以上のブランクが含まれている場合にこのブランクが検出され、それまでに入力された音声に対応した所定の応答音声が出力されるので、ブランクの開始時点から応答音声が出力されるまでの間に必要な時間（以後、これを「応答時間」と称する。）ｔは、ブランクに対応する時間ｔ１と応答音声を出力するための処理（応答処理）に必要な時間の合計に等しくなる。上述したように、利用者の立場からすれば、この応答時間ｔが音声入力時に許容されるブランク、すなわち見かけ上のブランクに対応しており、この応答時間ｔよりも短いブランクしか含まずに音声入力を行った場合には、連続語として音声認識処理が行われるものと認識されている場合が多い。
【００４６】
したがって、本実施形態では、時間ｔ１以上のブランクを検出した後に音声入力が行われたか否かを判定する時間ｔ２を、応答処理に必要な時間とほぼ等しい値に設定している。これにより、図４（Ｂ）に示すように、第１の音声入力（音声入力１）が行われ、ブランクが検出された後に、この第１の音声入力に対応する応答音声が出力される以前、すなわち時間ｔ２が経過する以前に第２の音声入力（音声入力２）が行われた場合には、第１の音声入力に対応する応答処理が中断されて、第２の音声入力に対応する音声認識処理が開始されることとなる。すなわち、応答時間ｔよりも短いブランクしか含まずに音声入力が行れた場合には、連続語として音声認識処理を行うことができるので、利用者の認識している見かけ上のブランクと音声認識装置１００において実際に許容されるブランク時間とをほぼ等しくすることができる。
【００４７】
時間ｔ２以内に音声入力が行われた場合には、ステップ２０６で肯定判断がなされ、制御部１４は、音声合成処理部３２に指示を送り、ブランク検出時点までに入力された音声に対応する応答音声を出力する処理を中止するとともに、音声認識処理の再開を決定し、音声認識処理部２０に対して所定の起動指示を送って照合処理部２８を起動する（ステップ２０７）。
【００４８】
起動指示を受けた照合処理部２８は、リングバッファ２４に格納された音声特徴量を読み出し（ステップ２０８）、その後、ステップ２０２に戻り、読み出した音声特徴量などに基づいて所定の音声認識処理を行う。
また、時間ｔ１以上のブランク検出時点から時間ｔ２以内に音声入力が行われなかった場合には、上述したステップ２０６で否定判断がなされ、制御部１４は、特徴量抽出部２６に対して指示を送り、音声特徴量をリングバッファ２４に格納する動作を中止する（ステップ２０９）。
【００４９】
また、制御部１４は、ステップ２０６に示した判定処理と並行して、照合処理部２８から出力される認識結果に基づいて、入力された音声を正常に認識することができたか否かを判定しており（ステップ２１０）、音声を正常に認識することができた場合には、ステップ２１０で肯定判断を行って、照合処理部２８から受け取った認識結果の文字列を音声合成処理部３２に出力することにより、入力された音声に対応する応答音声を出力する（ステップ２１１）。
【００５０】
具体的には、照合処理部２８は、入力された音声の全てに対応して何らかの文字列（単語）を特定することができた場合にはその文字列を出力し、音声の一部、あるいは全てに対応する文字列を特定することができなかった場合には、その旨（エラー通知）と特定することができた分の文字列を出力する。したがって、制御部１４は、照合処理部２８からのエラー通知の有無に基づいて、音声を正常に認識することができたか否かを判定する。
【００５１】
また、音声の一部あるいは全部を認識できなかった場合には、ステップ２１０で否定判断がなされ、制御部１４は、音声合成処理部３２に指示を送り、認識できた分の音声に対応する応答と、それ以外の音声（他の音声）が入力されたことも認識している旨の応答を出力する（ステップ２１２）。具体的には、利用者が「○○県××市……」と入力したが、「○○県」だけを認識することができ、後の「××市……」を認識することができなかった場合であれば、「○○県まで認識できました。もう一度、○○県以降をお願いします」といった内容の応答が出力される。このように、認識できなかった分の音声についても、その存在を認識している旨の応答を行うことにより、再度音声入力を促す場合における利用者の不快感や違和感を軽減することができる。
【００５２】
このように、本実施形態の音声認識装置１００は、音声に含まれるブランクが時間ｔ１を経過して所定の応答処理が開始された後にも、所定の時間ｔ２が経過するまでの間に音声が入力された場合には、所定の音声認識処理を再開しているので、応答音声を返すまでに入力された音声に対して取りこぼしをなくすことができる。また、時間ｔ１以上のブランクを検出した後に再開された音声認識処理の成否に応じて、音声認識処理が失敗した場合には「利用者による音声入力の存在は認識しているのだが音声認識処理には失敗した」という内容を含む応答音声を出力しているので、自分の音声入力が無視され、あるいは途中で遮られているといった悪い印象を利用者に対して与えてしまうことがなく、利用者が応答に対して違和感を感じることを防ぐことができる。
【００５３】
なお、本発明は上記実施形態に限定されるものではなく、本発明の要旨の範囲内において種々の変形実施が可能である。例えば、上述した実施形態では、本発明を適用した音声認識装置１００を車載用のナビゲーション装置３００と組み合わせて用いる場合の例を説明していたが、本発明の適用範囲は車載用に限定されるものではなく、他にも種々の装置やシステム、例えば、家庭用のパーソナルコンピュータ等を用いて実現される音声認識装置などに対しても適用することができる。
【００５４】
また、上述した実施形態では、車載用の用途を想定していたために、オーディオ音等を除去するための除去手段を備えた音声認識装置１００について説明したが、家庭用の用途等において、音声認識処理の対象とする音声以外の音がほとんど影響しないような場合には、除去手段を省略して構成の簡略化、低コスト化を図るようにしてもよい。
【００５５】
【発明の効果】
上述したように、本発明によれば、音声に含まれる無音状態が時間ｔ１を経過して所定の応答処理が開始された後にも、所定の時間ｔ２が経過するまでの間に所定の音圧レベルを超える音声が入力された場合には、音声認識処理手段による処理を再開しているので、応答を返すまでに入力された音声に対して取りこぼしをなくすことができる。また、時間ｔ１以上の無音状態が検出された後に再開された音声認識処理の成否に応じて、応答音声の内容を異ならせているので、自分の音声入力が無視され、あるいは途中で遮られているといった悪い印象を利用者に対して与えてしまうことがなく、利用者が応答に対して感じる違和感をなくすことができる。
【図面の簡単な説明】
【図１】一実施形態の音声認識装置の構成を示す図である。
【図２】音声認識装置における第１の動作手順を示す流れ図である。
【図３】音声認識装置における第２の動作手順を示す流れ図である。
【図４】第２の動作手順における時間ｔ１およびｔ２について説明する図である。
【符号の説明】
１０トークスイッチ
１２マイクロホン
１４制御部
１６遅延素子
１７適応フィルタ（ＡＤＦ）
１８演算部
２０音声認識処理部
２２、２４リングバッファ
２６特徴量抽出部
２８照合処理部
３０レベルメータ
３２音声合成処理部
３４合成部
３６スピーカ
１００音声認識装置
２００オーディオ装置
３００ナビゲーション装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition device that specifies a character string corresponding to input speech and returns a response according to the content.
[0002]
[Prior art]
Conventionally, a voice recognition device for inputting various operation instructions and the like by voice has been put into practical use and has been adopted in various devices and systems. For example, an in-vehicle navigation device equipped with a voice recognition device can input an operation instruction such as setting a destination in route search by voice. In addition, by executing a predetermined program on a personal computer (hereinafter referred to as a “personal computer”), a speech recognition device is realized on the personal computer, and a sentence input or the like corresponding to the sound collected by the microphone is performed. Some are operating.
[0003]
By the way, generally speaking, speech recognition technology can be classified into word speech recognition technology and continuous word speech recognition technology. The former word speech recognition technology is a technology for recognizing speech uttered by separating words and specifying a character string of a corresponding word. The latter continuous word speech recognition technology is a technology for recognizing speech in which a plurality of words or the like are continuously spoken and specifying character strings of a plurality of corresponding words.
[0004]
Conventionally, speech recognition devices that employ relatively easy-to-process word speech recognition technology have been the mainstream, but from the user's perspective, it is easier to input multiple words continuously. In recent years, speech recognition apparatuses that employ continuous word speech recognition technology are becoming popular. When a speech recognition device employing such continuous word speech recognition technology is installed in a navigation device, for example, “XX prefecture XX city ΔΔ... You can input and recognize words such as prefecture names, city names, and place names in succession, so you can perform input operations more comfortably than if you enter words such as prefecture names one by one. Can do.
[0005]
[Problems to be solved by the invention]
By the way, in the speech recognition apparatus adopting the continuous word speech recognition technology described above, a state that can be regarded as almost silent when capturing the speech uttered by the user (hereinafter, this silent state is referred to as “blank”) is preset. When the predetermined time is exceeded, speech recognition processing is performed on the speech input up to that point, and the recognition result is returned to the user.
[0006]
For example, when it is detected that a blank of time T1 or more is included in the input speech, a speech synthesis process corresponding to this speech is performed and a predetermined response is output. The time required from the input of (voice excluding blank) to the output of the corresponding response is a predetermined time T obtained by adding the time T1 corresponding to the blank and the time required for the speech synthesis process. . Therefore, from the user's standpoint, this predetermined time T is an apparent blank allowed at the time of voice input, and when voice input is performed without including only a blank shorter than the predetermined time T, continuous It is normal to think that speech recognition processing is performed as a word.
[0007]
However, in the conventional speech recognition apparatus, since the speech recognition process is terminated and the response process is started when a blank of a time T1 shorter than the predetermined time T is detected, a speech is input after the elapse of this time T1. However, there is a problem that so-called “missing” occurs. In general, a word that is not used to speaking usually cannot be fluently spoken, and blanks are often included between the words, so the above-mentioned miss is likely to occur.
[0008]
For example, considering a case where an address or the like that is not commonly used is input to a navigation device equipped with a voice recognition device, the user himself / herself may enter an address such as “XX prefecture XX city ...”. In spite of intending to input continuously, in fact, blanks are inserted between each word such as between “XX prefecture” and “XX city”. The recognition process is interrupted until “XX prefecture” and a corresponding response is made, and a part of “XX city ...” uttered thereafter may be missed. In addition, when the above-mentioned loss occurs, for example, a response such as “XX prefecture. Please enter the municipality name.” Will be given. Since the voice after a certain municipality name is ignored and input is requested again, the user often feels uncomfortable with such a response.
[0009]
The present invention has been made in view of such a point, and an object of the present invention is to provide a voice recognition device that can eliminate missing voices that are input before a response is returned. Another object of the present invention is to provide a speech recognition apparatus that can return a response without a sense of incongruity.
[0010]
[Means for Solving the Problems]
In order to solve the above-described problems, in the speech recognition apparatus according to the present invention, speech is collected by a microphone, speech recognition processing is performed on the collected speech by speech recognition processing means, and based on the recognized contents. When the response means generates and outputs a response sound, the interruption determination means detects a silence state included in the sound collected by the microphone, and the speech recognition is performed when the silence state continues for a time t1 or more. Determine interruption of processing. The sound pressure level detection means detects the sound pressure level of the sound collected by the microphone, and the sound pressure level detection means detects the sound pressure level during the time t2 after the time t1 has elapsed after the silent state described above. When the detected sound pressure level exceeds a predetermined value, the restart determination means determines to restart the speech recognition process.
[0011]
When a sound exceeding a predetermined sound pressure level is input until a predetermined time t2 elapses after the silent state included in the sound has elapsed after the time t1 and the predetermined response processing has started. Since the processing by the voice recognition processing means is resumed, it is possible to eliminate the missed voice that has been input before the response is returned.
[0012]
In addition, it is desirable that the above-described restart determination unit determines the restart of the process only once for the speech recognition processing unit in response to the first silent state that is included in the input speech for the time t1 or longer. In general, when the first silent state is detected and a response is returned, it is unlikely that the user will continue to input voice in parallel with this response, so only one time in response to the first silent state By simply restarting the processing of the speech recognition processing means, it is possible to eliminate almost all of the speech that has been input before the response is returned.
[0013]
In addition, it is desirable that the above-described restart determination unit sends an instruction to stop the output of the response voice to the response unit together with an operation to instruct the voice recognition processing unit to restart the process. By stopping the output to the response voice when the processing of the voice recognition processing means is resumed, it is possible to prevent the voice uttered by the user and the response voice from overlapping. In particular, since the voice recognition process is resumed without returning a response voice, the voice recognition process for continuous words can be continued until the voice uttered by the user includes a time (t1 + t2) blank, which is efficient. Voice input can be performed.
[0014]
Further, the above-described time t2 is a value substantially equal to the time from when the speech recognition process is interrupted by the interruption determination means until the response speech is output by the response means after the duration of the silent state is the time t1. It is desirable to set to. By setting the time from when the duration of the silent state reaches time t1 until the response voice by the response means is output and the above-described time t2 is substantially equal, the voice input is performed before the response voice is output. If performed, the voice recognition process can be continued in a sure manner corresponding to the voice input. Therefore, although the user himself intends to continuously input the voice, the voice recognition process is not started and the response voice is not output in response to the voice being input. A person can feel uncomfortable.
[0015]
In addition, it further comprises: a speaker that outputs response sound and sound output from other sound sources; and a removing means that removes a component that is not subject to speech recognition included in the sound collected by the microphone. It is desirable to input the speech recognition target speech output from the speech recognition processing means. Since the accuracy of speech recognition processing can be improved by removing components that are not subject to speech recognition, the speech that is not subject to speech recognition, such as when the speech recognition device of the present invention is mounted on an in-vehicle navigation device, etc. Is particularly effective in an environment in which sound is easily included in the sound collected by the microphone.
[0016]
Further, it is desirable that the response means described above generates response voices having different contents depending on the success or failure of the voice recognition processing by the voice recognition processing means resumed after detecting a silence state of time t1 or longer. Specifically, for example, when the voice recognition process after the restart is successful, a response voice based on the recognition result is output, and when the voice recognition process fails, “the presence of the input voice by the user is recognized. The voice input that you made is ignored by changing the content of the response voice according to the success or failure of the voice recognition process, such as outputting a response voice that contains the content `` but the voice recognition process failed '' It is possible to eliminate the unpleasant feeling that the user feels with respect to the response voice without giving the user a bad impression of being interrupted or being interrupted.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a speech recognition apparatus according to an embodiment to which the present invention is applied will be described with reference to the drawings.
FIG. 1 is a diagram illustrating the configuration of the speech recognition apparatus according to the present embodiment. The voice recognition apparatus 100 shown in the figure is used to give an operation instruction by voice to the in-vehicle navigation apparatus 300, and includes a talk switch 10, a microphone 12, a control unit 14, a delay element 16, an adaptive filter. (ADF) 17, calculation unit 18, speech recognition processing unit 20, level meter 30, speech synthesis processing unit 32, synthesis unit 34, and speaker 36 are configured. It is assumed that the speech recognition apparatus according to the present embodiment employs continuous word speech recognition technology.
[0018]
The talk switch 10 is operated before the user performs voice input, and the operation status is output to the control unit 14. The microphone 12 collects voice uttered by the user, converts it into an electric signal (voice signal), and outputs it.
The control unit 14 controls the overall operation of the voice recognition device 100 and outputs information such as a character string obtained as a result of the voice recognition processing to the navigation device 300. Details of the operation of the control unit 14 will be described later.
[0019]
The delay element 16 outputs a signal obtained by delaying the audio signal output from the microphone 12 by a predetermined time. The delay element 16 has, for example, a transfer characteristic Z ^-m This is realized by setting a filter coefficient corresponding to the delay time t to 1 and setting other filter coefficients to 0 using an FIR (Finite Impulse Response) type digital filter.
[0020]
The adaptive filter 17 is for simulating the transfer characteristic of the acoustic space in the vehicle interior, specifically, the transfer characteristic until the sound radiated from the speaker 36 reaches the microphone 12, and the filter coefficient W And a filter coefficient setting unit for setting the filter coefficient of the digital filter. For example, using a LMS (Least Mean Square) algorithm, a filter coefficient W is determined by performing an adaptive equalization process using an audio signal (described later) input to the speaker 36 as a reference signal, and the filter coefficient W is determined as an output signal of the microphone 12. Processing for removing the output sound component of the included speaker 36 is performed by the calculation unit 18.
[0021]
In this manner, in this embodiment, the output sound component of the speaker 36 is removed from the sound signal output from the microphone 12, so that the response sound during the sound recognition process, the audio sound output from the audio device 200, and the like Even when the voice overlaps with the voice input by the user, only the voice of the user can be reliably extracted, and the recognition rate of the voice recognition process can be improved.
[0022]
The speech recognition processing unit 20 performs a predetermined speech recognition process for specifying a character string corresponding to the input speech, and includes two ring buffers 22 and 24, a feature amount extraction unit 26, and a matching processing unit 28. It is configured to include.
The ring buffer 22 takes in and stores the audio signal after removal of noise components (audio sound, response sound, etc.) output from the calculation unit 18 in the order of input. The stored audio signals are read out in the order of storage and input to the feature amount extraction unit 26.
[0023]
The feature quantity extraction unit 26 extracts various voice feature quantities necessary for performing the voice recognition process. The voice feature amount extracted by the feature amount extraction unit 26 is output directly to the collation processing unit 28 or stored in the ring buffer 24 in accordance with an instruction from the control unit 14.
[0024]
The ring buffer 24 stores the audio feature amounts output from the feature amount extraction unit 26 in the order of input, and when a read request is given from the collation processing unit 28, the stored audio feature amounts are read in the storage order. .
The collation processing unit 28 prepares a standard pattern in units of phonemes or words in advance, and corresponds to the input voice by collating the voice feature amount extracted by the feature amount extraction unit 26 with this standard pattern. The character string to be specified is specified and output to the control unit 14.
[0025]
The level meter 30 measures the sound pressure level of the voice based on the voice feature amount output from the feature amount extraction unit 26 and outputs the measurement result to the control unit 14.
The voice synthesis processing unit 32 generates and outputs a voice signal for outputting a response voice corresponding to the recognition result output from the matching processing unit 28 in accordance with an instruction from the control unit 14.
[0026]
The synthesizing unit 34 synthesizes the audio signal output from the audio synthesis processing unit 32 and the audio sound signal output from the audio device 200 and outputs the synthesized audio signal to the speaker 36. The speaker 36 outputs response sound and audio sound in response to the output signal from the synthesis unit 34.
[0027]
The speech recognition processing unit 20 described above is the speech recognition processing unit, the speech synthesis processing unit 32, the speaker 36 is the response unit, the collation processing unit 28 is the interruption determination unit, the control unit 14 is the restart determination unit, and the level meter 30 is Each corresponds to the sound pressure level detection means. Further, the delay element 16, the adaptive filter 17, and the calculation unit 18 correspond to a removing unit.
[0028]
The voice recognition apparatus of this embodiment has such a configuration, and the operation thereof will be described next.
[First operation procedure]
FIG. 2 is a flowchart showing a first operation procedure in the speech recognition apparatus 100. In the following description, it is assumed that a destination or the like is set in the navigation apparatus 300. As an operation instruction, a voice “XX prefecture XX city Δ △ ……”, that is, “prefecture name” and “city / town / village” In the following description, it is assumed that speech recognition processing is performed on a continuous word speech composed of a plurality of words following “name”, “place name”,.
[0029]
The control unit 14 determines whether or not the talk switch 10 has been pressed (step 100), and outputs an activation instruction to the voice recognition processing unit 20 when the talk switch 10 is pressed.
When the user performs voice input to the microphone 12 after the voice recognition processing unit 20 is activated (step 101), a predetermined voice recognition process is performed by the voice recognition processing unit 20 in response to the voice input. (Step 102). Specifically, based on the audio signal stored in the ring buffer 22, the audio feature amount is extracted by the feature amount extraction unit 26, and the collation processing unit 28 performs collation processing between the audio feature amount and the standard pattern. Thus, character strings (words) corresponding to the input voice are sequentially identified.
[0030]
Next, the collation processing unit 28 in the speech recognition processing unit 20 determines whether or not a blank (silenced state) of time t1 or more is included in the input speech (step 103). If a blank longer than time t1 is not included, a negative determination is made in step 103, the process returns to step 102, and a predetermined voice recognition process is continued.
[0031]
If the input voice includes a blank for time t1 or more, an affirmative determination is made in step 103, and the collation processing unit 28 determines to interrupt the voice recognition processing by the voice recognition processing unit 20. The recognition result for the voice up to the time of blank detection is output to the control unit 14.
[0032]
The control unit 14 outputs the response voice corresponding to the voice up to the time of blank detection by outputting the recognition result received from the collation processing unit 28 to the voice synthesis processing unit 32 (step 104).
In parallel with the processing shown in step 104, the control unit 14 checks whether or not the output signal from the level meter 30 has exceeded a predetermined value, so that it is within time t2 from the time of blank detection at time t1 or more. It is determined whether or not voice input has been performed (step 105). In the following description, the voice input performed within the time t2 after the blank detection at the time t1 or more is referred to as “additional voice input”.
[0033]
If an additional voice input has been performed, an affirmative determination is made in step 105, and the control unit 14 determines to restart the voice recognition process, and outputs an activation instruction again to the voice recognition processing unit 20. In accordance with this activation instruction, the predetermined voice recognition processing by the voice recognition processing unit 20 is resumed (step 106), and the voice shown in step 106 is included until the input voice includes a second blank of time t1 or more (step 107). The recognition process continues.
[0034]
If the input voice includes a second blank of time t1 or more, an affirmative determination is made in step 107, and the collation processing unit 28 outputs the recognition result for the voice up to the time of blank detection to the control unit 14. . Based on the recognition result output from the matching processing unit 28, the control unit 14 determines whether or not the additional voice input has been normally recognized (step 108). Specifically, if any character string (word) can be specified in response to the additional voice input, the character string and the character string corresponding to the additional voice input could not be specified. In this case, since an error notification indicating that the recognition could not be normally performed is output from the collation processing unit 28, the control unit 14 determines whether or not there is an error notification from the collation processing unit 28. It is determined whether or not the additional voice input has been successfully recognized.
[0035]
If the additional speech input can be recognized, an affirmative determination is made in step 108, and the control unit 14 outputs the recognition result character string received from the collation processing unit 28 to the speech synthesis processing unit 32. Thus, a response voice corresponding to the additional voice input is output (step 109).
[0036]
If the additional voice input cannot be recognized, a negative determination is made in step 108, and the control unit 14 sends an instruction to the voice synthesis processing unit 32 to recognize that there is an additional voice input. A response voice including the message is output (step 110).
[0037]
Specifically, for example, as described above, the user tried to input “XX prefecture XX city ……”, but the response voice corresponding to “XX prefecture” is processed by the processing shown in step 104. If the input of “XX city ……” is stopped halfway because it has been output, and additional voice input cannot be recognized normally, “Recognize until XX prefecture” A response voice with the content “Please give me the post of XX prefecture again” is output. As described above, by outputting a response voice including the fact that the presence of the additional voice input is recognized, it is possible to reduce the user's discomfort and discomfort even when prompting the voice input again.
[0038]
If no additional voice input has been made, a negative determination is made in step 105 described above, and the control unit 14 outputs a response prompting additional voice input as necessary (step 111). Specifically, for example, when only “XX prefecture” is input by the user, a response voice such as “XX prefecture.
[0039]
[Second operation procedure]
By the way, in the first operation procedure shown in FIG. 2 described above, when the input voice includes a blank longer than the predetermined time t1, this is detected, and the voice is input again within the subsequent time t2. The voice recognition process is resumed only once, but the voice recognition process is resumed each time when the voice input is performed again within the time t2 after the blank time t1 or more is detected. You may make it do.
[0040]
FIG. 3 is a flowchart showing a second operation procedure in the speech recognition apparatus 100, and shows an operation procedure when the speech recognition process is restarted each time a speech input is made again within a predetermined time t2. Has been. In the following description, it is assumed that the destination is set in the navigation device 300, and the operation instruction is “XX prefecture XX city Δ △ ……”, that is, “prefecture name” and “ The description will be made on the assumption that speech recognition processing is performed on a continuous word speech composed of a plurality of words following “city name”, “location name”,. Further, in the second operation procedure shown in FIG. 3, since there are many portions that overlap with the operation in the first operation procedure shown in FIG. 2 described above, the overlapped portion will be simplified and described as appropriate.
[0041]
The control unit 14 determines whether or not the talk switch 10 has been pressed (step 200). When the talk switch 10 is pressed, the control unit 14 outputs an activation instruction to the voice recognition processing unit 20.
When the user performs voice input to the microphone 12 after the voice recognition processing unit 20 is activated (step 201), a predetermined voice recognition process is performed by the voice recognition processing unit 20 in response to the voice input. (Step 202).
[0042]
Next, the collation processing unit 28 in the speech recognition processing unit 20 determines whether or not the input speech includes a blank of time t1 or more (step 203). If a blank longer than time t1 is not included, a negative determination is made in step 103, the process returns to step 102, and a predetermined voice recognition process is continued.
[0043]
If the input voice includes a blank for time t1 or more, an affirmative determination is made in step 203, and the collation processing unit 28 determines to interrupt the voice recognition processing by the voice recognition processing unit 20. The recognition result for the voice up to the time of blank detection is output to the control unit 14.
[0044]
Next, the control unit 14 stores the voice feature amount in the ring buffer 24 by sending an instruction to the feature amount extraction unit 26 in the voice recognition processing unit 20 (step 204). By outputting the recognition result acquired from the collation processing unit 28 to the speech synthesis processing unit 32, it is instructed to start output processing of response speech corresponding to speech up to the time of blank detection (step 205).
[0045]
Next, the control unit 14 determines, based on the output signal of the level meter 30, whether or not voice input has been performed within a predetermined time t2 from a blank detection time of time t1 or more (step 206).
Here, the times t1 and t2 in the second operation procedure will be described. FIG. 4 is a diagram for explaining times t1 and t2 in the second operation procedure. As shown in FIG. 5A, when the first input voice includes a blank longer than time t1, this blank is detected, and a predetermined response voice corresponding to the voice inputted so far is detected. Since it is output, the time required from when the blank starts to when the response voice is output (hereinafter referred to as “response time”) t is output as the time t1 corresponding to the blank and the response voice. Equal to the total time required for the processing (response processing). As described above, from the user's point of view, this response time t corresponds to a blank allowed at the time of voice input, that is, an apparent blank, and includes only a blank shorter than the response time t. When input is performed, it is often recognized that speech recognition processing is performed as a continuous word.
[0046]
Therefore, in the present embodiment, the time t2 for determining whether or not voice input has been performed after detecting a blank of time t1 or longer is set to a value that is substantially equal to the time required for response processing. Thereby, as shown in FIG. 4B, after the first voice input (voice input 1) is performed and a blank is detected, before the response voice corresponding to the first voice input is output. That is, when the second voice input (voice input 2) is performed before the time t2 elapses, the response process corresponding to the first voice input is interrupted to correspond to the second voice input. The voice recognition process is started. That is, when speech input is performed without including only a blank shorter than the response time t, speech recognition processing can be performed as a continuous word, so that apparent blanks recognized by the user and speech recognition can be performed. The actual allowable blank time in the apparatus 100 can be made approximately equal.
[0047]
If a voice input is made within time t2, an affirmative determination is made in step 206, and the control unit 14 sends an instruction to the voice synthesis processing unit 32, and a response corresponding to the voice input up to the time of blank detection. The process of outputting the voice is stopped, the resumption of the voice recognition process is decided, a predetermined activation instruction is sent to the voice recognition processing section 20 and the collation processing section 28 is activated (step 207).
[0048]
Upon receiving the activation instruction, the collation processing unit 28 reads out the voice feature quantity stored in the ring buffer 24 (step 208), and then returns to step 202 to perform a predetermined voice recognition process based on the read voice feature quantity. Do.
If no voice is input within the time t2 from the time of blank detection at the time t1 or longer, a negative determination is made in step 206 described above, and the control unit 14 instructs the feature amount extraction unit 26. The operation of sending and storing the audio feature quantity in the ring buffer 24 is stopped (step 209).
[0049]
Further, in parallel with the determination process shown in step 206, the control unit 14 determines whether or not the input voice has been normally recognized based on the recognition result output from the matching processing unit 28. If the speech can be normally recognized, an affirmative determination is made in step 210, and the recognition result character string received from the collation processing unit 28 is sent to the speech synthesis processing unit 32. By outputting, a response sound corresponding to the input sound is output (step 211).
[0050]
Specifically, the collation processing unit 28 outputs a character string when a certain character string (word) can be specified corresponding to all of the input speech, and a part of the speech or If the character strings corresponding to all of them cannot be specified, the character strings corresponding to the fact (error notification) can be output. Therefore, the control unit 14 determines whether or not the voice has been normally recognized based on the presence / absence of an error notification from the verification processing unit 28.
[0051]
If part or all of the speech cannot be recognized, a negative determination is made in step 210, and the control unit 14 sends an instruction to the speech synthesis processing unit 32, and a response corresponding to the recognized speech. Then, a response indicating that other voices (other voices) have been input is output (step 212). Specifically, the user inputs “XX prefecture XX city ……”, but only “XX prefecture” can be recognized, and later “XX city ……” can be recognized. If it was not possible, a response such as “I was able to recognize XX prefectures. In this way, by responding that the presence of the voice that could not be recognized is recognized, the user's discomfort and discomfort when prompting voice input again can be reduced.
[0052]
As described above, in the speech recognition apparatus 100 according to the present embodiment, after the blank included in the speech passes the time t1 and the predetermined response process is started, the speech is not generated until the predetermined time t2 elapses. In the case of input, since the predetermined voice recognition process is resumed, it is possible to eliminate the missed voice that has been input before the response voice is returned. If the voice recognition process fails in response to the success or failure of the voice recognition process restarted after detecting a blank of time t1 or longer, “there is a voice recognition process that recognizes the presence of a voice input by the user. Since the response voice including the content of "failed to" is output, it does not give the user a bad impression that the voice input is ignored or blocked in the middle It is possible to prevent a person from feeling uncomfortable with the response.
[0053]
In addition, this invention is not limited to the said embodiment, A various deformation | transformation implementation is possible within the range of the summary of this invention. For example, in the above-described embodiment, the example in which the voice recognition device 100 to which the present invention is applied is used in combination with the in-vehicle navigation device 300 has been described. However, the scope of the present invention is limited to in-vehicle use. In addition, the present invention can also be applied to various devices and systems such as a speech recognition device realized by using a home personal computer or the like.
[0054]
Further, in the above-described embodiment, since the in-vehicle use is assumed, the speech recognition apparatus 100 including the removing unit for removing the audio sound or the like has been described. However, in the home use, the speech recognition is performed. If the sound other than the sound to be processed has little influence, the removal means may be omitted to simplify the configuration and reduce the cost.
[0055]
【The invention's effect】
As described above, according to the present invention, the predetermined sound pressure is maintained until the predetermined time t2 elapses after the silent state included in the sound has elapsed after the time t1 and the predetermined response process is started. When the voice exceeding the level is input, since the processing by the voice recognition processing means is resumed, it is possible to eliminate the missed voice input until the response is returned. Moreover, since the contents of the response voice are made different depending on the success or failure of the voice recognition process restarted after the silence state of time t1 or more is detected, the user's voice input is ignored or interrupted in the middle. It does not give a bad impression to the user such as being uncomfortable and can eliminate the uncomfortable feeling that the user feels for the response.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration of a speech recognition apparatus according to an embodiment.
FIG. 2 is a flowchart showing a first operation procedure in the speech recognition apparatus.
FIG. 3 is a flowchart showing a second operation procedure in the speech recognition apparatus.
FIG. 4 is a diagram for explaining times t1 and t2 in the second operation procedure.
[Explanation of symbols]
10 Talk switch
12 Microphone
14 Control unit
16 Delay element
17 Adaptive filter (ADF)
18 Calculation unit
20 Speech recognition processor
22, 24 Ring buffer
26 Feature Extraction Unit
28 Verification processing section
30 Level meter
32 Speech synthesis processor
34 Synthesizer
36 Speaker
100 Voice recognition device
200 audio equipment
300 Navigation device

Claims

A microphone that collects audio,
Speech recognition processing means for performing speech recognition processing on speech including a plurality of words collected by the microphone;
Response means for generating and outputting a response voice based on the content recognized by the voice recognition processing means;
Interruption determining means for detecting a silent state included in the voice collected by the microphone and determining interruption of the voice recognition processing when the silent state continues for a time t1 or more;
Sound pressure level detection means for detecting the sound pressure level of the sound collected by the microphone;
When the sound pressure level detected by the sound pressure level detection means exceeds a predetermined value during the time t2 after the silence t has passed the time t1, the voice recognition processing means Resumption determining means for determining resumption ,
The restart determination means sends an instruction to stop the output of the response voice to the response means together with an operation to instruct the voice recognition processing means to restart the process,
The time t2 is the time from when the duration of the silent state is the time t1 until the interruption determination unit interrupts the speech recognition process and until the response unit outputs the response voice. A speech recognition apparatus characterized by being set to substantially equal values .

In claim 1,
A speaker for outputting the response sound and sound output from other sound sources;
Removing means for removing a component other than a voice recognition target included in the voice collected by the microphone;
The speech recognition apparatus, further comprising: a speech recognition target speech output from the removing device is input to the speech recognition processing device.

In claim 1 or 2,
The response means generates the response voice having different contents according to the success or failure of the voice recognition processing by the voice recognition processing means restarted after the silence state of the time t1 or more is detected. apparatus.