JP4270770B2

JP4270770B2 - Speech recognition apparatus, speech recognition method, and speech recognition program

Info

Publication number: JP4270770B2
Application number: JP2001123317A
Authority: JP
Inventors: 知弘岩▲さき▼
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2001-04-20
Filing date: 2001-04-20
Publication date: 2009-06-03
Anticipated expiration: 2021-04-20
Also published as: JP2002318596A

Description

【０００１】
【発明の属する技術分野】
この発明は音声認識装置、音声認識方法及び音声認識プログラムに関するものである。
【０００２】
【従来の技術】
音声認識装置はデータの効率的な入力手段である。しかし、発声者が発声した音声が誤認識された場合、誤認識された部分を修正する手間がかかるという問題がある。このため、音声認識装置では、誤認識された部分を簡単に修正するための手段が必要とされる。
【０００３】
図２１は特開平４−１８１２９９号公報に示された従来の音声認識装置の構成を示すブロック図である。図において、１０１は音声認識装置、１０２は入力された音声を音声信号として出力する音声入力手段、１０３は認識対象となる単語の情報を含む単語辞書を記憶する単語辞書記憶手段、１０４は音声入力手段１０２に入力された、複数の認識対象となる単語の音声（１回目の音声）の音声信号（１回目の音声信号）と、単語辞書記憶手段１０３に記憶されている単語辞書との間でモデル照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する１または複数の候補を順位をつけて得るモデル照合手段、１０５はモデル照合手段１０４で照合対象となった１回目の音声信号を記憶する音声信号記憶手段、１０６は音声信号記憶手段１０５に記憶されている１回目の音声信号と、音声入力手段１０２に入力された、１回目の音声中の誤認識された単語の音声（２回目の音声）の音声信号（２回目の音声信号）との間でスポッティング処理を行い、１回目の音声信号のそれぞれの部分区間と２回目の音声信号との間の音響的類似度を求めるスポッティング手段、１０７は１回目の音声の認識結果を認識結果表示手段１０８に表示し、１回目の音声の認識結果が正しくない場合、２回目の音声信号と音響的類似度が高い１回目の音声信号の部分区間に対する候補を他の候補に入れ替え、新たな１回目の音声の認識結果を認識結果表示手段１０８に表示し、正しい認識結果が得られた段階で１回目の音声の認識結果を確定し、確定した１回目の音声の認識結果を出力する認識結果入れ替え手段である。
【０００４】
なお、音声入力手段１０２は訂正キーの入力があった場合に音声信号の出力先をモデル照合手段１０４からスポッティング手段１０６に変更する。
【０００５】
次に動作について説明する。
発声者が複数の認識対象となる単語の音声（１回目の音声）を発声し、１回目の音声が音声入力手段１０２に入力すると、音声入力手段１０２は１回目の音声の音声信号（１回目の音声信号）を出力する。音声入力手段１０２から出力された１回目の音声信号は、モデル照合手段１０４に入力する。モデル照合手段１０４は、１回目の音声信号と単語辞書記憶手段１０３に記憶されている単語辞書との間で連続ＤＰマッチングによるモデル照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する１または複数の候補を順位をつけて得て、それを認識結果入れ替え手段１０７に出力する。認識結果入れ替え手段１０７は、１回目の音声信号の照合結果を記憶し、１回目の音声の認識結果を認識結果表示手段１０８に表示する。また、音声信号記憶手段１０５は、モデル照合手段１０４から出力された、モデル照合手段１０４で照合対象となった1回目の音声信号を記憶する。
【０００６】
発声者は、認識結果表示手段１０８に表示された１回目の音声の認識結果を見て、それが正しいか否かを判断し、正しいとき、確定キーを押し、１回目の音声の認識結果を確定する。認識結果入れ替え手段１０７は、確定した１回目の音声の認識結果を出力する。
【０００７】
認識結果表示手段１０８に表示された１回目の音声の認識結果が正しくないとき、発声者は、訂正キーを押し、１回目の音声中の誤認識された単語の音声（２回目の音声）を発声する。２回目の音声が音声入力手段１０２に入力すると、音声入力手段１０２は２回目の音声の音声信号（２回目の音声信号）を出力する。訂正キーの入力があった場合、音声入力手段１０２は音声信号の出力先をモデル照合手段１０４からスポッティング手段１０６に変更するため、音声入力手段１０２から出力された２回目の音声信号は、スポッティング手段１０６に入力する。
【０００８】
スポッティング手段１０６は、音声信号記憶手段１０５に記憶されている1回目の音声信号と、２回目の音声信号との間で連続ＤＰマッチングによるスポッティング処理を行い、１回目の音声信号のそれぞれの部分区間と２回目の音声信号との間の音響的類似度を求め、それを認識結果入れ替え手段１０７に出力する。
【０００９】
認識結果入れ替え手段１０７は、２回目の音声信号と音響的類似度が高い１回目の音声信号の部分区間を検出し、その部分区間に対する候補を他の候補に入れ替え、新たな１回目の音声の認識結果を認識結果表示手段１０８に表示する。
【００１０】
発声者は、認識結果表示手段１０８に表示された新たな１回目の音声の認識結果を見て、それが正しいか否かを判断し、正しいとき、確定キーを押し、１回目の音声の認識結果を確定する。認識結果入れ替え手段１０７は、確定した１回目の音声の認識結果を出力する。
【００１１】
認識結果表示手段１０８に表示された新たな１回目の音声の認識結果が正しくないとき、発声者は、次候補キーを押す。認識結果入れ替え手段１０７は、検出された１回目の音声信号の部分区間に対する候補を他の候補に入れ替え、新たな１回目の音声の認識結果を認識結果表示手段１０８に表示する。
【００１２】
検出された１回目の音声信号の部分区間に対する候補の中に、正しい候補が含まれていない場合、発声者は、訂正キーを押して１回目の音声信号をキャンセルし、１回目の音声を発声し直す。
【００１３】
以下、具体例により上述した動作を説明する。
ここでは、認識対象が図２２に示す住所であり、発声者が「神奈川県横浜市中区石川町」と発声したとき、「中区」が「西区」と誤認識されたため、新たに「中区」と発声した場合について説明する。
【００１４】
発声者が「神奈川県横浜市中区」と発声したとき、図２３に示すように、モデル照合手段１０４により、音声入力手段１０２から出力された１回目の音声信号Ｓ１から１回目の音声中の３つの単語に対応する３つの部分区間Ｓ１１〜Ｓ１３が検出され、部分区間Ｓ１１に対する１位の候補として「神奈川県」、部分区間Ｓ１２に対する１位の候補として「横浜市」、２位の候補として「川崎市」、部分区間Ｓ１３に対する１位の候補として「西区」、２位の候補として「多摩区」、３位の候補として「中区」が得られ、認識結果入れ替え手段１０７に記憶された。また、「神奈川県横浜市西区」と認識結果表示手段１０８に表示された。
【００１５】
この場合、「中区」が「西区」と誤認識されたため、発声者が訂正キーを押し、新たに「中区」と発声すると、スポッティング手段１０６により、１回目の音声信号Ｓ１と２回目の音声信号Ｓ２との間で連続ＤＰマッチングによるスポッティング処理が行われ、１回目の音声信号Ｓ１のそれぞれの部分区間Ｓ１１〜Ｓ１３と２回目の音声信号Ｓ２との間の音響的類似度が求められた。また、図２４に示すように、認識結果入れ替え手段１０７により、２回目の音声信号Ｓ２と音響的類似度が高い１回目の音声信号Ｓ１の部分区間Ｓ１３が検出された。そして、図２５に示すように、認識結果入れ替え手段１０７により、検出された１回目の音声信号Ｓ１の部分区間Ｓ１３に対する１位の候補である「西区」が、２位の候補である「多摩区」に入れ替えられ、新たな１回目の音声の認識結果である「神奈川県横浜市多摩区」が認識結果表示手段１０８に表示された。
【００１６】
認識結果表示手段１０８に表示された新たな１回目の音声の認識結果が正しくないため、発声者が次候補キーを押すと、図２６に示すように、認識結果入れ替え手段１０７により、検出された１回目の音声信号Ｓ１の部分区間Ｓ１３に対する２位の候補である「多摩区」が、３位の候補である「中区」に入れ替えられた。そして、新たな１回目の音声の認識結果の候補である「神奈川県横浜市中区」が認識結果表示手段１０８に表示された。
【００１７】
認識結果表示手段１０８に表示された新たな１回目の音声の認識結果が正しいため、発声者が確定キーを押すと、１回目の音声の認識結果が確定し、確定した１回目の音声の認識結果が認識結果入れ替え手段１０７から出力された。
【００１８】
【発明が解決しようとする課題】
従来の音声認識装置は以上のように構成されているので、誤認識された部分に対応する１回目の音声信号の部分区間に対する候補の中に、正しい候補が含まれていない場合、１回目の音声信号をキャンセルし、１回目の音声を発声し直さなければならないという課題があった。
【００１９】
また、連続して長い文章を発声しようとして途中で区切り、そのときに誤認識を生じた場合、人間は誤認識された部分に続けて後続する文章を発声する傾向があるが、従来の音声認識装置では誤認識された部分の音声だけを新たに発声することを前提としているため、このような場合に誤認識された部分を正しく修正することができないという課題があった。
【００２０】
この発明は上記のような課題を解決するためになされたもので、効率的に誤認識された部分を修正することができる音声認識装置、音声認識方法及び音声認識プログラムを得ることを目的とする。
【００２１】
【課題を解決するための手段】
この発明に係る音声認識装置は、認識対象となる単語の情報を含む単語辞書を記憶する単語辞書記憶手段と、１回目の音声信号と単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第１の照合手段と、２回目の音声信号と単語辞書との間で照合処理を行い、２回目の音声信号から２回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第２の照合手段と、１回目の音声信号のそれぞれの部分区間と、２回目の音声信号のそれぞれの部分区間との間の音響的類似度を求めるスポッティング手段と、スポッティング手段で得られた音響的類似度を用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替える認識結果入れ替え手段とを備え、上記単語辞書記憶手段は、認識対象となる単語の情報を接続関係を規定する構文規則に従って含む単語辞書を記憶するものであり、上記認識結果入れ替え手段は、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、上記単語辞書中の構文規則に従って、検出された１回目の音声信号の部分区間の前後の部分区間に対する候補と接続可能であるか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その前後の部分区間に対する候補と接続可能な検出された２回目の音声信号の部分区間に対する候補に入れ替えるものであることを特徴とするものである。
【００２２】
この発明に係る音声認識装置は、２回目の音声が、１回目の音声中の誤認識された単語の音声のみからなる場合、認識結果入れ替え手段を、２回目の音声信号の部分区間と音響的類似度が高い１回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、２回目の音声信号の部分区間に対する候補に入れ替えるものとするものである。
【００２３】
この発明に係る音声認識装置は、２回目の音声が、１回目の音声中の誤認識された単語及びそれに後続する１又は複数の単語の音声からなる場合、認識結果入れ替え手段を、音響的類似度が高い１回目の音声信号の部分区間及び２回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替え、検出されなかった２回目の音声信号の部分区間に対する候補をそれに付加するものとするものである。
【００２４】
この発明に係る音声認識装置は、認識結果入れ替え手段を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、検出された１回目の音声信号の部分区間に対する候補と同じか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その候補と異なる検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするものである。
【００２６】
この発明に係る音声認識装置は、第１の照合手段を、１回目の音声信号と単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得るとともに、１回目の音声信号の部分区間毎に照合スコアを求めるものとし、認識結果入れ替え手段を、スポッティング手段で得られた音響的類似度と第１の照合手段で得られた照合スコアとを用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするものである。
【００２７】
この発明に係る音声認識方法は、１回目の音声信号と認識対象となる単語の情報を含む単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第１の照合工程と、２回目の音声信号と単語辞書との間で照合処理を行い、２回目の音声信号から２回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第２の照合工程と、１回目の音声信号のそれぞれの部分区間と、２回目の音声信号のそれぞれの部分区間との間の音響的類似度を求めるスポッティング工程と、スポッティング工程で得られた音響的類似度を用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替える認識結果入れ替え工程とを備え、上記認識結果入れ替え工程は、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、認識対象となる単語の情報を接続関係を規定する構文規則に従って含む単語辞書中の構文規則に従って、検出された１回目の音声信号の部分区間の前後の部分区間に対する候補と接続可能であるか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その前後の部分区間に対する候補と接続可能な検出された２回目の音声信号の部分区間に対する候補に入れ替えるものであるたものである。
【００２８】
この発明に係る音声認識方法は、２回目の音声が、１回目の音声中の誤認識された単語の音声のみからなる場合、認識結果入れ替え工程を、２回目の音声信号の部分区間と音響的類似度が高い１回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、２回目の音声信号の部分区間に対する候補に入れ替えるものとするものである。
【００２９】
この発明に係る音声認識方法は、２回目の音声が、１回目の音声中の誤認識された単語及びそれに後続する１又は複数の単語の音声からなる場合、認識結果入れ替え工程を、音響的類似度が高い１回目の音声信号の部分区間及び２回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替え、検出されなかった２回目の音声信号の部分区間に対する候補をそれに付加するものとするものである。
【００３０】
この発明に係る音声認識方法は、認識結果入れ替え工程を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、検出された１回目の音声信号の部分区間に対する候補と同じか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その候補と異なる検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするものである。
【００３２】
この発明に係る音声認識方法は、第１の照合工程を、１回目の音声信号と単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得るとともに、１回目の音声信号の部分区間毎に照合スコアを求めるものとし、認識結果入れ替え工程を、スポッティング工程で得られた音響的類似度と第１の照合工程で得られた照合スコアとを用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするものである。
【００３３】
この発明に係る音声認識プログラムは、コンピュータに、１回目の音声信号と認識対象となる単語の情報を含む単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第１の照合機能と、２回目の音声信号と単語辞書との間で照合処理を行い、２回目の音声信号から２回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第２の照合機能と、１回目の音声信号のそれぞれの部分区間と、２回目の音声信号のそれぞれの部分区間との間の音響的類似度を求めるスポッティング機能と、スポッティング機能で得られた音響的類似度を用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替える認識結果入れ替え機能とを実現させ、上記認識結果入れ替え機能は、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、認識対象となる単語の情報を接続関係を規定する構文規則に従って含む単語辞書中の構文規則に従って、検出された１回目の音声信号の部分区間の前後の部分区間に対する候補と接続可能であるか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その前後の部分区間に対する候補と接続可能な検出された２回目の音声信号の部分区間に対する候補に入れ替えるものであるものである。
【００３４】
この発明に係る音声認識プログラムは、２回目の音声が、１回目の音声中の誤認識された単語の音声のみからなる場合、認識結果入れ替え機能を、２回目の音声信号の部分区間と音響的類似度が高い１回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、２回目の音声信号の部分区間に対する候補に入れ替えるものとするものである。
【００３５】
この発明に係る音声認識プログラムは、２回目の音声が、１回目の音声中の誤認識された単語及びそれに後続する１又は複数の単語の音声からなる場合、認識結果入れ替え機能を、音響的類似度が高い１回目の音声信号の部分区間及び２回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替え、検出されなかった２回目の音声信号の部分区間に対する候補をそれに付加するものとするものである。
【００３６】
この発明に係る音声認識プログラムは、認識結果入れ替え機能を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、検出された１回目の音声信号の部分区間に対する候補と同じか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その候補と異なる検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするものである。
【００３８】
この発明に係る音声認識プログラムは、第１の照合機能を、１回目の音声信号と単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得るとともに、１回目の音声信号の部分区間毎に照合スコアを求めるものとし、認識結果入れ替え機能を、スポッティング機能で得られた音響的類似度と第１の照合機能で得られた照合スコアとを用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするものである。
【００３９】
【発明の実施の形態】
以下、この発明の実施の一形態を説明する。
実施の形態１．
図１はこの発明の実施の形態１による音声認識装置の構成を示すブロック図である。図において、１は音声認識装置、２は入力された音声を音声信号として出力する音声入力手段、３は認識対象となる単語の情報を含む単語辞書を記憶する単語辞書記憶手段、４は音声入力手段２に入力された、複数の認識対象となる単語の音声（１回目の音声）の音声信号（１回目の音声信号）と、単語辞書記憶手段３に記憶されている単語辞書との間でモデル照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する１または複数の候補を順位をつけて得る第１のモデル照合手段、５は第１のモデル照合手段４で照合対象となった１回目の音声信号を記憶する音声信号記憶手段、６は音声入力手段２に入力された、１回目の音声中の誤認識された単語の音声（２回目の音声）の音声信号（２回目の音声信号）と、単語辞書記憶手段３に記憶されている単語辞書との間でモデル照合処理を行い、２回目の音声信号から１つの部分区間を検出し、１または複数の候補を順位をつけて得る第２のモデル照合手段、７は音声信号記憶手段５に記憶されている１回目の音声信号と、２回目の音声信号との間でスポッティング処理を行い、１回目の音声信号のそれぞれ部分区間と２回目の音声信号の部分区間との間の音響的類似度を求めるスポッティング手段、８は１回目の音声の認識結果を認識結果表示手段９に表示し、１回目の音声の認識結果が正しくない場合、２回目の音声信号の部分区間と音響的類似度が高い１回目の音声信号の部分区間を検出し、その部分区間に対する候補を２回目の音声信号の部分区間に対する候補に入れ替え、新たな１回目の音声の認識結果を認識結果表示手段９に表示し、新たな１回目の音声の認識結果が正しくない場合、その部分区間に対する候補を２回目の音声信号の部分区間に対する他の候補に入れ替え、正しい１回目の音声の認識結果が得られた段階で１回目の音声の認識結果を確定し、確定した１回目の音声の認識結果を出力する認識結果入れ替え手段である。
【００４０】
なお、音声入力手段２は訂正キーの入力があった場合に音声信号の出力先を第１のモデル照合手段４から第２のモデル照合手段６及びスポッティング手段７に変更する。
【００４１】
次に動作について説明する。
図２から図４はこの発明の実施の形態１による音声認識装置の動作の説明に供するフローチャートである。
【００４２】
発声者が複数の認識対象となる単語の音声（１回目の音声）を発声し、１回目の音声が音声入力手段２に入力する（ステップＳＴ１）と、音声入力手段２は１回目の音声の音声信号（１回目の音声信号）を出力する。音声入力手段２から出力された１回目の音声信号は、第１のモデル照合手段４に入力する。第１のモデル照合手段４は、１回目の音声信号と単語辞書記憶手段３に記憶されている単語辞書との間で連続ＤＰマッチングによるモデル照合処理を行い（ステップＳＴ２）、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する１または複数の候補を順位をつけて得て、それを認識結果入れ替え手段８に出力する。認識結果入れ替え手段８は、それぞれの部分区間に対する１位の候補からなる１回目の音声信号の照合結果を記憶し（ステップＳＴ３）、１回目の音声の認識結果を認識結果表示手段９に表示する（ステップＳＴ４）。また、音声信号記憶手段５は、第１のモデル照合手段４から出力された、第１のモデル照合手段４で照合対象となった１回目の音声信号を記憶する（ステップＳＴ５）。
【００４３】
発声者は、認識結果表示手段９に表示された１回目の音声の認識結果を見て、それが正しいか否かを判断し（ステップＳＴ６）、正しいとき、確定キーを押し、１回目の音声の認識結果を確定する。認識結果入れ替え手段８は、確定した１回目の音声の認識結果を出力する（ステップＳＴ７）。
【００４４】
認識結果表示手段９に表示された１回目の音声の認識結果が正しくないとき、発声者は、訂正キーを押し、１回目の音声中の誤認識された単語の音声（２回目の音声）を発声する。２回目の音声が音声入力手段２に入力する（ステップＳＴ８）と、音声入力手段２は２回目の音声の音声信号（２回目の音声信号）を出力する。訂正キーの入力があった場合、音声入力手段２は音声信号の出力先を第１のモデル照合手段４から第２のモデル照合手段６及びスポッティング手段７に変更するため、音声入力手段２から出力された２回目の音声信号は、第２のモデル照合手段６及びスポッティング手段７に入力する。
【００４５】
第２のモデル照合手段６は、2回目の音声信号と単語辞書記憶手段３に記憶されている単語辞書との間で連続ＤＰマッチングによるモデル照合処理を行い（ステップＳＴ９）、２回目の音声信号から１つの部分区間を検出し、１または複数の候補を順位をつけて得て、それを認識結果入れ替え手段８に出力する。認識結果入れ替え手段８は、２回目の音声信号の照合結果を記憶する（ステップＳＴ１０）。
【００４６】
スポッティング手段７は、音声信号記憶手段５に記憶されている１回目の音声信号と、２回目の音声信号との間で連続ＤＰマッチングによるスポッティング処理を行い（ステップＳＴ１１）、１回目の音声信号のそれぞれの部分区間と２回目の音声信号の部分区間との間の音響的類似度を求め、それを認識結果入れ替え手段８に出力する。
【００４７】
認識結果入れ替え手段８は、２回目の音声信号の部分区間と音響的類似度が高い１回目の音声信号の部分区間を検出し（ステップＳＴ１２）、Ｍ＝１とした（ステップＳＴ１３）後、その部分区間に対する候補を２回目の音声信号の部分区間に対する１位の候補に入れ替え（ステップＳＴ１４）、新たな１回目の音声の認識結果を認識結果表示手段９に表示する（ステップＳＴ１５）。
【００４８】
発声者は、認識結果表示手段９に表示された新たな１回目の音声の認識結果を見て、それが正しいか否かを判断し（ステップＳＴ１６）、正しいとき、確定キーを押し、１回目の音声の認識結果を確定する。認識結果入れ替え手段８は、確定した１回目の音声の認識結果を出力する（ステップＳＴ１７）。
【００４９】
認識結果表示手段９に表示された新たな１回目の音声の認識結果が正しくないとき、発声者は、次候補キーを押す。認識結果入れ替え手段８は、２回目の音声信号の部分区間に対する下位の候補があるか否かを判断し（ステップＳＴ１８）、下位の候補がある場合、Ｍ＝２とした（ステップＳＴ１９）後、検出された１回目の音声信号の部分区間に対する候補を２回目の音声信号の部分区間に対する２位の候補に入れ替え（ステップＳＴ１４）、新たな１回目の音声の認識結果を認識結果表示手段９に表示する（ステップＳＴ１５）。
【００５０】
その後、１回目の音声の正しい認識結果が認識結果表示手段９に表示されるまで、検出された１回目の音声信号の部分区間に対する候補が２回目の音声信号の部分区間に対する下位の候補に入れ替えられ、下位の候補がなくなった場合、発声者は、訂正キーを押して２回目の音声信号をキャンセルし、２回目の音声を発声し直す。
【００５１】
以下、具体例により上述した動作を説明する。
ここでは、認識対象が図２２に示す住所であり、発声者が「神奈川県横浜市中区石川町」と発声したとき、「中区」が「西区」と誤認識されたため、新たに「中区」と発声した場合について説明する。
【００５２】
発声者が「神奈川県横浜市中区」と発声したとき、図５に示すように、第１のモデル照合手段４により、音声入力手段２から出力された１回目の音声信号Ｓ１から１回目の音声中の３つの単語に対応する３つの部分区間Ｓ１１〜Ｓ１３が検出され、部分区間Ｓ１１に対する１位の候補として「神奈川県」、部分区間Ｓ１２に対する１位の候補として「横浜市」、部分区間Ｓ１３に対する１位の候補として「西区」が得られ、認識結果入れ替え手段８に記憶された。また、「神奈川県横浜市西区」と認識結果表示手段９に表示された。
【００５３】
この場合、「中区」が「西区」と誤認識されたため、発声者が訂正キーを押し、新たに「中区」と発声すると、図６に示すように、第２のモデル照合手段６により、音声入力手段２から出力された２回目の音声信号Ｓ２から１つの部分区間Ｓ２１が検出され、１位の候補として「中区」、２位の候補として「多摩区」、３位の候補として「西区」が得られ、認識結果入れ替え手段８に記憶された。また、スポッティング手段７により、１回目の音声信号Ｓ１と２回目の音声信号Ｓ２との間で連続ＤＰマッチングによるスポッティング処理が行われ、１回目の音声信号Ｓ１のそれぞれの部分区間Ｓ１１〜Ｓ１３と２回目の音声信号Ｓ２の部分区間Ｓ２１との間の音響的類似度が求められた。また、図６に示すように、認識結果入れ替え手段８により、２回目の音声信号Ｓ２の部分区間Ｓ２１と音響的類似度が高い１回目の音声信号Ｓ１の部分区間Ｓ１３が検出された。そして、図７に示すように、認識結果入れ替え手段８により、検出された１回目の音声信号Ｓ１の部分区間Ｓ１３に対する候補である「西区」が、２回目の音声信号Ｓ２の部分区間Ｓ２１に対する１位の候補である「中区」に入れ替えられ、新たな１回目の音声の認識結果である「神奈川県横浜市中区」が認識結果表示手段９に表示された。
【００５４】
認識結果表示手段９に表示された新たな１回目の音声の認識結果が正しいため、発声者が確定キーを押すと、１回目の音声の認識結果が確定し、確定した１回目の音声の認識結果が認識結果入れ替え手段８から出力された。
【００５５】
以上のように、この実施の形態１によれば、１回目の音声が誤認識された場合、１回目の音声中の誤認識された単語の音声を２回目の音声として発声し、２回目の音声信号の部分区間に対する候補を用いて誤認識された単語を修正する。また、一般に、１回目の音声が誤認識された場合、２回目の音声をより丁寧に発声する傾向があるため、１回目の音声の認識結果より２回目の音声の認識結果の方が認識率が高い。従って、効率的に誤認識された単語を修正することができ、使用しやすい音声認識装置が得られる効果がある。
【００５６】
なお、この実施の形態では、音声信号の照合方式として連続ＤＰマッチングを用いる場合について説明したが、他の照合方式を用いる場合でも同様の効果が得られる。
【００５７】
また、この実施の形態では、１回目の音声信号と２回目の音声信号を異なるモデル照合手段を用いて照合処理する場合について説明したが、同じモデル照合手段を繰り返し用いる場合でも同様の効果が得られる。
【００５８】
また、この実施の形態では、訂正キーを押すことにより、音声入力手段２から出力される音声信号の出力先を第２のモデル照合手段６及びスポッティング手段７に変更する場合について説明したが、１回目の音声の発声後に自動的に変更する場合でも同様の効果が得られる。
【００５９】
また、この実施の形態では、訂正キー、確定キー、次候補キーを押すことにより、誤認識された単語を修正するプロセスを進行する場合について説明したが、音声認識装置からの音声による確認に対して、「ハイ」、「イエス」などと発声して応答することにより、音声だけで誤認識された単語を修正するプロセスを進行する場合でも同様の効果が得られる。
【００６０】
実施の形態２．
連続して長い文章を発声しようとして途中で区切り、そのときに誤認識を生じた場合、人間は誤認識された単語に続けて後続する１または複数の単語を発声する傾向がある。実施の形態２では、このような場合でも誤認識された単語を正しく修正することができるように構成した場合について説明する。
【００６１】
図８はこの発明の実施の形態２による音声認識装置の構成を示すブロック図である。図において、２１は音声認識装置、２２は音声入力手段２に入力された、誤認識された単語及びそれに後続する１または複数の単語の音声（２回目の音声）の音声信号（２回目の音声信号）と、単語辞書記憶手段３に記憶されている単語辞書との間でモデル照合処理を行い、２回目の音声信号から２回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する１または複数の候補を順位をつけて得る第２のモデル照合手段、２３は第２のモデル照合手段２２で照合対象となった２回目の音声信号を記憶する部分区間記憶手段、２４は音声信号記憶手段５に記憶されている１回目の音声信号と、部分区間記憶手段２３に記憶されている２回目の音声信号との間でスポッティング処理を行い、１回目の音声信号のそれぞれの部分区間と２回目の音声信号のそれぞれの部分区間との間の音響的類似度を求めるスポッティング手段、２５は１回目の音声の認識結果を認識結果表示手段９に表示し、１回目の音声の認識結果が正しくない場合、音響的類似度が高い１回目の音声信号の部分区間と２回目の音声信号の部分区間とを検出し、検出された１回目の音声信号の部分区間に対する候補を検出された２回目の音声信号の部分区間に対する候補に入れ替え、検出されなかった２回目の音声信号の部分区間に対する候補をそれに付加して、新たな１回目の音声の認識結果と２回目の音声の認識結果を認識結果表示手段９に表示し、新たな１回目の音声の認識結果が正しくない場合、検出された１回目の音声信号の部分区間に対する候補を検出された２回目の音声信号の部分区間に対する他の候補に入れ替え、正しい１回目の音声の認識結果が得られた段階で１回目の音声の認識結果及び２回目の音声の認識結果を確定し、確定した１回目の音声の認識結果及び２回目の音声の認識結果を出力する認識結果入れ替え手段である。
【００６２】
その他の構成要素は図１において同一符号を付して示したものと同一あるいは同等であるため、その詳細な説明は省略する。
【００６３】
なお、音声入力手段２は訂正キーの入力があった場合に音声信号の出力先を第１のモデル照合手段４から第２のモデル照合手段２２に変更する。
【００６４】
次に動作について説明する。
図９及び図１０はこの発明の実施の形態２による音声認識装置の動作の説明に供するフローチャートである。
【００６５】
ステップＳＴ７までは、実施の形態１の場合と同様に行う。
認識結果表示手段９に表示された１回目の音声の認識結果が正しくないとき、発声者は、訂正キーを押し、誤認識された単語及びそれに後続する１または複数の単語の音声（２回目の音声）を発声する。２回目の音声が音声入力手段２に入力する（ステップＳＴ２１）と、音声入力手段２は２回目の音声の音声信号（２回目の音声信号）を出力する。訂正キーの入力があった場合、音声入力手段２は音声信号の出力先を第１のモデル照合手段４から第２のモデル照合手段２２に変更するため、音声入力信号２から出力された２回目の音声信号は、第２のモデル照合手段２２に入力する。
【００６６】
第２のモデル照合手段２２は、２回目の音声信号と単語辞書記憶手段３に記憶されている単語辞書との間で連続ＤＰマッチングによるモデル照合処理を行い（ステップＳＴ２２）、２回目の音声信号から２回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する１または複数の候補を順位をつけて得て、それを認識結果入れ替え手段２５に出力する。認識結果入れ替え手段２５は、第２のモデル照合手段２２で得られた２回目の音声信号の照合結果を記憶する（ステップＳＴ２３）。また、部分区間記憶手段２３は、第２のモデル照合手段２２から出力された、第２のモデル照合手段２２で照合対象となった２回目の音声信号を記憶する（ステップＳＴ２４）。
【００６７】
スポッティング手段２４は、音声信号記憶手段５に記憶されている１回目の音声信号と、部分区間記憶手段２３に記憶されている２回目の音声信号との間で連続ＤＰマッチングによるスポッティング処理を行い（ステップＳＴ２５）、１回目の音声信号のそれぞれの部分区間と２回目の音声信号のそれぞの部分区間との間の音響的類似度を求め、それを認識結果入れ替え手段２５に出力する。
【００６８】
認識結果入れ替え手段２５は、音響的類似度が高い１回目の音声信号の部分区間と２回目の音声信号の部分区間とを検出し（ステップＳＴ２６）、Ｍ＝１とした（ステップＳＴ２７）後、検出された１回目の音声信号の部分区間に対する候補を検出された２回目の音声信号の部分区間に対する１位の候補に入れ替え（ステップＳＴ２８）、検出されなかった２回目の音声信号の部分区間に対する候補をそれに付加し（ステップＳＴ２９）、新たな１回目の音声の認識結果と２回目の音声の認識結果を認識結果表示手段９に表示する（ステップＳＴ３０）。
【００６９】
発声者は、認識結果表示手段９に表示された新たな１回目の音声の認識結果を見て、それが正しいか否かを判断し（ステップＳＴ３１）、正しいとき、確定キーを押し、１回目の音声の認識結果及び２回目の音声の認識結果を確定する。認識結果入れ替え手段２５は、確定した１回目の音声の認識結果及び２回目の音声の認識結果を出力する（ステップＳＴ３２）。
【００７０】
認識結果表示手段９に表示された新たな１回目の音声の認識結果が正しくないとき、発声者は、次候補キーを押す。認識結果入れ替え手段２５は、検出された２回目の音声信号の部分区間に対する下位の候補があるか否かを判断し（ステップＳＴ３３）、下位の候補がある場合、Ｍ＝２とした（ステップＳＴ３４）後、検出された１回目の音声信号の部分区間に対する候補を検出された２回目の音声信号の部分区間に対する２位の候補に入れ替え（ステップＳＴ２８）、新たな１回目の音声の認識結果と２回目の音声の認識結果を認識結果表示手段９に表示する（ステップＳＴ３０）。
【００７１】
その後、１回目の音声の正しい認識結果が認識結果表示手段９に表示されるまで、検出された１回目の音声信号の部分区間に対する候補が検出された２回目の音声信号の部分区間に対する下位の候補に入れ替えられ、下位の候補がなくなった場合、発声者は、訂正キーを押して２回目の音声信号をキャンセルし、２回目の音声を発声し直す。
【００７２】
以下、具体例により上述した動作を説明する。
ここでは、認識対象が図２２に示す住所であり、発声者が「神奈川県横浜市中区石川町」と発声しようとして「神奈川県横浜市中区」まで発声したとき、「中区」が「西区」と誤認識されたため、新たに「中区石川町」と発声した場合について説明する。
【００７３】
発声者が「神奈川県横浜市中区」まで発声したとき、図５に示すように、第１のモデル照合手段４により、音声入力手段２から出力された１回目の音声信号Ｓ１から１回目の音声中の３つの単語に対応する３つの部分区間Ｓ１１〜Ｓ１３が検出され、部分区間Ｓ１１に対する１位の候補として「神奈川県」、部分区間Ｓ１２に対する１位の候補として「横浜市」、部分区間Ｓ１３に対する１位の候補として「西区」が得られ、認識結果入れ替え手段２５に記憶された。また、「神奈川県横浜市西区」と認識結果表示手段９に表示された。
【００７４】
この場合、「中区」が「西区」と誤認識されたため、発声者が訂正キーを押し、新たに「中区石川町」と発声すると、図１１に示すように、第２のモデル照合手段２２により、音声入力手段２から出力された２回目の音声信号Ｓ２から２回目の音声中の２つの単語に対応する２つの部分区間Ｓ２１，Ｓ２２が検出され、部分区間Ｓ２１に対する１位の候補として「中区」、２位の候補として「多摩区」、３位の候補として「西区」、部分区間Ｓ２２に対する１位の候補として「石川町」が得られ、認識結果入れ替え手段２５に記憶された。また、スポッティング手段２４により、１回目の音声信号Ｓ１と２回目の音声信号Ｓ２との間で連続ＤＰマッチングによるスポッティング処理が行われ、１回目の音声信号Ｓ１のそれぞの部分区間Ｓ１１〜Ｓ１３と２回目の音声信号Ｓ２のそれぞれの部分区間Ｓ２１，Ｓ２２との間の音響的類似度が求められた。また、図１１に示すように、認識結果入れ替え手段２５により、音響的類似度が高い１回目の音声信号Ｓ１の部分区間Ｓ１３と２回目の音声信号Ｓ２の部分区間Ｓ２１とが検出された。そして、図１１に示すように、認識結果入れ替え手段２５により、検出された１回目の音声信号Ｓ１の部分区間Ｓ１３に対する候補である「西区」が、検出された２回目の音声信号Ｓ２の部分区間Ｓ２１に対する１位の候補である「中区」に入れ替えられ、検出されなかった２回目の音声信号Ｓ２の部分区間Ｓ２２に対する候補である「石川町」がそれに付加され、新たな１回目の音声の認識結果と２回目の音声の認識結果である「神奈川県横浜市中区石川町」が認識結果表示手段９に表示された。
【００７５】
認識結果表示手段９に表示された新たな１回目の音声の認識結果が正しいため、発声者が確定キーを押すと、１回目の音声の認識結果及び２回目の音声の認識結果が確定し、確定した１回目の音声の認識結果及び２回目の音声の認識結果が認識結果入れ替え手段２５から出力された。
【００７６】
以上のように、この実施の形態２によれば、１回目の音声が誤認識された場合、１回目の音声中の誤認識された単語及びそれに後続する１または複数の単語の音声を２回目の音声として発声し、誤認識された単語を修正するため、効率的に誤認識された単語を修正することができ、使用しやすい音声認識装置が得られる効果がある。
【００７７】
なお、１回目の音声信号Ｓ１の部分区間Ｓ１３と、２回目の音声信号Ｓ２の部分区間Ｓ２１及びＳ２２を合わせた区間との間の音響的類似度が高い場合、２回目の音声信号Ｓ２の部分区間Ｓ２１及びＳ２２が１つの単語に対応する部分区間であるとして、実施の形態１の場合のように処理される。すなわち、１回目の音声信号Ｓ１の部分区間Ｓ１３に対する候補が、２回目の音声信号Ｓ２の部分区間Ｓ２１及びＳ２２に対する候補に入れ替えられる。具体的には、１回目の音声信号Ｓ１の部分区間Ｓ１３に対する候補である「西区」が、２回目の音声信号Ｓ２の部分区間Ｓ２１及びＳ２２に対する候補である「中区石川町」に入れ替えられる。
【００７８】
実施の形態３．
音声認識では声質、発声様態などにより誤認識されやすい単語が存在するため、１回目の音声の認識結果に生じた誤認識と同じ誤認識が２回目の音声の認識結果に生じる場合がある。実施の形態３では、このような場合でも誤認識された単語を効率的に修正することができるように構成した場合について説明する。
【００７９】
実施の形態３の音声認識装置の構成は、図８に示す実施の形態２の音声認識装置の構成と同様である。ただし、実施の形態３の音声認識装置では、認識結果入れ替え手段２５は、１回目の音声の認識結果を認識結果表示手段９に表示し、１回目の音声の認識結果が正しくない場合、音響的類似度が高い１回目の音声信号の部分区間と２回目の音声信号の部分区間とを検出し、検出された１回目の音声信号の部分区間に対する候補をその候補と異なる検出された２回目の音声信号の部分区間に対する候補に入れ替え、検出されなかった２回目の音声信号の部分区間に対する候補をそれに付加して、新たな１回目の音声の認識結果と２回目の音声の認識結果を認識結果表示手段９に表示し、新たな１回目の音声の認識結果が正しくない場合、検出された１回目の音声信号の部分区間に対する候補をその候補と異なる検出された２回目の音声信号の部分区間に対する他の候補に入れ替え、正しい１回目の音声の認識結果が得られた段階で１回目の音声の認識結果及び２回目の音声の認識結果を確定し、確定した１回目の音声の認識結果及び２回目の音声の認識結果を出力するものである。
【００８０】
次に動作について説明する。
図１２はこの発明の実施の形態３による音声認識装置の動作の説明に供するフローチャートである。
【００８１】
ステップＳＴ２５までは、実施の形態２の場合と同様に行う。
認識結果入れ替え手段２５は、音響的類似度が高い１回目の音声信号の部分区間と２回目の音声信号の部分区間とを検出し（ステップＳＴ４１）、Ｍ＝１とした（ステップＳＴ４２）後、検出された１回目の音声信号の部分区間に対する候補が検出された２回目の音声信号の部分区間に対する１位の候補と同じでか否かを判断し（ステップＳＴ４３）、同じでないとき、検出された１回目の音声信号の部分区間に対する候補を検出された２回目の音声信号の部分区間に対する１位の候補に入れ替え（ステップＳＴ４４）、検出されなかった２回目の音声信号の部分区間に対する候補をそれに付加し（ステップＳＴ４５）、新たな１回目の音声の認識結果と２回目の音声の認識結果を認識結果表示手段９に表示する（ステップＳＴ４６）。
【００８２】
なお、認識結果入れ替え手段２５は、検出された１回目の音声信号の部分区間に対する候補が検出された２回目の音声信号の部分区間に対するＭ位の候補と同じであるとき、検出された２回目の音声信号の部分区間に対する下位の候補があるか否かを判断し（ステップＳＴ４７）、下位の候補がある場合、Ｍ＝Ｍ＋１とした（ステップＳＴ４８）後、ステップＳＴ４３に戻る。下位の候補がない場合、発声者は、訂正キーを押して２回目の音声信号をキャンセルし、２回目の音声を発声し直す。
【００８３】
発声者は、認識結果表示手段９に表示された新たな１回目の音声の認識結果を見て、それが正しいか否かを判断し（ステップＳＴ４９）、正しいとき、確定キーを押し、１回目の音声の認識結果及び２回目の音声の認識結果を確定する。認識結果入れ替え手段２５は、確定した１回目の音声の認識結果及び２回目の音声の認識結果を出力する（ステップＳＴ５０）。
【００８４】
認識結果表示手段９に表示された新たな１回目の音声の認識結果が正しくないとき、発声者は、次候補キーを押す。認識結果入れ替え手段２５は、検出された２回目の音声信号の部分区間に対する下位の候補があるか否かを判断し（ステップＳＴ５１）、下位の候補がある場合、Ｍ＝２とした（ステップＳＴ５２）後、検出された１回目の音声信号の部分区間に対する候補が検出された２回目の音声信号の部分区間に対する２位の候補と同じか否かを判断する（ステップＳＴ４３）。
【００８５】
その後、１回目の音声の正しい認識結果が認識結果表示手段９に表示されるまで、検出された１回目の音声信号の部分区間に対する候補が検出された２回目の音声信号の部分区間に対する下位の候補に入れ替えられ、下位の候補がなくなった場合、発声者は、訂正キーを押して２回目の音声信号をキャンセルし、２回目の音声を発声し直す。
【００８６】
以下、具体例により上述した動作を説明する。
ここでは、認識対象が図２２に示す住所であり、発声者が「神奈川県横浜市中区石川町」と発声しようとして「神奈川県横浜市中区」まで発声したとき、「中区」が「西区」と誤認識されたため、新たに「中区石川町」と発声した場合について説明する。
【００８７】
発声者が「神奈川県横浜市中区」まで発声したとき、図５に示すように、第１のモデル照合手段４により、音声入力手段２から出力された１回目の音声信号Ｓ１から１回目の音声中の３つの単語に対応する３つの部分区間Ｓ１１〜Ｓ１３が検出され、部分区間Ｓ１１に対する１位の候補として「神奈川県」、部分区間Ｓ１２に対する１位の候補として「横浜市」、部分区間Ｓ１３に対する１位の候補として「西区」が得られ、認識結果入れ替え手段２５に記憶された。また、「神奈川県横浜市西区」と認識結果表示手段９に表示された。
【００８８】
この場合、「中区」が「西区」と誤認識されたため、発声者が訂正キーを押し、新たに「中区石川町」と発声すると、図１３に示すように、第２のモデル照合手段２２により、音声入力手段２から出力された２回目の音声信号Ｓ２から２回目の音声中の２つの単語に対応する２つの部分区間Ｓ２１，Ｓ２２が検出され、部分区間Ｓ２１に対する１位の候補として「西区」、２位の候補として「中区」、３位の候補として「多摩区」、部分区間Ｓ２２に対する１位の候補として「石川町」が得られ、認識結果入れ替え手段２５に記憶された。また、スポッティング手段２４により、１回目の音声信号Ｓ１と２回目の音声信号Ｓ２との間で連続ＤＰマッチングによるスポッティング処理が行われ、１回目の音声信号Ｓ１のそれぞれの部分区間Ｓ１１〜Ｓ１３と２回目の音声信号Ｓ２のそれぞれの部分区間Ｓ２１，Ｓ２２との間の音響的類似度が求められた。また、図１３に示すように、認識結果入れ替え手段２５により、音響的類似度が高い１回目の音声信号Ｓ１の部分区間Ｓ１３と２回目の音声信号Ｓ２の部分区間Ｓ２１とが検出された。そして、図１３に示すように、検出された１回目の音声信号Ｓ１の部分区間Ｓ１３に対する候補が、検出された２回目の音声信号Ｓ２の部分区間Ｓ２１に対する1位の候補と同じであるため、認識結果入れ替え手段２５により、検出された１回目の音声信号Ｓ１の部分区間Ｓ１３に対する候補である「西区」が、検出された２回目の音声信号Ｓ２の部分区間Ｓ２１に対する２位の候補である「中区」に入れ替えられ、検出されなかった２回目の音声信号Ｓ２の部分区間Ｓ２２に対する候補である「石川町」がそれに付加され、新たな１回目の音声の認識結果と２回目の音声の認識結果である「神奈川県横浜市中区石川町」が認識結果表示手段９に表示された。
【００８９】
認識結果表示手段９に表示された新たな１回目の音声の認識結果が正しいため、発声者が確定キーを押すと、１回目の音声の認識結果及び２回目の音声の認識結果が確定し、確定した１回目の音声の認識結果及び２回目の音声の認識結果が認識結果入れ替え手段２５から出力された。
【００９０】
以上のように、この実施の形態３によれば、１回目の音声が誤認識された場合、１回目の音声中の誤認識された単語の音声を２回目の音声として発声し、誤認識された単語に対応する１回目の音声信号の部分区間に対する候補を、その候補と異なる、誤認識された単語に対応する２回目の音声信号の部分区間に対する候補に入れ替え、誤認識された単語を修正するため、効率的に誤認識された単語を修正することができ、使用しやすい音声認識装置が得られる効果がある。
【００９１】
なお、この実施の形態では、音声認識装置の構成が実施の形態２の音声認識装置の構成と同様である場合について説明したが、実施の形態１の音声認識装置の構成と同様である場合であっても同様の効果が得られる。
【００９２】
実施の形態４．
実施の形態４の音声認識装置の構成は、図１に示す実施の形態１の音声認識装置の構成と同様である。ただし、実施の形態４の音声認識装置では、単語辞書記憶手段３は、認識対象となる単語の情報を、接続関係を規定する構文規則に従って含む単語辞書を記憶するものである。
【００９３】
また、認識結果入れ替え手段８は、１回目の音声の認識結果を認識結果表示手段９に表示し、1回目の音声の認識結果が正しくない場合、２回目の音声信号の部分区間と音響的類似度が高い１回目の音声信号の部分区間を検出し、その部分区間に対する候補を、単語辞書中の構文規則に従ってその前後の部分区間に対する候補と接続可能な、２回目の音声信号の部分区間に対する候補に入れ替え、新たな１回目の音声の認識結果を認識結果表示手段９に表示し、新たな１回目の音声の認識結果が正しくない場合、その部分区間に対する候補を、単語辞書中の構文規則に従ってその前後の部分区間に対する候補と接続可能な、２回目の音声信号の部分区間に対する他の候補に入れ替え、正しい１回目の音声の認識結果が得られた段階で１回目の音声の認識結果を確定し、確定した１回目の音声の認識結果を出力するものである。
【００９４】
次に動作について説明する。
図１４はこの発明の実施の形態４による音声認識装置の動作の説明に供するフローチャートである。
【００９５】
ステップＳＴ１１までは、実施の形態１の場合と同様に行う。
認識結果入れ替え手段８は、２回目の音声信号の部分区間と音響的類似度が高い１回目の音声信号の部分区間を検出し（ステップＳＴ６１）、Ｍ＝１とした（ステップＳＴ６２）後、２回目の音声信号の部分区間に対する１位の候補が、単語辞書中の構文規則に従って、検出された１回目の音声信号の部分区間の前後の部分区間に対する候補と接続可能か否かを判断し（ステップＳＴ６３）、接続可能であるとき、検出された１回目の音声信号の部分区間に対する候補を２回目の音声信号の部分区間に対する１位の候補に入れ替え（ステップＳＴ６４）、新たな１回目の音声の認識結果を認識結果表示手段９に表示する（ステップＳＴ６５）。
【００９６】
なお、認識結果入れ替え手段８は、２回目の音声信号の部分区間に対するＭ位の候補が、単語辞書中の構文規則に従って、検出された１回目の音声信号の部分区間の前後の部分区間に対する候補と接続不可能であるとき、２回目の音声信号の部分区間に対する下位の候補があるか否かを判断し（ステップＳＴ６６）、下位の候補がある場合、Ｍ＝Ｍ＋１とした（ステップＳＴ６７）後、ステップＳＴ６３に戻る。下位の候補がない場合、発声者は、訂正キーを押して２回目の音声信号をキャンセルし、２回目の音声を発声し直す。
【００９７】
発声者は、認識結果表示手段９に表示された新たな１回目の音声の認識結果を見て、それが正しいか否かを判断し（ステップＳＴ６８）、正しいとき、確定キーを押し、１回目の音声の認識結果を確定する。認識結果入れ替え手段８は、確定した１回目の音声の認識結果を出力する（ステップＳＴ６９）。
【００９８】
認識結果表示手段９に表示された新たな１回目の音声の認識結果が正しくないとき、発声者は、次候補キーを押す。認識結果入れ替え手段８は、２回目の音声信号の部分区間に対する下位の候補があるか否かを判断し（ステップＳＴ７０）、下位の候補がある場合、Ｍ＝２とした（ステップＳＴ７１）後、２回目の音声信号の部分区間に対する２位の候補が、単語辞書中の構文規則に従って、検出された１回目の音声信号の部分区間の前後の部分区間に対する候補と接続可能か否かを判断する（ステップＳＴ６３）。
【００９９】
その後、１回目の音声の正しい認識結果が認識結果表示手段９に表示されるまで、検出された１回目の音声信号の部分区間に対する候補が２回目の音声信号の部分区間に対する下位の候補に入れ替えられ、下位の候補がなくなった場合、発声者は、訂正キーを押して２回目の音声信号をキャンセルし、２回目の音声を発声し直す。
【０１００】
以下、具体例により上述した動作を説明する。
ここでは、認識対象が図２２に示す住所であり、発声者が「神奈川県横浜市中区石川町」と発声したとき、「中区」が「西区」と誤認識されたため、新たに「中区」と発声した場合について説明する。また、単語辞書記憶手段３には、認識対象となる単語の情報が接続関係を矢印で表わす図１５に示す構文規則に従って含まれた単語辞書が記憶されているものとする。
【０１０１】
発声者が「神奈川県横浜市中区」と発声したとき、図５に示すように、第１のモデル照合手段４により、音声入力手段２から出力された１回目の音声信号Ｓ１から１回目の音声中の３つの単語に対応する３つの部分区間Ｓ１１〜Ｓ１３が検出され、部分区間Ｓ１１に対する１位の候補として「神奈川県」、部分区間Ｓ１２に対する１位の候補として「横浜市」、部分区間Ｓ１３に対する１位の候補として「西区」が得られ、認識結果入れ替え手段８に記憶された。また、「神奈川県横浜市西区」と認識結果表示手段９に表示された。
【０１０２】
この場合、「中区」が「西区」と誤認識されたため、発声者が訂正キーを押し、新たに「中区」と発声すると、図１６に示すように、第２のモデル照合手段６により、音声入力手段２から出力された２回目の音声信号Ｓ２から１つの部分区間Ｓ２１が検出され、１位の候補として「多摩区」、２位の候補として「中区」、３位の候補として「西区」が得られ、認識結果入れ替え手段８に記憶された。また、スポッティング手段７により、１回目の音声信号Ｓ１と２回目の音声信号Ｓ２との間で連続ＤＰマッチングによるスポッティング処理が行われ、１回目の音声信号Ｓ１のそれぞれの部分区間Ｓ１１〜Ｓ１３と２回目の音声信号Ｓ２の部分区間Ｓ２１との間の音響的類似度が求められた。また、図１６に示すように、認識結果入れ替え手段８により、２回目の音声信号Ｓ２の部分区間Ｓ２１と音響的類似度が高い１回目の音声信号Ｓ１の部分区間Ｓ１３が検出された。そして、図１５に示すように、２回目の音声信号Ｓ２の部分区間Ｓ２１に対する１位の候補である「多摩区」が、検出された１回目の音声信号Ｓ１の部分区間Ｓ１３前の部分区間Ｓ１２に対する候補である「横浜市」と接続不可能であり、２回目の音声信号Ｓ２の部分区間Ｓ２１に対する２位の候補である「中区」が、検出された１回目の音声信号Ｓ１の部分区間Ｓ１３前の部分区間Ｓ１２に対する候補である「横浜市」と接続可能であるため、図１６に示すように、認識結果入れ替え手段８により、検出された１回目の音声信号Ｓ１の部分区間Ｓ１３に対する候補である「西区」が、２回目の音声信号Ｓ２の部分区間Ｓ２１に対する２位の候補である「中区」に入れ替えられ、新たな１回目の音声の認識結果である「神奈川県横浜市中区」が認識結果表示手段９に表示された。
【０１０３】
認識結果表示手段９に表示された新たな１回目の音声の認識結果が正しいため、発声者が確定キーを押すと、１回目の音声の認識結果が確定し、確定した１回目の音声の認識結果が認識結果入れ替え手段８から出力された。
【０１０４】
以上のように、この実施の形態４によれば、１回目の音声が誤認識された場合、１回目の音声中の誤認識された単語の音声を２回目の音声として発声し、誤認識された単語に対応する１回目の音声信号の部分区間に対する候補を、単語辞書中の構文規則に従って、誤認識された単語に対応する２回目の音声信号に対する候補に入れ替え、誤認識された単語を修正するため、効率的に誤認識された単語を修正することができ、使用しやすい音声認識装置が得られる効果がある。
【０１０５】
なお、この実施の形態では、音声認識装置の構成が実施の形態１の音声認識装置の構成と同様である場合について説明したが、実施の形態２の音声認識装置の構成と同様である場合であっても同様の効果が得られる。
【０１０６】
実施の形態５．
実施の形態１では、連続ＤＰマッチングにより求められた音響的類似度から１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間を検出し、その部分区間に対する候補を２回目の音声信号の部分区間に対する候補に入れ替える場合について説明した。実施の形態５では、音響的類似度と照合スコアとを用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間を検出し、その部分区間に対する候補を２回目の音声信号の部分区間に対する候補に入れ替える場合について説明する。
【０１０７】
実施の形態５の音声認識装置の構成は、図１に示す実施の形態１の音声認識装置の構成と同様である。ただし、実施の形態５の音声認識装置では、第１のモデル照合手段４は、１回目の音声信号と単語辞書記憶手段３に記憶されている単語辞書との間でモデル照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する１または複数の候補を順位をつけて得るとともに、１回目の音声信号の部分区間毎に照合スコアを求めるものである。
【０１０８】
また、認識結果入れ替え手段８は、１回目の音声の認識結果を認識結果表示手段９に表示し、１回目の音声の認識結果が正しくない場合、１回目の音声信号の部分区間毎に、２回目の音声信号の部分区間との音響的類似度と照合スコアとの差分スコアを求め、差分スコアが高い１回目の音声信号の部分区間を検出し、その部分区間に対する候補を２回目の音声信号の部分区間に対する候補に入れ替え、新たな１回目の音声の認識結果を認識結果表示手段９に表示し、新たな１回目の音声の認識結果が正しくない場合、その部分区間に対する候補を２回目の音声信号に対する他の候補に入れ替え、正しい１回目の音声の認識結果が得られた段階で１回目の音声の認識結果を確定し、確定した１回目の音声の認識結果を出力するものである。
【０１０９】
次に動作について説明する。
図１７及び図１８はこの発明の実施の形態５による音声認識装置の動作の説明に供するフローチャートである。
【０１１０】
発声者が複数の認識対象となる単語の音声（１回目の音声）を発声し、１回目の音声が音声入力手段２に入力する（ステップＳＴ８１）と、音声入力手段２は１回目の音声の音声信号（１回目の音声信号）を出力する。音声入力手段２から出力された１回目の音声信号は、第１のモデル照合手段４に入力する。第１のモデル照合手段４は、１回目の音声信号と単語辞書記憶手段３に記憶されている単語辞書との間で連続ＤＰマッチングによるモデル照合処理を行い（ステップＳＴ８２）、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する１または複数の候補を順位をつけて得るとともに、１回目の音声信号の部分区間毎に照合スコアを求め、それらを認識結果入れ替え手段８に出力する。
その後、ステップＳＴ３からステップＳＴ１１までを実施の形態１の場合と同様に行う。
【０１１１】
認識結果入れ替え手段８は、１回目の音声信号の部分区間毎に、２回目の音声信号の部分区間との音響的類似度と照合スコアとの差分スコアを求め（ステップＳＴ８３）、差分スコアが高い１回目の音声信号の部分区間を検出し（ステップＳＴ８４）、Ｍ＝１とした（ステップＳＴ８５）後、その部分区間に対する候補を２回目の音声信号の部分区間に対する１位の候補に入れ替え（ステップＳＴ８６）、新たな１回目の音声の認識結果を認識結果表示手段９に表示する（ステップＳＴ８７）。
その後、ステップＳＴ１６からステップＳＴ１９までを実施の形態１の場合と同様に行う。
【０１１２】
以下、具体例により上述した動作を説明する。
ここでは、認識対象が図２２に示す住所であり、発声者が「神奈川県横浜市南区本牧」と発声したとき、「本牧」が「中里」と誤認識されたため、新たに「本牧」と発声した場合について説明する。また、照合スコア及び音響的類似度が０〜１０００までの範囲の数値で表わされ、数値が大きいほど、照合の度合いや類似の度合いが高いものとする。
【０１１３】
発声者が「神奈川県横浜市南区本牧」と発声したとき、図１９に示すように、第１のモデル照合手段４により、音声入力手段２から出力された１回目の音声信号Ｓ１から１回目の音声中の４つの単語に対応する４つの部分区間Ｓ１１〜Ｓ１４が検出され、部分区間Ｓ１１に対する１位の候補として「神奈川県」、部分区間Ｓ１２に対する１位の候補として「横浜市」、部分区間Ｓ１３に対する１位の候補として「南区」、部分区間Ｓ１４に対する１位の候補として「中里」が得られ、認識結果入れ替え手段８に記憶された。また、図１９に示すように、第１のモデル照合手段４により、１回目の音声信号Ｓ１の部分区間Ｓ１１〜Ｓ１４毎に、照合スコアＣ２［ｉ］が、それぞれ「８００」、「７５０」、「８００」、「４００」と求められた。部分区間Ｓ１４は１回目の音声中の誤認識された単語に対応する部分区間であるため、部分区間Ｓ１４の照合スコアが他の部分区間の照合スコアより小さい値となっている。また、「神奈川県横浜市南区中里」と認識結果表示手段９に表示された。
【０１１４】
この場合、「本牧」が「中里」と誤認識されたため、発声者が訂正キーを押し、新たに「本牧」と発声すると、図１９に示すように、第２のモデル照合手段６により、音声入力手段２から出力された２回目の音声信号Ｓ２から１つの部分区間Ｓ２１が検出され、１位の候補として「本牧」、２位の候補として「中区」、３位の候補として「多摩区」が得られ、認識結果入れ替え手段８に記憶された。また、図１９に示すように、スポッティング手段７により、１回目の音声信号Ｓ１と２回目の音声信号Ｓ２との間で連続ＤＰマッチングによるスポッティング処理が行われ、１回目の音声信号Ｓ１のそれぞれの部分区間Ｓ１１〜Ｓ１４と２回目の音声信号Ｓ２の部分区間Ｓ２１との間の音響的類似度Ｃ１［ｉ］が、それぞれ「１００」、「１５０」、「８００」、「７８０」と求められた。また、図１９に示すように、認識結果入れ替え手段８により、１回目の音声信号Ｓ１の部分区間Ｓ１１〜Ｓ１４毎に、２回目の音声信号Ｓ２の部分区間Ｓ２１との音響的類似度と照合スコアとの差分スコアＣ３［ｉ］が、それぞれ「−７００」、「−６００」、「０」、「３８０」と求められ、差分スコアが高い１回目の音声信号Ｓ１の部分区間Ｓ１４が検出された。そして、図１９に示すように、認識結果入れ替え手段８により、検出された１回目の音声信号Ｓ１の部分区間Ｓ１４に対する候補である「中里」が、２回目の音声信号Ｓ２の部分区間Ｓ２１に対する１位の候補である「本牧」に入れ替えられ、新たな１回目の音声の認識結果である「神奈川県横浜市南区本牧」が認識結果表示手段９に表示された。
【０１１５】
認識結果表示手段９に表示された新たな１回目の音声の認識結果が正しいため、発声者が確定キーを押すと、１回目の音声の認識結果が確定し、確定した１回目の音声の認識結果が認識結果入れ替え手段８から出力された。
【０１１６】
ここで照合スコアについて説明する。
図２０は「神奈川県横浜市南区本牧」と発声したときに得られた音声信号と、「神奈川県」、「横浜市」、「南区」、「中里」という単語の情報を連続して含む単語辞書との間にモデル照合処理を行った結果を示したものである。横軸は音声信号を表わし、ｔフレームという単位で表わす。縦軸は単語辞書を表わし、ｕ状態という単位で表わす。音声信号は全体でＴフレーム存在し、単語辞書は全体でＵ状態存在する。
【０１１７】
音声信号は発声により長さが変化し、部分的にも伸縮する。このため、モデル照合処理する際に、音声信号と単語辞書との対応関係を演算して最適な対応関係を求める。この対応関係はダイナミックプログラミング、あるいはビタビ演算と呼ばれる演算方法により効率よく計算することができる。このようにして音声信号のフレームｔと単語辞書の状態ｕとの最適な対応関係を示したものが図２０中の最適経路である。状態ｕに対するフレームｔの最適な対応関係を（１）式で示す。
【０１１８】
ｕ＝Ｇ（ｔ）・・・（１）
【０１１９】
一方、フレームｔの音声信号と状態ｕの単語辞書との音響的類似度を局所距離Ｄ（ｔ，ｕ）で表わす。局所距離は値が小さい程、音声信号と単語辞書との音響的類似度が高いことを意味する。単語ｉの照合スコアＣ２［ｉ］は単語ｉに属する最適経路上の局所距離をフレームについて平均したものである。図２０に示すように単語ｉに属する状態と対応する音声信号のフレームをｔｓ（ｉ）からｔｅ（ｉ）であるとすると、単語ｉに対する照合スコアＣ２［ｉ］は（２）式で演算される。
【０１２０】
【数１】

【０１２１】
以上のように、この実施の形態５によれば、１回目の音声が誤認識された場合、音響的類似度と照合スコアとを用いて、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間を検出し、その部分区間に対する候補を２回目の音声信号の部分区間に対する候補に入れ替えるので、音声信号のゆらぎなどにより、誤認識された単語に対応する部分区間と異なる部分区間の音響的類似度が高くなった場合でも、効率的に誤認識された単語を修正することができ、使用しやすい音声認識装置が得られる効果がある。
【０１２２】
なお、この実施の形態では、音響的類似度と照合スコアとの差分スコアを用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間を検出する場合について説明したが、別の演算手法から得られる値を用いて誤認識された単語に対応する部分区間を検出する場合であっても同様の効果が得られる。
【０１２３】
また、この実施の形態では、音声認識装置の構成が実施の形態２の音声認識装置の構成と同様である場合について説明したが、実施の形態１の音声認識装置の構成と同様である場合であっても同様の効果が得られる。
【０１２４】
上述した各実施の形態で説明した音声認識装置及び音声認識方法は、コンピュータに音声認識プログラムを組み込むことによっても得られる。
【０１２５】
【発明の効果】
以上のように、この発明によれば、認識対象となる単語の情報を含む単語辞書を記憶する単語辞書記憶手段と、１回目の音声信号と単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第１の照合手段と、２回目の音声信号と単語辞書との間で照合処理を行い、２回目の音声信号から２回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第２の照合手段と、１回目の音声信号のそれぞれの部分区間と、２回目の音声信号のそれぞれの部分区間との間の音響的類似度を求めるスポッティング手段と、スポッティング手段で得られた音響的類似度を用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替える認識結果入れ替え手段とを備えるように音声認識装置を構成したので、効率的に誤認識された部分を修正することができる音声認識装置が得られる効果がある。
【０１２６】
この発明によれば、２回目の音声が、１回目の音声中の誤認識された単語の音声のみからなる場合、認識結果入れ替え手段を、２回目の音声信号の部分区間と音響的類似度が高い１回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識装置を構成したので、誤認識された部分を効率的に修正することができる音声認識装置が得られる効果がある。
【０１２７】
この発明によれば、２回目の音声が、１回目の音声中の誤認識された単語及びそれに後続する１又は複数の単語の音声からなる場合、認識結果入れ替え手段を、音響的類似度が高い１回目の音声信号の部分区間及び２回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替え、検出されなかった２回目の音声信号の部分区間に対する候補をそれに付加するものとするように音声認識装置を構成したので、誤認識された単語及びそれに後続する１または複数の単語の音声を２回目の音声として発声した場合でも、誤認識された部分を効率的に修正することができる音声認識装置が得られる効果がある。
【０１２８】
この発明によれば、認識結果入れ替え手段を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、検出された１回目の音声信号の部分区間に対する候補と同じか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その候補と異なる検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識装置を構成したので、誤認識された部分を効率的に修正することができる音声認識装置が得られる効果がある。
【０１２９】
この発明によれば、単語辞書記憶手段を、認識対象となる単語の情報を接続関係を規定する構文規則に従って含む単語辞書を記憶するものとし、認識結果入れ替え手段を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、単語辞書中の構文規則に従って、検出された１回目の音声信号の部分区間の前後の部分区間に対する候補と接続可能であるか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その前後の部分区間に対する候補と接続可能な検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識装置を構成したので、誤認識された部分を効率的に修正することができる音声認識装置が得られる効果がある。
【０１３０】
この発明によれば、第１の照合手段を、１回目の音声信号と単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得るとともに、１回目の音声信号の部分区間毎に照合スコアを求めるものとし、認識結果入れ替え手段を、スポッティング手段で得られた音響的類似度と第１の照合手段で得られた照合スコアとを用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識装置を構成したので、音声信号のゆらぎなどにより、誤認識された部分に対応する部分区間と異なる部分区間の音響的類似度が高くなった場合でも、誤認識された部分を効率的に修正することができる音声認識装置が得られる効果がある。
【０１３１】
この発明によれば、１回目の音声信号と認識対象となる単語の情報を含む単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第１の照合工程と、２回目の音声信号と単語辞書との間で照合処理を行い、２回目の音声信号から２回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第２の照合工程と、１回目の音声信号のそれぞれの部分区間と、２回目の音声信号のそれぞれの部分区間との間の音響的類似度を求めるスポッティング工程と、スポッティング工程で得られた音響的類似度を用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替える認識結果入れ替え工程とを備えるように音声認識方法を構成したので、効率的に誤認識された部分を修正することができる音声認識方法が得られる効果がある。
【０１３２】
この発明によれば、２回目の音声が、１回目の音声中の誤認識された単語の音声のみからなる場合、認識結果入れ替え工程を、２回目の音声信号の部分区間と音響的類似度が高い１回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識方法を構成したので、効率的に誤認識された部分を修正することができる音声認識方法が得られる効果がある。
【０１３３】
この発明によれば、２回目の音声が、１回目の音声中の誤認識された単語及びそれに後続する１又は複数の単語の音声からなる場合、認識結果入れ替え工程を、音響的類似度が高い１回目の音声信号の部分区間及び２回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替え、検出されなかった２回目の音声信号の部分区間に対する候補をそれに付加するものとするように音声認識方法を構成したので、誤認識された単語及びそれに後続する１または複数の単語の音声を２回目の音声として発声した場合でも、誤認識された部分を効率的に修正することができる音声認識方法が得られる効果がある。
【０１３４】
この発明によれば、認識結果入れ替え工程を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、検出された１回目の音声信号の部分区間に対する候補と同じか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その候補と異なる検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識方法を構成したので、効率的に誤認識された部分を修正することができる音声認識方法が得られる効果がある。
【０１３５】
この発明によれば、認識結果入れ替え工程を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、認識対象となる単語の情報を接続関係を規定する構文規則に従って含む単語辞書中の構文規則に従って、検出された１回目の音声信号の部分区間の前後の部分区間に対する候補と接続可能であるか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その前後の部分区間に対する候補と接続可能な検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識方法を構成したので、効率的に誤認識された部分を修正することができる音声認識方法が得られる効果がある。
【０１３６】
この発明によれば、第１の照合工程を、１回目の音声信号と単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得るとともに、１回目の音声信号の部分区間毎に照合スコアを求めるものとし、認識結果入れ替え工程を、スポッティング工程で得られた音響的類似度と第１の照合工程で得られた照合スコアとを用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識方法を構成したので、音声信号のゆらぎなどにより、誤認識された部分に対応する部分区間と異なる部分区間の音響的類似度が高くなった場合でも、誤認識された部分を効率的に修正することができる音声認識方法が得られる効果がある。
【０１３７】
この発明によれば、コンピュータに、１回目の音声信号と認識対象となる単語の情報を含む単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第１の照合機能と、２回目の音声信号と単語辞書との間で照合処理を行い、２回目の音声信号から２回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得る第２の照合機能と、１回目の音声信号のそれぞれの部分区間と、２回目の音声信号のそれぞれの部分区間との間の音響的類似度を求めるスポッティング機能と、スポッティング機能で得られた音響的類似度を用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替える認識結果入れ替え機能とを実現させるものであるように音声認識プログラムを構成したので、効率的に誤認識された部分を修正することができる音声認識方法が得られる効果がある。
【０１３８】
この発明によれば、２回目の音声が、１回目の音声中の誤認識された単語の音声のみからなる場合、認識結果入れ替え機能を、２回目の音声信号の部分区間と音響的類似度が高い１回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識プログラムを構成したので、効率的に誤認識された部分を修正することができる音声認識方法が得られる効果がある。
【０１３９】
この発明によれば、２回目の音声が、１回目の音声中の誤認識された単語及びそれに後続する１又は複数の単語の音声からなる場合、認識結果入れ替え機能を、音響的類似度が高い１回目の音声信号の部分区間及び２回目の音声信号の部分区間を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間として検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替え、検出されなかった２回目の音声信号の部分区間に対する候補をそれに付加するものとするように音声認識プログラムを構成したので、誤認識された単語及びそれに後続する１または複数の単語の音声を２回目の音声として発声した場合でも、誤認識された部分を効率的に修正することができる音声認識プログラムが得られる効果がある。
【０１４０】
この発明によれば、認識結果入れ替え機能を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、検出された１回目の音声信号の部分区間に対する候補と同じか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その候補と異なる検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識プログラムを構成したので、効率的に誤認識された部分を修正することができる音声認識方法が得られる効果がある。
【０１４１】
この発明によれば、認識結果入れ替え機能を、１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された２回目の音声信号の部分区間に対する候補が、認識対象となる単語の情報を接続関係を規定する構文規則に従って含む単語辞書中の構文規則に従って、検出された１回目の音声信号の部分区間の前後の部分区間に対する候補と接続可能であるか否かを判断し、検出された１回目の音声信号の部分区間に対する候補を、その前後の部分区間に対する候補と接続可能な検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識プログラムを構成したので、効率的に誤認識された部分を修正することができる音声認識方法が得られる効果がある。
【０１４２】
この発明によれば、第１の照合機能を、１回目の音声信号と単語辞書との間で照合処理を行い、１回目の音声信号から１回目の音声中のそれぞれの単語に対応する部分区間を検出し、それぞれの部分区間に対する候補を得るとともに、１回目の音声信号の部分区間毎に照合スコアを求めるものとし、認識結果入れ替え機能を、スポッティング機能で得られた音響的類似度と第１の照合機能で得られた照合スコアとを用いて１回目の音声中の誤認識された単語に対応する１回目の音声信号の部分区間及び２回目の音声信号の部分区間を検出し、検出された１回目の音声信号の部分区間に対する候補を、検出された２回目の音声信号の部分区間に対する候補に入れ替えるものとするように音声認識プログラムを構成したので、音声信号のゆらぎなどにより、誤認識された部分に対応する部分区間と異なる部分区間の音響的類似度が高くなった場合でも、誤認識された部分を効率的に修正することができる音声認識プログラムが得られる効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１による音声認識装置の構成を示すブロック図である。
【図２】この発明の実施の形態１による音声認識装置の動作の説明に供するフローチャートである（その１）。
【図３】この発明の実施の形態１による音声認識装置の動作の説明に供するフローチャートである（その２）。
【図４】この発明の実施の形態１による音声認識装置の動作の説明に供するフローチャートである（その３）。
【図５】この発明の実施の形態１による音声認識装置の具体的な動作の説明に供する図である（その１）。
【図６】この発明の実施の形態１による音声認識装置の具体的な動作の説明に供する図である（その２）。
【図７】この発明の実施の形態１による音声認識装置の具体的な動作の説明に供する図である（その３）。
【図８】この発明の実施の形態２による音声認識装置の構成を示すブロック図である。
【図９】この発明の実施の形態２による音声認識装置の動作の説明に供するフローチャートである（その１）。
【図１０】この発明の実施の形態２による音声認識装置の動作の説明に供するフローチャートである（その２）。
【図１１】この発明の実施の形態２による音声認識装置の具体的な動作の説明に供する図である。
【図１２】この発明の実施の形態３による音声認識装置の動作の説明に供するフローチャートである。
【図１３】この発明の実施の形態３による音声認識装置の具体的な動作の説明に供する図である。
【図１４】この発明の実施の形態４による音声認識装置の動作の説明に供するフローチャートである。
【図１５】この発明の実施の形態４による音声認識装置の単語辞書記憶手段に記憶されている単語辞書の状態図である。
【図１６】この発明の実施の形態４による音声認識装置の具体的な動作の説明に供する図である。
【図１７】この発明の実施の形態５による音声認識装置の動作の説明に供するフローチャートである（その１）。
【図１８】この発明の実施の形態５による音声認識装置の動作の説明に供するフローチャートである（その２）。
【図１９】この発明の実施の形態５による音声認識装置の具体的な動作の説明に供する図である。
【図２０】照合スコアの算出方法の説明に供する図である。
【図２１】特開平４−１８１２９９号公報に示された従来の音声認識装置の構成を示すブロック図である。
【図２２】音声認識装置の認識対象の具体例を示す図である。
【図２３】従来の音声認識装置の具体的な動作の説明に供する図である（その１）。
【図２４】従来の音声認識装置の具体的な動作の説明に供する図である（その２）。
【図２５】従来の音声認識装置の具体的な動作の説明に供する図である（その３）。
【図２６】従来の音声認識装置の具体的な動作の説明に供する図である（その４）。
【符号の説明】
１，２１音声認識装置、２音声入力手段、３単語辞書記憶手段、４第１のモデル照合手段、５音声信号記憶手段、６，２２第２のモデル照合手段、７，２４スポッティング手段、８，２５認識結果入れ替え手段、９認識結果表示手段、２３部分区間記憶手段。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice recognition device, a voice recognition method, and a voice recognition program.
[0002]
[Prior art]
The voice recognition device is an efficient data input means. However, when the voice uttered by the speaker is misrecognized, there is a problem that it takes time to correct the misrecognized portion. For this reason, in the speech recognition apparatus, means for easily correcting a misrecognized portion is required.
[0003]
FIG. 21 is a block diagram showing the configuration of a conventional speech recognition apparatus disclosed in Japanese Patent Laid-Open No. 4-181299. In the figure, 101 is a voice recognition device, 102 is a voice input means for outputting the inputted voice as a voice signal, 103 is a word dictionary storage means for storing a word dictionary including information of words to be recognized, and 104 is a voice input. Between a speech signal (first speech signal) of a plurality of recognition target words input to the means 102 (first speech signal) and the word dictionary stored in the word dictionary storage means 103 Model matching means for performing model matching processing, detecting partial sections corresponding to the respective words in the first speech from the first speech signal, and obtaining one or a plurality of candidates for each partial section by ranking; Reference numeral 105 denotes an audio signal storage means for storing the first audio signal to be verified by the model matching means 104, and reference numeral 106 denotes a first sound stored in the audio signal storage means 105. Performing a spotting process between the signal and the voice signal (second voice signal) of the misrecognized word voice (second voice) in the first voice input to the voice input means 102; Spotting means 107 for obtaining the acoustic similarity between each partial section of the first speech signal and the second speech signal, 107 displays the recognition result of the first speech on the recognition result display means 108, and the first time. If the speech recognition result is incorrect, the candidate for the partial section of the first speech signal having high acoustic similarity to the second speech signal is replaced with another candidate, and the new first speech recognition result is recognized. This is a recognition result replacement means for displaying the result on the result display means 108, confirming the first speech recognition result when a correct recognition result is obtained, and outputting the confirmed first speech recognition result.
[0004]
The voice input unit 102 changes the output destination of the voice signal from the model matching unit 104 to the spotting unit 106 when a correction key is input.
[0005]
Next, the operation will be described.
When a speaker utters a plurality of recognition target words (first speech) and the first speech is input to the speech input unit 102, the speech input unit 102 receives the first speech signal (first time speech). Audio signal). The first audio signal output from the audio input unit 102 is input to the model matching unit 104. The model matching unit 104 performs model matching processing by continuous DP matching between the first speech signal and the word dictionary stored in the word dictionary storage unit 103, and the model matching unit 104 includes the first speech signal from the first speech signal. A partial section corresponding to each word is detected, one or a plurality of candidates for each partial section are obtained by ranking, and are output to the recognition result replacing means 107. The recognition result replacement unit 107 stores the first speech signal collation result, and displays the first speech recognition result on the recognition result display unit 108. The audio signal storage unit 105 stores the first audio signal output from the model matching unit 104 and subjected to verification by the model matching unit 104.
[0006]
The speaker sees the recognition result of the first speech displayed on the recognition result display means 108, determines whether or not it is correct, and when it is correct, presses the confirm key to display the recognition result of the first speech. Determine. The recognition result replacing means 107 outputs the confirmed first speech recognition result.
[0007]
When the recognition result of the first voice displayed on the recognition result display means 108 is not correct, the speaker presses the correction key and the voice of the erroneously recognized word in the first voice (second voice). Speak. When the second voice is input to the voice input means 102, the voice input means 102 outputs a voice signal (second voice signal) of the second voice. When the correction key is input, the voice input unit 102 changes the output destination of the voice signal from the model matching unit 104 to the spotting unit 106. Therefore, the second voice signal output from the voice input unit 102 is the spotting unit. 106.
[0008]
The spotting means 106 performs spotting processing by continuous DP matching between the first audio signal stored in the audio signal storage means 105 and the second audio signal, and each partial section of the first audio signal. And the second sound signal are obtained and output to the recognition result switching means 107.
[0009]
The recognition result replacing unit 107 detects a partial section of the first audio signal having a high acoustic similarity to the second audio signal, replaces the candidate for the partial section with another candidate, and generates a new first audio signal. The recognition result is displayed on the recognition result display means 108.
[0010]
The speaker sees the recognition result of the new first voice displayed on the recognition result display means 108, determines whether or not it is correct, and when it is correct, presses the confirm key to recognize the first voice. Confirm the result. The recognition result replacing means 107 outputs the confirmed first speech recognition result.
[0011]
When the new first speech recognition result displayed on the recognition result display means 108 is not correct, the speaker presses the next candidate key. The recognition result replacement unit 107 replaces the candidate for the detected partial section of the first speech signal with another candidate, and displays the new first speech recognition result on the recognition result display unit 108.
[0012]
When the correct candidate is not included in the detected candidates for the partial section of the first speech signal, the speaker presses the correction key to cancel the first speech signal and utter the first speech. cure.
[0013]
Hereinafter, the operation described above will be described using a specific example.
Here, the recognition target is the address shown in FIG. 22, and when the speaker utters “Ishikawa-cho, Naka-ku, Yokohama-shi, Kanagawa”, “Naka-ku” was misrecognized as “Nishi-ku”. A case where “ku” is spoken will be described.
[0014]
When the speaker utters “Naka Ward, Yokohama, Kanagawa”, the model matching unit 104 includes the first voice signal S1 output from the voice input unit 102, as shown in FIG. Three partial sections S11 to S13 corresponding to the three words are detected, “Kanagawa” as the first candidate for the partial section S11, “Yokohama City” as the first candidate for the partial section S12, and the second candidate. “Kawasaki City”, “Nishi Ward” as the first candidate for the partial section S13, “Tama Ward” as the second candidate, “Naka Ward” as the third candidate, and stored in the recognition result replacement means 107 . Also, “Nishi-ku, Yokohama-shi, Kanagawa” is displayed on the recognition result display means 108.
[0015]
In this case, since “Naka Ward” is misrecognized as “Nishi Ward”, the speaker presses the correction key and newly speaks “Naka Ward”. Spotting processing by continuous DP matching was performed with the audio signal S2, and the acoustic similarity between each of the partial sections S11 to S13 of the first audio signal S1 and the second audio signal S2 was obtained. . Further, as shown in FIG. 24, the recognition result exchanging means 107 detects the partial section S13 of the first audio signal S1 having a high acoustic similarity with the second audio signal S2. Then, as shown in FIG. 25, “Nishi-ku”, which is the first candidate for the partial section S13 of the first audio signal S1 detected by the recognition result switching unit 107, is “Tama-ku”, which is the second candidate. And “Tama Ward, Yokohama City, Kanagawa Prefecture”, which is the new first speech recognition result, is displayed on the recognition result display means 108.
[0016]
Since the recognition result of the new first voice displayed on the recognition result display means 108 is not correct, when the speaker presses the next candidate key, it is detected by the recognition result replacement means 107 as shown in FIG. “Tama Ward” that is the second candidate for the partial section S13 of the first audio signal S1 has been replaced with “Naka Ward” that is the third candidate. Then, “Naka Ward, Yokohama-shi, Kanagawa”, which is a candidate for the new first speech recognition result, is displayed on the recognition result display means 108.
[0017]
Since the new first speech recognition result displayed on the recognition result display means 108 is correct, when the speaker presses the confirmation key, the first speech recognition result is confirmed, and the confirmed first speech recognition is performed. The result is output from the recognition result switching means 107.
[0018]
[Problems to be solved by the invention]
Since the conventional speech recognition apparatus is configured as described above, if the correct candidate is not included in the candidates for the partial section of the first speech signal corresponding to the erroneously recognized portion, the first time There was a problem that the voice signal had to be canceled and the first voice had to be uttered again.
[0019]
Also, if you try to utter a long sentence continuously and break it in the middle, and a misrecognition occurs at that time, humans tend to utter the sentence that follows the misrecognized part, but conventional speech recognition Since the apparatus presupposes that only the voice of the erroneously recognized part is newly uttered, there has been a problem that the erroneously recognized part cannot be corrected correctly in such a case.
[0020]
The present invention has been made to solve the above-described problems, and an object thereof is to obtain a voice recognition device, a voice recognition method, and a voice recognition program that can efficiently correct a misrecognized portion. .
[0021]
[Means for Solving the Problems]
The speech recognition apparatus according to the present invention performs a collation process between a word dictionary storage unit that stores a word dictionary including information on a word to be recognized, a first speech signal and a word dictionary, and performs a first speech A partial section corresponding to each word in the first speech is detected from the signal, and a matching process is performed between the first collating means for obtaining candidates for each partial section, and the second speech signal and the word dictionary. And a second collating unit for detecting a partial section corresponding to each word in the second speech from the second speech signal and obtaining a candidate for each partial section, and each part of the first speech signal Spotting means for obtaining the acoustic similarity between the section and each partial section of the second speech signal, and the erroneous recognition in the first speech using the acoustic similarity obtained by the spotting means single Are detected for the first segment of the first audio signal and the second segment of the second audio signal, and the candidates for the first segment of the first audio signal are detected. A recognition result replacement means for replacing with a candidate for The word dictionary storage means stores a word dictionary including information of words to be recognized in accordance with a syntax rule that defines a connection relationship, and the recognition result replacement means is erroneously recognized in the first speech. And detecting a partial section of the first speech signal and a second section of the second speech signal corresponding to the detected word, and candidates for the detected second section of the speech signal are determined according to the syntax rules in the word dictionary. It is determined whether or not it is possible to connect with a candidate for a partial section before and after the detected partial section of the first audio signal, and a candidate for the detected partial section of the first audio signal is determined as a partial section before and after that. The candidate is replaced with a candidate for a partial section of the detected second audio signal connectable with a candidate for Is.
[0022]
In the speech recognition apparatus according to the present invention, when the second speech consists only of speech of a misrecognized word in the first speech, the recognition result replacing means is acoustically connected with the partial section of the second speech signal. A partial section of the first speech signal having a high similarity is detected as a partial section of the first speech signal corresponding to a misrecognized word in the first speech, and a portion of the detected first speech signal The candidate for the section is replaced with the candidate for the partial section of the second audio signal.
[0023]
In the speech recognition apparatus according to the present invention, when the second speech is composed of a misrecognized word in the first speech and the speech of one or more words subsequent thereto, the recognition result replacing means is acoustically similar. The first segment of the first speech signal and the second segment of the second speech signal, which have a high degree, are divided into the first segment of the first speech signal and the second speech signal corresponding to the misrecognized word in the first speech. The candidate for the partial section of the first audio signal detected is replaced with the candidate for the partial section of the detected second audio signal, and the partial section of the second audio signal not detected A candidate for is to be added to it.
[0024]
In the speech recognition apparatus according to the present invention, the recognition result switching means detects the partial section of the first speech signal and the partial section of the second speech signal corresponding to the misrecognized word in the first speech, It is determined whether the candidate for the detected second section of the audio signal is the same as the candidate for the detected first section of the audio signal, and the candidate for the detected first section of the audio signal is determined. Is replaced with a candidate for a partial section of the detected second audio signal different from the candidate.
[0026]
In the speech recognition apparatus according to the present invention, the first collating unit performs collation processing between the first speech signal and the word dictionary, and corresponds to each word in the first speech from the first speech signal. And detecting candidates for each partial section, obtaining a matching score for each partial section of the first speech signal, and replacing the recognition result replacement means with the acoustic similarity obtained by the spotting means And a partial section of the first speech signal and a partial section of the second speech signal corresponding to the misrecognized word in the first speech, using the collation score obtained by the first collation means The candidate for the detected partial section of the first audio signal is replaced with the candidate for the detected partial section of the second audio signal.
[0027]
The speech recognition method according to the present invention performs a collation process between a first speech signal and a word dictionary including information on a word to be recognized, and each word in the first speech from the first speech signal. The first matching step for detecting the partial sections corresponding to each of the partial sections, and performing a matching process between the second speech signal and the word dictionary, and obtaining a candidate for each partial section. A second matching step for detecting a partial section corresponding to each word in the speech and obtaining a candidate for each partial section, each partial section of the first speech signal, and each of the second speech signal A spotting step for obtaining the acoustic similarity between the partial sections, and a portion of the first speech signal corresponding to the erroneously recognized word in the first speech using the acoustic similarity obtained in the spotting step section Detecting a subinterval beauty second audio signal, a candidate for the subinterval of the detected first audio signal, and a recognition result replacement step replacing the candidate for subinterval of the detected second audio signal The recognition result switching step detects the partial section of the first speech signal and the partial section of the second speech signal corresponding to the misrecognized word in the first speech, and the detected second speech Candidates for the partial section of the signal are for the partial sections before and after the detected partial section of the first speech signal according to the syntax rule in the word dictionary including the information of the word to be recognized according to the syntax rule defining the connection relation. It is determined whether or not it is connectable with the candidate, and the candidate for the detected first segment of the audio signal is connected to the candidate for the second and subsequent partial segments of the detected second audio signal. To replace candidates for It is a thing.
[0028]
In the speech recognition method according to the present invention, when the second speech consists only of speech of a misrecognized word in the first speech, the recognition result replacement step is acoustically separated from the second segment of the speech signal. A partial section of the first speech signal having a high similarity is detected as a partial section of the first speech signal corresponding to a misrecognized word in the first speech, and a portion of the detected first speech signal The candidate for the section is replaced with the candidate for the partial section of the second audio signal.
[0029]
In the speech recognition method according to the present invention, when the second speech is composed of a misrecognized word in the first speech and the speech of one or more words subsequent thereto, the recognition result replacement step is acoustically similar. The first segment of the first speech signal and the second segment of the second speech signal, which have a high degree, are divided into the first segment of the first speech signal and the second speech signal corresponding to the misrecognized word in the first speech. The candidate for the partial section of the first audio signal detected is replaced with the candidate for the partial section of the detected second audio signal, and the partial section of the second audio signal not detected A candidate for is to be added to it.
[0030]
The speech recognition method according to the present invention detects the partial section of the first speech signal and the partial section of the second speech signal corresponding to the misrecognized word in the first speech in the recognition result replacement step, It is determined whether the candidate for the detected second section of the audio signal is the same as the candidate for the detected first section of the audio signal, and the candidate for the detected first section of the audio signal is determined. Is replaced with a candidate for a partial section of the detected second audio signal different from the candidate.
[0032]
In the speech recognition method according to the present invention, the first matching step is performed by performing a matching process between the first speech signal and the word dictionary, and corresponding to each word in the first speech from the first speech signal. And detecting candidates for each partial section, obtaining a matching score for each partial section of the first speech signal, and replacing the recognition result with the acoustic similarity obtained in the spotting process And a partial section of the first speech signal and a partial section of the second speech signal corresponding to the misrecognized word in the first speech using the matching score obtained in the first matching step. The candidate for the detected partial section of the first audio signal is replaced with the candidate for the detected partial section of the second audio signal.
[0033]
The speech recognition program according to the present invention performs a collation process between a first speech signal and a word dictionary including information on a word to be recognized in a computer. The first collation function that detects partial sections corresponding to the respective words and obtains candidates for the respective partial sections, and performs collation processing between the second speech signal and the word dictionary, and from the second speech signal A second collation function that detects partial sections corresponding to the respective words in the second speech and obtains candidates for the respective partial sections, each partial section of the first speech signal, and the second speech signal A spotting function for obtaining the acoustic similarity between each of the sub-sections of the first time, and one time corresponding to a misrecognized word in the first speech using the acoustic similarity obtained by the spotting function Recognition to detect a partial section of the first audio signal and a partial section of the second audio signal, and replace a candidate for the detected first section of the audio signal with a candidate for the detected second section of the audio signal Realize the result replacement function The recognition result switching function detects the partial section of the first speech signal and the partial section of the second speech signal corresponding to the misrecognized word in the first speech, and the detected second speech Candidates for the partial section of the signal are for the partial sections before and after the detected partial section of the first speech signal according to the syntax rule in the word dictionary including the information of the word to be recognized according to the syntax rule defining the connection relation. It is determined whether or not it is connectable with the candidate, and the candidate for the detected first segment of the audio signal is connected to the candidate for the second and subsequent partial segments of the detected second audio signal. To replace candidates for Is.
[0034]
In the speech recognition program according to the present invention, in the case where the second speech is composed only of the speech of the misrecognized word in the first speech, the recognition result replacement function is acoustically combined with the partial section of the second speech signal. A partial section of the first speech signal having a high similarity is detected as a partial section of the first speech signal corresponding to a misrecognized word in the first speech, and a portion of the detected first speech signal The candidate for the section is replaced with the candidate for the partial section of the second audio signal.
[0035]
In the speech recognition program according to the present invention, when the second speech is composed of a misrecognized word in the first speech and the speech of one or more words following it, the recognition result replacement function is acoustically similar. The first segment of the first speech signal and the second segment of the second speech signal, which have a high degree, are divided into the first segment of the first speech signal and the second speech signal corresponding to the misrecognized word in the first speech. The candidate for the partial section of the first audio signal detected is replaced with the candidate for the partial section of the detected second audio signal, and the partial section of the second audio signal not detected A candidate for is to be added to it.
[0036]
The speech recognition program according to the present invention detects a partial section of the first speech signal and a partial section of the second speech signal corresponding to a misrecognized word in the first speech by using the recognition result replacement function, It is determined whether the candidate for the detected second section of the audio signal is the same as the candidate for the detected first section of the audio signal, and the candidate for the detected first section of the audio signal is determined. Is replaced with a candidate for a partial section of the detected second audio signal different from the candidate.
[0038]
The speech recognition program according to the present invention performs the collation process between the first speech signal and the word dictionary for the first collation function, and handles each word in the first speech from the first speech signal. In addition to obtaining candidates for each partial section, obtaining a matching score for each partial section of the first speech signal, the recognition result replacement function is obtained by using the acoustic similarity obtained by the spotting function. And a partial section of the first speech signal and a second section of the second speech signal corresponding to the misrecognized word in the first speech using the collation score obtained by the first collation function The candidate for the detected partial section of the first audio signal is replaced with the candidate for the detected partial section of the second audio signal.
[0039]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the present invention will be described below.
Embodiment 1 FIG.
1 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 1 of the present invention. In the figure, 1 is a voice recognition device, 2 is a voice input means for outputting the inputted voice as a voice signal, 3 is a word dictionary storage means for storing a word dictionary including information on a word to be recognized, and 4 is a voice input. Between a speech signal (first speech signal) of a plurality of recognition target words input to the means 2 and a word dictionary stored in the word dictionary storage means 3 First model obtained by performing model matching processing, detecting partial sections corresponding to each word in the first speech from the first speech signal, and ranking one or more candidates for each partial section Collating means 5 is a voice signal storing means for storing the first voice signal to be collated by the first model matching means 4, and 6 is an erroneous recognition in the first voice inputted to the voice input means 2. Voice of the played word (twice Model matching processing is performed between the voice signal (second voice signal) and the word dictionary stored in the word dictionary storage means 3 to detect one partial section from the second voice signal. A second model matching means for obtaining one or a plurality of candidates by ranking; 7 is a spotting process between the first audio signal stored in the audio signal storage means 5 and the second audio signal; And spotting means for obtaining the acoustic similarity between the respective partial sections of the first audio signal and the partial section of the second audio signal, and 8 displays the recognition result of the first voice on the recognition result display means 9. If the first speech recognition result is not correct, a partial section of the first speech signal having a high acoustic similarity with the partial section of the second speech signal is detected, and candidates for that partial section are selected for the second time. Partial section of audio signal When the new first speech recognition result is displayed on the recognition result display means 9 and the new first speech recognition result is not correct, the candidate for the partial section is selected as the second speech signal. Recognizing result replacement means for substituting with another candidate for the partial section, confirming the first speech recognition result when the correct first speech recognition result is obtained, and outputting the confirmed first speech recognition result It is.
[0040]
The voice input means 2 changes the output destination of the voice signal from the first model matching means 4 to the second model matching means 6 and the spotting means 7 when a correction key is input.
[0041]
Next, the operation will be described.
2 to 4 are flowcharts for explaining the operation of the speech recognition apparatus according to the first embodiment of the present invention.
[0042]
When the speaker utters a plurality of recognition target words (first speech) and the first speech is input to the speech input means 2 (step ST1), the speech input means 2 receives the first speech. An audio signal (first audio signal) is output. The first audio signal output from the audio input unit 2 is input to the first model matching unit 4. The first model matching means 4 performs model matching processing by continuous DP matching between the first speech signal and the word dictionary stored in the word dictionary storage means 3 (step ST2). , The partial section corresponding to each word in the first speech is detected, one or a plurality of candidates for each partial section are obtained by ranking, and are output to the recognition result replacing means 8. The recognition result replacement unit 8 stores the first speech signal collation result including the first candidate for each partial section (step ST3), and displays the first speech recognition result on the recognition result display unit 9. (Step ST4). Also, the audio signal storage means 5 stores the first audio signal output from the first model matching means 4 and subjected to matching by the first model matching means 4 (step ST5).
[0043]
The speaker sees the recognition result of the first voice displayed on the recognition result display means 9 and determines whether or not it is correct (step ST6). Confirm the recognition result. The recognition result replacement means 8 outputs the confirmed first speech recognition result (step ST7).
[0044]
When the recognition result of the first voice displayed on the recognition result display means 9 is not correct, the speaker presses the correction key and the voice of the erroneously recognized word in the first voice (second voice). Speak. When the second voice is input to the voice input means 2 (step ST8), the voice input means 2 outputs a voice signal (second voice signal) of the second voice. When there is an input of the correction key, the voice input means 2 changes the output destination of the voice signal from the first model matching means 4 to the second model matching means 6 and the spotting means 7, so that the output from the voice input means 2 is performed. The second audio signal thus inputted is inputted to the second model matching means 6 and the spotting means 7.
[0045]
The second model matching means 6 performs model matching processing by continuous DP matching between the second speech signal and the word dictionary stored in the word dictionary storage means 3 (step ST9), and the second speech signal. One partial section is detected, one or a plurality of candidates are obtained by ranking, and are output to the recognition result replacing means 8. The recognition result replacement means 8 stores the second verification result of the audio signal (step ST10).
[0046]
The spotting means 7 performs spotting processing by continuous DP matching between the first audio signal stored in the audio signal storage means 5 and the second audio signal (step ST11). The acoustic similarity between each partial section and the partial section of the second audio signal is obtained and output to the recognition result replacing means 8.
[0047]
The recognition result exchanging means 8 detects the partial section of the first speech signal having a high acoustic similarity with the partial section of the second speech signal (step ST12), and sets M = 1 (step ST13). The candidate for the partial section is replaced with the first candidate for the partial section of the second speech signal (step ST14), and the new first speech recognition result is displayed on the recognition result display means 9 (step ST15).
[0048]
The speaker sees the recognition result of the new first voice displayed on the recognition result display means 9 to determine whether or not it is correct (step ST16). Confirm the speech recognition result. The recognition result replacing means 8 outputs the confirmed first speech recognition result (step ST17).
[0049]
When the recognition result of the new first voice displayed on the recognition result display means 9 is not correct, the speaker presses the next candidate key. The recognition result replacement means 8 determines whether or not there is a lower candidate for the partial section of the second audio signal (step ST18), and if there is a lower candidate, M = 2 (step ST19), The detected candidate for the partial section of the first speech signal is replaced with the second candidate for the partial section of the second speech signal (step ST14), and the new first speech recognition result is displayed in the recognition result display means 9. Displayed (step ST15).
[0050]
Thereafter, until the correct recognition result of the first speech is displayed on the recognition result display means 9, the detected candidate for the partial section of the first speech signal is replaced with a lower candidate for the partial section of the second speech signal. If there are no lower candidates, the speaker presses the correction key to cancel the second sound signal and re-utter the second sound.
[0051]
Hereinafter, the operation described above will be described using a specific example.
Here, the recognition target is the address shown in FIG. 22, and when the speaker utters “Ishikawacho, Naka-ku, Yokohama-shi, Kanagawa”, “Naka-ku” is misrecognized as “Nishi-ku”, so A case where “ku” is spoken will be described.
[0052]
When the speaker utters “Naka Ward, Yokohama, Kanagawa Prefecture”, the first model matching unit 4 starts the first speech signal S1 output from the speech input unit 2 as shown in FIG. Three partial sections S11 to S13 corresponding to three words in the speech are detected, “Kanagawa” as the first candidate for the partial section S11, “Yokohama City” as the first candidate for the partial section S12, and the partial section “Nishi-ku” was obtained as the first candidate for S13 and stored in the recognition result replacement means 8. Also, “Nishi-ku, Yokohama-shi, Kanagawa” is displayed on the recognition result display means 9.
[0053]
In this case, since “Naka Ward” was misrecognized as “Nishi Ward”, when the speaker presses the correction key and newly speaks “Naka Ward”, as shown in FIG. One partial section S21 is detected from the second audio signal S2 output from the voice input means 2, and “Naka Ward” is selected as the first candidate, “Tama Ward” as the second candidate, and “Tama Ward” as the third candidate. “Nishi-ku” was obtained and stored in the recognition result replacement means 8. Further, the spotting means 7 performs spotting processing by continuous DP matching between the first audio signal S1 and the second audio signal S2, and the respective partial sections S11 to S13 and 2 of the first audio signal S1. The acoustic similarity with the partial section S21 of the second audio signal S2 was obtained. Further, as shown in FIG. 6, the recognition result exchanging means 8 detects the partial section S13 of the first audio signal S1 having a high acoustic similarity with the partial section S21 of the second audio signal S2. Then, as shown in FIG. 7, “Nishi-ku”, which is a candidate for the partial section S13 of the first audio signal S1 detected by the recognition result exchanging means 8, is 1 for the partial section S21 of the second audio signal S2. The candidate was replaced with “Naka Ward”, and the new first speech recognition result “Naka Ward, Yokohama City, Kanagawa” was displayed on the recognition result display means 9.
[0054]
Since the new first speech recognition result displayed on the recognition result display means 9 is correct, when the speaker presses the confirmation key, the first speech recognition result is confirmed and the confirmed first speech recognition is performed. The result was output from the recognition result replacement means 8.
[0055]
As described above, according to the first embodiment, when the first speech is erroneously recognized, the speech of the misrecognized word in the first speech is uttered as the second speech, and the second speech Correct a misrecognized word using a candidate for a partial section of the speech signal. In general, when the first voice is misrecognized, there is a tendency to utter the second voice more carefully. Therefore, the recognition rate of the second voice is higher than the first voice recognition result. Is expensive. Therefore, it is possible to efficiently correct misrecognized words and to obtain an easy-to-use speech recognition device.
[0056]
In this embodiment, the case where continuous DP matching is used as the collation method of the audio signal has been described, but the same effect can be obtained even when another collation method is used.
[0057]
In this embodiment, the case where the first audio signal and the second audio signal are collated using different model matching means has been described, but the same effect can be obtained even when the same model matching means is used repeatedly. It is done.
[0058]
In this embodiment, the case where the output destination of the audio signal output from the audio input unit 2 is changed to the second model matching unit 6 and the spotting unit 7 by pressing the correction key has been described. The same effect can be obtained even when it is automatically changed after the second voice is uttered.
[0059]
Further, in this embodiment, the case where the process of correcting a misrecognized word by pressing the correction key, the confirmation key, and the next candidate key is described. However, for the confirmation by voice from the voice recognition device, The same effect can be obtained even when a process of correcting a misrecognized word only by speech is performed by responding by saying “high” or “yes”.
[0060]
Embodiment 2. FIG.
If a long sentence is divided in the middle while trying to utter continuously, and a misrecognition occurs at that time, humans tend to utter one or more words following the misrecognized word. In the second embodiment, a case will be described in which a misrecognized word can be corrected correctly even in such a case.
[0061]
FIG. 8 is a block diagram showing the configuration of a speech recognition apparatus according to Embodiment 2 of the present invention. In the figure, 21 is a voice recognition device, 22 is a voice signal (second voice) of a misrecognized word and one or more words following it (second voice) input to the voice input means 2. Signal) and a word dictionary stored in the word dictionary storage means 3 to detect a partial section corresponding to each word in the second speech from the second speech signal, Second model matching means for obtaining one or a plurality of candidates for each partial section by ranking, and 23 a partial section storage means for storing the second speech signal to be verified by the second model matching means 22 , 24 perform spotting processing between the first audio signal stored in the audio signal storage means 5 and the second audio signal stored in the partial section storage means 23, and the first audio signal So Spotting means 25 for obtaining the acoustic similarity between each partial section and each partial section of the second speech signal, 25 displays the recognition result of the first speech on the recognition result display means 9, and the first time If the speech recognition result is incorrect, the first speech signal partial section and the second speech signal partial section with high acoustic similarity are detected, and the detected first speech signal partial section is detected. The candidate is replaced with a candidate for the detected second segment of the audio signal, and a candidate for the second segment of the second audio signal that has not been detected is added to the candidate. When the recognition result display means 9 displays the result of the voice recognition and the recognition result of the new first voice is incorrect, the second voice in which the candidate for the partial section of the detected first voice signal is detected. When the correct first speech recognition result is obtained, the first speech recognition result and the second speech recognition result are confirmed, and the confirmed first speech is replaced. Is a recognition result replacing means for outputting the recognition result and the second speech recognition result.
[0062]
The other components are the same as or equivalent to those shown with the same reference numerals in FIG.
[0063]
The voice input means 2 changes the output destination of the voice signal from the first model matching means 4 to the second model matching means 22 when a correction key is input.
[0064]
Next, the operation will be described.
9 and 10 are flowcharts for explaining the operation of the speech recognition apparatus according to the second embodiment of the present invention.
[0065]
The process up to step ST7 is performed in the same manner as in the first embodiment.
When the recognition result of the first speech displayed on the recognition result display means 9 is not correct, the speaker presses the correction key and the speech of the erroneously recognized word and one or more words following it (the second speech) (Speech). When the second sound is input to the sound input means 2 (step ST21), the sound input means 2 outputs a sound signal of the second sound (second sound signal). When the correction key is input, the voice input means 2 changes the output destination of the voice signal from the first model matching means 4 to the second model matching means 22, so that the second time output from the voice input signal 2. Are input to the second model matching means 22.
[0066]
The second model matching means 22 performs model matching processing by continuous DP matching between the second speech signal and the word dictionary stored in the word dictionary storage means 3 (step ST22). , The partial section corresponding to each word in the second speech is detected, one or a plurality of candidates for each partial section are obtained by ranking, and are output to the recognition result replacing means 25. The recognition result replacing unit 25 stores the collation result of the second speech signal obtained by the second model collating unit 22 (step ST23). Further, the partial section storage unit 23 stores the second audio signal output from the second model matching unit 22 and subjected to the matching by the second model matching unit 22 (step ST24).
[0067]
The spotting unit 24 performs spotting processing by continuous DP matching between the first audio signal stored in the audio signal storage unit 5 and the second audio signal stored in the partial section storage unit 23 ( Step ST25) The acoustic similarity between each partial section of the first speech signal and each partial section of the second speech signal is obtained and output to the recognition result replacing means 25.
[0068]
The recognition result replacement unit 25 detects the first segment of the first audio signal and the second segment of the second audio signal with high acoustic similarity (step ST26), and sets M = 1 (step ST27). The detected candidate for the partial section of the first audio signal is replaced with the first candidate for the detected partial section of the second audio signal (step ST28), and the candidate for the second section of the second audio signal not detected is replaced. A candidate is added to it (step ST29), and the new first speech recognition result and the second speech recognition result are displayed on the recognition result display means 9 (step ST30).
[0069]
The speaker sees the recognition result of the new first voice displayed on the recognition result display means 9 to determine whether or not it is correct (step ST31). The speech recognition result and the second speech recognition result are determined. The recognition result replacement unit 25 outputs the confirmed first speech recognition result and the second speech recognition result (step ST32).
[0070]
When the recognition result of the new first voice displayed on the recognition result display means 9 is not correct, the speaker presses the next candidate key. The recognition result replacement unit 25 determines whether there is a lower candidate for the detected partial section of the second audio signal (step ST33). If there is a lower candidate, M = 2 is set (step ST34). ) After that, the candidate for the detected first section of the speech signal is replaced with the second candidate for the detected second section of the speech signal (step ST28), and the new first speech recognition result and The second speech recognition result is displayed on the recognition result display means 9 (step ST30).
[0071]
Thereafter, until the correct recognition result of the first speech is displayed on the recognition result display means 9, the candidate for the detected partial section of the first speech signal is detected as a subordinate to the partial section of the second speech signal. When the candidate is replaced and there are no lower candidates, the speaker presses the correction key to cancel the second audio signal and re-utter the second audio.
[0072]
Hereinafter, the operation described above will be described using a specific example.
Here, the recognition target is the address shown in FIG. 22, and when the speaker utters “Ishikawacho, Naka-ku, Yokohama-shi, Kanagawa” to “Naka-ku, Yokohama-shi, Kanagawa”, “Naka-ku” A case where “Nishi-ku Ishikawa-cho” is newly spoken because it was misrecognized as “Nishi-ku” will be described.
[0073]
When the speaker utters up to “Naka Ward, Yokohama, Kanagawa Prefecture”, as shown in FIG. 5, the first model matching unit 4 starts the first speech signal S1 output from the speech input unit 2 as shown in FIG. Three partial sections S11 to S13 corresponding to three words in the speech are detected, “Kanagawa” as the first candidate for the partial section S11, “Yokohama City” as the first candidate for the partial section S12, the partial section “Nishi-ku” was obtained as the first candidate for S13 and stored in the recognition result replacement means 25. Also, “Nishi-ku, Yokohama-shi, Kanagawa” is displayed on the recognition result display means 9.
[0074]
In this case, since “Naka Ward” is misrecognized as “Nishi Ward”, when the speaker presses the correction key and newly says “Naka Ward Ishikawacho”, as shown in FIG. 22, two partial sections S21 and S22 corresponding to two words in the second voice are detected from the second voice signal S2 output from the voice input means 2, and the first candidate for the partial section S21 is detected. “Naka Ward”, “Tama Ward” as the second candidate, “Nishi Ward” as the third candidate, “Ishikawacho” as the first candidate for the partial section S22 were obtained and stored in the recognition result replacement means 25 . The spotting means 24 performs spotting processing by continuous DP matching between the first audio signal S1 and the second audio signal S2, and the respective partial sections S11 to S13 of the first audio signal S1. The acoustic similarity between the partial sections S21 and S22 of the second audio signal S2 was obtained. Also, as shown in FIG. 11, the recognition result switching unit 25 detects the partial section S13 of the first audio signal S1 and the partial section S21 of the second audio signal S2 with high acoustic similarity. Then, as shown in FIG. 11, “Nishi-ku”, which is a candidate for the partial section S13 of the first audio signal S1 detected by the recognition result switching unit 25, is detected as a partial section of the detected second audio signal S2. It is replaced with “Naka Ward”, which is the first candidate for S21, and “Ishikawacho”, which is a candidate for the partial section S22 of the second audio signal S2 that has not been detected, is added to it, and the new first audio The recognition result and the second speech recognition result “Ishikawacho, Naka-ku, Yokohama-shi, Kanagawa” were displayed on the recognition result display means 9.
[0075]
Since the new first speech recognition result displayed on the recognition result display means 9 is correct, when the speaker presses the confirmation key, the first speech recognition result and the second speech recognition result are confirmed, The confirmed first speech recognition result and the second speech recognition result are output from the recognition result replacing means 25.
[0076]
As described above, according to the second embodiment, when the first speech is misrecognized, the misrecognized word in the first speech and the speech of one or more words following the second recognition speech are Therefore, the misrecognized word can be corrected efficiently, and an easy-to-use speech recognition apparatus can be obtained.
[0077]
If the acoustic similarity between the partial section S13 of the first audio signal S1 and the section of the partial sections S21 and S22 of the second audio signal S2 is high, the portion of the second audio signal S2 Assuming that the sections S21 and S22 are partial sections corresponding to one word, the processing is performed as in the first embodiment. That is, the candidate for the partial section S13 of the first audio signal S1 is replaced with the candidate for the partial sections S21 and S22 of the second audio signal S2. Specifically, “Nishi-ku”, which is a candidate for the partial section S13 of the first audio signal S1, is replaced with “Naka-ku Ishikawa-cho”, which is a candidate for the partial sections S21 and S22 of the second audio signal S2.
[0078]
Embodiment 3 FIG.
In speech recognition, there are words that are easily misrecognized due to voice quality, utterance mode, etc., and therefore, the same misrecognition as that generated in the first speech recognition result may occur in the second speech recognition result. In the third embodiment, a case will be described in which a misrecognized word can be efficiently corrected even in such a case.
[0079]
The configuration of the speech recognition apparatus according to Embodiment 3 is the same as that of the speech recognition apparatus according to Embodiment 2 shown in FIG. However, in the speech recognition apparatus according to the third embodiment, the recognition result replacement unit 25 displays the first speech recognition result on the recognition result display unit 9, and if the first speech recognition result is not correct, A first segment of the first audio signal and a second segment of the second audio signal having a high degree of similarity are detected, and a candidate for the detected first segment of the first audio signal is different from that candidate. Replacing with the candidate for the partial section of the speech signal, adding the candidate for the partial section of the second speech signal that was not detected, to the recognition result of the new first speech recognition result and the second speech recognition result If the recognition result of the new first voice displayed on the display means 9 is incorrect, the candidate for the detected first voice signal partial section is a part of the detected second voice signal different from the candidate. When the correct first speech recognition result is obtained, the first speech recognition result and the second speech recognition result are confirmed, and the first speech recognition result is confirmed. And the second speech recognition result is output.
[0080]
Next, the operation will be described.
FIG. 12 is a flowchart for explaining the operation of the speech recognition apparatus according to the third embodiment of the present invention.
[0081]
The process up to step ST25 is performed in the same manner as in the second embodiment.
The recognition result replacement unit 25 detects the first segment of the first audio signal and the second segment of the second audio signal with high acoustic similarity (step ST41), and sets M = 1 (step ST42). It is determined whether the candidate for the detected partial section of the first audio signal is the same as the first candidate for the detected partial section of the second audio signal (step ST43). The candidate for the partial section of the first speech signal is replaced with the first candidate for the partial section of the detected second speech signal (step ST44), and the candidate for the partial section of the second speech signal that has not been detected is replaced. In addition to this, a new first speech recognition result and a second speech recognition result are displayed on the recognition result display means 9 (step ST46).
[0082]
The recognition result switching unit 25 detects the second time when the candidate for the detected partial section of the first speech signal is the same as the M-th candidate for the detected second section of the speech signal. It is determined whether or not there is a lower candidate for the partial section of the voice signal (step ST47). If there is a lower candidate, M = M + 1 is set (step ST48), and the process returns to step ST43. If there is no lower candidate, the speaker presses the correction key to cancel the second audio signal and re-utter the second audio.
[0083]
The speaker sees the recognition result of the new first voice displayed on the recognition result display means 9 to determine whether or not it is correct (step ST49). The speech recognition result and the second speech recognition result are determined. The recognition result replacement unit 25 outputs the confirmed first speech recognition result and the second speech recognition result (step ST50).
[0084]
When the recognition result of the new first voice displayed on the recognition result display means 9 is not correct, the speaker presses the next candidate key. The recognition result replacement unit 25 determines whether there is a lower candidate for the detected partial section of the second audio signal (step ST51). If there is a lower candidate, M = 2 is set (step ST52). Thereafter, it is determined whether or not the candidate for the detected partial section of the first audio signal is the same as the second candidate for the detected partial section of the second audio signal (step ST43).
[0085]
Thereafter, until the correct recognition result of the first speech is displayed on the recognition result display means 9, the candidate for the detected partial section of the first speech signal is detected as a subordinate to the partial section of the second speech signal. When the candidate is replaced and there are no lower candidates, the speaker presses the correction key to cancel the second audio signal and re-utter the second audio.
[0086]
Hereinafter, the operation described above will be described using a specific example.
Here, the recognition target is the address shown in FIG. 22, and when the speaker speaks to “Naka-ku, Yokohama-shi, Kanagawa-ken” and speaks to “Naka-ku, Yokohama-shi, Kanagawa”, “Naka-ku” A case where “Nishi-ku Ishikawa-cho” is newly spoken because it was misrecognized as “Nishi-ku” will be described.
[0087]
When the speaker utters up to “Naka Ward, Yokohama, Kanagawa Prefecture”, as shown in FIG. 5, the first model matching unit 4 starts the first speech signal S1 output from the speech input unit 2 as shown in FIG. Three partial sections S11 to S13 corresponding to three words in the speech are detected, “Kanagawa” as the first candidate for the partial section S11, “Yokohama City” as the first candidate for the partial section S12, the partial section “Nishi-ku” was obtained as the first candidate for S13 and stored in the recognition result replacement means 25. Also, “Nishi-ku, Yokohama-shi, Kanagawa” is displayed on the recognition result display means 9.
[0088]
In this case, since “Naka Ward” is misrecognized as “Nishi Ward”, when the speaker presses the correction key and newly says “Naka Ward Ishikawacho”, as shown in FIG. 22, two partial sections S21 and S22 corresponding to two words in the second voice are detected from the second voice signal S2 output from the voice input means 2, and the first candidate for the partial section S21 is detected. “Nishi Ward”, “Naka Ward” as the 2nd candidate, “Tama Ward” as the 3rd candidate, “Ishikawacho” as the 1st candidate for the partial section S22, and stored in the recognition result replacing means 25 . Further, the spotting means 24 performs spotting processing by continuous DP matching between the first audio signal S1 and the second audio signal S2, and the respective partial sections S11 to S13 and 2 of the first audio signal S1. The acoustic similarity between the partial sections S21 and S22 of the second audio signal S2 was obtained. Further, as shown in FIG. 13, the recognition result switching unit 25 detects the partial section S13 of the first audio signal S1 and the partial section S21 of the second audio signal S2 with high acoustic similarity. And as shown in FIG. 13, since the candidate for the partial section S13 of the detected first audio signal S1 is the same as the first candidate for the partial section S21 of the detected second audio signal S2, “Nishi Ward”, which is a candidate for the partial section S13 of the first audio signal S1 detected by the recognition result switching means 25, is a second candidate for the partial section S21 of the detected second audio signal S2. “Ishikawacho”, which is a candidate for the partial section S22 of the second speech signal S2 that has not been detected and is replaced with “Naka Ward”, is added to it, and the new first speech recognition result and second speech recognition The result “Ishikawacho, Naka-ku, Yokohama-shi, Kanagawa” was displayed on the recognition result display means 9.
[0089]
Since the new first speech recognition result displayed on the recognition result display means 9 is correct, when the speaker presses the confirmation key, the first speech recognition result and the second speech recognition result are confirmed, The confirmed first speech recognition result and the second speech recognition result are output from the recognition result replacing means 25.
[0090]
As described above, according to the third embodiment, when the first speech is misrecognized, the misrecognized word speech in the first speech is uttered as the second speech and misrecognized. The candidate for the first segment of the first speech signal corresponding to the word is replaced with a candidate for the second segment of the second speech signal corresponding to the misrecognized word, which is different from the candidate, and the misrecognized word is corrected. Therefore, it is possible to efficiently correct misrecognized words and to obtain an easy-to-use speech recognition device.
[0091]
In this embodiment, the case where the configuration of the speech recognition apparatus is the same as the configuration of the speech recognition apparatus according to the second embodiment has been described. Even if it exists, the same effect is acquired.
[0092]
Embodiment 4 FIG.
The configuration of the speech recognition apparatus according to the fourth embodiment is the same as that of the speech recognition apparatus according to the first embodiment shown in FIG. However, in the speech recognition apparatus according to the fourth embodiment, the word dictionary storage unit 3 stores a word dictionary that includes information about words to be recognized in accordance with a syntax rule that defines a connection relationship.
[0093]
In addition, the recognition result replacement unit 8 displays the first speech recognition result on the recognition result display unit 9, and if the first speech recognition result is not correct, it is acoustically similar to the partial section of the second speech signal. For a partial section of the second speech signal that detects a partial section of the first speech signal having a high degree and can connect a candidate for that partial section with a candidate for the preceding and following partial sections according to the syntax rules in the word dictionary When the new first speech recognition result is displayed on the recognition result display means 9 and the new first speech recognition result is not correct, the candidate for the partial section is selected as a syntax rule in the word dictionary. Is replaced with other candidates for the partial section of the second audio signal that can be connected to the candidates for the previous and subsequent partial sections, and when the correct first speech recognition result is obtained, the first sound The voice recognition result is confirmed, and the confirmed first speech recognition result is output.
[0094]
Next, the operation will be described.
FIG. 14 is a flowchart for explaining the operation of the speech recognition apparatus according to the fourth embodiment of the present invention.
[0095]
The process up to step ST11 is performed in the same manner as in the first embodiment.
The recognition result exchanging means 8 detects the partial section of the first speech signal having a high acoustic similarity with the partial section of the second speech signal (step ST61), and sets M = 1 (step ST62). It is determined whether the first candidate for the partial section of the first speech signal is connectable with the candidates for the partial sections before and after the detected partial section of the first speech signal according to the syntax rules in the word dictionary ( Step ST63) When the connection is possible, the candidate for the detected partial section of the first audio signal is replaced with the first candidate for the partial section of the second audio signal (step ST64), and a new first audio signal is obtained. Is displayed on the recognition result display means 9 (step ST65).
[0096]
It should be noted that the recognition result replacing means 8 is a candidate for the partial sections before and after the first partial section of the first speech signal detected according to the syntax rules in the word dictionary as the Mth candidate for the second partial section of the speech signal. If there is a lower candidate for the second segment of the audio signal (step ST66), and if there is a lower candidate, M = M + 1 (step ST67) Return to step ST63. If there is no lower candidate, the speaker presses the correction key to cancel the second audio signal and re-utter the second audio.
[0097]
The speaker sees the recognition result of the new first voice displayed on the recognition result display means 9 to determine whether or not it is correct (step ST68). Confirm the speech recognition result. The recognition result replacing unit 8 outputs the confirmed first speech recognition result (step ST69).
[0098]
When the recognition result of the new first voice displayed on the recognition result display means 9 is not correct, the speaker presses the next candidate key. The recognition result replacement unit 8 determines whether or not there is a lower candidate for the partial section of the second audio signal (step ST70). If there is a lower candidate, M = 2 is set (step ST71). It is determined whether or not the second candidate for the partial section of the second speech signal can be connected to the candidates for the partial sections before and after the detected partial section of the first speech signal according to the syntax rules in the word dictionary. (Step ST63).
[0099]
Thereafter, until the correct recognition result of the first speech is displayed on the recognition result display means 9, the detected candidate for the partial section of the first speech signal is replaced with a lower candidate for the partial section of the second speech signal. If there are no lower candidates, the speaker presses the correction key to cancel the second sound signal and re-utter the second sound.
[0100]
Hereinafter, the operation described above will be described using a specific example.
Here, the recognition target is the address shown in FIG. 22, and when the speaker utters “Ishikawacho, Naka-ku, Yokohama-shi, Kanagawa”, “Naka-ku” is misrecognized as “Nishi-ku”, so A case where “ku” is spoken will be described. In addition, it is assumed that the word dictionary storage means 3 stores a word dictionary in which information about words to be recognized is included according to the syntax rules shown in FIG.
[0101]
When the speaker utters “Naka Ward, Yokohama, Kanagawa Prefecture”, the first model matching unit 4 starts the first speech signal S1 output from the speech input unit 2 as shown in FIG. Three partial sections S11 to S13 corresponding to three words in the speech are detected, “Kanagawa” as the first candidate for the partial section S11, “Yokohama City” as the first candidate for the partial section S12, and the partial section “Nishi-ku” was obtained as the first candidate for S13 and stored in the recognition result replacement means 8. Also, “Nishi-ku, Yokohama-shi, Kanagawa” is displayed on the recognition result display means 9.
[0102]
In this case, since “Naka Ward” was erroneously recognized as “Nishi Ward”, when the speaker presses the correction key and newly speaks “Naka Ward”, as shown in FIG. One partial section S21 is detected from the second audio signal S2 output from the voice input means 2, and “Tama Ward” is selected as the first candidate, “Naka Ward” as the second candidate, and “3rd candidate” as the third candidate. “Nishi-ku” was obtained and stored in the recognition result replacement means 8. Further, the spotting means 7 performs spotting processing by continuous DP matching between the first audio signal S1 and the second audio signal S2, and the respective partial sections S11 to S13 and 2 of the first audio signal S1. The acoustic similarity with the partial section S21 of the second audio signal S2 was obtained. Further, as shown in FIG. 16, the recognition result exchanging means 8 detects the partial section S13 of the first speech signal S1 having a high acoustic similarity with the partial section S21 of the second speech signal S2. Then, as shown in FIG. 15, “Tama Ward”, which is the first candidate for the partial section S21 of the second audio signal S2, is the partial section S12 before the detected partial section S13 of the first audio signal S1. Cannot be connected to “Yokohama City”, which is a candidate for the second segment “Naka Ward”, which is the second candidate for the partial segment S21 of the second audio signal S2, is the partial segment of the detected first audio signal S1 Since it is possible to connect to “Yokohama City”, which is a candidate for the partial section S12 before S13, as shown in FIG. 16, the candidate for the partial section S13 of the first audio signal S1 detected by the recognition result replacing means 8 is shown. “Nishi-ku” is replaced with “Naka-ku”, which is the second-ranked candidate for the partial section S21 of the second audio signal S2, and “Naka-ku, Yokohama-shi, Kanagawa Pref. " It is displayed on the recognition result display unit 9.
[0103]
Since the new first speech recognition result displayed on the recognition result display means 9 is correct, when the speaker presses the confirmation key, the first speech recognition result is confirmed and the confirmed first speech recognition is performed. The result was output from the recognition result replacement means 8.
[0104]
As described above, according to the fourth embodiment, when the first speech is misrecognized, the misrecognized word speech in the first speech is uttered as the second speech and misrecognized. The candidate for the partial section of the first speech signal corresponding to the word is replaced with the candidate for the second speech signal corresponding to the misrecognized word according to the syntax rules in the word dictionary, and the misrecognized word is corrected. Therefore, it is possible to efficiently correct misrecognized words and to obtain an easy-to-use speech recognition device.
[0105]
In this embodiment, the case where the configuration of the voice recognition device is the same as the configuration of the voice recognition device of the first embodiment has been described. Even if it exists, the same effect is acquired.
[0106]
Embodiment 5 FIG.
In the first embodiment, a partial section of the first speech signal corresponding to a misrecognized word in the first speech is detected from the acoustic similarity obtained by continuous DP matching, and candidates for the partial section are detected. The case where the candidate is replaced with a candidate for the partial section of the second audio signal has been described. In the fifth embodiment, the partial section of the first speech signal corresponding to the misrecognized word in the first speech is detected using the acoustic similarity and the matching score, and two candidates for the partial section are detected. The case where it replaces with the candidate with respect to the partial area of the 2nd audio | voice signal is demonstrated.
[0107]
The configuration of the speech recognition apparatus according to the fifth embodiment is the same as that of the speech recognition apparatus according to the first embodiment shown in FIG. However, in the speech recognition apparatus according to the fifth embodiment, the first model matching unit 4 performs a model matching process between the first speech signal and the word dictionary stored in the word dictionary storage unit 3. A partial section corresponding to each word in the first speech is detected from the first speech signal, and one or a plurality of candidates for each partial section are ranked, and for each partial section of the first speech signal Is used to obtain a matching score.
[0108]
In addition, the recognition result replacing unit 8 displays the recognition result of the first speech on the recognition result display unit 9, and if the recognition result of the first speech is not correct, 2 for each partial section of the first speech signal. The difference score between the acoustic similarity with the partial section of the first speech signal and the matching score is obtained, the partial section of the first speech signal having a high difference score is detected, and the candidate for the partial section is selected as the second speech signal. If the new first speech recognition result is displayed on the recognition result display means 9 and the new first speech recognition result is not correct, the candidate for the second partial segment is displayed as a second candidate. This is replaced with another candidate for the audio signal, and when the correct first speech recognition result is obtained, the first speech recognition result is confirmed, and the confirmed first speech recognition result is output.
[0109]
Next, the operation will be described.
17 and 18 are flowcharts for explaining the operation of the speech recognition apparatus according to the fifth embodiment of the present invention.
[0110]
When a speaker speaks a plurality of recognition target words (first speech) and inputs the first speech to the speech input means 2 (step ST81), the speech input means 2 receives the first speech. An audio signal (first audio signal) is output. The first audio signal output from the audio input unit 2 is input to the first model matching unit 4. The first model matching means 4 performs model matching processing by continuous DP matching between the first speech signal and the word dictionary stored in the word dictionary storage means 3 (step ST82). To detect a partial section corresponding to each word in the first speech, obtain one or more candidates for each partial section, and obtain a matching score for each partial section of the first speech signal. They are obtained and output to the recognition result replacement means 8.
Thereafter, steps ST3 to ST11 are performed in the same manner as in the first embodiment.
[0111]
The recognition result switching means 8 obtains a difference score between the acoustic similarity with the partial section of the second audio signal and the matching score for each partial section of the first audio signal (step ST83), and the difference score is high. After detecting the partial section of the first audio signal (step ST84) and setting M = 1 (step ST85), the candidate for the partial section is replaced with the first candidate for the partial section of the second audio signal (step ST84). (ST86) The new first speech recognition result is displayed on the recognition result display means 9 (step ST87).
Thereafter, steps ST16 to ST19 are performed in the same manner as in the first embodiment.
[0112]
Hereinafter, the operation described above will be described using a specific example.
Here, the recognition target is the address shown in FIG. 22, and when the speaker utters “Honmoku, Minami-ku, Yokohama-shi, Kanagawa”, “Honmoku” is misrecognized as “Nakazato”. A case of speaking will be described. The collation score and the acoustic similarity are represented by numerical values in the range of 0 to 1000, and the larger the numerical value, the higher the degree of collation and the degree of similarity.
[0113]
When the speaker utters “Honmoku, Minami-ku, Yokohama, Kanagawa” as shown in FIG. 19, the first time from the first speech signal S 1 output from the speech input means 2 by the first model matching means 4. The four partial sections S11 to S14 corresponding to the four words in the voice are detected, “Kanagawa” as the first candidate for the partial section S11, “Yokohama City” as the first candidate for the partial section S12, “Minami Ward” was obtained as the first candidate for the section S13 and “Nakazato” was obtained as the first candidate for the partial section S14 and stored in the recognition result replacing means 8. Further, as shown in FIG. 19, the first model matching unit 4 causes the matching scores C2 [i] to be “800”, “750”, “800” and “400” were requested. Since the partial section S14 is a partial section corresponding to the misrecognized word in the first speech, the collation score of the partial section S14 is smaller than the collation scores of the other partial sections. Further, “Nakazato, Minami-ku, Yokohama, Kanagawa” is displayed on the recognition result display means 9.
[0114]
In this case, since “Honmoku” is misrecognized as “Nakazato”, when the speaker presses the correction key and newly utters “Honmoku”, the second model matching means 6 performs the speech as shown in FIG. One partial section S21 is detected from the second audio signal S2 output from the input means 2, and "Honmaki" is selected as the first candidate, "Naka Ward" as the second candidate, "Tama Ward" as the third candidate Is obtained and stored in the recognition result replacing means 8. Further, as shown in FIG. 19, spotting means 7 performs spotting processing by continuous DP matching between the first audio signal S1 and the second audio signal S2, and each of the first audio signal S1. The acoustic similarity C1 [i] between the partial sections S11 to S14 and the partial section S21 of the second audio signal S2 was obtained as “100”, “150”, “800”, and “780”, respectively. . Further, as shown in FIG. 19, the recognition result switching means 8 makes the acoustic similarity and the matching score with the partial section S21 of the second audio signal S2 for each of the partial sections S11 to S14 of the first audio signal S1. Are obtained as “−700”, “−600”, “0”, and “380”, respectively, and a partial section S14 of the first audio signal S1 having a high difference score is detected. . Then, as shown in FIG. 19, “Nakazato” which is a candidate for the partial section S14 of the first audio signal S1 detected by the recognition result switching means 8 is 1 for the partial section S21 of the second audio signal S2. It was replaced with “Honmoku”, which is a candidate for the rank, and “New Honmoku, Minami-ku, Yokohama-shi, Kanagawa”, which was the new first speech recognition result, was displayed on the recognition result display means 9.
[0115]
Since the new first speech recognition result displayed on the recognition result display means 9 is correct, when the speaker presses the confirmation key, the first speech recognition result is confirmed and the confirmed first speech recognition is performed. The result was output from the recognition result replacement means 8.
[0116]
Here, the matching score will be described.
FIG. 20 shows the voice signal obtained when saying “Honmoku, Minami-ku, Yokohama-shi, Kanagawa” and information on the words “Kanagawa-ken”, “Yokohama-city”, “Minami-ku”, “Nakazato” in succession. The result of having performed a model collation process between the word dictionary containing it is shown. The horizontal axis represents an audio signal and is expressed in units of t frames. The vertical axis represents the word dictionary and is expressed in units of u state. The voice signal has T frames as a whole, and the word dictionary has U states as a whole.
[0117]
The length of the audio signal changes depending on the utterance, and it partially expands and contracts. For this reason, when performing the model matching process, the correspondence between the speech signal and the word dictionary is calculated to obtain the optimum correspondence. This correspondence can be efficiently calculated by dynamic programming or an arithmetic method called Viterbi arithmetic. The optimum path in FIG. 20 shows the optimum correspondence between the frame t of the speech signal and the state u of the word dictionary. The optimum correspondence of the frame t with respect to the state u is expressed by equation (1).
[0118]
u = G (t) (1)
[0119]
On the other hand, the acoustic similarity between the audio signal in frame t and the word dictionary in state u is represented by a local distance D (t, u). The smaller the local distance, the higher the acoustic similarity between the speech signal and the word dictionary. The collation score C2 [i] of the word i is an average of local distances on the optimum route belonging to the word i with respect to the frame. As shown in FIG. 20, when the frame of the audio signal corresponding to the state belonging to the word i is from ts (i) to te (i), the matching score C2 [i] for the word i is calculated by the equation (2). The
[0120]
[Expression 1]

[0121]
As described above, according to the fifth embodiment, when the first speech is misrecognized, it corresponds to the misrecognized word in the first speech using the acoustic similarity and the matching score. Since the partial section of the first speech signal is detected and the candidate for the partial section is replaced with the candidate for the partial section of the second speech signal, the partial section corresponding to the word misrecognized due to the fluctuation of the speech signal, etc. Even when the acoustic similarity of the partial sections different from the above becomes high, the erroneously recognized word can be corrected efficiently, and an easy-to-use speech recognition device can be obtained.
[0122]
In this embodiment, a case where a partial section of the first speech signal corresponding to a misrecognized word in the first speech is detected using a difference score between the acoustic similarity and the matching score will be described. However, the same effect can be obtained even when a partial section corresponding to a misrecognized word is detected using a value obtained from another calculation method.
[0123]
In this embodiment, the case where the configuration of the voice recognition device is the same as the configuration of the voice recognition device of the second embodiment has been described. However, the case is the same as the configuration of the voice recognition device of the first embodiment. Even if it exists, the same effect is acquired.
[0124]
The speech recognition apparatus and speech recognition method described in each of the above-described embodiments can also be obtained by incorporating a speech recognition program into a computer.
[0125]
【The invention's effect】
As described above, according to the present invention, the word dictionary storage means for storing the word dictionary including the information of the word to be recognized and the first speech signal and the word dictionary are collated, and the first time The first collation means for detecting a partial section corresponding to each word in the first speech from the first speech signal and obtaining a candidate for each partial section, collation between the second speech signal and the word dictionary A second collating unit that performs processing, detects a partial section corresponding to each word in the second speech from the second speech signal, and obtains a candidate for each partial section; and each of the first speech signal Spotting means for obtaining the acoustic similarity between each partial section of the second speech signal and each partial section of the second speech signal, and erroneous recognition in the first speech using the acoustic similarity obtained by the spotting means Was The first audio signal partial section and the second audio signal partial section corresponding to the word are detected, and candidates for the detected first audio signal partial section are determined as the detected second audio signal part. Since the speech recognition apparatus is configured to include the recognition result replacement means for replacing the candidate for the section, there is an effect that a speech recognition apparatus that can efficiently correct a misrecognized portion can be obtained.
[0126]
According to the present invention, when the second speech consists only of speech of a word that has been misrecognized in the first speech, the recognition result replacing means has an acoustic similarity with the partial section of the second speech signal. A candidate for the detected first segment of the first speech signal is detected as the first segment of the first speech signal corresponding to the erroneously recognized word in the first speech. Is replaced with a candidate for the partial section of the second speech signal, the speech recognition apparatus is configured to efficiently correct the erroneously recognized part. .
[0127]
According to the present invention, when the second speech is composed of a misrecognized word in the first speech and the speech of one or more words subsequent thereto, the recognition result replacement means has a high acoustic similarity. The first segment of the first speech signal and the second segment of the second speech signal are divided into the first segment of the first speech signal and the partial segment of the second speech signal corresponding to the misrecognized word in the first speech. The candidate for the partial section of the first audio signal detected is replaced with the candidate for the partial section of the detected second audio signal, and the candidate for the partial section of the second audio signal not detected is replaced with Since the speech recognition device is configured to be added to the speech recognition device, even if the misrecognized word and the subsequent speech of one or more words are uttered as the second speech, they are misrecognized. The effect of the speech recognition device is obtained which can be modified partially efficiently.
[0128]
According to the present invention, the recognition result switching means detects the partial section of the first speech signal and the partial section of the second speech signal corresponding to the misrecognized word in the first speech, and is detected. It is determined whether the candidate for the partial section of the second audio signal is the same as the candidate for the detected partial section of the first audio signal, and the candidate for the detected partial section of the first audio signal is Since the speech recognition apparatus is configured to replace the candidate for the partial section of the detected second speech signal different from the candidate, a speech recognition apparatus capable of efficiently correcting a misrecognized part is obtained. There is an effect.
[0129]
According to the present invention, the word dictionary storage means stores the word dictionary including the information of the word to be recognized in accordance with the syntax rule that defines the connection relation, and the recognition result replacement means is used as an error in the first speech. A partial section of the first speech signal and a partial section of the second speech signal corresponding to the recognized word are detected, and candidates for the detected second section of the speech signal are determined according to the syntax rules in the word dictionary. , It is determined whether or not it is possible to connect with a candidate for a partial section before and after the detected partial section of the first audio signal, and a candidate for the detected partial section of the first audio signal is determined Since the speech recognition device is configured to replace the candidate for the second section of the detected second speech signal that can be connected to the candidate for the section, the erroneously recognized part is efficiently repaired. The effect of the speech recognition device is obtained which can be.
[0130]
According to this invention, the first matching means performs the matching process between the first speech signal and the word dictionary, and the partial section corresponding to each word in the first speech from the first speech signal. And a candidate for each partial section is obtained, and a matching score is obtained for each partial section of the first speech signal, and the recognition result replacing means is connected to the acoustic similarity obtained by the spotting means and the first The partial section of the first speech signal and the partial section of the second speech signal corresponding to the misrecognized word in the first speech are detected and detected using the collation score obtained by the collation means. In addition, since the speech recognition apparatus is configured to replace the candidate for the partial section of the first speech signal with the candidate for the detected partial section of the second speech signal, Even when the acoustic similarity of a partial section different from the partial section corresponding to the misrecognized part becomes high, there is an effect that a voice recognition device capable of efficiently correcting the misrecognized part can be obtained. .
[0131]
According to the present invention, a collation process is performed between the first speech signal and a word dictionary including information on a word to be recognized, and each word in the first speech is handled from the first speech signal. The first collation process that detects partial sections and obtains candidates for the respective partial sections, and performs collation processing between the second speech signal and the word dictionary, and from the second speech signal to the second speech A second matching step for detecting partial sections corresponding to the respective words and obtaining candidates for the respective partial sections; a respective partial section of the first speech signal; and a respective partial section of the second speech signal; A spotting step for obtaining an acoustic similarity between the first speech signal and a second segment of the first speech signal corresponding to a misrecognized word in the first speech using the acoustic similarity obtained in the spotting step, and 2 Second time Voice recognition so as to comprise a recognition result replacing step of detecting a partial section of the voice signal and replacing a candidate for the detected partial section of the first speech signal with a candidate for the detected partial section of the second speech signal. Since the method is configured, it is possible to obtain a speech recognition method that can efficiently correct a misrecognized portion.
[0132]
According to the present invention, when the second speech is composed only of a misrecognized word speech in the first speech, the recognition result replacement step is performed with an acoustic similarity with the partial section of the second speech signal. A candidate for the detected first segment of the first speech signal is detected as the first segment of the first speech signal corresponding to the erroneously recognized word in the first speech. Is replaced with a candidate for the second segment of the speech signal, the speech recognition method can be efficiently corrected so that the speech recognition method can be obtained. .
[0133]
According to the present invention, when the second speech is composed of a misrecognized word in the first speech and the speech of one or more words subsequent thereto, the recognition result replacement step has a high acoustic similarity. The first segment of the first speech signal and the second segment of the second speech signal are divided into the first segment of the first speech signal and the partial segment of the second speech signal corresponding to the misrecognized word in the first speech. The candidate for the partial section of the first audio signal detected is replaced with the candidate for the partial section of the detected second audio signal, and the candidate for the partial section of the second audio signal not detected is replaced with Since the speech recognition method is configured so as to be added to it, even if the misrecognized word and the speech of one or more words following it are uttered as the second speech, they are misrecognized. The effect of the speech recognition method capable of correcting the partial efficiently can be obtained.
[0134]
According to this invention, the recognition result replacement step is performed by detecting the first section of the first speech signal and the second section of the second speech signal corresponding to the misrecognized word in the first speech. It is determined whether the candidate for the partial section of the second audio signal is the same as the candidate for the detected partial section of the first audio signal, and the candidate for the detected partial section of the first audio signal is Since the speech recognition method is configured to replace the candidate for the partial section of the detected second speech signal different from the candidate, a speech recognition method capable of correcting the erroneously recognized portion efficiently is obtained. There is an effect.
[0135]
According to this invention, the recognition result replacement step is performed by detecting the first section of the first speech signal and the second section of the second speech signal corresponding to the misrecognized word in the first speech. Candidates for the second segment of the speech signal before and after the first segment of the first speech signal detected according to the syntax rule in the word dictionary including the information of the word to be recognized according to the syntax rule that defines the connection relationship It is determined whether or not it is connectable with a candidate for the partial section of the second, and the candidate for the partial section of the detected first speech signal is connected to the candidate for the previous and subsequent partial sections. Since the speech recognition method is configured so as to be replaced with candidates for the partial sections of the signal, there is an effect that a speech recognition method capable of correcting a misrecognized portion efficiently can be obtained. .
[0136]
According to this invention, the first matching step is performed by performing a matching process between the first speech signal and the word dictionary, and the partial sections corresponding to the respective words in the first speech from the first speech signal. And a candidate for each partial section is obtained, and a matching score is obtained for each partial section of the first speech signal, and the recognition result replacement step is performed using the acoustic similarity obtained in the spotting step and the first And detecting a partial section of the first speech signal and a partial section of the second speech signal corresponding to the misrecognized word in the first speech using the matching score obtained in the matching step Since the speech recognition method is configured so that the candidate for the partial section of the first speech signal is replaced with the candidate for the detected partial section of the second speech signal, the speech recognition method may be used. Even if the acoustic similarity of a partial section different from the partial section corresponding to the misrecognized part increases, there is an effect that a speech recognition method capable of efficiently correcting the misrecognized part can be obtained. .
[0137]
According to this invention, the computer performs collation processing between the first speech signal and the word dictionary including the information of the word to be recognized, and each word in the first speech from the first speech signal. The first matching function for detecting partial sections corresponding to the first section and obtaining candidates for the respective partial sections, and performing a matching process between the second speech signal and the word dictionary, A second collation function that detects partial sections corresponding to respective words in the speech and obtains candidates for the respective partial sections, each partial section of the first speech signal, and each of the second speech signal A spotting function for obtaining the acoustic similarity between the partial sections, and a portion of the first speech signal corresponding to a misrecognized word in the first speech using the acoustic similarity obtained by the spotting function A recognition result replacement function that detects a partial interval of the second and second audio signals, and replaces the candidate for the detected first interval of the audio signal with a candidate for the detected second interval of the audio signal; Since the speech recognition program is configured to be realized, there is an effect that a speech recognition method capable of correcting a misrecognized portion efficiently can be obtained.
[0138]
According to the present invention, when the second speech is composed only of the misrecognized word speech in the first speech, the recognition result replacement function is set so that the acoustic similarity with the partial section of the second speech signal is increased. A candidate for the detected first segment of the first speech signal is detected as the first segment of the first speech signal corresponding to the erroneously recognized word in the first speech. Is replaced with a candidate for the second segment of the speech signal, the speech recognition program is configured, so that it is possible to obtain a speech recognition method that can efficiently correct a misrecognized portion. .
[0139]
According to the present invention, when the second speech is composed of a misrecognized word in the first speech and the speech of one or more words subsequent thereto, the recognition result replacement function has a high acoustic similarity. The first segment of the first speech signal and the second segment of the second speech signal are divided into the first segment of the first speech signal and the partial segment of the second speech signal corresponding to the misrecognized word in the first speech. The candidate for the partial section of the first audio signal detected is replaced with the candidate for the partial section of the detected second audio signal, and the candidate for the partial section of the second audio signal not detected is replaced with Since the speech recognition program is configured to be added to it, it is misidentified even when the misrecognized word and the speech of one or more words following it are uttered as the second speech. The effect of the speech recognition program is obtained that can modify the portion efficiently.
[0140]
According to the present invention, the recognition result switching function is detected by detecting the first segment of the first speech signal and the second segment of the second speech signal corresponding to the misrecognized word in the first speech. It is determined whether the candidate for the partial section of the second audio signal is the same as the candidate for the detected partial section of the first audio signal, and the candidate for the detected partial section of the first audio signal is Since the speech recognition program is configured to replace the candidate for the detected partial section of the second speech signal different from the candidate, a speech recognition method capable of correcting the erroneously recognized portion efficiently is obtained. There is an effect.
[0141]
According to the present invention, the recognition result switching function is detected by detecting the first segment of the first speech signal and the second segment of the second speech signal corresponding to the misrecognized word in the first speech. Before and after the first segment of the first speech signal detected by the candidate for the second segment of the speech signal in accordance with the syntax rules in the word dictionary including the information of the word to be recognized in accordance with the syntax rules defining the connection relationship It is determined whether or not it is connectable with a candidate for the partial section of the second, and the candidate for the partial section of the detected first speech signal is connected to the candidate for the previous and subsequent partial sections. Since the speech recognition program is configured to be replaced with candidates for signal partial sections, it is possible to obtain a speech recognition method that can efficiently correct misrecognized portions. There is.
[0142]
According to this invention, the first matching function performs a matching process between the first speech signal and the word dictionary, and the partial section corresponding to each word in the first speech from the first speech signal. And a candidate for each partial section is obtained, and a matching score is obtained for each partial section of the first speech signal, and the recognition result replacement function is the first difference between the acoustic similarity obtained by the spotting function and the first The first speech signal partial section and the second speech signal partial section corresponding to the misrecognized word in the first speech are detected using the collation score obtained by the collation function of Since the speech recognition program is configured so that the candidate for the partial section of the first speech signal is replaced with the candidate for the detected partial section of the second speech signal, the fluctuation of the speech signal Thus, even if the acoustic similarity of a partial section different from the partial section corresponding to the misrecognized part increases, an effect of obtaining a speech recognition program that can efficiently correct the misrecognized part is obtained. is there.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 1 of the present invention.
FIG. 2 is a flowchart for explaining the operation of the speech recognition apparatus according to the first embodiment of the present invention (part 1);
FIG. 3 is a flowchart (part 2) for explaining the operation of the speech recognition apparatus according to the first embodiment of the present invention.
FIG. 4 is a flowchart (part 3) for explaining the operation of the speech recognition apparatus according to the first embodiment of the present invention.
FIG. 5 is a diagram for explaining a specific operation of the voice recognition device according to the first embodiment of the present invention (part 1);
FIG. 6 is a diagram for explaining a specific operation of the voice recognition device according to the first embodiment of the present invention (part 2);
FIG. 7 is a diagram for explaining a specific operation of the speech recognition apparatus according to the first embodiment of the present invention (part 3);
FIG. 8 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 2 of the present invention.
FIG. 9 is a flowchart for explaining an operation of the speech recognition apparatus according to the second embodiment of the present invention (part 1);
FIG. 10 is a flowchart for explaining an operation of the speech recognition apparatus according to the second embodiment of the present invention (part 2);
FIG. 11 is a diagram for explaining a specific operation of the speech recognition apparatus according to the second embodiment of the present invention.
FIG. 12 is a flowchart for explaining the operation of the speech recognition apparatus according to Embodiment 3 of the present invention;
FIG. 13 is a diagram for explaining a specific operation of the speech recognition apparatus according to the third embodiment of the present invention.
FIG. 14 is a flowchart for explaining the operation of the speech recognition apparatus according to Embodiment 4 of the present invention;
FIG. 15 is a state diagram of a word dictionary stored in word dictionary storage means of a speech recognition apparatus according to Embodiment 4 of the present invention.
FIG. 16 is a diagram for explaining a specific operation of the speech recognition apparatus according to the fourth embodiment of the present invention.
FIG. 17 is a flowchart for explaining an operation of the speech recognition apparatus according to the fifth embodiment of the present invention (part 1);
FIG. 18 is a flowchart for explaining an operation of the speech recognition apparatus according to the fifth embodiment of the present invention (part 2);
FIG. 19 is a diagram for explaining a specific operation of the speech recognition apparatus according to the fifth embodiment of the present invention.
FIG. 20 is a diagram for explaining a method of calculating a matching score.
FIG. 21 is a block diagram showing a configuration of a conventional speech recognition apparatus disclosed in Japanese Patent Laid-Open No. 4-181299.
FIG. 22 is a diagram illustrating a specific example of a recognition target of the speech recognition apparatus.
FIG. 23 is a diagram for explaining a specific operation of the conventional speech recognition apparatus (part 1);
FIG. 24 is a diagram for explaining a specific operation of the conventional speech recognition device (part 2);
FIG. 25 is a diagram for explaining a specific operation of the conventional speech recognition device (part 3);
FIG. 26 is a diagram for explaining a specific operation of the conventional voice recognition device (part 4);
[Explanation of symbols]
1, 21 speech recognition device, 2 speech input means, 3 word dictionary storage means, 4 first model collation means, 5 speech signal storage means, 6, 22 2nd model collation means, 7, 24 spotting means, 8, 25 recognition result replacement means, 9 recognition result display means, 23 partial section storage means.

Claims

Word dictionary storage means for storing a word dictionary including information of words to be recognized;
A collation process is performed between a speech signal (hereinafter referred to as the first speech signal) of a plurality of recognition target words (hereinafter referred to as the first speech signal) and the word dictionary, and from the first speech signal. First verification means for detecting a partial section corresponding to each word in the first speech and obtaining a candidate for each partial section;
Between a speech signal (hereinafter referred to as a second speech signal) of one or a plurality of words including a misrecognized word in the first speech and the word dictionary A second matching unit that performs a matching process, detects a partial section corresponding to each word in the second voice from the second voice signal, and obtains a candidate for each partial section;
Spotting means for obtaining an acoustic similarity between each partial section of the first audio signal and each partial section of the second audio signal;
Using the acoustic similarity obtained by the spotting means, a first speech signal partial section and a second speech signal partial section corresponding to a misrecognized word in the first speech are detected and detected. A recognition result replacement unit that replaces the candidate for the partial section of the first audio signal that has been detected with the candidate for the partial section of the detected second audio signal ;
The word dictionary storage means stores a word dictionary including information of words to be recognized in accordance with a syntax rule that defines a connection relationship;
The recognition result replacing means detects a partial section of the first speech signal and a partial section of the second speech signal corresponding to a misrecognized word in the first speech, and the detected second speech signal It is determined whether or not the candidate for the partial section can be connected to the candidates for the partial sections before and after the detected partial section of the first speech signal in accordance with the syntax rules in the word dictionary. A speech recognition apparatus characterized in that a candidate for a partial section of a second speech signal is replaced with a candidate for a partial section of a detected second speech signal that can be connected to a candidate for a preceding and subsequent partial section .

When the second voice consists only of the voice of the misrecognized word in the first voice, the recognition result exchanging means is the first voice signal having high acoustic similarity with the partial section of the second voice signal. Is detected as a partial section of the first speech signal corresponding to the misrecognized word in the first speech, and candidates for the detected partial section of the first speech signal are detected as the second speech. The speech recognition apparatus according to claim 1, wherein the speech recognition apparatus is replaced with a candidate for a partial section of a signal.

When the second speech is composed of a misrecognized word in the first speech and the speech of one or more words that follow it, the recognition result replacement means is configured to replace the first speech signal having a high acoustic similarity. The partial section and the partial section of the second speech signal are detected and detected as the partial section of the first speech signal and the partial section of the second speech signal corresponding to the misrecognized word in the first speech. The candidate for the partial section of the first audio signal is replaced with the candidate for the partial section of the detected second audio signal, and the candidate for the partial section of the second audio signal not detected is added thereto. The speech recognition apparatus according to claim 1.

The recognition result replacing means detects a partial section of the first speech signal and a partial section of the second speech signal corresponding to the misrecognized word in the first speech, and the detected second speech signal It is determined whether the candidate for the partial section is the same as the candidate for the detected partial section of the first audio signal, and the candidate for the detected partial section of the first audio signal is detected differently from the candidate. The speech recognition apparatus according to claim 1, wherein the speech recognition apparatus is replaced with a candidate for a partial section of the second speech signal.

The first matching means performs a matching process between the first speech signal and the word dictionary, detects a partial section corresponding to each word in the first speech from the first speech signal, A candidate for a partial section is obtained and a matching score is obtained for each partial section of the first audio signal.
The recognition result replacement means uses the acoustic similarity obtained by the spotting means and the collation score obtained by the first collation means for the first time corresponding to the erroneously recognized word in the first speech. A partial section of the audio signal and a partial section of the second audio signal are detected, and the candidate for the detected partial section of the first audio signal is replaced with a candidate for the detected partial section of the second audio signal. The speech recognition apparatus according to claim 1, wherein the voice recognition apparatus is provided.

A collation process is performed between a speech signal (hereinafter referred to as a first speech signal) of a plurality of recognition target words (hereinafter referred to as a first speech signal) and a word dictionary including information on the recognition target words. Performing a first matching step of detecting a partial section corresponding to each word in the first speech from the first speech signal and obtaining a candidate for each partial section;
Between a speech signal (hereinafter referred to as a second speech signal) of one or a plurality of words including a misrecognized word in the first speech and the word dictionary A second matching step of performing a matching process, detecting a partial section corresponding to each word in the second voice from the second voice signal, and obtaining a candidate for each partial section;
A spotting step for obtaining an acoustic similarity between each partial section of the first audio signal and each partial section of the second audio signal;
Using the acoustic similarity obtained in the spotting process, the partial section of the first speech signal and the partial section of the second speech signal corresponding to the misrecognized word in the first speech are detected and detected. A recognition result replacing step of replacing a candidate for the partial section of the first audio signal with a candidate for the detected partial section of the second audio signal ,
The recognition result replacing step detects a partial section of the first speech signal and a partial section of the second speech signal corresponding to a misrecognized word in the first speech, and detects the detected second speech signal. Are candidates for partial sections before and after the partial section of the first speech signal detected in accordance with the syntax rules in the word dictionary including the information of the words to be recognized in accordance with the syntax rules defining the connection relationship. And the candidate for the detected first segment of the first audio signal is set to the candidate for the second segment of the detected second audio signal that can be connected to the candidate for the first and second partial segments. A speech recognition method characterized by being replaced with a candidate .

When the second speech consists only of the misrecognized word speech in the first speech, the recognition result replacement step is the first speech signal having a high acoustic similarity with the partial section of the second speech signal. Is detected as a partial section of the first speech signal corresponding to the misrecognized word in the first speech, and candidates for the detected partial section of the first speech signal are detected as the second speech. The speech recognition method according to claim 6 , wherein the speech recognition method is replaced with a candidate for a partial section of a signal.

When the second speech is composed of a misrecognized word in the first speech and the speech of one or more subsequent words, the recognition result replacement step includes the first speech signal having a high acoustic similarity. The partial section and the partial section of the second speech signal are detected and detected as the partial section of the first speech signal and the partial section of the second speech signal corresponding to the misrecognized word in the first speech. The candidate for the partial section of the first audio signal is replaced with the candidate for the detected partial section of the second audio signal, and the candidate for the partial section of the second audio signal not detected is added thereto. The speech recognition method according to claim 6 .

The recognition result replacement step detects a partial section of the first speech signal and a partial section of the second speech signal corresponding to a misrecognized word in the first speech, and detects the detected second speech signal. It is determined whether the candidate for the partial section is the same as the candidate for the detected partial section of the first audio signal, and the candidate for the detected partial section of the first audio signal is detected differently from the candidate. 7. The speech recognition method according to claim 6, wherein the second speech signal is replaced with a candidate for a partial section of the speech signal.

The first matching step performs a matching process between the first speech signal and the word dictionary, detects a partial section corresponding to each word in the first speech from the first speech signal, A candidate for a partial section is obtained and a matching score is obtained for each partial section of the first audio signal.
The recognition result replacement step is a first speech corresponding to a misrecognized word in the first speech using the acoustic similarity obtained in the spotting step and the collation score obtained in the first collation step. The partial section of the signal and the partial section of the second audio signal are detected, and the candidate for the detected partial section of the first audio signal is replaced with the candidate for the detected partial section of the second audio signal. The speech recognition method according to claim 6 .

On the computer,
A collation process is performed between a speech signal (hereinafter referred to as a first speech signal) of a plurality of recognition target words (hereinafter referred to as a first speech signal) and a word dictionary including information on the recognition target words. A first matching function for detecting a partial section corresponding to each word in the first speech from the first speech signal and obtaining a candidate for each partial section;
Between a speech signal (hereinafter referred to as a second speech signal) of one or a plurality of words including a misrecognized word in the first speech and the word dictionary A second matching function that performs a matching process, detects a partial section corresponding to each word in the second voice from the second voice signal, and obtains a candidate for each partial section;
A spotting function for obtaining an acoustic similarity between each partial section of the first audio signal and each partial section of the second audio signal;
Using the acoustic similarity obtained by the spotting function, a partial section of the first speech signal and a partial section of the second speech signal corresponding to a misrecognized word in the first speech are detected and detected. A recognition result replacement function that replaces the candidate for the partial section of the first speech signal with the candidate for the detected partial section of the second speech signal ,
The recognition result switching function detects a partial section of the first speech signal and a partial section of the second speech signal corresponding to a misrecognized word in the first speech, and the detected second speech signal Are candidates for partial sections before and after the partial section of the first speech signal detected in accordance with the syntax rules in the word dictionary including the information of the words to be recognized in accordance with the syntax rules defining the connection relationship. And the candidate for the detected first segment of the first audio signal is set to the candidate for the second segment of the detected second audio signal that can be connected to the candidate for the first and second partial segments. A speech recognition program that replaces candidates .

When the second voice consists only of the voice of the misrecognized word in the first voice, the recognition result switching function is the first voice signal having high acoustic similarity with the partial section of the second voice signal. Is detected as a partial section of the first speech signal corresponding to the misrecognized word in the first speech, and candidates for the detected partial section of the first speech signal are detected as the second speech. 12. The speech recognition program according to claim 11 , wherein the speech recognition program is replaced with a candidate for a partial section of a signal.

When the second speech consists of a misrecognized word in the first speech and the speech of one or more words following it, the recognition result replacement function is used for the first speech signal having a high acoustic similarity. The partial section and the partial section of the second speech signal are detected and detected as the partial section of the first speech signal and the partial section of the second speech signal corresponding to the misrecognized word in the first speech. The candidate for the partial section of the first audio signal is replaced with the candidate for the detected partial section of the second audio signal, and the candidate for the partial section of the second audio signal not detected is added thereto. The speech recognition program according to claim 11 .

The recognition result replacement function detects a partial section of the first speech signal and a partial section of the second speech signal corresponding to a misrecognized word in the first speech, and the detected second speech signal It is determined whether the candidate for the partial section is the same as the candidate for the detected partial section of the first audio signal, and the candidate for the detected partial section of the first audio signal is detected differently from the candidate. 12. The speech recognition program according to claim 11 , wherein the speech recognition program is replaced with a candidate for a partial section of the second speech signal.

The first matching function performs a matching process between the first speech signal and the word dictionary, detects a partial section corresponding to each word in the first speech from the first speech signal, A candidate for a partial section is obtained and a matching score is obtained for each partial section of the first audio signal.
The recognition result replacement function uses the acoustic similarity obtained by the spotting function and the collation score obtained by the first collation function, and the first speech corresponding to the erroneously recognized word in the first speech. The partial section of the signal and the partial section of the second audio signal are detected, and the candidate for the detected partial section of the first audio signal is replaced with the candidate for the detected partial section of the second audio signal. The speech recognition program according to claim 11 .