JP3962904B2

JP3962904B2 - Speech recognition system

Info

Publication number: JP3962904B2
Application number: JP2002015705A
Authority: JP
Inventors: 伸昌荒木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2002-01-24
Filing date: 2002-01-24
Publication date: 2007-08-22
Anticipated expiration: 2022-01-24
Also published as: JP2003216179A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識システムに関し、特に、音声検出誤りによる誤認識を改良した音声認識システムに関するものである。
【０００２】
【従来の技術】
従来の音声認識システムとしては、例えば、日本電気株式会社より発売されているパソコン用音声認識ソフト「SmartVoice」や日本アイ・ビー・エム株式会社より発売されているパソコン用音声認識ソフト「ViaVoice」等が挙げられる。図１０は従来例の音声認識システムを示すブロック図である。図１０において、音声入力装置１は、例えば、マイクロフォンを用いて音声を受け取り、マイクロフォンの音声信号のＡ／Ｄ変換等の処理を行う。音声入力装置１でデジタル化された入力データは音声検出部２に渡される。音声検出部２は受け取った入力データに関し、データのパワーの変化等に基づいてユーザが発声した音声データがその中に存在するかどうかを判断し、ユーザが発声したと判断した音声データを切り出して音声認識部３に渡す。
【０００３】
音声認識部３は受け取った音声データと認識辞書４を用いて音声認識処理を行い、認識結果を認識結果処理部８に渡す。認識辞書４は音声認識システムが受け付けるコマンド語を登録した辞書である。認識結果処理部８は受け取った認識結果に基づいた処理を行う。
【０００４】
具体的に説明すると、例えば、認識辞書４に「ファイルを開く」というコマンド語が登録されていて、ユーザが音声入力装置１を通して「ファイルを開く」と発声したとする。音声入力装置１はユーザの発声を含む音声を受け取り、Ａ／Ｄ変換等の処理を行い、入力データを音声検出部２に渡す。音声検出部２は入力データに関しパワーの変化等を調べ、ユーザが発声した「ファイルを開く」の部分の入力データが、発声された音声データであると判断して、その部分を音声データとして切り出し、音声認識部３に渡す。
【０００５】
音声認識部３はユーザが「ファイルを開く」と発声した音声データと、認識辞書４に登録されているコマンド語を比較して音声認識処理を行い、音声データが認識辞書４に登録されている「ファイルを開く」というコマンド語と一致するため、「ファイルを開く」が発声されたと判断する。音声認識部３は認識結果を「ファイルを開く」として認識結果処理部８に渡す。認識結果処理部８は受け取った「ファイルを開く」という認識結果に基づいて実際にファイルを開くといった処理を行う。
【０００６】
【発明が解決しようとする課題】
しかしながら、上述した従来の音声認識システムでは、次のような２つの問題点があった。即ち、音声検出部２はユーザの発声を１発声毎に検出・音声データの切り出しを行い、音声認識部３は受け取った音声データと認識辞書４とを比較してどのコマンド語が発声されたかを判断しているため、ユーザは認識辞書４に格納されているコマンド語の通りに、間を空けずに１発声で発声しなければならない。そのため、間を空けずに１発声で発声しないと正しく認識されない。
【０００７】
例えば、「ファイルを開く」の代わりに「ファイルオープン」と発声したり、間を空けて「ファイルを」と「開く」の２発声で発声した場合は、それぞれ、「ファイルオープン」の音声データと認識辞書４、「ファイルを」の音声データと認識辞書４、「開く」の音声データと認識辞書４とで認識処理を行うため、正しく認識されなかった。
【０００８】
また、音声検出部２はユーザが１発声と意識した音声の範囲に拘わらず、音声検出部２自身の判断により１発声の範囲を検出し、音声認識部３は受け取ったその音声データ毎に認識処理を行うため、ユーザが迷ったり言いよどんだために発声に間が空いて、音声検出が誤ってユーザの１発声を複数の発声に分割してしまった場合は、正しく認識されないことがあった。例えば、ユーザが迷って「ファイルを・・・開く」と発声し、音声検出部２が「ファイルを」の音声データと「開く」の音声データを別々に切り出してしまった場合には、正しく認識されなかった。
【０００９】
一方、音声検出で誤って発声が分割されたり、途切れたりする問題に対処する方法としては、例えば、特開平９−１９８０７７号公報に記載の音声認識システムがある。しかし、同公報の音声認識システムは、音声検出が無音で区切られてしまうのを防ぐ方法であり、仮に、ユーザがシステムの想定を超えた一定時間以上の無音を発声中に入れた場合には、分割されて音声が検出されるため、対処することができなかった。
【００１０】
また、言いよどみが入力された場合への対処方法としては、例えば、特開平６−１１８９８９号公報に記載の連続音声認識方法があるが、同公報の方法の場合も、発声中に無音が入り、分割されて音声検出された場合には対処することができなかった。
【００１１】
本発明は、上記従来の問題点に鑑みなされたもので、その目的は、ユーザがコマンド語を発声する時に間を空けて発声しても正しく認識できる音声認識システムを提供することにある。
【００１２】
【課題を解決するための手段】
本発明の音声認識システムは、音声入力手段と、前記音声入力手段から受け取った入力データから音声を検出する手段と、先頭のノードから終端のノードに繋がる一本のアークで表され、各々に識別子が付与されたコマンド語を登録する認識辞書と、前記認識辞書のコマンド語を発声を受け付ける所定単位に分解すると共に、前記所定単位が各々独立したアークとなるようにネットワーク文法を変更し、且つ、前記コマンド語の各アークに対して、元のアークの識別子と元のアークが何個のアークに区切られたかを示す個数とそのアークが元のアークの先頭から何番目のアークであるかを示す順番とを含む識別子を対応させたテーブルを作成する手段と、前記検出手段により検出された音声に対して認識を行い、認識結果を保持する手段と、前記テーブルを参照して前記認識結果に対して可能性のある識別子を保存する手段と、前記保存された識別子を組み合わせ、その組み合わせ結果が前記認識辞書に登録されている元のコマンド語と一致した時に当該コマンド語を認識する手段とを備えたことを特徴とする。
【００１３】
本発明においては、コマンド語が複数の発声に分割して発声された場合においても、そのコマンド語を正しく認識できるという効果が得られる。
【００１４】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して詳細に説明する。
【００１５】
（第１の実施形態）
図１は本発明の音声認識システムの第１の実施形態の構成を示すブロック図である。なお、図１では図１０の従来の音声認識システムと同一部分は同一符号を付している。図１において、音声入力装置１はマイクロフォン等を用いて音声を受け取り、マイクロフォンからの音声信号のＡ／Ｄ変換等を行い、入力された音声をデジタルの入力データとして生成する。音声検出部２は音声入力装置１から受け取った入力データに対し、データのパワーの変化等を計算することによりユーザの発声した音声が含まれているかどうかを判断し、ユーザの発声した音声が含まれていると判断した場合、その部分を音声データとして切り出して音声認識部３に渡す。
【００１６】
音声認識部３は受け取った音声データと認識辞書変更部５から渡された認識辞書とを用いて音声認識処理を行い、認識結果を認識結果保持部６に格納する。認識辞書４は認識対象となるコマンド語が登録された辞書である。認識辞書変更部５は詳しく後述するように認識辞書４に対し、登録されているコマンド語を単語或いは音節等の所定の単位に分解し、その単語列／音節列／その他の単位による列の部分列も認識対象となるように認識辞書４を変更する。
【００１７】
認識結果保持部６は音声認識部３より認識結果を受け取り、複数の認識結果を保持する。認識結果制御部７は認識結果保持部６に保持されている認識結果を参照し、それらの組み合わせが元の認識辞書４に格納されているコマンド語になるかどうかを判断し、認識結果の組み合わせがコマンド語になると判断した場合は、そのコマンド語を認識結果処理部８に渡し、認識結果保持部６に格納されている内容を空にする。認識結果処理部８は認識結果のコマンド語を受け取り、そのコマンド語に対応した処理を行う。
【００１８】
次に、本実施形態の動作を具体例を挙げて詳細に説明する。まず、認識辞書４は図２（ａ）に示すようなネットワーク文法で与えられているとする。即ち、この認識辞書４は先頭のノード１０１から終端のノード１０２までを繋ぐパスが認識対象となるコマンド語を表しており、「ファイルを開く」、「ファイルを閉じる」、「図を開く」、「図を閉じる」の４通りのコマンド語が認識対象となっている。また、認識辞書４はネットワーク文法の各アークが単語で構成されている。
【００１９】
認識辞書変更部５はこの４通りのコマンド語に対し、単語を境界として部分的に発声したものも受け付けるようにするため、先頭のノード１０１と終端のノード１０２以外のノードに対し、先頭のノード１０１及び終端のノード１０２に繋がるアークを作成する。例えば、図２（ａ）に示すネットワーク文法の場合、図２（ｂ）に示すようにアーク１０５、１０６、１０７、１０８の４つのアークを追加する。
【００２０】
これらの追加されたアークは、認識されるコマンド語はなし（図２（ｂ）ではφで表現する）となっている。これにより、図２（ｂ）のネットワーク文法では、「ファイル」、「ファイルを」、「図」、「図を」、「開く」、「を開く」、「閉じる」、「を閉じる」等、元のコマンド語を部分的に発声したものも受け付けられる。
【００２１】
ここで、ユーザが音声入力装置１を通して、まず、「ファイルを」と発声したとする。この発声は音声検出部２を通して処理され、音声検出部２では「ファイルを」が発声された音声データと判断し、その部分を音声認識部３に渡す。一方、認識辞書変更部５は図２（ｂ）に示すように変更した認識辞書を作成しており、音声認識部３では変更した認識辞書を用いて認識処理を行う。この場合、図２（ｂ）の認識辞書では「ファイルを」のコマンド語が受け付けられるため、音声認識部３は「ファイルを」を認識し、認識結果保持部６に「ファイルを」が認識結果として保持される。
【００２２】
認識結果保持部６に認識結果が保持されると、認識結果制御部７は認識結果保持部６に保持されている認識結果の組み合わせが、元の認識辞書４により受け付けられるかどうかを調べる。この時点では、「ファイルを」は図２（ａ）のネットワーク文法では受け付けられないため、元のコマンド語はまだ認識されていないと判断し、認識結果処理部８には何も渡さない。
【００２３】
次に、ユーザが続けて音声入力装置１を通して「開く」と発声したとする。この時、同様に音声検出部２、音声認識部３により処理され、この時も、図２（ｂ）の認識辞書では「開く」のコマンド語が受け付けられるため、音声認識部３は「開く」を認識し、これが認識結果として認識結果保持部６に保持される。認識結果制御部７は認識結果保持部６に認識結果が追加されたため、再度、認識結果の組み合わせが、元の認識辞書４により受け付けられるかどうかを調べる。
【００２４】
今回は、「ファイルを」と「開く」を組み合わせた「ファイルを開く」が図２（ａ）のネットワーク文法で受け付けられるため、認識結果処理部８に「ファイルを開く」を認識結果として渡す。また、認識結果保持部６に保持されている内容を空にする。このように本実施形態では、ユーザが「ファイルを」、「開く」と区切って発声したものを「ファイルを開く」のコマンド語として正しく認識することができる。
【００２５】
（第２の実施形態）
次に、本発明の第２実施形態について説明する。第２の実施形態の基本的構成は図１の第１の実施形態と同様であるが、認識辞書４の構造が異なっている。それに伴い、認識辞書変更部５、認識結果制御部７の動作が異なっている。本実施形態では、認識辞書４が図３（ａ）に示すように与えられているとする。この認識辞書４では、コマンド語が先頭のノードから終端のノードに繋がる一本のアークで表されている。また、各コマンド語には、#W001、#W002、#W003の識別子が付与されている。
【００２６】
認識辞書変更部５は、まず、認識辞書４のコマンド語を発声を受け付ける単位に分解する。この場合は、単語単位の区切りを受け付けるものとし、コマンド語を形態素解析する等の手段により単語単位に分解する。そして、図３（ｂ）に示すように各単語が独立したアークとなるようにネットワーク文法を変更する。その際、元のアークに付与されていた識別子を残しておくため、各アークに対し元となったアークの識別子と、そのアークが元のアークの先頭から何番目かと、いくつのアークに区切られたかの情報を付与する。
【００２７】
例えば、図３（ｂ）において「保存」というアークの「#W003#3#4」のうち「#W003」は、元のアークが「#W003」であることを表し、「#3#4」は４つのアークに分解されたうちの先頭から３番目であることを表している。
【００２８】
このようにして変更された図３（ｂ）のネットワーク文法に対し、認識辞書変更部５は、更に、第１の実施形態と同様に先頭と終端以外の各ノードに対し先頭のノード及び終端のノードに繋がるコマンド語のないアークを追加する。図４は図３（ｂ）に対してこの処理を行ったネットワーク文法を示す。また、この際、同じコマンド語が複数のアークに出現する可能性もあるため、認識辞書変更部５は、図５に示すように認識されるコマンド語とアークに付与された識別子のテーブル５１を保持する。
【００２９】
ここで、まず、ユーザが音声入力装置１を通して「ファイルを」と発声したとする。この音声は、音声検出部２、音声認識部３を通して処理され、音声認識部３では「ファイルを」が認識結果として得られる。この認識結果は、認識結果保持部６に保持される。この時点では、図６に示すように認識結果保持部６に“ファイルを”が保持される。
【００３０】
認識結果制御部７は、認識結果保持部６に保持されている認識結果が、元の認識辞書４で受け付けられるかどうかを調べる。まず、図５のテーブル５１を参照し、認識結果に対し可能性のある識別子を調べ、認識結果保持部６に追加情報として保持する。この時の認識結果保持部６には、図６に１１１として示すように「#W001#1#3」、「#W001#2#3」、「#W002#1#3」、「#W002#2#3」、「#W003#2#4」が保持された状態となる。この状態では、１１１に保持されている識別子を組み合わせても、元の識別子にならないため、まだ元のコマンド語は認識されていないと判断し、認識結果処理部８には何も通知しない。
【００３１】
次に、ユーザが続けて「保存」と発声したとする。この「保存」はユーザが言い間違えて発声したものとする。この場合も同様に入力音声が処理され、図６に示すように認識結果保持部６に“保存”の認識結果が保持される。認識結果制御部７は、同様にテーブル５１の識別子の情報を認識結果保持部６に追加し、図６に１１２として示すように保存に対応する識別子「#W003#3#4」が保持される。この時の認識結果保持部６の状態は、図６の１１１と１１２を合わせたものになる。認識結果制御部７はこの状態でも元の識別子を構成できないため、まだ元のコマンド語は認識されていないと判断する。
【００３２】
次いで、ユーザが「閉じる」と発声したとする。同様に入力音声が認識され、図６に示すように認識結果保持部６に“閉じる”の認識結果が保持される。認識結果制御部７は、同様にテーブル５１の識別子の情報を認識結果保持部６に追加し、図６に１１３として示すように認識結果保持部６に「閉じる」に対応する識別子「#W002#3#3」が追加される。この時の認識結果保持部６の状態は、図６の１１１、１１２、１１３を合わせたものになる。
【００３３】
この時、認識結果保持部６に付与された識別子を見ると、先頭から「#W002#1#3」、「#W002#2#3」、「#W002#3#3」の３つを繋げて「#W002」が完成する。よって、この時点で認識結果制御部７は、「#W002」のコマンド語、即ち「ファイルを閉じる」のコマンド語を認識できたと判断し、認識結果処理部８に「ファイルを閉じる」の認識結果を渡す。また、認識結果制御部７は認識結果保持部６に保持している内容を空にする。
【００３４】
このように本実施形態では、ユーザが「ファイルを」、「保存」、「閉じる」のように途中で言い間違いや言いよどみ等余計な発声も含めて、区切って発声したものを、「ファイルを閉じる」のコマンド語として正しく認識することができる。
【００３５】
また、以上の実施形態では、コマンド語を単語を境界として区切って発声できると説明したが、本発明は、単語ではなく、例えば、音節を境界として区切って発声できるシステムとしてもよい。例えば、図７（ａ）に示すような認識辞書があった場合、コマンド語の読みを音節単位に分解し、図７（ｂ）に示すようにネットワーク文法の各アークも音節単位に分解する。以降の処理は図３の認識辞書に対して行った処理と同様である。このような実施形態では、「さく」、「せい」や「へん」、「こう」のようにコマンド語を音節部分で区切って発声しても認識することが可能である。
【００３６】
更に、単語単位や音節単位以外にも、システムで定義した語句を部分的に発声できる境界としてもよい。例えば、２音節程度からなる語句を定義し、その語句によりコマンド語を分解するという方法や、文節によりコマンド語を分解するという方法等である。前者の方法としては、例えば、「おう」、「こう」、「そう」、「とう」等の２音節程度の語句を定義する。そして、例えば、これらの定義された語句を単位とすることで、「応答」というコマンド語は「おう」と「とう」に分解される。
【００３７】
また、このようにネットワーク文法を拡張すると、ネットワーク文法が複雑になり、処理速度や認識性能に影響を与えることも考えられる。そのため、例えば、図３（ｂ）のネットワーク文法に対し、同じコマンド語の付いたアークをまとめることにより、図８に示すようにネットワーク文法を単純化するという方法も考えられる。
【００３８】
また、このようにコマンド語を部分的に発声したものも受け付けるように認識辞書を変更した場合、コマンド語の数の増加や、似たコマンド語の増加により誤認識が増えることが考えられる。例えば、図９（ａ）のような認識辞書の場合、「ファイル」と「入る」というコマンド語が受け付けられ、「ファイル」と「入る」は音響的に似ているため、「ファイル」と発声しても「入る」と誤認識されてしまうことが考えられる。このような問題の対策として、例えば、認識結果の第１位候補だけでなく、第２位以降の候補も使用するという方法がある。これは、音声認識手段３が音声認識時に複数の候補を認識結果保持部６に保持しておく。
【００３９】
図９（ｂ）はこの場合の認識結果保持部６に保持された認識結果の例を示す。ここでは、ユーザが「ファイル」と発声したものが、第１位候補では「入る」と認識され、「ファイル」は第２位候補として認識されたものとする。このような場合、図９（ｂ）に示すように認識結果保持部６に第１位候補の他に第２位候補も含めて保持し、認識結果制御部７が第２位候補も含めて元のコマンド語の完成を判断することで正しく認識でき、誤認識が増える場合に対処できる。また、図９（ｂ）に示すように「を」と「と」、「開く」と「入る」等のような場合も、同様に第１位候補、第２位候補を保持し、第２位候補も含めて元のコマンド語になるかを判断することで正しく認識できる。
【００４０】
【発明の効果】
以上説明したように本発明によれば、認識辞書に登録されているコマンド語を所定の単位に分解し、複数の認識結果が得られた時にそれらの組み合わせが元のコマンド語になるかどうかを判断しているので、ユーザがコマンド語を発声する時に元々定義されているコマンド語の通りに１発声で発声するのではなく、長いコマンド語や１発声では言いにくいコマンド語を分割して発声した場合であっても、音声を正しく認識することができる。
【００４１】
また、ユーザの意図とは関係なく音声検出部自身で１発声の範囲を判断し、音声データを切り出すが、音声検出部がユーザの意図しないように１発声を複数の発声に分けて切り出した場合でも、元のコマンド語を部分的に発声したものも受け付けられるようになり、認識結果保持部には元のコマンド語が認識されるまで複数の認識結果が保持され、それらの複数の認識結果を組み合わせて元のコマンド語になるかを判断するため、ユーザが発声の途中で言いよどんだり、詰まったりし、音声検出で区切られるくらいの大きな間が空いてしまった場合でも正しく認識することができる。
【図面の簡単な説明】
【図１】本発明による音声認識システムの第１の実施形態の構成を示すブロック図である。
【図２】図１の実施形態に用いる認識辞書のネットワーク文法の一例及びそれを認識辞書変更部により変更した認識辞書を示す図である。
【図３】本発明の第２の実施形態で用いる認識辞書のネットワーク文法の一例及びそれを認識辞書変更部により変更した認識辞書を示す図である。
【図４】図３のネットワーク文法に対しコマンド語がなしのアークが付与された認識辞書を示す図である。
【図５】コマンド語と識別子の対応を示すテーブルの図である。
【図６】認識結果保持部に保持された認識結果の例を示す図である。
【図７】認識辞書のコマンド語を音節単位で分解する場合の例を説明する図である。
【図８】コマンド語が分解された認識辞書に対し、同じコマンド語のアークをまとめて単純化した例を示す図である。
【図９】複数の認識候補を用いて認識処理を行う場合の例を説明する図である。
【図１０】従来例の音声認識システムの構成を示すブロック図である。
【符号の説明】
１音声入力装置
２音声検出部
３音声認識部
４認識辞書
５認識辞書変更部
６認識結果保持部
７認識結果制御部
８認識結果処理部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition system, and more particularly to a speech recognition system that improves erroneous recognition due to speech detection errors.
[0002]
[Prior art]
Conventional voice recognition systems include, for example, PC voice recognition software “SmartVoice” released by NEC Corporation, PC voice recognition software “ViaVoice” released by IBM Japan, Ltd., etc. Is mentioned. FIG. 10 is a block diagram showing a conventional speech recognition system. In FIG. 10, the voice input device 1 receives voice using, for example, a microphone, and performs processing such as A / D conversion of the voice signal of the microphone. The input data digitized by the voice input device 1 is passed to the voice detector 2. The voice detection unit 2 determines whether or not voice data uttered by the user exists in the received input data based on a change in the power of the data, and cuts out the voice data determined to be uttered by the user. It passes to the voice recognition unit 3.
[0003]
The voice recognition unit 3 performs voice recognition processing using the received voice data and the recognition dictionary 4 and passes the recognition result to the recognition result processing unit 8. The recognition dictionary 4 is a dictionary in which command words accepted by the voice recognition system are registered. The recognition result processing unit 8 performs processing based on the received recognition result.
[0004]
More specifically, for example, it is assumed that a command word “open file” is registered in the recognition dictionary 4 and the user utters “open file” through the voice input device 1. The voice input device 1 receives a voice including a user's utterance, performs processing such as A / D conversion, and passes input data to the voice detection unit 2. The voice detection unit 2 examines a change in power with respect to the input data, determines that the input data of the “open file” portion uttered by the user is the uttered voice data, and cuts out the portion as the voice data. To the voice recognition unit 3.
[0005]
The voice recognition unit 3 performs voice recognition processing by comparing the voice data uttered by the user “open file” with the command word registered in the recognition dictionary 4, and the voice data is registered in the recognition dictionary 4. Since it matches the command word “Open file”, it is determined that “Open file” is uttered. The voice recognition unit 3 passes the recognition result to the recognition result processing unit 8 as “open file”. The recognition result processing unit 8 performs processing such as actually opening a file based on the received recognition result “open file”.
[0006]
[Problems to be solved by the invention]
However, the above-described conventional speech recognition system has the following two problems. That is, the voice detection unit 2 detects the user's utterance for each utterance and cuts out voice data, and the voice recognition unit 3 compares the received voice data with the recognition dictionary 4 to determine which command word is uttered. Since the determination is made, the user has to utter one utterance without leaving a gap according to the command word stored in the recognition dictionary 4. Therefore, it is not recognized correctly unless it is uttered by one utterance without a gap.
[0007]
For example, if you say “file open” instead of “open”, or if you say “file open” and “open” with two voices, Since recognition processing is performed by the recognition dictionary 4, the voice data of “file” and the recognition dictionary 4, and the voice data of “open” and the recognition dictionary 4, the recognition is not correctly performed.
[0008]
The voice detection unit 2 detects the range of one utterance according to the judgment of the voice detection unit 2 itself regardless of the range of the voice that the user is conscious of as one utterance, and the voice recognition unit 3 recognizes each received voice data. When the user wonders or hesitates to perform processing, the utterance is delayed, and if the voice detection mistakenly divides one utterance of the user into multiple utterances, it may not be recognized correctly . For example, if the user gets lost and utters "Open file" and the voice detection unit 2 cuts out the voice data of "File" and the voice data of "Open" separately, it is recognized correctly Was not.
[0009]
On the other hand, as a method for coping with the problem that the utterance is divided or interrupted by voice detection, there is a voice recognition system described in Japanese Patent Laid-Open No. 9-198077, for example. However, the speech recognition system of the same publication is a method for preventing the voice detection from being divided by silence, and if the user puts silence in the utterance for a certain time exceeding the assumption of the system, Because it was divided and the voice was detected, it could not be dealt with.
[0010]
In addition, as a method for dealing with a case where stagnation is input, for example, there is a continuous speech recognition method described in Japanese Patent Laid-Open No. 6-118989. It was not possible to cope with the case where the voice was detected after being divided.
[0011]
The present invention has been made in view of the above-described conventional problems, and an object of the present invention is to provide a speech recognition system capable of correctly recognizing even when a user utters a command word after a while.
[0012]
[Means for Solving the Problems]
The voice recognition system according to the present invention is represented by voice input means, means for detecting voice from input data received from the voice input means, and a single arc connected from the first node to the last node, each having an identifier. Recognizing a command word to which is given, and decomposing the command word of the recognition dictionary into predetermined units for receiving utterances, changing the network grammar so that the predetermined units are independent arcs, and For each arc in the command word, an identifier of the original arc, a number indicating how many arcs the original arc is divided into, and the number of arcs from the beginning of the original arc. Means for creating a table in which identifiers including orders are associated, means for recognizing speech detected by the detection means, and holding a recognition result; The means for storing a possible identifier for the recognition result with reference to the table is combined with the stored identifier, and the combination result matches the original command word registered in the recognition dictionary And a means for recognizing the command word .
[0013]
In the present invention, even when a command word is divided into a plurality of utterances, the command word can be recognized correctly.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described in detail with reference to the drawings.
[0015]
(First embodiment)
FIG. 1 is a block diagram showing the configuration of the first embodiment of the speech recognition system of the present invention. In FIG. 1, the same parts as those of the conventional speech recognition system of FIG. In FIG. 1, a voice input device 1 receives voice using a microphone or the like, performs A / D conversion of a voice signal from the microphone, and generates input voice as digital input data. The voice detection unit 2 determines whether or not the voice uttered by the user is included in the input data received from the voice input device 1 by calculating a change in the power of the data, and the voice uttered by the user is included. If it is determined that the portion is recorded, the portion is cut out as voice data and passed to the voice recognition unit 3.
[0016]
The voice recognition unit 3 performs voice recognition processing using the received voice data and the recognition dictionary passed from the recognition dictionary changing unit 5 and stores the recognition result in the recognition result holding unit 6. The recognition dictionary 4 is a dictionary in which command words to be recognized are registered. As will be described later in detail, the recognition dictionary changing unit 5 decomposes the registered command word into predetermined units such as words or syllables, and the word sequence / syllable sequence / parts of columns based on other units. The recognition dictionary 4 is changed so that the column is also a recognition target.
[0017]
The recognition result holding unit 6 receives the recognition result from the voice recognition unit 3 and holds a plurality of recognition results. The recognition result control unit 7 refers to the recognition results held in the recognition result holding unit 6, determines whether or not the combination is a command word stored in the original recognition dictionary 4, and the combination of the recognition results Is determined to be a command word, the command word is passed to the recognition result processing unit 8 and the contents stored in the recognition result holding unit 6 are emptied. The recognition result processing unit 8 receives a command word as a recognition result and performs processing corresponding to the command word.
[0018]
Next, the operation of the present embodiment will be described in detail with a specific example. First, it is assumed that the recognition dictionary 4 is given in a network grammar as shown in FIG. In other words, this recognition dictionary 4 represents a command word to be recognized by a path connecting from the first node 101 to the last node 102, and “open file”, “close file”, “open figure”, The four command words “Close figure” are the recognition targets. In the recognition dictionary 4, each arc of the network grammar is composed of words.
[0019]
The recognition dictionary changing unit 5 accepts the four command words that are partially uttered with the word as a boundary, so that the first node is compared to the first node 101 and the last node 102. An arc connected to 101 and the terminal node 102 is created. For example, in the case of the network grammar shown in FIG. 2A, four arcs 105, 106, 107, and 108 are added as shown in FIG.
[0020]
These added arcs have no recognized command words (represented by φ in FIG. 2B). Thus, in the network grammar of FIG. 2 (b), “file”, “file”, “diagram”, “diagram”, “open”, “open”, “close”, “close”, etc. A partial utterance of the original command word is also accepted.
[0021]
Here, it is assumed that the user first utters “file” through the voice input device 1. This utterance is processed through the voice detection unit 2, and the voice detection unit 2 determines that “file” is uttered voice data, and passes the portion to the voice recognition unit 3. On the other hand, the recognition dictionary changing unit 5 creates a changed recognition dictionary as shown in FIG. 2B, and the speech recognition unit 3 performs a recognition process using the changed recognition dictionary. In this case, since the command word “file” is accepted in the recognition dictionary of FIG. 2B, the speech recognition unit 3 recognizes “file”, and “file” is recognized in the recognition result holding unit 6. Held as.
[0022]
When the recognition result is held in the recognition result holding unit 6, the recognition result control unit 7 checks whether the combination of the recognition results held in the recognition result holding unit 6 is accepted by the original recognition dictionary 4. At this point, “file” is not accepted in the network grammar of FIG. 2A, so it is determined that the original command word has not yet been recognized, and nothing is passed to the recognition result processing unit 8.
[0023]
Next, it is assumed that the user continuously utters “open” through the voice input device 1. At this time, similarly, processing is performed by the voice detection unit 2 and the voice recognition unit 3, and also at this time, the command word “open” is accepted in the recognition dictionary of FIG. Is recognized and held in the recognition result holding unit 6 as a recognition result. Since the recognition result control unit 7 adds the recognition result to the recognition result holding unit 6, the recognition result control unit 7 checks again whether the combination of the recognition results is accepted by the original recognition dictionary 4.
[0024]
This time, “Open file” combining “Open file” and “Open” is accepted by the network grammar of FIG. 2A, and “Open file” is passed to the recognition result processing unit 8 as a recognition result. Further, the contents held in the recognition result holding unit 6 are emptied. As described above, in this embodiment, what the user uttered after separating “file” and “open” can be correctly recognized as the command word “open file”.
[0025]
(Second Embodiment)
Next, a second embodiment of the present invention will be described. The basic configuration of the second embodiment is the same as that of the first embodiment of FIG. 1, but the structure of the recognition dictionary 4 is different. Accordingly, the operations of the recognition dictionary changing unit 5 and the recognition result control unit 7 are different. In the present embodiment, it is assumed that the recognition dictionary 4 is given as shown in FIG. In this recognition dictionary 4, a command word is represented by a single arc connected from the first node to the last node. In addition, identifiers of # W001, # W002, and # W003 are assigned to each command word.
[0026]
The recognition dictionary changing unit 5 first decomposes the command words in the recognition dictionary 4 into units for receiving utterances. In this case, a word unit break is accepted, and the command word is decomposed into word units by means such as morphological analysis. Then, as shown in FIG. 3B, the network grammar is changed so that each word becomes an independent arc. At that time, the identifier assigned to the original arc is kept, so that the identifier of the original arc for each arc, the number of the arc from the beginning of the original arc, and how many arcs are divided. Give the information of Taka.
[0027]
For example, in FIG. 3B, “# W003” in “# W003 # 3 # 4” of the arc “Save” indicates that the original arc is “# W003”, and “# 3 # 4” Represents the third from the top of the four arcs.
[0028]
For the network grammar shown in FIG. 3B changed as described above, the recognition dictionary changing unit 5 further sets the first node and the end of each node other than the start and end as in the first embodiment. Add an arc with no command word connected to the node. FIG. 4 shows a network grammar obtained by performing this processing on FIG. At this time, since the same command word may appear in a plurality of arcs, the recognition dictionary changing unit 5 creates a table 51 of recognized command words and identifiers assigned to the arcs as shown in FIG. Hold.
[0029]
Here, first, it is assumed that the user utters “file” through the voice input device 1. This voice is processed through the voice detection unit 2 and the voice recognition unit 3, and the voice recognition unit 3 obtains “file” as a recognition result. This recognition result is held in the recognition result holding unit 6. At this time, “file” is held in the recognition result holding unit 6 as shown in FIG.
[0030]
The recognition result control unit 7 checks whether the recognition result held in the recognition result holding unit 6 is accepted by the original recognition dictionary 4. First, referring to the table 51 in FIG. 5, a possible identifier for the recognition result is checked, and held in the recognition result holding unit 6 as additional information. The recognition result holding unit 6 at this time includes “# W001 # 1 # 3”, “# W001 # 2 # 3”, “# W002 # 1 # 3”, “# W002 #” as indicated by 111 in FIG. “2 # 3” and “# W003 # 2 # 4” are held. In this state, even if the identifiers held in 111 are combined, it does not become the original identifier, so it is determined that the original command word has not yet been recognized, and nothing is notified to the recognition result processing unit 8.
[0031]
Next, it is assumed that the user continuously says “Save”. It is assumed that this “save” is uttered by the user in error. Also in this case, the input voice is processed in the same manner, and the recognition result “save” is held in the recognition result holding unit 6 as shown in FIG. Similarly, the recognition result control unit 7 adds the identifier information of the table 51 to the recognition result holding unit 6 and holds the identifier “# W003 # 3 # 4” corresponding to saving as indicated by 112 in FIG. . The state of the recognition result holding unit 6 at this time is a combination of 111 and 112 in FIG. The recognition result control unit 7 cannot determine the original identifier even in this state, and therefore determines that the original command word has not been recognized yet.
[0032]
Next, assume that the user utters “close”. Similarly, the input voice is recognized, and the recognition result “closed” is held in the recognition result holding unit 6 as shown in FIG. Similarly, the recognition result control unit 7 adds the information of the identifier of the table 51 to the recognition result holding unit 6, and the identifier “# W002 #” corresponding to “close” in the recognition result holding unit 6 as indicated by 113 in FIG. 3 # 3 "is added. The state of the recognition result holding unit 6 at this time is a combination of 111, 112, and 113 in FIG.
[0033]
At this time, when looking at the identifiers assigned to the recognition result holding unit 6, from the top, “# W002 # 1 # 3”, “# W002 # 2 # 3”, “# W002 # 3 # 3” are connected. "# W002" is completed. Therefore, at this time, the recognition result control unit 7 determines that the command word “# W002”, that is, the command word “close file” has been recognized, and the recognition result processing unit 8 recognizes the recognition result “close file”. give. Further, the recognition result control unit 7 empties the content held in the recognition result holding unit 6.
[0034]
As described above, in the present embodiment, what the user uttered in a divided manner including “excluded” utterances such as “save file”, “save”, “close”, and other utterances in the middle is “close file”. "Can be correctly recognized as a command word.
[0035]
In the above embodiment, it has been described that a command word can be uttered by dividing a word as a boundary. However, the present invention may be a system that can utter a voice by dividing not a word but a syllable, for example. For example, when there is a recognition dictionary as shown in FIG. 7A, command word reading is decomposed into syllable units, and each arc of the network grammar is also decomposed into syllable units as shown in FIG. 7B. The subsequent processing is the same as the processing performed on the recognition dictionary of FIG. In such an embodiment, it is possible to recognize a command word by dividing it into syllable parts such as “saku”, “sei”, “hen”, and “ko”.
[0036]
Further, in addition to word units and syllable units, it is also possible to use boundaries that allow partial utterance of words defined by the system. For example, there are a method of defining a phrase consisting of about two syllables and decomposing a command word by the phrase, a method of decomposing a command word by a phrase, and the like. As the former method, for example, a phrase of about two syllables such as “Ou”, “Kou”, “So”, “Tou” is defined. Then, for example, by using these defined phrases as a unit, the command word “response” is decomposed into “o” and “to”.
[0037]
In addition, if the network grammar is expanded in this way, the network grammar becomes complicated, which may affect the processing speed and recognition performance. Therefore, for example, a method of simplifying the network grammar as shown in FIG. 8 by combining arcs with the same command word with respect to the network grammar of FIG.
[0038]
In addition, when the recognition dictionary is changed so as to accept a command word partially uttered in this way, it is conceivable that misrecognition increases due to an increase in the number of command words or an increase in similar command words. For example, in the case of the recognition dictionary as shown in FIG. 9A, the command words “file” and “enter” are accepted, and “file” and “enter” are acoustically similar. Even so, it may be misrecognized as “enter”. As a countermeasure for such a problem, for example, there is a method of using not only the first candidate of the recognition result but also the second and subsequent candidates. This is because the speech recognition means 3 holds a plurality of candidates in the recognition result holding unit 6 during speech recognition.
[0039]
FIG. 9B shows an example of the recognition result held in the recognition result holding unit 6 in this case. Here, what the user uttered “file” is recognized as “enter” in the first candidate, and “file” is recognized as the second candidate. In such a case, as shown in FIG. 9B, the recognition result holding unit 6 holds the second candidate in addition to the first candidate, and the recognition result control unit 7 also includes the second candidate. It can be recognized correctly by judging the completion of the original command word, and can cope with the case where misrecognition increases. In addition, as shown in FIG. 9B, in the case of “O” and “To”, “Open” and “Enter”, etc., the first candidate and the second candidate are similarly held, It can be correctly recognized by determining whether the original command word is included including the position candidate.
[0040]
【The invention's effect】
As described above, according to the present invention, it is determined whether or not a command word registered in the recognition dictionary is decomposed into a predetermined unit, and when a plurality of recognition results are obtained, the combination of them becomes the original command word. Therefore, when a user utters a command word, the command word is not uttered as one command word originally defined, but a long command word or a command word difficult to say with one utterance is divided and uttered. Even in this case, the voice can be recognized correctly.
[0041]
In addition, when the voice detection unit itself determines the range of one utterance regardless of the user's intention and cuts out the voice data, the voice detection unit cuts out one utterance divided into a plurality of utterances so as not to be the user's intention However, a part of the original command word can be accepted, and the recognition result holding unit holds a plurality of recognition results until the original command word is recognized. Since it is determined whether the original command word is combined, it can be correctly recognized even if the user utters or clogs in the middle of utterance, and there is a space that is separated by voice detection .
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a first embodiment of a speech recognition system according to the present invention.
FIG. 2 is a diagram illustrating an example of a network grammar of a recognition dictionary used in the embodiment of FIG. 1 and a recognition dictionary in which the recognition dictionary is changed by a recognition dictionary changing unit.
FIG. 3 is a diagram illustrating an example of a network grammar of a recognition dictionary used in the second embodiment of the present invention and a recognition dictionary obtained by changing the recognition grammar using a recognition dictionary changing unit.
4 is a diagram showing a recognition dictionary in which an arc having no command word is added to the network grammar of FIG. 3; FIG.
FIG. 5 is a table showing the correspondence between command words and identifiers.
FIG. 6 is a diagram illustrating an example of a recognition result held in a recognition result holding unit.
FIG. 7 is a diagram for explaining an example in the case of decomposing a command word in a recognition dictionary in units of syllables.
FIG. 8 is a diagram illustrating an example in which arcs of the same command word are collectively simplified for a recognition dictionary in which the command word is decomposed.
FIG. 9 is a diagram for explaining an example in a case where recognition processing is performed using a plurality of recognition candidates.
FIG. 10 is a block diagram showing a configuration of a conventional speech recognition system.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Voice input device 2 Voice detection part 3 Voice recognition part 4 Recognition dictionary 5 Recognition dictionary change part 6 Recognition result holding part 7 Recognition result control part 8 Recognition result processing part

Claims

Voice input means;
Means for detecting voice from input data received from the voice input means;
A recognition dictionary that registers a command word that is represented by a single arc connected from the first node to the last node, and to which each is assigned an identifier,
The command words in the recognition dictionary are decomposed into predetermined units for receiving utterances, the network grammar is changed so that the predetermined units are independent arcs, and the original arc is determined for each arc of the command words. A table in which identifiers including the number of arcs that indicate the number of arcs into which the original arc is divided and the order that indicates the number of arcs from the beginning of the original arc are created. Means,
Means for recognizing the sound detected by the detecting means and holding a recognition result;
Means for storing a possible identifier for the recognition result with reference to the table;
A speech recognition system comprising: a combination of the stored identifiers, and means for recognizing the command word when the combination result matches the original command word registered in the recognition dictionary .

The speech recognition system according to claim 1, wherein the predetermined unit for decomposing the command word in the recognition dictionary is a word unit, a syllable unit, a phrase unit, or a predefined phrase unit .

3. A plurality of candidates are held as recognition results for the speech detected by the detecting means, and it is determined whether or not the original command word is matched based on the plurality of candidates. The speech recognition system described in 1.