JP2004117771A

JP2004117771A - Method and device for preparing dictionary for speech recognition, method and device for speech recognition, speech recognition program, and voice recognition system

Info

Publication number: JP2004117771A
Application number: JP2002280300A
Authority: JP
Inventors: Hiroyuki Aizu; 会津　宏幸
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-09-26
Filing date: 2002-09-26
Publication date: 2004-04-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and a device for preparing a dictionary for speech recognition for realizing a higher recognition rate by simplifying selection of words and sentences to be stored in an erroneous recognition countermeasure dictionary necessary for speech recognition and selecting words and sentences being apt to be erroneously recognized, from word groups being commonly used and to provide a speech recognition device and a speech recognition system which are provided with these method and device. <P>SOLUTION: In the speech recognition device, terms having a prescribed reading length or shorter except terms of which the speech recognition results are decided to be correct are extracted from a database where terms in everyday use consisting of words and sentences are enumerated, and the erroneous recognition countermeasure dictionary is prepared. When a term entered in this dictionary is obtained as a result of speech recognition, this term is excluded from the recognition result. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置及び音声認識システムに関するものであって、特に音声認識に使用する辞書の作成に関するものである。
【０００２】
【従来の技術】
音声による操作指示を認識し機器を制御する技術として音声認識技術がある。音声認識は人が発する音声という不安定な条件に基づいて機器を制御するものであって、キーボードを叩く、操作スイッチを入れるといった電気的に安定した信号で制御するものではない。一例を挙げれば、声を聞いただけでそれが誰であるかを判別できるように、音声には大きな個人差が存在する。さらには同一人物が同じ単語を発声したとしても、日によって聞こえ方が異なることもあるほど不安定なものである。
【０００３】
このため従来では音声認識する単語から特徴を抽出してパラメータ化し、入力された音声信号と比較することで、このような不安定要素を極力排除した音声認識手法がとられるようになってきた（たとえば、特許文献１を参照）。しかしながら特許文献１の方法による音声認識装置でも未だ認識できない場合があり、発声の仕方や周囲の雑音などによっては、発声した単語とは違う単語と認識してしまうことも少なからずあった。
【０００４】
たとえば特許文献２のように、認識したい単語を登録した音声認識辞書とは別に、認識結果から排除する単語を登録した辞書、つまり誤認識対策辞書を設ける方法をとるようになった。誤って認識しやすいが、およそ操作指示とはなりえない単語を予め登録しておくことにより、誤って認識した単語を事前に察知し認識結果から排除するものである。
【０００５】
しかしながら特許文献２の場合、誤認識しやすい単語の抽出を［００１４］段落に記載された方法で行っている。まず認識したい単語について読みの上げたときの文字ごとに母音を抽出し、その母音の数を文字の出現位置ごとに累計する。文字の出現位置で見たときに、その出現位置に出現しない母音をもつ単語は認識したい単語ではないはずだから、それぞれの母音の出現位置にこのような文字を持つ単語を作成して、これを誤認識対策辞書として蓄積する。
【０００６】
この方法は認識したい単語が多くなると母音の抽出や検討に時間がかかり、処理が複雑になってしまうといった問題がある。また音声認識したい単語はある機器を操作するための操作指示となりうる単語であるから、その数はおのずと限られるはずである。その音声認識したい単語をもとに誤認識対策辞書に登録する単語を生成するのだから、生成する単語のバリエーションもおのずと限定される。
【０００７】
しかしながら、本来音声認識は利用者の生活の中で使用されるものであるから、誤認識しやすい単語を効果的に排除するためには、利用者が日常的に使用している単語全般から、誤認識対策辞書に登録する単語を選択すべきである。
【０００８】
また特許文献２の方法でも誤認識対策辞書に登録する単語数を数多く生成すれば広い範囲の単語をカバーできるが、誤認識対策辞書が必要とする記憶量はできうる限り小さくしたいのが普通である。単語を大量に生成したとき、それが普段使用する可能性のない単語ばかりであれば記憶量は無駄に増加するばかりである。
【０００９】
【特許文献１】特開平１１−１４３４８５号公報
【００１０】
【特許文献２】特開２００１−１８４０８５公報
【００１１】
【発明が解決しようとする課題】
本発明は音声認識に必要な誤認識対策辞書へ記憶する用語の選定を簡略化し、かつ誤認識されやすい用語を常用される用語群から選定することで、より高い認識率を実現する音声認識用辞書の作成方法、作成装置、音声認識方法及びこれらを利用した音声認識装置、音声認識システムを提供することを目的とする。
【００１２】
【課題を解決するための手段】
本発明の方法によれば、
音声認識結果が誤認識であると判断するために使用する誤認識対策辞書を作成する音声認識用辞書作成方法であって、
日常用語を列記したデータベースから、用語の読みが所定の長さ以下の用語を抽出し、
音声認識した結果、所定の単語や文章からなる用語が得られたら正しい音声認識が行われたと判断する前記所定の用語を列記した音声認識辞書を用いて、
前記抽出した用語のうち、前記音声認識辞書に列記した用語の読みの文字並び又は文字長とが異なる用語を選択し
この選択した用語を用いて誤認識対策辞書を作成することを特徴とする音声認識用辞書作成方法
が提供される。
また、
音声認識結果として正しいと判断する用語を列記した音声認識辞書と、音声認識結果が誤認識であると判断するために使用する誤認識対策辞書とを用いた音声認識方法であって、
単語や文章からなる日常用語を列記したデータベースから、用語の読みが所定の長さ以下の用語を抽出し、
この抽出した用語のうち、前記音声認識辞書に列記した用語の読みの文字並び又は長さとが異なる用語を選択し
この選択した用語を用いて前記誤認識対策辞書を作成し、
音声認識を行う場合に、
音声認識した結果について、前記誤認識対策辞書に列記したものと同じ用語を示す結果であったときは、この結果を認識結果から排除する
ことを特徴とする音声認識方法
が提供される。
及び、これらの方法を実現する音声認識用辞書作成装置、音声認識装置、音声認識プログラム及び音声認識システムが提供される。
【００１３】
【発明の実施の形態】
図１に本発明の１実施形態にかかる音声認識システムの一例を示す。図１には集音装置１０１、音声認識装置１０２、出力装置１０３、辞書作成装置１０４及びネットワーク１０５が示されている。集音装置１０１と音声認識装置１０２及び辞書作成装置１０４はネットワーク１０５を介して通信ができるように接続されている。
【００１４】
集音装置１０１はマイクロフォンなどで周囲の音声などの音情報を集め、この音情報を音声信号としてネットワークを介して音声認識装置１０２へ送信する機能を持つ。
【００１５】
音声認識装置１０２は、集音装置１０１が送信した音声信号を受信し、その信号に含まれる音声を解析する。解析の結果から音声であると判断した場合には、その音声がいかなる語であるかを認識し、その認識結果に基づいた信号を出力する機能を持つ。
【００１６】
出力装置１０３は、音声認識装置１０２の出力信号を受けて、音声認識結果にしたがった動作をする装置である。たとえばディスプレイ装置やブザーといった表示機器、テレビといった家電製品、あるいはリレーといったスイッチに相当する。
【００１７】
辞書作成装置１０４は音声認識装置１０２が音声認識に使用する辞書を作成する機能を有する。作成した辞書はネットワーク１０５を介して音声認識装置１０２へ送信される。
【００１８】
ネットワーク１０５は集音装置１０１、音声認識装置１０２及び辞書作成装置１０４との間の通信を実現する。このネットワークにはたとえば、Ｅｔｈｅｒｎｅｔ（Ｒ）といった有線ＬＡＮ、ＩＥＥＥ８０２．１１ｂで規定されるような無線ＬＡＮ、近距離通信を目的として開発されたＢｌｕｅｔｏｏｔｈ^（ＴＭ）、または赤外線通信規格であるＩｒＤＡ（Ｉｎｆｒａｒｅｄ　Ｄａｔａ　Ａｓｓｏｃｉａｔｉｏｎ）といったものも適用可能である。通信の形態は上記したものに限られず、これらが互いに通信ができるネットワークであれば良い。
【００１９】
図２に本発明の１実施形態にかかる音声認識装置１０２のブロック構成図の一例を示す。図２には音声処理部２０１、認識部２０２、出力部２０３、通信処理部２０４、辞書更新部２０５、認識辞書２０６及び誤認識対策辞書２０７が示されている。
【００２０】
音声処理部２０１は集音装置１０１から得られた音声信号を音声認識に適した信号へ整形する機能を持っている。まず集音装置１０１が送信した音声信号が、通信処理部２０４によって音声処理部２０１へ入力される。この音声信号は通信時に紛れ込んだひずみやノイズが乗っている。音声処理部２０１は、このようなひずみを修正すると共に、フィルタ回路を介して音声認識に不要なノイズを除去する。その後、出力信号レベルなどを整え認識部２０２へ出力する。
【００２１】
認識部２０２は音声処理部２０１が出力する音声信号を受けて、この音声信号に含まれる音声を認識する機能を持つ。音声信号にどのような単語や文章の信号が含まれているかを判断するために行う信号解析の方法が多く知られている。本実施形態では、たとえば特許文献１に記載されるような単語を発声したときの音声信号に含まれる特徴を抽出した特徴パラメータによる解析方法などを利用すればよい。本実施形態の認識部２０２では、音声信号からこれに含まれる発声された単語や文章を解析する方法は、上記した方法に限らずどのような方法であってもかまわない。音声信号に含まれる発声された単語あるいは文章について、単語あるいは文章が抽出できる方法であれば用いることができる。
【００２２】
抽出した単語あるいは文章は、誤認識対策辞書２０７に記憶した情報と突き合わされる。突合せの結果該当するものが無い場合には、さらに認識辞書２０６に列記された情報と突き合わされる。突合せた結果、ここに列記されている単語や文章と同じものがあれば、抽出された単語あるいは文章は正しい指示であると判断する。このように判断した単語や文章による指示以外は、正しい指示ではないと判断する。
【００２３】
特徴抽出した単語や文章が正しい指示であると判断されたときは、出力部２０３に対し、その単語や文章によって実行される処理に沿う信号を出力する。
【００２４】
出力部２０３は認識部２０２の音声認識結果の信号を受けて、この信号に基づいた動作を行う機能を持つ。出力部２０３は、音声認識装置１０２に接続した出力装置１０３の制御をする。たとえば出力装置１０３の電源の入り切りなどを行う。出力装置１０３がテレビであれば電源の入り切りのほかに、受像チャンネルを切り替える、音量を上げるといった操作を行っても良い。
【００２５】
通信処理部２０４はネットワーク１０５を介して、集音装置１０１や辞書作成装置１０４と通信する機能を持つ。集音装置１０１から送信された音声信号については音声処理部２０１へ出力する。辞書作成装置１０４で作成された辞書情報を受信したときは、受信した辞書情報を辞書更新部２０５へ向けて出力する。
【００２６】
辞書更新部２０５は、辞書作成装置１０４からネットワーク１０５を介して得られた辞書情報を元に認識辞書２０６及び誤認識対策辞書２０７を更新する機能を持つ。
【００２７】
認識辞書２０６は認識部２０２で音声認識を行うときの、操作指示である単語や文章が記憶された、たとえばメモリといった記憶装置からなる辞書である。認識部２０２で音声信号に含まれる単語や文章を抽出したときに参照され、この辞書に列記されている単語や文章と同じものがある場合には、認識部２０２が音声による正しい指示がなされたと判断する。
【００２８】
たとえば出力装置１０３がテレビである場合、テレビの音量を上げるために「大きく」と発声するとする。認識部２０２は、利用者が発した「大きく」という音声信号を受けると、この単語や文章が認識辞書２０６にあるときはテレビの音量を大きくするように出力部２０３を介して出力装置１０３に信号を出力する、利用者が音声による指示をするとき、発声する単語や文章によって出力装置１０３に対して行う操作が異なるのが普通なので、列記された音声や文章ごとに認識部２０２が出力部２０３に与える出力信号（コマンド）も関連付けられて列記されている。好ましくは認識辞書２０６に記憶する単語や文章、及び出力信号を入れ替えられるようにすべきである。このように構成すれば、認識したい単語や文章、及び出力信号が異なる出力装置１０３であっても適用することができる。
【００２９】
認識部２０２の音声認識の際、認識辞書２０６に予め記憶されていない単語や文章については、正しい指示とは認めない。
【００３０】
誤認識対策辞書２０７は、認識部２０２で特徴抽出した単語や文章のうち、明らかに出力装置１０３への操作指示とはならない単語や文章を列記したものである。ここに記憶する単語や文章は、認識部２０２による単語や文章の特徴抽出後、出力装置１０３への操作指示ではないとしてまず先に認識部２０２によって排除されるべき単語や文書である。誤認識対策辞書２０７に記憶する単語や文章の選定については以降の辞書情報作成方法で説明する。好ましくは誤認識対策辞書２０７に記憶する単語や文章は入れ替えられるようにすべきである。このように構成すれば、出力装置１０３ごとあるいは任意に誤認識と判定したい単語や文章を変えたい場合でも適用することができるようになる。
【００３１】
図３は本発明の１実施形態にかかる辞書作成装置１０４のブロック構成図の一例を示している。図３には単語辞書３０１、誤認識対策辞書作成部３０２、認識辞書作成部３０３、通信制御部３０４、認識辞書情報３０５及び誤認識対策辞書情報３０６が示されている。
【００３２】
単語辞書３０１は日常会話で用いられている単語や文章を列記したデータベースである。ここに記憶する単語や文章は、たとえば市販されている国語辞書の掲載項目とすることができる。市販されている国語辞書は、普段の会話中で使用されている頻度が高い単語や文章が多く記載されている。日常会話等で発声される単語や文章をカバーするには好適である。市販の国語辞書の掲載項目から、話し言葉として発声する頻度が低いと思われるものを除いた単語辞書３０１を作成し、これを使用することもできる。このようにすると単語辞書３０１を構成するデータベースの容量を減少することができる。単語辞書３０１は磁気ディスク装置、メモリ、ＣＤ−ＲＯＭといったあらゆる記憶装置で構成することができる。
【００３３】
誤認識対策辞書作成部３０２は、単語辞書３０１に記憶された単語や文章から所定の規則にしたがって単語や文章を選択し、誤認識対策辞書情報３０６に記憶する機能を持つ。誤認識対策辞書作成部３０２が行う、単語辞書３０１から単語や文章を抽出する規則については、以降の辞書情報作成方法で説明する。
【００３４】
認識辞書作成部３０３は、音声認識装置１０２が行う音声認識で認識したい単語や文章を、認識辞書情報３０５に記憶する機能を持つ。認識したい単語や文章とは、音声認識装置１０２によって出力装置１０３を制御するために利用者が発する単語や文章のことである。認識辞書作成部３０３による認識辞書情報３０５へ単語や文章を入力する方法は、辞書作成装置１０４に備えた図示しないキーボードから打ち込む方法が考えられる。制御対象である出力装置１０３が定まれば、音声による制御に適した単語や文章が決まってくる。前出のテレビを例にとると、「つける」、「切る」、「○チャンネル」、「大きく」、「小さく」といった単語や文章を定めれば基本的な制御には十分である。
【００３５】
制御対象に応じて音声認識で認識したい単語や文章、およびそれらが音声信号として与えられたとき出力装置１０３にどのような出力信号を出力するかが決まったところで、利用者あるいは開発者がこれらの情報を前出のキーボードを叩き入力する。入力した情報は認識辞書作成部３０３が処理し、認識辞書情報３０５に記憶する。
【００３６】
あるいは認識辞書作成部３０３が、ネットワーク１０５を介して集音装置１０１から認識辞書情報３０５に記憶する単語や文章の音声信号を取得し、これを音声認識したものを認識辞書情報３０５に記憶するように構成しても良い。
【００３７】
通信処理部３０４はネットワーク１０５を介して集音装置１０１及び音声認識装置１０２と通信する機能を持つ。通信する情報には、集音装置１０１から送信される音声信号や、認識辞書情報３０５及び誤認識対策辞書情報３０６を音声認識装置１０２に送信する辞書情報がある。
【００３８】
認識辞書情報３０５は、認識辞書作成部３０３によって、音声認識装置１０２に出力装置１０３への制御指示として認識したい単語や文章が列記されるデータベースである。認識辞書情報３０５は磁気ディスク、メモリ、ＣＤ−ＲＯＭといったあらゆる記憶装置により構成することができる。
【００３９】
認識辞書情報３０５に列記される単語や文章は、制御対象である出力装置１０３に応じて入れ替えられるようになっている。このように構成すれば制御対象を制御するために発声する単語や文章、および出力信号が異なっていても本実施形態のまま適用することができる。
【００４０】
認識辞書情報３０５に記憶した単語や文章の情報は、通信処理部３０４が、音声認識装置１０２が備える認識辞書２０６の更新のための辞書情報としてネットワーク１０５を介して送信する。
【００４１】
誤認識対策辞書３０６は、誤認識対策辞書作成部３０２によって、音声認識装置１０２に認識結果から排除したい単語や文章の読みが列記されるデータベースである。誤認識対策辞書３０６は磁気ディスク、メモリ、ＣＤ−ＲＯＭといったあらゆる記憶装置により構成することができる。
【００４２】
誤認識辞書情報３０６に列記される単語や文章は、必要に応じて書き換え可能になっている。このように構成すれば制御対象を制御する際に認識したくない単語や文章が変わっても本実施形態のまま適用することができる。
【００４３】
誤認識対策辞書情報３０６に記憶した単語や文章の情報は、認識辞書情報３０５と同様に、通信処理部３０４によりネットワーク１０５を介して音声認識装置１０２が備える誤認識対策辞書２０７の更新のための辞書情報として送信される。
【００４４】
図４は本発明の１実施形態における集音装置１０１のブロック構成図の一例を示す図である。図４には、マイクロフォン４０１、音声処理部４０２及び通信処理部４０３が示されている。
【００４５】
マイクロフォン４０１は周囲の音声を音声信号に変換し、音声処理部４０２に出力する。音声信号には、利用者が出力装置１０３を制御するために発した単語や文章の音声信号が含まれている。
【００４６】
音声処理部４０２は、マイクロフォン４０１から得られた音声信号からひずみを修正すると共に、フィルタ回路を介して音声認識に不要なノイズを除去する。その後、ノイズが除去された信号の音声信号出力の信号レベルなどを整え、通信処理部４０３へ出力する。
【００４７】
通信処理部４０３は音声処理部４０２が出力した音声信号を、ネットワーク１０５を介して音声認識装置１０２及び辞書作成装置１０４に送信する機能を持つ。
【００４８】
次に、図５に本発明の１実施形態における辞書作成装置１０４の辞書情報の作成方法を示す動作フローの一例を示す。
【００４９】
誤認識対策辞書情報３０６の作成に先立って、最初に認識辞書情報３０５を作成する。利用者あるいは開発者は、出力装置１０３を制御する音声で使用する単語や文章を決め、この単語や文章によって出力装置１０３にどのような出力信号を出力すべきかを定める。これらが定まったとことで利用者あるいは開発者は認識辞書作成部３０３を介して、認識辞書情報３０５に単語や文章、出力信号の情報を入力する（ステップＳ０１）。
【００５０】
つぎに音声認識装置１０２で操作指示として認識しない単語や文章である誤認識単語を、単語辞書３０１から抽出する（ステップＳ０２）。
【００５１】
最終的に誤認識対策辞書情報３０６に列記される単語や文章は、次にあげる点を考慮して抽出する。
ａ）操作指示として使用する単語や文章を除いたもの。
ｂ）操作指示ではない単語や文章であるにもかかわらず、音声認識装置１０２で操作指示と認識されるもの。
ｃ）単語長が短いために、発声したときの特徴情報が少ないもの。
【００５２】
ａ）は、認識させたい単語や文章は、どのような事情があるにせよ誤認識対策辞書情報３０６及び誤認識対策辞書２０７には含めないという意味である。
【００５３】
ｂ）は発声した単語や文章が、操作指示では無いにもかかわらず音声認識装置１０２に操作指示として認識され易い単語や文章であった場合に、これらの単語や文章を個別に抽出対象とする。こうすることにより音声認識装置１０２によって誤って認識されやすい単語や文章を積極的に排除することができる。
【００５４】
ｃ）は、ある単語や文章を発声したときに、その音声信号に含まれる特長点が少なく音声認識の際に他の単語や文章と分別がつきにくいものを誤認識対策辞書情報として抽出するという意味である。
【００５５】
ステップＳ０２では上記のｃ）の点に注目して単語や文章を抽出する。本実施形態では、音声信号に含まれる特長点が少ないという基準を、１乃至４文字で記述される単語や文章とする。単語辞書３０１に列記された単語や文章の一部から抽出候補を選定する一例を図６に示す。図６の表は単語辞書３０１に列記した単語の読みの文字並びを示す登録単語の欄と、その登録単語のそれぞれの文字長、及び誤認識単語の判定結果である誤認判定の欄が示されている。これ以降、音声信号に含まれる特長点が少ないと判断される単語や文章を、誤認単語と呼ぶ。
【００５６】
誤認単語と判断された登録項目は「あーす」、「あーち」、「あい」、「あいかぎ」、「あいけん」、「あおかび」及び「あおぐ」の８個である。誤認識対策辞書作成部３０２は、この８個の単語は発声したときに音声的な特徴が少なく、音声認識装置１０２が出力装置１０３への操作指示を含む他の単語や文章と誤認識しやすい単語だと判断する。一方、「あいいれない」、「あいかわらず」、「あいえんか」、「あいきどう」、「あいきょう」、「あいしょう」及び「あおざめる」の７個は誤認識しにくい単語と判断し、誤認識対策辞書情報３０６には記憶しない。
【００５７】
このように単純に文字数を基準に、他の単語や文章と誤認識されやすい単語や文章を抽出すると、誤認識対策辞書２０７の作成が容易になる。また、音声によって操作を指示する利用者が日常的に使用する単語群や文章を列記した単語辞書３０１から抽出すれば、発声される確率の高い単語や文章をカバーした誤認識対策辞書２０７とすることが可能となる。
【００５８】
本実施形態では特徴点が少ない単語や文章の判断基準を読みで１乃至４文字で表現されるものとした。たとえば、音声認識装置１０２の、音声信号に含まれる単語や文章の特徴抽出能力が高く音声認識率が高い場合は、誤認識対策辞書情報３０６として抽出する単語や文章の長さを１乃至３文字あるいは１乃至２文字とし、誤認単語として用意する単語や文章の数を減らすことができる。
【００５９】
利用者が音声認識装置１０２に求める音声認識率にもよるが、誤認単語に該当する文字長を減ずると誤認識対策辞書情報３０６に列記する単語や文章の数を削減することが可能である。記憶数が削減されることにより、誤認識対策辞書情報３０６及び誤認識対策辞書２０７に必要な記憶容量をより小さいもので構成することもできる。
【００６０】
つぎに単語辞書３０１から誤認識対策辞書作成部３０２がすべての誤認単語を抽出したかを判断する（ステップＳ０３）。すべての誤認単語の抽出が終わったら、認識辞書情報３０５と誤認識対策辞書情報３０６に列記された単語や文章の情報を音声認識装置１０２に送信する（ステップＳ０６）。
【００６１】
一方、単語辞書３０１にまだ誤認単語がある場合は、抽出した誤認単語が認識辞書情報３０５に含まれているかを判断する（ステップＳ０４）。これはステップＳ０２の説明で既述したように、操作指示に該当するものは誤認識対策辞書には含めないというａ）のために行う処理である。このとき、誤認識対策辞書作成部３０２は抽出した誤認単語が認識辞書情報３０５にある場合には、この誤認単語の処理を止め新たな誤認単語を抽出する（ステップＳ０２）。
【００６２】
ステップＳ０４での誤認単語の同一判断は、単語や文章の読みの文字並び及び文字長で見たときに、その両方が同じものを同じ単語や文章であると判断する。このように単に読みが同じかどうかによって単語や文章の同一性が判断される。
【００６３】
誤認単語が認識辞書情報３０５にない単語や文章であると判断したときは、この誤認単語をその読みと共に誤認識対策辞書情報３０６に記憶する（ステップＳ０５）。記憶した後は再び単語辞書３０１から新たな誤認単語を抽出する（ステップＳ０２）。
【００６４】
続いて、図７に本発明の１実施形態にかかる音声認識装置１０２の動作フローを示す。音声認識に先立って、辞書作成装置１０４から送信された辞書情報により、認識辞書２０６及び誤認識対策辞書２０７の更新が正常に終わっているものとする。
【００６５】
認識部２０２は、集音装置１０１が送信した音声信号を受けて、この音声信号に含まれる音声の特徴から単語や文章を抽出する（ステップＳ１１）。
【００６６】
次に認識部２０２は音声信号に含まれる発声された単語や文章を正常に抽出できたかどうかを判断する（ステップＳ１２）。判断の結果、正常に単語や文章を抽出できなかったときは次に送られてくる音声信号の音声認識処理を行う（ステップＳ１１）。
【００６７】
入力音声の単語や文章が正常に抽出できたときは、認識部２０２は抽出した単語や文章が誤認識対策辞書２０７に列記されているかどうかを判断する（ステップＳ１３）。ここでの判断は、単語や文章の読みの文字並び及び文字長が同じであるかどうかによって、単語や文章同士の同一性を判断する。列記されているときは、この音声信号に含まれる単語や文章は誤認単語、つまり誤って認識した可能性が高いと判断されるため処理を中止し、次に送られてくる音声信号の音声認識処理を行う（ステップＳ１１）。
【００６８】
認識部２０２は、抽出した単語や文章が誤認識対策辞書２０７に列記されていないと判断したときは、次に抽出した単語や文章が認識辞書２０６に列記されているかどうかを判断する（ステップＳ１４）。認識辞書２０６に列記されていない単語や文書であれば、出力装置１０３への操作指示ではないから処理を中止し、次に送られてくる音声信号の音声認識処理を行う（ステップＳ１１）。
【００６９】
認識辞書２０６に列記されている単語や文章の中に、抽出した単語や文章と同じものがあったときは、その単語や文章と関連付けられた出力信号情報を認識辞書２０６から取り出す。そして取り出した出力信号を出力部２０３を介して出力装置１０３へ出力する（ステップＳ１５）。
【００７０】
上記したような構成とすることで、音声認識システムで使用する誤認識対策辞書の作成が容易になる。また、音声によって操作を指示する利用者が日常的に使用する単語群や文章を単語辞書から誤認単語を抽出するので、発声される確率の高い単語や文章をカバーした誤認識対策辞書とすることができる。よって誤認識対策辞書の作成コスト低減し、かつ音声認識率の向上を実現することができる。
【００７１】
（実施形態の変形例１）
図１に示した集音装置１０１、音声認識装置１０２及び辞書作成装置１０４の機能を一つの筐体で実現することもできる。このとき誤認識対策辞書２０７と誤認識対策辞書情報３０６、および認識辞書２０６と認識辞書情報３０５をそれぞれ１つの記憶装置で実現しても良い。誤認識対策辞書作成部３０２は誤認識対策辞書２０７を直接更新し、認識辞書作成部３０３も同様に認識辞書２０６を直接更新するようにすれば良い。
【００７２】
またネットワーク１０５を介した更新も不要となるので、この場合は通信制御部２０４、３０４及び４０３と、辞書更新部２０５を構成に含めなくとも、本発明にかかる音声認識システムを構成できる。
【００７３】
加えて、出力装置１０３に音声認識装置１０２の機能を含めるなど、図１に示したそれぞれの構成要素を適宜組み合わせて構成しても本発明の効果が得られることに変わりは無い。
【００７４】
たとえば家庭内の複数の場所に集音装置１０１を設置し、各々の集音装置１０１が送信する音声信号を一つの音声認識装置１０２で処理すれば、家庭内のどこからでも操作指示ができる環境を利用者に提供することができる。または集音装置１０１と音声認識装置１０２の機能を一つの筐体に収納すれば、音声認識機能を持った携帯型のたとえばリモコン装置とすることもできる。
【００７５】
このように構成すると、音声認識システムの利用場面や製造コストに応じて適切なシステムの構成とすることができる。
【００７６】
（実施形態の変形例２）
辞書作成装置１０４で作成した認識辞書情報３０５及び誤認識対策辞書情報３０６を、通信処理部３０４及び２０４を介して、辞書更新部２０５により認識辞書２０６及び誤認識対策辞書２０７を更新した後に、音声認識装置１０２によって認識辞書２０６及び誤認識対策辞書２０７に情報を追加する場合である。
【００７７】
たとえば操作指示に使用する単語や文章の一部を変えたいとき、または誤認識対策辞書作成部３０２で抽出されなかった単語や文章を誤認単語として誤認識対策辞書２０７に記憶したいときに、わざわざ辞書作成装置１０４で辞書情報の再作成をしたくない、あるいは辞書作成装置１０４が遠方に設置されているなどで辞書情報の再作成ができないときに有効である。
【００７８】
図８に本変形例にかかる音声認識装置１０２のブロック構成図の一例を示す。図２に示したブロック構成図との違いは、入力部２０８が新たに追加された点である。
【００７９】
入力部２０８はたとえばキーボードであり、認識辞書２０６や誤認識対策辞書２０７に列記されている情報を更新するために、利用者がこのキーボードから入力した情報が辞書更新部２０５に出力される。辞書更新部２０５は、入力部２０８から得られた情報をもとに認識辞書２０６や誤認識対策辞書２０７を更新する。
【００８０】
あるいは辞書更新部２０５が、ネットワーク１０５を介して集音装置１０１からそれぞれの辞書に列記する単語や文章の音声信号を取得し、これを解析した単語や文章を更新用の辞書情報として用いても良い。
【００８１】
なお、上記した本発明の１実施形態及び変形例の構成は実施形態に挙げた構成限られず、同様の機能をもつ構成によって一部あるいは全部を置き換えても、本発明の効果を得ることができる。また本発明の実施形態に示した音声認識装置１０２及び辞書作成装置１０４の動作フローを実現するプログラムコードを実装した計算機によって構成しても同様に本発明の効果を得ることができる。
【００８２】
【発明の効果】
音声認識に必要な誤認識対策辞書へ記憶する単語の選定を簡略化し、かつ誤認識されやすい単語や文章を常用される単語群から選定することで、より高い認識率を実現する音声認識用辞書の作成方法、作成装置及びこれらを備えた音声認識装置、音声認識システムとすることができる。
【図面の簡単な説明】
【図１】本発明の実施形態における音声認識システムの一例を示す図である。
【図２】本発明の実施形態における音声認識装置１０２のブロック構成図の一例を示す図である。
【図３】本発明の実施形態における辞書作成装置１０４のブロック構成図の一例を示す図である。
【図４】本発明の実施形態における集音装置１０１のブロック構成図の一例を示す図である。
【図５】本発明の実施形態における辞書作成装置１０４の動作フローの一例を示す図である。
【図６】本発明の実施形態における誤認単語の選定方法の一例を示す図である。
【図７】本発明の実施形態における音声認識装置１０２の動作フローの一例を示す図である。
【図８】本発明の実施形態の変形例１における音声認識装置１０２のブロック図の一例を示す図である。
【符号の説明】
１０１　　　　　集音装置
１０２　　　　　音声認識装置
１０３　　　　　出力装置
１０４　　　　　辞書作成装置
１０５　　　　　ネットワーク
２０２　　　　　認識部
２０５　　　　　辞書更新部
２０６　　　　　認識辞書
２０７　　　　　誤認識対策辞書
２０８　　　　　入力部
３０１　　　　　単語辞書
３０２　　　　　誤認識対策辞書作成部
３０６　　　　　誤認識対策辞書情報[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition device and a speech recognition system, and more particularly to creation of a dictionary used for speech recognition.
[0002]
[Prior art]
There is a voice recognition technology as a technology for recognizing an operation instruction by voice and controlling a device. Speech recognition controls devices based on unstable conditions such as human voices, and does not control them using electrically stable signals such as hitting a keyboard or turning on an operation switch. In one example, there is a large individual difference in speech so that it is possible to determine who it is just by listening to the voice. Furthermore, even if the same person utters the same word, the sound may be different depending on the day.
[0003]
For this reason, in the related art, a feature is extracted from a word to be speech-recognized, parameterized, and compared with an input speech signal, thereby adopting a speech recognition method in which such an unstable element is eliminated as much as possible ( For example, see Patent Document 1). However, there is a case where the speech recognition device using the method of Patent Document 1 cannot recognize the word yet, and depending on the manner of utterance and the surrounding noise, it is not rare that the word is recognized as a different word from the uttered word.
[0004]
For example, a method of providing a dictionary in which words to be excluded from recognition results are registered, that is, a dictionary for countering erroneous recognition, in addition to a speech recognition dictionary in which words to be recognized are registered, as in Patent Literature 2, for example. By preliminarily registering a word that is easily recognized by mistake but that cannot be approximately an operation instruction, a word that is incorrectly recognized is detected in advance and is excluded from the recognition result.
[0005]
However, in the case of Patent Literature 2, a word that is easily misrecognized is extracted by the method described in the paragraph [0014]. First, vowels are extracted for each character at the time of reading a word to be recognized, and the number of vowels is accumulated for each character appearance position. Words that have vowels that do not appear at the appearance position should not be words that you want to recognize when viewed at the appearance position of the character, so create a word that has such a character at the appearance position of each vowel, and It is stored as a misrecognition prevention dictionary.
[0006]
This method has a problem that if the number of words to be recognized increases, it takes time to extract and examine vowels, and the processing becomes complicated. In addition, words to be recognized by voice are words that can be used as operation instructions for operating a certain device, and the number of words should be naturally limited. Since the words to be registered in the misrecognition countermeasure dictionary are generated based on the words to be recognized, the variations of the words to be generated are naturally limited.
[0007]
However, since speech recognition is originally used in the life of the user, in order to effectively eliminate words that are easily misrecognized, the general use of words by the user must be Words to be registered in the misrecognition measures dictionary should be selected.
[0008]
Also, the method of Patent Document 2 can cover a wide range of words by generating a large number of words to be registered in the misrecognition countermeasure dictionary. However, it is generally desired to reduce the storage amount required by the misrecognition countermeasure dictionary as much as possible. is there. When a large number of words are generated, if only words that are unlikely to be used usually are used, the amount of storage increases only uselessly.
[0009]
[Patent Document 1] JP-A-11-143485
[0010]
[Patent Document 2] Japanese Patent Application Laid-Open No. 2001-184085
[0011]
[Problems to be solved by the invention]
The present invention simplifies the selection of terms to be stored in the misrecognition countermeasure dictionary necessary for speech recognition, and selects a term that is easily misrecognized from a group of commonly used terms, thereby achieving a higher recognition rate for speech recognition. An object of the present invention is to provide a dictionary creation method, a creation device, a speech recognition method, and a speech recognition device and a speech recognition system using the same.
[0012]
[Means for Solving the Problems]
According to the method of the present invention,
A speech recognition dictionary creation method for creating a misrecognition countermeasure dictionary used to determine that a speech recognition result is misrecognition,
From the database that lists everyday terms, extract terms whose reading is less than a predetermined length,
As a result of the speech recognition, using a speech recognition dictionary listing the predetermined terms to determine that correct speech recognition has been performed if a term comprising a predetermined word or sentence is obtained,
From the extracted terms, select a term that differs in the character arrangement or character length of the reading of the terms listed in the speech recognition dictionary.
A method for creating a dictionary for speech recognition, comprising creating an erroneous recognition countermeasure dictionary using the selected term.
Is provided.
Also,
A speech recognition method using a speech recognition dictionary listing terms that are determined to be correct as the speech recognition result, and an erroneous recognition countermeasure dictionary used to determine that the speech recognition result is erroneous recognition,
From a database that lists everyday terms consisting of words and sentences, extract terms whose reading is less than a predetermined length,
From the extracted terms, select a term that differs from the character arrangement or length of the reading of the terms listed in the speech recognition dictionary.
Using the selected term to create the misrecognition measures dictionary,
When performing voice recognition,
If the speech recognition result indicates the same term as that listed in the misrecognition countermeasure dictionary, this result is excluded from the recognition result.
Voice recognition method characterized by the following:
Is provided.
Further, a speech recognition dictionary creating device, a speech recognition device, a speech recognition program, and a speech recognition system that realize these methods are provided.
[0013]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 1 shows an example of a speech recognition system according to an embodiment of the present invention. FIG. 1 shows a sound collection device 101, a speech recognition device 102, an output device 103, a dictionary creation device 104, and a network 105. The sound collection device 101, the voice recognition device 102, and the dictionary creation device 104 are connected so as to be able to communicate via a network 105.
[0014]
The sound collection device 101 has a function of collecting sound information such as surrounding sounds with a microphone or the like and transmitting the sound information as a sound signal to the speech recognition device 102 via a network.
[0015]
The voice recognition device 102 receives the voice signal transmitted by the sound collection device 101, and analyzes the voice included in the signal. When it is determined from the result of the analysis that the voice is a voice, it has a function of recognizing what word the voice is, and outputting a signal based on the recognition result.
[0016]
The output device 103 is a device that receives an output signal of the speech recognition device 102 and performs an operation according to a speech recognition result. For example, it corresponds to a display device such as a display device or a buzzer, a home appliance such as a television, or a switch such as a relay.
[0017]
The dictionary creation device 104 has a function of creating a dictionary used by the speech recognition device 102 for speech recognition. The created dictionary is transmitted to the voice recognition device 102 via the network 105.
[0018]
The network 105 realizes communication among the sound collection device 101, the voice recognition device 102, and the dictionary creation device 104. This network includes, for example, a wired LAN such as Ethernet (R), a wireless LAN defined by IEEE802.11b, and Bluetooth developed for short-range communication. ^(TM) Alternatively, IrDA (Infrared Data Association), which is an infrared communication standard, is also applicable. The form of communication is not limited to those described above, and any network may be used as long as they can communicate with each other.
[0019]
FIG. 2 shows an example of a block diagram of the speech recognition apparatus 102 according to one embodiment of the present invention. FIG. 2 shows a voice processing unit 201, a recognition unit 202, an output unit 203, a communication processing unit 204, a dictionary update unit 205, a recognition dictionary 206, and a misrecognition countermeasure dictionary 207.
[0020]
The voice processing unit 201 has a function of shaping a voice signal obtained from the sound collection device 101 into a signal suitable for voice recognition. First, the audio signal transmitted by the sound collection device 101 is input to the audio processing unit 201 by the communication processing unit 204. This audio signal contains distortion and noise that have been introduced during communication. The voice processing unit 201 corrects such distortion and removes noise unnecessary for voice recognition via a filter circuit. Thereafter, the output signal level and the like are adjusted and output to the recognition unit 202.
[0021]
The recognition unit 202 has a function of receiving a voice signal output from the voice processing unit 201 and recognizing voice included in the voice signal. There are many known signal analysis methods performed to determine what word or sentence signal is included in an audio signal. In the present embodiment, for example, an analysis method based on a feature parameter obtained by extracting a feature included in an audio signal when a word is uttered as described in Patent Literature 1 may be used. In the recognition unit 202 of the present embodiment, the method of analyzing the uttered word or sentence included in the audio signal from the audio signal is not limited to the above-described method, and may be any method. With respect to the spoken word or sentence included in the audio signal, any method can be used as long as the method can extract the word or sentence.
[0022]
The extracted words or sentences are matched with the information stored in the misrecognition countermeasure dictionary 207. If no match is found as a result of the matching, the information is further matched with the information listed in the recognition dictionary 206. As a result of the matching, if there is the same word or sentence listed here, it is determined that the extracted word or sentence is a correct instruction. It is determined that instructions other than the instruction based on the word or sentence determined in this way are not correct instructions.
[0023]
When it is determined that the word or the sentence of which the feature is extracted is a correct instruction, a signal is output to the output unit 203 in accordance with the processing executed by the word or the sentence.
[0024]
The output unit 203 has a function of receiving a signal of the speech recognition result of the recognition unit 202 and performing an operation based on the signal. The output unit 203 controls the output device 103 connected to the speech recognition device 102. For example, power on / off of the output device 103 is performed. If the output device 103 is a television, besides turning on and off the power, operations such as switching the image receiving channel and increasing the volume may be performed.
[0025]
The communication processing unit 204 has a function of communicating with the sound collection device 101 and the dictionary creation device 104 via the network 105. The audio signal transmitted from the sound collection device 101 is output to the audio processing unit 201. When the dictionary information created by the dictionary creating device 104 is received, the received dictionary information is output to the dictionary updating unit 205.
[0026]
The dictionary updating unit 205 has a function of updating the recognition dictionary 206 and the misrecognition countermeasure dictionary 207 based on dictionary information obtained from the dictionary creation device 104 via the network 105.
[0027]
The recognition dictionary 206 is a dictionary including a storage device such as a memory in which words and sentences as operation instructions when performing voice recognition by the recognition unit 202 are stored. It is referred to when the recognition unit 202 extracts a word or a sentence included in the voice signal, and if there is the same word or sentence listed in this dictionary, it is determined that the recognition unit 202 has given a correct instruction by voice. to decide.
[0028]
For example, when the output device 103 is a television, “loud” is uttered to increase the volume of the television. When the recognition unit 202 receives the voice signal “loud” issued by the user, when the word or the sentence is in the recognition dictionary 206, the recognition unit 202 sends the sound to the output device 103 via the output unit 203 so as to increase the volume of the television. When a user outputs a signal and gives an instruction by voice, the operation to be performed on the output device 103 usually differs depending on the word or sentence to be uttered. Therefore, the recognition unit 202 outputs the recognition unit 202 for each listed voice or sentence. The output signals (commands) given to 203 are also listed in association with each other. Preferably, words and sentences stored in the recognition dictionary 206 and output signals should be exchanged. With such a configuration, the present invention can be applied even to the output device 103 having different words and sentences to be recognized and different output signals.
[0029]
At the time of speech recognition by the recognition unit 202, words or sentences that are not stored in the recognition dictionary 206 in advance are not recognized as correct instructions.
[0030]
The misrecognition countermeasure dictionary 207 is a list of words and sentences extracted from the features and extracted by the recognizing unit 202 that are not clearly an operation instruction to the output device 103. The words and sentences stored here are words and documents that are to be first eliminated by the recognition unit 202 after the feature extraction of the words and sentences by the recognition unit 202, assuming that they are not operation instructions to the output device 103. Selection of a word or a sentence to be stored in the misrecognition countermeasure dictionary 207 will be described in the following dictionary information creation method. Preferably, words and sentences stored in the misrecognition countermeasure dictionary 207 should be replaced. With such a configuration, the present invention can be applied to a case where a word or a sentence to be determined to be erroneously recognized is changed for each output device 103 or arbitrarily.
[0031]
FIG. 3 shows an example of a block configuration diagram of the dictionary creation device 104 according to an embodiment of the present invention. FIG. 3 shows a word dictionary 301, an erroneous recognition countermeasure dictionary creation unit 302, a recognition dictionary creation unit 303, a communication control unit 304, recognition dictionary information 305, and erroneous recognition countermeasure dictionary information 306.
[0032]
The word dictionary 301 is a database that lists words and sentences used in daily conversation. The words and sentences stored here can be, for example, items listed in a commercially available Japanese language dictionary. Commercially available Japanese language dictionaries contain many words and sentences frequently used in ordinary conversation. It is suitable for covering words and sentences uttered in daily conversation and the like. It is also possible to create and use a word dictionary 301 excluding items that are considered to be uttered less frequently as spoken words from items listed in a commercially available Japanese language dictionary. In this way, the capacity of the database constituting the word dictionary 301 can be reduced. The word dictionary 301 can be configured by any storage device such as a magnetic disk device, a memory, and a CD-ROM.
[0033]
The misrecognition countermeasure dictionary creation unit 302 has a function of selecting a word or a sentence from the words or sentences stored in the word dictionary 301 according to a predetermined rule and storing the selected word or sentence in the misrecognition countermeasure dictionary information 306. The rules for extracting words and sentences from the word dictionary 301 performed by the misrecognition countermeasure dictionary creation unit 302 will be described in the following dictionary information creation method.
[0034]
The recognition dictionary creation unit 303 has a function of storing words and sentences to be recognized by the voice recognition performed by the voice recognition device 102 in the recognition dictionary information 305. The word or sentence to be recognized is a word or sentence issued by the user for controlling the output device 103 by the speech recognition device 102. As a method of inputting a word or a sentence to the recognition dictionary information 305 by the recognition dictionary creating unit 303, a method of typing from a keyboard (not shown) provided in the dictionary creating device 104 can be considered. Once the output device 103 to be controlled is determined, words and sentences suitable for control by voice are determined. Taking the above-mentioned television as an example, it is sufficient for basic control to determine words and sentences such as “turn on”, “turn off”, “o channel”, “large” and “small”.
[0035]
Once the words or sentences to be recognized by voice recognition according to the control target and what output signals are to be output to the output device 103 when they are given as voice signals are determined, the user or the developer Hit the keyboard to enter information. The input information is processed by the recognition dictionary creation unit 303 and stored in the recognition dictionary information 305.
[0036]
Alternatively, the recognition dictionary creation unit 303 acquires a voice signal of a word or a sentence stored in the recognition dictionary information 305 from the sound collection device 101 via the network 105, and stores a speech-recognized speech signal in the recognition dictionary information 305. May be configured.
[0037]
The communication processing unit 304 has a function of communicating with the sound collection device 101 and the voice recognition device 102 via the network 105. The information to be communicated includes a voice signal transmitted from the sound collection device 101 and dictionary information for transmitting the recognition dictionary information 305 and the misrecognition countermeasure dictionary information 306 to the voice recognition device 102.
[0038]
The recognition dictionary information 305 is a database in which words and sentences that the speech recognition device 102 wants to recognize as a control instruction to the output device 103 are listed by the recognition dictionary creation unit 303. The recognition dictionary information 305 can be configured by any storage device such as a magnetic disk, a memory, and a CD-ROM.
[0039]
The words and sentences listed in the recognition dictionary information 305 are exchanged according to the output device 103 to be controlled. With this configuration, the present embodiment can be applied even if the words and sentences uttered for controlling the control target and the output signal are different.
[0040]
The communication processing unit 304 transmits the word and sentence information stored in the recognition dictionary information 305 via the network 105 as dictionary information for updating the recognition dictionary 206 included in the voice recognition device 102.
[0041]
The misrecognition countermeasure dictionary 306 is a database in which readings of words and sentences to be excluded from the recognition result by the speech recognition device 102 by the misrecognition countermeasure dictionary creation unit 302 are listed. The misrecognition countermeasure dictionary 306 can be configured by any storage device such as a magnetic disk, a memory, and a CD-ROM.
[0042]
The words and sentences listed in the misrecognition dictionary information 306 can be rewritten as needed. With this configuration, even when a word or a sentence that the user does not want to recognize when controlling the control target changes, the present embodiment can be applied.
[0043]
Like the recognition dictionary information 305, the information on the words and sentences stored in the misrecognition countermeasure dictionary information 306 is used by the communication processing unit 304 to update the misrecognition countermeasure dictionary 207 included in the speech recognition apparatus 102 via the network 105. Sent as dictionary information.
[0044]
FIG. 4 is a diagram illustrating an example of a block configuration diagram of the sound collection device 101 according to an embodiment of the present invention. FIG. 4 shows a microphone 401, a voice processing unit 402, and a communication processing unit 403.
[0045]
The microphone 401 converts the surrounding voice into a voice signal and outputs the voice signal to the voice processing unit 402. The audio signal includes an audio signal of a word or a sentence issued by the user to control the output device 103.
[0046]
The voice processing unit 402 corrects distortion from the voice signal obtained from the microphone 401 and removes noise unnecessary for voice recognition via a filter circuit. Thereafter, the signal level of the audio signal output of the signal from which noise has been removed is adjusted, and the adjusted signal is output to the communication processing unit 403.
[0047]
The communication processing unit 403 has a function of transmitting the voice signal output by the voice processing unit 402 to the voice recognition device 102 and the dictionary creation device 104 via the network 105.
[0048]
Next, FIG. 5 shows an example of an operation flow showing a dictionary information creation method of the dictionary creation device 104 according to an embodiment of the present invention.
[0049]
Prior to creating the misrecognition countermeasure dictionary information 306, first, the recognition dictionary information 305 is created. A user or a developer determines a word or a sentence to be used in the voice for controlling the output device 103, and determines what output signal should be output to the output device 103 based on the word or the sentence. When these are determined, the user or the developer inputs information of words, sentences, and output signals to the recognition dictionary information 305 via the recognition dictionary creating unit 303 (step S01).
[0050]
Next, words that are not recognized as operation instructions by the voice recognition device 102 or misrecognized words that are sentences are extracted from the word dictionary 301 (step S02).
[0051]
The words and sentences finally listed in the misrecognition countermeasure dictionary information 306 are extracted in consideration of the following points.
a) Those excluding words and sentences used as operation instructions.
b) A word or sentence that is not an operation instruction but is recognized as an operation instruction by the voice recognition device 102.
c) Because the word length is short, there is little feature information when uttered.
[0052]
a) means that the word or sentence to be recognized is not included in the misrecognition countermeasure dictionary information 306 and the misrecognition countermeasure dictionary 207 regardless of the circumstances.
[0053]
b) If the uttered word or sentence is a word or sentence that is not an operation instruction but is easily recognized as an operation instruction by the voice recognition device 102, these words and sentences are individually extracted. . By doing so, words and sentences that are likely to be erroneously recognized by the voice recognition device 102 can be positively excluded.
[0054]
In c), when a certain word or sentence is uttered, a speech signal that has few features and is difficult to be distinguished from other words or sentences during speech recognition is extracted as misrecognition countermeasure dictionary information. Meaning.
[0055]
In step S02, words and sentences are extracted by focusing on the point c). In the present embodiment, a criterion that there are few feature points included in the audio signal is a word or a sentence described by one to four characters. FIG. 6 shows an example of selecting an extraction candidate from a part of a word or a sentence listed in the word dictionary 301. The table of FIG. 6 shows a column of registered words indicating the character arrangement of the reading of words listed in the word dictionary 301, a character length of each of the registered words, and a column of misrecognition determination which is a determination result of a misrecognized word. ing. Hereinafter, words and sentences that are determined to have few features included in the audio signal are referred to as misidentified words.
[0056]
There are eight registered items that are determined to be misrecognized words: "Asu", "Aichi", "Ai", "Aikagi", "Aiken", "Aokai", and "Aogu". The misrecognition countermeasure dictionary creation unit 302 indicates that the eight words have less vocal characteristics when uttered, and the speech recognition device 102 is likely to erroneously recognize other words or sentences including an operation instruction to the output device 103. Judge as a word. On the other hand, seven words, "I can't accept it", "I don't like it", "Aienka", "Aikido", "Aikiyou", "Aisho" and "Aozare" are judged as words that are hard to be misrecognized. It is not stored in the recognition countermeasure dictionary information 306.
[0057]
When a word or a sentence that is likely to be erroneously recognized as another word or a sentence is simply extracted based on the number of characters as described above, the creation of the erroneous recognition countermeasure dictionary 207 is facilitated. In addition, if a word group or a sentence used by a user who instructs an operation by voice is extracted from the word dictionary 301 in which the words or sentences are listed in a daily manner, the dictionary 207 is a misrecognition countermeasure dictionary 207 covering words and sentences with a high probability of being uttered. It becomes possible.
[0058]
In the present embodiment, the criterion for determining a word or a sentence having a small number of feature points is expressed by one to four characters when read. For example, if the speech recognition device 102 has a high capability of extracting the features of words and sentences included in the speech signal and a high speech recognition rate, the length of the words and sentences to be extracted as the misrecognition countermeasure dictionary information 306 is one to three characters. Alternatively, the number of words or sentences prepared as one or two characters and prepared as an erroneous recognition word can be reduced.
[0059]
Depending on the speech recognition rate required by the user for the speech recognition device 102, the number of words and sentences listed in the misrecognition countermeasure dictionary information 306 can be reduced by reducing the character length corresponding to the misrecognized word. By reducing the number of storages, the storage capacity required for the misrecognition countermeasure dictionary information 306 and the misrecognition countermeasure dictionary 207 can be made smaller.
[0060]
Next, it is determined whether the erroneous recognition countermeasure dictionary creating unit 302 has extracted all erroneous words from the word dictionary 301 (step S03). When all the misrecognized words have been extracted, the information of the words and sentences listed in the recognition dictionary information 305 and the misrecognition countermeasure dictionary information 306 is transmitted to the speech recognition device 102 (step S06).
[0061]
On the other hand, if the word dictionary 301 still contains a misrecognized word, it is determined whether the extracted misrecognized word is included in the recognition dictionary information 305 (step S04). As described above in step S02, this is a process to be performed for a) in which an operation instruction is not included in the misrecognition countermeasure dictionary. At this time, if the extracted misrecognized word is found in the recognition dictionary information 305, the misrecognition countermeasure dictionary creating unit 302 stops the processing of the misrecognized word and extracts a new misrecognized word (step S02).
[0062]
In the determination of the misidentified word in step S04, when the words and sentences are read in terms of the character arrangement and character length, it is determined that the two words are the same if they are the same word or sentence. Thus, the identity of words and sentences is determined simply by whether or not the reading is the same.
[0063]
If it is determined that the misrecognized word is a word or a sentence that is not in the recognition dictionary information 305, the misrecognized word is stored in the misrecognition countermeasure dictionary information 306 together with its reading (step S05). After storing, a new misidentified word is extracted from the word dictionary 301 again (step S02).
[0064]
Next, FIG. 7 shows an operation flow of the speech recognition apparatus 102 according to an embodiment of the present invention. It is assumed that, prior to speech recognition, the recognition dictionary 206 and the erroneous recognition countermeasure dictionary 207 have been normally updated based on the dictionary information transmitted from the dictionary creation device 104.
[0065]
The recognition unit 202 receives the audio signal transmitted by the sound collection device 101, and extracts a word or a sentence from the features of the audio included in the audio signal (Step S11).
[0066]
Next, the recognizing unit 202 determines whether or not the uttered word or sentence included in the audio signal has been normally extracted (step S12). If the result of the determination is that words or sentences cannot be extracted normally, speech recognition processing of the next sent speech signal is performed (step S11).
[0067]
When the words and sentences of the input voice have been successfully extracted, the recognition unit 202 determines whether or not the extracted words and sentences are listed in the misrecognition countermeasure dictionary 207 (step S13). In this determination, the identity of the words and sentences is determined based on whether the character arrangement and the character length of the reading of the words and sentences are the same. If it is listed, the word or sentence included in this audio signal is judged to be a misrecognized word, that is, it is judged that there is a high possibility that it has been erroneously recognized. Processing is performed (step S11).
[0068]
When the recognizing unit 202 determines that the extracted word or sentence is not listed in the misrecognition countermeasure dictionary 207, it determines whether the next extracted word or sentence is listed in the recognition dictionary 206 (step S14). ). If the word or document is not listed in the recognition dictionary 206, the processing is stopped because it is not an operation instruction to the output device 103, and the voice recognition processing of the next transmitted voice signal is performed (step S11).
[0069]
If any of the words or sentences listed in the recognition dictionary 206 includes the same word or sentence as the extracted word or sentence, output signal information associated with the extracted word or sentence is extracted from the recognition dictionary 206. Then, the extracted output signal is output to the output device 103 via the output unit 203 (Step S15).
[0070]
With the above configuration, it is easy to create an erroneous recognition countermeasure dictionary used in the speech recognition system. In addition, since words and sentences commonly used by the user who instructs the operation by voice are extracted from the word dictionary, misidentified words are extracted. Can be. Therefore, it is possible to reduce the cost of creating an erroneous recognition countermeasure dictionary and improve the speech recognition rate.
[0071]
(Modification 1 of Embodiment)
The functions of the sound collection device 101, the voice recognition device 102, and the dictionary creation device 104 shown in FIG. 1 can be realized by one housing. At this time, the misrecognition countermeasure dictionary 207 and the misrecognition countermeasure dictionary information 306, and the recognition dictionary 206 and the recognition dictionary information 305 may be realized by one storage device. The misrecognition countermeasure dictionary creation unit 302 directly updates the misrecognition countermeasure dictionary 207, and the recognition dictionary creation unit 303 similarly updates the recognition dictionary 206 directly.
[0072]
In addition, since updating via the network 105 is unnecessary, in this case, the speech recognition system according to the present invention can be configured without including the communication control units 204, 304, and 403 and the dictionary updating unit 205 in the configuration.
[0073]
In addition, the effects of the present invention can be obtained even when the components shown in FIG. 1 are appropriately combined, such as including the function of the speech recognition device 102 in the output device 103.
[0074]
For example, if the sound collecting devices 101 are installed at a plurality of places in the home and the sound signals transmitted by the sound collecting devices 101 are processed by one voice recognition device 102, an environment in which operation instructions can be given from anywhere in the home can be provided. Can be provided to users. Alternatively, if the functions of the sound collection device 101 and the voice recognition device 102 are housed in a single housing, a portable device having a voice recognition function, for example, a remote control device can be provided.
[0075]
With this configuration, an appropriate system configuration can be obtained according to the usage scene and manufacturing cost of the speech recognition system.
[0076]
(Modification 2 of Embodiment)
After the recognition dictionary information 305 and the misrecognition countermeasure dictionary information 306 created by the dictionary creation device 104 are updated by the dictionary updating unit 205 via the communication processing units 304 and 204, the This is a case where information is added to the recognition dictionary 206 and the misrecognition countermeasure dictionary 207 by the recognition device 102.
[0077]
For example, when it is desired to change a part of a word or a sentence used for an operation instruction, or when a word or a sentence not extracted by the misrecognition countermeasure dictionary creating unit 302 is to be stored in the misrecognition countermeasure dictionary 207 as a misrecognition word, the dictionary is bothersome. This is effective when it is not desired to re-create the dictionary information with the creation device 104, or when the dictionary information cannot be created again because the dictionary creation device 104 is installed in a remote place.
[0078]
FIG. 8 shows an example of a block configuration diagram of a speech recognition device 102 according to the present modification. The difference from the block configuration diagram shown in FIG. 2 is that an input unit 208 is newly added.
[0079]
The input unit 208 is, for example, a keyboard, and information input by the user from the keyboard is output to the dictionary updating unit 205 in order to update information listed in the recognition dictionary 206 and the misrecognition countermeasure dictionary 207. The dictionary updating unit 205 updates the recognition dictionary 206 and the misrecognition countermeasure dictionary 207 based on the information obtained from the input unit 208.
[0080]
Alternatively, the dictionary update unit 205 may obtain voice signals of words and sentences listed in the respective dictionaries from the sound collection device 101 via the network 105 and use the analyzed words and sentences as update dictionary information. good.
[0081]
Note that the configurations of the above-described embodiment and modifications of the present invention are not limited to the configurations described in the embodiments, and the effects of the present invention can be obtained even if some or all of the configurations are replaced with configurations having similar functions. . Further, the effects of the present invention can be similarly obtained by using a computer that implements a program code for realizing the operation flow of the speech recognition device 102 and the dictionary creating device 104 described in the embodiment of the present invention.
[0082]
【The invention's effect】
Speech recognition dictionary that realizes a higher recognition rate by simplifying the selection of words to be stored in the misrecognition countermeasure dictionary required for speech recognition, and by selecting easily misrecognized words and sentences from commonly used words. And a speech recognition apparatus and a speech recognition system having the same.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating an example of a speech recognition system according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of a block configuration diagram of a speech recognition apparatus 102 according to the embodiment of the present invention.
FIG. 3 is a diagram illustrating an example of a block configuration diagram of a dictionary creation device 104 according to the embodiment of the present invention.
FIG. 4 is a diagram illustrating an example of a block configuration diagram of a sound collection device 101 according to the embodiment of the present invention.
FIG. 5 is a diagram showing an example of an operation flow of the dictionary creation device 104 according to the embodiment of the present invention.
FIG. 6 is a diagram showing an example of a method for selecting a mispermitted word in the embodiment of the present invention.
FIG. 7 is a diagram showing an example of an operation flow of the speech recognition device 102 according to the embodiment of the present invention.
FIG. 8 is a diagram illustrating an example of a block diagram of a speech recognition device 102 according to a first modification of the embodiment of the present invention.
[Explanation of symbols]
101 sound collector
102 Speech recognition device
103 Output device
104 dictionary creation device
105 Network
202 Recognition unit
205 Dictionary update unit
206 recognition dictionary
207 Error recognition dictionary
208 Input unit
301 Word Dictionary
302 Error recognition dictionary creation unit
306 Misrecognition countermeasure dictionary information

Claims

A speech recognition dictionary creation method for creating a misrecognition countermeasure dictionary used to determine that the speech recognition result is misrecognition,
From a database that lists everyday terms consisting of words and sentences, extract terms whose reading is less than a predetermined length,
As a result of speech recognition of the input speech, using a speech recognition dictionary listing the predetermined terms that determine that correct speech recognition has been performed when the predetermined terms are obtained,
A voice characterized by selecting, from among the extracted terms, terms having different character arrangements or character lengths of the readings of the terms listed in the speech recognition dictionary, and using the selected terms to create an erroneous recognition countermeasure dictionary. How to create a dictionary for recognition.

A speech recognition method using a speech recognition dictionary listing terms that are determined to be correct as the speech recognition result, and an erroneous recognition countermeasure dictionary used to determine that the speech recognition result is erroneous recognition,
From a database that lists everyday terms consisting of words and sentences, extract terms whose reading is less than a predetermined length,
Of the extracted terms, select a term having a different character arrangement or length of the reading of the terms listed in the speech recognition dictionary, and create the misrecognition countermeasure dictionary using the selected term,
When performing voice recognition of input voice,
A speech recognition method characterized in that if the speech recognition result indicates a result indicating the same term as that listed in the misrecognition countermeasure dictionary, the result is excluded from the recognition result.

In a speech recognition dictionary creation device for creating a misrecognition countermeasure dictionary used to determine that the speech recognition result is misrecognition,
Selecting means for selecting a term that is determined to be misrecognized and whose reading is equal to or less than a predetermined length and which is likely to be misrecognized in voice recognition, from a database listing daily terms composed of words and sentences,
A dictionary creating means for creating an erroneous recognition countermeasure dictionary using the selected term.

4. The apparatus for creating a speech recognition dictionary according to claim 3, wherein the predetermined length of the reading of the term extracted by the selection unit is four characters or less.

In a speech recognition apparatus having a speech recognition dictionary listing terms that are determined to be correct as a speech recognition result, and an erroneous recognition countermeasure dictionary used to determine that the speech recognition result is erroneous recognition,
A database that lists everyday terms consisting of words and sentences,
From the terms listed in this database, terms whose readings of terms are equal to or less than a predetermined length are extracted, and among the extracted terms, a term having a different character arrangement or character length of the reading of the terms listed in the speech recognition dictionary is selected. Means for selecting,
Dictionary creation means for creating the misrecognition countermeasure dictionary using the selected term,
A speech recognition apparatus characterized in that if the result of speech recognition of an input speech is a result indicating the same term as that listed in the misrecognition countermeasure dictionary, the result is excluded from the recognition result.

A speech recognition program that performs speech recognition using a speech recognition dictionary that lists terms that are determined to be correct as the speech recognition result and a misrecognition countermeasure dictionary used to determine that the speech recognition result is erroneous recognition,
Searching a database listing daily terms consisting of words and sentences, and extracting terms whose readings of terms are equal to or less than a predetermined length;
Search the voice recognition dictionary, determine whether there is a term with the same character sequence and character length of the reading of the extracted term, and only when there is no same term, the extracted term is in the misrecognition countermeasure dictionary Adding and storing,
When performing speech recognition of the input speech, if the word indicated by the speech recognition result is the same as the term stored in the misrecognition countermeasure dictionary, the result is excluded from the recognition result. Recognition program.

From a database that lists daily terms consisting of words and sentences, selectively extract words that should be judged as misrecognition and have a reading of less than a predetermined length and that are easily misrecognized in speech recognition. A dictionary creation device for speech recognition that creates the misrecognition countermeasure dictionary using:
In a speech recognition system comprising a speech recognition device for recognizing speech using a speech recognition dictionary for input speech,
The voice recognition device, for the recognition result of the input voice, comprising means for searching the erroneous recognition countermeasure dictionary whether it corresponds to a term in the erroneous recognition countermeasure dictionary,
In the case of a speech recognition result corresponding to a term in the misrecognition countermeasure dictionary, the term is not output as a recognition result.

The speech recognition dictionary creation device and the speech recognition device include a communication processing unit for communicating with each other,
The speech recognition system according to claim 7, wherein the communication processing unit transfers the misrecognition countermeasure dictionary created by the speech recognition dictionary creation device to the speech recognition device.