JP3825526B2

JP3825526B2 - Voice recognition device

Info

Publication number: JP3825526B2
Application number: JP08170097A
Authority: JP
Inventors: 康之正井; 信一田中
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1997-03-31
Filing date: 1997-03-31
Publication date: 2006-09-27
Anticipated expiration: 2017-03-31
Also published as: JPH10274996A

Description

【０００１】
【発明の属する技術分野】
本発明は、入力音声を音響分析して求めた特徴パラメータ系列を予め作成しておいた各認識語彙を構成するキーワードの音声モデルと照合して、入力音声を認識する音声認識装置に係り、特に認識語彙の読みの登録間違いや、装置使用時の読みの記憶違いによる誤認識を削減するのに好適な音声認識装置に関する。
【０００２】
【従来の技術】
一般に、入力音声を認識する音声認識装置では、当該装置での認識対象となる語彙（認識語彙）の読みを入力することで、その読みから、対応する認識語彙を構成するキーワードの音声モデルを予め作成し、入力音声の認識のため記憶しておくようになっている。この種の音声認識装置での入力音声の認識は、次のように行われる。
【０００３】
まず入力音声を音響分析して特徴パラメータ系列を求める。次に、求めた入力音声の特徴パラメータ系列を予め作成しておいた各認識語彙を構成するキーワードの音声モデルと照合して、入力音声を認識する。
【０００４】
このような音声認識装置においては、従来は、認識語彙の読みを誤って登録した場合には、使用時に正しい読みを発声しても正しく認識できないという問題があった。また、認識語彙の登録時には正しい読みを登録しておいても、使用時に誤った読みを発声すると正しく認識されないという問題もあった。
【０００５】
【発明が解決しようとする課題】
上記したように従来の音声認識装置では、認識語彙の読みを誤って登録すると、使用時に正しい読みを発声しても正しく認識できず、逆に認識語彙の登録時に正しい読みを登録しておいても、使用時に誤った読みを発声すると正しく認識されないという問題があった。
【０００６】
本発明は上記事情を考慮してなされたものでその目的は、認識語彙の読みの登録間違いや、装置使用時の読みの記憶違い、誤った読みでの発声等に起因する認識性能の低下を防ぐことができる音声認識装置を提供することにある。
【００１２】
【課題を解決するための手段】
本発明の１つの観点によれば、認識語彙の別称の音声モデルを含む各認識語彙の音声モデルと照合して、入力音声を認識する音声認識装置が提供される。この音声認識装置は、登録する認識語彙の第１の読みに加えて、異なる第２の読みを別称として登録する別称登録手段と、別称として登録される第２の読みが登録済みの認識語彙の第１の読みと類似しているか否かを判定する登録語彙類似性判定手段と、この登録語彙類似性判定手段により類似していると判定された場合に、その旨の警告を出力する警告出力手段と、上記別称登録手段により別称として登録された第２の読みの音声モデルとの照合で入力音声が認識された場合、その第２の読みを別称とする語彙を認識する認識結果出力手段とを備えたことを特徴とする。ここで、上記認識結果出力手段が、別称として登録されている第２の読みを認識したときに、その第２の読みを別称とする語彙の第１の読みが当該認識結果出力手段によって利用者に提示される構成とすると良い。
【００１３】
このような構成においては、認識語彙入力手段から入力して登録される認識語彙の読み（第１の読み）とは異なる読み（第２の読み）を別称として別称登録手段により登録する際に、この別称として登録される第２の読みが他の語彙の第１の読みと類似しているかを登録語彙類似性判定手段にて判定し、類似しているならば、その旨を警告出力手段から利用者に知らせることにより、別称登録による認識性能の低下を未然に防ぐことができる。このとき、該当する別称登録を中止させるとよい。また、別称として登録されている第２の読みの音声モデルとの照合で入力音声が認識された場合、その第２の読みを別称とする語彙を認識することにより、例えば「神戸（かんべ）」の別称として「神戸（こうべ）」が登録されている状態で、話者が一般的な読みである「神戸（こうべ）」と発声しても、認識結果として正しい「神戸（かんべ）」を得ることができる。この際、「神戸（こうべ）」（第２の読み）を別称とする語彙「神戸」の読み（第１の読み）「神戸（かんべ）」を話者に提示することにより、当該話者が別称で覚えていた言葉の正しい読みを当該話者に覚えさせることができ、以後正しい読みで入力できるようになる。
【００２０】
【発明の実施の形態】
以下、本発明の実施の形態につき図面を参照して説明する。
［第１の実施形態］
図１は本発明の第１の実施形態に係る音声認識装置の概略構成を示すブロック図である。
【００２１】
図１の音声認識装置において、音声入力部１０１から入力された音声は、音響分析部１０２で特徴パラメータに変換される。音声認識に使用される代表的な特徴パラメータとしては、バンドパスフィルタやフーリエ変換によって求めることができるパワースペクトルや、ＬＰＣ（線形予測）分析によって求めたケプストラム係数などがよく用いられるが、ここではその特徴パラメータの種類は問わない。
【００２２】
音響分析部１０２は求めた入力音声の特徴パラメータ系列をモデル照合部１０３に出力する。モデル照合部１０３は、認識語彙入力部１０５から入力された認識語彙に従って予め作成して音声モデル作成・記憶部１０４に記憶しておいた認識対象とするキーワード（認識語彙を構成するキーワード）の各音声モデルと上記入力音声の特徴パラメータ系列との類似度あるいは距離を求める演算を行う。
【００２３】
モデル照合部１０３の照合方法としては、音声モデルも特徴パラメータ系列で表現しておき、ＤＰ（動的計画）法で音声モデルの特徴パラメータ系列と入力音声の特徴パラメータ系列の距離を求める手法や、ＨＭＭ（隠れマルコフモデル）を用いて音声モデルを表現しておき、入力音声の特徴パラメータ系列が入力されたときの各音声モデルの確率を計算する手法などが広く使用されているが、特に手法は問わない。
【００２４】
認識語彙入力部１０５は、認識語彙と、認識語彙を構成する各キーワードの音声モデルを作成するために必要な、認識語彙の各キーワードへの分割情報（キーワード分割情報）と、各キーワードの読み情報とを入力するためのものであり、キーボードやファイルなどで実現することができる。認識語彙入力部１０５から入力された認識語彙は認識語彙記憶部１０９に登録される。
【００２５】
キーワード変換部１０６は、認識語彙入力部１０５から入力されたキーワード分割情報から各キーワードを抽出し、キーワード間の音の類似性と品詞などの属性に基づいて選択される、例えば音の類似性があって且つ品詞が同じキーワードの変換テーブル（キーワード変換テーブル）１０６ａを作成し、記憶しておくためのものである。
【００２６】
キーワード拡張部１０７は、モデル照合部１０３で得られた（類似度あるいは距離付きの）キーワードを、キーワード変換部１０６によりキーワード変換テーブル１０６ａに従って音の類似性のある他のキーワードに変換させ、キーワードの拡張を行う。
【００２７】
認識結果出力部１０８は、モデル照合部１０３で求めた各音声モデルとキーワード拡張部１０７で拡張して得られたキーワードが組み合わされたキーワード列に対する類似度（あるいは距離）をある条件（例えば類似度の大きさ）のもとでソーティングして、認識語彙記憶部１０９に記憶されている認識語彙の中で、類似度が最大（あるいは距離が最小）となる認識対象のカテゴリを認識結果として出力する。なお、上記ソーティングの制約として、例えば人の氏名は、会社名より優先させるなどを適用してもよい。
【００２８】
以上に述べた図１の構成の音声認識装置の具体的動作を、当該音声認識装置で認識対象とする語彙、即ち認識語彙が、「佐藤商店」、「加籐食堂」、「田中書店」の３種類である場合を例にとり説明する。
【００２９】
この場合、認識語彙入力部１０５から上記３種類の認識語彙が入力されることになるが、本実施形態では、その認識語彙を構成する各キーワードの音声モデルが（音声モデル作成・記憶部１０４にて）作成可能なように、「佐藤‐商店」、「加籐‐食堂」、「田中‐書店」のように、認識語彙中にキーワード分割記号（キーワード分割情報）「‐」が挿入されて入力される。
【００３０】
音声モデル作成・記憶部１０４は、認識語彙入力部１０５からキーワード分割記号「‐」が挿入された認識語彙「佐藤‐商店」、「加籐‐食堂」、「田中‐書店」が入力されると、各認識語彙について、その語彙中に挿入されたキーワード分割記号「‐」に従って、その語彙を構成するキーワードに分割する。ここでは、上記３種類の認識語彙が、「佐藤」、「加籐」、「田中」、「商店」、「食堂」、「書店」の６つのキーワードに分割される。音声モデル作成・記憶部１０４は、この６つのキーワード「佐藤」、「加籐」、「田中」、「商店」、「食堂」、「書店」について、それぞれ音声モデルを作成し、記憶する。
【００３１】
これと同時に、キーワード変換部１０６は、認識語彙入力部１０５から入力されたキーワード分割記号付きの認識語彙「佐藤‐商店」、「加籐‐食堂」、「田中‐書店」から得られる上記６つのキーワード「佐藤」、「加籐」、「田中」、「商店」、「食堂」、「書店」について音声の類似性を調べて、類似性のあるキーワードを抽出し、キーワード変換テーブル１０６ａを作成する。ここでは、キーワード変換テーブル１０６ａの作成規則を、キーワードの読みが異なる音節数が所定数以下、例えば１音節以下のキーワード同士を音声の類似性ありとして、当該テーブル１０６ａに登録するものとする。この場合、「佐藤」と「加藤」、「商店」と「書店」が類似性ありと抽出され、図２に示すようなキーワード変換テーブル１０６ａが作成される。
【００３２】
すると、音声認識時に、例えば「佐藤商店」と入力された場合に、モデル照合部１０３での照合結果が「佐藤」と「書店」であったとすると、キーワード拡張部１０７では、「佐藤」と「書店」について、キーワード変換部１０６によりキーワード変換テーブル１０６ａに従う「佐藤→加籐」、「書店→商店」のキーワード変換を行わせ、モデル照合部１０３での照合結果として「佐藤」と「書店」の他に、「加籐」と「商店」もあるかのように、キーワードの拡張を行う。
【００３３】
キーワード拡張部１０７により拡張されたキーワードの組み合わせの中には、認識語彙記憶部１０９に記憶されている認識語彙と一致するものとして、「佐藤商店」がある。したがって、モデル照合部１０３での照合結果が「佐藤」と「書店」であったにも拘らず、認識結果出力部１０８では、「佐藤商店」を正しく認識して出力することができる。
【００３４】
これに対し、キーワード変換部１０６とキーワード拡張部１０７がなく、キーワードの拡張が行われない場合には、モデル照合部１０３での照合結果である「佐藤」と「書店」で構成される「佐藤書店」は認識語彙記憶部１０９には存在しないので、「佐藤商店」を正しく認識することはできない。
【００３５】
なお、キーワード変換により得られたキーワードの音声モデルとの照合では、類似度を一定値あるいは一定割合低くするとよい。
以上は、話者が「佐藤商店」と発声したのに対して、モデル照合部１０３で「佐藤」「書店」と誤った照合結果が得られた場合でも、音の類似性に着目したキーワードの拡張により「佐藤商店」を正しく認識できる例について述べた。本実施形態では、同様にして、話者が「佐藤商店」を「佐藤書店」と言い間違った場合にも、音の類似性に着目したキーワードの拡張により「佐藤商店」を正しく認識することができる。
【００３６】
このように本実施形態においては、キーワードを音としての類似性に着目して拡張することにより、キーワードの認識誤りや話者の言い間違いによる認識性能の低下を効果的に防ぐことができる。
［第２の実施形態］
図３は本発明の第２の実施形態に係る音声認識装置の概略構成を示すブロック図である。
【００３７】
図３の音声認識装置において、音声入力部２０１から入力された音声は、音響分析部２０２で特徴パラメータに変換される。音声認識に使用される代表的な特徴パラメータとしては、バンドパスフィルタやフーリエ変換によって求めることができるパワースペクトルや、ＬＰＣ（線形予測）分析によって求めたケプストラム係数などがよく用いられるが、ここではその特徴パラメータの種類は問わない。
【００３８】
音響分析部２０２は求めた入力音声の特徴パラメータ系列をモデル照合部２０３に出力する。モデル照合部２０３は、音声モデル記憶部２０４に記憶されている全ての音節の任意の長さの音節列の音声モデルと特徴パラメータ系列の類似度あるいは距離を求める演算を行う。
【００３９】
モデル照合部２０３の照合方法としては、音声モデルも特徴パラメータ系列で表現しておき、ＤＰ（動的計画）法で音声モデルの特徴パラメータ系列と入力音声の特徴パラメータ系列の距離を求める手法や、ＨＭＭ（隠れマルコフモデル）を用いて音声モデルを表現しておき、入力音声の特徴パラメータ系列が入力されたときの各音声モデルの確率を計算する手法などが広く使用されているが、特に手法は問わない。
【００４０】
認識結果出力部２０５は、モデル照合部２０３での照合結果をもとに、制約条件記憶部２０６に記憶されている制約条件に従って、例えば先頭の音節と最後の音節が一致する音節列について、類似度（あるいは距離）をある条件のもとでソーティングして、類似度が最大（あるいは距離が最小）となる音節列の先頭の音節を認識結果として出力する。
【００４１】
以上に述べた図３の構成の音声認識装置の具体的動作を、例えば、「あさひ（朝日）のあ」と発声した場合を例にとり説明する。
まず、話者が「あさひ（朝日）のあ」と発声した結果、モデル照合部２０３にて図４に示すような音節列と類似度、即ち類似度が８６の音節列「あ」「さ」「ひ」「の」「あ」と、類似度が９２の音節列「う」「さ」「ひ」「の」「あ」とが得られたとする。
【００４２】
この場合、入力音声の先頭の音節と最後の音節が一致するという制約を設けないで、認識結果出力部２０５から類似度が最大となる音節列の先頭の音節を認識結果として出力するならば、入力音声の先頭の音節とは異なる誤った音節「う」が出力されることになる。
【００４３】
これに対して本実施形態では、制約条件記憶部２０６に記憶されている制約条件により、先頭と最後の音節が一致するという制約を設けてあるため、認識結果出力部２０５での認識結果は音節「あ」となり、入力音声の先頭の音節を正しく認識することができる。しかも、先頭と最後の音節が一致するという制約のもとで、入力音声の最初の音節を認識することから、この例のように音節「あ」を入力するときに発声する音声は、「朝日のあ」だけではなく、「あひるのあ」、更には「あじあ（アジア）」など、単に先頭の音節と最後の音節が同じであればよい。
【００４４】
このように本実施形態においては、入力音声の先頭の音節と最後の音節が一致するという制約のもとで入力音声の先頭の音節を認識することにより、非常に精度の高い音節認識を実現できる。また、各音節を入力するときに発声する言葉を覚える必要がないので、誰でもすぐに使用することができる。
［第３の実施形態］
図５は本発明の第３の実施形態に係る音声認識装置の概略構成を示すブロック図である。
【００４５】
図５の音声認識装置において、音声入力部３０１から入力された音声は、音響分析部３０２で特徴パラメータに変換される。音声認識に使用される代表的な特徴パラメータとしては、バンドパスフィルタやフーリエ変換によって求めることができるパワースペクトルや、ＬＰＣ（線形予測）分析によって求めたケプストラム係数などがよく用いられるが、ここではその特徴パラメータの種類は問わない。
【００４６】
音響分析部３０２は求めた入力音声の特徴パラメータ系列をモデル照合部２０３に出力する。モデル照合部３０３は、認識語彙入力部３０５から入力された認識語彙に従って予め作成して音声モデル作成・記憶部３０４に記憶しておいた認識対象とするキーワードの各音声モデルと上記入力音声の特徴パラメータ系列との類似度あるいは距離を求める演算を行う。
【００４７】
モデル照合部３０３の照合方法としては、音声モデルも特徴パラメータ系列で表現しておき、ＤＰ（動的計画）法で音声モデルの特徴パラメータ系列と入力音声の特徴パラメータ系列の距離を求める手法や、ＨＭＭ（隠れマルコフモデル）を用いて音声モデルを表現しておき、入力音声の特徴パラメータ系列が入力されたときの各音声モデルの確率を計算する手法などが広く使用されているが、特に手法は問わない。
【００４８】
認識結果出力部３０８は、モデル照合部３０３で求めた各認識語彙に対する類似度が最大（あるいは距離が最小）となる語彙を認識結果として出力する。
認識語彙入力部３０５は、認識したい語彙とその読みを登録するためのものであり、キーボードやファイルなどで実現することができる。
【００４９】
一方、例えば登録したい地名として、認識語彙入力部３０５から「神戸（かんべ）」を登録する際に、「神戸」の読みとしては「こうべ」の方が一般的であり、「こうべ」と誤読される可能性が高いと判断した場合には、「神戸（かんべ）」の誤読されやすい読み、即ち別称として「こうべ」を別称登録部３０６から登録する。この別称登録部３０６を、例えばキーボードで構成して、利用者からの当該キーボードの操作により別称を登録（入力）するようにするしてもよいし、語彙からその読みを検索することができるテーブルを予め作成して別称登録部３０６に設けておき、複数の読みが存在する場合には、別称をそのテーブルから別称登録部３０６内部で自動生成（入力）する構成としてもよい。
【００５０】
登録語彙類似性判定部３０７は、認識語彙入力部３０５から登録された全ての語彙の読みと別称登録部３０６での別称登録により登録される読みとの類似性を判断する。もし、別称登録される読みとの類似性のある（読みが登録された）語彙が存在する場合には、登録語彙類似性判定部３０７は警告出力部３０９により利用者に警告したり、別称の登録の中止を行う。この登録語彙類似性判定部３０７での読みの類似性の判定には、例えば読みの音節の相違が１音節以下などの条件が適用可能である。
【００５１】
このように本実施形態においては、認識語彙入力部３０５から入力して登録される認識語彙の別称を別称登録部３０６により登録する際に、別称が他の語彙と類似していないかを登録語彙類似性判定部３０７にて判断し、警告出力部３０９から利用者に知らせることにより、別称登録による認識性能の低下を未然に防ぐことができる。例えば、上記した「神戸（かんべ）」ではなくて、一般的な神戸（こうべ）」が認識語彙として登録されているにも拘らず、「神戸」の別称として「こうべ」を登録した場合には、「神戸（かんべ）」と「神戸（こうべ）」の識別はできなくなるが、図５の音声認識装置では、このような問題を回避することができる。
【００５２】
なお、別称登録部３０６により登録される別称に類似の認識語彙がない場合、例えば認識語彙「神戸（かんべ）」の別称として「こうべ」を登録する場合には、登録語彙類似性判定部３０７にて類似語彙がないものと判断されて別称登録が許可され、音声モデル作成・記憶部３０４には、認識語彙「神戸（かんべ）」の音声モデルとは別に、認識語彙「神戸（かんべ）」の別称「こうべ」の音声モデルが記憶される。この場合、「神戸（かんべ）」を誤って「こうべ」と発声しても、モデル照合部３０３で（音声モデル作成・記憶部３０４内の）「神戸（かんべ）」の別称の「こうべ」（の音声モデル）と照合されることで、「神戸（かんべ）」が認識される。
［第４の実施形態］
図６は本発明の第４の実施形態に係る音声認識装置の概略構成を示すブロック図である。
【００５３】
図６の音声認識装置において、音声入力部４０１から入力された音声は、音響分析部４０２で特徴パラメータに変換される。音声認識に使用される代表的な特徴パラメータとしては、バンドパスフィルタやフーリエ変換によって求めることができるパワースペクトルや、ＬＰＣ（線形予測）分析によって求めたケプストラム係数などがよく用いられるが、ここではその特徴パラメータの種類は問わない。
【００５４】
音響分析部４０２は求めた入力音声の特徴パラメータ系列をモデル照合部４０３に出力する。モデル照合部４０３は、認識語彙入力部４０５から入力された認識語彙に従って予め作成して音声モデル作成・記憶部４０４に記憶しておいた認識対象とするキーワード（認識語彙を構成するキーワード）の各音声モデルと上記入力音声の特徴パラメータ系列との類似度あるいは距離を求める演算を行う。
【００５５】
モデル照合部４０３の照合方法としては、音声モデルも特徴パラメータ系列で表現しておき、ＤＰ（動的計画）法で音声モデルの特徴パラメータ系列と入力音声の特徴パラメータ系列の距離を求める手法や、ＨＭＭ（隠れマルコフモデル）を用いて音声モデルを表現しておき、入力音声の特徴パラメータ系列が入力されたときの各音声モデルの確率を計算する手法などが広く使用されているが、特に手法は問わない。
【００５６】
キーワード別認識結果出力部４０６は、モデル照合部４０３で求めた各キーワードに対する類似度（あるいは距離）に従い、認識語彙入力部４０５から入力されて認識語彙記憶部４０７に記憶されている語彙を意味的に同じキーワード別にソーテイングし、類似度が最大（あるいは距離が最小）となる複数の語彙を認識結果として出力する。
【００５７】
例えば、認識語彙記憶部４０７内に、認識語彙として「田中ホテル」、「佐藤ホテル」、「加籐ホテル」、「田中酒店」、「佐藤酒店」、「田中ガソリンスタンド」の６種類が登録されている場合に、音声認識するキーワードとして、「田中」「佐藤」「加籐」「ホテル」「酒店」、「ガソリンスタンド」の６つキーワードを考える。
【００５８】
ここで、もし「田中ホテル」と発声された場合に、モデル照合部４０３にて得られる認識結果と類似度が図７に示すようになったものとする。この場合、キーワード別認識結果出力部４０６が、図８（ｂ）に示すように、単純にキーワードが組み合わされた（認識語彙記憶部４０７に記憶されている語彙に一致する）キーワード列に対する類似度の和の大きい順に複数の候補を出力したのでは（従来の出力方式）、「ホテル」や「酒店」が混在しているため候補選択時にわかりにくいという問題がある。
【００５９】
これに対して本実施形態では、キーワード別認識結果出力部４０６は、例えば業種を表すキーワードの類似度が予め定められた閾値以上となるキーワード列を、当該業種を表すキーワード別に出力する。例えば、類似度が１００以上の業種を表すキーワード別（ここでは「ホテル」と「酒店」の各キーワード別）に表示すると、図８（ａ）のように表示することができ、視認性良く候補を表示することができる。
【００６０】
このように本実施形態においては、キーワード別に複数の認識結果を類似度の大きい順（あるいは距離の小さい順）に出力することにより、候補選択を効率よく行うことができる。
［第５の実施形態］
図９は本発明の第５の実施形態に係る音声認識装置の概略構成を示すブロック図である。
【００６１】
図９の音声認識装置において、音声入力部５０１から入力された音声は、音響分析部５０２で特徴パラメータに変換される。音声認識に使用される代表的な特徴パラメータとしては、バンドパスフィルタやフーリエ変換によって求めることができるパワースペクトルや、ＬＰＣ（線形予測）分析によって求めたケプストラム係数などがよく用いられるが、ここではその特徴パラメータの種類は問わない。
【００６２】
音響分析部５０２は求めた入力音声の特徴パラメータ系列をモデル照合部５０３に出力する。モデル照合部５０３は、認識語彙入力部５０５から入力された認識語彙に従って予め作成して音声モデル作成・記憶部５０４に記憶しておいた認識対象とするキーワードの各音声モデル（ここでは、認識語彙の別称の音声モデルを含む各認識語彙の音声モデル）と上記入力音声の特徴パラメータ系列の類似度あるいは距離を求める演算を行う。
【００６３】
モデル照合部５０３の照合方法としては、音声モデルも特徴パラメータ系列で表現しておき、ＤＰ（動的計画）法で音声モデルの特徴パラメータ系列と入力音声の特徴パラメータ系列の距離を求める手法や、ＨＭＭ（隠れマルコフモデル）を用いて音声モデルを表現しておき、入力音声の特徴パラメータ系列が入力されたときの各音声モデルの確率を計算する手法などが広く使用されているが、特に手法は問わない。
【００６４】
認識語彙入力部５０５は、認識したい語彙とその読みを登録するためのものであり、キーボードやファイルなどで実現することができる。
一方、例えば登録したい地名として、認識語彙入力部５０５から「神戸（かんべ）」を登録する際に、「神戸」の読みとしては「こうべ」の方が一般的であり、「こうべ」と誤読される可能性が高いと判断した場合には、「神戸（かんべ）」の誤読されやすい読み、即ち別称として「こうべ」を別称登録部５０６から登録する。この別称登録部５０６を、例えばキーボードで構成して、利用者からの当該キーボードの操作により別称を登録（入力）するようにするしてもよいし、語彙からその読みを検索することができるテーブルを予め作成して別称登録部５０６に設けておき、複数の読みが存在する場合には、別称をそのテーブルから別称登録部５０６内部で自動生成（入力）する構成としてもよい。
【００６５】
別称登録部５０６から認識語彙「神戸（かんべ）」の別称として「こうべ」を登録すると、音声モデル作成・記憶部５０４には、認識語彙「神戸（かんべ）」の音声モデルとは別に、認識語彙「神戸（かんべ）」の別称「こうべ」の音声モデルが記憶される。ここで、「こうべ」の音声モデルには、「神戸（かんべ）」の別称であることを示すフラグ情報が付される。
【００６６】
そこで、「神戸（かんべ）」を誤って「こうべ」と発声しても、モデル照合部５０３で（音声モデル作成・記憶部５０４内の）「神戸（かんべ）」の別称の「こうべ」の音声モデルと照合されることで、「神戸（かんべ）」が認識される。ここで、「こうべ」の音声モデルには、上記したように「神戸（かんべ）」の別称であることを示すフラグ情報が付加されており、モデル照合部５０３で「こうべ」の音声モデルとの照合が行われた場合、その照合結果には当該フラグ情報が付されて認識結果出力部５０７に渡される。これにより認識結果出力部５０７は、モデル照合部５０３で認識されたキーワードは正しい読みでなくて別称であることを識別し、認識結果「神戸」に正しい読み「かんべ」を付加して、表示または音声で出力する。
【００６７】
このように本実施形態においては、認識結果出力時に、正しい読みを出力することにより、話者が別称で覚えていた言葉の正しい読みを当該話者に覚えさせることができ、以後正しい読みで入力できるようになる。
［第６の実施形態］
図１０は本発明の第６の実施形態に係る音声認識装置の概略構成を示すブロック図である。
【００６８】
図１０の音声認識装置において、音声入力部６０１から入力された音声は、音響分析部６０２で特徴パラメータに変換される。音声認識に使用される代表的な特徴パラメータとしては、バンドパスフィルタやフーリエ変換によって求めることができるパワースペクトルや、ＬＰＣ（線形予測）分析によって求めたケプストラム係数などがよく用いられるが、ここではその特徴パラメータの種類は問わない。
【００６９】
音響分析部６０２は求めた入力音声の特徴パラメータ系列をモデル照合部６０３に出力する。モデル照合部６０３は、認識語彙入力部６０５から入力された認識語彙の読み（仮名、カタカナ、あるいはローマ字などの表記で入力される認識語彙の読み）に従って予め作成して音声モデル作成・記憶部６０４に記憶しておいた認識対象とするキーワード（認識語彙を構成するキーワード）の各音声モデルと上記入力音声の特徴パラメータ系列との類似度あるいは距離を求める演算を行う。
【００７０】
モデル照合部６０３の照合方法としては、音声モデルも特徴パラメータ系列で表現しておき、ＤＰ（動的計画）法で音声モデルの特徴パラメータ系列と入力音声の特徴パラメータ系列の距離を求める手法や、ＨＭＭ（隠れマルコフモデル）を用いて音声モデルを表現しておき、入力音声の特徴パラメータ系列が入力されたときの各音声モデルの確率を計算する手法などが広く使用されているが、特に手法は問わない。
【００７１】
音声出力部６０７は、認識語彙入力部６０５から認識語彙の読みが入力された際に、その読みを音声に変換して出力する。音声出力部６０７による音声出力は、例えば日本語の全ての音節について音声を記憶しておき、上記入力された読みに従って、記憶された音声を接続することにより実現することができる。
【００７２】
例えば、認識語彙「竹芝」の読みとして認識語彙入力部６０５から誤って「たけしぱ」と入力したとすると、その誤った読み「たけしぱ」の音声モデルが音声モデル作成・記憶部６０４で作成・記憶されるため、認識時に「たけしば」と発声しても正しく認識できなくなる。
【００７３】
これに対して本実施形態によれば、認識語彙入力部６０５から認識語彙「竹芝」の読みを登録するときに、誤って「たけしぱ」と入力すると、音声出力部６０７により「たけしぱ」と音声で出力してくれるので、話者（認識語彙登録者）は読みの入力間違いに容易に気づくことができ、読みの入力誤りによる認識性能の低下を未然に防ぐことができる。
【００７４】
以上に述べた図１、図３、図５、図６、図９、図１０の構成の音声認識装置の各部の機能は、コンピュータ、例えば内蔵型マイクロホンが組み込まれた、あるいはマイクロホン入力端子が設けられた音声入力機能を持つパーソナルコンピュータを、上記音声認識装置が持つ各処理部として機能させるためのプログラムを記録した、ＣＤ‐ＲＯＭ、フロッピーディスク、メモリカード等の記録媒体を用い、当該記録媒体をパーソナルコンピュータに装着して、当該記録媒体に記録されているプログラムをパーソナルコンピュータで読み取り実行させることにより実現される。また、上記プログラムは、記録媒体に限らず、例えば通信回線からダウンロードされるものであっても構わない。
【００７５】
以上詳述したように本発明によれば、認識語彙の読みの登録間違いや、装置使用時の読みの記憶違い、誤った読みでの発声等に起因する認識性能の低下を防ぐことができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態に係る音声認識装置の概略構成を示すブロック図。
【図２】図１中のキーワード変換テーブル１０６ａの内容例を示す図。
【図３】本発明の第２の実施形態に係る音声認識装置の概略構成を示すブロック図。
【図４】「あさひのあ」と発声された場合に図３中のモデル照合部２０３で得られる音節列と類似度の一例を示す図。
【図５】本発明の第３の実施形態に係る音声認識装置の概略構成を示すブロック図。
【図６】本発明の第４の実施形態に係る音声認識装置の概略構成を示すブロック図。
【図７】「田中ホテル」と発声された場合に図６中のモデル照合部４０３にて得られる各キーワードと類似度の一例を示す図。
【図８】図７の認識結果と類似度とに基づく認識結果表示例を従来方式の認識結果表示例と対比させて示す図。
【図９】本発明の第５の実施形態に係る音声認識装置の概略構成を示すブロック図。
【図１０】本発明の第６の実施形態に係る音声認識装置の概略構成を示すブロック図。
【符号の説明】
１０１，２０１，３０１，４０１，５０１，６０１…音声入力部
１０２，２０２，３０２，４０２，５０２，６０２…音響分析部
１０３，２０３，３０３，４０３，５０３，６０３…モデル照合部
１０４，３０４，４０４，５０４，６０４…音声モデル作成・記憶部
１０５，３０５，４０５，５０５，６０５…認識語彙入力部
１０６…キーワード変換部
１０７…キーワード拡張部
１０８，２０５，３０８，５０７，６０６…認識結果出力部
１０９，４０７…認識語彙記憶部
２０４…音声モデル記憶部
２０６…制約条件記憶部
３０６，５０６…別称登録部
３０７…登録語彙類似性判定部
３０９…警告出力部
４０６…キーワード別認識結果出力部
６０７…音声出力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition apparatus that recognizes an input speech by comparing a feature parameter series obtained by acoustic analysis of the input speech with a keyword speech model that constitutes each recognition vocabulary prepared in advance. The present invention relates to a speech recognition apparatus suitable for reducing recognition errors in recognition vocabulary readings and misrecognition due to reading memory differences when the apparatus is used.
[0002]
[Prior art]
In general, in a speech recognition device that recognizes input speech, by inputting a reading of a vocabulary (recognition vocabulary) to be recognized by the device, a speech model of a keyword constituting a corresponding recognition vocabulary is preliminarily obtained from the reading. Created and stored for recognition of input speech. Recognition of input speech by this type of speech recognition apparatus is performed as follows.
[0003]
First, the input speech is acoustically analyzed to obtain a feature parameter series. Next, the input speech is recognized by comparing the obtained feature parameter series of the input speech with a speech model of a keyword constituting each recognition vocabulary prepared in advance.
[0004]
Conventionally, such a speech recognition apparatus has a problem that if a reading of a recognized vocabulary is mistakenly registered, it cannot be correctly recognized even if the correct reading is uttered during use. In addition, there is a problem that even if a correct reading is registered at the time of registration of the recognition vocabulary, it is not correctly recognized if an incorrect reading is made at the time of use.
[0005]
[Problems to be solved by the invention]
As described above, in the conventional speech recognition device, if the reading of the recognized vocabulary is registered incorrectly, it cannot be recognized correctly even if the correct reading is uttered at the time of use, and conversely, the correct reading is registered at the time of registering the recognized vocabulary. However, there was a problem that it would not be recognized correctly if a wrong reading was made during use.
[0006]
The present invention has been made in consideration of the above circumstances, and its purpose is to make a mistake in registering the reading of the recognized vocabulary and the memory of reading when using the device., Wrong readingIt is an object of the present invention to provide a speech recognition apparatus that can prevent a reduction in recognition performance caused by the above.
[0012]
[Means for Solving the Problems]
Of the present inventionFrom one perspective,A speech recognition device that recognizes an input speech by collating with a speech model of each recognition vocabulary including a speech model of another name of the recognition vocabularyIs provided. This voice recognition deviceOf the recognized vocabulary to registerFirstIn addition to readingSecondAnother name registration means to register the reading as another name, and registered as another nameSecondReadingRegisteredRecognition vocabularyFirstA registered vocabulary similarity determining unit that determines whether or not the reading is similar, and a warning output unit that outputs a warning to that effect when the registered vocabulary similarity determining unit determines that the reading is similar, By the above nickname registration meansAs another nameRegisteredSecondIf the input speech is recognized by matching the reading speech model,SecondIt has a recognition result output means for recognizing a vocabulary whose reading is another name.Here, when the recognition result output means recognizes the second reading registered as another name, the first reading of the vocabulary nicknamed the second reading is used by the recognition result output means by the user. The configuration presented in
[0013]
In such a configuration, the recognition vocabulary registered by inputting from the recognition vocabulary input means is stored.The reading (second reading) is different from the reading (first reading).When registering with another name registration means,thisAliasSecond reading registered asOther vocabularyFirst readingSimilar toHaveIn the registered vocabulary similarity judgment meansIf it is judged and similar,By notifying the user from the warning output means, it is possible to prevent a reduction in recognition performance due to nickname registration. At this time, it is preferable to cancel the corresponding alias registration.When the input speech is recognized by collation with the second reading speech model registered as another name, for example, “Kobe” is recognized by recognizing the vocabulary with the second reading as another name. Even if the speaker utters “Kobe”, which is a general reading, with “Kobe” registered as another name, the correct “Kobe” is obtained as a recognition result. be able to. At this time, by presenting to the speaker the reading of the vocabulary “Kobe” (first reading) “Kobe”, which is nicknamed “Kobe” (second reading), This makes it possible for the speaker to remember the correct reading of the word that was remembered under another name, and to input the correct reading thereafter.
[0020]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[First Embodiment]
FIG. 1 is a block diagram showing a schematic configuration of a speech recognition apparatus according to the first embodiment of the present invention.
[0021]
In the speech recognition apparatus of FIG. 1, speech input from the speech input unit 101 is converted into feature parameters by the acoustic analysis unit 102. As typical feature parameters used for speech recognition, a power spectrum that can be obtained by a bandpass filter or Fourier transform, a cepstrum coefficient obtained by LPC (linear prediction) analysis, etc. are often used. The type of feature parameter does not matter.
[0022]
The acoustic analysis unit 102 outputs the obtained feature parameter series of the input speech to the model matching unit 103. The model matching unit 103 generates each keyword (keyword constituting the recognition vocabulary) to be recognized, which is created in advance according to the recognition vocabulary input from the recognition vocabulary input unit 105 and stored in the speech model creation / storage unit 104. Calculation is performed to obtain the similarity or distance between the speech model and the feature parameter series of the input speech.
[0023]
As a matching method of the model matching unit 103, a voice model is also expressed by a feature parameter series, and a method of obtaining a distance between the feature parameter series of the voice model and the feature parameter series of the input voice by a DP (dynamic programming) method, A method of expressing a speech model using an HMM (Hidden Markov Model) and calculating the probability of each speech model when a feature parameter sequence of the input speech is input is widely used. It doesn't matter.
[0024]
The recognition vocabulary input unit 105 includes a recognition vocabulary, division information for each keyword of the recognition vocabulary (keyword division information), and reading information for each keyword necessary for creating a speech model of each keyword constituting the recognition vocabulary. Can be realized with a keyboard or a file. The recognized vocabulary input from the recognized vocabulary input unit 105 is registered in the recognized vocabulary storage unit 109.
[0025]
The keyword conversion unit 106 extracts each keyword from the keyword division information input from the recognized vocabulary input unit 105, and is selected based on the sound similarity between keywords and attributes such as part of speech. A keyword conversion table (keyword conversion table) 106a having the same part of speech is created and stored.
[0026]
The keyword expansion unit 107 causes the keyword conversion unit 106 to convert the keyword (with similarity or distance) obtained by the model matching unit 103 into another keyword having sound similarity according to the keyword conversion table 106a. Perform expansion.
[0027]
The recognition result output unit 108 sets a similarity (or distance) to a keyword string in which each voice model obtained by the model matching unit 103 and a keyword obtained by expansion by the keyword expansion unit 107 are combined with a certain condition (for example, similarity) Of the recognition vocabulary stored in the recognition vocabulary storage unit 109 and outputs the recognition target category having the maximum similarity (or the minimum distance) as a recognition result. . Note that, as the sorting restriction, for example, a person's name may be given priority over a company name.
[0028]
The specific operation of the speech recognition apparatus having the configuration shown in FIG. 1 described above is as follows. The vocabulary to be recognized by the speech recognition apparatus, that is, the recognition vocabulary is “Sato Shoten”, “Katan Shokudo”, “Tanaka Shoten” A case where there are three types will be described as an example.
[0029]
In this case, the above three types of recognition vocabulary are input from the recognition vocabulary input unit 105. In this embodiment, the speech model of each keyword constituting the recognition vocabulary is stored in the speech model creation / storage unit 104. For example, “Sato-Shop”, “Katani-Shokudo”, “Tanaka-Bookstore”, and so on, can be created by entering the keyword division symbol (keyword division information) “-” in the recognition vocabulary. Is done.
[0030]
The speech model creation / storage unit 104 receives the recognition vocabulary “Sato-shop”, “Kato-shokudo”, and “Tanaka-bookstore” with the keyword division symbol “-” inserted from the recognition vocabulary input unit 105. Each recognized vocabulary is divided into keywords constituting the vocabulary according to the keyword division symbol “-” inserted in the vocabulary. Here, the above three types of recognition vocabulary are divided into six keywords: “Sato”, “Katan”, “Tanaka”, “Shop”, “Restaurants”, and “Bookstore”. The voice model creation / storage unit 104 creates and stores voice models for the six keywords “Sato”, “Katan”, “Tanaka”, “Store”, “Dining room”, and “Bookstore”.
[0031]
At the same time, the keyword conversion unit 106 receives the above six words obtained from the recognition vocabulary “Sato-shop”, “Katan-shokudo”, and “Tanaka-bookstore” with keyword division symbols inputted from the recognition vocabulary input unit 105. The keywords “Sato”, “Katan”, “Tanaka”, “Store”, “Restaurants”, and “Bookstore” are examined for voice similarity, similar keywords are extracted, and a keyword conversion table 106a is created. . Here, as a rule for creating the keyword conversion table 106a, keywords having different syllable numbers with different keyword readings are registered in the table 106a as having similar sounds, for example, keywords having one syllable or less. In this case, “Sato” and “Kato”, “shop” and “bookstore” are extracted as similar, and a keyword conversion table 106a as shown in FIG. 2 is created.
[0032]
Then, at the time of speech recognition, for example, when “Sato Shoten” is input, and the matching results in the model matching unit 103 are “Sato” and “Bookstore”, the keyword expansion unit 107 sets “Sato” and “ For the “bookstore”, the keyword conversion unit 106 performs keyword conversion of “Sato → Katan” and “bookstore → shop” according to the keyword conversion table 106 a, and the matching result of the model matching unit 103 is “Sato” and “bookstore”. In addition, the keywords are expanded as if there are “katan” and “shop”.
[0033]
Among the combinations of keywords expanded by the keyword expansion unit 107, “Sato Shoten” is one that matches the recognized vocabulary stored in the recognized vocabulary storage unit 109. Therefore, the recognition result output unit 108 can correctly recognize and output “Sato Shoten” even though the matching results in the model matching unit 103 are “Sato” and “Bookstore”.
[0034]
On the other hand, when there is no keyword conversion unit 106 and keyword expansion unit 107 and keyword expansion is not performed, “Sato” composed of “Sato” and “bookstore” which are the collation results in the model collation unit 103 Since “bookstore” does not exist in the recognized vocabulary storage unit 109, “Sato Shoten” cannot be correctly recognized.
[0035]
It should be noted that the similarity may be reduced by a certain value or a certain rate in the collation with the speech model of the keyword obtained by keyword conversion.
As described above, even when the speaker utters “Sato Shoten” but the model matching unit 103 obtains an incorrect matching result of “Sato” and “Bookstore”, the keyword The example that can recognize "Sato Shoten" correctly by extension was described. Similarly, in the present embodiment, even when the speaker mistakenly says “Sato Shoten” as “Sato Shoten”, it is possible to correctly recognize “Sato Shoten” by extending the keyword focusing on sound similarity. it can.
[0036]
As described above, in the present embodiment, the keyword is expanded by paying attention to the similarity as a sound, so that it is possible to effectively prevent a decrease in recognition performance due to a keyword recognition error or a speaker's error.
[Second Embodiment]
FIG. 3 is a block diagram showing a schematic configuration of a speech recognition apparatus according to the second embodiment of the present invention.
[0037]
In the speech recognition apparatus of FIG. 3, the speech input from the speech input unit 201 is converted into feature parameters by the acoustic analysis unit 202. As typical feature parameters used for speech recognition, a power spectrum that can be obtained by a bandpass filter or Fourier transform, a cepstrum coefficient obtained by LPC (linear prediction) analysis, etc. are often used. The type of feature parameter does not matter.
[0038]
The acoustic analysis unit 202 outputs the obtained feature parameter series of the input speech to the model matching unit 203. The model collation unit 203 performs a calculation to obtain the similarity or distance between the speech model of the syllable string of any length stored in the speech model storage unit 204 and the feature parameter series.
[0039]
As a matching method of the model matching unit 203, a voice model is also expressed by a feature parameter series, and a method of obtaining a distance between the feature parameter series of the voice model and the feature parameter series of the input voice by a DP (dynamic programming) method, A method of expressing a speech model using an HMM (Hidden Markov Model) and calculating the probability of each speech model when a feature parameter sequence of the input speech is input is widely used. It doesn't matter.
[0040]
Based on the collation result in the model collation unit 203, the recognition result output unit 205 is similar to, for example, a syllable string in which the first syllable and the last syllable match according to the constraint condition stored in the constraint condition storage unit 206. The degree (or distance) is sorted under a certain condition, and the leading syllable of the syllable string having the maximum similarity (or the minimum distance) is output as a recognition result.
[0041]
The specific operation of the speech recognition apparatus having the configuration shown in FIG. 3 described above will be described by taking, for example, the case of saying “Asahi (Asahi)” as an example.
First, as a result of the speaker uttering “Asahi (Asahi no Aa)”, the model collation unit 203 uses a syllable string “A” “SA” having a similarity to the syllable string as shown in FIG. It is assumed that “hi”, “no”, “a” and syllable strings “u”, “sa”, “hi”, “no”, “a” having a similarity of 92 are obtained.
[0042]
In this case, if the first syllable of the syllable string having the maximum similarity is output as the recognition result from the recognition result output unit 205 without providing the restriction that the first syllable and the last syllable coincide with each other, An incorrect syllable “U” different from the first syllable of the input speech is output.
[0043]
On the other hand, in the present embodiment, the restriction that the first and last syllables coincide with each other due to the restriction condition stored in the restriction condition storage unit 206. Therefore, the recognition result in the recognition result output unit 205 is the syllable. It becomes “A” and the leading syllable of the input speech can be recognized correctly. Moreover, since the first syllable of the input speech is recognized under the constraint that the first and last syllables match, the speech uttered when inputting the syllable `` a '' as in this example is `` Asahi The first syllable and the last syllable need only be the same, such as not only “Noah” but also “Ahirunoa” and “Ajia (Asia)”.
[0044]
As described above, in this embodiment, it is possible to realize highly accurate syllable recognition by recognizing the leading syllable of the input speech under the restriction that the leading syllable and the last syllable of the input speech match. . Moreover, since it is not necessary to memorize the words uttered when inputting each syllable, anyone can use it immediately.
[Third Embodiment]
FIG. 5 is a block diagram showing a schematic configuration of a speech recognition apparatus according to the third embodiment of the present invention.
[0045]
In the speech recognition apparatus of FIG. 5, the speech input from the speech input unit 301 is converted into feature parameters by the acoustic analysis unit 302. As typical feature parameters used for speech recognition, a power spectrum that can be obtained by a bandpass filter or Fourier transform, a cepstrum coefficient obtained by LPC (linear prediction) analysis, etc. are often used. The type of feature parameter does not matter.
[0046]
The acoustic analysis unit 302 outputs the obtained feature parameter series of the input speech to the model matching unit 203. The model collation unit 303 is configured according to the recognition vocabulary input from the recognition vocabulary input unit 305 and stored in the speech model creation / storage unit 304 as a recognition target keyword and the characteristics of the input speech. Calculation to obtain similarity or distance to the parameter series.
[0047]
As a matching method of the model matching unit 303, a speech model is also expressed by a feature parameter sequence, and a method of obtaining a distance between the feature parameter sequence of the speech model and the feature parameter sequence of the input speech by a DP (dynamic programming) method, A method of expressing a speech model using an HMM (Hidden Markov Model) and calculating the probability of each speech model when a feature parameter sequence of the input speech is input is widely used. It doesn't matter.
[0048]
The recognition result output unit 308 outputs a vocabulary having the maximum similarity (or the minimum distance) for each recognized vocabulary obtained by the model matching unit 303 as a recognition result.
The recognition vocabulary input unit 305 is for registering a vocabulary to be recognized and its reading, and can be realized by a keyboard or a file.
[0049]
On the other hand, for example, when registering “Kobe” from the recognition vocabulary input unit 305 as a place name to be registered, “Kobe” is generally used as a reading of “Kobe” and is misread as “Kobe”. If it is determined that there is a high possibility of misreading, “Kobe” is easily misread, that is, “Kobe” is registered as an alternative name from the alternative name registration unit 306. The nickname registration unit 306 may be configured with a keyboard, for example, so that the nickname may be registered (input) by the user's operation of the keyboard, or a table in which the vocabulary can be retrieved. May be created in advance and provided in the alias registration unit 306, and when multiple readings exist, the alias may be automatically generated (input) in the alias registration unit 306 from the table.
[0050]
The registered vocabulary similarity determination unit 307 determines the similarity between the readings of all the vocabularies registered from the recognized vocabulary input unit 305 and the readings registered by the alias registration in the alias registration unit 306. If there is a vocabulary similar to a nickname-registered reading (a vocabulary is registered), the registered vocabulary similarity determination unit 307 warns the user by the warning output unit 309, Cancel registration. For the determination of the similarity of reading in the registered vocabulary similarity determination unit 307, for example, a condition that the difference in reading syllables is one syllable or less can be applied.
[0051]
As described above, in the present embodiment, when the alias of the recognized vocabulary registered by inputting from the recognized vocabulary input unit 305 is registered by the alias registration unit 306, whether the alias is similar to other vocabularies is registered. By determining the similarity in the similarity determination unit 307 and informing the user from the warning output unit 309, it is possible to prevent a reduction in recognition performance due to nickname registration. For example, when “Kobe” is registered as an alternative term for “Kobe” even though “Kobe” is registered as a recognition vocabulary instead of “Kobe” as described above. “Kobe” and “Kobe” cannot be distinguished from each other, but the speech recognition apparatus shown in FIG. 5 can avoid such a problem.
[0052]
Note that if there is no similar recognition vocabulary to the nickname registered by the nickname registration unit 306, for example, when “Kobe” is registered as another name of the recognition vocabulary “Kobe”, the registered vocabulary similarity determination unit 307 is registered. It is judged that there is no similar vocabulary, and nickname registration is permitted, and the speech model creation / storage unit 304 stores the recognition vocabulary “Kanbe” separately from the speech model of the recognition vocabulary “Kobe”. A speech model of another name “Kobe” is stored. In this case, even if “Kobe” is mistakenly uttered as “Kobe”, the model matching unit 303 (in the speech model creation / storage unit 304) is also called “Kobe” (in the speech model creation / storage unit 304). "Kobe" is recognized by collating with the voice model.
[Fourth Embodiment]
FIG. 6 is a block diagram showing a schematic configuration of a speech recognition apparatus according to the fourth embodiment of the present invention.
[0053]
In the speech recognition apparatus in FIG. 6, the speech input from the speech input unit 401 is converted into a characteristic parameter by the acoustic analysis unit 402. As typical feature parameters used for speech recognition, a power spectrum that can be obtained by a bandpass filter or Fourier transform, a cepstrum coefficient obtained by LPC (linear prediction) analysis, etc. are often used. The type of feature parameter does not matter.
[0054]
The acoustic analysis unit 402 outputs the obtained feature parameter series of the input speech to the model matching unit 403. The model matching unit 403 recognizes each of the keywords to be recognized (keywords constituting the recognition vocabulary) created in advance according to the recognition vocabulary input from the recognition vocabulary input unit 405 and stored in the speech model creation / storage unit 404. Calculation is performed to obtain the similarity or distance between the speech model and the feature parameter series of the input speech.
[0055]
As a matching method of the model matching unit 403, a speech model is also expressed by a feature parameter sequence, and a method of obtaining a distance between the feature parameter sequence of the speech model and the feature parameter sequence of the input speech by a DP (dynamic programming) method, A method of expressing a speech model using an HMM (Hidden Markov Model) and calculating the probability of each speech model when a feature parameter sequence of the input speech is input is widely used. It doesn't matter.
[0056]
The keyword-specific recognition result output unit 406 semantically inputs the vocabulary input from the recognition vocabulary input unit 405 and stored in the recognition vocabulary storage unit 407 according to the similarity (or distance) to each keyword obtained by the model matching unit 403. Are sorted by the same keyword, and a plurality of vocabularies having the maximum similarity (or the minimum distance) are output as recognition results.
[0057]
For example, in the recognition vocabulary storage unit 407, six types of recognition vocabulary, “Tanaka Hotel”, “Sato Hotel”, “Katan Hotel”, “Tanaka Hotel”, “Sato Hotel”, and “Tanaka Gas Station” are registered. In this case, as keywords for speech recognition, six keywords “Tanaka”, “Sato”, “Kayato”, “Hotel”, “Shuzhou”, and “Gas station” are considered.
[0058]
Here, if “Tanaka Hotel” is uttered, the recognition result and similarity obtained by the model matching unit 403 are assumed to be as shown in FIG. In this case, as shown in FIG. 8B, the keyword-specific recognition result output unit 406 is similar to a keyword string in which keywords are simply combined (matches the vocabulary stored in the recognized vocabulary storage unit 407). If a plurality of candidates are output in descending order of the sum (conventional output method), there is a problem that it is difficult to understand when selecting candidates because “hotels” and “hotels” are mixed.
[0059]
On the other hand, in this embodiment, the keyword-specific recognition result output unit 406 outputs, for example, a keyword string in which the similarity of a keyword representing a business type is equal to or greater than a predetermined threshold for each keyword representing the business type. For example, if you display by keyword that represents an industry with a similarity of 100 or more (here, “hotel” and “hotel”)FIG. 8 (a)The candidates can be displayed with high visibility.
[0060]
Thus, in this embodiment, candidate selection can be performed efficiently by outputting a plurality of recognition results for each keyword in descending order of similarity (or in ascending order of distance).
[Fifth Embodiment]
FIG. 9 is a block diagram showing a schematic configuration of a speech recognition apparatus according to the fifth embodiment of the present invention.
[0061]
In the speech recognition apparatus of FIG. 9, the speech input from the speech input unit 501 is converted into feature parameters by the acoustic analysis unit 502. As typical feature parameters used for speech recognition, a power spectrum that can be obtained by a bandpass filter or Fourier transform, a cepstrum coefficient obtained by LPC (linear prediction) analysis, etc. are often used. The type of feature parameter does not matter.
[0062]
The acoustic analysis unit 502 outputs the obtained feature parameter series of the input speech to the model matching unit 503. The model matching unit 503 generates each speech model of the keyword to be recognized (here, the recognized vocabulary) created in advance according to the recognized vocabulary input from the recognized vocabulary input unit 505 and stored in the speech model creation / storage unit 504. (Speech model of each recognized vocabulary including a speech model of another name) and a feature parameter series of the input speech and a calculation for obtaining a similarity or distance.
[0063]
As a matching method of the model matching unit 503, a method of obtaining a distance between a feature parameter sequence of a speech model and a feature parameter sequence of an input speech by a DP (dynamic programming) method by expressing the speech model as a feature parameter sequence, A method of expressing a speech model using an HMM (Hidden Markov Model) and calculating the probability of each speech model when a feature parameter sequence of the input speech is input is widely used. It doesn't matter.
[0064]
The recognition vocabulary input unit 505 is for registering a vocabulary to be recognized and its reading, and can be realized by a keyboard or a file.
On the other hand, for example, when registering “Kobe” from the recognition vocabulary input unit 505 as a place name to be registered, “Kobe” is more commonly read as “Kobe” and is misread as “Kobe”. If it is determined that there is a high possibility of misreading, “Kobe” is easily misread, that is, “Kobe” is registered as an alternative name from the alternative name registration unit 506. The nickname registration unit 506 may be configured with a keyboard, for example, so that the nickname may be registered (input) by the user's operation of the keyboard, or a table that can retrieve the reading from the vocabulary. May be created in advance and provided in the alias registration unit 506, and when there are a plurality of readings, an alternative name may be automatically generated (input) from the table within the alias registration unit 506.
[0065]
When “Kobe” is registered as another name of the recognition vocabulary “Kobe” from the nickname registration unit 506, the recognition vocabulary is stored in the speech model creation / storage unit 504 separately from the speech model of the recognition vocabulary “Kobe”. The voice model of “Kobe”, another name for “Kobe”, is stored. Here, flag information indicating that it is another name for “Kobe” is attached to the speech model of “Kobe”.
[0066]
Therefore, even if “Kobe” is mistakenly uttered as “Kobe”, the model collation unit 503 (in the speech model creation / storage unit 504) utters “Kobe”, another name for “Kobe”. By collating with the model, “Kobe” is recognized. Here, as described above, flag information indicating that it is another name for “Kobe” is added to the speech model of “Kobe”. When the collation is performed, the flag information is attached to the collation result and is passed to the recognition result output unit 507. As a result, the recognition result output unit 507 identifies that the keyword recognized by the model matching unit 503 is not a correct reading but another name, adds a correct reading “Kanbe” to the recognition result “Kobe”, and displays or Output by voice.
[0067]
As described above, in this embodiment, when a recognition result is output, by outputting a correct reading, it is possible to cause the speaker to remember a correct reading of a word that the speaker remembered by another name. become able to.
[Sixth Embodiment]
FIG. 10 is a block diagram showing a schematic configuration of a speech recognition apparatus according to the sixth embodiment of the present invention.
[0068]
In the speech recognition apparatus of FIG. 10, the speech input from the speech input unit 601 is converted into feature parameters by the acoustic analysis unit 602. As typical feature parameters used for speech recognition, a power spectrum that can be obtained by a bandpass filter or Fourier transform, a cepstrum coefficient obtained by LPC (linear prediction) analysis, etc. are often used. The type of feature parameter does not matter.
[0069]
The acoustic analysis unit 602 outputs the obtained feature parameter series of the input speech to the model matching unit 603. The model matching unit 603 is created in advance according to the reading of the recognized vocabulary input from the recognized vocabulary input unit 605 (the reading of the recognized vocabulary input in kana, katakana or romaji), and the speech model creation / storage unit 604. The calculation for obtaining the similarity or distance between each speech model of the keyword to be recognized (keyword constituting the recognition vocabulary) stored in the above and the feature parameter series of the input speech is performed.
[0070]
As a matching method of the model matching unit 603, a voice model is also expressed by a feature parameter sequence, and a distance between the feature parameter sequence of the speech model and the feature parameter sequence of the input speech is obtained by a DP (dynamic programming) method, A method of expressing a speech model using an HMM (Hidden Markov Model) and calculating the probability of each speech model when a feature parameter sequence of the input speech is input is widely used. It doesn't matter.
[0071]
When the recognition vocabulary reading is input from the recognition vocabulary input unit 605, the voice output unit 607 converts the reading into speech and outputs it. The voice output by the voice output unit 607 can be realized, for example, by storing voices for all Japanese syllables and connecting the stored voices in accordance with the input reading.
[0072]
For example, if “Takeshipa” is mistakenly input from the recognition vocabulary input unit 605 as a reading of the recognition vocabulary “Takeshiba”, a speech model of the erroneous reading “Takeshipa” is created and stored in the speech model creation / storage unit 604. Since it is memorized, even if “Takeshiba” is spoken during recognition, it cannot be recognized correctly.
[0073]
On the other hand, according to the present embodiment, when registering the reading of the recognition vocabulary “Takeshiba” from the recognition vocabulary input unit 605, if “Takeshipa” is mistakenly input, the voice output unit 607 reads “Takeshipa”. Since the voice is output, the speaker (recognized vocabulary registrant) can easily recognize an input error in reading, and can prevent deterioration in recognition performance due to an input error in reading.
[0074]
1, 3, 5, 6, 9, and 10 described above, the functions of each part of the speech recognition apparatus have a computer, for example, a built-in microphone or a microphone input terminal. A recording medium such as a CD-ROM, a floppy disk, or a memory card on which a program for causing a personal computer having a voice input function to function as each processing unit of the voice recognition device is recorded. This is realized by mounting on a personal computer and reading and executing the program recorded on the recording medium by the personal computer. The program is not limited to a recording medium, and may be downloaded from a communication line, for example.
[0075]
As detailed above, according to the present invention,It is possible to prevent the recognition performance from being deteriorated due to a registration error in the recognition vocabulary reading, a memory difference in reading when the device is used, a utterance in the wrong reading, or the like.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a speech recognition apparatus according to a first embodiment of the present invention.
FIG. 2 is a view showing an example of contents of a keyword conversion table 106a in FIG.
FIG. 3 is a block diagram showing a schematic configuration of a speech recognition apparatus according to a second embodiment of the present invention.
4 is a diagram showing an example of a syllable string and similarity obtained by the model matching unit 203 in FIG. 3 when “Asahi no A” is uttered. FIG.
FIG. 5 is a block diagram showing a schematic configuration of a speech recognition apparatus according to a third embodiment of the present invention.
FIG. 6 is a block diagram showing a schematic configuration of a speech recognition apparatus according to a fourth embodiment of the present invention.
FIG. 7 is a diagram showing an example of each keyword and similarity obtained by the model matching unit 403 in FIG. 6 when “Tanaka Hotel” is uttered.
FIG. 8 is a diagram showing a recognition result display example based on the recognition result and similarity in FIG. 7 in comparison with a conventional recognition result display example.
FIG. 9 is a block diagram showing a schematic configuration of a speech recognition apparatus according to a fifth embodiment of the present invention.
FIG. 10 is a block diagram showing a schematic configuration of a speech recognition apparatus according to a sixth embodiment of the present invention.
[Explanation of symbols]
101, 201, 301, 401, 501, 601 ... voice input unit
102, 202, 302, 402, 502, 602 ... acoustic analysis unit
103, 203, 303, 403, 503, 603 ... model matching unit
104, 304, 404, 504, 604 ... voice model creation / storage unit
105,305,405,505,605 ... Recognition vocabulary input part
106: Keyword conversion unit
107 ... Keyword expansion part
108, 205, 308, 507, 606... Recognition result output unit
109,407 ... Recognition vocabulary storage unit
204 ... Voice model storage unit
206: Constraint storage unit
306, 506 ... Alternative name registration unit
307: Registered vocabulary similarity determination unit
309: Warning output section
406 ... Keyword-specific recognition result output section
607 ... Audio output unit

Claims

A speech recognition device that recognizes an input speech by collating with a speech model of each recognition vocabulary including a speech model of another name of a recognition vocabulary,
In addition to the first reading that is the original reading of the recognized vocabulary to be registered, another name registration means for registering a second reading that is a different reading from the original reading of the recognized vocabulary as a different name,
Registered vocabulary similarity determining means for determining whether or not the second reading registered as an alternative name by the alternative name registration means is similar to the first reading of the registered recognized vocabulary;
A warning output means for outputting a warning to that effect when it is determined that the registered vocabulary similarity determination means is similar;
A recognition result output means for recognizing a vocabulary having the second reading as a different name when the input voice is recognized by collation with the second reading voice model registered as a different name by the different name registration means; A speech recognition apparatus characterized by that.

The recognition result output means, when recognizing a second reading registered as a different name, presents the user with a first reading of a vocabulary having the second reading as a different name. The speech recognition apparatus according to Item 1.

Apply to the speech recognition device that recognizes the input speech by collating it with the speech model of each recognition vocabulary including the speech model of another name of the recognition vocabulary that has been created beforehand by the feature parameter series obtained by acoustic analysis of the input speech Speech recognition method,
When the speech model of the recognized vocabulary is created in advance, in addition to the first reading that is the original reading of the recognized vocabulary, the second reading that is different from the original reading of the recognized vocabulary As another name,
When registering the second reading as the alias, determine whether the second reading is similar to the first reading of the registered recognition vocabulary,
If it is determined that they are similar, a warning indicating an alias registration error will be output,
If the input speech is recognized by collation with the second reading speech model registered as an alias,
A speech recognition method characterized by recognizing a vocabulary nicknamed the second reading.