JP4054610B2

JP4054610B2 - Voice recognition apparatus, voice recognition method, voice recognition program, and program recording medium

Info

Publication number: JP4054610B2
Application number: JP2002167217A
Authority: JP
Inventors: 彰鶴田
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2002-06-07
Filing date: 2002-06-07
Publication date: 2008-02-27
Anticipated expiration: 2022-06-07
Also published as: JP2004012883A

Description

【０００１】
【発明の属する技術分野】
この発明は、サブワードを認識単位とした音声認識装置および音声認識方法、音声認識プログラム、並びに、音声認識プログラムを記録したプログラム記録媒体に関する。
【０００２】
【従来の技術】
従来、不特定話者を対象とした音素または音節等のサブワードを認識単位とした音声認識装置においては、文字列を用いてユーザが認識対象単語を単語辞書に登録する方式、あるいは、予め認識対象単語が文字列によって単語辞書に準備されている方式を、採用する場合が多く見られる。
【０００３】
ここで用いられる上記単語辞書は、各認識対象単語の文字列をサブワードのネットワークまたは木構造等で記述したサブワード表記辞書であり、人為的に文字列によって登録された認識対象単語しか上記単語辞書には存在しない。尚、上記サブワードを認識単位とした音声認識技術については、例えば、刊行物「音声認識の基礎(下)」古井貞煕監訳に詳しく説明されている。
【０００４】
しかしながら、人間が発声する音声は、時として登録されている文字列とは異なる場合が多い。例えば、「ei」,「ou」といった単語の長音化、つまり「えー」,「おー」への長音化は、よく見られる現象である。ところが、文字列によって認識対象単語を登録しておく音声認識装置においては、ユーザが長音化を意識して認識対象単語を登録しない限りこの長音化に対応した単語辞書は作成されることはないのである。
【０００５】
したがって、発声された音声に対応した認識対象単語が登録されていない場合が生じ、特に音の長音化に関しては個人差があるために、長音化の可能性がある認識対象単語において認識率の低下を招く原因となっている。
【０００６】
このような問題を解決するために、発声時に長音化する可能性のある音節を含む認識対象単語に対しては、文字列から自動的に長音化した認識対象単語を単語辞書に追加登録する音声認識辞書作成装置が、特開平１１‐３１１９９１号公報に開示されている。この音声認識辞書作成装置によれば、長音化して発声される可能性がある文字列から自動的に長音化された文字列の認識対象単語を作成して上記単語辞書に登録するので、長音化による認識誤りを防止することができるのである。具体的な例としては、長音化の可能性がある「ou」を含む「東京(とうきょう)」の場合、「とうきょう」だけではなく、「とーきょう」，「とうきょー」，「とーきょー」の３つの認識対象単語も自動的に単語辞書に登録するのである。
【０００７】
【発明が解決しようとする課題】
しかしながら、上記特開平１１‐３１１９９１号公報に開示された音声認識辞書作成装置を搭載した音声認識装置には、以下のような問題がある。すなわち、上記音声認識辞書作成装置においては、上述したように、１つの単語の中に長音化する可能性のある個所がｎ個ある場合には、２のｎ乗個の認識対象単語が登録されることになる。したがって、このような辞書作成処理によって長音化に対応した認識対象単語を自動的に単語辞書に追加登録すると、結果的に認識対象単語が多くなって辞書照合に時間が掛るという新たな問題が発生することになる。
【０００８】
そこで、この発明の目的は、認識対象単語および処理時間を増加させることなく長音化に対処できる音声認識装置および音声認識方法、音声認識プログラム、並びに、音声認識プログラムを記録したプログラム記録媒体を提供することにある。
【０００９】
【課題を解決するための手段】
上記目的を達成するため、この発明の音声認識装置は、サブワードの音響モデルであるサブワード音響モデルが格納された音響モデル格納手段と、サブワードの時系列であるサブワード系列で表された認識対象単語が格納された単語辞書を備え、上記サブワードには、長音化の可能性があるサブワードであって長音化されない場合の表記と長音化された場合の表記とを併記してなる特別な表記で表されるサブワードが含まれている。そして、照合手段によって、入力音声から得られた特徴パラメータの時系列と上記サブワード音響モデルとの照合が行われて、各サブワードの尤度が得られる。さらに、得られた各サブワードの尤度と上記認識対象単語のサブワード系列との照合が行われて、最も高い尤度を呈する認識対象単語が認識結果として生成される。
【００１０】
このように、長音化の可能性を表すと共に特別な表記で表されるサブワードを含むサブワードのサブワード音響モデルと、上記特別な表記で表されるサブワードを含むサブワード系列で表された認識対象単語を備えているため、上記サブワード音響モデルおよび認識対象単語は、長音化の可能性がある音節が長音化した場合と長音化しなかった場合との両方に対応可能となる。したがって、入力音声中に長音化する可能性がある単語が存在しており、その単語が長音化されている場合でも長音化されていない場合でも容易に対処することができ、長音化による認識誤りが防止される。
【００１１】
その際に、上記単語辞書に格納された認識対象単語中の長音化の可能性がある音節は、１つの特別な表記のサブワードで表記されるため、従来のごとく長音化された音節を表すサブワードと長音化されない音節を表すサブワードとの複数のサブワードで表記される場合に比して、上記認識対象単語を表記するサブワード系列の数が少なくなる。したがって、各サブワードの尤度と上記認識対象単語のサブワード系列との照合の際における処理時間が、上記従来の場合に比して短縮される。
【００１２】
また、１実施例の音声認識装置では、辞書作成手段によって、入力認識対象単語の文字列表記がサブワード表記に変換されて長音化する可能性のある音節が検出される。さらに、この検出された長音化の可能性がある音節が上記特別な表記で表されるサブワードに変換されて、入力認識対象単語を表すサブワード系列が得られる。そして、得られた上記サブワード系列で表された認識対象単語が上記単語辞書に格納される。
【００１３】
こうして、上記単語辞書に格納される１つの認識対象単語が１つのサブワード系列で表され、各サブワードの尤度と上記認識対象単語のサブワード系列との照合時の処理時間が、必要最小限に抑えられる。
【００１４】
また、１実施例の音声認識装置では、上記辞書作成手段を、入力認識対象単語の漢字仮名混じりの文字列表記を仮名文字表記に変換し、得られた仮名文字表記からサブワード表記に変換するようにしている。したがって、ユーザが認識対象単語を入力する際に読みで入力する必要が無く、大語彙の認識対象単語を登録する場合であっても容易に登録することが可能になる。
【００１５】
また、１実施例の音声認識装置では、上記辞書作成手段を、入力認識対象単語の仮名文字表記をサブワード表記に変換するようにしている。したがって、本辞書作成手段に漢字仮名変換手段を搭載する必要が無く、コンパクトな音声認識装置が実現される。
【００１６】
また、この発明の音声認識方法は、入力音声から得られた特徴パラメータの時系列と、長音化の可能性があるサブワードであって、長音化されない場合の表記と長音化された場合の表記とを併記してなる特別な表記で表されるサブワード、を含むサブワードのサブワード音響モデルとの照合が行われて、各サブワードの尤度が得られる。さらに、得られた各サブワードの尤度と、上記特別な表記で表されるサブワードを含むサブワードのサブワード系列で表された認識対象単語との照合が行われて、最も高い尤度を呈する認識対象単語が認識結果として生成される。
【００１７】
このように、上記照合時には、長音化の可能性がある音節が長音化した場合と長音化しなかった場合との両方に対応可能なサブワード音響モデルおよび認識対象単語が用いられる。したがって、入力音声中に長音化する可能性がある単語が存在しており、その単語が長音化されている場合でも長音化されていない場合でも容易に対処することができ、長音化による認識誤りが防止される。
【００１８】
その際に、上記単語辞書に格納された認識対象単語中の長音化の可能性がある音節は、１つの特別な表記のサブワードで表記される。したがって、各サブワードの尤度と上記認識対象単語のサブワード系列との照合の際における処理時間が、長音化の可能性を表すサブワードを用いない従来の場合に比して短縮される。
【００１９】
また、この発明の音声認識プログラムは、コンピュータを、この発明の音声認識装置における音響分析処理手段,音響モデル格納手段,単語辞書および照合手段として機能させることができる。したがって、入力音声中に存在する長音化の可能性がある単語が長音化されている場合でも長音化されていない場合でも容易に対処でき、長音化による認識誤りが防止される。さらに、各サブワードの尤度と認識対象単語のサブワード系列との照合の際における処理時間が従来の場合に比して短縮される。
【００２０】
また、この発明のプログラム記録媒体は、この発明の音声認識プログラムが記録されている。したがって、コンピュータで読み出して実行することによって、入力音声中に存在する長音化の可能性がある単語が長音化された場合における認識誤りが防止される。さらに、各サブワードの尤度と認識対象単語のサブワード系列との照合の際における処理時間が、従来の場合に比して短縮される。
【００２１】
【発明の実施の形態】
以下、この発明を図示の実施の形態により詳細に説明する。図１は、本実施の形態の音声認識装置におけるブロック図である。この音声認識装置は、音響分析部１,尤度演算部２,音響モデル格納部３,尤度格納部４,辞書照合部５,単語辞書６,辞書作成部７および仮名文字・音素対応テーブル８で構成される。
【００２２】
図１において、入力音声は、上記音響分析部１によって、特徴パラメータの系列に変換されて尤度演算部２に出力される。尤度演算部２では、音響モデル格納部３に格納された各音素の音響モデルと入力された特徴パラメータ系列との照合が行われ、算出された各音素の尤度は尤度格納部４に格納される。辞書照合部５では、Viterbi(ビタビ)アルゴリズムを用いて、単語辞書６における各単語の音素系列と尤度格納部４に格納された各音素の尤度との照合が行われて、上記各単語の尤度が算出される。そして、最も尤度の高い単語が認識結果として出力される。すなわち、本実施の形態においては、尤度演算部２と辞書照合部５とで上記照合部を構成しているのである。
【００２３】
上記音響モデルとしては、例えば、モノフォンモデルと呼ばれる前後の音素環境に依存しない隠れマルコフモデル(ＨＭＭ)を用いる。但し、本実施の形態においては、上記ＨＭＭの音素に、長音化する可能性を表す特別な音素として、音素「i/e」と音素「u/o」との２つの音素を追加している。
【００２４】
上記尤度格納部４には、上述したように、上記音響モデル格納部３に格納された各音素の音響モデルと音響分析部１から入力された特徴パラメータ系列との照合が尤度演算部２によって行われた際に、算出された各音素の尤度が順次格納される。
【００２５】
また、上記単語辞書６には、認識対象単語に基づいて辞書作成部７によって作成された各単語の音素系列が、音素のネットワークあるいは音素の木構造の形式で格納されている。この単語の音素系列は、上記各認識対象単語の読みを音素系列で表記したものである。その場合、上記音素系列の中に存在する長音化する可能性がある音素については、図２(a)の最右欄に示すように、長音化の可能性を表す特別な音素「i/e」,「u/o」を用いて表記するのである。
【００２６】
ここで、図２(a)における音素系列「t;o;u;k;j;o;u」を、上記特別な音素「i/e」,「u/o」を用いない従来の表記方法によって長音化に対応できる形式で音素表記すると、図２(b)に示すように、４つの音素表記になる。これが、上記特別な音素「i/e」,「u/o」を用いた本実施の形態の表記方法によって音素表記した場合は、図２(b)に示すように図２(a)と同じ１つの音素表記でよく、従来の表記方法による音素表記数を削減することができる。したがって、単語辞書６に格納された各単語の音素系列を用いた辞書照合部４による照合演算処理を削減して、照合時間の短縮を図ることができるのである。
【００２７】
以下、上記辞書作成部７によって行われる単語辞書６に登録する単語の音素系列の作成方法について、図３に示す辞書作成処理動作のフローチャートに従って詳細に説明する。辞書作成部７に認識対象単語が入力されると辞書作成処理動作がスタートする。
【００２８】
ステップＳ1で、先ず、漢字仮名混じりの文字で表記された認識対象単語が、漢字仮名変換技術を用いて仮名文字列に変換される。例えば、入力認識対象単語「東京」が仮名文字列「とうきょう」に変換される。ステップＳ2で、得られた仮名文字列が、図４に示すような仮名文字・音素対応テーブル８を参照して音素列に変換される。その結果、仮名文字列「とうきょう」が音素列「t;o;u;k;j;o;u」に変換される。
【００２９】
ステップＳ3で、上記ステップＳ2において得られた音素列から音素連鎖「e;i」,「e;e」,「o;u」,「o;o」が検出される。こうして、長音化の可能性のある音節が抽出されるのである。その結果、音素列「t;o;u;k;j;o;u」からは音素連鎖「o;u」が２つ検出され、長音化の可能性がある音節として音節「ｕ」が抽出される。
【００３０】
ステップＳ4で、上記ステップＳ3において抽出された長音化の可能性のある音節が、長音化の可能性を表す特別な音素の表記「i/e」,「u/o」に置き換えられる。つまり、上記音素連鎖「e;i」,「e;e」が「e;i/e」に置き換えられる一方、上記音素連鎖「o;u」,「o;o」が「o;u/o」に置き換えられるのである。その結果、音素列「t;o;u;k;j;o;u」の場合には、音素列「t;o;u/o;k;j;o;u/o」が得られる。ステップＳ5で、上記ステップＳ4において得られた音素列が、辞書照合部５における照合方式に対応したデータに変換される。そうした後、辞書作成処理動作を終了する。
【００３１】
図５は、上記単語辞書６に登録される各単語を音素のネットワークの形式で表現した例である。尚、図５(a)は、辞書作成部７によって、入力認識対象単語「北海道」,「青森」,…,「東京」,…,「大阪」に基づいて作成された音素のネットワークである。また、図５(b)は、従来の方法によって、入力認識対象単語「北海道」,…,「東京」に基づいて作成された音素のネットワークである。
【００３２】
図６は、上記単語辞書６に登録される各単語を音素の木構造の形式で表現した例である。尚、図６(a)は、辞書作成部７によって、入力認識対象単語「秋田」，「青森」,…,「北海道」,…,「大分」,「大阪」,…,「東京」,…に基づいて作成された音素の木構造である。また、図６(b)は、従来の方法によって、入力認識対象単語「北海道」,…,「東京」に基づいて作成された音素の木構造である。
【００３３】
したがって、上記辞書照合部５が、辞書作成部７によって作成された単語辞書６を用いて照合を行う場合に、辞書を構成する音素のネットワークあるいは木構造の探索経路は必要最小限となる。その結果、辞書照合に必要な処理時間が従来の単語辞書に比較して大幅に短くできるのである。
【００３４】
以上のごとく、本実施の形態においては、上記音響モデル格納部３には、従来の音素のＨＭＭに加えて、長音化の可能性を表す特別な２つの音素「i/e」,「u/o」のＨＭＭを追加している。また、上記単語辞書６には、辞書作成部７によって、各認識対象単語中に存在する長音化の可能性がある音節を上記特別な音素「i/e」,「u/o」で表記して作成された単語辞書を格納している。
【００３５】
このように、上記単語辞書６に登録された認識対象単語内の長音化の可能性がある音節は、長音化した場合と長音化しなかった場合の両方に対応可能な特別な音素で表記されている。したがって、入力音声中に長音化する可能性がある単語が存在しており、その単語が長音化されている場合でも長音化されていない場合でも、認識率の低下を招くことはないのである。
【００３６】
また、例えば、４７都道府県名を認識対象語彙とした場合であって、従来の認識方法で長音化に対処する場合には、同じ認識対象単語を複数の音素列で表記した場合の夫々の音素列を単語として、「北海道」は２つ、「東京」は図２(b)に示すように４つのごとく、１つの認識対象単語の中に長音化する可能性がある個所がｎ個ある場合には２のｎ乗個の単語を単語辞書に登録する必要がある。したがって、上記単語辞書には、都道府県名数「４７」よりもかなり多い単語が登録されることになり、辞書照合の処理時間が大幅に増えてしまう。これに対し、本実施の形態の場合には、１つの都道府県名を１つの単語で表現することができるため単語辞書６に登録される単語数は都道府県名数と同じ「４７」となる。したがって、辞書照合の処理時間は必要最小限の処理時間に対して殆ど増加することはないのである。
【００３７】
尚、上記実施の形態においては、上記辞書作成部７は、上記辞書作成処理動作を実行するに際して、上記ステップＳ1において、漢字仮名混じりの文字で表記された認識対象単語を漢字仮名変換技術を用いて仮名文字列に変換した後に、上記ステップＳ2において音素表記列を作成するようにしている。しかしながら、この発明は、これに限定されるものではない。例えば、辞書作成部７に対して、仮名文字で認識対象単語の読みが入力される場合には、上記ステップＳ1を省略して上記ステップＳ2から辞書作成処理動作をスタートしても差し支えない。
【００３８】
また、上記実施の形態における辞書作成部７による辞書作成処理においては、仮名文字列から直接音素表記列への変換を行っている。しかしながら、仮名文字列からローマ字列に一旦変換し、得られたローマ字列から音素表記列に変換しても差し支えない。
【００３９】
また、上記実施の形態における辞書作成部７による辞書作成処理は、辞書作成部７と同様の辞書作成処理動作をコンピュータに実行させることが可能な辞書作成プログラムを用いて、コンピュータによって別途行っても差し支えない。
【００４０】
また、上記実施の形態における音響モデル格納部３に格納された上記音響モデルは、モノフォンモデルと呼ばれる前後の音素環境に依存しないＨＭＭである。しかしながら、上記音響モデルはこれに限定されるものではなく、音響モデルのサブワードが音節であったり、前後の音素環境に依存するＨＭＭであっても差し支えない。
【００４１】
ところで、上記各実施の形態における音響分析部１,尤度演算部２,音響モデル格納部３,辞書照合部５および単語辞書６による上記音響分析処理手段,音響モデル格納手段,単語辞書および照合手段としての機能は、プログラム記録媒体に記録された音声認識プログラムによって実現される。上記各実施の形態における上記プログラム記録媒体は、ＲＯＭ(リード・オンリ・メモリ)でなるプログラムメディアである。または、外部補助記憶装置に装着されて読み出されるプログラムメディアであってもよい。尚、何れの場合においても、上記プログラムメディアから上記音声認識プログラムを読み出すプログラム読み出し手段は、上記プログラムメディアに直接アクセスして読み出す構成を有していてもよいし、ＲＡＭ(ランダム・アクセス・メモリ)に設けられたプログラム記憶エリア(図示せず)にダウンロードし、上記プログラム記憶エリアにアクセスして読み出す構成を有していてもよい。尚、上記プログラムメディアから上記ＲＡＭのプログラム記憶エリアにダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。
【００４２】
ここで、上記プログラムメディアとは、本体側と分離可能に構成され、磁気テープやカセットテープ等のテープ系、フロッピーディスク,ハードディスク等の磁気ディスクやＣＤ(コンパクトディスク)‐ＲＯＭ,ＭＯ(光磁気)ディスク,ＭＤ(ミニディスク),ＤＶＤ(ディジタル多用途ディスク)等の光ディスクのディスク系、ＩＣ(集積回路)カードや光カード等のカード系、マスクＲＯＭ,ＥＰＲＯＭ（紫外線消去型ＲＯＭ),ＥＥＰＲＯＭ(電気的消去型ＲＯＭ),フラッシュＲＯＭ等の半導体メモリ系を含めた、固定的にプログラムを坦持する媒体である。
【００４３】
また、上記各実施の形態における音声認識装置は、モデムを備えてインターネットを含む通信ネットワークと接続可能に構成することもできる。その場合、上記プログラムメディアは、通信ネットワークからのダウンロード等によって流動的にプログラムを坦持する媒体であっても差し支えない。尚、その場合における上記通信ネットワークからダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。あるいは、別の記録媒体からインストールされるものとする。
【００４４】
尚、上記記録媒体に記録されるものはプログラムのみに限定されるものではなく、データも記録することが可能である。
【００４５】
【発明の効果】
以上より明らかなように、この発明では、入力音声から得られた特徴パラメータの時系列と、長音化の可能性があるサブワードであって、長音化されない場合の表記と長音化された場合の表記とを併記してなる特別な表記で表されるサブワード、を含むサブワードのサブワード音響モデルとの照合を行って、各サブワードの尤度を得、得られた各サブワードの尤度と、上記特別な表記で表されるサブワードを含むサブワードのサブワード系列で表された認識対象単語との照合を行って、最も高い尤度を呈する認識対象単語を認識結果として生成するので、上記夫々の照合時において、入力音声中に長音化する可能性がある単語が存在しており、その単語が長音化されている場合でも長音化されていない場合でも容易に対処することができる。したがって、音声入力単語中の音節が長音化にされている場合の認識誤りを防止することができる。
【００４６】
その際に、上記単語辞書に格納された認識対象単語中の長音化の可能性がある音節は、１つの音節当り１つの特別な表記のサブワードで表記されている。したがって、１つの認識対象単語を唯１つのサブワード系列で表記することができ、各サブワードの尤度と上記認識対象単語のサブワード系列との照合の際に必要なサブワード系列の数および処理時間の増加を極力抑えることができる。その結果、１つの認識対象単語を複数のサブワード系列で表記して長音化に対処する従来の音声認識に比して、上記照合処理時間を短縮することができるのである。
【図面の簡単な説明】
【図１】この発明の音声認識装置におけるブロック図である。
【図２】図１における単語辞書に格納される単語の音素系列の説明図である。
【図３】図１における辞書作成部によって実行される辞書作成処理動作のフローチャートである。
【図４】図１における仮名文字・音素対応テーブルの概念の一例を示す図である。
【図５】図１における単語辞書に登録される各単語を音素のネットワークの形式で表現した例を示す図である。
【図６】図１における単語辞書に登録される各単語を音素の木構造の形式で表現した例を示す図である。
【符号の説明】
１…音響分析部、
２…尤度演算部、
３…音響モデル格納部、
４…尤度格納部、
５…辞書照合部、
６…単語辞書、
７…辞書作成部、
８…仮名文字・音素対応テーブル。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition apparatus and speech recognition method using a subword as a recognition unit, a speech recognition program, and a program recording medium on which a speech recognition program is recorded.
[0002]
[Prior art]
Conventionally, in a speech recognition apparatus using a subword such as a phoneme or syllable as a recognition unit for an unspecified speaker, a method in which a user registers a recognition target word in a word dictionary using a character string, or a recognition target in advance In many cases, a method in which words are prepared in a word dictionary by character strings is adopted.
[0003]
The word dictionary used here is a subword notation dictionary in which a character string of each recognition target word is described by a network of subwords or a tree structure, and only the recognition target word artificially registered by the character string is stored in the word dictionary. Does not exist. Note that the speech recognition technology using the subword as a recognition unit is described in detail in the publication “Basics of Speech Recognition (2)” translated by Sadaaki Furui.
[0004]
However, voices uttered by humans are often different from registered character strings. For example, the longer sound of words such as “ei” and “ou”, that is, the longer sound of “e” and “o” is a common phenomenon. However, in a speech recognition apparatus that registers a recognition target word by a character string, a word dictionary corresponding to this longer sound is not created unless the user registers the recognition target word in consideration of the longer sound. is there.
[0005]
Therefore, there is a case where the recognition target word corresponding to the uttered voice is not registered, and the recognition rate decreases in the recognition target word that may be prolonged because there is an individual difference in particular regarding the lengthening of the sound. It is the cause that invites.
[0006]
In order to solve such problems, for recognition target words that contain syllables that may be prolonged when uttered, a speech that automatically registers a recognition target word that has been lengthened automatically from a character string in the word dictionary A recognition dictionary creation apparatus is disclosed in Japanese Patent Laid-Open No. 11-311991. According to this speech recognition dictionary creation device, a recognition target word of a character string that is automatically lengthened from a character string that may be uttered with a long sound is created and registered in the word dictionary. It is possible to prevent a recognition error due to. As a specific example, in the case of “Tokyo” including “ou”, which may have a longer sound, not only “Tokyo” but also “Tokyo”, “Tokyo”, “To- The three recognition target words “Kyo” are also automatically registered in the word dictionary.
[0007]
[Problems to be solved by the invention]
However, the speech recognition device equipped with the speech recognition dictionary creation device disclosed in the above-mentioned JP-A-11-311991 has the following problems. That is, in the speech recognition dictionary creating apparatus, as described above, when there are n places that may be made longer in one word, 2 nth recognition target words are registered. Will be. Therefore, if the recognition target word corresponding to the longer sound is automatically added and registered in the word dictionary by such dictionary creation processing, a new problem arises that as a result, the number of recognition target words increases and it takes time to collate the dictionary. Will do.
[0008]
SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a speech recognition apparatus and speech recognition method, speech recognition program, and program recording medium on which a speech recognition program is recorded, which can cope with a longer sound without increasing the recognition target word and processing time. There is.
[0009]
[Means for Solving the Problems]
To achieve the above object, the speech recognition apparatus of this invention, the acoustic model storage unit subword acoustic model is stored an acoustic model of Sa Buwado, recognition target represented by sub-word sequence is a time series of sub Buwado It has a word dictionary in which words are stored , and the above subword has a special notation that is a combination of a notation that is not a long sound and a notation that is a long sound. Contains the subword represented . Then, the collation unit collates the time series of the characteristic parameters obtained from the input speech with the subword acoustic model, and obtains the likelihood of each subword. Further, the likelihood of each subword obtained is compared with the subword sequence of the recognition target word, and the recognition target word having the highest likelihood is generated as a recognition result.
[0010]
Thus, sub-word acoustic models of sub-word containing the word represented in a special notation with represents the possibility of a prolonged sound of the recognition target words represented by sub-word sequences comprising a word represented by a special notation Therefore, the sub-word acoustic model and the recognition target word can be used both when the syllable that may be prolonged is long and when it is not long. Therefore, there is a word in the input voice that may be prolonged, and it can be easily dealt with whether the word is prolonged or not, and a recognition error due to the prolonged sound. Is prevented.
[0011]
At that time, a syllable that may be lengthened in the recognition target word stored in the word dictionary is represented by one special notation subword. Therefore, a subword that represents a syllable that is lengthened as in the past. And a subword representing a syllable that is not lengthened, the number of subword sequences representing the recognition target word is reduced. Therefore, the processing time for matching the likelihood of each subword with the subword sequence of the recognition target word is shortened compared to the conventional case.
[0012]
In the speech recognition apparatus of one embodiment, the dictionary creation unit detects a syllable that may be converted into a subword notation by converting the character string notation of the input recognition target word into a subword notation. Further, the detected syllables having the possibility of lengthening are converted into subwords represented by the special notation, and a subword sequence representing the input recognition target word is obtained. Then, the recognition target words represented by the obtained subword series are stored in the word dictionary.
[0013]
In this way, one recognition target word stored in the word dictionary is represented by one subword sequence, and the processing time when matching the likelihood of each subword with the subword sequence of the recognition target word is minimized. It is done.
[0014]
Further, in the speech recognition apparatus of one embodiment, the dictionary creating means converts the character string notation of the input recognition target word mixed with kanji characters into kana character notation and converts the obtained kana character notation into subword notation. I have to. Therefore, it is not necessary for the user to input the recognition target word by reading, and it is possible to easily register even when the recognition target word of the large vocabulary is registered.
[0015]
In the speech recognition apparatus of one embodiment, the dictionary creating means converts the kana character notation of the input recognition target word into a subword notation. Therefore, there is no need to mount the kanji kana conversion means in the dictionary creation means, and a compact speech recognition device is realized.
[0016]
The voice recognition method of the present invention, the time series of feature parameters obtained from the input speech, a word of the possibility of a prolonged sound reduction, notation conventions and prolonged sound reduction has been case of If not prolonged sound of Are compared with a subword acoustic model of a subword including a subword represented by a special notation that is written together , and the likelihood of each subword is obtained. Further, the likelihood of each subword obtained is compared with the recognition target word represented by the subword sequence of the subword including the subword represented by the special notation, and the recognition target exhibiting the highest likelihood A word is generated as a recognition result.
[0017]
Thus, at the time of the above collation, a subword acoustic model and a recognition target word that are compatible with both cases where a syllable that may have a longer sound length becomes longer and when it is not longer sounded are used. Therefore, there is a word in the input voice that may be prolonged, and it can be easily dealt with whether the word is prolonged or not, and a recognition error due to the prolonged sound. Is prevented.
[0018]
At that time, a syllable having a possibility of lengthening in the recognition target word stored in the word dictionary is represented by one special notation subword. Therefore, the processing time for matching the likelihood of each subword with the subword sequence of the recognition target word is shortened compared to the conventional case where no subword representing the possibility of longer sound is used.
[0019]
The speech recognition program of the present invention can cause a computer to function as acoustic analysis processing means, acoustic model storage means, word dictionary, and collation means in the speech recognition apparatus of the present invention. Therefore, it is possible to easily cope with a case where a word having a possibility of a longer sound existing in the input speech is made longer or not, and a recognition error due to the longer sound is prevented. Furthermore, the processing time for matching the likelihood of each subword with the subword sequence of the recognition target word is shortened compared to the conventional case.
[0020]
The program recording medium of the present invention records the voice recognition program of the present invention. Therefore, by reading out and executing by a computer, a recognition error is prevented when a word having a possibility of lengthening existing in the input voice is lengthened. Furthermore, the processing time for matching the likelihood of each subword with the subword sequence of the recognition target word is shortened compared to the conventional case.
[0021]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments. FIG. 1 is a block diagram of the speech recognition apparatus according to the present embodiment. This speech recognition apparatus includes an acoustic analysis unit 1, a likelihood calculation unit 2, an acoustic model storage unit 3, a likelihood storage unit 4, a dictionary collation unit 5, a word dictionary 6, a dictionary creation unit 7, and a kana character / phoneme correspondence table 8. Consists of.
[0022]
In FIG. 1, the input speech is converted into a characteristic parameter series by the acoustic analysis unit 1 and output to the likelihood calculation unit 2. In the likelihood calculation unit 2, the acoustic model of each phoneme stored in the acoustic model storage unit 3 is collated with the input feature parameter series, and the calculated likelihood of each phoneme is stored in the likelihood storage unit 4. Stored. The dictionary collation unit 5 collates the phoneme sequence of each word in the word dictionary 6 with the likelihood of each phoneme stored in the likelihood storage unit 4 using the Viterbi algorithm, and each word Is calculated. Then, the word with the highest likelihood is output as the recognition result. That is, in the present embodiment, the likelihood calculation unit 2 and the dictionary collation unit 5 constitute the collation unit.
[0023]
As the acoustic model, for example, a hidden Markov model (HMM) called a monophone model that does not depend on the surrounding phonemic environment is used. However, in the present embodiment, two phonemes of phoneme “i / e” and phoneme “u / o” are added to the above-mentioned HMM phonemes as special phonemes indicating the possibility of a longer tone. .
[0024]
In the likelihood storage unit 4, as described above, matching between the acoustic model of each phoneme stored in the acoustic model storage unit 3 and the feature parameter series input from the acoustic analysis unit 1 is performed by the likelihood calculation unit 2. When this is done, the calculated likelihood of each phoneme is sequentially stored.
[0025]
The word dictionary 6 stores a phoneme sequence of each word created by the dictionary creation unit 7 based on the recognition target word in the form of a phoneme network or a phoneme tree structure. The phoneme sequence of this word is a phoneme sequence of the reading of each recognition target word. In that case, as shown in the rightmost column of FIG. 2 (a), a special phoneme “i / e” indicating the possibility of lengthening is used for phonemes that may be lengthened in the phoneme series. ”,“ U / o ”.
[0026]
Here, the phoneme sequence “t; o; u; k; j; o; u” in FIG. 2 (a) is replaced with the conventional notation method using the special phonemes “i / e” and “u / o”. If phonemes are expressed in a format that can support longer sound, the four phonemes are displayed as shown in FIG. This is the same as FIG. 2 (a) as shown in FIG. 2 (b) when the phoneme is expressed by the notation method of the present embodiment using the special phonemes “i / e” and “u / o”. One phoneme notation is sufficient, and the number of phoneme notations by the conventional notation method can be reduced. Accordingly, the collation calculation processing by the dictionary collation unit 4 using the phoneme sequence of each word stored in the word dictionary 6 can be reduced, and the collation time can be shortened.
[0027]
Hereinafter, a method for creating a phoneme sequence of a word registered in the word dictionary 6 performed by the dictionary creation unit 7 will be described in detail according to a flowchart of the dictionary creation processing operation shown in FIG. When a recognition target word is input to the dictionary creation unit 7, the dictionary creation processing operation starts.
[0028]
In step S1, first, a recognition target word written with characters mixed with kanji characters is converted into a kana character string using a kanji kana conversion technique. For example, the input recognition target word “Tokyo” is converted into a kana character string “Tokyo”. In step S2, the obtained kana character string is converted into a phoneme string with reference to a kana character / phoneme correspondence table 8 as shown in FIG. As a result, the kana character string “Tokyo” is converted into a phoneme string “t; o; u; k; j; o; u”.
[0029]
In step S3, phoneme chains “e; i”, “e; e”, “o; u”, and “o; o” are detected from the phoneme sequence obtained in step S2. In this way, syllables that may be prolonged are extracted. As a result, two phoneme chains “o; u” are detected from the phoneme sequence “t; o; u; k; j; o; u”, and the syllable “u” is extracted as a syllable that may be prolonged. Is done.
[0030]
In step S4, the syllable with the possibility of lengthening extracted in step S3 is replaced with the special phoneme notations “i / e” and “u / o” representing the possibility of lengthening. That is, the phoneme chains “e; i” and “e; e” are replaced with “e; i / e”, while the phoneme chains “o; u” and “o; o” are replaced with “o; u / o”. ". As a result, in the case of the phoneme string “t; o; u; k; j; o; u”, the phoneme string “t; o; u / o; k; j; o; u / o” is obtained. In step S5, the phoneme string obtained in step S4 is converted into data corresponding to the collation method in the dictionary collation unit 5. After that, the dictionary creation processing operation is terminated.
[0031]
FIG. 5 is an example in which each word registered in the word dictionary 6 is expressed in the form of a phoneme network. 5A shows a phoneme network created by the dictionary creation unit 7 based on the input recognition target words “Hokkaido”, “Aomori”,..., “Tokyo”,. FIG. 5B shows a phoneme network created based on the input recognition target words “Hokkaido”,..., “Tokyo” by a conventional method.
[0032]
FIG. 6 is an example in which each word registered in the word dictionary 6 is expressed in a phoneme tree structure format. In FIG. 6 (a), the dictionary creation unit 7 uses the input recognition target words "Akita", "Aomori", ..., "Hokkaido", ..., "Oita", "Osaka", ..., "Tokyo", ... Is a phoneme tree structure created based on FIG. 6B shows a phoneme tree structure created based on the input recognition target words “Hokkaido”,..., “Tokyo” by the conventional method.
[0033]
Therefore, when the dictionary collation unit 5 performs collation using the word dictionary 6 created by the dictionary creation unit 7, the phoneme network or the tree structure search path constituting the dictionary is minimized. As a result, the processing time required for dictionary collation can be significantly shortened compared to conventional word dictionaries.
[0034]
As described above, in the present embodiment, in addition to the conventional phoneme HMM, the acoustic model storage unit 3 includes two special phonemes “i / e” and “u / o ”HMM is added. Also, in the word dictionary 6, the dictionary creation unit 7 describes the syllables that may exist in each recognition target word as the special phonemes “i / e” and “u / o”. The word dictionary created by
[0035]
In this way, the syllables with the possibility of lengthening in the recognition target words registered in the word dictionary 6 are represented by special phonemes that can be used both when the sound is made longer and when it is not made longer. Yes. Therefore, there is a word that may become longer sound in the input speech, and the recognition rate does not decrease even if the word is made longer or not longer.
[0036]
Further, for example, when 47 prefecture names are used as recognition target vocabulary, and when dealing with lengthening by a conventional recognition method, each phoneme when the same recognition target word is expressed by a plurality of phoneme strings is used. When there are n places where there is a possibility of making the sound longer in one recognition target word, such as “Hokkaido” and 2 “Tokyo” as shown in FIG. 2 (b). Needs to register 2 n words in the word dictionary. Therefore, the number of words that are considerably larger than the number of prefectures “47” is registered in the word dictionary, which greatly increases the processing time for dictionary collation. On the other hand, in the case of the present embodiment, since one prefecture name can be expressed by one word, the number of words registered in the word dictionary 6 is “47”, which is the same as the number of prefecture names. . Therefore, the dictionary collation processing time hardly increases with respect to the minimum necessary processing time.
[0037]
In the above embodiment, when the dictionary creation unit 7 executes the dictionary creation processing operation, the recognition target word represented by the kanji-kana mixed characters is used in step S1 using the kanji kana conversion technique. After the conversion to the kana character string, the phoneme notation string is created in step S2. However, the present invention is not limited to this. For example, when the reading of the word to be recognized is input to the dictionary creation unit 7 using kana characters, the step S1 can be omitted and the dictionary creation processing operation can be started from the step S2.
[0038]
In the dictionary creation process by the dictionary creation unit 7 in the above embodiment, conversion from a kana character string to a phoneme notation string directly is performed. However, the kana character string may be converted once from the romaji character string, and the obtained romaji character string may be converted into the phoneme notation string.
[0039]
In addition, the dictionary creation process by the dictionary creation unit 7 in the above embodiment may be separately performed by a computer using a dictionary creation program that can cause the computer to execute a dictionary creation processing operation similar to the dictionary creation unit 7. There is no problem.
[0040]
The acoustic model stored in the acoustic model storage unit 3 in the above embodiment is an HMM that does not depend on the phoneme environment before and after that is called a monophone model. However, the acoustic model is not limited to this, and the subword of the acoustic model may be a syllable or an HMM that depends on the surrounding phonemic environment.
[0041]
By the way, the acoustic analysis processing means, acoustic model storage means, word dictionary and collation means by the acoustic analysis section 1, likelihood calculation section 2, acoustic model storage section 3, dictionary collation section 5 and word dictionary 6 in the above embodiments. Is realized by a voice recognition program recorded on a program recording medium. The program recording medium in each of the above embodiments is a program medium composed of a ROM (read only memory). Alternatively, it may be a program medium that is loaded into an external auxiliary storage device and read out. In any case, the program reading means for reading the voice recognition program from the program medium may have a configuration in which the program medium is directly accessed and read, or in a RAM (Random Access Memory). You may have the structure which downloads to the program storage area (not shown) provided, and accesses and reads the said program storage area. It is assumed that a download program for downloading from the program medium to the program storage area of the RAM is stored in the main unit in advance.
[0042]
Here, the program medium is configured to be separable from the main body side, and is a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy disk or a hard disk, a CD (compact disk) -ROM, or MO (magneto-optical). Optical discs such as discs, MD (mini discs) and DVDs (digital versatile discs), card systems such as IC (integrated circuit) cards and optical cards, mask ROM, EPROM (ultraviolet erasable ROM), EEPROM (electrical This is a medium that carries a fixed program, including a semiconductor memory system such as a static erasable ROM) and a flash ROM.
[0043]
In addition, the speech recognition apparatus in each of the above embodiments can be configured to include a modem and be connectable to a communication network including the Internet. In this case, the program medium may be a medium that dynamically carries the program by downloading from a communication network or the like. In this case, it is assumed that a download program for downloading from the communication network is stored in the main device in advance. Or it shall be installed from another recording medium.
[0044]
It should be noted that what is recorded on the recording medium is not limited to a program, and data can also be recorded.
[0045]
【The invention's effect】
As is clear from the above, in the present invention, the time series of the characteristic parameters obtained from the input speech and the subwords with the possibility of longer sound, the notation when not being made longer and the notation when being made longer performing matching between the sub-word acoustic models of sub-word containing the word, represented by special notation comprising also shown the door, give the likelihood of each word, and the likelihood of each word obtained, the special Since the recognition target word having the highest likelihood is generated as a recognition result by performing matching with the recognition target word represented by the subword series of subwords including the subword represented by the notation , In the input voice, there is a word that has a possibility of making a long sound, and it is possible to easily cope with a case where the word is made long or not. Therefore, it is possible to prevent a recognition error when the syllable in the voice input word is lengthened.
[0046]
At that time, syllables having a possibility of lengthening in the recognition target words stored in the word dictionary are described by one special notation subword per syllable. Therefore, one recognition target word can be represented by only one subword sequence, and the number of subword sequences and the processing time required for matching the likelihood of each subword with the subword sequence of the recognition target word are increased. Can be suppressed as much as possible. As a result, the above collation processing time can be shortened as compared with the conventional speech recognition in which one recognition target word is represented by a plurality of subword sequences to cope with the longer sound.
[Brief description of the drawings]
FIG. 1 is a block diagram of a speech recognition apparatus according to the present invention.
2 is an explanatory diagram of a phoneme sequence of words stored in the word dictionary in FIG. 1. FIG.
FIG. 3 is a flowchart of a dictionary creation processing operation executed by the dictionary creation unit in FIG. 1;
4 is a diagram showing an example of a concept of a kana character / phoneme correspondence table in FIG. 1; FIG.
5 is a diagram illustrating an example in which each word registered in the word dictionary in FIG. 1 is expressed in the form of a phoneme network.
6 is a diagram illustrating an example in which each word registered in the word dictionary in FIG. 1 is expressed in a phoneme tree structure format. FIG.
[Explanation of symbols]
1 ... acoustic analysis section,
2 ... Likelihood calculation part,
3 ... Acoustic model storage unit,
4 ... Likelihood storage unit,
5 ... Dictionary collation part,
6 ... Word dictionary,
7 ... Dictionary creation part,
8: Kana character / phoneme correspondence table.

Claims

A speech recognition device that recognizes an input speech using a subword as a recognition unit and a subword acoustic model,
Acoustic analysis processing means for analyzing the input speech and obtaining a time series of feature parameters ;
An acoustic model storage means for the sub-word acoustic models are stored an acoustic model of service Buwado,
A word dictionary recognized word is stored represented in a sub-word sequence is a time series of sub Buwado,
The likelihood of each subword is obtained by collating the time series of the characteristic parameters with the subword acoustic model, and the likelihood of each subword is collated with the subword sequence of the recognition target word to obtain the most likely It has collation means to generate a high recognition target word as a recognition result
The subword includes a subword that has a possibility of making a long sound and is expressed by a special notation in which a notation when not making a long sound and a notation when making a long sound are written together. <br/> A speech recognition apparatus characterized by the above.

The speech recognition apparatus according to claim 1,
A recognition target word written in a character string is input, the character string representation of this input recognition target word is converted into a subword representation, and a syllable that may be made longer is detected from the obtained subword representation, and this detection is performed. A dictionary creating means for converting a syllable that has a possibility of lengthening into a subword represented by the special notation and storing a recognition target word represented by the obtained subword sequence in the word dictionary A voice recognition device characterized by the above.

The speech recognition device according to claim 2,
The dictionary creation means inputs a recognition target word written in a kanji-kana mixed character string, converts the kanji kana-mixed character string representation of the input recognition target word into a kana character notation, and obtains the obtained kana character notation A speech recognition apparatus characterized in that it is converted to subword notation.

The speech recognition device according to claim 2,
A speech recognition apparatus, wherein the dictionary creating means receives a recognition target word expressed in a kana character string, and converts the kana character expression of the input recognition target word into a subword expression.

A speech recognition method using a subword as a recognition unit and recognizing input speech using a subword acoustic model,
Analyzing the input speech to obtain a time series of feature parameters;
A word of the possibility of a prolonged sound of the sub-word, an acoustic model of the sub-word containing the word represented in a special notation comprising shown together with notation when labeled with long sound of If not prolonged sound of and acoustic models by performing a collation with the time series of the feature parameters, obtaining a likelihood of each sub-word,
Recognize the word with the highest likelihood by collating the recognition target word represented by the subword sequence that is the time series of the subwords including the subword represented by the special notation with the likelihood of each subword. A speech recognition method comprising the step of generating as a result.

Computer
A speech recognition program that functions as an acoustic analysis processing means, an acoustic model storage means, a word dictionary, and a collation means according to claim 1.

A computer-readable program recording medium on which the voice recognition program according to claim 6 is recorded.