JP3575904B2

JP3575904B2 - Continuous speech recognition method and standard pattern training method

Info

Publication number: JP3575904B2
Application number: JP3245596A
Authority: JP
Inventors: 喜永加藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1995-04-26
Filing date: 1996-02-20
Publication date: 2004-10-13
Anticipated expiration: 2016-02-20
Also published as: JPH0916192A

Description

【０００１】
【発明の属する技術分野】
本発明は、連続音声認識方式及び標準パタン訓練方式、より詳細には、類の一部を代表するパタンを時間方向に連結して状態遷移モデルとなし、状態遷移モデルにおける各状態の照合継続時間を制御しながら入力音声パタンを照合し、当該状態遷移モデルと入力音声の特徴パタンとを比較することによって、認識結果を得る連続音声認識方式、及び、連続音声中の重要な単語を認識するために必要な標準パタンを訓練するのに好適な標準パタン訓練方式に関する。
【０００２】
【従来の技術】
最初に、本明細書中において使用する記号について、下記の通り定義する。
【０００３】
【外１】
【０００４】
最初に、従来の連続音声認識方式について説明する。今、入力音声パタンに対する標準パタン系列がＳ個あるとし、ｓ番目の系列を（ｓ）Ｗとする。（ｓ）Ｗは、以下の式に示すように、Ｌ個標準パタンを接続したものから成る。この標準パタンは、音声の類（例えば音素や単語）を特徴づけているパタンである。
【０００５】
【数１】
【０００６】
ただし、Ｌは可変である。ここで、ｑ（ｌ）は、系列中のｌ（１≦ｌ≦Ｌ）番目の標準パタンのインデックスであり、Ｖ個の語彙数を持つ。
同様にして、入力音声特徴量の列Ｘを以下のように表す。
Ｘ＝｛ｘ_１，…，ｘ_ｍ，…，ｘ_Ｍ｝ …（２）
ここで、連続音声認識の問題は、発声した音声Ｘと参照系列との距離Ｄ（Ｘ，（ｓ）Ｗ）を最小にする参照系列＊Ｗをみつけることに相当する。
【０００７】
【数２】
【０００８】
式（４）の右辺に関する最小化は、それぞれ、標準パタンの連結数，モデルの並び、整合関数に関して行われる。式（４）は、動的計画法によって求めることができる。ここで、θは照合経路を表す関数である。標準パタン系列（ｓ）ｗの作成には、中川，“確率モデルによる音声認識”電子情報通信学会（１９８８）などに詳述される隠れマルコフモデル（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）や、神経回路網，音声パタンの相加平均などによってモデル化される。
【０００９】
標準パタンＷｉの組合せによって、参照系列を作成するが、その組合せに制約がないと、照合時の探索空間が広くなると同時に、認識性能が低下する。そこで、言語モデルを導入して、種々の言語制約を与える。例えば、構文制御による言語モデルは、０，１的に与えられ、文脈自由文法などで記述し、ＡＴＲ編，“自動翻訳電話”オーム社（１９９４）に詳述されるＬＲ（Ｌｅｆｔ−ｔｏ−ｒｉｇｈｔＲｉｇｈｔｍｏｓｔｄｅｒｉｖａｔｉｏｎ）パーサなどを用いて解析する。前出の文献による認識方式では、解析と同時に音素ＨＭＭから得られる尤度によって、パーサから得られた仮説を棄却するか存続するかを決定する。最終的に、最も大きい尤度をもつ仮説を認識結果とする。この場合、式（１）のＷｉは、ＬＲ構文解析により受理された、終端記号に対応する系列でなければならない。
【００１０】
次に、従来の標準パタン訓練方式について説明する。例えば、発話中から日付／一月一日／という単語を抽出したいと仮定する。発声者の発話方法はさまざまであり、（１）／一月一日／と連続的に発話する場合や、（２）／一月＿一日／（＿：若干の休止区間）、（３）／一月の一日／などと単語間に認識対象以外の語が挿入する場合が考えられる。このような発話に対して、照合に用いる標準パタンには、上記の３通りのパタンを全て作成することは、パタン記憶容量の増大を招くため、／一月／，／一日／といった、短い語を単位とする標準パタンを作成するのが普通である。このような標準パタンと入力音声とを、中川著，“確率モデルによる音声認識”（社）電子情報通信学会（１９８８），に掲載されているようなスポッティング手法を用いて照合し、キーワードを抽出する。
【００１１】
上記標準パタンを訓練するには、通常／一月／，／一日／などの孤立単語を数回発声し、その特徴パタンの相加平均を求めることで実現できる。ところが、このように離散的に発声された音声を用いた標準パタンは、上述の（１）〜（３）のような連続的な発話音声とは様式が異なっている。そのため、認識対象でない（２）の休止部分や（３）の／の／の部分が対象語のいずれかとなって抽出され湧き出しが起ったり、連続音声中での語を表すパタンや発話速度が孤立単語のものとは異なるために、対象語であるにも関わらず脱落してしまうことがある。
【００１２】
以上の現象は、発話様式に対する標準パタンを精密に設計していないために起こる。この問題に対処するために、特開平７−３６４７９号公報に掲載されているようにガーベジモデルによる方法がある。これは、登録語以外の語に相当するモデルを作成して、キーワード以外の発声部分を前記モデルで吸収するように標準パタンを訓練する。また、国際電気通信基礎技術研究所編，“自動翻訳電話”オーム社（１９９４），に掲載されているように、発話文として起こりうる全ての現象を文脈自由文法などで記述し、予測型一般化ＬＲ（Ｌｅｆｔ−ｔｏ−ｒｉｇｈｔＲｉｇｈｔｍｏｓｔｄｅｒｉｖａｔｉｏｎ）解析アルゴリズムを用いて、音素を単位とする隠れマルコフモデル（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）を入力音声と照合させる方法がある。
【００１３】
【発明が解決しようとする課題】
上述の従来の連続音声認識方式における標準パタンの作成において、最近では、鷹見他，“逐次状態分割方法による隠れマルコフ網の自動生成”，電子情報通信学会論文誌，Ｖｏｌ．Ｊ７６−Ｄ−ＩＩ，Ｎｏ．１０，ｐｐ．２１５５−２１６４（１９９３−１０）に報告されているように、音素単位ではなく、当該音素の環境を考慮するようなモデルが提案されている。例えば、／ａｋａ／と発声された音声の／ｋ／を認識するのに、／ｋ／の前後に／ａ／があるという情報をもった／ａ−ｋ−ａ／というＨＭＭを用いて照合を行う。同様にして、／ｉｋｉ／と発声した場合の／ｋ／は、／ｉ−ｋ−ｉ／というＨＭＭを用いることになる。上述の発声はどちらも中心部の子音は／ｋ／であり、音素環境独立型の場合には、同一のモデル／ｋ／が照合に用いられるのであるが、環境依存型の場合には、それぞれ異なるＨＭＭを用いることになる。そのため、音素モデルの設計段階で、当該音素のモデル化だけでなく、音響空間上のある音素から音素への移動経路もモデル化することができ、高精度な認識性能を期待できる。
【００１４】
一方、このような音素環境依存型モデルを検証器として駆動するために、様々なＬＲ構文解析機が提案されている。永井他，“隠れマルコフ網と一般化ＬＲ構文解析を統合した連続音声認識”，電子情報通信学会論文誌，Ｖｏｌ．Ｊ７７−Ｄ−ＩＩ，Ｎｏ．１，ｐｐ．９−１９（１９９４−１）には、音素環境独立型のＬＲテーブルを用いて解析アルゴリズムを音素環境依存型に変更する例が報告されている。この例では、アルゴリズムの変更に伴い、音素環境独立でも駆動する専用の音素環境依存型解析機を開発しなければならない。
【００１５】
また、永井他，“文脈自由文法から音素コンテキスト依存文法への変換アルゴリズム”，日本音響学会講演論文集，３−１−６，ｐｐ．８１−８２（１９９２−３）には、音素環境独立のＬＲテーブルを、音素環境依存の構文解析が可能なＬＲテーブルに変換する方法や、音素環境独立の文脈自由文法を音素環境依存の文脈自由文法に変換する方法を紹介している。しかし、これらの方法は、汎用的なタスクを想定して、音素環境独立のＬＲテーブル、あるいは文脈自由文法を音素環境依存型に変換しようとしているため、ＬＲテーブルの状態数や文脈自由文法の規則数が爆発的に増加することが想定される。
【００１６】
本発明は、上述のごとき従来の連続音声認識方式の実情に鑑みてなされたもので、タスクに応じて音声類の環境を考慮した文脈自由文法を生成し、機構が単純で、記憶量の小さい構文解析部を用いて音声認識の照合範囲を狭くするとともに、環境を考慮した標準パタンを適応的に訓練することによって、高速で高精度な照合が可能な連続音声認識方式を提供することを目的としてなされたものである。
【００１７】
また、上述の従来の標準パタン訓練方式のうち、ガーベジモデルを用いる方法では、登録語以外のモデルを比較的粗いモデルとして設計するために、抽出すべき単語もガーベジモデルに引き寄せられ、吸収されてしまう可能性がある。そのためモデルパラメータを注意深く制御しなければならない。また、不必要な吸収を避けるためにガーベジモデルの数を増やすことも考えられるがモデルの記憶量が増大する。
【００１８】
一方、予測型一般化ＬＲアルゴリズムを用いる方法では、発話内容の一字一句を全て認識していくため、認識結果にキーワードが存在しているか否かを調べる後処理を必要とする。また、発話現象を扱うための文法規則数が増し、記述も複雑になるので、管理が容易でない。
【００１９】
それゆえに、本発明は、上述のごとき従来の標準パタン訓練方式の実情に鑑みてなされたもので、記憶量の小さい標準パタン群と、機構が単純で、記憶量の小さいＬＲ表を用いた構文解析部とにより、構文解析部から直接標準パタンを選択することによって、標準パタンの訓練効率と発話様式に対する認識精度を高め、短時間で高精度なキーワード認識を可能にする標準パタン訓練方式を提供することを目的としてなされたものである。
【００２０】
【課題を解決するための手段】
請求項１の発明は、入力音声の特徴量を抽出する手段と、類の一部を代表するパタンを時間方向に連結して状態遷移モデルとなし、音声の類をモデル化する手段と、音声記号列を文法により解析する構文解析部と、状態遷移モデルにおける各状態の照合継続時間を制御しながら入力音声パタンを照合する手段とを備え、当該状態遷移モデルと入力音声の特徴パタンとを比較することによって、認識結果を得る連続音声認識方式において、前記構文解析部で受理された音声記号列を用いて、類の前後環境を含めた終端記号列を生成し、文法を作成することにより、類の前後環境を含めた状態遷移モデルを未知入力音声と照合する。
【００２１】
請求項２の発明は、請求項１の発明において、前記構文解析部で受理された音声記号列に基づいて発声した音声を入力とし、その入力に対応する類の前後環境を含めた状態遷移モデルを連結して訓練する。
請求項３の発明は、請求項１の発明において、前記構文解析部で受理された音声記号列を含む音声を入力とし、類の前後環境を含めた状態遷移モデルと照合し、その認識結果をもっともらしい順に所定数表示し、正しい候補を選択することによって、正しい状態遷移モデルを連結して訓練する。
請求項４の発明は、請求項２又は３の発明において、類の前後環境を含めた状態遷移モデルの訓練に関し、過去に当該モデルに対して訓練が行われていた場合は、過去の状態遷移モデルと重ね合せて訓練する。
【００２２】
請求項５の発明は、請求項２又は３の発明において、類の前後環境を含めた状態遷移モデルの訓練に関し、過去に当該モデルに対して訓練が行われていた場合には、新たに当該モデルの類に対する前後環境を含めた状態遷移モデルを生成して訓練を行い、過去の対応状態遷移モデルは訓練しない。
請求項６の発明は、請求項４の発明において、類の前後環境を含めた状態遷移モデルの訓練に関し、請求項５によって記憶された状態遷移モデルの中から、入力音声と最も類似したモデルを更新する。
請求項７の発明は、請求項２乃至６のいずれかの発明において、類の前後環境を含めた状態遷移モデルの訓練に関し、初期モデルとして、環境独立の状態遷移モデルを連結する。
【００２５】
【発明の実施の形態】
最初に、連続音声認識方式について説明する。
図１は、本発明による連続音声認識方式の一実施例を説明するための概略ブロック図で、図中、１はＬＰＣ分析部、２は照合部、３は環境依存文法部、４は環境依存動作表部、５は構文解析部、６は環境依存型ＤＳＴモデル、７はパタン連結部、８は判定部、９はスイッチ、１０は環境独立文法部、１１は環境独立動作表部、１２は記号処理部で、図１に示した実施例によれば、構文解析部５に手を加えることなく、環境依存型のＤＳＴモデル６を利用でき、タスクに対して適応的でより確実な認識を行うことができる。環境独立文法部１０には、通常の音素を終端記号とする文法を、文脈自由文法などを用いて格納してある。文法の例を表１に示す。表１で、右辺の小文字は終端記号を表す。本実施例では、文法の終端記号及び標準パタンの類を音素として話を進めるが、単語，音節などのような類を採用してもかまわない。また、この文法から得たＬＲ解析表を環境独立動作表部１１に記憶しておく。表１の内容は、Ａ.Ｖ.Ａho他，“Compilers-Principles, Techniques, and Tools”，Addison-Wesley（1986）などに詳述されるＬＲ解析表と同じで、ＡＣＴＩＯＮ部とＧＯＴＯ部とから成り立っている。
【００２６】
【表１】
【００２７】
まず、スイッチ９をＡ側に入れ、音素環境依存型の文法を作成するため、構文解析部５を駆動して、受理可能な文を終端記号列を用いて出力する。これは、北他，“ＨＭＭ音韻認識と拡張ＬＲ構文解析法を用いた連続音声認識”，情報処理学会論文誌，Vol.31, 3, pp.472-480（1990）などに詳述されるように、動作表から次に解析する終端記号を予測しながら、構文解析部５を駆動することによって実現することができる。
【００２８】
得られた文から、記号処理部１２で認識タスクとして必要な文を選択する。選択には、必要とする文を記号列照合により、自動的に選択してもよいし、人間が出力結果を編集することによって選択してもよい。その後、選択した文を終端記号の並びに応じて、環境依存型の終端記号列に変換する。例えば、／ｋｏｒｅｏｋｕｒｅ／という文を得ている時には、対象とする記号の先行及び後続記号の一文字を考慮して、／−ｋｏｋｏｒｏｒｅｒｅｏｅｏｋｏｋｕｋｕｒｕｒｅｒｅ−／のように変換する。中心の記号が対象とする終端記号であり、左右にはその環境を意味する記号を付加する。上述の例で／ｋｏｒ／は、／ｏ／という終端記号に先行して／ｋ／という終端記号があり、／ｒ／という記号が後続することを示す。／−／は、記号の始まりもしくは終りを示す。本実施例では、先行および後続する記号数を一つにしているが、いくつに設定してもよい。次に、変換した終端記号を用いて、環境依存文法を作成し、同文法部に格納する。作成された文法を表２に示す。同文法から得たＬＲ解析表を環境依存動作表部４に記憶しておく。
【００２９】
【表２】
【００３０】
次に、スイッチ９をＢ側に入れ、連続音声の認識を行う。入力した音声をＬＰＣ分析し、１０次元のケプストラムパラメタを抽出する。ただし、分析条件として、標本化周波数８ｋＨｚ，ハミング窓による窓がけ（窓幅１６ｍｓ），ＬＰＣ分析次数１４とする。また、１フレームあたりのシフト幅は、５ｍｓｅｃ間隔としている。分析手法は、上記に限られたものではなく、新美，“音声認識”，共立出版（１９７９）などで詳述されているように、周波数分析など、どのような音響分析手法を用いてもよい。
【００３１】
構文解析部５では、ＬＲ解析表からどの音素を照合すればよいかを決定する。解析の状態が進むたびに、室井他，“継続時間制御状態遷移モデルを用いた単語音声認識”，J72-D-II, 11, pp.1769-1777（1989-11）に詳述されるような継続時間制御状態遷移（ＤＳＴ：Duration-based State Transition）モデルを連結する。本実施例では、音素の環境を考慮したＤＳＴモデルを用い、照合部において、ＤＳＴモデルと入力音声の特徴量との照合を行う。解析した文の句構造は、構文解析部５のチャートに記録しておく。最終的に全ての解析を終了した候補の中から最も小さい得点をもつ候補を式（５）に従って求め、認識結果として出力する。
【００３２】
【数３】
【００３３】
ここで、ｒは、動的計画法により求められた伸縮関数である。この関数により、照合するｍフレーム目の入力特徴量とｒ（ｍ）番目のＤＳＴモデルの状態とが対応づけられる。ｌ（エル）_ｒ（ｍ）は、入力音声パタンをＮ（ｓ）個の部分パタンに分割した時のｒ（ｍ）番目の部分パタンにおけるフレーム長を示す。右辺の第１項目が音響分析によって得られた特徴量に関する距離を表し、第２項目が部分パタンの継続時間長に関する距離を表す。ａは、正の数で、継続時間長に関する距離をどの程度全体の距離に反映させるかを決定する。本実施例では、ａ＝０．１程度に設定する。上述のＤＳＴモデルを用いることによって、音響空間上の特徴量だけでなく、音声パタンの特間的構造（特に部分パタンの時間長）を考慮した照合を行うことができる。
【００３４】
図２は、本発明の他の実施例を説明するための概略ブロック図で、図中、１３は発声リスト、１４はＤＳＴモデル訓練部で、その他、図１に示した実施例と同様の作用をする部分には、図１の場合と同一の参照番号が付してある。而して、図２に示した実施例は、図１に示した実施例によって得られた環境依存型の文法と動作表とを用いて、音素環境依存型ＤＳＴモデルを訓練できるようにしたもので、まず、スイッチ９をＡ側に入れ、音素環境依存型ＤＳＴモデル６の訓練を行う。発声リスト１３に対応した音声が入力され、ＬＰＣケプストラムパラメタが抽出される。次に、発声リスト１３に従って、環境依存型ＤＳＴモデル列とを動的計画法を用いて照合し、式（４）の基準に従って伸縮関数θに関して最小化を行う。求めた伸縮関数をｒとする。
ＤＳＴモデル訓練部１４において、モデルの平均値と継続時間長を次式に従い更新する。ここで、Ｎ_ｒ（ｍ）は、ＤＳＴモデルのｒ（ｍ）番目の状態に対応づけられた入力パタンの最終フレーム番号である。
【００３５】
【数４】
【００３６】
ただし、Ｎ_ｒ（０）＝０とする。
上述の訓練を行った後、スイッチ９をＢ側に入れ、連続音声の認識を行う。認識過程の構成は、図１の実施例と同じであるため省略する。
【００３７】
図３は、本発明の更に他の実施例を説明するための概略ブロック図で、図中、１５は結果表示部、１６は選択部で、その他、図１又は図２に示した実施例と同様の作用をする部分には、図１又は図２の場合と同一の参照番号が付してある。而して、図３に示した実施例は、認識するために発声された入力音声を用いて音素環境依存型のＤＳＴモデルを訓練できるようにしたものである。図３に示した実施例によれば、認識とＤＳＴモデルの訓練とを同時に行うことができる。まず、入力音声を図１の実施例と同じ過程により認識し、ディスプレイなどの表示装置を用いて、表示部１５で認識候補の得点の低い順に所定数表示する。表示部１５に正解が含まれている場合には、キーボードなどの選択部１６により、正解を選択できるようにする。この選択により、入力された音声パタンに対して訓練するべきＤＳＴモデル列を決定することができる。これらのＤＳＴモデル列に対し、式（４），（８），（９）を適用して、訓練部１４にて、ＤＳＴモデルの平均値と継続時間長の更新を行う。訓練の過程は、図２の実施例と同じであるため省略する。
【００３８】
本実施例では、表示部において、照合時の距離尺度に式（７）に示すユークリッド距離を用いているため、得点の低い順番に候補を表示している。もし、尤度などを基準として認識候補の得点をつけた場合には、得点の高い順に表示することになる。もちろん、本発明においては、どちらの基準を用いても構わない。
【００３９】
図２または図３のＤＳＴモデル訓練部１４において、同じ類に対し、過去に訓練されたモデルが存在している場合には、次の２通りの方法によって、ＤＳＴモデルを訓練する。一つは、次式１０に従って、過去に訓練されたモデルＷ_ｋ１と新しく訓練されたモデルＷ_ｋ２とを重ね合わせて、Ｗ_ｎ３を作成する方法である。
Ｗ_ｋ３＝ｂＷ_ｋ１＋（１−ｂ）Ｗ_ｋ２ …（１０）
ここで、ｂは過去のモデルと新モデルとの混合比率を示す正の数である。特別な場合として、ｂ＝０の時には、モデルは訓練されないことを示し、ｂ＝１の時には、新モデルに置き換えることに相当する。
もう一つは、過去のモデルと新モデルとの両方を記憶しておく方法である。すなわち、訓練用の音声が入力されるたびに、新しいＤＳＴモデルを作成する。認識時には、最も入力音声パタンと近いＤＳＴモデル系列を認識結果として出力すればよい。
【００４０】
また、上述の２つの訓練法を組合わせた方法も可能である。上述の２つ目の方法は、同じ類に対して複数のモデルを持つことで、認識の精度を上げることができるが、照合時の組合せ回数が多くなるので、認識時間が長くなる。そこで、所定数だけ、モデルが作成された後は、重ね合わせの対象となるモデルを選択し、選択されたモデルと新しく訓練されたモデルとを式１０に従って重ね合わせる。列ｓが重ね合わせるＤＳＴモデルを含んだ列であるとした場合、重ね合わせの対象となるＤＳＴモデル列は、
【００４１】
【数５】
【００４２】
を満たす。この方法により、認識時間と認識精度との関係を自由に調整し、使用者の所望とする性能に設定することができる。
以上に述べてきた環境依存型ＤＳＴモデルを訓練するために、環境独立型ＤＳＴモデルを初期モデルとすることも可能である。例えば、先行および後続音素が／ａ／である／ａ−ｋ−ａ／というＤＳＴモデルを訓練することを考える。この場合の初期モデルとして、／ｋ／という音素環境独立型のＤＳＴモデルを訓練することを考える。この場合の初期モデルとして、／ｋ／という音素環境独立型のＤＳＴモデルを用いて訓練を始める。音素環境独立型のＤＳＴモデルから質のよい初期値を与えることにより、高精度なモデルを設計することができる。
【００４３】
次に、標準パタン訓練方式について説明する。
図４は、標準パタン訓練方式の一実施例を説明するための概略ブロック図で、図中、２１は分節化部、２２は特徴パタン作成部、２３は照合部、２４は累積得点記憶部、２５は比較部、２６はＬＲ表部、２７は予測型チャート構文解析部で、まず、スイッチＷ₁をＡ側に入れ、標準パタンの訓練を行なう。図４では、入力音声に対する状態遷移モデルを作成するために、ＬＲ表部２６を用いた予測型チャート構文解析部２７を駆動する。ＬＲ表部２６には表３に示すような文法から得られる動作表を記憶しておく。表３の記号の中で、終端記号は、’＊’で始まり、それ以外の記号は非終端記号である。この記述は実施例を示すため簡単にしてあるが、文脈自由文法による記法であればさらに複雑な記述が可能である。
【００４４】
【表３】
【００４５】
ＬＲ表の内容は、Ａ．Ｖ．Ａｈｏ他，“Ｃｏｍｐｉｌｅｒｓ−Ｐｒｉｎｃｉｐｌｅｓ，Ｔｅｃｈｎｉｑｕｅｓ，ａｎｄＴｏｏｌｓ”，Ａｄｄｉｓｏｎ−Ｗｅｓｌｅｙ（１９８６）などに詳述されるＬＲ解析表と同じで、ＡＣＴＩＯＮ部とＧＯＴＯ部から成り立っている。この表の動作には、状態の遷移，文法の適用，受理，誤りの４種類がある。
【００４６】
表３のＬＲ表を用いて、予測型チャート構文解析部２７では、終端記号を先頭から１つずつ取り出し、表４から表６に示すアルゴリズムを適用し、その結果を表７に示すチャートとして記録する。チャートには最終的に受理動作を行なうまで、全ての句構造を記録していく。ただし、’＊＄’は最後を表す終端記号で予測した終端記号列の最後の位置に設定される。
【００４７】
【表４】
【００４８】
【表５】
【００４９】
【表６】
【００５０】
表７は、例として“１月１日１時”の解析結果を示しているが、その他にも文法に基づいて“１月１日２時”，“１月２日１時”などが順次生成される。標準パタンの訓練は、これらの記号系列に対応する状態遷移モデルを作成することにより実現できる。
【００５１】
【表７】
【００５２】
上述の予測型チャート構文解析部２７の動作により、終端記号を構成している文字系列のインデックス番号が順次に標準パタン記憶部２８へ送られる。標準パタンは文字単位で格納されているので、連結部２９にてインデックス番号を参照して終端記号単位に標準パタンを連結し、状態遷移モデル部３０にて状態遷移モデルを作成する。例えば、標準パタンが音素単位で格納されていれば、終端記号“１月”に対して／ｉ，ｃｈ，ｉ，ｇ，ａ，ｔ，ｕ／という標準パタンで構成する。なお、状態遷移モデルをＨＭＭのような確率モデルで表現しても、単語グラフや有限状態網のように厳格に表現してもどちらでも構わない。
【００５３】
一方、入力音声は分節化部２１により所定の時間だけ音声を入力し、新美，“音声認識”，共立出版（１９７９）などで詳述されているような分析手法によって特徴パタンに変換される。ここでは、１０次元のケプストラムパラメタを抽出し特徴パタンとする。ただし、分析条件として、標本化周波数：１６ｋＨｚ，高域強調：一次差分，２５６点ハミング窓，更新周期：１０ｍｓ，ＬＰＣ分析次数：２０とする。分析手法は上記に限られたものではなく、周波数分析などどのような音響分析手法を用いてもよい。入力する音声には、前記チャート構文解析部から生成された終端記号に対応するキーワードを含めておく。
【００５４】
次に、上述のようにして作成された状態遷移モデルと入力音声の特徴パタンとを、照合部２３にて照合する。構文解析部２７から生成された終端記号列のうち、ｓ番目の終端記号に対応する状態遷移モデルをｓＷ，（ｓ＝１，…，Ｓ）で表す。ｓＷをＬ個の標準パタンにより構成する。
【００５５】
【数６】
【００５６】
ここで、ｐ_ｑ（ｌ）は、系列中のｌ（１≦ｌ≦Ｌ）番目に対応する標準特徴パタンのインデックスであり、全体でＶ個の標準パタンを持つ。表７を例にすれば、生成文の終端記号数は、３であるので、Ｓ＝３である。また、各標準パタンは、実施例の場合、音素に対応するので、標準パタン数は総音素数と等しくなる。
同様にして、入力特徴パタンＸを以下のように表す。
Ｘ＝｛ｘ_１，…，ｘ_ｍ，…，ｘ_Ｍ｝ …（１３）
実施例において、Ｘは、入力音声中のＳ個のキーワードが含まれた特徴パタンである。照合部では、入力音声特徴パタンと状態遷移モデルとの照合得点Ｄを以下の式により求める。
【００５７】
【数７】
【００５８】
ここで、ｍ_ｓ１，ｍ_ｓ２は、ｓ番目のキーワードに対応する音声特徴パタンの抽出区間の端点で、それぞれ始点と終点を表す。整合関数ｒは、照合経路を表す関数であり、よく知られた動的計画法などによって求めることができる。整合関数により、ｍフレーム目の入力特徴量とキーワードを構成するｒ（ｍ）番目の標準パタンとが対応づけられる。標準パタンと音声特徴パタンとの得点Ｄ（ｘ_ｍ，ｐ_ｒ（ｍ））は、正値をもつしきい値から、よく知られたユークリッド距離を引くことで得られる。式（１５）により得られたｒより、標準パタンに対応する音声特徴パタンの部分パタンが求まるので、この部分パタンを用いて標準パタンを訓練する。この訓練は、標準パタンのもつ特徴量と部分パタンの特徴量との相加平均を求め、新たに標準パタンとして登録することでなされる。
【００５９】
訓練の方法は、上記に限ったものではなく、状態遷移モデルをＨＭＭで表現すれば、Ｄ（ｘ_ｍ，ｐ_ｒ（ｍ））を尤度として計算することで実現できる。また、この時のＨＭＭの訓練は、前述の中川，“確率モデルによる音声認識”などに詳述されるＢａｕｍ−Ｗｅｌｃｈの推定法により可能である。式（１５）は、最大化を基準としているが、これに限ったものではなく、単なるユークリッド距離による最小化基準により訓練を行っても本発明の本質は変わらない。
【００６０】
以上に説明したように、キーワード単位で入力音声の部分パタンを照合するため、キーワード間に休止や不要語が挿入されても、標準パタンの訓練が可能である。上述の処理を予測型チャート構文解析部２７の終端記号列が生成されなくなるまで繰り返すことにより、訓練が完了する。次に、スイッチＷ₁をＢに入れることにより、キーワード認識をすることができる。認識時には、予測型チャート構文解析部２７とＬＲ表部２６からキーワードを予測するように働く。解析が進むたびに予測キーワードの状態遷移モデルを作成するために標準パタンを連結する。照合部２３において、状態遷移モデルと入力音声の特徴量との照合を行う。予測したキーワード候補の得点は、累積得点記憶部２４に記憶しておき、最終的に全ての解析を終了した候補の中から最も高い得点をもつ候補を式（１５）に従って求め、認識結果として出力する。
【００６１】
図４に示した実施例において、式（１６）で、以下のような条件を導入することにより、照合時間を速くすることが可能である。
【００６２】
【数８】
【００６３】
この式（１６）は、入力特徴パタン中でｓ番目のキーワードを検出し、その区間内に収まるフレームから次のキーワード、すなわちｓ＋１番目のキーワードの状態遷移モデルに対して照合を開始することを示している。
【００６４】
複数のキーワードが入力された場合、分節化部で音声の存在する部分だけを切り出してくることにより、高速な照合が可能である。図８は、二つのキーワード／一月／と／一日／が含まれている音声波形を示している。図８からわかるように、／一月／と／一日／の間には、若干の休止が存在している。このような場合に既出の新美，“音声認識”，共立出版（１９７９）などで述べられている音声の切り出しアルゴリズムなどを用いて、分節化部２１において、図８のＡとＢとの区間を求める。その後、切り出したＡとＢとの区間だけを状態遷移モデルとの照合対象とすることで、照合区間を短くすることができる。
【００６５】
図５は、他の実施例を示す概略ブロック図で、図中、図４に示した実施例と同様の作用をする部分には、図４の場合と同一の参照番号が付してある。而して、図５に示す実施例は、ＬＲ表部（２６Ａ，２６Ｂ，２６Ｃ）と予測型チャート構文解析部（２７Ａ，２７Ｂ，２７Ｃ）との組を複数用意したものである。標準パタンの訓練時に初期段階から複数のキーワードが含まれた音声を用いて標準パタンを訓練すると、不安定なパタンとなることがある。そのような現象を避けるため、初期段階では、入力音声から単一キーワードだけを訓練するようにし、徐々に音声中に含まれるキーワードを増やすことにより、標準パタンが安定するだけでなく、入力音声の多様な発話様式も合わせて訓練することができる。実施例では、ＬＲ表部２６Ａと予測型チャート構文解析部２７Ａとを用いて、キーワードが一つ含まれる文を生成するようになっている。同様に残りの２組は、キーワードが２つ含まれる文と、３つ含まれる文とをそれぞれ生成する。訓練時、すなわちスイッチＷ₁をＡに入れた時には、まず、スイッチＷ₂をＣに入れて前記実施例と同様の手続きに従って、キーワードが一つ含まれた入力音声から標準パタンを訓練する。次に、スイッチＷ₂を順にＤ，Ｅと切替えていくことにより、音声中に含まれるキーワード数を増やして、標準パタンを訓練することができる。キーワード認識時には、スイッチＷ₁をＢに入れ、スイッチＷ2をＣ，Ｄ，Ｅに全て入れることで実現できる。予測可能な全てのキーワード候補を生成することができるので、それらの中から最も高い得点を持つ候補を認識結果として出力すればよい。
【００６６】
図６は、他の実施例を説明するための概略ブロック図で、図６に示す実施例は、図４に示した実施例に表示装置３２を加えたものである。訓練時にスイッチＷ_１をＡ側に入れ、スイッチＷ_３をＣに入れる。表示装置３２には、予測型チャート構文解析部２７から生成されたキーワードを含む文が生成され、表示装置３２に“１月１日”のように表示される。この表示を見ながら、発声者が音声を入力する。その後の処理を、図４の実施例で述べた方法を同様にして行うことにより、標準パタンの訓練が完了する。認識は、スイッチＷ_１をＢ側に入れ、スイッチＷ_３をＣに切ることによって実現することができる。
【００６７】
図７は、さらに他の実施例を説明するための概略ブロック図で、図７に示す実施例は、図６の実施例によみ変換部３３を加えたものである。よみ変換を行うために、ＬＲ表を作成する時の文法を表８のように変更する。表８は、キーワードにあたる日付に対応するよみを書き換え規則として追加している。訓練時の表示装置には、よみ変換部３３により終端記号を含む書き換え規則の右辺も表示する。この結果、“１月（いちがつ）１日（ついたち）”のように表示することができ、１日を“いちにち”と読むようなことがなくなるため、発声者に正確な発話を促すことができる。
【００６８】
【表８】
【００６９】
【発明の効果】
以上の説明から明らかなように、本発明によれば、タスクに応じて類の環境を考慮した文脈自由文法を適応的に生成することができる。また、機構が単純で、記憶量の小さい従来のＬＲ-Chart構文解析部に変更を加えることなく、環境依存型の音素モデルを組み合わせることが可能となる。さらに、類の環境を考慮した継続時間長制御型状態モデルを適応的に訓練することができる。その結果、高精度で高速な照合を行う連続音声認識を実現することができる。
請求項１に係わる発明は、入力音声の特徴量を抽出する手段と、類の一部を代表するパタンを時間方向に連結して状態遷移モデルとなし、音声の類をモデル化する手段と、音声記号列を文法により解析する構文解析部と、状態遷移モデルにおける各状態の照合継続時間を制御しながら入力音声パタンを照合する手段とを備え、当該状態遷移モデルと入力音声の特徴パタンとを比較することによって、認識結果を得る連続音声認識方式において、前記構文解析部で産理された音声記号列を用いて、類の前後環境を含めた終端記号列を生成し、文法を作成することにより、類の前後環境を含めた状態遷移モデルを未知入力音声と照合することができる。
請求項２に係わる発明は、請求項１において、前記構文解析部で受理された音声記号列に基づいて発声した音声を入力とし、その入力に対応する類の前後環境を含めた状態遷移モデルを連結して訓練することができる。
請求項３に係わる発明は、請求項１において、前記構文解析部で受理された音声記号列を含む音声を入力とし、類の前後環境を含めた状態遷移モデルと照合し、その認識結果をもっともらしい順に所定数表示し、正しい候補を選択することによって、正しい状態遷移モデルを連結して訓練することができる。
請求項４に係わる発明は、請求項２又は３において、類の前後環境を含めた状態遷移モデルの訓練に関し、過去に当該モデルに対して訓練が行われていた場合は、過去の状態遷移モデルと重ね合せることができる。
請求項５に係わる発明は、請求項２又は３において、類の前後環境を含めた状態遷移モデルの訓練に関し、過去に当該モデルの類に対して訓練が行われていた場合には、新たに当該モデルに対する前後環境を含めた状態遷移モデルを生成して訓練を行い、過去の対応状態遷移モデルは訓練しないようにして、認識時間と認識精度との関係を自由に調整し、使用者の所望とする性能に設定することができる。
請求項６に係わる発明は、請求項４において、類の前後環境を含めた状態遷移モデルの訓練に関し、請求項５によって記憶された状態遷移モデルの中から、入力音声と最も類似したモデルを更新することができる。
請求項７に係わる発明は、請求項２乃至６のいずれかにおいて、類の前後環境を含めた状態遷移モデルの訓練に関し、初期モデルとして、環境独立の状態遷移モデルを連結することができる。
【図面の簡単な説明】
【図１】本発明の一実施例による連続音声認識の実施例を示すブロック図である。
【図２】本発明の他の実施例を説明するための概略ブロック図である。
【図３】本発明のその他の実施例を説明するための概略ブロック図である。
【図４】標準パタン訓練の一実施例を示すブロック図である。
【図５】他の実施例を示す概略ブロック図である。
【図６】他の実施例を示す概略ブロック図である。
【図７】さらに他の実施例を示す概略ブロック図である。
【図８】二つのキーワード／一月／と／一日／が含まれている音声波形を示す図である。
【符号の説明】
１…ＬＰＣ分析部、２…照合部、３…環境依存文法部、４…環境依存動作表部、５…構文解析部、６…環境依存型ＤＳＴモデル、７…パタン連結部、８…判定部、９…スイッチ、１０…環境独立文法部、１１…環境独立動作表部、１２…記号処理部、１３…発生リスト、１４…ＤＳＴモデル訓練部、１５…結果表示部、１６…選択部、２１…分節化部、２２…特徴パタン作成部、２３…照合部、２４…累積得点記憶部、２５…比較部、２６…ＬＲ表部、２７…予測型チャート構文解析部、２８…標準パタン記憶部、２９…連結部、３０…状態遷移モデル部、３１…訓練部、３２…表示装置、３３…よみ変換部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a continuous speech recognition method and a standard pattern training method, more specifically, a pattern representing a part of a class is connected in the time direction to form a state transition model, and a matching duration time of each state in the state transition model. A continuous speech recognition method that obtains recognition results by comparing input speech patterns while comparing the state transition model with feature patterns of the input speech while controlling The present invention relates to a standard pattern training method suitable for training a standard pattern necessary for the training.
[0002]
[Prior art]
First, the symbols used in the present specification are defined as follows.
[0003]
[Outside 1]
[0004]
First, a conventional continuous speech recognition method will be described. Now, it is assumed that there are S standard pattern sequences for the input voice pattern, and the s-th sequence is (s) W. (S) W is composed of L standard patterns connected as shown in the following equation. This standard pattern is a pattern that characterizes a type of speech (for example, a phoneme or a word).
[0005]
(Equation 1)
[0006]
However, L is variable. Here, q (l) is the index of the l (1 ≦ l ≦ L) standard pattern in the sequence, and has V vocabulary numbers.
Similarly, the column X of the input speech feature is represented as follows.
X = ｛x₁, ..., x_m, ..., x_M…… (2)
Here, the problem of continuous speech recognition corresponds to finding the reference sequence * W that minimizes the distance D (X, (s) W) between the uttered speech X and the reference sequence.
[0007]
(Equation 2)
[0008]
The minimization on the right side of Equation (4) is performed on the number of connected standard patterns, the arrangement of models, and the matching function, respectively. Equation (4) can be obtained by dynamic programming. Here, θ is a function representing the collation path. To create the standard pattern sequence (s) w, Nakagawa, Hidden Markov Model (HMM), which is described in detail in the Institute of Electronics, Information and Communication Engineers (1988), a neural network, It is modeled by arithmetic averaging of voice patterns.
[0009]
A reference sequence is created based on the combination of the standard patterns Wi. If there is no restriction on the combination, the search space at the time of collation increases and the recognition performance decreases. Therefore, a language model is introduced to give various language constraints. For example, a language model based on syntax control is given in terms of 0, 1 and is described in a context-free grammar, etc., and is described in detail in LR (Left-to-right) described in ATR, "Automatic Translation Telephone" Ohmsha (1994). (Rightmost derivation) using a parser or the like. In the recognition method according to the above-mentioned document, whether to reject or continue the hypothesis obtained from the parser is determined based on the likelihood obtained from the phoneme HMM at the same time as the analysis. Finally, the hypothesis having the highest likelihood is set as the recognition result. In this case, Wi in equation (1) must be a sequence received by LR parsing and corresponding to the terminal symbol.
[0010]
Next, a conventional standard pattern training method will be described. For example, suppose that one wants to extract the word date / one day / one / from the utterance. There are various ways in which the speaker speaks, such as (1) / one day / month / continuous utterance, (2) / one month / one day / (_: slight pause section), (3) A word other than the recognition target may be inserted between words such as / one day of the month /. For such an utterance, creating all of the above three patterns in the standard pattern used for collation causes an increase in the pattern storage capacity. Therefore, short patterns such as / month /, / day / It is common to create standard patterns that use words as units. Such a standard pattern and input speech are collated by using a spotting method described in Nakagawa, "Speech Recognition by Probabilistic Model", IEICE (1988), and keywords are extracted. I do.
[0011]
Training of the standard pattern can be realized by uttering isolated words such as normal / month /, / day / several times and calculating the arithmetic average of the characteristic pattern. However, the standard pattern using such discretely uttered voices is different in form from the continuous uttered voices as described in (1) to (3) above. Therefore, the pause part (2) and the /// part of (3) that are not the recognition target are extracted as one of the target words to generate a spurt, or a pattern or utterance speed representing a word in continuous speech. Is different from that of an isolated word, and may be dropped even though it is the target word.
[0012]
The above phenomenon occurs because the standard pattern for the speech style is not precisely designed. To cope with this problem, there is a method using a garbage model as disclosed in JP-A-7-36479. In this method, a model corresponding to a word other than a registered word is created, and a standard pattern is trained so that utterance parts other than a keyword are absorbed by the model. Also, as described in the International Telecommunications Research Institute, edited by "Automated Translation Telephone" Ohmsha (1994), all phenomena that can occur as utterances are described in context-free grammar, etc. There is a method of matching a hidden Markov model (HMM: Hidden Markov Model) in units of phonemes with an input voice using a generalized LR (Left-to-right Rightmost derivation) analysis algorithm.
[0013]
[Problems to be solved by the invention]
Recently, in the creation of a standard pattern in the conventional continuous speech recognition method described above, Takami et al., “Automatic Generation of Hidden Markov Network by Sequential State Division Method”, IEICE Transactions, Vol. J76-D-II, No. 10, pp. As reported in 2155-2164 (1993-3), a model has been proposed that considers not the phoneme unit but the environment of the phoneme. For example, when recognizing / k / of a voice uttered as / aka /, matching is performed using an HMM / a-ka-a / having information that / a / is present before and after / k /. Do. Similarly, when / ik / is uttered, / k / uses the HMM of / ik-i /. In both of the above utterances, the consonant at the center is / k /, and the same model / k / is used for collation in the case of the phoneme environment independent type, but in the case of the environment dependent type, A different HMM will be used. Therefore, at the stage of designing a phoneme model, not only the phoneme but also the movement path from a certain phoneme to a phoneme in the acoustic space can be modeled, and high-accuracy recognition performance can be expected.
[0014]
On the other hand, various LR parsers have been proposed to drive such a phoneme environment-dependent model as a verifier. Nagai et al., "Continuous Speech Recognition Integrating Hidden Markov Network and Generalized LR Parsing", IEICE Transactions, Vol. J77-D-II, No. 1, pp. 9-19 (1994-1) reports an example in which the analysis algorithm is changed to a phoneme environment dependent type using a phoneme environment independent type LR table. In this example, the phoneme environment becomes independentBut driveDedicatedPhoneme environment dependentAn analyzer must be developed.
[0015]
Nagai et al., "Conversion Algorithm from Context-Free Grammar to Phoneme Context-Dependent Grammar", Proc. 81-82 (1992-3), a method for converting a phoneme environment-independent LR table into a phoneme environment-dependent LR table, and a method for converting a phoneme environment-independent context-free grammar into a phoneme environment-dependent context-free grammar. It introduces how to convert to grammar. However, these methods attempt to convert a phoneme environment-independent LR table or a context-free grammar into a phoneme environment-dependent type, assuming a general-purpose task. It is expected that the number will explode.
[0016]
The present invention has been made in view of the actual situation of the conventional continuous speech recognition method as described above, generates a context-free grammar considering the environment of speech and the like according to a task, has a simple mechanism, and has a small storage amount. The purpose of the present invention is to provide a continuous speech recognition system that can perform high-speed and high-accuracy matching by narrowing the matching range of speech recognition using a syntax analysis unit and adaptively training standard patterns that take the environment into consideration. It was done as.
[0017]
In the conventional standard pattern training method described above, in the method using a garbage model, in order to design a model other than a registered word as a relatively coarse model, words to be extracted are also attracted to the garbage model and absorbed. May be lost. Therefore, the model parameters must be carefully controlled. It is also conceivable to increase the number of garbage models in order to avoid unnecessary absorption, but the storage capacity of the models increases.
[0018]
On the other hand, in the method using the predictive generalized LR algorithm, post-processing for checking whether or not a keyword is present in the recognition result is required in order to recognize every single phrase of the utterance content. Further, the number of grammatical rules for dealing with speech phenomena increases, and the description becomes complicated, so that management is not easy.
[0019]
Therefore, the present invention has been made in view of the actual situation of the conventional standard pattern training method as described above, and uses a standard pattern group having a small storage amount and an LR table having a simple mechanism and a small storage amount.StructureBy selecting a standard pattern directly from the syntax analysis unit by the sentence analysis unit, the standard pattern training method that improves the training efficiency of the standard pattern and the recognition accuracy for the utterance style and enables highly accurate keyword recognition in a short time It was made for the purpose of providing.
[0020]
[Means for Solving the Problems]
The invention according to claim 1 includes means for extracting a characteristic amount of an input voice, means for connecting a pattern representing a part of a class in a time direction to form a state transition model, and modeling means for a voice class; Parsing for parsing symbol strings by grammarDepartmentAnd means for matching an input voice pattern while controlling the matching duration time of each state in the state transition model, and comparing the state transition model with the feature pattern of the input voice to obtain a continuous speech. In recognition method, SaidUsing the phonetic symbol sequence received by the parsing unit, a terminal symbol sequence including the surrounding environment of the class is generated, and the grammar is created. Collate.
[0021]
The invention according to claim 2 is the invention according to claim 1,SaidA speech uttered based on the speech symbol string received by the syntax analysis unit is used as an input, and a state transition model including a surrounding environment of a kind corresponding to the input is connected and trained.
The invention according to claim 3 is the invention according to claim 1,SaidBy inputting the speech including the phonetic symbol string received by the syntax analysis unit, collating it with the state transition model including the environment before and after the class, displaying a predetermined number of the recognition results in plausible order, and selecting the correct candidate And train the correct state transition model.
The invention of claim 4 relates to the training of the state transition model including the kind of surrounding environment according to the invention of claim 2 or 3, wherein if the training has been performed on the model in the past, the past state transition is performed. Superimpose with modelTrainYou.
[0022]
The invention of claim 5 relates to the training of the state transition model including the kind of surrounding environment in the invention of claim 2 or 3, and if the training has been performed on the model in the past, the training is newly performed. modelKind ofToGenerate a state transition model including the environment before and afterTo train andResponseState transition modelDoes not train.
The invention of claim 6 relates to the training of a state transition model including the kind of surrounding environment according to the invention of claim 4, and, among the state transition models stored by claim 5, a model most similar to the input voice is determined. Update.
The invention of claim 7 relates to the training of a state transition model including the kind of before and after environment in any one of the inventions of claims 2 to 6, and connects an environment independent state transition model as an initial model.
[0025]
BEST MODE FOR CARRYING OUT THE INVENTION
First, the continuous speech recognition method will be described.
FIG. 1 is a schematic block diagram for explaining an embodiment of a continuous speech recognition system according to the present invention. In the figure, 1 is an LPC analysis unit, 2 is a collation unit, 3 is an environment-dependent grammar unit, and 4 is an environment-dependent grammar unit. Operation table part, 5syntaxAn analysis unit, 6 is an environment-dependent DST model, 7 is a pattern connection unit, 8 is a judgment unit, 9 is a switch, 10 is an environment-independent grammar unit, 11 is an environment-independent operation table unit, and 12 is a symbol processing unit. According to the embodiment shown insyntaxThe environment-dependent DST model 6 can be used without changing the analysis unit 5, and adaptive and more reliable recognition of the task can be performed. The environment-independent grammar unit 10 stores a grammar having a normal phoneme as a terminal symbol using a context-free grammar or the like. Table 1 shows examples of the grammar. In Table 1, lowercase letters on the right side represent terminal symbols. In the present embodiment, the grammar terminator and the standard pattern are used as phonemes, but a class such as a word or a syllable may be employed. The LR analysis table obtained from the grammar is stored in the environment independent operation table section 11. The contents of Table 1 are the same as the LR analysis table described in detail in AV Aho et al., "Compilers-Principles, Techniques, and Tools", Addison-Wesley (1986), etc., from the ACTION section and the GOTO section. It is made up.
[0026]
[Table 1]
[0027]
First, switch 9 is set to the A side to create a phoneme environment-dependent grammar., StructureThe sentence analysis unit 5 is driven to output an acceptable sentence using the terminal symbol string. This is described in detail in Kita et al., "Continuous Speech Recognition Using HMM Phoneme Recognition and Extended LR Parsing", Transactions of Information Processing Society of Japan, Vol. 31, 3, pp. 472-480 (1990). Thus, while predicting the terminal symbol to be analyzed next from the operation table,syntaxThis can be realized by driving the analysis unit 5.
[0028]
From the obtained sentences, the symbol processing unit 12 selects a sentence required as a recognition task. For the selection, a required sentence may be automatically selected by symbol string collation, or may be selected by a human editing an output result. Thereafter, the selected sentence is converted into an environment-dependent terminal symbol sequence according to the sequence of terminal symbols. For example, when the sentence "/ koreokure /" is obtained, the conversion is performed as "/ -ko kore reo eok oku ku ure re- /" in consideration of one character preceding and succeeding the target symbol. The symbol at the center is the target terminal symbol, and a symbol indicating the environment is added to the left and right. In the above example, / kor / indicates that a terminal symbol of / o / is preceded by a terminal symbol of / k /, and a symbol of / r / follows. // indicates the beginning or end of the symbol. In this embodiment, the number of preceding and succeeding symbols is one, but any number may be set. Next, an environment-dependent grammar is created using the converted terminal symbols and stored in the grammar section. Table 2 shows the grammar created. The LR analysis table obtained from the grammar is stored in the environment-dependent operation table section 4.
[0029]
[Table 2]
[0030]
Next, the switch 9 is turned to the B side to perform continuous voice recognition. The input voice is subjected to LPC analysis to extract 10-dimensional cepstrum parameters. However, the analysis conditions are a sampling frequency of 8 kHz, windowing with a Hamming window (window width 16 ms), and an LPC analysis order of 14. The shift width per frame is set to 5 msec. The analysis method is not limited to the above, and as described in detail in Niimi, "Speech Recognition", Kyoritsu Shuppan (1979), any acoustic analysis method such as frequency analysis can be used. Good.
[0031]
syntaxThe analysis unit 5 determines which phoneme to match from the LR analysis table. SolutionAnalysisEach time the state progresses, the continuation as detailed in Muroi et al., "Word Speech Recognition Using Duration Control State Transition Model", J72-D-II, 11, pp.1769-1777 (1989-11) A time-controlled state transition (DST) model is connected. In the present embodiment, a DST model that takes into account the phoneme environment is used, and the matching unit compares the DST model with the feature amount of the input speech. The parsed phrase structure issyntaxAnalysis section5Record on the chart. Finally, the candidate having the lowest score among the candidates for which all analysis has been completed is obtained according to the equation (5), and is output as a recognition result.
[0032]
(Equation 3)
[0033]
Here, r is an expansion / contraction function obtained by the dynamic programming. With this function, the input feature value of the m-th frame to be collated is associated with the state of the r (m) -th DST model. l_{r (m)}Indicates the frame length in the r (m) -th partial pattern when the input voice pattern is divided into N (s) partial patterns. The first item on the right side represents the distance related to the feature amount obtained by the acoustic analysis, and the second item represents the distance related to the duration of the partial pattern. a is a positive number and determines how much the distance related to the duration is reflected in the overall distance. In this embodiment, a is set to about 0.1. By using the above-described DST model, it is possible to perform matching in consideration of not only a feature amount in an acoustic space but also a special structure of a voice pattern (particularly, a time length of a partial pattern).
[0034]
FIG. 2 is a schematic block diagram for explaining another embodiment of the present invention, in which 13 is an utterance list, 14 is a DST model training unit, and other operations are the same as those of the embodiment shown in FIG. Are given the same reference numerals as in FIG. Thus, the embodiment shown in FIG. 2 enables training of a phoneme environment-dependent DST model using the environment-dependent grammar and the operation table obtained by the embodiment shown in FIG. First, the switch 9 is turned to the A side, and the training of the phoneme environment-dependent DST model 6 is performed. A voice corresponding to the utterance list 13 is input, and LPC cepstrum parameters are extracted. Next, in accordance with the utterance list 13, the sequence is collated with the environment-dependent DST model sequence using a dynamic programming method, and the expansion function θ is minimized according to the criterion of Expression (4). The obtained expansion function is defined as r.
The DST model training unit 14 updates the average value and the duration of the model according to the following equation. Where N_{r (m)}Is the last frame number of the input pattern associated with the r (m) -th state of the DST model.
[0035]
(Equation 4)
[0036]
Where N_{r (0)}= 0.
After performing the above training, the switch 9 is turned to the B side, and continuous voice recognition is performed. The configuration of the recognition process is the same as that of the embodiment of FIG.
[0037]
FIG. 3 is a schematic block diagram for explaining still another embodiment of the present invention, in which 15 is a result display section, 16 is a selection section, and other than the embodiment shown in FIG. 1 or FIG. Parts that perform a similar function are given the same reference numerals as in FIG. 1 or FIG. Thus, in the embodiment shown in FIG. 3, a phoneme environment-dependent DST model can be trained using input speech uttered for recognition. According to the embodiment shown in FIG. 3, recognition and DST model training can be performed simultaneously. First, an input voice is recognized in the same process as in the embodiment of FIG. When the correct answer is included in the display unit 15, the correct answer can be selected by the selecting unit 16 such as a keyboard. With this selection, it is possible to determine a DST model sequence to be trained for the input speech pattern. Equations (4), (8), and (9) are applied to these DST model strings, and the training unit 14 updates the average value and the duration of the DST model. The training process is the same as the embodiment of FIG.
[0038]
In this embodiment, since the Euclidean distance shown in Expression (7) is used as the distance scale at the time of collation on the display unit, candidates are displayed in the order of the lowest score. If scores of recognition candidates are given based on likelihood or the like, they are displayed in descending order of the scores. Of course, either criterion may be used in the present invention.
[0039]
In the DST model training unit 14 of FIG. 2 or FIG. 3, when a model trained in the past exists for the same class, the DST model is trained by the following two methods. One is that the previously trained model W_k1And the newly trained model W_k2And W_n3How to create
W_k3= BW_k1+ (1-b) W_k2 … (10)
Here, b is a positive number indicating the mixture ratio between the past model and the new model. As a special case, b = 0 indicates that the model is not trained, and b = 1 corresponds to replacing with a new model.
Another method is to store both the old model and the new model. That is, each time a training voice is input, a new DST model is created. At the time of recognition, a DST model sequence closest to the input speech pattern may be output as a recognition result.
[0040]
Further, a method combining the above two training methods is also possible. In the second method described above, the recognition accuracy can be improved by having a plurality of models for the same class, but the recognition time becomes longer because the number of combinations at the time of matching increases. Therefore, after a predetermined number of models have been created, a model to be superimposed is selected, and the selected model and the newly trained model are superimposed according to Equation 10. If the column s is a column including the DST model to be superimposed, the DST model sequence to be superimposed is:
[0041]
(Equation 5)
[0042]
Meet. With this method, the relationship between the recognition time and the recognition accuracy can be freely adjusted, and the performance desired by the user can be set.
In order to train the environment-dependent DST model described above, an environment-independent DST model can be used as an initial model. For example, consider training a DST model of / a-ka-a / where the leading and succeeding phonemes are / a /. As an initial model in this case, consider training a phoneme environment independent DST model of / k /. As an initial model in this case, training is started using a phoneme environment independent DST model of / k /. By giving a good initial value from the phoneme environment independent DST model, a highly accurate model can be designed.
[0043]
Next, the standard pattern training method will be described.
Figure 4, MarkFIG. 2 is a schematic block diagram for explaining an embodiment of a quasi-pattern training method, in which 21 is a segmentation unit, 22 is a feature pattern creation unit, 23 is a matching unit, 24 is a cumulative score storage unit, and 25 is a comparison unit. , 26 is an LR table, and 27 is a predictive chart parsing unit.₁Into the A side, and train the standard pattern. In FIG. 4, a predictive chart parsing unit 27 using an LR table unit 26 is driven in order to create a state transition model for an input voice. The LR table 26 stores an operation table obtained from a grammar as shown in Table 3. In the symbols in Table 3, terminal symbols start with '*', and other symbols are non-terminal symbols. This description has been simplified to illustrate the example, but is context free.SentenceA more complex description is possible if it is a notation by the method.
[0044]
[Table 3]
[0045]
The contents of the LR table are as follows: V. It is the same as the LR analysis table described in detail in Aho et al., "Compilers-Principles, Techniques, and Tools", Addison-Wesley (1986), and is composed of an ACTION section and a GOTO section. There are four types of operations in this table: state transition, application of grammar, acceptance, and error.
[0046]
Using the LR table in Table 3, the predictive chart parsing unit 27 extracts terminal symbols one by one from the beginning, applies the algorithms shown in Tables 4 to 6, and records the results as a chart shown in Table 7. I do. All phrase structures are recorded on the chart until the receiving operation is finally performed. However, '*' is set at the last position of the terminal symbol sequence predicted by the terminal symbol representing the end.
[0047]
[Table 4]
[0048]
[Table 5]
[0049]
[Table 6]
[0050]
Table 7 shows the analysis result of "1:00 on January 1" as an example. In addition, based on the grammar, "2:00 on January 1", "1:00 on January 2" etc. are sequentially displayed. Generated. Training of the standard pattern can be realized by creating a state transition model corresponding to these symbol sequences.
[0051]
[Table 7]
[0052]
By the operation of the predictive chart parsing unit 27 described above, the index numbers of the character sequences constituting the terminal symbols are sequentially sent to the standard pattern storage unit 28. Since the standard patterns are stored in units of characters, the connecting unit 29 refers to the index numbers and connects the standard patterns in units of terminal symbols, and the state transition model unit 30 creates a state transition model. For example, if the standard pattern is stored in phoneme units, the terminal pattern is composed of the standard pattern of / i, ch, i, g, a, t, u / with respect to the terminal symbol "January". The state transition model may be expressed by a probability model such as an HMM, or may be strictly expressed by a word graph or a finite state network.
[0053]
On the other hand, the input speech is input by the segmentation unit 21 for a predetermined time, and is converted into a feature pattern by an analysis method as described in detail in Niimi, "Speech Recognition", Kyoritsu Shuppan (1979) and the like. . Here, 10-dimensional cepstrum parameters are extracted and used as feature patterns. However, the analysis conditions are as follows: sampling frequency: 16 kHz, high-frequency emphasis: first-order difference, 256-point Hamming window, update cycle: 10 ms, and LPC analysis order: 20. The analysis method is not limited to the above, and any acoustic analysis method such as frequency analysis may be used. The input speech includes a keyword corresponding to the terminal symbol generated from the chart parsing unit.
[0054]
Next, the state transition model created as described above and the feature pattern of the input voice are collated by the collation unit 23. The state transition model corresponding to the s-th terminal symbol in the terminal symbol string generated from the syntax analysis unit 27 is represented by sW, (s = 1,..., S). sW is composed of L standard patterns.
[0055]
(Equation 6)
[0056]
Where p_{q (l)}Is an index of the standard feature pattern corresponding to the l (1 ≦ l ≦ L) -th in the sequence, and has V standard patterns in total. Taking Table 7 as an example, since the number of terminal symbols of the generated sentence is 3, S = 3. Further, in the case of the embodiment, each standard pattern corresponds to a phoneme, so that the number of standard patterns is equal to the total phoneme number.
Similarly, the input feature pattern X is represented as follows.
X = ｛x₁, ..., x_m, ..., x_M｝… (13)
In the embodiment, X is a feature pattern including S keywords in the input voice. The matching unit obtains a matching score D between the input speech feature pattern and the state transition model by the following equation.
[0057]
(Equation 7)
[0058]
Where m_s1, M_s2Are the end points of the extraction section of the audio feature pattern corresponding to the s-th keyword, and represent the start point and the end point, respectively. The matching function r is a function representing a matching path, and can be obtained by a well-known dynamic programming method or the like. The matching function associates the input feature value of the m-th frame with the r (m) -th standard pattern constituting the keyword. Score D (x) between the standard pattern and the voice feature pattern_m, P_{r (m)}) Is obtained by subtracting the well-known Euclidean distance from a threshold value having a positive value. Since a partial pattern of the voice feature pattern corresponding to the standard pattern is obtained from r obtained by the equation (15), the standard pattern is trained using this partial pattern. This training is performed by calculating an arithmetic mean of the feature amount of the standard pattern and the feature amount of the partial pattern, and newly registering it as a standard pattern.
[0059]
The training method is not limited to the above, and if the state transition model is represented by HMM, D (x_m, P_{r (m)}) As the likelihood. The training of the HMM at this time can be performed by the Baum-Welch estimation method described in detail in Nakagawa, "Speech Recognition by Stochastic Model", and the like. Equation (15) is based on maximization, but is not limited to this, and the essence of the present invention does not change even if training is performed based on minimization criteria based on a simple Euclidean distance.
[0060]
As explained above,-Since the partial patterns of the input voice are collated in units of words, even if pauses or unnecessary words are inserted between keywords, training of standard patterns is possible. The above processingPredictive typeChart parser27The training is completed by repeating until no terminal symbol sequence is generated. Next, switch W₁In B, keyword recognition can be performed. At recognition,Predictive typechartsyntaxIt works so as to predict a keyword from the analysis unit 27 and the LR table unit 26. SolutionAnalysisEach time the process proceeds, a standard pattern is linked to create a state transition model of the predicted keyword. The collation unit 23 collates the state transition model with the feature amount of the input speech. The predicted score of the keyword candidate is stored in the cumulative score storage unit 24, and finally the candidate having the highest score among the candidates for which all the analysis has been completed is obtained according to the equation (15), and is output as the recognition result. I do.
[0061]
In the embodiment shown in FIG. 4, the collation time can be shortened by introducing the following condition in Expression (16).
[0062]
(Equation 8)
[0063]
This equation (16) indicates that the s-th keyword is detected in the input feature pattern, and the matching is started with respect to the state transition model of the next keyword, that is, the s + 1-th keyword, from a frame that falls within the section. ing.
[0064]
When a plurality of keywords are input, high-speed collation can be performed by cutting out only a portion where a voice exists in the segmentation unit. FIG. 8 shows a speech waveform including two keywords: / month / and / day /. As can be seen from FIG. 8, there is a slight pause between / month // day /. In such a case, the segmentation unit 21 uses the speech extraction algorithm described in Niimi, “Speech Recognition”, Kyoritsu Shuppan (1979), and the like, and sets the section between A and B in FIG. Ask for. After that, only the section between A and B that has been cut out is to be compared with the state transition model, so that the matching section can be shortened.
[0065]
Figure 5,other4 is a schematic block diagram showing an embodiment of the present invention, in which parts having the same functions as those of the embodiment shown in FIG. 4 are denoted by the same reference numerals as in FIG. In the embodiment shown in FIG. 5, a plurality of pairs of the LR table section (26A, 26B, 26C) and the predictive chart parsing section (27A, 27B, 27C) are prepared. If the standard pattern is trained from the initial stage using a voice including a plurality of keywords during the training of the standard pattern, the pattern may become unstable. In order to avoid such a phenomenon, at the initial stage, only a single keyword is trained from the input voice, and by gradually increasing the number of keywords included in the voice, not only the standard pattern is stabilized, but also the input voice A variety of speaking styles can be trained together. In the embodiment, the LR table 26A and the prediction typechartA sentence including one keyword is generated using the syntax analysis unit 27A. Similarly, the remaining two sets generate a sentence containing two keywords and a sentence containing three keywords, respectively. During training, ie, switch W₁When the switch is put into A, first, the switch W_TwoIs input to C, and a standard pattern is trained from the input voice including one keyword according to the same procedure as in the above embodiment. Next, switch W_TwoIs sequentially switched to D and E, the number of keywords included in the voice can be increased, and the standard pattern can be trained. At the time of keyword recognition, switch W₁In B, and all the switches W2 in C, D, and E. Since all predictable keyword candidates can be generated, the candidate having the highest score among them can be output as a recognition result.
[0066]
FIG. 6 is a schematic block diagram for explaining another embodiment. The embodiment shown in FIG. 6 is obtained by adding the display device 32 to the embodiment shown in FIG. Switch W during training₁Into the A side, and switch W₃Into C. On the display device 32, a sentence including the keyword generated by the predictive chart syntax analysis unit 27 is generated, and displayed on the display device 32 as "January 1". The speaker inputs a voice while watching this display. By performing the subsequent processing in the same manner as in the method described in the embodiment of FIG. 4, the training of the standard pattern is completed. Recognition, switch W₁Into the B side and switch W₃To C.
[0067]
FIG. 7 is a schematic block diagram for explaining still another embodiment. The embodiment shown in FIG. 7 is obtained by adding the conversion unit 33 according to the embodiment of FIG. In order to perform the reading conversion, the grammar for creating the LR table is changed as shown in Table 8. In Table 8, the reading corresponding to the date corresponding to the keyword is added as a rewriting rule. On the display device at the time of training, the read conversion unit 33 also displays the right side of the rewrite rule including the terminal symbol. As a result, it is possible to display “January (one day), one day (one day)”, and it is not necessary to read one day as “one day”. Can be encouraged.
[0068]
[Table 8]
[0069]
【The invention's effect】
As is apparent from the above description, according to the present invention, it is possible to adaptively generate a context-free grammar considering a kind of environment according to a task. Further, it is possible to combine environment-dependent phoneme models without changing the conventional LR-Chart parsing unit having a simple mechanism and a small storage amount. In addition, it is possible to adaptively train a duration-controlled state model in consideration of a kind of environment. As a result, continuous speech recognition that performs high-precision and high-speed matching can be realized.
The invention according to claim 1 includes means for extracting a feature amount of an input voice, means for connecting a pattern representing a part of a class in a time direction to form a state transition model, and modeling a class of a voice. Parsing that analyzes phonetic symbol strings by grammarDepartmentAnd means for matching an input voice pattern while controlling the matching duration time of each state in the state transition model, and comparing the state transition model with the feature pattern of the input voice to obtain a continuous speech. In the recognition method,SaidBy using the phonetic symbol sequence produced by the parser, a terminal symbol sequence including the surrounding environment of the class is generated, and the grammar is created. Can be matched with
The invention according to claim 2 is the invention according to claim 1,SaidA speech uttered based on a speech symbol string received by the syntax analysis unit is used as an input, and a state transition model including a kind of surrounding environment of a kind corresponding to the input can be connected for training.
The invention according to claim 3 is the invention according to claim 1,SaidSpeech including a phonetic symbol string received by the parser is input, collated with a state transition model that includes the surrounding environment, and the recognition result is obtained.WhenBy displaying a predetermined number in a likely order and selecting a correct candidate, it is possible to connect and train a correct state transition model.
The invention according to claim 4 relates to the training of the state transition model including the kind of environment before and after according to claim 2 or 3, wherein if the model has been trained in the past, the past state transition model Can be superimposed.
The invention according to claim 5 relates to the training of the state transition model including the front and rear environment of the kind according to claim 2 or 3,Kind ofIf training has been conducted forGenerate a state transition model including the environment before and afterTo train andResponseState transition modelDoes not train, adjusts the relationship between recognition time and recognition accuracy freely, and sets the performance desired by the usercan do.
The invention according to claim 6 relates to training of the state transition model including the kind of surrounding environment in claim 4, and updating the model most similar to the input voice from the state transition models stored by claim 5. can do.
The invention according to claim 7 relates to the training of the state transition model including the kind of front and rear environment according to any one of claims 2 to 6, wherein an environment independent state transition model can be connected as an initial model..
[Brief description of the drawings]
FIG. 1 is a block diagram showing an embodiment of continuous speech recognition according to one embodiment of the present invention.
FIG. 2 is a schematic block diagram for explaining another embodiment of the present invention.
FIG. 3 is a schematic block diagram for explaining another embodiment of the present invention.
FIG. 4MarkIt is a block diagram showing one example of quasi-pattern training.
FIG. 5otherFIG. 3 is a schematic block diagram illustrating an example of FIG.
FIG. 6otherFIG. 3 is a schematic block diagram illustrating an example of FIG.
FIG. 7SaFIG. 11 is a schematic block diagram showing still another embodiment.
FIG. 8Shows a speech waveform containing two keywords / Jan /// Day /FIG.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... LPC analysis part, 2 ... collation part, 3 ... Environment-dependent grammar part, 4 ... Environment-dependent operation table part, 5 ...ParsingUnit, 6 environment dependent DST model, 7 pattern connection unit, 8 decision unit, 9 switch, 10 environment independent grammar unit, 11 environment independent operation table unit, 12 symbol processing unit, 13 generation list , 14: DST model training unit, 15: result display unit, 16: selection unit, 21: segmentation unit, 22: feature pattern creation unit, 23: collation unit, 24: cumulative score storage unit, 25: comparison unit, 26 .. LR table, 27 predictive chart parsing unit, 28 standard pattern storage unit, 29 connecting unit, 30 state transition model unit, 31 training unit, 32 display unit, 33 reading conversion unit.

Claims

A means for extracting the feature amount of the input speech, a pattern representing a part of the class is connected in the time direction to form a state transition model, a means for modeling the class of the speech, and a speech symbol string is analyzed by grammar. and parsing unit, by a means for collating the input voice pattern while controlling the collation duration of each state in the state transition model is compared with the characteristic pattern of the input sound voices with the state transition model, the recognition result In the continuous speech recognition method to obtain the above, by using the phonetic symbol string received by the syntax analysis unit, to generate a terminal symbol sequence including the environment before and after the class, by creating a grammar, including the environment before and after the class A continuous speech recognition method characterized by matching a state transition model with an unknown input speech.

Claims before a speech uttered as an input based on the phonetic symbol string accepted by Ki構 sentence analyzing unit, characterized in that train coupled state transition model including a longitudinal environment of the kind corresponding to the input Item 2. The continuous speech recognition method according to Item 1.

Before an input speech containing the phonetic symbol string is accepted by Ki構 sentence analyzing unit collates a state transition model including a longitudinal environment classes, and displays a predetermined number of the recognition result plausible sequentially select the correct candidate The continuous speech recognition method according to claim 1, wherein the training is performed by connecting the correct state transition models.

Relates training state transition model including a longitudinal environment of the compound, the claims if trained with respect to the model was done in the past, which is characterized that you trained superimposed with a past state transition model 4. The continuous speech recognition method according to 2 or 3.

It relates training state transition model including a longitudinal environment of the class, if the training with respect to the model was done in the past, produce a state transition model including a longitudinal environment against the class of new the model to perform training, continuous speech recognition system according to claim 2 or 3 past pairs 応状 state transition model is characterized in that no training.

The method according to claim 4, wherein a model most similar to the input voice is updated from among the state transition models stored according to claim 5, with respect to training of the state transition model including the environment before and after the class. Continuous speech recognition method.

7. The continuous speech recognition method according to claim 2, wherein an environment-independent state transition model is connected as an initial model for training of the state transition model including the kind of surrounding environment.