JP2004258531A

JP2004258531A - Voice recognition error correction method, system, and program

Info

Publication number: JP2004258531A
Application number: JP2003051645A
Authority: JP
Inventors: Takaaki Hasegawa; 隆明長谷川; Yoshihiko Hayashi; 林　　良彦
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-02-27
Filing date: 2003-02-27
Publication date: 2004-09-16
Anticipated expiration: 2023-02-27
Also published as: JP4171323B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve the precision of voice recognition by correcting the recognition errors relating to proper nouns among the recognition errors caused by the OOV problem. <P>SOLUTION: The voice recognition section 200 outputs the voice recognized results of the voice document along with its reliability. The voice recognition error correction section 300 corrects the errors following the correction conditions determined in advance. The proper noun breakpoint identification section 400 identifies the breakpoint of the proper noun included there from the inputted word columns. The voice recognition error correction candidate extraction part 500 extracts the proper noun breakpoint which is the candidate for the voice recognition correction from the related information. The relevant information retrieval section 600 searches for the related documents in an outside database depending on the relevant information search conditions determined in advance. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、録画・録音された音声ドキュメントに対して音声認識を適用することにより文字化を行う音声認識装置に関する。
【０００２】
【従来の技術】
増大するマルチメディアコンテンツの高度な利用を目的として、音声認識などのメディア認識技術の研究開発が行われている。音声認識装置はコンテンツ中の発声部分を文字化する装置であり、文字化が行われた後はさまざまな処理が可能となることから、重要な要素として位置づけられている。現在の音声認識装置においては、その性能を引き出すために、認識対象に対する適応が不可欠である。これには、認識辞書への単語の登録や、発話されやすい単語の組み合わせを言語モデルとして組み込むことが含まれる。しかしながら、これらをむやみに増やすことは、処理速度の低下だけではなく、認識精度の低下を招く。よって、認識辞書へ登録すべき単語は、認識対象と同等の性質を持つと思われる文書集合などから慎重に選択する必要がある。
【０００３】
【非特許文献１】
“ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇＡｌｇｏｒｉｔｈｍ（ＤＰＡ）ｆｏｒｅｄｉｔ−Ｄｉｓｔａｎｃｅ”，
ｈｔｔｐ：／／ｗｗｗ．ｃｓｓｅ．ｍｏｎａｓｈ．ｅｄｕ．ａｕ／−ｌｌｏｙｄ／ｔｉｌｄｅＡｌｇＤＳ／Ｄｙｎａｍｉｃ／Ｅｄｉｔ／
【０００４】
【発明が解決しようとする課題】
上記に述べたように、認識辞書へ登録すべき単語は慎重に選択する必要があり、実際の認識対象に含まれうる単語を１００％カバーすることは不可能である。特に、新語や、人名、地名、製品名といった固有名詞については数多くの単語が出現する可能性があり、認識対象外となること（ＯｕｔｏｆＶｏｃａｂｕｌａｒｙ問題：ＯＯＶ問題）が起こる。現状の音声認識装置においては、認識辞書に登録されていない単語は絶対に認識されることはないため、認識精度の低下につながる。
【０００５】
本発明の目的は、このようなＯＯＶ問題に起因する認識誤りのうち、固有名詞に関連する認識誤りを訂正することにより、音声認識の精度向上を図った認識誤り訂正方法、装置、およびプログラムを提供することである。
【０００６】
【課題を解決するための手段】
上記の目的を達成するために、本発明の認識誤り訂正装置は、
音声認識結果を信頼度とともに出力する音声認識手段と、
前記音声認識結果に対し、そこに含まれる所定のクラスに属する単語が発声された区間の同定を行う手段と、
信頼度スコアと所定のクラスに属する単語に関する複合条件を記述した音声認識誤り訂正条件が格納されている音声認識誤り訂正条件テーブルと、
前記同定結果が付加された音声認識結果と、前記音声認識誤り訂正条件テーブルの音声認識誤り訂正条件を照合し、音声認識の誤りが含まれている可能性があり、かつそれが訂正されうる区間を誤り訂正対象の区間として抽出する手段と、
関連情報検索キー単語抽出条件が格納されている関連情報検索キー単語抽出条件テーブルと、
前記関連情報検索キー単語抽出条件テーブルに格納された関連情報検索キー単語抽出条件にしたがって、前記同定結果が付与された音声認識結果から、関連文書検索を行うための検索条件となる単語集合を抽出する手段と、
単語集合を検索条件として、前記音声認識誤り訂正候補抽出条件にしたがって、関連文書を検索し、該関連文書中に含まれる所定のクラスに属する単語の区間を音声認識誤り候補として抽出する手段と、
前記誤り訂正対象の区間と前記誤り訂正候補群のマッチングを行い、前記誤り訂正対象の区間の誤り訂正を行う手段を有する。
【０００７】
通常の音声認識装置において認識精度低下の原因のひとつであるＯＯＶ問題（ＯｕｔＯｆＶｏｃａｂｕｌａｒｙ問題）のうち、例えば固有名詞に関わる認識誤りを訂正することにより、認識精度を向上させることが可能となる。
【０００８】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して説明する。
【０００９】
図１に示すように、本発明の一実施形態の認識誤り訂正装置は入力部１００と音声認識部２００と音声認識誤り訂正部３００と音声認識誤り訂正条件テーブル３１０と固有名詞区間同定部４００と音声認識誤り訂正候補抽出部５００と関連情報検索キー単語抽出条件テーブル５１０と関連情報検索部６００と音声認識誤り訂正候補抽出条件テーブル６１０と出力部７００から構成される。
【００１０】
入力部１００は音声ドキュメントを入力する。音声認識部２００は入力された音声ドキュメントを音声認識し、その結果を信頼度とともに出力する。音声認識誤り訂正部３００は音声認識結果を入力し、あらかじめ定められた音声認識誤り訂正条件にしたがって、音声認識誤りを訂正する。音声認識誤り訂正条件テーブル３１０は音声認識誤り訂正条件を予め格納している。固有名詞区間同定部４００は入力された単語列から、そこに含まれる固有名詞区間の同定を行う。音声認識誤り訂正候補抽出部５００は音声認識誤り訂正候補である固有名詞区間を関連情報より抽出する。関連情報検索キー単語抽出条件テーブル５１０は関連情報検索キー単語抽出条件を格納している。関連情報検索部６００はあらかじめ定められた関連情報検索条件にしたがって外部データベースにおける関連文書を検索する。音声認識誤り訂正候補抽出条件テーブル６１０は、あらかじめ定められた音声認識誤り訂正候補抽出条件を格納している。
【００１１】
なお、各処理部１００、２００、３００、４００、５００、６００、７００はＣＰＵ等の制御手段で実行される。各テーブル３１０、５１０、６１０、は記憶装置に記憶される。また、各処理部からの出力を一時的に格納する記憶装置（不図示）も設けられている。
【００１２】
以下、具体例を用いて、本実施形態の音声認識誤り訂正装置の動作を説明する。
【００１３】
図２は、入力部１００から入力され、音声認識部２００により文字化された音声認識結果の一部を示している。ここで実際の発声は、「ＩＴベンチャーの中谷製作所の田中祐市部長は、新プロジェクトのシリウス・ダッシュの概要を発表した。」であったとするが、音声認識の誤りのために、「ＩＴベンチャーのなかったり製作所の田中唯一部長は、新プロジェクトのシリウス・ダッシュの概要を発表した。」のように文字化されたものとする。
【００１４】
図２に例示する音声認識部２００の出力は、ＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）言語によって構造化されている。すなわち、音声ドキュメントｄｏｃは、発声単位であるｐｈｒａｓｅの集合として表現される。各発話単位は、そこに含まれる単語ｗｏｒｄの集合として表現される。各発話単位、および、そこに含まれる各単語に対しては、その開始時刻と終了時刻がそれぞれｂｅｇｉｎ、ｅｎｄという属性を用いて記録される。さらに、各単語に対しては、音声認識により文字化された単語表記がＸＭＬ要素の内容部分に記録されるだけでなく、該単語の品詞情報、読み情報と音声認識の信頼度がそれぞれｐｏｓ、ｒｅａｄｉｎｇ、ｃｏｎｆという属性を用いて記録される。なお、図２に例示した音声認識結果は、本発明の説明に必要な概念を例示するためのものであり、ＸＭＬのタグ構造も含めて、このデータ形式に限る必要はない。また、音声認識部２００としては、このような情報を出力可能な任意の音声認識装置を適用することが可能である。
【００１５】
図３は、あらかじめ設定する音声認識誤り訂正条件を格納する音声認識誤り訂正条件テーブル３１０のエントリ例を示す。図３に示す例においては、音声認識の信頼度スコアと後述する固有名詞クラスに関する複合条件を記述している。条件の適用の仕方については後述する。なお、これらの条件は、音声認識部２００に適用する音声認識装置に応じて経験的に設定する。
【００１６】
音声認識誤り訂正部３００は、図２に示すような音声認識部２００からの出力を入力し、図３に示すような、あらかじめ定められた音声認識誤り訂正条件に基づいて、音声認識結果に含まれる音声認識誤りの訂正を行う。音声認識誤り訂正部３００は、まず、入力された音声認識結果を固有名詞区間同定部４００へと転送する。固有名詞区間同定部４００は、図２に示すような入力された音声認識結果に対し、固有名詞が発声されたと判断される区間を同定し、図４に例示するようなデータ形式を持つ処理結果を音声認識誤り訂正部３００へと返却する。
【００１７】
図４は、図２の音声認識結果に対する固有名詞区間同定部４００の処理結果を示す。図４のデータは、図２に例示する音声認識結果と同様のＸＭＬ形式であるが、固有名詞区間同定の結果がｗｏｒｄタグ中のｎｅ−ｃｌａｓｓという属性により付加されている。すなわち、ｎｅ−ｃｌａｓｓという属性の属性値がｎｉｌ以外のものは、固有名詞区間に含まれることを示しており、ｎｉｌ以外の属性値は、人名、地名といった固有名詞のクラスを示す。図４において、ｐｅｒｓｏｎという属性値は人名を、ｏｒｇａｎｉｚａｔｉｏｎという属性値は組織名を示すものとする。
【００１８】
なお、本発明においては、固有名詞区間同定部４００の具体的構成については規定しないが、図２に示すようなＸＭＬ形式による構造化されたデータ、文字列としてのテキストデータを処理可能な入力インタフェースを備えており、固有名詞区間同定の処理は、例えば、特許文献１に示される方法・装置により実現されることを想定する。また、図４に例示した固有名詞区間同定結果は、本発明の説明に必要な概念を例示するためのものであり、ＸＭＬのタグ構造も含めて、このデータ形式に限る必要はない。
【００１９】
図４に示すような固有名詞区間同定の結果が付加された音声認識結果は音声認識誤り訂正部３００へ返却される。
【００２０】
音声認識誤り訂正部３００は、固有名詞区間同定の結果が付加された入力された音声認識結果と、音声認識誤り訂正条件テーブル３１０に格納された音声認識誤り訂正条件を照合し、音声認識の誤りが含まれている可能性がある区間（低い音声認識信頼度を持つ単語を含む）、かつ、それが訂正されうる区間（何らかの固有名詞クラスを有する固有名詞区間であると同定されている）を抽出する。ここで、抽出される区間は、「固有名詞クラスが音声認識誤り訂正条件に指定された条件を満たす単語」からなる最長の部分単語列であって、「該部分単語列中に含まれる単語に対する認識信頼度の中で最小のものが音声認識誤り訂正条件に指定されている条件を満たす」ものとする。
【００２１】
図４の固有名詞区間同定の結果が付加された音声認識結果に対して、図３の音声認識誤り訂正条件を照合させると、音声認識誤りを訂正するべき区間として、次の二つを得る。ここで、／は単語境界を表し、カッコ内は該区間が持つ固有名詞クラスを示す。
・［訂正対象１］な／かったり／製作所（ｏｒｇａｎｉｚａｔｉｏｎ）
・［訂正対象２］田中／唯一／部長（ｐｅｒｓｏｎ）
音声認識誤り訂正部３００は、次に、図４の固有名詞区間同定の結果が付加された音声認識結果を音声認識誤り訂正候補抽出部５００へと送信する。
【００２２】
音声認識誤り訂正候補抽出部５００は、関連情報検索キー単語抽出条件テーブル５１０にあらかじめ格納された関連情報検索キー単語抽出条件にしたがって、図４に示すような固有名詞区間同定の結果が付加された音声認識結果から、関連情報検索部６００によって外部データベースから関連文書検索を行うための検索条件となる単語集合を抽出する。次に、これらの単語集合を検索条件として、あらかじめ音声認識誤り訂正候補抽出条件テーブル６１０に格納された音声認識誤り訂正候補抽出条件にしたがって、関連情報検索部６００により外部データベースから関連文書を検索し、音声認識誤り訂正候補の固有名詞区間を抽出する。ここで、検索結果の文書に含まれる固有名詞区間を同定するためには、固有名詞区間同定部４００を呼び出す。抽出された音声認識誤り訂正候補は、音声認識誤り訂正候補抽出部５００へと返却する。
【００２３】
図５は、関連文書検索キー単語抽出条件テーブル５１０におけるエントリ例を示す。図５に示す例は、品詞と認識信頼度に関する三通りの条件が設定されている。図５の例に示すように、音声認識の信頼度を考慮することにより、正しく認識されている可能性の高い単語を抽出する。また、名詞や動詞などの品詞を有する単語を抽出することにより、関連情報検索部６００によって、関連する文書を外部データベースから検索する際にキーワードとなりうる単語を抽出する。なお、これらの条件は、音声認識部２００に適用する音声認識装置に応じて経験的に設定する。
【００２４】
図５に示す関連情報検索キー単語抽出条件にしたがって、図４に示す固有名詞区間同定の結果が付加された音声認識結果から、関連情報検索部６００によって外部データベースから関連文書検索を行うための検索条件となる単語集合を抽出すると、以下のような単語集合が得られる。
・［検索条件単語集合］（ベンチャー、プロジェクト、シリウス、ダッシュ）
図６は、関連情報検索条件テーブル５１０におけるエントリ例を示す。図６に示すように関連情報検索条件は、３つのエントリからなる。第１のエントリは、関連情報検索部６００が検索対象とすべき外部データベースの識別子である。図６の例では、インターネット上に存在するニュース検索サイトｆｏｏ−ｎｅｗｓ．ｃｏｍが指定されている。第２のエントリは、音声認識誤り訂正候補を抽出する対象となる文書の最大数を指定する。通常のインターネットのサイト検索やデータベース検索においては、検索要求に対する適合度順に複数の文書が返却されるため、この上位から指定された数の文書を対象とする。図６の例では、上位の二件の文書のみを拡張単語の対象とすることが指定されている。第３のエントリは、実際に音声認識誤り訂正候補として抽出する固有名詞区間の最大数を指定する。図６の例では、最大５つの固有名詞区間を抽出することが指定されている。
【００２５】
上記に抽出した単語集合を検索条件とし、図６に示す関連情報検索条件によって、関連情報検索部６００による関連文書検索を行った結果、次に示すような内容を持つ関連文書１件が抽出されるものとする。
・［関連文書内容］
ベンチャー業界注目の新規プロジェクト「シリウス・ダッシュ」がいよいよスタートする。参加企業を代表する田中祐市部長（中谷製作所）、鈴木一朗取締役（株式会社ダッシュ）の両氏は、昨夜開いた記者会見の会場で、その計画の概要を公表した。
【００２６】
この文書内容は、関連情報検索部６００から音声認識誤り訂正候補抽出部５００に返却される。
【００２７】
音声認識誤り訂正候補抽出部５００は、上記のような文書内容を固有名詞区間同定部４００を起動することにより、文書中に含まれる固有名詞区間を得る。上記の例においては、以下の５つの固有名詞区間（／の後は読み、カッコ内は固有名詞クラス）が得られるものとする。
・［訂正候補ａ］シリウス／しりうす（ｏｒｇａｎｉｚａｔｉｏｎ）
・［訂正候補ｂ］田中祐市部長／たなかゆういちぶちょう（ｐｅｒｓｏｎ）
・［訂正候補ｃ］中谷製作所／なかたにせいさくしょ（ｏｒｇａｎｉｚａｔｉｏｎ）
・［訂正候補ｄ］鈴木一朗取締役／すずきいちろうとりしまりやく（ｐｅｒｓｏｎ）
・［訂正候補ｅ］株式会社シリウス／かぶしきがいしゃしりうす（ｏｒｇａｎｉｚａｔｉｏｎ）
上記のごとく得られた音声認識誤り訂正候補は、音声認識誤り訂正候補抽出部５００から音声認識誤り訂正部３００へと送信される。音声認識誤り訂正部３００は，［訂正対象１］、［訂正対象２］のような誤り訂正対象となる固有名詞区間と、［訂正候補ａ−ｅ］のような誤り訂正候補群とのマッチングを行い、誤り訂正を試みる。
【００２８】
各訂正対象に対する訂正候補群とのマッチング手順は、以下のように行う。
・［ステップ１］該訂正対象と同じ固有名詞クラスを持つ訂正候補を訂正候補群から選択する
・［ステップ２］該訂正対象と選択された訂正候補それぞれとのマッチ度を計算する
・［ステップ３］該訂正対象に対して最大のマッチ度を与える訂正候補を選択する
上記の手順において、ステップ１とステップ３は自明であるので、ステップ２について説明する。
【００２９】
訂正対象と訂正候補のマッチ度の計算としては、例えば、「読み」のひらがな文字列の類似度を用いることができる。本発明で対象とするのは音声認識の誤りであるので、訂正対象である音声認識の誤り箇所の読みは、本来発声されたであろう正解の読みと類似していることが想定されるため、この方法には妥当性がある。
【００３０】
文字列間の類似度の計算方法としては様々なものが提案されているが、代表的な手法として「編集距離」を用いる方法があり、動的計画法を用いた効率のよい処理アルゴリズム（非特許文献１）も確立しているので、例えばこの手法を用いればよい。また、この方法においては、文字列を「編集」する際のコストを定義することができるが、あらかじめ音声認識誤りの傾向が分かっていれば、これをコストに反映させておくことにより、適切に類似度を計算することができる。
【００３１】
上記の例においては、訂正対象１の「なかったり製作所」に対しては、固有名詞クラスがｏｒｇａｎｉｚａｔｉｏｎで一致していて、読みがこれと類似していると計算される「中谷製作所」が訂正候補として選択される。また、訂正対象２の「田中唯一部長」に対しては、同様にして「田中祐市部長」が訂正候補として選択される。
【００３２】
このようにして求められた訂正候補は、図４に示すような音声認識結果へと反映される。
【００３３】
図７は、図４に示す固有名詞区間同定結果を含む音声認識結果に対して、上記に示した誤り訂正候補により誤り訂正を行った後の音声認識結果の例を示す。なお、上記のごとく誤り訂正された部分については、必要に応じ、音声認識の信頼度を適当な定数（図７においては５００としている）と置き換えればよい。また、誤りの訂正によって、上記の例のごとく単語の数が変わる場合があり、ｂｅｇｉｎ、ｅｎｄの属性によって記録されている発声時間の情報を調整する必要がある。この段階において、正確な発声時間を補うことは不可能であるが、訂正の対象となった区間の始まりと終了の時間が初期の音声認識結果の時間情報と矛盾しないような適当な時間をとるようにすればよい。例えば、図７における「中谷」「製作所」の例では、「中谷」の開始時間を初期の音声認識結果である「な」の開始時間とし、終了時間を初期の音声認識結果である「かったり」の終了時間としている。
【００３４】
このような誤り訂正された音声認識結果は、音声認識誤り訂正部３００から出力部７００へと送信される。
【００３５】
図８は特願２００２−３５５２８４号に記載されている、固有名詞区間同定部４００の処理を示す流れ図ある。音声データが入力されると（ステップ８０１）、大語彙連続音声認識を行い予め指定した個数の形態素の並びの候補を出力する（ステップ４０２）。始端と終端を含めて隣接する形態素の時刻が連続でない、つまりある形態素の終了時刻とつきの形態素の開始時刻が一致しない場合は、連続でない時間帯、つまりある形態素の終了時刻を開始時刻とし、次の形態素の開始時刻を終了時刻とする時刻情報を付加した読点等の形態素情報を挿入する（ステップ８０３、８０４）。また、信頼度スコアや形態素情報がある条件を満たす場合、形態素を元雄形態素情報を保持して別の形態素に置換変形する（ステップ８０５、８０６）。例えば、また、信頼度スコアが予め設定されている閾値より小さい場合に、表記、読み、品詞の先頭にそれぞれ「ε；」を付与する。複数候補の形態素の並びから、各形態素が有する時刻情報に基づいて単語グラフを作成する（ステップ８０７）。単語グラフは、各ノードが時刻情報を持つ形態素であり、ノード間のリンクはある時刻において形態素が隣接する形態素と接続可能であることを示す。単語グラフの時刻を先頭から進めていき、単語グラフの各時刻で終わる形態素候補が存在する限り（ステップ８０８）、後続の１形態素について想定されるすべての固有表現クラスが付与された場合を仮定して（ステップ８０９）、すでに学習された言語モデル、例えば固有表現付き単語ｂｉｇｒａｍの出現頻度に基づいて各固有表現クラス付きの形態素が接続した場合の対数確率を計算する（ステップ８１０）。例えば、直前の固有表現クラスＮＣ_−１と直前の形態素ｗ_−１が与えられたときに現在の固有表現クラスＮＣが選択される確率Ｐ（ＮＣ｜ＮＣ_−１，ｗ_−１）と現在と直前の固有表現クラスが与えられたときに、現在の固有表現クラスの中で最初の単語ｗ_{ｆｉｒｓｔ}が生成される確率Ｐ（ｗ_{ｆｉｒｓｔ}｜ＮＣ_−１，ｗ_−１）と、直前の形態素と現在の固有表現クラスが与えられたときに２番目以降の形態素が生成される確率Ｐ（ｗ｜ｗ_−１，ＮＣ）を下記の計算式により固有表現付きの単語ｂｉｇｒａｍ頻度Ｃから計算する。文末まで以上のステップを繰り返す。
【００３６】
【数１】

このとき置換変形されている形態素は表記、読み、品詞とも「ε」を用いて対数確率を計算する。その時刻において、それまでの累積の対数確率が最大となる固有表現クラス付き形態素を選択し、経路を保持する（ステップ８１１）。ここで、「経路を保持する」のは、後の処理で文末から後ろ向きに局所的に最大の対数確率を持つ経路をたどれるようにしておくためである。単語グラフのノードの時刻を進めて（ステップ８１２）、同様の処理を行う。文末に達したら、今度は文末から最大の対数確率（最尤）を持った経路を選択することにより、選択された経路の各形態素について固有表現クラスを出力する（ステップ２１３）。置換変形されている形態素は、例えば表記、読み、品詞に含まれる「ε；」を削除するなどして元の形態素に復元して出力する。
【００３７】
なお、本発明は専用のハードウェアにより実現されるもの以外に、その機能を実現するためのプログラムを、コンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行するものであってもよい。コンピュータ読み取り可能な記録媒体とは、フロッピーディスク、光磁気ディスク、ＣＤ−ＲＯＭ等の記録媒体、コンピュータシステムに内蔵されるハードディスク装置等の記憶装置を指す。さらに、コンピュータ読み取り可能な記録媒体は、インターネットを介してプログラムを送信する場合のように、短時間の間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含む。
【００３８】
【発明の効果】
以上説明したように、本発明によれば、通常の音声認識装置において認識精度低下の原因のひとつであるＯＯＶ問題（ＯｕｔＯｆＶｏｃａｂｕｌａｒｙ問題）のうち、例えば固有名詞に関わる認識誤りを訂正することにより、認識精度を向上させることが可能となる。また、人名、地名、製品名などの固有名詞を正しく認識することは、例えば、音声認識を適用した音声ドキュメント検索システムの検索精度を向上させることにつながる。
【図面の簡単な説明】
【図１】本発明の一実施形態の音声認識装置のブロック図である。
【図２】音声認識部２００により文字化された音声認識結果の一例を示す図である。
【図３】音声認識誤り訂正条件を格納する音声認識誤り訂正条件テーブル３１０のエントリ例を示す図である。
【図４】図２の音声認識結果に対する固有名詞区間同定部４００の処理結果を示す図である。
【図５】関連文書検索キー単語抽出条件テーブル５１０におけるエントリ例を示す図である。
【図６】音声認識誤り訂正候補抽出条件テーブル６１０におけるエントリ例を示す図である。
【図７】図４に示す固有名詞区間同定結果を含む音声認識結果に対して誤り訂正候補により誤り訂正を行った後の音声認識結果の例を示す図である。
【図８】固有名詞区間同定部４００の処理例のフローチャートである。
【符号の説明】
１００入力部
２００音声認識部
３００音声認識誤り訂正部
３１０音声認識誤り訂正条件テーブル
４００固有名詞区間同定部
５００音声認識誤り訂正候補抽出部
５１０関連情報検索キー単語抽出条件テーブル
６００関連情報検索部
６１０音声認識誤り訂正候補抽出条件テーブル
７００関連情報検索キー単語抽出条件テーブル
８０１〜８１２ステップ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice recognition device that performs character conversion by applying voice recognition to a recorded voice document.
[0002]
[Prior art]
Research and development of media recognition technologies such as voice recognition have been conducted for the purpose of advanced use of increasing multimedia contents. The speech recognition device is a device that converts the utterance part in the content into characters, and since it can perform various processes after the conversion, the voice recognition device is positioned as an important element. In a current speech recognition device, adaptation to a recognition target is indispensable to bring out its performance. This includes registering words in a recognition dictionary and incorporating a combination of words that are easily spoken as a language model. However, unnecessarily increasing these causes not only a reduction in processing speed but also a reduction in recognition accuracy. Therefore, words to be registered in the recognition dictionary need to be carefully selected from a set of documents that are considered to have the same properties as the recognition target.
[0003]
[Non-patent document 1]
“Dynamic Programming Algorithm (DPA) for edit-Distance”,
http: // www. csse. monash. edu. au / -lloyd / tildeAlgDS / Dynamic / Edit /
[0004]
[Problems to be solved by the invention]
As described above, words to be registered in the recognition dictionary must be carefully selected, and it is impossible to cover 100% of words that can be included in an actual recognition target. In particular, many words may appear for new words and proper nouns such as personal names, place names, and product names, and may be excluded from recognition (Out of Vocabulary problem: OOV problem). In the current speech recognition device, words that are not registered in the recognition dictionary are never recognized, which leads to a decrease in recognition accuracy.
[0005]
An object of the present invention is to provide a recognition error correction method, apparatus, and program for improving the accuracy of speech recognition by correcting a recognition error related to a proper noun among recognition errors caused by such an OOV problem. To provide.
[0006]
[Means for Solving the Problems]
In order to achieve the above object, a recognition error correction device of the present invention includes:
Voice recognition means for outputting a voice recognition result with reliability,
Means for identifying a section in which a word belonging to a predetermined class included in the speech recognition result is uttered,
A speech recognition error correction condition table that stores a speech recognition error correction condition describing a reliability score and a compound condition relating to a word belonging to a predetermined class;
The voice recognition result to which the identification result is added is compared with the voice recognition error correction condition of the voice recognition error correction condition table, and a section in which a voice recognition error may be included and which can be corrected. Means for extracting as an error correction target section,
A related information search key word extraction condition table storing related information search key word extraction conditions,
According to the related information search key word extraction condition stored in the related information search key word extraction condition table, a word set serving as a search condition for performing a related document search is extracted from the speech recognition result to which the identification result is added. Means to
Means for searching for a related document according to the word set as a search condition according to the speech recognition error correction candidate extraction condition, and extracting a section of a word belonging to a predetermined class included in the related document as a speech recognition error candidate;
Means for performing matching between the error correction target section and the error correction candidate group and performing error correction on the error correction target section.
[0007]
In an OOV problem (Out of Vocabulary problem), which is one of the causes of a decrease in recognition accuracy in a normal speech recognition device, it is possible to improve recognition accuracy by correcting, for example, a recognition error relating to a proper noun.
[0008]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
[0009]
As shown in FIG. 1, the recognition error correction apparatus according to one embodiment of the present invention includes an input unit 100, a speech recognition unit 200, a speech recognition error correction unit 300, a speech recognition error correction condition table 310, a proper noun section identification unit 400, It comprises a speech recognition error correction candidate extraction unit 500, a related information search key word extraction condition table 510, a related information search unit 600, a speech recognition error correction candidate extraction condition table 610, and an output unit 700.
[0010]
The input unit 100 inputs a voice document. The voice recognition unit 200 performs voice recognition on the input voice document, and outputs the result together with the reliability. The speech recognition error correction unit 300 receives the speech recognition result and corrects the speech recognition error according to a predetermined speech recognition error correction condition. The speech recognition error correction condition table 310 stores speech recognition error correction conditions in advance. The proper noun section identification unit 400 identifies a proper noun section included in the input word string. The speech recognition error correction candidate extraction unit 500 extracts a proper noun section that is a speech recognition error correction candidate from the related information. The related information search key word extraction condition table 510 stores related information search key word extraction conditions. The related information search unit 600 searches for a related document in an external database according to a predetermined related information search condition. The speech recognition error correction candidate extraction condition table 610 stores predetermined speech recognition error correction candidate extraction conditions.
[0011]
Each

processing unit

100, 200, 300, 400, 500, 600, 700 is executed by a control unit such as a CPU. Each of the tables 310, 510, 610 is stored in the storage device. Further, a storage device (not shown) for temporarily storing the output from each processing unit is also provided.
[0012]
Hereinafter, the operation of the speech recognition error correction device of the present embodiment will be described using a specific example.
[0013]
FIG. 2 shows a part of a speech recognition result input from the input unit 100 and converted into a character by the speech recognition unit 200. Here, the actual utterance was that "Mr. Yuichi Tanaka of Nakatani Manufacturing of IT Ventures has announced the outline of the new project Sirius Dash." The only director of the factory, Tanaka, presented the outline of the new project Sirius Dash. "
[0014]
The output of the speech recognition unit 200 illustrated in FIG. 2 is structured in an XML (extensible Markup Language) language. That is, the audio document doc is represented as a set of phrases, which are utterance units. Each utterance unit is expressed as a set of words contained therein. For each utterance unit and each word contained therein, its start time and end time are recorded using attributes "begin" and "end", respectively. Furthermore, for each word, not only is the word notation transcribed by speech recognition recorded in the content part of the XML element, but also the part-of-speech information, reading information, and the reliability of speech recognition of the word are pos, It is recorded using the attributes reading and conf. It should be noted that the speech recognition result illustrated in FIG. 2 is for exemplifying a concept necessary for describing the present invention, and it is not necessary to limit to this data format, including the XML tag structure. In addition, as the voice recognition unit 200, any voice recognition device that can output such information can be applied.
[0015]
FIG. 3 shows an example of an entry in the speech recognition error correction condition table 310 storing speech recognition error correction conditions set in advance. In the example shown in FIG. 3, a composite condition relating to a reliability score of speech recognition and a proper noun class described later is described. How to apply the condition will be described later. Note that these conditions are set empirically according to the speech recognition device applied to the speech recognition unit 200.
[0016]
The speech recognition error correction unit 300 receives an output from the speech recognition unit 200 as shown in FIG. 2 and includes the output in the speech recognition result based on a predetermined speech recognition error correction condition as shown in FIG. Correction of speech recognition errors to be performed. The speech recognition error correction unit 300 first transfers the input speech recognition result to the proper noun section identification unit 400. The proper noun section identification unit 400 identifies a section in which it is determined that the proper noun is uttered, based on the input speech recognition result as shown in FIG. 2, and a processing result having a data format as illustrated in FIG. Is returned to the speech recognition error correction unit 300.
[0017]
FIG. 4 shows a processing result of the proper noun section identification unit 400 with respect to the speech recognition result of FIG. The data in FIG. 4 is in the same XML format as the speech recognition result illustrated in FIG. 2, but the result of proper noun section identification is added by an attribute called ne-class in a word tag. That is, if the attribute value of the attribute “ne-class” is other than nil, it indicates that the attribute is included in the proper noun section, and the attribute value other than nil indicates a proper noun class such as a person name or a place name. In FIG. 4, the attribute value “person” indicates a person name, and the attribute value “organization” indicates an organization name.
[0018]
Although the specific configuration of the proper noun section identification unit 400 is not specified in the present invention, an input interface capable of processing structured data in XML format and text data as character strings as shown in FIG. It is assumed that the processing of proper noun section identification is realized by, for example, a method and apparatus disclosed in Patent Document 1. Further, the proper noun section identification result illustrated in FIG. 4 is intended to exemplify the concept necessary for the description of the present invention, and it is not necessary to limit the data format including the XML tag structure.
[0019]
The speech recognition result to which the result of the proper noun section identification as shown in FIG. 4 is added is returned to the speech recognition error correction unit 300.
[0020]
The speech recognition error correction unit 300 collates the input speech recognition result to which the result of proper noun section identification has been added with the speech recognition error correction condition stored in the speech recognition error correction condition table 310, and detects a speech recognition error. (Including words with low speech recognition reliability) and sections where they can be corrected (identified as proper noun sections having some proper noun class) Extract. Here, the extracted section is the longest partial word string composed of “words whose proper noun class satisfies the condition specified in the speech recognition error correction condition”, and “the words included in the partial word string The smallest recognition reliability satisfies the condition specified in the speech recognition error correction condition. "
[0021]
When the speech recognition result to which the proper noun section identification result of FIG. 4 is added is collated with the speech recognition error correction condition of FIG. 3, the following two sections are obtained as sections for correcting speech recognition errors. Here, / represents a word boundary, and parentheses indicate proper noun classes of the section.
・ [Correction 1] Na / Kakari / Manufacturer (organization)
・ [Corrected 2] Tanaka / Only / Manager (person)
Next, the speech recognition error correction unit 300 transmits the speech recognition result to which the result of the proper noun section identification of FIG. 4 is added to the speech recognition error correction candidate extraction unit 500.
[0022]
The speech recognition error correction candidate extraction unit 500 adds the result of proper noun section identification as shown in FIG. 4 according to the related information search key word extraction condition stored in advance in the related information search key word extraction condition table 510. A word set serving as a search condition for performing a related document search from an external database by the related information search unit 600 is extracted from the speech recognition result. Next, using these word sets as search conditions, the related information search unit 600 searches the external database for related documents according to the speech recognition error correction candidate extraction conditions stored in the speech recognition error correction candidate extraction condition table 610 in advance. Then, the proper noun section of the speech recognition error correction candidate is extracted. Here, in order to identify the proper noun section included in the search result document, the proper noun section identification unit 400 is called. The extracted speech recognition error correction candidate is returned to the speech recognition error correction candidate extraction unit 500.
[0023]
FIG. 5 shows an example of an entry in the related document search key word extraction condition table 510. In the example shown in FIG. 5, three conditions relating to the part of speech and the recognition reliability are set. As shown in the example of FIG. 5, words that are likely to be correctly recognized are extracted by considering the reliability of speech recognition. Further, by extracting a word having a part of speech such as a noun or a verb, the related information search unit 600 extracts a word that can be a keyword when a related document is searched from an external database. Note that these conditions are set empirically according to the speech recognition device applied to the speech recognition unit 200.
[0024]
A search for performing a related document search from an external database by a related information search unit 600 from a speech recognition result to which a result of proper noun section identification shown in FIG. 4 is added according to a related information search key word extraction condition shown in FIG. When a word set as a condition is extracted, the following word set is obtained.
・ [Search condition word set] (Venture, Project, Sirius, Dash)
FIG. 6 shows an example of an entry in the related information search condition table 510. As shown in FIG. 6, the related information search condition includes three entries. The first entry is an identifier of an external database to be searched by the related information search unit 600. In the example of FIG. 6, the news search site foo-news. com is specified. The second entry specifies the maximum number of documents from which speech recognition error correction candidates are extracted. In a normal Internet site search or database search, a plurality of documents are returned in order of relevance to the search request, and thus the number of documents specified from the top is targeted. In the example of FIG. 6, it is specified that only the top two documents are to be expanded word targets. The third entry specifies the maximum number of proper noun sections to be actually extracted as speech recognition error correction candidates. In the example of FIG. 6, extraction of up to five proper noun sections is specified.
[0025]
Using the word set extracted above as a search condition and performing a related document search by the related information search unit 600 according to the related information search condition shown in FIG. 6, one related document having the following contents is extracted. Shall be.
・ [Contents of related documents]
The new project "Sirius Dash", which is attracting attention from the venture industry, is finally starting. The representatives of the participating companies, Yuichi Tanaka, director of Nakatani Works, and Ichiro Suzuki, director of Dash Inc., have announced their plans at a press conference last night.
[0026]
This document content is returned from the related information search unit 600 to the speech recognition error correction candidate extraction unit 500.
[0027]
The speech recognition error correction candidate extraction unit 500 activates the proper noun section identification unit 400 for the document content as described above to obtain a proper noun section included in the document. In the above example, it is assumed that the following five proper noun sections (read after //, proper noun classes in parentheses) are obtained.
・ [Correction candidate a] Sirius / Siriusu (organization)
・ [Correction candidate b] Yuichi Tanaka, Director / Yuichi Tanaka (person)
・ [Correction candidate c] Nakatani Manufacturing Co., Ltd./organization
・ [Correction candidate d] Director Ichiro Suzuki / Ichirou Suzuki Toshiyari (person)
・ [Correction candidate e] Sirius Co., Ltd. / Kabushiki Gaisha Shiriusu (organization)
The speech recognition error correction candidate obtained as described above is transmitted from the speech recognition error correction candidate extraction unit 500 to the speech recognition error correction unit 300. The speech recognition error correction unit 300 performs matching between a proper noun section to be corrected, such as [correction target 1] and [correction target 2], and an error correction candidate group such as [correction candidates ae]. Perform error correction.
[0028]
The procedure for matching each correction target with the correction candidate group is performed as follows.
[Step 1] A correction candidate having the same proper noun class as the correction target is selected from the correction candidate group. [Step 2] The degree of matching between the correction target and each of the selected correction candidates is calculated. [Step 3] In the above-described procedure for selecting a correction candidate that gives the maximum matching degree to the correction target, Step 1 and Step 3 are obvious, so Step 2 will be described.
[0029]
As the calculation of the degree of matching between the correction target and the correction candidate, for example, the similarity of a hiragana character string of “reading” can be used. Since the object of the present invention is a speech recognition error, it is assumed that the reading of the error part of the speech recognition to be corrected is similar to the reading of the correct answer that would have been originally uttered. This method is valid.
[0030]
Various methods for calculating the similarity between character strings have been proposed, but a typical method is to use an "edit distance", and an efficient processing algorithm (a non- Patent Document 1) has also been established, so this method may be used, for example. In addition, in this method, the cost of "editing" a character string can be defined, but if the tendency of speech recognition error is known in advance, this can be reflected in the cost to appropriately Similarity can be calculated.
[0031]
In the above example, “Nakaya Seisakusho” whose correct noun class matches “organization” and whose reading is calculated to be similar to “Nataya Seisakusho” to be corrected 1 is a candidate for correction. Is selected as Similarly, for "the only director of Tanaka" to be corrected 2, "director Yuichi Tanaka" is selected as a correction candidate.
[0032]
The correction candidates obtained in this way are reflected on the speech recognition result as shown in FIG.
[0033]
FIG. 7 shows an example of the speech recognition result after performing error correction on the speech recognition result including the proper noun section identification result shown in FIG. 4 using the above-described error correction candidate. Note that, for the portion that has been error-corrected as described above, the reliability of speech recognition may be replaced with an appropriate constant (500 in FIG. 7) as necessary. Also, the number of words may change as in the above example due to error correction, and it is necessary to adjust the information of the utterance time recorded according to the attributes of begin and end. At this stage, it is impossible to make up for the correct utterance time, but take an appropriate time such that the start and end times of the section to be corrected do not contradict the time information of the initial speech recognition result. What should I do? For example, in the example of “Nakaya” and “Mfg.” In FIG. 7, the start time of “Nakaya” is set as the start time of “na” which is the initial speech recognition result, and the end time is “Katakaru” as the initial speech recognition result. End time. "
[0034]
The error-corrected speech recognition result is transmitted from the speech recognition error correction unit 300 to the output unit 700.
[0035]
FIG. 8 is a flowchart showing the processing of the proper noun section identification unit 400 described in Japanese Patent Application No. 2002-355284. When speech data is input (step 801), large vocabulary continuous speech recognition is performed and a predetermined number of morpheme arrangement candidates are output (step 402). If the times of adjacent morphemes including the beginning and end are not continuous, that is, if the end time of a certain morpheme does not match the start time of the morpheme, the non-continuous time zone, that is, the end time of a certain morpheme, is set as the start time. The morpheme information such as a reading point to which time information having the start time of the morpheme as the end time is added is inserted (steps 803 and 804). If the reliability score and the morpheme information satisfy certain conditions, the morpheme is replaced with another morpheme while retaining the original male morpheme information (steps 805 and 806). For example, when the reliability score is smaller than a preset threshold, “ε;” is added to the head of the notation, the reading, and the part of speech. A word graph is created based on the time information of each morpheme from the arrangement of the morphemes of the plurality of candidates (step 807). The word graph indicates that each node is a morpheme having time information, and the link between the nodes indicates that the morpheme can be connected to an adjacent morpheme at a certain time. Assuming that the time of the word graph is advanced from the beginning, and as long as there is a morpheme candidate ending at each time of the word graph (step 808), it is assumed that all the named entity classes assumed for the succeeding morpheme are assigned. (Step 809), the log probability is calculated when the morphemes with each named entity class are connected based on the frequency of appearance of the already learned language model, for example, the word bigram with named entities (Step 810). For example, given the immediately preceding named entity class NC- ₁ and the immediately preceding morpheme w- _{1, the} probability P (NC | NC- ₁ , W- ₁ ) that the current named entity class NC is selected and the current and immediately preceding Is given, the probability P (w _first | NC ₋₁ , w ₋₁ ) that the first word w _first in the current named entity class is generated, and the previous morpheme and the current morpheme The probability P (w | w ₋₁ , NC) of generating the second and subsequent morphemes when the named entity class is given is calculated from the word bigram frequency C with named entity by the following formula. Repeat the above steps until the end of the sentence.
[0036]
(Equation 1)

At this time, the logarithmic probability is calculated using “ε” for the notation, reading, and part of speech of the morpheme that has been replaced and transformed. At that time, the morpheme with the named entity class having the largest cumulative logarithmic probability is selected, and the path is held (step 811). Here, "keeping the path" is to follow a path having the maximum logarithmic probability locally from the end of the sentence backward in a later process. The time of the node of the word graph is advanced (step 812), and the same processing is performed. When the end of the sentence is reached, a path having the highest log probability (maximum likelihood) is selected from the end of the sentence, and a named entity class is output for each morpheme of the selected path (step 213). The morpheme that has been replaced and transformed is restored to the original morpheme by, for example, notation, reading, and deleting “ε;” included in the part of speech, and output.
[0037]
In addition, the present invention records a program for realizing the function other than that realized by dedicated hardware on a computer-readable recording medium, and stores the program recorded on the recording medium in a computer system. It may be read and executed. The computer-readable recording medium refers to a recording medium such as a floppy disk, a magneto-optical disk, a CD-ROM, or a storage device such as a hard disk device built in a computer system. Further, the computer-readable recording medium is one that dynamically holds the program for a short time (transmission medium or transmission wave), such as a case where the program is transmitted via the Internet, and serves as a server in that case. It also includes those that hold programs for a certain period of time, such as volatile memory inside a computer system.
[0038]
【The invention's effect】
As described above, according to the present invention, of the OOV problem (Out Of Vocabulary problem), which is one of the causes of a decrease in recognition accuracy in a normal speech recognition device, for example, by correcting a recognition error related to a proper noun, Thus, the recognition accuracy can be improved. In addition, correctly recognizing proper nouns such as person names, place names, and product names leads to, for example, improving the search accuracy of a voice document search system to which voice recognition is applied.
[Brief description of the drawings]
FIG. 1 is a block diagram of a speech recognition device according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of a speech recognition result converted into a character by a speech recognition unit 200;
FIG. 3 is a diagram showing an example of entries of a speech recognition error correction condition table 310 storing speech recognition error correction conditions.
FIG. 4 is a diagram showing a processing result of a proper noun section identification unit 400 for the speech recognition result of FIG. 2;
FIG. 5 is a diagram showing an example of an entry in a related document search key word extraction condition table 510.
FIG. 6 is a diagram showing an example of an entry in a speech recognition error correction candidate extraction condition table 610.
7 is a diagram illustrating an example of a speech recognition result after performing error correction on the speech recognition result including the proper noun section identification result illustrated in FIG. 4 using an error correction candidate.
FIG. 8 is a flowchart of a processing example of a proper noun section identification unit 400;
[Explanation of symbols]
Reference Signs List 100 Input unit 200 Speech recognition unit 300 Speech recognition error correction unit 310 Speech recognition error correction condition table 400 Proper noun section identification unit 500 Speech recognition error correction candidate extraction unit 510 Related information search key word extraction condition table 600 Related information search unit 610 Speech Recognition error correction candidate extraction condition table 700 Related information search key word extraction condition table 801 to 812 Step

Claims

A speech recognition step of outputting a speech recognition result together with the reliability,
Identifying a section in which a word belonging to a predetermined class included in the speech recognition result is uttered;
The identification result is added to the speech recognition result, and the confidence score is compared with a speech recognition error correction condition describing a complex condition for a word belonging to a predetermined class, and a speech recognition error may be included. And extracting a section in which it can be corrected as a section to be corrected,
Extracting a word set serving as a search condition for performing a related document search from the speech recognition result to which the identification result has been given according to a related information search key word extraction condition;
Using the word set as a search condition, searching for a related document according to a speech recognition error correction candidate extraction condition, and extracting a section of a word belonging to a predetermined class included in the related document as a speech recognition error correction candidate;
Matching the error correction target section with the error correction candidate group, and performing error correction on the error correction target section.

Voice recognition means for outputting a voice recognition result with reliability,
Means for identifying a section in which a word belonging to a predetermined class included in the speech recognition result is uttered,
A speech recognition error correction condition table that stores a speech recognition error correction condition describing a reliability score and a compound condition relating to a word belonging to a predetermined class;
The speech recognition result to which the identification result is added is compared with the speech recognition error correction condition of the speech recognition error correction condition table, and a section in which a speech recognition error may be included and which can be corrected. Means for extracting as an error correction target section,
A related information search key word extraction condition table storing related information search key word extraction conditions,
According to the related information search key word extraction condition stored in the related information search key word extraction condition table, a word set serving as a search condition for performing a related document search is extracted from the speech recognition result to which the identification result is added. Means to
Means for using the word set as a search condition and searching for a related document according to the speech recognition error correction candidate extraction condition, and extracting a section of a word belonging to a predetermined class included in the related document as a speech recognition error candidate; ,
A recognition error correction device comprising: means for matching the error correction target section with the error correction candidate group and performing error correction on the error correction target section.

A recognition error correction program for causing a computer to execute the recognition error correction method according to claim 1.