JP4008344B2

JP4008344B2 - Class identification model generation method, apparatus, and program, class identification method, apparatus, and program

Info

Publication number: JP4008344B2
Application number: JP2002355284A
Authority: JP
Inventors: 隆明長谷川; 林　　良彦
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-12-06
Filing date: 2002-12-06
Publication date: 2007-11-14
Anticipated expiration: 2022-12-06
Also published as: JP2004184951A

Description

【０００１】
【発明の属する技術分野】
本発明は電子テキスト以外のメディアの内容から固有表現を抽出する方法および装置に関する。
【０００２】
【従来の技術】
従来、電子テキストから固有表現を抽出する方法として、様々な方法が提案されていた（例えば、特許文献１や非特許文献１を参照）。
【０００３】
一方、電子テキスト以外のメディア、とりわけ、音声データからの固有表現抽出方法では、音声認識エンジン（例えば、非特許文献３参照）が出力する結果を入力テキストと見なしたテキストからの固有表現抽出と同様に、既存のテキストから学習しておいた言語モデルを用いて固有表現を抽出していた（例えば、非特許文献２参照）。
【０００４】
また、音声認識エンジンの内部で辞書に登録されていない固有名詞を抽出する試みもなされている（例えば、特許文献２参照）。
【０００５】
【特許文献１】
特開２００１−３１８７９２（請求項１）
【特許文献２】
特開２００１−２３６０８９（第５項、および第３図）
【非特許文献１】
ＮＹＭＢＬＥ：A High-Performance Learning Name-finder, D. Bikel 他３名，In Proceedings of the Fifth Confernece on Applied Natural Language Processing, Association for Computational Linguistics, pp.194-201, 1997.
【非特許文献２】
Named Entity Extraction from Speech, F. Kubala 他３名，Darpa98, http://www.nist.gov/speech/publications/darpa98/html/lm50/lm50.htm
【非特許文献３】
ＮＴＴ技術ジャーナル１９９９年１２月号、特集「ここまできた音声認識・音声合成」
【０００６】
【発明が解決しようとする課題】
しかしながら、上記の従来の固有表現抽出方法では、音声認識エンジンが用いる辞書の語彙数を抑えるために、低頻度の固有名詞は登録されず、未知語は根本的に認識できないという問題があった。さらに、音声認識結果は本質的に誤りを含むので、誤りを考慮していないテキストに基づく言語モデルを用いて、固有表現の発声区間を認識するのは極めて難しい。また、固有名詞の認識が仮にもできたとしても、それが人名なのか地名なのかという識別は、大規模な固有表現タグ付けテキストから生成された言語モデルを用いない限り困難であり、大規模な固有表現タグ付きテキストを準備するのは非常にコストがかかるという問題もある。
【０００７】
本発明の目的は、音声データと対応する固有表現タグ付きテキストを用いて誤りを含む言語モデルを生成するクラス同定モデル生成方法、装置、およびプログラムを提供することにある。
【０００８】
本発明の他の目的は、低頻度なため音声認識の語彙に含まれていない未知語の固有名詞を認識する場合や認識誤りを含む場合でも、固有表現の発声区間を適切に認識し、どの種類の固有表現かを同定することが可能なクラス同定方法、装置、プログラムを提供することにある。
【０００９】
【課題を解決するための手段】
本発明のクラス同定モデル生成装置は、
単語表記と読みの情報が格納されている認識語彙データベースと、
単語情報が格納されている形態素辞書と、
電子テキスト以外のメディアから内容を認識して電子テキストに変換する際に、認識語彙データベースに基づいて、最尤認識候補の形態素の並びを出力するメディア変換手段と、
電子テキスト以外のメディアに含まれる内容に対応する固有表現タグ付きテキストを解析して、形態素辞書に基づいて形態素に分割し、固有表現タグに対応する固有表現クラスを固有表現タグに含まれるすべての形態素に付与し、固有表現クラス付き形態素の並びを出力するタグ付きテキスト解析手段と、
最尤認識候補の形態素の並びと固有表現クラス付き形態素の並びの両者を両者の形態素が有する情報に基づいて類似度が最大となるように対応付けて、前者の形態素に後者の形態素の有する固有表現クラスを付与する形態素対応付け手段と、
固有表現クラス付き形態素の並びから、統計的に言語モデルを学習する言語モデル学習手段とを有する。
【００１０】
また、本発明のクラス同定装置は、
単語表記と読みの情報が格納されている認識語彙データベースと、
指定した個数だけの尤もらしい順に並んだ形態素の並びから、隣接する形態素の時刻が連続ではない場合、連続ではない時間帯の形態素を挿入する手段と、
形態素が持つ信頼度スコアと形態素情報の少なくとも一方に基づいて、形態素をその形態素情報を含めた別の形態素に置換変形した上で、メディア変換による複数の候補の形態素の並びから各形態素の開始位置および終了位置に基づいて、各形態素をノードとし、各形態素の開始位置と終了位置において接続可能な形態素間の関係をリンクとする単語グラフを作成する単語グラフ作成手段と、
単語グラフの各形態素について、すべての固有表現クラス付き形態素の候補を設定し、言語モデルに基づいた、単語グラフの始端から終端までの全体の確率が最大となるようにすべての固有表現クラス付き形態素を決定し、固有表現クラス付き形態素の並びを出力する固有表現付与手段とを有する。
【００１１】
入力された音声データは、メディア変換手段によって、例えば開始位置と終了位置および信頼度スコア付きの形態素の並びに変換される。入力された音声データに対応する固有表現タグ付きテキストは、タグ付きテキスト解析手段によって固有表現クラスを有する形態素の並びに変換される。形態素対応付け手段によって、両者の形態素の並びはそれぞれ比較され、各形態素ごとの類似度を計算し、例えば対応がずれる場合には減点し、全体の類似度が最も大きくなるように対応付けられた後、信頼度スコア付き形態素の並びに、対応する固有表現クラス付き形態素が有する固有表現クラスが付与されると同時に、例えば信頼度スコアや形態素情報が指定された条件を満たす場合には形態素情報が置換され出力される。言語モデル学習手段は、形態素対応付け手段により出力された固有表現クラス付き形態素の並びから統計的に言語モデルを学習する。単語グラフ作成手段は、新たに入力される音声データがメディア変換手段により指定された個数だけ尤もらしい順に出力される開始位置、終了位置および信頼度スコア付きの形態素の並びに対して、形態素をノードとし各形態素の開始位置および終了位置における可能な接続をリンクとする単語グラフを作成する。固有表現付与手段は、単語グラフ作成手段により作成された単語グラフに対して、言語モデル学習手段により学習された言語モデルに基づいて、単語グラフの先頭位置から終端位置までの確率が最大となるように、固有表現付き形態素を選択し、固有表現付き形態素の並びを出力する。
【００１２】
本発明は、音声データから固有表現を抽出する際に、対応するテキストの固有表現クラスを用いて、誤りを含む音声認識結果に固有表現クラスを付与して、固有表現クラスを有する形態素の並びから言語モデルを生成することにより、新たに入力される音声データからのロバストな固有表現抽出を実現する。
【００１３】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して説明する。
【００１４】
図１は本発明の一実施形態の固有表現抽出装置の構成図である。
【００１５】
本実施形態の固有表現抽出装置は、メディア変換部１とタグ付きテキスト解析部２と形態素対応付け部３と言語モデル学習部４と認識語彙ＤＢ８と形態素辞書９と言語モデルデータベース１０を含む固有表現抽出モデル生成装置と、メディア変換部１と単語グラフ作成部５と固有表現付与部６と認識語彙ＤＢ８と言語モデルデータベース１０を含む固有表現抽出装置と、制御部７からなる。
【００１６】
認識語彙ＤＢ８は単語表記（表記、仮名、読み、品詞、標準形）と単語ＩＤと読みの情報が格納されている。同類の単語をまとめて扱うために、クラスＩＤとクラス内unigram確率が単語表記の後ろに格納される。音声認識の結果、読みの情報に基づいて、単語表記が出力される。
【００１７】
形態素辞書９は単語情報（表記、品詞、仮名、接続コストなど）が格納されている。
【００１８】
メディア変換部１は、音声データを入力し、認識語彙ＤＢ８に基づいて、テキストとして表現される形態素の並び（形態素列）を出力する。すなわち、予め学習してある音響モデルと言語モデルに基づいて、入力された音声データに対しておこなわれる確率計算により、認識語彙ＤＢ８に存在する単語表記を出力する。形態素には、表記、読み、品詞、開始時刻、終了時刻、音響スコア、言語スコア、信頼度スコアの各情報が付随する。ここで、「開始時刻」「終了時刻」は単語１語あたりの開始時刻と終了時刻で、認識処理対象としている音声データの起点を０としてカウントされる。「音響スコア」は既存の音声から学習された音響モデルから得られる確率に基づくスコア、「言語スコア」は既存のテキストから学習された言語モデルから得られる確率に基づくスコア、「信頼度スコア」は音響スコアと言語スコアを所定の計算式で計算したスコアである。
【００１９】
タグ付きテキスト解析部２は、入力された音声データと対応したテキストで、固有表現に該当する部分にいずれかひとつの種類の固有表現のタグが付けられている固有表現タグ付きテキストを解析し、形態素辞書９に基づいて分割する。テキストは形態素に分割され、固有表現のタグが付いていた単独の形態素あるいは複数の連続する形態素に該当の固有表現クラスを付ける。例えば、固有表現タグの種類は、人名、地名、組織名、人工物名、日付表現、時間表現、金額表現、割合表現としている。固有表現クラスは、固有表現タグの種類を踏襲するが、タグが付いていない部分にもその他として付与されるものとしている。
【００２０】
形態素対応付け部３は、メディア変換部１によって得られた音声側の形態素の並び（最尤の認識候補の形態素列）とタグ付きテキスト解析部２で得られたテキスト側の形態素の並び（固有表現クラス付き形態素列）を比較して、各形態素同士について形態素情報に基づいて類似度を計算し、末尾の形態素までの累積した類似度（類似度の総和）が最大となるように形態素を対応付ける。すなわち、２つの形態素について標記が一致するかどうか（一致すれば１、しなければ０）、仮名がどの程度かさなっているか（１文字単位でカウントし、短い方の仮名の長さで正規化）、品詞は一致するかどうか（自立語のみ対象にして、一致すれば１、しなければ０、どちらかが自立語以外は０）を調べ、これらの重み付き和を計算する。このとき、形態素の対応が１対１からずれる（１対ｎまたはｎ対１になる）場合には、指定された値を累積した類似度から減点する。対応付けの結果、テキスト側の形態素に対応するすべての音声側の形態素に、テキスト側の形態素の有する固有表現クラスを付与する。付与される固有表現クラスは、テキスト側の形態素の有する固有表現クラスと同一であってもよいし、これに関連する予め対応付けられた別の固有表現クラスであってもよい。認識誤りの形態素にも固有表現クラスを付与する。さらに、このとき同時に音声側の形態素が有する情報、例えば信頼度スコアや品詞情報に基づいて、誤りと思われる形態素をある特別な記号に置換しておくこともできる。信頼度スコアは仮名の長さで正規化してもよい。
【００２１】
言語モデル学習部４は、形態素対応付け部３で得られた固有表現クラスを有する形態素の並び（形態素列）や、あるいはさらにテキスト側の固有表現付き形態素の並びを加えたものから、固有表現クラス付き単語ｂｉｇｒａｍとその頻度からなる言語モデルを統計的に学習し、結果を言語モデルデータベース１０に格納する。ここで、形態素列は認識候補に固有表現クラスを付与したものだけでなく、固有表現クラス付き形態素列を加えてもよい。
【００２２】
単語グラフ作成部５は、新たに入力される音声データからメディア変換部１によって得られる指定された個数の形態素の並びから、各形態素の有する開始位置と終了位置に基づいて形態素をノードとし、各位置における形態素の接続をリンクとする単語グラフを作成する。このとき、信頼度スコアに対する閾値を予め設定しておき、音声認識で得られる形態素の信頼度スコアがこの閾値に達しないとき、あるいは、音声認識で得られる形態素の品詞が予め指定しておいた特定のものであれば、あるいはこれらを同時に満たす場合に、別の形態素に置換変形して単語グラフを作成してもよい。
【００２３】
固有表現付与部６は、単語グラフ作成部５から得られた単語グラフに対して、各形態素が信頼度スコアや形態素情報の条件の元で別の形態素に置換変形した場合を含めて、言語モデル学習部４により学習された言語モデルに基づいて、あらゆるすべての固有表現クラスを持つとしたときの単語ｂｉｇｒａｍの対数確率を単語グラフの先頭から末尾の全体に対して計算して、最も大きい対数確率となるような固有表現クラス付き形態素を各位置において選択することにより、各形態素に固有表現クラスを付与する。
【００２４】
制御部７は、学習時にはメディア変換部１とタグ付きテキスト解析部２と形態素対応付け部３と言語モデル学習部４を駆動し、実行時にはメディア変換部１と単語グラフ作成部５と固有表現付与部６を駆動する。
【００２５】
なお、メディア変換部１の出力結果やタグ付きテキスト解析部２の出力結果は記憶装置（不図示）に記憶される。
【００２６】
図２は本実施形態の、学習時における言語モデル作成までの処理を示す流れ図である。メディア変換部１は、例えば、音声データと発話内容が一致するテキストが文の単位で対応している場合には、文単位で音声データを入力し（ステップ１０１）、大語彙連続音声認識により最も尤度の高い候補１つを抽出する（ステップ１０２）。このとき、大語彙連続音声認識において予め設定している閾値よりも長いポーズ区間を検出した場合には、音声認識処理の単位区間を分割して形態素の並びを出力する。閾値よりも長いポーズがあるならば、ポーズは読点に置換してポーズ区間の前後の区間の形態素の並びを接続して一つの文とする（ステップ１０３，１０４）。一方、タグ付きテキスト解析部２は、音声データと対応する固有表現タグ付きテキストを入力し（ステップ１０５）、テキストを形態素に分割した上で固有表現タグに含まれる形態素には固有表現タグに対応する固有表現クラスを付与し、固有表現タグに含まれない形態素には「その他」などの特定の固有表現クラスを付与し、固有表現クラス付きの形態素の並びに変換する（ステップ１０６）。形態素対応付け部３は、文単位の範囲において、音声側と対応するテキスト側の形態素の並びに対して、それぞれの文頭から文末までの各形態素について対応付けることが可能なすべての経路のうち最適な経路を計算することにより対応付け（ステップ１０７）、テキスト側の形態素に対応付けられた音声側の各形態素にテキスト側の形態素の持つ固有表現クラスを付与する（ステップ１０８）。対応付けられた音声側の形態素とテキスト側の形態素の表記同士が一致しない場合には、付与する固有表現クラスとして予め対応付けられた別の固有表現クラスを付与してもよい。
【００２７】
対応付けの際には、例えば、Ｎ番目の音声側の形態素とＭ番目のテキスト側の形態素が対応する場合には、そこに至るまでの３つの経路、すなわちＮ−１番目の音声側の形態素とＭ−１番目のテキスト側の形態素が対応する場合と、Ｎ−１番目の音声側の形態素とＭ番目のテキスト側の形態素が対応する場合と、Ｎ番目の音声側の形態素とＭ−１番目のテキスト側の形態素が対応する場合がある。１番目の経路の場合には、Ｎ番目の音声側の形態素とＭ番目のテキスト側の形態素について表記の一致や読みの重なり度合いに基づいて類似度を計算し、それまでの累積された類似度に新たに計算された類似度を累積する。２番目の経路の場合は、Ｎ−１番目の音声側の形態素とＭ番目のテキスト側の形態素までの累積した類似度から予め指定された値を減点する。３番目の経路の場合は、Ｎ−１番目の音声側の形態素とＭ番目のテキスト側の形態素の経路までの累積した類似度から予め指定された値を減点する。３つの経路のうち最大の累積の類似度を持つ経路をそこまでの形態素の経路として保持し、以上を文末まで繰り返すことにより文頭から文末までの累積の類似度が最大となる経路を求める動的計画法の考え方に基づいて、最終的に両者の文末の形態素までの最適な経路を求める。
【００２８】
また、形態素の信頼度スコアや形態素情報がある条件を満たす場合には、形態素を別の形態素に置換することもできる（ステップ１０９，１１０）。例えば、信頼度スコアが予め設定されている閾値より小さい場合や、形態素に付与された固有表現クラスが特定のものである場合には、別の形態素として表記、読み、品詞すべてを例えば特定の記号「ε」に置換する。最後に、言語モデル学習部４は、音声認識結果に対して固有表現クラスが付与された形態素の並びやそれに加えて対応するテキストにおける固有表現クラス付き形態素の並びから固有表現クラス付き単語ｂｉｇｒａｍとその出現頻度からなる言語モデルを統計的に学習し、学習結果を言語モデルデータベース１０に格納する（ステップ１１１）。
【００２９】
図３は本実施形態のうち、実行時における固有表現付与の処理を示す流れ図である。メディア変換部１は音声データが入力されると（ステップ２０１）、大語彙連続音声認識を行い予め指定した個数の形態素の並びの候補を出力する（ステップ２０２）。始端と終端を含めて隣接する形態素の時刻が連続でない、つまりある形態素の終了時刻と次の形態素の開始時刻が一致しない場合は、連続でない時間帯、つまりある形態素の終了時刻を開始時刻とし、次の形態素の開始時刻を終了時刻とする時刻情報を付加した読点等の形態素情報を挿入する（ステップ２０３，２０４）。また、信頼度スコアや形態素情報がある条件を満たす場合、形態素を元の形態素情報を保持して別の形態素に置換変形する（ステップ２０５，２０６）。例えば、信頼度スコアが予め設定されている閾値より小さい場合に、表記、読み、品詞の先頭にそれぞれ「ε；」を付与する。単語グラフ作成部５は複数候補の形態素の並びから、各形態素が有する時刻情報に基づいて単語グラフを作成する（ステップ２０７）。単語グラフは、各ノードが時刻情報を持つ形態素であり、ノード間のリンクはある時刻において形態素が隣接する形態素と接続可能であることを示す。単語グラフの時刻を先頭から進めていき、単語グラフの各時刻で終わる形態素候補が存在する限り（ステップ２０８）、後続の１形態素について想定されるすべての固有表現クラスが付与された場合を仮定して（ステップ２０９）、すでに学習された言語モデル、例えば固有表現付き単語ｂｉｇｒａｍの出現頻度に基づいて各固有表現クラス付きの形態素が接続した場合の対数確率を計算する（ステップ２１０）。例えば、直前の固有表現クラスＮＣ_-1と直前の形態素ｗ_-1が与えられたときに現在の固有表現クラスＮＣが選択される確率Ｐ（ＮＣ｜ＮＣ_-1，ｗ_-1）と現在と直前の固有表現クラスが与えられたときに、現在の固有表現クラスの中で最初の単語ｗ_firstが生成される確率Ｐ（ｗ_first｜ＮＣ_-1，ｗ_-1）と、直前の形態素と現在の固有表現クラスが与えられたときに２番目以降の形態素が生成される確率Ｐ（ｗ｜ｗ_-1，ＮＣ）を、下記の計算式により固有表現付き単語ｂｉｇｒａｍ頻度Ｃから計算する。文末まで以上のステップを繰り返す。
【００３０】
【数１】

このとき置換変形されている形態素は表記、読み、品詞とも「ε」を用いて対数確率を計算する。その時刻において、それまでの累積の対数確率が最大となる固有表現クラス付き形態素を選択し、経路を保持する（ステップ２１１）。ここで、「経路を保持する」のは、後の処理で文末から後ろ向きに局所的に最大の対数確率を持つ経路をたどれるようにしておくためである。単語グラフのノードの時刻を進めて（ステップ２１２）、同様の処理を行う。文末に達したら、今度は文末から最大の対数確率（最尤）を持った経路を選択することにより、選択された経路の各形態素について固有表現クラスを出力する（ステップ２１３）。置換変形されている形態素は、例えば表記、読み、品詞に含まれる「ε；」を削除するなどして元の形態素に復元して出力する。
【００３１】
図４に音声認識結果から得られる１位候補のスコア付き形態素の並びの例を示す。一例として、形態素情報は表記と読みと品詞からなり、スラッシュで区切っている。その後にスラッシュに続けて信頼度スコアが格納されている。ここでは、発声時刻は省略している。
【００３２】
図５に固有表現タグ付きテキストから得られる固有表現クラス付き形態素の並びの例を示す。
【００３３】
図６に両者の形態素の並びを対応付けて、音声側の形態素にテキスト側の固有表現クラスを付与した形態素の並びの例を示す。この例では、形態素の表記と読み１文字ずつの情報を用いて、類似度を計算している。この例では、テキスト側の８番目の形態素「オレンジ」は音声側の７番目の形態素「俺」と８番目の形態素「んち」と対応付けられる。この場合、音声側とテキスト側の形態素同士の表記が一致しないので、「オレンジ」の有する固有表現クラス「ＬＯＣＡＴＩＯＮ」に予め対応している「^*ＬＯＣＡＴＩＯＮ」が「俺んち」に付与される。また、この例では、スコアが閾値０以下の形態素は表記、読み、品詞ともすべて「ε」という記号にして別の形態素に置換している。ここでは、信頼度スコアのところに固有表現クラスを代わりに格納している。言語モデルデータベース１０に、これらの形態素の並びから、例えば連続する２つの固有表現クラス付きの形態素の出現頻度を格納する。この例では、音声側の形態素の並びとテキスト側の形態素の並びは１つずつを対応させているが、対応させる形態素の並びの個数はこれに限るものではない。
【００３４】
図７に実行時の例を示す。「中谷主任研究員」という音声データを入力したときの信頼度スコアと発声時刻付きの形態素の並びである。簡単のため、形態素は表記のみとしている。括弧の中は信頼度スコアを表す。次に、これらの音声認識結果の発生時刻に基づいて、単語グラフを作成する。このとき、２位候補の「中」と「足り」の間が不連続なので、読点「、」の形態素情報を挿入してグラフを補完する。また、信頼度スコアが閾値０より低い形態素の表記に「ε；」を付加する。対数確率の計算時には、図８に示すように「ε」を用いるか、あるいは、「ε」を使った確率と元の形態素を使った確率を計算し、これらを比較して、最も大きいものを採用する。各時刻のノードにおいて、想定されるすべての固有表現クラス、例えばＰＥＲＳＯＮやＬＯＣＡＴＩＯＮやＯＲＧＡＮＩＺＡＴＩＯＮなどが付加された形態素が接続したとするときの対数確率を言語モデルに基づいて計算し、全体の対数確率の総和が最大となる固有表現クラスを各形態素において選択する。最終的に「ε；」が先頭にある形態素はこれを除いて、１５ｍｓから１３００ｍｓまでの「なかっ」「たり」の発声区間が人名として抽出される。なお、メディア変換部１では、手書き文字または映像中のテロップから文字認識を行い、認識された文字列に対して形態素解析を行い、形態素の並びを出力してもよい。
【００３５】
なお、本発明は専用のハードウェアにより実現されるもの以外に、その機能を実現するためのプログラムを、コンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行するものであってもよい。コンピュータ読み取り可能な記録媒体とは、フロッピーディスク、光磁気ディスク、ＣＤ−ＲＯＭ等の記録媒体、コンピュータシステムに内蔵されるハードディスク装置等の記憶装置を指す。さらに、コンピュータ読み取り可能な記録媒体は、インターネットを介してプログラムを送信する場合のように、短時間の間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含む。
【００３６】
【発明の効果】
以上説明したように本発明は、音声データに対応する固有表現タグ付きテキストを用いて、認識誤りが含まれる音声認識結果の形態素に固有表現クラスを付与して言語モデルを学習することにより、固有表現が含まれる音声データが入力され、音声認識の語彙にないためなどの理由により正しく認識できない固有表現に対して、固有表現の発声区間を適切に同定し、固有表現の種類を識別することができるので、音声データに固有表現に関するメタデータを付けるという目的に貢献する。
【図面の簡単な説明】
【図１】本発明の一実施形態の固有表現抽出装置のブロック図である。
【図２】図１の固有表現抽出装置において、学習時における言語モデル作成までの処理の流れを示す図である。
【図３】図１の固有表現抽出装置において、実行時における固有表現付与の処理の流れを示す図である。
【図４】音声認識結果から得られる形態素の例を示す図である。
【図５】固有表現タグ付きテキストから得られる形態素の例を示す図である。
【図６】図４と図５の形態素の例から得られる、音声側の形態素にテキスト側の固有表現クラスを付与した例を示す図である。
【図７】音声データから固有表現を抽出するまでのステップを示す図である。
【図８】変形済みの形態系の確率の計算方法の説明図である。
【符号の説明】
１メディア変換部
２タグ付きテキスト解析部
３形態素対応付け部
４言語モデル学習部
５単語グラフ作成部
６固有表現付与部
７制御部
８認識語彙ＤＢ
９形態素辞書
１０言語モデルデータベース
１０１〜１１１，２０１〜２１３ステップ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method and apparatus for extracting a specific expression from the contents of media other than electronic text.
[0002]
[Prior art]
Conventionally, various methods have been proposed as methods for extracting a specific expression from an electronic text (see, for example, Patent Document 1 and Non-Patent Document 1).
[0003]
On the other hand, in a method for extracting a specific expression from media other than electronic text, in particular, speech data, extraction of a specific expression from text in which a result output by a speech recognition engine (see, for example, Non-Patent Document 3) is regarded as input text. Similarly, a specific expression is extracted using a language model learned from an existing text (see, for example, Non-Patent Document 2).
[0004]
In addition, attempts have been made to extract proper nouns that are not registered in the dictionary inside the speech recognition engine (see, for example, Patent Document 2).
[0005]
[Patent Document 1]
JP-A-2001-318792 (Claim 1)
[Patent Document 2]
Japanese Patent Laid-Open No. 2001-236089 (5th term and FIG. 3)
[Non-Patent Document 1]
NYMBLE: A High-Performance Learning Name-finder, D. Bikel and 3 others, In Proceedings of the Fifth Confernece on Applied Natural Language Processing, Association for Computational Linguistics, pp.194-201, 1997.
[Non-Patent Document 2]
Named Entity Extraction from Speech, F. Kubala and 3 others, Darpa98, http://www.nist.gov/speech/publications/darpa98/html/lm50/lm50.htm
[Non-Patent Document 3]
NTT Technology Journal December 1999 Special Issue “Speech Recognition and Speech Synthesis That Has Been Here”
[0006]
[Problems to be solved by the invention]
However, the above-described conventional specific expression extraction method has a problem that low-frequency proper nouns are not registered and unknown words cannot be fundamentally recognized in order to suppress the number of vocabularies in the dictionary used by the speech recognition engine. Furthermore, since the speech recognition result essentially includes an error, it is extremely difficult to recognize the utterance section of the specific expression using a language model based on text that does not consider the error. Moreover, even if proper nouns can be recognized, it is difficult to identify whether they are personal names or place names unless a language model generated from large-scale proper expression tagging text is used. There is also a problem that it is very expensive to prepare a text with such a proper expression tag.
[0007]
An object of the present invention is to provide a class identification model generation method, apparatus, and program for generating a language model including an error using a text with a unique expression tag corresponding to speech data.
[0008]
Another object of the present invention is to appropriately recognize the utterance section of the proper expression, even when recognizing a proper noun of an unknown word that is not included in the vocabulary of speech recognition because it is infrequent or includes a recognition error. It is an object to provide a class identification method, apparatus, and program capable of identifying whether a type is a specific expression.
[0009]
[Means for Solving the Problems]
The class identification model generation device of the present invention includes:
A recognition vocabulary database that stores word notation and reading information;
A morpheme dictionary in which word information is stored;
When converting to electronic text recognizes the content from media other than electronic text, and have based Dzu to the recognition vocabulary database, and media conversion means for outputting the sequence of morphological of the most likely recognition candidate,
Analyzes the text with tagged tags corresponding to the content contained in media other than electronic text, divides the text into morphemes based on the morpheme dictionary, and the named entity class corresponding to the named tag is included in all named tags. A tagged text analysis means for assigning to a morpheme and outputting a list of morphemes with proper expression classes;
The morpheme sequence of the maximum likelihood recognition candidate and the morpheme sequence with a unique expression class are associated with each other so that the degree of similarity is maximized based on the information of both morphemes, and the former morpheme has the uniqueness of the latter morpheme A morpheme matching means for assigning an expression class;
Language model learning means for statistically learning a language model from a list of morphemes with proper expression classes.
[0010]
The class identification device of the present invention is
A recognition vocabulary database that stores word notation and reading information;
Means for inserting morphemes in non-continuous time zones, when the time of adjacent morphemes is not continuous, from a list of morphemes arranged in a likely order of the specified number ;
Based on at least one of the confidence score and morpheme information of the morpheme, after replacing the morpheme with another morpheme including the morpheme information, the start position of each morpheme from the list of multiple candidate morphemes by media conversion And a word graph creating means for creating a word graph based on each morpheme as a node based on the end position and a link between the morphemes connectable at the start position and the end position of each morpheme,
For each morpheme in the word graph, all morpheme with proper expression class are set as candidates, and all the morphemes with proper expression class are set so that the overall probability from the start to the end of the word graph is maximized based on the language model. And a specific expression providing means for outputting a list of morphemes with specific expression classes.
[0011]
The input voice data is converted by, for example, a sequence of morphemes with a start position, an end position, and a reliability score by the media conversion means. The tagged text with the unique expression corresponding to the input voice data is converted into a sequence of morphemes having the unique expression class by the tagged text analysis means. The arrangement of both morphemes is compared by the morpheme association means, and the degree of similarity for each morpheme is calculated. For example, if the correspondence is lost, points are deducted, and the whole degree of similarity is made to correspond to the maximum. After that, the morpheme with confidence score and the corresponding morpheme with the unique expression class are given the unique expression class, and at the same time, for example, the morpheme information is replaced when the reliability score and the morpheme information are satisfied. And output. The language model learning unit statistically learns the language model from the sequence of the morphemes with the unique expression class output by the morpheme association unit. The word graph creating means uses the morpheme as a node with respect to a sequence of morphemes with start positions, end positions and reliability scores that are output in the order of likelihood that the number of newly input speech data is specified by the media conversion means. Create a word graph that links the possible connections at the start and end positions of each morpheme. The specific expression providing means is such that the probability from the start position to the end position of the word graph is maximized based on the language model learned by the language model learning means with respect to the word graph created by the word graph creating means. Next, morphemes with proper expressions are selected, and a list of morphemes with proper expressions is output.
[0012]
The present invention, when extracting a specific expression from speech data, assigns a specific expression class to a speech recognition result including an error using a specific expression class of the corresponding text, and from a sequence of morphemes having the specific expression class. By generating the language model, robust specific expression extraction from newly input speech data is realized.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
[0014]
FIG. 1 is a configuration diagram of a named entity extraction apparatus according to an embodiment of the present invention.
[0015]
The specific expression extraction apparatus of the present embodiment includes a media conversion unit 1, a tagged text analysis unit 2, a morpheme association unit 3, a language model learning unit 4, a recognition vocabulary DB 8, a morpheme dictionary 9, and a language model database 10. The extraction model generation device, the media conversion unit 1, the word graph creation unit 5, the specific expression providing unit 6, the specific expression extraction device including the recognition vocabulary DB 8 and the language model database 10, and the control unit 7 are included.
[0016]
The recognition vocabulary DB 8 stores word notation (notation, kana, reading, part of speech, standard form), word ID, and reading information. In order to handle similar words together, the class ID and intra-class unigram probabilities are stored after the word notation. As a result of the speech recognition, word notation is output based on the reading information.
[0017]
The morpheme dictionary 9 stores word information (notation, part of speech, kana, connection cost, etc.).
[0018]
The media conversion unit 1 receives voice data and outputs a sequence of morphemes (morpheme string) expressed as text based on the recognition vocabulary DB 8. That is, the word notation existing in the recognized vocabulary DB 8 is output by the probability calculation performed on the input speech data based on the acoustic model and the language model learned in advance. Each information of notation, reading, part of speech, start time, end time, acoustic score, language score, and reliability score is attached to the morpheme. Here, “start time” and “end time” are a start time and an end time for each word, and are counted with 0 as the starting point of the speech data to be recognized. "Acoustic score" is a score based on the probability obtained from an acoustic model learned from existing speech, "Language score" is a score based on a probability obtained from a language model learned from existing text, and "Reliability score" is It is the score which calculated the acoustic score and the language score with a predetermined calculation formula.
[0019]
The tagged text analysis unit 2 analyzes the text with the specific expression tag, in which the text corresponding to the input speech data is tagged with any one kind of specific expression in the part corresponding to the specific expression, Divide based on the morpheme dictionary 9. The text is divided into morphemes, and the corresponding specific expression class is attached to a single morpheme or a plurality of continuous morphemes that have been tagged with a specific expression. For example, the types of unique expression tags are a person name, place name, organization name, artifact name, date expression, time expression, money amount expression, and ratio expression. The proper expression class follows the type of the specific expression tag, but is assigned to other parts that do not have a tag.
[0020]
The morpheme association unit 3 includes a sequence of speech-side morphemes obtained by the media conversion unit 1 (maximum likelihood recognition candidate morpheme sequence) and a sequence of text-side morphemes obtained by the tagged text analysis unit 2 (unique). Morpheme sequences with representation classes), calculate the similarity for each morpheme based on the morpheme information, and associate the morpheme so that the accumulated similarity up to the last morpheme (sum of similarity) is maximized . That is, whether the titles match for two morphemes (1 if they do match, 0 if they do not match), how much is the kana (counting in units of one character, normalized by the length of the shorter kana) , Whether or not the parts of speech match (for independent words only, 1 if they match, 0 if they do not match, 0 for any other word) is calculated, and a weighted sum of these is calculated. At this time, if the correspondence of the morpheme deviates from 1: 1 (1 to n or n to 1), the designated value is deducted from the accumulated similarity. As a result of the association, a specific expression class of the text side morpheme is assigned to all speech side morphemes corresponding to the text side morpheme. The given specific expression class may be the same as the specific expression class of the morpheme on the text side, or may be another specific expression class associated with this in advance. A specific expression class is also assigned to a recognition error morpheme. Furthermore, at the same time, the morpheme that seems to be erroneous can be replaced with a special symbol based on the information of the morpheme on the voice side, for example, the reliability score and the part-of-speech information. The confidence score may be normalized by the length of the kana.
[0021]
The language model learning unit 4 adds a sequence of morphemes (morpheme sequence) having the specific expression class obtained by the morpheme association unit 3 or a sequence of morphemes with specific expressions on the text side to the specific expression class. A language model composed of the attached word bigram and its frequency is statistically learned, and the result is stored in the language model database 10. Here, the morpheme sequence is not limited to the one obtained by assigning the unique expression class to the recognition candidate, and a morpheme sequence with a unique expression class may be added.
[0022]
The word graph creation unit 5 uses a morpheme as a node based on a start position and an end position of each morpheme from a specified number of morphemes obtained by the media conversion unit 1 from newly input speech data. Create a word graph with links of morphemes at positions. At this time, a threshold for the reliability score is set in advance, and when the reliability score of the morpheme obtained by speech recognition does not reach this threshold, or the part of speech of the morpheme obtained by speech recognition is designated in advance. If it is a specific one, or if these are satisfied at the same time, a word graph may be created by replacing with another morpheme.
[0023]
The specific expression providing unit 6 includes a language model for the word graph obtained from the word graph creating unit 5 including a case where each morpheme is replaced with another morpheme under the conditions of the reliability score and the morpheme information. Based on the language model learned by the learning unit 4, the logarithmic probability of the word bigram is calculated for the whole word graph from the beginning to the end of the word graph when it has all the proper expression classes, and the largest log probability By selecting such a morpheme with a unique expression class at each position, a specific expression class is assigned to each morpheme.
[0024]
The control unit 7 drives the media conversion unit 1, the tagged text analysis unit 2, the morpheme association unit 3, and the language model learning unit 4 at the time of learning. The unit 6 is driven.
[0025]
The output result of the media conversion unit 1 and the output result of the tagged text analysis unit 2 are stored in a storage device (not shown).
[0026]
FIG. 2 is a flowchart showing processing up to language model creation at the time of learning according to the present embodiment. For example, if the text whose speech data matches the utterance content corresponds to the sentence unit, the media conversion unit 1 inputs the speech data in sentence unit (step 101) and performs the most by large vocabulary continuous speech recognition. One candidate with high likelihood is extracted (step 102). At this time, when a pause section longer than a preset threshold is detected in large vocabulary continuous speech recognition, the unit section of speech recognition processing is divided and a morpheme sequence is output. If there is a pose longer than the threshold value, the pose is replaced with a punctuation mark and a sequence of morphemes before and after the pose interval is connected to form one sentence (steps 103 and 104). On the other hand, the tagged text analysis unit 2 inputs the text with the unique expression tag corresponding to the voice data (step 105), and after the text is divided into morphemes, the morphemes included in the specific expression tags correspond to the unique expression tags. A specific unique expression class such as “others” is assigned to a morpheme that is not included in the unique expression tag, and a morpheme with a unique expression class is converted (step 106). The morpheme associating unit 3 selects the optimum route among all the routes that can be associated with each morpheme from the beginning of each sentence to the end of the sentence with respect to the morpheme on the text side corresponding to the speech side within the sentence unit range. Is calculated (step 107), and a specific expression class of the text side morpheme is assigned to each speech side morpheme associated with the text side morpheme (step 108). In the case where the notation of the associated speech-side morpheme and text-side morpheme do not match, another unique-expression class associated in advance may be assigned as the assigned-specific expression class.
[0027]
At the time of association, for example, when the morpheme on the Nth speech side and the morpheme on the Mth text side correspond to each other, three paths leading to that, that is, the morpheme on the (N-1) th speech side And the M-1th text side morpheme, the N-1th speech side morpheme and the Mth text side morpheme, and the Nth speech side morpheme and M-1 The second text side morpheme may correspond. In the case of the first route, the similarity is calculated based on the coincidence of the notation and the reading overlap for the Nth speech-side morpheme and the Mth text-side morpheme, and the accumulated similarity until then is calculated. The newly calculated similarity is accumulated. In the case of the second route, a point designated in advance is subtracted from the accumulated similarity between the N-1th speech side morpheme and the Mth text side morpheme. In the case of the third path, a point designated in advance is deducted from the accumulated similarity between the N-1th speech side morpheme and the Mth text side morpheme path. Dynamically obtains the path with the maximum accumulated similarity from the beginning to the end of the sentence by holding the path with the maximum accumulated similarity among the three paths as the path of the morpheme up to that and repeating the above to the end of the sentence Based on the idea of the programming method, the optimal route to the morphemes at the end of both sentences is finally obtained.
[0028]
In addition, when a reliability score of morpheme and morpheme information satisfy a certain condition, the morpheme can be replaced with another morpheme (steps 109 and 110). For example, when the reliability score is smaller than a preset threshold value, or when the specific expression class assigned to the morpheme is a specific one, it is written and read as another morpheme, and all parts of speech are, for example, specific symbols Replace with “ε”. Finally, the language model learning unit 4 determines the word bigram with the unique expression class from the arrangement of the morphemes to which the specific expression class is given to the speech recognition result and the arrangement of the morphemes with the unique expression class in the corresponding text. The language model consisting of the appearance frequency is statistically learned, and the learning result is stored in the language model database 10 (step 111).
[0029]
FIG. 3 is a flowchart showing the process of assigning a unique expression at the time of execution in the present embodiment. When the voice data is input (step 201), the media conversion unit 1 performs large vocabulary continuous voice recognition and outputs a predetermined number of morpheme arrangement candidates (step 202). When the time of adjacent morphemes including the start and end is not continuous, that is, when the end time of one morpheme and the start time of the next morpheme do not match, the start time is the non-continuous time zone, that is, the end time of a certain morpheme, Morphological information such as a punctuation mark added with time information whose end time is the start time of the next morpheme is inserted (steps 203 and 204). Further, when the reliability score and morpheme information satisfy certain conditions, the morpheme is replaced with another morpheme while retaining the original morpheme information (steps 205 and 206). For example, when the reliability score is smaller than a preset threshold value, “ε;” is added to the beginning of each notation, reading, and part of speech. The word graph creation unit 5 creates a word graph from a plurality of candidate morphemes based on time information of each morpheme (step 207). The word graph indicates that each node is a morpheme having time information, and a link between nodes can be connected to an adjacent morpheme at a certain time. Assuming that the time of the word graph is advanced from the beginning, and as long as there is a morpheme candidate that ends at each time of the word graph (step 208), all assumed proper expression classes for one subsequent morpheme are assigned. (Step 209), the logarithmic probability when the morpheme with each unique expression class is connected is calculated based on the language model already learned, for example, the appearance frequency of the word bigram with the unique expression (Step 210). For example, the probability P (NC | NC ₋₁ , w ₋₁ ) and the current and immediately previous probability that the current specific expression class NC is selected when the previous specific expression class NC ₋₁ and the previous morpheme w ₋₁ are given. Given a specific expression class, the probability P (w _first | NC ₋₁ , w ₋₁ ) that the first word w _first is generated in the current specific expression class, the previous morpheme and the current The probability P (w | w ₋₁ , NC) that the second and subsequent morphemes are generated when the specific expression class is given is calculated from the word bigram frequency C with specific expression by the following calculation formula. Repeat the above steps until the end of the sentence.
[0030]
[Expression 1]

At this time, the logarithmic probability is calculated by using “ε” for the notation, the reading, and the part of speech of the morpheme that is replaced and transformed. At that time, a morpheme with a unique expression class that maximizes the logarithmic probability accumulated so far is selected, and a path is held (step 211). Here, “hold the route” is to follow the route having the maximum logarithmic probability locally from the end of the sentence in the backward process. The time of the node of the word graph is advanced (step 212), and the same processing is performed. When the end of the sentence is reached, a specific expression class is output for each morpheme of the selected path by selecting a path having the maximum logarithmic probability (maximum likelihood) from the end of the sentence (step 213). The replacement morpheme is restored to the original morpheme and output, for example, by deleting “ε;” included in the notation, reading, and part of speech.
[0031]
FIG. 4 shows an example of the arrangement of scored morphemes of the first candidate obtained from the speech recognition result. As an example, morpheme information consists of notation, reading and part of speech, separated by slashes. After that, the reliability score is stored after the slash. Here, the utterance time is omitted.
[0032]
FIG. 5 shows an example of a list of morphemes with proper expression classes obtained from the text with proper expression tags.
[0033]
FIG. 6 shows an example of a morpheme sequence in which both morpheme sequences are associated with each other and a text-side specific expression class is assigned to a speech-side morpheme. In this example, the similarity is calculated using the morpheme notation and information for each reading. In this example, the eighth morpheme “orange” on the text side is associated with the seventh morpheme “I” on the voice side and the eighth morpheme “nuchi”. In this case, since the notation between the morphemes on the voice side and the text side does not match, “ ^* LOCATION” corresponding to the specific expression class “LOCATION” possessed by “orange” is assigned to “Orenchi”. Also, in this example, morphemes with a score of 0 or less in the threshold are written, read, and parts of speech are all replaced with another morpheme with the symbol “ε”. Here, the specific expression class is stored instead of the reliability score. In the language model database 10, for example, the appearance frequency of morphemes with two consecutive unique expression classes is stored from the list of these morphemes. In this example, the morpheme sequence on the voice side and the morpheme sequence on the text side are associated one by one, but the number of morpheme sequences to be associated is not limited to this.
[0034]
FIG. 7 shows an example at the time of execution. This is a list of morphemes with confidence score and utterance time when speech data of "Senior Researcher Nakatani" is input. For simplicity, the morpheme is only written. The brackets represent confidence scores. Next, a word graph is created based on the occurrence times of these speech recognition results. At this time, since the second candidate “middle” and “sufficient” are discontinuous, the graph is complemented by inserting the morphological information of the reading “,”. In addition, “ε;” is added to the notation of a morpheme whose reliability score is lower than the threshold value 0. When calculating the logarithmic probability, use “ε” as shown in FIG. 8, or calculate the probability using “ε” and the probability using the original morpheme, and compare them to find the largest one. adopt. Based on the language model, logarithmic probabilities are calculated based on the language model, assuming that the morphemes to which all the assumed expression classes such as PERSON, LOCATION, and ORGANIZATION are connected are connected at each time node. For each morpheme, the proper expression class that maximizes the sum is selected. Finally, except for the morpheme having “ε;” at the head, the utterance section of “None” or “Dari” from 15 ms to 1300 ms is extracted as a person name. Note that the media conversion unit 1 may perform character recognition from handwritten characters or telops in the video, perform morphological analysis on the recognized character string, and output a morpheme sequence.
[0035]
In addition to what is implemented by dedicated hardware, the present invention records a program for realizing the function on a computer-readable recording medium, and the program recorded on the recording medium is stored in a computer system. It may be read and executed. The computer-readable recording medium refers to a recording medium such as a floppy disk, a magneto-optical disk, a CD-ROM, or a storage device such as a hard disk device built in the computer system. Furthermore, a computer-readable recording medium is a server that dynamically holds a program (transmission medium or transmission wave) for a short period of time, as in the case of transmitting a program via the Internet, and a server in that case. Some of them hold programs for a certain period of time, such as volatile memory inside computer systems.
[0036]
【The invention's effect】
As described above, the present invention uses a text with a unique expression tag corresponding to speech data, assigns a unique expression class to a morpheme of a speech recognition result including a recognition error, and learns a language model. It is possible to properly identify the utterance section of the specific expression and identify the type of the specific expression for the specific expression that cannot be recognized correctly because the voice data containing the expression is input and is not in the speech recognition vocabulary. Because it can, it contributes to the purpose of adding metadata about specific expressions to audio data.
[Brief description of the drawings]
FIG. 1 is a block diagram of a named entity extraction apparatus according to an embodiment of the present invention.
2 is a diagram showing a flow of processing until creation of a language model at the time of learning in the specific expression extraction device of FIG. 1;
FIG. 3 is a diagram illustrating a flow of processing for assigning a specific expression at the time of execution in the specific expression extraction apparatus of FIG. 1;
FIG. 4 is a diagram illustrating an example of morphemes obtained from a speech recognition result.
FIG. 5 is a diagram illustrating an example of a morpheme obtained from a text with a unique expression tag.
6 is a diagram showing an example in which a text-side specific expression class is assigned to a speech-side morpheme obtained from the morpheme examples in FIGS. 4 and 5. FIG.
FIG. 7 is a diagram illustrating steps until a specific expression is extracted from audio data.
FIG. 8 is an explanatory diagram of a method of calculating the probability of a deformed morphological system.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Media conversion part 2 Text analysis part with tag 3 Morphological matching part 4 Language model learning part 5 Word graph creation part 6 Specific expression provision part 7 Control part 8 Recognition vocabulary DB
9 Morphological Dictionary 10 Language Model Database 101-111, 201-213 Step

Claims

Media conversion that outputs a sequence of morphemes of maximum likelihood recognition candidates based on a recognition vocabulary database that stores word notation and reading information when recognizing content from media other than electronic text and converting it to electronic text Steps,
Analyzes the text with a unique expression tag corresponding to the content contained in media other than electronic text, divides it into morphemes based on the morpheme dictionary storing the word information, and assigns the unique expression class corresponding to the specific expression tag A tagged text analysis step that assigns to all morphemes contained in the expression tag and outputs a list of morphemes with a unique expression class,
The morpheme sequence of the maximum likelihood recognition candidate and the morpheme sequence with the unique expression class are associated with each other so that the degree of similarity is maximum based on the information of both morphemes, and the former morpheme has the latter morpheme A morpheme mapping step for assigning a proper expression class;
A class identification model generation method comprising: a language model learning step that statistically learns a language model from a sequence of morphemes with proper expression classes.

2. The class identification model generation according to claim 1, wherein the media conversion step performs large vocabulary continuous speech recognition from speech data, and outputs a sequence of morphemes having a reliability score and a start time and an end time as maximum likelihood recognition candidates. Method.

The class identification model generation method according to claim 1, wherein the media conversion step performs character recognition from a handwritten character or a telop in a video, performs morpheme analysis on the recognized character string, and outputs a morpheme sequence. .

The morpheme mapping step calculates a similarity based on each morpheme information for each morpheme when comparing the sequence of morphemes including errors after media conversion and the sequence of morphemes of tagged text. If the degree of similarity is accumulated and the corresponding relationship deviates from 1: 1, the accumulated degree of similarity is deducted, and the degree of similarity until the end of the morpheme sequence is maximized. The class identification model generation method according to claim 1, wherein a specific expression class included in a text morpheme or a specific expression class related thereto is assigned to all morphemes including errors after media conversion corresponding to the morpheme.

The morpheme mapping step replaces a morpheme with another morpheme based on at least one of a confidence score of morpheme including an error after media conversion and morpheme information, and outputs a list of morphemes with proper expression classes. Item 2. The class identification model generation method according to Item 1.

A recognition vocabulary database that stores word notation and reading information;
A morpheme dictionary in which word information is stored;
Media conversion means for outputting a sequence of morphemes of maximum likelihood recognition candidates based on the recognition vocabulary database when recognizing content from media other than electronic text and converting it to electronic text;
Analyzes the text with tagged unique expressions corresponding to the content contained in media other than electronic text, divides the text into morphemes based on the morpheme dictionary, and includes the unique expression classes corresponding to the unique expression tags in the unique expression tags. A tagged text analysis unit that outputs to the morpheme and outputs a list of morphemes with a proper expression class,
The morpheme sequence of the maximum likelihood recognition candidate and the morpheme sequence with the unique expression class are associated with each other so that the degree of similarity is maximum based on the information of both morphemes, and the former morpheme has the latter morpheme A morpheme matching means for assigning a specific expression class;
A class identification model generation device comprising language model learning means for statistically learning a language model from an array of morphemes with a proper expression class.

When recognizing content from media other than electronic text and converting it to electronic text, the specified number of morpheme candidates is likely based on a recognized vocabulary database that stores word notation and reading information. A media conversion step for outputting in sequence, and a step of inserting a morpheme in a non-continuous time zone if the time of the adjacent morpheme is not continuous from a sequence of morphemes arranged in a likely order of the specified number ;
A word graph creation step for creating a word graph with each morpheme as a node and a link between the relationships between connectable morphemes at the start position and end position of each morpheme;
For each morpheme in the word graph, all morpheme candidates with proper expression classes are set, and all the probabilities from the start to the end of the word graph based on the language model according to claim 1 are all maximized. of determining the specific representation class with morphological, have a unique representation grant step of outputting the sequence of morphological with unique representation class,
The word graph creation step replaces and transforms a morpheme with another morpheme including the morpheme information based on at least one of the reliability score and morpheme information of the morpheme, and then converts a plurality of candidate morphemes by media conversion. A class identification method for creating a word graph based on a start position and an end position of each morpheme from a line .

If the morpheme has been replaced and transformed, the specific expression adding step calculates the probability of the node of each word graph based on the language model using the replaced and transformed morpheme information, and determines the morpheme with the unique expression. The class identification method according to claim 7, wherein the original morpheme information stored therein is returned and output.

A recognition vocabulary database in which information of word notation and reading is stored, and when recognizing contents from media other than electronic text and converting them to electronic text, based on the recognition vocabulary database, a list of morpheme candidates, Insert media morphemes that are not continuous when the time of adjacent morphemes is not continuous from the media conversion means that outputs the specified number in the most likely order and the morpheme arranged in the most likely order of the specified number Means,
A word graph creation means for creating a word graph with each morpheme as a node and a link between the morphemes connectable at the start position and the end position of each morpheme;
7. For each morpheme in the word graph, set all candidate morphemes with proper expression classes, and based on the language model according to claim 6, all the probability from the start to the end of the word graph is maximized of determining the specific representation class with morphological, have a unique representation grant means for outputting the sequence of morphological with unique representation class,
The word graph creating means replaces and transforms a morpheme with another morpheme including the morpheme information based on at least one of the reliability score and morpheme information of the morpheme, and then converts a plurality of candidate morpheme by media conversion. A class identification device that creates a word graph based on a start position and an end position of each morpheme from a line .

A program for causing a computer to execute the method according to claim 1.

A program for causing a computer to execute the method according to claim 7 or 8 .