JP2004046621A

JP2004046621A - Method and device for extracting multiple topics in text, program therefor, and recording medium recording this program

Info

Publication number: JP2004046621A
Application number: JP2002204434A
Authority: JP
Inventors: Shuko Ueda; 上田　修功; Kazumi Saito; 斉藤　和巳
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-07-12
Filing date: 2002-07-12
Publication date: 2004-02-12
Anticipated expiration: 2022-07-12
Also published as: JP3868344B2

Abstract

<P>PROBLEM TO BE SOLVED: To smoothly extract multiple topics in a text. <P>SOLUTION: A certain text is input to a text preprocessing part 1 to calculate the frequency of a word in its vocabulary to create a word-frequency vector. Based on the frequency vector, a parameter for the probability model of multi-topic text is expressed by the linear sum of parameters for the probability models of single-topic texts. Next, a model parameter estimating part 2 learns parameters for probability models using the word-frequency vector and a topic vector to which the text belongs. For a text whose topics are unknown, the text preprocessing part 4 calculates a word frequency vector, and using the learned parameters for the probability models, a multi-topic anticipating part 5 extracts from the word frequency vector the multiple topics to which the text belongs. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、インターネット、電子図書館、電子メール、電子カルテ等の電子的に蓄積された非定型テキストを予め定めたトピックに自動分類する方法および装置に関する。
【０００２】
【従来の技術】
近年、大量のテキストが電子的に蓄積されつつある。テキストは、通常、単一よりはむしろ複数のトピックスで構成されるので、テキストから多重トピックスを抽出する方法の開発は重要研究課題となっている。この抽出問題は文字認識のようなサンプルを排他的な単一クラスに分類するパターン認識問題とは異なる。
【０００３】
多重トピックス抽出問題は、多クラス、多重ラベルテキスト分類問題として多くの研究者に知られており、従来、全てのトピック毎にそのトピックに属するか否かの識別を逐次行うという２分類アプローチが採られていた。つまり、２分類アプローチでは多重トピックス抽出問題を各トピック毎に独立した２分類問題に分解して解いていた。この場合、あるトピックを担当する２分類器はテキストをそのトピックとそれ以外のトピックのいずれかに分類する。２分類器の著名な公知手法として、サポートベクトルマシン（Ｓｕｐ　ｐｏｒｔ　Ｖｅｃｔｏｒ　Ｍａｃｈｉｎｅ：　ＳＶＭ，　Ｖ．　Ｎ．　Ｖａｐｎｉｋ，　鉄ｔａｔｉｓｔｉｃａｌ　ｌｅａｒｎｉｎｇ　ｔｈｅｏｒｙ　　　Ｊｏｈｎ　Ｗｉｌｅｙ　＆　Ｓｏｎｓ，　Ｉｎｃ．，　Ｎｅｗ　Ｙｏｒｋ，　１９９８）あるいは、ナイーブベイズ法（Ｄ．　Ｌｅｗｉｓ　ａｎｄ　Ｍ．　Ｒｉｎｇｕｅｔｔｅ，　鄭　ｃｏｍｐａｒｉｓｏｎ　ｏｆ　ｔｗｏ　ｌｅａｒｎｉｎｇ　ａｌｇｏｒｉｔｈｍｓ　ｆｏｒ　ｔｅｘｔ　ｃａｔｅｇｏｒｉｚａｔｉｏｎ　　　Ｉｎ　Ｔｈｉｒｄ　Ａｎｕａｌ　Ｓｙｍｐｏｓｉｕｍ　ｏｎ　Ｄｏｃｕｍｅｎｔ　Ａｎａｌｙｓｉｓ　ａｎｄ　Ｉｎｆｏｒｍａｔｉｏｎ　Ｒｅｔｒｉｅｖａｌ　（ＳＤＡＩＲ’９４），
８１−９３，　１９９４）がある。
【０００４】
【発明が解決しようとする課題】
しかしながら、これら２分類アプローチは多重トピックスを同時に考慮していない。換言すれば、２分類アプローチは多重テキストの生成モデルを考慮していないため、性能限界があると考えられる。
【０００５】
また、ニューラルネットワークのような関数近似法や、特徴ベクトル間の類似性で分類するｋ近傍法は、原理的には、多重トピックス抽出を２分類アプローチのように単一トピックに分解すること無しに多重トピックス抽出が可能である。しかし、これらの方法も多重テキストの生成モデルを考慮していないため、２分類アプローチ同様、性能限界があると考えられる。
【０００６】
本発明の目的は、多重トピックスを一撃的に抽出する、多重トピックスの抽出方法、装置、プログラム、該プログラムを記録した記録媒体を提供することにある。
【０００７】
【課題を解決するための手段】
テキストの表現
本発明におけるテキストの表現法を説明する。まず、テキスト中から予め定めた語彙に含まれる単語を抽出し、それらの単語の使用頻度をベクトル表現する。すなわち、１つのテキスト１つの単語頻度はベクトル
【０００８】
【外１】

で表現される。
ここで、ｘ_ｉは語彙
【０００９】
【外２】

中の単語ｗ_ｉが前記テキスト中で出現した回数を表す。Ｖは語彙中の単語総数である。つまり、
【００１０】
【外３】

はＶ次元ユークリッド空間中の点として表現されることになる。さらに、
【００１１】
【外４】

は語彙中の全単語に渡る多項分布から生成されると仮定する。
【００１２】
【数１】

ここで、
【００１３】
【外５】

はモデルパラメータで、第ｉ番目の要素θ_ｉは単語ｗ_ｉが生起する確率を表す。明らかに、
【００１４】
【数２】

【００１５】
次に、テキストが帰属するトピックスベクトルを
【００１６】
【数３】

で定義する。ここで、
【００１７】
【外６】

の第ｌ要素ｙ_ｌは１または０の値をとり、テキストが第ｌトピックに属する場合に限りｙ_ｌ＝１とする。ここに、Ｌは全トピック数で、予め既知とする。また、テキストはＬトピックスの少なくとも１つには帰属するものと仮定する。すなわち、
【００１８】
【外７】

中の少なくとも一つの要素は１をとる。
多重トピックステキストの確率モデルのパラメータの表現
本発明の核となる多重トピックスの確率モデルの基本的な考え方を、２つのトピックス（Ｌ＝２）、かつ、語彙が３つの単語（ｗ_１，ｗ_２，ｗ_３）（Ｖ＝３）からなる簡単な例で以下に説明する。
【００１９】
今、単一トピックＣ_１およびＣ_２に属すテキスト中の単語が、各々、多項分布
【００２０】
【外８】

から生成され、かつ、各々の多項分布のパラメータはφ（Ｃ_１）＝（０．７，０．１，０．２）およびφ（Ｃ_２）＝（０．１，０．７，０．２）と仮定する。これは、トピックＣ_１に属するテキストでは、３種類の単語ｗ_１，ｗ_２，ｗ_３が各々０．７，０．１，０．２の確率で生起していることを意味する。トピックＣ_２も同様である。
【００２１】
図４（ａ）中の’０’，’＋’は各々φ（Ｃ_１），φ（Ｃ_２）から人工的に生成されたサンプル（単語頻度ベクトル）である。１つの’０’（’＋’）がトピックＣ_１（Ｃ_２）のテキストに対応する。テキスト中の単語総数、つまり、頻度ベクトルの要素の和は１００から８００の範囲で分布させている。パラメータベクトルφは図４（ｃ）の正三角形に示す２次元単体θ_１＋θ_２＋θ_３＝１上にある。
【００２２】
Ｃ_１，２をトピックＣ_１とＣ_２の両方に属する多重トピックスクラスを表すものとする。この時、Ｃ_１，２に属するテキスト中の単語はＣ_１とＣ_２に関連する単語の混合から成ると考えられる。例えば、“スポーツ”と“音楽”の両方に属するテキストには両方のトピックスに関連する単語が出現すると考えられる。ただし、“スポーツ”と“音楽”の両方に属するテキストでも、より“スポーツ”に関連するテキストである場合も考えられるので、２つのトピックス間の混合比、すなわち、２つのトピックス間の相対的な強さの割合は必ずしも等しいとは限らない。
【００２３】
上記の“単語の混合”なる考え方に従い、Ｃ_１，２に属す単語頻度サンプルを、図４（ｂ）中の’△’に示すように、Ｃ_１，Ｃ_２の各々に属する単語頻度ベクトルの混合として人工的に生成した。混合比は０．２から０．８の範囲でランダムに設定した。Ｃ_１，２のサンプルはＣ_１とＣ_２のサンプルの分布を内挿するような分布となっている。
【００２４】
ここで、注意すべきは、Ｃ_１，２に属するサンプルは２つの多項分布
【００２５】
【外９】

の混合分布からは生成できないことである。パラメータφ（Ｃ_ｋ）の最尤推定値は
【００２６】
【外１０】

に比例すること、および、Ｃ_１，２のサンプルの生成過程より、多重トピックスクラスＣ_１，２のモデルパラメータφ（Ｃ_１，２）はφ（Ｃ_１）とφ（Ｃ_２）の線形和として近似表現できることが分る。つまり、Ｃ_１，２に属するサンプルは
【００２７】
【数４】

なるパラメータを持つ多項分布の実現値と見ることができる。ただし、α（０＜α＜１）は混合比を表す。実際、人工的に生成されたサンプルに基づいて算出したＣ_１，２のパラメータの最尤推定値を図４（ｃ）に示す。
【００２８】
上記考え方を一般化すると、多重トピックスに属するテキスト中の単語の頻度分布は、単一トピックの多項分布のパラメータを基底パラメータとしそれらの線形和として表現されるパラメータをもつ多項分布となる。すなわち、トピックスベクトル
【００２９】
【外１１】

のテキストの単語頻度分布は、
【００３０】
【数５】

をパラメータとする多項分布に従う。ここで、
【００３１】
【外１２】

は単一トピックＣ_ｌの多項分布のパラメータを表す。
【００３２】
先に述べたように、多重トピックステキストはそのトピックスの中で特にあるトピックに関してより重点的に記述されていることがある。式（４）ではそうした重みづけは考慮されていない。そこでこの重みをパラメータとして考慮したより柔軟な線形和を次式で定義する。
【００３３】
【数６】

【００３４】
ここで、
【００３５】
【外１３】

とし、混合比α_ｌ，ｍ（＞０）は
【００３６】
【数７】

を満たす。α_ｌ，ｌ＝０．５より
【００３７】
【外１４】

となることに注意。また、
【００３８】
【外１５】

が成り立つ。式（４），（５）共、Ｖ個の要素の和は１となることに注意。
【００３９】
式（４）と式（５）との差は、式（４）では未知パラメータΘは、単一トピックの多項分布のパラメータ
【００４０】
【数８】

であるのに対し、式（５）では式（６）のパラメータに加え、α_ｌｍ（ｌ≠ｍ）（等価的に
【００４１】
【外１６】

も未知パラメータ扱いされることになる。
【００４２】
【数９】

【００４３】
いずれの線形和の場合も、トピックスベクトル
【００４４】
【外１７】

に属する多重トピックステキストの単語頻度ベクトル
【００４５】
【外１８】

の確率分布は
【００４６】
【数１０】

で表される。ここに、
【００４７】
【外１９】

の第ｉ要素を表す。
【００４８】
上記以外の線形和も考えられるが、本発明では、トピックスベクトル
【００４９】
【外２０】

に対応するモデルのパラメータ
【００５０】
【外２１】

がＬ個の単一トピックの多項分布のパラメータ
【００５１】
【外２２】

の線形和で表現されることを特徴とする。したがって、線形和の形態は式（４），（５）に限定されない。
確率モデルのパラメータの推定
次に、未知パラメータの推定法について説明する。
【００５２】
【数１１】

を与えられた学習データとする。
【００５３】
【外２３】

は第ｎテキストの単語頻度ベクトルと多重トピックスベクトルを表す。Ｎはテキスト総数。この時、未知パラメータΘは、学習データ
【００５４】
【外２４】

が与えられた下でのパラメータの事後分布の最大化により推定する。すなわち、
【００５５】
【数１２】

パラメータ
【００５６】
【外２５】

およびα_ｌ，ｍの事前分布は、各々次式に示すように多項分布の共役事前分布であるディレクレ分布とする。
【００５７】
【数１３】

ここで、ξおよびζはハイパーパラメータで、通常、ξ＝２およびζ＝２とする。
【００５８】
トピックスベクトル
【００５９】
【外２６】

は一様分布と仮定すると、式（１０）およびベイズの定理より
【００６０】
【外２７】

は次の目的関数
【００６１】
【数１４】

をΘに関して最大化することにより求まる。
トピックスベクトルの予測
次に、モデルパラメータの推定値を用いて新たなテキストのトピックスベクトルの値を予測する方法を以下に説明する。
【００６２】
【外２８】

を推定パラメータとすると、ここでの予測とは、新たなテキストの単語頻度ベクトル
【００６３】
【外２９】

からトピックスベクトル
【００６４】
【外３０】

の値を予測することである。そして、最適なトピックスベクトル値は
【００６５】
【外３１】

および
【００６６】
【外３２】

が与えられた下での
【００６７】
【外３３】

の事後分布を最大にする
【００６８】
【外３４】

として求められる。
【００６９】
ベイズの定理より
【００７０】
【数１５】

さらに、
【００７１】
【外３５】

の事前分布を一様分布と仮定すると、結局、最適トピックスベクトル
【００７２】
【外３６】

は
【００７３】
【外３７】

を最大化する
【００７４】
【外３８】

として求められる。
【００７５】
【数１６】

【００７６】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して説明する。
【００７７】
図１は本発明の一実施形態の、テキストの多重トピックス抽出装置の構成図、図２はその動作を示すフローチャートである。
【００７８】
学習データである任意のテキストをテキスト前処理部１に入力し、語彙中の単語の頻度を算出し、単語頻度ベクトルを作成し（ステップ１１）、該頻度ベクトルに基づいて、多重トピックスを有するテキストの単語の頻度分布、すなわち多重トピックステキストの確率モデルのパラメータを、各単一トピックのテキストの確率モデルのパラメータの線形和で表現する（ステップ１２）。次に、モデルパラメータ推定部２において該単語頻度ベクトルとテキストの帰属トピックスベクトルを用いて確率モデルのパラメータを学習し、学習結果を推定モデルパラメータ保存部３に格納する（ステップ１３）。トピックスが未知のテキストに対して、テキスト前処理部４で単語頻度ベクトルを算出し（ステップ１４）、多重トピックス予測部５で、該単語頻度ベクトルから、推定モデルパラメータ保存部３に保存されている、学習済みの確率モデルのパラメータを用いて、該テキストの帰属する多重トピックスを抽出する（ステップ１５）。
【００７９】
以下に本実施形態の核となるモデルパラメータ推定部２と多重トピックス予測部５の処理を詳細に説明する。
【００８０】
モデルパラメータ推定部２
式（４）の線形和の場合、式（１３）の目的関数は次式のように具体化される。
【００８１】
【数１７】

ここで、
【００８２】
【外３９】

は対数尤度項で
【００８３】
【数１８】

で与えられる。したがって、最適なパラメータは式（１５）をΘに関して最大化することにより求まる。しかしながら、この最大化は解析的に求めることができず、以下に示すように逐次反復法により求める。
【００８４】
便宜上、
【００８５】
【数１９】

とおき、かつ、
【００８６】
【外４０】

を反復の第ｔステップでの推定値とし、さらに、
【００８７】
【数２０】

とおく。
【００８８】
【外４１】

に注意。この時、式（１７）は次式のように書き換えられる。
【００８９】
【数２１】

ただし、
【００９０】
【外４２】

は次式で定義される。
【００９１】
【数２２】

【００９２】
Ｊｅｎｓｅｎの不等式より、
【００９３】
【数２３】

が成立することに注意すると、もし
【００９４】
【数２４】

ならば、式（２０）より
【００９５】
【数２５】

が成り立つ。故に、
【００９６】
【数２６】

をΘに関して最大化することにより
【外４３】

を増大させることができる。
【００９７】
式（２３）の最大化はラグランジュ乗数法により解けて
【００９８】
【数２７】

として求まる。ここに
【００９９】
【外４４】

は式（１９）で与えられる。式（２４）をｌ＝１，…，Ｌ、ｉ＝１，…，Ｖに対して計算することにより式（４）の線形和のモデルに対する未知パラメータが求まる。
【０１００】
多重トピックス予測部５
式（４）に対して式（１５）に基づく多重トピックスの予測は、次式の
【０１０１】
【外４５】

に関する最大化問題となる。
【０１０２】
【数２８】

【０１０３】
上記最大化問題は、単純には、
【０１０４】
【外４６】

の可能な値の全てについて評価すれば求まるが、解候補数は２^Ｌ−１通り故、Ｌが大きくなるとそうした単純な全数探索方法では現実時間で解くことが困難となる。そこで、以下に示す近似アルゴリズムにより近似解を求める。
予測アルゴリズム
ステップ１．　初期化Ｓ：＝｛１，２，…，Ｌ｝，
【０１０５】
【外４７】

【０１０６】
ステップ２．　Ｓが空集合でない限り以下を実行
ステップ２−１．　Ｓ中の要素ｌの各々について、
【０１０７】
【外４８】

を算出し、これをυ（ｌ）とする。
【０１０８】
ステップ２−２．　υ（ｌ）を最大化するｌをｌ^＊とし、もしυ（ｌ^＊）＞υ_ｍａｘなら
【０１０９】
【外４９】

，υ_ｍａｘ：＝υ（ｌ^＊）とし、ステップ２へ。さもなくば、
【０１１０】
【外５０】

を最終的な解として終了する。
【０１１１】
ここで、表記“：＝”は右辺の値を左辺に代入することを意味する。また、
【０１１２】
【外５１】

はＬ次元零ベクトルを表す。さらに
【０１１３】
【外５２】

は
【０１１４】
【外５３】

の第ｌ番目を１とし、Ｓから｛ｌ｝を除いた全ての要素を零に設定した時の
【０１１５】
【外５４】

の値、すなわち、式（１４）に示した事後分布
【０１１６】
【外５５】

の値を表す。つまり、
【０１１７】
【外５６】

と初期化された
【０１１８】
【外５７】

に対し、Ｌ個の要素の１つだけ１として事後分布が最大となるｌを見つけｌ^＊とし、次に、ｌ^＊＝１と固定して、残りのＬ−１個の要素に対し、１つだけ１として事後分布が最大となるｌを見つけていくという処理を、事後分布が増大しなくなるまで繰り返す。
【０１１９】
上記アルゴリズムは帯域的最適性は保証しないが、式（２５）の評価が高々Ｌ（Ｌ＋１）／２回で済み、全数探索の２^Ｌ−１回に比べ極めて効率的である。
【０１２０】
語彙数をＶ＝１００、トピックス数をＬ＝１０として人工的に作成した単語頻度ベクトルからなる人工テキストを用いた実験で本発明の有効性を示す。
【０１２１】
まず、ジップの法則を考慮しつつ、乱数を用いて各トピックの多項分布パラメータ（基底ベクトル）を設定した。そして、作成したパラメータに基づき１，０００テキストからなる学習データ、および１００，０００テキストからなるテストデータを生成した。ただし、各テキストが持つトピックス数をｍとすれば、その分布は１／２^ｍとなるようにした。すなわち、学習データではトピックス数が１のテキストが５００、トピックス数が２のテキストが２５０などとなり、多重度が増す程、テキスト数が指数的に減少するようにし、現実データの分布を反映させた。一方、各テキストの単語頻度ベクトルについては、既に説明したようにパラメータの線形和を用いて多重トピックスの多項分布を作り、この分布に基づいて単語の頻度ベクトルを生成した。
【０１２２】
図１に示した本発明の実施形態の構成図に従い、まず、学習データをテキスト前処理部１に入力して処理を施し、次いで、その結果をモデルパラメータ推定部２に入力して学習することにより推定モデルパラメータを求めた。そして、テストデータをテキスト前処理部４に入力して処理を施し、得られた単語頻度ベクトルとすでに求めた推定モデルパラメータを多重トピックス予測部５に入力して多重トピックスを予測することにより抽出結果を得た。テストデータの各々の正解トピック情報は既知故、予測結果と比較することにより多重トピックス抽出方法の評価が可能となる。
【０１２３】
図３に、これまで世界最高性能と報告されていたサポートベクトルマシンと本発明での性能を比較する。ただし、サポートベクトルマシンの適用では、各トピック毎の２分類問題を作り、学習データを用いて２分類器を構成し、その分類器群を用いて多重トピックス抽出結果を予測した。抽出性能の評価には、情報検索などで標準的に利用されるＦ値を採用した。なお、Ｆ値は、的中率と網羅率の調和平均として定義される。学習データ数が１，０００の場合、本発明を適用すれば、サポートベクトルマシンより約１５％も高い性能が得られることが分かる。また、学習データ数を減らした評価では、サポートベクトルマシンの性能がかなり劣化するのに対し、本発明の適用では極めて僅かな劣化であった。すなわち、本発明は学習データ数の変動に対しサポートベクトルマシンに比べはるかに頑健な手法であると言える。
【０１２４】
なお、以上説明した、テキストの多重トピックス抽出方法は専用のハードウェアにより実現されるもの以外に、その機能を実現するためのプログラムを、コンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行するものであってもよい。コンピュータ読み取り可能な記録媒体とは、フロッピーディスク、光磁気ディスク、ＣＤ−ＲＯＭ等の記録媒体、コンピュータシステムに内蔵されるハードディスク装置等の記憶装置を指す。さらに、コンピュータ読み取り可能な記録媒体は、インターネットを介してプログラムを送信する場合のように、短時間の間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含む。
【０１２５】
【発明の効果】
以上説明したように、本発明は、多重トピックステキスト中の単語頻度分布を確率分布としてモデル化し、確率モデルを単一トピックモデルの線形和により生成することにより、テキストの多重トピックス抽出を従来よりも良好に行なうことができる。
【図面の簡単な説明】
【図１】本発明の一実施形態のテキストの多重トピックス抽出装置の構成図である。
【図２】図１の装置の動作を示すフローチャートである。
【図３】本発明の効果をサポートベクトルマシンと比較して示すグラフである。
【図４】本発明の基本的な考え方を説明するための図である。
【符号の説明】
１，４　　テキスト前処理部
２　　モデルパラメータ推定部
３　　推定モデルパラメータ保持部
５　　多重トピックス予測部
１１〜１５　　ステップ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method and apparatus for automatically classifying electronically stored atypical text such as the Internet, an electronic library, an electronic mail, and an electronic medical record into predetermined topics.
[0002]
[Prior art]
In recent years, large amounts of text have been stored electronically. Since text is usually composed of multiple topics rather than a single topic, developing a method to extract multiple topics from text has become an important research topic. This extraction problem is different from the pattern recognition problem that classifies samples into an exclusive single class, such as character recognition.
[0003]
The multi-topics extraction problem is known to many researchers as a multi-class, multi-label text classification problem. Conventionally, a two-classification approach of sequentially identifying whether or not a topic belongs to each topic has been adopted. Had been. That is, in the two-class approach, the multi-topic extraction problem is solved by decomposing into two separate problems for each topic. In this case, the two classifiers for a topic classify the text into either that topic or other topics. As a well-known technique of the two classifiers, a support vector machine (Support Vector Machine: SVM, VN Vapnik, Ironistic learning theory John John Wiley & Sons, Inc., Nayby, Inc., Nayb. D. Lewis and M. Ringuette, Jeong comparison of two learning algorithms for text categorization in Third Annual Analysis of the National Convention of the National Convention of the National Convention of the National Convention.
81-93, 1994).
[0004]
[Problems to be solved by the invention]
However, these two classification approaches do not consider multiple topics simultaneously. In other words, the two-classification approach does not consider the generation model of multiple texts, so it is considered that there is a performance limit.
[0005]
In addition, a function approximation method such as a neural network, and a k-nearest neighbor method that classifies based on the similarity between feature vectors are basically applicable without decomposing a multitopic extraction into a single topic as in a two-class approach. Multiple topics extraction is possible. However, these methods do not take into account a multiplex text generation model, and thus may have performance limitations as in the case of the two-classification approach.
[0006]
An object of the present invention is to provide a method, an apparatus, a program, and a recording medium on which a program for extracting multiple topics for extracting multiple topics at a stroke.
[0007]
[Means for Solving the Problems]
Expression of text The expression method of text in the present invention will be described. First, words included in a predetermined vocabulary are extracted from the text, and the frequency of use of those words is expressed as a vector. That is, one word and one word frequency are vectors.
[Outside 1]

Is represented by
Where _xi is the vocabulary
[Outside 2]

It represents the number of times the word w _i appeared in the text in. V is the total number of words in the vocabulary. That is,
[0010]
[Outside 3]

Is represented as a point in the V-dimensional Euclidean space. further,
[0011]
[Outside 4]

Is generated from a multinomial distribution over all words in the vocabulary.
[0012]
(Equation 1)

here,
[0013]
[Outside 5]

The model parameters, the i-th element theta _i represents the probability of a word w _i is occurring. clearly,
[0014]
(Equation 2)

[0015]
Next, the topic vector to which the text belongs is given by
[Equation 3]

Defined by here,
[0017]
[Outside 6]

Part l element y _l of taking a value of 1 or 0, and y _{l =} 1 if and only if the text belonging to the l topic. Here, L is the total number of topics and is known in advance. It is also assumed that the text belongs to at least one of the L topics. That is,
[0018]
[Outside 7]

At least one of the elements takes one.
Expression of Parameters of Multitopic Text Probability Model The basic concept of the multitopic probability model serving as the core of the present invention is described as follows. Two topics (L = 2) and three vocabulary words (w ₁ , w ₂₎ , W ₃ ) (V = 3).
[0019]
Now, the words in the text belonging to the single topics C ₁ and C ₂ have a multinomial distribution, respectively.
[Outside 8]

, And the parameters of each polynomial distribution are φ (C ₁ ) = (0.7, 0.1, 0.2) and φ (C ₂ ) = (0.1, 0.7, 0. Assume 2). This is because, in the text belonging to the topic _{C 1,} means that the three types of words _w _1, w 2, _{w 3} has occurred in each 0.7,0.1,0.2 probability of. Topic _{C 2} is also the same.
[0021]
“0” and “+” in FIG. 4A are samples (word frequency vectors) artificially generated from φ (C ₁ ) and φ (C ₂ ), respectively. One '0'('+') corresponds to the text of topic C ₁ (C ₂ ). The total number of words in the text, that is, the sum of the elements of the frequency vector is distributed in the range of 100 to 800. The parameter vector φ is on the two-dimensional simplex θ ₁ + θ ₂ + θ ₃ = 1 shown by the equilateral triangle in FIG.
[0022]
Denote the multiple topics classes belonging to C _{1, 2} in both topic _{C 1} and _{C 2.} At this time, the word in the text belonging to C _{1, 2} is considered to consist of a mixture of words related to the C ₁ and C _2. For example, it is considered that words related to both topics appear in texts belonging to both “sports” and “music”. However, a text that belongs to both “sports” and “music” may be a text related to “sports” more. Therefore, a mixture ratio between two topics, that is, a relative ratio between two topics, The percentage of strength is not always equal.
[0023]
According consisting concept "mixing of the word" the, word frequency samples belonging to C _{1, 2,} as shown in in FIG. 4 (b) '△', the word frequency vectors belonging to each of the C _1, C ₂ Made artificially as a mixture. The mixing ratio was set at random in the range of 0.2 to 0.8. Sample C _{1, 2} has a distribution such interpolating the distribution of samples of _{C 1} and _{C 2.}
[0024]
Here, it should be noted that the samples belonging to C ₁ and 2 have two polynomial distributions.
[Outside 9]

Cannot be generated from the mixture distribution of The maximum likelihood estimate of the parameter φ (C _k ) is
[Outside 10]

Proportional enough, and _a linear sum of from generation process of a sample of _{C 1, 2,} model parameters φ _{(C 1,2)} of the multiple topics Class _{C 1, 2} is phi _{(C 1)} and phi _{(C 2)} It can be seen that approximation can be expressed as That is, the samples belonging to C ₁ and C ₂ are
(Equation 4)

It can be regarded as a realization value of a multinomial distribution having the following parameters. Here, α (0 <α <1) represents a mixture ratio. Actually, FIG. 4C shows the maximum likelihood estimation values of the parameters C ₁ and C ₂ calculated based on the artificially generated samples.
[0028]
When the above concept is generalized, the frequency distribution of words in texts belonging to multiple topics is a polynomial distribution having parameters of a single topic multinomial distribution as base parameters and parameters expressed as a linear sum thereof. That is, the topics vector
[Outside 11]

The word frequency distribution of the text
[0030]
(Equation 5)

Follow a multinomial distribution with here,
[0031]
[Outside 12]

Represents the parameters of the multinomial distribution of a single topic C _l.
[0032]
As mentioned earlier, a multi-topic text may be more focused on a particular topic within the topic. Equation (4) does not consider such weighting. Therefore, a more flexible linear sum considering this weight as a parameter is defined by the following equation.
[0033]
(Equation 6)

[0034]
here,
[0035]
[Outside 13]

And the mixture ratio α _{l, m} (> 0) is
(Equation 7)

Meet. From α _{l, l} = 0.5
[Outside 14]

Note that Also,
[0038]
[Outside 15]

Holds. Note that the sum of V elements is 1 in both equations (4) and (5).
[0039]
The difference between Equations (4) and (5) is that in Equation (4), the unknown parameter Θ is a parameter of a single topic multinomial distribution.
(Equation 8)

On the other hand, in equation (5), in addition to the parameters of equation (6), α _lm (l （m) (equivalently,
[Outside 16]

Will also be treated as unknown parameters.
[0042]
(Equation 9)

[0043]
In any linear sum, the topics vector
[Outside 17]

Word frequency vector of multi-topic text belonging to
[Outside 18]

Is the probability distribution of
(Equation 10)

Is represented by here,
[0047]
[Outside 19]

Represents the i-th element.
[0048]
Although a linear sum other than the above may be considered, in the present invention, the topic vector
[Outside 20]

Model parameters corresponding to
[Outside 21]

Is the parameter of the L single topic multinomial distribution
[Outside 22]

Is represented by a linear sum of Therefore, the form of the linear sum is not limited to Equations (4) and (5).
Estimation of Parameters of Stochastic Model Next, a method of estimating unknown parameters will be described.
[0052]
[Equation 11]

Is the given learning data.
[0053]
[Outside 23]

Represents the word frequency vector and the multi-topics vector of the n-th text. N is the total number of texts. At this time, the unknown parameter Θ is the learning data
[Outside 24]

Is given by maximizing the posterior distribution of the parameters given. That is,
[0055]
(Equation 12)

Parameter
[Outside 25]

And the prior distribution of α _{l, m} is a direct distribution which is a conjugate prior distribution of a polynomial distribution as shown in the following equation.
[0057]
(Equation 13)

Here, ξ and ζ are hyperparameters, and usually ξ = 2 and ζ = 2.
[0058]
Topics vector
[Outside 26]

Assuming that is a uniform distribution, from equation (10) and Bayes' theorem,
[Outside 27]

Is the following objective function:
[Equation 14]

By maximizing with respect to Θ.
Prediction of Topic Vector Next, a method of predicting a value of a topic vector of a new text using an estimated value of a model parameter will be described below.
[0062]
[Outside 28]

Is the estimation parameter, the prediction here is a word frequency vector of a new text.
[Outside 29]

From topics vector
[Outside 30]

Is to predict the value of Then, the optimal topic vector value is:
[Outside 31]

And [0066]
[Outside 32]

Under the given
[Outside 33]

Maximize the posterior distribution of
[Outside 34]

Is required.
[0069]
From Bayes' theorem
[Equation 15]

further,
[0071]
[Outside 35]

Assuming that the prior distribution is uniform, after all, the optimal topics vector
[Outside 36]

Is [0073]
[Outside 37]

To maximize
[Outside 38]

Is required.
[0075]
(Equation 16)

[0076]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
[0077]
FIG. 1 is a block diagram of an apparatus for extracting multiple topics of text according to an embodiment of the present invention, and FIG. 2 is a flowchart showing the operation thereof.
[0078]
An arbitrary text, which is learning data, is input to the text preprocessing unit 1, the frequency of words in the vocabulary is calculated, a word frequency vector is created (step 11), and a text having multiple topics is created based on the frequency vector. Is expressed as a linear sum of the parameters of the probability model of the text of each single topic (step 12). Next, the model parameter estimating unit 2 learns the parameters of the probability model using the word frequency vector and the topic topic vector of the text, and stores the learning result in the estimated model parameter storage unit 3 (step 13). A word frequency vector is calculated for the text whose topic is unknown by the text preprocessing unit 4 (step 14), and the multiple topic prediction unit 5 stores the word frequency vector from the word frequency vector in the estimated model parameter storage unit 3. Then, multiple topics to which the text belongs are extracted using the parameters of the learned probability model (step 15).
[0079]
Hereinafter, the processing of the model parameter estimation unit 2 and the multi-topic prediction unit 5 which are the core of the present embodiment will be described in detail.
[0080]
Model parameter estimator 2
In the case of the linear sum of Expression (4), the objective function of Expression (13) is embodied as the following expression.
[0081]
[Equation 17]

here,
[0082]
[Outside 39]

Is the log likelihood term
(Equation 18)

Given by Therefore, the optimal parameters are determined by maximizing equation (15) with respect to Θ. However, this maximization cannot be determined analytically, but is determined by a sequential iteration method as described below.
[0084]
For convenience,
[0085]
[Equation 19]

Toki, and
[0086]
[Outside 40]

Is the estimate at the t-th step of the iteration, and
[0087]
(Equation 20)

far.
[0088]
[Outside 41]

Be careful. At this time, equation (17) is rewritten as the following equation.
[0089]
(Equation 21)

However,
[0090]
[Outside 42]

Is defined by the following equation.
[0091]
(Equation 22)

[0092]
From Jensen's inequality,
[0093]
(Equation 23)

Note that the following holds.
[Equation 24]

Then, from equation (20),
(Equation 25)

Holds. Therefore,
[0096]
(Equation 26)

By maximizing with respect to 【

Can be increased.
[0097]
The maximization of the equation (23) can be solved by the Lagrange multiplier method.
[Equation 27]

Is obtained as Here [0099]
[Outside 44]

Is given by equation (19). By calculating equation (24) for l = 1,..., L, i = 1,..., V, unknown parameters for the model of the linear sum of equation (4) are obtained.
[0100]
Multi-topic prediction unit 5
The prediction of multiple topics based on equation (15) with respect to equation (4) is expressed by the following equation:
[Outside 45]

Is the problem of maximization.
[0102]
[Equation 28]

[0103]
The maximization problem is simply:
[0104]
[Outside 46]

Can be obtained by evaluating all possible values of, but since the number of solution candidates is 2 ^L −1, it becomes difficult to solve in real time by such a simple exhaustive search method as L increases. Therefore, an approximate solution is obtained by the following approximate algorithm.
Prediction algorithm step Initialization S: = {1,2, ..., L},
[0105]
[Outside 47]

[0106]
Step 2. Execute the following unless S is an empty set Step 2-1. For each element l in S,
[0107]
[Outside 48]

Is calculated, and this is defined as υ (l).
[0108]
Step 2-2. υ a l to maximize the (l) and ^{l *,} if if υ ^(l *)> υ _max [0109]
[Outside 49]

, Υ _max : = υ (l ^* ), and go to step 2. otherwise,
[0110]
[Outside 50]

Ends as the final solution.
[0111]
Here, the notation “: =” means that the value on the right side is assigned to the left side. Also,
[0112]
[Outside 51]

Represents an L-dimensional zero vector. Further,
[Outside 52]

Is [0114]
[Outside 53]

Is set to 1 and all the elements excluding {l} from S are set to zero.
[Outside 54]

, Ie, the posterior distribution shown in equation (14)
[Outside 55]

Represents the value of. That is,
[0117]
[Outside 56]

Is initialized.
[Outside 57]

On the other hand, assuming that only one of the L elements is 1 and finds l at which the posterior distribution is maximum, it is defined as l ^* . Then, l ^* = 1 is fixed, and for the remaining L-1 elements, 1 is obtained. The process of finding l at which the posterior distribution is maximized as 1 is repeated until the posterior distribution does not increase.
[0119]
Although the above algorithm does not guarantee band-wise optimality, the expression (25) needs to be evaluated at most L (L + 1) / 2 times, which is extremely efficient compared to 2 ^L −1 times of exhaustive search.
[0120]
The effectiveness of the present invention is shown by an experiment using an artificial text composed of a word frequency vector artificially created with a vocabulary number of V = 100 and a topic number of L = 10.
[0121]
First, a multinomial distribution parameter (basis vector) of each topic was set using random numbers while considering Zip's law. Then, based on the created parameters, learning data consisting of 1,000 texts and test data consisting of 100,000 texts were generated. However, if the number of topics in each text is m, the distribution is set to 1/2 ^m . That is, in the learning data, the text with the number of topics 1 is 500, the text with the number of topics 2 is 250, and so on. . On the other hand, with respect to the word frequency vector of each text, as described above, a polynomial distribution of multiple topics was created using the linear sum of parameters, and a word frequency vector was generated based on this distribution.
[0122]
According to the block diagram of the embodiment of the present invention shown in FIG. 1, first, learning data is input to the text preprocessing unit 1 to perform processing, and then the result is input to the model parameter estimating unit 2 for learning. The estimated model parameters were obtained by. Then, the test data is input to the text pre-processing unit 4 for processing, and the obtained word frequency vector and the estimated model parameters already obtained are input to the multi-topic prediction unit 5 to predict the multi-topics, thereby extracting the extraction results. Got. Since the correct topic information of each test data is known, it is possible to evaluate the multiple topics extraction method by comparing with the prediction result.
[0123]
FIG. 3 compares the performance of the present invention with a support vector machine which has been reported as the world's highest performance. However, when the support vector machine was applied, a two-classification problem was created for each topic, a two-classifier was configured using the learning data, and a multi-topics extraction result was predicted using the classifier group. For evaluation of the extraction performance, an F value used as a standard in information retrieval and the like was adopted. Note that the F value is defined as a harmonic average of the hit rate and the coverage rate. It can be seen that, when the number of learning data is 1,000, by applying the present invention, a performance approximately 15% higher than that of the support vector machine can be obtained. Further, in the evaluation in which the number of learning data was reduced, the performance of the support vector machine was considerably deteriorated, whereas in the application of the present invention, the performance was extremely slight. In other words, it can be said that the present invention is a method that is much more robust against the change in the number of learning data than the support vector machine.
[0124]
It should be noted that, in addition to the method of extracting multiple topics of text described above, which is realized by dedicated hardware, a program for realizing the function is recorded on a computer-readable recording medium, and is recorded on this recording medium. The recorded program may be read by a computer system and executed. The computer-readable recording medium refers to a recording medium such as a floppy disk, a magneto-optical disk, a CD-ROM, or a storage device such as a hard disk device built in a computer system. Further, the computer-readable recording medium is one that dynamically holds the program for a short time (transmission medium or transmission wave), such as a case where the program is transmitted via the Internet, and serves as a server in that case. It also includes those that hold programs for a certain period of time, such as volatile memory inside a computer system.
[0125]
【The invention's effect】
As described above, the present invention models a word frequency distribution in a multi-topic text as a probability distribution and generates a probability model by a linear sum of a single topic model, so that the multi-topic extraction of a text is performed as compared with the related art. It can be performed well.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of an apparatus for extracting multiple topics of text according to an embodiment of the present invention.
FIG. 2 is a flowchart showing the operation of the apparatus of FIG.
FIG. 3 is a graph showing the effect of the present invention in comparison with a support vector machine.
FIG. 4 is a diagram for explaining a basic concept of the present invention.
[Explanation of symbols]
1, 4 text preprocessing unit 2 model parameter estimation unit 3 estimated model parameter storage unit 5 multiple topics prediction unit 11 to 15 steps

Claims

A method for extracting, from any text, one or more topics to which the text belongs,
Expressing the text in terms of the frequency of words in all predetermined vocabularies;
Expressing, based on the word frequency information, the frequency distribution of words of a text having multiple topics, that is, expressing the parameters of the probability model of the multitopic text by a linear sum of the parameters of the probability model of the text of each single topic; ,
Learning the parameters of the probability model with learning data consisting of a set of topic information to which the word frequency information and the text belong, and storing the parameters of the probability model as a learning result in a storage device;
Calculating word frequency information for text whose topics are unknown;
Extracting the multiple topics to which the text belongs from the word frequency information in the unknown text using the parameters of the learned probability model stored in the storage device from the word frequency information in the unknown text. .

An apparatus for extracting, from any text, one or more topics to which the text belongs,
Means for expressing text in terms of the frequency of use of words in a predetermined vocabulary;
Means for expressing, based on the word frequency information, a frequency distribution of words of a text having multiple topics, that is, a parameter of a probability model of a multitopic text by a linear sum of parameters of a probability model of a text of each single topic; ,
Means for learning the parameters of the probability model with learning data consisting of a set of topic information to which the word frequency information and the text belong, and storing the parameters of the probability model of the learning result in a storage device;
Means for calculating word frequency information for texts with unknown topics;
Means for extracting multiple topics to which the text belongs from the word frequency information in the text of which the topics are unknown using the parameters of the learned probability model stored in the storage device. .

A program for causing a computer to execute, from an arbitrary text, one or more topics to which the text belongs,
A procedure for expressing the text by the frequency of use of words in all predetermined vocabularies;
Based on the word frequency information, a frequency distribution of words of a text having multiple topics, that is, a procedure of expressing the parameters of the probability model of the multitopic text by a linear sum of the parameters of the probability model of the text of each single topic; ,
Learning the parameters of the probability model with learning data consisting of a set of topic information to which the word frequency information and the text belong, and storing the parameters of the probability model of the learning result in a storage device;
A procedure for calculating word frequency information for text whose topics are unknown,
Extracting a multi-topic to which the text belongs from the word frequency information in the unknown text using the parameters of the learned probability model stored in the storage device from the word frequency information in the unknown text. .

A recording medium on which the program for extracting multiple topics of text according to claim 3 is recorded.