JP4306223B2

JP4306223B2 - Evaluation system for document filtering system

Info

Publication number: JP4306223B2
Application number: JP2002310425A
Authority: JP
Inventors: 雅夫山本; 敬士矢島; 博之絹川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2002-10-25
Filing date: 2002-10-25
Publication date: 2009-07-29
Anticipated expiration: 2022-10-25
Also published as: JP2004145679A

Description

【０００１】
【発明の属する技術分野】
本発明は論文フィルタリング評価システムに関する。
【０００２】
【従来の技術】
論文フィルタリングシステムでは、テストサンプルとプロファイルとの類似の度合いを数値化した「類似度」を計算し、類似度がある閾値より大きい場合は適合、小さい場合は非適合と判定することが多い。個人の興味をデータベース化したものがプロファイルであり、テストサンプルとプロファイルとの類似度は、テストサンプルに対する個人の興味の度合のシステム推定値であると考えられるからである。この種の技術を示す文献として例えば特許文献１がある。
【０００３】
X軸に類似度を、Y軸にテストサンプルの数をプロットした「類似度ヒストグラム」を作成したとき、適合テストサンプルが右側に、非適合テストサンプルが左側に分布するほど適合論文分別性能が高いと判断できる。このような類似度ヒストグラムを作成することにより、テストサンプルに対するシステムの適合論文分別性能を視覚的に評価することができる。
【０００４】
しかしより客観的かつ明確な評価を行うためには、システムの性能を定量的に比較する必要がある。定量的な評価方法に関して従来から既に多くの研究が行われているが、いずれも再現率精度グラフをベースとしたものである。再現率精度グラフ上の点は一般に離散値となることが多いため、補間曲線を作成する必要がある。再現率と精度は，類似度に対する閾値の設定により一方が良くなると他方が悪くなるというトレードオフの関係にあり、グラフが対角点 (1,1) に近づくほど総合的な適合論文分別性能が高いと言える。再現率精度グラフはシステムの全体性能をグラフとして定量的に表現するには良いが、再現率，精度という二つのパラメータがあるため、異なる方式の性能の優劣を数値で比較するには不便である。
【０００５】
この点を改善すべくvan Rijsbergenは再現率と精度の重みつき調和平均をとるF尺度を提案している。この場合でも評価値が再現率または精度の値に依存して変動するという問題は残る。この点を解決するためシステムの性能を単一の数値で表すことができる評価方法がいくつか提案されている。
【０００６】
単一数値で表すことができる代表的な３種類の評価手法を以下に示す。以後、以下の３評価指標を合わせて「従来３評価指標」と呼び、従来３評価指標によって得られる評価値を「従来３評価値」と呼ぶ。
（１）ブレイクイブンポイント
Lewisらは再現率精度グラフを描いたとき再現率と精度が等しくなる点の再現率を用いる「ブレイクイブンポイント (P_bep)」という方式を提案している(式1)。
【０００７】
【数１】

【０００８】
ここで、Rは再現率精度グラフの補間曲線上、再現率と精度が等しい点での再現率である。この方式は計測し易いという利点があるが、グラフの一部分だけで評価することになるため大局的な性能を評価値に反映させにくいという欠点がある。
（２）１１点平均精度
Salton らは再現率精度グラフの補間曲線上で再現率の軸を11等分し、各点での精度の平均を求める「11点平均精度 (P_11mean)」を提案している (式2)。
【０００９】
【数２】

【００１０】
ここで、f_p (R_i) は補間曲線上で再現率がR_iのときの精度、R_i (i=1,2,..,11)は再現率の軸を11等分した各点での再現率である。この方式ではグラフの大局的な情報を反映させやすいが11点について再現率から精度を読み取る作業が煩雑な点に難がある。
【００１１】
このほかにも再現率と精度を基にしたいくつかの評価指標が提案されているがいずれも評価値を算出するためにグラフの補間が必要であるという欠点がある。（３）正規化再現率
グラフの補間が不要な評価手法として、検索論文をひとつずつ増やしたときの理想的な再現率の増加と実際の再現率の増加の差分の累積値を求める「正規化再現率 (P_norm)」がある(式3)。この方式は評価値が最後の適合論文の順位によって左右されやすいという欠点がある。
【００１２】
【数３】

【００１３】
ここでR_iは論文iの適合性順位、N_rel，N_irrelは各々適合，非適合論文の総数である。
【００１４】
【特許文献１】
特開２００１−２５６２５３号公報
【００１５】
【発明が解決しようとする課題】
前記（１）〜（３）の従来３評価指標を用いた場合、適合論文と非適合論文を分別する性能の差異の表現能力つまり分解能が低い点が問題である。なぜならば、いずれの場合も単に類似度の並び順で評価値がユニークに決まり、適合，非適合各クラスの分離の度合は評価値に反映されないからである。
【００１６】
この問題を類似度ヒストグラムは異なるが，従来３評価値は等しい例 (図２) を用いて以下に説明する。この類似度ヒストグラム図２(i)(ii) によればテストサンプル1よりもテストサンプル2のほうが適合，非適合クラスの類似度分布がより明確に分離していることが視認できる。同じテストサンプルに対し、類似度，精度，再現率のデータを類似度の降順に並べ替えた表を図２(i)' (ii)' に示す。各テストサンプルの類似度の値は異なるが、適合が2サンプル、次に非適合が11サンプルのように適合，非適合の並び順は全く等しい。例えば、順位2位までの累積を考えると、テストサンプル1，2共に適合であるため、再現率は2/2=1.0，精度は2/2=1.0となる。順位３位までの累積を考えると、テストサンプル1，2共に非適合であるため、再現率は2/2=1.0，精度は2/3=0.67となる。以下同様に、13位まで累積の再現率，精度を計算すると、テストサンプル1，2共に全く同一値になる。したがってこれらの値をプロットした再現率精度グラフは、図２ (i)'' (ii)"に示すように全く等しくなる。再現率精度グラフが同一ということから再現率精度グラフを基に算出される従来３評価値Pbep , P11mean , Pnormも同一値になる。実際この場合はすべて1に等しい。
図２の例は、再現率と精度を基にした従来の評価指標ではその評価値が単に類似度の並び順のみで決定されるため、適合論文の分別性能の差異を明確に評価値として表すことができない場合があることを示している。新聞記事等のように多くのサンプルが利用可能な場合は従来評価指標でも統計的に適切な評価が可能であるが、学術論文のように分野の細分化と要求される新規性のゆえに関心分野の論文数が本質的に少ない場合は、上記例で示したように有意差がある結果を得にくい。
【００１７】
本発明の目的は、単に類似度の並び順だけでなく、各クラスのサンプルの分離の度合を明確に表すことができる分解能が高い評価手法を提供することにある。
【００１８】
【課題を解決するための手段】
上記目的を達成するために本発明では論文フィルタリング評価システムに、クラス分離度を適用した評価結果を出力する手段を備えたものである。
【００１９】
クラス分離度は、パターン認識の研究分野においては、級間分散(σ² _B ) ÷ 全分散(σ² _T )として定義されるクラス分離度η((式4)〜(式6)) が２クラスデータの分離の度合を評価する目的で利用されていた。
【００２０】
【数４】

【００２１】
【数５】

【００２２】
【数６】

【００２３】
ここで，X_iは分離指標である。パターン認識の場合はサンプルの輝度値，濃淡値などである。またN₁，N₂ は各々，クラス1，2のサンプル数，M₁, M₂, M_m は各々，クラス1，2，全体，に属するサンプルの分離指標の平均値である。
【００２４】
クラス分離度 ηは，各クラスのデータの分離指標の分布が完全に重なっているとき η= 0，完全に分離しているときη= 1となり，その他の場合は 0 < η < 1となる。
【００２５】
本発明では、上述したクラス分離度を論文フィルタリング評価システムに適用し、類似度ヒストグラムとともに出力する。
【００２６】
図２の例の場合、クラス分離度ηを計算すると、テストサンプル1の場合 η= 0.52であるのに対し、テストサンプル2の場合は η= 0.81となり、類似度ヒストグラムの分離の度合の差異を明確に表せることになる。
【００２７】
論文フィルタリング評価システムに、クラス分離度の計算手段及びその結果の出力手段を備えることにより、学術論文のように十分な量のサンプルを集めにくい場合でも適合，非適合論文の類似度の分離の度合を数値として高い分解能で表現でき、より明確な適合論文の分別性能評価が可能になる。
【００２８】
【発明の実施の形態】
論文フィルタリングシステム１は、ＣＰＵ２、ＣＰＵ２がワーク用に使用するメモリ３、入力機器としてのキーボード４、出力機器としてのディスプレイ５と記憶装置７とがバス６で接続された構成となる。記憶装置７には、プロファイル作成プログラム11と分別性能評価プログラム12及びそれぞれのプログラムで用いるプロファイルやテストサンプル等のデータが格納される。各プログラムはＣＰＵに順次読み込まれて解釈実行されることにより、プロファイルの作成処理機能や分別性能評価処理機能を果たす。プロファイル作成プログラムの処理は図３に、分別性能評価処理は図４に詳細フローが示される。プロファイル作成処理では、プロファイル作成用サンプルを用いて利用者から得た興味有無判定情報を元にフィルタリング用プロファイルを作成する。分別性能評価処理では、テスト用サンプルを用いてプロファイルとの類似度を計算し、さらにこれを元にクラス分離度を計算する。計算したクラス分離度を表示することにより、システムの利用者が論文フィルタリングシステムの分別性能を評価を行なうことを容易にする。以下に論文フィルタリングシステム分別性能評価方式処理フローをまとめて示す。
【００２９】
図３はプロファイル作成処理を示す。
【００３０】
まず、プロファイル作成用サンプルの適合性判定結果入力待ち処理(31)では、利用者にプロファイル作成用サンプルを提示し、興味有(適合)と興味なし(非適合)に選別させ、その入力を受け付ける。
【００３１】
入力された結果が興味有(適合)の場合には、適合プロファイルベクトルの作成処理を行なう(32)。適合プロファイルベクトルの作成処理では、適合プロファイル作成用サンプルに含まれる単語群からストップワードを除去して「適合基底ターム」を選定する。「適合基底ターム」をベクトルの基底とし，各タームの各適合論文における論文内出現頻度すなわちTF (Term Frequency) 値の平均値を重みとする「適合プロファイルベクトル」P_relを構成する。
【００３２】
入力された結果が興味無(非適合)の場合には、非適合プロファイルベクトルの作成処理を行なう(33)。非適合プロファイルベクトルの作成処理では、非適合論文に含まれる単語群からストップワードを除去して「非適合基底ターム」を選定する。「非適合基底ターム」をベクトルの基底とし，各タームの各非適合論文におけるTF値の平均値を重みとする「非適合プロファイルベクトル」P_irrelを構成する。
【００３３】
図４は分別性能評価処理を示す。分別性能評価処理においてはまずテストサンプルの適合性の判定結果の入力を受け(41)、論文ベクトルの作成処理を行なう(42)。適合論文ベクトルの作成処理は、上記「適合基底ターム」を基底とし当該論文における各適合基底タームのTF値を求め、それを重みとして適合論文ベクトルD_relを構成する。非適合論文ベクトルの作成処理は、上記「非適合基底ターム」を基底とし当該論文における各非適合基底タームのTF値を求め、それを重みとして非適合論文ベクトルD_irrelを構成する。
【００３４】
論文フィルタリングシステム分別性能評価処理として適合プロファイルとの類似度計算を行なう(43)。ステップ42で作成した合論文ベクトルD_relと、ステップ32で作成した適合プロファイルP_relとの適合類似度(式7)を計算する。
【００３５】
【数７】

【００３６】
ここで・はベクトルの内積を，×はスカラー値の積を表す。
0 ≦ Sim(D,P) ≦ 1 である。
【００３７】
次に、非適合プロファイルとの類似度計算を行なう(44)。ステップ42で作成した非適合論文ベクトルD_irrelと、ステップ33で作成した非適合プロファイルP_irrelとの非適合類似度(式 7)を計算する。
【００３８】
次に、適合・非適合類似度の統合処理を行なう(45)。ここでは評価指標を一次元化するために，統合類似度(式8)を計算する。
【００３９】
【数８】

【００４０】
次にクラス分離度計算を行なう(46)。
【００４１】
ステップ45で求めた統合類似度を元に(式4)〜(式6)の式でクラス分離度を計算する。
【００４２】
クラス分離度ηを論文フィルタリング評価システムに適用した場合、式４〜６の各記号はそれぞれ以下の値とする。
【００４３】
Ｘi：分離指標であり、式８で求めた総合類似度
Ｎ1：適合論文のサンプル数
Ｎ2：非適合論文のサンプル数
Ｍ1：適合論文の分離指標の平均値
Ｍ2：非適合論文の分離指標の平均値
Ｍm：サンプル全体の分離指標の平均値
計算されたクラス分離度を表示する(47)。利用者は、計算したクラス分離度がある閾値より大きければその論文フィルタリングシステムの分別性能は良好、そうでなければ非良好と判断することにより、論文フィルタリングシステム分別性能の評価が行なえる。
【００４４】
上述したフィルタリング処理及び分別性能評価処理を、論文を梗概，緒言・結言，全文の３部位に分割し、各部位の情報を使って論文フィルタリングシステムを構成し、各々の場合の分別性能について従来評価値およびクラス分離度を計算した結果のステップ47における表示例を図５に示す。図５では、適合，非適合サンプル群の分離の度合は、部位間で視覚的に明確な差異があり，梗概プロファイル，緒言・結言プロファイルを用いた場合よりも，全文プロファイルを用いた場合のほうが明らかに大きいことがわかる。
【００４５】
５分割交差検定法による５回の実験の平均値による各部位ごとの新旧評価指標による評価結果を図６に示す。従来評価指標によると梗概部位の情報だけを用いて作成した梗概プロファイル，緒言・結言部位の情報だけを用いて作成した緒言・結言プロファイル，全文の情報を用いて作成した全文プロファイルの間の差異が明確には表れないが，クラス分離度によれば，明確に表れることがわかる。
【００４６】
図６において，評価値の部位間のばらつきの度合いを表す標準偏差を見ると、クラス分離度は0.90であり、0.20, 0.52, 0.66となる従来３評価指標に比して1.5〜4.5倍であることがわかる。この結果は、本発明によるクラス分離度の分解能が従来３評価指標に比べて1.5 〜 4.5倍であることを示しており、類似度散布図 (図５) の分布の分離の度合を適切に反映した数値になっていることがわかる。
【００４７】
図６の結果から，提案指標であるクラス分離度による評価値は、梗概プロファイルを用いた場合0.32，緒言・結言プロファイルを用いた場合0.67であるのに対し、全文プロファイルを用いた場合は2.01と他の部位の3〜6倍となり、最も高い評価値となることを定量的に確認した。従来３評価指標の場合も部位の相違による評価値の大きさの順序は、梗概＜緒言・結言＜全文の順でありその順序に相違はないが、クラス分離度の場合はその差異がより明確に表れている。全文が最も有効であるということが、クラス分離度の導入によって初めて明確に言えるようになった。
【００４８】
【発明の効果】
本発明によれば、論文フィルタリングの評価システムにクラス分離度の概念を適用することにより、学術論文のように十分な量のサンプルを集めにくい場合でも適合，非適合論文の類似度の分離の度合を数値として高い分解能で表現でき、より明確な適合論文の分別性能評価が可能になる。
【図面の簡単な説明】
【図１】本発明の一実施例である論文フィルタリング評価システムの全体構成図。
【図２】従来技術での分別性能の差異の困難性を説明する図。
【図３】プロファイル作成処理を示すフローチャート。
【図４】分別性能評価処理を示すフローチャート。
【図５】学術論文を対象として異なる部位(梗概，緒言・結言，本文)の情報を使った複数の論文フィルタリング方式に対して適合，非適合類似度を計算してプロットした散布図。
【図６】従来評価指標及び本発明によるクラス分離度により論文のプロファイル抽出部位別に分別性能を評価した結果を表した図。
【符号の説明】
１‥論文フィルタリング評価システム、２‥ＣＰＵ、３‥メモリ、４‥キーボード、５‥ディスプレイ、６‥バス、７‥記憶装置、１１プロファイル作成プログラム、１２分別性能評価プログラム。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a paper filtering evaluation system.
[0002]
[Prior art]
In the article filtering system, a “similarity” in which the degree of similarity between a test sample and a profile is quantified is calculated, and when the similarity is larger than a certain threshold value, it is often determined as conforming and when it is smaller, it is determined as nonconforming. This is because the personal interest database is a profile, and the similarity between the test sample and the profile is considered to be a system estimate of the degree of personal interest in the test sample. For example, Patent Document 1 is a document showing this type of technology.
[0003]
When creating a “similarity histogram” in which the similarity is plotted on the X-axis and the number of test samples is plotted on the Y-axis, the conforming paper separation performance increases as conforming test samples are distributed on the right side and non-conforming test samples are distributed on the left side. It can be judged. By creating such a similarity histogram, it is possible to visually evaluate the system's relevant paper classification performance for the test sample.
[0004]
However, in order to make a more objective and clear evaluation, it is necessary to quantitatively compare system performance. Many studies have already been conducted on quantitative evaluation methods, all of which are based on a recall accuracy graph. Since points on the recall accuracy graph are generally discrete values, it is necessary to create an interpolation curve. The recall and accuracy are in a trade-off relationship where one becomes better by setting a threshold for similarity, and the other becomes worse. The closer the graph is to the diagonal point (1,1), the more comprehensive the paper classification performance becomes. It can be said that it is expensive. The recall accuracy graph is good for quantitatively expressing the overall performance of the system as a graph, but there are two parameters, recall and accuracy, so it is inconvenient to compare the superiority of performance of different methods numerically .
[0005]
In order to improve this point, van Rijsbergen has proposed an F scale that takes a weighted harmonic average of recall and accuracy. Even in this case, there remains a problem that the evaluation value varies depending on the recall or accuracy value. In order to solve this problem, several evaluation methods that can express the performance of the system with a single numerical value have been proposed.
[0006]
Three typical evaluation methods that can be expressed by a single numerical value are shown below. Hereinafter, the following three evaluation indexes are collectively referred to as “conventional three evaluation indexes”, and an evaluation value obtained by the conventional three evaluation indexes is referred to as “conventional three evaluation values”.
(1) Breakeven point
Lewis et al. _Have proposed a method called “break even point (P _bep )” that uses the reproducibility at the point where the reproducibility and accuracy are equal when drawing a reproducibility accuracy graph (Equation 1).
[0007]
[Expression 1]

[0008]
Here, R is the recall at the point where the recall and accuracy are equal on the interpolation curve of the recall accuracy graph. This method has the advantage of being easy to measure, but has the disadvantage that it is difficult to reflect the overall performance in the evaluation value because the evaluation is performed with only a part of the graph.
(2) 11-point average accuracy
Salton et al. _Have proposed “11-point average accuracy (P _11mean )” that _{calculates the} average accuracy at each point by dividing the recall axis into 11 equal parts on the interpolation curve of the recall accuracy graph (Equation 2). .
[0009]
[Expression 2]

[0010]
Here, f _p (R _i ) is the accuracy when the recall is R _i on the interpolation curve, and R _i (i = 1,2, .., 11) is each point obtained by dividing the recall axis into 11 equal parts. This is the recall rate. In this method, it is easy to reflect the global information of the graph, but it is difficult to read the accuracy from the recall for 11 points.
[0011]
In addition to this, several evaluation indexes based on the recall and accuracy have been proposed, but all of them have a drawback that graph interpolation is required to calculate an evaluation value. (3) As an evaluation method that does not require interpolation of the normalized recall rate graph, obtain the cumulative value of the difference between the increase in the ideal recall rate and the increase in the actual recall rate when the search papers are increased one by one. Recall rate (P _norm ) ”(Equation 3). This method has a drawback that the evaluation value is easily influenced by the ranking of the last relevant paper.
[0012]
[Equation 3]

[0013]
Here, R _i is the relevance ranking of paper i, and N _rel and N _irrel are the total number of conforming and non-conforming papers, respectively.
[0014]
[Patent Document 1]
Japanese Patent Laid-Open No. 2001-256253
[Problems to be solved by the invention]
When the conventional three evaluation indexes (1) to (3) are used, there is a problem in that the ability to express the difference in performance, that is, the resolution, for separating the relevant paper from the non-conforming paper is low. This is because, in any case, the evaluation value is uniquely determined by the order of similarity, and the degree of separation between conforming and non-conforming classes is not reflected in the evaluation value.
[0016]
This problem is explained below using an example (Fig. 2) in which the three evaluation values are the same, although the similarity histogram is different. According to the similarity histograms 2 (i) and (ii), it can be visually recognized that the test sample 2 is more clearly separated than the test sample 1 in conformity and non-conformance class similarity distribution. FIG. 2 (i) '(ii)' shows a table in which the data of similarity, accuracy, and recall are rearranged in descending order of similarity for the same test sample. Although the similarity values of each test sample are different, the order of conformance and nonconformity is exactly the same, with 2 samples conforming and then 11 samples nonconforming. For example, considering the accumulation up to the second rank, since both

test samples

1 and 2 are compatible, the recall is 2/2 = 1.0 and the accuracy is 2/2 = 1.0. Considering the accumulation up to the third rank, the

test samples

1 and 2 are non-conforming, so the recall is 2/2 = 1.0 and the accuracy is 2/3 = 0.67. Similarly, when the cumulative recall and accuracy are calculated up to the 13th place, the

test samples

1 and 2 have the same value. Therefore, the recall accuracy graph in which these values are plotted is exactly the same as shown in Fig. 2 (i) '' (ii) ". Since the recall accuracy graph is the same, it is calculated based on the recall accuracy graph. The conventional three evaluation values Pbep, P11mean and Pnorm are also the same value.
In the example of FIG. 2, in the conventional evaluation index based on the recall and accuracy, the evaluation value is determined only by the order of similarity, so the difference in the classification performance of the relevant paper is clearly expressed as the evaluation value. Indicates that there are cases where it is not possible. When many samples are available, such as newspaper articles, it is possible to make a statistically appropriate evaluation using conventional evaluation indicators. However, because of subdivision of fields and required novelty, such as academic papers, When the number of papers is essentially small, it is difficult to obtain a result having a significant difference as shown in the above example.
[0017]
An object of the present invention is to provide an evaluation method with a high resolution capable of clearly expressing the degree of separation of samples of each class, not just the order of similarity.
[0018]
[Means for Solving the Problems]
In order to achieve the above object, according to the present invention, the paper filtering evaluation system is provided with means for outputting an evaluation result to which the class separation degree is applied.
[0019]
In the field of pattern recognition research, class separation is ^{2 when} class separation η ((Expression 4) to (Expression 6)) defined as inter-class variance (σ ² _B ) ÷ total variance (σ ² _T ) is 2. It was used for the purpose of evaluating the degree of separation of class data.
[0020]
[Expression 4]

[0021]
[Equation 5]

[0022]
[Formula 6]

[0023]
Here, X _i is a separation index. In the case of pattern recognition, it is the luminance value, gray value, etc. of the sample. N ₁ and N ₂ are the number of samples of

classes

1 and 2, respectively, and M ₁ , M ₂ , and M _m are the average values of the separation indices of samples belonging to

classes

1 and 2, respectively.
[0024]
The class separation η is η = 0 when the distributions of the separation indices of the data of each class are completely overlapped, η = 1 when they are completely separated, and 0 <η <1 in other cases.
[0025]
In the present invention, the class separation described above is applied to a paper filtering evaluation system, and is output together with a similarity histogram.
[0026]
In the case of FIG. 2, the class separation η is calculated to be η = 0.52 in the case of the test sample 1, whereas η = 0.81 in the case of the test sample 2, and the difference in the degree of separation of the similarity histogram is shown. It will be clear.
[0027]
By providing the paper filtering evaluation system with a class separation degree calculation means and a result output means, even if it is difficult to collect a sufficient amount of samples, such as academic papers, the degree of separation of similarity between conforming and nonconforming papers Can be expressed as a numerical value with high resolution, and it becomes possible to more clearly evaluate the separation performance of relevant papers.
[0028]
DETAILED DESCRIPTION OF THE INVENTION
The paper filtering system 1 has a configuration in which a CPU 2, a memory 3 that the CPU 2 uses for work, a keyboard 4 as an input device, a display 5 as an output device, and a storage device 7 are connected by a bus 6. The storage device 7 stores a profile creation program 11, a classification performance evaluation program 12, and data such as profiles and test samples used in the respective programs. Each program is sequentially read by the CPU and interpreted and executed, thereby fulfilling a profile creation processing function and a classification performance evaluation processing function. FIG. 3 shows the processing of the profile creation program, and FIG. 4 shows the detailed flow of the sorting performance evaluation processing. In the profile creation process, a filtering profile is created based on the interest determination information obtained from the user using the profile creation sample. In the classification performance evaluation process, the similarity to the profile is calculated using the test sample, and the class separation degree is further calculated based on the similarity. Displaying the calculated class separation makes it easy for users of the system to evaluate the classification performance of the paper filtering system. The processing flow of the paper filtering system classification performance evaluation method is summarized below.
[0029]
FIG. 3 shows the profile creation process.
[0030]
First, in the profile creation sample suitability judgment result input waiting process (31), the profile creation sample is presented to the user, selected as interested (conforming) and not interested (non-conforming), and the input is accepted. .
[0031]
If the input result is of interest (matching), a matching profile vector is created (32). In the matching profile vector creation process, stop words are removed from the word group included in the matching profile creation sample, and a “matching base term” is selected. The “adapted profile term” P _rel is constructed with the “adapted basis term” as the basis of the vector, and the weight of the appearance frequency in each relevant paper of each term, ie, the average value of TF (Term Frequency) values.
[0032]
If the input result is uninteresting (non-conforming), non-conforming profile vector creation processing is performed (33). In the non-conforming profile vector creation process, the stop word is removed from the word group included in the non-conforming paper, and the “non-conforming base term” is selected. A “non-conforming profile vector” P _irrel is constructed with the “non-conforming basis term” as the basis of the vector and the weight of the average value of the TF values in each non-conforming paper of each term.
[0033]
FIG. 4 shows the classification performance evaluation process. In the classification performance evaluation process, first, an input of the test sample suitability determination result is received (41), and a paper vector is created (42). In the process of creating a relevant paper vector, the TF value of each relevant base term in the paper is obtained based on the above “relevant base term”, and the relevant paper vector D _rel is constructed using the TF value as a weight. In the non-conforming paper vector creation processing, the TF value of each non-conforming base term in the paper is obtained using the “non-conforming base term” as a base, and the non-conforming paper vector D _irrel is constructed using the TF value as a weight.
[0034]
As the paper filtering system classification performance evaluation process, the similarity with the matching profile is calculated (43). The matching similarity (formula 7) between the joint thesis vector D _rel created in step 42 and the matching profile P _rel created in step 32 is calculated.
[0035]
[Expression 7]

[0036]
Where • represents the inner product of vectors and × represents the product of scalar values.
0 ≤ Sim (D, P) ≤ 1.
[0037]
Next, the similarity with the non-conforming profile is calculated (44). The non-conformity similarity (formula 7) between the non-conforming paper vector D _irrel created in step 42 and the non-conforming profile P _irrel created in step 33 is calculated.
[0038]
Next, conformity / non-conformity similarity is integrated (45). Here, the integrated similarity (formula 8) is calculated in order to make the evaluation index one-dimensional.
[0039]
[Equation 8]

[0040]
Next, class separation is calculated (46).
[0041]
Based on the integrated similarity obtained in step 45, the class separation degree is calculated by the expressions (Expression 4) to (Expression 6).
[0042]
When the class separation degree η is applied to the paper filtering evaluation system, each symbol in the expressions 4 to 6 has the following value.
[0043]
Xi: Separation index, total similarity N1 obtained from Equation 8: Number of samples of relevant paper N2: Number of samples of non-conforming paper M1: Average value of separation indices of relevant paper M2: Average of separation indices of non-conforming papers Value Mm: The average value of the separation index of the entire sample is displayed (47). The user can evaluate the classification performance of the paper filtering system by determining that the classification performance of the paper filtering system is good if the calculated class separation degree is larger than a certain threshold value, and otherwise not good.
[0044]
The above filtering process and classification performance evaluation process are divided into three parts: summary, introduction / conclusion, and full text, and a paper filtering system is constructed using the information of each part, and the classification performance in each case is conventionally evaluated. FIG. 5 shows a display example in step 47 of the result of calculating the value and the class separation degree. In Fig. 5, the degree of separation of the conforming and non-conforming sample groups is visually distinct between the sites, and the full-text profile is better than the summary profile, introduction / concluding profile. Obviously it is big.
[0045]
FIG. 6 shows the evaluation results based on the old and new evaluation indices for each part based on the average value of five experiments by the 5-fold cross validation method. According to the conventional evaluation index, there is a difference between the outline profile created using only the information on the infarcted part, the introduction / conclusion profile created using only the information on the introductory / concluded part, and the full-text profile created using the full-text information. Although it does not appear clearly, it can be seen that it appears clearly according to the class separation.
[0046]
In FIG. 6, when the standard deviation representing the degree of variation between the evaluation value parts is seen, the class separation degree is 0.90, which is 1.5 to 4.5 times that of the conventional three evaluation indices of 0.20, 0.52, and 0.66. I understand that. This result shows that the resolution of the class separation according to the present invention is 1.5 to 4.5 times that of the conventional three evaluation indexes, and appropriately reflects the degree of separation of the distribution of the similarity scatter diagram (Fig. 5). You can see that it is the number.
[0047]
From the results shown in Fig. 6, the evaluation value based on the class separation degree, which is the proposed index, is 0.32 when using the introductory profile, and 0.67 when using the introduction / conclusion profile, whereas it is 2.01 when using the full-text profile. It was quantitatively confirmed that the value was 3 to 6 times that of other parts, and the highest evaluation value was obtained. Even in the case of the conventional three evaluation indices, the order of magnitude of evaluation values due to differences in the parts is as follows: <Introduction / Conclusion> The order of the whole text is the same, but there is no difference in the order, but in the case of class separation, the difference is clearer It appears in For the first time, the full text is most effective when class separation is introduced.
[0048]
【The invention's effect】
According to the present invention, by applying the concept of class separation to an evaluation system for paper filtering, even if it is difficult to collect a sufficient amount of samples like academic papers, the degree of separation of similarity between conforming and non-conforming papers. Can be expressed as a numerical value with high resolution, and it becomes possible to more clearly evaluate the separation performance of relevant papers.
[Brief description of the drawings]
FIG. 1 is an overall configuration diagram of a paper filtering evaluation system according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining the difficulty of a difference in sorting performance in the prior art.
FIG. 3 is a flowchart showing a profile creation process.
FIG. 4 is a flowchart showing classification performance evaluation processing.
FIG. 5 is a scatter diagram in which conformity and non-conformity similarities are calculated and plotted for multiple paper filtering methods using information on different parts (summary, introduction / concluding remarks, text) for academic papers.
FIG. 6 is a diagram showing the result of evaluating the classification performance for each profile extraction part of a paper by the conventional evaluation index and the class separation according to the present invention.
[Explanation of symbols]
1. Paper filtering evaluation system, 2. CPU, 3. Memory, 4. Keyboard, 5. Display, 6. Bus, 7. Storage device, 11. Profile creation program, 12. Sorting performance evaluation program.

Claims

An evaluation system for a document filtering system,
Profile creation means;
A separation performance evaluation means,
The profile creation means includes
It has conforming profile creation means and non-conformity profile creation means,
The sample conformance document for profile creation, the sample nonconformance document for profile creation, the conformity document for classification performance evaluation, and the nonconformity document for classification performance evaluation are input,
The conforming profile creation means removes stop words from a group of words included in a predetermined part of the sample conforming document for profile creation that is selected by the user, and based on the appearance frequency value of the selected conforming base term in the document. , Conforming profile vector P _ｒｅｌrel Is a means of configuring
The non-conforming profile creating means removes stop words from the word group included in the predetermined part of the sample non-conforming document for profile creation, which is selected by the user, and includes the selected non-conforming base term in the document. Based on the appearance frequency value, the non-conforming profile vector P _{ｉｒｒｅｌirrel} Is a means of configuring
The classification performance evaluation means is:
Conforming document vector constructing means;
Non-conforming document vector construction means;
Matching similarity calculation means;
Non-conformity similarity calculation means;
Integrated similarity calculation means;
Class separation calculation means,
Class separation display means for each part,
With
The conforming document vector constructing unit is configured to conform document vector D based on the conforming base term for the predetermined part of the conforming document for classification performance evaluation. _ｒｅｌrel Is a means of configuring
The non-conforming document vector constructing unit is configured to perform non-conforming document vector D based on the non-conforming base term for the predetermined part of the non-conforming document for classification performance evaluation. _{ｉｒｒｅｌirrel} Is a means of configuring
The matching similarity calculation means is configured to match the matching profile vector P. _ｒｅｌrel Similarity similarity S _ｒｅｌrel The

(However, ・ Is the inner product of vectors, × Represents a product of scalar values)
Is a means to calculate based on
The non-conforming similarity calculation means is the non-conforming profile vector P. _{ｉｒｒｅｌirrel} Non-conformity similarity S _{ｉｒｒｅｌirrel} Is a means for calculating based on (Equation 1),
The integrated similarity calculation means calculates an integrated similarity between the matching similarity and the non-matching similarity.

Is a means to calculate based on
The class separation degree calculating means includes:

(However,
X _ｉi : Overall similarity S
N _１1 : Number of samples of relevant documents
N _２2 : Number of non-conforming document samples
M _１1 : Average total similarity of relevant documents
M _２2 : Average value of total similarity of non-conforming documents
M _ｍm : Average value of overall similarity of the whole sample)
Based on the integrated similarity, the class separation η of the predetermined part is calculated,
Is a means to display
An evaluation system for a document filtering system.