JP2006235716A

JP2006235716A - Document filtering system

Info

Publication number: JP2006235716A
Application number: JP2005045593A
Authority: JP
Inventors: Masao Yamamoto; 雅夫山本; Hiroyuki Kinukawa; 博之絹川; Takashi Yajima; 敬士矢島
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2005-02-22
Filing date: 2005-02-22
Publication date: 2006-09-07

Abstract

<P>PROBLEM TO BE SOLVED: To improve efficiency in searching a document by extracting interesting information from a suitable document sample selected by a user oneself, and expressing interest to the document on the basis of the information in a differentiation structure. <P>SOLUTION: The system is provided with a differentiation structure generating part 111 for generating a differentiation structure of a plurality of clusters from a group of documents, a cluster representative point creating part 112 for creating a center gravity position of the cluster by using the document in each cluster, a similarity calculating part 121 for calculating a similarity of new document and each cluster representative point, and a recommendation ranking indicator calculating part 122 for converting a plurality of similarities to a single index. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、文書フィルタリングシステムに係り、特に特定の興味に関連した文書を探す作業を支援することのできる文書フィルタリングシステムに関する。 The present invention relates to a document filtering system, and more particularly to a document filtering system capable of supporting a task of searching for a document related to a specific interest.

従来の文書フィルタリングシステムは、「論文に対する興味は数個のキーワードだけで表現可能であり、そのキーワードは研究者が自分で容易に選定可能である」との考え方のもとで設計されている。このシステムの場合、あらかじめ、各研究者は適合文書（論文）に多く含まれるであろうと推定される数個のキーワードを選定し、研究者自身の興味としてシステムに登録する。しかし、この方式は単なるキーワード群の登録であるので、「興味構造が未分化のままの表現」と言うことができる。 The conventional document filtering system is designed based on the idea that “interest in a paper can be expressed by only a few keywords, and the keywords can be easily selected by researchers”. In the case of this system, each researcher selects in advance several keywords that are likely to be included in relevant documents (articles) and registers them as the researcher's own interests in the system. However, since this method is simply the registration of keyword groups, it can be said that “the expression with the interest structure remaining undifferentiated”.

一方、日々、公開される新規文書からタームを抽出して蓄積する。タームとは、文書を特徴付ける単語のことである。抽出したタームと研究者が選定した前記登録キーワードとのブーリアン照合を行い、合致度を研究者の興味度合いに対するシステムの推定値と考える。この推定値があらかじめ決めた値以上の場合、適合文書と判定し、値以下の場合は非適合文書と判定する。そして適合文書のみを利用者に提示するシステムが知られている。この種の技術を示す文献として例えば，特許文献１，非特許文献１などがある．
特開２００３−１５７２７３号公報 S.L.K.D.Bollacker and C.Giles "A system for automatic personalized tracking of scientific literaure on the web"proc.of the fourth AGM Conference on Digital Libraries,pp.105-113(1999) On the other hand, it extracts and accumulates terms from new documents published every day. A term is a word that characterizes a document. A Boolean match is made between the extracted term and the registered keyword selected by the researcher, and the degree of match is considered as an estimated value of the system for the degree of interest of the researcher. If this estimated value is greater than or equal to a predetermined value, it is determined as a conforming document, and if it is less than the value, it is determined as a non-conforming document. A system that presents only relevant documents to users is known. There are, for example, Patent Document 1 and Non-Patent Document 1 as documents showing this type of technology.
JP 2003-157273 A SLKDBollacker and C. Giles "A system for automatic personalized tracking of scientific literaure on the web" proc.of the fourth AGM Conference on Digital Libraries, pp.105-113 (1999)

上述のように、従来の文書フィルタリングシステムでは、未分化構造のキーワード群との単なるブーリアン照合によって適合文書の選別を行っているため、所望適合文書の選別漏れや非適合文書の混入が多いという課題がある。これらの課題は、興味表現のための情報量不足、研究者自身による興味表現の困難性、興味に対する不完全な構造表現が原因と考えられる。 As described above, in the conventional document filtering system, since the conforming documents are selected by simple Boolean matching with undifferentiated structure keyword groups, there is a problem that there are many omissions in the selection of desired conforming documents and mixing of nonconforming documents. There is. These issues are thought to be due to lack of information for interest expression, difficulty of expression of interest by researchers themselves, and imperfect structure expression for interest.

一般に、数個のキーワードだけで文書の複雑な構造を表すことは難しい。仮にある二つの文書に同じキーワードが含まれていたとしても、その出現頻度、共起タームなどが異なると、その文書の内容も異なることが十分考えられる。これは、数個のキーワードの有無の情報のみで文書間の類似性を評価しようとするブーリアン照合では、文書の弁別能力が十分でないことを示している。 In general, it is difficult to represent a complicated structure of a document with only a few keywords. Even if the same keyword is included in two documents, if the frequency of appearance, co-occurrence terms, etc. are different, the contents of the documents may be different. This indicates that the document discrimination capability is not sufficient in the Boolean collation in which the similarity between documents is evaluated only by information on the presence or absence of several keywords.

また、研究者は自分自身の興味というものを明確な形で意識しているとは限らない。これは、漠然とした状態から出発して、試行錯誤の末に明確な知見に至るという研究活動の特質によるものと考えられる。そもそも研究者にとって、曖昧で捉えにくい「興味」というものを研究者自身が数個のキーワードにより明示的に表現することは本質的に困難である。 Also, researchers are not always aware of their own interests in a clear way. This is thought to be due to the nature of research activities, starting from a vague state and reaching clear knowledge after trial and error. In the first place, it is inherently difficult for researchers to express the “interest” that is ambiguous and difficult to grasp by using several keywords.

さらに、学術文書に対して企業内研究者が興味を持つ側面は多様である。例えば、解決すべき課題に興味がある場合、課題解決のための手法に興味がある場合、関連する副次的な技術に関心がある場合等、さまざまである。興味を持つ側面によって、対応するキーワードは全く異なる。その結果、選定したキーワードには異なる側面を表現するキーワードが無意識のうちに混在してしまい、興味表現の焦点があいまいになって精度が低下するという課題がある。 In addition, there are various aspects of in-house researchers interested in academic documents. For example, there are various cases such as an interest in a problem to be solved, an interest in a technique for solving the problem, and an interest in a related secondary technology. Depending on the aspect you are interested in, the corresponding keywords are quite different. As a result, there is a problem that the keywords that express different aspects are unconsciously mixed in the selected keyword, and the focus of interest expression becomes ambiguous and the accuracy is lowered.

図２は、適合論文および非適合論文の分布を表す図、図３は、興味構造の未分化状態を説明する図である。図２に示すように、適合論文（白丸印）および非適合論文（黒丸印）からなる学術文書群はタームベクトルによるベクトル空間上に図２のように分布していると考えられる。このとき、図３に示すように数個のキーワード群からなる未分化構造表現によって適合か非適合かを判定すると、適合文書の提示漏れや非適合文書の混入が発生すると考えられる。 FIG. 2 is a diagram showing the distribution of conforming papers and non-conforming papers, and FIG. 3 is a diagram illustrating an undifferentiated state of the interest structure. As shown in FIG. 2, it is considered that the academic document group including the conforming paper (white circle mark) and the non-conforming paper (black circle mark) is distributed on the vector space by the term vector as shown in FIG. At this time, as shown in FIG. 3, if it is determined whether the document is compatible or non-conforming based on the undifferentiated structure expression composed of several keyword groups, it is considered that the presentation of the conforming document or the mixing of the non-conforming document occurs.

利用者にとってフィルタリング結果をランキング表示することが望ましいが、未分化構造表現では、図３に示すように、適合、非適合文書が混在する言わば「玉石混交状態」であり、意味のあるランキング表示は得られない。 It is desirable for the user to display the ranking of the filtering results, but in the undifferentiated structure representation, as shown in FIG. 3, it is a “cobblestone mixed state” in which compatible and non-conforming documents are mixed, and the meaningful ranking display is I can't get it.

本発明は、これらの問題点に鑑みてなされたもので、研究者等の利用者自身が選定した適合文書サンプルから興味情報を抽出し、この情報を元に文書に対する興味を分化構造で表現することにより、文書の探索の効率化を図るものである。 The present invention has been made in view of these problems, and extracts interest information from a relevant document sample selected by a user such as a researcher, and expresses interest in the document in a differentiated structure based on this information. This improves the efficiency of document search.

本発明は上記課題を解決するため、次のような手段を採用した。 In order to solve the above problems, the present invention employs the following means.

文書群から複数のクラスタによる分化構造を生成する分化構造生成部、各クラスタ内の文書を用いてクラスタの重心位置を作成するクラスタ代表点作成部、新規文書と各クラスタ代表点との類似度を計算する類似度計算部、および複数の類似度を単一の指標に変換する推奨ランキング指標計算部を備えた。 Differentiated structure generator that generates a differentiated structure by multiple clusters from a document group, cluster representative point generator that creates the center of gravity of a cluster using documents in each cluster, and the similarity between a new document and each cluster representative point A similarity calculation unit for calculating, and a recommended ranking index calculation unit for converting a plurality of similarities into a single index are provided.

本発明は、以上の構成を備えるため、興味のある文書の探索の効率化を図ることができる。 Since the present invention has the above-described configuration, it is possible to improve the efficiency of searching for a document of interest.

研究者の興味を構造表現するためには、数個のキーワードだけでは不足である。必要かつ十分な量の情報を適合文書サンプルから抽出する必要がある。そのために、発明者らは適合文書に含まれるタームとその頻度情報を用いて文書をベクトル表現するベクトル空間法（井ノ上、橋本：“非適合プロファイルを利用した文書フィルタリング手法”、情報処理学会、J42−D2、3、pp507−517(2001)他）を採用した。 A few keywords are not enough to structure the researcher's interests. The necessary and sufficient amount of information needs to be extracted from the relevant document sample. For this purpose, the inventors have used a vector space method for expressing a document as a vector using terms and frequency information contained in a conforming document (Inoue, Hashimoto: “Document Filtering Method Using Nonconformance Profile”, Information Processing Society of Japan, J42. -D2, 3, pp507-517 (2001) et al.

この手法は、従来のように異なる興味を未分化構造で表現するのではなく、異なる興味は、異なるプロファイルで分化構造表現する。これにより適合文書の探索の精度向上を図ることができる。 In this method, different interests are not expressed by undifferentiated structures as in the prior art, but different interests are expressed by differentiated structures using different profiles. Thereby, it is possible to improve the accuracy of searching for a compatible document.

学術文書をタームベクトルによるベクトル空間上にマップすると、前記図２に示すように、適合文書の中で同じタームが多く含まれる文書同士が近くにまとまって分布すると考えられる。この点に着目し、クラスタリング法により近くにある文書同士をクラスタリングすると、研究者の興味の数だけ適合文書の塊であるクラスタが生成される。クラスタリングのアルゴリズムによっては、各クラスタには非適合文書も若干含まれることがあるが、大多数が適合文書から構成されることになる。そこで、クラスタ中の適合文書だけのベクトル平均に相当するクラスタ代表点を求める。これが研究者の興味を分化構造的に表現したものと言うことができる。ここで、これらの研究者の興味に対応するクラスタ代表点をプロファイルという。 When academic documents are mapped onto a vector space based on term vectors, it is considered that documents including many of the same terms among the conforming documents are distributed together as shown in FIG. Focusing on this point and clustering nearby documents by the clustering method, clusters corresponding to the number of researchers' interests are generated as clusters of matching documents. Depending on the clustering algorithm, some non-conforming documents may be included in each cluster, but the majority are composed of conforming documents. Therefore, a cluster representative point corresponding to the vector average of only the relevant documents in the cluster is obtained. This can be said to express the researcher's interest in a differentiated structure. Here, cluster representative points corresponding to the interests of these researchers are called profiles.

図４は、複数のクラスタから構成される興味の分化構造表現を説明する図であり、各プロファイルと文書との類似性を表現するための適切な指標を定義する。図４に示すように類似度が大きい領域には適合文書が多く分布するが、類似度が小さい領域は大半が非適合文書で占められる。 FIG. 4 is a diagram for explaining an expression of the differentiated structure of interest composed of a plurality of clusters, and defines an appropriate index for expressing the similarity between each profile and a document. As shown in FIG. 4, many conforming documents are distributed in a region having a high degree of similarity, but most of the regions having a small degree of similarity are occupied by non-conforming documents.

一方、類似度が中程度の領域には、新しいアイデアのヒントになるような文書も一部含まれることが考えられる。各領域は重なる場合もあるが、未分化構造による従来の方式の場合と比して分化構造表現による本発明の方式では、個々の興味が各プロファイルによって別々に表現されるため、適合文書と非適合文書がよりシャープに分離されるようになる。したがって、個々のプロファイルのシャープな分離性能を利用することによりフィルタリング結果のランキング表示を実現することができる。 On the other hand, it is conceivable that an area with a medium similarity includes a part of a document that can be a hint for a new idea. Each region may overlap, but in the method of the present invention using the differentiated structure representation compared to the conventional method based on the undifferentiated structure, each interest is expressed separately by each profile. Relevant documents will be separated more sharply. Therefore, the ranking display of the filtering result can be realized by utilizing the sharp separation performance of each profile.

一般に、利用者にとっては、システムが選別した適合文書を単に提示されるのではなく、推奨の信頼度順にランキング表示されたほうが便利である。従来方式の場合は、未分化構造の単一プロファイルとの類似度だけを求めるので、ランキング指標としてはこの類似度を用いればよい。しかし、本発明の分化構造プロファイルの場合は、新規文書と複数プロファイルとの複数の類似度計算するため、そのままではランキングすることができない。このため、新規文書をランキング表示するためには、適合文書選別の際に、複数の類似度を一つの指標へ変換する必要がある。そのために、プロファイル作成時にプロファイル毎の勢力範囲に相当する基準閾値θｉを式（１）で計算する。

In general, it is more convenient for the user not to simply present the relevant documents selected by the system but to display the rankings in the order of recommended reliability. In the case of the conventional method, since only the similarity with a single profile of an undifferentiated structure is obtained, this similarity may be used as a ranking index. However, in the case of the differentiated structure profile of the present invention, since a plurality of similarities between a new document and a plurality of profiles are calculated, ranking cannot be performed as it is. Therefore, in order to display the ranking of new documents, it is necessary to convert a plurality of similarities into one index when selecting relevant documents. For this purpose, a reference threshold value θi corresponding to the power range for each profile is calculated by formula (1) when creating the profile.

ここで、ｉはクラスタ番号、θｉはクラスタｉの基準閾値、ｊはクラスタｉ内のサンプル番号、Ｎｉはクラスタｉ内のサンプル数、Ｓｉｍ（ｐｉ、ｄｊ）はクラスタｉの代表点であるプロファイルｐｉとクラスタｉ内のｊ番目のサンプルｄｊとの類似度である。基準閾値θｉは各プロファイルとクラスタ内要素との類似度のうち最小の値である。次の適合文書選別フェーズにおいては、この基準閾値をベースにして推奨ランキング指標を計算し、計算した指標を元に新規文書のランキング提示を行う。 Here, i is a cluster number, θi is a reference threshold value of cluster i, j is a sample number in cluster i, Ni is the number of samples in cluster i, and Sim (pi, dj) is a representative point of cluster i. And the j-th sample dj in the cluster i. The reference threshold value θi is the minimum value among the similarities between the profiles and the elements in the cluster. In the next relevant document selection phase, a recommended ranking index is calculated based on the reference threshold value, and a new document ranking is presented based on the calculated index.

次に、タームベクトル空間上における文書の分布状態について考察する。 Next, the distribution state of documents in the term vector space is considered.

図５は、ひとつのクラスタについて考えたときのタームベクトル空間上における文書の分布を説明する図である。基準閾値θｉは限られた学習サンプルによって決められるため、最適な閾値より大きめの値となるはずである。基準閾値より大きい類似度なら確実に適合であると言えるが、基準閾値より小さい適合文書も存在するため適合文書の推奨漏れが発生してしまう。そこで、各基準閾値に１以下の数字αを乗じた補正閾値δi（α)を式（２）で定義する。

FIG. 5 is a diagram for explaining the distribution of documents in the term vector space when one cluster is considered. Since the reference threshold value θi is determined by a limited learning sample, it should be a value larger than the optimum threshold value. If the degree of similarity is greater than the reference threshold, it can be said that the document is definitely relevant. However, since there is a conforming document that is smaller than the reference threshold, the recommended document is not recommended. Therefore, a correction threshold value δi (α) obtained by multiplying each reference threshold value by a number α of 1 or less is defined by Expression (2).

これは基準閾値θiの条件を緩めて、閾値を若干広めに補正した閾値であると言える。類似度が補正閾値以上の場合にその文書を提示するようにすると、非適合文書の混入が若干増えるが、適合文書の提示漏れを減少させることができる。αが１の場合は補正閾値δi（α)は基準閾値θiと等しくなり適合文書だけが含まれるような一番狭い領域の境界値となる。αが０の場合は補正閾値δi(α)は０になりすべての文書が無条件で提示される。このため通常はαは０と１の間の値とする。 This can be said to be a threshold value which is corrected by loosening the condition of the reference threshold value θi and slightly increasing the threshold value. If the document is presented when the similarity is equal to or greater than the correction threshold value, mixing of non-conforming documents slightly increases, but the omission of presentation of conforming documents can be reduced. When α is 1, the correction threshold value δi (α) is equal to the reference threshold value θi, and becomes the boundary value of the narrowest region in which only relevant documents are included. When α is 0, the correction threshold δi (α) is 0, and all documents are presented unconditionally. For this reason, α is usually a value between 0 and 1.

次に、分化構造表現により得られたシャープな文書分離特性を生かして適合文書のランキング提示をする。この場合には、まず、各プロファイル毎の補正閾値と新規文書の類似度とを用いて、推奨ランキング指標ｆr（ｄk）を式（３）で計算し、適合文書をこの指標値ｆr（ｄk）の降順に利用者に提示する。

Next, the ranking of the relevant documents is presented using the sharp document separation characteristic obtained by the differentiated structure expression. In this case, first, the recommended ranking index fr (dk) is calculated by the formula (3) using the correction threshold value for each profile and the similarity of the new document, and the conforming document is determined by this index value fr (dk). To the user in descending order.

すなわち、ある新規文書に対し、αを１から順次減らしていくことにより、基準閾値θｉから始めて補正閾値δｉ（α）を順次逐次広げていくと、あるポイントであるプロファイルとの類似度が補正閾値δｉ（α）を超えるはずである。式（３）によるｆr(dk)はそのときのαの値である。この推奨ランキング指標ｆr（ｄk）が大きいほど文書ｄkは適合文書である可能性が大きく、小さいほど非適合文書である可能性が大きい。値が中程度の場合は、非適合文書である可能性もあるが、新しいアイデアの源となる新発想源文書である可能性もある。 That is, when a correction document δi (α) is sequentially expanded starting from the reference threshold value θi by sequentially reducing α from 1 for a new document, the degree of similarity with a profile that is a certain point becomes the correction threshold value. It should exceed δi (α). Fr (dk) according to equation (3) is the value of α at that time. The greater the recommended ranking index fr (dk), the greater the possibility that the document dk is a conforming document, and the smaller the recommended ranking index fr (dk), the greater the possibility that the document dk is a non-conforming document. If the value is medium, it may be a non-conforming document, but it may also be a new idea source document that is the source of new ideas.

この例では、従来システムのように、単にフィルタリングした結果を利用者に提示するのでなく、前記推奨ランキング指標ｆr（ｄk）の大きい順に利用者に新規公開文書を提示する。その結果、利用者は自分の研究のフェーズに合わせ、必要な文書を取捨選択できるようになる。 In this example, instead of simply presenting the filtered result to the user as in the conventional system, new public documents are presented to the user in descending order of the recommended ranking index fr (dk). As a result, users can select necessary documents according to their research phase.

図１は、本実施形態に係る文書フィルタリングシステムの全体構成を説明する図である。図において、２はＣＰＵ、３はＣＰＵが利用するメモリ、４は入力装置としてのキーボード、５はディスプレイである。６は記憶装置７とＣＰＵ等を接続するバスである。記憶装置はプログラムにより構成されるプロファイル作成部１１および適合文書ランキング部１２を備える。 FIG. 1 is a diagram illustrating the overall configuration of the document filtering system according to the present embodiment. In the figure, 2 is a CPU, 3 is a memory used by the CPU, 4 is a keyboard as an input device, and 5 is a display. A bus 6 connects the storage device 7 and the CPU. The storage device includes a profile creation unit 11 and a matching document ranking unit 12 configured by a program.

１１１は分化構造生成部であり、互いに類似度が大きい適合論文をひとまとめにした複数のクラスタからなる分化構造を生成する。１１２はプロファイル（クラスタ代表点）作成部であり、各クラスタ内の文書を用いてクラスタの重心位置からなるプロファイルを作成する。プロファイルは各クラスタに含まれる適合文書を代表し、クラスタの中心に位置する仮想的な文書と考えることができる。１１３は基準閾値計算部であり、各クラスタの代表点とクラスタ内要素間の類似度を計算し、その内の最小値を基準閾値として設定する。１１４はプロファイル作成用論文のサンプルを格納する記憶装置、１１５はプロファイルを格納する記憶装置、１１６は基準閾値を格納する記憶装置である。 Reference numeral 111 denotes a differentiated structure generation unit that generates a differentiated structure including a plurality of clusters in which relevant papers having a high degree of similarity are grouped together. Reference numeral 112 denotes a profile (cluster representative point) creation unit, which creates a profile composed of the centroid positions of clusters using documents in each cluster. The profile represents a matching document included in each cluster, and can be considered as a virtual document located at the center of the cluster. A reference threshold value calculation unit 113 calculates the similarity between the representative point of each cluster and the elements in the cluster, and sets the minimum value among them as the reference threshold value. Reference numeral 114 denotes a storage device that stores a sample of a profile creation paper, 115 denotes a storage device that stores a profile, and 116 denotes a storage device that stores a reference threshold value.

１２１はターム抽出してベクトル化した新規文献と各プロファイルとの類似度を計算する類似度計算部であり、例えば新規文書から抽出したタームをもとに前記新規文書をベクトル空間にマッピングし、さらにマッピングされた新規文書とクラスタの代表点（プロファイル）との類似度を計算する。１２２は推奨ランキング指標計算部であり、閾値を基準閾値から逐次拡大した補正閾値としたときにおけるクラスタの代表点との類似度が、補正閾値を超えたときにおける該補正閾値の基準閾値に対する割合を推奨ランキング指標として算出する。１２３はランキング提示部であり、指標を算出した文書を指標の降順（類似度の大きい順）に利用者に提示する。１２４はランキング提示の対象となる新規文書を格納する記憶部、１２５は推奨ランキング指標を格納する記憶部である。 Reference numeral 121 denotes a similarity calculation unit that calculates the similarity between each profile and a new document that has been extracted and vectorized. For example, the new document is mapped to a vector space based on the terms extracted from the new document. The similarity between the mapped new document and the representative point (profile) of the cluster is calculated. 122 is a recommended ranking index calculation unit, and the ratio of the correction threshold to the reference threshold when the similarity with the representative point of the cluster when the threshold is a correction threshold that is sequentially expanded from the reference threshold exceeds the correction threshold. Calculated as a recommended ranking index. A ranking presentation unit 123 presents the document for which the index has been calculated to the user in descending order of the index (in descending order of similarity). Reference numeral 124 denotes a storage unit that stores a new document to be a ranking presentation target, and reference numeral 125 denotes a storage unit that stores a recommended ranking index.

図６は、興味の分化構造の生成処理を説明する図である。この処理はフィルタリングに際して最初に最初に一回だけ実行する。図６において、まず、各利用者は、すでに収集・保有している適合文書リストやデジタル化された適合文書群をシステムに登録する。このとき、適合および非適合の境界を決定するため、非適合文書も同時に登録する（ステップ４１）。登録した適合および非適合文書群に含まれる単語群からストップワードを除去して「基底ターム」を選定する。「基底ターム」をベクトルの基底とし、各タームの各文書における文書内出現頻度すなわちＴＦ（Term Frequency）値を重みとする「タームベクトル」Ｄrel（適合）およびＤirrel（非適合）を構成する。これにより、適合、非適合文書は、タームベクトルによるベクトル空間上に位置づけられたことになる（ステップ４２）。 FIG. 6 is a diagram for explaining a process of generating a differentiated structure of interest. This process is initially performed only once at the beginning of filtering. In FIG. 6, each user first registers in the system a list of conforming documents already collected and held and a group of conforming documents digitized. At this time, a non-conforming document is also registered at the same time in order to determine the boundary between conforming and non-conforming (step 41). The stop word is removed from the word group included in the registered conforming and non-conforming document group, and the “base term” is selected. A “term vector” Drel (conforming) and Dirrel (non-conforming) are constructed with the “basic term” as a vector base and the occurrence frequency in each document of each term, that is, a TF (Term Frequency) value as a weight. As a result, the conforming / non-conforming document is positioned on the vector space by the term vector (step 42).

次に、タームベクトル空間上に位置づけられた適合および非適合文書群をクラスタリングする（ステップ４３）。クラスタリングを行うためには、各文書間および各クラスタ間の類似度を定義する必要がある。文書間の類似度としては、タームの頻度情報を用いる最もシンプルなＴＦ法、特定の文書に多く発生するタームに大きな重み付けをするｌｏｇＴＦＩＤＦ法、さらに経験的パラメータによる補正を加えたＳＭＡＲＴ法などがある。発明者らはこれらの方法の中で最も良い結果が報告されているＳＭＡＲＴ法を採用した。 Next, the conforming and non-conforming document groups positioned on the term vector space are clustered (step 43). In order to perform clustering, it is necessary to define the similarity between documents and between clusters. Similarities between documents include the simplest TF method that uses term frequency information, the log TFIDF method that weights terms that occur frequently in a specific document, and the SMART method with correction based on empirical parameters. . The inventors adopted the SMART method, which reported the best results among these methods.

また、クラスタ間の類似度は、各クラスタに属する文書間の類似度として定義される。クラスタリング法にはこのクラスタ間の類似度定義の違いにより、単一リンク法と完全リンク法がある。発明者らは構造をより忠実に捉えるために、小さなクラスタができやすい完全リンク法を採用した。なお、単一要素からなる極小クラスタの乱立を防ぐため、実際には非適合サンプルの混入を１個まで許容している。このため、非適合サンプルも最大１個までクラスタに含まれている可能性があるが、実際にプロファイルを作成する際には、これは除外して計算する。 The similarity between clusters is defined as the similarity between documents belonging to each cluster. The clustering method includes a single link method and a complete link method depending on the difference in similarity definition between the clusters. The inventors adopted a complete link method that facilitates the formation of small clusters in order to capture the structure more faithfully. Note that in practice, up to one non-conforming sample is allowed in order to prevent a tiny cluster consisting of a single element from standing up. For this reason, a maximum of one non-conforming sample may be included in the cluster, but this is excluded when actually creating a profile.

ベクトル空間上で近くにある、つまり互いに類似度が大きい適合文書をひとまとめにしたものがクラスタである。したがって、一つのクラスタには同じ興味に対応した文書が含まれているはずである。次に、クラスタ内各適合文書について「基底ターム」を選定する。「基底ターム」をベクトルの基底とし、各タームの各文書における文書内出現頻度ＴＦ（Term Frequency）値の平均値を重みとするベクトルＰrel（ｉ）を構成する（式（４）参照）。

A cluster is a collection of relevant documents that are close in vector space, that is, having a high degree of similarity. Therefore, one cluster should contain documents corresponding to the same interest. Next, a “base term” is selected for each relevant document in the cluster. A vector Prel (i) is constructed with the “base term” as a vector base and the weight of the average value of the appearance frequency TF (Term Frequency) value in each document of each term (see equation (4)).

このベクトルＰrel（ｉ）がｉ番目のクラスタから生成されたプロファイルである。すなわち、興味の種類の数だけプロファイルができることになる。プロファイルは各クラスタに含まれる適合文書を代表し、クラスタの中心に位置する仮想的な文書と考えることができる。次に式（１）を用いて各クラスタ毎に基準閾値θｉを求める（ステップ４４）。このとき、類似度ＳｉｍはＳＭＡＲＴ法を採用して算出する。基準閾値が小さいほど、クラスタ領域の範囲は大きくなる。つまり、基準閾値が小さいほど、対応する興味の比重が大きいということが言える。 This vector Prel (i) is a profile generated from the i-th cluster. That is, as many profiles as the number of types of interest can be created. The profile represents a matching document included in each cluster, and can be considered as a virtual document located at the center of the cluster. Next, the reference threshold value θi is obtained for each cluster using the equation (1) (step 44). At this time, the similarity Sim is calculated by adopting the SMART method. The smaller the reference threshold, the larger the cluster area range. That is, it can be said that the smaller the reference threshold, the greater the specific gravity of the corresponding interest.

図７は、適合文書選別処理を説明する図である。この処理では、例えば新規文書を取得する毎に、取得した文書内から適合文書を選別してランキング提示する。図７において、まず、デジタルライブラリ等に掲載される文書を定期的に監視し、新しい文書があればこれを取得する。取得する文書通常はＰＤＦ（登録商標）形式の場合が多いため、これを処理しやすいプレーンテキスト形式に変換する。前述した適合文書のタームベクトル化と同様の方法で、新規文書からターム抽出してベクトル化する（ステップ５１）。次に、ユーザ別に各プロファイルとの類似度と補正閾値を計算する。類似度計算のアルゴリズムとしては、クラスタリングで用いたものと同じＳＭＡＲＴ法を用いる。また、式（２）にしたがって基準閾値に１以下の数値αを乗じて補正閾値δi（α)を算出し（ステップ５２）、算出した補正閾値δi（α)をもとに、式（３）を用いて推奨ランキング指標ｆr（ｄk）を求める。次に、取得した新規文書を推奨ランキング指標ｆr（ｄk）の降順に文書タイトル一覧表の形で利用者に提示する（ステップ５３）。 FIG. 7 is a diagram for explaining the compatible document selection process. In this process, for example, every time a new document is acquired, the relevant documents are selected from the acquired documents and presented in a ranking. In FIG. 7, first, a document posted in a digital library or the like is regularly monitored, and if there is a new document, it is acquired. Since the document to be acquired is usually in the PDF (registered trademark) format, it is converted into a plain text format that is easy to process. The term is extracted from the new document and vectorized by the same method as the term vectorization of the matching document described above (step 51). Next, the similarity with each profile and the correction threshold are calculated for each user. As a similarity calculation algorithm, the same SMART method as that used in clustering is used. Further, the correction threshold δi (α) is calculated by multiplying the reference threshold by a numerical value α of 1 or less according to the equation (2) (step 52), and the equation (3) is calculated based on the calculated correction threshold δi (α). Is used to obtain a recommended ranking index fr (dk). Next, the acquired new document is presented to the user in the form of a document title list in descending order of the recommended ranking index fr (dk) (step 53).

次に、本実施形態の文書フィルタリングシステムのフィルタリング性能を実験により確認した。フィルタリング実験のために独自に学術文書の実験サンプルを作成した。前述の処理フロー（図６、図７）に沿って評価実験を行った。得られた効果について以下に述べる。 Next, the filtering performance of the document filtering system of this embodiment was confirmed by experiments. An academic sample experiment was created for filtering experiments. An evaluation experiment was performed along the aforementioned processing flow (FIGS. 6 and 7). The obtained effect will be described below.

被験者は、経験の比較的少ない研究者Ａ、研究の第一線で活躍している中堅研究者Ｂ、経験豊富なベテランの研究者Ｃの３名（以後、それぞれ略してＲａ、Ｒｂ、Ｒｃと呼ぶ）とした。学術文書のうち情報科学分野の一連の文書２年分（２００１／４〜２００３／３）の約１０、５００件をフィルタリング実験用サンプルとして使用した。文書の範囲を限定して処理効率を高めるため、被験者自身が指定したキーワードにより事前にスクリーニングしておく。以後の実験ではスクリーニング済みのサンプルを用いる（ステップ４１）。 The subjects were three researchers, a researcher A with relatively little experience, a medium-sized researcher B active at the forefront of research, and an experienced veteran researcher C (hereinafter referred to as Ra, Rb, and Rc, respectively). Called). Among academic documents, about 10,500 of a series of documents in the field of information science for two years (2001/4 to 2003/3) were used as samples for filtering experiments. In order to limit the scope of the document and increase the processing efficiency, screening is performed in advance using keywords specified by the subject himself / herself. In the subsequent experiments, a screened sample is used (step 41).

スクリーニング済みのサンプルを被験者自身が「適合」、「非適合」の２種類に分類する。適合、非適合の判断基準として、タイトル、著者、発行年月、梗概、含まれるキーワードの情報、全文などを提示することにより判定作業の効率化を図る。その結果、３名分合わせて、適合サンプル６８件、非適合サンプル１７７件が得られた。多分割交差検定法を適用するために、上記サンプルを６分割する。この分割サンプルを用いて適合文書選別実験を行う。 Subjects who have screened themselves are classified into two types: “conforming” and “non-conforming”. As criteria for conformance and non-conformity, the title, author, date of issue, abstract, keyword information included, and the full text will be presented to improve the efficiency of judgment work. As a result, 68 conforming samples and 177 nonconforming samples were obtained for the three people. In order to apply the multi-fold cross-validation method, the sample is divided into 6 parts. Using this divided sample, a relevant document selection experiment is performed.

まず、スクリーニング済みのサンプルをタームベクトル化し（ステップ４２）、合計６回の実験ごとにプロファイル作成用サンプルをクラスタリングする。クラスタリングの際に用いるクラスタ間類似度計算アルゴリズムとしてはＳＭＡＲＴ法を、また、クラスタリング方式は完全リンク法を用いる。このクラスタリング結果を元に分割実験毎にプロファイルを作成する（ステップ４３）。さらに、プロファイルとクラスタに含まれるサンプルとの類似度を計算することにより、式（１）を用いて基準閾値を求め（ステップ４４）、プロファイルと共に保存する。 First, the screened samples are converted into term vectors (step 42), and the profile creation samples are clustered for a total of six experiments. The SMART method is used as an intercluster similarity calculation algorithm used in clustering, and the complete link method is used as a clustering method. A profile is created for each divided experiment based on the clustering result (step 43). Further, by calculating the similarity between the profile and the sample included in the cluster, a reference threshold value is obtained using equation (1) (step 44) and stored together with the profile.

次に、フィルタリング実験用サンプル（新規文書１２４）を用いてフィルタリング実験を行う。被験者毎に、多分割交差検定により最大６回の実験を行う。適合文書選別テスト用サンプルと各プロファイルとの類似度、および既に計算済みの基準閾値により補正閾値を計算する（ステップ５２）。次に、この補正閾値を元に式（３）により推奨ランキング指標値を計算する（ステップ５３）。最大６回分の実験で得られた推奨ランキング指標値を合わせてヒストグラムを作成した（図８参照）。 Next, a filtering experiment is performed using the filtering experiment sample (new document 124). For each subject, a maximum of 6 experiments are performed by multi-division cross validation. A correction threshold value is calculated based on the similarity between the relevant document selection test sample and each profile and the already calculated reference threshold value (step 52). Next, a recommended ranking index value is calculated according to equation (3) based on this correction threshold (step 53). Histograms were created by combining recommended ranking index values obtained in up to six experiments (see FIG. 8).

さらに、適合／非適合サンプルの分離精度を定量的に評価するために、再現度／精度グラフを作成した（図９参照）。このグラフは、推奨ランキング指標の閾値ｆrminを０から１まで変化させたとき、横軸にｆrminを、ｆr（ｄk）＞ｆrminとなるサンプルｄkを推奨結果と考えたときの再現率／精度を縦軸にプロットしたものである。ｆrminの値に応じて再現率、精度は変化するが、再現率１００％、つまり推奨漏れ率＝０％の条件のもとで、どれだけ非適合文書を提示しないようにできるかが、実際のフィルタリングシステムにおいて重要である。 Furthermore, a reproducibility / accuracy graph was created in order to quantitatively evaluate the separation accuracy of the conforming / nonconforming sample (see FIG. 9). In this graph, when the threshold value frmin of the recommended ranking index is changed from 0 to 1, the horizontal axis represents frmin, and the reproduction rate / accuracy when the sample dk satisfying fr (dk)> frmin is regarded as the recommended result is shown vertically. It is plotted on the axis. The recall and accuracy change depending on the value of frmin, but how much non-conforming document can be presented under the condition of 100% recall, that is, recommended leakage rate = 0%. Important in filtering systems.

図９をもとに、３名の被験者について、ｆrmin＝０．５５の時の再現率と精度および非適合文書排除率を求めた。その結果を図１０に示す。この表で、非適合文書排除率とは全非適合サンプルに対する推奨対象に混入した非適合サンプル数の割合のことである。いかに不要な文書を読まないで済むかという観点で重要な指標である。 Based on FIG. 9, the reproduction rate, accuracy, and nonconforming document rejection rate when frmin = 0.55 were obtained for three subjects. The result is shown in FIG. In this table, the non-conforming document rejection rate is the ratio of the number of non-conforming samples mixed into the recommended target to all non-conforming samples. This is an important indicator in terms of how much unnecessary documents can be read.

図１０からＲａ、Ｒｂでは再現率１００％を保ったまま非適合文書を各々７４％、４３．５％排除できていることがわかる。非適合文書の混入を１／２〜１／３に減らせたとも言える。これは、分化構造化プロファイルの導入による効果と考える。新たに導入した推奨ランキング指標により、図８に示すように、推奨信頼度の順に利用者に結果を提示できるようになった。ランキング上位の文書はこのグラフから適合文書が多いことがわかる。さらに中位の文書について、新しいアイデアのヒントになるような文書があるかどうか被験者に確認させたところ、そのような文書が各被験者について実際に１〜２件あることがわかった。 From FIG. 10, it can be seen that in Ra and Rb, non-conforming documents can be eliminated by 74% and 43.5%, respectively, while maintaining a reproduction rate of 100%. It can be said that mixing of non-conforming documents can be reduced to 1/2 to 1/3. This is considered to be the effect of introducing the differentiated structured profile. With the newly introduced recommended ranking index, the results can be presented to the user in the order of recommended reliability as shown in FIG. It can be seen from this graph that there are many relevant documents in the top ranking documents. Further, when the subjects were asked to check whether there was a document that could hint at a new idea for the medium document, it was found that there were actually one or two such documents for each subject.

以上説明したように、本実施形態によれば、まず研究者等の利用者が興味を持つ手持ちの文書をクラスタリングすることにより、複数のプロファイルによる利用者の興味の分化構造表現を行う。次いで新規文書に対し、前記各プロファイルとの類似度と分化構造情報を元に推奨ランキング指標を計算し、結果をランキング提示する。これにより、利用者は興味を持つ新規文書を効率的に探索することができる。 As described above, according to the present embodiment, first, a user's interest such as a researcher is clustered to cluster a user's interest, thereby expressing the differentiated structure of the user's interest using a plurality of profiles. Next, a recommended ranking index is calculated for the new document based on the similarity to each profile and the differentiated structure information, and the result is presented as a ranking. As a result, the user can efficiently search for a new document of interest.

本実施形態に係る文書フィルタリングシステムの全体構成を説明する図である。It is a figure explaining the whole structure of the document filtering system which concerns on this embodiment. 適合論文および非適合論文の分布を表す図である。It is a figure showing distribution of a conformity paper and a non-conformity paper. 興味構造の未分化状態を説明する図である。It is a figure explaining the undifferentiated state of an interest structure. 複数のクラスタから構成される興味の分化構造表現を説明する図である。It is a figure explaining the differentiation structure expression of interest comprised from a plurality of clusters. ひとつのクラスタについて考えたときのタームベクトル空間上における文書の分布を説明する図である。It is a figure explaining distribution of a document on term vector space when thinking about one cluster. 興味の分化構造の生成処理を説明する図である。It is a figure explaining the production | generation process of the differentiation structure of interest. 適合文書の選別処理を説明する図である。It is a figure explaining the selection process of a relevant document. 被験者毎の評価実験結果（適合／非適合文書数）を示す図である。It is a figure which shows the evaluation experiment result (conforming / nonconforming document number) for every test subject. 被験者毎の評価実験結果（再現率／精度）を示す図である。It is a figure which shows the evaluation experiment result (reproducibility / accuracy) for every test subject. 被験者毎の評価実験結果（推奨漏れ／非適合文書排除率）を示す図である。It is a figure which shows the evaluation-experiment result (recommended omission / nonconformity document exclusion rate) for every test subject.

Explanation of symbols

２ＣＰＵ
３メモリ
４キーボード
５ディスプレイ
６バス
７記憶装置
１１プロファイル作成部
１１１分化構造生成部
１１２プロファイル（クラスタ代表点）作成部
１１３基準閾値計算部
１１４，１１５，１１６記憶装置
１２適合文書ランキング部
１２１類似度計算部
１２２推奨ランキング指標計算部
１２３ランキング提示部
１２４，１２５記憶装置 2 CPU
3 Memory 4 Keyboard 5 Display 6 Bus 7 Storage Device 11 Profile Creation Unit 111 Differentiation Structure Generation Unit 112 Profile (Cluster Representative Point) Creation Unit 113 Reference Threshold Calculation Unit 114, 115, 116 Storage Device 12 Relevant Document Ranking Unit 121 Similarity Calculation Unit 122 recommended ranking index calculation unit 123 ranking presentation unit 124, 125 storage device

Claims

Differentiated structure generator that generates a differentiated structure by multiple clusters from a document group, cluster representative point generator that creates the center of gravity of a cluster using documents in each cluster, and the similarity between a new document and each cluster representative point A document filtering system comprising a similarity calculation unit for calculating, and a recommended ranking index calculation unit for converting a plurality of similarities into a single index.

The document filtering system according to claim 1, wherein
The differentiated structure generation unit includes a differentiated structure generation unit using a clustering method, and a recommended ranking index calculation unit that converts a plurality of similarities into a single index includes a reference threshold value calculation unit proportional to the size of each cluster, a threshold value A document filtering system comprising: a correction threshold calculation unit that calculates a correction threshold obtained by sequentially enlarging the reference threshold value, and a recommended ranking index calculation unit that calculates a ratio of the correction threshold to the reference threshold as a recommended ranking index.

A differentiated structure generation unit that clusters a plurality of preselected documents into a vector space based on terms extracted from these documents and term vectors represented by the appearance frequency of the terms, and classifies the plurality of documents into a plurality of clusters. When,
A cluster representative point creation unit that calculates the center of gravity of each classified cluster as a cluster representative point;
A profile creation unit including a reference threshold value calculation unit that calculates a similarity between a representative point of each cluster and an element in the cluster, and sets a minimum value thereof as a reference threshold value;
Means for extracting a term from a new document and mapping the new document to the vector space based on the extracted term;
A similarity calculator for calculating the similarity between the mapped new document and the representative point of the cluster;
A document filtering system, comprising: a recommended ranking index calculation unit that converts the calculated similarity into a single index based on the reference threshold value of each cluster.

The document filtering system according to claim 3, wherein
A document filtering system comprising a ranking presenting unit for ranking and presenting new documents according to the single index.

The document filtering system according to claim 3, wherein
A document filtering system, wherein a plurality of preselected documents are screened according to preset terms and classified into conforming documents and nonconforming documents.

The document filtering system according to claim 3, wherein
The recommended ranking index calculation unit calculates the ratio of the correction threshold to the reference threshold when the similarity with the representative point of the cluster exceeds the correction threshold when the threshold is a correction threshold that is sequentially expanded from the reference threshold. A document filtering system characterized by calculating as follows.

The document filtering system according to claim 4.
A document filtering system, wherein the ranking presentation unit ranks and presents new documents in descending order of index values.

The document filtering system according to claim 3, wherein
A document filtering system characterized in that similarity is calculated by the SMART method.