JP4936455B2

JP4936455B2 - Document classification apparatus, document classification method, program, and recording medium

Info

Publication number: JP4936455B2
Application number: JP2007075517A
Authority: JP
Inventors: 晴美川島; 吉秀佐藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-03-22
Filing date: 2007-03-22
Publication date: 2012-05-23
Anticipated expiration: 2027-03-22
Also published as: JP2008234482A

Description

本発明は、ネットワーク上に公開されている文書情報から、人々が記載している話題をテーマ毎にまとめ、クラスタ単位で閲覧可能にする文書分類装置に係り、特に、利用者が所望する期間に合致する文書を対象に、所望の粒度でテーマ毎にまとめたクラスタを提示する文書分類装置に関する。
The present invention relates to a document classification apparatus that collects topics described by people from the document information published on the network for each theme and enables browsing in cluster units, and in particular, during a period desired by the user. The present invention relates to a document classification apparatus that presents clusters that are grouped by theme with a desired granularity for matching documents.

近年、インターネット等のコンピュータネットワークの発達に伴い、大量の電子化された情報が発信され続けている。そのために、ある話題に関する情報を取得したいと思った場合、複数の情報源から、公開されているＷｅｂページを１つ１つ閲覧する大変な労力を必要とする。 In recent years, with the development of computer networks such as the Internet, a large amount of computerized information has been transmitted. For this reason, when it is desired to acquire information on a certain topic, a great effort is required to browse each published Web page from a plurality of information sources.

従来、自然言語処理や情報検索技術分野において、電子化されている文書を、文書内で出現する単語のベクトルで表し、単語のベクトルが類似している文書を、ひとまとめのクラスタにする技術が知られている。 Conventionally, in the field of natural language processing and information retrieval technology, a technology is known in which an electronic document is represented by a vector of words appearing in the document, and documents having similar word vectors are made into a cluster. It has been.

このような従来技術では、単語出現頻度を用いて、文書をベクトル表現し（以下、「文書ベクトル」と呼ぶ）、文書間の類似度は、この文書ベクトル間のコサイン類似度を適用している。すなわち、文書ｄ_ｎを文書ベクトル In such a conventional technique, a document is expressed as a vector using the word appearance frequency (hereinafter referred to as “document vector”), and the similarity between documents applies the cosine similarity between the document vectors. . In other words, the document vector of the document _{d n}

によって表す。なお、ｖは、単語集合Ｗ＝｛ｗ_１，ｗ_２，…，ｗ_Ｖ｝中の単語の総数を示し、ｘ_ｎｉは文書ｄ_ｎにおける単語ｗ_ｉの重みを示す。このときに、文書ｄ_ｊとｄ_ｋとの類似度を、各文書ベクトルがなす角

Is represented by Incidentally, v is word set _{_{W = {w 1, w 2}} , ..., w V} indicates the total number of words _{in, x ni} denotes the weight of a word _{w i} in the document _{d n.} At this time, the angle formed by each document vector _represents the similarity between the documents d _j and d _k.

で示す。また、単語ｗ_ｉの重みは、単語の文書内での出現頻度、ｔｆ（term frequency）をそのまま利用する場合があり、また、ｔｆに、ｉｄｆ（単語出現数を全文書数で割った値の対数）を乗算したｔｆ・ｉｄｆ（term frequency/inverse document）を利用する場合がある。

It shows with. In addition, word weight of w _i is the frequency of occurrence of in the document of the word, there is a case to be used as it is tf (term frequency), also, to tf, idf (the word number of occurrences of value divided by the total number of documents In some cases, tf · idf (term frequency / inverse document) multiplied by logarithm) is used.

そして、類似度に基づいて、クラスタリング処理を行う。クラスタリング手法は、様々存在するが、類似している文書を１つのクラスタとして順次まとめ、閾値を利用して類似していないと判断したときに処理をやめる手法が知られている（たとえば、特許文献１参照）。この手法では、１つの文書を１クラスタとして処理を予め開始し、最も類似しているクラスタ同士を蓄積したテーブルを作成することによって、クラスタリングの処理時間を短縮する。 Then, clustering processing is performed based on the similarity. There are various clustering methods, but a method is known in which similar documents are sequentially gathered as one cluster, and processing is stopped when it is determined that they are not similar using a threshold (for example, Patent Documents). 1). In this method, processing is started in advance with one document as one cluster, and the processing time of clustering is shortened by creating a table in which the most similar clusters are accumulated.

そこで、たとえば１ヶ月分の文書情報を１週間ずつに区切り、それぞれの期間における話題を見たいと思った場合、文書集合を１週間ずつに分割し、それぞれを入力文書としてクラスタリング処理を実行する。この結果、生成されたクラスタは、期間毎に独立であり、異なる期間におけるクラスタの対応付けが困難である。すなわち、あるクラスタが、時系列で変化した経緯を提示することができないという問題がある。 Therefore, for example, when document information for one month is divided into one week, and when it is desired to see a topic in each period, a document set is divided into one week, and clustering processing is executed using each as an input document. As a result, the generated clusters are independent for each period, and it is difficult to associate clusters in different periods. In other words, there is a problem that a certain cluster cannot present the history of changes in time series.

一方、１ヶ月分の文書情報を全て利用して、クラスタを生成し、この生成されたクラスタのうちで、指定された期間の文書のみを提示する方法も考えられる。この方法では、異なる期間においてクラスタ同士を対応付けることは可能であるが、クラスタ間の類似度を求める際に、本来、利用すべきではない期間外の文書についても、類似度を求めているので、正確さに欠けるという問題がある。
特許第３６７５６８２号公報 On the other hand, a method is also conceivable in which a cluster is generated using all document information for one month, and only a document in a specified period is presented from the generated cluster. In this method, it is possible to associate clusters with each other in different periods. However, when calculating the similarity between clusters, the similarity is also calculated for documents outside the period that should not be used. There is a problem of lack of accuracy.
Japanese Patent No. 3675682

つまり、上記従来例では、開始時刻から終了時刻までの期間と、文書間の距離に関する閾値とが与えられた場合、蓄積されている文書の中から、クラスタを短時間で生成することができないという問題がある。 In other words, in the above conventional example, when a period from the start time to the end time and a threshold value regarding the distance between documents are given, a cluster cannot be generated in a short time from the stored documents. There's a problem.

本発明は、蓄積されている文書の全てについて予め木構造を生成し、与えられた文書間の距離に関する閾値によって全期間のクラスタを生成し、開始時刻から終了時刻までの期間が指定された場合、期間に含まれている時刻情報を持つ文書のみを検出し、この検出された文書によって構成される期間指定クラスタを生成する。 In the present invention, a tree structure is generated in advance for all stored documents, a cluster for all periods is generated based on a threshold related to the distance between given documents, and a period from the start time to the end time is specified. Only documents having time information included in the period are detected, and a period designation cluster constituted by the detected documents is generated.

また、本発明は、閾値を変更した場合、または期間を変更した場合、適切なクラスタを短期間で提供することができる文書分類装置、文書分類方法、プログラムおよび記録媒体を提供することを目的とする。
Another object of the present invention is to provide a document classification device, a document classification method, a program, and a recording medium that can provide an appropriate cluster in a short period when the threshold value is changed or the period is changed. To do.

本発明は、文書を一意に示す文書ＩＤに対応付けて、時刻情報と、タイトルと、文書の本文とを蓄積する文書蓄積部と、クラスタを生成するための距離の閾値と、開始時刻と、終了時刻とを入力する入力部と、文書蓄積部に蓄積されている上記文書の本文を用い、自文書の時刻情報よりも古い時刻情報を持つ文書と上記自文書との距離を、文書間類似度に基づいて算出する文書間距離算出部と、上記自文書の時刻情報よりも古い時刻情報を持つ文書の中で上記自文書との距離が最短距離である文書を親ノードとし、親ノードの文書ＩＤと上記最短距離とを自文書の文書ＩＤと対応づけて蓄積する最短距離文書蓄積部と、上記最短距離文書蓄積部に蓄積されている最短距離と上記入力部から受け取った閾値と比較し、上記最短距離が上記閾値以下であれば最短文書と自文書とを同一のクラスタとし、上記最短距離が上記閾値よりも大きければ最短文書と自文書とを別のクラスタとすることでクラスタを生成する統合処理部と、上記統合処理部が統合したクラスタを、クラスタを一意に示すクラスタＩＤと、各クラスタに属する文書集合とを対応付けて蓄積する全クラスタ蓄積部と、全クラスタ蓄積部に蓄積されているクラスタを構成する文書のうちで、上記入力部から受け取った上記開始時刻から上記終了時刻までの期間である指定期間に含まれている時刻情報をもつ文書のみを検出し、この検出された文書によって期間指定クラスタを生成する期間指定クラスタ生成部と、上記期間指定クラスタ生成部が生成した上記期間指定クラスタを蓄積する期間指定クラスタ蓄積部とを有し、上記期間指定クラスタ生成部は、上記全クラスタ蓄積部から、未処理のクラスタＩＤに対応する文書の集合を取得し、上記指定期間に含まれている文書であり親ノードの時刻情報が上記開始時刻よりも前である文書が複数存在しない場合、あるいは、複数存在しても存在する複数の文書同士の距離が上記閾値以下である場合、当該クラスタＩＤに対応する文書の集合の中で上記指定期間に含まれる文書で１つの期間指定クラスタとし、上記指定期間に含まれている文書であり親ノードの時刻情報が上記開始時刻よりも前である文書が複数存在する場合で、存在する複数の文書同士の距離が上記閾値よりも大きい場合、当該クラスタＩＤに対応する文書の集合の中で上記指定期間に含まれる文書を複数の期間指定クラスタにすることを特徴とする文書分類装置である。The present invention relates to a document storage unit that stores time information, a title, and the text of a document in association with a document ID that uniquely identifies the document, a distance threshold for generating a cluster, a start time, Using the input unit for inputting the end time and the body text of the document stored in the document storage unit, the distance between the document having time information older than the time information of the document and the document is similar to each other. The inter-document distance calculation unit that calculates based on the degree and the document having the shortest distance from the self-document among the documents having time information older than the time information of the self-document is set as a parent node, Compare the shortest distance document storage unit that stores the document ID and the shortest distance in association with the document ID of the document, the shortest distance stored in the shortest distance document storage unit, and the threshold received from the input unit. The shortest distance is not less than the threshold value. If this is the case, the shortest document and the self-document are made the same cluster, and if the shortest distance is larger than the threshold, the shortest document and the self-document are made different clusters, and the integration processing unit generates the cluster. Clusters integrated by the processing unit are stored in association with a cluster ID that uniquely identifies the cluster and a set of documents belonging to each cluster, and the documents that constitute the clusters stored in all the cluster storage units Among these, only documents having time information included in the designated period that is the period from the start time to the end time received from the input unit are detected, and a period designation cluster is generated from the detected documents. A period-designated cluster generation unit that stores the period-designated cluster generated by the period-designated cluster generation unit, The recording period specified cluster generation unit acquires a set of documents corresponding to the unprocessed cluster ID from the all cluster storage unit, and the time information of the parent node that is a document included in the specified period is the start time. If there are no documents before the present, or if the distance between a plurality of existing documents is less than or equal to the threshold, the specified period in the set of documents corresponding to the cluster ID A plurality of existing documents in the case where there are a plurality of documents that are included in the specified period and whose time information of the parent node is earlier than the start time. When the distance between each other is larger than the threshold value, a document included in the specified period in the set of documents corresponding to the cluster ID is set as a plurality of period specified clusters. Document classification device.

本発明によれば、蓄積されている文書の全てについて予め木構造を生成し、与えられた文書間の距離に関する閾値によって全期間のクラスタを生成し、開始時刻から終了時刻までの期間が指定された場合、期間に含まれている時刻情報を持つ文書のみを検出し、この検出された文書によって構成される期間指定クラスタを生成するので、上記閾値を変更した場合、または期間を変更した場合に、適切なクラスタを短期間で提供することができるという効果を奏する。
According to the present invention, a tree structure is generated in advance for all of the accumulated documents, a cluster for all periods is generated with a threshold regarding the distance between given documents, and a period from the start time to the end time is specified. In this case, only documents with time information included in the period are detected, and a period-designated cluster composed of the detected documents is generated, so when the threshold is changed or the period is changed As a result, it is possible to provide an appropriate cluster in a short period of time.

発明を実施するための最良の形態は、以下の実施例である。 The best mode for carrying out the invention is the following examples.

本発明の実施例1は、文書間の類似度に基づいて文書間の距離を求め、文書間の距離が近い文書ほど類似しているとする。すなわち、文書ｄ_ｊとｄ_ｋとの距離を、各文書類似度を用い、１−Ｃｏｓθ_ｊ，ｋで示す。 In the first embodiment of the present invention, the distance between documents is obtained based on the similarity between documents, and it is assumed that documents having a shorter distance between documents are more similar. That is, the distance between the documents d _j and d _k is _represented by 1-Cos θ _{j, k} using each document similarity.

インターネット上では次々と新しい文書が発信され、更新された時刻情報等を、文書に付与し、蓄積する。実施例１では、このような場合について考える。 New documents are transmitted one after another on the Internet, and updated time information and the like are added to the document and stored. Example 1 considers such a case.

図１は、本発明の実施例１の原理を示すフローチャートである。 FIG. 1 is a flowchart showing the principle of the first embodiment of the present invention.

実施例１では、処理対象である文書間の距離を全て算出するのではなく、自文書の時刻よりも古い時刻が付与されている文書のみを対象として距離を算出する。次いで、全ての文書を時刻順に並べ、自文書と最も距離の近い文書との間を結ぶことによって木構造を形成し、この場合、最も古い文書をルートとして木構造を生成する（Ｓ１）。 In the first embodiment, not all distances between documents to be processed are calculated, but the distance is calculated only for documents to which a time older than the time of the document is given. Next, all documents are arranged in order of time, and a tree structure is formed by connecting between the own document and the closest document. In this case, a tree structure is generated with the oldest document as a root (S1).

図２は、実施例１の原理を説明する図である。 FIG. 2 is a diagram illustrating the principle of the first embodiment.

図２において、●印が１つの文書を示し、左から右に向かうほど、時刻の新しい文書が順に並んでいる。自文書よりも古い時刻を持ち、自文書に最も距離の短い文書と自文書とを線で結ぶと、図２（１）に示すように、木構造が生成される。 In FIG. 2, the ● mark indicates one document, and the documents with newer times are arranged in order from the left to the right. When a document having a time older than that of the own document and having the shortest distance to the own document and the own document are connected by a line, a tree structure is generated as shown in FIG.

次に、上記木構造において、文書間の距離が、与えられた閾値以下である文書同士を、同じクラスタにまとめる統合処理を行う（Ｓ２）。 Next, in the above tree structure, integration processing is performed in which documents whose distance between documents is equal to or less than a given threshold are grouped into the same cluster (S2).

図２（２）に示す例では、距離の閾値に基づいて、クラスタ１とクラスタ２とが生成される。また、全ての文書に、上記統合処理を実施したクラスタについて、指定されている期間（以下、「指定期間」という）に含まれている文書だけを提示する（Ｓ３）。 In the example shown in FIG. 2 (2), cluster 1 and cluster 2 are generated based on the distance threshold. Further, only the documents included in the designated period (hereinafter referred to as “designated period”) are presented to all the documents that have undergone the integration process (S3).

図２（３）に示す例において、時刻Ｔ１〜Ｔ２の期間が指定されていると、●の文書だけを提示し、○の文書は、期間外になるので提示しない。 In the example shown in FIG. 2 (3), if the period from time T1 to T2 is designated, only the document of ● is presented, and the document of ○ is not presented because it is out of the period.

このときに、あるクラスタにおける文書が、上記指定期間よりも古ければ、親ノードが期間外である文書が複数存在する場合がある。なお、上記「親ノード」は、自文書よりも古い時間情報を持つ文書のうちで、自文書との距離が最短である文書である。上記の場合、親ノードが期間外である複数の文書間の距離が、閾値以下であれば、同じクラスタとし、閾値よりも大きければ、別のクラスタとして提示する（クラスタを分割する）。 At this time, if a document in a cluster is older than the specified period, there may be a plurality of documents whose parent node is outside the period. The “parent node” is a document having the shortest distance from the self document among documents having time information older than the self document. In the above case, if the distance between a plurality of documents whose parent node is out of the period is equal to or smaller than the threshold, the same cluster is presented, and if the distance is larger than the threshold, it is presented as another cluster (dividing the cluster).

図２（３）に示すように、クラスタ１は、親ノードが期間外になる文書（点線でつながれた文書）が２つ存在し、これらの文書間の距離を算出し、この算出された距離に応じて、クラスタを分離するかどうかを決定する。 As shown in FIG. 2 (3), in cluster 1, there are two documents whose parent node is outside the period (documents connected by dotted lines), the distance between these documents is calculated, and the calculated distance Depending on, decide whether to separate the clusters.

図３は、本発明の実施例１における文書分類装置ＤＣ１を示す図である。 FIG. 3 is a diagram illustrating the document classification device DC1 according to the first embodiment of the present invention.

文書分類装置ＤＣ１は、文書蓄積部１０と、入力部２０と、クラスタ表示部３０とに接続されている。 The document classification device DC1 is connected to the document storage unit 10, the input unit 20, and the cluster display unit 30.

文書蓄積部１０は、文書を一意に示す文書ＩＤに対応付けて、時刻情報と、タイトルと、本文とが記憶装置に蓄積されている。 The document storage unit 10 stores time information, a title, and a text in a storage device in association with a document ID that uniquely identifies the document.

入力部２０は、クラスタを生成するための閾値と、開始時刻と、終了時刻とを出力する。 The input unit 20 outputs a threshold for generating a cluster, a start time, and an end time.

クラスタ表示部３０は、文書分類装置ＤＣ１が出力したクラスタ情報（文書の時刻情報、タイトル、本文）を表示する。つまり、クラスタ表示部３０は、図１０（２）に示すように、クラスタの概要を表示する際に、時刻情報、本文から抜粋したテキストを表示する。この場合、タイトルを表示することもある。 The cluster display unit 30 displays the cluster information (document time information, title, text) output from the document classification device DC1. That is, as shown in FIG. 10 (2), the cluster display unit 30 displays the time information and text extracted from the body when displaying the outline of the cluster. In this case, a title may be displayed.

文書分類装置ＤＣ１は、文書間距離算出部１１と、最短距離文書蓄積部１２と、統合処理部１３と、全クラスタ蓄積部１４と、期間指定クラスタ生成部１５と、期間指定クラスタ蓄積部１６とを有する。 The document classification device DC1 includes an inter-document distance calculation unit 11, a shortest distance document storage unit 12, an integration processing unit 13, an all-cluster storage unit 14, a period designation cluster generation unit 15, and a period designation cluster accumulation unit 16. Have

文書間距離算出部１１は、文書蓄積部１０に蓄積されている上記文書の本文を用い、自文書の時刻情報よりも古い時刻情報を持つ文書との距離を、文書間類似度に基づいて算出する。 The inter-document distance calculation unit 11 uses the body text of the document stored in the document storage unit 10 and calculates a distance from a document having time information older than the time information of the own document based on the inter-document similarity. To do.

上記自文書の時刻情報よりも古い時刻情報を持つ文書と、上記着目文書との距離のうちで、最短距離である文書の文書ＩＤと、上記最短距離とを記憶装置に蓄積する。文書間距離算出にかかる計算時間は、文書数の二乗オーダである。 Among the distances between the document having time information older than the time information of the document itself and the document of interest, the document ID of the document having the shortest distance and the shortest distance are stored in the storage device. The calculation time required for calculating the distance between documents is the square order of the number of documents.

図４は、最短距離文書蓄積部１２に蓄積されているデータの例を示す図である。 FIG. 4 is a diagram illustrating an example of data stored in the shortest distance document storage unit 12.

最短距離文書蓄積部１２は、図４（１）に示すように、各文書の文書ＩＤ４１と、時刻情報４２と、文書間距離算出部１１が算出した最短距離を持つ文書の文書ＩＤ４３と、その最短最短距離４４とが対応付けられ、記憶装置に蓄積する。 As shown in FIG. 4A, the shortest distance document storage unit 12 includes a document ID 41 of each document, time information 42, a document ID 43 of a document having the shortest distance calculated by the inter-document distance calculation unit 11, and its The shortest shortest distance 44 is associated and stored in the storage device.

図４（２）は、最短距離文書蓄積部１２が蓄積している情報を表す木構造４０を示す図である。 FIG. 4B is a diagram illustrating a tree structure 40 representing information stored in the shortest distance document storage unit 12.

統合処理部１３は、入力部２０から受け取った閾値と最短距離文書蓄積部１２に蓄積されている最短距離４４とを比較する。まず、最新文書からなるクラスタを１つ生成し、最短距離４４と上記閾値と比較し、上記最短距離が上記閾値以下であれば、最短文書を自文書と同一のクラスタに追加する。上記最短距離が上記閾値よりも大きければ、最短文書からなる新しいクラスタを生成する。次に、最短文書４３に記載されている文書を文書４１から探索し、該当する文書の最短距離４４と上記閾値とを比較する。一旦ルート（最も古い文書：図４に示す例では文書ｄ１）に到達した場合、閾値との比較が実施されていない文書集合において、該文書集合の中で最新の文書を選択し、最短距離４４と上記閾値との比較を開始する。 The integration processing unit 13 compares the threshold received from the input unit 20 with the shortest distance 44 stored in the shortest distance document storage unit 12. First, one cluster composed of the latest document is generated, compared with the shortest distance 44 and the threshold value, and if the shortest distance is equal to or smaller than the threshold value, the shortest document is added to the same cluster as the own document. If the shortest distance is greater than the threshold, a new cluster of shortest documents is generated. Next, the document described in the shortest document 43 is searched from the document 41, and the shortest distance 44 of the corresponding document is compared with the threshold value. Once the root (the oldest document: document d1 in the example shown in FIG. 4) is reached, the latest document is selected from the document set in the document set that has not been compared with the threshold, and the shortest distance 44 And comparison with the threshold value is started.

全ての文書について、上記閾値と比較し、生成したクラスタを、全クラスタ蓄積部１４に蓄積する。 All the documents are compared with the threshold value, and the generated clusters are accumulated in the all-cluster accumulation unit 14.

全クラスタ蓄積部１４は、統合処理部１３が全ての文書のそれぞれについて生成したクラスタを蓄積する。 The all cluster storage unit 14 stores the clusters generated by the integration processing unit 13 for all the documents.

実施例１では、最短距離と閾値とを比較する上記処理を、全ての文書について、１度だけ実行すれば足りるので、処理が高速である。 In the first embodiment, the above-described processing for comparing the shortest distance with the threshold need only be executed once for all the documents, so that the processing is fast.

次に、図４に示すデータが、最短距離文書蓄積部１２に蓄積され、上記閾値として０．５が入力された場合の動作について説明する。 Next, the operation when the data shown in FIG. 4 is stored in the shortest distance document storage unit 12 and 0.5 is input as the threshold value will be described.

まず、クラスタＣ１の要素とし最新の文書ｄ１０を考え、最新の文書ｄ１０に最短距離０．６が対応付けられ、この対応付けられている最短距離０．６と閾値０．５とを比較する。文書ｄ１０の最短距離０．６は、閾値０．５よりも大きいので、新しいクラスタＣ２を作成し、このクラスタＣ２の要素として、図４（１）に示す最短文書ＩＤのｄ９を蓄積する。 First, the latest document d10 is considered as an element of the cluster C1, and the latest document d10 is associated with the shortest distance 0.6, and this shortest distance 0.6 is compared with the threshold value 0.5. Since the shortest distance 0.6 of the document d10 is larger than the threshold value 0.5, a new cluster C2 is created, and d9 of the shortest document ID shown in FIG. 4A is stored as an element of this cluster C2.

続いて、上記最短文書ＩＤのｄ９について、上記と同様に、最短文書ＩＤであるｄ２との最短距離０．５と閾値０．５とを比較し、文書ＩＤｄ９と文書ＩＤｄ２との最短距離０．５が閾値０．５以下であるので、文書ＩＤｄ９が属するクラスタＣ２に、文書ＩＤｄ２を追加する。 Subsequently, for d9 of the shortest document ID, the shortest distance 0.5 between the shortest document ID d2 and the threshold value 0.5 is compared in the same manner as described above, and the shortest distance 0.0 between the document IDd9 and the document IDd2 is compared. Since 5 is 0.5 or less, the document ID d2 is added to the cluster C2 to which the document ID d9 belongs.

図５は、実施例１における全クラスタ蓄積部１４に蓄積されているデータの例を示す図である。 FIG. 5 is a diagram illustrating an example of data stored in the all-cluster storage unit 14 according to the first embodiment.

上記のようにして、全ての文書について処理した結果が、図５に示す例である。全クラスタ蓄積部１４は、図５（１）に示すように、クラスタを一意に示すクラスタＩＤ５１と、各クラスタに属する文書集合５２とを対応付けて蓄積している。図５（２）は、図５（１）に示す例を木構造で表示した例であり、各文書は、自文書と最も距離の短い最短文書と実線でつながれている。 The result of processing all the documents as described above is an example shown in FIG. As shown in FIG. 5A, the all cluster storage unit 14 stores a cluster ID 51 that uniquely indicates a cluster and a document set 52 that belongs to each cluster in association with each other. FIG. 5B is an example in which the example shown in FIG. 5A is displayed in a tree structure, and each document is connected to the shortest document having the shortest distance from the own document by a solid line.

期間指定クラスタ生成部１５は、全クラスタ蓄積部１４に蓄積されているクラスタを構成する文書のうちで、上記入力部２０から受け取った上記開始時刻から上記終了時刻までの期間（指定期間）に含まれている時刻情報をもつ文書を、検出し、この検出された文書のみによって構成されているクラスタである期間指定クラスタを生成する。 The period designation cluster generation unit 15 is included in the period (designation period) from the start time to the end time received from the input unit 20 among the documents constituting the clusters accumulated in the all cluster accumulation unit 14. A document having the time information is detected, and a period-designated cluster that is a cluster constituted only by the detected document is generated.

期間指定クラスタ蓄積部１６は、期間指定クラスタ生成部１５が生成した上記期間指定クラスタを蓄積する。 The period designation cluster accumulation unit 16 accumulates the period designation cluster generated by the period designation cluster generation unit 15.

図６は、実施例１における期間指定クラスタ生成部１５の動作を示すフローチャートである。 FIG. 6 is a flowchart illustrating the operation of the period designation cluster generation unit 15 according to the first embodiment.

まず、全クラスタ蓄積部１４から、全てのクラスタＩＤを取得し、クラスタ毎の処理を順次、行う。未処理のクラスタがあれば（Ｓ１１）、全クラスタ蓄積部１４から、未処理のクラスタＩＤに対応する文書ＩＤの集合を取得する（Ｓ１２）。 First, all cluster IDs are acquired from the all-cluster storage unit 14, and processing for each cluster is sequentially performed. If there is an unprocessed cluster (S11), a set of document IDs corresponding to the unprocessed cluster ID is acquired from all cluster storage units 14 (S12).

次に、上記指定期間内に含まれている文書があるかどうか、最短距離文書蓄積部１２に蓄積されている各文書の時刻情報４２を参照して調べる（Ｓ１３）。上記指定期間内に文書が存在しないクラスタは、以後の処理を実行せずに、Ｓ１１に戻り、次のクラスタの処理を実施する。 Next, it is checked whether or not there is a document included in the specified period by referring to the time information 42 of each document stored in the shortest distance document storage unit 12 (S13). A cluster in which no document exists within the specified period returns to S11 without performing the subsequent processing, and performs processing for the next cluster.

未処理のクラスタＩＤに対応する文書ＩＤの集合のうちで、上記指定期間内に含まれている文書があれば（Ｓ１３）、上記指定期間に基づいて、クラスタが分割されるかどうかを調べる（Ｓ１４）。クラスタが分割されなければ（Ｓ１４）、クラスタ内の文書が全て期間内に含まれ、処理しているクラスタＩＤと対応付けられている文書ＩＤとを、そのまま、期間指定クラスタ蓄積部１６に蓄積する（Ｓ１５）。 If there is a document included in the specified period among the set of document IDs corresponding to the unprocessed cluster ID (S13), it is checked whether the cluster is divided based on the specified period (S13). S14). If the cluster is not divided (S14), all documents in the cluster are included in the period, and the document ID associated with the cluster ID being processed is stored in the period designation cluster storage unit 16 as it is. (S15).

上記指定期間に基づいて、クラスタが分割される場合（Ｓ１４）、続いて、クラスタ内の各文書において、親ノードが、入力部２０が出力した開始時刻よりも前の時刻情報を持つ文書を選択し、上記選択された文書が複数存在するかどうかを調べる。つまり、図８に示すクラスタＣ２における文書ｄ６、ｄ８、ｄ９のように、複数の文書が存在するかどうかを調べる（Ｓ１６）。 If the cluster is divided based on the specified period (S14), then in each document in the cluster, the parent node selects a document having time information before the start time output by the input unit 20 Then, it is checked whether there are a plurality of the selected documents. That is, it is checked whether there are a plurality of documents such as the documents d6, d8, and d9 in the cluster C2 shown in FIG. 8 (S16).

なお、上記「親ノード」は、自文書よりも古い時間情報を持つ文書のうちで、自文書との距離が最短である文書である。つまり、自文書よりも古い時間情報を持つ文書のうちで、自文書との距離が最短である文書を、自文書に対して「親ノード」という。 The “parent node” is a document having the shortest distance from the self document among documents having time information older than the self document. In other words, a document having the shortest distance from the self document among documents having time information older than the self document is referred to as a “parent node” with respect to the self document.

親ノードの時刻情報が上記開始時刻よりも前の時刻情報を持つ文書が１つしかなければ（Ｓ１６）、クラスタＩＤと対応付けられている文書ＩＤ集合の中で、上記指定期間内に含まれている文書だけを選び、この選ばれた文書を、クラスタＩＤと対応付け、期間指定クラスタ蓄積部１６に蓄積する（Ｓ１７）。 If there is only one document whose parent node time information is earlier than the start time (S16), it is included in the specified period in the document ID set associated with the cluster ID. Only the selected document is selected, and the selected document is associated with the cluster ID and stored in the period-designated cluster storage unit 16 (S17).

上記親ノードが指定期間外になる文書が複数あれば、分割判定処理（Ｓ１８）を実行し、閾値と比較し、クラスタを分割するかどうかを決定する。 If there are a plurality of documents whose parent node is outside the specified period, a division determination process (S18) is executed and compared with a threshold value to determine whether to divide the cluster.

図７は、実施例１における期間指定クラスタ生成部１５の分割判定処理（Ｓ１８）の動作を示すフローチャートである。 FIG. 7 is a flowchart illustrating the operation of the division determination process (S18) of the period designation cluster generation unit 15 according to the first embodiment.

分割判定処理（Ｓ１８）には、上記指定されている閾値と、上記指定期間と、開始時刻よりも前の時刻情報を持つ親ノードの子ノードである文書を複数含むクラスタのＩＤと対応付けられている文書と、文書の時刻情報とが入力される。まず、入力されたクラスタＩＤをそのまま引き継ぐ文書ＩＤを決定する。 The division determination process (S18) is associated with the ID of a cluster including a plurality of documents that are child nodes of a parent node having time information before the specified time, the specified period, and the start time. And the time information of the document are input. First, a document ID that directly inherits the input cluster ID is determined.

ここでは、クラスタの中で、上記開始時刻よりも前の時刻情報を持つ文書に最も類似している文書を選択する。そこで、最短距離文書蓄積部１２に蓄積されている文書ＩＤ毎の最短距離４４に基づいて、文書ＩＤ（仮ルート）を選択する（Ｓ２０）。 Here, a document that is most similar to a document having time information before the start time is selected from the cluster. Therefore, a document ID (provisional route) is selected based on the shortest distance 44 for each document ID stored in the shortest distance document storage unit 12 (S20).

次に、仮ルートをルートとするサブツリーの文書の集合において、上記指定期間に含まれている文書を、クラスタＩＤと対応付け、期間指定クラスタ蓄積部１６に蓄積する（Ｓ２１）。そして、仮ルート以外の文書であって、親ノードの時刻情報が開始時刻よりも前である文書の集合を選択し（Ｓ２２）、この文書集合について、以下の処理を実行する。なお、上記「仮ルート」は、自文書が所定の期間内にあり、自文書の親ノードが上記期間外にある文書である。 Next, in the set of sub-tree documents with the temporary route as the root, the document included in the specified period is associated with the cluster ID and stored in the period specified cluster storage unit 16 (S21). Then, a set of documents that are documents other than the temporary root and whose parent node time information is earlier than the start time is selected (S22), and the following processing is executed for this document set. The “provisional route” is a document whose own document is within a predetermined period and whose parent node is outside the period.

まず、文書集合中に、未処理の文書があれば（Ｓ２３のＹＥＳ）、未処理の文書を１つ選択し、上記仮ルートとの距離を新たに算出する（Ｓ２４）。距離を算出する場合、文書間距離算出部１１が、文書間の距離を既に算出し、この算出された距離が蓄積され、これを参照するので、文書間距離をその都度算出する場合よりも、処理が高速である。 First, if there is an unprocessed document in the document set (YES in S23), one unprocessed document is selected, and a distance from the temporary route is newly calculated (S24). When calculating the distance, the inter-document distance calculation unit 11 has already calculated the inter-document distance, and the calculated distance is stored and referred to, so that the inter-document distance is calculated each time, rather than Processing is fast.

算出した距離が、指定された閾値以下であれば（Ｓ２５のＹＥＳ）、処理中の文書ＩＤをルートとするサブツリーの文書の集合の中で、上記指定期間に含まれている文書を、クラスタＩＤと対応付けて、期間指定クラスタ蓄積部１６に蓄積する（Ｓ２６）。 If the calculated distance is equal to or less than the specified threshold value (YES in S25), a document included in the specified period in the set of documents in the subtree whose root is the document ID being processed is a cluster ID. Are stored in the period-designated cluster storage unit 16 (S26).

算出した距離が閾値よりも大きければ（Ｓ２５のＮＯ）、新しいクラスタＩＤを作成し（Ｓ２７）、処理中の文書ＩＤをルートとするサブツリーの文書の集合の中で、上記指定期間に含まれている文書に、新しく作成したクラスタＩＤを対応付け、期間指定クラスタ蓄積部１６に蓄積する（Ｓ２８）。 If the calculated distance is larger than the threshold (NO in S25), a new cluster ID is created (S27), and included in the specified period in the set of subtree documents rooted at the document ID being processed. The newly created cluster ID is associated with the existing document and accumulated in the period designation cluster accumulation unit 16 (S28).

ここで、上記「新し作成したクラスタＩＤ」は、全クラスタ蓄積部１４に蓄積されているクラスタＩＤ以外のクラスタである。また、新しく作成したクラスタＩＤが、元々、属していたクラスタと対応させるために、処理中のクラスタＩＤを、新しく生成したクラスタＩＤと対応付けて蓄積する（Ｓ２９）。 Here, the “newly created cluster ID” is a cluster other than the cluster IDs stored in all the cluster storage units 14. Further, in order for the newly created cluster ID to correspond to the cluster to which it originally belonged, the cluster ID being processed is stored in association with the newly generated cluster ID (S29).

１つの文書に対する処理が終了すると、Ｓ２３に戻り、未処理の文書が無くなるまで、処理を実行する（Ｓ２３のＮＯ）。 When the process for one document is completed, the process returns to S23, and the process is executed until there is no unprocessed document (NO in S23).

次に、実際のデータを例にして、期間指定クラスタ生成部１５の処理について説明する。 Next, the process of the period designation cluster generation unit 15 will be described using actual data as an example.

図８は、期間指定クラスタ生成部１５における処理を説明する図である。 FIG. 8 is a diagram for explaining the processing in the period designation cluster generation unit 15.

図８において、文書間を接続している実線が、木構造であり、この木構造を全クラスタ蓄積部１４が蓄積している。また、図８において、点線は、Ｓ２４で距離を算出する対象である２文書を結ぶ点線である。 In FIG. 8, the solid line connecting the documents has a tree structure, and this cluster structure is stored in the all-cluster storage unit 14. In FIG. 8, a dotted line is a dotted line connecting two documents whose distances are to be calculated in S24.

図８に示す例では、統合処理部１３の説明で記載したように、閾値を０．５とし、この閾値０．５に基づいて、クラスタＣ１、クラスタＣ２、クラスタＣ３が生成され、存在している。 In the example shown in FIG. 8, as described in the explanation of the integration processing unit 13, the threshold value is set to 0.5, and based on the threshold value 0.5, the cluster C1, the cluster C2, and the cluster C3 are generated and exist. Yes.

上記指定期間として、開始時刻＝Ｔ１、終了時刻＝Ｔ２が与えられている場合、まず、全クラスタ蓄積部１４から、３つのクラスタＩＤ：クラスタＣ１、クラスタＣ２、クラスタＣ３を取得する（Ｓ１０）。続いて，各クラスタＩＤに対して、以下の処理を実行する。 When start time = T1 and end time = T2 are given as the specified period, first, three cluster IDs: cluster C1, cluster C2, and cluster C3 are acquired from all cluster storage units 14 (S10). Subsequently, the following processing is executed for each cluster ID.

未処理のクラスタＩＤとして（Ｓ１１のＹＥＳ）、まず、クラスタＣ１を選択し、クラスタＣ１に対応する文書の集合として、文書ｄ１０を取得する（Ｓ１２）。クラスタＣ１は、文書ＩＤ：ｄ１０のみで構成され、文書ｄ１０は、指定期間に含まれている文書である（Ｓ１３のＹＥＳ）。また、上記指定期間内に、全ての文書ＩＤが含まれているので（Ｓ１４のＮＯ）、文書ｄ１０を、クラスタＣ１と対応付けて、期間指定クラスタ蓄積部１６に蓄積する。 As an unprocessed cluster ID (YES in S11), the cluster C1 is first selected, and a document d10 is acquired as a set of documents corresponding to the cluster C1 (S12). The cluster C1 is composed of only the document ID: d10, and the document d10 is a document included in the designated period (YES in S13). Since all document IDs are included within the specified period (NO in S14), the document d10 is stored in the period specified cluster storage unit 16 in association with the cluster C1.

次に、クラスタＣ２を選択し（Ｓ１１のＹＥＳ）、クラスタＣ２に対応する文書の集合｛ｄ１，ｄ２，ｄ５，ｄ６，ｄ７，ｄ８，ｄ９｝を取得する（Ｓ１２）。クラスタＣ２には、上記指定期間に含まれている文書の集合｛ｄ６，ｄ７，ｄ８，ｄ９｝が存在し（Ｓ１３のＹＥＳ）、期間によって、集合｛ｄ１，ｄ２，ｄ５｝と集合｛ｄ６，ｄ７，ｄ８，ｄ９｝とに分割される（Ｓ１４のＹＥＳ）。 Next, the cluster C2 is selected (YES in S11), and a set of documents {d1, d2, d5, d6, d7, d8, d9} corresponding to the cluster C2 is acquired (S12). In cluster C2, there is a set of documents {d6, d7, d8, d9} included in the specified period (YES in S13). Depending on the period, sets {d1, d2, d5} and sets {d6, d7, d8, d9} (YES in S14).

そこで、上記指定期間に含まれている文書集合｛ｄ６、ｄ７、ｄ８、ｄ９｝のうちで、親ノードの時刻情報が上記開始時刻Ｔ１よりも前である文書が、複数存在するかどうかを調べると（Ｓ１６）、文書ｄ６の親ノードｄ５、文書ｄ８の親ノードｄ５、文書ｄ９の親ノードｄ２が、図８に示すように、開始時刻Ｔ１よりも前である。３文書ｄ６、ｄ８、ｄ９の親ノードが、指定期間外であるので（Ｓ１６のＹＥＳ）、分割判定処理（Ｓ１８）を実行する。 Therefore, it is checked whether there are a plurality of documents whose parent node time information is earlier than the start time T1 in the document set {d6, d7, d8, d9} included in the specified period. (S16), the parent node d5 of the document d6, the parent node d5 of the document d8, and the parent node d2 of the document d9 are before the start time T1, as shown in FIG. Since the parent nodes of the three documents d6, d8, and d9 are outside the designated period (YES in S16), the division determination process (S18) is executed.

分割判定処理では、まず、親ノードの時刻情報が、指定されている開始時刻よりも前である文書（子ノード）の集合｛ｄ６，ｄ８，ｄ９｝から、親ノードとの距離が最短である文書（仮ルート）として、文書ｄ６を選択する（Ｓ２０）。これを仮ルートという。つまり、「仮ルート」は、親ノードの時刻情報が、指定されている開始時刻よりも前である文書（子ノード）の集合から、親ノードとの距離が最短である文書である。 In the division determination process, first, the distance from the parent node is the shortest from the set {d6, d8, d9} of documents (child nodes) whose parent node time information is earlier than the designated start time. The document d6 is selected as the document (provisional route) (S20). This is called a temporary route. That is, the “provisional route” is a document having the shortest distance from the parent node from a set of documents (child nodes) whose parent node time information is earlier than the designated start time.

次に、仮ルートｄ６をルートとするサブツリーに含まれている文書の集合｛ｄ６、ｄ７｝のうちで、指定期間内に含まれている文書ｄ６とｄ７とを、クラスタＣ２と対応付けて、期間指定クラスタ蓄積部１６に蓄積する（Ｓ２１）。 Next, among the set of documents {d6, d7} included in the subtree having the temporary root d6 as a root, the documents d6 and d7 included in the designated period are associated with the cluster C2, It accumulates in the period designation cluster accumulation unit 16 (S21).

次に、仮ルートｄ６以外で、親ノードの時刻情報が上記開始時刻Ｔ１よりも前である文書の集合ｄ８，ｄ９を選択し（Ｓ２２）、この集合に属する文書ＩＤ毎に、次の処理を実行する。 Next, a set of documents d8 and d9 other than the temporary route d6 whose parent node time information is earlier than the start time T1 is selected (S22), and the following processing is performed for each document ID belonging to this set. Execute.

まず、文書集合｛ｄ８，ｄ９｝の中に未処理の文書がある場合（Ｓ２３のＹＥＳ）、未処理の文書を選択し、仮ルートｄ６との距離を算出する（Ｓ２４）。まず、文書ｄ８を選択し、仮ルート（ｄ６）との距離を算出する。算出した距離が、たとえば０．２５であれば、指定された閾値０．５以下であるので（Ｓ２５のＹＥＳ）、文書ｄ８をルートとするサブツリーに含まれている文書の集合｛ｄ８｝のうちで、期間内に含まれている文書を選択するが、この例では、文書ｄ８に子ノードが無いので、文書ｄ８のみをクラスタＣ２と対応付け、期間指定クラスタ蓄積部１６に蓄積する（Ｓ２６）。上記「子ノード」は、自文書の時刻情報よりも新しい時刻情報を持ち、しかも自文書から最短距離にある文書である。 First, when there is an unprocessed document in the document set {d8, d9} (YES in S23), an unprocessed document is selected and a distance from the temporary route d6 is calculated (S24). First, the document d8 is selected, and the distance from the temporary route (d6) is calculated. If the calculated distance is, for example, 0.25, the specified threshold value is 0.5 or less (YES in S25), and therefore, out of the set {d8} of documents included in the subtree having the document d8 as a root. In this example, since there is no child node in the document d8, only the document d8 is associated with the cluster C2 and stored in the period designation cluster storage unit 16 (S26). . The “child node” is a document that has time information that is newer than the time information of the own document and that is at the shortest distance from the own document.

次に、未処理の文書：ｄ９の処理を実行する（Ｓ２３のＹＥＳ）。仮ルートｄ６との距離を算出し（Ｓ２４）、算出した距離が、たとえば０．６であれば、指定された閾値０．５よりも大きいので（Ｓ２５のＮＯ）、新しいクラスタＩＤ：クラスタＣ４を作成する（Ｓ２７）。 Next, the unprocessed document: d9 is processed (YES in S23). The distance to the temporary route d6 is calculated (S24). If the calculated distance is 0.6, for example, it is larger than the specified threshold value 0.5 (NO in S25), so a new cluster ID: cluster C4 is set. Create (S27).

そして、文書ｄ９をルートとするサブツリーに含まれている文書の集合｛ｄ９，ｄ１０｝のうちで、指定期間に含まれている文書は｛ｄ９｝のみであり、文書ｄ９を、クラスタＣ４と対応付けて、期間指定クラスタ蓄積部１６に蓄積する（Ｓ２８）。蓄積に際して、クラスタＣ４が分割される前には、クラスタＣ２であったことを蓄積するようにしてもよい（Ｓ２９）。文書の集合｛ｄ８，ｄ９｝の処理が終了したので（Ｓ２３のＮＯ）、分割判定処理を終了し、図６のＳ１１に戻る。 Of the set of documents {d9, d10} included in the subtree having the document d9 as a root, the document included in the designated period is only {d9}, and the document d9 corresponds to the cluster C4. In addition, it is stored in the period designation cluster storage unit 16 (S28). At the time of accumulation, the fact that it was the cluster C2 may be accumulated before the cluster C4 is divided (S29). Since the processing of the document set {d8, d9} is completed (NO in S23), the division determination processing is ended, and the process returns to S11 in FIG.

Ｓ１１では、クラスタＩＤ：クラスタＣ１、クラスタＣ２について処理が終了した状態であり、未処理のクラスタＣ３を選択する。このクラスタＣ３に対応する文書の集合｛ｄ３，ｄ４｝を取得し、上記指定されている期間に含まれている文書があるかどうかを調べるが、全て開始時刻よりも前の文書であるので（Ｓ１３のＮＯ）、Ｓ１１の処理に戻る。 In S11, the process is completed for the cluster IDs: cluster C1 and cluster C2, and an unprocessed cluster C3 is selected. A set {d3, d4} of documents corresponding to the cluster C3 is acquired to check whether there is a document included in the specified period. Since all the documents are before the start time ( (NO in S13), the process returns to S11.

全てのクラスタＩＤ：クラスタＣ１、クラスタＣ２、クラスタＣ３について、処理が終了したので、期間指定クラスタ生成部１５の処理を終了する。 Since the processing has been completed for all the cluster IDs: cluster C1, cluster C2, and cluster C3, the processing of the period designation cluster generation unit 15 is terminated.

図９は、処理が終了した時点における期間指定クラスタ蓄積部１６に蓄積されているデータの例を示す図である。 FIG. 9 is a diagram illustrating an example of data accumulated in the period designation cluster accumulation unit 16 at the time when the processing is completed.

クラスタＩＤ６１に対応して、指定期間に含まれている文書のみの集合である文書の集合６２が、期間指定クラスタ蓄積部１６に蓄積されている。また、分割判定処理において、新しいクラスタＩＤ：クラスタＣ４を作成した際に、分割前クラスタＩＤであるクラスタＣ２を対応付けて蓄積している。 Corresponding to the cluster ID 61, a document set 62, which is a set of only documents included in the specified period, is stored in the period specified cluster storage unit 16. In addition, in the division determination process, when a new cluster ID: cluster C4 is created, the cluster C2 that is the pre-division cluster ID is associated and stored.

図１０は、期間指定クラスタ蓄積部１６と、文書蓄積部１０とに蓄積されている情報を用いて、クラスタ表示部３０が表示する画面例を示す図である。 FIG. 10 is a diagram illustrating an example of a screen displayed by the cluster display unit 30 using information stored in the period designation cluster storage unit 16 and the document storage unit 10.

図１０（１）に、２つの期間におけるクラスタリング結果が表示されている。期間Ｔ０〜Ｔ１におけるクラスタリング結果は、領域７１に表示され、クラスタＣ２、クラスタＣ３が存在している。期間Ｔ１〜Ｔ２におけるクラスタリング結果は、領域７２に表示され、クラスタＣ１、クラスタＣ２、クラスタＣ４が存在している。各クラスタは、円で表示され、その円の中に、そのクラスタに属する文書のＩＤが表示されている。 FIG. 10 (1) displays the clustering results for two periods. The clustering results in the periods T0 to T1 are displayed in the area 71, and there are clusters C2 and C3. The clustering result in the period T1 to T2 is displayed in the area 72, and the cluster C1, the cluster C2, and the cluster C4 exist. Each cluster is displayed as a circle, and the IDs of documents belonging to the cluster are displayed in the circle.

この画面において、利用者がクラスタを選択した場合におけるクラスタに属する文書の概要８０を、図１０（２）に示す。 FIG. 10B shows an overview 80 of documents belonging to a cluster when the user selects a cluster on this screen.

図１０（２）には、期間Ｔ０〜Ｔ１におけるクラスタＣ２が選択された場合の概要が表示されている。各クラスタは、期間内の文書のみを利用したクラスタリング結果である。 FIG. 10B shows an overview when the cluster C2 in the period T0 to T1 is selected. Each cluster is a clustering result using only documents within a period.

互いに異なる期間において、統一されたクラスタＩＤ（同一のクラスタＩＤ）が付与されているので、クラスタＣ２は、２つの期間で同じ数の文書が存在し、話題が継続していることが分かる。また、クラスタＣ２からは、話題が少し変化したクラスタＣ４が派生したことが見て取れる。さらに、クラスタＣ３は、期間Ｔ０〜Ｔ１の間に瞬間的に生じた話題であり、クラスタＣ１は、期間Ｔ１〜Ｔ２の間に新たに発生した話題であることが分かる。 Since a uniform cluster ID (same cluster ID) is assigned in different periods, it can be seen that the same number of documents exist in two periods and the topic continues in two periods. Further, it can be seen that the cluster C4 whose topic has changed slightly is derived from the cluster C2. Furthermore, it can be seen that the cluster C3 is a topic that occurs instantaneously during the period T0 to T1, and the cluster C1 is a topic that newly occurs during the period T1 to T2.

利用者が、閾値や期間を指定する前の段階で、文書間距離算出部１１の処理を実行し、最短距離文書蓄積部１２を作成し、その後に、利用者が、閾値と期間とを指定すると、統合処理部１３による処理と、期間指定クラスタ生成部１５による処理とを実行することによって、指定された期間内の文書のみを用いたクラスタを短時間で、しかも精度よく提供することができる。 Before the user specifies the threshold value or the period, the inter-document distance calculation unit 11 executes the process to create the shortest distance document storage unit 12, and then the user specifies the threshold value and the period. Then, by executing the processing by the integration processing unit 13 and the processing by the period designation cluster generation unit 15, it is possible to provide a cluster using only documents within the designated period in a short time and with high accuracy. .

また、利用者が期間を変更した場合、期間指定クラスタ生成部１５の処理を実行するだけで、該当期間のクラスタを生成することができる。 Further, when the user changes the period, the cluster for the corresponding period can be generated only by executing the process of the period designation cluster generation unit 15.

上記実施例は、各文書が、作成時刻や更新時刻等の時刻情報とともに保存され、ユーザが指定した任意の期間（指定期間）に属する文書のみについて分類する実施例である。 In the above embodiment, each document is stored together with time information such as creation time and update time, and is classified only for documents belonging to an arbitrary period (specified period) designated by the user.

すなわち、上記実施例は、次の処理によって、クラスタ作成と、期間指定によるクラスタ変更処理を行う。 That is, in the above embodiment, cluster creation and cluster change processing by specifying a period are performed by the following processing.

（１）自文書よりも古い時間情報を持つ文書のうちで、自文書との距離が最短の文書を、自文書に対する「親ノード」と定義し、その距離が、指定された閾値以下である自文書の集合を、１つのクラスタとする。 (1) Among documents having time information older than the own document, a document having the shortest distance from the own document is defined as a “parent node” for the own document, and the distance is equal to or less than a specified threshold. A set of own documents is defined as one cluster.

（２）全文書を用いて、上記（１）のクラスタ作成処理を１度だけ実行し、各文書とクラスタとの対応関係を、保存手段に保存する。 (2) Using all the documents, the cluster creation process (1) is executed only once, and the correspondence between each document and the cluster is stored in the storage unit.

（３）期間が指定されると、指定期間内の文書のみを、上記（２）で保存した対応関係に基づいて、クラスタ毎に分類する。指定期間外の文書については、処理しない。 (3) When a period is specified, only documents within the specified period are classified for each cluster based on the correspondence relationship stored in (2) above. Documents outside the specified period are not processed.

（４）同一クラスタに属する指定期間内の文書のうちで、その親ノードの時刻情報が期間外である文書が複数ある場合、これら複数の文書について、最も古い時刻情報を持つ文書と他の文書との２文書間の距離を新たに計算し、閾値以下であれば、その２文書と各子孫ノードとは、全て同一クラスタに属すると判定し、それ以外の場合は、２文書と各子孫ノードとは、異なるクラスタに属すると判定する。 (4) Among the documents within the specified period belonging to the same cluster, when there are a plurality of documents whose parent node time information is out of the period, the documents having the oldest time information and other documents among these documents If the distance between the two documents is newly calculated and if it is less than or equal to the threshold value, it is determined that the two documents and each descendant node all belong to the same cluster. Otherwise, the two documents and each descendant node Is determined to belong to a different cluster.

上記実施例によれば、蓄積されている文書の全てについて、木構造を生成し、文書間の距離に関する閾値が与えられた場合、全文書を１回探索するだけで、クラスタを生成することができる。 According to the above-described embodiment, when a tree structure is generated for all the stored documents and a threshold regarding the distance between documents is given, a cluster can be generated by searching all the documents once. it can.

また、上記実施例によれば、任意の期間が与えられた場合でも、全文書について生成したクラスタ情報を用いて、同一クラスタ内で親ノードの時刻情報が期間外である文書間の距離を計算するだけで、期間内の文書だけを用いた精度のよいクラスタを生成することができる。 Further, according to the above embodiment, even when an arbitrary period is given, the distance between documents whose parent node time information is outside the period within the same cluster is calculated using the cluster information generated for all documents. By simply doing, it is possible to generate a cluster with high accuracy using only documents within the period.

さらに、全文書でクラスタを生成するので、連続する異なる期間において、クラスタの対応付けが可能であり、異なる期間でクラスタが大きくなったのか、小さくなったのか、または分割したのかを提示することができる。 Furthermore, since clusters are generated for all documents, it is possible to associate clusters in different consecutive periods, and it is possible to present whether the clusters have become larger, smaller, or divided in different periods. it can.

なお、上記実施例において、処理対象は文書の本文であるが、タイトルを含めた「タイトル＋本文」を処理対象としてもよい。 In the above embodiment, the processing target is the text of the document, but “title + text” including the title may be the processing target.

つまり、上記実施例は、文書を一意に示す文書ＩＤに対応付けて、時刻情報と、タイトルと、文書の本文とを蓄積する文書蓄積部と、クラスタを生成するための距離の閾値と開始時刻と終了時刻とを出力する入力部と、文書蓄積部に蓄積されている上記文書の本文を用い、自文書の時刻情報よりも古い時刻情報を持つ文書と上記自文書との距離を、文書間類似度に基づいて算出する文書間距離算出部と、上記自文書の時刻情報よりも古い時刻情報を持つ文書と上記自文書との距離のうちで、最短距離である文書の文書ＩＤと、上記最短距離とを蓄積する最短距離文書蓄積部と、上記最短距離文書蓄積部に蓄積されている最短距離が、上記入力部から受け取った閾値よりも短い文書を、１つのクラスタに統合する統合処理部と、上記統合処理部が統合したクラスタを蓄積する全クラスタ蓄積部と、全クラスタ蓄積部に蓄積されているクラスタを構成する文書のうちで、上記入力部から受け取った上記開始時刻から上記終了時刻までの期間である指定期間に含まれている時刻情報をもつ文書のみを検出し、この検出された文書によって期間指定クラスタを生成する期間指定クラスタ生成部と、上記期間指定クラスタ生成部が生成した上記期間指定クラスタを蓄積する期間指定クラスタ蓄積部とを有する文書分類装置の例である。 That is, in the above-described embodiment, the document storage unit that stores time information, the title, and the text of the document in association with the document ID that uniquely identifies the document, the threshold of the distance for generating the cluster, and the start time And an input unit for outputting the end time, and the body text of the document stored in the document storage unit, and the distance between the document having time information older than the time information of the document and the document is determined between the documents. The inter-document distance calculation unit that calculates based on the similarity, the document ID of the document that has the shortest distance among the distances between the document having time information older than the time information of the own document and the own document, and A shortest distance document storage unit that stores the shortest distance, and an integrated processing unit that integrates documents in which the shortest distance stored in the shortest distance document storage unit is shorter than the threshold received from the input unit into one cluster And the integrated processing unit A specified period that is a period from the start time received from the input unit to the end time, among all the cluster storage units that store the integrated cluster and the documents that constitute the clusters stored in all the cluster storage units Only a document having time information included in the period is detected, a period designation cluster generation unit that generates a period designation cluster based on the detected document, and the period designation cluster generated by the period designation cluster generation unit is accumulated. It is an example of the document classification apparatus which has a period designation | designated cluster storage part.

この場合、上記全クラスタ蓄積部から、未処理のクラスタＩＤに対応する文書の集合を取得し、上記指定期間に含まれている文書について、自文書よりも古い時間情報を持つ文書のうちで、自文書との距離が最短である文書である親ノードの時刻情報が上記開始時刻よりも前である文書が複数存在すれば、上記閾値に基づいて、クラスタを分割するかどうかを決定する分割判定処理を実行する。 In this case, a set of documents corresponding to an unprocessed cluster ID is acquired from all the cluster storage units, and the documents included in the specified period are among documents having time information older than the own document. Division determination that determines whether or not to divide a cluster based on the threshold if there are multiple documents whose parent node time information is the document with the shortest distance from the own document before the start time. Execute the process.

また、上記の場合、上記閾値と、上記指定期間と、親ノードが開始時刻よりも前である文書が複数あるクラスタのクラスタＩＤと対応付けられている文書と、文書の時刻情報とを入力し、親ノードの時刻情報が上記開始時刻よりも前である文書の集合において、親ノードとの距離が最も短い文書である仮ルートを選択し、上記仮ルートをルートとするサブツリーの文書の集合において、上記指定期間に含まれている文書を、クラスタＩＤと対応付け、上記期間指定クラスタ蓄積部に蓄積し、上記仮ルート以外で、親ノードが開始時刻よりも前である文書の集合を選択し、文書集合中に、未処理の文書があれば、未処理の文書を１つ選択し、上記仮ルートとの距離を新たに算出し、この算出した距離が、上記閾値以下であれば、処理中の文書をルートとするサブツリーの文書の集合の中で、上記指定期間に含まれている文書を、クラスタＩＤと対応付け、上記期間指定クラスタ蓄積部に蓄積し、上記算出した距離が上記閾値よりも大きければ、新しいクラスタＩＤを作成し、処理中の文書ＩＤをルートとするサブツリーの文書の集合の中で、上記指定期間に含まれている文書に、上記全クラスタ蓄積部に蓄積されていないクラスタＩＤである新しいクラスタＩＤを対応付け、上記期間指定クラスタ蓄積部に蓄積する。 In the above case, the threshold value, the specified period, a document associated with the cluster ID of a cluster having a plurality of documents whose parent node is earlier than the start time, and the time information of the document are input. In the set of documents whose parent node time information is earlier than the start time, select a temporary route that is the document with the shortest distance from the parent node, and in the set of sub-tree documents whose root is the temporary route, The documents included in the specified period are associated with the cluster ID, stored in the period specified cluster storage unit, and a set of documents whose parent node is earlier than the start time is selected other than the temporary route. If there is an unprocessed document in the document set, one unprocessed document is selected, a distance from the temporary route is newly calculated, and if the calculated distance is equal to or less than the threshold, processing is performed. Documents inside A document included in the specified period in the set of sub-tree documents to be used as a start date is associated with a cluster ID and stored in the period specified cluster storage unit, and the calculated distance is greater than the threshold value. For example, a new cluster ID is created, and a cluster ID that is not stored in all the cluster storage units in a document included in the specified period in a set of documents in a subtree whose root is the document ID being processed. Are associated with each other and stored in the period-designated cluster storage unit.

また、上記実施例を方法の発明として把握することができる。つまり、上記実施例は、文書を一意に示す文書ＩＤに対応付けて、時刻情報と、タイトルと、文書の本文とを、記憶装置に蓄積する文書蓄積段階と、クラスタを生成するための距離の閾値と開始時刻と終了時刻とを入力する入力段階と、上記文書蓄積段階で蓄積されている上記文書の本文を用い、自文書の時刻情報よりも古い時刻情報を持つ文書と上記自文書との距離を、文書間類似度に基づいて算出し、記憶装置に記憶する文書間距離算出段階と、上記自文書の時刻情報よりも古い時刻情報を持つ文書と上記自文書との距離のうちで、最短距離である文書の文書ＩＤと、上記最短距離とを、記憶装置に蓄積する最短距離文書蓄積段階と、上記最短距離文書蓄積段階に蓄積されている最短距離が、上記入力段階で受け取った閾値よりも短い文書を、１つのクラスタに統合し、記憶装置に記憶する統合処理段階と、上記統合処理段階で統合したクラスタを、記憶装置に蓄積する全クラスタ蓄積段階と、全クラスタ蓄積段階で蓄積されているクラスタを構成する文書のうちで、上記入力段階で受け取った上記開始時刻から上記終了時刻までの期間である指定期間に含まれている時刻情報をもつ文書のみを検出し、この検出された文書によって期間指定クラスタを生成し、記憶装置に記憶する期間指定クラスタ生成段階と、上記期間指定クラスタ生成段階で生成された上記期間指定クラスタを、記憶装置に蓄積する期間指定クラスタ蓄積段階とを有する文書分類方法の例である。 Moreover, the said Example can be grasped | ascertained as invention of a method. In other words, in the above-described embodiment, the time information, the title, and the text of the document are stored in the storage device in association with the document ID that uniquely identifies the document, and the distance for generating the cluster is determined. Using the input stage for inputting the threshold value, the start time and the end time, and the text of the document stored in the document storage stage, the document having time information older than the time information of the own document and the own document The distance is calculated based on the similarity between documents and stored in the storage device, and the distance between the document having time information older than the time information of the document and the document is The shortest distance document storage stage that stores the document ID of the document that is the shortest distance and the shortest distance in the storage device, and the shortest distance that is stored in the shortest distance document storage stage is the threshold received in the input stage. Shorter than Are integrated into one cluster and stored in a storage device, the cluster integrated in the integration processing step is accumulated in the storage device, and the clusters are accumulated in the cluster accumulation step. , Only documents having time information included in a specified period that is a period from the start time to the end time received in the input stage are detected, and the period is determined by the detected document. A document classification method comprising: a period designation cluster generation stage for generating a designation cluster and storing it in a storage apparatus; and a period designation cluster accumulation stage for accumulating the period designation cluster generated in the period designation cluster generation stage in a storage apparatus It is an example.

また、上記実施例をプログラムの発明として把握することができる。つまり、上記実施例は、請求項１〜請求項３のいずれか１項記載の方法をコンピュータに実行させるプログラムの例である。 Moreover, the said Example can be grasped | ascertained as invention of a program. That is, the said Example is an example of the program which makes a computer perform the method of any one of Claims 1-3.

さらに、上記プログラムを記録媒体に記録するようにしてもよい。つまり、上記実施例は、請求項７記載のプログラムを記録したコンピュータ読取可能な記録媒体の例である。この場合、上記記録媒体として、ＣＤ、ＤＶＤ、ＨＤ、半導体メモリ等が考えられる。
Furthermore, the program may be recorded on a recording medium. That is, the said Example is an example of the computer-readable recording medium which recorded the program of Claim 7. In this case, CD, DVD, HD, semiconductor memory, etc. can be considered as the recording medium.

本発明の実施例１の原理を示すフローチャートである。It is a flowchart which shows the principle of Example 1 of this invention. 実施例１の原理を説明する図である。It is a figure explaining the principle of Example 1. FIG. 実施例１である文書分類装置ＤＣ１を示す図である。1 is a diagram illustrating a document classification apparatus DC1 that is Embodiment 1. FIG. 最短距離文書蓄積部１２に蓄積されているデータの例を示す図である。It is a figure which shows the example of the data accumulate | stored in the shortest distance document storage part. 実施例１における全クラスタ蓄積部１４に蓄積されているデータの例を示す図である。6 is a diagram illustrating an example of data stored in an all-cluster storage unit 14 in Embodiment 1. FIG. 実施例１における期間指定クラスタ生成部１５の動作を示すフローチャートである。6 is a flowchart illustrating an operation of a period designation cluster generation unit 15 according to the first embodiment. 実施例１における期間指定クラスタ生成部１５の分割判定処理（Ｓ１８）の動作を示すフローチャートである。6 is a flowchart illustrating an operation of a division determination process (S18) of a period designation cluster generation unit 15 according to the first embodiment. 期間指定クラスタ生成部１５における処理を説明する図である。It is a figure explaining the process in the period designation cluster production | generation part 15. FIG. 処理が終了した時点における期間指定クラスタ蓄積部１６に蓄積されているデータの例を示す図である。It is a figure which shows the example of the data accumulate | stored in the period designation | designated cluster storage part 16 at the time of a process being complete | finished. 期間指定クラスタ蓄積部１６に蓄積されている情報を用いて、クラスタ表示部３０が表示する画面例を示す図である。It is a figure which shows the example of a screen which the cluster display part 30 displays using the information accumulate | stored in the period designation | designated cluster storage part.

Explanation of symbols

ＤＣ１…文書分類装置、
１０…文書蓄積部、
１１…文書蓄積部、
１２…最短距離文書蓄積部、
１３…統合処理部、
１４…全クラスタ蓄積部、
１５…期間指定クラスタ生成部、
１６…期間指定クラスタ蓄積部、
２０…入力部、
３０…クラスタ表示部、
４０…木構造。 DC1 ... Document classification device,
10: Document storage unit,
11 ... Document storage unit,
12 ... shortest distance document storage unit,
13 ... integrated processing unit,
14: All cluster storage units,
15 ... Period designation cluster generation unit,
16: Period-designated cluster storage unit,
20 ... input part,
30 ... Cluster display section,
40 ... Wooden structure.

Claims

A document storage unit that stores time information, a title, and the text of the document in association with a document ID that uniquely identifies the document;
An input unit for inputting a distance threshold for generating a cluster, a start time, and an end time;
Inter-document distance calculation that uses the body text of the document stored in the document storage unit to calculate the distance between the document having time information older than the time information of the self-document and the self-document based on the similarity between documents. Part;
Among documents having time information older than the time information of the self-document, a document whose distance from the self-document is the shortest distance is a parent node, and the document ID of the parent node and the shortest distance are the document ID of the self-document. A shortest distance document accumulating unit that accumulates in association with;
The shortest distance stored in the shortest distance document storage unit is compared with the threshold value received from the input unit. If the shortest distance is less than or equal to the threshold value, the shortest document and the own document are regarded as the same cluster, and the shortest distance is stored. An integrated processing unit for generating a cluster by setting the shortest document and the self-document as separate clusters if is greater than the threshold;
An all-cluster storage unit that stores the cluster integrated by the integration processing unit in association with a cluster ID that uniquely identifies the cluster and a document set belonging to each cluster ;
Of the documents constituting the cluster stored in all cluster storage units, only documents having time information included in a specified period that is a period from the start time to the end time received from the input unit. A period designation cluster generation unit that detects and generates a period designation cluster based on the detected document;
A period-designated cluster storage unit that accumulates the period-designated cluster generated by the period-designated cluster generation unit;
Have
The period specified cluster generation unit
A set of documents corresponding to an unprocessed cluster ID is acquired from all the cluster storage units,
When there are no documents that are included in the specified period and the time information of the parent node is earlier than the start time, or the distance between a plurality of documents that exist even if there are a plurality is the threshold value If it is the following, one period designated cluster among the documents included in the designated period in the set of documents corresponding to the cluster ID,
When there are multiple documents that are included in the specified period and whose parent node time information is earlier than the start time, and the distance between the existing documents is greater than the threshold, A document classification apparatus characterized in that a document included in the specified period in a set of documents corresponding to a cluster ID is made into a plurality of period specified clusters .

In claim 1,
The period specified cluster generation unit
When there are a plurality of documents whose parent node time information is earlier than the start time, select a temporary route that is a document with the shortest distance from the parent node in the set of the plurality of documents.
In a set of subtree documents rooted at the temporary route, the documents included in the specified period are set as one period specified cluster ,
Other than the temporary route, a set of documents whose parent node is earlier than the start time is selected. If there is an unprocessed document in the document set, one unprocessed document is selected. If the calculated distance is equal to or less than the threshold value, a document included in the specified period in the set of sub-tree documents rooted at the document being processed is temporarily stored. The same period specified cluster as the period specified cluster of the route,
If the calculated distance is larger than the threshold value, the documents included in the specified period in the set of subtrees whose root is the document ID being processed are separated from the period specified cluster of the temporary root. Document classifying apparatus characterized in that it is a period specified cluster .

A document storage stage that stores time information, a title, and the text of the document in association with a document ID that uniquely identifies the document;
An input stage for inputting a distance threshold for generating a cluster, a start time, and an end time;
Inter-document distance calculation that uses the body text of the document stored in the document storage stage to calculate the distance between the document having time information older than the time information of the own document and the own document based on the similarity between documents. Stages;
Among documents having time information older than the time information of the self-document, a document whose distance from the self-document is the shortest distance is a parent node, and the document ID of the parent node and the shortest distance are the document ID of the self-document. The shortest distance document storage stage to be stored in association with
The shortest distance stored in the shortest distance document storage stage is compared with the threshold value input in the input stage, and if the shortest distance is equal to or less than the threshold value, the shortest document and the own document are regarded as the same cluster, and the shortest distance is stored. An integrated processing stage for generating a cluster by making the shortest document and the self-document separate from each other if the distance is larger than the threshold;
An all-cluster storage stage for storing the clusters integrated in the integration processing stage in association with a cluster ID that uniquely identifies the cluster and a document set belonging to each cluster ;
Of the documents that make up the cluster stored in all cluster storage stages, only documents that have time information included in the specified period that is the period from the start time to the end time input in the input stage And a period-specific cluster generation step of generating a period-specific cluster according to the detected document;
A period-designated cluster accumulation stage for accumulating the period-designated cluster generated in the period-designated cluster creation stage;
Have
The period specified cluster generation stage
A set of documents corresponding to an unprocessed cluster ID is acquired from the clusters accumulated in the all cluster accumulation stage ,
When there are no documents that are included in the specified period and the time information of the parent node is earlier than the start time, or the distance between a plurality of documents that exist even if there are a plurality is the threshold value If it is the following, one period designated cluster among the documents included in the designated period in the set of documents corresponding to the cluster ID,
When there are multiple documents that are included in the specified period and whose parent node time information is earlier than the start time, and the distance between the existing documents is greater than the threshold, A document classification method, wherein a document included in the specified period in a set of documents corresponding to a cluster ID is made into a plurality of period specified clusters .

In claim 3,
The period specified cluster generation stage
When there are a plurality of documents whose parent node time information is earlier than the start time, select a temporary route that is a document with the shortest distance from the parent node in the set of the plurality of documents.
In a set of subtree documents rooted at the temporary route, the documents included in the specified period are set as one period specified cluster ,
Other than the temporary route, a set of documents whose parent node is earlier than the start time is selected. If there is an unprocessed document in the document set, one unprocessed document is selected. If the calculated distance is equal to or less than the threshold value, a document included in the specified period in the set of sub-tree documents rooted at the document being processed is temporarily stored. The same period specified cluster as the period specified cluster of the route,
If the calculated distance is larger than the threshold value, the documents included in the specified period in the set of subtrees whose root is the document ID being processed are separated from the period specified cluster of the temporary root. A method for classifying documents, characterized in that it is in the stage of making a specified cluster of periods .

The program which makes a computer perform the method of Claim 3 or Claim 4.

A computer-readable recording medium on which the program according to claim 5 is recorded.