JP2010182253A

JP2010182253A - Document classification apparatus and document classification method

Info

Publication number: JP2010182253A
Application number: JP2009027551A
Authority: JP
Inventors: Yusuke Sato; 祐介佐藤; Makoto Iwayama; 真岩山
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2009-02-09
Filing date: 2009-02-09
Publication date: 2010-08-19
Anticipated expiration: 2029-02-09
Also published as: JP5240777B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document classification support apparatus which prevents useless addition of correct answer and clustering operation. <P>SOLUTION: A document classification apparatus for classifying a plurality of documents on the basis of a given correct category includes: a processor, a memory connected to the processor, an input section, a classification control section, a dispersion degree calculation section, and a degree of progress calculation section. The input section receives the correct category of a document input by a user. The classification control section classifies the plurality of documents into a plurality of groups on the basis of the correct category of the document. The dispersion degree calculation section calculates a dispersion degree shown by the number of documents whose classified groups are changed by the classification. The degree of progress calculation section calculates a degree of progress of the classification. The document classification apparatus displays the result of the classification of the documents by the classification control section, the dispersion degree calculated by the dispersion degree calculation section, and the degree of progress calculated by the degree of progress calculation section. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、文書分類装置に関し、特に制約付きクラスタリングを用いた文書分類装置に関する。 The present invention relates to a document classification device, and more particularly to a document classification device using constrained clustering.

従来、文書検索システムなどによって収集された大量の文書を、各々の文書の内容に従って、いくつかのカテゴリに分類する場合、クラスタリング（集団化）などの機械的な方法を用いて、大量の文書を自動的に分類する方法が一般的に用いられる。機械的な方法による文書の分類とは、文書の中に存在する単語の出現頻度に基づいた特徴量を、その文書のベクトルの要素とし、その文書が持つベクトルと、他の文書が持つベクトルの類似度等に基づいて、文書を分類する方法である。 Conventionally, when a large amount of documents collected by a document retrieval system or the like is classified into several categories according to the contents of each document, a large amount of documents are collected using a mechanical method such as clustering (grouping). An automatic classification method is generally used. Document classification by a mechanical method means that the feature quantity based on the appearance frequency of words existing in a document is an element of the vector of the document, the vector of the document and the vector of other documents This is a method of classifying documents based on similarity or the like.

クラスタリングの代表的な方法には、Ｋ−Ｍｅａｎｓ法がある。Ｋ−Ｍｅａｎｓ法は、予め設定したクラスタ（集団）の数をｋ個とした場合、以下の操作によってデータ集合を分類する方法である（例えば、非特許文献１参照）。なお、クラスタリングとは、文書の集団化、又は文書の分類の意である。 A representative clustering method is a K-Means method. The K-Means method is a method of classifying a data set by the following operation when the number of preset clusters (groups) is k (for example, see Non-Patent Document 1). Note that clustering means document grouping or document classification.

１．文書の集合の中から、ｋ個の個体（文書）をランダムに決定し、各クラスタの重心とする。 1. From the set of documents, k individuals (documents) are randomly determined and set as the center of gravity of each cluster.

２．集合の中の各個体と、重心との距離（もしくは、類似度）を、各々の個体のベクトルに基づいて算出し、最も近い重心のクラスタに分類する。 2. The distance (or similarity) between each individual in the set and the center of gravity is calculated based on the vector of each individual and is classified into the cluster with the nearest center of gravity.

３．上記２．において作成したクラスタの、各々のクラスタにおける新たな重心をランダムに決定し、各個体と新たな重心との距離（もしくは、類似度）を算出する。 3. 2. A new centroid in each cluster is determined at random, and the distance (or similarity) between each individual and the new centroid is calculated.

４．上記２．において用いた重心と、３．において決定した新たな重心との移動距離（例えば、（旧重心−新重心）²の総和）がある閾値以下でなければ２．へ戻る。 4). 2. 2. the center of gravity used in 3. 1. If the moving distance (for example, the sum of (old centroid-new centroid) ² ) with the new centroid determined in step 2 is not less than a certain threshold value. Return to.

５．Ｋ−Ｍｅａｎｓ法の終了。 5. End of the K-Means method.

Ｋ−Ｍｅａｎｓ法によるクラスタリングは、一度に大量の文書を分類できるが、重心をランダムに決定することによって、分類する精度が十分ではない場合が多く、ユーザの意図どおりに分類された結果にならない場合が多い。 Clustering by the K-Means method can classify a large number of documents at a time, but by determining the centroid at random, the classification accuracy is often not sufficient, and the result is not classified as intended by the user. There are many.

これに対して、ユーザが意図する分類を、制約として、クラスタリングに組み込むことで、分類する精度を向上させる方法がある。つまり、任意の文書の正解、すなわち任意の文書が所属するべき分類先を、ユーザがその文書に付与し、それらの正解が付与された文書を教師データとすることによって、クラスタリングの精度を向上させる方法がある。この方法は、制約付きクラスタリング（もしくは、半教師有りクラスタリング）と呼ばれている（例えば、非特許文献２参照）。 On the other hand, there is a method of improving classification accuracy by incorporating classification intended by the user into the clustering as a constraint. In other words, correctness of an arbitrary document, that is, a classification destination to which an arbitrary document should belong is assigned to the document by the user, and the document to which the correct answer is assigned is used as teacher data to improve the accuracy of clustering. There is a way. This method is called constrained clustering (or semi-supervised clustering) (see Non-Patent Document 2, for example).

この制約付きクラスタリングの利用は、クラスタリングの精度を効率的に向上させる。例えば、制約付きクラスタリングを利用したクラスタリング方法、すなわち分類方法に、学習型分類方法がある。 This use of constrained clustering effectively improves the accuracy of clustering. For example, there is a learning type classification method as a clustering method using constrained clustering, that is, a classification method.

学習型分類方法は、非特許文書１に記載されているＫ−Ｍｅａｎｓ法に基づいて、文書の集合に対して、ユーザによる正解の付与を含む制約付きクラスタリングを繰り返すことによって、クラスタリングの結果をユーザが意図する分類へ近付けていく方法である。 Based on the K-Means method described in Non-Patent Document 1, the learning-type classification method repeats constrained clustering including granting of correct answers by a user to a set of documents, thereby obtaining a clustering result. This is a method that approaches the intended classification.

また、学習型分類方法には、Ｋ−Ｍｅａｎｓ法の他に、ファジイクラスタリングに基づく学習型分類方法（例えば、特許文献１参照）、又はサポートベクトルマシンに基づく学習型分類方法（例えば、特許文献２参照）などがある。 In addition to the K-Means method, the learning type classification method includes a learning type classification method based on fuzzy clustering (see, for example, Patent Document 1), or a learning type classification method based on a support vector machine (for example, Patent Document 2). See).

ファジイクラスタリングを用いた学習型分類方法は、Ｋ−Ｍｅａｎｓ法によるクラスタリングと同様に、あらかじめ与えられた数のクラスタにおいて、各々の代表となる個体を定め、全体の集合の中の各個体と、代表となる個体との関係において、所属率を算出する。ファジイクラスタリングを用いた場合、各個体は、各クラスタに対して所属率、すなわち、どれくらい所属するかの数値を各々持つ。しかし、ファジイクラスタリングを用いた場合、ユーザにとって各個体がどのクラスタに所属するのかが曖昧になり、ユーザが意図した分類とならない場合が多い。 In the learning type classification method using fuzzy clustering, as in the clustering by the K-Means method, each representative individual is determined in a predetermined number of clusters, and each individual in the entire set is represented as a representative. The affiliation rate is calculated in relation to the individual. When fuzzy clustering is used, each individual has an affiliation rate with respect to each cluster, that is, a numerical value indicating how much belongs. However, when fuzzy clustering is used, it is ambiguous to which cluster each individual belongs to the user, and the classification intended by the user is often not achieved.

サポートベクトルマシンを用いた学習型分類方法は、二つのクラスのいずれかに属する事例を正解の事例とし、未知の事例がいずれのクラスに属する事例かを、正解の事例に基づいて判定する分類方法である。サポートベクトルマシンに基づく学習型分類方法は、正解の事例を作成する作業を必要とし、この作業に大きなコストを必要とする。また、基本的には二つのクラスを識別する方法であるため、複数のクラスに識別することができない。 The learning type classification method using support vector machines is a classification method in which cases that belong to one of two classes are regarded as correct cases, and an unknown case belongs to which class is determined based on correct cases It is. The learning type classification method based on the support vector machine requires a work of creating a correct answer case, and this work requires a large cost. In addition, since it is basically a method for identifying two classes, it cannot be identified as a plurality of classes.

特開平９−３０５５６６公報Japanese Patent Laid-Open No. 9-305566 特開２００４−０２１５９０公報JP 2004-021590 A

ＴｒｅｖｏｒＨａｓｔｉｅ、外２名、"ＴｈｅＥｌｅｍｅｎｔｓｏｆＳｔａｔｉｓｔｉｃａｌＬｅａｒｎｉｎｇ：ＤａｔａＭｉｎｉｎｇ，Ｉｎｆｅｒｅｎｃｅ，ａｎｄＰｒｅｄｉｃｔｉｏｎ"、ＵＳＡ、Ｓｐｒｉｎｇｅｒ−Ｖｅｒｌａｇ、２００３年Trevor Hastie, two others, "The Elements of Statistical Learning: Data Mining, Inference, and Prediction", USA, Springer-Verlag, 2003 ＳｕｇａｔｏＢａｓｕ、外２名、"Ｓｅｍｉ−ｓｕｐｅｒｖｉｓｅｄＣｌｕｓｔｅｒｉｎｇｂｙＳｅｅｄｉｎｇ"、Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１９ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＭａｃｈｉｎｅＬｅａｒｎｉｎｇ、２００２年、ｐｐ．１９−２６Sugar Basu, two others, “Semi-supervised Clustering by Seeding”, Processes of the 19th International Conference on Machine Learning, 2002, pp. 19-26 ＣｈｒｉｓｔｏｐｈｅｒＤ．Ｍａｎｎｉｎｇ、＆ＨｉｎｒｉｃｈＳｃｈｕｔｚｅ、"ＦｏｕｎｄａｔｉｏｎｓｏｆＳｔａｔｉｓｔｉｃａｌＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ"、ＵＳＡ、ＭＩＴＰｒｅｓｓ、１９９９年Christopher D.C. Manning, & Hinrich Schutze, “Fundations of Statistical Natural Language Processing”, USA, MIT Press, 1999

従来の学習型分類方法を用いた場合、クラスタリングの精度を向上するための、ユーザが正解を付与すべき文書の適切な量を、ユーザは予測することができない。このため、ユーザは、文書に正解を付与する操作を無駄に続けてしまう場合がある。 When the conventional learning type classification method is used, the user cannot predict an appropriate amount of documents to which the user should give a correct answer in order to improve the accuracy of clustering. For this reason, the user may continue uselessly the operation of giving the correct answer to the document.

例えば、ユーザが任意の文書に正解を付与し、それら正解を付与された文書に基づいた制約付きクラスタリングを繰り返すような学習型分類方法（つまり、正解付与→制約付きクラスタリング→正解付与→制約付きクラスタリング→．．．の繰り返しにより、文書を正しい分類先に仕分けていく操作）は、以下のように、無駄な正解の付与が続けられる。 For example, a learning-type classification method in which a user assigns correct answers to arbitrary documents and repeats constrained clustering based on the documents given the correct answers (that is, correct answer → constrained clustering → correct answer → constrained clustering) → The operation of sorting documents into the correct classification destination by repeating the above process) continues to give useless correct answers as follows.

繰り返し行われる制約付きクラスタリングの，最初の段階においては、ユーザが文書に正解を付与するほど、正解が付与された文書数以上にクラスタリングの精度が向上していく。例えば、ある文書の集合に５件の正解を付与した（正解が付与された文書の量：５）場合、制約付きクラスタリングにより（５＋Ａ）件の文書が正解に分類される。この（正解が付与された文書の量＋Ａ）は、一度の制約付きクラスタリングにより、正しい分類先にクラスタリングされた文書数を示し、Ａは制約付きクラスタリングの学習効果と呼ばれる。 In the first stage of constrained clustering that is repeatedly performed, the accuracy of clustering improves more than the number of documents to which the correct answer is given, as the user gives the correct answer to the document. For example, when five correct answers are given to a set of documents (amount of documents to which correct answers are given: 5), (5 + A) documents are classified as correct by constrained clustering. This (amount of documents to which a correct answer is given + A) indicates the number of documents clustered in the correct classification destination by one constrained clustering, and A is called a learning effect of constrained clustering.

制約付きクラスタリングの学習効果の値Ａは、正解の付与を始める最初の段階においては大きな値である。しかし、正解の付与とクラスタリングとを繰り返し、ある一定の正解が付与された文書の量に達すると、制約付きクラスタリングの精度は、与えた正解が付与された文書の量以上に上がらなくなる。つまり、一定の正解が付与された文書の量に達すると、学習効果Ａは０に近い値になってしまう。 The learning effect value A of the constrained clustering is a large value in the first stage of starting to give the correct answer. However, when the assignment of correct answers and clustering are repeated and the amount of documents to which a certain correct answer is assigned is reached, the accuracy of constrained clustering does not increase beyond the amount of documents to which the given correct answer is assigned. That is, when the amount of documents to which a certain correct answer is given is reached, the learning effect A becomes a value close to zero.

ユーザは、制約付きクラスタリングの精度が上がらなくなる量がどの程度であるかを知る手掛かりを持たない。このため、制約付きクラスタリングによる学習効果Ａが０であることにユーザは気付かず、無駄な正解の付与とクラスタリングとの操作を繰り返し続けてしまう。 The user has no clue to know how much the constrained clustering accuracy is not improved. For this reason, the user is unaware that the learning effect A by the constrained clustering is 0, and repeats the operations of useless correct answer assignment and clustering.

上記のように、学習効果Ａが０であることに気付かなかい場合、ユーザは文書の全件に正解を付与してしまう事態も考えられる。本来、クラスタリングに代表される自動分類手法は、文書全件の中身を精査して仕分ける作業を軽減することが目的であるため、これでは意味がない。 As described above, when the user does not realize that the learning effect A is 0, the user may give a correct answer to all documents. Originally, the automatic classification method represented by clustering is meaningless because it aims to reduce the work of examining and sorting the contents of all documents.

本発明は、最も制約付きクラスタリングの精度が上がる正解を付与する文書の量を、ユーザに認知させることを目的とする。 It is an object of the present invention to allow a user to recognize the amount of a document that gives a correct answer that improves the accuracy of the most constrained clustering.

本発明の代表的な一例を示せば以下の通りである。すなわち、プロセッサと、前記プロセッサに接続されるメモリと、を備え、与えられた正しい分類先に基づいて、複数の文書を分類する文書分類装置は、入力部と、分類制御部と、発散度計算部と、進捗度計算部と、を備え、前記入力部は、ユーザから入力された文書の前記正しい分類先を受け付け、前記分類制御部は、前記文書の正しい分類先に基づいて、前記複数の文書を複数のグループに分類し、前記発散度計算部は、前記分類によって前記グループが変化した文書数によって示される発散度を算出し、前記進捗度計算部は、前記分類の進捗度を算出し、前記分類制御部によって文書を分類した結果と、前記発散度計算部によって算出された発散度と、文書分類装置は、前記進捗度計算部によって算出された進捗度とを表示するためのデータを生成する。 A typical example of the present invention is as follows. That is, a document classification apparatus that includes a processor and a memory connected to the processor and classifies a plurality of documents based on a given correct classification destination, an input unit, a classification control unit, and a divergence calculation And a progress degree calculation unit, wherein the input unit receives the correct classification destination of the document input from the user, and the classification control unit is configured to select the plurality of documents based on the correct classification destination of the document. The documents are classified into a plurality of groups, the divergence degree calculating unit calculates a divergence degree indicated by the number of documents in which the group has changed due to the classification, and the progress degree calculating unit calculates a progress degree of the classification. , The result of classifying the document by the classification control unit, the divergence degree calculated by the divergence degree calculation unit, and the document classification apparatus for displaying the progress degree calculated by the progress degree calculation unit To generate over data.

本発明の一実施形態によると、ユーザは、文書の分類の際に無駄な正解の付与とクラスタリングとを繰り返さない。 According to an embodiment of the present invention, the user does not repeat useless correct answer assignment and clustering when classifying documents.

本発明の実施形態の文書分類支援装置を示したブロック図である。It is the block diagram which showed the document classification assistance apparatus of embodiment of this invention. 本発明の実施形態の情報端末における文書分類の処理を示したフローチャートである。It is the flowchart which showed the process of the document classification | category in the information terminal of embodiment of this invention. 本発明の実施形態の文書分類支援装置のクラスタリング結果表示画面を示した説明図である。It is explanatory drawing which showed the clustering result display screen of the document classification assistance apparatus of embodiment of this invention. 本発明の実施形態の文書データＤＢに含まれるテーブルを示した説明図である。It is explanatory drawing which showed the table contained in document data DB of embodiment of this invention. 本発明の実施形態の文書インデックスＤＢに含まれるテーブルを示した説明図である。It is explanatory drawing which showed the table contained in document index DB of embodiment of this invention. 本発明の実施形態のクラスタリングの処理を示したフローチャートである。It is the flowchart which showed the process of the clustering of embodiment of this invention. 本発明の実施形態のクラスタリング前後での、各文書のクラスタラベルの変化を示した説明図である。It is explanatory drawing which showed the change of the cluster label of each document before and after clustering of embodiment of this invention. 本発明の実施形態の分類作業進捗度を表示する画面を示した説明図である。It is explanatory drawing which showed the screen which displays the classification work progress of embodiment of this invention. 本発明の実施形態のクラスタリングの繰り返しによる、クラスタ発散度の履歴を表示する画面を示した図である。It is the figure which showed the screen which displays the log | history of a cluster divergence degree by repetition of the clustering of embodiment of this invention. 本発明の実施形態のクラスタリングによる各分類の文書のクラスタラベルの変化を示した説明図である。It is explanatory drawing which showed the change of the cluster label of the document of each classification | category by clustering of embodiment of this invention.

本発明の実施形態において、ユーザによって正解を付与された文書による制約付きクラスタリングの学習効果と、クラスタリングにおける各文書の所属するクラスタの変化、すなわち、クラスタラベルの変化とは、高い相関がある。本実施形態において、制約付きクラスタリングによる学習効果が０になる場合の指標を、制約付きクラスタリングが収束するまでにクラスタラベルが変化した文書数の総和値によって示す。 In the embodiment of the present invention, there is a high correlation between the learning effect of constrained clustering by a document given a correct answer by the user and the change of the cluster to which each document belongs in clustering, that is, the change of the cluster label. In the present embodiment, the index when the learning effect by the constrained clustering becomes 0 is indicated by the total value of the number of documents in which the cluster label has changed until the constrained clustering converges.

なお、クラスタラベルとは、分類された文書群の個々を識別する識別子である。文書には、分類された先の文書群を示すクラスタラベルが付与される。 The cluster label is an identifier for identifying each classified document group. The document is assigned a cluster label indicating the group of previous documents.

また、本発明の実施形態において用いる制約付きクラスタリング方法は、Ｋ−Ｍｅａｎｓ法である。 The constrained clustering method used in the embodiment of the present invention is the K-Means method.

図１は、本発明の実施形態の文書分類支援装置を示すブロック図である。 FIG. 1 is a block diagram showing a document classification support apparatus according to an embodiment of the present invention.

本発明の実施形態の文書分類支援装置は、情報端末１０と、文書データＤＢ１１０及び文書インデックスＤＢ１１１の２つのデータベースと、ネットワーク１１２とを備える。情報端末１０及び２つのデータベースは、ネットワーク１１２によって接続されているが、情報端末１０が２つのデータベースを備えてもよい。 The document classification support apparatus according to the embodiment of the present invention includes an information terminal 10, two databases, a document data DB 110 and a document index DB 111, and a network 112. The information terminal 10 and the two databases are connected by the network 112, but the information terminal 10 may include two databases.

情報端末１０は、ＣＰＵ１０１と、メモリ１０２と、キーボード及びマウス１０３と、ディスプレイ１０４と、データ通信部１０９とを備える計算機である。また、情報端末１０は、機械分類制御部１０５、クラスタ発散度計算部１０６、進捗度計算部１０７、及び文書表示部１０８の機能を提供するプログラムを含む。 The information terminal 10 is a computer including a CPU 101, a memory 102, a keyboard and mouse 103, a display 104, and a data communication unit 109. Further, the information terminal 10 includes a program that provides the functions of the machine classification control unit 105, the cluster divergence calculation unit 106, the progress calculation unit 107, and the document display unit 108.

ＣＰＵ１０１は、機械分類制御部１０５、クラスタ発散度計算部１０６、進捗度計算部１０７及び文書表示部１０８の機能を持つプログラムを実行することによって、文書分類の処理を実行する。メモリ１０２は、ＣＰＵ１０１によって実行されるプログラム及びプログラムを実行するために必要なデータを一時的に記憶する。 The CPU 101 executes document classification processing by executing a program having the functions of the machine classification control unit 105, the cluster divergence calculation unit 106, the progress calculation unit 107, and the document display unit 108. The memory 102 temporarily stores a program executed by the CPU 101 and data necessary for executing the program.

キーボード及びマウス１０３は、ユーザが情報を入力する装置である。ディスプレイ１０４には、クラスタリング結果、後述するクラスタ発散度、及び後述する分類作業進捗度等が表示される。 The keyboard and mouse 103 is a device for a user to input information. The display 104 displays a clustering result, a cluster divergence described later, a classification work progress described later, and the like.

機械分類制御部１０５は、情報端末１０に入力された文書を、クラスタリングする。クラスタ発散度計算部１０６は、クラスタリングの処理において、後述するクラスタ発散度を算出する。進捗度計算部１０７は、クラスタリングの処理において、後述する分類作業進捗度を算出する。文書表示部１０８は、クラスタリングの結果、後述するクラスタ発散度、及び後述する分類作業進捗度等を、ディスプレイ１０４に表示させる。 The machine classification control unit 105 clusters documents input to the information terminal 10. The cluster divergence calculation unit 106 calculates a cluster divergence described later in the clustering process. The progress calculation unit 107 calculates a classification work progress described later in the clustering process. As a result of clustering, the document display unit 108 displays a cluster divergence degree, which will be described later, a classification work progress degree, which will be described later, and the like on the display 104.

データ通信部１０９は、情報端末１０がネットワーク１１２を介してデータ通信をするインターフェースであり、例えば、ＴＣＰ／ＩＰプロトコルによって通信可能なＬＡＮカードである。情報端末１０は、データ通信部１０９を介してネットワーク１１２に接続された二つのデータベースと通信する。 The data communication unit 109 is an interface through which the information terminal 10 performs data communication via the network 112. For example, the data communication unit 109 is a LAN card capable of communication using the TCP / IP protocol. The information terminal 10 communicates with two databases connected to the network 112 via the data communication unit 109.

文書データＤＢ１１０には、文書に関する情報が含まれる。文書データＤＢ１１０は、著者、又は書名などの書誌情報の検索に加え、文書全文の検索も可能にする。 The document data DB 110 includes information regarding documents. The document data DB 110 enables searching of the entire document in addition to searching for bibliographic information such as an author or a title.

文書インデックスＤＢ１１１には、文書とキーワードと出現頻度との対応関係が含まれる。文書インデックスＤＢ１１１は、ある文書が含むキーワードリストを提供する。 The document index DB 111 includes correspondences between documents, keywords, and appearance frequencies. The document index DB 111 provides a keyword list included in a certain document.

図２は、本発明の実施形態の情報端末における文書分類の処理を示したフローチャートである。 FIG. 2 is a flowchart showing document classification processing in the information terminal according to the embodiment of the present invention.

図２において、機械分類制御部１０５、クラスタ発散度計算部１０６、進捗度計算部１０７及び文書表示部１０８によって実行される文書分類の処理の概要を説明する。 An outline of the document classification process executed by the machine classification control unit 105, the cluster divergence calculation unit 106, the progress calculation unit 107, and the document display unit 108 will be described with reference to FIG.

まず、ユーザは、キーボード及びマウス１０３によって、予め文書検索システム等で収集した文書の集合を文書分類支援装置に入力する。この際、ユーザは、文書の集合に加えて、クラスタの数などのクラスタリングの条件を入力してもよい。クラスタリングの条件には、クラスタの数の他に、例えば、各クラスタを説明する代表的なキーワードなどがある。なお、入力されたキーワードに基づいて、重心を定めてもよい。入力された文書の集合は、文書データＤＢ１１０及び文書インデックスＤＢ１１１に格納される。 First, the user inputs a set of documents previously collected by a document search system or the like to the document classification support apparatus using the keyboard and mouse 103. At this time, the user may input a clustering condition such as the number of clusters in addition to the set of documents. In addition to the number of clusters, the clustering conditions include, for example, representative keywords that describe each cluster. Note that the center of gravity may be determined based on the input keyword. The set of input documents is stored in the document data DB 110 and the document index DB 111.

機械分類制御部１０５は、文書インデックスＤＢ１１１から、文書の集合内の各文書のキーワードと出現頻度とを取得する（Ｓ２０１）。 The machine classification control unit 105 acquires a keyword and an appearance frequency of each document in the document set from the document index DB 111 (S201).

次に、機械分類制御部１０５は、Ｓ２０１において取得された各文書のキーワードの出現頻度に従ってクラスタリングをする。続いて、クラスタ発散度計算部１０６は、クラスタ発散度を算出する（Ｓ２０２）。クラスタ発散度の算出方法は、図６及び図７を用いて後述する。 Next, the machine classification control unit 105 performs clustering according to the appearance frequency of the keyword of each document acquired in S201. Subsequently, the cluster divergence calculator 106 calculates the cluster divergence (S202). A method of calculating the cluster divergence will be described later with reference to FIGS.

次に、Ｓ２０２において得られたクラスタ発散度と文書への正解が付与された数とから、進捗度計算部１０７は、分類作業進捗度を算出する。分類作業進捗度の算出方法は、図８を用いて後述する。続いて、文書表示部１０８は、Ｓ２０２におけるクラスタリングの結果、Ｓ２０２において算出されたクラスタ発散度、及び分類作業進捗度をディスプレイ１０４に表示する（Ｓ２０３）。 Next, the progress degree calculation unit 107 calculates the classification work progress degree from the cluster divergence degree obtained in S202 and the number of correct answers given to the document. A method of calculating the classification work progress will be described later with reference to FIG. Subsequently, the document display unit 108 displays the cluster divergence and the classification work progress calculated in S202 on the display 104 as a result of the clustering in S202 (S203).

ディスプレイ１０４に表示される画面の具体例は、図３にて後述する。なお、Ｓ２０３における分類作業の進捗度は、文書の集合を最初に入力した直後においては、まだ正解が付与されていないため、正解が付与された数を０として算出される。 A specific example of the screen displayed on the display 104 will be described later with reference to FIG. Note that the degree of progress of the classification work in S203 is calculated by setting the number of correct answers to 0, since the correct answer has not yet been given immediately after the first set of documents is input.

ユーザは、Ｓ２０３においてディスプレイ１０４に表示されたクラスタリングの結果であるクラスタ発散度、及び分類作業進捗度に基づいて、文書分類の作業の終了可否を判定する（Ｓ２０４）。なお、ユーザは、クラスタ発散度が十分低い値になったか否か、又は、分類作業進捗度が、十分高い値になったか否かによって、終了の可否を判定する。 Based on the cluster divergence and the classification work progress, which are the clustering results displayed on the display 104 in S203, the user determines whether or not the document classification work can be completed (S204). It should be noted that the user determines whether or not to end depending on whether or not the cluster divergence has a sufficiently low value or whether or not the classification work progress has a sufficiently high value.

Ｓ２０４において判定されたクラスタリングの結果が、ユーザにとって所望の結果ではなく、ユーザによって情報端末１０に、分類の作業は終了不可であると入力された場合（Ｓ２０４において、Ｎｏと判定された場合）、ユーザは、各クラスタ内の文書に正解を付与する（Ｓ２０５）。Ｓ２０５の後、文書分類の処理は、Ｓ２０２に戻る。 When the result of clustering determined in S204 is not the result desired by the user, and the user inputs to the information terminal 10 that the classification operation cannot be completed (when determined No in S204), The user gives a correct answer to the documents in each cluster (S205). After S205, the document classification process returns to S202.

なお、文書に正解を付与する操作には、正しく分類された文書に正しく分類された文書であることを明示する「固定」操作と、誤って分類された文書に正しいクラスタを指定する「移動」操作がある。 In addition, for the operation to give a correct answer to a document, a “fix” operation that clearly indicates that the document is correctly classified into a correctly classified document, and a “move” that specifies a correct cluster for the incorrectly classified document There is an operation.

Ｓ２０４において判定されたクラスタリングの結果が、ユーザにとって所望の結果であり、ユーザによって情報端末１０に、分類の作業は終了可能であると入力された場合（Ｓ２０４において、Ｙｅｓと判定された場合）、本発明の文書分類支援装置を用いた文書分類作業を終了する。 When the result of clustering determined in S204 is a result desired by the user and the user inputs to the information terminal 10 that the classification work can be completed (when determined Yes in S204), The document classification work using the document classification support apparatus of the present invention is terminated.

以下において、Ｓ２０３におけるクラスタリングの結果の表示画面、文書データＤＢ１１０及び文書インデックスＤＢ１１１の詳細、並びに図２のＳ２０２及びＳ２０３に伴う算出の詳細を説明する。 Hereinafter, the display screen of the clustering result in S203, the details of the document data DB 110 and the document index DB 111, and the details of the calculations associated with S202 and S203 of FIG. 2 will be described.

図３は、本発明の実施形態の文書分類支援装置のクラスタリング結果表示画面３０１を示した説明図である。 FIG. 3 is an explanatory diagram illustrating a clustering result display screen 301 of the document classification support apparatus according to the embodiment of this invention.

クラスタリング結果表示画面３０１は、クラスタリング結果を表示する領域と、分類作業進捗度及びクラスタ発散度を表示する領域とを含む。また、クラスタリングする（図２のＳ２０２）ためのクラスタリングボタン３０２、正しく分類された文書に正しく分類された文書であることを明示するための固定ボタン３０３、及び誤って分類された文書に正しいクラスタを指定するための移動ボタン３０４が、クラスタリング結果表示画面３０１に含まれる。 The clustering result display screen 301 includes an area for displaying the clustering result and an area for displaying the classification work progress and the cluster divergence. Further, a clustering button 302 for clustering (S202 in FIG. 2), a fixed button 303 for clearly indicating that the document is correctly classified into a correctly classified document, and a correct cluster for an incorrectly classified document. A move button 304 for designating is included in the clustering result display screen 301.

クラスタリング結果を表示する領域は、クラスタリング結果表３０５を含む。クラスタリング結果表３０５は、縦軸と横軸との２つの軸によって文書の分類結果を表示する。縦軸と横軸とは、それぞれ異なる観点によって文書の集合を分類した結果である。例えば、横軸は、クラスタリングによって分類した結果であり、縦軸は文書データＤＢ１１０に格納した文書の書誌情報によって分類した結果であってもよい。また、縦軸及び横軸の両方が、クラスタリングによって分類した結果であってもよい。ユーザは、クラスタリング結果表３０５によって、クラスタリングの結果、各々の文書がどの分類に所属したのかを認識する。 The area for displaying the clustering result includes a clustering result table 305. The clustering result table 305 displays document classification results by two axes, a vertical axis and a horizontal axis. The vertical axis and the horizontal axis are the results of classifying a set of documents from different viewpoints. For example, the horizontal axis may be the result of classification by clustering, and the vertical axis may be the result of classification by document bibliographic information stored in the document data DB 110. Further, both the vertical axis and the horizontal axis may be the result of classification by clustering. Based on the clustering result table 305, the user recognizes to which classification each document belongs as a result of clustering.

縦分類３０６と横分類３０７とは、縦軸及び横軸のそれぞれにおける分類の名前を示す。文書リスト３０８には、縦分類３０６及び横分類３０７に所属する文書の名前がリスト表示される。 The vertical classification 306 and the horizontal classification 307 indicate the names of the classifications on the vertical axis and the horizontal axis, respectively. In the document list 308, names of documents belonging to the vertical classification 306 and the horizontal classification 307 are displayed as a list.

固定チェックボックス３０９は、固定チェックボックス３０９に対応する文書の所属分類が正しい場合、ユーザが「固定」したい文書を選択する。そして、ユーザが固定ボタン３０３を操作することによって、選択された文書の固定チェックボックス３０９にチェックが入力される。 The fixed check box 309 selects a document that the user wants to “fix” when the affiliation classification of the document corresponding to the fixed check box 309 is correct. Then, when the user operates the fixed button 303, a check is input to the fixed check box 309 of the selected document.

また、固定チェックボックス３０９に対応する文書の所属分類が誤っている場合、ユーザは「移動」したい文書を選択し、移動ボタン３０４を操作することによって「移動」操作をする。ユーザは、移動ボタン３０４を操作することによって正しい分類先を指定し、選択された文書は指定した分類先に移動され、選択された文書の固定チェックボックス３０９に自動的にチェックが入力される。 If the affiliation classification of the document corresponding to the fixed check box 309 is incorrect, the user selects the document to be “moved” and operates the “move” button 304 to perform the “move” operation. The user designates a correct classification destination by operating the move button 304, the selected document is moved to the designated classification destination, and a check is automatically input to the fixed check box 309 of the selected document.

なお、「移動」操作は、ユーザが移動ボタン３０４を操作した後、選択された文書の新たな分類先を指定する画面が表示され、ユーザが表示された画面に分類先を入力することによって、選択された文書は新たな分類先に移動し、固定チェックボックス３０９にチェックが自動的に入力されてもよい。また、ユーザがドラッグアンドドロップによって、選択された文書の表示を視覚的に新たな分類先に移動させ、移動ボタン３０４を操作した後に、選択された文書の固定チェックボックス３０９にチェックが入力されてもよい。 In the “move” operation, after the user operates the move button 304, a screen for designating a new classification destination of the selected document is displayed, and the user inputs the classification destination on the displayed screen. The selected document may be moved to a new classification destination, and a check may be automatically input to the fixed check box 309. Further, after the user visually moves the selected document display to a new classification destination by drag and drop and operates the move button 304, a check is input to the fixed check box 309 of the selected document. Also good.

図２に示すＳ２０５の処理は、ユーザが固定ボタン３０３及び移動ボタン３０４によって、各文書に固定及び移動の操作をすることによって実行される。 The process of S205 shown in FIG. 2 is executed when the user performs a fixing and moving operation on each document using the fixed button 303 and the moving button 304.

進捗度・クラスタ発散度表示領域は、進捗度表示部３１０とクラスタ発散度履歴表示部３１１とを含む。 The progress / cluster divergence display area includes a progress display 310 and a cluster divergence history display 311.

進捗度表示部３１０は、進捗度計算部１０７によって算出された分類作業進捗度を表示する。ユーザは、進捗度表示部３１０の表示によって、文書に対するクラスタリングの進捗を認識することができる。 The progress display unit 310 displays the classification work progress calculated by the progress calculation unit 107. The user can recognize the progress of clustering for the document by the display of the progress display unit 310.

クラスタ発散度履歴表示部３１１は、クラスタ発散度計算部１０６によって算出された、表示した時点における各クラスタリング処理で得たクラスタ発散度、すなわち、クラスタラベルが変化した文書数の総和値を表示する。ユーザは、クラスタ発散度履歴表示部３１１の表示によって、各々のクラスタリングにおいて、クラスタラベルが変化した文書の数を認識することができる。 The cluster divergence history display unit 311 displays the cluster divergence calculated by the cluster divergence calculation unit 106 and obtained by each clustering process at the time of display, that is, the total value of the number of documents whose cluster labels have changed. The user can recognize the number of documents whose cluster labels have changed in each clustering by the display of the cluster divergence history display unit 311.

分類作業進捗度の算出方法は図８を用いて後述し、クラスタ発散度の算出方法は図６及び図７を用いて後述する。 A method of calculating the classification work progress will be described later with reference to FIG. 8, and a method of calculating the cluster divergence will be described later with reference to FIGS.

次に、図２に示す処理において使用されるデータベースのデータを説明する。 Next, database data used in the processing shown in FIG. 2 will be described.

図４は、本発明の実施形態の文書データＤＢ１１０に含まれるテーブルを示した説明図である。 FIG. 4 is an explanatory diagram illustrating a table included in the document data DB 110 according to the embodiment of this invention.

文書データＤＢ１１０における文書データを格納するテーブルは、文書番号４０１、著者４０２、発行年４０３、分類４０４及び全文４０５を含む。 A table storing document data in the document data DB 110 includes a document number 401, an author 402, an issue year 403, a classification 404, and a full text 405.

文書番号４０１は、格納される文書の識別子である。著者４０２は、文書の著者名である。発行年４０３は、文書が発行された年である。分類４０４は、文書に付与された分類である。全文４０５は、文書の全文の内容である。なお、図４に示すテーブルの列要素は、対象文書の種類よって、追加または、変更されてもよい。 A document number 401 is an identifier of a stored document. The author 402 is the name of the document author. The issue year 403 is the year that the document was issued. A classification 404 is a classification given to the document. The full text 405 is the content of the full text of the document. Note that the column elements of the table shown in FIG. 4 may be added or changed depending on the type of the target document.

図５は、本発明の実施形態の文書インデックスＤＢ１１１に含まれるテーブルを示した説明図である。 FIG. 5 is an explanatory diagram illustrating a table included in the document index DB 111 according to the embodiment of this invention.

文書インデックスＤＢ１１１には、文書中のキーワードによって文書をクラスタリングするためのインデックス５０１が含まれる。 The document index DB 111 includes an index 501 for clustering documents according to keywords in the document.

インデックス５０１は、文書とクラスタとの類似度を算出するために使用される。 The index 501 is used to calculate the similarity between a document and a cluster.

インデックス５０１は、文書番号５０２と、リスト５０３を含む。文書番号５０２は、格納される文書の識別子である。文書データＤＢ１１０における文書番号４０１と同じ値を示す。 The index 501 includes a document number 502 and a list 503. A document number 502 is an identifier of a stored document. The same value as the document number 401 in the document data DB 110 is shown.

リスト５０３は、（キーワード番号、頻度）の対を含む。キーワード番号は、対応する文書が含むキーワードの識別子である。頻度は、キーワードが文書中に出現する回数（出現頻度）である。 The list 503 includes a (keyword number, frequency) pair. The keyword number is an identifier of a keyword included in the corresponding document. The frequency is the number of times that the keyword appears in the document (appearance frequency).

キーワードと出現頻度とに基づいて、類似度を算出することができる（例えば、「情報検索アルゴリズム」北研二他著、共立出版、２００２年発行を参照）。例えば、あるキーワードの出現頻度が同程度の文書間には、高い類似度が算出される。 The similarity can be calculated based on the keyword and the appearance frequency (see, for example, “Information Retrieval Algorithm” by Kenji Kitaken et al., Kyoritsu Shuppan, 2002). For example, a high degree of similarity is calculated between documents having a similar appearance frequency of a certain keyword.

以下、図２に示すＳ２０２におけるクラスタ発散度の算出方法と、Ｓ２０３における分類作業進捗度の算出方法との詳細について説明する。 The details of the cluster divergence calculation method in S202 shown in FIG. 2 and the classification work progress calculation method in S203 will be described below.

まず、図２に示すＳ２０２におけるクラスタリングには、Ｋ−Ｍｅａｎｓ法に基づく制約付きクラスタリングが用いられる。クラスタ発散度は、この制約付きクラスタリングと同時にクラスタ発散度計算部１０６によって算出される。Ｓ２０２における処理の詳細を図６に示す。 First, constrained clustering based on the K-Means method is used for clustering in S202 shown in FIG. The cluster divergence is calculated by the cluster divergence calculator 106 simultaneously with the constrained clustering. Details of the processing in S202 are shown in FIG.

図６は、本発明の実施形態のクラスタリングの処理を示したフローチャートである。 FIG. 6 is a flowchart illustrating clustering processing according to the embodiment of this invention.

文書の集合が情報端末１０に入力された（Ｓ２０１）、又は、図３に示す固定ボタン３０２及び移動ボタン３０４を用いて文書に正解を付与された（Ｓ２０５）後に、ユーザがクラスタリング結果表示画面３０１におけるクラスタリングボタン３０２を操作すると、Ｓ２０２が開始される。機械分類制御部１０５は、Ｓ２０２の処理の中で、文書の集合と文書インデックスＤＢ１１１に含まれるデータとによって、制約付きクラスタリングを実行する。 After a set of documents is input to the information terminal 10 (S201), or a correct answer is given to the document using the fixed button 302 and the move button 304 shown in FIG. 3 (S205), the user displays the clustering result display screen 301. When the clustering button 302 is operated at S202, S202 is started. In the process of S202, the machine classification control unit 105 executes constrained clustering based on the document set and the data included in the document index DB 111.

クラスタ発散度計算部１０６は、Ｓ２０２における制約付きクラスタリングが実行される前に、クラスタ発散度の値を０にする（Ｓ６０１）。クラスタ発散度計算部１０６は、各文書のクラスタラベルを保存する（Ｓ６０２）。本実施形態のクラスタラベルは、クラスタリング結果表３０５の縦分類３０６及び横分類３０７の分類名に対応する。 The cluster divergence calculation unit 106 sets the value of the cluster divergence to 0 before the restricted clustering in S202 is executed (S601). The cluster divergence calculation unit 106 stores the cluster label of each document (S602). The cluster labels of this embodiment correspond to the classification names of the vertical classification 306 and the horizontal classification 307 in the clustering result table 305.

Ｓ６０２の処理の後、機械分類制御部１０５がクラスタリングを実行する（Ｓ６０３）。クラスタリングにはＫ−Ｍｅａｎｓ法に基づく制約付きクラスタリングを用いる。制約付きクラスタリングのアルゴリズムを、以下に示す。 After the process of S602, the machine classification control unit 105 executes clustering (S603). For clustering, constrained clustering based on the K-Means method is used. The constrained clustering algorithm is shown below.

１．図３に示す固定チェックボックス３０９に、チェックが入力された文書があるクラスタは、そのクラスタの中で、チェックが入力された文書全てを用いて重心を算出する。チェックが入力された文書が無いクラスタは、そのクラスタに所属する全ての文書を用いて重心を算出する。 1. A cluster having a document with a check input in the fixed check box 309 shown in FIG. 3 calculates a centroid using all the documents with the check input in the cluster. For a cluster with no document for which a check is input, the center of gravity is calculated using all the documents belonging to the cluster.

２．各文書と重心との類似度を算出し、類似度が最も高い、すなわち最も近いことを示した重心のクラスタに、各文書を分類する。ただし、固定チェックボックス３０９にチェックが入力された文書は、類似度によって分類せず、クラスタラベルをそのまま維持する。 2. The degree of similarity between each document and the centroid is calculated, and each document is classified into a cluster of centroids indicating the highest similarity, that is, the closest. However, the document whose check is input to the fixed check box 309 is not classified according to the similarity, and the cluster label is maintained as it is.

３．２．において分類されたクラスタに所属する文書の中で、新たな重心を算出する。 3.2. A new center of gravity is calculated among the documents belonging to the cluster classified in (1).

上記１．〜３．の制約付きクラスタリングが終了すると、クラスタ発散度計算部１０６は、Ｓ６０２において保存した各文書のクラスタラベルと、Ｓ６０３において得た各文書の新たなクラスタラベルとを比較し、クラスタラベルが変化した文書数を算出し、算出された文書数をクラスタ発散度に加算する。さらに、クラスタ発散度計算部１０６は、クラスタごとのクラスタラベルが変わった文書数も算出し、クラスタごとに算出された文書数をクラスタごとのクラスタ発散度に加算する（Ｓ６０４）。 Above 1. ~ 3. When the constrained clustering is completed, the cluster divergence calculating unit 106 compares the cluster label of each document stored in S602 with the new cluster label of each document obtained in S603, and the number of documents whose cluster label has changed. And the calculated number of documents is added to the cluster divergence. Further, the cluster divergence calculation unit 106 also calculates the number of documents in which the cluster label has changed for each cluster, and adds the number of documents calculated for each cluster to the cluster divergence for each cluster (S604).

図７は、本実施形態のクラスタリング前後における、各文書のクラスタラベルの変化を示した説明図である。 FIG. 7 is an explanatory diagram showing changes in the cluster label of each document before and after clustering according to the present embodiment.

図７は、２０件の文書を、クラスタ数を３としてクラスタリングした場合の結果である。 FIG. 7 shows the results when 20 documents are clustered with 3 clusters.

左表７０１は、クラスタリング前の各文書のクラスタラベルの一覧であり（すなわち、Ｓ６０２において保存されたクラスタラベルに相当）、右表７０２は、クラスタリング結果後のクラスタラベルの一覧である（すなわち、Ｓ６０４において得られたクラスタラベルに相当）。 The left table 701 is a list of cluster labels of each document before clustering (that is, the cluster label stored in S602), and the right table 702 is a list of cluster labels after the clustering result (that is, S604). Equivalent to the cluster label obtained in step 1).

左表７０１と右表７０２とを比較すると、文書番号４、５、６、９、１０、１１、１２、１３、１９、２０を示す１０個の文書のクラスタラベルが変化したため、クラスタ発散度計算部１０６は、クラスタ発散度に１０を加算する。また、クラスタ１において２個（文書番号１９、２０）、クラスタ３において３個（文書番号４、５、６）、クラスタ３において５個（文書番号９、１０、１１、１２、１３）の文書のクラスタラベルが変化したため、クラスタ発散度計算部１０６は、各々のクラスタの各々のクラスタ発散度に加算する。 When the left table 701 and the right table 702 are compared, the cluster divergence calculation is performed because the cluster labels of ten documents indicating the document numbers 4, 5, 6, 9, 10, 11, 12, 13, 19, and 20 have changed. The unit 106 adds 10 to the cluster divergence. Also, two documents (document numbers 19 and 20) in cluster 1, three documents (document numbers 4, 5, and 6) in cluster 3, and five documents (document numbers 9, 10, 11, 12, and 13) in cluster 3 Therefore, the cluster divergence calculation unit 106 adds to each cluster divergence of each cluster.

Ｓ６０４の後、機械分類制御部１０５は、クラスタリングが収束しているか否かを判定する（Ｓ６０５）。本実施形態の収束判定の方法には、例えば、非特許文献１に開示されるＫ−Ｍｅａｎｓ法における収束判定方法を用いる。すなわち、本実施形態の収束判定方法は、各々のクラスタリングによって得られる重心の変化量の閾値を管理者が予め定め、各々のクラスタリングによって得られる重心の変化量と、定められた閾値とに基づいて、収束しているか否かを判定する方法である。 After S604, the machine classification control unit 105 determines whether the clustering has converged (S605). As the convergence determination method of the present embodiment, for example, the convergence determination method in the K-Means method disclosed in Non-Patent Document 1 is used. That is, in the convergence determination method of the present embodiment, the administrator predetermines a threshold value of the change amount of the center of gravity obtained by each clustering, and based on the change amount of the center of gravity obtained by each clustering and the determined threshold value. This is a method for determining whether or not the convergence has occurred.

Ｓ６０５において、クラスタリングは収束していないと判定された場合（Ｓ６０５においてＮｏを判定された場合）、Ｓ６０２からの同じ処理が繰り返される。Ｓ６０５において、クラスタリングは収束したと判定された場合は（Ｓ６０５においてＹｅｓと判定された場合）、図６に示すクラスタリングの処理を終了する。 When it is determined in S605 that the clustering has not converged (No is determined in S605), the same processing from S602 is repeated. If it is determined in S605 that the clustering has converged (if determined as Yes in S605), the clustering process illustrated in FIG. 6 ends.

Ｓ２０２におけるクラスタリングは、Ｓ６０５において収束されたと判定されるまで繰り返される（この繰り返しをループと呼ぶ）。例えば、ループが５回発生し、各々のループにおけるクラスタ発散度が、１０、９、６、４、０であった場合、最終的なクラスタ発散度は、（１０＋９＋６＋４＋０）＝２９となる。クラスタごとのクラスタ発散度の算出方法も、同じ算出方法である。 The clustering in S202 is repeated until it is determined in S605 that it has converged (this repetition is referred to as a loop). For example, when a loop occurs five times and the cluster divergence in each loop is 10, 9, 6, 4, 0, the final cluster divergence is (10 + 9 + 6 + 4 + 0) = 29. The calculation method of the cluster divergence for each cluster is the same calculation method.

Ｓ２０２におけるクラスタリング処理が終了した後、文書表示部１０８は、クラスタ発散度と、後述する方法によって算出された分類作業進捗度とを表示する（Ｓ２０３）。 After the clustering process in S202 is completed, the document display unit 108 displays the cluster divergence and the classification work progress calculated by a method described later (S203).

図２及び図６に示す処理によって、ユーザは、クラスタリングごとのクラスタ発散度と、分類作業進捗度を認識することができる。 2 and 6, the user can recognize the cluster divergence and the classification work progress for each clustering.

分類作業進捗度は、進捗度計算部１０７によって数１を用いて算出される。 The classification work progress is calculated by the progress calculation unit 107 using Equation 1.

数１に示すクラスタ発散度は、Ｓ２０２において得られたクラスタ発散度である。数１に示すループ数は、図６に示すＳ６０２からＳ６０５の繰り返し数である。数１に示す固定文書数は、クラスタリング結果表３０５に表示された文書のうち、固定チェックボックス３０９にチェックが入力されている文書の数である。数１に示す全文書数は、文書全体に含まれる文書の数である。 The cluster divergence shown in Equation 1 is the cluster divergence obtained in S202. The number of loops shown in Equation 1 is the number of repetitions of S602 to S605 shown in FIG. The number of fixed documents shown in Equation 1 is the number of documents whose check is input to the fixed check box 309 among the documents displayed in the clustering result table 305. The total number of documents shown in Equation 1 is the number of documents included in the entire document.

数１に示すαは、クラスタリングとユーザの正解を付与する操作との、どちらを重視するかを示す重み係数であり、本発明の実施形態における文書分類支援装置の管理者が、予め設定してもよいし、ユーザが指定してもよい。αは、設定される値の範囲を、０≦α≦１．０とされるのが望ましい。 Α shown in Equation 1 is a weighting coefficient indicating which of the clustering and the operation for giving the correct answer of the user is emphasized, and is set in advance by the administrator of the document classification support apparatus in the embodiment of the present invention. Alternatively, the user may specify. As for α, it is desirable that the set value range is 0 ≦ α ≦ 1.0.

分類作業進捗度は、二つの指標に基づいて定められる。一つの指標は、分類対象となる文書の集合の何割に正解を付与したのか（正解付与進捗度）、であり、もう一つの指標は、クラスタリングによってどの程度ユーザが意図する分類になっているのか（クラスタリング安定度）、という点である。数１の分類作業進捗度の計算式において、分子の左側の項が、「文書の集合の何割に正解を付与したのか」に相当し、右側の項がどの程度ユーザが意図する分類になっているのか」に相当する。 The classification work progress is determined based on two indicators. One index is how many percent of the set of documents to be classified are given correct answers (correct answer progress), and the other index is the classification intended by the user by clustering. (Clustering stability). In the formula for calculating the classification work progress in Equation 1, the term on the left side of the numerator corresponds to “How many percent of the set of documents has been given the correct answer”, and the term on the right side is the classification intended by the user. Is equivalent to?

αの値は、管理者が正解付与進捗度と、クラスタリング安定度との、二つの指標のどちらを重要とするかによって定められる。正解付与進捗度を重要としたい場合、管理者は、αの値を大きくし、クラスタリング安定度を重要としたい場合は、管理者は、αの値を小さくする。 The value of α is determined by which one of the two indexes, the progress degree of correct answer assignment and the clustering stability, is important by the manager. The administrator increases the value of α when it is desired to make the correct assignment progress important, and the administrator decreases the value of α when the clustering stability is important.

数１に示す計算式によって求められる分類作業進捗度が、大きい値である場合、ユーザは、分類作業は進んでいると判定し、小さい場合は、分類作業は進んでいないと判定する。 When the classification work progress degree obtained by the calculation formula shown in Equation 1 is a large value, the user determines that the classification work is progressing, and when the classification work is small, the user determines that the classification work is not progressing.

なお、分類作業進捗度の表示方法は、図３に示す分類作業進捗度表示部３１０に示すように、数１に示す計算式に従って１つの結果を表示するが、数１の計算式の分子の２つの項を分けて、表示してもよい。つまり、分類作業進捗度表示部３１０を、図８に示す画面によって別々に表示してもよい。 As shown in the classification work progress display unit 310 shown in FIG. 3, the classification work progress display method displays one result according to the calculation formula shown in Equation 1, The two terms may be displayed separately. That is, the classification work progress display unit 310 may be displayed separately on the screen shown in FIG.

図８は、本発明の実施形態の分類作業進捗度を表示する画面を示した説明図である。 FIG. 8 is an explanatory diagram showing a screen displaying the classification work progress of the embodiment of the present invention.

図８に示す画面は、正解付与進捗度と、クラスタリング安定度とを表示する。図８に示す画面を用いる場合、正解付与進捗度の数値が大きいほど、分類作業が進んでいることを示し、クラスタリング安定度の数値が小さいほど、分類作業が進んでいることを示す。 The screen shown in FIG. 8 displays the correct answer assignment progress level and the clustering stability level. When the screen shown in FIG. 8 is used, it indicates that the classification work is progressing as the numerical value of the correct assignment progress is large, and the classification work is advanced as the numerical value of the clustering stability is small.

図９は、本発明の実施形態のクラスタリングの繰り返しによる、クラスタ発散度の履歴を表示する画面を示した説明図である。 FIG. 9 is an explanatory diagram showing a screen that displays a history of cluster divergence by repeating clustering according to the embodiment of this invention.

本実施形態においては、クラスタ発散度の履歴を棒グラフによって表す。 In this embodiment, the history of cluster divergence is represented by a bar graph.

図９に示す説明図の横軸は、一回のクラスタリング処理（Ｓ２０２の処理を一回分に相当）によって得たクラスタ発散度の値を示す。１回目、２回目、．．．、ｎ回目の表示は、文書への正解を付与し、クラスタリング（Ｓ２０２からＳ２０５）の繰り返し作業のうちの何回目に得た結果であるかを示す。 The horizontal axis of the explanatory diagram shown in FIG. 9 indicates the value of the cluster divergence obtained by one clustering process (corresponding to one process of S202). 1st, 2nd,. . . The n-th display gives the correct answer to the document and indicates the result obtained in the repeated operation of clustering (S202 to S205).

図９に示す説明図の縦軸は、対応するクラスタリングにおいて得られたクラスタ発散度の値である。図９に示す説明図は、文書の集合を３つのクラスタに分類した場合であり、棒グラフは、各分類（クラスタリングにおけるそれぞれのクラスタに対応）に対応したクラスタ発散度の内訳を色によって識別できるように表示する。 The vertical axis of the explanatory diagram shown in FIG. 9 is the value of the cluster divergence obtained in the corresponding clustering. The explanatory diagram shown in FIG. 9 is a case where a set of documents is classified into three clusters, and the bar graph can identify the breakdown of cluster divergence corresponding to each classification (corresponding to each cluster in clustering) by color. To display.

図９に示す表示によって、ユーザは、クラスタリング処理を繰り返すほど（横軸の左から右へ進むほど）棒グラフが低くなる、すなわちクラスタ発散度が０に近づくことを確認し、文書の集合の分類が、ユーザが意図する分類に近づいていることを、認識することが可能となる。 With the display shown in FIG. 9, the user confirms that the bar graph decreases as the clustering process is repeated (from the left to the right on the horizontal axis), that is, the cluster divergence approaches 0, and the classification of the document set is It is possible to recognize that the user is approaching the intended classification.

さらに、ユーザは、各分類のクラスタ発散度から、どの分類がクラスタリングが収束していないか、すなわち、文書に正解を付与することが不十分であるか否かを確認できるため、どの分類に重点的に正解を付与すべきか方針を立てることができる。なお、正解の付与が不十分である分類は、クラスタリングを繰り返してもクラスタ発散度が減少しにくい。 In addition, the user can confirm which class has not converged from the cluster divergence of each class, that is, whether it is insufficient to give a correct answer to the document. It is possible to make a policy on whether to give the correct answer. Note that classification with insufficient correct answers is unlikely to reduce the cluster divergence even if clustering is repeated.

図１０は、本発明の実施形態のクラスタリングによる各分類の文書のクラスタラベルの変化を示した説明図である。 FIG. 10 is an explanatory diagram showing changes in cluster labels of documents of each classification by clustering according to the embodiment of this invention.

図１０は、図３に示すクラスタリング結果表３０５とは別の、クラスタリング結果表３０５の表示である。図１０に示すクラスタリング結果表３０５によって、ユーザは、クラスタリングによって縦分類及び横分類に所属する文書に変化があった分類を認識する。 FIG. 10 is a display of the clustering result table 305 different from the clustering result table 305 shown in FIG. Based on the clustering result table 305 illustrated in FIG. 10, the user recognizes a classification in which documents belonging to the vertical classification and the horizontal classification have changed due to clustering.

図１０に示すクラスタリング結果表３０５の値の表示は、値１００１及び背景１００２で示される。 The display of values in the clustering result table 305 shown in FIG. 10 is indicated by a value 1001 and a background 1002.

値１００１は、クラスタリング結果表３０５における縦分類及び横分類に所属する文書の件数を示す
背景１００２は、クラスタリングの前後で文書件数に変化のあった分類の背景色を変化させることによって、変化のあった分類を強調する。強調される分類は、その分類に、クラスタリング前には無かった文書が新たに加わった分類である。従って、クラスタリングによって他分類へ文書が移動したことによって、所属する文書の件数が減少したのみの分類は、強調されない。また、クラスタリング前と比較して、所属する文書件数が減少し、新たな文書が追加された分類は、強調される。 A value 1001 indicates the number of documents belonging to the vertical classification and horizontal classification in the clustering result table 305. A background 1002 is changed by changing the background color of the classification in which the number of documents has changed before and after clustering. Emphasize the classification. The emphasized classification is a classification in which a document that was not present before clustering is newly added to the classification. Therefore, the classification in which the number of documents belonging to the document is reduced due to the movement of the document to another classification by clustering is not emphasized. In addition, the classification in which the number of belonging documents is reduced and a new document is added is emphasized as compared to before clustering.

なお、文書が所属する分類の変化は、図６に示すＳ６０２の処理において保存されたクラスタリング前の分類と、Ｓ６０４の処理においてクラスタリング後の分類とを比較することによって、判定される。 Note that the change of the classification to which the document belongs is determined by comparing the classification before clustering stored in the process of S602 shown in FIG. 6 with the classification after clustering in the process of S604.

なお、図１０に示すクラスタリング結果表３０５は、所属する文書の件数を表示し、背景１００２を強調しているが、図３に示すクラスタリング結果表３０５の文書のリストを表示し、背景を強調してもよい。 Note that the clustering result table 305 shown in FIG. 10 displays the number of documents to which the user belongs, and emphasizes the background 1002, but displays the list of documents in the clustering result table 305 shown in FIG. 3 and emphasizes the background. May be.

前述したように、本発明の実施形態によると、分類作業進捗度を表示することによって、ユーザは、文書へ正解を付与した数とクラスタ発散度に基づいたクラスタリング安定度との両方を加味した分類作業の進捗の程度を認識でき、クラスタリングに効果のない正解の付与の操作とクラスタリング操作とを繰り返す無駄を省くことが可能となる。 As described above, according to the embodiment of the present invention, by displaying the classification work progress, the user can consider both the number of correct answers given to the document and the clustering stability based on the cluster divergence. The degree of work progress can be recognized, and it is possible to eliminate waste of repeating correct operation and clustering operation that are not effective for clustering.

また、クラスタリングによる分類の変化を、分類ごとに表示することによって、どの分類に正解を付与していくべきか方針を立てやすくなり、効果的なクラスタリングが可能となる。 Further, by displaying the change in classification due to clustering for each classification, it becomes easier to make a policy as to which classification should be given the correct answer, and effective clustering becomes possible.

１０情報端末
１０１ＣＰＵ
１０２メモリ
１０３キーボード・マウス
１０４ディスプレイ
１０５機械分類制御部
１０６クラスタ発散度計算部
１０７進捗度計算部
１０８文書表示部
１０９データ通信部
１１０文書データＤＢ
１１１文書インデックスＤＢ
１１２ネットワーク 10 Information terminal 101 CPU
102 Memory 103 Keyboard / Mouse 104 Display 105 Machine Classification Control Unit 106 Cluster Divergence Calculation Unit 107 Progress Level Calculation Unit 108 Document Display Unit 109 Data Communication Unit 110 Document Data DB
111 Document Index DB
112 network

Claims

A document classification device comprising a processor and a memory connected to the processor, and classifying a plurality of documents based on a given correct classification destination,
An input unit, a classification control unit, a divergence calculation unit, and a progress calculation unit,
The input unit accepts the correct classification destination of the document input from the user;
The classification control unit classifies the plurality of documents into a plurality of groups based on a correct classification destination of the document,
The divergence calculator calculates a divergence indicated by the number of documents in which the group has changed due to the classification,
The progress calculation unit calculates the progress of the classification,
Generating data for displaying a result of classifying the document by the classification control unit, a divergence degree calculated by the divergence degree calculation unit, and a progress degree calculated by the progress degree calculation unit; Document classification device.

The divergence calculation unit calculates the divergence for each classification for each group,
The document classification apparatus according to claim 1, wherein data for displaying the divergence calculated in the plurality of classifications on a single screen is generated.

The progress degree calculation unit calculates the progress degree based on the calculated divergence degree, the number of times of classification, and the number of documents in which the correct classification destination is input. Document document classification device.

2. The document classification apparatus according to claim 1, wherein data for displaying the classification result, the calculated degree of divergence, and the calculated degree of progress for each classification is generated. .

2. The document according to claim 1, wherein data for displaying the group in which the number of documents belonging to the classification has changed and the group in which the number of documents belonging to the group has not changed is generated in an identifiable manner. Classification device.

A document classification method in a document classification device that includes a processor and a memory connected to the processor and classifies a plurality of documents based on a given correct classification destination,
The document classification method is:
The processor accepts the correct classification destination of a document input by a user;
The processor classifies the plurality of documents into a plurality of groups based on a correct classification destination of the document;
Storing the classification result in the memory;
The processor calculates a divergence indicated by the number of documents in which the group has changed due to the classification;
The processor calculates the progress of the classification;
The processor stores, in the memory, the classified result, the calculated divergence, and the calculated progress.
The document classification method, wherein the processor generates data for displaying the classified result, the calculated degree of divergence, and the calculated degree of progress.

The processor calculates the divergence per classification for each group,
The document classification method according to claim 6, wherein the processor generates data for displaying the divergence calculated in the plurality of classifications on a single screen.

7. The document according to claim 6, wherein the processor calculates the degree of progress based on the calculated divergence, the number of times of classification, and the number of documents to which the correct classification destination is input. Classification method.

The said processor produces | generates the data for displaying the classified result, the calculated divergence degree, and the calculated progress degree for every classification | category. Document classification method.

7. The processor generates data for distinguishably displaying a group in which the number of documents belonging to the group has changed due to the classification and a group in which the number of documents belonging to the group has not changed. Document classification method described in 1.