JP2011191834A

JP2011191834A - Method, device and program for classifying document

Info

Publication number: JP2011191834A
Application number: JP2010055271A
Authority: JP
Inventors: Hisao Mase; 久雄間瀬; Keiko Yagi; 敬宏八木
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2010-03-12
Filing date: 2010-03-12
Publication date: 2011-09-29
Anticipated expiration: 2030-03-12
Also published as: JP5439235B2

Abstract

<P>PROBLEM TO BE SOLVED: To output the number of classifications desired by a user by improving a reproduction rate while maintaining a relevance ratio, in a KNN (K-Nearest Neighboring) method general as a document classification method. <P>SOLUTION: A total of similarity degrees of similar documents imparted with the classification is calculated about each of the pieces of the classification imparted to K-pieces of similar documents, and the classification having a high total value is specified. Next, processing of excluding the specified classification from the classification imparted to the similar documents, excluding the similar document wherein an imparted classification number becomes zero as the result of the exclusion, specifying the K-pieces of similar documents from the left similar documents, and specifying the classification is repeated. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、分類が付与されていない文書を入力として、予め定義された分類体系の中から適切な分類を付与する文書分類方法、文書分類装置および文書分類プログラムに関する。特に、分類精度を再現率・適合率ともに高く維持しつつ、任意の個数を出力可能とする文書分類方法、これを実施する文書分類装置およびプログラムに関する。 The present invention relates to a document classification method, a document classification device, and a document classification program that use a document to which no classification is assigned as input and assign an appropriate classification from a predefined classification system. In particular, the present invention relates to a document classification method capable of outputting an arbitrary number while maintaining a high classification accuracy and a high recall and relevance rate, and a document classification apparatus and program for executing the document classification method.

特許や論文、新聞記事、Webページのように、テキスト情報を含む大量の文書の中から所望の文書を検索する際には、文書の形式や書誌、内容に応じて予め分類を付与しておき、検索条件の中で分類を指定することによって、検索範囲を絞り込む方法が有効である。しかし、大量の文書に分類を付与する作業や、時代の流れに応じて分類体系を常に最適な状態に更新する作業、分類を更新した場合に過去の大量の文書に付与された分類を付与し直す作業などに多大な労力を要しているのが現状である。 When searching for a desired document from a large number of documents containing text information, such as patents, papers, newspaper articles, and web pages, a classification is assigned in advance according to the document format, bibliography, and contents. A method of narrowing down the search range by specifying the classification in the search condition is effective. However, the task of assigning classifications to a large number of documents, the task of constantly updating the classification system to the optimal state according to the flow of the times, and the classification assigned to a large number of past documents when the classification is updated. The current situation is that a great deal of labor is required for repair work.

そこで、これらの作業にかかる作業者の負担を軽減する技術として文書自動分類技術が注目されている。自動分類のアルゴリズムとしては、例えば、特開2007-323454号公報に開示されているように、分類が予め付与された文書集合を学習用文書集合として用意し、まだ分類が付与されていない文書に対して、これに類似する学習用文書を検索し、検索結果上位の学習用文書に付与された分類の付与状況を統計解析して分類スコアを分類毎に算出し、当該分類スコアの大小に応じて付与すべき分類を特定するという、Ｋ近傍法（K-Nearest Neighboring法）と呼ばれる方式が一般的である（以下、本明細書では、ＫＮＮ法と呼ぶこととする）。ＫＮＮ法では、検索結果上位Ｋ件（ＫＮＮのＫ）の学習用文書の多くに付与された分類、および、より上位の学習用文書に付与された分類などを考慮して分類スコアを分類毎に算出し、分類スコアの高い順に分類を出力する。 Therefore, automatic document classification technology has attracted attention as a technology for reducing the burden on workers involved in these operations. As an automatic classification algorithm, for example, as disclosed in Japanese Patent Application Laid-Open No. 2007-323454, a document set to which classification is assigned in advance is prepared as a learning document set, and a document that has not yet been assigned classification is prepared. On the other hand, a similar learning document is searched, and the classification status assigned to the higher-ranking learning document is statistically analyzed to calculate a classification score for each classification, and according to the size of the classification score. A method called a K-Nearest Neighboring method is generally used (hereinafter, referred to as a KNN method in this specification). In the KNN method, the classification score is assigned to each classification in consideration of the classification given to many of the learning documents in the top K search results (K of KNN) and the classification given to higher learning documents. Calculate and output classification in descending order of classification score.

特開2007-323454号公報JP 2007-323454 A

ＫＮＮ法は、類似文書検索と共通のインデクスを使用でき、アルゴリズムも比較的単純であり、分類精度も比較的高いことから、文書分類でしばしば使われる方式である。しかし、その一方で、そのアルゴリズムの特性により、出力される分類の数が不安定になるという欠点がある。 The KNN method is a method often used in document classification because it can use an index common to similar document retrieval, has a relatively simple algorithm, and has a relatively high classification accuracy. However, on the other hand, the number of classifications to be output becomes unstable due to the characteristics of the algorithm.

例えば、類似文書検索結果の上位10件を用いて分類を特定し、１件の文書に分類が１種類だけ付与されていると仮定すると、１件の文書に対して出力されうる分類の数は、最低で１個、多くともせいぜい10個である。この出力分類の中に正解となる分類が含まれていれば良いが、特許分類のように分類の数が大量であり、かつ、分類の粒度が細かい分類体系を対象とした場合、正解分類が漏れてしまうこともしばしばあり、その結果として再現率（分類の漏れのなさの割合）が低下する可能性がある。 For example, if the top 10 similar document search results are used to identify the classification and it is assumed that only one type of classification is assigned to one document, the number of classifications that can be output for one document is , At least one and at most ten. It is sufficient that the correct classification is included in this output classification. However, when the classification system is a large number of classifications and the classification granularity is small, such as patent classification, the correct classification is There is often leakage, and as a result, the recall (ratio of missing classification) may be reduced.

出力される分類の数を多くするために、ＫＮＮ法で対象とする類似文書の件数を多くする（上記例では10件を100件にする）ことが考えられるが、適合率（分類のノイズのなさの割合）が低下することが評価実験の結果から判明している。 In order to increase the number of classifications to be output, it is possible to increase the number of similar documents targeted by the KNN method (in the above example, 10 to 100). It has been found from the results of the evaluation experiment that the ratio of the absence is reduced.

一方、文書自動分類の用途として、自動分類結果を人間が精査し、必要に応じて追加・削除・修正することにより、人手で一から分類するよりも作業効率を向上させるという用途があるが、人間が精査した結果、自動分類結果が妥当でないと判断した場合、適切な分類を付与するためにその候補となる分類が漏れなく利用者に提示できていないと作業効率が悪くなる。したがって、自動分類結果として、利用者の要求に応じた数だけ分類を出力できることが望ましい。 On the other hand, as an application of automatic document classification, there is an application that improves the work efficiency rather than manually classifying from the beginning by manually examining the automatic classification results and adding, deleting, and correcting as necessary. As a result of human inspection, if it is determined that the automatic classification result is not valid, the work efficiency is deteriorated unless the candidate classification can be presented to the user without omission in order to give an appropriate classification. Therefore, it is desirable that as many automatic classification results as possible can be output according to the user's request.

以上を踏まえると、ＫＮＮ法というシンプルかつ強力な文書分類方式において、再現率および適合率を低下させることなく、一定個数の分類を出力可能とすることが課題となる。 Based on the above, in a simple and powerful document classification method called the KNN method, it becomes a problem to be able to output a certain number of classifications without reducing the recall rate and the matching rate.

そこで本発明では、入力された分類付与対象文書に対して、類似文書検索手段によって検索された上位Ｋ件の類似文書（Ｋは予め指定された自然数）を特定し、当該Ｋ件の類似文書に付与された分類の各々について、当該Ｋ件の類似文書の中で当該分類が付与された文書の「件数」または「類似度（類似文書検索スコア）の総和値」または「類似度を検索順位の対数で除算した値の総和値」のいずれかを算出し、前記算出値の高い上位Ｎ個（Ｎは予め指定された自然数）の分類を当該分類付与対象文書に付与すべき分類付与結果として分類付与結果テーブルに格納し、前記類似文書検索手段によって検索された類似文書に付与された分類から当該Ｎ個の分類を除外し、除外した結果付与された分類数が０個になった類似文書を除外し、除外されずに残った類似文書における上位Ｋ件の類似文書を特定する処理を、前記分類付与結果テーブルに格納された分類数が利用者によって予め指定された数になるまで繰り返し実行することにより、上記課題を解決する。 Therefore, in the present invention, the top K similar documents (K is a natural number designated in advance) searched by the similar document search means are specified for the inputted classification grant target document, and the K similar documents are specified. For each of the assigned classifications, the “number of documents” or “total value of similarities (similar document search scores)” or “similarity of the similarities in the search rankings of the K similar documents. "Sum of values divided by logarithm" is calculated, and the top N (N is a natural number designated in advance) classification with the highest calculated value is classified as a classification grant result to be assigned to the classification grant target document. A similar document stored in the assignment result table and excluding the N classifications from the classifications assigned to the similar documents searched by the similar document search means, and the number of classifications given as a result of the exclusion is zero. Excluded, excluded By repeatedly executing the process of identifying the top K similar documents in the remaining similar documents until the number of classifications stored in the classification assignment result table reaches the number specified in advance by the user, To solve.

本発明により、ＫＮＮ法というシンプルかつ強力な文書分類方式において、再現率および適合率を低下させることなく、分類を任意の個数だけ出力することが可能となる。また、高精度の分類付与結果を一定個数利用者に提示できるので、利用者は自動分類結果の妥当性をより効率良く精査できるようになる。 According to the present invention, in a simple and powerful document classification method called the KNN method, it is possible to output an arbitrary number of classifications without reducing the reproduction rate and the matching rate. In addition, since a certain number of classification grant results with high accuracy can be presented to the user, the user can scrutinize the validity of the automatic classification result more efficiently.

本発明の実施形態におけるブロック図の一例を示す図。The figure which shows an example of the block diagram in embodiment of this invention. 本発明の実施形態におけるハードウェア構成の一例を示す図。The figure which shows an example of the hardware constitutions in embodiment of this invention. 本発明の実施形態における分類付与部９の処理概要の一例を示す図。The figure which shows an example of the process outline | summary of the classification provision part 9 in embodiment of this invention. 本発明の実施形態における類似文書テーブル８の構成の一例を示す図。The figure which shows an example of a structure of the similar document table 8 in embodiment of this invention. 本発明の実施形態における文書−分類対応テーブル10の構成の一例を示す図。The figure which shows an example of a structure of the document-classification correspondence table 10 in the embodiment of the present invention. 本発明の実施形態における分類付与部９で生成、参照する分類付き類似文書テーブルの構成の一例を示す図。The figure which shows an example of a structure of the similar document table with a classification | category produced | generated and referred by the classification provision part 9 in embodiment of this invention. 本発明の実施形態における分類付与部９の処理フローの一例を示す図。The figure which shows an example of the processing flow of the classification provision part 9 in embodiment of this invention. 本発明の実施形態における分類特定部11の処理フローの一例を示す図。The figure which shows an example of the processing flow of the classification specific | specification part 11 in embodiment of this invention. 本発明の実施形態における分類編集部12の処理フローの一例を示す図。The figure which shows an example of the processing flow of the classification edit part 12 in embodiment of this invention. 本発明の実施形態におけるパラメータ設定画面の一例を示す図。The figure which shows an example of the parameter setting screen in embodiment of this invention.

本発明の実施の形態を、図面を用いて詳細に説明する。なお、これにより本発明が限定されるものではない。 Embodiments of the present invention will be described in detail with reference to the drawings. Note that the present invention is not limited thereby.

本実施形態では、テキストを含む文書を入力として、予め定義された分類体系の中から適切な分類を自動的に特定する文書分類システムについて述べる。本実施形態では日本語で書かれた文書を対象としているが、英語等の外国語で書かれた文書にも適用可能である。本システムは、入力された文書と内容が類似している文書を、学習用文書集合（分類が予め付与されている文書の集合）の中から検索し、その検索結果上位の文書に付与された分類の付与状況を統計解析して付与すべき分類を決定するＫＮＮ法をベースとする。なお、ＫＮＮ法では、予め分類が付与された学習用文書集合が不可欠であるが、本実施形態では、学習用文書集合が既に存在することを前提とする。 In the present embodiment, a document classification system will be described in which a document including text is input and an appropriate classification is automatically specified from a predefined classification system. In this embodiment, a document written in Japanese is targeted, but the present invention can also be applied to a document written in a foreign language such as English. This system searches for a document whose contents are similar to the input document from a learning document set (a set of documents with classifications assigned in advance), and is assigned to the higher-ranked document as a result of the search. Based on the KNN method of determining the classification to be given by statistical analysis of the classification assignment status. In the KNN method, a learning document set to which a classification is assigned in advance is indispensable. However, in the present embodiment, it is assumed that a learning document set already exists.

図１は本実施形態におけるブロック図の一例を示す図である。各々のデータおよび処理部についての詳細は、図２以降の説明の中で詳述する。 FIG. 1 is a diagram showing an example of a block diagram in the present embodiment. Details of each data and processing unit will be described in detail in FIG.

文書分類システムは、分類付与対象文書集合２を入力として、付与すべき分類を特定して分類付与結果テーブル13に出力する。処理に先立って、学習用文書集合１を構成する各文書に対して、文書解析部３において単語辞書４を参照してテキストを形態素解析して文書の内容を特徴付ける重み付き特徴語を抽出し、検索インデクス生成部５において文書ＩＤと重み付き特徴語を対応付ける形で検索インデクス６に格納する。 The document classification system receives the classification grant target document set 2 as input, identifies the classification to be given, and outputs it to the classification grant result table 13. Prior to the processing, for each document constituting the learning document set 1, the document analysis unit 3 refers to the word dictionary 4 and morphologically analyzes the text to extract weighted feature words that characterize the contents of the document, The search index generation unit 5 stores the document ID and the weighted feature word in the search index 6 in a form of associating them.

自動分類処理はまず、分類付与対象文書集合２を構成する各文書に対して文書解析部３において同様に単語辞書４を参照してテキストを形態素解析して重み付き特徴語を抽出する。 In the automatic classification process, first, the document analysis unit 3 similarly refers to the word dictionary 4 for each document constituting the classification assignment target document set 2 and extracts a weighted feature word by performing morphological analysis on the text.

次に、類似文書検索部７において、検索インデクス６を参照して、抽出された重み付き特徴語集合と類似する特徴語集合を持つ学習用文書を特定し、その類似度の高い文書から順に類似文書として類似文書テーブル８に類似度（類似文書検索スコア）とともに格納する。 Next, the similar document search unit 7 refers to the search index 6 to identify a learning document having a feature word set similar to the extracted weighted feature word set, and similar documents in descending order of similarity. The document is stored in the similar document table 8 together with the similarity (similar document search score).

次に、分類付与部９において、類似文書テーブル８を入力として、学習用文書集合１を構成する文書とその文書に付与されている分類との対応を記述した文書−分類対応テーブル10を参照して、分類付与対象文書２に付与すべき分類を特定して分類付与結果テーブル13に格納する。 Next, the classification assigning unit 9 takes the similar document table 8 as an input, and refers to the document-classification correspondence table 10 describing the correspondence between the documents constituting the learning document set 1 and the classifications assigned to the documents. Thus, the classification to be assigned to the classification assignment target document 2 is specified and stored in the classification assignment result table 13.

本システムでは、分類付与部９以外の部分は従来のＫＮＮ法と同じ構成となっている。すなわち、文書解析部３における形態素解析については既に公知の技術であり、多くのツールが公開されている。本発明は、形態素解析方式に依存しない方法であり、どの形態素解析ツールを使用しても構わないので、ここでは深く言及しない。また、重み付き特徴語を抽出する方式としては、単語の出現頻度および出現文書数を考慮したTF-IDF法や、単語の文書内出現箇所や共起出現傾向を考慮した重み付けなど、既に公知の技術である。本発明は、特徴語の抽出および重み付けの方式に依存しない方法であり、どの方式を採用しても構わないので、ここでは深く言及しない。さらに、類似文書検索部７における特徴語照合方式、類似度算出方式、検索インデクス生成方式については、類似文書検索ツールや市販製品が多く存在している。類似文書検索では、各文書の重み付き特徴語を要素とする文書ベクトルを生成し、二つの文書の文書ベクトル間のなす角の余弦値の高いほど互いに類似していると判定する方式などが広く知られている。本発明は、類似度（類似文書検索スコア）付きの類似文書を検索できればよく、類似文書検索方式の中身には依存しないので、ここでは深く言及しない。 In this system, parts other than the classification assigning unit 9 have the same configuration as the conventional KNN method. That is, the morphological analysis in the document analysis unit 3 is already a known technique, and many tools are disclosed. The present invention is a method that does not depend on a morpheme analysis method, and any morpheme analysis tool may be used. In addition, as a method for extracting weighted feature words, there are already known methods such as TF-IDF method considering the frequency of appearance of words and the number of appearing documents, weighting considering the occurrence location of words and co-occurrence appearance tendency, etc. Technology. The present invention is a method that does not depend on the feature word extraction and weighting method, and any method may be adopted, and thus will not be described in detail here. Furthermore, there are many similar document search tools and commercial products for the feature word collation method, similarity calculation method, and search index generation method in the similar document search unit 7. In the similar document search, there are a wide variety of methods such as generating a document vector whose elements are weighted feature words of each document and determining that the higher the cosine value of the angle between the document vectors of the two documents, the more similar to each other. Are known. The present invention only needs to be able to search for a similar document with a similarity (similar document search score), and does not depend on the content of the similar document search method, and therefore will not be described in detail here.

本発明では、分類付与部９が発明のポイントとなる部分である。分類付与部９は、分類特定部11、分類編集部12の二つの処理部を含み、これらを交互に繰返し実行することにより、分類を順次特定していく。 In this invention, the classification | category provision part 9 is a part used as the point of invention. The classification assigning unit 9 includes two processing units, that is, a classification specifying unit 11 and a classification editing unit 12. By repeatedly executing these processing units alternately, the classification is sequentially specified.

分類特定部11では、類似文書テーブル８および文書−分類対応テーブル10から得られる分類付き類似文書テーブルを参照して、付与すべき分類上位Ｎ個を特定し、分類付与結果テーブル13に格納する。分類付与結果テーブル13に格納された分類の数が予め利用者に指定された値に達した時点で、分類付与部９の処理を終了する。分類上位Ｎ個の特定方法については、分類付き類似文書テーブルに格納されている類似文書の上位Ｋ件を対象として、ある分類が付与されている類似文書の「件数」、「類似度の総和」、「類似度を『検索順位＋１』の対数で除算した値の総和」のいずれかの値を算出し、この算出値の高い上位Ｎ個（Ｎは予め指定された自然数）を特定する。 The classification specifying unit 11 refers to the similar document table with classification obtained from the similar document table 8 and the document-classification correspondence table 10, specifies the top N classifications to be assigned, and stores them in the classification assignment result table 13. When the number of classifications stored in the classification assignment result table 13 reaches a value designated in advance by the user, the processing of the classification assignment unit 9 is terminated. Regarding the top N classification methods, the “number of cases” and “sum of similarities” of similar documents to which a certain classification is assigned are targeted for the top K similar documents stored in the similar document table with classification. , Any one of “the sum of values obtained by dividing the similarity by the logarithm of“ search order + 1 ”” is calculated, and the top N (N is a natural number designated in advance) having the highest calculated value is specified.

分類編集部12では、分類付き類似文書テーブルに格納されている類似文書に付与されている分類のうち、この時点で分類付与結果テーブル13に格納されている分類を除外する。除外することによって類似文書に付与されている分類数が０個になった場合、その文書を類似文書から除外し、検索順位を上に順次詰める。この処理によって、上位Ｋ件の類似文書は動的に変化していくことになる。 The classification editing unit 12 excludes the classification stored in the classification assignment result table 13 at this time from the classifications assigned to the similar documents stored in the classified similar document table. When the number of classifications assigned to the similar document becomes 0 as a result of the exclusion, the document is excluded from the similar documents, and the search order is sequentially increased up. By this processing, the top K similar documents are dynamically changed.

図２は、本実施例のハードウェア構成の一例を示す図である。本システムは大きく、計算処理を実行する処理装置50、利用者が操作内容またはデータを入力するためのキーボード51およびマウス52、計算処理結果を利用者に出力するための出力モニタ53、処理装置50における処理に関するプログラムおよびデータを格納する記憶装置60から構成される。入出力データを別の計算機とやりとりする場合には、入出力データはネットワーク54を介して送受信する。 FIG. 2 is a diagram illustrating an example of a hardware configuration of the present embodiment. This system is large and includes a processing device 50 for executing calculation processing, a keyboard 51 and a mouse 52 for a user to input operation contents or data, an output monitor 53 for outputting a calculation processing result to the user, and a processing device 50. It is comprised from the memory | storage device 60 which stores the program and data regarding the processing in. When the input / output data is exchanged with another computer, the input / output data is transmitted / received via the network 54.

記憶装置60はさらに、処理装置50における処理データを一時的に格納するワーキングエリア61と、学習用文書集合格納エリア62、分類付与対象文書集合格納エリア63、文書解析部格納エリア64、単語辞書格納エリア65、検索インデクス生成部格納エリア66、検索インデクス格納エリア67、類似文書検索部格納エリア68、類似文書テーブル格納エリア69、分類付与部格納エリア70、分類特定部格納エリア71、分類編集部格納エリア72、文書−分類対応テーブル格納エリア73、分類付与結果テーブル格納エリア74から構成される。処理装置50では、記憶装置60から必要なプログラムおよびデータをロードし、実行した結果を記憶装置60に格納することを繰り返すことにより処理が行われる。 The storage device 60 further includes a working area 61 for temporarily storing processing data in the processing device 50, a learning document set storage area 62, a classification grant target document set storage area 63, a document analysis unit storage area 64, and a word dictionary storage. Area 65, search index generation unit storage area 66, search index storage area 67, similar document search unit storage area 68, similar document table storage area 69, classification assigning unit storage area 70, classification specifying unit storage area 71, classification editing unit storage An area 72, a document-classification correspondence table storage area 73, and a classification assignment result table storage area 74 are configured. The processing device 50 performs processing by repeatedly loading necessary programs and data from the storage device 60 and storing the execution results in the storage device 60.

図３は分類付与部９の処理概要を示す図である。図３（１）は分類付与部９で生成、参照する分類付き類似文書テーブルの初期状態のデータであり、類似文書テーブル８と文書−分類対応テーブル10を、類似文書ＩＤをキーとして結合（join）することによって得られる。本データは、類似文書検索結果順位301、入力文書に類似する文書として検索された類似文書ＩＤ302、入力文書との類似の度合いを数値化した類似度303、類似文書に付与されている分類304から構成されている。なお、図３では１件の文書に１個の分類が付与されているが、１件の文書に複数個の分類が付与されていても同様の方法で処理できる。 FIG. 3 is a diagram showing an outline of processing of the classification assigning unit 9. FIG. 3 (1) shows data in the initial state of the similar document table with classification generated and referred to by the classification assigning unit 9. The similar document table 8 and the document-classification correspondence table 10 are joined (joined with the similar document ID as a key). ). This data includes similar document search result ranking 301, similar document ID 302 searched as a document similar to the input document, similarity 303 obtained by quantifying the degree of similarity with the input document, and classification 304 assigned to the similar document. It is configured. In FIG. 3, one classification is assigned to one document. However, even if a plurality of classifications are assigned to one document, the same processing can be performed.

この初期状態のデータの上位Ｋ件（図３では10件）の類似文書を対象として、ＫＮＮ法に基づいて分類をＮ個（図３では１個）特定する。分類を特定する方法としては、以下の３種類が考えられるが、これ以外の方法でもかまわない。 Based on the KNN method, N classifications (one in FIG. 3) are identified based on the top K similar documents (10 in FIG. 3) of the initial state data. As the method for specifying the classification, the following three types are conceivable, but other methods may be used.

（１）分類が付与されている上位Ｋ件内の類似文書の件数が多い順に分類を並べ、上位Ｎ個を付与すべき分類として特定する。 (1) The categories are arranged in the descending order of the number of similar documents in the top K to which the classification is given, and the top N items are specified as the classification to be given.

（２）分類が付与されている上位Ｋ件内の類似文書の類似度の総和を算出して総和の高い順に分類を並べ、上位Ｎ個を付与すべき分類として特定する。 (2) The sum of the similarities of the similar documents in the top K cases to which the classification is given is calculated, the classifications are arranged in descending order of the sum, and the top N items are specified as the classifications to be given.

（３）分類が付与されている上位Ｋ件内の類似文書の類似度を「検索順位＋１」
の対数で除算した値の総和を算出して総和の高い順に分類を並べ、上位Ｎ個を付与すべき分類として特定する。 (3) The similarity of similar documents in the top K documents to which classification is assigned is “search order + 1”
The sum of the values divided by the logarithm of is calculated, and the categories are arranged in descending order of the sum, and the top N items are specified as the categories to be assigned.

なお、本実施例では、Ｋ＝10、Ｎ＝１としているが、これらのパラメータは利用者が指定可能とすることは容易に可能である。また、類似文書検索結果として出力する類似文書の件数を１万件としているが、これも利用者が指定可能とすることは容易に可能である。 In this embodiment, K = 10 and N = 1, but these parameters can be easily specified by the user. Further, although the number of similar documents output as the similar document search result is 10,000, it is easily possible for the user to specify this.

図３（１）で、上述の方法によって仮に分類C01が付与すべき分類として特定されたとすると、次に、類似文書に付与されている分類304の中で、C01が付与されている分類をすべて除外する。除外した結果、一つも分類が付与されていない類似文書ができた場合には、その類似文書自体を除外する。類似文書を除外してできた空白は、順位を上に順に詰めることによって埋める。分類C01を除外してできたデータが図３（２）である。初期状態で１位にランクされた類似文書00004は除外され、２位の類似文書01015が１位に格上げされている。また、類似文書の総数が１万件から9000件に減っている。これは、分類C01を除外したことによって除外された類似文書の件数が1000件あったことを示している。 In FIG. 3 (1), if the classification C01 is specified as a classification to be given by the above-described method, all the classifications to which C01 is assigned among the classifications 304 given to similar documents are next. exclude. As a result of the exclusion, if any similar document without any classification is created, the similar document itself is excluded. Blank spaces created by excluding similar documents are filled in by increasing the order. Data obtained by excluding the classification C01 is FIG. 3 (2). The similar document 0004 ranked first in the initial state is excluded, and the second similar document 0015 is upgraded to the first. The total number of similar documents has decreased from 10,000 to 9000. This indicates that the number of similar documents excluded by excluding classification C01 was 1000.

この状態における上位Ｋ件の類似文書を対象として、上述したＫＮＮ法によって、再び分類をＮ個決定する。仮に分類C02が付与すべき分類として決定された場合、今度は分類C02を除外し、順位を上に詰める。 For the top K similar documents in this state, N classifications are determined again by the above-described KNN method. If the classification C02 is determined as a classification to be given, the classification C02 is excluded and the rank is increased.

以上の操作を繰り返し、特定された分類の数が、利用者が予め指定した数に達した時点で処理を終了する。付与すべき分類の順位は上記方法によって決定された順（分類C01、分類C02、・・・の順）に順位付けされる。 The above operation is repeated, and the process is terminated when the number of specified classifications reaches the number designated in advance by the user. The classifications to be given are ranked in the order determined by the above method (classification C01, classification C02,...).

このように、ＫＮＮ法によって分類を特定する対象となる類似文書の範囲を動的に変え、付与すべき分類を逐次特定していくことによって、適合率を低下させることなく、再現率を向上させることができ、利用者の所望の数の分類を出力できるようになる。 In this way, by dynamically changing the range of similar documents for which classification is specified by the KNN method and sequentially specifying the classification to be given, the reproduction rate is improved without reducing the relevance rate. The user can output a desired number of classifications.

図４は、分類付与対象文書集合２に対する類似文書テーブル８のデータ構成を示す図である。入力文書を一意に識別する入力文書ＩＤ401と、当該文書に対する類似文書検索の検索順位402、類似文書を一意に識別する類似文書ＩＤ404、当該類似文書の入力文書に対する類似度403から構成される。本実施例では、類似文書検索として出力されるのは、予め分類が付与されている学習用文書集合１を構成する文書のみであるが、分類が付与されていない文書も含めた文書集合から類似文書を検索し、検索結果から予め分類が付与されている学習用文書のみをフィルタリングして類似文書テーブル８を生成しても良い（この場合、類似文書検索で用いる出現文書数などの統計数値が変わるため、異なる検索結果が得られる）。 FIG. 4 is a diagram showing a data configuration of the similar document table 8 for the classification grant target document set 2. The input document ID 401 uniquely identifies the input document, the similar document search order 402 for the document, the similar document ID 404 that uniquely identifies the similar document, and the similarity 403 of the similar document with respect to the input document. In the present embodiment, only documents constituting the learning document set 1 to which classification is assigned in advance are output as similar document search, but similar from a document set including documents to which no classification is assigned. A similar document table 8 may be generated by searching for documents and filtering only learning documents to which classifications have been assigned in advance from the search results (in this case, a statistical value such as the number of appearing documents used in similar document search may be determined). Will result in different search results).

図５は、文書−分類対応テーブル10のデータ構成を示す図である。文書ＩＤ411とその文書に付与された分類412が対になって格納されている。文書ＩＤ＝12345、00021のように、一つの文書に複数の分類が付与されていても良い。 FIG. 5 is a diagram showing a data configuration of the document-classification correspondence table 10. A document ID 411 and a classification 412 assigned to the document are stored as a pair. A plurality of classifications may be assigned to one document as document ID = 1345, 00002.

図６は、分類付与部９で生成、参照する分類付き類似文書テーブルの構成を示す図でり、図３で述べたデータに相当するものである。本データはワーキングエリア61に格納される。類似文書検索結果として出力された類似文書ＩＤ424の分類425が対応付けて記述されている。検索順位422が１位および３位の類似文書のように、１件の類似文書に複数の分類が付与されている場合、類似文書ＩＤの分類425には１個の分類が入るように、複数のレコードに分けて記述する。 FIG. 6 is a diagram showing the structure of a similar document table with classification generated and referred to by the classification assigning unit 9, and corresponds to the data described in FIG. This data is stored in the working area 61. The classification 425 of the similar document ID 424 output as the similar document search result is described in association with it. When a plurality of classifications are assigned to one similar document, such as the similar documents of the first and third search rankings 422, a plurality of classifications such that one classification is included in the classification 425 of the similar document ID. Describe the records separately.

図７は、分類付与部９の処理フローを示す図である。まず、類似文書テーブル８と文書分類対応テーブル10を、類似文書ＩＤをキーとして結合（join）することによって、図６で説明した分類付き類似文書テーブルを生成し、ワーキングエリア61に格納する（ステップ1001）。次に、分類特定部11の処理を実行する（ステップ1002）。次に、分類付与結果テーブル13に格納された分類が利用者によって予め指定された個数に達したか否かを判定し（ステップ1003）、達したなら分類付与部９の処理を終了する。達していない場合、分類編集部12の処理を実行し（ステップ1004）、ステップ1002に戻る。 FIG. 7 is a diagram illustrating a processing flow of the classification assigning unit 9. First, the similar document table 8 and the document classification correspondence table 10 are joined using the similar document ID as a key, thereby generating the similar document table with classification described in FIG. 6 and storing it in the working area 61 (step) 1001). Next, the processing of the classification specifying unit 11 is executed (step 1002). Next, it is determined whether or not the number of classifications stored in the classification assignment result table 13 has reached the number designated in advance by the user (step 1003). If not reached, the process of the category editing unit 12 is executed (step 1004), and the process returns to step 1002.

図８は、分類付与部９を構成する分類特定部11の処理フローを示す図である。まず、分類付与部９のステップ1001で生成してワーキングエリア61に格納されている分類付き類似文書テーブルにおいて、上位Ｋ件の類似文書ＩＤ424を特定する（ステップ2001）。次に、特定した上位Ｋ件の類似文書ＩＤ424に付与されている「類似文書ＩＤの分類425」で構成される分類リストを生成する（ステップ2002）。次に、上記分類リストを構成する分類の各々について、当該分類が付与されている上位Ｋ件内の類似文書ＩＤ424を特定し、特定された類似文書ＩＤ424の類似度423の総和を算出する（ステップ2003）。なお、本ステップにおいて、「特定された類似文書ＩＤ424の類似度423の総和」を算出する代わりに、「特定された類似文書ＩＤ424の件数」または「特定された類似文書ＩＤ424の類似度を『検索順位＋１』の対数で除算した値の総和」を算出しても良い。次に、ステップ2003で算出した分類毎の「類似度の総和」を降順にソートする（ステップ2004）。次に、類似度の総和値の高い上位Ｎ個の分類を分類付与結果テーブル13に格納する（ステップ2005）。 FIG. 8 is a diagram showing a processing flow of the classification specifying unit 11 constituting the classification giving unit 9. First, the top K similar document IDs 424 are specified in the classified similar document table generated in step 1001 of the classification assigning unit 9 and stored in the working area 61 (step 2001). Next, a classification list composed of “similar document ID classification 425” assigned to the identified top K similar document IDs 424 is generated (step 2002). Next, for each of the classifications constituting the classification list, the similar document IDs 424 in the top K items to which the classification is assigned are specified, and the sum of the similarities 423 of the specified similar document IDs 424 is calculated (step 2003). In this step, instead of calculating “the sum of the similarities 423 of the specified similar document IDs 424”, the “number of specified similar document IDs 424” or the “similarity of the specified similar document ID 424” is searched. The sum of the values divided by the logarithm of “rank +1” ”may be calculated. Next, the “similarity of similarity” for each classification calculated in step 2003 is sorted in descending order (step 2004). Next, the top N classifications with a high sum of similarities are stored in the classification assignment result table 13 (step 2005).

図９は、分類付与部９を構成する分類編集部12の処理フローを示す図である。まず、分類付き類似文書テーブルの行数をカウントするカウンタＣの値を１に初期化する（ステップ3001）。次に、分類付き類似文書テーブルのＣ行目にデータが存在するか否かを判別し（ステップ3002）、存在する場合、この時点で分類付与結果テーブル13に格納されている分類のいずれかが、分類付き類似文書テーブルのＣ行目の「類似文書ＩＤの分類425」と一致するかを判別し（ステップ3003）、一致する場合、当該Ｃ行目のレコードを削除し、Ｃ＋１行目以降の行を上に詰める（ステップ3004）。本ステップによれば、１件の類似文書に付与された分類が０個になった類似文書は自動的に除外されることとなる。ステップ3003で一致しない場合、カウンタＣに１を加算し（ステップ3005）、ステップ3002に戻る。ステップ3002においてデータが存在しないと判別された場合、分類付き類似文書テーブルの検索順位422を１から順に付与し直す（ステップ3006）。すなわち、ステップ3004でレコードが削除された結果、検索順位が飛び飛びになってしまっているものを、１から順になるように整形する。ここで、類似文書ＩＤ424が同一の行に対しては、同一の検索順位422を付与する。 FIG. 9 is a diagram illustrating a processing flow of the classification editing unit 12 constituting the classification assigning unit 9. First, the value of the counter C that counts the number of rows in the classified similar document table is initialized to 1 (step 3001). Next, it is determined whether or not data exists in the C-th row of the similar document table with classification (step 3002). If it exists, any of the classifications stored in the classification assignment result table 13 at this time is determined. Then, it is determined whether or not it matches the “similar document ID classification 425” of the C line of the similar document table with classification (step 3003), and if it matches, the record of the C line is deleted and the C + 1 and subsequent lines are deleted. Line up (step 3004). According to this step, a similar document in which the number of classifications assigned to one similar document becomes zero is automatically excluded. If they do not match at step 3003, 1 is added to the counter C (step 3005), and the process returns to step 3002. If it is determined in step 3002 that data does not exist, the search order 422 of the classified similar document table is reassigned in order from 1 (step 3006). That is, as a result of deleting records in step 3004, the items whose search order is skipped are shaped so as to be in order from 1. Here, the same search order 422 is assigned to the same line with the similar document ID 424.

図１０は、本システムで用いる各種パラメータの設定を入力するパラメータ設定画面の一例を示す図である。本画面でのパラメータ設定は、エンドユーザが行っても良いし、システム管理者などエンドユーザ以外の人間が行っても良い。本設定は、分類処理を実行する前に設定するが、分類結果を見て適宜設定し直しても良い。 FIG. 10 is a diagram showing an example of a parameter setting screen for inputting various parameter settings used in the present system. Parameter setting on this screen may be performed by an end user, or may be performed by a person other than the end user, such as a system administrator. This setting is set before executing the classification process, but it may be set as appropriate by looking at the classification result.

本実施例において、本画面で設定するパラメータは、以下である。 In this embodiment, the parameters set on this screen are as follows.

（１）類似文書検索部７で出力する類似文書の総数
図３でいうところの10000件に相当する値である。 (1) Total number of similar documents output by the similar document search unit 7 This is a value corresponding to 10000 in FIG.

（２）分類を特定する際に使用する類似文書の件数
図８のＫに相当する値であり、図３でいうところの10件に相当する値である。 (2) Number of similar documents used when specifying a classification A value corresponding to K in FIG. 8 and a value corresponding to 10 in FIG.

（３）分類特定部１１による１回の処理で付与される分類の個数
図８のＮに相当する値であり、図３でいうところの１個に相当する値である。 (3) Number of classifications given in one process by the classification specifying unit 11 A value corresponding to N in FIG. 8 and a value corresponding to one in FIG.

（４）出力分類個数
分類結果として出力する分類の個数である。 (4) Number of output classifications This is the number of classifications output as a classification result.

（５）スコア算出方法
上位Ｋ件の類似文書の類似度から、どの分類を付与すべきかを特定するために算出するスコアの算出方法であり、本実施例では、以下の３種類から一つを選択する。 (5) Score calculation method This is a score calculation method for calculating which classification should be given from the similarity of the top K similar documents. In this embodiment, one of the following three types is used. select.

（ａ）分類が付与されている上位Ｋ件内の類似文書の件数
（ｂ）分類が付与されている上位Ｋ件内の類似文書の類似度の総和
（ｃ）分類が付与されている上位Ｋ件内の類似文書の類似度を「検索順位＋１」
の対数で除算した値の総和
本画面によって設定されたパラメータ値はワーキングエリア61に格納し、各処理部から適宜参照することにより、分類処理を実行する。 (A) Number of similar documents in top K to which classification is assigned (b) Sum of similarities of similar documents in top K to which classification is assigned (c) Top K to which classification is assigned The similarity of similar documents in the case is set to “search order +1”
The sum of the values divided by the logarithm of is stored in the working area 61 with the parameter values set on this screen, and the classification process is executed by referring appropriately from each processing unit.

５０・・・処理装置、６０・・・記憶装置、５１・・・キーボード、５２・・・マウス、５３・・・出力モニタ、５４・・・計算機ネットワーク。 50 ... processing device, 60 ... storage device, 51 ... keyboard, 52 ... mouse, 53 ... output monitor, 54 ... computer network.

Claims

A learning document set, a document-classification correspondence table in which each document constituting the learning document set and a classification previously assigned to the document are described in association with each other, and a classification grant target document that has not yet been assigned a classification Based on the classification assigned to the retrieved similar documents with reference to the document-classification correspondence table, and the similar document retrieval means for retrieving documents similar to the classification assignment target document from the learning document set. A document classification method in a document classification apparatus having a classification providing means for determining a classification to be assigned to the classification grant target document,
The top K similar documents (K is a natural number designated in advance) searched by the similar document search means are specified, and for each of the classifications assigned to the K similar documents, the K similar documents are identified. Among them, either “Number of documents”, “Sum of similarity (similar document search score)” or “Sum of values obtained by dividing similarity by logarithm of“ search rank + 1 ”” To calculate
The top N classifications (N is a natural number designated in advance) having the highest calculated value are stored in the classification assignment result table as classification assignment results to be assigned to the classification assignment target document,
The N classifications are excluded from the classifications assigned to the similar documents searched by the similar document search means, and the similar documents in which the number of classifications given as a result of the exclusion are 0 are excluded and are not excluded. The process of identifying the top K similar documents in the remaining similar documents
Repeatedly until the number of classifications stored in the classification assignment result table reaches a number specified in advance by the user,
Document classification method characterized by the above.

A learning document set, a document-classification correspondence table in which each document constituting the learning document set and a classification previously assigned to the document are described in association with each other, and a classification grant target document that has not yet been assigned a classification Based on the classification assigned to the retrieved similar documents with reference to the document-classification correspondence table, and the similar document retrieval means for retrieving documents similar to the classification assignment target document from the learning document set. In the document classification apparatus having a classification providing means for determining a classification to be assigned to the classification grant target document,
For each of the classifications assigned to the top K similar documents (K is a natural number designated in advance) searched by the similar document search means, among the K similar documents, the documents assigned the classification Calculate either “Number of cases” or “Sum of similarity (similar document search score)” or “Sum of similarity divided by logarithm of“ search rank + 1 ””
The top N classifications (N is a natural number designated in advance) having the highest calculated value are stored in the classification assignment result table as classification assignment results to be assigned to the classification assignment target document,
The N classifications are excluded from the classifications assigned to the similar documents searched by the similar document search means, and the similar documents in which the number of classifications given as a result of the exclusion are 0 are excluded and are not excluded. The process of identifying the top K similar documents in the remaining similar documents
Means for alternately and repeatedly executing until the number of classifications stored in the classification assignment result table reaches a number specified in advance by the user;
A document classification device characterized by that.

A learning document set, a document-classification correspondence table in which each document constituting the learning document set and a classification previously assigned to the document are described in association with each other, and a classification grant target document that has not yet been assigned a classification Based on the classification assigned to the retrieved similar documents with reference to the document-classification correspondence table, and the similar document retrieval means for retrieving documents similar to the classification assignment target document from the learning document set. A program having a computer having a classifying unit for determining a class to be given to the document to be classified,
The top K similar documents (K is a natural number designated in advance) searched by the similar document search means are specified, and for each of the classifications assigned to the K similar documents, the K similar documents are identified. Among them, either “Number of documents”, “Sum of similarity (similar document search score)” or “Sum of values obtained by dividing similarity by logarithm of“ search rank + 1 ”” To calculate
The top N classifications (N is a natural number designated in advance) having the highest calculated value are stored in the classification assignment result table as classification assignment results to be assigned to the classification assignment target document,
The N classifications are excluded from the classifications assigned to the similar documents searched by the similar document search means, and the similar documents in which the number of classifications given as a result of the exclusion are 0 are excluded and are not excluded. The process of identifying the top K similar documents in the remaining similar documents
Causing the computer to repeatedly execute until the number of classifications stored in the classification assignment result table reaches a number specified in advance by a user,
A program characterized by that.