JP2005158010A

JP2005158010A - Apparatus, method and program for classification evaluation

Info

Publication number: JP2005158010A
Application number: JP2004034729A
Authority: JP
Inventors: Takahiko Kawatani; 隆彦川谷
Original assignee: Hewlett Packard Development Co LP
Current assignee: Hewlett Packard Development Co LP
Priority date: 2003-10-31
Filing date: 2004-02-12
Publication date: 2005-06-16
Also published as: EP1528486A3; CN1612134A; US20050097436A1; KR20050041944A; EP1528486A2

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem such that input document contents are changed with a lapse of time and a class model may become obsolete, and in such a case, a large amount of workloads are required to update the class model, when executing a document classification system for classifying an input document into predetermined document classes by collating the input document with the class model. <P>SOLUTION: When executing the document classification system, each degree of resemblance between an actual document set being classified into each class and a training document set is obtained for the entire classes, and a class having a low degree of resemblance is selected. Alternatively, each degree of resemblance between the training document set of each class and the actual document set of the other entire classes is obtained, and a class pair having a low degree of resemblance is selected. Thus, a class which has become obsolete is detected. Also, each degree of resemblance between the training document sets is obtained for the entire class pair, and by selecting a class pair having a low degree of resemblance, a class pair having a close topic is detected. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は文書をはじめとするパターンの分類技術に関するものであり、特にその時々のクラスモデルの妥当性を適確に評価できるようにすることによってその運用の効率性を高めることを目的とする。 The present invention relates to a technique for classifying documents and other patterns, and in particular, an object of the present invention is to improve the efficiency of operation by making it possible to accurately evaluate the validity of the class model at that time.

文書分類は文書を予め決められたグループに振り分ける技術であり、情報の流通が増すにつれ、重要性が高まってきている。文書分類としてはこれまでに、ベクトル空間法、k-最近隣法（kNN法）、ナイーブベイズ法、決定木法、サポートベクターマシン法、ブースティング法など実に様々な方法が研究開発されてきた。文書の文書分類処理に関する最近の動向については、情報処理学会誌第42巻第1号（2001年1月）に掲載されている「テキスト分類‐学習理論の見本市‐」（著者：永田昌明、平博順）に詳しい。どのような分類法も、文書クラスに関する情報を何らかの形で記述し、入力文書と照合している。以下これをクラスモデルと呼ぶ。 Document classification is a technique for assigning documents to predetermined groups, and the importance of the information is increasing as the distribution of information increases. To date, various methods such as vector space method, k-nearest neighbor method (kNN method), naive Bayes method, decision tree method, support vector machine method, and boosting method have been researched and developed. For the recent trend of document classification processing of documents, "Text Classification-Trade Fair for Learning Theory" published in Journal of Information Processing Society of Japan, Vol. 42 No. 1 (January 2001) (Author: Masaaki Nagata, Heihei Detailed on Hiroshun). Any taxonomy describes some form of information about the document class and matches it against the input document. Hereinafter, this is called a class model.

このクラスモデルは、例えば、ベクトル空間法では各クラスに属する文書の平均ベクトルにより表現され、k-最近隣法では各クラスに属する文書のベクトルの集合により表現され、ブースティング法では単純な仮説の集合により表現されている。正確な分類を図るにはクラスモデルは各クラスを正確に記述したものでなければならない。クラスモデルは通常各クラスに訓練データとして用意された大量の文書を用いて作成される。
情報処理学会誌第42巻第1号（2001年1月）「テキスト分類‐学習理論の見本市‐」（著者：永田昌明、平博順） This class model is represented by, for example, the average vector of documents belonging to each class in the vector space method, represented by a set of document vectors belonging to each class in the k-nearest neighbor method, and a simple hypothesis in the boosting method. It is expressed by a set. For accurate classification, the class model must accurately describe each class. A class model is usually created using a large amount of documents prepared as training data for each class.
IPSJ Journal Vol. 42, No. 1 (January 2001) "Text Classification-Trade Fair for Learning Theory" (Author: Masaaki Nagata, Jun Hirahira)

文書の分類は文字や音声と同じように認識技術をベースとするものであるが、文字認識や音声認識と比べた場合、次のような特質がある。
（１）文字認識や音声認識の場合、同じクラスに属するパターンが時々刻々変化することは考えられない。クラス“２”に属する文字パターンは現在も１年前も同じ筈である。ところが、文書の場合には同じクラスであっても文書の内容が刻々変化する場合がよくある。例えば、“国際政治”というクラスを想定したとき、このクラスに属する文書の話題は、“イラク戦争”の前後でかなり異なっているものと考えられる。従って、“国際政治”のクラスモデルは、時間の経過と共に更新される必要がある。 Document classification is based on recognition technology in the same way as characters and speech, but has the following characteristics when compared with character recognition and speech recognition.
(1) In the case of character recognition and speech recognition, it is not considered that patterns belonging to the same class change from moment to moment. Character patterns belonging to class “2” are the same now and a year ago. However, in the case of a document, the content of the document often changes every moment even if it is the same class. For example, assuming a class of “international politics”, the topics of documents belonging to this class are considered to be quite different before and after the “Iraq War”. Therefore, the class model of “international politics” needs to be updated over time.

（２）文字や音声の場合には、入力された文字や音声がどのクラスに属するかは人間は直ちに判断できるので、クラスモデルを構築するための訓練データを収集することは難しい問題ではない。しかし、文書の場合には、入力された文書を読まなければその文書の属するクラスを判断することができない。たとえ飛ばし読みにしても文書を人間が読む限り少なからぬ時間を必要とする。従って、文書の場合には大量の信頼の置ける訓練データを収集することの負担は極めて大きい。 (2) In the case of letters and voices, humans can immediately determine which class the inputted letters and voices belong to, so it is not a difficult problem to collect training data for constructing a class model. However, in the case of a document, the class to which the document belongs cannot be determined unless the input document is read. Even if skipping is read, it takes a considerable amount of time as long as a human reads the document. Therefore, in the case of documents, the burden of collecting a large amount of reliable training data is very large.

（３）（２）と同じ理由で、文書分類の場合、大量の未知文書に対してどの程度の正確さで分類が行われているか性能を把握することは容易ではない。
（４）文字や音声の場合には、入力される文字や音声にどのようなクラスが存在するかはほぼ自明である。例えば文字認識で数字を認識する場合クラス数は１０である。しかし、文書分類ではクラスの設定には任意性があり、どのようなクラスを用意するかは利用者の要望やシステム設計者の意図などによって決まる。 (3) For the same reason as in (2), in the case of document classification, it is not easy to grasp the performance of how accurately a large number of unknown documents are classified.
(4) In the case of characters and voices, it is almost obvious what class exists in the input characters and voices. For example, when a number is recognized by character recognition, the number of classes is 10. However, in the document classification, there is an arbitrary setting of the class, and what kind of class is prepared depends on the user's request and the intention of the system designer.

従って、文書分類では、（１）の特質の故に、実際の運用においてその時々の文書を正しく分類するにはクラスモデルの頻繁な更新が必須である。しかしながら、クラスモデルの更新は（２）に述べた理由で決して容易なものではない。クラスモデルの更新の負担の軽減を図るには、全クラスを更新するのではなく、クラスモデルの陳腐化したクラスのみ更新するようにすればよいが、（３）に述べた理由で陳腐化したクラスを検出することも容易でない。このように文書分類を実際に運用するためのコストは決して安価なものではない。 Therefore, in document classification, due to the nature of (1), frequent updating of the class model is essential in order to correctly classify documents from time to time in actual operation. However, updating the class model is not easy for the reason described in (2). In order to reduce the burden of updating the class model, instead of updating all classes, it is only necessary to update the class model that has become obsolete, but it has become obsolete for the reason described in (3). It is not easy to detect classes. Thus, the cost for actually operating the document classification is not cheap.

さらに、文書分類の場合、人為的に設定された各クラスの話題が互いに離れていれば問題はないが、話題が接近するクラス対が存在してしまう場合がある。このようなクラス対は互いの間で誤分類を招き、システムの性能を劣化させる。従って文書分類システムの設計においては話題が接近するクラス対をいち早く検出し、クラスを再設定する必要がある。このとき文書分類システムを再設計した後、テストデータで評価して問題クラス対を検出するようにしてもよいが、これには労力と時間を必要とする。このような話題の接近が問題となるクラス対は、訓練データの準備が終了した時点、即ち訓練データの収集及び各文書に対するラベル付けが終わった段階で直ちに検出できるのが望ましい。 Further, in the case of document classification, there is no problem if the artificially set topics of each class are separated from each other, but there may be a class pair in which the topics are close to each other. Such class pairs cause misclassification between each other and degrade system performance. Therefore, in designing a document classification system, it is necessary to quickly detect a class pair with which a topic approaches, and reset the class. At this time, after redesigning the document classification system, the problem class pairs may be detected by evaluating with test data, but this requires labor and time. It is desirable that the class pair in which the approach of the topic is a problem can be detected immediately when the preparation of the training data is completed, that is, at the stage where the collection of the training data and the labeling of each document are finished.

本発明の目的は、話題が接近するクラス対やクラスモデルの陳腐化したクラスを容易に検出出来るようにすることにより、文書分類システム設計の負担やクラスモデルの更新の負担を軽減することにある。 An object of the present invention is to reduce the burden of designing a document classification system and updating a class model by making it possible to easily detect stale classes of class pairs and class models that are close in topic. .

先ずクラスモデルの陳腐化について考える。クラスＡのクラスモデルが陳腐化した場合の影響としては次の2通りが考えられる。即ち、入力文書がクラスＡに属していてもクラスＡに属すると判定できなくなる場合と、クラスＡとは別のクラスＢに誤分類される場合とである。従って、クラスＡの場合、「再現率」をクラスＡに属する文書数に対するクラスＡに属すると判定された文書数の割合と定義し、「精度」をクラスＡに属すると判定された文書の中で実際にクラスＡに属している文書数の割合と定義すると、クラスモデルの陳腐化の影響は、再現率や精度の低下となって現れる。従って、問題は再現率や精度の低下したクラスを如何にして検出するかである。本願発明では以下のようなアプローチを採用する。ここでは、再現率や精度の低下したクラスであっても正しくそのクラスに分類される文書は少なからず存在することを前提とする。 First, consider the obsolescence of the class model. There are two possible effects when the class model of class A becomes obsolete. That is, even when the input document belongs to class A, it cannot be determined that the input document belongs to class A, and the input document is misclassified to class B different from class A. Therefore, in the case of class A, “recall rate” is defined as the ratio of the number of documents determined to belong to class A to the number of documents belonging to class A, and “accuracy” is determined among the documents determined to belong to class A. If the definition is the ratio of the number of documents actually belonging to class A, the obsolescence of the class model appears as a reduction in recall and accuracy. Therefore, the problem is how to detect classes with reduced recall and accuracy. In the present invention, the following approach is adopted. Here, it is assumed that there are not a few documents that are correctly classified into the class even if the recall rate and accuracy are lowered.

クラスＡの再現率が低下した場合、クラスＡに属する入力文書の話題とクラスモデルが想定するクラスＡの話題との間にミスマッチが生じていると考えられる。クラスモデルが想定するクラスＡの話題はそのクラスモデルを構築したときのクラスＡの訓練データによって決まる。文書分類システムの実際の運用時において、クラスＡに分類された文書集合を「クラスＡの実文書集合」と呼ぶこととする。上記ミスマッチを起こしているかどうかは、クラスＡの実文書集合とクラスＡのクラスモデル構築に用いた訓練文書集合との近さ、即ち「類似度」によって判定できる。この類似度が大きければ、クラスＡの実文書集合はクラスモデル構築時の訓練文書集合と内容が近く、陳腐化は起こしていないと判断できる。反対に類似度が小さければ、クラスＡに属する入力文書の話題はシフトし、クラスモデルは陳腐化を起こしていると判断できる。陳腐化していると判断されたクラスはクラスモデルの再構築が必要である。 When the recall rate of class A decreases, it is considered that a mismatch has occurred between the topic of the input document belonging to class A and the topic of class A assumed by the class model. The topic of class A assumed by the class model is determined by the training data of class A when the class model is constructed. In actual operation of the document classification system, a document set classified into class A is referred to as “class A actual document set”. Whether or not the above-mentioned mismatch has occurred can be determined by the closeness between the class A actual document set and the training document set used in class A class model construction, that is, “similarity”. If the degree of similarity is large, it can be determined that the class A actual document set is close in content to the training document set at the time of class model construction, and no obsolescence has occurred. On the other hand, if the degree of similarity is small, it can be determined that the topic of the input document belonging to class A has shifted and the class model has become obsolete. Classes that are determined to be obsolete need to be rebuilt.

また、クラスＡに属する入力文書がクラスＢに誤分類されるケースが多い場合には、クラスＡに属する文書の話題がシフトし、クラスＢのクラスモデルと非常に近くなっていると考えられる。従って、クラスＡの実文書集合とクラスＢのクラスモデル構築に用いた訓練文書集合との近さ、即ち類似度は大きくなっていると考えられる。従って、この類似度が大きいようであれば、これはクラスＡに属する文書の話題がクラスＢに接近していることの証拠となる。このときクラスＡ、Ｂの両方のクラスモデルが陳腐化を起こしていると判断できるので、クラスＡ、Ｂの両方のクラスモデルの再構築が必要となる。 Further, when there are many cases where an input document belonging to class A is misclassified to class B, the topic of the document belonging to class A is shifted and considered to be very close to the class model of class B. Therefore, it is considered that the closeness, that is, the similarity between the class A actual document set and the class B training model set used for the class model construction is increased. Therefore, if the similarity is high, this is evidence that the topic of a document belonging to class A is close to class B. At this time, since it can be determined that both the class models of classes A and B are obsolete, it is necessary to reconstruct both class models of classes A and B.

次に、話題が接近するクラス対について述べる。話題が接近するクラス対ではそれぞれの文書集合間の類似性も高くなっていると考えられる。従って、全てのクラス対間の類似度、即ち、各クラスの訓練文書集合間の類似度を求め、類似度が一定値より高いクラス対を選択するとこれらのクラス対は話題が接近するクラス対とみなすことができる。このようなクラス対はクラスを設定することの是非やクラスの定義を含めて再検討する必要がある。 Next, class pairs that are close to each other are discussed. It is thought that the similarity between each document set is high in the class pair in which the topics are close. Accordingly, the similarity between all the class pairs, that is, the similarity between the training document sets of each class is obtained, and when a class pair with a similarity higher than a certain value is selected, these class pairs are classified as class pairs with which the topic approaches. Can be considered. Such class pairs need to be reconsidered, including the pros and cons of setting the class and the definition of the class.

以上述べたように、本発明においては各クラスの訓練文書集合以外に、各クラスの実文書集合を求めておき、全てのクラス対の訓練文書集合間の類似度、各クラスの訓練文書集合と実文書集合の間の類似度、全てのクラス対の訓練文書集合と実文書集合間の類似度を求めることにより、更新あるいは見直しの必要のあるクラスを検出できるので、極めて容易に文書分類システム設計の変更やクラスモデルの更新を行うことができる。 As described above, in the present invention, in addition to the training document set of each class, the actual document set of each class is obtained, the similarity between the training document sets of all class pairs, the training document set of each class, It is very easy to design a document classification system because it can detect the class that needs to be updated or reviewed by finding the similarity between the actual document sets and the similarity between the training document set of all class pairs and the actual document set. Changes and class models can be updated.

図１は、本願発明を実施する装置を示している。筐体１００の中には、記憶装置１１０、メインメモリ１２０、出力装置１３０、処理装置（ＣＰＵ）１４０、操作部１５０、入力装置１６０が含まれている。処理装置（ＣＰＵ）１４０は、メインメモリ１２０から制御するプログラムを読み込み、操作部１５０から入力された命令に従い、入力装置１６０から入力される文書データ、及び、記憶装置１１０に格納されている訓練文書や実文書の情報を使用して情報処理を行い、話題接近クラス対、及び、陳腐化した文書クラスなどを検出し出力装置１３０に出力する。 FIG. 1 shows an apparatus for carrying out the present invention. The housing 100 includes a storage device 110, a main memory 120, an output device 130, a processing device (CPU) 140, an operation unit 150, and an input device 160. The processing device (CPU) 140 reads a program to be controlled from the main memory 120, and in accordance with a command input from the operation unit 150, the document data input from the input device 160 and the training document stored in the storage device 110. And information processing is performed using information on the actual document, and a topic approaching class pair and an obsolete document class are detected and output to the output device 130.

図２は、本発明の概要を示すブロック図である。２１０は文書入力ブロック、２２０は文書前処理ブロック、２３０は文書情報処理ブロック、２４０は訓練文書情報格納ブロック、２５０は実文書情報格納ブロック、２６０は不適格文書クラス出力ブロックを示す。文書入力ブロック２１０には、処理したい文書集合が入力される。文書前処理ブロック２２０では、入力された文書の用語検出、形態素解析、文書ベクトル作成等が行われる。文書ベクトルの各成分の値は対応する単語の文書内の頻度などをもとに求められる。訓練文書情報格納ブロック２４０には、作成されたクラス別訓練文書情報が格納される。実文書情報格納ブロック２５０には、分類結果に基づくクラス別実文書情報が格納される。文書情報処理ブロック２３０は、訓練文書集合の全クラス対の類似度算出、各クラスの訓練文書集合と同一クラスの実文書集合の間の類似度算出、各クラスの訓練文書集合と他の全てのクラスの実文書集合の間の類似度算出などを行って、話題接近クラス対、及び、陳腐化クラスを求める。不適格文書クラス出力ブロック２６０は文書情報処理ブロック２３０で得られた結果を、ディスプレー等の出力装置に出力する。 FIG. 2 is a block diagram showing an outline of the present invention. 210 denotes a document input block, 220 denotes a document preprocessing block, 230 denotes a document information processing block, 240 denotes a training document information storage block, 250 denotes an actual document information storage block, and 260 denotes an ineligible document class output block. A document set to be processed is input to the document input block 210. In the document preprocessing block 220, term detection, morphological analysis, document vector creation, and the like of the input document are performed. The value of each component of the document vector is obtained based on the frequency of the corresponding word in the document. The training document information storage block 240 stores the created class-specific training document information. The real document information storage block 250 stores class-specific real document information based on the classification result. The document information processing block 230 calculates the similarity of all pairs of classes in the training document set, calculates the similarity between the training document set of each class and the actual document set of the same class, the training document set of each class and all other classes. By calculating the similarity between the actual document sets of the class, a topic approaching class pair and an obsolete class are obtained. The unqualified document class output block 260 outputs the result obtained by the document information processing block 230 to an output device such as a display.

図３は与えられた訓練文書集合に対して、話題接近クラス対を検出する本発明の第１の実施例のフローチャートを示す。この発明の方法は、汎用コンピュータ上でこの発明を組み込んだプログラムを走らせることによって実施することができる。図３は、そのようなプログラムを走らせている状態でのコンピュータのフローチャートである。
ブロック２１は訓練文書集合入力、ブロック２２はクラスラベル付与、ブロック２３は文書前処理、ブロック２４はクラス別訓練文書データベース作成、ブロック２５は訓練文書集合のクラス対の類似度算出、ブロック２６は類似度と閾値との比較、ブロック２７は、閾値を超える類似度を有するクラス対の出力を行う。ブロック２８は終了チェック処理である。以下、英文文書を例にとって実施例１について説明する。 FIG. 3 shows a flowchart of the first embodiment of the present invention for detecting a topic approach class pair for a given training document set. The method of the present invention can be implemented by running a program incorporating the present invention on a general-purpose computer. FIG. 3 is a flowchart of the computer in a state where such a program is running.
Block 21 is a training document set input, block 22 is a class label assignment, block 23 is a document pre-processing, block 24 is a training document database for each class, block 25 is a class pair similarity calculation of a training document set, block 26 is a similarity The comparison between the degree and the threshold, block 27 outputs the class pair having the similarity exceeding the threshold. Block 28 is an end check process. The first embodiment will be described below using an English document as an example.

先ず、訓練文書集合入力２１において文書分類システム構築に用いる文書集合が入力される。クラスラベル付与２２では、予め各クラスに対してなされていた定義に従って帰属するクラス名を各文書に付与する。ひとつの文書に対して２つ以上のクラス名が付与されることもありうる。文書前処理２３においては各入力文書に対して、用語検出、形態素解析、文書ベクトル作成などの前処理が行われる。場合によっては、文書セグメント区分け、文書セグメントベクトル作成を行い、文書セグメントベクトルの集合として文書を表すこともある。用語検出としては、各入力文書から単語、数式、記号系列などを検出する。ここでは、単語や記号系列などを総称して「用語」と呼ぶ。英文の場合、用語同士を分けて書く正書法が確立しているので用語の検出は容易である。 First, a training document set input 21 inputs a document set used for building a document classification system. In class label assignment 22, a class name belonging to each class is assigned to each document according to the definition made for each class in advance. Two or more class names may be assigned to one document. In the document preprocessing 23, preprocessing such as term detection, morphological analysis, and document vector creation is performed for each input document. In some cases, document segments are classified, document segment vectors are created, and a document is represented as a set of document segment vectors. For term detection, words, mathematical formulas, symbol sequences, etc. are detected from each input document. Here, words and symbol sequences are collectively referred to as “terms”. In the case of English sentences, it is easy to detect the terms because there is an established orthography that writes the terms separately.

次に、形態素解析では、各入力文書に対して用語の品詞付けなどの形態素解析を行う。文書ベクトル作成では、先ず文書全体に出現する用語から作成すべきベクトルの次元数および各次元と各用語との対応を決定する。この際に出現する全ての用語の種類にベクトルの成分を対応させなければならないということはなく、品詞付け処理の結果を用い、例えば名詞と動詞と判定された用語のみを用いてベクトルを作成するようにしてもよい。次いで各文書に出現する単語の頻度値、もしくは頻度値を加工して得られる値を対応する文書ベクトルの成分に与える。文書セグメント区分けが行われる場合は各入力文書は文書セグメントに分解される。文書セグメントは文書を構成する要素であり、その最も基本的な単位は文である。英文の場合、文はピリオドで終わり、その後ろにスペースが続くので文の切出しは容易に行うことができる。その他の文書セグメントへの分解法としては、ひとつの文が複文からなる場合主節と従属節に分けておく方法、用語の数がほぼ同じになるように複数の文をまとめて文書セグメントとする方法、文書の先頭から含まれる用語の数が同じになるように文とは関係なく区分けする方法などがある。 Next, in morphological analysis, morphological analysis such as part-of-speech addition of terms is performed on each input document. In creating a document vector, first, the number of dimensions of a vector to be created and the correspondence between each dimension and each term are determined from terms appearing in the entire document. There is no need to associate vector components with all types of terms that appear at this time, and using the result of part-of-speech processing, for example, create a vector using only terms determined as nouns and verbs. You may do it. Next, a frequency value of a word appearing in each document, or a value obtained by processing the frequency value is given to a corresponding document vector component. When document segmentation is performed, each input document is broken down into document segments. A document segment is an element constituting a document, and its most basic unit is a sentence. In English, the sentence ends with a period and is followed by a space, so it is easy to cut out sentences. As a method of disassembling into other document segments, when a sentence consists of multiple sentences, it is divided into a main section and subordinate sections, and multiple sentences are grouped into a document segment so that the number of terms is almost the same. And a method of dividing the document so that the number of terms included from the beginning of the document is the same regardless of the sentence.

文書セグメントベクトル作成では、文書ベクトル作成と同じように、各文書セグメントに出現する単語の頻度値、もしくは頻度値を加工して得られる値を対応する文書セグメントベクトルの成分に与える。一例として、分類に用いられる用語の種類数をＭとし、Ｍ次元のベクトルで文書ベクトルが表される場合を考える。当該文書ベクトルをd_r とすると、用語が用いられている場合を「０」と、用いられていない場合を「１」としてその成分を与えると、d_r ＝ (１，０，０，．．，１)^Ｔのように、あるいは用語の出現頻度をその成分値として与えると、d_r ＝ (２，０，１，．．，４)^Ｔのように表すことが出来る。ここでＴはベクトルの転置を表す。クラス別訓練文書データベース作成２４では、ブロック２２の結果に基づき、各文書の前処理結果をクラス別にソートし、データベースに格納する。訓練文書集合のクラス対の類似度算出２５では、訓練文書集合を用いて指定されたクラス対に対して類似度を算出する。クラス対の指定は、最初の繰り返しでは予め決められたクラス対に基づいて、２回目以降の繰り返しではブロック２８からの指令により行う。 In document segment vector creation, the frequency value of a word appearing in each document segment or a value obtained by processing the frequency value is given to the corresponding document segment vector component, as in document vector creation. As an example, let us consider a case where the number of types of terms used for classification is M and a document vector is represented by an M-dimensional vector. When the document vector and d _r, a "0" when the term is used, the case where not used Given the ingredients as "1", d _r = (1,0,0, .. , 1) as ^T, then or given a term frequency as the component _{value, d r = (2,0,1, ..} , 4) represents that can as ^T. Here, T represents transposition of the vector. In the class-by-class training document database creation 24, the pre-processing results of each document are sorted by class based on the result of the block 22, and stored in the database. In the class document similarity calculation 25 of the training document set, the similarity is calculated for the class pair designated using the training document set. The class pair is designated by a command from the block 28 in the second and subsequent iterations based on a predetermined class pair in the first iteration.

文書集合間の類似度を求める方法としては種々の方法が知られている。例えば、クラスＡ、Ｂの文書集合をΩ_A、Ω_Bとする。また、文書rの文書ベクトルをd_rとして、次式によりクラスＡ、Ｂの平均文書ベクトルd_A、d_Bを定義する。 Various methods are known for obtaining the similarity between document sets. For example, a set of documents of classes A and B is assumed to be Ω _A and Ω _B. Further, with the document vector of the document _r as dr, average document vectors d _A and d _B of classes A and _B are defined by the following formula.

ここで、｜Ω_A｜、｜Ω_B｜は文書集合Ω_A、Ω_Bの文書数を表す。クラスＡ、Ｂの訓練文書集合間の類似度をsim(Ω_A,Ω_B)とすると、これは余弦類似度により次のように求めることができる。 Here, | Ω _A | and | Ω _B | represent the number of documents in the document sets Ω _A and Ω _B. Assuming that the similarity between the training document sets of classes A and B is sim (Ω _A , Ω _B ), this can be obtained from the cosine similarity as follows.

ここで、||d_A ||はベクトルd_Aのノルムを表す。式（１）で定義される類似度の例では、単語間の共起の情報は反映されない。そこで、以下の計算方法を用いると文書セグメントにおける単語共起の情報を反映した類似度を求めることが出来る。クラスＡには複数の文書が含まれておりその集合をΩ_Aと表す。集合をΩ_Aのｒ番目の文書rはＹ個の文書セグメントから成るとし、ｙ番目の文書セグメントベクトルをd_ryにより表す。図４（ａ）では、文書集合Ω_Aが文書１から文書Ｒまでの文書群で構成されていることを示している。図４（ｂ）は文書集合Ω_Aのｒ番目の文書ｒがさらにＹ個の文書セグメントから構成されており、その中のｙ番目の文書セグメントから、文書セグメントベクトルd_ryを生成することをイメージ的に示している。ここで、文書rに対し次式で定義される行列を共起行列と呼ぶこととする。 Here, || d _A || represents the norm of vector d _A. In the example of similarity defined by Equation (1), information on co-occurrence between words is not reflected. Therefore, by using the following calculation method, it is possible to obtain the similarity reflecting the word co-occurrence information in the document segment. Class A includes a plurality of documents, and the set is represented as Ω _A. Assume that the r-th document r in Ω _A consists of Y document segments, and the y-th document segment vector is represented by d _ry . FIG. 4A shows that the document set Ω _A is composed of document groups from document 1 to document R. FIG. 4B illustrates that the r-th document r in the document set Ω _A is further composed of Y document segments, and the document segment vector d _ry is generated from the y-th document segment. Is shown. Here, a matrix defined by the following expression for the document r is called a co-occurrence matrix.

さらに、クラスＢの集合をΩ_Bとし、クラスＡ、Ｂの各文書の共起行列の総和をS^A、S^Bとすると、これらは以下により求められる。 Further, if the set of class B is Ω _B and the sum of co-occurrence matrices of the documents of class A and B is S ^A and S ^B , these are obtained as follows.

この場合、クラスＡ、Ｂの訓練文書集合間の類似度sim(Ω_A,Ω_B)は行列S^A、S^Bの各成分を用いて以下のように定義することができる。 In this case, the similarity sim (Ω _A , Ω _B ) between the training document sets of classes A and B can be defined as follows using the components of the matrices S ^A and S ^B.

ここで、S^A _mn はS^Aのｍ行ｎ列の成分値であり、Mは文書セグメントベクトルの次元、即ち出現単語の種類数である。もし、文書セグメントベクトルの各成分をバイナリーで、即ちm番目の単語が出現すれば1、現れなければ0として表現した場合、S^A _mn、S^B _mnは式（２）（３）から明らかなようにクラスＡ、Ｂの訓練文書集合において単語mとnとが共起する文書セグメントの数となるので、式（４）には単語共起の情報が与えられていることが分かる。単語共起の情報を与えることでより的確な類似度を求めることができる。なお、式（４）において行列S^A、S^Bの非対角成分を用いないようにすると式（１）で定義される類似度とほぼ等価になる。 Here, S ^A _mn is the component value of m rows and n columns of S ^A , and M is the dimension of the document segment vector, that is, the number of types of appearance words. If each component of the document segment vector is expressed in binary, that is, 1 if the m-th word appears and 0 if it does not appear, S ^A _mn and S ^B _mn are clear from the equations (2) and (3). Thus, since the number of document segments in which the words m and n co-occur in the class A and B training document sets is obtained, it can be seen that the word co-occurrence information is given in the equation (4). By giving word co-occurrence information, a more accurate similarity can be obtained. If the non-diagonal components of the matrices S ^A and S ^B are not used in the equation (4), the degree of similarity defined by the equation (1) is almost equivalent.

ブロック２６で、類似度（第１の類似度）が所定の閾値（第１の閾値）を超えるか否かを判断している。ブロック２７では、指定されたクラス間の訓練文書集合の類似度が予め指定された閾値を超えている場合には、話題が接近しているクラス対として検出する。具体的には、αを閾値としたとき、 In block 26, it is determined whether or not the similarity (first similarity) exceeds a predetermined threshold (first threshold). In block 27, when the similarity of the training document set between the designated classes exceeds a predetermined threshold value, it is detected as a class pair whose topics are approaching. Specifically, when α is a threshold value,

を満たす場合にクラスＡ、Ｂは話題が接近しているとみなす。αは話題内容のよく分かっている訓練文書集合を用いれば実験的に決めることは容易である。検出された話題接近クラス対に対しては、クラスの定義の見直しやそれらのクラスを設定すること自体の再検討、訓練文書のラベル付けの妥当性の確認を行うことになる。ブロック２８では、ブロック２５、２６、２７の処置を全てのクラス対に対して行ったかどうかのチェックを行い、未処理のクラスがなければ終了し、あれば次のクラス対を指定して次の処理をブロック２５に戻す。 Class A and B are considered to be close to each other when the conditions are satisfied. α can be easily determined experimentally by using a set of training documents whose topics are well known. For the detected topic approach class pair, the class definition is reviewed, the setting of the class itself is reviewed, and the validity of the labeling of the training document is confirmed. In block 28, it is checked whether or not the processing of blocks 25, 26, and 27 has been performed for all class pairs. If there is no unprocessed class, the process ends. Processing returns to block 25.

図５（ａ）及び図５（ｂ）は実際の文書分類システム上において、陳腐化クラスを検出する本発明の第２及び第３の実施例を示す。この発明の方法は、汎用コンピュータ上でこの発明を組み込んだプログラムを走らせることによって実施することができる。図５（ａ）及び図５（ｂ）は、そのようなプログラムを走らせている状態でのコンピュータのフローチャートである。先ず、図５（ａ）で示される第２の実施例について説明する。ブロック31は文書入力、ブロック32は文書前処理、ブロック33は文書分類処理、ブロック34はクラス別実文書データベース作成、ブロック35は各クラスの訓練文書集合と同一クラスの実文書集合の間の類似度算出、ブロック３６は類似度と閾値との比較、ブロック３７は各クラスの訓練文書集合と同一クラスの実文書集合の間の類似度が閾値より大きい場合の処置、ブロック３８は終了チェック処理である。 FIGS. 5 (a) and 5 (b) show second and third embodiments of the present invention for detecting an obsolete class on an actual document classification system. The method of the present invention can be implemented by running a program incorporating the present invention on a general-purpose computer. FIG. 5A and FIG. 5B are flowcharts of the computer in a state where such a program is running. First, the second embodiment shown in FIG. 5A will be described. Block 31 is document input, Block 32 is document pre-processing, Block 33 is document classification processing, Block 34 is class-based real document database creation, Block 35 is similarity between each class training document set and same class actual document set The degree is calculated, block 36 is a comparison between the similarity and the threshold, block 37 is a treatment when the similarity between the training document set of each class and the actual document set of the same class is larger than the threshold, and block 38 is an end check process is there.

以下、図５（ａ）のフローチャートについて詳細に説明する。先ず、ブロック31において運用状態の文書分類システムに実際に分類すべき文書が入力される。ブロック32では図2のブロック23と同様な文書前処理が行われ、ブロック33では入力文書に対して文書分類処理が行われる。文書分類の方法としては、これまでに、ベクトル空間法、k-最近隣法（kNN）、ナイーブベイズ法、決定木法、サポートベクターマシン法、ブースティング法など実に様々な方法が開発されてきており、本発明ではどの方法も用いることができる。ブロック34では、ブロック33の文書分類処理の結果を用いて、クラス毎に実文書データベース作成を作成する。ここではクラスＡ、Ｂに分類された実文書集合をΩ'_A、Ω'_Bにより表す。 Hereinafter, the flowchart of FIG. 5A will be described in detail. First, in block 31, a document to be actually classified is input to the operating document classification system. In block 32, document preprocessing similar to that in block 23 of FIG. 2 is performed, and in block 33, document classification processing is performed on the input document. Various document classification methods such as vector space method, k-nearest neighbor method (kNN), naive Bayes method, decision tree method, support vector machine method, and boosting method have been developed. Any method can be used in the present invention. In block 34, using the result of the document classification processing in block 33, a real document database is created for each class. Here, a set of actual documents classified into classes A and B is represented by Ω ′ _A and Ω ′ _B.

ブロック35では指定されたクラスの訓練文書集合と同一クラスの実文書集合の間の類似度の算出を行う。クラスの指定は最初の繰り返しでは予め指定されたクラスに基づいて、２回目以降はブロック３８からの指令により行う。クラスＡの訓練文書集合Ω_Aと同一クラスの実文書集合Ω'_Aの間の類似度sim(Ω_A,Ω'_A) （第２の類似度）、は式（１）及び（４）と同様に求めることができる。
次いでブロック３６では類似度と閾値との比較を行い、ブロック３７において陳腐化を起こしたクラスモデルの検出を行う。その時の閾値をβとしたとき、 In block 35, the similarity between the training document set of the designated class and the actual document set of the same class is calculated. The class is designated based on the class designated in advance in the first iteration, and the second and subsequent times are designated by a command from the block 38. The similarity sim (Ω _A , Ω ′ _A ) (second similarity) between the training document set Ω _A of class _A and the actual document set Ω ′ _A of the same class is expressed by equations (1) and (4) It can be obtained similarly.
Next, in block 36, the similarity is compared with a threshold value, and in block 37, a class model that has become obsolete is detected. When the threshold at that time is β,

を満たす場合にクラスＡに属すべき実文書の話題はシフトしており、クラスＡのクラスモデルは陳腐化していると判断される。ブロック３８は、ブロック３５、３６、３７の処理を、全てのクラスに対して行ったかどうかのチェックをおこない、未処理のクラスが無ければ終了し、あれば次のクラスを指定してブロック３５に処理を戻す。
次に、図５（ｂ）を用いて、第３の実施例について説明する。ブロック３１からブロック３４までは、図５（a）と同様であるので説明は割愛する。ブロック３９は各クラスの訓練文書集合と他の全てのクラスの実文書集合の間の類似度を算出する。ブロック４０及びブロック４１は、各クラスの訓練文書集合と他のクラスの実文書集合の間の類似度が閾値を超えている場合の処置を示している。ブロック４２は終了チェック処理である。 When the condition is satisfied, the topic of the actual document that should belong to class A is shifted, and it is determined that the class model of class A is obsolete. The block 38 checks whether or not the processing of the blocks 35, 36, and 37 has been performed for all classes. If there is no unprocessed class, the process ends. Return processing.
Next, a third embodiment will be described with reference to FIG. Block 31 to block 34 are the same as those in FIG. Block 39 calculates the similarity between the training document set for each class and the actual document set for all other classes. Blocks 40 and 41 show actions when the similarity between the training document set of each class and the actual document set of another class exceeds a threshold. Block 42 is an end check process.

以下、図５（ｂ）のフローチャートについて詳細に説明する。図５（ａ）と同様であるブロック３１からブロック３４に関する説明は割愛する。ブロック３９では各クラスの訓練文書集合と他の全てのクラスの実文書集合の間の類似度の算出を行う。ブロック４０及びブロック４１は、指定されたクラスの訓練文書集合と指定された他のクラスの実文書集合の間の類似度が閾値を超えている場合の処置を示している。クラスＡの訓練文書集合をΩ_AとクラスＢの実文書集合Ω'_Bと間の類似度sim(Ω_A,Ω'_B) （第３の類似度）は式（１）及び（４）と同様に求めることができる。クラス対の指定は、最初の繰り返しでは予め指定されたクラスに基づいて、２回目以降はブロック４２からの指令により行う。ブロック４０及びブロック４１において、γを閾値としたとき、 Hereinafter, the flowchart of FIG. 5B will be described in detail. The description regarding the block 31 to the block 34 which is the same as that in FIG. In block 39, the similarity between the training document set of each class and the actual document set of all other classes is calculated. Blocks 40 and 41 show actions when the similarity between the training document set of the designated class and the actual document set of another designated class exceeds the threshold. The similarity sim (Ω _A , Ω ′ _B ) (third similarity) between the class A training document set and Ω _A and the class B actual document set Ω ′ _B is given by the equations (1) and (4) It can be obtained similarly. The designation of the class pair is performed by a command from the block 42 for the second and subsequent times based on the class designated in advance in the first iteration. In block 40 and block 41, when γ is a threshold value,

を満たす場合にはクラスＢに属する文書の話題がクラスＡに接近し、クラスＡ、Ｂ共クラスモデルは陳腐化していると判断される。ブロック４２は終了処理であり、ブロック３９、４０、４１の処置を全てのクラス対に対して行ったかどうかのチェックを行い、未処理のクラス対がなければ終了し、あれば次のクラス対を指定して次の処理をブロック３９に戻す。
なお、実施例２及び実施例３で用いたβ、γは話題内容のよく分かっている訓練文書集合を用いて予め実験的に決めておく必要がある。 When the condition is satisfied, it is determined that the topic of the document belonging to the class B approaches the class A, and the class A and B co-class models are obsolete. Block 42 is a termination process, and it is checked whether the processing of blocks 39, 40, and 41 has been performed for all class pairs. If there is no unprocessed class pair, the process ends. The next process is returned to block 39 by designating.
Note that β and γ used in the second and third embodiments need to be experimentally determined in advance using a training document set whose topic content is well known.

以上述べたように本発明によれば、話題の接近するクラス対、及び、陳腐化を起こしたクラス対を不適格クラスとして容易に検出することができる。文書分類の研究用に多く用いられている文書コーパスReuters-21578に対する実験結果を示す。文書分類法としてはkNN法を採用している。図４は各クラス対の話題の接近の程度とエラー率の関係を示す図であり、各点が特定のクラス対に対応している。 As described above, according to the present invention, it is possible to easily detect class pairs that are close to each other and class pairs that have become obsolete as ineligible classes. Experimental results for the document corpus Reuters-21578, which is widely used for document classification research, are shown. The kNN method is adopted as the document classification method. FIG. 4 is a diagram showing the relationship between the degree of topic approach of each class pair and the error rate, and each point corresponds to a specific class pair.

また、横軸は訓練文書集合のクラス間類似度(siｍilarity)を百分率で示し、縦軸はテスト文書集合に対するクラス間エラー率(error rate)を百分率で示している。訓練文書集合とテスト文書集合はReuters-21578において指定されているもので、テスト文書集合は実文書集合に対応するものと見なされる。クラスＡ、Ｂのクラス間エラー率はクラスＡでありながらクラスＢに誤った文書数とクラスＢでありながらクラスＡに誤った文書数との和をクラスＡ、Ｂの文書数の和で除した値で与えられる。図４は訓練文書に対してクラス間類似度の高いクラス対、即ち、話題の接近するクラス対はテスト文書集合に対してエラー率が高いことを示している。従って、クラス間類似度が閾値より高いクラス対を検出して、クラスの定義の見直しやそれらのクラスを設定すること自体の再検討、訓練文書のラベル付けの妥当性の確認を行い、話題の接近するクラス対をなくすようにすれば文書分類システムの性能を向上させることができる。 The horizontal axis represents the similarity between classes of the training document set as a percentage, and the vertical axis represents the error rate between classes for the test document set as a percentage. The training document set and the test document set are specified in Reuters-21578, and the test document set is regarded as corresponding to the actual document set. The class-to-class error rate for class A and B is class A, but the sum of the number of documents mistaken for class B and the number of documents wrong for class A but class A is divided by the sum of the number of documents for class A and B. Is given as a value. FIG. 4 shows that a class pair having a high degree of similarity between classes with respect to the training document, that is, a class pair with a close topic has a high error rate with respect to the test document set. Therefore, detect class pairs whose similarity between classes is higher than the threshold, review the definition of the class, review those settings themselves, and confirm the validity of labeling the training document. Eliminating close class pairs can improve the performance of the document classification system.

図５は陳腐化したクラスを検出する例として、横軸は同じクラスの訓練文書集合とテスト文書集合の類似度(siｍilarity)を百分率で示し、縦軸はテスト文書集合に対する再現率(recall)を百分率で示し、それらの関係を示すものであり、各点がひとつのクラスに対応している。図５から分かるように再現率が低いクラスでは訓練文書集合とテスト文書集合の類似度が小さい。従って、訓練文書集合とテスト文書集合の類似度が小さいクラスを選択することにより陳腐化を起こしたクラスを効率的に見出すことができる。クラスモデルの更新は上記類似度の小さいクラスのみ行えばよいことになるので、全てのクラスのクラスモデルの更新を行う場合に比べて著しいコストの低減が期待できる。 As an example of detecting an obsolete class, Fig. 5 shows the similarity between the training document set and test document set of the same class as a percentage, and the vertical axis shows the recall for the test document set. It is expressed as a percentage and shows their relationship, with each point corresponding to a class. As can be seen from FIG. 5, the similarity between the training document set and the test document set is small in the class with a low recall rate. Therefore, by selecting a class having a small similarity between the training document set and the test document set, it is possible to efficiently find a class that has become obsolete. Since the class model needs to be updated only for the class having a low similarity, a significant cost reduction can be expected as compared with the case where the class models of all classes are updated.

なお、上記実施例は文書を例にとって説明したが、実施例で示した文書と同じ様に表現でき、かつ、同様の性質を有するパターンについても適用できる。すなわち、実施例で示した、文書をパターン、用語を構成要素、訓練文書を訓練パターン、文書セグメントをパターンセグメント、文書セグメントベクトルをパターンセグメントベクトル等のように置き換えれば、本願発明が同様に適用できる。 Although the above embodiment has been described by taking a document as an example, it can also be applied to patterns that can be expressed in the same manner as the document shown in the embodiment and have similar properties. That is, the present invention can be similarly applied if the document is replaced with a pattern, a term as a component, a training document as a training pattern, a document segment as a pattern segment, a document segment vector as a pattern segment vector, etc. .

本願発明を実行する装置の構成図を示す図である。It is a figure which shows the block diagram of the apparatus which implements this invention. 本願発明のブロック図を示す図である。It is a figure which shows the block diagram of this invention. 与えられた訓練文書集合に対して、話題接近クラス対を検出する本発明の手順を示すフローチャートである。It is a flowchart which shows the procedure of this invention which detects a topic approach class pair with respect to the given training document set. 文書集合、文書、文ベクトルの関係を示す図である。It is a figure which shows the relationship between a document set, a document, and a sentence vector. 本願発明において、クラスモデルの陳腐化したクラスを検出する本発明の実施例２の手順を示すフローチャートである。In this invention, it is a flowchart which shows the procedure of Example 2 of this invention which detects the obsolete class of a class model. 本願発明において、クラスモデルの陳腐化したクラスを検出する本発明の実施例３の手順を示すフローチャートである。In this invention, it is a flowchart which shows the procedure of Example 3 of this invention which detects the obsolete class of a class model. 訓練文書集合のクラス間類似度（横軸）とテスト文書集合に対するクラス間エラー率（縦軸）の関係を示すグラフである。It is a graph which shows the relationship between the similarity between classes of a training document set (horizontal axis) and the error rate between classes (vertical axis) with respect to a test document set. 同じクラスの訓練文書集合とテスト文書集合との間の類似度（横軸）とテスト文書集合に対する再現率（縦軸）の関係を示すグラフである。It is a graph which shows the relationship between the similarity (horizontal axis) between the training document set of the same class and a test document set, and the recall (vertical axis) with respect to the test document set.

Explanation of symbols

１００：筐体
１１０：記憶装置
１２０：メインメモリー
１３０：出力装置
１４０：処理装置（ＣＰＵ）
１５０：操作部
１６０：入力
２１０：文書入力ブロック
２２０：文書前処理ブロック
２３０：文書情報処理ブロック
２４０：訓練文書情報格納ブロック
２５０：実文書情報格納ブロック
２６０：不適格文書クラス出力ブロック
100: Housing 110: Storage device 120: Main memory 130: Output device 140: Processing device (CPU)
150: operation unit 160: input 210: document input block 220: document preprocessing block 230: document information processing block 240: training document information storage block 250: actual document information storage block 260: non-qualified document class output block

Claims

Means for classifying the input document by collating the input document with a class model for each class created based on the training document information for each class, and the following means (a) and (b) Document classification evaluation device including
(A) means for obtaining a first similarity for all class pairs using a training document set for each class; and (b) detecting a class pair for which the first similarity is greater than a first threshold. means.

The document classification evaluation apparatus according to claim 1, wherein the means for obtaining the similarity includes the following means (a) to (d):
(A) means for detecting and selecting a term used for detecting the class pair from each training document;
(B) means for decomposing each training document into document segments;
(C) means for generating a document segment vector having a value related to an occurrence frequency of a term appearing in the document segment for each training document as a value of a corresponding component;
(D) Means for obtaining similarity between training document sets for all class pairs based on the document segment vector of each training document.

Means for collating the input document with a class model for each class created on the basis of the training document information for each class, and classifying the input document; and the following means (a) to (d) Document classification evaluation device including
(A) means for creating a class model for each document class based on the training document set;
(B) means for collating the input document with the class model, classifying the input document to a document class to which the input document belongs, and creating a real document set;
(C) means for obtaining a second similarity between the actual document sets of the same class as the training document set for all document classes;
(D) Means for detecting a class having the second similarity smaller than a second threshold.

Means for collating the input document with a class model for each class created on the basis of the training document information for each class, and classifying the input document; and the following means (a) to (d) Document classification evaluation device including
(A) means for creating a class model for each document class based on the training document set;
(B) means for collating the input document with the class model, classifying the input document to a document class to which the input document belongs, and creating a real document set;
(C) means for determining a third similarity between the training document set of each document class and the actual document set of all other document classes;
(D) Means for detecting a class pair whose third similarity is greater than a third threshold.

The apparatus according to claim 3 and 4, wherein the means for obtaining the similarity includes the following means (a) to (d):
(A) means for detecting and selecting a term used for detection of the class or class pair from each training document and each actual document;
(B) means for decomposing each training document and each actual document into document segments;
(C) means for generating a document segment vector having a value related to an appearance frequency of a term appearing in the document segment for each training document and each actual document as a value of a corresponding component;
(D) Means for obtaining the second similarity or the third similarity based on each training document and the document segment vector of the actual document.

The number of types of the appearing terms is given by M, it has Y document segments, and the yth document segment vector is d _y = (d _y1,., D _yM ) ^T (T is the transpose of the vector) The co-occurrence matrix S of the document,

When the sum matrix of co-occurrence matrices of all documents is obtained for each document set, the first, second, or third is calculated based on the product sum of the corresponding components of the two sum matrices. The apparatus according to claim 1, further comprising means for obtaining a similarity degree.

The computer operates the means for classifying the input document by comparing the input document with the class model for each class created based on the training document information for each class, and further, the following (a) and (b) Document classification evaluation program that operates the means of
(A) means for obtaining a first similarity for all class pairs using the training document set for each class; and (b) detecting a class pair for which the first similarity is greater than a first threshold. Means to do.

The document classification evaluation program according to claim 7, wherein the means for obtaining the similarity includes the following means (a) to (d):
(A) means for detecting and selecting a term used for detecting the class pair from each training document;
(B) means for decomposing each training document into document segments;
(C) means for generating a document segment vector having a value related to an occurrence frequency of a term appearing in the document segment for each training document as a value of a corresponding component;
(D) Means for obtaining similarity between training document sets for all class pairs based on the document segment vector of each training document.

By means of a computer, a means for collating the input document with a class model for each class created based on the training document information for each class and operating the input document is operated, and the following (a) to (d) Document classification evaluation program that operates the means of
(A) means for creating a class model for each document class based on the training document set;
(B) means for collating the input document with the class model, classifying the input document to a document class to which the input document belongs, and creating a real document set;
(C) means for obtaining a second similarity between the actual document sets of the same class as the training document set for all document classes;
(D) Means for detecting a class having the second similarity smaller than a second threshold.

The computer operates the means for classifying the input document by collating the input document with the class model for each class created based on the training document information for each class, and further, the following (a) to (d) Document classification evaluation program for operating the means,
(A) means for creating a class model for each document class based on the training document set;
(B) means for collating the input document with the class model, classifying the input document to a document class to which the input document belongs, and creating a real document set;
(C) means for determining a third similarity between the training document set of each document class and the actual document set of all other document classes;
(D) Means for detecting a class pair whose third similarity is greater than a third threshold.

The means according to claim 9 and claim 10, wherein the means for obtaining the similarity includes the following means (a) to (d):
(A) means for detecting and selecting a term used for detection of the class or class pair from each training document and each actual document;
(B) means for decomposing each training document and each actual document into document segments;
(C) means for generating a document segment vector having a value related to an appearance frequency of a term appearing in the document segment for each training document and each actual document as a value of a corresponding component;
(D) Means for obtaining the second similarity or the third similarity based on each training document and the document segment vector of the actual document.

The number of types of terms that appear is given by M, it has Y document segments, and the y-th document segment vector is d _y = (d _{y1, ..,} d _yM ) ^T , where T is a vector The co-occurrence matrix S of the document,

Then, after obtaining a sum matrix of co-occurrence matrices of all documents for each document set, the first, second, or third based on the product sum of corresponding components of the two sum matrices The program according to any one of claims 7 to 11, which operates means for obtaining a similarity degree.

Collating the input document with a class model for each class created on the basis of the training document information for each class, and classifying the input document, and the following steps (a) and (b) Document classification evaluation method having
(A) obtaining a first similarity for all class pairs using a set of training documents for each class; and (b) detecting a class pair for which the first similarity is greater than a first threshold. Step.

The document classification evaluation method according to claim 13, wherein the step of obtaining the similarity includes the following means (a) to (d):
(A) detecting and selecting a term used for detecting the class pair from each training document;
(B) decomposing each training document into document segments;
(C) generating a document segment vector having a value related to a value related to an appearance frequency of a term appearing in the document segment for each training document;
(D) obtaining similarities between training document sets for all class pairs based on the document segment vectors of the respective training documents.

Collating the input document with a class model for each class created based on the training document information for each class, and classifying the input document, and further comprising the following steps (a) to (d) Document classification evaluation method including
(A) creating a class model for each document class based on the training document set;
(B) collating the input document with the class model, classifying the input document into a document class to which the input document belongs, and creating a real document set;
(C) determining a second similarity between the actual document sets of the same class as the training document set for all document classes;
(D) detecting a class having the second similarity smaller than a second threshold.

A step of collating the input document with a class model for each class created based on the training document information for each class, and classifying the input document; and the following means (a) to (d) Document classification evaluation method including
(A) creating a class model for each document class based on the training document set;
(B) collating the input document with the class model, classifying the input document into a document class to which the input document belongs, and creating a real document set;
(C) determining a third similarity between the training document set of each document class and the actual document set of all other document classes;
(D) detecting a class pair in which the third similarity is larger than a third threshold.

The method according to claim 15 and 16, wherein the step of determining the similarity includes the following steps (a) to (d):
(A) detecting and selecting a term used for detection of the class or class pair from each training document and each actual document;
(B) decomposing each training document and each actual document into document segments;
(C) generating a document segment vector having a value related to an occurrence frequency of a term appearing in the document segment for each training document and each actual document as a value of a corresponding component;
(D) A step of obtaining the second similarity or the third similarity based on each training document and the document segment vector of the actual document.

When the sum matrix of co-occurrence matrices of all documents is obtained for each document set, the first, second, or third is calculated based on the product sum of the corresponding components of the two sum matrices. The method according to any one of claims 13 to 17, further comprising a step of obtaining a degree of similarity.

The computer operates the means for classifying the input pattern by comparing the input pattern with the class model for each class created based on the training pattern information for each class, and further, the following (a) and (b) Pattern classification evaluation program that operates the means of
(A) means for obtaining a first similarity for all class pairs using the training pattern set for each class; and (b) detecting a class pair for which the first similarity is greater than a first threshold. Means to do.

The pattern classification evaluation program according to claim 19, wherein the means for obtaining the similarity includes the following means (a) to (d):
(A) means for detecting and selecting a component used for detection of the class pair from each training pattern;
(B) means for decomposing each training pattern into pattern segments;
(C) means for generating a pattern segment vector having a value related to an appearance frequency of a component appearing in the pattern segment for each training pattern as a value of a corresponding component;
(D) Means for obtaining similarity between training pattern sets for all class pairs based on the pattern segment vector of each training pattern.

By means of a computer, the means for collating the input pattern with the class model for each class created on the basis of the training pattern information for each class and operating the input pattern is operated, and the following (a) to (d) Pattern classification evaluation program that operates the means of
(A) means for creating a class model of each pattern class based on the training pattern set;
(B) means for collating the input pattern with the class model, classifying the input pattern into a pattern class to which the input pattern belongs, and creating a real pattern set;
(C) For all pattern classes, means for obtaining a second similarity between the actual pattern sets of the same class as the training pattern set;
(D) Means for detecting a class having the second similarity smaller than a second threshold.

The computer operates a means for classifying the input pattern by comparing the input pattern with the class model for each class created based on the training pattern information for each class, and further, the following (a) to (d) Pattern classification evaluation program for operating the means,
(A) means for creating a class model of each pattern class based on the training pattern set;
(B) means for collating the input pattern with the class model, classifying the input pattern into a pattern class to which the input pattern belongs, and creating a real pattern set;
(C) means for obtaining a third similarity between the training pattern set of each pattern class and the actual pattern sets of all other pattern classes;
(D) Means for detecting a class pair whose third similarity is greater than a third threshold.

The program according to claim 21 and claim 22, wherein the means for obtaining the similarity includes the following means (a) to (d):
(A) means for detecting and selecting a component used for detection of the class or class pair from each training pattern and each actual pattern;
(B) means for decomposing each training pattern and each actual pattern into pattern segments;
(C) means for generating a pattern segment vector having a value related to an appearance frequency of a component appearing in the pattern segment for each training pattern and each actual pattern as a value of a corresponding component;
(D) Means for obtaining the second similarity or the third similarity based on each training pattern and the pattern segment vector of the actual pattern.