JP4926198B2

JP4926198B2 - Method and system for generating a document classifier

Info

Publication number: JP4926198B2
Application number: JP2009097929A
Authority: JP
Inventors: ジェチャンリイ; ユウジャオ
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2008-04-18
Filing date: 2009-04-14
Publication date: 2012-05-09
Anticipated expiration: 2029-04-14
Also published as: CN101561805A; CN101561805B; JP2009259250A

Description

本発明は、情報検索（information retrieval：ＩＲ）およびテキストデータマイニングに関し、特に、分類してない文書集合の基礎となるデータ分布とカテゴリ名称の用語に包含された意味的な情報を組み合わせることにより、高精度な文書分類を提供する自動文書分類のための文書分類器を生成する方法およびシステムに関する。 The present invention relates to information retrieval (IR) and text data mining, in particular by combining the data distribution underlying the unclassified document set and the semantic information contained in the term category name, The present invention relates to a method and system for generating a document classifier for automatic document classification that provides highly accurate document classification.

近年、利用可能な電子文書の急成長は、普通の人々にそのような大量の情報を理解させ有利に利用させる。それは、利用者が大量の情報を整理し、かつ効果的かつ効率的な方法で興味のある部分を見つけるのを支援する興味深い作業である。 In recent years, the rapid growth of available electronic documents has allowed ordinary people to understand and use such large amounts of information. It is an interesting task that helps users organize large amounts of information and find interesting parts in an effective and efficient way.

情報検索（ＩＲ）は、文書集合における情報を検索する科学である。情報検索（ＩＲ）は、さらに、文書に含まれている１片の情報を検索すること、文書自体を検索すること、文書について記述するメタデータを検索すること、スタンド・アロンのリレーショナルデータベース、テキスト、音、イメージあるいはデータについてインターネットやイントラネットでネットワーク化されたハイパーテキストデータベース等のデータベース内で検索することに分けることができる。テキストデータマイニングは、一般にプレーンテキストから高品質情報を構築する処理手順に関する技術であり、テキスト分類、テキストクラスタリング、コンセプト／エンティティ抽出、文書要約などに分けられる。最も役に立つ情報がテキストあるいは文書として一般に格納されているので、情報検索およびテキストデータマイニングは高い商品価値を有すると考えられる。文書分類は、予め定義された集合から主題カテゴリを有する自然言語テキストを分類する（ラベル付けする）作業であり、例えば、語義曖昧性解消、文書編成、テキストフィルタリングおよびウェブページ検索等のＩＲおよびテキストデータマイニングの多くの利用実態に適用することが可能である。 Information retrieval (IR) is the science of retrieving information in a document collection. Information Retrieval (IR) further searches for a piece of information contained in a document, searches the document itself, searches for metadata describing the document, a stand-alone relational database, text It can be divided into searching for a sound, image or data in a database such as a hypertext database networked on the Internet or an intranet. Text data mining is generally a technique related to a processing procedure for constructing high-quality information from plain text, and is divided into text classification, text clustering, concept / entity extraction, document summarization, and the like. Information retrieval and text data mining are considered to have high commercial value because the most useful information is generally stored as text or documents. Document classification is the task of classifying (labeling) natural language text having subject categories from a predefined set, eg IR and text such as word sense disambiguation, document organization, text filtering and web page search. It can be applied to many usages of data mining.

増大する電子情報の有用性は、情報検索およびテキストデータマイニングの重要性を決定する。自動文書分類は、それら両方のための基礎技術の１つで、当時に、大量の電子情報の高効果的で高効率的な利用に主要な役割を果たす。 The increasing availability of electronic information determines the importance of information retrieval and text data mining. Automatic document classification is one of the basic technologies for both, and at that time played a major role in the highly efficient and efficient use of large amounts of electronic information.

現在、機械学習（ＭＬ）をベースとしたアプローチは、自動文書分類のための主要な１つである。機械学習（ＭＬ）をベースとしたアプローチの最適な性能は、手作業でラベル付けされた大量のトレーニングデータに強く依存する。しかしながら、何百又は何千ものカテゴリがある場合、データラベル付けの作業は、特に複雑な文書分類において、面倒で費用がかかる。 Currently, a machine learning (ML) based approach is one of the main for automatic document classification. The optimal performance of a machine learning (ML) based approach is highly dependent on the large amount of manually labeled training data. However, if there are hundreds or thousands of categories, the data labeling task is cumbersome and expensive, especially in complex document classifications.

多くの研究がトレーニングモデルの正確さを改善するためラベル付けしていないデータを利用するために行なわれている。しかしながら、既存の方法は、トレーニングデータが役に立たないケースを扱うことができない。さらに、学習工程が少数のトレーニング標本に非常に依存するので、分類結果がトレーニングデータによって容易に偏ることになる。そのため、実際のシステムのために十分良い性能が得られていない。 A lot of research has been done to use unlabeled data to improve the accuracy of the training model. However, existing methods cannot handle cases where training data is useless. Furthermore, since the learning process is highly dependent on a small number of training samples, the classification results are easily biased by the training data. For this reason, sufficient performance is not obtained for an actual system.

本発明の調査研究は、従来の研究で広く検討されている情報検索およびテキストデータマイニングに関する研究（特に文書分類）に密接に関連する。基本的に、自動的な文書分類に対する一般のアプローチは、教師あり文書の分類、半教師あり文書の分類および教師なし文書の分類の３種類に分けることができる。それらの実施は一般に基本的な２つのステップ、すなわち、分類学習ステップおよび文書分類ステップを含む。 The research of the present invention is closely related to research on information retrieval and text data mining (particularly document classification) that has been widely studied in conventional research. Basically, general approaches to automatic document classification can be divided into three types: supervised document classification, semi-supervised document classification, and unsupervised document classification. Their implementation generally includes two basic steps: a classification learning step and a document classification step.

教師あり文書分類アプローチは、シンボリックラベルとしてのみカテゴリ名称を扱い、それらの意味についての付加的な知識を仮定しない。また、外生的な知識は分類器を構築する際の支援に利用することができる。学習段階では、あらかじめ手作業で(例えば、分野専門家によって)分類された文書集合の特性を観察することにより、カテゴリについて自動的に分類器を構築する一般の帰納的な処理手順を利用する。その後、文書分類段階では、分類器は、対応するカテゴリの下に分類するために新規の文書が有する特性を取得する。文書分類器の帰納的な構築のための様々な異なる方法が、以前の研究で検討されている。最も一般的な方法は、確率的な分類器、デシジョンツリー、ニューラル・ネット、サポートベクターマシン(SVM)および回帰法を含んでいる。文書のための正確な分類に関する知識が分類器学習を管理するために使用されるので、全てのカテゴリについて手作業でラベル付けされた大量のトレーニング標本は正確な学習に必要とされる。 The supervised document classification approach treats category names only as symbolic labels and assumes no additional knowledge about their meaning. In addition, exogenous knowledge can be used to support the construction of classifiers. In the learning phase, a general recursive procedure is used that automatically constructs a classifier for a category by observing the characteristics of a document set that has been manually classified (eg, by a field expert) in advance. Thereafter, in the document classification stage, the classifier obtains the characteristics of the new document for classification under the corresponding category. Various different methods for the inductive construction of document classifiers have been examined in previous studies. The most common methods include probabilistic classifiers, decision trees, neural nets, support vector machines (SVM) and regression methods. Since knowledge about accurate classification for documents is used to manage classifier learning, a large number of training samples, manually labeled for all categories, are required for accurate learning.

トレーニングデータのラベル付け（分類）に対する人的な労力を減らすために、数の少ないラベル付けされたデータを含むドキュメント分類用の半教師あり文書分類アプローチが、ますます多くの注目を引いている。それらはラベル付けされたトレーニングデータサンプルとラベル付けされていないトレーニングデータサンプルの両方を利用する。ラベル付けされていないデータは、不十分なトレーニングデータを含む教師あり学習の不十分な性能を向上するために利用される。これまで、半教師あり文書分類アプローチに関する作業は、大きく、生成的手法、識別的手法（Discriminative method）及び自己学習手法に類別することができる。 In order to reduce the human effort for training data labeling (classification), semi-supervised document classification approaches for document classification involving a small number of labeled data are gaining more and more attention. They utilize both labeled training data samples and unlabeled training data samples. Unlabeled data is used to improve the poor performance of supervised learning, including insufficient training data. So far, the work on the semi-supervised document classification approach can be broadly classified into generative methods, discriminative methods, and self-learning methods.

生成的な方法は、識別可能な混合配分（例えばガウスの混合モデル：Gaussian mixture models）から文書例が生成されると仮定する。大量のラベル付けされていないデータによって、混合モデルの未知パラメータを識別することが可能である。代表的な方法は期待値最大化（ＥＭ：Expectation-Maximization）アルゴリズムである。同じ方法に沿って、文書クラスタリングは、テキスト分類を改善するために分類してない文書を使用するのに利用される、ここで、各々のクラスタは、実際に「擬似混合モデル」として役立つ。クラスタリング処理は、それらのクラスタから抽出した新たな特徴を分類されたデータと分類されていないデータ中のパターンに導入して、分類されたデータと分類されていないデータに適用することができる。 The generative method assumes that example documents are generated from identifiable mixture distributions (eg, Gaussian mixture models). With a large amount of unlabeled data, it is possible to identify unknown parameters of the mixed model. A typical method is an Expectation-Maximization (EM) algorithm. Along the same way, document clustering is utilized to use unclassified documents to improve text classification, where each cluster actually serves as a “pseudo-mixed model”. The clustering process can be applied to classified data and unclassified data by introducing new features extracted from those clusters into patterns in the classified data and unclassified data.

識別的手法（Discriminative method）は、種々のクラスに分類されていないデータは大きなマージンを持って分離されるという考えから考案される。この仮定に基づいて、ＴｒａｎｓｄｕｃｔｉｖｅＳＶＭ（ＴＳＶＭ）（トランスダクティブサポートベクターマシン）は、分類してないデータを有する標準サポートベクターマシンを拡張し、特定の文書の誤った分類を最小限にすることにより、「分類してないデータマージン」を最大にしようとする。ＳＶＭの一般形態であるロジスティックスの回帰モデルも半教師ありテキスト分類のために採用される。最近、一連の新規な半教師あり学習のアプローチが、グラフ表現から起こっている。ここでは、ラベル付けされたインスタンスが、頂点として表わされ、ラベル付けされていないインスタンスが、インスタンス間の類似度を符号化する辺として表される。 Discriminative methods are devised from the idea that data that is not classified into various classes is separated with a large margin. Based on this assumption, Transductive SVM (TSVM) (Transductive Support Vector Machine) extends the standard support vector machine with unclassified data to minimize misclassification of specific documents, Try to maximize "unclassified data margin". A logistics regression model, a common form of SVM, is also employed for semi-supervised text classification. Recently, a series of new semi-supervised learning approaches have emerged from graph representation. Here, labeled instances are represented as vertices, and unlabeled instances are represented as edges that encode the similarity between instances.

自己学習方法は、分類器自身の高い信頼度予測が正確であると仮定する。この仮定から派生した２つの代表的な方法がある。自己トレーニングおよび共トレーニングである。自己トレーニングは以下のように実現される：１）ラベル付けされた少量の文書が分類器トレーニングのために使用される、２）得られた分類器は分類してない文書を分類するために利用される、３）各反復において高い確信度で選択されている新しくラベル付けされた文書の信頼できる集合は、分類器を繰り返し再教育するために利用される。この反復中に、分類器は、独習するためにそれ自身の高い信頼度予測を利用する。同様の技術として、特許文献１（特開２００２−１３３３８９号公報）は、数の少ないトレーニングデータで反復学習の精度を改善するためにテストデータの配布を使用する促進メカニズムを提供する。共同トレーニングは以下のように実現される。１）まず、特徴集合が、２つの分類器を訓練するために利用される２つの十分かつ条件付きの独立した集合にそれぞれ分割される；２）その後、分類器はそれぞれラベル付けされていないデータを分類し、他方の分類器のトレーニングデータを拡張するためにいくつかの信頼できる標本選択する；３）両方の分類器は、追加のトレーニング標本で再教育され、この処理を繰り返す。 The self-learning method assumes that the high confidence prediction of the classifier itself is accurate. There are two representative methods derived from this assumption. Self-training and co-training. Self-training is achieved as follows: 1) A small amount of labeled documents are used for classifier training, 2) The resulting classifier is used to classify unclassified documents 3) A reliable set of newly labeled documents that are selected with high confidence in each iteration is used to iteratively retrain the classifier. During this iteration, the classifier uses its own high confidence prediction to learn. As a similar technique, Patent Document 1 (Japanese Patent Laid-Open No. 2002-133389) provides an acceleration mechanism that uses test data distribution to improve the accuracy of iterative learning with a small number of training data. Joint training is realized as follows. 1) First, the feature set is divided into two sufficient and conditional independent sets that are used to train the two classifiers respectively; 2) Each classifier is then unlabeled data And select some reliable samples to extend the training data of the other classifier; 3) Both classifiers are retrained with additional training samples and the process repeats.

文書分類のための文書集合に包含された知識を利用する教師あり及び半教師あり学習方法と異なり、いわゆる教師なしアプローチは、自動文書分類のためのカテゴリ名称の概念に包含された知識を主に利用する。それらは、手作業によってトレーニング文書を生成せずに、主に初期の予め定義されたキーワードリストか、カテゴリ名称に根源として出現したキーワードを利用し、一定のブートストラップメカニズム（bootstrapping mechanisms）を採用する。代わりの解決策は、文書を文に分割することによって、各カテゴリのキーワードリストを利用してトレーニング文集合を生成することである。同時に、分類された文は、文書分類のための利用される。 Unlike supervised and semi-supervised learning methods that use the knowledge contained in the document set for document classification, the so-called unsupervised approach mainly focuses on the knowledge contained in the category name concept for automatic document classification. Use. They do not generate training documents by hand, but rather use an initial predefined keyword list or a keyword that appears as a root in the category name and employs certain bootstrapping mechanisms. . An alternative solution is to generate a set of training sentences using the keyword list of each category by dividing the document into sentences. At the same time, the classified sentences are used for document classification.

特開２００２−１３３３８９号公報JP 2002-133389 A

しかしながら、以下のように、既存の方法について解決すべき問題点がまだある。 However, there are still problems to be solved for the existing methods as follows.

まず、教師ありアプローチについて、教師ありアプローチのために十分なトレーニングデータを作成することは非常にコストがかかる。教師あり文書分類アプローチは、各文書集合あるいは問題ドメインのために有効な大量のトレーニングデータに必要とする。しかしながら、それらは経験を積んだ注釈者の労力を必要とするので、多くの場合困難であり、高コストであり、ラベル付けされたデータを取得するための時間を消費する。何百あるいは何千もの分類がある複雑なタスクあるいはドメインについては、特に問題となる。 First, for a supervised approach, creating sufficient training data for a supervised approach is very expensive. A supervised document classification approach requires a large amount of training data available for each document set or problem domain. However, they are often difficult, expensive, and time consuming to obtain labeled data because they require the effort of experienced annotators. Especially for complex tasks or domains with hundreds or thousands of classifications.

次に、半教師ありのアプローチの文書分類結果は、数の少ないトレーニングデータによって偏りが生じる傾向にある。半教師ありの学習の考えは、ラベル付けさされたトレーニングデータから学習するだけでなく、さらに加えて利用可能なラベル付けされていないデータ中の構造的情報を利用することである。トレーニングデータの有効性の問題が部分的に取り組まれている。ラベル付けされたデータが希薄であるので、精度が十分でないだけでなく、その頑強性がこれらの方法の適用に対して大きな問題である。 Next, document classification results of the semi-supervised approach tend to be biased by a small number of training data. The idea of semi-supervised learning is not only to learn from labeled training data, but also to use structural information in available unlabeled data. The issue of the effectiveness of training data is partly addressed. Since the labeled data is sparse, not only is the accuracy insufficient, but its robustness is a major problem for the application of these methods.

さらに、教師なしアプローチについて、それらの文書分類結果は、予め定義されたキーワードリストによって偏りが生じる傾向にある。いわゆる教師なしアプローチにおいては、カテゴリ名称あるいは各カテゴリのキーワードリストが、自動テキスト分類に対するブートストラップメカニズムのための根源として役立つ。このアプローチは人間によって定義された初期のキーワードリストに強く依存し、かつ偏り制御機構はないので、精度および分類結果の頑強性が一般に十分ではない。さらに、初期の根源単語を手作業で集める必要があり、それは複雑なタスクにとってさらに冗長で高コストなタスクである。 Furthermore, for unsupervised approaches, their document classification results tend to be biased by predefined keyword lists. In the so-called unsupervised approach, category names or keyword lists for each category serve as the root for a bootstrap mechanism for automatic text classification. This approach relies heavily on the initial keyword list defined by humans, and since there is no bias control mechanism, the accuracy and robustness of the classification results are generally not sufficient. In addition, initial source words need to be collected manually, which is a more verbose and costly task for complex tasks.

最後に、教師ありアプローチ、半教師ありのアプローチあるいは教師なしアプローチにとって、それらの適応性およびスケーラビリティが不十分である。上記の３つのすべてのアプローチを経て訓練された分類器は、ドメインまたは文書集合に依存する。すなわち、文書集合かドメインが変更されると、分類器を再度訓練する必要がある。教師あり及び半教師ありのアプローチについては、トレーニングデータとして一定の量の文書にラベル付けするための追加の人間の労力が必要であることを意味する。いわゆる教師なしアプローチについては、ドメインが変更されると、対応するカテゴリに関係のある初期のキーワードリストを定義する必要がある。さらに、追加学習の労力が、変更されたドメインか文書集合のために必要となる。 Finally, their adaptability and scalability are insufficient for supervised, semi-supervised or unsupervised approaches. A classifier trained through all three approaches above depends on the domain or document set. That is, if the document set or domain is changed, the classifier needs to be retrained. For supervised and semi-supervised approaches, this means that additional human effort is required to label a certain amount of documents as training data. For the so-called unsupervised approach, when a domain is changed, it is necessary to define an initial keyword list that is relevant to the corresponding category. In addition, additional learning effort is required for the modified domain or document collection.

上記の問題を考慮すると、特にラベル付けされたデータが利用可能でない場合に対して、文書分類の精度およびスケーラビリティを改善する自動文書分類のための新方式及びシステムが必要となる。 In view of the above problems, there is a need for a new method and system for automatic document classification that improves the accuracy and scalability of document classification, especially when labeled data is not available.

（発明の目的）
本発明は、この技術分野における既存の文書分類アプローチの前述の課題に鑑みて提案された。 (Object of invention)
The present invention has been proposed in view of the aforementioned problems of existing document classification approaches in this technical field.

本発明において、文書分類器生成方法は、自動文書分類のために提案されている。対象の文書集合についてのデータ配布知識、およびカテゴリ名称によって包含された意味的な情報は、文書分類の精度およびスケーラビリティを改善するために、特にトレーニングデータが役に立たない場合のために利用される。 In the present invention, a document classifier generation method is proposed for automatic document classification. Data distribution knowledge about the target document set, and semantic information encompassed by the category name, is used to improve the accuracy and scalability of document classification, particularly when training data is not useful.

概して、混成の文書分類器形成方法は主に、（１）初期のトレーニングデータ生成、（２）反復分類器学習、（３）最終的な分類器構築の３つのステップを含んでいる。 In general, a hybrid document classifier formation method mainly includes three steps: (1) initial training data generation, (2) iterative classifier learning, and (3) final classifier construction.

まず、初期のトレーニングデータ生成において、初期のトレーニングデータは、外部知識源を用いたカテゴリ名称の意味解析に基づいて生成される。例えば、実施例において、プロファイルに基づいた方法が、分類器の形成のために設計されている。ここで、カテゴリはそれぞれ、カテゴリの代表的なプロファイルとして役立つ意味的に関連する特徴集合を有する。初期の分類器によって、肯定的な標本と否定的な標本を初期のトレーニングデータ（ラベル付けされた文書）は、次の反復の分類器学習のために生成される。 First, in the initial training data generation, the initial training data is generated based on the semantic analysis of the category name using an external knowledge source. For example, in an embodiment, a profile-based method is designed for forming a classifier. Here, each category has a semantically related feature set that serves as a representative profile of the category. With initial classifiers, initial training data (labeled documents) with positive and negative samples are generated for classifier learning for the next iteration.

次に、反復の分類器学習の段階において、各反復における最後の反復からの分類器の分類結果は、その反復のトレーニングデータの構築のために利用される（ラベルがラベルを付けられたデータとして高い確信度で分類された結果を選択する）。その後、新たな分類器は、更新されたトレーニングデータ（ラベル付けさえたデータ）から作成される。最後に、新たな分類器が、最後の反復からの分類器と交代し、残りの文書を分類するために利用される。全ての文書が分類された場合、分類器の形成集合が収束し、あるいは、他の終了条件が満たされると、反復が終了する。 Next, in the iterative classifier learning phase, the classifier classification results from the last iteration in each iteration are used to construct the training data for that iteration (labeled as labeled data). Select results classified with high confidence). A new classifier is then created from the updated training data (data that has even been labeled). Finally, the new classifier replaces the classifier from the last iteration and is used to classify the remaining documents. If all documents are classified, the iteration ends when the classifier set converges or when other termination conditions are met.

最終的な分類器形成処理において、反復学習が終了した後、結果として生じた全ての分類器から、最もクラスタリング結果と一致している分類器が最終的な分類器として選択される。本発明は、トレーニングデータがないことを想定するので、分類器選択のための解決策として最尤法（maximal likelihood estimation）を主に利用する。 In the final classifier formation process, after the iterative learning is completed, the classifier that most closely matches the clustering result is selected as the final classifier from all the resulting classifiers. Since the present invention assumes that there is no training data, it mainly uses maximum likelihood estimation as a solution for classifier selection.

機械学習処理中に、ベイズのモデルを採用することが可能である場合、トレーニングデータ選択（反復学習における初期のトレーニングデータ生成および中間のトレーニングデータ生成を含む）は、クラスタリングと分類の結果の整合に基づくことに留意する必要がある。その目的は、カテゴリ名称、外部知識源あるいは反復分類器学習処理におけるノイズデータから生じる可能性のある偏りを軽減することである。 Training data selection (including initial training data generation and iterative training data generation in iterative learning) helps align clustering and classification results when a Bayesian model can be adopted during the machine learning process. It should be noted that based. Its purpose is to mitigate bias that can arise from category names, external knowledge sources or noise data in the iterative classifier learning process.

本発明による分類器を生成する方法は、オブジェクト集合についてクラスタリング結果の取得し、ラフな分類器を取得するためにオブジェクト集合についてラフなカテゴリ分類結果を生成し、最終的な分類器を生成するために前記クラスタリング結果でラフな前記カテゴリ分類結果を調整する。好ましい形態によれば、ラフな分類器は、トレーニングデータ（それは手作業でラベル付けされたトレーニングデータとして外部から取得することが可能であり、あるいは、外部知識源を参照してドメイン関連のカテゴリ名称に応じて自動的に生成することが可能である）で分類器を学習することにより生成することができる。さらに、ある形態によれば、ラフな分類結果は、前もって取得されたクラスタリング結果へラフな分類結果を整合させることにより調整することが可能である。この調整処理は反復方法において実現することが可能である。トレーニングデータを反復して更新することによって、中間の分類器の聚合を学習することが可能であり、それらから、クラスタリング結果と最も一致している最適な分類器を最終的な分類器として選択する。 The method of generating a classifier according to the present invention obtains a clustering result for an object set, generates a rough category classification result for the object set to obtain a rough classifier, and generates a final classifier. The rough category classification result is adjusted with the clustering result. According to a preferred form, rough classifiers can be obtained externally as training data (it can be obtained from outside as manually labeled training data, or domain-related category names with reference to external knowledge sources) Can be generated automatically by learning a classifier. Furthermore, according to one aspect, the rough classification result can be adjusted by matching the rough classification result to the previously obtained clustering result. This adjustment process can be realized in an iterative manner. By repeatedly updating the training data, it is possible to learn the intermediate classifiers, from which the best classifier that best matches the clustering results is selected as the final classifier .

本発明による分類器を生成するシステムは、オブジェクト集合についてクラスタリング結果の取得する取得手段と、ラフな分類器を取得するためにオブジェクト集合についてラフなカテゴリ分類結果を生成するラフなカテゴリ化手段と、最終的な分類器を生成するために前記クラスタリング結果でラフな前記カテゴリ分類結果を調整する調整・生成手段とを備える。 A system for generating a classifier according to the present invention includes an acquisition unit that acquires a clustering result for an object set, a rough categorization unit that generates a rough category classification result for an object set to acquire a rough classifier, Adjustment / generation means for adjusting the rough category classification result with the clustering result to generate a final classifier.

本発明において、クラスタリングと分類の結果間の整合解析は、初期のトレーニングデータ形成の処理だけでなく反復分類器学習の処理において実施され統合される。これにより、カテゴリ名称と対応する意味解析から生じする可能性のある偏りが制御される。それは、結果として生じたトレーニングデータと同時に最終的な分類結果の改良された精度を保証する。 In the present invention, matching analysis between clustering and classification results is performed and integrated not only in the initial training data formation process but also in the iterative classifier learning process. This controls the bias that can arise from the semantic analysis corresponding to the category name. It ensures improved accuracy of the final classification result as well as the resulting training data.

さらに、本発明による方法は、トレーニングデータあるいは文書分類のための初期の予め定義されたキーワードリストを必要としない。代わりに、既存の外部知識源によるカテゴリ名称の意味解析（同時出現キーワード抽出のための隠れた意味解析を含む）は、初期のトレーニングデータ形成のために利用される。既存の外部知識源が複数定義域をカバーすることが可能であるので、ドメインか文書集合が変更された場合でも、本発明の方法は、追加のラベル付け労力の大幅な軽減に加えて、多数の様々な種類のドメイン／文書集合に容易に適用することが可能である。 Furthermore, the method according to the invention does not require an initial predefined keyword list for training data or document classification. Instead, semantic analysis of category names by existing external knowledge sources (including hidden semantic analysis for simultaneous appearance keyword extraction) is used for initial training data generation. Since existing external knowledge sources can cover multiple domains, even if the domain or document set is changed, the method of the present invention offers a large number of additional labeling efforts in addition to a significant reduction. It can be easily applied to various types of domain / document sets.

さらに、最終の分類器形成のために提供されるメカニズムは、特に特徴的な分類器（例えば、ＳＶＭ（Support Vector Machine：サポートベクターマシン）、ロジスティック回帰（Logistic regression））について、反復分類器学習処理におけるノイズデータによって偏りが分類器に過度にかけられるという危険を軽減することが可能である。さらに、それは、文書分類の最終結果の精度改良に対する本発明の重要な貢献である。 In addition, the mechanisms provided for final classifier formation include an iterative classifier learning process, particularly for characteristic classifiers (eg, SVM (Support Vector Machine), Logistic regression). It is possible to mitigate the risk that bias is overly applied to the classifier by noise data at. Furthermore, it is an important contribution of the present invention to improving the accuracy of the final result of document classification.

本発明によるによる文書分類システム１００（分類器生成サブシステム１０の内部構成はその中で詳細に示される）の全体的なブロック図である。1 is an overall block diagram of a document classification system 100 (the internal structure of the classifier generation subsystem 10 is shown in detail therein) according to the present invention. 図１に示される文書分類システム１００の動作処理の例を示すフローチャートである。3 is a flowchart showing an example of operation processing of the document classification system 100 shown in FIG. 1. 図１に示す分類器生成サブシステム１０における調整・生成手段１０３の内部構成例を示すためのブロック図である。FIG. 2 is a block diagram for illustrating an internal configuration example of an adjustment / generation unit 103 in the classifier generation subsystem 10 illustrated in FIG. 1. 図１に示す分類器生成サブシステム１０におけるラフカテゴリ化手段１０２の実装例４００Ａの内部構成を示すブロック図であり、実装例４００Ａにおいては、外部から取得した手作業でラベル付けされたトレーニングデータが分類器学習のための直接利用される。It is a block diagram which shows the internal structure of the implementation example 400A of the rough categorization means 102 in the classifier production | generation subsystem 10 shown in FIG. 1, In the implementation example 400A, the training data labeled manually acquired from the outside are shown. Used directly for classifier learning. 図１に示す分類器生成サブシステム１０におけるラフカテゴリ化手段１０２の実装例４００Ｂの内部構成を示すブロック図であり、実装例４００Ｂにおいては、トレーニングデータは、分類器学習のために自動的に生成される。It is a block diagram which shows the internal structure of the implementation example 400B of the rough categorization means 102 in the classifier generation subsystem 10 shown in FIG. 1, and training data is automatically generated for classifier learning in the implementation example 400B. Is done. 自動的にトレーニングデータを生成する場合における、図５に示すトレーニングデータ生成ユニット４０１Ｂの内部構成を示すブロック図である。FIG. 6 is a block diagram showing an internal configuration of a training data generation unit 401B shown in FIG. 5 when training data is automatically generated. 図６に示すトレーニングデータ生成ユニットにおける分類部５０４の内部構成例を示すブロック図である。It is a block diagram which shows the example of an internal structure of the classification | category part 504 in the training data generation unit shown in FIG. 図６に示すトレーニングデータを自動的に生成する場合における、トレーニングデータ生成ユニットの動作処理７００の例を示すフローチャートである。It is a flowchart which shows the example of the operation | movement process 700 of a training data generation unit in the case of producing | generating the training data shown in FIG. 6 automatically. 図６に示す中間の分類結果に基づいてトレーニングデータを生成するためのトレーニングデータ生成部５０５の内部構成例を示すブロック図であり、文書集合に対するクラスタリング結果は中間の分類結果の調整のために取得される。FIG. 7 is a block diagram illustrating an example of an internal configuration of a training data generation unit 505 for generating training data based on the intermediate classification result illustrated in FIG. 6, and a clustering result for a document set is acquired for adjustment of an intermediate classification result Is done. 本発明による、図１に示す分類器生成サブシステム１０における反復分類器学習のめの調整・生成手段１０３の動作処理９００を示すフローチャートである。6 is a flowchart showing an operation process 900 of the adjustment / generation unit 103 for iterative classifier learning in the classifier generation subsystem 10 shown in FIG. 1 according to the present invention. 本発明を実現するために利用されるコンピュータ・システムの概略ブロック図である。FIG. 2 is a schematic block diagram of a computer system used to implement the present invention.

本発明の前述の特徴とその他の特徴および効果は、添付図面と組み合わせた以下の説明からより明白になるであろう。本発明の範囲が、ここで説明された実施例あるいは特定の実施の形態に制限されないことは言うまでもない。 The foregoing and other features and advantages of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings. It goes without saying that the scope of the invention is not limited to the examples or specific embodiments described herein.

本発明の前述した特徴及びその他の特徴は、添付図面と共に以下の説明を読むことでより完全に理解されるであろう。 The foregoing and other features of the present invention will be more fully understood when the following description is read in conjunction with the accompanying drawings.

本発明による分類生成方法およびシステムは、テキストフィルタリング、文書推薦、探索結果クラスタリング、ウェブページ検索およびウェブマイニングシステム等に適用することができる。 The classification generation method and system according to the present invention can be applied to text filtering, document recommendation, search result clustering, web page search, web mining system, and the like.

図1は、本発明による文書分類システム１００の全体を示すブロック図である。図１には、分類器生成サブシステム１０の内部構成が詳細に示されている。図に示すように、文書ベース１０５から受信された文書セットは、文書クラスタリング手段１０７によって前もって多数のグループへクラスタ化され、そのクラスタリング結果はクラスタリング結果ベース１０４に格納される。クラスタリング結果ベース１０４に格納された文書セットに関するクラスタリング結果は、後で本発明による分類機生成サブシステム１０、あるいは他の情報検索に関連したアプリケーションによって消費される。文書クラスタリング方法に関しては、当業者にとって広く知られている多くの既存アプローチを本発明について使用することができる。これらは本発明の主要な特徴ではないので、ここでは詳細に記述しない。当業者にとって容易に利用可能であれば、どのような文書クラスタリング方法をも必要書類クラスタリング結果の取得に使用することができる。例えば、図１に示した本発明による分類生成サブシステム１０は、取得手段１０１、ラフカテゴリ化手段１０２および調整・生成手段１０３を含む。 FIG. 1 is a block diagram showing the entire document classification system 100 according to the present invention. FIG. 1 shows the internal structure of the classifier generation subsystem 10 in detail. As shown in the figure, the document set received from the document base 105 is clustered into a number of groups in advance by the document clustering means 107, and the clustering result is stored in the clustering result base 104. The clustering results for the document set stored in the clustering result base 104 are later consumed by the classifier generation subsystem 10 according to the present invention, or other applications related to information retrieval. With regard to document clustering methods, many existing approaches that are well known to those skilled in the art can be used with the present invention. These are not the main features of the present invention and will not be described in detail here. Any document clustering method can be used to obtain the required document clustering results as long as it is readily available to those skilled in the art. For example, the classification generation subsystem 10 according to the present invention shown in FIG. 1 includes an acquisition unit 101, a rough categorization unit 102, and an adjustment / generation unit 103.

図２は、図１に示した文書分類システム１００の動作処理手順の例を示すフローチャートである。 FIG. 2 is a flowchart showing an example of an operation processing procedure of the document classification system 100 shown in FIG.

図２に示す処理手順２００はステップ２０１から始まり、ここで、分類生成サブシステム１０が、文書ベース１０５から分類すべき文書集合を取得する。
取得した文書集合は、ステップ２０２に示すように、ラフなカテゴリ化結果（つまり、ラフな分類器）を生成するために、ラフカテゴリ化のためのラフカテゴリ化手段１０２に提供される。例えば、既存の教師あり文書分類、半教師あり文書分類あるいは背景技術で説明したような教師なし文書分類方法のうちの何れかが、ラフなかカテゴリ化の目的を実施するために適用することができる。ある実施の形態においては、例えば、後述するように、トレーニングデータを含む分類を学習する方法が、ラフな分類を生成するために採用することができる。異なるアプリケーション要求に従って、分類器を学習するためのトレーニングデータは、手作業でラベル付けされたトレーニングデータとして外部から入力することができ、あるいは、外部知識源からのカテゴリ名称に関する意味的な情報を参照することにより自動的に生成することができる。トレーニングデータの生成処理手順の詳細については、後述する。 The processing procedure 200 shown in FIG. 2 starts from step 201, where the classification generation subsystem 10 acquires a document set to be classified from the document base 105.
The acquired document set is provided to the rough categorization means 102 for rough categorization in order to generate a rough categorization result (that is, a rough classifier) as shown in step 202. For example, any of the existing supervised document classification, semi-supervised document classification, or unsupervised document classification methods as described in Background Art can be applied to implement rough or categorized purposes. . In one embodiment, for example, as described below, a method of learning a classification including training data can be employed to generate a rough classification. Training data for learning the classifier according to different application requirements can be entered externally as manually labeled training data or refer to semantic information about category names from external knowledge sources Can be generated automatically. Details of the training data generation processing procedure will be described later.

同時に、ステップ２０３において、取得手段１０１は、同じ文書集合について予め格納されたクラスタリング結果をクラスタリング結果ベース１０４から取得する。当業者に知られているように、クラスタリング結果は、文書集合の基礎となるデータ分布を反映している。このため、クラスタリング結果は、ラフな分類結果において起こり得る偏りを抑制するために使用される。ラフカテゴリ化手段１０２から文書集合に関するラフな分類結果と取得手段１０１によって取得されたクラスタリング結果の両方が、調整・生成手段１０３に供給される。次に、ステップ２０４において、調整・生成手段１０３は、取得手段１０１からクラスタリング結果を利用することにより、ラフカテゴリ化手段１０２からのラフな分類結果（すなわち、ラフな分類器）を調整し、その結果、最終的な分類器１０６を生成する。クラスタリング結果の使用してラフな分類結果を調整する原理および処理手順については、図３を参照して説明する。更に、後述するように、クラスタリング結果の使用によりラフな分類結果を調整するこの考えは、中間の分類器のグループを生成し、それらから１つの最適な分類器を最終的な分類器として選択する反復方式に拡張することが可能である。このような方法により、文書分類の精度をよりさらに改善することが可能である。分類学習の特定の反復処理については後述する。その後、ステップ２０５において、ステップ２０１で取得した文書集合が、生成された最終的な分類器１０６に供給され、生成された最終的な分類器１０６は順番に各文書を少なくとも１つの適切なカテゴリに分類する。文書の最終的な分類結果は、文書分類結果ベース１０８に格納される。その後、処理手順２００が終了する。 At the same time, in step 203, the acquisition unit 101 acquires a clustering result stored in advance for the same document set from the clustering result base 104. As is known to those skilled in the art, the clustering results reflect the data distribution underlying the document collection. For this reason, the clustering result is used to suppress a bias that may occur in the rough classification result. Both the rough classification result regarding the document set from the rough categorizing unit 102 and the clustering result acquired by the acquiring unit 101 are supplied to the adjusting / generating unit 103. Next, in step 204, the adjustment / generation unit 103 uses the clustering result from the acquisition unit 101 to adjust the rough classification result (that is, the rough classifier) from the rough categorization unit 102. As a result, a final classifier 106 is generated. The principle and processing procedure for adjusting the rough classification result using the clustering result will be described with reference to FIG. Furthermore, as will be described later, this idea of adjusting rough classification results by using clustering results generates a group of intermediate classifiers from which one optimal classifier is selected as the final classifier. It is possible to extend iterative methods. By such a method, it is possible to further improve the accuracy of document classification. A specific iterative process of classification learning will be described later. Thereafter, in step 205, the document set obtained in step 201 is supplied to the generated final classifier 106, which in turn places each document into at least one appropriate category. Classify. The final classification result of the document is stored in the document classification result base 108. Thereafter, the processing procedure 200 ends.

図３は、図１に示す分類器生成サブシステム１０の調整・生成手段１０３の内部構成の例を示すブロック図である。この例において、ラフカテゴリ化手段１０２がクエリに基づいた方法によってラフな分類を処理するものと想定する。また、ラフな分類結果は、一連の順位スコアとして表わされる。調整・生成手段１０３は、ラフな分類結果とクラスタリング結果の間の整合を実行するためにベイズ推測に基づいた整合モデルを設定する。このように、より正確な分類結果（つまり最終分類器１０６）を実現することができる。クラスタリング結果を含むラフな分類結果を調整する方法については、図３において示されるようなベイズ推測モデルに基づいた整合例に制限されない。他の調整方法も同様に文書分類の精度を改善する目的を達成するために適用することができることを当業者が理解するのは容易である。 FIG. 3 is a block diagram showing an example of the internal configuration of the adjustment / generation unit 103 of the classifier generation subsystem 10 shown in FIG. In this example, it is assumed that rough categorizing means 102 processes rough classification by a method based on a query. Rough classification results are expressed as a series of rank scores. The adjustment / generation unit 103 sets a matching model based on Bayesian inference in order to perform matching between the rough classification result and the clustering result. In this way, a more accurate classification result (that is, the final classifier 106) can be realized. The method for adjusting the rough classification result including the clustering result is not limited to the matching example based on the Bayesian inference model as shown in FIG. It is easy for those skilled in the art to understand that other adjustment methods can be applied to achieve the purpose of improving the accuracy of document classification as well.

図３で示される例において、調整・生成手段１０３は、事前確率計算部３０１と整合部３０２を含む。 In the example shown in FIG. 3, the adjustment / generation unit 103 includes a prior probability calculation unit 301 and a matching unit 302.

事前確率計算部３０１においては、ラフなカテゴリ分類結果に対応する事前確率を最初に計算する必要がある。
上述のように、ラフなカテゴリ分類結果が一連の順位スコアとして表わされるものと仮定する。
Ｃをカテゴリ集合とする。文書ｄ_ｉ∈Ｄとカテゴリｃ_ｉ∈Ｃに対して、順位スコアｓ（ｄ_ｉ、ｃ_ｊ）は、ｄ_ｉがｃ_ｊに属する可能性を暗黙に示している。よって、式１によってスコアを正規化する。

その結果、Ｐ（ｃ_ｊ｜ｄ_ｉ）＝ｓ'（ｄ_ｉ、ｃ_ｊ）と見なすことができる。 Prior probability calculation section 301 needs to first calculate prior probabilities corresponding to rough category classification results.
As described above, assume that the rough categorization result is represented as a series of rank scores.
Let C be a category set. The document _{d i} ∈D and category _{c i} ∈ C, rank score _{s (d} i, _{c j)} indicates the possibility _{that d i} belongs to _{c j} implicitly. Therefore, the score is normalized by Equation 1.

As a result, it can be considered that P (c _j | d _i ) = s ′ (d _i , c _j ).

その後、整合部３０２において、整合モデルはベイズの推論に基づいて設定される。Ｃ’をクラスタ集合とする。文書ｄ_ｉがクラスターc'_k∈C'にクラスタリングされたことをクラスタリング結果が示すならば、その後、整合結果は以下のように事後確率によって示される。

ここで、事前確率Ｐ（ｃ_ｊ｜ｄ_ｉ）はラフな分類結果から得る。
明らかに、基礎的統計を利用することにより可能性を以下のように計算することができる。

よって、最終の整合モデルは以下のように示すことができる。

Thereafter, in the matching unit 302, a matching model is set based on Bayesian inference. Let C ′ be a cluster set. If the clustering result indicates that the document d _i has been clustered into cluster c ′ _k εC ′, then the matching result is indicated by the posterior probability as follows:

Here, the prior probability P (c _j | d _i ) is obtained from the rough classification result.
Obviously, by using basic statistics, the possibility can be calculated as follows:

Therefore, the final matching model can be expressed as follows.

式５に示すような整合モデルによれば、クラスタリング結果によって調節された最終分類器を達成することができる。式１において示されるラフな分類器と比較して、最終分類器は、最終のカテゴリ分類結果の向上した精度を保証する。さらに、カテゴリ名称および対応する意味解析から派生する偏りは、クラスタリング結果に基づいたカテゴリ分類結果調整の導入により効果的に制御することができる。 According to the matching model as shown in Equation 5, the final classifier adjusted by the clustering result can be achieved. Compared to the rough classifier shown in Equation 1, the final classifier ensures improved accuracy of the final category classification result. Furthermore, the biases derived from category names and corresponding semantic analysis can be effectively controlled by the introduction of category classification result adjustment based on clustering results.

以下、分類器生成サブシステム１０のラフカテゴリ化手段１０２の内部構成について、図４と図５を参照してより詳細に説明する。上述のように、ある実施の形態では、ラフな分類器は、トレーニングデータを有する分類学習方法の採用により生成することができる。本発明に採用されたトレーニングデータは、外部から直接入力した手作業でラベル付けされたトレーニングデータであるか、あるいは、システムによって自動的に生成することができる。図４と図５は、手作業でラベル付けされたトレーニングデータを使用するか、あるいはシステムによってトレーニングデータを自動的に生成する場合において、トレーニングデータの学習によるラフな分類器の生成をそれぞれ示す２つの例を提供する。もちろん、ラフな分類器の生成はトレーニングデータ学習に制限されるものではなく、当業者によって知られているような他の分類器生成方法も、本発明に適用することができる。 Hereinafter, the internal configuration of the rough categorization unit 102 of the classifier generation subsystem 10 will be described in more detail with reference to FIGS. 4 and 5. As described above, in one embodiment, a rough classifier can be generated by employing a classification learning method having training data. The training data employed in the present invention is manually labeled training data input directly from the outside, or can be automatically generated by the system. 4 and 5 respectively show the generation of rough classifiers by training the training data when using manually labeled training data or when the system automatically generates training data 2 Provide one example. Of course, generation of rough classifiers is not limited to training data learning, and other classifier generation methods known by those skilled in the art can also be applied to the present invention.

まず、図４を参照すると、この例では、ラフカテゴリ化手段１０２が、トレーニングデータ生成ユニット４０１Ａと学習ユニット４０２を含む。トレーニングデータ生成ユニット４０１Ａは、手作業でラベル付けされたトレーニングデータを外部から取得し、分類学習のために学習ユニット４０２にそれを直接供給する。その後、学習ユニット４０２は、ラフな分類器を学習するために使用される。トレーニングデータと共に分類器を学習する処理手順については、本発明の属する技術分野において周知の技術であるので、その詳細な説明をここで省略する。 First, referring to FIG. 4, in this example, the rough categorizing means 102 includes a training data generation unit 401 </ b> A and a learning unit 402. The training data generation unit 401A obtains manually labeled training data from the outside and supplies it directly to the learning unit 402 for classification learning. The learning unit 402 is then used to learn a rough classifier. Since the processing procedure for learning the classifier together with the training data is a well-known technique in the technical field to which the present invention belongs, a detailed description thereof will be omitted here.

図５を参照すると、この例では、ラフカテゴリ化手段１０２が、トレーニングデータ生成ユニット４０１Ｂと学習ユニット４０２を含む。トレーニングデータ生成ユニット４０１Ｂとトレーニングデータ生成ユニット４０１Ａの間の差異は、トレーニングデータ生成ユニット４０１Ｂでは、外部知識源４０４からのカテゴリ名称についての意味的な情報を参照して、トレーニングデータを自動的に生成することである。その後、図４のように、生成されたトレーニングデータは、分類器を学習するための学習ユニット４０２に供給される。 Referring to FIG. 5, in this example, rough categorizing means 102 includes a training data generation unit 401 </ b> B and a learning unit 402. Regarding the difference between the training data generation unit 401B and the training data generation unit 401A, the training data generation unit 401B automatically generates training data by referring to the semantic information about the category names from the external knowledge source 404. It is to be. Thereafter, as shown in FIG. 4, the generated training data is supplied to a learning unit 402 for learning a classifier.

以下、図６から図９を用いて、図５で示すトレーニングデータ生成ユニット４０１Ｂによるトレーニングデータの自動生成の原理および処理手順をより詳しく説明する。 Hereinafter, the principle and processing procedure of automatic generation of training data by the training data generation unit 401B shown in FIG. 5 will be described in more detail with reference to FIGS.

まず、図６に示すように、トレーニングデータ生成ユニット４０１Ｂは、例えば、カテゴリ名称取得部５０１、語義曖昧性解消部５０２、キーワード生成部５０３、分類部５０４及びトレーニングデータ生成部５０５を含む。さらに、図６に示すように、文書ベース１０５に加えて、トレーニングデータ生成ユニット４０１Ｂは、カテゴリ名称ベース４０３及びトレーニングデータの自動生成の実施のためのカテゴリ名称に関する外部知識源４０４にも接続されている。 First, as illustrated in FIG. 6, the training data generation unit 401B includes, for example, a category name acquisition unit 501, a word meaning ambiguity resolution unit 502, a keyword generation unit 503, a classification unit 504, and a training data generation unit 505. Furthermore, as shown in FIG. 6, in addition to the document base 105, the training data generation unit 401B is also connected to a category name base 403 and an external knowledge source 404 regarding category names for performing automatic generation of training data. Yes.

図６に示すトレーニングデータ生成ユニット４０１Ｂによるトレーニングデータの自動生成処理７００について、図８のフローチャートを参照して説明する。 Training data automatic generation processing 700 by the training data generation unit 401B shown in FIG. 6 will be described with reference to the flowchart of FIG.

処理手順７００はステップ７０１から開始する。ステップ７０１では、カテゴリ名称取得部５０１は、カテゴリ名称ベース４０３からの文書の集合に関する所定のカテゴリ名称を取得する。カテゴリ名称中の単語は様々なケースにおいて様々な意味を持つので、ステップ７０２において、語義曖昧性解消部５０２は、最初に、外部知識源４０４の補助によって取得したカテゴリ名称について語義曖昧性解消を行なう。その後、ステップ７０３において、語義曖昧性解消後のカテゴリ名称はキーワード生成部５０３に供給され、そこで、識別された単語意味に基づいて、適切なキーワードが生成される。ここで、適切なキーワードは、カテゴリ名称と同時に出現する可能性の高い単語を含むかもしれない。それは隠れた意味解析によって識別することが可能である。さらに、それらは、カテゴリ名称に出現するキーワードの下位語、類義語あるいは同義語を含んでいる。それらは、外部知識源４０４によって見つけ出すことができるかもしれない。 The processing procedure 700 starts from step 701. In step 701, the category name acquisition unit 501 acquires a predetermined category name related to a set of documents from the category name base 403. Since words in category names have various meanings in various cases, in step 702, the word meaning ambiguity resolution unit 502 first performs word meaning ambiguity resolution on the category names obtained with the assistance of the external knowledge source 404. . Thereafter, in step 703, the category name after word meaning ambiguity resolution is supplied to the keyword generation unit 503, where an appropriate keyword is generated based on the identified word meaning. Here, suitable keywords may include words that are likely to appear simultaneously with the category name. It can be identified by hidden semantic analysis. In addition, they contain a narrower term, synonym or synonym for the keyword that appears in the category name. They may be found by an external knowledge source 404.

ここで、理解を容易にするために、語義曖昧性解消および同義語選択の例を示す。
単語「スパム」は、ＷｏｒｄＮｅｔにおいて２つの意味を有することができる。すなわち、（意味１）：主として豚肉から作られた缶詰肉と、（意味２）：不要な電子メールである。
我々は、製品プロファイル分類のための「スパム」の同義語を選ぶためにそれらを区別する必要がある。したがって、「スパム＋主として豚肉から作られた缶詰肉」と「スパム＋不要な電子メール」は、文書集合（すなわち製品プロファイル集合）に送られる２つのクエリとして使用することができる。
前者のクエリに対して、２０のヒットがあり、後者のクエリに対して、１００のヒットがあったとする。１００＞２０であるので、この分類タスクの文脈中の「スパム」が意味２を有すると判断することができる。その後、意味２の同義語（すなわち「ジャンク電子メール」）が選択される。 Here, in order to facilitate understanding, examples of word meaning ambiguity resolution and synonym selection are shown.
The word “spam” can have two meanings in WordNet. That is, (meaning 1): canned meat made mainly from pork, and (meaning 2): unnecessary e-mail.
We need to distinguish between them in order to pick a synonym for “spam” for product profile classification. Thus, “spam + canned meat made primarily from pork” and “spam + unwanted email” can be used as two queries sent to a document set (ie, a product profile set).
Assume that there are 20 hits for the former query and 100 hits for the latter query. Since 100> 20, it can be determined that “spam” in the context of this classification task has meaning 2. Thereafter, a synonym of meaning 2 (ie, “junk email”) is selected.

図８に戻り、ステップ７０４において、生成された適切なキーワードは、中間の分類結果（すなわち中間の分類器）を取得するために、文書の集合を分類するための分類部５０４に供給される。次に、ステップ７０５において、中間の分類結果は必要なトレーニングデータの生成のためにトレーニングデータ生成部５０５に供給される。その後、処理手順７００が終了する。
Returning to FIG. 8, in step 704, the generated appropriate keywords are supplied to a classifier 504 for classifying a set of documents to obtain an intermediate classification result (ie, an intermediate classifier). Next, in step 705, the intermediate classification result is supplied to the training data generation unit 505 for generating necessary training data. Thereafter, the processing procedure 700 ends.

図７は、図６で示したトレーニングデータ生成ユニットにおける分類部５０４の内部構成例を示す。この例において、プロファイルに基づいたフィルタリング方法を、中間の分類結果を生成するために利用する。すなわち、文書集合を検索するために、カテゴリ名称関連キーワードをクエリとして利用する。また、ヒットリスト中の文書が、対応するカテゴリとしてラベル付けされる。
図７に示すように、この例において、分類部５０４は、検索部６０１およびカテゴリラベル付け部６０２を含む。再び図８におけるステップ７０４を参照すると、ステップ７０４は、いくつかのサブステップを含んでいることを示している。まず、サブステップ７０４１において、検索部６０１はキーワード生成部５０３からカテゴリ名称関連キーワードを受け取り、文書の集合を検索するために代表的なプロファイルとしてキーワードを利用する。その後、ステップ７０４２において、探索結果がそうであるように、ヒットリストはカテゴリラベル付け部６０２に送った。ラベルは、文書分類を達成するために対応するカテゴリにヒットリスト中のすべてあるいはいくつかの（例えば、最初の２００）文書をラベルを付けられる。 FIG. 7 shows an example of the internal configuration of the classification unit 504 in the training data generation unit shown in FIG. In this example, a profile-based filtering method is utilized to generate intermediate classification results. That is, in order to search a document set, a category name related keyword is used as a query. Also, the documents in the hit list are labeled as corresponding categories.
As shown in FIG. 7, in this example, the classification unit 504 includes a search unit 601 and a category labeling unit 602. Referring again to step 704 in FIG. 8, step 704 shows that it includes several sub-steps. First, in sub-step 7041, the search unit 601 receives a category name related keyword from the keyword generation unit 503, and uses the keyword as a representative profile for searching a set of documents. Thereafter, in step 7042, the hit list is sent to the category labeling unit 602 as the search result is. Labels can label all or some (eg, the first 200) documents in the hit list to corresponding categories to achieve document classification.

一般に、ラベルがラベルを付けられた文書が高い信頼をもって正確であることを確かめるために、ヒットリストの一番上の文書だけが選択される。
例えば、「ａｎｔｉ＿ｓｐａｍ」の製品カテゴリについて、「Ｓｐａｍ＋Junk email」が、検索のための文書集合に関連するキーワードとして送られる。
ここで、「スパム」は、カテゴリ名称（すなわち「ａｎｔｉ＿ｓｐａｍ」）から識別される。また、「ジャンク電子メール」はＷｏｒｄＮｅｔから選択された同義語である。
ヒットリストにおいて返された結果が１０００あると仮定すると、「ａｎｔｉ＿ｓｐａｍ」製品の代表的な製品概要として上位の２００の項目を選択するかもしれない。
上位の２００の製品概要が、製品がａｎｔｉ＿ｓｐａｍ機能を有するか、あるいは製品が「ａｎｔｉ＿ｓｐａｍ」カテゴリに属するかどうかを人が判断するために利用する全ての必要な特徴を保持すると思われる。 In general, only the document at the top of the hit list is selected to ensure that the document labeled is accurate with high confidence.
For example, for the product category “anti_spam”, “Spam + Junk email” is sent as a keyword related to a document set for search.
Here, “spam” is identified from the category name (ie, “anti_spam”). “Junk e-mail” is a synonym selected from WordNet.
Assuming there are 1000 results returned in the hit list, the top 200 items may be selected as a representative product summary for the “anti_spam” product.
The top 200 product summaries will likely hold all the necessary features that a person will use to determine if the product has an anti_spam function or if the product belongs to the “anti_spam” category.

上述したように、中間の分類結果（すなわち中間の分類器）を取得した後、中間の分類結果はトレーニングデータの生成のためにトレーニングデータ生成部５０５に供給される。当業者に知られている様々なトレーニングデータ生成す方法を、本発明に適用することが可能である。しかしながら、トレーニングデータを生成する処理手順において、さらに文書分類の精度を改良するために、中間の分類結果についても、クラスタリング結果の採用により（例えば、ベイズの推論モデルの利用により）調整することができる。図９は、トレーニングデータ生成部５０５の内部構成例を示す。トレーニングデータ生成部５０５においては、文書の集合に関するクラスタリング結果が中間の分類結果を調整するために使用される。 As described above, after obtaining the intermediate classification result (that is, the intermediate classifier), the intermediate classification result is supplied to the training data generation unit 505 for generating training data. Various methods for generating training data known to those skilled in the art can be applied to the present invention. However, in order to further improve the accuracy of document classification in the processing procedure for generating training data, intermediate classification results can also be adjusted by employing clustering results (for example, by using Bayesian inference models). . FIG. 9 shows an internal configuration example of the training data generation unit 505. In the training data generation unit 505, the clustering result regarding the document set is used to adjust the intermediate classification result.

図９のブロック図は、図３に示した調整・生成手段１０３の内部構成に多少類似していることを理解できる。すなわち、この実施例において、トレーニングデータ生成部５０５は、中間の分類結果を調整するために図３の調整・生成手段１０３と類似した方法を利用する。その詳細については、図３に関する説明を参考にすることができる。その後、調整された（整合された）中間の分類結果は望ましいトレーニングデータを選択するためにトレーニングデータ選択部８０２に供給される。 It can be understood that the block diagram of FIG. 9 is somewhat similar to the internal configuration of the adjustment / generation unit 103 shown in FIG. That is, in this embodiment, the training data generation unit 505 uses a method similar to the adjustment / generation unit 103 in FIG. 3 in order to adjust the intermediate classification result. For the details, the description regarding FIG. 3 can be referred to. Thereafter, the adjusted (matched) intermediate classification result is supplied to the training data selection unit 802 to select desired training data.

本発明による分類器生成サブシステム１００の構成および動作原理は、図１から図９を参照して説明した。上述したように、文書分類の精度をさらに改良するために、クラスタリング結果と共にラフな分類結果を調整する処理手順が、反復方法で実施される。詳細な処理手順について、図１０を参照して以下に説明する。 The configuration and operation principle of the classifier generation subsystem 100 according to the present invention has been described with reference to FIGS. As described above, in order to further improve the accuracy of document classification, the processing procedure for adjusting the rough classification result together with the clustering result is performed in an iterative manner. A detailed processing procedure will be described below with reference to FIG.

まず、ステップ９０１において、ラフな分類結果を生成する処理手順中に生成されたトレーニングデータは、初期のトレーニングデータとして取得される。各反復サイクル中に、ある分類学習方法（例えば、ＮＢ（ナイーブベイジアン）に基づく多項モデル）は、トレーニングデータと共に新たな中間の分類器を学習するために利用される（ステップ９０２）。その後、ステップ９０３において、新たな中間の分類結果を取得するため、新たな分類器は文書ベース１０５の文書を分類するために利用される。ステップ９０４において、反復終了条件を満たしているかどうかが決定される。反復終了条件につていは、利用者によって予め決定することが可能である。例えば、反復処理中に生成された中間の分類器がすべて次第に収束するならば、トレーニングデータの状態が安定に向かうことを反復終了条件として選択することができる。あるいは、文書ベース１０５の全ての文書が対応するカテゴリに分類されていることを、反復終了条件としてを利用することが可能である。ステップ９０４において、反復終了条件を満たしていると判定さない場合（すなわち、ステップ９０４で「ＮＯ」）、処理手順はステップ９０５に進む。ステップ９０５において、一連の反復において生成された中間の分類結果は、新たなトレーニングデータを生成するために次の反復サイクルのための利用される。ここで、中間の分類結果に従って新たなトレーニングデータを生成する方法は、図９のそれに類似している。上述したように、中間の分類結果は、整合モデル（例えばベイズの整合モデル）に基づいたクラスタリング結果と整合させる。
図９の方法との差異は事前確率の計算にある。
種々の分類器からの文書分類結果について一定の特別の方法を採用することができるかもしれない。例えば、ＮＢ分類器が採用されるとき、事前確率は、分類器から直接返される各対のカテゴリｃ_ｊおよび文書ｄ_ｉについてＰ（ｃ_ｊ｜ｄ_ｉ）である。 First, in step 901, training data generated during a processing procedure for generating a rough classification result is acquired as initial training data. During each iteration cycle, a classification learning method (eg, a multinomial model based on NB (Naive Bayesian)) is utilized to learn a new intermediate classifier along with the training data (step 902). Thereafter, in step 903, the new classifier is used to classify the document base 105 document to obtain a new intermediate classification result. In step 904, it is determined whether the iteration termination condition is met. The iteration end condition can be determined in advance by the user. For example, if all of the intermediate classifiers generated during the iterative process converge gradually, it is possible to select that the state of the training data is stable as an iterative termination condition. Alternatively, the fact that all the documents in the document base 105 are classified into the corresponding categories can be used as the repetition end condition. If it is not determined in step 904 that the iteration end condition is satisfied (that is, “NO” in step 904), the processing procedure proceeds to step 905. In step 905, the intermediate classification results generated in the series of iterations are utilized for the next iteration cycle to generate new training data. Here, the method of generating new training data according to the intermediate classification result is similar to that of FIG. As described above, the intermediate classification result is matched with the clustering result based on the matching model (for example, Bayesian matching model).
The difference from the method of FIG. 9 is in the calculation of prior probabilities.
Certain special methods may be employed for document classification results from various classifiers. For example, when an NB classifier is employed, the prior probabilities are P (c _j | d _i ) for each pair of categories c _j and documents d _i returned directly from the classifier.

実施例としてＮＢ分類器を利用すると、この反復的アルゴリズムは以下のように設計することが可能である。
（ａ）初期のトレーニングデータＴ：Ｃ→Ｐｏｗｅｒｓｅｔ（Ｄ）（すなわち、ラベル付けされた文書部分集合）の入力；
（ｂ）ＴによりＮＢ分類器の学習、また各カテゴリ文書対(c,d) ∈C×Dに対してＰ（ｃ｜ｄ）を取得するためにその学習結果を利用；
（ｃ）各（ｃ、ｄ）∈C×Dに対して、ｄ∈C’がクラスタリング結果内であれば、整合モデルによりＰ（ｃ｜ｄ、ｃ’）を計算するために、Ｐ’（ｃ｜ｄ）＝Ｐ（ｃ｜ｄ、ｃ’）とする；
（ｄ）データＴ’：Ｃ→Ｐｏｗｅｒｓｅｔ（Ｄ）をトレーニングするためのいくつかの新たなラベル付けされた文書を生成；
ここで、各ｃ∈Ｃ，Ｔ’（ｃ）は、文書集合Ｄ−ｄｏｍａｉｎ（Ｔ）（Ｄ間の差集合およびＴのドメイン集合）の内でＰ’（ｃ｜ｄ）が最も上位である文書を含んでいる。
（ｅ）T'=Φであるなら反復処理を終了し、そうでなければ、T:=T+T'とし、ステップｂ）にジャンプし、次の反復を開始する。 Using an NB classifier as an example, this iterative algorithm can be designed as follows.
(A) Input of initial training data T: C → Powerset (D) (ie, labeled document subset);
(B) learning of the NB classifier by T and using the learning result to obtain P (c | d) for each category document pair (c, d) ∈ C × D;
(C) For each (c, d) εC × D, if dεC ′ is within the clustering result, P ′ ( c | d) = P (c | d, c ′);
(D) Generate several new labeled documents for training data T ′: C → Powerset (D);
Here, for each c∈C, T ′ (c), P ′ (c | d) is the highest in the document set D-domain (T) (difference set between D and domain set of T). Contains documents.
(E) If T ′ = Φ, the iterative process is terminated; otherwise, T: = T + T ′, jump to step b), and start the next iteration.

実施例としてＮＢ分類器を挙げて、図１０に示されるようにステップ９０１−９０５における分類器を学習する反復処理について詳細に説明した。
反復学習処理中に、反復サイクルはそれぞれ、カテゴリ文書対Ｐ’（ｃ｜ｄ）の事後確率関数によって表わすことが可能な分類器を生成する。もちろん、本発明に関する分類器はＮＢ分類器に制限ではない。他の種類の分類器は、明らかに本発明に適用することが可能である。 The NB classifier is taken as an example, and the iterative process for learning the classifier in steps 901 to 905 as shown in FIG. 10 has been described in detail.
During the iterative learning process, each iteration cycle produces a classifier that can be represented by a posterior probability function of the category document pair P ′ (c | d). Of course, the classifier according to the present invention is not limited to the NB classifier. Other types of classifiers can obviously be applied to the present invention.

図１０に戻り、ステップ９０４において反復終了条件を満たすと判定された場合（すなわち、ステップ９０４で「Ｙｅｓ」）、処理手順はステップ９０６に進む。ステップ９０６において、反復処理中に生成された分類器のグループを保存する。それから、ステップ９０７において、この分類器のグループから、最適な分類器を最終分類器として選択する。ここで、最適な分類器の代表的な選択方法としては、与えられた文書集合に応じて適切なものを選択する方法がある。反復学習処理の間、クラスタリング結果が不十分なトレーニングデータの偏りを減らすと思われる。したがって、さらに最も適切な分類器を評価し選択するためにクラスタリング結果を利用することが可能である。一例として、ベイズのモデルは最適な分類器を選択するために利用される。 Returning to FIG. 10, if it is determined in step 904 that the iteration end condition is satisfied (that is, “Yes” in step 904), the processing procedure proceeds to step 906. In step 906, the group of classifiers generated during the iterative process is saved. Then, in step 907, an optimal classifier is selected as the final classifier from this group of classifiers. Here, as a typical selection method of the optimum classifier, there is a method of selecting an appropriate one according to a given document set. During the iterative learning process, it seems to reduce the bias of training data with insufficient clustering results. Therefore, the clustering result can be used to evaluate and select the most appropriate classifier. As an example, Bayesian models are used to select the optimal classifier.

例えば、中間の分類器をF_k, ｋ＝１、２、…（Ｎ）とする。ここで、Ｎは、反復の時間を示す。
ベイジアンモデルを含む次のような式を有する

最大尤度法に基づいて、Ｐ（Ｃ’｜Ｆ_ｋ）を最大にする特定のＦ_ｋを見つけ出す。
言うまでもなく、お互いに独立した文書とすると、次のように表すことができる。

ここで、ｃ’（ｄ）が文書ｄが属するクラスターであり、ｃ（ｄ）はｄが分類器Ｆ_ｋに属するカテゴリである。

同様に、上述した整合モデルの尤度計算として、Ｆ_ｋの尤度関数は以下の通
りである。

そして、最終分類器は、

として導かれる。
For example, let the intermediate classifier be F _k , k = 1, 2,... (N). Here, N indicates the time of repetition.
With the following formula including the Bayesian model

Find the specific F _k that maximizes P (C ′ | F _k ) based on the maximum likelihood method.
Needless to say, if the documents are independent from each other, they can be expressed as follows.

Here, c '(d) is a cluster that belongs document d, c (d) is a category belonging to d is the classifier F _k.

Similarly, as the likelihood calculation of the above-described matching model, the likelihood function of F _k is as follows.

And the final classifier is

As led.

その後、最終分類器が選択されると、処理手順９００が終了する。 Thereafter, when the final classifier is selected, the processing procedure 900 ends.

図１１は、本発明を実現するために利用されるコンピュータ・システム１０００の概略ブロック図である。図示のように、コンピュータ・システム１０００は、ＣＰＵ１００１、ユーザインターフェース１００２、周辺装置１００３、メモリ１００５、外部記憶措置１００６および上記構成要素を互いに接続する内部バス１００４を含んでいる。メモリ１００５は、さらにドメインおよびＰＯＳ解析モジュール、自動文書分類モジュール、文書クラスタリングモジュール、ＩＲ関連システム、オペレーティングシステム（ＯＳ）などを含んでいる。本発明は、主に自動文書分類モジュールに関するものである。それは、例えば図１に示した文書分類システム１００である。文書クラスタリングモジュールは、文書集合についてクラスタリング処理を行ない、適切なクラスタリング結果ベース（例えば、クラスタリング結果ベース１０４）へクラスタリング結果を格納する。外部記憶装置１００６は、クラスタリング結果ベース１０４、文書ベース１０５、文書分類結果ベース１０８、カテゴリ名前ベース４０３、外部知識源４０４などのような本発明に関する様々なデータベースを格納する。 FIG. 11 is a schematic block diagram of a computer system 1000 used to implement the present invention. As shown, the computer system 1000 includes a CPU 1001, a user interface 1002, a peripheral device 1003, a memory 1005, an external storage device 1006, and an internal bus 1004 that connects the above components to each other. The memory 1005 further includes a domain and POS analysis module, an automatic document classification module, a document clustering module, an IR related system, an operating system (OS), and the like. The present invention mainly relates to an automatic document classification module. This is, for example, the document classification system 100 shown in FIG. The document clustering module performs a clustering process on the document set, and stores the clustering result in an appropriate clustering result base (for example, the clustering result base 104). The external storage device 1006 stores various databases related to the present invention such as the clustering result base 104, the document base 105, the document classification result base 108, the category name base 403, the external knowledge source 404, and the like.

以上、本発明による文書分類方法およびシステムについて、添附の図面を参照して説明した。上記の説明に基づいて、本発明の効果について、以下に述べる。 The document classification method and system according to the present invention have been described with reference to the accompanying drawings. Based on the above description, the effects of the present invention will be described below.

本発明においては、クラスタリングと分類の結果間の整合解析は、初期のトレーニングデータ形成の処理だけでなく反復類学習の処理で実施され統合される。この処理において、カテゴリ名称および対応する意味解析から発生する可能性のある偏りが制御される。それは、結果として生じたトレーニングデータと同時に最終の分類結果の精度の向上を保証する。 In the present invention, the matching analysis between the results of clustering and classification is performed and integrated not only in the initial training data formation process but also in the iterative class learning process. In this process, biases that may arise from category names and corresponding semantic analysis are controlled. It ensures an improvement in the accuracy of the final classification result as well as the resulting training data.

さらに、本発明による方法は、文書分類のためのトレーニングデータあるいは初期の予め定義されたキーワードリストを必要としない。代わりに、既存の外部知識源の補助によるカテゴリ名称の意味解析（同時出現キーワード抽出のための隠れた意味解析を含む）が、初期のトレーニングデータ形成のために利用される。既存の外部知識源が複数定義域をカバーすることが可能であるので、ドメインか文書集合が変更される場合、本発明の方法は、大幅に縮小された追加のラベル付け労力と共に多数の様々な種類のドメイン／文書集合に容易に適用することが可能である。 Furthermore, the method according to the invention does not require training data for document classification or an initial predefined keyword list. Instead, semantic analysis of category names (including hidden semantic analysis for co-occurrence keyword extraction) with the aid of existing external knowledge sources is used for initial training data generation. Since existing external knowledge sources can cover multiple domains, when a domain or document collection is changed, the method of the present invention can be applied to a large number of different labels with significantly reduced additional labeling efforts. It can be easily applied to different types of domains / document sets.

さらに、最終的な分類器形成のために提供されるメカニズムは、特に特徴的な分類器（例えば、ＳＶＭ（Support Vector Machine：サポートベクターマシン）、ロジスティック回帰（Logistic regression））について、反復分類器学習処理におけるノイズデータによって偏りが分類器に過度にかけられるという危険を軽減することが可能である。さらに、それは、文書分類の最終結果の精度改良に対する本発明の重要な貢献である。 In addition, the mechanisms provided for final classifier formation are iterative classifier learning, especially for characteristic classifiers (eg, SVM (Support Vector Machine), Logistic regression). It is possible to reduce the risk that bias is overly applied to the classifier due to noise data in the process. Furthermore, it is an important contribution of the present invention to improving the accuracy of the final result of document classification.

本発明の特定の実施の形態について、上記のように添付の図面を参照して説明した。しかしながら、本発明は、添付の図面中で示される特定の構成および処理に限定されない。上記の実施の形態において、いくつかの特定のステップは具体例として示されかつ説明されている。しかしながら、本発明の方法処理はこれらの特定のステップに限定されない。当業者は、これらのステップを変更し、修正し、補足することが可能であり、あるいは、いくつかのステップの順序を、本発明の精神および本質的な機能から外れずに変更することが可能であることを理解するだろう。 Specific embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the specific configurations and processes shown in the accompanying drawings. In the above embodiments, some specific steps are shown and described as specific examples. However, the method processing of the present invention is not limited to these specific steps. Those skilled in the art can change, modify and supplement these steps, or can change the order of several steps without departing from the spirit and essential function of the present invention. You will understand that.

本発明の要素は、ハードウェア、ソフトウェア、ファームウェアあるいはそれの組合せにおいて実装することが可能であり、システム、サブシステム、コンポーネントあるいはサブコンポーネントにおいて利用することが可能である。ソフトウェアの中で実施された場合、本発明の要素は、必要なタスクを実行するためのプログラム、あるいはコードセグメントである。プログラムまたはコードセグメントは、コンピュータ読み取り可能な媒体に格納するか、あるいは伝送ケーブルか通信リンク上の搬送波に包含されたデータ信号によって送信することが可能である。コンピュータ読み取り可能な媒体には、情報を格納するか転送することが可能であるすべての媒体を含む。コンピュータ読み取り可能な媒の具体例は、電子回路、半導体記憶装置、ＲＯＭ、フラッシュ・メモリー、消去可能ＲＯＭ（ＥＲＯＭ）、フレキシブル・ディスク、ＣＤ−ＲＯＭ光ディスク、ハードディスク、光ファイバー媒体、無線周波数（ＲＦ）リンクなどを含む。コードセグメントは、インターネット、イントラネットなどのようなコンピュータネットワークを経由してダウンロードすることも可能である。 The elements of the invention can be implemented in hardware, software, firmware or a combination thereof and can be utilized in a system, subsystem, component or subcomponent. When implemented in software, the elements of the invention are programs or code segments for performing the necessary tasks. The program or code segment can be stored on a computer readable medium or transmitted by a data signal contained on a transmission cable or carrier wave on a communication link. Computer-readable media includes all media that can store or transfer information. Specific examples of computer-readable media are electronic circuits, semiconductor storage devices, ROM, flash memory, erasable ROM (EROM), flexible disks, CD-ROM optical disks, hard disks, optical fiber media, radio frequency (RF) links. Etc. The code segment can also be downloaded via a computer network such as the Internet or an intranet.

以上、特定の実施の形態を参照して本発明を説明したが、本発明は、図面中で示される上記の特定の実施の形態および特定の構成に限定されない。例えば、示されたいくつかの構成要素は、１つの構成要素としてお互いと組み合わせるかもしれない。あるいは、１つの構成要素はいくつかのサブコンポーネントに分割されるかもしれないし、他の既知の構成要素も加えられるかもしれない。動作処理も実施例において示されるものに限定されない。当業者は、本発明が、本発明の精神および本質的な機能から外れずに、他の特定の形態で実装可能であることを理解するだろう。従って、現在の実施の形態は、全ての点において例示でありかつ限定的でないとして考慮されるべきである。本発明の範囲は、前述の説明によってではなく添付された請求項によって示される。また、したがって、請求項と同等の意味と範囲の内で生ずる変更は全て本発明の範囲に包含される。 Although the present invention has been described above with reference to specific embodiments, the present invention is not limited to the specific embodiments and specific configurations shown in the drawings. For example, some of the components shown may be combined with each other as one component. Alternatively, one component may be divided into several subcomponents and other known components may be added. The operation process is not limited to that shown in the embodiment. Those skilled in the art will appreciate that the present invention can be implemented in other specific forms without departing from the spirit and essential function of the invention. Accordingly, the current embodiment is to be considered in all respects as illustrative and not restrictive. The scope of the invention is indicated by the appended claims rather than by the foregoing description. Accordingly, all modifications that come within the meaning and range equivalent to the terms of the claims are included in the scope of the present invention.

１００：文書分類システム
１０：分類器生成サブシステム
１０１：取得手段１０１
１０２：ラフカテゴリ化手段
１０３：調整・生成手段
１０４：クラスタリング結果ベース
１０５：文書ベース
１０７：文書クラスタリング手段
１０８：文書分類結果ベース
１０６：最終分類器
３０１：事前確率計算ユニット
３０２：整合ユニット
４０１Ａ：トレーニングデータ生成ユニット
４０２：学習ユニット
４０１Ｂ：トレーニングデータ生成ユニット
４０２：学習ユニット
４０３：カテゴリ名称ベース
４０４：外部知識源
５０１：カテゴリ名称取得部
５０２：語義曖昧性解消部
５０３：キーワード生成部
５０４：分類部
５０５：トレーニングデータ生成部
５０４：分類部
６０１：検索部
６０２：カテゴリラベル付け部
５０５：トレーニングデータ生成部
８０１１：事前確率計算ユニット
８０１２：整合ユニット
８０２：トレーニングデータ選択部
８０３：分類器

100: Document classification system 10: Classifier generation subsystem 101: Acquisition unit 101
102: rough categorization means 103: adjustment / generation means 104: clustering result base 105: document base 107: document clustering means 108: document classification result base 106: final classifier 301: prior probability calculation unit 302: matching unit 401A: training Data generation unit 402: Learning unit 401B: Training data generation unit 402: Learning unit 403: Category name base 404: External knowledge source 501: Category name acquisition unit 502: Word meaning ambiguity resolution unit 503: Keyword generation unit 504: Classification unit 505 : Training data generation unit 504: classification unit 601: search unit 602: category labeling unit 505: training data generation unit 8011: prior probability calculation unit 8012: matching unit 80 : Training data selection unit 803: classifier

Claims

A classifier generation method performed by a computer constituting a classifier generation system,
A rough categorization step that generates a rough categorization result for a set of objects to obtain a rough classifier;
Adjusting and generating the rough category classification result with the clustering result for the object set to generate a final classifier,
The rough categorization step comprises:
The rough category classification result is generated by any one of supervised document classification, semi-supervised document classification, and unsupervised document classification,
The adjustment / generation step includes
A prior probability calculating step of calculating a prior probability corresponding to the rough category classification result;
Matching the rough category classification results against the clustering results based on Bayesian inference using the prior probabilities, and generating the final classifier,
The final classifier is a Bayesian classifier in supervised document classification;
The rough categorization step includes:
A training data generation step for acquiring training data;
Learning the rough classifier from the training data,
The training data is
A category name obtaining step for obtaining a category name that is a name of a classification indicating the property of the object set;
A keyword generation step of generating a related keyword based on the category name;
A classification step for generating an intermediate category classification result from the keywords;
Automatically generated by a training data generation step of generating the training data from the intermediate categorization result,
The keyword is used as a representative profile,
The classification step includes
A search step using the representative profile as a query term to search the object set;
Labeling objects in the hit list as search results for corresponding categories,
The training data generation step includes
Calculating a prior probability corresponding to the intermediate categorization result;
Using the prior probabilities to match the intermediate category classification result with the clustering result based on Bayesian inference, and generating a classifier for use in training data generation. Generation method.

The classifier generation method according to claim 1 , wherein the training data is training data manually labeled.

The keyword generation step further includes:
Performing word sense disambiguation on the category name obtained with reference to an external knowledge source;
The method for generating a classifier according to claim 1, further comprising: generating the related keyword based on the category name after word meaning ambiguity resolution.

The classifier generation method according to claim 1, wherein a predetermined upper number of objects in the hit list are labeled with corresponding categories.

In the adjustment / generation step,
Iterative classifier learning is performed by using the training data generated during the processing procedure for generating the rough category classification result and the rough category classifier as initial training data and initial classifier, respectively. Run,
In one cycle of iteration in the iterative classifier learning,
Learn the intermediate classifier corresponding to the current cycle of the iteration with the training data generated in the cycle before the iteration,
Classify a set of objects by using a learned intermediate classifier corresponding to the current cycle of iteration to obtain an intermediate categorization result of the current cycle of iteration;
Adjust the categorization results in the middle of the current cycle of iterations with the clustering results to generate training data used for the next cycle of iterations,
The classifier that most closely matches the clustering result is selected as a final classifier from the group of intermediate classifiers learned in each cycle when the iteration end condition is satisfied. A method of generating a classifier described in 1.

A system for generating a classifier,
Rough categorization means for generating rough categorization results for object sets to obtain a rough classifier;
Adjusting / generating means for adjusting the rough category classification result with the clustering result for the object set to generate a final classifier;
The rough categorizing means is
The rough category classification result is generated by any one of supervised document classification, semi-supervised document classification, and unsupervised document classification,
The adjusting / generating means is
A prior probability calculating means for calculating a prior probability corresponding to the rough category classification result;
Matching means for matching the rough category classification result against the clustering result based on Bayesian inference using the prior probability and generating the final classifier;
The final classifier is a Bayesian classifier in supervised document classification;
The rough categorizing means is
A training data generation unit for acquiring training data;
A learning unit that learns the rough classifier from the training data,
A category name base for storing a domain related to a category name which is a name of a classification indicating properties of the object set;
The training data generation unit is
A category name acquisition unit for acquiring the category name related to the object set;
A keyword generating unit that generates a related keyword based on the category name;
A classification unit for generating an intermediate category classification result from the keyword;
A training data generation unit for acquiring the training data from the intermediate category classification result, automatically generating the training data,
The keyword is used as a representative profile,
The classification unit includes:
A search unit that uses the representative profile as a query term to search the object set;
A category labeling unit that labels objects in the hit list as a search result for the corresponding category,
The training data generation unit
A unit for calculating a prior probability corresponding to the intermediate category classification result;
A classifier comprising: a unit that generates a classifier for use in training data generation by matching the intermediate category classification result with the clustering result based on Bayesian inference using the prior probability. Generation system.

The classifier generation system according to claim 6, wherein the training data generation unit obtains training data manually labeled from the outside.

An external knowledge source for storing knowledge about the category name;
The training data generation unit includes a word sense ambiguity resolution unit that performs word sense ambiguity resolution for a category name acquired with reference to the external knowledge source,
The classifier generation system according to claim 6, wherein the training data generation unit generates a related keyword based on a category name after word meaning ambiguity resolution.

7. The classifier generation system according to claim 6, wherein the category labeling unit levels a predetermined number of higher-order objects in the hit list to corresponding categories.

The adjusting / generating means includes
Iterative classifier learning is performed by using the training data generated during the processing procedure for generating the rough category classification result and the rough category classifier as initial training data and initial classifier, respectively. And
In one cycle of iteration in the iterative classifier learning,
Learn the intermediate classifier corresponding to the current cycle of the iteration with the training data generated in the cycle before the iteration,
Classify a set of objects by using a learned intermediate classifier corresponding to the current cycle of iteration to obtain an intermediate categorization result of the current cycle of iteration;
Adjust the categorization results in the middle of the current cycle of iterations with the clustering results to generate training data used for the next cycle of iterations,
The classifier that most closely matches the clustering result is selected as a final classifier from the group of intermediate classifiers learned in each cycle when the iteration end condition is satisfied. The classifier generation system described in 1.