JPH1139313A

JPH1139313A - Automatic document classification system, document classification oriented knowledge base creating method and record medium recording its program

Info

Publication number: JPH1139313A
Application number: JP9198113A
Authority: JP
Inventors: Takefumi Yamazaki; 毅文山崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-07-24
Filing date: 1997-07-24
Publication date: 1999-02-12

Abstract

PROBLEM TO BE SOLVED: To dissolve the polysemy of a semantic category and to improve precision more by selecting and utilizing only an optimum semantic category that a word has at the time of generating learning data. SOLUTION: A characteristic extraction mechanism 10 analyzes a classification tagged document set of a file 20, examines the semantic category of a word and outputs an extracted characteristic vector and data of a classification category to which its text belongs to a learning data file 22 before polysemy dissolution. A polysemy dissolution mechanism 12 is inputted from the file 22 and an association degree data file 23 and outputs a polysemy dissolution result to a learning data file 24 after polysemy dissolution. A document classification oriented knowledge base creation mechanism 13 is inputted from the file 24 and outputs the weight between a characteristic and a classification to a document classification oriented knowledge base 25. A classification processing mechanism 14 is inputted from a thesaurus dictionary 21 and the base 25 and outputs a classification tag that corresponds to a new text inputted by a user.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は自然言語処理の技術
分野に係わり、詳しくは、文書自動分類システム、分類
タグ付き文書集合から文書分類向け知識ベースを生成す
る方法及びそのプログラムを記録した記録媒体に関する
ものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the technical field of natural language processing, and more particularly, to an automatic document classification system, a method of generating a knowledge base for document classification from a set of documents with classification tags, and a recording medium recording the program thereof. It is about.

【０００２】[0002]

【従来の技術】文書分類とは、テキストをその意味内容
に基づいて、予め決められた分類カテゴリー（例えば、
政治、スポーツ、経済等）に分類することであり、大量
の電子化されたテキストが流通するようになった現在、
テキストの効率的検索／利用において、重要な課題とな
っている。この文書分類作業の自動化を実現するのが文
書自動分類システムである。このシステムは、予め、カ
テゴリー分類されたテキスト（ここでは、これを「分類
タグ付き文書集合」と呼ぶ）を利用した適当な分類器
（文書分類向け知識ベース）を作成し、その分類器に基
づいて新たに入力されたテキストのカテゴリー分類を行
う。2. Description of the Related Art Document classification means that a text is classified into predetermined classification categories (for example,
Politics, sports, economy, etc.), and now that a large amount of digitized texts are in circulation,
It is an important issue in the efficient search / use of text. An automatic document classification system realizes automation of the document classification work. This system creates an appropriate classifier (knowledge base for document classification) using text that has been categorized in advance (here, this is referred to as a “document set with a classification tag”), and based on the classifier. To categorize the newly input text.

【０００３】従来の文書自動分類システムは、分類タグ
付き文書集合から、分類のキーとなる単語を抜き出し、
それらの単語と予め付与された分類カテゴリーとの結び
付きの強さである関連度を学習して、分類器を生成す
る。同じ分類カテゴリーを持つテキストには、その分野
を特徴づける単語が決まって頻出するので、その単語と
その指定した分類カテゴリーとの関連度は大きくなり、
そうでない単語と分類カテゴリーとの関連度は小さくな
る。分類器は、単語、分類カテゴリー、及びこの単語と
分類カテゴリーとの関連度を表わすリンクから構成され
る。A conventional automatic document classification system extracts a word which is a key of classification from a set of documents with classification tags,
A classifier is generated by learning the degree of association, which is the strength of association between those words and the pre-assigned classification category. Since words that characterize the field frequently appear in texts having the same classification category, the degree of association between the word and the specified classification category increases,
The degree of association between the words that are not so and the category is reduced. The classifier includes a word, a classification category, and a link indicating the degree of association between the word and the classification category.

【０００４】分類器の例を図２に示す。図２で示す通
り、特徴ノードはテキストで頻出する単語、分類ノード
は文書集合が持つ分類カテゴリーから構成され、特徴ノ
ードと分類ノード間のリンク上の数値が、関連度を表わ
す。関連度の学習方法として、カイ２乗計算手法等、様
々な手法が考えられる。FIG. 2 shows an example of a classifier. As shown in FIG. 2, the feature node is composed of words frequently appearing in the text, and the classification node is composed of the classification category of the document set. The numerical value on the link between the characteristic node and the classification node indicates the degree of association. Various methods such as a chi-square calculation method can be considered as a learning method of the degree of association.

【０００５】新たなテキストが入力されると、まず、そ
のテキスト上に現われる単語を抽出し、次に分類器上の
単語と分類タグ間の関連度から、そのテキストの各分類
カテゴリーへの関連度を計算する。この値が、ある決め
られた閾値を超えるもの或いは最も大きい値を持つもの
を、そのテキストの分類カテゴリーと判断する。When a new text is input, first, words appearing on the text are extracted, and then, the relevance between the text and each classification category is determined from the relevance between the words on the classifier and the classification tags. Is calculated. Those whose values exceed a predetermined threshold value or have the largest value are determined as the category of the text.

【０００６】[0006]

【発明が解決しようとする課題】上記従来技術によっ
て、分類タグ付き文書集合から自動的に分類器が生成で
きるが、特徴量ノードを構成するものが、分類タグ付き
文書集合上に出現する単語に限られるので、新たに入力
される分類対象テキストが、特徴ノードにない未知単語
を含む場合、そのテキストの分類精度が悪くなってしま
う。この問題を解決する一つの方法として、特徴ノード
の生成において、単語のみでなく、シソーラス利用によ
って得られる単語の意味カテゴリーも合わせて、利用す
る手法が考えられる。意味カテゴリーは、単語よりも一
般的な概念で、通常複数の単語が共通の意味カテゴリー
を持つので、この意味カテゴリーによる特徴ノードの生
成で、未知単語の問題が解決できる。According to the above-mentioned prior art, a classifier can be automatically generated from a set of documents with a classification tag. Since the classification target text newly input includes an unknown word not included in the feature node, the classification accuracy of the text is deteriorated. As a method for solving this problem, a method of generating feature nodes using not only words but also semantic categories of words obtained by using a thesaurus is conceivable. A semantic category is a more general concept than a word. Generally, a plurality of words have a common semantic category. Therefore, generation of a feature node based on this semantic category can solve the problem of an unknown word.

【０００７】しかし、一方で、シソーラス利用によって
得られる各単語の意味カテゴリーの数は通常一つとは限
らない。例えば、単語「路線」は、政治の分野では「方
針」という意味で使われ、また、交通の分野では「道
筋」という意味で使われる。このような多義語は、シソ
ーラス上で複数の意味カテゴリーを持つ。分類器を生成
する学習段階で、シソーラスを利用して特徴ノードを生
成する場合に、多義語から得られる複数の意味カテゴリ
ーを、文脈に応じて絞らずに全て利用すると、特徴ノー
ドと分類ノードとの関連度が誤って計算されてしまい、
分類精度が悪くなる可能性がある。特徴ノードにシソー
ラス上の意味カテゴリーを利用する場合は、意味カテゴ
リーを文脈に応じて正確に特定する、所謂多義解消が必
要である。However, on the other hand, the number of meaning categories of each word obtained by using a thesaurus is usually not limited to one. For example, the word "route" is used in the political field to mean "policy", and in the transportation field it is used to mean "path". Such polysemous words have a plurality of semantic categories on the thesaurus. In the learning step of generating a classifier, when generating a feature node using a thesaurus, if all of a plurality of semantic categories obtained from polysemous words are used without being narrowed down according to the context, a feature node, a classification node, Is incorrectly calculated,
Classification accuracy may be degraded. When a semantic category on a thesaurus is used for a feature node, it is necessary to specify what the semantic category is accurately according to the context, that is, to eliminate polysemy.

【０００８】本発明の目的は、意味カテゴリーの多義性
を解消し、より精度の高い文書自動分類システム、その
文書分類向け知識ベースの生成方法及びそのプログラム
を記録した記録媒体を提供することにある。An object of the present invention is to provide an automatic document classification system which eliminates the ambiguity of a semantic category and has higher accuracy, a method of generating a knowledge base for the document classification, and a recording medium on which a program thereof is recorded. .

【０００９】[0009]

【課題を解決するための手段】本発明の文書自動分類シ
ステムは、分類タグ付き文書集合とシソーラス辞書を入
力し、文書を特徴付ける単語とその意味カテゴリーと当
該文書の分類カテゴリーから構成される学習データを生
成する手段と、前記生成された学習データから分類カテ
ゴリーと特徴間の関連度を計算する手段と、前記計算さ
れた関連度と前記学習データを入力し、学習データから
不適正な意味カテゴリーを除去して新たな学習データを
生成する手段と、前記生成された新たな学習データを入
力とし、分類カテゴリーと特徴間の関連度を計算して文
書分類向け知識ベースを生成する手段と、未分類の文書
を入力とし、前記知識ベースを元に対応する分類カテゴ
リーを出力する手段とを有することを特徴とする。According to the present invention, there is provided an automatic document classification system which inputs a document set with a classification tag and a thesaurus dictionary, and learns a word characterizing a document, its semantic category, and a classification category of the document. Means for calculating the relevance between the classification category and the feature from the generated learning data; inputting the calculated relevance and the learning data; and selecting an incorrect semantic category from the learning data. Means for removing and generating new learning data, means for receiving the generated new learning data as input, calculating relevance between a classification category and a feature to generate a knowledge base for document classification, And a means for inputting the above document and outputting a corresponding classification category based on the knowledge base.

【００１０】本発明の文書分類向け知識ベース生成方法
は、分類タグ付き文書集合とシソーラス辞書を入力し
て、文書を特徴付ける単語とその意味カテゴリー及び当
該文書の分類カテゴリーから構成される学習データを生
成し、前記学習データについて、分類カテゴリーと特徴
間の関連度を計算し、前記関連度に基づき、前記学習デ
ータから不適切な特徴を除去して新たな学習データを生
成し、前記新たな学習データにより、分類カテゴリーと
特徴間の重みを計算して、文書分類向け知識ベースを生
成することを特徴とする。According to the method for generating a knowledge base for document classification of the present invention, a document set with a classification tag and a thesaurus dictionary are input to generate learning data composed of words characterizing a document, their meaning categories, and the category of the document. Calculating, for the learning data, a relevance between a classification category and a feature; removing inappropriate features from the learning data based on the relevance to generate new learning data; By calculating the weight between the classification category and the feature, a knowledge base for document classification is generated.

【００１１】本発明のコンピュータ読み取り可能な記録
媒体は、分類タグ付き文書集合を読み取る処理と、前記
分類タグ付き文書集合を解析して、当該文書を特徴付け
る単語を抽出し、シソーラス辞書を参照して該単語の意
味カテゴリーを調べ、前記単語と意味カテゴリーからな
る特徴ベクトル及び当該文書の分類カテゴリーから構成
される学習データを生成する処理と、前記学習データに
ついて、分類カテゴリーと特徴間の関連度を計算する処
理と、前記関連度に基づき、前記学習データから不適切
な特徴を除去して新たな学習データを生成する処理と、
前記新たな学習データにより、分類カテゴリーと特徴間
の重みを計算して、文書分類向け知識ベースを生成する
処理とを含むことを特徴とする。The computer-readable recording medium of the present invention reads a document set with a classification tag, analyzes the document set with a classification tag, extracts words characterizing the document, and refers to a thesaurus dictionary. Examining the semantic category of the word, generating learning data composed of a feature vector composed of the word and the semantic category and the classification category of the document, and calculating the degree of association between the classification category and the feature for the learning data And generating new learning data by removing inappropriate features from the learning data based on the degree of association.
Calculating a weight between the classification category and the feature based on the new learning data to generate a knowledge base for document classification.

【００１２】意味カテゴリーの多義性解消は、「意味カ
テゴリーは、その対象テキストが属する分類カテゴリー
に関連のある他の単語からも生成される」という性質に
基づいて、テキストの分類カテゴリー情報と事前に得ら
れる特徴と該当分類カテゴリー間の関連度の利用により
行える。例えば、多義語「路線」は、あるシソーラス上
で、「Ｓ交通路」「Ｓ形勢」の２つの意味カテゴリーを
持つ。ここでは、意味カテゴリーを単語と区別するた
め、意味カテゴリーの先頭に文字Ｓを付与する。「Ｓ形
勢」は、分類カテゴリーが「政治」で頻出する単語「動
向」「非常事態」からも生成される意味カテゴリーであ
り、また「Ｓ交通路」は、分類カテゴリーが「交通」で
頻出する単語「新幹線」「東海道」からも生成される。
よって、「政治」「交通」の分類カテゴリーを持つカテ
ゴリーを持つテキストから、「Ｓ交通路」は「交通」と
強い関連があり、「Ｓ形勢」は「政治」と強い関連を持
つことが解るはずである。この関連度を利用すれば、対
象テキスト中に単語「路線」が現われた時、そのテキス
トの分類カテゴリーが「政治」であれば、意味カテゴリ
ーとして「Ｓ形勢」を選択し、「交通」であれば、「Ｓ
交通路」を選択して用いることが可能である。The disambiguation of the semantic category is based on the property that "the semantic category is also generated from another word related to the classification category to which the target text belongs", and the classification category information of the text is preliminarily obtained. This can be done by using the obtained feature and the degree of association between the corresponding category. For example, the polysemy word “route” has two meaning categories “S traffic route” and “S form” on a certain thesaurus. Here, a letter S is added to the head of the semantic category in order to distinguish the semantic category from the word. “S form” is a semantic category that is also generated from the words “trend” and “emergency” that frequently appear in the category “politics”, and “S traffic road” frequently appears in the category “traffic”. It is also generated from the words "Shinkansen" and "Tokaido".
Therefore, it can be seen from the text having a category having the classification category of “politics” and “traffic” that “S traffic route” is strongly related to “traffic” and “S morphology” is strongly related to “politics”. Should be. By using this degree of relevance, when the word “Route” appears in the target text, if the classification category of the text is “Politics”, select “S form” as the semantic category and select “Transportation”. For example, "S
It is possible to select and use "traffic route".

【００１３】このように、シソーラス利用の文書自動分
類システムにおいて、最適な意味カテゴリーを選択す
る、多義を解消する処理プログラムを組み込むことによ
り、学習データ中のノイズが減少し、より精度の高い文
書分類向け知識ベースを生成することが可能になる。As described above, in the automatic document classification system using the thesaurus, by incorporating a processing program for selecting an optimal semantic category and eliminating polysemy, noise in learning data is reduced and a more accurate document classification is performed. It is possible to generate a knowledge base.

【００１４】[0014]

【発明の実施の形態】以下、図面を用いて、本発明の一
実施例を説明する。図１は、本発明の一実施例に係わる
文書自動分類システムのブロック図である。文書自動分
類システムは、特徴抽出機構１０、関連度計算機構１
１、多義性解消機構１２、文書分類向け知識ベース生成
機構１３、及び未分類の文書に分類ラベルを出力する分
類処理機構１４から装置本体１と、分類タグ付き文書フ
ァイル２０、シソーラス辞書２１、多義性解消前の学習
データファイル２２、関連度データファイル２３、多義
性解消後の学習データファイル２４、文書分類向け知識
ベース２５等を格納する外部記憶装置群で構成される。
装置本体１は、ＣＰＵ、ＲＡＭ、内蔵ハードディスクな
どで構成される所謂コンピュータである。An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of an automatic document classification system according to an embodiment of the present invention. The automatic document classification system includes a feature extracting mechanism 10 and a relevance calculating mechanism 1
1, the ambiguity eliminating mechanism 12, the knowledge base generating mechanism 13 for document classification, and the classification processing mechanism 14 that outputs a classification label to an unclassified document, the apparatus main body 1, a document file 20 with a classification tag, a thesaurus dictionary 21, a polysemy. It is composed of a group of external storage devices for storing a learning data file 22 before resolving sex, a relevance data file 23, a learning data file 24 after resolving ambiguity, a knowledge base 25 for document classification, and the like.
The apparatus main body 1 is a so-called computer including a CPU, a RAM, a built-in hard disk, and the like.

【００１５】図１の構成において、特徴抽出機構１０
は、分類タグ付き文書ファイル２０及びシソーラス辞書
２１を入力として、文書を表わス特徴及びその分類カテ
ゴリー情報を多義性解消前の学習データファイル２２に
出力する。関連度計算機構１１は、多義性解消前の学習
データファイル２２を入力として、特徴と分類カテゴリ
ー間の関連度情報を関連度データファイル２３に出力す
る。多義性解消機構１２は、多義性解消前の学習データ
ファイル２２及び関連度データファイル２３を入力とし
て、多義性解消結果を多義性解消後の学習データファイ
ル２４に出力する。文書分類向け知識ベース生成機構１
３は、多義性解消後の学習データファイル２４を入力と
して、再計算後の特徴と分類カテゴリー間の関連度を文
書分類向け知識ベース２５に出力する。分類処理機構１
４は、文書分類向け知識ベース２５及びユーザが入入力
した新たなテキストを入力として、入力されたテキスト
に対応する分類カテゴリーを出力する。以下に、各機構
１０、１１、１２、１３、１４の構成および動作を詳述
する。In the configuration shown in FIG.
Receives as input the classification tag-added document file 20 and the thesaurus dictionary 21 and outputs a document representing a document and its classification category information to a learning data file 22 before disambiguation. The relevance calculating mechanism 11 receives the learning data file 22 before disambiguation as an input and outputs relevance information between the feature and the classification category to the relevance data file 23. The disambiguation mechanism 12 receives the learning data file 22 before disambiguation and the relevance data file 23 as input, and outputs the disambiguation result to the learning data file 24 after disambiguation. Knowledge base generation mechanism for document classification 1
Reference numeral 3 receives, as an input, the learning data file 24 after disambiguation, and outputs the recalculated feature and the degree of association between the classification categories to the knowledge base 25 for document classification. Classification processing mechanism 1
Numeral 4 receives the knowledge base 25 for document classification and the new text input and input by the user, and outputs a classification category corresponding to the input text. Hereinafter, the configuration and operation of each mechanism 10, 11, 12, 13, 14 will be described in detail.

【００１６】〈特徴抽出機構１０〉特徴抽出機構１０
は、ファイル２０の入力として与えられた分類タグ付き
文書集合を解析し、品詞が名詞、固定名詞である単語を
抜き出し、その後、シソーラス辞書２１を参照して、そ
れらの単語の意味カテゴリーを調べる。そして、一つの
テキストから、そのテキスト中に含まれる単語と意味カ
テゴリーからなる特徴ベクトルを抽出し、該特徴ベクト
ルとそのテキストの属する分類カテゴリーからなるデー
タを多義性解消前の学習データファイル２２に出力す
る。<Feature extraction mechanism 10> Feature extraction mechanism 10
Analyzes a document set with a classification tag given as an input to the file 20, extracts words whose part of speech is a noun or a fixed noun, and then refers to the thesaurus dictionary 21 to check the meaning category of those words. Then, a feature vector composed of words and semantic categories included in the text is extracted from one text, and the data composed of the feature vector and the classification category to which the text belongs is output to the learning data file 22 before disambiguation. I do.

【００１７】図３に、特徴抽出機構１０の一実施例の構
成図を示す。本特徴抽出機構１０は、形態素解析部１０
１、単語抽出部１０２、意味カテゴリー付与部１０３か
らなる。FIG. 3 is a block diagram showing one embodiment of the feature extracting mechanism 10. As shown in FIG. The feature extraction mechanism 10 includes a morphological analyzer 10
1, a word extraction unit 102 and a meaning category assignment unit 103.

【００１８】形態素解析部１０１は、ファイル２０から
入力されたテキストについて形態素解析を行い、単語抽
出とその品詞付けを行う。単語抽出部１０２は、必要な
品詞（名詞、固有名詞）の単語を選び出す。意味カテゴ
リー付与部１０３は、単語抽出部１０２で選ばれた単語
について、利用するシソーラス辞書２１を調べて、その
意味カテゴリーを付与し、単語と意味カテゴリーからな
る特徴ベクトルと該テキストの属する分類カテゴリーか
らなるデータ（多義性解消前の学習データ）をファイル
２２に出力する。The morphological analysis unit 101 performs a morphological analysis on the text input from the file 20, and performs word extraction and its part of speech. The word extracting unit 102 selects words of necessary parts of speech (nouns, proper nouns). The semantic category assigning unit 103 examines the thesaurus dictionary 21 to be used for the word selected by the word extracting unit 102, assigns the semantic category, and assigns the feature vector including the word and the semantic category to the classification category to which the text belongs. (Learning data before disambiguation) is output to the file 22.

【００１９】図４は、本特徴抽出機構１０での具体的処
理例を示したものである。ここで、意味カテゴリーの先
頭には文字「Ｓ」を付与し、単語と区別する。例えば、
単語「路線」に対し、シソーラス辞書２１において「交
通路」「形勢」の二つの意味カテゴリーを持つ場合、
「Ｓ交通路」、「Ｓ形勢」を付加する。「Ｓ党」、「Ｓ
団体」も同様である。FIG. 4 shows a specific processing example of the feature extracting mechanism 10. Here, the character "S" is added to the head of the semantic category to distinguish it from a word. For example,
If the thesaurus 21 has two meaning categories of "traffic road" and "form" in the thesaurus dictionary 21,
"S traffic route" and "S form" are added. "S party", "S
The same applies to "organizations."

【００２０】〈関連度計算機構１１〉関連度計算機構１
１は、多義解消前の学習データファイル２２を入力とし
て、特徴と分類カテゴリー間の関連度情報を関連度デー
タファイル２３に出力する。<Relationship Calculation Mechanism 11> Relevance Calculation Mechanism 1
1 receives the learning data file 22 before disambiguation as an input and outputs the relevance information between the feature and the classification category to the relevance data file 23.

【００２１】図５に、関連度計算機構１１の一実施例の
構成図を示す。本関連度計算機構１１は、事例数カウン
ターモジュール１１１、カイ２乗計算モジュール１１
２、関連度格納モジュール１１３から構成される。FIG. 5 shows a configuration diagram of an embodiment of the relevance calculating mechanism 11. The relevance calculating mechanism 11 includes a case counter module 111, a chi-square calculating module 11
2. It is composed of an association degree storage module 113.

【００２２】事例数カウンターモジュール１１１は、多
義解消前の学習データファイル２２を入力として、各分
類カテゴリー毎に、各特徴に対して、次の４つの値をカ
ウントする。即ち、該当特徴が出現するテキスト集合中
で、該当分類カテゴリーを持つテキスト数「Ｎｒ＋」、
持たないテキストの数「Ｎｎ＋」、また、該当特徴が出
現しないテキスト集合中で、該当分類カテゴリーを持つ
テキスト数「Ｎｒ−」、持たないテキストの数「Ｎｎ
−」をそれぞれ求める。The case number counter module 111 receives the learning data file 22 before disambiguation as an input and counts the following four values for each feature for each classification category. That is, in the text set in which the corresponding feature appears, the number of texts having the corresponding category “Nr +”
The number of texts that do not have “Nn +”, the number of texts that have the corresponding category in the text set where the feature does not appear “Nr−”, the number of texts that do not have “Nn−”
-”Is required.

【００２３】カイ２乗計算モジュール１１２は、上記４
つの値を用いて、次の式に基づいて、カイ２乗値（関連
度）を計算する。但し、Ｎは全事例数を表す。The chi-square calculation module 112 calculates
Using these two values, a chi-square value (degree of association) is calculated based on the following equation. Here, N represents the total number of cases.

【００２４】[0024]

【数１】 (Equation 1)

【００２５】関連度格納モジュール１１３は、カイ２乗
計算モジュール１１２の計算結果（各分類カテゴリー
毎、各特徴毎のカイ２乗値）を関連度データファイル２
３に格納する。The relevance storage module 113 stores the calculation results of the chi-square calculation module 112 (the chi-square values for each classification category and for each feature) in the relevance data file 2.
3 is stored.

【００２６】〈多義性解消機構１２〉多義性解消機構１
２は、多義性解消前の学習データファイル２２及び関連
度データファイル２３を入力として、多義性解消結果を
多義性解消後の学習データファイル２４に出力する。<Semantic elimination mechanism 12> Ambiguity elimination mechanism 1
2 receives the learning data file 22 before the disambiguation and the relevance data file 23 as input and outputs the disambiguation result to the learning data file 24 after the disambiguation.

【００２７】図６は、多義性解消機構１２の一実施例の
構成図を示す。本多義性解消機構１２は、多義語選択モ
ジュール１２１、関連度参照モジュール１２２、最適意
味カテゴリー選択モジュール１２３から構成される。FIG. 6 is a block diagram showing an embodiment of the disambiguation mechanism 12. The polysemy elimination mechanism 12 includes a polysemy word selection module 121, a relevance reference module 122, and an optimal meaning category selection module 123.

【００２８】多義語選択モジュール１２１は、多義性解
消前の学習データファイル２２を入力として、意味カテ
ゴリーが複数付与されている多義語を探しだし、その意
味カテゴリー及びそのテキストが属する分類カテゴリー
を出力する。関連度参照モジュール１２２は、関連度デ
ータファイル２３を入力として、モジュール１２１で得
られた意味カテゴリーと分類カテゴリーから、それらに
対応する関連度を出力する。最適意味カテゴリー選択モ
ジュール１２３は、モジュール１２２で得られた関連先
を元に、関連度の最も大きい値をもっものを最適意味カ
テゴリーとして選択し、多義性解消後の学習データファ
イル２４に出力する。The polysemy word selection module 121 uses the learning data file 22 before ambiguity resolution as an input, searches for polysemy words to which a plurality of semantic categories are assigned, and outputs the semantic category and the classification category to which the text belongs. . The relevance reference module 122 receives the relevance data file 23 and outputs the relevance corresponding to the semantic category and the classification category obtained by the module 121. The optimal meaning category selection module 123 selects the one having the largest value of the degree of association as the optimal meaning category based on the association destination obtained in the module 122, and outputs it to the learning data file 24 after disambiguation.

【００２９】例えば、図４の場合、多義語「路線」に対
応する「Ｓ交通路」と「Ｓ形勢」の２つの意味カテゴリ
ーのうち、「Ｓ形勢」が最適意味カテゴリーとして選択
される。For example, in the case of FIG. 4, of the two semantic categories "S traffic road" and "S form" corresponding to the polysemy word "route", "S form" is selected as the optimal meaning category.

【００３０】〈文書分類向け知識ベース生成機構１３〉
文書分類向け知識ベース生成機構１３は、多義性解消後
の学習データファイル２４を入力として、特徴と分類タ
グ間の重みを文書分類向け知識ベース２５に出力する。
実施例の一例として、ここでは、文書分類を線形分類モ
デルに基づいて行ない、線形モデルにおける重みの学習
に誤り駆動型学習アルゴリズムを用いるとする。線形分
類モデルは、特徴集合ノードと分類カテゴリーノードと
から構成されており、入力された事例がその分類カテゴ
リーに属するか否かの判定を、入力事例のもつ特徴ノー
ドと分類カテゴリー間の重みの合計が、決められた閾値
を超えるか否かによって行う。線形分類モデルの例を図
７に示す。なお、線形分類モデル、誤り駆動型学習アル
ゴリズムについては、例えば「Ｎ．Ｌittlestone,"Lear
ning quickly when irrelevant attributes abound：Ａ
new linearthreshold algorithm”，Ｍachine Ｌearni
ng，Ｎo.２，pp２８５−３１８，１９８８．」に記述さ
れている。<Knowledge Base Generating Mechanism 13 for Document Classification>
The document classification knowledge base generation mechanism 13 receives the learning data file 24 after disambiguation as an input and outputs the weight between the feature and the classification tag to the document classification knowledge base 25.
As an example of the embodiment, it is assumed here that document classification is performed based on a linear classification model, and an error-driven learning algorithm is used for learning weights in the linear model. The linear classification model is composed of a feature set node and a classification category node, and determines whether an input case belongs to the classification category by determining the sum of the weight between the feature node of the input case and the classification category. Is greater than a predetermined threshold. FIG. 7 shows an example of the linear classification model. The linear classification model and the error-driven learning algorithm are described in, for example, "N. Littlestone,"
ning quickly when irrelevant attributes abound: A
new linearthreshold algorithm ”, Machine Learni
ng, No. 2, pp 285-318, 1988. ".

【００３１】図８は、誤り駆動型学習アルゴリズムを用
いた、文書分類向け知識ベース生成機構１３の一実施例
の構成図を示す。本文書分類向け知識ベース生成機構１
３は重み初期値設定モジュール１３１、重み合算モジュ
ール１３２、正解判定モジュール１３３、重み更新モジ
ュール１３４から構成される。FIG. 8 is a block diagram showing an embodiment of the knowledge base generating mechanism 13 for document classification using an error-driven learning algorithm. Knowledge base generation mechanism for document classification 1
Reference numeral 3 includes a weight initial value setting module 131, a weight summation module 132, a correct answer determination module 133, and a weight update module 134.

【００３２】重み初期値設定モジュール１３１は、多義
性解消後の学習データファイル２４を入力として、学習
データに出現する特徴と分類カテゴリー間をある値に初
期化し、文書分類向け知識ベース２５に出力する。重み
合算モジュール１３２は、多義性解消後の学習データフ
ァイル２４、文書分類向け知識ベース２５を入力とし
て、各入力事例毎に、事例に出現する特徴から、各分類
カテゴリーに対するスコアを重みの合計として計算す
る。正解判定モジュール１３３は、前モジュールで計算
されたスコアが、決められた閾値を超えるか否かを判定
し、分類カテゴリーを割り当てる。重み更新モジュール
１３４は、前モジュールが割り当てた分類カテゴリーが
正解の分類カテゴリーと異なる場合のみ、特徴と分類カ
テゴリー間の重みを更新し、文書分類向け知識ベース２
５に出力する。The weight initial value setting module 131 receives the learning data file 24 after disambiguation as an input, initializes features appearing in the learning data and classification categories to a certain value, and outputs the values to the knowledge base 25 for document classification. . The weight summation module 132 receives as input the learning data file 24 after disambiguation and the knowledge base 25 for document classification, and calculates, for each input case, a score for each classification category as a sum of weights from features that appear in the case. I do. The correct answer determination module 133 determines whether or not the score calculated in the previous module exceeds a predetermined threshold, and assigns a classification category. The weight update module 134 updates the weight between the feature and the classification category only when the classification category assigned by the previous module is different from the correct classification category, and updates the knowledge base 2 for document classification.
5 is output.

【００３３】〈分類処理機構１４〉分類処理機構１４
は、シソーラス辞書２１、文書分類向け知識ベース２５
及びユーザが入力した新たなテキストを入力として、入
力されたテキストに対応する分類タグを出力する。<Classification processing mechanism 14> Classification processing mechanism 14
Is a thesaurus dictionary 21 and a knowledge base 25 for document classification.
And a new text input by the user as an input, and outputs a classification tag corresponding to the input text.

【００３４】図９は、分類処理機構１４の一実施例の構
成図を示す。本分類処理機構１４は特徴抽出モジュール
１４１、重み合算モジュール１４２、分類カテゴリー生
成モジュール１４３から構成される。FIG. 9 is a block diagram showing an embodiment of the classification processing mechanism 14. The classification processing mechanism 14 includes a feature extraction module 141, a weight summation module 142, and a classification category generation module 143.

【００３５】特徴抽出モジュール１４１は、ユーザが入
力した新たな未分類テキストを形態素解析し、名詞、固
定名詞である単語を選択し、それらの持つ意味カテゴリ
ーを、シソーラス辞書２１を参照して付与し、特徴ベク
トルを生成する。本モジュール１４１は、前記特徴抽出
機構１１と基本的に同じものである。重み合算モジュー
ル１４２は、モジュール１４１で生成された特徴ベクト
ルと文書分類向け知識ベース２５を入力として、各分類
カテゴリーに対するスコア計算を行う。分類カテゴリー
生成モジュール１４３は、前モジュールで計算されたス
コアがある決められた閾値以上である分類カテゴリーを
出力する。The feature extraction module 141 performs morphological analysis on a new unclassified text input by the user, selects words that are nouns and fixed nouns, and assigns semantic categories of the words with reference to the thesaurus dictionary 21. , Generate a feature vector. This module 141 is basically the same as the feature extraction mechanism 11. The weight summation module 142 calculates a score for each classification category by using the feature vector generated in the module 141 and the knowledge base 25 for document classification as inputs. The classification category generation module 143 outputs a classification category in which the score calculated in the previous module is equal to or more than a predetermined threshold.

【００３６】以上、本発明の一実施例に係わる文書自動
分類システムについて説明したが、図１において、文書
分類向け知識ベースの生成に関係する特徴抽出機構１
０、関連度計算機構１１、多義性解消機構１２及び文書
分類向け知識ベース生成機構１３の処理プログラムは一
つにまとめてもよい。図１０は、その処理フローを示し
たもので、処理１０１０〜１０１３は図１の各機構１０
〜１３に対応する。処理１０１０〜１０１３のプログラ
ムは、あらかじめＣＤ−ＲＯＭ等の記録媒体に記録して
おき、該プログラムをコンピュータにロードすることに
より、先に説明した図１の各機構１０〜１３と同様の処
理が実現する。The automatic document classification system according to one embodiment of the present invention has been described above. In FIG. 1, a feature extraction mechanism 1 related to generation of a knowledge base for document classification is described.
0, the relevance calculating mechanism 11, the disambiguation mechanism 12, and the processing program of the document classification knowledge base generating mechanism 13 may be integrated into one. FIG. 10 shows the processing flow, and processings 1010 to 1013 correspond to each mechanism 10 shown in FIG.
~ 13. The programs of the processes 1010 to 1013 are recorded in advance on a recording medium such as a CD-ROM, and the same processes as those of the mechanisms 10 to 13 of FIG. I do.

【００３７】[0037]

【発明の効果】以上説明したように、本発明によれば、
分類タグ付き文書集合とシソーラス辞書から、分類器
（文書分類向け知識ベース）の生成に利用する学習デー
タを作成する際に、単語の持つ意味カテゴリーを最適な
意味カテゴリーのみを選んで利用するので、従来手法に
よる多義性解消をしない場合に比べて、より精度の高い
分類器を生成することができる。As described above, according to the present invention,
When creating learning data to be used to generate a classifier (a knowledge base for document classification) from a set of documents with classification tags and a thesaurus, only the most appropriate semantic category is selected from the semantic categories of the words. A classifier with higher accuracy can be generated as compared with the case where the ambiguity is not eliminated by the conventional method.

[Brief description of the drawings]

【図１】本発明の一実施例の文書自動分類システムの全
体構成図である。FIG. 1 is an overall configuration diagram of an automatic document classification system according to an embodiment of the present invention.

【図２】分類器の一例を示す図である。FIG. 2 is a diagram illustrating an example of a classifier.

【図３】図１の特徴抽出機構の一実施例の構成図であ
る。FIG. 3 is a configuration diagram of one embodiment of a feature extraction mechanism of FIG. 1;

【図４】特徴抽出機構の具体的処理例を示す図である。FIG. 4 is a diagram showing a specific processing example of a feature extraction mechanism.

【図５】図１の関連度計算機構の一実施例の構成図であ
る。FIG. 5 is a configuration diagram of an embodiment of an association degree calculation mechanism of FIG. 1;

【図６】図１の多義性解消機構の一実施例の構成図であ
る。FIG. 6 is a configuration diagram of an embodiment of the disambiguation mechanism of FIG. 1;

【図７】線形分類モデルの一例を示す図である。FIG. 7 is a diagram illustrating an example of a linear classification model.

【図８】図１の文書分類向け知識ベース生成機構の一実
施例の構成図である。8 is a configuration diagram of an embodiment of a knowledge base generation mechanism for document classification in FIG. 1;

【図９】図１の分類処理機構の一実施例の構成図であ
る。FIG. 9 is a configuration diagram of an embodiment of the classification processing mechanism of FIG. 1;

【図１０】本発明の文書分類向け知識ベース生成方法の
概略処理フロー図である。FIG. 10 is a schematic processing flowchart of a method for generating a knowledge base for document classification according to the present invention.

[Explanation of symbols]

１文書自動分類装置本体２外部記憶装置群１０特徴抽出機構１１関連度計算機構１２多義性解消機構１３文書分類向け知識ベース生成機構１４分類処理機構２０分類タグ付き文書ファイル２１シソーラス辞書２２多義性解消前の学習データファイル２３関連度データファイル２４多義性解消後の学習データファイル２５文書分類向け知識ベース DESCRIPTION OF SYMBOLS 1 Document automatic classification apparatus main body 2 External storage device group 10 Feature extraction mechanism 11 Relevance calculation mechanism 12 Ambiguity elimination mechanism 13 Knowledge base generation mechanism for document classification 14 Classification processing mechanism 20 Document file with classification tag 21 Thesaurus dictionary 22 Ambiguity elimination Previous learning data file 23 Relevance data file 24 Learning data file after disambiguation 25 Knowledge base for document classification

Claims

[Claims]

1. An automatic document classification system using a thesaurus dictionary, which receives a document set with a classification tag and a thesaurus dictionary as input, and generates learning data composed of words characterizing a document, their semantic categories, and the category of the document. Means for inputting the generated learning data as input, and calculating the degree of association between the classification category and the feature, and inputting the calculated degree of association and the learning data as input, and extracting an inappropriate feature from the learning data. Means for removing and generating new learning data, means for receiving the generated new learning data as input, calculating weights between classification categories and features, and generating a knowledge base for document classification, Means for inputting a document and outputting a corresponding classification category based on the knowledge base. M

2. A method of generating a knowledge base for document classification in an automatic document classification system using a thesaurus dictionary, comprising the steps of: inputting a document set with a classification tag and a thesaurus dictionary; Generating learning data composed of the classification categories of the document; calculating the relevance between the classification categories and the features with respect to the learning data; removing inappropriate features from the learning data based on the relevance; A method for generating a knowledge base for document classification, comprising: generating new learning data; calculating a weight between a classification category and a feature based on the new learning data; and generating a knowledge base for document classification.

3. A computer-readable recording medium recording a program for generating a document classification knowledge base in an automatic document classification system using a thesaurus dictionary, comprising: a process of reading a document set with a classification tag; Analyzing the tagged document set, extracting words characterizing the document, examining the semantic category of the word with reference to a thesaurus dictionary, and comprising a feature vector composed of the word and the semantic category and a classification category of the document Generating learning data, calculating the relevance between a classification category and a feature for the learning data, and removing inappropriate features from the learning data based on the relevance to obtain new learning data. And the weight between the classification category and the feature is calculated based on the new learning data. To, computer-readable recording medium storing a program that includes a process of generating a document classification for the knowledge base, a.