JP2000222431A

JP2000222431A - Document classifying device

Info

Publication number: JP2000222431A
Application number: JP11026483A
Authority: JP
Inventors: Hiroyoshi Konaka; 裕喜小中
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1999-02-03
Filing date: 1999-02-03
Publication date: 2000-08-11

Abstract

PROBLEM TO BE SOLVED: To select a category to be appropriately classified and make document information storable according to the category by selecting a category from which a characteristic pattern whose weight is the highest is extracted among collected characteristic patterns and storing document information that is not classified yet in the selected category. SOLUTION: A storing means 10 classified document information having a keyword set having a keyword characterizing the contents of a document corresponding to a document number, etc., to a corresponding category and stores it. An extracting means 20 extracts a characteristic pattern having a keyword whose rate of being included in the keyword set of the stored document information is high in each category and gives weight. Then, a characteristic pattern collecting means 30 collects character patterns including a part of a keyword set of unclassified document information. A selecting means 40 selects a category from which a characteristic pattern whose weight is the highest among the collected characteristic patterns and stores the unclassified document information.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はあらかじめ文書情報
を分類し格納したものの中に未だ分類がなされていない
文書情報（未分類文書情報と称す）を、その内容に適し
たカテゴリを選定した後、このカテゴリへ格納する文書
分類装置に関するものである。BACKGROUND OF THE INVENTION The present invention relates to a method of classifying document information which has not been classified yet (hereinafter referred to as "unclassified document information") in a pre-classified and stored version of document information. The present invention relates to a document classification device that stores documents in this category.

【０００２】[0002]

【従来の技術】個々のキーワードの統計情報に基づいて
分類を行う技術としては、特開昭６３−２１４８３２号
公報、特開平１−１８８９３４号公報、特開平５−５４
０３７号公報、特開平５−３４２２７２号公報、特開平
６−７５９９５号公報などがある。これら公報に記載さ
れたものは、あるカテゴリに属する文書における個々の
キーワードの出現頻度といった情報をカテゴリへの貢献
度とするとともに、未分類文書情報に含まれる個々の単
語の貢献度をカテゴリごとに加算して各カテゴリへの関
連度とし、最大の関連度をもつカテゴリへと分類し、格
納するものである。また特開平８−２２１４３９号公報
はニューラルネットワークを利用して分類を行うもので
ある。2. Description of the Related Art As a technique for performing classification based on statistical information of individual keywords, Japanese Patent Application Laid-Open Nos. 63-214832, 1-188934, 5-54.
037, JP-A-5-342272 and JP-A-6-75995. In these publications, information such as the frequency of appearance of individual keywords in documents belonging to a certain category is used as the degree of contribution to the category, and the degree of contribution of each word included in the unclassified document information is classified for each category. The sum is set as the degree of relevance to each category, classified into the category having the highest degree of relevance, and stored. Japanese Patent Application Laid-Open No. 8-221439 discloses a classification using a neural network.

【０００３】分類決定木によるものとしては特開平５−
２３３７０６号公報、特開平５−２３４７２６号公報、
特開平９−１６５７０号公報などがある。これらの公報
に記載されたものは、キーワードやその他の文書情報の
有無をもとに分類を決定する木を予め構成しておき、そ
れを利用して分類を決定するものである。[0003] Japanese Patent Application Laid-Open No.
JP-A-233706, JP-A-5-234726,
JP-A-9-16570 and the like are available. In the publications described above, a tree for deciding a classification based on the presence or absence of a keyword or other document information is configured in advance, and the classification is determined using the tree.

【０００４】[0004]

【発明が解決しようとする課題】個々のキーワードの統
計情報に基づいて分類を行う場合、各キーワードのカテ
ゴリへの貢献度が加算される結果、分類に寄与したキー
ワードの組合せを判別するのが困難であり、利用者がそ
の分類の良否を判断することが難しい。また例えば複数
ある内のあるカテゴリ（例えばカテゴリAとする）には
キーワードa及びキーワードbを共に含む文書を、その他
のカテゴリ（つまりカテゴリＡ以外のカテゴリ）にはキ
ーワードaもしくはキーワードbのうちのいずれか一方と
共に他のキーワードを含むような文書を分類していて、
そこにキーワードa、キーワードb両方のキーワードを含
む未分類文書を分類する場合を考える。When classification is performed based on statistical information of individual keywords, it is difficult to determine the combination of keywords that contributed to the classification as a result of adding the contribution of each keyword to the category. Therefore, it is difficult for the user to judge the quality of the classification. For example, a document including both the keyword a and the keyword b is included in a certain category (for example, category A) of a plurality of categories, and any of the keywords a and b is included in the other categories (that is, categories other than the category A). Classify documents that contain other keywords along with one or the other,
Consider a case where an unclassified document including both keywords a and b is classified.

【０００５】各キーワードの貢献度の算出において、キ
ーワードの総出現文書数に応じて重みが小さくなるよう
な重みづけの方法を用いている場合、キーワードa、キ
ーワードbの両方を含む未分類の文書情報のカテゴリAへ
の関連度が小さく算出され、その未分類の文書情報に含
まれるキーワードのうち総出現文書数が少ないキーワー
ドの貢献度に左右されて、他のカテゴリに分類されると
いった問題があった。When calculating the contribution of each keyword, if a weighting method is used such that the weight decreases according to the total number of documents appearing in the keyword, an unclassified document including both the keyword a and the keyword b is used. The degree of relevance of the information to category A is calculated to be small, and the keywords included in the unclassified document information are classified into other categories depending on the contribution degree of the keyword with a small total number of appearing documents. there were.

【０００６】一方、分類決定木を構成する方法では、冗
長性のない決定木を構成するため、例えば複数あるカテ
ゴリのうちのあるカテゴリ（カテゴリBとする）にある
いくつかのキーワード（キーワードａ、キーワードｂと
する）のいずれかを含む文書を分類しようとしたとき、
たまたま予めカテゴリBに分類された文書がキーワードb
を含むものばかりとなっていれば、決定木として例えば
キーワードbがあればカテゴリBに分類するというものが
構成され、キーワードaだけを含む文書を分類する知識
が得られないことになる。On the other hand, in the method of constructing a classification decision tree, in order to construct a decision tree having no redundancy, for example, some keywords (keyword a, keyword a, If you try to classify documents that contain any of the keywords b)
The document that was accidentally classified into category B happens to be keyword b.
If the decision tree contains only the keyword b, for example, if there is a keyword b, the classification tree is classified into the category B, and the knowledge for classifying a document including only the keyword a cannot be obtained.

【０００７】本発明の目的は、従来技術における上記の
ような問題点を解決するためになされたものであり、い
くつかのカテゴリに分類された文書情報集合と未分類文
書情報とが与えられ、キーワードの組合せで出現する場
合にも対応し、かつ適切な分類すべきカテゴリを選定し
格納することが可能となる文書分類装置を得ることであ
る。SUMMARY OF THE INVENTION An object of the present invention is to solve the above-mentioned problems in the prior art, and a document information set classified into several categories and unclassified document information are provided. An object of the present invention is to provide a document classifying apparatus which can cope with a case in which a keyword appears as a combination and which can select and store an appropriate category to be classified.

【０００８】[0008]

【課題を解決するための手段】この発明に係る文書分類
装置は、文書番号および前記文書番号に対応する文書の
内容を特徴づけるキーワードを有するキーワード集合を
有する文書情報を対応するカテゴリに分類して格納する
格納手段と、前記格納手段の各カテゴリを特徴付けるた
めの情報であって、前記カテゴリに格納された文書情報
のキーワード集合に含まれる割合が高いキーワードを有
する特徴パターンを前記格納手段の各カテゴリ毎に抽出
するとともに、前記抽出した特徴パターンに重みを付与
する抽出手段と、未だ分類されていない文書情報のキー
ワード集合の一部または全部を含む特徴パターンをカテ
ゴリ毎に収集する特徴パターン収集手段と、前記特徴パ
ターン収集手段により収集した特徴パターンのうち、そ
の重みが最も高い特徴パターンを抽出したカテゴリを選
定し、前記選定したカテゴリへ未だ分類されていない文
書情報を格納する選定手段とを備えたことを特徴とする
ものである。SUMMARY OF THE INVENTION A document classification apparatus according to the present invention classifies document information having a document number and a keyword set having a keyword characterizing the contents of a document corresponding to the document number into corresponding categories. Storage means for storing, and information for characterizing each category of the storage means, wherein a feature pattern having a keyword which is included in a keyword set of document information stored in the category at a high rate is stored in each category of the storage means. Extraction means for extracting each of the extracted feature patterns and weighting the extracted feature patterns, and feature pattern collection means for collecting, for each category, feature patterns including a part or all of a keyword set of document information that has not been classified. Among the feature patterns collected by the feature pattern collection means, Select a category extracting the symptoms pattern, it is characterized in that a selection means for storing the selected the document information that has not yet been classified into categories.

【０００９】この発明に係る文書分類装置は、抽出した
特徴パターンに付与する重みを、特徴パターンに対応す
るカテゴリに格納された全ての文書情報のうち、前記特
徴パターンを含むキーワード集合を有する文書情報の割
合としたことを特徴とするものである。According to the document classification device of the present invention, the weight given to the extracted feature pattern is set to the document information having a keyword set including the feature pattern among all the document information stored in the category corresponding to the feature pattern. The ratio is characterized by:

【００１０】この発明に係る文書分類装置は、抽出した
特徴パターンに付与する重みを、特徴パターンを含むキ
ーワード集合をもつ全カテゴリの文書のうち、当該特徴
パターンに対応するカテゴリに属する文書の割合とした
ことを特徴とするものである。[0010] The document classification device according to the present invention sets the weight given to the extracted feature pattern to the ratio of the documents belonging to the category corresponding to the feature pattern among the documents of all categories having the keyword set including the feature pattern. It is characterized by having done.

【００１１】この発明に係る文書分類装置は、格納手段
に格納した所定の文書情報の集合において、カテゴリの
特徴パターンに対する条件付きエントロピーを前記特徴
パターンに付与する重みとしたことを特徴とするもので
ある。[0011] The document classification apparatus according to the present invention is characterized in that, in a set of predetermined document information stored in the storage means, conditional entropy for a feature pattern of a category is weighted to be given to the feature pattern. is there.

【００１２】この発明に係る文書分類装置は、抽出した
特徴パターンに付与する重みに前記特徴パターンを構成
するキーワード数を乗じたものを前記特徴パターンに付
与する重みとしたことを特徴とするものである。[0012] The document classification device according to the present invention is characterized in that a weight obtained by multiplying the weight given to the extracted feature pattern by the number of keywords constituting the feature pattern is used as the weight given to the feature pattern. is there.

【００１３】この発明に係る文書分類装置は、文書番号
および前記文書番号に対応する文書の内容を特徴づける
キーワードを有するキーワード集合を有する文書情報を
対応するカテゴリに分類して格納する格納手段と、前記
格納手段の各カテゴリを特徴付けるための情報であっ
て、前記カテゴリに格納された文書情報のキーワード集
合に含まれる割合が高いキーワードを有する特徴パター
ンを前記格納手段の各カテゴリ毎に抽出する抽出手段
と、未だ分類されていない文書情報のキーワード集合の
一部または全部を含む特徴パターンをカテゴリ毎に収集
する特徴パターン収集手段と、前記未だ分類されていな
い文書情報がカテゴリへ属するのが適切とするべき確率
を各カテゴリ毎に算出するとともに、前記確率が最も大
きなカテゴリを選定し、前記選定したカテゴリへ未だ分
類されていない文書情報を格納する選定手段とを備えた
ことを特徴とするものである。[0013] A document classifying apparatus according to the present invention comprises: storage means for classifying and storing document information having a document number and a keyword set having a keyword characterizing the content of the document corresponding to the document number into a corresponding category; Extraction means for extracting, for each category, information for characterizing each category of the storage means, the feature pattern having a keyword having a high rate of being included in a keyword set of document information stored in the category. A characteristic pattern collection unit that collects, for each category, a characteristic pattern including a part or all of the keyword set of the document information that has not been classified, and it is appropriate that the document information that has not been classified belongs to the category. The power probability is calculated for each category, and the category having the highest probability is selected. It is characterized in that a selection means for storing the selected the document information that has not yet been classified into categories.

【００１４】この発明に係る文書分類装置は、選定手段
は、前記特徴パターン収集手段により収集した特徴パタ
ーンのうち、その重みが高い特徴パターンを抽出したカ
テゴリまたは前記未だ分類されていない文書情報がカテ
ゴリへ属するのが適切とするべき確率が高い特徴パター
ンを抽出したカテゴリを選定し、前記選定手段により選
定したカテゴリの情報を表示するように構成したことを
特徴とするものである。[0014] In the document classification apparatus according to the present invention, the selection unit includes a category in which a feature pattern having a high weight is extracted from the feature patterns collected by the feature pattern collection unit, or the document information that has not been classified is a category. A category from which a characteristic pattern having a high probability of belonging to the category is selected is selected, and information on the category selected by the selection means is displayed.

【００１５】この発明に係る文書分類装置は、入力した
特徴パターンが格納手段に格納された文書情報の中にあ
るかどうかを検索する文書情報検索手段を備えたことを
特徴とするものである。[0015] The document classification apparatus according to the present invention is characterized in that it is provided with document information search means for searching whether or not the input characteristic pattern is present in the document information stored in the storage means.

【００１６】[0016]

【発明の実施の形態】実施の形態１．以下本発明の実施
の一形態を説明する。図１は実施の形態１の文書分類装
置を説明するための図である。図において、１０は格納
手段、２０は抽出手段、３０は特徴パターン収集手段、
選定手段、４０は選定手段、５０は文書情報入力手段で
ある。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiment 1 Hereinafter, an embodiment of the present invention will be described. FIG. 1 is a diagram for explaining the document classification device according to the first embodiment. In the figure, 10 is storage means, 20 is extraction means, 30 is feature pattern collection means,
A selection means, 40 is a selection means, and 50 is a document information input means.

【００１７】格納手段１０は、例えばハードディスク、
フィレキシブルディスク等のような磁気記録媒体、MOデ
ィスクなどのような光磁気記録媒体といったように高い
記憶容量を有する情報記憶媒体を有するものである。格
納手段１０は予めその内部にいくつかのカテゴリを有す
る。各カテゴリには予め文書に対応する番号を付与した
文書番号である文書ＩＤ、およびこの文書の内容を特徴
づけるキーワードを少なくとも１つ有するキーワード集
合を含む文書情報が格納されている。The storage means 10 is, for example, a hard disk,
It has an information storage medium having a high storage capacity, such as a magnetic recording medium such as a flexible disk and a magneto-optical recording medium such as an MO disk. The storage means 10 has several categories in advance. Each category stores a document ID which is a document number to which a number corresponding to the document is assigned in advance, and document information including a keyword set having at least one keyword characterizing the content of the document.

【００１８】カテゴリへの分類は例えばK平均法など何
らかのクラスタリングアルゴリズムを用いたものでもよ
いし、あるいは人手で行ったものでもよい。格納データ
形式の簡単な例としては、例えば各カテゴリに文書ＩＤ１：キーワード1、キーワード２．．．文書ＩＤ２：キーワード1、キーワード２．．．というように、複数の文書情報を文書情報の集合として
１つのファイルとしたものなどがある。The classification into categories may be performed by using any clustering algorithm such as the K-means method, or may be performed manually. As a simple example of the storage data format, for example, document ID 1: keyword 1, keyword 2. . . Document ID 2: Keyword 1, Keyword 2. . . As described above, there is a type in which a plurality of pieces of document information are collected into one file as a set of document information.

【００１９】文書の内容そのものは、この文書の文書Ｉ
Ｄと対応づけて取り出せるように格納している。例え
ば、文書に対応する格納手段１０の各カテゴリに文書の
内容に関する情報を格納しても良いし、または格納手段
１０とは別体の格納手段（図示せず）に格納してもよ
い。The contents of the document itself are described in Document I of this document.
It is stored so that it can be taken out in association with D. For example, information on the contents of the document may be stored in each category of the storage unit 10 corresponding to the document, or may be stored in a storage unit (not shown) separate from the storage unit 10.

【００２０】キーワード集合は文書情報の内容から予め
人手で付与しておいてもよいし、予め機械（図示せず）
などにより文書情報の内容を解析し、その結果を利用し
て付与してもよい。また「コンピューター」、「計算
機」などを例えば「コンピュータ」という統一したキー
ワードとして表すというようにその意味が同じである用
語を１つの統一したキーワードとして表すようにすれ
ば、いわゆる表記ゆれ、同義語を考慮しかつ、統一した
キーワードの付与が可能となる。The keyword set may be manually added in advance from the contents of the document information, or may be set in advance by a machine (not shown).
For example, the content of the document information may be analyzed, and the result may be used to provide the document information. Also, if a term having the same meaning is expressed as one unified keyword, such as expressing "computer" or "computer" as a unified keyword such as "computer", the so-called swaying of the notation and synonym Considered and unified keywords can be assigned.

【００２１】特徴パターン抽出手段２０は、格納手段１
０に格納された文書情報のキーワード集合から、各カテ
ゴリの文書情報を特徴づける1つもしくはそれ以上のキ
ーワードの連言である特徴パターンをカテゴリごとに抽
出するとともに、抽出した特徴パターンに重みを付与す
るものである。以上の処理は実際に未分類の文書情報を
与えられて分類を行う前に処理しておくことが可能であ
る。The feature pattern extracting means 20 includes the storing means 1
From the keyword set of the document information stored in 0, a feature pattern that is a conjunction of one or more keywords characterizing the document information of each category is extracted for each category, and a weight is assigned to the extracted feature pattern. Is what you do. The above processing can be performed before the classification is actually performed by giving the unclassified document information.

【００２２】未分類文書入力手段５０は格納手段１０に
未だ分類されていない文書の文書情報（未分類文書情報
と称す）に含まれるキーワード集合を特徴パターン収集
手段３０へ送る。特徴パターン収集手段３０は、特徴パ
ターン抽出手段２０が抽出した特徴パターンのうち、未
分類文書情報のキーワード集合の一部または全部を含む
特徴パターンをカテゴリ毎に収集する。The unclassified document input means 50 sends to the feature pattern collection means 30 a set of keywords included in the document information (referred to as unclassified document information) of a document not yet classified in the storage means 10. The feature pattern collection unit 30 collects, for each category, feature patterns that include a part or all of the keyword set of the unclassified document information among the feature patterns extracted by the feature pattern extraction unit 20.

【００２３】選定手段４０は特徴パターン収集手段によ
り収集した各カテゴリの特徴パターンのうち、その重み
が最大となるものを、そのカテゴリへの関連度とし、関
連度が最大のカテゴリを選定した後、選定したカテゴリ
へ未分類文書情報を格納する。The selecting means 40 determines the feature pattern having the largest weight among the feature patterns of each category collected by the feature pattern collecting means as the relevance to the category. After selecting the category having the highest relevance, Store the unclassified document information in the selected category.

【００２４】このように構成することにより、未分類文
書情報を文書の内容に応じたカテゴリへ逐次格納するこ
とが可能となる。With this configuration, it is possible to sequentially store the unclassified document information in a category according to the content of the document.

【００２５】特徴パターン抽出手段２０により各カテゴ
リから抽出する特徴パターンとしては、例えばカテゴリ
の支持率が高い特徴パターン、カテゴリの確信度が高い
特徴パターン、これら２つがいずれも高い特徴パターン
等がある。The feature patterns extracted from each category by the feature pattern extracting means 20 include, for example, a feature pattern having a high category support rate, a feature pattern having a high category certainty factor, a feature pattern having both of these two high features, and the like.

【００２６】ここで、支持率、確信度を以下のように定
義する。カテゴリの支持率とは、少なくとも１つのキー
ワードを有する特徴パターンを備えた文書情報に対し、
カテゴリに属する文書情報集合のうち、当該特徴パター
ンを構成する全てのキーワードを含むキーワード集合を
備えた文書情報の割合とする。カテゴリの確信度とは、
当該特徴パターンを構成する全てのキーワードを含むキ
ーワード集合を備えた文書情報を格納する全てのカテゴ
リのうち、当該カテゴリに属する文書情報の割合とす
る。このように支持率、確信度が高い特徴パターンはそ
のカテゴリに属する文書情報を特徴づけるもの情報とな
りうる。Here, the support rate and the certainty factor are defined as follows. The category approval rate refers to document information having a feature pattern having at least one keyword.
It is a ratio of document information provided with a keyword set including all keywords constituting the feature pattern in the document information set belonging to the category. Category confidence is
The ratio of the document information belonging to the category among all the categories storing the document information including the keyword set including all the keywords constituting the feature pattern. As described above, a feature pattern having a high support rate and a high degree of certainty can be information that characterizes document information belonging to the category.

【００２７】特徴パターン抽出手段２０より抽出した特
徴パターンとして例えば、格納する文書情報の数が５０
であるカテゴリCに対応して「コンピュータ」および
「ネットワーク」を共にキーワードとして有する特徴パ
ターンが支持率４０％、確信度８０％で抽出されたとす
る。これはカテゴリＣに格納された全文書情報のうち、
「コンピュータ」および「ネットワーク」というキーワ
ードを含むキーワード集合を有するものがカテゴリＣの
全文書情報のうちの４０％すなわち２０の文書情報がこ
れに相当し、それは格納手段１０の中の全カテゴリに格
納された文書情報の中で、「コンピュータ」および「ネ
ットワーク」というキーワードを含むキーワード集合を
有する文書情報のうちの８０％がカテゴリＣに格納され
た文書情報であることを意味する。As the characteristic pattern extracted by the characteristic pattern extracting means 20, for example, the number of stored document information is 50.
It is assumed that a feature pattern having both “computer” and “network” as keywords corresponding to category C is extracted with a support rate of 40% and a certainty factor of 80%. This is of all the document information stored in category C
Those having a keyword set including the keywords "computer" and "network" correspond to 40% of all the document information of category C, that is, 20 pieces of document information, which are stored in all categories in the storage means 10. In the obtained document information, 80% of the document information having a keyword set including the keywords “computer” and “network” is document information stored in category C.

【００２８】実施の形態１では、特徴パターン抽出手段
２０は、例えばあるカテゴリにおいて、予め定められた
値以上の最小支持率と予め定められた値以上の最小確信
度とを有するものを特徴パターンとして抽出するものを
例に説明する。In the first embodiment, the feature pattern extracting means 20 uses, for example, a feature pattern having a minimum support rate equal to or higher than a predetermined value and a minimum certainty factor equal to or higher than a predetermined value in a certain category. The extraction will be described as an example.

【００２９】図２は、格納手段１０、特徴パターン抽出
手段２０の具体的な構成の一例を説明するための図であ
る。図において図１と同一の符号を付したものは同一ま
たはこれに相当するものである。図において、１１は、
文書ＩＤ、文書ＩＤに対応するキーワード集合をカテゴ
リに分類して、格納する記録部であり、記録部１１は例
えばハードディスク装置、ＭＯディスクまたはフロッピ
ーディスクなどを装着したディスク駆動装置などであ
る。１２は、記録部１１に記録された文書ＩＤ、文書Ｉ
Ｄに対応するキーワード集合をカテゴリ毎に読み出して
記憶するとともに、後述する制御部２１により指定され
たカテゴリに格納された文書ＩＤ、文書ＩＤに対応する
キーワード集合を有する文書情報を後述する候補パター
ン生成部２４に出力する文書情報記憶部である。FIG. 2 is a diagram for explaining an example of a specific configuration of the storage means 10 and the characteristic pattern extraction means 20. In the figure, components denoted by the same reference numerals as those in FIG. 1 are the same or corresponding components. In the figure, 11 is
The recording unit is a recording unit that classifies and stores a document ID and a keyword set corresponding to the document ID into categories, and the recording unit 11 is, for example, a hard disk device, a disk drive device equipped with an MO disk or a floppy disk, or the like. Reference numeral 12 denotes a document ID recorded in the recording unit 11 and a document I
A keyword set corresponding to D is read and stored for each category, and a document ID stored in a category specified by the control unit 21 described later, and document information having a keyword set corresponding to the document ID are generated as candidate pattern generation described later. A document information storage unit that outputs to the unit 24.

【００３０】また文書情報記憶部１２は、多支持パター
ン集合生成部２２または高確信度パターン出力部２３に
よりカテゴリとパターンが指定されると、指定されたカ
テゴリにおいて、指定されたパターンを含むキーワード
集合をが出現する文書情報の数を計算し、多支持パター
ン集合生成部２２または高確信度パターン出力部２３に
出力する。文書情報記憶部１２は、各文書情報におい
て、指定されたパターンが出現するか否かを検査する場
合、その文書情報を構成するキーワード集合の一部また
は全部がそのパターンに一致するか否かを検査する。When a category and a pattern are specified by the multi-support pattern set generation unit 22 or the high confidence pattern output unit 23, the document information storage unit 12 stores a keyword set including the specified pattern in the specified category. Is calculated and output to the multi-support pattern set generation unit 22 or the high confidence pattern output unit 23. When checking whether or not a specified pattern appears in each piece of document information, the document information storage unit 12 determines whether a part or all of a keyword set constituting the document information matches the pattern. inspect.

【００３１】したがって、この検査を効率的に行うため
に、ハッシュテーブルやハッシュ木を用いて各文書情報
のキーワード集合に対して部分集合となる可能性のある
パターンを絞り込んだり、各パターン、文書情報におい
てキーワード集合を整列したり、あるいはビットパター
ンで表すようにしてもよい。なお、文書情報記憶部１２
を所定のコンピュータネットワークに接続し、そのコン
ピュータネットワークを介して他の記録部から文書情報
を読み出すようにしてもよい。Therefore, in order to perform this check efficiently, a pattern that may be a subset of the keyword set of each document information is narrowed down using a hash table or a hash tree, or each pattern, document information, , The keyword set may be arranged or represented by a bit pattern. The document information storage unit 12
May be connected to a predetermined computer network, and the document information may be read from another recording unit via the computer network.

【００３２】２４は、文書情報記憶部１２より供給され
たあるカテゴリの文書ＩＤに対応する各キーワードに対
し、１つのキーワードにより構成される候補パターンを
生成して、多支持パターン集合生成部２２に出力すると
ともに、多支持パターン集合生成部２２より供給され、
その多支持パターン集合の多支持パターンを構成するキ
ーワードの数をｎとしたとき、（ｎ−１）個のキーワー
ドが共通する任意の２つの多支持パターンから（ｎ＋
１）個のキーワードからなるパターンを生成し、そのパ
ターンの任意のｎ個のキーワードが多支持パターン集合
に含まれるものを候補パターンとして、新たな候補パタ
ーン集合を生成し、多支持パターン集合生成部２２に出
力する候補パターン集合生成部である。24 generates a candidate pattern composed of one keyword for each keyword corresponding to a document ID of a certain category supplied from the document information storage unit 12, and sends the candidate pattern to the multi-support pattern set generation unit 22. Output and supplied from the multi-support pattern set generation unit 22,
Assuming that the number of keywords constituting the multi-support pattern of the multi-support pattern set is n, (n + 1) is calculated from any two multi-support patterns having (n−1) keywords in common.
1) A pattern composed of keywords is generated, and a new candidate pattern set is generated by using, as candidate patterns, a pattern in which arbitrary n keywords of the pattern are included in the multi-support pattern set. 22 is a candidate pattern set generation unit that outputs to 22.

【００３３】２２は、候補パターン集合生成部２４より
供給される各候補パターンについて、制御部２１により
指定されたカテゴリにおいて、その候補パターンが出現
する文書情報の数とそのカテゴリに属するすべての文書
情報の数との比である支持率を、文書情報記憶部１２を
利用して計算し、その支持率が所定のしきい値（または
最小支持率）以上であるパターンを新たな多支持パター
ンとして選択し、それらの多支持パターンの集合を高確
信度パターン出力部２３と候補パターン集合生成部２４
に出力する多支持パターン集合生成部である。Reference numeral 22 denotes, for each candidate pattern supplied from the candidate pattern set generation unit 24, in the category specified by the control unit 21, the number of document information in which the candidate pattern appears and all the document information belonging to the category. Is calculated using the document information storage unit 12, and a pattern whose support rate is equal to or greater than a predetermined threshold (or minimum support rate) is selected as a new multi-support pattern. Then, a set of those multi-support patterns is output to the high confidence pattern output unit 23 and the candidate pattern set generation unit 24.
Is a multi-support pattern set generation unit that outputs the multi-support pattern set.

【００３４】２３は、多支持パターン集合生成部２２よ
り供給される各パターンについて、制御部２１により指
定されたクラスタにおいてそのパターンが出現する文書
情報の数とすべてのクラスタにおいてそのパターンが出
現する文書情報の数との比である確信度を、文書集合記
憶部１２を利用して計算し、多支持パターン集合生成部
２２より供給されたパターン集合から、確信度が所定の
最小確信度以上であるパターンを選択し、そのパターン
を出力する高確信度パターン出力部である。２１は、文
書集合記憶部１２、多支持パターン集合生成部２２、高
確信度パターン出力部２３、および候補パターン集合生
成部２４を制御する制御部である。Reference numeral 23 denotes, for each pattern supplied from the multi-support pattern set generation unit 22, the number of pieces of document information in which the pattern appears in the cluster designated by the control unit 21 and the document information in which the pattern appears in all the clusters. A certainty factor, which is a ratio to the number of pieces of information, is calculated using the document set storage unit 12, and the certainty factor is equal to or more than a predetermined minimum certainty factor from the pattern set supplied from the multi-support pattern set generating unit 22. This is a high confidence pattern output unit that selects a pattern and outputs the pattern. A control unit 21 controls the document set storage unit 12, the multi-support pattern set generation unit 22, the high confidence pattern output unit 23, and the candidate pattern set generation unit 24.

【００３５】図２に示すように構成すれば、多支持パタ
ーン集合生成部２２より支持率の高い特徴パターンを抽
出することができ、高確信度パターン出力部２３より確
信度の高い特徴パターンを抽出することができる。With the configuration shown in FIG. 2, a feature pattern having a high support rate can be extracted from the multi-support pattern set generating section 22 and a feature pattern having a high certainty can be extracted from the high certainty pattern output section 23. can do.

【００３６】このように支持率、確信度が高い特徴パタ
ーンはそのカテゴリに属する文書情報を特徴づけるもの
情報となりうる。未分類文書情報に対応する文書ＩＤ、
これに対応するキーワード集合を適切なカテゴリへ分類
するのに有用であるばかりか、キーワード検索を行う場
合において、キーワードの内容に関連するカテゴリを優
先して検索するようにすれば、検索時間の短縮が可能と
なる。As described above, a feature pattern having a high support rate and certainty factor can be information characterizing document information belonging to the category. Document ID corresponding to unclassified document information,
Not only is it useful for classifying the corresponding keyword set into appropriate categories, but when performing a keyword search, priority can be given to the category related to the content of the keyword, thereby reducing the search time Becomes possible.

【００３７】一般に、抽出する支持率のしきい値（最小
支持率）と抽出する確信度のしきい値（最小確信度）を
与えた場合、あるカテゴリに対応する特徴パターンは複
数のものが抽出される。しかしながら、最小支持率また
は最小確信度が高過ぎると特徴パターンが一つも抽出で
きない場合も考えられる。逆に最小支持率または最小確
信度が低過ぎると特徴パターンの抽出に時間がかかる
か、または必要以上に多くの特徴パターンが抽出されや
すくなる。In general, when a threshold value of a support rate to be extracted (minimum support rate) and a threshold value of a certainty factor to be extracted (minimum certainty factor) are given, a plurality of feature patterns corresponding to a certain category are extracted. Is done. However, if the minimum support rate or the minimum certainty factor is too high, no feature pattern may be extracted. Conversely, if the minimum support rate or the minimum certainty factor is too low, it takes a long time to extract a feature pattern, or it becomes easy to extract more feature patterns than necessary.

【００３８】従って、最小支持率、最小確信度を予め高
い値に設定し、特徴パターンに対応するカテゴリの抽出
ができなければ、適切な数の特徴パターンの抽出ができ
るまで最小支持率、最小確信度を徐々に小さくするする
ように構成するのが望ましい。また特徴パターンの抽出
において、特徴パターンに含まれるキーワードの数を制
限することにより、あまりに複雑なパターンの抽出を抑
制することが可能である。Therefore, if the minimum support rate and the minimum certainty factor are set to high values in advance, and if the category corresponding to the feature pattern cannot be extracted, the minimum support rate and the minimum certainty factor are determined until an appropriate number of feature patterns can be extracted. It is desirable that the degree is gradually reduced. In the extraction of the feature pattern, by limiting the number of keywords included in the feature pattern, it is possible to suppress the extraction of a too complicated pattern.

【００３９】特徴パターン抽出手段２０は抽出した各特
徴パターンに対し、相互比較可能な重みを与える。カテ
ゴリＣ_iにおいて抽出された個々の特徴パターンｐ_ijの
重みｗ_ijとしては、カテゴリＣ_iに対する特徴パターン
ｐ_ijの支持率、確信度、または条件つきエントロピーを
用いることなどが考えられる。または、カテゴリＣ_iに
おいて抽出された個々の特徴パターンｐ_ijの重みｗ_ijと
して、特徴パターンｐ_ijを構成するキーワードの数その
ものをカテゴリＣ_iの個々の特徴パターンｐ_ijの重みｗ
_ijとしてもよい。The characteristic pattern extracting means 20 gives each extracted characteristic pattern a weight that can be compared with each other. As the weight w _ij of each feature pattern p _ij extracted in the category C _i , it is conceivable to use a support rate, a certainty factor, or a conditional entropy of the feature pattern p _ij for the category C _i . _{Alternatively,} as the weight w _ij of each feature pattern p _ij extracted in the category C _i , the number of keywords constituting the feature pattern p _ij itself is used as the weight w w of the individual feature pattern p _ij of the category C _i.
_ij may be used.

【００４０】ここで、格納手段１０に格納された所定の
文書情報の集合を文書集合Ｄとする。文書集合Ｄは格納
手段１０に格納された一部または全ての文書情報であ
る。文書集合ＤにおけるカテゴリＣ_iのパターンｐに対
する条件付エントロピーＥｎｔ_i (Ｄ｜ｐ)とすると、Here, a set of predetermined document information stored in the storage means 10 is referred to as a document set D. The document set D is part or all of the document information stored in the storage unit 10. If conditional entropy Ent _i (D | p) for pattern p of category C _i in document set D is:

【００４１】[0041]

【数１】 (Equation 1)

【００４２】ここで、Ｄ_pは文書集合Ｄのうちパターン
ｐをキーワード集合の一部に含む文書の集合であり、Here, D _p is a set of documents in the document set D that includes the pattern p as a part of the keyword set.

【００４３】[0043]

【数２】 (Equation 2)

【００４４】とする。ここで、Ｅｎｔ_i（Ｄ）は文書集
合ＤにおけるクラスタＣ_iのエントロピーであり、It is assumed that Here, Ent _i (D) is the entropy of cluster C _i in document set D,

【００４５】[0045]

【数３】 (Equation 3)

【００４６】とすると、（３ａ）、（３ｂ）よりＥｎｔ
_i（Ｄ）は次式で表される。Then, from (3a) and (3b), Ent is obtained.
_i (D) is represented by the following equation.

【００４７】[0047]

【数４】 (Equation 4)

【００４８】以上いずれかの指標に従って、特徴パター
ン抽出手段２０は抽出した各特徴パターンに重みを付与
していく。以上の処理は実際に未分類文書情報を与えら
れて分類を行う前に処理しておくことが可能である。The feature pattern extracting means 20 assigns weights to the extracted feature patterns according to any of the indices. The above processing can be performed before classification is actually performed by giving unclassified document information.

【００４９】特徴パターン収集手段３０は、未分類文書
情報のキーワード集合の一部または全部を含む特徴パタ
ーンをカテゴリごとに収集する。例えばカテゴリCの特
徴パターンとして「コンピュータ」および「ネットワー
ク」からなるものを特徴パターンとするもの、「コンピ
ュータ」および「制御」からなるものを特徴パターンと
するもの、「コンピュータ」を特徴パターンとするもの
が抽出されていて、未分類の文書情報のキーワード集合
が「コンピュータ、プロセッサ、ネットワーク、計算、
プログラム」であるとする。このとき、特徴パターン収
集手段３０は、「コンピュータ」および「ネットワー
ク」と「コンピュータ」が未分類の文書情報のキーワー
ド集合に関連するカテゴリCの特徴パターンとして収集
する。The feature pattern collecting means 30 collects a feature pattern including a part or all of the keyword set of the unclassified document information for each category. For example, as a feature pattern of category C, a feature pattern consisting of "computer" and "network" as a feature pattern, a feature pattern consisting of "computer" and "control", and a feature pattern of "computer" Is extracted, and the keyword set of unclassified document information is “computer, processor, network, computation,
Program ". At this time, the feature pattern collection unit 30 collects the “computer”, the “network”, and the “computer” as the feature patterns of the category C related to the keyword set of the unclassified document information.

【００５０】選定手段４０は特徴パターン収集手段３０
により収集した各カテゴリの特徴パターンのうち、その
重みが最大となるものを、そのカテゴリへの関連度と
し、関連度が最大となるカテゴリを選定した後、選定し
たカテゴリへ未分類の文書情報を格納する。The selecting means 40 is the feature pattern collecting means 30
Of the feature patterns of each category collected by the above, the one with the largest weight is regarded as the relevance to that category, and after selecting the category with the highest relevance, the unclassified document information is classified into the selected category Store.

【００５１】例えば特徴パターンの重みとして確信度を
用いていて、特徴パターン収集手段３０が収集した特徴
パターンがカテゴリCに対応する確信度８０％の「コン
ピュータ」および「ネットワーク」に対応する特徴パタ
ーン、同じくカテゴリCで確信度５０％の「コンピュー
タ」に対応する特徴パターン、そしてカテゴリDで確信
度６０％の「プロセッサ」に対応する特徴パターンであ
れば、選定手段４０は未分類文書情報を格納手段１０の
カテゴリの１つであるカテゴリCへ格納する。For example, the certainty factor is used as the weight of the feature pattern, and the feature patterns collected by the feature pattern collection means 30 correspond to the “computer” and the “network” with the certainty factor of 80% corresponding to the category C; Similarly, if the feature pattern corresponds to “computer” having a certainty factor of 50% in category C and corresponds to “processor” having a 60% certainty factor in category D, the selecting unit 40 stores the unclassified document information. It is stored in category C, which is one of the ten categories.

【００５２】このように実施の形態1によれば、各カテ
ゴリの文書情報を特徴づける1つもしくはそれ以上のキ
ーワードを有する特徴パターンをカテゴリごとに抽出
し、この抽出した特徴パターンに相互比較可能な重み
（例えば、支持率、確信度、条件付きエントロピー、特
徴パターンの重みにキーワード数を乗じたもの等）を与
えておき、各カテゴリより抽出された複数の特徴パター
ンに付与した重みのうち、その値が最大となるものをカ
テゴリへの関連度とし、この関連度が最大であるカテゴ
リを選択し、選択したカテゴリへ未分類文書の分類を行
うことにより、重みが最大となる特徴パターンを格納す
るカテゴリを選択し、このカテゴリへ未分類の文書情報
を格納することが可能となる。As described above, according to the first embodiment, a feature pattern having one or more keywords characterizing document information of each category is extracted for each category, and can be compared with the extracted feature pattern. Weights (e.g., support rate, certainty factor, conditional entropy, weight of feature pattern multiplied by the number of keywords, etc.) are given, and among the weights assigned to a plurality of feature patterns extracted from each category, The feature having the largest weight is stored by selecting the category with the largest value as the relevance to the category, selecting the category with the largest relevance, and classifying the unclassified documents into the selected category. It is possible to select a category and store unclassified document information in this category.

【００５３】更に格納された文書情報の集合から各カテ
ゴリの文書情報を特徴づけるキーワードの組合せパター
ンを抽出し、それを利用して分類するので、キーワード
が組合せで出現する場合にも対応することが可能である
とともに、分類の根拠がキーワードの組合せパターンで
あるため、利用者にとっても直観的に理解しやすく、分
類の良否の判断が容易になる。Furthermore, a combination pattern of keywords characterizing document information of each category is extracted from the stored set of document information, and classification is performed using the extracted patterns. Since it is possible and the basis of the classification is the combination pattern of the keywords, it is easy for the user to intuitively understand, and it is easy to judge the quality of the classification.

【００５４】個々の特徴パターンに付与する重みとして
確信度を用いた場合、その特徴パターンを含むキーワー
ド集合を有する文書情報が偏って格納されているカテゴ
リへる未分類の文書情報が格納される。When the certainty factor is used as the weight assigned to each feature pattern, unclassified document information is stored in a category in which document information having a keyword set including the feature pattern is biased.

【００５５】また、複数のカテゴリにおいて同一の特徴
パターンが収集された場合、個々のカテゴリで抽出され
た特徴パターンの重みとして支持率を用いれば、あるカ
テゴリにおいて最も出現する文書情報の割合が高い特徴
パターンに対応したカテゴリへ未分類の文書情報が格納
される。When the same feature pattern is collected in a plurality of categories, if the support rate is used as the weight of the feature pattern extracted in each category, a feature in which the ratio of the document information which appears most in a certain category is high. Unclassified document information is stored in the category corresponding to the pattern.

【００５６】また、個々の特徴パターンに付与する重み
として条件つきエントロピーを用いる場合、支持率と確
信度との両方を考慮することに相当する。これは、カテ
ゴリにおいて出現する文書情報の割合も高く、また他の
カテゴリにはあまり出現しないような特徴パターンに対
応したカテゴリへ未分類の文書情報が格納される。よっ
て、未分類の文書情報のキーワード集合に含まれるキー
ワードが多く格納されているカテゴリであって、未分類
の文書情報のキーワード集合に含まれるキーワードを含
む文書情報の割合が高いカテゴリへ格納することができ
るので、未分類の文書情報をより適切なカテゴリへ格納
することができる。When conditional entropy is used as a weight assigned to each feature pattern, this corresponds to considering both the support rate and the certainty factor. In this case, unclassified document information is stored in a category corresponding to a feature pattern in which the ratio of document information that appears in a category is high and that does not often appear in other categories. Therefore, a category in which a large number of keywords included in the keyword set of the unclassified document information are stored, and the percentage of the document information including the keywords included in the keyword set of the unclassified document information is high, is stored in the category. Therefore, unclassified document information can be stored in a more appropriate category.

【００５７】また、上述したいずれかの各特徴パターン
の重みと、この特徴パターンを構成するキーワードの数
とを乗じ、これを新たな重みとし、選定手段はこの新た
な重みが最大なものに対応する特徴パターンを格納する
カテゴリを選定するように構成すれば、特徴パターンを
構成するキーワードの数が多い特徴パターンに対応する
カテゴリへ未分類の文書情報が格納される。Further, the weight of any one of the above-mentioned characteristic patterns is multiplied by the number of keywords constituting the characteristic pattern, and this is set as a new weight, and the selecting means corresponds to the one having the maximum new weight. If the configuration is such that a category for storing a feature pattern to be stored is selected, unclassified document information is stored in a category corresponding to a feature pattern having a large number of keywords constituting the feature pattern.

【００５８】実施の形態２．特徴パターン抽出手段２０
は、格納手段１０に格納された文書情報集合から、各カ
テゴリを特徴づける特徴パターンをカテゴリごとに抽出
した後、実施の形態１のように各特徴パターンに重みを
付与するのではなく、各特徴パターンを含む文書情報の
数を、対応するカテゴリとその他の全カテゴリの合計に
ついてそれぞれ付与し、選定手段４０は未分類文書があ
るカテゴリに属するのが適切とするべき確率（推定確率
と称す）を算出するとともに、この推定確率をそのカテ
ゴリへの関連度とし、関連度が最大のカテゴリに未分類
文書を格納するように構成したことを特徴とするもので
ある。Embodiment 2 Feature pattern extraction means 20
After extracting a feature pattern characterizing each category from the document information set stored in the storage unit 10 for each category, instead of assigning a weight to each feature pattern as in the first embodiment, The number of pieces of document information including the pattern is assigned to each of the corresponding category and the total of all other categories, and the selecting unit 40 determines a probability (referred to as an estimated probability) that it is appropriate that the unclassified document belongs to a certain category. In addition to the calculation, the estimated probability is regarded as the degree of relevance to the category, and an unclassified document is stored in the category having the highest degree of relevance.

【００５９】各カテゴリにおける推定確率は以下のよう
に計算する。まず、あるカテゴリにおいて、未分類の文
書情報に含まれるキーワードに対応する特徴パターンが
全く収集されなければ、推定確率は0とする。次に、特
徴パターンが1つしか収集されなければ、その確信度が
そのまま推定確率として用いられる。一方、あるカテゴ
リに対応して複数の特徴パターンが収集された場合、当
該カテゴリに対する推定確率を計算する方法として、さ
まざまなものが考えられる。The estimated probability in each category is calculated as follows. First, if a feature pattern corresponding to a keyword included in unclassified document information is not collected at all in a certain category, the estimated probability is set to 0. Next, if only one feature pattern is collected, the certainty factor is directly used as the estimated probability. On the other hand, when a plurality of feature patterns are collected corresponding to a certain category, various methods can be considered as a method of calculating the estimated probability for the category.

【００６０】例えば、各特徴パターンの確信度を推定確
率とするような構成の場合は、実施の形態１において説
明したものの一例と同じになる。For example, a configuration in which the certainty factor of each feature pattern is used as the estimated probability is the same as the example described in the first embodiment.

【００６１】別の方法として、あるカテゴリに対応して
収集された複数の特徴パターンが同時に出現しているこ
とに着目した推定方法が考えられる。以下にその推定方
法を示す。As another method, an estimation method focusing on the fact that a plurality of feature patterns collected corresponding to a certain category appear simultaneously can be considered. The estimation method is described below.

【００６２】まず単純なケースとして、格納手段に格納
されている文書情報の数がＮである文書情報の集合Ｄの
中のカテゴリＣ_iにおいて、キーワードaからなる特徴パ
ターンｐ_aとキーワードbからなる特徴パターンｐ_bが収
集されたとする。また、各特徴パターンを含むキーワー
ド集合を持つ文書情報の集合をそれぞれＤ_pa、Ｄ_pbと
し、First, as a simple case, in a category C _i in a set D of document information in which the number of document information stored in the storage means is N, a feature pattern pa composed of _a keyword _a and a keyword b are formed. It is assumed that the feature pattern p _b has been collected. Also, sets of document information having a keyword set including each feature pattern are D _pa and D _pb , respectively.

【００６３】[0063]

【数５】 (Equation 5)

【００６４】とする。このときパターンｐ_aとｐ_bとを同
時に含む未分類文書がカテゴリＣ_iに属する推定確率は
以下のように表される。Assume that Estimated probability belonging unclassified document containing this case a pattern p _a and p _b at the same time the category C _i is expressed as follows.

【００６５】[0065]

【数６】 (Equation 6)

【００６６】但し、Ｎ（Ｘ）は文書情報の集合Ｘに属す
る文書の数とする。ここで、例えばＮ（Ｃ_i∩D_pa∩
Ｄ_pb）は実際に文書情報の中から算出することも可能だ
が、そのような文書情報がたまたま存在しない場合もあ
り、また存在したとしても文書情報の数が小さくて統計
的に意味を持たない場合がある。ここでは、対応するカ
テゴリとその他の全カテゴリにおける各特徴パターンの
出現文書数から、間接的に算出する方法を考える。その
ための仮定として、Ｄ_pa及びＤ_pbがＣ_i及びHere, N (X) is the number of documents belonging to the document information set X. Here, for example, N (C _i ∩D _pa ∩
D _pb ) can be actually calculated from the document information. However, there is a case where such document information does not exist by chance, and even if it exists, the number of the document information is small and has no statistical significance. There are cases. Here, a method of indirectly calculating the number of documents of each feature pattern in the corresponding category and all other categories is considered. Hypothetically therefor, D _pa and D _pb is C _i and

【００６７】[0067]

【数７】 (Equation 7)

【００６８】においてそれぞれ独立であるとする。この
とき、Are independent of each other. At this time,

【００６９】[0069]

【数８】 (Equation 8)

【００７０】となり、上記条件つき確率は各特徴パター
ンの対応するカテゴリとその他の全カテゴリにおける出
現文書数及び各カテゴリの総文書数からThe conditional probability is obtained from the number of documents appearing in the category corresponding to each feature pattern and all other categories, and the total number of documents in each category.

【００７１】[0071]

【数９】 (Equation 9)

【００７２】のように計算できるため、これを用いると
Ｐ（Ｃ_i｜Ｄ_pa∩Ｄ_pb）は、Using this, P (C _i | D _pa ∩D _pb ) can be calculated as follows:

【００７３】[0073]

【数１０】 (Equation 10)

【００７４】ただし、However,

【００７５】[0075]

【数１１】 [Equation 11]

【００７６】とする。このように（９）式により推定確
率を求めることが可能となる。It is assumed that As described above, the estimation probability can be obtained by Expression (9).

【００７７】次にあるカテゴリにおいて収集された特徴
パターンのうちの少なくとも1つが複数のキーワードを
有するような場合を考える。複数の特徴パターンが重複
した構成キーワードを持っていない場合は、上記と同様
に考えればよい。重複したキーワードを持っている場合
は、そのキーワードが上記確率にどの程度寄与するかを
考慮する必要がある。Next, consider a case where at least one of the feature patterns collected in a certain category has a plurality of keywords. If a plurality of feature patterns do not have overlapping constituent keywords, it can be considered in the same manner as described above. If there are duplicate keywords, it is necessary to consider how much the keywords contribute to the probability.

【００７８】例えばカテゴリＣ_iにおいてキーワードc、
dからなる特徴パターンｐ_c,dとキーワードｃ、ｅ、ｆか
らなる特徴パターンp_c,e,fが収集されたとする。このと
き（９）式に従ってそのまま計算するとキーワードcの
寄与分が重複して考慮されることになる。これを避ける
ためには、各キーワードによる寄与分をそれぞれ考慮す
る必要があるが、その推定を容易にするための仮定とし
て、例えばある特徴パターンにおける各キーワードはそ
れぞれ独立しており、それぞれの寄与は均等であるとす
る。この仮定によれば特徴パターンp_c,dに含まれるキー
ワードｃ、ｄの（９）式に対する寄与はFor example, in the category C _i , the keyword c,
It is assumed that a feature pattern _{pc, d} consisting of d and a feature pattern pc, e, f consisting of keywords _{c, e, f} have been collected. At this time, if the calculation is directly performed according to the equation (9), the contribution of the keyword c is considered redundantly. In order to avoid this, it is necessary to consider the contribution of each keyword, but as an assumption to facilitate the estimation, for example, each keyword in a certain feature pattern is independent, and each contribution is Suppose they are equal. According to this assumption, the contribution of the keywords c and d included in the feature pattern _{pc and d} to the expression (9) is

【００７９】[0079]

【数１２】 (Equation 12)

【００８０】となる。同様に特徴パターンp_c,e,fにおい
てはキーワードｃ、ｅ、ｆの寄与はIs obtained. Similarly, in the feature pattern _{pc, e, f} , the contribution of the keywords c, e, f is

【００８１】[0081]

【数１３】 (Equation 13)

【００８２】と計算される。あるキーワードが収集され
た複数の特徴パターンに含まれる時、そのキーワードの
寄与は例えばそれぞれの特徴パターンにおいて算出され
た寄与の最大値とする。このようにしてあるカテゴリＣ
_iにおいて収集された特徴パターンｐに含まれるすべて
のキーワードｋについてそれぞれの寄与Ｒ_i（ｋ）を上
記により計算すれば、推定確率はIs calculated. When a keyword is included in a plurality of collected feature patterns, the contribution of the keyword is, for example, the maximum value of the contribution calculated for each feature pattern. Category C in this way
_{If the} respective contributions R _i (k) are calculated as described above for all the keywords k included in the feature pattern p collected in _i , the estimated probability becomes

【００８３】[0083]

【数１４】 [Equation 14]

【００８４】と計算される。このように実施の形態2に
よれば、未分類文書情報に応じてカテゴリごとに収集さ
れた特徴パターンから、未分類文書が各カテゴリに属す
るのが適切とするべき推定確率を各カテゴリごとに算出
し、推定確率が最大となるカテゴリを選定し、このカテ
ゴリへ未分類の文書情報を格納するようにしたので、未
分類の文書情報を適切なカテゴリへ格納することができ
る。特にカテゴリごとに収集された特徴パターンの構成
キーワードまで考慮した推定確率を算出するようにすれ
ば、実施の形態1より計算量は大きくなるものの、従来
頻出しなかった新たなキーワードの組合せにも対応した
分類が可能となる。Is calculated. As described above, according to the second embodiment, from the feature patterns collected for each category according to the unclassified document information, the estimated probability that the unclassified document should be appropriately assigned to each category is calculated for each category. Then, the category having the highest estimated probability is selected, and the unclassified document information is stored in this category. Therefore, the unclassified document information can be stored in an appropriate category. In particular, if the estimated probabilities are calculated in consideration of the constituent keywords of the feature patterns collected for each category, the amount of calculation is larger than in the first embodiment, but new keyword combinations that did not frequently appear in the past are also supported. Classification is possible.

【００８５】実施の形態３．図３は実施の形態3の文書
分類装置の構成を説明するための図である。図におい
て、図１、２と同一の符号を付したものは同一またはこ
れに相当するものである。６０は分類決定インタフェー
ス、７０は文書情報検索手段である。分類決定インタフ
ェース５０は、表示画面を有し、表示画面には選定手段
４０が選定したカテゴリと、その選定に関する情報、例
えば各カテゴリに対して収集された特徴パターン、特徴
パターンに対する重み、または未分類の文書情報とカテ
ゴリとの関連度等を表示するものである。利用者はその
表示された情報をもとに最終的に分類すべきカテゴリを
決定する。また、利用者は必要に応じて特徴パターンを
選択して文書情報検索手段７０に送出し、その結果を参
照することが可能である。Embodiment 3 FIG. 3 is a diagram for explaining the configuration of the document classification device according to the third embodiment. In the drawings, the components denoted by the same reference numerals as those in FIGS. 1 and 2 are the same or equivalent. Reference numeral 60 denotes a classification determination interface, and reference numeral 70 denotes a document information search unit. The classification determination interface 50 has a display screen. The display screen displays a category selected by the selection unit 40 and information related to the selection, for example, a feature pattern collected for each category, a weight for the feature pattern, or an unclassified information. And the degree of association between the document information and the category. The user determines a category to be finally classified based on the displayed information. Further, the user can select a characteristic pattern as necessary, send it to the document information search means 70, and refer to the result.

【００８６】文書情報検索手段７０は与えられた特徴パ
ターンを含むキーワード集合をもつ文書情報を格納手段
１０に格納された文書情報の集合から検索し、結果を分
類決定インタフェース６０に表示する。必ずしも必要で
はないが、検索対象を特定のカテゴリに絞ったり、複数
の特徴パターンに関するAND検索、OR検索などが実行可
能であれば、より効率のよい検索を行うことが可能とな
る。The document information retrieving means 70 retrieves document information having a keyword set including the given feature pattern from the set of document information stored in the storage means 10 and displays the result on the classification decision interface 60. Although it is not necessary, if the search target is narrowed down to a specific category, and if an AND search, an OR search, and the like relating to a plurality of feature patterns can be executed, a more efficient search can be performed.

【００８７】このように実施の形態3によれば、利用者
が分類選定に関する情報を参照しながら、分類を最終的
に決定することが可能である。特に、各カテゴリで収集
された特徴パターンはキーワードの連言であり、利用者
にとって直観的にわかりやすいため、必要に応じて特徴
パターンによる文書情報検索を行いながら、各カテゴリ
の特徴を把握した上で分類を最終的に決定することが可
能である。As described above, according to the third embodiment, it is possible for the user to finally determine the classification while referring to the information on the classification selection. In particular, the feature patterns collected in each category are a series of keywords, and are intuitive to the user. Therefore, it is necessary to grasp the features of each category while performing document information search using feature patterns as necessary. The classification can be finally determined.

【００８８】上述した各実施の形態は、本発明の実施の
一形態として示したものであり、本発明はこれらに限定
されるべきものではない。本願発明は特許請求の範囲に
記載されたもの、またはその均等物を含むものである。The embodiments described above are shown as one embodiment of the present invention, and the present invention should not be limited to these embodiments. The invention of this application includes what is described in the claims or an equivalent thereof.

【００８９】[0089]

【発明の効果】この発明に係る文書分類装置によれば、
文書番号および前記文書番号に対応する文書の内容を特
徴づけるキーワードを有するキーワード集合を有する文
書情報を対応するカテゴリに分類して格納する格納手段
と、前記格納手段の各カテゴリを特徴付けるための情報
であって、前記カテゴリに格納された文書情報のキーワ
ード集合に含まれる割合が高いキーワードを有する特徴
パターンを前記格納手段の各カテゴリ毎に抽出するとと
もに、前記抽出した特徴パターンに重みを付与する抽出
手段と、未だ分類されていない文書情報のキーワード集
合の一部または全部を含む特徴パターンをカテゴリ毎に
収集する特徴パターン収集手段と、前記特徴パターン収
集手段により収集した特徴パターンのうち、その重みが
最も高い特徴パターンを抽出したカテゴリを選定し、前
記選定したカテゴリへ未だ分類されていない文書情報を
格納する選定手段とを備えたので、未だ分類されていな
い文書情報を適切なカテゴリへ格納することができる。According to the document classification device of the present invention,
Storing means for classifying and storing document information having a keyword set having a document number and a keyword characterizing the content of the document corresponding to the document number into corresponding categories; and information for characterizing each category of the storing means. Extracting means for extracting, for each category, a feature pattern having a keyword which is included in a keyword set of document information stored in the category for each category, and assigning a weight to the extracted feature pattern A feature pattern collection unit that collects, for each category, a feature pattern including a part or the entirety of a keyword set of document information that has not been classified, and a feature pattern having the highest weight among feature patterns collected by the feature pattern collection unit. Select the category from which the high feature pattern was extracted, and select the category Since a selecting means for storing the document information that has not yet been classified into can store document information that has not yet been classified into appropriate categories.

【００９０】この発明に係る文書分類装置によれば、抽
出した特徴パターンに付与する重みを、特徴パターンに
対応するカテゴリに格納された全ての文書情報のキーワ
ード集合うち、前記特徴パターンを含むキーワード集合
を有する文書情報の割合としたので、未だ分類されてい
ない文書情報を適切なカテゴリへ格納することができ
る。According to the document classification device of the present invention, the weight given to the extracted feature pattern is determined by the keyword set including the feature pattern among the keyword sets of all the document information stored in the category corresponding to the feature pattern. , The document information that has not yet been classified can be stored in an appropriate category.

【００９１】この発明に係る文書分類装置によれば、抽
出した特徴パターンに付与する重みを、特徴パターンを
含むキーワード集合をもつ全カテゴリの文書のうち、当
該特徴パターンに対応するカテゴリに属する文書の割合
としたので、未だ分類されていない文書情報を適切なカ
テゴリへ格納することができる。According to the document classification device of the present invention, the weight assigned to the extracted feature pattern is set to the weight of the document belonging to the category corresponding to the category corresponding to the category corresponding to the feature pattern among all the categories having the keyword set including the feature pattern. Since the ratio is set, document information that has not yet been classified can be stored in an appropriate category.

【００９２】この発明に係る文書分類装置によれば、格
納手段に格納した所定の文書情報の集合において、カテ
ゴリの特徴パターンに対する条件付きエントロピーを前
記特徴パターンに付与する重みとしたので、未だ分類さ
れていない文書情報をより適切なカテゴリへ格納するこ
とができる。According to the document classifying apparatus of the present invention, in a set of predetermined document information stored in the storage means, conditional entropy for a feature pattern of a category is weighted to be assigned to the feature pattern. Document information that has not been stored can be stored in a more appropriate category.

【００９３】この発明に係る文書分類装置によれば、抽
出した特徴パターンに付与する重みに前記特徴パターン
を構成するキーワード数を乗じたものを前記特徴パター
ンに付与する重みとしたので、未だ分類されていない文
書情報をより適切なカテゴリへ格納することができる。According to the document classifying apparatus of the present invention, the weight given to the extracted feature pattern is multiplied by the number of keywords constituting the feature pattern, and the weight is given to the feature pattern. Document information that has not been stored can be stored in a more appropriate category.

【００９４】この発明に係る文書分類装置によれば、文
書番号および前記文書番号に対応する文書の内容を特徴
づけるキーワードを有するキーワード集合を有する文書
情報を対応するカテゴリに分類して格納する格納手段
と、前記格納手段の各カテゴリを特徴付けるための情報
であって、前記カテゴリに格納された文書情報のキーワ
ード集合に含まれる割合が高いキーワードを有する特徴
パターンを前記格納手段の各カテゴリ毎に抽出する抽出
手段と、未だ分類されていない文書情報のキーワード集
合の一部または全部を含む特徴パターンをカテゴリ毎に
収集する特徴パターン収集手段と、前記未だ分類されて
いない文書情報がカテゴリへ属するのが適切とするべき
確率を各カテゴリ毎に算出するとともに、前記確率が最
も大きなカテゴリを選定し、前記選定したカテゴリへ未
だ分類されていない文書情報を格納する選定手段とを備
えたので、未だ分類されていない文書情報をより適切な
カテゴリへ格納することができる。According to the document classification device of the present invention, storage means for classifying and storing document information having a document number and a keyword set having a keyword characterizing the contents of the document corresponding to the document number into the corresponding category. And extracting, for each category of the storage unit, a feature pattern that is information for characterizing each category of the storage unit and has a keyword having a high ratio included in a keyword set of document information stored in the category. Extraction means, feature pattern collection means for collecting, for each category, a feature pattern including a part or all of a keyword set of document information that has not been classified, and it is preferable that the document information that has not been classified belongs to a category. Is calculated for each category, and the category with the highest probability is Constant and, since a selection means for storing the selected the document information that has not yet been classified into categories, it is possible to store the document information that has not yet been classified into more appropriate categories.

【００９５】この発明に係る文書分類装置によれば、選
定手段は、前記特徴パターン収集手段により収集した特
徴パターンのうち、その重みが高い特徴パターンを抽出
したカテゴリまたは前記未だ分類されていない文書情報
がカテゴリへ属するのが適切とするべき確率が高い特徴
パターンを抽出したカテゴリを選定し、前記選定手段に
より選定したカテゴリの情報を表示するように構成した
ので、装置を使用するものが選定したカテゴリに対し適
宜判断することが可能となる。According to the document classifying device of the present invention, the selecting means selects the category in which the feature pattern having the higher weight is extracted from the feature patterns collected by the feature pattern collecting means or the document information which has not been classified yet. Is configured to select a category from which a feature pattern having a high probability that it should belong to the category is appropriate and to display information of the category selected by the selection means. Can be appropriately determined.

【００９６】この発明に係る文書分類装置は、入力した
特徴パターンが格納手段に格納された文書情報の中にあ
るかどうかを検索する文書情報検索手段を備えたので、
装置を利用するものは特徴パターンによる文書情報検索
を行いながら、各カテゴリの特徴を把握した上で分類を
最終的に決定することが可能である。The document classification device according to the present invention includes the document information search means for searching whether or not the input characteristic pattern exists in the document information stored in the storage means.
Those using the apparatus can finally determine the classification based on the characteristics of each category while performing document information search using the characteristic pattern.

[Brief description of the drawings]

【図１】実施の形態１の文書分類装置を説明するため
の図である。FIG. 1 is a diagram for explaining a document classification device according to a first embodiment.

【図２】実施の形態１の文書分類装置を説明するため
の図である。FIG. 2 is a diagram for explaining a document classification device according to the first embodiment;

【図３】実施の形態１の文書分類装置を説明するため
の図である。FIG. 3 is a diagram for explaining a document classification device according to the first embodiment;

[Explanation of symbols]

１０：格納手段２０：抽出
手段３０：特徴パターン収集手段４０：選定
手段５０：未文書情報入力手段10: storage means 20: extraction means 30: feature pattern collection means 40: selection means 50: non-document information input means

Claims

[Claims]

1. A storage unit for classifying and storing document information having a document number and a keyword set having a keyword characterizing the contents of a document corresponding to the document number into a corresponding category, and storing each category of the storage unit. A feature pattern, which is information for characterizing and has a keyword that is included in the keyword set of the document information stored in the category at a high rate, is extracted for each category of the storage unit, and a weight is assigned to the extracted feature pattern. Extracting means for assigning, to each category, a feature pattern collecting means for collecting a feature pattern including a part or all of a keyword set of document information which has not been classified, and among feature patterns collected by the feature pattern collecting means. Select the category from which the feature pattern with the highest weight is extracted, Selecting means for storing document information that has not yet been classified into the selected category.

2. The method according to claim 1, wherein the weight given to the extracted feature pattern is a ratio of document information having a keyword set including the feature pattern to all document information stored in a category corresponding to the feature pattern. 2. The document classification device according to claim 1, wherein:

3. The method according to claim 1, wherein the weight given to the extracted feature pattern is a ratio of a document belonging to a category corresponding to the feature pattern among documents of all categories having a keyword set including the feature pattern. Item 2. The document classification device according to Item 1.

4. The document classification apparatus according to claim 1, wherein, in a set of predetermined document information stored in the storage unit, conditional entropy for a feature pattern of a category is weighted to be assigned to the feature pattern. .

5. The weight given to the feature pattern obtained by multiplying the weight given to the extracted feature pattern by the number of keywords constituting the feature pattern. Document classification device according to the paragraph.

6. A storage unit for classifying and storing document information having a document number and a keyword set having a keyword characterizing the content of a document corresponding to the document number into a corresponding category, and storing each category of the storage unit. Extracting means for extracting, for each category, characteristic patterns having keywords having a high rate of being included in the keyword set of the document information stored in the category, which are information for characterizing, and not yet classified A feature pattern collection unit that collects, for each category, a feature pattern including a part or all of the keyword set of the document information; and a probability that the document information that has not been classified yet should appropriately belong to the category. Calculate and select the category with the highest probability, and still go to the selected category A document selecting device for storing document information that has not been classified.

7. The selection means, wherein, among the feature patterns collected by the feature pattern collection means, a category from which a feature pattern having a high weight is extracted or the document information which has not been classified yet belongs to the category. 7. The document classification apparatus according to claim 1, wherein a category from which a characteristic pattern having a high power probability is extracted is selected, and information of the category selected by the selection unit is displayed.

8. The document classification device according to claim 7, further comprising a document information search unit for searching whether or not the input characteristic pattern exists in the document information stored in the storage unit.