JP2008203933A

JP2008203933A - Category creation method and apparatus and document classification method and apparatus

Info

Publication number: JP2008203933A
Application number: JP2007036118A
Authority: JP
Inventors: Tatsuma Bise; 竜馬備瀬; Hirokazu Kasahara; 博和笠原; Tomohiro Nihongi; 智洋二本木; Mitsuaki Morimoto; 光昭森本; Masaki Takada; 政樹高田; Osamu Nakagawa; 修中川
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2007-02-16
Filing date: 2007-02-16
Publication date: 2008-09-04
Anticipated expiration: 2027-02-16
Also published as: JP5012078B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a category creation method and an apparatus capable of creating optimum categories using tags imparted by each reviewer, and a document classification method and an apparatus for classifying documents using the created categories. <P>SOLUTION: When a category classification means 20 having a plurality of particular category classifiers for determining whether or not data belong to a certain category candidate classifies tagged document data into category candidates, an ambivalent category exclusion means 30 excludes category candidates with high rates of wrong classification out of the category candidates. A similar category exclusion means 40 having a selective category classifier for determining to which of the related two categories with high error rates each of the remaining category candidates belongs uses the selective category classifier to classify the tagged document data having category candidates as tags, and if those with high rates of wrong classification exist as a result, the category candidate with the lower error rate is excluded from the category candidates. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、テキストを含む文書データを複数のカテゴリに分類するための技術に関する。 The present invention relates to a technique for classifying document data including text into a plurality of categories.

近年、インターネットにおいて誰でも見られるように公開した“ブログ”と呼ばれる簡易型のＷｅｂサイトが急増している。このブログが広まるにつれて、指定キーワードに関連するブログの検索サービスやブログで流行のキーワードを提示するサービスが増えつつある。このようなサービスにおいて、ブログがカテゴリごとに分類されているとユーザが興味ある情報を閲覧する際の利便性が高まると考えられる。 In recent years, the number of simplified websites called “blogs” that are open to the public so that anyone can see them has increased rapidly. As this blog spreads, blog search services related to designated keywords and services that present trending keywords on blogs are increasing. In such a service, if the blog is classified for each category, it is considered that convenience when the user browses information of interest is enhanced.

そこで、様々なＷｅｂページのカテゴリ分類手法が提案されている。例えば、専任者がカテゴリごとに設定したルールによって分類する手法や確率モデルやＳＶＭ等の機械学習による手法等が提案されている（非特許文献１参照）。一方、近年、個々のユーザが各自の視点で記事にタグを付与し、そのタグを利用して分類するフォークソノミーと呼ばれる手法が注目を集めている。 Therefore, various web page category classification methods have been proposed. For example, a method of classifying by a rule set by a full-time person for each category, a method using a machine learning such as a probability model or SVM, and the like have been proposed (see Non-Patent Document 1). On the other hand, in recent years, a technique called folk sonomie, in which individual users attach tags to articles from their own viewpoints and classify using the tags, has attracted attention.

高村，松本．ＳＶＭを用いた文書分類と機械的機能学習法．情報処理学会論文誌データベース，Vol.44，SIG3(TOD17)pp.1-9，2003Takamura, Matsumoto. Document classification and machine function learning using SVM. IPSJ Transactions Database, Vol.44, SIG3 (TOD17) pp.1-9, 2003

しかしながら、上記専任者がカテゴリを設定する手法では、適切なルールを見つけるために手間と費用を要するという問題がある。また、機械学習による手法では、そのための学習データの準備に手間がかかるという問題がある。これに対して、フォークソノミーでは、手間と費用はかからないが、インターネット上の一部の文書にしか付与されていないという問題や、付与されたタグが曖昧なものであったり、互いに意味の類似するタグが多数存在することになり、検索効率が上がらないという問題がある。 However, the method of setting the category by the full-time person has a problem that it takes time and money to find an appropriate rule. In addition, the machine learning method has a problem that it takes time to prepare learning data for this purpose. On the other hand, in folksonomy, it takes less time and money, but the problem is that it is only given to some documents on the Internet, and the assigned tags are ambiguous or similar in meaning to each other. As a result, there are a large number of tags that do not improve search efficiency.

上記のような点に鑑み、本発明は、各閲覧者が付与したタグを利用して最適なカテゴリを作成することが可能なカテゴリ作成方法および装置、および作成されたカテゴリを利用して文書を分類する文書分類方法および装置を提供することを課題とする。 In view of the above points, the present invention provides a category creation method and apparatus capable of creating an optimum category using a tag assigned by each viewer, and a document created using the created category. It is an object to provide a document classification method and apparatus for classification.

上記課題を解決するため、本発明のカテゴリ作成方法は、検索用の用語をタグとして付与したタグ付文書データを利用して文書データを分類するためのカテゴリをコンピュータにより作成する方法であって、前記タグとして付与された用語の中からカテゴリ候補を設定する段階と、各前記カテゴリ候補に属するか否かを判断する特定カテゴリ分類器を用いて、前記タグ付文書データが当該カテゴリに属するか否かを判別する段階と、前記特定カテゴリ分類器による判別の結果、タグ付文書データが付与されたタグのカテゴリ候補と異なるカテゴリ候補に判別された数、およびタグ付文書データが付与されたタグのカテゴリ候補に判別されなかった数を基に、各カテゴリ候補の特定カテゴリ分類器と各カテゴリ候補をタグとして有するタグ付文書データの関係ごとの誤り率である特定誤り率を算出する段階と、前記算出した特定誤り率が所定値以上となる関係が所定値以上存在する特定カテゴリ分類器のカテゴリ候補を、カテゴリ候補から除外する段階を有することを特徴とする。 In order to solve the above-mentioned problem, the category creation method of the present invention is a method for creating a category for classifying document data by using tagged document data to which a search term is assigned as a tag. Whether or not the tagged document data belongs to the category using a step of setting category candidates from the terms given as the tags and a specific category classifier that determines whether or not each of the category candidates belongs And the number determined by the specific category classifier as a result of the determination by the specific category classifier as a category candidate different from the category candidate of the tag to which the tagged document data is assigned, and the tag to which the tagged document data is assigned. Tagged sentences with specific category classifiers for each category candidate and each category candidate as a tag based on the number of categories not identified A step of calculating a specific error rate, which is an error rate for each data relationship, and a category candidate of a specific category classifier having a relationship where the calculated specific error rate is equal to or higher than a predetermined value is excluded from the category candidates It has the stage to do.

また、本発明のカテゴリ作成方法は、前記算出した特定誤り率が所定値以上となる関係が所定値以上存在する特定カテゴリ分類器のカテゴリ候補を、カテゴリ候補から除外した後、さらに、残った各カテゴリ候補について前記特定誤り率が所定値以上となる関係の２つのカテゴリのいずれに属するかを判断する択一的カテゴリ分類器を用いて、前記カテゴリ候補をタグとして有するタグ付文書データを分類する段階と、前記択一的カテゴリ分類器による分類の結果、タグ付文書データが付与されたタグのカテゴリ候補と異なるカテゴリ候補に分類された数、およびタグ付文書データが付与されたタグのカテゴリ候補に分類されなかった数を基に、各カテゴリ候補の特定カテゴリ分類器と各カテゴリ候補をタグとして有するタグ付文書データの関係ごとの誤り率である択一的誤り率を算出する段階と、前記算出した択一的誤り率が所定値以上となるものが存在する場合に、一方のカテゴリ候補を、カテゴリ候補から除外する段階を有することを特徴とする。 In addition, the category creation method of the present invention further includes, after excluding, from the category candidates, the category candidates of the specific category classifier in which the relationship that the calculated specific error rate is equal to or greater than a predetermined value exists from the category candidates, Classifying document data with a tag having the category candidate as a tag by using an alternative category classifier that determines which of the two categories in which the specific error rate is greater than or equal to a predetermined value for the category candidate. And the number of categories classified as a category candidate different from the category candidate of the tag to which the tagged document data is assigned as a result of the classification by the alternative category classifier, and the category candidate of the tag to which the tagged document data is assigned Based on the number that was not classified into the category, the specific category classifier of each category candidate and the tagged document data having each category candidate as a tag Calculating an alternative error rate, which is an error rate for each person, and excluding one of the category candidates from the category candidate when the calculated alternative error rate exceeds a predetermined value It has a stage.

また、本発明の文書分類方法は、文書データを複数のカテゴリに分類する方法であって、各カテゴリについて当該カテゴリに属するか否かを判断する特定カテゴリ分類器を用いて、前記文書データをカテゴリに分類する段階と、前記特定カテゴリ分類器を用いた分類の結果、複数のカテゴリに存在する文書データが得られた場合に、当該文書データが分類された前記複数のカテゴリ中の２つのカテゴリのいずれに属するかを判断する択一的カテゴリ分類器を用いて、前記文書データをいずれかのカテゴリに分類する段階を有することを特徴とする。 The document classification method of the present invention is a method of classifying document data into a plurality of categories, and for each category, the document data is classified into categories using a specific category classifier that determines whether or not each category belongs to the category. And when the document data existing in a plurality of categories is obtained as a result of the classification using the specific category classifier, the two categories in the plurality of categories into which the document data is classified A step of classifying the document data into any category by using an alternative category classifier for determining which category the document data belongs to.

本発明のカテゴリ作成方法、装置によれば、あるカテゴリ候補に属するか否かを判断する特定カテゴリ分類器を用いて、カテゴリ候補をタグとして有するタグ付文書データを分類し、その結果、タグ付文書データが付与されたタグのカテゴリ候補と異なるカテゴリ候補に分類された数、分類されなかった数を基に、各カテゴリ候補の特定カテゴリ分類器と各カテゴリ候補をタグとして有するタグ付文書データの関係ごとの特定誤り率を算出し、算出した特定誤り率が所定値以上となる関係が所定数以上存在する特定カテゴリ分類器のカテゴリ候補をカテゴリ候補から除外するようにしたので、各閲覧者が付与したタグを利用して曖昧な意味をもつカテゴリが除外された最適なカテゴリを作成することが可能となる。 According to the category creation method and apparatus of the present invention, the tagged document data having the category candidate as a tag is classified using the specific category classifier that determines whether the category belongs or not, and as a result, the tagged Based on the number classified into the category candidate different from the category candidate of the tag to which the document data is assigned and the number not classified, each of the category candidate specific category classifier and the tagged document data having each category candidate as a tag The specific error rate for each relationship is calculated, and the category candidates of specific category classifiers that have a predetermined number or more of relationships in which the calculated specific error rate is equal to or greater than a predetermined value are excluded from the category candidates. It is possible to create an optimum category from which a category having an ambiguous meaning is excluded by using the assigned tag.

また、本発明のカテゴリ作成方法、装置によれば、さらに、残った各カテゴリ候補について特定誤り率が所定値以上となる関係の２つのカテゴリのいずれに属するかを判断する択一的カテゴリ分類器を用いて、カテゴリ候補をタグとして有するタグ付文書データを分類し、その結果誤り率の高いものを候補から除外するようにしたので、各閲覧者が付与したタグを利用して類似するカテゴリが除外された最適なカテゴリを作成することが可能となる。 In addition, according to the category creation method and apparatus of the present invention, an alternative category classifier that further determines which of the two categories in which the specific error rate exceeds a predetermined value belongs to each remaining category candidate. Is used to classify tagged document data having category candidates as tags, and as a result, those with a high error rate are excluded from the candidates. It is possible to create an excluded optimal category.

また、本発明の文書分類方法、装置によれば、各カテゴリについてカテゴリに属するか否かを判断する特定カテゴリ分類器を用いて、文書データをカテゴリに分類し、その結果、複数のカテゴリに存在する文書データが得られた場合に、文書データが分類された２つのカテゴリのいずれに属するかを判断する択一的カテゴリ分類器を用いて、文書データをいずれかのカテゴリに分類するようにしたので、分類精度の高い分類を行うことが可能となる。 Further, according to the document classification method and apparatus of the present invention, document data is classified into categories using a specific category classifier that determines whether each category belongs to a category, and as a result, exists in a plurality of categories. Document data is classified into one of the two categories using an alternative category classifier that determines which of the two categories the document data belongs to Therefore, classification with high classification accuracy can be performed.

（１．特定カテゴリ分類器の作成）
以下、本発明の実施形態について図面を参照して詳細に説明する。まず、特定カテゴリ分類器の作成について説明する。特定カテゴリ分類器とは、カテゴリ作成装置の一部として用いられるものであり、入力された文書データが、特定のカテゴリに分類されるかどうかを判断するためのものである。特定カテゴリ分類器の実体は、コンピュータプログラムであり、このコンピュータプログラムをコンピュータが実行することにより特定カテゴリ分類器となるものである。この特定カテゴリ分類器の作成は、コンピュータに専用のプログラムを実行させることにより行われる。 (1. Creation of a specific category classifier)
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. First, creation of a specific category classifier will be described. The specific category classifier is used as a part of the category creation device, and is used to determine whether or not input document data is classified into a specific category. The entity of the specific category classifier is a computer program, and the computer program is executed by the computer to become the specific category classifier. The specific category classifier is created by causing a computer to execute a dedicated program.

まず、所定数のタグ付文書データを用意する。タグ付文書データとは、文書データを構成する本文の他に、検索用のタグが付された文書データである。このタグ付文書データのタグは、インターネット上において閲覧者が任意に付加したものである。そのため、本文の内容に対して適切に付加されたものもあれば、本文の内容とは、的外れなタグが付加されたものもある。 First, a predetermined number of tagged document data is prepared. The tagged document data is document data to which a tag for search is added in addition to the text constituting the document data. The tag of the tagged document data is arbitrarily added by the viewer on the Internet. For this reason, some of the contents are appropriately added to the contents of the text, and some of the contents of the text are added with an inappropriate tag.

そして、用意したタグ付文書データをコンピュータに読み込ませる。タグ付文書データのうち、所定数以上の文書に付与されているタグをカテゴリ候補として選択する処理を行う。具体的には、閾値を設定し、あるタグが付与されているタグ付文書データが、この閾値以上である場合に、そのタグをカテゴリ候補とする。例えば、［日常］［サッカー］［野球］［囲碁］の４つのタグ付文書データがそれぞれ１２０、１５０、１３０、９０存在した場合、閾値が１００に設定されていたとすると、［日常］［サッカー］［野球］の３つの用語がカテゴリ候補として選択されることになる。 Then, the prepared tagged document data is read into the computer. Among the tagged document data, a process of selecting tags assigned to a predetermined number or more of documents as category candidates is performed. Specifically, a threshold is set, and if the tagged document data to which a certain tag is assigned is equal to or greater than this threshold, the tag is set as a category candidate. For example, if there are four tagged data items [daily] [soccer] [baseball] [go] 120, 150, 130, and 90, respectively, and the threshold is set to 100, [daily] [soccer] Three terms “baseball” are selected as category candidates.

続いて、カテゴリ候補として選択された各用語について、当該タグが付与されたタグ付文書（適合文書）と、当該タグが付与されていないタグ付文書（不適合文書）を抽出する。本実施形態では、ともに１００個ずつ抽出するものとする。そして、適合文書内の本文中に出現する全ての名詞について、適合文書における出現確率Ｐを算出する。出現確率Ｐは、各名詞が出現した適合文書の数を、全適合文書数（本実施形態では１００）で除算した値である。また、適合文書内の本文中に出現する全ての名詞について、不適合文書における出現確率Ｐも算出する。 Subsequently, for each term selected as a category candidate, a tagged document (conforming document) to which the tag is attached and a tagged document (nonconforming document) to which the tag is not attached are extracted. In this embodiment, both 100 are extracted. Then, the appearance probability P in the relevant document is calculated for all nouns appearing in the text in the relevant document. The appearance probability P is a value obtained by dividing the number of relevant documents in which each noun appears by the total number of relevant documents (100 in this embodiment). Further, the appearance probability P in the nonconforming document is also calculated for all nouns appearing in the body text of the conforming document.

次に、各名詞について算出した出現確率Ｐを、以下の〔数１〕に示す適合度算出式に与え、算出した適合度ｓｉｍに対する閾値を決定する。以上の出現確率、適合度算出式、閾値の組み合わせ（実体はプログラム）が各用語についての特定カテゴリ分類器の構成要素となる。

Next, the appearance probability P calculated for each noun is given to the fitness calculation formula shown in the following [Equation 1], and a threshold for the calculated fitness sim is determined. The combination of the appearance probability, the fitness calculation formula, and the threshold (the substance is a program) is a component of the specific category classifier for each term.

上記〔数１〕において、Ｐ（ｋｉ｜Ｒ）は適合文書の集合Ｒ中の文書が名詞ｋｉを含む確率、Ｐ（¬ｋｉ｜Ｒ）はＲ中の文書が名詞ｋｉを含まない確率であり、¬ＲはＲの補集合（すなわち不適合文書の集合）である。ｇｉ（ｄｊ）は、分類対象文書ｄｊが名詞ｋｉを含む際には１を返し、含まない場合には０を返す関数、αは分母が０にならないように付加する微小値である。 In the above [Equation 1], P (ki | R) is the probability that the document in the set R of matching documents contains the noun ki, and P (¬ki | R) is the probability that the document in R does not contain the noun ki. , ¬R is a complement of R (that is, a set of nonconforming documents). gi (dj) is a function that returns 1 when the classification target document dj includes the noun ki, and returns 0 when it does not include it, and α is a minute value that is added so that the denominator does not become 0.

前記作成した特定カテゴリ分類器は、入力データの名詞を抽出し、その名詞集合に関して、上記〔数１〕で適合度ｓｉｍを計算し、その値が閾値を超えている場合は、入力データは当該特定カテゴリに属すると判断し、閾値を超えない場合は、入力データは当該特定カテゴリに属さないと判断する。 The created specific category classifier extracts the nouns of the input data, calculates the fitness sim with the above [Formula 1] for the noun set, and if the value exceeds the threshold, the input data If it is determined that it belongs to a specific category and does not exceed the threshold, it is determined that the input data does not belong to the specific category.

（２．択一的カテゴリ分類器の作成）
次に、択一的カテゴリ分類器の作成について説明する。択一的カテゴリ分類器とは、カテゴリ作成装置の一部として用いられるものであり、入力された文書データが、２つのカテゴリのうちいずれに分類されるかを判断するためのものである。特定カテゴリ分類器との違いは、出現確率Ｐを算出する際の適合文書、不適合文書が異なる点である。例えば、サッカーか野球かを分類する択一的カテゴリ分類器を作成する際には、適合文書として［サッカー］タグが付与された文書データを用い、不適合文書として［野球］タグが付与された文書データを用いる。（適合文書に［野球］、不適合文書に［サッカー］としても可）前記のように確率を算出する元データを２つのカテゴリに限定することで、その２つのカテゴリに関する判別に関しては、特定カテゴリ分類器より精度が向上することが期待される。また、入力データを与えると、特定カテゴリ分類器と同様に動作し、当該特定カテゴリに属するか否かを判別する。択一的カテゴリ分類器の実体も、特定カテゴリ分類器と同様コンピュータプログラムであり、このコンピュータプログラムをコンピュータが実行することにより択一的カテゴリ分類器となるものである。 (2. Creating an alternative category classifier)
Next, creation of an alternative category classifier will be described. The alternative category classifier is used as a part of the category creation device, and is used to determine which of the two categories the input document data is classified. The difference from the specific category classifier is that the conforming document and the nonconforming document when calculating the appearance probability P are different. For example, when an alternative category classifier that classifies soccer or baseball is created, document data with a [soccer] tag is used as a conforming document, and a document with a [baseball] tag is applied as a nonconforming document. Use data. (It is possible to use [baseball] as a conforming document and [soccer] as a nonconforming document.) By limiting the original data for calculating the probability to two categories as described above, a specific category classification is used for discrimination regarding the two categories It is expected that the accuracy will be improved compared to the instrument. When input data is given, it operates in the same manner as the specific category classifier, and determines whether or not it belongs to the specific category. The entity of the alternative category classifier is also a computer program, like the specific category classifier, and becomes an alternative category classifier when the computer executes the computer program.

まず、上記のようにして特定カテゴリ分類器が作成された各カテゴリについて、各カテゴリに対応するタグが付与された適合文書を、各カテゴリについての特定カテゴリ分類器により分類し、各カテゴリについての適合文書の各特定カテゴリ分類器による分類の誤り率を算出する。 First, for each category for which a specific category classifier has been created as described above, the conforming document to which the tag corresponding to each category is assigned is classified by the specific category classifier for each category, and the conformity for each category is classified. The error rate of classification by each specific category classifier of the document is calculated.

誤り率とは、あるカテゴリの特定カテゴリ分類器において、異なるカテゴリの適合文書を対象とした場合には、特定カテゴリ分類器のカテゴリに分類される率であり、あるカテゴリの特定カテゴリ分類器において、同一カテゴリの適合文書を対象とした場合には、特定カテゴリ分類器のカテゴリに分類されない率である。 The error rate is the rate at which a specific category classifier of a certain category is classified into a category of a specific category classifier when a conforming document of a different category is targeted. In the specific category classifier of a certain category, When conforming documents of the same category are targeted, the rate is not classified into the category of the specific category classifier.

そして、あるカテゴリの特定カテゴリ分類器において、誤り率が所定値（例えば２５％）を超えるカテゴリが所定の割合（例えば全カテゴリ数の３０％）を超える数存在する場合には、その特定カテゴリ分類器のカテゴリを曖昧なものであると判断して、カテゴリ候補から除外する。したがって、作成したその特定カテゴリ分類器は不要なものとなる。 In a specific category classifier of a certain category, when there are more than a predetermined ratio (for example, 30% of the total number of categories) of categories whose error rate exceeds a predetermined value (for example, 25%), the specific category classification The category of the vessel is determined to be ambiguous and is excluded from the category candidates. Therefore, the created specific category classifier is unnecessary.

そして、あるカテゴリの特定カテゴリ分類器において、誤り率が所定値以上のカテゴリが存在するが、所定の割合以下である場合には、その特定カテゴリ分類器のカテゴリと、適合文書のカテゴリのいずれかのカテゴリに分類するための択一的カテゴリ分類器を作成する。 In a specific category classifier of a certain category, there is a category with an error rate equal to or higher than a predetermined value, but if it is lower than a predetermined ratio, either the category of the specific category classifier or the category of the conforming document Create an alternative category classifier to categorize

択一的カテゴリ分類器は、特定カテゴリ分類器と同様、上記〔数１〕に示したような出現確率Ｐ、適合度算出式、閾値の組み合わせ（実体はプログラム）により構成される。ただし、択一的カテゴリ分類器の場合は、上記〔数１〕において、Ｒの補集合¬Ｒに代えて、他のカテゴリの集合Ｒ´を用いる。 Similar to the specific category classifier, the alternative category classifier is configured by a combination (substance is a program) of the appearance probability P, the fitness calculation formula, and the threshold value as shown in [Formula 1]. However, in the case of an alternative category classifier, a set R ′ of another category is used in place of the complementary set ¬R of R in the above [Equation 1].

（３．カテゴリ作成方法および装置）
次に、本発明に係るカテゴリ作成装置について説明する。図１は、本発明に係るカテゴリ作成装置の機能ブロック図である。図１において、文書データ記憶手段１０は、タグ付文書データが記録されたものである。カテゴリ分類手段２０は、ある特定のカテゴリに属するか否かを判断するための特定カテゴリ分類器を複数備えており、タグ付文書データがどのカテゴリに属するかを判断する機能を有している。曖昧カテゴリ除外手段３０は、カテゴリ分類手段２０による分類結果に従って曖昧なカテゴリを除外する機能を有している。類似カテゴリ除外手段４０は、残ったカテゴリの中から、さらに類似するカテゴリを除外する機能を有している。 (3. Category creation method and apparatus)
Next, the category creation apparatus according to the present invention will be described. FIG. 1 is a functional block diagram of a category creation apparatus according to the present invention. In FIG. 1, a document data storage means 10 is one in which tagged document data is recorded. The category classification means 20 includes a plurality of specific category classifiers for determining whether or not a certain category belongs, and has a function of determining which category the tagged document data belongs to. The ambiguous category exclusion means 30 has a function of excluding ambiguous categories according to the classification result by the category classification means 20. The similar category excluding unit 40 has a function of excluding further similar categories from the remaining categories.

次に、本発明に係るカテゴリ作成方法を、図１に示したカテゴリ作成装置の処理動作とともに説明する。まず、カテゴリ分類手段２０がタグ付文書データを文書データ記憶手段１０から順次読み込む処理を行う。カテゴリ分類手段２０は、各タグ付文書データを各特定カテゴリ分類器に判断させ、各タグ付文書データを各カテゴリ候補に分類していく。例えば、カテゴリ候補が［日常］［サッカー］［野球］の３つである場合、［日常］のタグを有するタグ付文書データを所定数、［サッカー］のタグを有するタグ付文書データを所定数、［野球］のタグを有するタグ付文書データを所定数、カテゴリ分類手段２０に読み込ませる。 Next, the category creation method according to the present invention will be described together with the processing operation of the category creation apparatus shown in FIG. First, the category classification unit 20 performs processing for sequentially reading tagged document data from the document data storage unit 10. The category classification unit 20 causes each specific category classifier to determine each tagged document data, and classifies each tagged document data into each category candidate. For example, when there are three category candidates, [daily], [soccer], and [baseball], a predetermined number of tagged document data having a tag of [daily] and a predetermined number of tagged document data having a tag of [soccer] The category classification means 20 reads a predetermined number of tagged document data having the tag [baseball].

ここでは、［日常］のタグを有するタグ付文書データについての処理を説明する。例えば、［日常］のタグを有するタグ付文書データ２００件を処理対象とした場合、各タグ付文書データについて、［日常］の特定カテゴリ分類器で分類処理を行う。この結果、［日常］のタグを有する各タグ付文書データが、［日常］のカテゴリに分類されなかった率を誤り率として求める。例えば、２００件のタグ付文書データのうち、９５件が［日常］のカテゴリであると判断され、１０５件が［日常］のカテゴリでないと判断された場合には、誤り率は１０５／２００となる。 Here, a process for tagged document data having a tag of [daily] will be described. For example, when 200 tagged document data having a tag of “daily” are to be processed, each tagged document data is classified by a specific category classifier of “daily”. As a result, the rate at which each tagged document data having the tag of [daily] is not classified into the [daily] category is obtained as an error rate. For example, out of 200 tagged document data, if 95 are determined to be in the [daily] category and 105 are determined not to be in the [daily] category, the error rate is 105/200. Become.

同様にして、［日常］のタグを有するタグ付文書データ２００件を処理対象として、［サッカー］の特定カテゴリ分類器で分類処理を行う。この場合は、文書データの有しているタグと、特定カテゴリ分類器のカテゴリ候補が異なっているため、文書データの有しているタグと、特定カテゴリ分類器のカテゴリ候補が同一である場合とは逆に、［日常］のタグを有する各タグ付文書データが、［サッカー］のカテゴリに分類された率を誤り率として求める。例えば、２００件のタグ付文書データのうち、６０件が［サッカー］のカテゴリであると判断され、４０件が［サッカー］のカテゴリでないと判断された場合には、誤り率は６０／２００となる。同様にして、［日常］のタグを有するタグ付文書データ２００件を処理対象として、［野球］の特定カテゴリ分類器で分類処理を行う。 In the same manner, 200 tagged document data having a tag of [daily] is processed, and classification processing is performed by a specific category classifier of [soccer]. In this case, because the tag that the document data has and the category candidate of the specific category classifier are different, the tag that the document data has and the category candidate of the specific category classifier are the same Conversely, the rate at which each tagged document data having the [daily] tag is classified into the [soccer] category is obtained as an error rate. For example, out of 200 tagged document data, 60 cases are determined to be the [soccer] category, and 40 cases are determined not to be the [soccer] category, the error rate is 60/200. Become. In the same manner, 200 tagged document data having [daily] tags are processed, and classification processing is performed by a specific category classifier of [baseball].

同様にして、［サッカー］のタグを有するタグ付文書データ、［野球］のタグを有するタグ付文書データを処理対象として、［日常］の特定カテゴリ分類器、［サッカー］の特定カテゴリ分類器、［野球］の特定カテゴリ分類器で分類処理を行う。このようにして、カテゴリ候補のタグを有するタグ付文書データとカテゴリ候補の特定カテゴリ分類器の全ての組み合わせについて誤り率を算出する。上記の例により算出された誤り率の一例を図２に示す。なお、このようにして特定カテゴリ分類器を利用して算出された誤り率を、後述の択一的カテゴリ分類器を利用して算出された誤り率と区別するため、「特定誤り率」と呼ぶことにする。 Similarly, with the tagged document data having a tag of [soccer] and the tagged document data having a tag of [baseball] as a processing target, a specific category classifier of [daily], a specific category classifier of [soccer], Classification processing is performed by a specific category classifier of [Baseball]. In this way, error rates are calculated for all combinations of tagged document data having category candidate tags and category candidate specific category classifiers. An example of the error rate calculated by the above example is shown in FIG. In order to distinguish the error rate calculated using the specific category classifier in this way from the error rate calculated using the alternative category classifier described later, it is referred to as a “specific error rate”. I will decide.

なお、上記の例では、特定カテゴリ分類器を事前に作成しておいた場合について説明したが、特定カテゴリ分類器を事前に作成しておかず、カテゴリ分類手段２０が、さらに多数の文書データを読み込んで、複数の特定カテゴリ分類器を作成し、これら複数の特定カテゴリを利用して、特定誤り率を算出するまでの処理を連続して行うようにするようにしても良い。 In the above example, the specific category classifier has been created in advance. However, the specific category classifier has not been created in advance, and the category classification unit 20 reads a larger number of document data. Thus, a plurality of specific category classifiers may be created, and the processing until the specific error rate is calculated may be continuously performed using the plurality of specific categories.

特定誤り率が算出されたら、次に、曖昧カテゴリ除外手段３０が、曖昧なカテゴリ候補を除外する。曖昧なカテゴリ候補とは、その用語の意味が曖昧であるため、本文の内容が多岐にわたり、そのようなカテゴリを作って分類したとしても、検索の用に適さないカテゴリ候補である。曖昧なカテゴリ候補の特定カテゴリ分類器を利用した場合、どのようなカテゴリ候補のタグを有するタグ付文書データについても特定誤り率が高くなる傾向がある。したがって、曖昧カテゴリ除外手段３０は、あるカテゴリ候補の特定カテゴリ分類器に着目した場合に、特定誤り率が所定値以上の組み合わせが所定割合以上存在する場合に、そのカテゴリ候補を、カテゴリ候補から除外する。例えば、特定誤り率判断の所定値として“０．４（８０／２００）”が設定され、所定割合として“０．７”が設定されている場合、［日常］の特定カテゴリ分類器については、３つ全て（すなわち所定割合が“１”）が特定誤り率０．４以上であるため、曖昧カテゴリ候補であるとして、カテゴリ候補から除外する。なお、図２の例では、説明の便宜上、カテゴリ候補数が少ない場合を想定しているので、特定誤り率判断の所定値を“０．４”、所定割合を“０．７”としているが、実際には、利用するタグ付文書データを１００万件程度用い、カテゴリ候補も数百となるため、特定誤り率判断の所定値を“０．２５”、所定割合を“０．３”とする。 Once the specific error rate is calculated, the ambiguous category excluding means 30 excludes ambiguous category candidates. An ambiguous category candidate is a category candidate that is not suitable for search even if the category is ambiguous because the meaning of the term is ambiguous and the contents of the body are diverse. When a specific category classifier of ambiguous category candidates is used, the specific error rate tends to be high for tagged document data having any category candidate tag. Therefore, the ambiguous category exclusion means 30 excludes a category candidate from the category candidate when a specific category classifier of a certain category candidate is focused and a combination with a specific error rate equal to or greater than a predetermined value exists in a predetermined ratio or more. To do. For example, when “0.4 (80/200)” is set as the predetermined value for determining the specific error rate and “0.7” is set as the predetermined ratio, for the [daily] specific category classifier, Since all three (ie, the predetermined ratio is “1”) have a specific error rate of 0.4 or more, they are excluded from the category candidates as being ambiguous category candidates. In the example of FIG. 2, for convenience of explanation, it is assumed that the number of category candidates is small. Therefore, the predetermined value for determining the specific error rate is “0.4” and the predetermined ratio is “0.7”. Actually, about 1 million tagged document data to be used are used, and there are several hundred category candidates. Therefore, the specific error rate determination predetermined value is “0.25” and the predetermined ratio is “0.3”. To do.

次に、類似カテゴリ除外手段４０が、互いに類似するカテゴリ候補のうちの一方をカテゴリ候補から除外する。具体的には、曖昧なカテゴリ候補を除外した段階で残っているカテゴリ候補について、特定誤り率が所定値以上となる関係が存在する場合、その関係となる２つのカテゴリ候補を判断するための択一的カテゴリ分類器を用いて、２つのカテゴリ候補のタグを有するタグ付文書データを処理対象として、分類処理を行う。この結果、一方のタグを有する各タグ付文書データが、他方のカテゴリに分類された率を「択一的誤り率」として求める。例えば、［サッカー］［野球］のいずれかを択一的に判断する択一的カテゴリ分類器により、［サッカー］のタグを有するタグ付文書データ２００件、［野球］のタグを有するタグ付文書データを処理対象２００件について分類を行った場合、［サッカー］２００件のタグ付文書データのうち、２３件が［野球］のカテゴリであると判断され、１７７件が［サッカー］のカテゴリであると判断された場合には、択一的誤り率は２３／２００となる。上記の例により算出された択一的誤り率の一例を図３（ａ）に示す。 Next, the similar category excluding means 40 excludes one of the category candidates similar to each other from the category candidates. Specifically, for the category candidates remaining after the ambiguous category candidates are excluded, if there is a relationship in which the specific error rate exceeds a predetermined value, an option for determining the two category candidates that are the relationship Using a single category classifier, classification processing is performed on tagged document data having two category candidate tags. As a result, the rate at which each tagged document data having one tag is classified into the other category is obtained as an “alternative error rate”. For example, by an alternative category classifier that alternatively determines either [soccer] or [baseball], 200 tagged document data having a [soccer] tag and tagged document having a [baseball] tag When the data is classified for 200 processing targets, among the [tagged soccer] 200 tagged document data, 23 are determined to be the [baseball] category, and 177 are the [soccer] category. Is determined, the alternative error rate is 23/200. An example of the alternative error rate calculated by the above example is shown in FIG.

類似カテゴリ除外手段４０は、いずれかの択一的誤り率が所定値以上となる場合に、両者は互いに類似するカテゴリ候補であるとして、択一的誤り率が高い方をカテゴリ候補から除外する。例えば、択一的誤り率判断の所定値として“０．４（８０／２００）”が設定されている場合、図３（ａ）の例では、いずれも択一的誤り率０．４未満であるため、両方とも除外されず、カテゴリ候補として残されることになる。このようにして、最終的に残ったカテゴリ候補が、カテゴリとして決定されることになる。なお、類似するカテゴリ候補の一方を除外する規則としては、択一的誤り率が高い方を除外するという規則に限定される必要はなく、逆に択一的誤り率が低い方をカテゴリ候補から除外するようにしても良い。また、文書データ記憶手段１０を参照し、記録されたタグ付文書データ数が少ないタグに対応するカテゴリ候補を除外するようにしても良い。 The similar category excluding means 40 excludes the one having a higher alternative error rate from the category candidates, assuming that any of the alternative error rates is equal to or greater than a predetermined value, assuming that both are category candidates similar to each other. For example, when “0.4 (80/200)” is set as the predetermined value for determining the alternative error rate, in the example of FIG. 3A, both are lower than the alternative error rate of 0.4. Therefore, both are not excluded and left as category candidates. In this way, the finally remaining category candidates are determined as categories. Note that the rule for excluding one of the similar category candidates need not be limited to the rule of excluding the one with a higher alternative error rate, and conversely, the rule with a lower alternative error rate is excluded from the category candidates. You may make it exclude. Further, referring to the document data storage means 10, category candidates corresponding to tags with a small number of recorded tagged document data may be excluded.

ここで、類似と判断されるカテゴリ候補の場合の択一的誤り率の一例を図３（ｂ）に示す。図３（ｂ）の例では、いずれも択一的誤り率０．４以上であるため、択一的誤り率が低い方である「ＴＶ」を類似するカテゴリ候補としてカテゴリ候補から除外する。この場合は、「テレビ」がカテゴリとして決定されることになる。 Here, FIG. 3B shows an example of an alternative error rate in the case of category candidates determined to be similar. In the example of FIG. 3B, since the alternative error rate is 0.4 or more, “TV” having a lower alternative error rate is excluded as a similar category candidate from the category candidates. In this case, “TV” is determined as the category.

なお、上記の例では、択一的カテゴリ分類器を事前に作成しておいた場合について説明したが、択一的カテゴリ分類器を事前に作成しておかず、類似カテゴリ除外手段４０が、択一的カテゴリ分類器を１以上作成し、択一的カテゴリを利用して、類似するカテゴリ候補を除外するまでの処理を連続して行うようにするようにしても良い。 In the above example, the case where the alternative category classifier is created in advance has been described. However, the alternative category classifier is not created in advance, and the similar category excluding means 40 selects the alternative category classifier. One or more target category classifiers may be created, and an alternative category may be used to continuously perform processing until similar category candidates are excluded.

また、上記の例では、曖昧カテゴリ除外手段３０によるカテゴリ候補の除外後、さらに、類似カテゴリ除外手段４０により、類似するカテゴリ候補を除外するようにしたが、本発明においては、必ずしも類似するカテゴリ候補を除外する必要はなく、曖昧なカテゴリ候補を除外するだけでも十分に適正なカテゴリを作成するという効果を有する。 Further, in the above example, after the category candidates are excluded by the ambiguous category excluding unit 30, similar category candidates are excluded by the similar category excluding unit 40. However, in the present invention, the similar category candidates are not necessarily used. There is no need to exclude the category, and it is possible to create a sufficiently appropriate category simply by excluding the ambiguous category candidate.

（４．文書分類方法および装置）
次に、本発明に係る文書分類装置について説明する。図４は、本発明に係る文書分類装置の機能ブロック図である。図４において、文書データ記憶手段１１は、タグ付文書データ、タグなし文書データが記録されたものである。カテゴリ分類手段２１は、ある特定のカテゴリに属するか否かを判断するための特定カテゴリ分類器を複数備えており、タグ付文書データがどのカテゴリに属するかを判断する機能を有している。分類特定手段５０は、２つのカテゴリのいずれであるかを判断する択一的カテゴリ分類器を複数有しており、カテゴリ分類手段２１による分類の結果、ある文書が複数のカテゴリに分類された場合、その両カテゴリについての択一的カテゴリ分類器が存在するときに、どちらのカテゴリに属するかを分類する機能を有している。 (4. Document classification method and apparatus)
Next, the document classification apparatus according to the present invention will be described. FIG. 4 is a functional block diagram of the document classification apparatus according to the present invention. In FIG. 4, the document data storage unit 11 records tagged document data and untagged document data. The category classification means 21 includes a plurality of specific category classifiers for determining whether or not a certain category belongs, and has a function of determining which category the tagged document data belongs to. The classification specifying unit 50 has a plurality of alternative category classifiers that determine which of the two categories, and when a document is classified into a plurality of categories as a result of classification by the category classification unit 21 When there is an alternative category classifier for both categories, it has a function of classifying which category it belongs to.

次に、本発明に係る文書分類方法を、図４に示した文書分類装置の処理動作とともに説明する。まず、分類特定手段５０が文書データを文書データ記憶手段１１から順次読み込む処理を行う。図４に示す文書分類装置は、タグが付与されているか否かに関わらず、全ての文書データをカテゴリに分類することを目的としているので、文書データ記憶手段１１からは、タグ付文書データも、タグなし文書データも読み込むことになる。続いて、分類特定手段５０は、各文書データを各特定カテゴリ分類器に判断させ、各文書データを各カテゴリに分類していく。この結果、複数のカテゴリに分類される各文書データも存在する一方、いずれのカテゴリにも分類されない文書データも存在することになる。 Next, the document classification method according to the present invention will be described together with the processing operation of the document classification apparatus shown in FIG. First, the classification specifying unit 50 sequentially reads document data from the document data storage unit 11. The document classification apparatus shown in FIG. 4 aims to classify all document data into categories regardless of whether or not a tag is attached. Therefore, the document data storage means 11 also stores tagged document data. Document data without tags is also read. Subsequently, the classification specifying unit 50 causes each specific category classifier to determine each document data, and classifies each document data into each category. As a result, there are document data classified into a plurality of categories, while document data not classified into any category exists.

複数のカテゴリに分類される各文書データについては、その複数のカテゴリのうち２つのカテゴリについての択一的カテゴリ分類器が存在する場合には、その文書データを択一的カテゴリ分類器で分類する処理を行う。このようにして、複数のカテゴリに分類された文書データに対しては、さらに最適なカテゴリに分類されるように絞り込んでいくことになる。 For each document data classified into a plurality of categories, if there is an alternative category classifier for two of the plurality of categories, the document data is classified by the alternative category classifier. Process. In this way, the document data classified into a plurality of categories is narrowed down so as to be further classified into an optimum category.

文書分類装置によりカテゴリが決定した文書データについては、そのカテゴリと対応付けて登録がなされ、カテゴリにより検索が可能となる。これにより、元々個々の閲覧者が主観的に付与したタグに比べて、より効率的な検索が可能となる。また、元々付与されているタグを利用して検索できるようにしておくことも可能である。この場合、利用者はカテゴリによる検索と、タグによる検索の両方を、目的に応じて使い分けることが可能となる。 Document data whose category is determined by the document classification device is registered in association with the category, and can be searched by category. As a result, a more efficient search can be performed as compared with a tag subjectively given by each individual viewer. It is also possible to make a search using a tag originally attached. In this case, the user can use both the search by category and the search by tag depending on the purpose.

（５．実際の適用例）
上記カテゴリ作成方法、カテゴリ作成装置、文書分類方法、文書分類装置は、それぞれ単体のコンピュータに専用のプログラムを実行させることにより実現可能であるが、文書データとして、ブログサイトの文書データを用いる場合には、複数のコンピュータからなるコンピュータネットワークで実現することも可能である。この場合、文書データ記憶手段１０としては、各ブログの文書データが記録された複数の記憶手段を利用し、ネットワークを介してインターネット上の複数の記憶手段から文書データを収集し、特定カテゴリ分類器２０による処理を行う。 (5. Actual application examples)
The above category creation method, category creation device, document classification method, and document classification device can be realized by causing a single computer to execute a dedicated program. However, when document data of a blog site is used as document data, Can also be realized by a computer network composed of a plurality of computers. In this case, the document data storage means 10 uses a plurality of storage means in which the document data of each blog is recorded, collects the document data from the plurality of storage means on the Internet via the network, and stores the specific category classifier 20 is performed.

実際にインターネット上から収集した１８０万件の文書データを対象とし、４００件以上出現しているタグをカテゴリ候補として、本発明に係るカテゴリ作成方法およびカテゴリ作成装置により実験を行った。この結果、カテゴリとして選択されたタグと、カテゴリから除外されたタグを図５に示す。図５に示す実験結果を見ると、人がカテゴリとして判断しないであろう意味が曖昧なタグは本カテゴリ作成方法およびカテゴリ作成装置でも除外されていることがわかる。また、ＴＶ、ｇａｍｅのような、類似の概念である他のタグ（テレビ、ゲーム）が存在するタグについても、除外されていることがわかる。 Experiments were performed using the category creation method and the category creation apparatus according to the present invention, with 1.8 million document data actually collected from the Internet as targets, and tags with 400 or more tags appearing as category candidates. As a result, the tag selected as the category and the tag excluded from the category are shown in FIG. From the experimental results shown in FIG. 5, it can be seen that tags with ambiguous meaning that a person would not judge as a category are also excluded in the present category creation method and category creation device. It can also be seen that tags with other tags (TV, game) having similar concepts such as TV and game are also excluded.

また、択一的カテゴリ分類器を利用した分類特定の効果を検証するため、再現率が６割程度となるよう調整して、装置において類似カテゴリ除外を行った場合と行わない場合の適合率を比べた。この結果を図６に示す。なお、再現率とは、人が分類した場合と装置が分類した場合で一致した文書数を、人が分類した全文書数で除算したものであり、適合率とは、人が分類した場合と装置が分類した場合で一致した文書数を、装置が分類した全文書数で除算したものである。図６に示すように、択一的カテゴリ分類器を利用した分類特定を行った場合、適合率が７％程度向上しているのがわかる。 In addition, in order to verify the effect of classification using an alternative category classifier, the reproducibility is adjusted to be about 60%, and the relevance ratio when the similar category is excluded in the apparatus and when it is not performed is calculated. Compared. The result is shown in FIG. Recall rate is the number of documents that match between human classification and device classification divided by the total number of documents classified by humans. The number of documents that match when the device classifies is divided by the total number of documents classified by the device. As shown in FIG. 6, it can be seen that when the classification using an alternative category classifier is performed, the precision is improved by about 7%.

上記実施形態では、確率的手法を利用して特定カテゴリ分類器、択一的カテゴリ分類器を作成し、これらのカテゴリ分類器を利用するようにしたが、サポートベクターマシーン（ＳＶＭ）やニューラルネットワーク等のパターン認識手法を利用して、特定カテゴリ分類器、択一的カテゴリ分類器のような２値分類器を作成する手法は数多く提案されている。本発明では、このような、適合文書と不適合文書を学習データとして機械学習を行い、作成される特定カテゴリ分類器、択一的カテゴリ分類器を用いるようにしても良い。 In the above embodiment, a specific category classifier and an alternative category classifier are created using a probabilistic method, and these category classifiers are used. However, a support vector machine (SVM), a neural network, or the like is used. Many methods for creating a binary classifier such as a specific category classifier or an alternative category classifier have been proposed using the above pattern recognition technique. In the present invention, machine learning may be performed using such a conforming document and a nonconforming document as learning data, and a created specific category classifier or alternative category classifier may be used.

本発明に係るカテゴリ作成装置の機能ブロック図である。It is a functional block diagram of a category creation device according to the present invention. タグ付文書データと特定カテゴリ分類器の組み合わせについての特定誤り率の一例を示す図である。It is a figure which shows an example of the specific error rate about the combination of tagged document data and a specific category classifier. 択一的カテゴリ分類器を用いて算出された択一的誤り率の一例を示す図である。It is a figure which shows an example of the alternative error rate calculated using the alternative category classifier. 本発明に係る文書分類装置の機能ブロック図である。It is a functional block diagram of the document classification device according to the present invention. 本発明に係るカテゴリ作成方法およびカテゴリ作成装置による実験結果を示す図である。It is a figure which shows the experimental result by the category creation method and category creation apparatus which concern on this invention. 本発明に係る文書分類方法および文書分類装置による実験結果を示す図である。It is a figure which shows the experimental result by the document classification method and document classification device based on this invention.

Explanation of symbols

１０、１１・・・文書データ記憶手段
２０、２１・・・カテゴリ分類手段
３０・・・曖昧カテゴリ除外手段
４０・・・類似カテゴリ除外手段
５０・・・分類特定手段 DESCRIPTION OF SYMBOLS 10, 11 ... Document data storage means 20, 21 ... Category classification means 30 ... Ambiguous category exclusion means 40 ... Similar category exclusion means 50 ... Classification specification means

Claims

A method for creating a category for classifying document data using computer data with tagged document data to which a search term is assigned as a tag,
Setting a category candidate from the terms given as the tag;
Determining whether the tagged document data belongs to the category using a specific category classifier that determines whether the category belongs or not,
As a result of the determination by the specific category classifier, the number determined as a category candidate different from the category candidate of the tag to which the tagged document data was assigned and the category candidate of the tag to which the tagged document data was added were not determined. Calculating a specific error rate that is an error rate for each relationship between a specific category classifier of each category candidate and tagged document data having each category candidate as a tag based on the number;
Excluding category candidates of a specific category classifier having a relationship where the calculated specific error rate is greater than or equal to a predetermined value from a predetermined value;
A category creation method characterized by comprising:

After excluding a category candidate of a specific category classifier having a relationship where the calculated specific error rate is equal to or greater than a predetermined value from the category candidate,
Further, with each of the remaining category candidates, using an alternative category classifier that determines which of the two categories in which the specific error rate is a predetermined value or higher belongs, Classifying document data; and
As a result of the classification by the alternative category classifier, the number classified into a category candidate different from the tag category candidate to which the tagged document data is assigned and the category candidate of the tag to which the tagged document data is assigned are classified. A step of calculating an alternative error rate, which is an error rate for each relationship between the specific category classifier of each category candidate and the tagged document data having each category candidate as a tag,
If one of the calculated alternative error rates is greater than or equal to a predetermined value, excluding one category candidate from the category candidates;
The category creation method according to claim 1, further comprising:

A method for classifying document data into a plurality of categories,
Classifying the document data into categories using a specific category classifier that determines whether each category belongs to the category;
As a result of classification using the specific category classifier, when document data existing in a plurality of categories is obtained, it is determined which of the two categories in the plurality of categories the document data belongs to Using the alternative category classifier to classify the document data into any of the categories;
A document classification method characterized by comprising:

A device for creating a category for classifying document data by using tagged document data to which search terms are assigned as tags,
A plurality of specific category classifiers for determining whether or not a category candidate set in advance belongs to the category candidate, and a category classification unit for classifying tagged document data having the category candidate as a tag;
As a result of classification by the category classification means, the number classified into a category candidate different from the tag category candidate to which the tagged document data is assigned, and the number not classified into the tag category candidate to which the tagged document data is assigned. Based on the above, a specific error rate that is an error rate for each relationship between a specific category classifier of each category candidate and tagged document data having each category candidate as a tag is calculated, and the calculated specific error rate exceeds a predetermined value Ambiguous category exclusion means for excluding category candidates of a specific category classifier whose number exceeds a predetermined ratio from the category candidates;
A category creating apparatus characterized by comprising:

An alternative category classifier that determines which of the two categories in which the specific error rate is greater than or equal to a predetermined value for each category candidate remaining after the category candidate is excluded by the ambiguous category exclusion means; The tagged document data having the category candidate as a tag is classified using the alternative category classifier, and the category of the tag to which the tagged document data is assigned as a result of the classification by the alternative category classifier Based on the number classified as a category candidate different from the candidate and the number not classified into the category candidate of the tag to which tagged document data is assigned, each category candidate has a specific category classifier and each category candidate as a tag An alternative error rate that is an error rate for each relationship of tagged document data is calculated, and the calculated alternative error rate is a predetermined value or more. When standing, the one category candidates, category creation device according to claim 4, characterized in that it has to exclude similar categories excluding means from the category candidates, further.

An apparatus for classifying document data into a plurality of categories,
Category classification means for classifying the document data into categories using a specific category classifier that determines whether each category belongs to the category, or
When document data existing in a plurality of categories is obtained as a result of the classification by the category classification means, it is an option for determining which of the two categories of the plurality of categories the document data belongs to Classifying means for classifying the document data into any category using a dynamic category classifier;
A document classification apparatus comprising: