JP2002007433A

JP2002007433A - Information sorter, information sorting method, computer readable recording medium recorded with information sorting program and information sorting program

Info

Publication number: JP2002007433A
Application number: JP2001111942A
Authority: JP
Inventors: Yoshinori Katayama; 佳則片山; Kanji Uchino; 寛治内野; Norihiko Sakamoto; 憲彦坂本; Tatsu Shibata; 竜柴田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2000-04-17
Filing date: 2001-04-10
Publication date: 2002-01-11
Anticipated expiration: 2021-04-10
Also published as: JP4017354B2

Abstract

PROBLEM TO BE SOLVED: To enhance sorting precision regardlessly of the contents and quantity of the information on an object to be sorted. SOLUTION: This information sorter is provided with a feature element extracting part 40 to extract feature elements for every sorting category from each of plural sample texts to be included in sorting sample data 30 in which a group 10 of sample texts is preliminarily associated with plural sorting categories, a sorting method deciding part 50 to decide a sorting method with the highest sorting precision from plural sorting methods based on the sorting sample data 30, a sort learning information generating part 60 to generate sorting learning information 70 showing features for every sorting category based o the feature elements extracted by the feature element extracting part 40 according to the sorting method decided by the sorting method deciding part 50, and an automatic sorting part 90 to sort a group 80 of new texts being the object to be sorted for every sorting category according to the sorting method decided by the sorting method deciding part 50 and the sorting learning information 70.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、大量のテキスト情
報等の分類に用いられる情報分類装置、情報分類方法お
よび情報分類プログラムを記録したコンピュータ読み取
り可能な記録媒体、並びに情報分類プログラムに関する
ものであり、特に、複数の分類方法から最も分類精度が
高い分類方法を選択することで、分類精度、効率を高め
ることができる情報分類装置、情報分類方法および情報
分類プログラムを記録したコンピュータ読み取り可能な
記録媒体、並びに情報分類プログラムに関するものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information classification apparatus, an information classification method, a computer-readable recording medium on which an information classification program is recorded, and an information classification program used for classifying a large amount of text information and the like. In particular, an information classification device, an information classification method, and a computer-readable recording medium that records an information classification program that can increase the classification accuracy and efficiency by selecting a classification method with the highest classification accuracy from a plurality of classification methods , And an information classification program.

【０００２】近時、インターネットを用いることで、莫
大な量のテキスト情報を簡単に入手することが可能であ
る。このことから、これらの大量のテキスト情報の内容
を把握し、その中から必要なテキスト情報を効率よく抽
出する技術が求められている。これは、決められた分類
カテゴリに、これらのテキスト情報が分類されている
と、後にテキスト情報を活用する際の検索や、関連テキ
スト情報を見つける場合等に便利だからである。2. Description of the Related Art Recently, a huge amount of text information can be easily obtained by using the Internet. For this reason, there is a need for a technology that grasps the contents of a large amount of text information and efficiently extracts necessary text information from the content. This is because if these text information is classified into a predetermined classification category, it is convenient for a search when utilizing the text information later, or when finding related text information.

【０００３】従来では、このような大量のテキスト情報
は、分類担当者や、テキスト情報の作成者またはテキス
ト情報の活用者により、新規のテキスト情報の内容が判
断され、複数の分類カテゴリからなる分類体系の中の最
適な分類カテゴリにそれぞれ手動で分類されていた。ま
た、別の分類方法としては、計算機システムを利用して
新規のテキスト情報の内容が解析され、この解析結果に
基づいて分類カテゴリに該当するテキスト情報を自動で
分類するものがある。前者の分類方法では、非常に高い
コストがかかり、後者の分類方法では、実用的な結果を
得るための分類カテゴリの数や分類精度に問題がある。
このことから、従来よりこのような問題を効果的に解決
するための手段、方法が切望されている。Conventionally, such a large amount of text information is classified by a person in charge of classification, a creator of the text information or a user of the text information to determine the contents of the new text information, and the classification is made up of a plurality of classification categories. Each was manually classified into the best classification category in the system. As another classification method, there is a method in which the content of new text information is analyzed using a computer system, and text information corresponding to a classification category is automatically classified based on the analysis result. The former classification method has a very high cost, and the latter classification method has problems in the number of classification categories and the classification accuracy for obtaining practical results.
For this reason, means and methods for effectively solving such problems have long been desired.

【０００４】[0004]

【従来の技術】電子化された大量のテキスト情報が流通
するようになった現在では、テキスト情報の効率的検索
／利用の観点から、テキスト情報の意味内容に基づいた
分類が重要な課題となっている。従来より、かかる課題
を解決するための手段として、テキスト情報の分類作業
を自動で実行する情報分類装置が各方面で用いられてい
る。2. Description of the Related Art At present, when a large amount of digitized text information has been distributed, classification based on the semantic content of text information has become an important issue from the viewpoint of efficient search / use of text information. ing. 2. Description of the Related Art Conventionally, as a means for solving such a problem, an information classification device that automatically executes a text information classification operation has been used in various fields.

【０００５】また、従来では、与えられたテキスト情報
の分類事例に基づいてテキスト情報の分類方法を導出し
た後、この分類方法に基づいて新規のテキスト情報を分
類する方法として、特開平１１−３２８２１１号公報、
特開平１１−２９６５５２号公報、特開平１１−１６７
５８１号公報、特開平１１−１６１６７１号公報等に様
々な分類方法が開示されている。ここで、つぎの（１）
項〜（３）項に従来の分類方法を列挙する。（１）確率モデルを基にした統計的な分類方法（２）学習により自動分類を行う分類方法（３）それぞれの分類カテゴリにテキスト情報を分類す
るためのルールを作成し、このルールを用いて自動分類
を行う分類方法Conventionally, a method for deriving a text information classification method based on a given text information classification example and then classifying new text information based on this classification method is disclosed in Japanese Patent Application Laid-Open No. H11-32821. No.
JP-A-11-296552, JP-A-11-167
Various classification methods are disclosed in, for example, Japanese Patent Application Laid-Open No. 581 and Japanese Patent Application Laid-Open No. 11-161671. Here, the following (1)
Items to (3) list conventional classification methods. (1) Statistical classification method based on stochastic model (2) Classification method that performs automatic classification by learning (3) Create rules for classifying text information into each classification category, and use this rule Classification method for automatic classification

【０００６】（１）項の分類方法は、一般的な分類の傾
向を見つけだすことができるが、細かい分類の傾向を見
つけだすことができない。（２）項の分類方法は、分類
カテゴリ数が数十未満の場合に高い分類精度を得ること
ができるが、数十以上に増えた場合、分類精度が低くな
る。また、（３）項の分類方法は、ルールの作成および
メンテナンスに多大なコストがかかる。このように、
（１）項〜（３）項までの分類方法は、それぞれ一長一
短がある。[0006] In the classification method of the item (1), general classification tendency can be found, but fine classification tendency cannot be found. According to the classification method of item (2), high classification accuracy can be obtained when the number of classification categories is less than several tens, but when the number of classification categories increases to several tens or more, the classification accuracy decreases. In addition, the classification method of the item (3) requires a great deal of cost for creating and maintaining rules. in this way,
The classification methods of items (1) to (3) each have advantages and disadvantages.

【０００７】図１８は、従来の情報分類装置の構成を示
すブロック図である。この図において、分類サンプルデ
ータ２は、どの分類カテゴリにどのテキストを分類する
のかが予め決められた複数のテキストからなる分類に関
する正解データである。特徴素抽出部１は、分類サンプ
ルデータ２から、各分類カテゴリの特徴をそれぞれ表す
特徴素（単語）を各テキストから抽出する。FIG. 18 is a block diagram showing a configuration of a conventional information classification device. In this figure, classification sample data 2 is correct data relating to a classification composed of a plurality of texts in which which text is to be classified into which classification category. The feature element extraction unit 1 extracts, from each text, feature elements (words) representing the features of each classification category from the classification sample data 2.

【０００８】ここで、特徴素の抽出においては、各分類
カテゴリの弁別能力を高めることができる特徴素を効率
的に抽出する必要がある。従って、特徴素抽出部１で
は、特徴素の出現頻度をベースにして、上記弁別能力を
高めるための特徴素抽出方法が用いられる。この特徴素
抽出方法としては、従来より複数のものが提案されてい
る。また、特徴素の属性についても品詞を幾つか指定す
る等の方法が採られる。Here, in extracting feature elements, it is necessary to efficiently extract feature elements that can enhance the discrimination ability of each classification category. Therefore, the feature element extraction unit 1 uses the feature element extraction method for enhancing the discrimination ability based on the appearance frequency of the feature element. A plurality of feature element extraction methods have been conventionally proposed. In addition, a method of designating some parts of speech for the attribute of the feature element is also employed.

【０００９】分類学習情報生成部３は、特徴素抽出部１
により抽出された特徴素から各分類カテゴリの特徴をそ
れぞれ算出し、この算出結果としての分類学習情報４を
生成する。この分類学習情報生成部３における分類学習
方法としては、従来より複数のものが提案されている。
分類学習情報４は、特徴素の状況と分類カテゴリとの対
応関係を表す情報である。自動分類部５は、予め固定的
に設定された一つの分類方法により、分類対象である、
複数のテキストからなる新規テキスト群６を分類学習情
報４に基づいて、分類カテゴリに分類し、分類結果デー
タ７を出力する。The classification learning information generation unit 3 includes a feature element extraction unit 1
The feature of each classification category is calculated from the feature element extracted by the above, and the classification learning information 4 as the calculation result is generated. As the classification learning method in the classification learning information generation unit 3, a plurality of methods have been conventionally proposed.
The classification learning information 4 is information indicating the correspondence between the state of the feature element and the classification category. The automatic classification unit 5 is a classification target according to one classification method fixedly set in advance.
The new text group 6 including a plurality of texts is classified into classification categories based on the classification learning information 4, and the classification result data 7 is output.

【００１０】[0010]

【発明が解決しようとする課題】ところで、前述したよ
うに、従来の情報分類装置（図１８参照）においては、
特徴素抽出部１の特徴素抽出方法として複数のものがあ
る旨を述べたが、分類対象となる新規テキスト群６の内
容、量に依存して、分類結果データ７における分類精度
が変動することから、あらゆる内容、量の新規テキスト
群６に対して高い分類精度を維持する万能な抽出方法を
一意に規定することが難しい。By the way, as described above, in the conventional information classification device (see FIG. 18),
Although it has been described that there are a plurality of feature element extraction methods of the feature element extraction unit 1, the classification accuracy in the classification result data 7 varies depending on the content and amount of the new text group 6 to be classified. Therefore, it is difficult to uniquely define a universal extraction method that maintains high classification accuracy for the new text group 6 having any content and amount.

【００１１】同様にして、分類学習情報生成部３におい
ても、分類学習方法として複数のものがある旨を述べた
が、新規テキスト群６の内容、量に依存して分類結果デ
ータ７における分類精度が変動することから、高い分類
精度を維持する万能な分類学習方法を一意に規定するこ
とが難しい。このことから、従来の情報分類装置では、
やむを得ず、複数の分類方法（特徴素抽出方法、分類学
習方法）のうち一つの分類方法が固定的に用いられてい
る。Similarly, the classification learning information generating unit 3 has described that there are a plurality of classification learning methods. However, the classification accuracy in the classification result data 7 depends on the content and amount of the new text group 6. Fluctuates, it is difficult to uniquely define a universal classification learning method that maintains high classification accuracy. From this, in the conventional information classification device,
Inevitably, one of a plurality of classification methods (feature element extraction method, classification learning method) is fixedly used.

【００１２】従って、従来の情報分類装置では、一つの
固定的な分類方法により新規テキスト群６の分類を行っ
ているため、新規テキスト群６の内容、量によって分類
精度がバラツキ、結果的に分類精度が低くなってしまう
という問題があった。Therefore, in the conventional information classifying apparatus, the classification of the new text group 6 is performed by one fixed classification method. Therefore, the classification accuracy varies depending on the content and the amount of the new text group 6, and as a result, the classification is performed. There has been a problem that accuracy is reduced.

【００１３】本発明は、上記に鑑みてなされたもので、
分類対象の情報の内容、量にかかわらず、分類精度を高
めることができる情報分類装置、情報分類方法および情
報分類プログラムを記録したコンピュータ読み取り可能
な記録媒体、並びに情報分類プログラムを提供すること
を目的とする。The present invention has been made in view of the above,
An object of the present invention is to provide an information classification device, an information classification method, a computer-readable recording medium on which an information classification program is recorded, and an information classification program capable of improving the classification accuracy regardless of the content and amount of information to be classified. And

【００１４】[0014]

【課題を解決するための手段】上記目的を達成するため
に、請求項１にかかる発明は、複数のサンプルテキスト
と複数の分類カテゴリとが予め対応付けられた分類サン
プル情報に含まれる複数のサンプルテキストのそれぞれ
から分類カテゴリ毎に特徴素を抽出する特徴素抽出手段
と、前記分類サンプル情報に基づいて、複数の分類方法
の中から最も分類精度が高い分類方法を決定する分類方
法決定手段と、前記分類方法決定手段により決定された
分類方法に従って、前記特徴素抽出手段により抽出され
た特徴素に基づいて、分類カテゴリ毎の特徴を表す分類
学習情報を生成する分類学習情報生成手段と、前記分類
方法決定手段により決定された分類方法および前記分類
学習情報に従って、分類対象である新規テキスト群を分
類カテゴリ毎に分類する分類手段とを備えることを特徴
とする。To achieve the above object, according to the present invention, a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance. A feature element extraction unit that extracts a feature element for each classification category from each of the texts; a classification method determination unit that determines a classification method with the highest classification accuracy from among a plurality of classification methods based on the classification sample information; A classification learning information generation unit configured to generate classification learning information representing a feature for each classification category based on the feature element extracted by the feature element extraction unit in accordance with the classification method determined by the classification method determination unit; According to the classification method determined by the method determination means and the classification learning information, a new text group to be classified is divided for each classification category. Characterized in that it comprises a classifying means for.

【００１５】この発明によれば、複数の分類方法を使用
可能な状態にしておき、分類方法決定手段により、分類
サンプル情報に基づいて複数の分類方法の中から最も分
類精度が高い分類方法を決定した後、この分類方法に従
って新規テキスト群を分類カテゴリ毎に分類するように
したので、従来に比して、分類対象の情報の内容、量に
かかわらず、分類精度を高めることができる。According to the present invention, a plurality of classification methods are made usable, and the classification method determining means determines a classification method having the highest classification accuracy from the plurality of classification methods based on the classification sample information. After that, the new text group is classified for each classification category according to this classification method, so that the classification accuracy can be improved as compared with the related art regardless of the content and amount of the information to be classified.

【００１６】また、請求項２にかかる発明は、請求項１
に記載の情報分類装置において、前記特徴素抽出手段
は、複数の特徴素抽出方法により特徴素をそれぞれ抽出
し、これらの抽出結果に基づいて、複数の特徴素抽出方
法の中から分類カテゴリ間の弁別能力が高い特徴素抽出
方法を選択し、この選択結果に対応する特徴素を抽出結
果とすることを特徴とする。According to a second aspect of the present invention, there is provided the first aspect of the present invention.
In the information classification device described in the above, the feature element extracting means extracts each of the feature elements by a plurality of feature element extraction methods, and based on these extraction results, among the plurality of feature element extraction methods, includes A feature element extraction method having high discrimination ability is selected, and a feature element corresponding to the selection result is set as an extraction result.

【００１７】この発明によれば、特徴素抽出手段で複数
の特徴素抽出方法を使用可能な状態にしておき、これら
の複数の特徴素抽出方法にそれぞれ対応する特徴素を抽
出し、特に、分類カテゴリ間の弁別能力が高い特徴素抽
出方法に対応する特徴素を抽出結果とするようにしたの
で、この特徴素に対応する分類結果の分類精度をさらに
高めることができる。According to the present invention, a plurality of feature element extraction methods can be used by the feature element extraction means, and feature elements respectively corresponding to the plurality of feature element extraction methods are extracted. Since the feature element corresponding to the feature element extraction method having a high ability to discriminate between categories is used as the extraction result, the classification accuracy of the classification result corresponding to this feature element can be further improved.

【００１８】また、請求項３にかかる発明は、請求項１
に記載の情報分類装置において、前記特徴素抽出手段に
より抽出された特徴素を編集する編集手段を備えること
を特徴とする。The invention according to claim 3 is based on claim 1.
The information classification device described in (1), further comprising an editing unit that edits the feature element extracted by the feature element extraction unit.

【００１９】この発明によれば、編集手段を設けて、抽
出された特徴素を編集（削除、追加等）可能としたの
で、分類カテゴリに対して柔軟な特徴素設定を行うこと
ができる。According to the present invention, the editing means is provided so that the extracted feature element can be edited (deleted, added, etc.), so that flexible feature element setting can be performed for the classification category.

【００２０】また、請求項４にかかる発明は、請求項１
〜３のいずれか一つに記載の情報分類装置において、前
記分類方法決定手段は、クロスバリデーション方式によ
り、複数の分類方法の中から最も分類精度が高い分類方
法を決定することを特徴とする。The invention according to claim 4 is based on claim 1.
In the information classification device according to any one of the items (1) to (3), the classification method determination unit determines a classification method with the highest classification accuracy from a plurality of classification methods by a cross validation method.

【００２１】この発明によれば、複数の分類方法を使用
可能な状態にしておき、分類方法決定手段により、分類
サンプル情報に基づいて複数の分類方法の中から最も分
類精度が高い分類方法をクロスバリデーション方式によ
り決定した後、この分類方法に従って新規テキスト群を
分類カテゴリ毎に分類するようにしたので、従来に比し
て、分類対象の情報の内容、量にかかわらず、分類精度
を高めることができる。According to the present invention, a plurality of classification methods are made available, and the classification method having the highest classification accuracy is selected from the plurality of classification methods based on the classification sample information by the classification method determination means. After determining by the validation method, the new text group is classified according to the classification category according to this classification method, so that the classification accuracy can be improved as compared with the past, regardless of the content and amount of classification target information. it can.

【００２２】また、請求項５にかかる発明は、請求項１
〜４のいずれか一つに記載の情報分類装置において、前
記サンプル情報、前記新規テキスト群における分類対象
箇所を指定する指定手段を備えることを特徴とする。Further, the invention according to claim 5 is based on claim 1.
5. The information classification device according to any one of items 4 to 4, further comprising a specification unit that specifies a classification target portion in the sample information and the new text group.

【００２３】この発明によれば、指定手段により、分類
サンプル情報、新規テキスト群における分類対象箇所を
指定するようにしたので、分類に不要な箇所を排除し、
本質的に必要な箇所を対象に分類を行うことができるた
め、分類精度をさらに向上させることができる。According to the present invention, the classifying sample information and the classification target portion in the new text group are designated by the designation means.
Since the classification can be performed on essentially necessary parts, the classification accuracy can be further improved.

【００２４】また、請求項６にかかる発明は、請求項１
〜５のいずれか一つに記載の情報分類装置において、複
数のサンプルテキストをクラスタリングすることで、前
記複数のサンプルテキストと複数の分類カテゴリとが対
応付けられた前記分類サンプル情報を生成するクラスタ
リング手段を備えることを特徴とする。The invention according to claim 6 is the invention according to claim 1.
In the information classification device according to any one of the first to fifth aspects, by clustering a plurality of sample texts, clustering means for generating the classified sample information in which the plurality of sample texts are associated with a plurality of classification categories. It is characterized by having.

【００２５】この発明によれば、クラスタリング手段に
より分類サンプル情報を生成するようにしたので、複数
のサンプルテキストから分類カテゴリを手動で生成する
場合に比して、格段に効率を向上させることができると
ともに、ユーザの作業負担を軽減させることができる。According to the present invention, since the classification sample information is generated by the clustering means, the efficiency can be remarkably improved as compared with a case where the classification category is manually generated from a plurality of sample texts. At the same time, the work load on the user can be reduced.

【００２６】また、請求項７にかかる発明は、請求項１
〜５のいずれか一つに記載の情報分類装置において、前
記分類サンプル情報をクラスタリングするクラスタリン
グ手段と、前記クラスタリング手段のクラスタリング結
果と所望のクラスタリング結果とを比較する比較手段
と、前記比較手段の比較結果に基づいて、必要に応じて
前記分類サンプル情報を変更する変更手段とを備えるこ
とを特徴とする。The invention according to claim 7 is the first invention.
5. The information classification apparatus according to any one of claims 1 to 5, wherein the clustering unit clusters the classified sample information, a comparison unit that compares a clustering result of the clustering unit with a desired clustering result, and a comparison of the comparison unit. Changing means for changing the classification sample information as necessary based on the result.

【００２７】この発明によれば、クラスタリング手段の
クラスタリング結果と所望のクラスタリング結果とを比
較し、この比較結果が例えば不一致である場合に、変更
手段により分類サンプル情報を変更可能としたので、よ
り完全な分類サンプル情報に基づいて新規テキスト群の
分類を行うことができることから、分類精度を極めて高
くすることができる。According to the present invention, the clustering result of the clustering means is compared with a desired clustering result, and when the comparison result is, for example, inconsistent, the classification sample information can be changed by the changing means. Since the new text group can be classified based on the proper classification sample information, the classification accuracy can be extremely increased.

【００２８】また、請求項８にかかる発明は、請求項１
〜７のいずれか一つに記載の情報分類装置において、前
記分類手段の分類結果における新規テキスト群をクラス
タリングし、クラスタリング結果を表示するクラスタリ
ング結果表示手段を備えることを特徴とする。The invention according to claim 8 is the invention according to claim 1.
7. The information classification apparatus according to any one of items 1 to 7, further comprising a clustering result display unit that clusters a new text group in the classification result of the classification unit and displays the clustering result.

【００２９】この発明によれば、クラスタリング結果表
示手段によりクラスタリング結果を表示するようにした
ので、分類結果の分布をユーザが容易に把握することが
できる。According to the present invention, the clustering result is displayed by the clustering result display means, so that the user can easily grasp the distribution of the classification results.

【００３０】また、請求項９にかかる発明は、請求項１
〜８のいずれか一つに記載の情報分類装置において、前
記分類手段の分類結果を最適化する最適化手段を備え、
前記分類学習情報生成手段は、最適化された分類結果に
基づいて、分類学習情報を再生成し、前記分類手段は、
前記分類方法決定手段により決定された分類方法および
再生成された前記分類学習情報に従って、分類対象であ
る新規テキスト群を分類カテゴリ毎に分類することを特
徴とする。The invention according to claim 9 is based on claim 1.
The information classification device according to any one of to 8, further comprising an optimization unit that optimizes a classification result of the classification unit,
The classification learning information generating means regenerates the classification learning information based on the optimized classification result, and the classification means
In accordance with the classification method determined by the classification method determination means and the regenerated classification learning information, a new text group to be classified is classified for each classification category.

【００３１】この発明によれば、最適化手段により最適
化された分類結果に基づいて、分類学習情報を再生成
し、この分類学習情報に従って、新規テキスト群を再度
分類するようにしたので、さらに分類精度を向上させる
ことができる。According to the present invention, the classification learning information is regenerated based on the classification result optimized by the optimizing means, and the new text group is classified again according to the classification learning information. Classification accuracy can be improved.

【００３２】また、請求項１０にかかる発明は、請求項
９に記載の情報分類装置において、前記最適化前の分類
結果と前記最適化後の分類結果との相違を視覚的に認識
可能な相違認識情報として表示する相違認識情報表示手
段を備えることを特徴とする。According to a tenth aspect of the present invention, in the information classification apparatus according to the ninth aspect, the difference between the classification result before the optimization and the classification result after the optimization is visually recognizable. It is characterized by comprising a difference recognition information display means for displaying as recognition information.

【００３３】この発明によれば、最適化前後における分
類結果の相違を相違認識情報として表示させ、ユーザが
一目で相違を認識できるようにしたので、相違に基づく
ユーザの対応を迅速に行わせることができ、結果的に分
類精度を高めることができる。According to the present invention, the difference between the classification results before and after the optimization is displayed as the difference recognition information so that the user can recognize the difference at a glance, so that the user can quickly respond based on the difference. As a result, classification accuracy can be improved.

【００３４】また、請求項１１にかかる発明は、複数の
サンプルテキストと複数の分類カテゴリとが予め対応付
けられた分類サンプル情報に含まれる複数のサンプルテ
キストのそれぞれから分類カテゴリ毎に特徴素を抽出す
る特徴素抽出工程と、前記分類サンプル情報に基づい
て、複数の分類方法の中から最も分類精度が高い分類方
法を決定する分類方法決定工程と、前記分類方法決定工
程で決定された分類方法に従って、前記特徴素抽出工程
で抽出された特徴素に基づいて、分類カテゴリ毎の特徴
を表す分類学習情報を生成する分類学習情報生成工程
と、前記分類方法決定工程で決定された分類方法および
前記分類学習情報に従って、分類対象である新規テキス
ト群を分類カテゴリ毎に分類する分類工程とを含むこと
を特徴とする。Further, according to the present invention, a feature element is extracted for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance. A feature element extraction step, based on the classification sample information, a classification method determination step of determining a classification method having the highest classification accuracy from among a plurality of classification methods, and a classification method determined in the classification method determination step. A classification learning information generating step of generating classification learning information representing a feature for each classification category based on the characteristic element extracted in the characteristic element extraction step; a classification method determined in the classification method determining step; A classification step of classifying a new text group to be classified into classification categories according to the learning information.

【００３５】この発明によれば、複数の分類方法を使用
可能な状態にしておき、分類方法決定工程で、分類サン
プル情報に基づいて複数の分類方法の中から最も分類精
度が高い分類方法を決定した後、この分類方法に従って
新規テキスト群を分類カテゴリ毎に分類するようにした
ので、従来に比して、分類対象の情報の内容、量にかか
わらず、分類精度を高めることができる。According to the present invention, a plurality of classification methods are made available, and in the classification method determining step, a classification method having the highest classification accuracy is determined from the plurality of classification methods based on the classification sample information. After that, the new text group is classified for each classification category according to this classification method, so that the classification accuracy can be improved as compared with the related art regardless of the content and amount of the information to be classified.

【００３６】また、請求項１２にかかる発明は、複数の
サンプルテキストと複数の分類カテゴリとが予め対応付
けられた分類サンプル情報に含まれる複数のサンプルテ
キストのそれぞれから分類カテゴリ毎に特徴素を抽出さ
せる特徴素抽出工程と、前記分類サンプル情報に基づい
て、複数の分類方法の中から最も分類精度が高い分類方
法を決定させる分類方法決定工程と、前記分類方法決定
工程で決定された分類方法に従って、前記特徴素抽出工
程で抽出された特徴素に基づいて、分類カテゴリ毎の特
徴を表す分類学習情報を生成させる分類学習情報生成工
程と、前記分類方法決定工程で決定された分類方法およ
び前記分類学習情報に従って、分類対象である新規テキ
スト群を分類カテゴリ毎に分類させる分類工程とをコン
ピュータに実行させるための情報分類プログラムを記録
したコンピュータ読み取り可能な記録媒体である。According to a twelfth aspect of the present invention, a feature element is extracted for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance. A feature element extracting step, based on the classification sample information, a classification method determining step of determining a classification method having the highest classification accuracy from a plurality of classification methods, and a classification method determined in the classification method determination step. A classification learning information generating step of generating classification learning information representing a feature for each classification category based on the characteristic element extracted in the characteristic element extraction step; a classification method determined in the classification method determining step; A classification step of classifying a new text group to be classified into classification categories in accordance with the learning information. A computer-readable recording medium recording the order information classification program.

【００３７】この発明によれば、複数の分類方法を使用
可能な状態にしておき、分類方法決定工程で、分類サン
プル情報に基づいて複数の分類方法の中から最も分類精
度が高い分類方法を決定した後、この分類方法に従って
新規テキスト群を分類カテゴリ毎に分類するようにした
ので、従来に比して、分類対象の情報の内容、量にかか
わらず、分類精度を高めることができる。According to the present invention, a plurality of classification methods are made available, and the classification method having the highest classification accuracy is determined from the plurality of classification methods based on the classification sample information in the classification method determination step. After that, the new text group is classified for each classification category according to this classification method, so that the classification accuracy can be improved as compared with the related art regardless of the content and amount of the information to be classified.

【００３８】また、請求項１３にかかる発明は、複数の
サンプルテキストと複数の分類カテゴリとが予め対応付
けられた分類サンプル情報に含まれる複数のサンプルテ
キストのそれぞれから分類カテゴリ毎に特徴素を抽出さ
せる特徴素抽出手順と、前記分類サンプル情報に基づい
て、複数の分類方法の中から最も分類精度が高い分類方
法を決定させる分類方法決定手順と、前記分類方法決定
手順で決定された分類方法に従って、前記特徴素抽出手
順で抽出された特徴素に基づいて、分類カテゴリ毎の特
徴を表す分類学習情報を生成させる分類学習情報生成手
順と、前記分類方法決定手順で決定された分類方法およ
び前記分類学習情報に従って、分類対象である新規テキ
スト群を分類カテゴリ毎に分類させる分類手順とをコン
ピュータに実行させるための情報分類プログラムであ
る。According to a thirteenth aspect of the present invention, a feature element is extracted for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance. A feature element extraction procedure, a classification method determination procedure for determining a classification method having the highest classification accuracy from among a plurality of classification methods based on the classification sample information, and a classification method determined in the classification method determination procedure. A classification learning information generating step of generating classification learning information representing a feature for each classification category based on the characteristic element extracted in the characteristic element extraction step; a classification method determined in the classification method determination step; The computer executes a classification procedure for classifying a new text group to be classified into classification categories according to the learning information. It is because of information classification program.

【００３９】この発明によれば、複数の分類方法を使用
可能な状態にしておき、分類方法決定手順で、分類サン
プル情報に基づいて複数の分類方法の中から最も分類精
度が高い分類方法を決定した後、この分類方法に従って
新規テキスト群を分類カテゴリ毎に分類するようにした
ので、従来に比して、分類対象の情報の内容、量にかか
わらず、分類精度を高めることができる。According to the present invention, a plurality of classification methods are made available, and the classification method having the highest classification accuracy is determined from the plurality of classification methods based on the classification sample information in the classification method determination procedure. After that, the new text group is classified for each classification category according to this classification method, so that the classification accuracy can be improved as compared with the related art regardless of the content and amount of the information to be classified.

【００４０】[0040]

【発明の実施の形態】以下、図面を参照して本発明にか
かる情報分類装置、情報分類方法および情報分類プログ
ラムを記録したコンピュータ読み取り可能な記録媒体、
並びに情報分類プログラムの一実施の形態について詳細
に説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, with reference to the drawings, an information classification device, an information classification method, and a computer-readable recording medium recording an information classification program according to the present invention will be described.
An embodiment of the information classification program will be described in detail.

【００４１】図１は、本発明にかかる一実施の形態の構
成を示すブロック図である。この図において、サンプル
テキスト群１０は、未分類の複数のテキストの集合であ
る。クラスタリング部２０は、サンプルテキスト群１０
をクラスタリングし、分類サンプルデータ３０を生成す
る。この分類サンプルデータ３０は、どの分類カテゴリ
にどのテキストを分類するのかが予め決められた複数の
テキストからなる分類に関する正解データである。FIG. 1 is a block diagram showing the configuration of an embodiment according to the present invention. In this figure, a sample text group 10 is a set of a plurality of unclassified texts. The clustering unit 20 includes the sample text group 10
Are clustered to generate classification sample data 30. The classification sample data 30 is correct answer data relating to a classification including a plurality of texts in which which text is to be classified into which classification category.

【００４２】特徴素抽出部４０は、特徴素抽出部１（図
１８参照）と同様にして、分類サンプルデータ３０か
ら、各分類カテゴリの特徴をそれぞれ表す特徴素（単
語）を各テキストから抽出する。ただし、特徴素抽出部
１が一つの特徴素抽出方法に従って特徴素の抽出を行う
のに対して、特徴素抽出部４０は、複数の特徴素抽出方
法のそれぞれに従って特徴素の抽出を行う点で、特徴素
抽出部１と異なる。The feature element extraction unit 40 extracts feature elements (words) representing the features of each classification category from each text from the classification sample data 30 in the same manner as the feature element extraction unit 1 (see FIG. 18). . However, while the feature element extraction unit 1 extracts feature elements according to one feature element extraction method, the feature element extraction unit 40 extracts feature elements according to each of a plurality of feature element extraction methods. , The feature element extraction unit 1.

【００４３】分類学習情報生成部６０は、分類学習情報
生成部３（図１８参照）と同様にして、特徴素抽出部４
０により抽出された特徴素から各分類カテゴリの特徴を
それぞれ算出し、この算出結果としての分類学習情報７
０を生成する。ただし、分類学習情報生成部３が一つの
分類学習方法に従って特徴を算出するのに対して、分類
学習情報生成部６０は、複数の分類学習方法のそれぞれ
に従って特徴を算出する点で、分類学習情報生成部３と
異なる。The classification learning information generation unit 60 is similar to the classification learning information generation unit 3 (see FIG. 18) in that the feature element extraction unit 4
0, the feature of each classification category is calculated from the feature element extracted, and the classification learning information 7 as the calculation result is obtained.
Generate 0. However, while the classification learning information generation unit 3 calculates a feature according to one classification learning method, the classification learning information generation unit 60 calculates a feature according to each of a plurality of classification learning methods. Different from the generation unit 3.

【００４４】分類方法決定部５０は、例えば、周知のク
ロスバリデーションにより、複数の分類方法の中から最
も分類精度が高い分類方法を決定する。この分類方法決
定部５０の動作の詳細については後述する。新規テキス
ト群８０は、図２に示したように、分類対象の複数の新
規テキストＴＸ₁（テキスト名ｔｅｘｔ１）〜新規テキ
ストＴＸ₁₀ （テキスト名ｔｅｘｔ１０）、・・・からな
る。図１に戻り、自動分類部９０は、分類方法決定部５
０により決定された分類方法および分類学習情報７０に
基づいて、新規テキスト群８０を分類カテゴリに分類
し、これを分類結果データ１００（図３参照）として出
力する。The classification method determining unit 50 determines a classification method having the highest classification accuracy from among a plurality of classification methods by, for example, well-known cross validation. Details of the operation of the classification method determining unit 50 will be described later. New text group 80, as shown in FIG. 2, a plurality of new text TX ₁ to be classified (text name text1) ~ new text TX ₁₀ (text name text10), consisting of .... Returning to FIG. 1, the automatic classification unit 90 includes the classification method determination unit 5
The new text group 80 is classified into classification categories based on the classification method and the classification learning information 70 determined by 0, and this is output as the classification result data 100 (see FIG. 3).

【００４５】クラスタリング部１１０は、分類結果デー
タ１００をクラスタリングし、クラスタリング結果Ｃ
（図４参照）を得る。表示部１２０は、クラスタリング
部１１０からのクラスタリング結果Ｃや、各部からの各
種データを表示するディスプレイである。図５〜図７に
は、表示部１２０の表示例が図示されている。入力部１
３０は、後述する編集作業や、表示部１２０におけるウ
ィンドウ操作等を行うためのマウス、キーボード等であ
る。The clustering unit 110 clusters the classification result data 100 and generates a clustering result C
(See FIG. 4). The display unit 120 is a display that displays the clustering result C from the clustering unit 110 and various data from each unit. 5 to 7 show display examples of the display unit 120. Input unit 1
Reference numeral 30 denotes a mouse, a keyboard, and the like for performing an editing operation described later, a window operation on the display unit 120, and the like.

【００４６】つぎに、上述した一実施の形態の動作につ
いて、図８〜図１０に示したフローチャートを参照しつ
つ説明する。図１に示したクラスタリング部２０にサン
プルテキスト群１０が入力されると、図８に示したステ
ップＳＡ１では、クラスタリング部２０は、サンプルテ
キスト群１０の複数のテキストをクラスタリングする。
ステップＳＡ２では、クラスタリング部２０は、各クラ
スタを分類カテゴリ化する。ステップＳＡ３では、クラ
スタリング部２０は、どの分類カテゴリにどのテキスト
を分類するのかが予め決められた複数のテキストからな
る分類に関する分類サンプルデータ３０（正解データ）
を特徴素抽出部４０へ出力する。Next, the operation of the above-described embodiment will be described with reference to the flowcharts shown in FIGS. When the sample text group 10 is input to the clustering unit 20 shown in FIG. 1, the clustering unit 20 clusters a plurality of texts in the sample text group 10 in step SA1 shown in FIG.
At step SA2, the clustering unit 20 categorizes each cluster into categories. In step SA3, the clustering unit 20 classifies the classification sample data 30 (correct data) regarding a classification including a plurality of texts in which classifications are classified into which texts.
Is output to the feature element extraction unit 40.

【００４７】これにより、ステップＳＡ４では、特徴素
抽出部４０は、分類サンプルデータ３０における各分類
カテゴリの特徴をそれぞれ表す特徴素（単語）を各テキ
ストから抽出する特徴素抽出処理を実行する。すなわ
ち、図９に示したステップＳＢ１では、特徴素抽出部４
０は、分類サンプルデータ３０を形態素解析することに
より、分類カテゴリの特徴を表す特徴素（単語）の候補
を抽出する。ステップＳＢ２では、特徴素抽出部４０
は、抽出された特徴素の候補における同義語を統一化す
るという処理を実行する。Thus, in step SA4, the feature element extraction unit 40 executes a feature element extraction process of extracting, from each text, a feature element (word) representing a feature of each category in the category sample data 30. That is, in step SB1 shown in FIG.
0 extracts a feature element (word) candidate representing the feature of the classification category by performing a morphological analysis on the classification sample data 30. In step SB2, the feature element extraction unit 40
Executes a process of unifying synonyms in the extracted feature element candidates.

【００４８】ステップＳＢ３では、特徴素抽出部４０
は、抽出された複数の特徴素の候補に関して、分類カテ
ゴリ毎に、同一語の特徴素をカウントする。ステップＳ
Ｂ４では、特徴素抽出部４０は、分類カテゴリ毎に複数
の特徴素の候補を絞り込むランキング処理を実行する。
このランキング処理では、複数の特徴素の候補に対し
て、出現頻度が高い順に特徴素を分類カテゴリ毎にラン
キングする方法や、出現確率が高い順に特徴素を分類カ
テゴリ毎にランキングする方法や、出現頻度の算出に統
計的手法（他の分類カテゴリにも出現している特徴素の
ランキングを下げる手法）を取り入れ、特徴素を分類カ
テゴリ毎にランキングする方法等が採用される。In step SB3, the feature element extraction unit 40
Counts, for each of the plurality of extracted feature element candidates, the same word feature element. Step S
In B4, the feature element extraction unit 40 executes a ranking process for narrowing down a plurality of feature element candidates for each classification category.
In this ranking process, for a plurality of candidate feature elements, a method of ranking feature elements by classification category in descending order of appearance frequency, a method of ranking feature elements by classification category in descending order of appearance probability, A statistical method (a method of lowering the ranking of feature elements that also appear in other classification categories) is used to calculate the frequency, and a method of ranking feature elements for each classification category is employed.

【００４９】ステップＳＢ５では、特徴素抽出部４０
は、上述したランキングが高い特徴素を分類カテゴリ毎
に上位から所定数抽出し、これらを特徴素として抽出す
る。ステップＳＢ６では、特徴素抽出部４０は、抽出さ
れた特徴素を特徴素抽出結果データとして出力する。図
１１は、上述した三つのランキングの方法のうち、出現
頻度順にランキングされた特徴素出現頻度順リストＲ₁
（特徴素抽出結果データに対応）を示す図である。In step SB5, the feature element extraction unit 40
Extracts a predetermined number of the above-described feature elements having high rankings from the top for each classification category, and extracts these as feature elements. In step SB6, the feature element extraction unit 40 outputs the extracted feature element as feature element extraction result data. FIG. 11 shows a feature element appearance frequency order list R ₁ ranked in the order of appearance frequency among the three ranking methods described above.
FIG. 14 is a diagram illustrating (corresponding to feature element extraction result data).

【００５０】同図には、分類カテゴリ（「Ｅｃｏｎｏｍ
ｉｃ」、「Ｆｏｒｅｉｇｎ」、・・・、「Ｓｏｃｉｅｔ
ｙ」および「Ｓｐｏｒｔ」）のフィールドと、当該分類
カテゴリにおける特徴素（「市場」、「円高」等）出現
頻度を表す度数のフィールドとがある。それぞれの分類
カテゴリに対応するレコードには、当該分類カテゴリに
分類されたテキストの数が記述されている。ここでいう
テキストとは、サンプルテキスト群１０（図１参照）を
構成するものをいう。例えば、「Ｅｃｏｎｏｍｉｃ」と
いう分類カテゴリには、２７個のテキストが分類されて
おり、「Ｆｏｒｅｉｇｎ」という分類カテゴリには、４
３個のテキストが分類されている。FIG. 5 shows a classification category (“Econom”).
ic "," Foreign ", ...," Societ "
y "and" Sport "), and a frequency field indicating the frequency of appearance of feature elements (" market "," yen appreciation ", etc.) in the classification category. In the record corresponding to each classification category, the number of texts classified into the classification category is described. Here, the text refers to a text constituting the sample text group 10 (see FIG. 1). For example, a classification category of “Economic” classifies 27 texts, and a classification category of “Foreign” includes 4 texts.
Three texts are classified.

【００５１】同図左端のフィールドは、出現頻度が高い
順を表すランキングである。例えば、「Ｅｃｏｎｏｍｉ
ｃ」という分類カテゴリにおいては、２７個のテキスト
内での出現頻度のランキングが１位の特徴素が「市場」
（度数：６１．０）、２位の特徴素が「円高」（度数：
４０．０）、以下同様にして、３０位の特徴素が「金
融」（度数：１２．０）である。The leftmost field in the figure is a ranking indicating the order of appearance frequency. For example, "Economi
In the category “c”, the feature element whose ranking of the appearance frequency in the 27 texts is the first is “market”
(Frequency: 61.0) The feature element of the second place is “yen high”
40.0), and similarly, the thirtieth feature element is “finance” (frequency: 12.0).

【００５２】図１２は、上述した三つのランキングの方
法のうち、Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ法と呼ば
れる統計的手法を取り入れ、特徴素が分類カテゴリ毎に
ランキングされた特徴素出現頻度順リストＲ₂（特徴素
抽出結果データに対応）を示す図である。同図に示した
特徴素ランキングリストＲ₂ の基本的な構成は、特徴素
出現頻度順リストＲ₁（図１１参照）の構成と同一であ
る。FIG. 12 shows a feature element appearance frequency order list R ₂ (feature element ranking) in which feature elements are ranked for each classification category by adopting a statistical method called the Kullback-Leibler method among the three ranking methods described above. (Corresponding to extraction result data). The basic configuration of the feature elements ranking list R ₂ shown in this figure is the same as that of the feature element appearance frequency order list R ₁ (see FIG. 11).

【００５３】しかしながら、特徴素ランキングリストＲ
₂ では、他の分類カテゴリにも出現している特徴素のラ
ンキングを下げ、当該分類カテゴリと他の分類カテゴリ
との弁別能力を向上させるための統計的手法が採用され
ている。例えば、図１１に示した「Ｅｃｏｎｏｍｉｃ」
という分類カテゴリにおけるランキング３位の「ドル」
（特徴素）は、図１２に示した「Ｅｃｏｎｏｍｉｃ」と
いう分類カテゴリで３１位以下（図示略）とされてい
る。However, the feature element ranking list R
_{In 2} , a statistical method is adopted for lowering the ranking of feature elements that also appear in other classification categories and improving the ability to discriminate the classification category from other classification categories. For example, "Economic" shown in FIG.
"Dollar" ranked third in the classification category
(Feature element) is ranked 31st or lower (not shown) in the classification category of “Economic” shown in FIG.

【００５４】図８に戻り、ステップＳＡ５では、分類方
法決定部５０は、新規テキスト群８０に適用する分類方
法を自動的に決定するか否かを判断する。ユーザからの
指示が無ければ、分類方法決定部５０は、ステップＳＡ
５の判断結果を「Ｙｅｓ」とする。一方、ユーザにより
マニュアル操作で分類方法が指示された場合、分類方法
決定部５０は、ステップＳＡ５の判断結果を「Ｎｏ」と
し、ステップＳＡ７でユーザからの指示に基づいて分類
方法を決定する。Returning to FIG. 8, in step SA5, the classification method determining section 50 determines whether or not to automatically determine the classification method to be applied to the new text group 80. If there is no instruction from the user, the classification method determining unit 50 proceeds to step SA
The determination result of No. 5 is “Yes”. On the other hand, when the user instructs the classification method by manual operation, the classification method determination unit 50 sets the determination result in step SA5 to “No”, and determines the classification method based on the instruction from the user in step SA7.

【００５５】この場合、ステップＳＡ６では、分類方法
決定部５０は、例えば、クロスバリデーションにより、
分類方法を自動的に決定する分類方法決定処理を実行す
る。すなわち、図１０に示したステップＳＣ１では、分
類方法決定部５０は、分類サンプルデータ３０における
分類カテゴリ毎に分類サンプル（テキスト）をランダム
にＮ個に分ける。ステップＳＣ２では、分類方法決定部
５０は、（Ｎ−１）個の分類サンプルに対して、複数の
学習アルゴリズム（分類方法）をそれぞれ適用し、それ
ぞれの学習アルゴリズムに対応する特徴素や分類学習情
報を作成する。In this case, in step SA6, the classification method determining unit 50 performs, for example, cross validation.
A classification method determination process for automatically determining a classification method is executed. That is, in step SC1 shown in FIG. 10, the classification method determination unit 50 randomly divides the classification samples (texts) into N pieces for each classification category in the classification sample data 30. In step SC2, the classification method determination unit 50 applies a plurality of learning algorithms (classification methods) to the (N-1) classification samples, respectively, and outputs a feature element or classification learning information corresponding to each learning algorithm. Create

【００５６】ステップＳＣ３では、分類方法決定部５０
は、ステップＳＣ２で作成された特徴素や分類学習情報
を用いて、残り（１／Ｎ）の分類サンプルに対して当該
学習アルゴリズム方法を適用することにより、分類テス
トを行い分類精度を算出する。この分類精度は、複数の
学習アルゴリズムのそれぞれについて個別的に算出され
る。ステップＳＣ４では、分類方法決定部５０は、上記
分類テストをＮ回実行したか否かを判断し、この場合、
判断結果を「Ｎｏ」とする。以後、ステップＳＣ２およ
びステップＳＣ３では、分類サンプルを一つづつ替える
ことにより、Ｎ個の分類サンプルに関するそれぞれ分類
精度が、複数の学習アルゴリズム毎に算出される。In step SC3, the classification method determining section 50
Performs a classification test and calculates classification accuracy by applying the learning algorithm method to the remaining (1 / N) classification samples using the feature elements and the classification learning information created in step SC2. This classification accuracy is individually calculated for each of the plurality of learning algorithms. In step SC4, the classification method determination unit 50 determines whether or not the classification test has been performed N times.
The determination result is “No”. Thereafter, in step SC2 and step SC3, the classification accuracy for each of the N classification samples is calculated for each of the plurality of learning algorithms by changing the classification samples one by one.

【００５７】そして、ステップＳＣ４の判断結果が「Ｙ
ｅｓ」になると、ステップＳＣ５では、分類方法決定部
５０は、Ｎ個の分類サンプルに関する分類精度の平均値
を複数の学習アルゴリズム毎に算出する。ステップＳＣ
６では、分類方法決定部５０は、複数の学習アルゴリズ
ム（分類方法）にそれぞれ対応する複数の分類精度の平
均値うち、最も高いものを選択した後、選択された分類
精度に対応する学習アルゴリズム（分類方法）を選択す
る。また、分類方法決定部５０は、分類精度が最も高い
学習アルゴリズム（分類方法）を分類学習情報生成部６
０および自動分類部９０に通知する。Then, the determination result of step SC4 is "Y
When "es" is reached, in step SC5, the classification method determination unit 50 calculates an average value of the classification accuracy for the N classification samples for each of the plurality of learning algorithms. Step SC
In 6, the classification method determination unit 50 selects the highest one of the average values of the plurality of classification accuracies respectively corresponding to the plurality of learning algorithms (classification methods), and then selects the learning algorithm (classification method) corresponding to the selected classification accuracy. Method). Further, the classification method determining unit 50 determines a learning algorithm (classification method) having the highest classification accuracy by the classification learning information generation unit 6.
0 and notify the automatic classification unit 90.

【００５８】図８に戻り、ステップＳＡ８では、分類学
習情報生成部６０は、分類方法決定部５０により通知さ
れた学習アルゴリズム（分類方法）、および特徴素抽出
部４０からの特徴素抽出結果データに基づいて、分類学
習情報７０を生成する。ステップＳＡ９では、分類学習
情報生成部６０は、分類学習情報７０をデータベース
（図示略）に登録する。ステップＳＡ１０では、自動分
類部９０は、分類対象である新規テキスト群８０が入力
されたか否かを判断し、この場合、判断結果を「Ｎｏ」
として同判断を繰り返す。Returning to FIG. 8, in step SA8, the classification learning information generation unit 60 converts the learning algorithm (classification method) notified by the classification method determination unit 50 and the feature element extraction result data from the feature element extraction unit 40 into the learning algorithm. Based on this, the classification learning information 70 is generated. In step SA9, the classification learning information generation unit 60 registers the classification learning information 70 in a database (not shown). In step SA10, the automatic classification unit 90 determines whether or not the new text group 80 to be classified has been input. In this case, the determination result is “No”.
And repeat the same judgment.

【００５９】そして、新規テキスト群８０が自動分類部
９０に入力されると、自動分類部９０は、ステップＳＡ
１０の判断結果を「Ｙｅｓ」とする。ステップＳＡ１１
では、自動分類部９０は、新規テキスト群８０（図２参
照）を構成する新規テキストＴＸ₁ 、新規テキストＴＸ
₂、・・・新規テキストＴＸ₁₀ 、・・・のすべての自動分類
が終了したか否かを判断し、この場合、判断結果を「Ｎ
ｏ」とする。以降、ステップＳＡ１５〜ステップＳＡ２
１では、自動分類部９０は、分類方法決定部５０により
決定された分類方法に基づいて、自動分類処理を実行す
る。When the new text group 80 is input to the automatic classification unit 90, the automatic classification unit 90
The determination result of No. 10 is “Yes”. Step SA11
Then, the automatic classifying unit 90 sets the new text TX ₁ and the new text TX constituting the new text group 80 (see FIG. 2).
₂ ,... It is determined whether or not all the automatic classifications of the new texts TX ₁₀ ,... Have been completed.
o ". Hereinafter, Step SA15 to Step SA2
In 1, the automatic classification unit 90 executes an automatic classification process based on the classification method determined by the classification method determination unit 50.

【００６０】以下では、分類方法の一例として、ベクト
ル空間法に基づいて新規テキスト群８０を分類する場合
について説明する。この場合に、分類学習情報７０に
は、各分類カテゴリ毎に３０個の特徴素が含まれてお
り、全特徴素のベクトル、各分類カテゴリのベクトルが
含まれているものとする。この状態で、ステップＳＡ１
５では、自動分類部９０は、新規テキスト群８０におけ
る新規テキストＴＸ₁ （図２参照）に対して形態素解析
を実行し、特徴素（単語）を抽出する。ステップＳＡ１
６では、自動分類部９０は、抽出された特徴素における
同義語を統一化するという同義語統一化処理を実行す
る。In the following, as an example of the classification method, a case where the new text group 80 is classified based on the vector space method will be described. In this case, it is assumed that the classification learning information 70 includes 30 feature elements for each category, and includes vectors of all feature elements and vectors of each category. In this state, step SA1
In 5, the automatic classification unit 90 performs a morphological analysis on the new text TX ₁ (see FIG. 2) in the new text group 80 and extracts a feature element (word). Step SA1
In 6, the automatic classification unit 90 executes a synonym unification process of unifying synonyms in the extracted feature elements.

【００６１】ステップＳＡ１７では、自動分類部９０
は、抽出された特徴素をカウントする。ステップＳＡ１
８では、自動分類部９０は、分類学習情報７０内の特徴
素と同一の特徴素を、新規テキストＴＸ₁ に含まれる複
数の特徴素から取得する。つぎに、自動分類部９０は、
取得した特徴素、すなわち、新規テキストＴＸ₁ に関す
る文書ベクトルを生成する。In step SA17, the automatic classifying section 90
Counts the extracted feature elements. Step SA1
In 8, automatic classification section 90, the same feature element and the feature element in the classification learning information 70, obtains a plurality of feature elements included in the new text TX _1. Next, the automatic classification unit 90
Acquired feature element, i.e., generates a document vector for new text TX _1.

【００６２】ステップＳＡ１９では、新規テキストＴＸ
₁ に関する文書ベクトルと、分類学習情報７０内の各分
類カテゴリのベクトルとの類似度（コサイン値）を算出
する。この類似度（コサイン値）は、分類カテゴリのベ
クトルをＡ、新規テキストＴＸ₁ の文書ベクトルをＢと
するとつぎの式で表される。類似度（コサイン値）＝ベクトルＡと文書ベクトルＢと
の内積／（ベクトルＡの大きさ×文書ベクトルＢの大き
さ）At Step SA19, the new text TX
The similarity (cosine value) between the document vector related to ₁ and the vector of each classification category in the classification learning information 70 is calculated. The similarity (cosine values) is a vector of classification category A, When the document vector for the new text TX ₁ B is expressed by the following. Similarity (cosine value) = Inner product of vector A and document vector B / (size of vector A × size of document vector B)

【００６３】すなわち、ステップＳＡ１９では、新規テ
キストＴＸ₁ に関して、分類カテゴリの数分の類似度
（コサイン値）が算出される。ステップＳＡ２０では、
自動分類部９０は、算出された複数の類似度（コサイン
値）を正規化（０〜１００までの値とする）する。ステ
ップＳＡ２１では、自動分類部９０は、複数の類似度
（コサイン値）のうち、しきい値（例えば、７０）以上
の類似度を選択した後、選択された類似度に対応する分
類カテゴリに新規テキストＴＸ₁ を分類する。なお、複
数の類似度のすべてがしきい値に満たない場合、自動分
類部９０は、当該新規テキストＴＸ₁ を分類できないテ
キストとする。以後、ステップＳＡ１５〜ステップＳＡ
２１までの処理が繰り返されることにより、新規テキス
トが分類カテゴリに順次分類される。[0063] That is, in step SA19, in relation to the new text TX _1, a few minutes of the similarity of the classification category (cosine value) is calculated. At step SA20,
The automatic classifying unit 90 normalizes the plurality of calculated similarities (cosine values) (to a value from 0 to 100). In step SA21, the automatic classification unit 90 selects a similarity that is equal to or greater than a threshold value (for example, 70) among a plurality of similarities (cosine values), and then newly assigns a classification category corresponding to the selected similarity. to classify the text TX _1. In the case where all of the plurality of similarity is less than the threshold value, automatic classification section 90, a text that can not be classified the new text TX _1. Thereafter, Step SA15 to Step SA
By repeating the processing up to 21, the new text is sequentially classified into the classification categories.

【００６４】そして、すべての新規テキストの分類が終
了すると、自動分類部９０は、ステップＳＡ１１の判断
結果を「Ｙｅｓ」とする。ステップＳＡ１２では、自動
分類部９０は、図３に示した分類結果データ１００を出
力する。この図において、テキスト名ｔｅｘｔ１〜テ
キスト名ｔｅｘｔ２０、・・・は、図２に示したテキスト
名ｔｅｘｔ１〜テキスト名ｔｅｘｔ１０、・・・に対応し
ており、「ＡＵＴＯＭＯＴＩＶＥ＿ＩＮＤＵＳＴＲＹ」
等は、分類カテゴリを示し、分類カテゴリの右側の数字
は、得点（例えば、類似度）を表す。すなわち、図２に
示した新規テキストＴＸ₁ は、「ＡＵＴＯＭＯＴＩＶＥ
＿ＩＮＤＵＳＴＲＹ」という分類カテゴリに分類されて
おり、得点（類似度）が「８５」である。When the classification of all new texts is completed, the automatic classification unit 90 sets the result of the determination in step SA11 to "Yes". In step SA12, the automatic classification unit 90 outputs the classification result data 100 shown in FIG. In this figure, the text names text1 to text20, ... correspond to the text names text1 to text10, ... shown in Fig. 2 and "AUTOMOTIVE_INDUSTRY"
And so on indicate a classification category, and a number on the right side of the classification category indicates a score (for example, similarity). In other words, the new text TX ₁ shown in FIG. 2, "AUTOMOTIVE
_INDUSTRY ”, and the score (similarity) is“ 85 ”.

【００６５】図８に戻り、ステップＳＡ１３では、クラ
スタリング部１１０は、分類結果データ１００を用い
て、新規テキスト群８０をクラスタリングする。図４
は、クラスタリング部１１０におけるクラスタリング結
果Ｃを示す図である。この図には、１０００個の新規テ
キストからなる新規テキスト群８０が分類された場合で
あって、「Ｅｃｏｎｏｍｉｃ」という分類カテゴリに２
６個の新規テキストが分類された場合の２６個の新規テ
キストの内訳（テキストの数、特徴素）が図示されてい
る。Returning to FIG. 8, in step SA 13, clustering section 110 clusters new text group 80 using classification result data 100. FIG.
9 is a diagram showing a clustering result C in the clustering unit 110. FIG. This figure shows a case where a new text group 80 consisting of 1000 new texts is classified, and is classified into a classification category “Economic”.
A breakdown (the number of texts, characteristic elements) of the 26 new texts when 6 new texts are classified is illustrated.

【００６６】ステップＳＡ１４では、表示部１２０に
は、例えば、図４に示したクラスタリング結果Ｃが表示
される。これにより、ユーザは、分類カテゴリ（この場
合、Ｅｃｏｎｏｍｉｃ」）にどのような内容が分類され
ているかの確認を行うことができる。In step SA14, the display unit 120 displays, for example, the clustering result C shown in FIG. Thereby, the user can confirm what kind of content is classified into the classification category (Economic in this case).

【００６７】なお、一実施の形態においては、図１２に
示した特徴素ランキングリストＲ₂を表示部１２０に表
示させ、ユーザの要求に応じて、特徴素ランキングリス
トＲ ₂ を編集し、図１３に示した特徴素ランキングリス
トＲ₃ を用いて、分類を行うようにしてもよい。この場
合、ユーザは、入力部１３０を用いて、特徴素ランキン
グリストＲ₂ において不要と判断した特徴素を削除する
という編集を行う。これにより、特徴素ランキングリス
トＲ₃ （図１３参照）が作成され、この特徴素ランキン
グリストＲ₃ に基づいて、上述した処理が実行される。In one embodiment, FIG.
The indicated feature element ranking list R_TwoIs displayed on the display unit 120.
Feature list according to the user's request.
R _Two Is edited, and the feature element ranking list shown in FIG.
R_Three May be used to perform the classification. This place
In this case, the user uses the input unit 130 to
Grist R_Two Delete feature elements that are judged unnecessary in
Edit that. As a result, the feature element ranking squirrel
R_Three (See FIG. 13) is created, and this feature element Rankin
Grist R_Three The above-described processing is executed on the basis of.

【００６８】なお、一実施の形態では、分類サンプルデ
ータ３０と新規テキスト群８０との構造が予め規定され
ている場合、分類サンプルデータ３０、新規テキスト群
８０における分類対象箇所を入力部１３０により指定す
るようにしてもよい。In one embodiment, when the structure of the classification sample data 30 and the new text group 80 is defined in advance, the input unit 130 specifies the classification target portion in the classification sample data 30 and the new text group 80. You may make it.

【００６９】さて、前述では、図１に示したクラスタリ
ング部２０によりクラスタリングされた結果（分類サン
プルデータ３０）をそのまま特徴素抽出部４０で用いた
例について説明したが、クラスタリングされた結果を検
証するようにしてもよい。以下では、この場合を一実施
の形態の変形例１として、図１４および図１５を参照し
て説明する。In the above description, an example was described in which the result (classified sample data 30) clustered by the clustering unit 20 shown in FIG. 1 was used as it is in the feature element extraction unit 40. The clustered result will be verified. You may do so. Hereinafter, this case will be described as a first modification of the embodiment with reference to FIGS.

【００７０】図１５に示したステップＳＤ１では、図１
に示した分類サンプルデータ３０（正解データ）に含ま
れるサンプルテキスト群１０に対して、クラスタリング
部２０によりクラスタリングが実行される。この場合、
分類サンプルデータ３０における分類カテゴリの割付け
が無視される。図１４は、クラスタリング部２０により
クラスタリングされた結果（クラスタリング結果分布デ
ータＣＢ）を示す図である。この図には、７つの分類カ
テゴリ（「Ｅｃｏｎｏｍｉｃ」、「Ｆｏｒｅｉｇｎ」、
・・・「Ｓｐｏｒｔ」）に割り付けられた２７７のテキス
トをクラスタリングした結果が図示されている。In step SD1 shown in FIG.
The clustering unit 20 performs clustering on the sample text group 10 included in the classification sample data 30 (correct answer data) shown in FIG. in this case,
The assignment of the classification category in the classification sample data 30 is ignored. FIG. 14 is a diagram illustrating a result of clustering performed by the clustering unit 20 (clustering result distribution data CB). The figure shows seven classification categories ("Economic", "Foreign",
... “Sport”) are clustered in 277 texts.

【００７１】この図によれば、Ａレコードの「Ｓｐｏｒ
ｔｓ」、ＣおよびＥレコードの「Ｐｏｌｉｔｉｃｓ」
は、きれいに分類カテゴリの割付が行われていることが
わかる。これに対して、Ｄレコードの「Ｅｃｏｎｏｍｉ
ｃ」と「Ｉｎｄｕｓｔｒｙ」の区別や、Ｆレコード以降
の「Ｆｏｒｅｉｇｎ」、「Ｉｎｄｕｓｔｒｙ」、「Ｐｏ
ｌｉｔｉｃｓ」、「Ｓｃｉｅｎｃｅ」、「Ｓｏｃｉｅｔ
ｙ」の区別が曖昧になっていることがわかる。この場合
には、後述するステップＳＤ４の処理が実行される。ス
テップＳＤ２では、クラスタリングされた結果（分類カ
テゴリの割付）と、ユーザが当初想定していた分類カテ
ゴリの割付とが比較部（図示略）により比較される。According to this figure, “Spor” of the A record
"ts", "Politics" of C and E records
Indicates that the classification categories are clearly assigned. On the other hand, D record "Economi
c ”and“ Industry ”and“ Foreign ”,“ Industry ”,“ Po ”
liters "," Science "," Societ "
It can be seen that the distinction of "y" is ambiguous. In this case, the process of step SD4 described later is executed. In step SD2, the result of the clustering (assignment of the classification category) is compared with the assignment of the classification category originally assumed by the user by the comparison unit (not shown).

【００７２】ステップＳＤ３では、比較部は、ステップ
ＳＤ２の比較結果が同一であるか否かを判断し、この判
断結果が「Ｎｏ」である場合、比較結果を表示部１２０
に表示させる。これにより、ステップＳＤ４では、ユー
ザは、入力部１３０を用いて、クラスタリングされた結
果（分類カテゴリの割付）を再検討し、分類カテゴリの
編集を行う。一方、ステップＳＤ３の判断結果が「Ｙｅ
ｓ」である場合、すなわち、分類サンプルデータ３０に
おける分類カテゴリの割付がユーザが当初想定していた
ものと同一であるため、ステップＳＤ５では、分類カテ
ゴリおよび分類サンプル（テキスト）が学習情報とされ
る。ステップＳＤ６では、比較部（図示略）は、分類サ
ンプルデータ３０を特徴素抽出部４０へ出力する。これ
により、前述した処理が実行される。In step SD3, the comparing section determines whether or not the comparison result in step SD2 is the same. If the determination result is "No", the comparing section displays the comparison result on display section 120.
To be displayed. Thus, in step SD4, the user uses the input unit 130 to review the clustered result (assignment of the classification category) and edit the classification category. On the other hand, if the determination result of step SD3 is “Ye
s ", that is, since the assignment of the classification category in the classification sample data 30 is the same as that initially assumed by the user, in step SD5, the classification category and the classification sample (text) are set as the learning information. . In step SD6, the comparison unit (not shown) outputs the classification sample data 30 to the feature element extraction unit 40. Thereby, the above-described processing is executed.

【００７３】さて、前述では、自動分類部９０により分
類された分類結果データ１００をそのまま出力する例に
ついて説明したが、自動分類部９０により分類が行われ
た後に分類結果データ１００が所望のものであるか否か
を検証し、この検証結果がＮＧの場合に、この検証結果
を分類学習情報７０にフィードバックし、再学習するこ
とにより分類精度を向上させるようにしてもよい。以下
では、この場合を一実施の形態の変形例２として図１６
を参照しつつ説明する。同図において、図１の各部に対
応する部分には同一の符号を付ける。この図において
は、再学習処理部１４０が新たに設けられている。この
再学習処理部１４０は、上述したフィードバックを受け
て分類学習情報７０Ａを作成する。In the above, an example has been described in which the classification result data 100 classified by the automatic classification unit 90 is output as it is. However, after the classification is performed by the automatic classification unit 90, the classification result data 100 is a desired one. It is also possible to verify whether or not there is, and if the verification result is NG, feed back the verification result to the classification learning information 70 and re-learn to improve the classification accuracy. Hereinafter, this case will be referred to as a second modification of the embodiment shown in FIG.
This will be described with reference to FIG. In the figure, portions corresponding to the respective portions in FIG. 1 are denoted by the same reference numerals. In this figure, a relearning processing unit 140 is newly provided. The relearning processing unit 140 creates the classification learning information 70A in response to the feedback described above.

【００７４】２０個の新規テキストからなる新規テキス
ト群８０が情報分類装置２００に入力されると、新規テ
キスト群８０は、前述した動作と同様にして、分類学習
情報７０および所定の分類方法に基づいて、自動分類さ
れる。これにより、情報分類装置２００からは、分類結
果データ１００が出力される。この分類結果データ１０
０は、表示部１２０に表示される。ここで、分類結果デ
ータ１００において、分類カテゴリＢに割り付けられた
新規テキスト（５）および（６）が分類カテゴリＡに割
り付けられるべきであって、かつ分類カテゴリＣに割り
付けられた新規テキスト（９）が分類カテゴリＤに割り
付けられるべきであった場合、ユーザは、入力部１３０
を用いて、所望の割付に編集する。When a new text group 80 composed of 20 new texts is input to the information classification device 200, the new text group 80 is generated based on the classification learning information 70 and a predetermined classification method in the same manner as the above-described operation. Are automatically classified. As a result, the classification result data 100 is output from the information classification device 200. This classification result data 10
0 is displayed on the display unit 120. Here, in the classification result data 100, the new texts (5) and (6) allocated to the classification category B should be allocated to the classification category A, and the new texts (9) allocated to the classification category C. Should be assigned to the classification category D, the user
Is used to edit to the desired assignment.

【００７５】これにより、再学習処理部１４０は、編集
された分類結果データ１００に基づいて、分類学習情報
生成部６０（図１参照）と同様の動作により、再学習処
理を実行し、分類学習情報７０Ａを再構築する。この状
態で、新規テキスト群８０が情報分類装置２００に入力
されると、新規テキスト群８０は、前述した動作と同様
にして、再構築された分類学習情報７０Ａおよび所定の
分類方法に基づいて、自動分類される。この場合、情報
分類装置２００から出力される分類結果データ１００の
分類精度は、再学習の効果により、極めて高い。Thus, the re-learning processing unit 140 executes the re-learning process based on the edited classification result data 100 by the same operation as the classification learning information generating unit 60 (see FIG. 1). The information 70A is reconstructed. In this state, when the new text group 80 is input to the information classification device 200, the new text group 80 is generated based on the reconstructed classification learning information 70A and the predetermined classification method in the same manner as the operation described above. Automatically classified. In this case, the classification accuracy of the classification result data 100 output from the information classification device 200 is extremely high due to the effect of relearning.

【００７６】なお、一実施の形態では、図１に示した表
示部１２０に図５に示した画面Ｇ₁を表示させ、分類処
理で発生する各種情報を表示させるようにしてもよい。
画面Ｇ₁ には、「ユーザークレーム分類」という分類カ
テゴリに対応するフォルダＨ ₀ 、この分類カテゴリの配
下に属する「初期不良」、・・・・「問い合わせ」および
「分類されなかった文書」という分類カテゴリにそれぞ
れ対応するフォルダＨ₁〜Ｈ₇ がそれぞれ表示されてい
る。In one embodiment, the table shown in FIG.
The screen G shown in FIG.₁Is displayed and the classification
Various kinds of information generated in the process may be displayed.
Screen G₁ Has a classification category called “User Claim Classification”.
Folder H corresponding to category ₀ , Distribution of this classification category
"Initial failure" under ... "Inquiry" and
Classified as "Unclassified documents"
Corresponding folder H₁~ H₇ Are displayed respectively
You.

【００７７】また、画面Ｇ₁ には、ウィンドウ制御によ
り、画面Ｇ₂ 〜Ｇ₄ が表示されている。画面Ｇ₂ には、
図６に示したように「問い合わせ」という分類カテゴリ
に対応するサンプル文書（分類サンプルデータ３０に対
応）のタイトルＫ₁ およびテキスト内容Ｋ₂ が表示され
ている。また、図７に示した画面Ｇ₃ には、「問い合わ
せ」という分類カテゴリに対応するキーワード（特徴
素）が表示されている。図５に示した画面Ｇ₄ には、
「問い合わせ」という分類カテゴリに分類された新規テ
キストの一覧画面Ｊ₁ および当該新規テキストの内容に
関する内容表示画面Ｊ₂ が表示されている。ここで、新
規テキストの一覧画面Ｊ₁ におけるアイコンＩ₁〜Ｉ₄
は、上述した変形例２による再学習前の得点（類似度）
に対する、再学習後の得点の変化を表すものである。Further, screens G _{2 to} G ₄ are displayed on the screen G ₁ by window control. The screen G _2,
Title K ₁ and text content K ₂ of the sample document corresponding to the classification category of "inquiry" (corresponding to the classification sample data 30) is displayed as shown in FIG. 6. Also, the screen G ₃ shown in FIG. 7, the keyword corresponding to the classification category of "Query" (feature element) is displayed. Screen G ₄ shown in FIG. 5,
The contents display screen J ₂ is displayed on the contents of the list screen J ₁ and the new text of the new text, which is classified as a classification category of "inquiry". Here, the icon I ₁ ~I ₄ in the list screen J ₁ of the new text
Is the score (similarity) before re-learning according to Modification 2 described above.
Represents a change in score after re-learning with respect to.

【００７８】すなわち、アイコンＩ₁ は、前回よりも得
点（類似度）が高くなったことを意味しており、アイコ
ンＩ₂ は、前回よりも得点（類似度）が低くなったこと
を意味している。アイコンＩ₃ は、前回、当該分類カテ
ゴリ（この場合「問い合わせ」）に分類されていた新規
テキストが、今回、当該分類カテゴリに分類されなかっ
たことを意味している。また、アイコンＩ₄ は、前回、
当該分類カテゴリ（この場合「問い合わせ」）に分類さ
れていなかった新規テキストが、今回、当該分類カテゴ
リに分類されたことを意味している。That is, the icon I ₁ means that the score (similarity) is higher than the previous time, and the icon I ₂ means that the score (similarity) is lower than the previous time. ing. Icon I ₃ is the last time, the new text, which has been classified into the classification category (in this case "inquiry") is, this time, which means that it has not been classified into the classification category. Also, the icon I ₄ was the last time,
This means that a new text that has not been classified into the classification category (in this case, “inquiry”) is now classified into the classification category.

【００７９】以上説明したように、一実施の形態によれ
ば、複数の分類方法を使用可能な状態にしておき、分類
方法決定部５０により、分類サンプルデータ３０に基づ
いて複数の分類方法の中から最も分類精度が高い分類方
法を決定した後、この分類方法に従って新規テキスト群
８０を分類カテゴリ毎に分類するようにしたので、従来
に比して、分類対象の情報の内容、量にかかわらず、分
類精度を高めることができる。As described above, according to one embodiment, a plurality of classification methods are set to be usable, and the classification method determining unit 50 sets a plurality of classification methods based on the classification sample data 30. After determining the classification method with the highest classification accuracy from, the new text group 80 is classified according to the classification method according to this classification method. , Classification accuracy can be improved.

【００８０】また、一実施の形態によれば、特徴素抽出
部４０で複数の特徴素抽出方法を使用可能な状態にして
おき、これらの複数の特徴素抽出方法にそれぞれ対応す
る特徴素を抽出し、特に、分類カテゴリ間の弁別能力が
高い特徴素抽出方法に対応する特徴素を抽出結果とする
ようにしたので、この特徴素に対応する分類結果の分類
精度をさらに高めることができる。According to one embodiment, a plurality of feature element extraction methods can be used by the feature element extraction unit 40, and feature elements corresponding to the plurality of feature element extraction methods are extracted. In particular, since a feature element corresponding to a feature element extraction method having high discrimination ability between classification categories is used as an extraction result, the classification accuracy of the classification result corresponding to this feature element can be further improved.

【００８１】また、一実施の形態によれば、入力部１３
０および表示部１２０（編集手段）を設けて、抽出され
た特徴素を編集（削除、追加等）可能としたので、分類
カテゴリに対して柔軟な特徴素設定を行うことができ
る。According to one embodiment, the input unit 13
0 and the display unit 120 (editing means) are provided so that the extracted feature elements can be edited (deleted, added, etc.), so that flexible feature element setting can be performed for the classification category.

【００８２】また、一実施の形態によれば、入力部１３
０および表示部１２０（指定手段）により、分類サンプ
ルデータ３０、新規テキスト群８０における分類対象箇
所を指定するようにしたので、分類に不要な箇所を排除
し、本質的に必要な箇所を対象に分類を行うことができ
るため、分類精度をさらに向上させることができる。According to one embodiment, the input unit 13
0 and the display unit 120 (designating means) designate a classification target portion in the classification sample data 30 and the new text group 80. Therefore, a portion unnecessary for classification is eliminated, and a portion essentially required is targeted. Since the classification can be performed, the classification accuracy can be further improved.

【００８３】また、一実施の形態によれば、クラスタリ
ング部２０により分類サンプルデータ３０を生成するよ
うにしたので、複数のサンプルテキストから分類カテゴ
リを手動で生成する場合に比して、格段に効率を向上さ
せることができるとともに、ユーザの作業負担を軽減さ
せることができる。Further, according to the embodiment, the classification sample data 30 is generated by the clustering unit 20, so that the efficiency is significantly improved as compared with the case where the classification category is manually generated from a plurality of sample texts. Can be improved, and the work load on the user can be reduced.

【００８４】また、一実施の形態によれば、クラスタリ
ング部２０のクラスタリング結果と所望のクラスタリン
グ結果とを比較し、この比較結果が例えば不一致である
場合に、入力部１３０（変更手段）により分類サンプル
データ３０を変更可能としたので、より完全な分類サン
プルデータ３０に基づいて新規テキスト群８０の分類を
行うことができることから、分類精度を極めて高くする
ことができる。Further, according to one embodiment, the clustering result of the clustering unit 20 is compared with a desired clustering result. Since the data 30 can be changed, the new text group 80 can be classified based on the more complete classification sample data 30, so that the classification accuracy can be extremely increased.

【００８５】また、一実施の形態によれば、表示部１２
０にクラスタリング結果分布データＣＢ（図１４参照）
を表示するようにしたので、分類結果の分布をユーザが
容易に把握することができる。According to one embodiment, the display unit 12
Clustering result distribution data CB to 0 (see FIG. 14)
Is displayed, so that the user can easily grasp the distribution of the classification results.

【００８６】また、一実施の形態によれば、変形例２で
説明したように、最適化された分類結果に基づいて、分
類学習情報７０Ａを再生成し、この分類学習情報７０Ａ
に従って、新規テキスト群８０を再度分類するようにし
たので、さらに分類精度を向上させることができる。Further, according to one embodiment, as described in the second modification, the classification learning information 70A is regenerated based on the optimized classification result, and the classification learning information 70A is generated.
, The new text group 80 is classified again, so that the classification accuracy can be further improved.

【００８７】また、一実施の形態によれば、上記最適化
前後における分類結果の相違をアイコンＩ₁〜１₄（相違
認識情報）として表示させ、ユーザが一目で相違を認識
できるようにしたので、相違に基づくユーザの対応を迅
速に行わせることができ、結果的に分類精度を高めるこ
とができる。[0087] According to one embodiment, to display the difference of the classification result in the before and after optimization icon I ₁ to 1 ₄ as (difference recognition information), so the user has to be aware of the difference at a glance In addition, it is possible to promptly respond to the user based on the difference, and as a result, the classification accuracy can be improved.

【００８８】以上本発明にかかる一実施の形態について
図面を参照して詳述してきたが、具体的な構成例はこの
一実施の形態に限られるものではなく、本発明の要旨を
逸脱しない範囲の設計変更等があっても本発明に含まれ
る。たとえば、前述した一実施の形態においては、情報
分類装置の機能を実現するための情報分類プログラムを
図１７に示したコンピュータ読み取り可能な記録媒体４
００に記録して、この記録媒体４００に記録された情報
分類プログラムを同図に示したコンピュータ３００に読
み込ませ、実行することにより情報分類を行うようにし
てもよい。Although the embodiment of the present invention has been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment and does not depart from the gist of the present invention. Even if there is a change in the design, the present invention is included in the present invention. For example, in the above-described embodiment, an information classification program for realizing the function of the information classification device is stored in the computer-readable recording medium 4 shown in FIG.
00, the information classification program recorded in the recording medium 400 may be read by the computer 300 shown in FIG.

【００８９】図１７に示したコンピュータ３００は、上
記情報分類プログラムを実行するＣＰＵ３０１と、キー
ボード、マウス等の入力装置３０２と、各種データを記
憶するＲＯＭ（Read Only Memory）３０３と、演算パラ
メータ等を記憶するＲＡＭ（Random Access Memory）３
０４と、記録媒体４００から情報分類プログラムを読み
取る読取装置３０５と、ディスプレイ、プリンタ等の出
力装置３０６と、装置各部を接続するバスＢＵとから構
成されている。The computer 300 shown in FIG. 17 stores a CPU 301 for executing the information classification program, an input device 302 such as a keyboard and a mouse, a ROM (Read Only Memory) 303 for storing various data, and an arithmetic parameter. RAM (Random Access Memory) 3 for storing
04, a reading device 305 that reads the information classification program from the recording medium 400, an output device 306 such as a display or a printer, and a bus BU that connects each unit of the device.

【００９０】ＣＰＵ３０１は、読取装置３０５を経由し
て記録媒体４００に記録されている情報分類プログラム
を読み込んだ後、情報分類プログラムを実行することに
より、前述した情報分類を行う。なお、記録媒体４００
には、光ディスク、フロッピー（登録商標）ディスク、
ハードディスク等の可搬型の記録媒体が含まれることは
もとより、ネットワークのようにデータを一時的に記録
保持するような伝送媒体も含まれる。The CPU 301 reads the information classification program recorded on the recording medium 400 via the reading device 305, and executes the information classification program to perform the information classification described above. The recording medium 400
Include optical disks, floppy (registered trademark) disks,
In addition to a portable recording medium such as a hard disk, a transmission medium such as a network for temporarily recording and holding data is also included.

【００９１】また、一実施の形態では、図１に示した分
類方法決定部５０で、分類方法の決定方式の一例として
クロスバリデーション方式を採用した場合について説明
したが、この方式に限られるものではなく、再現率（結
果の中で正解の含まれている割合）や、適合率（結果の
中で間違いの少なさ）といった値をキーとして分類方法
を決定するようにしてもよい。要は、複数の分類方法が
使用可能であること、これらの分類方法の中から最も分
類精度が高いものを選択できること、という要件を具備
していれば、いかなる方式を採用しても本発明に含まれ
る。Further, in the embodiment, the case where the cross-validation method is adopted as an example of the method of determining the classification method in the classification method determining section 50 shown in FIG. 1 has been described, but the present invention is not limited to this method. Instead, the classification method may be determined by using the values such as the recall (the ratio of correct answers included in the results) and the precision (the number of mistakes in the results) as keys. In short, as long as a plurality of classification methods can be used and a classification method having the highest classification accuracy can be selected from these classification methods, the present invention can be applied to any method. included.

【００９２】[0092]

【発明の効果】以上説明したように、請求項１にかかる
発明によれば、複数の分類方法を使用可能な状態にして
おき、分類方法決定手段により、分類サンプル情報に基
づいて複数の分類方法の中から最も分類精度が高い分類
方法を決定した後、この分類方法に従って新規テキスト
群を分類カテゴリ毎に分類するようにしたので、従来に
比して、分類対象の情報の内容、量にかかわらず、分類
精度を高めることができるという効果を奏する。As described above, according to the first aspect of the present invention, a plurality of classification methods are set in a usable state, and a plurality of classification methods are determined by the classification method determining means based on the classification sample information. After determining the classification method with the highest classification accuracy from among, the new text group is classified according to the classification category according to this classification method. Therefore, an effect that the classification accuracy can be improved can be achieved.

【００９３】また、請求項２にかかる発明によれば、特
徴素抽出手段で複数の特徴素抽出方法を使用可能な状態
にしておき、これらの複数の特徴素抽出方法にそれぞれ
対応する特徴素を抽出し、特に、分類カテゴリ間の弁別
能力が高い特徴素抽出方法に対応する特徴素を抽出結果
とするようにしたので、この特徴素に対応する分類結果
の分類精度をさらに高めることができるという効果を奏
する。According to the second aspect of the present invention, a plurality of feature element extraction methods can be used by the feature element extraction means, and feature elements respectively corresponding to the plurality of feature element extraction methods are set. Since extraction is performed, in particular, a feature element corresponding to a feature element extraction method having a high ability to discriminate between classification categories is used as an extraction result, the classification accuracy of the classification result corresponding to this feature element can be further improved. It works.

【００９４】また、請求項３にかかる発明によれば、編
集手段を設けて、抽出された特徴素を編集（削除、追加
等）可能としたので、分類カテゴリに対して柔軟な特徴
素設定を行うことができるという効果を奏する。According to the third aspect of the present invention, an editing means is provided so that the extracted feature element can be edited (deleted, added, etc.). This has the effect that it can be performed.

【００９５】また、請求項４にかかる発明によれば、複
数の分類方法を使用可能な状態にしておき、分類方法決
定手段により、分類サンプル情報に基づいて複数の分類
方法の中から最も分類精度が高い分類方法をクロスバリ
デーション方式により決定した後、この分類方法に従っ
て新規テキスト群を分類カテゴリ毎に分類するようにし
たので、従来に比して、分類対象の情報の内容、量にか
かわらず、分類精度を高めることができるという効果を
奏する。According to the fourth aspect of the present invention, a plurality of classification methods are set in a usable state, and the classification method determining means determines the most accurate classification among the plurality of classification methods based on the classification sample information. After determining the classification method with high cross-validation method, the new text group is classified for each classification category according to this classification method, so compared to the past, regardless of the content and amount of classification target information, There is an effect that classification accuracy can be improved.

【００９６】また、請求項５にかかる発明によれば、指
定手段により、分類サンプル情報、新規テキスト群にお
ける分類対象箇所を指定するようにしたので、分類に不
要な箇所を排除し、本質的に必要な箇所を対象に分類を
行うことができるため、分類精度をさらに向上させるこ
とができるという効果を奏する。According to the fifth aspect of the present invention, the designation means designates the classification sample information and the classification target position in the new text group. Therefore, unnecessary portions for classification are eliminated, and essentially, Since classification can be performed for a necessary portion as an object, there is an effect that classification accuracy can be further improved.

【００９７】また、請求項６にかかる発明によれば、ク
ラスタリング手段により分類サンプル情報を生成するよ
うにしたので、複数のサンプルテキストから分類カテゴ
リを手動で生成する場合に比して、格段に効率を向上さ
せることができるとともに、ユーザの作業負担を軽減さ
せることができるという効果を奏する。According to the sixth aspect of the present invention, since the classification sample information is generated by the clustering means, the efficiency is significantly improved as compared with the case where the classification category is manually generated from a plurality of sample texts. And the burden on the user can be reduced.

【００９８】また、請求項７にかかる発明によれば、ク
ラスタリング手段のクラスタリング結果と所望のクラス
タリング結果とを比較し、この比較結果が例えば不一致
である場合に、変更手段により分類サンプル情報を変更
可能としたので、より完全な分類サンプル情報に基づい
て新規テキスト群の分類を行うことができることから、
分類精度を極めて高くすることができるという効果を奏
する。According to the seventh aspect of the present invention, the clustering result of the clustering means is compared with a desired clustering result, and when the comparison result is, for example, inconsistent, the classification sample information can be changed by the changing means. Since it is possible to classify a new text group based on more complete classification sample information,
There is an effect that the classification accuracy can be made extremely high.

【００９９】また、請求項８にかかる発明によれば、ク
ラスタリング結果表示手段によりクラスタリング結果を
表示するようにしたので、分類結果の分布をユーザが容
易に把握することができるという効果を奏する。According to the eighth aspect of the present invention, since the clustering result is displayed by the clustering result display means, the user can easily grasp the distribution of the classification result.

【０１００】また、請求項９にかかる発明によれば、最
適化手段により最適化された分類結果に基づいて、分類
学習情報を再生成し、この分類学習情報に従って、新規
テキスト群を再度分類するようにしたので、さらに分類
精度を向上させることができるという効果を奏する。According to the ninth aspect of the present invention, classification learning information is regenerated based on the classification result optimized by the optimizing means, and the new text group is classified again according to the classification learning information. As a result, there is an effect that the classification accuracy can be further improved.

【０１０１】また、請求項１０にかかる発明によれば、
最適化前後における分類結果の相違を相違認識情報とし
て表示させ、ユーザが一目で相違を認識できるようにし
たので、相違に基づくユーザの対応を迅速に行わせるこ
とができ、結果的に分類精度を高めることができるとい
う効果を奏する。According to the tenth aspect of the present invention,
The difference between the classification results before and after the optimization is displayed as difference recognition information so that the user can recognize the difference at a glance, so that the user can be promptly dealt with based on the difference, and as a result, the classification accuracy is improved. It has the effect of being able to increase.

【０１０２】また、請求項１１、１２、１３にかかる発
明によれば、複数の分類方法を使用可能な状態にしてお
き、分類方法決定工程で、分類サンプル情報に基づいて
複数の分類方法の中から最も分類精度が高い分類方法を
決定した後、この分類方法に従って新規テキスト群を分
類カテゴリ毎に分類するようにしたので、従来に比し
て、分類対象の情報の内容、量にかかわらず、分類精度
を高めることができるという効果を奏する。According to the eleventh, twelfth, and thirteenth aspects of the present invention, a plurality of classification methods are set to be usable, and in the classification method determining step, a plurality of classification methods are selected based on the classification sample information. After determining the classification method with the highest classification accuracy from, the new text group is classified according to the classification method according to this classification method, so compared to the past, regardless of the content and amount of information to be classified, There is an effect that classification accuracy can be improved.

[Brief description of the drawings]

【図１】本発明にかかる一実施の形態の構成を示すブロ
ック図である。FIG. 1 is a block diagram showing a configuration of an embodiment according to the present invention.

【図２】図１に示した新規テキスト群８０の一例を示す
図である。FIG. 2 is a diagram showing an example of a new text group 80 shown in FIG.

【図３】図１に示した分類結果データ１００の一例を示
す図である。FIG. 3 is a diagram showing an example of classification result data 100 shown in FIG.

【図４】図１に示したクラスタリング部１１０における
クラスタリング結果Ｃを示す図である。FIG. 4 is a diagram showing a clustering result C in the clustering unit 110 shown in FIG.

【図５】図１に示した表示部１２０の表示例を示す図で
ある。FIG. 5 is a diagram showing a display example of a display unit 120 shown in FIG.

【図６】図１に示した表示部１２０の表示例を示す図で
ある。FIG. 6 is a diagram showing a display example of a display unit 120 shown in FIG.

【図７】図１に示した表示部１２０の表示例を示す図で
ある。FIG. 7 is a diagram showing a display example of a display unit 120 shown in FIG.

【図８】同一実施の形態の動作を説明するフローチャー
トである。FIG. 8 is a flowchart illustrating the operation of the same embodiment.

【図９】図８に示した特徴素抽出処理を説明するフロー
チャートである。FIG. 9 is a flowchart illustrating a feature element extraction process illustrated in FIG. 8;

【図１０】図８に示した分類方法決定処理を説明するフ
ローチャートである。FIG. 10 is a flowchart illustrating a classification method determination process illustrated in FIG. 8;

【図１１】同一実施の形態における特徴素出現頻度順リ
ストＲ₁ を示す図である。11 is a diagram illustrating a feature element appearance frequency order list R ₁ in the same embodiment.

【図１２】同一実施の形態における特徴素ランキングリ
ストＲ₂ を示す図である。12 is a diagram showing the feature elements ranking list R ₂ in the same embodiment.

【図１３】同一実施の形態における特徴素ランキングリ
ストＲ₃ を示す図である。13 is a diagram showing the feature elements ranking list R ₃ in the same embodiment.

【図１４】同一実施の形態におけるクラスタリング結果
分布データＣＢを示す図である。FIG. 14 is a diagram showing clustering result distribution data CB in the same embodiment.

【図１５】同一実施の形態の変形例１を説明するフロー
チャートである。FIG. 15 is a flowchart illustrating a first modification of the same embodiment.

【図１６】同一実施の形態の変形例２を説明する図であ
る。FIG. 16 is a diagram illustrating a second modification of the same embodiment.

【図１７】同一実施の形態の変形例３を示すブロック図
である。FIG. 17 is a block diagram showing a third modification of the same embodiment.

【図１８】従来の情報分類装置の構成を示すブロック図
である。FIG. 18 is a block diagram showing a configuration of a conventional information classification device.

[Explanation of symbols]

２０クラスタリング部４０特徴素抽出部５０分類方法決定部６０分類学習情報生成部９０自動分類部１１０クラスタリング部１２０表示部１３０入力部３００コンピュータ３０１ＣＰＵ４００記録媒体 Reference Signs List 20 clustering unit 40 feature element extraction unit 50 classification method determination unit 60 classification learning information generation unit 90 automatic classification unit 110 clustering unit 120 display unit 130 input unit 300 computer 301 CPU 400 recording medium

フロントページの続き (72)発明者坂本憲彦静岡県静岡市南町18番１号株式会社富士通インフォソフトテクノロジ内 (72)発明者柴田竜静岡県静岡市南町18番１号株式会社富士通インフォソフトテクノロジ内Ｆターム(参考） 5B075 ND03 NK06 NK32 NK46 NR12 PQ02 PQ46 QM05 UU06 5B082 GA08 Continued on the front page (72) Inventor Norihiko Sakamoto 18-1, Minamicho, Shizuoka City, Shizuoka Prefecture Inside Fujitsu Infosoft Technology Co., Ltd. (72) Inventor Ryu Shibata 18-1, Minamicho, Shizuoka City, Shizuoka Prefecture Fujitsu Infosoft Co., Ltd. F term in technology (reference) 5B075 ND03 NK06 NK32 NK46 NR12 PQ02 PQ46 QM05 UU06 5B082 GA08

Claims

[Claims]

A feature element extracting unit configured to extract a feature element for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance; A classification method determining unit that determines a classification method with the highest classification accuracy from a plurality of classification methods based on the classification sample information; and an extraction by the feature element extraction unit according to the classification method determined by the classification method determination unit. A classification learning information generating unit that generates classification learning information representing a feature for each classification category based on the obtained feature element; and a classification target according to the classification method determined by the classification method determination unit and the classification learning information. An information classification device, comprising: a classification unit that classifies a new text group for each classification category.

2. The feature element extraction means extracts feature elements by a plurality of feature element extraction methods, respectively, and based on the extraction results, a discriminating ability between classification categories from the plurality of feature element extraction methods. The information classification apparatus according to claim 1, wherein a high feature element extraction method is selected, and a feature element corresponding to the selection result is set as an extraction result.

3. The information classification apparatus according to claim 1, further comprising an editing unit that edits the feature element extracted by the feature element extraction unit.

4. The classification method according to claim 1, wherein the classification method determination unit determines a classification method having the highest classification accuracy from a plurality of classification methods by a cross-validation method. Described information classification device.

5. The information classification apparatus according to claim 1, further comprising a specification unit that specifies a classification target portion in the classification sample information and the new text group.

6. A clustering means for clustering a plurality of sample texts to generate the classification sample information in which the plurality of sample texts are associated with a plurality of classification categories. The information classification device according to any one of Items 1 to 5,

7. A clustering means for clustering the classified sample information, a comparing means for comparing a clustering result of the clustering means with a desired clustering result, and, if necessary, based on a comparison result of the comparing means. The information classification device according to claim 1, further comprising a change unit configured to change the classification sample information.

8. The information classification according to claim 1, further comprising a clustering result display unit that clusters a new text group in the classification result of the classification unit and displays the clustering result. apparatus.

9. An optimizing unit for optimizing a classification result of the classification unit, wherein the classification learning information generating unit regenerates classification learning information based on the optimized classification result,
The method according to claim 1, wherein the classification unit classifies the new text group to be classified for each classification category according to the classification method determined by the classification method determination unit and the regenerated classification learning information. 8. The information classification device according to any one of 8.

10. A difference recognition information display means for displaying a difference between the classification result before optimization and the classification result after optimization as difference recognition information that can be visually recognized. 9. The information classification device according to item 9.

11. A feature element extracting step of extracting a feature element for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance, Based on the classification sample information, a classification method determination step of determining the classification method with the highest classification accuracy from among a plurality of classification methods, according to the classification method determined in the classification method determination step,
A classification learning information generating step of generating classification learning information representing a feature for each classification category based on the characteristic element extracted in the characteristic element extraction step; a classification method determined in the classification method determining step; and the classification learning A classification step of classifying a new text group to be classified according to the information for each classification category.

12. A feature element extracting step of extracting a feature element for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance, Based on the classification sample information, a classification method determining step of determining the classification method with the highest classification accuracy from a plurality of classification methods, according to the classification method determined in the classification method determination step,
A classification learning information generation step of generating classification learning information representing a feature for each classification category based on the feature element extracted in the characteristic element extraction step; a classification method determined in the classification method determination step; and the classification learning A computer-readable recording medium that records an information classification program for causing a computer to execute a classification step of classifying a new text group to be classified into classification categories according to information;

13. A feature element extraction procedure for extracting a feature element for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance; Based on the classification sample information, a classification method determination procedure for determining the classification method with the highest classification accuracy from among a plurality of classification methods, and according to the classification method determined in the classification method determination procedure,
A classification learning information generation step of generating classification learning information representing a feature for each classification category based on the feature element extracted in the characteristic element extraction step; a classification method determined in the classification method determination step; and the classification learning An information classification program characterized by causing a computer to execute a classification procedure for classifying a new text group to be classified according to information according to classification categories.