JP3439494B2

JP3439494B2 - Context-sensitive automatic classifier

Info

Publication number: JP3439494B2
Application number: JP32347692A
Authority: JP
Inventors: 忠星合
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1992-12-02
Filing date: 1992-12-02
Publication date: 2003-08-25
Anticipated expiration: 2018-08-25
Also published as: JPH06176064A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、ある意味内容を有する
文字列で構成するテキストを、文書データベース等に格
納する際に、その内容の統計的特徴からテキストの種別
を自動的に分類することにより、文書データベースの効
率的な作成を支援するテキスト自動分類方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention automatically classifies text types from statistical characteristics of the texts when storing texts composed of character strings having certain meanings in a document database or the like. The present invention relates to an automatic text classification method that supports efficient creation of a document database.

【０００２】[0002]

【従来の技術】近年、オフィスにおける文書処理作業の
効率化の必要性は、ますます高まっている。文書処理作
業において、新しい文書を文書データベースや電子キャ
ビネットに格納・登録する場合、文書の分類情報を付与
し、後に当該文書を検索しやすくして、文書の有効利用
を図ることが多い。2. Description of the Related Art In recent years, there has been an increasing need for more efficient document processing work in offices. When a new document is stored / registered in a document database or an electronic cabinet in a document processing operation, it is often the case that the document classification information is added so that the document can be easily searched later to effectively use the document.

【０００３】現状では、この文書の分類作業は人手によ
る分類あるいは、未だ技術的には未熟であるが計算機に
よる自動分類という二つのやり方で対処している。この
うち、人手による分類は、現在の技術水準では、最も正
確な分類結果を得ることが可能である。At present, the classification work of this document is dealt with by two methods of manual classification or automatic classification by computer although it is still technically inexperienced. Among them, the manual classification can obtain the most accurate classification result at the current state of the art.

【０００４】ところが、大規模テキストデータベースの
作成にあたっては、ａ．分類作業の時間的効率ｂ．客観的分類基準の必要性（複数の人間で分類する場
合の個人差や主観を排除する必要性）から、計算機によるテキストの自動分類が望まれてい
る。However, when creating a large-scale text database, a. Time efficiency of classification work b. Due to the need for objective classification criteria (the need to eliminate individual differences and subjectivity when classifying by multiple people), automatic classification of text by a computer is desired.

【０００５】本発明は、この計算機によるテキストの自
動分類技術に関するものであり、テキストの統計的特徴
を手懸かりとして、内容が類似するテキストを同類のテ
キストクラスへ分類するものである。The present invention relates to an automatic text classification technique by this computer, and classifies texts having similar contents into similar text classes by using statistical characteristics of the text as a clue.

【０００６】テキストの統計的特徴を手懸かりとする分
類手法としては、多変量解析の分野における教師付き分
類の手法を利用して、テキストを分類空間上に写像し、
テキスト間の距離を基にして、テキストの類似性を判断
することにより、テキストを分類空間において最も距離
が近いテキストクラスへと分類する手法が、次の文献イ
により知られている。文献イ．水谷静夫：用語による梅・桜の弁別，計量国語
学１２巻１号，ＰＰ．１〜１３，１９７９。As a classification method using the statistical characteristics of the text as a clue, a method of supervised classification in the field of multivariate analysis is used to map the text onto the classification space,
A method of classifying a text into a text class having the shortest distance in a classification space by determining the similarity of the texts based on the distance between the texts is known from the following literature (i). Literature a. Shizuo Mizutani: Discrimination of Ume / Sakura by term, Japanese Language Measurements Vol. 12, No. 1, PP. 1-13, 1979.

【０００７】また、複数段落からなるテキスト群の段落
構造は、状態遷移ネットワークになることが、次の文献
ロにより知られているが、文書の段落構成を自動分類す
る場合は、この状態遷移確率を考慮すべきである。文献ロ．星合忠、他：ビジネス文書作成のための段落構
成立案の支援，情報処理学会第４３回全国大会講演論文
集２Ｆ−１，ＰＰ．３−３１５〜３１６，１９９１。It is known from the following document B that the paragraph structure of a text group consisting of a plurality of paragraphs becomes a state transition network. However, when automatically classifying the paragraph structure of a document, this state transition probability Should be considered. Literature b. Tadashi Hoshiai, et al .: Support for drafting paragraph structures for business document preparation, Proc. Of the 43rd National Convention of IPSJ, 2F-1, PP. 3-315-316, 1991.

【０００８】[0008]

【発明が解決しようとする課題】しかし、先に本願出願
人により提出された「文書構造データベース構築処理方
式」（平成３年９月２０日特許出願）では、上記の点が
考慮されていなかったため、最適な分類処理ができない
という問題があった。However, the "document structure database construction processing method" (patent application on September 20, 1991) previously submitted by the applicant of the present application did not consider the above points. However, there is a problem that the optimal classification process cannot be performed.

【０００９】本発明は、このような従来技術の問題点に
鑑み、段落構造を有するテキストの自動分類において、
教師情報であるところのテキストクラス自体の確率分布
に関する情報（生起確率、及び遷移確率など）を勘案
し、これらを分類処理の際の重み付けに反映させること
により、コンテキストたる段落構造を表現する状態遷移
ネットワークに適合した、適切な分類を実現することを
目的とする。In view of the above problems of the prior art, the present invention provides automatic classification of text having a paragraph structure,
The state transition that expresses the paragraph structure that is the context by considering the information (probability and transition probability) related to the probability distribution of the text class itself, which is the teacher information, and reflecting these in the weighting during classification processing. The purpose is to achieve appropriate classification that is suitable for the network.

【００１０】[0010]

【課題を解決するための手段】本発明によれば、上述の
目的は、前記特許請求の範囲に記載した手段にて達成さ
れる。According to the invention, the above mentioned objects are achieved by means of the patent claims.

【００１１】すなわち、請求項１の発明は、分類対象の
テキストを入力するためのテキスト入力部と、入力テキ
ストをテキストクラスに分類する際に利用する比較デー
タを格納及び検索できるデータベースであるテキストク
ラス特徴データ部と、入力テキストをその内容に従って
前記テキストクラスのいずれかに分類するためのテキス
ト自動分類部とを有する装置において、前記テキスト自
動分類部は、分類空間における入力テキストの位置と、
それぞれの前記テキストクラスの重心の位置との距離を
求め、当該距離に対してテキストクラス自体の単純生起
確率を用いて重み付けし、重み付けされた距離に基づい
て入力テキストを前記テキストクラスのいずれかに分類
する文脈依存自動分類装置である。[0011] That is, the invention of claim 1, a text input unit for entering text classification target, input text
The comparison data used to classify the strikes into text classes.
Text that is a database that can store and retrieve data
Lath feature data section and input text according to its content
Texts for classifying into any of the above text classes
In an apparatus having an automatic classification section,
The dynamic classification unit determines the position of the input text in the classification space,
The distance from the position of the center of gravity of each of the above text classes
Find the simple occurrence of the text class itself for that distance
Weighted using probabilities, based on weighted distance
Input text into one of the text classes above
It is a context-dependent automatic classification device.

【００１２】[0012]

【００１３】[0013]

【００１４】また、請求項２の発明は、分類対象のテキ
ストを入力するためのテキスト入力部と、入力テキスト
をテキストクラスに分類する際に利用する比較データを
格納及び検索できるデータベースであるテキストクラス
特徴データ部と、入力テキストをその内容に従って前記
テキストクラスのいずれかに分類するためのテキスト自
動分類部とを有する装置において、前記テキスト自動分
類部は、分類空間における入力テキストの位置と、それ
ぞれの前記テキストクラスの重心の位置との距離を求
め、当該距離に対してテキストクラス間の状態遷移確率
を勘案して重み付けし、重み付けされた距離に基づいて
入力テキストを前記テキストクラスのいずれかに分類す
る文脈依存自動分類装置である。The invention of claim 2 is a text class which is a database for storing and retrieving a text input section for inputting text to be classified and comparison data used when classifying the input text into text classes. In a device having a feature data section and an automatic text classification section for classifying an input text into one of the text classes according to its content, the automatic text classification section is provided with a position of the input text in a classification space, The distance from the position of the center of gravity of the text class is obtained, the distance is weighted in consideration of the state transition probability between the text classes, and the input text is classified into one of the text classes based on the weighted distance. It is a context-dependent automatic classification device.

【００１５】また、請求項３の発明は、前記テキストク
ラス間の状態遷移確率は、自分自身への状態遷移確率も
考慮するものである文脈依存自動分類装置である。The invention according to claim 3 is the context-dependent automatic classification device, wherein the state transition probability between the text classes also considers the state transition probability to itself.

【００１６】[0016]

【００１７】[0017]

【００１８】[0018]

【００１９】また、請求項４の発明は、前記テキスト自
動分類部による入力テキストの分類結果を出力するため
の分類結果出力部を有する文脈依存自動分類装置であ
る。The invention according to claim 4 is the context-dependent automatic classification device having a classification result output unit for outputting the classification result of the input text by the automatic text classification unit.

【００２０】[0020]

【作用】本発明によれば、段落構造を有するテキストの
自動分類において、テキストクラスの確率分布に関する
情報（生起確率、及び遷移確率など）を勘案し、これら
を分類の際の重み付けに反映させることにより、コンテ
キストたる段落構造を表現する状態遷移ネットワークに
適合した分類を実現することが可能となる。According to the present invention, in automatic classification of text having a paragraph structure, information regarding probability distribution of a text class (occurrence probability, transition probability, etc.) is taken into consideration and reflected in weighting at the time of classification. By this, it becomes possible to realize the classification suitable for the state transition network expressing the paragraph structure which is the context.

【００２１】[0021]

【実施例】以下、本発明による自動分類処理の流れをス
テップ［１］〜ステップ［５］の五つのステップに分け
て、これらの各ステップを、図１に示す自動分類装置の
ブロック図に基づいて、順番に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The flow of automatic classification processing according to the present invention is divided into five steps, Steps [1] to [5], and these steps are based on the block diagram of the automatic classification apparatus shown in FIG. And will be explained in order.

【００２２】ステップ［１］．ユーザは、自分が分類し
たいテキストデータＴを、入出力装置７より入力する。
この場合、例えばフロッピイディスク中の文書ファイル
を入力するのであれば、入力装置としてフロッピィディ
スク装置を選べばよいし、受信した電子メールを入力す
るのであれば、モデム及び通信ソフトウェアを入力装置
と見做すなど、入力文書の情報形態により入力装置とし
て用いるハードウェアを適宜選択すればよい。Step [1]. The user inputs the text data T to be classified by himself / herself from the input / output device 7.
In this case, for example, if you want to input a document file in a floppy disk, you can select a floppy disk device as the input device, and if you want to input a received e-mail, consider the modem and communication software as the input device. The hardware used as the input device may be appropriately selected depending on the information form of the input document.

【００２３】ステップ［２］．テキスト入力部４は、シ
ステムに入力された文書中のテキストデータ（段落文
章）Ｔを、順番にテキスト自動分類部１に送る。Step [2]. The text input unit 4 sequentially sends the text data (paragraph sentence) T in the document input to the system to the automatic text classifying unit 1.

【００２４】ステップ［３］．次に、自動分類部１は、
テキストデータＴの統計的特徴を分析し、分類対象と分
類先との距離を求めた後に、コンテクスト情報の重み付
けにより分類を行なう。そして、テキストデータＴを、
統計的特徴が近く、かつ、出現の期待値が高いテキスト
クラスへ分類する。Step [3]. Next, the automatic classification unit 1
The statistical characteristics of the text data T are analyzed, the distance between the classification target and the classification destination is obtained, and then the classification is performed by weighting the context information. Then, the text data T is
Classify into text classes with close statistical characteristics and high expected values of appearance.

【００２５】すなわち、入力されたテキストデータ中の
キーワードの統計的特徴を抽出して、多変量解析の手法
により、入力テキストデータＴと教師情報としてのテキ
ストのグループ（本明細書において、「テキストクラ
ス」という。）との分類空間上における距離をそれぞれ
求める。That is, the statistical features of the keywords in the input text data are extracted, and the group of the input text data T and the text as the teacher information (in the present specification, "text class" is extracted by the method of multivariate analysis). ).) In the classification space.

【００２６】テキストを多変量解析により分類する手法
としては、例えば、数量化理論の手法による、文献イ
（前出）などがある。また、同じ教師付き分類法とし
て、やはり多変量解析の重判別分析におけるマハラノビ
スの汎距離を基準にしたもの、などの種々の手法が提案
されているが、本発明では、分類基準として距離の概念
を用いるものであれば、どの手法をも用いることができ
る。As a method of classifying texts by multivariate analysis, there is, for example, the document B (supra) by the method of quantification theory. Further, as the same supervised classification method, various methods such as one based on Mahalanobis's general distance in multiple discriminant analysis of multivariate analysis have been proposed, but in the present invention, the concept of distance is used as a classification criterion. Any method can be used as long as it uses.

【００２７】ところで、本発明は、このような分類手法
を利用するだけであり、距離計算手法の内容は対象とし
ていないし、距離概念の利用においてもその手法の違い
には影響されないので、統計的手法の定義や計算法の説
明は最小限に止める。距離計算の概要は、以下の通りで
ある。By the way, the present invention only uses such a classification method, does not cover the contents of the distance calculation method, and the use of the distance concept is not affected by the difference in the method, so that the statistical method is used. The definition of the method and the explanation of the calculation method are minimized. The outline of the distance calculation is as follows.

【００２８】まず、テキスト自動分類部１は、分類の比
較対象（即ち、教師情報）としてのテキストクラスＣ_j
（ｊ＝１，２，・・・，Ｍ）の特徴データを抽出し、テ
キストクラス特徴データ部３へ格納する。First, the text automatic classification unit 1 uses the text class C _j as a comparison target of classification (that is, teacher information).
Feature data of (j = 1, 2, ..., M) is extracted and stored in the text class feature data unit 3.

【００２９】次に、テキスト自動分類部１は、分類対象
の入力テキストＴの特徴データを抽出し、教師情報とし
てのテキストクラスＣ_j （ｊ＝１，２，・・・，Ｍ）の
特徴データの全てについて比較する。Next, the automatic text classification unit 1 extracts the characteristic data of the input text T to be classified, and the characteristic data of the text class C _j (j = 1, 2, ..., M) as the teacher information. Compare all of.

【００３０】ここで、分類空間上における入力テキスト
Ｔの位置ベクトルをＸ（Ｔ）、テキストクラスＣ_j 中の
各テキストＴ_j1，・・・，Ｔ_jNj の位置ベクトルを、Ｘ
（Ｔ _j1），・・・，Ｘ（Ｔ_jNj ）とすると、テキストク
ラスＣ_j の重心は、Here, the input text on the classification space
Position vector of T is X (T), text class C_j In
Each text T_j1・・・, T_jNj Position vector of X
(T _j1), ..., X (T_jNj ), The text
Russ C_j The center of gravity of

【００３１】[0031]

【数１】 [Equation 1]

【００３２】となる。It becomes

【００３３】従って、入力テキストＴとテキストクラス
Ｃ_j との分類空間上における距離は、Therefore, the distance between the input text T and the text class C _j in the classification space is

【００３４】[0034]

【数２】 [Equation 2]

【００３５】により与えることができる。本文以降が、
本発明の重点とするところである。そこで、全てのテキ
ストクラスＣ_j との距離を計算して、後述の重み付けを
行ない、最適のテキストクラスを分類先とする。Can be given by After the text,
This is the focus of the present invention. Therefore, the distances from all the text classes C _j are calculated, weighting described later is performed, and the optimum text class is set as the classification destination.

【００３６】ステップ［４］．ステップ［３］で求めた
距離に対して、コンテクスト情報による重み付けを計算
する。ここで、重み付けの方式は、大別すると、単純生
起確率による重み付けと、状態遷移確率による重み付け
との２種の方式に分けることができる。Step [4]. The weighting by the context information is calculated for the distance obtained in step [3]. Here, a method of weighting with is roughly can be classified and weighted by a simple probability, the two methods of the weighting by the state transition probability.

【００３７】以下、その２種の方式に関して説明する。
第一の方式では、図２に示すように、距離の重み付けと
して、テキストクラス自体の単純生起確率を用いる。図
２において、分類空間上にいくつかのテキストクラスが
表示されている。テキストクラスは、例えば訪問のテキ
ストクラスであったり、礼状のテキストクラスであった
り、売り込みのテキストクラスであったりする。The two types will be described below.
In the first method, as shown in FIG. 2, the simple occurrence probability of the text class itself is used as the distance weighting. In FIG. 2, some text classes are displayed on the classification space. The text class may be, for example, a visiting text class, a thank-you text class, or a selling text class.

【００３８】入力テキストの分類空間上の位置と、教師
情報における各テキストクラスＣj（ｊ＝１，２，・・
・，Ｍ；但し、Ｍはテキストクラスの個数）の分類空間
における重心の位置との距離ｒ _ｔｊの比較において、テ
キストクラス自体の単純生起確率を勘案して重み付けす
ることにより、分類のコンテクストとしての教師情報の
確率分布に適した分類を行なう。The position of the input text on the classification space and each text class Cj (j = 1, 2, ...
. , M; where M is the number of text classes) and the distance r _{tj to} the position of the center of gravity in the classification space is compared, the weighting is performed in consideration of the simple occurrence probabilities of the text class itself, and Perform classification suitable for the probability distribution of teacher information.

【００３９】各テキストクラスの単純生起確率ｐ（Ｃ
_i ）は、Simple occurrence probability p (C of each text class
_i ) is

【００４０】[0040]

【数３】 [Equation 3]

【００４１】により、教師情報から得られる。但し、ｆ
（Ｃ_i ）は、テキストクラスＣ_i 中のインスタンスデー
タの頻度（即ち、Ｃ_i に属する段落文章の数）である。According to the above, it is obtained from the teacher information. However, f
(C _i ) is the frequency of instance data in the text class C _i (that is, the number of paragraph sentences belonging to C _i ).

【００４２】テキストクラスＣ_i が正規分布Ｎ（０，
１）に従うとした時の、そのテキストクラスＣ_i の生起
確率換算半径ｒ_i は、The text class C _i has a normal distribution N (0,
The occurrence probability conversion radius r _i of the text class C _i when obeying 1) is

【００４３】[0043]

【数４】 [Equation 4]

【００４４】により定まる。It is determined by

【００４５】この換算半径によって規格化された距離ｒ
_ｔｉ ^＊は、ｒ_ｔｉ ^＊＝ｒ_ｔｉ／ｒ_ｉにより得られる。単純生起確率を用いた分類では、この
規格化距離を最小にするテキストクラスＣ_ｍｉｎを求め
ることになる。The distance r standardized by this conversion radius
_ti ^* is obtained by _{^{_{r ti * = r t i /}}} r i. In the classification using the simple occurrence probability, the text class C _min that minimizes this normalized distance is obtained.

【００４６】第二の方式では、図３に示すように、距離
の重み付けとして、テキストクラス間の状態遷移確率を
用いる。図３において、状態遷移ネットワークのノード
となっているテキストクラスは、例えば挨拶のテキスト
クラスであったり、経緯のテキストクラスであったり、
説明のテキストクラスであったり、招待のテキストクラ
スであったりする。In the second method, as shown in FIG. 3, the state transition probability between text classes is used as the weighting of the distance. In FIG. 3, the text class that is a node of the state transition network is, for example, a greeting text class or a history text class.
It may be the text class of description or the text class of invitation.

【００４７】入力文書中のテキストデータ（段落文章）
の中で、直前に分類クラスとなったテキストクラスをＣ
_i とし、次のテキストデータを処理中であるとする。但
し、文書の第一段落を処理中の時は、Ｃ_i ＝Ｃ₀ ＝始点
ノードとする。Text data (paragraph text) in the input document
Of the text classes that became the classification class immediately before in C
_i and the next text data is being processed. However, when the first paragraph of the document is being processed, C _i = C ₀ = start point node.

【００４８】入力テキストの分類空間上の位置と、教師
情報における各テキストクラスＣ_j（ｊ＝１，２，・・
・，Ｍ；但し、Ｍはテキストクラスの個数）の分類空間
における重心の位置との距離ｒ_tjの比較において、テキ
ストクラスＣ_i からテキストクラスＣ_j （ｊ＝１，２，
・・・，Ｍ）への状態遷移確率を勘案して重み付けする
ことにより、分類のコンテクストとしての教師情報の確
率分布に適した分類を行なう。The position of the input text in the classification space and each text class C _j (j = 1, 2, ...
., M; where M is the number of text classes) and the distance r _{tj from} the position of the center of gravity in the classification space is compared to the text classes C _i to C _j (j = 1, 2,
, M) are weighted in consideration of the state transition probabilities to perform the classification suitable for the probability distribution of the teacher information as the classification context.

【００４９】テキストクラスＣ_i からＣ_j への状態遷移
確率ｐ（Ｃ_j ｜Ｃ_i ）は、The state transition probability p (C _j | C _i ) from the text class C _i to C _j is

【００５０】[0050]

【数５】 [Equation 5]

【００５１】により、教師情報から得られる。但し、ｆ
（Ｃ_i ）は、テキストクラスＣ_i 中のインスタンスデー
タの頻度（即ち、Ｃ_i に属する段落文章の数）である。Is obtained from the teacher information. However, f
(C _i ) is the frequency of instance data in the text class C _i (that is, the number of paragraph sentences belonging to C _i ).

【００５２】テキストクラスＣ_i からＣ_j への状態遷移
が、正規分布Ｎ（０，１）に従うとした時の状態遷移換
算半径ｒ_i は、When the state transition from the text class C _i to C _j follows the normal distribution N (0,1), the state transition conversion radius r _i is

【００５３】[0053]

【数６】 [Equation 6]

【００５４】により定まる。It is determined by

【００５５】テキストクラスＣ_j への状態遷移によって
規格化された距離ｒ_tj ^* は、ｒ_tj ^* ＝ｒ_tj ／ｒ_j により得られる。状態遷移確率を用いた分類では、この
規格化距離を最小にするテキストクラスＣ_min を求める
ことになる。The distance r _tj ^* standardized by the state transition to the text class C _j is obtained by r _tj ^* = r _tj / r _j . In the classification using the state transition probability, the text class C _min that minimizes this normalized distance is obtained.

【００５６】ステップ［５］．分類結果出力部５は、分
類結果に対する検討結果を入出力装置７へ出力する。即
ち、入力テキストＴに対する分類結果として分類先のテ
キストクラス名Ｃ_jminを出力する。Step [5]. The classification result output unit 5 outputs the examination result for the classification result to the input / output device 7. That is, the text class name C _jmin of the classification destination is output as the classification result for the input text T.

【００５７】[0057]

【発明の効果】以上説明したように、本発明によれば、
段落構造を有するテキストの自動分類において、テキス
トクラスの確率分布に関する情報（生起確率及び遷移確
率など）を勘案し、分類の際の重み付けに反映させるこ
とにより、コンテクストたる段落構造を表現する状態遷
移ネットワークに適合した、適切な分類を実現すること
が可能となり、文書処理作業の効率化に寄与するところ
が大きい。As described above, according to the present invention,
In the automatic classification of texts with a paragraph structure, a state transition network that expresses a contextual paragraph structure by considering the information about the probability distribution of text classes (occurrence probabilities, transition probabilities, etc.) and reflecting it in the weighting during classification. It becomes possible to realize an appropriate classification that conforms to, and it greatly contributes to the efficiency of the document processing work.

[Brief description of drawings]

【図１】テキスト分類装置の構成を説明するための図で
ある。FIG. 1 is a diagram illustrating a configuration of a text classification device.

【図２】単純生起確率による重み付けに関する説明図で
ある。FIG. 2 is an explanatory diagram regarding weighting based on a simple occurrence probability.

【図３】状態遷移確率による重み付けに関する説明図で
ある。FIG. 3 is an explanatory diagram regarding weighting based on a state transition probability.

[Explanation of symbols]

１テキスト自動分類部２テキスト分類結果検討部３テキストクラス特徴データ部４テキスト入力部５分類結果出力部６新規クラス提案部７入出力装置 1 Text automatic classification section 2 Text classification result examination department 3 Text class feature data section 4 Text input section 5 Classification result output section 6 New Class Proposal Department 7 I / O device

フロントページの続き (56)参考文献特開平２−285419（ＪＰ，Ａ) 佐藤円他，ネットニュース記事群の自動パッケージ化，情報処理学会論文誌，日本，社団法人情報処理学会, 1997年６月，ＶＯＬ38 Ｎｏ６，第 1225頁乃至第1234頁河合敦夫，意味属性の学習結果にもとづく文書自動分類方式，情報処理学会論文誌，日本，情報処理学会，1991年９月15日，ＶＯＬ33 Ｎｏ９，第1114頁乃至第1122頁星合忠他，ビジネス文書作成のための段落構成立案の支援，情報処理学会第 43回（平成３年後期）全国大会講演論文集（３），日本，情報処理学会，1991 年10月22日，第３−315頁、第３−316頁 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of front page (56) References JP-A-2-285419 (JP, A) En Sato et al., Automatic packaging of net news articles, Journal of Information Processing Society of Japan, Japan, Information Processing Society of Japan, 1997 6 Mon, VOL38 No6, pages 1225 to 1234 Atsio Kawai, automatic document classification based on learning results of semantic attributes, IPSJ Journal, Japan, IPSJ, September 15, 1991, VOL33 No9, No. 1114 No. 1122 Tadashi Hoshiai et al., Support for drafting paragraph structure for business document creation, Proc. Of the 43rd National IPSJ Conference (Late 1991) (3), Japan, Information Processing Society, October 22, 1991, pp. 3-315, pp. 3-316 (58) Fields investigated (Int.Cl. ⁷ , DB name) G06F 17/30 JISST file (JOIS)

Claims

(57) [Claims]

1. A text input section for inputting a text to be classified, a text class characteristic data section which is a database capable of storing and retrieving comparison data used when classifying the input text into a text class, and an input text. An apparatus having a text automatic classification unit for classifying the text class into any of the text classes according to the content thereof, wherein the text automatic classification unit is the position of the input text in the classification space and the position of the center of gravity of each of the text classes. Context-dependent, characterized in that the distance between the text and the text class is weighted using the simple occurrence probability of the text class itself, and the input text is classified into one of the text classes based on the weighted distance. Automatic classifier.

2. A text input section for inputting a text to be classified, a text class characteristic data section which is a database capable of storing and searching comparison data used when classifying the input text into a text class, and an input text. An apparatus having a text automatic classification unit for classifying the text class into any of the text classes according to the content thereof, wherein the text automatic classification unit is the position of the input text in the classification space and the position of the center of gravity of each of the text classes. A context characterized in that the distance between the text and the text class is weighted in consideration of the state transition probability between the text classes, and the input text is classified into one of the text classes based on the weighted distance. Dependent automatic classifier.

3. The context-dependent automatic classification device according to claim 2, wherein the state transition probability between the text classes also considers the state transition probability to itself.

4. A context-sensitive automatic classification apparatus according to any one of claims 1 to 3 having a classification result output unit for outputting a classification result of the input text by the text automatic classification section.