JPH11167581A

JPH11167581A - Information sorting method, device and system

Info

Publication number: JPH11167581A
Application number: JP9334309A
Authority: JP
Inventors: Masami Hara; 正巳原; Tsuyoshi Kitani; 強木谷
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 1997-12-04
Filing date: 1997-12-04
Publication date: 1999-06-22
Anticipated expiration: 2017-12-04
Also published as: JP3488063B2

Abstract

PROBLEM TO BE SOLVED: To provide an information sorting device which can sort the texts with high accuracy. SOLUTION: An information sorting device 1 includes a text input part 11, a word processing part 12, a vector processing part 13, a learning feature vector set file 14, a similarity processing part 15, a category decision part 16 and an external or internal document data base 17. The part 12 calculates the importance of category of every word that is extracted from a learning text based on both number of appearance and category frequencies of the word. The part 15 calculates the similarity of words based on the learning feature vector, the learning feature vector set and the sorting object text feature vector which are calculated based on the importance of words calculated at the part 12. The part 16 decides a prescribed number of corresponding categories as the categories of the sorting object texts based on the similarity having the largest calculation value. Then the sorting object texts sorted in each category are stored in the" data base 17.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、自然言語処理や情
報検索技術分野において、電子化されたテキスト群を効
率的に分類する情報分類手法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information classification method for efficiently classifying digitized text groups in the field of natural language processing and information retrieval technology.

【０００２】[0002]

【従来の技術】電子化情報の分類手法として、ベクトル
表現したカテゴリの特徴と未分類の電子化情報とを比較
することによりカテゴリを決定する手法が知られてい
る。以下、この手法の概要を説明する。前提条件とし
て、カテゴリの特徴を抽出するために利用する学習用電
子化情報（以下、学習テキスト）群には、予めＮ種類の
カテゴリＣ１，Ｃ２，…、ＣＮが各々付与されているも
のとする。2. Description of the Related Art As a method of classifying digitized information, a method of determining a category by comparing the characteristics of a vector-represented category with unclassified digitized information is known. Hereinafter, an outline of this method will be described. As a precondition, it is assumed that N types of categories C1, C2,..., CN are respectively provided in advance to a group of learning electronic information (hereinafter, learning text) used for extracting a feature of a category. .

【０００３】まず、カテゴリＣｉ（１≦ｉ≦Ｎ）の特徴
を表す特徴ベクトルｐｉを作成するために、カテゴリＣ
ｉが付与されている学習テキスト群から単語を抽出す
る。そして、学習テキストにおける各単語の重要度をカ
テゴリ毎に決定する。重要度の決定方法としては、情報
検索の分野で提案されたＴＦ・ＩＤＦ法が広く知られて
いる（「Introduction to Modern Information Retriev
al：G.Salton著、McGraw-Hill」参照）。このＴＦ・Ｉ
ＤＦ法における単語の重要度は、出現頻度ｔｆと、出現
件数ｄｆの逆数ｉｄｆとを用いて定義される。具体的に
は、カテゴリＣｉにおける単語ｔｋの重要度Ｗ（ｔｋ，
Ｃｉ）は、以下に示す式（１）で算出される。 W(tk,Ci)=tf(tk,Ci)log(Li/df(tk,Ci)+1) ・・・(1) ここでｔｆ（ｔｋ，Ｃｉ）は、カテゴリＣｉにおける単
語ｔｋの出現頻度を表し、またｄｆ（ｔｋ，Ｃｉ）は、
カテゴリＣｉにおける単語ｔｋの出現件数を表してい
る。一方、Ｌｉは、カテゴリＣｉにおける総テキスト件
数を表している。First, in order to create a feature vector pi representing a feature of a category Ci (1 ≦ i ≦ N), a category C
A word is extracted from the learning text group to which i is assigned. Then, the importance of each word in the learning text is determined for each category. As a method of determining the importance, the TF / IDF method proposed in the field of information retrieval is widely known (“Introduction to Modern Information Retriev”).
al: by G. Sallton, McGraw-Hill "). This TF ・ I
The importance of a word in the DF method is defined using the appearance frequency tf and the reciprocal idf of the number of appearances df. Specifically, the importance level W (tk, tk) of the word tk in the category Ci
Ci) is calculated by the following equation (1). W (tk, Ci) = tf (tk, Ci) log (Li / df (tk, Ci) +1) (1) where tf (tk, Ci) is the frequency of occurrence of word tk in category Ci. And df (tk, Ci) is
It represents the number of occurrences of the word tk in the category Ci. On the other hand, Li represents the total number of texts in the category Ci.

【０００４】次に、学習テキスト集合に出現するすべて
の単語ｔ１，ｔ２，〜，ｔＭについて上記式（１）によ
りカテゴリＣｉにおける重要度を各々算出し、算出され
た各重要度を要素としたベクトルをカテゴリＣｉの特徴
ベクトルｐｉとする。未分類テキストＴについても同様
に、特徴ベクトルｑを算出する。この場合の特徴ベクト
ルの要素となる単語の重要度には、主に出現頻度ｔｆが
用いられる。未分類テキストＴにおけるカテゴリの決定
には、各カテゴリの特徴ベクトルｐｉ（１≦ｉ≦Ｎ）と
未分類テキストＴの特徴ベクトルｑとの類似度ｄ（ｐ
ｉ，ｑ）が用いられる。この類似度計算の代表的な例に
は、両ベクトルの内積を算出する方法や集合論的測度を
利用する方法等が知られており、「情報検索：伊藤哲朗
著、昭晃堂」に詳しく記述されている。Next, the importance in the category Ci is calculated for each of the words t1, t2,..., And tM appearing in the learning text set by the above equation (1), and a vector having the calculated importance as an element is calculated. Is the feature vector pi of the category Ci. The feature vector q is similarly calculated for the unclassified text T. In this case, the appearance frequency tf is mainly used as the importance of the word as an element of the feature vector. To determine the category of the uncategorized text T, the similarity d (p) between the feature vector pi (1 ≦ i ≦ N) of each category and the feature vector q of the uncategorized text T is determined.
i, q) are used. As a typical example of the similarity calculation, a method of calculating an inner product of both vectors and a method of using a set-theoretic measure are known, and are described in "Information Search: Tetsuro Ito, Shokodo". It has been described.

【０００５】このように、カテゴリ毎に上述の類似度ｄ
（ｐｉ，ｑ）を算出して利用することにより、未分類テ
キストＴと類似の度合いが近いカテゴリを複数選択して
分類先となるカテゴリを決定する。As described above, the above-described similarity d
By calculating and using (pi, q), a plurality of categories having similar degrees of similarity to the unclassified text T are selected, and the category to be classified is determined.

【０００６】[0006]

【発明が解決しようとする課題】上述のように、ＴＦ・
ＩＤＦ法は、例えば検索語と検索データベース内のテキ
ストとを比較するためのベクトル作成に利用される手法
であり、出現頻度ｔｆが大きいほど出現件数の逆数ｉｄ
ｆが大きい、即ち出現件数ｄｆが小さいほど重要度が高
くなるものである。As described above, TF.
The IDF method is, for example, a method used for creating a vector for comparing a search word with a text in a search database, and the reciprocal id of the number of appearances increases as the appearance frequency tf increases.
The larger the value of f, that is, the smaller the number of occurrences df, the higher the importance.

【０００７】しかし、テキストの分類では、ベクトル作
成の対象となるテキスト群は、通常、同一カテゴリに属
しており、カテゴリを考慮しない情報検索とはテキスト
の特徴が異なったものとなる。そのため、カテゴリの特
徴となるような重要な単語（以下、特徴語）は、同一カ
テゴリに属するテキストに着目した場合には、多くのテ
キストに出現する、即ち出現件数ｄｆが大きいことが考
えられる。このことは、出現件数ｄｆの逆数を用いたｉ
ｄｆを利用するＴＦ・ＩＤＦ法では、特徴語に低い重要
度を付与してしまう可能性があることを意味する。この
結果、ＴＦ・ＩＤＦ法を利用して単語の重要度を決定す
ると、カテゴリの特徴を明確に表現した特徴ベクトルの
作成が困難となり、また、分類精度も低下してしまうと
いう問題があった。However, in the classification of texts, a group of texts for which a vector is to be created usually belongs to the same category, and the characteristics of the text are different from those of the information retrieval which does not consider the categories. Therefore, when focusing on texts belonging to the same category, an important word that becomes a feature of the category (hereinafter, characteristic word) may appear in many texts, that is, the number of occurrences df may be large. This means that i using the reciprocal of the number of occurrences df
In the TF / IDF method using df, it means that the feature word may be given low importance. As a result, when the importance of a word is determined by using the TF / IDF method, there is a problem that it is difficult to create a feature vector that clearly expresses a feature of a category, and the classification accuracy is reduced.

【０００８】一方、出現件数ｄｆを利用する場合でも、
出現件数ｄｆの多い単語群にはカテゴリに依存すること
なく出現する一般的な語（以下、一般語）も含まれてお
り、出現件数ｄｆが多い語が必ずしも特徴語であるとは
いえない。このため、特徴語の重要度に出現件数ｄｆ自
体が利用されることは殆どなかった。On the other hand, even when the number of occurrences df is used,
The word group having a large number of occurrences df includes general words (hereinafter, general words) that appear without depending on the category, and words having a large number of occurrences df are not necessarily feature words. For this reason, the number of occurrences df itself is rarely used for the importance of the characteristic word.

【０００９】そこで、本発明の課題は、学習テキストに
おけるカテゴリの特徴語となる単語の重要度を考慮する
ことにより、高精度の分類を可能にする新規な情報分類
方法を提供することにある。また、本発明の他の課題
は、上記情報分類方法の実施に適した情報分類装置、及
び情報分類システムを提供することにある。It is an object of the present invention to provide a new information classification method which enables highly accurate classification by considering the importance of words which are characteristic words of a category in a learning text. Another object of the present invention is to provide an information classification device and an information classification system suitable for implementing the information classification method.

【００１０】[0010]

【課題を解決するための手段】上記課題を解決するた
め、本発明は、属すべきカテゴリが既知の学習用テキス
トから単語を抽出し、抽出した単語毎に、その出現件数
及び出現するカテゴリ数に基づく重要度を算出するとと
もに、算出された重要度を要素としてカテゴリ毎の特徴
を表す学習特徴ベクトルを生成する過程と、カテゴリが
不明な分類対象テキストに対して当該分類対象テキスト
中の単語毎の出現頻度に基づく重要度を算出し、算出さ
れた重要度を要素としてテキスト毎の特徴を表す分類対
象特徴ベクトルを生成する過程と、分類対象特徴ベクト
ルと前記カテゴリ毎の学習特徴ベクトルとの類似度を判
定する過程とを含み、前記分類対象テキストとの類似度
が所定範囲内の学習特徴ベクトル、または類似度の高い
順に並べたときに上位から予め定めた件数以上の学習特
徴ベクトルに対応するカテゴリを当該分類対象テキスト
に付与すべきカテゴリ候補とする、情報分類方法を提供
する。In order to solve the above-mentioned problems, the present invention extracts words from learning texts to which categories to which they belong are known, and for each extracted word, the number of occurrences and the number of appearing categories are reduced. A process of generating a learning feature vector representing a feature for each category using the calculated importance as an element, and a process for each word in the classification target text for which the category is unknown. A process of calculating importance based on the frequency of appearance and generating a classification target feature vector representing a feature for each text using the calculated importance as an element; and a similarity between the classification target feature vector and the learning feature vector for each category. And determining whether the similarity with the text to be classified is within a predetermined range or a learning feature vector or a similarity. And Category candidates to be given to the classified text category corresponding to a predetermined number or more training feature vectors from providing information classification methods.

【００１１】この情報分類方法において、前記学習特徴
ベクトルを生成する過程は、例えば、前記学習用テキス
ト中の単語の出現傾向に着目してカテゴリの特徴を表す
指標となる特徴語及びカテゴリに依存しない一般語を判
別し、前記単語の出現するカテゴリ数に基づいて前記一
般語の重要度を低減させることで前記特徴語の重要度が
相対的に高く反映された学習特徴ベクトルを生成するこ
とを特徴とする。In this information classification method, the step of generating the learning feature vector does not depend on, for example, a feature word serving as an index indicating a feature of a category and a category by focusing on the appearance tendency of a word in the learning text. Determining a common word and reducing the importance of the common word based on the number of categories in which the word appears, thereby generating a learning feature vector in which the importance of the characteristic word is reflected relatively high. And

【００１２】上記他の課題を解決する本発明の情報分類
装置は、１または複数のカテゴリが付与された学習用テ
キストの分類体系に即してカテゴリが不明な分類対象テ
キストに付与すべきカテゴリを決定して分類処理を行う
装置であって、以下の要素を備えて構成される。（１）前記学習用テキスト及び分類対象テキストの各々
から単語を抽出するとともに抽出した単語毎の重要度を
算出する単語処理手段。この単語処理手段は、例えば、
前記学習用テキスト中の総カテゴリ数を特定の単語が出
現するカテゴリ数による除算に基づくカテゴリ頻度係数
を算出する手段を有し、特定のカテゴリに出現する単語
の出現件数と前記カテゴリ頻度係数との乗算により前記
学習用テキスト中の単語毎の重要度を算出するととも
に、出現件数が相対的に多く且つカテゴリへの依存が相
対的に少ない単語の重要度を低減させるように構成され
る。また、特定のカテゴリに出現する単語の出現件数と
前記カテゴリ頻度係数との乗算による算出値に、さらに
当該単語の出現頻度を乗算することにより前記学習用テ
キスト中の単語毎の重要度を算出するように構成され
る。あるいは、前記分類対象テキスト中の単語の出現頻
度を計測する手段を有し、出現頻度が低い単語ほど当該
分類対象テキスト中の重要度が高くするように構成され
る。（２）前記単語毎の重要度を要素として、前記学習用テ
キストの特徴をカテゴリ毎に表現した学習特徴ベクト
ル、及び分類対象テキストの特徴をテキスト毎に表現し
た分類対象特徴ベクトルを生成するベクトル処理手段。（３）個々の分類対象特徴ベクトルと前記学習特徴ベク
トルとの特徴差に基づいてカテゴリ毎の学習特徴ベクト
ルに対する前記分類対象特徴ベクトルの類似度を判定す
る類似度処理手段。この類似度処理手段は、例えば、個
々の学習特徴ベクトル及び分類対象特徴ベクトル間の内
積に基づいて両ベクトルの余弦を算出するとともに、こ
の余弦の算出値を所定順に整列して両ベクトルの特徴差
を定量化するように構成される。（４）前記類似度処理手段による判定結果に基づいて、
前記分類対象テキストに付与すべきカテゴリを決定する
カテゴリ決定手段。According to the information classification apparatus of the present invention for solving the above-mentioned other problems, a category to be assigned to a classification target text whose category is unknown is determined in accordance with a classification system of a learning text to which one or more categories are assigned. This is an apparatus for performing a classification process by determining, and includes the following elements. (1) Word processing means for extracting words from each of the learning text and the classification target text, and calculating the importance of each extracted word. This word processing means, for example,
Means for calculating a category frequency coefficient based on division of the total number of categories in the learning text by the number of categories in which a specific word appears, wherein the number of occurrences of words appearing in a specific category and the category frequency coefficient The importance of each word in the learning text is calculated by multiplication, and the importance of words having a relatively large number of occurrences and relatively little dependence on categories is reduced. Further, the importance of each word in the learning text is calculated by multiplying the calculated value obtained by multiplying the number of occurrences of the word appearing in the specific category by the category frequency coefficient by the appearance frequency of the word. It is configured as follows. Alternatively, there is provided a means for measuring the frequency of appearance of the words in the text to be classified, so that the lower the frequency of appearance, the higher the importance in the text to be classified. (2) Vector processing for generating a learning feature vector expressing the features of the learning text for each category and a classification target feature vector expressing the characteristics of the classification target text for each text, using the importance of each word as an element. means. (3) Similarity processing means for determining the similarity of the classification target feature vector to the learning feature vector for each category based on the feature difference between each classification target feature vector and the learning feature vector. The similarity processing means calculates a cosine of both vectors based on, for example, an inner product between each learning feature vector and a classification target feature vector, and arranges the calculated values of the cosine in a predetermined order to obtain a characteristic difference between the two vectors. Is configured to quantify (4) Based on the determination result by the similarity processing means,
Category determining means for determining a category to be assigned to the text to be classified.

【００１３】好ましくは、前記分類対象テキストに対す
る類似度が所定範囲内となる１または複数の学習特徴ベ
クトルに対応するカテゴリを視認可能にして提示する提
示手段をさらに備える。この場合、前記カテゴリ決定手
段は、前記提示手段による提示に対応して特定されたカ
テゴリを当該分類対象テキストに付与すべきカテゴリと
して決定するように構成する。Preferably, the apparatus further comprises presentation means for visually recognizing and presenting a category corresponding to one or a plurality of learning feature vectors whose similarity to the classification target text falls within a predetermined range. In this case, the category determination unit is configured to determine a category specified in correspondence with the presentation by the presentation unit as a category to be assigned to the classification target text.

【００１４】上記他の課題を解決する本発明の情報分類
システムは、上記本発明の情報分類装置と、通信回線を
介して流通する前記分類対象テキストを前記情報分類装
置に取り込むテキスト入力手段とを備えたことを特徴と
する。前記テキスト入力手段は、前記分類対象テキスト
をエージェント機能を通じて前記情報分類装置に入力す
るように構成することが望ましい。According to another aspect of the present invention, there is provided an information classification system, comprising: an information classification device according to the invention; and a text input unit for inputting the classification target text distributed via a communication line into the information classification device. It is characterized by having. It is preferable that the text input unit is configured to input the classification target text to the information classification device through an agent function.

【００１５】[0015]

【発明の実施の形態】以下、図面を参照して本発明にお
ける実施の形態を詳細に説明する。（第１実施形態）図１は、本実施形態による情報分類装
置の一実施形態を示す機能ブロック図である。本実施形
態の情報分類装置１は、スタンドアロン型コンピュータ
装置の内部あるいは外部記憶装置に構築される文書デー
タベース１７と、上記コンピュータ装置が所定のプログ
ラムを読み込んで実行することにより形成される、テキ
スト入力部１１、単語処理部１２、ベクトル処理部１
３、学習特徴ベクトル集合ファイル１４、類似度処理部
１５、カテゴリ決定部１６、を備えて構成される。Embodiments of the present invention will be described below in detail with reference to the drawings. (First Embodiment) FIG. 1 is a functional block diagram showing an embodiment of an information classification device according to the present embodiment. The information classification device 1 of the present embodiment includes a document database 17 built in a stand-alone computer device or in an external storage device, and a text input unit formed by reading and executing a predetermined program by the computer device. 11, word processing unit 12, vector processing unit 1
3, a learning feature vector set file 14, a similarity processing unit 15, and a category determination unit 16.

【００１６】なお、上記プログラムは、通常、コンピュ
ータ装置の内部記憶装置あるいは外部記憶装置に格納さ
れ、随時読み取られて実行されるようになっているが、
コンピュータ装置とは分離可能な記録媒体、例えばＣＤ
−ＲＯＭやＦＤ等の可搬性記録媒体、あるいは当該コン
ピュータ装置と構内ネットワークに接続されたプログラ
ムサーバ等に格納され、使用時に上記内部記憶装置また
は外部記憶装置にインストールされて随時実行に供され
るものであってもよい。The program is usually stored in an internal storage device or an external storage device of the computer, and is read and executed at any time.
A recording medium that can be separated from the computer device, for example, a CD
-Stored in a portable recording medium such as a ROM or FD, or a program server connected to the computer device and a private network, and installed in the internal storage device or the external storage device at the time of use and provided for execution at any time; It may be.

【００１７】文書データベース１７は、電子化された複
数の文書データ（以下、テキスト）が蓄積されるもので
ある。このテキスト群は、予め蓄積された学習用のテキ
スト群（以下、学習テキスト）と、当該学習テキストに
対して新規に分類対象となる１または複数のテキスト
（以下、分類対象テキスト）の分類結果とが蓄積される
ように構成されている。The document database 17 stores a plurality of digitized document data (hereinafter, text). The text group includes a learning text group (hereinafter, learning text) stored in advance, and a classification result of one or more texts (hereinafter, classification target texts) to be newly classified with respect to the learning text. Is configured to be stored.

【００１８】また、この学習テキストには、予めＮ種類
のカテゴリＣ１、Ｃ２、…、ＣＮのいずれかがテキスト
毎に１または複数付与されているものとしている。カテ
ゴリが付与された学習テキストは単語処理部１２に入力
される。Further, it is assumed that one or a plurality of N types of categories C1, C2,..., CN are previously assigned to the learning text for each text. The learning text with the category is input to the word processing unit 12.

【００１９】テキスト入力部１１は、図示しない入力手
段により、分類対象テキストの入力を受け付けて単語処
理部１２への入力を行うものである。単語処理部１２
は、入力されたテキストに対して所定の形態素解析を施
して単語の抽出を行うとともに、抽出された複数の単語
に対して、各々、重要度を付与するものである。重要度
が付与された単語群は、特徴ベクトル処理部１３に入力
される。なお、重要度の付与の仕方については後述す
る。The text input unit 11 receives an input of a text to be classified and inputs the text to the word processing unit 12 by input means (not shown). Word processing unit 12
Is to perform a predetermined morphological analysis on an input text to extract words, and to assign importance to each of a plurality of extracted words. The word group to which the importance is assigned is input to the feature vector processing unit 13. How to assign importance will be described later.

【００２０】ベクトル処理部１３は、単語処理部１２で
付与された重要度を要素としてカテゴリ毎の特徴ベクト
ルまたは特徴ベクトル集合を抽出するものである。学習
テキストから抽出された場合の特徴ベクトル集合（以
下、学習特徴ベクトル集合）は、学習特徴ベクトル集合
ファイル１４に入力されて保持され、分類対象テキスト
から抽出された特徴ベクトルは類似度処理部１５に入力
されるようになっている。The vector processing unit 13 extracts a feature vector or a set of feature vectors for each category using the importance given by the word processing unit 12 as an element. A feature vector set extracted from the learning text (hereinafter, learning feature vector set) is input and held in the learning feature vector set file 14, and the feature vector extracted from the classification target text is sent to the similarity processing unit 15. Is to be entered.

【００２１】類似度処理部１５は、分類対象テキストに
対応する特徴ベクトルと、学習特徴ベクトル集合ファイ
ル１４に対応する特徴ベクトル集合とに基づいて、分類
対象テキストの学習テキストに対する類似度をカテゴリ
毎に算出するものである。算出された類似度は、カテゴ
リ決定部１６に入力される。なお、類似度算出処理につ
いては後述する。The similarity processing unit 15 calculates the similarity of the classification target text to the learning text for each category based on the feature vector corresponding to the classification target text and the feature vector set corresponding to the learning feature vector set file 14. It is to be calculated. The calculated similarity is input to the category determination unit 16. The similarity calculation processing will be described later.

【００２２】カテゴリ決定部１６は、算出されたカテゴ
リ毎の類似度に基づいて分類対象テキストに付与すべき
カテゴリを決定するものである。このカテゴリ決定部１
６は、例えば類似度が最大となるものから順次図示しな
いディスプレイ装置等を通じて利用者に提示し、この提
示に基づいて利用者から特定されたカテゴリを分類対象
テキストに付与すべきカテゴリとして決定するように構
成される。このようにすれば、利用者等が必要とする情
報に対して漠然としたイメージしか有していない場合で
あっても、類似度が高い方から低い方へ順に探索するこ
とで、必要な情報を容易に取得することが可能となる。
カテゴリ決定部１６は、また、決定されたカテゴリを分
類対象テキストに付与して文書データベース１７に送出
するように構成される。これにより、文書データベース
１７は、分類対象テキストをカテゴリ毎に蓄積できるよ
うになる。The category determining section 16 determines a category to be assigned to the text to be classified based on the calculated similarity for each category. This category determination unit 1
6 is, for example, sequentially presented to the user through a display device or the like (not shown) from the one having the highest similarity, and the category specified by the user is determined as the category to be assigned to the text to be classified based on the presentation. It is composed of In this way, even when the user or the like has only a vague image of the information required, the necessary information can be searched in order from the highest similarity to the lowest similarity. It can be easily obtained.
The category determining unit 16 is configured to add the determined category to the text to be categorized and send it to the document database 17. Thereby, the document database 17 can accumulate the classification target text for each category.

【００２３】次に、本実施形態の情報分類装置１を用い
た情報分類方法を、学習テキスト及び分類対象テキスト
における重要度の付与、特徴ベクトルの作成、及び類似
度の判定の処理を中心に説明する。単語処理部１２で
は、まず、学習テキストに出現する複数の単語ｔｋ（１
≦ｋ≦Ｍ）を抽出し、カテゴリＣｉ（１≦ｉ≦Ｎ）に属
する学習テキストにおける単語ｔｋの出現件数ｄｆ（ｔ
ｋ，Ｃｉ）を算出する。この出現件数の算出は、抽出さ
れたすべての単語ｔ１，ｔ２，…，ｔＭに対応する出現
件数ｄｆ（ｔ１，Ｃｉ），ｄｆ（ｔ２，Ｃｉ），…，ｄ
ｆ（ｔＭ，Ｃｉ）を各々算出するものである。Next, an information classification method using the information classification apparatus 1 of the present embodiment will be described focusing on processing of assigning importance to the learning text and the text to be classified, creating a feature vector, and determining the similarity. I do. In the word processing unit 12, first, a plurality of words tk (1
≦ k ≦ M), and the number of occurrences df (t (t) of the word tk in the learning text belonging to the category Ci (1 ≦ i ≦ N)
k, Ci) are calculated. The number of occurrences is calculated by calculating the number of occurrences df (t1, Ci), df (t2, Ci),..., D corresponding to all the extracted words t1, t2,.
f (tM, Ci) are each calculated.

【００２４】ここで、出現件数ｄｆの大きい単語群は、
必ずしもカテゴリにおける重要な単語のみとなるもので
はなく、前述のように特徴語と一般語とが混在している
という問題がある。具体的には、特徴語は特定のカテゴ
リでのみ高い出現件数を表すのに対して、一般語は多く
のカテゴリで共通して高い出現件数を表すものと考えら
れる。そこで単語処理部１２では、単語の一般性を判定
するために、カテゴリ頻度ｃｆを定義する。例えば、す
べてのカテゴリ数Ｎにおいて特定の単語ｔｋがｎ個のカ
テゴリに出現するような場合のカテゴリ頻度ｃｆ（ｔ
ｋ）は、ｎ（ｎ≦Ｎ）で表される。即ち、特定の単語が
出現するカテゴリ数を当該単語のカテゴリ頻度として定
義することができる。このカテゴリ頻度ｃｆ（ｔｋ）が
大きいほど、単語ｔｋは、カテゴリへの依存の少ない一
般的な単語として特定可能となる。Here, a word group having a large number of occurrences df is
There is a problem that characteristic words and general words are mixed as described above, not necessarily only important words in a category. Specifically, it is considered that a characteristic word represents a high number of occurrences only in a specific category, while a general word represents a high number of occurrences commonly in many categories. Therefore, the word processing unit 12 defines the category frequency cf in order to determine the generality of the word. For example, a category frequency cf (t) when a specific word tk appears in n categories in all category numbers N
k) is represented by n (n ≦ N). That is, the number of categories in which a specific word appears can be defined as the category frequency of the word. The larger the category frequency cf (tk) is, the more the word tk can be specified as a general word having little dependence on the category.

【００２５】次に、単語ｔｋのカテゴリＣｉにおける重
要度Ｗ（ｔｋ，Ｃｉ）を、例えば、単語の出現件数ｄ
ｆ、及びカテゴリ頻度ｃｆの逆数を利用した値ｉｃｆ
（カテゴリ頻度係数）を用いて、以下に示す式（２）及
び（３）のように定義する。 W(tk,Ci)=df(tk,Ci)×icf(tk) ・・・(2) icf(tk)=log(N/cf(tk)) ・・・(3) 出現件数ｄｆ及びカテゴリ頻度ｃｆに基づく上記式
（２）を用いることにより、出現件数ｄｆの高い単語群
における一般的な単語の重要度を低減させることがで
き、また、特徴語となる単語に対してより高い重要度を
付与することが可能となる。図２に、単語の重要度算出
を表す概念図を示す。Next, the importance W (tk, Ci) of the word tk in the category Ci is calculated by, for example, the number d of occurrences of the word.
f and the value icf using the reciprocal of the category frequency cf
Using (category frequency coefficient), it is defined as in the following equations (2) and (3). W (tk, Ci) = df (tk, Ci) × icf (tk) ・・・ (2) icf (tk) = log (N / cf (tk)) ・・・ (3) Number of occurrences df and category frequency By using the above expression (2) based on cf, it is possible to reduce the importance of general words in a word group having a high number of occurrences df, and to increase the importance of words serving as characteristic words. Can be granted. FIG. 2 is a conceptual diagram illustrating calculation of the importance of a word.

【００２６】なお、単語の重要度は、上記式（２）以外
にも、例えば、単語の出現頻度ｔｆをさらに乗算する
等、従来手法により利用されているパラメータとの融合
により算出するように定義することもできる。In addition to the above equation (2), the importance of a word is defined so as to be calculated by fusing it with a parameter used by a conventional method, for example, by further multiplying the appearance frequency tf of the word. You can also.

【００２７】図３は、学習テキストに対応する特徴ベク
トルの抽出手順説明図である。学習テキストにおけるカ
テゴリＣｉの特徴ベクトルｐｉは、具体的には、上記式
（２）で定義した単語の重要度を各要素として、以下に
示す式（４）で算出することができる。 pi=(W(t1,Ci),W(t2,Ci),…,W(tM,Ci)) ・・・(4)FIG. 3 is an explanatory diagram of a procedure for extracting a feature vector corresponding to a learning text. Specifically, the feature vector pi of the category Ci in the learning text can be calculated by the following equation (4) using the importance of the word defined by the above equation (2) as each element. pi = (W (t1, Ci), W (t2, Ci),…, W (tM, Ci)) ・・・ (4)

【００２８】ベクトル処理部１３では、上記式（４）に
基づいて、すべてのカテゴリＣ１、Ｃ２、…、ＣＮにつ
いての特徴ベクトルｐ１，ｐ２，…，ｐＮを、出現件数
ｄｆ及びカテゴリ頻度ｃｆに基づいて各々算出する（ス
テップＳ１０１〜１０２）。これらのカテゴリ別の特徴
ベクトルから成る集合、即ち学習特徴ベクトル集合は、
学習特徴ベクトル集合ファイル１７に保持される（ステ
ップＳ１０３）。The vector processing unit 13 determines the feature vectors p1, p2,..., PN for all the categories C1, C2,..., CN based on the above equation (4) based on the number of occurrences df and the category frequency cf. Are calculated (steps S101 to S102). A set composed of these category-specific feature vectors, that is, a learning feature vector set is
It is stored in the learning feature vector set file 17 (step S103).

【００２９】一方、未分類、即ちカテゴリが付与されて
いない分類対象テキストＴにおける特徴ベクトルｑは、
ｑ＝（Ｗ’（ｔ１），Ｗ’（ｔ２），…，Ｗ’（ｔ
Ｍ））で算出される。ここで、Ｗ’（ｔｋ）は、分類対
象テキストＴにおける単語ｔｋの重要度であり、例え
ば、分類対象テキストＴ中における単語の出現頻度ｔｆ
等に基づいて算出されるものである。On the other hand, a feature vector q in an unclassified text, ie, a classification target text T to which no category is assigned, is
q = (W ′ (t1), W ′ (t2),..., W ′ (t
M)). Here, W ′ (tk) is the importance of the word tk in the classification target text T, and is, for example, the frequency of occurrence tf of the word in the classification target text T
It is calculated based on the above.

【００３０】この分類対象テキストＴの特徴ベクトルｑ
を用いて、類似度処理部１５では、学習テキストのカテ
ゴリに対する分類対象テキストＴの類似度を算出する。
この類似度は、例えば、従来手法で採用されている公知
のベクトル間の内積を利用した以下の式（５）により算
出することができる。The feature vector q of the text T to be classified
, The similarity processing unit 15 calculates the similarity of the classification target text T with respect to the category of the learning text.
This similarity can be calculated, for example, by the following equation (5) using the inner product between known vectors adopted in the conventional method.

【００３１】[0031]

【数１】 (Equation 1)

【００３２】上記式（５）における「ｄ（ｐｉ，ｑ）」
は、両特徴ベクトルのなす角の余弦を表しており、その
値は、「−１≦ｄ（ｐｉ，ｑ）≦１」の範囲となる。こ
の余弦ｄ（ｐｉ，ｑ）が大きいほど両特徴ベクトルの指
す方向が近い、換言すれば、分類対象テキストＴがカテ
ゴリＣｉに属する可能性が高いことを意味する。この余
弦ｄ（ｐｉ，ｑ）が即ち類似度となるものであり、カテ
ゴリ決定部１６では、分類対象テキストＴと類似度が高
いと判定されるカテゴリから所定の順で分類先のカテゴ
リを決定する。"D (pi, q)" in the above equation (5)
Represents the cosine of the angle formed by both feature vectors, and its value is in the range of “−1 ≦ d (pi, q) ≦ 1”. The larger the cosine d (pi, q), the closer the directions pointed to by both feature vectors, in other words, the higher the possibility that the classification target text T belongs to the category Ci. The cosine d (pi, q) is the similarity, and the category determination unit 16 determines the category to be classified in a predetermined order from the category determined to have a high similarity with the text T to be classified. .

【００３３】図４は、分類対象テキストの分類処理の手
順説明図である。なお、ここでは、学習テキストにおけ
る学習特徴ベクトル集合は既に抽出済みであり、学習特
徴ベクトル集合ファイル１４に保持されているものとす
る。FIG. 4 is an explanatory diagram of the procedure of the classification process of the text to be classified. Here, it is assumed that the learning feature vector set in the learning text has already been extracted and is stored in the learning feature vector set file 14.

【００３４】分類対象テキストはテキスト入力部１１を
介して単語処理部１２に入力され、単語が抽出される。
そして、抽出された各単語の当該テキストにおける出現
頻度と、出現頻度に基づいた重要度とが算出される。ベ
クトル処理部１３では、算出された各単語の重要度を要
素として、分類対象テキストの特徴ベクトルｑを抽出す
る（ステップＳ２０１）。なお、分類対象テキストが複
数の場合には、テキスト毎に特徴ベクトルｑが抽出され
る。類似度処理部１５は、分類対象テキストの特徴ベク
トルｑと学習特徴ベクトル集合ファイル１４中の各特徴
ベクトルｐｉとの類似度Ｄｉ（＝ｄ（ベクトルｐｉ，ベ
クトルｑ））を、すべてのカテゴリについて各々算出す
る（ステップＳ２０２〜２０３）。The text to be classified is input to the text input unit 11.
The word is then input to the word processing unit 12 and a word is extracted.
Then, the appearance of each extracted word in the text
The frequency and the importance based on the appearance frequency are calculated. Be
The vector processing unit 13 requires the calculated importance of each word.
Extract the feature vector q of the text to be classified as a prime
(Step S201). Note that if the text to be classified
In the case of a number, a feature vector q is extracted for each text.
You. The similarity processing unit 15 calculates the feature vector of the classification target text.
Torq and each feature in learning feature vector set file 14
Similarity Di to vector pi (= d(Vector pi,
Vector q)) for each category
(Steps S202 to S203).

【００３５】類似度Ｄｉが算出された後、カテゴリ決定
部１６は、各類似度を算出値の大きさで降順に整列し
（ステップＳ２０４）、当該算出値が最大となるものか
ら所定数を選択して当該算出値に係るカテゴリ群を分類
対象テキストの属するカテゴリ候補として決定する。当
該算出値が所定範囲内となるカテゴリ群を当該分類対象
テキストに付与すべきカテゴリ候補とするようにしても
良い。これにより分類対象テキストは、当該カテゴリで
分類され（ステップＳ２０５）、文書データーベース１
７に蓄積される。なお、ステップＳ２０４〜２０５にお
けるカテゴリの決定は、類似度の算出値の大きさに着目
したものであるが、この例に限定することなく、カテゴ
リ決定に係る閾値を適宜設定して、決定すべきカテゴリ
を絞り込むように構成することも可能である。After the similarity Di is calculated, the category determining unit 16 sorts the similarities in descending order of the calculated value (step S204), and selects a predetermined number from the one having the largest calculated value. Then, a category group related to the calculated value is determined as a category candidate to which the classification target text belongs. A category group in which the calculated value falls within a predetermined range may be set as a category candidate to be assigned to the text to be classified. As a result, the text to be classified is classified by the category (step S205), and the document database 1 is classified.
7 is stored. Note that the category determination in steps S204 to S205 focuses on the magnitude of the calculated value of the similarity. However, without being limited to this example, the category determination threshold should be set as appropriate and determined. It is also possible to narrow down the categories.

【００３６】このように、本実施形態の情報分類装置１
では、学習テキストにおける単語の重要度を決定する際
に、出現件数及びカテゴリ頻度（またはカテゴリ頻度係
数）を用いるようにしたので、カテゴリの特徴語となる
単語の候補を容易に選択できるようになった。As described above, the information classification device 1 of the present embodiment
In the above, the number of occurrences and the category frequency (or category frequency coefficient) are used when determining the importance of a word in the learning text, so that a candidate word to be a characteristic word of the category can be easily selected. Was.

【００３７】また、すべてのカテゴリに出現する単語の
割合を重要度に反映させるようにしたので、出現件数の
高い単語群における一般語の重要度を低減させ、一般語
よりも高い重要度を特徴語に対して付与することができ
るようになった。これにより、学習特徴ベクトルの品質
及び分類精度が大幅に向上した。Further, since the ratio of words appearing in all categories is reflected in the importance, the importance of common words in a word group having a high number of occurrences is reduced, and the importance is higher than that of common words. Can be added to words. As a result, the quality and the classification accuracy of the learning feature vector are greatly improved.

【００３８】（第２実施形態）本発明は、インタネット
等の公衆網を介して流通する大量の電子化情報に対して
自動的な分類処理を行うシステム、例えば、上記情報分
類装置として機能するところの情報分類サーバ、情報取
得装置として機能するところのクライアント、を配備し
た情報分類システムの形態での実施も可能である。この
場合の情報分類サーバは、例えば、インタネット環境上
における複数の大規模なデータベースに対するサーチエ
ンジンとして位置付けられる。(Second Embodiment) The present invention functions as a system for automatically classifying a large amount of computerized information distributed through a public network such as the Internet, for example, as the above-mentioned information classifying apparatus. It is also possible to implement the present invention in the form of an information classification system in which the information classification server and the client functioning as an information acquisition device are deployed. The information classification server in this case is positioned, for example, as a search engine for a plurality of large-scale databases on the Internet environment.

【００３９】この情報分類サーバは、第１実施形態の情
報分類装置１と同様、コンピュータ装置の内部あるいは
外部記憶装置に、上記文書データベース１７と同一のデ
ータベースを構築し、公衆網を介してクライアントと通
信を行う通信制御部、を具備するとともに、上記情報分
類装置１と同様の機能ブロック、テキスト入力部１１、
単語処理部１２、特徴ベクトル処理部１３、学習特徴ベ
クトル集合ファイル１４、類似度処理部１５、カテゴリ
決定部１６、を具備して構成される（符号は図１に従っ
ている）。This information classification server, like the information classification device 1 of the first embodiment, builds the same database as the document database 17 in the internal or external storage device of the computer device, and communicates with the client via the public network. A communication control unit for performing communication, and a functional block similar to that of the information classification device 1; a text input unit 11;
It comprises a word processing unit 12, a feature vector processing unit 13, a learning feature vector set file 14, a similarity processing unit 15, and a category determination unit 16 (codes are in accordance with FIG. 1).

【００４０】この情報分類サーバが上記情報分類装置１
と相違する点は、通信制御を行う公知の通信制御部を具
備する点であり、この通信制御部を介して流通する電子
化情報群をテキスト入力部１１に入力するとともに、ク
ライアントからの分類要求を受けるように構成する。こ
の分類要求には、例えば、分類対象となる電子化情報を
識別するための情報等を用いれば良い。分類結果も同様
に、通信制御部を介してクライアントに対して送信を行
うように構成することで代替が可能であり、上記情報分
類装置１と同等の効果を得ることができる。この場合の
分類結果としては、例えば、対象となるテキストの属す
るカテゴリを用いれば良い。This information classification server is the information classification device 1
This is different from the first embodiment in that a known communication control unit for performing communication control is provided. The computerized information group distributed through the communication control unit is input to the text input unit 11 and a classification request from the client is input. It is configured to receive. For this classification request, for example, information for identifying digitized information to be classified may be used. Similarly, the classification result can be replaced by transmitting the classification result to the client via the communication control unit, and the same effect as that of the information classification device 1 can be obtained. As a classification result in this case, for example, a category to which the target text belongs may be used.

【００４１】また、情報分類サーバへのテキスト手段と
して、インタネット環境下におけるエージェント機能を
用いることにより、流通する大量の電子化情報群に対し
て自動的な情報分類及び管理を行うことができるシステ
ム構築が可能となる。従って、例えばクライアント側の
利用者等が必要とするテキストに対して漠然としたイメ
ージしか有していない場合であっても、テキストの分類
に係る上位レベルから下位レベルへ順次分類処理を施
し、その経過を辿っていくことにより、必要な情報を容
易に取得することが可能となる。Further, by using an agent function under the Internet environment as a text means to the information classification server, a system construction capable of automatically performing information classification and management for a large amount of electronic information information distributed. Becomes possible. Therefore, for example, even when the user on the client side has only a vague image of the text required, the classification processing is sequentially performed from the upper level related to the text classification to the lower level, and the progress is performed. , It is possible to easily obtain necessary information.

【００４２】[0042]

【発明の効果】以上の説明から明らかなように、本発明
によれば、学習特徴ベクトルを明確に表現できるので、
高精度の分類が可能となる。また、学習テキストにおけ
る既存の分類体系に則した本発明の分類処理を自動的に
行うことにより、利用者等が必要とする情報を容易に検
索して活用することが可能となる。さらに、本発明を情
報検索システム等に適合させた場合には、検索処理の効
率及び実用性が格段に向上するシステムの提供が可能と
なる。As is clear from the above description, according to the present invention, the learning feature vector can be clearly expressed.
High-precision classification becomes possible. Further, by automatically performing the classification process of the present invention in accordance with the existing classification system in the learning text, it is possible to easily search and utilize information required by a user or the like. Further, when the present invention is adapted to an information search system or the like, it becomes possible to provide a system in which the efficiency and practicality of search processing are significantly improved.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る情報分類装置におけ
る機能ブロック図。FIG. 1 is a functional block diagram of an information classification device according to an embodiment of the present invention.

【図２】単語の重要度算出を表す概念図。FIG. 2 is a conceptual diagram illustrating calculation of importance of a word.

【図３】学習特徴ベクトル集合作成における処理手順
図。FIG. 3 is a processing procedure diagram for creating a learning feature vector set.

【図４】分類処理における処理手順図。FIG. 4 is a processing procedure diagram in classification processing.

[Explanation of symbols]

１情報分類装置１１テキスト入力部１２単語処理部１３特徴ベクトル処理部１４学習特徴ベクトル集合ファイル１５類似度処理部１６カテゴリ決定部１７文書データベース REFERENCE SIGNS LIST 1 information classification device 11 text input unit 12 word processing unit 13 feature vector processing unit 14 learning feature vector set file 15 similarity processing unit 16 category determination unit 17 document database

Claims

[Claims]

1. A method for extracting words from a learning text to which a category to which a category belongs is known, calculating, for each extracted word, the number of occurrences and the importance based on the number of appearing categories, and calculating the calculated importance as an element. A process of generating a learning feature vector representing a feature for each category, and calculating importance based on the frequency of appearance of each word in the classification target text for a category unknown text, and calculating the calculated importance. Generating a classification target feature vector representing a feature of each text using as an element, and determining a similarity between the classification target feature vector and the learning feature vector for each category, wherein the similarity to the classification target text is included. A category corresponding to a learning feature vector having a degree within a predetermined range is set as a category candidate to be assigned to the text to be classified. Information classification method.

2. A method according to claim 1, wherein a category corresponding to a predetermined number or more of learning feature vectors from a higher rank when arranged in descending order of similarity with the classification target text is set as a category candidate to be assigned to the classification target text. 2. The information classification method according to claim 1, wherein:

3. The process of generating the learning feature vector includes: determining a feature word serving as an index representing a feature of a category and a general word that does not depend on the category by focusing on an appearance tendency of a word in the learning text; 2. The learning feature vector in which the importance of the feature word is reflected relatively high by reducing the importance of the common word based on the number of categories in which the word appears. Information classification method.

4. An apparatus for performing a classification process by determining a category to be assigned to a classification target text whose category is unknown according to a classification system of a learning text to which one or a plurality of categories have been assigned, Word processing means for extracting a word from each of the learning text and the classification target text and calculating the importance of each extracted word; and using the importance of each word as an element, the features of the learning text are classified for each category. Vector processing means for generating a classified learning feature vector and a classification target feature vector expressing the characteristics of the classification target text for each text; and for each category based on a feature difference between each classification target feature vector and the learning feature vector. Similarity processing means for determining the similarity of the classification target feature vector to the learning feature vector of According determination result based, information classification device characterized by and a category determining means for determining the category to be assigned to the classified text.

5. The word processing means has means for calculating a category frequency coefficient based on division of the total number of categories in the learning text by the number of categories in which a specific word appears, and appears in a specific category. The importance of each word in the learning text is calculated by multiplying the number of occurrences of the word by the category frequency coefficient, and the importance of the word having a relatively large number of occurrences and relatively little dependence on the category is calculated. 5. The information classification device according to claim 4, wherein the information classification device is configured to reduce the number of times.

6. The learning text in the learning text by multiplying a value calculated by multiplying the number of occurrences of a word appearing in a specific category by the category frequency coefficient by an appearance frequency of the word. 5. The information classification device according to claim 4, wherein the information classification device is configured to calculate the importance of each word.

7. The word processing means includes means for measuring the frequency of appearance of a word in the text to be classified, and is configured such that a word having a lower frequency of occurrence has a higher importance in the text to be classified. 5. The method according to claim 4, wherein
Described information classification device.

8. The similarity processing means calculates a cosine of both vectors based on an inner product between each of the learning feature vectors and the classification target feature vector, and arranges the calculated values of the cosine in a predetermined order to obtain the two vectors. 5. The information classification device according to claim 4, wherein the information classification device is configured to quantify a characteristic difference of the information.

9. A presentation unit for visually recognizing and presenting a category corresponding to one or a plurality of learning feature vectors whose similarity to the classification target text is within a predetermined range, wherein the category determination unit includes: 5. The information classification apparatus according to claim 4, wherein a category specified in response to the presentation by the means is determined as a category to be assigned to the text to be classified.

10. An information classifying apparatus according to claim 4, further comprising: a text input unit that takes in the classification target text distributed via a communication line into the information classifying apparatus. An information classification system characterized by the following.

11. An information classification system according to claim 11, wherein said text input means is configured to input said classification target text to said information classification device through an agent function.