JP2003167891A

JP2003167891A - Word significance calculating method, device, program and recording medium

Info

Publication number: JP2003167891A
Application number: JP2001364323A
Authority: JP
Inventors: Masayuki Sugizaki; 正之杉崎; Toshiaki Makino; 俊朗牧野; Masaru Miyamoto; 勝宮本; Hiroto Inagaki; 博人稲垣
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2001-11-29
Filing date: 2001-11-29
Publication date: 2003-06-13

Abstract

<P>PROBLEM TO BE SOLVED: To calculate significance of a retrieving word. <P>SOLUTION: First of all, a category information input part 101 inputs a category being front information for an automatic classifying system to automatically classify the category and a sample document allocated to the category. Next, a retrieval recording extracting part 102 extracts information on 'a retrieving word' and 'a selected category' from retrieval input recording recorded by retrieval service. A retrieval recording totalizing part 103 totalizes an input frequency of the retrieving word with respective categories. A category information learning part 104 extracts a word existing in a document from data acquired by the category information input part 101, and totalizes the word with respective categories, and calculates significance of the word to the respective categories by also using data acquired by the retrieval recording totalizing part 103. A result output part 105 outputs a result determined by the category information learning part 104 to a display. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、大量の文書データ
を自動分類するために必要な、単語の重要度を算出する
方法および装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for calculating the importance of words necessary for automatically classifying a large amount of document data.

【０００２】[0002]

【従来の技術】近年、インターネットなどのコンピュー
タネットワークを通じて、大量の電子化された文書をや
り取りできるようになっている。そのため、個人が必要
とする情報を検索できるようなサービスがネットワーク
上で実現されている。2. Description of the Related Art In recent years, it has become possible to exchange a large amount of electronic documents through a computer network such as the Internet. Therefore, services that allow individuals to search for information that they need have been realized on the network.

【０００３】インターネット上のテキスト検索サービス
として、一般に、ロボット型検索サービスとディレクト
リ型検索サービスがある。「ロボット型」はインターネ
ット上の文書を自動収集しておき、検索入力として単語
を入力するとその単語を含む文書のリストや実際の文書
の全文を出力サービスであり、「ディレクトリ型」はあ
らかじめ用意してある「ディレクトリ（カテゴリ）」と
呼ばれる多階層の木構造に文書を割り当てておき、ディ
レクトリをたどっていくことで目的の文書リストあるい
は文書にたどりつけるようにしているサービスである
（図３が「ディレクトリ型」のイメージ図）。当然ディ
レクトリに登録されている文書のみを対象とした検索サ
ービスも考えられ、また、ディレクトリ自体を検索する
サービスも考えられる。Generally, there are a robot type search service and a directory type search service as text search services on the Internet. "Robot type" is a service that automatically collects documents on the Internet, and when you input a word as a search input, a list of documents containing the word and the full text of the actual document are output services. "Directory type" is prepared in advance. This is a service that assigns documents to a multi-layer tree structure called "directory (category)", and follows the directory to reach the target document list or documents (Fig. 3 shows "directory"). Image of "type"). Naturally, a search service that targets only the documents registered in the directory is conceivable, and a service that searches the directory itself is also conceivable.

【０００４】ディレクトリ自体の検索の場合、検索結果
の表示方法として、検索入力された単語の文字列と一致
するカテゴリを表示する方法や検索用にあらかじめカテ
ゴリに複数の単語を割り当てておき、検索入力された単
語を含むカテゴリの表示する方法がある。In the case of searching the directory itself, as a method of displaying the search result, a method of displaying a category that matches the character string of the word that has been searched and input, or a plurality of words are assigned to the category in advance for search, and the search input is performed. There is a way to display the categories that contain the specified word.

【０００５】特殊なディレクトリの検索例として、ディ
レクトリ情報を持つ文書を対象にした検索サービスで、
入力条件を満たす文書に付与されているディレクトリ情
報を集計し、検索結果の表示の際にディレクトリ情報を
表示する装置もある（特開平１１―３０６１８７「カテ
ゴリ付き文書の検索結果の提示処理方法およびその装
置」）。As an example of searching a special directory, a search service for documents having directory information is used.
There is also a device that totals the directory information given to documents satisfying the input conditions and displays the directory information when the search result is displayed (Japanese Patent Laid-Open No. 11-306187, "Method for presenting search result of document with category and its method"apparatus").

【０００６】一方、従来から、文書情報を自動的に分類
する手法の研究が行われている。代表的な手法として
は、図書館のように分類するためのカテゴリが既知で、
新規の情報に対しそれぞれ適切と思われるカテゴリに分
類する手法（「分類体系相互の関係を利用したテキスト
の自動分類」山本、増山（豊橋技術科学大学）内藤（NT
T）、自然言語学会研究会１９９５）や、分類するカテ
ゴリが未知で、文書集合の中から類似する文書を集めて
分類カテゴリを作成し割り当てるという方法（「競合学
習ニューラルネットワークによる自動分割」菊池、松岡
ら（宇都宮大他）、電子情報通信学会論文誌１９９５）
などがある。On the other hand, conventionally, research has been conducted on a method of automatically classifying document information. As a typical method, a category for classifying like a library is known,
A method for classifying new information into appropriate categories ("Automatic classification of texts using mutual relationships of classification systems" Yamamoto, Masuyama (Toyohashi University of Technology) Naito (NT
T), The Society of Natural Language Studies 1995), or a method in which categories to be classified are unknown and similar documents are collected from a set of documents to create and allocate classification categories ("automatic division by competitive learning neural network" Kikuchi, Matsuoka et al. (Utsunomiya Univ.), IEICE Transactions 1995)
and so on.

【０００７】本発明が対象としている辞書学習方法で生
成された辞書を用いる分類手法は、あらかじめ分類する
ためのカテゴリが既知の場合の手法である。あらかじめ
分類するためのカテゴリと、そこに入るべきサンプルの
文書または単語をシステムに対して与えると、システム
はそれらの情報から単語の重要度を計算し、カテゴリの
特徴として単語とそのカテゴリに対する重要度が対のベ
クトルを生成する。分類する文書に対しても同様に、単
語と文書に対する重要度を計算し、ベクトルを生成する
（“Automatic Text Processinng”Gerard Salton, ADD
ISON-WESLEY pub. 1989）。The classification method using the dictionary generated by the dictionary learning method which is the subject of the present invention is a method when the category for classification is known in advance. Given to the system a category for pre-classification and a sample document or word that should be in it, the system calculates the importance of the word from that information, and the importance of the word and the importance of the category as a feature of the category. Produces a pair of vectors. Similarly, for a document to be classified, a word and the importance of the document are calculated, and a vector is generated (“Automatic Text Processinng” Gerard Salton, ADD).
ISON-WESLEY pub. 1989).

【０００８】カテゴリiそれぞれに生成される特徴ベク
トルFeature vector generated for each category i

【０００９】[0009]

【外１】 [Outer 1]

【００１０】は、[0010]

【００１１】[0011]

【数１】 [Equation 1]

【００１２】となる（Nはサンプル文書群内に出現する
全単語数）。(N is the total number of words that appear in the sample document group).

【００１３】分類処理は、カテゴリの持つ特徴ベクトル
と文書の持つ特徴ベクトルとの距離を定義し、その値を
利用して各文書を類似するカテゴリに割り当てる。ま
た、距離が非常に離れている、すなわち、どのカテゴリ
とも類似しないと判断した場合、どのカテゴリにも割り
当てない。In the classification process, the distance between the feature vector of a category and the feature vector of a document is defined, and the value is used to assign each document to a similar category. Also, if it is determined that the distance is very large, that is, it is not similar to any category, it is not assigned to any category.

【００１４】[0014]

【発明が解決しようとする課題】上記の特徴ベクトルを
生成する方法では、単語を抽出しているサンプル文書に
よって、学習される単語の種類および重要度が決定す
る。すなわち、不適切な単語を多く含むサンプル文書が
多ければ、学習される単語も不適切になり、かといっ
て、そのカテゴリにふさわしいサンプル文書を見つけ
る、あるいは、人手で文書を作成することも困難であ
る。In the above method for generating a feature vector, the type and importance of a learned word are determined by the sample document from which the word is extracted. In other words, if there are many sample documents that contain many inappropriate words, the learned words will also be inappropriate, and it will be difficult to find a sample document that fits that category, or to create a document manually. is there.

【００１５】本発明の目的は、大量の文書データを自動
分類するために必要な、単語の重要度を算出する学習方
法および装置を提供することにある。An object of the present invention is to provide a learning method and apparatus for calculating the importance of words necessary for automatically classifying a large amount of document data.

【００１６】[0016]

【課題を解決するための手段】検索サービスで記録され
た検索入力記録を利用する。検索サービスは特開平１１
―３０６１８７号で提案されているようなサービスを想
定する。この検索サービスの特徴は「入力した単語に応
じてカテゴリをリスト表示する」「必ずしも単語とカテ
ゴリ名は一致していない」という点である。すなわち、
例えば「放送」「放送局」という検索入力単語に対し、
それらを含む文書に付与されているカテゴリ名として
「テレビ局」が多ければ、検索結果の表示には「テレビ
局」が表示され、カテゴリ名として「ラジオ局」が多け
れば「ラジオ局」を表示する。A search input record recorded by a search service is used. Search service is JP-A-11
-Assuming a service as proposed in No. 306187. The feature of this search service is that "the categories are displayed in a list according to the entered word" and "the word and the category name do not necessarily match". That is,
For example, for the search input words "broadcast""broadcaststation",
If there are many "TV stations" as the category name given to the document including them, "TV station" is displayed in the search result display, and if there are many "radio stations" as the category name, "radio station" is displayed.

【００１７】検索サービス側では、検索記録として検索
サービス利用者（以下、利用者）が、例えば、“「放
送」と単語を入力して、カテゴリ「テレビ局」を選択し
た”ことを記録しておく。同様にして、他の「（複数
の）検索語」と「選択したカテゴリ」の情報も記録す
る。On the search service side, it is recorded as a search record that the search service user (hereinafter, user) has, for example, entered the word "broadcast" and selected the category "TV station"". In the same way, record the information of other "(multiple) search terms" and "selected category".

【００１８】この検索記録を用いて、各カテゴリに対し
どの検索語が何回入力されたかを集計する。上記の例で
いえば、カテゴリ「テレビ局」に対し検索語「放送」が
１回入力されたことになる。また、検索入力として複数
の単語が入力された場合は、それぞれの単語を１回づつ
足し合わせる（カテゴリiに対する検索語jの入力回数を
c_ijとする）。Using this search record, the number of times each search word has been input for each category is totaled. In the above example, the search term "broadcast" is entered once for the category "TV station". Also, when multiple words are input as the search input, each word is added once (the number of times the search word j is input for the category i is
c _ij ).

【００１９】次に、カテゴリ情報の特徴ベクトルの計算
について説明する。カテゴリ（iとする）に割り当てら
れたサンプル文書を解析し、サンプル文書内に出現する
単語（jとする）の回数を集計する（これをtf_ijとす
る）。一般に、単語の重要度w _ijは、Next, the calculation of the feature vector of the category information
Will be described. Assigned to a category (let's say i)
Parsed sample documents that appear and appear in the sample document
Count the number of words (let's call it j) (this is tf_ijTosu
). In general, word importance w _ijIs

【００２０】[0020]

【数２】 [Equation 2]

【００２１】と表現することができる。ここで、上記検
索記録から集計した入力回数c_ijを用いて、新たな単語
の重要度w'_ijを、例えば、It can be expressed as Here, using the number of inputs c _ij tabulated from the search record, the importance w ′ _ij of the new word is calculated, for example,

【００２２】[0022]

【数３】 [Equation 3]

【００２３】とすることにすると、「出現回数が増える
ためカテゴリに対する単語の重要度が大きくなる」
「（この後特徴ベクトルの正規化を行うことで）選択さ
れなかったカテゴリの単語の重要度が相対的に小さくな
る」といった効果が期待できる。Then, "the importance of the word to the category increases because the number of appearances increases".
An effect such as “(after normalizing the feature vector), the importance of the words in the categories not selected becomes relatively small” can be expected.

【００２４】また、c_ijの別の使い方として、例えば「c
_ij＞th (thは閾値）の単語jは、分類処理時に必須の単
語とする（文書内に単語jが存在しなければ、カテゴリi
には分類しない）」などといった利用方法がある。As another usage of c _ij , for example, "c
A word j of _ij > th (th is a threshold value) is an indispensable word at the time of classification processing (if word j does not exist in the document, category i
There is a usage method such as "is not classified".

【００２５】なお、単語の重要度の計算は、実際の文書
の分類処理を行う前に行う。The word importance is calculated before the actual document classification process.

【００２６】[0026]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, embodiments of the present invention will be described with reference to the drawings.

【００２７】図１を参照すると、本発明の一実施形態の
単語重要度算出装置はカテゴリ情報入力部１０１と検索
記録抽出部１０２と検索記録集計部１０３とカテゴリ情
報学習部１０４と結果出力部１０５で構成され、図２に
示すような検索サービスで使用される。Referring to FIG. 1, a word importance calculating apparatus according to an embodiment of the present invention includes a category information input unit 101, a search record extracting unit 102, a search record totaling unit 103, a category information learning unit 104, and a result output unit 105. And is used in the search service as shown in FIG.

【００２８】まず、カテゴリ情報入力部１０１で、自動
分類システムが自動分類するための前情報となるカテゴ
リとそれに割り当てられているサンプル文書を入力す
る。First, the category information input unit 101 inputs a category which is pre-information for automatic classification by the automatic classification system and a sample document assigned to the category.

【００２９】次に、検索サービスで記録された検索入力
記録から「検索語」「選択したカテゴリ」の情報を、検
索記録抽出部１０２で抽出する。検索記録集計部１０３
では、検索記録抽出部１０２で抽出した情報を用いて、
カテゴリを選択したときの検索語の入力回数を各カテゴ
リ毎に集計する。カテゴリ情報学習部１０４では、カテ
ゴリ情報入力部１０１で獲得したデータから、サンプル
文書内に存在する単語を抽出してその出現回数を各カテ
ゴリ毎に集計し、検索記録集計部１０３で取得したデー
タ（入力回数）も用いて、各カテゴリに対する各単語の
重要度を算出する。結果出力部１０５では、カテゴリ情
報学習部１０４で求めた重要度を、カテゴリ、単語とと
もにディスプレイ等に出力する。Next, the search record extracting section 102 extracts the information of "search word" and "selected category" from the search input record recorded by the search service. Search record totaling unit 103
Then, using the information extracted by the search record extraction unit 102,
The number of input search words when a category is selected is totaled for each category. The category information learning unit 104 extracts words existing in the sample document from the data acquired by the category information input unit 101, totals the number of appearances of each word for each category, and acquires the data acquired by the search record totaling unit 103 ( The number of inputs) is also used to calculate the importance of each word for each category. The result output unit 105 outputs the degree of importance obtained by the category information learning unit 104 to a display or the like together with the categories and words.

【００３０】本実施形態の処理の流れを具体的に説明す
る。The processing flow of this embodiment will be described in detail.

【００３１】検索記録は、本発明とは別に行われている
検索サービスで記録されているものとする。検索記録に
は少なくとも本発明で利用する検索語、選択したカテゴ
リの２点が記録されているものとする。It is assumed that the search record is recorded by a search service that is performed separately from the present invention. It is assumed that at least two points of the search word used in the present invention and the selected category are recorded in the search record.

【００３２】カテゴリ情報入力部１０１では、カテゴリ
情報とサンプル文書を入力する。例えば、カテゴリは
「／エンターテイメント／テレビ／テレビ局」や「／エ
ンターテイメント／テレビ／アニメ／アニメ声優」、そ
れぞれに入るサンプル文書はテレビ局やアニメの声優に
関する文書であったり新聞記事であったりする。The category information input unit 101 inputs category information and a sample document. For example, the categories are “/ entertainment / TV / TV stations” and “/ entertainment / TV / animation / animated voice actors”, and the sample documents in each category are documents related to voice actors of television stations or anime, or newspaper articles.

【００３３】検索記録抽出部１０２では検索記録から本
発明で必要な情報（「検索語」「選択したカテゴリ」）
を抽出する。表１の例からは「検索語」「選択したカテ
ゴリ」として、“「アニメ」「アニメ声優」”、“「放
送」「テレビ局」”が抽出される。検索語、選択したカ
テゴリのいずれか一方の記録しかない場合は、抽出を行
わない。In the search record extracting unit 102, information necessary for the present invention (“search word” “selected category”) is obtained from the search record.
To extract. “Animation”, “Anime voice actor”, and “Broadcasting” “TV station” are extracted as “search term” and “selected category” from the example in Table 1. One of the search term and the selected category If there is only a record of, the extraction will not be performed.

【００３４】[0034]

【表１】 [Table 1]

【００３５】検索記録集計部１０３では、検索記録抽出
部１０２で抽出された検索記録を用いて単語の入力回数
c_ijを求める。表１の例からは、カテゴリ「テレビ局」
に対する単語「放送」の回数を１つ増加させ、また、カ
テゴリ「アニメ声優」に対する単語「アニメ」の回数を
１つ増加させる。The search record totaling unit 103 uses the search records extracted by the search record extracting unit 102 to input the number of words.
_Find c _ij . From the example in Table 1, the category "TV station"
The number of times the word "broadcasting" is increased by one, and the number of times the word "animation" for the category "animated voice actor" is increased by one.

【００３６】カテゴリ情報学習部１０４では、検索記録
集計部１０３で集計した単語の入力回数c_ijとカテゴリ
情報入力部１０１で入力されたデータから各カテゴリに
対する各単語の重要度を計算する。The category information learning unit 104 calculates the degree of importance of each word for each category from the number of input times c _{ij of the} words totaled by the search record totaling unit 103 and the data input by the category information input unit 101.

【００３７】例えば、カテゴリ「テレビ局」に対する単
語「放送」のサンプル文書内の出現回数tf_ijが１０、入
力回数c_ijが５、カテゴリ情報入力部１０１で入力され
た全カテゴリ数が１００、「放送」の出現カテゴリ数が
５だとすると、単語の重要度w'_ijは、 w'_ij＝（１０＋５）＊log（１００／５）＝１９.５２となる。For example, the number of appearances tf _ij in the sample document of the word “broadcast” for the category “TV station” is 10, the number of inputs c _ij is 5, the total number of categories input by the category information input unit 101 is 100, and “broadcast” If the number of appearance categories of “” is 5, the word importance w ′ _ij is w ′ _ij = (10 + 5) * log (100/5) = 19.52.

【００３８】w_ij＝１０＊log（１００／５）＝１３.０１と比較して値が大きくなっていることがわかる。同様に
して、その他の単語やカテゴリについてw'_ijの値を計算
し、結果出力部１０５でその結果「カテゴリ、単語、重
要度」を出力する。It can be seen that the value is larger than that of w _ij = 10 * log (100/5) = 13.01. Similarly, the value of w ′ _ij is calculated for other words and categories, and the result output unit 105 outputs the result “category, word, importance”.

【００３９】なお、本実施形態では、本実施形態の出力
を用いて自動分類システムが分類し、その結果を検索サ
ービスが利用し、検索サービスで記録された情報を利用
して本発明が再学習し......と、データの流れは循環す
るため、カテゴリ情報入力部１０１で入力される情報は
本発明が処理する度に入力するのではなく記録媒体にあ
らかじめ記録しておくことも可能である。In this embodiment, the output of this embodiment is used for classification by the automatic classification system, the results are used by the search service, and the present invention is relearned by using the information recorded by the search service. However, since the data flow circulates, the information input by the category information input unit 101 may be recorded in the recording medium in advance instead of being input each time the present invention processes. It is possible.

【００４０】また、本実施形態は専用のハードウェアに
より実現されるもの以外に、その機能を実現するための
プログラムを、コンピュータ読み取り可能な記録媒体に
記録して、この記録媒体に記録されたプログラムをコン
ピュータシステムに読み込ませ、実行するものであって
もよい。コンピュータ読み取り可能な記録媒体とは、フ
ロッピー（登録商標）ディスク、光磁気ディスク、CD―
ROM等の記録媒体、コンピュータシステムに内蔵される
ハードディスク装置等の記憶装置を指す。さらに、コン
ピュータ読み取り可能な記録媒体は、インターネットを
介してプログラムを送信する場合のように、短時間の
間、動的にプログラムを保持するもの（伝送媒体もしく
は伝送波）、その場合のサーバとなるコンピュータシス
テム内部の揮発性メモリのように、一定時間プログラム
を保持しているものも含む。In addition to the functions realized by the dedicated hardware, this embodiment records a program for realizing the function in a computer-readable recording medium, and the program recorded in the recording medium. May be read by a computer system and executed. Computer-readable recording media include floppy disks, magneto-optical disks, CDs
A recording medium such as a ROM, or a storage device such as a hard disk device built into a computer system. Further, the computer-readable recording medium dynamically holds the program for a short time (transmission medium or transmission wave) such as when transmitting the program via the Internet, and serves as a server in that case. It also includes a volatile memory that holds a program for a certain period of time, such as a volatile memory inside a computer system.

【００４１】[0041]

【発明の効果】以上説明したように、本発明は、利用者
の検索結果に対する行動を記録しておき、それを利用し
て自動分類システムが用いる分類用辞書の学習内容を修
正するための単語重要度を算出することにより、分類シ
ステムの分類精度を向上させることができる。As described above, according to the present invention, the action for the user's search result is recorded, and the word for correcting the learning contents of the classification dictionary used by the automatic classification system is recorded by using the recorded action. By calculating the importance, the classification accuracy of the classification system can be improved.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施の形態の単語重要度算出装置の
概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a word importance calculation device according to an embodiment of the present invention.

【図２】本発明の一実施の形態の単語重要度算出装置を
含むシステム全体の構成を示すブロック図である。FIG. 2 is a block diagram showing an overall configuration of a system including a word importance calculation device according to an embodiment of the present invention.

【図３】従来の技術の「ディレクトリ型」サービスのデ
ィレクトリ情報とそれに割り当てられた文書情報のイメ
ージ図である。FIG. 3 is an image diagram of directory information of a conventional “directory type” service and document information assigned to it.

[Explanation of symbols]

１０１カテゴリ情報入力部１０２検索記録抽出部１０３検索記録集計部１０４カテゴリ情報学習部１０５結果出力部 101 Category information input section 102 search record extraction unit 103 Search Record Aggregation Unit 104 Category Information Learning Department 105 result output section

───────────────────────────────────────────────────── フロントページの続き (72)発明者宮本勝東京都千代田区大手町二丁目３番１号日本電信電話株式会社内 (72)発明者稲垣博人東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5B075 ND03 NR12 PQ02 PQ38 PR04 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Masaru Miyamoto 2-3-1, Otemachi, Chiyoda-ku, Tokyo Inside Telegraph and Telephone Corporation (72) Inventor Hiroto Inagaki 2-3-1, Otemachi, Chiyoda-ku, Tokyo Inside Telegraph and Telephone Corporation F term (reference) 5B075 ND03 NR12 PQ02 PQ38 PR04

Claims

[Claims]

1. A word calculation method for calculating a degree of importance of a word, which is necessary for automatically classifying a large amount of document information into categories, which category is a classification destination, and a sample document previously assigned to the category. The first step of inputting, the second step of acquiring the word that is the search word and the information of the selected category from the search input record, and the input of each word for each category obtained in the second step. A third step of counting the number of times, counting the number of times a word of the category to be classified appears in the sample document, and determining the importance of the word for each category, the total number of the classification destination categories, and the word The number of categories in which the word appears, the number of appearances of the word in the category, the fourth step of calculating from the number of times the word is input, the category, the word, the category Word significance calculation method having a fifth step of outputting the importance of said word for.

2. A word importance degree calculating device for calculating the degree of importance of a word, which is necessary for automatically classifying a large amount of document information into categories, and a category to be a classification destination and a word assigned in advance to the category. A category information input section for inputting a sample document, a search record extraction section for acquiring information on a word in a search word and a selected category from a search input record, and a search record extraction section for each category obtained by the search record extraction section. A search record aggregating unit for aggregating the number of input words, aggregating the number of times that a word of the category to be classified appears in the sample document, and the importance of the word for each category is the total number of the category to be classified. A number of categories in which the word appears, a number of appearances of the word in the category, and a category information learning unit that is calculated from the number of times the word is input, The importance obtained in Gori information learning unit, category, word significance calculating device and a result output unit for outputting with words.

3. A word importance calculation program for causing a computer to execute the word importance calculation method according to claim 1.

4. A recording medium recording a word importance calculation program for causing a computer to execute the word importance calculation method according to claim 1.