JPH11328211A

JPH11328211A - Mass information automatic classification method and apparatus, and recording medium recording mass information automatic classification program

Info

Publication number: JPH11328211A
Application number: JP10137141A
Authority: JP
Inventors: Masayuki Sugizaki; 正之杉崎; Masakatsu Okubo; 雅且大久保; Takashi Inoue; 孝史井上; Kazuo Tanaka; 一男田中
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-05-19
Filing date: 1998-05-19
Publication date: 1999-11-30
Anticipated expiration: 2018-05-19
Also published as: JP3571214B2

Abstract

(57)【要約】【課題】分類したい実カテゴリを予めいくつかの組で
ある中間カテゴリに分け、中間カテゴリでの条件を満足
した場合のみ実カテゴリとの距離計算を行うことにより
計算時間を短縮する大量情報自動分類方法および装置と
大量情報自動分類プログラムを記録した記録媒体を提供
する。【解決手段】文書を分類するためのカテゴリおよびそ
の特徴を入力し、入力されたカテゴリおよび特徴から特
徴ベクトルを計算して記憶し、入力されたカテゴリおよ
びその特徴と入力された分類基準とを用いて、中間カテ
ゴリ計算部１０７でカテゴリを新規に作成し、前記入力
されたカテゴリで類似したものを同一の中間カテゴリに
割り当て、入力されたカテゴリと中間カテゴリ計算部で
得たカテゴリとを用いて、入力された文書を分類する。 (57) [Summary] [Problem] Reduce the calculation time by dividing the real category to be classified into some intermediate categories in advance and calculating the distance to the real category only when the conditions of the intermediate category are satisfied The present invention provides a method and apparatus for automatically classifying large amounts of information and a recording medium on which a program for automatically classifying large amounts of information is recorded. SOLUTION: A category for classifying documents and its features are input, a feature vector is calculated and stored from the input categories and features, and the input category and its features are used and the input classification criteria are used. Then, a new category is created in the intermediate category calculation unit 107, similar ones of the input categories are assigned to the same intermediate category, and the input category and the category obtained by the intermediate category calculation unit are used to Classify the input document.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書データ等の大
量の情報を大量のカテゴリに高速に分類する大量情報自
動分類方法および装置と大量情報自動分類プログラムを
記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and an apparatus for automatically classifying a large amount of information such as document data into a large number of categories at high speed, and a recording medium on which a program for automatically classifying a large amount of information is recorded.

【０００２】[0002]

【従来の技術】近年、インターネットなどのコンピュー
タネットワークを通じて、大量の電子化された文書をや
り取りできるようになっている。そのため、個人が必要
とする情報を検索できるようなサービスがネットワーク
上で実現されている。しかし、そのために自分が獲得し
た情報が大量になってしまい、個々の情報の持つ特徴を
抽出することが困難となる。そこで、獲得した情報を分
類し整理する技術が必要となる。2. Description of the Related Art In recent years, it has become possible to exchange a large amount of electronic documents through a computer network such as the Internet. For this reason, services that allow individuals to search for information they need are implemented on networks. However, the amount of information acquired by the user becomes large, and it becomes difficult to extract characteristics of individual information. Therefore, a technique for classifying and organizing the acquired information is required.

【０００３】従来から、文書情報を自動的に分類する手
法の研究が行われている。代表的な手法としては、図書
館のように分類するための区切り（カテゴリと呼ぶ）が
既知で、新規の情報に対しそれぞれ適切と思われるカテ
ゴリに分類する手法（“分類体系相互の関係を利用した
テキストの自動分類”山本、増山（豊橋技術大学）、内
藤（ＮＴＴ）、1995）や、分類するカテゴリが未知で、
文書集合の中から類似する文書を集めて分類カテゴリを
作成し割り当てるという方法（“競合学習ニューラルネ
ットワークによる自動分割”菊池、松岡ら（宇都宮大
他）、1995）などがある。これらの技術により、大量の
文書の分類整理を行う。Conventionally, research has been conducted on a method of automatically classifying document information. A typical method is to classify information into categories that are considered appropriate for new information, with known breaks (called categories) for classifying like a library (" Automatic classification of text "Yamamoto, Masuyama (Toyohashi University of Technology), Naito (NTT), 1995), and the category to classify is unknown,
A method in which similar documents are collected from a set of documents to create and assign classification categories ("Automatic segmentation by competitive learning neural network" Kikuchi, Matsuoka et al. (Utsunomiya Univ.)
Others) and 1995). With these technologies, a large number of documents are classified and arranged.

【０００４】本発明が対象としている分類手法は、予め
分類するためのカテゴリが既知の場合の手法である。予
め分類するためのカテゴリと、そこに入るべきサンプル
の文書または単語をシステムに対して与えると、システ
ムはそれらの情報から単語の重要度を計算し、カテゴリ
の特徴として単語とそのカテゴリに対する重要度が対の
ベクトルを生成する。分類する文書に対しても同様に、
単語と文書に対する重要度を計算し、ベクトルを生成す
る（"Automatic Text Processing" Gerard Salton,198
9）。[0004] The classification method to which the present invention is directed is a method in which a category for classification is known in advance. Given a category for pre-classification and a sample document or word to be included in the system, the system calculates the importance of the word from the information, and as a feature of the category, the word and the importance for the category. Produces a pair of vectors. Similarly, for documents to be classified,
Calculate importance for words and documents and generate vectors ("Automatic Text Processing" Gerard Salton, 198
9).

【０００５】[0005]

【数１】となる。ちなみに、Ｎは次元数を表している。同様に分
類対象となる文書に対しても特徴ベクトルを生成する。(Equation 1) Becomes Incidentally, N represents the number of dimensions. Similarly, a feature vector is generated for a document to be classified.

【０００６】分類は、カテゴリの持つ特徴ベクトルと文
書の持つ特徴ベクトルとの距離を定義し、その値を利用
して各文書を類似するカテゴリに割り当てる。また、距
離が非常に離れている、すなわち、どのカテゴリとも類
似しないと判断した場合は、どのカテゴリにも割り当て
ない。The classification defines a distance between a feature vector of a category and a feature vector of a document, and assigns each document to a similar category using the value. If it is determined that the distance is very large, that is, it is not similar to any category, it is not assigned to any category.

【０００７】[0007]

【発明が解決しようとする課題】上述した従来の手法で
は、分類したいカテゴリの総数や出現する単語の数が多
ければ多いほど、その数に比例して計算時間も増大する
という問題がある。In the above-mentioned conventional method, there is a problem that as the total number of categories to be classified or the number of appearing words increases, the calculation time increases in proportion to the number.

【０００８】本発明は、上記に鑑みてなされたもので、
その目的とするところは、分類したい実カテゴリを予め
いくつかの組である中間カテゴリに分け、中間カテゴリ
での条件を満足した場合のみ実カテゴリとの距離計算を
行うことにより計算時間を短縮する大量情報自動分類方
法および装置と大量情報自動分類プログラムを記録した
記録媒体を提供することにある。[0008] The present invention has been made in view of the above,
The purpose is to divide the real category to be classified into some intermediate categories in advance, and to calculate the distance from the real category only when the conditions of the intermediate category are satisfied, thereby shortening the calculation time. It is an object of the present invention to provide a method and apparatus for automatically classifying information and a recording medium on which a program for automatically classifying large amounts of information is recorded.

【０００９】[0009]

【課題を解決するための手段】上記目的を達成するた
め、請求項１記載の本発明は、大量の情報を大量のカテ
ゴリに分類する大量情報自動分類方法であって、自然言
語で記述された文書を入力する第１のステップと、該第
１のステップで入力された文書を記憶する第２のステッ
プと、前記文書を分類するためのカテゴリおよびその特
徴を入力する第３のステップと、該第３のステップで入
力されたカテゴリおよび特徴を記憶するとともに該カテ
ゴリおよび特徴から特徴ベクトルを計算して記憶する第
４のステップと、分類処理のための初期値である分類基
準を入力する第５のステップと、該第５のステップで入
力された分類基準を記憶する第６のステップと、第３の
ステップで入力されたカテゴリおよびその特徴と第５の
ステップで入力された分類基準とを用いて、カテゴリを
新規に作成し、第３のステップで入力されたカテゴリで
類似したものを同一の中間カテゴリに割り当てる第７の
ステップと、第３のステップで入力されたカテゴリと第
７のステップで得たカテゴリとを用いて、第１のステッ
プで入力された文書を分類する第８のステップとを有す
ることを要旨とする。According to one aspect of the present invention, there is provided a method for automatically classifying a large amount of information into a large number of categories. A first step of inputting a document, a second step of storing the document input in the first step, a third step of inputting a category for classifying the document and its characteristics, A fourth step of storing the category and the feature input in the third step and calculating and storing a feature vector from the category and the feature, and a fifth step of inputting a classification criterion which is an initial value for the classification process. Step, the sixth step of storing the classification criterion input in the fifth step, the category and its characteristics input in the third step, and the category input in the fifth step. A seventh step of newly creating a category using the classification criteria and assigning similar categories input in the third step to the same intermediate category; and a category input in the third step. And an eighth step of classifying the document input in the first step using the category obtained in the seventh step.

【００１０】請求項１記載の本発明にあっては、文書を
分類するためのカテゴリおよびその特徴を入力し、この
入力されたカテゴリおよび特徴から特徴ベクトルを計算
して記憶し、入力されたカテゴリおよびその特徴と入力
された分類基準とを用いて、カテゴリを新規に作成し、
入力されたカテゴリで類似したものを同一の中間カテゴ
リに割り当て、この作成されたカテゴリと入力されたカ
テゴリとを用いて、入力された文書を分類するため、文
書を高速に分類することができる。According to the first aspect of the present invention, a category for classifying a document and its features are input, a feature vector is calculated from the input categories and features, stored, and the input category is stored. And create a new category using the features and the input classification criteria,
A similar input category is assigned to the same intermediate category, and the input document is classified using the created category and the input category. Therefore, the documents can be classified at high speed.

【００１１】また、請求項２記載の本発明は、請求項１
記載の発明において、前記第７のステップにおいて、第
５のステップで入力された分類基準を基に複数のカテゴ
リ間の距離関数を組み合わせて、分類基準に応じた距離
関数を作成し、この作成した距離関数を用いて第３のス
テップで入力されたカテゴリ間の距離を計算して中間カ
テゴリを生成することを要旨とする。The present invention described in claim 2 is the same as the claim 1.
In the invention described in the above, in the seventh step, a distance function according to the classification criterion is created by combining distance functions between a plurality of categories based on the classification criterion input in the fifth step, and this created The gist is to calculate the distance between the categories input in the third step using the distance function to generate an intermediate category.

【００１２】請求項２記載の本発明にあっては、入力さ
れた分類基準を基に複数のカテゴリ間の距離関数を組み
合わせて、分類基準に応じた距離関数を作成し、この作
成した距離関数を用いて、入力されたカテゴリ間の距離
を計算して中間カテゴリを生成している。According to the second aspect of the present invention, a distance function according to the classification criterion is created by combining distance functions between a plurality of categories based on the input classification criterion, and the created distance function is created. Is used to calculate the distance between the input categories to generate intermediate categories.

【００１３】更に、請求項３記載の本発明は、請求項１
または２記載の発明において、前記第８のステップにお
いて、第７のステップで生成した中間カテゴリへの分類
を行い、この分類結果と第５のステップで入力された分
類基準を基に第３のステップで入力されたカテゴリへの
分類の要否を決定することを要旨とする。Further, the present invention according to claim 3 provides the invention according to claim 1.
In the invention according to the second aspect, in the eighth step, classification into the intermediate category generated in the seventh step is performed, and a third step is performed based on the classification result and the classification criterion input in the fifth step. The point is to determine the necessity of classification into the category input in.

【００１４】請求項３記載の本発明にあっては、前記生
成した中間カテゴリへの分類結果と入力された分類基準
を基に、入力されたカテゴリへの分類の要否を決定して
いる。According to the third aspect of the present invention, the necessity of classification into the input category is determined based on the generated classification result into the intermediate category and the input classification standard.

【００１５】請求項４記載の本発明は、大量の情報を大
量のカテゴリに分類する大量情報自動分類装置であっ
て、自然言語で記述された文書を入力する文書入力部
と、該文書入力部で入力された文書を記憶する文書記憶
部と、前記文書を分類するためのカテゴリおよびその特
徴を入力するカテゴリ情報入力部と、該カテゴリ情報入
力部で入力されたカテゴリおよび特徴を記憶するととも
に該カテゴリおよび特徴から特徴ベクトルを計算して記
憶するカテゴリ情報記憶部と、分類処理のための初期値
である分類基準を入力する初期値入力部と、該初期値入
力部で入力された分類基準を記憶する初期値記憶部と、
前記カテゴリ情報入力部で入力されたカテゴリおよびそ
の特徴と前記初期値入力部で入力された分類基準とを用
いて、カテゴリを新規に作成し、前記カテゴリ情報入力
部で入力されたカテゴリで類似したものを同一の中間カ
テゴリに割り当てる中間カテゴリ計算部と、前記カテゴ
リ情報入力部で入力されたカテゴリと前記中間カテゴリ
計算部で得たカテゴリとを用いて、前記文書入力部で入
力された文書を分類する実カテゴリ計算部とを有するこ
とを要旨とする。According to a fourth aspect of the present invention, there is provided an automatic mass information classifying apparatus for classifying a large amount of information into a large number of categories, comprising: a document input unit for inputting a document described in a natural language; A document storage unit that stores the document input in step S, a category information input unit that inputs a category for classifying the document and its characteristics, and stores the category and characteristic input in the category information input unit. A category information storage unit that calculates and stores a feature vector from a category and a feature, an initial value input unit that inputs a classification criterion that is an initial value for a classification process, and a classification criterion input by the initial value input unit. An initial value storage unit for storing;
Using the category and its characteristics input in the category information input unit and the classification criterion input in the initial value input unit, a new category is created, and the category input in the category information input unit is similar. Classifying a document input by the document input unit by using an intermediate category calculation unit that assigns objects to the same intermediate category; and a category input by the category information input unit and a category obtained by the intermediate category calculation unit. And a real category calculation unit that performs

【００１６】請求項４記載の本発明にあっては、文書を
分類するためのカテゴリおよびその特徴を入力し、入力
されたカテゴリおよび特徴から特徴ベクトルを計算して
記憶し、入力されたカテゴリおよびその特徴と入力され
た分類基準とを用いて、中間カテゴリ計算部でカテゴリ
を新規に作成し、前記入力されたカテゴリで類似したも
のを同一の中間カテゴリに割り当て、入力されたカテゴ
リと中間カテゴリ計算部で得たカテゴリとを用いて、入
力された文書を分類するため、文書を高速に分類するこ
とができる。According to a fourth aspect of the present invention, a category for classifying a document and its features are input, a feature vector is calculated from the input categories and the features and stored, and the input categories and Using the characteristics and the input classification criteria, a new category is created in the intermediate category calculation unit, and a similar one of the input categories is assigned to the same intermediate category, and the input category and the intermediate category calculation are performed. Since the input document is classified using the category obtained by the section, the document can be classified at high speed.

【００１７】また、請求項５記載の本発明は、請求項４
記載の発明において、前記中間カテゴリ計算部が、前記
初期値入力部で入力された分類基準を基に複数のカテゴ
リ間の距離関数を組み合わせて、分類基準に応じた距離
関数を作成し、この作成した距離関数を用いて前記カテ
ゴリ情報入力部で入力されたカテゴリ間の距離を計算し
て中間カテゴリを生成する手段を有することを要旨とす
る。The present invention according to claim 5 provides the present invention according to claim 4.
In the described invention, the intermediate category calculation unit combines a distance function between a plurality of categories based on the classification criterion input by the initial value input unit to create a distance function according to the classification criterion, It is essential to have means for calculating the distance between the categories input by the category information input unit using the distance function obtained and generating an intermediate category.

【００１８】請求項５記載の本発明にあっては、入力さ
れた分類基準を基に複数のカテゴリ間の距離関数を組み
合わせて、分類基準に応じた距離関数を作成し、この作
成した距離関数を用いて、入力されたカテゴリ間の距離
を計算して中間カテゴリを生成している。According to the present invention, a distance function corresponding to a classification criterion is created by combining distance functions between a plurality of categories based on the input classification criterion, and the created distance function is created. Is used to calculate the distance between the input categories to generate intermediate categories.

【００１９】更に、請求項６記載の本発明は、請求項４
または５記載の発明において、前記実カテゴリ計算部
が、前記中間カテゴリ計算部で生成した中間カテゴリへ
の分類を行い、この分類結果と前記初期値入力部で入力
された分類基準を基に前記カテゴリ情報入力部で入力さ
れたカテゴリへの分類の要否を決定する手段を有するこ
とを要旨とする。Further, the present invention according to claim 6 provides the invention according to claim 4.
In the invention according to the fifth aspect, the real category calculation unit performs classification into the intermediate categories generated by the intermediate category calculation unit, and based on the classification result and the classification criteria input by the initial value input unit, The gist of the present invention is to have a means for determining whether or not classification into the category input by the information input unit is necessary.

【００２０】請求項６記載の本発明にあっては、中間カ
テゴリ計算部で生成した中間カテゴリへの分類結果と入
力された分類基準を基に、入力されたカテゴリへの分類
の要否を決定している。According to the present invention, the necessity of classification into the input category is determined based on the classification result into the intermediate category generated by the intermediate category calculation unit and the input classification standard. doing.

【００２１】請求項７記載の本発明は、大量の情報を大
量のカテゴリに分類する大量情報自動分類プログラムを
記録した記録媒体であって、自然言語で記述された文書
を入力する第１のステップと、該第１のステップで入力
された文書を記憶する第２のステップと、前記文書を分
類するためのカテゴリおよびその特徴を入力する第３の
ステップと、該第３のステップで入力されたカテゴリお
よび特徴を記憶するとともに該カテゴリおよび特徴から
特徴ベクトルを計算して記憶する第４のステップと、分
類処理のための初期値である分類基準を入力する第５の
ステップと、該第５のステップで入力された分類基準を
記憶する第６のステップと、第３のステップで入力され
たカテゴリおよびその特徴と第５のステップで入力され
た分類基準とを用いて、カテゴリを新規に作成し、第３
のステップで入力されたカテゴリで類似したものを同一
の中間カテゴリに割り当てる第７のステップと、第３の
ステップで入力されたカテゴリと第７のステップで得た
カテゴリとを用いて、第１のステップで入力された文書
を分類する第８のステップとを有する大量情報自動分類
プログラムを記録媒体に記録することを要旨とする。According to a seventh aspect of the present invention, there is provided a recording medium recording a large-volume information automatic classification program for classifying a large amount of information into a large number of categories, and a first step of inputting a document described in a natural language. A second step of storing the document input in the first step, a third step of inputting a category for classifying the document and its characteristics, and a third step of inputting the category. A fourth step of storing a category and a feature and calculating and storing a feature vector from the category and the feature; a fifth step of inputting a classification criterion which is an initial value for a classification process; A sixth step of storing the classification criterion input in the step, a category and its characteristics input in the third step, and the classification criterion input in the fifth step. Te, to create a category to a new, third
A seventh step of assigning similar categories among the categories input in step to the same intermediate category, and a first step using the categories input in the third step and the categories obtained in the seventh step. The gist of the invention is to record a mass information automatic classification program having an eighth step of classifying a document input in the step on a recording medium.

【００２２】請求項７記載の本発明にあっては、文書を
分類するためのカテゴリおよびその特徴を入力し、この
入力されたカテゴリおよび特徴から特徴ベクトルを計算
して記憶し、入力されたカテゴリおよびその特徴と入力
された分類基準とを用いて、カテゴリを新規に作成し、
入力されたカテゴリで類似したものを同一の中間カテゴ
リに割り当て、この作成されたカテゴリと入力されたカ
テゴリとを用いて、入力された文書を分類する大量情報
自動分類プログラムを記録媒体に記憶しているため、該
記録媒体を用いて、その流通性を高めることができる。According to the present invention, a category for classifying documents and its features are input, a feature vector is calculated from the input categories and features, stored, and the input category is stored. And create a new category using the features and the input classification criteria,
By assigning similar input categories to the same intermediate category, using the created category and the input category, storing a mass information automatic classification program for classifying the input document in a recording medium, Therefore, the distribution property can be improved by using the recording medium.

【００２３】また、請求項８記載の本発明は、請求項７
記載の発明において、前記第７のステップにおいて、第
５のステップで入力された分類基準を基に複数のカテゴ
リ間の距離関数を組み合わせて、分類基準に応じた距離
関数を作成し、この作成した距離関数を用いて第３のス
テップで入力されたカテゴリ間の距離を計算して中間カ
テゴリを生成する大量情報自動分類プログラムを記録媒
体に記録することを要旨とする。The present invention described in claim 8 provides the present invention according to claim 7.
In the invention described in the above, in the seventh step, a distance function according to the classification criterion is created by combining distance functions between a plurality of categories based on the classification criterion input in the fifth step, and this created The gist of the present invention is to record a mass information automatic classification program for calculating a distance between categories input in the third step using a distance function to generate an intermediate category on a recording medium.

【００２４】請求項８記載の本発明にあっては、入力さ
れた分類基準を基に複数のカテゴリ間の距離関数を組み
合わせて、分類基準に応じた距離関数を作成し、この作
成した距離関数を用いて、入力されたカテゴリ間の距離
を計算して中間カテゴリを生成する大量情報自動分類プ
ログラムを記録媒体に記録しているため、該記録媒体を
用いて、その流通性を高めることができる。According to the present invention, a distance function according to the classification criterion is created by combining distance functions between a plurality of categories based on the input classification criterion, and the created distance function is created. , The mass information automatic classification program for calculating the distance between the input categories and generating the intermediate category is recorded on the recording medium, so that the distribution can be improved by using the recording medium. .

【００２５】更に、請求項９記載の本発明は、請求項７
または８記載の発明において、前記第８のステップにお
いて、第７のステップで生成した中間カテゴリへの分類
を行い、この分類結果と第５のステップで入力された分
類基準を基に第３のステップで入力されたカテゴリへの
分類の要否を決定する大量情報自動分類プログラムを記
録媒体に記録することを要旨とする。Further, the present invention described in claim 9 is the same as claim 7.
In the eighth aspect, in the eighth step, classification into the intermediate category generated in the seventh step is performed, and a third step is performed based on the classification result and the classification criterion input in the fifth step. The gist is to record a mass information automatic classification program for determining whether or not to classify into the category input in the recording medium on the recording medium.

【００２６】請求項９記載の本発明にあっては、前記生
成した中間カテゴリへの分類結果と入力された分類基準
を基に、入力されたカテゴリへの分類の要否を決定する
大量情報自動分類プログラムを記録媒体に記録している
ため、該記録媒体を用いて、その流通性を高めることが
できる。According to the ninth aspect of the present invention, based on the generated classification result to the intermediate category and the input classification criterion, it is possible to determine whether or not the classification into the input category is necessary. Since the classification program is recorded on the recording medium, it is possible to use the recording medium to improve the distribution.

【００２７】[0027]

【発明の実施の形態】以下、図面を用いて本発明の実施
の形態について説明する。図１は、本発明の一実施形態
に係る大量情報自動分類装置の構成を示すブロック図で
ある。同図に示す大量情報自動分類装置は、自然言語で
記述された文書を入力する文書入力部１０１、該文書入
力部１０１で入力された文書を記憶する文書記憶部１０
２、前記文書を分類するためのカテゴリおよびその特徴
を入力するカテゴリ情報入力部１０３、該カテゴリ情報
入力部１０３で入力されたカテゴリおよび特徴を記憶す
るとともに該カテゴリおよび特徴から特徴ベクトルを計
算して記憶するカテゴリ情報記憶部１０４、分類処理の
ための初期値である分類基準を入力する初期値入力部１
０５、該初期値入力部１０５で入力された分類基準を記
憶する初期値記憶部１０６、前記カテゴリ情報入力部１
０３で入力されたカテゴリおよびその特徴と前記初期値
入力部１０５で入力された分類基準とを用いて、カテゴ
リを新規に作成し、前記カテゴリ情報入力部１０３で入
力されたカテゴリで類似したものを同一の中間カテゴリ
に割り当てる中間カテゴリ計算部１０７、前記カテゴリ
情報入力部１０３で入力されたカテゴリと前記中間カテ
ゴリ計算部１０７で得たカテゴリとを用いて、前記文書
入力部１０１で入力された文書を分類する実カテゴリ計
算部１０８、および該実カテゴリ計算部１０８で分類さ
れた結果を表示する表示部１０９から構成されている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a mass information automatic classification device according to an embodiment of the present invention. The mass information automatic classification device shown in FIG. 1 includes a document input unit 101 for inputting a document described in a natural language, and a document storage unit 10 for storing the document input by the document input unit 101.
2. A category information input unit 103 for inputting a category for classifying the document and its characteristics, storing the categories and characteristics input by the category information input unit 103, and calculating a feature vector from the categories and characteristics. A category information storage unit 104 for storing, an initial value input unit 1 for inputting a classification criterion which is an initial value for classification processing
05, an initial value storage unit 106 for storing the classification criteria input by the initial value input unit 105, and the category information input unit 1
A new category is created using the category and its characteristics input in step 03 and the classification criteria input in the initial value input unit 105, and similar categories are input in the category information input unit 103. An intermediate category calculation unit 107 for allocating to the same intermediate category, using the category input by the category information input unit 103 and the category obtained by the intermediate category calculation unit 107 to convert the document input by the document input unit 101 It comprises a real category calculation unit 108 for classifying and a display unit 109 for displaying the result of classification by the real category calculation unit 108.

【００２８】このように構成される大量情報自動分類装
置の作用を説明する前に、まず本実施形態の大量情報自
動分類装置の基本的原理について説明する。本実施形態
の大量情報自動分類装置は、分類したいカテゴリである
実カテゴリを予めいくつかの組である中間カテゴリに分
け、この中間カテゴリでの条件を満たした場合のみ、実
カテゴリとの距離計算を行うことにより計算時間を短縮
している。Before describing the operation of the thus configured automatic mass information classification apparatus, the basic principle of the automatic mass information classification apparatus of the present embodiment will be described first. The mass information automatic classification device of the present embodiment divides a real category, which is a category to be classified, into some intermediate groups in advance, and calculates a distance from the real category only when a condition in this intermediate category is satisfied. By doing so, the calculation time is reduced.

【００２９】更に詳細には、実カテゴリの持つ特徴ベク
トルを用いて、カテゴリ間に距離を導入する。各実カテ
ゴリ間の距離を計算し、類似する実カテゴリ同士を抽出
し、２つ以上の類似したカテゴリを新規に作成した中間
カテゴリに割り当てる。More specifically, a distance is introduced between categories by using a feature vector of a real category. The distance between each real category is calculated, similar real categories are extracted, and two or more similar categories are assigned to a newly created intermediate category.

【００３０】カテゴリ間の距離関数として、特徴ベクト
ルの内積やカテゴリ間の単語の共起関係、それらを組み
合わせて利用する。例えば、カテゴリｉとカテゴリｊの
単語の共起関係Ｉ_ijを、As a distance function between categories, an inner product of feature vectors, a co-occurrence relationship of words between categories, and a combination thereof are used. For example, the co-occurrence relation I _ij between words of category i and category j is

【数２】と定義する。共起関係Ｉ_ijはカテゴリｉとカテゴリｊで
どれだけ共通の単語を有しているかを表す関数である。
また、カテゴリｉとカテゴリｊの単語の重なり具合
Ｒ_ij、(Equation 2) Is defined. The co-occurrence relation I _ij is a function indicating how many words are common to categories i and j.
Also, the degree of overlap R _ij of the words of category i and category j,

【数３】などの関数も考えられる。特徴ベクトルの内積の値やカ
テゴリ間の共起関係の値の組み合わせなどで、カテゴリ
間の類似度を定義する。(Equation 3) Such functions are also conceivable. The similarity between categories is defined by a combination of values of inner products of feature vectors and values of co-occurrence relations between categories.

【００３１】次に、実際の分類処理について説明する。
文書の分類には、まず、文書と中間カテゴリとの距離を
計算する。この距離により類似していると判断した場合
に、中間カテゴリに属している実カテゴリとの距離を計
算し分類するかどうかを判断する。Next, the actual classification processing will be described.
To classify a document, first, the distance between the document and the intermediate category is calculated. If it is determined that the distances are more similar to each other, it is determined whether or not the distance to the actual category belonging to the intermediate category is calculated and classified.

【００３２】中間カテゴリの生成と分類処理時の判断基
準は、実カテゴリでの分類基準に依存している。例え
ば、文書は実カテゴリのいずれか１つに必ず入るように
分類する場合、実カテゴリはすべていずれかの中間カテ
ゴリに属するように分類する。また、どこにも分類され
ない文書、すなわち、適切な実カテゴリがない文書は分
類しない場合、中間カテゴリでの判断によっては実カテ
ゴリとの計算を一度もせずに”どのカテゴリにも分類し
ない”という判断を下すことができる。また、中間カテ
ゴリの生成に共起関係を利用し、判断基準として分類対
象の文書内に存在する単語が中間カテゴリに存在するか
どうかを利用すると、容易に分類精度を落とさず高速化
できる。The criterion at the time of generating and classifying the intermediate category depends on the classification criterion of the actual category. For example, when a document is classified so as to always belong to one of the real categories, the real category is classified so as to belong to any one of the intermediate categories. In addition, if a document that is not classified anywhere, that is, a document that does not have an appropriate real category, is not to be classified, the judgment of “not classified into any category” can be made without calculating the actual category once, depending on the judgment of the intermediate category. Can be done. In addition, if the co-occurrence relation is used to generate the intermediate category and whether or not a word present in the document to be classified exists in the intermediate category as a criterion, the classification speed can be easily increased without lowering the classification accuracy.

【００３３】また、中間カテゴリは１層に限ったわけで
はなく、中間カテゴリ間に距離を導入し中間カテゴリの
中間カテゴリを作成し、多段階で分類するかどうかを判
断することで、より高速化な分類を行うことができる。Further, the intermediate category is not limited to one layer. By introducing a distance between the intermediate categories, creating an intermediate category of the intermediate category, and judging whether or not to classify the intermediate category in multiple stages, a higher speed can be achieved. Classification can be performed.

【００３４】次に、図１に示す大量情報自動分類装置に
ついて説明する。図１において、分類処理を施したい文
書は、文書入力部１０１から入力される。この分類処理
を施したい文書としては、コンピュータに入力した文書
すべてが含まれる。例えば、新聞記事やインターネット
上のＨＴＭＬファイルやネットニュース、文字放送やＦ
Ｍ多重放送やテレビでの放送原稿等がある。また、文書
記憶部１０２では、文書入力部１０１において入力した
情報を、情報発信媒体それぞれ、あるいは、情報発信媒
体のいくつか、あるいは、すべての情報発信媒体全体の
文書を取り出せるように記憶する。Next, the mass information automatic classification device shown in FIG. 1 will be described. In FIG. 1, a document to be subjected to classification processing is input from the document input unit 101. Documents to be subjected to this classification processing include all documents input to the computer. For example, newspaper articles, HTML files on the Internet, net news, teletext, F
There are M multiplex broadcasting and broadcast manuscripts on television. Further, the document storage unit 102 stores the information input in the document input unit 101 so that the document of each information transmission medium, some of the information transmission media, or all the information transmission media can be retrieved.

【００３５】カテゴリ情報入力部１０３では、システム
利用者が分類したいカテゴリとそれに入るサンプルとな
る文書情報やキーワードを入力する。例えば、スポーツ
のサッカーに関する情報を集めるのであれば、「サッカ
ー」に関するカテゴリと新聞記事やインターネット上か
ら収集したサッカーに関する文書やキーワードを入力す
る。In the category information input section 103, the system user inputs a category to be classified and sample document information and keywords included in the category. For example, to collect information about sports soccer, a category related to “soccer” and newspaper articles and documents and keywords related to soccer collected from the Internet are input.

【００３６】カテゴリ情報記憶部１０４では、カテゴリ
情報入力部１０３で入力された情報の記憶と、入力され
た情報から特徴ベクトルの計算とその結果の記憶を行
う。The category information storage unit 104 stores information input by the category information input unit 103, calculates a feature vector from the input information, and stores the result.

【００３７】初期値入力部１０５では、システムが計算
を行う際の初期値を入力する。例えば、分類時の誤りの
基準や、「すべての文書を類似するいずれか１つの実カ
テゴリに分類する」や「１つの文書を複数の実カテゴリ
に分類する」などの分類条件を入力する。The initial value input unit 105 inputs an initial value when the system performs a calculation. For example, a classification error criterion or a classification condition such as "classify all documents into one similar real category" or "classify one document into a plurality of real categories" is input.

【００３８】中間カテゴリ計算部１０７では、カテゴリ
情報入力部１０３と初期値入力部１０５で入力された情
報から、中間カテゴリを作成する。カテゴリ情報入力部
１０３で入力された実カテゴリの情報から特徴ベクトル
を生成し、初期値入力部１０５で入力された情報を利用
して、実カテゴリ間の距離関数を決定し中間カテゴリを
生成する。また、同時に中間カテゴリに属する実カテゴ
リに対して文書の分類を行うかどうかの判断基準も決定
する。The intermediate category calculation unit 107 creates an intermediate category from the information input by the category information input unit 103 and the initial value input unit 105. A feature vector is generated from the information of the real category input by the category information input unit 103, and a distance function between the real categories is determined using the information input by the initial value input unit 105 to generate an intermediate category. At the same time, the criterion for determining whether or not to classify the document into the actual category belonging to the intermediate category is determined.

【００３９】実カテゴリ計算部１０８では、文書入力部
１０１で入力された文書を実カテゴリに割り当てる処理
を行う。中間カテゴリ計算部１０７の処理結果である中
間カテゴリと判断基準を利用して、まず、中間カテゴリ
と文書との距離を計算する。その結果と中間カテゴリ計
算部１０７で決定した中間カテゴリでの判断基準を用い
て、中間カテゴリに属する実カテゴリと文書との距離計
算を行うかどうかを判断する。計算する必要があれば、
実際に文書と実カテゴリとの距離計算を行い、初期値入
力部１０５で入力された分類基準に従って、文書を実カ
テゴリに割り当てる。The real category calculation unit 108 performs a process of assigning a document input by the document input unit 101 to a real category. First, the distance between the intermediate category and the document is calculated by using the intermediate category and the judgment criterion, which are the processing results of the intermediate category calculation unit 107. Using the result and the criterion for the intermediate category determined by the intermediate category calculation unit 107, it is determined whether to calculate the distance between the actual category belonging to the intermediate category and the document. If you need to calculate
The distance between the document and the actual category is actually calculated, and the document is assigned to the actual category according to the classification criterion input by the initial value input unit 105.

【００４０】次に、具体例を用いて、本実施形態の作用
について図２、図３を参照して詳細に説明する。システ
ムに入力する文書集合は、新聞社の１９９８年の新聞記
事とし、この文書集合を文書入力部１０１より入力す
る。Next, the operation of the present embodiment will be described in detail with reference to FIGS. A document set to be input to the system is a newspaper article of a newspaper company in 1998, and this document set is input from the document input unit 101.

【００４１】次に、カテゴリ情報入力部１０３より、収
集したいカテゴリとそこに入るサンプルとなる文書や単
語を入力する。例えば、「“交通事故”に関する記事を
集める分類カテゴリ」や「“電気自動車”に関する記事
を集める分類カテゴリ」や「“地震”に関する記事を集
める分類カテゴリ」を入力し、それぞれのカテゴリに入
るべきサンプル文書として、過去の“交通事故”に関す
るいくつかの記事を「“交通事故”に関する記事を集め
る分類カテゴリ」へ、“電気自動車”に関するいくつか
の記事を「“電気自動車”に関する記事を集める分類カ
テゴリ」へ、“地震”に関するいくつかの記事を「“地
震”に関する記事を集める分類カテゴリ」へ割り当てる
ことを入力する。Next, from the category information input unit 103, a category to be collected and a sample document or word included therein are input. For example, input "category category to collect articles about" Traffic accidents "", "category category to collect articles about" electric vehicles "" and "classification category to collect articles about" Earthquake "". As a document, some articles about past "traffic accidents" are classified into "category categories to collect articles about" traffic accidents ", and some articles about" electric cars "are classified into" category categories to collect articles about "electric cars". , Input that some articles relating to “earthquake” are to be assigned to “a classification category for collecting articles relating to“ earthquake ””.

【００４２】カテゴリ情報記憶部１０４では、カテゴリ
情報入力部１０３で入力された情報から、各カテゴリの
特徴ベクトルを作成し記憶する。例えば、「“交通事
故”に関する記事を集める分類カテゴリ」では、入力さ
れたサンプルの文書から特徴ベクトルとして、The category information storage unit 104 creates and stores a feature vector of each category from the information input by the category information input unit 103. For example, in the "category category that collects articles about" traffic accidents "", a feature vector is obtained from the input sample document as a feature vector.

【数４】 ((自動車，０．２３），（スピード，０．１３），（夜間，０．０８），（死者，０．０５），…） …（５）などを生成し記憶する。また、「“電気自動車”に関す
る記事を集める分類カテゴリ」では、入力されたサンプ
ルの文書から特徴ベクトルとして、((Car, 0.23), (speed, 0.13), (night, 0.08), (dead, 0.05),...) (5) are generated and stored. In the “category category for collecting articles related to“ electric vehicles ”,” a feature vector is obtained from the input sample document as a feature vector.

【数５】 ((自動車，０．２０），（電気，０．１２），（燃費，０．１０），（スピード，０．０４），…） …（６）などを生成し記憶する。また、「“地震”に関する記事
を集める分類カテゴリ」では、入力されたサンプルの文
書から特徴ベクトルとして、((Car, 0.20), (electricity, 0.12), (fuel consumption, 0.10), (speed, 0.04),...) (6) are generated and stored. In the “category category that collects articles about“ earthquake ””, a feature vector is obtained from the input sample document as a feature vector.

【数６】 ((地震，０．１５），（震度，０．１０），（死者，０．０７），（津波，０．０４），…） …（７）などを生成し記憶する。[Formula 6] ((earthquake, 0.15), (seismic intensity, 0.10), (dead, 0.07), (tsunami, 0.04),...) (7) are generated and stored.

【００４３】次に、初期値入力部１０５より、分類する
ための初期値の入力を行う。ここでは、「いずれか１つ
の実カテゴリに割り当てる」という条件と「どの実カテ
ゴリにも割り当てられない文書が存在することを許可す
る」という条件を与えたとする。Next, an initial value for classification is input from the initial value input unit 105. Here, it is assumed that a condition of “assignment to any one of the actual categories” and a condition of “permit that there is a document that cannot be assigned to any of the actual categories” are given.

【００４４】以上の入力処理から、まず、中間カテゴリ
計算部１０７の処理を行う。例えば、「“交通事故”に
関する記事を集める分類カテゴリ」と「“電気自動車”
に関する記事を集める分類カテゴリ」と「“地震”に関
する記事を集める分類カテゴリ」に対し、単語の重なり
具体Ｒ_ijを利用すると、類似するカテゴリとして「“交
通事故”に関する記事を集める分類カテゴリ」と「“電
気自動車”に関する記事を集める分類カテゴリ」が、あ
る中間カテゴリに割り当てられる。この場合に、生成さ
れた中間カテゴリ名を話を分かりやすくするために
「“自動車”の中間カテゴリ」と呼ぶことにする。
「“自動車”の中間カテゴリ」が持つ特徴ベクトルを、
「中間カテゴリ内の実カテゴリが持つ特徴ベクトルに共
通に存在する単語」として生成すると、例えば、From the above input processing, first, the processing of the intermediate category calculation unit 107 is performed. For example, “Category category that collects articles about“ traffic accidents ”” and “Electric vehicles”
When the word overlapping concrete R _ij is used for the classification category that collects articles about “earthquake” and the classification category that collects articles about “earthquake”, similar categories “classification category that collects articles about“ traffic accident ”” and “ A classification category for collecting articles related to “electric vehicles” is assigned to a certain intermediate category. In this case, the generated intermediate category name will be referred to as an “intermediate category of“ car ”” to make the story easier to understand.
The feature vector of "intermediate category of" automobile ""
When generated as “words that are commonly present in the feature vectors of the real categories in the intermediate category”, for example,

【数７】 ((自動車，０．２１５），（スピード，０．０６５），…） …（８）という特徴ベクトルを生成する。各単語の重みは、元の
特徴ベクトルの平均値とした。また、判断基準は「文書
に存在する単語が、中間カテゴリが持つ特徴ベクトルに
１つでも存在するかどうか」とする。## EQU00007 ## A feature vector of ((automobile, 0.215), (speed, 0.065),...) (8) is generated. The weight of each word was the average value of the original feature vector. The judgment criterion is “whether any word existing in the document exists in any of the feature vectors of the intermediate category”.

【００４５】次に、実カテゴリ計算部１０８で、文書の
実カテゴリへの分類処理が行われる。１９９８年の新聞
記事で、例えば、図２に示すように、「自動車がスピー
ドの出し過ぎでガードレールに衝突」という記事があっ
たとする。この記事の特徴ベクトルは、Next, the actual category calculation unit 108 performs a classification process of the documents into the actual categories. For example, assume that a newspaper article in 1998 includes an article “A car collides with a guardrail due to excessive speed” as shown in FIG. The feature vector for this article is

【数８】 ((衝突，０．２５），（自動車，０．１９），（ガードレール，０．１２），（スピード，０．１０），…） …（９）となったとする。この特徴ベクトルと、「“自動車”の
中間カテゴリ」と文書内の単語を比較する。この場合、
「自動車」「スピード」という単語が記事の特徴ベクト
ルと中間カテゴリのそれの両方に存在するため、中間カ
テゴリでの判断基準を満たす。そのため、実カテゴリの
「“交通事故”に関する記事を集める分類カテゴリ」と
「“電気自動車”に関する記事を集める分類カテゴリ」
との分類を行い、最終的に「“交通事故”に関する記事
を集める分類カテゴリ」に割り当てる。(8) ((collision, 0.25), (automobile, 0.19), (guardrail, 0.12), (speed, 0.10),...) (9) This feature vector is compared with the word in the document, “the middle category of“ automobile ””. in this case,
Since the words "automobile" and "speed" exist in both the feature vector of the article and that of the intermediate category, the criteria for the intermediate category are satisfied. For this reason, the actual categories "category category for collecting articles on" traffic accidents "" and "category category for collecting articles on" electric vehicles ""
And finally assign it to the “category category that collects articles about“ traffic accidents ””.

【００４６】また、図３に示すように、「高校生サッカ
ーの試合で、Ａ高校が優勝」という記事があったとす
る。この記事の特徴ベクトルは、Also, as shown in FIG. 3, it is assumed that there is an article "A high school wins in a high school soccer game." The feature vector for this article is

【数９】 ((高校生，０．１８），（サッカー，０．１５），（優勝，０．１０），（Ａ高校，０．０９），…） …（10）となったとする。この場合、「“自動車”の中間カテゴ
リ」の持つ特徴ベクトルと比較すると、共通の単語が存
在せず判断基準を満たさない。そのため、実カテゴリで
ある「“交通事故”に関する記事を集める分類カテゴ
リ」と「“電気自動車”に関する記事を集める分類カテ
ゴリ」との分類は行わない。(9) ((high school student, 0.18), (soccer, 0.15), (winning, 0.10), (A high school, 0.09),...) (10) In this case, when compared with the feature vector of the “intermediate category of“ automobile ””, no common word exists and the criterion is not satisfied. For this reason, the actual categories “classification category for collecting articles related to“ traffic accident ”” and “classification category for collecting articles related to“ electric vehicle ”” are not classified.

【００４７】表示部１０９では、分類結果を表示する。
実カテゴリに分類された文書や分類されなかった文書、
生成された中間カテゴリの情報や実カテゴリに特徴ベク
トルの情報、分類に要した時間情報などを表示する。The display unit 109 displays the classification result.
Documents that have been classified into real categories or not,
Information on the generated intermediate category, information on the feature vector in the actual category, information on the time required for classification, and the like are displayed.

【００４８】[0048]

【発明の効果】以上説明したように、本発明によれば、
文書を分類するためのカテゴリおよびその特徴を入力
し、この入力されたカテゴリおよび特徴から特徴ベクト
ルを計算して記憶し、入力されたカテゴリおよびその特
徴と入力された分類基準とを用いて、カテゴリを新規に
作成し、入力されたカテゴリで類似したものを同一の中
間カテゴリに割り当て、この作成されたカテゴリと入力
されたカテゴリとを用いて、入力された文書を分類する
ので、大量の文書集合を大量の分類カテゴリに高速に分
類することができる。As described above, according to the present invention,
A category for classifying a document and its features are input, a feature vector is calculated and stored from the input categories and features, and the category is calculated using the input category and its features and the input classification criteria. Is newly created, and similar input categories are assigned to the same intermediate category, and the input documents are classified using the generated categories and the input categories. Can be quickly classified into a large number of classification categories.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る大量情報自動分類装
置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a mass information automatic classification device according to an embodiment of the present invention.

【図２】図１に示す大量情報自動分類装置の具体例を説
明するための図である。FIG. 2 is a diagram for explaining a specific example of the mass information automatic classification device shown in FIG. 1;

【図３】図１に示す大量情報自動分類装置の具体例を説
明するための図である。FIG. 3 is a diagram for explaining a specific example of the mass information automatic classification device shown in FIG. 1;

[Explanation of symbols]

１０１文書入力部１０２文書記憶部１０３カテゴリ情報入力部１０４カテゴリ情報記憶部１０５初期値入力部１０６初期値記憶部１０７中間カテゴリ計算部１０８実カテゴリ計算部１０９表示部 Reference Signs List 101 Document input unit 102 Document storage unit 103 Category information input unit 104 Category information storage unit 105 Initial value input unit 106 Initial value storage unit 107 Intermediate category calculation unit 108 Real category calculation unit 109 Display unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者田中一男東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内 ────────────────────────────────────────────────── ─── Continued on the front page (72) Inventor Kazuo Tanaka Nippon Telegraph and Telephone Corporation 3-19-2 Nishishinjuku, Shinjuku-ku, Tokyo

Claims

[Claims]

1. A mass information automatic classification method for classifying a large amount of information into a large number of categories, comprising: a first step of inputting a document described in a natural language; and a document input in the first step. A third step of inputting a category for classifying the document and its characteristics, and storing the category and characteristics input in the third step and from the categories and characteristics. A fourth step of calculating and storing the feature vector; and a fifth step of inputting a classification criterion which is an initial value for the classification process.
And a sixth step of storing the classification criterion input in the fifth step.
And a new category is created using the category and its characteristics input in the third step and the classification criterion input in the fifth step, and a similarity is created in the category input in the third step. Classifying the document input in the first step using the seventh step of allocating the extracted data to the same intermediate category, and the category input in the third step and the category obtained in the seventh step. An automatic mass information classification method, comprising the steps of:

2. In the seventh step, a distance function corresponding to a classification criterion is created by combining distance functions between a plurality of categories based on the classification criterion input in the fifth step, and the created distance function is created. 2. The method according to claim 1, wherein a distance between the categories input in the third step is calculated using a distance function to generate an intermediate category.

3. In the eighth step, classification into the intermediate category generated in the seventh step is performed, and input is performed in the third step based on the classification result and the classification criterion input in the fifth step. 3. The method according to claim 1, wherein it is determined whether or not the classification into the determined category is necessary.

4. A mass information automatic classification apparatus for classifying a large amount of information into a large number of categories, comprising: a document input unit for inputting a document described in a natural language; and a document input by the document input unit. A document storage unit, a category information input unit for inputting a category for classifying the document and its features, and a category and feature input by the category information input unit, and a feature vector from the categories and features. A category information storage unit for calculating and storing; an initial value input unit for inputting a classification criterion which is an initial value for classification processing; an initial value storage unit for storing the classification criterion input in the initial value input unit A new category is created by using the category and its characteristics input in the category information input unit and the classification criteria input in the initial value input unit, An intermediate category calculator that allocates similar categories to the same intermediate category in the category input in the information input unit, using the category input in the category information input unit and the category obtained in the intermediate category calculator. And a real category calculating unit for classifying the document input by the document input unit.

5. The intermediate category calculation unit creates a distance function according to the classification criterion by combining distance functions between a plurality of categories based on the classification criterion input by the initial value input unit. The mass information automatic classification device according to claim 4, further comprising: means for calculating a distance between the categories input by the category information input unit using the obtained distance function to generate an intermediate category.

6. The real category calculation unit performs classification into the intermediate categories generated by the intermediate category calculation unit, and inputs the category information based on the classification result and the classification criterion input by the initial value input unit. 6. The mass information automatic classification device according to claim 4, further comprising means for determining whether or not the classification into the category input by the section is necessary.

7. A recording medium recording a mass information automatic classification program for classifying a large amount of information into a large number of categories, the first step of inputting a document described in a natural language, and the first step A second step of storing the document input in step 3; a third step of inputting a category for classifying the document and its characteristics; and storing the category and characteristic input in the third step. A fourth step of calculating and storing a feature vector from the category and the feature, and a fifth step of inputting a classification criterion which is an initial value for a classification process.
And a sixth step of storing the classification criterion input in the fifth step.
And a new category is created using the category and its characteristics input in the third step and the classification criterion input in the fifth step, and a similarity is created in the category input in the third step. Classifying the document input in the first step using the seventh step of allocating the extracted data to the same intermediate category, and the category input in the third step and the category obtained in the seventh step. A recording medium having recorded thereon a mass information automatic classification program, comprising: an eighth step.

8. In the seventh step, a distance function according to the classification criterion is created by combining distance functions between a plurality of categories based on the classification criterion input in the fifth step, and the created distance function is created. 8. The recording medium according to claim 7, wherein an intermediate category is generated by calculating a distance between the categories input in the third step using a distance function.

9. In the eighth step, classification into the intermediate category generated in the seventh step is performed, and the classification is performed in the third step based on the classification result and the classification criterion input in the fifth step. 9. A recording medium recording a mass information automatic classification program according to claim 7 or 8, wherein it is determined whether or not the classification into the determined category is necessary.