JP2000011004A

JP2000011004A - Automatic information classifying method, its device and recording medium recording its method

Info

Publication number: JP2000011004A
Application number: JP10180182A
Authority: JP
Inventors: Masayuki Sugizaki; 正之杉崎; Masakatsu Okubo; 雅且大久保; Kazuhiro Hayakawa; 和宏早川; Kazuo Tanaka; 一男田中
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-06-26
Filing date: 1998-06-26
Publication date: 2000-01-14
Anticipated expiration: 2018-06-26
Also published as: JP3566856B2

Abstract

PROBLEM TO BE SOLVED: To improve classification precision by automatically preparing a new category (temporary category) for an erroneously classified document, assigning it there and classifying it through the use of an inputted category and a temporary category for it. SOLUTION: As of a document which is not correctly classified e.g. a document which is not classified into 'a category concerning 'life insurance'' erroneously, a temporary category (temporary category against 'the life insurance') is generated and the all are assigned there. When a newspaper article indicating 'the stock of a nonlife insurance company ascends and, in another insurance company...' exists if there are may documents concerning 'nonlife insurance', the value of a similarity degree of 'the temporary category for 'life insurance ''is larger than that of 'the category concerning 'life insurance'' when the similarity degree is calculated by using a feature vector of the document. And the result, the document which is erroneously classified into 'the category concerning 'life insurance'' is assigned to 'the temporary category for 'the life insurance'' this time and is not assigned to 'the category concerning 'life insurance''.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、大量の文書データ
を自動分類するようにした情報自動分類方法およびその
装置およびこの方法を記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for automatically classifying a large amount of document data, and a recording medium on which the method is recorded.

【０００２】[0002]

【従来の技術】近年、インターネットなどのコンピュー
タネットワークを通じて、大量の電子化された文書をや
り取りできるようになっている。そのため、個人が必要
とする情報を検索できるようなサービスがネットワーク
上で実現されている。しかし、そのために自分が獲得し
た情報が大量になってしまい、個々の情報の持つ特徴を
抽出することが困難となる。そこで、獲得した情報を分
類し整理する技術が必要となる。2. Description of the Related Art In recent years, it has become possible to exchange a large amount of electronic documents through a computer network such as the Internet. For this reason, services that allow individuals to search for information they need are implemented on networks. However, the amount of information acquired by the user becomes large, and it becomes difficult to extract characteristics of individual information. Therefore, a technique for classifying and organizing the acquired information is required.

【０００３】従来から、文書情報を自動的に分類する手
法の研究が行われている。代表的な手法としては、図書
館のように分類するための区切り（カテゴリと呼ぶ）が
既知で、新規の情報に対しそれぞれ適切と思われるカテ
ゴリに分類する手法（“分類体系相互の関係を利用した
テキストの自動分類”山本，増山（豊橋技術科学大学）
内藤（ＮＴＴ），1995 自然言語処理学会研究会３
月）や、分類するカテゴリが未知で、文書集合の中から
類似する文書を集めて分類カテゴリを作成し割り当てる
という方法（“競合学習ニューラルネットワークによる
自動分割”菊地，松岡ら（宇都宮大他），1995 信学
会論文誌 10月）などがある。これらの技術により、
大量の文書の分類整理を行うようにされる。Conventionally, research has been conducted on a method of automatically classifying document information. A typical method is to classify information into categories that are considered appropriate for new information, with known breaks (called categories) for classifying like a library (" Automatic text classification "Yamamoto, Masuyama (Toyohashi University of Technology)
Naito (NTT), 1995 Natural Language Processing Society of Japan 3
Month) or a method in which the category to be classified is unknown, and similar documents are collected from a set of documents to create and assign a classification category ("Automatic division by competitive learning neural network" Kikuchi, Matsuoka et al. (Utsunomiya Univ.), 1995 IEICE Transactions October). With these technologies,
A large number of documents are sorted and arranged.

【０００４】本発明が対象としている分類手法は、前者
の、あらかじめ分類するためのカテゴリが既知の場合の
手法である。あらかじめ分類するためのカテゴリと、そ
こに入るべきサンプルの文書または単語をシステムに対
して与えると、システムはそれらの情報から単語の重要
度を計算し、カテゴリの特徴として単語とそのカテゴリ
に対する重要度が対のベクトルを生成する。分類する文
書に対しても同様に、単語と文書とに対する重要度を計
算し、ベクトルを生成する（"Automatic TextProcessin
g" Gerard Salton, 1989 Addison-Wesley pub.co.）。[0004] The classification method to which the present invention is directed is the former method in which a category for classification is known in advance. Given a category to classify in advance and a sample document or word to be included in the system, the system calculates the importance of the word from the information, and as a feature of the category, the word and the importance for the category Produces a pair of vectors. Similarly, for the document to be classified, the importance is calculated for the word and the document, and a vector is generated ("Automatic TextProcessin
g "Gerard Salton, 1989 Addison-Wesley pub.co.).

【０００５】カテゴリｉそれぞれに生成される特徴ベク
トルＷｉ（ベクトル）は、Ｗｉ（ベクトル）＝（ｗ_i1，・・・，ｗ_ij，・・・，ｗ_iN）（１）ｗ_{ij ...} カテゴリｉに対する単語ｊの重要度（２）となる。ちなみに、Ｎは次元数を表している。特徴ベク
トルＷｉ（ベクトル）の各要素の値の計算には、各カテ
ゴリが持つ単語の出現頻度等を比較し、多くのカテゴリ
に存在する単語に対する値は小さく、あるカテゴリに偏
って出現する単語の値は大きくなるように計算してお
く。The feature vector Wi (vector) generated for each category i is: Wi (vector) = (wi ₁ ,..., W _ij ,..., Wi _N ) (1) w _ij. The importance of the word j with respect to i is (2). Incidentally, N represents the number of dimensions. In calculating the value of each element of the feature vector Wi (vector), the frequency of occurrence of words belonging to each category is compared, and the value of a word existing in many categories is small, and the value of a word appearing biased in a certain category is small. The value is calculated so as to increase.

【０００６】分類は、カテゴリの持つ特徴ベクトルと文
書の持つ特徴ベクトルとの距離を定義し、その値を利用
して各文書を類似するカテゴリに割り当てる。また距離
が非常に離れている、すなわち、どのカテゴリとも類似
しないと判断した場合は、どのカテゴリにも割り当てな
い。The classification defines a distance between a feature vector of a category and a feature vector of a document, and assigns each document to a similar category using the value. If it is determined that the distance is very large, that is, it is not similar to any category, no assignment is made to any category.

【０００７】[0007]

【発明が解決しようとする課題】従来の手法では、同音
異義語が含まれている場合などは、適切な単語を含む特
徴ベクトルが生成できず、そのような特徴ベクトルを用
いて分類を行うと、カテゴリに適切でない文書が分類さ
れる場合があった。In the conventional method, when a homonymous word is included, a feature vector including an appropriate word cannot be generated, and classification is performed using such a feature vector. In some cases, documents that were not appropriate for the category were classified.

【０００８】本発明は、分類した結果を用いて、正しく
分類された文書は正しく分類されたカテゴリに割り当
て、誤って分類された文書は新しいカテゴリ（仮カテゴ
リ）を自動作成しそこに割り当て、入力されたカテゴリ
とそれに対する仮カテゴリを用いて分類することによ
り、分類精度を向上させるようにすることを目的として
いる。According to the present invention, a document that is correctly classified is assigned to a correctly classified category by using the result of classification, and a document that is incorrectly classified is automatically created and assigned to a new category (temporary category) and input. It is an object of the present invention to improve the classification accuracy by performing classification using the determined category and the provisional category corresponding thereto.

【０００９】[0009]

【課題を解決するための手段】あらかじめ与えられたカ
テゴリに対し、いくつか文書を分類させる。その結果、
いくつかの文書が適切なカテゴリに分類されない場合が
ある。[Means for Solving the Problems] Some documents are classified into predetermined categories. as a result,
Some documents may not fall into the appropriate category.

【００１０】本発明では、正しく分類されなかった文書
を、新規に仮カテゴリを自動作成して割り当てる。この
仮カテゴリは、正しく分類されなかったカテゴリに対し
て生成する。分類処理を行う時、カテゴリとそれに対す
る仮カテゴリと文書との類似度を計算し、類似している
カテゴリのどちらか一方に割り当てることにする。仮カ
テゴリは、複数のカテゴリに対して、一つ生成する場合
もある。In the present invention, a temporary category is automatically created and assigned to a document that has not been correctly classified. This temporary category is generated for a category that has not been correctly classified. When performing the classification process, the similarity between a category, a provisional category, and a document is calculated, and assigned to one of the similar categories. One temporary category may be generated for a plurality of categories.

【００１１】正しく分類されなかった原因は、類似した
単語を用いているが内容が異なる場合であり、この作業
により類似した単語群が存在している文書であっても、
別のカテゴリを生成し、そこで特徴ベクトルを生成させ
ることで、類似した単語群以外の単語をカテゴリの特徴
ベクトルとして抽出でき、これによって、分類精度を向
上させることができる。[0011] The cause of incorrect classification is that similar words are used but the contents are different. Even if a document contains a group of similar words by this operation,
By generating another category and generating a feature vector there, words other than the similar word group can be extracted as feature vectors of the category, thereby improving classification accuracy.

【００１２】[0012]

【発明の実施の形態】以下、図面と共に本発明の実施形
態を説明する。なお、実施形態を説明する全図におい
て、同一要素には同一符号を付け、その繰り返しの説明
を省く。Embodiments of the present invention will be described below with reference to the drawings. In all the drawings describing the embodiments, the same reference numerals are given to the same elements, and the repeated description thereof will be omitted.

【００１３】図１は、本発明の一発明の実施の形態であ
る情報自動分類装置の概略構成を示すブロック図であ
り、１０１は文書入力部、１０２は文書記憶部、１０３
はカテゴリ情報入力部、１０４はカテゴリ情報記憶部、
１０５は分類結果入力部、１０６は分類結果記憶部、１
０７は仮カテゴリ生成部、１０８は分類計算部である。FIG. 1 is a block diagram showing a schematic configuration of an automatic information classification apparatus according to an embodiment of the present invention. Reference numeral 101 denotes a document input unit; 102, a document storage unit;
Is a category information input unit, 104 is a category information storage unit,
105 is a classification result input unit, 106 is a classification result storage unit, 1
Reference numeral 07 denotes a provisional category generation unit, and reference numeral 108 denotes a classification calculation unit.

【００１４】（各部の説明）本発明の実施の形態の情報
自動分類装置では、処理を施したい文書を文書入力部１
０１にて入力する。処理を施したい文書としては、コン
ピュータに入力した文書すべてが含まれる。例えば、新
聞記事やインターネット上のＨＴＭＬファイルやネット
ニュース、文字放送やＦＭ多重放送やテレビでの放送原
稿等がある。また、文書記憶部１０２では、文書入力部
１０１において入力した情報を、情報発信媒体それぞ
れ、あるいは、情報発信媒体のいくつか、あるいは、す
べての情報発信媒体全体の文書を取り出せるように記憶
する。(Description of Each Part) In the automatic information classification apparatus according to the embodiment of the present invention, a document to be processed is input to the document input unit 1.
Enter with 01. The documents to be processed include all the documents input to the computer. For example, there are newspaper articles, HTML files on the Internet, net news, text broadcasting, FM multiplex broadcasting, and broadcast manuscripts on television. Further, the document storage unit 102 stores the information input in the document input unit 101 so that the document of each information transmission medium, some of the information transmission media, or all the information transmission media can be retrieved.

【００１５】カテゴリ情報入力部１０３では、システム
利用者が分類したいカテゴリと、それに入るサンプルに
なる文書情報やキーワードとを入力する。例えば、スポ
ーツの高校野球に関する情報を集めるのであれば、“高
校野球”に関するカテゴリと、新聞記事やインターネッ
トなどから収集した高校野球に関する文書やキーワード
とを入力する。In the category information input unit 103, a category to be classified by the system user and document information and keywords to be sampled into the category are input. For example, to collect information on sports related to high school baseball, a category related to “high school baseball” and documents and keywords related to high school baseball collected from newspaper articles and the Internet are input.

【００１６】カテゴリ情報記憶部１０４では、カテゴリ
情報入力部１０３で入力された情報の記憶と、入力され
た情報から特徴ベクトルの計算と、その結果との記憶を
行う。The category information storage unit 104 stores information input by the category information input unit 103, calculates a feature vector from the input information, and stores the result.

【００１７】分類結果入力部１０５では、いくつか分類
を行った結果を入力する。正しく分類された文書および
それが割り当てられていたカテゴリ名と、正しく分類さ
れなかった文書およびそれが割り当てられていたカテゴ
リ名とを入力する。例えば、“高校野球”に関するカテ
ゴリへの分類を行った結果があり、“プロ野球”に関す
る文書が誤って高校野球に分類されていたとすると、正
しく分類されなかった文書として、“プロ野球”に関す
る文書とそれが正しく分類されていなかったのが“高校
野球”に関するカテゴリであることを入力する。The classification result input unit 105 inputs the results of some classifications. Enter the correctly classified document and the category name to which it was assigned, and the incorrectly classified document and the category name to which it was assigned. For example, if there is a result of classification into a category relating to "high school baseball", and a document relating to "professional baseball" is mistakenly classified as high school baseball, a document relating to "professional baseball" will be regarded as a document that has not been correctly classified. And that it was not correctly classified is a category relating to "high school baseball".

【００１８】仮カテゴリ生成部１０７では、正しく分類
されていなかった文書を、新規に仮カテゴリを生成し、
そこに割り当てる。そうして生成した仮カテゴリと文書
の情報とをカテゴリ情報記憶部１０４に記憶させる。こ
れにより、カテゴリ情報記憶部１０４で、仮カテゴリの
特徴ベクトルが生成され記憶される。また、このとき、
仮カテゴリと対になる、正しく分類されていなかったカ
テゴリ名も記憶される。正しく分類された文書が分類結
果入力部１０５より入力されていた場合、カテゴリ情報
入力部１０３で入力されたカテゴリの情報と組み合わせ
て、カテゴリ情報記憶部１０４に記憶させる。A temporary category generation unit 107 generates a new temporary category for a document that has not been correctly classified,
Assign there. The tentative category and document information thus generated are stored in the category information storage unit 104. Thus, the category information storage unit 104 generates and stores the feature vector of the temporary category. At this time,
The category name that is not correctly classified and that is paired with the temporary category is also stored. When a correctly classified document is input from the classification result input unit 105, the document is stored in the category information storage unit 104 in combination with the category information input by the category information input unit 103.

【００１９】分類計算部１０８では、文書入力部１０１
で入力された文書をカテゴリと仮カテゴリとに割り当て
る処理を行う。仮カテゴリが存在するカテゴリがあった
場合、分類処理は仮カテゴリとカテゴリとの両方に分類
することはなく、カテゴリかそれに対する仮カテゴリの
どちらかに分類する。仮カテゴリ以外のカテゴリに対す
る分類は複数のカテゴリに割り当てられるような分類基
準でも構わない。In the classification calculation unit 108, the document input unit 101
A process of allocating the input document to the category and the temporary category is performed. If there is a category in which a temporary category exists, the classification process does not classify into both the temporary category and the category, but classifies into either the category or the temporary category corresponding thereto. The classification for the category other than the temporary category may be a classification criterion that is assigned to a plurality of categories.

【００２０】（具体例を用いた処理の説明）例を用いて
本発明の処理の流れを具体的に説明する。図２は仮カテ
ゴリを生成していない場合の分類処理の一例を示し、図
３は仮カテゴリを生成した場合の分類処理の一例を示
す。(Explanation of Processing Using Specific Example) The flow of the processing of the present invention will be specifically described using an example. FIG. 2 shows an example of a classification process when a temporary category is not generated, and FIG. 3 shows an example of a classification process when a temporary category is generated.

【００２１】システムに入力する文書は、新聞社の過去
１０年の新聞記事とする。文書集合を文書入力部１０１
より入力する。次に、カテゴリ情報入力部１０３より、
収集したいカテゴリとそこに入るサンプルとなる文書や
単語を入力する。例えば、「“生命保険”に関するカテ
ゴリ」や「“台風”に関するカテゴリ」を入力し、それ
ぞれのカテゴリに入るべきサンプル文書として、過去の
“生命保険”に関するいくつかの記事を「“生命保険”
に関するカテゴリ」へ、過去の“台風”の情報や台風に
よる被害の記事を「“台風”に関するカテゴリ」に割り
当てることを入力する。The document to be input to the system is a newspaper article of a newspaper company for the past ten years. The document set is sent to the document input unit 101
Enter more. Next, from the category information input unit 103,
Enter the category you want to collect and the sample documents and words that go into it. For example, enter "categories related to" life insurance "" or "categories related to" typhoon "", and as a sample document to be included in each category, some articles related to "life insurance" in the past
In the “category related to“ typhoon ”, the user inputs information on past“ typhoon ”and articles on damage caused by the typhoon to“ category related to “typhoon” ”.

【００２２】カテゴリ情報記憶部１０４では、カテゴリ
情報入力部１０３で入力された情報から、各カテゴリの
特徴ベクトルを作成し記憶する。例えば、「“生命保
険”に関するカテゴリ」では、入力されたサンプルの文
書から特徴ベクトルとして、 ((生命保険，0.23),（保険，0.13),（損害，0.10),（自由化，0.08), （会社Ａ，0.05），・・・）（３）などを生成し記憶する。また、「“台風”に関するカテ
ゴリ」では、入力されたサンプルの文書から特徴ベクト
ルとして、 ((台風,0.25),(被害,0.21),(上陸,0.18),(強さ,0.14),(屋根,0.10), ・・・) （４）などを生成し記憶する。The category information storage unit 104 creates and stores a feature vector of each category from the information input by the category information input unit 103. For example, in the “category about“ life insurance ””, ((life insurance, 0.23), (insurance, 0.13), (damage, 0.10), (liberalization, 0.08), (Company A, 0.05), ...) (3) Generate and store. In the “category about“ typhoon ””, ((typhoon, 0.25), (damage, 0.21), (landing, 0.18), (strength, 0.14), (roof) , 0.10), ...) (4) Generate and store.

【００２３】次に、分類結果入力部１０５より、カテゴ
リ情報入力部１０３で入力されたカテゴリに対して分類
した過去の結果の入力を行う。例えば、「“生命保険”
に関するカテゴリ」では、“生命保険” や“生命”や
“保険”などの単語が特徴ベクトルとして獲得される。
この特徴ベクトルでは“保険”という単語のために、図
２に示す如く例えば“損害保険”に関する文書が「“生
命保険”に関するカテゴリ」に分類されてしまうことが
ある。即ち、図２においては「“生命保険”に関するカ
テゴリ」に分類されている。過去の結果、あるいは、い
くつかサンプルを分類することによって正しく分類でき
なかった文書例が抽出出来た場合、そのような文書を
「“生命保険”に関するカテゴリ」の正しく分類されな
かった例として入力する。Next, past results classified into the category input by the category information input unit 103 are input from the classification result input unit 105. For example, "life insurance"
In the "category about", words such as "life insurance", "life" and "insurance" are acquired as feature vectors.
Due to the word “insurance” in this feature vector, for example, as shown in FIG. 2, a document relating to “life insurance” may be classified into a “category relating to“ life insurance ””. That is, in FIG. 2, it is classified into a “category relating to“ life insurance ””. If a past example or a document example that could not be classified correctly by classifying some samples could be extracted, enter such a document as an example that was not correctly classified in the “category related to“ life insurance ”” .

【００２４】仮カテゴリ生成部１０７では、分類結果入
力部１０５からの入力により、新規の仮カテゴリを生成
し、正しく分類されなかった文書を割り当てる。例え
ば、「“生命保険”に関するカテゴリ」に誤って分類さ
れなかった文書は、図３に示す如く仮カテゴリを生成し
（これを「“生命保険”に対する仮カテゴリ」と呼ぶこ
とにする）、そこにすべて割り当てる。The provisional category generation unit 107 generates a new provisional category based on the input from the classification result input unit 105, and assigns a document that has not been correctly classified. For example, for a document that has not been erroneously classified as a “category relating to“ life insurance ””, a temporary category is generated as shown in FIG. 3 (this will be referred to as a “temporary category for“ life insurance ”). All assigned to.

【００２５】仮カテゴリ生成部１０７で生成された仮カ
テゴリと文書情報とをカテゴリ情報記憶部１０４に記憶
させ、特徴ベクトルを生成させる。例えば、「“生命保
険”に対する仮カテゴリ」では、正しく分類されなかっ
た文書として“損害保険”に関する文書が多ければ、そ
の特徴ベクトルは、 ((損害保険,0.23),(保険,0.22),(保証,0.11),(生命,0.09),(災害,0.09), ・・・) （５）などとなる。The temporary category generated by the temporary category generation unit 107 and the document information are stored in the category information storage unit 104, and a feature vector is generated. For example, in the “temporary category for“ life insurance ””, if there are many documents related to “non-life insurance” as documents that were not correctly classified, the feature vector is ((non-life insurance, 0.23), (insurance, 0.22), ( (Guarantee, 0.11), (life, 0.09), (disaster, 0.09), ・・・) (5)

【００２６】以後、分類計算部１０８では、カテゴリ情
報記憶部１０４で記憶されているカテゴリとその特徴ベ
クトルを用いて、分類処理を行う。このとき、「“生命
保険”のカテゴリ」への分類処理は、「“生命保険”に
対する仮カテゴリ」と文書の類似度とを計算し、どちら
か一方のカテゴリに分類する。今、例えば、「損害保険
会社の株価が上昇し、他の保険会社では・・・」という
新聞記事があった場合、この文書の特徴ベクトルは、 ((損害保険,0.18),(保険,0.16),(株価,0.12),(業績,0.11), ・・・) （６）となったとする。カテゴリと文書の類似度の関数として
ベクトルの内積を用いるとすると、この場合、「“生命
保険”に対する仮カテゴリ」の方が類似度の値が大きく
なる。その結果、図３に示す如く、従来手法である仮カ
テゴリが存在しない場合（図２の場合）には「“生命保
険”に関するカテゴリ」に誤って分類された文書が、今
回は「“生命保険”に対する仮カテゴリ」に割り当てら
れ、「“生命保険”に関するカテゴリ」には割り当てら
れなくなる。Thereafter, the classification calculation unit 108 performs a classification process using the categories and their feature vectors stored in the category information storage unit 104. At this time, in the classification process into the “category of“ life insurance ””, the “temporary category for“ life insurance ”” and the similarity of the document are calculated and classified into one of the categories. Now, for example, if there is a newspaper article such as "The stock price of a non-life insurance company has risen and other insurance companies ...", the feature vector of this document is ((General Insurance, 0.18), (Insurance, 0.16 ), (Stock price, 0.12), (performance, 0.11), ...) (6) Assuming that the inner product of the vectors is used as a function of the similarity between the category and the document, in this case, the “temporary category for“ life insurance ”” has a larger similarity value. As a result, as shown in FIG. 3, when there is no provisional category, which is the conventional method (in the case of FIG. 2), a document incorrectly classified as “category related to“ life insurance ”” And is no longer assigned to the "category for" life insurance "".

【００２７】仮カテゴリに関する情報を抽出すること
で、分類したいカテゴリと競合するカテゴリを把握する
ことが出来、これを収集したいカテゴリとしてカテゴリ
情報入力部１０３から人為的に入力することも可能であ
る。また、仮カテゴリの情報と分類した結果を用いて、
人手によりカテゴリ名を与えることで、仮カテゴリでは
なくカテゴリ情報入力部１０３で入力されたカテゴリと
同等に扱うことができる。これにより、新規の文書情報
とカテゴリを獲得することができる。By extracting information relating to the provisional category, it is possible to grasp the category that conflicts with the category to be classified, and it is also possible to artificially input the category to be collected from the category information input unit 103. Also, using the information of the provisional category and the result of the classification,
By manually assigning the category name, the category can be handled in the same manner as the category input by the category information input unit 103 instead of the temporary category. Thereby, new document information and a new category can be obtained.

【００２８】図４は本発明の一実施例の処理フローチャ
ートを示している。ステップ（Ｓ１）：分類する文書に対して、特徴ベクト
ルを作成する。ステップ（Ｓ２）：あるカテゴリＡとの類似度を計算す
る。FIG. 4 shows a processing flowchart of one embodiment of the present invention. Step (S1): Create a feature vector for a document to be classified. Step (S2): Calculate the degree of similarity with a certain category A.

【００２９】ステップ（Ｓ３）：カテゴリＡと類似して
いるか否かを調べる。類似していればステップ（Ｓ５）
へ進み、類似していなければステップ（Ｓ４）へ進む。ステップ（Ｓ４）：次のカテゴリとの計算に向い、ステ
ップ（Ｓ２）へ戻る。Step (S3): It is checked whether or not it is similar to category A. If they are similar, step (S5)
If not, the process proceeds to step (S4). Step (S4): Return to step (S2) for calculation with the next category.

【００３０】ステップ（Ｓ５）：カテゴリＡに対する仮
カテゴリがあるか否かを調べる。あればステップ（Ｓ
７）へ進み、なければステップ（Ｓ６）へ進む。ステップ（Ｓ６）：カテゴリＡに割り当て、次のカテゴ
リとの計算に向い、ステップ（Ｓ２）へ戻る。Step (S5): It is checked whether or not there is a temporary category for category A. If there is a step (S
Go to 7), otherwise go to step (S6). Step (S6): Assign to category A, and go to step (S2) for calculation with the next category.

【００３１】ステップ（Ｓ７）：仮カテゴリと類似して
いるか否かを調べる。類似していればステップ（Ｓ９）
へ進み、類似していなければステップ（Ｓ８）へ進む。ステップ（Ｓ８）：カテゴリＡに割り当て、次のカテゴ
リとの計算に向い、ステップ（Ｓ２）へ戻る。Step (S7): It is checked whether or not it is similar to the temporary category. If they are similar, step (S9)
If not, the process proceeds to step (S8). Step (S8): Assigned to category A, return to step (S2) for calculation with the next category.

【００３２】ステップ（Ｓ９）：仮カテゴリに割り当て
る（カテゴリＡには割り当てない）。ステップ（Ｓ１０）：別のカテゴリがあるか否かを調べ
る。あればステップ（Ｓ２）に戻る。なければ、ＥＮＤ
となる。Step (S9): Assign to a temporary category (do not assign to category A). Step (S10): Check whether there is another category. If there is, the process returns to step (S2). If not, END
Becomes

【００３３】以上において、情報自動分類方法や分類装
置について説明したが、当該分類に関する処理は、プロ
グラムの形で記録媒体に記録しておくことができる。こ
のことから、本願発明は、当該記録媒体をも包含するも
のである。In the above, the automatic information classification method and the classification device have been described. However, the processing relating to the classification can be recorded in a recording medium in the form of a program. For this reason, the present invention includes the recording medium.

【００３４】[0034]

【発明の効果】本発明によれば、大量の文書集合を大量
の分類カテゴリに分類する際に、分類処理した結果をサ
ンプルとして与えることで、分類精度を向上させること
ができる。According to the present invention, when a large set of documents is classified into a large number of classification categories, classification accuracy can be improved by giving a result of the classification processing as a sample.

[Brief description of the drawings]

【図１】本発明の一発明の実施の形態である情報自動分
類装置の概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of an information automatic classification device according to an embodiment of the present invention.

【図２】仮カテゴリを生成していない場合の情報自動分
類装置の分類処理の一例である。FIG. 2 is an example of a classification process of the information automatic classification device when a temporary category is not generated.

【図３】仮カテゴリを生成した場合の情報自動分類装置
の分類処理の一例である。FIG. 3 is an example of a classification process of the information automatic classification device when a temporary category is generated.

【図４】本発明の一実施例処理フローチャートを示す。FIG. 4 shows a processing flowchart of an embodiment of the present invention.

[Explanation of symbols]

１０１：文書入力部１０２：文書記憶部１０３：カテゴリ情報入力部１０４：カテゴリ情報記憶部１０５：分類結果入力部１０６：分類結果記憶部１０７：仮カテゴリ生成部１０８：分類計算部 101: document input unit 102: document storage unit 103: category information input unit 104: category information storage unit 105: classification result input unit 106: classification result storage unit 107: temporary category generation unit 108: classification calculation unit

フロントページの続き (72)発明者早川和宏東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内 (72)発明者田中一男東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内Ｆターム(参考） 5B009 SA12 VA02 VA09 5B075 ND03 NK12 NK35 NR03 NR12 PR06 QM08 QS01 UU06 Continuing on the front page (72) Inventor Kazuhiro Hayakawa 3-19-2 Nishi Shinjuku, Shinjuku-ku, Tokyo Nippon Telegraph and Telephone Corporation (72) Inventor Kazuo Tanaka 3-192-2 Nishi-Shinjuku, Shinjuku-ku, Tokyo Nippon Telegraph and Telephone Telephone Co., Ltd. F term (reference) 5B009 SA12 VA02 VA09 5B075 ND03 NK12 NK35 NR03 NR12 PR06 QM08 QS01 UU06

Claims

[Claims]

1. An automatic information classification method for automatically classifying a large amount of information into a large number of categories, a first step of inputting a document described in a natural language, and storing input data obtained in the first step. A second step of inputting a category for classification and its characteristics; a fourth step of storing the category and its characteristics obtained in the third step; a correctly classified document A fifth step of inputting a document that has not been correctly classified, a sixth step of storing the document obtained in the fifth step, and a correct classification using the document obtained in the fifth step. The classified document is added as information of the category input in the third step, and the document that is not correctly classified is
The 7th to automatically generate a new temporary category and assign it to it
And an eighth step of classifying the document obtained in the first step using the information on the category obtained in the third step and the information on the temporary category obtained in the seventh step. An automatic information classification method, comprising:

2. The method according to claim 1, wherein in the seventh step, the fifth
Using the document obtained in the step, the document that is not correctly classified is assigned to a temporary category, and the document obtained in the first step in the eighth step is divided into the category obtained in the third step and the category obtained in the third step. 2. The information automatic classification method according to claim 1, wherein the information is classified into any one of provisional categories.

3. An automatic information classification apparatus for automatically classifying a large amount of information into a large number of categories, comprising: a document input unit for inputting a document described in a natural language; and a document for storing input data obtained by the document input unit. A storage unit; a category information input unit for inputting a category for classification and its characteristics; a category information storage unit for storing the categories and the characteristics obtained by the category information input unit; A classification result input unit that inputs a document that has not been classified, a classification result storage unit that stores a document obtained by the classification result input unit, and a document that is correctly classified using the document obtained by the classification result input unit Is added as information on the category input in the category information input section, and for documents that have not been correctly classified, a temporary category is automatically generated and a new temporary category is assigned thereto. A glue generation unit, a classification calculation unit that classifies the document obtained by the document input unit using the information on the category obtained by the category information input unit and the information on the temporary category obtained by the temporary category generation unit, An automatic information classification apparatus, comprising:

4. The provisional category generation unit assigns a document that has not been correctly classified to a provisional category using a document obtained by the classification result input unit, and a document obtained by the classification calculation unit by the document input unit. 4. The automatic information classification apparatus according to claim 3, wherein the information is classified into one of a category obtained by the category information input unit and a temporary category corresponding to the category.

5. A first step of receiving a document described in a natural language on a recording medium recording an information automatic classification method for automatically classifying a large amount of information into a large number of categories; A second step of storing input data, a third step of inputting a category for classification and its characteristics, a fourth step of storing the category obtained in the third step and its characteristics, A fifth step of inputting a classified document and a document that has not been correctly classified; a sixth step of storing the document obtained in the fifth step; and a document processing apparatus using the document obtained in the fifth step. The correctly classified documents are added as information of the category input in the third step, and the documents that are not correctly classified are
The 7th to automatically generate a new temporary category and assign it to it
And an eighth step of classifying the document obtained in the first step using the information on the category obtained in the third step and the information on the temporary category obtained in the seventh step. A recording medium recording an information automatic classification method characterized by recording a described processing program.

6. The method according to claim 7, wherein in the seventh step, the fifth
Using the document obtained in the step, the document that is not correctly classified is assigned to a temporary category, and the document obtained in the first step in the eighth step is divided into the category obtained in the third step and the category obtained in the third step. 6. A recording medium recording the information automatic classification method according to claim 5, wherein a processing program adapted to classify the information into one of the provisional categories is recorded.