JP2002099555A

JP2002099555A - Apparatus and method for document classification

Info

Publication number: JP2002099555A
Application number: JP2000288127A
Authority: JP
Inventors: Shigemi Nakazato; 茂美中里; Tsutomu Kobayashi; 勉小林; Takeshi Matsukuma; 剛松隈; Yukio Nakamoto; 幸夫中本; Takuya Nishina; 卓哉仁科; Hiroshi Yamazaki; 弘山崎
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2000-09-22
Filing date: 2000-09-22
Publication date: 2002-04-05

Abstract

PROBLEM TO BE SOLVED: To overcome the problem of a prior art such that the time required for updating a database increases as the number of documents stored in the database for use in document classification becomes enormous, because database is to be updated when there are documents having much effect on a classification specifying error, and it becomes difficult to adjust the database frequently. SOLUTION: The degree of similarity of a plurality of registration document data stored in a main database 4a to document data is found (step 307). After classification and specifying (step 310), when the classification with high degree of similarity is included in erroneous classification of an erroneously specified database 4b (step 312), the degree of similarity is found from the document data and the registration document data wherein the erroneous classification of the erroneously specified database is provided (step 315). If there are registration document data with high degree of similarity, correct classification that is previously imparted therein is added to specified classification (step 317). In such an arrangement, the classification specifying error can be decreased.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、電子化され分類
登録された複数文書のデータベースを用いて、新規の文
書データがどの分類に対応するかを求めるための文書分
類装置及び文書分類方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document classification device and a document classification method for determining which classification new document data corresponds to, using a database of a plurality of documents that have been digitized and registered.

【０００２】[0002]

【従来の技術】近年、大量の電子化された文書データ
が流通するようになり、その文書データが、どのような
分類に属するかを自動的に分類する技術が実用化されて
いる。一般的な技術としては、データベースに色々な分
類を代表する文書を複数登録しておき、入力された文書
と登録されている文書間の類似度をベクトル空間法など
を用いて求め、類似していた文書が属する分類を参考
に、入力された文書が属すると思われる分類を特定する
というものがある。これは、各分類に含まれる文書にお
いて、一般的に、多く使用される用語に偏りが生じるこ
とを利用したものである。2. Description of the Related Art In recent years, a large amount of digitized document data has been distributed, and a technique for automatically classifying to which classification the document data belongs has been put to practical use. As a general technique, a plurality of documents representing various classifications are registered in a database, and the similarity between the input document and the registered document is obtained using a vector space method or the like, and the similarity is determined. There is a method in which a classification to which an input document is supposed to be specified is specified with reference to a classification to which the input document belongs. This is based on the fact that, in a document included in each classification, generally used terms are biased.

【０００３】しかし、特定の技術を対象とした分類の中
には、そこに含まれる文書に、上記特定の技術の応用例
などの説明が記述されることも多く、本来当該分類の文
書に含まれる単語とは直接関係のない単語が多く含まれ
ている場合がある。例えば、特定の技術として撮像素子
（ＣＣＤ）を対象とした分類に関する文書に、その応用
例としてビデオカメラや、ディジタルスチルカメラ、ス
キャナなどに関して記述している場合である。このよう
な文書がデータベースに登録されている場合、文書間の
類似度を算出する際に本来文書分類に直接関係のない単
語が影響し、正解となるべき文書と異なる分類の文書と
の類似度の方が高くなってしまう場合があった。[0003] However, in a classification for a specific technology, an explanation of an application example of the specific technology or the like is often described in a document included therein. There are cases where many words that are not directly related to the word to be included are included. For example, there is a case where a document relating to classification for an image pickup device (CCD) as a specific technique describes a video camera, a digital still camera, a scanner, or the like as an application example. If such a document is registered in the database, words that are not directly related to the document classification affect the similarity between the documents, and the similarity between the document to be the correct answer and the document of a different classification Was sometimes higher.

【０００４】このような問題を解消する方法として、本
願発明者らが先に特許出願した特願平１１−３３５４４
２号の明細書及び図面に記載されているような、分野特
定の不正解に強く影響している文書を優先的にデータベ
ースから削除することで、分野特定の正解率を向上させ
る、というようにデータベースそのものを更新する方法
がある。As a method of solving such a problem, Japanese Patent Application No. 11-33544 filed by the present inventors previously has filed a patent application.
To improve the field-specific correct answer rate by preferentially deleting from the database documents that strongly affect domain-specific incorrect answers, such as those described in the specification 2 and drawings. There is a way to update the database itself.

【０００５】[0005]

【発明が解決しようとする課題】従来の技術のように
データベースを更新する場合、分書分類に使用するデー
タベースに格納される文書数が膨大になればなるほど更
新に必要な時間も増加し、こまめにデータベースを調整
することが困難になる。また、特定の分野の文書をデー
タベースから削除することは、データベース全体のバラ
ンスを崩すことになり、文書を削除した分類以外の文書
を分類する際の精度にも影響があり、それらを考慮して
単語の重み調整等によるデータベース調整を行なうこと
は困難である、という点で改良の余地があった。When a database is updated as in the prior art, as the number of documents stored in the database used for classification of classification becomes enormous, the time required for updating increases, and the update is frequently performed. It becomes difficult to adjust the database. Also, deleting documents in a specific field from the database will break the balance of the entire database, and will affect the accuracy of classifying documents other than the class from which the documents were deleted. There is room for improvement in that it is difficult to perform database adjustment by word weight adjustment and the like.

【０００６】本発明は、分類特定誤りに影響の大きな文
書が存在しても、データベース全体の更新を行わずに分
類特定の誤りを減少させることが可能な文書分類装置及
び文書分類方法を提供することを目的とする。The present invention provides a document classifying apparatus and a document classifying method which can reduce a classifying error without updating the entire database even if a document having a large influence on the classifying error exists. The purpose is to:

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するた
めに、請求項１に係る発明では、予め分類が付与された
複数の登録文書データと文書データとの類似度に基づい
て当該文書データの分類を特定する文書分類装置におい
て、予め分類が付与された複数の登録文書データを格納
する第１のデータベースと、文書データの分類を特定す
る際に、正解分類と異なる誤り分類として分類特定の誤
りを引き起こす登録文書データを上記正解分類及び誤り
分類と共に格納する第２のデータベースと、文書データ
と上記第１のデータベース内の登録文書データ及び上記
第２のデータベース内の登録文書データとの類似度を求
め、この類似度に基づき分類を特定する分類特定手段
と、を具備することを特徴とする。このような構成によ
り、第１のデータベースで特定した分類結果のみなら
ず、第２のデータベース内の分類特定を誤り易い登録文
書データをも対象とし、この第２のデータベース中に類
似したものがある場合には、これを併せて出力するよう
にするため、分類特定誤りを減少させることが可能とな
る。しかも分類誤りの改善のためデータベース更新が必
要とされる場合でも、全データベースでなくこの第２の
データベースのみ更新すれば良いため効率的である。Means for Solving the Problems In order to achieve the above object, according to the first aspect of the present invention, a plurality of registered document data to which classification has been given in advance and the similarity of the document data based on the similarity between the document data. In a document classifying device for specifying a classification, a first database storing a plurality of registered document data to which a classification is previously assigned, and an error classification different from the correct classification when specifying the classification of the document data. And a second database for storing the registered document data causing the error, together with the correct answer classification and the error classification, and a similarity between the document data, the registered document data in the first database, and the registered document data in the second database. And a class specifying means for specifying the class based on the similarity. With such a configuration, not only the classification result specified in the first database but also registered document data in the second database in which classification specification is apt to be erroneous, and there are similar data in the second database. In such a case, since this is also output, it is possible to reduce classification identification errors. In addition, even when the database needs to be updated in order to improve the classification error, it is efficient because only the second database needs to be updated instead of the entire database.

【０００８】また、本発明の文書分類装置は請求項３に
記載されるように、予め分類が付与された複数の登録文
書データと文書データとの類似度に基づいて当該文書デ
ータの分類を特定する文書分類装置において、予め分類
が付与された複数の登録文書データを格納する第１のデ
ータベースと、文書データの分類を特定する際に、正解
分類と異なる誤り分類として分類特定の誤りを引き起こ
す登録文書データを上記正解分類及び誤り分類と共に格
納する第２のデータベースと、文書データを入力する入
力手段と、この入力された文書データと上記第１のデー
タベース内の登録文書データとの類似度を算出する類似
度算出手段と、この類似度算出手段にて算出された類似
度を各登録文書データに付与された分類毎に加算し、全
類似度の合計値に対する各分類毎の合計値の割合を算出
する手段と、この手段にて算出された割合の大きさに基
づき分類を特定する分類特定手段と、この分類特定手段
により特定された分類における上記割合を特定の値と比
較する比較手段と、この比較手段による比較の結果上記
割合が上記特定の値より小さい場合、上記第２のデータ
ベース内の上記誤り分類に含まれる登録文書データと文
書データとの類似度を算出し、この類似度が所定値以上
の登録文書データがあれば、この登録文書データに付与
されている正解分類を上記分類特定手段により特定され
た分類に追加する追加手段と、この追加手段により追加
された分類を含め、上記文書データの分類として特定さ
れた分類を出力する出力手段と、を具備することを特徴
とする。この様な構成とすることにより、分類特定誤り
を減少させることが可能となり、分類誤りの改善のため
データベース更新が効率的であるだけでなく、常時第２
のデータベースを活用するのでなく、類似度の合計値が
特定値より小さい場合、換言するなら分類特定の信頼性
が低い場合に第２のデータベースを活用するため、処理
効率を高めることが可能となる。According to a third aspect of the present invention, there is provided a document classification apparatus which specifies a classification of a document data based on a similarity between the document data and a plurality of pre-registered document data. A first database that stores a plurality of registered document data to which a classification has been assigned in advance, and a registration that causes a classification specific error as an error classification different from the correct classification when specifying the classification of the document data. A second database for storing the document data together with the correct classification and the error classification; input means for inputting the document data; and calculating the similarity between the input document data and the registered document data in the first database. The similarity calculated by the similarity calculating means, and the similarity calculated by the similarity calculating means are added for each classification assigned to each registered document data, and the sum of all similarities is calculated. Means for calculating the ratio of the total value for each classification, classification specifying means for specifying the class based on the magnitude of the ratio calculated by this means, and the above-mentioned ratio in the classification specified by the classification specifying means A comparison unit that compares the document data with the registered data included in the error classification in the second database if the ratio is smaller than the specific value as a result of the comparison by the comparison unit; Calculating a degree, and if there is registered document data whose similarity is equal to or greater than a predetermined value, adding means for adding the correct classification assigned to the registered document data to the classification specified by the classification specifying means; Output means for outputting the classification specified as the classification of the document data, including the classification added by the means. By adopting such a configuration, it is possible to reduce classification identification errors, and it is not only efficient to update the database to improve the classification errors,
Instead of using the database, the second database is used when the total value of the similarities is smaller than the specific value, in other words, when the reliability of the classification specification is low, so that the processing efficiency can be improved. .

【０００９】また、本発明の文書分類装置は請求項４に
記載されるように、請求項３の構成に加えて、上記文書
データに付与されるべき正解分類を入力する正解分類入
力手段と、この正解分類入力手段により入力された正解
分類と、上記分類特定手段により特定された分類とを比
較し、分類特定結果が誤っていた場合、上記文書データ
を上記第２のデータベースに追加すること文書データ追
加手段と、を具備することを特徴とする。このように構
成することにより、第２のデータベースの更新を自動的
に行なうことができ、データベースのメンテナンスを省
力化することができる。According to a fourth aspect of the present invention, there is provided the document classification device according to the third aspect, further comprising: a correct answer classification input means for inputting a correct answer classification to be given to the document data; The correct classification input by the correct classification input means is compared with the classification specified by the classification specifying means. If the classification specification result is incorrect, the document data is added to the second database. Data adding means. With this configuration, the second database can be automatically updated, and the maintenance of the database can be saved.

【００１０】さらに本発明の文書分類装置は請求項４に
記載されるように、請求項３の構成に加えて、上記文書
データに付与されるべき正解分類を入力する正解分類入
力手段と、この正解分類入力手段により入力された正解
分類と、上記追加手段により追加された分類とを比較
し、追加された分類が誤っていた場合、この分類を追加
する根拠となった上記登録文書データを上記第２のデー
タベースから削除する削除手段と、を具備することを特
徴とする。このような構成とすることにより、第２のデ
ータベースの更新を自動的に行なうことができ、データ
ベースのメンテナンスを省力化することができる。Further, the document classification apparatus of the present invention, in addition to the configuration of claim 3, further comprises a correct classification input means for inputting a correct classification to be given to the document data. The correct classification input by the correct classification input means is compared with the classification added by the additional means, and if the added classification is incorrect, the registered document data which is the basis for adding this classification is compared with the above. And deleting means for deleting from the second database. With such a configuration, the second database can be automatically updated, and the maintenance of the database can be saved.

【００１１】[0011]

【発明の実施の形態】以下、図面を参照して本発明の一
実施形態を説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings.

【００１２】図１は本発明の一実施形態に係る文書分類
装置のハードウェア構成を示す図である。なお、本装置
は一般的なアーキテクチャを持つコンピュータ上の一機
能として構築されるものである。FIG. 1 is a diagram showing a hardware configuration of a document classification device according to an embodiment of the present invention. This device is constructed as one function on a computer having a general architecture.

【００１３】図１に示すように、本文書分類装置は、Ｃ
ＰＵ、ＲＯＭ、ＲＡＭなどから構成される制御装置１、
キーボード、ポインティングデバイス、スキャナ、マイ
クなどの入力装置２、分類結果などを表示する表示装置
３、およびハードディスク装置、ＭＯ、ＤＶＤ−ＲＡＭ
などの外部記憶装置４から構成される。この外部記憶装
置４は、分類に使用するための文書データや類似度情
報、各種設定値を格納している。本実施形態では、予め
分類が付与された複数の登録文書データを格納する主デ
ータベース（以下、「主ＤＢ」と称する）４ａと、文書
データの分類を特定する際に、正解分類と異なる誤り分
類として分類特定の誤りを引き起こす登録文書データを
上記正解分類及び誤り分類と共に格納する誤特定データ
ベース（以下、「誤特定ＤＢ」と称する）４ｂとを有す
る。誤特定ＤＢ４ｂは、主ＤＢ４ａから派生して作成さ
れるものである。作成に関してより具体的に記載すると
以下の通りである。つまり主ＤＢ４ａを用いて登録文書
データと分類文書データ間の類似度を算出することによ
り分類を特定したものの、その分類特定結果が誤ってい
た場合に、その文書データを誤って特定された分類別に
誤特定ＤＢ４ｂに登録することにより作成する。さらに
この作業を繰り返すことにより主ＤＢ４ａで誤って特定
される分類毎の誤特定の文書データを蓄積したものが誤
特定ＤＢ４ｂである。従って、この誤特定ＤＢ４ｂのサ
イズは、当然主ＤＢ４ａより小さく、通所はかなり小さ
い。それにもかかわらず、分類特定の誤りを減少させる
ことができる。従って主ＤＢ４ａを更新するより更新が
容易である上、後述のように自動的に誤特定ＤＢの更新
を行なうため、データベース調整に必要なリソースを低
減することができる。As shown in FIG. 1, the present document classifying device
A control device 1 including a PU, a ROM, a RAM, and the like;
Input device 2 such as keyboard, pointing device, scanner, microphone, etc., display device 3 for displaying classification results, etc., and hard disk device, MO, DVD-RAM
And the like. The external storage device 4 stores document data, similarity information, and various setting values used for classification. In this embodiment, a main database (hereinafter, referred to as a “main DB”) 4a storing a plurality of registered document data to which classifications have been previously assigned, and an error classification different from the correct classification when specifying the classification of the document data. And an erroneous identification database (hereinafter, referred to as an “erroneous identification DB”) 4b that stores registered document data causing an categorization specific error together with the correct classification and the erroneous classification. The incorrect identification DB 4b is created by being derived from the main DB 4a. The details of the creation are as follows. That is, although the classification is specified by calculating the similarity between the registered document data and the classified document data using the main DB 4a, if the classification specification result is incorrect, the document data is classified according to the incorrectly specified classification. It is created by registering in the incorrect identification DB 4b. Further, by repeatedly performing this operation, the erroneously specified DB 4b stores erroneously specified document data for each classification that is erroneously specified in the main DB 4a. Therefore, the size of the erroneous identification DB 4b is naturally smaller than the main DB 4a, and the location is considerably smaller. Nevertheless, classification specific errors can be reduced. Therefore, the update is easier than updating the main DB 4a, and the erroneous identification DB is automatically updated as described later, so that resources required for database adjustment can be reduced.

【００１４】図２に本装置における制御装置１の構成を
示す。図２の中で制御装置１を、ＣＰＵとＲＯＭにより
為される部分を機能的にプログラム部２００と、ＲＡＭ
を機能的にバッファ部２５０として表わしている。FIG. 2 shows the configuration of the control device 1 in the present apparatus. In FIG. 2, the control device 1 includes a CPU 200 and a ROM.
Are functionally represented as a buffer unit 250.

【００１５】プログラム部２００は、初期化部２０１、
分類文書入力部２０２、登録文書読み込み部２０３、類
似度算出部２０４、分類−類似度一覧作成部２０５、分
類特定部２０６、類似度占有率不足条件設定部２０７、
類似度占有率不足判断設定部２０８、類似判断条件設定
部２０９、類似判断部２１０、分類特定結果出力部２１
１、正解分類情報入力部２１２、分類特定結果正否判断
部213、誤特定ＤＢ更新部２１４の14の機能を有してい
る。The program section 200 includes an initialization section 201,
A classification document input unit 202, a registered document reading unit 203, a similarity calculation unit 204, a classification-similarity list creation unit 205, a classification identification unit 206, a similarity occupancy ratio shortage condition setting unit 207,
Similarity occupancy ratio insufficiency determination setting unit 208, similarity determination condition setting unit 209, similarity determination unit 210, classification specification result output unit 21
1. It has 14 functions of a correct answer classification information input unit 212, a classification specification result right / wrong judgment unit 213, and an incorrect specification DB update unit 214.

【００１６】バッファ部２５０は、分類文書格納バッフ
ァ部２５１、登録文書格納バッファ部２５２、類似度算
出結果格納バッファ部２５３、分類−類似度一覧格納バ
ッファ部２５４、分類特定結果格納バッファ部２５５、
類似度占有率不足判断条件格納バッファ部２５６、類似
判断条件格納バッファ部２５７、正解分類情報格納バッ
ファ部２５８の8の領域を有している。The buffer section 250 includes a classification document storage buffer section 251, a registered document storage buffer section 252, a similarity calculation result storage buffer section 253, a classification-similarity list storage buffer section 254, a classification specification result storage buffer section 255,
It has eight areas: a similarity occupancy ratio insufficiency determination condition storage buffer unit 256, a similarity determination condition storage buffer unit 257, and a correct answer classification information storage buffer unit 258.

【００１７】初期化部２０１は、バッファ部２５０内の
各バッファ部に格納されたデータのクリアを行う。The initialization section 201 clears data stored in each buffer section in the buffer section 250.

【００１８】分類文書入力部２０２は、ユーザが入力装
置２を用いて入力する分類される文書のデータ（以下、
「分類文書データ」と称する）を、分類文書格納バッフ
ァ部２５１へ格納する。この時、分類文書ＩＤが発行さ
れ、このＩＤも合わせて分類文書格納バッファ部２５１
へ格納される。The classification document input unit 202 is a device for inputting classification data (hereinafter, referred to as classification data) input by the user using the input device 2.
(Referred to as “classified document data”) in the classified document storage buffer unit 251. At this time, a classification document ID is issued, and the classification document storage buffer unit 251
Is stored in

【００１９】登録文書読み込み部２０３は、外部記憶装
置4に格納された文書のデータ（以下、「登録文書デー
タ」と称する）を読み出し、登録文書格納バッファ部２
５２へ格納する。The registered document reading unit 203 reads document data (hereinafter, referred to as “registered document data”) stored in the external storage device 4, and stores the data in the registered document storage buffer unit 2.
52.

【００２０】類似度算出部２０４は、分類文書格納バッ
ファ部２５１に格納されている分類文書データと、登録
文書格納バッファ部２５２に格納されている登録文書デ
ータとをそれぞれ単語に分割し、各単語の出現回数をベ
クトルの成分とするベクトル空間法で類似の度合いを算
出し、分類文書ＩＤと登録文書ＩＤと類似度と登録文書
データが属する分類情報を組にして、類似度算出結果格
納バッファ部２５３に格納する。尚、ここで求める類似
度は、ベクトル空間法の代わりに共通単語数により算出
しても構わないし、その他の文書間の類似性を示す値を
利用しても構わない。The similarity calculating section 204 divides the classified document data stored in the classified document storage buffer section 251 and the registered document data stored in the registered document storage buffer section 252 into words, respectively. The degree of similarity is calculated by a vector space method using the number of appearances of a vector as a component of the vector, and a classification document ID, a registered document ID, a degree of similarity, and classification information to which the registered document data belong are set, and a similarity calculation result storage buffer unit 253. Note that the similarity obtained here may be calculated by the number of common words instead of the vector space method, or other values indicating similarity between documents may be used.

【００２１】分類−類似度一覧作成部２０５は、類似度
算出結果格納バッファ部２５３に格納されている登録文
書データとの類似度情報から、各分類別に類似度を加算
した分類−類似度一覧を作成し、すべての類似度の和に
対するそれぞれの分野の類似度の和の割合を類似度占有
率として分類−類似度一覧格納バッファ部２５４へ格納
する。The classification-similarity list creation unit 205 generates a classification-similarity list obtained by adding the similarity for each classification from the similarity information with the registered document data stored in the similarity calculation result storage buffer unit 253. Then, the ratio of the sum of the similarities of the respective fields to the sum of all the similarities is stored as the similarity occupancy in the classification-similarity list storage buffer unit 254.

【００２２】分類特定部２０６は、分類−類似度一覧格
納バッファ部２５４に格納された情報から、類似度占有
率の大きいものを分類特定結果格納バッファ２５５に出
力する。分類−類似度一覧による分類特定方法以外に
も、類似度の高い文書の属する分類をそのまま利用して
も構わない。The classification specifying unit 206 outputs, from the information stored in the classification-similarity list storage buffer unit 254, the one with the highest similarity occupancy to the classification specification result storage buffer 255. In addition to the classification specifying method based on the classification-similarity list, a classification to which a document having a high similarity belongs may be used as it is.

【００２３】類似度占有率不足判断条件設定部２０７
は、ユーザが入力装置２より入力した類似度占有率不足
判断条件を類似度占有率不足判断条件格納バッファ部２
５５に格納する。Similarity occupancy ratio insufficient judgment condition setting unit 207
The similarity occupancy ratio insufficiency determination condition input from the input device 2 by the user is stored in the similarity occupancy ratio insufficiency determination condition storage buffer unit 2.
55.

【００２４】類似度占有率不足判断部２０８は、分類特
定結果格納バッファ部２５４に格納された分類特定結果
と類似度占有率不足判断条件格納バッファ部２５５に格
納された類似度占有率不足判断条件から、誤特定ＤＢ４
ｂを用いて分類特定処理を行うかどうかを判断する。The similarity occupancy shortage judging section 208 includes the classification specifying result stored in the classification specifying result storage buffer 254 and the similarity occupancy shortage judging condition stored in the similarity occupancy shortage judging storage buffer 255. From erroneous identification DB4
It is determined whether or not to perform the classification specifying process using b.

【００２５】次に類似判断条件設定部２０９は、ユーザ
が入力装置２より入力した類似判断条件を類似判断条件
格納バッファ部２５６に格納する。Next, the similarity determination condition setting unit 209 stores the similarity determination condition input by the user from the input device 2 in the similarity determination condition storage buffer unit 256.

【００２６】類似判断部２１０は、類似度算出結果格納
バッファ部２５３に格納された類似度算出結果と類似判
断条件格納バッファ部２５６に格納された類似判断条件
から、類似性が高いかどうかを判断する。The similarity determination unit 210 determines whether or not the similarity is high based on the similarity calculation result stored in the similarity calculation result storage buffer unit 253 and the similarity determination condition stored in the similarity determination condition storage buffer unit 256. I do.

【００２７】分類特定結果出力部２１１は、分類特定結
果格納バッファ部２５４に格納された分類特定結果を表
示装置3に出力する。The classification specification result output section 211 outputs the classification specification result stored in the classification specification result storage buffer section 254 to the display device 3.

【００２８】正解分類情報入力部２１２は、ユーザが入
力装置２より入力した正解分類情報を、正解分類情報格
納バッファ２５７に格納する。The correct answer classification information input unit 212 stores the correct answer classification information input by the user from the input device 2 in the correct answer classification information storage buffer 257.

【００２９】分類特定結果正否判断部２１３は、分類特
定結果格納バッファ部２５４に格納された分類特定結果
と正解分類情報格納バッファ２５７に格納された正解情
報から、分類特定結果が正しいかを判断する。The classification specification result right / wrong determination unit 213 determines whether the classification specification result is correct based on the classification specification result stored in the classification specification result storage buffer unit 254 and the correct answer information stored in the correct classification information storage buffer 257. .

【００３０】誤特定ＤＢ更新部２１４は、分類文書格納
バッファ部２５１に格納された分類文書を誤特定ＤＢに
追加したり、誤特定ＤＢに登録されている文書を削除し
たりする。The incorrectly specified DB updating unit 214 adds a classified document stored in the classified document storage buffer unit 251 to the incorrectly specified DB, or deletes a document registered in the incorrectly specified DB.

【００３１】次に、本実施形態の文書分類装置の動作を
説明する。ここで説明する動作は制御装置１のＣＰＵ
が、ＲＯＭ内のプログラム、及びＲＡＭ内の記憶領域を
用いて実行するものである。Next, the operation of the document classification device of this embodiment will be described. The operation described here is performed by the CPU of the control device 1.
Is executed using a program in the ROM and a storage area in the RAM.

【００３２】本実施形態は、大きく第１のステップと第
2のステップとからなる。第１のステップは、入力され
た分類文書データと登録された登録文書データとを比較
し、分類文書データが属すると思われる分類を決定する
ステップである。第２のステップは、第１のステップで
分類した文書データの正解分類情報を入力し、第１のス
テップでの分類結果が誤っていた場合に、誤特定ＤＢ４
ｂに対する分類文書データの追加や、誤特定ＤＢ４ｂか
らの文書データの削除を行なうステップである。The present embodiment mainly includes a first step and a first step.
It consists of two steps. The first step is a step of comparing the input classification document data with the registered document data registered, and determining a classification to which the classification document data belongs. The second step is to input the correct answer classification information of the document data classified in the first step, and if the result of the classification in the first step is incorrect, an erroneous identification DB 4
This is a step of adding classified document data to the document data b and deleting document data from the erroneously specified DB 4b.

【００３３】まず、第１のステップについて説明する。First, the first step will be described.

【００３４】はじめにユーザは、入力装置２を使用し
て、外部記憶装置４に文書の分類時に参照する文書デー
タを格納する（ステップ３０１）。First, the user uses the input device 2 to store, in the external storage device 4, document data to be referred to when classifying documents (step 301).

【００３５】続いて初期化部２０１により全バッファを
クリアして初期化する（ステップ３０２）。この状態で
ユーザが入力装置２から類似度占有率不足判断条件を入
力すると類似度占有率不足判断条件設定部２０６が類似
度占有率不足判断条件格納バッファ部２５３に格納する
（ステップ３０３）。図４は類似度占有率不足判断条件
として０．３を設定した場合の格納例である。Subsequently, the initialization section 201 clears and initializes all buffers (step 302). In this state, when the user inputs the similarity occupancy ratio insufficient judgment condition from the input device 2, the similarity occupancy ratio insufficient judgment condition setting unit 206 stores the similarity occupancy ratio insufficient judgment condition storage buffer unit 253 (step 303). FIG. 4 is an example of storage when 0.3 is set as the similarity occupancy ratio shortage determination condition.

【００３６】続いて、ユーザが入力装置２から類似判断
条件を入力すると類似判断条件設定部２０８が、類似判
断条件格納バッファ部２５４にその条件値を格納する
（ステップ３０４）。図５は類似判断条件として０．４
を設定した場合の格納例である。Subsequently, when the user inputs similarity determination conditions from the input device 2, the similarity determination condition setting unit 208 stores the condition values in the similarity determination condition storage buffer unit 254 (step 304). FIG. 5 shows a similarity determination condition of 0.4.
This is a storage example in the case where is set.

【００３７】類似度占有率不足判断条件及び類似判断条
件の入力後、ユーザは入力装置２を用いて分類文書デー
タの入力を行なう。分類文書データが入力されると、分
類文書入力部２０２は、分類文書格納バッファ部２５１
にその分類文書データを検索キーとして格納する。（ス
テップ３０５）。図６に格納された分類文書データの例
を示す。After inputting the similarity occupancy ratio insufficient judgment condition and the similarity judgment condition, the user uses the input device 2 to input classified document data. When the classification document data is input, the classification document input unit 202 changes the classification document storage buffer unit 251.
Stored as the search key. (Step 305). FIG. 7 shows an example of classified document data stored in FIG.

【００３８】分類文書データが格納されると、登録文書
読み出し部２０３は、外部記憶装置４から複数の文書デ
ータを読み出し、登録文書格納バッファ部２５２に登録
文書データとして格納する（ステップ３０６）。登録文
書データには、図７に示すように、本文の他に文書を識
別するための登録文書ＩＤと、その文書の分類を表す分
類情報が付与されている。When the classified document data is stored, the registered document reading unit 203 reads a plurality of document data from the external storage device 4 and stores the read document data in the registered document storage buffer unit 252 as registered document data (step 306). As shown in FIG. 7, the registered document data is provided with a registered document ID for identifying the document and classification information indicating the classification of the document, in addition to the text.

【００３９】ステップ３０６に続いて、類似度算出部２
０４が、登録文書データと分類文書データとの類似度を
求める（ステップ３０７）。つまり、分類文書格納バッ
ファ部２５１に格納された分類文書データの本文と、登
録文書格納バッファ部２５２に格納された登録文書デー
タの本文とを比較し、それぞれのデータの類似の度合い
を示す数値である類似度をベクトル空間法を用いて算出
する。算出された類似度は、登録文書ＩＤ及びその登録
文書データの分類を表す分類情報と共に類似度算出結果
格納バッファ部２５３に格納される。この時、類似度が
大きいものから所定の件数だけ格納したり、予め特定し
た類似度以上のものだけを格納しても構わない。図８の
類似度算出結果格納例では、１番目のデータは、文書Ｉ
Ｄ＝１０２３、分類＝テレビ、類似度＝０．３７８とい
う内容が格納されていることを示す。Subsequent to step 306, similarity calculating section 2
04 obtains the similarity between the registered document data and the classified document data (step 307). That is, the text of the classified document data stored in the classified document storage buffer unit 251 is compared with the text of the registered document data stored in the registered document storage buffer unit 252, and a numerical value indicating the degree of similarity of each data is used. A certain similarity is calculated using a vector space method. The calculated similarity is stored in the similarity calculation result storage buffer unit 253 together with the registered document ID and the classification information indicating the classification of the registered document data. At this time, a predetermined number of similarities may be stored in descending order of similarity, or only those having a similarity higher than a previously specified similarity may be stored. In the storage example of the similarity calculation result in FIG. 8, the first data is the document I
D = 1023, classification = television, and similarity = 0.378 are stored.

【００４０】類似度の格納が済むと、類似度を算出して
いない登録文書データが残っているかを判断し（ステッ
プ３０８）、残っている場合は、ステップ３０６へ戻
り、残りの登録文書データを対象としてステップ３０
６、３０７の処理を繰り返す。一方、他に登録文書デー
タが無いと判断した場合は、ステップ３０９に進む。When the similarity has been stored, it is determined whether or not there remains any registered document data for which the similarity has not been calculated (step 308). Step 30 as the target
6 and 307 are repeated. On the other hand, if it is determined that there is no other registered document data, the process proceeds to step 309.

【００４１】ステップ３０９では、分類−類似度一覧作
成部２０５が類似度算出結果格納バッファ部２５３に格
納された類似度算出結果をそれぞれの登録文書の属する
分類別に合計し、分類−類似度一覧格納バッファ部２５
４に出力し、さらに、出力したすべての類似度の和に対
するそれぞれの類似度の和の割合を算出し類似度占有率
として格納する。図９の分類−類似度一覧の例では、分
類＝パソコンに属する文書の類似度の和が０．７２９で
あり、分類＝ビデオに属する文書の類似度の和が１．５
９３であることを示す。また、これらすべての類似度の
和が０．５１４＋１．５９３＋０．７２９＋１．７８２
＋…として８．５という値であるとすると、類似度占有
率は分類＝テレビの場合は０．５１４／８．５＝０．０
６０となり、分類＝ビデオの場合は１．５９３／８．５
＝０．１８７となる。In step 309, the classification-similarity list creation unit 205 sums the similarity calculation results stored in the similarity calculation result storage buffer unit 253 for each classification to which each registered document belongs, and stores the classification-similarity list. Buffer unit 25
4 and further calculates the ratio of the sum of the similarities to the sum of all the output similarities and stores the ratio as the similarity occupancy. In the example of the classification-similarity list in FIG. 9, the classification = the sum of the similarities of the documents belonging to the personal computer is 0.729, and the classification = the sum of the similarities of the documents belonging to the video is 1.5.
93. The sum of all the similarities is 0.514 + 1.593 + 0.729 + 1.782.
Assuming a value of 8.5 as +..., The similarity occupancy is 0.514 / 8.5 = 0.0 in the case of classification = TV.
60 and 1.593 / 8.5 when classification = video
= 0.187.

【００４２】このようにして求められた分類−類似度一
覧の類似度の和が大きいものから順に、分類特定部２０
６が、分類特定結果格納バッファ部２５５に格納する
（ステップ３１０）。このときに、分類に用いたデータ
ベースが主ＤＢ４ａであることと、特定された分類の類
似度占有率を対応させて格納する。図１０の分類特定結
果格納の例では、第１候補に分類＝ビデオカメラ、類似
度占有率＝０．２１０、第２候補に分類＝ビデオ、類似
度占有率＝０．１８７、第３候補に分類＝パソコン、類
似度占有率＝０．０８６が出力されていることを示す。
この分類特定結果格納バッファ部２５５には類似登録文
書ＩＤの記憶領域も設けられているが、この時点では、
この記憶領域には特に何も格納しない。The classification specifying unit 20 sequentially sorts the similarities in the classification-similarity list obtained in this manner in descending order of similarity.
6 is stored in the classification specification result storage buffer 255 (step 310). At this time, the database used for the classification is the main DB 4a, and the similarity occupancy of the specified classification is stored in association with each other. In the example of storing the classification specification results in FIG. 10, the first candidate is classified as video camera, similarity occupancy = 0.210, the second candidate is classified as video = video, similarity occupancy = 0.187, and the third candidate is third camera. Classification = PC, similarity occupancy = 0.086 is output.
Although the storage area of the similar registration document ID is also provided in the classification specification result storage buffer unit 255, at this time,
Nothing is stored in this storage area.

【００４３】このようにして分類特定結果格納バッファ
部２５５に格納されたデータの内、第１候補の類似度占
有率は、類似度占有率不足判断部２０８により、類似度
占有率不足判断条件格納バッファ部２５６に格納されて
いる類似度占有率不足判断条件（図４の例では０．３）
と比較され（ステップ３１１）、第１候補の類似度占有
率の方が大きい場合にはステップ３１８へ進み、上記類
似度占有率の方が小さい場合にはステップ３１２へと進
む。本実施形態では、分類特定結果に確実性の高い分類
が含まれない場合を判断の条件とするために第１候補の
類似度占有率を類似度占有率判断の対象とした。The similarity occupancy rate of the first candidate among the data stored in the classification specification result storage buffer section 255 is stored in the similarity occupancy rate shortage determination section 208 by the similarity occupancy rate shortage determination section 208. Similarity degree occupancy ratio shortage determination condition stored in buffer unit 256 (0.3 in the example of FIG. 4)
(Step 311). When the similarity occupancy of the first candidate is larger, the process proceeds to Step 318. When the similarity occupancy is smaller, the process proceeds to Step 312. In the present embodiment, the similarity occupancy of the first candidate is set as the target of the similarity occupancy determination in order to use a case in which a classification with high certainty is not included in the classification identification result as a condition for the determination.

【００４４】ステップ３１２では、ステップ３１０で格
納した第１候補の分類が、誤特定ＤＢ４ｂに登録されて
いるかを判断し、登録されていればステップ３１３へ進
み、登録されていなければステップ３１８へ進む。図１
１は誤特定ＤＢ４ｂに登録されている文書データの文書
ＩＤの例である。この例では、分類＝ビデオカメラに文
書ＩＤが３、１９、・・・の文書データが登録されている
ことを示している。この場合、ステップ３１０の分類特
定結果の第１候補の分類はビデオカメラであり、ステッ
プ３１２の判断結果はＹｅｓとなるのでステップ３１３
へ進むことになる。In step 312, it is determined whether the classification of the first candidate stored in step 310 is registered in the erroneous identification DB 4b. If registered, the process proceeds to step 313, and if not, the process proceeds to step 318. . FIG.
1 is an example of the document ID of the document data registered in the erroneous identification DB 4b. This example shows that document data of document IDs 3, 19,... Are registered in the classification = video camera. In this case, the classification of the first candidate of the classification specification result in step 310 is a video camera, and the determination result in step 312 is Yes.
Will go to.

【００４５】ステップ３１３では、登録文書読み込み部
２０３が、誤特定ＤＢ４ｂに登録されている登録文書デ
ータのうち、ステップ３１０で特定した第１候補の分類
に登録されている登録文書データを外部記憶装置４から
読み出し、登録文書格納バッファ部２５２に格納する。In step 313, the registered document reading unit 203 stores the registered document data registered in the first candidate classification identified in step 310 among the registered document data registered in the erroneous identification DB 4b in the external storage device. 4 and stored in the registered document storage buffer unit 252.

【００４６】次に、ステップ３０７と同様に、類似度算
出部２０４が分類文書格納バッファ部２５１に格納され
た分類文書データと、登録文書格納バッファ部２５２に
格納された登録文書データとの類似度を算出し、類似度
算出結果格納バッファ部２５３に格納する（ステップ３
１４）。ステップ３１３及び３１４を、登録文書格納バ
ッファ部２５２に格納された全登録文書データを対象と
して繰り返し実行し、このバッファ２５２内に格納され
た全登録文書データに対する類似度算出及びその類似度
の格納が終了するとステップ３１６へ進む（ステップ３
１５）。図１２は、誤特定ＤＢ４ｂの分類＝ビデオカメ
ラに登録されている文書との類似度算出結果の例で、登
録文書ＩＤ＝３の文書データは分類＝カメラで、類似度
が０．２２６であることを表し、登録文書ＩＤ＝１９の
文書データは分類＝デジカメで、類似度が０．４２３で
あることを表している。Next, similar to step 307, the similarity calculating section 204 calculates the similarity between the classified document data stored in the classified document storage buffer 251 and the registered document data stored in the registered document storage buffer 252. Is calculated and stored in the similarity calculation result storage buffer unit 253 (step 3).
14). Steps 313 and 314 are repeatedly executed for all registered document data stored in the registered document storage buffer unit 252, and similarity calculation and storage of the similarity for all registered document data stored in the buffer 252 are performed. Upon completion, the process proceeds to step 316 (step 3
15). FIG. 12 is an example of the calculation result of the similarity with the document registered in the video camera in the classification of the erroneous identification DB 4b. The document data of the registered document ID = 3 is the classification = camera and the similarity is 0.226. The document data of the registered document ID = 19 indicates that the classification = digital camera and the similarity is 0.423.

【００４７】ステップ３１６では、類似判断条件格納バ
ッファ部２５７に格納されている類似判断条件（図５の
例では０．４）と、類似度算出結果格納バッファ部２５
３に格納されている類似度を比較し、類似判断条件の方
が大きい場合にはステップ３１８に進み、類似判断条件
の方が小さい場合にはステップ３１７へ進む。ステップ
３１７では該当する文書データの分類、類似登録文書Ｉ
Ｄおよび誤特定ＤＢ４ｂでの分類特定結果であることを
分類特定結果格納バッファ部２５５に追加する。図１３
の例では、登録文書ＩＤ＝１９の文書データの類似度が
類似判断条件の０．４を超えているので分類特定結果格
納バッファに追加される。そしてこの登録文書ＩＤ＝１
９の登録文書データが正しい分類＝デジカメであること
も誤特定ＤＢ４ｂ内に格納されているため、誤特定ＤＢ
４ｂでの分類結果として分類特定結果格納バッファ部２
５５には分類＝デジカメが格納されることになる。も
し、誤特定ＤＢ４ｂによる分類特定結果により追加すべ
き登録文書データが無い場合は図１０のままで変更され
ない。In step 316, the similarity determination condition (0.4 in the example of FIG. 5) stored in the similarity determination condition storage buffer 257 and the similarity calculation result storage buffer 25
The similarity stored in No. 3 is compared. If the similarity determination condition is higher, the process proceeds to step 318. If the similarity determination condition is lower, the process proceeds to step 317. In step 317, the classification of the corresponding document data, the similar registration document I
D and the fact that it is the classification specification result in the incorrect specification DB 4b is added to the classification specification result storage buffer unit 255. FIG.
In the example, the similarity of the document data with the registered document ID = 19 exceeds the similarity determination condition of 0.4, and is added to the classification specification result storage buffer. And this registration document ID = 1
Since the registered document data of No. 9 is stored in the erroneous identification DB 4b also indicating that the correct classification = digital camera,
Classification result storage buffer unit 2 as the classification result in 4b
In 55, the classification = digital camera is stored. If there is no registered document data to be added according to the classification specification result by the erroneous specification DB 4b, it remains unchanged in FIG.

【００４８】このようにして求められ、分類特定結果格
納バッファ部２５５に格納された分類文書データに対応
する分類特定結果は、ユーザが認識できるよう分類特定
結果出力部２１１により表示装置3に出力される（ステ
ップ３１８）。この場合の出力例を図１４に示す。The classification specification result obtained in this manner and corresponding to the classification document data stored in the classification specification result storage buffer unit 255 is output to the display device 3 by the classification specification result output unit 211 so that the user can recognize it. (Step 318). FIG. 14 shows an output example in this case.

【００４９】対象となる分類文書データの分類特定結果
を出力した後、分類文書格納バッファ部２５１内に、未
だ分類特定のなされていない分類文書データが残ってい
るかを判断し（ステップ３１９）、残っていればステッ
プ３０５に戻り、ステップ３０５から３１８までを繰り
返す。一方、分類文書データが残っていなければ分類処
理を終了する。これで第１のステップが終了する。After outputting the result of specifying the classification of the target classified document data, it is determined whether or not the classified document data which has not been classified yet remains in the classified document storage buffer unit 251 (step 319). If so, the process returns to step 305, and steps 305 to 318 are repeated. On the other hand, if no classified document data remains, the classification process ends. This ends the first step.

【００５０】続いて第２のステップについて説明する。
図１５はこの第２のステップの手順を示すフローチャー
トである。Next, the second step will be described.
FIG. 15 is a flowchart showing the procedure of the second step.

【００５１】まず初期化部２０１により分類特定結果格
納バッファ部２５５以外のバッファ部を全てクリアする
（ステップ３５１）。First, the initialization unit 201 clears all buffer units other than the classification specification result storage buffer unit 255 (step 351).

【００５２】次にこの状態で、入力装置２から第１のス
テップで分類した分類文書データの分類文書ＩＤとその
文書データの正解分類が入力されると、正解分類情報入
力部２１２が、正解分類情報格納バッファ部２５８にこ
の入力情報を格納する（ステップ３５２）。図１６は、
分類文書ＩＤ＝１の文書データの正解分類＝パソコンが
入力された場合の格納例である。Next, in this state, when the classification document ID of the classification document data classified in the first step and the correct classification of the document data are input from the input device 2, the correct classification information input unit 212 causes the correct classification This input information is stored in the information storage buffer unit 258 (step 352). FIG.
This is an example of storage when the correct classification of the document data of the classification document ID = 1 is input to the personal computer.

【００５３】次に、分類特定結果正否判断部２１３によ
り、分類特定結果格納バッファ部２５５に格納された主
ＤＢ４ａを使用した分類結果と、正解分類情報格納バッ
ファ部２５８に格納された正解分類情報とを比較する。
まずステップ３５３において、分類特定結果に誤特定Ｄ
Ｂ４ｂによる分類特定結果があるかを判断し、あればス
テップ３５４へ進み、なければステップ３５６へ進む。
分類特定結果が図１３の内容で、正解分類情報が図１６
の内容の場合、分類特定結果には誤特定ＤＢ４ｂでの分
類特定結果が含まれているのでステップ３５４へ進む。Next, the classification specification result correctness judgment unit 213 uses the main DB 4a stored in the classification specification result storage buffer unit 255 to classify the classification result and the correct classification information stored in the correct classification information storage buffer unit 258. Compare.
First, in step 353, the misidentification D is added to the classification identification result.
It is determined whether there is a classification specification result by B4b. If there is, the process proceeds to step 354; otherwise, the process proceeds to step 356.
The classification specification result is the content of FIG. 13 and the correct classification information is FIG.
In the case of the content, the classification specification result includes the classification specification result in the erroneous specification DB 4b.

【００５４】ステップ３５４では、分類特定結果正否判
断部２１３が誤特定ＤＢ４ｂによる分類特定結果と正解
分類情報が同じであるかを判断し、同じである場合には
ステップ３５８へ進み、異なっている場合にはステップ
３５５へ進み、誤特定ＤＢ４ｂから類似していた登録文
書データを削除する。本実施形態では、分類特定結果格
納バッファ部２５５に格納された主ＤＢ４ａを使用した
分類結果が図１３の内容、正解分類情報格納バッファ部
２５８に格納された正解分類情報が図１６の内容であ
り、誤特定ＤＢ４ｂによる分類特定結果がデジカメ、一
方正解分類情報はパソコンと、両者は異なるため、ステ
ップ３５５において、誤特定ＤＢ４ｂでの分類特定結果
の類似登録文書ＩＤ＝１９の文書データを誤特定ＤＢか
ら削除することになる。In step 354, the classification specification result right / wrong determination unit 213 determines whether the classification specification result by the incorrect specification DB 4b is the same as the correct classification information, and if they are the same, proceeds to step 358; Then, the process proceeds to step 355, where the similar registered document data is deleted from the erroneous identification DB 4b. In the present embodiment, the classification result using the main DB 4a stored in the classification specification result storage buffer unit 255 is the contents of FIG. 13, and the correct classification information stored in the correct classification information storage buffer unit 258 is the contents of FIG. Since the classification specification result by the erroneous identification DB 4b is a digital camera and the correct classification information is different from a personal computer, in step 355, the document data of the similar registration document ID = 19 of the classification identification result in the erroneous identification DB 4b is erroneously specified. Will be deleted.

【００５５】ステップ３５５にて誤特定ＤＢ４ｂから登
録文書データの削除が行なわれた後、ステップ３５６で
は、分類特定結果正否判断部２１３が分類特定結果の主
ＤＢ４ａによる分類特定結果と正解分類情報が同じであ
るかを判断し、同じであればステップ３５８へ進み、異
なっていればステップ３５７へ進み、その分類文書デー
タを誤特定ＤＢ４ｂへ追加する。例えば、分類特定結果
が図１７の内容で、且つ正解分類情報が図１６の内容で
あった場合、主ＤＢ４ａでの分類結果には正解分類のパ
ソコンが入っていないのでステップ３５７へ進み、誤特
定ＤＢ更新部２１４により、分類文書データが誤特定Ｄ
Ｂ４ｂへ追加登録される。After the deletion of the registered document data from the erroneous identification DB 4b in step 355, in step 356, the classification identification result correctness judgment unit 213 determines that the classification identification result by the main DB 4a of the classification identification result is the same as the correct answer classification information. Is determined, and if they are the same, the process proceeds to step 358, and if they are different, the process proceeds to step 357 to add the classified document data to the erroneous identification DB 4b. For example, when the classification result is the content shown in FIG. 17 and the correct classification information is the content shown in FIG. 16, since the classification result in the main DB 4a does not include the personal computer of the correct classification, the process proceeds to step 357, and the incorrect specification is performed. The classified document data is incorrectly specified D by the DB update unit 214.
It is additionally registered in B4b.

【００５６】ステップ３５８では、まだ入力されていな
い正解情報があるかを判断し、あればステップ３５２へ
戻りステップ３５２から３５７の処理を繰り返し実行す
る。一方未入力の正解情報がなければ処理を終了する本
発明はその主旨を逸脱しない範囲であれば、上記の実施
例に限定されるものではない。例えば、類似度占有率不
足判断条件や類似判断条件を、都度入力するのでなく、
デフォルト値として予め記憶させておくことも可能であ
る。In step 358, it is determined whether there is correct information that has not been input, and if so, the process returns to step 352 to repeatedly execute the processing of steps 352 to 357. On the other hand, the present invention, in which the process is terminated if there is no unanswered correct answer information, is not limited to the above embodiment as long as it does not depart from the gist of the present invention. For example, instead of inputting the similarity occupancy ratio insufficient judgment condition or the similarity judgment condition each time,
It is also possible to store it in advance as a default value.

【００５７】[0057]

【発明の効果】以上詳述したように本発明によれば、分
類特定誤りに影響の大きな文書が存在しても、データベ
ース全体の更新を行わずに分類特定の誤りを減少させる
ことが可能となる。As described above in detail, according to the present invention, even if there is a document having a great influence on a classification error, it is possible to reduce the classification error without updating the entire database. Become.

[Brief description of the drawings]

【図１】本発明に係る一実施形態の文書分類装置のハー
ドウェア構成を示す図FIG. 1 is a diagram showing a hardware configuration of a document classification device according to an embodiment of the present invention.

【図２】図１の文書分類装置における制御装置の機能ブ
ロック図FIG. 2 is a functional block diagram of a control device in the document classification device of FIG. 1;

【図３】主ＤＢと誤特定ＤＢとを用いて文書データの分
類特定を実行する手順を示す図FIG. 3 is a diagram showing a procedure for executing classification and identification of document data using a main DB and an erroneous identification DB;

【図４】類似度占有率不足判断条件格納例を示す図FIG. 4 is a diagram showing an example of storing a similarity occupancy ratio insufficient judgment condition;

【図５】類似判断条件格納例を示す図FIG. 5 is a diagram showing an example of storing similarity determination conditions.

【図６】分類文書データ格納例を示す図FIG. 6 is a diagram illustrating an example of storing classified document data.

【図７】登録文書データ格納例を示す図FIG. 7 is a diagram showing an example of storing registered document data.

【図８】類似度算出結果格納例を示す図FIG. 8 is a diagram showing an example of storing a similarity calculation result;

【図９】分類−類似度一覧例を示す図FIG. 9 is a diagram showing an example of a classification-similarity list.

【図１０】主ＤＢによる分類特定結果格納例を示す図FIG. 10 is a diagram showing an example of classification identification result storage by the main DB

【図１１】分類別誤特定ＤＢの文書ＩＤ格納例を示す図FIG. 11 is a diagram illustrating an example of storing a document ID in an erroneous identification DB for each category;

【図１２】誤特定ＤＢによる類似度算出結果例を示す図FIG. 12 is a diagram illustrating an example of a similarity calculation result using an erroneous identification DB;

【図１３】誤特定ＤＢによる分類特定結果追加例を示す
図FIG. 13 is a diagram showing an example of adding a classification specification result by an incorrect specification DB;

【図１４】分類結果出力例を示す図FIG. 14 is a diagram showing a classification result output example.

【図１５】誤特定ＤＢの分類文書追加／削除処理を実行
する手順を示す図FIG. 15 is a diagram showing a procedure for executing a classified document addition / deletion process of an erroneously specified DB;

【図１６】正解分類情報例を示す図FIG. 16 is a diagram showing an example of correct answer classification information.

【図１７】分類特定結果のその他の例を示す図FIG. 17 is a diagram showing another example of a classification specification result.

[Explanation of symbols]

１…制御装置２…入力装置３…表示装置４…外部記憶装置２００…プログラム部２０６…分類特定部２０７…類似度占有率不足判断条件設定部２０８…類似度占有率不足判断部２０９…類似判断条件設定部２１０…類似判断部２１１…分類特定結果出力部２１２…正解分類情報入力部２１３…分類特定結果正否判断部２１４…誤特定ＤＢ更新部２５０…バッファ部２５４…分類類似度一覧格納バッファ部２５５…分類特定結果格納バッファ部２５６…類似度占有率不足判断条件格納バッファ部２５７…類似判断条件格納バッファ部２５８…正解分類情報格納バッファ部 DESCRIPTION OF SYMBOLS 1 ... Control device 2 ... Input device 3 ... Display device 4 ... External storage device 200 ... Program part 206 ... Classification specifying part 207 ... Similarity occupancy ratio shortage judgment condition setting part 208 ... Similarity occupancy ratio shortage judgment part 209 ... Similarity judgment Condition setting unit 210: similarity determination unit 211: classification identification result output unit 212: correct answer classification information input unit 213: classification identification result right / wrong determination unit 214: incorrect identification DB update unit 250: buffer unit 254: classification similarity list storage buffer unit 255: classification specification result storage buffer unit 256: similarity occupancy ratio insufficiency determination condition storage buffer unit 257: similarity determination condition storage buffer unit 258 ... correct answer classification information storage buffer unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者松隈剛東京都青梅市新町３丁目３番地の１東芝デジタルメディアエンジニアリング株式会社内 (72)発明者中本幸夫東京都青梅市新町３丁目３番地の１東芝デジタルメディアエンジニアリング株式会社内 (72)発明者仁科卓哉東京都青梅市新町３丁目３番地の１東芝デジタルメディアエンジニアリング株式会社内 (72)発明者山崎弘東京都青梅市新町３丁目３番地の１東芝デジタルメディアエンジニアリング株式会社内Ｆターム(参考） 5B075 NR03 NR12 PR06 QM08 UU06 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Tsuyoshi Matsukuma 3-3-3 Shinmachi, Ome-shi, Tokyo Toshiba Digital Media Engineering Co., Ltd. In-house (72) Yukio Nakamoto 3-3-1 Shinmachi, Ome-shi, Tokyo 1 Toshiba Digital Media Engineering Co., Ltd. In-house (72) Inventor Takuya Nishina 3-3-3 Shinmachi, Ome-shi, Tokyo In-house Toshiba Digital Media Engineering Co., Ltd. In-house (72) Inventor Hiroshi Yamazaki 3-3-3, Shinmachi, Ome-shi, Tokyo 1 Toshiba Digital Media Engineering Corporation In-house F-term (reference) 5B075 NR03 NR12 PR06 QM08 UU06

Claims

[Claims]

1. A document classification apparatus for identifying a class of document data based on a similarity between the document data and a plurality of registered document data to which a plurality of classifications are assigned in advance. And a second database that stores registered document data that causes a classification specific error as an error classification different from the correct classification when specifying the classification of the document data, together with the correct classification and the error classification. Classifying means for determining the similarity between the document data and the registered document data in the first database and the registered document data in the second database, and specifying a classification based on the similarity. Document classification apparatus characterized by the above-mentioned.

2. A document classification apparatus for identifying a class of document data based on a similarity between the document data and a plurality of registered document data to which a plurality of classifications have been assigned in advance. And a second database that stores registered document data that causes a classification specific error as an error classification different from the correct classification when specifying the classification of the document data, together with the correct classification and the error classification. Input means for inputting document data; classification specifying means for specifying a classification based on the similarity between the input document data and the registered document data in the first database; The classification is included in the error classification in the second database, and the document data is included in the registered document data corresponding to the error classification. If there is data whose similarity with the data is equal to or more than a predetermined value, additional means for adding the correct answer classification given to the registered document data to the classification specified by the classification specifying means, and Output means for outputting a classification specified as the classification of the document data, including a classification.

3. A document classification apparatus for identifying a class of document data based on a similarity between the document data and a plurality of registered document data to which a plurality of classifications are assigned in advance. And a second database that stores registered document data that causes a classification specific error as an error classification different from the correct classification when specifying the classification of the document data, together with the correct classification and the error classification. Input means for inputting document data, similarity calculating means for calculating the similarity between the input document data and the registered document data in the first database, and the similarity calculating means. Means for adding the similarity for each category assigned to each registered document data, and calculating the ratio of the total value for each category to the total value of all similarities; Classification specifying means for specifying a class based on the magnitude of the calculated ratio, comparing means for comparing the ratio in the classification specified by the classification specifying device with a specific value, and a result of the comparison by the comparing means Is smaller than the specific value, the similarity between the registered document data included in the error classification in the second database and the document data is calculated, and if there is registered document data whose similarity is equal to or greater than a predetermined value, Adding means for adding the correct classification assigned to the registered document data to the classification specified by the classification specifying means, and the classification specified as the classification of the document data including the classification added by the additional means. Output means for outputting a document.

4. A correct answer classification inputting means for inputting a correct answer classification to be given to the document data, and comparing the correct answer classification inputted by the correct answer input means with the classification specified by the classification specifying means. 4. The document classifying apparatus according to claim 3, further comprising: a document data adding unit that adds the document data to the second database when the classification specification result is incorrect.

5. A correct answer classification inputting means for inputting a correct answer classification to be given to the document data, and comparing the correct answer classification inputted by the correct answer input means with the classification added by the additional means, Deleting means for deleting, from the second database, the registered document data on which the added classification is incorrect when the added classification is incorrect;
The document classification device according to claim 3, comprising:

6. A similarity between the document data and a first plurality of registered document data to which a classification has been previously assigned, a first classification is specified based on the calculated similarity, and the document data is defined as: A similarity with a second plurality of registered document data that is recognized in advance as causing an erroneous classification such that the erroneous classification is different from the correct classification is calculated, and the calculated similarity and the correct classification are used to calculate a second similarity. A document classification method for specifying a classification and specifying the classification of document data by adding the second classification to the first classification.

7. A similarity between the document data and a first plurality of registered document data to which a classification is given in advance is calculated, and a classification is specified based on the calculated similarity. If there is a second plurality of registered document data that is recognized in advance to cause an erroneous classification as an error classification different from the correct answer classification, the registered document data and the document data If there is registered document data having a predetermined value or more in the calculated similarity, the classification of the document data is specified by adding the correct classification of the registered document data to the above specified classification. Document classification method to be used.

8. A similarity between the document data and a first plurality of registered document data to which a classification has been previously assigned is calculated, and the calculated similarity is added for each classification assigned to each registered document data. , Calculate the ratio of the total value of each class to the total value of all similarities, specify the class based on the magnitude of the calculated ratio, and classify the class based on the magnitude of the ratio in the specified class. Identify, compare the percentage in this identified classification with a specific value, and as a result of this comparison, if the percentage is smaller than the specific value, cause an incorrect classification as a wrong classification different from the correct classification The similarity between the previously recognized second plurality of registered document data and the document data is calculated. If the calculated similarity includes registered document data equal to or more than a predetermined value, the correct classification of the registered document data is determined. Identified above Document classification method of identifying a classification of the document data by adding the classification.

9. A correct classification to be given to the document data is input, and the input correct classification is compared with the specified classification. If the specified classification is incorrect, the document data is 9. The method according to claim 8, further comprising: adding to the second plurality of registered document data.

10. A correct classification to be given to the document data is input, and the input correct classification is compared with the added classification, and when the added classification is incorrect,
9. The document classification method according to claim 8, wherein the registered document data that is the basis for adding the classification is deleted from the second plurality of registered document data.