JPH1145247A

JPH1145247A - Document classification device, storage medium for storing document classification program and document classification method

Info

Publication number: JPH1145247A
Application number: JP9217131A
Authority: JP
Inventors: Naoyuki Nomura; 直之野村
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 1997-07-27
Filing date: 1997-07-27
Publication date: 1999-02-16
Anticipated expiration: 2017-07-27
Also published as: JP4067603B2

Abstract

PROBLEM TO BE SOLVED: To perform a more precise classification of an object document by using both classification results by a manual classification and an automatic classification. SOLUTION: An evaluation function by weighting or the like obtained from the rate of correct answer for the past classification is made to be a database by each classification personnel and, at the same time, a typical document for characterizing the classification by each classification is prepared in advance. Then, a scoring of a manual classification of each classification is executed from a classification result by the classification personnel (manual) and the evaluation function of the object document. Also, the degree of similarity of the object document and the typical document is calculated and the scoring of an automatic classification of each classification is performed by using this degree of similarity. The classification of the highest value obtained by summing up both of these scores by each classification is regarded as the final classification result. In this way, by fusing the manual classification and the automatic classification, a more precise classification result can be obtained.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書分類装置、文
書分類プログラムが記憶された記憶媒体、及び文書分類
方法に係り、詳細には、取得した対象文書に対する分類
精度の向上に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document classification device, a storage medium storing a document classification program, and a document classification method, and more particularly, to an improvement in classification accuracy of an acquired target document.

【０００２】[0002]

【従来の技術】文書をファイルしたり、電子的に配信し
たり、記憶媒体に記憶させたりする場合、その対象文書
を予め決められたカテゴリに分類する場合がある。この
ように対象文書の分類を行う場合、従来では分類者担当
ものがその対象文書を読んだ後に手動分類をしたり、コ
ンピュータシステムを使用して文書内容を解析すること
で対象文書を自動的に分類したりしている。2. Description of the Related Art When a document is filed, distributed electronically, or stored in a storage medium, the target document may be classified into a predetermined category. Conventionally, when classifying a target document in this way, the person in charge of the classifier performs a manual classification after reading the target document, or automatically analyzes the target document by analyzing the document content using a computer system. Or categorized.

【０００３】[0003]

【発明が解決しようとする課題】しかし、従来の人手に
よる文書の手動分類では、必ずしも正確に分類付けがさ
れない場合があった。一方、コンピュータシステムによ
る判断は高速に大量の文書を分類することが可能である
が、この分類も必ず下正確であるとは限らなかった。ま
た、従来の手動分類と自動分類とでは、分類形態が全く
異なるため両者を融合したシームレスな使い勝手が実現
しないかった。However, in the conventional manual classification of documents, classification is not always accurate. On the other hand, a computer system can classify a large number of documents at high speed, but this classification is not always accurate. In addition, since the conventional manual classification and the automatic classification are completely different in classification form, a seamless usability combining the two is not realized.

【０００４】本発明は、このような従来技術の課題を解
決するために成されたもので、手動分類と自動分類の両
分類結果を使用して、対象文書に対してより精度の高い
分類を行うことが可能な文書分類装置を提供することを
第１の目的とする。また、本発明は、手動分類と自動分
類の両分類結果を使用して、対象文書に対してより精度
の高い分類を行うことが可能な文書分類プログラムが記
録された記憶媒体を提供することを第１の目的とする。
また、本発明は、手動分類と自動分類の両分類結果を使
用して、対象文書に対してより精度の高い分類を行うこ
とが可能な文書分類方法を提供することを第３の目的と
する。[0004] The present invention has been made to solve such problems of the prior art, and uses both classification results of manual classification and automatic classification to classify a target document with higher accuracy. It is a first object of the present invention to provide a document classification apparatus that can perform the document classification. Further, the present invention provides a storage medium on which a document classification program capable of performing more accurate classification on a target document by using both classification results of manual classification and automatic classification is recorded. This is the first purpose.
A third object of the present invention is to provide a document classification method capable of performing more accurate classification on a target document by using both classification results of manual classification and automatic classification. .

【０００５】[0005]

【課題を解決するための手段】請求項１に記載した発明
では、図８に示すように、予め決められた複数分類のセ
ットの範囲内で、人手によって対象文書を分類した手動
分類結果を取得する手動分類結果取得手段と、前記対象
文書を取得する対象文書取得手段と、前記対象文書取得
手段で取得された対象文書を、前記複数分類のセットの
範囲内で、自動的に分類して自動分類結果を得る自動分
類結果取得手段と、前記手動分類結果と前記自動分類結
果とから前記対象文書に対する分類を最終決定する分類
決定手段と、を文書分類装置に具備させて前記目的を達
成する。請求項２に記載した発明では、図９に示すよう
に、請求項１に記載した文書分類装置において、前記自
動分類結果取得手段は、前記対象文書を特徴づける対象
文書ベクトルを取得する文書ベクトル取得手段と、前記
各分類を特徴づける典型文書の典型文書ベクトルを取得
する典型文書ベクトル取得手段と、前記対象文書ベクト
ルと前記各典型文書ベクトルとの類似度を算出して各分
類に対する類似度を得る類似度算出手段とを有し、前記
類似度算出手段によって得られた各分類に対する類似度
を分類結果とする。請求項３に記載した発明では、図１
０に示すように、予め決められた複数分類のセットの範
囲内で、人手によって対象文書を分類した手動分類結果
を取得する手動分類結果取得機能と、前記対象文書を取
得する対象文書取得機能と、前記対象文書取得機能で取
得された対象文書を、前記複数分類のセットの範囲内
で、自動的に分類して自動分類結果を得る自動分類結果
取得機能と、前記手動分類結果と前記自動分類結果とか
ら前記対象文書に対する分類を最終決定する分類決定機
能と、をコンピュータに実現させるためのコンピュータ
読取り可能な文書分類プログラムを記憶媒体に記憶させ
て前記第２の目的を達成する。請求項４に記載した発明
では、図１１に示すように、前記自動分類結果取得機能
は、前記対象文書を特徴づける対象文書ベクトルを取得
する文書ベクトル取得機能と、前記各分類を特徴づける
典型文書の典型文書ベクトルを取得する典型文書ベクト
ル取得機能と、前記対象文書ベクトルと前記各典型文書
ベクトルとの類似度を算出して各分類に対する類似度を
得る類似度算出機能とを有し、前記類似度算出機能によ
って得られた各分類に対する類似度を分類結果とする。
請求項５に記載した発明では、図１２に示すように、予
め決められた複数分類のセットの範囲内で対象文書を自
動的に分類し、この自動分類結果と、前記複数分類のセ
ットの範囲内で、人手によって前記対象文書を分類した
手動分類結果とから前記対象文書に対する分類を最終決
定する、ことで前記第３の目的を達成する。According to the first aspect of the present invention, as shown in FIG. 8, a manual classification result obtained by manually classifying target documents within a predetermined set of plural classifications is obtained. A manual classification result acquiring unit, a target document acquiring unit for acquiring the target document, and automatically classifying the target document acquired by the target document acquiring unit within the set of the plurality of classifications. The object is achieved by providing a document classification device with an automatic classification result obtaining unit for obtaining a classification result, and a classification determination unit for finally determining a classification for the target document from the manual classification result and the automatic classification result. According to the second aspect of the present invention, as shown in FIG. 9, in the document classification apparatus according to the first aspect, the automatic classification result obtaining means obtains a target document vector characterizing the target document. Means, a typical document vector obtaining means for obtaining a typical document vector of a typical document characterizing each of the classes, and calculating a similarity between the target document vector and each of the typical document vectors to obtain a similarity for each class. A similarity calculating unit, and the similarity for each classification obtained by the similarity calculating unit is set as a classification result. In the invention described in claim 3, FIG.
As shown in FIG. 0, within a set of a plurality of predetermined classifications, a manual classification result obtaining function for manually obtaining the classification result of manually classifying the target document, a target document obtaining function for obtaining the target document, An automatic classification result acquisition function for automatically classifying a target document acquired by the target document acquisition function and obtaining an automatic classification result within the range of the plurality of classification sets, the manual classification result and the automatic classification The second object is achieved by storing a computer-readable document classification program for causing a computer to implement a classification determination function for finally determining the classification of the target document based on the result, on a storage medium. According to the invention described in claim 4, as shown in FIG. 11, the automatic classification result obtaining function includes a document vector obtaining function for obtaining a target document vector characterizing the target document, and a typical document characterizing each of the classifications. A typical document vector acquisition function of acquiring a typical document vector of the same type, and a similarity calculation function of calculating a similarity between the target document vector and each of the typical document vectors to obtain a similarity for each classification. The similarity for each classification obtained by the degree calculation function is defined as a classification result.
In the invention described in claim 5, as shown in FIG. 12, the target document is automatically classified within a predetermined range of a plurality of classification sets, and the result of the automatic classification and the range of the plurality of classification sets are determined. The third object is achieved by finally determining the classification of the target document from the result of manual classification in which the target document is manually classified.

【０００６】[0006]

【発明の実施の形態】以下、本発明の文書分類装置、文
書分類プログラムが記憶された記憶媒体、及び文書分類
方法における好適な実施の形態について、図１から図７
を参照して説明する。（１）実施形態の概要本実施形態による文書分類処理では、過去に行った分類
に対する正解率から求めた重み付け等による評価関数を
各分類担当者毎にデータベース化しておくと共に、各分
類毎にその分類を特徴づける典型文書を予め用意してお
く。そして、分類担当者（人手）による対象文書の分類
結果と評価関数とから、各分類に対する手動分類の点数
化を行う。また、対象文書と典型文書との類似度を算出
し、この類似度を用いて各分類に対する自動分類の点数
化を行う。この両点数を各分類毎に合計した値が最も高
い分類を最終分類結果とする。このように、手動分類と
自動分類とを融合化することで、より正確な分類結果を
得ることができる。DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of a document classification apparatus, a storage medium storing a document classification program, and a document classification method according to the present invention will be described below with reference to FIGS.
This will be described with reference to FIG. (1) Overview of Embodiment In the document classification processing according to the present embodiment, an evaluation function based on weights and the like obtained from the correct answer rate for the classification performed in the past is stored in a database for each person in charge of classification, and the classification is performed for each classification. A typical document characterizing the classification is prepared in advance. Then, based on the result of classification of the target document by the person in charge of classification (manually) and the evaluation function, scoring of the manual classification for each classification is performed. In addition, the similarity between the target document and the typical document is calculated, and the automatic classification is scored for each classification using the similarity. The classification having the highest sum of the two scores for each classification is defined as the final classification result. In this way, by merging the manual classification and the automatic classification, a more accurate classification result can be obtained.

【０００７】（２）実施の形態の詳細本実施形態の文書分類装置は、パーソナルコンピュータ
やワードプロセッサ等を含むコンピュータシステムで構
成するだけでなく、ＬＡＮ（ローカル・エリア・ネット
ワーク）のサーバ、コンピュータ（パソコン）通信のホ
スト、インターネット上に接続されたコンピュータシス
テム等によって構成することも可能である。また、ネッ
トワーク上の各機器に機能分散させ、ネットワーク全体
で文書分類装置を構成することも可能である。(2) Details of the Embodiment The document classification apparatus according to the present embodiment is not only configured by a computer system including a personal computer, a word processor, etc., but also by a LAN (local area network) server and a computer (personal computer). ) It can also be constituted by a communication host, a computer system connected to the Internet, and the like. Further, it is also possible to distribute the function to each device on the network and configure the document classification device on the entire network.

【０００８】図１は、文書分類装置の構成を表したブロ
ック図である。文書分類装置は、図１に示すようにシス
テム全体を制御するための制御部１１を備えている。こ
の制御部１１には、データバス等のバスライン２１を介
して、入力装置としてのキーボード１２やマウス１３、
表示装置１４、印刷装置１５、記憶装置１６、記憶媒体
駆動装置１７、通信制御装置１８、入出力Ｉ／Ｆ１９、
及び文字認識装置２０が接続されている。制御部１１
は、ＣＰＵ１１１、ＲＯＭ１１２、ＲＡＭ１１３を備え
ている。ＲＯＭ１１２は、ＣＰＵ１１１が各種制御や演
算を行うための各種プログラムやデータが予め格納され
たリードオンリーメモリである。FIG. 1 is a block diagram showing the configuration of the document classification device. As shown in FIG. 1, the document classification device includes a control unit 11 for controlling the entire system. The control unit 11 includes a keyboard 12 and a mouse 13 as input devices via a bus line 21 such as a data bus.
A display device 14, a printing device 15, a storage device 16, a storage medium driving device 17, a communication control device 18, an input / output I / F 19,
And a character recognition device 20 are connected. Control unit 11
Includes a CPU 111, a ROM 112, and a RAM 113. The ROM 112 is a read-only memory in which various programs and data for the CPU 111 to perform various controls and calculations are stored in advance.

【０００９】ＲＡＭ１１３は、ＣＰＵ１１１にワーキン
グメモリとして使用されるランダムアクセスメモリであ
る。このＲＡＭ１１３には、本実施形態による文書分類
処理を行うためのエリアとして、自動分類と手動分類の
分類結果を点数化して正規化等の処理を行う分類処理表
が格納される分類処理表格納エリア１１３１、分類の対
象となる対象文書が格納される対象文書格納エリア１１
３２、抽出したキーワードの重要度等を要素値として対
象文書を特徴づける対象文書ベクトルが格納される対象
文書ベクトル格納エリア、典型文書を特徴づける典型文
書ベクトルが格納される典型文書ベクトル格納エリア１
１３４、対象文書と各典型文書との類似度が格納される
類似度格納エリア１１３５、…、その他の各種エリアが
確保されるようになっている。The RAM 113 is a random access memory used as a working memory for the CPU 111. In the RAM 113, a classification processing table storage area is provided as an area for performing the document classification processing according to the present embodiment, in which a classification processing table for converting the classification results of the automatic classification and the manual classification into scores and performing processing such as normalization is stored. 1131, target document storage area 11 for storing target documents to be classified
32, a target document vector storage area for storing a target document vector characterizing the target document using the importance of the extracted keyword as an element value, a typical document vector storage area 1 for storing a typical document vector characterizing a typical document
134, similarity storage areas 1135,... For storing the similarity between the target document and each typical document, and other various areas are secured.

【００１０】キーボード１２は、自装置内で対象文書を
作成する場合の対象文書取得手段や群類担当者による分
類結果を入力する場合の手動分類結果取得手段の一部を
構成し、かな文字を入力するためのかなキーやテンキ
ー、各種機能を実行するための機能キー、カーソルキ
ー、等の各種キーが配置されている。マウス１３は、ポ
インティングデバイスであり、表示装置１４に表示され
たキーやアイコン等を左クリックすることで対応する機
能の指定を行う入力装置である。表示装置１４は、例え
ばＣＲＴや液晶ディスプレイ等が使用される。この表示
装置には、キーボード１２やマウス１３による入力結果
が表示されたり、最終分類結果が表示されたりするよう
になっている。印刷装置１５は、表示装置１４に表示さ
れた文書や、記憶装置１６の文書格納部１６４に格納さ
れた文書等の印刷を行うためのものである。この印刷装
置としては、レーザプリンタ、ドットプリンタ、インク
ジェットプリンタ、ページプリンタ、感熱式プリンタ、
熱転写式プリンタ、等の各種印刷装置が使用される。The keyboard 12 constitutes a part of a target document obtaining means for creating a target document in the apparatus itself and a part of a manual classification result obtaining means for inputting a classification result by a group person in charge. Various keys such as a kana key and a numeric keypad for inputting, a function key for executing various functions, a cursor key, and the like are arranged. The mouse 13 is a pointing device, and is an input device for designating a corresponding function by left-clicking a key, an icon, or the like displayed on the display device 14. As the display device 14, for example, a CRT or a liquid crystal display is used. This display device displays input results from the keyboard 12 and the mouse 13 and displays final classification results. The printing device 15 is for printing a document displayed on the display device 14, a document stored in the document storage unit 164 of the storage device 16, and the like. This printing device includes laser printers, dot printers, inkjet printers, page printers, thermal printers,
Various printing devices such as a thermal transfer printer are used.

【００１１】記憶装置１６は、読み書き可能な記憶媒体
と、その記憶媒体に対してプログラムやデータ等の各種
情報を読み書きするための駆動装置で構成されている。
この記憶装置１６に使用される記憶媒体としては、主と
してハードディスクが使用されるが、後述の記憶媒体駆
動装置１７で使用される各種記憶媒体のうちの読み書き
可能な記憶媒体を使用するようにしてもよい。記憶装置
１６は、仮名漢字変換辞書１６１、プログラム格納部１
６２、データ格納部１６３、文書データベース１６４、
評価関数データベース１６５、文書ベクトルデータベー
ス１６６、図示しないその他の格納部（例えば、この記
憶装置１６内に格納されているプログラムやデータ等を
バックアップするための格納部）等を有している。プロ
グラム格納部１６２には、本実施形態における文書分類
処理プログラム、文書ベクトル作成処理プログラム等の
各種プログラムの他、仮名漢字変換辞書１６１を使用し
て入力された仮名文字列を漢字混り文に変換する仮名漢
字変換プログラム等の各種プログラムが格納されてい
る。データ格納部１６３には、ユーザに関するデータ等
の、システムが必要とする各種データが格納されてい
る。The storage device 16 comprises a readable and writable storage medium and a drive device for reading and writing various information such as programs and data on the storage medium.
As a storage medium used for the storage device 16, a hard disk is mainly used, but a readable and writable storage medium among various storage media used in a storage medium driving device 17 described later may be used. Good. The storage device 16 stores the kana-kanji conversion dictionary 161 and the program storage unit 1
62, a data storage unit 163, a document database 164,
It has an evaluation function database 165, a document vector database 166, and other storage units (not shown) (for example, storage units for backing up programs and data stored in the storage device 16). The program storage unit 162 converts a kana character string input using the kana-kanji conversion dictionary 161 into a kanji mixed sentence, in addition to various programs such as a document classification processing program and a document vector creation processing program in the present embodiment. Various programs such as a kana-kanji conversion program are stored. The data storage unit 163 stores various data required by the system, such as data on the user.

【００１２】文書データベース１６４には、各の分類を
特徴づける典型文書や、典型文書以外の通常の文書等が
格納されている。この文書データベース１６４に格納さ
れる各文書の形式は特に限定されるものではなく、テキ
スト形式の文書、ＨＴＭＬ（Hyper Text Markup Langua
ge）形式の文書、ＪＩＳ形式の文書等の各種形式の文書
の格納が可能である。この典型文書により特徴づけられ
る分類としては、技術動向報告、主張報告、新プロジェ
クト等の社内用の分類や、政治、経済、健康等の一般的
な分類、図書館等弟子用される一般図書や科学技術文献
に関する分類、その他各種分類が使用目的によって適宜
選択可能になっている。The document database 164 stores typical documents characterizing each classification, ordinary documents other than the typical documents, and the like. The format of each document stored in the document database 164 is not particularly limited, and text documents, HTML (Hyper Text Markup Language)
ge) format documents and various types of documents such as JIS format documents can be stored. Classifications characterized by this typical document include in-house classifications such as technical trend reports, assertion reports, and new projects, general classifications such as politics, economy, and health, and general books and science used by disciples such as libraries. Classifications related to technical literature and other various classifications can be appropriately selected according to the purpose of use.

【００１３】図２は、評価関数データベース１６５の内
容を概念的に表したものである。この図２に示すよう
に、評価関数は各分類担当者花子、太郎、四郎、…毎
に、各分類甲、乙、丙、…に対する、「重み」が評価値
として格納されている。「重み」は各分類に対する分類
担当者の正解率（または誤り率）等に基づいて決定され
る。この「重み」は、各担当者が対象文書に対する分類
を決定する毎に、最終分類結果と比較して、変更され
る。この図２に示すように、分類担当者花子さんは、分
類甲に対しての正解率が低く、分類丙に対する正解率が
高いことが理解される。FIG. 2 conceptually shows the contents of the evaluation function database 165. As shown in FIG. 2, the evaluation function stores "weight" for each classifier A, B, C,... As an evaluation value for each classifier Hanako, Taro, Shiro,. The “weight” is determined based on the correct answer rate (or error rate) of the person in charge of classification for each classification. This “weight” is changed by comparing with the final classification result each time each person in charge determines the classification of the target document. As shown in FIG. 2, it is understood that Hanako, who is in charge of classification, has a low correct answer rate for Classification A and a high correct answer rate for Classification Hei.

【００１４】図３は、文書ベクトルデータベース１６６
の内容を概念的に表したものである。この図３に示され
るように、文書Ａｊｋの中から自動抽出されたキーワー
ドｘに対して求められた重要度ｆ（ｘ）が文書ベクトル
の要素値ｆ（ｘ）として格納されている。この文書ベク
トルは各文書ｊｋ（ｊ＝１〜、ｋ＝１〜）毎に格納さ
れ、文書データベース１６４に格納されている各文書と
対応づけられている。各文書ベクトルの次元は採用する
キーワードｘ（重要語句）の数であるが、２文書間の類
似度を両文書ベクトルから求める場合には、両文書のキ
ーワードの和集合の数が両文書ベクトルの次元となる。
この場合、一方の文書ベクトルにのみ含まれるキーワー
ドに対する他方の文書ベクトルの要素値は、”０”に定
義される。FIG. 3 shows a document vector database 166.
Are conceptually represented. As shown in FIG. 3, the importance f (x) obtained for the keyword x automatically extracted from the document Ajk is stored as the element value f (x) of the document vector. This document vector is stored for each document jk (j = 1, k = 1), and is associated with each document stored in the document database 164. The dimension of each document vector is the number of keywords x (keywords) to be adopted. When the similarity between two documents is obtained from both document vectors, the number of unions of keywords of both documents is Be a dimension.
In this case, the element value of the other document vector for the keyword included in only one document vector is defined as “0”.

【００１５】例えば図３おいて、文書Ｂのキーワードは
「重要、重要語、重要度、…」、文書Ｃのキーワードは
「重要、…、政治、…」であり、両文書の文書ベクトル
は次の通りである。文書Ｂの文書ベクトル＝（１，１８，１９，…）文書Ｃの文書ベクトル＝（１８，…，２１，…）これに対して文書Ｂと文書Ｃとの類似度を算出する場合
には、両文書のキーワードを「重要、重要語、重要度、
…、政治、…」とし、両文書の文書ベクトルはつぎの通
り定義される。文書Ａの文書ベクトル＝（１，１８，１９，…，
０，…）、文書Ｃの文書ベクトル＝（１８，０，０，…，２
１，…）For example, in FIG. 3, the keyword of document B is “important, important word, importance,...”, The keyword of document C is “important,..., Politics,. It is as follows. Document vector of document B = (1,18,19, ...) Document vector of document C = (18, ..., 21, ...) On the other hand, when calculating the similarity between document B and document C, Keywords for both documents are "important, important words, importance,
..., politics, ... ", and the document vectors of both documents are defined as follows. Document vector of document A = (1,18,19, ...,
0,...), Document vector of document C = (18, 0, 0,.
1,…)

【００１６】記憶媒体駆動装置１７（図１）は、ＣＰＵ
１１１が外部の記憶媒体からコンピュータプログラムや
文書を含むデータ等を読み込むための駆動装置である。
記憶媒体に記憶されているコンピュータプログラム等に
は、本実施形態の文書分類装置により実行される文書分
類処理等の各種処理プログラム、および、そこで使用さ
れる辞書、データ等も含まれる。ここで、記憶媒体と
は、コンピュータプログラムやデータ等が記憶される記
憶媒体をいい、具体的には、フロッピーディスク、ハー
ドディスク、磁気テープ等の磁気記憶媒体、メモリチッ
プやＩＣカード等の半導体記憶媒体、ＣＤ−ＲＯＭやＭ
Ｏ、ＰＤ（相変化書換型光ディスク）等の光学的に情報
が読み取られる記憶媒体、紙カードや紙テープ等の用紙
（および、用紙に相当する機能を持った媒体）を用いた
記憶媒体、その他各種方法でコンピュータプログラム等
が記憶される記憶媒体が含まれる。本実施形態の文書分
類装置において使用される記憶媒体としては、主とし
て、ＣＤ−ＲＯＭやフロッピーディスク等の記憶媒体が
使用される。記憶媒体駆動装置１７は、これらの各種記
憶媒体からコンピュータプログラムを読み込む他に、フ
ロッピーディスクのような書き込み可能な記憶媒体に対
してＲＡＭ１１３や記憶装置１６に格納されているデー
タ等を書き込むことが可能である。The storage medium drive 17 (FIG. 1) has a CPU
Reference numeral 111 denotes a driving device for reading data including computer programs and documents from an external storage medium.
The computer programs and the like stored in the storage medium include various processing programs such as a document classification process executed by the document classification device of the present embodiment, and dictionaries and data used therein. Here, the storage medium refers to a storage medium in which a computer program, data, and the like are stored, and specifically, a magnetic storage medium such as a floppy disk, a hard disk, and a magnetic tape, and a semiconductor storage medium such as a memory chip and an IC card. , CD-ROM or M
O, PD (phase change rewritable optical disk) and other storage media from which information can be read optically, storage media using paper (such as paper cards and tapes) (and media having functions equivalent to paper), and various other types A storage medium in which a computer program or the like is stored in the method is included. As a storage medium used in the document classification device of the present embodiment, a storage medium such as a CD-ROM or a floppy disk is mainly used. The storage medium drive 17 can read data stored in the RAM 113 or the storage device 16 into a writable storage medium such as a floppy disk in addition to reading a computer program from these various storage media. It is.

【００１７】本実施形態の文書分類装置では、制御部１
１のＣＰＵ１１１が、記憶媒体駆動装置１７にセットさ
れた外部の記憶媒体からコンピュータプログラムを読み
込んで、記憶装置１６の各部に格納（インストール）す
る。そして、本実施形態による文書分類処理等の各種処
理を実行する場合、記憶装置１６から該当プログラムを
ＲＡＭ１１３に読み込み、実行するようになっている。
但し、記憶装置１６からではなく、記憶媒体駆動装置１
７により外部の記憶媒体から直接ＲＡＭ１１３にプログ
ラムを読み込んで実行することも可能である。また、文
書分類装置によっては、本実施形態の文書分類処理プロ
グラム等を予めＲＯＭ１１２に記憶させておき、これを
ＣＰＵ１１１が実行するようにしてもよい。さらに、本
実施形態の文書分類処理プログラム等の各種プログラム
やデータを、通信制御装置１８を介して他の記憶媒体か
らダウンロードし、実行するようにしてもよい。In the document classification device of the present embodiment, the control unit 1
One CPU 111 reads a computer program from an external storage medium set in the storage medium drive 17 and stores (installs) it in each unit of the storage 16. When executing various processes such as the document classification process according to the present embodiment, the corresponding program is read from the storage device 16 into the RAM 113 and executed.
However, not from the storage device 16 but the storage medium drive 1
7, it is also possible to read the program directly from the external storage medium into the RAM 113 and execute it. Further, depending on the document classification device, the document classification processing program or the like of the present embodiment may be stored in the ROM 112 in advance, and the program may be executed by the CPU 111. Further, various programs and data such as the document classification processing program of the present embodiment may be downloaded from another storage medium via the communication control device 18 and executed.

【００１８】通信制御装置１８は、文書分類装置と他の
パーソナルコンピュータやワードプロセッサ等の各種電
子機器との間をネットワーク接続するための制御装置で
ある。通信制御装置１８は、これら各種電子機器が有し
ている対象文書と同一の言語の文書、入力された他言語
の文書、および同一言語や他言語の文書のデータベース
を検索対象としてアクセスすることが可能になってい
る。対象となる文書には、テキスト形式やＨＴＭＬ形式
等の各種形式の文書の他、ビットマップデータ等の各種
データも含まれる。入出力Ｉ／Ｆ１９は、音声や音楽等
の出力を行うスピーカ等の各種機器を接続するためのイ
ンターフェースである。文字認識装置２０は、用紙等に
記載された文字をテキスト形式やＨＴＭＬ等の各種形式
で認識する装置であり、イメージスキャナや文字認識プ
ログラム等で構成されている。The communication control device 18 is a control device for making a network connection between the document classification device and various electronic devices such as other personal computers and word processors. The communication control device 18 can access a document in the same language as the target document of these various electronic devices, a document in another language input, and a database of documents in the same language or another language as search targets. It is possible. The target document includes various types of data such as bitmap data in addition to various types of documents such as a text format and an HTML format. The input / output I / F 19 is an interface for connecting various devices such as a speaker that outputs audio, music, and the like. The character recognition device 20 is a device for recognizing characters written on paper or the like in various formats such as a text format or HTML, and is configured by an image scanner, a character recognition program, and the like.

【００１９】本実施形態では、キーボード１２の入力操
作により作成した文書（ＲＡＭ１１３の所定格納エリア
に格納）の他、外部で作成して所定の記憶媒体に格納し
た文書で記憶媒体駆動装置１７から読み込んだ文書、予
め文書データベースに格納されている文書、通信制御装
置１８からダウンロードした文書、及び文字認識装置２
０で文字認識した文書、等の各種文書を検索の元になる
対象文書として取得する（文書取得手段）ことが可能で
ある。In the present embodiment, in addition to a document created by an input operation of the keyboard 12 (stored in a predetermined storage area of the RAM 113), a document created externally and stored in a predetermined storage medium is read from the storage medium driving device 17. Documents, documents stored in advance in a document database, documents downloaded from the communication control device 18, and the character recognition device 2.
It is possible to acquire various documents such as a document whose character has been recognized with 0 as a target document to be searched (document acquiring means).

【００２０】以上のように構成された本実施形態の文書
分類装置による文書分類処理の動作について、図４を使
用して説明する。図４は文書分類処理のメイン動作を表
したフローチャートである。ＣＰＵ１１１は、まず分類
を希望する対象文書Ｔを取得しＲＡＭ１１３の対象文書
格納エリア１１３２に格納する（ステップ１１）。The operation of the document classification processing of the document classification device of the present embodiment configured as described above will be described with reference to FIG. FIG. 4 is a flowchart showing the main operation of the document classification process. First, the CPU 111 acquires the target document T to be classified and stores it in the target document storage area 1132 of the RAM 113 (step 11).

【００２１】そして、ＣＰＵ１１１は、分類担当者と、
その分類担当者によって分類された手動分類結果を取得
し、ＲＡＭ１１３の分類処理表格納エリア１１３１の分
類処理表に格納する（ステップ１２）。図６は、ＲＡＭ
１１３１の作業領域としてエリアが確保されている自動
分類表の内容を概念的に表したものである。分類担当者
花子が対象文書を読んで決定した分類が分類甲であった
場合、図６に示すように、花子の分類結果として花子ａ
欄６１における、分類甲の点数が１点で他の分類が０点
となる。Then, the CPU 111 includes:
The result of the manual classification classified by the person in charge of classification is acquired and stored in the classification processing table in the classification processing table storage area 1131 of the RAM 113 (step 12). FIG. 6 shows a RAM
This is a conceptual representation of the contents of an automatic classification table in which an area is secured as a work area 1131. If the classification determined by the classifier Hanako after reading the target document is the classification A, as shown in FIG.
In the column 61, the score of the classification A is 1 point and the other classifications are 0 points.

【００２２】次にＣＰＵ１１１は、取得した対象文書Ｔ
の文書ベクトルＢｔが既に作成されていて文書ベクトル
データベース１６６中に格納されているか否かを確認し
（ステップ１４）、格納されていれば（；Ｙ）、その文
書ベクトルＢｔを読み込んでＲＡＭ１１３の対象文書ベ
クトル格納エリア１１３３に格納する（ステップ１
５）。対象文書の文書ベクトルＢｔが文書ベクトルデー
タベース１６６に格納されていない場合（ステップ１
４；Ｎ）、ＣＰＵ１１１は、対象文書に対する文書ベク
トルＢｔを作成する（ステップ１６）。Next, the CPU 111 executes the acquired target document T
It is confirmed whether or not the document vector Bt has already been created and stored in the document vector database 166 (step 14). If it is stored (; Y), the document vector Bt is read and the It is stored in the document vector storage area 1133 (step 1
5). When the document vector Bt of the target document is not stored in the document vector database 166 (Step 1)
4; N), the CPU 111 creates a document vector Bt for the target document (step 16).

【００２３】図５は、文書ベクトル作成処理の動作を表
したフローチャートである。ＣＰＵ１１１は、形態素解
析を行うことで対象文書Ｔから自立語を抽出する（ステ
ップ１３１）と共に、名詞句、複合名詞句等を含めた候
補語（句）を対象文書Ｔから抽出しＲＡＭ１１３の所定
作業領域に格納する（ステップ１３２）。そして抽出し
た候補語（句）の対象文書Ｔでの出現頻度、評価関数か
ら、各候補語（句）重要度ｆ（ｘ）を決定する（ステッ
プ１３３）。ここで、評価関数としては、例えば、所定
の重要語が予め指定されている場合にはその重要語に対
する重み付け、単語、名詞句、複合名詞句等の候補語
（句）の種類による重み付け等が使用される。さらにＣ
ＰＵ１１１は、決定した重要度ｆ（ｘ）の値から対象文
書Ｔのキーワードａ，ｂ，…を決定する（ステップ１３
４）。そして、各キーワードの重要度ｆ（ｘ）を要素と
して、文書ベクトルＢ＝（ｆ（ａ），ｆ（ｂ），…）を
ＲＡＭ１１３の対象文書ベクトル格納エリア１１３３に
格納して（ステップ１３５）、図４の文書分類処理ルー
チンにリターンする。FIG. 5 is a flowchart showing the operation of the document vector creation processing. The CPU 111 extracts a self-sufficient word from the target document T by performing morphological analysis (step 131), and extracts candidate words (phrases) including a noun phrase, a compound noun phrase, etc. from the target document T, and performs a predetermined operation of the RAM 113. It is stored in the area (step 132). Then, the degree of importance f (x) of each candidate word (phrase) is determined from the frequency of appearance of the extracted candidate word (phrase) in the target document T and the evaluation function (step 133). Here, as the evaluation function, for example, when a predetermined important word is specified in advance, weighting for the important word, weighting according to the type of a candidate word (phrase) such as a word, a noun phrase, a compound noun phrase, and the like are used. used. Further C
The PU 111 determines the keywords a, b,... Of the target document T from the determined value of the importance f (x) (step 13).
4). Then, the document vector B = (f (a), f (b),...) Is stored in the target document vector storage area 1133 of the RAM 113 using the importance f (x) of each keyword as an element (step 135). The process returns to the document classification processing routine of FIG.

【００２４】次にＣＰＵ１１１は、対象文書Ｔと分類
甲、乙、丙、…の各典型文書との類似度Ｓを算出する
（ステップ１７）。すなわち、ＣＰＵ１１１は、図７に
示すように、対象文書の文書ベクトルＢｔと典型文書の
文書ベクトルＢｊｋとを比較し、両者ベクトルの角度に
依存するコサインにより両文書間の類似度Ｓを算出す
る。一般に、文書Ａｘの文書ベクトルＢｘと文書Ａｙの
文書ベクトルＢｙとの間の角度をθとし、両文書ベクト
ルの内積をＢｘ・Ｂｙとし、両文書ベクトルの大きさを
それぞれ｜Ｂｘ｜、｜Ｂｙ｜とした場合、両文書ベクト
ルの類似度Ｓは次の数式１により求まる。Next, the CPU 111 calculates the similarity S between the target document T and each of the typical documents of the classifications A, B, C,... (Step 17). That is, as shown in FIG. 7, the CPU 111 compares the document vector Bt of the target document with the document vector Bjk of the typical document, and calculates the similarity S between the two documents based on the cosine depending on the angle of both vectors. Generally, the angle between the document vector Bx of the document Ax and the document vector By of the document Ay is θ, the inner product of both document vectors is Bx · By, and the magnitudes of both document vectors are | Bx | and | By | In this case, the similarity S between the two document vectors is obtained by the following equation 1.

【００２５】[0025]

【数１】類似度Ｓ＝ＣＯＳ（θ）＝（Ｂｘ・Ｂｙ）／
（｜Ｂｘ｜×｜Ｂｙ｜）## EQU1 ## Similarity S = COS (θ) = (Bx · By) /
(| Bx | × | By |)

【００２６】この類似度Ｓの値は−１≦Ｓ≦１の値をと
り、１に近いほど２つの文書ベクトルが互いに平行に近
く、２つの文書Ａｘと文書Ａｙは互いに類似していると
考えることができる。The value of the similarity S takes a value of -1.ltoreq.S.ltoreq.1, and as the value is closer to 1, the two document vectors are closer to each other, and it is considered that the two documents Ax and Ay are similar to each other. be able to.

【００２７】次に、ＣＰＵ１１１は、各分類の典型文書
に対して算出した類似度Ｓの合計値が１になるように正
規化し、正規化後の類似度を自動分類の点数として分類
処理表エリア１１３１の自動ｂ欄６２（図６）に格納す
る（ステップ１８）。Next, the CPU 111 normalizes the similarity S calculated for the typical document of each classification so that the total value becomes 1, and uses the normalized similarity as the score of the automatic classification to obtain a classification processing table area. It is stored in the automatic b column 62 (FIG. 6) of 1131 (step 18).

【００２８】そして、ＣＰＵ１１１は、手動分類と自動
分類による点数に対して評価関数の処理を行う（ステッ
プ１９）。すなわち、分類担当花子の評価関数のうち、
分類甲に対象文書を分類した場合の評価関数（重みｗ＝
０．５）を評価関数データベース１６５から読み出し、
図６の分類処理表における、花子ａ欄６１の各分類の点
数に、乗じて花子ｃ欄６３に格納する。また、自動ｂ欄
６２における各分類の点数に（１−ｗ＝０．５）を乗じ
て、自動ｄ欄６４に格納する。Then, the CPU 111 performs an evaluation function process on the scores obtained by the manual classification and the automatic classification (step 19). That is, among Hanako's evaluation functions,
Evaluation function when the target document is classified into the classifier (weight w =
0.5) from the evaluation function database 165,
In the classification processing table of FIG. 6, the score of each classification in the Hanako a column 61 is multiplied and stored in the Hanako c column 63. Further, the score of each classification in the automatic b column 62 is multiplied by (1−w = 0.5) and stored in the automatic d column 64.

【００２９】さらにＣＰＵ１１１は、評価関数処理を行
った後の手動分類の各点数（花子ｃ欄６３）と、に評価
関数処理後の自動分類の各点数（自動ｄ欄６４）との合
計値（ｃ＋ｄ）を各分類毎に求め、合計値が最大となる
分類を対象文書Ｔに対する分類として最終決定する（ス
テップ２０；分類決定手段）。ＣＰＵ１１１は、最終決
定した分類により、分類目的に応じて対象文書を処理し
（ステップ２１）、処理を終了する。対象文書の処理の
例としては、分類目的が配信であればその分類に属する
ユーザに対象文書を配信する。Further, the CPU 111 calculates the total value of the points of the manual classification after the evaluation function processing (Hanako c column 63) and the points of the automatic classification after the evaluation function processing (Auto d column 64) ( c + d) is obtained for each classification, and the classification having the largest total value is finally determined as the classification for the target document T (step 20; classification determining means). The CPU 111 processes the target document according to the classification purpose based on the finally determined classification (step 21), and ends the processing. As an example of processing of a target document, if the classification purpose is distribution, the target document is distributed to users belonging to the classification.

【００３０】以上説明したように本実施形態によれば、
各分類担当者による手動分類の結果にから各分類に対す
る手動分類の点数化を行い、各分類を特徴づける典型文
書の文書ベクトルと対象文書の対象文書ベクトルとの類
似度から各分類に対する自動分類の点数化を行うこと
で、手動分類と自動分類とを融合させることができ、よ
り正確な分類結果を得ることができる。As described above, according to the present embodiment,
Based on the result of the manual classification by each classifier, the points of the manual classification for each classification are converted into points, and the automatic classification of each classification is performed based on the similarity between the document vector of the typical document characterizing each classification and the target document vector of the target document. By performing scoring, manual classification and automatic classification can be combined, and a more accurate classification result can be obtained.

【００３１】以上、本実施形態の構成および他言語文書
検索の処理について説明したが、本発明では、これらの
各形態に限定されるものではなく、請求項に記載された
発明の範囲内で種々の変形をすることが可能である。例
えば、典型文書は、必ずしも予め選ばれている必要がな
く、文書データベース１６４に格納されてる通常の文書
を典型文書として使用してもよい。また、文書データベ
ース１６３に格納されている文書の中から、クラスタリ
ング処理により自動抽出した文書を典型文書として使用
するようにしてもよい。Although the configuration of the present embodiment and the process of searching for a document in another language have been described above, the present invention is not limited to each of these embodiments, and various modifications are possible within the scope of the invention described in the claims. Can be modified. For example, the typical document does not need to be selected in advance, and a normal document stored in the document database 164 may be used as the typical document. Further, a document automatically extracted by clustering processing from documents stored in the document database 163 may be used as a typical document.

【００３２】説明した実施形態では、典型文書とその典
型文書ベクトルとがそれぞれ文書データベース１６４、
文書ベクトルデータベース１６６に格納されていること
を前提に説明したが、必ずしも両者が存在する必要はな
い。すなわち、典型文書に対する典型文書ベクトルが存
在すれば（文書ベクトルデータベース１６６に格納され
ていれば）、対象文書Ｔとの類似度Ｓを算出することが
できるので、典型文書自体は必ずしも必要ではない。逆
に、各分類毎にその分類を特徴づける典型文書が存在す
れば（文書データベース１６４に格納されて入れば）、
図５に示した文書ベクトル作成処理により、典型文書ベ
クトルを作成することができるので、典型文書ベクトル
自体は必ずしも必要ではない。In the described embodiment, the typical document and its typical document vector are stored in the document database 164, respectively.
The description has been made on the premise that they are stored in the document vector database 166, but both need not necessarily be present. That is, if there is a typical document vector for the typical document (stored in the document vector database 166), the similarity S with the target document T can be calculated, so the typical document itself is not necessarily required. Conversely, if there is a typical document characterizing the classification for each classification (if stored in the document database 164),
Since the typical document vector can be created by the document vector creation process shown in FIG. 5, the typical document vector itself is not always necessary.

【００３３】また、説明した実施形態では、１分類に対
する典型文書の数については特に限定しなかったが、典
型文書は必ずしも１分類に１典型文書である必要はな
く、１分類に複数の典型文書を用意するようにしてもよ
い。この場合、各分類に対する対象文書の類似度として
は合計値または平均値（正規化処理を行うのでどちらを
使用することも可能である。）を使用する。このように
１分類複数典型文書とすることで、各をより的確に特徴
づけることができ、自動分類側の精度を上げることがで
きる。In the embodiment described above, the number of typical documents for one class is not particularly limited. However, the typical document does not necessarily need to be one typical document for one class, and a plurality of typical documents may be included for one class. May be prepared. In this case, a total value or an average value (either of which can be used since normalization processing is performed) is used as the similarity of the target document for each classification. In this way, by making the document into a plurality of typical documents of one classification, each can be characterized more accurately, and the accuracy of the automatic classification side can be improved.

【００３４】また、最終分類結果と分類対象者による分
類結果が異なる場合には、評価関数の重み付けを変える
ことで、学習を行うようにしても良い。で文書分類装置
を構成することも可能である。また、自動分類による分
類結果（例えば、ステッ１８による正規化後の類似度の
値）に対して、手動分類の場合と同様に、重み付け（自
動分類に対する評価関数）を規定するようにしてもよ
い。そして、この場合の重み付けに対しても、学習によ
り変更するようにしてもよい。When the final classification result is different from the classification result by the person to be classified, learning may be performed by changing the weight of the evaluation function. It is also possible to configure a document classification device by using. Further, a weight (an evaluation function for the automatic classification) may be defined for the classification result by the automatic classification (for example, the value of the similarity after the normalization by the step 18), as in the case of the manual classification. . The weighting in this case may be changed by learning.

【００３５】さらに、説明した実施形態では、対象文書
の言語については特に言及しなかったが、本発明では日
本語に限定されるものではなく、あらゆる言語の対象文
書に適用することが可能である。この場合、対象文書の
言語用の形態素解析アルゴリズム等を使用するといっ
た、本発明の構成には影響のない部分を変更するだけで
よい。但し、典型文書の言語は対象文書の言語と同一で
ある必要がある。Further, in the above-described embodiment, the language of the target document is not particularly described. However, the present invention is not limited to Japanese, and can be applied to target documents in any language. . In this case, it is only necessary to change a part that does not affect the configuration of the present invention, such as using a morphological analysis algorithm for the language of the target document. However, the language of the typical document needs to be the same as the language of the target document.

【００３６】以上の実施形態において説明した、各装
置、各部、各動作、各処理等に対しては、それらを含む
上位概念としての各手段（〜手段）により、実施形態を
構成することが可能である。例えば、「ＣＰＵ１１１
は、図７に示すように、対象文書の文書ベクトルＢｔと
典型文書の文書ベクトルＢｊｋとを比較し、両者ベクト
ルの角度に依存するコサインにより両文書間の類似度Ｓ
を算出する。」との記載に対して「類似度算出手段」を
構成するようにしてもよい。同様に、その他各種動作に
対して「〜（動作）手段」等の上位概念で実施形態を構
成するようにしてもよい。Each device, each unit, each operation, each process, and the like described in the above embodiment can be constituted by each unit as a high-level concept including these units. It is. For example, “CPU 111
Compares the document vector Bt of the target document with the document vector Bjk of the typical document, as shown in FIG. 7, and calculates the similarity S between the two documents by a cosine depending on the angle of both vectors.
Is calculated. May be configured as “similarity calculating means”. Similarly, the embodiment may be configured with a higher concept such as “「 (operation) means ”for various other operations.

【００３７】[0037]

【発明の効果】本発明によれば、同一の複数分類のセッ
トの範囲内で、手動分類と自動分類を行うと共に、両分
類結果を使用して対象文書に対する最終分類を決定する
ようにしたので、手動分類と自動分類の両分類結果を使
用して、対象文書に足してより精度の高い分類を行うこ
とができる。According to the present invention, manual classification and automatic classification are performed within the same set of plural classifications, and the final classification for the target document is determined by using both classification results. By using both the classification results of the manual classification and the automatic classification, it is possible to perform classification with higher accuracy by adding to the target document.

[Brief description of the drawings]

【図１】本発明の１実施形態における文書分類装置の構
成を表したブロック図である。FIG. 1 is a block diagram illustrating a configuration of a document classification device according to an embodiment of the present invention.

【図２】同上、実施形態における評価関数データベース
の内容を概念的に表した説明図である。FIG. 2 is an explanatory diagram conceptually showing the contents of an evaluation function database in the embodiment.

【図３】同上、実施形態における文書ベクトルデータベ
ースの内容を概念的に表した説明図である。FIG. 3 is an explanatory diagram conceptually showing the contents of a document vector database in the embodiment.

【図４】同上、実施形態における文書分類処理のメイン
動作を表したフローチャートである。FIG. 4 is a flowchart illustrating a main operation of a document classification process according to the embodiment.

【図５】同上、実施形態の文書分類処理における文書ベ
クトル作成処理の動作を表したフローチャートである。FIG. 5 is a flowchart showing an operation of a document vector creation process in the document classification process of the embodiment.

【図６】同上、実施形態において分類の最終決定までの
分類処理表での処理を表した説明図である。FIG. 6 is an explanatory diagram showing processing in a classification processing table up to the final determination of classification in the embodiment.

【図７】同上、実施形態においける対象文書と典型文書
との類似関係を文書ベクトルを用いて表した説明図であ
る。FIG. 7 is an explanatory diagram showing a similarity between a target document and a typical document in the embodiment using a document vector.

【図８】請求項１に記載した発明のクレーム対応図であ
る。FIG. 8 is a diagram corresponding to claims of the invention described in claim 1;

【図９】請求項２に記載した発明のクレーム対応図であ
る。FIG. 9 is a diagram corresponding to claims of the invention described in claim 2;

【図１０】請求項３に記載した発明のクレーム対応図で
ある。FIG. 10 is a diagram corresponding to claims of the invention described in claim 3;

【図１１】請求項４に記載した発明のクレーム対応図で
ある。FIG. 11 is a diagram corresponding to claims of the invention described in claim 4;

【図１２】請求項５に記載した発明のクレーム対応図で
ある。FIG. 12 is a diagram corresponding to claims of the invention described in claim 5;

[Explanation of symbols]

１１制御部１１２ＲＯＭ１１３ＲＡＭ１１３１分類処理表１１３２対象文書格納エリア１１３３対象文書ベクトル格納エリア１１３４典型文書ベクトル格納エリア１１３５類似度格納エリア１２キーボード１３マウス１４表示装置１５印刷装置１６記憶装置１６１仮名漢字変換辞書１６２プログラム格納部１６３データ格納部１６４文書データベース１６５評価関数データベース１６６文書ベクトルデータベース１７記憶媒体駆動装置１８通信制御装置１９入出力Ｉ／Ｆ２０文字認識装置 11 Control Unit 112 ROM 113 RAM 1131 Classification Processing Table 1132 Target Document Storage Area 1133 Target Document Vector Storage Area 1134 Typical Document Vector Storage Area 1135 Similarity Storage Area 12 Keyboard 13 Mouse 14 Display Device 15 Printer 16 Storage Device 161 Kana-Kanji Conversion Dictionary 162 Program storage 163 Data storage 164 Document database 165 Evaluation function database 166 Document vector database 17 Storage medium drive 18 Communication control device 19 Input / output I / F 20 Character recognition device

Claims

[Claims]

1. A manual classification result obtaining means for obtaining a manual classification result obtained by manually classifying a target document within a predetermined set of a plurality of classifications; a target document obtaining means for obtaining the target document; An automatic classification result obtaining unit that automatically classifies the target document obtained by the target document obtaining unit and obtains an automatic classification result within the range of the set of the plurality of classifications; the manual classification result and the automatic classification result And a classification determining means for finally determining the classification of the target document from the following.

2. The automatic classification result obtaining unit includes: a document vector obtaining unit that obtains a target document vector characterizing the target document; and a typical document vector obtaining unit that obtains a typical document vector of a typical document characterizing each of the classifications. Means, and similarity calculating means for calculating a similarity between the target document vector and each of the typical document vectors to obtain a similarity for each class, and a similarity for each class obtained by the similarity calculating means. The document classification apparatus according to claim 1, wherein the degree is a classification result.

A manual classification result acquisition function for manually acquiring a result of manually classifying the target document within a predetermined set of a plurality of classifications; a target document acquisition function for acquiring the target document; The target document acquired by the target document acquisition function, within the set of the plurality of classifications, an automatic classification result acquisition function to automatically classify and obtain an automatic classification result, the manual classification result and the automatic classification result A storage medium storing a computer-readable document classification program for causing a computer to realize a classification determination function for finally determining a classification of the target document.

4. The automatic classification result obtaining function includes: a document vector obtaining function of obtaining a target document vector characterizing the target document; and a typical document vector obtaining function of obtaining a typical document vector of a typical document characterizing each of the classifications. And a similarity calculation function for calculating a similarity between the target document vector and each of the typical document vectors to obtain a similarity for each class, and a similarity for each class obtained by the similarity calculation function. 4. The storage medium document classification device according to claim 3, wherein the degree is a classification result.

5. A target document is automatically classified within a predetermined set of a plurality of classifications, and the result of the automatic classification and the target document are manually classified within the range of the plurality of classifications. A document classification method, wherein a classification for the target document is finally determined from a result of the manual classification.