JP2006099478A

JP2006099478A - Document classification device and document classification method

Info

Publication number: JP2006099478A
Application number: JP2004285367A
Authority: JP
Inventors: Tsutomu Kobayashi; 勉小林; Yoshihisa Otake; 能久大嶽; Takeshi Matsukuma; 剛松隈; Hiroshi Yamazaki; 弘山崎
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2004-09-29
Filing date: 2004-09-29
Publication date: 2006-04-13

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document classification device capable of interactively adjusting a classification result in a document classification system using batch processing. <P>SOLUTION: This device comprises a comparison object document information storage means storing comparison object document information obtained by relating information of a comparative object document with the field of the comparison object document; a word weight information storage part storing words and the word weights thereof; a batch processing control part comparing the comparison object information with a classification key document to extract a commonly used word which is commonly used in the classification key document and the comparison object document, and generating common word information obtained by relating the commonly used word with the use frequency of the commonly used word and the word weight of the commonly used word read from the word weight information storage part; and an interactive processing control part determining the similarities of a plurality of comparison object documents with the classification key document from the common word information, specifying a field based on a comparison object document with high similarity determined, and adjusting the specified field based on an instruction from an input device. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、文書の分野を分類する文書分類装置および文書分類方法に関する。 The present invention relates to a document classification apparatus and a document classification method for classifying document fields.

従来、予めデータベースに記憶される複数の文書情報に基づいて、入力文書の属する分野を特定する文書分類システムがある。このような文書分類システムでは、まず、予め分野が特定されてデータベースに記憶されている複数の比較対象文書から分類を特定したい入力文書と類似する比較対象文書を抽出する。その後、その抽出された比較対象文書に予め関連付けられている分野に基づいて、入力文書が属する分野を特定する方式がある（例えば、特許文献１）。 Conventionally, there is a document classification system that identifies a field to which an input document belongs based on a plurality of document information stored in advance in a database. In such a document classification system, first, a comparison target document similar to an input document whose classification is to be specified is extracted from a plurality of comparison target documents whose fields are specified in advance and stored in the database. Thereafter, there is a method for specifying a field to which an input document belongs based on a field previously associated with the extracted comparison target document (for example, Patent Document 1).

さらに、分類処理の効率化を図るため、文書分類システムでは、一括して大量の入力文書の分類処理を行なう方式が一般的である。また、このような文書分類システムでは、入力文書について大量の比較対象文書との類似度算出を行なうことが多い。さらに、高い精度が求められる文書分類システムにおいては、コンピュータにより分類した結果を、人手によってチェックすることになる。
特開２００１−１５５０２５号公報 Further, in order to increase the efficiency of the classification process, a document classification system generally performs a classification process for a large number of input documents at once. Also, in such a document classification system, the input document is often calculated for similarity with a large number of comparison target documents. Furthermore, in a document classification system that requires high accuracy, the result of classification by a computer is manually checked.
JP 2001-1555025 A

上述したような、大量の入力文書を一括して分類する従来の文書分類システムでは、一度に大量の文書を効率良く処理できる反面、一括して分類処理を行った時点で分類の結果が確定される。そのため、対話性が犠牲になる問題が生じることがある。 In the conventional document classification system that classifies a large amount of input documents as described above, a large amount of documents can be processed efficiently at one time, but the classification result is fixed when the classification processing is performed collectively. The As a result, the problem of interactivity may arise.

具体的には、分類処理で得られた結果を操作者が確認したとき目的の結果でない場合、再び分類処理でのパラメータ調整して再度分類を実行する対話的な操作を実現できなかった。 Specifically, when the operator confirms the result obtained by the classification process, if the result is not the target result, the interactive operation for adjusting the parameters again in the classification process and executing the classification again cannot be realized.

また、一括処理方式で分類する文書分類システムでは、対話的な操作の実現が困難であるため、分類結果に大きく影響した特徴的な単語などの分類の根拠となる情報を利用者に提示するシステムも存在しなかった。 In addition, since it is difficult to implement interactive operations in a document classification system that classifies using the batch processing method, a system that presents the user with information that provides the basis for classification such as characteristic words that have greatly influenced the classification result. Also did not exist.

本発明は上記の問題を解決するためになされたものであり、一括処理を利用する文書分類方式において、対話的に分類結果を調整することが可能な文書分類装置および文書分類方法を提供することを目的とする。 The present invention has been made to solve the above problems, and provides a document classification apparatus and a document classification method capable of interactively adjusting a classification result in a document classification method using batch processing. With the goal.

本発明の特徴に係る文書分類装置によれば、文書の属する分野を分類する対象となる分類キー文書を分類する文書分類装置であって、分類キー文書と比較する比較対象文書の情報と、この比較対象文書の分野が関連付けられた比較対象文書情報を記憶している比較対象文書情報記憶部と、単語と、単語が含まれている文書の分野の特徴を示す指標となる単語重みを記憶している単語重み情報記憶部と、分類キー文書を比較対象文書情報と比較して分類キー文書および比較対象文書で共通に使用されている単語である共通使用単語を抽出し、少なくともこれらの共通使用単語と、共通使用単語の使用回数と、前記単語重み情報記憶部から単語重みを読み出した共通使用単語の単語重みとが関連づけられた共通単語情報を生成する一括処理制御部と、共通単語情報から複数の比較対象文書と前記分類キー文書との類似度を求め、求められた類似度の高い比較対象文書に基づいて分野を特定し、さらに、入力装置からの指示に基づいて、特定した分野を調整する対話処理制御部とを有する。 According to the document classification device according to the feature of the present invention, a document classification device for classifying a classification key document to be classified into a field to which the document belongs, information on a comparison target document to be compared with the classification key document, A comparison target document information storage unit storing comparison target document information associated with a field of the comparison target document, a word, and a word weight serving as an index indicating characteristics of the field of the document including the word The word weight information storage section, the classification key document is compared with the comparison target document information, and a common use word that is a word commonly used in the classification key document and the comparison target document is extracted, and at least these common use A batch processing control unit that generates common word information in which a word, the number of times of use of a commonly used word, and the word weight of the commonly used word read from the word weight information storage unit are associated The similarity between a plurality of comparison target documents and the classification key document is obtained from the common word information, the field is specified based on the obtained comparison target document having a high similarity, and further, based on an instruction from the input device And a dialogue processing control unit for adjusting the specified field.

上記構成の本発明によれば、対話的に分類結果を調整する文書分類装置を提供することができる。 According to the present invention having the above-described configuration, it is possible to provide a document classification device that interactively adjusts classification results.

本発明によれば、一括処理を利用する文書分類方式において、対話的に分類結果を調整することができる。 According to the present invention, it is possible to interactively adjust the classification result in the document classification method using batch processing.

［第１の実施例］
以下に、図面を参照して、本発明の第１の実施の形態に係る類似文書分類装置１を説明する。 [First embodiment]
The similar document classification device 1 according to the first exemplary embodiment of the present invention will be described below with reference to the drawings.

［類似文書分類装置］
図１に示すのは、本発明の第１の実施の形態に係る類似文書分類装置１のブロック図である。 [Similar document classification device]
FIG. 1 is a block diagram of a similar document classification apparatus 1 according to the first embodiment of the present invention.

図１に示す類似文書分類装置１は、比較対象文書情報記憶部１１、単語重み情報記憶部１２、一括処理制御部１３および対話処理制御部１４を有する。 The similar document classification device 1 shown in FIG. 1 includes a comparison target document information storage unit 11, a word weight information storage unit 12, a batch processing control unit 13, and a dialogue processing control unit 14.

比較対象文書情報記憶部１１は、文書の属する分野を分類する対象となる分類キー文書と比較する比較対象文書の情報と、この比較対象文書の分野が関連付けられた比較対象文書情報を記憶している。 The comparison target document information storage unit 11 stores information on a comparison target document to be compared with a classification key document that is a target for classifying the field to which the document belongs, and comparison target document information associated with the field of the comparison target document. Yes.

単語重み情報記憶部１２は、単語と、単語が含まれる文書の分野の特徴を示す指標となる単語重みを記憶している。 The word weight information storage unit 12 stores a word and a word weight serving as an index indicating characteristics of the field of the document including the word.

一括処理制御部１３は、分類キー文書を前記比較対象文書情報と比較して分類キー文書および比較対象文書で共通に使用されている単語である共通使用単語を抽出し、少なくともこれらの共通使用単語と、共通使用単語の使用回数と、単語重み情報データベースから読み出した共通使用単語の単語重みとが関連づけられた共通単語情報を生成する。 The collective processing control unit 13 compares the classification key document with the comparison target document information, extracts common usage words that are commonly used in the classification key document and the comparison target document, and at least these common usage words And common word information in which the number of times of use of the commonly used word is associated with the word weight of the commonly used word read from the word weight information database.

対話処理制御部１４は、共通単語情報から複数の比較対象文書との類似度を求め、求められた類似度の高い比較対象文書の分野に基づいて分野を特定した分類結果を求め、さらに、分類結果を調整する調整手段を備える。 The dialogue processing control unit 14 obtains a similarity with a plurality of comparison target documents from the common word information, obtains a classification result specifying the field based on the field of the comparison target document having a high similarity, and further classifies the classification. Adjustment means for adjusting the result is provided.

図２に示すのは、比較対象文書情報記憶部１１で記憶する比較対象文書情報の一例である。比較対象文書とは、分野を分類する対象となる分類キー文書と比較する文書である。また、比較対象文書情報は、この比較対象文書に基づいて生成される。 FIG. 2 shows an example of comparison target document information stored in the comparison target document information storage unit 11. The comparison target document is a document to be compared with a classification key document that is a target for classifying a field. The comparison target document information is generated based on the comparison target document.

具体的に図２に示す比較対象文書情報は、複数の比較対象文書の「タイトル」、「分野」、「使用単語」および「使用回数」などの情報を含んでいる。この図２に示す例によれば、比較対象文書１は、タイトルが「データベース更新処理時間の短縮」であり、分野は「データベース更新」である。また、比較対象文書１の中で使用されている単語とその使用回数が、それそれ「大規模」が２回、「データベース」が５回、「更新処理」が８回、「時間」が３回、「短縮」が２回であることを表している。 Specifically, the comparison target document information shown in FIG. 2 includes information such as “title”, “field”, “use word”, and “use count” of a plurality of comparison target documents. According to the example shown in FIG. 2, the title of the comparison target document 1 is “reduction in database update processing time”, and the field is “database update”. Further, the words used in the comparison target document 1 and the number of times of use thereof are 2 for “large scale”, 5 for “database”, 8 for “update processing”, and 3 for “time”. Times, “shortening” represents 2 times.

図３に示すのは、単語重み情報記憶部１２で記憶する単語重み情報の一例である。図３に示す単語重み情報では、例えば「自動分類」の単語重みは「８．５」であり、「データベース」の単語重みは「４．３」であることを表している。 FIG. 3 shows an example of word weight information stored in the word weight information storage unit 12. In the word weight information shown in FIG. 3, for example, the word weight of “automatic classification” is “8.5”, and the word weight of “database” is “4.3”.

この「単語重み」には、例えば比較対象文書情報記憶部１１中の全ての比較対象文書におけるその使用単語の使用頻度の逆数が利用される。これは、より多く使用される単語は一般的な単語であり、文書の特徴を表さない単語であると考えられるためである。逆に、使用される回数の少ない単語はより特徴的であると考える。本実施例では、分類に使用する単語重み情報として、図３に示すような単語重み情報が予め作成され、記憶されているものとする。 For this “word weight”, for example, the reciprocal of the frequency of use of the used word in all the comparison target documents in the comparison target document information storage unit 11 is used. This is because a more frequently used word is a general word and is considered to be a word that does not represent the characteristics of the document. Conversely, words that are used less frequently are considered more characteristic. In this embodiment, it is assumed that word weight information as shown in FIG. 3 is created and stored in advance as word weight information used for classification.

本発明の最良の実施の形態に係る類似文書分類装置１は、図４に示すように、中央処理制御装置１０１、ＲＯＭ（Read Only Memory）１０２、ＲＡＭ（Random Access Memory）１０３および入出力インタフェース１０９が、バス１１０を介して接続されている。入出力インタフェース１０９には、入力装置１０４、表示装置１０５、通信制御装置１０６、記憶装置１０７およびリムーバブルディスク１０８が接続されている。 As shown in FIG. 4, the similar document classification device 1 according to the preferred embodiment of the present invention includes a central processing control device 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, and an input / output interface 109. Are connected via the bus 110. An input device 104, a display device 105, a communication control device 106, a storage device 107, and a removable disk 108 are connected to the input / output interface 109.

中央処理制御装置１０１は、入力装置１０４からの入力信号に基づいてＲＯＭ１０２から類似文書分類装置１を起動するためのブートプログラムを読み出して実行し、更に記憶装置１０７に記憶されたオペレーティングシステムを読み出す。更に中央処理制御装置１０１は、入力装置１０４や通信制御装置１０６などの入力信号に基づいて、各種装置の制御を行ったり、ＲＡＭ１０３や記憶装置１０７などに記憶されたプログラムおよびデータを読み出してＲＡＭ１０３にロードするとともに、ＲＡＭ１０３から読み出されたプログラムのコマンドに基づいて、データの計算または加工など、後述する一連の処理を実現する処理装置である。 The central processing control device 101 reads and executes a boot program for starting the similar document classification device 1 from the ROM 102 based on an input signal from the input device 104, and further reads an operating system stored in the storage device 107. Further, the central processing control device 101 controls various devices based on input signals from the input device 104, the communication control device 106, etc., and reads programs and data stored in the RAM 103, the storage device 107, etc., into the RAM 103. A processing device that loads and implements a series of processing described later, such as data calculation or processing, based on a program command read from the RAM 103.

入力装置１０４は、操作者が各種の操作を入力するキーボード、マウスなどの入力デバイスにより構成されており、操作者の操作に基づいて入力信号を作成し、入出力インタフェース１０９およびバス１１０を介して中央処理制御装置１０１に送信される。表示装置１０５は、ＣＲＴ（Cathode Ray Tube）ディスプレイや液晶ディスプレイなどであり、中央処理制御装置１０１からバス１１０および入出力インタフェース１０９を介して表示装置１０５において表示させる出力信号を受信し、例えば、中央処理制御装置１０１の処理結果などを表示する装置である。通信制御装置１０６は、ＬＡＮカードやモデムなどの装置であり、類似文書分類装置１をインターネットやＬＡＮなどの通信ネットワークに接続する装置である。通信制御装置１０６を介して通信ネットワークと送受信したデータは入力信号または出力信号として、入出力インタフェース１０９およびバス１１０を介して中央処理制御装置１０１に送受信される。 The input device 104 includes input devices such as a keyboard and a mouse through which an operator inputs various operations. The input device 104 generates an input signal based on the operation of the operator, and inputs via the input / output interface 109 and the bus 110. It is transmitted to the central processing control apparatus 101. The display device 105 is a CRT (Cathode Ray Tube) display, a liquid crystal display, or the like. The display device 105 receives an output signal to be displayed on the display device 105 from the central processing control device 101 via the bus 110 and the input / output interface 109. This is a device that displays the processing results of the processing control device 101. The communication control device 106 is a device such as a LAN card or a modem, and is a device that connects the similar document classification device 1 to a communication network such as the Internet or a LAN. Data transmitted / received to / from the communication network via the communication control device 106 is transmitted / received to / from the central processing control device 101 via the input / output interface 109 and the bus 110 as an input signal or an output signal.

記憶装置１０７は半導体記憶装置または磁気ディスク装置等であって、中央処理制御装置１０１で実行されるプログラムやデータが記憶されている。リムーバブルディスク１０８は、光ディスクやフレキシブルディスクのことであり、ディスクドライブによって読み書きされた信号は、入出力インタフェース１０９およびバス１１０を介して中央処理制御装置１０１に送受信される。本発明の実施の形態に係る類似文書分類装置１の記憶装置１０７には、類似文書分類プログラムが記憶されるとともに、比較対象文書情報記憶部１１および単語重み情報記憶部１２が記憶される。また、この類似文書分類プログラムが類似文書分類装置１の中央処理制御装置１０１に読み込まれて実行されることによって、一括処理制御部１３および対話処理制御部１４が実装される。 The storage device 107 is a semiconductor storage device, a magnetic disk device, or the like, and stores programs and data executed by the central processing control device 101. The removable disk 108 is an optical disk or a flexible disk, and signals read / written by the disk drive are transmitted / received to / from the central processing control apparatus 101 via the input / output interface 109 and the bus 110. The storage device 107 of the similar document classification device 1 according to the embodiment of the present invention stores a similar document classification program and a comparison target document information storage unit 11 and a word weight information storage unit 12. Also, the similar document classification program is read and executed by the central processing control device 101 of the similar document classification device 1, whereby the batch processing control unit 13 and the interactive processing control unit 14 are implemented.

なお、本発明の最良の実施の形態に係る類似文書分類装置１は、一つのコンピュータによって実現されても良いし、互いに通信可能な複数のコンピュータによって実現されても良い。例えば、一括処理を行なうための構成と対話処理を行なうための構成は、同一のコンピュータシステム上にあっても構わないし、ネットワーク等を介して接続された別のコンピュータシステム上にあっても構わない。また、一括処理制御部１３および対話処理制御部１４もそれぞれ一つのコンピュータによって実現されていても良く、また複数のコンピュータによって実現されていても良い。 The similar document classification device 1 according to the best embodiment of the present invention may be realized by a single computer or a plurality of computers that can communicate with each other. For example, the configuration for performing batch processing and the configuration for performing interactive processing may be on the same computer system or on different computer systems connected via a network or the like. . Further, the batch processing control unit 13 and the dialogue processing control unit 14 may each be realized by one computer, or may be realized by a plurality of computers.

図５に示すように、本発明の実施の形態に係る類似文書分類装置１における一括処理制御部１３は制御部２００およびメモリ部２５０を有する。 As shown in FIG. 5, the batch processing control unit 13 in the similar document classification device 1 according to the embodiment of the present invention includes a control unit 200 and a memory unit 250.

制御部２００は、初期化部２０１、入力部２０２、単語重み読み込み部２０３、分類キー文書情報生成部２０４、比較対象文書情報読み込み部２０５、共通単語情報生成部２０６およびデータ出力部２０７を有する。 The control unit 200 includes an initialization unit 201, an input unit 202, a word weight reading unit 203, a classification key document information generation unit 204, a comparison target document information reading unit 205, a common word information generation unit 206, and a data output unit 207.

また、メモリ部２５０は、単語重み情報バッファ部２５１、分類キー文書情報バッファ部２５２、比較対象文書情報バッファ部２５３および共通単語情報バッファ部２５４を有する。 The memory unit 250 includes a word weight information buffer unit 251, a classification key document information buffer unit 252, a comparison target document information buffer unit 253, and a common word information buffer unit 254.

単語重み情報バッファ部２５１は、比較対象文書で使用されている単語である使用単語について、各使用単語とその重みとが関連付けられた単語重み情報を記憶する。 The word weight information buffer unit 251 stores word weight information in which each used word and its weight are associated with each other for a used word that is a word used in the comparison target document.

分類キー文書情報バッファ部２５２は、分類の対象となる分類キー文書から生成される分類キー文書情報を記憶する。 The classification key document information buffer unit 252 stores classification key document information generated from a classification key document to be classified.

比較対象文書情報バッファ部２５３は、分類キー文書情報と比較する比較対象文書情報を記憶する。 The comparison target document information buffer unit 253 stores comparison target document information to be compared with the classification key document information.

共通単語情報バッファ部２５４は、分類キー文書と比較対象文書で共通して使用されている単語である共通使用単語と、その共通使用単語の文書における使用回数と、その共通使用単語の単語重みとを関連付けた共通単語情報を記憶する。 The common word information buffer unit 254 includes a common use word that is a word commonly used in the classification key document and the comparison target document, the number of times the common use word is used in the document, and the word weight of the common use word. The common word information associated with is stored.

初期化部２０１は、メモリ部２５０の各バッファ部２５１〜２５４を初期化する。 The initialization unit 201 initializes the buffer units 251 to 254 of the memory unit 250.

入力部２０２は、分類キー文書や操作指示を入力装置から入力する。 The input unit 202 inputs a classification key document and an operation instruction from the input device.

単語重み読み込み部２０３は、単語重み情報記憶部１２から単語重み情報バッファ部２５１に単語重み情報を読み込む。 The word weight reading unit 203 reads word weight information from the word weight information storage unit 12 into the word weight information buffer unit 251.

分類キー文書情報生成部２０４は、入力された分類キー文書を単語単位に分解する。また、分類キー文書情報生成部２０４は、分解された各単語とその単語の使用回数とを含む分類キー文書情報を生成し、分類キー文書情報バッファ部２５２に記憶させる。 The classification key document information generation unit 204 decomposes the input classification key document into words. Further, the classification key document information generation unit 204 generates classification key document information including each decomposed word and the number of times the word is used, and stores it in the classification key document information buffer unit 252.

比較対象文書情報読み込み部２０５は、比較対象文書情報記憶部１１から比較対象文書情報バッファ部２５３に比較対象文書情報を読み込む。 The comparison target document information reading unit 205 reads the comparison target document information from the comparison target document information storage unit 11 into the comparison target document information buffer unit 253.

共通単語情報生成部２０６は、分類キー文書情報バッファ部２５２に記憶される分類キー文書情報と比較対象文書情報バッファ部２５３に記憶される比較対象文書情報とを読み出す。その後、共通単語情報生成部２０６は、分類キー文書と比較対象文書で共通で使用している共通使用単語を抽出し、その共通使用単語の文書中での使用回数およびその共通使用単語の単語重みを合わせた共通単語情報を生成して共通単語情報バッファ部２５４に記憶させる。 The common word information generation unit 206 reads out the classification key document information stored in the classification key document information buffer unit 252 and the comparison target document information stored in the comparison target document information buffer unit 253. After that, the common word information generation unit 206 extracts a common word that is used in common in the classification key document and the comparison target document, and uses the common usage word in the document and the word weight of the common word. Is generated and stored in the common word information buffer unit 254.

データ出力部２０７は、共通単語情報バッファ部２５４に格納されている共通単語情報を、対話処理制御部１４へ出力する。 The data output unit 207 outputs the common word information stored in the common word information buffer unit 254 to the dialogue processing control unit 14.

図６に示すように、本発明の実施の形態に係る類似文書分類装置における対話処理制御部１４は、制御部３００およびメモリ部３５０を有する。 As shown in FIG. 6, the dialogue processing control unit 14 in the similar document classification device according to the embodiment of the present invention includes a control unit 300 and a memory unit 350.

制御部３００は、初期化部３０１、入力部３０２、データ入力部３０３、比較対象文書類似度算出部３０４、分野別類似度積算部３０５、分野特定部３０６および単語重み調整部３０７を有する。 The control unit 300 includes an initialization unit 301, an input unit 302, a data input unit 303, a comparison target document similarity calculation unit 304, a field-specific similarity accumulation unit 305, a field identification unit 306, and a word weight adjustment unit 307.

また、メモリ部３５０は、共通単語情報バッファ部３５１、比較対象文書類似度バッファ部３５２および分野別類似度積算値バッファ部３５３を有する。 The memory unit 350 includes a common word information buffer unit 351, a comparison target document similarity buffer unit 352, and a field-specific similarity integrated value buffer unit 353.

共通単語情報バッファ部３５１は、共通単語情報を記憶する。 The common word information buffer unit 351 stores common word information.

比較対象文書類似度バッファ部３５２は、分類キー文書に関して求められた比較対象文書毎の共通使用単語を用いた類似度を比較対象文書類似度として記憶する。 The comparison target document similarity buffer unit 352 stores, as the comparison target document similarity, the similarity using the commonly used word for each comparison target document obtained for the classification key document.

分野別類似度積算値バッファ部３５３は、比較対象文書類似度を、その比較対象文書が属する分野毎に合計した分野別類似度積算値を記憶する。 The field-specific similarity integrated value buffer unit 353 stores field-specific similarity integrated values obtained by adding the comparison target document similarities for each field to which the comparison target documents belong.

初期化部３０１は、メモリ部３５０の各バッファ部３５１〜３５３を初期化する。 The initialization unit 301 initializes the buffer units 351 to 353 of the memory unit 350.

入力部３０２は、操作指示を入力装置から入力する。 The input unit 302 inputs an operation instruction from the input device.

データ入力部３０３は、一括処理制御部１３のデータ出力部２０７から出力された共通単語情報を入力し、共通単語情報バッファ部３５１に記憶させる。 The data input unit 303 inputs the common word information output from the data output unit 207 of the batch processing control unit 13 and stores it in the common word information buffer unit 351.

比較対象文書類似度算出部３０４は、共通単語情報バッファ部３５１に記憶されている共通単語情報を読み出し、各共通単語情報の比較対象文書毎の類似度である比較対象文書類似度を算出する。また、比較対象文書類似度算出部３０４は、算出した比較対象文書類似度を比較対象文書類似度バッファ部３５２に記憶させる。 The comparison target document similarity calculation unit 304 reads the common word information stored in the common word information buffer unit 351, and calculates the comparison target document similarity that is the similarity for each comparison target document of each common word information. In addition, the comparison target document similarity calculation unit 304 stores the calculated comparison target document similarity in the comparison target document similarity buffer unit 352.

なお、本実施の形態において比較対象文書類似度を算出する方法は、分類キー文書および比較対象文書の２つの文書で共通して使用されている使用単語の出現回数の和に単語重みを掛け合わせたものを類似度とする例を用いて説明する。この類似度の算出方法は、上記の方法に限定するものではなく、他の算出方法で求めてもよい。 In this embodiment, the method for calculating the comparison target document similarity is obtained by multiplying the sum of the number of used words commonly used in the two documents of the classification key document and the comparison target document by multiplying the word weight. A description will be given using an example in which the degree of similarity is used. The method for calculating the similarity is not limited to the above method, and may be obtained by another calculation method.

分野別類似度積算部３０５は、比較対象文書類似度バッファ部３５２に記憶された比較対象文書類似度を読み出し、その比較対象文書類似度を各分野別に積算する。この積算された値が分野別類似度積算値である。また分野別類似度積算部３０５は、分野別に求められた分野別類似度積算値を分野別類似度積算値バッファ部３５３に記憶させる。 The field-specific similarity accumulation unit 305 reads the comparison object document similarity stored in the comparison object document similarity buffer unit 352 and accumulates the comparison object document similarity for each field. This integrated value is a field-specific similarity integrated value. The field-specific similarity accumulation unit 305 stores the field-specific similarity accumulation value obtained for each field in the field-specific similarity accumulation value buffer unit 353.

分野特定部３０６は、分野別類似度積算値バッファ部３５３に格納された類似度の積算値と、その元となった比較対象文書類似度バッファ部３５２に格納されている類似度算出結果を対応付けて、類似文書分類装置に接続される表示装置などの出力装置に出力する。 The field specifying unit 306 associates the integrated value of the similarity stored in the field-specific similarity integrated value buffer unit 353 with the similarity calculation result stored in the comparison target document similarity buffer unit 352 that is the origin. In addition, the data is output to an output device such as a display device connected to the similar document classification device.

単語重み調整部３０７は、接続されるキーボードなどの入力装置から利用者によって単語重みの調整のために入力される値を、比較対象文書類似度バッファ部３５２へ記憶させる。 The word weight adjustment unit 307 causes the comparison target document similarity buffer unit 352 to store a value input by the user for adjusting the word weight from an input device such as a connected keyboard.

次に、図７および図８を用いて、本発明の実施の形態に係る類似文書分類装置１の処理を説明する。図７に示すのは、一括処理制御部１３における処理を説明するフローチャートである。また、図８に示すのは、対話処理制御部１４における処理を説明するフローチャートである。 Next, processing of the similar document classification device 1 according to the embodiment of the present invention will be described with reference to FIGS. 7 and 8. FIG. 7 is a flowchart for explaining processing in the batch processing control unit 13. FIG. 8 is a flowchart for explaining processing in the dialogue processing control unit 14.

本実施例では、分類処理を第１の処理である一括処理と第２の処理である対話処理の２つの処理に分けて行なう。 In this embodiment, the classification process is divided into two processes, a batch process that is a first process and an interactive process that is a second process.

［一括処理］
一括処理は、分野を分類する対象となる分類キー文書に対して、比較対象文書情報記憶部１１に記憶されている複数の比較対象文書情報との比較を行なう処理である。 [batch processing]
The batch process is a process for comparing the classification key document that is the target of classifying the field with a plurality of pieces of comparison target document information stored in the comparison target document information storage unit 11.

まず、図７に示すフローチャートにあるように、初期化部２０１は、メモリ部２５０の各バッファ部２５１〜２５４を初期化する（Ｓ００１）。その後、単語重み読み込み部２０３は、単語重み情報記憶部１２から単語重み情報バッファ部２５１に単語重み情報を読み込む（Ｓ００２）。 First, as shown in the flowchart of FIG. 7, the initialization unit 201 initializes the buffer units 251 to 254 of the memory unit 250 (S001). Thereafter, the word weight reading unit 203 reads word weight information from the word weight information storage unit 12 into the word weight information buffer unit 251 (S002).

続いて、分類キー文書情報生成部２０４は、分類キー文書が入力されると、この分類キー文書を単語単位に分解する。また分類キー文書情報生成部２０４は、分解された各単語と各単語の使用回数とを含む分類キー文書情報を生成し、生成した分類キー文書情報を分類キー文書情報バッファ部２５２に記憶させる（Ｓ００３）。 Subsequently, when the classification key document is input, the classification key document information generation unit 204 decomposes the classification key document into words. The classification key document information generating unit 204 generates classification key document information including each decomposed word and the number of times each word is used, and stores the generated classification key document information in the classification key document information buffer unit 252 ( S003).

図９に示すのは、入力される分類キー文書の一例である。図９に示すのは分類キー文書１であり、その後半は省略されている。このようにして、複数の分類キー文書が入力される。 FIG. 9 shows an example of an input classification key document. FIG. 9 shows the classification key document 1, and the latter half is omitted. In this way, a plurality of classification key documents are input.

また、図１０に示すのは、「分類キー文書１」から生成された分類キー文書情報の一例である。例えば、分類キー文書情報は分類キー文書中で使用されている「使用単語」と、その使用単語が対象となる分類キー文書中で使用されている回数である「使用回数」とが関連付けられた情報である。 FIG. 10 shows an example of classification key document information generated from “classification key document 1”. For example, in the classification key document information, “used word” used in the classification key document is associated with “use count” that is the number of times the used word is used in the target classification key document. Information.

このステップＳ００３の処理は、対象となる分類キー文書全てに対して行なわれる（Ｓ００４）。例えば、分類キー文書として２０００の文書が入力された場合、ステップＳ００３の処理は２０００回繰り返される。 The process of step S003 is performed on all target classification key documents (S004). For example, when 2000 documents are input as the classification key document, the process of step S003 is repeated 2000 times.

次に、比較対象文書情報読み込み部２０５は、比較対象文書情報記憶部１１から比較対象文書情報バッファ部２５３に比較対象文書情報を読み込む（Ｓ００５）。 Next, the comparison target document information reading unit 205 reads the comparison target document information from the comparison target document information storage unit 11 into the comparison target document information buffer unit 253 (S005).

続いて、共通単語情報生成部２０６は、分類キー文書と比較対象文書で共通して使用されている単語を抽出するとともに、抽出された共通使用単語について、分類キー文書および比較対象文書で使用されている回数の合計値とを合わせて共通単語情報を生成し、検索対象文書別に共通単語情報バッファ部２５４に記憶する（Ｓ００６）。 Subsequently, the common word information generation unit 206 extracts words commonly used in the classification key document and the comparison target document, and uses the extracted common use words in the classification key document and the comparison target document. The common word information is generated together with the total number of times of the search, and stored in the common word information buffer unit 254 for each search target document (S006).

図１１に示すのは、共通単語情報の一例である。この共通単語情報では、各分類キー文書と比較対象文書との組み合わせ毎に、その比較対象文書の「分野名」と「使用単語」と、その使用単語が分類キー文書および比較対象文書で使用された「使用回数」と、その使用単語の「単語重み」とが関連付けられて記憶されている。例えば、図１１に示す例では、分類キー文書１を比較対象文書１であるタイトルが「データベース更新処理時間の短縮」の文書と比較すると、使用単語「大規模」の使用回数は５回であり、使用単語「データベース」の使用回数は１１回であり、「時間」の使用回数は５回であることを示している。 FIG. 11 shows an example of common word information. In this common word information, for each combination of a classification key document and a comparison target document, the “field name” and “use word” of the comparison target document and the use word are used in the classification key document and the comparison target document. The “number of uses” and the “word weight” of the used word are stored in association with each other. For example, in the example shown in FIG. 11, when the classification key document 1 is compared with a document whose title is “comparison target document 1” and whose title is “reduction in database update processing time”, the number of uses of the word “Large Scale” is five. , The usage count of the word “database” is 11 times, and the usage count of “time” is 5 times.

ステップＳ００６における共通単語情報生成部２０６の処理は、第２の処理である対話処理を行う際に対話処理制御部１４で必要となる中間情報である共通単語情報を出力するためのものである。 The processing of the common word information generation unit 206 in step S006 is for outputting common word information, which is intermediate information necessary for the dialogue processing control unit 14 when the dialogue processing as the second processing is performed.

本実施の形態で共通単語情報は、共通単語数に基づいて使用回数と単語重みを加算した値を利用して求めている。しかしながら、これ以外にもベクトル空間法を利用して類似度を算出することも可能である。 In this embodiment, the common word information is obtained using a value obtained by adding the number of times of use and the word weight based on the number of common words. However, in addition to this, it is also possible to calculate the similarity using a vector space method.

データ出力部２０７は、Ｓ００６で抽出した共通単語情報を、対話処理制御部１４へ送信する。（Ｓ００７）
本実施の形態では、すべての比較対象文書情報との共通単語情報を送信しているものとする。また、一括処理制御部１３から対話処理制御部１４への送信手段は、ネットワークを介してデータを転送しても良いし、磁気テープやＤＶＤ−Ｒなどのオフラインメディアを介して送っても良い。 The data output unit 207 transmits the common word information extracted in S006 to the dialogue processing control unit 14. (S007)
In the present embodiment, it is assumed that common word information with all comparison target document information is transmitted. Further, the transmission means from the batch processing control unit 13 to the dialogue processing control unit 14 may transfer data via a network or may send it via an offline medium such as a magnetic tape or a DVD-R.

以上のステップＳ００１〜Ｓ００７までが、第１の処理である一括処理を利用した共通単語情報生成処理である。 The above steps S001 to S007 are the common word information generation process using the batch process which is the first process.

このように、上述した一括処理によれば、複数の分類キー文書に対して一括して共通単語情報を生成するまでの処理を行なうことで、処理効率良く処理することができる。 As described above, according to the collective processing described above, it is possible to perform processing with high processing efficiency by performing the processing until the common word information is generated collectively for a plurality of classification key documents.

その際、比較対象文書が膨大であってデータベースが大規模な場合、処理で得られる結果も多くなる。また、比較対象文書との共通単語が少ないものや、共通単語が多く含まれていても、そのどれもが一般的な単語で特徴を持たないものである場合もある。 At that time, if the comparison target documents are enormous and the database is large, more results are obtained by the processing. In addition, there are cases where there are few common words with the comparison target document, or even if many common words are included, all of them are general words and have no characteristics.

そのため、一括処理に続く第２の処理である対話処理で共通単語の数が設定した閾値に満たない結果や設定した閾値よりも単語重みの大きい単語を含まない結果について、類似度を算出して得られた値を閾値に満たすか否かなどの条件によりフィルタリングする。フィルタリングの結果、類似度が高いものだけを選択して送信することで、送信の付加を軽減することができる。 Therefore, the similarity is calculated for a result in which the number of common words is less than the set threshold value in the interactive process that is the second process following the batch process or a result that does not include a word having a word weight greater than the set threshold value. Filtering is performed according to conditions such as whether or not the obtained value satisfies a threshold value. As a result of filtering, it is possible to reduce transmission addition by selecting and transmitting only those having a high degree of similarity.

［対話処理］
次に、第２の処理である対話処理制御部１４における対話処理について説明する。 [Interactive processing]
Next, dialogue processing in the dialogue processing control unit 14 as the second processing will be described.

まず、初期化部３０１は、メモリ部３５０の各バッファ部３５１〜３５３を初期化する（Ｓ１０１）。 First, the initialization unit 301 initializes the buffer units 351 to 353 of the memory unit 350 (S101).

次に、データ入力部３０３は、一括処理制御部１３から送信される共通単語情報を受信し、共通単語情報バッファ部３５１に記憶させる（Ｓ１０２）。ここで、共通単語情報バッファ部３５１に記憶された共通単語情報は、一括処理で生成された図１１に示す共通単語情報と同一であるものとする。 Next, the data input unit 303 receives the common word information transmitted from the batch processing control unit 13 and stores it in the common word information buffer unit 351 (S102). Here, it is assumed that the common word information stored in the common word information buffer unit 351 is the same as the common word information shown in FIG. 11 generated by the batch processing.

続いて、比較対象文書類似度算出部３０４は、共通単語情報バッファ部３５１に記憶される共通単語情報を読み出して類似度を算出し、算出した類似度を比較対象文書類似度として比較対象文書類似度バッファ部３５２に記憶させる（Ｓ１０３）。この比較対象文書類似度を算出するために、まず、各共通使用単語について使用回数と単語重みとの積を算出する。各共通使用単語について求められた積の合計の値を、分類キー文書毎に各比較対象文書類似度とする。 Subsequently, the comparison target document similarity calculation unit 304 reads the common word information stored in the common word information buffer unit 351, calculates the similarity, and uses the calculated similarity as the comparison target document similarity. It is stored in the degree buffer unit 352 (S103). In order to calculate the comparison target document similarity, first, the product of the number of times of use and the word weight is calculated for each commonly used word. The total value of the products obtained for each commonly used word is set as each comparison target document similarity for each classification key document.

例えば、図１１に示した「比較対象文書１」の場合、その比較対象文書類似度は５×２．１＋１１×４．３＋５×１．７＝６６．３となる。 For example, in the case of “comparison target document 1” shown in FIG. 11, the comparison target document similarity is 5 × 2.1 + 11 × 4.3 + 5 × 1.7 = 66.3.

図１２に示すのは、比較対象文書類似度バッファ部３５２に記憶される比較対象文書類似度の一例である。 FIG. 12 shows an example of the comparison target document similarity stored in the comparison target document similarity buffer unit 352.

このステップＳ１０２、Ｓ１０３の処理が全ての比較対象文書についてされると（Ｓ１０４）、分野別類似度積算部３０５は、ステップＳ１０３で算出した各比較対象文書について求めた比較対象文書類似度を分野別に積算し、分野別類似度積算値として分野別類似度積算値バッファ部３５３に記憶させる。（Ｓ１０５）
類似度算出結果が図１２に示す状態にあった場合、まず、「データベース更新」という分野には「データベース更新処理時間の短縮」の類似度６６．３と「ＸＭＬ文書データベース」の類似度４３．５が加算され、その後に続く分類対象文書で「データベース更新」に分類される文書の類似度が加算されて分野別類似度積算値とされる。 When the processes in steps S102 and S103 are performed for all the comparison target documents (S104), the field-specific similarity accumulation unit 305 determines the comparison target document similarity calculated for each comparison target document calculated in step S103 for each field. The result is integrated and stored in the field-specific similarity integrated value buffer unit 353 as the field-specific similarity integrated value. (S105)
When the similarity calculation result is in the state shown in FIG. 12, first, in the field of “database update”, the similarity 66.3 of “reduction of database update processing time” and the similarity 43.3 of “XML document database” are obtained. 5 is added, and the similarities of the documents classified as “database update” in the subsequent classification target documents are added to obtain an integrated similarity value for each field.

図１３に示すのは、ステップＳ１０５において、すべての比較対象文書について処理を行った結果、得られた分野別の類似度積算値の一例である。 FIG. 13 shows an example of integrated similarity values for each field obtained as a result of processing all the comparison target documents in step S105.

次に、分野特定部３０６は、比較対象文書類似度バッファ部３５２の内容について、分野別類似度積算値の値の大きい分野から順に分野特定結果を接続される出力装置である表示装置に表示する。この分野特定結果は、「分野名」、その分野に該当する比較対象文書の「タイトル」および「類似度」を有している。また、分野特定部３０６は、各比較対象文書との類似度の算出で利用した単語重みについても表示する。（Ｓ１０６）このときに、各単語の単語重みを書き換え可能な状態で表示する。 Next, the field specifying unit 306 displays the contents of the comparison target document similarity buffer unit 352 on the display device that is an output device to which the field specifying results are connected in order from the field having the highest similarity value for each field. . This field identification result includes “field name”, “title” and “similarity” of the comparison target document corresponding to the field. The field specifying unit 306 also displays word weights used in calculating the similarity to each comparison target document. (S106) At this time, the word weight of each word is displayed in a rewritable state.

図１４に示すのは、ステップＳ１０６で表示される分野特定結果の一例である。また、図１５に示すのは、ステップＳ１０６で表示される単語重み調整画面の一例である。図１５に示す例では、重みの大きい順にソートして単語重みを表示している。 FIG. 14 shows an example of the field identification result displayed in step S106. FIG. 15 shows an example of the word weight adjustment screen displayed in step S106. In the example shown in FIG. 15, the word weights are displayed by sorting in descending order of weight.

上述したステップＳ１０２〜Ｓ１０６の処理は、参照するデータが限られており、一括処理における共通単語の抽出にかかる時間よりもはるかに短時間に処理することができる。 The processes in steps S102 to S106 described above are limited in data to be referred to, and can be processed in a much shorter time than the time required for extracting the common word in the batch process.

続いて、単語重み調整部３０７は、ステップＳ１０６で表示した単語重みの調整値を受け付ける。（Ｓ１０７）
具体的には、利用者は、表示された分類キー文書の内容と分類結果を参照し、分類結果が正しくないと判断した場合、その分類結果に含まれる分類に影響した単語とその単語重みを参照し、単語重みを調整して分類結果を再度求めることができる。 Subsequently, the word weight adjustment unit 307 receives the adjustment value of the word weight displayed in step S106. (S107)
Specifically, the user refers to the contents of the displayed classification key document and the classification result, and when it is determined that the classification result is not correct, the user selects the word that affected the classification included in the classification result and the word weight. The classification result can be obtained again by referring to and adjusting the word weight.

具体的には、表示された単語重みの中で、分類キー文書の分野の特徴を示していないにも関わらず、高い重みが付いている場合や、逆に分野の特徴を示しているにも関わらず、低い重みが付いている場合に、その単語重みを調整することが可能となる。 Specifically, among the displayed word weights, even though they do not show the characteristics of the field of the classification key document, when the weight is high, or conversely, Regardless, if the weight is low, the word weight can be adjusted.

例えば、利用者が、図９に示す分類キー文書に対して属する分野として、「データベース更新」や「文書検索」が適当でないと判断し、その原因が「対話的」や「データベース」の単語重みが高いことによると判断したとする。この場合、「対話的」や「データベース」の単語重みを例えば１．０に変更するなど、低い値に設定し直すことができる。 For example, the user determines that “database update” or “document search” is not appropriate as the field to which the classification key document shown in FIG. 9 belongs, and the cause is the word weight of “interactive” or “database”. Suppose that it is due to high. In this case, the word weight of “interactive” or “database” can be reset to a low value, for example, to 1.0.

続いて、単語重み調整部３０７で単語重みが変更されたことが判断されると（Ｓ１０８）、新たに入力された単語重みで共通単語情報バッファ部３５１を書き替える（Ｓ１０９）。その後、ステップＳ１０３からの分野特定処理を再実行する。 Subsequently, when the word weight adjustment unit 307 determines that the word weight has been changed (S108), the common word information buffer unit 351 is rewritten with the newly input word weight (S109). Thereafter, the field specifying process from step S103 is performed again.

図１６に示すのは、上述した例にあるように「対話的」と「データベース」の単語重みを１．０に調整した場合の比較対象文書の類似度の一例である。このように調整した結果、分類キー文書が属する分野としてあまり適当でなかった、「データベース更新」や「文書検索」分野の点数が下がり、分類先として適当な「文書分類」分野が上位に上がってくる結果となる。 FIG. 16 shows an example of the similarity of the comparison target documents when the word weights of “interactive” and “database” are adjusted to 1.0 as in the above-described example. As a result of this adjustment, the scores for the “database update” and “document search” fields, which were not very suitable as the field to which the classification key document belongs, decreased, and the “document classification” field suitable as the classification destination increased. Result.

本実施の形態では、類似度算出の中間データとして、共通単語とその重み情報を用いたが、これらの情報以外にも、一括処理側で複数の方式によりそれぞれの点数（複数）を算出し、端末側でそれらの点数のブレンドの比率などを調整して、分野特定を行なうような実施の形態も考えられる。 In the present embodiment, the common word and its weight information are used as intermediate data for calculating the similarity, but in addition to these pieces of information, the score (plurality) is calculated by a plurality of methods on the batch processing side, An embodiment in which the field is specified by adjusting the blend ratio of the scores on the terminal side is also conceivable.

上述したステップＳ１０２〜１０８の処理は、対象となる全ての分類キー文書に対して繰り返される（Ｓ１０９）。例えば、分類キー文書として２０００件分の文書が入力された場合、２０００回繰り返される。 The processing in steps S102 to S108 described above is repeated for all target classification key documents (S109). For example, if 2000 documents are input as the classification key document, the document is repeated 2000 times.

上述した第１の実施例に係る発明によれば、類似度の算出結果の根拠を確認しつつ、例えば単語重みを調整して分類に適当でない単語の影響を減少さて類似度を算出することにより、新たな分野を特定することが可能となる。そのため、従来の一括処理では実現できなかった、分類結果の誤りの原因を特定し、その原因を取り除いて分類処理を再実行し、結果を確認するという対話的な処理が可能となる。また、上記のような単語重みの調整結果を蓄積し、分類精度の向上に利用することも考えられる。 According to the invention according to the first embodiment described above, by checking the basis of the similarity calculation result, for example, by adjusting the word weight and reducing the influence of words that are not suitable for classification, the similarity is calculated. It becomes possible to specify a new field. For this reason, it is possible to perform an interactive process in which the cause of the error in the classification result that cannot be realized by the conventional batch process is identified, the cause is removed, the classification process is re-executed, and the result is confirmed. It is also conceivable to accumulate word weight adjustment results as described above and use them to improve classification accuracy.

［第２の実施例］
以下に、図面を参照して、本発明の第２の実施の形態に係る類似文書分類装置１ａを説明する。なお、以下の説明においては、第１の実施例に係る類似文書分類装置１と同様の点については説明を省略し、異なる点のみについて説明する。従って、一括処理制御部１３における処理は同様であるため説明を省略し、対話処理制御部１４ａにおける処理のみを説明する。 [Second Embodiment]
The similar document classification device 1a according to the second embodiment of the present invention will be described below with reference to the drawings. In the following description, description of the same points as those of the similar document classification device 1 according to the first embodiment will be omitted, and only different points will be described. Therefore, since the process in the batch process control unit 13 is the same, the description thereof is omitted, and only the process in the dialogue process control unit 14a is described.

図１７に示すのは、本発明の第２の実施の形態に係る類似文書分類装置１ａの対話処理制御部１４ａである。 FIG. 17 shows the dialogue processing control unit 14a of the similar document classification device 1a according to the second embodiment of the present invention.

図１７に示す対話処理制御部１４ａは図６に示した対話処理制御部１４と比較して、単語重み調整部３０７を有さず分野調整部３０８を有している点で異なる。 The dialog processing control unit 14a illustrated in FIG. 17 is different from the dialog processing control unit 14 illustrated in FIG. 6 in that it does not include the word weight adjustment unit 307 but includes the field adjustment unit 308.

分野調整部３０８は、接続されるキーボードなどの入力装置から利用者によって入力された指示により、共通単語情報バッファ部３５１を書き換える。 The field adjustment unit 308 rewrites the common word information buffer unit 351 in accordance with an instruction input by a user from an input device such as a connected keyboard.

図１８に示すのは、本発明の第２の実施例に係る類似文書分類装置１ａの対話処理制御部１４ａにおける処理を説明するフローチャートである。図１８において、上述した図８で説明したフローチャートと同様の処理は同様の番号を付して説明を省略する。 FIG. 18 is a flowchart for explaining processing in the dialogue processing control unit 14a of the similar document classification device 1a according to the second embodiment of the present invention. In FIG. 18, the same processes as those in the flowchart described with reference to FIG.

ステップＳ１０６において、分野特定結果および単語重みが表示されると、分野調整部３０８は、分野の調整を受け付ける（Ｓ２０７）。ここで、分野の調整がされたことが確認されると（Ｓ２０８）、分野調整部３０８は分野調整を行なう（Ｓ２０９）。 When the field identification result and the word weight are displayed in step S106, the field adjustment unit 308 accepts the field adjustment (S207). When it is confirmed that the field has been adjusted (S208), the field adjustment unit 308 performs field adjustment (S209).

図１８に示す処理では、分野調整として、例えば分野別類似度積算値バッファ部３５３の書き替えを例として説明する。具体的には、表示される分野特定結果において表示される分野明らかに妥当でないと操作者により判断された場合、その分野について分野特定結果から削除した分野別類似度を設定するように、分野別類似度積算値バッファ部３５３を書き替える等の処理を行うことが考えられる。 In the processing illustrated in FIG. 18, as field adjustment, for example, rewriting of the field-specific similarity integrated value buffer unit 353 will be described as an example. Specifically, if the operator determines that the displayed field is clearly invalid in the displayed field identification result, the field-specific similarity is set for the field, which is deleted from the field identification result. It is conceivable to perform processing such as rewriting the integrated similarity value buffer unit 353.

上述した第２の実施例に係る本発明によれば、類似度算出結果の根拠を確認しつつ、例えば比較対象として用いる比較対象文書の分野を調整して分類に適当でない分野の影響を減少さて類似度を算出することにより、新たな分野を特定することが可能となる。 According to the present invention according to the second embodiment described above, while confirming the basis of the similarity calculation result, for example, the field of the comparison target document used as the comparison target is adjusted to reduce the influence of the field not suitable for classification. It is possible to specify a new field by calculating the similarity.

本発明はここでは記載していない様々な実施の形態等を含むことは勿論である。従って、本発明の技術的範囲は上記の説明に記載した事項と自明な特許請求の範囲に係る発明特定事項によってのみ定められるものである。 It goes without saying that the present invention includes various embodiments not described herein. Therefore, the technical scope of the present invention is defined only by the matters described in the above description and the invention specific matters according to the obvious claims.

本発明の実施例１に係る類似文書分類装置を説明するブロック図である。It is a block diagram explaining the similar document classification | category apparatus based on Example 1 of this invention. 本発明の実施例１に係る比較対象文書情報記憶部で記憶する比較対象文書情報の一例である。It is an example of the comparison object document information memorize | stored in the comparison object document information storage part which concerns on Example 1 of this invention. 本発明の実施例１に係る単語重み情報記憶部で記憶する単語重み情報の一例である。It is an example of the word weight information memorize | stored in the word weight information storage part which concerns on Example 1 of this invention. 本発明の実施例１に係る類似文書分類装置を説明する図である。It is a figure explaining the similar document classification | category apparatus which concerns on Example 1 of this invention. 本発明の実施例１に係る類似文書分類装置における一括処理制御部を説明する図である。It is a figure explaining the batch process control part in the similar document classification device based on Example 1 of this invention. 本発明の実施例１に係る類似文書分類装置における対話処理制御部を説明する図である。It is a figure explaining the dialogue processing control part in the similar document classification device concerning Example 1 of the present invention. 図５に示した一括処理制御部における処理を説明するフローチャートである。It is a flowchart explaining the process in the batch processing control part shown in FIG. 図６に示した対話処理制御部における処理を説明するフローチャートである。It is a flowchart explaining the process in the dialogue process control part shown in FIG. 本発明の実施例１に係る類似文書分類装置に入力される分類キー文書の一例である。It is an example of the classification | category key document input into the similar document classification | category apparatus based on Example 1 of this invention. 本発明の実施例１に係る類似文書分類装置において生成される分類キー文書情報の一例である。It is an example of the classification key document information produced | generated in the similar document classification | category apparatus based on Example 1 of this invention. 本発明の実施例１に係る類似文書分類装置において生成される共通単語情報の一例である。It is an example of the common word information produced | generated in the similar document classification | category apparatus based on Example 1 of this invention. 本発明の実施例１に係る類似文書分類装置において記憶される比較対象文書類似度の一例である。It is an example of the comparison object document similarity memorize | stored in the similar document classification | category apparatus based on Example 1 of this invention. 本発明の実施例１に係る類似文書分類装置において記憶される分野別の類似度積算値の一例である。It is an example of the similarity integrated value according to field memorize | stored in the similar document classification | category apparatus which concerns on Example 1 of this invention. 本発明の実施例１に係る類似文書分類装置において求められる分野特定結果の一例である。It is an example of the field identification result calculated | required in the similar document classification | category apparatus based on Example 1 of this invention. 本発明の実施例１に係る類似文書分類装置において表示される単語重み調整画面の一例である。It is an example of the word weight adjustment screen displayed in the similar document classification | category apparatus based on Example 1 of this invention. 本発明の実施例１に係る類似文書分類装置において表示される比較対象文書の類似度の一例である。It is an example of the similarity of the comparison object document displayed in the similar document classification | category apparatus based on Example 1 of this invention. 本発明の実施例２に係る類似文書分類装置における対話処理制御部を説明する図である。It is a figure explaining the dialogue processing control part in the similar document classification | category apparatus based on Example 2 of this invention. 図１７に示した対話処理制御部における処理を説明するフローチャートである。It is a flowchart explaining the process in the dialogue process control part shown in FIG.

Explanation of symbols

１，１ａ…類似文書分類装置
１１…比較対象文書情報データベース
１２…単語重み情報データベース
１３…一括処理制御部
１４，１４ａ…対話処理制御部
１０１…中央処理制御装置
１０２…ＲＯＭ
１０３…ＲＡＭ
１０４…入力装置
１０５…表示装置
１０６…通信制御装置
１０７…記憶装置
１０８…リムーバブルディスク
１０９…入出力インタフェース
１１０…バス
２００…制御部
２０１…初期化部
２０２…入力部
２０３…単語重み読み込み部
２０４…分類キー文書情報生成部
２０５…比較対象文書情報読み込み部
２０６…共通単語情報生成部
２０７…データ出力部
２５０…メモリ部
２５１…情報バッファ部
２５２…分類キー文書情報バッファ部
２５３…比較対象文書情報バッファ部
２５４…共通単語情報バッファ部
３００…制御部
３０１…初期化部
３０２…入力部
３０３…データ入力部
３０４…比較対象文書類似度算出部
３０５…分野別類似度積算部
３０６…分野特定部
３０７…単語重み調整部
３０８…分野調整部
３５０…メモリ部
３５１…共通単語情報バッファ部
３５２…比較対象文書類似度バッファ部
３５３…分野別類似度積算値バッファ部 DESCRIPTION OF SYMBOLS 1,1a ... Similar document classification | category apparatus 11 ... Comparison object document information database 12 ... Word weight information database 13 ... Collective processing control part 14, 14a ... Dialog processing control part 101 ... Central processing control apparatus 102 ... ROM
103 ... RAM
DESCRIPTION OF SYMBOLS 104 ... Input device 105 ... Display device 106 ... Communication control device 107 ... Memory | storage device 108 ... Removable disk 109 ... Input / output interface 110 ... Bus 200 ... Control part 201 ... Initialization part 202 ... Input part 203 ... Word weight reading part 204 ... Classification key document information generation unit 205 ... Comparison target document information reading unit 206 ... Common word information generation unit 207 ... Data output unit 250 ... Memory unit 251 ... Information buffer unit 252 ... Classification key document information buffer unit 253 ... Comparison target document information buffer Unit 254 ... common word information buffer unit 300 ... control unit 301 ... initialization unit 302 ... input unit 303 ... data input unit 304 ... comparison target document similarity calculation unit 305 ... field-specific similarity accumulation unit 306 ... field identification unit 307 ... Word weight adjustment unit 308 ... field adjustment unit 350 ... memory 351 ... Common word information buffer 352 ... comparison document similarity buffer unit 353 ... sector similarity integration value buffer

Claims

A document classification device for classifying a classification key document that is a target for classifying a field to which a document belongs,
A comparison target document information storage unit storing information of a comparison target document to be compared with the classification key document, and comparison target document information associated with a field of the comparison target document;
A word weight information storage unit storing a word and a word weight serving as an index indicating characteristics of the field of the document in which the word is included;
The classification key document is compared with the comparison target document information to extract common usage words that are commonly used in the classification key document and the comparison target document, and at least these common usage words and the common usage words A batch processing control unit that generates common word information in which the number of uses and the word weight of the common use word read from the word weight information storage unit are associated;
A similarity between a plurality of comparison target documents and the classification key document is obtained from the common word information, a field is specified based on the obtained comparison target document having a high similarity, and further, based on an instruction from the input device A dialogue processing control unit that adjusts the identified field;
A document classification apparatus comprising:

A document classification device for classifying a classification key document that is a target for classifying a field to which a document belongs,
A comparison target document information storage unit storing information of a comparison target document to be compared with the classification key document, and comparison target document information associated with a field of the comparison target document;
A word weight information storage unit storing a word and a word weight serving as an index indicating characteristics of the field of the document in which the word is included;
The classification key document is compared with the comparison target document information to extract common usage words that are commonly used in the classification key document and the comparison target document, and at least these common usage words and the common usage words A common word information generating unit that generates common word information in which the number of uses and the word weight of the common used word read from the word weight storage unit are associated;
A similarity calculation unit that calculates a similarity based on the number of times of use of each commonly used word included in the common word information and the word weight of the commonly used word;
A field identification unit that determines a comparison target document having a high degree of similarity obtained for the classification key document, and that determines a classification result that identifies a field to which the comparison target document belongs as a field to which the classification key document belongs;
An adjusting unit for adjusting the field specified by the field specifying unit;
A document classification apparatus comprising:

The document classification device according to claim 2, wherein the adjustment unit includes:
A document classification apparatus, wherein the word weight is varied to generate common word information.

The document classification device according to claim 2, wherein the adjustment unit includes:
A document classification apparatus, wherein a specific field is deleted from a classification result obtained from the classification result.

5. The document classification device according to claim 2, wherein the similarity calculation unit includes:
A document classification apparatus characterized in that a sum of products of the number of times each common word is used and a word weight is calculated, and the calculated total value is used as a similarity.

A document classification method for classifying a classification key document that is a target for classifying a field to which a document belongs,
The classification key document is compared with the comparison target document information stored in the comparison target document information storage unit in association with the information of the comparison target document to be compared with the classification key document and the field of the comparison target document. And common use words that are commonly used in the comparison target document, and extract at least the common use word, the number of times of use of the common use word, and the field of the document including the word and the word. Generating common word information associated with the word weight of the commonly used word read from the word weight information storage unit storing the word weight as an index indicating the feature;
Find the similarity between multiple comparison target documents and classification key documents from common word information, identify the field based on the compared target documents with high similarity, and further identify based on instructions from the input device Document classification method characterized by adjusting the selected field.

A document classification method for classifying a classification key document that is a target for classifying a field to which a document belongs,
The classification key document is compared with the comparison target document information stored in the comparison target document information storage unit in association with the information of the comparison target document to be compared with the classification key document and the field of the comparison target document. And common use words that are commonly used in the comparison target document, and extract at least the common use word, the number of times of use of the common use word, and the field of the document including the word and the word. Generating common word information associated with the word weight of the commonly used word read from the word weight information storage unit in which the word weight serving as an index indicating the feature is stored;
Calculating the similarity based on the number of times of use of each commonly used word included in the common word information and the word weight of the commonly used word;
A comparison target document having a high degree of similarity required for the classification key document is determined, the field to which the comparison target document belongs is specified,
Reconcile identified areas,
A document classification method characterized by comprising:

8. The document classification method according to claim 7, wherein the common word information is generated by varying the word weight when adjusting the specified field.

8. The document classification method according to claim 7, wherein when the specified field is adjusted, the specific field is deleted from the classification result obtained from the classification result.

10. The document classification method according to claim 7, wherein when calculating the similarity, a sum of products of the number of times of use of each common word and the word weight is calculated, and the calculated total value is calculated. Document classification method characterized by similarity.