JPH06348755A

JPH06348755A - Method and system for classifying document

Info

Publication number: JPH06348755A
Application number: JP5135588A
Authority: JP
Inventors: Tadahiro Kiyama; 忠博木山; Hiroshi Tsuji; 洋辻
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1993-06-07
Filing date: 1993-06-07
Publication date: 1994-12-22

Abstract

PURPOSE:To save huge human labor for classifying document data by preparing any specified dictionary for classification and classifying non-classified documents while utilizing this dictionary for classification. CONSTITUTION:A document data word division part 1 divides the document data into words while referring to classified document data (a) and registers the classified result on a classified word division table (b). A dictionary for classification preparation part 2 detects the appearance frequency of words while referring to the classified word division table (b) and registers the result on a dictionary (c) for classification. After the dictionary (c) is generated, the word division part 1 divides the document data into words while referring to classification target document data (d) and registers the divided result on a classification target word division table (e). A document classification part 3 classifies the document data (e) while referring to the division table (e) and the dictionary (c) for classification and registers them on a document classified result (f). Thus, since the system is provided with the dictionary for classification preparing function and the document classifying function, the document data can be automatically classified.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は文書データの自動分類方
法に係り、特に、大量の文書データを自動的に分類する
場合に好適な文書分類方法おびそのシステムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for automatically classifying document data, and more particularly to a method and system for classifying a document suitable for automatically classifying a large amount of document data.

【０００２】[0002]

【従来の技術】従来の文書分類方法は、例えば、特開平
２−１５８８７１号公報に記載されている。文書に含ま
れるキーワードの頻度値から各文書の概念特徴量を求
め、これに応じて文書を分類している。2. Description of the Related Art A conventional document classification method is described, for example, in Japanese Patent Laid-Open No. 158871/1990. The conceptual feature amount of each document is obtained from the frequency value of the keyword included in the document, and the document is classified according to this.

【０００３】[0003]

【発明が解決しようとする課題】従来の文書分類方法で
は、文書間の概念特徴量の差に応じて文書間の距離を求
め、この距離によって文書の分類を行なっているが、概
念特徴量を求めるために、文書分類用のシソーラスやキ
ーワード分類項目を予め人手により登録しこれを利用し
ている。つまり、文書を分類するために分類用の情報を
人手により定義しなければならないという問題があっ
た。In the conventional document classification method, the distance between documents is obtained according to the difference in the conceptual feature amount between documents, and the document is classified by this distance. In order to obtain it, a thesaurus for document classification and a keyword classification item are manually registered in advance and used. That is, there is a problem that the information for classification must be manually defined in order to classify the documents.

【０００４】本発明の目的は、上記問題を解決するため
に、複数に分類された書データと、分類内での各単語の
出現頻度を利用し、分類別のキーワードを自動的に抽出
し、分類用辞書を作成し、この分類用辞書を利用するこ
とにより未分類の文書を分類する方法およびシステムを
提供することにある。In order to solve the above problem, an object of the present invention is to use the calligraphy data classified into a plurality and the frequency of appearance of each word in the classification to automatically extract keywords for each classification, (EN) A method and system for creating a classification dictionary and classifying unclassified documents by using this classification dictionary.

【０００５】[0005]

【課題を解決するための手段】上記目的を達成するため
に、本発明の文書分類方法は、一分類が一文書データ以
上からなる分類済みの文書データから分類別のキーワー
ドとなる語を抽出し分類用辞書を作成する方法と、分類
用辞書を用いて未分類の文書データを分類する方法と、
未分類文書データを分類する方法により分類された結果
から新たにキーワードを検出し分類用辞書に登録する方
法を具備する。また、本発明の文書分類システムは、上
記文書分類方法を実行するプログラムを具備した自然言
語処理システムである。In order to achieve the above object, the document classification method of the present invention extracts a word as a keyword for each classification from classified document data in which one classification consists of one document data or more. A method for creating a classification dictionary, a method for classifying unclassified document data using the classification dictionary,
A method of newly detecting a keyword from the result of classification by the method of classifying unclassified document data and registering it in the classification dictionary is provided. The document classification system of the present invention is a natural language processing system including a program for executing the above document classification method.

【０００６】[0006]

【作用】本発明の文書分類方法およびシステムは、ま
ず、分類済みの文書データを取得する。次に、利用者に
より分類用辞書の作成範囲を文書データを構成する項目
により指定されている場合は指定項目のみの文書データ
を処理対象とし、指定されてない場合は文書データ全体
を処理対象とし分類用辞書を作成する。次に、分類済み
の文書データから得た分類別に、唯一の分類のみに存在
する単語を検出し、この単語を該当する分類のキーワー
ドとして分類用辞書に登録する。また、文書データが複
数の項目により構成されている場合にはキーワードと項
目の対応関係も分類用辞書登録する。The document classification method and system of the present invention first obtains classified document data. Next, if the creation range of the classification dictionary is specified by the user by the items that make up the document data, the document data of only the specified items is processed, and if not specified, the entire document data is processed. Create a classification dictionary. Next, a word existing in only one classification is detected for each classification obtained from the classified document data, and this word is registered in the classification dictionary as a keyword of the corresponding classification. When the document data is composed of a plurality of items, the correspondence between the keywords and the items is also registered in the classification dictionary.

【０００７】分類用辞書作成後、未分類の文書データを
取得する。次に、利用者により文書データの分類範囲を
文書データを構成する項目により指定されている場合は
指定項目のみの文書データを処理対象とし、指定されて
ない場合は文書データ全体を処理対象とし文書を分類す
る。未分類文書データの単語を検出し、分類用辞書中の
キーワードと比較照合し、分類別に一致回数を求める。After the classification dictionary is created, unclassified document data is acquired. Next, if the user specifies the classification range of the document data by the items that compose the document data, the document data of only the specified items is processed, and if not specified, the entire document data is processed. Classify. The word of the unclassified document data is detected and compared with the keyword in the classification dictionary to find the number of matches for each classification.

【０００８】次に、キーワードの一致回数が最も多い分
類を分類対象文書データの分類結果とする。また、一致
回数が多い分類の順番に優先度を付与し分類結果とす
る。また、未分類文書データを分類する方法による分類
結果が正しいか否か利用者に問合せ正しいと指示された
場合に、未分類文書データを分類する方法により分類さ
れた文書データ中の単語で分類用辞書中に存在しない単
語であれば、この単語を該分類を表わすキーワードとし
て分類用辞書に登録する。また、キーワードを分類用辞
書に登録する場合に、利用者が指示したキーワードのみ
を分類用辞書に登録する。Next, the classification with the largest number of keyword matches is taken as the classification result of the classification target document data. In addition, priority is given to the order of the classification having the largest number of coincidences, and the result is classified. In addition, if the user is inquired whether or not the result of classification by the method for classifying unclassified document data is correct, and if the result is correct, the words in the document data classified by the method for classifying unclassified document data are used for classification. If the word does not exist in the dictionary, this word is registered in the classification dictionary as a keyword representing the classification. Moreover, when registering a keyword in the classification dictionary, only the keyword designated by the user is registered in the classification dictionary.

【０００９】また、未分類文書データを分類する方法に
よる分類結果が正しいか否か利用者に問合せ正しいと指
示された場合に一致したキーワードを利用者に提示し、
誤ったキーワードが存在している場合に利用者の指示に
より分類用辞書から該キーワードを削除する。以上によ
り、分類されてない文書データを自動的に分類すること
ができると同時に、分類用辞書の自己増殖が可能とな
る。In addition, if the user inquires whether or not the result of classification by the method of classifying unclassified document data is correct, the matched keyword is presented to the user,
When an incorrect keyword exists, the user deletes the keyword from the classification dictionary according to the instruction. As described above, unclassified document data can be automatically classified, and at the same time, the classification dictionary can be self-proliferated.

【００１０】[0010]

【実施例】以下、本発明の実施例を、図面により詳細に
説明する。図１は、本発明の文書分類方法の一実施例を
示す機能ブロック図である。文書データ単語分割部１
は、分類済文書データａを参照し、文書データを単語分
割し、分割結果を分類済単語分割テーブルｂに登録す
る。分類用辞書作成部２は、分類済単語分割テーブルｂ
を参照し、単語の出現頻度を検出し、分類用辞書ｃに登
録する。分類用辞書が生成された後、文書データ単語分
割部１は、分類対象文書データｂを参照し、文書データ
を単語分割し、分割結果を分類対象単語分割テーブルｅ
に登録する。文書分類部３は、分類対象単語分割テーブ
ルｅと分類用辞書を参照し、分類対象文書データｄを分
類し文書分類結果ｆに登録する。Embodiments of the present invention will now be described in detail with reference to the drawings. FIG. 1 is a functional block diagram showing an embodiment of the document classification method of the present invention. Document data word division unit 1
Refers to the classified document data a, divides the document data into words, and registers the division result in the classified word division table b. The classification dictionary creating unit 2 uses the classified word division table b.
, The appearance frequency of the word is detected and registered in the classification dictionary c. After the classification dictionary is generated, the document data word division unit 1 refers to the classification target document data b, divides the document data into words, and divides the division result into the classification target word division table e.
Register with. The document classification unit 3 refers to the classification target word division table e and the classification dictionary, classifies the classification target document data d, and registers it in the document classification result f.

【００１１】このように分類用辞書作成機能及び文書分
類機能を持たせることにより、文書データの自動分類を
実現する。By thus providing the classification dictionary creating function and the document classification function, automatic classification of document data is realized.

【００１２】図から明らかなように、文書データ単語分
割部１，分類用辞書作成部２，文書分類部３は処理を示
し、分類済文書データａ，分類済単語分割テーブルｂ，
分類用辞書ｃ，分類対象文書データｄ，分類対象文書単
語分割テーブルｅ，文書分類結果ｆはファイル（テーブ
ルとも呼ぶ）である。このように、本実施例によれば、
各機能ブロックが、プログラム論理により構成されてい
る。そのため、各機能ブロック単位にＬＳＩ化が可能で
あり、文書分類装置として、処理の高速化を図ることが
できる。As is clear from the figure, the document data word division unit 1, the classification dictionary creation unit 2, and the document classification unit 3 indicate the processing, and the classified document data a, the classified word division table b,
The classification dictionary c, the classification target document data d, the classification target document word division table e, and the document classification result f are files (also called tables). Thus, according to this embodiment,
Each functional block is configured by program logic. Therefore, it is possible to implement an LSI for each functional block unit, and the processing speed can be increased as a document classification device.

【００１３】図２は、図１における文書分類装置の全体
的なハードウェア構成を示すブロック図である。入出力
装置４は、データの入力、および、各種情報の表示を行
なう。プロセッサ５は、プログラムに基づき、図１にお
ける処理を実行する。記憶装置６は、図１における各種
文書データａや各種プログラム等を格納する。さらに、
記憶装置６は、プロセッサ５の各処理実行用のメモリで
あるワーキングエリアａｅ，文書データ単語分割部格納
エリア１０，分類用辞書作成部格納エリア２０，文書分
類部格納エリア３０，文書データ格納エリアａｄ，単語
分割テーブル格納エリアｂｅ，分類用辞書格納エリア
ｃ，文書分類結果格納エリアｆの記憶部を持っている。FIG. 2 is a block diagram showing the overall hardware configuration of the document classification device shown in FIG. The input / output device 4 inputs data and displays various information. The processor 5 executes the processing in FIG. 1 based on the program. The storage device 6 stores various document data a, various programs, and the like shown in FIG. further,
The storage device 6 is a memory for executing each process of the processor 5, a working area ae, a document data word dividing section storage area 10, a classification dictionary creating section storage area 20, a document classification section storage area 30, a document data storage area ad. The storage unit has a word division table storage area be, a classification dictionary storage area c, and a document classification result storage area f.

【００１４】記憶装置６に格納される各プログラムは、
プロセッサ５において実行される。その実行に際して、
必要に応じて入出力装置４が用いられる。Each program stored in the storage device 6 is
It is executed in the processor 5. To do that,
The input / output device 4 is used as necessary.

【００１５】図３は、図１における文書データ単語分割
部の処理手順を表すＰＡＤ図（Problem Analysis Diagr
am）である。分類済文書データａから文書データを取得
し単語分割を行ない分類済単語分割テーブルｂに格納す
るまでの処理、または、分類対象文書データｄから文書
データを取得し単語分割を行ない分類対象単語分割テー
ブルｅに格納するまでの処理を示したものである。FIG. 3 is a PAD diagram (Problem Analysis Diagr) showing the processing procedure of the document data word division unit in FIG.
am). Processing for obtaining document data from the classified document data a and performing word division and storing in the classified word division table b, or for obtaining document data from the classification target document data d and performing word division, classification target word division table It shows the processing up to storing in e.

【００１６】以下、この処理をＰＡＤ図に従って説明す
る。分類済文書データａを参照し、先頭文書データから
末尾文書データまで以下の処理を行なう（ステップ１
１）。まず、分類済文書データａから一文書分のデータ
を取得し（ステップ１２）、文書データを単語分割し見
出し文字列，品詞をそれぞれ単語分割テーブルｂの見出
し文字列ｂ１，品詞ｂ２に格納する（ステップ１３）。This process will be described below with reference to the PAD diagram. By referring to the classified document data a, the following processing is performed from the first document data to the last document data (step 1
1). First, data for one document is acquired from the classified document data a (step 12), the document data is divided into words, and the headline character string and the part of speech are stored in the headline character string b1 and the part of speech b2 of the word division table b, respectively ( Step 13).

【００１７】次に、文書データが項目で区分されている
か判別し（ステップ１４）、項目で区分されている場合
には、各単語に該当する項目を分類済単語分割テーブル
ｂの項目ｂ４に格納する（ステップ１５）。次に、該当
する文書データの分野名，文書番号を分類済単語分割テ
ーブルｂの分類名ｂ５，文書Ｎｏｂ６に格納する（ステ
ップ１６）。次に、処理の対象を次の文書データに移動
する（ステップ１７）。Next, it is determined whether the document data is divided into items (step 14). If the document data is divided into items, the item corresponding to each word is stored in the item b4 of the classified word division table b. (Step 15). Next, the field name and document number of the corresponding document data are stored in the classification name b5 and the document Nob6 of the classified word division table b (step 16). Next, the processing target is moved to the next document data (step 17).

【００１８】また、ステップ１１〜ステップ１７におい
て、分類対象文書データｄを入力とした場合には、分類
対象単語分割テーブルｅに単語分割結果を格納する。ス
テップ１１〜ステップ１７により、図４に示す分類済文
書データａを取得し単語分割を行ない、図５に示す分類
済単語分割テーブルｂに格納する。また、ステップ１１
〜ステップ１７により、図９に示す分類対象文書データ
ｄを取得し単語分割を行ない、図１０に示す分類対象単
語分割テーブルｅに格納する。When the classification target document data d is input in steps 11 to 17, the word division result is stored in the classification target word division table e. By steps 11 to 17, the classified document data a shown in FIG. 4 is acquired, word division is performed, and the word data is stored in the classified word division table b shown in FIG. Also, step 11
Through step 17, the classification target document data d shown in FIG. 9 is acquired and word division is performed and stored in the classification target word division table e shown in FIG.

【００１９】図４は、分類済文書データａの例であり、
分類名，文書番号付きの分類済みの文書データａの例で
ある。分類済み文書データは「題名」「要旨」「目的」
「主内容」「今後の課題」などの項目により構成されて
いる。FIG. 4 is an example of the classified document data a,
It is an example of classified document data a with a classification name and a document number. Classified document data is "Title""Summary""Purpose"
It consists of items such as "main contents" and "future tasks".

【００２０】図５は、単語分割テーブルの例であり、見
出し文字列ｂ１、品詞ｂ２、項目ｂ３、分類名ｂ４、文
書Ｎｏｂ５の項目により構成されている。見出し文字列
ｂ１は分割された単語を構成する文字列、品詞Ｃ２は単
語の品詞、項目ｂ３は該当する単語に対応する項目の項
目名、分類名ｂ４は該当する文書データの分類名、文書
Ｎｏｂ５は文書データの識別番号を表している。FIG. 5 is an example of a word division table, which is composed of headline character string b1, part of speech b2, item b3, classification name b4, and document Nob5. The headline character string b1 is a character string that constitutes a divided word, the part of speech C2 is the part of speech of the word, the item b3 is the item name of the item corresponding to the corresponding word, the classification name b4 is the classification name of the corresponding document data, and the document Nob5. Represents the identification number of the document data.

【００２１】図６は、図１における分類用辞書作成部２
の処理を表わすＰＡＤ図であり、未分類の文書を分類す
るための辞書を作成する処理を示したものである。以
下、この処理をＰＡＤ図に従って説明する。FIG. 6 shows a classification dictionary creating section 2 in FIG.
FIG. 6 is a PAD diagram showing the process of FIG. 4 and shows the process of creating a dictionary for classifying unclassified documents. Hereinafter, this processing will be described with reference to the PAD diagram.

【００２２】分類済単語分割テーブルｂを参照し、分類
済単語分割テーブルの先頭レコードから末尾レコードま
で以下の処理を行なう（ステップ２０１）。まず、分類
済単語分割テーブルｂを参照し、１レコード分の情報を
取得し（ステップ２０２）、見出し文字列ｂ１と品詞ｂ
２と分類名Ｂ４が等しい他のレコードを分類済み単語分
割テーブルｂより検索し保持しワークテーブルに格納す
る（ステップ２０３）。Referring to the classified word division table b, the following processing is performed from the first record to the end record of the classified word division table (step 201). First, referring to the classified word division table b, information for one record is acquired (step 202), and the headline character string b1 and the part of speech b are acquired.
Another record having the same classification name B4 as 2 is searched from the classified word division table b, held, and stored in the work table (step 203).

【００２３】次に、該当する文書データを構成する項目
が存在するか判別し（ステップ２０４）、項目が存在す
る場合に、項目別の各単語の出現回数を求めワークテー
ブルに格納する（ステップ２０５）。次に、文書データ
全体の各単語の出現回数を求めワークテーブルに格納す
る（ステップ２０６）。Next, it is determined whether or not there is an item constituting the corresponding document data (step 204), and if the item exists, the number of appearances of each word for each item is obtained and stored in the work table (step 205). ). Next, the number of appearances of each word in the entire document data is obtained and stored in the work table (step 206).

【００２４】次に、処理対象となるレコードを次のレコ
ードへ移動する（ステップ２０７）。ステップ２０１
の繰返し処理が終了した後、２０１〜２０６で格納した
ワークテーブルの先頭レコードから末尾レコードまで以
下の処理を行なう（ステップ２０８）。Next, the record to be processed is moved to the next record (step 207). Step 201
After the above-mentioned repeated processing is completed, the following processing is performed from the first record to the last record of the work table stored in 201 to 206 (step 208).

【００２５】まず、ワークテーブルより１レコード文の
情報を取得し保持する（ステップ２０９）。次に、見出
し文字列と品詞が等しい他のレコードがワークテーブル
中に存在するか判別し（ステップ２１０）、存在しない
場合に、保持した本レコードを分類用辞書３に格納する
（ステップ２１１）。First, the information of one record sentence is acquired from the work table and held (step 209). Next, it is determined whether or not another record having the same part of speech as the headline character string exists in the work table (step 210), and if it does not exist, the held record is stored in the classification dictionary 3 (step 211).

【００２６】次に、処理対象となるレコードを次のレコ
ードへ移動する（ステップ２１２）。Next, the record to be processed is moved to the next record (step 212).

【００２７】ステップ２０１〜ステップ２０７により、
図５に示す分類済単語分割テーブルｂから図７に示すワ
ークテーブルが生成され、ステップ２０８〜ステップ２
１２により、図７に示すワークテーブルから図８に示す
分類用辞書ｃが生成される。By steps 201 to 207,
The work table shown in FIG. 7 is generated from the classified word division table b shown in FIG.
12, the classification dictionary c shown in FIG. 8 is generated from the work table shown in FIG.

【００２８】図７は、図６におけるワークテーブルの例
であり、見出し文字列Ｗ１、品詞Ｗ２、題名Ｗ３〜今後
の課題Ｗ７、合計Ｗ８、分類名Ｗ９の項目により構成さ
れている。見出し文字列Ｗ１は分割された単語を構成す
る文字列、品詞Ｗ２は単語の品詞、題名Ｗ３〜今後の課
題Ｗ７は文書データを構成する項目中に出現する単語の
出現回数、合計Ｗ８は文書データ中に出現する単語の出
現回数、分類名Ｗ９は該当する文書データの分類名を表
している。FIG. 7 is an example of the work table in FIG. 6, and is composed of items of a headline character string W1, a part of speech W2, a title W3 to a future task W7, a total W8, and a classification name W9. The headline character string W1 is a character string that constitutes a divided word, the part-of-speech W2 is the part-of-speech of the word, the title W3 to the future task W7 is the number of appearances of the word that appears in the items that form the document data, and the total W8 is the document data. The number of appearances of words appearing in the category name W9 represents the category name of the corresponding document data.

【００２９】図８は、図１における分類用辞書の例であ
り、見出し文字列Ｃ１、品詞Ｃ２、題名Ｃ３〜今後の課
題Ｃ７、合計Ｃ８、分類名Ｃ９の項目により構成されて
いる。見出し文字列Ｃ１は分割された単語を構成する文
字列、品詞Ｃ２は単語の品詞、題名Ｃ３〜今後の課題Ｃ
７は文書データを構成する項目中に出現する単語の出現
回数、合計Ｃ８は文書データ中に出現する単語の出現回
数、分類名Ｃ９は該当する文書データの分類名を表して
いる。本分類用辞書は図７に示すワークテーブルとは異
なり、見出し文字列が重複することはなく、一つの見出
し文字列は一つの分類名に対応している。FIG. 8 is an example of the classification dictionary in FIG. 1, and is composed of items of a headline character string C1, a part of speech C2, a title C3 to a future task C7, a total C8, and a classification name C9. The headline character string C1 is a character string forming a divided word, the part-of-speech C2 is the part-of-speech of the word, the title C3 to the future task C.
Reference numeral 7 represents the number of appearances of words that appear in the items forming the document data, total C8 represents the number of appearances of words that appear in the document data, and classification name C9 represents the classification name of the corresponding document data. Unlike the work table shown in FIG. 7, this classification dictionary has no overlapping of the index character strings, and one index character string corresponds to one classification name.

【００３０】図９は、分類対象文書データａの例であ
り、文書番号付きの分類対象文書データｄの例である。
分類対象文書データは「題名」「要旨」「目的」「主内
容」「今後の課題」などの項目により構成されている。FIG. 9 is an example of the classification target document data a, and is an example of the classification target document data d with a document number.
The classification target document data is composed of items such as “title”, “summary”, “purpose”, “main content”, and “future task”.

【００３１】図１０は、分類対象文書単語分割テーブル
ｄの例であり、見出し文字列ｅ１、品詞ｅ２、項目ｅ
３、分類名ｅ４、文書Ｎｏｅ５の項目により構成されて
いる。見出し文字列ｅ１は分割された単語を構成する文
字列、品詞ｅ２は単語の品詞、項目ｅ３は該当する項目
の項目名、分類名ｅ４は該当する文書データの分類名、
文書Ｎｏｅ５は文書データの識別番号を表している。FIG. 10 shows an example of a classification target document word division table d, which includes a headline character string e1, a part of speech e2, and an item e.
3, the classification name e4, and the document Noe5. The headline character string e1 is a character string that constitutes a divided word, the part of speech e2 is the part of speech of the word, the item e3 is the item name of the corresponding item, the classification name e4 is the classification name of the corresponding document data,
Document Noe5 represents the identification number of the document data.

【００３２】図１１は、図１における文書分類部３の処
理を表すＰＡＤ図である。本処理は、未分類の文書を分
類用辞書を用いて分類する処理である。また、本処理の
前提として、図１における分類対象文書データｄを文書
データ単語分割部１により単語分割し、単語分割結果を
分類対象文書単語分割テーブルｅに格納されているもの
とする。FIG. 11 is a PAD diagram showing the processing of the document classification unit 3 in FIG. This process is a process of classifying unclassified documents using a classification dictionary. Further, as a premise of this processing, it is assumed that the classification target document data d in FIG. 1 is word-divided by the document data word division unit 1 and the word division result is stored in the classification target document word division table e.

【００３３】以下、この処理をＰＡＤ図に従って説明す
る。分類対象文書単語分割テーブルｂを参照し、本テー
ブルの先頭レコードから末尾レコードまで以下の処理を
行なう（ステップ３０１）。まず、利用者が文書の分類
の対象とする項目を指定しているか判別し（ステップ３
０２）、指定していれば、該当する項目に対応するレコ
ードに限定し一レコード分の情報を取得し（ステップ３
０３）、指定してなければ、一レコード分の情報を取得
する（ステップ３０４）。This process will be described below with reference to the PAD diagram. With reference to the classification target document word division table b, the following processing is performed from the first record to the last record of this table (step 301). First, it is determined whether or not the user has designated an item to be classified as a document (step 3
02), if specified, the information corresponding to one record is acquired only in the record corresponding to the corresponding item (step 3
03), if not specified, information for one record is acquired (step 304).

【００３４】次に、取得したレコードを文字列変数ＭＩ
ＤＡＳＨＩ１に格納する（ステップ３０５）。次に、分
類用辞書ｃを参照し、分類用辞書の先頭レコードから末
尾レコードまで以下の処理を行なう（ステップ３０
６）。まず、利用者が文書の分類の対象とする項目を指
定しているか判別し（ステップ３０７）、指定していれ
ば、該当する項目に対応するレコードに限定し分類用辞
書ｃから一レコード分の情報を取得し（ステップ３０
８）、指定してなければ、一レコード分の情報を取得す
る（ステップ３０９）。Next, the acquired record is set to a character string variable MI.
The data is stored in DASHI1 (step 305). Next, referring to the classification dictionary c, the following processing is performed from the first record to the last record of the classification dictionary (step 30).
6). First, it is determined whether or not the user has designated an item to be a document classification target (step 307). If so, only the record corresponding to the relevant item is limited to one record from the classification dictionary c. Obtain information (step 30
8) If not specified, information for one record is acquired (step 309).

【００３５】次に、取得したレコードを文字列変数ＭＩ
ＤＡＳＨＩ２に格納する（ステップ３１０）。次に、Ｍ
ＩＤＡＳＨＩ１とＭＩＤＡＳＨＩ２に格納した情報から
見出し文字列と品詞を取得し、見出し文字列と品詞が一
致するか判別し（ステップ３１１）、一致する場合、一
致回数を分類名別にカウントしこれを保持し（ステップ
３１２）、一致した見出し文字列と品詞を保持する（ス
テップ３１３）。Next, the acquired record is set to a character string variable MI.
It is stored in DASHI2 (step 310). Then M
The headline character string and part-of-speech are acquired from the information stored in IDASHI1 and MIDASHI2, and it is determined whether the headline character string and part-of-speech match (step 311). If they match, the number of matches is counted for each classification name and held ( In step 312), the matching headline character string and part of speech are held (step 313).

【００３６】次に、本繰返し処理の対象を分類用辞書の
次のレコードへ移動する（ステップ３１４）。次に、ス
テップ３０６の繰返し処理が終了した後に、ステップ３
０１の繰返し処理の対象を分類対象文書単語分割テーブ
ルの次のレコードへ移動する（ステップ３１５）。次
に、ステップ３０１の繰返し処理が終了した後に、上記
で保持した一致回数が最も多い分類名を本文書の分類結
果として文書分類結果ｆに格納し（ステップ３１６）、
一致回数が１回以上の分類名を対象に分類名別に一致回
数が多い順番に並べ換え、分類候補として文書分類結果
ｆに格納し（ステップ３１７）、一致した見出し文字列
と品詞と一致回数を文書分類結果ｆに格納する（ステッ
プ３１８）。Next, the target of this iterative process is moved to the next record in the classification dictionary (step 314). Next, after the repetition processing of step 306 is completed, step 3
The target of the iterative processing of 01 is moved to the next record in the classification target document word division table (step 315). Next, after the iterative process of step 301 is completed, the classification name having the largest number of matches held above is stored in the document classification result f as the classification result of this document (step 316).
For the classification name having the number of times of matching of 1 or more, rearrange in order of the number of times of matching according to the classification name and store it as the classification candidate in the document classification result f (step 317). The result is stored in the classification result f (step 318).

【００３７】ステップ３０１〜ステップ３１８により、
図１０に示す分類対象文書単語分割テーブルｅと分類用
辞書ｃをもとに文書を分類し図１２に示す文書分類結果
ｆを得ることができる。この文書分類結果ｆを利用者に
示すことにより分類結果の確認が可能となる。By steps 301 to 318,
Documents can be classified based on the classification target document word division table e and the classification dictionary c shown in FIG. 10, and the document classification result f shown in FIG. 12 can be obtained. By showing the document classification result f to the user, the classification result can be confirmed.

【００３８】図１２は、図１における文書分類結果ｆの
例である。本例には、３つの文書の分類結果を示してい
る。１文書目は図９に示している文書Ｎｏ「１０１」の
文書であり、結果として「言語処理」に分類されてい
る。これは分類名「言語処理」が分類用辞書の単語の一
致回数が最も高かったことを表わしている。また、一致
した単語は「辞書」「自然語」「ＬＲ解析」等であり括
弧内の数値は単語の一致回数を表わしている。また、
「言語処理」の次に単語の一致回数が多かった分類候補
が「通信」であることを表わしている。FIG. 12 is an example of the document classification result f in FIG. In this example, the classification results of three documents are shown. The first document is the document with the document number “101” shown in FIG. 9, and as a result is classified into “language processing”. This means that the classification name “language processing” has the highest number of matching of words in the classification dictionary. In addition, the matched words are “dictionary”, “natural language”, “LR analysis”, etc., and the numerical value in parentheses indicates the number of times the words match. Also,
The category candidate having the largest number of word matches next to "language processing" is "communication".

【００３９】２文書目は図９に示している文書Ｎｏ「１
０２」の文書であり、結果として「電気回路」に分類さ
れている。これは分類名「電気回路」が分類用辞書の単
語の一致回数が最も高かったことを表わしている。ま
た、一致した単語は「ＡＤ変換」「アナログ」「ディジ
タル」等であり括弧内の数値は単語の一致回数を表わし
ている。また、「電気回路」の次に単語の一致回数が多
かった分類候補は無かったことを表わしている。The second document is the document No. "1" shown in FIG.
02 ”, and as a result, it is classified into“ electric circuit ”. This means that the classification name “electrical circuit” has the highest number of matching of words in the classification dictionary. In addition, the matched words are “AD conversion”, “analog”, “digital”, etc., and the numerical value in the parentheses indicates the number of times the words match. It also means that there was no classification candidate having the largest number of word matches next to “electrical circuit”.

【００４０】３文書目は図９に示している文書Ｎｏ「１
０３」の文書であり、結果として「通信」に分類されて
いる。これは分類名「通信」が分類用辞書の単語の一致
回数が最も高かったことを表わしている。また、一致し
た単語は「ネットワーク」「プロトコル」「ＬＡＮ」等
であり括弧内の数値は単語の一致回数を表わしている。
また、「通信」の次に単語の一致回数が多かった分類候
補が「電気回路」であることを表わしている。The third document is the document No. "1" shown in FIG.
No. 03 ”document, and as a result, is classified as“ communication ”. This means that the classification name “communication” has the highest number of matching of words in the classification dictionary. The matched words are “network”, “protocol”, “LAN”, etc., and the numerical value in parentheses represents the number of times the words match.
Also, the category candidate having the highest number of word matches after "communication" is "electrical circuit".

【００４１】以上、述べたように、本実施例によれば、
従来は人手により分類されていた文書データを自動的に
分類することが可能となり、人手による文書データの分
類作業に費やす膨大な作業を省くことができるようにな
るという効果がある。また、パソコンやユニックスなど
のニュースサービス（掲示板）は文書の内容によって投
稿すべきニュースグループが非常に多く、どのニュース
グループに投稿すべきであるか判断できないことがある
が、本実施例によれば、ニュースグループにより分類さ
れている文書から分類用辞書を作成し自動分類可能とな
り、投稿対象文書のニュースグループを指定しなくとも
自動的に投稿すべきニュースグループを知ることができ
る。また、自動的に分類し投稿することができる。As described above, according to this embodiment,
Conventionally, it is possible to automatically classify document data that has been manually classified, and it is possible to eliminate an enormous amount of work that is required to manually classify document data. In addition, a news service (bulletin board) such as a personal computer or Unix has a large number of newsgroups to be posted depending on the content of the document, and it may not be possible to determine which newsgroup to post, but according to this embodiment, , It becomes possible to automatically classify by creating a classification dictionary from documents classified by news group, and it is possible to automatically know the news group to be posted even if the news group of the document to be posted is not specified. Also, you can automatically classify and post.

【００４２】次に、本発明の文書分類方法の拡張例につ
いて説明する。図１３は、本発明の文書分類方法におい
て分類した結果を利用者に確認し、確認した結果から新
たにキーワードとして分類用辞書に登録することを実現
する機能ブロック図である。文書分類確認部４は文書分
類結果ｆを参照し、分類した文書の分類結果や一致した
キーワードを利用者に提示し、利用者の確認を促し、一
致したキーワードで相応しく無いものがあれば、そのキ
ーワードを利用者の指定により分類用辞書から削除し、
分類した文書に新たなキーワードが存在するか判別し、
存在する場合には新たなキーワードを分類用辞書に登録
し、分類用辞書の自己増殖を行なう。Next, an extended example of the document classification method of the present invention will be described. FIG. 13 is a functional block diagram for realizing confirmation of the result of classification by the document classification method of the present invention with the user, and registering a new keyword in the classification dictionary based on the confirmed result. The document classification confirmation unit 4 refers to the document classification result f, presents the classification result of the classified document and the matching keyword to the user, prompts the user to confirm, and if there is a matching keyword that is not suitable, Delete the keyword from the classification dictionary by the user's designation,
Determine if there are new keywords in the classified documents,
If it exists, a new keyword is registered in the classification dictionary, and the classification dictionary is self-propagated.

【００４３】図１４の［画面１］は、分類対象文書の分
類結果の確認例である。提示内容として、文書番号、題
名、分類結果、分類候補がある。また、分類結果と同時
に、分類用辞書中のキーワードと一致したキーワードと
キーワードの一致回数を示している。これらの情報から
利用者が分類結果が正しいか否か判断し、分類結果の正
否を入力する。[Screen 1] in FIG. 14 is an example of confirmation of the classification result of the classification target document. The presentation contents include a document number, a title, a classification result, and a classification candidate. In addition to the classification result, the keywords that match the keywords in the classification dictionary and the number of times the keywords match are also shown. Based on these information, the user determines whether the classification result is correct or not, and inputs whether the classification result is correct or not.

【００４４】図１４の［画面２］は、［画面１］におい
て分類結果が正しいと利用者が判断した後、分類の根拠
となった分類用辞書中の該分類のキーワードと一致した
キーワードを利用者に提示し、キーワードとして相応し
く無いものがあれば、利用者に指定するよう促している
例である。また、指定された場合には、そのキーワード
を分類用辞書から削除し、分類用辞書の精度を向上させ
る。本例では、そのようなキーワードは無いと利用者が
応答している例を表わしている。In [Screen 2] of FIG. 14, after the user judges that the classification result is correct in [Screen 1], a keyword that matches the keyword of the classification in the classification dictionary that is the basis of classification is used. In this example, the keyword is presented to the user and the user is prompted to specify any keyword that is not suitable. If specified, the keyword is deleted from the classification dictionary to improve the accuracy of the classification dictionary. In this example, the user responds that there is no such keyword.

【００４５】図１５の［画面３］は、分類結果が正しか
った場合に、分類用辞書には存在せず、分類対象文書に
のみにキーワードが存在する場合に、このキーワードを
該分類の新たなキーワードとして自動登録するか否か、
また、キーワードを確認した後に、利用者がキーワード
を種々選択し登録するか否か利用者の応答を促している
例である。本例では、新たに出現したキーワードは確認
せず自動登録するよう利用者が応答している例を表わし
ている。[Screen 3] of FIG. 15 shows that when the classification result is correct, it does not exist in the classification dictionary, and when the keyword exists only in the document to be classified, this keyword is newly added to the classification. Whether to automatically register as a keyword,
Further, in this example, after confirming the keyword, the user prompts the user to respond by asking whether to select and register various keywords. In this example, the user responds to automatically register the newly appearing keyword without confirming it.

【００４６】以上、述べたように、本拡張例によれば、
分類用辞書中の任意の分類として誤ったキーワードが登
録されている場合に、分類対象文書と一致したキーワー
ドを利用者が確認し、誤ったキーワードを削除すること
により、分類用辞書の精度向上を容易に実現することが
できる。また、分類対象文書に新たなキーワードが出現
している場合に、そのキーワードを分類用辞書に自動登
録する、または、利用者に提示し必要なキーワードのみ
分類用辞書に登録することにより、利用者が時間をかけ
て辞書登録することなく、簡単に分類用辞書へのキーワ
ードの辞書登録または自己増殖が可能となる。As described above, according to this extended example,
If an incorrect keyword is registered as an arbitrary classification in the classification dictionary, the user can check the keyword that matches the classification target document and delete the incorrect keyword to improve the accuracy of the classification dictionary. It can be easily realized. In addition, when a new keyword appears in the classification target document, the keyword is automatically registered in the classification dictionary, or presented to the user and only the necessary keyword is registered in the classification dictionary. It is possible to easily register a keyword in the classification dictionary or self-proliferate without spending time in registering the dictionary.

【００４７】次に、本発明の文書分類方法の別の拡張例
について説明する。図１６は、分類用辞書を作成する場
合、または、分類対象文書を分類する場合に、文書デー
タが一定の項目により区分されているならば、文書デー
タ全体を分類用辞書作成または分類対象にするのではな
く、文書内の項目を利用者が指定することを可能とし、
その指定された項目のみを処理の対象とすることを実現
する機能ブロック図である。文書分類対象項目指定部６
は利用者が処理対象として指定した項目を認識し、分類
用辞書を作成する場合には分類済文書データａを参照し
文書データから該当する項目の文書データを取得し、文
書データを分類する場合には分類対象文書データｄを参
照し文書データから該当する項目の文書データを取得
し、以降の処理で取得した文書データのみを処理対象と
する。Next, another extended example of the document classification method of the present invention will be described. FIG. 16 shows that when a classification dictionary is created or when classification target documents are classified, if the document data is classified by certain items, the entire document data is created or classified. Allows users to specify items in the document instead of
It is a functional block diagram which implement | achieves making only the designated item the process target. Document classification target item specification unit 6
Is to classify the document data by recognizing the item designated by the user and referring to the classified document data a when creating the classification dictionary and acquiring the document data of the corresponding item from the document data. The document data of the corresponding item is acquired from the document data by referring to the classification target document data d, and only the document data acquired in the subsequent process is processed.

【００４８】また、単に項目を指定するのではなく、任
意の項目間の論理関係の指定及び認識を可能とすること
により、例えば、項目「主内容」または項目「今後の課
題」に出現する単語、項目「主内容」と項目「今後の課
題」の両項目に出現する単語を処理対象とすることが容
易に実現可能である。In addition, it is possible to specify and recognize a logical relationship between arbitrary items, rather than simply specifying the items. For example, words appearing in the item "main contents" or the item "future tasks" It is possible to easily implement the words that appear in both the item “main contents” and the item “future tasks” as processing targets.

【００４９】以上、述べたように、本拡張例によれば、
文書データ全体を処理対象とせず項目を限定することか
ら、文書データの性格によっては分類用辞書精度向上や
分類精度向上が期待できると同時に、膨大な量の文書デ
ータや文書データ自体が大きい場合にも実用的な処理速
度を維持することが可能となり、処理速度が向上すると
いう効果がある。As described above, according to this extended example,
By limiting the items without processing the entire document data, it is possible to expect an improvement in classification dictionary accuracy and classification accuracy depending on the character of the document data, and at the same time when a huge amount of document data or the document data itself is large. Also has the effect that the practical processing speed can be maintained and the processing speed is improved.

【００５０】[0050]

【発明の効果】本発明によれば、従来は人手により分類
されていた文書データを自動的に分類することが可能と
なり、人手による文書データの分類作業に費やす膨大な
作業を省くことができるようになるという効果がある。
また、パソコンやユニックスなどのニュースサービス
（掲示板）は文書の内容によって投稿すべきニュースグ
ループが非常に多く、どのニュースグループに投稿すべ
きであるか判断できないことがあるが、本実施例によれ
ば、ニュースグループにより分類されている文書から分
類用辞書を作成し自動分類可能となり、投稿対象文書の
ニュースグループを指定しなくとも自動的に投稿すべき
ニュースグループを知ることができる。また、自動的に
分類し投稿することができる。また、分類用辞書へのキ
ーワード登録が簡単に行なうことができると同時に、分
類用辞書の自己増殖も可能となるという効果がある。As described above, according to the present invention, it is possible to automatically classify document data which has been conventionally classified manually, and it is possible to omit an enormous amount of work for manual document data classification work. Has the effect of becoming.
In addition, a news service (bulletin board) such as a personal computer or Unix has a large number of newsgroups to be posted depending on the content of the document, and it may not be possible to determine which newsgroup to post. , It becomes possible to automatically classify by creating a classification dictionary from documents classified by news group, and it is possible to automatically know the news group to be posted even if the news group of the document to be posted is not specified. Also, you can automatically classify and post. Further, there is an effect that the keyword can be easily registered in the classification dictionary, and at the same time, the classification dictionary can be self-propagated.

[Brief description of drawings]

【図１】本発明を施した文書分類方法の一実施例を示す
機能ブロック図。FIG. 1 is a functional block diagram showing an embodiment of a document classification method according to the present invention.

【図２】図１における文書分類方法の全体的なハードウ
ェア構成を示すブロック図。FIG. 2 is a block diagram showing the overall hardware configuration of the document classification method in FIG.

【図３】図１における文書データ単語分割プログラムの
ＰＡＤ図。FIG. 3 is a PAD diagram of the document data word division program in FIG.

【図４】図１における分類済文書データの例。FIG. 4 is an example of classified document data in FIG.

【図５】図１における分類済単語分割テーブルの例。5 is an example of a classified word division table in FIG.

【図６】図１における分類用辞書作成プログラムのＰＡ
Ｄ図。6 is a PA of the classification dictionary creating program in FIG.
Figure D.

【図７】図１における分類用辞書を作成するためのワー
クテーブルの例。7 is an example of a work table for creating the classification dictionary in FIG.

【図８】図１における分類用辞書の例。8 is an example of a classification dictionary in FIG.

【図９】図１における分類対象文書データの例。FIG. 9 is an example of document data to be classified in FIG.

【図１０】図１における分類対象文書単語分割テーブル
の例。10 is an example of a classification target document word division table in FIG.

【図１１】図１における文書分類プログラムのＰＡＤ
図。11 is a PAD of the document classification program in FIG.
Fig.

【図１２】図１における文書分類結果例。FIG. 12 is an example of a document classification result in FIG.

【図１３】文書分類結果の確認方法及び分類用辞書学習
方法を示す機能ブロック図。FIG. 13 is a functional block diagram showing a document classification result confirmation method and a classification dictionary learning method.

【図１４】図１３における分類結果確認及び分類用辞書
学習における利用者との対話例。14 is an example of a dialogue with a user in the confirmation of the classification result and the learning of the dictionary for classification in FIG.

【図１５】図１３における分類結果確認及び分類用辞書
学習における利用者との対話例。FIG. 15 is an example of a dialogue with a user in the confirmation of the classification result and the learning of the dictionary for classification in FIG.

【図１６】文書分類方法において処理対象とする文書内
の項目指定方法を示す機能ブロック図である。FIG. 16 is a functional block diagram showing an item specifying method in a document to be processed in the document classification method.

[Explanation of symbols]

１…文書データ単語分割部、２…分類用辞書作成部、３
…文書分類部、ａ…分類済文書データ、ｂ…分類済単語
分割テーブル、ｃ…分類用辞書、ｄ…分類対象文書デー
タ、ｅ…分類対象文書単語分割テーブル。1 ... Document data word division unit, 2 ... Classification dictionary creation unit, 3
Document classification unit, a ... Classified document data, b ... Classified word division table, c ... Classification dictionary, d ... Classification target document data, e ... Classification target document word division table.

Claims

[Claims]

1. A processing system for classifying document data, and means for creating a classification dictionary by extracting words that are keywords for each classification from document data that has already been classified into one document data. A document classification system having means for classifying uncategorized document data using a dictionary for use.

2. The document classification system according to claim 1, wherein the means for creating the classification dictionary detects words in the document data by using the classified document data, and detects words that appear in only one classification. A document classification system, which detects and registers the word as a keyword representing the classification in a classification dictionary.

3. The document classification system according to claim 1, wherein the means for creating the classification dictionary detects the items constituting the document data by using the classified document data and detects the word for each item. A document classification system, wherein a word that appears only in the classification is detected, and the word represents the classification and is registered in the classification dictionary as a keyword that appears in the item.

4. The document classification system according to claim 1, wherein the means for creating a classification dictionary has an item that constitutes the classified document data, and a range in which the user creates the classification dictionary is the item. The document classification system, wherein the classification dictionary is created only for the document data corresponding to the specified item in the classified document data when specified by.

5. The document classification system according to claim 1, wherein the means for classifying the unclassified document data detects a word in the unclassified document data and finds the number of matches with a keyword registered in the classification dictionary. The document classification system, wherein the category having the largest number of matches among the detected and matched categories is used as the classification result of the uncategorized document data.

6. The document classification system according to claim 1, wherein the means for classifying the unclassified document data detects a word in the unclassified document data and matches the number of keywords with the keywords registered in the classification dictionary. The document classification system characterized in that the unclassified document data is classified as a category having a high priority in the order of the number of coincidences among the matched categories.

7. The document classification system according to claim 1, wherein the means for classifying the unclassified document data has an item that constitutes the unclassified document data, and a range for classifying the unclassified document data by a user. Is designated by the item, the document classification system is characterized in that only the document data corresponding to the designated item in the uncategorized document data is processed.

8. A processing system for classifying document data, and means for creating a classification dictionary by extracting words that are keywords for each classification from document data that has been classified with one classification being one or more document data, Document classification system having means for classifying uncategorized document data using a dictionary for use, and means for newly detecting a keyword from the result of classification by the means for classifying the uncategorized document data and registering it in the classification dictionary .

9. The document classification system according to claim 8, wherein the means for newly detecting a keyword and registering it in the classification dictionary is a word in the document data classified by the means for classifying the unclassified document data. A document classification system, wherein if the word does not exist in the classification dictionary, the word is registered in the classification dictionary as a keyword representing the classification.

10. The document classification system according to claim 8, wherein the means for newly detecting a keyword and registering the keyword in the classification dictionary determines whether or not the classification result by the means for classifying the unclassified document data is correct. If the word in the document data classified by the means for classifying the unclassified document data is not a word in the classification dictionary, the word is used as a keyword representing the classification. A document classification system characterized by being registered in the classification dictionary.

11. The document classification system according to claim 8, wherein the means for newly detecting a keyword and registering the keyword in the classification dictionary determines whether or not the classification result by the means for classifying the unclassified document data is correct. If the word in the document data classified by the means for classifying the unclassified document data is not a word in the classification dictionary, the word is used as a keyword representing the classification. A document classification system characterized by presenting to a user and registering only keywords designated by the user in the classification dictionary.

12. The document classification system according to claim 8, wherein the means for newly detecting a keyword and registering the keyword in the classification dictionary determines whether or not the classification result by the means for classifying the unclassified document data is correct. The document classification is characterized by presenting the matching keyword to the user when it is instructed to be correct, and deleting the keyword from the classification dictionary according to the user's instruction when the incorrect keyword exists. system.