JPH0298778A

JPH0298778A - Automatic document classification device

Info

Publication number: JPH0298778A
Application number: JP63250832A
Authority: JP
Inventors: Atsuo Kawai; 河合　敦夫; Masaaki Nagata; 昌明永田; Haruo Kimoto; 木本　晴夫
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1988-10-06
Filing date: 1988-10-06
Publication date: 1990-04-11

Abstract

PURPOSE:To decide a classification destination based on a bylaw and an illustration by using a generation rule describing the bylaw and the illustration at the time of classifying based on relation between classification fields and cogenerating relation between appearing words. CONSTITUTION:A classification candidate selection means 200 extracts the key word of an input sentence, and collates it with the key word appearing partially in a document at every prescribed classification stored in a key word storage means 100, and selects at least one classification candidate. A generation rule storage means 300 stores the generation rule which regulates a set including plural pieces of classification based on the bylaw and the illustration in the classification and one piece of classification corresponding to the set. A classification candidate decision means 400 collates at least one classification candidate selected at the classification candidate selection means 200 with the generation rule stored in the generation rule storage means 300, and decides one classification candidate. Thereby, it is possible to decide the classification destination based on the bylaw and the illustration.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、文書データベース作成のために、データベー
スに蓄積される文書にたいして、その文言の分類を自動
的に行なう装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to an apparatus for automatically classifying the words of documents stored in a database in order to create a document database.

[Conventional technology]

新聞記事、特許出願明細書、技術論文などの大牡の文書
を含むデータベースを作成する場合、データベースの入
力の際に各文書に対して、分類用のコードを付与する必
要が生じる。従来、この目的のために、ある分類分野に
偏って出現する傾向の高い単語に着目する方法が用いら
れてきた。When creating a database containing a large number of documents such as newspaper articles, patent application specifications, and technical papers, it is necessary to assign a classification code to each document when inputting the database. Conventionally, for this purpose, a method has been used that focuses on words that tend to appear biasedly in a certain classification field.

この方法では、すでに分類済みの文書中の単語を統計的
に処理して、各分野に偏って出現する単語（今優、分野
識別単語と呼ぶ）を決定する。次に、未分類の文書中の
分野識別単語を手掛かりに、文書の分類先を決定する手
法である。例えば、スポーツ、世界、・・・の分類分野
を設定した時に、単語“オリンピック”が分野“スポー
ツ”に、単語“外交官”が分野“世界”に、偏って出現
するならば、オリンピック、外交官を、分野識別単語と
する。In this method, words in documents that have already been classified are statistically processed to determine words that appear biasedly in each field (referred to as field identification words). Next, this is a method of determining the classification destination of a document using field identification words in unclassified documents as clues. For example, when setting classification fields such as sports, world, etc., if the word "Olympic" appears biasedly in the field "sports" and the word "diplomat" appears in the field "world", then Olympics, diplomacy, etc. Let ``kan'' be the field identification word.

しかし、この方法では、文書中に複数の分野識別単語が
出現し、それぞれが別々の分野の分野識別単語であった
場合、最終的な分類先は分野識別ｌ１語の頻度等の統計
的情報のみで決定する。例えば、“プロ野球の選手が労
働組合を結成する。However, with this method, if multiple field identification words appear in a document, and each field identification word is from a different field, the final classification destination is only statistical information such as the frequency of field identification words. Determine. For example, ``Professional baseball players form a labor union.

という記事では、スポーツ分野の分野識別単語と労働の
分野の分野識別単語が出現し、分類先の決定は、両分野
の分野識別単語の頻度のみで決まってしまう。In this article, the field identification words for the field of sports and the field identification words for the field of labor appear, and the classification destination is determined only by the frequency of the field identification words for both fields.

[Problem to be solved by the invention]

しかし、現実の分類作業において、分類先の候補が競合
する場合には、分類の細則や凡例を参照することにより
、最終的な分類先を判定しており、単に単語の頻度で決
めているわけではない。こうした細則や凡例は、統計的
データ、すなわち、文書中の単語の頻度よりも、分類分
野間の関係や出現単語間の共起関係等に着目して記述さ
れている。However, in actual classification work, when there are competing classification candidates, the final classification target is determined by referring to the classification rules and legend, and the decision is not made simply based on word frequency. isn't it. These detailed rules and legends are written with a focus on statistical data, that is, relationships between classification fields, co-occurrence relationships between words, etc., rather than the frequency of words in a document.

すなわち、従来の統計的手法だけでは、分類時における
細則や凡例を記述できないため、分類先の候補が競合す
る場合、細則や凡例にもとすく分類先決定ができないと
いう欠点があった。In other words, conventional statistical methods alone cannot describe detailed rules and legends during classification, and therefore, when there are competing candidates for classification, the detailed rules and legends have the disadvantage that it is not easy to determine the classification destination.

従って、本発明は上記従来技術の問題点を解決し、分類
分野の関係や出現単語間の共起関係に基づいて分類時の
細則や凡例を記述した生成規則を用いることにより、細
則や凡例に基づく分類先決定を可能とすることを目的と
する。Therefore, the present invention solves the problems of the prior art described above, and uses production rules that describe detailed rules and legends during classification based on the relationship between classification fields and co-occurrence relationships between words. The purpose is to enable classification destination determination based on

[Means to solve the problem]

第１図は、本発明の機能ブロック図である。Ｊ２］図に
おいて、キーワード記憶手段１００は、所定の各分類ご
とに偏って文書中に出現するキーワードを記述する。FIG. 1 is a functional block diagram of the present invention. J2] In the figure, the keyword storage means 100 describes keywords that appear in documents in a biased manner for each predetermined classification.

分類候補選択手段２００は、入力文章のキーワードを抽
出して各分類ごとのキーワードと照合し、少なくとも１
つの分類候補を選択する。The classification candidate selection means 200 extracts keywords from the input text, collates them with keywords for each classification, and selects at least one keyword.
Select one classification candidate.

生成規則記憶手段３００は、予め分類における細則や凡
例をもとに複数の分類を含む集合と、該集合に対応する
１つの分類とを規定した生成規則を記憶する。The generation rule storage unit 300 stores in advance a generation rule that defines a set including a plurality of categories and one category corresponding to the set based on the detailed rules and legends of the classification.

分類候補決定手段４００は、分類候補選択手段で選択さ
れた少なくとも１つの分類候補を生成規則記憶手段に記
憶されている生成規則と照合することにより、１つの分
類候補を決定する。The classification candidate determining means 400 determines one classification candidate by comparing at least one classification candidate selected by the classification candidate selecting means with the generation rule stored in the generation rule storage means.

[Effect]

分類候補選択手段２００は入力文章のキーワードを抽出
し、キーワード記憶手段１００に記憶されている所定の
各分類ごとに偏って文書中に出現するキーワードと照合
する。例えば、入力文章中のキーワードの偏りが分類Ａ
及びＢにあったとすれば、この分類Ａ及び８が分類候補
として選択される。そして、分類Ａ及びＢは分類候補決
定手段４００により、生成規則記憶手段３００に記憶さ
れている生成規則と照合される。生成規則は細則や凡例
によるもので、例えば分類Ａと８からなる集合は最終的
に分類Ｂに特定されるというものである。分類候補決定
手段４００は候補分類ＡとＢからなる生成規則を探し、
この生成規則により最終的に分類Ｂを１つの分類候補と
して決定する。The classification candidate selection means 200 extracts keywords from the input text and compares them with keywords that appear in the document in a biased manner for each predetermined classification stored in the keyword storage means 100. For example, the bias of keywords in the input text is classified as A.
and B, the classifications A and 8 are selected as classification candidates. Then, the classification candidates determining means 400 compares the classifications A and B with the generation rules stored in the generation rule storage means 300. The production rules are based on detailed rules and legends, and for example, a set consisting of categories A and 8 is ultimately specified as category B. The classification candidate determining means 400 searches for a generation rule consisting of candidate classifications A and B,
Based on this production rule, classification B is finally determined as one classification candidate.

このように、単にキーワードの出現頻度だけでなく、分
類時の細則や凡例を考虎して分類候補を決定しているの
で、従来の問題点は解消される。In this way, classification candidates are determined by considering not only the frequency of appearance of keywords, but also the detailed rules and legends used during classification, which solves the conventional problems.

Ｃ実施例〕以下、本発明の一実施例を添付図面を参照して詳細に説
明する。Embodiment C] Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

第２図は、本発明をハードウェアによって構成した実施
例を示す。１は標本文書ファイルで、分野ごとの文書の
語棄の統計的特徴を抽出するために用いる標本データで
ある。２は分野特徴抽出装置で、分野ごとの文書の特徴
を抽出する。この分野ごとの文書の特徴（キーワードの
出現頻度の分野ごとの偏り）は、３の分野識別Ｉｔ語１
７点表（メモリ）へ記録する。４は分類コード生成規則
辞書で、分類に用いる細則や凡例を生成規則の形で記述
する。５は未分類の文書を格納する入力文書ファイルで
ある。６は分類光識別装置で、未分類の文書に分類コー
ドを自動的に付与し、その結果を７の出力文書ファイル
へ出力する。FIG. 2 shows an embodiment in which the present invention is implemented by hardware. Reference numeral 1 denotes a sample document file, which is sample data used to extract statistical characteristics of word omissions in documents for each field. 2 is a field feature extraction device that extracts document features for each field. The characteristics of documents for each field (bias in the frequency of appearance of keywords for each field) are as follows:
Record in the 7-point table (memory). 4 is a classification code production rule dictionary that describes detailed rules and legends used for classification in the form of production rules. 5 is an input document file that stores unclassified documents. A classification light identification device 6 automatically assigns a classification code to an unclassified document and outputs the result to an output document file 7.

ここで、第１図の本発明の機能ブロック図との関係を説
明する。キーワード記憶手段１００は標本文書ファイル
１１分野特徴抽出装置２及び分野識別単語得点表３に相
当する。分類候補選択手段２００は、入力文書ファイル
５及び分類光識別装置６に相当する。生成規則記憶手段
３００は、分類コード生成規則辞書４に相当する。分類
候補決定手段４００は分類光識別装置６に相当する。す
なわち、分類光識別装置６は、分類候補選択手段２００
及び分類候補決定手段４００の両方の機能を実現する。Here, the relationship with the functional block diagram of the present invention shown in FIG. 1 will be explained. The keyword storage means 100 corresponds to the sample document file 11, the field feature extraction device 2, and the field identification word score table 3. The classification candidate selection means 200 corresponds to the input document file 5 and the classification light identification device 6. The generation rule storage means 300 corresponds to the classification code generation rule dictionary 4. The classification candidate determining means 400 corresponds to the classification light identification device 6. That is, the classification light identification device 6 uses the classification candidate selection means 200
and classification candidate determining means 400.

出力文書ファイル７は、分類候補決定手段４．　ＯＯで
決定された最終分類候補（分類光候補）をフードの形で
格納する。The output document file 7 is generated by the classification candidate determining means 4. The final classification candidates (classification light candidates) determined in OO are stored in the form of a hood.

第３図は、分野特徴抽出装置２の詳細なブロック図であ
る。図示するように、分野特徴抽出装置２は入力装Ｊｉ
ｏ、キーワード抽出部１１１分野別頻度計輝部１２．得
点表計算部１３．キーワードテーブル１４及びキーワー
ド分野別頻度表１５を具備して構成される。入力装置？
Ｚ１０を介して標本文書ファイル１から読み込まれた分
類コード付き文書（標本データ）から、キーワード抽出
部１１によりキーワードが抽出され、キーワードテーブ
ル１４に格納される。この動作にＪ３いて、キーワード
抽出部１１は、シソーラス９を用いることにより、異表
記の統一、代用表現の言い換え、下位概念から上位概念
への生成等を行ない、同じ概念を表す用語の統一も行な
う。次に、分野別頻度計算部１２は、このキーワードテ
ーブル１４中のキーワードの出現頻度を、分類コード〈
分類を示すコード）をもとに、キーワード分野別頻度表
１５中に順次加算しでいく。以上の操作を、標本文書の
数だけ行なう。FIG. 3 is a detailed block diagram of the field feature extraction device 2. As shown in the figure, the field feature extraction device 2 has an input device Ji.
o, Keyword extraction unit 111 Field frequency meter Brightness unit 12. Score sheet calculation section 13. It is configured to include a keyword table 14 and a keyword field frequency table 15. Input device?
Keywords are extracted by the keyword extraction unit 11 from the classification coded document (sample data) read from the sample document file 1 via Z10, and stored in the keyword table 14. During this operation, the keyword extraction unit 11 uses the thesaurus 9 to unify different notations, paraphrase substitute expressions, generate higher-level concepts from lower-level concepts, and unify terms expressing the same concept. . Next, the field frequency calculation unit 12 calculates the frequency of appearance of the keywords in the keyword table 14 using the classification code
The keywords are sequentially added to the frequency table 15 by keyword field based on the code indicating the classification. The above operations are performed for the number of sample documents.

（ｑ意表計算部１３では、不要キーワードの削除、頻度
から得点への変換により分野識別単語得点表３を作成す
る。具体的には、キーワード分野別頻度表１５の中から
、■全分野を合計した時の出現回数が少ないキーワード
、■各分野にわたって、均一的に出現し、出現分野に偏
りが少ないキーワード、を削除する。次に、各キーワー
ドの各分野における頻度を得点へと変換する。(The Q table calculation unit 13 creates a field identification word score table 3 by deleting unnecessary keywords and converting frequencies to scores. Specifically, from the keyword field frequency table 15, Keywords that appear in a small number of times when doing so, (i) Keywords that appear uniformly across all fields and have little bias in the fields in which they appear are deleted.Next, the frequency of each keyword in each field is converted into a score.

第４図は分類光識別装置６の詳細なブロック図である。FIG. 4 is a detailed block diagram of the classification light identification device 6.

分類光識別装置６は入力装置１６．キーワード抽出部１
７１分野得点計算部１８、生成規則適用部１つ、出力装
置ｉ２０．文潟メモリ２１゜キーワードテーブル２２１
分類コード候補テーブル２３及び分類コードテーブル２
４を具備して構成されている。The classification light identification device 6 includes an input device 16. Keyword extraction part 1
71 field score calculation unit 18, one production rule application unit, output device i20. Bungata memory 21゜keyword table 221
Classification code candidate table 23 and classification code table 2
4.

入力文章ファイル５に格納されている未分類の入力文書
は、入力装置１６により読み込まれ、キーワード抽出部
１７に与えられる。キーワード抽出部１７は入力文書か
らキーワードを抽出し、キーワードテーブル２２に格納
する。次に、分野得点ｉ！′を輝部１８は、キーワード
テーブル２２の中から分野識別＋１語得点表３に載って
いるキーワードの得点を分類コードごとに加砕する。こ
こで）ｑられた各分類コードの得点をもとにして、第１
位の得点の分類コードと、その得点から、あらかじめ決
めたしきい値以内にある得点の分類コードを、候補とな
る分類コードとする。そして、この分類コードを、分類
コード候補テーブル２３へ書き込む。Unclassified input documents stored in the input text file 5 are read by the input device 16 and provided to the keyword extracting section 17 . The keyword extraction unit 17 extracts keywords from the input document and stores them in the keyword table 22. Next, field score i! ', the brightening unit 18 breaks down the scores of the keywords listed in the field identification + 1 word score table 3 from the keyword table 22 for each classification code. Based on the scores of each classification code (here), the first
The classification code with the highest score and the classification code with the score within a predetermined threshold from that score are taken as candidate classification codes. Then, this classification code is written into the classification code candidate table 23.

生成規則適用部１９では、分類コード候補テーブル２３
に複数の分類コードが存在する場合に、分類コード生成
規則辞書４を用いることにより、分類コードを最終的に
１つに絞る。分類コード生成規則は、条件部、分類光、
規則の得点、の３つからなる（第５図参照）。条件部は
１個以上のキーワードまたは分類コードの列からなる。The generation rule application unit 19 uses the classification code candidate table 23
When a plurality of classification codes exist, the classification code generation rule dictionary 4 is used to finally narrow down the classification codes to one. The classification code generation rule consists of the condition part, classification light,
The rules consist of three points (see Figure 5). The condition part consists of a string of one or more keywords or classification codes.

各キーワードまたは分類コードは、Ｃ１（１≦１≦ｎ）
で表す。また、得点αの規則の確信度を表す。条件部で
は、キーワードテーブル２２及び分類コード候補テーブ
ル２３を参照し、Ｃ１〜Ｑｎの各項がテーブルの中にあ
るかどうかをチエツクする。Each keyword or classification code is C1 (1≦1≦n)
Expressed as It also represents the certainty of the rule of score α. The condition section refers to the keyword table 22 and the classification code candidate table 23 and checks whether each term of C1 to Qn exists in the table.

そして、Ｃ１〜Ｃｎの全項がテーブルの中にある時に、
・その条件部に対応する分類コードを生成し得点を与え
る。最終的な分類コードは、適用した生成規則の数によ
り、次の０）〜■によって決める。Then, when all terms C1 to Cn are in the table,
- Generate a classification code corresponding to the conditional part and give a score. The final classification code is determined by the following 0) to ■, depending on the number of applied production rules.

（′１）適用した生成規則が０個の時は、分野得点計算
部で第１位の得点であった分類コードを最終的な分類コ
ードとする。('1) When the number of applied production rules is 0, the classification code with the highest score in the field score calculation section is used as the final classification code.

■適用した生成規則が１個の時は、その生成規則に記述
しである分類コードを最終的な分類先とする。(2) When only one production rule is applied, the classification code written in that production rule is used as the final classification destination.

■適用した生成規則が２個以上の時は、各分類コードご
とに、それを生成した生成規則に記述しである得点を加
算し、Ｒｔｌ’Ｆ１点の分類コードを最終的な分類先と
する。■When two or more production rules are applied, for each classification code, add a certain score to the production rule that generated it, and use the classification code with Rtl'F1 point as the final classification destination. .

このようにして決められた最終的な分類コードは分類コ
ードテーブル２４に格納される。最後に、出力装置２０
は、最終的に決定した分類コードを、文書とともに、外
部記憶装置上の出力文内ファイル７へ出力する。The final classification code determined in this manner is stored in the classification code table 24. Finally, the output device 20
outputs the finally determined classification code together with the document to the output text file 7 on the external storage device.

第５図は分類コード生成規則の一例である。図中、〔〕
は分類コード、〈〉はキーワード、番■は分類コード生
成規則の規則番号を示している。＃１は、運輸・通信分
野における労働問題や就職状況については、′労働”分
野の記事として分類するという凡例を、＃２は、日本以
外の労働問題や雇用状況については、１世界”分野の記
事として分類するという凡例を表している。また、＃４
は、国会で科学技術分野の審議が行なわれた場合は、パ
政治”分野の記事として分類するという凡例を表してい
る。FIG. 5 is an example of classification code generation rules. In the figure, []
indicates the classification code, <> indicates the keyword, and number ■ indicates the rule number of the classification code generation rule. #1 has the legend that labor issues and employment situations in the transportation and communications fields are classified as articles in the 'labor' category, and #2 has a legend that states that articles on labor issues and employment situations outside of Japan are classified as articles in the 'world' category. It represents the legend that it is classified as an article. Also, #4
represents the legend that if deliberations in the field of science and technology are held in the Diet, the article will be classified as an article in the field of "Politics".

第６図は、キーワード分野別頬度表の一例を示す。この
例では、分類コードとして、政治、経済、労働、・・・
、スポーツ、世界の１０分野を設定している。それぞれ
の分野の文書に、各キーワードが何回表れたかが示され
ている。キーワード“労働組合”は、政治分野の記事に
５回、労働分野の記事に５６回出現している。FIG. 6 shows an example of a cheekiness table by keyword field. In this example, the classification codes are political, economic, labor,...
, sports, and 10 fields from around the world. It shows how many times each keyword appears in documents in each field. The keyword "labor union" appears 5 times in articles in the political field and 56 times in articles in the labor field.

第７図は、分野識別単語得点表３の一例であり、第６図
の表をもとに作成した。図中、＊印は削除される単語を
示す。すなわち、“全電通”゛オフサイド”は、全体と
しての出現頻度が、それぞれ、３回、１回と小さく、た
とえ分野識別Ｉｌｌとして登録しても、他の文書に出現
する確率が低いので、第６図の表から削除する。また、
１東京”“新聞記事”は、各分類分野の文書に、平均的
に出現するので、逆に、そのキーワードで分野を識別す
る手掛かりにはなりにくい。従って、分野識別単語とし
ては不適切であり、第６図の表がら削除する。次に、こ
うして選択された分野識別単語ｊの、分野にの頻度ｘｊ
ｋを、この動作例ではく式１）によって得点Ｘｊｋに変
換する。こうして、第７図に示す表を得る。FIG. 7 is an example of field identification word score table 3, which was created based on the table in FIG. 6. In the figure, *marks indicate words to be deleted. In other words, the overall appearance frequency of "ZENDENTSU""OFFSIDE" is small, 3 times and 1 time, respectively, and even if it is registered as field identification Ill, there is a low probability that it will appear in other documents, so it is Delete from the table in the figure.Also,
1 Tokyo" and "Newspaper article" appear on average in documents in each classification field, so conversely, it is difficult to use that keyword as a clue to identify the field. Therefore, it is inappropriate as a field identification word. , from the table in Figure 6.Next, the frequency xj of the field identification word j selected in this way
In this operation example, k is converted into a score Xjk using equation 1). In this way, the table shown in FIG. 7 is obtained.

Ｙｊｋ”（ｘＪｋ−Ｍｊｋ”Ｍｊｋ・・・（式１）Ｍｊ
ｋ：単語ｊのに分野における理論度数であり（式２）に
よって求める。Yjk"(xJk-Mjk"Mjk...(Formula 1)Mj
k: The theoretical frequency of word j in the field, which is obtained by (Equation 2).

（■−分野総数）第８図は、第４図分類先識別装置の一動作例である。入
力装置より読み込まれた入力文書２５は、キーワード自
動抽出部１７により、キーワードテーブル２６が抽出さ
れる。次に、キーワードテーブルの中から、分野識別Ｉ
１語得点表（第７図）に載っているキーワード“′プロ
外球、ヤクルト、ナイター、労働組合”の得点を分類分
野ごとに、加算する。ここで、単語が複数回出現した場
合は、出現回数分を加算する。この結果を第８図（Ｃ）
各分類コードの得点２７に示す。ここで、得点の第１位
と第２位のスポーツと労働の分類コードが候補として分
類コード候補テーブル２３に記述される。そこで、この
判定を生成規則適用部で行なう。生成規則適用部１９で
は、第８図（Ｄ＞に示す生成規則２８を適用し、最終的
な分類コード２９を労働とする。(■-Total number of fields) FIG. 8 is an example of the operation of the classification destination identification device shown in FIG. 4. The keyword automatic extraction unit 17 extracts a keyword table 26 from the input document 25 read from the input device. Next, from the keyword table, field identification I
Add up the scores for the keywords "'Professional baseball, Yakult, night game, labor union" listed in the one-word score table (Figure 7) for each classification field. Here, if a word appears multiple times, the number of appearances is added. This result is shown in Figure 8 (C).
The scores for each classification code are shown in 27. Here, the sports and labor classification codes with the first and second scores are written as candidates in the classification code candidate table 23. Therefore, this determination is performed by the generation rule application section. The production rule applying unit 19 applies the production rule 28 shown in FIG. 8 (D>) and sets the final classification code 29 as labor.

以上、本発明の詳細な説明した。上記実施例では分類コ
ードとキーワードからなる分類コード生成規則番４を用
いているが、原理的には＃１へ＃３のように、分類コー
ドのみからなる分類コード生成規則を用いることとして
も良い。The present invention has been described in detail above. In the above embodiment, classification code generation rule No. 4 is used, which consists of a classification code and a keyword, but in principle, it is also possible to use classification code generation rules consisting only of classification codes, such as #1 to #3. .

〔Effect of the invention〕

以上説明した様に、本発明によれば、分類コード生成規
則を用いることにより、従来の技術では表現できなかっ
た、分類分野間の関係や単語間の共起関係が表現できる
ため、分類時における細則や凡例を生成規則を考慮した
上で入力文章を分類することができる。As explained above, according to the present invention, by using classification code generation rules, relationships between classification fields and co-occurrence relationships between words, which could not be expressed using conventional techniques, can be expressed. It is possible to classify input sentences by taking into account production rules and legends.

[Brief explanation of drawings]

第１図は本発明の機能ブロック図、第２図は本発明の一実施例のブロック図、第３図は第２
図に示す分野特徴抽出装置のブロック図、第４図は第２図に示す分類光識別装置のブロック図、第５図は分類コード生成規則の例を示す図、第６図は分
野別キーワード頻度表を示す図、第７図は分野識別単語
衣を示す図、及び第８図は分類光識別ｇ装置の一動作例
を示す図である。１・・・標本文書ファイル、２・・・分野特徴抽出装置
、３・・・分野識別中詰得点表、４・・・分類コード生成規則辞書、５・・・入力文書ファイル、６・・・分類光識別装置、７・・・出力文書ファイル、
８・・・日本語辞書、９・・・シソーラス、１０・・・
入力装置、１１・・・キーワード抽出部、１２・・・分
野別傾度計算部、１３・・・得点表針ｎ部、１４・・・
キーワードテーブル、１５・・・キーワード分野別頻度表、１６・・・入力袋
ｄ、１７・・・キーワード抽出部、１８・・・分野得点
８１粋部、１９・・・生成規則適用部、２０・・・出力
装η、２１・・・文書メモリ、２２・・・キーワードテ
ーブル、２３・・・分類コード候補テーブル、２４・・・分類コードテーブル、２５・・・入力文書、
２６・・・キーワードテーブル、２７・・・各分類コードの得点、２８・・・適用した生成規則、２９・・・最終的な分類コード、１００・・・キーワード記憶手段、２００・・・分類候補選択手段、３００・・・生成規則記憶手段、４００・・・分類候補決定手段。特許出願人　日本電信電話株式会社本発明５′）機能ブロック図第１図第２図第３図楊オフサイド分野別キワード頻度表を示す図、、、、、、、、、　、、、　、、、　、、、、、、、
、、　、、、　、、、、、、、、、、、、、、、、、、
、、、、、、　Ｙ　Ｉｋ分野識別単語得点表を示す図第７１１！１１扁条件部ＣＩ　、　Ｃ２，−、Ｃｎ［労働コ、〔運輸・通信］→ ［労働］、［世界コ　　　− ［労働］、［スポーツ］　＝［科学技術］、〈国会〉　→ 分類先［労働〕［世界コ［労働コ［政治〕分類コード生成規則の例を示す図第５図キーワード分類光識別装置の一動作例を示す図第８図Fig. 1 is a functional block diagram of the present invention, Fig. 2 is a block diagram of an embodiment of the present invention, and Fig. 3 is a functional block diagram of the present invention.
Figure 4 is a block diagram of the classification optical identification device shown in Figure 2, Figure 5 is a diagram showing an example of classification code generation rules, Figure 6 is keyword frequency by field. FIG. 7 is a diagram showing a table, FIG. 7 is a diagram showing field identification words, and FIG. 8 is a diagram showing an example of the operation of the classification light identification device. 1... Sample document file, 2... Field feature extraction device, 3... Field identification intermediate score table, 4... Classification code generation rule dictionary, 5... Input document file, 6... classification light identification device, 7... output document file,
8...Japanese dictionary, 9...Thesaurus, 10...
Input device, 11...Keyword extraction unit, 12...Field-based gradient calculation unit, 13...Score table needle n part, 14...
Keyword table, 15... Frequency table by keyword field, 16... Input bag d, 17... Keyword extraction section, 18... Field score 81 part, 19... Production rule application section, 20. ... Output device η, 21... Document memory, 22... Keyword table, 23... Classification code candidate table, 24... Classification code table, 25... Input document,
26... Keyword table, 27... Score of each classification code, 28... Applied production rule, 29... Final classification code, 100... Keyword storage means, 200... Classification candidate Selection means, 300... Generation rule storage means, 400... Classification candidate determination means. Patent applicant Nippon Telegraph and Telephone Corporation Invention 5') Functional block diagram Figure 1 Figure 2 Figure 3 Yang Offside A diagram showing the keyword frequency table by field. ,,,,,,,,
,, ,,, ,,,,,,,,,,,,,,,,,,,,,
, , , , Y Figure 7 showing the Ik field identification word score table Labor], [Sports] = [Science and Technology], <Parliament> → Classification destination [Labor] [World Co. [Labor Co. [Politics] Diagram showing an example of classification code generation rules Figure 5 Operation of keyword classification optical identification device Figure 8 showing an example

Claims

[Scope of Claim] An automatic document classification device that determines the classification of an input text, comprising: a keyword description means that describes keywords that appear biasedly in a document for each predetermined classification; Classification candidate selection means that selects at least one classification candidate by comparing keywords for each classification, and predefining a set containing multiple classifications and one classification corresponding to the set based on detailed rules and legends for classification. 1 by comparing at least one classification candidate selected by the production rule storage means that stores the production rules that have been selected and the classification candidate selection means with the production rules stored in the production rule storage means.
An automatic document classification device comprising: classification candidate determining means for determining two classification candidates;