JPH02105973A

JPH02105973A - Automatic classifying device for document

Info

Publication number: JPH02105973A
Application number: JP63258748A
Authority: JP
Inventors: Atsuo Kawai; 河合　敦夫; Masaaki Nagata; 昌明永田; Haruo Kimoto; 木本　晴夫
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1988-10-14
Filing date: 1988-10-14
Publication date: 1990-04-18

Abstract

PURPOSE:To identify a word in the same set even when a word which has the same concept with a word (field identification word) expressing features by fields, but is different as a character string appears in an unclassified document by using meaning categories as a clue to classification. CONSTITUTION:The meaning categories of words are noticed and a meaning category which appears one-sidedly by the fields is used as a new clue to classify documents. Namely, the features (field-by-field deviation in appearance frequency of a keyword and the meaning category) are recorded in a field identification word point table 3a nd a field identification means category point table 4. Consequently, when the word (deviation in expression and homonym) which has the same concept with the field identification word representing the features by the fields, but is different as the character string appears in the unclassified document, the meaning category is used to identify the word in the same set, thereby obtaining the clue to the classification.

Description

【発明の詳細な説明】（１）発明の屈する技術分野本発明は２文書データベース作成のために、データベー
スに蓄積される文書に対して、その文書の分類を自動的
に行う文書自動分類装置に関するものである。Detailed Description of the Invention (1) Technical field to which the invention pertains The present invention relates to an automatic document classification device for automatically classifying documents stored in a database in order to create a two-document database. It is something.

（２）従来の技術新開記事、特許出願四頚、技術論文などの大工の文書を
含むデータベースを作成する場合、データベースの入力
の際に各文官に対して１分類用のコードを付与する必要
が生じる。従来、この目的のために、ある分類分野に偏
って出現する傾向の高い単語に着目する方法が用いられ
てきた。(2) When creating a database that includes carpentry documents such as conventional technology breakthrough articles, patent application papers, and technical papers, it is necessary to assign one classification code to each civil servant when inputting the database. arise. Conventionally, for this purpose, a method has been used that focuses on words that tend to appear biasedly in a certain classification field.

この方法では、すでに分類済みの文の中の単語を統計的
に処理して、各分野に偏って出現する単語（今後９分野
識別単語と呼ぶ）を決定する。次に、未分類の文書中の
分野識別単語を手掛かりに。In this method, words in sentences that have already been classified are statistically processed to determine words that appear biasedly in each field (hereinafter referred to as nine-field identification words). Next, use field identification words in unclassified documents as clues.

文書の分類先を決定する手法である０分野識別単語の例
として１例えば、スポーツ、国際、・・・の分類分野を
設定した時、単語“オリンピック”が分類分野“スポー
ツ”に、また単語“外交官”が分類分野“国際゛に偏っ
て出現する場合、“オリンピック”や“外交官“を分野
識別単語とする。As an example of 0-category identification words, which is a method of determining the classification destination of a document, 1. For example, when the classification fields of sports, international, etc. are set, the word "Olympic" is placed in the classification field "sports", and the word " If "diplomat" appears biasedly in the classification field "international", use "Olympic" or "diplomat" as the field identification word.

しかし８分野識別の手掛かりとして、単語の表記（文字
列）そのものを用いると１分野識別の点からは同一の単
語集合として扱ってもよい、　（ｉ）表記のゆれ（例　
コンピューターとコンピユー先組合せと組み合わせなど
）、（ｉｉ）同義語（例　コンピューターと電子計３γ
機１首相と総理大臣など）、（ｉｉｉ）広い意味で同じ
概念を表す単語集合（例　ゴルフ、剣道、フェンシング
、・・・は。However, if the word notation (character string) itself is used as a clue for identifying eight fields, it may be treated as the same word set from the point of view of one field identification. (i) Variations in the notation (e.g.
(ii) Synonyms (e.g. computer and electronic total 3γ), (ii) synonyms (e.g. computer and electronic total 3γ
(e.g. Prime Minister and Prime Minister), (iii) A set of words that express the same concept in a broad sense (e.g. golf, kendo, fencing, etc.).

スポーツの概念を表す）が、別々の分野識別単語になる
。(representing sports concepts) become separate domain-identifying words.

このため、未分類文書中に出現した単語が２分野識別単
語として登録されている単語の１つと同じ概念を表して
いても、その分野識別単語との表記のずれがあったり、
同義語である場合は、その分野識別単語と全く別の文字
列として取り扱われる。このため、未分類文言中には分
野識別ｉ１″Ｌ語が存在しないことになり１分類が不可
能になる。という欠点があった。Therefore, even if a word that appears in an unclassified document expresses the same concept as one of the words registered as a two-field identification word, there may be a difference in the notation with the field identification word,
If it is a synonym, it is treated as a completely different character string from the field identification word. Therefore, the field identification i1''L word does not exist in the unclassified text, making single classification impossible. This is a drawback.

（３）発明の目的本発明の目的は、単語の意味カテゴリを用いることによ
り、従来の文書自動分類装置の持つ上述の欠点を解決し
た文書自動分類装置を提供することにある。(3) Object of the Invention An object of the present invention is to provide an automatic document classification device that solves the above-mentioned drawbacks of conventional automatic document classification devices by using word meaning categories.

（４）発明の構成（４−１）発明の特徴と従来の技術との差異従来の技術
では９分野ごとに偏って出現する単語を分類装置に登録
し、この単語を手掛かりに文書の分類を行っていた０本
発明では、従来の技術に加えて、単語の意味カテゴリに
着目し１分野ごとに偏って出現する意味カテゴリを、新
たな手掛かりとして１文書の分類を行うことを、Ｗｔも
主要な特徴とする。(4) Structure of the invention (4-1) Differences between the characteristics of the invention and the conventional technology In the conventional technology, words that appear unevenly in each of nine fields are registered in a classification device, and documents are classified using these words as clues. In addition to the conventional techniques, the present invention focuses on the semantic categories of words and uses the semantic categories that appear unevenly in each field as new clues to classify a single document. The characteristics are as follows.

したがって、未分類文書中に分野ごとの特徴を表してい
る分野識別単語と同じ概念を表すが文字列として異なる
単語（表記のずれ、同義語）が表れた場合に、従来の技
術では分類の手掛かりを得ることができなかったが１本
発明では、意味カテゴリを用いることにより、同じ集合
の単語として識別でき１分類の手掛かりが得られるとい
う点で。Therefore, when a word that expresses the same concept as a field identification word that expresses the characteristics of each field appears in an unclassified document but has a different character string (orthographic deviation, synonym), conventional technology can provide clues for classification. However, in the present invention, by using semantic categories, words can be identified as belonging to the same group, and a clue for classification can be obtained.

従来の技術とは異なる。Different from conventional technology.

（４−２）実施例第１図は９本発明をハードウェアによって構成した本発
明の基本構成例を示す、ｌは分類コード付き文書ファイ
ルで２分野ごとの文書の特徴を抽出するために用いる標
本データである。２は分野特徴抽出装置で１分野ごとの
文書の特徴を抽出する。分野ごとの文書の待１次（キー
ワードと意味カテゴリの出現頻度の分野ごとの偏り）を
、それぞれ分野識別単語得点表３１分野識別意味カテゴ
リ得点表４へ記録する。５は分類コード無し文書ファイ
ルである。６は分Ｌ１先識別装置で、未分類の文書に分
類コードを自動的に付与し、その結果を分類コードファ
イル７へ出力スル。(4-2) Embodiment Figure 1 shows an example of the basic configuration of the present invention in which nine of the present inventions are configured by hardware. l is a document file with a classification code, which is used to extract document characteristics for each of two fields. This is sample data. 2 is a field feature extraction device that extracts document features for each field. The primary order of the documents for each field (bias in the frequency of appearance of keywords and semantic categories for each field) is recorded in the field identification word score table 31 and the field identification meaning category score table 4, respectively. 5 is a document file without a classification code. Reference numeral 6 denotes a L1 destination identification device that automatically assigns a classification code to an unclassified document and outputs the result to a classification code file 7.

次に、第２図を用いて１分野特徴抽出装置の説明を行う
。まず、入力装置Ｏから読み込まれた分類コード付き文
書から、キーワード自動抽出・生成部１１を介して、キ
ーワード８が抽出または生成される０次に、キーワード
頻度計算部１２では、このキーワード８の出現開度を８
分類コードをもとに１分野別キーワード頻度表９へ加算
する。意味カテゴリ検索部１３では１日本語辞四１５の
意味カテゴリ記述部を検索して、キーワード８のそれぞ
れに意味カテゴリを付与する。次に、意味カテゴリ頻度
計算部１４でも同様に、キーワードの意味カテゴ１月６
の出現頻度を分類コードごとに１分野別意味カテゴリ額
度表１７へと加算する。以上の１桑作を。Next, the one-field feature extraction device will be explained using FIG. First, the keyword 8 is extracted or generated from the classification coded document read from the input device O via the keyword automatic extraction/generation unit 11. Next, the keyword frequency calculation unit 12 calculates the appearance of the keyword 8. Opening degree 8
Based on the classification code, it is added to the keyword frequency table 9 for each field. The meaning category search unit 13 searches the meaning category description parts of the 1 Japanese dictionary 4 15 and assigns a meaning category to each of the keywords 8. Next, the semantic category frequency calculation unit 14 similarly calculates the meaning category of the keyword.
The appearance frequency of is added to the field-specific meaning category level table 17 for each classification code. Above is one mulberry production.

分類コード付き文♂の数だけ行う。Perform as many times as there are sentences with classification codes.

単語得点表計算部１８では、不要キーワードの削除、頻
度から得点への変換により分野識別単語得点表３を作成
する。具体的には１分野別キーワ−ド頻度表９の中から
。The word score table calculation unit 18 creates a field identification word score table 3 by deleting unnecessary keywords and converting frequencies into scores. Specifically, from keyword frequency table 9 by field.

■　全体としての出現回数が低いキーワード。■ Keywords that appear less frequently overall.

■　各分野にわたって、均一的に出現し、出現分野に偏
りが少ないキーワード。■ Keywords that appear uniformly across all fields and have little bias in the fields in which they appear.

を頻度表から削除する０次に、各キーワードの各分野に
おける頻度を得点へと変換する。意味カテゴリ得点表計
算部１９でも１分野別意味カテゴリ頻度表１７をもとに
、不要意味カテゴリの削除、頻度から得点への変換によ
り分野識別意味カテゴリ得点表４を作成する。is deleted from the frequency table. Next, the frequency of each keyword in each field is converted into a score. The semantic category score table calculation unit 19 also creates a field identification semantic category score table 4 based on the field-specific semantic category frequency table 17 by deleting unnecessary semantic categories and converting frequencies into scores.

次に、第３図を用いて１分類先識別装置の説明を行う。Next, the one-category identification device will be explained using FIG.

人力袋ｒ１１２１より読み込まれた分類コード前し文書
は１キーワード自動抽出・生成部２２により、キーワー
ド２０が抽出・生成される０次に、キーワード得点加算
部２３では、キーワードの中から。The 1-keyword automatic extraction/generation unit 22 extracts and generates the keyword 20 from the classification code-previous document read from the human power bag r1121.The keyword score addition unit 23 extracts and generates the keyword 20 from among the keywords.

分野識別単語得点表３に載っているキーワードの得点を
分類分野ごとに加算する。意味カテゴリ検索部２６では
１日本語辞凹２８の意味カテゴリ記述部を検索して、各
キーワードの意味カテゴリ２９を検索する。次に、意味
カテゴリ得点加算部２７では。The scores of the keywords listed in the field identification word score table 3 are added for each classification field. The semantic category search section 26 searches the semantic category description section of the 1 Japanese dictionary 28 to search for the semantic category 29 of each keyword. Next, in the semantic category score adding section 27.

キーワードの意味カテゴリ２９の中から１分野識別意味
カテゴリ得点表４に載っている意味カテゴリの得点を分
類分野ごとに加算する。分野判定部２４では、キーワー
ド得点加算部と意味カテゴリ得点加３１部の得点を各分
野ごとに単純加算し、一番得点の高い分野を文書の分類
先として決定する。そして、出力装置２５により１分類
先を分類コードファイル７へ書き込む。The scores of the meaning categories listed in the field identification meaning category score table 4 from among the keyword meaning categories 29 are added for each classification field. The field determination section 24 simply adds the scores of the keyword score addition section and the semantic category score addition section 31 for each field, and determines the field with the highest score as the document classification destination. Then, the output device 25 writes the next classification into the classification code file 7.

第４図は分野別キーワード頻度の一例を説明する図であ
る。この例では２分類分野として、政治。FIG. 4 is a diagram illustrating an example of keyword frequency by field. In this example, the two categories are politics.

経済、科学、・・・、スポーツ、国際の１０分野を設定
している。それぞれの分野の文書に、各キーワードが何
回表れたかが示されている。キーワード円相場”は、政
治分野の記事に５回、経済分野の記事に５６回出現して
いる。第５図は１分野識別単語得点の一例を説明する図
あり、第４図をもとに作成した。“全電通”、“オフサ
イド″は、全体としての出現頻度が、それぞれ、３回、
１回と小さく、たとえ分野識別単語として登録しても、
他の文書に出現する確率が低いので、第４図から削除す
る。また、“東京”、　“所間記事”は、各分類分野の
文書に、平均的に出現するので、逆に、そのキーワード
で分野を識別する手掛かりにはなりにくい、従って１分
野識別単語としては不適切であり、第４図から削除する
。Ten fields have been set: economics, science,..., sports, and international. It shows how many times each keyword appears in documents in each field. The keyword "yen market price" appears 5 times in articles in the political field and 56 times in articles in the economic field. Figure 5 is a diagram explaining an example of one field identification word score, based on Figure 4. "Zendentsu" and "Offside" appear three times, respectively.
Even if it is registered as a field identification word, it is as small as once.
Since it has a low probability of appearing in other documents, it is deleted from FIG. 4. In addition, since "Tokyo" and "Tokorama article" appear on average in documents of each classification field, conversely, it is difficult to use the keyword as a clue to identify the field, and therefore, it is difficult to use the keyword as a clue to identify the field. This is inappropriate and will be deleted from Figure 4.

次に、こうして選択された分野識別単語ｊの分野にの頻
度Ｘｊｋを、この動作例では（式１）によって得点Ｙｊ
ｋに変換する。こうして、第５図図示の得点を得る。Next, the frequency Xjk of the field identification word j selected in this way is calculated as the score Yj by (Equation 1) in this operation example.
Convert to k. In this way, the scores shown in FIG. 5 are obtained.

Ｙｊｋ−（Ｘｊｋ−Ｍｊｋ）　／Ｍｊｋ　　　　・・−
・・・・（式１）Ｍｊｋ　：単語ｊ（７）ｋ分野におけ
る理論度数であり（式２）によって求める。Yjk-(Xjk-Mjk) /Mjk...-
(Formula 1) Mjk: The theoretical frequency in the field of word j(7)k, which is obtained by (Formula 2).

第６図は意味カテゴリとキーワードの関係の一例を表す
説明図である。例えば“参議院”、“郵政省”、“市役
所”などは意味上から“行政機関”というキーワードに
まとめられている。第７図は分野別意味カテゴリ頻度の
一例を説明する図である。第８図は分野識別意味カテゴ
リ得点の一例を説明する図であり、不要意味カテゴリの
削除。FIG. 6 is an explanatory diagram showing an example of the relationship between semantic categories and keywords. For example, "House of Councilors,""Ministry of Posts and Telecommunications," and "City Hall" are grouped together under the keyword "administrative organ" for their meaning. FIG. 7 is a diagram illustrating an example of the frequency of meaning categories by field. FIG. 8 is a diagram illustrating an example of field identification semantic category scores, and deletion of unnecessary semantic categories.

を工（度から得点への変換により、第７図図示の頻度か
ら作成した。第７図および第８図は第４図および第５図
の場合と同様にして得られる。was created from the frequencies shown in Figure 7 by converting degrees to scores. Figures 7 and 8 are obtained in the same way as Figures 4 and 5.

第９図は、第３図分類先識別装置の一動作例を説明する
図である。入力装置より読み込まれた入力文書３０は、
キーワード自動抽出・生成部により。FIG. 9 is a diagram illustrating an example of the operation of the classification destination identification device of FIG. 3. The input document 30 read from the input device is
By automatic keyword extraction/generation section.

キーワード３１が抽出・生成される。図において意味カ
テゴリは〔〕で囲って示されている。次に。Keyword 31 is extracted and generated. In the figure, semantic categories are shown in brackets [ ]. next.

意味カテゴリ検索部により、各キーワードの意味カテゴ
リ３２を検索する０次に、■；１−−ワードの中から分
野識別単語得点（第５図）に載っているキーワードの各
得点を分類分野ごとに、加算する。The meaning category search unit searches the meaning category 32 of each keyword. Next, ■;1-- From among the words, the scores of the keywords listed in the field identification word score (Figure 5) are calculated for each classification field. ,to add.

しかし、ここでは１分野識別単語得点に載っているキー
ワードはないので、各分野の得点は０となる。■分野識
別意味カテゴリ得点（第８図）に載っている意味カテゴ
リは〔スポーツ〕だけであるので、この〔スポーツ〕の
得点を分類分野ごとに加算する。ここで、同じ意味カテ
ゴリ、単語が複数回出現した場合は出現日数分を加算す
る０分野別得点表３３における■、■の得点を各分野ご
とに加算し、第９図図示の例では一番得点の高い分野“
スポーツ”を文書の分類先として決定する。However, here, since there is no keyword listed in the 1-field identification word score, the score for each field is 0. ■Since the only semantic category listed in the field identification semantic category score (Figure 8) is [Sports], the score for [Sports] is added for each classification field. Here, if the same meaning category or word appears multiple times, add up the number of days of appearance.0 Scores of ■ and ■ in the field-specific score table 33 are added for each field, and in the example shown in Figure 9, the Fields with high scores
"Sports" is determined as the document classification destination.

（５）発明の詳細な説明したように１本発明によれば５分類の手掛かりと
して単語の意味カテゴリを用いるようにしているので、
未分類文書中に１分野ごとの特徴を表している単語（分
野識別単語）と同じ概念を持つが文字列としては異なる
単語が出現した場合でも、同じ集合の単語として識別で
き１分類の手掛かりが得られるという利点がある。(5) As described in detail, according to the present invention, the meaning categories of words are used as clues for the five classifications.
Even if a word that has the same concept as a word that expresses the characteristics of each field (field identification word) but differs in character string appears in an unclassified document, it can be identified as a word from the same set and provides a clue for one classification. There are advantages that can be obtained.

[Brief explanation of the drawing]

第１図は本発明の基本構成例、第２図は分野特徴抽出装
置の構成、第３図は分類先識別装置の構成、第４図ない
し第８図は動作を説明するための説明図、第９図は動作
例を示す図である。ｌ・・・分類コード付き文書ファイル。２・・・分野特徴抽出装置。３・・・分野識別ノ１１語得点表。４・・・分野識別意味カテゴリ得点表。５・・・分類コード無し文書ファイル。６・・・分類先識別装置。７・・・分類コードファイル。８・・・キーワード。９・・・分野別キーワード頻度表。１０・・・入力装置。１１・・・キーワード自動抽出・生成部。１２・・・キーワード頻度計算部。１３・・・意味カテゴリ検索部。１４・・・意味カテゴリ頻度計算部。Ｉ５・・・日本語辞占。１６・・・キーワードの意味カテゴリ。１７・・・分野別意味カテゴリ頻度表。１８・・・単語得点表計Ｊγ部。１９・・・意味カテゴリ得点表計算部。２０・・・キーワード。２１・・・入力装置。２２・・・キーワード自動抽出・生成部。２３・・・キーワード得点加算部。２４・・・分野判定部。２５・・・出力装置。２６・・・意味カテゴリ検索部。２７・・・意味カテゴリ得点加算部。２８・・・日本語辞書。２９・・・キーワードの意味カテゴリ。３０・・・入力文書。３１・・・キーワード。３２・・・意味カテゴリ。３３・・・分野別得点表。図FIG. 1 is an example of the basic configuration of the present invention, FIG. 2 is the configuration of the field feature extraction device, FIG. 3 is the configuration of the classification target identification device, and FIGS. 4 to 8 are explanatory diagrams for explaining the operation. FIG. 9 is a diagram showing an example of operation. l...Document file with classification code. 2...Field feature extraction device. 3...Field identification 11 word score table. 4...Field identification meaning category score table. 5...Document file without classification code. 6... Classification destination identification device. 7...Classification code file. 8...Keyword. 9...Keyword frequency table by field. 10... Input device. 11...Keyword automatic extraction/generation unit. 12...Keyword frequency calculation unit. 13...Semantic category search section. 14...Semantic category frequency calculation unit. I5...Japanese dictionary. 16...Semantic category of keyword. 17...Semantic category frequency table by field. 18...Word score table Jγ section. 19...Semantic category score table calculation section. 20...Keyword. 21... Input device. 22...Keyword automatic extraction/generation unit. 23...Keyword score addition section. 24... Field determination department. 25... Output device. 26...Semantic category search section. 27...Semantic category score addition section. 28...Japanese dictionary. 29...Semantic category of keyword. 30... Input document. 31...Keyword. 32...Semantic category. 33...Score table by field. figure

Claims

[Scope of Claims] In a natural language processing system that handles a Japanese document database input from a document input device, there is provided a field feature extraction device for examining keywords and semantic categories of keywords that appear unevenly in documents for each classification field. , a field identification word score table that describes keywords that appear biasedly in documents for each classification field, and a field identification semantic category score table that describes the semantic categories of words that appear biasedly in documents for each classification field. , comprising a classification target identification device that determines the classification target of a document from keywords appearing in a document to be classified and their semantic categories using a field identification word score table and a field identification semantic category score table. Characteristic automatic document classification device.