JPH06282587A

JPH06282587A - Automatic classifying method and device for document and dictionary preparing method and device for classification

Info

Publication number: JPH06282587A
Application number: JP5087774A
Authority: JP
Inventors: Masanori Nakamura; 正規中村; Keizo Uchiyama; 恵三内山
Original assignee: Tokyo Electric Power Co Inc
Current assignee: Tokyo Electric Power Company Holdings Inc
Priority date: 1993-03-24
Filing date: 1993-03-24
Publication date: 1994-10-07

Abstract

PURPOSE:To exactly attain automatic classification under the consideration of the meaning content of a text by extracting key words and simultaneously discriminating the classification of the key words, and making the key words same when the classification is same. CONSTITUTION:A document inputted through an inputting device 1 is transmitted to a key word extracting part 2, the kind(subject, object, and other) of an extracted noun is extracted, and the key words same up to the kind are treated as the same key words. The extracted key words are stored in a key word file 3, read, and applied to a character string matching key word extracting part 4. Then, the key words including the same characters in the sequence of the appearance of the characters constituting the basic key words are recognized to be the same, and stored in a character string matching key word file 5. Moreover, the number of documents in which the key word and a pair of key words appear is counted by count parts 6 and 7, transmitted to a featured value calculating part 8, and a feature for preparing a dictionary is searched.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書の自動分類方法及
び装置並びに分類用の辞書作成方法及び装置に関するも
のである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for automatically classifying documents and a method and apparatus for creating a dictionary for classification.

【０００２】[0002]

【発明の背景】各種の文書をファイリングしてデータベ
ースを構築するに際し、そのデータベースに格納する文
書に対して分類を決定したりキーワード付けをしたりす
る必要があり、係る作業の正確さがその後のデータベー
ス利用の便利さに繋がるため、重要な作業である。しか
し、係る分類分け等を人間が行うとすると、その文章を
すべて読み、理解した上で分類の決定等を行わなければ
ならず、その作業が極めて煩雑となるばかりでなく、作
業者によってばらつき正確さ（信頼性）にかける。BACKGROUND OF THE INVENTION When filing various documents and constructing a database, it is necessary to determine the classification and add keywords to the documents to be stored in the database. This is an important task because it leads to the convenience of using the database. However, if humans perform such classification, it is necessary to read and understand all the sentences before making a classification decision, which not only makes the work extremely complicated but also varies depending on the operator. (Reliability).

【０００３】そこで、係る文書の分類を自動的に行うシ
ステムとして、種々のものが開発されている。そしてそ
の中の１つは、例えば特開平３−２００９５６３号に開
示された知的検索方式のようにある分野の意味関係を表
した意味辞書を作成し、それに基づいてその後の未知デ
ータの検索を行うものがある。Therefore, various systems have been developed as systems for automatically classifying such documents. One of them is to create a semantic dictionary that represents a semantic relationship in a certain field, such as the intelligent search method disclosed in Japanese Patent Laid-Open No. 3-2009563, and to search for unknown data thereafter based on the dictionary. There is something to do.

【０００４】しかし、この方式では、予め意味辞書を人
為的分析により形成する必要があり、分野が異なるとそ
の分野についての解析が再度必要で、係る処理が、煩雑
となるという問題を有する。However, in this method, it is necessary to form the semantic dictionary in advance by artificial analysis, and if the field is different, the analysis for the field is required again, and the processing concerned is complicated.

【０００５】また、他の方式としては、例えば「ファジ
ィ文書検索システム」（情報処理学会第３９回全国大会
予稿集第１０６７〜１０６８頁）に代表されるように、
特定分野における文書の表記の特徴を統計処理した辞書
を用いて検索するものがある。係る方式は、１つの文書
に共通に含まれるキーワード対は、文書を既成概念と考
えた時に、何等かの概念上のつながりをもっていること
に着目し、キーワード対の同時出現頻度値を用いて文書
間の親近性を定量化（辞書化）している。そして、親近
性が高いほど共通の文書であると認定できる。そこで、
予め分野のわかっている文書に基づいて辞書を作成し、
新規の文書と分野既知の文書との親近性を求め、親近性
の高い文書の分野を、その新規な文書の分野に決定する
ものである。As another method, for example, as represented by "Fuzzy Document Retrieval System" (Information Processing Society of Japan 39th National Convention Proceedings, pages 1067-1068),
There is a search method that uses a dictionary that statistically processes the notational characteristics of documents in a specific field. According to this method, it is noted that the keyword pair commonly included in one document has some conceptual connection when the document is considered as an established concept, and the simultaneous appearance frequency value of the keyword pair is used for the document. The degree of familiarity between them is quantified (dictionaryized). The higher the degree of closeness, the more common the document can be certified. Therefore,
Create a dictionary based on documents whose fields are known in advance,
It seeks closeness between a new document and a document whose field is already known, and determines the field of the document with high closeness to be the field of the new document.

【０００６】しかし、この方式では、前者に比べ人間が
介在しないで辞書を構築できるため異なる分野に対して
も自動的に辞書を作成できる点が有用性が高いが、文章
の意味理解までは行っていないため、生成した親近性
（ファジィな値）の精度が低い（曖昧な範囲が広い）と
いう問題がある。その結果、より近似する文書をそれぞ
れ適切な分野に分類しようとした場合には、曖昧な範囲
内に存在してしまい、正しく分類できなくなるおそれが
ある。In this method, however, the dictionary can be constructed without human intervention as compared with the former method, and thus it is highly useful that the dictionary can be automatically created for different fields. Therefore, there is a problem that the generated affinity (fuzzy value) has low accuracy (the ambiguous range is wide). As a result, when attempting to classify more similar documents into appropriate fields, there is a possibility that they will fall within an ambiguous range and cannot be correctly classified.

【０００７】さらに、上記の「共通のキーワード対を多
く有する文書同士ほど、両者の類似度は高い（共通の分
野に属する蓋然性が高い）」という方式では、係る判定
を行うための同一のキーワードが完全一致を前提として
いるため、そのままでは例えば「○×電力」とその略称
である「○電」や、電気関連の分野における「領収書」
と「電気料金領収書」のように本来は同一の意味として
扱うべきキーワード同士が、単に表記上の相違だけで別
のキーワードとして扱われてしまいう。その結果、さら
なる精度の低下を招くことになる。Further, in the above-mentioned method that "the documents having a large number of common keyword pairs have a high degree of similarity between them (the higher the probability of belonging to a common field)", the same keyword for performing such determination is As it is based on perfect matching, for example, "○ x electric power" and its abbreviation "○ electric power" and "receipt" in the field related to electricity are used as they are.
Keywords that should be treated as having the same meaning, such as and “electricity receipt”, are treated as different keywords because of the difference in notation. As a result, the accuracy is further reduced.

【０００８】本発明は、上記した背景に鑑みてなされた
もので、その目的とするところは、文章の意味内容や表
記上の相違を考慮してより精度の高い自動分類を行うと
ともに、その分類を行うための辞書を自動的に形成する
ことのできる文書の自動分類方法及び装置並びに分類用
の辞書作成方法及び装置を提供するものである。The present invention has been made in view of the above background, and an object of the present invention is to perform a more accurate automatic classification in consideration of the semantic content of sentences and differences in notation, and to classify the same. The present invention provides an automatic document classification method and apparatus, and a dictionary creation method and apparatus for classification, which can automatically form a dictionary for performing.

【０００９】[0009]

【課題を解決するための手段】上記した目的を達成する
ため、本発明に係る文書の自動分類用の辞書作成方法で
は、分野既知の複数の文書をそれぞれ構成する語句の中
から主語，目的語などの種類分けを行いつつキーワード
抽出をし、抽出されたキーワードの出現文書数並びに任
意の２つのキーワードが同時に出現するキーワード対の
出現文書数を求める。そして、前記キーワードの出現文
書数と前記キーワード対の出現文書数から前記キーワー
ドを構成する２つのキーワード間の距離を算出する。次
いで、そのキーワード間の距離から各キーワード対のそ
の分野の依存度を算出し、少なくともその分野における
キーワード対と依存度の関係を辞書に格納する。In order to achieve the above-mentioned object, in the dictionary creating method for automatic document classification according to the present invention, the subject and object are selected from the words and phrases respectively constituting a plurality of documents of known fields. Keyword extraction is performed while classifying such as, and the number of appearing documents of the extracted keywords and the number of appearing documents of a keyword pair in which two arbitrary keywords simultaneously appear are obtained. Then, the distance between the two keywords forming the keyword is calculated from the number of documents that appear the keyword and the number of documents that appear the pair of keywords. Next, the degree of dependency of each keyword pair in the field is calculated from the distance between the keywords, and at least the relationship between the keyword pair and the degree of dependence in the field is stored in the dictionary.

【００１０】そして、好ましくは、前記キーワード対の
出現文書数を求めるに際し、予め同一意味を表すキーワ
ードのグループを求め、同一グループに属するキーワー
ドがあれば前記出現文書数に加算するようにすることで
ある。Preferably, when obtaining the number of appearing documents of the keyword pair, a group of keywords having the same meaning is obtained in advance, and if there is a keyword belonging to the same group, it is added to the appearing document number. is there.

【００１１】また、上記した辞書を製造するための好適
な装置としては、分野既知の文書を入力する入力装置
と、その入力装置を介して与えられた文書に対してそれ
を構成する語句の中から主語，目的語などの種類分けを
行いつつキーワードを抽出する手段と、前記キーワード
を抽出する手段の出力を受けて、同一のキーワードを有
する同一分野の文書数を計数する第１計数手段と、前記
キーワードを抽出する手段の出力を受けて、同一のキー
ワード対を有する同一分野の文書数を計数する第２計数
手段と、前記両計数手段の出力を受け、前記キーワード
対を構成する２つのキーワード間の距離を求めるととも
に、そのキーワード対の前記分野に対する依存度を算出
し、求められたその分野におけるキーワード対の依存度
を辞書に格納する手段とから構成することで、さらに
は、前記キーワードを抽出する手段の出力を受けて、同
一意味を表すキーワードのグループを求める手段をさら
に設けるとともに、前記第２計数手段を、前記グループ
を求める手段と前記キーワード抽出する手段の出力を受
け、同一グループに属するキーワードがあれば前記キー
ワード対の出現文書数に加算するようにすると好まし
い。Further, as a suitable device for manufacturing the above-mentioned dictionary, an input device for inputting a document known in the field, and words and phrases constituting the document given through the input device are used. Means for extracting a keyword while classifying the subject, object, etc., and a first counting means for receiving the output of the means for extracting the keyword and counting the number of documents in the same field having the same keyword A second counting unit that receives the output of the keyword extracting unit and counts the number of documents in the same field having the same keyword pair, and two keywords that form the keyword pair by receiving the outputs of both counting units. While calculating the distance between them, calculating the degree of dependence of the keyword pair on the field, and storing the determined degree of dependence of the keyword pair in the field in the dictionary. Further, by further comprising means for obtaining a group of keywords having the same meaning in response to the output of the means for extracting the keyword, the second counting means as a means for obtaining the group. It is preferable to receive the output of the keyword extracting unit and add the number of appearing documents of the keyword pair if there are keywords belonging to the same group.

【００１２】一方、本発明に係る文書の自動分類方法で
は、入力された分野未知の文書を構成する語句の中から
主語，目的語などの種類分けを行いつつキーワード抽出
をし、抽出されたキーワードが、前記辞書製造方法によ
り製造された辞書に格納された所定の分野に出現するキ
ーワードと一致するか否かの関係を求める。次いで、前
記求めた関係と、前記辞書に格納されたその分野の各キ
ーワード対の依存度とを掛け算して前記入力された文書
から抽出された各キーワードのその分野における依存度
を算出し、その依存度から各分野に対する曖昧度を求
め、最小の曖昧度となる分野を、前記入力された文書が
属する分野に決定するようにした。On the other hand, in the method for automatically classifying documents according to the present invention, the keywords extracted while classifying the subject, the object, etc. from the words constituting the input document of the unknown field are extracted, and the extracted keywords are extracted. Determines whether or not matches with a keyword appearing in a predetermined field stored in the dictionary manufactured by the dictionary manufacturing method. Then, the relationship obtained is multiplied by the dependency of each keyword pair of the field stored in the dictionary to calculate the dependency of each keyword extracted from the input document in the field, The degree of ambiguity for each field is calculated from the degree of dependence, and the field having the minimum ambiguity is determined as the field to which the input document belongs.

【００１３】そして、係る方法を実施するための装置と
しては、分野未知の文書を入力する入力装置と、その入
力装置を介して与えられた文書に対してそれを構成する
語句の中から主語，目的語などの種類分けを行いつつキ
ーワードを抽出する手段と、上記した辞書製造装置によ
りデータが格納された辞書と、前記キーワードを抽出す
る手段と前記辞書に接続され、前記抽出されたキーワー
ドが前記辞書に格納された所定の分野に出現するキーワ
ードと一致するか否かの関係を求める手段と、前記求め
る手段から出力される関係と、前記辞書に格納されたそ
の分野の各キーワード対の依存度とを受け、それら関係
と依存度とを掛け算して入力された文書から抽出された
各キーワードのその分野に対する依存度を算出する依存
度算出手段と、前記依存度算出手段の出力を受け、前記
文書の各分野に対する曖昧度を求める曖昧度算出手段
と、前記曖昧度算出手段の出力を受け、各分野の曖昧度
を比較し最小の曖昧度となる分野を検出する判定手段と
を備えた。As an apparatus for carrying out such a method, an input device for inputting a document whose field is unknown, and a subject selected from the phrases constituting the document given via the input device, A means for extracting a keyword while classifying objects such as an object, a dictionary in which data is stored by the dictionary manufacturing apparatus described above, a means for extracting the keyword and the dictionary are connected to the dictionary, and the extracted keyword is Means for obtaining a relationship as to whether or not it matches with a keyword appearing in a predetermined field stored in the dictionary, relationship output from the means, and dependency of each keyword pair of the field stored in the dictionary And a dependency degree calculating means for calculating the degree of dependency of each keyword extracted from the input document with respect to the field by multiplying the relationship with the degree of dependency, An ambiguity calculation unit that receives the output of the dependency calculation unit and obtains ambiguity for each field of the document, and an output of the ambiguity calculation unit that compares the ambiguity of each field to obtain the minimum ambiguity And a determination means for detecting

【００１４】[0014]

【作用】文書中のキーワード対の同時出現頻度値を用い
て文書間の親近性を定量化（辞書化）し、親近性が高い
ほど共通の文書であると認定できることを利用し、分野
既知の文書に基づいて各分野ごとの辞書を作成し、新規
の文書と最も親近性の高い辞書の分野を、その文書の分
野と決定する。この時、キーワードを抽出するに際し、
その種類（主語，目的語，その他等）も同時に判別し、
種別まで同一で始めて同一のキーワードとしてとらえる
ようにしたため、文章の意味内容まで考慮して辞書が作
成される。よって、文書の意味内容に近い高精度の辞書
が作成され、その後の係る辞書に基づいて処理される分
野未知の文書に対する自動分類が正確に行われる。[Function] The familiarity between documents is quantified (dictionary) by using the value of the frequency of simultaneous appearance of a pair of keywords in the document, and the fact that the higher the familiarity is, the more common the document is. A dictionary for each field is created based on the document, and the field of the dictionary that is most familiar to the new document is determined as the field of the document. At this time, when extracting the keywords,
The type (subject, object, etc.) is also determined at the same time,
Since the types are the same and the keywords are the same, the dictionary is created in consideration of the meaning and content of sentences. Therefore, a highly accurate dictionary close to the meaning and content of the document is created, and automatic classification is accurately performed for documents of unknown fields that are processed based on the dictionary thereafter.

【００１５】また、キーワード対の出現数を係数するに
際し、同一意味を表すキーワードのグループを求め、同
一グループに属するキーワードがあれば前記出現文書数
に加算するようにした場合には、たとえ表記上の相違が
あったとしても、それに影響されることなく正しい辞書
が作成でき、より正確な自動分類が行われる。Further, when the number of appearances of a keyword pair is calculated, a group of keywords having the same meaning is obtained, and if there is a keyword belonging to the same group, it is added to the number of appearing documents. Even if there is a difference, the correct dictionary can be created without being affected by it, and more accurate automatic classification is performed.

【００１６】[0016]

【実施例】以下、本発明に係る文書の自動分類方法及び
装置並びに分類用の辞書作成方法及び装置の好適な実施
例について添付図面を参照にして詳述する。図１は本発
明に係る辞書作成装置の一実施例を示している。同図に
示すように、キーボードやＯＣＲ等の入力装置１を介し
て、分野が既知の文書を入力するようになっている。そ
して、その入力された文書が次段のキーワード抽出部２
に送られ、文書中に存在する名詞を抽出するようになっ
ている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A preferred embodiment of an automatic document classification method and apparatus and a classification dictionary creating method and apparatus according to the present invention will be described in detail below with reference to the accompanying drawings. FIG. 1 shows an embodiment of a dictionary creating apparatus according to the present invention. As shown in the figure, a document of a known field is input via the input device 1 such as a keyboard or OCR. Then, the input document is the next keyword extraction unit 2
Sent to, to extract the nouns present in the document.

【００１７】ここで本発明では、まずキーワードを抽出
するに際し、単に名詞を取り出すのではなく、その名詞
のすぐ後ろに付いている助詞や格助詞に着目してその名
詞の種類（主語，目的語，その他）を併せて抽出するよ
うになっている。これにより、抽出した名詞（キーワー
ド）がその文書中でどの様な使われ方をしているのかが
わかり、意味内容の理解が図れる。In the present invention, first, when extracting a keyword, rather than simply extracting a noun, focusing on the particle or case particle attached immediately after the noun, the type of the noun (subject, object) , Other) are also extracted. As a result, it is possible to understand how the extracted nouns (keywords) are used in the document, and understand the meaning content.

【００１８】そこで本例におけるキーワード抽出部２
は、図２に示すフローにより処理される機能を有してい
る。すなわち、まず入力された文書に対して単語単位に
分割する「分かち書き処理」を行う（Ｓ１０１）。次
に、上記分かち書き処理したものに対し「文節切り出
し」処理を行う。この文節切り出し処理は、分かち書き
により単語毎に分割されたものの中から、所定のルール
にしたがって文節の区切りとなる言葉を見付け、文節と
して切断し直すものである（Ｓ１０２）。Therefore, the keyword extracting unit 2 in this example
Has a function to be processed by the flow shown in FIG. That is, first, the "divided writing process" for dividing the input document into words is performed (S101). Next, the “segment cutout” process is performed on the above-mentioned segmentation process. The phrase segmentation process is to find a word that is a segment of a segment according to a predetermined rule from among the segments segmented into words by segmentation, and segment the segmented segment again (S102).

【００１９】そして、そのようにして文節を切り出した
ら、次にその切り出された文節の中から以下の抽出ルー
ルに従って重要文節を抽出する（Ｓ１０３）。文節末が、「が」，「は」，「を」，「に」，
「で」，「や］，「も」の文節を抽出する。これによ
り、主語・目的語を含む節等が抽出される。文節末の言葉が「ひらがな混在」でないものを取り出
す。動詞等の節の場合には、その終わりに「ひらがな」
が記載されることに着目したもので、これによりキーワ
ードになりにくい動詞等が排除される。Then, after the bunsetsu is cut out in this way, an important bunsetsu is extracted from the cut-out bunsetsu according to the following extraction rule (S103). The end of the phrase is "ga", "ha", "wo", "ni",
Extract the phrases "de,""ya," and "mo." As a result, a clause including the subject / object is extracted. Take out words whose endings are not "Hiragana mixed". For clauses such as verbs, "hiragana" is added at the end.
The focus is on the description of verbs, which eliminates verbs that are difficult to be keywords.

【００２０】さらに、上記の処理により重要文節を抽出
したなら、その重要文節に対し再び分かち書き処理を
し、単語単位に再分割する。そして、各単語中、キーワ
ードになりにくい、「動詞」や「助詞」，「助動詞」、
さらには「すばらしい」等の「形容詞」等取り除くこと
により、キーワード（名詞）候補を抽出する（Ｓ１０
４）。Further, when the important phrase is extracted by the above processing, the important phrase is again divided into words and divided into word units. And in each word, it is difficult to be a keyword, "verb", "particle", "auxiliary verb",
Furthermore, keyword (noun) candidates are extracted by removing "adjectives" such as "great" (S10).
4).

【００２１】次に、上記Ｓ１０４までの処理にて抽出さ
れたキーワード候補の次に来る語を見て、キーワードの
種類（主語，目的語，その他）分け等をおなう（Ｓ１０
５）。すなわち、キーワードの次に来る語が「が」，
「は」の場合にはそのキーワードは、「主語」とする。
また、「を」，「と」，「で」，「や」，「も」の場合
は「目的語」とする。そして、上記に該当しないキーワ
ードは、すべて「その他」とする。Next, by looking at the words that come next to the keyword candidates extracted by the processing up to S104, the types of keywords (subject, object, etc.) are classified (S10).
5). That is, the word that follows the keyword is "ga",
In the case of “ha”, the keyword is “subject”.
Also, in the case of "wa", "to", "de", "ya", and "mo", it is regarded as "object". Then, all the keywords that do not correspond to the above are “others”.

【００２２】さらに、このようにして抽出されたキーワ
ード（種類付き）は、同一文書中に複数回出現すること
があるため、重複するキーワードを削除し、同一文書に
１つのみ残す処理を行う。これにより、キーワード（種
類付き）抽出処理が終了する。Further, since the keywords (with types) thus extracted may appear multiple times in the same document, duplicate keywords are deleted and only one is left in the same document. As a result, the keyword (with type) extraction processing ends.

【００２３】そして、その抽出結果の一例（サンプル文
書としては同一分野（ＴＶＣＭについての意見）の４５
個の文書を使用）を図３に示す。なお、図から明らかな
ように、従来方式では、同一キーワードと処理されたも
のが明確に分離される。すなわち、例えば文書１０の
「○電（その他）」と、文書２２の「○電（目的語）」
は、従来では同じとして処理されたが、本発明では異な
るものとして処理され、また、文書２のように同じ「○
電」でも、「主語」，「目的語」，「その他」というよ
うに違う使われ方をしている場合は、従来ではまとめて
１語として扱われたが、本発明では、３つの語として分
離されて扱われることになる。つまり、種類まで同じキ
ーワード、例えば文書２，３，１０の「○電（その
他）」や、文書２２，４１の「○電（目的語）」がそれ
ぞれ同一のキーワードとして取り扱われる。An example of the extraction result (as a sample document, 45 in the same field (opinion about TVCM))
3 documents are shown in FIG. As is apparent from the figure, in the conventional method, the same keyword and the processed one are clearly separated. That is, for example, “○ Den (Other)” in Document 10 and “○ Den (Object)” in Document 22
Are treated as the same as in the past, but are treated as different in the present invention, and the same "○" as in document 2 is used.
Even in the case of "den", when they are used in different ways such as "subject", "object", and "other", they have been treated as one word in the past, but in the present invention, they are treated as three words. It will be treated separately. That is, the same keyword up to the type, for example, “○ den (other)” in documents 2, 3 and 10 and “○ den (object)” in documents 22 and 41 are treated as the same keyword.

【００２４】なおまた、この図３では、キーワードを具
体的な文字列として表記したが、高速処理を図るために
は、ＩＤ番号を付与し、ＩＤ番号に基づいて以下の一致
か否かの判断等の処理を行うようにし、それにともな
い、ＩＤ番号管理ファイルも別途も受けることである
（以下同じ）。In addition, in FIG. 3, the keywords are expressed as a specific character string, but in order to achieve high-speed processing, an ID number is assigned, and it is determined whether or not the following matches based on the ID number. Etc., and accordingly, the ID number management file is also received separately (the same applies hereinafter).

【００２５】そして、このようにして抽出されたキーワ
ード（種類付き）が、文書毎に次段のキーワードファイ
ル３に格納されるようになっている。さらに、本発明で
は、「○電」と「○×電力」等の表記上相違するが同一
の意味として取り扱うべきキーワードを抽出するため
に、上記キーワードファイル３に格納されたキーワード
を読み出すとともにキーワードのグループを求める手段
たる文字列一致キーワード抽出部４に与え、ここにおい
て、同一の意味を表すキーワード（種類付き）のグルー
プ化を求めるようになっている。すなわち、基となるキ
ーワード（比較キーワード）と、それと同一のグループ
に属するキーワード（要素キーワード）のテーブルを作
成する。The thus-extracted keywords (with types) are stored in the next-stage keyword file 3 for each document. Further, in the present invention, in order to extract keywords that are different in terms of notation such as “○ den” and “○ × electricity” but should be treated as having the same meaning, the keywords stored in the keyword file 3 are read and It is given to the character string matching keyword extraction unit 4 as a means for obtaining a group, and here, grouping of keywords (with types) having the same meaning is obtained. That is, a table of the base keywords (comparative keywords) and the keywords (element keywords) belonging to the same group as the base keywords is created.

【００２６】そして具体的に本例では、各キーワードの
文字列の出現順序に着目し、基となるキーワードを構成
する文字の出現順序で同じ文字を含むキーワードは、同
一のキーワード（要素キーワード）としてみなすように
している。なお、出現順序が同じであれば、各文字間に
他の文字が介在してもかまわないものとしている。Specifically, in this example, paying attention to the appearance order of the character string of each keyword, the keywords including the same characters in the appearance order of the characters forming the base keyword are regarded as the same keyword (element keyword). I try to regard it. It should be noted that if the appearance order is the same, other characters may intervene between the characters.

【００２７】すなわち、図４（Ａ）に示すように、「○
電」が基（比較キーワード）とすると、文字列「○，
電」の順に並んでいるキーワードは完全一致の「○電」
はもちろんのこと、「○×電力」も同一グループ（要素
キーワード）となりカウントの対象となる。しかし、同
図（Ｂ）に示すように基が「○×電力」とすると、その
略称である「○電」は、「×」と「力」が含まれていな
いためカウントされない。なお、この方式をとると、同
図（Ｃ）に示すように、「ＰＲ」と意味の関係ない「Ｐ
ＯＩＮＴＥＲ」もカウントされてまうが、たとえこのよ
うなものが同一の仲間としてグループ化されたとして
も、係るキーワードがキーワード対として現れる可能性
が少いとともに、後段の特徴量算出等の処理における影
響度が少ないためさほど問題はない。一方、同図（Ｄ）
のように基が「ＣＩ」とすると、「ＣＩ」や「ＣＩマー
ク」は上記した理由により同一グループとしてカウント
されるが、「ＩＣ」はその出現順序が異なる（逆であ
る）ためカウントされない。That is, as shown in FIG.
If "Den" is the base (comparison keyword), the character string "○,
The keywords in the order of "Den" are the perfect match "○ Den"
Of course, “○ × power” also belongs to the same group (element keyword) and is counted. However, as shown in FIG. 2B, if the base is “◯ × electric power”, the abbreviation “◯ DEN” is not counted because “×” and “force” are not included. If this method is adopted, as shown in FIG. 7C, there is no relation between "PR" and "P".
"OINTER" is also counted, but even if such items are grouped as the same group, it is unlikely that such a keyword will appear as a keyword pair, and the influence on the later-described feature amount calculation processing and the like. There is not much problem because it is less frequent. On the other hand, the same figure (D)
As described above, when the group is “CI”, “CI” and “CI mark” are counted as the same group for the above-mentioned reason, but “IC” is not counted because their appearance order is different (reverse).

【００２８】そして、実際には読み出されたキーワード
を構成する文字の順列を相互に比較することにより求め
られる。さらに、この文字列一致キーワードを抽出する
際には、同一分野を構成するすべて文書を一纏めにとら
えて処理する。そして、このようにして抽出された文字
列一致キーワード（比較キーワード：要素キーワードの
テーブル）を、文字列一致キーワードファイル５に格納
する。Then, it is actually obtained by comparing the permutations of the characters forming the read keyword with each other. Further, when extracting the character string matching keyword, all the documents that constitute the same field are collectively processed. Then, the character string matching keywords (comparison keyword: element keyword table) thus extracted are stored in the character string matching keyword file 5.

【００２９】なお、上記した図３に示す文書１０の「Ｃ
Ｉ目的語」が基とすると、文書１の「ＣＩマーク主
語」は、図４（Ｄ）からすると同一となるが、「目的
語」と「主語」というように種類がそもそも異なるた
め、異なるキーワードとして扱われ、文字列一致ファイ
ル５に格納されない。そして、仮に総文書数が図３に示
す８文書とすると、文字列一致キーワードファイルに
は、図５に示すような状態で格納されることになる。The "C" of the document 10 shown in FIG.
Based on “I object”, the “CI mark subject” of document 1 is the same as shown in FIG. 4D, but since the types such as “object” and “subject” are different from each other, different keywords are used. And is not stored in the character string matching file 5. If the total number of documents is eight as shown in FIG. 3, the character string matching keyword file will be stored in the state as shown in FIG.

【００３０】さらに、上記キーワードファイル３の出力
が、第１計数手段たるキーワードカウント部６並びに第
２計数手段たるキーワード対カウント部７に接続され、
また、このキーワード対カウント部７には、上記文字列
一致キーワードファイル５の出力も接続されている。そ
して、前者のキーワードカウント部６では、各キーワー
ドが出現する（存在する）文書の数を求めるようになっ
ている。Further, the output of the keyword file 3 is connected to the keyword counting section 6 as the first counting means and the keyword pair counting section 7 as the second counting means,
The keyword pair counting unit 7 is also connected to the output of the character string matching keyword file 5. Then, the former keyword counting unit 6 is adapted to obtain the number of documents in which each keyword appears (exists).

【００３１】一方、キーワード対カウント部７では、同
一のキーワード対が同時出現する文書の数を求めるよう
になっている。ここでキーワード対とは、１つの文書中
に存在する異なる２つのキーワードのペアの組み合わせ
で、組み合わせの前者を基準キーワード、後者を比較キ
ーワードとする。そして、カウントする際には、基準ワ
ードは完全一致とし、比較キーワードは上記求めた文字
列一致キーワードにより同一とされたワード（要素キー
ワード）が対象となる。On the other hand, the keyword pair counting section 7 is adapted to obtain the number of documents in which the same keyword pair appears at the same time. Here, the keyword pair is a combination of two different pairs of keywords existing in one document, and the former combination is the reference keyword and the latter is the comparison keyword. Then, when counting, the reference word is an exact match, and the comparison keyword is a word (element keyword) that is the same as the obtained character string match keyword.

【００３２】すなわち、同一文書中に存在するキーワー
ドの組み合わせ表を作成し、作成したキーワードの組み
合わせ表に基づいて両ファイル３，５に格納されたデー
タをアクセスし、基準キーワードＸｉ別に比較キーワー
ドＸｊと要素キーワードが一致する組み合わせが存在す
る文書があったなら、そのキーワードの組み合わせ（キ
ーワード対）に対してカウントアップする。これによ
り、キーワード対の出現文書数が求められる。That is, a combination table of keywords existing in the same document is created, the data stored in both files 3 and 5 is accessed based on the created combination table of keywords, and comparison keywords Xj are set for each reference keyword Xi. If there is a document in which there is a combination in which the element keywords match, the document is counted up for that combination of keywords (keyword pair). As a result, the number of appearing documents of the keyword pair is obtained.

【００３３】そして、その一例を示すと仮に総文書数が
図３に示す８つと仮定すると、例えば、「○電（目的
語）」（基準キーワード）と「温水器（その他）」（比
較キーワード）からなるキーワード対の場合は、完全一
致をとるとその出現文書数は１回（文書２２のみ）とな
るが、要素キーワードをも考慮した本例では、出現文書
数は２回（文書２２，４２）となる。また、同様に、
「コマーシャル（目的語），○電（その他）」からなる
キーワード対の出現文書数は２回（文書３，１０）とな
る。なお、文書４のコマーシャルは、その種類が「主
語」であるため、カウントされない。As an example, assuming that the total number of documents is eight as shown in FIG. 3, for example, "○ den (object)" (reference keyword) and "water heater (other)" (comparative keyword) In the case of a keyword pair consisting of, if the perfect match is found, the number of occurrence documents is once (only document 22), but in the present example in which the element keyword is also taken into consideration, the number of occurrence documents is twice (documents 22, 42). ). Also, similarly,
The number of appearance documents of the keyword pair consisting of "commercial (object) and Oden (other)" is two (documents 3 and 10). The commercial of document 4 is not counted because its type is “subject”.

【００３４】さらに、上記各カウント部６，７で求めた
各キーワード，キーワード対の出現文書数を特徴量算出
部８に送り、辞書を作るための各特徴量を求めるように
なっている。すなわち、まず、キーワード対を構成する
基準キーワード（ｘｉ）と比較キーワード（ｘｊ）との
間の距離Ｌ（xi，xj）を求め、さらに、そのキーワード
間の距離Ｌ（xi，xj）を用いてキーワードのファジィ集
合ｆ（ｘ）を求めるようになっている。そして具体的に
は、以下の式に基づいて算出される。Further, the number of appearing documents of each keyword and keyword pair obtained by each of the counting units 6 and 7 is sent to the feature amount calculating unit 8 to obtain each feature amount for creating a dictionary. That is, first, the distance L (xi, xj) between the reference keyword (xi) forming the keyword pair and the comparison keyword (xj) is obtained, and further the distance L (xi, xj) between the keywords is used. A fuzzy set f (x) of keywords is obtained. Then, specifically, it is calculated based on the following formula.

【００３５】まず、キーワード間の距離Ｌは、キーワー
ド対の同時出現頻度値を用いて親近性の定量化を求める
もので、下記式（１）により求められる。First, the distance L between the keywords is used to quantify the familiarity using the simultaneous appearance frequency value of the keyword pair, and is calculated by the following equation (1).

【００３６】[0036]

【数１】Ｌ（xi，xj）＝（Ｎ（xi）＋Ｎ（xj）−Ｎ（xi，xj））／Ｎ（xi，xj） …（１）Ｎ（xi）：基準キーワードｘｉを含む文書数（カウント
部６から出力）Ｎ（xj）：比較キーワードｘｊを含む文書数（カウント
部６から出力）Ｎ（xi，xj）：ｘｉ，ｘｊを同時に含む文書数（カウン
ト部７から出力）次いで、上記算出された数値を下記式（２）に代入する
ことによりファジィ集合（基準キーワードｘｉに対する
比較キーワードｘｊの依存度）ｆ（ｘ）を求める。## EQU1 ## L (xi, xj) = (N (xi) + N (xj) -N (xi, xj)) / N (xi, xj) (1) N (xi): Document containing the reference keyword xi Number (output from counting unit 6) N (xj): number of documents including comparison keyword xj (output from counting unit 6) N (xi, xj): number of documents simultaneously including xi and xj (output from counting unit 7) , The fuzzy set (the degree of dependence of the comparison keyword xj on the reference keyword xi) f (x) is obtained by substituting the calculated numerical value into the following equation (2).

【００３７】[0037]

【数２】ｆ（ｘ）＝ｅｘｐ（−ａ｜Ｌ（xi，xj）²｜）なお、上記式（２）中「ａ」は類推の強さを示し、距離
Ｌ（xi，xj）の最長となるキーワード対の依存度が０．
０５以下となるような任意の値をとる。そこで本例で
は、同一分野のすべての文書に対して上記式（１）に基
づいて各キーワード対の距離Ｌ（xi，xj）を求め、その
距離Ｌ（xi，xj）の最大値を式（２）に代入するととも
に依存度ｆ（ｘ）に０．０５を代入し、「ａ」について
解くことによりその分野についての「ａ」を決定する。
なお、求めた依存度は、０〜１の値となり、１に近いほ
どその分野の概念上の特徴を顕著に表していることを意
味している。[Mathematical formula-see original document] f (x) = exp (-a | L (xi, xj) ² |) In the above formula (2), "a" indicates the strength of analogy and the distance L (xi, xj) The dependency of the longest keyword pair is 0.
Takes any value that is less than or equal to 05. Therefore, in this example, the distance L (xi, xj) of each keyword pair is obtained based on the above equation (1) for all documents in the same field, and the maximum value of the distance L (xi, xj) is given by the equation ( Substituting 2) and substituting 0.05 for dependency f (x) and solving for “a” determines “a” for that field.
The obtained dependency is a value of 0 to 1, which means that the closer it is to 1, the more markedly the conceptual features of the field are represented.

【００３８】そして、この特徴量抽出部８の出力が分類
用辞書９に接続されており、上記のようにして求めた特
徴量を係る分類用辞書９に格納するようになっている。
なお、自動分類するためには、少なくともキーワード対
に対する依存度が関連付けられた格納されていればよい
が、後述するように、係る分類用辞書９に格納されたデ
ータに基づいて新規の文書の分野の分類を行ったなら、
係る新規文書のデータに基づいて分類用辞書９の内容を
更新し、より最新の多数の文書情報に基づいてより正確
な辞書の更新を行うために、本例では、依存度に加えて
キーワードの出現回数Ｎ（xi），Ｎ（xj），キーワード
対の出現文書数Ｎ（xi，xj）や距離Ｌ（xi，xj）も併せ
て格納するようにしている。そして、格納のデータ構造
の一例を示すと図６のようになっている。さらに、係る
分類用辞書９と上記キーワードファイル３とにより、自
動分類をするための辞書を構成することになる。The output of the feature quantity extraction unit 8 is connected to the classification dictionary 9 so that the feature quantity obtained as described above is stored in the classification dictionary 9.
In order to perform automatic classification, it is sufficient that at least the degree of dependence on a keyword pair is stored, but as will be described later, the field of a new document is based on the data stored in the classification dictionary 9 concerned. If you have classified
In order to update the contents of the classification dictionary 9 based on the data of the new document, and to update the dictionary more accurately based on the latest multiple pieces of document information, in this example, in addition to the dependency degree, the keyword The number of appearances N (xi), N (xj), the number of appearance documents N (xi, xj) of the keyword pair, and the distance L (xi, xj) are also stored. An example of the stored data structure is shown in FIG. Further, the classification dictionary 9 and the keyword file 3 constitute a dictionary for automatic classification.

【００３９】次に、上記した実施例を用いて、本発明方
法（辞書作成方法）の一実施例を説明する。図７に示す
ように分野が既知の複数のサンプル文書を用い、分野と
ともに各サンプル文書（同一分野のサンプル文書をそれ
ぞれ複数用意する）を、入力装置１を介して辞書作成装
置に入力する（Ｓ２０１）。次いで、各文書からキーワ
ード（種類付き）を抽出し、さらに、あるキーワードと
同一の意味内容を表す要素キーワードを文字列一致キー
ワード抽出部５を用いて求める（Ｓ２０２，２０３）。Next, one embodiment of the method of the present invention (dictionary creating method) will be described using the above-mentioned embodiment. As shown in FIG. 7, a plurality of sample documents whose fields are known are used, and each sample document (preparing a plurality of sample documents in the same field) together with the fields is input to the dictionary creating device via the input device 1 (S201). ). Next, a keyword (with type) is extracted from each document, and an element keyword representing the same meaning content as a certain keyword is obtained using the character string matching keyword extraction unit 5 (S202, 203).

【００４０】そして、各カウント部６，７にて各分野毎
にキーワードを有する文書の数並びにキーワード対を有
する文書（基準キーワードと比較キーワードのペアが同
時に出現する文書）の数をそれぞれ求める（Ｓ２０
４）。なお、本例ではキーワード対を求めるに際し、比
較キーワードが要素キーワードも含めるようにしたた
が、係る要素キーワードは考慮せずに比較キーワードも
基準キーワードと同様に完全一致としてもよい。Then, the number of documents having a keyword and the number of documents having a keyword pair (documents in which a pair of a reference keyword and a comparison keyword appear at the same time) are obtained in each counting section 6 and 7 (S20).
4). In this example, when the keyword pair is obtained, the comparison keyword includes the element keyword, but the comparison keyword may be an exact match like the reference keyword without considering the element keyword.

【００４１】そして、求めたキーワード，キーワード対
の出現文書数を特徴量算出部８に与え、各キーワード間
（基準キーワードと比較キーワードとの間）の距離を算
出し、それに基づいて係るキーワード間の依存度を算出
する。そして、算出した結果を、分類用辞書９に格納
し、処理を終了する（Ｓ２０５〜２０７）。Then, the calculated keywords and the number of appearing documents of the keyword pairs are given to the feature amount calculating section 8 to calculate the distance between the keywords (between the reference keyword and the comparison keyword), and based on the distance, the keywords are related. Calculate the degree of dependence. Then, the calculated result is stored in the classification dictionary 9, and the process ends (S205 to 207).

【００４２】そして、具体的にサンプル文書として同一
分野（ＴＶＣＭについての意見）の４５個の文書を上記
辞書作成装置に入力し、得られた結果（辞書）の一部
は、図６のようになる。また、同一のサンプル文書に対
し従来の方式（キーワードの種類分けをせずに、しか
も、比較キーワードの完全一致によりキーワード対の出
現文書数をカウントする方式）を用いて得られた辞書の
一部を図８に示す。図６，８から明らかなように両方式
により得られた辞書の内容は異なる。ところで、この辞
書を見ただけでは両辞書の精度の良否は不明であるが、
辞書を作成した際に用いた４５個のサンプル文書を次に
説明する自動分類装置に入力し、係る文書が本分野に属
する度合い（曖昧度）を求めることにより、良否を判定
することができ、その結果、本発明により製造された辞
書の方が属する度合いが高い（曖昧度が低い）ことが確
認された。Specifically, 45 documents in the same field (opinions about TVCM) are input to the dictionary creating device as sample documents, and a part of the obtained results (dictionaries) is as shown in FIG. Become. A part of the dictionary obtained by using the conventional method for the same sample document (a method that counts the number of documents that appear in a keyword pair based on complete matching of comparison keywords without classifying the keywords) Is shown in FIG. As is clear from FIGS. 6 and 8, the contents of the dictionaries obtained by both formulas are different. By the way, the accuracy of both dictionaries is not clear just by looking at this dictionary,
It is possible to judge pass / fail by inputting 45 sample documents used when the dictionary was created into the automatic classification device described below, and obtaining the degree (ambiguity) of the document to belong to the field, As a result, it was confirmed that the dictionary manufactured by the present invention has a higher degree of belonging (lower ambiguity).

【００４３】図９は、本発明に係る自動分類装置の一実
施例を示している。同図に示すように、上記した図１に
示す辞書作成装置と同様の入力装置１′，キーワード抽
出部２′並びに辞書作成装置で作成されたキーワードフ
ァイル３並びに分類用辞書９を備えており、さらに、そ
れらキーワード抽出部２′，キーワードファイル３の出
力をファジィ関係作成部１０に入力するようにしてい
る。このファジィ関係作成部１０では、各分野のファジ
ィ集合の基準ワード（キーワードファイル３から得られ
る）と分野の未知の新規文書の抽出結果（キーワード抽
出部２より与えられる）とを比較し、一致する場合には
ファジィ関係１を「１」とし、一致しない場合に「０」
とする。そして、その様にして求めたファジィ関係Ｒを
ファジィ関係ファイル１１に格納する。FIG. 9 shows an embodiment of the automatic classification device according to the present invention. As shown in the figure, the dictionary creating apparatus shown in FIG. 1 includes an input device 1 ', a keyword extracting unit 2', a keyword file 3 created by the dictionary creating device, and a classification dictionary 9. Further, the outputs of the keyword extracting unit 2'and the keyword file 3 are input to the fuzzy relation creating unit 10. The fuzzy relation creating unit 10 compares the reference word of the fuzzy set of each field (obtained from the keyword file 3) with the extraction result of the unknown new document of the field (given by the keyword extracting unit 2) and matches them. If the fuzzy relation 1 is "1", and if they do not match, "0"
And Then, the fuzzy relation R thus obtained is stored in the fuzzy relation file 11.

【００４４】今、説明の便宜上、分類用辞書９に格納さ
れたある分野のデータが図１０に示すようであったとし
（キーワード数は８個）、新規の文書からは、図１１に
示す５個のキーワードが抽出されたとすると、ファジィ
関係ファイル１１には、例えば図１２に示すような状態
で格納される。For convenience of description, assume that the data of a certain field stored in the classification dictionary 9 is as shown in FIG. 10 (the number of keywords is 8), and from the new document, the data shown in FIG. If individual keywords are extracted, they are stored in the fuzzy relation file 11 in a state as shown in FIG. 12, for example.

【００４５】そして、ファジィ関係ファイル１１と、分
類用辞書９を分野別依存度算出部１２に接続し、そこに
おいて、既知の分野別に、分類用辞書９に格納されたフ
ァジィ集合Ｆと入力された文書（分野未知）に基づいて
作成されたファジィ関係Ｒとを合成し（下記式
（３））、新たなファジィ集合（依存度）を作成する。
これにより、各分野毎にその文書の各キーワードの依存
度が求められる。Then, the fuzzy relation file 11 and the classification dictionary 9 are connected to the field-dependent dependency calculation unit 12, and the fuzzy set F stored in the classification dictionary 9 is input into the field-dependent dependency calculation unit 12. A new fuzzy set (dependency) is created by synthesizing the fuzzy relation R created based on the document (field unknown) (Equation (3) below).
As a result, the degree of dependence of each keyword of the document is obtained for each field.

【００４６】[0046]

【数３】そして、図１０に示す分野に対するキーワードの依存度
は、図１３，図１４に示す計算式にのっとり演算処理さ
れ、図１４中に示すような各値が得られる。ここで
「＊」は論理積（最小値をとる）であり、「Ｕ」は結ば
れた数値の中の最大値をとることを意味する。そして、
係る演算処理がすべての分野毎に行われるため、結局、
図１４に示すような各キーワードに対する依存度の関係
（表）が分野の数だけ存在することになる。[Equation 3] Then, the degree of dependency of the keyword on the field shown in FIG. 10 is calculated according to the calculation formulas shown in FIGS. 13 and 14, and each value as shown in FIG. 14 is obtained. Here, “*” is a logical product (takes the minimum value), and “U” means to take the maximum value among the connected numerical values. And
Since such arithmetic processing is performed for all fields, in the end,
There are as many dependency relationships (tables) as there are fields as shown in FIG.

【００４７】そして、係る求めた分野別の依存度を、次
段の分野別曖昧度算出部１３に送り、ここにおいて上記
得られたファジィ集合を下記式（４）に代入し、各分類
毎の曖昧度ｄを求める。この曖昧度ｄが小さいほど、曖
昧さが少ない、すなわち、その分野にマッチした文書で
あることを意味する。Then, the obtained field-specific dependency is sent to the field-specific ambiguity calculation unit 13 in the next stage, and the fuzzy set obtained above is substituted into the following equation (4) to obtain each category. Find the ambiguity d. The smaller the ambiguity d is, the less ambiguity is, that is, the document matches the field.

【００４８】[0048]

【数４】ｄ（category) ＝（Δ１＋Δ２＋…＋Δｎ）／ｎ Δｎ＝-ua(ai)log₂ua(ai)-(1-ua(ai))log₂(1-ua(ai)) ここで、ｎは分野別のキーワードの数を示し、ua(ai)は
キーワード対の依存度を示している。そして、その計算
の一例（図１０のある分野に対する図１１の新規文書の
曖昧度の算出）を図１５に示す。D (category) = (Δ1 + Δ2 + ... + Δn) / n Δn = -ua (ai) log ₂ ua (ai)-(1-ua (ai)) log ₂ (1-ua (ai)) where , N indicates the number of keywords for each field, and ua (ai) indicates the degree of dependence of keyword pairs. 15 shows an example of the calculation (calculation of the ambiguity of the new document of FIG. 11 for a certain field of FIG. 10).

【００４９】そして、その様にして得られた分野別曖昧
度算出部１３を判定・処理部１４に送り、そこにおいて
最も曖昧度の小さい分野を、入力した文書の分野に決定
し、その結果を出力装置（ＣＲＴやプリンター等）１５
に出力する。なお、通常は、データベースの作成を行う
に際し本発明を実施してデータベースに格納する文書の
分野を確定するため、抽出したキーワードと決定した分
野を所定のデータベース１６に格納するようにしてもよ
い。なお、文書の全文は入力装置１′を介してそのまま
データベース１６に格納される。Then, the field-specific ambiguity calculation unit 13 thus obtained is sent to the determination / processing unit 14, where the field with the smallest ambiguity is determined as the field of the input document, and the result is determined. Output device (CRT, printer, etc.) 15
Output to. Note that, in general, in order to determine the field of the document to be stored in the database by carrying out the present invention when creating the database, the extracted keyword and the determined field may be stored in the predetermined database 16. The entire text of the document is stored in the database 16 as it is via the input device 1 '.

【００５０】さらに、本例では上記した図１に示した辞
書作成装置に、入力された処理対象の新規な文書を決定
した分野とともに送るようになっている。そして、その
新規な文書に基づいて、辞書の更新を行えるようになっ
ている。なお、図１に示す各カウント部６，７における
キーワード，キーワード対の出現文書数のカウントは、
実際には、分類用辞書９に格納された各回数のうち、該
当する箇所（新規な文書に存在するキーワード，キーワ
ード対）の数値をインクリメントする処理を行うことに
なる。Further, in this example, the input new document to be processed is sent together with the determined field to the dictionary creating apparatus shown in FIG. Then, the dictionary can be updated based on the new document. Note that the count of the number of appearing documents of keywords and keyword pairs in each of the counting units 6 and 7 shown in FIG.
Actually, the process of incrementing the numerical value of the corresponding part (keyword, keyword pair existing in a new document) in each number of times stored in the classification dictionary 9 is performed.

【００５１】なお上記のように分類用辞書の更新を行う
ようにした場合には、上記したキーワード抽出部２′に
て抽出されたキーワードのうち、既知の文書からすでに
抽出されてキーワードファイル３に格納されたキーワー
ドに含まれないものがあった場合には、その後の自動分
類のために係るキーワード並びにそれに基づくデータ
（要素キーワード等）を作成し、所定のファイル３，５
等に格納するのが好ましい。When the classification dictionary is updated as described above, the keywords extracted by the keyword extraction unit 2'are already extracted from a known document and stored in the keyword file 3. If some of the stored keywords are not included, a keyword related to the subsequent automatic classification and data (element keyword, etc.) based on the keyword are created and stored in the predetermined files 3, 5
And so on.

【００５２】次に、上記した装置を用いて本発明に係る
自動部類方法の一例について説明する。まず、本例では
分類用辞書９に、２つの分野についてのファジィ集合が
格納されている。この２つの分野は、上記した「ＴＶＣ
Ｍについての意見（第１分野）」と「紙を媒体としたＣ
Ｍについての意見（第２分野）」であり、第１分野につ
いては４５個のサンプル文書に基づいて辞書を作成し、
また第２分野については６９個のサンプル文書に基づい
て辞書を作成した。Next, an example of the automatic classifying method according to the present invention using the above apparatus will be described. First, in this example, the classification dictionary 9 stores fuzzy sets for two fields. These two fields are the above-mentioned “TVC
Opinion about M (first field) "and" C using paper as medium "
Opinion about M (second field) ", and for the first field, create a dictionary based on 45 sample documents,
For the second field, a dictionary was created based on 69 sample documents.

【００５３】この状態において、未知の文書を入力装
置１′を介して自動分類装置に入力し、まず、キーワー
ド抽出部２′によりキーワード（種類付き）を抽出す
る。この時、重複するキーワードは１個のみ残す。そし
て、その結果を図１６に示す。In this state, an unknown document is input to the automatic classification device via the input device 1 ', and first the keyword (with type) is extracted by the keyword extraction unit 2'. At this time, only one duplicate keyword is left. And the result is shown in FIG.

【００５４】次いで、ファジィ関係作成部１０によりフ
ァジィ関係を求めた後、分野別に入力された新規な文書
に存在する、キーワードに一致する新たに求めたファジ
ィ集合を構成するキーワード（ｘj ）とそのファジィ集
合Ｆを抽出する。これにより第１分野についてのファジ
ィ集合（図１７）並びに第２分野についてのファジィ集
合（図１８）が求められる。それに基づいて、分野別依
存度算出部１２にてファジィ集合（依存度）を求める。Next, after the fuzzy relation creating unit 10 obtains the fuzzy relation, the keyword (xj) that constitutes the newly obtained fuzzy set that matches the keyword and that exists in the new document input for each field and its fuzzy Extract the set F. As a result, a fuzzy set for the first field (FIG. 17) and a fuzzy set for the second field (FIG. 18) are obtained. Based on this, the field-dependent dependency calculation unit 12 obtains a fuzzy set (dependency).

【００５５】さらに、分野別曖昧度算出部１３にて、曖
昧度を求める。そして、その結果を図１９に示す（な
お、依存度が「０」となったキーワードについては記載
を省略している）。そして、判定・処理部１４では、各
分野毎の曖昧度を比較し、最も小さい値を示す分野を選
択する。そして、本例では第１分野に決定される。Further, the field-specific ambiguity calculator 13 calculates the ambiguity. The result is shown in FIG. 19 (note that the keywords for which the degree of dependence is “0” are omitted). Then, the determination / processing unit 14 compares the ambiguities of the respective fields and selects the field having the smallest value. Then, in this example, the first field is determined.

【００５６】また、他の文書（図２０に示すキーワー
ドを有する）を入力し、上記と同様の処理を行った結
果、図２１に示すような曖昧度が得られ、この文書の
場合には第２分野に属する文書であると判定できる。そ
して、上記した処理フロー（本発明にかかる自動分類方
法の一実施例）を図２２に示す。Further, as a result of inputting another document (having the keywords shown in FIG. 20) and performing the same processing as described above, the ambiguity as shown in FIG. 21 is obtained. It can be determined that the document belongs to two fields. FIG. 22 shows the processing flow described above (one embodiment of the automatic classification method according to the present invention).

【００５７】さらに、上記の文書と文書を従来の方
式に基づいて自動分類し、第１分野に対する曖昧度を求
めた結果、図２３に示すようになった。すなわち、この
図から明らかなように、このサンプル文書の場合では従
来方式でも正しく分類（文書→第１分野，文書→第
２分野）されたが、その曖昧度に着目すると、文書の
曖昧度は「０．１０６…」で、これは本発明により得ら
れた曖昧度「０．０８４１…」より大きな数値となり、
逆に文書の曖昧度は「０．４０４９６…」で、これは
本発明により得られた曖昧度「０．５３９２…」よりも
小さな数値となっている。このことから、本発明のもの
では、正しい分野に対しては曖昧度がより小さくなる一
方誤った分野に対しては曖昧度がより大きくなることに
なり、似かよった分野に対しても精度よく分離して分類
できることがわかる。Further, as a result of automatically classifying the above-mentioned document and the document based on the conventional method and obtaining the ambiguity with respect to the first field, the result is as shown in FIG. That is, as is clear from this figure, in the case of this sample document, the classification was correctly performed even in the conventional method (document → first field, document → second field). However, focusing on the ambiguity, the ambiguity of the document is “0.106 ...”, which is a larger value than the ambiguity “0.0841 ...” obtained by the present invention,
On the contrary, the ambiguity of the document is “0.40496 ...”, which is a numerical value smaller than the ambiguity of “0.5392 ...” obtained by the present invention. From this, according to the present invention, the ambiguity becomes smaller for the correct field, while the ambiguity becomes larger for the incorrect field, and the similar fields are accurately separated. It turns out that they can be classified.

【００５８】また、辞書を作成する際に用いた第１分野
の４５のサンプル文書を構成するすべてのキーワードを
上記分類装置に入力して処理した結果得られた第１分野
についての曖昧度は、０．０８０７７５８１となった。
また同様に従来方式のものでは、０．０８２４２０８７
となった。すなわち、本発明の方が曖昧度が小さく、正
確に辞書が作成されていることがわかる。Further, the ambiguity of the first field obtained as a result of inputting and processing all the keywords constituting the 45 sample documents of the first field used when creating the dictionary into the classifying device is: It became 0.08077751.
Similarly, in the conventional system, 0.08242087
Became. That is, it can be seen that the present invention has a lower ambiguity and the dictionary is created accurately.

【００５９】[0059]

【発明の効果】以上のように本発明に係る文書の自動分
類方法及び装置並びに分類用の辞書作成方法及び装置で
は、キーワードを抽出するに際し、その種別（主語，目
的語，その他等）も同時に判別し、種別まで同一で始め
て同一のキーワードとしてとらえるようにしたため、文
章の意味内容まで考慮して辞書が作成されるため、文書
の内容に近い高精度の辞書が作成でき、分野の未知の文
書に対する自動分類が正確に行える。As described above, in the method and apparatus for automatically classifying documents according to the present invention and the method and apparatus for creating a dictionary for classification, when extracting a keyword, the type (subject, object, etc.) of the keyword is also calculated. Since it is determined that the type is the same and the keywords are the same, the dictionary is created in consideration of the semantic content of the sentence, so a highly accurate dictionary close to the content of the document can be created, and the document of unknown field The automatic classification can be performed accurately.

【００６０】また、キーワード対の出現数を係数するに
際し、文字列一致を同一のキーワードとしてとらえるよ
うにしたため、たとえ表記上の相違があったとしても、
それに影響されることなく正しい辞書が作成でき、自動
分類が可能となる。Further, when the number of appearances of the keyword pair is calculated, the character string coincidence is regarded as the same keyword. Therefore, even if there is a difference in notation,
A correct dictionary can be created without being affected by it, and automatic classification is possible.

[Brief description of drawings]

【図１】本発明に係る辞書作成装置の一実施例を示すブ
ロック図である。FIG. 1 is a block diagram showing an embodiment of a dictionary creating device according to the present invention.

【図２】キーワード抽出部の機能を示すフローチャート
図である。FIG. 2 is a flowchart showing the function of a keyword extracting unit.

【図３】キーワード抽出部により抽出されたキーワード
の一例を示す図である。FIG. 3 is a diagram showing an example of keywords extracted by a keyword extracting unit.

【図４】文字列一致キーワード抽出部の作用を説明する
図である。FIG. 4 is a diagram illustrating an operation of a character string matching keyword extraction unit.

【図５】文字列一致キーワード抽出部により抽出された
同一グループを構成するキーワード群の一例を示す図で
ある。FIG. 5 is a diagram showing an example of a keyword group constituting the same group extracted by a character string matching keyword extraction unit.

【図６】分類用辞書のデータ構造の一例を示す図であ
る。FIG. 6 is a diagram showing an example of a data structure of a classification dictionary.

【図７】本発明に係る辞書作成方法の一実施例を示すフ
ローチャート図である。FIG. 7 is a flowchart showing an embodiment of a dictionary creating method according to the present invention.

【図８】従来方式により作成された分類用辞書の一部を
示す図である。FIG. 8 is a diagram showing a part of a classification dictionary created by a conventional method.

【図９】本発明に係る自動分類装置の一実施例を示すブ
ロック図である。FIG. 9 is a block diagram showing an embodiment of an automatic classification device according to the present invention.

【図１０】分類用辞書に格納された辞書の一例を示す図
である。FIG. 10 is a diagram showing an example of a dictionary stored in a classification dictionary.

【図１１】キーワード抽出部により抽出されたキーワー
ドの一例を示す図である。FIG. 11 is a diagram showing an example of keywords extracted by a keyword extracting unit.

【図１２】ファジィ関係作成部により作成された関係を
示す図である。FIG. 12 is a diagram showing a relationship created by a fuzzy relationship creating unit.

【図１３】分野別依存度算出部の作用を説明する図であ
る。FIG. 13 is a diagram for explaining an operation of a field-dependent dependency calculating unit.

【図１４】分野別依存度算出部の作用を説明する図であ
る。FIG. 14 is a diagram illustrating an operation of a field-dependent dependency calculation unit.

【図１５】分野別曖昧度算出部の作用を説明する図であ
る。FIG. 15 is a diagram illustrating an operation of a field-specific ambiguity calculation unit.

【図１６】自動分類装置を用いて実施される自動分類方
法の処理過程を示す図である。FIG. 16 is a diagram showing a process of an automatic classification method performed by using the automatic classification device.

【図１７】自動分類装置を用いて実施される自動分類方
法の処理過程を示す図である。FIG. 17 is a diagram showing a processing process of an automatic classification method performed using the automatic classification device.

【図１８】自動分類装置を用いて実施される自動分類方
法の処理過程を示す図である。FIG. 18 is a diagram showing a process of an automatic classification method performed using the automatic classification device.

【図１９】自動分類装置を用いて実施される自動分類方
法の処理過程を示す図である。FIG. 19 is a diagram showing a process of an automatic classification method performed using the automatic classification device.

【図２０】自動分類装置を用いて実施される自動分類方
法の処理過程を示す図である。FIG. 20 is a diagram showing a process of an automatic classification method performed by using the automatic classification device.

【図２１】自動分類装置を用いて実施される自動分類方
法の処理過程を示す図である。FIG. 21 is a diagram showing a process of an automatic classification method performed by using the automatic classification device.

【図２２】自動分類装置を用いて実施される本発明に係
る自動分類方法の一例を示すフローチャート図である。FIG. 22 is a flow chart diagram showing an example of an automatic classification method according to the present invention, which is carried out using an automatic classification device.

【図２３】従来方式により分類処理された結果を示す図
である。FIG. 23 is a diagram showing a result of classification processing by a conventional method.

[Explanation of symbols]

１，１′ 入力装置２，２′ キーワード抽出装置３キーワードファイル４文字列一致キーワード抽出部５文字列一致キーワードファイル６キーワードカウント部７キーワード対カウント部８特徴量算出部９分類用辞書１０ファジィ関係作成部１１ファジィ関係ファイル１２分野別依存度算出部１３分野別曖昧度算出部１４判定・処理部１５出力装置１６データベース 1, 1'Input device 2, 2'Keyword extracting device 3 Keyword file 4 Character string matching keyword extracting unit 5 Character string matching keyword file 6 Keyword counting unit 7 Keyword pair counting unit 8 Feature amount calculating unit 9 Classification dictionary 10 Fuzzy relations Creation unit 11 Fuzzy relation file 12 Dependency calculation unit by field 13 Ambiguity calculation unit by field 14 Judgment / processing unit 15 Output device 16 Database

Claims

[Claims]

1. A keyword is extracted while classifying a subject, an object, etc. from words and phrases constituting a plurality of documents known in the field, and the number of documents in which the extracted keyword appears and the arbitrary two keywords are The number of appearing documents of a keyword pair that appears at the same time is calculated, the distance between two keywords forming the keyword is calculated from the number of appearing documents of the keyword and the number of appearing documents of the keyword pair, and each keyword is calculated from the distance between the keywords. A method for creating a dictionary for automatic classification of documents, in which the dependency of a pair on the field is calculated and at least the relationship between the keyword pair and the dependency in the field is stored in the dictionary.

2. The method according to claim 1, wherein, when obtaining the number of appearing documents of the keyword pair, a group of keywords having the same meaning is obtained in advance, and if there are keywords belonging to the same group, the number of appearing documents is added. Dictionary creation method for automatic classification of other documents.

3. An input device for inputting a document of a known field, and a keyword given to a document given via the input device while classifying the subject, object, etc. from the words and phrases constituting the document. An output means for extracting the keyword, a first counting means for receiving the output of the means for extracting the keyword and counting the number of documents in the same field having the same keyword, and an output for the means for extracting the keyword, Second counting the number of documents in the same field with keyword pairs
The counting means and the outputs of the both counting means are used to calculate the distance between two keywords forming the keyword pair, calculate the dependency of the keyword pair on the field, and calculate the keyword pair in the field. A dictionary creation device for automatic classification of documents, which is provided with a means for storing the degree of dependency in a dictionary.

4. The apparatus further comprises means for receiving an output of the means for extracting the keyword to obtain a group of keywords having the same meaning, wherein the second counting means comprises means for obtaining the group and means for extracting the keyword. The dictionary creating apparatus for automatic classification of documents according to claim 3, which receives the output and adds the keywords belonging to the same group to the number of appearing documents of the keyword pair.

5. The method according to claim 1, wherein a keyword is extracted while classifying a subject, an object, and the like from words and phrases that compose a document whose field is unknown, and the extracted keyword is the method. Then, a relationship is determined whether or not the keyword appears in a predetermined field stored in the dictionary manufactured by, and then the relationship obtained and the degree of dependence of each keyword pair of the field stored in the dictionary. And the degree of dependency of each keyword extracted from the input document in that field is calculated, the ambiguity for each field is obtained from the dependency, and the field having the minimum ambiguity is input as described above. A method of automatically classifying documents so that the field to which the document belongs is determined.

6. An input device for inputting a document of unknown field, and a keyword given to a document given through the input device while classifying the subject, object, etc. from the words and phrases constituting the document. Means for extracting, a dictionary data-stored by the device according to claim 3 or 4, means for extracting the keyword and connected to the dictionary,
A means for obtaining a relationship as to whether the extracted keyword matches a keyword appearing in a predetermined field stored in the dictionary, a relationship output from the obtaining means, and the field stored in the dictionary And a dependency degree calculating means for receiving the dependency degree of each keyword pair and calculating the dependency degree of each keyword extracted from the input document for the field by multiplying the dependency degree and the dependency degree. An ambiguity calculation unit that receives the output of the means and obtains the ambiguity for each field of the document, and an output of the ambiguity calculation unit that compares the ambiguity of each field, and detects the field having the minimum ambiguity. The automatic document classification device according to claim 5, further comprising a determination unit.