JP2003016105A

JP2003016105A - Device for calculating degree value of association

Info

Publication number: JP2003016105A
Application number: JP2001197910A
Authority: JP
Inventors: Takeshi Nagamine; 猛志永峯; Shoichi Tateno; 昌一舘野
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2001-06-29
Filing date: 2001-06-29
Publication date: 2003-01-17

Abstract

PROBLEM TO BE SOLVED: To provide a degree of association value calculating device that improve the accuracy of the calculated degree of association value by improving the degree of association of a keyword corresponding to a proper name to calculate the degree of association value in calculating a value representing the degree of association of one or a plurality of keywords with one or a plurality of documents. SOLUTION: A proper name keyword specifying means 2 specifies a keyword corresponding to a proper name on the basis of a set proper name keyword specification condition, and degree of association calculating means 3 and 7 improve the degree of association of the keyword specified by the proper name keyword specifying means 2 with each keyword about a document group to calculate the degree of association value of a keyword group with the document group.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、１又は複数のキー
ワードと１又は複数の文書に関する関連度を表す値を算
出する関連度値算出装置などに関し、特に、固有名に相
当するキーワードの関連度を高めて関連度値を算出する
技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a degree-of-association value calculating device for calculating a value indicating the degree of relevance of one or more keywords and one or more documents, and in particular, the degree of relevance of keywords corresponding to unique names. And a technique for calculating a relevance value by increasing the value.

【０００２】[0002]

【従来の技術】例えば、ユーザが保持する文書（種文
書）に関連した文書を検索する文書検索装置が検討等さ
れている。一例として、特開平１０−２６０９７２号公
報に記載された「関連文書検索装置及び関連文書検索プ
ログラムを記録した記録媒体」では、種文書から抽出さ
れた関連度付き関連語で検索対象となる文書を検索し、
検索した文書に含まれる関連語の関連度に基づいてその
文書の関連度を計算して出力することが行われている。
ここで、関連語の関連度を計算する方法としては、種文
書及び検索対象文書に含まれる単語（関連語）の出現頻
度により決定される。2. Description of the Related Art For example, a document search device for searching a document related to a document (seed document) held by a user has been studied. As an example, in the "recording medium recording the related document search device and the related document search program" described in Japanese Patent Laid-Open No. 10-260972, a document to be searched by a related word with a degree of relevance extracted from a seed document Search and
The degree of relevance of a document is calculated and output based on the degree of relevance of a related word included in the retrieved document.
Here, the method of calculating the degree of association of related words is determined by the frequency of appearance of words (related words) included in the seed document and the search target document.

【０００３】しかしながら、単語の出現頻度のみに基づ
いて関連度が決定される場合には、例えば出現頻度は低
いがユーザにとっては重要度が高いと思われるような関
連語の関連度が低い値となってしまい、必ずしも適切な
関連度が算出されるとは限らないといった不具合があっ
た。特に、人名や地名や会社名（企業名）などのような
固有名は、通常、文書を特徴付ける単語であって情報量
が高いと考えられるが、出現頻度のみに基づいて関連度
を計算した場合には必ずしも高い関連度が得られるとは
限らなかった。However, when the degree of relevance is determined based only on the frequency of appearance of words, for example, a value having a low degree of relevance of a related word that appears to be of high importance to the user is used. However, there is a problem that an appropriate degree of association is not always calculated. In particular, a proper name such as a person's name, place name, company name (company name), etc. is usually a word that characterizes a document and is considered to have a large amount of information, but when the degree of association is calculated based only on the frequency of occurrence. Did not always obtain a high degree of association.

【０００４】また、例えば、文書群の中の各文書間の距
離を計算して当該計算結果に基づいて文書群を分類する
ことを自動的に行う文書群分類装置が検討等されてい
る。なお、文書間の距離が近ければ関連度が高く、文書
間の距離が遠ければ関連度が低いと考えると、文書間の
距離は文書間の関連度に対応すると考えることができ
る。Further, for example, a document group classification device has been studied which automatically calculates the distance between documents in a document group and automatically classifies the document group based on the calculation result. If the distance between documents is short, the degree of association is high, and if the distance between documents is long, the degree of association is low. It can be considered that the distance between documents corresponds to the degree of association between documents.

【０００５】このような文書群分類装置では、２つの文
書間の関連度を求める方法として、例えば辞書と照らし
合わせることにより或いは一般に知られているｎ−ｇｒ
ａｍ方式を用いることにより各文書中に含まれる単語を
切り出し、切り出した単語の出現頻度を各自で定めた方
式により計数し、各単語について計数した値を２つの文
書間で乗算した値を全ての単語について総和した値を２
つの文書間の関連度とすることが行われている。In such a document group classification device, as a method for obtaining the degree of association between two documents, for example, by comparing with a dictionary or by a generally known n-gr.
The words included in each document are cut out by using the am method, the frequency of appearance of the cut out words is counted by a method defined by each person, and the value obtained by multiplying the counted value of each word between two documents is calculated. The sum of the words is 2
The degree of relevance between two documents is used.

【０００６】しかしながら、このような方法では、例え
ば特徴の無い単語も、特徴のある単語と同様に、関連度
を算出するための出現頻度を計数する対象となってしま
って、対象となる単語の範囲が広くなるため、文書を分
類する精度が劣化してしまうといった不具合や、関連度
を計算するために要する時間が増大してしまうといった
不具合があった。[0006] However, in such a method, for example, a word without a feature becomes a target for counting the appearance frequency for calculating the degree of association, like a word with a feature, and the target word Since the range is widened, there are problems that the accuracy of classifying documents deteriorates and that the time required to calculate the degree of association increases.

【０００７】[0007]

【発明が解決しようとする課題】上述のように、従来の
文書検索装置などでは、１又は複数のキーワードと１又
は複数の文書に関する関連度を表す値を算出する場合
に、例えばキーワードとなる全ての単語について同等に
出現頻度等を求めて関連度値を算出していたため、精度
の高い関連度値を得ることができないといった不具合が
あった。As described above, in the conventional document search apparatus or the like, when calculating a value indicating the degree of association of one or more keywords and one or more documents, for example, all of the keywords Since the relevance value is calculated by equally calculating the appearance frequency and the like for the word, there is a problem that a highly accurate relevance value cannot be obtained.

【０００８】本発明は、このような従来の事情に鑑みな
されたもので、１又は複数のキーワードと１又は複数の
文書に関する関連度を表す値を算出するに際して、固有
名に相当するキーワードの関連度を高めて関連度値を算
出することにより、算出される関連度値の精度を高くす
ることなどを実現する関連度値算出装置などを提供する
ことを目的とする。The present invention has been made in view of such conventional circumstances. When calculating a value indicating the degree of association between one or a plurality of keywords and one or a plurality of documents, the relation between the keywords corresponding to the proper names is calculated. An object of the present invention is to provide an association degree value calculation device or the like that realizes, for example, increasing the accuracy of the calculated association degree value by increasing the degree and calculating the association degree value.

【０００９】[0009]

【課題を解決するための手段】上記目的を達成するた
め、本発明に係る関連度値算出装置では、１又は複数の
キーワードから構成されるキーワード群と１又は複数の
文書から構成される文書群に関する関連度を表す値とし
て、当該キーワード群に含まれる各キーワードと当該文
書群に関するキーワード毎の関連度値（キーワード毎関
連度値）を全てのキーワードについて総和した値を算出
するに際して、次のようにして、関連度値を算出する。
すなわち、固有名キーワード特定手段が設定された固有
名キーワード特定条件に基づいて固有名に相当するキー
ワードを特定し、関連度値算出手段が固有名キーワード
特定手段により特定されたキーワードと文書群に関する
キーワード毎の関連度を高めてキーワード群と文書群に
関する関連度値を算出する。In order to achieve the above object, in the relevance value calculating apparatus according to the present invention, a keyword group composed of one or a plurality of keywords and a document group composed of one or a plurality of documents. As a value representing the degree of relevance for each keyword included in the keyword group and the relevance value for each keyword related to the document group (relevance value for each keyword) for all keywords, the following value is calculated. Then, the relevance value is calculated.
That is, the unique name keyword specifying means specifies the keyword corresponding to the unique name based on the set unique name keyword specifying condition, and the relevance value calculating means specifies the keyword specified by the unique name keyword specifying means and the keyword related to the document group. The degree of relevance for each is increased to calculate the degree-of-relevance value for the keyword group and the document group.

【００１０】従って、固有名に相当すると特定されたキ
ーワードと文書群に関するキーワード毎の関連度を高め
ることにより、例えば文書を特徴付ける単語であって情
報量が高いと考えられる固有名に相当するキーワードと
文書群に関するキーワード毎関連度値を関連度が高い値
としてキーワード群と文書群に関する関連度値を算出す
ることができ、これにより、算出される関連度値の精度
を高くすることができる。Therefore, by increasing the degree of relevance for each keyword related to a document group and a keyword identified as a unique name, for example, a keyword that characterizes a document and is associated with a unique name that is considered to have a high amount of information is identified. The relevance value for each keyword can be calculated using the keyword-related relevance value for each document group as a value having a high relevance value, and thus the accuracy of the calculated relevance value can be increased.

【００１１】ここで、キーワード群に含まれるキーワー
ドの数としては、種々な数であってもよく、１であって
もよく、複数（２以上）であってもよい。同様に、文書
群に含まれる文書の数としては、種々な数であってもよ
く、１であってもよく、複数（２以上）であってもよ
い。このように、本明細書では、キーワード群や文書群
などのように「群」という表現を用いた語句について、
当該「群」に含まれるキーワードや文書といった要素が
１つである場合をも包含している。Here, the number of keywords included in the keyword group may be various, may be 1, or may be plural (two or more). Similarly, the number of documents included in the document group may be various, may be 1, or may be plural (two or more). As described above, in this specification, with respect to words and phrases using the expression “group” such as a keyword group and a document group,
The case where the “group” includes only one element such as a keyword or a document is also included.

【００１２】また、キーワードとしては、例えば種々な
品詞を有する語句が用いられてもよく、本発明では、特
に、固有名に相当するキーワードの重要度を高く考慮し
て関連度値を算出する。また、文書としては、種々な文
書が用いられてもよい。なお、通常、文書には種々な品
詞を有する複数の語句が含まれる。また、文書として
は、例えばキーワードがインデックスなどとして付加さ
れたような文書を用いることもできる。As the keyword, words having various parts of speech may be used, and in the present invention, the relevance value is calculated in consideration of the importance of the keyword corresponding to the proper name. Various documents may be used as the document. Note that a document usually includes a plurality of words and phrases having various parts of speech. Further, as the document, for example, a document to which a keyword is added as an index can be used.

【００１３】また、キーワード群と文書群に関する関連
度を表す値としては、例えばキーワード群と文書群とが
どれくらい関連があるかの度合いを表すと考えられる値
が用いられる。なお、通常は、関連度値が大きい方が関
連度が高いという設定が用いられると考えられるが、反
対に、関連度値が小さい方が関連度が高いという設定が
用いられてもよい。As the value representing the degree of association between the keyword group and the document group, for example, a value considered to represent the degree of association between the keyword group and the document group is used. It should be noted that normally, a setting in which a higher relevance value has a higher relevance is considered to be used, but conversely, a setting in which a lower relevance value has a higher relevance may be used.

【００１４】また、キーワード群に含まれる各キーワー
ドと文書群に関するキーワード毎の関連度値としては、
各キーワード毎に算出することが可能なものであり、例
えば各キーワードが文書群の中に出現するか否かをそれ
ぞれ“１”値又は“０”値を用いて表した値や、各キー
ワードが文書群の中に出現する回数を表した値や出現す
る頻度を表した値などを用いることができる。Further, as the relevance value of each keyword included in the keyword group and each keyword related to the document group,
It is possible to calculate for each keyword. For example, a value indicating whether or not each keyword appears in a document group using a “1” value or a “0” value, or each keyword is A value representing the number of appearances in the document group, a value representing the frequency of appearance, or the like can be used.

【００１５】また、例えばキーワード群が或る文書群か
ら抽出されるような場合には、当該キーワード群に含ま
れる各キーワードに、当該文書群における出現状況に応
じた値が設定されていてもよい。このように或る文書群
における各キーワードの出現状況に応じた値が各キーワ
ードに設定された場合には、当該値を考慮して当該各キ
ーワードと他の文書群との関連度値を算出することによ
り、当該或る文書群と当該他の文書群とのキーワード毎
の関連度値を算出することができる。In addition, for example, when a keyword group is extracted from a certain document group, each keyword included in the keyword group may be set to a value according to the appearance status in the document group. . In this way, when a value according to the appearance status of each keyword in a certain document group is set for each keyword, the degree of association between each keyword and another document group is calculated in consideration of the value. As a result, the relevance value for each keyword between the certain document group and the other document group can be calculated.

【００１６】また、キーワード毎の関連度値を全てのキ
ーワードについて総和した値を算出するとは、キーワー
ド群に複数のキーワードが含まれる場合にこれら複数の
全てのキーワードについてキーワード毎の関連度値を総
和した値を算出することを言っており、キーワード群に
１つのキーワードのみが含まれる場合には当該１つのキ
ーワードについて算出されるキーワード毎の関連度値が
全てのキーワードについて総和した値に相当する。To calculate the sum of the relevance value of each keyword for all keywords, when the keyword group includes a plurality of keywords, the relevance value of each keyword is summed for all the keywords. When the keyword group includes only one keyword, the relevance value for each keyword calculated for the one keyword corresponds to the sum of all the keywords.

【００１７】また、固有名としては、種々な語句が用い
られてもよく、例えば、人名、地名、会社名、日付、時
間、日時、製品名などを用いることができる。なお、好
ましい態様としては、本発明に言う固有名として、例え
ば一般に固有名詞として分類される語句の集合をそのま
ま用いることや、或いは、当該集合の一部を用いること
などができる。Various words and phrases may be used as the proper name, for example, a person's name, place name, company name, date, time, date and time, product name, etc. can be used. In a preferred embodiment, as the proper name referred to in the present invention, for example, a set of words or phrases generally classified as proper nouns can be used as it is, or a part of the set can be used.

【００１８】また、設定された固有名キーワード特定条
件としては種々な条件が用いられてもよく、固有名キー
ワード特定条件に基づいて固有名に相当するキーワード
を特定する仕方としては種々な仕方が用いられてもよ
い。また、固有名キーワード特定条件としては、例えば
予めメモリに設定されてもよく、例えばメモリに設定さ
れた内容がユーザにより書き換え可能な構成であっても
よく、或いは、例えば関連度値を算出する際にユーザに
より固有名キーワード特定条件が設定されるような態様
が用いられてもよい。Further, various conditions may be used as the set unique name keyword specifying condition, and various methods are used to specify the keyword corresponding to the unique name based on the unique name keyword specifying condition. You may be asked. The unique name keyword specifying condition may be set in advance in the memory, for example, the contents set in the memory may be rewritable by the user, or, for example, when calculating the relevance value. A mode in which the unique name keyword specifying condition is set by the user may be used.

【００１９】具体的には、例えば固有名キーワード特定
条件として固有名として検出する語句を電子データの形
でメモリに設定して、当該語句に一致するキーワードを
固有名に相当するものとして特定する仕方や、例えば固
有名キーワード特定条件として固有名に関する情報を有
する辞書を電子データの形でメモリに設定して、当該辞
書の内容に基づいて固有名の語句に一致するキーワード
を固有名に相当するものとして特定する仕方や、例えば
固有名キーワード特定条件として「株式会社」などの語
句を電子データの形でメモリに設定して、当該「株式会
社」などという接尾辞や接頭辞を有する語句に一致する
キーワードを固有名に相当するものとして特定する仕方
や、例えば固有名キーワード特定条件として地名などの
語句を電子データの形でメモリに設定して、当該地名な
どの語句を先頭や後尾に含む語句に一致するキーワード
を固有名に相当するものとして特定する仕方などを用い
ることができる。Specifically, for example, a method of setting a word to be detected as a proper name as a proper name keyword specifying condition in a memory in the form of electronic data and specifying a keyword matching the phrase as a proper name. Or, for example, a dictionary having information about a proper name as a proper name keyword specifying condition is set in a memory in the form of electronic data, and a keyword matching a phrase of the proper name based on the contents of the dictionary corresponds to the proper name. Or a word or phrase such as “stock” as a proper name keyword specifying condition is set in the memory in the form of electronic data and matches the word or phrase with a suffix or prefix such as “stock”. How to specify a keyword as being equivalent to a proper name, or electronic terms such as a place name as a proper name keyword specifying condition Set in the memory in the form, or the like can be used how to identify the keywords that match words like the place names containing terms including the head and tail as corresponding to a specific name.

【００２０】また、文書に含まれる語句の解析は、例え
ば一般に知られている形態素解析などを用いて行うこと
ができる。また、文書と当該文書に含まれる語句との対
応付けは、種々な態様で行われてもよく、一例として、
各語句及び当該各語句の品詞などの属性情報の組に対応
させて当該各語句が含まれる１又は複数の文書の識別情
報を記憶するような態様を用いることができる。The analysis of the words included in the document can be performed by using, for example, a generally known morphological analysis. Further, the correspondence between the document and the word / phrase included in the document may be performed in various modes, and as an example,
It is possible to use a mode in which the identification information of one or a plurality of documents including each word is stored in association with each word and a set of attribute information such as a part of speech of each word.

【００２１】また、本発明に係る関連度値算出装置で
は、一例として、関連度値算出手段は、固有名キーワー
ド特定手段により特定されたキーワードと文書群に関す
るキーワード毎関連度値をその算出値と比較して関連度
が高いことを表す値へ補正することで、当該キーワード
と当該文書群に関するキーワード毎の関連度を高める。Further, in the relevance value calculating device according to the present invention, as an example, the relevance value calculating means sets the keyword specified by the proper name keyword specifying means and the relevance value for each keyword related to the document group as the calculated value. By comparing and correcting to a value indicating that the degree of association is high, the degree of association for each keyword related to the keyword and the document group is increased.

【００２２】このような構成では、固有名キーワード特
定手段により特定されたキーワード以外のキーワードに
ついても考慮した関連度値を算出することができ、この
場合に、固有名キーワード特定手段により特定されたキ
ーワードのキーワード毎関連度値をその算出値より関連
度が高いことを表す値へ補正して、算出される関連度値
の精度を高めることができる。With such a configuration, the relevance value can be calculated in consideration of keywords other than the keyword specified by the proper name keyword specifying means. In this case, the keyword specified by the proper name keyword specifying means can be calculated. The accuracy of the calculated relevance value can be improved by correcting the relevance value for each keyword to a value indicating that the relevance value is higher than the calculated value.

【００２３】ここで、固有名キーワード特定手段により
特定されたキーワードと文書群に関するキーワード毎関
連度値をその算出値と比較して関連度が高いことを表す
値へ補正するとは、例えばキーワード毎関連度値が大き
い方が関連度が高いことを表す場合には、固有名に相当
するものと特定されたキーワードについての補正しない
ときのキーワード毎関連度値を当該キーワード毎関連度
値より大きい値へ変更することを言う。なお、キーワー
ド毎関連度値を補正する度合いとしては、種々な度合い
が用いられてもよく、例えばキーワード毎関連度値を補
正する倍率などが予め設定される態様や、或いは、この
ような倍率などがユーザにより設定される態様などを用
いることができる。Here, the term "keyword relevance value associated with the keyword and document group identified by the unique name keyword identifying means is compared with the calculated value to correct the value to indicate that the degree of relevance is high. If a higher degree value indicates a higher degree of relevance, the keyword-related relevance value for the keywords identified as corresponding to the proper name when not corrected is set to a value larger than the keyword-related degree value. Say to change. Note that various degrees may be used as the degree of correcting the keyword-related relevance value, for example, a mode in which a magnification for correcting the keyword-related degree value is set in advance, or such a magnification Can be set by the user.

【００２４】また、本発明に係る関連度値算出装置で
は、他の例として、関連度値算出手段は、固有名キーワ
ード特定手段により特定されたキーワードと文書群に関
するキーワード毎関連度値を非ゼロとする一方、他のキ
ーワード毎関連度値をゼロとすることで、当該キーワー
ドと当該文書群に関するキーワード毎の関連度を高め
る。Further, in the relevance value calculating device according to the present invention, as another example, the relevance value calculating means sets the keyword specified by the proper name keyword specifying means and the relevance value for each keyword relating to the document group to non-zero. On the other hand, by setting the relevance value for each of the other keywords to zero, the relevance for each keyword related to the keyword and the document group is increased.

【００２５】このような構成では、キーワード群と文書
群に関する関連度値を算出するに際して、固有名キーワ
ード特定手段により特定されたキーワードと文書群に関
するキーワード毎関連度値のみが非ゼロ（つまり、ゼロ
でない値）となって、このような非ゼロのキーワード毎
関連度値のみを演算する構成となるため、例えばキーワ
ード群と文書群に関する関連度値の精度を高くして、当
該関連度値の算出に要する演算量や時間を低減させるこ
とができる。With such a configuration, when the relevance value for the keyword group and the document group is calculated, only the relevance value for each keyword specified by the proper name keyword specifying means for each keyword is non-zero (that is, zero). Value), and only the non-zero keyword-related relevance value is calculated. Therefore, for example, the accuracy of the relevance value related to the keyword group and the document group is increased to calculate the relevance value. It is possible to reduce the amount of calculation and time required for.

【００２６】また、本発明に係る関連度値算出装置で
は、カテゴリ情報記憶手段が同類の複数の固有名から構
成される１又は複数のカテゴリに関する情報を記憶し、
カテゴリ指定受付手段がカテゴリの指定を受け付け、固
有名キーワード特定手段が、カテゴリ情報記憶手段に記
憶されたカテゴリ情報に基づいて、カテゴリ指定受付手
段により受け付けられたカテゴリに含まれる固有名に相
当するキーワードを特定する。Further, in the degree-of-association value calculation apparatus according to the present invention, the category information storage means stores information on one or a plurality of categories composed of a plurality of unique names of the same kind,
The category designation receiving means receives the designation of the category, and the proper name keyword specifying means, based on the category information stored in the category information storing means, the keyword corresponding to the proper name included in the category received by the category designation receiving means. Specify.

【００２７】従って、指定されたカテゴリに含まれる固
有名に相当すると特定されたキーワードと文書群に関す
るキーワード毎の関連度を高めることにより、例えばユ
ーザの要求などに応じて指定されたカテゴリに含まれる
固有名に相当するキーワードの関連度を高めて関連度値
を算出することができ、これにより、ユーザの要求など
を反映させて算出される関連度値の精度を高くすること
ができる。Therefore, by increasing the degree of relevance for each keyword related to the keyword and the document group that are identified as corresponding to the proper name included in the designated category, the keyword is included in the designated category in response to the user's request, for example. The degree of relevance of the keyword corresponding to the proper name can be increased to calculate the degree-of-relevance value, and thus the accuracy of the degree-of-relevance value calculated by reflecting the user's request can be increased.

【００２８】また、例えば「三井」などのように人名の
カテゴリ及び会社名のカテゴリといった複数のカテゴリ
に含まれるキーワードや、同様に「日立」などのように
地名のカテゴリ及び会社名のカテゴリといった複数のカ
テゴリに含まれるキーワードについては、これら複数の
カテゴリに含まれる同一のキーワードの全ての関連度が
高められてしまうことがあり得るが、上記のようにカテ
ゴリの指定がなされる場合には、同一のキーワード（語
句）であっても、指定されたカテゴリ以外のカテゴリに
含まれるキーワードについては関連度が高められてしま
うことを防止することができる。Further, a plurality of keywords such as "Mitsui" included in a plurality of categories such as a person name category and a company name category, and a plurality of keywords such as "Hitachi" such as a place name category and a company name category. For the keywords included in the category, all the relevance of the same keyword included in the plurality of categories may be increased, but if the category is specified as described above, the It is possible to prevent the degree of relevance from being increased for keywords included in categories other than the specified category, even with the keywords (words).

【００２９】つまり、例えば「三井」のように人名や会
社名として複数のカテゴリにおいて用いられる同一の語
句であっても、カテゴリ毎に異なる語句として識別する
ことが可能である。なお、このようなカテゴリ識別を支
援するための実施態様例（１）、（２）を示す。実施態
様例（１）では、一度判別された固有名（例えば固有名
詞）と同一の語句が再度出現した場合には、当該同一の
語句のカテゴリを前回と同じカテゴリであるとみなす。
具体例として、「…三井邦利が先日発表した内容による
と、…、三井は独自のプランを打ち出し、…」という文
がある場合には、文の途中に２回目に出現する「三井」
（「三井は独自の…」の「三井」）はこれだけでは人名
なのか会社名なのかが分からないが、例えば最初の「三
井」（「三井邦利…」の「三井」）が人名のカテゴリに
属すると解析されている場合には、このことをメモリに
記憶しておくことにより、２回目の「三井」のカテゴリ
の判断に迷ったときには、前回の解析結果を参照して人
名のカテゴリに属するものとしてカテゴリを特定するこ
とが可能である。実施態様例（２）では、語句の出現パ
ターンにより、カテゴリを特定する。具体例として、
「三井銀行」については「三井」の後ろに「銀行」が付
くので会社名であると判断する。That is, even if the same word or phrase is used in a plurality of categories as a person's name or a company name such as "Mitsui", it is possible to identify it as a word or phrase different for each category. In addition, embodiment examples (1) and (2) for supporting such category identification are shown. In the embodiment example (1), when the same phrase as the proper name (for example, proper noun) once determined appears again, the category of the same phrase is regarded as the same category as the previous one.
As a specific example, if there is a sentence "... according to the contents that Kunitoshi Mitsui announced the other day ..., Mitsui puts out its own plan ...", the second time "Mitsui" appears in the middle of the sentence.
(“Mitsui” of “Mitsui is unique…”) does not know whether it is a personal name or a company name, but for example, the first “Mitsui” (“Mitsui” of “Kunitoshi Mitsui…”) is the category of personal name. If it is analyzed that it belongs, by storing this in the memory, when you are confused about the category of "Mitsui" for the second time, refer to the previous analysis result and belong to the category of personal name. It is possible to specify a category as a thing. In the embodiment example (2), the category is specified by the appearance pattern of the word. As a specific example,
As for "Mitsui Bank", "Mitsui" is followed by "Bank", so it is judged to be the company name.

【００３０】なお、同一の語句がいずれのカテゴリに属
するかは、例えば予め各語句に当該各語句が属するカテ
ゴリの情報を付加しておくことや、例えば各語句の意味
などを解析して当該各語句が属するカテゴリを推測する
ことなどにより判定することができる。It is to be noted that which category the same phrase belongs to is determined by, for example, previously adding information on the category to which each phrase belongs, or by analyzing the meaning of each phrase and the like. It can be determined by estimating the category to which the word belongs.

【００３１】ここで、同類の複数の固有名から構成され
るカテゴリとしては、種々なカテゴリが用いられてもよ
く、例えば、人名、同種の人名、会社名、同種の会社
名、製品名、同種の製品名、国内の地名、同一地方の地
名、同一国の地名、花の名前、動物の名前などの種々な
カテゴリを用いることができる。なお、同種の人名とし
ては例えば歴史上の人物名、研究者の名前などを用いる
ことができ、同種の会社名としては例えば食品メーカー
の名前、自動車メーカーの名前などを用いることがで
き、また、他についても同様である。Here, various categories may be used as the category composed of a plurality of similar unique names. For example, a person's name, a person's name of the same kind, a company name, a company name of the same kind, a product name, a kind of the same kind. Various categories such as product name, domestic place name, same region place name, same country place name, flower name, animal name can be used. As the same type of person name, for example, a historical person's name, the name of a researcher, or the like can be used, and as the same type of company name, for example, the name of a food manufacturer, the name of an automobile manufacturer, or the like can be used. The same applies to the other cases.

【００３２】具体的には、例えば人名という点で同類で
ある複数の異なる人の名前（名称）である「Ａ」、
「Ｂ」、「Ｃ」、…など（Ａ、Ｂ、Ｃは人物名）を含ん
で成るカテゴリや、例えば複写機メーカーの会社名とい
う点で同類である複数の異なる会社の名前「Ａ社」、
「Ｂ社」、「Ｃ社」、…など（Ａ社、Ｂ社、Ｃ社は会社
名）を含んで成るカテゴリや、例えば複写機の製品名と
いう点で同類である複数の異なる複写機製品の名前「Ａ
機」、「Ｂ機」、「Ｃ機」、…など（Ａ機、Ｂ機、Ｃ機
は製品名）を含んで成るカテゴリなどを用いることがで
きる。Specifically, for example, "A", which is the name (name) of a plurality of different persons who are similar in terms of the person's name,
Categories including “B”, “C”, ... (where A, B, and C are person names), and the names of a plurality of different companies that are similar in terms of the company name of the copying machine manufacturer, for example, “Company A” ,
Categories including “Company B”, “Company C”, ... (Company names of A, B, and C are companies) and a plurality of different copier products that are similar in terms of the product name of the copier, for example. Name "A
, Etc. (machine A, machine B, machine C are product names) and the like can be used.

【００３３】また、カテゴリの数としては、種々な数で
あってもよく、１であってもよく、複数（２以上）であ
ってもよい。また、カテゴリに関する情報としては、例
えば各カテゴリに含まれる複数の固有名のそれぞれを特
定する情報や、同一のカテゴリに含まれる固有名が当該
同一のカテゴリに属することを特定する情報などが用い
られる。The number of categories may be various, may be 1, or may be plural (two or more). In addition, as the information regarding the category, for example, information that identifies each of a plurality of proper names included in each category, information that identifies that proper names included in the same category belong to the same category, or the like is used. .

【００３４】また、カテゴリ情報記憶手段としては、情
報を電子データの形で記憶するメモリなどを用いること
ができる。また、カテゴリ指定受付手段としては、例え
ばユーザからカテゴリの指定を受け付けるような場合に
は、ユーザにより操作されてカテゴリの指定を行うため
の入力をユーザから受け付けるキーボードやマウスなど
を用いることができる。Further, as the category information storage means, a memory or the like for storing information in the form of electronic data can be used. Further, as the category designation receiving means, for example, when the category designation is received from the user, a keyboard or a mouse operated by the user to receive the input for designating the category can be used.

【００３５】また、例えば関連度値の算出に際してユー
ザがキーワードを入力するような場合には、当該キーワ
ードに相当する語句が含まれるカテゴリが指定されたと
してカテゴリ指定受付手段が当該指定を受け付けるよう
な態様を用いることもできる。Further, for example, when the user inputs a keyword when calculating the degree-of-association value, it is determined that the category including the phrase corresponding to the keyword is designated and the category designation receiving means receives the designation. Aspects can also be used.

【００３６】また、本発明に係る関連度値算出装置で
は、一構成例として、カテゴリ指定受付手段は、カテゴ
リの指定を要求する情報をユーザに対して表示出力し、
当該指定をユーザからの入力により受け付ける。ここ
で、カテゴリの指定を要求する情報としては、例えばユ
ーザに対してカテゴリを指定することを促すような情報
が用いられる。また、表示出力としては、例えばディス
プレイ画面などに表示出力する態様が用いられる。Further, in the association degree value calculating device according to the present invention, as one configuration example, the category designation receiving means displays and outputs information requesting the designation of the category to the user,
The designation is accepted by the input from the user. Here, as the information requesting the designation of the category, for example, information that prompts the user to designate the category is used. Moreover, as the display output, for example, a mode of displaying and outputting on a display screen or the like is used.

【００３７】また、本発明に係る関連度値算出装置で
は、カテゴリ情報記憶手段が同類の複数の固有名から構
成される１又は複数のカテゴリに関する情報を記憶し、
補正度合い指定受付手段がカテゴリ毎のキーワード毎関
連度値を補正する度合いの指定を受け付け、関連度値算
出手段が、カテゴリ情報記憶手段に記憶されたカテゴリ
情報に基づいて、固有名キーワード特定手段により特定
されたキーワードと文書群に関するキーワード毎関連度
値を、当該キーワードを含むカテゴリについて補正度合
い指定受付手段により受け付けられた補正度合いを用い
て補正する。Further, in the degree-of-association calculation device according to the present invention, the category information storage means stores information on one or a plurality of categories composed of a plurality of unique names of the same kind,
The correction degree designation receiving means receives the designation of the degree of correcting the keyword relatedness value for each category, and the association degree value calculating means uses the proper name keyword specifying means based on the category information stored in the category information storage means. The keyword-related relevance value for each of the identified keyword and document group is corrected using the correction degree received by the correction degree designation receiving means for the category including the keyword.

【００３８】従って、カテゴリ毎に指定された補正度合
いを用いて、各カテゴリに含まれる固有名に相当すると
特定されたキーワードと文書群に関するキーワード毎関
連度値をその算出値より関連度が高いことを表す値へ補
正することにより、例えばユーザの要求などに応じた補
正度合いで各カテゴリに含まれる固有名に相当するキー
ワードの関連度を高めて関連度値を算出することがで
き、これにより、ユーザの要求などを反映させて算出さ
れる関連度値の精度を高くすることができる。Therefore, by using the correction degree designated for each category, the degree of relevance of each keyword related to the keyword and the document group identified as corresponding to the proper name included in each category is higher than the calculated value. By correcting to a value indicating, for example, the relevance value of the keyword corresponding to the proper name included in each category can be increased with a correction degree according to the user's request, and the relevance value can be calculated. The accuracy of the relevance value calculated by reflecting the user's request can be increased.

【００３９】ここで、カテゴリ毎のキーワード毎関連度
値を補正する度合いとしては、例えば、或るカテゴリＡ
については補正前のキーワード毎関連度値をａ倍する補
正を行い、他のカテゴリＢについては補正前のキーワー
ド毎関連度値をｂ倍する補正を行うなどといったような
場合における倍率（ａ倍、ｂ倍）を用いることができ
る。Here, the degree of correction of the keyword-related degree value for each category is, for example, a certain category A.
For example, the correction is performed by multiplying the pre-correction keyword-related relevance value by a, and for the other category B, the correction is performed by multiplying the pre-correction keyword-related relevance value by b. b times) can be used.

【００４０】また、補正度合い指定受付手段としては、
例えばユーザから補正度合いの指定を受け付けるような
場合には、ユーザにより操作されて補正度合いの指定を
行うための入力をユーザから受け付けるキーボードやマ
ウスなどを用いることができる。Further, as the correction degree designation receiving means,
For example, in the case of accepting the designation of the correction degree from the user, it is possible to use a keyboard, a mouse or the like which is operated by the user and receives an input from the user for designating the correction degree.

【００４１】また、本発明に係る関連度値算出装置で
は、一構成例として、補正度合い指定受付手段は、カテ
ゴリ毎の補正度合いの指定を要求する情報をユーザに対
して表示出力し、当該指定をユーザからの入力により受
け付ける。ここで、補正度合いの指定を要求する情報と
しては、例えばユーザに対して補正度合いを指定するこ
とを促すような情報が用いられる。また、表示出力とし
ては、例えばディスプレイ画面などに表示出力する態様
が用いられる。Further, in the association degree value calculating device according to the present invention, as one configuration example, the correction degree designation receiving means outputs the information requesting the designation of the correction degree for each category to the user and outputs the designated information. Is accepted by the input from the user. Here, as the information requesting the designation of the correction degree, for example, information that prompts the user to specify the correction degree is used. Moreover, as the display output, for example, a mode of displaying and outputting on a display screen or the like is used.

【００４２】また、本発明に係る関連度値算出装置で
は、固有名キーワード情報表示出力手段が、固有名キー
ワード特定手段により特定されたキーワードに関する情
報をユーザに対して表示出力する。従って、キーワード
毎の関連度を高めるものとしてキーワード毎関連度値が
補正されたキーワードに関する情報がユーザに対して表
示出力されるため、ユーザはいずれのキーワードの関連
度が高められたかなどを把握することができる。ここ
で、キーワードに関する情報としては、例えば当該キー
ワードの語句や、補正の度合いなどを用いることができ
る。Further, in the degree-of-association value calculating apparatus according to the present invention, the proper name keyword information display / output means displays and outputs to the user the information regarding the keyword specified by the proper name keyword specifying means. Therefore, since the information about the keyword, in which the relevance value for each keyword is corrected to increase the relevance for each keyword, is displayed and output to the user, the user grasps which keyword has the higher relevance. be able to. Here, as the information on the keyword, for example, the phrase of the keyword, the degree of correction, or the like can be used.

【００４３】また、以上に示したような本発明に係る関
連度値算出装置は、種々な装置に適用することが可能で
あり、例えばキーワード群などに関連する文書を検索す
る関連文書検索装置や、文書を関連するキーワード群に
カテゴライズする文書カテゴライズ装置や、２つの文書
の間の関連度値に基づいて複数の文書をクラスタリング
する文書クラスタリング装置などに適用することができ
る。Further, the relevance degree value calculating device according to the present invention as described above can be applied to various devices, for example, a related document retrieving device for retrieving documents related to a keyword group or the like. The present invention can be applied to a document categorizing device that categorizes documents into related keyword groups, a document clustering device that clusters a plurality of documents based on a relevance value between two documents, and the like.

【００４４】例えば、本発明に係る関連文書検索装置で
は、１又は複数の文書（種文書）から構成される文書群
（種文書群）から１又は複数のキーワードから構成され
るキーワード群を抽出し、抽出したキーワード群に関連
する文書を複数の検索対象となる文書（検索対象文書）
から構成される文書群（検索対象文書群）から検索し、
抽出したキーワード群に含まれる各キーワード毎に種文
書群に含まれる文書の総数に対する当該種文書群中でキ
ーワードが出現する文書の数の割合値（第１の割合値）
を検索対象文書群に含まれる文書の総数に対する当該検
索対象文書群中でキーワードが出現する文書の数の割合
値（第２の割合値）で除算した値を各キーワード毎の出
現割合値として算出し、検索した各文書に関して抽出し
たキーワード群に含まれるキーワードの中で文書に出現
するキーワードの出現割合値を総和した値を当該キーワ
ード群と当該文書との関連度を表す値として算出し、当
該関連度値が大きい順に検索した各文書に関する情報を
出力するに際して、次のようにして、キーワード群と文
書との関連度値を算出する。すなわち、固有名キーワー
ド特定手段が設定された固有名キーワード特定条件に基
づいて固有名に相当するキーワードを特定し、関連度値
算出手段が固有名キーワード特定手段により特定された
キーワードの出現割合値を所定数倍してキーワード群と
文書との関連度値を算出する。For example, the related document search apparatus according to the present invention extracts a keyword group composed of one or a plurality of keywords from a document group composed of one or a plurality of documents (seed documents) (seed document group). , Multiple documents related to the extracted keyword group (search target documents)
A document group consisting of
Percentage value of the number of documents in which the keyword appears in the seed document group with respect to the total number of documents included in the seed document group for each keyword included in the extracted keyword group (first proportion value)
Is divided by the ratio value (second ratio value) of the number of documents in which the keyword appears in the search target document group to the total number of documents included in the search target document group, and is calculated as the appearance ratio value for each keyword Then, the sum of the appearance ratio values of the keywords that appear in the document among the keywords included in the extracted keyword group for each searched document is calculated as a value indicating the degree of association between the keyword group and the document, When outputting information related to each document retrieved in descending order of relevance value, the relevance value between the keyword group and the document is calculated as follows. That is, the proper name keyword specifying means specifies the keyword corresponding to the proper name based on the set proper name keyword specifying condition, and the relevance value calculating means determines the appearance ratio value of the keyword specified by the proper name keyword specifying means. The relevance value between the keyword group and the document is calculated by multiplying by a predetermined number.

【００４５】ここで、種文書群からキーワード群を抽出
する仕方としては、種々な仕方が用いられてもよい。ま
た、抽出したキーワード群に関連する文書を検索対象文
書群から検索する仕方としては、種々な仕方が用いられ
てもよい。具体的には、キーワードを用いて当該キーワ
ードに関連する文書を検索する仕方や、１又は複数の文
書からキーワードを抽出して当該キーワードを用いて関
連する文書を検索する仕方や、キーワードを用いて文書
を検索して当該検索した文書に含まれる他のキーワード
を用いて当該他のキーワードに関連する文書を検索する
仕方などを用いることができる。Various methods may be used to extract the keyword group from the seed document group. In addition, various methods may be used as a method of searching a document related to the extracted keyword group from the search target document group. Specifically, how to search for documents related to the keyword using keywords, how to extract keywords from one or more documents and search for related documents using the keywords, and how to use keywords A method of searching a document and using another keyword included in the searched document to search for a document related to the other keyword can be used.

【００４６】また、具体例として、種文書群に含まれる
文書の総数がＮであり、検索対象文書群に含まれる文書
の総数がＭであり、抽出したキーワード群に含まれる或
るキーワードについて、種文書群中で当該キーワードが
出現する文書の数がｎであり、検索対象文書群中で当該
キーワードが出現する文書の数がｍである場合には、前
記第１の割合値は（ｎ／Ｎ）となり、前記第２の割合値
は（ｍ／Ｍ）となり、当該キーワードの出現割合値は前
記第１の割合値を前記第２の割合値で除算した値となっ
て（ｎ／Ｎ）／（ｍ／Ｍ）となる。As a specific example, the total number of documents included in the seed document group is N, the total number of documents included in the search target document group is M, and a certain keyword included in the extracted keyword group is When the number of documents in which the keyword appears in the seed document group is n and the number of documents in which the keyword appears in the search target document group is m, the first ratio value is (n / N), the second ratio value becomes (m / M), and the appearance ratio value of the keyword becomes a value obtained by dividing the first ratio value by the second ratio value (n / N). / (M / M).

【００４７】また、具体例として、抽出したキーワード
群に含まれるキーワードの中で検索した或る文書に出現
するキーワードがｚ（ｚは１以上の整数）個あり、当該
文書に対するこれらｚ個のキーワードのそれぞれの出現
割合値がＰｉ（ｉ＝１〜ｚ）である場合には、当該キー
ワード群と当該文書との関連度値はΣＰｉで表される。
なお、Σはｉ＝１からｉ＝ｚまでの総和を表す。As a specific example, among the keywords included in the extracted keyword group, there are z (z is an integer of 1 or more) keywords that appear in a certain document retrieved, and these z keywords for the document are included. When the respective appearance ratio values of the above are Pi (i = 1 to z), the degree of association value between the keyword group and the document is represented by ΣPi.
Note that Σ represents the total sum from i = 1 to i = z.

【００４８】また、上記した（ｎ／Ｎ）／（ｍ／Ｍ）と
いう出現割合値を用いた場合には、当該出現割合値が大
きい方が関連度が高いと考えることができる。また、検
索した各文書に関する情報としては、種々な情報が用い
られてもよく、例えば各文書のタイトルの情報などを用
いることができる。また、情報を出力する仕方として
は、種々な仕方が用いられてもよく、例えば情報をディ
スプレイ画面などに表示出力する仕方や、情報をプリン
タにより印刷出力する仕方などを用いることができる。When the appearance ratio value of (n / N) / (m / M) is used, it can be considered that the larger the appearance ratio value is, the higher the degree of association is. Various information may be used as the information regarding each searched document, and for example, information on the title of each document may be used. Various methods may be used to output the information. For example, a method of displaying and outputting the information on a display screen or a method of printing and outputting the information by a printer can be used.

【００４９】また、固有名キーワード特定手段により特
定されたキーワードの出現割合値を所定数倍するのに用
いられる当該所定数倍としては、例えば予めメモリなど
に設定されてもよく、或いは、ユーザにより指定されて
もよい。The predetermined multiple times used for multiplying the appearance ratio value of the keyword specified by the unique name keyword specifying means by a predetermined number may be set in advance in a memory or the like, or by the user. May be specified.

【００５０】また、本発明に係る文書カテゴライズ装置
では、１又は複数のキーワードから構成される複数のキ
ーワード群と１又は複数の文書から構成される文書群に
関して、各キーワード群毎にキーワード群に含まれる各
キーワードと文書群とのキーワード毎の関連度を表す値
を全てのキーワードについて総和した値を当該キーワー
ド群と当該文書群との関連度値として算出し、算出され
る関連度値が最高の関連度を表す値となるキーワード群
に当該文書群をカテゴライズするに際して、次のように
して、キーワード群と文書群との関連度値を算出する。
すなわち、固有名キーワード特定手段が設定された固有
名キーワード特定条件に基づいて固有名に相当するキー
ワードを特定し、関連度値算出手段が固有名キーワード
特定手段により特定されたキーワードと文書群とのキー
ワード毎の関連度を高めてキーワード群と文書群との関
連度値を算出する。Further, in the document categorizing apparatus according to the present invention, a plurality of keyword groups composed of one or a plurality of keywords and a document group composed of one or a plurality of documents are included in the keyword group for each keyword group. The value indicating the relevance of each keyword between each keyword and the document group is summed up for all the keywords, and calculated as the relevance value between the keyword group and the document group, and the calculated relevance value is the highest. When categorizing the document group into a keyword group having a value indicating the degree of association, the degree of association value between the keyword group and the document group is calculated as follows.
That is, the proper name keyword specifying means specifies the keyword corresponding to the proper name based on the set proper name keyword specifying condition, and the relevance value calculating means determines the keyword and the document group specified by the proper name keyword specifying means. The degree of relevance for each keyword is increased to calculate the degree of relevance between the keyword group and the document group.

【００５１】ここで、一般的なカテゴライズの一例とし
て、或る文書を或るカテゴリに振り分ける処理の手順例
（１）〜（４）を示す。すなわち、手順（１）では、前
準備として、Ｎ（Ｎは複数）個のカテゴリＣ１〜ＣＮを
用意する。また、各カテゴリＣ１〜ＣＮを代表する文書
Ｓ１〜ＳＮを各カテゴリＣ１〜ＣＮについて１件ずつ用
意する。また、用意された各文書Ｓ１〜ＳＮのそれぞれ
を単語単位に分割して、各文書Ｓｉ毎に文書を構成する
単語の集合Ｗ（Ｓｉ）を生成する。ここで、ｉ＝１〜Ｎ
を示す。Here, as an example of general categorization, procedure examples (1) to (4) of processing for allocating a certain document to a certain category will be shown. That is, in the procedure (1), N (N is a plurality) categories C1 to CN are prepared as a preliminary preparation. Further, one document S1 to SN representing each category C1 to CN is prepared for each category C1 to CN. In addition, each of the prepared documents S1 to SN is divided into word units, and a set W (Si) of words forming the document is generated for each document Si. Where i = 1 to N
Indicates.

【００５２】次に、手順（２）では、カテゴリに振り分
けたい文書（つまり、カテゴリ対象となる文書）Ｄを単
語単位に分割して、当該文書Ｄを構成する単語の集合Ｗ
（Ｄ）を生成する。次に、手順（３）では、各カテゴリ
Ｃ１〜ＣＮを代表する文書Ｓ１〜ＳＮのそれぞれの単語
集合Ｗ（Ｓ１）〜Ｗ（ＳＮ）について、前記単語集合Ｗ
（Ｄ）と重複して現れる単語の数つまり各単語集合Ｗ
（Ｓ１）〜Ｗ（ＳＮ）に現れて且つ前記単語集合Ｗ
（Ｄ）にも現れる単語の数Ｋ（Ｄ、Ｓ１）〜Ｋ（Ｄ、Ｓ
Ｎ）を求める。Next, in step (2), the document D to be sorted into categories (that is, the documents to be categorized) D is divided on a word-by-word basis, and a set W of words constituting the document D is divided.
(D) is generated. Next, in the procedure (3), the word set W (S1) to W (SN) of each of the documents S1 to SN representing the categories C1 to CN is added to the word set W.
The number of words that overlap with (D), that is, each word set W
(S1) to W (SN) and the word set W
Number of words that also appear in (D) K (D, S1) to K (D, S
N) is calculated.

【００５３】そして、手順（４）では、各文書Ｓ１〜Ｓ
Ｎについて求められた単語数Ｋ（Ｄ、Ｓ１）〜Ｋ（Ｄ、
ＳＮ）の中で最も数が大きい単語数に対応したカテゴリ
に文書Ｄを分類（カテゴライズ）する。つまり、最大の
単語数がＫ（Ｄ、Ｓｉ）である場合には、文書Ｄをカテ
ゴリＣｉにカテゴライズする。Then, in step (4), each of the documents S1 to S
Number of words K (D, S1) to K (D, found for N
The document D is classified (categorized) into a category corresponding to the largest number of words in (SN). That is, when the maximum number of words is K (D, Si), the document D is categorized into the category Ci.

【００５４】また、固有名に着目したカテゴライズ手法
の具体例として、上記のようなカテゴライズに本発明を
適用した場合の実施態様例（１）〜（４）を示す。実施
態様例（１）では、上記手順（３）において、単語集合
Ｗ（Ｄ）と各カテゴリＣ１〜ＣＮを代表する文書Ｓ１〜
ＳＮを構成する単語集合Ｗ（Ｓ１）〜Ｗ（ＳＮ）とに重
複して現れる単語の数Ｋ（Ｄ、Ｓ１）〜Ｋ（Ｄ、ＳＮ）
を調べるに際して、重複した単語が固有名（例えば固有
名詞）に相当する場合には、当該固有名の数だけ前記単
語数Ｋ（Ｄ、Ｓｉ）を増加させる。具体例として、文書
Ｓｉについて１０個の単語が文書Ｄと重複していて、こ
れら１０個の単語の中の３個の単語が固有名に相当する
場合には、前記単語数Ｋ（Ｄ、Ｓｉ）＝１０＋３＝１３
とする。Further, as specific examples of the categorizing method focusing on the proper name, embodiment examples (1) to (4) in the case where the present invention is applied to the above categorizing will be shown. In the embodiment example (1), in the above procedure (3), the word set W (D) and the documents S1 to S1 representing the categories C1 to CN are used.
The number of words K (D, S1) to K (D, SN) that appear redundantly in the word sets W (S1) to W (SN) forming the SN.
When checking the above, if the duplicated word corresponds to a proper name (for example, proper noun), the number of words K (D, Si) is increased by the number of proper names. As a specific example, if 10 words of the document Si overlap with the document D, and 3 of these 10 words correspond to a proper name, the word number K (D, Si ) = 10 + 3 = 13
And

【００５５】実施態様例（２）では、上記手順（１）に
おいて、或るカテゴリＣｉを表す文書Ｓｉは複数の文書
から構成されてもよいとする。実施態様例（３）では、
上記手順（１）において、或るカテゴリＣｉを表す文書
Ｓｉの代わりに、そのカテゴリＣｉを代表する単語の集
合を用いてもよいとする。In the embodiment example (2), the document Si representing a certain category Ci may be composed of a plurality of documents in the procedure (1). In the embodiment example (3),
In the above procedure (1), a set of words representing a certain category Ci may be used instead of the document Si representing the certain category Ci.

【００５６】実施態様例（４）では、上記手順（３）の
後に上記手順（４）において、重複した単語については
その出現頻度などを考慮した重み付けを行ってもよいと
する。具体例として、単語ｗｉが振り分けたい文書Ｄに
含まれており、且つ、当該単語ｗｉがＮ件の文書Ｓ１〜
ＳＮの中のｙ件の文書に含まれている場合には、当該単
語ｗｉの特徴度Ｆ（ｗｉ）＝Ｎ／ｙとし、また、単語ｗ
ｉが固有名に相当する場合には例えば特徴度Ｆ（ｗｉ）
＝１０×（Ｎ／ｙ）のように算出される特徴度Ｆ（ｗ
ｉ）を所定数倍（ここでは、１０倍）する。なお、ｙ＝
０の場合には、特徴度Ｆ（ｗｉ）＝０とする。そして、
文書Ｄに含まれる全ての単語ｗｉについて上記した特徴
度Ｆ（ｗｉ）を求め、文書Ｄと文書Ｓｉとの類似度Ｒ
（Ｄ、Ｓｉ）＝ΣＦ（ｗｋ）を算出する。ここで、Σは
ｋ＝１からｋ＝Ｋまでの総和を示し、ｗ１〜ｗＫは文書
Ｄと文書Ｓｉとで重複して出現する単語を示す。このよ
うにして、全ての文書Ｓｉ（ｉ＝１〜Ｎ）についての類
似度Ｒ（Ｄ、Ｓｉ）を求めて、最も大きい値の類似度が
得られた文書のカテゴリに文書Ｄを分類（カテゴライ
ズ）する。つまり、類似度Ｒ（Ｄ、Ｓｋ）の値が最も大
きい場合には、カテゴリＣｋに文書Ｄをカテゴライズす
る。In the embodiment (4), it is assumed that, in the procedure (4) after the procedure (3), the duplicated words may be weighted in consideration of their appearance frequency and the like. As a specific example, the document S in which the word wi is included in the document D to be sorted and the word wi is N documents S1 to S1.
When the document is included in y documents in SN, the feature level F (wi) of the word wi is set to N / y, and the word w
When i corresponds to a proper name, for example, the feature degree F (wi)
= 10 × (N / y), the characteristic degree F (w
i) is multiplied by a predetermined number (here, 10 times). Note that y =
In the case of 0, the characteristic degree F (wi) = 0. And
The above-described characteristic degree F (wi) is calculated for all the words wi included in the document D, and the similarity R between the document D and the document Si is calculated.
Calculate (D, Si) = ΣF (wk). Here, Σ represents the total sum from k = 1 to k = K, and w1 to wK represent words that appear in duplicate in the document D and the document Si. In this way, the similarity R (D, Si) for all the documents Si (i = 1 to N) is obtained, and the document D is classified (categorized) into the category of the document having the largest similarity. ) Do. That is, when the value of the similarity R (D, Sk) is the largest, the document D is categorized into the category Ck.

【００５７】また、本発明に係る文書クラスタリング装
置では、複数の文書から構成される文書群に含まれる２
つの文書に関して、１又は複数のキーワードから構成さ
れるキーワード群に含まれる各キーワードについてのこ
れら２つの文書のキーワード毎の関連度を表す値を全て
のキーワードについて総和した値をこれら２つの文書の
関連度値として算出し、当該関連度値に基づいて当該文
書群に含まれる文書をクラスタリングするに際して、次
のようにして、２つの文書の関連度値を算出する。すな
わち、固有名キーワード特定手段が設定された固有名キ
ーワード特定条件に基づいて固有名に相当するキーワー
ドを特定し、関連度値算出手段が固有名キーワード特定
手段により特定されたキーワードについての２つの文書
のキーワード毎の関連度を高めてこれら２つの文書の関
連度値を算出する。Further, in the document clustering apparatus according to the present invention, 2 included in a document group composed of a plurality of documents.
With regard to one document, the value representing the degree of association of each keyword of these two documents for each keyword included in the keyword group composed of one or more keywords is the sum of all the keywords When the documents included in the document group are clustered based on the degree-of-association value, the degree-of-association value of two documents is calculated as follows. That is, the unique name keyword specifying means specifies the keyword corresponding to the unique name based on the set unique name keyword specifying condition, and the relevance value calculating means determines two documents about the keyword specified by the unique name keyword specifying means. The degree of relevance of each of the keywords is increased to calculate the degree of relevance of these two documents.

【００５８】なお、文書群に含まれる複数の文書をクラ
スタリングする仕方としては、一例として、これら複数
の文書の中の２つの文書の関連度値を算出することを全
ての文書の組み合わせについて行って、最高の関連度を
表す値が算出された２つの文書をその重心位置で１つに
まとめるといったことを繰り返して実行するような仕方
を用いることができる。As a method of clustering a plurality of documents included in a document group, as an example, the relevance value of two documents in the plurality of documents is calculated for all the combinations of documents. It is possible to use a method in which the two documents for which the value indicating the highest degree of association is calculated are put together into one at the position of the center of gravity thereof and repeatedly executed.

【００５９】ここで、一般的なクラスタリングの一例と
して、或る文書を或るカテゴリに振り分ける処理の手順
例（１）〜（４）を示す。すなわち、手順（１）では、
Ｎ個の文書Ｄ１〜ＤＮから成る文書集合（文書群）に含
まれるそれぞれの文書Ｄ１〜ＤＮに関して、各文書Ｄ１
〜ＤＮに含まれる単語を要素としたベクトルを生成す
る。このとき、各要素の値は、或る単語を含む場合には
１とする一方、含まない場合には０とする。Here, as an example of general clustering, procedure examples (1) to (4) of processing for allocating a certain document to a certain category will be shown. That is, in the procedure (1),
Regarding each document D1 to DN included in the document set (document group) composed of N documents D1 to DN, each document D1
A vector having the words included in DN as elements is generated. At this time, the value of each element is set to 1 when a certain word is included, and is set to 0 when the certain word is not included.

【００６０】具体例として、文書Ｄ１のベクトル＝
｛１、０、１、０、０、０、１、…、０｝、文書Ｄ２の
ベクトル＝｛０、０、１、０、０、１、０、…、１｝、
文書Ｄ３のベクトル＝｛１、１、０、０、１、１、０、
…、０｝などが得られる。ここで、それぞれのベクトル
はＭ（Ｍは１又は複数）個の要素の値（１又は０）から
構成されており、各要素の値はＭ個の単語のそれぞれが
各文書に出現するか否かを表している。As a specific example, the vector of the document D1 =
{1, 0, 1, 0, 0, 0, 1, ..., 0}, vector of document D2 = {0, 0, 1, 0, 0, 1, 0, ..., 1},
Vector of document D3 = {1,1,0,0,1,1,0,
..., 0} is obtained. Here, each vector is composed of the value (1 or 0) of M (M is 1 or more) elements, and the value of each element is whether or not each of the M words appears in each document. Is represented.

【００６１】次に、手順（２）では、１からＮまでの値
をとるｉ及びｊについて文書Ｄｉのベクトルと文書Ｄｊ
のベクトルとの内積（Ｄｉ、Ｄｊ）を算出し、算出した
内積（Ｄｉ、Ｄｊ）の値が最大となるｉ及びｊの組を求
める。但し、内積（Ｄｉ、Ｄｊ）は、互いに異なる値と
なるｉ及びｊの組について算出する。なお、内積（Ｄ
ｉ、Ｄｊ）としては、例えば一般に知られているよう
に、２つのベクトルの各要素値の積和を用いる。Next, in the procedure (2), the vector of the document Di and the document Dj for i and j that take values from 1 to N are used.
The inner product (Di, Dj) with the vector is calculated, and the pair of i and j that maximizes the value of the calculated inner product (Di, Dj) is obtained. However, the inner product (Di, Dj) is calculated for a pair of i and j having different values. The inner product (D
As i, Dj), for example, as is generally known, the sum of products of the element values of two vectors is used.

【００６２】次に、手順（３）では、上記した内積（Ｄ
ｉ、Ｄｊ）を２つの文書Ｄｉ、Ｄｊの間の類似度とみな
して、全ての組合せの文書間の類似度の中で最も大きい
値の類似度（つまり、内積）が算出された２つの文書の
間の重心を求める。つまり、文書Ｄｐと文書Ｄｑとの類
似度（Ｄｐ、Ｄｑ）が最大であった場合には、当該文書
Ｄｐと当該文書Ｄｑとの間の重心をとる。そして、その
重心を表すベクトルを文書Ｄｐと文書Ｄｑとをクラスタ
リングした文書Ｄｐｑのベクトルとする。Next, in step (3), the above inner product (D
i, Dj) is regarded as the similarity between the two documents Di and Dj, and the similarity between the documents of all the combinations has the largest similarity (that is, the inner product) between the two documents. Seek the center of gravity between. That is, when the similarity (Dp, Dq) between the document Dp and the document Dq is the maximum, the center of gravity between the document Dp and the document Dq is taken. The vector representing the center of gravity is set as the vector of the document Dpq obtained by clustering the document Dp and the document Dq.

【００６３】次に、手順（４）では、クラスタリングさ
れた２つの文書（例えば、文書Ｄｐと文書Ｄｑ）を文書
全体（文書集合）から取り除き、その代わりに、クラス
タリングされた結果である文書Ｄｐｑを文書集合に追加
する。すると、第１回目のクラスタリング手順では文書
集合に含まれる文書の総数が１だけ減少して（Ｎ−１）
となり、以上と同様にしてクラスタリング手順を繰り返
して実行することにより、文書集合に含まれる文書の総
数を１ずつ減少させていく。そして、第ｋ回目のクラス
タリング処理が終了して文書集合に含まれる文書の総数
（Ｎ−ｋ）が例えば予め指定されたクラスタ数以下とな
った場合には処理を終了し、当該クラスタ数以下となっ
ていない場合には上記手順（２）、（３）、（４）を繰
り返して実行する。Next, in step (4), the two clustered documents (for example, the document Dp and the document Dq) are removed from the entire document (document set), and instead, the clustered result document Dpq is removed. Add to document set. Then, in the first clustering procedure, the total number of documents included in the document set is decreased by 1 (N-1).
By repeating the clustering procedure in the same manner as described above, the total number of documents included in the document set is decreased by one. Then, when the k-th clustering process is completed and the total number of documents (N−k) included in the document set is, for example, the number of clusters designated in advance or less, the process is terminated and the number of clusters is set to be the number of clusters or less. If not, the above steps (2), (3) and (4) are repeated.

【００６４】また、固有名に着目したクラスタリング手
法の具体例として、上記のようなクラスタリングに本発
明を適用した場合の実施態様例（１）〜（３）を示す。
実施態様例（１）では、上記手順（２）において、１か
らＮまでの値となるｉ及びｊについて内積が最大値とな
るｉ及びｊの組を求めるに際して、文書Ｄｉのベクトル
と文書Ｄｊのベクトルの内積として（Ｄｉ、Ｄｊ）’＝
（Ｄｉ、Ｄｊ）＋ｋという値を算出する。ここで、ｋ
は、これら２つのベクトルを構成する要素の値の中で要
素が出現すること（１という要素値）が一致した固有名
の個数であり、つまり、文書Ｄｉと文書Ｄｊとの両方に
出現する共通の固有名の個数である。Further, as specific examples of the clustering method focusing on the unique name, the embodiment examples (1) to (3) in the case where the present invention is applied to the above clustering will be shown.
In the embodiment example (1), in the above procedure (2), when a pair of i and j whose inner product has the maximum value is obtained for i and j having values from 1 to N, the vector of the document Di and the document Dj (Di, Dj) '= as the inner product of the vectors
The value (Di, Dj) + k is calculated. Where k
Is the number of unique names matching the occurrence of an element (element value of 1) in the values of the elements forming these two vectors, that is, the common name that appears in both document Di and document Dj. Is the number of unique names of.

【００６５】実施態様例（２）では、上記手順（１）な
どにおいて、各単語が各文書に含まれているか否かを１
又は０の要素の値で表したベクトルを構成する代わり
に、各単語が各文書中に出現する回数や頻度などを要素
の値として表したベクトルを構成する。実施態様例
（３）では、実施態様例（２）を適用した場合におい
て、更に、ベクトル中において固有名に対応した要素の
値については所定数倍するなどして実施態様例（２）で
算出される値を増加させる。以上のような実施態様例で
は、固有名に重みを置いたクラスタリングが可能とな
る。In the embodiment (2), in the above procedure (1), it is determined whether each word is included in each document.
Alternatively, instead of constructing the vector represented by the value of the element of 0, the vector representing the number of times each word appears in each document, the frequency, etc. is constructed as the value of the element. In the embodiment example (3), when the embodiment example (2) is applied, the value of the element corresponding to the proper name in the vector is calculated by multiplying by a predetermined number. Increase the value that is played. In the above-described exemplary embodiments, it is possible to perform clustering with weighting on proper names.

【００６６】また、本発明は、以上に示したような関連
度値を算出する方法などを提供する。なお、このような
本発明に係る方法は、例えばＣＰＵやメモリ等を備えた
コンピュータなどにおいて実行される。例えば、本発明
に係る関連度値算出方法では、１又は複数のキーワード
から構成されるキーワード群と１又は複数の文書から構
成される文書群に関する関連度を表す値として、当該キ
ーワード群に含まれる各キーワードと当該文書群に関す
るキーワード毎の関連度値を全てのキーワードについて
総和した値を算出するに際して、例えばメモリに設定さ
れた固有名キーワード特定条件に基づいて固有名に相当
するキーワードを特定し、特定されたキーワードと文書
群に関するキーワード毎の関連度を高めてキーワード群
と文書群に関する関連度値を算出する。The present invention also provides a method for calculating the degree-of-association value as shown above. It should be noted that such a method according to the present invention is executed in, for example, a computer including a CPU, a memory and the like. For example, in the degree-of-association value calculation method according to the present invention, the value is included in the keyword group as a value indicating the degree of relevance related to a keyword group composed of one or more keywords and a document group composed of one or more documents. When calculating the sum of the relevance values for each keyword and each keyword related to the document group for all keywords, for example, specify the keyword corresponding to the unique name based on the unique name keyword specifying condition set in the memory, The degree of association for each keyword related to the specified keyword and document group is increased to calculate the degree of association value for the keyword group and document group.

【００６７】また、本発明に係る関連文書検索方法で
は、１又は複数の文書から構成される種文書群から１又
は複数のキーワードから構成されるキーワード群を抽出
し、抽出したキーワード群に関連する文書を複数の検索
対象となる文書から構成される検索対象文書群から検索
し、抽出したキーワード群に含まれる各キーワード毎に
種文書群に含まれる文書の総数に対する当該種文書群中
でキーワードが出現する文書の数の割合値を検索対象文
書群に含まれる文書の総数に対する当該検索対象文書群
中でキーワードが出現する文書の数の割合値で除算した
値を各キーワード毎の出現割合値として算出し、検索し
た各文書に関して抽出したキーワード群に含まれるキー
ワードの中で文書に出現するキーワードの出現割合値を
総和した値を当該キーワード群と当該文書との関連度を
表す値として算出し、当該関連度値が大きい順に検索し
た各文書に関する情報を例えば情報出力装置により出力
するに際して、例えばメモリに設定された固有名キーワ
ード特定条件に基づいて固有名に相当するキーワードを
特定し、特定されたキーワードの出現割合値を所定数倍
してキーワード群と文書との関連度値を算出する。Further, in the related document search method according to the present invention, a keyword group composed of one or a plurality of keywords is extracted from a seed document group composed of one or a plurality of documents and is related to the extracted keyword group. A document is searched from a search target document group composed of a plurality of search target documents, and for each keyword included in the extracted keyword group, the keyword is included in the seed document group with respect to the total number of documents included in the seed document group. A value obtained by dividing the ratio value of the number of appearing documents by the ratio value of the number of documents in which the keyword appears in the search target document group to the total number of documents included in the search target document group is set as the appearance ratio value for each keyword. The value obtained by summing the appearance ratio values of the keywords that appear in the document among the keywords included in the keyword group that is calculated and retrieved for each retrieved document is the key. When a value representing the degree of association between a word group and the document is output and information about each document retrieved in descending order of the degree of association is output by, for example, an information output device, for example, a unique name keyword specifying condition set in a memory The keyword corresponding to the proper name is specified based on the above, and the appearance ratio value of the specified keyword is multiplied by a predetermined number to calculate the relevance value between the keyword group and the document.

【００６８】また、本発明に係る文書カテゴライズ方法
では、１又は複数のキーワードから構成される複数のキ
ーワード群と１又は複数の文書から構成される文書群に
関して、各キーワード群毎にキーワード群に含まれる各
キーワードと文書群とのキーワード毎の関連度を表す値
を全てのキーワードについて総和した値を当該キーワー
ド群と当該文書群との関連度値として算出し、算出され
る関連度値が最高の関連度を表す値となるキーワード群
に当該文書群をカテゴライズするに際して、例えばメモ
リに設定された固有名キーワード特定条件に基づいて固
有名に相当するキーワードを特定し、特定されたキーワ
ードと文書群とのキーワード毎の関連度を高めてキーワ
ード群と文書群との関連度値を算出する。Further, in the document categorizing method according to the present invention, a plurality of keyword groups composed of one or a plurality of keywords and a document group composed of one or a plurality of documents are included in each keyword group in the keyword group. The value indicating the relevance of each keyword between each keyword and the document group is summed up for all the keywords, and calculated as the relevance value between the keyword group and the document group, and the calculated relevance value is the highest. When categorizing the document group into a keyword group that is a value indicating the degree of association, for example, a keyword corresponding to the unique name is specified based on the unique name keyword specifying condition set in the memory, and the specified keyword and document group The degree of relevance for each keyword is increased to calculate the degree of relevance between the keyword group and the document group.

【００６９】また、本発明に係る文書クラスタリング方
法では、複数の文書から構成される文書群に含まれる２
つの文書に関して、１又は複数のキーワードから構成さ
れるキーワード群に含まれる各キーワードについてのこ
れら２つの文書のキーワード毎の関連度を表す値を全て
のキーワードについて総和した値をこれら２つの文書の
関連度値として算出し、当該関連度値に基づいて当該文
書群に含まれる文書をクラスタリングするに際して、例
えばメモリに設定された固有名キーワード特定条件に基
づいて固有名に相当するキーワードを特定し、特定され
たキーワードについての２つの文書のキーワード毎の関
連度を高めてこれら２つの文書の関連度値を算出する。Further, in the document clustering method according to the present invention, 2 included in a document group composed of a plurality of documents.
With regard to one document, the value representing the degree of association of each keyword of these two documents for each keyword included in the keyword group composed of one or more keywords is the sum of all the keywords When a document included in the document group is clustered based on the relevance value, the keyword corresponding to the unique name is specified based on the unique name keyword specifying condition set in the memory, for example. The degree of relevance of each of the two documents for the generated keyword is increased, and the relevance value of these two documents is calculated.

【００７０】また、本発明では、以上に示したような関
連度値を算出する処理を実行させるプログラムなどを提
供する。例えば、本発明に係るプログラムは、１又は複
数のキーワードから構成されるキーワード群と１又は複
数の文書から構成される文書群に関する関連度を表す値
として、当該キーワード群に含まれる各キーワードと当
該文書群に関するキーワード毎の関連度値を全てのキー
ワードについて総和した値を算出する処理をコンピュー
タに実行させるに際して、例えばメモリに設定された固
有名キーワード特定条件に基づいて固有名に相当するキ
ーワードを特定する処理と、特定されたキーワードと文
書群に関するキーワード毎の関連度を高めてキーワード
群と文書群に関する関連度値を算出する処理とを当該コ
ンピュータに実行させる。The present invention also provides a program or the like for executing the processing for calculating the degree-of-association value as described above. For example, the program according to the present invention uses the keywords included in the keyword group and the keywords included in the keyword group as a value indicating the degree of association regarding the keyword group including one or more keywords and the document group including one or more documents. When causing a computer to execute the process of calculating the sum of the relevance values for each keyword related to a document group for all keywords, for example, the keyword corresponding to the unique name is specified based on the unique name keyword specifying condition set in the memory. The computer is caused to execute the processing for performing the above processing and the processing for increasing the degree of association for each keyword related to the specified keyword and the document group and calculating the degree of association for the keyword group and the document group.

【００７１】また、本発明に係るプログラムは、１又は
複数の文書から構成される種文書群から１又は複数のキ
ーワードから構成されるキーワード群を抽出する処理
と、抽出したキーワード群に関連する文書を複数の検索
対象となる文書から構成される検索対象文書群から検索
する処理と、抽出したキーワード群に含まれる各キーワ
ード毎に種文書群に含まれる文書の総数に対する当該種
文書群中でキーワードが出現する文書の数の割合値を検
索対象文書群に含まれる文書の総数に対する当該検索対
象文書群中でキーワードが出現する文書の数の割合値で
除算した値を各キーワード毎の出現割合値として算出す
る処理と、検索した各文書に関して抽出したキーワード
群に含まれるキーワードの中で文書に出現するキーワー
ドの出現割合値を総和した値を当該キーワード群と当該
文書との関連度を表す値として算出する処理と、当該関
連度値が大きい順に検索した各文書に関する情報を例え
ば情報出力機能により出力する処理とをコンピュータに
実行させるに際して、例えばメモリに設定された固有名
キーワード特定条件に基づいて固有名に相当するキーワ
ードを特定する処理と、特定されたキーワードの出現割
合値を所定数倍してキーワード群と文書との関連度値を
算出する処理とを当該コンピュータに実行させる。Further, the program according to the present invention is a process of extracting a keyword group composed of one or a plurality of keywords from a seed document group composed of one or a plurality of documents, and a document related to the extracted keyword group. To search the target document group consisting of a plurality of search target documents for each keyword included in the extracted keyword group, and the keyword in the seed document group for the total number of documents included in the seed document group The value obtained by dividing the ratio value of the number of documents in which the word appears by the ratio value of the number of documents in which the keyword appears in the search target document group to the total number of documents included in the search target document group, for each keyword And the appearance ratio value of the keywords appearing in the document among the keywords included in the extracted keyword group for each searched document. Causing the computer to execute a process of calculating the value obtained as a value indicating the degree of association between the keyword group and the document, and a process of outputting information regarding each document retrieved in the descending order of the degree of association by, for example, an information output function At this time, for example, the process of identifying the keyword corresponding to the proper name based on the proper name keyword specifying condition set in the memory, multiplying the appearance ratio value of the specified keyword by a predetermined number, and the degree of association between the keyword group and the document. The computer is caused to execute a process of calculating a value.

【００７２】また、本発明に係るプログラムは、１又は
複数のキーワードから構成される複数のキーワード群と
１又は複数の文書から構成される文書群に関して、各キ
ーワード群毎にキーワード群に含まれる各キーワードと
文書群とのキーワード毎の関連度を表す値を全てのキー
ワードについて総和した値を当該キーワード群と当該文
書群との関連度値として算出する処理と、算出される関
連度値が最高の関連度を表す値となるキーワード群に当
該文書群をカテゴライズする処理とをコンピュータに実
行させるに際して、例えばメモリに設定された固有名キ
ーワード特定条件に基づいて固有名に相当するキーワー
ドを特定する処理と、特定されたキーワードと文書群と
のキーワード毎の関連度を高めてキーワード群と文書群
との関連度値を算出する処理とを当該コンピュータに実
行させる。Further, the program according to the present invention relates to a plurality of keyword groups composed of one or a plurality of keywords and a document group composed of one or a plurality of documents. The process of calculating the value of the relevance value of each keyword between the keyword group and the document group for all keywords as the relevance value between the keyword group and the document group, and the calculated relevance value is the highest. When causing a computer to perform a process of categorizing a document group into a keyword group having a value indicating a degree of association, for example, a process of specifying a keyword corresponding to a unique name based on a unique name keyword specifying condition set in a memory, Calculating the relevance value between the keyword group and the document group by increasing the degree of relevance for each keyword between the specified keyword and the document group The process and to be executed on the computer.

【００７３】また、本発明に係るプログラムは、複数の
文書から構成される文書群に含まれる２つの文書に関し
て、１又は複数のキーワードから構成されるキーワード
群に含まれる各キーワードについてのこれら２つの文書
のキーワード毎の関連度を表す値を全てのキーワードに
ついて総和した値をこれら２つの文書の関連度値として
算出する処理と、当該関連度値に基づいて当該文書群に
含まれる文書をクラスタリングする処理とをコンピュー
タに実行させるに際して、例えばメモリに設定された固
有名キーワード特定条件に基づいて固有名に相当するキ
ーワードを特定する処理と、特定されたキーワードにつ
いての２つの文書のキーワード毎の関連度を高めてこれ
ら２つの文書の関連度値を算出する処理とを当該コンピ
ュータに実行させる。Further, the program according to the present invention relates to two documents included in a document group composed of a plurality of documents, and these two keywords for each keyword included in the keyword group composed of one or a plurality of keywords. A process of calculating a sum of values representing the degree of association of each keyword of a document for all keywords as a degree-of-association value of these two documents, and clustering the documents included in the relevant document group based on the degree-of-association value When executing the process and the computer, for example, a process of specifying a keyword corresponding to a proper name based on a proper name keyword specifying condition set in a memory, and a degree of relevance for each keyword of two documents regarding the specified keyword And the process of calculating the relevance value of these two documents by the computer. .

【００７４】[0074]

【発明の実施の形態】本発明に係る実施例を図面を参照
して説明する。まず、本発明の第１実施例に係る関連文
書検索装置を説明する。図１には、本発明を適用した関
連文書検索装置の構成例を示してあり、この関連文書検
索装置には、検索要求を受け付ける検索要求受付部１
と、固有名を特定するための情報や固有名が属するカテ
ゴリの情報などをメモリに保持する固有名保持部２と、
各単語の関連度を表す値を計算する関連度計算部３と、
検索対象となる文書における各単語の出現頻度をメモリ
に保持する検索対象文書単語出現頻度保持部４と、検索
対象となる文書をメモリに保持する検索対象データベー
ス（ＤＢ）５と、検索対象データベース５に保持される
文書を検索する検索部６と、文書の関連度を表す値を計
算する文書関連度計算部７と、検索結果を提示する検索
結果提示部８とが備えられている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment according to the present invention will be described with reference to the drawings. First, a related document search device according to the first embodiment of the present invention will be described. FIG. 1 shows a configuration example of a related document search device to which the present invention is applied. The related document search device includes a search request receiving unit 1 that receives a search request.
And a unique name holding unit 2 that holds information for identifying a unique name, information on a category to which the unique name belongs, and the like in a memory,
A degree-of-association calculation unit 3 that calculates a value indicating the degree of association of each word,
Retrieval target document word appearance frequency retaining unit 4 that retains the occurrence frequency of each word in the document that is the retrieval target, retrieval target database (DB) 5 that retains the document that is the retrieval target in memory, and retrieval target database 5 There is provided a search unit 6 for searching the documents stored in the document, a document relevance calculating unit 7 for calculating a value indicating the relevance of the documents, and a search result presenting unit 8 for presenting the search results.

【００７５】本例の関連文書検索装置により行われる動
作の一例を示す。まず、ユーザは、検索のための検索要
求として、種文書集合（種文書群）の文書内容の情報及
び固有名のカテゴリを指定する情報を検索要求受付部１
に入力し、当該検索要求受付部１はこれらの情報を受け
付ける。ここで、本例では、種文書集合は、或る歌手グ
ループのコンサート情報が記載された４件分の文書から
構成されているとする。また、本例では、ユーザは、固
有名のカテゴリとして「人名」のカテゴリを指定したと
する。An example of the operation performed by the related document search device of this example will be shown. First, the user, as a search request for a search, retrieves the document content information of the seed document set (seed document group) and the information designating the category of the unique name from the search request receiving unit 1.
, And the search request receiving unit 1 receives these pieces of information. Here, in this example, it is assumed that the seed document set is composed of four documents in which concert information of a certain singer group is described. Further, in this example, it is assumed that the user specifies the category of “personal name” as the category of the unique name.

【００７６】次に、関連度計算部３は、種文書集合から
キーワードとなる単語（キーワード群）及び各単語の出
現頻度を抽出し、種文書集合中における各単語の出現頻
度と、検索対象文書単語出現頻度保持部４に保持された
検索対象文書全体に対する各単語の出現頻度とに基づい
て、各単語の関連度値を算出する。Next, the degree-of-association calculation unit 3 extracts words (keyword groups) that are keywords from the seed document set and the appearance frequency of each word, and the appearance frequency of each word in the seed document set and the search target document. The relevance value of each word is calculated based on the appearance frequency of each word with respect to the entire search target document held in the word appearance frequency holding unit 4.

【００７７】具体的に、本例では、種文書集合から、
「ＡＢＣレコード出版」、「武道館」、「新曲」、「歌
手グループＤＥＦ」、「４月２７日」、「コンサート」
という単語が抽出されたとする。なお、「歌手グループ
ＤＥＦ」は、或る歌手グループの名前であり、「人名」
という固有名のカテゴリに分類されているとする。Specifically, in this example, from the seed document set,
"ABC record publication", "Budokan", "new song", "singer group DEF", "April 27", "concert"
Suppose that the word is extracted. “Singer group DEF” is the name of a certain singer group, and is “person name”.
It is assumed that they are classified in the category of the proper name.

【００７８】また、上記したそれぞれの単語は、種文書
集合に含まれるＮ（＝４）件の種文書の中で、次のよう
な数（ｎ）の文書に出現しているとする。「ＡＢＣレコ
ード出版」は１件の種文書に出現している（ｎ＝１）。
「武道館」は２件の種文書に出現している（ｎ＝２）。
「新曲」は３件の種文書に出現している（ｎ＝３）。
「歌手グループＤＥＦ」は４件の種文書に出現している
（ｎ＝４）。「４月２７日」は１件の種文書に出現して
いる（ｎ＝１）。「コンサート」は３件の種文書に出現
している（ｎ＝３）。It is also assumed that each of the above-mentioned words appears in the following number (n) of documents among the N (= 4) seed documents included in the seed document set. "ABC record publication" appears in one seed document (n = 1).
"Budokan" appears in two seed documents (n = 2).
The "new song" appears in three seed documents (n = 3).
“Singer group DEF” appears in four seed documents (n = 4). "April 27th" appears in one seed document (n = 1). "Concert" appears in three seed documents (n = 3).

【００７９】また、検索対象文書の総数Ｍが１０００件
（Ｍ＝１０００）であるとし、上記した各単語が検索対
象文書集合の中で、次のような数（ｍ）の文書に出現す
るとする。なお、検索対象文書単語出現頻度保持部４に
は、（ｍ／Ｍ）の値が保持されている。「ＡＢＣレコー
ド出版」は１０件の検索対象文書に出現している（ｍ＝
１０）。「武道館」は５件の検索対象文書に出現してい
る（ｍ＝５）。「新曲」は１００件の検索対象文書に出
現している（ｍ＝１００）。「歌手グループＤＥＦ」は
２００件の検索対象文書に出現している（ｍ＝２０
０）。「４月２７日」は３件の検索対象文書に出現して
いる（ｍ＝３）。「コンサート」は３００件の検索対象
文書に出現している（ｍ＝３００）。It is also assumed that the total number M of search target documents is 1000 (M = 1000) and that each of the above-mentioned words appears in the following number (m) of documents in the search target document set. . The search target document word appearance frequency storage unit 4 stores a value of (m / M). "ABC record publication" appears in 10 search target documents (m =
10). "Budokan" appears in 5 documents to be searched (m = 5). The “new song” appears in 100 search target documents (m = 100). The "singer group DEF" appears in 200 search target documents (m = 20).
0). "April 27th" appears in three search target documents (m = 3). The "concert" appears in 300 search target documents (m = 300).

【００８０】そして、各単語の関連度値を（ｎ／Ｎ）／
（ｍ／Ｍ）として計算すると、上記した各単語の関連度
値は、次のようになる。「ＡＢＣレコード出版」の単語
毎の関連度値は２５である。「武道館」の単語毎の関連
度値は１００である。「新曲」の単語毎の関連度値は
７．５である。「歌手グループＤＥＦ」の単語毎の関連
度値は５である。「４月２７日」の単語毎の関連度値は
（約）８３である。「コンサート」の単語毎の関連度値
は２．５である。Then, the relevance value of each word is (n / N) /
When calculated as (m / M), the relevance value of each word described above is as follows. The relevance value for each word of "ABC record publication" is 25. The relevance value for each word of "Budokan" is 100. The relevance value for each word of "new song" is 7.5. The relevance value for each word of the "singer group DEF" is 5. The relevance value for each word of "April 27th" is (about) 83. The relevance value for each word of "concert" is 2.5.

【００８１】次に、固有名保持部２は、当該固有名保持
部２に保持された情報に基づいて、上記した各単語がそ
れぞれ指定された「人名」というカテゴリに属するか否
かを判定する。なお、本例では、カテゴリが指定された
場合を示すが、例えばカテゴリの指定が無く検索要求受
付部１によりカテゴリの指定が受け付けられなかった場
合には、固有名に相当する全ての単語について重み付け
をする。本例では、上記した単語の中で「歌手グループ
ＤＥＦ」という語句が「人名」というカテゴリに属する
固有名に相当するとして判定される。Next, the proper name holding unit 2 determines whether or not each of the above-mentioned words belongs to the designated category "personal name" based on the information held in the proper name holding unit 2. . In this example, a case is shown in which a category is specified. However, for example, when no category is specified and the search request receiving unit 1 does not receive a category specification, weighting is applied to all words corresponding to unique names. do. In this example, it is determined that the word “singer group DEF” in the above words corresponds to a proper name belonging to the category “personal name”.

【００８２】次に、関連度計算部３は、ユーザが指定し
たカテゴリと一致した固有名である「歌手グループＤＥ
Ｆ」という単語について関連度値を調整する。本例で
は、「人名」というカテゴリに属する固有名であると判
定された「歌手グループＤＥＦ」という単語の関連度値
を１００倍する。すると、上記した各単語の関連度値は
次のような値に調整される。「ＡＢＣレコード出版」の
単語毎の関連度値は２５である。「武道館」の単語毎の
関連度値は１００である。「新曲」の単語毎の関連度値
は７．５である。「歌手グループＤＥＦ」の単語毎の関
連度値は５００である。「４月２７日」の単語毎の関連
度値は（約）８３である。「コンサート」の単語毎の関
連度値は２．５である。Next, the degree-of-association calculating unit 3 determines the "singer group DE" that is the proper name that matches the category designated by the user.
Adjust the relevance value for the word "F". In this example, the relevance value of the word "singer group DEF" that is determined to be a proper name belonging to the category "personal name" is multiplied by 100. Then, the relevance value of each word described above is adjusted to the following value. The relevance value for each word of "ABC record publication" is 25. The relevance value for each word of "Budokan" is 100. The relevance value for each word of "new song" is 7.5. The relevance value for each word of the "singer group DEF" is 500. The relevance value for each word of "April 27th" is (about) 83. The relevance value for each word of "concert" is 2.5.

【００８３】次に、検索部６は、上記した各単語と当該
各単語について得られた関連度値を関連語情報として関
連度計算部３から受け取り、受け取った単語をキーワー
ドとして用いて例えば一般に知られるＯＲ検索を検索対
象データベース５に対して行い、これにより、例えば少
なくとも２つのキーワードを含む文書を検索対象データ
ベース５に保持された複数の文書（検索対象文書群）の
中から検索する。Next, the retrieval unit 6 receives the above-mentioned words and the degree-of-relationship values obtained for each word from the degree-of-relationship calculation unit 3 as related-word information, and uses the received words as keywords, for example, in general. An OR search is performed on the search target database 5, whereby a document including at least two keywords is searched from a plurality of documents (search target document group) held in the search target database 5.

【００８４】本例では、次の４件の文書１〜４が検索さ
れて、それぞれ次のような単語が各文書に含まれるとす
る。文書１には、「武道館」、「新曲」という単語が含
まれる。文書２には、「歌手グループＤＥＦ」、「コン
サート」という単語が含まれる。文書３には、「ＡＢＣ
レコード出版」、「新曲」という単語が含まれる。文書
４には、「ＡＢＣレコード出版」、「新曲」、「４月２
７日」という単語が含まれる。In this example, it is assumed that the following four documents 1 to 4 are searched and the following words are included in each document. Document 1 includes the words “Budokan” and “new song”. Document 2 includes the words "singer group DEF" and "concert". Document 3 says “ABC
The words "published record" and "new song" are included. Document 4 includes “ABC record publication”, “new song”, and “April 2
The word "7 days" is included.

【００８５】次に、文書関連度計算部７は、検索された
各文書１〜４に含まれる関連語（単語）の関連度値に基
づいて、各文書１〜４の文書関連度値（キーワード群と
各文書１〜４との関連度値）を計算する。本例では、検
索された各文書１〜４に含まれる各関連語の関連度値を
総和した値を各文書１〜４の文書関連度値とする。する
と、本例では、検索された各文書１〜４の文書関連度値
は、次のようになる。文書１の文書関連度値は、１００
＋７．５＝１０７．５となる。文書２の文書関連度値
は、５００＋２．５＝５０２．５となる。文書３の文書
関連度値は、２５＋７．５＝３２．５となる。文書４の
文書関連度値は、２５＋７．５＋８３＝１１５．５とな
る。Next, the document relevance calculating unit 7 calculates the document relevance values (keywords) of the documents 1 to 4 based on the relevance values of the related words (words) included in the retrieved documents 1 to 4. Calculate the degree of association between the group and each document 1-4. In this example, the sum of the relevance values of the related words included in the retrieved documents 1 to 4 is set as the document relevance value of each of the documents 1 to 4. Then, in this example, the document relevance values of the retrieved documents 1 to 4 are as follows. The document relevance value of document 1 is 100
It becomes + 7.5 = 107.5. The document relevance value of document 2 is 500 + 2.5 = 502.5. The document relevance value of document 3 is 25 + 7.5 = 32.5. The document relevance value of the document 4 is 25 + 7.5 + 83 = 115.5.

【００８６】次に、検索結果提示部８は、検索された文
書１〜４に関する情報と各文書１〜４について得られた
文書関連度値を文書関連度計算部７から受け取り、受け
取った文書１〜４に関する情報を文書関連度値が高い順
にユーザに対して提示する。本例では、各文書１〜４の
タイトルの情報を文書関連度値が高い方から順に並べて
ディスプレイ画面に表示出力する。なお、本例では、次
のような順序で４件の文書１〜４のタイトル情報が並べ
られる。（文書関連度値が１番目に高い文書）文書２（文書関連度値が２番目に高い文書）文書４（文書関連度値が３番目に高い文書）文書１（文書関連度値が４番目に高い文書）文書３Next, the retrieval result presentation unit 8 receives the information about the retrieved documents 1 to 4 and the document relevance value obtained for each of the documents 1 to 4 from the document relevance calculation unit 7, and receives the document 1 Information about 4 to 4 are presented to the user in descending order of document relevance value. In this example, the title information of each of the documents 1 to 4 is arranged in order from the highest document relevance value and displayed and output on the display screen. In this example, the title information of four documents 1 to 4 is arranged in the following order. (Document with highest document relevance value) Document 2 (Document with second highest document relevance value) Document 4 (Document with highest third document relevance value) Document 1 (Fourth document relevance value) Highly expensive document) Document 3

【００８７】以上のように、本例の関連文書検索装置で
は、各単語毎の関連度値を計算する際に、指定された
「人名」というカテゴリに属する固有名に相当する単語
の関連度値を大きくするような重み付けを行う補正を実
行することにより、例えば当該カテゴリに属する「歌手
グループＤＥＦ」という単語が出現する文書２を最も文
書関連度値が高い文書としてユーザに通知することがで
きる。As described above, in the related document retrieval apparatus of this example, when calculating the relevance value for each word, the relevance value of the word corresponding to the unique name belonging to the designated category "personal name" is calculated. By executing the correction to increase the weight, for example, the document 2 in which the word “singer group DEF” that belongs to the category appears can be notified to the user as the document having the highest document relevance value.

【００８８】なお、本例では、ユーザが種文書集合を入
力する構成としたが、例えばユーザがキーワードを入力
し、関連文書検索装置が当該キーワードに関連する文書
などを種文書集合として検索するような構成を用いるこ
ともできる。In this example, the user inputs the seed document set. However, for example, the user inputs a keyword and the related document search device searches for documents related to the keyword as a seed document set. Various configurations can also be used.

【００８９】また、本例では、「人名」というカテゴリ
を指定する場合を示したが、例えばユーザが入力する種
文書が同一であっても、「会社名」や、「場所の名前」
や、「日付」や、「時間」や、「日時」や、「製品名」
などの他のカテゴリを指定することにより、指定したカ
テゴリに含まれる固有名を重要視した文書関連度値を算
出することができる。また、本例では、１つのカテゴリ
を指定する場合を示したが、上述のようにカテゴリを指
定せずに全ての固有名に相当する単語についての関連度
値を大きくする態様や、或いは、複数のカテゴリを指定
する態様を用いることもできる。また、例えば各カテゴ
リ毎に補正前の単語毎の関連度値をどれくらい大きく補
正するかといった補正の度合いをユーザにより指定する
ような態様を用いることもできる。In this example, the case where the category "personal name" is designated is shown. For example, even if the seed documents inputted by the user are the same, "company name" and "place name" are specified.
Or "date" or "time" or "date and time" or "product name"
By designating another category such as, the document relevance value placing importance on the proper name included in the designated category can be calculated. In this example, one category is designated. However, as described above, a mode in which the relevance value for words corresponding to all proper names is increased without designating a category, or a plurality of categories are used. It is also possible to use a mode of designating the category. Further, it is also possible to use a mode in which the degree of correction, such as how much the degree of relevance value for each word before correction is corrected, is specified by the user for each category.

【００９０】図２には、ユーザによりユーザプロファイ
ルを入力してカテゴリを指定するための画面情報の一例
を示してあり、当該画面情報は例えば検索要求受付部１
によりユーザに対して表示出力される。この画面情報で
は、「企業」、「人名」、「官庁」、「時間」、「場
所」というカテゴリの中でいずれのカテゴリにユーザの
興味があるかつまりいずれのカテゴリに属する固有名に
相当する単語の関連度値を補正するかをユーザに対して
尋ねており、同図の例では、「企業」及び「人名」のカ
テゴリがユーザにより指定されている。また、この画面
情報では、ユーザにより指定したカテゴリに属する単語
の関連度値をどれくらい大きくするかといった倍率をユ
ーザに対して尋ねており、同図の例では、「企業」につ
いては２倍が設定され、「人名」については３倍が設定
されている。FIG. 2 shows an example of screen information for the user to input a user profile and specify a category. The screen information is, for example, the search request receiving unit 1.
Is displayed and output to the user. In this screen information, which category of the "company", "personal name", "government", "time", and "place" the user is interested in, that is, which category corresponds to a proper name The user is asked if the word relevance value should be corrected, and in the example of the figure, the categories of "company" and "personal name" are specified by the user. In addition, in this screen information, the user is asked a scaling factor such as how much the degree of relevance of a word belonging to a category designated by the user is increased. In the example of the figure, “double” is set for “company”. Therefore, the "personal name" is tripled.

【００９１】ここで、本例では、検索要求受付部１が例
えば上記図２に示したような画面情報をユーザに対して
表示出力してユーザからの入力によりカテゴリの指定を
受け付ける機能によりカテゴリ指定受付手段が構成され
ており、固有名保持部２が同類の複数の固有名から構成
されるカテゴリに関する情報を記憶する機能によりカテ
ゴリ情報記憶手段が構成されており、固有名保持部２が
設定された固有名キーワード特定条件に基づいて指定さ
れたカテゴリに含まれる固有名に相当するキーワードを
特定する機能により固有名キーワード特定手段が構成さ
れており、関連度計算部３が単語毎の関連度値（キーワ
ード毎の関連度値）をその算出値と比較して大きい値
（関連度が高いことを表す値）へ補正して文書関連度計
算部７が文書関連度値を算出する機能により関連度値算
出手段が構成されている。Here, in this example, the search request receiving unit 1 displays the screen information as shown in FIG. 2, for example, to the user and displays the screen information to the user to accept the category specification by the user's input. The category information storage unit is configured by a function of storing the information about the category in which the unique name holding unit 2 is composed of a plurality of unique names of the same kind, and the unique name holding unit 2 is set. The unique name keyword specifying means is configured by the function of specifying the keyword corresponding to the unique name included in the category specified based on the unique name keyword specifying condition. The (relevance value for each keyword) is compared with the calculated value and corrected to a large value (a value indicating that the relevance is high), and the document relevance calculation unit 7 determines the document relevance. Relevance value calculating means is constituted by the function of calculating a.

【００９２】また、本例では、検索要求受付部１が例え
ば上記図２に示したような画面情報をユーザに対して表
示出力してユーザからの入力によりカテゴリ毎の補正倍
率（補正度合い）の指定を受け付ける機能により補正度
合い指定受付手段が構成されており、検索結果提示部８
が例えばキーワード毎関連度値の補正が行われたキーワ
ードの語句などの情報をユーザに対して表示出力する機
能により補正キーワード情報表示出力手段が構成されて
いる。Further, in this example, the search request accepting section 1 outputs the screen information as shown in FIG. 2 to the user and inputs the correction factor (correction degree) for each category according to the input from the user. A correction degree designation receiving means is configured by the function of receiving the designation, and the search result presentation unit 8
However, the corrected keyword information display output means is constituted by a function of displaying and outputting information such as a word or phrase of the keyword for which the relevance value for each keyword is corrected to the user.

【００９３】また、本例では、指定されたカテゴリに含
まれる固有名に相当するキーワード以外のキーワード
（単語）と各文書１〜４とのキーワード毎関連度値を非
ゼロとして文書関連度値を算出する場合を示したが、例
えばこのようなキーワード毎関連度値をゼロとして文書
関連度値を算出する構成とすることにより、文書関連度
値を算出するのに要する演算量や時間を低減させること
もできる。Further, in this example, the keyword relevance value for each keyword between the keywords (words) other than the keyword corresponding to the proper name included in the designated category and each of the documents 1 to 4 is set to non-zero, and the document relevance value is set. Although the case of calculation is shown, for example, by configuring such that the relevance value for each keyword is set to zero to calculate the document relevance value, the amount of calculation and time required for calculating the document relevance value are reduced. You can also

【００９４】次に、本発明の第２実施例に係る文書群分
類装置（文書クラスタリング装置）によるクラスタリン
グを説明する。なお、本例では、文書とは、例えば自然
言語で記述された１つ以上の文の集まりであって当該１
つ以上の文の集まりが分類される対象であるようなもの
を言う。具体的には、例えば政治や経済やスポーツなど
に分類される新聞記事などのように分類可能な特定の１
文を包含して有するようなものを文書とみなす。Next, clustering by the document group classification device (document clustering device) according to the second embodiment of the present invention will be described. In this example, a document is a collection of one or more sentences described in, for example, a natural language.
It is such that a collection of one or more sentences is the subject of classification. Specifically, a specific one that can be classified, such as newspaper articles classified into politics, economy, sports, etc.
Documents are those that contain and contain sentences.

【００９５】図３には、本例の文書群分類装置の構成例
を示してあり、この文書群分類装置には、文書群を入力
する文書群入力部１１と、入力されたそれぞれの文書の
内容から成る文書群データを記憶する文書群記憶部１２
と、当該文書群データを形態素解析などにより解析して
キーワードを抽出などする文書群解析部１３と、当該解
析結果に基づいて文書群を分類する文書群分類部１４
と、当該分類結果を記憶する分類結果記憶部１５とが備
えられている。FIG. 3 shows a configuration example of the document group classification device of this example. The document group classification device includes a document group input unit 11 for inputting a document group, and a document group input device 11 for inputting each document. Document group storage unit 12 for storing document group data including contents
A document group analysis unit 13 that analyzes the document group data by morphological analysis or the like to extract keywords, and a document group classification unit 14 that classifies the document group based on the analysis result.
And a classification result storage unit 15 for storing the classification result.

【００９６】ここで、文書群記憶部１２や分類結果記憶
部１５は、例えば情報を記憶するハードディスクや半導
体メモリから構成されている。また、文書群解析部１３
や文書群分類部１４は、例えばプログラムを記憶するメ
モリや当該プログラムの記述内容に従って動作するＣＰ
Ｕ（Central Processing Unit）を有している。なお、
上記したメモリやＣＰＵは複数の処理部で共用すること
も可能である。Here, the document group storage unit 12 and the classification result storage unit 15 are composed of, for example, a hard disk or a semiconductor memory for storing information. Also, the document group analysis unit 13
The document group classifying unit 14 is, for example, a memory that stores a program or a CP that operates according to the description content of the program.
It has a U (Central Processing Unit). In addition,
The above memory and CPU can be shared by a plurality of processing units.

【００９７】本例の文書群分類装置により行われる動作
の一例を示す。まず、文書群入力部１１は、例えばユー
ザによりキーボードなどが操作されることにより或いは
装置内のハードディスクや外部のネットワークから、分
類対象となる文書群の情報を入力する。そして、文書群
入力部１１は、入力したそれぞれの文書の内容から成る
文書群データを、個々の文書が識別可能な形式で文書群
記憶部１２に記憶する。具体的には、例えば各文書に文
書番号などを付けて文書群記憶部１２に記憶し、当該文
書番号により各文書を管理する。An example of the operation performed by the document group classification device of this example will be shown. First, the document group input unit 11 inputs information on a document group to be classified by a user operating a keyboard or the like, or from a hard disk in the device or an external network. Then, the document group input unit 11 stores the document group data including the contents of the respective input documents in the document group storage unit 12 in a format in which each document can be identified. Specifically, for example, each document is given a document number or the like and stored in the document group storage unit 12, and each document is managed by the document number.

【００９８】次に、文書群解析部１３は、文書群記憶部
１２から文書データを読み出し、それぞれの文書に対応
したそれぞれの文書データに対して自然言語解析を行
い、これにより、それぞれの文書データから単語やその
出現位置やその品詞や単語間の関係などを示す文法情報
を抽出する。そして、文書群解析部１３は、例えば固有
名詞に相当する品詞情報を有する単語のみを抽出し、抽
出したそれぞれの単語毎に、その単語が含まれる文書内
での出現頻度を求める。なお、出現頻度の値としては、
例えば単語が文書内に出現したか否かを示す“１”値或
いは“０”値の情報や、例えば単語が文書内に出現した
回数や頻度の情報などを用いることができる。Next, the document group analysis unit 13 reads the document data from the document group storage unit 12 and performs a natural language analysis on each document data corresponding to each document. The grammatical information indicating the word, its appearance position, its part of speech, the relationship between words, etc. is extracted from the. Then, the document group analysis unit 13 extracts, for example, only words having part-of-speech information corresponding to proper nouns, and for each extracted word, obtains the appearance frequency in the document including the word. In addition, as the value of the appearance frequency,
For example, information of a "1" value or a "0" value indicating whether or not a word appears in a document, or information on the number of times or frequency of occurrence of a word in a document can be used.

【００９９】このようにして固有名詞に相当する各単語
の出現頻度情報を全ての文書について求めると、次に、
文書群分類部１４は、例えば任意の１文書を選定して、
選定した文書と当該文書以外の全ての文書との距離を求
める。なお、本例では、以下に示すように、２文書間の
距離としてはその値が大きいほど距離が近いことを表す
ようなものを用いており、これは例えば２文書間の関連
度値に相当するとみなすことができる。In this way, when the appearance frequency information of each word corresponding to a proper noun is obtained for all documents, next,
The document group classification unit 14 selects, for example, one arbitrary document,
Find the distance between the selected document and all other documents. In this example, as shown below, the distance between the two documents is such that the larger the value, the closer the distance is. This corresponds to, for example, the relevance value between the two documents. Then it can be considered.

【０１００】具体例として、２文書間の距離を求めるた
めの単語がｔ（ｔは１以上の数）個あって、文書Ｄｒに
おけるｉ番目の単語の出現頻度の値がａｒ（ｉ）であ
り、文書Ｄｓにおけるｉ番目の単語の出現頻度の値がａ
ｓ（ｉ）であるとすると、文書Ｄｒと文書Ｄｓとの２文
書間の距離Ｓｉｍ（Ｄｒ、Ｄｓ）は例えば式１で示され
る。As a specific example, there are t (t is a number of 1 or more) words for obtaining the distance between two documents, and the value of the appearance frequency of the i-th word in the document Dr is ar (i). , The value of the appearance frequency of the i-th word in the document Ds is a
Assuming s (i), the distance Sim (Dr, Ds) between the two documents of the document Dr and the document Ds is expressed by, for example, Expression 1.

【０１０１】[0101]

【数１】 [Equation 1]

【０１０２】ここで、図４を参照して、上記のような２
文書間の距離に基づいて文書群をクラスタリングする一
例を示す。なお、ここでは、説明の便宜上から、文書群
には４つの文書Ａ、Ｂ、Ｃ、Ｄが含まれるとする。例え
ば、全ての文書Ａ〜Ｄから２つの文書を選択する全ての
組み合わせについて２文書間の距離を求めた結果が、同
図（ａ）の行列のように表されたとする。そして、この
行列を参照して、距離が近い（本例では、値が大きい）
２文書のところから順に階層木を作成していくことによ
り、クラスタリングを行うことができる。Here, with reference to FIG.
An example of clustering a document group based on the distance between documents is shown. For convenience of description, it is assumed that the document group includes four documents A, B, C, and D. For example, it is assumed that the result of obtaining the distance between two documents for all combinations of selecting two documents from all documents A to D is represented as a matrix in FIG. Then, referring to this matrix, the distance is short (in this example, the value is large)
Clustering can be performed by sequentially creating a hierarchical tree from two documents.

【０１０３】まず、同図（ａ）の例では、文書Ａと文書
Ｄとの距離が最も近いことから、図示のように、文書Ａ
と文書Ｄとを関連付ける。すると、文書Ａと文書Ｄとを
まとめることにより、上記した行列が同図（ｂ）に示す
ように変更され、同図（ｂ）の例では、文書Ａ、Ｄのま
とまりと文書Ｃとの距離が最も近いことから、図示のよ
うに、文書Ａ、Ｄのまとまりと文書Ｃとを関連付ける。
すると、上記した行列は同図（ｃ）に示すように変更さ
れ、図示のように、残った文書Ｂと文書Ａ、Ｃ、Ｄのま
とまりとを関連付ける。そして、同図（ｃ）に示すよう
な階層木をクラスタリング結果として得ることができ
る。First, in the example of FIG. 10A, since the distance between the document A and the document D is the shortest, as shown in the figure, the document A
And the document D are associated with each other. Then, by combining the documents A and D, the matrix described above is changed as shown in FIG. 7B, and in the example of FIG. 7B, the distance between the group of documents A and D and the document C is changed. Is closest to each other, the group of documents A and D is associated with the document C as illustrated.
Then, the above matrix is changed as shown in FIG. 6C, and the remaining document B and the group of documents A, C, and D are associated with each other as illustrated. Then, a hierarchical tree as shown in FIG. 7C can be obtained as a clustering result.

【０１０４】以上のように、本例の文書群分類装置で
は、例えば自然言語解析により文書に含まれるキーワー
ドの要素を抽出するとともに当該要素に付随する品詞や
カテゴリの情報などを抽出し、要素に付随する品詞情報
に基づいて固有名詞に相当する要素のみを計数するよう
にした状況で、出現頻度を算出する仕方を規定する計数
規則に基づいて要素毎の出現回数などを計数し、当該計
数結果に基づいて２文書間の距離を求めることにより文
書群をクラスタリングすることにより、例えば大量の文
書群をより少ない計算量で、且つ、単語自身の有する情
報量をその品詞に基づいて考慮した方法で分類（クラス
タリング）することができる。As described above, in the document group classification apparatus of this example, the element of the keyword included in the document is extracted by the natural language analysis, and the part of speech or the category information attached to the element is extracted to obtain the element. In the situation where only the elements corresponding to proper nouns are counted based on the accompanying part-of-speech information, the number of appearances of each element is counted based on the counting rule that stipulates how to calculate the appearance frequency, and the count result By clustering the document groups by determining the distance between two documents based on, for example, a method that considers the amount of information of a large number of document groups and the information amount of the word itself based on its part-of-speech It can be classified (clustering).

【０１０５】なお、本例では、固有名詞に相当する単語
のみに基づいて２文書間の距離を求めたが、例えば各単
語毎に重み付けを行うことで固有名詞に相当する単語の
重要度を大きくするような構成も可能であり、このよう
な構成では、例えば要素に付随する品詞情報に基づいて
固有名詞に相当する要素には他の品詞に相当する要素と
比較して大きい重みを与えて、２文書間の距離を算出す
るようなことが行われる。In this example, the distance between the two documents is calculated based only on the word corresponding to the proper noun, but the importance of the word corresponding to the proper noun is increased by weighting each word, for example. Such a configuration is also possible, and in such a configuration, for example, based on the part-of-speech information attached to the element, an element corresponding to the proper noun is given a greater weight than elements corresponding to other parts of speech, Something like calculating the distance between two documents is performed.

【０１０６】具体的に、上記式１に対応するものとし
て、例えばｉ番目の単語に対する重み付けの係数をＫｉ
とすると、式２に示すような演算式により２文書間の距
離を求めることが可能である。Specifically, as the one corresponding to the above expression 1, for example, the weighting coefficient for the i-th word is Ki.
Then, it is possible to obtain the distance between the two documents by the arithmetic expression shown in Expression 2.

【０１０７】[0107]

【数２】 [Equation 2]

【０１０８】なお、例えば、計数する対象となる要素を
固有名詞に相当するものだけとするか、或いは、固有名
詞以外の品詞に相当する要素についても計数する対象と
して固有名詞に対する重みを他の品詞と比べて大きくす
るか、といったことをユーザにより任意に選択可能な構
成とすることもできる。Note that, for example, the elements to be counted are only those corresponding to proper nouns, or the elements corresponding to parts of speech other than proper nouns are also counted as weights for proper nouns to other parts of speech. It is also possible to have a configuration in which the user can arbitrarily select whether to make it larger than that.

【０１０９】また、キーワード（要素）を抽出する仕方
としては、種々な仕方が用いられてもよく、例えば予め
辞書に登録された固有名詞に相当する語句を抽出する仕
方や、例えば「株式会社」などのような語句が含まれて
いるといった固有名詞に特有な出現パターンに基づいて
固有名詞とみなされる語句を抽出する仕方などを用いる
ことができる。また、このような辞書をユーザにより編
集可能にして、当該辞書に対してユーザが任意に新たな
単語を追加することなどが可能な構成とすることもでき
る。Various methods may be used to extract the keywords (elements), for example, a method of extracting a phrase corresponding to a proper noun registered in the dictionary in advance, for example, "corporation". It is possible to use a method of extracting a word or phrase considered as a proper noun based on an appearance pattern peculiar to the proper noun such as containing a word or phrase such as. Further, such a dictionary can be made editable by the user so that the user can arbitrarily add a new word to the dictionary.

【０１１０】ここで、本例では、上記式１や上記式２に
より２文書間の距離を算出するために用いられるｔ個の
要素（キーワード）の群がキーワード群に相当し、これ
ら２文書から文書群が構成され、当該キーワード群と当
該文書群に関する関連度値をこれら２文書間の距離とし
て算出している。更に詳しくは、２つの文書Ｄｒ、Ｄｓ
から文書群が構成され、前記キーワード群に含まれるキ
ーワードと当該文書群に関するキーワード毎関連度値が
上記式１中の｛ａｒ（ｉ）・ａｓ（ｉ）｝や上記式２中
の｛Ｋｉ・ａｒ（ｉ）・ａｓ（ｉ）｝で表される。ま
た、他のとらえ方として、例えば文書Ｄｓを文書群であ
るとみなして、ａｒ（ｉ）についてはｉ番目のキーワー
ドに付加された文書Ｄｒに依存する重み付け情報である
とみなすこともでき、この場合にも、各キーワードにつ
いての上記式１中の｛ａｒ（ｉ）・ａｓ（ｉ）｝や上記
式２中の｛Ｋｉ・ａｒ（ｉ）・ａｓ（ｉ）｝がキーワー
ド毎の関連度値に相当する。Here, in this example, a group of t elements (keywords) used for calculating the distance between two documents by the above Expression 1 or Expression 2 corresponds to a keyword group. A document group is formed, and a degree-of-association value relating to the keyword group and the document group is calculated as the distance between these two documents. More specifically, two documents Dr, Ds
A document group is composed of the keywords, and the keyword included in the keyword group and the relevance value for each keyword related to the document group are {ar (i) .as (i)} in the above Expression 1 or {Ki. ar (i) · as (i)}. Further, as another way of understanding, for example, the document Ds can be regarded as a document group, and ar (i) can be regarded as weighting information depending on the document Dr added to the i-th keyword. Also in this case, {ar (i) · as (i)} in the above expression 1 or {Ki · ar (i) · as (i)} in the above expression 2 for each keyword is the relevance value for each keyword. Equivalent to.

【０１１１】また、本例では、文書群解析部１３が設定
された固有名詞キーワード特定条件に基づいて固有名詞
に相当するキーワード（要素）を特定する機能により固
有名キーワード特定手段が構成されており、文書群解析
部１３や文書群分類部１４が固有名詞に相当すると特定
されたキーワードのみについての２つの文書のキーワー
ド毎関連度値を非ゼロの値とすることや或いはこのよう
なキーワードについての２つの文書のキーワード毎関連
度値（距離）を重み付けしてその算出値と比較して大き
い値（関連度が高いことを表す値）へ補正することを行
って、２つの文書の関連度値を算出する機能により関連
度値算出手段が構成されている。Also, in this example, the proper name keyword specifying means is constituted by the function of the document group analysis unit 13 to specify the keyword (element) corresponding to the proper noun based on the set proper noun keyword specifying condition. , The document group analysis unit 13 and the document group classification unit 14 set the relevance value for each keyword of two documents only for a keyword identified as a proper noun to a non-zero value, or for such a keyword. The relevance value of each of the two documents is weighted, compared with the calculated value, and corrected to a large value (a value indicating that the relevance is high) to obtain the relevance value of the two documents. A relevance value calculation means is configured by the function of calculating.

【０１１２】次に、本発明の第３実施例に係る文書群分
類装置（文書カテゴライズ装置）により行われるカテゴ
ライズの一例を示す。なお、本例の文書分類装置は、例
えば上記図３に示したものと同様な構成のものを用いる
ことができ、本例では、詳しい説明は省略する。Next, an example of categorization performed by the document group classification device (document categorization device) according to the third embodiment of the present invention will be described. Note that the document classification apparatus of this example may have the same configuration as that shown in FIG. 3, for example, and detailed description thereof will be omitted in this example.

【０１１３】図５を参照して、本例の文書分類装置によ
り行われるカテゴライズの一例を示す。同図に示される
ように、例えば種々な内容に関して予め複数のキーワー
ド群が用意されていて、各キーワード群がそれぞれ異な
る分類１、２、３、…に対応付けられているとする。ま
た、カテゴライズ対象となる文書１、２、３…があると
する。An example of categorization performed by the document classification device of this example will be described with reference to FIG. As shown in the figure, it is assumed that, for example, a plurality of keyword groups are prepared in advance for various contents, and that each keyword group is associated with a different classification 1, 2, 3, ... Further, it is assumed that there are documents 1, 2, 3, ... Which are to be categorized.

【０１１４】この場合、例えば文書１を例とすると、ま
ず、固有名に相当するキーワードに重み付けをする方式
を用いて、文書１と各分類１、２、３、…に対応したそ
れぞれのキーワード群との関連度値を算出し、算出され
た関連度値が最大となる分類を検出して、検出した分類
に文書１をカテゴライズする。In this case, for example, taking document 1 as an example, first, by using a method of weighting keywords corresponding to proper names, each keyword group corresponding to document 1 and each classification 1, 2, 3, ... The degree of relevance value with is calculated, the category having the highest calculated degree of relevance is detected, and the document 1 is categorized into the detected category.

【０１１５】具体的に、同図の例では、文書１と分類１
のキーワード群との関連度値が文書１と他の分類２、
３、…のキーワード群との関連度値と比較して最大であ
った場合を示してあり、この場合、文書１を分類１にカ
テゴライズする。同図の例では、同様にして、文書２が
分類１にカテゴライズされており、文書３が分類２にカ
テゴライズされている。Specifically, in the example of the figure, document 1 and classification 1
The degree of relevance with the keyword group of is document 1 and other classification 2,
3 shows the case in which the degree of relevance with the keyword group is the largest, and in this case, the document 1 is categorized into the category 1. In the example of FIG. 5, similarly, the document 2 is categorized into the category 1 and the document 3 is categorized into the category 2.

【０１１６】以上のように、文書群を複数の分類にカテ
ゴライズする場合においても、例えば上記第１実施例や
上記第２実施例で述べたのと同様に、固有名に相当する
キーワードに重み付けすることにより、関連度値の精度
を向上させることや、関連度値の算出に要する演算量や
時間を低減させることなどができる。As described above, even when the document group is categorized into a plurality of categories, the keywords corresponding to the proper names are weighted, as in the first and second embodiments. As a result, it is possible to improve the accuracy of the degree-of-association value and reduce the amount of calculation and time required to calculate the degree-of-association value.

【０１１７】ここで、本発明に係る関連度値算出装置な
どの構成としては、必ずしも以上に示したものに限られ
ず、種々な構成が用いられてもよい。また、本発明の適
用分野としては、必ずしも以上に示したものに限られ
ず、本発明は、種々な分野に適用することが可能なもの
である。Here, the configuration of the degree-of-association value calculating device and the like according to the present invention is not necessarily limited to the ones described above, and various configurations may be used. Further, the application fields of the present invention are not necessarily limited to those shown above, and the present invention can be applied to various fields.

【０１１８】また、本発明に係る関連度値算出装置など
において行われる各種の処理としては、例えばプロセッ
サやメモリ等を備えたハードウエア資源においてプロセ
ッサがＲＯＭ（Read Only Memory）に格納された制御プ
ログラムを実行することにより制御される構成が用いら
れてもよく、また、例えば当該処理を実行するための各
機能手段が独立したハードウエア回路として構成されて
もよい。また、本発明は上記の制御プログラムを格納し
たフロッピー（登録商標）ディスクやＣＤ（Compact Di
sc）−ＲＯＭ等のコンピュータにより読み取り可能な記
録媒体や当該プログラム（自体）として把握することも
でき、当該制御プログラムを記録媒体からコンピュータ
に入力してプロセッサに実行させることにより、本発明
に係る処理を遂行させることができる。Further, as various kinds of processing performed in the degree-of-association value calculating apparatus and the like according to the present invention, for example, a control program in which a processor is stored in a ROM (Read Only Memory) in hardware resources including a processor and a memory The configuration controlled by executing the above may be used, or each functional unit for executing the process may be configured as an independent hardware circuit. The present invention also relates to a floppy (registered trademark) disk or a CD (Compact Disk) storing the above control program.
sc) -ROM or the like, which can be grasped as a computer-readable recording medium or the program (itself), and the processing according to the present invention by inputting the control program into the computer from the recording medium and causing the processor to execute the control program. Can be accomplished.

【０１１９】[0119]

【発明の効果】以上説明したように、本発明に係る関連
度値算出装置などによると、１又は複数のキーワードか
ら構成されるキーワード群と１又は複数の文書から構成
される文書群に関する関連度を表す値として、当該キー
ワード群に含まれる各キーワードと当該文書群に関する
キーワード毎の関連度値（キーワード毎関連度値）を全
てのキーワードについて総和した値を算出するに際し
て、設定された固有名キーワード特定条件に基づいて固
有名に相当するキーワードを特定し、特定されたキーワ
ードと文書群に関するキーワード毎の関連度を高めてキ
ーワード群と文書群に関する関連度値を算出するように
したため、算出される関連度値の精度を高くすることな
どができる。As described above, according to the degree-of-association value calculating apparatus and the like according to the present invention, the degree of relevance relating to a keyword group composed of one or a plurality of keywords and a document group composed of one or a plurality of documents. A unique name keyword that is set when calculating the sum of the relevance value for each keyword included in the keyword group and the relevance value for each keyword related to the document group (relevance value for each keyword) It is calculated because the keyword corresponding to the unique name is specified based on the specific condition and the degree of relevance for each keyword related to the specified keyword and the document group is increased to calculate the degree of relevance related to the keyword group and the document group. The accuracy of the relevance value can be increased.

【０１２０】また、本発明に係る関連度値算出装置など
によると、同類の複数の固有名から構成される１又は複
数のカテゴリに関する情報を記憶し、カテゴリの指定を
受け付けて、受け付けられたカテゴリに含まれる固有名
に相当するキーワードを特定するようにしたため、例え
ばユーザの要求などを反映させて算出される関連度値の
精度を高くすることなどができる。Further, according to the degree-of-association value calculating apparatus and the like according to the present invention, information on one or a plurality of categories composed of a plurality of unique names of the same type is stored, the category designation is accepted, and the accepted category is accepted. Since the keyword corresponding to the unique name included in is specified, the accuracy of the relevance value calculated by reflecting the user's request can be increased.

[Brief description of drawings]

【図１】本発明の第１実施例に係る関連文書検索装置
の構成例を示す図である。FIG. 1 is a diagram showing a configuration example of a related document search device according to a first exemplary embodiment of the present invention.

【図２】ユーザプロファイルを入力してカテゴリ及び
倍率を指定するための画面表示の一例を示す図である。FIG. 2 is a diagram showing an example of a screen display for inputting a user profile and designating a category and a magnification.

【図３】本発明の第２実施例に係る文書群分類装置の
構成例を示す図である。FIG. 3 is a diagram showing a configuration example of a document group classification device according to a second exemplary embodiment of the present invention.

【図４】クラスタリングの一例を説明するための図で
ある。FIG. 4 is a diagram for explaining an example of clustering.

【図５】カテゴライズの一例を説明するための図であ
る。FIG. 5 is a diagram for explaining an example of categorization.

[Explanation of symbols]

１・・検索要求受付部、２・・固有名保持部、３・
・関連度計算部、４・・検索対象文書単語出現頻度保持
部、５・・検索対象データベース、６・・検索部、
７・・文書関連度計算部、８・・検索結果提示部、１
１・・文書群入力部、１２・・文書群記憶部、１３
・・文書群解析部、１４・・文書群分類部、１５・・
分類結果記憶部、1 .. Search request reception part 2 .. Unique name storage part 3.
-Relevance calculator, 4 ... Search target document word appearance frequency holding unit, 5 ... Search target database, 6 ... Search unit,
7 ... Document relevance calculation section, 8 ... Search result presentation section, 1
1 ... Document group input unit, 12 ... Document group storage unit, 13
..Document group analysis unit, 14 ... Document group classification unit, 15 ...
Classification result storage section,

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5B009 SA12 5B075 ND03 NK02 NR12 PR06 5B082 EA08 ─────────────────────────────────────────────────── ─── Continued front page F-term (reference) 5B009 SA12 5B075 ND03 NK02 NR12 PR06 5B082 EA08

Claims

[Claims]

1. As a value representing the degree of association between a keyword group composed of one or a plurality of keywords and a document group composed of one or a plurality of documents, each keyword included in the keyword group and the keyword related to the document group A relevance value calculating device for calculating a value obtained by summing up relevance values of respective keywords for all keywords, and a unique name keyword specifying means for specifying a keyword corresponding to a unique name based on a set unique name keyword specifying condition. And a degree-of-association value calculating means for increasing the degree of relevance for each keyword related to the keyword and document group identified by the proper name keyword specifying means, and calculating a degree-of-relevance value related to the keyword group and document group. Relevance value calculating device.

2. The degree-of-association value calculation device according to claim 1, wherein the degree-of-association value calculation means compares the keyword identified by the proper name keyword identification means and the keyword-related degree of association for each document group with the calculated value. Then, the relevance value calculation device is characterized in that the relevance value for each keyword related to the keyword and the document group is increased by correcting the value to indicate that the relevance is high.

3. The degree-of-association value calculation device according to claim 1, wherein the degree-of-association value calculation means sets the degree-of-relevance value for each keyword related to the keyword and document group specified by the proper name keyword specifying means to non-zero. A degree-of-association value calculating device for increasing the degree of relevance for each keyword related to the keyword and the document group by setting the degree-of-association value for each other keyword to zero.

4. The category information storage device according to claim 1, wherein the degree-of-association value calculation device stores information about one or a plurality of categories that are composed of a plurality of unique names in the same category. Means and category specification receiving means for receiving the specification of the category,
The unique name keyword identifying means identifies the keyword corresponding to the unique name included in the category accepted by the category designation accepting means, based on the category information stored in the category information storage means. Relevance value calculation device.

5. The degree-of-association value calculation device according to claim 4, wherein the category designation receiving means displays and outputs information requesting designation of a category to the user, and receives the designation by input from the user. An association degree value calculating device characterized by:

6. The degree-of-association value calculation device according to claim 2, wherein the category information storage unit stores information about one or a plurality of categories composed of a plurality of unique names of the same kind, and a relation for each keyword for each category. Correction degree designation receiving means for receiving designation of the degree of correcting the degree value, and the related degree value calculating means is specified by the unique name keyword specifying means based on the category information stored in the category information storage means. A degree-of-association value calculation apparatus for correcting a degree-of-keyword-related degree-of-keyword for a keyword and a document group by using a degree of correction received by a degree-of-correction designation receiving unit for a category including the keyword.

7. The degree-of-association value calculation device according to claim 6, wherein the correction degree designation receiving means outputs, to the user, information requesting the designation of the correction degree for each category, and the designation is input from the user. An apparatus for calculating a degree-of-association value, which is received by inputting.

8. The relevance value calculating device according to claim 1, wherein the relevance value calculation device displays and outputs to the user information related to the keyword specified by the unique name keyword specifying means. A degree-of-association value calculation apparatus comprising information display output means.

9. A keyword group composed of one or a plurality of keywords is extracted from a seed document group composed of one or a plurality of documents, and a document related to the extracted keyword group is extracted from a plurality of documents to be searched. A search is performed from the composed search target document group, and the ratio of the number of documents in which the keyword appears in the seed document group to the total number of documents included in the seed document group for each keyword included in the extracted keyword group is searched. A value obtained by dividing the total number of documents included in the target document group by the ratio of the number of documents in which the keyword appears in the search target document group was calculated as the appearance ratio value for each keyword, and extracted for each searched document. Among the keywords included in the keyword group, the sum of the appearance ratio values of the keywords appearing in the document is used as a value indicating the degree of association between the keyword group and the document. A related document search device that outputs information related to each document searched in descending order of the degree of relevance value, and that identifies a keyword corresponding to a unique name based on a set unique name keyword specifying condition. The name keyword specifying means and the relevance value calculating means for calculating the relevance value between the keyword group and the document by multiplying the appearance ratio value of the keyword specified by the proper name keyword specifying means by a predetermined number. Characterized related document retrieval device.

10. Regarding a plurality of keyword groups composed of one or a plurality of keywords and a document group composed of one or a plurality of documents, each keyword included in the keyword group and a keyword of the document group A keyword group in which a value obtained by summing the values indicating the degree of relevance for each keyword is calculated as the degree of relevance between the keyword group and the document group, and the calculated degree of relevance is the value indicating the highest degree of relevance A document categorizing device that categorizes the document group into a unique name keyword specifying unit that specifies a keyword corresponding to a unique name based on the set unique name keyword specifying condition, and a unique name keyword specifying unit Relevance to increase the relevance of each keyword between a keyword and a document group to calculate a relevance value between a keyword and a document group Document categorization apparatus characterized by comprising: a calculating means.

11. With respect to two documents included in a document group composed of a plurality of documents, for each keyword included in a keyword group composed of one or a plurality of keywords, the degree of relevance of each of these two documents for each keyword. Is a document clustering apparatus that calculates a value obtained by summing up the values representing all the keywords as a relevance value of these two documents and clusters the documents included in the document group based on the relevance value. The unique name keyword specifying means for specifying the keyword corresponding to the unique name based on the unique name keyword specifying condition, and the degree of relevance of each of the two documents for the keyword specified by the unique name keyword specifying means are increased. And a relevance value calculation means for calculating a relevance value of two documents. Document clustering device.

12. A keyword related to a keyword group composed of one or a plurality of keywords and a keyword related to a document group composed of one or a plurality of documents as a value representing a degree of association with the document group composed of one or a plurality of documents. It is a relevance value calculation method that calculates the sum of relevance values for each keyword for all keywords, and identifies the keywords corresponding to the proper names based on the set unique name keyword identification conditions And a relevance value for each keyword related to the document group are calculated to calculate a relevance value related to the keyword group and the document group.

13. A keyword group composed of one or a plurality of keywords is extracted from a seed document group composed of one or a plurality of documents, and a document related to the extracted keyword group is extracted from a plurality of documents to be searched. A search is performed from the composed search target document group, and the ratio of the number of documents in which the keyword appears in the seed document group to the total number of documents included in the seed document group for each keyword included in the extracted keyword group is searched. A value obtained by dividing the total number of documents included in the target document group by the ratio of the number of documents in which the keyword appears in the search target document group was calculated as the appearance ratio value for each keyword, and extracted for each searched document. A value that represents the degree of association between the keyword group and the document, which is the sum of the appearance ratio values of the keywords that appear in the document among the keywords included in the keyword group. Is a related document search method that outputs information related to each document searched in descending order of relevance value, and identifies and identifies the keyword corresponding to the unique name based on the set unique name keyword specifying conditions. A related document search method, wherein a relevance value between a keyword group and a document is calculated by multiplying an appearance ratio value of the generated keyword by a predetermined number.

14. Regarding a plurality of keyword groups composed of one or a plurality of keywords and a document group composed of one or a plurality of documents, for each keyword group, each keyword included in the keyword group and the keyword of the document group A value obtained by summing the values indicating the relevance for each keyword for all the keywords is calculated as the relevance value between the keyword group and the document group, and the calculated relevance value is the value indicating the highest relevance. Is a document categorizing method for categorizing the relevant document group, in which the keyword corresponding to the unique name is specified based on the set unique name keyword specifying condition, and the degree of relevance for each keyword between the specified keyword and the document group is determined. A document categorizing method characterized by increasing the degree of association between a keyword group and a document group.

15. With respect to two documents included in a document group composed of a plurality of documents, for each keyword included in a keyword group composed of one or a plurality of keywords, the degree of relevance of each of these two documents for each keyword. Is a document clustering method for calculating a value obtained by summing the values representing the above as a relevance value of these two documents and clustering the documents included in the document group based on the relevance value. The keyword corresponding to the proper name is specified based on the unique name keyword specifying condition, and the degree of relevance of each of the two documents for the specified keyword is increased to calculate the relevance value of these two documents. A document clustering method characterized by.

16. A keyword related to a keyword group composed of one or a plurality of keywords and a document group composed of one or a plurality of documents, as a value indicating the degree of association, and a keyword related to the document group. A program that causes a computer to execute a process of calculating a sum of the relevance values for each keyword for all keywords, and a process of identifying a keyword corresponding to a unique name based on a set unique name keyword specifying condition, A program for causing a computer to execute a process of increasing a degree of association for each keyword related to a specified keyword and a document group and calculating a degree of association value related to the keyword group and a document group.

17. A process of extracting a keyword group composed of one or a plurality of keywords from a seed document group composed of one or a plurality of documents, and a document related to the extracted keyword group being a plurality of search targets. The process of searching from the search target document group composed of documents and the number of documents in which the keyword appears in the seed document group with respect to the total number of documents included in the seed document group for each keyword included in the extracted keyword group A process of calculating a value obtained by dividing the ratio value by the ratio value of the number of documents in which the keyword appears in the search target document group to the total number of documents included in the search target document group as the appearance ratio value for each keyword, and the search Among the keywords included in the extracted keyword group for each document, the sum of the appearance ratio values of the keywords that appear in the document is regarded as the relevant keyword group. A program for causing a computer to execute a process of calculating as a value representing the degree of association with the document and a process of outputting information regarding each document searched in the descending order of the degree of association, wherein a specific name keyword specified is set. Executed on the computer, the process of identifying the keyword corresponding to the proper name based on the condition, and the process of multiplying the appearance ratio value of the identified keyword by a predetermined number to calculate the degree of association between the keyword group and the document. A program characterized by:

18. Regarding a plurality of keyword groups formed of one or a plurality of keywords and a document group formed of one or a plurality of documents, each keyword included in the keyword group and a keyword of the document group A process of calculating the value of the relevance for each keyword for all keywords as a relevance value between the keyword group and the document group, and the calculated relevance value becomes a value indicating the highest relevance. A program for causing a computer to perform a process of categorizing the document group into a keyword group, the process of identifying a keyword corresponding to a unique name based on a set unique name keyword identifying condition, and the identified keyword and document A process of increasing the degree of association of each keyword with a group and calculating a degree-of-association value between the keyword group and the document group. A program characterized by being executed by.

19. With respect to two documents included in a document group composed of a plurality of documents, for each keyword included in a keyword group composed of one or a plurality of keywords, the degree of relevance of each of these two documents for each keyword. Causing a computer to execute a process of calculating a value obtained by summing up the values representing the above as a relevance value of these two documents and a process of clustering the documents included in the document group based on the relevance value. The program is a process for specifying a keyword corresponding to a proper name based on a set proper name keyword specifying condition, and the degree of relevance of each of the two documents for the specified keyword is increased to increase the two documents. A program for causing a computer to execute a process of calculating a relevance value of.