JP2000123041A

JP2000123041A - Similarity judging method, document retrieving device, document classifying device, storage medium stored with document retrieval program and storage medium stored with document classification program

Info

Publication number: JP2000123041A
Application number: JP10297321A
Authority: JP
Inventors: Junji Tomita; 準二富田; Hiroshi Takeno; 浩竹野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-10-19
Filing date: 1998-10-19
Publication date: 2000-04-28
Anticipated expiration: 2018-10-19
Also published as: JP3577972B2

Abstract

PROBLEM TO BE SOLVED: To provide a similarity judging method, a document retrieving device and a document classifying device capable of precisely judging similarity between document element having plural subjects or sub-titles, judging the similarity using the feature of the document elements and precisely judging the similarity by solving the incompleteness of morphemic analysis, and to provide a storage medium in which a document retrieval program is stored and a storage medium in which a document classification program is stored. SOLUTION: The word used in document element composed of word string, the Boolean operator join of the word string, sentence, document, or document set is extracted from the document element (S1), the criticality is imparted to the extracted word (S2), and the relevancy degree is imparted to the extracted two arbitrary words (S3). Then, the subjects of respective document elements are expressed by graphics prepared with the criticality of the word as the weight of a node and with the relevancy degree between those words as the weight of the link (S4), and the similarity between the document elements is judged based on the matching degree between graphics expressing the subjects (S5).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、類似度判定方法及
び文書検索装置及び文書分類装置及び文書検索プログラ
ムを格納した記憶媒体及び文書分類プログラムを格納し
た記憶媒体に係り、特に、単語列、または、単語列のブ
ール演算子結合、または、文書、または、文書集合から
なる文書要素間の類似度を適切に判定するための類似度
判定方法及び文書検索装置及び文書分類装置及び文書検
索プログラムを格納した記憶媒体及び文書分類プログラ
ムを格納した記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a similarity determination method, a document search device, a document classification device, a storage medium storing a document search program, and a storage medium storing a document classification program. , A Boolean operator combination of word strings, or a similarity determination method, a document search device, a document classification device, and a document search program for appropriately determining the similarity between document elements composed of documents or document sets. And a storage medium storing a document classification program.

【０００２】[0002]

【従来の技術】従来の技術による文書検索装置は、以下
の方法で検索キーと検索対象文書との類似度を判定する
ことによって、検索結果を決定するものである。ここ
で、検索キーは、単語列、または、単語列のブール演算
子結合だけでなく、文書、または、文書集合などの場合
もある。2. Description of the Related Art A conventional document search apparatus determines a search result by determining the similarity between a search key and a search target document by the following method. Here, the search key may be not only a word string or a Boolean operator combination of the word strings but also a document or a document set.

【０００３】まず、形態素解析と呼ばれる技術を用いて
検索対象のそれぞれの文書に使用されている単語を抽出
し、出現頻度情報などに基づき、それぞれの単語にその
単語の主題（内容）との関連の強さを表す重要度を付与
する。同様に、ユーザの入力した検索キー内の単語にも
重要度を付与し、それぞれの単語がどの程度の重要度
で、それぞれの文書に含まれているのか調べ、検索キー
内のブール演算子を適切に処理して類似度を計算する。
最後に検索対象の文書を、このようにして求められた類
似度の降順にソートして、上位ｎ件を検索結果とするも
のである。なお、ここで、ｎは、正の定数である。[0003] First, words used in each document to be searched are extracted using a technique called morphological analysis, and based on appearance frequency information and the like, each word is associated with the subject (content) of the word. Is given a degree of importance indicating the strength of Similarly, we assign importance to the words in the search key entered by the user, check how important each word is in each document, and change the Boolean operator in the search key. Appropriate processing is performed to calculate the similarity.
Finally, the documents to be searched are sorted in descending order of the similarity thus obtained, and the top n items are set as the search results. Here, n is a positive constant.

【０００４】例えば、「情報または、文書（特に文書）
の検索」に関する文書を検索したいとする。ユーザは、
検索キーとして、（（情報^-0.5 or 文書^-0.8 ） and 検索^-1.0 ）と指定する。文書検索装置は、まず、検索キー内での単
語の重要度（それぞれ（情報，0.5 ）（文書，0.8 ）
（検索，1.0 ）を求める（この例の場合、検索キー内に
重要度が明示的に記述されているが、単語の出現頻度情
報などからこれらの重要度を自動的に決定する場合もあ
る）。For example, “information or a document (especially a document)
Suppose you want to search for documents related to "search." The user
Specify ((information ^- 0.5 or document ^- 0.8) and search ^- 1.0) as the search key. First, the document search apparatus first determines the importance of a word in a search key ((information, 0.5) (document, 0.8)
(Search, 1.0) is obtained (in this example, the importance is explicitly described in the search key, but these importance may be automatically determined from word appearance frequency information or the like). .

【０００５】次に、検索キーに使用されている単語「情
報」「文書」「検索」の検索対象のそれぞれの文書内で
の重要度を出現頻度などを基にして求める。これらが次
の値であったとする。検索キー内での単語の重要度と、検索対象文書内でのそ
の単語の重要度の積を計算し、検索キー内で、ｏｒが使
われた場合は、その両側の値を足すこととし、ａｎｄが
使われた場合は、その両側の値の小さい方を取ることと
する。Next, the importance of the words "information", "document", and "search" used in the search key in each document to be searched is determined based on the frequency of appearance. Suppose these were the following values: Calculate the product of the importance of the word in the search key and the importance of the word in the search target document, and if or is used in the search key, add the values on both sides, If and is used, the smaller of the values on both sides is taken.

【０００６】この方法で、検索キーとそれぞれの文書の
類似度は以下のように求めることができる。文書ａの類似度＝min((0.5*0.4 + 0.8*0.6), 1.0*0.9)=
min(0.68,0.9)=0.68 文書ｂの類似度＝min((0.5*0.4 + 0.8*0.1), 1.0*0.0)=
min(0.26,0.0)=0.0 文書ｃの類似度＝min((0.5*0.3 + 0.8*0.8), 1.0*1.0)=
min(0.79,1.0)=0.79 ここで、min(x,y)は、ｘ，ｙの小さい方の値を返す。With this method, the similarity between a search key and each document can be obtained as follows. Similarity of document a = min ((0.5 * 0.4 + 0.8 * 0.6), 1.0 * 0.9) =
min (0.68,0.9) = 0.68 Similarity of document b = min ((0.5 * 0.4 + 0.8 * 0.1), 1.0 * 0.0) =
min (0.26,0.0) = 0.0 Similarity of document c = min ((0.5 * 0.3 + 0.8 * 0.8), 1.0 * 1.0) =
min (0.79,1.0) = 0.79 Here, min (x, y) returns the smaller value of x and y.

【０００７】このようにして、検索キーと検索対象文書
との類似度を計算し、この値の降順に検索対象文書をソ
ートして、上位ｎ（＝２）件を検索結果とする。従っ
て、この場合の検索結果は、文書ｃ（類似度０．７９）文書ａ（類似度０．６８）となる。In this way, the similarity between the search key and the search target document is calculated, the search target documents are sorted in descending order of this value, and the top n (= 2) items are set as the search results. Therefore, the search result in this case is document c (similarity 0.79) and document a (similarity 0.68).

【０００８】また、従来の文書分類装置は、分類対象と
なる文書集合内のすべての２つの文書の組み合わせにつ
いて、それら文書間の類似度を判定することによって、
文書を分類するものである。まず、分類対象となる文書
集合のそれぞれの文書から単語を抽出し、それらの単語
に適切な重要度を付与する。次に、この重要度を基に、
文書検索装置で述べた方法と同様の方法で分類対象とな
る文書集合内の全ての２つの文書の組み合わせについて
それら文書間の類似度を判定する。次にこの文書間の類
似度に基づき、類似度の大きい文書同士を順次結合して
いくことによって、文書を分類する。この手法は、クラ
スタリングと呼ばれている。Further, the conventional document classifying apparatus determines the similarity between all two documents in a set of documents to be classified by determining the similarity between the documents.
Classifies documents. First, words are extracted from each document of a set of documents to be classified, and the words are given appropriate importance. Next, based on this importance,
The similarity between all two documents in the set of documents to be classified is determined in the same manner as the method described in the document search apparatus. Next, based on the similarity between the documents, the documents are classified by sequentially combining documents having a large similarity. This technique is called clustering.

【０００９】[0009]

【発明が解決しようとする課題】より精度の高い文書検
索装置及び文書分類装置を構成するためには、文書要素
間の類似度を適切に判定する必要がある。ここで、文書
要素とは、単語列、または、単語のブール演算子結合、
または、文書、または、文書集合である。しかしなが
ら、従来の類似度判定方法には、以下のような問題があ
る。In order to construct a more accurate document search device and document classification device, it is necessary to appropriately determine the similarity between document elements. Here, the document element is a word string or a Boolean operator combination of words,
Alternatively, it is a document or a document set. However, the conventional similarity determination method has the following problems.

【００１０】１．複数の主題や副題を持つ文書要素間の
類似度を精度良く判定できない：文書要素が単語列や要
約などの場合、その文書要素は１つの主題を持っている
と考えられるが、一般に文書全文を対象とするとその文
書は複数の主題や副題を持つものとなる。そのため、こ
のような文書全文を対象とすると類似度が適切に計算さ
れない。[0010] 1. The similarity between document elements with multiple themes and subtitles cannot be determined with high accuracy: When a document element is a word string or an abstract, it is considered that the document element has one subject. When targeted, the document will have multiple subjects and subtitles. Therefore, the similarity is not properly calculated when the entire text of such a document is targeted.

【００１１】例えば、文書検索作業において、ユーザが
「情報検索を行うロボット」に関する文書を検索したい
場合に、「情報検索 and ロボット」と検索キーを指定
したとする。しかし、この検索キーでは、「情報検索シ
ステム」と「産業用ロボット」という２つの主題を持つ
文書にまで高い類似度を与えてしまう。このように、文
書要素が複数の主題や副題を持つ場合に、類似度を精度
良く判定できないという問題がある。For example, in a document search operation, it is assumed that a user specifies a search key of "information search and robot" when the user wants to search for a document related to "robot performing information search". However, this search key gives a high degree of similarity to documents having two themes, "information retrieval system" and "industrial robot". As described above, when the document element has a plurality of themes and subtitles, there is a problem that the similarity cannot be accurately determined.

【００１２】２．文書要素の持つ特徴を利用した類似度
の判定ができない：文内に使用されている単語間には、
係り受け関係などの特徴がある。また、文書には、パラ
グラフなどの特徴がある。しかしながら、従来の類似度
判定方法では、単語を抽出し、それらの単語に重要度を
付与し、それらを基に類似度を判定するだけなので、こ
れらの特徴を利用することができず、類似度を精度良く
判定できないという問題がある。2. It is not possible to judge similarity using features of document elements: between words used in a sentence,
There are characteristics such as dependency relationship. Documents also have features such as paragraphs. However, the conventional similarity determination method only extracts words, assigns importance to those words, and determines similarity based on the words. Therefore, these features cannot be used, and similarity determination cannot be performed. Cannot be determined with high accuracy.

【００１３】３．形態素解析の不完全性：単語を文書か
ら抽出する際に用いられる形態素解析では、どの部分文
字列が単語となるかを認識する必要があり、そのため
に、単語を予め辞書に登録しておく必要がある。しかし
ながら、情報の更新速度が速い場合には、全ての単語を
予め辞書に登録しておくことは不可能であり、このよう
な情報を対象とした場合、単語の抽出を行う際の解析の
失敗は避けられない。例えば、辞書に、「インター」と
「ネット」という単語だけしか登録されていない場合、
「インターネット」という単語は抽出されず、この単語
は、「インター」と「ネット」という２つの単語として
抽出されてしまう。このように、単語の抽出の失敗が起
こるために、類似度を精度良く判定できないという問題
がある。3. Imperfect morphological analysis: In morphological analysis used when extracting words from a document, it is necessary to recognize which partial character string is to be a word, and therefore, it is necessary to register the word in a dictionary in advance There is. However, when the update speed of information is high, it is impossible to register all words in a dictionary in advance, and when such information is targeted, the analysis fails when extracting words. Is inevitable. For example, if the dictionary contains only the words "inter" and "net",
The word “internet” is not extracted, and this word is extracted as two words “inter” and “net”. As described above, there is a problem that the similarity cannot be determined with high accuracy because the word extraction fails.

【００１４】本発明は、上記の点に鑑みなされたもの
で、複数の主題や副題を持つ文書要素間の類似度を精度
良く判定し、文書要素の持つ特徴を利用した類似度の判
定を可能とし、形態素解析の不完全性を解決して文書要
素間の類似度を精度良く判定することが可能な類似度判
定方法及び文書検索装置及び文書分類装置及び文書検索
プログラムを格納した記憶媒体及び文書分類プログラム
を格納した記憶媒体を提供することを目的とする。The present invention has been made in view of the above points, and it is possible to accurately determine the similarity between document elements having a plurality of subjects and subtitles and determine the similarity using the features of the document elements. A similarity determination method, a document search device, a document classification device, and a storage medium and a document storing a document search program that can resolve incompleteness of morphological analysis and accurately determine the similarity between document elements It is an object to provide a storage medium storing a classification program.

【００１５】[0015]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。本発明（請求項１）は、文書
要素間の類似度を適切に判定するための類似度判定方法
において、単語列または、単語列のブール演算子結合ま
たは、文または、文書または、文書集合で構成される文
書要素から、該文書要素内で使用されている単語を抽出
し（ステップ１）、抽出されたそれぞれの単語に重要度
を付与し（ステップ２）、抽出されたそれぞれの任意の
２単語間に関連度を付与し（ステップ３）、単語の重要
度をノードの重みとし、該単語間の関連度をリンクの重
みとしたグラフによって、それぞれの文書要素の主題を
表現し（ステップ４）、主題を表現するグラフ間の一致
の度合に基づき、文書要素間の類似度を判定する（ステ
ップ５）。FIG. 1 is a diagram for explaining the principle of the present invention. The present invention (Claim 1) provides a similarity determination method for appropriately determining the similarity between document elements, wherein a word string, a Boolean operator combination of word strings, a sentence, a document, or a document set is used. From the composed document elements, words used in the document elements are extracted (step 1), and the extracted words are assigned importance (step 2), and each of the extracted arbitrary 2 words is assigned. Relevance is given between words (step 3), and the subject of each document element is expressed by a graph in which the importance of the word is used as the weight of the node and the relevance between the words is used as the weight of the link (step 4). ), The similarity between the document elements is determined based on the degree of agreement between the graphs representing the themes (step 5).

【００１６】本発明（請求項２）は、主題を表現するた
めのグラフ間の一致の度合を計算する際に、両方のグラ
フの同様のノード（同じ単語を含んでいるノード）の個
数が多ければ多い程、該グラフ間の一致の度合を大きな
値とし、片方のグラフ内にあるノードに大きな重みが付
いていた場合は、もう片方のグラフ内の同様のノードに
大きな重みが付いていればいる程、両方のグラフ間の一
致の度合を大きな値とし、両方のグラフの同様のリンク
（リンクの両端のノードに含まれる単語が同じであるリ
ンク）の本数が多ければ多い程、該両方のグラフ間の一
致の度合を大きな値とし、片方のグラフ内にあるリンク
に大きな重みが付いていた場合は、もう片方のグラフ内
の同様のリンクに大きな重みが付いていればいる程、両
方のグラフ間の一致の度合を大きな値にするように、主
題を表現するグラフ間の一致の度合を計算する。According to the present invention (claim 2), when calculating the degree of coincidence between graphs for expressing a subject, the number of similar nodes (nodes containing the same word) in both graphs is large. The greater the number, the greater the degree of coincidence between the graphs, and if a node in one graph has a large weight, if a similar node in the other graph has a large weight, The greater the number of similar links (links containing the same word in the nodes at both ends of the link) in both graphs, the greater the degree of matching between the two graphs. If the degree of agreement between the graphs is set to a large value, and a link in one graph is heavily weighted, the more the similar link in the other graph is weighted, the more One between graphs To the degree of a large value, to calculate the degree of match between the graph representing the subject.

【００１７】本発明（請求項３）は、主題を表現するた
めのグラフ間の一致の度合を計算する際に、それぞれの
グラフを、グラフ内で使用されている単語集合がどの程
度の強さで関連し合っているのかに基づいて、部分グラ
フに分割し、それぞれの部分グラフに、該部分グラフ内
の任意のノード間にリンクがない場合には、該部分グラ
フに小さい重みのリンクを生成し、部分グラフを再結合
し、部分グラフに生成したリンクをそのまま追加して、
分割前のグラフに戻して主題を表現するグラフ間の一致
の度合を計算する。According to the present invention (claim 3), when calculating the degree of coincidence between graphs for expressing a subject, each graph is represented by the strength of a word set used in the graph. Is divided into subgraphs based on whether the subgraphs are related to each other, and if each subgraph has no link between any nodes in the subgraph, a link having a small weight is generated in the subgraph. Then rejoin the subgraphs, add the links you just created to the subgraphs,
Return to the graph before the division and calculate the degree of agreement between the graphs expressing the subject.

【００１８】本発明（請求項４）は、主題を表現するた
めのグラフ間の一致の度合を計算する際に、それぞれの
グラフを、グラフ内で使用されている単語集合がどの程
度の強さで関連し合っているかに基づいて、部分グラフ
に分割し、ぞれぞれの部分グラフに、該部分グラフ間の
任意のノード間にリンクがない場合には、該部分グラフ
に小さい重みのリンクを生成し、それぞれの部分グラフ
毎に一致の度合を計算する。According to the present invention (claim 4), when calculating the degree of coincidence between graphs for expressing the subject, each graph is determined by the strength of the word set used in the graph. Is divided into subgraphs based on whether the subgraphs are related to each other. If there is no link between any nodes between the subgraphs in each subgraph, a link having a small weight is assigned to the subgraph. Is generated, and the degree of matching is calculated for each subgraph.

【００１９】本発明（請求項５）は、主題を表現するた
めのグラフ間の一致の度合を計算する際に、それぞれの
グラフを、グラフ内で使用されている単語集合がどの程
度の強さで関連し合っているのかに基づいて、部分グラ
フに分割し、それぞれの部分グラフに、該部分グラフ内
の任意のノード間にリンクがない場合には、該部分グラ
フに小さい重みのリンクを生成し、それぞれの部分グラ
フ毎に一致の度合を計算し、部分グラフ毎に計算された
一致の度合の総和を計算することにより、主題を表現す
るグラフ間の一致の度合を計算する。According to the present invention (claim 5), when calculating the degree of coincidence between graphs for expressing a subject, each graph is represented by the strength of a word set used in the graph. Is divided into subgraphs based on whether the subgraphs are related to each other, and if each subgraph has no link between any nodes in the subgraph, a link having a small weight is generated in the subgraph. Then, the degree of coincidence is calculated for each of the subgraphs, and the sum of the degrees of coincidence calculated for each of the subgraphs is calculated, thereby calculating the degree of coincidence between the graphs expressing the subject.

【００２０】図２は、本発明の文書検索装置の原理構成
図である。本発明（請求項６）は、ユーザからの検索要
求に基づいて文書を検索するための文書検索装置であっ
て、ユーザからの検索要求を解析し、検索キーを取り出
す検索インタフェース手段４１０と、検索キーから該検
索キーの主題を表現するグラフを生成する検索キー主題
グラフ作成手段４２０と、指定された単語が出現する文
書の文書ＩＤの集合を取得する単語情報管理手段４３０
と、文書ＩＤが指定されると、該文書ＩＤに対応した文
書を検索対象文書記憶手段４４１から取得し、該文書の
主題を表現するグラフを作成する検索対象文書主題グラ
フ作成手段４４０と、検索キーの主題を表現するグラフ
と検索対象文書の主題を表現するグラフを入力とし、そ
れらがどの程度似ているのかを判断する類似度判定手段
４５０と、検索インタフェース手段４１０、検索キー主
題グラフ作成手段４２０、単語情報管理手段４３０、検
索対象文書主題グラフ作成手段４４０、及び類似度判定
手段４５０の制御を行う検索制御手段４６０とを有す
る。FIG. 2 is a block diagram showing the principle of the document search apparatus according to the present invention. The present invention (claim 6) is a document search apparatus for searching for a document based on a search request from a user, the search interface means 410 for analyzing a search request from a user and extracting a search key, A search key subject graph creation unit 420 that generates a graph representing the subject of the search key from the key, and a word information management unit 430 that acquires a set of document IDs of documents in which a specified word appears.
When a document ID is specified, a document corresponding to the document ID is obtained from the search target document storage unit 441, and a search target document subject graph creation unit 440 that creates a graph expressing the subject of the document, A graph expressing the subject of a key and a graph expressing the subject of a search target document are input, and a similarity determination unit 450 for determining how similar they are, a search interface unit 410, a search key subject graph creation unit 420, a word information management unit 430, a search subject document subject graph creation unit 440, and a search control unit 460 that controls the similarity determination unit 450.

【００２１】本発明（請求項７）は、検索キー主題グラ
フ作成手段４２０において、単語列または、単語列のブ
ール演算子結合または、文または、文書または、文書集
合で構成される文書要素から、該文書要素内で使用され
ている単語を抽出する単語抽出手段と、抽出されたそれ
ぞれの単語に重要度を付与する重要度付与手段と、抽出
されたそれぞれの任意の２単語間に関連度を付与する関
連度付与手段とを含み、検索対象文書主題グラフ作成手
段４４０は、単語の重要度をノードの重みとし、該単語
間の関連度をリンクの重みとしたグラフによって、それ
ぞれの文書要素の主題を表現する主題表現手段を含み、
類似度判定手段４５０は、主題を表現するグラフ間の一
致の度合に基づき、文書要素間の類似度を判定する手段
を含む。According to the present invention (claim 7), in the search key subject graph creating means 420, a word string, a Boolean operator combination of a word string, or a document element composed of a sentence, a document, or a document set is used. Word extracting means for extracting words used in the document element; importance assigning means for assigning importance to each of the extracted words; and relevance between any two extracted words. The search subject document subject graph creating unit 440 includes a relevance assigning unit that assigns the relevance, and the search target document subject graph creating unit 440 uses a graph in which the importance of the word is set as the weight of the node and the relevance between the words is set as the weight of the link. Including subject expressing means for expressing the subject,
The similarity determination unit 450 includes a unit that determines the similarity between document elements based on the degree of coincidence between graphs representing subjects.

【００２２】本発明（請求項８）は、類似度判定手段４
５０において、両方のグラフの同様のノード（同じ単語
を含んでいるノード）の個数が多ければ多い程、該グラ
フ間の一致の度合を大きな値とし、片方のグラフ内にあ
るノードに大きな重みが付いていた場合は、もう片方の
グラフ内の同様のノードに大きな重みが付いていればい
る程、両方のグラフ間の一致の度合を大きな値とし、両
方のグラフの同様のリンク（リンクの両端のノードに含
まれる単語が同じであるリンク）の本数が多ければ多い
程、該両方のグラフ間の一致の度合を大きな値とし、片
方のグラフ内にあるリンクに大きな重みが付いていた場
合は、もう片方のグラフ内の同様のリンクに大きな重み
が付いていればいる程、両方のグラフ間の一致の度合を
大きな値にするように、主題を表現するグラフ間の一致
の度合を計算する第１の計算手段を含む。According to the present invention (claim 8), the similarity determination means 4
At 50, the greater the number of similar nodes (nodes containing the same word) in both graphs, the greater the degree of matching between the graphs, the greater the weight of the nodes in one graph If so, the greater the weight of similar nodes in the other graph, the greater the degree of agreement between the two graphs, and the same link (both ends of the link) in both graphs The greater the number of links having the same word in the node), the greater the degree of matching between the two graphs, and the greater the weight of the links in one of the graphs. , Calculate the degree of agreement between the graphs representing the subject so that the greater the weight of similar links in the other graph, the greater the degree of agreement between both graphs Including one of the calculation means.

【００２３】本発明（請求項９）は、類似度判定手段４
５０において、それぞれのグラフを、グラフ内で使用さ
れている単語集合がどの程度の強さで関連し合っている
のかに基づいて、部分グラフに分割し、それぞれの部分
グラフに、該部分グラフ内の任意のノード間にリンクが
ない場合には、該部分グラフに小さい重みのリンクを生
成し、部分グラフを再結合し、部分グラフに生成したリ
ンクをそのまま追加して、分割前のグラフに戻して主題
を表現するグラフ間の一致の度合を計算する第２の計算
手段を含む。According to the present invention (claim 9), the similarity determination means 4
At 50, each graph is divided into subgraphs based on how strongly the word sets used in the graph are related, and each subgraph is divided into subgraphs within the subgraph. If there is no link between any of the nodes, a link with a small weight is generated in the subgraph, the subgraph is recombined, the generated link is added to the subgraph as it is, and the graph before the division is returned. Second calculating means for calculating the degree of coincidence between the graphs expressing the subject.

【００２４】本発明（請求項１０）は、類似度判定手段
４５０において、それぞれのグラフを、グラフ内で使用
されている単語集合がどの程度の強さで関連し合ってい
るかに基づいて、部分グラフに分割し、ぞれぞれの部分
グラフに、該部分グラフ間の任意のノード間にリンクが
ない場合には、該部分グラフに小さい重みのリンクを生
成し、それぞれの部分グラフ毎に一致の度合を計算する
第３の計算手段を含む。According to the present invention (claim 10), the similarity determining means 450 compares each graph based on how strong the word sets used in the graph are related to each other. If the subgraph is divided into graphs and each subgraph has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph, and a match is made for each subgraph. And a third calculating means for calculating the degree of.

【００２５】本発明（請求項１１）は、類似度判定手段
４５０において、それぞれのグラフを、グラフ内で使用
されている単語集合がどの程度の強さで関連し合ってい
るのかに基づいて、部分グラフに分割し、それぞれの部
分グラフに、該部分グラフ内の任意のノード間にリンク
がない場合には、該部分グラフに小さい重みのリンクを
生成し、それぞれの部分グラフ毎に一致の度合を計算
し、部分グラフ毎に計算された一致の度合の総和を計算
することにより、主題を表現するグラフ間の一致の度合
を計算する第４の計算手段を含む。According to the present invention (claim 11), the similarity determination means 450 compares each graph based on how strong the word sets used in the graph are related to each other. If the subgraph is divided into subgraphs and if there is no link between any nodes in the subgraph in each subgraph, a link having a small weight is generated in the subgraph, and the degree of matching for each subgraph is And calculating the sum of the degrees of matching calculated for each of the subgraphs, thereby calculating the degree of matching between the graphs expressing the subject.

【００２６】図３は、本発明の文書分類装置の原理構成
図である。本発明（請求項１２）は、文書が格納されて
いる文書記憶手段６１１から、文書ＩＤに対応した文書
を取得し、該文書の主題を表現するグラフを作成する主
題グラフ作成手段６１０と、２つの文書の主題を表現す
るグラフが入力されると、これらの一致の度合を判定す
るグラフ類似度判定手段６２０と、文書間の類似度を表
す行列に基づいて、該文書を分類する分類手段６３０
と、分類作業全体の制御を行う分類制御手段６４０とを
有する。FIG. 3 is a diagram illustrating the principle of the configuration of the document classification apparatus according to the present invention. The present invention (claim 12) provides a subject graph creating unit 610 that acquires a document corresponding to a document ID from a document storage unit 611 in which a document is stored, and creates a graph expressing the subject of the document. When a graph representing the subject of one document is input, a graph similarity determining unit 620 that determines the degree of coincidence between them, and a classifying unit 630 that classifies the document based on a matrix representing the similarity between the documents
And a classification control unit 640 for controlling the entire classification work.

【００２７】本発明（請求項１３）は、主題グラフ作成
手段６１０において、主題を表現するグラフ間の一致の
度合を測定するために、単語列または、単語列のブール
演算子結合または、文または、文書または、文書集合で
構成される文書要素から、該文書要素内で使用されてい
る単語を抽出する単語抽出手段と、抽出されたそれぞれ
の単語に重要度を付与する重要度付与手段と、抽出され
たそれぞれの任意の２単語間に関連度を付与する関連度
付与手段と、単語の重要度をノードの重みとし、該単語
間の関連度をリンクの重みとしたグラフによって、それ
ぞれの文書要素の主題を表現する主題表現手段とを含
み、グラフ類似度判定手段６２０において、主題を表現
するグラフ間の一致の度合に基づき、文書要素間の類似
度を判定する手段を含む。According to the present invention (claim 13), in the subject graph creating means 610, in order to measure the degree of coincidence between graphs expressing the subject, a word string or a Boolean operator combination of word strings, a sentence or A word extraction unit that extracts words used in the document element from a document element formed of a document or a document set, and an importance assignment unit that assigns importance to each extracted word; Each document is represented by a relevance assigning means for assigning a relevance between any two extracted words and a graph in which the importance of the word is set as the weight of the node and the relevancy between the words is set as the weight of the link. Means for expressing the theme of the element, wherein the graph similarity determination means 620 determines the similarity between the document elements based on the degree of agreement between the graphs expressing the theme. No.

【００２８】本発明（請求項１４）は、グラフ類似度判
定手段６２０において、両方のグラフの同様のノード
（同じ単語を含んでいるノード）の個数が多ければ多い
程、該グラフ間の一致の度合を大きな値とし、片方のグ
ラフ内にあるノードに大きな重みが付いていた場合は、
もう片方のグラフ内の同様のノードに大きな重みが付い
ていればいる程、両方のグラフ間の一致の度合を大きな
値とし、両方のグラフの同様のリンク（リンクの両端の
ノードに含まれる単語が同じであるリンク）の本数が多
ければ多い程、該両方のグラフ間の一致の度合を大きな
値とし、片方のグラフ内にあるリンクに大きな重みが付
いていた場合は、もう片方のグラフ内の同様のリンクに
大きな重みが付いていればいる程、両方のグラフ間の一
致の度合を大きな値にするように、主題を表現するグラ
フ間の一致の度合を計算する第１の計算手段を含む。According to the present invention (claim 14), in the graph similarity determination means 620, the greater the number of similar nodes (nodes containing the same word) in both graphs, the greater the match between the graphs. If the degree is large and nodes in one graph are heavily weighted,
The greater the weight of similar nodes in the other graph, the greater the degree of agreement between the two graphs, and a similar link in both graphs (the word contained in the nodes at both ends of the link) The greater the number of links), the greater the degree of coincidence between the two graphs, and the greater the weight in the links in one graph, the greater the number of links in the other graph. The first calculating means for calculating the degree of coincidence between the graphs representing the subjects is such that the greater the weight of the similar link of the above, the greater the degree of coincidence between the two graphs. Including.

【００２９】本発明（請求項１５）は、グラフ類似度判
定手段６２０において、それぞれのグラフを、グラフ内
で使用されている単語集合がどの程度の強さで関連し合
っているのかに基づいて、部分グラフに分割し、それぞ
れの部分グラフに、該部分グラフ内の任意のノード間に
リンクがない場合には、該部分グラフに小さい重みのリ
ンクを生成し、部分グラフを再結合し、部分グラフに生
成したリンクをそのまま追加して、分割前のグラフに戻
して主題を表現するグラフ間の一致の度合を計算する第
２の計算手段を含む。According to the present invention (claim 15), in the graph similarity determination means 620, each graph is determined based on how strong the word sets used in the graph are related to each other. , If each subgraph has no link between any nodes in the subgraph, generate a link with a smaller weight to the subgraph, rejoin the subgraphs, It includes a second calculating unit that adds the generated link to the graph as it is, returns the graph before the division, and calculates the degree of agreement between the graphs expressing the subject.

【００３０】本発明（請求項１６）は、グラフ類似度判
定手段６２０において、それぞれのグラフを、グラフ内
で使用されている単語集合がどの程度の強さで関連し合
っているかに基づいて、部分グラフに分割し、ぞれぞれ
の部分グラフに、該部分グラフ間の任意のノード間にリ
ンクがない場合には、該部分グラフに小さい重みのリン
クを生成し、それぞれの部分グラフ毎に一致の度合を計
算する第３の計算手段を含む。According to the present invention (claim 16), in the graph similarity determination means 620, each graph is determined based on how strong the word sets used in the graph are related to each other. If the subgraph is divided into subgraphs, and each subgraph has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph, and a link is generated for each subgraph. A third calculating means for calculating the degree of coincidence is included.

【００３１】本発明（請求項１７）は、グラフ類似度判
定手段６２０において、それぞれのグラフを、グラフ内
で使用されている単語集合がどの程度の強さで関連し合
っているのかに基づいて、部分グラフに分割し、それぞ
れの部分グラフに、該部分グラフ内の任意のノード間に
リンクがない場合には、該部分グラフに小さい重みのリ
ンクを生成し、それぞれの部分グラフ毎に一致の度合を
計算し、部分グラフ毎に計算された一致の度合の総和を
計算することにより、主題を表現するグラフ間の一致の
度合を計算する第４の計算手段を含む。According to the present invention (claim 17), in the graph similarity determination means 620, each graph is determined on the basis of how strong the word sets used in the graph are related to each other. If there is no link between any nodes in the subgraph in each subgraph, a link having a small weight is generated in the subgraph, and a match is determined for each subgraph. A fourth calculating means for calculating the degree of match between the graphs expressing the subject by calculating the degree and calculating the sum of the degrees of match calculated for each of the subgraphs.

【００３２】本発明（請求項１８）は、ユーザからの検
索要求に基づいて文書を検索するための文書検索プログ
ラムを格納した記憶媒体であって、ユーザからの検索要
求を解析し、検索キーを取り出す検索インタフェースプ
ロセスと、検索キーから該検索キーの主題を表現するグ
ラフを生成する検索キー主題グラフ作成プロセスと、指
定された単語が出現する文書の文書ＩＤの集合を取得す
る単語情報管理プロセスと、文書ＩＤが指定されると、
該文書ＩＤに対応した文書を検索対象文書が格納されて
いる検索対象文書記憶手段から取得し、該文書の主題を
表現するグラフを作成する検索対象文書主題グラフ作成
プロセスと、検索キーの主題を表現するグラフと検索対
象文書の主題を表現するグラフを入力とし、それらがど
の程度似ているのかを判断する類似度判定プロセスと、
検索インタフェースプロセス、検索キー主題グラフ作成
プロセス、単語情報管理プロセス、検索対象文書主題グ
ラフ作成プロセス、及び類似度判定プロセスの制御を行
う検索制御プロセスとを有する。The present invention (claim 18) is a storage medium storing a document search program for searching for a document based on a search request from a user. The storage medium analyzes a search request from a user and sets a search key. A search interface process for extracting, a search key subject graph creation process for generating a graph expressing the subject of the search key from the search key, and a word information management process for acquiring a set of document IDs of documents in which a specified word appears. , When the document ID is specified,
A search target document subject graph creation process for obtaining a document corresponding to the document ID from the search target document storage unit storing the search target document and creating a graph expressing the subject of the document; A similarity determination process of inputting the graph to be expressed and the graph expressing the subject of the search target document, and determining how similar they are;
It has a search interface process, a search key subject graph creation process, a word information management process, a search subject document subject graph creation process, and a search control process for controlling the similarity determination process.

【００３３】本発明（請求項１９）は、検索キー主題グ
ラフ作成プロセスにおいて、単語列または、単語列のブ
ール演算子結合または、文または、文書または、文書集
合で構成される文書要素から、該文書要素内で使用され
ている単語を抽出する単語抽出プロセスと、抽出された
それぞれの単語に重要度を付与する重要度付与プロセス
と、抽出されたそれぞれの任意の２単語間に関連度を付
与する関連度付与プロセスとを含み、検索対象文書主題
グラフ作成プロセスは、単語の重要度をノードの重みと
し、該単語間の関連度をリンクの重みとしたグラフによ
って、それぞれの文書要素の主題を表現する主題表現プ
ロセスを含み、類似度判定プロセスは、主題を表現する
グラフ間の一致の度合に基づき、文書要素間の類似度を
判定するプロセスを含む。According to the present invention (claim 19), in a search key subject graph creation process, a word string, a Boolean operator combination of a word string, or a document element composed of a sentence, a document, or a document set is used. A word extraction process for extracting words used in a document element, an importance assignment process for assigning importance to each extracted word, and a relevance between any two extracted words. The subject-of-search-document-graph creating process, the subject of each document element is represented by a graph in which the importance of a word is the weight of a node and the relevance between the words is the weight of a link. A similarity determination process including a theme representation process for representing the subject, wherein the similarity determination process determines a similarity between the document elements based on a degree of agreement between graphs representing the subject. Including.

【００３４】本発明（請求項２０）は、類似度判定プロ
セスにおいて、両方のグラフの同様のノード（同じ単語
を含んでいるノード）の個数が多ければ多い程、該グラ
フ間の一致の度合を大きな値とし、片方のグラフ内にあ
るノードに大きな重みが付いていた場合は、もう片方の
グラフ内の同様のノードに大きな重みが付いていればい
る程、両方のグラフ間の一致の度合を大きな値とし、両
方のグラフの同様のリンク（リンクの両端のノードに含
まれる単語が同じであるリンク）の本数が多ければ多い
程、該両方のグラフ間の一致の度合を大きな値とし、片
方のグラフ内にあるリンクに大きな重みが付いていた場
合は、もう片方のグラフ内の同様のリンクに大きな重み
が付いていればいる程、両方のグラフ間の一致の度合を
大きな値にするように、主題を表現するグラフ間の一致
の度合を計算する第１の計算プロセスを含む。According to the present invention (claim 20), in the similarity determination process, the greater the number of similar nodes (nodes containing the same word) in both graphs, the greater the degree of matching between the graphs. If the value is large and a node in one graph is heavily weighted, the greater the weight of a similar node in the other graph, the better the match between the two graphs The greater the number of similar links in both graphs (links having the same word in the nodes at both ends of the link), the greater the degree of matching between the two graphs. If a link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the degree of matching between the two graphs. To include a first calculation process of calculating the degree of match between the graph representing the subject.

【００３５】本発明（請求項２１）は、類似度判定プロ
セスにおいて、それぞれのグラフを、グラフ内で使用さ
れている単語集合がどの程度の強さで関連し合っている
のかに基づいて、部分グラフに分割し、それぞれの部分
グラフに、該部分グラフ内の任意のノード間にリンクが
ない場合には、該部分グラフに小さい重みのリンクを生
成し、部分グラフを再結合し、部分グラフに生成したリ
ンクをそのまま追加して、分割前のグラフに戻して主題
を表現するグラフ間の一致の度合を計算する第２の計算
プロセスを含む。According to the present invention (claim 21), in the similarity determination process, each graph is partially divided based on how strongly the word sets used in the graph are related to each other. If the subgraph is divided into graphs and each subgraph has no link between any nodes in the subgraph, a link having a small weight is generated in the subgraph, the subgraph is rejoined, and the subgraph is recombined. A second calculation process of adding the generated link as it is and returning to the graph before division to calculate the degree of agreement between the graphs expressing the subject is included.

【００３６】本発明（請求項２２）は、類似度判定プロ
セスにおいて、それぞれのグラフを、グラフ内で使用さ
れている単語集合がどの程度の強さで関連し合っている
かに基づいて、部分グラフに分割し、ぞれぞれの部分グ
ラフに、該部分グラフ間の任意のノード間にリンクがな
い場合には、該部分グラフに小さい重みのリンクを生成
し、それぞれの部分グラフ毎に一致の度合を計算する第
３の計算プロセスを含む。According to the present invention (claim 22), in the similarity determination process, each graph is divided into subgraphs based on how strong the word sets used in the graph are related to each other. If there is no link between any nodes between the subgraphs in each of the subgraphs, a link having a small weight is generated in the subgraph, and a match is determined for each subgraph. And a third calculation process for calculating the degree.

【００３７】本発明（請求項２３）は、文書要素間類似
度判定プロセスにおいて、それぞれのグラフを、グラフ
内で使用されている単語集合がどの程度の強さで関連し
合っているのかに基づいて、部分グラフに分割し、それ
ぞれの部分グラフに、該部分グラフ内の任意のノード間
にリンクがない場合には、該部分グラフに小さい重みの
リンクを生成し、それぞれの部分グラフ毎に一致の度合
を計算し、部分グラフ毎に計算された一致の度合の総和
を計算することにより、主題を表現するグラフ間の一致
の度合を計算する第４の計算プロセスを含む。According to the present invention (claim 23), in the process of determining the similarity between document elements, each graph is determined based on how strong the word sets used in the graph are related to each other. Then, if there is no link between any nodes in the subgraph in each subgraph, a link having a small weight is generated in the subgraph, and the subgraph is matched for each subgraph. And calculating the sum of the degrees of matching calculated for each of the subgraphs, thereby calculating the degree of matching between the graphs representing the subjects.

【００３８】本発明（請求項２４）は、文書を分類する
ための文書分類装置に搭載される文書分類プログラムを
格納した記憶媒体であって、文書が格納されている文書
記憶手段から、文書ＩＤに対応した文書を取得し、該文
書の主題を表現するグラフを作成する主題グラフ作成プ
ロセスと、２つの文書の主題を表現するグラフが入力さ
れると、これらの一致の度合を判定するグラフ類似度判
定プロセスと、文書間の類似度を表す行列に基づいて、
該文書を分類する分類プロセスと、分類プロセスの分類
作業全体を制御を行う分類制御プロセスとを有する。The present invention (claim 24) is a storage medium storing a document classification program mounted on a document classification device for classifying documents, wherein the document storage means in which the document is stored is provided with a document ID. The subject graph creation process of acquiring a document corresponding to the subject and creating a graph representing the subject of the document, and a graph similarity determination process for determining the degree of coincidence between the two when a graph representing the subject of two documents is input Based on the degree determination process and a matrix representing the similarity between documents,
A classification process for classifying the document; and a classification control process for controlling the entire classification work of the classification process.

【００３９】本発明（請求項２５）は、主題グラフ作成
プロセスにおいて、主題を表現するグラフ間の一致の度
合を測定するために、単語列または、単語列のブール演
算子結合または、文または、文書または、文書集合で構
成される文書要素から、該文書要素内で使用されている
単語を抽出する単語抽出プロセスと、抽出されたそれぞ
れの単語に重要度を付与する重要度付与プロセスと、抽
出されたそれぞれの任意の２単語間に関連度を付与する
関連度付与プロセスと、単語の重要度をノードの重みと
し、該単語間の関連度をリンクの重みとしたグラフによ
って、それぞれの文書要素の主題を表現する主題表現プ
ロセスと、主題を表現するグラフ間の一致の度合に基
づき、文書要素間の類似度を判定する文書要素間類似度
判定プロセスを含む。According to the present invention (claim 25), in a subject graph creation process, a word string or a Boolean operator combination of a word string or a sentence or a character string is used to measure the degree of coincidence between graphs representing the subject. A word extraction process for extracting words used in a document element from a document or a document element composed of a document set, an importance assignment process for assigning importance to each extracted word, The relevance assigning process of assigning a relevance between any two given words and a graph in which the importance of the word is set as the weight of the node and the relevancy between the words is set as the weight of the link are used for each document element. And a document element similarity determination process for determining the similarity between document elements based on the degree of agreement between the graphs representing the themes. .

【００４０】本発明（請求項２６）は、グラフ類似度判
定プロセスにおいて、両方のグラフの同様のノード（同
じ単語を含んでいるノード）の個数が多ければ多い程、
該グラフ間の一致の度合を大きな値とし、片方のグラフ
内にあるノードに大きな重みが付いていた場合は、もう
片方のグラフ内の同様のノードに大きな重みが付いてい
ればいる程、両方のグラフ間の一致の度合を大きな値と
し、両方のグラフの同様のリンク（リンクの両端のノー
ドに含まれる単語が同じであるリンク）の本数が多けれ
ば多い程、該両方のグラフ間の一致の度合を大きな値と
し、片方のグラフ内にあるリンクに大きな重みが付いて
いた場合は、もう片方のグラフ内の同様のリンクに大き
な重みが付いていればいる程、両方のグラフ間の一致の
度合を大きな値にするように、主題を表現するグラフ間
の一致の度合を計算する第１の計算プロセスを含む。According to the present invention (claim 26), in the graph similarity determination process, as the number of similar nodes (nodes containing the same word) in both graphs increases,
The degree of coincidence between the graphs is set to a large value, and if a node in one graph has a large weight, the larger the similar node in the other graph has a large weight, the more both The degree of coincidence between the graphs is set to a large value, and the greater the number of similar links (links having the same word in the nodes at both ends of the link) in both graphs, the greater the degree of coincidence between the two graphs If the link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the match between the two graphs A first calculation process of calculating the degree of agreement between the graphs representing the subjects so as to increase the degree of.

【００４１】本発明（請求項２７）は、グラフ類似度判
定プロセスにおいて、それぞれのグラフを、グラフ内で
使用されている単語集合がどの程度の強さで関連し合っ
ているのかに基づいて、部分グラフに分割し、それぞれ
の部分グラフに、該部分グラフ内の任意のノード間にリ
ンクがない場合には、該部分グラフに小さい重みのリン
クを生成し、部分グラフを再結合し、部分グラフに生成
したリンクをそのまま追加して、分割前のグラフに戻し
て主題を表現するグラフ間の一致の度合を計算する第２
の計算プロセスを含む。According to the present invention (claim 27), in the graph similarity determination process, each graph is determined based on how strongly the word sets used in the graph are related to each other. If the subgraph is divided into subgraphs and each subgraph has no link between any nodes in the subgraph, a link having a small weight is generated in the subgraph, and the subgraph is recombined. The second is to add the generated link as it is and return to the graph before division to calculate the degree of agreement between the graphs expressing the subject.
Including the calculation process.

【００４２】本発明（請求項２８）は、グラフ類似度判
定プロセスにおいて、それぞれのグラフを、グラフ内で
使用されている単語集合がどの程度の強さで関連し合っ
ているかに基づいて、部分グラフに分割し、ぞれぞれの
部分グラフに、該部分グラフ間の任意のノード間にリン
クがない場合には、該部分グラフに小さい重みのリンク
を生成し、それぞれの部分グラフ毎に一致の度合を計算
する第３の計算プロセスを含む。According to the present invention (claim 28), in the graph similarity determination process, each graph is divided into parts based on how strongly the word sets used in the graph are related to each other. If the subgraph is divided into graphs and each subgraph has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph, and a match is made for each subgraph. A third calculation process for calculating the degree of

【００４３】本発明（請求項２９）は、グラフ類似度判
定プロセスにおいて、それぞれのグラフを、グラフ内で
使用されている単語集合がどの程度の強さで関連し合っ
ているのかに基づいて、部分グラフに分割し、それぞれ
の部分グラフに、該部分グラフ内の任意のノード間にリ
ンクがない場合には、該部分グラフに小さい重みのリン
クを生成し、それぞれの部分グラフ毎に一致の度合を計
算し、部分グラフ毎に計算された一致の度合の総和を計
算することにより、主題を表現するグラフ間の一致の度
合を計算する第４の計算プロセスを含む。According to the present invention (claim 29), in the graph similarity determination process, each graph is determined based on how strongly the word sets used in the graph are related to each other. If the subgraph is divided into subgraphs and if there is no link between any nodes in the subgraph in each subgraph, a link having a small weight is generated in the subgraph, and the degree of matching for each subgraph is And calculating the sum of the degrees of matching calculated for each subgraph, thereby calculating the degree of matching between the graphs representing the subjects.

【００４４】上記により、本発明によれば、複数の主題
や副題を持つ文書要素間の類似度を精度良く判定できな
い問題を、単語間の関連度を用いることによって、複数
の主題や副題を持つ文書要素間の類似度を精度良く判定
することが可能となる。例えば、前述の発明が解決しよ
うとする課題における例において、ユーザが「情報検索
を行うロボット」に関する文書を検索したい場合に、本
発明では、「情報検索」と「ロボット」が強く関連して
いる文書の方がそうでない文書を比べて、高い類似度と
なる。前述した、「情報検索システム」と「産業用ロボ
ット」という２つの主題を持つ文書内では、「情報検
索」と「ロボット」が強く関連していないので、このよ
うな文書は、高い類似度とならない。このように、本発
明では、類似度を精度よく判定することが可能となる。As described above, according to the present invention, the problem that the similarity between document elements having a plurality of themes and subtitles cannot be determined with high accuracy is determined by using the relevance between words. It is possible to determine the similarity between document elements with high accuracy. For example, in the example of the problem to be solved by the above-described invention, if the user wants to search for a document related to "robot performing information search", in the present invention, "information search" and "robot" are strongly related. A document has a higher similarity than a document that does not. In a document having two themes, "information retrieval system" and "industrial robot", "information retrieval" and "robot" are not strongly related to each other. No. As described above, according to the present invention, it is possible to accurately determine the similarity.

【００４５】また、本発明によれば、文書要素の持つ特
徴を利用した類似度の判定ができないという問題を、文
内で強い係り受けの関係にある単語間や、同一のパラグ
ラフに含まれる単語間に高い関連度を与えることができ
るため、これらの特徴を利用した類似度の判定により解
決できる。このように、本発明を利用すれば、類似度を
精度良く判定できる。Further, according to the present invention, the problem that the similarity cannot be determined using the characteristics of the document element cannot be determined between words having a strong dependency relationship in a sentence or words included in the same paragraph. Since a high degree of relevance can be given between them, the problem can be solved by determining the similarity using these features. As described above, the similarity can be accurately determined by using the present invention.

【００４６】更に、形態素解析を利用しているため単語
の抽出の失敗により類似度の判定の精度を低下させると
いう問題に対して、本発明では、前述の「インターネッ
ト」という単語の抽出の失敗の例を用いて説明すると、
形態素解析を利用しているため、前述した例と同様に、
「インターネット」という単語は抽出されず、この単語
は、「インター」と「ネット」という２つの単語として
抽出されてしまう。しかしながら、ある文書要素に「イ
ンターネット」という文字列がある場合、その文書要素
内では、抽出された単語「インター」と「ネット」の間
には強い関連がある。従って、「インター」と「ネッ
ト」が別々に出現する文書要素に比べて、「インターネ
ット」という文字列が出現する文書要素の方が高い類似
度となる。従って、たとえ、形態素解析に失敗したとし
ても、本発明により類似度判定の精度の低下を阻止する
ことが可能となる。Further, to solve the problem that the accuracy of the determination of the similarity is reduced due to the failure of the word extraction due to the use of the morphological analysis, in the present invention, the above-mentioned failure of the extraction of the word "Internet" is considered. Using an example,
Since morphological analysis is used, similar to the above example,
The word “internet” is not extracted, and this word is extracted as two words “inter” and “net”. However, when a certain document element has a character string “Internet”, there is a strong relationship between the extracted words “inter” and “net” in the document element. Therefore, the document element in which the character string “Internet” appears has a higher similarity than the document element in which “inter” and “net” appear separately. Therefore, even if the morphological analysis fails, the present invention can prevent a decrease in the accuracy of the similarity determination.

【００４７】上記により、本発明では、文書要素間の類
似度を精度良く判定できるので、精度の良い類似度判定
方法及び文書検索装置及び文書分類装置及び文書検索プ
ログラムを格納した記憶媒体及び文書分類プログラムを
格納した記憶媒体を提供することができる。As described above, according to the present invention, since the similarity between document elements can be determined with high accuracy, a highly accurate similarity determination method, a document search device, a document classification device, a storage medium storing a document search program, and a document classification program A storage medium storing a program can be provided.

【００４８】[0048]

【発明の実施の形態】図４は、本発明の類似度判定方法
を説明するためのフローチャートである。ステップ１０）文書要素内で使用されている単語を抽
出する。ステップ２０）文書要素内で使用されているそれぞれ
の単語の重要度を計算する。FIG. 4 is a flow chart for explaining a similarity determination method according to the present invention. Step 10) Extract words used in the document element. Step 20) Calculate the importance of each word used in the document element.

【００４９】ステップ３０）文書要素内で使用されて
いる任意の２単語間の関連度を計算する。ステップ４０）単語の重要度をノードの重みとし、単
語間の関連度をリンクの重みとした、グラフによってそ
れぞれの文書要素の主題を表現する。以下、このグラフ
を主題グラフと呼ぶ。Step 30) The degree of relevance between any two words used in the document element is calculated. Step 40) The subject of each document element is expressed by a graph in which the importance of a word is set as the weight of a node and the degree of association between words is set as the weight of a link. Hereinafter, this graph is called a theme graph.

【００５０】ステップ５０）このようにして生成した
主題グラフ同士の一致度に基づき、文書要素間の類似度
を判定する。以下、上記の各ステップの動作を詳細に説明する。ステップ１０）単語の抽出：単語の抽出は、文書要素
を形態素解析することによっておこなう。形態素解析手
法には、既存技術を用いるものとする。Step 50) The similarity between the document elements is determined based on the degree of coincidence between the subject graphs generated in this way. Hereinafter, the operation of each of the above steps will be described in detail. Step 10) Word extraction: Word extraction is performed by morphological analysis of document elements. An existing technology is used for the morphological analysis method.

【００５１】ステップ２０）重要度の計算：それぞれ
の文書要素内で使用されている単語の重要度を、次のよ
うにして計算する。本発明では、文書要素として、単語
列、単語のブール演算子結合、文、文書、文書集合を想
定しているので、それぞれについての単語の重要度の計
算法を以下に示す。Step 20) Calculation of importance: The importance of words used in each document element is calculated as follows. In the present invention, as a document element, a word string, a Boolean operator combination of words, a sentence, a document, and a document set are assumed, and a method of calculating the importance of a word for each is described below.

【００５２】・単語列：全ての単語の重要度を同じ値
にするか、または、ユーザにそれぞれの単語の重要度を
明示的に指定させることによって、重要度を決定する。・単語のブール演算子結合：単語列の場合と同様の方
法で重要度を決定する。・文：全ての単語を同じ重要度とするか、または、単
語の品詞（固有名詞には、副詞よりも高い重要度を付与
するなど）に応じて重要度を決定する。Word sequence: The importance is determined by setting the importance of all words to the same value or by letting the user explicitly specify the importance of each word. Boolean operator combination of words: The importance is determined in the same manner as in the case of word strings. Sentence: All words have the same importance, or the importance is determined according to the part of speech of a word (e.g., a proper noun is given a higher importance than an adverb).

【００５３】・文書：文の場合と同様の方法で重要度
を決定するか、または、単語の出現位置情報（タイトル
内で出現する単語には高い重要度を付与）、出現頻度情
報（高い出現頻度の単語には高い重要度を付与）、文書
要素集合全体の中での出現文書要素数（特定の文書要素
にしか出現しない単語には高い重要度を付与）などに基
づき計算する。Document: The importance is determined in the same manner as in the case of a sentence, or word position information (words appearing in a title are given high importance), appearance frequency information (high occurrences) The calculation is performed based on a word having a high frequency, a high degree of importance, and the number of document elements appearing in the entire document element set (a high degree of importance is given to a word that appears only in a specific document element).

【００５４】・文書集合：文書集合全体を一つの大き
な文書（全ての文書を結合した文書）と考えて、文書の
場合と同様の方法で重要度を計算する。ステップ３０）関連度の計算：それぞれの文書要素内
で使用されている単語間の関連度を、次のようにして計
算する。本発明では、文書要素として、単語列、単語の
ブール演算子結合、文、文書、文書集合を想定している
ので、それぞれについての単語間の関連度の計算法を次
に示す。Document set: Considering the entire document set as one large document (a document in which all documents are combined), the importance is calculated in the same manner as in the case of a document. Step 30) Calculation of relevance: The relevancy between words used in each document element is calculated as follows. In the present invention, as a document element, a word string, a combination of Boolean operators of words, a sentence, a document, and a set of documents are assumed.

【００５５】・単語列：文書要素に含まれる全ての２
単語間の関連度を等しい値とするか、または、ユーザに
明示的に関連度を指定させることによって、関連度を決
定する。・単語のブール演算子結合：単語列での方法に加え
て、ブール演算子の種類に応じて関連度を決定する。例
えば、ａｎｄで結合されている単語同士は、ｏｒで結合
されているものに比べて関連度を大きな値とする。Word string: all 2 included in the document element
The relevance is determined by making the relevance between words equal or by having the user explicitly specify the relevance. -Word Boolean operator combination: In addition to the word string method, the degree of relevance is determined according to the type of Boolean operator. For example, words connected by and have a greater degree of relevance than words connected by or.

【００５６】・文：単語列での方法を用いるか、また
は、次に示す、係り受け関係の情報を用いて計算する。
まず、文の係り受け関係の解析を行う。係り受け関係の
解析の手法は、既存技術を用いるものとする。直接の係
り受けの関係にあるもの同士は強い関連があるとし、間
接的な係り受け関係にあるものは、弱い関連があるもの
とする。例えば、「情報の検索に単語の重要度を利用す
る」という文があった場合、「情報」と「検索」の関連
度は、「情報」と「単語」の関連度に比べて大きな値と
する。なぜなら、「情報」と「検索」は、直接の係り受
け関係にあるのに対して、「情報」と「単語」は直接の
係り受け関係にないからである。Sentence: A sentence is calculated by using a method using a word string or by using the following dependency information.
First, the dependency relation of the sentence is analyzed. An existing technique is used as a method of analyzing the dependency relationship. It is assumed that those having a direct dependency relationship have a strong relationship, and those having an indirect dependency relationship have a weak relationship. For example, if there is a sentence “Use word importance for information search”, the relevance of “information” and “search” is larger than the relevance of “information” and “word”. I do. This is because “information” and “search” have a direct dependency relationship, whereas “information” and “word” do not have a direct dependency relationship.

【００５７】・文書：文での方法を用いるか、また
は、以下の２つのどちらかの方法によって、単語間の関
連度を計算する。 −共出現情報の利用：ある２単語が同一の文内（また
は、指定文字数の範囲内）で共出現した場合、これらの
共出現回数を数える。共出現の回数が多ければ多い程、
それら２単語間の関連度を大きな値とする。Document: The degree of relevance between words is calculated by using a sentence method or by one of the following two methods. -Use of co-occurrence information: If two words co-occur in the same sentence (or within the specified number of characters), count the number of co-occurrences. The more co-occurrences, the more
The degree of association between these two words is set to a large value.

【００５８】−構造情報を利用：文書の構造（章、パラ
グラフなど）を解析する。あるパラグラフ内に現れる単
語はそのパラグラフの見出し語と関連があり、また、パ
ラグラフ内の単語同士は関連があると考えられるので、
パラグラフ内での頻度情報に基づき、単語間の関連度を
決定する。例えば、あるパラグラフだけに高い頻度で出
現している単語はその節の見出し語と強い関連があり、
また、それらの単語同士は強い関連があるとする。Using structure information: Analyze the structure (chapter, paragraph, etc.) of the document. Since words appearing in a paragraph are related to the headword of the paragraph, and words in the paragraph are considered related,
The degree of relevance between words is determined based on the frequency information in the paragraph. For example, words that occur frequently only in a paragraph are strongly related to the headword of that section,
It is also assumed that those words have a strong relationship.

【００５９】・文書集合：文書集合を一つの大きな文
書（全ての文書を結合した文書）と考えて、文書の場合
と同様の方法で関連度を計算する。ステップ４０）主題グラフの作成：ステップ２０で求
めた単語の重要度をノード重みとし、ステップ３０で求
めた単語間の関連度をリンクの重みとしたグラフを作成
する。このグラフ（主題グラフ）によって文書要素の主
題を表現する。Document set: A document set is considered as one large document (a document in which all documents are combined), and the degree of relevance is calculated in the same manner as in the case of a document. Step 40) Creation of a subject graph: A graph is created in which the importance of the word obtained in step 20 is used as a node weight, and the degree of association between the words obtained in step 30 is used as a link weight. This graph (subject graph) expresses the subject of the document element.

【００６０】ステップ５０）ステップ１４０で作成し
た文書要素の主題グラフ間の一致度を測定することによ
って、文書要素間の類似度を判定する。類似度判定処理
の構成を図５に示す。図５は、本発明の類似度判定処理
の動作を説明するための図である。１．グラフ間一致度測定処理（ステップ１１１、１２
３、１３３、１４３）：文書要素の主題グラフｑとｕの
一致度を、以下の式によって計算する。グラフｑとグラ
フｕに使用されている単語の重要度をそれぞれ以下のベ
クトルで表す。Step 50) The similarity between the document elements is determined by measuring the degree of coincidence between the subject graphs of the document elements created in step 140. FIG. 5 shows the configuration of the similarity determination processing. FIG. 5 is a diagram for explaining the operation of the similarity determination processing of the present invention. 1. Graph coincidence measurement processing (steps 111 and 12)
3, 133, 143): The degree of coincidence between the subject graphs q and u of the document element is calculated by the following equation. The importance of the words used in the graphs q and u is represented by the following vectors, respectively.

【００６１】ｖ_q ＝（ｖ_q1，ｖ_q2，…，ｖ_qn）（１）ｖ_u ＝（ｖ_u1，ｖ_u2，…，ｖ_un）（２）ここで、ｖ_qiとｖ_uiはそれぞれ、文書要素ｑ内での単語
ｉの重要度、文書ｕ内での単語ｉの重要度を表す。これ
らのベクトルの内積ｆ_v V _q = (v _q1 , v _q2 ,..., V _qn ) (1) v _u = (v _u1 , v _u2 ,..., V _un ) (2) where v _qi and v _ui are respectively It represents the importance of the word i in the document element q and the importance of the word i in the document u. Dot product f _{v of} these vectors

【００６２】[0062]

【数１】 (Equation 1)

【００６３】を計算する。グラフｑとグラフｕに使用さ
れている単語間の関連度をそれぞれ以下のように表す。ｒ_q＝（ｖ_q11，ｖ_q12，…，ｖ_q21，ｖ_q22，…，ｖ_qnn）（３）ｒ_u＝（ｖ_u11，ｖ_u12，…，ｖ_u21，ｖ_u22，…，ｖ_unn）（４）ここで、ｖ_qijとｖ_uijは、それぞれ、文書要素ｑ内で
の単語ｉと単語ｊの関連度、文書要素ｕ内での単語ｉと
単語ｊの関連度を表す。Is calculated. The relevance between words used in the graph q and the graph u is represented as follows. _{_{r q = (v q11, v}} q12, ..., v q21, v q22, ..., v qnn) (3) r u = (v u11, v u12, ..., v u21, v u22, ..., v unn) ( 4) Here, v _qij and v _uij represent the relevance between the word i and the word j in the document element q and the relevance between the word i and the word j in the document element u, respectively.

【００６４】これらのベクトルの内積The inner product of these vectors

【００６５】[0065]

【数２】 (Equation 2)

【００６６】を計算する。ｆ_vとｆ_rからグラフ間の一
致度を以下のように求める。一致度＝ｆ_v ^p＊ｆ_r ^q ここで、ｐ及びｑは正の定数である。２．グラフ分割処理（ステップ１２１、１３１、１４
１）：以下の処理によって、グラフを分割し、それぞれ
の部分グラフ内に小さい重みのリンクを生成する。Is calculated. obtained from f _v and f _r as follows coincidence degree between the graph. Matching degree = f _v ^p * f _r ^q where p and q are positive constants. 2. Graph division processing (steps 121, 131, and 14)
1): The graph is divided by the following processing, and a link having a small weight is generated in each subgraph.

【００６７】（ａ）グラフＧ_Aをノード間の結合力の
強さに応じて、ｐ個の部分グラフにＧ_Ai（ｉ＝０，１，
…，ｐ）に分割する。ここで、結合力の強さとは、例え
ば、「それぞれの部分グラフ内の任意のノード間には、
必ず、距離１のリンクが存在するか、または、距離ｎ
（ｎ≧２）以下のリンクがｍ（ｍ≧１）本以上存在す
る。」などである。ここで、ノードａとノードｂ間の距
離とは、ａからｂへ到達するのに通過するリンクの本数
である。( _A ) The graph G _A is divided into p subgraphs by G _Ai (i = 0, 1,
.., P). Here, the strength of the coupling force is, for example, "Any node in each subgraph has
Make sure there is a link with distance 1 or distance n
There are m (m ≧ 1) or more links (n ≧ 2) or less. And so on. Here, the distance between the node a and the node b is the number of links that pass from the node a to the node b.

【００６８】（ｂ）分割された部分グラフ内の任意の
ノード間にリンクが存在しない場合は、これらのノード
間に弱い重みのリンクを生成する。ｎ＝２，ｍ＝２の場
合のグラフ分割処理を図６に示す。この例では、分割前
のグラフＧ_A（２１０）は、３個の部分グラフ（Ｇ
_A1（２２１），Ｇ_A2（２２２），Ｇ_A3（２３３））に分
割されている。このように分割されるのは、・Ｇ_A1（２１１）について、ノードＡＢ，ＡＣ，ＢＤ，
ＣＤ間に距離１のリンクが存在し、ＡＤ，ＢＣ間のそれ
ぞれには、距離２のリンクが２本存在する。・Ｇ_A2（２２２）について、ノードＢＤ，ＢＥ，ＤＥ間
に、距離１のリンクが存在する。・Ｇ_A3（２２３）について、ノードＤＦ間の距離１のリ
ンクが存在する。(B) If no link exists between arbitrary nodes in the divided subgraph, a link having a weak weight is generated between these nodes. FIG. 6 shows a graph division process when n = 2 and m = 2. In this example, the graph G _A (210) before the division is composed of three subgraphs (G
_A1 (221), G _A2 (222), and G _A3 (233)). The division is as follows: For G _A1 (211), nodes AB, AC, BD,
A link with a distance of 1 exists between CDs, and two links with a distance of 2 exist between each of AD and BC. For G _A2 (222), a link with a distance of 1 exists between nodes BD, BE, and DE. For G _A3 (223), there is a link with a distance of 1 between nodes DF.

【００６９】このため、「それぞれの部分グラフ内の任
意の２ノード間には、必ず距離１のリンクが存在する
か、または、距離２以下のリンクが２本以上存在す
る。」という条件を満たすからである。また、Ｇ_A1（２
２１）における破線は、グラフ分割処理の（ｂ）で追加
された弱い重みのリンクである。この処理から明らかな
ように、分割された部分グラフ内の単語同士は、強い結
合力で結ばれている。従って、これらの部分グラフは、
意味的に関連の強い単語の集合で構成されていることに
なるので、これらの部分グラフからなるサブ文書はそれ
ぞれもとの文書の副題を表すことになる。For this reason, the condition that "a link with a distance of 1 always exists or two or more links with a distance of 2 or less exist between any two nodes in each subgraph" is satisfied. Because. Also, G _A1 (2
A broken line in 21) is a link with a weak weight added in (b) of the graph division processing. As is apparent from this processing, the words in the divided subgraphs are connected with a strong connection force. Therefore, these subgraphs are
Since the document is composed of a set of words that are semantically related, the sub-documents composed of these subgraphs each represent a subtitle of the original document.

【００７０】また、このように部分グラフ内にリンクを
生成することによって、それぞれの副題に含まれる単語
同士には、ある程度の関連があるということをグラフ上
で表現している。３．グラフ再結合処理（ステップ１２２）：グラフ分割
処理が作成した部分グラフＧ_Ai（ｉ＝０，１，…，ｐ）
を分割前のグラフＧ’_Aに再結合する。このとき、Ｇ’
_Aは、グラフ分割処理（ステプ１２１）で生成されたリ
ンクを追加したものである。By generating links in the subgraph in this way, it is expressed on the graph that the words included in the respective subtitles have a certain degree of association. 3. Graph reassociation processing (step 122): the partial graph G _Ai (i = 0, 1,..., P) created by the graph division processing
Is recombined with the graph G ′ _A before division. At this time, G '
_A is obtained by adding a link generated in the graph division processing (step 121).

【００７１】図７に、グラフ再結合処理の例を示す。こ
の例では、図６のＧ_A（２１０）から作成された、３個
の部分グラフ（Ｇ_A1（３１１），Ｇ_A2（３１２），Ｇ_A3
（３１３））を、元のグラフへ再結合することによっ
て、Ｇ’_A（３２０）を作成している。グラフ分割処
理、グラフ再結合処理が作成したＧ’_Aには、ＡＤ間や
ＢＣ間にＧ_Aには存在しなかった弱い重みのリンクが追
加されている。FIG. 7 shows an example of the graph recombining process. In this example, three subgraphs (G _A1 (311), G _A2 (312), G _A3 ) created from G _A (210) in FIG.
(313)) is recombined into the original graph to create G ′ _A (320). To G ' _A created by the graph division processing and the graph recombining processing, a link with a weak weight that did not exist in G _A between AD and BC is added.

【００７２】４．部分グラフ一致度測定処理（ステッ
プ１３２、１４２）：グラフ分割処理が作成したそれぞ
れの部分グラフ毎に、グラフ間一致度測定処理を用い
て、一致度を測定する。５．一致度合計処理（ステップ１４４）：部分グラフご
とに求めた一致度を合計した値を、分割前のグラフ全体
の一致度とする。4. Subgraph coincidence measurement processing (steps 132 and 142): The degree of coincidence is measured using the intergraph coincidence measurement processing for each of the partial graphs created by the graph division processing. 5. Matching degree total processing (step 144): The value obtained by summing up the matching degrees obtained for each subgraph is set as the matching degree of the entire graph before division.

【００７３】次に、図５に示した類似度の計算方法（４
種類）のそれぞれについて説明する。１．グラフ分割を用いない方法（ステップ１１０）：（ａ）文書要素の主題グラフＧ_qとＧ_uを、グラフ間
一致度測定処理（ステップ１１１）に渡す。Next, the method of calculating the similarity shown in FIG.
) Will be described. 1. Method not using graph division (step 110): (a) Pass the subject graphs G _q and _Gu of the document element to the inter-graph coincidence measurement processing (step 111).

【００７４】（ｂ）グラフ間一致度測定処理（ステッ
プ１１１）では、これら２のグラフＧ_q，Ｇ_u間の一致
度を測定し、出力する。この方法で求めた主題グラフ間
の一致度を文書要素間の類似度とする。この方法は、グ
ラフ分割を用いないので処理が高速である。２．グラフ分割、再結合を用いる方法（ステップ１２
０）：（ａ）文書要素の主題グラフＧ_qとＧ_uを、グラフ分
割処理（ステップ１２１）に渡す。(B) In the inter-graph coincidence measuring process (step 111), the coincidence between these two graphs G _q and _Gu is measured and output. The degree of coincidence between subject graphs obtained by this method is defined as the degree of similarity between document elements. This method is fast because no graph division is used. 2. Method using graph division and recombination (step 12
0): (a) Pass the subject graphs _Gq and _Gu of the document element to the graph division processing (step 121).

【００７５】（ｂ）グラフ分割処理（ステップ１２
１）は、Ｇ_q、Ｇ_uをそれぞれ複数の部分グラフＧ
_qi（ｉ＝０，１，…ｐ）、Ｇ_uj（ｊ＝０，１，…，ｒ）
に分割し、Ｇ_qi，Ｇ_uj内の任意のノード間にリンクが存
在しない場合は、これらのノード間に小さい重みのリン
クを生成する。（ｃ）グラフ再結合処理（ステップ１２２）は、グラ
フ分割処理（ステップ１２１）で作成した部分グラフＧ
_qi，Ｇ_ujを、もう一度グラフ分割する前の状態に再結合
することによって、Ｇ’_q，Ｇ’_uを作成する。前述し
たように、再結合の際には、それぞれの部分グラフに生
成したリンクを、Ｇ’_q，Ｇ’_uに追加する。(B) Graph division processing (step 12)
1) sets G _q and _Gu to a plurality of subgraphs G
_qi (i = 0, 1, ... p), _Guj (j = 0, 1, ..., r)
If there is no link between any of the nodes in G _qi and G _uj , a link having a small weight is generated between these nodes. (C) The graph recombining process (step 122) includes the subgraph G created in the graph dividing process (step 121).
G ′ _q and G ′ _u are created by recombining _qi and G _uj to the state before the graph division. As described above, at the time of reconnection, the links generated in the respective subgraphs are added to G ′ _q and G ′ _u .

【００７６】（ｄ）グラフ間一致度測定処理（ステッ
プ１２３）は、Ｇ’_qとＧ’_u間の一致度を測定し出力
する。このようにして求めた主題グラフ間の一致度を文
書要素間の類似度とする。この方法では、間接的な単語
間の関連（同一の副題に含まれる単語には、直接のある
程度の関連がある）を用いて類似度の判定を行うことが
できるので、より正確な類似度を計算できる。[0076 (d) The graph between coincidence measurement process (step 123), measures and outputs the coincidence degree between G _'q and G' _u. The degree of matching between the subject graphs obtained in this way is defined as the degree of similarity between document elements. In this method, similarity can be determined using indirect word-to-word associations (words included in the same subtitle have some direct association), so that a more accurate similarity can be determined. Can be calculated.

【００７７】３．部分グラフ毎の一致度の測定法（ス
テップ１３０）：（ａ）文書要素の主題グラフＧ_qとＧ_uのそれぞれ
を、グラフ分割処理（ステップ１３１）に渡す。（ｂ）グラフ分割処理（ステップ１３１）は、
Ｇ’_q，Ｇ’_uを複数の部分グラフＧ_qi（ｉ＝０，１，
…，ｐ）、Ｇ_uj（ｊ＝０，１，…，ｒ）に分割し、
Ｇ_qi，Ｇ_uj内の任意のノード間にリンクが存在しない場
合は、これらのノード間に小さい重みのリンクを追加す
る。3. Method for measuring the degree of coincidence for each subgraph (step 130): (a) Pass the subject graphs _Gq and _Gu of the document element to a graph division process (step 131). (B) The graph division processing (step 131)
G ′ _q and G ′ _u are _converted into a plurality of subgraphs G _qi (i = 0, 1,
, P) and G _uj (j = 0, 1,..., R)
If there is no link between any nodes in G _qi and G _uj , a link having a small weight is added between these nodes.

【００７８】（ｃ）部分グラフ一致度測定処理（ステ
ップ１３２）は、それぞれの部分グラフＧ_qiとＧ_ujの全
ての組み合わせについて一致度を測定する。この際、グ
ラフ間一致度測定処理（ステップ１３３）を利用する。
部分グラフ毎に求めた一致度を出力する。このようにし
て求めた、部分グラフごとの一致度を、それぞれ分割さ
れたサブ文書（副題）毎の類似度とする。この方法で
は、文書全体を対象とするのではなく、文書内のサブ文
書（副題）毎の類似度の計算を行うことができる。(C) The subgraph coincidence measuring process (step 132) measures the coincidence of all the combinations of the subgraphs _Gqi and _Guj . At this time, the inter-graph coincidence measurement processing (step 133) is used.
Outputs the degree of coincidence obtained for each subgraph. The degree of coincidence for each subgraph obtained in this way is defined as the degree of similarity for each divided sub-document (subtitle). In this method, the similarity can be calculated for each sub-document (subtitle) in the document, not for the entire document.

【００７９】４．部分グラフ毎の一致度の合計を用い
る方法（ステップ１４０）：（ａ）文書要素の主題グラフＧ_q、Ｇ_uのそれぞれ
を、グラフ分割処理（ステップ１４１）に渡す。（ｂ）グラフ分割処理（ステップ１４１）は、Ｇ_q，
Ｇ_uを複数の部分グラフＧ_qi（ｉ＝０，１，…，ｐ），
Ｇ_uj（ｊ＝０，１，…，ｒ）に分割し、Ｇ_qi，Ｇ_uj内の
任意のノード間にリンクが存在しない場合は、これらの
ノード間に小さい重みのリンクを追加する。4. Method using the sum of the degrees of coincidence for each subgraph (step 140): (a) Each of the subject graphs G _q and _Gu of the document element is passed to the graph division processing (step 141). (B) The graph division processing (step 141) is performed by G _q ,
The G _u plurality of subgraphs _{G qi (i = 0,1, ...} , p),
G _uj (j = 0, 1,..., R), and if there is no link between any of the nodes in G _qi and G _uj , a link having a small weight is added between these nodes.

【００８０】（ｃ）部分グラフ一致度測定処理（ステ
ップ１４２）は、それぞれの部分グラフＧ_qiとＧ_ujのす
べての組み合わせについて一致度を測定する。この際、
グラフ間一致度測定処理（ステップ１４３）を利用す
る。（ｄ）一致度合計処理（ステップ１４４）は、部分グ
ラフ一致度測定処理（ステップ１４２）で得られた全て
の部分グラフ毎の一致度の合計を計算し、出力する。(C) The subgraph coincidence measuring process (step 142) measures the coincidence of all combinations of the subgraphs _Gqi and _Guj . On this occasion,
The graph matching degree measurement processing (step 143) is used. (D) The matching degree total processing (step 144) calculates and outputs the sum of the matching degrees for all the partial graphs obtained in the subgraph matching degree measuring processing (step 142).

【００８１】このようにして求めた、合計された一致度
を、文書要素間の類似度とする。この方法では、文書内
の副題毎の類似度の総和によって、文書要素間の類似度
を判定することができるので、より正確な類似度の計算
を行うことができる。以上をまとめると、処理を高速に
行うことができるのが、「グラフ分割を用いない方法
（ステップ１１０）」である。The sum of the coincidences determined in this way is regarded as the similarity between the document elements. According to this method, the similarity between document elements can be determined based on the sum of the similarities for each subtitle in the document, so that more accurate calculation of the similarity can be performed. To summarize the above, the method that can perform the processing at high speed is the “method without using graph division (step 110)”.

【００８２】「グラフ分割、再結合を用いる方法（ステ
ップ１２０）」では、グラフ分割処理という複雑な処理
を行う代わりに、間接的な単語間の関連も利用した、類
似度の判定を行うことができる。また、「部分グラフ毎
の一致度の測定法（ステップ１３０）」では、文書要素
をそれぞれ１つの文書として取り扱うのではなく、副題
ごとに分割された独立したそれぞれのサブ文書毎の類似
度の判定を行うことができる。In the "method using graph division and recombination (step 120)", similarity determination using indirect word association is performed instead of performing complicated processing called graph division processing. it can. In the "method of measuring the degree of coincidence for each subgraph (step 130)", the document element is not treated as one document, but the similarity is determined for each independent sub-document divided for each subtitle. It can be performed.

【００８３】「部分グラフ毎の一致度の合計を用いる方
法（ステップ１４０）」では、非常に処理が複雑である
が、文書内の副題毎の類似度の総和として、文書全体の
類似度を求めることができるので、より正確な類似度を
判定できる。In the "method using the sum of the degrees of coincidence for each subgraph (step 140)", the processing is extremely complicated, but the similarity of the entire document is obtained as the sum of the degrees of similarity for each subtitle in the document. Therefore, more accurate similarity can be determined.

【００８４】[0084]

【実施例】以下、図面と共に本発明の実施例を説明す
る。最初に、前述の方法を用いた文書検索装置について
説明する。図８は、本発明の一実施例の文書検索装置の
構成を示す。同図に示す文書検索装置は、検索インタフ
ェース部４１０、検索キー主題グラフ作成部４２０、単
語情報管理部４３０、検索対象主題グラフ作成部４４
０、類似度判定部４５０、検索制御部４６０、検索対象
文書データベース４４１、及びインデックスファイル４
３１から構成される。Embodiments of the present invention will be described below with reference to the drawings. First, a document search device using the above-described method will be described. FIG. 8 shows the configuration of a document search device according to one embodiment of the present invention. The document search device shown in the figure includes a search interface unit 410, a search key subject graph creation unit 420, a word information management unit 430, and a search subject subject graph creation unit 44.
0, similarity determination unit 450, search control unit 460, search target document database 441, and index file 4
31.

【００８５】検索インタフェース部４１０は、ユーザか
らの検索要求を解析し、検索キーを取り出し、検索制御
部４６０に渡す。また、検索結果を検索制御部４６０か
ら受け取り、ユーザに返す。検索キー主題グラフ作成部
４２０は、検索キーから主題グラフを作成する。単語情
報管理部４３０は、インデックスファイル４３１を参照
することによって、指定された単語が出現する文書ＩＤ
の集合を取得する。ここで、インデックスファイル４３
１は、単語をキー、その単語が出現する文書ＩＤの集合
を値とするハッシュテーブルである。The search interface unit 410 analyzes a search request from a user, extracts a search key, and passes it to the search control unit 460. Further, the search result is received from the search control unit 460 and returned to the user. The search key subject graph creation unit 420 creates a subject graph from the search key. The word information management unit 430 refers to the index file 431 to obtain the document ID in which the specified word appears.
Get the set of Here, the index file 43
A hash table 1 has a word as a key and a set of document IDs in which the word appears as a value.

【００８６】検索対象文書主題グラフ作成部４４０は、
文書ＩＤが指定されるとその文書ＩＤに対応した文書を
検索対象文書データベース４４１から取得し、その文書
の主題グラフを作成する。類似度判定部４５０は、検索
キーの主題グラフと検索対象文書の主題グラフを入力と
し、それらの類似度を判定する。ここで、類似度の判定
法は、前述の図５のフローチャートの方法を用いる。The search subject document subject graph creation unit 440
When a document ID is specified, a document corresponding to the document ID is acquired from the search target document database 441, and a subject graph of the document is created. The similarity determination unit 450 receives the subject graph of the search key and the subject graph of the search target document, and determines the similarity between them. Here, the method of determining the similarity uses the method of the flowchart of FIG. 5 described above.

【００８７】検索制御部４６０は、以下の処理を行う。（ａ）検索インタフェース部４１０から検索キーを取
得する。（ｂ）検索キー主題グラフ作成部４２０から、この検
索キーから作成された主題グラフを取得する。（ｃ）単語情報管理部４３０から、この主題グラフ内
の単語のどれか一つでも出現する文書ＩＤの集合を取得
する。The search control unit 460 performs the following processing. (A) Obtain a search key from the search interface unit 410. (B) From the search key subject graph creation unit 420, a subject graph created from this search key is obtained. (C) From the word information management unit 430, a set of document IDs in which any one of the words in the subject graph appears is acquired.

【００８８】（ｄ）これらの文書ＩＤの集合のそれぞ
れの要素に対して、以下の処理を実行する。文書ＩＤに対応した検索対象文書の主題グラフを検
索対象文書主題グラフ作成部４４０から取得する。この検索対象文書の主題グラフと検索キーの主題グ
ラフの類似度を類似度判定４５０から取得する。(D) The following processing is executed for each element of the set of document IDs. The subject graph of the search target document corresponding to the document ID is acquired from the search subject document subject graph creation unit 440. The similarity between the subject graph of the search target document and the subject graph of the search key is acquired from the similarity determination 450.

【００８９】（ｅ）文書ＩＤの集合を類似度の降順に
ソートし、上位ｎ件の文書ＩＤに対応した文書を検索結
果とする。以下、次の例を用いて処理の流れを説明す
る。図９は、本発明の一実施例の主題グラフの作成の例
を示す。検索キーＱ：（情報ｏｒ文書）ａｎｄ検索検索対象文書Ｕ：以下の５文からなる文書「情報の主題について。(E) The set of document IDs is sorted in descending order of similarity, and documents corresponding to the top n document IDs are set as search results. Hereinafter, the flow of processing will be described using the following example. FIG. 9 shows an example of creating a subject graph according to one embodiment of the present invention. Search key Q: (information or document) and search Search target document U: A document consisting of the following five sentences “About the subject of information.

【００９０】文書の主題について。検索の効率を上げ
る。情報を検索する。文書を検索する。」ステップ５０１）ユーザが検索要求を入力する。About the subject of the document. Improve search efficiency. Search for information. Search for documents. Step 501) The user inputs a search request.

【００９１】ステップ５０２）検索インタフェース部
４１０は、ユーザが入力した検索要求から検索キーを抽
出し、検索制御部４６０に渡す。ステップ５０３）検索制御部４６０は、検索キーを検
索キー主題グラフ作成部４２０に渡す。ステップ５０４）検索キー主題グラフ作成部４２０
は、検索キーから主題グラフを作成し、検索制御部４６
０に渡す。今回の例では、検索キーＱから図９の検索キ
ーの主題グラフ５１０を作成した。但し、すべての単語
の重要度を１．０とし、単語間の関連度は、“ｏｒ”の
場合０．５、“ａｎｄ”の場合１．０とした。Step 502) The search interface unit 410 extracts a search key from the search request input by the user, and passes it to the search control unit 460. Step 503) The search control unit 460 passes the search key to the search key subject graph creation unit 420. Step 504) Search key subject graph creation section 420
Creates a subject graph from a search key,
Pass to 0. In this example, the subject graph 510 of the search key shown in FIG. However, the importance of all words was set to 1.0, and the relevance between words was set to 0.5 for "or" and 1.0 for "and".

【００９２】ステップ５０５）検索制御部４６０は、
検索キーの主題グラフに使用されているそれぞれの単語
を単語情報管理部４３０に渡す。ステップ５０６）単語情報管理部４３０は、その単語
が一度でも出現する文書ＩＤの集合をインデックスファ
イル４３１から取得し、検索制御部４６０に渡す。Step 505) The search control section 460
Each word used in the subject graph of the search key is passed to the word information management unit 430. Step 506) The word information management unit 430 acquires from the index file 431 a set of document IDs in which the word appears at least once, and passes it to the search control unit 460.

【００９３】ステップ５０７）検索制御部４６０は、
単語情報管理部４３０から取得した文書ＩＤの集合のそ
れぞれの要素を検索対象文書主題グラフ作成部４４０に
渡す。ステップ５０８）検索対象文書主題グラフ作成部４４
０は、文書ＩＤに対応した文書を検索対象文書データベ
ース４４１から取得し、その文書の主題グラフを作成
し、これを検索制御部４６０に渡す。今回の例では、文
書Ｕから図９の検索対象文書の主題グラフ５２０を作成
した。但し、単語の重要度は出現回数に比例した値と
し、単語間の関連度は、文内の共出現回数に比例した値
とし、不要語は取り除いた。Step 507) The search control unit 460
Each element of the set of document IDs acquired from the word information management unit 430 is passed to the search subject document graph creation unit 440. Step 508) Search subject document subject graph creation unit 44
0 acquires the document corresponding to the document ID from the search target document database 441, creates a subject graph of the document, and passes it to the search control unit 460. In this example, the subject graph 520 of the search target document in FIG. However, the importance of the word was a value proportional to the number of appearances, the relevance between words was a value proportional to the number of co-occurrence in the sentence, and unnecessary words were removed.

【００９４】ステップ５０９）検索制御部４６０は、
検索対象文書主題グラフ作成部４４０から取得したそれ
ぞれの主題グラフと検索キーの主題グラフを類似度判定
部４５０に渡す。ステップ５１０）類似度判定部４５０は、検索キーの
主題グラフとそれぞれの検索対象文書の主題グラフとの
類似度の判定を行う。Step 509) The search control unit 460
The subject graphs obtained from the search subject document subject graph creation unit 440 and the subject graph of the search key are passed to the similarity determination unit 450. Step 510) The similarity determination unit 450 determines the similarity between the subject graph of the search key and the subject graph of each search target document.

【００９５】今回は、「グラフ分割を用いない方法」を
使用した場合の例として、類似度の判定法を説明する。
まず、図９の検索キーの主題グラフ５１０及び検索対象
文書の主題グラフ５２０から、以下の単語の重要度を表
すベクトルを生成する。但し、ベクトルｖ_Q，ｖ_Uの要
素は、それぞれ、検索キーＱ、検索対象文書Ｕ内での
（情報，文書，検索，主題，効率）の重要度を示し、グ
ラフに存在しない単語の重要度は、０．０とした。This time, a method of determining the similarity will be described as an example of the case where the “method without using graph division” is used.
First, a vector representing the importance of the following words is generated from the subject graph 510 of the search key and the subject graph 520 of the search target document in FIG. However, the elements of the vectors v _Q and v _U indicate the importance of (information, document, search, subject, efficiency) in the search key Q and the search target document U, respectively, and the importance of words not present in the graph. Was set to 0.0.

【００９６】ｖ_Q＝（1.0, 1.0, 1.0, 0.0, 0.0) （５）ｖ_u＝（0.6, 0.6, 1.0, 0.6, 0.3) （６）これらのベクトルの内積ｆ_vは、ｆ_v＝２．２となる。V _Q = (1.0, 1.0, 1.0, 0.0, 0.0) (5) v _u = (0.6, 0.6, 1.0, 0.6, 0.3) (6) The inner product f _v of these vectors is f _v = 2 .2.

【００９７】次に、同様に関連度を表すベクトルを生成
する。但し、ベクトルｒ_Q，ｒ_Uの要素は、それぞれ、
検索キーＱ、検索対象文書Ｕ内での（情報と主題、情報
と文書、情報と検索、情報と効率、主題と文書、主題と
検索、主題と効率、文書と検索、文書と効率、検索と効
率）の関連度を表し、グラフに存在しないリンクの重み
は０とした。Next, similarly, a vector representing the degree of association is generated. Where the elements of the vectors r _Q and r _U are
The search key Q, (information and subject, information and document, information and search, information and efficiency, subject and document, subject and search, subject and efficiency, document and search, document and efficiency, search and Efficiency), and the weight of a link that does not exist in the graph is set to 0.

【００９８】ｒ_Q＝（0.0, 0.5, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0) （７）ｒ_U＝（1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0) （８）これらのベクトルの内積ｆ_vは、ｆ_v＝２．０となる。そこで、ｐ＝１，ｑ＝１とした場合、検索キー
の主題グラフ５１０と検索対象文書の主題グラフ５２０
の一致度は、一致度＝ｆ_U＊ｆ_r＝２．２＊２．０＝４．４となり、今回の例での検索キーＱと検索対象文書Ｕの類
似度は、４．４と計算される。R _Q = (0.0, 0.5, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0) (7) r _U = (1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0) (8) The inner product f _v of these vectors is f _v = 2.0. Therefore, if p = 1 and q = 1, the subject graph 510 of the search key and the subject graph 520 of the search target document
The degree of coincidence, coincidence degree _{_{= f U * f r = 2.2}} * 2.0 = 4.4 , and the similarity of the search key Q searched documents U in this example, 4.4 and calculations Is done.

【００９９】ステップ５１１）検索制御部４６０は、
類似度判定部４５０から取得したそれぞれの文書の類似
度に基づき、文書集合を降順に並べ替え、上位ｎ件を検
索結果とし、これを検索インタフェース部４１０に渡
す。ここで、類似度判定部４５０で、部分グラフごとの
一致度の測定法を用いた場合は、上位ｎ件のサブ文書が
検索結果となる。Step 511) The search control unit 460
The document set is rearranged in descending order based on the similarity of each document acquired from the similarity determination unit 450, and the top n items are set as search results, which are passed to the search interface unit 410. Here, when the similarity determination unit 450 uses the method of measuring the degree of coincidence for each subgraph, the top n sub-documents are the search results.

【０１００】ステップ５１２）検索インタフェース部
４１０は、検索結果をユーザに返す。今回の例では、
（文書Ｕ，４．４）が検索結果である。次に、本発明を用いた文書分類装置について説明する。
図１０は、本発明の一実施例の文書分類装置の構成を示
す。同図に示す文書分類装置は、主題グラフ作成部６１
０、類似度判定部６２０、分類部６３０、分類制御部６
４０、文書データベース６１１から構成される。Step 512) The search interface unit 410 returns a search result to the user. In this example,
(Document U, 4.4) is the search result. Next, a document classification apparatus using the present invention will be described.
FIG. 10 shows the configuration of a document classification device according to an embodiment of the present invention. The document classification device shown in FIG.
0, similarity determination section 620, classification section 630, classification control section 6
40, a document database 611.

【０１０１】同図において、主題グラフ作成部６１０
は、前述の文書検索装置の検索対象文書主題グラフ作成
部４４０と全く同じものである。主題グラフ作成部６１
０は、文書データベース６１１から文書ＩＤに対応した
文書を取得し、その文書の主題グラフを作成する。類似
度判定部６２０は、２つの文書の主題グラフが入力され
ると、これらの類似度を判定する。ここで、類似度の判
定には、前述の文書検索装置の類似度判定部４５０と同
様の類似度判定方法を用いるものとする。In the figure, the subject graph creation unit 610
Is exactly the same as the search subject document graph creation unit 440 of the document search device described above. Thematic graph creation unit 61
0 acquires the document corresponding to the document ID from the document database 611 and creates a subject graph of the document. When the subject graphs of two documents are input, the similarity determination unit 620 determines the similarity between the two. Here, the similarity determination is performed using the similarity determination method similar to the similarity determination unit 450 of the above-described document search device.

【０１０２】分類部６３０は、文書間類似度行列を基に
文書を分類する。ここで、文書間類似度行列とは、以下
の形式である。文書１文書２ … 文書ｎ文書１ｓ₁₁ ｓ₁₂ … ｓ_1n 文書２ｓ₂₁ ｓ₂₂ … ｓ_2n … … … … … 文書ｎｓ_n1 ｓ_n2 … ｓ_nn 但し、ｓ_ijは、文書ｉと文書ｊの類似度を表し、ｓ_ij＝
ｓ_jiであり、ｓ_iiは無限大である。The classifying section 630 classifies documents based on the inter-document similarity matrix. Here, the inter-document similarity matrix has the following format. Document 1 Document 2… Document n Document 1 s ₁₁ s ₁₂ … s _1n Document 2 s ₂₁ s ₂₂ … s _2n ………… Document ns s _n1 s _n2 … s _nn where s _ij is document i and document j S _ij =
s _ji and s _ii are infinite.

【０１０３】文書間類似度行列が与えられた時の分類の
方法は、例えば、類似度最大の文書同士を順次結合して
いくクラスタリングなどである。具体的な分類の方法
は、既存記述による。分類制御部６４０は、分類作業全
体の制御を行う。上記の構成の一連の動作を以下に説明
する。A classification method when an inter-document similarity matrix is given is, for example, clustering in which documents having the highest similarity are sequentially combined. The specific classification method is based on the existing description. The classification control unit 640 controls the entire classification work. A series of operations of the above configuration will be described below.

【０１０４】ステップ６０１）ユーザは、文書データ
ベース６１１内の文書を何個の文書集合に分類するのか
（分類数）を指定する。ステップ６０２）分類制御部６４０は、文書データベ
ース６１１に含まれるすべての文書の文書ＩＤを主題グ
ラフ作成部６１０に渡す。ステップ６０３）主題グラフ作成部６１０は、それぞ
れの文書ＩＤに対応した文書を文書データベース６１１
から取得し、主題グラフを作成し、これを分類制御部６
４０に渡す。Step 601) The user designates how many document sets the documents in the document database 611 are to be classified into (classification number). Step 602) The classification control unit 640 passes the document IDs of all the documents included in the document database 611 to the subject graph creation unit 610. Step 603) The subject graph creation unit 610 stores the document corresponding to each document ID in the document database 611.
, A theme graph is created, and the subject graph is created.
Pass to 40.

【０１０５】ステップ６０４）分類制御部６４０は、
主題グラフ作成部６１０から取得した主題グラフのすべ
ての２つの組み合わせを類似度判定部６２０に渡す。ステップ６０５）類似度判定部６２０は、それぞれの
主題グラフ間の類似度を判定し、分類制御部６４０に渡
す。ステップ６０６）分類制御部６４０は、類似度判定部
６２０から取得したすべての２つの組み合わせの文書間
の類似度から、文書間類似度行列を作成し、ユーザが入
力した分類数と共に分類部６３０に渡す。Step 604) The classification control unit 640
All two combinations of the theme graphs obtained from the theme graph creation unit 610 are passed to the similarity determination unit 620. Step 605) The similarity determination unit 620 determines the similarity between the subject graphs, and passes the similarity to the classification control unit 640. Step 606) The classification control unit 640 creates an inter-document similarity matrix from the similarities between the documents of all the two combinations acquired from the similarity determination unit 620, and sends it to the classification unit 630 together with the number of classifications input by the user. hand over.

【０１０６】ステップ６０７）分類部６３０は、文書
間類似度行列と分類数を基に、文書集合の分類を行い、
分類結果を分類制御部６４０に渡す。ステップ６０８）分類制御部６４０は、分類結果をユ
ーザに返す。また、上記の実施例における文書検索装置
と文書分類装置の構成要素をプログラムとして構築し、
文書検索装置及び文書分類装置として利用されるコンピ
ュータに接続されるディスク装置や、フロッピーディス
ク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納しておき、本
発明を実施する際にインストールすることにより容易に
本発明を実現できる。Step 607) The classification unit 630 classifies the document set based on the inter-document similarity matrix and the number of classifications.
The classification result is passed to the classification control unit 640. Step 608) The classification control unit 640 returns a classification result to the user. Further, the components of the document search device and the document classification device in the above embodiment are constructed as a program,
It is easily stored in a disk device connected to a computer used as a document search device and a document classification device, or in a portable storage medium such as a floppy disk or a CD-ROM, and installed when implementing the present invention. Thus, the present invention can be realized.

【０１０７】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内で種々変更・応用が可能
である。The present invention is not limited to the above-described embodiments, but can be variously modified and applied within the scope of the claims.

【０１０８】[0108]

【発明の効果】上述のように、本発明によれば、文書要
素間の類似度を単語の重要度だけであく、単語間の関連
度を基に判定することができるので、より精度の高い類
似度の判定を行うことができる。また、検索キーと検索
対象文書の類似度を、検索キー及び検索対象文書内での
単語の重要度だけでなく、検索キー及び検索対象文書内
での単語間の関連度も用いて計算することができるの
で、検索キーが文や文書になっても、また、検索対象が
文書全文となった場合でも、より精度の高い情報検索を
表現できる。As described above, according to the present invention, the similarity between document elements can be determined based not only on the importance of words but also on the degree of relevance between words. The determination of the degree of similarity can be performed. In addition, the similarity between the search key and the search target document is calculated using not only the search key and the importance of the word in the search target document but also the relevance between the search key and the word in the search target document. Therefore, even if the search key is a sentence or a document, or if the search target is the whole document, a more accurate information search can be expressed.

【０１０９】また、同様に文書間の類似度を文書内の単
語の重要度だけでなく、文書内の単語間の関連度も用い
て計算することができるので、より精度の高い文書分類
を実現できる。Similarly, the similarity between documents can be calculated using not only the importance of the words in the document but also the relevance between the words in the document, so that a more accurate document classification can be realized. it can.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の文書検索装置の原理構成図である。FIG. 2 is a diagram illustrating the principle configuration of a document search device according to the present invention.

【図３】本発明の文書分類装置の原理構成図である。FIG. 3 is a diagram illustrating the principle configuration of a document classification device according to the present invention.

【図４】本発明の類似度判定方法を説明するためのフロ
ーチャートである。FIG. 4 is a flowchart illustrating a similarity determination method according to the present invention.

【図５】本発明の類似度判定処理の動作を説明するため
の図である。FIG. 5 is a diagram illustrating an operation of a similarity determination process according to the present invention.

【図６】本発明のグラフ分類処理を説明するための図で
ある。FIG. 6 is a diagram illustrating a graph classification process according to the present invention.

【図７】本発明のグラフ再結合処理を説明するための図
である。FIG. 7 is a diagram for explaining a graph recombining process according to the present invention.

【図８】本発明の一実施例の文書検索装置の構成図であ
る。FIG. 8 is a configuration diagram of a document search device according to an embodiment of the present invention.

【図９】本発明の一実施例の主題グラフの作成の例を示
す図である。FIG. 9 is a diagram showing an example of creating a subject graph according to one embodiment of the present invention.

【図１０】本発明の一実施例の文書分類装置の構成図で
ある。FIG. 10 is a configuration diagram of a document classification device according to an embodiment of the present invention.

[Explanation of symbols]

２１０分割前のグラフ２１１，２２２，２２３部分グラフ３１１，３１２，３１３部分グラフ３２０再結合したグラフ４１０検索インタフェース手段、検索インタフェース
部４２０検索キー主題グラフ作成手段、検索キー主題グ
ラフ作成部４３０単語情報管理手段、単語情報管理部４３１インデックスファイル４４０検索対象文書主題グラフ作成手段、検索対象文
書主題グラフ作成部４４１検索対象文書記憶手段、検索対象文書データベ
ース４５０類似度判定手段、類似度判定部４６０検索制御手段、検索制御部５１０検索キーＱの主題グラフ５２０検索対象文書Ｕの主題グラフ６１０主題グラフ作成手段、主題グラフ作成部６１１文書記憶手段、文書データベース６２０グラフ類似度判定手段、類似度判定部６３０分類手段、分類部６４０分類制御手段、分類部制御部210 Graph before division 211, 222, 223 Subgraph 311, 312, 313 Subgraph 320 Rejoined graph 410 Search interface unit, search interface unit 420 Search key subject graph creation unit, search key subject graph creation unit 430 Word information management Means, word information management unit 431 Index file 440 Search target document subject graph creation unit, Search target document subject graph creation unit 441 Search target document storage unit, Search target document database 450 Similarity determination unit, Similarity determination unit 460 Search control unit Search control unit 510 subject graph of search key Q 520 subject graph of search target document U 610 subject graph creation unit, subject graph creation unit 611 document storage unit, document database 620 graph similarity determination unit, similarity determination unit 630 Classification unit, classification unit 640 Classification control unit, classification unit control unit

Claims

[Claims]

1. A similarity determination method for appropriately determining similarity between document elements, comprising: a word string, a Boolean operator combination of word strings, a sentence, a document, or a document set Extracting words used in the document element from the elements; assigning importance to each of the extracted words; assigning a relevance between any two extracted words; The subject of each document element is expressed by a graph in which importance is set as the weight of the node and the relevance between the words is set as the weight of the link. Based on the degree of matching between the graphs expressing the subject, the document element is used. A similarity determination method characterized by determining a similarity between the two.

2. When calculating the degree of coincidence between graphs for expressing the subject, the greater the number of similar nodes (nodes containing the same word) in both graphs, the more the graphs If the degree of matching between them is large, and if a node in one graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the weight of both graphs The degree of matching between the two graphs is set to a large value, and the greater the number of similar links (links having the same word included in the nodes at both ends of the link) in both graphs, the greater the number of matches between the two graphs If the degree is set to a large value and a link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the match between the two graphs Degree The to a large value, the similarity judging method of claim 1 wherein calculating a degree of match between the graph representing the subject.

3. When calculating the degree of coincidence between graphs for expressing the subject, each graph is associated with a word set used in the graph with a certain degree of strength. If there is no link between any nodes in the subgraph in each of the subgraphs, a link with a small weight is generated in the subgraph, and the subgraph is divided into subgraphs. Is recombined, and the link generated in the subgraph is added as it is,
2. The similarity determination method according to claim 1, wherein the degree of coincidence between the graphs expressing the subject is calculated by returning to the graph before the division.

4. When calculating the degree of coincidence between graphs for expressing the subject, how each word is used in the graphs is related to each graph. On the basis of the,
If the subgraph is divided into subgraphs, and if each of the subgraphs has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph; 2. The similarity determination method according to claim 1, wherein the degree of coincidence is calculated for each case.

5. When calculating the degree of coincidence between graphs for expressing the subject, each graph is associated with a word set used in the graph with a certain degree of strength. If there is no link between any nodes in the subgraph in each of the subgraphs, a link having a small weight is generated in the subgraph, and each subgraph is divided into subgraphs. The similarity according to claim 1, wherein a degree of coincidence is calculated for each graph, and a degree of coincidence between graphs expressing the subject is calculated by calculating a sum of degrees of coincidence calculated for each of the subgraphs. Judgment method.

6. A document search device for searching a document based on a search request from a user, comprising: a search interface unit for analyzing a search request from the user and extracting a search key; Search key subject graph creating means for generating a graph expressing the subject of the search key; word information management means for acquiring a set of document IDs of documents in which the specified word appears; and when the document ID is specified, A search target document subject graph creating unit for obtaining a document corresponding to the document ID from the search target document storage unit and creating a graph expressing the subject of the document; a graph expressing the subject of the search key and a search target document A similarity determination unit for inputting a graph expressing the subject of the search, and determining how similar they are; the search interface unit; A document search device, comprising: a title graph creation unit, the word information management unit, the search target document subject graph creation unit, and a search control unit that controls the similarity determination unit.

7. The search key subject graph creating means is used in the document element from a word string, a Boolean operator combination of the word strings, or a document element composed of a sentence, a document, or a document set. Word extracting means for extracting a word which is present, importance assigning means for assigning importance to each of the extracted words, and relevance assigning means for assigning a relevance between any two extracted words. The subject expression representing the subject of each document element by a graph in which the importance of the word is used as the weight of the node and the relevance between the words is used as the weight of the link. 7. The document search apparatus according to claim 6, further comprising: means for determining a similarity between the document elements based on a degree of coincidence between graphs expressing the subject.

8. The similarity determination means sets the degree of coincidence between the graphs to a larger value as the number of similar nodes (nodes including the same word) in both graphs increases. If a node in the graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the degree of matching between both graphs; The greater the number of similar links (links having the same word in the nodes at both ends of the link) of the graph, the greater the degree of matching between the two graphs is If a link is heavily weighted, the subject may be modified such that the greater the weight of a similar link in the other graph, the greater the degree of agreement between the two graphs will be. Express Claim including a first calculating means for calculating a degree of match between the graph 6
Document search device as described.

9. The similarity determination unit divides each graph into subgraphs based on how strong the word sets used in the graph are related to each other. If there is no link between any nodes in the subgraph in the subgraph, a link having a small weight is generated in the subgraph, the subgraph is recombined, and the link generated in the subgraph is generated. As it is,
7. The document search apparatus according to claim 6, further comprising a second calculating unit that returns the graph before the division and calculates the degree of coincidence between the graphs expressing the subject.

10. The similarity determination means, based on how strong the word sets used in the graph are related to each other,
If the subgraph is divided into subgraphs, and if each of the subgraphs has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph; 7. The document search apparatus according to claim 6, further comprising third calculating means for calculating the degree of matching for each of the documents.

11. The similarity determination means divides each graph into subgraphs based on how strong the word sets used in the graph are related to each other. If the subgraph has no link between any nodes in the subgraph, a link having a small weight is generated in the subgraph, and the degree of matching is calculated for each of the subgraphs. 7. The document search apparatus according to claim 6, further comprising: fourth calculating means for calculating a degree of coincidence between graphs expressing the subject by calculating a sum of degrees of coincidence calculated for each subgraph.

12. A subject graph creating means for acquiring a document corresponding to a document ID from a document storage means in which a document is stored, and creating a graph representing the subject of the document, and expressing the subject of the two documents. When a graph to be input is input, a graph similarity determination unit that determines the degree of coincidence between them, a classification unit that classifies the documents based on a matrix that represents the similarity between the documents, And a classification control unit for performing the classification.

13. The subject graph creating means, for measuring the degree of coincidence between graphs expressing the subject, by word strings, Boolean operator combinations of word strings, sentences, documents, or document sets. Word extracting means for extracting words used in the document element from the composed document element; importance assigning means for assigning importance to each of the extracted words; A relevance assigning unit for assigning a relevance between two words, and expressing the subject of each document element by a graph in which the importance of the word is used as a node weight and the relevance between the words is used as a link weight. 13. The theme similarity determining means, wherein the graph similarity determining means includes means for determining similarity between the document elements based on a degree of coincidence between graphs expressing the theme.
Document classification device as described.

14. The graph similarity determination means sets the degree of coincidence between the graphs to a larger value as the number of similar nodes (nodes including the same word) in both graphs increases. If a node in the graph has a large weight, the greater the weight of a similar node in the other graph, the greater the degree of matching between the two graphs, The greater the number of similar links (links containing the same word in the nodes at both ends of the link) in both graphs, the greater the degree of matching between the two graphs, If a certain link has a large weight, the more the similar link in the other graph has a larger weight, the greater the degree of matching between the two graphs will be. Claim 1 including a first calculating means for calculating a degree of match between the graph representing
2. The document classification device according to 2.

15. The graph similarity determination means divides each graph into subgraphs based on how strong the word sets used in the graph are related to each other, When there is no link between any nodes in the subgraph in each of the subgraphs, a link having a small weight is generated in the subgraph, the subgraph is recombined, and the subgraph is generated in the subgraph. Add the link as is,
13. The image processing apparatus according to claim 12, further comprising a second calculating unit configured to calculate a degree of coincidence between the graphs expressing the subject by returning to the graph before the division.
Document classification device as described.

16. The graph similarity determination means may determine each graph based on how strong the word sets used in the graph are related to each other.
If the subgraph is divided into subgraphs, and if each of the subgraphs has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph; 13. The document classification device according to claim 12, further comprising a third calculating unit that calculates a degree of coincidence for each document.

17. The graph similarity determination means divides each graph into subgraphs based on how strong the word sets used in the graph are related to each other, If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph, calculate the degree of matching for each of the subgraphs, 13. The document classification device according to claim 12, further comprising: fourth calculating means for calculating a degree of coincidence between graphs expressing the subject by calculating a sum of degrees of coincidence calculated for each of the subgraphs.

18. A storage medium storing a document search program for searching for a document based on a search request from a user, comprising: a search interface process for analyzing a search request from the user and extracting a search key; A search key subject graph creation process for generating a graph representing the subject of the search key from the search key; a word information management process for acquiring a set of document IDs of documents in which a specified word appears; When specified, a search target document subject graph creation process for obtaining a document corresponding to the document ID from the search target document storage means storing the search target document and creating a graph expressing the subject of the document; Input a graph expressing the subject of the search key and a graph expressing the subject of the search target document, and determine how similar they are A similarity determination process for determining; a search interface process; a search key subject graph creation process; a word information management process; a search target document subject graph creation process; and a search control process for controlling the similarity determination process. A storage medium storing a document search program characterized by having:

19. The search key subject graph creation process is executed in a document element from a word string, a Boolean operator combination of word strings, or a document element composed of a sentence, a document, or a document set. A word extraction process for extracting a word, an importance assignment process for assigning importance to each extracted word, and a relevance assignment process for assigning a relevance between any two extracted words. The subject expression representing the subject of each document element by a graph in which the importance of the word is used as the weight of the node and the relevance between the words is used as the weight of the link. A similarity determination process, wherein the similarity determination process determines a similarity between the document elements based on a degree of agreement between the graphs representing the themes. Storage medium storing a document retrieval program according to claim 18, comprising a.

20. The similarity determination process, wherein the greater the number of similar nodes (nodes containing the same word) in both graphs, the greater the degree of matching between the graphs, If a node in the graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the degree of matching between both graphs; The greater the number of similar links (links having the same word in the nodes at both ends of the link) of the graph, the greater the degree of matching between the two graphs is If a link is heavily weighted, the subject may be modified such that the greater the weight of a similar link in the other graph, the greater the degree of agreement between the two graphs will be. First storage medium storing a document retrieval program according to claim 18, further comprising a calculation process of calculating the degree of match between the graph of current.

21. The similarity determination process divides each graph into subgraphs based on how strongly the word sets used in the graph are related to each other. If there is no link between any nodes in the subgraph in the subgraph, a link having a small weight is generated in the subgraph, the subgraph is recombined, and the link generated in the subgraph is generated. As it is,
19. The storage medium storing the document search program according to claim 18, further comprising a second calculation process for calculating the degree of coincidence between the graphs expressing the subject by returning to the graph before the division.

22. The similarity determination process may include rewriting each graph based on how strongly the word sets used in the graph are related to each other.
If the subgraph is divided into subgraphs, and if each of the subgraphs has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph; 19. The storage medium storing the document search program according to claim 18, including a third calculation process for calculating a degree of coincidence every time.

23. The similarity determination process divides each graph into subgraphs based on how strong the word sets used in the graph are related to each other. If the subgraph has no link between any nodes in the subgraph, a link having a small weight is generated in the subgraph, and the degree of matching is calculated for each subgraph. 19. The storage device according to claim 18, further comprising a fourth calculation process of calculating a sum of the degrees of matching calculated for each graph to calculate a degree of matching between the graphs expressing the subject. Medium.

24. A storage medium storing a document classification program mounted on a document classification device for classifying documents, wherein a document corresponding to a document ID is obtained from a document storage unit in which the document is stored. Subject graph creation process for creating a graph representing the subject of the document, a graph similarity determination process for determining the degree of coincidence between the two graphs representing the subject of the two documents, A storage medium storing a document classification program, comprising: a classification process for classifying the document based on a matrix representing the similarity of the classification process; and a classification control process for controlling the entire classification work of the classification process.

25. The subject graph creation process includes the steps of: determining a degree of agreement between graphs representing the subject by combining word strings or Boolean operator combinations of word strings or sentences or documents or document collections; A word extraction process for extracting words used in the document element from the composed document elements; an importance assignment process for assigning importance to each of the extracted words; A relevancy assigning process for assigning a relevancy between two words, and expressing the subject of each document element by a graph in which the importance of the word is set as a node weight and the relevance between the words is set as a link weight. 3. A subject expressing process, and a document element similarity determining process for determining a similarity between the document elements based on a degree of coincidence between graphs expressing the subject. A storage medium storing the document classification program according to 4.

26. The graph similarity determination process, wherein the greater the number of similar nodes (nodes containing the same word) in both graphs, the greater the degree of coincidence between the graphs. If a node in the graph has a large weight, the greater the weight of a similar node in the other graph, the larger the degree of matching between both graphs, The greater the number of similar links (links having the same word in the nodes at both ends of the link) in both graphs, the greater the degree of matching between the two graphs, If a certain link is heavily weighted, the greater the weight of a similar link in the other graph, the greater the degree of matching between the two graphs, so that First storage medium storing a document classification program according to claim 24, further comprising a calculation process of calculating the degree of match between the graph representing the problem.

27. The graph similarity determination process includes dividing each graph into subgraphs based on how strongly the word sets used in the graph are related to each other; When there is no link between any nodes in the subgraph in each of the subgraphs, a link having a small weight is generated in the subgraph, the subgraph is recombined, and the subgraph is generated in the subgraph. Add the link as is,
25. The storage medium storing the document classification program according to claim 24, further comprising a second calculation process for returning the graph before division and calculating the degree of coincidence between the graphs expressing the subject.

28. The graph similarity determination process may include rewriting each graph based on how strongly the word sets used in the graph are related to each other.
If the subgraph is divided into subgraphs, and if each of the subgraphs has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph; 25. The storage medium storing the document classification program according to claim 24, further comprising a third calculation process for calculating a degree of coincidence for each document.

29. The graph similarity determination process divides each graph into subgraphs based on how strongly the word sets used in the graphs are related, When there is no link between any nodes in the subgraph in each of the subgraphs, a link having a small weight is generated in the subgraph, and the degree of matching is calculated for each subgraph, 25. The document classification program according to claim 24, further comprising a fourth calculation process of calculating a degree of match between graphs expressing the subject by calculating a sum of degrees of match calculated for each subgraph. Storage medium.