JP4146393B2

JP4146393B2 - Label display type document search apparatus, label display type document search method, computer program for executing label display type document search method, and computer readable recording medium storing the computer program

Info

Publication number: JP4146393B2
Application number: JP2004156296A
Authority: JP
Inventors: 浩之戸田; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-05-26
Filing date: 2004-05-26
Publication date: 2008-09-10
Anticipated expiration: 2024-05-26
Also published as: JP2005339139A

Description

本発明は、文書のラベルを表示させてから文書を表示させるラベル表示型文書検索装置に関するものである。 The present invention relates to a label display type document search apparatus that displays a document label and then displays the document.

従来より、コンピュータネットワークを介して文書を検索する検索装置として、文書のラベルを一旦表示させてから文書を表示させるラベル表示型文書検索装置が利用されている。これにより、ユーザは文書内容を全て評価する必要がなく、もって所望の情報に効率的に到達可能となる。 2. Description of the Related Art Conventionally, a label display type document retrieval apparatus that displays a document label once and then displays the document is used as a retrieval apparatus that retrieves a document via a computer network. This eliminates the need for the user to evaluate the entire document content, thereby enabling the desired information to be reached efficiently.

これを実現するためには、以下のような方式が考えられている。 In order to realize this, the following methods are considered.

・分類体系利用方式
あらかじめ人手によって分類カテゴリを作成し、それぞれのカテゴリに対して適切と思われる文書を学習データとして与える。学習データの特徴を基にカテゴリの特徴を決定、検索結果が与えられた場合に、それぞれの検索結果とカテゴリとの類似度を計算し最も類似していると思われるカテゴリに関連付けることで、検索結果を分類しユーザに提示できる。・ Classification system usage method Create classification categories by hand in advance and provide documents that are considered appropriate for each category as learning data. By determining the characteristics of the category based on the characteristics of the learning data, and when the search results are given, the similarity between each search result and the category is calculated and related to the category that seems to be the most similar. The results can be classified and presented to the user.

・文書類似性利用方式
予め、ベクトル表現等を用いて個々の文書の特徴表現を取得する。検索結果が与えられた場合に、その特徴表現を元に類似した文書同士を同一の分類とし、検索者にとって有効であると思われる個数に分類して提示する。 -Document similarity utilization method In advance, a feature expression of each document is acquired using a vector expression or the like. When a search result is given, similar documents are classified into the same classification based on the feature expression, and are presented in a number that is considered to be effective for the searcher.

・特徴的ターム利用方式
検索結果の文書やデータから、検索結果において特徴的な単語や複合語、キーワード等を取得、これらを絞りこみ候補として提示することにより、検索結果を分類したのと同様に提示する。
特開２００３−２０３１３６号公報・ Characteristic term usage method Similar to classification of search results by acquiring characteristic words, compound words, keywords, etc. in search results from documents and data of search results and presenting them as narrowing candidates Present.
JP 2003-203136 A

しかし、上記の従来技術には以下の課題点がある。 However, the above prior art has the following problems.

分類体系利用方式は、あらかじめ複数のラベルを人手で作成し、該当するラベルが持つ意味と適合する文書をラベルに関連付けるという手法であり、ラベルの生成、ラベルの意味付けを行なうための正解データの作成は人手で行なうことが前提となっており、初期ラベルの定義、文書の更新に伴うラベルのメンテナンスが情報検索システムの管理者にとって大きなコストとなるという問題がある。 The classification system utilization method is a method in which a plurality of labels are manually created in advance, and a document that matches the meaning of the corresponding label is associated with the label, and the correct data for label generation and label meaning is used. The creation is presumed to be performed manually, and there is a problem that the definition of the initial label and the maintenance of the label accompanying the document update become a large cost for the administrator of the information retrieval system.

また、文書類似性利用方式では、処理時間の制約により、文書分類の質と速度のトレードオフを考慮しなければならない。また、そこで用いられる手法はＫ−Ｍｅａｎｓ法等分類の数をあらかじめ決定するような手法が取られる。これは、実際のトピックの分類数と決定した値が一致しない場合には、不明瞭な分類が生成され、それぞれの分類の内容を示すラベル付けが困難となり、生成されたラベルを一見して分類の内容を把握できない不明瞭なものとなることがある等の問題がある。 In the document similarity utilization method, the trade-off between document classification quality and speed must be considered due to processing time constraints. The technique used there is a technique for determining the number of classifications in advance, such as the K-Means method. This is because if the number of actual topic classifications does not match the determined value, an ambiguous classification will be generated, making it difficult to label the contents of each classification, and the classification of the generated labels at a glance. There is a problem that it may become unclear that the content of the contents cannot be grasped.

これらの問題への対処として、近年ｗｅｂ文書の検索等の場面で、特徴的ターム利用方式を利用する動きがある。この場合、クエリーと文書中で共起する語などを利用することで、ユーザが検索式を効率的に拡張することを可能とし、これにより容易に検索結果を絞り込むことが可能となる。 In order to cope with these problems, there is a movement to use a characteristic term utilization method in a scene such as retrieval of a web document in recent years. In this case, by using a word that co-occurs in the query and the document, the user can efficiently expand the search expression, thereby easily narrowing down the search result.

しかし、この手法は、キーワード等の文書の一部の特徴のみを利用する手法であり、このため一つの文書が複数の観点から分類される事が多くなる。よって、この手法を用いて検索結果の分類を行なった場合、非排他的な分類となり、一つの文書が複数の分類に関連付けられる。これによって多様な観点から文書内容を判断できるというメリットがある反面、分類数が過剰に多くなったり、異なるラベルでも同じ文書群に対するポインタとなったり、効果的な文書の絞りこみが出来ない場合がある。 However, this method is a method that uses only some features of a document such as a keyword, and therefore, one document is often classified from a plurality of viewpoints. Therefore, when the search results are classified using this method, the classification is non-exclusive, and one document is associated with a plurality of classifications. This has the merit of being able to judge document contents from various viewpoints, but there are cases where the number of classifications is excessive, pointers to the same document group even with different labels, and effective document narrowing cannot be performed. is there.

本発明は上記に鑑みなされたものであり、検索結果の効率的な絞り込みが行えるラベル表示型文書検索装置を提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a label display type document search apparatus capable of efficiently narrowing down search results.

上記の課題を解決するために、第１の本発明は、文字列からなる本文を含む文書であって当該文書の題名と当該文書を示す文書識別情報をさらに含み且つ予め定めた文字列である属性値が当該本文に含まれる文書が複数記憶された文書記憶手段と、前記文書記憶手段に記憶された複数の文書のいずれかに少なくとも含まれた属性値ごとに該属性値の当該複数の文書での出現回数を記録した統計情報を生成し、予め設けた記憶手段に記憶させる統計処理手段と、前記文書記憶手段から複数の文書を検索する文書検索手段と、この検索された複数の文書のいずれかに少なくとも含まれた属性値ごとに該属性値の当該複数の文書での出現回数を記録した検索結果統計情報を生成し記憶する検索結果統計情報生成手段と、前記検索された複数の文書のいずれかに少なくとも含まれた属性値ごとに、該属性値が当該複数の文書の一部をなす複数の文書を表す文字列であるラベルとして適している程度を示す適合度を算出するための、当該属性値の前記統計情報および前記検索結果統計情報での各出現回数を用いた算出式を使用して、当該属性値の適合度を算出する適合度算出手段と、前記検索された複数の文書のいずれかに少なくとも含まれた属性値からなる複数の属性値のそれぞれに対応する前記適合度の高い方から当該適合度が予め設定された条件を満たす限り当該適合度に対応する属性値を選択することにより、当該複数の属性値の一部をなす複数の属性値を選択し、選択された複数の属性値をそれぞれラベルとし、当該複数のラベルを含むラベル情報を生成するラベル情報生成手段と、前記ラベル情報に含まれたラベルごとに、該ラベルである文字列を含む文書であり且つ前記検索された複数の文書のいずれかでもある文書を示す文書識別情報および題名を当該文書の数だけ含むクラスタ情報であり且つ当該ラベルを含むクラスタ情報を生成するクラスタ情報生成手段と、前記ラベル情報に含まれた２つのラベルごとに、該一方のラベルを含む前記クラスタ情報内の各文書識別情報で示される文書の集合である一方の集合と、当該他方のラベルを含む前記クラスタ情報内の文書識別情報で示される文書の集合である他方の集合との和集合に対する前記一方の集合の比率である第１比率と、当該和集合に対する前記他方の集合の比率である第２比率とを算出し、当該第１比率と当該第２比率がともに予め設定されたしきい値を超えているなら当該２つのラベルが同値関係にあると判定し、一方だけが予め設定されたしきい値を超えているなら、当該一方に対応するラベルが他方に対応するラベルに包含されるという包含関係があると判定し、同値関係または包含関係にあると判定したなら、当該２つラベルを含むラベルであり且つ当該２つのラベルの間の同値関係または包含関係がわかるようなラベルを前記ラベル情報に含ませるとともに、当該２つのクラスタ情報のいずれかに少なくとも含まれた文書識別情報と題名の組を含むクラスタ情報であり且つ当該ラベルを含むクラスタ情報を生成する第２のクラスタ情報生成手段と、前記ラベル情報に含まれた各ラベルを表示させ、１つの当該ラベルが選択されて当該ラベルを含むクラスタ情報に含まれた文書識別情報および題名の組がそれぞれ表示され、１つの当該文書識別情報および題名の組が選択されたなら、当該文書識別情報および題名の組を含む文書を前記文書記憶手段から読み出して表示させる文書表示制御手段とを備えることを特徴とするラベル表示型文書検索装置をもって解決手段とする。
In order to solve the above problems, the first aspect of the present invention is a document including a body composed of a character string, further including a title of the document and document identification information indicating the document, and a predetermined character string. The document storage unit storing a plurality of documents whose attribute values are included in the body, and the plurality of documents having the attribute value for each attribute value included in at least one of the plurality of documents stored in the document storage unit Statistical information that records the number of appearances in the memory, and stored in a storage means provided in advance, a document search means for searching a plurality of documents from the document storage means, and a plurality of searched documents Search result statistical information generating means for generating and storing search result statistical information in which the number of occurrences of the attribute value in the plurality of documents is recorded for each attribute value included in at least one of the plurality of searched documents of For each attribute value included in at least one of them, for calculating a fitness indicating the degree to which the attribute value is suitable as a label that is a character string representing a plurality of documents forming a part of the plurality of documents, Using the calculation formula using the number of appearances in the statistical information and the search result statistical information of the attribute value, the fitness calculation means for calculating the fitness of the attribute value, and the plurality of retrieved documents The attribute value corresponding to the fitness level is selected as long as the fitness level satisfies a preset condition from the higher fitness level corresponding to each of a plurality of attribute values including attribute values included in at least Label information generating means for selecting a plurality of attribute values forming a part of the plurality of attribute values, using the selected plurality of attribute values as labels, and generating label information including the plurality of labels, For each label included in the label information, document identification information and titles indicating documents that include the character string that is the label and that are any of the plurality of retrieved documents are included for the number of the documents. Cluster information generating means for generating cluster information that includes cluster information and includes the label, and each document identification information in the cluster information including the one label for each of the two labels included in the label information. The ratio of the one set to the union of the one set that is a set of documents and the other set that is the set of documents indicated by the document identification information in the cluster information including the other label. 1 ratio and a second ratio that is the ratio of the other set to the union, and both the first ratio and the second ratio exceed a preset threshold value. If it is determined that the two labels are in an equivalence relationship, and if only one of the two labels exceeds a preset threshold, the label corresponding to the one is included in the label corresponding to the other If it is determined that there is a relationship, and it is determined that there is an equivalence relationship or an inclusion relationship, a label that is a label that includes the two labels and that shows an equivalence relationship or an inclusion relationship between the two labels is the label information. Second cluster information generating means for generating cluster information including the label and the cluster information including the combination of the document identification information and the title included in at least one of the two cluster information, Each label included in the label information is displayed, and when one label is selected, the document identification information and the cluster information including the label are included. Fine title set are respectively displayed, one if the document identification information and the title set is selected, the document display control means for displaying a document containing a set of document identification information and title is read from the document storage means And a label display type document retrieval device characterized by comprising:

第２の本発明は、ラベル表示型文書検索装置が行うラベル表示型文書検索方法であって、前記ラベル表示型文書検索装置が、文書記憶手段と統計処理手段と文書検索手段と検索結果統計情報生成手段と適合度算出手段とラベル情報生成手段とクラスタ情報生成手段と第２のクラスタ情報生成手段と文書表示制御手段とを備え、前記文書記憶手段には、文字列からなる本文を含む文書であって当該文書の題名と当該文書を示す文書識別情報をさらに含み且つ予め定めた文字列である属性値が当該本文に含まれる文書が複数記憶されており、前記ラベル表示型文書検索方法は、前記統計処理手段が、前記文書記憶手段に記憶された複数の文書のいずれかに少なくとも含まれた属性値ごとに該属性値の当該複数の文書での出現回数を記録した統計情報を生成し、予め設けた記憶手段に記憶させる統計処理ステップと、前記文書検索手段が、前記文書記憶手段から複数の文書を検索する文書検索ステップと、前記検索結果統計情報生成手段が、当該検索された複数の文書のいずれかに少なくとも含まれた属性値ごとに該属性値の当該複数の文書での出現回数を記録した検索結果統計情報を生成し記憶する検索結果統計情報生成ステップと、前記適合度算出手段が、前記検索された複数の文書のいずれかに少なくとも含まれた属性値ごとに、該属性値が当該複数の文書の一部をなす複数の文書を表す文字列であるラベルとして適している程度を示す適合度を算出するための、当該属性値の前記統計情報および前記検索結果統計情報での各出現回数を用いた算出式を使用して、当該属性値の適合度を算出する適合度算出ステップと、前記ラベル情報生成手段が、前記検索された複数の文書のいずれかに少なくとも含まれた属性値からなる複数の属性値のそれぞれに対応する前記適合度の高い方から当該適合度が予め設定された条件を満たす限り当該適合度に対応する属性値を選択することにより、当該複数の属性値の一部をなす複数の属性値を選択し、選択された複数の属性値をそれぞれラベルとし、当該複数のラベルを含むラベル情報を生成するラベル情報生成ステップと、前記クラスタ情報生成手段が、前記ラベル情報に含まれたラベルごとに、該ラベルである文字列を含む文書であり且つ前記検索された複数の文書のいずれかでもある文書を示す文書識別情報および題名を当該文書の数だけ含むクラスタ情報であり且つ当該ラベルを含むクラスタ情報を生成するクラスタ情報生成ステップと、前記第２のクラスタ情報生成手段が、前記ラベル情報に含まれた２つのラベルごとに、該一方のラベルを含む前記クラスタ情報内の各文書識別情報で示される文書の集合である一方の集合と、当該他方のラベルを含む前記クラスタ情報内の文書識別情報で示される文書の集合である他方の集合との和集合に対する前記一方の集合の比率である第１比率と、当該和集合に対する前記他方の集合の比率である第２比率とを算出し、当該第１比率と当該第２比率がともに予め設定されたしきい値を超えているなら当該２つのラベルが同値関係にあると判定し、一方だけが予め設定されたしきい値を超えているなら、当該一方に対応するラベルが他方に対応するラベルに包含されるという包含関係があると判定し、同値関係または包含関係にあると判定したなら、当該２つラベルを含むラベルであり且つ当該２つのラベルの間の同値関係または包含関係がわかるようなラベルを前記ラベル情報に含ませるとともに、当該２つのクラスタ情報のいずれかに少なくとも含まれた文書識別情報と題名の組を含むクラスタ情報であり且つ当該ラベルを含むクラスタ情報を生成する第２のクラスタ情報生成ステップと、前記文書表示制御手段が、前記ラベル情報に含まれた各ラベルを表示させ、１つの当該ラベルが選択されて当該ラベルを含むクラスタ情報に含まれた文書識別情報および題名の組がそれぞれ表示され、１つの当該文書識別情報および題名の組が選択されたなら、当該文書識別情報および題名の組を含む文書を前記文書記憶手段から読み出して表示させる文書表示制御ステップとを含むことを特徴とするラベル表示型文書検索方法をもって解決手段とする。A second aspect of the present invention is a label display type document search method performed by a label display type document search device, wherein the label display type document search device includes a document storage unit, a statistical processing unit, a document search unit, and search result statistical information. A generation unit, a fitness calculation unit, a label information generation unit, a cluster information generation unit, a second cluster information generation unit, and a document display control unit, and the document storage unit is a document including a text composed of a character string. A plurality of documents that further include a title of the document and document identification information indicating the document, and a plurality of attribute values that are predetermined character strings are included in the text, and the label display type document search method includes: Statistical information in which the statistical processing unit records the number of appearances of the attribute value in the plurality of documents for each attribute value included in at least one of the plurality of documents stored in the document storage unit. Is generated and stored in a storage unit provided in advance, the document search unit searches for a plurality of documents from the document storage unit, and the search result statistical information generation unit includes the search A search result statistical information generation step for generating and storing search result statistical information in which the number of occurrences of the attribute value in the plurality of documents is recorded for each attribute value included in at least one of the plurality of documents; For each attribute value included in at least one of the retrieved plurality of documents, the fitness level calculating unit is a label that is a character string representing a plurality of documents that form part of the plurality of documents. Using the calculation formula using the statistical information of the attribute value and the number of appearances in the search result statistical information to calculate the degree of suitability indicating the degree of suitability, conformance of the attribute value And the higher the fitness level corresponding to each of a plurality of attribute values consisting of attribute values included in at least one of the retrieved documents. As long as the fitness level satisfies a preset condition, by selecting an attribute value corresponding to the fitness level, a plurality of attribute values forming a part of the attribute values are selected, and a plurality of selected Each of the attribute values is a label, and a label information generating step for generating label information including the plurality of labels, and the cluster information generating means includes a character string that is the label for each label included in the label information Document identification information indicating a document that is one of the retrieved plurality of documents and cluster information including titles as many as the number of the documents and the label A cluster information generating step for generating cluster information including the document information, and the second cluster information generating means, for each of two labels included in the label information, for identifying each document in the cluster information including the one label The ratio of one set to the union of one set that is a set of documents indicated by information and the other set that is a set of documents indicated by document identification information in the cluster information including the other label And a second ratio that is a ratio of the other set to the union, and both the first ratio and the second ratio exceed a preset threshold value. If it is determined that the two labels are in an equivalence relationship and only one of them exceeds a preset threshold, the label corresponding to the one is included in the label corresponding to the other If it is determined that there is an inclusive relationship, and it is determined that there is an equivalence relationship or an inclusive relationship, a label that is a label that includes the two labels and that can know the equivalence relationship or the inclusive relationship between the two labels A second cluster information generation step for generating cluster information including the label, which is included in the information and includes cluster information including a set of document identification information and a title included in at least one of the two cluster information. The document display control means displays each label included in the label information, and a set of document identification information and a title included in the cluster information including the label is selected by selecting one label. If one set of the document identification information and the title is selected, a document including the set of the document identification information and the title is stored in the document storage. With label-type document retrieval method characterized by comprising a document display control step of displaying it is read out from the stage and solutions.

本発明によれば、文書を記憶した文書記憶手段から文書を検索し、検索された文書に含まれた属性値を文書のラベルとして選択し、選択されたラベルを示すラベル情報を生成し記憶し、ラベル情報において同値関係にあるラベルを統合したラベルを生成し当該ラベル情報に含ませること、及び／または、包含関係にあるラベルを統合したラベルを生成し当該ラベル情報に含ませることを行い、統合されたラベルを含むラベル情報を読み出すとともに当該ラベル情報によりラベルを表示させ、統合されたラベルが指示された場合、統合前のラベルのいずれかを少なくとも含み且つ検索された文書の中にも含まれる文書を文書記憶手段から読み出して表示させるので、ラベルを予め用意する必要がなく、しかも統合されたラベルによりユーザによる選択が容易になる。よって、検索結果の効率的な絞りこみが行える。 According to the present invention, a document is retrieved from a document storage unit that stores the document, an attribute value included in the retrieved document is selected as a document label, and label information indicating the selected label is generated and stored. , Generating a label that integrates the labels in the equivalence relation in the label information and including the labels in the label information; and / or generating a label that integrates the labels in the inclusion relation and including the labels in the label information. When the label information including the integrated label is read and the label is displayed by the label information, and the integrated label is designated, at least one of the labels before the integration is included and also included in the searched document. The document to be read is read from the document storage means and displayed, so there is no need to prepare a label in advance and the user can use the integrated label. Selection is facilitated. Therefore, the search result can be narrowed down efficiently.

以下、本発明の実施の形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

[第１の実施の形態]
図１は、第１の実施の形態の装置構成を示すブロック図である。これより、文書をニュース記事としたときの例を示しながら説明を行う。 [First embodiment]
FIG. 1 is a block diagram showing a device configuration of the first embodiment. From here, it demonstrates, showing the example when a document is made into a news article.

検索装置１は、文書を検索するサーバコンピュータであり、本発明のラベル表示型文書検索方法を実行するラベル表示型文書検索装置に相当する。検索装置１は、図示しないネットワークを介して接続されたクライアントコンピュータのブラウザ２に対し通信可能となっている。 The retrieval apparatus 1 is a server computer that retrieves a document, and corresponds to a label display type document retrieval apparatus that executes the label display type document retrieval method of the present invention. The search device 1 can communicate with a browser 2 of a client computer connected via a network (not shown).

ブラウザ２は、キーボードやマウス等の入力装置を介してキーワードが入力されるキーワード入力部２１と、このキーワードにより検索された文書を図示しないＣＲＴ（Cathode Ray Tube）やＬＣＤ(Liquid CrystalDisplay)等からなる表示装置に表示させる文書表示制御部２２とを備える。 The browser 2 includes a keyword input unit 21 for inputting a keyword via an input device such as a keyboard and a mouse, and a CRT (Cathode Ray Tube), an LCD (Liquid Crystal Display), or the like (not shown) for a document searched by the keyword. And a document display control unit 22 to be displayed on the display device.

ブラウザ２は、検索された文書を表示させる前に、文書に含まれる属性値をラベルとしてしかも当該属性値の属性名で分類して表示させる。そして、いずれかのラベルがクリックなどで指示されたときに、そのラベル（属性値）を含む文書の題名などを表示させ、いずれかの題名が指定されたときに、その題名の文書を表示させるようになっている。 Before displaying the retrieved document, the browser 2 displays the attribute value included in the document as a label and classified by the attribute name of the attribute value. When any label is specified by clicking, the title of the document including the label (attribute value) is displayed, and when any title is specified, the document with the title is displayed. It is like that.

検索装置１は、ブラウザ２からキーワードを受信し、検索された文書をブラウザ２に送信する通信部１０１と、通信部１０１から与えられるキーワードによる文書検索などを制御する要求処理部１０２とを備える。 The search device 1 includes a communication unit 101 that receives a keyword from the browser 2 and transmits the searched document to the browser 2, and a request processing unit 102 that controls document search using the keyword given from the communication unit 101.

また、検索装置１は、文書検索で使われる情報が書き込まれた設定ファイル１０３を備える。この設定ファイル１０３には、検索において用いられる属性名「ジャンル」、「組織」などが書き込まれている。また、この設定ファイル１０３には、選択されるラベルの数が書き込まれている。また、設定ファイル１０３には、表示させる属性名としての適合度（属性名適合度という）を求める際に使用されるパラメータα、β及びγが書き込まれている。また、設定ファイル１０３には、各種しきい値などが書き込まれる。 In addition, the search device 1 includes a setting file 103 in which information used for document search is written. In the setting file 103, attribute names “genre” and “organization” used in the search are written. In the setting file 103, the number of labels to be selected is written. The setting file 103 stores parameters α, β, and γ that are used when determining the fitness as the attribute name to be displayed (referred to as attribute name fitness). Various threshold values and the like are written in the setting file 103.

また、検索装置１は、タグ無し文書、すなわち、設定ファイル１０３に書き込まれた属性名のいずれかに分類される属性値を含んでいるがその属性値に属性名（タグ）が付与されていない文書を入力し、それに対して、ニュース記事管理者などが、いわゆる手動でタグを付すことにより文書を生成する文書生成部１０４と、タグ無し文書を入力し、その属性値に対し自動的にタグを付すことにより、タグ付き文書（単に文書ともいう）を生成する文書生成部１０５と、文書生成部１０４や文書生成部１０５で生成された文書が格納される文書データベース（以下、データベースをＤＢと略記する）１０６とを備える。 In addition, the search device 1 includes an attribute value classified as one of the attribute names written in the setting file 103, but no attribute name (tag) is given to the attribute value. A document is input, and a news article manager or the like inputs a document by generating a document by so-called manual tagging, and an untagged document, and the attribute value is automatically tagged. , A document generation unit 105 that generates a tagged document (also simply referred to as a document), and a document database that stores the documents generated by the document generation unit 104 and the document generation unit 105 (hereinafter, the database is referred to as DB). (Abbreviated) 106.

また、検索装置１は、文書ＤＢ１０６に格納された文書に含まれる属性値を正規化する正規化部１０７を備える。 In addition, the search device 1 includes a normalization unit 107 that normalizes attribute values included in documents stored in the document DB 106.

また、検索装置１は、文書ＤＢ１０６に格納された文書に含まれるワード（属性値でもよい）とそのワードが含まれた文書を示す文書識別情報（以下、識別情報をＩＤという）とを対応づけたインデクス１０８１を生成するインデクス生成部１０８と、キーワードとインデクス１０８１を基に文書ＤＢ１０６から文書を検索する文書検索部１０９とを備える。 Further, the search device 1 associates a word (which may be an attribute value) included in the document stored in the document DB 106 with document identification information (hereinafter, identification information is referred to as ID) indicating the document including the word. The index generation unit 108 that generates the index 1081 and the document search unit 109 that searches the document DB 106 based on the keyword and the index 1081 are provided.

また、検索装置１は、設定ファイル１０３の属性名ごとに生成された第１統計情報が格納される第１統計情報ＤＢ１１０と、第１統計情報ごとに生成された第２統計情報が格納される第２統計情報ＤＢ１１１と、第１統計情報及び第２統計情報を生成する統計処理部１１２を備える。 Further, the search device 1 stores a first statistical information DB 110 that stores first statistical information generated for each attribute name of the setting file 103, and second statistical information generated for each first statistical information. A second statistical information DB 111 and a statistical processing unit 112 that generates first statistical information and second statistical information are provided.

また、検索装置１は、設定ファイル１０３の属性名ごとに複数の属性値をラベル候補として選択するラベル候補選択部１１３と、そのラベル候補を文書のラベルとするときの適合度（ラベル適合度という）を算出するラベル適合度算出部１１４と、算出されたラベル適合度を基にラベルを決定するラベル決定部１１５とを備える。 The search device 1 also includes a label candidate selection unit 113 that selects a plurality of attribute values as label candidates for each attribute name of the setting file 103, and a fitness level (referred to as a label fitness level) when the label candidate is a document label. ), And a label determination unit 115 that determines a label based on the calculated label suitability.

また、検索装置１は、決定されたラベルごとにクラスタ情報を生成するクラスタ情報生成部１１６と、決定された２以上ラベル同士を統合するための処理を行うラベル統合処理部１１７を備える。なお、本実施の形態でクラスタとは、１つのラベルを含む１以上の検索された文書をいう。 In addition, the search device 1 includes a cluster information generation unit 116 that generates cluster information for each determined label, and a label integration processing unit 117 that performs processing for integrating two or more determined labels. In the present embodiment, a cluster means one or more searched documents including one label.

[検索前処理]
次に、検索装置１が検索前に行う処理を説明する。 [Pre-search process]
Next, processing performed by the search device 1 before searching will be described.

図２は、検索装置１が検索前に行う処理を示すフローチャートである。 FIG. 2 is a flowchart showing processing performed by the search device 1 before search.

文書生成部１０４は、例えば、図３に示すようなタグ無し文書が入力され、さらに例えば、「国際原子力機関が＊＊＊を決定」が題名であるという指定や、属性値「国際原子力機関」が属性名「組織」に分類されるという指定や、属性値「経済」が属性名「ジャンル」に分類されるという指定があると、図４に示すように、タグ無し文書に、これら指定の内容と、例えば「００１」という文書ＩＤなどを付与することで文書を生成し、これを文書ＤＢ１０６に格納する（Ｓ１０１）。 The document generation unit 104 receives, for example, an untagged document as shown in FIG. 3, and further, for example, designates that “International Atomic Energy Agency has determined ***” as a title, or has an attribute value “International Atomic Energy Agency”. 4 is classified into the attribute name “organization” and the attribute value “economic” is classified into the attribute name “genre”, as shown in FIG. A document is generated by giving the contents and a document ID such as “001”, for example, and stored in the document DB 106 (S101).

一方、文書生成部１０５は、タグ無し文書が入力され、さらに題名指定などがあると、その属性値に対し自動的にタグを付すことにより文書を生成し、さらに文書ＩＤを付与し、これを文書ＤＢ１０６に格納する（Ｓ１０１）。 On the other hand, when an untagged document is input and there is a title designation or the like, the document generation unit 105 generates a document by automatically attaching a tag to the attribute value, and further assigns a document ID. It is stored in the document DB 106 (S101).

このような処理により、文書ＤＢ１０６には多数の文書が格納される。 By such processing, a large number of documents are stored in the document DB 106.

次に、正規化部１０７は、文書ＤＢ１０６に格納された文書に含まれる属性値を正規化する（Ｓ１０３）。正規化とは、例えば、略記号で表記された属性値「IAEA」を略さない日本語で表記された属性値「国際原子力機関」に変換することをいう。 Next, the normalization unit 107 normalizes attribute values included in the document stored in the document DB 106 (S103). Normalization means, for example, conversion of an attribute value “IAEA” expressed by an abbreviation into an attribute value “International Atomic Energy Agency” expressed in Japanese that is not abbreviated.

つまり、正規化部１０７は、文書中で同じ意味を持ちながら表現の異なる同義語となっている属性値を検出し、これらを同じ表現にする。 In other words, the normalization unit 107 detects attribute values having synonyms with different expressions while having the same meaning in the document, and makes them the same expression.

同義語の検出にはいくつかの方法があるが、図５に示す共起パタンを用いる方法を採用することができる。 Although there are several methods for detecting synonyms, a method using the co-occurrence pattern shown in FIG. 5 can be adopted.

このような処理により、文書ＤＢ１０６における文書の属性値が正規化される。 By such processing, the attribute value of the document in the document DB 106 is normalized.

次に、インデクス生成部１０８は、文書ＤＢ１０６に格納された文書に含まれたワードと該ワードを含む文書の文書ＩＤとを対応づけたインデクス１０８１を生成する（Ｓ１０５）。 Next, the index generation unit 108 generates an index 1081 in which a word included in a document stored in the document DB 106 is associated with a document ID of a document including the word (S105).

図６に示すように、インデクス１０８１では、例えば、ワード「原子力」に対し、このワードを含む文書の文書ＩＤ「００１」などが対応づけられる。 As shown in FIG. 6, in the index 1081, for example, the document ID “001” of the document including the word is associated with the word “nuclear power”.

次に、統計処理部１１２は、文書ＤＢ１０６を基に、設定ファイル１０３の属性名ごとに第１統計情報を生成して第１統計情報ＤＢ１１０に格納する（Ｓ１０７）。 Next, the statistical processing unit 112 generates first statistical information for each attribute name of the setting file 103 based on the document DB 106 and stores the first statistical information in the first statistical information DB 110 (S107).

図７（ａ）や（ｂ）に示すように、１つの第１統計情報には１つの属性名が割り当てられている。 As shown in FIGS. 7A and 7B, one attribute name is assigned to one first statistical information.

また、１つの第１統計情報は、文書ＩＤとこのＩＤの文書に含まれ且つ属性名に分類される属性値とを対応づけたものを１以上備える情報である。 One piece of first statistical information is information including one or more pieces of information in which a document ID is associated with an attribute value included in a document with this ID and classified as an attribute name.

図７（ａ）は、例えば、文書ＩＤ「００１」の文書には、属性名「ジャンル」に分類される属性値「経済」などが含まれていることを示している。また、図７（ｂ）は、文書ＩＤ「００１」の文書には、属性名「組織」に分類される属性名「国際原子力機関」などが含まれていることを示している。 FIG. 7A shows that, for example, the document with the document ID “001” includes an attribute value “economic” or the like classified into the attribute name “genre”. FIG. 7B shows that the document with the document ID “001” includes an attribute name “International Atomic Energy Agency” classified into the attribute name “organization”.

次に、統計処理部１１２は、第１統計情報ごとに第２統計情報を生成して第２統計情報ＤＢ１１１に格納する（Ｓ１０９）。 Next, the statistical processing unit 112 generates second statistical information for each first statistical information and stores it in the second statistical information DB 111 (S109).

図８（ａ）や（ｂ）に示すように、１つの第２統計情報には１つの第１統計情報の属性名が割り当てられている。 As shown in FIGS. 8A and 8B, one second statistical information attribute name is assigned to one second statistical information.

また、１つの第２統計情報は、属性名に分類される属性値と該属性値の第１統計情報ＤＢ１１０内における出現回数とを対応づけたものを１以上備える情報である。 One piece of second statistical information is information including one or more items in which attribute values classified into attribute names are associated with the number of appearances of the attribute value in the first statistical information DB 110.

図８（ａ）は、例えば、属性名「ジャンル」に分類される属性値「経済」の出現回数が１００回であることを示している。また、図８（ｂ）は、属性名「組織」に分類される属性値「国際原子力機関」の出現回数が７０回であることを示している。 FIG. 8A shows that the number of appearances of the attribute value “economy” classified into the attribute name “genre” is 100, for example. FIG. 8B shows that the number of appearances of the attribute value “International Atomic Energy Agency” classified into the attribute name “organization” is 70 times.

なお、第２統計情報は、第１統計情報において属性値と文書ＩＤの対応を検出し、検出ごとに出現回数をカウントアップすることで生成してもよい。 The second statistical information may be generated by detecting the correspondence between the attribute value and the document ID in the first statistical information and counting up the number of appearances for each detection.

また、第２統計情報を属性値自身やその属性値が出現する文書の文書ＩＤ自身で構成してもよい。また、第２統計情報を各属性値間の共起頻度で構成してもよい。このときの共起頻度は、同じ文書中に出現する属性値同士を共起すると定義できる。また、タグ無し文書から文書を自動生成する場合には、タグ無し文書中の同じセンテンスやパラグラフに含まれる属性値同士を共起すると定義できる。また、文書中の近接度によって共起関係を[０，１]のバイナリ値で表現するのでなく、共起度のようにより近くで共に出現する属性値間には大きい値を与えるようにしてもよい。 The second statistical information may be composed of the attribute value itself and the document ID of the document in which the attribute value appears. Moreover, you may comprise 2nd statistical information by the co-occurrence frequency between each attribute value. The co-occurrence frequency at this time can be defined as co-occurring attribute values appearing in the same document. When a document is automatically generated from an untagged document, it can be defined that attribute values included in the same sentence or paragraph in the untagged document co-occur. Further, the co-occurrence relationship is not expressed by a binary value of [0, 1] depending on the proximity in the document, but a large value may be given between attribute values that appear closer together like the co-occurrence. Good.

このようにして、Ｓ１０９までの処理が終わると文書検索が可能となるが、文書ＤＢ１０６の文書が更新、追加または削除されたときは、属性値の正規化や、インデクス１０８１、第１統計ＤＢ１１０、第２統計ＤＢ１１１などの更新が行われる。 In this way, the document search becomes possible after the processing up to S109 is completed. However, when the document in the document DB 106 is updated, added, or deleted, the attribute value normalization, the index 1081, the first statistical DB 110, The second statistics DB 111 and the like are updated.

[検索処理]
次に、検索装置１が行う検索処理を説明する。 [Search process]
Next, a search process performed by the search device 1 will be described.

キーワード入力部２１は、例えばキーワード「原子力」がユーザにより入力されると、このキーワード「原子力」を検索装置１の通信部１０１に送信する。なお、キーワードを複数として、ＡＮＤ検索やＯＲ検索や更に複雑な条件で検索を行うことも可能であるが、ここでは説明の便宜上、キーワードを１つとする。 For example, when the keyword “nuclear power” is input by the user, the keyword input unit 21 transmits the keyword “nuclear power” to the communication unit 101 of the search device 1. A plurality of keywords can be used to perform an AND search, an OR search, or a more complicated condition, but here, for convenience of explanation, only one keyword is used.

図９は、キーワードを送信された検索装置１が行う処理のフローチャートである。 FIG. 9 is a flowchart of processing performed by the search device 1 that has received the keyword.

先ず、通信部１０１は、送信されたキーワード「原子力」を要求処理部１０２に与え、要求処理部１０２は、そのキーワードを文書検索部１０９に与える。文書検索部１０９は、そのキーワード「原子力」に対しインデクス１０８１で対応づけられた文書ＩＤを検索し、それらを要求処理部１０２に返却する（Ｓ２０１：文書検索）。 First, the communication unit 101 gives the transmitted keyword “nuclear power” to the request processing unit 102, and the request processing unit 102 gives the keyword to the document search unit 109. The document search unit 109 searches for the document ID associated with the keyword “nuclear power” in the index 1081 and returns them to the request processing unit 102 (S201: document search).

要求処理部１０２は、その文書ＩＤをラベル候補選択部１１３に与える（Ｓ２０３）。 The request processing unit 102 gives the document ID to the label candidate selection unit 113 (S203).

ラベル候補選択部１１３は、第１統計情報ＤＢ１１０と、検索された文書ＩＤを基に、設定ファイル１０３の属性名ごとに第１検索結果統計情報を生成して一時的に記憶する（Ｓ２０５）。 The label candidate selection unit 113 generates first search result statistical information for each attribute name of the setting file 103 based on the first statistical information DB 110 and the searched document ID, and temporarily stores the first search result statistical information (S205).

図１０に示すように、１つの第１検索結果統計情報には１つの属性名が割り当てられている。 As shown in FIG. 10, one attribute name is assigned to one first search result statistical information.

また、１つの第１検索結果統計情報は、１つの第１統計情報に含まれる各属性値に対し該属性値を含む文書の文書ＩＤであり且つ検索された文書ＩＤにも含まれる文書ＩＤを対応づけたものである。 One first search result statistical information is a document ID of a document including the attribute value for each attribute value included in the first statistical information, and a document ID included in the searched document ID. It is a correspondence.

次に、ラベル候補選択部１１３は、第１検索結果統計情報を基に、属性名ごとに第２検索結果統計情報を生成して一時的に記憶する（Ｓ２０７）。 Next, the label candidate selection unit 113 generates and temporarily stores second search result statistical information for each attribute name based on the first search result statistical information (S207).

図１１に示すように、１つの第２検索結果統計情報には１つの属性名が割り当てられている。 As shown in FIG. 11, one attribute name is assigned to one second search result statistical information.

また、１つの第２検索結果統計情報は、１つの第１検索結果統計情報の各属性値に対し該属性値に対応づけられた文書ＩＤの数を出現回数として対応づけたものである。 One second search result statistical information is obtained by associating each attribute value of one first search result statistical information with the number of document IDs associated with the attribute value as the number of appearances.

次に、ラベル候補選択部１１３は、第２統計情報と同じ属性名が割り当てられた第２検索結果統計情報とを基に、第２統計情報ごとに第３統計情報を生成する（Ｓ２０９）。 Next, the label candidate selection unit 113 generates third statistical information for each second statistical information based on the second search result statistical information to which the same attribute name as the second statistical information is assigned (S209).

図１２に示すように、１つの第３統計情報は、１つの第２統計情報に含まれた１以上の行からなる統計情報であり且つ該行の属性値が第２検索結果統計情報の対応行にも含まれたものである。 As shown in FIG. 12, one piece of third statistical information is statistical information including one or more rows included in one piece of second statistical information, and the attribute value of the row corresponds to the second search result statistical information. It is also included in the line.

つぎに、ラベル適合度算出部１１４は、第２検索結果統計情報と第３統計情報と検索された文書ＩＤを基に、ラベル適合度情報を第２検索結果統計情報ごと生成し一時的に記憶する（Ｓ２１１）。 Next, the label suitability calculation unit 114 generates and temporarily stores label suitability information for each second search result statistical information based on the second search result statistical information, the third statistical information, and the retrieved document ID. (S211).

図１３に示すように、１つのラベル適合度情報には１つの属性名が割り当てられている。 As shown in FIG. 13, one attribute name is assigned to one label suitability information.

また、１つのラベル適合度情報は、１つの第２検索結果統計情報に含まれた各属性値に対しラベル適合度を対応づけたものである。 One label suitability information is obtained by associating a label suitability with each attribute value included in one second search result statistical information.

ラベル適合度は、例えば以下のように算出する。 The label suitability is calculated as follows, for example.

第２検索結果統計情報における１つの属性値に対応する出願回数をｈとし、第３統計情報におけるその属性値に対応する出願回数をｄとし、検索された文書ＩＤの数を｜Ｈ｜とし、式（１）によりラベル適合度を算出する。

The number of applications corresponding to one attribute value in the second search result statistical information is set as h, the number of applications corresponding to the attribute value in the third statistical information is set as d, and the number of retrieved document IDs is set as | H | The label suitability is calculated by equation (1).

なお、式（１）のｈ／ｄは、検索された文書における属性値の網羅性を、｜Ｈ｜／ｈは検索された文書における属性値の希少性を示している。 In the equation (1), h / d represents the completeness of the attribute value in the retrieved document, and | H | / h represents the rarity of the attribute value in the retrieved document.

また、式（１）における第１項のｈの代わりにｈ／｜Ｈ｜とし、第１項のｄの代わりにｄ／｜Ｄ｜（｜Ｄ｜は、その属性値を含む文書数）としてもよい。 Also, in equation (1), h / | H | is substituted for h in the first term, and d / | D | (| D | is the number of documents including the attribute value) instead of d in the first term. Also good.

次に、ラベル決定部１１５は、ラベル適合度情報から属性値及びラベル適合度の組を減らしたものをラベル情報とし一時的に記憶する（Ｓ２１３）。なお、ラベル情報はラベル適合度情報ごとに生成され記憶される。また、ラベル情報における属性値は文書のラベルとなるものであるからラベルということにする。 Next, the label determination unit 115 temporarily stores the label information obtained by subtracting the combination of the attribute value and the label compatibility from the label compatibility information (S213). The label information is generated and stored for each label suitability information. The attribute value in the label information is a label because it is a label of the document.

図１４に示すように、ラベル情報は、ラベルに対しラベル適合度を対応づけたものであるが、ラベル適合度情報におけるラベル適合度の高い方からラベルを選択することにより、ラベル情報におけるラベル及びラベル適合度の組数は、ラベル適合度情報における属性値及びラベル適合度の組数よりも少なくなっている。 As shown in FIG. 14, the label information is obtained by associating the label suitability with the label. By selecting a label from the label suitability higher in the label suitability information, the label information in the label information The number of sets of label suitability is smaller than the number of sets of attribute values and label suitability in the label suitability information.

図１５は、ラベル決定部１１５が行うラベル選択のフローチャートである。 FIG. 15 is a flowchart of label selection performed by the label determination unit 115.

ラベル決定部１１５は、設定ファイル１０３に書き込まれた数のラベルをラベル適合度の高い方から選択する（Ｓ３０１）。次に、ラベル適合度が次点のラベルを追加選択するか否かを判定する（Ｓ３０３）。 The label determining unit 115 selects the number of labels written in the setting file 103 from the one with the highest label matching degree (S301). Next, it is determined whether or not to additionally select a label whose label suitability is the next point (S303).

具体的には、選択済みの最も低いラベル適合度をＣ（ｎ）、その１つ上のラベル適合度をＣ（ｎ＋１）、次点のラベル適合度をＣ（ｎ−１）とし、式（２）が成立するときは、次点のラベルを追加選択して（Ｓ３０５）、Ｓ３０３へと戻る。

Specifically, C (n) is the lowest selected label fitness level, C (n + 1) is the label fitness level one above, and C (n-1) is the label fitness level of the next point. When 2) is established, the label of the next point is additionally selected (S305), and the process returns to S303.

ただし、ｅは、設定ファイル１０３などに書き込まれたしきい値である。 However, e is a threshold value written in the setting file 103 or the like.

つまり、値の傾きを評価し、傾きがあるしきい値を越えたところを境界とする考え方を適用した判定が行われる。 In other words, the inclination of the value is evaluated, and a determination is made by applying the concept of using the boundary where the inclination exceeds a certain threshold.

この方法により、ラベル適合度が近いにも関わらずラベルの選択から漏れるのを防止できる。つまり、ラベル適合度に差がある場合に限って選択しないようにできる。 By this method, it is possible to prevent leakage from the selection of the label even though the label matching degree is close. That is, it can be made not to select only when there is a difference in the label suitability.

なお、Ｓ３０１では、設定ファイル１０３などに書き込まれたラベル適合度のしきい値との比較によりラベルを選択してもよい。 In S301, a label may be selected by comparison with a threshold value of the label suitability written in the setting file 103 or the like.

次に、ラベル決定部１１５は、ラベル情報を基に属性名適合度情報を生成し一時的に記憶する（Ｓ２１５）。 Next, the label determination unit 115 generates attribute name fitness information based on the label information and temporarily stores it (S215).

図１６に示すように、属性名適合度情報は、属性名ごとに属性名適合度を示したものである。 As shown in FIG. 16, the attribute name suitability information indicates the attribute name suitability for each attribute name.

例えば、属性名「ジャンル」の場合の属性名適合度は、以下のように算出する。 For example, the attribute name suitability for the attribute name “genre” is calculated as follows.

まず、「ジャンル」のラベル情報におけるいずれかのラベルを含む文書の数ｄｌを、「ジャンル」の第１検索結果統計情報から求める。このとき、複数のラベルを含む１文書を１と計算する。 First, the number dl of documents including any label in the “genre” label information is obtained from the first search result statistical information of “genre”. At this time, one document including a plurality of labels is calculated as 1.

そして、式（３）により網羅性Ｓ１を求める。

And comprehensiveness S1 is calculated | required by Formula (3).

ここで、ｄｒは、検索された文書ＩＤの数である。 Here, dr is the number of retrieved document IDs.

このＳ１が大きいほど、検索結果がラベルにより網羅されている程度が大きいことになる。 The greater the S1, the greater the extent to which the search result is covered by the label.

次に、式（４）により、重なりの少なさ、分類の明確さＳ２を求める。

Next, the degree of overlap and the clarity of classification S <b> 2 are obtained by Expression (4).

ここで、ｄｒは、検索された文書ＩＤの数であり、ｄｌ_ｉは、「ジャンル」のラベル情報におけるｉ番目のラベルｌ_ｉを含む文書数であり、「ジャンル」の第２検索結果統計情報から得たものである。 Here, dr is the number of retrieved document IDs, dl _i is the number of documents including the i-th label l _i in the “genre” label information, and the second search result statistical information of “genre” It is obtained from.

このＳ２が大きいほど、検索結果がラベルにより明確に分類されている程度が大きいことになる。 The larger S2 is, the greater the degree to which the search result is clearly classified by the label.

次に、式（５）により、分類の均一さＳ３を求める。ここでは、後述するクラスタの平均エントロピーを算出することでＳ３を求める。

Next, the uniformity S3 of the classification is obtained from Equation (5). Here, S3 is obtained by calculating an average entropy of a cluster to be described later.

ここで、ｄｒは、検索された文書ＩＤの数であり、ｄｌ_ｉは、「ジャンル」のラベル情報におけるｉ番目のラベルｌ_ｉを含む文書数である。ｄｌ_ｉは第２検索結果統計情報から得ることができる。 Here, dr is the number of retrieved document IDs, and dl _i is the number of documents including the i-th label l _i in the label information of “genre”. dl _i can be obtained from the second search result statistical information.

このＳ３が大きいほど、検索結果がラベルにより均一に分類されている程度が大きいことになる。 The larger this S3, the greater the degree to which the search results are uniformly classified by the label.

次に、式（６）により、属性名適合度Ｓを求める。

Next, the attribute name suitability S is obtained by Expression (6).

ここで、α、β、γは設定ファイル１０３に書き込まれたパラメータである。 Here, α, β, and γ are parameters written in the setting file 103.

次に、要求処理部１０２は、第２検索結果統計情報、ラベル情報及び属性名適合度情報を読み出し、ラベル情報をクラスタ情報生成部１１６に与える。 Next, the request processing unit 102 reads the second search result statistical information, the label information, and the attribute name fitness information, and provides the label information to the cluster information generation unit 116.

クラスタ情報生成部１１６は、ラベル情報に含まれたラベルごとにクラスタ情報を生成し一時的に記憶する（Ｓ２１７）。 The cluster information generation unit 116 generates cluster information for each label included in the label information and temporarily stores it (S217).

図１７に示すように、クラスタ情報は、ラベル情報に含まれる各ラベルと、該ラベルを含む文書の文書ＩＤで且つ検索された文書ＩＤにも含まれる文書ＩＤと、当該文書の題名とを対応づけたものである。 As shown in FIG. 17, the cluster information corresponds to each label included in the label information, the document ID of the document including the label and also included in the retrieved document ID, and the title of the document. It is attached.

次に、ラベル統合処理部１１７は、決定したラベルを更に統合するための処理を行う（Ｓ２１８）。 Next, the label integration processing unit 117 performs processing for further integrating the determined labels (S218).

図１８は、Ｓ２１８の詳細を示すフローチャートである。ここでは、属性名「ジャンル」について説明するが、他の属性名についても同様に処理される。 FIG. 18 is a flowchart showing details of S218. Although the attribute name “genre” will be described here, other attribute names are processed in the same manner.

先ず、ラベル統合処理部１１７は、記憶されたラベル情報、第２検索結果統計情報およびクラスタ情報を読み出す（Ｓ３１１）。このラベル情報内の２ラベルからなるラベルの組が順次処理されることになるが、続いては、未処理のラベルの組を選択する（Ｓ３１３）。ラベル統合処理部１１７は、選択したラベルの組内の２つのラベルが同値関係にあるか、包含関係にあるかを判定する（Ｓ３１５）。 First, the label integration processing unit 117 reads the stored label information, second search result statistical information, and cluster information (S311). A set of labels consisting of two labels in the label information is sequentially processed. Subsequently, a set of unprocessed labels is selected (S313). The label integration processing unit 117 determines whether the two labels in the selected label set have an equivalence relationship or an inclusive relationship (S315).

Ｓ３１５では、先ず、式（７）、（８）により、比率Ｒ_Ａ／Ｂと比率Ｒ_Ｂ／Ａとを求める。

In S315, first, the ratio R _{A / B} and the ratio R _{B / A} are obtained from the equations (7) and (8).

図１９に示すように、Ａは一方のラベルを含んだ文書の集合であり、Ｂは他方のラベルを含んだ文書の集合であり、Ａ∩Ｂは両方のラベルを含んだ文書の集合である。なお、文書の集合についてはクラスタ情報から判断される。 As shown in FIG. 19, A is a set of documents including one label, B is a set of documents including the other label, and A∩B is a set of documents including both labels. . The set of documents is determined from the cluster information.

続いて、比率Ｒ_Ａ／Ｂと比率Ｒ_Ｂ／Ａとが共に、予め設定されたしきい値（例えば、０．８や０．９などに設定される）を越えている場合は、同値関係にあると判定する。 Subsequently, when both the ratio R _{A / B} and the ratio R _{B / A} exceed a preset threshold value (for example, set to 0.8, 0.9, etc.), the equivalence relation It is determined that

同値関係にあると判定されず、比率Ｒ_Ａ／Ｂと比率Ｒ_Ｂ／Ａの一方が、予め設定されたしきい値（例えば、０．８や０．９などに設定される）を越えている場合は、越えている方が越えていない方に包含されるという包含関係にあると判定する。 It is not determined that there is an equivalence relationship, and one of the ratio R _{A / B} and the ratio R _{B / A} exceeds a preset threshold (for example, set to 0.8, 0.9, etc.) If it is, it is determined that there is an inclusion relationship in which the person who exceeds it is included in the person who does not exceed it.

次に、ラベル情報において、同値関係にあると判断された２つのラベルを例えば、一方のラベル／他方のラベル、というように「／」で接続した１つのラベルに統合したものを生成し当該ラベル情報に含める。また、第２検索結果統計情報に含まれた属性値であって当該２つのラベルと同じものについても同様に１つの属性値に統合したものを生成し当該第２検索結果統計情報に含める（Ｓ３１７）。 Next, in the label information, two labels determined to be in an equivalence relationship, for example, one label / the other label are integrated into one label connected by “/”, and the label is generated. Include in information. Further, the attribute values included in the second search result statistical information that are the same as the two labels are also integrated into one attribute value and included in the second search result statistical information (S317). ).

次に、ラベル情報において、包含関係にあると判断された２つのラベルを例えば、包含するラベル（包含されるラベル）、というように（）で包含関係を表した１つのラベルに統合したものを生成し当該ラベル情報に含める。また、第２検索結果統計情報に含まれた属性値であって当該２つのラベルと同じものについても同様に１つの属性値に統合したものを生成し当該第２検索結果統計情報に含める（Ｓ３１９）。 Next, in the label information, two labels that are determined to be in an inclusive relationship, for example, an inclusive label (included label), which is integrated into a single label that represents the inclusive relationship in () Generate and include in the label information. Further, the attribute values included in the second search result statistical information that are the same as the two labels are also integrated into one attribute value and included in the second search result statistical information (S319). ).

次に、同値関係または包含関係にあると判断された２つのラベル（統合前のラベル）のそれぞれに対応する２つのクラスタ情報を選択し、これらを１つのクラスタ情報を統合したものを生成する（Ｓ３２１）。統合されたクラスタ情報には、統合されたラベルが含まれ、且つ２つのクラスタ情報にいずれか一方に少なくとも含まれた文書ＩＤおよび題名が含まれることになる。 Next, two pieces of cluster information corresponding to two labels (labels before integration) determined to be in an equivalence relationship or an inclusion relationship are selected, and a combination of these pieces of cluster information is generated ( S321). The integrated cluster information includes the integrated label, and the document ID and title included in at least one of the two cluster information.

そして、処理はステップＳ３１３に戻り、処理済みでないラベルの組が選択され、上記と同様の処理がなされることとなる。なお、図１８の処理は、全てのラベルの組が処理済みとなったときに終了する。なお、統合されたラベルや属性値を含まされたラベル情報や第２検索結果統計情報は再び記憶される。また、統合されたクラスタ情報は元のクラスタ情報とともに記憶される。 Then, the process returns to step S313, a set of labels that have not been processed is selected, and the same process as described above is performed. Note that the processing in FIG. 18 ends when all label pairs have been processed. Note that the label information and the second search result statistical information including the integrated label and attribute value are stored again. The integrated cluster information is stored together with the original cluster information.

図９に戻り、要求処理部１０２は、記憶された第２検索結果統計情報、ラベル情報、属性名適合度情報及びクラスタ情報をそれぞれ全て読み出して通信部１０１に与え、通信部１０１は、これらをブラウザ２に送信する（Ｓ２１９）。 Returning to FIG. 9, the request processing unit 102 reads all of the stored second search result statistical information, label information, attribute name matching degree information, and cluster information, and gives them to the communication unit 101. It transmits to the browser 2 (S219).

図２０は、こらら情報を送信されたブラウザ２が行う処理のフローチャートである。 FIG. 20 is a flowchart of processing performed by the browser 2 to which these pieces of information are transmitted.

ブラウザ２の文書表示制御部２２は、図２１に示すように、全てのクラスタ情報に含まれる文書ＩＤと題名を表示させ（Ｓ４０１）、さらにラベル情報に含まれたラベルを表示させる（Ｓ４０３）。このとき、表示されるラベル数は適合度により少なくされ、しかも、同値関係または包含関係にあるラベルが統合されているので、ユーザはラベルの指示を容易に行うことができる。具体的には、図２１では、同値関係にあるラベルで新たに生成された国際／国際関係というラベルが示され、包含関係にあるラベルで新たに生成されたスポーツ（野球，…）というラベルが示されている。 As shown in FIG. 21, the document display control unit 22 of the browser 2 displays document IDs and titles included in all cluster information (S401), and further displays labels included in the label information (S403). At this time, the number of labels to be displayed is reduced depending on the degree of conformity, and since the labels having the equivalence relation or the inclusion relation are integrated, the user can easily instruct the label. Specifically, FIG. 21 shows a newly generated label of international / international relation with a label having an equivalence relation, and a newly generated label of sports (baseball,...) With a label having an inclusive relation. It is shown.

そして、ユーザにとって一層便宜となるように、例えば、ラベルは属性名ごとにまとめて表示させる。また、属性名適合度情報における属性名適合度の高い属性名のラベルをより見やすいように表示させる。また、１つのラベル情報に含まれたラベルについては対応づけられたラベル適合度の高いものをより見やすいように表示させる。また、ラベルには、第２検索結果統計情報において対応づけられた文書ＩＤの数を対応づけて表示させる。 Then, for the convenience of the user, for example, labels are displayed together for each attribute name. Further, the label of the attribute name having a high attribute name matching degree in the attribute name matching degree information is displayed so as to be easier to see. In addition, for labels included in one label information, an associated label having a high degree of label matching is displayed so as to be easier to see. In addition, the number of document IDs associated with the second search result statistical information is displayed in association with the label.

そして、文書表示制御部２２は、ユーザにより１つのラベルが指示される（Ｓ４０５）と、表示済みの文書ＩＤと題名を消去し、図２２に示すように、そのラベルを含むクラスタ情報に含まれた文書ＩＤと題名を表示させる（Ｓ４０７）。このときのラベルは、統合されたラベルでもよいことは勿論である。 Then, when one label is instructed by the user (S405), the document display control unit 22 deletes the displayed document ID and title, and is included in the cluster information including the label as shown in FIG. The document ID and title are displayed (S407). Of course, the label at this time may be an integrated label.

そして、文書表示制御部２２は、ユーザにより文書ＩＤが指示される（Ｓ４０９）と、その文書ＩＤを検索装置１の通信部１０１に送信する（Ｓ４１１）。なお、実際には、文書ＩＤと題名の位置をクリックすると文書ＩＤが指示できるようになっている。 Then, when the document ID is instructed by the user (S409), the document display control unit 22 transmits the document ID to the communication unit 101 of the search device 1 (S411). In practice, the document ID can be designated by clicking the position of the document ID and the title.

図２２に示すように、本実施の形態では、ラベル指示後においては指示前よりも、文書ＩＤと題名の数が減っているので、ユーザは容易に指示することができる。 As shown in FIG. 22, in the present embodiment, the number of document IDs and titles is less after the label instruction than before the instruction, so the user can easily instruct.

検索装置１の通信部１０１は、送信された文書ＩＤを要求処理部１０２に与える。要求処理部１０２は、与えられた文書ＩＤを文書検索部１０９に与える。文書検索部１０９は、与えられた文書ＩＤの文書を読み出して要求処理部１０２に返却する。 The communication unit 101 of the search device 1 gives the transmitted document ID to the request processing unit 102. The request processing unit 102 gives the given document ID to the document search unit 109. The document search unit 109 reads the document with the given document ID and returns it to the request processing unit 102.

要求処理部１０２は、返却された文書を通信部１０１に与え、通信部１０１はそれをブラウザ２に送信する。 The request processing unit 102 gives the returned document to the communication unit 101, and the communication unit 101 transmits it to the browser 2.

ブラウザ２の文書表示制御部２２は、送信された文書を表示させる。 The document display control unit 22 of the browser 2 displays the transmitted document.

なお、Ｓ４０５で１つのラベルが指示されたときに、文書表示制御部２２がそのラベルを含むクラスタ情報に含まれた文書ＩＤと題名を表示させる（Ｓ４０７）のでなく、このラベルを追加のキーワードとして、通信部１０１に送信し、検索装置１が、この追加のキーワードと先に送信されたキーワードとを用いて、図９のＳ２０１以降（ＡＮＤ検索）を行ってもよい。 When one label is designated in S405, the document display control unit 22 does not display the document ID and title included in the cluster information including the label (S407), but uses this label as an additional keyword. Then, the search device 1 may transmit to the communication unit 101 and use the additional keyword and the previously transmitted keyword to perform S201 and subsequent steps (AND search) in FIG.

以上説明したように、検索装置１によれば、文書を記憶した文書記憶手段を構成する文書ＤＢ１０６から文書検索手段を構成する文書検索部１０９が文書を検索し、ラベル情報生成手段を構成するラベル決定部１１５が検索された文書に含まれた属性値を文書のラベルとして選択するとともに選択されたラベルを示すラベル情報を生成し記憶し、ラベル情報処理手段を構成するラベル統合処理部１１７がラベル情報において同値関係にあるラベルを統合したラベルを生成し当該ラベル情報に含ませること、及び／または、包含関係にあるラベルを統合したラベルを生成し当該ラベル情報に含ませることを行い、文書表示制御手段を構成する要求処理部１０２が、統合されたラベルを含むラベル情報を読み出すとともに当該ラベル情報によりラベルをブラウザ２において表示させ、統合されたラベルが指示された場合、統合前のラベルのいずれか少なくとも含み且つ前記検索された文書の中にも含まれる文書を文書記憶手段から読み出して表示させるので、ラベルを予め用意する必要がなく、しかも統合されたラベルによりユーザによる選択が容易になる。よって、検索結果の効率的な絞りこみが行える。 As described above, according to the retrieval apparatus 1, the document retrieval unit 109 that constitutes the document retrieval unit retrieves the document from the document DB 106 that constitutes the document storage unit that stores the document, and the label that constitutes the label information generation unit. The decision unit 115 selects an attribute value included in the searched document as a label of the document, generates and stores label information indicating the selected label, and the label integration processing unit 117 constituting the label information processing unit performs labeling. Generate a label that integrates the labels that have the equivalence relationship in the information and include it in the label information, and / or generate a label that integrates the labels in the inclusive relationship and include it in the label information, and display the document The request processing unit 102 constituting the control means reads the label information including the integrated label and uses the label information to read the label information. If the integrated label is instructed in the browser 2, the document including at least one of the labels before integration and also included in the searched document is read from the document storage means and displayed. , It is not necessary to prepare labels in advance, and the integrated label facilitates selection by the user. Therefore, the search result can be narrowed down efficiently.

また、クラスタ情報生成手段を構成するクラスタ情報生成部１１６が、選択されたラベルの１つを含み且つ検索された文書の中にも含まれる文書を示すクラスタ情報を生成し、クラスタ情報処理手段を構成するラベル統合処理部１１７が、統合前のラベルに対応するクラスタ情報を統合したものを生成し、文書表示制御手段（１０２）は、統合されたラベルが指示された場合、統合されたクラスタ情報で示される文書の存在を表示させ、文書が指示された場合、この文書を文書記憶手段から読み出して表示させるので、検索結果の効率的な絞りこみが行える。 Further, the cluster information generating unit 116 constituting the cluster information generating means generates cluster information indicating one of the selected labels and also included in the searched document, and the cluster information processing means When the integrated label information processing unit 117 generates the integrated cluster information corresponding to the label before the integration, the document display control unit (102) indicates the integrated cluster information when the integrated label is instructed. The presence of the document indicated by (1) is displayed, and when the document is instructed, this document is read from the document storage means and displayed, so that the search result can be narrowed down efficiently.

また、ラベル情報生成手段を構成するラベル適合度算出部１１４が、検索された文書に含まれた属性値をラベルとするときの適合度を算出し、ラベル情報生成手段を構成するラベル決定部１１５が適合度の高い方から、当該属性値の数よりも少ない数の属性値をラベルとして選択することが検索装置１の特徴となっている。 In addition, the label suitability calculating unit 114 constituting the label information generating unit calculates the suitability when the attribute value included in the retrieved document is used as a label, and the label determining unit 115 constituting the label information generating unit. The search device 1 is characterized by selecting, as a label, a smaller number of attribute values than the number of the attribute values from the one with a higher fitness.

また、ラベル情報生成手段を構成するラベル適合度算出部１１４が、ラベルの適合度を算出する対象の属性値を含み且つ検索された文書にも含まれる文書の数と、当該属性値を含み且つ文書記憶手段にも記憶された文書の数とを用いて適合度を算出することが検索装置１の特徴となっている。 Further, the label suitability calculation unit 114 constituting the label information generating means includes the attribute value of the target for calculating the suitability of the label and the number of documents included in the searched document, and the attribute value A feature of the search device 1 is that the fitness is calculated using the number of documents stored in the document storage means.

また、文書記憶手段に記憶された文書に含まれた属性値についての統計情報を生成する統計情報生成手段である統計処理部１１２と、生成された統計情報が記憶される統計情報記憶手段である第１統計ＤＢ１１０および第２統計ＤＢ１１１とを備え、ラベル情報生成手段を構成するラベル適合度算出部１１４は、当該統計情報を用いてラベルの適合度を算出することが検索装置１の特徴となっている。 The statistical processing unit 112 is a statistical information generation unit that generates statistical information about attribute values included in the document stored in the document storage unit, and the statistical information storage unit stores the generated statistical information. A feature of the search apparatus 1 is that the label suitability calculation unit 114, which includes the first statistics DB 110 and the second statistics DB 111 and constitutes the label information generation means, calculates the suitability of the label using the statistics information. ing.

なお、上記実施の形態では、同値関係や包含関係を２つのラベルについて判断したが、３つ以上のラベルについて判断してもよい。 In the above embodiment, the equivalence relation and the inclusion relation are determined for two labels, but may be determined for three or more labels.

また、上記実施の形態では、同値関係と包含関係の両方について処理したが、必要性に応じて、一方のみについて処理を行っても勿論よい。 Moreover, in the said embodiment, although processed about both equivalence | correspondence relation and inclusion relation, you may process only about one according to necessity.

また、上記実施の形態で説明したラベル表示型文書検索方法を検索装置１に実行させるコンピュータプログラムは、半導体メモリ、磁気ディスク、光ディスク、光磁気ディスク、磁気テープなどのコンピュータ読み取り可能な記録媒体に格納したり、インターネットなどの通信網を介して伝送させて、広く流通させることができる。 A computer program that causes the search device 1 to execute the label display type document search method described in the above embodiment is stored in a computer-readable recording medium such as a semiconductor memory, a magnetic disk, an optical disk, a magneto-optical disk, or a magnetic tape. Or transmitted through a communication network such as the Internet and widely distributed.

第１の実施の形態の装置構成を示すブロック図である。It is a block diagram which shows the apparatus structure of 1st Embodiment. 検索装置１が検索前に行う処理を示すフローチャートである。It is a flowchart which shows the process which the search device 1 performs before a search. タグ無し文書の一例を示す図である。It is a figure which shows an example of an untagged document. タグ付き文書の一例を示す図である。It is a figure which shows an example of a tagged document. 共起パタンを用いた同義語の検出方法を示す図である。It is a figure which shows the detection method of a synonym using a co-occurrence pattern. インデクスの一例を示す図である。It is a figure which shows an example of an index. 第１統計情報の一例を示す図である。It is a figure which shows an example of 1st statistical information. 第２統計情報の一例を示す図である。It is a figure which shows an example of 2nd statistical information. キーワードを送信された検索装置１が行う処理のフローチャートである。It is a flowchart of the process which the search device 1 which transmitted the keyword performs. 第１検索結果統計情報の一例を示す図である。It is a figure which shows an example of 1st search result statistical information. 第２検索結果統計情報の一例を示す図である。It is a figure which shows an example of 2nd search result statistical information. 第３統計情報の一例を示す図である。It is a figure which shows an example of 3rd statistical information. ラベル適合度情報の一例を示す図である。It is a figure which shows an example of label compatibility information. ラベル情報の一例を示す図である。It is a figure which shows an example of label information. ラベル決定部１１５が行うラベル選択のフローチャートである。It is a flowchart of the label selection which the label determination part 115 performs. 属性名適合度情報の一例を示す図である。It is a figure which shows an example of attribute name compatibility information. クラスタ情報の一例を示す図である。It is a figure which shows an example of cluster information. 決定したラベルを更に統合するための処理の詳細フローチャートである。It is a detailed flowchart of the process for further integrating the determined label. 一方のラベルを含む文書の集合と他方のラベルを含む文書の集合の関係を示す図である。It is a figure which shows the relationship between the set of documents containing one label and the set of documents containing the other label. ブラウザ２が行う処理のフローチャートである。It is a flowchart of the process which the browser 2 performs. 文書表示制御部２２によるラベル指示前の表示例を示す図である。It is a figure which shows the example of a display before the label instruction | indication by the document display control part. 文書表示制御部２２によるラベル指示後の表示例を示す図である。It is a figure which shows the example of a display after the label instruction | indication by the document display control part 22. FIG.

Explanation of symbols

１，１０…検索装置
２…ブラウザ
２１…キーワード入力部
２２…文書表示制御部
１０１…通信部
１０２…要求処理部
１０３…設定ファイル
１０４…文書生成部
１０５…文書生成部
１０６…文書ＤＢ
１０７…正規化部
１０８…インデクス生成部
１０９…文書検索部
１１０…第１統計ＤＢ
１１１…第２統計ＤＢ
１１２…統計処理部
１１３…ラベル候補選択部
１１４…ラベル適合度算出部
１１５…ラベル決定部
１１６…クラスタ情報生成部
１１７…ラベル統合処理部 DESCRIPTION OF SYMBOLS 1,10 ... Search apparatus 2 ... Browser 21 ... Keyword input part 22 ... Document display control part 101 ... Communication part 102 ... Request processing part 103 ... Setting file 104 ... Document generation part 105 ... Document generation part 106 ... Document DB
107 ... Normalization unit 108 ... Index generation unit 109 ... Document search unit 110 ... First statistical DB
111 ... 2nd statistics DB
DESCRIPTION OF SYMBOLS 112 ... Statistical processing part 113 ... Label candidate selection part 114 ... Label suitability calculation part 115 ... Label determination part 116 ... Cluster information generation part 117 ... Label integration processing part

Claims

Document storage including a document including a text composed of a character string, further including a title of the document and document identification information indicating the document, and a plurality of documents in which attribute values that are predetermined character strings are included in the text Means,
For each attribute value included in at least one of the plurality of documents stored in the document storage unit, statistical information recording the number of appearances of the attribute value in the plurality of documents is generated, and the storage unit provided in advance Statistical processing means to be stored;
Document retrieval means for retrieving a plurality of documents from the document storage means;
Search result statistical information generating means for generating and storing search result statistical information in which the number of occurrences of the attribute value in the plurality of documents is recorded for each attribute value included in at least one of the plurality of searched documents; ,
For each attribute value included in at least one of the plurality of retrieved documents, the degree to which the attribute value is suitable as a label that is a character string representing a plurality of documents forming part of the plurality of documents is indicated. A fitness calculation means for calculating the fitness of the attribute value using a calculation formula using the number of appearances in the statistical information of the attribute value and the search result statistical information for calculating the fitness; ,
The fitness level as long as the fitness level satisfies a preset condition from the higher fitness level corresponding to each of a plurality of attribute values including attribute values included in at least one of the retrieved plurality of documents. By selecting an attribute value corresponding to, a plurality of attribute values forming a part of the plurality of attribute values are selected, and the plurality of selected attribute values are used as labels, respectively, and label information including the plurality of labels is displayed. Label information generating means to generate;
For each label included in the label information, document identification information and titles indicating documents that include the character string that is the label and that are any of the plurality of retrieved documents are included for the number of the documents. Cluster information generating means for generating cluster information that is cluster information and includes the label;
For each of two labels included in the label information, one set which is a set of documents indicated by each document identification information in the cluster information including the one label, and the cluster information including the other label A first ratio that is a ratio of the one set to a union with the other set that is a set of documents indicated by the document identification information, and a second ratio that is a ratio of the other set to the union If both the first ratio and the second ratio exceed a preset threshold value, it is determined that the two labels are in an equivalent relationship, and only one of them is a preset threshold value. Is exceeded, it is determined that there is an inclusive relationship that the label corresponding to the one is included in the label corresponding to the other, and if it is determined to be in the equivalence relationship or the inclusive relationship, the label that includes the two labels is included. And the label information includes a label that indicates an equivalence relation or inclusion relation between the two labels, and a combination of the document identification information and the title included in at least one of the two cluster information. Second cluster information generating means for generating cluster information including the label and the cluster information including
Each label included in the label information is displayed, one of the labels is selected, and a set of document identification information and title included in the cluster information including the label is displayed, respectively. A label display type document retrieval apparatus comprising: a document display control unit that reads out and displays a document including the document identification information and the set of titles from the document storage unit when a set of titles is selected .

A label display type document search method performed by a label display type document search device,
The label display type document retrieval apparatus includes a document storage unit, a statistical processing unit, a document retrieval unit, a retrieval result statistical information generation unit, a fitness calculation unit, a label information generation unit, a cluster information generation unit, and a second cluster information generation unit. And a document display control means,
The document storage means includes
A document including a text composed of a character string, further including a title of the document and document identification information indicating the document, and a plurality of documents including attribute values that are predetermined character strings included in the text,
The label display type document search method includes:
The statistical processing unit generates statistical information that records the number of appearances of the attribute value in the plurality of documents for each attribute value included in at least one of the plurality of documents stored in the document storage unit; A statistical processing step to be stored in a storage means provided in advance;
A document search step in which the document search means searches for a plurality of documents from the document storage means;
The search result statistical information generating means generates search result statistical information in which the number of appearances of the attribute value in the plurality of documents is recorded for each attribute value included in at least one of the plurality of searched documents. A search result statistical information generation step to be stored;
For each attribute value included in at least one of the searched plurality of documents, the fitness level calculating unit is a label that is a character string representing a plurality of documents that form part of the plurality of documents. The calculation value using the number of appearances in the statistical information and the search result statistical information of the attribute value to calculate the fitness indicating the degree of suitability as A fitness calculation step to calculate;
The label information generation means presets the fitness level from the higher fitness level corresponding to each of a plurality of attribute values consisting of attribute values included in at least one of the searched plurality of documents. As long as the condition is satisfied, by selecting an attribute value corresponding to the fitness level, a plurality of attribute values forming a part of the plurality of attribute values are selected, and the plurality of selected attribute values are used as labels, respectively. A label information generation step for generating label information including the label of
Document identification information and title indicating that the cluster information generating means is a document that includes a character string that is the label for each label included in the label information and is one of the plurality of searched documents. A cluster information generation step of generating cluster information including the label and the cluster information including the number of the documents,
The second cluster information generating means, for each of two labels included in the label information, one set which is a set of documents indicated by each document identification information in the cluster information including the one label; A first ratio that is a ratio of the one set to a union with the other set that is a set of documents indicated by the document identification information in the cluster information including the other label, and the other with respect to the union A second ratio that is a ratio of the set of the two, and if both the first ratio and the second ratio exceed a preset threshold, it is determined that the two labels are in an equivalence relationship; If only one of the thresholds exceeds a preset threshold, it is determined that there is an inclusion relationship in which the label corresponding to the one is included in the label corresponding to the other, and it is determined that there is an equivalence relationship or an inclusion relationship. If so, the label information includes a label that includes the two labels so that the equivalence relation or the inclusion relation between the two labels can be understood, and at least one of the two cluster information includes the label. A second cluster information generating step for generating cluster information that includes cluster information including a set of the document identification information and the title, and including the label;
The document display control means displays each label included in the label information, and one set of the label is selected, and a set of document identification information and a title included in cluster information including the label is displayed. A document display control step of reading out and displaying a document including the document identification information and title set from the document storage means when one set of the document identification information and title is selected;
Labeling type document retrieval method, which comprises a.

A computer program for causing a computer to execute the label display type document retrieval method according to claim 2 .

A computer-readable recording medium in which the computer program according to claim 3 is stored.