JP4146361B2

JP4146361B2 - Label display type document search apparatus, label display type document search method, computer program for executing label display type document search method, and computer readable recording medium storing the computer program

Info

Publication number: JP4146361B2
Application number: JP2004013398A
Authority: JP
Inventors: 浩之戸田; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-01-21
Filing date: 2004-01-21
Publication date: 2008-09-10
Anticipated expiration: 2024-01-21
Also published as: JP2005208838A

Description

本発明は、文書のラベルを表示させてから文書を表示させるラベル表示型文書検索装置に関するものである。 The present invention relates to a label display type document search apparatus that displays a document label and then displays the document.

コンピュータネットワークにおける検索システムにおいて、検索結果を効率的に絞り込ませる検索システムとして以下のものが知られている。 In a search system in a computer network, the following is known as a search system for efficiently narrowing down search results.

・ランキング付き検索システム
google（登録商標）などに代表されるキーワード入力型の検索システムでは、入力されたキーワードを含むコンテンツを、入力したキーワードとの類似度（非特許文献１）やコンテンツの重要度を示すPageRank（非特許文献２に記載)順にコンテンツをランキングすることで、より効率的に所望のコンテンツに到達することが出来る。・ Search system with ranking
In a keyword input type search system typified by google (registered trademark), content including the input keyword is compared with the input keyword (Non-Patent Document 1) and PageRank indicating the importance of the content (non- By ranking the contents in order (described in Patent Document 2), the desired contents can be reached more efficiently.

・Relevance Feedbackシステム
ユーザに対して検索結果を提示し、その検索結果に対するユーザの評価を一度検索システムに返却し、その情報を元に検索条件式を変更することで、検索結果を返却するシステムである。これによって、ユーザは自身の意図するものに近い検索結果を得ることができる（非特許文献３に記載）。・ Relevance Feedback system A system that presents search results to the user, returns the user's evaluation for the search results to the search system once, and returns the search results by changing the search condition formula based on the information. is there. Thereby, the user can obtain a search result close to what the user intends (described in Non-Patent Document 3).

・クラスタリングシステム
「適合文書同士は類似している」と言う仮定に基づき、文書間の類似度を元にクラスタを生成し、ユーザに対して検索結果を分類し提示する手法。ユーザは検索結果に含まれるコンテンツ全てを評価することなく、所望の情報に効率的に到達可能となる（非特許文献４に記載）。 Clustering system A method of generating clusters based on the similarity between documents based on the assumption that “relevant documents are similar”, and classifying and presenting search results to the user. The user can efficiently reach desired information without evaluating all the contents included in the search result (described in Non-Patent Document 4).

・クエリー拡張システム
ユーザが入力したクエリーに関連するキーワードを提示し、ユーザがインタラクティブにクエリーを修正、変更し、効率的に所望のコンテンツを得る手法。テキストコーパスからあらかじめ関連語を取得しておく手法や、入力された検索要求から得られた検索結果を解析することで得たデータを利用するものがある（非特許文献５に記載）。
tf-idf；Salton, G. et al.“Introduction to Modern Information Retrieval” McGraw-Hill Book Company, 1983 Brin, S. and Page, L., “The Anatomy of a Large-Scale Hypertextual Web Search Engine” Proceedings of 7th WWW Conference, 1998. Ricardo Baeza-Yates and Berthier Ribeiro-Neto, “Modern Information Retrieval”, 1999 Anton Leuski, “Evaluating Document Clustering for Interactive Information Retrieval”, Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, 2001. H. Sakai, K. Ohtake and S. Masuyama, A retrieval support system by suggesting terms to a user, in Proceedings 2001 International Conference on Chinese Language Computing, 2001. Query expansion system A method for presenting keywords related to a query input by the user, and the user interactively correcting and modifying the query to efficiently obtain desired content. There are a method of acquiring a related word from a text corpus in advance and a method of using data obtained by analyzing a search result obtained from an input search request (described in Non-Patent Document 5).
tf-idf; Salton, G. et al. “Introduction to Modern Information Retrieval” McGraw-Hill Book Company, 1983 Brin, S. and Page, L., “The Anatomy of a Large-Scale Hypertextual Web Search Engine” Proceedings of 7th WWW Conference, 1998. Ricardo Baeza-Yates and Berthier Ribeiro-Neto, “Modern Information Retrieval”, 1999 Anton Leuski, “Evaluating Document Clustering for Interactive Information Retrieval”, Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, 2001. H. Sakai, K. Ohtake and S. Masuyama, A retrieval support system by suggesting terms to a user, in Proceedings 2001 International Conference on Chinese Language Computing, 2001.

しかしながら、上記した各検索システムには、以下のような不都合がある。 However, each search system described above has the following disadvantages.

ランキング付き検索システムでは、検索結果を優先度付きのリストによって提示するが、検索要求によって十分検索結果が絞られなかった場合、ユーザは膨大な検索結果のリストの中から所望のコンテンツを検索するか、新たな絞り込み用の検索条件を用意し、再検索を行わなければならない。前者は当然ながら大きなコストを要するし、後者についても一般に困難であることが知られている。 In a search system with ranking, search results are presented in a list with priorities. If the search results are not sufficiently narrowed down by the search request, does the user search for the desired content from a huge list of search results? New search conditions for narrowing down must be prepared and re-searched. Of course, the former is expensive and the latter is generally known to be difficult.

Relevance Feedbackシステムは、検索結果の上位数件〜数十件程度に対してユーザが適合、不適合の評価を行うことで検索要求を改善し、よりユーザの要求に近い検索結果を取得することが出来るので、手法的に直感的でよいが、実際には、ユーザが文書の適合不適合を判定するために多くの文書を判定しなければならない。この手法は一つの検索により多くのコストをかけて、確実にすべての適合文書を見つけるような再現率を重視するアプローチとしては有用であるが、テレビのチャンネルを選択するように一つでも気に入るものを見つけるというアプローチにおいてはユーザにかかるコストが大きすぎる。 The Relevance Feedback system can improve the search request by evaluating the conformity and nonconformity for the top few to several tens of search results, and can obtain the search result closer to the user's request. Therefore, although it may be intuitive in terms of technique, in practice, a user must determine many documents in order to determine conformity / nonconformity of documents. This technique is useful as an approach that emphasizes recall so that it can cost more for a single search and ensure that all relevant documents are found, but it can be as good as selecting a TV channel. In the approach of finding out, the cost to the user is too great.

クラスタリングシステムは、検索結果を分類する事により、ユーザが所望の検索結果に到達することを支援することができるが、一般にクラスタリングシステムは、クラスタリングの処理時間の制約により、クラスタリングの質とのトレードオフを考慮しなければならない。そこでK-Means法等のクラスタの数をあらかじめ決定するような手法が取られる。しかし、実際のトピックの分類数と決定した値が一致しない場合には、不明瞭なクラスタが生成され、それぞれのクラスタの内容を示すラベル付けが困難となり、生成されたラベルを一見してクラスタの内容を把握できない不明瞭なものとなることがある等の問題がある。 A clustering system can help a user reach a desired search result by classifying search results. Generally, a clustering system trades off with the quality of clustering due to constraints on the processing time of clustering. Must be taken into account. Therefore, a method such as the K-Means method that predetermines the number of clusters is taken. However, if the number of actual topic classifications does not match the determined value, an unclear cluster is generated, making it difficult to label the contents of each cluster. There is a problem that the contents may be unclear and cannot be understood.

なお、カテゴライジングシステム、つまりあらかじめラベル付きのバスケットの中にラベルに適合する文書を投入することでラベルに関しては問題を解消したシステムがあるが、カテゴリの生成は人手で行なうことが前提となっており、カテゴリの定義、コンテンツの更新に伴うカテゴリのメンテナンスが情報検索システムの管理者にとって大きなコストとなるという問題がある。 Although there is a categorizing system, that is, a system that solves the problem with labels by putting documents that fit the label in a basket with labels in advance, it is assumed that the category is generated manually. In addition, there is a problem that the category definition and the category maintenance associated with the content update are costly for the information retrieval system administrator.

クエリー拡張システムでは、クエリーと文書中で共起する語などを利用することで、ユーザが検索式を効率的に拡張することを可能とし、これにより容易に検索結果を絞り込むことが可能となるが、クエリー候補の属性を考慮しないと、クエリー候補とする語のレベルが不均一になり、検索結果全体から情報を選択することが難しくなる。 In the query expansion system, it is possible for the user to efficiently expand the search expression by using a word that co-occurs in the query and the document, and this makes it possible to narrow down the search result easily. If the attributes of the query candidates are not taken into consideration, the level of the words that are the query candidates becomes uneven, and it becomes difficult to select information from the entire search results.

つまり、従来の検索システムでは、ユーザに対して膨大なリストからのコンテンツの探索を強いること、また、これを解決する手段においてもユーザやシステム管理者に多くのコストを強要したり、提供する情報自体が不十分な手法となるという問題がある。 In other words, in the conventional search system, the user is forced to search for content from an enormous list, and in the means for solving this, information that compels the user and the system administrator is costly or is provided. There is a problem that the method itself is insufficient.

本発明は、上記の課題に鑑みてなされたものであり、管理者による文書の更新に応じた設定変更やユーザによる煩雑な操作を不要にできる文書検索装置を提供することにある。 SUMMARY An advantage of some aspects of the invention is that it provides a document search apparatus that eliminates the need for a setting change according to a document update by an administrator and a complicated operation by a user.

上記の課題を解決するために、第１の本発明は、文字列からなる本文を含む文書であって当該文書の題名と当該文書を示す文書識別情報をさらに含み且つ予め定めた文字列である属性値が当該本文に含まれる文書が複数記憶された文書記憶手段と、前記文書記憶手段に記憶された各文書に対応する文書ベクトルであって、当該文書に含まれる属性値の数を属性値ごとに含む文書ベクトルを生成する文書ベクトル生成手段と、前記文書記憶手段に記憶された複数の文書のいずれかに少なくとも含まれた属性値ごとに該属性値の当該複数の文書での出現回数を記録した統計情報を生成し、予め設けた記憶手段に記憶させる統計処理手段と、前記文書記憶手段から複数の文書を検索する文書検索手段と、この検索された複数の文書のいずれかに少なくとも含まれた属性値ごとに該属性値の当該複数の文書での出現回数を記録した検索結果統計情報を生成し記憶する検索結果統計情報生成手段と、前記検索された複数の文書のいずれかに少なくとも含まれた属性値ごとに、該属性値が当該複数の文書の一部をなす複数の文書を表す文字列であるラベルとして適している程度を示す適合度を算出するための、当該属性値の前記統計情報および前記検索結果統計情報での各出現回数を用いた算出式を使用して、当該属性値の適合度を算出する適合度算出手段と、前記検索された複数の文書のいずれかに少なくとも含まれた属性値からなる複数の属性値のそれぞれに対応する前記適合度の高い方から当該適合度が予め設定された条件を満たす限り当該適合度に対応する属性値を選択することにより、当該複数の属性値の一部をなす複数の属性値を選択し、選択された複数の属性値をそれぞれラベルとし、当該複数のラベルを含むラベル情報を生成するラベル情報生成手段と、前記ラベル情報に含まれたラベルごとに、該ラベルである文字列を含む文書であり且つ前記検索された複数の文書のいずれかでもある文書を示す文書識別情報および題名を当該文書の数だけ含むクラスタ情報であり且つ当該ラベルを含むクラスタ情報を生成するクラスタ情報生成手段と、前記クラスタ情報ごとに、該クラスタ情報に含まれた文書識別情報を含む複数の文書のそれぞれに対応する前記文書ベクトルのベクトル和であるクラスタベクトルを生成し、当該クラスタ情報に含まれた文書識別情報でない文書識別情報を含む文書であり且つ前記文書記憶手段に記憶された文書に対応する文書ベクトルと当該クラスタベクトルとの余弦尺度を算出し、当該余弦尺度が、予め設定したしきい値を超えているなら、当該クラスタ情報に当該文書の文書識別情報および題名を含ませるクラスタ情報変更手段と、前記ラベル情報に含まれた各ラベルを表示させ、１つの当該ラベルが選択されて当該ラベルを含むクラスタ情報に含まれた文書識別情報および題名の組がそれぞれ表示され、１つの当該文書識別情報および題名の組が選択されたなら、当該文書識別情報および題名の組を含む文書を前記文書記憶手段から読み出して表示させる文書表示制御手段とを備えることを特徴とするラベル表示型文書検索装置をもって解決手段とする。
In order to solve the above problems, the first aspect of the present invention is a document including a body composed of a character string, further including a title of the document and document identification information indicating the document, and a predetermined character string. A document storage unit storing a plurality of documents whose attribute values are included in the text, and a document vector corresponding to each document stored in the document storage unit, the number of attribute values included in the document being an attribute value Document vector generation means for generating a document vector included for each of the plurality of documents, and for each attribute value included in at least one of the plurality of documents stored in the document storage means, the number of appearances of the attribute value in the plurality of documents A statistical processing unit that generates recorded statistical information and stores the statistical information in a storage unit provided in advance, a document search unit that searches a plurality of documents from the document storage unit, and any one of the searched plurality of documents. Search result statistical information generating means for generating and storing search result statistical information in which the number of occurrences of the attribute value in the plurality of documents is recorded for each included attribute value, and any one of the plurality of searched documents For each attribute value included in at least the attribute value, the attribute value for calculating the degree of suitability indicating the degree to which the attribute value is suitable as a label that is a character string representing a plurality of documents forming a part of the plurality of documents. A degree-of-fit calculation means for calculating the degree of match of the attribute value using a calculation formula using the number of appearances in the statistical information of the value and the search result statistical information, and any one of the plurality of retrieved documents Selecting an attribute value corresponding to the fitness level from the higher fitness level corresponding to each of a plurality of attribute values consisting of at least included attribute values as long as the fitness level satisfies a preset condition. By Label information generating means for selecting a plurality of attribute values forming a part of the plurality of attribute values, using the selected attribute values as labels, and generating label information including the plurality of labels, and the label information For each label included in the document, the document identification information indicating the document that includes the character string that is the label and that is one of the retrieved documents, and the cluster information that includes the titles as many as the number of the documents A cluster information generating means for generating cluster information including the label and a vector sum of the document vectors corresponding to each of the plurality of documents including the document identification information included in the cluster information for each of the cluster information. A document that generates a certain cluster vector, includes document identification information that is not document identification information included in the cluster information, and is stored in the document storage unit. A cosine scale between the document vector corresponding to the stored document and the cluster vector is calculated. If the cosine scale exceeds a preset threshold value, the document identification information and title of the document are included in the cluster information. The cluster information changing means including the label information, and each label included in the label information is displayed, and one set of the document identification information and the title included in the cluster information including the label is displayed by selecting the label. And a document display control means for reading out and displaying a document including the document identification information and title set from the document storage means when one set of the document identification information and title is selected. A label display type document retrieval apparatus that performs the above processing is used as a solution.

第２の本発明は、ラベル表示型文書検索装置が行うラベル表示型文書検索方法であって、前記ラベル表示型文書検索装置が、文書記憶手段と文書ベクトル生成手段と統計処理手段と文書検索手段と検索結果統計情報生成手段と適合度算出手段とラベル情報生成手段とクラスタ情報生成手段とクラスタ情報変更手段と文書表示制御手段とを備え、前記文書記憶手段には、文字列からなる本文を含む文書であって当該文書の題名と当該文書を示す文書識別情報をさらに含み且つ予め定めた文字列である属性値が当該本文に含まれる文書が複数記憶されており、前記ラベル表示型文書検索方法は、前記文書ベクトル生成手段が、前記文書記憶手段に記憶された各文書に対応する文書ベクトルであって、当該文書に含まれる属性値の数を属性値ごとに含む文書ベクトルを生成する文書ベクトル生成ステップと、前記統計処理手段が、前記文書記憶手段に記憶された複数の文書のいずれかに少なくとも含まれた属性値ごとに該属性値の当該複数の文書での出現回数を記録した統計情報を生成し、予め設けた記憶手段に記憶させる統計処理ステップと、前記文書検索手段が、前記文書記憶手段から複数の文書を検索する文書検索ステップと、前記検索結果統計情報生成手段が、当該検索された複数の文書のいずれかに少なくとも含まれた属性値ごとに該属性値の当該複数の文書での出現回数を記録した検索結果統計情報を生成し記憶する検索結果統計情報生成ステップと、前記適合度算出手段が、前記検索された複数の文書のいずれかに少なくとも含まれた属性値ごとに、該属性値が当該複数の文書の一部をなす複数の文書を表す文字列であるラベルとして適している程度を示す適合度を算出するための、当該属性値の前記統計情報および前記検索結果統計情報での各出現回数を用いた算出式を使用して、当該属性値の適合度を算出する適合度算出ステップと、前記ラベル情報生成手段が、前記検索された複数の文書のいずれかに少なくとも含まれた属性値からなる複数の属性値のそれぞれに対応する前記適合度の高い方から当該適合度が予め設定された条件を満たす限り当該適合度に対応する属性値を選択することにより、当該複数の属性値の一部をなす複数の属性値を選択し、選択された複数の属性値をそれぞれラベルとし、当該複数のラベルを含むラベル情報を生成するラベル情報生成ステップと、前記クラスタ情報生成手段が、前記ラベル情報に含まれたラベルごとに、該ラベルである文字列を含む文書であり且つ前記検索された複数の文書のいずれかでもある文書を示す文書識別情報および題名を当該文書の数だけ含むクラスタ情報であり且つ当該ラベルを含むクラスタ情報を生成するクラスタ情報生成ステップと、前記クラスタ情報変更手段が、前記クラスタ情報ごとに、該クラスタ情報に含まれた文書識別情報を含む複数の文書のそれぞれに対応する前記文書ベクトルのベクトル和であるクラスタベクトルを生成し、当該クラスタ情報に含まれた文書識別情報でない文書識別情報を含む文書であり且つ前記文書記憶手段に記憶された文書に対応する文書ベクトルと当該クラスタベクトルとの余弦尺度を算出し、当該余弦尺度が、予め設定したしきい値を超えているなら、当該クラスタ情報に当該文書の文書識別情報および題名を含ませるクラスタ情報変更ステップと、前記文書表示制御手段が、前記ラベル情報に含まれた各ラベルを表示させ、１つの当該ラベルが選択されて当該ラベルを含むクラスタ情報に含まれた文書識別情報および題名の組がそれぞれ表示され、１つの当該文書識別情報および題名の組が選択されたなら、当該文書識別情報および題名の組を含む文書を前記文書記憶手段から読み出して表示させる文書表示制御ステップとを含むことを特徴とするラベル表示型文書検索方法をもって解決手段とする。
A second aspect of the present invention is a label display type document search method performed by a label display type document search device, wherein the label display type document search device includes a document storage unit, a document vector generation unit, a statistical processing unit, and a document search unit. Search result statistical information generation means, fitness calculation means, label information generation means, cluster information generation means, cluster information change means, and document display control means, and the document storage means includes a text string. The label display type document retrieval method, wherein a plurality of documents, each of which includes a title of the document and document identification information indicating the document, and a plurality of attribute values that are predetermined character strings are included in the body, , the document vector generation means, a document vectors for each document stored in the document storage means, the number of attribute values included in the document for each attribute value A document vector generating step for generating a document vector, and the statistical processing means for each of the plurality of documents having the attribute value for each attribute value included in any of the plurality of documents stored in the document storage means. A statistical processing step for generating statistical information in which the number of occurrences is recorded and storing the statistical information in a storage unit provided in advance, a document search step in which the document search unit searches a plurality of documents from the document storage unit, and the search result A search in which statistical information generation means generates and stores search result statistical information in which the number of appearances of the attribute value in the plurality of documents is recorded for each attribute value included in at least one of the searched plurality of documents. The result statistical information generation step and the fitness level calculation means, for each attribute value included in at least one of the searched plurality of documents, The number of appearances of the attribute value in the statistical information and the search result statistical information for calculating the degree of suitability indicating the degree of suitability as a label that is a character string representing a plurality of documents forming a part of a document The fitness level calculating step for calculating the fitness level of the attribute value using the used calculation formula, and the label information generating means are composed of attribute values included in at least one of the retrieved plurality of documents. A part of the plurality of attribute values is selected by selecting an attribute value corresponding to the fitness level from the higher fitness level corresponding to each of the attribute values as long as the fitness level satisfies a preset condition. A plurality of attribute values forming a label, each of the selected attribute values as a label, a label information generating step for generating label information including the plurality of labels, and the cluster information generating means, For each label included in the label information, document identification information and titles indicating documents that include the character string that is the label and that are any of the plurality of retrieved documents are included for the number of the documents. A cluster information generating step for generating cluster information that is cluster information and includes the label, and each of the plurality of documents in which the cluster information changing means includes document identification information included in the cluster information for each of the cluster information A document that generates a cluster vector that is a vector sum of the document vectors corresponding to the document vector, includes document identification information that is not document identification information included in the cluster information, and corresponds to a document stored in the document storage unit Calculate the cosine scale of the vector and the cluster vector, and the cosine scale exceeds the preset threshold. Then, the cluster information changing step for including the document identification information and the title of the document in the cluster information, and the document display control means displays each label included in the label information, and one corresponding label is selected. If the document identification information and title combination included in the cluster information including the label are respectively displayed and one document identification information and title combination is selected, the document including the document identification information and title combination is selected. And a document display control step for reading and displaying the document from the document storage means .

本発明によれば、文書を記憶した文書記憶手段から文書を検索し、検索された文書に含まれた属性値を文書のラベルとするときの適合度を算出し、適合度の高い方から、当該属性値の数よりも少ない数の属性値をラベルとして選択し、選択されたラベルを示すラベル情報を記憶させ、記憶されたラベル情報を読み出すとともに当該ラベル情報によりラベルを表示させ、ラベルが指示された場合、このラベルを含み且つ検索された文書の中にも含まれる文書を文書記憶手段から読み出して表示させるので、ラベルを予め用意する必要がなく、しかもラベルの数を少なくでき、その結果、管理者による文書の更新に応じた設定変更やユーザによる煩雑な操作が不要となる。 According to the present invention, a document is retrieved from a document storage unit that stores the document, and a fitness level when the attribute value included in the retrieved document is used as a document label is calculated. Select an attribute value that is smaller than the number of the attribute value as a label, store the label information indicating the selected label, read the stored label information, display the label with the label information, and the label indicates In this case, since the document including the label and also included in the retrieved document is read from the document storage means and displayed, it is not necessary to prepare the label in advance, and the number of labels can be reduced. Therefore, it is not necessary to change the settings according to the document update by the administrator and troublesome operations by the user.

また、文書を記憶した文書記憶手段から文書を検索し、検索された文書に含まれた属性値を文書のラベルとするときの適合度を算出し、適合度の高い方から、当該属性値の数よりも少ない数の属性値をラベルとして選択し、選択されたラベルを示すラベル情報を記憶させ、選択されたラベルの１つを含み且つ検索された文書の中にも含まれる文書を示すクラスタ情報を生成し、クラスタ情報で示される文書と当該クラスタ情報で示されない文書との類似度を算出し、この類似度が高い場合、後者の文書が示されるようにクラスタ情報を変更し、記憶されたラベル情報を読み出すとともに当該ラベル情報によりラベルを表示させ、ラベルが指示された場合、変更されたクラスタ情報で示される文書の存在を表示させ、文書が指示された場合、この文書を文書記憶手段から読み出して表示させるので、管理者による文書の更新に応じた設定変更やユーザによる煩雑な操作が不要となることに加えて、少なく表示させたラベルの指示により表示される文書数を多くすることができるので、所望の文書を表示させることのできる可能性が高まる。 In addition, a document is retrieved from the document storage unit that stores the document, and a degree of conformity when the attribute value included in the retrieved document is used as a document label is calculated. A cluster that selects a smaller number of attribute values than the number as a label, stores label information indicating the selected label, and indicates a document that includes one of the selected labels and is also included in the retrieved document Information is generated, the similarity between the document indicated by the cluster information and the document not indicated by the cluster information is calculated, and if this similarity is high, the cluster information is changed and stored so that the latter document is indicated. When the label information is read out, the label is displayed according to the label information, the label is instructed, the existence of the document indicated by the changed cluster information is displayed, and the document is instructed. Since the document is read from the document storage means and displayed, the setting change according to the document update by the administrator and the complicated operation by the user are not required, and the document displayed by the instructions of the few labels displayed Since the number can be increased, the possibility that a desired document can be displayed increases.

以下、本発明の実施の形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

[第１の実施の形態]
図１は、第１の実施の形態の装置構成を示すブロック図である。これより、文書をニュース記事としたときの例を示しながら説明を行う。 [First embodiment]
FIG. 1 is a block diagram showing a device configuration of the first embodiment. From here, it demonstrates, showing the example when a document is made into a news article.

検索装置１は、文書を検索するサーバコンピュータであり、本発明のラベル表示型文書検索方法を実行するラベル表示型文書検索装置に相当する。検索装置１は、図示しないネットワークを介して接続されたクライアントコンピュータのブラウザ２に対し通信可能となっている。 The retrieval apparatus 1 is a server computer that retrieves a document, and corresponds to a label display type document retrieval apparatus that executes the label display type document retrieval method of the present invention. The search device 1 can communicate with a browser 2 of a client computer connected via a network (not shown).

ブラウザ２は、キーボードやマウス等の入力装置を介してキーワードが入力されるキーワード入力部２１と、このキーワードにより検索された文書を図示しないＣＲＴ（Cathode Ray Tube）やＬＣＤ(Liquid CrystalDisplay)等からなる表示装置に表示させる文書表示制御部２２とを備える。 The browser 2 includes a keyword input unit 21 for inputting a keyword via an input device such as a keyboard and a mouse, and a CRT (Cathode Ray Tube), an LCD (Liquid Crystal Display), or the like (not shown) for a document searched by the keyword. And a document display control unit 22 to be displayed on the display device.

ブラウザ２は、検索された文書を表示させる前に、文書に含まれる属性値をラベルとしてしかも当該属性値の属性名で分類して表示させる。そして、いずれかのラベルがクリックなどで指示されたときに、そのラベル（属性値）を含む文書の題名などを表示させ、いずれかの題名が指定されたときに、その題名の文書を表示させるようになっている。 Before displaying the retrieved document, the browser 2 displays the attribute value included in the document as a label and classified by the attribute name of the attribute value. When any label is specified by clicking, the title of the document including the label (attribute value) is displayed, and when any title is specified, the document with the title is displayed. It is like that.

検索装置１は、ブラウザ２からキーワードを受信し、検索された文書をブラウザ２に送信する通信部１０１と、通信部１０１から与えられるキーワードによる文書検索などを制御する要求処理部１０２とを備える。 The search device 1 includes a communication unit 101 that receives a keyword from the browser 2 and transmits the searched document to the browser 2, and a request processing unit 102 that controls document search using the keyword given from the communication unit 101.

また、検索装置１は、文書検索で使われる情報が書き込まれた設定ファイル１０３を備える。この設定ファイル１０３には、検索において用いられる属性名「ジャンル」、「組織」などが書き込まれている。また、この設定ファイル１０３には、選択されるラベルの数が書き込まれている。また、設定ファイル１０３には、表示させる属性名としての適合度（属性名適合度という）を求める際に使用されるパラメータα、β及びγが書き込まれている。また、設定ファイル１０３には、各種しきい値などが書き込まれる。 In addition, the search device 1 includes a setting file 103 in which information used for document search is written. In the setting file 103, attribute names “genre” and “organization” used in the search are written. In the setting file 103, the number of labels to be selected is written. The setting file 103 stores parameters α, β, and γ that are used when determining the fitness as the attribute name to be displayed (referred to as attribute name fitness). Various threshold values and the like are written in the setting file 103.

また、検索装置１は、タグ無し文書、すなわち、設定ファイル１０３に書き込まれた属性名のいずれかに分類される属性値を含んでいるがその属性値に属性名（タグ）が付与されていない文書を入力し、それに対して、ニュース記事管理者などが、いわゆる手動でタグを付すことにより文書を生成する文書生成部１０４と、タグ無し文書を入力し、その属性値に対し自動的にタグを付すことにより、タグ付き文書（単に文書ともいう）を生成する文書生成部１０５と、文書生成部１０４や文書生成部１０５で生成された文書が格納される文書データベース（以下、データベースをＤＢと略記する）１０６とを備える。 In addition, the search device 1 includes an attribute value classified as one of the attribute names written in the setting file 103, but no attribute name (tag) is given to the attribute value. A document is input, and a news article manager or the like inputs a document by generating a document by so-called manual tagging, and an untagged document, and the attribute value is automatically tagged. , A document generation unit 105 that generates a tagged document (also simply referred to as a document), and a document database that stores the documents generated by the document generation unit 104 and the document generation unit 105 (hereinafter, the database is referred to as DB). (Abbreviated) 106.

また、検索装置１は、文書ＤＢ１０６に格納された文書に含まれる属性値を正規化する正規化部１０７を備える。 In addition, the search device 1 includes a normalization unit 107 that normalizes attribute values included in documents stored in the document DB 106.

また、検索装置１は、文書ＤＢ１０６に格納された文書に含まれるワード（属性値でもよい）とそのワードが含まれた文書を示す文書識別情報（以下、識別情報をＩＤという）とを対応づけたインデクス１０７を生成するインデクス生成部１０８と、キーワードとインデクス１０７を基に文書ＤＢ１０６から文書を検索する文書検索部１０９とを備える。 Further, the search device 1 associates a word (which may be an attribute value) included in the document stored in the document DB 106 with document identification information (hereinafter, identification information is referred to as ID) indicating the document including the word. The index generation unit 108 that generates the index 107 and the document search unit 109 that searches the document DB 106 based on the keyword and the index 107 are provided.

また、検索装置１は、設定ファイル１０３の属性名ごとに生成された第１統計情報が格納される第１統計情報ＤＢ１１０と、第１統計情報ごとに生成された第２統計情報が格納される第２統計情報ＤＢ１１１と、第１統計情報及び第２統計情報を生成する統計処理部１１２を備える。 Further, the search device 1 stores a first statistical information DB 110 that stores first statistical information generated for each attribute name of the setting file 103, and second statistical information generated for each first statistical information. A second statistical information DB 111 and a statistical processing unit 112 that generates first statistical information and second statistical information are provided.

また、検索装置１は、設定ファイル１０３の属性名ごとに複数の属性値をラベル候補として選択するラベル候補選択部１１３と、そのラベル候補を文書のラベルとするときの適合度（ラベル適合度という）を算出するラベル適合度算出部１１４と、算出されたラベル適合度を基にラベルを決定するラベル決定部１１５とを備える。 The search device 1 also includes a label candidate selection unit 113 that selects a plurality of attribute values as label candidates for each attribute name of the setting file 103, and a fitness level (referred to as a label fitness level) when the label candidate is a document label. ), And a label determination unit 115 that determines a label based on the calculated label suitability.

また、検索装置１は、決定されたラベルごとにクラスタ情報を生成するクラスタ情報生成部１１６を備える。なお、本実施の形態でクラスタとは、１つのラベルを含む１以上の検索された文書をいう。 In addition, the search device 1 includes a cluster information generation unit 116 that generates cluster information for each determined label. In the present embodiment, a cluster means one or more searched documents including one label.

[検索前処理]
次に、検索装置１が検索前に行う処理を説明する。 [Pre-search process]
Next, processing performed by the search device 1 before searching will be described.

図２は、検索装置１が検索前に行う処理を示すフローチャートである。 FIG. 2 is a flowchart showing processing performed by the search device 1 before search.

文書生成部１０４は、例えば、図３に示すようなタグ無し文書が入力され、さらに例えば、「国際原子力機関が＊＊＊を決定」が題名であるという指定や、属性値「国際原子力機関」が属性名「組織」に分類されるという指定や、属性値「経済」が属性名「ジャンル」に分類されるという指定があると、図４に示すように、タグ無し文書に、これら指定の内容と、例えば「００１」という文書ＩＤなどを付与することで文書を生成し、これを文書ＤＢ１０６に格納する（Ｓ１０１）。 The document generation unit 104 receives, for example, an untagged document as shown in FIG. 3, and further, for example, designates that “International Atomic Energy Agency has determined ***” as a title, or has an attribute value “International Atomic Energy Agency”. 4 is classified into the attribute name “organization” and the attribute value “economic” is classified into the attribute name “genre”, as shown in FIG. A document is generated by giving the contents and a document ID such as “001”, for example, and stored in the document DB 106 (S101).

一方、文書生成部１０５は、タグ無し文書が入力され、さらに題名指定などがあると、その属性値に対し自動的にタグを付すことにより文書を生成し、さらに文書ＩＤを付与し、これを文書ＤＢ１０６に格納する（Ｓ１０１）。なお、自動的にタグを付す処理については詳しく後述する。 On the other hand, when an untagged document is input and there is a title designation or the like, the document generation unit 105 generates a document by automatically attaching a tag to the attribute value, and further assigns a document ID. It is stored in the document DB 106 (S101). The process of automatically attaching a tag will be described later in detail.

このような処理により、文書ＤＢ１０６には多数の文書が格納される。 By such processing, a large number of documents are stored in the document DB 106.

次に、正規化部１０７は、文書ＤＢ１０６に格納された文書に含まれる属性値を正規化する（Ｓ１０３）。正規化とは、例えば、略記号で表記された属性値「IAEA」を略さない日本語で表記された属性値「国際原子力機関」に変換することをいう。 Next, the normalization unit 107 normalizes attribute values included in the document stored in the document DB 106 (S103). Normalization means, for example, conversion of an attribute value “IAEA” expressed by an abbreviation into an attribute value “International Atomic Energy Agency” expressed in Japanese that is not abbreviated.

つまり、正規化部１０７は、文書中で同じ意味を持ちながら表現の異なる同義語となっている属性値を検出し、これらを同じ表現にする。 In other words, the normalization unit 107 detects attribute values having synonyms with different expressions while having the same meaning in the document, and makes them the same expression.

同義語の検出にはいくつかの方法があるが、図５に示す共起パタンを用いる方法を採用することができる。 Although there are several methods for detecting synonyms, a method using the co-occurrence pattern shown in FIG. 5 can be adopted.

このような処理により、文書ＤＢ１０６における文書の属性値が正規化される。 By such processing, the attribute value of the document in the document DB 106 is normalized.

次に、インデクス生成部１０８は、文書ＤＢ１０６に格納された文書に含まれたワードと該ワードを含む文書の文書ＩＤとを対応づけたインデクス１０７を生成する（Ｓ１０５）。 Next, the index generation unit 108 generates an index 107 in which a word included in a document stored in the document DB 106 is associated with a document ID of a document including the word (S105).

図６に示すように、インデクス１０７では、例えば、ワード「原子力」に対し、このワードを含む文書の文書ＩＤ「００１」などが対応づけられる。 As shown in FIG. 6, in the index 107, for example, the document ID “001” of the document including the word is associated with the word “nuclear power”.

次に、統計処理部１１２は、文書ＤＢ１０６を基に、設定ファイル１０３の属性名ごとに第１統計情報を生成して第１統計情報ＤＢ１１０に格納する（Ｓ１０７）。 Next, the statistical processing unit 112 generates first statistical information for each attribute name of the setting file 103 based on the document DB 106 and stores the first statistical information in the first statistical information DB 110 (S107).

図７（ａ）や（ｂ）に示すように、１つの第１統計情報には１つの属性名が割り当てられている。 As shown in FIGS. 7A and 7B, one attribute name is assigned to one first statistical information.

また、１つの第１統計情報は、文書ＩＤとこのＩＤの文書に含まれ且つ属性名に分類される属性値とを対応づけたものを１以上備える情報である。 One piece of first statistical information is information including one or more pieces of information in which a document ID is associated with an attribute value included in a document with this ID and classified as an attribute name.

図７（ａ）は、例えば、文書ＩＤ「００１」の文書には、属性名「ジャンル」に分類される属性値「経済」などが含まれていることを示している。また、図７（ｂ）は、文書ＩＤ「００１」の文書には、属性名「組織」に分類される属性名「国際原子力機関」などが含まれていることを示している。 FIG. 7A shows that, for example, the document with the document ID “001” includes an attribute value “economic” or the like classified into the attribute name “genre”. FIG. 7B shows that the document with the document ID “001” includes an attribute name “International Atomic Energy Agency” classified into the attribute name “organization”.

次に、統計処理部１１２は、第１統計情報ごとに第２統計情報を生成して第２統計情報ＤＢ１１１に格納する（Ｓ１０９）。 Next, the statistical processing unit 112 generates second statistical information for each first statistical information and stores it in the second statistical information DB 111 (S109).

図８（ａ）や（ｂ）に示すように、１つの第２統計情報には１つの第１統計情報の属性名が割り当てられている。 As shown in FIGS. 8A and 8B, one second statistical information attribute name is assigned to one second statistical information.

また、１つの第２統計情報は、属性名に分類される属性値と該属性値の第１統計情報ＤＢ１１０内における出現回数とを対応づけたものを１以上備える情報である。 One piece of second statistical information is information including one or more items in which attribute values classified into attribute names are associated with the number of appearances of the attribute value in the first statistical information DB 110.

図８（ａ）は、例えば、属性名「ジャンル」に分類される属性値「経済」の出現回数が１００回であることを示している。また、図８（ｂ）は、属性名「組織」に分類される属性値「国際原子力機関」の出現回数が７０回であることを示している。 FIG. 8A shows that the number of appearances of the attribute value “economy” classified into the attribute name “genre” is 100, for example. FIG. 8B shows that the number of appearances of the attribute value “International Atomic Energy Agency” classified into the attribute name “organization” is 70 times.

なお、第２統計情報は、第１統計情報において属性値と文書ＩＤの対応を検出し、検出ごとに出現回数をカウントアップすることで生成してもよい。 The second statistical information may be generated by detecting the correspondence between the attribute value and the document ID in the first statistical information and counting up the number of appearances for each detection.

また、第２統計情報を属性値自身やその属性値が出現する文書の文書ＩＤ自身で構成してもよい。また、第２統計情報を各属性値間の共起頻度で構成してもよい。このときの共起頻度は、同じ文書中に出現する属性値同士を共起すると定義できる。また、タグ無し文書から文書を自動生成する場合には、タグ無し文書中の同じセンテンスやパラグラフに含まれる属性値同士を共起すると定義できる。また、文書中の近接度によって共起関係を[０，１]のバイナリ値で表現するのでなく、共起度のようにより近くで共に出現する属性値間には大きい値を与えるようにしてもよい。 The second statistical information may be composed of the attribute value itself and the document ID of the document in which the attribute value appears. Moreover, you may comprise 2nd statistical information by the co-occurrence frequency between each attribute value. The co-occurrence frequency at this time can be defined as co-occurring attribute values appearing in the same document. When a document is automatically generated from an untagged document, it can be defined that attribute values included in the same sentence or paragraph in the untagged document co-occur. Further, the co-occurrence relationship is not expressed by a binary value of [0, 1] depending on the proximity in the document, but a large value may be given between attribute values that appear closer together like the co-occurrence. Good.

このようにして、Ｓ１０９までの処理が終わると文書検索が可能となるが、文書ＤＢ１０６の文書が更新、追加または削除されたときは、属性値の正規化や、インデクス１０７、第１統計ＤＢ１１０、第２統計ＤＢ１１１などの更新が行われる。 In this way, the document search becomes possible after the processing up to S109 is completed. However, when the document in the document DB 106 is updated, added, or deleted, the attribute value normalization, the index 107, the first statistical DB 110, The second statistics DB 111 and the like are updated.

[検索処理]
次に、検索装置１が行う検索処理を説明する。 [Search process]
Next, a search process performed by the search device 1 will be described.

キーワード入力部２１は、例えばキーワード「原子力」がユーザにより入力されると、このキーワード「原子力」を検索装置１の通信部１０１に送信する。 For example, when the keyword “nuclear power” is input by the user, the keyword input unit 21 transmits the keyword “nuclear power” to the communication unit 101 of the search device 1.

図９は、キーワードを送信された検索装置１が行う処理のフローチャートである。 FIG. 9 is a flowchart of processing performed by the search device 1 that has received the keyword.

先ず、通信部１０１は、送信されたキーワード「原子力」を要求処理部１０２に与え、要求処理部１０２は、そのキーワードを文書検索部１０９に与える。文書検索部１０９は、そのキーワード「原子力」に対しインデクス１０７で対応づけられた文書ＩＤを検索し、それらを要求処理部１０２に返却する（Ｓ２０１：文書検索）。 First, the communication unit 101 gives the transmitted keyword “nuclear power” to the request processing unit 102, and the request processing unit 102 gives the keyword to the document search unit 109. The document search unit 109 searches for the document ID associated with the keyword “nuclear power” in the index 107 and returns them to the request processing unit 102 (S201: document search).

要求処理部１０２は、その文書ＩＤをラベル候補選択部１１３に与える（Ｓ２０３）。 The request processing unit 102 gives the document ID to the label candidate selection unit 113 (S203).

ラベル候補選択部１１３は、第１統計情報ＤＢ１１０と、検索された文書ＩＤを基に、設定ファイル１０３の属性名ごとに第１検索結果統計情報を生成して一時的に記憶する（Ｓ２０５）。 The label candidate selection unit 113 generates first search result statistical information for each attribute name of the setting file 103 based on the first statistical information DB 110 and the searched document ID, and temporarily stores the first search result statistical information (S205).

図１０に示すように、１つの第１検索結果統計情報には１つの属性名が割り当てられている。 As shown in FIG. 10, one attribute name is assigned to one first search result statistical information.

また、１つの第１検索結果統計情報は、１つの第１統計情報に含まれる各属性値に対し該属性値を含む文書の文書ＩＤであり且つ検索された文書ＩＤにも含まれる文書ＩＤを対応づけたものである。 One first search result statistical information is a document ID of a document including the attribute value for each attribute value included in the first statistical information, and a document ID included in the searched document ID. It is a correspondence.

次に、ラベル候補選択部１１３は、第１検索結果統計情報を基に、属性名ごとに第２検索結果統計情報を生成して一時的に記憶する（Ｓ２０７）。 Next, the label candidate selection unit 113 generates and temporarily stores second search result statistical information for each attribute name based on the first search result statistical information (S207).

図１１に示すように、１つの第２検索結果統計情報には１つの属性名が割り当てられている。 As shown in FIG. 11, one attribute name is assigned to one second search result statistical information.

また、１つの第２検索結果統計情報は、１つの第１検索結果統計情報の各属性値に対し該属性値に対応づけられた文書ＩＤの数を出現回数として対応づけたものである。 One second search result statistical information is obtained by associating each attribute value of one first search result statistical information with the number of document IDs associated with the attribute value as the number of appearances.

次に、ラベル候補選択部１１３は、第２統計情報と同じ属性名が割り当てられた第２検索結果統計情報とを基に、第２統計情報ごとに第３統計情報を生成する（Ｓ２０９）。 Next, the label candidate selection unit 113 generates third statistical information for each second statistical information based on the second search result statistical information to which the same attribute name as the second statistical information is assigned (S209).

図１２に示すように、１つの第３統計情報は、１つの第２統計情報に含まれた１以上の行からなる統計情報であり且つ該行の属性値が第２検索結果統計情報の対応行にも含まれたものである。 As shown in FIG. 12, one piece of third statistical information is statistical information including one or more rows included in one piece of second statistical information, and the attribute value of the row corresponds to the second search result statistical information. It is also included in the line.

つぎに、ラベル適合度算出部１１４は、第２検索結果統計情報と第３統計情報と検索された文書ＩＤを基に、ラベル適合度情報を第２検索結果統計情報ごと生成し一時的に記憶する（Ｓ２１１）。 Next, the label suitability calculation unit 114 generates and temporarily stores label suitability information for each second search result statistical information based on the second search result statistical information, the third statistical information, and the retrieved document ID. (S211).

図１３に示すように、１つのラベル適合度情報には１つの属性名が割り当てられている。 As shown in FIG. 13, one attribute name is assigned to one label suitability information.

また、１つのラベル適合度情報は、１つの第２検索結果統計情報に含まれた各属性値に対しラベル適合度を対応づけたものである。 One label suitability information is obtained by associating a label suitability with each attribute value included in one second search result statistical information.

ラベル適合度は、例えば以下のように算出する。 The label suitability is calculated as follows, for example.

第２検索結果統計情報における１つの属性値に対応する出願回数をｈとし、第３統計情報におけるその属性値に対応する出願回数をｄとし、検索された文書ＩＤの数を｜Ｈ｜とし、式（１）によりラベル適合度を算出する。

The number of applications corresponding to one attribute value in the second search result statistical information is set as h, the number of applications corresponding to the attribute value in the third statistical information is set as d, and the number of retrieved document IDs is set as | H | The label suitability is calculated by equation (1).

なお、式（１）のｈ／ｄは、検索された文書における属性値の網羅性を、｜Ｈ｜／ｈは検索された文書における属性値の希少性を示している。 In the equation (1), h / d represents the completeness of the attribute value in the retrieved document, and | H | / h represents the rarity of the attribute value in the retrieved document.

また、式（１）における第１項のｈの代わりにｈ／｜Ｈ｜とし、第１項のｄの代わりにｄ／｜Ｄ｜（｜Ｄ｜は、その属性値を含む文書数）としてもよい。 Also, in equation (1), h / | H | is substituted for h in the first term, and d / | D | (| D | is the number of documents including the attribute value) instead of d in the first term. Also good.

次に、ラベル決定部１１５は、ラベル適合度情報から属性値及びラベル適合度の組を減らしたものをラベル情報とし一時的に記憶する（Ｓ２１３）。なお、ラベル情報はラベル適合度情報ごとに生成され記憶される。また、ラベル情報における属性値は文書のラベルとなるものであるからラベルということにする。 Next, the label determination unit 115 temporarily stores the label information obtained by subtracting the combination of the attribute value and the label compatibility from the label compatibility information (S213). The label information is generated and stored for each label suitability information. The attribute value in the label information is a label because it is a label of the document.

図１４に示すように、ラベル情報は、ラベルに対しラベル適合度を対応づけたものであるが、ラベル適合度情報におけるラベル適合度の高い方からラベルを選択することにより、ラベル情報におけるラベル及びラベル適合度の組数は、ラベル適合度情報における属性値及びラベル適合度の組数よりも少なくなっている。 As shown in FIG. 14, the label information is obtained by associating the label suitability with the label. By selecting a label from the label suitability higher in the label suitability information, the label information in the label information The number of sets of label suitability is smaller than the number of sets of attribute values and label suitability in the label suitability information.

図１５は、ラベル決定部１１５が行うラベル選択のフローチャートである。 FIG. 15 is a flowchart of label selection performed by the label determination unit 115.

ラベル決定部１１５は、設定ファイル１０３に書き込まれた数のラベルをラベル適合度の高い方から選択する（Ｓ３０１）。次に、ラベル適合度が次点のラベルを追加選択するか否かを判定する（Ｓ３０３）。 The label determining unit 115 selects the number of labels written in the setting file 103 from the one with the highest label matching degree (S301). Next, it is determined whether or not to additionally select a label whose label suitability is the next point (S303).

具体的には、選択済みの最も低いラベル適合度をＣ（ｎ）、その１つ上のラベル適合度をＣ（ｎ＋１）、次点のラベル適合度をＣ（ｎ−１）とし、式（２）が成立するときは、次点のラベルを追加選択して（Ｓ３０５）、Ｓ３０３へと戻る。

Specifically, C (n) is the lowest selected label fitness level, C (n + 1) is the label fitness level one above, and C (n-1) is the label fitness level of the next point. When 2) is established, the label of the next point is additionally selected (S305), and the process returns to S303.

ただし、ｅは、設定ファイル１０３などに書き込まれたしきい値である。 However, e is a threshold value written in the setting file 103 or the like.

つまり、値の傾きを評価し、傾きがあるしきい値を越えたところを境界とする考え方を適用した判定が行われる。 In other words, the inclination of the value is evaluated, and a determination is made by applying the concept of using the boundary where the inclination exceeds a certain threshold.

この方法により、ラベル適合度が近いにも関わらずラベルの選択から漏れるのを防止できる。つまり、ラベル適合度に差がある場合に限って選択しないようにできる。 By this method, it is possible to prevent leakage from the selection of the label even though the label matching degree is close. That is, it can be made not to select only when there is a difference in the label suitability.

なお、Ｓ３０１では、設定ファイル１０３などに書き込まれたラベル適合度のしきい値との比較によりラベルを選択してもよい。 In S301, a label may be selected by comparison with a threshold value of the label suitability written in the setting file 103 or the like.

次に、ラベル決定部１１５は、ラベル情報を基に属性名適合度情報を生成し一時的に記憶する（Ｓ２１５）。 Next, the label determination unit 115 generates attribute name fitness information based on the label information and temporarily stores it (S215).

図１６に示すように、属性名適合度情報は、属性名ごとに属性名適合度を示したものである。 As shown in FIG. 16, the attribute name suitability information indicates the attribute name suitability for each attribute name.

例えば、属性名「ジャンル」の場合の属性名適合度は、以下のように算出する。 For example, the attribute name suitability for the attribute name “genre” is calculated as follows.

まず、「ジャンル」のラベル情報におけるいずれかのラベルを含む文書の数ｄｌを、「ジャンル」の第１検索結果統計情報から求める。このとき、複数のラベルを含む１文書を１と計算する。 First, the number dl of documents including any label in the “genre” label information is obtained from the first search result statistical information of “genre”. At this time, one document including a plurality of labels is calculated as 1.

そして、式（３）により網羅性Ｓ１を求める。

And comprehensiveness S1 is calculated | required by Formula (3).

ここで、ｄｒは、検索された文書ＩＤの数である。 Here, dr is the number of retrieved document IDs.

このＳ１が大きいほど、検索結果がラベルにより網羅されている程度が大きいことになる。 The greater the S1, the greater the extent to which the search result is covered by the label.

次に、式（４）により、重なりの少なさ、分類の明確さＳ２を求める。

Next, the degree of overlap and the clarity of classification S <b> 2 are obtained by Expression (4).

ここで、ｄｒは、検索された文書ＩＤの数であり、ｄｌ_ｉは、「ジャンル」のラベル情報におけるｉ番目のラベルｌ_ｉを含む文書数であり、「ジャンル」の第２検索結果統計情報から得たものである。 Here, dr is the number of retrieved document IDs, dl _i is the number of documents including the i-th label l _i in the “genre” label information, and the second search result statistical information of “genre” It is obtained from.

このＳ２が大きいほど、検索結果がラベルにより明確に分類されている程度が大きいことになる。 The larger S2 is, the greater the degree to which the search result is clearly classified by the label.

次に、式（５）により、分類の均一さＳ３を求める。ここでは、後述するクラスタの平均エントロピーを算出することでＳ３を求める。

Next, the uniformity S3 of the classification is obtained from Equation (5). Here, S3 is obtained by calculating an average entropy of a cluster to be described later.

ここで、ｄｒは、検索された文書ＩＤの数であり、ｄｌ_ｉは、「ジャンル」のラベル情報におけるｉ番目のラベルｌ_ｉを含む文書数である。ｄｌ_ｉは第２検索結果統計情報から得ることができる。 Here, dr is the number of retrieved document IDs, and dl _i is the number of documents including the i-th label l _i in the label information of “genre”. dl _i can be obtained from the second search result statistical information.

このＳ３が大きいほど、検索結果がラベルにより均一に分類されている程度が大きいことになる。 The larger this S3, the greater the degree to which the search results are uniformly classified by the label.

次に、式（６）により、属性名適合度Ｓを求める。

Next, the attribute name suitability S is obtained by Expression (6).

ここで、α、β、γは設定ファイル１０３に書き込まれたパラメータである。 Here, α, β, and γ are parameters written in the setting file 103.

次に、要求処理部１０２は、第２検索結果統計情報、ラベル情報及び属性名適合度情報を読み出し、ラベル情報をクラスタ情報生成部１１６に与える。 Next, the request processing unit 102 reads the second search result statistical information, the label information, and the attribute name fitness information, and provides the label information to the cluster information generation unit 116.

クラスタ情報生成部１１６は、ラベル情報に含まれたラベルごとにクラスタ情報を生成し一時的に記憶する（Ｓ２１７）。 The cluster information generation unit 116 generates cluster information for each label included in the label information and temporarily stores it (S217).

図１７に示すように、クラスタ情報は、ラベル情報に含まれる各ラベルと、該ラベルを含む文書の文書ＩＤで且つ検索された文書ＩＤにも含まれる文書ＩＤと、当該文書の題名とを対応づけたものである。 As shown in FIG. 17, the cluster information corresponds to each label included in the label information, the document ID of the document including the label and also included in the retrieved document ID, and the title of the document. It is attached.

次に、要求処理部１０２は、第２検索結果統計情報、ラベル情報、属性名適合度情報及びクラスタ情報をそれぞれ全て読み出して通信部１０１に与え、通信部１０１は、これらをブラウザ２に送信する（Ｓ２１９）。 Next, the request processing unit 102 reads out all the second search result statistical information, label information, attribute name fitness information, and cluster information and gives them to the communication unit 101, and the communication unit 101 transmits them to the browser 2. (S219).

図１８は、こらら情報を送信されたブラウザ２が行う処理のフローチャートである。 FIG. 18 is a flowchart of processing performed by the browser 2 to which these pieces of information are transmitted.

ブラウザ２の文書表示制御部２２は、図１９に示すように、全てのクラスタ情報に含まれる文書ＩＤと題名を表示させ（Ｓ４０１）、さらにラベル情報に含まれたラベルを表示させる（Ｓ４０３）。このとき、表示されるラベル数は適合度により少なくされているのでユーザによるラベルの指示を容易に行うことができる。 As shown in FIG. 19, the document display control unit 22 of the browser 2 displays document IDs and titles included in all cluster information (S401), and further displays labels included in the label information (S403). At this time, since the number of displayed labels is reduced depending on the degree of fitness, the user can easily instruct the label.

そして、ユーザにとって一層便宜となるように、例えば、ラベルは属性名ごとにまとめて表示させる。また、属性名適合度情報における属性名適合度の高い属性名のラベルをより見やすいように表示させる。また、１つのラベル情報に含まれたラベルについては対応づけられたラベル適合度の高いものをより見やすいように表示させる。また、ラベルには、第２検索結果統計情報において対応づけられた文書ＩＤの数を対応づけて表示させる。 Then, for the convenience of the user, for example, labels are displayed together for each attribute name. Further, the label of the attribute name having a high attribute name matching degree in the attribute name matching degree information is displayed so as to be easier to see. In addition, for labels included in one label information, an associated label having a high degree of label matching is displayed so as to be easier to see. In addition, the number of document IDs associated with the second search result statistical information is displayed in association with the label.

そして、文書表示制御部２２は、ユーザにより１つのラベルが指示される（Ｓ４０５）と、表示済みの文書ＩＤと題名を消去し、図２０に示すように、そのラベルを含むクラスタ情報に含まれた文書ＩＤと題名を表示させる（Ｓ４０７）。 Then, when one label is instructed by the user (S405), the document display control unit 22 deletes the displayed document ID and title, and is included in the cluster information including the label as shown in FIG. The document ID and title are displayed (S407).

そして、文書表示制御部２２は、ユーザにより文書ＩＤが指示される（Ｓ４０９）と、その文書ＩＤを検索装置１の通信部１０１に送信する（Ｓ４１１）。なお、実際には、文書ＩＤと題名の位置をクリックすると文書ＩＤが指示できるようになっている。 Then, when the document ID is instructed by the user (S409), the document display control unit 22 transmits the document ID to the communication unit 101 of the search device 1 (S411). In practice, the document ID can be designated by clicking the position of the document ID and the title.

図２０に示すように、本実施の形態では、ラベル指示後においては指示前よりも、文書ＩＤと題名の数が減っているので、ユーザは容易に指示することができる。 As shown in FIG. 20, in the present embodiment, after the label instruction, the number of document IDs and titles is smaller than before the instruction, so the user can easily instruct.

検索装置１の通信部１０１は、送信された文書ＩＤを要求処理部１０２に与える。要求処理部１０２は、与えられた文書ＩＤを文書検索部１０９に与える。文書検索部１０９は、与えられた文書ＩＤの文書を読み出して要求処理部１０２に返却する。 The communication unit 101 of the search device 1 gives the transmitted document ID to the request processing unit 102. The request processing unit 102 gives the given document ID to the document search unit 109. The document search unit 109 reads the document with the given document ID and returns it to the request processing unit 102.

要求処理部１０２は、返却された文書を通信部１０１に与え、通信部１０１はそれをブラウザ２に送信する。 The request processing unit 102 gives the returned document to the communication unit 101, and the communication unit 101 transmits it to the browser 2.

ブラウザ２の文書表示制御部２２は、送信された文書を表示させる。 The document display control unit 22 of the browser 2 displays the transmitted document.

以上説明したように、第１の実施の形態の検索装置１によれば、文書を記憶した文書記憶手段たる文書ＤＢ１０６から文書検索手段たる文書検索部１０９が文書を検索し、検索された文書に含まれた属性値を文書のラベルとするときの適合度をラベル選択手段を構成するラベル適合度算出部１１４が算出し、ラベル選択手段を構成するラベル決定部１１５が適合度の高い方から、当該属性値の数よりも少ない数の属性値をラベルとして選択し、選択されたラベルを示すラベル情報を記憶し、文書表示制御手段を構成する要求処理部１０２が、ラベル情報を読み出すとともに当該ラベル情報をブラウザ２に送信することによりラベルを表示させ、ラベルがユーザにより指示された場合、このラベルを含み且つ検索された文書の中にも含まれる文書を文書表示制御手段を構成する文書検索部１０９が文書ＤＢ１０６から読み出し、これを文書表示制御手段を構成する要求処理部１０２がブラウザ２に送信して表示させるので、ラベルを予め用意する必要がなく、しかもラベルの数を少なくでき、その結果、管理者による文書の更新に応じた設定変更やユーザによる煩雑な操作が不要となる。 As described above, according to the search device 1 of the first embodiment, the document search unit 109, which is a document search unit, searches the document DB 106, which is a document storage unit that stores documents, and searches the document. The label suitability calculation unit 114 that constitutes the label selection unit calculates the suitability when the included attribute value is used as a document label, and the label determination unit 115 that constitutes the label selection unit calculates from the higher suitability The number of attribute values smaller than the number of the attribute values is selected as a label, label information indicating the selected label is stored, the request processing unit 102 constituting the document display control unit reads out the label information and the label When a label is displayed by transmitting information to the browser 2 and the label is designated by the user, the sentence including this label and also included in the retrieved document Is retrieved from the document DB 106 by the document search unit 109 constituting the document display control means, and transmitted to the browser 2 for display by the request processing unit 102 constituting the document display control means, so that it is not necessary to prepare a label in advance. In addition, the number of labels can be reduced, and as a result, a setting change according to the update of the document by the administrator and a complicated operation by the user become unnecessary.

また、ラベルの選択及び表示が属性名ごとに行われるように制御するので、ラベルの数を少なくして表示させることを属性名ごとに行うことができる。 Further, since control is performed so that selection and display of labels is performed for each attribute name, it is possible to perform display for each attribute name by reducing the number of labels.

また、統計情報を記憶しておくことにより、検索時に統計情報を生成する必要がなくなるので、検索時に迅速な処理が行える。 In addition, by storing the statistical information, it is not necessary to generate the statistical information at the time of searching, so that quick processing can be performed at the time of searching.

また、正規化部１０７が属性値を正規化するので、正規化されたラベルを表示させることができる。 Moreover, since the normalization part 107 normalizes an attribute value, the normalized label can be displayed.

[第２の実施の形態]
次に、本発明を適用した第２の実施の形態を説明する。ここでは、第１の実施の形態の構成要素や処理と同一のものには同一符号や同一ステップ番号を付すことにする。 [Second Embodiment]
Next, a second embodiment to which the present invention is applied will be described. Here, the same reference numerals and the same step numbers are assigned to the same components and processes as those in the first embodiment.

図２１は、第２の実施の形態の装置構成を示すブロック図である。 FIG. 21 is a block diagram illustrating an apparatus configuration according to the second embodiment.

検索装置１０は、検索装置１の構成要素に加えて、文書ＤＢ１０６に格納された１つの文書に含まれる属性値の数（以下、ベクトル要素という）を属性値ごとに含む文書ベクトルを生成する文書ベクトル生成部１１７と、生成された文書ベクトルが格納される文書ベクトルＤＢ１１８を備える。なお、ベクトル要素は、属性値の有無に応じた２値情報でもよい。 The search device 10 generates a document vector that includes the number of attribute values (hereinafter referred to as vector elements) included in one document stored in the document DB 106 for each attribute value in addition to the components of the search device 1 A vector generation unit 117 and a document vector DB 118 that stores the generated document vectors are provided. The vector element may be binary information corresponding to the presence or absence of an attribute value.

また、検索装置１０は、１つのクラスタ情報に含まれる属性値ごとのベクトル要素からなるクラスタベクトルを生成するクラスタベクトル生成部１１９と、クラスタとクラスタに含まれない文書の内容についての類似度を算出する類似度算出部１２０と、算出された類似度により、クラスタ情報の文書ＩＤを増加させるクラスタ拡張部１２１を備える。 In addition, the search device 10 calculates a similarity between the cluster vector generation unit 119 that generates a cluster vector including vector elements for each attribute value included in one cluster information, and the content of the document that is not included in the cluster. And a cluster expansion unit 121 that increases the document ID of the cluster information based on the calculated similarity.

[検索前処理]
図２２は、検索装置１０が検索前に行う処理を示すフローチャートである。 [Pre-search process]
FIG. 22 is a flowchart illustrating processing performed by the search device 10 before search.

Ｓ１０９の処理が終了すると、文書ベクトル生成部１１７は、文書ベクトルを文書ＤＢ１０６に格納された文書ごとに生成し、これを文書ベクトルＤＢ１１８に格納する（Ｓ１１１）。 When the processing of S109 ends, the document vector generation unit 117 generates a document vector for each document stored in the document DB 106, and stores this in the document vector DB 118 (S111).

図２３に示すように、１つの文書ベクトルは、１つの文書に含まれる属性値の数（ベクトル要素という）を属性値ごとに含むものである。 As shown in FIG. 23, one document vector includes the number of attribute values (referred to as vector elements) included in one document for each attribute value.

[検索処理]
図２４は、キーワードを送信された検索装置１０が行う処理のフローチャートである。 [Search process]
FIG. 24 is a flowchart of processing performed by the search device 10 to which a keyword has been transmitted.

Ｓ２１７の処理が終了すると、要求処理部１０２の要求により、クラスタベクトル生成部１１９がクラスタ情報ごとにクラスタベクトルを生成する（Ｓ２１８１）。 When the processing of S217 is completed, the cluster vector generation unit 119 generates a cluster vector for each cluster information in response to a request from the request processing unit 102 (S2181).

図２５に示すように、１つのクラスタベクトルは、１つのクラスタに含まれる属性値ごとのベクトル要素からなるものである。 As shown in FIG. 25, one cluster vector is composed of vector elements for each attribute value included in one cluster.

クラスタベクトルは、式（７）により求めることができる。

The cluster vector can be obtained by Expression (7).

つまり、クラスタに含まれる文書のベクトル和を求めることによりクラスタベクトルが生成される。 That is, a cluster vector is generated by obtaining a vector sum of documents included in the cluster.

そして、以下のＳ２１８３及び２１８５を、クラスタとクラスタに含まれない文書の組み合わせの全てについて行う。 Then, S2183 and 2185 below are performed for all the combinations of clusters and documents not included in the clusters.

まず、類似度算出部１２０が、クラスタとクラスタに含まれない文書の内容についての類似度を算出する（Ｓ２１８３）。類似度は、例えば、Ｓ２１８１で求めたクラスタベクトルとクラスタに含まれない文書の文書ベクトルとの余弦尺度とすることができる。 First, the similarity calculation unit 120 calculates the similarity between the clusters and the contents of the document not included in the clusters (S2183). The similarity can be, for example, a cosine measure between the cluster vector obtained in S2181 and the document vector of a document not included in the cluster.

そして、クラスタ拡張部１２１が、算出された類似度によりクラスタ情報に文書ＩＤ及び題名を加える（Ｓ２１８５）。例えば、Ｓ２１８３で算出された類似度が、設定ファイル１０３に書き込まれたしきい値を越えていれば、該当の文書ＩＤと題名をクラスタ情報に加えるようにする。 Then, the cluster expansion unit 121 adds a document ID and a title to the cluster information based on the calculated similarity (S2185). For example, if the similarity calculated in S2183 exceeds the threshold value written in the setting file 103, the corresponding document ID and title are added to the cluster information.

なお、これ以降は、第１の実施の形態におけるＳ２１９以降の処理が行われる。 After this, the processes after S219 in the first embodiment are performed.

以上説明したように、第２の実施の形態の検索装置１０によれば、クラスタ情報生成手段たるクラスタ情報生成部１１６が、ラベルを含み且つ検索された文書の中にも含まれる文書を示すクラスタ情報を生成し、クラスタ情報変更手段を構成する類似度算出部１２０が、クラスタ情報で示される文書と当該クラスタ情報で示されない文書との類似度を算出し、この類似度が高い場合、クラスタ情報変更手段を構成するクラスタ拡張部１２１が、後者の文書が示されるようにクラスタ情報を変更するので、管理者による文書の更新に応じた設定変更やユーザによる煩雑な操作が不要となることに加えて、少なく表示させたラベルの指示により表示される文書数を多くすることができるので、所望の文書を表示させることのできる可能性が高まる。 As described above, according to the search device 10 of the second embodiment, the cluster information generation unit 116 as cluster information generation means includes a cluster that includes a label and indicates a document included in the searched document. If the similarity calculation unit 120 that generates information and constitutes the cluster information changing unit calculates the similarity between the document indicated by the cluster information and the document not indicated by the cluster information, and the similarity is high, the cluster information Since the cluster expansion unit 121 constituting the changing unit changes the cluster information so that the latter document is displayed, setting change according to the update of the document by the administrator and a complicated operation by the user become unnecessary. Since the number of documents displayed can be increased according to the instruction of the label displayed less, the possibility that a desired document can be displayed is increased.

次に、文書生成部１０５を説明する。ここでは、ニュース記事に限らない一般的な文書を例にして説明する。 Next, the document generation unit 105 will be described. Here, a general document that is not limited to a news article will be described as an example.

図２６は、文書生成部１０５に適用される固有表現抽出規則生成システムおよびそれを設けた固有表現抽出装置の構成例を示すブロック図であり、図２７は、図２６における固有表現抽出規則生成システムおよび固有表現抽出装置のハードウェア構成例を示すブロック図である。文書生成部１０５は、図２６におけるタグ無し文書に相当する新規文書Ａ１１を除いたブロックで構成され、固有表現のリストＡ１３に含まれる固有表現を属性値としてタグ付き文書を生成する。これにより、タグ無し文書に属性値の指示をする必要がなくなる。 26 is a block diagram showing a configuration example of a specific expression extraction rule generation system applied to the document generation unit 105 and a specific expression extraction device provided with the specific expression extraction rule generation system. FIG. 27 shows a specific expression extraction rule generation system in FIG. 2 is a block diagram illustrating a hardware configuration example of a specific expression extraction device. FIG. The document generation unit 105 includes blocks excluding the new document A11 corresponding to the untagged document in FIG. 26, and generates a tagged document using the specific expression included in the specific expression list A13 as an attribute value. This eliminates the need to give attribute values to untagged documents.

図２７において、１０２１はＣＲＴやＬＣＤ等からなる表示装置、１０２２はキーボードやマウス等からなる入力装置、１０２３はＨＤＤ（Hard Disk Drive）等からなる外部記憶装置、１０２４はＣＰＵ（Central Processing Unit）１０２４ａや主メモリ１０２４ｂ等を具備して蓄積プログラム方式によるコンピュータ処理を行なう情報処理装置、１０２５はプログラムやデータを記録したＣＤ−ＲＯＭ（Compact Disc-Read Only Memory）もしくはＤＶＤ（Digital Video Disc/Digital Versatile Disc）等からなる光ディスク、１０２６は光ディスク１０２５に記録されたプログラムおよびデータを読み出すための駆動装置、１０２７はＬＡＮ（Local Area Network）カードやモデム等からなる通信装置である。 In FIG. 27, 1021 is a display device such as a CRT or LCD, 1022 is an input device such as a keyboard or mouse, 1023 is an external storage device such as an HDD (Hard Disk Drive), and 1024 is a CPU (Central Processing Unit) 1024a. And an information processing device 1025 having a main memory 1024b and the like and performing computer processing by a storage program method, 1025 is a CD-ROM (Compact Disc-Read Only Memory) or DVD (Digital Video Disc / Digital Versatile Disc) on which a program and data are recorded ) And the like, 1026 is a drive device for reading the program and data recorded on the optical disc 1025, and 1027 is a communication device such as a LAN (Local Area Network) card or a modem.

光ディスク１０２５に格納されたプログラムおよびデータを情報処理装置１０２４により駆動装置１０２６を介して外部記憶装置１０２３内にインストールした後、外部記憶装置１０２３から主メモリ１０２４ｂに読み込みＣＰＵ１０２４ａで処理することにより、情報処理装置１０２４内に図２６に示す固有表現抽出規則生成システムおよびそれを具備した固有表現抽出装置が構成される。 The program and data stored in the optical disk 1025 are installed in the external storage device 1023 by the information processing device 1024 via the drive device 1026, and then read from the external storage device 1023 into the main memory 1024b and processed by the CPU 1024a. In the apparatus 1024, a specific expression extraction rule generation system shown in FIG. 26 and a specific expression extraction apparatus including the system are configured.

図２６の固有表現抽出装置においては、訓練用文書Ａ１と、正解リストＡ２、固有表現抽出規則群Ａ５、改良後固有表現抽出規則群Ａ５ａ、訓練用記録Ａ７、タグ無し文書に相当する新規文書Ａ１１、および、抽出された固有表現のリストＡ１３のそれぞれは、図２７における外部記憶装置１０２３もしくは主メモリ１０２４ｂ等に格納され、また、形態素解析・品詞文字種付与部Ａ３と、規制生成部Ａ４、訓練用規則適用部Ａ６、規則評価部Ａ８、規則削除部Ａ９、規則精錬部Ａ１０、実施用規則適用部Ａ１２のそれぞれは、図２７におけるＣＤ−ＲＯＭ１０２５に格納されたプログラムに基づき情報処理装置１０２４内に構成される。 26, the training document A1, the correct answer list A2, the specific expression extraction rule group A5, the improved specific expression extraction rule group A5a, the training record A7, and the new document A11 corresponding to the untagged document. , And the extracted list of specific expressions A13 are stored in the external storage device 1023 or the main memory 1024b in FIG. 27, and the morphological analysis / part-of-speech character type assigning unit A3, the restriction generating unit A4, and the training Each of rule application unit A6, rule evaluation unit A8, rule deletion unit A9, rule refinement unit A10, and enforcement rule application unit A12 is configured in information processing apparatus 1024 based on the program stored in CD-ROM 1025 in FIG. Is done.

そして、形態素解析・品詞文字種付与部Ａ３と、規制生成部Ａ４、訓練用規則適用部Ａ６、規則評価部Ａ８、規則削除部Ａ９、規則精錬部Ａ１０のそれぞれが固有表現抽出規則生成システムを構成している。 The morphological analysis / part-of-speech character type assigning unit A3, the regulation generating unit A4, the training rule applying unit A6, the rule evaluating unit A8, the rule deleting unit A9, and the rule refining unit A10 each constitute a specific expression extraction rule generating system. ing.

形態素解析・品詞文字種付与部Ａ３は、訓練用文書Ａ１を単語分割して、各単語にその品詞名や構成文字種の情報を付加する。 The morphological analysis / part-of-speech character type assigning unit A3 divides the training document A1 into words, and adds information on the part-of-speech name and constituent character type to each word.

規則生成部Ａ４は、形態素解析・品詞文字種付与部Ａ３の処理で得られる単語列を正解リストＡ２で与えられる抽出すべき固有表現のデータと突き合わせて、各固有表現を構成する単語列を取り出し、これを一般化して規則を生成する。 The rule generation unit A4 matches the word string obtained by the processing of the morphological analysis / part-of-speech character type adding unit A3 with the data of the specific expression to be extracted given in the correct answer list A2, and extracts the word string constituting each specific expression, A rule is generated by generalizing this.

その結果が固有表現抽出規則群Ａ５として図２７における外部記憶装置１０２３に記録される。 The result is recorded in the external storage device 1023 in FIG. 27 as a specific expression extraction rule group A5.

訓練用規則適用部Ａ６は、規則生成部Ａ４の生成結果で得られる固有表現抽出規則群Ａ５を訓練用文書Ａ１に適用する。その結果は訓練用記録Ａ７として図２７における外部記憶装置１０２３に記録される。 The training rule application unit A6 applies the specific expression extraction rule group A5 obtained from the generation result of the rule generation unit A4 to the training document A1. The result is recorded as training record A7 in the external storage device 1023 in FIG.

規則評価部Ａ８は、訓練用記録Ａ７に基づいて各規則を評価する。規則削除部Ａ９は、規則評価部Ａ８の評価結果に基づいて、成績の悪い規則を削除する。 The rule evaluation unit A8 evaluates each rule based on the training record A7. The rule deletion unit A9 deletes a rule with a poor result based on the evaluation result of the rule evaluation unit A8.

規則精錬部Ａ１０は、成績が良くなるように規則を精錬する。 The rule refining unit A10 refines the rules so that the grades are improved.

実施用規則適用部Ａ１２は、このようにして改良された固有表現抽出規則群Ａ５（改良後固有表現抽出規則群Ａ５ａ）を、実際の新規文書Ａ１１に適用して固有表現リストＡ１３を得る。 The implementation rule application unit A12 applies the specific expression extraction rule group A5 improved in this way (the improved specific expression extraction rule group A5a) to the actual new document A11 to obtain the specific expression list A13.

訓練用規則適用部Ａ６と実施用規則適用部Ａ１２はいずれも、規則群を文書に適用して固有表現を抽出するものであり、その処理内容はほぼ同じであるため、単一の装置で両者を兼ねることも可能である。ただし、実施用規則適用部Ａ１２は、訓練用記録Ａ７を残す必要がないが、最終的な候補の選択を行なう必要がある点が異なる。 Each of the training rule application unit A6 and the implementation rule application unit A12 applies a group of rules to a document to extract a specific expression, and the processing content is almost the same. It is also possible to serve as. However, the implementation rule application unit A12 does not need to leave the training record A7, but differs in that it needs to select a final candidate.

まず、実施用規則適用部Ａ１２の動作、すなわち、本例の固有表現抽出規則生成システムで生成・改良された固有表現抽出規則群Ａ５、改良後固有表現抽出規則群Ａ５ａを用いた固有表現抽出装置としての動作を説明する。 First, the operation of the implementation rule application unit A12, that is, the specific expression extraction apparatus using the specific expression extraction rule group A5 generated and improved by the specific expression extraction rule generation system of this example, and the improved specific expression extraction rule group A5a The operation will be described.

実施用規則適用部Ａ１２は、固有表現を抽出したい新規文書Ａ１１に対して、改良後固有表現抽出規則群Ａ５ａを適用して、文書中に含まれる固有表現を抽出して固有表現リストＡ１３を出力する。 The implementation rule application unit A12 applies the improved specific expression extraction rule group A5a to the new document A11 from which a specific expression is to be extracted, extracts the specific expressions included in the document, and outputs the specific expression list A13. To do.

例えば、「田中太郎賞選考委員会では、・・・」という新規文書Ａ１１があるとすると、この文書中の固有表現として、「田中」、「太郎」、「田中太郎」という人名の候補と、「田中太郎賞」という人工物名の候補、さらに、「田中太郎賞選考委員会」という組織名の候補が考えられるが、一般には、その内で一番長い「田中太郎賞選考委員会」だけが固有表現として抽出され出力されることが望まれる場合が多く、この場合、これと重なっている「田中」や「太郎」などの他の候補（固有表現）は出力されるべきでない。 For example, if there is a new document A11 “In the Taro Tanaka Prize Selection Committee,” the candidate names of “Tanaka”, “Taro”, and “Taro Tanaka” are listed as specific expressions in this document. A candidate for the artifact name “Taro Tanaka Award” and a candidate for the organization name “Taro Tanaka Award Selection Committee” are considered, but in general, only the longest “Taro Tanaka Award Selection Committee” Is often extracted and output as a specific expression. In this case, other candidates (specific expressions) such as “Tanaka” and “Taro” that overlap this should not be output.

このような侯補間の関係は、重なりに起因する競合関係と、各候補の優先順位による抑制関係に還元することができる。つまり、「田中太郎賞選考委員会」は「田中」などの他の候補と重なっているがために競合し、長い「田中太郎賞選考委員会」の優先順位が高く、短い他の候補を抑制していると考えることができる。 Such a wrinkle interpolation relationship can be reduced to a competitive relationship caused by overlap and a suppression relationship based on the priority of each candidate. In other words, the “Taro Tanaka Award Selection Committee” competes with other candidates such as “Tanaka”, but the long “Taro Tanaka Award Selection Committee” has higher priority and suppresses other short candidates. You can think that you are.

本例においては、実施用規則適用部Ａ１２は、この考え方に基づき、まず、全ての規則を文書に適用することで、全ての固有表現の候補の集合(「田中」や「太郎」、「田中太郎」、「田中太郎賞」、「田中太郎賞選考委員会」などを含む)を求める。次に、これらの候補の中で同じ固有表現（上の各候補においては「田中」）が最初に現れるものの内で一番長いもの(上の各候補においては「田中太郎賞選考委員会」)を出力する。 In this example, the implementation rule application unit A12 first applies all the rules to the document based on this concept, so that a set of candidates for all the specific expressions (“Tanaka”, “Taro”, “Tanaka” Taro ”,“ Taro Tanaka Award ”,“ Taro Tanaka Award Selection Committee ”, etc.). Next, among these candidates, the longest of the same specific expressions ("Tanaka" in each of the above candidates) appears first ("Taro Tanaka Prize Selection Committee" in each of the above candidates) Is output.

このようにして一つの候補が出力されると、この候補と競合している他の全ての候補（「田中」、「田中太郎」、「田中太郎賞」）を候補の集合から削除する。候補の集合が空になるまで、この作業を繰り返すことにより、固有表現のリストＡ１３が得られる。 When one candidate is output in this way, all other candidates competing with this candidate (“Tanaka”, “Tanaka Taro”, “Tanaka Taro Award”) are deleted from the candidate set. By repeating this operation until the candidate set becomes empty, a list A13 of specific expressions is obtained.

ただし、このように長さだけに着目して、各々競合する各候補からの選択の判断を行うだけでは、同じ長さの複数の候補がある場合に判断に困る。 However, if there is a plurality of candidates having the same length, it is difficult to make a determination only by determining the selection from each competing candidate by paying attention only to the length.

例えば「ホワイトハウス」は、地名と考えられる場合と組織名と考えられる場合があるので、同じ「ホワイトハウス」という文字列を地名とする候補と、組織名とする候補とが考えられる。 For example, “White House” may be considered as a place name or an organization name, and therefore, a candidate having the same character string “White House” as a place name and a candidate as an organization name can be considered.

そこで、この２つの候補の間に、抽出するための優先順位を設ける。 Therefore, a priority order for extraction is set between the two candidates.

例えば、その前後の単語を考慮して、「ホワイトハウスの近くの公園で・・・」であれば地名の可能性が高く、「ホワイトハウスによれば・・・」であれば、組織名の可能性が高い。また、例えば、その出現頻度を考慮して、訓練用文書Ａ１に「ホワイトハウス」が地名として出現しているのが１回で、組織名として出現しているのが２０回とすれば、組織名と判断した方が正解する可能性が高い。 For example, considering the words before and after that, if it is "in a park near the White House ...", the place name is likely; if it is "according to the White House ...", the name of the organization Probability is high. Also, for example, in consideration of the frequency of appearance, if “white house” appears as a place name once in the training document A1, and if it appears as an organization name 20 times, the organization The person who judges it as a name is more likely to answer correctly.

本例では、改良後固有表現抽出規則群Ａ５ａにおける各規則には、これらの条件を加味した優先度が付与されている。 In this example, each rule in the improved specific expression extraction rule group A5a is given a priority in consideration of these conditions.

実施用規則適用部Ａ１２は、このような優先度と、前述の固有表現の長さとを組み合わせて、各候補の優先順位を計算する。この優先順位の設定としてはさまざまな変種が考えられるが、上記のように、開始位置が一番早いものの中で、さらに終了位置が一番遅いものの内、優先度が一番高いものを選ぶのが明快である。つまり、候補の優先関係については、以下のような定義を基本とする。 The implementation rule application unit A12 calculates the priority of each candidate by combining such priority and the length of the above-described specific expression. There are various variations of this priority setting, but as mentioned above, the one with the highest priority among the ones with the earliest start position and the slowest end position is selected. Is clear. In other words, the priority definition of candidates is based on the following definition.

■候補Ａの開始位置が候補Ｂの開始位置より早い(数字として小さい)ならば、候補Ａの方が優先される。 (2) If the start position of candidate A is earlier (smaller as a number) than the start position of candidate B, candidate A has priority.

■候補Ａの開始位置と候補Ｂの開始位置が同じであれば、終了位置が遅い(数字として大きい)候補が優先される。 (2) If the start position of candidate A and the start position of candidate B are the same, a candidate with a late end position (larger number) is given priority.

■両候補の開始位置と終了位置が全く同じであれば、予め規則で与えられた優先度ｕの大きい候補が優先される。 (2) If the starting position and the ending position of both candidates are exactly the same, a candidate having a high priority u given in advance by a rule is given priority.

本例の固有表現抽出規則生成システムでは、このような実施用規則適用部Ａ１２による処理を容易とする固有表現抽出規則群Ａ５および改良後固有表現抽出規則群Ａ５ａを生成する。以下、このような優先関係を加味した規則の生成処理に係わる固有表現抽出規則生成システムを構成する各部の動作について説明する。 In the specific expression extraction rule generation system of this example, a specific expression extraction rule group A5 and an improved specific expression extraction rule group A5a that facilitate processing by the implementation rule application unit A12 are generated. In the following, the operation of each part constituting the specific expression extraction rule generation system related to the rule generation processing taking such priority relationships into account will be described.

まず、形態素解析・品詞文字種付与部Ａ３において、文書を単語列に分割する。典型的には形態素解析機能を有し、訓練用文書Ａ１や新規文書Ａ１１などの与えられた文書を単語分割して、各単語に品詞名とその単語を構成する文字の種類（構成文字種情報）を付与したデータ構造を作り、そのリストを作成する。 First, the morphological analysis / part of speech character type assigning unit A3 divides the document into word strings. Typically, it has a morphological analysis function, and a given document such as a training document A1 or a new document A11 is divided into words, and the part of speech name and the type of characters constituting the word (constituent character type information) Create a data structure with a list and create a list.

例えば、「東京製鉄の中野社長は・・・」という文があると、形態素解析により「東京」は固有名詞、「製鉄」は普通名詞、「の」は助詞、「中野」は固有名詞、「社長」は普通名詞、「は」は助詞、という結果が得られる。 For example, if there is a sentence “President Nakano is ...”, “Tokyo” is a proper noun, “Iron” is a common noun, “no” is a particle, “Nakano” is a proper noun, The result is that “President” is a common noun, and “Ha” is a particle.

また、「東京」は複数の漢字で構成されており、「の」はひらがなである。従って、形態素解析・品詞文字種付与部Ａ３は、この文に対して、例えば以下のようなデータ構造からなるリストを出力する。［(東京，複数漢字，固有名詞)、(製鉄，複数漢字，普通名詞)、(の，ひらがな，助詞)、・・・］
一方、正解リストＡ２は、訓練用文書Ａ１の中のどの位置にどのような種類の固有表現が含まれているかを列挙したものであり、「東京製鉄の中野社長は・・・」という訓練用文書Ａ１に対応して予め用意される正解リストＡ２は、例えば、次のようなデータからなる。 “Tokyo” is composed of a plurality of kanji characters, and “no” is hiragana. Therefore, the morphological analysis / part-of-speech character type assigning unit A3 outputs, for this sentence, a list having the following data structure, for example. [(Tokyo, multiple kanji, proper noun), (steel, multiple kanji, common noun), (no, hiragana, particle), ...]
On the other hand, the correct answer list A2 lists what kind of specific expressions are included in which position in the training document A1. The correct answer list A2 prepared in advance corresponding to the document A1 includes, for example, the following data.

このリストにおいて、最初の行は、この文書の「０文字目から３文字目の位置」が「東京製鉄」という「組織名」をその種類とする固有表現であり、次の行は「５文字目から６文字目の位置」が「中野」という「人名」をその種類とする固有表現であることを示している。このように、本例の正解リストＡ２においては、各固有表現の開始位置と終了位置を示す数字の対で、当該固有表現の位置を略称する。 In this list, the first line is a unique expression whose type is “organization name” whose “position from the 0th character to the 3rd character” of this document is “Tokyo Steel,” and the next line is “5 characters. This indicates that the “position of the sixth character from the eye” is a unique expression of “person name” of “Nakano” as its type. Thus, in the correct answer list A2 of this example, the position of the specific expression is abbreviated as a pair of numbers indicating the start position and end position of each specific expression.

規則生成部Ａ４は、このような正解リストＡ２と、形態素解析・品詞文字種付与部Ａ３の出力する単語列とを突き合わせて、固有表現を変数化等して、例えば、次のような固有表現の抽出規則を生成する。anytag(３) <-- <＠(組織名，２１)，word(_，複数漢字，固有名詞)，word(製鉄，複数漢字，普通名詞)，>＠(組織名)．この規則（ルール）は、番号「２１」が付与された規則であり、任意の（変数化された）漢字の固有名詞があり（「word(_，複数漢字，固有名詞)」）、その次の単語が「製鉄」という複数漢字の普通名詞であれば（「word(製鉄，複数漢字，普通名詞)」）、その２単語が、「組織名」の固有表現の候補として考えられるという意味の規則である。 The rule generation unit A4 matches such a correct list A2 with the word string output from the morphological analysis / part-of-speech character type assigning unit A3, converts the specific expression into a variable, etc. Generate an extraction rule. anytag (3) <-<@ (organization name, 21), word (_, multiple kanji, proper noun), word (steel, multiple kanji, common noun),> @ (organization name). This rule (rule) is a rule to which the number “21” is assigned, and there is an arbitrary (variable) kanji proper noun (“word (_, plural kanji, proper noun)”), followed by If the word is a common noun of multiple kanji called “steel” (“word (steel, multiple kanji, common noun)”), the two words can be considered as candidates for the proper expression of “organization name” It is a rule.

このような規則（ルール）の生成は、より一般的には以下のように表せる。まず、固有表現は、Ｎ＋１単語［(ｗ0，ｃ0，ｐ0)，・・・，(ｗi，ｃi，ｐi)，・・・，(ｗN，ｃN，ｐN)］でできているとする。ここでｗiは単語（「製鉄」、「中野」など）、ｃiは構成文字種（「複数漢字」や「数字」など）、ｐiは品詞名（「固有名詞」、「普通名詞」など）である。 Generation of such a rule (rule) can be expressed more generally as follows. First, it is assumed that the unique expression is made up of N + 1 words [(w0, c0, p0),..., (Wi, ci, pi), ..., (wN, cN, pN)]. Here, w i is a word (such as “steel” or “Nakano”), c i is a constituent character type (such as “multiple kanji” or “number”), and p i is a part of speech name (such as “proper noun” or “common noun”). .

実際には、前後の幾つかの単語も、固有表現かどうかを判断するのに重要な手がかりとなるので、含めて考えるのが一般的であるが、ここでは単純化して、固有表現に含まれる単語だけを考える。 Actually, some words before and after are also important clues for judging whether they are proper expressions, so it is common to consider them, but here they are simplified and included in the proper expressions Think only of words.

次に、このような単語列から、最小汎化などの既存の一般化技術を用いることによって、規則（ルール）を生成する。しかし、本例では、次のようにして簡単に生成する。 Next, a rule is generated from such a word string by using an existing generalization technique such as minimum generalization. However, in this example, it is easily generated as follows.

すなわち、訓練用文書Ａ１に含まれる固有表現を構成する具体的な単語列［(ｗ0，ｃ0，ｐ0)，・・・，(ｗi，ｃi，ｐi)，・・・，(ｗN，ｃN，ｐN)］に、以下に述べる経験則を適用して、変数を含むリスト［(ｗ0'，ｃ0'，ｐ0')，・・・，(ｗi'，ｃi'，ｐi')，・・・，(ｗN'，ｃN'，ｐN')］を得て、次のような規則を作る。anytag(ｕ) <-- <＠(ｔ＋ｄｆ，ｋ)，word(ｗ0'，ｃ0'，ｐ0')，・・・，(ｗi'，ｃi'，ｐi')，・・・，word(ｗN'，ｃN'，ｐN')，>＠(ｔ−ｄｔ)．
ここで「ｔ」は、固有表現の種類（例えば「組織名」）を表す。 That is, a specific word string [(w0, c0, p0),..., (Wi, ci, pi),..., (WN, cN, pN] constituting a specific expression included in the training document A1. )] Is applied to the list [(w0 ′, c0 ′, p0 ′),..., (Wi ′, ci ′, pi ′),. wN ′, cN ′, pN ′)], and the following rule is created. anytag (u) <-<@ (t + df, k), word (w0 ', c0', p0 '), ..., (wi', ci ', pi'), ..., word (wN ' , CN ′, pN ′),> @ (t−dt).
Here, “t” represents the type of specific expression (for example, “organization name”).

「＋ｄｆ」は、この固有表現の開始位置を何文字右にずらすかを表し、最初の単語の文字数未満の非負整数である。また、「−ｄｔ」は固有表現の終了位置を何文字左にずらすかを表し、最後の単語の文字数未満の非負整数である。 “+ Df” represents the number of characters to which the start position of this specific expression is shifted to the right, and is a non-negative integer less than the number of characters of the first word. Further, “−dt” represents how many characters the end position of the specific expression is shifted to the left, and is a non-negative integer less than the number of characters of the last word.

例えば、「厚木市内で・・・」という訓練用文書Ａ１があり、正解リストＡ２によればこの内の「厚木市」が地名であるにもかかわらず、形態素解析・品詞文字種付与部Ａ３の形態素解析で、「厚木」、「市内」、「で」というように単語分割された場合、固有表現を構成する単語列は、［(厚木，複数漢字，固有名詞)，（市内、複数漢字、普通名詞)］となり、最後の１文字（「内」）が余分である。そこで終了位置を一文字左にずらすために、「ｄｔ＝１」とする。尚、開始位置はずらさないので、「ｄｆ=０」である。また、上述の規則（ルール）における「ｋ」は、この規則につけられた番号であり、「ｕ」はこの規則の優先度である。各変数を含むデータ(ｗi'，ｃi'，ｐi')は、訓練用文書Ａ１に含まれる具体的な固有表現に対応するデータ(ｗi，ｃi，ｐi)に対して、以下の経験則を、上から順に調べ、最初に当てはまったものを適用することによって得る。 For example, there is a training document A1 “in Atsugi City ...”, and according to the correct answer list A2, “Atsugi City” is a place name, but the morphological analysis / part-of-speech character type assigning part A3 In the morphological analysis, when words are divided into “Atsugi”, “city”, and “de”, the word string constituting the proper expression is [(Atsugi, multiple kanji, proper noun), (city, multiple Kanji, common noun)], and the last character (“inside”) is extra. Therefore, “dt = 1” is set in order to shift the end position to the left by one character. Since the start position is not shifted, “df = 0”. In the above rule (rule), “k” is a number given to this rule, and “u” is the priority of this rule. The data (wi ', ci', pi ') including each variable has the following empirical rules for the data (wi, ci, pi) corresponding to the specific specific expression included in the training document A1, It is obtained by examining in order from the top and applying the first fit.

■「ｉ」が「０」か「Ｎ」で、固有表現の境界を含む場合（ｄｆ＞０またはｄｔ＞０）は、これらを変数化しない。規則（ルール）の「ｄｆ」と「ｄｔ」は、元になった固有表現に対する値をそのまま利用する。 (2) When “i” is “0” or “N” and includes a boundary of proper expression (df> 0 or dt> 0), these are not converted into variables. As the rules (rules) “df” and “dt”, the values for the original specific expressions are used as they are.

■数字の場合は「ｗi」を変数化する。 ■ In the case of numbers, “wi” is converted into a variable.

■固有名詞の場合は「ｗi」を変数化する。 ■ In the case of proper nouns, variable “wi”.

■リストの最後の単語か、記号・単漢字・接尾語・接頭語・助詞などの機能語であれば、変数化しない。 ■ If it is the last word in the list or a functional word such as a symbol, single kanji, suffix, prefix, or particle, it is not converted into a variable.

■それ以外であれば「ｗi」を変数化する。 ■ Otherwise, “wi” is made variable.

各固有表現に対して以上の処理を適用することにより、固有表現抽出規則群Ａ５を自動的に生成することができる。 By applying the above processing to each specific expression, the specific expression extraction rule group A5 can be automatically generated.

また、各規則の優先度（ｕ）としては、例えば、その規則の元になった固有表現が正解リスト中に現れる「のべ回数」を採用する。これにより、正解回数の少ない規則(前述の例では、地名としての「ホワイトハウス」)が正解回数の多い規則(組織名としての「ホワイトハウス」)を正当な理由もなく抑制してしまうことが避けられる。 Further, as the priority (u) of each rule, for example, the “total number of times” in which the unique expression that is the basis of the rule appears in the correct answer list is adopted. As a result, a rule with a small number of correct answers (in the above example, “White House” as a place name) may suppress a rule with a large number of correct answers (“White House” as an organization name) without a valid reason. can avoid.

こうして規則生成部Ａ４により得られた各規則（固有表現抽出規則群Ａ５）を、訓練用規則適用部Ａ６において、訓練用文書Ａ１の単語列に適用することにより訓練用記録Ａ７を得る。すなわち、訓練用規則適用部Ａ６では、訓練用文書Ａ１の先頭から末尾まで、各規則がマッチする位置を順に調べていき、マッチしたら、それを候補として訓練用記録Ａ７に追加することを繰り返す。 Each rule (specific expression extraction rule group A5) thus obtained by the rule generation unit A4 is applied to the word string of the training document A1 in the training rule application unit A6, thereby obtaining the training record A7. That is, in the training rule application unit A6, the positions where each rule matches are sequentially examined from the beginning to the end of the training document A1, and if they match, it is repeatedly added as a candidate to the training record A7.

訓練用記録Ａ７には、具体的には、後で各候補間の競合関係や抑制関係の比較をして、最終的な出力ができるように、ルール番号（ｋ）や、マッチした位置、固有表現の種類（ｔ）などのデータを記録しておく。 Specifically, in the training record A7, the rule number (k), the matched position, and the uniqueness are set so that the competition output and the suppression relationship between the candidates can be compared later and final output can be performed. Data such as expression type (t) is recorded.

このような訓練用規則適用部Ａ６による処理を、固有表現抽出規則群Ａ５の全ての規則に対して行ない、訓練用記録Ａ７を作り出す。 Such processing by the training rule application unit A6 is performed on all the rules of the specific expression extraction rule group A5 to create a training record A7.

尚、ボトムアップ型の構文解析を用いれば、複数の規則の適用結果を効率良く一度に得ることも可能である。 If bottom-up syntax analysis is used, it is possible to efficiently obtain application results of a plurality of rules at once.

規則評価部Ａ８は、このようにして作成された訓練用記録Ａ７を読み出して、各規則の成績を採点する。採点の仕方としては様々な基準を用いることができるが、不正解になった回数や割合による評価を用いれば簡単である。しかし、各規則の不正解回数は、厳密には、どのような規則と組み合わせて用いるかに依存するため、どの規則を採用するか未定のこの時点では、正確な数字を得られない。そこで、各規則（Ｒ）の記録を以下のように分類して考える。 The rule evaluation unit A8 reads the training record A7 created in this way, and scores the results of each rule. Various criteria can be used for scoring, but it is easy to use evaluation based on the number and ratio of incorrect answers. However, strictly speaking, the number of incorrect answers for each rule depends on which rule is used in combination with each other. Therefore, at this point in time, which rule is to be adopted is not yet determined, an accurate number cannot be obtained. Therefore, the records of each rule (R) are considered classified as follows.

（○）規則Ｒの元になった固有表現とマッチして得られた候補、つまり、他の候補に抑制されなければ必然的に正解になるもの（正解候補固有表現）。 (◯) Candidate obtained by matching with the specific expression that is the basis of the rule R, that is, one that inevitably becomes a correct answer if not suppressed by another candidate (correct candidate specific expression).

（△）競合する別の固有表現が正解リストＡ２に登録されており、それに抑制されるもの。つまり、その固有表現が正解になれば出力が抑制されるので、精度の高い規則群においては、成績を下げない可能性の高いもの（中間候補固有表現）。 (Δ) Another competing specific expression is registered in the correct answer list A2 and is suppressed thereto. In other words, since the output is suppressed when the proper expression is correct, the rule group with high accuracy has a high possibility of not lowering the grade (intermediate candidate specific expression).

（×）それ以外のもの、つまり、抑制する正解固有表現がないため、精度の高い規則群においては、間違った候補を出力して成績を下げる可能性が高いもの（不正解候補固有表現）。 (X) Other than that, that is, since there is no correct answer specific expression to be suppressed, a rule group with high accuracy is likely to output a wrong candidate and lower the grade (incorrect candidate candidate specific expression).

規則評価部Ａ８は、各規則に対して「○」、「△」、「×」の回数を数え、この「×」の回数を不正解の回数、「○」の回数を正解の回数の代用として採用する。尚、単純に「△」を全て不正解と考えると、「田中」のように短い固有表現を抽出する規則が不利になるので避けた方が良い。そのため、規則評価部Ａ８では、以下のような方法で不正解回数を数える。 The rule evaluation unit A8 counts the number of “○”, “△”, and “×” for each rule, and substitutes the number of “×” for the number of incorrect answers and the number of “○” for the number of correct answers. Adopt as. If all “Δ” are considered to be incorrect, it is better to avoid a rule that extracts a short unique expression such as “Tanaka”. Therefore, the rule evaluation unit A8 counts the number of incorrect answers by the following method.

すなわち、規則評価部Ａ８は、訓練用記録Ａ７を前から順に読み、規則Ｒが訓練用文書Ａ１の位置Ｌで適用されており、規則Ｒが付与する固有表現のタイプ(地名や人名などの区別)がＴであり、そのタイプＴと位置Ｌの対が正解リストＡ２に正解として含まれておらず、さらに、位置Ｌに重なる位置に正解の固有表現が存在しないか、存在しても、その正解に対応する候補より規則Ｒによる候補の方が優先順位において優位であれば、規則Ｒの不正解回数を１増やす。これを訓練用記録Ａ７の終わりに達するまで繰り返す。 That is, the rule evaluation unit A8 reads the training record A7 in order from the front, the rule R is applied at the position L of the training document A1, and the type of the unique expression given by the rule R (the distinction between place names, personal names, etc.) ) Is T, and the pair of type T and position L is not included in the correct answer list A2 as a correct answer. If the candidate based on the rule R is superior in priority to the candidate corresponding to the correct answer, the number of incorrect answers for the rule R is increased by one. This is repeated until the end of training record A7 is reached.

規則評価部Ａ８が、各規則の「○」、「△」、「×」の個数を数えると、この結果を参照して、規則削除部Ａ９と規則精錬部Ａ１０が固有表現抽出規則群Ａ５に修正を加える。 When the rule evaluation unit A8 counts the number of “O”, “Δ”, and “×” in each rule, the rule deletion unit A9 and the rule refinement unit A10 refer to this result to the specific expression extraction rule group A5. Make corrections.

規則削除部Ａ９は、固有表現抽出規則群Ａ５の親則の内、例えば、「×」の個数が「○」の個数を超える規則を削除する。規則精錬部Ａ１０は、固有表現抽出規則群Ａ５の規則の内、例えば、成績が「×」の個数が「○」の個数の半分以上ある規則に、前後の単語などに関する制約情報を加えて、当該規則の成績がより良くなるようにする。 The rule deletion unit A9 deletes, for example, a rule in which the number of “x” exceeds the number of “◯” among the parent rules of the specific expression extraction rule group A5. The rule refining unit A10 adds restriction information on the preceding and following words to a rule having a score of “X” or more than half of the number of “O” in the rules of the specific expression extraction rule group A5, Make the rules perform better.

例えば、固有表現の前後２単語ずつを含めて考えると、上記規則で抽出され、「○」や「×」に評価されて分類された各固有表現のそれぞれにおいて、［(ｗ-2，ｃ-2，ｐ-2)，(ｗ-1，ｃ-1，ｐ-1)，(ｗ0，ｃ0，ｐ0)，・・・，(ｗN+1，ｃN+1，ｐN+1)，(ｗN+2，ｃN+2，ｐN+2)，］という単語リストが各々に考えられる。そこで、各固有表現毎に(ｗ-2，ｃ-2，ｐ-2，ｗ-1，ｃ-1，ｐ-1，ｗN+1，ｃN+1，ｐN+1，ｗN+2，ｃN+2，ｐN+2)という特徴のリストを考え、「○」に分類された固有表現の場合を正例、「×」に分類された固有表現の場合を負例と考えれば、これは典型的な帰納学習の課題であり、既存の機械学習の手法がそのまま利用できる。 For example, when including two words before and after the specific expression, in each of the specific expressions extracted by the above rules and classified as “◯” or “×”, [(w−2, c− 2, p-2), (w-1, c-1, p-1), (w0, c0, p0), ..., (wN + 1, cN + 1, pN + 1), (wN + 2, cN + 2, pN + 2),] are considered for each. Therefore, for each unique expression, (w-2, c-2, p-2, w-1, c-1, p-1, wN + 1, cN + 1, pN + 1, wN + 2, cN + 2, pN + 2), this is typical if the proper expression classified as “○” is a positive example and the proper expression classified as “×” is a negative example. This is a problem of inductive learning, and the existing machine learning method can be used as it is.

例えば、決定木による学習を用いることにより、前後の幾つかの単語の内、どの単語のどの性質の値を残し、他を変数化すべきかが決定できる。具体例として、「×」に分類された固有表現が「１０」個抽出され、その内、「８」個の固有表現において、その前の単語(ｗ-1)として「ｗX」が特定されれば、以下のようにして元の規則に制約条件｛ｗ-1'≠ ｗX｝を加え、前の単語(ｗ-1)として「ｗX」を有する固有表現が抽出されないように制約する。anytag(ｕ) <-- word(ｗ-1'，ｃ-1'，ｐ-1')，<＠(ｔ＋ｄｆ，ｋ)，word(ｗ0'，ｃ0'，ｐ0')，・・・，(ｗi'，ｃi'，ｐi')，・・・，word(ｗN'，ｃN'，ｐN')，>＠(ｔ−ｄｔ)，｛ｗ-1'≠ ｗX｝．こうして得られた規則は、元の規則より制約が強いので、元の規則がマッチした部分と同じところにしかマッチしない。従って、訓練用文書Ａ１全体に適用しなくても、訓練用記録Ａ７に残っている元の規則のマッチした部分にのみ適用すれば、新しい規則の成績はわかる。 For example, by using learning based on a decision tree, it is possible to determine which property value of which word among several words before and after and which should be made variable. As a specific example, “10” specific expressions classified as “×” are extracted, and among them, “wX” is specified as the previous word (w−1) in “8” specific expressions. For example, a constraint condition {w−1 ′ ≠ wX} is added to the original rule as follows to constrain a specific expression having “wX” as the previous word (w−1) from being extracted. anytag (u) <-word (w-1 ', c-1', p-1 '), <@ (t + df, k), word (w0', c0 ', p0'), ..., ( wi ′, ci ′, pi ′),..., word (wN ′, cN ′, pN ′),> @ (t−dt), {w−1 ′ ≠ wX}. Since the rule obtained in this way is more restrictive than the original rule, it matches only the same part where the original rule matched. Therefore, even if it is not applied to the entire training document A1, if it is applied only to the matched part of the original rule remaining in the training record A7, the result of the new rule can be known.

このように本例では、規則の改良が、他の規則とほぼ独立に行なえる。以上によって、元の規則（固有表現抽出規則群Ａ５）から、より成績の良い規則（改良後固有表現抽出規則群Ａ５ａ）を生成する。 As described above, in this example, the rule can be improved almost independently of other rules. As described above, a rule (an improved specific expression extraction rule group A5a after improvement) is generated from the original rule (specific expression extraction rule group A5).

図２８は、固有表現抽出規則生成方法の処理手順例を示すフローチャートである。 FIG. 28 is a flowchart illustrating an example of a processing procedure of the specific expression extraction rule generation method.

本例は、図２６における固有表現抽出規則生成システムにおける形態素解析・品詞文字種付与部Ａ３、規則生成部Ａ４、訓練用規則適用部Ａ６、規則評価部Ａ８の各処理動作を示すものであり、まず、形態素解析・品詞文字種付与部Ａ３において、訓練用文書Ａ１を形態素解析して単語に分割し（Ｓ１３０１）、各単語に品詞と文字種などの情報を付加する（Ｓ１３０２）。 This example shows processing operations of the morphological analysis / part-of-speech character type assigning unit A3, the rule generating unit A4, the training rule applying unit A6, and the rule evaluating unit A8 in the named entity extraction rule generating system in FIG. In the morphological analysis / part-of-speech character type assigning unit A3, the training document A1 is morphologically analyzed and divided into words (S1301), and information such as part-of-speech and character type is added to each word (S1302).

次に、規則生成部Ａ４において、正解リストＡ２の固有表現と、その近傍にある単語からなる単語列を抜き出して（Ｓ１３０３）、正解単語列に経験則等を適用して、抽出規則を生成し（Ｓ１３０４）、固有表現抽出規則群Ａ５として記録する。そして、訓練用規則適用部Ａ６において、このようにして生成した抽出規則を、訓練用文書Ａ１に適用して、その結果得られた固有表現を候補として記録する（Ｓ１３０５）。さらに、規則評価部Ａ８において、各抽出規則で抽出された固有表現の正解度（○、△、×）を求めて分類し、それに基づき、各抽出規則の適正度を採点する（Ｓ１３０６）。 Next, the rule generation unit A4 extracts a specific string of the correct answer list A2 and a word string composed of words in the vicinity thereof (S1303), and applies an empirical rule or the like to the correct word string to generate an extraction rule. (S1304), recorded as a specific expression extraction rule group A5. Then, in the training rule application unit A6, the extraction rule generated in this way is applied to the training document A1, and the resulting unique expression is recorded as a candidate (S1305). Further, the rule evaluation unit A8 obtains and classifies the correctness (◯, Δ, ×) of the unique expression extracted by each extraction rule, and scores the appropriateness of each extraction rule based on the correctness (S1306).

その採点の結果、修正不可能な成績の悪い（適正度の低い）規則群は、規則削除部Ａ９において削除し（Ｓ１３０７）、また、修正により適正度が高まる規則群には、規則精錬部Ａ１０において当該修正を加えて、新規則とし（Ｓ１３０８）、改良後固有表現抽出規則群Ａ５ａに記録する。Ｓ１３０５からの処理を繰り返すことにより、より成績の良い規則群の生成が可能となる。 As a result of the scoring, a rule group having a bad result (low degree of appropriateness) that cannot be corrected is deleted in the rule deletion unit A9 (S1307), and a rule group whose degree of appropriateness is increased by correction is rule refining part A10. In step S1308, the modification is made to make a new rule (S1308), which is recorded in the improved specific expression extraction rule group A5a. By repeating the processing from S1305, it is possible to generate a rule group with better results.

図２９は、図２６における固有表現抽出装置の処理動作例を示すフローチャートである。本例は、図２６に示す固有表現抽出装置における、新規文書Ａ１１に対する処理動作を示すものであり、まず、形態素解析・品詞文字種付与部Ａ３において、新規文書Ａ１１を形態素解析して単語に分割し（Ｓ１４０１）、各単語リストに品詞と文字種などの情報を付加する（Ｓ１４０２）。 FIG. 29 is a flowchart showing an example of processing operation of the named entity extraction apparatus in FIG. This example shows the processing operation for the new document A11 in the named entity extraction apparatus shown in FIG. 26. First, the morphological analysis / part of speech character type assigning unit A3 divides the new document A11 into words by performing morphological analysis. (S1401), information such as part of speech and character type is added to each word list (S1402).

次に、実施用規則適用部Ａ１２において、各単語リストに、改良後固有表現抽出規則群Ａ５ａの各抽出規則を適用して、各固有表現を候補としてリストアップし（Ｓ１４０３）、全ての候補に対して以下の優先制御処理を行う（Ｓ１４０４）。すなわち、各候補の中で最優先の候補を出力し（Ｓ１４０５）、この出力された候補と競合する候補を削除する（Ｓ１４０６）。 Next, the implementation rule application unit A12 applies each extraction rule of the improved specific expression extraction rule group A5a to each word list, and lists each specific expression as a candidate (S1403). On the other hand, the following priority control process is performed (S1404). That is, the highest priority candidate among the candidates is output (S1405), and the candidate competing with the output candidate is deleted (S1406).

以上、図２６〜図２９を用いて説明したように、本例の固有表現抽出規則生成システムと方法では、まず、予め用意された訓練用文書Ａ１を形態素解析して単語に分割し、品詞名や構成文字種などの情報を各単語に付加し、こうして得られた単語から、固有表現を構成する単語列を取り出し、予め訓練用文書Ａ１に対応して用意された正解リストＡ２を参照して経験則や最小汎化などの一般化手段によって多数の固有表現抽出規則を生成する。 As described above with reference to FIG. 26 to FIG. 29, in the named expression extraction rule generation system and method of this example, first, the training document A1 prepared in advance is divided into words by morphological analysis, and the part-of-speech name. And information such as constituent character types are added to each word, a word string constituting a specific expression is extracted from the words thus obtained, and experience is obtained by referring to a correct answer list A2 prepared in advance corresponding to the training document A1. A large number of proper expression extraction rules are generated by a generalization means such as a rule or minimum generalization.

次に、これらの抽出規則をそれぞれ独立に訓練用文書Ａ１に適用して、その規則が訓練用文書Ａ１のどの位置にマッチしたかの記録を用意しておく。この記録に入っているものは、訓練用文書Ａ１に対してシステムが出力する固有表現の候補となる。そして、複数のルールを組み合わせる場合には、それらのルールに対応する記録に入っている全ての候補の中から、競合関係と優先順位を考慮して、最終的に出力する候補の列を一定の明快な基準で選び出す。 Next, each of these extraction rules is independently applied to the training document A1, and a record of where the rule matches in the training document A1 is prepared. What is included in this record is a candidate for the specific expression output by the system for the training document A1. And when combining multiple rules, out of all candidates in the records corresponding to those rules, the final candidate column to be output is determined in consideration of the competition and priority. Select based on clear criteria.

この結果、訓練用文書Ａ１における不正解の頻度あるいは割合が非常に多い規則があれば、それを削除する。ただし、その規則が訓練用文書のどの位置で正解し、どの位置で不正解になっているかがわかる。そこで、正解の箇所の前後の単語列と、不正解の箇所の前後の単語列を比較して制約を加えることによって、訓練用文書における成績が良くなる規則が作れるかどうか判断できるので、成績が良くなる場合は制約を加えた規則を加える。 As a result, if there is a rule having a very high frequency or rate of incorrect answers in the training document A1, it is deleted. However, it can be seen where in the training document the rule is correct and where it is incorrect. Therefore, by comparing the word string before and after the correct answer part with the word string before and after the incorrect answer part and adding restrictions, it can be determined whether a rule that improves the result in the training document can be made. If it improves, add a rule with constraints.

このように、本例によれば、固有表現を含む訓練用文書と、その文書の中のどの位置にどのような種類の固有表現が含まれているかを列挙した正解リストを与えると、システムがこの正解に基づいて固有表現抽出規則を生成するので、人間が多大な労力を払って抽出規則を書き下す必要がなくなる。さらに、予め用意された訓練用文書Ａ１に対して出力される個々の規則の評価を求め、次に、複数の規則を種々に組み合わせた場合の各評価値を、個々の規則の評価値から簡単に計算できる。 Thus, according to this example, when a training document including a specific expression and a correct answer list enumerating what kind of specific expression is included in which position in the document are given, the system Since the specific expression extraction rule is generated based on the correct answer, it is not necessary for the human to write down the extraction rule with great effort. Furthermore, the evaluation of each rule output for the training document A1 prepared in advance is obtained, and then each evaluation value when various rules are combined in various ways is easily calculated from the evaluation value of each rule. Can be calculated.

これによって、良い成績が得られる規則の組み合わせを求める際の試行錯誤に要する処理時間を短縮することができる。また、このような固有表現抽出規則の改良が、他の規則とほぼ独立して行なえるため、精度を向上させることが容易になる。また、本例の固有表現抽出装置では、訓練用文書と正解リストに基づいて生成され、かつ、改良された規則を新規文書Ａ１１に適用して、この新規文書Ａ１１から固有表現を自動的に抽出すると共に、抽出した複数の固有表現に部分的な重なりがあれば、文書における記載開始位置が早いものを優先して抽出し、また、記載開始位置が同じであれば記載終了位置が遅いものを優先して抽出し、さらに、表現は同じであるが種類の異なる固有表現があれば、各固有表現の抽出に用いた各々の規則に予め付与された優先度の大きいものを優先して抽出するので、適切な固有表現のみに限定された抽出が可能である。尚、図２６〜図２９を用いて説明した例に限定されるものではなく、種々変更した実施を行うことができる。例えば、本例では、規則に制約を付加する際、候補固有表現の訓練用文書における前後の単語に基づき制約を設けているが、当該単語の文字種（漢字、カタカナ、・・・）や品詞（名詞、動詞、・・・）等に関して制約を設けることでも良い。また、本例では、光ディスク１０２５を記録媒体として用いているが、ＦＤを記録媒体として用いることでも良い。また、プログラムのインストールに関しても、通信装置１０２７を介してネットワーク経由でプログラムをダウンロードしてインストールすることでも良い。 As a result, it is possible to reduce the processing time required for trial and error when obtaining a combination of rules for obtaining good results. Moreover, since the improvement of such a specific expression extraction rule can be performed almost independently of other rules, it is easy to improve accuracy. In addition, the specific expression extraction apparatus of this example automatically extracts specific expressions from the new document A11 by applying the improved rules generated based on the training document and the correct answer list to the new document A11. In addition, if there is a partial overlap in the extracted plurality of specific expressions, the one with the earlier description start position in the document is preferentially extracted, and if the description start position is the same, the one with the later description end position is extracted. If there are specific expressions with the same expression but different types, extract those with higher priority given in advance to each rule used to extract each specific expression. Therefore, extraction can be limited to only proper proper expressions. In addition, it is not limited to the example demonstrated using FIGS. 26-29, Various implementation can be performed. For example, in this example, when a restriction is added to a rule, the restriction is provided based on the preceding and following words in the training document for candidate specific expressions, but the character type (kanji, katakana,...) Restrictions may be placed on nouns, verbs,. In this example, the optical disk 1025 is used as a recording medium, but an FD may be used as a recording medium. As for program installation, the program may be downloaded and installed via the communication device 1027 via a network.

また、第１及び第２の実施の形態で説明したラベル表示型文書検索方法を検索装置１や１０に実行させるコンピュータプログラムは、半導体メモリ、磁気ディスク、光ディスク、光磁気ディスク、磁気テープなどのコンピュータ読み取り可能な記録媒体に格納したり、インターネットなどの通信網を介して伝送させて、広く流通させることができる。 A computer program that causes the search apparatus 1 or 10 to execute the label display type document search method described in the first and second embodiments is a computer such as a semiconductor memory, a magnetic disk, an optical disk, a magneto-optical disk, or a magnetic tape. It can be widely distributed by storing it in a readable recording medium or transmitting it via a communication network such as the Internet.

第１の実施の形態の装置構成を示すブロック図である。It is a block diagram which shows the apparatus structure of 1st Embodiment. 検索装置１が検索前に行う処理を示すフローチャートである。It is a flowchart which shows the process which the search device 1 performs before a search. タグ無し文書の一例を示す図である。It is a figure which shows an example of an untagged document. タグ付き文書の一例を示す図である。It is a figure which shows an example of a tagged document. 共起パタンを用いた同義語の検出方法を示す図である。It is a figure which shows the detection method of a synonym using a co-occurrence pattern. インデクスの一例を示す図である。It is a figure which shows an example of an index. 第１統計情報の一例を示す図である。It is a figure which shows an example of 1st statistical information. 第２統計情報の一例を示す図である。It is a figure which shows an example of 2nd statistical information. キーワードを送信された検索装置１が行う処理のフローチャートである。It is a flowchart of the process which the search device 1 which transmitted the keyword performs. 第１検索結果統計情報の一例を示す図である。It is a figure which shows an example of 1st search result statistical information. 第２検索結果統計情報の一例を示す図である。It is a figure which shows an example of 2nd search result statistical information. 第３統計情報の一例を示す図である。It is a figure which shows an example of 3rd statistical information. ラベル適合度情報の一例を示す図である。It is a figure which shows an example of label compatibility information. ラベル情報の一例を示す図である。It is a figure which shows an example of label information. ラベル決定部１１５が行うラベル選択のフローチャートである。It is a flowchart of the label selection which the label determination part 115 performs. 属性名適合度情報の一例を示す図である。It is a figure which shows an example of attribute name compatibility information. クラスタ情報の一例を示す図である。It is a figure which shows an example of cluster information. ブラウザ２が行う処理のフローチャートである。It is a flowchart of the process which the browser 2 performs. 文書表示制御部２２によるラベル指示前の表示例を示す図である。It is a figure which shows the example of a display before the label instruction | indication by the document display control part. 文書表示制御部２２によるラベル指示後の表示例を示す図である。It is a figure which shows the example of a display after the label instruction | indication by the document display control part 22. FIG. 第２の実施の形態の装置構成を示すブロック図である。It is a block diagram which shows the apparatus structure of 2nd Embodiment. 検索装置１０が検索前に行う処理を示すフローチャートである。It is a flowchart which shows the process which the search device 10 performs before a search. 文書ベクトルの一例を示す図である。It is a figure which shows an example of a document vector. キーワードを送信された検索装置１０が行う処理のフローチャートである。It is a flowchart of the process which the search device 10 which transmitted the keyword performs. クラスタベクトルの一例を示す図である。It is a figure which shows an example of a cluster vector. 文書生成部１０５に適用される固有表現抽出規則生成システムおよびそれを設けた固有表現抽出装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the specific expression extraction rule production | generation system applied to the document production | generation part 105, and the specific expression extraction apparatus provided with the same. 図２６における固有表現抽出規則生成システムおよび固有表現抽出装置のハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of the specific expression extraction rule production | generation system in FIG. 26, and a specific expression extraction apparatus. 固有表現抽出規則生成方法の処理手順例を示すフローチャートである。It is a flowchart which shows the example of a process sequence of the specific expression extraction rule production | generation method. 図２６における固有表現抽出装置の処理動作例を示すフローチャートである。It is a flowchart which shows the processing operation example of the specific expression extraction apparatus in FIG.

Explanation of symbols

１，１０…検索装置
２…ブラウザ
２１…キーワード入力部
２２…文書表示制御部
１０１…通信部
１０２…要求処理部
１０３…設定ファイル
１０４…文書生成部
１０５…文書生成部
１０６…文書ＤＢ
１０７…インデクス
１０７…正規化部
１０８…インデクス生成部
１０９…文書検索部
１１０…第１統計ＤＢ
１１１…第２統計ＤＢ
１１２…統計処理部
１１３…ラベル候補選択部
１１４…ラベル適合度算出部
１１５…ラベル決定部
１１６…クラスタ情報生成部
１１７…文書ベクトル生成部
１１８…文書ベクトルＤＢ
１１９…クラスタベクトル生成部
１２０…類似度算出部
１２１…クラスタ拡張部 DESCRIPTION OF SYMBOLS 1,10 ... Search apparatus 2 ... Browser 21 ... Keyword input part 22 ... Document display control part 101 ... Communication part 102 ... Request processing part 103 ... Setting file 104 ... Document generation part 105 ... Document generation part 106 ... Document DB
DESCRIPTION OF SYMBOLS 107 ... Index 107 ... Normalization part 108 ... Index production | generation part 109 ... Document search part 110 ... 1st statistics DB
111 ... 2nd statistics DB
DESCRIPTION OF SYMBOLS 112 ... Statistical processing part 113 ... Label candidate selection part 114 ... Label suitability calculation part 115 ... Label determination part 116 ... Cluster information generation part 117 ... Document vector generation part 118 ... Document vector DB
119: Cluster vector generation unit 120 ... Similarity calculation unit 121 ... Cluster expansion unit

Claims

Document storage including a document including a text composed of a character string, further including a title of the document and document identification information indicating the document, and a plurality of documents in which attribute values that are predetermined character strings are included in the text Means,
A document vector generation unit that generates a document vector corresponding to each document stored in the document storage unit and includes the number of attribute values included in the document for each attribute value;
For each attribute value included in at least one of the plurality of documents stored in the document storage unit, statistical information recording the number of appearances of the attribute value in the plurality of documents is generated, and the storage unit provided in advance Statistical processing means to be stored;
Document retrieval means for retrieving a plurality of documents from the document storage means;
Search result statistical information generating means for generating and storing search result statistical information in which the number of occurrences of the attribute value in the plurality of documents is recorded for each attribute value included in at least one of the plurality of searched documents; ,
For each attribute value included in at least one of the plurality of retrieved documents, the degree to which the attribute value is suitable as a label that is a character string representing a plurality of documents forming part of the plurality of documents is indicated. A fitness calculation means for calculating the fitness of the attribute value using a calculation formula using the number of appearances in the statistical information of the attribute value and the search result statistical information for calculating the fitness; ,
The fitness level as long as the fitness level satisfies a preset condition from the higher fitness level corresponding to each of a plurality of attribute values including attribute values included in at least one of the retrieved plurality of documents. By selecting an attribute value corresponding to, a plurality of attribute values forming a part of the plurality of attribute values are selected, and the plurality of selected attribute values are used as labels, respectively, and label information including the plurality of labels is displayed. Label information generating means to generate;
For each label included in the label information, document identification information and titles indicating documents that include the character string that is the label and that are any of the plurality of retrieved documents are included for the number of the documents. Cluster information generating means for generating cluster information that is cluster information and includes the label;
For each cluster information, a cluster vector that is a vector sum of the document vectors corresponding to each of a plurality of documents including the document identification information included in the cluster information is generated, and the document identification information included in the cluster information A cosine measure between the document vector corresponding to the document stored in the document storage means and the cluster vector is calculated, and the cosine measure exceeds a preset threshold value. If there is, cluster information changing means for including the document identification information and title of the document in the cluster information,
Each label included in the label information is displayed, one of the labels is selected, and a set of document identification information and title included in the cluster information including the label is displayed, respectively. A label display type document retrieval apparatus comprising: a document display control unit that reads out and displays a document including the document identification information and the set of titles from the document storage unit when a set of titles is selected .

The document stored in the document storage means is provided with a tag indicating that it is an attribute value with respect to an attribute value included in the document,
2. The label display type document retrieval apparatus according to claim 1, wherein the document display control means displays the labels for each of the tags attached to the labels.

Attributes that detect a plurality of attribute values included in the document stored in the document storage means and have the same meaning and different expressions, and make the expression of the attribute values the same. 3. The label display type document retrieval apparatus according to claim 1, further comprising a value normalization unit.

The document stored in the document storage means is a tag in which an attribute value is added to an attribute value included in the document,
A document generation unit that adds a tag to the attribute value as a character string that is a character string in a document having no tag and is equal to the specific expression in the list generated in advance, and stores the document in the document storage unit; 4. The label display type document retrieval device according to claim 1, wherein

A label display type document search method performed by a label display type document search device,
The label display type document search apparatus includes a document storage unit, a document vector generation unit, a statistical processing unit, a document search unit, a search result statistical information generation unit, a fitness calculation unit, a label information generation unit, a cluster information generation unit, and cluster information. A change means and a document display control means,
The document storage means includes
A document including a text composed of a character string, further including a title of the document and document identification information indicating the document, and a plurality of documents including attribute values that are predetermined character strings included in the text,
The label display type document search method includes:
A document vector generation step in which the document vector generation unit generates a document vector corresponding to each document stored in the document storage unit and including the number of attribute values included in the document for each attribute value. When,
The statistical processing unit generates statistical information that records the number of appearances of the attribute value in the plurality of documents for each attribute value included in at least one of the plurality of documents stored in the document storage unit; A statistical processing step to be stored in a storage means provided in advance;
A document search step in which the document search means searches for a plurality of documents from the document storage means;
The search result statistical information generating means generates search result statistical information in which the number of appearances of the attribute value in the plurality of documents is recorded for each attribute value included in at least one of the plurality of searched documents. A search result statistical information generation step to be stored;
For each attribute value included in at least one of the searched plurality of documents, the fitness level calculating unit is a label that is a character string representing a plurality of documents that form part of the plurality of documents. The calculation value using the number of appearances in the statistical information and the search result statistical information of the attribute value to calculate the fitness indicating the degree of suitability as A fitness calculation step to calculate;
The label information generation means presets the fitness level from the higher fitness level corresponding to each of a plurality of attribute values consisting of attribute values included in at least one of the searched plurality of documents. As long as the condition is satisfied, by selecting an attribute value corresponding to the fitness level, a plurality of attribute values forming a part of the plurality of attribute values are selected, and the plurality of selected attribute values are used as labels, respectively. A label information generation step for generating label information including the label of
Document identification information and title indicating that the cluster information generating means is a document that includes a character string that is the label for each label included in the label information and is one of the plurality of searched documents. A cluster information generation step of generating cluster information including the label and the cluster information including the number of the documents,
The cluster information changing unit generates, for each cluster information, a cluster vector that is a vector sum of the document vectors corresponding to each of a plurality of documents including document identification information included in the cluster information, and the cluster information The cosine scale between the document vector corresponding to the document stored in the document storage means and the cluster vector is calculated, and the cosine scale is set in advance. A cluster information changing step for including the document identification information and title of the document in the cluster information,
The document display control means displays each label included in the label information, and one set of the label is selected, and a set of document identification information and a title included in cluster information including the label is displayed. A document display control step of reading out and displaying a document including the document identification information and title set from the document storage means when one set of the document identification information and title is selected;
Labeling type document retrieval method, which comprises a.

The document stored in the document storage means is provided with a tag indicating that it is an attribute value with respect to an attribute value included in the document,
6. The label display type document search method according to claim 5, wherein the document display control means displays the labels for each of the tags attached to the labels.

The label display type document search device further includes attribute value normalization means,
The attribute value normalization means detects a plurality of attribute values included in the document stored in the document storage means and having the same meaning and different expressions, and the plurality of attribute values 7. The label display type document search method according to claim 5 or 6 , further comprising an attribute value normalizing step for making each expression the same .

The label display type document search device further includes a document generation unit,
The document stored in the document storage means is a tag in which an attribute value is added to an attribute value included in the document,
The document generation means assigns a tag to the attribute value as a character string that is a character string in a document with no tag and is equal to the specific expression in the list generated in advance, and stores the document in the document storage means The label display type document retrieval method according to claim 5 , further comprising a document generation step .

A computer program for causing a computer to execute the label display type document retrieval method according to claim 5 .

A computer-readable recording medium in which the computer program according to claim 9 is stored.