JP2007011891A

JP2007011891A - Information retrieval method and device, program, and storage medium storing program

Info

Publication number: JP2007011891A
Application number: JP2005194297A
Authority: JP
Inventors: Hiroyuki Toda; 浩之戸田; Ryoji Kataoka; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-07-01
Filing date: 2005-07-01
Publication date: 2007-01-18

Abstract

<P>PROBLEM TO BE SOLVED: To overview search results and perform refined search efficiently. <P>SOLUTION: A content DB is searched based on search conditions, search results are acquired, and a label DB is searched and a proper noun is acquired based on a document ID contained in the search results. The proper noun and the document ID related to the proper noun are acquired as a label. A label for indicating a proper object is specified in the acquired labels. A label DB is searched based on the label, and a label description word is acquired. Labels having strong relevancy are specified as a pair of labels. The label DB is searched based on the label in the pair of labels, and the label pair explanation word is acquired. A label map for indicating the relevancy between labels is generated and outputted by using the label, the pair of labels, the label description word, and the label pair description word. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、情報検索方法及び装置及びプログラム及びプログラムを格納した記憶媒体に係り、特に、インターネットに代表されるコンピュータネットワークにおいて、自然言語で記述された文書を対象にした検索サービスを提供するための情報検索方法及び装置及びプログラム及びプログラムを格納した記憶媒体に関する。 The present invention relates to an information search method and apparatus, a program, and a storage medium storing the program, and more particularly to provide a search service for a document described in a natural language in a computer network represented by the Internet. The present invention relates to an information search method and apparatus, a program, and a storage medium storing the program.

コンピュータネットワークにおける情報検索システムにおいて、検索結果を効率的に絞り込ませる手段として、検索結果を分類し、個々の分類を代表するようなラベルを分類し付与して、検索結果と共に提示するシステムが考案されている。 In an information search system in a computer network, as a means of efficiently narrowing down search results, a system has been devised that classifies search results, classifies and assigns labels that represent individual classifications, and presents them together with the search results. ing.

これにより、ユーザは検索結果に含まれるコンテンツ全てを評価することなく、所望の情報に効率的に到達可能となる。 Thereby, the user can efficiently reach desired information without evaluating all the contents included in the search result.

これを実現するためには、以下のような方式が考えられている。 In order to realize this, the following methods are considered.

・分類体系利用方式：
予め人手によって分類カテゴリを作成し、それぞれのカテゴリに対して適切と思われる文書を学習データとして与える。学習データの特徴を基にカテゴリの特徴を決定し、検索結果が与えられた場合に、それぞれの検索結果とカテゴリとの類似度を計算し最も類似していると思われるカテゴリに関連付けることで、検索結果を分類しユーザに提示できる（例えば、非特許文献１参照）。・ Classification system usage method:
A classification category is created in advance by hand, and a document that is considered appropriate for each category is given as learning data. By determining the characteristics of the category based on the characteristics of the learning data, given the search results, calculate the similarity between each search result and the category and associate it with the category that seems to be most similar. Search results can be classified and presented to the user (see, for example, Non-Patent Document 1).

・文書類似性利用方式：
予め、ベクトル表現等を用いて個々の文書の特徴表現を取得する。検索結果が与えられた場合に、その特徴表現を元に類似した文書同士を同一の分類とし、検索者にとって有効であると思われる個数に分類して提示する（例えば、非特許文献２参照）。・ Document similarity use method:
A feature expression of each document is acquired in advance using a vector expression or the like. When a search result is given, similar documents based on the feature expression are classified into the same classification and presented in a number that is considered to be effective for the searcher (for example, see Non-Patent Document 2). .

・特徴的ターム利用方式：
検索結果の文書やデータから、検索結果において特徴的な単語や複合語、キーワード当を取得し、これらを絞り込み候補として提示することにより、検索結果を分類したのと同様に提示する（例えば、非特許文献３参照）。
Tsuruta, M, et. al “A Web Search Result Classification System based on the degree of Suitability for Specialists” Working Notes of NTCIR-4(2004) Cutting, D. el. al “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections” Proceedings of SIGIR 1992(1992). Toda, Hiroyuki, et,al. “A Clustering Method for News Articles Retrieving System” Proceedings of WWW2005 (2005). ・ Characteristic term usage method:
By obtaining characteristic words, compound words, keyword keywords in search results from documents and data of search results, and presenting them as narrowing candidates, the search results are presented in the same way as classified (for example, non-search (See Patent Document 3).
Tsuruta, M, et. Al “A Web Search Result Classification System based on the degree of Suitability for Specialists” Working Notes of NTCIR-4 (2004) Cutting, D. el. Al “Scatter / Gather: A Cluster-based Approach to Browsing Large Document Collections” Proceedings of SIGIR 1992 (1992). Toda, Hiroyuki, et, al. “A Clustering Method for News Articles Retrieving System” Proceedings of WWW2005 (2005).

しかしながら、上記従来の技術では、以下のような問題がある。 However, the conventional technique has the following problems.

分類体系利用方式では、予め複数のラベルを人手で作成し、該当するラベルが持つ意味と適合する文書をラベルに関連付けるという手法であり、ラベルの生成、ラベルの意味付けを行うための正解データの作成は人手で行うことが前提となっており、初期ラベルの定義、コンテンツの更新に伴うラベルのメンテナンスが情報検索システムの管理者にとって大きなコストとなるという問題がある。 The classification system utilization method is a method in which a plurality of labels are manually created in advance, and a document that matches the meaning of the corresponding label is associated with the label. The correct data for label generation and label meaning is used. It is assumed that the creation is performed manually, and there is a problem that the definition of the initial label and the maintenance of the label accompanying the update of the contents are a large cost for the administrator of the information search system.

また、文書類似性利用方式では、処理時間の制約により、文書分類の質と速度のトレードオフを考慮しなければならない。また、そこで用いられる手法は、K-Means法等、分類の数を予め決定するような手法がとられる。これは、実際のトピックの分類数と決定した値が一致しない場合には、不明瞭な分類が生成され、それぞれの分類の内容を占めるラベル付が困難となり、生成されたラベルを一見して分類の内容を把握できない不明瞭なものとなることがある等の問題がある。 In the document similarity utilization method, the trade-off between document classification quality and speed must be considered due to processing time constraints. The method used there is a method such as K-Means method that predetermines the number of classifications. This is because if the number of actual topic classifications does not match the determined value, an ambiguous classification will be generated, making it difficult to label the contents of each classification, and classifying the generated labels at a glance. There is a problem that it may become unclear that the content of the contents cannot be grasped.

これらの問題への対処として、近年Ｗｅｂ文書の検索等の場面で、特徴的ターム利用方式を利用する動きがある。このシステムでは、クエリーと文書中で共起する語などを利用することで、ユーザが検索式を効率的に拡張することを可能とし、これにより容易に検索結果を絞り込むことが可能となる。 In order to deal with these problems, there is a movement to use a characteristic term utilization method in a scene such as retrieval of a Web document in recent years. In this system, by using a word that co-occurs in a query and a document, the user can efficiently expand the search expression, thereby easily narrowing down the search result.

しかし、本手法は、キーワード等の文書の一部の特徴のみを利用する手法であり、このため一つの文書が複数の観点から分類されることが多くなる。よって、本手法を用いて検索結果の分類を行った場合、非排他的な分類となり、一つの文書が複数の分類に関連付けられる。これによって多様な観点から文書内容を判断できるというメリットがある反面、個々のラベルが何を示しているのかや、ラベル間はどう関連しているのかという情報が、ユーザに提供できず、検索結果にどのようなトピックがあるかというのを明確に知ることができず、この結果、効果的な検索が行えない場合がある。 However, this method is a method that uses only some features of a document such as a keyword, and therefore, one document is often classified from a plurality of viewpoints. Therefore, when the search result is classified using this method, the classification becomes a non-exclusive classification, and one document is associated with a plurality of classifications. This has the merit of being able to judge the contents of the document from various viewpoints, but it cannot provide the user with information on what each label indicates and how the labels are related, and the search results. As a result, it may not be possible to clearly know what topic exists in the network, and as a result, an effective search may not be performed.

本発明は、上記の点に鑑みなされたもので、コンピュータネットワークにおける情報検索システムにおいて、検索結果に対して付与されるラベルについて、そのラベルを説明する属性情報やラベル間の関連性を提示することで、検索結果の概観、絞込み検索を効率的に行うことが可能な情報検索方法及び装置及びプログラム及びプログラムを格納した記憶媒体を提供することを目的とする。 The present invention has been made in view of the above points. In an information search system in a computer network, for a label attached to a search result, presenting attribute information explaining the label and a relationship between the labels. SUMMARY OF THE INVENTION An object of the present invention is to provide an information search method and apparatus, a program, and a storage medium storing the program that can efficiently perform an overview of search results and a narrow search.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、自然言語で書かれた文書をコンテンツＤＢに複数蓄積し、ユーザから与えられる検索条件に該当する検索結果及びその検索結果中の特徴的なキーワードもしくはセンテンスをラベルとして提示する情報検索方法であって、
文書ＩＤ、各文書で出現する固有名詞、及びそれ以外の語の情報、各文書で出現する固有名詞ペアの情報、各文書で出現する固有名詞と説明語の情報、各文書で出現する固有名詞ペアと説明語の情報を格納したラベルＤＢと、検索手段と、ラベル生成手段と、ラベル説明語取得手段と、ラベルペア特定手段と、ラベルペア説明語取得手段と、ラベルマップ生成手段と、出力手段と、を有する情報検索装置において、
検索手段が、検索条件に基づいてコンテンツＤＢを検索し、検索結果を取得する検索ステップ（ステップ１）と、
ラベル生成手段が、検索ステップで取得した検索結果に含まれる文書ＩＤに基づいて、ラベルＤＢを検索して固有名詞を取得し、該固有名詞及び該固有名詞に関連する文書ＩＤとをラベルとして取得するラベル生成ステップ（ステップ２）と、
ラベル説明語取得手段が、ラベル生成ステップで取得したラベルのうち、人や物などの具体的な固有物を示すラベルを特定し、該ラベルに基づいてラベルＤＢを検索し、該ラベルを説明するキーワードやセンテンスをラベル説明語として取得するラベル説明語取得ステップ（ステップ３）と、
ラベルペア特定手段が、ラベル生成ステップで取得したラベルのうち、人や物などの具体的な固有物を示すラベルを特定し、該ラベルに基づいてラベルＤＢを検索し、検索ステップ（ステップ１）で得られた検索結果に基づいて、それぞれのラベル間の関連性の強いものをラベルペアと特定するラベルペア特定ステップ（ステップ４）と、
ラベルペア説明語取得手段が、ラベルペアのラベルに基づいてラベルＤＢを検索し、それぞれのラベルペアの関係を説明するキーワードやセンテンスを含むラベルペア説明語を取得するラベルペア説明語取得ステップ（ステップ５）と、
ラベルマップ生成手段が、ラベル、ラベルペア、ラベル説明語、及びラベルペア説明語を用いて、ラベル間の関連性を示すラベルマップを生成するラベルマップ生成ステップ（ステップ６）と、
出力手段が、ラベルマップをユーザ側の装置に出力する出力ステップ（ステップ７）と、
を行う。 The present invention (Claim 1) accumulates a plurality of documents written in a natural language in a content DB, and uses a search result corresponding to a search condition given by a user and a characteristic keyword or sentence in the search result as a label. An information retrieval method to be presented,
Document ID, proper nouns appearing in each document, information of other words, proper noun pair information appearing in each document, proper noun and explanatory word information appearing in each document, proper noun appearing in each document A label DB storing information on pairs and explanatory words, a search means, a label generating means, a label explanatory word acquiring means, a label pair specifying means, a label pair explanatory word acquiring means, a label map generating means, an output means, In an information retrieval apparatus having
A search step (step 1) in which the search means searches the content DB based on the search condition and obtains a search result;
Based on the document ID included in the search result acquired in the search step, the label generation unit searches the label DB to acquire a proper noun, and acquires the proper noun and the document ID related to the proper noun as a label. Label generation step (step 2),
The label explanation acquisition means identifies a label indicating a specific unique object such as a person or an object from the labels acquired in the label generation step, searches the label DB based on the label, and explains the label A label explanation word acquisition step (step 3) for acquiring keywords and sentences as label explanation words;
The label pair identification means identifies a label indicating a specific unique object such as a person or an object among the labels acquired in the label generation step, searches the label DB based on the label, and in the search step (step 1) A label pair identifying step (step 4) for identifying a label pair that is strongly related to each label based on the obtained search results;
Label pair explanatory word acquisition means searches the label DB based on the label of the label pair, and acquires a label pair explanatory word including a keyword and sentence explaining the relationship of each label pair (step 5),
A label map generating step (step 6) in which the label map generating means generates a label map indicating the relationship between the labels using the label, the label pair, the label explanatory word, and the label pair explanatory word;
An output step (step 7) in which the output means outputs the label map to the user side device;
I do.

また、本発明（請求項２）は、ラベル生成ステップ（ステップ２）において、
ラベルＤＢを検索することにより取得した固有名詞について、コンテンツＤＢの文書集合全体の中での頻度と検索結果における頻度の比を求め、該比に基づいて予め決められた個数の固有名詞を選択するステップを行う。 Further, the present invention (Claim 2) provides a label generation step (Step 2).
For the proper noun acquired by searching the label DB, the ratio of the frequency in the entire document set of the content DB and the frequency in the search result is obtained, and a predetermined number of proper nouns are selected based on the ratio. Do step.

また、本発明（請求項３）は、ラベル説明語取得ステップ（ステップ３）において、
取得したラベルの固有名詞を対象とし、対象とする固有名詞と同一文内で共起する説明語候補の共起頻度を求め、予め決められた値を越える語を説明語として取得するステップを行う。 Further, the present invention (Claim 3) is a label explanation word acquisition step (Step 3).
For the proper noun of the acquired label, the co-occurrence frequency of the explanatory word candidates that co-occur in the same sentence as the target proper noun is obtained, and the step of acquiring the word exceeding the predetermined value as the explanatory word is performed .

また、本発明（請求項４）は、ラベルペア特定ステップ（ステップ４）において、
ラベルについて、コンテンツＤＢ内の文書集合全体の中での出現頻度と、検索結果における出現頻度の比を求め、予め決められた値を越える２つのラベルをラベルペアとして取得するステップを行う。 Further, the present invention (Claim 4), in the label pair identification step (Step 4),
For the label, a step of obtaining a ratio between the appearance frequency in the entire document set in the content DB and the appearance frequency in the search result and obtaining two labels exceeding a predetermined value as a label pair is performed.

また、本発明（請求項５）は、ラベルペア説明語取得ステップ（ステップ５）において、
ラベルペアについて、コンテンツＤＢの文書集合中での出現頻度に比べて、該ラベルペアと同一文内での共起頻度が高く、かつ、特定の固有名詞との共起頻度を比べて該ラベルペアと同一文内での共起頻度が高いものを、ラベルペア説明語として取得するステップを行う。 Further, according to the present invention (Claim 5), in the label pair explanatory word acquisition step (Step 5),
About the label pair, the co-occurrence frequency in the same sentence as the label pair is higher than the appearance frequency in the document set of the content DB, and the same sentence as the label pair is compared with the co-occurrence frequency with a specific proper noun. In this example, a step of acquiring a word having a high co-occurrence frequency as a label pair explanatory word is performed.

また、本発明（請求項６）は、ラベルマップ生成ステップ（ステップ６）において、
ラベルペアのうち、同一ノードをもつペアを結合し、特定の話題に出現する固有名詞を繋げたラベルマップを生成するステップを行う。 Moreover, the present invention (Claim 6) provides a label map generation step (Step 6).
Of the label pairs, a pair having the same node is combined to generate a label map in which proper nouns appearing in a specific topic are connected.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項７）は、自然言語で書かれた文書をコンテンツＤＢ１０に複数蓄積し、ユーザから与えられる検索条件に該当する検索結果及びその検索結果中の特徴的なキーワードもしくはセンテンスをラベルとして提示する情報検索装置であって、
検索条件に基づいてコンテンツＤＢ１０を検索し、検索結果を取得する検索手段１７０と、
文書ＩＤ、各文書で出現する固有名詞、及びそれ以外の語の情報、各文書で出現する固有名詞ペアの情報、各文書で出現する固有名詞と説明語の情報、各文書で出現する固有名詞ペアと説明語の情報を格納したラベルＤＢ２０と、
検索手段１７０から取得した検索結果に含まれる文書ＩＤに基づいて、ラベルＤＢを検索して固有名詞を取得し、該固有名詞及び該固有名詞に関連する文書ＩＤとをラベルとして取得するラベル生成手段１１０と、
ラベル生成手段１１０で取得したラベルのうち、人や物などの具体的な固有物を示すラベルを特定し、該ラベルに基づいてラベルＤＢ２０を検索し、該ラベルを説明するキーワードやセンテンスをラベル説明語として取得するラベル説明語取得手段１２０と、
ラベル生成手段１１０で取得したラベルのうち、人や物などの具体的な固有物を示すラベルを特定し、該ラベルに基づいてラベルＤＢ２０を検索し、検索手段１７０で得られた検索結果に基づいて、それぞれのラベル間の関連性の強いものをラベルペアと特定するラベルペア特定手段１３０と、
ラベルペアのラベルに基づいてラベルＤＢ２０を検索し、それぞれのラベルペアの関係を説明するキーワードやセンテンスを含むラベルペア説明語を取得するラベルペア説明語取得手段１４０と、
ラベル、ラベルペア、ラベル説明語、及びラベルペア説明語を用いて、ラベル間の関連性を示すラベルマップを生成するラベルマップ生成手段１５０と、
ラベルマップをユーザ側の装置に出力する出力手段１８０と、を有する。 The present invention (Claim 7) accumulates a plurality of documents written in a natural language in the content DB 10, and uses a search result corresponding to a search condition given by the user and a characteristic keyword or sentence in the search result as a label. An information retrieval device to present,
A search unit 170 that searches the content DB 10 based on the search condition and obtains a search result;
Document ID, proper nouns appearing in each document, information of other words, proper noun pair information appearing in each document, proper noun and explanatory word information appearing in each document, proper noun appearing in each document A label DB 20 storing information on pairs and explanatory words;
Based on the document ID included in the search result acquired from the search unit 170, the label DB is searched to acquire a proper noun, and the proper noun and the document ID related to the proper noun are acquired as labels. 110,
Among labels acquired by the label generating means 110, a label indicating a specific unique object such as a person or an object is specified, the label DB 20 is searched based on the label, and a keyword or sentence explaining the label is described as a label. Label explanation word acquisition means 120 to acquire as a word;
Among the labels acquired by the label generating means 110, a label indicating a specific unique object such as a person or an object is specified, the label DB 20 is searched based on the label, and the search result obtained by the searching means 170 is used. Label pair specifying means 130 for specifying a label pair having a strong relationship between the labels,
Label pair explanatory word acquisition means 140 that searches the label DB 20 based on the label of the label pair and acquires a label pair explanatory word including a keyword and sentence explaining the relationship between the respective label pairs;
Label map generating means 150 for generating a label map indicating the relationship between labels using the label, the label pair, the label explanatory word, and the label pair explanatory word;
Output means 180 for outputting the label map to a user side device.

また、本発明（請求項８）は、ラベル生成手段１１０において、
ラベルＤＢ２０を検索することにより取得した固有名詞について、コンテンツＤＢ１０の文書集合全体の中での頻度と検索結果における頻度の比を求め、該比に基づいて予め決められた個数の固有名詞を選択する手段を含む。 Further, according to the present invention (claim 8), in the label generating means 110,
For the proper noun acquired by searching the label DB 20, the ratio of the frequency in the entire document set of the content DB 10 and the frequency in the search result is obtained, and a predetermined number of proper nouns are selected based on the ratio. Including means.

また、本発明（請求項９）は、ラベル説明語取得手段１２０において、
取得したラベルの固有名詞を対象とし、対象とする固有名詞と同一文内で共起する説明語候補の共起頻度を求め、予め決められた値を越える語を説明語として取得する手段を含む。 Further, the present invention (Claim 9) is the label explanation word acquisition means 120,
Including a means for obtaining a co-occurrence frequency of explanatory word candidates that co-occur in the same sentence as the target proper noun and acquiring a word exceeding a predetermined value as an explanatory word for the proper noun of the acquired label .

また、本発明（請求項１０）は、ラベルペア特定手段３０において、
ラベルについて、コンテンツＤＢ１０内の文書集合全体の中での出現頻度と、検索結果における出現頻度の比を求め、予め決められた値を越える２つのラベルをラベルペアとして取得する手段を含む。 Further, the present invention (Claim 10) is the label pair specifying means 30,
For the labels, a means for obtaining a ratio between the appearance frequency in the entire document set in the content DB 10 and the appearance frequency in the search result and acquiring two labels exceeding a predetermined value as a label pair is included.

また、本発明（請求項１１）は、ラベルペア説明語取得手段１４０において、
ラベルペアについて、コンテンツＤＢ１０の文書集合中での出現頻度に比べて、該ラベルペアと同一文内での共起頻度が高く、かつ、特定の固有名詞との共起頻度を比べて該ラベルペアと同一文内での共起頻度が高いものを、ラベルペア説明語として取得する手段を含む。 Further, the present invention (claim 11) is the label pair explanation word acquisition means 140,
About the label pair, the co-occurrence frequency in the same sentence as the label pair is higher than the appearance frequency in the document set of the content DB 10, and the same sentence as the label pair is compared with the co-occurrence frequency with a specific proper noun. Includes a means for obtaining a label pair explanatory word having a high co-occurrence frequency.

また、本発明（請求項１２）は、ラベルマップ生成手段１５０において、
ラベルペアのうち、同一ノードをもつペアを結合し、特定の話題に出現する固有名詞を繋げたラベルマップを生成する手段を含む。 In the present invention (claim 12), in the label map generating means 150,
The label pair includes means for combining a pair having the same node and generating a label map in which proper nouns appearing in a specific topic are connected.

本発明（請求項１３）は、請求項７乃至１２のいずれかに記載された検索手段と、ラベル生成手段と、ラベル説明語取得手段と、ラベルペア特定手段と、ラベルペア説明語取得手段と、ラベルマップ生成手段と、出力手段と、を有する情報検索装置として、コンピュータを機能させる情報検索プログラムである。 According to the present invention (Claim 13), the search means, the label generation means, the label explanation acquisition means, the label pair identification means, the label pair explanation acquisition means, the label described in any of Claims 7 to 12 An information search program that causes a computer to function as an information search device having a map generation unit and an output unit.

本発明（請求項１４）は、請求項７乃至１２のいずれかに記載された検索手段と、ラベル生成手段と、ラベル説明語取得手段と、ラベルペア特定手段と、ラベルペア説明語取得手段と、ラベルマップ生成手段と、出力手段と、を有する情報検索装置として、コンピュータを機能させるプログラムを格納した記憶媒体である。 According to the present invention (Claim 14), the search means, the label generation means, the label explanation word acquisition means, the label pair identification means, the label pair explanation word acquisition means, the label described in any one of Claims 7 to 12 A storage medium storing a program that causes a computer to function as an information retrieval apparatus having a map generation unit and an output unit.

本発明によれば、文書情報を複数蓄積し、ユーザから与えられる検索条件に該当するコンテンツの集合を検索結果として返却する情報検索システムにおいて、検索結果から取得した特徴的なラベルを用いて検索結果を分類する場合に、ラベルの出現情報を元に、ラベルの属性情報やラベル間の関連性を提示することで、ユーザは得られた情報の中を概観することが可能になり、また、ユーザはその中から求める情報を容易に取得することが可能となる。 According to the present invention, in an information search system that accumulates a plurality of document information and returns a set of contents corresponding to a search condition given by a user as a search result, a search result using a characteristic label acquired from the search result By classifying the label information, the label attribute information and the relationship between the labels are presented based on the label appearance information, and the user can get an overview of the obtained information. Makes it possible to easily obtain the desired information from the information.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

本実施の形態では、検索結果から取得したラベルを用いて検索結果を分類する場合に、文書を文書に関連するラベルの種別や頻度を元に分類し、その分類に含まれる文書を説明する情報として、ラベルの説明やラベル間の関連性を提示する手法について説明する。 In this embodiment, when a search result is classified using a label acquired from the search result, the document is classified based on the type and frequency of the label related to the document, and the information explaining the document included in the classification A method for presenting the explanation of the labels and the relationship between the labels will be described.

図３は、本発明の一実施の形態におけるシステムの構成を示す。 FIG. 3 shows the configuration of the system in one embodiment of the present invention.

同図に示すシステムは、ネットワーク（図示せず）を介してサーバ１００とクライアント（ブラウザ）２００が接続される、サーバクライアント方式の構成である。 The system shown in the figure has a server client system configuration in which a server 100 and a client (browser) 200 are connected via a network (not shown).

サーバ１００は、ラベル生成部１１０、ラベル説明語取得部１２０、ラベルペア特定部１３０、ラベルペア説明語取得部１４０、ラベルマップ生成部１５０、ラベル管理部１６０、検索エンジン１７０、テキスト解析部１９０、コンテンツＤＢ１０，ラベルＤＢ２０，関連文書取得部３０から構成される。このうち、検索エンジン１７０、テキスト解析部１９０は、事前処理に利用される機能である。 The server 100 includes a label generation unit 110, a label explanatory word acquisition unit 120, a label pair identification unit 130, a label pair explanatory word acquisition unit 140, a label map generation unit 150, a label management unit 160, a search engine 170, a text analysis unit 190, and a content DB 10 , Label DB 20, and related document acquisition unit 30. Among these, the search engine 170 and the text analysis unit 190 are functions used for pre-processing.

また、クライアント２００のブラウザは、検索要求入力画面２１０、ラベル表示部２２０、検索結果表示部２３０から構成される。 The browser of the client 200 includes a search request input screen 210, a label display unit 220, and a search result display unit 230.

サーバ１００のコンテンツＤＢ１０は、新聞記事やＷｅｂページ等、検索対象のコンテンツの管理を行う記憶装置である。 The content DB 10 of the server 100 is a storage device that manages search target content such as newspaper articles and Web pages.

テキスト解析部１９０は、コンテンツＤＢに登録されたデータを解析し、テキスト中に含まれるタームの解析を行い、以下に示す情報を抽出し、ラベルＤＢ２０に格納する。テキストからタームの切り出しには固有表現抽出技術（例えば、磯崎秀樹、賀沢秀人「ＳＶＭに基づく固有表現抽出の高速化」情報処理学会研究会報告、第１４９回自然言語処理研究会（平成１４年５月２３日）Vol.2002,No.44(NL-149),Page1-6）や、名詞句抽出技術（例えば、Kageura, K.and Umino, B.: “Models of Automatic Term Recognition” Terminology Vol.3. No.2 (1996)）を利用する。 The text analysis unit 190 analyzes data registered in the content DB, analyzes terms included in the text, extracts information shown below, and stores the information in the label DB 20. To extract terms from text, we use specific expression extraction techniques (for example, Hideki Amagasaki, Hideto Kazawa “Acceleration of extraction of specific expressions based on SVM” Information Processing Society of Japan Report, 149th Natural Language Processing Society (2002) May 23) Vol. 2002, No. 44 (NL-149), Page 1-6) and noun phrase extraction technology (eg Kageura, K. and Umino, B .: “Models of Automatic Term Recognition” Terminology Vol .3. No.2 (1996)) is used.

・各文書で出現するタームのうち、具体的な対象を示す固有名詞、及びそれ以外の語の情報；
・各文書で出現する固有名詞ペアの情報；
・各文書で出現する固有名詞ペア・説明語の情報；
ラベルＤＢ２０は、テキスト解析部１９０から取得した上記の情報の蓄積管理を行う記憶装置である。図４は、本発明の一実施の形態におけるラベルＤＢの例である。ラベルＤＢ２０は、４つのブロックを有し、図４（ａ）は、ラベル候補と出現コンテンツＩＤからなり、図４（ｂ）は、ラベル候補と出現頻度からなり、図４（ｃ）は、コンテンツＩＤと出現ラベル候補からなり、図４（ｄ）はラベル候補、説明語候補、出現頻度からなる。図（ｅ）は、コンテンツＩＤで出現するラベルと説明語の組み合わせからなる。 -Of the terms appearing in each document, proper nouns indicating specific objects, and information on other words;
・ Information on proper noun pairs appearing in each document;
・ Information on proper noun pairs and explanatory words appearing in each document;
The label DB 20 is a storage device that performs accumulation management of the information acquired from the text analysis unit 190. FIG. 4 is an example of a label DB in one embodiment of the present invention. The label DB 20 has four blocks. FIG. 4A includes label candidates and appearance content IDs. FIG. 4B includes label candidates and appearance frequencies. FIG. 4C illustrates content. It consists of an ID and an appearance label candidate. FIG. 4D includes a label candidate, an explanation word candidate, and an appearance frequency. The figure (e) consists of a combination of a label that appears in the content ID and an explanatory word.

ラベル生成部１１０は、ラベル管理部１６０から「検索結果一覧」を取得し、これに含まれるコンテンツのＩＤを元にラベルＤＢ２０（図４（ｃ））にアクセスし、「検索結果一覧」に含まれる固有名詞を取得し、個々の固有名詞について「ラベルらしさ」を算出し、この値に基づいて予め決められた個数の固有名詞を取得し、その固有名詞に関連する文書のＩＤと共に「ラベル一覧」としてラベル管理部１６０に転送する。ここで、ラベル一覧に含まれる情報は、ラベル（ＩＤ）とラベルらしさ（スコア）である。 The label generation unit 110 acquires the “search result list” from the label management unit 160, accesses the label DB 20 (FIG. 4C) based on the content ID included therein, and includes it in the “search result list”. Specific label nouns are calculated, “label-likeness” is calculated for each proper noun, a predetermined number of proper nouns are acquired based on this value, and the “label list” is displayed together with the IDs of documents related to the proper nouns. To the label management unit 160. Here, the information included in the label list is a label (ID) and a label likelihood (score).

ここで、ラベル生成部１１０で算出される「ラベルらしさ」の算出方法の例を説明する。 Here, an example of a method for calculating “label-likeness” calculated by the label generation unit 110 will be described.

検索結果を分類する時のラベルとなるべき値は、検索結果との関連性が強いものの重要性が高い。そこで、「ラベルらしさ」の指標として、文書集合全体の中での頻度と、検索結果に閉じた状態での頻度の比を用いた以下のような基準が考えられる。 A value that should be a label for classifying search results is highly important although it is strongly related to the search results. Therefore, the following criteria using the ratio of the frequency in the whole document set and the frequency in the closed state as the search result can be considered as an index of “label-likeness”.

ここで、ＤはコンテンツＤＢ１０中のコンテンツの集合、Ｈは検索結果全体の集合、ｈ_ｊは検索結果中で該固有表現を含むコンテンツ数、ｄ_ｊはコンテンツＤＢ１０中で該固有表現を含むコンテンツ数を表す。

Here, D is a set of contents in the content DB 10, H is a set of all search results, h _j is the number of contents including the specific expression in the search results, and d _j is the number of contents including the specific expression in the content DB 10. Represents.

ラベル説明語取得部１２０は、ラベル管理部１６０から「ラベル一覧」と「検索結果」を取得し、「ラベル一覧」の個々の固有表現を対象とし、対象とする固有表現と同一の文内で共起する説明語候補（固有名詞も含む全ての語を候補とする）について、ラベルＤＢ２０（図４（ｄ）（ｅ））の情報を利用し、「説明語らしさ」のスコアを算出し、予め決められた値を越える語を説明語とし、それぞれの固有表現に対して「説明語一覧」の情報をラベル管理部１６０に転送する。説明語一覧に含まれる情報は、ラベル、説明語、スコアである。 The label explanation word acquisition unit 120 acquires the “label list” and “search result” from the label management unit 160, targets each specific expression of the “label list”, and within the same sentence as the target specific expression. For coexisting explanatory word candidates (all words including proper nouns are candidates), using the information in the label DB 20 (FIGS. 4D and 4E), a score of “explanatory word likelihood” is calculated, A word exceeding a predetermined value is used as an explanatory word, and information of “explanatory word list” is transferred to the label management unit 160 for each unique expression. Information included in the explanatory word list is a label, an explanatory word, and a score.

ここで、ラベル説明語取得部１２０で算出される「説明語らしさ」の算出方法の例を説明する。 Here, an example of a method of calculating “explanatory word likelihood” calculated by the label explanatory word acquisition unit 120 will be described.

「説明語らしさ」は、文書集合中での出現頻度に比べて、対象とする固有表現と同一文内での共起頻度が高く、かつ、現在対象としている以外の特定の固有表現との共起頻度と比べて対象とする固有表現と同一文内での共起頻度が高い、と仮定すると以下の式で表現される。 “Explanatory word-likeness” is a co-occurrence frequency in the same sentence as the target specific expression compared to the frequency of occurrence in the document set, and is a common specific expression other than the target target. Assuming that the co-occurrence frequency in the same sentence as the target specific expression is higher than the occurrence frequency, the following expression is used.

また、後者の項を、エントロピーを用いて以下のように表すことも考えられる。

In addition, the latter term may be expressed using entropy as follows.

ここでは、ＤはコンテンツＤＢ１０中のコンテンツ集合、Ｈ_ｘはタームｘを含む文書の集合、ここで、ｘは固有表現であるとＮ_ｉと、説明語であるＥ_ｊのいずれかである。Ｃ（ｘ）は語ｘが文内で出現する文書の集合を示す。また、Ｋはある固有表現Ｎ_ｉと文内で共起する固有表現の集合を表す。Ｃ（ｘ，ｙ）は、語ｘとｙが分けないで共起する文書の集合を示す。

Here, D is a set of contents in the contents DB 10, H _x is a set of documents including the term x, and x is either N _{i if} it is a specific expression or E _j that is an explanatory word. C (x) indicates a set of documents in which the word x appears in the sentence. Also, K is represents a set of named entities co-occur within a statement and named entity N _i with. C (x, y) indicates a set of documents that co-occur without being divided into words x and y.

ラベルペア特定部１３０は、ラベル管理部１６０より「ラベル一覧」と「検索結果」の情報を取得し、ラベルＤＢ２０の情報（図４（ａ）（ｃ））を利用し、「ラベルペアらしさ」のスコアを算出し、予め決められた値を越えるペアをラベルペアとし、「ラベルペア一覧」としてラベル管理部１６０に転送する。ラベルペア一覧は、ラベルＡ，ラベルＢ及びスコアからなる。 The label pair identification unit 130 obtains the “label list” and “search result” information from the label management unit 160, and uses the information in the label DB 20 (FIGS. 4A and 4C) to obtain the “label pair likelihood” score. And a pair exceeding a predetermined value is set as a label pair and transferred to the label management unit 160 as a “label pair list”. The list of label pairs includes label A, label B, and score.

ここで、ラベルペア特定部１３０で算出される「ラベルペアらしさ」の算出方法の例を説明する。 Here, an example of a calculation method of “label pair likelihood” calculated by the label pair identification unit 130 will be described.

ラベルペアらしさとしては、検索結果との関連性が強いものがその検索条件で絞られた分野で重要だと考えられる。そこで、「ラベルペアらしさ」の指標として、文書集合全体の中での頻度と、検索結果に閉じた状態での頻度の比を用いた以下のような基準が考えられる。 For label pairs, those that have a strong relationship with the search results are considered important in the field narrowed down by the search conditions. Therefore, the following criteria using the ratio of the frequency in the whole document set and the frequency in the closed state as a search result can be considered as an index of “label pairness”.

ここで、ＤはコンテンツＤＢ１０のコンテンツの集合、Ｈは検索結果全体の集合、ｈ_ｘ，ｙは検索結果中でラベルペアｘ，ｙを含むコンテンツ数、ｄ_ｘ，ｙはコンテンツＤＢ１０中でラベルペアｘ，ｙを含むコンテンツ数を表す。Ｎ_ｘは固有表現を示す。

Here, D is a set of contents in the content DB 10, H is a set of all search results, h _{x, y} is the number of contents including the label pair x, y in the search results, and d _{x, y} are label pairs x, _y in the content DB 10. This represents the number of contents including y. N _x represents a specific expression.

ラベルペア説明語取得部１４０は、ラベル管理部１６０から「ラベルペア一覧」と「検索結果」を取得し、「ラベルペア一覧」の個々のペアを対象とし、対象とするペアと同一の文内で共起する説明語候補（固有名詞も含む全ての語の候補とする）について、ラベルＤＢ２０の情報（図４（ａ）（ｂ）（ｄ））を利用し、「ラベルペア説明語らしさ」のスコアを算出し、予め決められた値を越える語を説明語とし、それぞれのラベルペアに対して「ラベルペア説明語一覧」の情報をラベル管理部１６０に転送する。 The label pair explanatory word acquisition unit 140 acquires the “label pair list” and the “search result” from the label management unit 160, and targets each pair of the “label pair list” as a target and co-occurs in the same sentence as the target pair. Using the information in the label DB 20 (FIGS. 4 (a), (b), and (d)) for the explanatory word candidates to be performed (candidates for all words including proper nouns), the score of “label pair explanatory word likelihood” is calculated. Then, the words exceeding the predetermined value are used as explanation words, and the information of “label pair explanation word list” is transferred to the label management unit 160 for each label pair.

ここで、ラベルペア説明語取得部１４０で算出されるラベルペア説明語らしさのスコアの算出方法の例を説明する。 Here, an example of a method of calculating the score of the label pair explanatory word likelihood calculated by the label pair explanatory word acquisition unit 140 will be described.

「ラベルペア説明語らしさ」は、文書集合中での出現頻度に比べて、対象とするラベルペアと同一文内での共起頻度が高く、かつ、特定の固有表現との共起頻度と比べて対象とするラベルペアと同一文内での共起頻度が高い、と仮定すると以下の式で表現される。 “Large pair explanation wordiness” is higher in the co-occurrence frequency in the same sentence than the target label pair in the same sentence as the appearance frequency in the document set, and in comparison with the co-occurrence frequency with a specific unique expression. Assuming that the co-occurrence frequency in the same sentence as the label pair is high, it is expressed by the following expression.

In addition, the latter term may be expressed using entropy as follows.

ラベルマップ生成部１５０は、取得した「ラベル一覧」と「ラベルペア一覧」と「検索結果」を元に、同一ワードを持つペアを接続し、ラベル管理部１６０を検索した、１つの検索結果を元に構成されるラベルペアのうち、同一ノードを持つペアを接続し、特定の話題に出現する固有名詞を繋げてクライアント１００に提示する。但し、必ずしもノードが同一であったといって、ペアを接続してよいとは限らない。そこで、ラベルペア間の接続可能性が予め決定した値だった場合のみそのラベルペアを接続することとする。

Based on the acquired “label list”, “label pair list”, and “search result”, the label map generation unit 150 connects pairs having the same word and searches the label management unit 160 based on one search result. The pair having the same node is connected, and proper nouns appearing in a specific topic are connected and presented to the client 100. However, just because the nodes are the same, the pair may not be connected. Therefore, the label pair is connected only when the connection possibility between the label pairs is a predetermined value.

ここで、ラベルマップ生成部１５０で算出される「ラベルペア間の接続可能性」の算出方法の例を説明する。 Here, an example of a calculation method of “connectability between label pairs” calculated by the label map generation unit 150 will be described.

「ラベルペア間の接続可能性」は、片方のノードが同一である２つのラベルペアの接続可能性を数値化するものである。この計算は、２つのノードの同一ではないノードの共起に基づき測定する。 “Possibility of connection between label pairs” quantifies the connection possibility of two label pairs in which one node is the same. This calculation measures based on the co-occurrence of two nodes that are not identical.

ラベル管理部１６０は、セッション管理部１８０から受信した「検索結果一覧」を取得し、「検索結果一覧」をラベル生成部１１０へ転送し、ラベル生成部１１０から「ラベル一覧」を取得し、当該「ラベル一覧」をラベル説明語取得部１２０に転送し、ラベル説明語取得部１２０から「説明語一覧」を取得し、「ラベル一覧」をラベルペア特定部１３０へ転送し、ラベルペア特定部１３０から「ラベルペア一覧」を取得し、「ラベルペア一覧」をラベルペア説明語取得部１４０へ転送し、ラベル説明語取得部１４０から「ラベルペア説明語一覧」を取得し、「ラベルペア一覧」と「ラベルペア」をラベルマップ生成部１５０へ転送し、ラベルマップ生成部１５０から「ラベルペア説明語一覧」（ラベルマップ）を取得し、これを検索結果に対する「ラベル情報」として、セッション管理部１８０へ転送する。なお、ラベルペア説明語一覧は、ラベルＡ、ラベルＢ，説明語、スコアからなる。

The label management unit 160 acquires the “search result list” received from the session management unit 180, transfers the “search result list” to the label generation unit 110, acquires the “label list” from the label generation unit 110, and The “label list” is transferred to the label explanatory word acquisition unit 120, the “explanation word list” is acquired from the label explanatory word acquisition unit 120, the “label list” is transferred to the label pair identification unit 130, and the label pair identification unit 130 "Label pair list" is acquired, "Label pair list" is transferred to label pair explanatory word acquisition unit 140, "Label pair explanatory word list" is acquired from label explanatory word acquisition unit 140, and "Label pair list" and "Label pair" are label map The data is transferred to the generation unit 150, and a “label pair explanatory word list” (label map) is acquired from the label map generation unit 150, and this is used as a search result. To as "label information", to transfer to the session management unit 180. The label pair explanatory word list includes a label A, a label B, an explanatory word, and a score.

検索エンジン１７０は、事前準備として、コンテンツＤＢ１０内の文書やデータを解析し、検索条件が入力された場合に高速に検索結果を返却できる索引情報を作成する。また、セッション管理部１８０から「検索条件」を取得した場合にはその索引情報を元に「検索条件」に該当するコンテンツの集合である「検索結果一覧」を生成し、セッション管理部１８０へ送信する。 As a preliminary preparation, the search engine 170 analyzes documents and data in the content DB 10 and creates index information that can return a search result at a high speed when a search condition is input. If “search conditions” are acquired from the session management unit 180, a “search result list” that is a set of contents corresponding to the “search conditions” is generated based on the index information and transmitted to the session management unit 180. To do.

セッション管理部１８０は、ブラウザ２００上に表示される検索要求入力画面２１０を通じて入力されたユーザの「検索条件」を取得後、検索エンジン１７０及びラベル管理部１６０にアクセスし、「検索結果一覧」及び「ラベル情報」を取得し、ブラウザ２００に送信する。 The session management unit 180 accesses the search engine 170 and the label management unit 160 after acquiring the “search condition” of the user input through the search request input screen 210 displayed on the browser 200, and displays the “search result list” and “Label information” is acquired and transmitted to the browser 200.

クライアント側のブラウザ２００の検索要求入力画面２１０は、ユーザからの「検索条件」を受け付け、ブラウザ２００を通じてセッション管理部１８０に当該「検索条件」を送信する。 The search request input screen 210 of the browser 200 on the client side receives the “search condition” from the user and transmits the “search condition” to the session management unit 180 through the browser 200.

クライアント側のブラウザ２００のラベル表示部２２０は、セッション管理部１８０から受信した「ラベル情報」に基づいて、検索結果を分類するラベルの情報をユーザに提示する。また、この分類情報のラベル及びラベル群がユーザによって選択された場合には、元の「検索条件」と共に、新たな「検索条件」を生成し、セッション管理部１８０に送信する。 Based on the “label information” received from the session management unit 180, the label display unit 220 of the browser 200 on the client side presents the user with label information for classifying the search results. When the label and label group of the classification information are selected by the user, a new “search condition” is generated together with the original “search condition” and transmitted to the session management unit 180.

クライアント側のブラウザ２００の検索結果表示部２３０は、セッション管理部１８０から受信した「検索結果一覧」の情報に基づき、検索結果をユーザに提示する。 The search result display unit 230 of the browser 200 on the client side presents the search result to the user based on the “search result list” information received from the session management unit 180.

図５は、ブラウザ２００の検索要求入力画面２１０の例を示し、図６は、ブラウザ２００のラベル表示部２２０及び検索結果表示部２３０に表示される画面の例を示している。 FIG. 5 shows an example of the search request input screen 210 of the browser 200, and FIG. 6 shows an example of a screen displayed on the label display unit 220 and the search result display unit 230 of the browser 200.

次に、上記の構成における動作を説明する。 Next, the operation in the above configuration will be described.

本発明では、情報検索システムでユーザに対してサービスを行うために、前処理を行う段階と、実際にユーザに対してサービスを行う段階との２つの段階がある。 In the present invention, in order to provide a service to the user in the information search system, there are two stages: a stage for performing preprocessing and a stage for actually providing service to the user.

図７は、本発明の一実施の形態における事前準備時の処理のフローチャートである。 FIG. 7 is a flowchart of the process at the time of advance preparation according to the embodiment of the present invention.

ステップ１０１）検索エンジン１７０は、コンテンツＤＢ１０内の文書を解析し、検索条件がセッション管理部１８０から入力された場合に、検索結果を高速に返却するための索引情報を生成し、メモリ等の記憶手段に格納する。 Step 101) The search engine 170 analyzes the document in the content DB 10, generates index information for returning the search result at a high speed when a search condition is input from the session management unit 180, and stores it in a memory or the like. Store in the means.

ステップ１０２）テキスト解析部１９０は、コンテンツＤＢ１０内の文書やデータを解析し、各コンテンツからラベル及び説明語の統計情報を取得し、ラベルＤＢ２０に格納する。 Step 102) The text analysis unit 190 analyzes the document and data in the content DB 10, acquires the statistical information of the label and the explanatory word from each content, and stores it in the label DB 20.

次に、検索サービスを実行する場合の動作を説明する。 Next, the operation when the search service is executed will be described.

図８、図９は、本発明の一実施の形態における検索サービス提供時の処理のフローチャートである。 8 and 9 are flowcharts of processing when providing a search service according to an embodiment of the present invention.

ステップ２０１）クライアント側のブラウザ２００の検索要求入力画面２１０は、ユーザからの「検索条件」を取得し、サーバ１００のセッション管理部１８０に送信する。 Step 201) The search request input screen 210 of the browser 200 on the client side acquires the “search condition” from the user and transmits it to the session management unit 180 of the server 100.

ステップ２０２）セッション管理部１８０は、受信した「検索条件」を検索エンジン１７０に転送し、検索エンジン１７０で検索された「検索結果」を取得する。 Step 202) The session management unit 180 transfers the received “search condition” to the search engine 170, and acquires the “search result” searched by the search engine 170.

ステップ２０３）セッション管理部１８０は、「検索結果」をラベル管理部１６０に転送する。 Step 203) The session management unit 180 transfers the “search result” to the label management unit 160.

ステップ２０４）ラベル管理部１６０は、「検索結果」をラベル生成部１１０に転送する。 Step 204) The label management unit 160 transfers the “search result” to the label generation unit 110.

ステップ２０５）ラベル生成部１１０は、「検索結果」を元に、ラベルＤＢ２０にアクセスし、当該検索条件に一致する検索結果集合を分類するための指標となる固有表現やその他の単語や複合語をラベルとして取得し、ラベル一覧を生成し、ラベル管理部１６０に転送する。 Step 205) Based on the “search result”, the label generation unit 110 accesses the label DB 20 and selects a specific expression or other word or compound word that serves as an index for classifying the search result set that matches the search condition. Obtained as a label, generates a list of labels, and transfers it to the label management unit 160.

ステップ２０６）ラベル管理部１６０は、ラベル生成部１１０より「ラベル一覧」を取得し、ラベル説明語取得部１２０に「ラベル一覧」と「検索結果」を転送する。 Step 206) The label management unit 160 acquires the “label list” from the label generation unit 110, and transfers the “label list” and the “search result” to the label explanatory word acquisition unit 120.

ステップ２０７）ラベル説明語取得部１２０は、「ラベル一覧」と「検索結果」のラベルに基づいてラベルＤＢ２０（図４（ｄ）（ｅ））にアクセスし、ラベルのうち、人や人物などの具体的な固有物を示すラベルを特定し、そのラベルを説明するキーワードやセンテンス等を取得して、各ラベルの「ラベル説明語一覧」を生成し、ラベル管理部１６０に転送する。 Step 207) The label explanatory word acquisition unit 120 accesses the label DB 20 (FIGS. 4D and 4E) based on the labels of the “label list” and the “search result”. A label indicating a specific unique object is specified, a keyword or sentence describing the label is acquired, a “label explanation word list” of each label is generated, and transferred to the label management unit 160.

ステップ２０８）ラベル管理部１６０は、ラベル説明語取得部１２０より「ラベル説明語一覧」を取得する。 Step 208) The label management unit 160 acquires the “label explanation word list” from the label explanation word acquisition unit 120.

ステップ２０９）ラベル管理部１６０は、ラベルペア特定部１３０に「ラベル一覧」と「検索結果」を転送する。 Step 209) The label management unit 160 transfers the “label list” and the “search result” to the label pair identification unit 130.

ステップ２１０）ラベルペア特定部１３０は、「ラベル一覧」のラベルに基づいてラベルＤＢ２０（図４（ａ）（ｃ））にアクセスし、人や物などの具体的な固有物を示すラベルを特定し、それぞれのラベル間の関連性を判定し、各ラベルの「ラベルペア一覧」を生成する。 Step 210) The label pair identification unit 130 accesses the label DB 20 (FIGS. 4A and 4C) based on the label of the “label list”, and identifies a label indicating a specific unique object such as a person or an object. The relationship between the labels is determined, and a “label pair list” for each label is generated.

ステップ２１１）ラベル管理部１６０は、ラベルペア特定部１３０より「ラベルペア一覧」を取得する。 Step 211) The label management unit 160 acquires the “label pair list” from the label pair identification unit 130.

ステップ２１２）ラベル管理部１６０は、ラベルペア説明語取得部１４０に「ラベルペア一覧」と「検索結果」を転送する。 Step 212) The label management unit 160 transfers the “label pair list” and the “search result” to the label pair explanatory word acquisition unit 140.

ステップ２１３）ラベルペア説明語取得部１４０は、ラベルＤＢ２０（図４（ａ）（ｂ）（ｄ）（ｅ））にアクセスし、「ラベルペア一覧」のそれぞれのラベルペアの関係を説明するキーワードやセンテンス等を取得し、各ラベルペアの「ラベルペア説明語一覧」を生成する。 Step 213) The label pair explanatory word acquisition unit 140 accesses the label DB 20 (FIGS. 4A, 4B, 4D, and 4E), keywords, sentences, and the like that explain the relationship between the respective label pairs in the “label pair list”. And generate a “label pair explanatory word list” for each label pair.

ステップ２１４）ラベル管理部１６０は、ラベルペア説明語取得部１４０より「ラベルペア説明語一覧」を取得する。 Step 214) The label management unit 160 acquires the “label pair explanatory word list” from the label pair explanatory word acquisition unit 140.

ステップ２１５）ラベル管理部１６０は、ラベルマップ生成部１５０に「ラベル一覧」及び「ラベルペア一覧」を転送する。 Step 215) The label management unit 160 transfers the “label list” and the “label pair list” to the label map generation unit 150.

ステップ２１６）ラベルマップ生成部１５０は、「ラベル一覧」「ラベルペア一覧」「検索結果」を元にラベルＤＢ２０（図４（ａ）（ｂ）（ｃ））の情報にアクセスし、この情報に基づいて、ラベル間の関係を示す「ラベルマップ」を生成し、ラベル管理部１６０に転送する。 Step 216) The label map generation unit 150 accesses the information in the label DB 20 (FIGS. 4A, 4B, and 4C) based on the “label list”, “label pair list”, and “search result”, and based on this information Then, a “label map” indicating the relationship between the labels is generated and transferred to the label management unit 160.

ステップ２１７）ラベル管理部１６０は、ラベルマップ生成部１５０よりラベルマップを取得する。 Step 217) The label management unit 160 acquires a label map from the label map generation unit 150.

ステップ２１８）ラベル管理部１６０は、「ラベル一覧」「ラベル説明語一覧」「ラベルペア一覧」「ラベルペア説明語一覧」「ラベルマップ」を「ラベル情報」としてまとめ、セッション管理部１８０へ転送する。 Step 218) The label management unit 160 collects “label list”, “label explanation word list”, “label pair list”, “label pair explanation word list”, and “label map” as “label information” and transfers them to the session management unit 180.

ステップ２１９）セッション管理部１８０は、「検索結果」及び「ラベル情報」をブラウザ２００に送信する。 Step 219) The session management unit 180 transmits “search result” and “label information” to the browser 200.

ステップ２２０）ブラウザ２００のラベル表示部２２０は、「ラベル情報」を取得し、本データを元に各ラベル間の関連に基づいて表示する。 Step 220) The label display unit 220 of the browser 200 acquires “label information” and displays it based on the relationship between the labels based on this data.

ステップ２２１）検索結果表示部２３０は、「検索結果」を取得して表示する。 Step 221) The search result display unit 230 acquires and displays the “search result”.

ステップ２２２）ユーザがラベル表示部２２０に表示されたラベルを選択した場合には、ステップ２２３に移行し、そうでない場合には、処理を終了する。 Step 222) If the user selects a label displayed on the label display unit 220, the process proceeds to step 223, and if not, the process ends.

ステップ２２３）検索結果表示部２３０は、ユーザから「ラベルの選択」を取得すると、元々の「検索条件」と組み合わせた「新たな検索条件」を生成し、セッション管理部１８０に送信し、ステップ２０２の処理に移行する。 Step 223) Upon obtaining “label selection” from the user, the search result display unit 230 generates a “new search condition” combined with the original “search condition” and transmits it to the session management unit 180. Move on to processing.

なお、上記の図３に示すサーバ１００の各構成要素の図８、図９に示す動作をプログラムとして構築し、サーバとして利用されるコンピュータに実行させる、または、ネットワークを介して流通させることが可能である。 The operations shown in FIGS. 8 and 9 of the components of the server 100 shown in FIG. 3 can be constructed as a program and executed by a computer used as a server, or distributed via a network. It is.

また、構築されたプログラムをハードディスク装置や、フレキシブルディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk device, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、インターネット等のネットワークにおいて、自然言語で記述された文書を検索するためのシステムに適用可能である。 The present invention can be applied to a system for searching a document described in a natural language in a network such as the Internet.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態におけるシステム構成図である。1 is a system configuration diagram according to an embodiment of the present invention. 本発明の一実施の形態におけるラベルＤＢの例である。It is an example of label DB in one embodiment of the present invention. 本発明の一実施の形態におけるブラウザ画面（検索要求入力画面）の例である。It is an example of the browser screen (search request input screen) in one embodiment of the present invention. 本発明の一実施の形態におけるブラウザ画面（ラベル及び検索結果表示画面）の例である。It is an example of the browser screen (a label and a search result display screen) in one embodiment of the present invention. 本発明の一実施の形態における事前準備時の処理のフローチャートである。It is a flowchart of the process at the time of prior preparation in one embodiment of this invention. 本発明の一実施の形態における検索サービス提供時の処理のフローチャート（その１）である。It is a flowchart (the 1) of the process at the time of search service provision in one embodiment of this invention. 本発明の一実施の形態における検索サービス提供時の処理のフローチャート（その２）である。It is a flowchart (the 2) of the process at the time of search service provision in one embodiment of this invention.

Explanation of symbols

１０コンテンツＤＢ
２０ラベルＤＢ
３０関連文書取得部
１００サーバ
１１０ラベル生成手段、ラベル生成部
１２０ラベル説明語取得手段、ラベル説明語取得部
１３０ラベルペア特定手段、ラベルペア特定部
１４０ラベルペア説明語取得手段、ラベルペア説明語取得部
１５０ラベルマップ生成手段、ラベルマップ生成部
１６０ラベル管理部
１７０検索手段、検索エンジン
１８０出力手段、セッション管理部
１９０テキスト解析部
２００ブラウザ
２１０検索要求入力画面
２２０ラベル表示部
２３０検索結果表示部 10 Content DB
20 Label DB
30 Related document acquisition unit 100 Server 110 Label generation unit, label generation unit 120 Label explanatory word acquisition unit, label explanatory word acquisition unit 130 Label pair identification unit, label pair identification unit 140 Label pair explanatory word acquisition unit, label pair explanatory word acquisition unit 150 Label map Generation unit, label map generation unit 160 Label management unit 170 Search unit, search engine 180 output unit, session management unit 190 Text analysis unit 200 Browser 210 Search request input screen 220 Label display unit 230 Search result display unit

Claims

An information search method for accumulating a plurality of documents written in a natural language in a content DB and presenting a search result corresponding to a search condition given by a user and a characteristic keyword or sentence in the search result as a label,
Document ID, proper nouns appearing in each document, information of other words, proper noun pair information appearing in each document, proper noun and explanatory word information appearing in each document, proper noun appearing in each document A label DB storing information on pairs and explanatory words, a search means, a label generating means, a label explanatory word acquiring means, a label pair specifying means, a label pair explanatory word acquiring means, a label map generating means, an output means, In an information retrieval apparatus having
A search step in which the search means searches the content DB based on the search condition and obtains a search result;
The label generation means searches the label DB based on the document ID included in the search result acquired in the search step to acquire a proper noun, and the proper noun and the document ID related to the proper noun; A label generation step for acquiring as a label,
The label explanatory word acquisition means specifies a label indicating a specific unique object such as a person or an object among the labels acquired in the label generation step, searches the label DB based on the label, A label explanation acquisition step for acquiring a keyword or sentence describing a label as a label explanation,
The label pair identification unit identifies a label indicating a specific unique object such as a person or an object among the labels acquired in the label generation step, searches the label DB based on the label, and performs the search step A label pair identification step for identifying, as a label pair, a strongly related item between the labels based on the search result obtained in
The label pair explanatory word acquisition means searches the label DB based on the label of the label pair, and acquires a label pair explanatory word including a keyword and sentence explaining the relationship between the label pairs, and a label pair explanatory word acquisition step,
A label map generating step in which the label map generating means generates a label map indicating a relationship between labels using the label, the label pair, the label explanatory word, and the label pair explanatory word;
The output means for outputting the label map to the device on the user side; and
An information retrieval method characterized by:

In the label generation step,
For the proper noun acquired by searching the label DB, a ratio of the frequency in the entire document set of the content DB and the frequency in the search result is obtained, and a predetermined number of unique nouns are determined based on the ratio The information retrieval method according to claim 1, wherein a step of selecting a noun is performed.

In the label explanatory word acquisition step,
Targeting the acquired proper noun of the label, obtaining a co-occurrence frequency of explanatory word candidates that co-occur within the same sentence as the target proper noun, and acquiring a word exceeding a predetermined value as an explanatory word The information search method according to claim 1 to be performed.

In the label pair identification step,
A step of obtaining a ratio between an appearance frequency in the entire document set in the content DB and an appearance frequency in the search result for the label, and acquiring two labels exceeding a predetermined value as a label pair. Item 2. The information search method according to Item 1.

The label pair explanatory word acquisition step includes:
For the label pair, the co-occurrence frequency in the same sentence as the label pair is higher than the appearance frequency in the document set of the content DB, and the co-occurrence frequency with a specific proper noun is compared with the label pair. The information search method according to claim 1, wherein a step of acquiring a label pair explanation word having a high co-occurrence frequency in the same sentence is performed.

The label map generating step includes:
The information search method according to claim 1, wherein the label map is generated by combining pairs having the same node among the label pairs and connecting proper nouns appearing in a specific topic.

An information search apparatus that accumulates a plurality of documents written in a natural language in a content DB and presents a search result corresponding to a search condition given by a user and a characteristic keyword or sentence in the search result as a label,
Search means for searching the content DB based on the search condition and obtaining a search result;
Document ID, proper nouns appearing in each document, information of other words, proper noun pair information appearing in each document, proper noun and explanatory word information appearing in each document, proper noun appearing in each document A label DB storing information on pairs and explanatory words;
Based on the document ID included in the search result acquired from the search unit, the label DB is searched to acquire a proper noun, and the proper noun and the document ID related to the proper noun are acquired as labels. Generating means;
Among the labels acquired by the label generation means, a label indicating a specific unique object such as a person or an object is specified, the label DB is searched based on the label, and a keyword or sentence describing the label is searched. Label explanation acquisition means to acquire as a label explanation,
Among the labels acquired by the label generation means, a label indicating a specific unique object such as a person or an object is specified, the label DB is searched based on the label, and the search obtained by the search means Based on the results, a label pair identification means for identifying a label pair that is strongly related to each label,
Label pair explanatory word acquisition means for searching the label DB based on the label of the label pair, and acquiring a label pair explanatory word including a keyword and a sentence explaining the relationship of each label pair;
Label map generating means for generating a label map indicating the relationship between labels using the label, the label pair, the label explanatory word, and the label pair explanatory word;
Output means for outputting the label map to the device on the user side;
An information retrieval apparatus comprising:

The label generating means includes
For the proper noun acquired by searching the label DB, a ratio of the frequency in the entire document set of the content DB and the frequency in the search result is obtained, and a predetermined number of unique nouns are determined based on the ratio 8. The information retrieval apparatus according to claim 7, further comprising means for selecting a noun.

The label explanatory word acquisition means includes:
A means for obtaining a co-occurrence frequency of explanatory word candidates that co-occur within the same sentence as the target proper noun, and acquiring a word that exceeds a predetermined value as an explanatory word for the acquired proper noun of the label The information search device according to claim 7 including.

The label pair specifying means includes
A means for obtaining a ratio between an appearance frequency in the entire document set in the content DB and an appearance frequency in the search result for the label, and acquiring two labels exceeding a predetermined value as a label pair. Item 8. The information search device according to Item 7.

The label pair explanatory word acquisition means includes:
For the label pair, the co-occurrence frequency in the same sentence as the label pair is higher than the appearance frequency in the document set of the content DB, and the co-occurrence frequency with a specific proper noun is compared with the label pair. The information search device according to claim 7, further comprising means for acquiring, as the label pair explanatory word, a word having a high co-occurrence frequency in the same sentence.

The label map generating means includes
The information search device according to claim 7, further comprising means for generating a label map in which pairs having the same node among the label pairs are combined to connect proper nouns appearing in a specific topic.

Search means according to any one of claims 7 to 12, label generation means, label explanation word acquisition means, label pair identification means, label pair explanation word acquisition means, label map generation means, output means, An information search program for causing a computer to function as an information search apparatus having

Search means according to any one of claims 7 to 12, label generation means, label explanation word acquisition means, label pair identification means, label pair explanation word acquisition means, label map generation means, output means, A storage medium storing an information search program characterized in that a program for causing a computer to function is stored as an information search apparatus having a computer.