JP2008027207A

JP2008027207A - Retrieval system and retrieval method

Info

Publication number: JP2008027207A
Application number: JP2006199312A
Authority: JP
Inventors: Michiko Yasukawa; 美智子安川; Hidetoshi Yokoo; 英俊横尾; Tomofumi Uchiyama; 智文内山
Original assignee: Gunma University NUC
Current assignee: Gunma University NUC
Priority date: 2006-07-21
Filing date: 2006-07-21
Publication date: 2008-02-07
Anticipated expiration: 2026-07-21
Also published as: JP4547500B2

Abstract

<P>PROBLEM TO BE SOLVED: To enable to indicate retrieval result by a cluster easy to understand to a user. <P>SOLUTION: The retrieval system acquires a plurality of related words relating to a retrieval word from a retrieval query log (102), performs meta-search corresponding to the retrieval word by a plurality of retrieval engines (104), extracts text data from retrieved Web pages (108), acquires a plurality of vocabularies morphological-analyzing the extracted text data (112), makes a vocabulary frequency matrix to a plurality of Web pages (116), and calculates the degree of similarity of related words attending to only related words (120). Then, the system performs clustering of related words based on the degree of similarity of calculated related words, generates related word clusters of prescribed number (122), performs weighting of the related word clusters based on retrieval number of the related words, sorts the related word cluster systematically (124), and indicates a list of the related word clusters as retrieval result (128). <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、検索装置及び検索方法にかかり、特に、検索エンジンによって文書データを検索する検索装置及び検索方法に関する。 The present invention relates to a search device and a search method, and more particularly to a search device and a search method for searching document data by a search engine.

従来より、Ｗｅｂ検索エンジンを用いて、様々な検索が行われるようになっている。流行している物や現象、人、企業、商品、サービス、テレビ番組などについての情報を検索する際に、検索対象についてあまり詳しく知らないため、適切な関連語で検索結果を絞り込む事が容易でない場合がある。 Conventionally, various searches are performed using a Web search engine. When searching for information about popular items and phenomena, people, companies, products, services, TV programs, etc., it is not easy to narrow down the search results with appropriate related terms because you do not know much about the search target. There is a case.

また、検索対象についてある程度知っている場合であっても検索語で検索される膨大な検索結果を全て閲覧するのではなく、興味のあるページ群だけ概観したいという場合がある。 Further, even if the user knows the search target to some extent, there is a case where the user does not browse all the huge search results searched by the search word but wants an overview of only the page group that is interested.

一般に検索対象となる文書集合の中には類似した文書が含まれることが多いことから、予め文書集合を類似度に応じてグループ化（クラスタリング）しておき、検索時にはこれらのグループ（クラスタ）と検索質問（検索クエリ）との適合度を計算するクラスタ型の検索が知られている（非特許文献１）。ある検索語で検索される検索結果Ｗｅｂページ群には、多数の類似したＷｅｂページが含まれるため、適切なクラスタリングを行うことで、検索結果を絞り込むことや、検索結果を概観することが容易になる。
徳永健伸、「情報検索と言語処理」、東京大学出版会、（１９９９） In general, there are many similar documents included in the document set to be searched. Therefore, the document set is grouped (clustered) in advance according to the degree of similarity, and these groups (clusters) and A cluster type search for calculating the degree of matching with a search question (search query) is known (Non-Patent Document 1). The search result Web page group searched by a certain search term includes a large number of similar Web pages. Therefore, it is easy to narrow down the search result and overview the search result by performing appropriate clustering. Become.
Takenobu Tokunaga, “Information Retrieval and Language Processing”, The University of Tokyo Press, (1999)

しかしながら、上記の非特許文献１記載の技術では、検索結果Ｗｅｂページ群をＷｅｂページでクラスタリングすると、Ｗｅｂページ群の中に、ユーザの検索ニーズに合致しない雑多な情報が多数含まれているため、ユーザにとって意味が分からないクラスタや、検索対象を絞り込む上で役に立たないクラスタが生成されてしまうため、クラスタリングされた検索結果が、ユーザにとって分かりにくく、利便性が低いものとなってしまう、という問題がある。 However, in the technique described in Non-Patent Document 1, when the search result Web page group is clustered with Web pages, the Web page group includes a lot of miscellaneous information that does not match the user's search needs. Clusters that do not make sense for the user and clusters that are not useful for narrowing down the search target are generated, so the clustered search results are difficult to understand for the user and less convenient. is there.

本発明は、上記の問題点を解決するためになされたもので、ユーザにとって分かりやすいクラスタにより検索結果を表示することができる検索装置及び検索方法を提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a search device and a search method that can display a search result using a cluster that is easy for the user to understand.

上記の目的を達成するために本発明に係る検索装置は、複数の文書データを記憶した文書データベースから、検索語に適合する複数の文書データを取得する文書データ取得手段と、前記文書データ取得手段によって取得された複数の文書データの各々を形態素解析することによって得られた単語に基づいて、前記文書データの各々について、前記検索語に関連する複数の関連語の各々の出現頻度を算出する頻度算出手段と、前記頻度算出手段によって算出された前記複数の関連語の各々の出現頻度に基づいて、各関連語同士の類似度を算出する類似度算出手段と、前記複数の関連語のクラスタリングを行って、前記類似度算出手段によって算出された類似度が高い組み合わせから前記関連語を組み合わせて、所定数の関連語クラスタを生成するクラスタリング手段と、前記クラスタリング手段によって生成された関連語クラスタを、前記検索語に適合する文書データの検索結果として表示する表示手段とを含んで構成されている。 In order to achieve the above object, a search device according to the present invention includes a document data acquisition unit that acquires a plurality of document data that matches a search term from a document database that stores a plurality of document data, and the document data acquisition unit. The frequency of calculating the appearance frequency of each of the plurality of related words related to the search word for each of the document data based on the word obtained by performing morphological analysis on each of the plurality of document data acquired by Calculation means; similarity calculation means for calculating the similarity between the related words based on the appearance frequency of each of the related words calculated by the frequency calculating means; and clustering of the related words. And generating a predetermined number of related word clusters by combining the related words from combinations having high similarity calculated by the similarity calculating unit. And clustering means, the related word clusters generated by the clustering unit, is configured to include a display means for displaying the search results matching the document data to the keyword.

また、本発明に係る検索方法は、複数の文書データを記憶した文書データベースから、検索語に適合する複数の文書データを取得し、前記取得された複数の文書データの各々を形態素解析することによって得られた単語に基づいて、前記文書データの各々について、前記検索語に関連する複数の関連語の各々の出現頻度を算出し、前記算出された前記複数の関連語の各々の出現頻度に基づいて、各関連語同士の類似度を算出し、前記複数の関連語のクラスタリングを行って、前記算出された類似度が高い組み合わせから前記関連語を組み合わせて、所定数の関連語クラスタを生成し、前記生成された関連語クラスタを、前記検索語に適合する文書データの検索結果として表示することを特徴としている。 The search method according to the present invention acquires a plurality of document data that matches a search word from a document database storing a plurality of document data, and performs morphological analysis on each of the acquired document data. Based on the obtained word, for each of the document data, the appearance frequency of each of the plurality of related words related to the search word is calculated, and based on the calculated appearance frequency of each of the plurality of related words Calculating a similarity between the related words, clustering the plurality of related words, and generating a predetermined number of related word clusters by combining the related words from the combination having the calculated high similarity. The generated related word cluster is displayed as a search result of document data matching the search word.

本発明によれば、複数の文書データを記憶した文書データベースから、検索語に適合する複数の文書データを取得し、取得された複数の文書データの各々を形態素解析して、文書データの単語を得る。そして、得られた単語に基づいて、文書データの各々について、検索語に関連する複数の関連語の各々の出現頻度を算出し、算出された複数の関連語の各々の出現頻度に基づいて、各関連語同士の類似度を算出する。 According to the present invention, a plurality of document data matching a search word is acquired from a document database storing a plurality of document data, each of the acquired plurality of document data is subjected to morphological analysis, and a word of the document data is obtained. obtain. And based on the obtained word, for each of the document data, to calculate the appearance frequency of each of a plurality of related words related to the search word, based on the calculated appearance frequency of each of the plurality of related words, The similarity between each related word is calculated.

そして、複数の関連語のクラスタリングを行って、算出された類似度が高い組み合わせから関連語を組み合わせて、所定数の関連語クラスタを生成し、生成された関連語クラスタを、検索語に適合する文書データの検索結果として表示する。 Then, a plurality of related words are clustered, and related words are combined from the calculated combination having a high degree of similarity to generate a predetermined number of related word clusters, and the generated related word clusters are adapted to the search word. Displayed as document data search results.

従って、検索語に適合する各文書データにおける検索語に関連する複数の関連語の各々の出現頻度に基づいて関連語をクラスタリングした結果を、検索結果として表示することにより、ユーザによって入力される検索語に関係のない単語を除外して生成した関連語クラスタを検索結果として表示するため、ユーザにとって分かりやすいクラスタにより検索結果を表示することができる。 Therefore, the search input by the user by displaying the result of clustering the related words based on the appearance frequency of each of the related words related to the search word in each document data matching the search word as the search result. Since related word clusters generated by excluding words irrelevant to words are displayed as search results, the search results can be displayed with clusters that are easy for the user to understand.

ここで、検索語に関連する関連語とは、検索エンジンにユーザが検索語と同時に入力した単語である。 Here, the related word related to the search word is a word that the user inputs to the search engine at the same time as the search word.

また、本発明に係る検索装置は、クラスタリング手段によって生成された関連語クラスタ毎に、関連語の出現頻度に基づいて、文書データ取得手段によって取得された複数の文書データのうち、関連語クラスタの関連語によって特徴付けられる文書データを関連語クラスタに対応付ける対応付け手段を更に含み、表示手段は、関連語クラスタ及び関連語クラスタに対応付けられた文書データを示す文書データ情報を、検索結果として表示することができる。これにより、検索語に適合する文書データを関連語クラスタに対応付けて表示するため、検索結果の表示におけるユーザの利便性を向上することができる。 Further, the search device according to the present invention provides, for each related word cluster generated by the clustering unit, a related word cluster among a plurality of document data acquired by the document data acquiring unit based on the appearance frequency of the related word. The display device further includes association means for associating the document data characterized by the related word with the related word cluster, and the display means displays the document data information indicating the related word cluster and the document data associated with the related word cluster as a search result. can do. As a result, the document data matching the search term is displayed in association with the related term cluster, so that the convenience of the user in displaying the search result can be improved.

また、本発明に係る検索装置は、少なくとも１つの検索語からなる検索クエリを複数記憶したデータベースに基づいて、文書データ取得手段における検索語と同時に検索語となる単語を、関連語として複数取得する関連語取得手段を更に含み、頻度算出手段は、文書データの各々について、関連語取得手段によって取得された複数の関連語の出現頻度を算出することができる。これにより、検索クエリのログを記憶したデータベースから、検索語に関連する関連語を複数取得することができる。 Further, the search device according to the present invention acquires a plurality of words as search terms simultaneously with the search terms in the document data acquisition unit based on a database storing a plurality of search queries consisting of at least one search term. The frequency calculation means can further calculate the appearance frequency of a plurality of related words acquired by the related word acquisition means for each piece of document data. Thereby, a plurality of related terms related to the search term can be acquired from the database storing the search query log.

また、本発明に係る検索装置は、少なくとも１つの検索語からなる検索クエリを複数記憶したデータベースに基づいて、文書データ取得手段における検索語の類義語と同時に検索語となる単語を、関連語として複数取得する関連語取得手段を更に含み、頻度算出手段は、文書データの各々について、関連語取得手段によって取得された複数の関連語の出現頻度を算出することができる。これにより、検索クエリのログを記憶したデータベースから、検索語の類義語に関連する関連語を複数取得することができる。 In addition, the search device according to the present invention provides a plurality of words as search terms simultaneously as synonyms of the search terms in the document data acquisition means based on a database storing a plurality of search queries consisting of at least one search term. It further includes related word acquisition means for acquiring, and the frequency calculation means can calculate the appearance frequency of a plurality of related words acquired by the related word acquisition means for each of the document data. Thereby, a plurality of related terms related to the synonym of the search term can be acquired from the database storing the log of the search query.

また、本発明に係る表示手段は、検索語と同時に検索される回数が多い関連語を含む関連語クラスタから順番に、検索結果として表示することができる。これにより、検索語との関連が強い関連語を含む関連語クラスタを先に表示することにより、ユーザの検索ニーズに合致することができ、検索結果の表示におけるユーザの利便性を向上させることができる。 Further, the display means according to the present invention can display the search results in order from the related word cluster including related words that are frequently searched simultaneously with the search word. Thereby, by displaying the related term cluster including the related terms that are strongly related to the search terms first, it is possible to meet the user's search needs and improve the convenience of the user in displaying the search results. it can.

以上説明したように、本発明の検索装置及び検索方法によれば、検索語に適合する各文書データにおける検索語に関連する複数の関連語の各々の出現頻度に基づいて関連語をクラスタリングした結果を、検索結果として表示することにより、ユーザによって入力される検索語に関係のない単語を除外して生成した関連語クラスタを検索結果として表示するため、ユーザにとって分かりやすいクラスタにより検索結果を表示することができる、という効果が得られる。 As described above, according to the search device and search method of the present invention, the result of clustering related words based on the appearance frequency of each of a plurality of related words related to the search word in each document data that matches the search word Is displayed as a search result, and related word clusters generated by excluding words irrelevant to the search word input by the user are displayed as the search result. Therefore, the search result is displayed in a cluster that is easy for the user to understand. The effect that it can be obtained.

以下、図面を参照して本実施の形態を詳細に説明する。なお、本実施の形態では、複数の検索エンジンを一括検索（メタサーチ）する検索装置に本発明を適用した場合について説明する。 Hereinafter, the present embodiment will be described in detail with reference to the drawings. In the present embodiment, a case will be described in which the present invention is applied to a search device that collectively searches (metasearch) a plurality of search engines.

図１に示すように、第１の実施の形態に係る検索システム１０は、複数の検索クエリから構成される検索クエリのログを記憶した検索クエリログデータベース１２と、少なくとも１つの検索語からなる検索クエリに対応して、Ｗｅｂページを検索する複数のＷｅｂ検索エンジン１４、及び複数のＷｅｂ検索エンジン１４によって検索されたＷｅｂページのキャッシュデータを一時的に記憶するキャッシュデータデータベース１６に接続され、かつ複数のＷｅｂ検索エンジン１４を一括検索（メタサーチ）するメタサーチエンジンを実現するメタサーチエンジンプログラムを記憶したコンピュータ１８とを備えている。 As shown in FIG. 1, the search system 10 according to the first embodiment includes a search query log database 12 storing a search query log composed of a plurality of search queries, and a search query consisting of at least one search term. Corresponding to a plurality of Web search engines 14 that search Web pages, and a cache data database 16 that temporarily stores cache data of Web pages searched by the plurality of Web search engines 14, and a plurality of And a computer 18 storing a meta search engine program for realizing a meta search engine for performing a batch search (meta search) of the web search engine 14.

検索クエリログデータベース１２には、１つ以上の検索語からなる検索クエリが複数記憶されている。また、複数のＷｅｂ検索エンジン１４は、例えば、インターネットにおいて主要な複数の検索エンジン（ｈｔｔｐ：／／ｗｗｗ．ｙａｈｏｏ．ｃｏ．ｊｐ／、ｈｔｔｐ：／／ｓｅａｒｃｈ．ｍｓｎ．ｃｏ．ｊｐ／）を用いている。 The search query log database 12 stores a plurality of search queries including one or more search terms. Further, the plurality of Web search engines 14 use, for example, a plurality of main search engines (http://www.yahoo.co.jp/, http://search.msn.co.jp/) on the Internet. Yes.

また、コンピュータ１８には、テキストデータを形態素解析するための形態素解析器２０と、行列計算を行うための行列計算ライブラリ２２とが接続されている。 The computer 18 is connected to a morpheme analyzer 20 for morphological analysis of text data and a matrix calculation library 22 for performing matrix calculation.

メタサーチエンジンプログラムは、後述するメタサーチ処理ルーチンを実行するためのプログラムであり、検索クエリログデータベース１２から取得した検索クエリログデータに基づいて、入力された検索語に関連する関連語を取得する関連語データ取得モジュール、検索クエリの検索語に適合するＷｅｂページを、複数のＷｅｂ検索エンジン１４によって検索し、検索されたＷｅｂページのキャッシュデータをキャッシュデータデータベース１６に一時的に記憶させる検索データ取得モジュール、キャッシュデータデータベース１６のキャッシュデータに対して、形態素解析器２０によって形態素解析を行って単語を取得し、名詞及び未知語の出現頻度を示す単語頻度行列を検索されたＷｅｂページ毎に作成する行列作成モジュール、単語頻度行列に対して行列計算ライブラリ２２によって行列計算を行い、関連語をクラスタリングして、関連語クラスタを生成するクラスタ生成モジュール、及び生成された関連語クラスタの順序付けを行うクラスタ順序付けモジュールを含んで構成されている。 The meta search engine program is a program for executing a meta search processing routine, which will be described later, and a related word for acquiring a related word related to the input search word based on the search query log data acquired from the search query log database 12. A data acquisition module, a search data acquisition module that searches a Web page that matches a search term of a search query by a plurality of Web search engines 14 and temporarily stores cache data of the searched Web page in the cache data database 16; Matrix creation for the cache data of the cache data database 16 by performing morphological analysis by the morphological analyzer 20 to obtain words and creating a word frequency matrix indicating the appearance frequency of nouns and unknown words for each searched Web page Module, word A matrix generation module for performing matrix calculation on the degree matrix by the matrix calculation library 22, clustering related words to generate related word clusters, and a cluster ordering module for ordering the generated related word clusters Has been.

なお、検索語に関連する関連語とは、検索エンジンへの検索クエリとして、ユーザが検索語と同時に入力した単語である。 The related term related to the search term is a word that is input by the user simultaneously with the search term as a search query to the search engine.

次に、従来のＷｅｂ検索結果のクラスタリングの問題点について説明する。既存の文書、Ｗｅｂページ、Ｗｅｂ検索結果のクラスタリングの手法において、クラスタリング対象となる文書群やＷｅｂページ群の全体がユーザの検索ニーズを分離可能な状態で包含している場合や、ユーザが興味を持つ文書群のみが文書群全体から明確に分離可能な場合には、効果的なクラスタ検索を行うことができる。 Next, problems of conventional clustering of Web search results will be described. In the clustering method of existing documents, web pages, and web search results, when the entire document group or web page group to be clustered includes the user's search needs in a separable state, the user is interested. When only the document group possessed can be clearly separated from the entire document group, an effective cluster search can be performed.

しかし、一般的に、Ｗｅｂ検索エンジンが返す検索結果は、効果的なクラスタ検索を行う上で理想的なＷｅｂページ群ではなく、ユーザにとって意味のない雑多な情報を多数含んでいる場合が多い。 However, in general, the search result returned by the Web search engine is not an ideal Web page group for performing an effective cluster search, and often includes many miscellaneous information that is meaningless to the user.

例えば、検索語「英会話」に対してＷｅｂ検索エンジンが返す検索結果から、図２に示すようなＷｅｂページの単語頻度行列が作成された場合を考える。Ｗｅｂページの単語頻度行列に対して、Ｗｅｂページ方向に類似度計算することでＷｅｂページのクラスタリングが可能となり、また単語方向に類似度計算することで単語のクラスタリングが可能になる。図２の行列要素全てをクラスタリングに用いると、「英会話」という検索語の観点からはあまり関係のない「件」「月」「日」などの語が高い頻度で出現していたり、あるいは、逆に出現頻度が低く希少性が高かったり、また、他の語と共起していたりすることによって、語の持つ特徴量が大きくなることが、クラスタリング結果を悪化させる。例えば、類似度計算により、Ｗｅｂページのクラスタリングで｛英会話学習、英語の日記、ジオス｝｛英会話ＢＢＳ、イーオン｝のような分け方がされる場合、また、単語方向のクラスタリングで｛件、日｝｛月、無料、スクール、教材｝のような分け方がされる場合のどちらの場合も、ユーザにとってクラスタリング結果が理解しにくいものとなってしまう。 For example, consider a case where a word frequency matrix of a Web page as shown in FIG. 2 is created from the search result returned by the Web search engine for the search word “English conversation”. Web page clustering can be performed by calculating similarity in the Web page direction with respect to the word frequency matrix of the Web page, and word clustering can be performed by calculating similarity in the word direction. When all the matrix elements in FIG. 2 are used for clustering, words such as “case”, “month”, “day”, etc. that are not so much related from the viewpoint of the search term “English conversation” appear frequently or vice versa. If the appearance frequency is low and the rarity is high, or the feature amount of the word increases due to co-occurrence with other words, the clustering result is deteriorated. For example, according to the similarity calculation, when classification such as {English conversation learning, English diary, geos} {English conversation BBS, Eon} is performed in clustering of web pages, {case, day} is performed in word direction clustering. In both cases where the classification is {Month, Free, School, Teaching Material}, the clustering result is difficult for the user to understand.

このように、Ｗｅｂ検索結果から構築される単語頻度行列の全体を用いると、ユーザの検索ニーズに合致しない雑多な情報が影響するために、Ｗｅｂページのクラスタリング及び単語のクラスタリングのどちらの場合も、クラスタ検索を効果的に行うことができない。 As described above, when the entire word frequency matrix constructed from the Web search results is used, miscellaneous information that does not match the user's search needs is affected. Therefore, in both cases of Web page clustering and word clustering, Cluster search cannot be performed effectively.

以下、上記のコンピュータ１８で実行される検索語による検索結果として、関連語のクラスタリング結果を表示するためのメタサーチ処理ルーチンについて図３を用いて説明する。 Hereinafter, a meta search processing routine for displaying a clustering result of related terms as a search result based on a search term executed by the computer 18 will be described with reference to FIG.

まず、ステップ１００において、ユーザが検索語を入力したか否かを判定し、ユーザがキーボードやマウス（図示省略）を操作して、検索語を入力すると、ステップ１０２へ進み、検索クエリログデータベース１２から、検索語に関連する複数の関連語を示す関連語データを取得する。 First, in step 100, it is determined whether or not the user has input a search word. When the user operates the keyboard or mouse (not shown) to input the search word, the process proceeds to step 102, where the search query log database 12 Then, related word data indicating a plurality of related words related to the search word is acquired.

ここで、関連語は、検索に役立つ語を推薦するＹａｈｏｏ！（Ｒ）やＹａｈｏｏ！ＪＡＰＡＮ（Ｒ）の関連語検索の機能や検索広告のキーワード分析に用いられるものであり、検索広告では、キーワード分析を行うために、ユーザが検索語と同時に検索エンジンに入力した関連語の情報が提供されている。例えば、キーワード分析ツールにおいて、検索語「英会話」についての検索を行うユーザの検索ニーズを表す情報であって、図４のような関連語のデータを用いることにより、検索語「英会話」で得られる検索結果を、ユーザの検索ニーズに合致した情報によって絞り込む事ができるようになる。 Here, Yahoo! recommends words that are useful for search. (R) or Yahoo! It is used for the related word search function of JAPAN (R) and keyword analysis of search advertisements. In search advertisements, information on related words input to a search engine by a user at the same time as a search word is used for keyword analysis. Is provided. For example, in the keyword analysis tool, it is information representing the search needs of a user who performs a search for the search word “English conversation”, and can be obtained with the search word “English conversation” by using related word data as shown in FIG. Search results can be narrowed down by information that matches the user's search needs.

なお、第１の実施の形態では、検索語の関連語のデータは、１００件を上限とする検索語の関連語と、月間検索数の予測値が得られるＯｖｅｒｔｕｒｅ（Ｒ）のキーワードアドバイスツール（ｈｔｔｐ：／／ｉｎｖｅｎｔｏｒｙ．ｊｐ．ｏｖｅｒｔｕｒｅ．ｃｏｍ／）により取得する。 In the first embodiment, the related word data of the search word includes the keyword related tool of Overture (R) that can obtain the related word of the search word with an upper limit of 100 and the predicted value of the number of monthly searches. http://inventory.jp.overture.com/).

そして、ステップ１０４では、複数のＷｅｂ検索エンジン１４を用いて、ステップ１００で入力された検索語に対応してメタサーチを行い、Ｗｅｂ検索エンジン１４の各々から、Ｗｅｂ検索結果データとして、検索結果ＵＲＬ、Ｔｉｔｌｅ、ｓｕｍｍａｒｙ／ｓｎｉｐｐｅｔ、及びキャッシュＵＲＬを取得する。 In step 104, a meta search is performed using a plurality of web search engines 14 corresponding to the search term input in step 100, and a search result URL is sent from each of the web search engines 14 as web search result data. , Title, summary / snippet, and cache URL.

なお、Ｙａｈｏｏ！（Ｒ）、Ｙａｈｏｏ！ＪＡＰＡＮ（Ｒ）、Ｇｏｏｇｌｅ（Ｒ）、ＭＳＮサーチ（Ｒ）などの主要なＷｅｂ検索エンジンでは、ライセンスを持たないメタ検索エンジンからのアクセスを禁止し、一般ユーザ向けに提供された検索サイトへの自動クエリの送信を禁止しているが、その代わりに、プログラムで検索エンジン資源にアクセスするための検索ＡＰＩやＳＤＫを提供している。例えば、ＧｏｏｇｌｅＷｅｂＡＰＩｓ（ｈｔｔｐ：／／ｗｗｗ．ｇｏｏｇｌｅ．ｃｏｍ／ａｐｉｓ）や、Ｙａｈｏｏ！ＳｅａｒｃｈＷｅｂＳｅｒｖｉｃｅｓＳＤＫ（ｈｔｔｐ：／／ｄｅｖｅｌｏｐｅｒ．ｙａｈｏｏ．ｎｅｔ／ｓｅａｒｃｈ／）、ＭＳＮＳｅａｒｃｈＷｅｂＳｅｒｖｉｃｅＳＤＫ（ｈｔｔｐ：／／ｍｓｄｎ．ｍｉｃｒｏｓｏｆｅ．ｃｏｍ／ｍｓｎ／ｍｓｎｓｅａｒｃｈ／）、Ｙａｈｏｏ！ＪＡＰＡＮＷｅｂサービスＳＤＫ（ｈｔｔｐ：／／ｄｅｖｅｌｏｐｅｒ．ｙａｈｏｏ．ｃｏ．ｊｐ／）があり、第１の実施の形態では、１０００件を上限とする日本語の検索結果が得られるＧｏｏｇｌｅＷｅｂＡＰＩｓとＹａｈｏｏ！ＪＡＰＡＮＷｅｅｂサービスＳＤＫを用いてメタサーチを行っている。 Yahoo! (R), Yahoo! Major Web search engines such as JAPAN (R), Google (R), MSN Search (R), etc. prohibit access from unlicensed meta search engines and automatically access search sites provided for general users. Query transmission is prohibited, but instead, a search API and SDK for accessing search engine resources by a program are provided. For example, Google Web APIs (http://www.google.com/apis), Yahoo! Search Web Services SDK (http://developer.yahoo.net/search/), MSN Search Web Service SDK (http://msdn.microsoft.com/msn/)/msnsearch/! There is a JAPAN Web service SDK (http://developer.yahoo.co.jp/). In the first embodiment, Google Web APIs and Yahoo! A meta search is performed using the JAPAN Web service SDK.

次のステップ１０６では、キャッシュＵＲＬに基づいて、キャッシュデータを取得し、キャッシュデータをキャッシュデータデータベース１６に格納し、ステップ１０８において、キャッシュデータのＨＴＭＬソースファイルから、ＥＵＣ−ＪＰテキストであるテキストデータを抽出する。 In the next step 106, cache data is acquired based on the cache URL, and the cache data is stored in the cache data database 16. In step 108, text data that is EUC-JP text is obtained from the HTML source file of the cache data. Extract.

そして、ステップ１１０で、形態素解析器２０のユーザ辞書に対して、入力された検索語及びステップ１０２で取得された関連語を登録し、ステップ１１２において、形態素解析器２０によって、ステップ１０８で抽出したテキストデータを形態素解析して、形態素解析結果として複数の単語を取得し、ステップ１１４で、形態素解析結果から、雑音を除去し、検索語の周辺の名詞及び未知語のみを抽出する。なお、形態素解析には、ＣｈａＳｅｎ（ｈｔｔｐ：／／ｃｈａｓｅｎ．ｎａｉｓｔ．ｊｐ／ｈｉｋｉ／Ｃｈａｓｅｎ／）を使用し、検索語や関連語をＣｈａＳｅｎのユーザ辞書に登録することにより、１つの語が複数の語に分割されていないようにしている。 In step 110, the input search word and the related word acquired in step 102 are registered in the user dictionary of the morphological analyzer 20, and extracted in step 108 by the morphological analyzer 20 in step 112. The morphological analysis is performed on the text data to obtain a plurality of words as a morphological analysis result. In step 114, noise is removed from the morphological analysis result, and only nouns and unknown words around the search word are extracted. For morphological analysis, ChaSen (http://chasen.naist.jp/hiki/Chasen/) is used, and a search word or related word is registered in the ChaSen user dictionary, so that one word can be converted into a plurality of words. It is not divided into words.

そして、ステップ１１６において、抽出された名詞及び未知語で、図２に示すような複数のＷｅｂページに対する単語頻度行列を作成し、ステップ１１８で、作成された単語頻度行列における関連語と一致する単語の列要素ＩＤを抽出し、ステップ１２０において、抽出した列要素ＩＤを指定して、行列計算ライブラリ２２によって、関連語にのみ注目した関連語同士の類似度を算出する。 Then, in step 116, a word frequency matrix for a plurality of Web pages as shown in FIG. 2 is created from the extracted nouns and unknown words, and in step 118, a word that matches a related word in the created word frequency matrix. Column element IDs are extracted, and in step 120, the extracted column element IDs are specified, and the matrix calculation library 22 calculates the degree of similarity between related words focused only on related words.

ここで、上述したように、検索結果から作成されるＷｅｂページの単語頻度行列全体に対して、単語方向の類似度計算を行うと、検索結果の中の雑多な情報がクラスタリングに悪影響を及ぼしてしまう。これに対して、例えば、図２に示すような検索結果に対して、図４に含まれる関連語で絞り込みを行い、関連語「スクール」「無料」「教材」のみの出現頻度で類似度を算出すると、図５に示すように、「英会話」に興味を持つユーザにとって重要でない語「件」「月」「日」を、類似度計算の対象から除外することができる。 Here, as described above, when the similarity calculation in the word direction is performed on the entire word frequency matrix of the Web page created from the search results, miscellaneous information in the search results adversely affects clustering. End up. On the other hand, for example, the search results as shown in FIG. 2 are narrowed down by the related words included in FIG. 4, and the similarity is determined by the appearance frequency of only the related words “school”, “free”, and “learning material”. When calculated, as shown in FIG. 5, the words “case”, “month”, and “day” that are not important for the user who is interested in “English conversation” can be excluded from the similarity calculation targets.

そして、ステップ１２２において、ステップ１２０で算出された関連語同士の類似度に基づいて、関連語のクラスタリングを行い、類似度が高い組み合わせから関連語を組み合わせ、所定数の関連語クラスタになるまで、類似度が高い組み合わせから関連語の組み合わせを行い、所定数の関連語クラスタを生成する。例えば、図５にように、関連語「スクール」「無料」「教材」に限定して関連語のクラスタリングを行うことで、｛無料、教材｝｛スクール｝のような関連語クラスタを生成し、関連語クラスタ｛無料、教材｝を特徴付けるページとして｛英会話学習、英語の日記、英会話ＢＢＳ｝を関連語クラスタ｛無料、教材｝に対応付け、また、関連語クラスタ｛スクール｝を特徴付けるページとして｛ジオス、イーオン｝を関連語クラスタ｛スクール｝に対応付ける。 Then, in step 122, based on the similarity between the related words calculated in step 120, the related words are clustered, and the related words are combined from a combination having a high degree of similarity until a predetermined number of related word clusters are obtained. A combination of related words is performed from a combination having a high similarity, and a predetermined number of related word clusters are generated. For example, as shown in FIG. 5, by performing related word clustering limited to related words “school”, “free”, and “learning material”, a related word cluster such as {free, learning material} {school} is generated, As a page characterizing the related word cluster {free, learning material}, {English conversation learning, English diary, English conversation BBS} is associated with the related word cluster {free, learning material}, and as a page characterizing the related word cluster {school}, {Gios , Eon} is associated with the related word cluster {school}.

なお、関連語のクラスタリングを行うために、第１の実施の形態では、連想計算のライブラリとして汎用連想計算エンジンＧＥＴＡ（ｈｔｔｐ：／／ｇｅｔａ．ｅｘ．ｎｉｉ．ａｃ．ｊｐ／）を利用している。ＧＥＴＡでは、単一リンク方、完全リンク法、群平均法、ＷＡＲＤ法、階層的ベイズクラスタリング（ＨＢＣ）などの代表的なクラスタリングの距離計算のアルゴリズムを指定できる。 In order to perform clustering of related terms, in the first embodiment, a general-purpose associative calculation engine GETA (http://geta.ex.nii.ac.jp/) is used as an associative calculation library. . In GETA, it is possible to specify a representative clustering distance calculation algorithm such as a single link method, a complete link method, a group average method, a WARD method, and hierarchical Bayesian clustering (HBC).

また、検索数（月間検索数の予測値）が多い関連語で限定した関連語のクラスタリングにより、多くのユーザの検索ニーズに合致する関連語クラスタを生成することができる。 In addition, related word clusters that match the search needs of many users can be generated by clustering related words limited by related words with a large number of searches (predicted value of the number of monthly searches).

次のステップ１２４では、関連語の検索数に基づいて、ステップ１２２で生成された関連語クラスタの重み付けを行い、重みに基づいて関連語クラスタを順序付けて、関連語クラスタをソートする。関連語クラスタＣ_ｉの重みは以下の数式によって算出する。 In the next step 124, the related word clusters generated in step 122 are weighted based on the number of related word searches, the related word clusters are ordered based on the weight, and the related word clusters are sorted. The weight of the related word cluster C _i is calculated by the following formula.

ここで、ｆ_ｔは関連語クラスタＣ_ｉに含まれる関連語ｗ_ｔの検索数の総和であり、Ｔは関連語クラスタＣ_ｉに含まれる関連語の数である。

Here, f _t is the total number of searches of related words w _t included in the related word cluster C _i , and T is the number of related words included in the related word cluster C _i .

例えば、関連語「子供」が、「英会話子供」「子供英会話教室」のような複数の検索で用いられている場合は関連語「子供」の検索数の総和は、「英会話子供」「子供英会話教室」の検索数の和となる。図４の例では、「スクール」の検索数が２２７９６件、「無料」と「教材」の検索数がそれぞれ６６４７件、２２８５件となっている。従って、図５の関連語クラスタ｛無料、教材｝｛スクール｝の重みはそれぞれ８９３２、２２７９６と計算される。 For example, if the related word “children” is used in multiple searches such as “English conversation children” and “Children English conversation classes”, the total number of searches for the related word “children” is “English conversation children” “Children English conversation” The total number of searches for “classroom”. In the example of FIG. 4, the number of searches for “school” is 22,966, and the numbers of searches for “free” and “learning material” are 6647 and 2285, respectively. Accordingly, the weights of the related word cluster {free, learning material} {school} in FIG. 5 are calculated as 8932 and 22796, respectively.

そして、ステップ１２６において、関連語クラスタとＷｅｂ検索データが示すＷｅｂページとの対応付けを行い、ステップ１２８で、図６に示すように、ソートされた関連語クラスタのリストを検索結果として表示して、メタサーチ処理ルーチンを終了する。図５に示したような関連語クラスタが生成された場合には、検索結果において関連語クラスタが｛スクール｝｛無料、教材｝の順で表示される。このように、第１の実施の形態では、検索語の関連語のデータを用いて、検索で頻繁に用いられる関連語のみを用いた関連語のクラスタリングを行い、更に、生成された関連語クラスタを関連語の検索数で重み付けし、関連語クラスタをソートして検索結果を表示する。 In step 126, the related word cluster is associated with the Web page indicated by the Web search data. In step 128, the sorted list of related word clusters is displayed as a search result as shown in FIG. Then, the meta search processing routine is terminated. When the related word cluster as shown in FIG. 5 is generated, the related word cluster is displayed in the order of {school} {free, learning material} in the search result. As described above, in the first embodiment, using related word data of search terms, related words are clustered using only related words frequently used in the search, and the generated related word clusters are further generated. Are weighted by the number of related word searches, and the related word clusters are sorted and the search results are displayed.

また、検索結果の表示では、図６に示すように、関連語クラスタのリスト表示の下に、関連語クラスタの詳細表示として、関連語クラスタに対応付けられたＷｅｂページの文書データ情報としてのタイトルや概要、ＵＲＬも表示されるようになっている。 In the display of the search result, as shown in FIG. 6, the title as the document data information of the Web page associated with the related word cluster is displayed as a detailed display of the related word cluster below the list of related word clusters. A summary and URL are also displayed.

次に、第１の実施の形態のクラスタリングと従来のクラスタリングとの比較実験について説明する。ここでは、検索語として、ＣｌｕｓｔｙｔｈｅＣｌｕｓｔｅｒｉｎｇＥｎｇｉｎｅ（ｈｔｔｐ：／／ｃｌｕｓｔｙ．ｊｐ／）のトップページで例示されているクラスタ検索の検索語の例６語（英会話、介護、携帯電話、胃がん、悪質商法、受験）を用いて、関連語のクラスタリングとＷｅｂページのクラスタリングのとの結果を比較した。 Next, a comparison experiment between the clustering of the first embodiment and the conventional clustering will be described. Here, as search terms, six examples of search words for cluster search exemplified on the top page of Clusty the Clustering Engine (http://clusty.jp/) (English conversation, nursing, mobile phone, stomach cancer, malicious business method) ), The results of related word clustering and web page clustering were compared.

関連語のクラスタリングに利用する関連語の数、生成するクラスタの数、クラスタリングの距離計算のアルゴリズムなど条件を変えることで、生成される関連語クラスタが変化する。異なる条件の下で関連語のクラスタリングとＷｅｂページのクラスタリングとをそれぞれ行い、クラスタリング結果を比較した。 By changing conditions such as the number of related words used for clustering of related words, the number of clusters to be generated, and a distance calculation algorithm for clustering, the generated related word clusters are changed. Clustering of related terms and Web page clustering were performed under different conditions, and the clustering results were compared.

関連語のクラスタリングを図７に示す条件で行い、Ｗｅｂページのクラスタリングを図８に示す条件で行い、検索語を「英会話」とした場合のＷｅｂページのクラスタリング結果を図９に示す。また、検索語を「英会話」とした場合の関連語のクラスタリング結果では、図１０に示すように、「無料、教材、上達法」は、無料の英会話教材を使って英会話の勉強をする場合をイメージすることができ、「マンツーマン、個人、プライベート、レッスン、講師」は、個人的に英会話のレッスンを受けたい場合をイメージすることができ、「ビジネス、ラジオ、日常、旅行」は、ラジオ番組を聴いて、英会話を習得したい場合をイメージすることができる。 FIG. 9 shows the results of clustering Web pages when related words are clustered under the conditions shown in FIG. 7, Web pages are clustered under the conditions shown in FIG. 8, and the search term is “English conversation”. In addition, as shown in FIG. 10, in the clustering result of the related words when the search word is “English conversation”, “free, teaching material, improvement method” indicates that the English conversation material is studied using the free English conversation material. "One-to-one, individual, private, lesson, instructor" can imagine if you want to take English conversation lessons personally, "Business, radio, daily life, travel" is a radio program You can imagine the case of listening and learning English conversation.

また、検索語を「英会話」とした場合のＷｅｂページのクラスタリング結果を図１１に示し、また、検索語を「英会話」とした場合の関連語のクラスタリング結果を図１２に示す。 FIG. 11 shows the clustering result of the Web page when the search word is “English conversation”, and FIG. 12 shows the clustering result of the related word when the search word is “English conversation”.

上記の比較結果では、関連語のクラスタリングの結果とＷｅｂページのクラスタリングの結果とには、ほとんど共通点がなく、Ｗｅｂページのクラスタリングでは、ユーザの検索意図とは無関係な意味の分からないクラスタが生成される傾向が見られた。 In the above comparison results, the related word clustering result and the web page clustering result have almost no common point, and the web page clustering generates a cluster whose meaning is unrelated to the user's search intention. There was a tendency to be.

これに対して関連語のクラスタリングでは、ユーザにとって馴染みがあると思われる関連語がクラスタリング結果に現れ、ユーザ層や検索目的ごとの関連語クラスタが生成される傾向が見られた。 On the other hand, in related word clustering, related words that seem familiar to users appear in the clustering results, and there is a tendency that related word clusters are generated for each user group and search purpose.

次に、実際に検索を行うユーザの立場で、関連語のクラスタリングの結果とＷｅｂページのクラスタリングの結果とを比較する評価実験について説明する。まず、評価者は、大学院生及び大学学部生（男性、２０代前半）１０名であり、検索語は図１３に示す２０語を用いた。なお、クラスタリングは、図７、８に示す条件で行った。 Next, an evaluation experiment for comparing the result of clustering of related terms with the result of clustering of Web pages from the standpoint of a user who actually performs a search will be described. First, evaluators were 10 graduate students and university undergraduate students (male, early 20s), and the search words used were 20 words shown in FIG. Note that clustering was performed under the conditions shown in FIGS.

関連語のクラスタリングの結果とＷｅｂページのクラスタリングの結果とを左右並べて表示し、「どちらのクラスタリング結果が見やすいか」を評価者に質問して、回答を得た。評価者１０人が２０語のクラスタリング結果の比較を行い、合計２００件の回答が得られた。２００件のうち、１６１件が「関連語のクラスタリングの結果が見やすい」、３９件が「Ｗｅｂページのクラスタリングの結果が見やすい」という結果であった。 The results of the related word clustering and the web page clustering were displayed side by side, and the evaluator asked the evaluator whether "which clustering result is easier to see" and obtained an answer. Ten evaluators compared the clustering results of 20 words, and a total of 200 responses were obtained. Out of 200 cases, 161 cases were “relevant word clustering results are easy to see” and 39 cases were “web page clustering results are easy to see”.

また、検索語別及び評価者別の回答結果を図１４及び図１５のそれぞれに示す。また、評価者別のクラスタリング結果１件当たりの平均閲覧時間を図１６に示す。検索語によって、また、評価者によって評価が分かれているが、Ｗｅｂページのクラスタリングと比較して、関連語のクラスタリングの方がユーザにとって分かりやすく見やすい結果を表示できていると推察される。 Moreover, the answer results for each search term and each evaluator are shown in FIGS. 14 and 15, respectively. In addition, FIG. 16 shows the average browsing time per clustering result for each evaluator. Although the evaluation is divided according to the search term and the evaluator, it is presumed that the related word clustering can display a result that is easier to understand and view for the user than the Web page clustering.

第１の実施の形態における関連語のクラスタリングでは、類義語（例えば、「試験」と「模試」）、共起語（例えば、「航空券」と「空席」と「予約」）、集合（例えば、「レクサス」と「ハリアー」と「アイシス」と「ウィッシュ」）、表記の揺れ（例えば、「プレーヤー」と「プレイヤー」）、複合語（例えば、「機種」と「変更」）がそれぞれ１つのクラスタにまとまる傾向が見受けられた。この傾向により、関連語のクラスタリングは、検索結果ページ群をコーパスとした関連語のシソーラス構築に相当するものといえる。 In the related word clustering in the first embodiment, synonyms (for example, “examination” and “simulation”), co-occurrence words (for example, “air ticket”, “vacant seat”, and “reservation”), sets (for example, "Lexus", "Harrier", "Isis" and "Wish"), notation shaking (for example, "Player" and "Player"), and compound words (for example, "Model" and "Change") each in one cluster There was a tendency to gather together. Based on this tendency, it can be said that the clustering of related words corresponds to the construction of a thesaurus of related words using a search result page group as a corpus.

以上説明したように、第１の実施の形態に係る検索システムによれば、検索語に適合する各Ｗｅｂページにおける検索語に関連する複数の関連語の各々の出現頻度に基づいて関連語をクラスタリングした結果を、検索結果として表示することにより、ユーザによって入力される検索語に関係のない単語を除外して生成した関連語クラスタを検索結果として表示するため、ユーザにとって分かりやすいクラスタにより検索結果を表示することができる。 As described above, according to the search system according to the first embodiment, related words are clustered based on the appearance frequencies of a plurality of related words related to the search word in each Web page that matches the search word. The search results are displayed as search results, so that related word clusters generated by excluding words that are not related to the search terms entered by the user are displayed as search results. Can be displayed.

また、検索されたＷｅｂページを関連語クラスタに対応付けて表示するため、検索結果の表示におけるユーザの利便性を向上することができる。 Further, since the searched Web page is displayed in association with the related word cluster, the convenience of the user in displaying the search result can be improved.

また、検索クエリのログを記憶したデータベースから、自動的に検索語に関連する関連語を複数取得することができる。 Also, a plurality of related terms related to the search term can be automatically acquired from the database storing the search query log.

また、検索語と同時に検索される回数が多い関連語を含む関連語クラスタから順に表示することにより、検索語との関連が強い関連語を含む関連語クラスタを先に表示するため、ユーザの検索ニーズに合致することができ、検索結果の表示におけるユーザの利便性を向上させることができる。 In addition, by displaying in order from related terms clusters that contain related terms that are frequently searched at the same time as the search terms, related terms clusters that contain related terms that are strongly related to the search terms are displayed first. It is possible to meet the needs, and it is possible to improve the convenience of the user in displaying the search results.

また、ユーザが頻繁に利用する検索語の関連語を用いた関連語のクラスタリングにより、ユーザにとって分かりやすい見やすいクラスタリング結果の表示を行うことができる。 In addition, the clustering of related terms using the related terms of the search terms frequently used by the user makes it possible to display a clustering result that is easy to understand for the user.

また、複数の検索エンジンを一括検索することにより、質の良い多数の検索結果を得ることができる。 Further, a large number of high-quality search results can be obtained by collectively searching a plurality of search engines.

また、得られた多数の結果をクラスタリングして表示することでユーザにとって概観しやすい検索結果表示を行うことができる。 Further, by displaying a large number of obtained results in a clustered manner, it is possible to display a search result that is easy for the user to view.

また、関連語の検索数で関連語クラスタを重み付けすることで、頻繁に参照される関連語クラスタを検索結果の上位に表示することができる。 Further, by weighting the related word clusters by the number of related word searches, the related word clusters that are frequently referred to can be displayed at the top of the search results.

なお、上記の実施の形態では、コンピュータが既存の複数の検索エンジンを利用して、Ｗｅｂ検索結果データを取得する場合を例に説明したが、コンピュータに検索エンジンの機能が搭載されており、Ｗｅｂページを複数記憶したデータベースから検索語に適合するＷｅｂページを取得するようにしてもよい。この場合には、関連語の取得や関連語クラスタリングの機能が、検索エンジンの一つの機能となる。 In the above embodiment, the case where the computer obtains Web search result data using a plurality of existing search engines has been described as an example. However, the computer has a search engine function, and the Web A Web page that matches the search term may be acquired from a database that stores a plurality of pages. In this case, the related word acquisition and related word clustering functions are functions of the search engine.

また、メタサーチ処理ルーチンなどのプログラムをコンピュータで実行する場合を例に説明したが、これに限定されるものではなく、検索システムが携帯情報端末を含んで構成されており、携帯情報端末で、メタサーチ処理ルーチンを含むプログラムを実行するように構成してもよい。 Moreover, although the case where a program such as a meta search processing routine is executed by a computer has been described as an example, the present invention is not limited to this, and the search system includes a portable information terminal. A program including a meta search processing routine may be executed.

次に第２の実施の形態について説明する。なお、第１の実施の形態と同様の構成部分については、同一符号を付して説明を省略する。 Next, a second embodiment will be described. In addition, about the component similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第２の実施の形態では、関連語を取得するための検索語や、Ｗｅｂ検索データを取得するための検索語を修正して、再度クラスタリングすることができる点が第１の実施の形態と異なっている。 The second embodiment is different from the first embodiment in that a search word for acquiring a related word and a search word for acquiring Web search data can be corrected and clustered again. ing.

図１７に示すように、第２の実施の形態に係る検索システム２１０は、検索クエリログデータベース１２、Ｗｅｂ検索エンジン１４、キャッシュデータデータベース１６、及び検索語を類義語に修正するために、複数の単語の各々に対する類義語を記憶した検索語修正用シソーラスデータベース２１２に接続されたコンピュータ２１８を備えている。なお、類義語とは、一般的な意味の類義語の他に、分割した単語や、表記の揺れとなる単語を含む。 As shown in FIG. 17, the search system 210 according to the second embodiment includes a search query log database 12, a Web search engine 14, a cache data database 16, and a plurality of words in order to correct the search terms to synonyms. A computer 218 connected to a search term correction thesaurus database 212 storing synonyms for each of them is provided. Note that the synonyms include divided words and words that shake notation in addition to synonyms having a general meaning.

また、コンピュータ２１８には、形態素解析器２０と行列計算ライブラリ２２とが接続されている。 In addition, a morphological analyzer 20 and a matrix calculation library 22 are connected to the computer 218.

次に、第２の実施の形態におけるメタサーチ処理ルーチンについて図１８を用いて説明する。なお、第１の実施の形態と同様の処理については、同一符号を付して詳細な説明を省略する。 Next, a meta search processing routine according to the second embodiment will be described with reference to FIG. In addition, about the process similar to 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

まず、ステップ１００において、ユーザが検索語を入力したか否かを判定し、検索語が入力されると、ステップ１０２で、検索クエリログデータベース１２から、検索語に関連する複数の関連語を示す関連語データを取得する。そして、ステップ１０４では、入力された検索語に対応してメタサーチを行い、Ｗｅｂ検索エンジン１４の各々から、Ｗｅｂ検索結果データを取得し、次のステップ１０６では、キャッシュデータを取得し、キャッシュデータデータベース１６に格納する。 First, in step 100, it is determined whether or not the user has input a search word. When the search word is input, in step 102, a relation indicating a plurality of related words related to the search word is displayed from the search query log database 12. Get word data. In step 104, a meta search is performed corresponding to the input search term, and Web search result data is acquired from each of the Web search engines 14, and in the next step 106, cache data is acquired and cache data is acquired. Store in database 16.

そして、ステップ１０８において、キャッシュデータからテキストデータを抽出し、ステップ１１０で、形態素解析器２０のユーザ辞書に対して、入力された検索語及び関連語を登録し、ステップ１１２において、抽出したテキストデータを形態素解析して、形態素解析結果として複数の単語を取得し、ステップ１１４で、形態素解析結果から、雑音を除去し、検索語の周辺の名詞及び未知語のみを抽出する。 In step 108, text data is extracted from the cache data. In step 110, the input search word and related word are registered in the user dictionary of the morphological analyzer 20. In step 112, the extracted text data is extracted. The morpheme analysis is performed to obtain a plurality of words as a morpheme analysis result. In step 114, noise is removed from the morpheme analysis result, and only nouns and unknown words around the search word are extracted.

そして、ステップ１１６において、抽出された名詞及び未知語で、複数のＷｅｂページに対する単語頻度行列を作成し、ステップ１１８で、作成された単語頻度行列における関連語と一致する単語の列要素ＩＤを抽出し、ステップ１２０において、抽出した列要素ＩＤを指定して、関連語にのみ注目した関連語同士の類似度を算出する。 Then, in step 116, a word frequency matrix for a plurality of Web pages is created with the extracted nouns and unknown words, and in step 118, column element IDs of words that match related words in the created word frequency matrix are extracted. In step 120, the extracted column element ID is designated, and the similarity between the related words focused only on the related words is calculated.

そして、ステップ１２２において、関連語のクラスタリングを行い、所定数の関連語クラスタを生成し、次のステップ１２４では、生成された関連語クラスタの重み付けを行い、重みに基づいて関連語クラスタを順序付けて、関連語クラスタをソートする。 In step 122, related words are clustered to generate a predetermined number of related word clusters. In the next step 124, the generated related word clusters are weighted, and the related word clusters are ordered based on the weights. Sort related word clusters.

そして、ステップ１２６において、関連語クラスタとＷｅｂ検索データが示すＷｅｂページとの対応付けを行い、ステップ１２８で、ソートされた関連語クラスタのリストを検索結果として表示する。 In step 126, the related word cluster is associated with the Web page indicated by the Web search data, and in step 128, the sorted list of related word clusters is displayed as the search result.

次のステップ２３０では、検索結果として表示された関連語クラスタを修正するか否かを判定し、ユーザから関連語クラスタの修正が指示されない場合には、メタサーチ処理ルーチンを終了するが、ユーザがキーボードやマウスを操作して、関連語クラスタの修正を指示すると、ステップ２３２で、関連語データを修正するか否かを判定し、ユーザが関連語データの修正を指示しない場合には、ステップ２３８へ移行するが、一方、ユーザがキーボードやマウスを操作して、関連語データの修正を指示した場合には、ステップ２３４へ移行する。 In the next step 230, it is determined whether or not the related word cluster displayed as the search result is to be corrected. If the user does not instruct the correction of the related word cluster, the meta search processing routine is terminated. When the related word cluster is instructed to be corrected by operating the keyboard or the mouse, it is determined in step 232 whether or not the related word data is to be corrected. If the user does not instruct the correction of the related word data, step 238 is performed. On the other hand, if the user operates the keyboard or mouse to instruct correction of related word data, the process proceeds to step 234.

ステップ２３４では、関連語データ取得用に、修正した検索語を作成する。例えば、ユーザの入力により、修正した検索語を作成するか、または、検索語修正用シソーラスデータベース２１２から検索語の類似語を自動的に取得して、修正した検索語を作成する。次のステップ２３６では、修正済みの検索語と同時に検索される関連語を、検索クエリログデータベース１２から抽出して、関連語データを取得し、ステップ２３８へ移行する。 In step 234, a corrected search term is created for obtaining related term data. For example, a corrected search term is created by a user input, or a similar term of a search term is automatically acquired from the search term correction thesaurus database 212 to create a corrected search term. In the next step 236, related words searched simultaneously with the corrected search word are extracted from the search query log database 12 to obtain related word data, and the process proceeds to step 238.

ステップ２３８において、Ｗｅｂ検索結果データを修正するか否かを判定し、ユーザがＷｅｂ検索結果データの修正を指示しない場合には、ステップ１０６へ戻り、新たに取得された関連語データに基づいて、再び関連語クラスタを生成するが、一方、ユーザがキーボードやマウスを操作して、Ｗｅｂ検索結果データの修正を指示した場合には、ステップ２４０へ移行する。 In step 238, it is determined whether or not the Web search result data is to be corrected. If the user does not instruct correction of the Web search result data, the process returns to step 106, and based on the newly acquired related term data, The related word cluster is generated again. On the other hand, if the user operates the keyboard or mouse to instruct correction of the Web search result data, the process proceeds to step 240.

ステップ２４０では、Ｗｅｂ検索結果データ取得用に、修正した検索語を作成する。例えば、ユーザの入力により、修正した検索語を作成するか、または、検索語修正用シソーラスデータベース２１２から検索語の類似語を自動的に取得して、修正した検索語を作成する。次のステップ２４２では、修正済みの検索語に対応してメタサーチを行い、Ｗｅｂ検索エンジン１４の各々から、Ｗｅｂ検索結果データを取得して、ステップ１０６へ戻り、新たに取得された関連語データ及びＷｅｂ検索結果データに基づいて、再び関連語クラスタを生成する。 In step 240, a corrected search term is created for Web search result data acquisition. For example, a corrected search term is created by a user input, or a similar term of a search term is automatically acquired from the search term correction thesaurus database 212 to create a corrected search term. In the next step 242, a meta search is performed corresponding to the corrected search word, Web search result data is acquired from each of the Web search engines 14, the process returns to step 106, and the newly acquired related word data is acquired. Then, the related word cluster is generated again based on the Web search result data.

以上説明したように、第２の実施の形態に係る検索システムによれば、関連語クラスタを検索結果として表示した後に、検索語を修正して、新たに取得した関連語データ及びＷｅｂ検索結果データを用いて、検索結果となる関連語クラスタを生成することができるため、ユーザにとって更に分かりやすいクラスタにより検索結果を表示することができる。 As described above, according to the search system according to the second embodiment, after displaying a related word cluster as a search result, the search word is corrected and newly acquired related word data and Web search result data. Can be used to generate a related word cluster as a search result, so that the search result can be displayed in a cluster that is more easily understood by the user.

第１の実施の形態に係る検索システムを示すブロック図である。It is a block diagram which shows the search system which concerns on 1st Embodiment. 複数のＷｅｂページにおける単語頻度行列を示すイメージ図である。It is an image figure which shows the word frequency matrix in a some web page. 第１の実施の形態に係るコンピュータのメタサーチ処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the metasearch processing routine of the computer which concerns on 1st Embodiment. 検索語と関連語との組み合わせに対する検索数を示す表である。It is a table | surface which shows the number of searches with respect to the combination of a search word and a related word. 関連語の列要素に限定した単語頻度行列を示すイメージ図である。It is an image figure which shows the word frequency matrix limited to the column element of a related word. 第１の実施の形態に係る検索結果表示のイメージ図である。It is an image figure of the search result display concerning a 1st embodiment. 関連語のクラスタリングの条件を示す表である。It is a table | surface which shows the conditions of clustering of a related word. Ｗｅｂページのクラスタリングの条件を示す表である。It is a table | surface which shows the conditions of clustering of a web page. 検索語を「英会話」とした場合のＷｅｂページのクラスタリング結果を示す図である。It is a figure which shows the clustering result of a web page when a search word is set to "English conversation." 検索語を「英会話」とした場合の関連語のクラスタリング結果を示す図である。It is a figure which shows the clustering result of a related word when a search word is set to "English conversation." 検索語を「受験」とした場合のＷｅｂページのクラスタリング結果を示す図である。It is a figure which shows the clustering result of the web page at the time of setting a search term as "examination". 検索語を「受験」とした場合の関連語のクラスタリング結果を示す図である。It is a figure which shows the clustering result of a related word when a search term is set to "take an examination." ユーザ評価に用いた検索語を示す図である。It is a figure which shows the search term used for user evaluation. 複数の検索語各々におけるクラスタリング結果の見やすさを示すグラフである。It is a graph which shows the legibility of the clustering result in each of a plurality of search terms. 複数の評価者各々におけるクラスタリング結果の見やすさを示すグラフである。It is a graph which shows the visibility of the clustering result in each of several evaluators. 複数の評価者各々におけるクラスタリング結果の平均閲覧時間を示すグラフである。It is a graph which shows the average browsing time of the clustering result in each of several evaluators. 第２の実施の形態に係る検索システムを示すブロック図である。It is a block diagram which shows the search system which concerns on 2nd Embodiment. 第２の実施の形態に係るコンピュータのメタサーチ処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the metasearch processing routine of the computer which concerns on 2nd Embodiment.

Explanation of symbols

１０、２１０検索システム
１２検索クエリログデータベース
１４検索エンジン
１６キャッシュデータデータベース
１８、２１８コンピュータ
２０形態素解析器
２２行列計算ライブラリ
２１２検索語修正用シソーラスデータベース DESCRIPTION OF SYMBOLS 10,210 Search system 12 Search query log database 14 Search engine 16 Cache data database 18, 218 Computer 20 Morphological analyzer 22 Matrix calculation library 212 Thesaurus database for search term correction

Claims

Document data acquisition means for acquiring a plurality of document data matching a search term from a document database storing a plurality of document data;
Based on words obtained by morphological analysis of each of a plurality of document data acquired by the document data acquisition means, for each of the document data, appearance of each of a plurality of related words related to the search word A frequency calculating means for calculating the frequency;
Similarity calculation means for calculating the similarity between each related word based on the appearance frequency of each of the plurality of related words calculated by the frequency calculation means;
Clustering means for generating a predetermined number of related word clusters by performing clustering of the plurality of related words and combining the related words from a combination having a high similarity calculated by the similarity calculating means;
Display means for displaying a related word cluster generated by the clustering means as a search result of document data matching the search word;
Search device including

Each related word cluster generated by the clustering unit is characterized by a related word of the related word cluster among a plurality of document data acquired by the document data acquiring unit based on the appearance frequency of the related word. And further comprising an association means for associating the document data with the related word cluster,
The search device according to claim 1, wherein the display unit displays document data information indicating the related word cluster and document data associated with the related word cluster as the search result.

Further comprising related word acquisition means for acquiring a plurality of words as search words simultaneously with the search words in the document data acquisition means based on a database storing a plurality of search queries consisting of at least one search word;
The search device according to claim 1, wherein the frequency calculation unit calculates an appearance frequency of a plurality of related words acquired by the related word acquisition unit for each of the document data.

Further, related word acquisition means for acquiring a plurality of words as search words simultaneously with synonyms of the search word in the document data acquisition means based on a database storing a plurality of search queries consisting of at least one search word. Including
The search device according to claim 1, wherein the frequency calculation unit calculates an appearance frequency of a plurality of related words acquired by the related word acquisition unit for each of the document data.

The search device according to any one of claims 1 to 4, wherein the display unit displays the search results in order from related word clusters including related words that are frequently searched simultaneously with the search word.

Obtain multiple document data that matches the search term from the document database that stores multiple document data,
Based on words obtained by morphological analysis of each of the plurality of obtained document data, for each of the document data, to calculate the appearance frequency of each of a plurality of related words related to the search word,
Based on the calculated appearance frequency of each of the related words, the similarity between the related words is calculated,
Performing clustering of the plurality of related words, combining the related words from the combination having the calculated high similarity, and generating a predetermined number of related word clusters;
A search method comprising: displaying the generated related word cluster as a search result of document data matching the search word.