JP4774081B2

JP4774081B2 - Document search system, document search method, and program

Info

Publication number: JP4774081B2
Application number: JP2008153372A
Authority: JP
Inventors: 秀人湯澤; 竜己小林
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2008-06-11
Filing date: 2008-06-11
Publication date: 2011-09-14
Anticipated expiration: 2028-06-11
Also published as: JP2009301221A

Description

本発明は、文書検索システム、文書検索方法、及びプログラムに関する。 The present invention relates to a document search system, a document search method, and a program.

インターネット上に公開されている文書を探すサーチエンジンに代表されるように、キーワードに応じた文書の検索処理が広く行われているが、検索対象となる文書の量が増大していることから、検索結果を効率的に絞り込むことが求められている。例えば、特許文献１には、あるユーザの入力した検索条件と同一の検索条件を過去に行った他のユーザが有益であると判断した検索結果を優先的に表示する技術が提案されている。
特開２００３−１０８５８７号公報 As represented by search engines that search for documents published on the Internet, search processing of documents according to keywords is widely performed, but because the amount of documents to be searched has increased, There is a need to efficiently narrow down search results. For example, Patent Literature 1 proposes a technique that preferentially displays search results that other users who have previously performed the same search condition as a search condition input by a certain user are useful.
JP 2003-108587 A

しかしながら、あるユーザが入力したキーワードが膨大な文書にヒットした場合、膨大な文書は様々なカテゴリに分類することが可能であり、他のユーザが同じキーワードを用いていたからといって、他のユーザが有益であると判断した文書のカテゴリが、キーワードを入力したユーザが意図したカテゴリと一致するとは限らない。また、ユーザが膨大な検索結果を絞り込むために、追加のキーワードを入力しようとしても、新たなキーワードに想到することができないことも多い。 However, if a keyword entered by a certain user hits a huge document, the huge document can be classified into various categories, and because other users use the same keyword, The category of the document that is determined to be useful does not necessarily match the category intended by the user who entered the keyword. In addition, when a user tries to input an additional keyword in order to narrow down an enormous number of search results, it is often impossible to come up with a new keyword.

本発明は、このような背景を鑑みてなされたものであり、文書の検索処理において効率的に絞込みを行うことのできる、文書検索システム、文書検索方法及びプログラムを提供することを目的とする。 The present invention has been made in view of such a background, and an object of the present invention is to provide a document search system, a document search method, and a program capable of efficiently narrowing down a document search process.

上記課題を解決するための本発明の主たる発明は、文書を検索するシステムであって、キーワードの入力を受け付けるキーワード入力部と、前記キーワードに対応する文書を検索する文書検索部と、検索結果の文書を、前記文書に含まれる単語に基づいて複数のグループに分類するグループ分類部と、前記グループのうち属する文書の数が最も多い最大グループを表す最大グループ名を、前記最大グループに属する文書に含まれる単語に基づいて決定する最大グループ名決定部と、前記最大グループ名がユーザの意図に合致するか否かを示す選択情報の入力を受け付ける絞込み選択部と、前記選択情報が前記意図に合致することを示す場合、前記最大グループに属する文書の一覧を表示し、前記選択情報が前記意図に合致しないことを示す場合、前記検索結果に含まれる文書のうち前記最大グループに属さないものの一覧を表示する検索結果表示部と、を備えることとする。 A main invention of the present invention for solving the above problems is a system for searching for a document, in which a keyword input unit that receives an input of a keyword, a document search unit that searches for a document corresponding to the keyword, a search result A group classification unit that classifies a document into a plurality of groups based on words included in the document, and a maximum group name that represents a maximum group that has the largest number of documents belonging to the group is assigned to the document that belongs to the maximum group. A maximum group name determination unit that is determined based on words included; a narrowing selection unit that receives input of selection information indicating whether the maximum group name matches a user's intention; and the selection information matches the intention. Display a list of documents belonging to the maximum group and indicate that the selection information does not match the intention. , And it and a search result display unit for displaying a list of those not belonging to the largest group among the documents included in the search results.

本発明の文書検索システムによれば、ユーザの選択に応じて、最大グループに属する文書、又は属さない文書のいずれかに検索結果を絞り込んで行くことができる。したがって、ユーザは、新たにキーワードを追加することなく、システムからの問いかけに二者択一で応じるだけで、検索結果を容易に絞り込むことができる。 According to the document search system of the present invention, the search result can be narrowed down to a document belonging to the maximum group or a document not belonging to the maximum group according to the user's selection. Therefore, the user can narrow down the search results easily by only answering questions from the system without adding a new keyword.

また、前記選択情報が前記意図に合致することを示す場合、前記最大グループに属する文書を前記検索結果とし、前記選択情報が前記意図に合致しないことを示す場合、前記検索結果に含まれる文書のうち前記最大グループに属さないものを前記検索結果として、前記グループ分類部が前記検索結果に含まれる文書をグループに分類し、前記最大グループ名決定部が前記最大グループ名を決定し、前記選択画面情報送信部が前記選択画面情報を送信し、前記選択情報受信部が前記選択情報を受信し、前記検索結果送信部が前記一覧を表示する情報を送信するようにしてもよい。 In addition, when the selection information indicates that it matches the intention, a document belonging to the maximum group is used as the search result, and when the selection information indicates that it does not match the intention, a document included in the search result Of these, the group that does not belong to the maximum group is used as the search result, the group classification unit classifies the documents included in the search result, the maximum group name determination unit determines the maximum group name, and the selection screen An information transmission unit may transmit the selection screen information, the selection information reception unit may receive the selection information, and the search result transmission unit may transmit information for displaying the list.

また、本発明の文書検索システムは、前記検索結果の文書から、ＬＳＩ手法により前記文書についての特徴語を抽出し、抽出した前記特徴語を前記文書に含まれる単語として決定する特徴語抽出部を備えるようにしてもよい。 Further, the document search system of the present invention includes a feature word extraction unit that extracts a feature word for the document from the search result document by an LSI technique and determines the extracted feature word as a word included in the document. You may make it prepare.

また、前記グループ分類部は、前記文書に含まれる単語のクラスタリングにより、前記文書を複数のグループに分類するようにしてもよい。 The group classification unit may classify the document into a plurality of groups by clustering words included in the document.

また、前記グループ分類部は、前記文書に含まれる単語に、ＴＦ−ＩＤＦによる重み付けを行い、重み付けされた単語をクラスタリングにより分類するようにしてもよい。 The group classification unit may weight the words included in the document by TF-IDF and classify the weighted words by clustering.

また、前記グループ分類部は、異なる複数の手法により複数のクラスタリングを行うようにしてもよい。この場合、複数の視点から絞込みを行うことが可能となる。 Further, the group classification unit may perform a plurality of clustering by a plurality of different methods. In this case, it is possible to narrow down from a plurality of viewpoints.

また、前記最大グループ名決定部は、前記最大グループに属する文書に含まれる各単語の頻度を算出し、前記頻度の一番高い単語を前記最大グループ名として決定するようにしてもよい。 The maximum group name determination unit may calculate the frequency of each word included in the document belonging to the maximum group, and determine the word with the highest frequency as the maximum group name.

また、本発明の文書検索システムは、グループを示すグループ名に対応付けて、前記グループに関連する単語を記憶するカテゴリデータベースを備え、前記最大グループ名決定部は、前記最大グループに属する文書に含まれる複数の単語を抽出し、前記カテゴリデータベースに記憶されている前記グループ名のそれぞれについて、前記関連する単語のうち、前記文書から抽出した単語に含まれているものの数をカウントし、前記カウントした数が最も多い前記グループ名を前記最大グループ名として決定するようにしてもよい。 The document search system of the present invention includes a category database that stores words related to the group in association with a group name indicating a group, and the maximum group name determination unit is included in the document belonging to the maximum group. A plurality of words extracted, and for each of the group names stored in the category database, the number of words included in the word extracted from the document among the related words is counted, and the counted The group name having the largest number may be determined as the maximum group name.

また、本発明の文書検索システムは、ツリー構造のノードとして単語を記憶するカテゴリデータベースを備え、前記最大グループ名決定部は、前記最大グループに属する文書に含まれる複数の単語を抽出し、前記カテゴリデータベースから、前記抽出した単語が全て子孫として含まれている前記ノードのうち最も階層が深いものを取得し、取得した前記ノードを前記最大グループ名として決定するようにしてもよい。 The document search system of the present invention includes a category database that stores words as nodes in a tree structure, and the maximum group name determination unit extracts a plurality of words included in a document belonging to the maximum group, and the category A node having the deepest hierarchy among the nodes in which all the extracted words are included as descendants may be acquired from a database, and the acquired node may be determined as the maximum group name.

また、前記最大グループ名決定部は、前記文書から抽出した単語にＴＦ−ＩＤＦによる重み付けを行い、前記ノードのうち、重みが所定値以上である前記文書から抽出した単語の全てが子孫として含まれているものを前記最大グループ名として決定するようにしてもよい。 The maximum group name determination unit weights words extracted from the document using TF-IDF, and all of the words extracted from the document having a weight equal to or greater than a predetermined value among the nodes are included as descendants. May be determined as the maximum group name.

また、前記グループ分類部は、前記検索結果の文書のうち、所定数の文書についてのみ、複数のグループに分類するようにしてもよい。 The group classification unit may classify only a predetermined number of documents among the search result documents into a plurality of groups.

また、本発明の文書検索システムは、前記キーワードに対応付けて、前記文書、及び、前記文書の特徴語を記憶するインデックス記憶部を備え、前記文書検索部は、前記キーワードに対応する前記文書及び前記特徴語を前記インデックス記憶部から取得し、前記グループ分類部は、前記特徴語に基づいて前記文書を複数のグループに分類するようにしてもよい。 In addition, the document search system of the present invention includes an index storage unit that stores the document and a feature word of the document in association with the keyword, and the document search unit includes the document corresponding to the keyword and The feature words may be acquired from the index storage unit, and the group classification unit may classify the documents into a plurality of groups based on the feature words.

また、本発明の文書検索システムは、前記文書ごとに、前記文書が属するカテゴリを１つ以上記憶するカテゴリデータベースを備え、前記グループ分類部は、前記検索結果の文書を、対応する前記カテゴリに基づいて複数のグループに分類するようにしてもよい。 In addition, the document search system of the present invention includes a category database that stores one or more categories to which the document belongs for each document, and the group classification unit selects the search result document based on the corresponding category. May be classified into a plurality of groups.

また、本発明の文書検索システムは、前記文書ごとに、前記文書が属するカテゴリを記憶するカテゴリデータベースと、前記選択情報が前記意図に合致することを示す場合に、前記最大グループに含まれる前記文書が属する前記カテゴリの一覧を表示するカテゴリ表示部と、前記カテゴリの入力を受け付けるカテゴリ入力部と、前記最大グループに含まれる前記文書のうち、前記入力されたカテゴリに属するものの一覧を表示する絞込み表示部と、を備えるようにしてもよい。 In addition, the document search system of the present invention includes a category database that stores a category to which the document belongs and a document included in the maximum group when the selection information indicates that the document matches the intention. A category display section that displays a list of the categories to which the document belongs, a category input section that receives input of the categories, and a narrowed display that displays a list of documents belonging to the input category among the documents included in the maximum group May be provided.

その他本願が開示する課題やその解決方法については、発明の実施形態の欄及び図面により明らかにされる。 Other problems and solutions to be disclosed by the present application will be made clear by the embodiments of the invention and the drawings.

本発明によれば、文書の検索処理において効率的に絞込みを行うことができる。 According to the present invention, it is possible to narrow down efficiently in a document search process.

以下、本発明の一実施形態に係る文書検索システムについて説明する。本実施形態の文書検索システムは、キーワードに応じて文書を検索し、ユーザに二者択一の問合わせを行いながら大量の検索結果を絞り込んで行くものである。なお、本実施形態では、文書そのものの検索ではなく、クローラにより回収した文書の単語や要約を含むインデックス情報を検索するようにしている。 Hereinafter, a document search system according to an embodiment of the present invention will be described. The document search system according to the present embodiment searches for documents according to keywords, and narrows down a large number of search results while making an alternative inquiry to the user. In this embodiment, instead of searching for the document itself, index information including words and summaries of the document collected by the crawler is searched.

＝＝システム構成＝＝
図１は、本実施形態の文書検索システムの全体構成を示す図である。同図に示すように、本実施形態の文書検索システムは、ユーザ端末１０と検索サーバ２０とを含んで構成され、ユーザ端末１０と検索サーバ２０とは通信ネットワーク３０を介して接続される。なお、ユーザ端末１０及び検索サーバ２０は複数含まれていてもよい。 == System configuration ==
FIG. 1 is a diagram showing the overall configuration of the document search system of this embodiment. As shown in the figure, the document search system of this embodiment includes a user terminal 10 and a search server 20, and the user terminal 10 and the search server 20 are connected via a communication network 30. A plurality of user terminals 10 and search servers 20 may be included.

通信ネットワーク３０は、例えば、インターネットやＬＡＮ（Local Area Network）などであり、光ファイバーやイーサネット（登録商標）、電話回線網、無線通信網などにより構築される。本実施形態では、ユーザ端末１０と検索サーバ２０とはＨＴＴＰ（HyperText Transfer Protocol）により通信を行っているものとする。 The communication network 30 is, for example, the Internet or a LAN (Local Area Network), and is constructed by an optical fiber, Ethernet (registered trademark), a telephone line network, a wireless communication network, or the like. In this embodiment, it is assumed that the user terminal 10 and the search server 20 communicate with each other using HTTP (HyperText Transfer Protocol).

ユーザ端末１０は、ユーザが利用するコンピュータである。ユーザ端末１０は、例えば、パーソナルコンピュータや携帯電話、ＰＤＡ（Personal Digital Assistance）などである。本実施形態では、ユーザ端末１０においてＷｅｂページを閲覧するブラウザが動作しており、ユーザは、ユーザ端末１０で動作するブラウザを操作して、キーワードをＨＴＴＰのリクエストとして検索サーバに送信するものとする。なお、以下の説明では、文書を検索するためのキーワードを含むＨＴＴＰのリクエストをクエリという。 The user terminal 10 is a computer used by a user. The user terminal 10 is, for example, a personal computer, a mobile phone, or a PDA (Personal Digital Assistance). In the present embodiment, a browser for browsing a web page is operating on the user terminal 10, and the user operates the browser operating on the user terminal 10 to transmit a keyword to the search server as an HTTP request. . In the following description, an HTTP request including a keyword for searching for a document is referred to as a query.

検索サーバ２０は、文書を検索するコンピュータであり、例えば、パーソナルコンピュータやワークステーションなどである。本実施形態では、検索サーバ２０は、インターネットのサーチエンジンを想定している。検索サーバ２０は、Ｗｅｂページ（文書）を提供する他のサーバ（不図示）にアクセスして、文書の要約を生成するとともに、文書のインデックスを生成しておき、キーワードに対応する文書を示すＵＲＬ（Uniform Resource Locator）と要約とをユーザに提示する。 The search server 20 is a computer that searches documents, and is, for example, a personal computer or a workstation. In the present embodiment, the search server 20 is assumed to be an Internet search engine. The search server 20 accesses another server (not shown) that provides a Web page (document), generates a document summary, generates a document index, and indicates a document corresponding to a keyword. (Uniform Resource Locator) and summary are presented to the user.

＝＝ハードウェア＝＝
図２は、検索サーバ２０のハードウェア構成を示す図である。同図に示すように、検索サーバ２０は、ＣＰＵ２０１、メモリ２０２、記憶装置２０３、通信インタフェース２０４、入力装置２０５、出力装置２０６を備えている。記憶装置２０３は、各種のデータやプログラムを記憶する、例えば、ハードディスクやフラッシュメモリ、ＣＤ−ＲＯＭドライブなどである。ＣＰＵ２０１は、記憶装置２０３に記憶されているプログラムをメモリ２０２に読み出して実行することにより各種の機能を実現する。通信インタフェース２０４は、通信ネットワーク３０に接続するためのインタフェースであり、例えば、イーサネット（登録商標）に接続するためのアダプタや、電話回線網に接続するためのモデム、無線通信網に接続するための無線通信器などである。入力装置２０５はデータの入力を受け付ける、例えばキーボードやマウス、タッチパネル、マイクロフォンなどである。出力装置２０６はデータを出力する、例えばディスプレイやプリンタ、スピーカなどである。 == Hardware ==
FIG. 2 is a diagram illustrating a hardware configuration of the search server 20. As shown in the figure, the search server 20 includes a CPU 201, a memory 202, a storage device 203, a communication interface 204, an input device 205, and an output device 206. The storage device 203 stores various data and programs, for example, a hard disk, a flash memory, a CD-ROM drive, and the like. The CPU 201 implements various functions by reading a program stored in the storage device 203 into the memory 202 and executing it. The communication interface 204 is an interface for connecting to the communication network 30. For example, an adapter for connecting to Ethernet (registered trademark), a modem for connecting to a telephone line network, and a wireless communication network are used. Such as a wireless communication device. The input device 205 is a keyboard, mouse, touch panel, microphone, or the like that accepts data input. The output device 206 outputs data, for example, a display, a printer, a speaker, or the like.

なお、検索サーバ２０のハードウェア構成は一般的なパーソナルコンピュータやワークステーションのものを想定している。また、ユーザ端末１０のハードウェア構成も検索サーバ２０と同様のものである。 Note that the hardware configuration of the search server 20 is assumed to be that of a general personal computer or workstation. The hardware configuration of the user terminal 10 is the same as that of the search server 20.

＝＝ソフトウェア＝＝
図３は、検索サーバ２０のソフトウェア構成を示す図である。同図に示すように、検索サーバ２０は、クエリ受信部２１１、検索実行部２１２、検索結果生成部２１３、クラスタリング分析部２１４、最大クラスタ決定部２１５、カテゴリ決定部２１６、提案情報生成部２１７、検索結果送信部２１８、選択情報受信部２１９及びクローラ処理部２２０の各機能部と、インデックスデータベース２５１及びカテゴリデータベース２５２の各記憶部とを備えている。 == Software ==
FIG. 3 is a diagram illustrating a software configuration of the search server 20. As shown in the figure, the search server 20 includes a query reception unit 211, a search execution unit 212, a search result generation unit 213, a clustering analysis unit 214, a maximum cluster determination unit 215, a category determination unit 216, a proposal information generation unit 217, Each function part of the search result transmission part 218, the selection information reception part 219, and the crawler process part 220, and each memory | storage part of the index database 251 and the category database 252 are provided.

なお、クエリ受信部２１１、検索実行部２１２、検索結果生成部２１３、クラスタリング分析部２１４、最大クラスタ決定部２１５、カテゴリ決定部２１６、提案情報生成部２１７、検索結果送信部２１８、選択情報受信部２１９及びクローラ処理部２２０は、検索サーバ２０が備えるＣＰＵ２０１が記憶装置２０３に記憶されているプログラムをメモリ２０２に読み出して実行することにより実現される。また、インデックスデータベース２５１及びカテゴリデータベース２５２は、メモリ２０２や記憶装置２０３が提供する記憶領域として実現される。 The query receiving unit 211, the search execution unit 212, the search result generation unit 213, the clustering analysis unit 214, the maximum cluster determination unit 215, the category determination unit 216, the proposal information generation unit 217, the search result transmission unit 218, and the selection information reception unit. The 219 and the crawler processing unit 220 are realized by the CPU 201 included in the search server 20 reading out the program stored in the storage device 203 to the memory 202 and executing it. Further, the index database 251 and the category database 252 are realized as storage areas provided by the memory 202 and the storage device 203.

インデックスデータベース２５１は、検索対象となる文書のＵＲＬと、その文書の要約とを含む情報（以下、インデックス情報という。）を記憶する。インデックスデータベース２５１に記憶されるインデックス情報の構成例を図４に示す。同図に示すように、インデックス情報には、インデックス情報を識別するＩＤ（以下、インデックスＩＤという。）、文書のＵＲＬ、文書のタイトル、文書の要約、及び文書を特徴づける単語（以下、特徴語という。）が含まれる。本実施形態では、検索サーバ２０は、ユーザが入力したキーワードが要約に含まれるインデックス情報を検索することで、文書の検索処理を行うものとする。 The index database 251 stores information (hereinafter referred to as index information) including a URL of a document to be searched and a summary of the document. A configuration example of index information stored in the index database 251 is shown in FIG. As shown in the figure, the index information includes an ID for identifying the index information (hereinafter referred to as an index ID), a document URL, a document title, a document summary, and a word characterizing the document (hereinafter referred to as a feature word). Is included). In the present embodiment, it is assumed that the search server 20 performs document search processing by searching index information in which a keyword input by a user is included in a summary.

クローラ処理部２２０は、通信ネットワーク３０に接続されている各種のコンピュータが公開している文書を取得していき、取得した文書から特徴語を抽出する。特徴語の抽出は、例えば、一般的な形態素解析により文書から抽出した単語のうち、出現頻度の高い順に所定数のものを特徴語とすることができる。クローラ処理部２２０は、取得した文書を示すＵＲＬと、文書に含まれているタイトルと、文書の要約と、特徴語とを含むインデックス情報を作成する。文書のタイトルは、文書の属性として設定されているタイトルであり、例えば、ＨＴＭＬで記述された文書の場合、ＴＩＴＬＥタグの内容とし、テキストデータの場合、１行目のテキストとすることができる。文書の要約は、例えば、文書から、特徴語の前後のテキストデータを抽出したものとすることができる。クローラ処理部２２０は、インデックスデータベース２５１に登録する。クローラ処理部２２０は、いわゆるクローラ、スパイダー、ロボット、インデクサなどと呼ばれるものである。なお、クローラ処理部２２０がインデックス情報を作成する処理には、一般的なクローラによるデータベースの作成処理を用いることができる。クローラ処理部２２０は、定期的にインデックス情報を生成してインデックスデータベース２５１に登録し続けているものとする。 The crawler processing unit 220 acquires documents published by various computers connected to the communication network 30 and extracts feature words from the acquired documents. In the extraction of feature words, for example, a predetermined number of words extracted from a document by general morphological analysis in descending order of appearance frequency can be used as feature words. The crawler processing unit 220 creates index information including a URL indicating the acquired document, a title included in the document, a summary of the document, and a feature word. The document title is a title set as a document attribute. For example, in the case of a document described in HTML, it can be the content of the TITLE tag, and in the case of text data, it can be the text on the first line. The document summary can be obtained by, for example, extracting text data before and after the feature word from the document. The crawler processing unit 220 registers in the index database 251. The crawler processing unit 220 is a so-called crawler, spider, robot, indexer, or the like. Note that a database creation process by a general crawler can be used for the process in which the crawler processing unit 220 creates index information. It is assumed that the crawler processing unit 220 continues to generate index information and register it in the index database 251 periodically.

なお、文書の特徴語の抽出には、ＬＳＩ（Latent Semantic Indexing；潜在意味インデクシング）手法を用いることができる。この場合、クローラ処理部２２０は、文書に含まれる語句を行とし、各文書を列とし、文書中に含まれる語句の頻度を要素とする行列Ａを生成し、この行列Ａを、特位置分解（ＳＶＤ；Singular Value Decomposition）により３つの行列Ｕ、Ｓ、Ｖに分解する。ここで、Ａ＝Ｕ×Ｓ×Ｖとなる。Ｓは対角行列であり、クローラ処理部２２０は、Ｓの要素（特徴成分）を大きい方から所定数のみを取り出し、次元を圧縮して行例Ｓ’を生成し、Ａ’＝Ｕ×Ｓ’×Ｖにより行列Ａ’を算出する。これにより、Ａ’ではノイズが排除され、特徴語句がより強調されたものになる。クローラ処理部２２０は、Ａ’の各列の文書について、要素が所定の閾値より大きい行の単語を特徴語として決定することができる（特徴語抽出部）。 It should be noted that an LSI (Latent Semantic Indexing) technique can be used for extracting feature words of a document. In this case, the crawler processing unit 220 generates a matrix A having the words included in the document as rows, each document as a column, and the frequency of the words included in the document as elements, and the matrix A is subjected to special position decomposition. It is decomposed into three matrices U, S, V by (SVD; Singular Value Decomposition). Here, A = U × S × V. S is a diagonal matrix, and the crawler processing unit 220 extracts only a predetermined number of elements (feature components) of S from the larger one, compresses the dimensions to generate an example S ′, and A ′ = U × S The matrix A is calculated from “× V”. As a result, noise is eliminated in A ′, and the feature words are more emphasized. The crawler processing unit 220 can determine, as a feature word, a word in a row whose elements are greater than a predetermined threshold for the document in each column of A ′ (a feature word extraction unit).

カテゴリデータベース２５２は、カテゴリごとに、そのカテゴリに属する単語を管理する。図５は、カテゴリデータベース２５２の構成例を示す図である。同図に示すように、本実施形態のカテゴリデータベース２５２は、各単語をツリー構造で管理し、各単語はそれぞれカテゴリの名称となり、カテゴリに含まれる単語は、カテゴリの子孫になるようにツリー構造が構成される。 The category database 252 manages words belonging to the category for each category. FIG. 5 is a diagram illustrating a configuration example of the category database 252. As shown in the figure, the category database 252 of this embodiment manages each word in a tree structure, each word is a category name, and the words included in the category are descendants of the category. Is configured.

なお、本実施形態では、説明を簡単にするため、ツリー構造の中に単語は重複せずに登録されているものとするが、ツリー構造に同じ単語が複数含まれるようにすることもできる。例えば同じ単語が異なるカテゴリに属する場合などには、その単語をそれぞれのカテゴリの子孫に含めるようにする。 In this embodiment, for simplicity of explanation, it is assumed that words are registered without overlapping in the tree structure, but a plurality of the same words may be included in the tree structure. For example, when the same word belongs to different categories, the word is included in the descendants of each category.

また、カテゴリデータベース２５２の構造は、図５のようなツリー構造に限るものではなく、カテゴリと単語とが対応付けられていればよい。例えば、カテゴリと、そのカテゴリに属する単語とを対応付けて表形式で管理するようにしてもよい。また、本実施形態では、カテゴリデータベース２５２には、上位の階層に行くほど上位概念の単語となるように、予め管理者により登録されているものとする。 Further, the structure of the category database 252 is not limited to the tree structure as shown in FIG. 5, and the category and the word only need to be associated with each other. For example, a category and a word belonging to the category may be associated with each other and managed in a table format. In the present embodiment, the category database 252 is preliminarily registered by the administrator so that the higher the concept, the higher the concept word.

クエリ受信部２１１は、ユーザ端末１０から送信されるクエリを受信する。
検索実行部２１２は、クエリに含まれているキーワードに対応する文書を検索する。具体的には、検索実行部２１２は、クエリに含まれるキーワードが特徴語として含まれるインデックス情報（以下、キーワード検索結果という。）をインデックスデータベース２５１から取得する。 The query receiving unit 211 receives a query transmitted from the user terminal 10.
The search execution unit 212 searches for a document corresponding to the keyword included in the query. Specifically, the search execution unit 212 acquires index information (hereinafter, referred to as a keyword search result) including a keyword included in the query as a feature word from the index database 251.

検索結果生成部２１３は、キーワード検索結果に含まれるインデックス情報の要約の一覧を表示するための画面データ（以下、検索結果画面という。）を生成する。本実施形態では、検索結果生成部２１３が生成する検索結果画面は、ＨＴＭＬ（HyperText Markup Language）で記述されるものとし、検索結果生成部２１３は、各インデックス情報のタイトルにＵＲＬへのリンクを付けたものと要約との一覧を記述するものとする。なお、検索結果生成部２１３が生成する検索結果画面の生成処理には、一般的な検索エンジンによる検索結果を表示する画面の生成処理を採用することができる。 The search result generation unit 213 generates screen data (hereinafter referred to as a search result screen) for displaying a summary list of index information included in the keyword search result. In this embodiment, the search result screen generated by the search result generation unit 213 is described in HTML (HyperText Markup Language), and the search result generation unit 213 attaches a link to the URL to the title of each index information. A list of data and summaries shall be described. It should be noted that a search result screen generation process generated by the search result generation unit 213 may employ a screen generation process for displaying a search result by a general search engine.

クラスタリング分析部２１４は、キーワード検索結果に含まれるインデックス情報のクラスタリング分析処理を行い、インデックス情報をクラスタに分類する。例えば、クラスタリング分析部２１４は、各特徴語について、ＴＦ−ＩＤＦ（Term Frequency - Inverse Document Frequency）により求められる指標値（以下、ＴＦＩＤＦ値という。）を算出する。クラスタリング分析部２１４は、各文書についてＴＦＩＤＦ値が大きい順に所定数（例えば、１〜５個）の特徴語を選択し、選択した特徴語のＴＦＩＤＦ値からベクトル値を生成する。クラスタリング分析部２１４は、キーワード検索結果に含まれる全ての２つのインデックス情報の組合せのそれぞれについて、ベクトル値の内積値を類似度として算出し、ベクトル値の距離が近いものをクラスタとして分類する。なお、クラスタリング分析部２１４は、一般的なクラスタリング分析の手法により、インデックス情報をクラスタに分類することができる。クラスタリング分析部２１４は、分類したクラスタごとに、クラスタに含まれるインデックス情報のインデックスＩＤをメモリ２０２に記憶する。クラスタリング分析部２１４は、例えば、図６に示すテーブル２５３に、クラスタの識別情報（以下、クラスタＩＤという。）に対応付けて、インデックスＩＤのリストを記憶するようにすることができる。 The clustering analysis unit 214 performs a clustering analysis process on the index information included in the keyword search result, and classifies the index information into clusters. For example, the clustering analysis unit 214 calculates an index value (hereinafter referred to as a TFIDF value) obtained by TF-IDF (Term Frequency-Inverse Document Frequency) for each feature word. The clustering analysis unit 214 selects a predetermined number (for example, 1 to 5) of feature words in descending order of TFIDF values for each document, and generates a vector value from the TFIDF values of the selected feature words. The clustering analysis unit 214 calculates the inner product value of the vector values as the similarity for each of the combinations of all the two index information included in the keyword search result, and classifies the vector values having a short distance as a cluster. Note that the clustering analysis unit 214 can classify the index information into clusters by a general clustering analysis technique. The clustering analysis unit 214 stores the index ID of the index information included in the cluster in the memory 202 for each classified cluster. For example, the clustering analysis unit 214 can store a list of index IDs in the table 253 illustrated in FIG. 6 in association with cluster identification information (hereinafter referred to as cluster ID).

最大クラスタ決定部２１５は、クラスタリング分析部２１４が分類したクラスタのうち、分類されたインデックス情報の数が最も多いもの（以下、最大クラスタという。）を決定する。 The maximum cluster determination unit 215 determines the cluster with the largest number of classified index information (hereinafter referred to as the maximum cluster) among the clusters classified by the clustering analysis unit 214.

カテゴリ決定部２１６は、最大クラスタに分類されたインデックス情報の要約に含まれる単語に基づいて、最大クラスタの名称（以下、最大カテゴリ名という。）を決定する。本実施形態では、カテゴリ決定部２１６は、カテゴリデータベース２５２に記憶されているツリー構造から、最大クラスタに分類されたインデックス情報の要約に含まれる単語が全て子孫に含まれているノードのうち、最も階層が深いものを最大カテゴリ名として決定する。なお、カテゴリ決定部２１６による最大カテゴリ名の決定処理の詳細については後述する。 The category determination unit 216 determines the name of the maximum cluster (hereinafter referred to as the maximum category name) based on the words included in the summary of the index information classified into the maximum cluster. In the present embodiment, the category determination unit 216 uses the tree structure stored in the category database 252 as the most out of the nodes in which all the words included in the summary of the index information classified into the largest cluster are included in the descendants. A deep category is determined as the maximum category name. Details of the maximum category name determination process by the category determination unit 216 will be described later.

提案情報生成部２１７は、最大カテゴリ名がユーザの意図に合致するか否かを問い合わせるための画面データ（以下、提案画面という。）を生成する。本実施形態では、提案情報生成部２１７は、「知りたいのは『最大カテゴリ名』ですか？」というメッセージとともに、「はい」のボタンと「いいえ」とのボタンを表示する画面をＨＴＭＬで記述したものを提案画面として生成するものとする。 The proposal information generation unit 217 generates screen data (hereinafter referred to as a proposal screen) for inquiring whether or not the maximum category name matches the user's intention. In the present embodiment, the proposal information generation unit 217 describes a screen that displays a “Yes” button and a “No” button in HTML together with a message “What do you want to know is“ maximum category name ”?” The generated screen is generated as a proposal screen.

検索結果送信部２１８は、検索結果画面と提案画面とをユーザ端末１０に送信する。
選択情報受信部２１９は、ユーザ端末１０において提案画面が表示され、「はい」または「いいえ」のいずれかの選択がなされたことを示す情報（以下、選択情報という。）をユーザ端末１０から受信する。 The search result transmission unit 218 transmits the search result screen and the proposal screen to the user terminal 10.
The selection information receiving unit 219 receives from the user terminal 10 information (hereinafter referred to as selection information) indicating that a proposal screen is displayed on the user terminal 10 and either “Yes” or “No” has been selected. To do.

＝＝処理＝＝
以下、本実施形態の文書検索システムにおける処理の流れを図７〜１２を参照して説明する。図７は、文書検索システムにおける文書検索処理全体の流れを説明する図であり、図８は、文書検索処理においてユーザ端末１０に表示される画面例を示す図である。 == Processing ==
Hereinafter, the flow of processing in the document search system of this embodiment will be described with reference to FIGS. FIG. 7 is a diagram for explaining the overall flow of the document search process in the document search system, and FIG. 8 is a diagram showing an example of a screen displayed on the user terminal 10 in the document search process.

まずユーザ端末１０は、図８に示す画面４０を表示する。画面４０は、キーワードの入力欄４１１を備えている。
ユーザ端末１０は、検索ボタン４１２が押下されると、入力欄４１１に入力されたキーワードを含むクエリを検索サーバに送信する（図７、Ｓ４０１）。 First, the user terminal 10 displays a screen 40 shown in FIG. The screen 40 includes a keyword input field 411.
When the search button 412 is pressed, the user terminal 10 transmits a query including the keyword input in the input field 411 to the search server (FIG. 7, S401).

検索サーバ２０のクエリ受信部２１１は、ユーザ端末１０からクエリを受信し、検索実行部２１２は、検索結果画面の作成処理を行う（Ｓ４０２）。検索結果画面の作成処理の流れを図９に示す。検索実行部２１２は、クエリからキーワードを抽出し（Ｓ５０１）、抽出したキーワードが特徴語として含まれているインデックス情報をインデックスデータベース２５１から読み出して、キーワード検索結果とする（Ｓ５０２）。検索結果生成部２１３は、キーワード検索結果に含まれているインデックス情報のＵＲＬへのリンクをつけたタイトルと、要約とを一覧にしたリストをＨＴＭＬで記述して検索結果画面を生成する（Ｓ５０３）。 The query receiving unit 211 of the search server 20 receives a query from the user terminal 10, and the search execution unit 212 performs search result screen creation processing (S402). The flow of the search result screen creation process is shown in FIG. The search execution unit 212 extracts a keyword from the query (S501), reads index information including the extracted keyword as a feature word from the index database 251 and sets it as a keyword search result (S502). The search result generation unit 213 generates a search result screen by describing in HTML a list that lists titles with links to URLs of index information included in the keyword search results and summaries (S503). .

検索サーバ２０は、検索結果画面を生成した後、提案画面の作成処理を行う（Ｓ４０３）。提案画面の作成処理の流れを図１０に示す。 After generating the search result screen, the search server 20 creates a proposal screen (S403). The flow of the proposal screen creation process is shown in FIG.

クラスタリング分析部２１４は、キーワード検索結果に含まれるインデックス情報のクラスタリング処理を行う（Ｓ５２１）。クラスタリングには、上述したように、インデックス情報に含まれている各特徴語についてＴＦＩＤＦ値を作成し、高いものから順に所定数個のＴＦＩＤＦ値によるベクトル値を算出し、キーワード検索結果に含まれる２つのインデックス情報の各組についてベクトル値の内積を算出し、距離が近いものをクラスタとして分類する一般的なクラスタ分析処理を用いることができる。クラスタリング分析部２１４は、図６に示すテーブル２５３を空の状態で作成し、クラスタリング処理により分類されたクラスタごとに、属するインデックス情報のインデックスＩＤのリストを図６のテーブル２５３に登録する（Ｓ５２２）。最大クラスタ決定部２１５は、クラスタリング分析部２１４により分類されたクラスタのうち、属しているインデックス情報の数が最も多いものを最大クラスタとして決定する（Ｓ５２３）。カテゴリ決定部２１７は、後述する図１１の最大カテゴリ名の決定処理により最大カテゴリ名を決定し（Ｓ５２４）、「知りたいのは『最大カテゴリ名』ですか？」というメッセージと、「はい」のボタンと「いいえ」のボタンを表示するためのＨＴＭＬタグとを記述した提案情報を生成する（Ｓ５２５）。 The clustering analysis unit 214 performs clustering processing of index information included in the keyword search result (S521). In the clustering, as described above, TFIDF values are created for each feature word included in the index information, and vector values based on a predetermined number of TFIDF values are calculated in descending order, and are included in the keyword search result. It is possible to use a general cluster analysis process that calculates an inner product of vector values for each set of two pieces of index information and classifies those having a short distance as clusters. The clustering analysis unit 214 creates the table 253 shown in FIG. 6 in an empty state, and registers a list of index IDs of the index information belonging to each cluster classified by the clustering process in the table 253 of FIG. 6 (S522). . The maximum cluster determination unit 215 determines a cluster having the largest number of index information belonging to the clusters classified by the clustering analysis unit 214 as the maximum cluster (S523). The category determination unit 217 determines the maximum category name by the maximum category name determination process of FIG. 11 described later (S524), a message “What do you want to know is“ maximum category name ”?” And “Yes” Proposal information describing the button and the HTML tag for displaying the “No” button is generated (S525).

カテゴリ決定部２１６による最大カテゴリ名の決定処理を図１１に示す。
カテゴリ決定部２１６は、まず変数ｎに０を設定する（Ｓ５４１）。カテゴリ決定部２１６は、最大クラスタに属しているインデックス情報を取得する。カテゴリ決定部２１６は、例えば、最大クラスタのクラスタＩＤに対応するインデックスＩＤをテーブル２５６から取得し、取得したインデックスＩＤに対応するインデックス情報をインデックスデータベース２５１から読み出す。カテゴリ決定部２１６は、読み出したインデックス情報に含まれている特徴語のうち、カテゴリデータベース２５２に登録されているものを抽出し、抽出した特徴語のリストを変数ＣＬとする（Ｓ５４２）。カテゴリ決定部２１６は、各特徴語についてのカウンタを０に設定する（Ｓ５４３）。 The maximum category name determination process by the category determination unit 216 is shown in FIG.
The category determining unit 216 first sets 0 to the variable n (S541). The category determination unit 216 acquires index information belonging to the maximum cluster. For example, the category determination unit 216 acquires an index ID corresponding to the cluster ID of the largest cluster from the table 256, and reads index information corresponding to the acquired index ID from the index database 251. The category determination unit 216 extracts feature words registered in the category database 252 among the feature words included in the read index information, and sets the extracted feature word list as a variable CL (S542). The category determination unit 216 sets a counter for each feature word to 0 (S543).

カテゴリ決定部２１６は、ｎをインクリメントし（Ｓ５４４）、変数ＰＬにＣＬに含まれる特徴語を設定する（Ｓ５４５）。カテゴリ決定部２１６は、特徴語のｎ階層祖先のノードをカテゴリデータベース２５２から取得し（Ｓ５４６）、取得したノードをＣＬとして（Ｓ５４７）、ＣＬ中の各ノードについてカウンタをインクリメントする（Ｓ５４８）。カテゴリ決定部２１６は、ＣＬに含まれるノードの数が１でなければ（Ｓ５４９：ＮＯ）、ステップＳ５４４からの処理を繰り返す。 The category determination unit 216 increments n (S544), and sets a feature word included in CL in the variable PL (S545). The category determination unit 216 acquires the n-th ancestor node of the feature word from the category database 252 (S546), sets the acquired node as CL (S547), and increments the counter for each node in the CL (S548). If the number of nodes included in the CL is not 1 (S549: NO), category determination unit 216 repeats the processing from step S544.

ＣＬに含まれるノードの数が１であった場合（Ｓ５４９：ＹＥＳ）、カテゴリ決定部２１６は、ＣＬに含まれているノードがツリー構造のルートであれば（Ｓ５５０：ＹＥＳ）、ＰＬに含まれているノードの中で、カウンタが最も大きいノードを最大カテゴリ名とし（Ｓ５５１）、ＣＬに含まれているノードがルートでなければ（Ｓ５５０：ＮＯ）、ＣＬに含まれているノードを最大カテゴリ名として決定する（Ｓ５５２）。 When the number of nodes included in the CL is 1 (S549: YES), the category determination unit 216 includes the PL if the node included in the CL is the root of the tree structure (S550: YES). The node with the largest counter is the maximum category name (S551), and if the node included in CL is not the root (S550: NO), the node included in CL is the maximum category name. (S552).

以上のようにして、カテゴリ決定部２１６は、最大クラスタに属するインデックス情報に含まれる全特徴語が子孫として含まれているノードのうち、最も階層が深いものを最大カテゴリ名として決定し、そのようなノードがルート以外に存在しない場合には、ルート直下の階層のノードのうち、その子孫のノードのうち、最大クラスタに属するインデックス情報の要約に含まれている数が最も多いものを最大カテゴリ名として決定することができる。このようにして最大カテゴリ名を決定することにより、最大クラスタに含まれる単語を包含する上位概念の単語をもって最大クラスタを表現することができる。したがって、ユーザが最大クラスタとして分類された文書がどのようなものであるのかを容易に判断することが可能となる。 As described above, the category determining unit 216 determines, as the maximum category name, the node having the deepest hierarchy among the nodes including all the feature words included in the index information belonging to the maximum cluster as descendants. If there is no node other than the root, the highest category name is the node that is included in the summary of index information belonging to the largest cluster among the descendant nodes of the hierarchy immediately below the root Can be determined as By determining the maximum category name in this way, the maximum cluster can be expressed with the upper concept words including the words included in the maximum cluster. Therefore, the user can easily determine what kind of document is classified as the maximum cluster.

以上のようにして、最大カテゴリ名が決定され、提案情報画面が作成されると、検索サーバ２０の検索結果装置部２１８は、提案情報画面と検索結果画面とを含む画面データをユーザ端末１０に送信する（Ｓ４０４）。この画面データに基づきユーザ端末１０では、図８の画面４２のような画面が表示される。画面４２は、ユーザが意図していた検索の対象が最大カテゴリ名を含むか否かを問い合わせるメッセージ（図８の例では、「知りたいのは『動物キャラクタ』ですか？」が表示されている。）の表示欄４２１とともに、「はい」を選択するためのボタン４２２及び「いいえ」を選択するためのボタン４２３が表示される。また、画面４２の表示欄４２４には、キーワード検索結果の一覧（検索結果画面）が表示される。ユーザは画面４２において、ボタン４２２又はボタン４２３を押下することで、最大カテゴリ名がユーザの検索の意図に合致するか否かを選択する（Ｓ４０５）。 As described above, when the maximum category name is determined and the proposal information screen is created, the search result device unit 218 of the search server 20 sends screen data including the proposal information screen and the search result screen to the user terminal 10. Transmit (S404). Based on this screen data, the user terminal 10 displays a screen such as the screen 42 in FIG. The screen 42 displays a message for inquiring whether or not the search target intended by the user includes the maximum category name (in the example of FIG. 8, “Do you want to know“ animal character ”?”). )) And a button 422 for selecting “Yes” and a button 423 for selecting “No” are displayed. A list of keyword search results (search result screen) is displayed in the display field 424 of the screen 42. The user presses the button 422 or the button 423 on the screen 42 to select whether or not the maximum category name matches the user's search intention (S405).

ユーザがボタン４２２又はボタン４２３を押下すると、ユーザ端末１０は押下されたボタンに応じて、「はい」又は「いいえ」を示す選択情報を検索サーバ２０に送信する（Ｓ４０６）。 When the user presses the button 422 or the button 423, the user terminal 10 transmits selection information indicating “Yes” or “No” to the search server 20 in accordance with the pressed button (S406).

検索サーバ２０の選択情報受信部２１９は、ユーザ端末１０から送信される選択情報を受信すると、検索結果の更新処理を行う（Ｓ４０７）。図１２は検索結果の更新処理の流れを示す図である。
選択情報受信部２１９は、選択情報が「はい」である場合（Ｓ５６１：ＹＥＳ）、最大クラスタのクラスタＩＤに対応するインデックスＩＤをテーブル２５３から読み出し、読み出したインデックスＩＤに対応するインデックス情報をキーワード検索結果とする（Ｓ５６２）。
一方、選択情報が「いいえ」である場合（Ｓ５６１：ＮＯ）、選択情報受信部２１９は、最大クラスタのクラスタＩＤ以外のクラスタＩＤに対応するインデックスＩＤをテーブル２５３から読み出し、読み出したインデックスＩＤに対応するインデックス情報をキーワード検索結果とする（Ｓ５６３）。 Upon receiving the selection information transmitted from the user terminal 10, the selection information receiving unit 219 of the search server 20 performs a search result update process (S407). FIG. 12 is a diagram showing the flow of search result update processing.
When the selection information is “Yes” (S561: YES), the selection information receiving unit 219 reads the index ID corresponding to the cluster ID of the largest cluster from the table 253, and performs a keyword search for the index information corresponding to the read index ID. The result is taken (S562).
On the other hand, when the selection information is “No” (S561: NO), the selection information receiving unit 219 reads the index ID corresponding to the cluster ID other than the cluster ID of the maximum cluster from the table 253, and corresponds to the read index ID. The index information to be used is set as a keyword search result (S563).

検索結果生成部２１３は、キーワード検索結果に含まれているインデックス情報のＵＲＬへのリンクをつけたタイトルと、要約とを一覧にしたリストをＨＴＭＬで記述して検索結果画面を生成する（Ｓ５６４）。
次に、検索サーバ２０は、図１０に示す提案情報画面の作成処理を行い、提案情報画面を作成する（Ｓ５６５）。
検索結果が更新されると、検索サーバ２０は、提案画面及び検索結果画面を含む画面データユーザ端末１０に送信すべく、図７のステップＳ４０４からの処理を繰り返す。 The search result generation unit 213 generates a search result screen by describing in HTML a list that lists titles with links to URLs of index information included in keyword search results and summaries (S564). .
Next, the search server 20 performs a proposal information screen creation process shown in FIG. 10, and creates a proposal information screen (S565).
When the search result is updated, the search server 20 repeats the processing from step S404 in FIG. 7 to transmit to the screen data user terminal 10 including the proposal screen and the search result screen.

検索結果の更新処理が行われた後には、図８の画面４３のような画面がユーザ端末１０に表示される。画面４３は、画面４２と同様の構成であり、絞り込まれた検索結果をクラスタリングにより複数のクラスタに分類した結果の最大クラスタの最大カテゴリ名が表示欄４２１に表示されるとともに、絞り込まれた検索結果の一覧が表示欄４２４に表示される。 After the search result update process is performed, a screen such as a screen 43 in FIG. 8 is displayed on the user terminal 10. The screen 43 has the same configuration as the screen 42. The maximum category name of the maximum cluster as a result of classifying the narrowed search results into a plurality of clusters by clustering is displayed in the display column 421, and the narrowed search results Is displayed in the display column 424.

以上説明したように、本実施形態の文書検索システムによれば、検索サーバ２０からユーザに対して絞込みの提案を行うことができる。したがって、ユーザは、キーワードを追加したり、検索条件を変更したりと積極的に検索を進めることなく、検索サーバ２０からの提案に対して「はい」又は「いいえ」を単に選択するだけで、容易に検索結果を絞り込んでいくことができる。一般に、１つめのキーワードが思い浮かんだとしても、２つめのキーワードを追加することが困難であることが多いが、本実施形態の文書検索システムによれば、ユーザは新たなキーワードを考える必要がなくなるので便利である。 As described above, according to the document search system of this embodiment, the search server 20 can propose narrowing to the user. Therefore, the user simply selects “Yes” or “No” for the proposal from the search server 20 without actively performing a search such as adding a keyword or changing a search condition. Search results can be narrowed down easily. In general, even if the first keyword comes to mind, it is often difficult to add the second keyword. However, according to the document search system of this embodiment, the user needs to consider a new keyword. It is convenient because it disappears.

また、本実施形態の文書検索システムでは、最大クラスタに分類されたインデックス情報の要約に含まれる単語の上位概念を最大カテゴリ名としてユーザに提示することができる。したがって、ユーザは最大クラスタに分類された文書がどのような文書であるのかを容易に把握することができる。 Further, in the document search system of the present embodiment, the high-level concept of words included in the summary of index information classified into the maximum cluster can be presented to the user as the maximum category name. Therefore, the user can easily grasp what kind of document is classified into the largest cluster.

また、本実施形態の文書検索システムによれば、ユーザからの選択に応じて、最大クラスタに分類されたインデックス情報、またはそれ以外のクラスタに分類されたインデックス情報を対象として再度クラスタリングが行われる。例えば、キーワードを追加したり、新たな検索条件を指定したりする場合には、新たな検索の結果には、別の新たな文書が含まれるようになることもあるが、本実施形態のように、検索結果をクラスタリングして絞り込んでいくことにより、確実に検索結果を絞り込むことができる。 Also, according to the document search system of the present embodiment, clustering is performed again for the index information classified into the maximum cluster or the index information classified into other clusters according to the selection from the user. For example, when a keyword is added or a new search condition is specified, another new document may be included in the result of the new search, as in the present embodiment. Further, by narrowing down the search results by clustering, the search results can be surely narrowed down.

なお、本実施形態の文書検索システムでは、インデックス情報に予め特徴語が含まれているものとしたが、これに限らず、例えば、タイトルや要約に含まれている単語を抽出してクラスタリング処理を行うようにしてもよい。 In the document search system according to the present embodiment, it is assumed that the feature word is included in the index information in advance. However, the present invention is not limited to this. You may make it perform.

また、本実施形態の文書検索システムでは、インデックス情報を検索するものとしたが、文書を直接検索するようにしてもよい。この場合、インデックスデータベース２５１に代えて、文書を管理する文書データベースを採用し、文書に含まれる単語を検索するようにすることができる。 In the document search system of this embodiment, the index information is searched. However, the document may be directly searched. In this case, instead of the index database 251, a document database that manages documents can be adopted to search for words included in the documents.

また、インデックスデータベース２５１は、ＵＲＬと特徴語のみを管理しておき、クエリに含まれるキーワードにマッチするインデックス情報を検索し、その度にＵＲＬが示す文書を取得して、取得した文書に含まれる単語を抽出して要約を作成するようにしてもよい。 In addition, the index database 251 manages only URLs and feature words, searches for index information that matches the keyword included in the query, acquires the document indicated by the URL each time, and is included in the acquired document. A summary may be created by extracting words.

また、本実施形態では、クエリに含まれるキーワードにマッチした全てのインデックス情報を対象としてクラスタリングを行うものとしたが、例えば、１００件や１０００件など所定数のインデックス情報のみを対象としてクラスタリングを行うようにしてもよい。 In the present embodiment, clustering is performed for all index information that matches the keywords included in the query. For example, clustering is performed only for a predetermined number of index information such as 100 or 1000. You may do it.

また、本実施形態では、特徴語を用いてクラスタリングを行うようにしたが、例えば、各文書が複数のカテゴリに所属するように予め各文書をカテゴリ分けしておき、そのカテゴリを用いてクラスタリングを行うようにしてもよい。この場合、インデックス情報には、特徴語に加えてカテゴリ名も含まれるようにし、カテゴリデータベース２５２は、カテゴリの名称を管理しておき、クエリに含まれるキーワードにマッチしたインデックス情報に含まれるカテゴリを用いてクラスタリングする。有限数のカテゴリを用いてクラスタリングを行うことにより、次元を少なくすることが可能となる。例えば、カテゴリの種類が１０００種類であれば、クラスタリングの計算に用いる次元も１０００で済むことになる。 In this embodiment, clustering is performed using feature words. For example, each document is categorized in advance so that each document belongs to a plurality of categories, and clustering is performed using the categories. You may make it perform. In this case, the index information includes the category name in addition to the feature word, and the category database 252 manages the category name, and the category included in the index information that matches the keyword included in the query. To cluster. By performing clustering using a finite number of categories, the number of dimensions can be reduced. For example, if there are 1000 types of categories, 1000 dimensions may be used for clustering calculation.

また、予め文書をカテゴリ分けしておく場合には、さらにカテゴリで検索結果を絞り込むようにしてもよい。この場合、例えば、検索サーバ２０は、最大クラスタに含まれる文書をカテゴリごとにカウントし、属する文書の数が多い順に所定数のカテゴリを選択するための画面データをユーザ端末１０に送信するカテゴリ選択画面送信部と、ユーザ端末１０から選択されたカテゴリを受信するカテゴリ入力部と、最大クラスタに含まれる文書のうち、選択されたカテゴリに属するものの一覧のみを表示するための画面データ（以下、絞込み結果画面という。）を生成する絞込み結果生成部とを備え、検索結果送信部２１８は、絞込み結果画面をユーザ端末１０に送信するようにする。このように、最大クラスタに含まれる文書を、文書のカテゴリによって、容易に絞込みを行うようにすることができる。 In addition, when documents are classified into categories in advance, the search results may be further narrowed down by category. In this case, for example, the search server 20 counts the documents included in the maximum cluster for each category, and transmits category data for selecting a predetermined number of categories to the user terminal 10 in descending order of the number of belonging documents. Screen transmission unit, category input unit for receiving a category selected from the user terminal 10, and screen data for displaying only a list of documents belonging to the selected category among documents included in the maximum cluster (hereinafter, narrowing down) The search result transmission unit 218 transmits the narrowing result screen to the user terminal 10. In this way, it is possible to easily narrow down the documents included in the maximum cluster according to the document category.

また、本実施形態では、クラスタリングは１回のみ行うようにしたが、これに限らず、例えば、複数の距離関数を用いるなど、異なる手法により複数回のクラスタリングを行い、全ての手法により生成されたクラスタ全てのうち、最も属する文書の数が大きいクラスタを最大クラスタとするようにする。これにより、様々な視点から検索結果を絞り込むことが可能となる。 In the present embodiment, the clustering is performed only once. However, the present invention is not limited to this. For example, the clustering is performed a plurality of times by different methods such as using a plurality of distance functions. Among all the clusters, a cluster having the largest number of documents belonging to the largest cluster is set as the maximum cluster. This makes it possible to narrow down search results from various viewpoints.

また、本実施形態では、カテゴリ決定部２１６は、最大クラスタに分類されたインデックス情報の要約に含まれる全ての単語が子孫に含まれるノードのうち、最も深い階層のものを最大カテゴリ名とするようにしたが、これに限らず、例えば、カテゴリ決定部２１６は、最大クラスタに分類されたインデックス情報の要約に含まれる単語の出現頻度を算出し、出現頻度の最も多い単語を最大カテゴリ名と決定するようにしてもよい。また、カテゴリ決定部２１６は、最大クラスタに分類されたインデックス情報の要約に含まれる各単語のＴＦＩＤＦ値を算出し、ＴＦＩＤＦ値が最も高い単語を最大カテゴリ名として決定するようにしてもよいし、ＴＦＩＤＦ値が高い順に所定数個の単語を含む最大カテゴリ名を生成するようにしてもよい。 In the present embodiment, the category determination unit 216 sets the maximum category name to the deepest hierarchy among nodes in which all words included in the summary of the index information classified into the maximum cluster are included in the descendants. However, the present invention is not limited to this. For example, the category determination unit 216 calculates the appearance frequency of words included in the summary of index information classified into the maximum cluster, and determines the word with the highest appearance frequency as the maximum category name. You may make it do. Further, the category determination unit 216 may calculate the TFIDF value of each word included in the summary of the index information classified into the maximum cluster, and determine the word having the highest TFIDF value as the maximum category name. The maximum category name including a predetermined number of words may be generated in descending order of the TFIDF value.

また、本実施形態では、カテゴリデータベース２５２は、単語をツリー構造で管理するものとしたが、カテゴリ名に対応付けて、そのカテゴリに属する単語を記憶するようにしてもよい。この場合、クラスタリング分析部２１４は、各インデックス情報について、カテゴリデータベース２５２に登録されているカテゴリ名ごとに、要約に含まれている単語の数をカウントし、カウント数が最も多かったカテゴリ名をインデックス情報のカテゴリ名とし、同じカテゴリ名のインデックス情報をクラスタとして分類することができる。また、カテゴリ決定部２１６は、そのカテゴリ名を最大カテゴリ名として決定することができる。 In the present embodiment, the category database 252 manages words in a tree structure, but the words belonging to the category may be stored in association with the category name. In this case, for each index information, the clustering analysis unit 214 counts the number of words included in the summary for each category name registered in the category database 252, and indexes the category name having the highest count number. It is possible to classify the index information having the same category name as a cluster as a category name of information. Further, the category determining unit 216 can determine the category name as the maximum category name.

以上、本実施形態について説明したが、上記実施形態は本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。本発明は、その趣旨を逸脱することなく、変更、改良され得ると共に、本発明にはその等価物も含まれる。 Although the present embodiment has been described above, the above embodiment is intended to facilitate understanding of the present invention and is not intended to limit the present invention. The present invention can be changed and improved without departing from the gist thereof, and the present invention includes equivalents thereof.

例えば、本実施形態の文書検索システムは検索サーバ２０のみで構成することもできる。この場合、検索サーバ２０は、キーボードやマウス等の入力装置２０５からキーワードの入力を受け付けて検索を行い、ディスプレイ等の出力装置２０６に検索結果を出力すればよい。この構成は、例えばパーソナルコンピュータ内に大量の文書が蓄積されている場合に有用である。 For example, the document search system of the present embodiment can be configured with only the search server 20. In this case, the search server 20 may receive a keyword input from the input device 205 such as a keyboard or a mouse, perform a search, and output the search result to the output device 206 such as a display. This configuration is useful, for example, when a large amount of documents are stored in a personal computer.

本実施形態の文書検索システムの全体構成を示す図である。It is a figure which shows the whole structure of the document search system of this embodiment. 検索サーバ２０のハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of a search server 20. FIG. 検索サーバ２０のソフトウェア構成を示す図である。2 is a diagram illustrating a software configuration of a search server 20. FIG. インデックスデータベース２５１に記憶されるインデックス情報の構成例を示す図である。It is a figure which shows the structural example of the index information memorize | stored in the index database. カテゴリデータベース２５２の構成例を示す図である。It is a figure which shows the structural example of the category database 252. クラスタごとにインデックスＩＤを管理するテーブル２５３の構成例を示す図である。It is a figure which shows the structural example of the table 253 which manages index ID for every cluster. 文書検索システムにおける文書検索処理全体の流れを説明する図である。It is a figure explaining the flow of the whole document search process in a document search system. 文書検索処理においてユーザ端末１０に表示される画面例を示す図である。It is a figure which shows the example of a screen displayed on the user terminal 10 in a document search process. 検索結果画面の作成処理の流れを示す図である。It is a figure which shows the flow of a creation process of a search result screen. 提案画面の作成処理の流れを示す図である。It is a figure which shows the flow of the creation process of a proposal screen. 最大カテゴリ名の決定処理の流れを示す図である。It is a figure which shows the flow of the determination process of a maximum category name. 検索結果の更新処理の流れを示す図である。It is a figure which shows the flow of the update process of a search result.

Explanation of symbols

１０ユーザ端末
２０検索サーバ
３０通信ネットワーク
２１１クエリ受信部
２１２検索実行部
２１３検索結果生成部
２１４クラスタリング分析部
２１５最大クラスタ決定部
２１６カテゴリ決定部
２１７報生成部
２１７提案情報生成部
２１８検索結果送信部
２１９選択情報受信部
２２０クローラ処理部
２５１インデックスデータベース
２５２カテゴリデータベース
２５３テーブル DESCRIPTION OF SYMBOLS 10 User terminal 20 Search server 30 Communication network 211 Query reception part 212 Search execution part 213 Search result generation part 214 Clustering analysis part 215 Maximum cluster determination part 216 Category determination part 217 Information generation part 217 Proposal information generation part 218 Search result transmission part 219 Selection information receiving unit 220 Crawler processing unit 251 Index database 252 Category database 253 Table

Claims

A system for retrieving documents,
A keyword input unit that accepts keyword input;
A document search unit for searching for a document corresponding to the keyword;
A group classification unit for classifying the search result documents into a plurality of groups based on words included in the document;
A maximum group name determining unit that determines a maximum group name representing a maximum group having the largest number of documents belonging to the group based on words included in the documents belonging to the maximum group;
A refinement selection unit that receives input of selection information indicating whether or not the maximum group name matches a user's intention;
When the selection information indicates that it matches the intention, a list of documents belonging to the maximum group is displayed, and when the selection information indicates that the intention does not match the intention, the documents included in the search result A search result display section that displays a list of items that do not belong to the maximum group,
A document retrieval system comprising:

The document search system according to claim 1,
When the selection information indicates that it matches the intention, a document belonging to the largest group is used as the search result. When the selection information indicates that it does not match the intention, the document included in the search result The group classification unit classifies documents included in the search result into groups, and the maximum group name determination unit determines the maximum group name, and the selection screen information is transmitted. The selection screen information is transmitted, the selection information reception unit receives the selection information, and the search result transmission unit transmits information for displaying the list,
Document search system characterized by

The document search system according to claim 1,
A feature word extraction unit that extracts a feature word for the document from the search result document by an LSI technique and determines the extracted feature word as a word included in the document;
Document search system characterized by

The document search system according to claim 1,
The group classification unit classifies the document into a plurality of groups by clustering words included in the document;
Document search system characterized by

The document search system according to claim 4,
The group classification unit weights words included in the document by TF-IDF and classifies the weighted words by clustering;
Document search system characterized by

The document search system according to claim 4,
The group classification unit performs a plurality of clustering by a plurality of different methods;
Document search system characterized by

The document search system according to claim 1,
The maximum group name determination unit calculates the frequency of each word included in the document belonging to the maximum group, and determines the word with the highest frequency as the maximum group name;
Document search system characterized by

The document search system according to claim 1,
A category database that stores words related to the group in association with the group name indicating the group,
The maximum group name determination unit extracts a plurality of words included in a document belonging to the maximum group, and extracts from the document among the related words for each of the group names stored in the category database. Counting the number of words included in the selected word, and determining the group name with the largest number as the maximum group name,
Document search system characterized by

The document search system according to claim 1,
A category database that stores words as nodes in a tree structure
The maximum group name determination unit extracts a plurality of words included in a document belonging to the maximum group, and has the deepest hierarchy among the nodes in which all the extracted words are included as descendants from the category database. And determining the acquired node as the maximum group name;
Document search system characterized by

The document search system according to claim 9,
The maximum group name determination unit weights words extracted from the document by TF-IDF, and among the nodes, all words extracted from the document having a weight equal to or greater than a predetermined value are included as descendants. Determining one as said maximum group name;
Document search system characterized by

The document search system according to claim 1,
The group classification unit classifies only a predetermined number of documents among the search result documents into a plurality of groups;
Document search system characterized by

The document search system according to claim 1,
An index storage unit that stores the document and feature words of the document in association with the keyword;
The document search unit acquires the document and the feature word corresponding to the keyword from the index storage unit,
The group classification unit classifies the documents into a plurality of groups based on the feature words;
Document search system characterized by

The document search system according to claim 1,
A category database for storing one or more categories to which the document belongs for each document;
The group classification unit classifies the search result documents into a plurality of groups based on the corresponding category;
Document search system characterized by

The document search system according to claim 1,
A category database storing a category to which the document belongs for each document;
A category display unit for displaying a list of the categories to which the document included in the maximum group belongs, when the selection information indicates that the intention is matched;
A category input unit that receives input of the category;
A refinement display unit for displaying a list of documents belonging to the input category among the documents included in the maximum group;
A document retrieval system comprising:

A method for searching documents,
Computer
Receiving a keyword input;
Searching for a document corresponding to the keyword;
Classifying the search result documents into a plurality of groups based on words contained in the documents;
Determining a maximum group name representing a maximum group having the largest number of documents belonging to the group based on words included in documents belonging to the maximum group;
Receiving input of selection information indicating whether or not the maximum group name matches a user's intention;
When the selection information indicates that it matches the intention, a list of documents belonging to the maximum group is displayed, and when the selection information indicates that the intention does not match the intention, the documents included in the search result Displaying a list of items not belonging to the largest group;
The document retrieval method characterized by performing.

A program for searching documents,
On the computer,
Receiving a keyword input;
Searching for a document corresponding to the keyword;
Classifying the search result documents into a plurality of groups based on words contained in the documents;
Determining a maximum group name representing a maximum group having the largest number of documents belonging to the group based on words included in documents belonging to the maximum group;
Receiving input of selection information indicating whether or not the maximum group name matches a user's intention;
When the selection information indicates that it matches the intention, a list of documents belonging to the maximum group is displayed, and when the selection information indicates that the intention does not match the intention, the documents included in the search result Displaying a list of items not belonging to the largest group;
A program for running