JP2011170583A

JP2011170583A - Information search apparatus, information search method and information search program

Info

Publication number: JP2011170583A
Application number: JP2010033276A
Authority: JP
Inventors: Shunsuke Konagai; 俊介小長井; Yamato Takahashi; 大和高橋; Ryoji Kataoka; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-02-18
Filing date: 2010-02-18
Publication date: 2011-09-01

Abstract

<P>PROBLEM TO BE SOLVED: To provide various electronic document information for various fields by raising electronic document information ranked lower in the output order based on importance. <P>SOLUTION: In an information search apparatus 1, a document clustering part 2 classifies a group of collected electronic documents into a plurality of clusters. A rarity calculation part 3 calculates a rarity of a collected electronic document by the number of documents constituting the cluster to which the electronic document belongs. A keyword matching calculation part 8 calculates a matching degree between an input search keyword and a group of documents including the search keyword which are extracted from the collected electronic document group. A general ranking calculation part 9 decides an output order of the document group of individual documents including the search keyword according to the calculated matching degree and rarity and an importance of the individual documents. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明はインターネット上の検索エンジン等の情報検索技術に関するものである。 The present invention relates to an information search technique such as a search engine on the Internet.

近年、インターネットの普及によって、インターネット上の膨大な文書群から利用者が必要とする情報を的確に検索するシステムおよびサービスの重要性が高まっている。 In recent years, with the spread of the Internet, the importance of a system and a service for accurately retrieving information required by a user from an enormous document group on the Internet has increased.

一般に情報検索システムでは、ユーザが入力した検索キーワードが検索対象の文書や該文書に対する他の文書からのリンクアンカーテキストに含まれる数に基づく検索キーワードと文書との一致度や当該文書が他の文書からどれだけ参照されているかを示す文書の重要度に基づき情報検索の出力順を決定している。尚、文書の一致度、文書の重要度を算出する技術は非特許文献１〜３に開示されている。 In general, in an information retrieval system, the degree of coincidence between a search keyword and a document based on the number of search keywords input by a user included in a search target document or link anchor text from another document for the document, and the document is another document. The output order of the information retrieval is determined based on the importance of the document indicating how much is referred to. Non-patent documents 1 to 3 disclose techniques for calculating the degree of coincidence of documents and the importance of documents.

S Robertson, H Zaragoza, M Taylor ,“Simple BM25 extension to multiple weighted fields”,Proceedings of the thirteenth ACM international conference on Information and knowledge management,November 8-13,2004,p.42-49S Robertson, H Zaragoza, M Taylor, “Simple BM25 extension to multiple weighted fields”, Proceedings of the thirteenth ACM international conference on Information and knowledge management, November 8-13, 2004, p.42-49 Lawrence Page, Sergey Brin, Rajeev Motwai, Terry Winograd, “The PageRank Citation Ranking: Bringing Order to the Web”, 7th International World Wide Web conference (WWW98)，January 29,1998,p.1-17Lawrence Page, Sergey Brin, Rajeev Motwai, Terry Winograd, “The PageRank Citation Ranking: Bringing Order to the Web”, 7th International World Wide Web conference (WWW98), January 29,1998, p.1-17 Jon M. Kleinberg,“Authoritative sources in a hyperlinked environment”,Journal of the ACM (JACM),Vol.46,No.5,September,1999,p.604-632Jon M. Kleinberg, “Authoritative sources in a hyperlinked environment”, Journal of the ACM (JACM), Vol. 46, No. 5, September, 1999, p. 604-632

しかし、従来のこのような情報検索システムには次の問題が存在する。検索キーワードと文書の一致度の計算にはｔｆ（ｔｅｒｍｆｒｅｑｕｅｎｃｙ）、ｉｄｆ（ｉｎｖｅｒｓｅｄｏｃｕｍｅｎｔｆｒｅｑｕｅｎｃｙ）やＢＭ２５といった単語の統計量を用いた手法が一般的に利用されている。これら手法は、特定の文書群全体の平均と比較して文書に高い頻度で現れる単語が、該文書を特徴付けるものであるという推定に基づいて、ユーザが入力した検索キーワードが文書の特徴と一致する度合いが高い文書を高い出力順位としている。 However, the conventional information retrieval system has the following problems. For the calculation of the degree of coincidence between a search keyword and a document, a method using a statistic of a word such as tf (term frequency), idf (inverse document frequency) or BM25 is generally used. These techniques are based on the assumption that words that appear more frequently in a document compared to the average for a specific group of documents are characteristic of the document, and the search keyword entered by the user matches the document characteristics. Documents with a high degree are assigned high output ranks.

これら手法によれば、検索キーワードが比較的珍しい単語であればよい検索結果が得られるが、検索キーワードが極ありふれた単語である場合には、同程度の一致度となる文書が多くなりすぎてしまう。 According to these methods, a search result can be obtained if the search keyword is a relatively rare word. However, if the search keyword is a very common word, too many documents have the same degree of matching. End up.

この問題を解決するために、従来の情報検索サービスでは、検索キーワードとの一致度が同程度となった文書の順位付けのために文書の重要度を算出し、検索キーワードと文書との一致度と文書の重要度とを合わせて文書の出力順を決定している。 In order to solve this problem, the conventional information retrieval service calculates the importance of documents for ranking documents that have the same degree of matching with the search keyword, and the degree of match between the search keyword and the document. And the importance of the document are combined to determine the document output order.

文書の重要度の判定技術には、ＰａｇｅＲａｎｋ（非特許文献２）やＨＩＴＳ（非特許文献３）といった手法が一般的に利用されている。これらの手法は、ＷＥＢページのリンク情報を用いて、特定の文書が他の多くの文書からリンクされている場合にはその文書が重要であろうという推定に基づくものである。 A technique such as PageRank (Non-Patent Document 2) or HITS (Non-Patent Document 3) is generally used as a document importance determination technique. These approaches are based on the assumption that if a particular document is linked from many other documents using the WEB page link information, that document will be important.

そして、文書の静的重要度と検索キーワードと文書の一致度とを組み合わせて用いることで検索キーワードと一致し且つ重要な文書を検索ユーザに提示することができるようになっている。 By using a combination of the static importance of the document, the search keyword, and the coincidence of the document, it is possible to present to the search user an important document that matches the search keyword.

しかし、このように、検索キーワードと同程度に一致する文書を文書の静的重要度で順序づけるとすると、例えば「デジタルカメラ撮影方法」というキーワードで検索を行ったとすると、単語「デジタルカメラ」および「撮影方法」を記載した文書がそれらのキーワードの統計量による検索キーワードとの一致度と、被リンクを主体とした文書の重要度によって順序付けされる。ここで、デジタルカメラによる一般的な撮影方法に関わる情報を記載した文書Ａとデジタルカメラによる天体写真の撮影方法に関わる情報を記載した文書Ｂが存在したとすると、外部から集めるリンクは天体写真といった特定読者にアピールする文書Ｂよりも一般読者の多くにアピールする文書Ａの方が必然的に多くなることが予想できる。 However, in this way, if documents that match the same degree as the search keyword are ordered by the static importance of the document, for example, if a search is performed using the keyword “digital camera shooting method”, the words “digital camera” and Documents describing “photographing methods” are ordered according to the degree of coincidence with the search keyword based on the statistics of those keywords and the importance of the document mainly composed of linked documents. Here, if there is a document A describing information related to a general photographing method using a digital camera and a document B describing information related to a photographing method of an astronomical photograph using a digital camera, a link collected from the outside is an astronomical photograph or the like. It can be expected that the number of documents A appealing to many general readers will inevitably increase rather than the document B appealing to specific readers.

そして、インターネット上にはデジタルカメラによる撮影方法をまとめた優良な文書が大量に存在しており、それら文書に対するリンクを行っている文書も同様に大量に存在する。 On the Internet, there are a large number of excellent documents that summarize the photographing methods using a digital camera, and there are also a large number of documents that are linked to these documents.

この結果、従来の検索結果の順位付けによると、検索結果の上位はデジタルカメラによる一般的な撮影方法に関する情報を記載した複数の文書が独占してしまうことになり、検索システムの利用者に対して多様な情報の提供ができないという問題が存在する。 As a result, according to the conventional ranking of the search results, the top of the search results are monopolized by a plurality of documents describing information on general photographing methods using a digital camera. There is a problem that various information cannot be provided.

そこで、従来の手法は、検索キーワードを含んだ文書群を検索対象から抽出した後に、この文書群を複数のグループに分類し、各グループから一定数またはグループを構成する文書数に比例した数の文書を検索結果として利用者に提示する方法が採られている。 Therefore, in the conventional method, after extracting a document group including a search keyword from a search target, the document group is classified into a plurality of groups, and a fixed number from each group or a number proportional to the number of documents constituting the group. A method of presenting a document as a search result to a user is employed.

しかしながら、この方法は検索キーワードに合致した、一般に多数の文書の分類計算を行うために、検索要求に対する応答時間が長くなり、システムの負荷も大きくなるという問題がある。 However, this method has a problem that a response time to a search request becomes long and a load on the system increases because classification calculation of a large number of documents that match the search keyword is generally performed.

また、別の従来手法として、予め検索対象となる文書を分類し、文書が属するグループを記録しておき、検索キーワードと合致した文書群に含まれる予め分類済みのグループ群毎に、一定数またはグループを構成する文書数に比例した数の文書を検索結果として提示する方法が存在する。この方法によれば文書の分類が予め済んでいるため前記の方法と比較して検索要求に対する応答時間の遅れは大幅に縮小できる。 Further, as another conventional method, a document to be searched is classified in advance, a group to which the document belongs is recorded, and a predetermined number or each group group that is included in the document group that matches the search keyword There is a method for presenting a number of documents proportional to the number of documents constituting a group as a search result. According to this method, since the classification of documents is completed in advance, a delay in response time to a search request can be greatly reduced as compared with the above method.

しかしながら、この方法では文書の分類を予め行っておく都合上、前記の方法と比較して文書を細かくグループ分けしておかなければ検索結果の多様性を担保することができなくなる。一方、グループを細かく分類しすぎると検索キーワードに合致する文書群を構成するグループの数が多くなりすぎ、それらから検索結果として利用者に提示する文書を選択するためには各グループからそれぞれ幾つの文書を提示するかを決定するための処理が必要となる。したがって、検索要求に対する応答が遅延するという問題が存在する。 However, in this method, because the documents are classified in advance, the diversity of search results cannot be ensured unless the documents are grouped finely compared with the above method. On the other hand, if the groups are classified too finely, the number of groups constituting the document group that matches the search keyword becomes too large, and in order to select a document to be presented to the user as a search result, there are several groups from each group. A process for determining whether to present a document is required. Therefore, there is a problem that a response to the search request is delayed.

本発明は以上の事情を鑑みなされたもので文書の主題の希少性を検索結果の順位付けに反映させることで検索の応答遅延を伴わずに多様な情報の提供を実現することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide various kinds of information without delaying the search response by reflecting the scarcity of the subject matter of the document in the ranking of the search results. .

そこで、本発明は、入力された検索キーワードを含んだ各電子文書が属するクラスタを構成する文書数に基づく当該各電子文書の希少度と当該キーワードと当該各電子文書との一致度と当該各電子文書の重要度とに基づき当該各電子文書の出力順位を決定する。 Therefore, the present invention provides a degree of rarity of each electronic document based on the number of documents constituting a cluster to which each electronic document including the input search keyword belongs, a degree of coincidence between the keyword and each electronic document, and each electronic document. The output order of each electronic document is determined based on the importance of the document.

本発明の情報検索装置の態様としては、電子文書を検索対象とする情報検索装置であって、収集された電子文書群を複数のクラスタに分類する分類手段と、前記収集した電子文書が属するクラスタを構成する文書の数に基づき当該電子文書の希少度を算出する希少度計算手段と、入力された検索キーワードと前記収集した電子文書群から抽出した当該検索キーワードを含んだ文書群との一致度を算出する一致度計算手段と、前記算出された一致度及び希少度と前記検索キーワードを含んだ文書群の個々の文書の重要度とに基づき当該個々の文書の出力順位を決定する順位決定手段とを備える。 As an aspect of the information search device of the present invention, there is provided an information search device for searching electronic documents, a classification means for classifying a collected electronic document group into a plurality of clusters, and a cluster to which the collected electronic documents belong The degree of coincidence between the rareness calculation means for calculating the rarity of the electronic document based on the number of documents constituting the document and the document group including the search keyword extracted from the collected electronic document group A degree-of-matching calculating means, and a rank determining means for determining the output rank of each individual document based on the calculated degree of coincidence and rarity and the importance of each document in the document group including the search keyword With.

本発明の情報検索方法の態様としては、電子文書を検索対象とする情報検索方法であって、分類手段が収集された電子文書群を複数のクラスタに分類するステップと、希少度計算手段が前記収集した電子文書が属するクラスタを構成する文書の数に基づき当該電子文書の希少度を算出するステップと、一致度計算手段が入力された検索キーワードと前記収集した電子文書群から抽出した当該検索キーワードを含んだ文書群との一致度を算出するステップと、順位決定手段が前記算出された一致度及び希少度と前記検索キーワードを含んだ文書群の個々の文書の重要度とに基づき当該個々の文書の出力順位を決定するステップとを有する。 An aspect of the information search method of the present invention is an information search method for searching electronic documents, wherein the classification means collects the collected electronic documents into a plurality of clusters, and the rarity calculation means includes the A step of calculating the rarity of the electronic document based on the number of documents constituting the cluster to which the collected electronic document belongs, the search keyword input by the matching degree calculation means, and the search keyword extracted from the collected electronic document group The degree of coincidence with the document group including the document group, and the rank determining means based on the calculated degree of coincidence and rarity and the importance of the individual document of the document group including the search keyword. Determining a document output order.

尚、本発明は前記情報検索装置を構成する各手段としてコンピュータを機能させる情報検索プログラムの態様とすることもできる。 Note that the present invention may also be an information search program that causes a computer to function as each means constituting the information search apparatus.

以上の発明によれば、電子文書情報の希少度が検索結果に反映されることで、文書情報の重要度に基づく出力順位が下位にあった電子文書情報を上位に引き上げられるので、特定の分野に偏ることなく多様な電子文書情報を提供できる。 According to the above invention, since the rarity of the electronic document information is reflected in the search result, the electronic document information whose output order based on the importance of the document information is lower can be raised to a higher level. A variety of electronic document information can be provided without bias.

発明の実施形態に係る情報検索装置のブロック構成図。1 is a block configuration diagram of an information search apparatus according to an embodiment of the invention. 発明の実施形態に係る情報検索装置を実装するハードウェア構成図。The hardware block diagram which mounts the information search device which concerns on embodiment of invention. 発明の実施形態に係る処理手順を説明したチャート図。The chart explaining the processing procedure which concerns on embodiment of invention.

以下、本発明の実施の形態について図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［概要］
図１に示された本実施形態に係る情報検索装置１は入力された検索キーワードを含んだ各電子文書が属するクラスタを構成する文書数に基づく当該各文書の希少度と当該キーワードと当該各文書との一致度と当該各文書の重要度とに基づき当該各文書の出力順位を決定する。本実施形態の情報検索装置１は図１に例示したインターネットを介して収集したＷＥＢページのようなＷＥＢ文書（電子文書）を検索対象とする。 [Overview]
The information search apparatus 1 according to the present embodiment shown in FIG. 1 includes a rarity level of each document, the keyword, and each document based on the number of documents that constitute a cluster to which each electronic document including the input search keyword belongs. The output order of each document is determined based on the degree of coincidence and the importance of each document. The information search apparatus 1 according to the present embodiment searches a WEB document (electronic document) such as a WEB page collected via the Internet illustrated in FIG.

［装置の構成］
図１に示された情報検索装置１は、インデックス検索方式に基づくもので、図２に示されたハードウェアリソース（ＣＰＵ、メモリ、ハードディスクドライブ装置等）とソフトウェアリソース（ＯＳ、アプリケーション等）との協働により機能部２〜９を実装する。尚、情報検索装置１は図２に示された単一のコンピュータのハードウェアリソースで構成する以外に複数のコンピュータのハードウェアリソースの組み合わせにより実装された態様とすることもできる。 [Device configuration]
The information search device 1 shown in FIG. 1 is based on an index search method, and includes the hardware resources (CPU, memory, hard disk drive device, etc.) and software resources (OS, applications, etc.) shown in FIG. The functional units 2 to 9 are implemented by cooperation. Note that the information retrieval apparatus 1 may be implemented by a combination of hardware resources of a plurality of computers, in addition to the hardware resources of a single computer shown in FIG.

情報検索装置１を構成する各機能部２〜９について説明する。 Each function part 2-9 which comprises the information search device 1 is demonstrated.

文書クラスタリング部２は情報検索装置１のデータ収集機能（クローラ）によって収集したＷＥＢ文書１１〜１３等からなるＷＥＢ文書群を複数のクラスタ（グループ）に分類する。文書の分類方法は周知の分類方法を用いればよい。例えば、各文書を構成する異なり語の有無や数を用いた単語ベクトルや各文書を構成する単語の概念ベクトルの合成による文書ベクトルを用いて分類する方法が挙げられる。 The document clustering unit 2 classifies the WEB document group including the WEB documents 11 to 13 collected by the data collection function (crawler) of the information retrieval apparatus 1 into a plurality of clusters (groups). A known classification method may be used as the document classification method. For example, there is a method of classifying by using a word vector using the presence or number of different words constituting each document and a document vector obtained by synthesizing a concept vector of words constituting each document.

希少度計算部３は文書クラスタリング部２によってクラスタリングされたクラスタに含まれる文書の希少度を算出する。具体的にはクラスタのサイズ（クラスタに含まれる文書の数）またはこれとクラスタに含まれる文書の重要度をパラメータとする演算によって文書の希少度を算出する。算出された各ＷＥＢ文書の希少度は文書希少度テーブル５に格納される。図１に示されたように文書希少度テーブル５において各希少度の値はその各ＷＥＢ文書の識別子に対応したカラムに格納される。文書希少度テーブル５は図２に示された記憶部（ハードディスクドライブ装置等）に記録される。 The rarity calculator 3 calculates the rarity of the documents included in the cluster clustered by the document clustering unit 2. Specifically, the rarity of the document is calculated by an operation using the size of the cluster (the number of documents included in the cluster) or the importance of the document included in the cluster as a parameter. The calculated scarcity of each WEB document is stored in the document scarcity table 5. As shown in FIG. 1, in the document rarity level table 5, each rarity value is stored in a column corresponding to the identifier of each WEB document. The document rarity level table 5 is recorded in the storage unit (hard disk drive device or the like) shown in FIG.

希少度計算部３によって実行される希少度の計算方法を以下に例示する。 A method for calculating the scarcity executed by the scarcity calculator 3 will be exemplified below.

（方法１）
クラスタリング（分類）の結果、ＷＥＢ文書Ｄｉが含まれることとなったクラスタＣｉを構成する文書数がＣｉｎである場合、当該文書Ｄｉの希少度ＲＳ（Ｄｉ）は当該クラスタＣｉに含まれる文書数に逆比例する値として以下の式（１）による演算によって算出できる。 (Method 1)
As a result of clustering (classification), when the number of documents constituting the cluster Ci that includes the WEB document Di is Cin, the rarity level RS (Di) of the document Di is set to the number of documents included in the cluster Ci. The inversely proportional value can be calculated by the calculation according to the following equation (1).

ＲＳ（Ｄｉ）＝ａ／Ｃｉｎ …（１）
式（１）において、ａは検索結果を引き上げる度合いを調整するための係数である。クラスタＣｉは少なくとも文書Ｄｉを含むのでＣｉｎは１以上である。方法１に係る演算式はＣｉｎをパラメータとする関数である限りは式（１）に限定されることなく非線形を含む曲線関数を適用してもよい。 RS (Di) = a / Cin (1)
In Expression (1), a is a coefficient for adjusting the degree to which the search result is raised. Since the cluster Ci includes at least the document Di, Cin is 1 or more. As long as the arithmetic expression according to Method 1 is a function having Cin as a parameter, a curve function including non-linearity may be applied without being limited to Expression (1).

以上の方法１によればクラスタを構成する文書数が少ないクラスタに含まれる文書に大きな希少度が与えられ、珍しい主題に関する文書の検索結果順位を引き上げることができる。 According to the method 1 described above, a document is included in a cluster with a small number of documents constituting the cluster, and a high degree of rarity is given, and the search result rank of documents related to unusual subjects can be raised.

（方法２）
方法２は、文書重要度テーブル４を参照してクラスタを構成する文書群の中で最も重要度の高い特定個数の文書についてのみ、その希少度を当該文書が属するクラスタに含まれる文書数をパラメータとする式（１）の演算によって算出する。この方法によれば、検索対象の全文書の出力順位を変動させずに、クラスタリングされた各主題を代表する文書のみの出力順位を引き上げることが可能となり、より多様な検索結果を期待できる。 (Method 2)
Method 2 refers to the document importance table 4 and sets a parameter for the number of documents included in the cluster to which the document belongs only for a specific number of documents having the highest importance in the document group constituting the cluster. It is calculated by the calculation of equation (1). According to this method, it is possible to raise the output order of only the documents representing each clustered subject without changing the output order of all documents to be searched, and more diverse search results can be expected.

（方法３）
方法３は、文書の希少度を算出する演算のパラメータとしてクラスタに含まれる文書の数と共に当該クラスタに含まれる文書群の全ての重要度を利用する。または、クラスタに含まれる文書の数と共に当該クラスタを代表する文書群の文書の重要度を前記パラメータとして利用する。前記クラスタを代表する文書群としてはクラスタに含まれる文書群の中で重要度の最も高い特定個数の文書の重要度が挙げられる。以上の方法３によれば同程度の数の文書を含んだクラスタ間でクラスタの重要度を反映した検索結果順位を実現できる。 (Method 3)
Method 3 uses the importance of all the documents included in the cluster together with the number of documents included in the cluster as a parameter of the calculation for calculating the rarity of the document. Alternatively, the importance of the document of the document group representing the cluster is used as the parameter together with the number of documents included in the cluster. The document group representing the cluster includes the importance of a specific number of documents having the highest importance among the document groups included in the cluster. According to the above method 3, it is possible to realize a search result order reflecting the importance of the cluster among clusters including the same number of documents.

ＷＥＢ文書Ｄｉを含んだクラスタＣｉを構成する文書群の中で希少度の計算に供する文書群の重要度をＣｉＰＲと示すと、ＷＥＢ文書Ｄｉの希少度ＲＳ（Ｄｉ）は以下の式（２）によって算出される線形結合の値、または以下の式（３）によって示される積による値としてもよい。 When the importance of a document group used for calculation of the rarity degree among the document groups constituting the cluster Ci including the WEB document Di is denoted as CiPR, the rarity level RS (Di) of the WEB document Di is expressed by the following equation (2). It is good also as a value by the product shown by the value of the linear combination calculated by (3), or the following formula | equation.

ＲＳ（Ｄｉ）＝ａ／Ｃｉｎ＋ｂ×ＣｉＰＲ …（２）
ＲＳ（Ｄｉ）＝（ａ×ＣｉＰＲ）／Ｃｉｎ …（３）
ＣｉＰＲとしてはクラスタＣｉの全ての文書の重要度の平均若しくは和またはクラスタＣｉを代表する文書群の重要度の平均または和が例示される。クラスタＣｉを代表する文書群としては方法２と同様にクラスタに含まれる文書群の中で重要度の最も高い特定個数の文書が挙げられる。方法３に係る演算式も方法１に係る演算式と同様にＣｉｎとＣｉＰＲをパラメータとする関数である限りは上記の式（２）（３）に限定されることなく非線形的な曲面関数を適用してもよい。 RS (Di) = a / Cin + b × CiPR (2)
RS (Di) = (a × CiPR) / Cin (3)
CiPR is exemplified by the average or sum of the importance of all the documents in the cluster Ci or the average or sum of the importance of the document group representing the cluster Ci. As the document group representing the cluster Ci, a specific number of documents having the highest importance among the document groups included in the cluster can be cited as in the method 2. As long as the arithmetic expression related to method 3 is a function using Cin and CiPR as parameters as in the arithmetic expression related to method 1, a nonlinear curved surface function is applied without being limited to the above expressions (2) and (3). May be.

方法３は従来手法のＰａｇｅＲａｎｋやＨＩＴＳに基づく文書自体の重要度と併せて文書の主題の重要度を検索結果ランキング（文書の出力順位）に反映させることが可能となる。図１に示されたＷＥＢ文書１２，１３を検索結果として出力する事例について説明する。ＷＥＢ文書１２を含む「デジタルカメラによる天体写真撮影方法」を主題とする文書クラスタと、ＷＥＢ文書１３を含む「デジタルカメラによる顕微鏡写真撮影方法」を主題とする文書クラスタとは同数の文書を含んでいる。両クラスタの重要度が上位の文書群の重要度を比較した際には「デジタルカメラによる天体写真撮影方法」のクラスタに含まれる上位文書群の重要度の方が高くなっているとする。この場合、ＷＥＢ文書１２を含んだクラスタの上位文書の重要度にＷＥＢ文書１２の希少度が反映され、ＷＥＢ文書１３を含んだクラスタの上位文書の重要度にＷＥＢ文書１３の希少度が反映される。検索キーワードとの一致度が同程度の両クラスタのＷＥＢ文書１２とＷＥＢ文書１３の出力順位を両文書の重要度に基づくものから逆転させることが可能となる。 Method 3 can reflect the importance of the subject of the document in the search result ranking (document output order) together with the importance of the document itself based on the conventional method of PageRank or HITS. An example of outputting the WEB documents 12 and 13 shown in FIG. 1 as a search result will be described. The document cluster including the WEB document 12 as the subject of the “astrophotography method using a digital camera” and the document cluster including the WEB document 13 as the subject of the “microphotographing method using a digital camera” include the same number of documents. Yes. It is assumed that when the importance levels of both clusters are compared with the importance levels of the higher-order document groups, the higher-level document groups included in the cluster of the “astrophotography method using a digital camera” are higher. In this case, the rarity of the WEB document 12 is reflected in the importance of the upper document of the cluster including the WEB document 12, and the rarity of the WEB document 13 is reflected in the importance of the upper document of the cluster including the WEB document 13. The It is possible to reverse the output order of the WEB document 12 and the WEB document 13 of both clusters having the same degree of coincidence with the search keyword from those based on the importance of both documents.

文書重要度テーブル４は収集したＷＥＢ文書（ＷＥＢ文書１１，１２，…，ｎ）の重要度を希少度計算部３及び総合ランキング計算部９での演算処理に供するために格納する。文書の重要度はＰａｇｅＲａｎｋやＨＩＴＳ等の周知技術によって予め決定される。図１に示されたように文書重要度テーブル４において各重要度の値は各ＷＥＢ文書の識別子に対応したカラムに格納される。文書重要度テーブル４は図２に示された記憶部（ハードディスクドライブ装置等）に記録される。 The document importance level table 4 stores the importance levels of the collected WEB documents (WEB documents 11, 12,..., N) for use in arithmetic processing in the rarity level calculation unit 3 and the overall ranking calculation unit 9. The importance of the document is determined in advance by a known technique such as Page Rank or HITS. As shown in FIG. 1, in the document importance level table 4, each importance value is stored in a column corresponding to the identifier of each WEB document. The document importance level table 4 is recorded in the storage unit (hard disk drive device or the like) shown in FIG.

インデックス部６は収集したＷＥＢ文書を検索しやすいように当該文書を語句単位に分割する。分割によって得られる語句の形態としては、単語、ｎ−ｇｒａｍ、サフィックスアレイ等の全文検索用の語句が挙げられる。 The index unit 6 divides the collected WEB document into word units so that it can be easily searched. As a form of a phrase obtained by division, a full-text search phrase such as a word, n-gram, suffix array, and the like can be given.

文書インデックス７はインデックス部６によって得られた語句をインデックス形式で格納する。文書インデックス７には通常の全文検索インデックスに含まれるｔｆ、ｉｄｆ、ｈｔｍｌによる単語のマークアップ情報や単語の位置情報等が含まれてもよい。 The document index 7 stores the words obtained by the index unit 6 in an index format. The document index 7 may include word markup information or word position information by tf, idf, and html included in a normal full-text search index.

キーワード一致度計算部８は情報検索装置１の利用者の情報検索端末１０から供された検索キーワードと文書インデックス７に格納された語句との一致度を計算する。具体的には検索キーワードを用いて文書インデックス７を参照して当該検索キーワードを含む文書群をリストアップし、これらの文書と検索キーワードとの一致度を算出する。一致度の算出にはｔｆ・ｉｄｆ法、ＢＭ２５法、ＢＭ２５Ｆ法に例示される周知の演算方法を適用すればよい。 The keyword matching degree calculation unit 8 calculates the degree of matching between the search keyword provided from the information search terminal 10 of the user of the information search apparatus 1 and the word / phrase stored in the document index 7. Specifically, a document group including the search keyword is listed by referring to the document index 7 using the search keyword, and the degree of coincidence between these documents and the search keyword is calculated. For calculating the degree of coincidence, a known calculation method exemplified by the tf · idf method, the BM25 method, and the BM25F method may be applied.

総合ランキング計算部９はキーワード一致度計算部８によって算出された一致度と文書重要度テーブル４及び文書希少度テーブル５の情報とに基づき前記検索キーワードに対する検索結果すなわち当該検索キーワードを含んだＷＥＢ文書の出力順を決定する。 The comprehensive ranking calculation unit 9 searches the search keyword for the search keyword based on the matching degree calculated by the keyword matching degree calculation unit 8 and the information in the document importance degree table 4 and the document rarity degree table 5, that is, a WEB document including the search keyword. Determine the output order.

［処理手順の説明］
図３を参照しながら情報検索装置１による情報検索処理の手順について説明する。 [Description of processing procedure]
The procedure of the information search process by the information search apparatus 1 will be described with reference to FIG.

Ｓ１：文書クラスタリング部２は、収集したＷＥＢ文書１１，１２，…，ｎからなる文書群を複数のクラスタに分類する。 S1: The document clustering unit 2 classifies the collected document group including the web documents 11, 12,..., N into a plurality of clusters.

図１の事例ではＷＥＢ文書１１は「デジタルカメラによる撮影方法一般に関するクラスタ」に分類される。ＷＥＢ文書１２は「デジタルカメラによる天体撮影方法一般に関するクラスタ」に分類される。ＷＥＢ文書１３は「デジタルカメラによる顕微鏡写真撮影方法一般に関するクラスタ」に分類される。 In the example of FIG. 1, the WEB document 11 is classified as a “cluster relating to a general photographing method using a digital camera”. The WEB document 12 is classified as a “cluster relating to a general astronomical photographing method using a digital camera”. The WEB document 13 is classified as “a cluster relating to a general microscopic photography method using a digital camera”.

Ｓ２：希少度計算部３は文書クラスタリング部２によって分類された各クラスタに含まれる文書の希少度を計算して文書希少度テーブル５に格納する。 S2: The rarity level calculation unit 3 calculates the rarity level of the documents included in each cluster classified by the document clustering unit 2 and stores it in the document rarity level table 5.

Ｓ２に係る希少度の計算に（方法１）に基づく計算法が適用された事例について説明する。例えば、ＷＥＢ文書１１は一般に広くアピールする文書重であり他の文書からのリンクが多いので当該文書１１の重要度（１００）はＷＥＢ文書１２の重要度（３５）よりも高いが、類似する文書が多いのでＷＥＢ文書１１の属するクラスタに含まれる文書数の値が大きくなる。したがって、算出されるＷＥＢ文書１１の希少度はＷＥＢ文書１２の希少度よりも小さい値で算出される（ＷＥＢ文書１１の希少度（１）＜ＷＥＢ文書１２の希少度（５０））。 An example in which the calculation method based on (Method 1) is applied to the calculation of the rarity related to S2 will be described. For example, since the WEB document 11 is generally a document weight that appeals widely and there are many links from other documents, the importance (100) of the document 11 is higher than the importance (35) of the WEB document 12, but similar documents. Therefore, the value of the number of documents included in the cluster to which the WEB document 11 belongs increases. Accordingly, the degree of rarity of the calculated WEB document 11 is calculated to be smaller than the degree of rarity of the WEB document 12 (rare degree of the WEB document 11 (1) <rare degree of the WEB document 12 (50)).

一方、ＷＥＢ文書１２は一般に広くアピールする文書ではないので他の文書からのリンクがＷＥＢ文書１１よりも少ないので当該文書１２の重要度（３５）はＷＥＢ文書１１の重要度（１００）よりも低い。しかし、ＷＥＢ文書１２の属するクラスタの中では類似する文書が少なくまたページ重要度が高いので、算出されるＷＥＢ文書１２の希少度はＷＥＢ文書１１の希少度よりも大きい値で算出される（ＷＥＢ文書１１の希少度（１）＜ＷＥＢ文書１２の希少度（５０））。 On the other hand, since the WEB document 12 is not generally a widely appealing document, the number of links from other documents is less than that of the WEB document 11, so the importance (35) of the document 12 is lower than the importance (100) of the WEB document 11. . However, since there are few similar documents in the cluster to which the WEB document 12 belongs and the page importance is high, the degree of rarity of the calculated WEB document 12 is calculated to be larger than the degree of rarity of the WEB document 11 (WEB The rarity level of the document 11 (1) <the rarity level of the WEB document 12 (50)).

以上のように算出されたＷＥＢ文書１１〜ｎの希少度は文書希少度テーブル５における各文書に対応したカラムに格納される。 The rarity levels of the WEB documents 11 to n calculated as described above are stored in columns corresponding to the respective documents in the document rarity level table 5.

Ｓ３：インデックス部６はＷＥＢ文書１１，…，ｎを全文検索用の語句単位に分割して文書インデックス７に格納する。 S3: The index unit 6 divides the WEB documents 11,..., N into word / phrase units for full-text search and stores them in the document index 7.

図１の事例では単語によるインデックスとして、一例として単語「デジタルカメラ」「撮影方法」等がＷＥＢ文書１１，１２を含む文書群に出現している文書インデックス７が作成される。 In the example of FIG. 1, as an index by word, for example, a document index 7 in which the words “digital camera”, “imaging method”, etc. appear in the document group including the WEB documents 11 and 12 is created.

Ｓ４：キーワード一致度計算部８は、情報検索端末１０から検索キーワードの入力を受けると文書インデックス７を参照して当該検索キーワードを含む文書をリストアップし、このリストアップした文書と検索キーワードとの一致度を算出する。 S4: Upon receiving the search keyword input from the information search terminal 10, the keyword matching degree calculation unit 8 refers to the document index 7 to list documents including the search keyword, and sets the list of documents and the search keyword. The degree of coincidence is calculated.

Ｓ５：総合ランキング計算部９はＳ４で算出された一致度と文書重要度テーブル４及び文書希少度テーブル５に格納された前記リストアップされた文書の希少度及び重要度とに基づきＷＥＢ文書の出力順を決定する。 S5: The comprehensive ranking calculation unit 9 outputs a WEB document based on the coincidence calculated in S4 and the rarity and importance of the listed documents stored in the document importance table 4 and the document rarity table 5. Determine the order.

具体的には前記一致度と文書重要度テーブル４に格納された重要度と文書希少度テーブル５に格納された希少度とをパラメータとする線形的または非線形的な関数値に基づき前記検索キーワードを含んだ文書の出力順位を決定する。出力順位の態様としては、例えば、前記検索キーワードを含んだ文書群の各文書が前記関数値の降順に出力表示させるような形態が挙げられる。このような形態の文書群の出力順位は検索結果として情報検索端末１０にて出力表示される。 Specifically, the search keyword is determined based on a linear or non-linear function value using the degree of coincidence, the importance stored in the document importance table 4 and the rarity stored in the document rarity table 5 as parameters. Determine the output order of the included documents. Examples of the output order include a form in which each document in the document group including the search keyword is output and displayed in descending order of the function values. The output order of the document group in such a form is output and displayed on the information search terminal 10 as a search result.

尚、Ｓ５では、文書希少度テーブル５を用いているが、当該テーブルを用いることなく、各文書の重要度に対して希少度計算部３が算出した各文書の希少度が加算または積算されたものを格納した文書重要度テーブル４を検索結果の出力順を決定に供してもよい。 In S5, the document rarity level table 5 is used. However, the rarity level of each document calculated by the rarity level calculation unit 3 is added to or added to the importance level of each document without using the table. The document importance table 4 storing the items may be used for determining the output order of the search results.

［本実施形態の効果］
以上の情報検索装置１によれば、収集した文書群の中で検索キーワードと一致した文書について、分類される文書主題に希少性のある文書の検索順位を引き上げることができる。また、収集された文書はその希少度がユーザの情報検索要求に先立って算出されているので、従来手法の検索キーワードに合致する文書群をクラスタリングして検索結果の多様性を確保する場合に起こる応答時間の遅延が起こらない。さらに、収集された各文書はその主題の希少性に基づくスコアが加算されて並び替えがなされるので、従来からの課題であった収集した文書群の分類によって得られた各クラスタから検索結果として文書を選び出す処理時間も大幅に削減できる。 [Effect of this embodiment]
According to the information search apparatus 1 described above, it is possible to raise the search order of documents having a rarity in the document subject to be classified for documents that match the search keyword in the collected document group. Moreover, since the rarity of the collected documents is calculated prior to the user's information search request, it occurs when clustering documents that match the search keyword of the conventional method to ensure diversity of search results. There is no response time delay. Furthermore, each collected document is sorted by adding a score based on the rarity of the subject, so search results are obtained from each cluster obtained by classification of the collected document group, which was a conventional problem. The processing time for selecting documents can be greatly reduced.

特に、Ｓ２の希少度算出のステップにおいて、（方法２）が適用されることで、検索対象の全文書の出力順位を変動させることなく、文書主題の異なる各クラスタに含まれる文書の中で最も重要度の高い限られた文書のみについてその文書主題の希少性に基づき出力順位を引き上げることができる。 In particular, in the step of calculating the degree of rarity in S2, (Method 2) is applied, so that the output rank of all the documents to be searched is not changed, and the document included in each cluster having different document subjects is the most. Only a limited number of documents with high importance can raise the output rank based on the scarcity of the document subject matter.

または、Ｓ２のステップにおいて、（方法３）が適用されることで、同程度数の文書をふくむクラスタの中で重要度の高い文書を含むクラスタを重視した文書の出力順位の引き上げが実現する。 Alternatively, in the step S2, (Method 3) is applied, so that it is possible to raise the output order of documents focusing on clusters including documents of high importance among clusters including the same number of documents.

以上説明した本実施形態に係る情報検索装置１はインターネットを介して収集できるＷＥＢ文書の検索技術に係るものであるが、本発明はＷＥＢ文書の検索のみならず予め電子化された社内文書等の検索技術にも適用できる。 The information search apparatus 1 according to the present embodiment described above relates to a search technique for WEB documents that can be collected via the Internet. However, the present invention is not limited to searching for WEB documents, but also for in-house documents that have been digitized in advance. It can also be applied to search technology.

［本発明のプログラムとしての態様］
本発明は、専用のハードウェアにより実現されるもの以外に、上述の情報検索装置１を構成する機能部２〜９としてコンピュータを機能させる情報検索プログラムの態様とすることもできる。また、このプログラムを格納したコンピュータ読み取り可能な記録媒体も本発明の一態様となる。前記記録媒体としては、フレキシブルディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、その他の既知の記録媒体、コンピュータシステムに内蔵されるハードディスクドライブ装置等の記憶装置が例示される。さらに、前記記録媒体としては、インターネットを介してプログラムを送信する場合のように、短時間の間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバとなるコンピュータシステム内部の揮発性メモリのように一定時間プログラムを保持しているものも含まれる。 [Aspect as Program of the Present Invention]
The present invention may be an aspect of an information search program that causes a computer to function as the function units 2 to 9 constituting the information search device 1 other than those realized by dedicated hardware. A computer-readable recording medium storing this program is also an embodiment of the present invention. Examples of the recording medium include flexible disks, magneto-optical disks, CD-ROMs, DVD-ROMs, other known recording media, and storage devices such as hard disk drives built in computer systems. Further, as the recording medium, a program system that dynamically holds a program for a short time (transmission medium or transmission wave) as in the case of transmitting a program via the Internet, and a computer system that serves as a server in that case Some of them hold programs for a certain period of time, such as internal volatile memory.

１…情報検索装置
２…文書クラスタリング部（分類手段）
３…希少度計算部（希少度計算手段）
８…キーワード一致度計算部（一致度計算手段）
９…総合ランキング計算部（順位決定手段） DESCRIPTION OF SYMBOLS 1 ... Information retrieval apparatus 2 ... Document clustering part (classification means)
3 ... Rareness calculation part (rareness calculation means)
8 ... Keyword matching degree calculation part (matching degree calculation means)
9 ... Comprehensive ranking calculator (ranking determination means)

Claims

An information search device for searching electronic documents,
A classifying means for classifying the collected electronic document group into a plurality of clusters;
A rarity calculation means for calculating the rarity of the electronic document based on the number of documents constituting the cluster to which the collected electronic document belongs;
A degree of coincidence calculation means for calculating a degree of coincidence between the input search keyword and the document group including the search keyword extracted from the collected electronic document group;
And rank determining means for determining the output rank of each individual document based on the calculated degree of coincidence and rarity and the importance of each document in the document group including the search keyword. Information retrieval device.

2. The information according to claim 1, wherein the rarity calculation unit calculates the rarity only for a specific number of documents having the highest importance in a document group constituting a cluster to which the collected electronic document belongs. Search device.

2. The information search apparatus according to claim 1, wherein the scarcity calculation means uses importance levels of all documents constituting the cluster as an operation parameter for calculating the scarcity.

The scarcity calculation means uses the importance of a specific number of documents having the highest importance in a document group included in the cluster as an operation parameter for calculating the scarcity. Item 2. The information search device according to Item 1.

An information search method for searching electronic documents,
Classifying the electronic document group collected by the classification means into a plurality of clusters;
A step of calculating a rarity of the electronic document based on the number of documents constituting the cluster to which the collected electronic document belongs, the rarity calculating unit;
A step of calculating the degree of coincidence between the search keyword inputted by the coincidence degree calculation means and the document group including the search keyword extracted from the collected electronic document group;
And a step of determining the output order of the individual documents based on the calculated degree of coincidence and rarity and the importance of the individual documents of the document group including the search keyword. How to search for information.

6. The rarity level is calculated only for a specific number of documents having the highest importance in a document group constituting a cluster to which the collected electronic document belongs, in the step of calculating the rarity level. Information retrieval method described.

6. The information search method according to claim 5, wherein in the step of calculating the rarity level, importance levels of all documents constituting the cluster are used as calculation parameters for calculating the rarity level.

In the step of calculating the rarity level, the importance level of a specific number of documents having the highest importance level among the document groups included in the cluster is used as a calculation parameter for calculating the rarity level. The information search method according to claim 5.

An information search program for causing a computer to function as each means constituting the information search device according to any one of claims 1 to 4.