JP5261326B2

JP5261326B2 - Information search device and information search program

Info

Publication number: JP5261326B2
Application number: JP2009197659A
Authority: JP
Inventors: 俊介小長井; 孝史井上; 大和高橋; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-08-28
Filing date: 2009-08-28
Publication date: 2013-08-14
Anticipated expiration: 2029-08-28
Also published as: JP2011048718A

Description

本発明は、インターネット上のＷｅｂサーチエンジン（いわゆる検索エンジン）などの情報検索の技術に関する。 The present invention relates to an information search technique such as a Web search engine (so-called search engine) on the Internet.

近年、インターネットの普及によって、インターネット上の膨大なＷｅｂ文書群（電子文書群）からユーザが必要とする情報を的確に検索するシステムおよびサービスの重要性が高まっている。 In recent years, with the spread of the Internet, the importance of a system and a service for accurately retrieving information required by a user from an enormous Web document group (electronic document group) on the Internet has increased.

一般的な検索サービスによれば、ユーザ入力の検索キーワードが検索対象の文書や該文書に対する別の文書からのリンクアンカーテキストに含まれる数に基づいた検索キーワードと文書の一致度と、該文書が別の文書からどれだけ参照されているかといった文書の重要度とから情報検索の出力順が決定されている。 According to a general search service, a search keyword input by a user is based on the number of documents included in a search target document or a link anchor text from another document corresponding to the document, the matching degree between the search keyword and the document, The output order of information retrieval is determined from the importance of the document, such as how much it is referenced from another document.

概略を説明すれば、文書の一致度には非特許文献１の「ＢＭ２５」や「ｔｆ・ｉｄｆ」といった単語の統計量を用いた手法が利用されている。ここでは特定の文書群全体の平均と比較して文書に高い頻度で現れる単語が、該文書を特徴付けるものであるという推定に基づいて、ユーザが入力した検索キーワードが文書の特徴と一致する度合いが高い文書を高い出力順位としている。 In brief, a technique using the statistic of a word such as “BM25” or “tf · idf” of Non-Patent Document 1 is used for the matching degree of documents. Here, based on the assumption that words that appear more frequently in a document than the average of a specific group of documents characterize the document, the degree to which the search keyword input by the user matches the document feature is determined. A high document has a high output ranking.

この手法によれば、検索キーワードが比較的珍しい単語であれば良好な検索結果が得られるが、検索キーワードが極ありふれた単語であれば同程度の一致度となる文書が多くなりすぎてしまう。そこで、一般的な情報検索サービスでは、検索キーワードとの一致度が同程度となった文書の順位付けのために文書の重要度を算出し、検索キーワードと文書との一致度と文書の重要度とを合わせて検索結果の出力順を決定している。 According to this method, a good search result can be obtained if the search keyword is a relatively rare word, but if the search keyword is a very common word, too many documents have the same degree of matching. Therefore, in a general information search service, the importance of a document is calculated for ranking documents that have the same degree of matching with the search keyword, and the degree of matching between the search keyword and the document and the importance of the document are calculated. And the output order of search results is determined.

文書の重要度としては、ＰａｇｅＲａｎｋ（非特許文献２参照）やＨＩＴＳ（非特許文献３参照）といった手法が一般的に利用されている。これら手法は、Ｗｅｂページのリンク情報を用いて、特定の文書が他の多くの文書からリンクされている場合にはその文書が重要であろうという推定に基づいている。このように文書の静的重要度と検索キーワードとの一致度を併せて用いることで、情報検索ユーザの要求を概ね満たしていた。 As the importance of a document, methods such as PageRank (see Non-Patent Document 2) and HITS (see Non-Patent Document 3) are generally used. These approaches are based on the assumption that a particular document will be important if it is linked from many other documents using the link information of the web page. As described above, the static importance of the document and the degree of coincidence between the search keywords are used together, and the information retrieval user's request is generally satisfied.

ＳｔｅｐｈｅｎＲｏｂｅｒｔｓｏｎ，ＨｕｇｏＺａｒａｇｏｚａ，ＭｉｃｈａｅｌＴａｙｌｏｒ ”ＳｉｍｐｌｅＢＭ２５ＥｘｔｅｎｓｉｏｎｔｏＭｕｌｔｉｐｌｅＷｅｉｇｈｔｅｄＦｉｅｌｄｓ” ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅｔｈｉｒｔｅｅｎｔｈＡＣＭｉｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆｅｒｅｎｃｅｏｎＩｎｆｏｒｍａｔｉｏｎａｎｄｋｎｏｗｌｅｄｇｅｍａｎａｇｅｍｅｎｔ，２００４Stephen Robertson, Hugo Zaragoza, Michael Taylor “Simple BM25 Extension to Multiple Weighted Fields” Proceedings of the Acementing ACM. ＬａｗｒｅｎｃｅＰａｇｅ，ＳｅｒｇｅｙＢｒｉｎ，ＲａｊｅｅｖＭｏｔｗａｉ，ＴｅｒｒｙＷｉｎｏｇｒａｄ， ”ＴｈｅＰａｇｅＲａｎｋＣｉｔａｔｉｏｎＲａｎｋｉｎｇ：ＢｒｉｎｇｉｎｇＯｒｄｅｒｔｏｔｈｅＷｅｂ”，７th ＩｎｔｅｒｎａｔｉｏｎａｌＷｏｒｌｄＷｉｄｅＷｅｂｃｏｎｆｅｒｅｎｃｅ（ＷＷＷ９８）Ｊｕｎｕａｒｙ２９，１９９８Lawrence Page, Sergey Brin, Rajeev Mottai, Terry Wingrad, “The PageRank Citation Ranking: Bring Order to the World 98”, 7th International World 98 ＪｏｎＭ．Ｋｌｅｉｎｂｅｒｇ，”ＡｕｔｈｏｒｉｔａｔｉｖｅＳｏｕｒｃｅｓｉｎａＨｙｐｅｒｌｉｎｋｅｄＥｎｖｉｒｏｎｍｅｎｔ”，ＪｏｕｒｎａｌｏｆｔｈｅＡＣＭ（ＪＡＣＭ），ｖｏｌ．４６，Ｎｏ．５，Ｓｅｐｔｅｍｂｅｒ１９９９，ｐｐ．６０４−６３２Jon M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment”, Journal of the ACM (JACM), vol. 46, no. 5, September 1999, pp. 604-632

しかしながら、検索キーワードと同程度に一致するＷｅｂ文書を静的重要度で順序づけると、文書内の単語やフレーズの重要性が検索結果に的確に反映されないおそれがある。 However, if Web documents that match the same degree as the search keyword are ordered by static importance, the importance of words or phrases in the document may not be accurately reflected in the search results.

例えば「犬の飼い方」について優良な情報を記述したＷｅｂ文書があったとして、そこには「犬には玉葱は毒なので、エサに入れないように注意してください。」という重要な情報が記述されていたとする。また、人間の健康について記述された他のＷｅｂ文書中には「ファーストフードなんて犬のエサと同じです、健康のため食べ過ぎに注意！」と記述されていたとする。ここでは他のＷｅｂ文書は、前記Ｗｅｂ文書よりも多く外部からリンクされ、「ｐａｇｅｒａｎｋ」等の静的重要度が高くなっているものとする。 For example, if there is a Web document that describes excellent information about "how to keep a dog", there is important information such as "Be careful not to put it in a dog because onions are poisonous for dogs." Suppose that it was described. Also, it is assumed that other Web documents describing human health describe “Fast food is the same as dog food, be careful not to eat too much for health!”. Here, it is assumed that other Web documents are linked from the outside more than the Web documents and have a higher static importance such as “pagerank”.

この状況下でユーザが犬の餌に関する注意事項を調べようと「犬エサ注意」という検索キーワードを用いて検索を行った場合には、前記両Ｗｅｂ文書間で上記以外の部分の文書長や検索キーワードの頻度が同じであれば、静的重要度が高い後者のページが検索上位に出力されてしまう。 In this situation, when the user searches for the precautions regarding dog food using the search keyword “Attention to dog food”, the document length or search of the portion other than the above between the two Web documents If the keyword frequency is the same, the latter page with high static importance will be output to the top of the search.

本発明は、上述のような問題点を解決するためになされたものであり、文書内の単語やフレーズの重要性を検索結果に的確に反映させる情報検索技術の提供を解決課題としている。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide an information search technique that accurately reflects the importance of words and phrases in a document in search results.

そこで、本発明は、引用情報に基づき電子文書の単語やフレーズの部分ごとに重み付けし、該重みを検索キーワードの一致度に反映させる。このとき被引用数や引用文書／被引用文書の関係、引用先文書の重要度などが利用されている。 Therefore, the present invention weights each word or phrase part of the electronic document based on the citation information, and reflects the weight on the matching degree of the search keyword. At this time, the number of citations, the relationship between the cited document / cited document, the importance of the cited document, etc. are used.

すなわち、本発明は、ユーザ端末から指示された検索キーワードを用いて電子文書群を検索する際に、各電子文書を任意の単位に分割した文書インデックスを参照して前記検索キーワードとの一致度を算出する装置であって、あらかじめ検索対象の電子文書間における引用情報を抽出しておく引用情報抽出手段と、該引用情報抽出手段の抽出した引用情報に基づき各電子文書の前記文書インデックスの分割単位ごとに重みを付与し、当該重みの付与においては、前記引用情報に示されたリンク先の電子文書の被引用部分中の前記各分割単位に対する引用回数を、当該リンク先の電子文書の前記重みに反映させるインデックス手段と、を備え、前記検索キーワードとの一致度を、前記インデックス手段の付与した前記分割単位ごとの重みを加味して算出することを特徴とする。 That is, according to the present invention, when searching for an electronic document group using a search keyword instructed from a user terminal, the degree of coincidence with the search keyword is determined by referring to a document index obtained by dividing each electronic document into arbitrary units. An apparatus for calculating, a citation information extraction means for extracting citation information between electronic documents to be searched in advance, and a unit for dividing the document index of each electronic document based on the citation information extracted by the citation information extraction means A weight is assigned to each link, and in assigning the weight, the number of citations for each division unit in the cited portion of the linked electronic document indicated in the citation information is set to the weight of the linked electronic document. and a indexing means to be reflected in the degree of coincidence between the retrieval keyword, considering the weight of each of the divided units granted of the indexing means And calculating Te.

さらに、前記引用情報抽出手段の抽出した引用情報に基づき電子文書間の関係の深度を判定する文書間関係判定手段を備え、前記インデックス手段は、前記文書間関係判定手段の判定した深度を引用元の電子文書の前記重みに反映させる態様としてもよい。 Further, it includes an inter-document relationship determining unit that determines the depth of the relationship between the electronic documents based on the citation information extracted by the citation information extracting unit, and the index unit determines the depth determined by the inter-document relationship determining unit. It is good also as an aspect reflected in the said weight of the electronic document.

なお、本発明は、前記装置を構成する各手段としてコンピュータを機能させるプログラムや、該プログラムを記録した記録媒体の態様とすることもできる。 Note that the present invention can also be implemented as a program that causes a computer to function as each unit constituting the apparatus, or a recording medium that records the program.

本発明によれば、電子文書中において他の電子文書に引用されている部分を重要とみなし、その文書内の単語やフレーズの重要度などが的確に反映される。 According to the present invention, a portion cited in another electronic document in the electronic document is regarded as important, and the importance of words and phrases in the document is accurately reflected.

本発明の第１実施形態に係る情報検索装置の構成図。1 is a configuration diagram of an information search device according to a first embodiment of the present invention. 同第２実施形態に係る情報検索装置の構成図。The block diagram of the information search device which concerns on 2nd Embodiment.

以下、本発明の実施形態に係る情報検索装置を説明する。この情報検索装置は、電子文書間にリンク情報を備えた文書情報を検索対象とし、各文書へのリンク元の文書に含まれる該リンク先文書の被引用部分の情報と該リンク先文書自体の情報とを併せて該リンク先文書を検索する際の参照情報とする。ここではリンク先文書の情報検索キーワードへの一致度の算出に、該被引用部分の情報検索キーワードへの一致度と該リンク先文書の被引用部分以外の情報検索キーワードへの一致度との加重比を変化させる。 Hereinafter, an information search apparatus according to an embodiment of the present invention will be described. This information search apparatus searches for document information having link information between electronic documents, and includes information on the cited part of the link destination document included in the link source document to each document and the link destination document itself. Together with the information, it is used as reference information for searching the linked document. Here, in calculating the degree of coincidence of the linked document with the information search keyword, the weight of the degree of coincidence of the cited part with the information search keyword and the degree of coincidence with the information search keyword other than the cited part of the linked document Change the ratio.

このとき前記リンク先文書が１つ以上の他の文書から1つ以上の引用を行われている場合に、該リンク先文書を構成する各部分の重みを該部分の被引用数に応じて変化させてもよく、被引用部分の重みを該リンク元ページの重要度に応じて変化させてもよい。あるいは該被引用部分の重みを、該引用関係を構成する文書間の関係性に応じて変化させてもよい。 At this time, when the linked document has one or more citations from one or more other documents, the weight of each part constituting the linked document is changed according to the number of citations of the part. The weight of the cited part may be changed according to the importance of the link source page. Or you may change the weight of this cited part according to the relationship between the documents which comprise this citation relationship.

≪第１実施形態≫
図１に基づき前記情報検索装置の詳細を説明する。ここでは前記情報検索装置１は、ユーザ所有の情報検索端末２とインターネットを通じてデータの送受信が可能に接続されている。 << First Embodiment >>
Details of the information search apparatus will be described with reference to FIG. Here, the information search device 1 is connected to a user-owned information search terminal 2 so as to be able to transmit and receive data through the Internet.

この情報検索端末２は、パーソナルコンピュータ（ＰＣ）からなり、ユーザ入力の検索要求（検索キーワードを含む。）を前記情報検索装置１に送信し、該検索要求に対する検索結果を前記情報検索装置１から受信する。なお、前記情報検索端末２は、パーソナルコンピュータには限定されず、例えば携帯電話やＰＤＡなどのモバイル端末であってもよい。 The information search terminal 2 is composed of a personal computer (PC), transmits a search request (including a search keyword) input by a user to the information search device 1, and sends a search result for the search request from the information search device 1. Receive. The information search terminal 2 is not limited to a personal computer, and may be a mobile terminal such as a mobile phone or a PDA.

前記情報検索装置１は、あらかじめ全文検索用の文書インデックスを作成・保存しておく事前処理と、前記情報検索端末２からの検索要求に応じて文書インデックスを検索して検索結果を作成する検索処理とを実施する。すなわち、前記情報検索装置１は、Ｗｅｂページに代表される電子文書（以下、Ｗｅｂ文書という。）群を検索するＷｅｂサーチエンジン（いわゆる検索エンジン）のシステムを構成し、通常のコンピュータのハードウェアリソース、例えばＣＰＵ，メモリ（ＲＡＭ），ハードディスクドライブ装置，通信インタフェースなどを備える。 The information search apparatus 1 includes a pre-process for creating and storing a document index for full-text search in advance, and a search process for searching a document index in response to a search request from the information search terminal 2 to create a search result. And carry out. That is, the information search apparatus 1 constitutes a system of a Web search engine (so-called search engine) that searches a group of electronic documents represented by Web pages (hereinafter referred to as Web documents), and is a hardware resource of a normal computer. For example, a CPU, a memory (RAM), a hard disk drive device, a communication interface, and the like are provided.

このハードウェアリソースとソフトウェアリソース（例えばＯＳ，アプリケーションなど）との協動の結果、前記情報検索装置１は、検索対象のＷｅｂ文書群から引用情報を抽出する引用情報抽出機能部３と、検索対象のＷｅｂ文書と前記引用情報抽出機能部３の抽出した引用情報とから前記文書インデックスを作成するインデックス機能部４と、該インデックス機能部４によって作成された文書インデックスを保存する文書インデックスＤＢ５と、検索対象のＷｅｂ文書の重要度を記録した文書重要度テーブル６と、前記情報検索端末２から送信された検索要求に基づき前記文書インデックスＤＢ５を参照して検索キーワードと検索対象のＷｅｂ文書との一致度を計算するキーワード一致度計算部７と、該計算部７の算出した一致度と前記文書重要度テーブル６の記録情報とを総合して前記情報検索端末２に返信する検索結果の出力順を決定する総合ランキング計算部８とを実装する。 As a result of the cooperation between the hardware resource and the software resource (for example, OS, application, etc.), the information search apparatus 1 includes a citation information extraction function unit 3 that extracts citation information from a Web document group to be searched, and a search target. An index function unit 4 that creates the document index from the Web document and the citation information extracted by the citation information extraction function unit 3, a document index DB 5 that stores the document index created by the index function unit 4, and a search The degree of coincidence between the search keyword and the search target Web document by referring to the document index DB 5 based on the search request transmitted from the information search terminal 2 and the document importance table 6 that records the importance of the target Web document A keyword matching degree calculation unit 7 for calculating the degree of coincidence calculated by the calculation unit 7 Comprehensively and recording information written severity table 6 to implement the overall ranking calculation unit 8 for determining the output order of the search results to be returned to the information search terminal 2.

ここでは前記機能部３．４を通じて前記事前処理が実施され、前記計算部７．８を通じて前記検索処理が実施される。また、前記情報検索端末２とのデータ送受信は、前記通信インタフェースを通じて実施され、前記文書インデックスＤＢ５および前記文書重要度テーブル６は、前記ハードディスクドライブ装置上に構築されている。 Here, the pre-processing is performed through the function unit 3.4, and the search processing is performed through the calculation unit 7.8. Data transmission / reception with the information search terminal 2 is performed through the communication interface, and the document index DB 5 and the document importance table 6 are constructed on the hard disk drive.

なお、前記文書重要度テーブル６に記録される各Ｗｅｂ文書の重要度は、非特許文献２．３などの手法により算出することができる。以下、前記各部３．４．７．８の処理内容を、Ｗｅｂ文書１１．１２を検索対象とする事例に基づき説明する。 The importance of each Web document recorded in the document importance table 6 can be calculated by a technique such as Non-Patent Document 2.3. Hereinafter, the processing content of each unit 3.4.7.8 will be described based on a case where the Web document 11.12.

（１）引用情報抽出機能部３
前記引用情報抽出機能３は、検索対象のＷｅｂ文書１１．１２から引用情報を抽出する。ここでは一例としてＷｅｂ文書１１は、Ｗｅｂ文書１２中のフレーズ「犬には玉葱は毒なので、エサに入れないように注意してください。」を引用しているものとする。 (1) Citation information extraction function part 3
The citation information extraction function 3 extracts citation information from the Web document 11.12. Here, as an example, it is assumed that the Web document 11 quotes the phrase “Be careful not to put it in the dog because onions are poisonous for dogs” in the Web document 12.

引用情報の抽出方法を例示すれば、Ｗｅｂ文書１１中に「ｈｔｍｌ」の記法よりＷｅｂ文書１２へのリンクが記述されているとして、そのリンクの近傍をＷｅｂ文書１２と比較して一致している部分を引用と判断する方法を用いることができる。引用時の省略や誤字脱字を考慮して一定割合以上一致している部分を引用と判断してもよい。「ｈｔｍｌ」の記法構造に基づき＜ｂｌｏｃｋｑｕｏｔｅ＞などのマークアップされた部分や、『』，「」，“”等の一般的に引用を示すことに用いられる記号で囲まれた部分を引用と判断してもよい。また、上記のマークアップや記号による判断を、二つの文書が相互にリンクを行っている場合に、どちらがどちらから引用しているかの判断に用いてもよい。 For example, if a link to the Web document 12 is described in the Web document 11 by the notation “html”, the vicinity of the link is compared with the Web document 12 to match. A method of determining a part as a citation can be used. In consideration of omission at the time of quoting or typographical omission, a portion that matches more than a certain percentage may be determined as citation. Based on the notation structure of “html”, a marked-up part such as <blockquote> or a part surrounded by symbols generally used to indicate citations such as “”, “”, “”, etc. is judged as a citation. May be. In addition, the above-described determination based on markup or symbol may be used for determining which is cited from which two documents are linked to each other.

ここでは抽出される引用情報は、引用している文書ＵＲＬ，引用されている文書ＵＲＬ，引用開始位置，引用終了位置という形式とする。この事例では、引用情報として「Ｗｅｂ文書１１のＵＲＬ，Ｗｅｂ文書１２のＵＲＬ，１文字目，３０文字目」が抽出される。なお、抽出される引用情報は、他の形式であってもよく、抽出後に前記インデックス機能部４に転送される。 Here, the extracted citation information is in the form of a cited document URL, a cited document URL, a citation start position, and a citation end position. In this example, “URL of Web document 11, URL of Web document 12, first character, 30th character” is extracted as citation information. The extracted citation information may be in other formats, and is transferred to the index function unit 4 after extraction.

（２）インデックス機能部４
前記インデックス機能部４は、Ｗｅｂ文書を単語、ｎ−ｇｒａｍ、サフィックスアレイといった全文検索用の単位に分割して文書インデックスを作成し、これを前記文書インデックスＤＢ５に保存する。なお、作成する文書インデックスの形式は上記の他いかなる形式であってもよいものとする。 (2) Index function unit 4
The index function unit 4 divides a Web document into units for full text search such as words, n-grams, and suffix arrays to create a document index, and stores the document index in the document index DB 5. The format of the document index to be created may be any format other than the above.

ここでは単語による文書インデックスの一例として、Ｗｅｂ文書１１．１２を含む文書群に単語「犬」が出現する文書インデックスが作成されている。この文書インデックスには、通常の全文検索インデックスに含まれる「ｔｆ」や「ｉｄｆ」、「ｈｔｍｌ」による単語のマークアップ情報や単語の位置情報等が含まれてもよい。 Here, as an example of a document index by word, a document index in which the word “dog” appears in a document group including the Web document 11.12. This document index may include word markup information, word position information, and the like based on “tf”, “idf”, and “html” included in a normal full-text search index.

また、前記インデックス機能部４は、文書インデックスを作成する際に引用情報抽出機能部３から転送された引用情報を反映させる。例えば図１中の前記文書インデックスＤＢ５に基づき説明すれば、文書インデックスの単語「犬」には（１１：１）（１２：９）という情報が格納されている。 The index function unit 4 reflects the citation information transferred from the citation information extraction function unit 3 when creating a document index. For example, based on the document index DB 5 in FIG. 1, information (11: 1) (12: 9) is stored in the word “dog” of the document index.

すなわち、引用情報は、単語の重みに足し込んで文書インデックスに格納され、括弧内の数値は（単語出現Ｗｅｂ文書番号：該単語の重み）をそれぞれ示している。この格納例は、Ｗｅｂ文書１１およびＷｅｂ文書１２の双方に単語「犬」が1回現れているものの、Ｗｅｂ文書１２に現れる単語「犬」は別のＷｅｂ文書（ここではＷｅｂ文書１１）から引用されている部分にも含まれるため、Ｗｅｂ文書１１と比較して大きな重みが与えられている。ここでは文書インデックスに格納される単語の重みは、単純に被引用部分に含まれているか否かによる２種類の重みづけではなく、該単語を含む部分を引用している他のＷｅｂ文書数が線形または非線形に反映されているものとする。 That is, the citation information is added to the word weight and stored in the document index, and the numerical values in parentheses indicate (word appearance Web document number: weight of the word). In this storage example, although the word “dog” appears once in both the Web document 11 and the Web document 12, the word “dog” that appears in the Web document 12 is quoted from another Web document (here, the Web document 11). Therefore, a greater weight is given compared to the Web document 11. Here, the weight of the word stored in the document index is not simply two types of weighting based on whether or not the word is included in the cited part, but the number of other Web documents that cite the part including the word is not limited. It shall be reflected linearly or nonlinearly.

このとき前記インデックス機能部４は、Ｗｅｂ文書間での引用情報を文書インデックスに反映させる際に、引用しているＷｅｂ文書の重要度を文書重要度テーブル６から読み出し、該重要度の値も用いて文書インデックスを構成する。ここでは「単語の重み＝Ｗｅｂ文書内における単語の出現回数「ｎ」＋（引用する他のＷｅｂ文書の文書重要度「ｇ」×引用部分における単語の出現回数「ｋ」）」として算出される。なお、文書重要度「ｇ」と出現回数「ｋ」の乗算は、単語を含む部分を引用するＷｅｂ文書毎に算出され、それぞれの算出値が出現回数「ｎ」に加算される。 At this time, when reflecting the citation information between the Web documents in the document index, the index function unit 4 reads the importance of the cited Web document from the document importance table 6 and also uses the importance value. Configure the document index. Here, “word weight = number of occurrences of word in Web document“ n ”+ (document importance“ g ”of another Web document to be cited × number of occurrences of word in quoted portion“ k ”)” is calculated. . Note that the multiplication of the document importance “g” and the number of appearances “k” is calculated for each Web document that cites a portion including a word, and each calculated value is added to the number of appearances “n”.

そして、これを前述の文書インデックスの格納例、即ち単語「犬」には（１１：１）（１２：９）という情報が格納されている状況についてみれば、Ｗｅｂ文書１１には単語「犬」が１回現れ、他のＷｅｂ文書から引用されていないため、単語重みは「１」と算出され、文書インデックスには（ＷＥＢ文書：重み）の値の組として（１１：１）の情報が格納されている。一方、Ｗｅｂ文書１２には単語「犬」が１回現れ、文書重要度「８」のＷｅｂ文書１１に引用された部分に単語「犬」が１回出現しているため、単語「犬」の重みは「１＋（８．０×１）＝９」と算出され、文書インデックスには（１２：９）の情報が格納されている。 Then, in the storage example of the document index, that is, the situation where the information “11: 1” (12: 9) is stored in the word “dog”, the word “dog” is stored in the Web document 11. Appears once and is not quoted from other Web documents, the word weight is calculated as “1”, and (11: 1) information is stored in the document index as a set of (WEB document: weight) values. Has been. On the other hand, since the word “dog” appears once in the Web document 12 and the word “dog” appears once in the portion quoted in the Web document 11 having the document importance level “8”, the word “dog” The weight is calculated as “1+ (8.0 × 1) = 9”, and (12: 9) information is stored in the document index.

ただし、前記文書インデックスＤＢ５に保存される文書インデックスは、図１のデータ形式に限定されるものではなく、例えば引用情報の格納を（単語出現Ｗｅｂ文書番号：該単語重み：該単語被引用重み）といった形式としてもよい。また、全文検索用の分割単位、即ち文書インデックス毎に重みが格納されるので、ｎ−ｇｒａｍやサフィックスアレイの文字列（フレーズ）としてもよい。 However, the document index stored in the document index DB 5 is not limited to the data format shown in FIG. 1. For example, citation information is stored (word appearance Web document number: the word weight: the word citation weight). It is good also as a form. Further, since the weight is stored for each division unit for full text search, that is, for each document index, it may be a character string (phrase) of n-gram or suffix array.

（３）計算部７．８
前記キーワード一致度計算部７は、前記情報検索端末２から検索キーワードを指定した検索要求を受信後に、検索キーワードを用いて前記文書インデックスＤＢ５を参照し、検索キーワードを含むＷｅｂ文書をリストアップする。ここでリストアップされた各Ｗｅｂ文書の検索キーワードとの一致度を非特許文献１のＢＭ２５、ＢＭ２５Ｆ、ｔｆ・ｉｄｆなどの方法で算出する。このとき文書インデックスに格納された重みを変数として加えて、前記キーワード一致度を算出する。 (3) Calculation unit 7.8
After receiving a search request specifying a search keyword from the information search terminal 2, the keyword matching degree calculation unit 7 refers to the document index DB 5 using the search keyword and lists Web documents including the search keyword. The degree of coincidence with the search keyword of each Web document listed here is calculated by a method such as BM25, BM25F, or tf · idf of Non-Patent Document 1. At this time, the keyword matching degree is calculated by adding the weight stored in the document index as a variable.

算出方法としては、単語毎に付与された重みをｔｆ・ｉｄｆなどの計算式に掛け合わせてもよく、ＢＭ２５Ｆであれば図１に示すように、前記文書インデックスＤＢ５中の文書インデックスを作成する時点で被引用部分に含まれる単語のｔｆ値の倍率を重みによって変化させてもよい。ここで算出された検索キーワードとの一致度は、前記総合ランキング計算部１０に転送される。 As a calculation method, the weight given to each word may be multiplied by a calculation formula such as tf · idf, and when it is BM25F, as shown in FIG. 1, when the document index in the document index DB 5 is created. The magnification of the tf value of the word included in the cited part may be changed by the weight. The degree of coincidence with the search keyword calculated here is transferred to the comprehensive ranking calculation unit 10.

前記総合ランキング計算部８は、前記重要度テーブル６を参照してリスト化された各Ｗｅｂ文書の重要度のリストを取得する。取得したリストに掲載された各Ｗｅｂ文書の重要度と前記キーワード一致度計算部７から転送された前記キーワード一致度とを総合して検索結果の出力順を決定する。ここで決定された出力順に従って前記情報検索端末２に検索結果が送信される。 The comprehensive ranking calculation unit 8 refers to the importance level table 6 and obtains a list of importance levels of each Web document listed. The order of output of search results is determined by combining the importance of each Web document listed in the acquired list and the keyword matching degree transferred from the keyword matching degree calculating unit 7. The search result is transmitted to the information search terminal 2 in accordance with the output order determined here.

このとき前記キーワード一致度は、Ｗｅｂ文書中で他のＷｅｂ文書から引用される単語やフレーズの重みを考慮した文書インデックスをベースに算出されるため、引用されている単語などがＷｅｂ文書中の重要部分として定義され、これによりＷｅｂ文書内の単語などの重要性が検索結果に的確に反映される。 At this time, the keyword matching degree is calculated based on a document index in consideration of the weight of words and phrases quoted from other web documents in the web document. It is defined as a part, and thereby the importance of the word in the Web document is accurately reflected in the search result.

とりわけ文書インデックスの作成に際して、引用しているＷｅｂ文書の重要度の値を前記重みの算出に利用するため、ＰａｇｅＲａｎｋやＨＩＴＳに代表される重要度の高いＷｅｂ文書に引用された被引用部分を、特に重要でないＷｅｂ文書に引用された部分よりも重視して、ＷＥＢ文書の検索キーワードとの一致度が算出される。また、他のＷｅｂ文書から引用されている頻度が前記重みの算出に利用されるため、多数の文書から引用されている単語やフレーズをより重要な部分として検索結果に的確に反映させることもできる。 In particular, when creating a document index, the importance value of a cited Web document is used for the calculation of the weight, so that a cited part cited in a highly important Web document represented by PageRank or HITS, The degree of coincidence with the search keyword of the WEB document is calculated with emphasis on the part cited in the Web document that is not particularly important. Moreover, since the frequency quoted from other Web documents is used for the calculation of the weight, it is possible to accurately reflect the words and phrases quoted from many documents as a more important part in the search result. .

≪第２実施形態≫
前記情報検索装置１は、図２に示すように、前記抽出機能部３の抽出した引用情報に基づき検索対象Ｗｅｂ文書間の関係（リンク）を判定する文書間関係判定機能部９を追加してもよい。 << Second Embodiment >>
As shown in FIG. 2, the information search apparatus 1 adds an inter-document relationship determination function unit 9 that determines a relationship (link) between search target Web documents based on the citation information extracted by the extraction function unit 3. Also good.

ここでは文書間の関係（リンク）の判定方法として、Ｗｅｂ文書のＵＲＬを比較して同一ドメインに属する、同一ホストに属する、あるいは同一ディレクトリに属するといった順に関係が深くなるという「ＷＷＷ」（ＷｏｒｌｄＷｉｄｅＷｅｂ）の仕組みを利用する方法を用いることができる。具体例を挙げて説明すれば、以下の４つの引用情報が前記引用情報抽出機能部３から転送されたとする。 Here, as a method for determining the relationship (link) between documents, “WWW” (World Wide) in which the URLs of Web documents are compared and the relationship deepens in the order of belonging to the same domain, belonging to the same host, or belonging to the same directory. A method using a Web) mechanism can be used. For example, assume that the following four citation information is transferred from the citation information extraction function unit 3.

・１番目の引用情報
（ｈｔｔｐ：／／ｈｏｇｅ．ｈｏｇｅ．ｃｏｍ／ｈｏｇｅ／ｈｏｇｅ．ｈｔｍｌ,ｈｔｔｐ：／／ｆｕｇａ．ｆｕｇａ．ｃｏｍ／ｆｕｇａ／ｆｕｇａ．ｈｔｍｌ，引用開始位置，引用終了位置）
・２番目の引用情報
（ｈｔｔｐ：／／ｈｏｇｅ．ｈｏｇｅ．ｃｏｍ／ｈｏｇｅ／ｈｏｇｅ．ｈｔｍｌ,ｈｔｔｐ：／／ｆｕｇａ．ｈｏｇｅ．ｃｏｍ／ｆｕｇａ／ｆｕｇａ．ｈｔｍｌ，引用開始位置，引用終了位置）
・３番目の引用情報
（ｈｔｔｐ：／／ｈｏｇｅ．ｈｏｇｅ．ｃｏｍ／ｈｏｇｅ／ｈｏｇｅ．ｈｔｍｌ,ｈｔｔｐ：／／ｈｏｇｅ．ｈｏｇｅ．ｃｏｍ／ｆｕｇａ／ｆｕｇａ．ｈｔｍｌ，引用開始位置，引用終了位置）
・４番目の引用情報
（ｈｔｔｐ：／／ｈｏｇｅ．ｈｏｇｅ．ｃｏｍ／ｈｏｇｅ／ｈｏｇｅ．ｈｔｍｌ,ｈｔｔｐ：／／ｈｏｇｅ．ｈｏｇｅ．ｃｏｍ／ｈｏｇｅ／ｆｕｇａ．ｈｔｍｌ，引用開始位置，引用終了位置）
ここでは引用しているＷｅｂ文書と引用されているＷｅｂ文書の関係（深度）を重みとして反映させることができる。例えば１番目の引用情報は全く異なるドメインによる引用であるので重みを「１」に設定し、２番目の引用情報は同一ドメインの別ホストによる引用であるので重みを「０．５」に設定し、３番目の引用情報は同一ホスト内の異なるディレクトリ間での引用であるので重みを「０．１」に設定し、４番目の引用情報は同一ホストの同一ディレクトリ内の引用であるので重みを「０．０５」に設定可能である。 First citation information (http://hoge.hoge.com/hoge/hoge.html, http://fuga.fuga.com/fuga/fuga.html, citation start position, citation end position)
・ Second citation information (http://hoge.hoge.com/hoge/hoge.html, http://fuga.hoge.com/fuga/fuga.html, citation start position, citation end position)
Third citation information (http://hoge.hoge.com/hoge/hoge.html, http://hoge.hoge.com/fuga/fuga.html, citation start position, citation end position)
・ Fourth citation information (http://hoge.hoge.com/hoge/hoge.html, http://hoge.hoge.com/hoge/fuga.html, citation start position, citation end position)
Here, the relationship (depth) between the cited Web document and the cited Web document can be reflected as a weight. For example, the first quote information is quoted by a completely different domain, so the weight is set to “1”, and the second quote information is quoted by another host in the same domain, so the weight is set to “0.5”. Since the third citation information is a citation between different directories in the same host, the weight is set to “0.1”, and the fourth citation information is a citation in the same directory of the same host. It can be set to “0.05”.

また、一律に別ドメイン間の引用の重みを「１」、同一ドメイン内の引用の重みを「０．５」などと決めずに、特定のドメイン間やホスト間でのリンクや引用の量や割合に応じて引用の重みを算出するなどの方法を用いてもよい。ここで判定されたＷｅｂ文書間の関係（リンク）の重み値は、前記インデックス機能部４に転送される。 In addition, the quoting weight between different domains is not set to “1”, the quoting weight within the same domain is set to “0.5”, etc. A method of calculating a citation weight according to the ratio may be used. The weight value of the relationship (link) between the Web documents determined here is transferred to the index function unit 4.

そして、前記インデックス機能部４は、Ｗｅｂ文書間の引用情報を文書インデックスに反映する際に、前記文書間関係判定機能部９から転送された重み値も利用して、文書インデックスを構成する。重み値の利用方法の一例を説明すれば、第１実施形態ではＷｅｂ文書１２には単語「犬」が１回現れ、文書重要度「８」のＷｅｂ文書１１に引用された部分に単語「犬」が１回出現しているため、単語「犬」の重みは「１＋（８．０×１）＝９」と算出され、文書インデックスには（１２：９）の情報が格納されている。ここでは更にＷｅｂ文書１１．１２間の関係（リンク）の重み値ｗを用いて、「１＋（ｗ×８．０×１）＝９」を単語「犬」の重みとして文書インデックスの情報に格納する。ただし、重み値ｗの利用方法は、これに限定されず、他の方法を用いてもよいものとする。 The index function unit 4 also uses the weight value transferred from the inter-document relationship determination function unit 9 to reflect the citation information between Web documents in the document index, thereby forming a document index. In the first embodiment, the word “dog” appears once in the Web document 12 in the first embodiment, and the word “Dog” appears in the portion cited in the Web document 11 with the document importance level “8”. ”Appears once, the weight of the word“ dog ”is calculated as“ 1+ (8.0 × 1) = 9 ”, and information of (12: 9) is stored in the document index. Here, using the weight value w of the relationship (link) between the Web documents 11.12, “1+ (w × 8.0 × 1) = 9” is stored in the document index information as the weight of the word “dog”. To do. However, the method of using the weight value w is not limited to this, and other methods may be used.

このようにＷｅｂ文書間の関係（リンク）の重み値ｗが文書インデックスに反映されることから、引用元のＷｅｂ文書と引用先のＷｅｂ文書とが同じＷｅｂサーバ上に存在する場合や、同じドメインに含まれる場合には被引用部分を重視することなく、別のドメインから引用されている被引用部分を重視して、Ｗｅｂ文書の検索キーワードとの一致度が算出される。この意味でＷｅｂ文書内の単語やフレーズの真の重要度が検索結果に反映される。 Since the weight value w of the relationship (link) between Web documents is reflected in the document index in this way, the citation source Web document and the citation destination Web document exist on the same Web server, or the same domain If it is included, the degree of matching with the search keyword of the Web document is calculated with emphasis on the cited part quoted from another domain without emphasizing the cited part. In this sense, the true importance of words and phrases in the Web document is reflected in the search result.

≪その他・プログラムなど≫
本発明は、前記各実施形態に限定されるものではなく、例えばＷｅｂ文書の重要度やＷｅｂ文書間の関係（リンク）を反映させることなく、単語を含む部分を引用している他のＷｅｂ文書数のみを単語の重みとすることもできる。この場合には単語の重みは、「Ｗｅｂ文書内における単語の出現回数×単語を含む部分を引用している他のＷｅｂ文書数」として算出される。同様に単語を含む部分を引用している他のＷｅｂ文書数とＷｅｂ文書間の関係（リンク）とを単語の重みとすることも可能である。 ≪Others ・ Programs≫
The present invention is not limited to the above-described embodiments. For example, another Web document that quotes a part including a word without reflecting the importance of the Web document or the relationship (link) between the Web documents. Only numbers can be used as word weights. In this case, the weight of the word is calculated as “number of occurrences of the word in the Web document × number of other Web documents quoting a part including the word”. Similarly, the number of other Web documents quoting a part including a word and the relationship (link) between the Web documents can be used as the weight of the word.

なお、本発明は、前記情報検索装置構成する各部３〜９の一部もしくは全部として、コンピュータを機能させるプログラムとして構成することもできる。このプログラムによれば、各部３〜９の全処理あるいは一部の処理をコンピュータに実行させることができる。 In addition, this invention can also be comprised as a program which functions a computer as some or all of each part 3-9 which comprises the said information search device. According to this program, it is possible to cause the computer to execute all or some of the processes of the units 3 to 9.

このプログラムは、Ｗｅｂサイトや電子メールなどネットワークを通じて提供することができる。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，Ｂｌｕ−ｒａｙＤｉｓｋ（登録商標）などの記録媒体に記録して、保存・配布することも可能である。この記録媒体は、記録媒体駆動装置を利用して読み出され、そのプログラムコード自体が前記実施形態の処理を実現するので、該記録媒体も本発明を構成する。 This program can be provided through a network such as a website or e-mail. The program is recorded on a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, Blu-ray Disk (registered trademark). It is also possible to save and distribute. This recording medium is read using a recording medium driving device, and the program code itself realizes the processing of the above embodiment, so that the recording medium also constitutes the present invention.

１…情報検索装置
２…情報検索端末（ユーザ端末）
３…引用情報抽出機能部（引用情報抽出手段）
４…インデックス機能部（インデックス手段）
５…文書インデックスＤＢ
６…文書重要度テーブル
７…キーワード一致度計算部
８…総合ランキング計算部
９…文書間関係判定機能部（文書間関係判定手段）
１１．１２…Ｗｅｂ文書 DESCRIPTION OF SYMBOLS 1 ... Information retrieval apparatus 2 ... Information retrieval terminal (user terminal)
3. Citation information extraction function part (citation information extraction means)
4 ... Index function section (index means)
5 ... Document index DB
6 ... Document importance level table 7 ... Keyword matching degree calculation unit 8 ... Total ranking calculation unit 9 ... Inter-document relationship determination function unit (inter-document relationship determination means)
11.12 ... Web document

Claims

An apparatus for calculating a degree of coincidence with a search keyword by referring to a document index obtained by dividing each electronic document into arbitrary units when searching for an electronic document group using a search keyword instructed from a user terminal. ,
Citation information extraction means for extracting citation information between electronic documents to be searched in advance, and assigning a weight to each division unit of the document index of each electronic document based on the citation information extracted by the citation information extraction means, In the assignment of the weight, the index means for reflecting the number of citations for each division unit in the cited portion of the linked electronic document indicated in the citation information in the weight of the linked electronic document ; With
The degree of coincidence with the search keyword is calculated in consideration of the weight for each of the division units given by the index means.

The index means refers to a document importance table in which importance is recorded for each electronic document,
The information retrieval apparatus according to claim 1 , wherein the importance level of the electronic document that is a citation destination indicated in the citation information is acquired, and the importance level is reflected in the weight of the electronic document that is the citation source.

An inter-document relationship determination unit that determines the depth of the relationship between electronic documents based on the citation information extracted by the citation information extraction unit;
It said index means, the information retrieval apparatus according to any one of claims 1 or 2 the determined depth, characterized in that is reflected in the weights of the reference source of the electronic document of the document between the relationship determination means.

The information search program which makes a computer function as each means which comprises the information search device of any one of Claims 1-3 .