JP5389538B2

JP5389538B2 - Search result ranking method and apparatus, program, and computer-readable recording medium

Info

Publication number: JP5389538B2
Application number: JP2009136235A
Authority: JP
Inventors: 浩之戸田; 良彦数原; 幸生植松; 由美子松浦; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-06-05
Filing date: 2009-06-05
Publication date: 2014-01-15
Anticipated expiration: 2029-06-05
Also published as: JP2010282480A

Description

本発明は、検索結果ランキング方法及び装置及びプログラム及びコンピュータ読取可能な記録媒体に係り、特に、コンピュータ内部に存在、もしくは、コンピュータネットワークを介してアクセスできるテキスト集合の検索において、検索クエリが複数のキーワードで構成されていて、かつ、テキストがHTMLやXMLのように構造を持っている場合に検索結果のランキングを行うための検索結果ランキング方法及び装置及びプログラム及びコンピュータ読取可能な記録媒体に関する。 The present invention relates to a search result ranking method, apparatus, program, and computer-readable recording medium, and more particularly, in a search for a text set that exists inside a computer or that can be accessed via a computer network, the search query includes a plurality of keywords. And a search result ranking method, apparatus and program for ranking search results when the text has a structure such as HTML or XML, and a computer-readable recording medium.

従来、複数のキーワードが検索クエリとして指定された場合の検索結果ランキング方法として、それらのキーワード間のテキスト内での距離が近接しているテキストを優先的に提示しようとする方法（以下、「第１の方法」と記す）がある（例えば、非特許文献１参照）。 Conventionally, as a search result ranking method when a plurality of keywords are specified as a search query, a method that preferentially presents texts that are close to each other in the distance between the keywords (hereinafter referred to as “No. 1 ”) (for example, see Non-Patent Document 1).

第１の方法は、検索クエリ中のキーワードが近接して出現するテキストは、それら検索クエリとより関連するテキストだろうという考えに基づいている。以下に例を挙げて説明する。 The first method is based on the idea that text in which keywords in a search query appear in close proximity will be text more related to the search query. An example will be described below.

「横浜ラーメン」という検索クエリで検索している場合、検索者は"「横浜」の「ラーメン」"について言及しているテキストを求めていると考えられる。しかしながら、キーワードの近接性を考慮しない場合には、"「東京」の「ラーメン」"と"「横浜」の「シュウマイ」"の両方に言及しているテキストのように、「横浜」及び「ラーメン」のそれぞれについては言及しているが、"「横浜」の「ラーメン」"には言及していない文書も検索結果上位に提示される可能性がある。 When searching with the search query “Yokohama Ramen”, the searcher may be looking for text that mentions “Ramen” in “Yokohama.” However, if the proximity of keywords is not considered. In the text, “Yokohama” and “Ramen” are mentioned, as in the text referring to both “Ramen” in “Tokyo” and “Shumai” in “Yokohama”. , “Documents that do not mention“ Yokohama ”“ Ramen ”” may also be displayed at the top of the search results.

このような場合に、キーワードのテキスト中での近接性を考慮することで、より検索者の意図にあった結果を提示できる可能性が高くなる。 In such a case, considering the proximity of the keyword in the text, the possibility of presenting a result more suitable for the searcher becomes higher.

また、XML-IRの分野では、HTML等のタグ付テキスト内で２つのキーワード間の距離を算出する際に、キーワード間にタグが存在する場合、単純な距離と、タグに対して与える仮想的な距離からキーワード間の距離を評価し、その距離に基づいて検索結果をランキングする方法（以下、「第２の方法」と記す）が提案されている（例えば、非特許文献２参照）。HTMLの＜font＞タグのように装飾をするためのタグの場合にはその距離を０もしくは十分小さい値とし、＜h1＞や<div＞等の分の構造を規定し、領域の境界を明確にするようなタグの場合にはその距離を大きめに設定している。 Also, in the field of XML-IR, when calculating the distance between two keywords in tagged text such as HTML, if there is a tag between keywords, the simple distance and the virtual given to the tag A method of evaluating the distance between keywords based on the distance and ranking the search results based on the distance (hereinafter referred to as “second method”) has been proposed (see, for example, Non-Patent Document 2). In the case of a tag for decoration such as the <font> tag of HTML, the distance is set to 0 or a sufficiently small value, the structure of <h1>, <div>, etc. is specified, and the boundary of the region is clarified In the case of a tag such as, the distance is set larger.

これにより、文書の構造を考えた上で、検索クエリ中の検索キーワード群と関連性が高い検索結果を取得することが可能となる。 Thereby, it is possible to acquire a search result highly relevant to the search keyword group in the search query in consideration of the structure of the document.

Tao, T. and Zhai, C.: An exploration of proximity measures in information retrieval, in SIGIR '07, pp. 295-302 (2007).Tao, T. and Zhai, C .: An exploration of proximity measures in information retrieval, in SIGIR '07, pp. 295-302 (2007). Broschart, A. and Schenkel, R.: Proximity-aware scoring for XML retrieval, in SIGIR '08, pp. 845-846 (2008).Broschart, A. and Schenkel, R .: Proximity-aware scoring for XML retrieval, in SIGIR '08, pp. 845-846 (2008).

上記の第１の方法では、テキストに構造がある場合、近接性を正確に評価することができないという問題がある。 The first method has a problem that the proximity cannot be accurately evaluated when the text has a structure.

例えば、以下のような一部内容の順番が入れ替わった２つの構造付きのテキストを考える。 For example, consider a text with two structures in which the order of some contents is changed as follows.

・テキスト１
<doc>
<title>ラーメン</title>
<body>
○神奈川
横浜では家系が有名。横浜駅近くにある○○家を源流とし
横浜市周辺を中心に…
○東京
和風だしを元にした醤油ベースのスープに、中細のちぢれ
麺が特徴…
</body>
</doc>
・テキスト２
<doc>
<title>ラーメン<title>
<body>
○東京
和風だしを元にした醤油ベースのスープに、中細のちぢれ
麺が特徴…
○神奈川
横浜では家系が有名。横浜駅近くにある○○家を源流とし
横浜市周辺を中心に…
</body>
</doc>
第１の方法では、テキストの構造（タグの情報）は考慮しないため、タグは空白もしくは存在しないものとして扱われる。ここで「横浜ラーメン」というクエリで検索された場合、テキスト１のキーワード間（「横浜」と「ラーメン」の間）の距離はテキスト２のキーワード間の距離より小さいと判断され、テキスト１を優先的に提示しようとする。・ Text 1
<doc>
<title> Ramen </ title>
<body>
○ Kanagawa Family is famous in Yokohama. Originating from the XX house near Yokohama Station, around the city of Yokohama ...
○ Tokyo Soy sauce-based soup based on Japanese-style broth and medium thin noodles feature noodles ...
</ body>
</ doc>
・ Text 2
<doc>
<title> Ramen <title>
<body>
○ Tokyo Soy sauce-based soup based on Japanese-style broth and medium thin noodles feature noodles ...
○ Kanagawa Family is famous in Yokohama. Originating from the XX house near Yokohama Station, around the city of Yokohama ...
</ body>
</ doc>
In the first method, since the text structure (tag information) is not taken into account, the tag is treated as blank or nonexistent. When a search is made with the query “Yokohama Ramen”, the distance between the keywords in Text 1 (between “Yokohama” and “Ramen”) is determined to be smaller than the distance between the keywords in Text 2, and Text 1 has priority. Try to present.

しかしながら、テキストの内容を考えると、これらのテキストの優先度に大きな差があるべきではないと思われる。 However, given the content of the text, it seems that there should be no significant difference in the priority of these texts.

一方、二つ目に挙げた第２の方法では、テキストの構造を意識して近接性を評価することができるが、上記と同様の問題が生じる。 On the other hand, in the second method mentioned above, the proximity can be evaluated in consideration of the structure of the text, but the same problem as described above occurs.

上記の２つのテキストを考えた場合、それぞれのタグにどのような距離を与えたとしてもテキスト１のキーワード間の距離は、テキスト２のキーワード間の距離より小さくなる。</title>や<body>タグに対して大きな距離を与えることで、２つのテキスト中でのキーワード間の距離を限りなく同一に近づけることも可能であるが、それはキーワード間の距離（例では「ラーメン」と「横浜」の距離」）を過剰に大きく評価することとなり、関連する構造に存在するキーワード間の関係を正しく評価しようという目的に反し、検索精度の低下につながる可能性がある。 When the above two texts are considered, the distance between the keywords of the text 1 is smaller than the distance between the keywords of the text 2 no matter what distance is given to each tag. By giving a large distance to the </ title> and <body> tags, it is possible to make the distance between keywords in the two texts as close as possible. "Distance between" Ramen "and" Yokohama ") will be evaluated excessively large, which is contrary to the purpose of correctly evaluating the relationship between keywords existing in the related structure, and may lead to a decrease in search accuracy.

第２の方法が、構造を意識しながらも、うまく作用しない原因は、この手法が基本的にはテキスト中の距離を元にした手法であるためと考えられる。構造付きテキストの場合、上記の例示したテキスト１，２のタイトル文と本文の内容のように、単純な距離は離れていたとしても関連性があることもあり、このような場合に、当該第２の方法では適切な検索結果ランキングができない場合がある。 The reason why the second method does not work well while being aware of the structure is considered to be that this method is basically based on the distance in the text. In the case of structured text, there is a case where a simple distance is related even if it is far away, such as the title sentence of the texts 1 and 2 and the content of the body shown in the above example. In some cases, the method 2 may not provide an appropriate search result ranking.

以上示したように、従来手法では構造付きテキスト中でのキーワード間の意味的な近接性を、文意に合うように評価することができない。 As described above, the conventional method cannot evaluate the semantic proximity between keywords in the structured text so as to match the meaning of the sentence.

本発明は、上記の点に鑑みなされたもので、構造化テキストの検索において、検索クエリで指定されたキーワード間の距離がテキスト中で遠くとも、テキストの構造を元に意味的に近いと考えられる場合には、それらのキーワードの距離を近いと見做すことができ、そのような文書を検索結果上位に提示することが可能な検索結果ランキング方法及び装置及びプログラム及びコンピュータ読取可能な記録媒体を提供することを目的とする。 The present invention has been made in view of the above points. In structured text search, it is considered that the distance between keywords specified in the search query is semantically close based on the structure of the text even if the distance between the keywords is long in the text. Search result ranking method, apparatus, program, and computer-readable recording medium capable of assuming that the distances between these keywords are close to each other and presenting such a document at the top of the search results The purpose is to provide.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、コンピュータ内部に存在、もしくは、コンピュータネットワークを介してアクセスできるテキストの集合から、テキストの内容を指定するキーワード群からなる検索クエリを指定して、該クエリを満たすテキストを検索する装置における、検索結果ランキング方法において、
入力された検索クエリに基づいて、検索対象となる、コンピュータ内部に存在、もしくは、コンピュータネットワークを介してアクセスできるテキストのうち、タイトルや本文、章や節などの構造を持つテキストの集合を格納するテキストデータベース（ＤＢ）を検索し、検索結果テキストＩＤからなる検索結果集合を取得する検索ステップ（ステップ１）と、
検索結果テキストＩＤと検索クエリに基づいてテキストＤＢを参照し、該検索クエリ中の個々のキーワードが、テキスト内のどの構造に出現するかを特定するキーワード出現構造特定ステップ（ステップ２）と、
キーワード出現構造特定ステップで得られたキーワードの出現構造に基づいて、キーワードのペアが同じ構造に出現する場合には２つのキーワードの位置の差に依存した値を当該キーワードのペア間の距離情報とし、キーワードのペアが異なる構造に出現する場合には予め決められた構造間の距離を参照し、当該キーワードのペアが出現する構造間の距離を当該キーワードのペア間の距離情報とし、各キーワードの重みとキーワードのペア間の距離情報によって、キーワードのペア間の近接性を評価する近接性評価ステップ（ステップ３）と、
テキストＤＢを参照して、検索クエリ中のキーワードのテキスト全体での出現回数、及び、テキスト内での出現回数に基づいて、該検索クエリに対するテキストの関連性を評価する内容条件関連性評価ステップ（ステップ４）と、
近接性評価ステップによるキーワードのペア間の近接性の評価結果及び内容条件関連性評価ステップによるテキストの関連性の評価結果に基づいて、テキストのランキングを行う検索結果ランキングステップ（ステップ５）と、を行う。 According to the present invention (Claim 1), a text that satisfies the query is specified by specifying a search query including a group of keywords that specify the content of the text from a set of text that exists inside the computer or that can be accessed via a computer network. In the search result ranking method in the device for searching for
Based on the input search query, it stores a set of text that has a structure such as title, body, chapter, section, etc., among the texts to be searched that exist inside the computer or can be accessed via computer network A search step (step 1) for searching a text database (DB) and obtaining a search result set consisting of search result text IDs;
A keyword appearance structure specifying step (step 2) for referring to the text DB based on the search result text ID and the search query and specifying in which structure in the text each keyword in the search query appears;
Based on the keyword appearance structure obtained in the keyword appearance structure specifying step, when a keyword pair appears in the same structure, the value depending on the difference in the position of the two keywords is used as the distance information between the keyword pairs. When the keyword pairs appear in different structures, the distance between the predetermined structures is referred to, and the distance between the structures in which the keyword pairs appear is used as distance information between the keyword pairs. A proximity evaluation step (step 3) for evaluating proximity between keyword pairs based on weight and distance information between keyword pairs ;
A content condition relevance evaluation step for evaluating the relevance of the text to the search query based on the number of appearances of the keyword in the entire search text and the number of appearances in the text with reference to the text DB ( Step 4) and
A search result ranking step (step 5) for ranking the text based on the proximity evaluation result between the keyword pairs in the proximity evaluation step and the text relevance evaluation result in the content condition relevance evaluation step; Do.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項２）は、コンピュータ内部に存在、もしくは、コンピュータネットワークを介してアクセスできるテキストの集合から、テキストの内容を指定するキーワード群からなる検索クエリを指定して、該クエリを満たすテキストを検索する検索結果ランキング装置であって、
検索対象となる、コンピュータ内部に存在、もしくは、コンピュータネットワークを介してアクセスできるテキストのうち、タイトルや本文、章や節などの構造を持つテキストの集合を格納するテキストデータベース（ＤＢ）１０７と、
キーワードのペアの出現するそれぞれのテキスト内での構造関係に対するキーワード間の距離情報を格納した構造関係ＤＢ１０８と、
入力された検索クエリに基づいて、テキストＤＢ１０７を検索し、検索結果テキストＩＤからなる検索結果集合を取得する検索手段１０１と、
検索結果テキストＩＤと検索クエリに基づいてテキストＤＢ１０７を参照し、該検索クエリ中の個々のキーワードが、テキスト内のどの構造に出現するかを特定するキーワード出現構造特定手段１０５と、
キーワード出現構造特定手段１０５で得られたキーワードの出現構造に基づいて、キーワードのペアが同じ構造に出現する場合には２つのキーワードの位置の差に依存した値を当該キーワードのペア間の距離情報とし、キーワードのペアが異なる構造に出現する場合には予め決められた構造間の距離を参照し、当該キーワードのペアが出現する構造間の距離を当該キーワードのペア間の距離情報とし、各キーワードの重みとキーワードのペア間の距離情報によって、キーワードのペア間の近接性を評価する近接性評価手段１０４と、
テキストＤＢ１０７を参照して、検索クエリ中のキーワードのテキスト全体での出現回数、及び、テキスト内での出現回数に基づいて、該検索クエリに対するテキストの関連性を評価する内容条件関連性評価手段１０３と、
近接性評価手段１０４によるキーワードのペア間の近接性の評価結果及び内容条件関連性評価手段１０３によるテキストの関連性の評価結果に基づいて、テキストのランキングを行う検索結果ランキング手段１０２と、を有する。 According to the present invention (Claim 2), a text that satisfies the query is specified by specifying a search query including a group of keywords that specify the content of the text from a set of text that exists inside the computer or that can be accessed via a computer network. A search result ranking device for searching for
A text database (DB) 107 for storing a set of texts having a structure such as a title, a body, a chapter, and a section among texts to be searched, existing in a computer or accessible via a computer network;
A structural relationship DB 108 storing distance information between keywords for the structural relationship in each text in which a keyword pair appears;
A search unit 101 that searches the text DB 107 based on the input search query and obtains a search result set including search result text IDs;
A keyword appearance structure specifying unit 105 that refers to the text DB 107 based on the search result text ID and the search query, and specifies in which structure in the text each keyword in the search query appears;
Based on the keyword appearance structure obtained by the keyword appearance structure specifying unit 105, when a keyword pair appears in the same structure, the distance information between the keyword pairs is obtained by using a value depending on the difference between the positions of the two keywords. When the keyword pairs appear in different structures, the distance between the predetermined structures is referred to, and the distance between the structures in which the keyword pairs appear is used as the distance information between the keyword pairs. Proximity evaluation means 104 that evaluates the proximity between keyword pairs based on the weight information and distance information between keyword pairs ;
The content condition relevance evaluation means 103 that evaluates the relevance of the text to the search query based on the number of appearances of the keyword in the entire search text and the number of appearances in the text with reference to the text DB 107. When,
A search result ranking unit 102 for ranking the text based on the evaluation result of the proximity between the keyword pairs by the proximity evaluation unit 104 and the evaluation result of the text relevance by the content condition relevance evaluation unit 103; .

本発明（請求項３）は、請求項２に記載の検索結果ランキング装置を構成する各手段としてコンピュータを機能させるための検索結果ランキングプログラムである
本発明（請求項４）は、請求項３に記載の検索結果ランキングプログラムを格納したコンピュータ読取可能な記録媒体である。 The present invention (Claim 3) is a search result ranking program for causing a computer to function as each means constituting the search result ranking apparatus according to Claim 2. The present invention (Claim 4) is based on Claim 3. A computer-readable recording medium storing the described search result ranking program.

上記のように本発明によれば、構造付きテキスト集合の検索において、入力された検索クエリ中のキーワード群の近接性を用いて検索結果のランキングを行う際に、テキスト構造の意味を元にキーワード間の距離を評価し、その距離を用いて近接性及びそれを用いた文書スコアの算出を行うことにより、構造化テキストの検索において、検索クエリで指定されたキーワード間の距離がテキスト中で単純に遠くとも、テキストの構造を元に意味的に近いと考えられる場合にはそれらのキーワードの距離を近いと見做すことができたり、その逆に単純には近くとも、構造的に遠いと考えられる場合には遠いと見做したりすることで、検索精度を向上させることができる。 As described above, according to the present invention, in the search of the structured text set, when ranking the search results using the proximity of the keyword group in the input search query, the keyword is based on the meaning of the text structure. The distance between keywords specified in the search query is simplified in the text in the search of structured text by evaluating the distance between them and calculating the proximity and the document score using the distance. If it is considered to be semantically close based on the structure of the text, it can be considered that the distance between those keywords is close, or vice versa. If it is possible, the search accuracy can be improved by assuming that it is far.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における検索結果ランキング装置の構成図である。It is a block diagram of the search result ranking apparatus in one embodiment of this invention. 本発明の一実施の形態におけるキーワードの出現パターンの例である。It is an example of the appearance pattern of the keyword in one embodiment of this invention. 本発明の一実施の形態におけるテキストＤＢの内容例である。It is an example of the content of text DB in one embodiment of this invention. 本発明の一実施の形態における構造関係ＤＢの内容例である。It is an example of the contents of the structural relation DB in an embodiment of the present invention. 本発明の一実施の形態における一連の動作のフローチャートである。It is a flowchart of a series of operation | movement in one embodiment of this invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における検索結果ランキング装置の構成を示す。 FIG. 3 shows the configuration of the search result ranking apparatus according to the embodiment of the present invention.

同図に示す検索結果ランキング装置は、検索アプリケーション１、検索管理部２、検索部１０１、検索結果ランキング部１０２、内容条件関連性評価部１０３、近接性評価部１０４、キーワード出現構造特定部１０５、検索結果生成部１０６、テキストＤＢ１０７、構造関係ＤＢ１０８から構成される。 The search result ranking apparatus shown in FIG. 1 includes a search application 1, a search management unit 2, a search unit 101, a search result ranking unit 102, a content condition relevance evaluation unit 103, a proximity evaluation unit 104, a keyword appearance structure specifying unit 105, The search result generating unit 106, the text DB 107, and the structure relation DB 108 are configured.

検索アプリケーション１は、ユーザとのインタフェースであり、ユーザから検索クエリの入力を受け付け、検索管理部１０２にアクセスし、得られた検索結果をユーザに提示する。 The search application 1 is an interface with a user, receives an input of a search query from the user, accesses the search management unit 102, and presents the obtained search result to the user.

検索管理部２は、検索アプリケーション１から検索クエリを受け付け、検索部１０１、検索結果ランキング部１０２、検索結果生成部１０６を利用して検索結果を生成し、検索アプリケーション１へ返却する。 The search management unit 2 receives a search query from the search application 1, generates a search result using the search unit 101, the search result ranking unit 102, and the search result generation unit 106, and returns it to the search application 1.

検索部１０１は、検索管理部２を経由して受け付けた検索クエリを元に、テキストＤＢ１０７にアクセスし、検索クエリ中のキーワードを含むテキスト集合（検索結果テキストＩＤ集合）を特定し、検索管理部２に返却する。 The search unit 101 accesses the text DB 107 based on the search query received via the search management unit 2, identifies a text set (search result text ID set) including the keyword in the search query, and searches the search management unit. Return to 2.

検索結果ランキング部１０２は、検索管理部２を経由して受け付けた検索結果集合（検索結果テキストについてスコアを付与し、ランキング結果を検索管理部２に返却する。テキストのスコアを算出する際は、内容条件関連性評価部１０３に対して検索クエリと検索部１０１で特定したテキストＩＤを送信し、検索クエリとテキストの関連性を評価する関連性スコアを受け取り、また、近接性評価部１０４に対して検索クエリと検索部１０１で特定したテキストＩＤを送信し、該テキスト内での検索クエリ中のキーワード間の関連性を評価する近接性スコアを受け取り、これら２つのスコアを統合する形でテキストのスコアを算出する。以下に、スコア算出式の一例を示す。 The search result ranking unit 102 assigns a search result set received via the search management unit 2 (a score is given to the search result text, and returns the ranking result to the search management unit 2. When calculating the score of the text, The search condition and the text ID specified by the search unit 101 are transmitted to the content condition relevance evaluation unit 103, the relevance score for evaluating the relevance between the search query and the text is received, and the proximity evaluation unit 104 The search query and the text ID specified by the search unit 101 are transmitted, a proximity score that evaluates the relevance between keywords in the search query in the text is received, and the two scores are integrated into the text. An example of a score calculation formula is shown below.

Score(d,q)=Content_Score(d,q)+Proximity_Score(d,q) （１）
ここで、Score(d,q)は、テキストｄのクエリｑに対するスコアである。また、Content_Score(d,q)は関連性スコア、Proximity_Score(d,q)は近接性スコアである。 Score (d, q) = Content_Score (d, q) + Proximity_Score (d, q) (1)
Here, Score (d, q) is a score for the query q of the text d. Content_Score (d, q) is a relevance score, and Proximity_Score (d, q) is a proximity score.

内容条件関連性評価部１０３は、検索結果ランキング部１０２から検索クエリとテキストＩＤを受け付け、該テキストの検索クエリに対する関連性（関連性スコア）を評価し、検索結果ランキング部１０２に返却する。 The content condition relevance evaluation unit 103 receives the search query and the text ID from the search result ranking unit 102, evaluates the relevance (relevance score) of the text to the search query, and returns it to the search result ranking unit 102.

なお、関連性スコアとしてはBM25（Okapi at TREC-4, SE Robertson, S Walker, S Jones, MM Hancock, Proceedings of the Fourth Text Retrieval Conference, 1996．）や、TF-IDF(Term Weighting Approaches in Automatic Text Retrieval, G Salton, C Buckey, 1987, http://dspace.library. cornell.edu/bitstream/1813/6721/2/87-881.ps)等の指標を利用することが考えられる。 Relevance scores include BM25 (Okapi at TREC-4, SE Robertson, S Walker, S Jones, MM Hancock, Proceedings of the Fourth Text Retrieval Conference, 1996.) and TF-IDF (Term Weighting Approaches in Automatic Text Retrieval, G Salton, C Buckey, 1987, http://dspace.library.cornell.edu/bitstream/1813/6721/2/87-881.ps) may be used.

テキストの情報を取得する際には、テキストＩＤを元にテキストＤＢ１０７にアクセスし、検索クエリ中のキーワードがテキスト全体で出現する回数及び該テキスト中での出現回数を取得する。 When acquiring text information, the text DB 107 is accessed based on the text ID, and the number of times that the keyword in the search query appears in the entire text and the number of appearances in the text are acquired.

近接性評価部１０４は、検索結果ランキング部１０２からテキストＩＤと検索クエリを受け付け、検索クエリ中のキーワード間の該テキスト中での近接性（近接性スコア）を評価し、検索結果ランキング部１０２に返却する。 The proximity evaluation unit 104 receives the text ID and the search query from the search result ranking unit 102, evaluates the proximity (proximity score) in the text between the keywords in the search query, and sends it to the search result ranking unit 102. return.

テキストの近接性スコアは、検索クエリ中のキーワードのペアのテキスト内での近接性を元に算出する。 The proximity score of the text is calculated based on the proximity within the text of the keyword pair in the search query.

以下に、テキストの近接性スコアの算出方法の例を示す。 An example of a method for calculating the proximity score of text is shown below.

ここでは、検索対象のテキストが、タイトルと本文という構造を持つテキストだった場合について考える。また、検索クエリは「キーワードＡ」及び「キーワードＢ」で構成されていたとする。この場合に、「キーワードＡ」と「キーワードＢ」がテキスト内で出現するパターンは図４に示す４つのパターンが考えられる。それぞれのキーワードが出現する構造の特定には、テキストＩＤと検索クエリをキーワード出現構造特定部１０５に送信し、取得する。 Here, consider a case where the text to be searched is a text having a structure of a title and a body. Further, it is assumed that the search query includes “keyword A” and “keyword B”. In this case, four patterns shown in FIG. 4 may be considered as patterns in which “keyword A” and “keyword B” appear in the text. To specify the structure in which each keyword appears, the text ID and the search query are transmitted to the keyword appearance structure specifying unit 105 and acquired.

このうち、両方のキーワードが同じ構造中に出現する場合（ＴＴ，ＢＢ）にはキーワードＡ，Ｂ間の距離は単純な距離で求められ、以下の式で求める。 Among these, when both keywords appear in the same structure (TT, BB), the distance between the keywords A and B is obtained by a simple distance, and is obtained by the following formula.

dist(A,B)=│pos(A)-pos(b)│ （２）
ここで、pos(x)はテキスト中でのキーワードの位置を示す関数である。pos(x)はテキストＤＢ１０７へアクセスし、タイトル及び本文を取得し、それらのテキスト情報を走査することで得られる。 dist (A, B) = │pos (A) -pos (b) │ (2)
Here, pos (x) is a function indicating the position of the keyword in the text. pos (x) is obtained by accessing the text DB 107, acquiring the title and body, and scanning the text information.

一方、両方のキーワードが別の構造中に出現する場合（ＴＢ，ＢＴ）はテキスト構造を考慮しようとすると、上記のように直接dist(A,B)を算出できず、構造の意味を考慮する必要がある。この場合には、予め決められた構造間の関係に基づく距離をキーワード間の距離とする。 On the other hand, when both keywords appear in another structure (TB, BT), if the text structure is considered, dist (A, B) cannot be directly calculated as described above, and the meaning of the structure is considered. There is a need. In this case, a distance based on a predetermined relationship between structures is set as a distance between keywords.

dist(A,B)=L(structure(A), structure(B)) （３）
structure(x)はキーワードｘが出現するテキスト中の構造を示し、L(y,z)はそれぞれのキーワードが構造ｙとｚに存在した場合の意味的な距離を与える関数である。この関数で返す値は予め決定されているものとし、その値は構造関係ＤＢ１０８に設定されているものとする。 dist (A, B) = L (structure (A), structure (B)) (3)
structure (x) indicates a structure in the text in which the keyword x appears, and L (y, z) is a function that gives a semantic distance when each keyword exists in the structures y and z. It is assumed that the value returned by this function is determined in advance, and that the value is set in the structural relationship DB 108.

この距離及びそれぞれのキーワードの重みを元にキーワードペア間の近接性を算出する式を以下に示す。 An expression for calculating the proximity between keyword pairs based on this distance and the weight of each keyword is shown below.

Weight(x)はキーワードｘの重みを示す。この重みにはIDF（文書頻度の逆数）等が利用される。

Weight (x) indicates the weight of the keyword x. For this weight, IDF (reciprocal of document frequency) or the like is used.

そして、上記スコアを元にしたテキストｄの検索クエリｑに対する近接性スコアの算出式の例を以下に示す。 And the example of the calculation formula of the proximity score with respect to the search query q of the text d based on the said score is shown below.

ここで、J_d,qは、テキストｄ中に含まれる全ての検索クエリｑ中のキーワードを示す集合である。式（５）では、テキスト中に出現する全ての検索キーワードペアに対する近接性スコアを加算し、テキストの近接性スコアとしている。他の例として、最大のスコアのみを採用する方法や、平均値を採用する方法も考えられる。

Here, J _{d, q} is a set indicating keywords in all search queries q included in the text d. In Expression (5), the proximity scores for all search keyword pairs appearing in the text are added to obtain the text proximity score. As another example, a method of adopting only the maximum score or a method of employing an average value can be considered.

検索クエリが「キーワードＡキーワードＢ」であり、キーワードＡとキーワードＢがテキスト中でそれぞれ一回のみしか出現しない場合には、該テキストの近接性スコアProxWeight(d,q)は、キーワードＡとキーワードＢの近接性スコアProx(A,B)と等しい。 When the search query is “keyword A keyword B” and the keyword A and the keyword B appear only once in the text, the proximity score ProxWeight (d, q) of the text is the keyword A and the keyword Equal to B's proximity score Prox (A, B).

キーワード出現構造特定部１０５は、近接性評価部１０４からテキストＩＤと検索クエリの情報を受け取り、検索クエリ中のキーワードが出現するテキスト中での構造（タイトルにあるのか、本文にあるのか）を特定し、検索クエリ中のキーワードの出現する構造を返却する。この機能では、検索クエリ中のキーワードのテキスト中での出現位置を特定するため、テキストＤＢ１０７にアクセスし、本文及びタイトルの情報を取得し、それらを走査する。 The keyword appearance structure specifying unit 105 receives the text ID and the search query information from the proximity evaluation unit 104, and specifies the structure in the text in which the keyword in the search query appears (whether in the title or in the body). And return the structure in which the keyword appears in the search query. In this function, in order to specify the appearance position of the keyword in the search query in the text, the text DB 107 is accessed to acquire the text and title information and scan them.

検索結果生成部１０６は、検索管理部２を経由して、ユーザに返却すべき優先度付けされた検索結果のテキストＩＤの集合を取得し、ユーザに提示すべきタイトルやスニペットをテキストＤＢ１０７の情報を利用し生成する。生成した結果を元にユーザに提示する検索結果を生成し、検索管理部２に返却する。 The search result generation unit 106 acquires a set of text IDs of search results that are prioritized to be returned to the user via the search management unit 2, and displays the title and snippet to be presented to the user in the information in the text DB 107. Generate using. A search result to be presented to the user is generated based on the generated result and returned to the search management unit 2.

テキストＤＢ１０７は、検索対象のテキストのタイトル本文などの情報を格納、管理するデータベースである。データベースの検索機能を利用することで、指定されたキーワードが出現する文書の頻度や指定されたテキスト中でのキーワードの出現回数を提示することを可能とする。当該テキストＤＢ１０７の内容例を図５に示す。 The text DB 107 is a database that stores and manages information such as the title body of the text to be searched. By using the database search function, it is possible to present the frequency of the document in which the designated keyword appears and the number of occurrences of the keyword in the designated text. An example of the contents of the text DB 107 is shown in FIG.

構造関係ＤＢ１０８は、キーワードペアの出現するそれぞれのテキスト内での構造の関係に対して式（３）におけるＬ（ｙ，ｚ）の値を格納するデータベースである。当該構造関係ＤＢ１０８の内容例を図６に示す。同図（Ａ）は、検索対象のテキストがタイトルと本文のみの構造を持つ場合の設定例であり、同図（Ｂ）は、検索対象のテキストがタイトルと本文を持ち、かつ、sectionを持つ場合の設定例である。 The structure relation DB 108 is a database that stores the value of L (y, z) in Expression (3) for the structure relation in each text in which a keyword pair appears. An example of the contents of the structural relationship DB 108 is shown in FIG. (A) in the figure is a setting example in the case where the text to be searched has a structure of only the title and the body. FIG. (B) is a text in which the text to be searched has a title and a body, and has a section. This is a setting example.

次に、上記の構成における一連の動作を説明する。 Next, a series of operations in the above configuration will be described.

ステップ１０１）検索アプリケーションを経由して、ユーザから検索クエリを受け取り、検索管理部２に送信する。 Step 101) A search query is received from a user via a search application and transmitted to the search management unit 2.

ステップ１０２）検索管理部２は、検索部１０１に検索クエリを送信する。 Step 102) The search management unit 2 transmits a search query to the search unit 101.

ステップ１０３）検索部１０１は、テキストＤＢ１０７にアクセスし、検索クエリに基づいて、条件に適合する検索結果集合（検索結果テキストＩＤ集合）を取得し、検索管理部２に返信する。 Step 103) The search unit 101 accesses the text DB 107, acquires a search result set (search result text ID set) that meets the conditions based on the search query, and sends it back to the search management unit 2.

ステップ１０４）検索管理部２は、検索結果ランキング部１０２に検索クエリと検索結果テキストＩＤ集合を送信する。 Step 104) The search management unit 2 transmits the search query and the search result text ID set to the search result ranking unit 102.

ステップ１０５）検索結果ランキング部１０２は、内容条件関連性評価部１０３、近接性評価部１０４に検索クエリと検索結果テキストＩＤを渡す。 Step 105) The search result ranking unit 102 passes the search query and the search result text ID to the content condition relevance evaluation unit 103 and the proximity evaluation unit 104.

ステップ１０６）近接性評価部１０４では、検索結果ランキング部１０２から受け取ったテキストＩＤと検索クエリの情報をキーワード出現構造特定部１０５に送信する。 Step 106) The proximity evaluation unit 104 transmits the text ID and the search query information received from the search result ranking unit 102 to the keyword appearance structure specifying unit 105.

ステップ１０７）キーワード出現構造特定部１０５では、受け取ったテキストＩＤと検索クエリの情報を元にテキストＤＢ１０７にアクセスし、本文、タイトルを取得し、それらを利用して該テキスト中での検索クエリ中のキーワードが出現する構造を特定し、それぞれのキーワードが出現する構造を近接性評価部１０４に送信する。 Step 107) The keyword appearance structure specifying unit 105 accesses the text DB 107 based on the received text ID and information of the search query, acquires the body and title, and uses them to search the text in the search query in the text. The structure in which the keyword appears is specified, and the structure in which each keyword appears is transmitted to the proximity evaluation unit 104.

ステップ１０８）近接性評価部１０４では、キーワード出現構造特定部１０５から受け取ったキーワードの出現構造を元にテキストＤＢ１０７、構造関係ＤＢ１０８にアクセスし、キーワード間の距離情報を取得し、それを元にキーワードの近接性スコアの算出、及び、それを利用したテキストの近接性スコア算出を行い、テキストＩＤに近接性スコアを関連付け、検索結果ランキング部１０２に返す。 Step 108) The proximity evaluation unit 104 accesses the text DB 107 and the structure relation DB 108 based on the keyword appearance structure received from the keyword appearance structure specifying unit 105, acquires distance information between the keywords, and uses the keyword as a basis. The proximity score is calculated and the proximity score of the text using this is calculated, and the proximity score is associated with the text ID and returned to the search result ranking unit 102.

ステップ１０９）内容条件関連性評価部１０３では、ステップ１０５において検索結果ランキング部１０２から受け取った検索結果テキストＩＤと検索クエリの情報を元にテキストＤＢ１０７にアクセスし、検索クエリ中のキーワードのテキスト全体での出現回数及び該テキスト中での出現回数を取得し、テキストと検索クエリの関連性スコアを算出し、テキストＩＤに関連性スコアを関連付け、検索結果ランキング部１０２に返す。 Step 109) The content condition relevance evaluation unit 103 accesses the text DB 107 based on the search result text ID and the search query information received from the search result ranking unit 102 in Step 105, and uses the entire keyword text in the search query. The number of appearances and the number of appearances in the text are acquired, the relevance score between the text and the search query is calculated, the relevance score is associated with the text ID, and the result is returned to the search result ranking unit 102.

ステップ１１０）検索結果ランキング部１０２は、関連性スコア、近接性スコアを元に該テキストのスコアを算出する。 Step 110) The search result ranking unit 102 calculates the score of the text based on the relevance score and the proximity score.

ステップ１１１）全ての文書のスコアの計算が終了すれば、ステップ１１２に移行し、終了しなければ、ステップ１０５に移行する。 Step 111) If the calculation of the scores of all the documents is completed, the process proceeds to Step 112, and if not completed, the process proceeds to Step 105.

ステップ１１２）全てのテキストのスコアを元に検索結果のテキストＩＤ集合を優先度付けし、優先度付き検索結果テキストＩＤ集合を検索管理部２に返却する。 Step 112) Prioritize the search result text ID set based on the scores of all texts, and return the search result text ID set with priority to the search management unit 2.

ステップ１１３）検索管理部２は、優先度付けされた検索結果テキストＩＤ集合を元にユーザに返却するテキストを特定し、検索結果生成部１０６へそのテキストＩＤの集合を送信する。 Step 113) The search management unit 2 identifies the text to be returned to the user based on the prioritized search result text ID set, and transmits the set of text IDs to the search result generation unit 106.

ステップ１１４）検索結果生成部１０６は、検索管理部２から受け取ったテキストＩＤの集合を元に、ユーザに返却する検索結果のタイトル、スニペットを、テキストＤＢ１０７から本文、タイトルを取得して生成し、検索管理部２に返却する。 Step 114) The search result generation unit 106 generates the search result title and snippet to be returned to the user based on the set of text IDs received from the search management unit 2 by acquiring the text and title from the text DB 107, Return to the search management unit 2.

ステップ）１１５検索管理部２は、受け取った検索結果を検索アプリケーション１を介してユーザに返却する。 Step) 115 The search management unit 2 returns the received search result to the user via the search application 1.

また、上記の図３に示す検索結果ランキング装置の構成要素の動作をプログラムとして構築し、検索結果ランキング装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 Further, the operation of the constituent elements of the search result ranking apparatus shown in FIG. 3 can be constructed as a program, installed and executed on a computer used as the search result ranking apparatus, or distributed via a network. It is.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments and examples, and various modifications and applications can be made within the scope of the claims.

１検索アプリケーション
２検索管理部
１０１検索手段、検索部
１０２検索結果ランキング手段、検索結果ランキング部
１０３内容条件関連性評価手段、内容条件関連性評価部
１０４近接性評価手段、近接性評価部
１０５キーワード出現構造特定手段、キーワード出現構造特定部
１０６検索結果生成部
１０７テキストＤＢ
１０８構造関係ＤＢ DESCRIPTION OF SYMBOLS 1 Search application 2 Search management part 101 Search means, Search part 102 Search result ranking means, Search result ranking part 103 Content condition relevance evaluation means, Content condition relevance evaluation part 104 Proximity evaluation means, Proximity evaluation part 105 Keyword appearance Structure specifying means, keyword appearance structure specifying unit 106 search result generating unit 107 text DB
108 Structural DB

Claims

A search result ranking in a device that searches a text satisfying the query by specifying a search query consisting of a keyword group that specifies the content of the text from a set of text existing inside the computer or accessible via a computer network In the method
Based on the input search query, it stores a set of text that has a structure such as title, body, chapter, section, etc., among the texts to be searched that exist inside the computer or can be accessed via computer network A search step of searching a text database (DB) and obtaining a search result set consisting of search result text IDs;
A keyword appearance structure specifying step for referring to the text DB based on the search result text ID and the search query and specifying in which structure in the text each keyword in the search query appears;
Based on the keyword appearance structure obtained in the keyword appearance structure specifying step, when a keyword pair appears in the same structure, the distance information between the keyword pairs is obtained as a value depending on the difference between the positions of the two keywords. When the keyword pairs appear in different structures, the distance between the predetermined structures is referred to, and the distance between the structures in which the keyword pairs appear is used as the distance information between the keyword pairs. A proximity evaluation step that evaluates the proximity between keyword pairs based on the weight of and the distance information between keyword pairs ;
Content condition relevance evaluation for evaluating relevance of text to the search query based on the number of appearances of the keyword in the entire search text and the number of appearances in the text with reference to the text DB Steps,
A search result ranking step for ranking the text based on a proximity evaluation result between a pair of keywords in the proximity evaluation step and a text relevance evaluation result in the content condition relevance evaluation step;
The search result ranking method characterized by performing.

A search result ranking device that specifies a search query including a group of keywords that specify the contents of a text from a set of text that exists inside a computer or that can be accessed via a computer network, and searches for text that satisfies the query. And
A text database (DB) that stores a set of texts having a structure such as a title, body, chapter, and section among texts to be searched, existing in a computer or accessible via a computer network;
A structural relation DB storing distance information between keywords for the structural relation in each text in which a keyword pair appears;
Search means for searching the text DB based on the input search query and obtaining a search result set consisting of search result text IDs;
A keyword appearance structure specifying means for referring to the text DB based on the search result text ID and the search query, and specifying in which structure each keyword in the search query appears in the text;
Based on the keyword appearance structure obtained by the keyword appearance structure specifying means, when keyword pairs appear in the same structure, the distance information between the keyword pairs is obtained by using a value depending on the difference between the positions of the two keywords. When the keyword pairs appear in different structures, the distance between the predetermined structures is referred to, and the distance between the structures in which the keyword pairs appear is used as the distance information between the keyword pairs. Proximity evaluation means that evaluates the proximity between keyword pairs based on the weight of and the distance information between keyword pairs ;
Content condition relevance evaluation for evaluating relevance of text to the search query based on the number of appearances of the keyword in the entire search text and the number of appearances in the text with reference to the text DB Means,
Search result ranking means for ranking the text based on the evaluation result of the proximity between the pair of keywords by the proximity evaluation means and the evaluation result of the relevance of the text by the content condition relevance evaluation means;
The search result ranking apparatus characterized by having.

The search result ranking program for functioning a computer as each means which comprises the search result ranking apparatus of Claim 2.

A computer-readable recording medium storing the search result ranking program according to claim 3.