JP5432936B2

JP5432936B2 - Document search apparatus having ranking model selection function, document search method having ranking model selection function, and document search program having ranking model selection function

Info

Publication number: JP5432936B2
Application number: JP2011032317A
Authority: JP
Inventors: 良彦数原; 潤鈴木; 宜仁安田; 義昌小池; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-02-17
Filing date: 2011-02-17
Publication date: 2014-03-05
Anticipated expiration: 2031-02-17
Also published as: JP2012173794A

Description

本発明は、文書の検索結果を提示する装置およびその方法に関するものである。 The present invention relates to an apparatus and a method for presenting document search results.

ウェブ検索システムのような検索システムにおいては、ＴＦ−ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ−ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）のようなクエリ頻度に基づくスコアや、ＰａｇｅＲａｎｋのようなリンク解析に基づくスコアなど、多数の要因（スコア要因と呼ぶ）を用いて最終的なランキングに用いる検索スコアを算出する（非特許文献１参照）。そして、算出された検索スコアの降順に並べることによって、ランキングを行った検索結果を提示する方法が広く用いられている。 In a search system such as a web search system, there are many factors (score factors such as a score based on a query frequency such as TF-IDF (Term Frequency-Inverse Document Frequency) and a score based on link analysis such as PageRank). The search score used for the final ranking is calculated (see Non-Patent Document 1). A method of presenting the search results obtained by ranking by arranging the calculated search scores in descending order is widely used.

クエリ毎に最適なランキングモデルが異なるため、クエリ毎に最適なランキングモデルを選択し、適用する方法がある。非特許文献２では，クエリ毎にランキングモデルを生成し、入力されたクエリに対して、当該クエリの特徴表現において、ユークリッド距離が最も短いランキングモデルを選択する。この際、それぞれのランキングモデル生成手法としては、たとえば非特許文献３の技術などを用いる。 Since the optimal ranking model is different for each query, there is a method of selecting and applying the optimal ranking model for each query. In Non-Patent Document 2, a ranking model is generated for each query, and a ranking model with the shortest Euclidean distance is selected for the input query in the feature expression of the query. At this time, for example, the technique of Non-Patent Document 3 is used as each ranking model generation method.

尚、本発明の文書検索装置で利用する変換行列は、下記非特許文献４に記載されている。 Note that the conversion matrix used in the document search apparatus of the present invention is described in Non-Patent Document 4 below.

竹野浩、井上孝史、「分散型高速情報収集／全文検索システムＩｎｆｏＢｅｅ／Ｅｖａｎｇｅｌｉｓｔ」、ＮＴＴＲ＆ＤＶｏｌ．５２Ｎｏ．２２００３、ｐｐ．７８≡８４。Hiroshi Takeno, Takashi Inoue, “Distributed high-speed information collection / full-text search system InfoBee / Evangelist”, NTT R & D Vol. 52 no. 2 2003, pp. 78≡84. ＸｉｕｂｏＧｅｎｇ，Ｔｉｅ−ＹａｎＬｉｕ，ＴａｏＱｉｎ，ＡｎｄｒｅｗＡｒｎｏｌｄ，ＨａｎｇＬｉａｎｄＨｅｕｎｇ−ＹｅｕｎｇＳｈｕｍ，“ＱｕｅｒｙＤｅｐｅｎｄｅｎｔＲａｎｋｉｎｇＵｓｉｎｇＫ−ＮｅａｒｅｓｔＮｅｉｇｈｂｏｒ”，ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ３１ｓｔａｎｎｕａｌｉｎｔｅｒｎａｔｉｏｎａｌＡＣＭＳＩＧＩＲｃｏｎｆｅｒｅｎｃｅｏｎＲｅｓｅａｒｃｈａｎｄｄｅｖｅｌｏｐｍｅｎｔｉｎｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ（ＳＩＧＩＲ ’０８），２００８，ｐｐ．１１５−１２２．Xiubo Geng, Tie-Yan Liu, Tao Qin, Andrew Arnold, Hang Li and Heung-Yeung Shum, "Query Dependent Ranking Using K-Nearest Neighbor", In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '08), 2008, pp. 115-122. ＴｈｏｒｓｔｅｎＪｏａｃｈｉｍｓ，“ＯｐｔｉｍｉｚｉｎｇＳｅａｒｃｈＥｎｇｉｎｅｓｕｓｉｎｇＣｌｉｃｋｔｈｒｏｕｇｈＤａｔａ”，ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅｅｉｇｈｔｈＡＣＭｉｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆｅｒｅｎｃｅｏｎＫｎｏｗｌｅｄｇｅＤｉｓｃｏｖｅｒｙａｎｄＤａｔａｍｉｎｉｎｇ（ＫＤＤ ’０２），２００２，ｐｐ．１３３−１４２．Thorsten Joachims, “Optimizing Search Engineers using Clickthrough Data”, In Proceedings of the height of the ACM International Conference on Knowledge. 133-142. ＥｒｉｃＰ．ＸｉｎｇａｎｄＡｎｄｒｅｗＹ．ＮｇａｎｄＭｉｃｈａｅｌＩ．ＪｏｒｄａｎａｎｄＳｔｕａｒｔＲｕｓｓｅｌｌ，“ＤｉｓｔａｎｃｅＭｅｔｒｉｃＬｅａｒｎｉｎｇ，ｗｉｔｈＡｐｐｌｉｃａｔｉｏｎｔｏＣｌｕｓｔｅｒｉｎｇｗｉｔｈＳｉｄｅ− Ｉｎｆｏｒｍａｔｉｏｎ”，Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１６ｔｈａｎｎｕａｌｃｏｎｆｅｒｅｎｃｅｏｎＮｅｕｒａｌＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＳｙｓｔｅｍｓ（ＮＩＰＳ ’０２），２００２，ｐｐ．５０５−５１２．Eric P. Xing and Andrew Y. Ng and Michael I. Jordan and Stuart Russell, “Distance Metric Learning, with Application to Clustering with Side-Information. Proceedings of the 16th Announcement”. 505-512.

クエリ毎にランキングモデルを選択するために従来技術で用いられたクエリ類似度の計算方法は、事前に設定された特徴空間における距離であり、入力されたクエリに対して、この特徴空間における最近傍のクエリを用いて生成されたランキングモデルが最適であるという仮定に基づいている。 The query similarity calculation method used in the prior art to select a ranking model for each query is the distance in the feature space set in advance, and the nearest neighbor in this feature space for the input query Is based on the assumption that the ranking model generated using

しかしながら、実際にはクエリの特徴表現における最近傍のモデルと、実際に最適なモデルは一致するとは限らないため、適切に類似度を計算することができず、高精度なランキングを実現する最適なモデルを選択できないという課題がある。 However, since the nearest model in the query feature expression and the actually optimal model do not always match, the similarity cannot be calculated appropriately, and the optimal model that achieves high-precision ranking There is a problem that the model cannot be selected.

本発明は上記課題を解決するものであり、その目的は、入力クエリに対して最適なランキングモデルを選択することができるランキングモデル選択機能を有する文書検索装置、方法、プログラムを提供することにある。 The present invention solves the above-described problems, and an object thereof is to provide a document search apparatus, method, and program having a ranking model selection function capable of selecting an optimal ranking model for an input query. .

上記課題を解決するための本発明のランキングモデル選択機能を有する文書検索装置は、Ｎ個のクエリ各々に対してＭ次元の特徴で表現したクエリ表現データが格納されたクエリ表現データベースと、Ｎ個のクエリに対する文書の検索結果の適合度と、Ｍ次元のベクトルで表される特徴表現とを有した訓練データが格納された訓練データデータベースと、前記訓練データを入力とし、各クエリの前記特徴表現に対する重みを保持したランキングモデルを生成してランキングモデルデータベースを構築するランキング関数生成手段と、前記訓練データおよびランキングモデルを入力とし、前記ランキングモデルデータベース内の全モデルに対して最大の検索評価指標値を与えて、訓練データにおける最高精度を示す訓練時最適ランキングモデルを生成し、該訓練時最適ランキングモデルのクエリと前記訓練データのクエリの対のデータを有した訓練時最適モデルデータベースを構築する訓練時最適モデル選択手段と、前記クエリ表現データベースと訓練時最適モデルデータベースの各データを入力とし、クエリ表現データベース内のクエリと、該クエリに相当する前記訓練時最適モデルデータベース内の訓練時最適ランキングモデルのクエリとの距離が最小となる変換行列を学習して生成し、変換行列データベースを構築する距離学習手段と、前記クエリ表現データベース、ランキングモデルデータベースおよび変換行列データベースの各データを入力とし、クエリ表現データベース内の各クエリについて、前記変換行列を利用して各クエリ間の類似度を計算し、最大の類似度を持つクエリを選択し、該選択されたクエリのランキングモデルを前記ランキングモデルデータベースから取得し、該取得されたランキングモデルをクエリに対する最適モデルとして最適モデルデータベースを構築する最適モデルデータベース作成手段と、予めＷｅｂページから収集した文書を基に作成された文書インデクスが格納された文書インデクスデータベースと、入力された検索クエリに対する検索結果集合を前記文書インデクスデータベースから取得し、該検索結果集合と複数のスコア要因とでスコア要因値行列を算出するクエリ処理手段と、前記クエリ処理手段で算出されたスコア要因値行列と、前記ランキングモデルデータベースおよび最適モデルデータベースの各データとを入力とし、前記入力された検索クエリに対応する最適モデルを前記最適モデルデータベースから取得し、該取得された最適モデルのクエリに対応する前記ランキングモデルデータベース内のランキングモデルとしての重みと、前記スコア要因値行列とを積算して検索スコアベクトルを計算する検索スコア計算手段と、前記検索スコア計算手段により計算された検索スコアの降順に入力クエリに対する検索結果を提示する検索結果提示手段と、を備えたことを特徴としている。 In order to solve the above problems, a document retrieval apparatus having a ranking model selection function according to the present invention includes a query expression database storing query expression data expressed by M-dimensional features for each of N queries, and N A training data database storing training data having a matching degree of a document search result with respect to a query and a feature expression represented by an M-dimensional vector, and the training data as an input, the feature expression of each query A ranking function generating means for generating a ranking model holding weights for the ranking model database, and the training data and ranking model as inputs, and the maximum search evaluation index value for all models in the ranking model database The optimal ranking model during training showing the highest accuracy in training data A training-time optimal model selection means for generating and building a training-time optimal model database having data of a pair of the training-time optimal ranking model and the training data query; and the query expression database and the training-time optimal model database. And learning and generating a transformation matrix that minimizes the distance between the query in the query expression database and the query of the optimal ranking model for training in the optimal model database for training corresponding to the query. A distance learning means for constructing a transformation matrix database, and input each data of the query expression database, ranking model database and transformation matrix database, and for each query in the query expression database, between the queries using the transformation matrix Calculate the similarity of and have the maximum similarity An optimal model database creating means for selecting an Eri, acquiring a ranking model of the selected query from the ranking model database, and constructing an optimal model database using the acquired ranking model as an optimal model for the query, and a Web page in advance A document index database storing a document index created based on a document collected from the database, a search result set for the input search query is acquired from the document index database, and the search result set and a plurality of score factors A query processing means for calculating a score factor value matrix, a score factor value matrix calculated by the query processing means, and each data of the ranking model database and the optimal model database, and corresponding to the input search query Optimal model to do The search score vector is calculated by accumulating the weight as the ranking model in the ranking model database corresponding to the query of the acquired optimum model and the score factor value matrix. Search score calculation means and search result presentation means for presenting search results for the input query in descending order of the search score calculated by the search score calculation means.

本発明によれば、検索評価指標を最大にするようなクエリを特徴空間における近傍に近づけるクエリ特徴空間の変換を行っているので、入力されたクエリに対する類似度計算が改善され、これにより、ランキングモデルの選択の性能を向上し、検索ランキングの精度向上を実現することができる。 According to the present invention, since the query feature space is converted so that the query that maximizes the search evaluation index is close to the neighborhood in the feature space, the similarity calculation with respect to the input query is improved. The performance of model selection can be improved and the accuracy of search ranking can be improved.

本発明の一実施形態例の文書検索装置全体の構成図。1 is a configuration diagram of an entire document search apparatus according to an embodiment of the present invention. 図１の最適モデルＤＢを作成する装置の構成図。The block diagram of the apparatus which produces the optimal model DB of FIG. 図２の変換行列ＤＢを生成する装置の構成図。The block diagram of the apparatus which produces | generates transformation matrix DB of FIG. 図３の訓練時最適モデル選択部１２０の処理の流れを示すフローチャート。The flowchart which shows the flow of a process of the optimal model selection part 120 at the time of training of FIG. 図２の最適モデルＤＢ作成部１４０の処理の流れを示すフローチャート。The flowchart which shows the flow of a process of the optimal model DB creation part 140 of FIG. 図１の文書検索装置の処理の流れを示すフローチャート。3 is a flowchart showing a flow of processing of the document search apparatus in FIG. 1.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。まず本発明の一実施形態例の全体構成の概要を図１〜図３とともに説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments. First, an overview of the overall configuration of an embodiment of the present invention will be described with reference to FIGS.

本実施形態例の文書検索装置１００は、図１に示すように、予めＷｅｂページから収集した文書を基に作成された文書インデクスデータが格納された文書インデクスＤＢ（データベース）１０７、ランキングモデルのデータが格納されたランキングモデルＤＢ１０３、最適モデルのデータが格納された最適モデルＤＢ１０６、クエリ処理手段としてのクエリ処理部１５０、検索スコア計算手段としての検索スコア計算部１６０および検索結果提示手段としての検索結果提示部１７０を備えている。 As shown in FIG. 1, the document search apparatus 100 according to the present embodiment includes a document index DB (database) 107 that stores document index data created based on documents collected in advance from Web pages, ranking model data, and the like. Is stored in the ranking model DB 103, the optimal model DB 106 storing the data of the optimal model, the query processing unit 150 as the query processing unit, the search score calculation unit 160 as the search score calculation unit, and the search result as the search result presentation unit A presentation unit 170 is provided.

図１の最適モデルＤＢ１０６は、図２に示すクエリ表現ＤＢ１０１、ランキングモデルＤＢ１０３および変換行列ＤＢ１０５に格納されている各データに基づいて、最適モデルＤＢ作成手段としての最適モデルＤＢ作成部１４０の処理によって構築される。 The optimum model DB 106 in FIG. 1 is processed by the optimum model DB creation unit 140 as the optimum model DB creation unit based on the data stored in the query expression DB 101, the ranking model DB 103, and the transformation matrix DB 105 shown in FIG. Built.

図１のランキングモデルＤＢ１０３は、図３に示す訓練データＤＢ１０２に格納されているデータに基づいて、ランキング関数生成手段としてのランキング関数生成部１１０の処理によって構築される。 The ranking model DB 103 in FIG. 1 is constructed by the processing of the ranking function generation unit 110 serving as a ranking function generation unit based on data stored in the training data DB 102 illustrated in FIG.

図２の変換行列ＤＢ１０５は、図３に示すクエリ表現ＤＢ１０１、訓練データＤＢ１０２、ランキングモデルＤＢ１０３および訓練時最適モデルＤＢ１０４に格納される各データに基づいて、ランキング関数生成部１１０、訓練時最適モデル選択手段としての訓練時最適モデル選択部１２０および距離学習手段としての距離学習部１３０の各処理によって構築される。 The transformation matrix DB 105 in FIG. 2 is based on the data stored in the query expression DB 101, the training data DB 102, the ranking model DB 103, and the training optimum model DB 104 shown in FIG. It is constructed by the respective processes of the optimum model selection unit 120 during training as means and the distance learning unit 130 as distance learning means.

尚、図３のランキングモデルＤＢ１０３、訓練時最適モデルＤＢ１０４，ランキング関数生成部１１０、訓練時最適モデル選択部１２０および距離学習部１３０によって変換行列生成装置１１５を構成している。 Note that the transformation matrix generation device 115 is configured by the ranking model DB 103, the training optimal model DB 104, the ranking function generation unit 110, the training optimal model selection unit 120, and the distance learning unit 130 of FIG.

図１〜図３に示す文書検索装置１００は、例えばコンピュータにより構成され、通常のコンピュータのハードウェアリソース、例えばＲＯＭ、ＲＡＭ、ＣＰＵ、入力装置、出力装置、表示装置、通信インターフェース、ハードディスク、記録媒体およびその駆動装置を備えている。 The document search apparatus 100 shown in FIGS. 1 to 3 is configured by a computer, for example, and is a normal computer hardware resource, for example, ROM, RAM, CPU, input device, output device, display device, communication interface, hard disk, recording medium And a driving device thereof.

このハードウェアリソースとソフトウェアリソース（ＯＳ、アプリケーションなど）との協働の結果、文書検索装置１００は、図１〜図３に示すように、クエリ表現ＤＢ１０１、訓練データＤＢ１０２、ランキングモデルＤＢ１０３、訓練時最適モデルＤＢ１０４、変換行列ＤＢ１０５、最適モデルＤＢ１０６、文書インデクスＤＢ１０７、ランキング関数生成部１１０、訓練時最適モデル選択部１２０、距離学習部１３０、最適モデルＤＢ作成部１４０、クエリ処理部１５０、検索スコア計算部１６０および検索結果提示部１７０を実装する。 As a result of the cooperation between the hardware resource and the software resource (OS, application, etc.), as shown in FIGS. 1 to 3, the document search apparatus 100 performs a query expression DB 101, a training data DB 102, a ranking model DB 103, and a training time. Optimal model DB 104, transformation matrix DB 105, optimal model DB 106, document index DB 107, ranking function generation unit 110, optimal model selection unit 120 during training, distance learning unit 130, optimal model DB creation unit 140, query processing unit 150, search score calculation Unit 160 and search result presentation unit 170 are implemented.

前記クエリ表現ＤＢ１０１、訓練データＤＢ１０２、ランキングモデルＤＢ１０３、訓練時最適モデルＤＢ１０４、変換行列ＤＢ１０５、最適モデルＤＢ１０６、文書インデクスＤＢ１０７は、ハードディスクあるいはＲＡＭなどの保存手段・記憶手段に構築されているものとする。 The query expression DB 101, the training data DB 102, the ranking model DB 103, the training optimum model DB 104, the transformation matrix DB 105, the optimum model DB 106, and the document index DB 107 are constructed in storage means / storage means such as a hard disk or RAM. .

次に、上記のように構成された装置の詳細を説明する。 Next, details of the apparatus configured as described above will be described.

まず図３において、変換行列生成装置１１５は、クエリ表現ＤＢ１０１と、訓練データＤＢ１０２を入力として受け取り、変換行列ＤＢ１０５を生成する。訓練データＤＢ１０２のデータ構造の例を表１に示す。 First, in FIG. 3, the transformation matrix generation device 115 receives the query expression DB 101 and the training data DB 102 as inputs, and generates the transformation matrix DB 105. An example of the data structure of the training data DB 102 is shown in Table 1.

この例では、Ｎ個のクエリに対する検索結果（文書）それぞれに対して、適合度が付与された訓練データを表している。各行が、当該クエリに対するある文書の適合度と特徴表現を表している。文書は、この例においてはＭ個の特徴によって表現され、Ｍ次元のベクトルを用いて表すことができる。 In this example, training data to which a fitness level is assigned to each of search results (documents) for N queries is shown. Each line represents the degree of conformity and feature expression of a document for the query. The document is represented by M features in this example and can be represented using M-dimensional vectors.

＜ランキング関数生成部１１０＞
ランキング関数生成部１１０は、訓練データＤＢ１０２を入力として受け取り、ランキングモデルＤＢ１０３を生成する。ランキング関数生成部１１０には、例えば非特許文献３の技術を用いることができる。ランキングモデルＤＢ１０３のデータ構造の例を表２に示す。 <Ranking function generator 110>
The ranking function generation unit 110 receives the training data DB 102 as an input, and generates a ranking model DB 103. For the ranking function generation unit 110, for example, the technique of Non-Patent Document 3 can be used. An example of the data structure of the ranking model DB 103 is shown in Table 2.

表２において、ｉ番目の行がクエリｑｉの訓練データをランキング関数生成部１１０に入力し、得られたランキングモデルを表している。表２は、Ｎ個のクエリそれぞれに対するランキングモデルの例である。ランキングモデルは入力された文書の特徴表現に対する重みとして表現することができる。すなわち、表２の例においてｗｉが表１のｘｉに対する重みを表している。文書がＭ次元の特徴表現ｘで表す場合、ランキングモデルも同様にＭ次元の重みｗで表すことができる。 In Table 2, the i-th line represents the ranking model obtained by inputting the training data of the query qi to the ranking function generation unit 110. Table 2 is an example of a ranking model for each of the N queries. The ranking model can be expressed as a weight for the feature expression of the input document. That is, in the example of Table 2, wi represents the weight for xi of Table 1. When a document is represented by an M-dimensional feature expression x, the ranking model can be similarly represented by an M-dimensional weight w.

＜訓練時最適モデル選択部１２０＞
訓練時最適モデル選択部１２０は、訓練データＤＢ１０２とランキングモデルＤＢ１０３を入力として受け取り、図４のステップＳ１２１〜Ｓ１２４に示す処理を行って訓練時最適モデルＤＢ１０４を出力する。 <Optimum model selection unit 120 during training>
The training optimal model selection unit 120 receives the training data DB 102 and the ranking model DB 103 as inputs, performs the processing shown in steps S121 to S124 in FIG. 4 and outputs the training optimal model DB 104.

前記訓練時最適モデル選択部１２０が内部で利用するデータ構造の例を表３に示し、訓練時最適モデル選択部１２０が出力する訓練時最適モデルＤＢ１０４のデータ構造の例を表４に示す。 Table 3 shows an example of the data structure used internally by the training optimal model selection unit 120, and Table 4 shows an example of the data structure of the training optimal model DB 104 output by the training optimal model selection unit 120.

まず図４のステップＳ１２１において、訓練データＤＢ１０２から未処理のクエリｑを取得する。 First, in step S121 of FIG. 4, an unprocessed query q is acquired from the training data DB 102.

次にステップＳ１２２において、ランキングモデルＤＢ１０３に含まれる全てのランキングモデルについて、最大の評価値（検索評価指標値）を与えるクエリｐを決定する。 Next, in step S122, a query p that gives the maximum evaluation value (search evaluation index value) is determined for all the ranking models included in the ranking model DB 103.

次にステップＳ１２３において、訓練時最適モデルＤＢ１０４にクエリｐ（例えば表４の最適モデルのクエリＩＤ）とｑ（例えば表４のクエリＩＤ）を出力する。 Next, in step S123, the query p (for example, the query ID of the optimal model in Table 4) and q (for example, the query ID of Table 4) are output to the optimal model DB104 during training.

そしてステップ１２４において、訓練データＤＢ１０２内の全てのクエリの処理が終わるまでステップＳ１２１〜Ｓ１２３の処理を繰り返し実行する。 In step 124, the processes in steps S121 to S123 are repeatedly executed until all the queries in the training data DB 102 are processed.

例えば、各クエリ毎に当該クエリを除いた検索評価指標が大きい最上位のクエリの訓練データを用いることができる。ここで検索評価指標は、ＭＡＰ（ＭｅａｎＡｖｅｒａｇｅＰｒｅｃｉｓｉｏｎ）やＮＤＣＧ（ＮｏｒｍａｌｉｚｅｄＤｉｓｃｏｕｎｔｅｄＣｕｍｕｌａｔｉｖｅＧａｉｎ）などを用いることができる。以降、説明のため検索評価指標にはＭＡＰを用いることとする。 For example, the training data of the highest-order query having a large search evaluation index excluding the query can be used for each query. Here, as the search evaluation index, MAP (Mean Average Precision), NDCG (Normalized Disclosed Cumulative Gain), or the like can be used. Hereinafter, for explanation, MAP is used as a search evaluation index.

上記のような処理によって、訓練データ（訓練データＤＢ１０２）における最高精度を示すランキングモデルの結果を生成する。表３の例では、クエリｑ₁に対して、クエリｑ₂を用いて生成されたランキングモデルによって０．７のＭＡＰ値、クエリｑ_Nを用いて生成されたランキングモデルによって０．４のＭＡＰ値が得られたことを表している。 By the processing as described above, a ranking model result indicating the highest accuracy in the training data (training data DB 102) is generated. In the example of Table 3, for query q ₁ , a MAP value of 0.7 by the ranking model generated using query q ₂ and a MAP value of 0.4 by the ranking model generated using query q _N Indicates that is obtained.

＜距離学習部１３０＞
距離学習部１３０では、クエリ表現ＤＢ１０１と訓練時最適モデルＤＢ１０４を入力として受け取り、変換行列ＤＢ１０５を出力する。クエリ表現ＤＢ１０１のデータ構造例を表５、変換行列ＤＢ１０５のデータ構造例を表６に示す。 <Distance learning unit 130>
The distance learning unit 130 receives the query expression DB 101 and the training optimal model DB 104 as inputs, and outputs a transformation matrix DB 105. A data structure example of the query expression DB 101 is shown in Table 5, and a data structure example of the transformation matrix DB 105 is shown in Table 6.

表５の例では、クエリ表現ＤＢ１０１はＮ個のクエリの特徴表現を格納しており、あるクエリにおけるｆｉの値が、当該クエリのｉ番目の特徴の値を示している。この例では、各クエリがＭ次元の特徴で表現されている。 In the example of Table 5, the query expression DB 101 stores feature expressions of N queries, and the value of fi in a certain query indicates the value of the i-th feature of the query. In this example, each query is represented by M-dimensional features.

クエリｑｉのクエリ表現ベクトルをｘｉ、クエリｑｊのクエリ表現ベクトルをｘｊとすると、ふたつのクエリ表現ベクトルの距離は、 When the query expression vector of the query qi is xi and the query expression vector of the query qj is xj, the distance between the two query expression vectors is

を用いて算出することができる。ここで変換行列ＡはＭ次元の特徴空間をＭ次元の特徴空間に写像するＭ次元正方行列で、Ａ＝Ｉとする場合、ユークリッド距離となる。 Can be used to calculate. Here, the transformation matrix A is an M-dimensional square matrix that maps the M-dimensional feature space to the M-dimensional feature space. When A = I, the transformation matrix A is the Euclidean distance.

表３の例では、ｑ₁に対しては、ｑ₂を用いて生成されたランキングモデルが最も高いＭＡＰ値を示したため、ｑ₁のクエリ表現ベクトルｘ₁と、ｑ₂のクエリ表現ベクトルｘ₂の距離を小さくするように変換行列Ａの学習を行う。このように全てのクエリｑｉ（ｉ＝１．．．Ｎ）について最良の結果を示すランキングモデルを選択し、選択されたクエリと当該クエリの距離が最小となるように変換行列の生成を行う。この変換行列Ａの生成には、例えば非特許文献４の技術を用いることができる。 Table The third example, for q _1, because the ranking model generated using q ₂ showed the highest MAP value, the query expression vectors x ₁ of q _1, query representation vector x ₂ of q ₂ The conversion matrix A is learned so as to reduce the distance of. In this way, a ranking model showing the best result for all the queries qi (i = 1... N) is selected, and a transformation matrix is generated so that the distance between the selected query and the query is minimized. For example, the technique of Non-Patent Document 4 can be used to generate the transformation matrix A.

次に図２の最適モデルＤＢ作成装置の詳細を説明する。 Next, details of the optimum model DB creation apparatus in FIG. 2 will be described.

＜最適モデルＤＢ作成部１４０＞
最適モデルＤＢ作成部１４０は、図３の変換行列生成装置１１５によって生成された変換行列ＤＢ１０５と、クエリ表現ＤＢ１０１、ランキングモデルＤＢ１０３を各々入力とし、図５のステップＳ１４１〜Ｓ１４６に示す処理を行なって最適モデルＤＢ１０６を出力する。 <Optimum model DB creation unit 140>
The optimum model DB creation unit 140 receives the transformation matrix DB 105 generated by the transformation matrix generation device 115 of FIG. 3, the query expression DB 101, and the ranking model DB 103, respectively, and performs the processes shown in steps S141 to S146 of FIG. The optimal model DB 106 is output.

最適モデルＤＢ１０６のデータ構造の例を表７に示す。 An example of the data structure of the optimal model DB 106 is shown in Table 7.

尚表７は、訓練時最適モデルＤＢ１０４のクエリ数Ｎよりも多い件数のクエリｑ_N+Lについて最適モデルを出力した例を示している。このようにｑ_N+L個の最適モデルを構築することにより、後述の文書検索処理時における入力クエリ数がｑ_Nよりも多い場合にも対処できる。 Table 7 shows an example in which the optimal model is output for the number of queries q _{N + L} that is larger than the number of queries N in the optimal model DB 104 during training. By constructing q _{N + L} optimal models in this way, it is possible to cope with a case where the number of input queries during document search processing described later is greater than q _N.

まず図５のステップＳ１４１において、クエリ表現ＤＢ１０１から未処理のクエリｑを取得する。 First, in step S141 in FIG. 5, an unprocessed query q is acquired from the query expression DB 101.

次にステップＳ１４２において、クエリ表現ＤＢ１０１に含まれる各クエリについて、変換行列ＤＢ１０５を利用して類似度ｄを計算する。 In step S142, the similarity d is calculated using the transformation matrix DB 105 for each query included in the query expression DB 101.

次にステップＳ１４３において、最大の類似度ｄを持つクエリｐを選択する。 Next, in step S143, the query p having the maximum similarity d is selected.

次にステップＳ１４４において、ランキングモデルＤＢ１０３から、前記選択されたクエリｐに相当するランキングモデルｗを取得する。 Next, in step S144, the ranking model w corresponding to the selected query p is acquired from the ranking model DB 103.

次にステップＳ１４５において、クエリｑに対する最適モデルを前記取得されたｗとして最適モデルＤＢ１０６に出力する。 In step S145, the optimal model for the query q is output to the optimal model DB 106 as the acquired w.

そしてステップＳ１４６において、クエリ表現ＤＢ１０１内の全てのクエリの処理が終わるまでステップＳ１４１〜Ｓ１４６の処理を繰り返し実行する。 In step S146, the processes in steps S141 to S146 are repeatedly executed until the processing of all the queries in the query expression DB 101 is completed.

次に図１の文書検索装置１００の詳細を図６のフローチャートとともに説明する。 Next, details of the document search apparatus 100 of FIG. 1 will be described with reference to the flowchart of FIG.

＜クエリ処理部１５０＞
クエリ処理部１５０は、検索クエリを入力として受け取り、該検索クエリを含む検索結果集合（文書）を文書インデクスＤＢ１０７から取得し、該検索結果集合と複数のスコア要因とでスコア要因値行列を算出する（ステップＳ１５０）。 <Query processing unit 150>
The query processing unit 150 receives a search query as input, acquires a search result set (document) including the search query from the document index DB 107, and calculates a score factor value matrix using the search result set and a plurality of score factors. (Step S150).

具体的には、Ｍ個のスコア要因を用いて、文書インデクスＤＢ１０７からＮ件の検索結果集合を取得した際、そのスコア要因値行列は、 Specifically, when N search result sets are acquired from the document index DB 107 using M score factors, the score factor value matrix is:

と表現する。ここで、Ｄのｉ行目がｉ番目の検索結果のスコア要因値を表している。例えば、ｄ₂₃は、２番目の文書に対する３番目のスコア要因値である。 It expresses. Here, the i-th row of D represents the score factor value of the i-th search result. For example, d ₂₃ is the third score factor value for the second document.

＜検索スコア計算部１６０＞
検索スコア計算部１６０は、クエリ処理部１５０が出力したスコア要因値行列Ｄ、ランキングモデルＤＢ１０３のデータ、最適モデルＤＢ１０６のデータ、および入力された検索クエリｑ_inputを各々入力として受け取る。 <Search score calculation unit 160>
The search score calculation unit 160 receives the score factor value matrix D output from the query processing unit 150, the data of the ranking model DB 103, the data of the optimal model DB 106, and the input search query q _input as inputs.

検索スコア計算部１６０は、最適モデルＤＢ１０６から、入力された検索クエリｑ_inputに対応する最適モデルのクエリＩＤｑ_bestを取得し、当該最適モデルのクエリの重みｗ（スコア要因重み）をランキングモデルＤＢ１０３から取得し、該スコア要因重みとスコア要因値行列Ｄを元に検索スコアベクトルを計算する（ステップＳ１６０）。 The search score calculation unit 160 obtains the query ID q _best of the optimal model corresponding to the _input search query q _input from the optimal model DB 106, and calculates the weight w (score factor weight) of the query of the optimal model from the ranking model DB 103. The search score vector is obtained based on the score factor weight and the score factor value matrix D (step S160).

検索ランキングに用いるための検索スコアベクトルｓは、スコア要因値行列Ｄと、スコア要因重みｗ^(qbest)の積によって得られる。 The search score vector s for use in the search ranking is obtained by the product of the score factor value matrix D and the score factor weight w ^(qbest) .

すなわちｉ番目の文書に対する検索スコアｓｉは、 That is, the search score si for the i-th document is

によって算出する。 Calculated by

＜検索結果提示部１７０＞
検索結果提示部１７０は、検索スコアベクトルｓを受け取り、検索スコアｓｉの降順に、クエリに対する検索結果を提示する（表示、又はデータとして出力する）（ステップＳ１７０）。 <Search result presentation unit 170>
The search result presentation unit 170 receives the search score vector s, and presents the search result for the query in the descending order of the search score si (displays or outputs it as data) (step S170).

また、本実施形態のランキングモデル選択機能を有する文書検索装置における各手段の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、本実施形態のランキングモデル選択機能を有する文書検索方法における手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもなく、コンピュータでその機能を実現するためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）−ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＨＤＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 In addition, some or all of the functions of each means in the document search apparatus having the ranking model selection function of the present embodiment are configured by a computer program, and the program is executed using the computer to realize the present invention. Of course, it is possible to configure the procedure in the document search method having the ranking model selection function of the present embodiment by a computer program and cause the computer to execute the program. The computer-readable recording medium, for example, FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), memory card, CD (Compact D) isk) -ROM, DVD (Digital Versatile Disk) -ROM, CD-R, CD-RW, HDD, removable disk, etc., and can be stored or distributed. It is also possible to provide the above program through a network such as the Internet or electronic mail.

１００…文書検索装置
１０１…クエリ表現ＤＢ
１０２…訓練データＤＢ
１０３…ランキングモデルＤＢ
１０４…訓練時最適モデルＤＢ
１０５…変換行列ＤＢ
１０６…最適モデルＤＢ
１０７…文書インデクスＤＢ
１１０…ランキング関数生成部
１２０…訓練時最適モデル選択部
１３０…距離学習部
１４０…最適モデルＤＢ作成部
１５０…クエリ処理部
１６０…検索スコア計算部
１７０…検索結果提示部 DESCRIPTION OF SYMBOLS 100 ... Document search apparatus 101 ... Query expression DB
102 ... Training data DB
103 ... Ranking model DB
104 ... Optimum model DB for training
105 ... Transformation matrix DB
106 ... Optimal model DB
107 ... Document index DB
DESCRIPTION OF SYMBOLS 110 ... Ranking function production | generation part 120 ... Optimum model selection part 130 at the time of training 130 ... Distance learning part 140 ... Optimal model DB creation part 150 ... Query processing part 160 ... Search score calculation part 170 ... Search result presentation part

Claims

A query expression database storing query expression data expressed by M-dimensional features for each of the N queries;
A training data database in which training data having a matching degree of a document search result with respect to N queries and a feature expression represented by an M-dimensional vector are stored;
Ranking function generating means for generating a ranking model database by generating a ranking model having the training data as an input and holding a weight for the feature expression of each query;
The training data and the ranking model are input, the maximum search evaluation index value is given to all models in the ranking model database, and the training-time optimal ranking model indicating the highest accuracy in the training data is generated. A training optimal model selection means for constructing a training optimal model database having data of a pair of an optimal ranking model query and the training data query;
Each data of the query expression database and the optimal model database for training is input, and the distance between the query in the query expression database and the query of the optimal ranking model for training in the optimal model database for training corresponding to the query is minimum. Distance learning means that learns and generates a transformation matrix and constructs a transformation matrix database;
Using each data of the query expression database, ranking model database, and transformation matrix database as input, for each query in the query expression database, the similarity between the queries is calculated using the transformation matrix, and the maximum similarity is obtained. An optimal model database creating means for acquiring a ranking model of the selected query from the ranking model database and constructing an optimal model database using the acquired ranking model as an optimal model for the query;
A document index database in which a document index created based on a document collected in advance from a Web page is stored;
Query processing means for acquiring a search result set for the input search query from the document index database, and calculating a score factor value matrix from the search result set and a plurality of score factors;
The score factor value matrix calculated by the query processing means and each data of the ranking model database and the optimal model database are input, and an optimal model corresponding to the input search query is acquired from the optimal model database, A search score calculating means for calculating a search score vector by integrating a weight as a ranking model in the ranking model database corresponding to the obtained query of the optimal model and the score factor value matrix;
Search result presenting means for presenting search results for the input query in descending order of the search score calculated by the search score calculating means;
A document retrieval apparatus having a ranking model selection function.

The ranking function generation means receives training data in a training data database in which training data having a fitness of a document search result for N queries and a feature expression represented by an M-dimensional vector are stored. A ranking function generation step of generating a ranking model holding weights for the feature expression of each query and constructing a ranking model database;
The training optimal model selection means takes the training data and ranking model as input, gives the maximum search evaluation index value for all models in the ranking model database, and shows the highest accuracy in training data indicating the highest accuracy in training data Generating a model and constructing a training optimal model database having data of a pair of the training optimal ranking model query and the training data query; and a training optimal model selection step;
The distance learning means inputs the query expression database storing query expression data expressed by M-dimensional features for each of the N queries and each data of the optimal model database at the time of training, and queries in the query expression database Learning and generating a transformation matrix that minimizes the distance from the query of the optimal ranking model for training in the optimal model database for training corresponding to the query, and a distance learning step of constructing the transformation matrix database;
Optimal model database creation means inputs each data of the query expression database, ranking model database, and transformation matrix database, and calculates the similarity between each query using the transformation matrix for each query in the query expression database. An optimal model that selects a query having the maximum similarity, acquires a ranking model of the selected query from the ranking model database, and constructs an optimal model database using the acquired ranking model as an optimal model for the query A database creation step;
The query processing means acquires a search result set for the input search query from a document index database in which a document index created based on a document previously collected from a Web page is stored, and the search result set and a plurality of scores A query processing step for calculating a score factor value matrix with factors,
The search score calculation means inputs the score factor value matrix calculated by the query processing means and each data of the ranking model database and the optimal model database, and selects the optimal model corresponding to the input search query as the optimal A search score calculation step of calculating a search score vector by accumulating the weight as a ranking model in the ranking model database corresponding to the acquired query of the optimal model and the score factor value matrix, obtained from the model database When,
A search result presenting step for presenting a search result for the input query in descending order of the search score calculated by the search score calculating unit;
A document retrieval method having a ranking model selection function characterized by comprising:

A document search program having a ranking model selection function for causing a computer to function as each means according to claim 1.