JP2018088182A

JP2018088182A - Model generation device, click-log correct-answer likelihood calculation device, document retrieval device, method, and program

Info

Publication number: JP2018088182A
Application number: JP2016231743A
Authority: JP
Inventors: 克人別所; Katsuto Bessho; 淳史大塚; Atsushi Otsuka; 京介西田; Kyosuke Nishida; 久子浅野; Hisako Asano; 松尾　義博; Yoshihiro Matsuo; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-11-29
Filing date: 2016-11-29
Publication date: 2018-06-07
Anticipated expiration: 2036-11-29
Also published as: JP6521931B2

Abstract

PROBLEM TO BE SOLVED: To enable the improvement in retrieval accuracy.SOLUTION: Concept vector generation means 54 synthesizes, for a text of each query and each document in a click log, concept vectors of words in the text in a word concept base 52, to thereby create a concept vector for the text. Then, feature vector generation means 56 extracts, for an arbitrary pair in the click log, a feature including the number of in-neighborhood pairs or the number of difference users tied up to in-neighborhood pair provided that such a pair in the click log that a query concept vector exists in a neighborhood of a concept vector of the pair of queries and a document concept vector exists in a neighborhood of a concept vector of the pair of documents is taken as an in-neighborhood pair, to thereby create a feature vector for the pair. Then, correct-answer likelihood estimation means 62 estimates, for an arbitrary pair in the click log, a correct-answer likelihood of the pair by means of the pair of feature vectors and the classification model.SELECTED DRAWING: Figure 13

Description

本発明は、モデル生成装置、クリックログ正解尤度算出装置、文書検索装置、方法、及びプログラムに関する。 The present invention relates to a model generation device, a click log correct likelihood calculation device, a document search device, a method, and a program.

概念検索は、検索対象の文書の集合から、ユーザが入力したクエリに概念的に適合する文書を検索するというものである。 Concept search is to search a document that conceptually matches a query input by a user from a set of documents to be searched.

以下の非特許文献１では、コーパスから、単語と、該単語の概念を表す概念ベクトルとの組の集合である単語概念ベースを生成する。各文書に対し、該文書中の単語の該単語概念ベース中の概念ベクトルを合成することにより、該文書の概念ベクトルを生成する。クエリに対し、クエリ中の単語の該単語概念ベース中の概念ベクトルを合成することにより、該クエリの概念ベクトルを生成し、各文書に対し、該クエリの概念ベクトルと、該文書の概念ベクトルとの類似度を算出する。検索結果として、類似度の降順にランキングした文書を表示する。あるいは、ある閾値以上の類似度をもつ文書を表示する。 In Non-Patent Document 1 below, a word concept base that is a set of a word and a concept vector representing the concept of the word is generated from the corpus. For each document, a concept vector of the document is generated by synthesizing concept vectors in the word concept base of the words in the document. A concept vector of the query is generated by synthesizing the concept vector in the word concept base of the word in the query with respect to the query, and for each document, the concept vector of the query, the concept vector of the document, and The similarity is calculated. As search results, documents ranked in descending order of similarity are displayed. Alternatively, a document having a similarity greater than a certain threshold is displayed.

別所克人, 内山俊郎, 内山匡, 片岡良治, 奥雅博,“単語・意味属性間共起に基づくコーパス概念ベースの生成方式,”情報処理学会論文誌, Vol.49, No.12, pp.3997-4006, Dec. 2008.Katsuto Bessho, Toshiro Uchiyama, Kei Uchiyama, Ryoji Kataoka, Masahiro Oku, “Corpus Concept-Based Generation Based on Co-occurrence Between Words and Semantic Attributes,” Information Processing Society of Japan, Vol.49, No.12, pp. 3997-4006, Dec. 2008.

文書が、クエリの検索結果として正解であっても、文書の意味内容とクエリの意味内容に乖離がある場合、上記従来手法では類似度が低くなってしまうという課題がある。 Even if the document is correct as a query search result, if there is a difference between the semantic content of the document and the semantic content of the query, the conventional method has a problem that the similarity is low.

これを解決する手法として、文書の方を、それに対応するクエリを含むように拡張するという手法がある。しかし、対応するクエリを人手で作成するのは多大なコストがかかるという課題がある。 As a technique for solving this, there is a technique of expanding a document to include a query corresponding to the document. However, there is a problem that it is very expensive to manually create a corresponding query.

検索システムにおいて、一般にユーザは、入力したクエリに対する検索結果文書群の中で、クエリに関連すると思った文書をクリックする。それ故、ユーザがクリックした文書は、クエリの検索結果として妥当である可能性が高い。このようなユーザが入力したクエリとクリックした文書との対の集合であるクリックログがある場合、クリックした文書を、対応するクエリで拡張すれば、上記の人手作成のコストを無くすことができる。 In a search system, a user generally clicks a document that is considered to be related to a query in a group of search result documents for the input query. Therefore, the document clicked by the user is likely to be valid as a query search result. When there is a click log that is a set of a pair of a query input by the user and a clicked document, the cost of manual creation can be eliminated by extending the clicked document with a corresponding query.

しかしながら、クリックした文書が、クエリの検索結果として不正解である場合も多い。そのような場合、文書を、無関係なクエリで拡張することとなり、拡張した文書の概念ベクトルが妥当なものとならず、その結果、検索精度に問題がある。 However, the clicked document is often incorrect as a query search result. In such a case, the document is expanded with an irrelevant query, the concept vector of the expanded document is not valid, and as a result, there is a problem in search accuracy.

本発明の目的は、この課題を解決して検索精度を向上させるモデル生成装置、クリックログ正解尤度算出装置、文書検索装置、方法、及びプログラムを提供することにある。 An object of the present invention is to provide a model generation device, a click log correct likelihood calculation device, a document search device, a method, and a program that solve this problem and improve search accuracy.

上記課題を解決するため、第１の発明に係るモデル生成装置は、クエリと文書との対の集合であるクリックログであって、所属する各対に正解であるか否かのラベルが付与されているクリックログを入力とし、単語と、該単語の概念を表す概念ベクトルとの組の集合である単語概念ベースと、クリックログ中の各クエリ及び各文書のテキストに対し、該テキスト中の単語の該単語概念ベース中の概念ベクトルを合成することにより、該テキストの概念ベクトルを生成する概念ベクトル生成手段と、クリックログ中の任意の対に対し、該対のクエリの概念ベクトルの近傍内に、クエリの概念ベクトルが存在し、該対の文書の概念ベクトルの近傍内に、文書の概念ベクトルが存在するような、クリックログ中の対を近傍内対としたとき、近傍内対の数あるいは近傍内対に紐づく異なりユーザ数を含む素性を抽出することにより、該対の素性ベクトルを生成する素性ベクトル生成手段と、クリックログ中の任意の対の素性ベクトルとラベルとの組の集合から、任意の素性ベクトルの正解尤度を算出するための分類モデルを生成する分類モデル生成手段と、を含んで構成されている。 In order to solve the above-described problem, the model generation device according to the first invention is a click log that is a set of pairs of a query and a document, and each pair to which the model belongs is assigned a label indicating whether or not the answer is correct. A word concept base which is a set of a word and a concept vector representing the concept of the word, and each query in the click log and the text of each document. A concept vector generation means for generating a concept vector of the text by synthesizing the concept vector in the word concept base of the word, and for any pair in the click log, within the vicinity of the concept vector of the query of the pair When a pair in the click log in which a query concept vector exists and a document concept vector exists in the vicinity of the concept vector of the pair of documents is an in-neighbor pair, A feature vector generating means for generating a feature vector of the pair by extracting a feature including a number or a different number of users associated with a pair in a neighborhood or a pair of a feature vector and a label of an arbitrary pair in a click log Classification model generation means for generating a classification model for calculating the correct likelihood of an arbitrary feature vector from the set.

第２の発明に係るクリックログ正解尤度算出装置は、文書検索システムにおいて、ユーザが入力したクエリと、該クエリに対する検索結果文書群の中でユーザがクリックした文書との対の集合であるクリックログを対象とし、該クリックログ中の対に対し、該対の文書が、該対のクエリの検索結果として正解である度合いである正解尤度を算出するためのクリックログ正解尤度算出装置であって、クリックログを入力とし、単語と、該単語の概念を表す概念ベクトルとの組の集合である単語概念ベースと、任意の素性ベクトルの正解尤度を算出するための分類モデルと、クリックログ中の各クエリ及び各文書のテキストに対し、該テキスト中の単語の該単語概念ベース中の概念ベクトルを合成することにより、該テキストの概念ベクトルを生成する概念ベクトル生成手段と、クリックログ中の任意の対に対し、該対のクエリの概念ベクトルの近傍内に、クエリの概念ベクトルが存在し、該対の文書の概念ベクトルの近傍内に、文書の概念ベクトルが存在するような、クリックログ中の対を近傍内対としたとき、近傍内対の数あるいは近傍内対に紐づく異なりユーザ数を含む素性を抽出することにより、該対の素性ベクトルを生成する素性ベクトル生成手段と、クリックログ中の任意の対に対し、該対の素性ベクトルと、該分類モデルにより、該対の正解尤度を推定する正解尤度推定手段と、を含んで構成されている。 In the document search system, the click log correct likelihood calculating apparatus according to the second invention is a click that is a set of a pair of a query input by a user and a document clicked by a user in a search result document group corresponding to the query. A click log correct likelihood calculating apparatus for calculating a correct likelihood that is a degree that a pair of documents in the click log is correct as a search result of the pair of queries for a pair in the click log. A click log as an input, a word concept base that is a set of a word and a concept vector representing the concept of the word, a classification model for calculating a correct likelihood of an arbitrary feature vector, and a click For each query and document text in the log, the concept vector of the text is generated by synthesizing the concept vector in the word concept base of the words in the text. For any pair in the click log, there is a query concept vector in the neighborhood of the concept vector of the pair, and a document in the neighborhood of the concept vector of the pair of documents. When the pair in the click log where there is a concept vector in the neighborhood is an in-neighbor pair, the feature of the pair is extracted by extracting the feature including the number of pairs in the neighborhood or the number of different users associated with the near-inner pair. A feature vector generating means for generating a vector; for any pair in the click log, a feature vector of the pair; and a correct likelihood estimating means for estimating the correct likelihood of the pair by the classification model It consists of

第３の発明に係るクリックログ正解尤度算出装置は、検索対象の各文書に対し、該文書のテキストの重みを１とし、クリックログ中の該文書に対応する各クエリのテキストの重みを、該文書と該クエリの対の正解尤度とし、重みがある閾値以下のクエリのテキストは除外した上で、各テキストに含まれる単語の前記単語概念ベース中の概念ベクトルに、該単語の所属するテキストの重みを乗じた概念ベクトルを、各テキストの各単語にわたって加算し正規化した概念ベクトルを生成し、文書と該文書の概念ベクトルとの組の集合である文書概念ベースを生成する文書概念ベース生成手段をさらに含むようにすることができる。 In the click log correct likelihood calculating apparatus according to the third invention, for each document to be searched, the text weight of the document is set to 1, and the text weight of each query corresponding to the document in the click log is set as follows: The correct likelihood of the pair of the document and the query is used, and the text of the query whose weight is less than or equal to a threshold is excluded, and the word belongs to the concept vector in the word concept base of the word included in each text. A document concept base that generates a concept vector that is a set of a document and a concept vector of the document by generating a normalized concept vector by adding the concept vector multiplied by the text weight over each word of each text. A generation means may be further included.

第４の発明に係る文書検索装置は、クエリを入力とし、単語と、該単語の概念を表す概念ベクトルとの組の集合である単語概念ベースと、クリックログ正解尤度算出装置によって生成された、文書と、該文書の概念を表す概念ベクトルとの組の集合である文書概念ベースと、該クエリ中の単語の該単語概念ベース中の概念ベクトルを合成することにより、該クエリの概念ベクトルを生成する概念ベクトル生成手段と、文書概念ベース中の各文書に対し、該クエリの概念ベクトルと、該文書の概念ベクトルとの類似度を算出する類似度算出手段と、を含んで構成されている。 According to a fourth aspect of the present invention, there is provided a document search apparatus, wherein a query is input, a word concept base that is a set of a word and a concept vector representing the concept of the word, and a click log correct answer likelihood calculation apparatus. A concept vector of the query by combining a document concept base that is a set of a document and a concept vector representing the concept of the document and a concept vector in the word concept base of the word in the query. A concept vector generating means for generating, and for each document in the document concept base, a concept vector for the query and a similarity calculating means for calculating the similarity between the concept vector of the document .

第５の発明に係るモデル生成方法は、単語と、該単語の概念を表す概念ベクトルとの組の集合である単語概念ベース、概念ベクトル生成手段、素性ベクトル生成手段、及び分類モデル生成手段を含むモデル生成装置におけるモデル生成方法であって、クエリと文書との対の集合であるクリックログであって、所属する各対に正解であるか否かのラベルが付与されているクリックログを入力とし、概念ベクトル生成手段が、クリックログ中の各クエリ及び各文書のテキストに対し、該テキスト中の単語の該単語概念ベース中の概念ベクトルを合成することにより、該テキストの概念ベクトルを生成するステップと、素性ベクトル生成手段が、クリックログ中の任意の対に対し、該対のクエリの概念ベクトルの近傍内に、クエリの概念ベクトルが存在し、該対の文書の概念ベクトルの近傍内に、文書の概念ベクトルが存在するような、クリックログ中の対を近傍内対としたとき、近傍内対の数あるいは近傍内対に紐づく異なりユーザ数を含む素性を抽出することにより、該対の素性ベクトルを生成するステップと、分類モデル生成手段が、クリックログ中の任意の対の素性ベクトルとラベルとの組の集合から、任意の素性ベクトルの正解尤度を算出するための分類モデルを生成するステップと、を含んで構成されている。 A model generation method according to a fifth invention includes a word concept base that is a set of a word and a concept vector representing the concept of the word, a concept vector generation unit, a feature vector generation unit, and a classification model generation unit. A model generation method in a model generation device, which is a click log that is a set of a pair of a query and a document, and each pair to which the user belongs belongs to a click log that is given a correct answer or not. A step of generating a concept vector of the text by synthesizing a concept vector in the word concept base of the word in the text with respect to each query in the click log and the text of each document. The feature vector generation means, for any pair in the click log, the query concept vector is within the vicinity of the query concept vector of the pair. When the pair in the click log where the concept vector of the document exists in the vicinity of the concept vector of the pair of documents is set as the pair in the neighborhood, the number of pairs in the neighborhood or the pairs in the neighborhood are linked. A step of generating a feature vector of the pair by extracting features including the number of different users, and a classification model generation means, from a set of combinations of feature vectors and labels of any pair in the click log, Generating a classification model for calculating the correct likelihood of the feature vector.

第６の発明に係るクリックログ正解尤度算出方法は、単語と、該単語の概念を表す概念ベクトルとの組の集合である単語概念ベース、任意の素性ベクトルの正解尤度を算出するための分類モデル、概念ベクトル生成手段、素性ベクトル生成手段、及び正解尤度推定手段を含み、文書検索システムにおいて、ユーザが入力したクエリと、該クエリに対する検索結果文書群の中でユーザがクリックした文書との対の集合であるクリックログを対象とし、該クリックログ中の対に対し、該対の文書が、該対のクエリの検索結果として正解である度合いである正解尤度を算出するためのクリックログ正解尤度算出装置におけるクリックログ正解尤度算出方法であって、クリックログを入力とし、概念ベクトル生成手段が、クリックログ中の各クエリ及び各文書のテキストに対し、該テキスト中の単語の該単語概念ベース中の概念ベクトルを合成することにより、該テキストの概念ベクトルを生成するステップと、素性ベクトル生成手段が、クリックログ中の任意の対に対し、該対のクエリの概念ベクトルの近傍内に、クエリの概念ベクトルが存在し、該対の文書の概念ベクトルの近傍内に、文書の概念ベクトルが存在するような、クリックログ中の対を近傍内対としたとき、近傍内対の数あるいは近傍内対に紐づく異なりユーザ数を含む素性を抽出することにより、該対の素性ベクトルを生成するステップと、正解尤度推定手段が、クリックログ中の任意の対に対し、該対の素性ベクトルと、該分類モデルにより、該対の正解尤度を推定するステップと、を含んで構成されている。 A click log correct likelihood calculation method according to a sixth aspect of the present invention is a word concept base that is a set of a word and a concept vector representing the concept of the word, and a correct likelihood of an arbitrary feature vector. A query input by a user in a document search system, and a document clicked by a user in a search result document group for the query, including a classification model, a concept vector generation unit, a feature vector generation unit, and a correct likelihood estimation unit Click for calculating a correct likelihood that is a degree that the paired document is correct as a search result of the paired query with respect to a click log that is a set of pairs of A click log correct likelihood calculation method in the log correct likelihood calculation device, wherein the click log is input, and the concept vector generation means includes each query and each log in the click log. Generating a concept vector of the text by synthesizing a concept vector in the word concept base of a word in the text with respect to the text of each document; In a click log where there is a query concept vector in the neighborhood of the pair's query concept vector and a document concept vector in the neighborhood of the pair's document concept vector for the pair When a pair is a neighborhood pair, a feature vector including the number of pairs in the neighborhood or the number of different users associated with the pair in the neighborhood is extracted, and a correct likelihood estimation unit includes: For an arbitrary pair in the click log, the feature vector of the pair and the step of estimating the correct likelihood of the pair by the classification model are included.

第７の発明に係るクリックログ正解尤度算出方法は、文書概念ベース生成手段が、検索対象の各文書に対し、該文書のテキストの重みを１とし、クリックログ中の該文書に対応する各クエリのテキストの重みを、該文書と該クエリの対の正解尤度とし、重みがある閾値以下のクエリのテキストは除外した上で、各テキストに含まれる単語の前記単語概念ベース中の概念ベクトルに、該単語の所属するテキストの重みを乗じた概念ベクトルを、各テキストの各単語にわたって加算し正規化した概念ベクトルを生成し、文書と該文書の概念ベクトルとの組の集合である文書概念ベースを生成するステップをさらに含むようにすることができる。 In the click log correct likelihood calculation method according to the seventh aspect of the invention, the document concept base generation means sets the weight of the text of the document to 1 for each document to be searched, and each corresponding to the document in the click log. The weight of the query text is the correct likelihood of the pair of the document and the query, the text of the query whose weight is less than a certain threshold is excluded, and the concept vector in the word concept base of the words included in each text A concept vector obtained by adding the normalized concept vector multiplied by the weight of the text to which the word belongs to each word of each text to generate a normalized concept vector, and a document concept that is a set of a document and the concept vector of the document The method may further include generating a base.

第８の発明に係るプログラムは、コンピュータを、上記モデル生成装置、上記クリックログ正解尤度算出装置、又は上記文書検索装置の各手段として機能させるためのプログラムである。 A program according to an eighth invention is a program for causing a computer to function as each unit of the model generation device, the click log correct answer likelihood calculation device, or the document search device.

本発明のモデル生成装置、クリックログ正解尤度算出装置、文書検索装置、方法、及びプログラムによれば、検索精度を向上させることができる、という効果を有する。 According to the model generation device, click log correct likelihood calculation device, document search device, method, and program of the present invention, there is an effect that the search accuracy can be improved.

検索システムの画面例を示す図である。It is a figure which shows the example of a screen of a search system. 検索結果である文書の詳細情報の展開表示例を示す図である。It is a figure which shows the example of expansion | deployment display of the detailed information of the document which is a search result. 対象対を構成する対象クエリ及び対象文書の概念ベクトル空間内の位置関係を説明するための説明図である。It is explanatory drawing for demonstrating the positional relationship in the concept vector space of the object query and object document which comprise object pair. 対象対を構成する対象クエリ及び対象文書の概念ベクトル空間内の位置関係を説明するための説明図である。It is explanatory drawing for demonstrating the positional relationship in the concept vector space of the object query and object document which comprise object pair. 本発明の実施の形態に係るモデル生成装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the model production | generation apparatus which concerns on embodiment of this invention. 検索対象文書群の例を示す図である。It is a figure which shows the example of a search object document group. クリックログの例を示す図である。It is a figure which shows the example of a click log. 正解であるか否かのラベルが付与されたクリックログの例を示す図である。It is a figure which shows the example of the click log to which the label of whether it is a correct answer was provided. 単語概念ベースの例を示す図である。It is a figure which shows the example of a word concept base. クエリＩＤと該クエリＩＤの概念ベクトルとの組の集合の例を示す図である。It is a figure which shows the example of the set of the group of query ID and the concept vector of this query ID. 文書ＩＤと該文書ＩＤの概念ベクトルとの組の集合の例を示す図である。It is a figure which shows the example of the set of the group of document ID and the concept vector of this document ID. クリックログ中の対と該対に対応する素性ベクトルとの組の集合の例を示す図である。It is a figure which shows the example of the set of the group of the pair in a click log, and the feature vector corresponding to this pair. 本発明の実施の形態に係るクリックログ正解尤度算出装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the click log correct answer likelihood calculation apparatus which concerns on embodiment of this invention. クリックログ中の対と該対に対応する正解尤度の例を示す図である。It is a figure which shows the example of the correct answer likelihood corresponding to the pair in a click log, and this pair. 検索対象の文書の概念ベクトルにクエリを反映する処理を説明するための説明図である。It is explanatory drawing for demonstrating the process which reflects a query on the concept vector of the document of search object. 本発明の実施の形態に係る文書検索装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the document search device which concerns on embodiment of this invention. 本発明の実施の形態に係るモデル生成装置におけるモデル生成処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the model generation process routine in the model generation apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係るクリックログ正解尤度算出装置におけるクリックログ正解尤度算出処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the click log correct answer calculation process routine in the click log correct answer calculation apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る文書検索装置における文書検索処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the document search processing routine in the document search apparatus concerning embodiment of this invention.

以下、図面とともに本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

＜本発明の実施の形態の概要＞
本発明の実施の形態で対象とするクリックログは、検索システムにおいて蓄積されるものである。図１は、検索システムの画面例であり、検索窓にクエリを入力すると、検索結果の文書群が提示される。一般にユーザは、クエリに関連すると思った文書をクリックする。図１で、Ｄ５の文書をクリックすると、その後のシステムの挙動として、例えば図２のように、文書Ｄ５の詳細情報が展開表示される。システムによっては、文書の内容が役に立ったか、あるいは、役に立たなかったかを選択するボタンが提示される。ユーザは、そのボタンを押さない場合もあれば、いずれかのボタンを押す場合もある。図２では、Ｄ５の詳細情報が、同一画面上に展開されているが、システムによっては、別ウィンドウが出現して、その中にＤ５の詳細情報が表示される。 <Outline of Embodiment of the Present Invention>
The click log targeted by the embodiment of the present invention is stored in the search system. FIG. 1 shows an example of a screen of the search system. When a query is input to the search window, a document group as a search result is presented. In general, the user clicks on a document that he believes is relevant to the query. When the document D5 is clicked in FIG. 1, the detailed information of the document D5 is expanded and displayed as the subsequent system behavior, for example, as shown in FIG. Depending on the system, a button is presented to select whether the content of the document is useful or not useful. The user may not press the button or may press any button. In FIG. 2, the detailed information of D5 is expanded on the same screen, but depending on the system, another window appears and the detailed information of D5 is displayed therein.

このような入力クエリと、その検索結果文書群、検索結果文書群の中でユーザがどの文書をクリックしたか、ユーザが「役に立った」ボタンを押したか、あるいは、「役に立たなかった」ボタンを押したか、といったログ情報が蓄積される。本発明の実施の形態では、ユーザが入力したクエリと、該クエリに対する検索結果文書群の中でユーザがクリックした文書との対の集合を、クリックログと呼ぶ。 In such an input query and its search result document group, which document the user clicked in the search result document group, the user pressed the “Useful” button, or pressed the “Not useful” button Log information is stored. In the embodiment of the present invention, a set of pairs of a query input by the user and a document clicked by the user in a search result document group corresponding to the query is referred to as a click log.

クリックログ中の対に対し、該対の文書が、該対のクエリの検索結果として正解である度合いを正解尤度と呼ぶ。以下、本発明の実施の形態の正解尤度算出のベースとなる考えを述べる。 The degree to which a pair of documents in the click log is correct as a search result of the pair of queries is referred to as correct likelihood. Hereinafter, the idea which becomes the basis of correct likelihood calculation of the embodiment of the present invention will be described.

本発明の実施の形態では、検索対象の文書も、入力クエリも、全て概念ベクトルに変換する。図３のように、対象対を構成する対象クエリと対象文書は、概念ベクトル空間内の点として表現される。対象クエリ、対象文書のそれぞれに対し、点線で囲んだ近傍をとる。クリックログ中の対で、そのクエリが対象クエリの近傍内にあり、その文書が対象文書の近傍内にある対を、対象対の近傍内の対と呼ぶ。 In the embodiment of the present invention, both the search target document and the input query are all converted into concept vectors. As shown in FIG. 3, the target query and the target document constituting the target pair are expressed as points in the concept vector space. For each of the target query and the target document, a neighborhood surrounded by a dotted line is taken. A pair in the click log whose query is in the vicinity of the target query and whose document is in the vicinity of the target document is called a pair in the vicinity of the target pair.

本発明の実施の形態では、以下の仮説１を置いている。 In the embodiment of the present invention, the following hypothesis 1 is set.

（仮説１）
・対象対が正解である場合、近傍内の他の対の数が多い。
・対象対が不正解である場合、近傍内の他の対の数が少ない。 (Hypothesis 1)
• If the target pair is correct, there are many other pairs in the neighborhood.
• If the target pair is incorrect, the number of other pairs in the neighborhood is small.

この仮説は、対象対を支持する他の対の数が多ければ、対象対の正解尤度は高いという考えに基づいている。すなわち、対象クエリと意味の近い入力クエリに対し、対象文書と意味の近い文書をクリックした事例の数が多ければ、対象対の正解尤度は高いという考えである。 This hypothesis is based on the idea that if the number of other pairs supporting the target pair is large, the correct likelihood of the target pair is high. In other words, if the number of cases where the target query is clicked on a document having a meaning similar to that of the target query is large, the correct likelihood of the target pair is high.

実際にクリックログデータをとり、統計をとると、近傍内の他の対の数は、正解対象対の方が、不正解対象対よりも多い傾向があり、仮説1は正しい傾向がある。したがって、近傍内の他の対の数が多ければ、対象対の正解尤度は高く、近傍内の他の対の数が少なければ、対象対の正解尤度は低いということがいえる。 When actually taking click log data and taking statistics, the number of other pairs in the neighborhood tends to be higher in the correct answer pair than in the incorrect answer target pair, and Hypothesis 1 tends to be correct. Therefore, it can be said that if the number of other pairs in the neighborhood is large, the correct likelihood of the target pair is high, and if the number of other pairs in the neighborhood is small, the correct likelihood of the target pair is low.

図３の対象対は、近傍内の他の対が１個であり、図４の対象対は、近傍内の他の対が４個であるため、図４の対象対の正解尤度は、図３の対象対の正解尤度より高いということがいえる。 The target pair in FIG. 3 has one other pair in the neighborhood, and the target pair in FIG. 4 has four other pairs in the neighborhood. Therefore, the correct likelihood of the target pair in FIG. It can be said that it is higher than the correct likelihood of the target pair of FIG.

また、各対象対には、該対のクエリを入力し、該対の文書をクリックしたユーザが紐づいている。 Each target pair is associated with a user who inputs the query of the pair and clicks on the document of the pair.

本発明の実施の形態では、以下の仮説２を置いている。 In the embodiment of the present invention, the following hypothesis 2 is set.

（仮説２）
・対象対が正解である場合、近傍内の他の対に紐づくユーザの異なりの数が多い。
・対象対が不正解である場合、近傍内の他の対に紐づくユーザの異なりの数が少ない。 (Hypothesis 2)
If the target pair is correct, there are many different numbers of users associated with other pairs in the vicinity.
-When the target pair is incorrect, the number of different users associated with other pairs in the vicinity is small.

この仮説は、対象対を支持するユーザの数が多ければ、対象対の正解尤度は高いという考えに基づいている。すなわち、近傍内の他の対の数がＮ個のとき、紐づくユーザの異なりの数が少数の場合（例えば１人がＮ回、入力とクリックを行ったような場合）は、少数のユーザが恣意的にクリックしている可能性もあり、そのような場合は対象対の信頼性は低く、逆に、紐づくユーザの異なりの数が多数の場合（例えばＮ人が１回ずつ、入力とクリックを行ったような場合）は、対象対の信頼性は高いという考えである。 This hypothesis is based on the idea that if the number of users supporting the target pair is large, the correct likelihood of the target pair is high. That is, when the number of other pairs in the neighborhood is N, and the number of different users associated is small (for example, when one person performs input and click N times), the number of users is small. In such a case, the reliability of the target pair is low, and conversely, when there are a large number of different linked users (for example, N people enter once each) If you click and click), the reliability of the target pair is high.

実際にクリックログデータをとり、統計をとると、近傍内の他の対に紐づくユーザの異なりの数は、正解対象対の方が、不正解対象対よりも多い傾向があり、仮説２は正しい傾向がある。したがって、近傍内の他の対に紐づくユーザの異なりの数が多ければ、対象対の正解尤度は高く、近傍内の他の対に紐づくユーザの異なりの数が少なければ、対象対の正解尤度は低いということがいえる。 When actually taking click log data and taking statistics, the number of different users associated with other pairs in the neighborhood tends to be higher in the correct target pair than in the incorrect target pair. There is a correct tendency. Therefore, if the number of different users associated with other pairs in the neighborhood is large, the correct likelihood of the target pair is high, and if the number of different users associated with other pairs in the neighborhood is small, It can be said that the likelihood of correct answer is low.

また、「役に立った」ボタン、「役に立たなかった」ボタンがある場合、対象対について、ユーザが「役に立った」ボタンあるいは「役に立たなかった」ボタンを押したということは、ユーザが対象対に対し明確に支持あるいは不支持をしたことを意味する。本発明の実施の形態は、対象対が「役に立った」（あるいは「役に立たなかった」）に該当する場合や、対象対の近傍内の他の「役に立った」（あるいは「役に立たなかった」）に該当する対の数が多い場合に、対象対の正解尤度は高い（あるいは低い）と捉える。 In addition, when there is a “useful” button or “useless” button, it is clear to the target pair that the user has pressed the “useful” button or “useless” button for the target pair. Means support or disapproval. In the embodiment of the present invention, the target pair corresponds to “useful” (or “useless”), or other “useful” (or “useless”) in the vicinity of the target pair. When the number of applicable pairs is large, the correct likelihood of the target pair is regarded as high (or low).

以上述べた、対象対の近傍内の他の対の数などの、正解尤度の各種決定因子に対し、正解尤度を算出するための決定因子の重み係数を、本発明の実施の形態では、機械学習により獲得する。以下で説明する、第１の発明の実施の形態に係るモデル生成装置は、機械学習における学習処理を規定したものである。モデル生成装置は、所属する各対に正解であるか否かのラベルが付与されているクリックログを入力とし、クリックログ中の任意の対に対し、各種決定因子に相当する素性を抽出することにより、該対の素性ベクトルを生成し、素性ベクトルとラベルとの組の集合から分類モデルを生成する。 In the embodiment of the present invention, the weighting factor of the determinant for calculating the correct likelihood is calculated with respect to various determinants of the correct likelihood, such as the number of other pairs in the vicinity of the target pair as described above. Acquired through machine learning. A model generation apparatus according to an embodiment of the first invention described below defines a learning process in machine learning. The model generation device inputs a click log to which each pair to which it belongs belongs to a label indicating whether the answer is correct, and extracts features corresponding to various determinants for any pair in the click log. Thus, a feature vector of the pair is generated, and a classification model is generated from a set of pairs of the feature vector and the label.

また、第２の発明の実施の形態に係るクリックログ正解尤度算出装置は、機械学習における推定処理を規定したものである。クリックログ正解尤度算出装置は、学習処理で用いたクリックログとは別の、所属する各対に正解であるか否かのラベルが付与されていないクリックログを入力とし、クリックログ中の任意の対に対し、学習処理で抽出した素性と同一の素性を抽出することにより、該対の素性ベクトルを生成し、該素性ベクトルと分類モデルとから、該対の正解尤度を推定する。 The click log correct likelihood calculation apparatus according to the embodiment of the second invention defines an estimation process in machine learning. The click log correct answer likelihood calculation device takes as input a click log that is different from the click log used in the learning process and does not have a label indicating whether each pair belongs to whether the answer is correct or not. A feature vector identical to the feature extracted in the learning process is extracted from the pair, and a feature vector of the pair is generated, and the correct likelihood of the pair is estimated from the feature vector and the classification model.

また、第３の発明の実施の形態に係るクリックログ正解尤度算出装置は、推定した正解尤度を考慮して、検索対象の各文書の概念ベクトルを生成する処理を規定したものである。クリックログ正解尤度算出装置は、文書を、推定処理で用いたクリックログ中の該文書に対応する各クエリを追加することにより拡張する。但し、該クエリの意味内容は、該文書と該クエリとの対の正解尤度分だけ反映させるようにする。このため、該文書の概念ベクトルを生成するにあたり、該クエリ中の各単語の概念ベクトルに、該正解尤度を乗じた概念ベクトルを加算していく。正解尤度の低い対のクエリは、該文書に無関係であるため、上記の処理から除外する。このように、追加クエリの意味内容を正解尤度の分だけ、文書の概念ベクトルに反映させるため、不正解の対のクエリの影響が無くなり、文書の概念ベクトルが、より適切なものとなる。 Further, the click log correct likelihood calculating apparatus according to the embodiment of the third invention defines a process of generating a concept vector of each document to be searched in consideration of the estimated correct likelihood. The click log correct answer likelihood calculation device expands a document by adding each query corresponding to the document in the click log used in the estimation process. However, the semantic content of the query is reflected by the correct likelihood of the pair of the document and the query. Therefore, when generating the concept vector of the document, a concept vector obtained by multiplying the concept vector of each word in the query by the correct likelihood is added. Pair queries with low correct likelihood are irrelevant to the document and are excluded from the above processing. In this way, since the meaning content of the additional query is reflected in the document concept vector by the amount of the correct likelihood, the influence of the query of the incorrect answer pair is eliminated, and the document concept vector becomes more appropriate.

また、第４の発明の実施の形態に係る文書検索装置は、第３の発明の実施の形態に係るクリックログ正解尤度算出装置で生成した検索対象文書の概念ベクトルの集合を対象とした概念検索を規定している。ある文書の概念ベクトル生成において追加したクエリと意味内容の近い新規のクエリが入力されたとき、該文書は該新規クエリの正解の検索結果である。拡張後の該文書の意味内容は、該新規クエリの意味内容を包含しているため、該新規クエリの概念ベクトルと、該文書の概念ベクトルとの類似度は、拡張前よりも、拡張後の方が、より高くなる。これにより、検索精度が従来手法と比べ高くなる。 A document search device according to an embodiment of the fourth invention is a concept that targets a set of concept vectors of search target documents generated by the click log correct likelihood calculation device according to the embodiment of the third invention. Specifies search. When a new query having a semantic content close to that of a query added in the concept vector generation of a document is input, the document is a correct search result of the new query. Since the semantic content of the document after expansion includes the semantic content of the new query, the similarity between the concept vector of the new query and the concept vector of the document is higher after the expansion than before the expansion. Will be higher. Thereby, the search accuracy is higher than that of the conventional method.

なお、学習処理の入力となるクリックログの各対には、正解であるか否かのラベルが付与されており、このラベル付与には人手によるコストがかかる。しかしながら、一旦、あるクリックログデータに対し、このラベル付与の作業をして学習処理にかけ、分類モデルを生成しておけば、全くの別の検索対象文書及びそれを用いた文書検索システムのクリックログデータに対しても、該分類モデルを用いて推定処理を行い、適確な正解尤度を推定できる。このため、人手によるラベル付与作業のコストは、最初の１回分で済む。 Note that a label indicating whether or not the answer is correct is assigned to each pair of click logs that are input to the learning process, and this labeling requires a manual cost. However, once a certain click log data is subjected to a labeling operation and subjected to a learning process to generate a classification model, a completely different search target document and a click log of a document search system using the search target document are used. The data can also be estimated using the classification model, and an accurate correct likelihood can be estimated. For this reason, the cost of manual labeling work is only required for the first time.

以上述べたように、本発明の実施の形態では、クリックログ中の対に対し、該対の文書が、該対のクエリの検索結果として正解である度合いである正解尤度を推定し、正解尤度を考慮して、検索対象の文書の概念ベクトルを生成するので、不正解の対のクエリの影響が無くなり、その結果、検索精度が向上するという効果がある。 As described above, in the embodiment of the present invention, the likelihood of the correct answer, which is the degree that the document of the pair is correct as a search result of the query of the pair, is estimated for the pair in the click log. Since the concept vector of the document to be searched is generated in consideration of the likelihood, the influence of the query of the incorrect answer pair is eliminated, and as a result, the search accuracy is improved.

＜モデル生成装置の構成＞
本発明の実施の形態に係るモデル生成装置の構成について説明する。図５は、第１の発明の実施の形態に係るモデル生成装置の構成例である。図５に示すように、本発明の実施の形態に係るモデル生成装置１００は、ＣＰＵと、ＲＡＭと、後述する各処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。このモデル生成装置１００は、機能的には図５に示すように入力手段１０と、演算手段２０とを備えている。 <Configuration of model generation device>
A configuration of the model generation device according to the embodiment of the present invention will be described. FIG. 5 is a configuration example of the model generation device according to the embodiment of the first invention. As shown in FIG. 5, the model generation device 100 according to the embodiment of the present invention includes a CPU, a RAM, and a ROM that stores programs and various data for executing processing routines described later. Can be configured. The model generation apparatus 100 functionally includes an input unit 10 and a calculation unit 20 as shown in FIG.

図６に、検索対象文書群の例を示す。各文書は、文書ＩＤと文書テキストとからなり、文書ＩＤによって一意に識別される。 FIG. 6 shows an example of a search target document group. Each document includes a document ID and a document text, and is uniquely identified by the document ID.

図７は、図６の検索対象文書群を検索対象とした文書検索システムにおいて蓄積されたクリックログの例を示す表である。各レコードは、クリックログ中のクエリと該クエリに対応するクリック文書のリストからなる。 FIG. 7 is a table showing an example of the click log accumulated in the document search system for the search target document group of FIG. Each record includes a query in the click log and a list of click documents corresponding to the query.

クエリは、クエリＩＤとクエリテキストからなり、クエリＩＤは、入力したユーザのＩＤであるユーザＩＤと、該ユーザの何番目の入力かを表す入力ＩＤからなる。クエリは、クエリＩＤによって一意に識別され、異なるクエリＩＤのテキストが同一の場合もありえる。クエリＩＤによっては、クリック文書リストが空の場合もある。 The query includes a query ID and a query text, and the query ID includes a user ID that is an input user ID and an input ID that indicates the input number of the user. A query is uniquely identified by a query ID, and the text of different query IDs may be the same. Depending on the query ID, the click document list may be empty.

クリックログ中の対は、クエリＩＤと該クエリＩＤに対応するクリック文書ＩＤとの対であるともいえる。 It can be said that the pair in the click log is a pair of a query ID and a click document ID corresponding to the query ID.

クリック文書は、「役に立った」ボタンを押された文書（表中の「役に立った」のカラム）、「役に立たなかった」ボタンを押された文書（表中の「役に立たなかった」のカラム）、いずれのボタンも押されなかった文書（表中の「−」のカラム）からなる。検索システムに「役に立った」ボタン、「役に立たなかった」ボタンがない場合は、表中の「役に立った」カラム、「役に立たなかった」カラムは無い。 Click documents are documents that have been clicked on the "Useful" button ("Useful" column in the table), documents that have been clicked on the "Not useful" button ("Useless" column in the table), It consists of a document ("-" column in the table) in which no button is pressed. If there is no “useful” button or “useless” button in the search system, there is no “useful” column or “useless” column in the table.

入力手段１０は、所属する各対に正解であるか否かのラベルが付与されているラベル付クリックログを入力として受け付ける。図８は、クリックログ中の各対に対し、正解であるか否かのラベルを付与したデータの例である。正解の対には１が付与され、不正解の対には０が付与されている。 The input means 10 accepts as input a click log with a label in which each pair to which it belongs is assigned a label indicating whether the answer is correct. FIG. 8 is an example of data in which a label indicating whether the answer is correct is assigned to each pair in the click log. A correct answer pair is assigned 1 and an incorrect answer pair is assigned 0.

演算手段２０は、単語概念ベース２２と、概念ベクトル生成手段２４と、素性ベクトル生成手段２６と、分類モデル生成手段２８と、分類モデル記憶部３０とを含んで構成されている。 The computing unit 20 includes a word concept base 22, a concept vector generation unit 24, a feature vector generation unit 26, a classification model generation unit 28, and a classification model storage unit 30.

単語概念ベース２２は、単語と、該単語の概念を表す概念ベクトルとの組の集合である。図９は、単語概念ベース２２の例である。単語概念ベース２２は、例えば、非特許文献１の手法によって生成される。単語概念ベース２２には、名詞、動詞、形容詞等の内容語のみを登録するというようにしてもよい。単語概念ベース２２において単語は、該単語の終止形で登録されており、単語概念ベース２２を検索する際は、単語の終止形で検索する。各単語の概念ベクトルは長さ１に正規化されたｄ次元ベクトルであり、概念的に近い単語の概念ベクトルは、近くに配置されている。 The word concept base 22 is a set of a set of a word and a concept vector representing the concept of the word. FIG. 9 is an example of the word concept base 22. The word concept base 22 is generated by, for example, the method of Non-Patent Document 1. In the word concept base 22, only content words such as nouns, verbs, and adjectives may be registered. In the word concept base 22, the word is registered in the form of the end of the word. When searching the word concept base 22, the word is searched for in the form of the end of the word. The concept vector of each word is a d-dimensional vector normalized to length 1, and the concept vectors of words that are conceptually close are arranged nearby.

概念ベクトル生成手段２４は、クリックログ中の各クエリ及び各文書のテキストに対し、該テキスト中の単語の単語概念ベース２２中の概念ベクトルを合成することにより、該テキストの概念ベクトルを生成する。以下、具体的に説明する。 The concept vector generation means 24 generates a concept vector of the text by synthesizing the concept vector in the word concept base 22 of the word in the text with respect to each query in the click log and the text of each document. This will be specifically described below.

概念ベクトル生成手段２４は、クリックログ中の各クエリＩＤに対し、該クエリＩＤのテキスト中の単語の単語概念ベース２２中の概念ベクトルを加算し長さ１に正規化することにより、該クエリＩＤの概念ベクトルを生成する。図１０は、クエリＩＤと該クエリＩＤの概念ベクトルとの組の集合の例である。 The concept vector generation means 24 adds the concept vector in the word concept base 22 of the word in the text of the query ID to each query ID in the click log and normalizes the query vector to a length of 1 to thereby add the query ID. Generate a concept vector. FIG. 10 shows an example of a set of a set of a query ID and a concept vector of the query ID.

また、概念ベクトル生成手段２４は、図６に示すデータ、あるいは、クリックログ中の文書ＩＤの異なりのみから作成した図６相当のデータを使用し、該データ中の各文書ＩＤに対し、該文書ＩＤのテキスト中の単語の単語概念ベース２２中の概念ベクトルを加算し長さ１に正規化することにより、該文書ＩＤの概念ベクトルを生成する。図１１は、文書ＩＤと該文書ＩＤの概念ベクトルとの組の集合の例である。 Further, the concept vector generation means 24 uses the data shown in FIG. 6 or data corresponding to FIG. 6 created only from the difference in the document ID in the click log, and for each document ID in the data, the document The concept vector of the document ID is generated by adding the concept vector in the word concept base 22 of the word in the ID text and normalizing it to a length of 1. FIG. 11 is an example of a set of a set of a document ID and a concept vector of the document ID.

素性ベクトル生成手段２６は、クリックログ中の任意の対に対し、概念ベクトル生成手段２４によって生成された、該対のクエリの概念ベクトルの近傍内に、クエリの概念ベクトルが存在し、該対の文書の概念ベクトルの近傍内に、文書の概念ベクトルが存在するような、クリックログ中の対を近傍内対としたとき、近傍内対の数あるいは近傍内対に紐づく異なりユーザ数を含む素性を抽出することにより、該対の素性ベクトルを生成する。以下、具体的に説明する。 The feature vector generation unit 26 has, for any pair in the click log, a query concept vector exists in the vicinity of the concept vector of the pair of queries generated by the concept vector generation unit 24, and When a pair in the click log where there is a document concept vector in the neighborhood of the document concept vector is a neighborhood inner pair, the number of pairs in the neighborhood or features including the number of users associated with the pairs in the neighborhood Is extracted to generate a feature vector of the pair. This will be specifically described below.

素性ベクトル生成手段２６は、対象対を構成する対象クエリＩＤの概念ベクトルを中心とする、あらかじめ定めた半径ｒの円の内部を近傍としてとる。また、素性ベクトル生成手段２６は、対象対を構成する対象文書ＩＤの概念ベクトルを中心とする、あらかじめ定めた半径ｒ’の円の内部を近傍としてとる。クリックログ中の対で、そのクエリＩＤの概念ベクトルが対象クエリＩＤの近傍内にあり、その文書ＩＤの概念ベクトルが対象文書ＩＤの近傍内にある対を、対象対の近傍内の対と呼ぶ。 The feature vector generation means 26 takes the inside of a circle with a predetermined radius r centered on the concept vector of the target query ID constituting the target pair as a neighborhood. Also, the feature vector generation means 26 takes the inside of a circle with a predetermined radius r ′ centered on the concept vector of the target document ID constituting the target pair as a neighborhood. A pair in the click log in which the concept vector of the query ID is in the vicinity of the target query ID and the concept vector of the document ID is in the vicinity of the target document ID is called a pair in the vicinity of the target pair. .

素性ベクトル生成手段２６によって生成される素性の例として、以下が挙げられる。 Examples of the features generated by the feature vector generating means 26 include the following.

（１）近傍内の他の対の数
（２）近傍内の他の対のクエリＩＤを構成するユーザＩＤの異なりの数 (1) Number of other pairs in the neighborhood (2) Different number of user IDs constituting the query IDs of other pairs in the neighborhood

また、（１）、（２）の数の割合としての以下の素性も挙げられる。 Moreover, the following features as the ratio of the number of (1) and (2) are also mentioned.

（３）（１）の数の、［クリックログ中のクエリＩＤの異なりの数］×［検索対象の文書ＩＤの数］に占める割合
（４）（２）の数の、［クリックログ中のクエリＩＤを構成するユーザＩＤの異なりの数］に占める割合 (3) The ratio of the number of (1) to [number of different query IDs in the click log] × [number of document IDs to be searched] (4) the number of (2) Of different number of user IDs constituting a query ID]

なお、検索対象文書数や入力クエリ数が増大するにつれ、一般に、近傍内の（１）、（２）の数も比例して増大するため、（３）、（４）の素性は、データ量に依存しない特徴量となる。 As the number of search target documents and the number of input queries increase, the number of (1) and (2) in the neighborhood generally increases in proportion, so the features of (3) and (4) The feature quantity does not depend on.

また、検索システムに「役に立った」ボタン、「役に立たなかった」ボタンがある場合、素性の例として以下も挙げられる。 In addition, when the search system includes a “useful” button and a “useless” button, examples of features include the following.

（５）対象対が「役に立った」に該当するか否か
（６）対象対が「役に立たなかった」に該当するか否か
（７）近傍内の他の対で、「役に立った」に該当する対の数
（８）近傍内の他の対で、「役に立たなかった」に該当する対の数
（９）近傍内の他の対で、「役に立った」に該当する対のクエリＩＤを構成するユーザＩＤの異なりの数
（１０）近傍内の他の対で、「役に立たなかった」に該当する対のクエリＩＤを構成するユーザＩＤの異なりの数
（１１）（７）の数の、［クリックログ中のクエリＩＤの異なりの数］×［検索対象の文書ＩＤの数］に占める割合
（１２）（８）の数の、［クリックログ中のクエリＩＤの異なりの数］×［検索対象の文書ＩＤの数］に占める割合
（１３）（９）の数の、［クリックログ中のクエリＩＤを構成するユーザＩＤの異なりの数］に占める割合
（１４）（１０）の数の、［クリックログ中のクエリＩＤを構成するユーザＩＤの異なりの数］に占める割合 (5) Whether the target pair falls under “helpful” (6) Whether the target pair falls under “useless” (7) Other pairs in the vicinity fall under “useful” The number of pairs to be matched (8) The number of pairs in the vicinity that are not useful The number of pairs in the vicinity that was not useful (9) The other pair in the vicinity is composed of the query ID of the pair that corresponds to “Useful” The number of different user IDs (10) The number of different user IDs (11) and (7) constituting the pair of query IDs corresponding to “not useful” in other pairs in the vicinity of [10] [Different number of query IDs in the click log] × [Number of different query IDs in the click log] × (Number of differences (12) (8) in the number of document IDs to be searched] × [Search target Of [the number of document IDs] of (13) and the number of (9), which constitutes the query ID in the click log Of (14) and (10) in the ratio of the number of different user IDs to be accounted for in [the number of different user IDs constituting the query ID in the click log]

他に、近傍内の他の対を、以下のように分け、そのそれぞれに関する素性を設定することもできる。 In addition, other pairs in the neighborhood can be divided as follows, and the feature regarding each of them can be set.

・クエリＩＤが対象クエリＩＤと同一で、文書ＩＤが対象文書ＩＤと異なる対
・クエリＩＤが対象クエリＩＤと異なり、文書ＩＤが対象文書ＩＤと同一の対
・クエリＩＤが対象クエリＩＤと異なり、文書ＩＤが対象文書ＩＤと異なる対 A query ID is the same as the target query ID and a document ID is different from the target document ID. A query ID is different from the target query ID, a pair whose document ID is the same as the target document ID. A query ID is different from the target query ID. A pair whose document ID is different from the target document ID

素性ベクトル生成手段２６は、対象対に対し、上記各素性の値を求め、素性ベクトルを生成する。図１２は、クリックログ中の対と、該対に対応する素性ベクトルとの組の集合の例である。 The feature vector generation means 26 obtains each feature value for the target pair, and generates a feature vector. FIG. 12 is an example of a set of pairs of pairs in the click log and feature vectors corresponding to the pairs.

分類モデル生成手段２８は、素性ベクトル生成手段２６によって生成された、クリックログ中の任意の対の素性ベクトルとラベルとの組の集合から、任意の素性ベクトルの正解尤度を算出するための分類モデルを生成する。以下、具体的に説明する。 The classification model generation means 28 is a classification for calculating the correct likelihood of an arbitrary feature vector from the set of pairs of feature vectors and labels in the click log generated by the feature vector generation means 26. Generate a model. This will be specifically described below.

クリックログ中の各対に対し、図１２中の対応する素性ベクトルと、図８中の対応するラベルがあり、素性ベクトルとラベルとの組ができる。分類モデル生成手段２８は、この組の集合を入力として、サポートベクタマシン等の機械学習器にかけ、ラベルが１か０かの２値分類を解くための分類モデルを生成する。なお、分類モデルは０〜１の間の値を正解尤度として出力しても良い。 For each pair in the click log, there is a corresponding feature vector in FIG. 12 and a corresponding label in FIG. 8, and a set of the feature vector and the label can be made. The classification model generation means 28 receives the set of sets as an input and applies it to a machine learning device such as a support vector machine to generate a classification model for solving a binary classification with a label of 1 or 0. The classification model may output a value between 0 and 1 as the correct likelihood.

分類モデル記憶部３０には、分類モデル生成手段２８によって生成された分類モデルが記憶される。 The classification model storage unit 30 stores the classification model generated by the classification model generation unit 28.

＜クリックログ正解尤度算出装置の構成＞
次に、本発明の実施の形態に係るクリックログ正解尤度算出装置の構成について説明する。図１３は、第２又は第３の発明の実施の形態に係るクリックログ正解尤度算出装置の構成例である。図１３に示すように、本発明の実施の形態に係るクリックログ正解尤度算出装置２００は、ＣＰＵと、ＲＡＭと、後述する各処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。クリックログ正解尤度算出装置２００は、モデル生成装置１００の検索対象文書群とは必ずしも同一とは限らない検索対象文書群を検索対象とした文書検索システムにおいて、ユーザが入力したクエリと、該クエリに対する検索結果文書群の中でユーザがクリックした文書との対の集合であるクリックログを対象とし、該クリックログ中の対に対し、該対の文書が、該対のクエリの検索結果として正解である度合いである正解尤度を算出する。このクリックログ正解尤度算出装置２００は、機能的には図１３に示すように入力手段４０と、演算手段５０とを備えている。 <Configuration of Click Log Correct Likelihood Calculation Device>
Next, the configuration of the click log correct answer likelihood calculation apparatus according to the embodiment of the present invention will be described. FIG. 13 is a configuration example of the click log correct likelihood calculating apparatus according to the embodiment of the second or third invention. As shown in FIG. 13, the click log correct likelihood calculating apparatus 200 according to the embodiment of the present invention includes a CPU, a RAM, a ROM for storing various programs and data for executing processing routines described later, and a ROM. , Can be configured with a computer including. The click log correct likelihood calculating apparatus 200 includes a query input by a user in a document search system that searches a search target document group that is not necessarily the same as the search target document group of the model generation apparatus 100, and the query. In the search result document group for, a click log that is a set of pairs with a document clicked by the user is targeted, and for the pair in the click log, the pair of documents is correct as the search result of the pair query. The correct likelihood that is a degree of is calculated. The click log correct answer likelihood calculation device 200 functionally includes an input unit 40 and a calculation unit 50 as shown in FIG.

入力手段４０は、モデル生成装置１００の処理で用いたクリックログとは別の、所属する各対に正解であるか否かのラベルが付与されていないクリックログを入力として受け付ける。クリックログ中の対も、クエリＩＤと該クエリＩＤに対応するクリック文書ＩＤとの対である。クリックログのデータ形式は、図７の形式、及び、図８からラベルのカラムを除いた形式となる。 The input unit 40 accepts, as an input, a click log that is different from the click log used in the process of the model generation device 100 and is not assigned a label indicating whether each pair belongs to whether the answer is correct. A pair in the click log is also a pair of a query ID and a click document ID corresponding to the query ID. The data format of the click log is the format shown in FIG. 7 and the format obtained by removing the label column from FIG.

演算手段５０は、単語概念ベース５２と、概念ベクトル生成手段５４と、素性ベクトル生成手段５６と、分類モデル記憶部６０と、正解尤度推定手段６２と、検索対象文書群記憶部６４と、文書概念ベース生成手段６６と、文書概念ベース６８とを含んで構成されている。 The computing unit 50 includes a word concept base 52, a concept vector generation unit 54, a feature vector generation unit 56, a classification model storage unit 60, a correct likelihood estimation unit 62, a search target document group storage unit 64, a document A concept base generating unit 66 and a document concept base 68 are included.

単語概念ベース５２は、単語と、該単語の概念を表す概念ベクトルとの組の集合である。単語概念ベース５２のデータ形式は、図９のデータ形式と同様である。単語概念ベース５２の内容は、モデル生成装置１００の単語概念ベース２２と異なっている場合もありえる。 The word concept base 52 is a set of a set of a word and a concept vector representing the concept of the word. The data format of the word concept base 52 is the same as the data format of FIG. The content of the word concept base 52 may be different from the word concept base 22 of the model generation device 100.

概念ベクトル生成手段５４は、入力手段４０によって受け付けたクリックログ中の各クエリ及び各文書のテキストに対し、該テキスト中の単語の該単語概念ベース中の概念ベクトルを合成することにより、該テキストの概念ベクトルを生成する。概念ベクトル生成手段５４の処理内容は、モデル生成装置１００の概念ベクトル生成手段２４の処理内容と同じである。検索対象文書群及び単語概念ベース５２が、モデル生成装置１００のものと同一であり、かつ、モデル生成装置１００の概念ベクトル生成手段２４で、全検索対象文書ＩＤの概念ベクトルを生成していたならば、各文書ＩＤの概念ベクトルを生成する処理はせず、モデル生成装置１００の概念ベクトル生成手段２４で生成した各文書ＩＤの概念ベクトルをそのまま採用してもよい。 The concept vector generation means 54 synthesizes the concept vector in the word concept base of the word in the text for each query in the click log received by the input means 40 and the text of each document, thereby Generate a concept vector. The processing content of the concept vector generation unit 54 is the same as the processing content of the concept vector generation unit 24 of the model generation device 100. If the search target document group and the word concept base 52 are the same as those of the model generation device 100, and the concept vector generation unit 24 of the model generation device 100 has generated concept vectors of all search target document IDs. For example, the concept vector of each document ID generated by the concept vector generation unit 24 of the model generation apparatus 100 may be used as it is without performing the process of generating the concept vector of each document ID.

素性ベクトル生成手段５６は、入力手段４０によって受け付けたクリックログ中の任意の対に対し、該対のクエリの概念ベクトルの近傍内に、クエリの概念ベクトルが存在し、該対の文書の概念ベクトルの近傍内に、文書の概念ベクトルが存在するような、クリックログ中の対を近傍内対としたとき、近傍内対の数あるいは近傍内対に紐づく異なりユーザ数を含む素性を抽出することにより、該対の素性ベクトルを生成する。素性ベクトル生成手段５６の処理内容は、モデル生成装置１００の素性ベクトル生成手段２６の処理内容と同じである。クリックログ中の各対に対し、モデル生成装置１００の素性ベクトル生成手段２６の処理で抽出した素性と同一の素性の値を抽出し、素性ベクトルを生成する。 The feature vector generation unit 56 has, for any pair in the click log received by the input unit 40, a query concept vector exists in the vicinity of the concept vector of the pair of queries, and the concept vector of the pair of documents When the pair in the click log where the concept vector of the document exists in the neighborhood of is taken as the neighborhood inner pair, the feature including the number of pairs in the neighborhood or the number of different users associated with the pairs in the neighborhood is extracted. To generate a feature vector of the pair. The processing content of the feature vector generation unit 56 is the same as the processing content of the feature vector generation unit 26 of the model generation device 100. For each pair in the click log, a feature value identical to the feature extracted by the processing of the feature vector generation means 26 of the model generation device 100 is extracted to generate a feature vector.

分類モデル記憶部６０に記憶された分類モデルは、モデル生成装置１００の処理で生成した分類モデルと同一である。 The classification model stored in the classification model storage unit 60 is the same as the classification model generated by the processing of the model generation device 100.

正解尤度推定手段６２は、クリックログ中の任意の対に対し、素性ベクトル生成手段５６によって生成された該対の素性ベクトルと、分類モデル記憶部６０に記憶された分類モデルにより、該対の正解尤度を推定する。以下、具体的に説明する。クリックログ中の各対に対し、対応する素性ベクトルと分類モデルとから、分類先ラベルが１である尤度を算出する。なお、分類モデルは０〜１の間の値を正解尤度として出力しても良い。この尤度を、該対の推定された正解尤度とする。図１４は、クリックログ中の対と、該対に対応する正解尤度の例である。 The correct likelihood estimation unit 62 calculates, for an arbitrary pair in the click log, the pair of feature vectors generated by the feature vector generation unit 56 and the classification model stored in the classification model storage unit 60. Estimate the correct likelihood. This will be specifically described below. For each pair in the click log, the likelihood that the classification destination label is 1 is calculated from the corresponding feature vector and the classification model. The classification model may output a value between 0 and 1 as the correct likelihood. This likelihood is taken as the estimated correct likelihood of the pair. FIG. 14 is an example of a pair in the click log and a correct likelihood corresponding to the pair.

検索対象文書群記憶部６４には検索対象文書群が格納される。検索対象文書群記憶部６４のデータ形式は、図６のデータ形式である。検索対象文書群の内容は、モデル生成装置１００の検索対象文書群と異なっている場合もありえる。 The search target document group storage unit 64 stores a search target document group. The data format of the search target document group storage unit 64 is the data format of FIG. The contents of the search target document group may be different from the search target document group of the model generation apparatus 100.

文書概念ベース生成手段６６は、検索対象文書群記憶部６４の検索対象の各文書に対し、該文書のテキストの重みを１とし、クリックログ中の該文書に対応する各クエリのテキストの重みを、該文書と該クエリの対の正解尤度とし、重みがある閾値以下のクエリのテキストは除外した上で、各テキストに含まれる単語の前記単語概念ベース５２中の概念ベクトルに、該単語の所属するテキストの重みを乗じた概念ベクトルを、各テキストの各単語にわたって加算し正規化した概念ベクトルを生成し、文書と該文書の概念ベクトルとの組の集合である文書概念ベース６８を生成する。以下、具体的に説明する。 The document concept base generation unit 66 sets the text weight of the document to 1 for each document to be searched in the search target document group storage unit 64, and sets the text weight of each query corresponding to the document in the click log. , The correct likelihood of the pair of the document and the query, excluding the text of the query whose weight is less than or equal to a threshold, and adding the word of the word to the concept vector in the word concept base 52 of the word included in each text. A concept vector multiplied by the weight of the text to which it belongs is added over each word of each text to generate a normalized concept vector, and a document concept base 68 that is a set of a document and the concept vector of the document is generated. . This will be specifically described below.

図１５のように、検索対象の文書ＩＤに対し、該文書ＩＤのテキストの重みを１とし、クリックログ中の該文書ＩＤに対応する各クエリＩＤのテキスト（クエリ１，クエリ２，クエリ３）の重みを、該文書ＩＤと該クエリＩＤの対の正解尤度（０．９，０．６，０．３）とする。重みの閾値αをあらかじめ定めておく。α＝０．４とすると、クエリ３の重みは０．３でα以下となるので、クエリ３は、以降の処理で除外する。各テキストに含まれる単語の前記単語概念ベース５２中の概念ベクトルに、該単語の所属するテキストの重みを乗じた概念ベクトルを、各テキストの各単語にわたって加算し正規化した概念ベクトルを生成する。この例の場合、該文書ＩＤの概念ベクトルは以下のものとなる。 As shown in FIG. 15, for the document ID to be searched, the text weight of the document ID is set to 1, and the text of each query ID corresponding to the document ID in the click log (query 1, query 2, query 3). Is the correct likelihood (0.9, 0.6, 0.3) of the pair of the document ID and the query ID. A threshold value α for the weight is determined in advance. If α = 0.4, the weight of query 3 is 0.3, which is less than or equal to α, so query 3 is excluded in the subsequent processing. A concept vector obtained by multiplying the concept vector in the word concept base 52 of the word included in each text by the weight of the text to which the word belongs is added to each word of each text to generate a normalized concept vector. In this example, the concept vector of the document ID is as follows.

（１×単語１概念ベクトル＋１×単語２概念ベクトル＋０．９×単語３概念ベクトル＋０．９×単語４概念ベクトル＋０．６×単語５概念ベクトル＋０．６×単語６概念ベクトル） (1 × word 1 concept vector + 1 × word 2 concept vector + 0.9 × word 3 concept vector + 0.9 × word 4 concept vector + 0.6 × word 5 concept vector + 0.6 × word 6 concept vector)

そして、該文書ＩＤの概念ベクトルは、長さ１に正規化される。 Then, the concept vector of the document ID is normalized to length 1.

そして、文書概念ベース生成手段６６は、検索対象の各文書ＩＤと該文書ＩＤの概念ベクトルとの組の集合である文書概念ベース６８を生成する。 Then, the document concept base generation unit 66 generates a document concept base 68 that is a set of each document ID to be searched and a concept vector of the document ID.

文書概念ベース６８には、文書概念ベース生成手段６６によって生成された、検索対象の各文書ＩＤと該文書ＩＤの概念ベクトルとの組の集合が格納される。文書概念ベース６８のデータ形式は、図１１のデータ形式である。 The document concept base 68 stores a set of sets of each document ID to be searched and the concept vector of the document ID generated by the document concept base generation unit 66. The data format of the document concept base 68 is the data format of FIG.

＜文書検索装置の構成＞
次に、本発明の実施の形態に係る文書検索装置の構成について説明する。図１６は、第４の発明の実施の形態に係る文書検索装置の構成例である。図１６に示すように、本発明の実施の形態に係る文書検索装置３００は、ＣＰＵと、ＲＡＭと、後述する各処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この文書検索装置３００は、機能的には図１６に示すように入力手段７０と、演算手段８０と、出力手段９０とを備えている。 <Configuration of document retrieval device>
Next, the configuration of the document search apparatus according to the embodiment of the present invention will be described. FIG. 16 is a configuration example of a document search apparatus according to the embodiment of the fourth invention. As shown in FIG. 16, a document search apparatus 300 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing processing routines described later and various data. Can be configured. Functionally, the document search apparatus 300 includes an input unit 70, a calculation unit 80, and an output unit 90 as shown in FIG.

入力手段７０は、新規のクエリを入力として受け付ける。 The input means 70 receives a new query as an input.

演算手段８０は、単語概念ベース８２と、概念ベクトル生成手段８４と、文書概念ベース８６と、類似度算出手段８８とを含んで構成されている。 The calculation means 80 includes a word concept base 82, a concept vector generation means 84, a document concept base 86, and a similarity calculation means 88.

単語概念ベース８２は、単語と、該単語の概念を表す概念ベクトルとの組の集合である。単語概念ベース８２は、クリックログ正解尤度算出装置２００の処理で用いた単語概念ベース５２と同一である。 The word concept base 82 is a set of a set of a word and a concept vector representing the concept of the word. The word concept base 82 is the same as the word concept base 52 used in the processing of the click log correct answer likelihood calculation device 200.

概念ベクトル生成手段８４は、入力手段７０によって受け付けた該クエリ中の単語の単語概念ベース８２中の概念ベクトルを合成することにより、該クエリの概念ベクトルを生成する。具体的には、概念ベクトル生成手段８４は、該クエリ中の単語の単語概念ベース８２中の概念ベクトルを加算し長さ１に正規化することにより、該クエリの概念ベクトルを生成する。概念ベクトル生成手段８４の処理内容は、モデル生成装置１００の概念ベクトル生成手段２４及びクリックログ正解尤度算出装置２００の概念ベクトル生成手段５４の処理内容と、入力がクエリであるということを除いて同じである。 The concept vector generation unit 84 generates a concept vector of the query by synthesizing the concept vector in the word concept base 82 of the word in the query received by the input unit 70. Specifically, the concept vector generation means 84 adds the concept vectors in the word concept base 82 of the words in the query and normalizes them to length 1, thereby generating the concept vector of the query. The process contents of the concept vector generation means 84 are the same as the process contents of the concept vector generation means 24 of the model generation apparatus 100 and the concept vector generation means 54 of the click log correct likelihood calculation apparatus 200, except that the input is a query. The same.

文書概念ベース８６は、文書と、該文書の概念を表す概念ベクトルとの組の集合である。文書概念ベース８６は、クリックログ正解尤度算出装置２００の処理で生成した文書概念ベース６８と同一である。 The document concept base 86 is a set of a set of a document and a concept vector representing the concept of the document. The document concept base 86 is the same as the document concept base 68 generated by the processing of the click log correct answer likelihood calculation device 200.

類似度算出手段８８は、文書概念ベース８６中の各文書ＩＤに対し、概念ベクトル生成手段８４によって生成された該クエリの概念ベクトルと、該文書ＩＤの概念ベクトルとの類似度を算出する。類似度として、例えば内積をとってもよい。検索結果として、類似度の降順にランキングした文書を表示する。あるいは、ある閾値以上の類似度をもつ文書を表示する。 The similarity calculation unit 88 calculates the similarity between the concept vector of the query generated by the concept vector generation unit 84 and the concept vector of the document ID for each document ID in the document concept base 86. As the similarity, for example, an inner product may be taken. As search results, documents ranked in descending order of similarity are displayed. Alternatively, a document having a similarity greater than a certain threshold is displayed.

図１７は、モデル生成装置１００で実行される処理フローの一例である。モデル生成装置１００に、所属する各対に正解であるか否かのラベルが付与されているクリックログが入力されると、図１７に示すモデル生成処理ルーチンが実行される。 FIG. 17 is an example of a processing flow executed by the model generation device 100. When a click log in which a label indicating whether each pair belongs to the model generation apparatus 100 is a correct answer is input, a model generation processing routine shown in FIG. 17 is executed.

まず、ステップＳ１００において、入力手段１０は、所属する各対に正解であるか否かのラベルが付与されているクリックログを取得する。 First, in step S100, the input unit 10 acquires a click log in which a label indicating whether each pair to which the user belongs is correct is assigned.

そして、ステップＳ１０２において、概念ベクトル生成手段２４は、上記ステップＳ１００で取得したクリックログ中の各クエリ及び各文書のテキストに対し、該テキスト中の単語の単語概念ベース２２中の概念ベクトルを合成することにより、該テキストの概念ベクトルを生成する。 In step S102, the concept vector generation unit 24 synthesizes the concept vector in the word concept base 22 of the word in the text with each query and text of each document in the click log acquired in step S100. Thus, a concept vector of the text is generated.

ステップＳ１０４において、素性ベクトル生成手段２６は、上記ステップＳ１００で取得したクリックログ中の任意の対に対し、上記ステップＳ１０２で生成された該対のクエリの概念ベクトルの近傍内に、クエリの概念ベクトルが存在し、該対の文書の概念ベクトルの近傍内に、文書の概念ベクトルが存在するような、クリックログ中の対を近傍内対としたとき、近傍内対の数あるいは近傍内対に紐づく異なりユーザ数を含む素性を抽出することにより、該対の素性ベクトルを生成する。 In step S104, the feature vector generation unit 26, for any pair in the click log acquired in step S100, includes a query concept vector within the vicinity of the concept vector of the pair query generated in step S102. When the pair in the click log where the concept vector of the document exists in the vicinity of the concept vector of the pair of documents is a pair in the neighborhood, the number of pairs in the vicinity or the pairs in the neighborhood Then, the feature vector including the number of different users is extracted to generate the feature vector of the pair.

ステップＳ１０６において、分類モデル生成手段２８は、上記ステップＳ１０４で生成されたクリックログ中の任意の対の素性ベクトルとラベルとの組の集合から、任意の素性ベクトルの正解尤度を算出するための分類モデルを生成する。 In step S106, the classification model generation means 28 calculates the correct likelihood of an arbitrary feature vector from the set of pairs of arbitrary pairs of feature vectors and labels in the click log generated in step S104. Generate a classification model.

ステップＳ１０８において、分類モデル生成手段２８は、ステップＳ１０６で生成された分類モデルを分類モデル記憶部３０に格納して、モデル生成処理ルーチンを終了する。 In step S108, the classification model generation means 28 stores the classification model generated in step S106 in the classification model storage unit 30, and ends the model generation processing routine.

図１８は、クリックログ正解尤度算出装置２００で実行される処理フローの一例である。クリックログ正解尤度算出装置２００に、クリックログが入力されると、図１８に示す正解尤度算出処理ルーチンが実行される。 FIG. 18 is an example of a processing flow executed by the click log correct answer likelihood calculation apparatus 200. When a click log is input to the click log correct likelihood calculating apparatus 200, a correct likelihood calculating process routine shown in FIG. 18 is executed.

まず、ステップＳ２００において、入力手段４０は、所属する各対に正解であるか否かのラベルが付与されていないクリックログを入力として取得する。 First, in step S200, the input means 40 acquires, as an input, a click log to which each pair to which the user belongs belongs is not assigned a label indicating whether the answer is correct.

次に、ステップＳ２０２において、概念ベクトル生成手段５４は、上記ステップＳ２００で取得したクリックログ中の各クエリ及び各文書のテキストに対し、該テキスト中の単語の該単語概念ベース中の概念ベクトルを合成することにより、該テキストの概念ベクトルを生成する。 Next, in step S202, the concept vector generation means 54 synthesizes the concept vector in the word concept base of the word in the text with each query in the click log and the text in each document acquired in step S200. Thus, a concept vector of the text is generated.

次に、ステップＳ２０４において、素性ベクトル生成手段５６は、上記ステップＳ２００で取得したクリックログ中の任意の対に対し、上記ステップＳ２０２で生成された該対のクエリの概念ベクトルの近傍内に、クエリの概念ベクトルが存在し、該対の文書の概念ベクトルの近傍内に、文書の概念ベクトルが存在するような、クリックログ中の対を近傍内対としたとき、近傍内対の数あるいは近傍内対に紐づく異なりユーザ数を含む素性を抽出することにより、該対の素性ベクトルを生成する。 Next, in step S204, the feature vector generation unit 56 adds a query to an arbitrary pair in the click log acquired in step S200 within the vicinity of the concept vector of the pair of queries generated in step S202. If the pair in the click log where the concept vector of the document exists and the concept vector of the document exists within the neighborhood of the concept vector of the pair, the number of pairs in the neighborhood or within the neighborhood By extracting a feature including the number of different users associated with a pair, a feature vector of the pair is generated.

そして、ステップＳ２０６において、正解尤度推定手段６２は、上記ステップＳ２００で取得したクリックログ中の任意の対に対し、上記ステップＳ２０４で生成された該対の素性ベクトルと、分類モデル記憶部６０に記憶された分類モデルにより、該対の正解尤度を推定する。 In step S206, the correct likelihood estimation unit 62 stores the pair of feature vectors generated in step S204 and the classification model storage unit 60 for any pair in the click log acquired in step S200. The correct likelihood of the pair is estimated based on the stored classification model.

そして、ステップＳ２０８において、文書概念ベース生成手段６６は、検索対象文書群記憶部６４の検索対象の各文書に対し、該文書のテキストの重みを１とし、クリックログ中の該文書に対応する各クエリのテキストの重みを、上記ステップＳ２０６で推定された該文書と該クエリの対の正解尤度とし、重みがある閾値以下のクエリのテキストは除外した上で、各テキストに含まれる単語の前記単語概念ベース５２中の概念ベクトルに、該単語の所属するテキストの重みを乗じた概念ベクトルを、各テキストの各単語にわたって加算し正規化した概念ベクトルを生成し、文書と該文書の概念ベクトルとの組の集合である文書概念ベース６８を生成して、正解尤度算出処理ルーチンを終了する。 In step S208, the document concept base generation unit 66 sets the text weight of the document to 1 for each document to be searched in the search target document group storage unit 64, and each corresponding to the document in the click log. The weight of the query text is set to the correct likelihood of the document-query pair estimated in step S206, and the text of the query whose weight is less than or equal to a threshold is excluded, and the word of each word included in each text is excluded. A concept vector in which the concept vector in the word concept base 52 is multiplied by the weight of the text to which the word belongs is added to each word of each text to generate a normalized concept vector, and the document and the concept vector of the document are generated. Then, a document concept base 68 that is a set of these pairs is generated, and the correct likelihood calculation processing routine is terminated.

図１９は、文書検索装置３００で実行される処理フローの一例である。文書検索装置３００に、新規のクエリが入力されると、図１９に示す文書検索処理ルーチンが実行される。 FIG. 19 is an example of a processing flow executed by the document search apparatus 300. When a new query is input to the document search device 300, a document search processing routine shown in FIG. 19 is executed.

ステップＳ３００において、入力手段７０は、新規のクエリを入力として取得する。 In step S300, the input unit 70 acquires a new query as an input.

ステップＳ３０２において、概念ベクトル生成手段８４は、上記ステップＳ３００で取得した該クエリ中の単語の単語概念ベース８２中の概念ベクトルを合成することにより、該クエリの概念ベクトルを生成する。 In step S302, the concept vector generation unit 84 generates the concept vector of the query by synthesizing the concept vector in the word concept base 82 of the word in the query acquired in step S300.

ステップＳ３０４において、類似度算出手段８８は、文書概念ベース８６中の各文書ＩＤに対し、上記ステップＳ３０２で生成された該クエリの概念ベクトルと、該文書ＩＤの概念ベクトルとの類似度を算出する。 In step S304, the similarity calculation unit 88 calculates, for each document ID in the document concept base 86, a similarity between the concept vector of the query generated in step S302 and the concept vector of the document ID. .

ステップＳ３０６において、出力手段９０は、検索結果として、上記ステップＳ３０４で算出された類似度の降順にランキングした文書を表示して、文書検索処理ルーチンを終了する。 In step S306, the output unit 90 displays the documents ranked in descending order of the similarity calculated in step S304 as the search result, and ends the document search processing routine.

これまで述べた処理をプログラムとして構築し、当該プログラムを通信回線または記録媒体からインストールし、ＣＰＵ等の手段で実施することが可能である。 It is possible to construct the processing described so far as a program, install the program from a communication line or a recording medium, and implement it by means such as a CPU.

なお、本発明は、上記の実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

本発明は、検索対象の文書の集合から、ユーザが入力したクエリに概念的に適合する文書を検索する概念検索技術に適用可能である。 The present invention can be applied to a concept search technique for searching a document that conceptually matches a query input by a user from a set of documents to be searched.

１０，４０，７０入力手段
２０，５０，８０演算手段
２２，５２，８２単語概念ベース
２４，５４，８４概念ベクトル生成手段
２６，５６素性ベクトル生成手段
２８分類モデル生成手段
３０，６０分類モデル記憶部
６２正解尤度推定手段
６４検索対象文書群記憶部
６６文書概念ベース生成手段
６８，８６文書概念ベース
８８類似度算出手段
９０出力手段
１００モデル生成装置
２００クリックログ正解尤度算出装置
３００文書検索装置 10, 40, 70 Input means 20, 50, 80 Calculation means 22, 52, 82 Word concept base 24, 54, 84 Concept vector generation means 26, 56 Feature vector generation means 28 Classification model generation means 30, 60 Classification model storage unit 62 Correct likelihood estimation means 64 Search target document group storage unit 66 Document concept base generation means 68, 86 Document concept base 88 Similarity calculation means 90 Output means 100 Model generation apparatus 200 Click log correct likelihood calculation apparatus 300 Document search apparatus

Claims

The input is a click log that is a set of query and document pairs, and each pair to which the query belongs is labeled as correct or not,
A word concept base, which is a set of a word and a concept vector representing the concept of the word;
Concept vector generation means for generating a concept vector of the text by synthesizing a concept vector in the word concept base of a word in the text for each query in the click log and the text of each document,
For any pair in the click log, the query concept vector exists in the vicinity of the pair of query concept vectors, and the document concept vector exists in the vicinity of the pair of document concept vectors. When a pair in the click log is an in-neighbor pair, a feature vector generation that generates a feature vector of the pair by extracting the number of pairs in the neighborhood or a feature including the number of different users associated with the near-inner pair. Means,
A classification model generation means for generating a classification model for calculating the correct likelihood of an arbitrary feature vector from a set of pairs of arbitrary pairs of feature vectors and labels in the click log;
A model generation apparatus comprising:

In a document search system, a click log that is a set of pairs of a query input by a user and a document clicked by a user in a search result document group corresponding to the query is targeted. A click log correct answer likelihood calculation device for calculating a correct answer likelihood that is a degree that a pair of documents is correct as a search result of the pair of queries,
Click log as input,
A word concept base, which is a set of a word and a concept vector representing the concept of the word;
A classification model for calculating the correct likelihood of an arbitrary feature vector;
Concept vector generation means for generating a concept vector of the text by synthesizing a concept vector in the word concept base of a word in the text for each query in the click log and the text of each document,
For any pair in the click log, the query concept vector exists in the vicinity of the pair of query concept vectors, and the document concept vector exists in the vicinity of the pair of document concept vectors. When a pair in the click log is an in-neighbor pair, a feature vector generation that generates a feature vector of the pair by extracting the number of pairs in the neighborhood or a feature including the number of different users associated with the near-inner pair. Means,
For any pair in the click log, correct likelihood estimation means for estimating the correct likelihood of the pair using the feature vector of the pair and the classification model;
Click log correct likelihood calculation apparatus characterized by including.

For each document to be searched, the text weight of the document is set to 1, the text weight of each query corresponding to the document in the click log is the correct likelihood of the pair of the document and the query, and the weight is After excluding the text of a query below a certain threshold, a concept vector obtained by multiplying the concept vector in the word concept base of the word included in each text by the weight of the text to which the word belongs is assigned to each word of each text. 3. A document concept base generating unit that generates a concept vector that is a set of a document and a concept vector of the document, and that generates a document concept base that is a set of a document and a concept vector of the document. Click log correct likelihood calculation device.

Take the query as input,
A word concept base, which is a set of a word and a concept vector representing the concept of the word;
A document concept base, which is a set of a document and a concept vector representing the concept of the document, generated by the click log correct likelihood calculating apparatus according to claim 3;
Concept vector generation means for generating a concept vector of the query by synthesizing concept vectors in the word concept base of the words in the query;
Similarity calculation means for calculating the similarity between the concept vector of the query and the concept vector of the document for each document in the document concept base;
A document retrieval apparatus comprising:

A model generation method in a model generation apparatus including a word concept base, which is a set of a word and a concept vector representing the concept of the word, a concept vector generation unit, a feature vector generation unit, and a classification model generation unit,
The input is a click log that is a set of query and document pairs, and each pair to which the query belongs is labeled as correct or not,
A step of generating a concept vector of the text by synthesizing a concept vector in the word concept base of the word in the text with respect to each query in the click log and the text of each document; ,
The feature vector generation means, for any pair in the click log, there is a query concept vector in the vicinity of the concept vector of the pair of queries, and the document vector in the vicinity of the concept vector of the pair of documents. When a pair in the click log where there is a concept vector is an in-neighbor pair, the feature vector of the pair is extracted by extracting a feature including the number of pairs in the neighborhood or the number of different users associated with the near-inner pair. A step of generating
A step of generating a classification model for calculating a correct likelihood of an arbitrary feature vector from a set of pairs of arbitrary pairs of feature vectors and labels in the click log;
A model generation method comprising:

A word concept base that is a set of a word and a concept vector representing the concept of the word, a classification model for calculating a correct likelihood of an arbitrary feature vector, a concept vector generation unit, a feature vector generation unit, and a correct answer In the document search system, including a likelihood estimation unit, a click log that is a set of a pair of a query input by a user and a document clicked by a user in a search result document group for the query The click log correct likelihood calculation method in the click log correct likelihood calculation device for calculating the correct likelihood that is the degree that the document of the pair is correct as a search result of the query of the pair is There,
Click log as input,
A step of generating a concept vector of the text by synthesizing a concept vector in the word concept base of the word in the text with respect to each query in the click log and the text of each document; ,
The feature vector generation means, for any pair in the click log, there is a query concept vector in the vicinity of the concept vector of the pair of queries, and the document vector in the vicinity of the concept vector of the pair of documents. When a pair in the click log where there is a concept vector is an in-neighbor pair, the feature vector of the pair is extracted by extracting a feature including the number of pairs in the neighborhood or the number of different users associated with the near-inner pair. A step of generating
Correct likelihood estimation means, for any pair in the click log, estimating the correct likelihood of the pair from the feature vector of the pair and the classification model;
The click log correct answer likelihood calculation method characterized by including these.

The document concept base generation unit sets the text weight of the document to 1 for each document to be searched, and sets the text weight of each query corresponding to the document in the click log to the pair of the document and the query. A concept vector obtained by multiplying the concept vector in the word concept base of the word included in each text by the weight of the text to which the word belongs, excluding the text of a query with a correct likelihood and a weight less than a threshold value And generating a document concept base that is a set of a set of a document and a concept vector of the document. 6. The click log correct likelihood calculation method according to 6.

A program for causing a computer to function as each unit of the model generation device according to claim 1, the click log correct likelihood calculation device according to claim 2 or claim 3, or the document search device according to claim 4.