JP5700566B2

JP5700566B2 - Scoring model generation device, learning data generation device, search system, scoring model generation method, learning data generation method, search method and program thereof

Info

Publication number: JP5700566B2
Application number: JP2012023886A
Authority: JP
Inventors: 隆伸大庭; 堀　貴明; 貴明堀; 中村　篤; 篤中村; 伊藤　彰則; 彰則伊藤
Original assignee: Tohoku University NUC; Nippon Telegraph and Telephone Corp
Current assignee: Tohoku University NUC; Nippon Telegraph and Telephone Corp
Priority date: 2012-02-07
Filing date: 2012-02-07
Publication date: 2015-04-15
Anticipated expiration: 2032-02-07
Also published as: JP2013161330A

Description

本発明は、ドキュメント検索や音声ドキュメント検索におけるスコアリングモデルを統計的モデル学習に基づき学習する際に用いる学習データを生成する学習データ生成装置及びその方法、並びに、生成された学習データに基づき学習されたスコアリングモデルによってドキュメント検索を行う検索装置及びその方法に関する。 The present invention relates to a learning data generation apparatus and method for generating learning data used when learning a scoring model in document search or voice document search based on statistical model learning, and learning based on the generated learning data. The present invention relates to a search apparatus and a method for searching a document by using a scoring model.

ドキュメント検索は、事前に与えられた有限個のドキュメントの中から、入力されたクエリ（以下「入力クエリ」ともいう）に関連深いドキュメントを選び出す問題である。通常、入力クエリに関連深い順にドキュメントを並べる。このとき入力クエリと各ドキュメントの関連深さを表すスコアを算出する必要があり、スコア順にドキュメントを並べる。したがって、スコアの算出方法が技術的核となる。なおクエリとは検索を行うユーザが指定する単語列（言い換えると検索対象となる単語列）であり、文、文章、句、単語、記号及びそれらの組み合わせであってもよい。またドキュメントとは通常、一つ以上の文や文章を含むｗｅｂページやテキストファイル等であり、ドキュメント検索の対象となるものである。 Document search is a problem of selecting documents closely related to an input query (hereinafter also referred to as “input query”) from a finite number of documents given in advance. Typically, documents are arranged in order of relevance to the input query. At this time, it is necessary to calculate a score representing the relationship between the input query and each document, and the documents are arranged in the order of score. Therefore, the score calculation method is the technical core. Note that a query is a word string (in other words, a word string to be searched) designated by a user who performs a search, and may be a sentence, a sentence, a phrase, a word, a symbol, and a combination thereof. A document is usually a web page or text file containing one or more sentences or sentences, and is a target of document search.

また音声ドキュメント検索は、検索対象が音声ドキュメントであるドキュメント検索である。音声ドキュメントとは音声を録音した音声ファイル等である。音声ドキュメントに音声認識を適用し、テキスト化した上で、ドキュメント検索の技術を適用することで実現される。またクエリが音声で与えられる場合もあり、同様に音声認識が適用されるのが一般的である。ただし、クエリと各音声ドキュメントの関連深さを表すスコアの算出は、音声認識の誤認識や、未知語(音声認識システムに登録されていない単語)の存在を考慮した上で行われる。以下ドキュメント及び音声ドキュメントを併せて単にドキュメントともいう。 The voice document search is a document search whose search target is a voice document. An audio document is an audio file that records audio. This is realized by applying speech recognition to a voice document, converting it into text, and applying a document search technique. In some cases, the query is given by voice, and voice recognition is generally applied in the same manner. However, the calculation of the score representing the relation depth between the query and each voice document is performed in consideration of misrecognition of voice recognition and the presence of unknown words (words not registered in the voice recognition system). Hereinafter, a document and an audio document are also simply referred to as a document.

従来、ドキュメント検索では、ヒューリスティックな方法（試行錯誤・実験・検討などの過程を通じて問題解決を行う手法であり、発見的方法）で、入力クエリと各ドキュメントの関連深さを表すスコアの算出していた。しかし自然言語処理分野の多くの問題で見るように、統計的モデル学習に基づき生成されたモデルを利用してスコアを算出することで、更なる精度向上を見込むことができる。なお入力クエリと各ドキュメントとの関連深さを表すスコアを算出する際に利用されるモデルをスコアリングモデルと呼ぶ。非特許文献１が統計的モデル学習に基づきスコアリングモデルを生成する従来技術として知られている。以下スコアリングモデルについて説明する。 Conventionally, in document search, heuristic methods (a method of solving problems through a process of trial and error, experiments, and examinations, heuristic methods) are used to calculate a score that represents the depth of association between the input query and each document. It was. However, as seen in many problems in the natural language processing field, a further improvement in accuracy can be expected by calculating a score using a model generated based on statistical model learning. A model used for calculating a score representing the depth of association between the input query and each document is called a scoring model. Non-Patent Document 1 is known as a conventional technique for generating a scoring model based on statistical model learning. The scoring model will be described below.

クエリｑとドキュメントｄの組から抽出される素性ベクトルをｆ_ｑ,ｄとおく。なお素性ベクトルの抽出は事前に定義されたルールに基づき実行される。例えば要素（素性）として、クエリｑとドキュメントｄに共通の単語ｗ_ｉの、ドキュメントｄにおける個数ｃ（ｗ_ｉ，ｄ）の対数値の総和や単語ｗ_ｉの逆文書頻度（Inverse Document Frequency）ｉｄｆ（ｗ_ｉ）の対数値の総和等を用いることができる。スコアリングモデルのパラメータベクトルをΦとするとき、このスコアリングモデルによってクエリｑとドキュメントｄの組に与えられるスコアをＳ_Φ（ｆ_ｑ，ｄ）と表記する。このスコアＳ_Φ（ｆ_ｑ，ｄ）がクエリｑとドキュメントｄの関連深さを表す。 Let f _{q, d} be a feature vector extracted from a set of a query q and a document d. The feature vector is extracted based on a rule defined in advance. For example as elements (feature), a common word _{w i} to query q and document d, inverse document frequency (Inverse Document Frequency) number c _(w i, d) of the logarithm of the sum and words _{w i} in the document d idf The sum of logarithmic values of (w _i ) can be used. When the parameter vector of the scoring model is Φ, the score given to the set of the query q and the document d by this scoring model is expressed as S _Φ (f _{q, d} ). This score S _Φ (f _{q, d} ) represents the relation depth between the query q and the document d.

広く使用されるスコアリングモデルの１つとして線形モデルがある。線形モデルでは例えば次式でスコアを算出する。
S_Φ(f_q,d)=Φ・f_q,d (1)
なお、・は内積演算子である。 One of the widely used scoring models is a linear model. In the linear model, for example, the score is calculated by the following equation.
S _Φ (f _{q, d} ) = Φ ・ f _{q, d} (1)
Note that · is an inner product operator.

パラメータベクトルΦは事前に統計的モデル学習法により求める。学習データを用意し、既存の学習法を使用してパラメータベクトルΦの値を求めることができる。なお学習データは一般に、クエリとリファレンスラベル（クエリと関連深いドキュメントを指し示すラベルであり、関連深いドキュメントの数は複数であってもよい）の組の集合である。学習データは、各クエリに対し関連が深いと考えられるドキュメントを人手により判断し、用意する。 The parameter vector Φ is obtained in advance by a statistical model learning method. Learning data is prepared, and the value of the parameter vector Φ can be obtained using an existing learning method. The learning data is generally a set of a set of a query and a reference label (a label indicating a document closely related to the query, and the number of documents closely related may be plural). The learning data is prepared by manually determining a document that is considered to be closely related to each query.

なお、非特許文献２では、言語モデルを用いたドキュメント検索手法が示されている。言語モデルのパラメータを統計データを用いて学習するため統計的にモデルを学習していると言えるが、リファレンスラベルを使用していないため、ドキュメントとクエリの関連の有無を直接的には学習していない。そのため、本明細書においてはヒューリスティックな手法と位置づける。 In Non-Patent Document 2, a document search method using a language model is shown. It can be said that the model is statistically learned because it learns the parameters of the language model using statistical data, but because it does not use reference labels, it directly learns whether the document is related to a query. Absent. Therefore, in this specification, it is positioned as a heuristic method.

Ramesh Nallapati, “Discriminative Models for Information Retrieval”, Proceedings of ACM SIGIR, 2004, pp.64-71Ramesh Nallapati, “Discriminative Models for Information Retrieval”, Proceedings of ACM SIGIR, 2004, pp.64-71 Jay M. Ponte and W. Bruce Croft，” A Language Modeling Approach to Information Retrieval”, Proceedings of ACM SIGIR, 1998, pp. 275-281Jay M. Ponte and W. Bruce Croft, “A Language Modeling Approach to Information Retrieval”, Proceedings of ACM SIGIR, 1998, pp. 275-281

一般に、適切なパラメータ推定結果を得るためには、パラメータ数が多い（パラメータベクトルΦの次元が高い）ほど、多くの学習データ（クエリとリファレンスラベルの組）を必要とする。しかし、前述の通り学習データは人手で用意する必要があり、大量に用意することが難しい。そのため従来技術のパラメータベクトルΦは低次元のベクトルである。 In general, in order to obtain an appropriate parameter estimation result, the larger the number of parameters (the higher the dimension of the parameter vector Φ), the more learning data (a set of query and reference label) is required. However, as described above, it is necessary to prepare learning data manually, and it is difficult to prepare a large amount. Therefore, the prior art parameter vector Φ is a low-dimensional vector.

例えば式（１）の線形モデルにおいて、パラメータベクトルΦが８次元であるとすると、素性ベクトルｆ_ｑ,ｄも８次元である。これは、クエリｑとドキュメントｄから高々８種類の特徴しか使用してはいけないことを意味しており、ドキュメント検索を行う上で重要な特徴を落としている可能性が高い。一般に言語処理分野では、数万〜数千万次元といった高次元のモデルを使用するのに対し、ドキュメント検索では極端に低次元のパラメータベクトルに基づくスコアリングモデルが使用されている。これは学習データを人手で大量に用意することが難しいことに起因する。 For example, in the linear model of Equation (1), if the parameter vector Φ is 8 dimensions, the feature vector f _{q, d} is also 8 dimensions. This means that at most eight types of features should be used from the query q and the document d, and there is a high possibility that important features are dropped when performing a document search. In general, in the language processing field, high-dimensional models such as tens of thousands to tens of millions of dimensions are used, whereas in document search, scoring models based on extremely low-dimensional parameter vectors are used. This is because it is difficult to prepare a large amount of learning data manually.

統計的モデル学習に基づきスコアリングモデルを生成する利点は、（１）クエリとリファレンスラベルとの関連を明示的に学習していること、（２）線形モデルのような単純なモデルを用いることで様々な特徴を容易に導入可能な点にある。高次元のパラメータベクトルに基づくスコアリングモデルが使用可能になれば、様々な特徴を利用することができるようになり、より精密にクエリとリファレンスラベルとの関連を学習できるようになる。したがって、高次元のパラメータベクトルに基づくスコアリングモデルを使用することができれば、仮に既存手法を精度で下回っても、少なくとも既存手法とは異なる特徴が取り込めるため、クエリとドキュメントの関連深さを表すスコアの算出において、既存手法により求めたスコアと高次元のパラメータベクトルに基づくスコアリングモデルにより求めたスコアとの重ね合わせによりスコアを算出することにより、精度向上が期待できる。 The advantages of generating a scoring model based on statistical model learning are that (1) the relationship between a query and a reference label is explicitly learned, and (2) a simple model such as a linear model is used. Various features can be easily introduced. If a scoring model based on a high-dimensional parameter vector becomes available, various features can be used, and the relationship between a query and a reference label can be learned more precisely. Therefore, if a scoring model based on a high-dimensional parameter vector can be used, even if it is less accurate than the existing method, at least features that are different from the existing method can be captured. In the calculation of, accuracy can be expected by calculating the score by superimposing the score obtained by the existing method and the score obtained by the scoring model based on the high-dimensional parameter vector.

本発明は、統計的モデル学習に基づきスコアリングモデルを学習するスコアリングモデル生成装置、その際に用いる学習データを生成する学習データ生成装置を提供することを目的とする。 An object of this invention is to provide the scoring model production | generation apparatus which learns a scoring model based on statistical model learning, and the learning data production | generation apparatus which produces | generates the learning data used in that case.

上記の課題を解決するために、本発明の第一の態様によれば、スコアリングモデル生成装置は、ドキュメント検索におけるスコアリングモデルを統計的モデル学習に基づき学習する。スコアリングモデル生成装置は、Ｍをドキュメントの個数、Ｎを１つのドキュメントから生成されるクエリの個数、ｍ＝１，２，…，Ｍ、ｎ＝１，２，…，Ｎとし、Ｍ×Ｎ個の学習データｓ _ｍｎを受け取り、学習データｓ _ｍｎに含まれるクエリｑと各ドキュメントｄとから素性ベクトルｆ _ｑ,ｄを抽出し、パラメータベクトルΦと素性ベクトルｆ _ｑ,ｄとの内積が、ドキュメントｄがクエリｑの関連ドキュメントである場合には正の値を、ドキュメントｄがクエリｑの関連ドキュメントでない場合には負の値をとるように、パラメータベクトルΦを学習する。
上記の課題を解決するために、本発明の他の態様によれば、学習データ生成装置は複数のドキュメントを与えられ、ドキュメント検索におけるスコアリングモデルを統計的モデル学習に基づき学習する際に用いる学習データを生成する。学習データ生成装置は単語列生成手段と学習データ生成手段とを含む。単語列生成手段は与えられる各ドキュメントに対して、そのドキュメントに含まれる単語を含む単語列を１つ以上生成する。学習データ生成手段は生成した各単語列及びその単語列を生成する際に用いられたドキュメントを指し示すラベルを、それぞれクエリ及びリファレンスとし、クエリ及びリファレンスの組を学習データとする。 In order to solve the above problem, according to the first aspect of the present invention, the scoring model generation device learns a scoring model in document search based on statistical model learning. The scoring model generation device sets M as the number of documents, N as the number of queries generated from one document, m = 1, 2,..., M, n = 1, 2,. The learning data s _mn is received, the feature vector f _{q, d} is extracted from the query q included in the learning data s _mn and each document d _, and the inner product of the parameter vector Φ and the feature vector f _{q, d} is the document The parameter vector Φ is learned so that a positive value is taken when d is a related document of the query q, and a negative value is taken when the document d is not a related document of the query q.
In order to solve the above problem, according to another aspect of the present invention, a learning data generation device is provided with a plurality of documents, and learning used when learning a scoring model in document search based on statistical model learning Generate data. The learning data generation device includes a word string generation unit and a learning data generation unit. For each given document, the word string generation means generates one or more word strings including words included in the document. The learning data generation means uses the generated word string and the label indicating the document used when generating the word string as a query and a reference, respectively, and sets the query and reference as learning data.

本発明によれば、統計的モデル学習に基づきスコアリングモデルを学習する際に用いる学習データを、人手によらずに自動で生成できるという効果を奏する。大量の学習データを用いて高次元のパラメータベクトルを適切に推定することができ、ヒューリスティックな手法では扱いづらかった情報が利用可能となるため、そのパラメータベクトルを用いた検索装置は、より精度の高い検索が可能となる。 According to the present invention, there is an effect that learning data used when learning a scoring model based on statistical model learning can be automatically generated without depending on human hands. High-dimensional parameter vectors can be appropriately estimated using a large amount of learning data, and information that is difficult to handle with heuristic methods can be used. Therefore, a search device using the parameter vectors is more accurate. Search is possible.

第一実施形態に係る検索システム１の構成図。1 is a configuration diagram of a search system 1 according to a first embodiment. 第一実施形態に係る検索システム１の処理フローを示す図。The figure which shows the processing flow of the search system 1 which concerns on 1st embodiment. 第一実施形態に係る学習データ生成装置１１の機能ブロック図。The functional block diagram of the learning data generation apparatus 11 which concerns on 1st embodiment. 第一実施形態に係る学習データ生成装置１１の処理フローを示す図。The figure which shows the processing flow of the learning data generation apparatus 11 which concerns on 1st embodiment. 第一実施形態の検索装置１３のシミュレーション１の結果を示す図。The figure which shows the result of the simulation 1 of the search device 13 of 1st embodiment. 第一実施形態の検索装置１３のシミュレーション２の結果を示す図。The figure which shows the result of the simulation 2 of the search device 13 of 1st embodiment.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted.

＜第一実施形態＞
図１は検索システム１の構成例を、図２はその処理フローを示す。検索システム１は学習データ生成装置１１とスコアリングモデル生成装置１２と検索装置１３とを含む。 <First embodiment>
FIG. 1 shows a configuration example of the search system 1, and FIG. 2 shows a processing flow thereof. The search system 1 includes a learning data generation device 11, a scoring model generation device 12, and a search device 13.

学習データ生成装置１１はＭ個のドキュメントｄ_１，ｄ_２，…，ｄ_Ｍを入力とし、Ｍ×Ｎ個の学習データｓ_ｍｎを生成し（ｓ１１）、スコアリングモデル生成装置１２に出力する。ただし、ｍ＝１，２，…，Ｍであり、ｎ＝１，２，…，Ｎであり、Ｎは一つのドキュメントから生成される単語列（クエリ）の個数である。詳細は後述する。 The learning data generation device 11 receives M documents d ₁ , d ₂ ,..., D _M as input, generates M × N learning data s _mn (s 11), and outputs it to the scoring model generation device 12. However, m = 1, 2,..., M, n = 1, 2,..., N, and N is the number of word strings (queries) generated from one document. Details will be described later.

スコアリングモデル生成装置１２はＭ×Ｎ個の学習データｓ_ｍｎを用いて、スコアリングモデルにおいて利用するパラメータベクトルΦを学習、生成し（ｓ１２）、検索装置１３に出力する。 The scoring model generation device 12 learns and generates a parameter vector Φ used in the scoring model using the M × N pieces of learning data s _mn (s12), and outputs it to the search device 13.

検索装置１３はパラメータベクトルΦをスコアリングモデルに設定する。検索装置１３は利用者の操作する端末２から検索用クエリｑ_ｕを受信すると、検索用クエリｑ_ｕに対応するドキュメントを検索する（ｓ１３）。ドキュメントの検索は例えば以下のように行う。スコアリングモデルを用いて、検索用クエリｑ_ｕに対する各ドキュメントのスコアを計算し、スコアの高い順にドキュメントの一部やタイトル、ＵＲＬ等を並べたドキュメント情報Ｄ_ｕを生成し、端末２に送信する。利用者はドキュメント情報Ｄ_ｕに含まれるＵＲＬ等にアクセスすることで検索用クエリｑ_ｕに関連深いドキュメントを閲覧することができる。 The search device 13 sets the parameter vector Φ as a scoring model. When receiving the search query q _u from the terminal 2 operated by the user, the search device 13 searches for a document corresponding to the search query q _u (s13). The document search is performed as follows, for example. Using the scoring model, the score of each document for the search query q _u is calculated, and document information D _{u in} which a part of the document, title, URL, etc. are arranged in descending order of score is generated and transmitted to the terminal 2. . The user can browse a document closely related to the search query q _u by accessing a URL or the like included in the document information D _u .

＜学習データ生成装置１１＞
図３は学習データ生成装置１１の機能ブロック図を、図４はその処理フローを示す。学習データ生成装置１１は、記憶部１１１と個別言語モデル生成手段１１２と総合言語モデル生成手段１１３と単語列生成手段１１４と学習データ生成手段１１５とを含む。 <Learning data generation device 11>
FIG. 3 is a functional block diagram of the learning data generation apparatus 11, and FIG. 4 shows its processing flow. The learning data generation device 11 includes a storage unit 111, an individual language model generation unit 112, a comprehensive language model generation unit 113, a word string generation unit 114, and a learning data generation unit 115.

（記憶部１１１）
記憶部１１１は与えられたＭ個のドキュメントｄ_１，ｄ_２，…，ｄ_Ｍを格納する。また処理途中のデータや各種パラメータ等を記憶する。各手段は、記憶部１１１に対し、各処理過程で所定のデータやパラメータの読み書きを行う。ただし、各手段は、必ずしも記憶部１１１に対し各データの読み書きを行わなければならないわけではなく、各部間で直接データを受け渡すように制御してもよい。記憶部１１１は補助記憶装置、ＲＡＭ（Random Access Memory）、レジスタ、その他のバッファメモリやキャッシュメモリ等の何れか、あるいはこれらを併用した記憶領域に相当する。 (Storage unit 111)
The storage unit 111 stores the given M documents d ₁ , d ₂ ,..., D _M. In addition, data during processing and various parameters are stored. Each means reads / writes predetermined data and parameters to / from the storage unit 111 in each process. However, each means does not necessarily have to read / write each data with respect to the memory | storage part 111, You may control so that data may be passed directly between each part. The storage unit 111 corresponds to an auxiliary storage device, a RAM (Random Access Memory), a register, another buffer memory, a cache memory, or the like, or a storage area using these in combination.

（総合言語モデル生成手段１１３）
総合言語モデル生成手段１１３は、Ｍ個のドキュメントｄ_１，ｄ_２，…，ｄ_Ｍを記憶部１１１から取り出し、全てのドキュメントに対する確率的言語モデルＢを生成し（ｓ１１０１）、記憶部１１１に格納する。確率的言語モデルとしては例えば、ｎ−ｇｒａｍ言語モデル、ｂａｃｋ−ｏｆｆｎ−ｇｒａｍ言語モデル、隠れマルコフモデル、最大エントロピーモデル等が考えられる。 (Comprehensive language model generation means 113)
The comprehensive language model generation unit 113 extracts _M documents d ₁ , d ₂ ,..., D _M from the storage unit 111, generates a probabilistic language model B for all the documents (s 1101), and stores it in the storage unit 111. To do. Examples of the probabilistic language model include an n-gram language model, a back-off n-gram language model, a hidden Markov model, and a maximum entropy model.

（個別言語モデル生成手段１１２）
個別言語モデル生成手段１１２は、ドキュメントｄ_ｍを記憶部１１１から取り出し、そのドキュメントｄ_ｍに対する確率的言語モデルＬ_ｍを生成し（ｓ１１０３）、記憶部１１１に格納する。個別言語モデル生成手段１１２はＭ個のドキュメントｄ_１，ｄ_２，…，ｄ_Ｍに対してそれぞれ確率的言語モデルＬ_１，Ｌ_２，…，Ｌ_Ｍを生成する（ｓ１１０２、ｓ１１０９、ｓ１１１０）。 (Individual language model generation means 112)
Individual language model generating unit 112 retrieves the document _{d m} from the storage unit 111, generates a probabilistic language model _{L m} for the document _{d m} (S1103), and stores in the storage unit 111. The individual language model generation means 112 generates probabilistic language models L ₁ , L ₂ ,..., L _M for _M documents d ₁ , d ₂ ,..., D _M , respectively (s1102, s1109, s1110).

（単語列生成手段１１４）
単語列生成手段１１４は、二つの確率的言語モデルＢ及びＬ_ｍを記憶部１１１から取り出し、二つの確率的言語モデルＢ及びＬ_ｍに基づき単語列ｑ_ｍｎを生成し（ｓ１１０５）、記憶部１１１に格納する。本実施形態では単語列からなる文ｑ_ｍｎを生成する。 (Word string generation means 114)
Word sequence generation unit 114 takes out the two probabilistic language model B and _{L m} from the storage unit 111, it generates a word sequence _{q mn} based on two stochastic language model B and _{L m} (S1105), the storage unit 111 To store. In the present embodiment, a sentence q _mn composed of a word string is generated.

例えば次式により二つの確率的言語モデルＢ及びＬ_ｍを線形結合したものにより与えられる単語Ｗの確率Ｐ（Ｗ）を求め、その確率分布に従ってランダムに文ｑ_ｍｎを生成する。
P(W)=λP_Lm(W)+(1-λ)P_B(W) (2)
ただし、Ｐ_Ｌｍ（Ｗ）は確率的言語モデルＬ_ｍにより与えられる単語Ｗの確率であり、Ｐ_Ｂ（Ｗ）は確率的言語モデルＢにより与えられる単語Ｗの確率であり、λは０＜λ≦１の実数値からなる重み係数である。なお言語モデルの学習において各ドキュメント中に文末シンボルの存在を仮定する。ランダムに単語列を生成する際に文末シンボルの出力をもって一つの単語列（クエリ）とみなす。なお一つのドキュメントｄ_ｍに対してＮ個の文ｑ_ｍｎを生成する（ｓ１１０４、ｓ１１０７、ｓ１１０８）。 For example, a probability P (W) of a word W given by a linear combination of two probabilistic language models B and L _m is obtained by the following formula, and a sentence q _mn is randomly generated according to the probability distribution.
P (W) = λP _Lm (W) + (1-λ) P _B (W) (2)
Where P _Lm (W) is the probability of the word W given by the probabilistic language model L _m , P _B (W) is the probability of the word W given by the probabilistic language model B, and λ is 0 <λ It is a weighting coefficient consisting of real values of ≦ 1. In the learning of the language model, it is assumed that the end-of-sentence symbol exists in each document. When a word string is randomly generated, the output of the sentence end symbol is regarded as one word string (query). Incidentally generates N statement _{q mn} for one document _{d m (s1104, s1107, s1108} ).

（学習データ生成手段１１５）
学習データ生成手段１１５は、記憶部１１１からドキュメントｄ_ｍを指し示すラベルｍと、単語列生成手段１１４においてそのドキュメントｄ_ｍから生成された単語列ｑ_ｍｎとを取り出し、ラベルｍをリファレンスとし、単語列ｑ_ｍｎをクエリとし、そのセットを学習データｓ_ｍｎ＝（ｍ，ｑ_ｍｎ）とし（ｓ１１０６）、記憶部１１１に格納する。全ての文ｑ_ｍｎに対してこの処理を行う（ｓ１１０４、ｓ１１０７、ｓ１１０８）。 (Learning data generation means 115)
Learning data generating means 115 takes out a label m pointing to document d _m from the storage unit 111, and a word string q _mn to the word train generating means 114 is generated from the document d _m, a label m as a reference word string q _mn is used as a query, and the set is set as learning data s _mn = (m, q _mn ) (s 1106) and stored in the storage unit 111. This processing is performed for all sentences q _mn (s1104, s1107, s1108).

学習データ生成装置１１はこのようにしてＭ×Ｎ個の学習データｓ_ｍｎを生成し、スコアリングモデル生成装置１２に送信する。 The learning data generation device 11 generates M × N learning data s _mn in this way, and transmits the learning data s _mn to the scoring model generation device 12.

＜効果＞
このような構成により統計的モデル学習に基づきスコアリングモデルを学習する際に用いる学習データを、人手によらずに自動で生成できる。 <Effect>
With such a configuration, learning data used when learning a scoring model based on statistical model learning can be automatically generated without relying on humans.

以下、学習データ生成装置１１によって生成された学習データを用いたスコアリングモデルの生成方法、及び生成されたスコアリングモデルを用いたドキュメント検索方法の概略を説明する。 Hereinafter, an outline of a scoring model generation method using the learning data generated by the learning data generation device 11 and a document search method using the generated scoring model will be described.

＜検索装置１３＞
検索装置１３は、学習データ生成装置１１で得られたＭ×Ｎ個の学習データｓ_ｍｎを用いて学習したスコアリングモデルを利用してドキュメント検索を行う。本実施形態では、自動生成された学習データと真の学習データとは大きく異なることを考慮して、検索精度を担保するため、学習データｓ_ｍｎを用いて学習したスコアリングモデルを従来の検索手法に対するリスコアリングとして導入する。スコアリングモデルの学習方法については後述する。 <Search device 13>
The search device 13 performs a document search using a scoring model learned using M × N pieces of learning data s _mn obtained by the learning data generation device 11. In the present embodiment, in consideration of the fact that the automatically generated learning data and the true learning data are greatly different, the scoring model learned using the learning data s _mn is used as a conventional search method in order to ensure the search accuracy. Introducing as a re-scoring against. A method for learning the scoring model will be described later.

ベースライン検索システムが与えるドキュメントｄとクエリｑの関連度をＤ（ｆ_ｑ，ｆ_ｄ）と表記すると、本実施形態では線形モデルのパラメータベクトルΦと素性ベクトルｆ_ｑ,ｄとを用いて関連度Ｄ（ｆ_ｑ，ｆ_ｄ）を次式で補正（リスコアリング）する。
D(f_q,f_d)+αΦ・f_q,d (3)
ただし、αは両者のスケールを調整するための定数である。 When the relevance between the document d and the query q given by the baseline search system is expressed as D (f _q , f _d ), in this embodiment, the relevance is obtained using the parameter vector Φ and the feature vector f _{q, d} of the linear model. D (f _q , f _d ) is corrected (rescored) by the following equation.
D (f _q , f _d ) + αΦ ・ f _{q, d} (3)
Here, α is a constant for adjusting both scales.

例えばベースライン検索システムの関連度Ｄ（ｆ_ｑ，ｆ_ｄ）として素性ベクトル間の距離に基づく手法（参考文献１参照）を用いることができる。
（参考文献１）宇野有，伊藤仁，伊藤彰則，牧野正三，“音声ドキュメント検索のためのWWWを用いたインデクス改善”，第4回音声ドキュメント処理ワークショップ講演論文集，2010年 For example, a technique based on the distance between feature vectors (see Reference 1) can be used as the degree of association D (f _q , f _d ) of the baseline search system.
(Reference 1) Yu Uno, Hitoshi Ito, Akinori Ito, Shozo Makino, “Index improvement using WWW for speech document retrieval”, Proceedings of the 4th Speech Document Processing Workshop, 2010

クエリｑから抽出される素性ベクトルをｆ_ｑ、ドキュメントｄから抽出される素性ベクトルをｆ_ｄと記す。この素性ベクトルの抽出は事前に定義されたルールに基づき実行される。例えば素性ベクトルには単語等のｔｆ−ｉｄｆ（term frequency and inverse document frequency）からなるベクトルを用いることができる。クエリｑとドキュメントｄの関連深さを表すスコアを両ベクトルの距離Ｄ（ｆ_ｑ，ｆ_ｄ）により算出する。例えば距離Ｄ（ｆ_ｑ，ｆ_ｄ）にはコサイン距離を用いることができる。すなわちｔｆ−ｉｄｆ（ｘ）を単語列ｘのｕｎｉｇｒａｍのｔｆ−ｉｄｆベクトル、ｃｏｓｉｎｅ（ｙ，ｚ）をｙとｚ間のコサイン距離として、関連度Ｄ（ｆ_ｑ，ｆ_ｄ）を次式により算出する。
D(f_q,f_d)=cosine(tf-idf(d),tf-idf(q)) A feature vector extracted from the query q is denoted by f _q , and a feature vector extracted from the document d is denoted by f _d . This feature vector extraction is executed based on a predefined rule. For example, a vector composed of tf-idf (term frequency and inverse document frequency) such as a word can be used as the feature vector. A score representing the relation depth between the query q and the document d is calculated from the distance D (f _q , f _d ) between both vectors. For example, a cosine distance can be used for the distance D (f _q , f _d ). In other words, tf-idf (x) is a unigram tf-idf vector of word string x, cosine (y, z) is a cosine distance between y and z, and a degree of association D (f _q , f _d ) is calculated by the following equation: To do.
D (f _q , f _d ) = cosine (tf-idf (d), tf-idf (q))

パラメータベクトルΦ及び素性ベクトルｆ_ｑ,ｄは、ｕｎｉｇｒａｍ頻度に関するベクトルを用いる場合、単語の種類の総数と同数の次元を持つ。このとき、素性ベクトルｆ_ｑ,ｄは、ドキュメントｄにおける各単語の出現頻度を要素の値とする。ただし、クエリｑに出現しない単語に対応する要素の値を０とする。 The parameter vector Φ and the feature vector f _{q, d} have the same number of dimensions as the total number of word types when a vector related to unigram frequency is used. At this time, the feature vector f _{q, d} uses the appearance frequency of each word in the document d as an element value. However, the value of an element corresponding to a word that does not appear in the query q is set to 0.

従来技術では多くの学習データを用意することが困難であるため、高次元のパラメータベクトルΦを適切に推定することができない。そのためパラメータベクトルΦは低次元となり素性ベクトルも低次元とならざるを得なかった。本実施形態では大量の学習データを用意に用意することができるため、高次元のパラメータベクトルΦを適切に推定でき、単語の種類の総数と同数の次元数を持つような高次元の素性ベクトルを用いることができる。これはクエリｑとドキュメントｄから多くの特徴を使用できることを意味し、ドキュメント検索を行う上で重要な特徴を逃さずに利用できることを意味する。 Since it is difficult to prepare a large amount of learning data in the prior art, the high-dimensional parameter vector Φ cannot be estimated appropriately. Therefore, the parameter vector Φ has a low dimension, and the feature vector has to have a low dimension. In this embodiment, since a large amount of learning data can be prepared in advance, a high-dimensional parameter vector Φ can be estimated appropriately, and a high-dimensional feature vector having the same number of dimensions as the total number of word types can be obtained. Can be used. This means that many features can be used from the query q and the document d, which means that important features can be used without missing a document search.

検索装置１３は、各ドキュメントに対するスコアを式（３）により算出し、その値の大きな順にドキュメントの上位候補として順位を決定する。 The search device 13 calculates a score for each document by the equation (3), and determines the rank as a document high-order candidate in descending order of the value.

＜スコアリングモデル生成装置１２＞
スコアリングモデル生成装置１２は、Ｍ×Ｎ個の学習データｓ_ｍｎを受信し、スコアリングモデルにおいて用いるパラメータベクトルΦを学習し、生成し、検索装置１３に送信する。例えばパラメータベクトルΦの学習には既存の最大エントロピーモデルに基づく手法を用いることができる（非特許文献１）。これは、ドキュメントｄがクエリｑの関連ドキュメントであるかどうか（１ｏｒ−１）をｒ（ｄ，ｑ）で表すとき、Φ・ｆ_ｑ,ｄの符号とｒ（ｄ，ｑ）の符号を一致させるように学習を行うことを意味している。なおドキュメントｄがクエリｑの関連ドキュメントであるかどうかを表す情報ｒ（ｄ，ｑ）がＭ×Ｎ個の学習データｓ_ｍｎ＝（ｍ，ｑ_ｍｎ）から生成される。 <Scoring model generator 12>
The scoring model generation device 12 receives M × N pieces of learning data s _mn , learns and generates a parameter vector Φ used in the scoring model, and transmits it to the search device 13. For example, a method based on an existing maximum entropy model can be used for learning the parameter vector Φ (Non-Patent Document 1). This indicates that when r (d, q) indicates whether or not the document d is a related document of the query q _, the sign of Φ · f _{q, d} matches the sign of r (d, q). It means that learning is done. Note that information r (d, q) indicating whether the document d is a document related to the query q is generated from M × N pieces of learning data s _mn = (m, q _mn ).

具体的にはまず２つのパラメータベクトルΦ_＋１及びΦ_−１に関し次式を最小化する値を求める。 Specifically, first, a value that minimizes the following expression is obtained for the two parameter vectors Φ ₊₁ and Φ ₋₁ .

そして、ΦをΦ_＋１−Φ_−１により求める。なお、||x||² ₂はＬ２−ノルム、ｃは定数であり本実施形態ではｃ＝１である。最小化問題の解法にはＬ−ＢＦＧＳアルゴリズムを用いることができる。 Then, Φ is obtained by Φ ₊₁ −Φ ₋₁ . Note that || x || ² ₂ is an L2-norm, c is a constant, and c = 1 in this embodiment. The L-BFGS algorithm can be used to solve the minimization problem.

＜シミュレーション１＞
第一実施形態の検索装置１３について日本語話し言葉コーパスＣＳＪのSpoken document retrieval test collection （参考文献２参照）を用いて評価実験を実施した。
（参考文献２）Tomoyosi Akiba, Kiyoaki Aikawa, Yoshiaki Itoh, Tatsuya Kawahara, Hiroaki Nanjo, Hiromitsu Nishizaki, Norihito Yasuda, Yoichi Yamashita, Katunobu Itou, "Construction of a Test Collection for Spoken Document Retrieval from Lecture Audio Data", IPSJ Journal, 2009. Vol.50, No.2, pp.501-513, <Simulation 1>
An evaluation experiment was conducted on the search device 13 of the first embodiment using the spoken document retrieval test collection (see Reference 2) of the Japanese spoken language corpus CSJ.
(Reference 2) Tomoyosi Akiba, Kiyoaki Aikawa, Yoshiaki Itoh, Tatsuya Kawahara, Hiroaki Nanjo, Hiromitsu Nishizaki, Norihito Yasuda, Yoichi Yamashita, Katunobu Itou, "Construction of a Test Collection for Spoken Document Retrieval from Lecture Audio Data", IPSJ Journal , 2009.Vol.50, No.2, pp.501-513,

本テストコレクションには、２７０２個の音声ドキュメント及びその音声認識結果と、３９個のクエリ及びそのリファレンスラベルが収録されている。３９個のクエリを９個と３０個に分け、それぞれ開発セット、評価セットとした。なお開発セットは式（３）のαを決めるためのみに使用した。 This test collection contains 2702 voice documents and their speech recognition results, 39 queries and their reference labels. The 39 queries were divided into 9 and 30 to make a development set and an evaluation set, respectively. The development set was used only to determine α in equation (3).

学習データ生成装置１１において、言語モデルとしてｂａｃｋ−ｏｆｆｔｒｉ−ｇｒａｍ言語モデルを用い、全ドキュメントから作成した言語モデルＢと各ドキュメントから生成した言語モデルＬ_ｍとを線形結合したものを利用した。式（２）のλは0.9,0.8,0.7,0.6,0.5の何れかとし、単語列生成手段１１４は各ドキュメントの各λに対してそれぞれ５０個のクエリを生成した。つまり、一つのドキュメントからＮ＝５０個×５個(λ値の総数)＝２５０個のクエリを生成した。よって学習用に生成したクエリの総数は２５０個×２７０２個(ドキュメントの総数)＝６７５５００個である。この方法によれば、ドキュメントｄに出現していない単語のみから構成されるクエリが生成される可能性がゼロではない。しかし、そのようなクエリは全体のごく一部に過ぎず、パラメータの推定にほとんど影響しないことが予想される。そのため、当該クエリの存在を確認すること無く全てのクエリを学習に利用した。なおドキュメントｄに出現していない単語のみから構成されるクエリを学習データから除外する構成としてもよい。 In the learning data generating device 11, using a back-off tri-gram language model as the language model, and a language model L _m generated from the language model B and the documents created from all the document using those linear combinations. In Equation (2), λ is any one of 0.9, 0.8, 0.7, 0.6, and 0.5, and the word string generation unit 114 generates 50 queries for each λ of each document. That is, N = 50 × 5 (total number of λ values) = 250 queries are generated from one document. Therefore, the total number of queries generated for learning is 250 × 2702 (total number of documents) = 675500. According to this method, the possibility that a query including only words that do not appear in the document d is generated is not zero. However, such queries are only a fraction of the total and are expected to have little impact on parameter estimation. Therefore, all the queries were used for learning without confirming the existence of the queries. In addition, it is good also as a structure which excludes the query comprised only from the word which does not appear in the document d from learning data.

スコアリングモデル生成装置１２は、６７５５００個の学習データを用いて、パラメータベクトルΦを推定した。なお学習には既存の最大エントロピーモデルに基づく手法を用いた（非特許文献１参照）。 The scoring model generation device 12 estimated the parameter vector Φ using 675500 pieces of learning data. In addition, the method based on the existing maximum entropy model was used for learning (refer nonpatent literature 1).

検索装置１３は、パラメータベクトルΦを用いて式（３）により開発セットのクエリに対するドキュントのスコアを計算し、その値の大きな順にドキュメントの上位候補として順位を決定した。なお単語の種類の総数は約２万７千であり、Φ及びｆ_ｑ,ｄの次元はこれに一致する。 The search device 13 uses the parameter vector Φ to calculate the score of the document for the query of the development set according to the expression (3), and determines the rank as the top candidate of the document in descending order of the value. The total number of types of words is about 27,000, and the dimensions of Φ and f _{q, d} coincide with this.

評価尺度はＭＡＰ（mean average precision）、Ｒ−ｐｒｅｃｉｓｉｏｎ、５位におけるｎＤＣＧ（normalized discounted cumulative gain）である。いずれも大きな値ほど性能が良いことを示す。評価セットにおける値は図５に示す通りである。何れの評価尺度においてもベースライン検索システムの評価よりも検索装置１３の評価のほうが高いことがわかる。 The evaluation scale is MAP (mean average precision), R-precition, nDCG (normalized discounted cumulative gain) at the fifth place. In either case, the larger the value, the better the performance. The values in the evaluation set are as shown in FIG. It can be seen that the evaluation of the search device 13 is higher than the evaluation of the baseline search system in any evaluation scale.

＜シミュレーション２＞
シミュレーション１では、素性に単語頻度を用いたが、ｎ−ｇｒａｍ頻度を使用することでさらに高次元のモデルにすることもできる。また、品詞、文字や音素等のサブワードに関する素性を用いることも可能であり、これにより、未知の単語が出現した場合にも頑健な検索結果を期待できる。さらに、音声認識の信頼度を素性に使うことで、音声認識の誤認識に頑健な検索が期待できる。なお学習データ生成装置１１によって大量の学習データを生成することができるため、素性を追加して素性ベクトルの次元数を増やしても適切にパラメータベクトルを学習し、生成することができる。 <Simulation 2>
In the simulation 1, the word frequency is used for the feature, but a higher-dimensional model can be obtained by using the n-gram frequency. It is also possible to use features related to subwords such as parts of speech, characters and phonemes, so that a robust search result can be expected even when an unknown word appears. Furthermore, by using the reliability of voice recognition as a feature, a robust search against misrecognition of voice recognition can be expected. Since a large amount of learning data can be generated by the learning data generation device 11, the parameter vector can be appropriately learned and generated even if the feature is added to increase the number of dimensions of the feature vector.

シミュレーション２では、素性として音素を加えている。またシミュレーション２において３９個のクエリの中には音声認識の未知語を含むものが４個あり（開発セットに１、評価セットに３）、これも別途評価に用いた。またシミュレーション２ではスコアリングモデル生成装置１２において式（４）によりパラメータベクトルΦを学習し生成している。 In simulation 2, phonemes are added as features. Also, in simulation 2, 39 queries included 4 that contain unknown words for speech recognition (1 in the development set and 3 in the evaluation set), which were also used for separate evaluation. In the simulation 2, the scoring model generation device 12 learns and generates the parameter vector Φ by the equation (4).

シミュレーション２では評価尺度としてＭＡＰと上位１０位に関するｎＤＣＧを用いる。図６に結果を示す。図中”ｅｖａｌ”には３０クエリの評価セットに対する精度、”ｏｏｖ”には未知語を含む４クエリに対する精度を示している。まずｅｖａｌに関して、ベースライン単体（Ｂａｓｅｌｉｎｅ）の検索精度と比べて、第一実施形態のスコアリングモデル生成装置を利用して生成したモデルによるリスコアリング（＋単語）を行うことで大きく性能が改善している。この結果はシミュレーション１と同様である。未知語や認識誤りの有無とは別に、本質的に検索能力が向上したものと考えられる。音素素性はｅｖａｌに関して、精度を低下させる結果となった。しかし、ｏｏｖに着目するとベースラインに対して大きく検索精度を改善している。このことから、第一実施形態では単に音素素性を加えるだけで未知語に対する頑健性を向上させることがわかる。 In simulation 2, nDCG related to MAP and the top 10 is used as an evaluation scale. The results are shown in FIG. In the figure, “eval” indicates the accuracy for an evaluation set of 30 queries, and “oov” indicates the accuracy for 4 queries including unknown words. First, with respect to eval, performance is greatly improved by performing re-scoring (+ words) using a model generated using the scoring model generation device of the first embodiment, compared with the search accuracy of the baseline alone (Baseline). doing. This result is the same as simulation 1. Apart from the presence of unknown words and recognition errors, it is considered that the search ability is essentially improved. The phoneme feature resulted in reduced accuracy with respect to eval. However, focusing on oov, the search accuracy is greatly improved with respect to the baseline. From this, it can be seen that in the first embodiment, the robustness against unknown words is improved simply by adding phoneme features.

＜第一変形例＞
第一実施形態の学習データ生成装置１１は、総合言語モデル生成手段１１３を備えない構成としてもよい。その場合、図４のｓ１１０１を行わない。単語列生成手段１１４ではドキュメントｄ_ｍに対する確率的言語モデルＬ_ｍにより与えられる単語Ｗの確率Ｐ_Ｌｍ（Ｗ）を求め、その確率分布に従ってランダムに文ｑ_ｍｎを生成する。ただし、確率Ｐ_Ｌｍ（Ｗ）の確率分布に従ってランダムに文ｑ_ｍｎを生成すると、ｑ_ｍｎは一般にドキュメントｄ_ｍに出現する語彙のみで構成される。クエリに出現する単語の全てが、リファレンスとなるドキュメントに全て出現することは稀であるため、第一実施形態のように全てのドキュメントに対する確率的言語モデルを用いたほうがその精度高くなると考えられる。 <First modification>
The learning data generation device 11 of the first embodiment may be configured not to include the comprehensive language model generation unit 113. In that case, s1101 of FIG. 4 is not performed. Seeking a word train generating means 114 in the document _d probabilistic language model for _m _L probability of a word W given by _m _P Lm (W), randomly generates a sentence _{q mn} according to the probability distribution. However, if you generate a random sentence _{q mn} according to the probability distribution of the probability _P Lm _{(W), q mn} is made only in the general vocabulary that appears in the document _{d m.} Since it is rare that all the words appearing in the query appear in the reference document, it is considered that the accuracy is higher when the probabilistic language model for all the documents is used as in the first embodiment.

＜第二変形例＞
第一変形例の学習データ生成装置１１は、さらに個別言語モデル生成手段１１２を備えない構成としてもよい。その場合、図４のｓ１１０３を行わない。単語列生成手段１１４ではドキュメントｄ_ｍからランダムに単語、句、文を抽出し、それを接続してできる文（文章）をクエリとする。ただし、ランダムに単語のみを抽出する場合は、個別言語モデル生成手段１１２で単語ｕｎｉｇｒａｍ言語モデルを生成した場合と等価となる。 <Second modification>
The learning data generation device 11 of the first modification may be configured not to further include the individual language model generation unit 112. In that case, s1103 of FIG. 4 is not performed. Word from the word train generating section 114 document d _m at random, phrase, extracting a sentence, a query statement (sentence) that can be connected to it. However, extracting only words at random is equivalent to the case where the individual language model generation unit 112 generates a word unigram language model.

抽出する単語、句、文の単位、数に関してランダムに決定することも可能である。クエリに出現する単語の全てが、リファレンスとなるドキュメントに全て出現することは現実には稀である。そこで対策として、一部の単語や句を、他のドキュメントから抽出した単語や句に置換又は挿入することもできる。置換や挿入を行う数、位置についてもランダムに決定することができる。ただし、ランダムとは、一様分布も含め、種々の確率モデルに従った分布に従うことを意味している。例えば、単語列を構成する単語の数であれば、ポアソン分布に従って決定することが考えられる。 It is also possible to randomly determine the word, phrase, sentence unit and number to be extracted. In reality, it is rare that all the words appearing in the query appear in the reference document. As a countermeasure, some words and phrases can be replaced or inserted with words and phrases extracted from other documents. The number and position of replacement or insertion can also be determined randomly. However, “random” means following a distribution according to various probability models including a uniform distribution. For example, if it is the number of words constituting the word string, it may be determined according to the Poisson distribution.

第一実施形態と第一変形例と第二変形例との関係性をみれば、明らかにその何れかを組合せて単語列（クエリ）を作成することも可能である。例えば第一実施形態や第一変形例により抽出した単語に対し、第二変形例の方法により一部を置換したり、挿入してもよい。 If the relationship between the first embodiment, the first modified example, and the second modified example is seen, it is possible to create a word string (query) by clearly combining any of them. For example, a part of the words extracted by the first embodiment or the first modification may be replaced or inserted by the method of the second modification.

＜その他の変形例＞
総合言語モデル生成手段１１３は、必ずしも全てのドキュメントに対する確率的言語モデルを生成しなくともよく、ドキュメントｄ_ｍに対する確率的言語モデルＬ_ｍとは別の（バックグラウンド）言語モデル、言い換えると言語モデルＬ_ｍに含まれていない語彙を含む言語モデルであればよい。 <Other variations>
Comprehensive language model generating means 113, not all may not generate a probabilistic language model for a document, another (background) is a stochastic language model L _m to the document d _m language model, in other words, language model L Any language model including a vocabulary not included in _m may be used.

検索装置１３は必ずしも本実施形態のスコアリングモデルを従来の検索手法に対するリスコアリングとして導入する必要はない。つまり検索装置１３は、ベースライン検索システムの関連度をＤ（ｆ_ｑ，ｆ_ｄ）を用いずに、パラメータベクトルΦと素性ベクトルｆ_ｄ，ｐのみを用いて式（１）によりスコアを求めてもよい。 The search device 13 does not necessarily need to introduce the scoring model of the present embodiment as rescoring for the conventional search method. That is, the search device 13 obtains a score by the equation (1) using only the parameter vector Φ and the feature vector f _{d, p} without using D (f _q , f _d ) as the relevance of the baseline search system. Also good.

第一実施形態では一つのドキュメントｄ_ｍから生成されるクエリはＮ個としているが、ドキュメント毎に生成されるクエリの個数を変更してもよい。例えばドキュメントｄ_ｍの長さに応じてＮを変更してもよい。 Although the first embodiment queries are generated from one document d _m are the N number may change the number of queries generated for each document. For example it may be changed N according to the length of the document d _m.

本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
上述した学習データ生成装置、スコアリングモデル生成装置及び検索装置は、コンピュータにより機能させることもできる。この場合はコンピュータに、目的とする装置（各種実施例で図に示した機能構成をもつ装置）として機能させるためのプログラム、又はその処理手順（各実施形態で示したもの）の各過程をコンピュータに実行させるためのプログラムを、ＣＤ−ＲＯＭ、磁気ディスク、半導体記憶装置などの記録媒体から、あるいは通信回線を介してそのコンピュータ内にダウンロードし、そのプログラムを実行させればよい。 <Program and recording medium>
The learning data generation device, scoring model generation device, and search device described above can also be operated by a computer. In this case, each process of a program for causing a computer to function as a target device (a device having the functional configuration shown in the drawings in various embodiments) or a processing procedure thereof (shown in each embodiment) is processed by the computer. A program to be executed by the computer may be downloaded from a recording medium such as a CD-ROM, a magnetic disk, or a semiconductor storage device or via a communication line into the computer, and the program may be executed.

Claims

  A scoring model generation device for learning a scoring model in document search based on statistical model learning,
  M × N pieces of learning data s, where M is the number of documents and N is the number of queries generated from one document. _ｍｎmn Receive
  The learning data s _ｍｎmn Feature vector f from query q and document d included in _{ｑ,ｄq, d} Extract
  Parameter vector Φ and feature vector f _{ｑ,ｄq, d} The parameter vector Φ is learned so that the inner product of and takes a positive value when the document d is a related document of the query q and a negative value when the document d is not a related document of the query q. ,
  Scoring model generator.

The scoring model generation device according to claim 1,
|| x || ²² ₂₂ Is a L2-norm, c is a constant, information indicating whether the document d is a related document of the query q is r (d, q), and a parameter vector Φ that minimizes the following expression _＋１+1 And Φ _−１-1 Seeking

The parameter vector Φ is obtained by the following equation:
Φ = Φ _＋１+1 −Φ _−１-1
Scoring model generator.

A learning data generation device that is provided with a plurality of documents and generates M × N learning data s _mn for use in the scoring model generation device according to claim 1 or 2 ,
Word string generation means for generating one or more word strings including words included in the document for each given document;
Learning data generating means including each of the generated word strings and a label indicating a document used when generating the word strings as a query and a reference, and a set of the query and the reference as the learning data, respectively. Data generator.

The learning data generation device according to claim 3 ,
Further comprising individual language model generation means for generating a probabilistic language model for each of the documents using each of the given documents;
The word string generating means generates the word string based on the probabilistic language model;
Learning data generation device.

The learning data generation device according to claim 4 ,
A comprehensive language model generating means for generating probabilistic language models for all the documents using all the given documents;
The word string generation means generates the word string based on two probabilistic language models;
Learning data generation device.

A scoring system including the scoring model generation device according to claim 1 and the learning data generation device according to any one of claims 3 to 5 ,
And a search device that performs a document search using the scoring model learned by the scoring model generation device using the learning data generated by the learning data generation device ,
Search system.

  A scoring model generation method for learning a scoring model in document search based on statistical model learning using a scoring model generation device,
  M × N pieces of learning data s, where M is the number of documents and N is the number of queries generated from one document. _ｍｎmn Receive
  The learning data s _ｍｎmn Feature vector f from query q and document d included in _{ｑ,ｄq, d} Extract
  Parameter vector Φ and feature vector f _{ｑ,ｄq, d} The parameter vector Φ is learned so that the inner product of and takes a positive value when the document d is a related document of the query q and a negative value when the document d is not a related document of the query q. ,
  Scoring model generation method.

A scoring model generation method according to claim 7,
|| x || ²² ₂₂ Is a L2-norm, c is a constant, information indicating whether the document d is a related document of the query q is r (d, q), and a parameter vector Φ that minimizes the following expression _＋１+1 And Φ _−１-1 Seeking

The parameter vector Φ is obtained by the following equation:
Φ = Φ _＋１+1 −Φ _−１-1
Scoring model generation method.

A learning data generation method for generating M × N learning data s _mn to be used in the scoring model generation method according to claim 7 or 8 using a learning data generation device given a plurality of documents. And
A word string generation step for generating one or more word strings including words included in the document for each given document;
A learning data generation step in which each of the generated word strings and a label indicating the document used when generating the word strings is a query and a reference, and a set of the query and the reference is the learning data. Data generation method.

The learning data generation method according to claim 9 , wherein
Further comprising: generating an individual language model using each given document to generate a probabilistic language model for each document;
In the word string generation step, the word string is generated based on the probabilistic language model.
Learning data generation method.

The learning data generation method according to claim 10 ,
A comprehensive language model generating step of generating a probabilistic language model for all the documents using all the given documents;
In the word string generation step, the word string is generated based on two probabilistic language models.
Learning data generation method.

A scoring model generation method according to claim 7 or 8, and a learning data generation method according to any one of claims 9 to 11 ,
Further, the document search is performed using the scoring model learned by the learning data generation method using the learning data generated by the learning data generation method .
retrieval method.

The program for functioning a computer as a scoring model production | generation apparatus of Claim 1 or 2.

A program for causing a computer to function as the learning data generation device according to any one of claims 3 to 5.

A program for causing a computer to function as a search device included in the search system according to claim 6.