JP6765992B2

JP6765992B2 - Hyperspheric spatial language model generator, query likelihood calculator, these methods and programs

Info

Publication number: JP6765992B2
Application number: JP2017038959A
Authority: JP
Inventors: 亮増村; 太一浅見
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-03-02
Filing date: 2017-03-02
Publication date: 2020-10-07
Anticipated expiration: 2037-03-02
Also published as: JP2018147062A

Description

この発明は、文書群のデータベースに対して情報検索を行うために用いられる技術に関する。 The present invention relates to a technique used for performing information retrieval in a database of document groups.

情報検索は、検索対象の集合である文書群に対してユーザが検索クエリを入力した際に、クエリに対する適合度のスコアを各文書に付与し、スコア順に文書を並び替える枠組みによって実現されている。ここで、文書は様々な対象が考えられ、代表的なものはWebページであり、その他に、社内文書などを対象とした情報検索も含まれる。 Information retrieval is realized by a framework in which when a user inputs a search query to a set of documents to be searched, a score of fitness for the query is given to each document and the documents are sorted in order of score. .. Here, various targets can be considered for documents, and a typical one is a Web page, and information retrieval targeting in-house documents is also included.

情報検索の方法の１つに確率的言語モデルを用いる方法がある。確率的言語モデルを用いる方法では、事前に検索対象の集合である文書群のそれぞれの文書を確率的言語モデルとして表現しておく。そして入力クエリに対して、クエリ文を生成する確率値を各文書について求め、スコア順に文書に並び替えることで情報検索を実現できる。具体的には、文書Dに関する言語モデルθ_DがクエリQを生成する尤度（クエリ尤度と呼ばれる）log P（Q｜θ_D）を定義し、文書集合の全ての文書に対してクエリ尤度を求める。 One of the information retrieval methods is a method using a probabilistic language model. In the method using the probabilistic language model, each document of the document group which is a set of search targets is expressed as a probabilistic language model in advance. Then, for the input query, the probability value for generating the query statement is obtained for each document, and the information search can be realized by sorting the documents in order of score. Specifically, the language model theta _D (called a query likelihood) the likelihood of generating a query Q relating to the document D log P | defines the (Q theta _D), the query likelihood for all documents of a document set Find the degree.

情報検索における確率的言語モデルとしては、一般的に多項分布に基づくモデル化が利用される（例えば、非特許文献１参照。）。ここで、クエリQは、Q=q₁,…,q_Lとして表されることにする。なお、q_iはトークンと呼ばれる単位で分割された要素あり、例えば単語や文字が該当する。つまり、Lはクエリに含まれるトークン数であり、トークンの単位が単語であれば、文書に含まれる単語数を表す。このとき、多項分布に基づくモデル化では、以下の式でモデル化する。この時、文書Dに関する言語モデルθ_Dに対するクエリQのクエリ尤度 log p（Q｜θ_D）は以下でモデル化される。 As a stochastic language model in information retrieval, modeling based on a multinomial distribution is generally used (see, for example, Non-Patent Document 1). Here, the query Q is expressed as Q = q ₁ , ..., q _L. Note that q _i has elements divided into units called tokens, for example, words and letters. That is, L is the number of tokens included in the query, and if the token unit is a word, it represents the number of words included in the document. At this time, in the modeling based on the multinomial distribution, the model is modeled by the following equation. At this time, the query likelihood log p (Q | θ _D ) of the query Q with respect to the language model θ _D for the document D is modeled as follows.

ここで、p(q|θ_D)は文書Dに関する確率的言語モデルからトークンqが生成する確率である。文書Dからのp(q|θ_D)のモデル化方法、つまり情報検索モデル製造方法は様々な手法が考えられるが、代表的な最尤推定に基づく方法では、以下の式でモデル化する。 Here, p (q | θ _D ) is the probability that the token q will be generated from the probabilistic language model for document D. Various methods can be considered for the modeling method of p (q | θ _D ) from the document D, that is, the information retrieval model manufacturing method, but the method based on the typical maximum likelihood estimation is modeled by the following formula.

ここで、c(q|D)は、文書Dに含まれるトークンqの回数、|D|は文書Dに含まれるトークンの総数である。例えばDが、”東京の人気の名所はディズニーランドやスカイツリーです”(※ スペースはトークンの区切りを表す)であれば、最尤推定によるモデルでは、p(東京|θ_D=1/10)、p(の|θ_D =1/5)としてモデル化される。 Here, c (q | D) is the number of tokens q contained in the document D, and | D | is the total number of tokens contained in the document D. For example, if D is "Tokyo's popular attractions are Disneyland and Skytree" (* spaces represent token delimiters), the maximum likelihood estimation model is p (Tokyo | θ _D = 1/10), It is modeled as p (| θ _D = 1/5).

また、文書Dに含まれないトークンについては確率が0となってしまうため、スムージングと呼ばれる処理を入れることで微小な確率を割り当てる処理を行う。これに限らず、多項分布に基づくモデル化では、文書Dごとに、あらかじめ様々なトークンについてp(q|θ_D)をモデル化しておくことで、情報検索モデルを構成している。スムージングの詳細、および多項分布に基づくモデル化におけるその他のモデル製造方法については、例えば非特許文献２を参照のこと。 In addition, since the probability is 0 for tokens not included in document D, a process called smoothing is performed to allocate a minute probability. Not limited to this, in modeling based on the multinomial distribution, an information retrieval model is constructed by modeling p (q | θ _D ) for various tokens in advance for each document D. For details of smoothing and other model manufacturing methods in modeling based on the multinomial distribution, refer to, for example, Non-Patent Document 2.

J. M. Ponte and W.B. Croft “A language modeling approach to information retrieval”, In Proc. SIGIR 1998, pp.275-281, 1998.J. M. Ponte and W.B. Croft “A language modeling approach to information retrieval”, In Proc. SIGIR 1998, pp.275-281, 1998. 江口浩二, “情報検索のための確率的言語モデル” IPSJ SIG Technical Report Vol.2010-SLP-82, No.9, 2010.Koji Eguchi, “Probabilistic Language Model for Information Retrieval” IPSJ SIG Technical Report Vol.2010-SLP-82, No.9, 2010.

多項分布に基づく情報検索モデルでは、トークンの空間を離散空間で表現し、トークンqごとに確率値P(q|θ_D)を割り当てている。この場合、あらかじめ確率値を割り当てていないトークンに対する生成確率を求めることができない。例えば、文書Dが”東京の人気の名所はディズニーランドやスカイツリーです”で、検索クエリQが“関東の有名な観光地を教えて”である場合、検索クエリに関係する文書であると考えられるが、多項分布に基づく情報検索モデルでは、クエリ尤度は小さくなってしまう。多項分布に基づく情報検索モデルでは、文書に出現しないトークンについては、生成確率が小さくなりやすくなってしまうため、“関東”というトークンや“観光地”というトークンの生成確率は低くなってしまう。 In the information retrieval model based on the multinomial distribution, the token space is represented by a discrete space, and the probability value P (q | θ _D ) is assigned to each token q. In this case, it is not possible to obtain the generation probability for the token to which the probability value has not been assigned in advance. For example, if document D is "Tokyo's most popular tourist attractions are Disneyland and Skytree" and search query Q is "Tell me about famous tourist destinations in Kanto", it is considered to be a document related to the search query. However, in the information retrieval model based on the multinomial distribution, the query likelihood becomes small. In the information retrieval model based on the multinomial distribution, the probability of generating tokens that do not appear in the document tends to be small, so the probability of generating tokens "Kanto" and "sightseeing spots" is low.

実際、検索クエリ“関東”というトークンと“東京”というトークン、“人気”というトークンと“有名”というトークン、“名所”というトークンと“観光地”というトークン、それぞれは類似しているトークン同士であるが、多項分布に基づく情報検索モデルでは類似性を考慮できない。つまり、離散空間ではトークン間の類似性を考慮した情報検索を行えない。 In fact, the search query "Kanto" token and "Tokyo" token, "popular" token and "famous" token, "famous place" token and "sightseeing spot" token, respectively, are similar tokens. However, similarity cannot be considered in information retrieval models based on the multinomial distribution. That is, in a discrete space, information retrieval considering the similarity between tokens cannot be performed.

この発明では、より性能の高い情報検索を行うことができる、超球面空間言語モデル生成装置、クエリ尤度算出装置、これらの方法およびプログラムを提供することである。 The present invention provides a hyperspherical spatial language model generator, a query likelihood calculator, methods and programs thereof capable of performing information retrieval with higher performance.

この発明の一態様による超球面空間言語モデル生成装置は、所定のトークン連続値ベクトル変換モデルを用いて入力された文書に含まれる各トークンに大きさ１の連続値ベクトルを割り当て、割り当てられた大きさ１の連続値ベクトルを用いて文書を超球面空間上の確率的言語モデルとして表現した超球面空間言語モデルを生成する超球面空間言語モデル変換部を備えている。 The hyperspheric spatial language model generator according to one aspect of the present invention assigns a continuous value vector of size 1 to each token included in a document input using a predetermined token continuous value vector conversion model, and the assigned size. It is provided with a hypersphere space language model conversion unit that generates a hypersphere space language model that expresses a document as a stochastic language model on the hypersphere space using the continuous value vector of 1.

この発明の一態様による超球面空間言語モデル生成装置は、所定のトークン連続値ベクトル変換モデルを用いて文書集合の各文書に含まれる各トークンに大きさ１の連続値ベクトルを割り当て、割り当てられた大きさ１の連続値ベクトルを用いて文書集合の各文書を超球面空間上の確率的言語モデルとして表現した超球面空間言語モデルを生成する超球面空間言語モデル変換部を備えている。 The hyperspherical spatial language model generator according to one aspect of the present invention assigns and is assigned a continuous value vector of size 1 to each token included in each document of the document set using a predetermined token continuous value vector conversion model. It is provided with a hyperspherical space language model conversion unit that generates a superspherical space language model that expresses each document of a document set as a probabilistic language model on the hyperspherical space using a continuous value vector of size 1.

この発明の一態様によるクエリ尤度算出装置は、所定のトークン連続値ベクトル変換モデルを用いて入力された検索クエリに含まれる各トークンに大きさ１の連続値ベクトルを割り当て、請求項１の超球面空間言語モデル生成装置により生成された文書を超球面空間上の確率的言語モデルとして表現した超球面空間言語モデルにおける、検索クエリに含まれる各トークンに割り当てられた連続値ベクトルの尤度の和を計算して、検索クエリの文書を超球面空間上の確率的言語モデルとして表現した超球面空間言語モデルにおけるクエリ尤度とするクエリ尤度算出部を備えている。 The query likelihood calculation device according to one aspect of the present invention assigns a continuous value vector of size 1 to each token included in the search query input by using a predetermined token continuous value vector conversion model, and exceeds the first aspect of claim 1. The sum of the likelihoods of the continuous value vectors assigned to each token included in the search query in the hyperspherical spatial language model that expresses the document generated by the spherical spatial language model generator as a probabilistic language model on the hyperspherical space. Is provided as a query likelihood calculation unit that calculates the query likelihood in the hyperspheric space language model that expresses the search query document as a probabilistic language model on the hypersphere space.

この発明の一態様によるクエリ尤度算出装置は、所定のトークン連続値ベクトル変換モデルを用いて入力された検索クエリに含まれる各トークンに大きさ１の連続値ベクトルを割り当て、請求項１の超球面空間言語モデル生成装置により生成された、文書集合の各文書を超球面空間上の確率的言語モデルとして表現した超球面空間言語モデルにおける、検索クエリに含まれる各トークンに割り当てられた連続値ベクトルの尤度の和を計算して、検索クエリの各文書を超球面空間上の確率的言語モデルとして表現した超球面空間言語モデルにおけるクエリ尤度とするクエリ尤度算出部を備えている。 The query likelihood calculation device according to one aspect of the present invention allocates a continuous value vector of size 1 to each token included in a search query input using a predetermined token continuous value vector conversion model, and is hypersphere of claim 1. Continuous value vector assigned to each token included in the search query in the hyperspheric space language model that expresses each document of the document set as a stochastic language model on the hypersphere space generated by the spherical space language model generator. It is provided with a query likelihood calculation unit that calculates the sum of the likelihoods of the above and uses it as the query likelihood in the hyperspheric space language model that expresses each document of the search query as a probabilistic language model on the hypersphere space.

より性能の高い情報検索を行うことができる、超球面空間言語モデルを生成し又はクエリ尤度を算出することができる。 It is possible to generate a hyperspherical spatial language model or calculate the query likelihood, which enables higher performance information retrieval.

情報検索装置の例を説明するためのブロック図。A block diagram for explaining an example of an information retrieval device. 情報検索方法の例を説明するための流れ図。A flow chart for explaining an example of an information retrieval method.

［技術的背景］
上記の課題を解決するためのポイントは、任意のトークンを大きさ1の連続値ベクトルとして表現する枠組みを導入し、トークンを超球面のベクトル空間上で扱う点である。その際に、類似性のあるトークン間の距離が近い(角度が小さい)距離空間になるような超球面空間を導入することで、トークン間の類似性を直接評価できるようになる。この考え方を確率的言語モデルに基づく情報検索の枠具無でと扱うために、例えば混合von Mises Fisher分布を用いて文書の情報検索モデルを構成する。これにより、多項分布に基づく情報検索モデルでは捉えられないトークン間の類似性を、超球面空間上のベクトル間の距離（超球面空間上では角度）として捉えながら、クエリ尤度を求めることができるようになる。 [Technical background]
The point to solve the above problem is to introduce a framework that expresses an arbitrary token as a continuous value vector of size 1 and handle the token on the vector space of the hypersphere. At that time, by introducing a hyperspherical space in which the distance between tokens having similarities is short (the angle is small), the similarity between tokens can be directly evaluated. In order to treat this idea as without a framework for information retrieval based on a probabilistic language model, an information retrieval model for documents is constructed using, for example, a mixed von Mises Fisher distribution. As a result, the query likelihood can be obtained while capturing the similarity between tokens, which cannot be captured by the information retrieval model based on the multinomial distribution, as the distance between vectors in the hypersphere space (angle in the hypersphere space). Will be.

つまり、例えば、文書Dが”東京の人気の名所はディズニーランドやスカイツリーです”で、検索クエリQが“関東の有名な観光地を教えて”である場合においても、関東”というトークンと“東京”というトークン、“人気”というトークンと“有名”というトークン、“名所”というトークンと“観光地”というトークン、それぞれは類似しているトークン同士であることを捉えて、クエリ尤度を算出できるようになる。 So, for example, even if Document D is "Tokyo's most popular tourist attractions are Disneyland and Skytree" and the search query Q is "Tell me about famous tourist destinations in Kanto", the token "Kanto" and "Tokyo" The query likelihood can be calculated by recognizing that the tokens "popular" and "famous", the tokens "famous places" and the tokens "sightseeing spots" are similar tokens. Will be.

そして、各文書を超球面空間上の確率的言語モデルとして表現し、従来と同様に入力の検索クエリに対してクエリ尤度が高い文書を出力することで、性能の高い情報検索を実現できるようになる。 Then, by expressing each document as a probabilistic language model on the hyperspherical space and outputting a document with high query likelihood for the input search query as in the past, high-performance information retrieval can be realized. become.

［前提］
文書および検索クエリでは、単語がトークン区切りとして表現されているものとする。トークン単位で区切る基準は、単語単位や文字単位などを用いることができる。単語単位の場合は、英語であればスペース区切りとするなどの方法をとることができ、日本語であれば任意の形態素解析などの分割方法を用いればよい。 [Premise]
In documents and search queries, the words are represented as token delimiters. As a standard for separating tokens, words or characters can be used. In the case of word units, a method such as space delimiter can be used in English, and a division method such as arbitrary morphological analysis may be used in Japanese.

任意のトークンを大きさ1の連続値ベクトルとして表現する枠組みを導入する。これを「トークン連続値ベクトル変換モデル」と呼ぶ。なお、トークン連続値ベクトル変換モデルを作り直したりする場合は、後述の実施例において、作り直したものを利用する必要がある。 Introduce a framework to express any token as a continuous value vector of size 1. This is called a "token continuous value vector transformation model". When recreating the token continuous value vector conversion model, it is necessary to use the recreated one in the examples described later.

トークン連続値ベクトル変換モデルは、様々な方法でモデル化することができるが、代表的なものとしては、Skip-gramモデルや連続Bag-of-Wordsモデルという方法を導入することができる（例えば、参考文献１参照。）。この場合、あらかじめ大量のテキストデータから、Skip-gramモデルや連続Bag-of-Wordsモデルを学習しておくことで、トークンを連続値ベクトルに変換することができる。なお、この時点ではベクトルの大きさは1とは限らないことに留意されたい。 The token continuous value vector transformation model can be modeled by various methods, and as a typical method, a method called a Skip-gram model or a continuous Bag-of-Words model can be introduced (for example, a continuous Bag-of-Words model). See Reference 1.). In this case, the token can be converted into a continuous value vector by learning the Skip-gram model and the continuous Bag-of-Words model from a large amount of text data in advance. Note that the size of the vector is not always 1 at this point.

〔参考文献１〕Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, “Efficient estimation of word representations in vector space”, In Proc. ICLR, 2013. [Reference 1] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, “Efficient estimation of word representations in vector space”, In Proc. ICLR, 2013.

なお、トークン連続値ベクトル変換モデルのモデル化の時点で、連続値ベクトルの次元数は決定される。この次元数は事前に人手で与えるものであり、2以上の次元とする必要があり、例えば200次元や400次元とすればよい。 At the time of modeling the token continuous value vector conversion model, the number of dimensions of the continuous value vector is determined. This number of dimensions is given manually in advance, and it is necessary to set it to two or more dimensions, for example, 200 dimensions or 400 dimensions.

次に、任意のベクトルを大きさ1の連続値ベクトルと表現するためには、以下の正規化を行えばよい。あるベクトル^-vを大きさ1としてベクトルvは以下の式に従い算出できる。 Next, in order to express an arbitrary vector as a continuous value vector of magnitude 1, the following normalization may be performed. Letting a certain vector ^- v have a magnitude of 1, the vector v can be calculated according to the following equation.

これにより、トークン連続値ベクトル変換モデルでは、あるトークンを入力すると大きさ1の連続値ベクトルを出力することができる。 As a result, in the token continuous value vector transformation model, when a certain token is input, a continuous value vector of size 1 can be output.

以下に説明する実施形態では、所定のトークン連続値ベクトル変換モデルが予め定められているとする。 In the embodiment described below, it is assumed that a predetermined token continuous value vector transformation model is predetermined.

［実施形態］
以下、図面を参照して、この発明の一実施形態について説明する。 [Embodiment]
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

情報検索装置は、超球面空間言語モデル生成部１と、情報検索部２とを備えている。超球面空間言語モデル生成部１は、超球面空間言語モデル変換部１１を備えている。情報検索部２は、クエリ尤度算出部２１を備えている。情報検索方法は、情報検索装置の各部が、以下に説明するステップＳ１からステップＳ２２の処理を行うことにより実現される。 The information retrieval device includes a hyperspherical spatial language model generation unit 1 and an information retrieval unit 2. The hyperspherical space language model generation unit 1 includes a hypersphere space language model conversion unit 11. The information retrieval unit 2 includes a query likelihood calculation unit 21. The information retrieval method is realized by each part of the information retrieval apparatus performing the processes of steps S1 to S22 described below.

＜超球面空間言語モデル生成部１＞
超球面空間言語モデル生成部１の超球面空間言語モデル変換部１１には、文書集合および所定のトークン連続値ベクトル変換モデルが入力される。 <Hyperspherical spatial language model generator 1>
A document set and a predetermined token continuous value vector conversion model are input to the hypersphere space language model conversion unit 11 of the hypersphere space language model generation unit 1.

超球面空間言語モデル変換部１１は、所定のトークン連続値ベクトル変換モデルに基づいて、文書集合の各文書の超球面空間言語モデルを生成する（ステップＳ１）。超球面空間言語モデル変換部１１は、生成した文書集合の各文書の超球面空間言語モデルを、情報検索部２に出力する。 The hyperspherical space language model conversion unit 11 generates a hyperspherical space language model of each document in the document set based on a predetermined token continuous value vector conversion model (step S1). The hyperspherical space language model conversion unit 11 outputs the hyperspherical space language model of each document of the generated document set to the information retrieval unit 2.

超球面空間言語モデル変換部１１は、文書集合の各文書を超球面空間言語モデルとして表現する。例えば、文書が100個ある場合は、100個の文書のそれぞれを超球面空間言語モデルとしてモデル化する。以下、文書集合の各文書を文書Dと表記する。 The hypersphere space language model conversion unit 11 expresses each document in the document set as a hypersphere space language model. For example, if there are 100 documents, each of the 100 documents is modeled as a hyperspheric spatial language model. Hereinafter, each document in the document set is referred to as document D.

まず、超球面空間言語モデル変換部１１は、文書Dに含まれる各トークンを、所定のトークン連続値ベクトル変換モデルを用いて、大きさ1の連続値ベクトルに変換する。そして、超球面空間言語モデル変換部１１は、これらの連続値ベクトル系列を用いて、文書Dについての超球面空間言語モデルを学習する。 First, the hyperspherical space language model conversion unit 11 converts each token included in the document D into a continuous value vector of size 1 using a predetermined token continuous value vector conversion model. Then, the hyperspherical space language model conversion unit 11 learns the hyperspherical space language model for the document D by using these continuous value vector series.

ここでは、あるトークンqについての大きさ1の連続値ベクトルをq'と表現する。つまり、D=q₁,…,q_Lはq₁',…,q_L'として表される。超球面空間言語モデルp(q'|θ_D)を混合von Mises Fisher分布として表現する場合、次式により確率値を求めることができる。 Here, a continuous value vector of magnitude 1 for a certain token q is expressed as q'. That is, D = q ₁ ,…, q _L is expressed as q ₁ ',…, q _L '. When the hyperspheric spatial language model p (q'| θ _D ) is expressed as a mixed von Mises Fisher distribution, the probability value can be obtained by the following equation.

ここで、M、λ_m、κ_m、μ_mが、混合von Mises Fisher分布のモデルパラメータである。C_pは所定の関数である。文書Dからの上記モデルパラメータの推定は様々な方法が考えられるが、基本的にMは人手で決めることになる。なお、Mは2以上かつL以下となる必要がある。例えば、M=Lとし、その他のパラメータをq₁',…,q_L'から最尤推定によって求めるという手段をとることができる。混合von Mises Fisher分布のモデルパラメータの最尤推定は、例えば参考文献２を参照のこと。 Here, M, λ _m , κ _m , and μ _m are model parameters of the mixed von Mises Fisher distribution. C _p is a given function. Various methods can be considered for estimating the above model parameters from document D, but basically M is determined manually. M must be 2 or more and L or less. For example, M = L can be set, and other parameters can be obtained from q ₁ ', ..., q _L' by maximum likelihood estimation. For maximum likelihood estimation of the model parameters of the mixed von Mises Fisher distribution, see reference 2, for example.

〔参考文献２〕A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Clustering on the Unit Hypersphere Using Von Mises-Fisher Distributions,” J. Machine Learning Research, vol. 6, pp. 1345-1382, 2005. [Reference 2] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Clustering on the Unit Hypersphere Using Von Mises-Fisher Distributions,” J. Machine Learning Research, vol. 6, pp. 1345- 1382, 2005.

このようにして、超球面空間言語モデル変換部１１は、所定のトークン連続値ベクトル変換モデルを用いて文書集合の各文書に含まれる各トークンに大きさ１の連続値ベクトルを割り当て、割り当てられた大きさ１の連続値ベクトルを用いて文書集合の各文書の超球面空間言語モデルを生成する。 In this way, the hyperspherical space language model conversion unit 11 assigns and is assigned a continuous value vector of size 1 to each token included in each document of the document set using a predetermined token continuous value vector conversion model. A nsphere spatial language model of each document in the document set is generated using a continuous value vector of magnitude 1.

言い換えれば、超球面空間言語モデル変換部１１は、所定のトークン連続値ベクトル変換モデルを用いて入力された文書Dに含まれる各トークンに大きさ１の連続値ベクトルを割り当て、割り当てられた大きさ１の連続値ベクトルを用いて文書Dの超球面空間言語モデルを生成する。 In other words, the hyperspheric space language model conversion unit 11 assigns a continuous value vector of size 1 to each token included in the document D input using the predetermined token continuous value vector conversion model, and the assigned size. A hyperspheric spatial language model of document D is generated using a continuous value vector of 1.

＜情報検索部２＞
情報検索部２には、文書集合の各文書の超球面空間言語モデル、所定のトークン連続値ベクトル変換モデルおよび検索クエリが入力される。 <Information retrieval unit 2>
In the information retrieval unit 2, a hyperspherical spatial language model of each document in the document set, a predetermined token continuous value vector conversion model, and a search query are input.

情報検索部２は、文書集合の各文書の超球面空間言語モデルに関して、検索クエリのクエリ尤度を算出する（ステップＳ２１）。例えば、文書が100個ある場合は、情報検索部２は、100個の文書のそれぞれに関して検索クエリのクエリ尤度を算出する。以下、文書集合の各文書を文書Dと表記する。 The information retrieval unit 2 calculates the query likelihood of the search query for the hyperspherical spatial language model of each document in the document set (step S21). For example, when there are 100 documents, the information retrieval unit 2 calculates the query likelihood of the search query for each of the 100 documents. Hereinafter, each document in the document set is referred to as document D.

まず、情報検索部２のクエリ尤度算出部２１は、検索クエリの各トークンを、所定のトークン連続値ベクトル変換モデルにより大きさ1の連続値ベクトルとして表す。言い換えれば、情報検索部２は、所定のトークン連続値ベクトル変換モデルを用いて入力された検索クエリに含まれる各トークンに大きさ１の連続値ベクトルを割り当てる。 First, the query likelihood calculation unit 21 of the information retrieval unit 2 represents each token of the search query as a continuous value vector of size 1 by a predetermined token continuous value vector transformation model. In other words, the information retrieval unit 2 assigns a continuous value vector of size 1 to each token included in the search query input by using the predetermined token continuous value vector conversion model.

そして、クエリ尤度算出部２１は、文書Dの超球面空間言語モデルを用いて、検索クエリのクエリ尤度を算出する。ここでは、あるトークンqについての大きさ1の連続値ベクトルをq'と表現する。つまり、検索クエリのトークン系列Q=q₁,…,q_Lはq₁',…,q_L'として表される。このとき、文書Dについての検索クエリQのクエリ尤度は次式で算出できる。 Then, the query likelihood calculation unit 21 calculates the query likelihood of the search query by using the hyperspherical spatial language model of the document D. Here, a continuous value vector of magnitude 1 for a certain token q is expressed as q'. That is, the token sequence Q = q ₁ ,…, q _L in the search query is represented as q ₁ ',…, q _L '. At this time, the query likelihood of the search query Q for the document D can be calculated by the following equation.

ここで、p(q_i'|θ_D)は、超球面空間言語モデル変換部１１で生成された超球面空間言語モデルp(q'|θ_D)である。 Here, p (q _i '| θ _D ) is the hypersphere space language model p (q' | θ _D ) generated by the hypersphere space language model conversion unit 11.

このようにして、クエリ尤度算出部２１は、文書集合の各文書の超球面空間言語モデルにおける、検索クエリに含まれる各トークンqに割り当てられた連続値ベクトルq'の尤度の和を計算して、検索クエリの各文書の超球面空間言語モデルにおけるクエリ尤度とする。 In this way, the query likelihood calculation unit 21 calculates the sum of the likelihoods of the continuous value vector q'assigned to each token q included in the search query in the hyperspheric spatial language model of each document in the document set. Then, it is the query likelihood in the hyperspheric spatial language model of each document of the search query.

言い換えれば、情報検索部２のクエリ尤度算出部２１は、文書Dの超球面空間言語モデルにおける、検索クエリに含まれる各トークンに割り当てられた連続値ベクトルの尤度の和を計算して、検索クエリの文書Dの超球面空間言語モデルにおけるクエリ尤度とする。 In other words, the query likelihood calculation unit 21 of the information retrieval unit 2 calculates the sum of the likelihoods of the continuous value vectors assigned to each token included in the search query in the hyperspheric spatial language model of the document D. Let it be the query likelihood in the hyperspheric spatial language model of document D of the search query.

情報検索部２は、各文書に対しての検索クエリのクエリ尤度が求まったあとは、クエリ尤度に基づいて、所定の基準で各文書を並び替えてユーザに提示するなどの任意の情報提示を行う（ステップＳ２２）。例えば、Nを所定の正の整数として、クエリ尤度が高い上位N個の文書の情報をユーザに提示してもよい。 After the query likelihood of the search query for each document is obtained, the information retrieval unit 2 sorts each document according to a predetermined criterion based on the query likelihood and presents it to the user. Presentation is performed (step S22). For example, N may be a predetermined positive integer, and the information of the top N documents having high query likelihood may be presented to the user.

［実施形態］
情報検索モデル製造装置、および情報検索装置は、文書集合の超球面空間言語モデル変換部と、超球面空間言語モデルによる情報検索部を備える。文書集合の超球面空間言語モデル変換部で、文書集合の各文書を超球面空間言語モデルとして表現する。また、超球面空間言語モデルによる情報検索部では、文書集合の各文書の超球面空間言語モデルに関して、検索クエリのクエリ尤度を算出し、クエリ尤度の高い順に文書を並び替えて出力する。 [Embodiment]
The information retrieval model manufacturing apparatus and the information retrieval apparatus include a hyperspherical spatial language model conversion unit for a document set and an information retrieval unit using a hyperspherical spatial language model. The hyperspherical space language model conversion unit of the document set expresses each document of the document set as a hyperspherical space language model. In addition, the information retrieval unit based on the hyperspherical spatial language model calculates the query likelihood of the search query for the hyperspherical spatial language model of each document in the document set, sorts the documents in descending order of query likelihood, and outputs the documents.

このように、離散空間では捉えられない文書と検索クエリ間の関係を超食う面空間上で捉えることにより、トークンの類似性を考慮したクエリ尤度の算出ができるようになる。よって、性能の高い情報検索を実現できるようになる。 In this way, by capturing the relationship between the document and the search query, which cannot be captured in the discrete space, on the super-eating surface space, the query likelihood can be calculated in consideration of the token similarity. Therefore, high-performance information retrieval can be realized.

［変形例］
所定のトークン連続値ベクトル変換モデルは、２以上の所定のトークン連続値ベクトル変換モデルを含んでいてもよい。２以上の所定のトークン連続値ベクトル変換モデルは、例えば、Skip-gramモデルと連続Bag-of-Wordsモデルの２種類のモデルであってもよい。また、２以上の所定のトークン連続値ベクトル変換モデルは、例えば、複数の異なる言語資源からそれぞれ学習した複数のSkip-gramモデルであってもよい。 [Modification example]
The predetermined token continuous value vector transformation model may include two or more predetermined token continuous value vector transformation models. The two or more predetermined token continuous value vector conversion models may be, for example, two types of models, a Skip-gram model and a continuous Bag-of-Words model. Further, the two or more predetermined token continuous value vector conversion models may be, for example, a plurality of Skip-gram models learned from a plurality of different language resources.

これらの場合は、それぞれから求めた結合ベクトルを構成すればよく、例えば200次元のベクトルと300次元のベクトルの結合ベクトルは、200+300=500次元のベクトルである。この結合ベクトルを大きさ1に正規化することで、基本となる実施例と同じ枠組みで利用できる。 In these cases, the coupling vector obtained from each of them may be constructed. For example, the coupling vector of the 200-dimensional vector and the 300-dimensional vector is a 200 + 300 = 500-dimensional vector. By normalizing this coupling vector to magnitude 1, it can be used in the same framework as the basic embodiment.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, it goes without saying that changes can be made as appropriate without departing from the gist of the present invention.

[プログラム及び記録媒体]
超球面空間言語モデル生成装置、クエリ尤度算出装置又は情報検索装置における各処理をコンピュータによって実現する場合、超球面空間言語モデル生成装置、クエリ尤度算出装置又は情報検索装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、超球面空間言語モデル生成装置、クエリ尤度算出装置又は情報検索装置の処理がコンピュータ上で実現される。 [Programs and recording media]
When each process in the hyperspherical space language model generator, query likelihood calculation device, or information retrieval device is realized by a computer, the functions that the hyperspherical space language model generator, query likelihood calculation device, or information retrieval device should have are The processing content is described by the program. Then, by executing this program on a computer, the processing of the hyperspheric space language model generator, the query likelihood calculation device, or the information retrieval device is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing content can be recorded on a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A continuous value vector of size 1 is assigned to each token included in the document input using a predetermined token continuous value vector conversion model, and the above document is subjected to the hyperspheric space using the assigned continuous value vector of size 1. Hyperspheric space language model conversion unit that generates the hypersphere space language model expressed as the above probabilistic language model ,
Hyperspheric spatial language model generator including.

The hyperspherical spatial language model generator according to claim 1.
The predetermined token continuous value vector transformation model includes two or more predetermined token continuous value vector transformation models.
Hyperspheric spatial language model generator.

A continuous value vector of size 1 is assigned to each token included in the search query entered using the predetermined token continuous value vector conversion model, and the document generated by the spherical spatial language model generator of claim 1 is surpassed. In the hyperspherical space language model expressed as a probabilistic language model on the spherical space, the sum of the likelihoods of the continuous value vectors assigned to each token included in the search query is calculated, and the above document of the search query is obtained . Includes a query likelihood calculation unit that is the query likelihood in the hyperspheric space language model expressed as a probabilistic language model on the hyperspheric space .
Query likelihood calculator.

A continuous value vector of size 1 is assigned to each token included in each document of the document set using a predetermined token continuous value vector conversion model, and each document of the document set is assigned using the assigned continuous value vector of size 1. A hyperspherical spatial language model converter that generates a superspherical spatial language model that expresses as a probabilistic language model on the hyperspherical space.
Hyperspheric spatial language model generator including.

A set of documents generated by the hyperspheric spatial language model generator of claim 4 by assigning a continuous value vector of size 1 to each token included in a search query input using a predetermined token continuous value vector conversion model. In the hyperspheric space language model that expresses each document as a probabilistic language model on the hypersphere space, the sum of the likelihoods of the continuous value vectors assigned to each token included in the search query is calculated, and the above search is performed. Includes a query likelihood calculation unit that is the query likelihood in a hyperspheric space language model that expresses each document of the query as a probabilistic language model on the hypersphere space.
Query likelihood calculator.

The hyperspheric spatial language model conversion unit assigns a continuous value vector of size 1 to each token included in the document input using the predetermined token continuous value vector conversion model, and the assigned continuous value vector of size 1. To generate a hyperspheric space language model that expresses the above document as a probabilistic language model on the hypersphere space using
How to generate a hyperspheric spatial language model including.

The query likelihood calculation unit assigns a continuous value vector of size 1 to each token included in the search query input using the predetermined token continuous value vector conversion model, and the hyperspherical spatial language model generation device according to claim 1. Calculate the sum of the likelihoods of the continuous value vectors assigned to each token included in the search query in the hyperspheric space language model that expresses the document generated by the above as a probabilistic language model on the hypersphere space . Including a query likelihood calculation step as a query likelihood in a hyperspheric space language model expressing the document of the search query as a probabilistic language model on the hypersphere space.
Query likelihood calculation method.

A program for operating a computer as each part of the device according to any one of claims 1 to 5.