JP2011118689A

JP2011118689A - Retrieval method and system

Info

Publication number: JP2011118689A
Application number: JP2009275762A
Authority: JP
Inventors: Mitsuru Ishizuka; 満石塚; Bollegala Danushka; ダヌシカボッレーガラ; Tuan Duc Nguen; トアンドゥクグェン
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2009-12-03
Filing date: 2009-12-03
Publication date: 2011-06-16

Abstract

<P>PROBLEM TO BE SOLVED: To express relations between entity pairs from huge text data of WWW or the like, and to achieve relation retrieval using the relations between the entity pairs. <P>SOLUTION: The method for retrieving an entity X having a relation the same as or similar to the relation between entities A, B between the entities C, X of the entity pair (C, X) when the entity pair (A, B) and the entity C are input. By defining the relation between the entity A and the entity B depending on context around the entities A, B in a text including the entity A and the entity B and searching the entities C, X having the context the same as or similar to the context around the entities A, B, the entity X having the relation the same as or similar to the relation of the entity A and the entity B with the entity C is acquired. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、検索方法及びシステムに係り、詳しくは、ペアを形成する２つのエンティティにおけるエンティティ間の関係を利用する検索技術に関する。本明細書において、「エンティティ」とは、１つあるいは複数の単語（すなわち語句）であり、典型的には単一名詞、または複数の単語からなる名詞句である。 The present invention relates to a search method and system, and more particularly, to a search technique using a relationship between entities in two entities forming a pair. In this specification, an “entity” is one or a plurality of words (ie, phrases), and is typically a single noun or a noun phrase composed of a plurality of words.

近年ＷＷＷ上のデータが膨大に増加しつつある。その膨大な情報の中に、エンティティ間の関係が多数潜在している。従来のキーワードベース検索エンジンは、キーワードを受け取り、そのキーワードを含む文章を見つけ出すことができるが、エンティティ間の関係を検索することは出来ない。従来のキーワードベース検索エンジンでは、ＷＷＷ上の膨大な情報中に潜在的に存在する多くの関係情報を積極的に利用することはできない。 In recent years, data on the WWW is increasing enormously. There are many relationships between entities in the vast amount of information. A conventional keyword-based search engine can receive a keyword and find a sentence including the keyword, but cannot search a relationship between entities. A conventional keyword-based search engine cannot actively use a lot of related information potentially existing in a large amount of information on the WWW.

与えられた２つのエンティティの関係を抽出するシステムとして、非特許文献1に示すTextRunnerシステムがある。TextRunnerはエンティティペア(Ａ，Ｂ)を入力として受けたときに、ＡとＢの間に成り立つ述語(predicate)を検索することができる。例えば、(Bush，United States)というエンティティペアを入力すると、is(George W. Bush，President of United States)というis述語を出力する。すなわち、BushとUnited Statesとの関係(is president of)が直接出力されるわけではなく、BushとUnited Statesを含むエンティティ(George W. Bush，President of United States)の間に成り立つ一般的な述語(is)しか出力できない。よって、TextRunnerは２つのエンティティ間の関係を見つけることは出来ず、TextRunnerを用いて関係検索を行なうことは出来ない。 As a system for extracting the relationship between two given entities, there is a TextRunner system shown in Non-Patent Document 1. When TextRunner receives an entity pair (A, B) as an input, it can search for a predicate between A and B. For example, if an entity pair (Bush, United States) is input, an is predicate is (George W. Bush, President of United States) is output. In other words, the relationship between Bush and United States (is president of) is not directly output, but a general predicate that holds between entities including Bush and United States (George W. Bush, President of United States) ( is) only can be output. Thus, TextRunner cannot find a relationship between two entities and cannot perform a relationship search using TextRunner.

TextRunnerシステムは、教師なし学習による抽出パターン学習手法として、特許文献１に記載されている。この手法では、まず、小さなコーパス(テキストの集合)に対して、構文解析器を走らせて、各文書中の文を構文解析する。そして、その解析結果から、名詞を取り、名詞と名詞との間の関係を構文木から得る。ヒューリスティクスを使い、その関係が抽出すべき関係かどうかを判断し、学習のためのサンプルを作る。例えば、「Oppenheimer taught at Berkeley and Caltech」という文の構文木を利用すれば、(Oppenheimer，taught at，Berkeley)と(Oppenheimer，taught at，Caltech)というパターンは正のサンプルとして、(Berkeley，and，Caltech)は負のサンプルとして判断する。これらのサンプルをtrainingデータとしてNaïve Bayes分類器を作る。つまり、ある関係が与えられた時に、その関係が良い（正しい）関係か悪い（正しくない）関係かを判断する分類器である。分類過程で使う特徴は、関係の２つの名詞が固有名詞かどうか、文中の距離や左右にある品詞などである。分類器は小さなコーパスから学習することで、構文解析器を使ってもあまり時間がかからないという利点がある。そして、分類器を学習できた後、膨大なコーパスから抽出された関係が良い（正しい）関係かどうかを、分類器を使い判断することが出来る。この手法により、良い（正しい）関係を抽出でき、本発明には良いベースになるかもしれない。しかし、特許文献１で述べられた手法は関係抽出と検索方法しか与えない。つまり、単語間の関係の抽出方法と単語または関係が与えられたときに関係、未知の単語を検索する方法である。この手法では単語ペア間の関係の類似度などは扱われず、本発明の潜在的関係検索エンジンは実現出来ない。また、特許文献1の検索は本発明と目的が違うため、indexing手法、検索アルゴリズムも本発明とは全く異なっている。更に、学習データを作るため構文解析器を使うことで、言語依存になり、言語リソースの少ない言語(構文解析器がない言語)に対してはうまくパターンを抽出できない可能性が高い。 The TextRunner system is described in Patent Document 1 as an extraction pattern learning method based on unsupervised learning. In this method, first, a parser is run on a small corpus (a set of texts) to parse sentences in each document. Then, from the analysis result, a noun is taken and the relationship between the noun and the noun is obtained from the syntax tree. Use heuristics to determine if the relationship is to be extracted and create a sample for learning. For example, using the syntax tree of the sentence “Oppenheimer taught at Berkeley and Caltech”, the patterns (Oppenheimer, taught at, Berkeley) and (Oppenheimer, taught at, Caltech) are positive samples, (Berkeley, and, Caltech) is judged as a negative sample. Na ï ve Bayes classifier is made using these samples as training data. In other words, when a relationship is given, the classifier determines whether the relationship is good (correct) or bad (incorrect). The characteristics used in the classification process are whether the two nouns are proper nouns, the distance in the sentence, and the part of speech on the left and right. The classifier learns from a small corpus and has the advantage that it takes less time to use a parser. Then, after learning the classifier, it can be determined using the classifier whether the relationship extracted from the enormous corpus is a good (correct) relationship. This approach can extract good (correct) relationships and may be a good basis for the present invention. However, the method described in Patent Document 1 only provides a relationship extraction and search method. That is, there are a method for extracting a relationship between words and a method for searching for a related or unknown word when a word or relationship is given. In this method, the similarity of the relationship between word pairs is not handled, and the latent relationship search engine of the present invention cannot be realized. Further, since the search of Patent Document 1 has a different purpose from the present invention, the indexing method and search algorithm are also completely different from the present invention. Furthermore, by using a syntax analyzer to create learning data, it becomes language dependent, and there is a high possibility that patterns cannot be extracted well for languages with few language resources (languages without a syntax analyzer).

２つの単語ペアの間の類似度を測定する手法がいくつか提案されている。例えば、非特許文献２で提案されているLRA(latent relational analysis)手法では、２つのエンティティペア((Ａ，Ｂ)，(Ｃ，Ｄ))が与えられた時に、(Ａ，Ｂ)の間の関係を特徴づけるlexical patterns(語彙パターン)頻度ベクトルを、キーワードベース検索エンジンを使って自動的に生成する。また、(Ｃ，Ｄ)も同様に処理し、(Ａ，Ｂ)と(Ｃ，Ｄ)の特徴ベクトルのcosine類似度を測ることにより、(Ａ，Ｂ)と(Ｃ，Ｄ)の関係類似度を測ることが出来る。更に、非特許文献３では、単語ペアの特徴ベクトルをWWWから高速に作成する手法と、関係類似度の精度を上げるために特徴ベクトル次元の削減方法を提案している。 Several methods for measuring the similarity between two word pairs have been proposed. For example, in the LRA (latent relational analysis) method proposed in Non-Patent Document 2, when two entity pairs ((A, B), (C, D)) are given, the interval between (A, B) Automatically generate lexical patterns frequency vectors that characterize the relationship using a keyword-based search engine. Also, (C, D) is processed in the same manner, and the relationship similarity between (A, B) and (C, D) is measured by measuring the cosine similarity of the feature vectors of (A, B) and (C, D). You can measure the degree. Further, Non-Patent Document 3 proposes a method for creating a word pair feature vector at high speed from the WWW and a feature vector dimension reduction method for improving the accuracy of the relational similarity.

非特許文献２、非特許文献３に開示された手法は本発明の基礎となるが、関係類似度測定システムでは、２つのエンティティペアの完全な４つのエンティティが必要であり、1つのエンティティが未知の場合には測定出来ず、未知のエンティティを持つエンティティペアの特徴ベクトルも出力できない。よって、それらの文献で提案されているシステムだけでは本発明で提案する検索エンジンの機能を実現出来ない。 Although the methods disclosed in Non-Patent Document 2 and Non-Patent Document 3 are the basis of the present invention, the relationship similarity measurement system requires four complete entities of two entity pairs, and one entity is unknown. In this case, measurement cannot be performed, and the feature vector of an entity pair having an unknown entity cannot be output. Therefore, the functions of the search engine proposed in the present invention cannot be realized only by the systems proposed in those documents.

非特許文献４はWordNetを利用する単語ペア間の類似度を測る手法を提案している。WordNetは図１３で示すような概念辞書である。この辞書を使って、２つの単語間の類似度を次のように定義する。

ここで、τ(w_i，w_j)は単語w_iとw_jの類似度で、δ(w)は単語wのWordNet階層における深さである。P_ijは単語w_i，w_jの共通上位語(hypernym)である。例えば、図１３の概念“abstract entity”の深さは1で、Indo-European，Greek，Englishの深さはそれぞれ5，6，8である。そこで、(English，Greek)の類似度は、

となる。
そして単語間の類似度を用いて、単語ペア間の類似度を次のように定義する。

αとβは重みで、より似ている単語ペアに優先するようにα=2，β=1とする。
この式を使うと、例えば、π((English，A)，(Greek，Zeus))＝0.52で、
π((English，A)，(Greek，Alpha))＝0.82である。つまり、(English，A)と(Greek，Alpha)との類似度は(English，A)，(Greek，Zeus)との類似度よりも大となる。 Non-Patent Document 4 proposes a method of measuring the similarity between word pairs using WordNet. WordNet is a conceptual dictionary as shown in FIG. Using this dictionary, the similarity between two words is defined as follows.

Here, τ (w _i , w _j ) is the similarity between the words w _i and w _j , and δ (w) is the depth of the word w in the WordNet hierarchy. P _ij is a common broad word (hypernym) of the words w _i and w _j . For example, the depth of the concept “abstract entity” in FIG. 13 is 1, and the depths of Indo-European, Greek, and English are 5, 6, and 8, respectively. Therefore, the similarity of (English, Greek) is

It becomes.
And the similarity between word pairs is defined as follows using the similarity between words.

α and β are weights, and α = 2 and β = 1 so that priority is given to more similar word pairs.
Using this formula, for example, π ((English, A), (Greek, Zeus)) = 0.52,
π ((English, A), (Greek, Alpha)) = 0.82. That is, the similarity between (English, A) and (Greek, Alpha) is greater than the similarity between (English, A) and (Greek, Zeus).

非特許文献４で述べた手法を使い、関係検索を実現することも考えられる。すなわち、(English，A)と(Greek，?)というクエリがあった時に、EnglishからAまでのパスを探す。そして、そのパスとの共通パスが最長になるようなパス(Greek，?)を検索する。図１３に示すように、この手法を使うと、(English，A)と(Greek，?)というクエリに対する答えは、(Greek，Zeus)ではなく、(Greek，Alpha)や(Greek，Beta)，…のような答えを出すことが出来る。しかし、それだけでは正解「Alpha」という答えを出すことは出来ない。 It is also conceivable to implement a relational search using the method described in Non-Patent Document 4. That is, when there is a query (English, A) and (Greek,?), The path from English to A is searched. Then, a path (Greek,?) That has the longest common path with the path is searched. As shown in FIG. 13, when this method is used, the answers to the queries (English, A) and (Greek,?) Are not (Greek, Zeus), but (Greek, Alpha), (Greek, Beta), You can give an answer like ... However, that alone cannot give the correct answer "Alpha".

さらに、関係検索に近い提案として、非特許文献５で提案された「アナロジー・シソーラス(Analogical Thesaurus)」がある。アナロジー・シソーラスとは類似関係にある単語ペアを見つけ出すための階層型概念辞書である。WordNetでは、図1１で示すように、英語の文字とギリシャの文字は共に{LETER，ALPHABETIC_CHARACTER}概念に分類されていて、その区別が全くなく、クエリ{(English，A)，(Greek，?)}に対して答えることが出来ない。 Furthermore, there is an “Analogical Thesaurus” proposed in Non-Patent Document 5 as a proposal close to the relationship search. An analogy thesaurus is a hierarchical concept dictionary for finding word pairs in a similar relationship. In WordNet, as shown in Figure 11, both English and Greek characters are classified into the {LETER, ALPHABETIC_CHARACTER} concept, and there is no distinction between them, and queries {(English, A), (Greek,?) } Cannot answer.

それに対して、非特許文献５で提案されたシソーラスは、WordNetをより詳細に分割するような概念を導入する。図１２に示すように、元々WordNetにあった概念{LETTER，ALPHABETIC_CHARACTER}を詳細に分割するために、新しい概念{GREEK_LETTER}，{ENGLISH_LETTER}や{1^st LETTER}を導入する。そこで、クエリ{(English，A)，(Greek，?)}が入力されたときに、システムはこの辞書を引き、{A}の文字の位置を見つけ出す。そして、その位置の先祖ノードには{ENGLISH_LETTER}と{1^st LETTER}がある。クエリがEnglishとGreekの対応を探したいので、{ENGLISH_LETTER}の対応として{GREEK_LETTER}が候補である。なぜなら{GREEK_LETTER}と{ENGLISH_LETTER}は先祖ノード{LETTER，ALPHABETIC_CHARACTER}を共通的に持つからである。よって、答えの候補は{GREEK_LETTER}のすべての子ノードであることが分かる。最後に、候補集合から正しい答えを出すために、クエリの“A”のノード{A}とノード{GREEK_LETTER}の子ノードのうち、ノード{ALPHA}がノード{A}と直前の先祖{1^st LETTER}を共有するので、答えはノード{ALPHA}を出力する。よって、クエリ{(English，A)，(Greek，?)}の出力は“ALPHA”(文字α)である。出力の導出過程を見て分かるように、その理由はAはEnglishの1番目(1^st letter)の文字であり、αはGreekの1番目の文字である。従って、非特許文献５で提案されたシソーラスは関係検索を直接対応できる。 On the other hand, the thesaurus proposed in Non-Patent Document 5 introduces a concept that divides WordNet in more detail. As shown in FIG. 12, new concepts {GREEK_LETTER}, {ENGLISH_LETTER} and {1 ^st LETTER} are introduced to divide the concept {LETTER, ALPHABETIC_CHARACTER} originally in WordNet in detail. Therefore, when a query {(English, A), (Greek,?)} Is input, the system looks up this dictionary and finds the position of the character {A}. And there are {ENGLISH_LETTER} and {1 ^st LETTER} as ancestor nodes at that position. Since the query wants to find the correspondence between English and Greek, {GREEK_LETTER} is a candidate for {ENGLISH_LETTER}. This is because {GREEK_LETTER} and {ENGLISH_LETTER} have an ancestor node {LETTER, ALPHABETIC_CHARACTER} in common. Thus, it can be seen that the answer candidates are all child nodes of {GREEK_LETTER}. Finally, in order to get the correct answer from the candidate set, among the child nodes of the node “A” and the node {GREEK_LETTER} of the query, the node {ALPHA} is the node {A} and the previous ancestor {1 ^st Since it shares LETTER}, the answer will output node {ALPHA}. Therefore, the output of the query {(English, A), (Greek,?)} Is “ALPHA” (character α). As you can see in the output derivation process, the reason is that A is the first letter of English (1 ^st letter) and α is the first letter of Greek. Therefore, the thesaurus proposed in Non-Patent Document 5 can directly handle the relationship search.

しかしながら、この方法を使うと、検索システムの能力が大変限られる。なぜなら、正確に検索できるためにはWordNetに新しい概念(詳細分割ための概念)を導入する必要があり、それらの概念を自動的に導出する方法があるが、精度がよくなく、これが最終結果の再現率を減らす大きな原因になってしまう。また、更に悪いのはWordNetに存在しない原始概念(つまり子ノードを持たない概念)に対しては検索できない。例えば、WWWでは頻繁に新しい概念(人名、組織名など)が出てくるが、アナロジー・シソーラスではこれに対応できない。 However, using this method, the capabilities of the search system are very limited. Because it is necessary to introduce new concepts (concepts for detailed division) into WordNet in order to be able to search accurately, there is a method to automatically derive those concepts, but the accuracy is not good, and this is the final result This is a major cause of reducing the recall rate. What's worse, you can't search for primitive concepts that don't exist in WordNet (that is, concepts that don't have child nodes). For example, the WWW frequently introduces new concepts (person names, organization names, etc.), but the analogy thesaurus cannot handle this.

M. Banko and O. Etzioni. TheTradeoffs Between Open and Traditional Relation Extraction, Proceedings of ACL,pp.28-36 (2008).M. Banko and O. Etzioni. TheTradeoffs Between Open and Traditional Relation Extraction, Proceedings of ACL, pp. 28-36 (2008). Peter D. Turney. Measuring SemanticSimilarity by Latent Relational Analysis. Proceedings of International JointConf. on Artificial Intelligence (IJCAI), pp.1136-1141 (2005).Peter D. Turney. Measuring SemanticSimilarity by Latent Relational Analysis.Proceedings of International JointConf.on Artificial Intelligence (IJCAI), pp.1136-1141 (2005). Danushka Bollegala, Yutaka Matsuo andMitsuru Ishizuka: Measuring the Similarity between Implicit Semantic Relationsfrom the Web , Proceedings of the 18th International World Wide Web Conference(WWW 2009), pp. 651-660(2009).Danushka Bollegala, Yutaka Matsuo and Mitsuru Ishizuka: Measuring the Similarity between Implicit Semantic Relations from the Web, Proceedings of the 18th International World Wide Web Conference (WWW 2009), pp. 651-660 (2009). Tony Veale. WordNet Sits theS.A.T. - A Knowledge-Based Approach to Lexical Analogy. Proceedings of the 16thEuropean Conf. on Artificial Intelligence (ECAI), pp.606-612 (2004).Tony Veale. WordNet Sits theS.A.T.- A Knowledge-Based Approach to Lexical Analogy.Proceedings of the 16th European Conf. On Artificial Intelligence (ECAI), pp.606-612 (2004). Tony Veale. The AnalogicalThesaurus. In Proceedings of the Fifteenth Conference on InnovativeApplications of Artificial Intelligence (IAAI), pp.137-141 (2003).Tony Veale.The AnalogicalThesaurus.In Proceedings of the Fifteenth Conference on InnovativeApplications of Artificial Intelligence (IAAI), pp.137-141 (2003).

US2008/0243479A1.US2008 / 0243479A1.

上述のように、TextRunnerなどの既存システムではテキストコーパスがあっても、エンティティペア間の関係をうまく表現できない。また、アナロジー・シソーラスは限られた関係検索しか出来なく、膨大なＷＷＷのデータを利用できない。また、関係類似度を測る方法がいくつか提案されたが、そのままでは関係検索を直接実現できない。 As described above, existing systems such as TextRunner cannot express the relationship between entity pairs even if there is a text corpus. In addition, the analogy thesaurus can only perform a limited relationship search and cannot use a huge amount of WWW data. Several methods for measuring the relational similarity have been proposed, but the relational search cannot be realized directly.

本発明は、ＷＷＷなどの膨大なテキストデータから、エンティティペア間の関係を表現し、エンティティペア間の関係を用いて関係検索を実現することを目的とするものである。 An object of the present invention is to express a relationship between entity pairs from a large amount of text data such as WWW, and to realize a relationship search using the relationship between entity pairs.

本発明が採用した第１の技術手段は、エンティティのペア(Ａ，Ｂ)とエンティティＣが入力された場合に、エンティティのペア（Ｃ，Ｘ）のエンティティＣ、Ｘ間にエンティティＡ，Ｂ間の関係と同一または類似の関係があるようなエンティティＸを、検索データベースあるいは／および既存の検索エンジンを用いて、検索する方法であって、
エンティティのペア(Ａ，Ｂ)とエンティティＣをクエリとして受け取り、
エンティティＡとエンティティＢとの間の関係を、エンティティＡとエンティティＢを含むテキストにおけるエンティティＡ，Ｂの周辺の文脈に依存して規定し、
エンティティＡ，Ｂの周辺の文脈と同一あるいは類似の文脈を周辺に備えたエンティティＣ，Ｘを探索することで、エンティティＣとの間において、エンティティＡとエンティティＢの関係と同一または類似の関係を備えるエンティティＸを取得する、関係検索方法、である。 The first technical means adopted by the present invention is that when an entity pair (A, B) and an entity C are input, the entity pair between the entity C (X) and the entity A (B) A method for searching for an entity X having the same or similar relationship as the relationship using a search database or / and an existing search engine,
Entity pair (A, B) and entity C are received as a query,
Defining the relationship between entity A and entity B depending on the context around entities A and B in the text containing entity A and entity B;
By searching for the entities C and X having the same or similar context as the surroundings of the entities A and B, the relationship between the entities A and B is the same as or similar to the relationship between the entities A and B. It is a relationship search method for acquiring an entity X provided.

「エンティティ」とは、１つあるいは複数の単語、すなわち語句である。
本発明は、 (A,B)と関係類似度（relational similarity）が大となる(C,X)を関係類似度によりランキングして求める手法である。
本発明において、エンティティＸの位置は限定されず、例えば、エンティティペア（Ｘ，Ｄ）は、エンティティペア（Ｃ，Ｘ）と同様に扱うことができる。すなわち、エンティティのペア(Ａ，Ｂ)とエンティティＤがクエリとして入力された場合に、前記エンティティペア（Ｃ，Ｘ）をエンティティペア（Ｘ，Ｄ）に、前記エンティティＣをエンティティＤに置き換えることで、エンティティペア(Ａ，Ｂ)とエンティティＤが入力された場合に、エンティティペア（Ｘ，Ｄ）のエンティティＸ，Ｄ間にエンティティＡ，Ｂ間の関係と同一または類似の関係があるようなエンティティＸを検索する手法として機能する。 An “entity” is one or more words or phrases.
The present invention is a technique for ranking (C, X) that has a large relational similarity (A, B) and (C, X) based on the relational similarity.
In the present invention, the position of the entity X is not limited. For example, the entity pair (X, D) can be handled in the same manner as the entity pair (C, X). That is, when an entity pair (A, B) and an entity D are input as a query, the entity pair (C, X) is replaced with an entity pair (X, D), and the entity C is replaced with an entity D. When the entity pair (A, B) and the entity D are input, the entity having the same or similar relationship as the relationship between the entities A and B between the entities X and D of the entity pair (X, D) It functions as a method for searching for X.

１つの態様では、前記文脈は、語彙パターン、bag-of-words、品詞パターン、係り受けパターンの１つあるいは任意の組み合わせである。 In one aspect, the context is one or any combination of vocabulary patterns, bag-of-words, part of speech patterns, dependency patterns.

１つの態様では、前記文脈はクラスタリングされ、各文脈クラスタには、実質的に同一あるいは類似の文脈が含まれる。 In one aspect, the contexts are clustered and each context cluster includes substantially the same or similar context.

１つの態様では、エンティティのペアの２つのエンティティ間の関係は、特徴ベクトルによって表現され、前記特徴ベクトルは、前記エンティティペアの２つのエンティティの周辺の文脈あるいは文脈クラスタから取得される。 In one aspect, the relationship between two entities in a pair of entities is represented by a feature vector, which is obtained from a context or context cluster around the two entities in the entity pair.

１つの態様では、エンティティＡとエンティティＢとの間の関係は、第１の特徴ベクトルによって表現され、第１の特徴ベクトルは、エンティティＡ，Ｂの周辺の文脈あるいは文脈クラスタから取得され、
エンティティＸを取得するステップは、エンティティＣとの間において、第１の特徴ベクトルと同一あるいは類似の第２の特徴ベクトルを備えるエンティティＸを取得するものであり、第２の特徴ベクトルは、エンティティＣとエンティティＸを含むテキストにおけるエンティティＣ，Ｘの周辺の文脈あるいは文脈クラスタから取得される。 In one aspect, the relationship between entity A and entity B is represented by a first feature vector, which is obtained from a context or context cluster around entities A and B;
The step of acquiring the entity X is to acquire the entity X having a second feature vector that is the same as or similar to the first feature vector with the entity C, and the second feature vector is the entity C And the context or context cluster around the entities C and X in the text including the entity X.

１つの態様では、前記エンティティＸを取得するステップは、
ペア(Ａ，Ｂ)と各ペア(Ｃ，Ｘ)のペア間の類似度を、両者の特徴ベクトル間の距離に基づいて算出し、算出されたペア間類似度（関係類似度）を指標として、ペア間類似度が大きい順に複数のペア(Ｃ，Ｘ)をソートし、ソート順にＸの候補をランキングするステップと、
ランキングされたＸの一部あるいは全部を検索結果として出力するステップと、
を備えている。
典型的には、関係類似度は特徴ベクトルの間の距離によって算出されるため、まず特徴ベクトルを作成しておく必要がある。１つの態様では、特徴ベクトルは、「パターンの頻度を基にした重み」で定義することができる。
典型的な態様では、関係類似度は、両者の特徴ベクトル間のコサイン類似度に基づいて算出されるが、他の距離尺度を用いてもよい。 In one aspect, obtaining the entity X comprises
The similarity between the pair (A, B) and each pair (C, X) is calculated based on the distance between both feature vectors, and the calculated similarity between the pairs (relationship similarity) is used as an index. Sorting a plurality of pairs (C, X) in descending order of similarity between pairs and ranking X candidates in the order of sorting;
Outputting a part or all of the ranked X as a search result;
It has.
Typically, since the relationship similarity is calculated by the distance between feature vectors, it is necessary to create a feature vector first. In one aspect, the feature vector can be defined by “weight based on pattern frequency”.
In a typical aspect, the relationship similarity is calculated based on the cosine similarity between both feature vectors, but other distance measures may be used.

１つの態様では、前記文脈は語彙パターンであり、前記特徴ベクトルは語彙パターンの頻度を要素とする。すなわち、「語彙パターンの頻度を基に計算する重み」を特徴量にした特徴ベクトルでエンティティペア間の関係を表現する。
より具体的には、特徴ベクトルの要素は、「特徴(次元)」とその「値(特徴量)」であり、文脈を語彙パターンとした場合には、「特徴（次元）」＝「語彙パターン」、「値（特徴量）」＝「語彙パターンの頻度」、となる。
語彙パターンをクラスタリングする場合には、特徴ベクトルの次元を「語彙パターンクラスタ」とし、各次元の値を「当該語彙パターンクラスタに含まれる語彙パターンの頻度の総和」とする In one aspect, the context is a vocabulary pattern, and the feature vector has the frequency of the vocabulary pattern as an element. That is, the relationship between the entity pairs is expressed by a feature vector having a feature value of “weight calculated based on the vocabulary pattern frequency”.
More specifically, the elements of the feature vector are “feature (dimension)” and its “value (feature amount)”. When the context is a vocabulary pattern, “feature (dimension)” = “vocabulary pattern” ”,“ Value (feature amount) ”=“ vocabulary pattern frequency ”.
When clustering vocabulary patterns, the dimension of the feature vector is “vocabulary pattern cluster”, and the value of each dimension is “sum of frequency of vocabulary patterns included in the vocabulary pattern cluster”.

１つの態様では、前記エンティティＸを取得するステップは、
取得したペア(Ａ，Ｂ)の第１の特徴ベクトルの語彙パターン（頻度が予め設定した閾値より大きい）を要素に含む第２の特徴ベクトルを備えた複数のペア(Ｃ，Ｘ)を取得するステップと、
ペア(Ａ，Ｂ)と各ペア(Ｃ，Ｘ)のペア間の類似度を両者の特徴ベクトルを用いて算出し、算出されたペア間類似度を指標として、ペア間類似度が大きい順に複数のペア(Ｃ，Ｘ)をソートし、ソート順にＸの候補をランキングするステップと、
ランキングされたＸの一部あるいは全部を検索結果として出力するステップと、
を備えている。 In one aspect, obtaining the entity X comprises
Acquire a plurality of pairs (C, X) having a second feature vector whose elements include the vocabulary pattern of the first feature vector of the acquired pair (A, B) (frequency is greater than a preset threshold). Steps,
The similarity between the pair (A, B) and each pair (C, X) is calculated using both feature vectors, and the calculated similarity between the pairs is used as an index. Sorting the pairs (C, X) and ranking the candidates for X in the sort order;
Outputting a part or all of the ranked X as a search result;
It has.

１つの態様では、前記検索データベースには、多数のエンティティペアと、各エンティティペアのエンティティ間の関係を表現する周辺の文脈との対応が格納されている。
１つの態様では、前記文脈は語彙パターンであり、前記検索データベースには、多数のエンティティペアと、各エンティティペアと、語彙パターン及び当該語彙パターンの頻度と、の対応が格納されている。 In one aspect, the search database stores correspondences between a large number of entity pairs and surrounding contexts expressing relationships between the entities of each entity pair.
In one aspect, the context is a vocabulary pattern, and the search database stores correspondence between a large number of entity pairs, each entity pair, the vocabulary pattern and the frequency of the vocabulary pattern.

１つの態様では、前記検索データベースは、
エンティティペアの各エンティティのＩＤを規定する第１インデックスと、
２つのエンティティＩＤのペアに対応するエンティティペアＩＤを規定する第２インデックスと、
各語彙パターンに対応する語彙パターンＩＤを規定する第３インデックスと、
エンティティペアＩＤと、語彙パターンＩＤ・語彙パターンの頻度と、の対応関係を規定する第４インデックスと、
を備えており、
入力されたエンティティペア(Ａ，Ｂ)に対応するエンティティペアＩＤを取得し、前記取得されたエンティティペアＩＤを用いてエンティティペア(Ａ，Ｂ)に対応する語彙パターンＩＤ（語彙パターンの頻度が予め設定した閾値より大きい）・語彙パターンの頻度を取得して第１特徴ベクトルを形成するステップと、
前記第１特徴ベクトルの語彙パターンＩＤ（語彙パターンの頻度が予め設定した閾値より大きい）を含み、かつエンティティＣのＩＤを含む候補エンティティペアＩＤを取得するステップと、
各候補エンティティペアＩＤに対応する第２特徴ベクトルを取得し、前記第２特徴ベクトルが前記第１特徴ベクトルと類似するようなエンティティペア(Ｃ，Ｘ)を取得するステップと、
を備えている。
１つの態様では、前記語彙パターンの頻度の閾値は頻度１である。実際には、低頻度のものはスペルミス等の雑音を含むことが多く、予め閾値を適切に選択して設定しておくことが望ましい。
１つの態様では、前記第４インデックスは、
エンティティペアＩＤから語彙パターンＩＤ・語彙パターンの頻度を検索するためのインデックスと、
語彙パターンＩＤからエンティティペアＩＤ・語彙パターンの頻度を検索するためのインデックスと、
を備えている。 In one aspect, the search database is
A first index defining the ID of each entity in the entity pair;
A second index defining entity pair IDs corresponding to two entity ID pairs;
A third index defining a vocabulary pattern ID corresponding to each vocabulary pattern;
A fourth index that defines the correspondence between the entity pair ID and the vocabulary pattern ID / frequency of the vocabulary pattern;
With
An entity pair ID corresponding to the input entity pair (A, B) is acquired, and using the acquired entity pair ID, a vocabulary pattern ID corresponding to the entity pair (A, B) (the frequency of the vocabulary pattern is determined in advance). Obtaining a frequency of the vocabulary pattern to form a first feature vector;
Obtaining a candidate entity pair ID including the vocabulary pattern ID of the first feature vector (the frequency of the vocabulary pattern is greater than a preset threshold) and the ID of entity C;
Obtaining a second feature vector corresponding to each candidate entity pair ID and obtaining an entity pair (C, X) such that the second feature vector is similar to the first feature vector;
It has.
In one embodiment, the frequency threshold of the vocabulary pattern is frequency 1. Actually, the low frequency often includes noise such as spelling mistakes, and it is desirable to appropriately select and set a threshold value in advance.
In one aspect, the fourth index is
An index for searching the frequency of the vocabulary pattern ID / vocabulary pattern from the entity pair ID;
An index for searching the frequency of entity pair ID / vocabulary pattern from the vocabulary pattern ID;
It has.

１つの態様では、前記検索データベースにおいて、語彙パターンは、語彙パターンのクラスタを形成しており、各語彙パターンクラスタには、類似あるいは実質的に同一の語彙パターンが含まれている。
１つの態様では、前記エンティティペア(Ｃ，Ｘ)を取得するステップは、エンティティペア(Ａ，Ｂ)の特徴ベクトルの語彙パターンと同一の語彙パターン（頻度が予め設定した閾値より大きい）を要素とする特徴ベクトルを備えたエンティティペア(Ｃ，Ｘ)のみならず、エンティティペア(Ａ，Ｂ)の特徴ベクトルの語彙パターンが属する語彙パターンクラスタに属する語彙パターン（頻度が予め設定した閾値より大きい）を要素とする特徴ベクトルを備えたエンティティペア(Ｃ，Ｘ)を探索するステップを含む。 In one mode, in the search database, the vocabulary patterns form vocabulary pattern clusters, and each vocabulary pattern cluster includes similar or substantially identical vocabulary patterns.
In one aspect, the step of acquiring the entity pair (C, X) includes the same vocabulary pattern (frequency greater than a preset threshold) as the vocabulary pattern of the feature vector of the entity pair (A, B) as an element. Vocabulary patterns belonging to the vocabulary pattern cluster to which the vocabulary pattern of the feature vector of the entity pair (A, B) belongs as well as the entity pair (C, X) with the feature vector A step of searching for an entity pair (C, X) having a feature vector as an element.

１つの態様では、前記検索データベースにおいて、類似のエンティティからなるエンティティクラスタが形成されている。
１つの態様では、前記エンティティＸの候補のランキングの指標となるスコアとして、前記エンティティＸが属するエンティティクラスタのスコアを用いる。
１つの態様では、前記出力ステップにおいて、前記エンティティクラスタに含まれる複数のエンティティを出力する。前記エンティティクラスタのスコアは、当該エンティティクラスタに含まれる各エンティティのスコアの総和をエンティティクラスタサイズで割って正規化したクラスタ平均スコアである。 In one aspect, an entity cluster composed of similar entities is formed in the search database.
In one aspect, the score of the entity cluster to which the entity X belongs is used as a score that serves as an index for ranking the candidate of the entity X.
In one aspect, in the output step, a plurality of entities included in the entity cluster are output. The score of the entity cluster is a cluster average score obtained by normalizing the sum of the scores of the entities included in the entity cluster divided by the entity cluster size.

１つの態様では、エンティティペアのエンティティ間の関係を表現する特徴ベクトルは、
テキストコーパスから取得した文を単語に切り分け、
切り分けられた単語から候補エンティティペアを作成し、
候補エンティティペアの周辺の文脈から取得される。 In one aspect, a feature vector representing a relationship between entities of an entity pair is
The sentence acquired from the text corpus is divided into words,
Create candidate entity pairs from the segmented words,
Obtained from the context around the candidate entity pair.

１つの態様では、前記特徴ベクトルの要素は、語彙パターンの頻度であり、
語彙パターンの頻度の抽出は、
候補エンティティペアが含まれる元の文において、エンティティペアの第１エンティティをＸに、第２エンティティをＹに置き換えて、Ｘ，Ｙを含む部分列Ｓを生成し、部分列Ｓのn-gram(nは1からＫまで)を生成するステップと、
得られたn-gramを特徴ベクトルの１つの次元として、エンティティペアを含む全ての文において頻度を数えるステップと、
を備え、
n-gramの頻度を語彙パターンの頻度として、エンティティペアと関連付けて保存する。
ここで、部分列Ｓは、Ｘの直前のｍ_１個の単語，Ｘ，ＸＹ間の単語列（Ｘ，Ｙを除く），Ｙ，Ｙの直後のｍ_２個の単語、からなる単語列である。ｍ_１、ｍ_２、Ｋは、パラメータ（整数）である。
ＸＹ間の単語列（Ｘ，Ｙを除く）の長さをＤとした時、１つの態様では、ＤがＫ−２よりも大きい場合、さらに、[Ｘ, (Ｘ, Ｙ)間の列，Ｙ]のn-gramも生成する。
１つの態様では、前記候補エンティティペアの２つエンティティ間の距離Ｄが、閾値Ｄ_ｔｈ以下の場合にのみ候補エンティティペアの周辺の文脈を取得する。この距離Ｄは、単語数である。
１つの態様では、前記候補エンティティペアを作成するステップにおいて、固有名詞を含むエンティティペアを候補エンティティペアとして優先的に抽出する。
１つの態様では、前記候補エンティティペアを作るステップにおいて、頻度の高いエンティティペアを候補エンティティペアとして優先的に抽出する。
１つの態様では、類似あるいは実質的に同一の語彙パターンから語彙パターンのクラスタを形成するステップを含む。 In one aspect, the element of the feature vector is a vocabulary pattern frequency;
Extracting the frequency of vocabulary patterns
In the original sentence including the candidate entity pair, the first entity of the entity pair is replaced with X, the second entity is replaced with Y, and a partial sequence S including X and Y is generated, and an n-gram ( generating n) from 1 to K;
Using the obtained n-gram as one dimension of a feature vector, counting the frequency in all sentences including entity pairs;
With
The n-gram frequency is stored as the vocabulary pattern frequency in association with the entity pair.
Here, the partial sequence S is a word sequence composed of m ₁ words immediately before X, a word sequence between X and XY (excluding X and Y), and m ₂ words immediately after Y and Y. is there. m ₁ , m ₂ , and K are parameters (integers).
When the length of a word string between XY (excluding X and Y) is D, in one aspect, if D is greater than K-2, then the sequence between [X, (X, Y), Y] n-gram is also generated.
In one aspect, the context around the candidate entity pair is acquired only when the distance D between the two entities in the candidate entity pair is equal to or less than a threshold value _Dth . This distance D is the number of words.
In one aspect, in the step of creating the candidate entity pair, entity pairs including proper nouns are preferentially extracted as candidate entity pairs.
In one aspect, in the step of creating the candidate entity pair, a frequent entity pair is preferentially extracted as a candidate entity pair.
One aspect includes forming clusters of vocabulary patterns from similar or substantially identical vocabulary patterns.

１つの態様では、本発明の検索方法は、既存の検索エンジンを用いて、エンティティＡとエンティティＢを含むテキストにおけるエンティティＡ，Ｂの周辺の文脈を取得するステップと、
既存の検索エンジンを用いて、エンティティＡ，Ｂの周辺の文脈と同一あるいは類似の文脈を備えたエンティティＣ，Ｘを取得するステップと、
を備えている。
検索データベースにクエリとして与えたエンティティが存在する場合と存在しない場合では検索システム内部の処理が変わる。エンティティＡ，Ｂ，Ｃの１つあるいは複数が検索データベースに存在しない場合、既存のキーワードベースWeb検索エンジンを用いて、動的にエンティティＡ，Ｂの間の関係を表す語彙パターンを抽出し、エンティティＣでその語彙パターンと関連付けられるエンティティＸを抽出し、ランキングする。 In one aspect, the search method of the present invention uses an existing search engine to obtain a context around entities A and B in text including entity A and entity B;
Using existing search engines to obtain entities C, X having the same or similar context as the surrounding context of entities A, B;
It has.
The processing inside the search system changes depending on whether or not an entity given as a query exists in the search database. If one or more of the entities A, B, and C do not exist in the search database, the existing keyword-based Web search engine is used to dynamically extract a vocabulary pattern that represents the relationship between the entities A and B, and In C, the entity X associated with the vocabulary pattern is extracted and ranked.

１つの態様では、エンティティＡ，Ｂの周辺の文脈は、エンティティＡとエンティティＢの間の距離を制限したエンティティＡ、エンティティＢをクエリとした前記検索エンジンの検索結果としてのスニッペト(snippet)から取得する。
具体的には、ここでの検索は、エンティティＡとＢの間に所定数の＊（＊：０個あるいは１個の単語に対応するワイルドカード）を挿入して作成したクエリ（例“A * * * B”）を用いた検索である。通常のキーワード検索エンジンでは、エンティティＡ，Ｂをクエリとして入力した場合には、エンティティＡ，Ｂを含むスニペットが得られるようになっており、ここでのワイルドカード検索は、単なるエンティティＡ，Ｂをクエリとする入力に、エンティティＡ、Ｂ間の距離を制限するものであると言える。したがって、クエリ（例“A * * * B”）で検索すると、エンティティＡとＢの前後の文脈も含むスニペットが得られて、そのスニペットからエンティティＡとＢの前後（外側）の文脈を含む語彙パターンを生成することができる。「＊」の個数は、適宜、当業者によって予め設定される。 In one aspect, the context around the entities A and B is obtained from a snippet as a search result of the search engine using the entities A and B as a query with the distance between the entities A and B limited. To do.
Specifically, the search here is a query created by inserting a predetermined number of * (*: wildcard corresponding to 0 or 1 word) between entities A and B (eg, “A *”). * * Search using B ”). In a normal keyword search engine, when entities A and B are entered as a query, a snippet including entities A and B can be obtained. In this wildcard search, only entities A and B are searched. It can be said that the input between the queries is to limit the distance between the entities A and B. Thus, a search with a query (eg “A *** B”) yields a snippet that also includes the contexts before and after entities A and B, and the vocabulary that includes the contexts before and after (outside) entities A and B. A pattern can be generated. The number of “*” is appropriately set in advance by those skilled in the art.

１つの態様では、前記エンティティＡ，Ｂの周辺の文脈は、上記スニペット（“A
* * * B”などをクエリとして得られたスニペット）から抽出される語彙パターンである。
エンティティＸの候補は、抽出された語彙パターンにエンティティＣを代入して得られるクエリの検索結果としてのスニッペトから取得する。
当初の検索クエリが(Ａ, Ｂ), (Ｃ, ？)である場合、ここでのクエリは、エンティティＣ、語彙パターン(pattern)、？の順序を保持したクエリ“C pattern *”となる。当初の検索クエリが(Ａ, Ｂ), (？, Ｄ)である場合、ここでのクエリは、？、エンティティＤ、語彙パターンの順序を保持したクエリ“* pattern Ｄ”となる。
クエリ“C pattern *” を入力すると、多くの既存の検索エンジンは、“C pattern W1” (W1はある１個の単語(日本語では文字))だけを返すわけではなく、そのクエリとマッチされたテキストのsnippetでは文脈も返される。したがって、実際には、”… C pattern W1 W2 W3 …” (* のところはW1に対応する) が返ってくる。そこで、Dの候補は W1 または W1 W2 または W1 W2 W3 などとすることが出来る。 In one aspect, the context around the entities A and B is the snippet (“A
* * * This is a vocabulary pattern extracted from a snippet obtained using B ”as a query.
Entity X candidates are obtained from snippets as query search results obtained by substituting entity C for the extracted vocabulary pattern.
When the initial search query is (A, B), (C,?), The query here is entity C, vocabulary pattern (pattern),? The query is “C pattern *” that retains the order. If the initial search query is (A, B), (?, D), what is the query here? , Entity D and query “* pattern D” holding the order of vocabulary patterns.
If you enter the query “C pattern *”, many existing search engines will not return only “C pattern W1” (W1 is a single word (letters in Japanese)), but will match that query. For a text snippet, the context is also returned. Therefore, “… C pattern W1 W2 W3…” (where * corresponds to W1) is actually returned. Therefore, the candidate for D can be W1 or W1 W2 or W1 W2 W3.

本発明が採用した第２の技術手段は、エンティティのペア(Ａ，Ｂ)とエンティティＣが入力された場合に、エンティティのペア（Ｃ，Ｘ）のエンティティＣ、Ｘ間にエンティティＡ，Ｂ間の関係と同一または類似の関係があるようなエンティティＸを検索するシステムであって、
前記検索データベースには、多数のエンティティペアと、各エンティティペアのエンティティ間の関係を表現する周辺の文脈との対応が格納されており、
エンティティのペア(Ａ，Ｂ)とエンティティＣをクエリとして受け取る手段と、
エンティティＡとエンティティＢとの間の関係を表現する第１の周辺の文脈を、前記検索データベースを用いて取得する手段と、
第１の周辺の文脈と同一あるいは類似の第２の周辺の文脈を備えたエンティティＣ，Ｘを、前記検索データベースを用いて探索することで、エンティティＣを取得する手段と、
を備えている、検索システム、である。 The second technical means adopted by the present invention is that when an entity pair (A, B) and an entity C are input, an entity pair (C, X) between entities C and X is between entities A and B. A system for searching for an entity X having the same or similar relationship as
The search database stores correspondences between a large number of entity pairs and surrounding contexts that express relationships between entities of each entity pair,
Means for receiving entity pair (A, B) and entity C as a query;
Means for obtaining, using the search database, a first peripheral context expressing a relationship between entity A and entity B;
Means for obtaining entity C by searching for an entity C, X having a second peripheral context identical or similar to the first peripheral context using the search database;
A search system.

１つの態様では、前記文脈は語彙パターンであり、前記検索データベースには、多数のエンティティペアと、各エンティティペアと、語彙パターン及び当該語彙パターンの頻度と、の対応が格納されている。
１つの態様では、前記検索システムは、
取得したエンティティペア(Ａ，Ｂ)の第１の特徴ベクトルの語彙パターン（頻度が予め設定した閾値より大きい）を要素に含む第２の特徴ベクトルを備えた複数のエンティティペア(Ｃ，Ｘ)を取得する手段と、
エンティティペア(Ａ，Ｂ)と各エンティティペア(Ｃ，Ｘ)のペア間の類似度を両者の特徴ベクトルを用いて算出する手段と、
算出されたペア間類似度を指標として、ペア間類似度が大きい順に複数のエンティティペア(Ｃ，Ｘ)をソートし、ソート順にＸの候補をランキングする手段と、
ランキングされたＸの一部あるいは全部を検索結果として出力する手段と、
を備えている。 In one aspect, the context is a vocabulary pattern, and the search database stores correspondence between a large number of entity pairs, each entity pair, the vocabulary pattern and the frequency of the vocabulary pattern.
In one aspect, the search system comprises:
A plurality of entity pairs (C, X) having a second feature vector whose elements include the vocabulary pattern of the first feature vector of the acquired entity pair (A, B) (frequency is greater than a preset threshold) Means to obtain,
Means for calculating the similarity between the pair of the entity pair (A, B) and each entity pair (C, X) using both feature vectors;
A means for sorting a plurality of entity pairs (C, X) in descending order of the similarity between pairs using the calculated similarity between pairs as an index, and ranking X candidates in the order of sorting;
Means for outputting a part or all of the ranked X as a search result;
It has.

１つの態様では、前記検索データベースは、
エンティティペアの各エンティティのＩＤを規定する第１インデックスと、
２つのエンティティＩＤのペアに対応するエンティティペアＩＤを規定する第２インデックスと、
各語彙パターンに対応する語彙パターンＩＤを規定する第３インデックスと、
エンティティペアＩＤと、語彙パターンＩＤ・語彙パターンの頻度と、の対応関係を規定する第４インデックスと、
を備えており、
入力されたエンティティペア(Ａ，Ｂ)に対応するエンティティペアＩＤを取得し、前記取得されたエンティティペアＩＤを用いてエンティティペア(Ａ，Ｂ)に対応する語彙パターンＩＤ（語彙パターンの頻度が予め設定した閾値より大きい）・語彙パターンの頻度を取得して第１特徴ベクトルを形成し、
前記第１特徴ベクトルの語彙パターンＩＤ（語彙パターンの頻度が予め設定した閾値より大きい）を含み、かつエンティティＣのＩＤを含む候補ペアＩＤを取得し、
各候補ペアＩＤに対応する第２特徴ベクトルを取得し、前記第２特徴ベクトルが前記第１特徴ベクトルと類似するようなエンティティペア(Ｃ，Ｘ)を取得する。
１つの態様では、前記第４インデックスは、
エンティティペアＩＤから語彙パターンＩＤ・語彙パターンの頻度を検索するためのインデックスと、
語彙パターンＩＤからエンティティペアＩＤ・語彙パターンの頻度を検索するためのインデックスと、
を備えている。 In one aspect, the search database is
A first index defining the ID of each entity in the entity pair;
A second index defining entity pair IDs corresponding to two entity ID pairs;
A third index defining a vocabulary pattern ID corresponding to each vocabulary pattern;
A fourth index that defines the correspondence between the entity pair ID and the vocabulary pattern ID / frequency of the vocabulary pattern;
With
An entity pair ID corresponding to the input entity pair (A, B) is acquired, and using the acquired entity pair ID, a vocabulary pattern ID corresponding to the entity pair (A, B) (the frequency of the vocabulary pattern is determined in advance). (Greater than the set threshold value)-The frequency of the vocabulary pattern is acquired to form the first feature vector,
Obtaining a candidate pair ID including the vocabulary pattern ID of the first feature vector (the frequency of the vocabulary pattern is greater than a preset threshold) and the ID of the entity C;
A second feature vector corresponding to each candidate pair ID is acquired, and an entity pair (C, X) such that the second feature vector is similar to the first feature vector is acquired.
In one aspect, the fourth index is
An index for searching the frequency of the vocabulary pattern ID / vocabulary pattern from the entity pair ID;
An index for searching the frequency of entity pair ID / vocabulary pattern from the vocabulary pattern ID;
It has.

１つの態様では、前記検索データベースには、語彙パターンのクラスタが格納されており、各語彙パターンクラスタには、類似あるいは実質的に同一の語彙パターンが含まれている。
１つの態様では、前記検索データベースには、エンティティクラスタが格納されており、各エンティティクラスタには、類似あるいは実質的に同一のエンティティが含まれている。 In one aspect, the search database stores vocabulary pattern clusters, and each vocabulary pattern cluster includes similar or substantially identical vocabulary patterns.
In one aspect, the search database stores entity clusters, and each entity cluster includes similar or substantially identical entities.

本発明は、膨大なテキストコーパスから有益なエンティティ間の関係を特徴ベクトルで表現し、それに基づいて、関係検索を実用的に実現することができる。また、本発明の手法は膨大なデータ源に対応できる。つまり、膨大なWWWなどの情報空間の中で関係検索ができ、いわゆる大規模なWeb-scaleに対応できる。 According to the present invention, a relationship between useful entities is expressed by a feature vector from a huge text corpus, and based on this, a relationship search can be practically realized. In addition, the method of the present invention can deal with a huge amount of data sources. In other words, a relational search can be performed in an information space such as a huge WWW, and so-called large-scale Web-scale can be handled.

本発明の検索データベース作成によるindexing技術は、検索時間を削減するために重要である。indexing技術により検索データベースを作成しておくことで、本システムでは高速で待ち時間の少ない関係検索を実行することができる。 The indexing technique by creating a search database according to the present invention is important for reducing the search time. By creating a search database using indexing technology, this system can execute relational searches with high latency and low latency.

潜在的関係に基づく本発明の関係検索エンジンを使うと、様々な問題が解決できる。例えば、質問応答システムでは、入力質問に対して、システムが答えを教えるが、システムの知識ベースの中に存在しない知識に対しては答えを出すのが難しい。本発明の検索システムを使えば、少ない知識しか持たない場合も、幅広い質問を答えることが出来る。例えば、「東京が日本の首都」という知識が予めあるとする（例えば、首都 (日本、東京)という一階述語論理式が知識ベースにある）。そのとき、「フランスの首都は何か」という質問に対して、その答えはクエリ{(日本、東京)，(フランス，?)}の結果であり、フランスの首都という知識(論理式：首都(フランス、パリ))はシステムが予め持たなくても、WWWやテキストコーパスを利用するだけで、答えを出すことが出来る。 Various problems can be solved using the relationship search engine of the present invention based on latent relationships. For example, in a question answering system, the system teaches an answer to an input question, but it is difficult to give an answer to knowledge that does not exist in the system knowledge base. If the search system of the present invention is used, a wide range of questions can be answered even if there is little knowledge. For example, suppose that there is knowledge that “Tokyo is the capital of Japan” in advance (for example, there is a first-order predicate logical expression of capital (Japan, Tokyo) in the knowledge base). At that time, the answer to the question "What is the French capital?" Is the result of the query {(Japan, Tokyo), (France,?)} And the knowledge of the French capital (logical formula: capital ( Paris, France)) can give answers just by using the WWW or text corpus even if the system does not have it in advance.

また、本検索エンジンは類似語の検索なども簡単に実現でき、シソーラスの作成を支援するシステムにもなる。例えば、クエリとして、{(Ａ，Ｂ)，(Ｃ，？)}を入れ、(Ａ，Ｂ)の関係が類似語であれば、?検索結果として求まるＤはＣの類似語であるという可能性が高い。同様に、類似語だけではなく、上位下位語(hypernyms，hyponyms)、反対語や部分一部関係にある語(meronyms)の検索もできる。 In addition, the search engine can easily search for similar words, and can also be used to support the creation of a thesaurus. For example, if {(A, B), (C,?)} Is entered as a query, and the relationship of (A, B) is a similar word, it is possible that D obtained as a search result is a similar word of C High nature. Similarly, not only similar terms but also broader terms (hypernyms, hyponyms), opposite terms and partially related terms (meronyms) can be searched.

更に、実際にすぐ利用できる応用として、製品メーカ検索などが挙げられる。例えば、電子商取引ウェブサイトでユーザは「化粧品ならば資生堂が有名」と分かっているが、「ゲーム機ならばどのメーカが良いのか」を探したい。その時に、本システムに(化粧品, 資生堂) と (ゲーム機, ? )というクエリを入力したら、有名なゲーム機メーカのリストを得ることが出来る。 Further, as an application that can be used immediately, there is a product manufacturer search and the like. For example, a user knows that “Shiseido is famous for cosmetics” on an e-commerce website, but wants to find out which manufacturer is better for a game machine. At that time, if you enter the query (cosmetics, Shiseido) and (game machine,?) In this system, you can get a list of famous game machine manufacturers.

膨大なWWWなどの情報空間の中で関係検索を行うことで、最新のデータ(知識)についても処理でき、新しい概念や情報にも対応できる。また、クエリに対して複数の候補をユーザに提示することもできる。
従って、本発明で開発される検索エンジンの応用範囲が広い。 By performing relational searches in an enormous information space such as the World Wide Web, the latest data (knowledge) can be processed, and new concepts and information can be handled. In addition, a plurality of candidates can be presented to the user for the query.
Therefore, the application range of the search engine developed in the present invention is wide.

本発明の検索方法及びシステムの概念を示す図である。It is a figure which shows the concept of the search method and system of this invention. 本発明の検索方法及びシステムの検索画面を示す図である。It is a figure which shows the search screen of the search method and system of this invention. クエリ(steve jobs, apple), (?,microsoft) の簡易結果 (証拠文章を含まない結果)の表示画面である。It is a display screen of simple results (results that do not include evidence) of queries (steve jobs, apple), (?, Microsoft). 図２Ａの表示画面において、証拠文の一部のみを示したものである。また、 Debug Info に語彙パターンの一部を示している。In the display screen of FIG. 2A, only a part of the evidence is shown. Some of the vocabulary patterns are shown in Debug Info. クエリ (steve jobs, apple), (steveballmer, ?) 簡易結果 (証拠文章を含まない結果)の表示画面である。Query (steve jobs, apple), (steveballmer,?) This is a display screen for simple results (results not including evidence). 本発明の検索方法及びシステムの全体構成を示す図である。It is a figure which shows the whole structure of the search method and system of this invention. エンティティペアと特徴ベクトルとの関係を示す図である。It is a figure which shows the relationship between an entity pair and a feature vector. 品詞タグ付けを示す図である。It is a figure which shows part-of-speech tagging. （ａ）エンティティとエンティティＩＤの対応、（ｂ）エンティティペアとエンティティペアＩＤの対応を示す図である。It is a figure which shows a response | compatibility of (a) entity and entity ID, (b) correspondence of entity pair and entity pair ID. 語彙パターンと語彙パターンＩＤの対応のテーブルを示す。Fig. 5 shows a correspondence table between vocabulary patterns and vocabulary pattern IDs. エンティティペアから、そのエンティティペアを含むような語彙パターンとその頻度の情報を保存しているテーブルを示す。A table storing vocabulary patterns including the entity pairs and frequency information from the entity pairs is shown. 図８をＩＤで表現したテーブルである。9 is a table expressing FIG. 8 by ID. 語彙パターンIDからその語彙パターンを特徴ベクトルの１つの成分として含むエンティティペアIDと頻度のリストへのマッピングのテーブルを示す。A table of mapping from a vocabulary pattern ID to an entity pair ID and frequency list including the vocabulary pattern as one component of a feature vector is shown. WordNet（概念辞書）を示す図である(非特許文献５)。It is a figure which shows WordNet (concept dictionary) (nonpatent literature 5). WordNet（概念辞書）を詳細分割する概念の導入後を示す図である(非特許文献５)。It is a figure which shows after the introduction of the concept which divides WordNet (concept dictionary) in detail (nonpatent literature 5). WordNet（概念辞書）中のパスを示す図である。It is a figure which shows the path | pass in WordNet (concept dictionary).

［Ａ］検索方法及びシステムの概要
本発明に係る検索システムは、図１に示すように、２つのエンティティペア{(Ａ，Ｂ)，(Ｃ，？)}がクエリとして入力され、「?」として適切なエンティティを検索する。検索システムの出力「？」は、(Ａ，Ｂ)のエンティティＡ，Ｂ間の関係と似ている関係を持つ(Ｃ，Ｘ)となるようなエンティティＸである。典型的には、エンティティＸは、ランキングされたリストとして得られる。例えば、本検索システムは、入力として{(日本、東京)，(フランス，?)}の3つのエンティティ（単語）が与えられた時、最上位の結果として、「パリ」を「?」として出力する。つまり、「日本」と「東京」の関係は「フランス」と「パリ」の関係とよく似ていることを示している。本明細書においては、簡便のため入力形式を{(Ａ，Ｂ)，(Ｃ，？)}として説明するが、本発明において、「?」の記号は各エンティティペアのどの位置にも設定できる。本検索システムのクエリは、例えば、{(？，Ｂ)，(Ｃ，Ｄ)}のような入力形式であってもよい。 [A] Overview of Search Method and System In the search system according to the present invention, as shown in FIG. 1, two entity pairs {(A, B), (C,?)} Are input as queries, and “?” As to find the appropriate entity. The output “?” Of the search system is an entity X that becomes (C, X) having a relationship similar to the relationship between the entities A and B of (A, B). Typically, entity X is obtained as a ranked list. For example, when three entities (words) {(Japan, Tokyo), (France,?)} Are given as input, this search system outputs “Paris” as “?” As the top result. To do. In other words, the relationship between “Japan” and “Tokyo” is very similar to the relationship between “France” and “Paris”. In this specification, the input format is described as {(A, B), (C,?)} For convenience, but in the present invention, the symbol “?” Can be set at any position of each entity pair. . The query of this search system may be in an input format such as {(?, B), (C, D)}, for example.

上記のようなクエリに答えるためには、エンティティＡとエンティティＢの間の関係を抽出し、それらの関係を持つ他のエンティティペア(Ｃ，Ｘ)をデータベースから検索する必要がある。しかし、本検索エンジンが検索対象とするＷＷＷやテキストコーパスはある大きなテキストデータでしかなく、エンティティ間の関係は全く事前に準備されていない。そのため、エンティティペア(Ａ，Ｂ)におけるエンティティＡとエンティティＢの関係は事前に明示されたものではなく、潜在的である。本発明では、そのような潜在の関係を特徴ベクトルで表現し、２つエンティティペアの間の関係類似度、いわば潜在関係類似度(latent relational similarity)、を測ることにより、関係検索を実現する。 In order to answer the query as described above, it is necessary to extract the relationship between the entity A and the entity B and search the database for another entity pair (C, X) having the relationship. However, the WWW and text corpus to be searched by this search engine are only large text data, and the relationship between entities is not prepared in advance. Therefore, the relationship between the entity A and the entity B in the entity pair (A, B) is not specified in advance, but is potential. In the present invention, such a latent relationship is expressed by a feature vector, and a relationship search is realized by measuring a relationship similarity between two entity pairs, that is, a latent relational similarity.

本発明は、与えられたエンティティペア(Ａ，Ｂ)(これを「stem語ペア」という)のエンティティＡ、Ｂ間に潜在的な関係がある場合、与えられた3つ目のエンティティ(Ｃ)で、エンティティペア(Ｃ，Ｘ)または(Ｘ，Ｃ)のエンティティＣ，Ｘ間にも類似の潜在関係を持つようなエンティティＸをＷＷＷやテキストコーパスから検索する。(Ｘ，Ｃ)を検索する問題と(Ｃ，Ｘ)を検索する問題は本質的に違わないので以下では(Ｃ，Ｘ)の検索を想定する。 In the present invention, if there is a potential relationship between entities A and B of a given entity pair (A, B) (referred to as “stem word pair”), the given third entity (C) Then, an entity X having a similar latent relationship between the entities C and X of the entity pair (C, X) or (X, C) is searched from the WWW and the text corpus. Since the problem of searching for (X, C) and the problem of searching for (C, X) are essentially the same, the search for (C, X) is assumed below.

先ず、与えられたstem語ペア(Ａ，Ｂ)のエンティティＡ，Ｂ間に潜在的に存在する関係を表現する特徴をテキストコーパスから見つけ出す。エンティティＡ，Ｂ間の関係は、エンティティＡとエンティティＢを含むテキストにおけるエンティティＡ，Ｂの周辺の文脈に基づいて抽出することができる。周辺の文脈をどのように把握するかについては、幾つかの手法が考えられ、エンティティＡ，Ｂの周辺の語彙パターン、bag-of-words、品詞パターン、係り受けパターンの１つあるいは任意の組み合わせを例示することができる。１つの態様では、これらの周辺文脈(語彙パターン、bag-of-words、品詞パターン、係り受けパターン等)を多数のエンティティペアに対して予め取得し、エンティティペアと周辺文脈を対応させてデータベースとして保存しておく。あるいは、周辺文脈をクエリ時にon-the-flyで取得することもできる。 First, a feature that expresses a potential relationship between entities A and B of a given stem word pair (A, B) is found from a text corpus. The relationship between the entities A and B can be extracted based on the context around the entities A and B in the text including the entities A and B. There are several ways to understand the surrounding context. One or any combination of lexical patterns, bag-of-words, part-of-speech patterns, dependency patterns around entities A and B Can be illustrated. In one aspect, these peripheral contexts (vocabulary patterns, bag-of-words, part-of-speech patterns, dependency patterns, etc.) are acquired in advance for a large number of entity pairs, and the entity pairs and peripheral contexts are associated with each other as a database. Save it. Alternatively, the surrounding context can be obtained on-the-fly when querying.

語彙パターン(lexical pattern)とは、文中における順序を保った語彙の列である。例えば、文中に現れた語彙のn-gramが語彙パターンである。bag-of-wordsとは、語順に捉われずに単語の集合として扱う手法である。 A lexical pattern is a sequence of vocabulary that maintains order in a sentence. For example, an vocabulary n-gram that appears in a sentence is a vocabulary pattern. Bag-of-words is a technique that treats a set of words without being caught in word order.

品詞パターンについて簡単に説明する。例えば、Obama is the president of the U.S.という文の中で、
Obama/PERSON
is/VBZ the/DT president/NN of/IN the/DT U.S./LOCATION
ここで、VBZ:動詞-３人称単数、DT:限定詞、NN:名詞（単数または不可算名詞）、IN:前提詞または従属接続詞）と解釈されたときに、ObamaとU.S.の関係(エンティティペア(Obama, U.S.) の特徴)を表す
1つの語彙パターンは「is the president of the」、
１つの品詞パターンは「VBZ DT NN IN DT」、となる。
品詞パターンが同じであればエンティティペアのエンティティ間の関係が類似する可能性がある。例えば、上記の文と下記の文の品詞パターンは同じある。
Obama is the leader of the U.S. Briefly explain the part of speech pattern. For example, in the sentence Obama is the president of the US,
Obama / PERSON
is / VBZ the / DT president / NN of / IN the / DT US / LOCATION
Here, the relationship between Obama and US (entity pair) when interpreted as VBZ: verb-3rd person singular, DT: determiner, NN: noun (single or non-countable noun), IN: predicate or subordinator. (Obama, US) features)
One vocabulary pattern is "is the president of the"
One part-of-speech pattern is “VBZ DT NN IN DT”.
If the part-of-speech pattern is the same, there is a possibility that the relationship between the entities of the entity pair is similar. For example, the part of speech pattern of the above sentence and the following sentence is the same.
Obama is the leader of the US

係り受けパターンとは、その文の中の係り受け解析した結果のパターンである。係り受け解析の結果は各エンティティ（単語）の文中の役割と関係を表すので、係り受けパターンが同じであれば、エンティティペアも関係する可能性がある。例えば、
Obama is the president of the U.S.、
Junichiro Koizumi was the prime minister of Japan、
という文については、品詞パターンが全く同じではないが、係り受けパターン:
PERSON<--subject--VERB(be)--object-->NOUN--of-->LOCATION、
は共通するので、(Obama, U.S), (Junichiro Koizumi, Japan)の間に何らかの関係があると考えられる。上記の表現は依存関係を矢印で表したもので、矢印のラベルが関係の名前である。つまり、PERSON(Obama, Junichiro Koizumi)は動詞(be)に依存する関係があり、また、名詞(president, prime minister)も動詞に依存する。更に、LOCATION (U.S., Japan)は真ん中の名詞(president, prime minister)に依存する。 The dependency pattern is a pattern resulting from dependency analysis in the sentence. Since the result of the dependency analysis represents the role and relationship in the sentence of each entity (word), if the dependency pattern is the same, the entity pair may also be related. For example,
Obama is the president of the US,
Junichiro Koizumi was the prime minister of Japan,
The sentence part pattern is not exactly the same, but the dependency pattern:
PERSON <-subject--VERB (be)-object->NOUN--of-> LOCATION,
Are common, (Obama, US), (Junichiro Koizumi, Japan) is considered to have some relationship. In the above expression, the dependency is represented by an arrow, and the arrow label is the name of the relationship. That is, PERSON (Obama, Junichiro Koizumi) has a relationship that depends on the verb (be), and the noun (president, prime minister) also depends on the verb. Furthermore, LOCATION (US, Japan) depends on the middle noun (president, prime minister).

本発明において、典型的には、エンティティペアの周辺の文脈として、エンティティペアの周辺の語彙パターンを用いるが、品詞パターンや係り受けパターンを用いてもよい。但し、Webの膨大かつnoisyなデータを高度な言語処理（係り受け解析）するときの処理時間および精度の問題を考えると、語彙パターンのみを用いることが有利である。例えば、本発明において動的モードで検索を行う場合には、キーワードベース検索エンジン出力のスニペットは完全な文でないことも多いため、形態素や構文の解析がうまくいかないこともある。 In the present invention, typically, the vocabulary pattern around the entity pair is used as the context around the entity pair, but a part of speech pattern or a dependency pattern may be used. However, considering the problem of processing time and accuracy when performing advanced language processing (dependency analysis) on the vast and noisy data on the Web, it is advantageous to use only vocabulary patterns. For example, when searching in the dynamic mode in the present invention, the snippet of the keyword-based search engine output is often not a complete sentence, so the morpheme and syntax analysis may not be successful.

一つの態様では、エンティティＡ，Ｂ間の関係を表現する特徴として、エンティティＡ，Ｂが同時に出現する文中の語彙パターン(lexical patterns)を用いる。例えば、コーパスの中で「アメリカの大統領オバマ」という句がある時、“の大統領”という語彙パターンを抽出し、このパターンをエンティティペア(アメリカ，オバマ)における「アメリカ」と「オバマ」との間の潜在的な関係を特徴づけるパターンの１つとする。エンティティペアのエンティティ間の関係を特徴付けるパターンは複数存在し得るので、その特徴的なパターンを頻度のベクトルとして表現する。本手法では、Webやテキストコーパスからエンティティペアのエンティティ間における潜在的な関係を特徴ベクトル（本実施形態では、語彙パターンの頻度ベクトル）として表現する。 In one aspect, lexical patterns in a sentence in which entities A and B appear simultaneously are used as features that express the relationship between entities A and B. For example, when there is a phrase “American President Obama” in the corpus, extract the vocabulary pattern “President of” and use this pattern between “America” and “Obama” in the entity pair (America, Obama). One of the patterns characterizing the potential relationship of Since there may be a plurality of patterns that characterize the relationship between the entities of the entity pair, the characteristic pattern is expressed as a frequency vector. In this method, a potential relationship between entities in an entity pair is expressed as a feature vector (in this embodiment, a vocabulary pattern frequency vector) from the Web or a text corpus.

stem語ペアが与えられた時、stem語ペアのエンティティ間の関係を表現する特徴ベクトルとよく似ている特徴ベクトルを備え、かつ、3つ目のエンティティＣを1つの要素として持つようなエンティティペアを見つけ出す。見つけ出したエンティティペアのもう1つの要素(Ｃではないエンティティ)が答えの候補である。それらの候補エンティティペアとstem語ペアとの類似度を測り、類似度でソートし、ソートされたリストから結果を出力する。 When a stem word pair is given, an entity pair that has a feature vector that is very similar to the feature vector that expresses the relationship between the entities of the stem word pair and that has the third entity C as one element Find out. Another element of the found entity pair (an entity that is not C) is a candidate answer. The similarity between the candidate entity pair and the stem word pair is measured, sorted by similarity, and the result is output from the sorted list.

本発明の１つの態様では、更に、特徴ベクトルの次元（例えば、語彙パターン）をクラスタリング手法によって減らし、ノイズや出現頻度の少ないパターンの問題（data sparseness problem）を解決する。 In one aspect of the present invention, the dimension of the feature vector (eg, vocabulary pattern) is further reduced by a clustering technique to solve the noise and data pattern problem (data sparseness problem).

検索したいエンティティが固有名詞などの場合、その表現の仕方が複数に存在する可能性がある(例えば、「United States」、「U.S.」、「U.S.A.」はすべて米国を指す固有名詞であるが、その表現形式が異なっている)。本発明の１つの態様では、それらの複数表現形式(surface form)のエンティティについてもクラスタリング手法を使い、できるだけ異なる表現形式も１つのエンティティとして扱うことにより、検索の精度と再現率を上げることを図る。 If the entity you want to search for is a proper noun, there may be multiple ways to express it (for example, “United States”, “US”, “USA” are all proper nouns that refer to the United States. (The expression format is different.) In one aspect of the present invention, a clustering technique is used for entities in a plurality of expression forms (surface forms), and different expression forms are handled as one entity, thereby improving search accuracy and recall. .

本発明に係る検索システム（具体的な構成例を図３に示す）は、一つあるいは複数のコンピュータから構成されており、当該コンピュータは、ハードウェアとしての処理手段（ＣＰＵ等）、記憶手段（ハードディスク、ＲＡＭ、ＲＯＭ等）、入力手段、出力手段ないし表示手段、ソフトウエアとしてのコンピュータを動作させる制御プログラム等を備えている。ユーザ端末も、一つあるいは複数のコンピュータから構成されており、当該コンピュータは、処理手段、記憶手段、入力手段、出力手段ないし表示手段、コンピュータを動作させる制御プログラム等を備えている。検索システムとユーザ端末は、インターネットに代表されるコンピュータネットワークを介して相互に情報のやり取りを可能とする送受信手段を備えており、インターネットに代表されるコンピュータネットワークを介して互いに通信可能に接続されている。検索システムは、インターネットに代表されるコンピュータネットワークを介して既存のキーワードベース検索エンジンに接続されている。ユーザ端末の画面には、例えば、図２のような検索画面が表示され、ユーザ端末の入力手段からエンティティを入力し、検索クエリとして検索システムへ送信する。検索システムでは、検索クエリに基づいて検索結果を計算し、ユーザ端末から閲覧可能とする。 The search system according to the present invention (a specific configuration example is shown in FIG. 3) is composed of one or a plurality of computers. The computer includes processing means (CPU and the like) as hardware, storage means ( Hard disk, RAM, ROM, etc.), input means, output means or display means, and a control program for operating the computer as software. The user terminal is also composed of one or a plurality of computers, and the computer includes processing means, storage means, input means, output means or display means, a control program for operating the computer, and the like. The search system and the user terminal are provided with transmission / reception means that can exchange information with each other via a computer network represented by the Internet, and are connected to each other via a computer network represented by the Internet. Yes. The search system is connected to an existing keyword-based search engine via a computer network represented by the Internet. For example, a search screen as shown in FIG. 2 is displayed on the screen of the user terminal, and an entity is input from the input means of the user terminal, and is transmitted to the search system as a search query. In the search system, the search result is calculated based on the search query and can be viewed from the user terminal.

［Ｂ］検索方法及びシステムの具体的な内容
語彙パターン（lexical pattern）生成、語彙パターンクラスタリング、エンティティペア間の類似度計算、エンティティペアのインデクシング、関係類似度による検索、リモートコーパスを用いたクエリ時の動的処理、について説明する。 [B] Specific content of search method and system: lexical pattern generation, vocabulary pattern clustering, similarity calculation between entity pairs, indexing of entity pairs, search by relationship similarity, query using remote corpus The dynamic processing will be described.

［Ｂ−１］語彙パターン（lexical pattern）生成
テキストコーパスから、エンティティペアのエンティティ間の関係を表現するための特徴ベクトルを定義し、特徴ベクトルを具体的に作成する。特徴ベクトルの要素は、エンティティ間の関係を表現する語彙パターン(lexical patterns)の頻度である。例えば、「アメリカの大統領オバマ」を含む文が10回出現し、「アメリカの議員オバマ」を含む文が8回出現した時、エンティティペア(アメリカ，オバマ)の特徴ベクトルは図４の列ベクトルＶ１のように表現される。そのペアに含まれない語彙パターンの頻度は0とする。 [B-1] A feature vector for expressing a relationship between entities of an entity pair is defined from a lexical pattern generation text corpus, and the feature vector is specifically created. The element of the feature vector is the frequency of lexical patterns that express the relationship between entities. For example, when a sentence containing “American President Obama” appears 10 times and a sentence containing “American Congressman Obama” appears 8 times, the feature vector of the entity pair (America, Obama) is the column vector V1 in FIG. It is expressed as The frequency of vocabulary patterns not included in the pair is 0.

特徴ベクトルの作成には幾つかのやり方がある。第１に、テキストコーパスが予めあって、それを解析することにより、特徴ベクトルを事前に(クエリ前に)作成しておくことができる。第２に、テキストコーパスがなくても、通常のキーワードベース検索エンジンを使い、“アメリカ***オバマ”（ここで「*」演算子は１個ないしゼロ単語とマッチングするワイルドカード）のようなクエリをキーワードベース検索エンジンに送信し、得られたスニッペト(snippet)結果から語彙パターンを抽出し、特徴ベクトルを作成することができる。すなわち、クエリが分かった時に、クエリに関連するエンティティ間の関係を、既存のキーワードベース検索エンジンを利用して、オンライン(実時間)で抽出してもよい。 There are several ways to create feature vectors. First, there is a text corpus in advance, and by analyzing it, a feature vector can be created in advance (before a query). Second, even without a text corpus, using a normal keyword-based search engine, like “America *** Obama” (where the “*” operator is a wildcard that matches one or zero words) A query can be sent to a keyword-based search engine, lexical patterns can be extracted from the resulting snippet results, and feature vectors can be created. That is, when a query is known, relationships between entities related to the query may be extracted online (in real time) using an existing keyword-based search engine.

特徴ベクトルを生成する際に、特に重要なのはエンティティ間の語彙パターンである。精度の高い関係検索を行うには、エンティティ間の関係を含む文脈において当該関係を表現する語彙パターンをより正確に抽出することが重要である。以下に、語彙パターンのクエリ前処理について説明する。 In generating feature vectors, vocabulary patterns between entities are particularly important. In order to perform a relationship search with high accuracy, it is important to more accurately extract a vocabulary pattern expressing the relationship in a context including the relationship between entities. The lexical pattern query preprocessing will be described below.

クエリ前処理するコーパスは、検索システムのローカルハードディスクにあるローカルコーパス(local diskにあるテキストの集合やWWWページなど)である。このようなテキストコーパスには、大量のエンティティとそれらの間の潜在的な関係が含まれている。 The corpus for query preprocessing is a local corpus (such as a set of texts on a local disk or a WWW page) in the local hard disk of the search system. Such a text corpus contains a large number of entities and potential relationships between them.

これらのテキストをまず文単位に切る。そして、それらの文を単語に分けて、品詞をつける。この処理は、品詞タグ付け（part-of-speech(POS) tagging）として知られている。例えば、図５に示すように、「東京は日本の首都である」という文があったときに、品詞タグ付け後、<東京/名詞，は/助詞，日本/名詞，の/助詞，首都/名詞，で/助動詞，ある/助動詞>の列が得られる。得られた単語と品詞列から、名詞、動詞、形容詞、形容動詞を取り、それらのエンティティをエンティティペアの要素の候補とする。前記の文では、「東京」、「日本」と「首都」が抽出の候補エンティティになる。 These texts are first cut into sentences. Then, divide those sentences into words and add parts of speech. This process is known as part-of-speech (POS) tagging. For example, as shown in FIG. 5, when there is a sentence “Tokyo is the capital of Japan”, after tagging part-of-speech tagging, <Tokyo / noun, ha / particle, Japan / noun, no / particle, capital / A sequence of nouns, de / auxiliary verbs, certain / auxiliary verbs> is obtained. From the obtained word and part-of-speech string, nouns, verbs, adjectives, and adjective verbs are taken and those entities are set as candidates for entity pair elements. In the above sentence, “Tokyo”, “Japan” and “Capital” are extraction candidate entities.

候補エンティティから、候補となるエンティティペアを作る。候補エンティティペアは文の中の順序を保った２つの候補エンティティである。例えば、前記の文であれば、(東京、日本)、(日本、首都)、(東京、首都)という３つのエンティティペアを候補エンティティペアとする。文の中で候補エンティティペアが出現したら、そのエンティティペアを記録し、頻度を増やす。また、そのペアが出現する文の位置(document IDとdocument中の文の位置)も記録する。 Create candidate entity pairs from candidate entities. A candidate entity pair is two candidate entities that maintain the order in the sentence. For example, in the case of the above sentence, three entity pairs of (Tokyo, Japan), (Japan, capital), and (Tokyo, capital) are set as candidate entity pairs. If a candidate entity pair appears in the sentence, record that entity pair and increase the frequency. Also, the position of the sentence in which the pair appears (document ID and the position of the sentence in the document) is recorded.

候補エンティティとして固有名詞を優先的に抽出するために、ＮＥＲ(Named
Entity Recognizer)を使って、文中の固有名詞を抽出する。エンティティペアに固有名詞があったら、それを記録する。 In order to preferentially extract proper nouns as candidate entities, NER (Named
Entity Recognizer) is used to extract proper nouns in the sentence. If there is a proper noun in the entity pair, record it.

固有名詞の複数形式(United States，U.S.，USA,…)を１つのエンティティとして扱えるように、それらの表現の仕方をクラスタリングする。つまりクラスタリングした後、United States，U.S.，USAは１つのクラスタに入るようにする。そのために、ある表現形式がどのエンティティと一緒にペアを作るかを調べる。更に、その表現形式を含むエンティティペアの語彙パターンについても調べる。相手のエンティティと語彙パターンが似ている表現形式は１つのクラスタに入ると判断する。例えば、United States，U.S.，USAの表現形式はよく「Barack Obama」と一緒に(United States，Barack Obama)，(U.S.，Barack Obama)，(USA，Barack Obama)というエンティティペアを作る。更に、そのエンティティペアの語彙パターンの中では「President」、「leader」が共通に出現する。従って、「United States」、「U.S.」、「USA」は１つのエンティティである可能性が高い。検索時に、候補エンティティが「U.S.」として見つかれば、「United States」、「USA」のエンティティも候補として含めると再現率を高くすることができる。同様に、
Steve Ballmerについて、
(Steve Ballmer, Microsoft): 50 回出現
(Steve Ballmer, Bill Gates): 10
(Steve Ballmer, Microsoft Corp): 8
...
Ballmerについて、
(Ballmer, Microsoft): 20
(Ballmer, Bill Gates): 15
(Ballmer, Gates): 10
...
という検索結果があれば、Steve BallmerとBallmerは似ているとして１つのクラスタに入るようにクラスタリングする。尚、エンティティクラスタリングにおいて、エンティティの種類によっては、類義語の辞書をデータベースに格納しておき、それを用いることもできる。また、検索結果をユーザに提示する時にも、エンティティ（固有名詞）のクラスタ（クラスタに属する複数の異なる表現形式）を表示すればユーザにとっても分かりやすい。 In order to handle multiple forms of proper nouns (United States, US, USA,...) As a single entity, the expression methods are clustered. That is, after clustering, United States, US, and USA are included in one cluster. To that end, we examine which entity an expression form pairs with. Further, the lexical pattern of the entity pair including the expression form is also examined. It is determined that an expression form having a vocabulary pattern similar to that of the partner entity falls within one cluster. For example, the United States, US, and USA expressions often form (United States, Barack Obama), (US, Barack Obama), (USA, Barack Obama) entity pairs together with "Barack Obama". Furthermore, “President” and “leader” appear in common in the vocabulary pattern of the entity pair. Therefore, “United States”, “US”, and “USA” are likely to be one entity. If a candidate entity is found as “US” during a search, the recall rate can be increased by including entities of “United States” and “USA” as candidates. Similarly,
About Steve Ballmer
(Steve Ballmer, Microsoft): 50 appearances
(Steve Ballmer, Bill Gates): 10
(Steve Ballmer, Microsoft Corp): 8
...
About Ballmer
(Ballmer, Microsoft): 20
(Ballmer, Bill Gates): 15
(Ballmer, Gates): 10
...
If there is a search result, Steve Ballmer and Ballmer are clustered so as to be in one cluster because they are similar. In entity clustering, a dictionary of synonyms may be stored in a database and used depending on the type of entity. In addition, when presenting the search result to the user, it is also easy for the user to understand if a cluster of entities (proprietary nouns) (a plurality of different expression formats belonging to the cluster) is displayed.

上記の候補エンティティペアの頻度を全コーパスで数えるために、コーパス全体に対して、上記の処理を実行する。候補エンティティペアを抽出した後、候補エンティティペア及びコーパス中に出現するそれらのエンティティペアの頻度が分かり、更に、どのエンティティペアが固有名詞を含むかも分かる。この処理が終わった後、新しくテキストがコーパスに入るときには、逐次的にそのテキストに同じ処理を施すことにより頻度を更新できる。 In order to count the frequency of the candidate entity pairs in the entire corpus, the above processing is executed on the entire corpus. After extracting candidate entity pairs, the candidate entity pairs and the frequency of those entity pairs appearing in the corpus are known, as well as which entity pairs contain proper nouns. After this processing is completed, when new text enters the corpus, the frequency can be updated by sequentially performing the same processing on the text.

取得した候補エンティティペアを使い、語彙パターンを抽出する。その時、固有名詞を含むエンティティペアや頻度の高いエンティティペアの語彙パターンを優先的に抽出する。候補エンティティペアの語彙パターン抽出前に様々なフィルタリング手法（例えば、エンティティペアの出現頻度やエンティティ間の距離を用いる）を使い、ノイズやあまり関係のないようなエンティティペアを発見し、それらのペアを除去する。例えば、頻度の少ないエンティティペア(例えば４回以下)に対しては、ノイズとして認識し、語彙パターン抽出を実行しない。新しいテキストがコーパスに追加されるときに、もしエンティティペアの頻度が４回以上になったら、エンティティペアはノイズではなくなり、通常通りに語彙パターンを抽出する。 Extract vocabulary patterns using the obtained candidate entity pairs. At that time, lexical patterns of entity pairs including proper nouns and frequent entity pairs are preferentially extracted. Use a variety of filtering techniques (eg, using entity pair appearance frequency or distance between entities) to extract candidate entity pairs that can be found in noise or irrelevant entity pairs. Remove. For example, entity pairs that are infrequent (for example, 4 times or less) are recognized as noise and lexical pattern extraction is not executed. When new text is added to the corpus, if the frequency of the entity pair is more than four times, the entity pair is no longer noise and the lexical pattern is extracted as usual.

語彙パターンを決めるために、先ず、候補エンティティペアの２つエンティティ間のテキストにおける距離Ｄを測る。距離Ｄは単に２つのエンティティ間の単語数とする。距離Ｄが、閾値Ｄ_thよりも大きいエンティティペアに対しては、エンティティ間の関連性が低いとして、語彙パターンを抽出しない。 To determine the vocabulary pattern, first measure the distance D in the text between the two entities of the candidate entity pair. The distance D is simply the number of words between the two entities. For entity pairs whose distance D is greater than the threshold value D _th , the vocabulary pattern is not extracted because the relationship between the entities is low.

閾値Ｄ_th以下のエンティティペアについて、元の文においてエンティティペアの第１エンティティをＸに、エンティティペアの第２エンティティをＹに置き換えた後、Ｘ，Ｙを含む以下の部分列（単語列）Ｓを取り出し、単語列Ｓのn-gram(nは1からＫまで)を全て生成する。得られたn-gramを特徴ベクトルの１つの次元として、エンティティペアを含む文にわたって頻度を数える。特徴ベクトルはn-gramを次元として、その値はn-gramの頻度である。
部分列Ｓ＝PreX X InXY Y PostY
ここで、PreX は Xの(文中の)直前のｍ₁個の単語列で、PostYはYの直後のｍ₂個の単語列である。また、InXY は (X, Y)の間の単語列である。１つの態様では、パラメータm₁ とm₂は５である。InXY が10以下であれば候補ペアとして抽出する(つまり、InXYの長さＤがＤ_thよりも大きいと抽出しない。)。N-gramの nは１からＫまでとする。１つの態様では、Ｋ＝７である。1つの態様では、ＤがＫ−２よりも大きい(Ｄ＞Ｋ−２、即ち、Ｄ＋２＞Ｋ)の時、(X, InXY, Y)というn-gramも生成する。 For an entity pair having a threshold value _Dth or less, after replacing the first entity of the entity pair with X and the second entity of the entity pair with Y in the original sentence, the following subsequence (word string) S including X and Y And all n-grams (n is from 1 to K) of the word string S are generated. Using the obtained n-gram as one dimension of the feature vector, the frequency is counted over the sentence including the entity pair. The feature vector has n-gram as a dimension, and its value is the frequency of n-gram.
Subsequence S = PreX X InXY Y PostY
Here, PreX is m ₁ word strings immediately before X (in the sentence), and PostY is m ₂ word strings immediately after Y. InXY is a word string between (X, Y). In one embodiment, the parameters m ₁ and m ₂ are 5. InXY is extracted as a candidate pair if 10 or less (i.e., the length D of InXY is not extracted to be greater than D _th.). The n-gram n is from 1 to K. In one aspect, K = 7. In one embodiment, when D is larger than K-2 (D> K-2, ie, D + 2> K), an n-gram of (X, InXY, Y) is also generated.

以下により具体的に説明する。
エンティティの前後のｍ個の単語、エンティティ間の全ての単語
(ただし、エンティティ間の距離ＤがＤ_th以内)の n-gram (ｎ＝1,2,3, …,K)を取る。
例えば、
They discussed with Barack Obama, president of the United States in December to find a solution to the problem.
ｍ₁ ＝ m₂ ＝２、Ｄ_th＝１０、とすると、
抽出対象となる部分列Ｓは、
Ｓ＝discussed with Barack Obama, president of the United States in December
となる。
抽出されるn-gramは、
［Ｋ＝７の時（Ｄ＝４,Ｄ＜Ｋ−２］
n = 4:discussed with X,；with X, president；X, president of； ...；the Y in December
n = 6:discussed with X, president of；with X, president of the；X, president of the Y；...
n = 7:discussed with X, president of the；with X, president of the Y；X, president of the Y in；, president of the Y in December
n = 2:discussed with；with X；X,；, president；president of；of the；the Y；Y in；in December
また、別の例として、文が:
“They discussed with Barack Obama, the first African American president of the United States in December to find a solution to the problem”
m₁＝m₂＝２,Ｄ_th＝１０とすると、抽出対象列
Ｓ＝“discussed with X, the first African American president of the Y in December”
この場合、Ｄ＝８(つまり、”, the first African American president of the”の長さなので、Ｄ＝８である)。また、Ｋ＝７と仮定すると、Ｄ＝８＞Ｋ−２である。
［K = 7の時、 D = 8 > K−2］
抽出されるn-gram は、
ｎ＝1, 2, .., 7のn-gramのすべて(上記の例と同様にｎ＝1から7までのn-gramをすべて生成) と次のパターン:
X, the first African American president of the Y(n＝8)
すなわち、ＤがＫ−２よりも大きい場合は、n＝1からＫまでのn-gramの全てに加えて、部分列(X, InXY, Y)もn-gramとして抽出する(この部分列はn-gramであるが、nがＫよりも大きい)。
生成されたn-gram の全てを関係を表す特徴として用いる。
１つの態様では、語彙パターン抽出は文を切り出した後に行う。つまり、１つの文から部分列を抽出し、その部分列からn-gramを生成する。１つの態様では、文中にエンティティペアの一方あるいは両方が複数に表れても、置き換えられた出現（ＸないしＹ）は１つだけにするが、同時に置き換えてもよい。 More specific description will be given below.
M words before and after an entity, all words between entities
(However, the distance D between entities is within D _th ) n-gram (n = 1, 2, 3,..., K).
For example,
They discussed with Barack Obama, president of the United States in December to find a solution to the problem.
If m ₁ = m ₂ = 2 and D _th = 10,
The subsequence S to be extracted is
S = discussed with Barack Obama , president of the United States in December
It becomes.
The extracted n-gram is
[When K = 7 (D = 4, D <K-2]
n = 4: discussed with X ,; with X, president; X, president of; ...; the Y in December
n = 6: discussed with X, president of ； with X, president of the ； X, president of the Y ； ...
n = 7: discussed with X, president of the; with X, president of the Y; X, president of the Y in ;, president of the Y in December
n = 2: discussed with; with X; X,;, president; president of; of the; the Y; Y in; in December
And as another example, the statement is:
“They discussed with Barack Obama , the first African American president of the United States in December to find a solution to the problem”
If m ₁ = m ₂ = 2 and D _th = 10, the extraction target column S = “discussed with X, the first African American president of the Y in December”
In this case, D = 8 (that is, D = 8 because of the length of “, the first African American president of the”). Further, assuming that K = 7, D = 8> K−2.
[When K = 7, D = 8> K−2]
The extracted n-gram is
All n-grams with n = 1, 2, .., 7 (generate all n-grams with n = 1 to 7 as in the above example) and the following pattern:
X, the first African American president of the Y (n = 8)
That is, when D is larger than K-2, in addition to all n-grams from n = 1 to K, a subsequence (X, InXY, Y) is also extracted as an n-gram (this subsequence is n-gram, where n is greater than K).
All of the generated n-grams are used as features representing relationships.
In one aspect, lexical pattern extraction is performed after a sentence is cut out. That is, a partial sequence is extracted from one sentence, and an n-gram is generated from the partial sequence. In one aspect, even if one or both of the entity pairs appear in the sentence, only one occurrence (X to Y) is replaced, but they may be replaced at the same time.

パラメータＤ_th、ｍ_１、ｍ_２、Ｋは、閾値として当業者において適宜設定され得るものであり、上述の態様ではＤ_th ＝10、ｍ_１＝５、ｍ_２＝５、Ｋ＝７としているが、これらに限定されない。例えば、Ｄ_th ＝10の場合、エンティティペアが単語10個以上離れている場合には、語彙パターン・特徴ベクトルの抽出を行わない。また、特徴ベクトルを計算するときに、閾値Ｄ_th を動的閾値（例えば、1 〜10 まで動的に変化可能）としてもよい。また、検索データベースに、所望の語彙パターンが出現できるようにするために、最小の
Ｄ_thの値を記録してもよい。例えば、「Obama is the president of the USA」という文について、「X is the president of the Y」が出現するための最小のＤ_thは5である。もし、Ｄ_th = 4 であれば、この語彙パターンは出現しないとして、検索アルゴリズムが実行される。 The parameters D _th , m ₁ , m ₂ , and K can be appropriately set as threshold values by those skilled in the art. In the above embodiment, D _th = 10, m ₁ = 5, m ₂ = 5, and K = 7. However, it is not limited to these. For example, when D _th = 10, vocabulary patterns / feature vectors are not extracted when ten or more entity pairs are separated from each other. Further, when the feature vector is calculated, the threshold value _Dth may be a dynamic threshold value (for example, dynamically changeable from 1 to 10). Further, a minimum D _th value may be recorded in the search database so that a desired vocabulary pattern can appear. For example, for the sentence “Obama is the president of the USA”, the minimum D _th for “X is the president of the Y” to appear is 5. If D _th = 4, the lexical pattern does not appear and the search algorithm is executed.

［Ｂ−２］語彙パターンクラスタリング
エンティティペアのエンティティ間の関係を表現する語彙パターンは多数あり、完全に一致する語彙パターンが少ないこともある。例えば、「の大統領」という関係は「国のリーダ」とか、「政府のリーダ」などの形式でも表現され得る。そこで、本実施形態では、パターン一致低頻度の問題を解決するために、非特許文献3で開示されている手法と同様のパターンクラスタリング手法を使う。語彙パターンをクスタリングすることにより、低頻度の問題を解決でき、更に、語彙パターンが完全に一致しなくても、検索の再現率を上げることが出来る。 [B-2] Vocabulary pattern clustering There are many vocabulary patterns that express relationships between entities in an entity pair, and there may be few vocabulary patterns that completely match. For example, the relationship “President of” can be expressed in the form of “Country leader” or “Government leader”. Therefore, in the present embodiment, a pattern clustering method similar to the method disclosed in Non-Patent Document 3 is used to solve the problem of pattern matching low frequency. By clustering vocabulary patterns, low frequency problems can be solved, and even if the vocabulary patterns do not completely match, the search recall can be increased.

語彙パターンクラスタリングについて説明する。先ず、語彙パターンを、その語彙パターンと一緒に出現するエンティティペアの頻度ベクトルで表現する。例えばX, the president of Yという語彙パターンが(Barack Obama, USA)で10回、(Vladmir Putin, Russia)で20回出現しているとする。この場合、[10,20]のようなベクトルを作成する。同様に全ての語彙パターンを、それと一緒に出現するエンティティペアの頻度ベクトルで表す。 Vocabulary pattern clustering will be described. First, a vocabulary pattern is represented by a frequency vector of entity pairs that appear together with the vocabulary pattern. For example, assume that the lexical pattern X, the president of Y appears 10 times in (Barack Obama, USA) and 20 times in (Vladmir Putin, Russia). In this case, a vector such as [10, 20] is created. Similarly, all vocabulary patterns are represented by frequency vectors of entity pairs that appear together.

語彙パターンpを表すベクトルpにおいて、i番目の要素は次式で与えられる。

ここで、ｈはエンティティペア（S,T）と語彙パターンpの出現頻度を表す関数である。 In the vector p representing the vocabulary pattern p, the i-th element is given by the following equation.

Here, h is a function representing the appearance frequency of the entity pair (S, T) and the vocabulary pattern p.

次に、語彙パターン同士の類似度をその対応するベクトル間のcosine(コサイン)類似度で計算し「似た語彙パターン同士をクラスタにする」という作業を行う。クラスタリングアルゴリズムとして、非特許文献３の655ページのAlgorithm1を用いる。ここで、非特許文献３に開示のクラスタリング手法は、参照により本明細書に組み込まれる。 Next, the similarity between vocabulary patterns is calculated by the cosine similarity between the corresponding vectors, and the operation of “clustering similar vocabulary patterns into clusters” is performed. As a clustering algorithm, Algorithm 1 of page 655 of Non-Patent Document 3 is used. Here, the clustering method disclosed in Non-Patent Document 3 is incorporated herein by reference.

簡単に説明すると、まず語彙パターンの集合を、その総出現頻度（全エンティティペアにおける出現頻度の総和）の高い順にソートする。これにより、語彙パターンの頻度順のランキングが得られる。次いで、その先頭から１個ずつ語彙パターンを選択していき、既存の語彙パターンクラスタとの類似度を計算し、最も近い語彙パターンクラスタを探索する。最も近い語彙パターンクラスタとの類似度が予め決められた閾値より大きければ、対象となっている語彙パターンをその語彙パターンクラスタに追加する。そうでなければ対象となっている語彙パターン１個からなる新たな語彙パターンクラスタを作成する。この処理を全語彙パターンに対して行う。 Briefly, first, a set of vocabulary patterns is sorted in descending order of the total appearance frequency (the sum of the appearance frequencies in all entity pairs). Thereby, the ranking of the vocabulary pattern in order of frequency is obtained. Next, vocabulary patterns are selected one by one from the beginning, the similarity with existing vocabulary pattern clusters is calculated, and the nearest vocabulary pattern cluster is searched. If the similarity with the nearest vocabulary pattern cluster is larger than a predetermined threshold, the vocabulary pattern of interest is added to the vocabulary pattern cluster. Otherwise, a new vocabulary pattern cluster consisting of one vocabulary pattern is created. This process is performed for all vocabulary patterns.

［Ｂ−３］エンティティペア間類似度計算
特徴ベクトルが得られた時に、クエリの答えを検索するためには、２つのエンティティペア間の類似度（この類似度を「関係類似度」という）を、エンティティ間の関係を表現する特徴ベクトルにより定義する必要がある。関係類似度は、特徴ベクトル間の距離によって計算することができる。ここで、２つのエンティティペア間の関係類似度を表現する関数を、

で定義する。一つの典型的な態様では、関係類似度として、２つの特徴ベクトルのcosine類似度を用いることができる。また、非特許文献3で提案されている手法を使うことも出来る。また、特徴ベクトル間の距離尺度については、ユーグリット距離、マハラノビス距離等の公知の距離尺度を用いることもできる。 [B-3] When a feature vector between entity pairs is obtained, in order to search for an answer to a query, the similarity between two entity pairs (this similarity is referred to as “relationship similarity”) is used. , It must be defined by a feature vector expressing the relationship between entities. The relationship similarity can be calculated by the distance between feature vectors. Here, a function expressing the relationship similarity between two entity pairs is

Define in. In one exemplary aspect, the cosine similarity of two feature vectors can be used as the relationship similarity. The method proposed in Non-Patent Document 3 can also be used. As a distance scale between feature vectors, a known distance scale such as Eugrid distance or Mahalanobis distance can be used.

例えば、図４に示すように、(アメリカ，オバマ)のエンティティ（単語）ペアの特徴ベクトルＶ１と(フランス，サルコジ)のエンティティ（単語）ペアの特徴ベクトルＶ２とのcosine類似度は高いので、RelSim((アメリカ，オバマ)，(フランス，サルコジ))の値も高くなる。 For example, as shown in FIG. 4, since the feature vector V1 of the entity (word) pair of (USA, Obama) and the feature vector V2 of the entity (word) pair of (France, Sarkozy) are high, RelSim The value of ((USA, Obama), (France, Sarkozy)) will also be high.

類似度計算手法を使えば、エンティティペア(Ａ，Ｂ)の語彙パターンとエンティティペア(Ｃ，Ｄ)の語彙パターンが多く正確にマッチングした時に、それらのエンティティペアの関係類似度が高くなる。しかし、エンティティ間の関係を表現する語彙パターンは多数であり、完全に一致す語彙パターンが少ないこともある。よって、上述の語彙パターンクラスタリングを考慮した類似度計算を行うことが望ましい。 If the similarity calculation method is used, when the lexical pattern of the entity pair (A, B) and the vocabulary pattern of the entity pair (C, D) are accurately matched, the relationship similarity between these entity pairs increases. However, there are many vocabulary patterns that express relationships between entities, and there may be few vocabulary patterns that perfectly match. Therefore, it is desirable to perform similarity calculation considering the above vocabulary pattern clustering.

関係類度の計算方法についてより具体的に説明する。関係類似度を計算する２つのエンティティペアをそれぞれ(Ａ,Ｂ)と(Ｃ,Ｄ)とする。まず、エンティティＡとエンティティＢを関連付ける語彙パターンを抽出する。ある語彙パターンpがエンティティペア（Ａ,Ｂ）に対して抽出された回数をh(A,B,p)で表す。 The calculation method of the relationship similarity will be described more specifically. The two entity pairs for calculating the relationship similarity are (A, B) and (C, D), respectively. First, a vocabulary pattern that associates entity A and entity B is extracted. The number of times a certain vocabulary pattern p is extracted for the entity pair (A, B) is represented by h (A, B, p).

次に、エンティティペア（Ａ,Ｂ）の特徴ベクトルF(A,B)を以下のように構築する。F(A,B)のi番目の要素は、i番目の語彙パターンクラスタC_iの中で現れる全パターンpとエンティティペア（Ａ,Ｂ）の出現頻度の総和である。F(A,B)のi番目の要素を式で書けば次のようになる、

Next, the feature vector F (A, B) of the entity pair (A, B) is constructed as follows. The i-th element of F (A, B) is the sum of the appearance frequencies of all patterns p and entity pairs (A, B) that appear in the i-th vocabulary pattern cluster C _i . If you write the i-th element of F (A, B) as an expression,

上記式の意味は次の通りである。まず、語彙パターンクラスタは「同一関係を表すために使われる異なる表現」をまとめることが目的であることに注意されたい。例えば、一国の大統領であるという関係は「X the president of Y」、「Y president X」、「Y’s head of state X」など様々な語彙パターンで表現することが可能である（ここではXは人名、Yは国名）。これらの「同じ意味を表す異なる表層表現」をあたかも同じものとして扱うことによって、一見異なる語彙パターンを使っていてもエンティティペア間の類似度計算に影響を与えることがない。上記式を用いることで、同じ意味を表す異なる語彙パターンが全てベクトルの一つの次元に縮退して現れる。従って、ゼロ頻度問題(sparseness problem)が回避できるという利点がある。語彙パターンのクラスタ数がｎであれば、上記の特徴ベクトルはｎ次元となる。つまり、各々の語彙パターンクラスタは一つの特徴に貢献していることになる。 The meaning of the above formula is as follows. First of all, it should be noted that the vocabulary pattern cluster is intended to organize “different expressions used to express the same relationship”. For example, the relationship of being president of a country can be expressed in various vocabulary patterns such as “X the president of Y”, “Y president X”, “Y's head of state X” (where X is Person name, Y is country name). By treating these “different surface layer expressions representing the same meaning” as if they were the same, even if seemingly different vocabulary patterns are used, the similarity calculation between entity pairs is not affected. By using the above formula, all different vocabulary patterns representing the same meaning appear in one dimension of the vector. Therefore, there is an advantage that the zero frequency problem can be avoided. If the number of clusters in the vocabulary pattern is n, the feature vector has n dimensions. That is, each vocabulary pattern cluster contributes to one feature.

更に上記式に対してもう一つの工夫を加える必要がある。語彙パターンクラスタの中には膨大なパターンを含むものもあれば、あまりパターンを含まないものも存在する。それは、ある関係を表現するために用いられる語彙パターンの種類は、その関係によって異なるからである。例えば、「最も大きい」という関係（例：ダチョウと鳥）を表すためには「X is a large Y」、「X the largest Y」、「X is a big Y」、「enormous Y X」など書ききれないほど沢山の表現がある。一方、「一国の大統領」というようなかなり明確な関係を表す言い方はそれほど存在しない。上記式は各語彙パターンクラスタの中にどれくらいの数の語彙パターンが存在するか、すなわち、語彙パターンクラスタの大きさを考慮していない。そこで、上記式の右辺を「クラスタ内の全パターンの総和」で割って正規化する。

ここでVは全エンティティペアを含む集合である。 Furthermore, another device needs to be added to the above formula. Some vocabulary pattern clusters contain a huge number of patterns, and some lexical pattern clusters do not contain many patterns. This is because the types of vocabulary patterns used to express a certain relationship vary depending on the relationship. For example, “X is a large Y”, “X the largest Y”, “X is a big Y”, “enormous YX”, etc. can be written to express the relationship of “largest” (eg ostrich and bird) There are so many expressions. On the other hand, there is not so much to express a fairly clear relationship such as “a president of a country”. The above formula does not consider how many vocabulary patterns exist in each vocabulary pattern cluster, that is, the size of the vocabulary pattern cluster. Therefore, normalization is performed by dividing the right side of the above expression by “the sum of all patterns in the cluster”.

Here, V is a set including all entity pairs.

［Ｂ−４］エンティティペアのインデクシング(indexing)
検索を実行するためには、ペア同士の類似度測定だけでは不十分である。なぜなら、４番目のエンティティが分かっていないからである。２番目のペアの一方のエンティティが未知であるので、そのエンティティを検索する必要がある。そのために、本実施形態では、出来るだけ多くのエンティティペアをデータベースに保存し、かつ、その特徴ベクトルを予め計算してデータベースに保存する。このことをエンティティペアのindexingという。Indexingの結果は、「エンティティペアからそのエンティティペアのエンティティ間の関係を表現する特徴ベクトルのマッピング」と「語彙パターンからその語彙パターンを含むエンティティペアリストのマッピング」である。より具体的には、「エンティティペア−特徴ベクトルのテーブル」と「語彙パターン−語彙パターンを含むエンティティペアのテーブル」である。エンティティペアのインデクシングにより得られた検索データベースを用いた検索については図３に基づいて後述する。 [B-4] Entity pair indexing
In order to execute a search, it is not sufficient to measure the similarity between pairs. Because the fourth entity is not known. Since one entity of the second pair is unknown, it needs to be searched. Therefore, in this embodiment, as many entity pairs as possible are stored in the database, and their feature vectors are calculated in advance and stored in the database. This is called entity pair indexing. The result of Indexing is “mapping of feature vectors expressing the relationship between entities of the entity pair from the entity pair” and “mapping of the entity pair list including the vocabulary pattern from the vocabulary pattern”. More specifically, they are “entity pair-feature vector table” and “vocabulary pattern-entity pair table including vocabulary pattern”. A search using a search database obtained by indexing entity pairs will be described later with reference to FIG.

［Ｂ−５］クエリ処理
関係検索エンジンのクエリ{(Ａ，Ｂ)，(Ｃ，？)}が入力されたときに、(Ａ，Ｂ)の特徴ベクトルを探し、その特徴ベクトルに含まれている語彙パターン(頻度1よりも大きい)を全て求める。 [B-5] When a query {(A, B), (C,?)} Of a query processing related search engine is inputted, a feature vector of (A, B) is searched and included in the feature vector. Find all existing vocabulary patterns (greater than frequency 1).

次に、それらの語彙パターンが属する語彙パターンクラスタ(上記のクラスタリング結果)を見つけ出す。そして、それらの語彙パターンクラスタに属する語彙パターンを含む他の単語ペア(Ｃ，Ｄ)を見つけ出し、RelSim((A，B)，(C，D))を指標としてＤの候補をランキングする。ランキングされたDのリストが検索結果である。 Next, a vocabulary pattern cluster (the above clustering result) to which those vocabulary patterns belong is found. Then, other word pairs (C, D) including vocabulary patterns belonging to those vocabulary pattern clusters are found, and D candidates are ranked using RelSim ((A, B), (C, D)) as an index. The list of ranked D is the search result.

また、indexingにより得られた検索データベースに現れないようなエンティティペアについては、次に述べるように、キーワードベース検索エンジンを使い、Web全体のテキストを検索する。この操作は幾分余計な時間が掛かるので、実行の前にユーザの許可などを伺うようにすることが望ましい。 In addition, for entity pairs that do not appear in the search database obtained by indexing, search the entire Web text using a keyword-based search engine as described below. Since this operation takes some time, it is desirable to ask the user's permission before execution.

［Ｂ−６］リモートコーパスを用いたクエリ時の動的処理
ローカルコーパスが膨大であっても、全てのクエリをカーバできるとは限らない。したがって、リモートコーパスを使い、ローカルコーパスにないエンティティについても答えを出せるようにする。 [B-6] Dynamic processing at the time of query using a remote corpus Even if the local corpus is enormous, not all queries can be covered. Therefore, use the remote corpus so that you can answer for entities that are not in the local corpus.

リモートコーパスは、クエリが分かった時点で、キーワードベース検索エンジン(Google，Yahoo等)に、”Ａ*****Ｂ”というクエリを送信し、結果として受信したスニペットの集合である。ここで、「*」演算子は１個ないしゼロ単語とマッチングするワイルドカードであり、多くの既存の検索エンジン（例えば、Google，YahooBOSS API）でサポートされている。これらのスニペットのテキストは事前にローカルハードディスクにないので、リモートコーパスと呼ぶ。上記クエリではＡとＢが最大５単語以内で共起する場合のスニペットをダウンロードすることが可能である。通常の既存のキーワード検索エンジンでは、Ａ，Ｂをクエリとして入力した場合には、Ａ，Ｂを含むスニペットが得られるようになっており、ここでのワイルドカード検索は、単なるＡ，Ｂをクエリとする入力に、Ａ、Ｂ間の距離の限定を加えるものであると言える。したがって、クエリ（“Ａ*****Ｂ”）で検索すると、ＡとＢが最大５単語以内で共起すると共に、ＡとＢの前後の文脈も含むスニペットが得られて、そのスニペットからＡとＢの前後の文脈を含む語彙パターンを生成することができる。スニペットを使うことによって実際にＷｅｂから検索結果をダウンロードするためにかかる時間を削減することができる。 The remote corpus is a set of snippets that are transmitted as a result of sending a query “A ***** B” to a keyword-based search engine (Google, Yahoo, etc.) when the query is known. Here, the “*” operator is a wildcard that matches one or zero words and is supported by many existing search engines (eg, Google, YahooBOSS API). Since the text of these snippets is not pre-existing on the local hard disk, it is called a remote corpus. In the above query, it is possible to download a snippet when A and B co-occur within a maximum of 5 words. In a normal existing keyword search engine, when A and B are input as a query, a snippet including A and B can be obtained. In this wildcard search, a simple query of A and B is performed. It can be said that the input between is limited to the distance between A and B. Therefore, when searching with a query (“A ***** B”), A and B co-occur within a maximum of 5 words and a snippet including the context before and after A and B is obtained. A vocabulary pattern including contexts before and after A and B can be generated. By using the snippet, the time taken to actually download the search result from the Web can be reduced.

リモートコーパスを用いたクエリ時の動的処理は、１つの態様では、入力されたクエリ中の３つのエンティティの少なくとも１つがローカルコーパスにない時に実行する。クエリ中の３つのエンティティがローカルコーパスに存在する場合には、リモートコーパスを用いない。 Dynamic processing at query time using a remote corpus, in one aspect, is performed when at least one of the three entities in the input query is not in the local corpus. If the three entities in the query exist in the local corpus, the remote corpus is not used.

次に、ダウンロードした文脈の中でＡとＢの全ての出現を変数ＸとＹでそれぞれ置き換える。文脈をlowercaseに変換し、lemmatization（基底形の抽出）を行う。次に、ＸとＹ両方を含む5-gram以下の全てのn-gramを語彙パターンとして選択する。ここで抽出された語彙パターンが既に検索データベースの中に含まれている場合には、その高頻度の語彙パターンを優先的に選択してもよい。 Next, replace all occurrences of A and B with variables X and Y, respectively, in the downloaded context. Convert the context to lowercase and perform lemmatization (base shape extraction). Next, all n-grams that are less than 5-gram including both X and Y are selected as vocabulary patterns. If the extracted vocabulary pattern is already included in the search database, the high-frequency vocabulary pattern may be preferentially selected.

上記の語彙パターン抽出で抽出した語彙パターンで変数ＸにＣを代入し、Ｙに「＊」を代入する。このように、得られた語彙パターンで再びＷｅｂ検索を行う。スニペットの中で「＊」に当てはまる部分をＤの候補として抽出する。尚、「＊」にあてはまる部分から一単語のみではなくbi-gramやtri-gramなども候補として抽出可能である。そうすることによって一単語以上からなるエンティティＤに対しても対応可能である。既述の通り、クエリ“C pattern *”としても、“C pattern W1”だけが返されるわけではなく、文脈として、“… C pattern W1 W2 W3 …”が返されるので、W1, W2, W3 などを候補として利用できる。 In the vocabulary pattern extracted by the above vocabulary pattern extraction, C is substituted for the variable X and “*” is substituted for Y. In this way, the Web search is performed again using the obtained vocabulary pattern. A portion of the snippet that matches “*” is extracted as a candidate for D. In addition, not only one word but also bi-gram, tri-gram, etc. can be extracted as candidates from the portion corresponding to “*”. By doing so, it is possible to deal with an entity D consisting of one word or more. As already mentioned, the query “C pattern *” does not return only “C pattern W1”, but “… C pattern W1 W2 W3…” is returned as the context, so W1, W2, W3, etc. Can be used as candidates.

この段階ではエンティティＤの候補として様々なエンティティが現れ得る。そこで、多数のエンティティＤの候補をランキングする必要がある。膨大な数の候補のランキング手法としては、以下の手法を例示することができる。
ａ．出現頻度の大きい順にランクづける。
ｂ．何個の異なる語彙パターンで候補として選択されているかでランクづける。
ｃ．その候補Dを使って、(Ａ,Ｂ)，(？,Ｄ)クエリで検索した場合にＣが得られるかどうかでランクづける（逆検索手法）。
ｄ．上記の手法の任意の組み合わせを機械学習を使って学習する。 At this stage, various entities can appear as candidates for the entity D. Therefore, it is necessary to rank a large number of entity D candidates. The following methods can be exemplified as a ranking method for a huge number of candidates.
a. Rank in descending order of appearance frequency.
b. Rank by number of different vocabulary patterns selected as candidates.
c. Using the candidate D, ranking is performed based on whether or not C is obtained when searching with (A, B), (?, D) queries (inverse search method).
d. Learn any combination of the above techniques using machine learning.

［Ｃ］検索データベースを用いた関係検索
検索データベースを用いた検索システムの一実施形態について、図３乃至図１０を参照しながら詳細に説明する。なお、説明の簡略のため、エンティティペア及びエンティティＩＤを、単語ペア及び単語ペアＩＤとして記載している。 [C] An embodiment of a search system using a relational search database using a search database will be described in detail with reference to FIGS. For simplicity, the entity pair and the entity ID are described as a word pair and a word pair ID.

図３は検索システムの構成を示す図である。データ源はローカルコーパス（Local
corpus）１、および／あるいはリモートコーパス（Remote corpus）２である。ローカルコーパスとは検索システムの記憶部に保存されているコーパスであり、例えば、Wikipediaのデータダンプ（data dump）やクロールしたWWWページなどが含まれる。それに対して、リモートコーパスとは、本検索エンジンの記憶部に存在しないコーパスで、例えば、普通のキーワードベース検索エンジンのクエリ結果が含まれる。リモートコーパスを使うと、本システムでクローリング(crawling)しなくても膨大なWWWなどのデータ源を扱うことが出来る。なお、リモートコーパスを一時的に記憶部に保存してもよい。本明細書では、これらのローカルコーパスやリモートコーパスを一般にテキストコーパスという。 FIG. 3 is a diagram showing the configuration of the search system. Data source is Local Corpus
corpus) 1 and / or remote corpus 2. The local corpus is a corpus stored in the storage unit of the search system, and includes, for example, a Wikipedia data dump, a crawled WWW page, and the like. On the other hand, a remote corpus is a corpus that does not exist in the storage unit of the present search engine, and includes, for example, a query result of an ordinary keyword-based search engine. If you use a remote corpus, you can handle a huge amount of data sources such as WWW without crawling with this system. The remote corpus may be temporarily stored in the storage unit. In this specification, these local and remote corpora are generally referred to as text corpora.

検索クエリを処理する前に、検索するためのデータベースを、テキストコーパスを用いて作成する。図３に示す検索システムのインデクシングエンジン(Indexing Engine)３、パターンクラスタリングエンジン（Pattern Clustering Engine）４が検索データベースを作るモジュールである。インデクシングエンジン３は、トークンナイザ（Tokenizer）３０、エンティティペアセレクタ（Word-Pair Selector）３１、語彙パターン生成器（Lexical Pattern Generator）３２を備えている。インデクシングエンジン３によって、エンティティペアインデックス（Word-Pair Index）５０、語彙パターンインデックス（Pattern Index）５１、特徴ベクトルインデックス（Feature Vector Index）５２、インバーテッドエンティティペアインデックス(Inverted Word-Pair Index)５３の４つのデータベースが作成される。 Before processing a search query, a database for searching is created using a text corpus. The indexing engine 3 and the pattern clustering engine 4 of the search system shown in FIG. 3 are modules for creating a search database. The indexing engine 3 includes a tokenizer 30, an entity pair selector (Word-Pair Selector) 31, and a lexical pattern generator 32. By the indexing engine 3, entity pair index (Word-Pair Index) 50, vocabulary pattern index (Pattern Index) 51, feature vector index (Feature Vector Index) 52, inverted entity-pair index (Inverted Word-Pair Index) 53 One database is created.

インデクシングエンジン３は、テキストコーパス１、２からテキストを受け取り、そのテキストをトークンナイザ３０で各単語に分ける。例えば、図５に示すように、トークンナイザ３０が、入力文(東京は日本の首都である)に対して、その文に含まれる単語の品詞付け列 (東京/名詞、は/助詞、日本/名詞、の/助詞、首都/名詞、で/助動詞、ある/助動詞)を出力する。 The indexing engine 3 receives text from the text corpora 1 and 2 and divides the text into words by the tokenizer 30. For example, as shown in FIG. 5, the tokenizer 30 receives an input sentence (Tokyo is the capital of Japan) and a part-of-speech sequence of words included in the sentence (Tokyo / noun, ha / particle, Japan / noun). , No / particle, capital / noun, de / auxiliary verb, certain / auxiliary verb).

次に、エンティティペアセレクタ（Word-pair Selector）３１がトークンナイザ３０により生成された単語列を受け取り、エンティティペアを作る。エンティティペアの候補は１つの文の中のすべての名詞、動詞、形容詞、形容動詞を品詞として持つエンティティのペアであるが、最終的には頻度の高いエンティティペアだけがエンティティペアセレクタにより選択され、次のステップに進む。エンティティペアの頻度を保存するテーブルは後述する特徴ベクトルインデックス５２である。 Next, an entity pair selector (Word-pair Selector) 31 receives the word string generated by the tokenizer 30 and creates an entity pair. Entity pair candidates are all nouns, verbs, adjectives, and entity pairs that have adjective verbs as part of speech in a sentence, but ultimately only frequent entity pairs are selected by the entity pair selector, Proceed to the next step. A table for storing the frequency of entity pairs is a feature vector index 52 described later.

各エンティティに対して唯一的にある整数を対応させる。この整数をエンティティIDという。エンティティからエンティティIDを高速に見つけ出すために、図６(a)のように、エンティティとエンティティIDのハッシュ表を作り、データベースの１つのテーブルに保存する。例えば、東京、フランス、日本、パリ、首都に対して、それぞれ、エンティティＩＤ１、２、３、４、５が割り当てられる。これ以降の処理はエンティティ自身を使わなく、エンティティIDに対する処理だけで済むので、高速で検索したり、保存すべきデータ量を圧縮できたりするという利点がある。また、検索データベースには、各エンティティの頻度と品詞の情報もエンティティＩＤに関連付けて保存される。 Associate a unique integer for each entity. This integer is called an entity ID. In order to find the entity ID from the entity at high speed, a hash table of the entity and the entity ID is created as shown in FIG. 6A and stored in one table of the database. For example, entity IDs 1, 2, 3, 4, and 5 are assigned to Tokyo, France, Japan, Paris, and the capital, respectively. Subsequent processing does not use the entity itself, and only processing for the entity ID is required, so that there is an advantage that the search can be performed at a high speed and the amount of data to be stored can be compressed. The search database also stores the frequency and part of speech information of each entity in association with the entity ID.

そして、エンティティペアの各候補について、そのペアの２つのエンティティIDをキーとして、図６(b)のハッシュ表を作る。図６(b)はエンティティペアの２つのエンティティIDから、エンティティペアIDにマッピングするテーブルである。図６(b)がエンティティペアインデックス５０の内容である。具体的には、エンティティＩＤ１、エンティティＩＤ３からエンティティペアが形成され、そのエンティティペアにエンティティペアＩＤ１が割り当てられる。同様に、エンティティＩＤ２、エンティティＩＤ４からエンティティペアが形成され、そのエンティティペアにエンティティペアＩＤ２が割り当てられる。 Then, for each entity pair candidate, the hash table of FIG. 6B is created using the two entity IDs of the pair as keys. FIG. 6B is a table that maps two entity IDs of an entity pair to an entity pair ID. FIG. 6B shows the contents of the entity pair index 50. Specifically, an entity pair is formed from entity ID1 and entity ID3, and entity pair ID1 is assigned to the entity pair. Similarly, an entity pair is formed from entity ID2 and entity ID4, and entity pair ID2 is assigned to the entity pair.

インデクシングエンジン３の最後のステップでは、語彙パターン生成器３２により、選択されたエンティティペアの語彙パターン(lexical patterns)を生成する。語彙パターンの生成アルゴリズムには、既述の方法を用いることができる。語彙パターン生成器は、図７で示すような、語彙パターンを語彙パターンＩＤ（lexical pattern ID）に対応させるテーブル（Pattern vs. Pattern ID）５１を作る。図７に示すように、語彙パターン「ＸはＹの首都である」には語彙パターンＩＤ１が割り当てられ、語彙パターン「ＸはＹの最大都市である」には語彙パターンＩＤ２が割り当てられ、語彙パターン「Ｙの首都がＸである」には語彙パターン３が割り当てられている。 In the final step of the indexing engine 3, the lexical pattern generator 32 generates lexical patterns for the selected entity pair. As the lexical pattern generation algorithm, the above-described method can be used. The vocabulary pattern generator creates a table (Pattern vs. Pattern ID) 51 as shown in FIG. 7 that associates lexical patterns with lexical pattern IDs. As shown in FIG. 7, the vocabulary pattern “X is the capital of Y” is assigned a vocabulary pattern ID1, and the vocabulary pattern “X is the largest city of Y” is assigned a vocabulary pattern ID2. Vocabulary pattern 3 is assigned to “the capital of Y is X”.

さらに、語彙パターンから語彙パターンＩＤの情報だけではなく、語彙パターンの頻度の総数も保存しておく（図８参照）。語彙パターン生成器３２はエンティティペアの語彙パターンの頻度（そのエンティティペアに対する語彙パターンの頻度）も数え、特徴ベクトルインデックスのデータベース５２に保存する。特徴ベクトルインデックス５２は、「エンティティペア」から「エンティティペアの特徴ベクトル」へのマッピングである。概念的には、特徴ベクトルインデックスは図８で示すテーブルである。すなわち、エンティティペアから、そのエンティティペアを含むような語彙パターン（lexical pattern）とその頻度の情報を保存しているテーブルである。図８に示すように、エンティティペア（東京、日本）は、語彙パターン「ＸはＹの首都である」・頻度「１０」、語彙パターン「ＸはＹの最大都市である」・頻度「４」、語彙パターン「Ｙの首都がＸである」・頻度「６」、頻度総数「２０」に対応付けられている。エンティティペア（パリ、フランス）は、語彙パターン「ＸはＹの首都である」・頻度「１２」、語彙パターン「Ｙの首都がＸである」・頻度「３」、頻度総数「１５」に対応付けられている。ここでは、簡単のため、図８には、語彙パターンの全てのn-gramを記載していない。 Furthermore, not only the vocabulary pattern ID information but also the total number of vocabulary pattern frequencies is stored from the vocabulary pattern (see FIG. 8). The vocabulary pattern generator 32 also counts the frequency of the vocabulary pattern of the entity pair (the frequency of the vocabulary pattern for the entity pair) and stores it in the feature vector index database 52. The feature vector index 52 is a mapping from “entity pair” to “feature vector of an entity pair”. Conceptually, the feature vector index is a table shown in FIG. In other words, the table stores information on lexical patterns including the entity pairs and their frequencies from the entity pairs. As shown in FIG. 8, the entity pair (Tokyo, Japan) has the vocabulary pattern “X is the capital of Y” / frequency “10”, and the vocabulary pattern “X is the largest city of Y” / frequency “4”. , The vocabulary pattern “the capital of Y is X”, the frequency “6”, and the total frequency “20”. The entity pair (Paris, France) corresponds to the vocabulary pattern “X is the capital of Y” / frequency “12”, the vocabulary pattern “the capital of Y is X” / frequency “3”, and the total frequency “15”. It is attached. Here, for simplicity, FIG. 8 does not show all n-grams of vocabulary patterns.

語彙パターンに加えて、エンティティペアが含まれる品詞パターンを抽出し、エンティティペアと品詞パターンを対応付けてデータベースに保存してもよい。品詞パターンの抽出は当業者において既知である。１つの態様では、エンティティの品詞 (つまり、このエンティティは一般名詞か、人名か、組織名、地名か)の情報はエンティティクラスタリングに用いることができる。これは、２つのエンティティが異なる品詞(たとえば、人名と地名)であれば、同じクラスタに入る可能性が同じ品詞の場合よりも低いので、類似度を測定するときに、異なる品詞であれば、類似度を普通の類似度の半分にします(異なる品詞ペアにペナルティをつける)。例として、Barack Obama (人名), Obama (人名) の類似度が 0.9 であれば、そのまま類似度 0.9 として数えて、同じクラスタに入れる。しかし、Barack Obama (人名) と Whitehouse (地名) の類似度が 0.9 であっても、品詞が異なるので、類似度を半分にし、0.45 に減じる。そこで、Barack Obama と Whitehouse が同じクラスタに入れなくなる。 In addition to the vocabulary pattern, a part of speech pattern including an entity pair may be extracted, and the entity pair and the part of speech pattern may be associated with each other and stored in the database. Extraction of part-of-speech patterns is known to those skilled in the art. In one aspect, information on the part of speech of an entity (ie, whether this entity is a common noun, person name, organization name, place name) can be used for entity clustering. This is because if two entities have different parts of speech (for example, person names and place names), they are less likely to enter the same cluster than the same part of speech, so when measuring similarity, Make the similarity half the normal similarity (penalize different part-of-speech pairs). As an example, if the similarity between Barack Obama (person name) and Obama (person name) is 0.9, the similarity is counted as 0.9 and put in the same cluster. However, even if the similarity between Barack Obama (person name) and Whitehouse (place name) is 0.9, the part of speech is different, so the similarity is halved and reduced to 0.45. So Barack Obama and Whitehouse can't be in the same cluster.

特徴ベクトルインデックスを簡単に検索したり、サイズを圧縮したりするために、エンティティペアのテキストをキーとして直接保存するのではなく、エンティティペアのＩＤ、語彙パターンＩＤを使う。そこで、実際に保存するのは、図９のテーブルである。図９では、エンティティペアＩＤ、語彙パターンＩＤを用いて図８を表現したものである。図６、図７、図９のテーブルを用いることで、図８に示すテーブルを再現することができる。また、stem語ペアから候補のエンティティペアを限定するために、候補エンティティペアはstem語ペアと1つ以上の語彙パターン(あるいは語彙パターンクラスタ)を共有するエンティティペアに限定する。 In order to easily search the feature vector index or reduce the size, the entity pair ID and vocabulary pattern ID are used instead of storing the text of the entity pair directly as a key. Therefore, what is actually saved is the table of FIG. In FIG. 9, FIG. 8 is expressed using the entity pair ID and the vocabulary pattern ID. By using the tables shown in FIGS. 6, 7, and 9, the table shown in FIG. 8 can be reproduced. Further, in order to limit candidate entity pairs from stem word pairs, candidate entity pairs are limited to entity pairs that share one or more vocabulary patterns (or vocabulary pattern clusters) with stem word pairs.

その候補エンティティペアの集合を高速に探すために、インバーテッドエンティティペアインデックス（Inverted Word-Pair Index）５３を用意する。インバーテッドエンティティペアインデックスとは、図１０に示すように、語彙パターンIDからその語彙パターンを特徴ベクトルの１つの成分として含むエンティティペアIDと頻度のリストへのマッピングである。図１０に示すように、語彙パターンＩＤ１は、エンティティペアＩＤ１・頻度「１０」、エンティティペアＩＤ２・頻度「１２」、頻度総数「２２」に対応している。語彙パターンＩＤ２は、エンティティペアＩＤ１・頻度「４」、頻度総数「４」に対応している。語彙パターンＩＤ３は、エンティティペアＩＤ１・頻度「６」、エンティティペアＩＤ３・頻度「３」、頻度総数「９」に対応している。このテーブルは特徴ベクトルインデックスのテーブル(図９)との反対方向のindexなので、inverted indexと呼ぶ。 In order to search the set of candidate entity pairs at high speed, an inverted entity pair index (Inverted Word-Pair Index) 53 is prepared. As shown in FIG. 10, the inverted entity pair index is a mapping from a vocabulary pattern ID to a list of entity pair IDs and frequencies including the vocabulary pattern as one component of a feature vector. As shown in FIG. 10, the vocabulary pattern ID1 corresponds to the entity pair ID1 · frequency “10”, the entity pair ID2 · frequency “12”, and the total number of frequencies “22”. The vocabulary pattern ID2 corresponds to the entity pair ID1, the frequency “4”, and the total frequency “4”. The vocabulary pattern ID3 corresponds to the entity pair ID1 / frequency “6”, the entity pair ID3 / frequency “3”, and the total frequency “9”. Since this table is an index in the opposite direction to the feature vector index table (FIG. 9), it is called an inverted index.

既述のように、低頻度や完全マッチングの問題を解決するために、語彙パターンのクラスタリングとエンティティクラスタリング(named entity disambiguationの１つの方法)手法を使う。本システムで用いる手法は非特許文献３で提案されたsequential clustering algorithmである。このアルゴリズムによりクラスタリングを実現するモジュールはパターンクラスタリングエンジン４である。パターンクラスタリングエンジン４は特徴ベクトルインデックス５２とインバーテッドエンティティペアインデックス５３を用い、似ている語彙パターンを１つの語彙パターンクラスタにまとめる。似ている語彙パターンとは、その語彙パターンが出現する単語ペアの分布が似ているものである(Distributional hypothesisに基づく)。 As described above, lexical pattern clustering and entity clustering (one method of named entity disambiguation) are used to solve the problem of low frequency and perfect matching. The method used in this system is the sequential clustering algorithm proposed in Non-Patent Document 3. A module for realizing clustering by this algorithm is the pattern clustering engine 4. The pattern clustering engine 4 uses the feature vector index 52 and the inverted entity pair index 53 to combine similar vocabulary patterns into one vocabulary pattern cluster. A similar vocabulary pattern is one in which the distribution of word pairs in which the vocabulary pattern appears is similar (based on Distributional hypothesis).

特徴ベクトルインデックス５２とインバーテッドエンティティペアインデックス５３を使い、パターンの単語ペア頻度ベクトル（例えば、図１０において、全部で３つの単語ペア(単語ペアＩＤ1, 2, 3)しかないと仮定すると、パターンＩＤ1の単語ペア頻度ベクトルが(10, 12, 0)で、パターンＩＤ 2のベクトルが(4, 0, 0)で、パターンＩＤ 3 のベクトルが (6, 0, 3) である）のcosine類似度を計算し、非特許文献3で述べたsequential clustering アルゴリズムを施し、語彙パターンのクラスタリングを行う。検索データベースには、類似する語彙パターンがまとめられた語彙パターンのクラスタのテーブルが用意され、各語彙パターンはいずれかの語彙パターンクラスタに振り分けられている。語彙パターンクラスタには語彙パターンクラスタＩＤが割り当てられる、語彙パターンＩＤと語彙パターンクラスタＩＤが関連付けられる。 Using feature vector index 52 and inverted entity pair index 53, assuming that there are only three word pair frequency vectors (for example, in FIG. 10, there are only three word pairs (word pair IDs 1, 2, 3) in FIG. Cosine similarity of (10, 12, 0) for word pair frequency vector, (4, 0, 0) for pattern ID 2 and (6, 0, 3) for pattern ID 3) And the sequential clustering algorithm described in Non-Patent Document 3 is applied to cluster the vocabulary patterns. In the search database, a table of vocabulary pattern clusters in which similar vocabulary patterns are collected is prepared, and each vocabulary pattern is distributed to any vocabulary pattern cluster. A vocabulary pattern cluster is associated with a vocabulary pattern ID, which is assigned a vocabulary pattern cluster ID.

固有名詞(named entityなど)については、既述のように、相手のエンティティとの共起頻度と共通パターンを使い、クラスタリングする。語彙パターンのクラスタリングと同様に、エンティティ同士の類似度をその対応するベクトル間のcosine(コサイン)類似度で計算し「似たエンティティ同士をクラスタにする」という作業を行う。
例えば、United States，U.S.，USAは１つのクラスタに入るようにする。各エンティティがどのクラスタに入るかを類似するエンティティがまとめられたエンティティクラスタのテーブルに記録しておく。エンティティクラスタにはエンティティクラスタＩＤが割り当てられ、エンティティＩＤとエンティティクラスタＩＤが関連付けられる。固有名詞以外のエンティティについても、類義語をまとめたエンティティクラスタを用意してもよい。これで、検索するための準備が完成する。 Proper nouns (named entities, etc.) are clustered using the co-occurrence frequency and common pattern with the other entity as described above. Similar to the vocabulary pattern clustering, the similarity between entities is calculated by the cosine similarity between the corresponding vectors, and the work of “clustering similar entities together” is performed.
For example, United States, US, and USA are included in one cluster. Which cluster each entity enters is recorded in an entity cluster table in which similar entities are grouped. An entity cluster ID is assigned to the entity cluster, and the entity ID and the entity cluster ID are associated with each other. For entities other than proper nouns, an entity cluster in which synonyms are collected may be prepared. This completes the preparation for searching.

最後のモジュールはクエリ処理エンジン（Query Processing Engine）６である。クエリ７として{(Ａ，Ｂ)，(Ｃ，？)}が入力されたときに、クエリ処理エンジン６はまず、エンティティペアインデックス５０から入力stem語ペア(Ａ，Ｂ)のＩＤを探す(stem_idとする)。 The last module is a query processing engine 6. When {(A, B), (C,?)} Is input as the query 7, the query processing engine 6 first searches the entity pair index 50 for the ID of the input stem word pair (A, B) (stem_id). And).

そのstem_idを使い、特徴ベクトルインデックス５２(図８, 図９参照)やパターンクラスタリングエンジン４の結果によって、stem語ペア(Ａ，Ｂ)の特徴ベクトルf(a，b)を見つけ出す。そして、f(a，b)にある頻度1より大きい語彙パターンＩＤを探し、インバーテッドエンティティペアインデックス５３(図１０)を参照して、候補エンティティペアＩＤの集合を作る。候補エンティティペアとなるためには、そのペアの一方のエンティティがクエリのエンティティCである必要がある。 Using the stem_id, the feature vector f (a, b) of the stem word pair (A, B) is found based on the result of the feature vector index 52 (see FIGS. 8 and 9) and the pattern clustering engine 4. Then, a vocabulary pattern ID greater than frequency 1 in f (a, b) is searched, and a set of candidate entity pair IDs is created with reference to the inverted entity pair index 53 (FIG. 10). To be a candidate entity pair, one entity in the pair needs to be the entity C of the query.

最終に、stem語ペア(Ａ，Ｂ)の特徴ベクトルと候補エンティティペア(Ｃ，Ｄ)の特徴ベクトルの類似度を前記のRelSim((Ａ，Ｂ)，(Ｃ，Ｄ))関数を使って計算し、候補エンティティペアの集合を類似度の高い順にソートする。 Finally, the similarity between the feature vector of the stem word pair (A, B) and the feature vector of the candidate entity pair (C, D) is calculated using the RelSim ((A, B), (C, D)) function. Calculate and sort the set of candidate entity pairs in descending order of similarity.

１つの態様では、ランキングの指標となるスコア（類似度）を計算する際に、１つのエンティティクラスタに入っている複数のエンティティ(つまり、１つのエンティティを指しているが表現形式が異なる)のスコア（類似度）の和をエンティティクラスタのサイズで割ったクラスタ平均スコアを取り、そのエンティティ（エンティティ）のスコア（類似度）として用いてランキングを行う。すなわち、個々のエンティティのスコア（類似度）で個々のエンティティをランキングするのではなく、エンティティのクラスタとしてのスコア（エンティティクラスタのメンバーのスコアの総和をクラスタのサイズで割ったクラスタ平均スコアを用いてランキングを行う。 In one aspect, when calculating a score (similarity) that serves as an index for ranking, scores of a plurality of entities that are included in one entity cluster (that is, one entity is pointed out but the expression format is different) A cluster average score obtained by dividing the sum of (similarity) by the size of the entity cluster is taken, and ranking is performed using the score (similarity) of the entity (entity). In other words, instead of ranking individual entities by individual entity scores (similarity), the score as a cluster of entities (using the cluster average score obtained by dividing the sum of the scores of the members of the entity cluster by the size of the cluster) Make a ranking.

以下にクラスタのスコア（類似度）について具体的に説明する。
入力クエリ: (A, B), (C, ?)とした場合、
(A, B) を含む語彙パターンはS₁＝{p₁,p₂,..,p_n}
(C, X) を含む語彙パターンはS₂＝{q₁,q₂,..,q_m}
q₁∈S₁ならば、Xのスコアにh(q₁,
(C, X)) * h(q₁, (A, B)) を加える。
ここで、h(p, wp)は、「語彙パターンpのword pair wp における出現頻度」、である。
q1 not∈S₁ の場合であっても、q₁とp_iが同じパターンクラスタであれば、同じく次のようにスコアを計算する。

h(q₁,(C,X))*h(p_i,(A,B))
クラスタのスコアは、すべてのXのスコアの和となる。
例えば、
(Obama, US):(is president of [4], is leader of [3], is a senate of [2],)
(Koizumi, Japan):(is prime minister of [8], is leader of [5])
Query:(Obama,US),(Koizumi,X)
“is president of”と“is prime minister of” が同じクラスタに入る場合、X のスコアは:(3*5+4*8)=15+32=47、となる。
次に、クラスタ平均スコアについて説明する。
クラスタの総和スコア = Σ（entity のスコア)、
クラスタ平均スコア = クラスタ総和スコア / size、となる。
エンティティクラスタ(United States, U.S., US) を形成した場合に、
例えば、United States のスコア:15、 U.Sのスコア10、USのスコア8、の場合、
クラスタ総和スコア: 15 + 10 + 8 = 33、
クラスタ平均スコア: 33 / 3 = 11、となる。 The cluster score (similarity) will be specifically described below.
Input query: (A, B), (C,?)
A vocabulary pattern containing (A, B) is S ₁ = {p ₁ , p ₂ , .., p _n }
A vocabulary pattern containing (C, X) is S ₂ = {q ₁ , q ₂ , .., q _m }
If q ₁ ∈ S _1, then the score of X is h (q ₁ ,
Add (C, X)) * h (q ₁ , (A, B)).
Here, h (p, wp) is “appearance frequency of word pair wp of vocabulary pattern p”.
Even for q1 not∈S _{_1,} q ₁ and p _i is given the same pattern cluster, also calculates a score as follows.

h (q ₁ , (C, X)) * h (p _i , (A, B))
The cluster score is the sum of all X scores.
For example,
(Obama, US) :( is president of [4], is leader of [3], is a senate of [2],)
(Koizumi, Japan) :( is prime minister of [8], is leader of [5])
Query: (Obama, US), (Koizumi, X)
If “is president of” and “is prime minister of” are in the same cluster, the score of X is: (3 * 5 + 4 * 8) = 15 + 32 = 47.
Next, the cluster average score will be described.
Cluster total score = Σ (entity score),
Cluster average score = Cluster total score / size.
When an entity cluster (United States, US, US) is formed,
For example, if United States score: 15, US score 10, US score 8,
Cluster total score: 15 + 10 + 8 = 33,
Cluster average score: 33/3 = 11.

エンティティのクラスタのスコアを用いることは、検索結果のランキングに重要な影響を与える。例えば、 (Apple, Steve Jobs), (Microsoft, ?)のクエリの場合、D=Steve Ballmer だけの結果のスコアがD=Windows(両方ともX introduces Y, X announces Y,...という語彙パターンを備える場合が多い)よりも低くなり得る。D=Steve
Ballmer, D=Ballmer, D=Steve Anthony Ballmer を１つのクラスタにまとめることで、そのエンティティクラスタのスコア（各エンティティのスコアの総和結果）は、D=Windows だけのエンティティクラスタのスコアよりも高くなると考えられる。このスコア逆転によって、望ましい結果が上位にランクされるという利点がある。 Using the cluster score of an entity has an important impact on the ranking of search results. For example, if the query is (Apple, Steve Jobs), (Microsoft,?), The result score of D = Steve Ballmer alone is D = Windows (both X introduces Y, X announces Y, ... Can often be lower). D = Steve
By combining Ballmer, D = Ballmer, D = Steve Anthony Ballmer into one cluster, the score of the entity cluster (the total result of the scores of each entity) will be higher than the score of the entity cluster of D = Windows only. It is done. This score reversal has the advantage that the desired result is ranked higher.

ソートされた候補エンティティペアの集合から検索結果Ｄ(またはＤに属するエンティティクラスタ)のランキングされたリスト８を出力する。また、エンティティペア(Ａ，Ｂ)、エンティティペア(Ｃ，Ｄ)が共通に持つ語彙パターンを含む文や文書を、結果を説明する理由として出力してもよい。更に、(Ａ，Ｂ)，(Ｃ，Ｄ)の共有の代表的な関係も出力できる。 A ranked list 8 of search results D (or entity clusters belonging to D) is output from the sorted set of candidate entity pairs. In addition, a sentence or a document including a vocabulary pattern that the entity pair (A, B) and the entity pair (C, D) have in common may be output as a reason for explaining the result. Furthermore, a representative relationship of sharing (A, B) and (C, D) can also be output.

あるエンティティが検索結果として得られた時に、そのエンティティをクラスタリングすることにより、「エンティティクラスタ」を検索結果として出力することができる。すなわち、検索結果をエンティティクラスタを用いて出力することで、Xの候補をユーザに提示するときに似たようなもの（例：ObamaとBarak Obama）をまとめて表示することができる。 When an entity is obtained as a search result, clustering the entities can output an “entity cluster” as the search result. In other words, by outputting search results using entity clusters, similar items (eg, Obama and Barak Obama) can be displayed together when X candidates are presented to the user.

［Ｄ］動的処理を用いた検索
以上、検索データベースを用いた検索について説明した。エンティティＡ，Ｂ，Ｃの１つ以上が検索データベースに存在しない場合には、既存の検索エンジンによりリモートコーパスを用いる動的処理モードを実行する。以下に、具体的に説明する。 [D] Search Using Dynamic Processing Above, search using a search database has been described. When one or more of the entities A, B, and C do not exist in the search database, a dynamic processing mode using a remote corpus is executed by an existing search engine. This will be specifically described below.

エンティティペア（Ａ,Ｂ）のエンティティＡ，Ｂが検索データベースにない場合には、エンティティＡ，Ｂの間の関係を表す語彙パターンを既存の検索エンジンを用いて抽出する。まず“Ａ****Ｂ”というクエリを既存検索エンジンで検索する。これはエンティティＡとエンティティＢを最大4単語以内（＊は１個ないしゼロ個の単語とマッチされるワイルドカード）で出現するsnippetを検索結果として受信する。後で語彙パターンを生成するために、単語の連続(n-gram)を抽出するが、そのためにはエンティティＡとエンティティＢが近くに現れる必要がある。いわばこのクエリはエンティティＡとエンティティＢの「文脈」を近似しようとするものである。ワイルドカード＊の数については当業者において適宜設定し得るが、例えば、1個から最大7個まで複数のクエリを用いることができる。例えば「Ａ*Ｂ」、「Ａ**Ｂ」、「Ａ***Ｂ」、…「Ａ*******Ｂ」まで用いる。＊が多いクエリで受信する結果と＊が少ないクエリで受信する結果が似ている場合もあるので、1つの態様では、同一URLからくるsnippetを一回だけ集めるようにする。多くの既存の検索エンジン（例えば、Google，YahooBOSS API）では、“Ａ****Ｂ”というクエリにより、ＡとＢの前後の文脈も含むスニペットが得られる。 When the entities A and B of the entity pair (A, B) are not in the search database, the vocabulary pattern representing the relationship between the entities A and B is extracted using an existing search engine. First, an existing search engine is searched for the query “A **** B”. This receives a snippet in which entity A and entity B appear within a maximum of 4 words (* is a wild card matched with 1 to 0 words) as a search result. In order to generate a vocabulary pattern later, a word sequence (n-gram) is extracted. For this purpose, entity A and entity B need to appear nearby. In other words, this query attempts to approximate the “context” of entity A and entity B. The number of wildcards * can be appropriately set by those skilled in the art. For example, a plurality of queries from 1 to a maximum of 7 can be used. For example, “A * B”, “A ** B”, “A *** B”,... “A ******* B” are used. In some cases, the results received with a query with many * and the results with a query with few * are similar, so in one aspect, snippets from the same URL are collected only once. In many existing search engines (for example, Google, YahooBOSS API), the query “A **** B” provides a snippet that includes the context before and after A and B.

ワイルドカード検索を用いて取得したスニペットから、エンティティペア（Ａ,Ｂ）の文脈となるものが得られる。次に、この文脈からエンティティＡとエンティティＢの間の関係を表現する語彙パターンを抽出する必要がある。この語彙パターンを抽出するアルゴリズムは既述の語彙パターン取得手法と同じある。つまり、エンティティＡ、エンティティＢを含むn-gramを抽出し、抽出したn-gramの中のＡをＸに、ＢをＹに置き換える。以下例を示して説明する。 From the snippet acquired using the wild card search, what is the context of the entity pair (A, B) is obtained. Next, it is necessary to extract a vocabulary pattern that expresses the relationship between entity A and entity B from this context. The algorithm for extracting the vocabulary pattern is the same as the lexical pattern acquisition method described above. That is, n-grams including entity A and entity B are extracted, and A in the extracted n-grams is replaced with X and B is replaced with Y. An example will be described below.

エンティティＡ=Barack Obama、エンティティＢ=USA、とした場合、“Ａ****Ｂ”でGoogle検索すると次の結果が得られる。

When entity A = Barack Obama and entity B = USA, a Google search with “A **** B” will give the following results.

上記のsnippetをlowercaseにし、全ての活用形を基底形に変換する(lemmatization)。Lemmatizationによって名詞の複数形がその単数形になり、動詞の活用形（過去形や進行形）が基底形になり、活用によるばらつきが吸収される。lowercaseにすることで大文字、小文字によるばらつきが吸収される。 Make the above snippet into a lowercase and convert all the used forms to base forms (lemmatization). Lemmatization converts the plural form of the noun into its singular form, and the usage forms of verbs (past and progressive forms) become base forms, and variations due to use are absorbed. By using lowercase, variations due to uppercase and lowercase letters are absorbed.

次に、Ｘ=Barack Obama, Ｙ= USAとすると次のようになる。

次のこの中でXもYも両方を含むn-gramを生成する。例えば、n=6で

が生成される。 Next, if X = Barack Obama, Y = USA, the result is as follows.

In the following, generate an n-gram that includes both X and Y. For example, n = 6

Is generated.

上記の語彙パターンはXから始まり、Yで終わっているが、必ずしもXとYの間をつなぐ語彙パターンのみを抽出するものではない。語彙パターン抽出の条件は長さがnでXもYも含んでいることであり、その2つの条件を満たすものを全て語彙パターンとして抽出する。例えば、n=7にすると上記のsnippetから次の2のパターンが抽出される。

The above vocabulary pattern starts with X and ends with Y, but it does not necessarily extract only the vocabulary pattern that connects between X and Y. The condition for extracting the vocabulary pattern is that the length is n and includes both X and Y, and all those satisfying the two conditions are extracted as vocabulary patterns. For example, when n = 7, the following two patterns are extracted from the above snippet.

得られた語彙パターンが検索データベースの中にも存在すれば高頻度のものを優先することができるが、検索データベースに存在しない場合は、クエリ（Ａ,Ｂ）対して例えば２回以上抽出されたパターンの全てにXにCを代入し検索する。 If the obtained vocabulary pattern also exists in the search database, the one with higher frequency can be prioritized, but if it does not exist in the search database, it is extracted, for example, twice or more for the query (A, B). Search for all patterns by substituting C for X.

例えば、上記のX, president of the Yという語彙パターンとC=Vladimir Putinの場合には、次のクエリとなる。
Vladimir Putin, president of the *
これでGoogleを検索すると次の結果が得られる。

For example, in the case of the above vocabulary pattern of X, president of the Y and C = Vladimir Putin, the following query is obtained.
Vladimir Putin, president of the *
You can now search Google for the following results:

このsnippetのなかでFEDERATION OF RUSSIAの部分が正解である。正解となる部分を抽出するには、例えば、語彙パターンの後に出てくる部分から3-gramまで取り、回答候補を出現頻度でランキングする。 In this snippet, the FEDERATION OF RUSSIA part is correct. In order to extract the correct part, for example, 3-gram from the part appearing after the vocabulary pattern is taken, and the answer candidates are ranked by appearance frequency.

このように、エンティティペア（Ａ,Ｂ）がデータベースにない場合には、エンティティペア（Ａ,Ｂ）の関係を表す語彙パターンはWeb検索結果 snippetから抽出する必要がある。エンティティCが検索データベースにあれば抽出した語彙パターンでデータベースを検索して候補(C,X)を選択することが可能である。もちろん、動的モードは、エンティティ（Ａ,Ｂ）及びエンティティCが検索データベースに入っている場合であっても、データベースを無視して実行することができ、この場合でもあたかもエンティティCもデータベースになかったように全て動的モードで処理することが可能である。ただし、動的モードは実際にその場で動的にWeb検索エンジンにアクセスする必要があるので、検索データベース（エンティティＡ,Ｂ,Ｃが全て検索データベースにある）を用いた検索に比べて、幾分時間を要する点に留意する。1つの態様では、検索データベースを用いた検索が１秒以内で検索結果を出力するのに対して、動的モードでは30秒以内で検索結果を出力する。 Thus, when the entity pair (A, B) is not in the database, it is necessary to extract the vocabulary pattern representing the relationship of the entity pair (A, B) from the web search result snippet. If entity C is in the search database, the candidate (C, X) can be selected by searching the database with the extracted vocabulary pattern. Of course, the dynamic mode can be executed by ignoring the database even when the entities (A, B) and the entity C are in the search database. Even in this case, the entity C is not in the database. All can be processed in the dynamic mode. However, since the dynamic mode actually needs to access the Web search engine dynamically on the spot, compared with the search using the search database (entities A, B, and C are all in the search database) Note that it takes minutes. In one embodiment, a search using the search database outputs a search result within 1 second, whereas in the dynamic mode, a search result is output within 30 seconds.

図２Ａ〜２Ｃに検索結果の表示画面の例を示す。図２Ａは、クエリ(steve
jobs, apple), (?, microsoft) の簡易結果 (証拠文章を含まない結果)の表示画面（クライアント端末のディスプレイに表示される）である。図２Ａでは、検索結果のランキングが表示されており、具体的には、ballmer, steve ballmer(スコア２９５)、bill gates（スコア５２）、danny thrope(スコア２７)の順に表示されている。図２Ａに表示される検索結果から、ballmerと steve ballmerはクラスタリングされていることがわかる。図２Ａの表示画面において、クライアント端末から入力手段によってShow evidenceを選択すると、証拠文が表示画面に表示される。図２Ｂは、図２の表示画面において、証拠文の一部（bill gates）のみを示したものである。また、 Debug Info に語彙パターンの一部を表示している。図２Ｃは、クエリ (steve jobs, apple), (steve ballmer, ?) 簡易結果 (証拠文章を含まない結果)の表示画面である。 2A to 2C show examples of search result display screens. FIG. 2A shows a query (steve
It is a display screen (displayed on the display of the client terminal) of simple results (results not including evidence) of jobs, apple), (?, microsoft). In FIG. 2A, the ranking of search results is displayed, specifically, in the order of ballmer, steve ballmer (score 295), bill gates (score 52), and danny thrope (score 27). From the search result displayed in FIG. 2A, it can be seen that the ballmer and steve ballmer are clustered. In the display screen of FIG. 2A, when Show evidence is selected from the client terminal by the input means, the evidence text is displayed on the display screen. FIG. 2B shows only a part of the evidence (bill gates) on the display screen of FIG. A part of the vocabulary pattern is displayed in Debug Info. FIG. 2C is a display screen of a query (steve jobs, apple), (steve ballmer,?) Simple result (result not including evidence).

Claims

When the entity pair (A, B) and the entity C are input, the relationship between the entities A and B between the entities C and X of the entity pair (C, X) is the same or similar. A method of searching for an entity X using a search database or / and an existing search engine,
Entity pair (A, B) and entity C are received as a query,
Defining the relationship between entity A and entity B depending on the context around entities A and B in the text containing entity A and entity B;
By searching for the entities C and X having the same or similar context as the surroundings of the entities A and B, the relationship between the entities A and B is the same as or similar to the relationship between the entities A and B. A search method for acquiring an entity X provided.

The search method according to claim 1, wherein the context is one or any combination of a vocabulary pattern, a bag-of-words, a part-of-speech pattern, and a dependency pattern.

The search method according to claim 1, wherein the contexts are clustered, and each context cluster includes substantially the same or similar context.

The search method according to claim 3, wherein a relationship between two entities of a pair of entities is represented by a feature vector, and the feature vector is obtained from a context or a context cluster around the two entities of the entity pair. .

The relationship between entity A and entity B is represented by a first feature vector, which is obtained from the context or context cluster around entities A and B;
The step of acquiring the entity X is to acquire the entity X having a second feature vector that is the same as or similar to the first feature vector with the entity C, and the second feature vector is the entity C 5. The search method according to claim 4, wherein the search is acquired from a context or a context cluster around the entities C and X in the text including the entity X. 5.

The step of obtaining the entity X includes:
The similarity between the pair of the entity pair (A, B) and each entity pair (C, X) is calculated based on the distance between the feature vectors of the two, and the calculated pair similarity is used as an index to determine the entity pair. Sorting a plurality of entity pairs (C, X) in descending order of similarity, and ranking X candidates in order of sorting;
Outputting a part or all of the ranked X as a search result;
The search method according to claim 5, further comprising:

The search method according to claim 4, wherein the context is a vocabulary pattern, and the feature vector has a frequency of the vocabulary pattern as an element.

The step of obtaining the entity X includes:
A plurality of entity pairs (C, X) having a second feature vector whose elements include the vocabulary pattern of the first feature vector of the acquired entity pair (A, B) (frequency is greater than a preset threshold) A step to obtain,
The similarity between the pair of the entity pair (A, B) and each entity pair (C, X) is calculated using both feature vectors, and the similarity between the entity pairs is calculated using the calculated similarity between the pairs as an index. Sorting a plurality of entity pairs (C, X) in descending order and ranking candidates for X in sort order;
Outputting a part or all of the ranked X as a search result;
The search method according to claim 7, comprising:

The search method according to claim 1, wherein the search database stores correspondences between a large number of entity pairs and surrounding contexts expressing relationships between entities of each entity pair.

The search according to claim 9, wherein the context is a vocabulary pattern, and the search database stores correspondences between a large number of entity pairs, each entity pair, a vocabulary pattern and a frequency of the vocabulary pattern. Method.

The search database is
A first index defining the ID of each entity in the entity pair;
A second index defining entity pair IDs corresponding to two entity ID pairs;
A third index defining a vocabulary pattern ID corresponding to each vocabulary pattern;
A fourth index that defines the correspondence between the entity pair ID and the vocabulary pattern ID / frequency of the vocabulary pattern;
With
A pair ID corresponding to the input entity pair (A, B) is acquired, and using the acquired entity pair ID, a vocabulary pattern ID corresponding to the entity pair (A, B) is set in advance. Obtaining a frequency of vocabulary patterns to form a first feature vector; and
Obtaining a candidate entity pair ID including the vocabulary pattern ID of the first feature vector (the frequency of the vocabulary pattern is greater than a preset threshold) and the ID of entity C;
Obtaining a second feature vector corresponding to each candidate entity pair ID and obtaining an entity pair (C, X) such that the second feature vector is similar to the first feature vector;
The search method according to claim 10, comprising:

The fourth index is
An index for searching the frequency of the vocabulary pattern ID / vocabulary pattern from the entity pair ID;
An index for searching the frequency of entity pair ID / vocabulary pattern from the vocabulary pattern ID;
The search method according to claim 11, comprising:

In the search database, vocabulary patterns form vocabulary pattern clusters, and each vocabulary pattern cluster includes similar or substantially identical vocabulary patterns.
The search method according to claim 10.

The search method according to claim 13, wherein the dimension of the feature vector when calculating the similarity between pairs is a vocabulary pattern cluster, and the value of each dimension is the sum of the frequencies of vocabulary patterns included in the vocabulary pattern cluster.

The search method according to claim 9, wherein an entity cluster composed of similar entities is formed in the search database.

The search method according to claim 15, wherein a score of an entity cluster to which the entity X belongs is used as a score serving as an index for ranking the candidate of the entity X.

The search method according to claim 15, wherein in the output step, a plurality of entities included in the entity cluster are output.

A feature vector that represents the relationship between entities in an entity pair is
The sentence acquired from the text corpus is divided into words,
Create candidate entity pairs from the segmented words,
The search method according to claim 4, wherein the search method is acquired from a context around a candidate entity pair.

The element of the feature vector is the frequency of the vocabulary pattern,
Extracting the frequency of vocabulary patterns
In the original sentence including the candidate entity pair, the first entity of the entity pair is replaced with X, the second entity is replaced with Y, and a partial sequence S including X and Y is generated, and an n-gram ( n is from 1 to K),
Using the obtained n-gram as one dimension of a feature vector, counting the frequency in all sentences including entity pairs;
With
The search method according to claim 18, wherein the n-gram frequency is stored in association with an entity pair as a vocabulary pattern frequency.
Here, the partial sequence S is m ₁ words immediately before X, a word sequence between X and XY, and m ₂ words immediately after Y and Y. m ₁ , m ₂ , and K are integer parameters.

The search method according to any one of claims 18 and 19, wherein a context around a candidate entity pair is acquired only when a distance D between two entities of the candidate entity pair is equal to or less than a threshold value _Dth .

21. The search method according to claim 18, wherein in the step of creating the candidate entity pair, an entity pair including a proper noun is preferentially extracted as a candidate entity pair.

The search method according to any one of claims 18 to 21, wherein, in the step of creating the candidate entity pair, a high-frequency entity pair is preferentially extracted as a candidate entity pair.

The search method according to any one of claims 18 to 22, further comprising the step of forming a cluster of vocabulary patterns from similar or substantially identical vocabulary patterns.

Using an existing search engine to obtain the context around entities A and B in the text containing entities A and B;
Using existing search engines to obtain entities C, X having the same or similar context as the surrounding context of entities A, B;
The search method according to claim 1, further comprising:

25. The context around the entities A and B is obtained from a snippet as a search result of the search engine with the entity A and the entity B as a query, the distance between the entity A and the entity B being limited. Search method described in.

The context around the entities A and B is acquired as a vocabulary pattern,
Entity X candidates are obtained by substituting C for the position of A in the acquired vocabulary pattern “C
26. The search method according to claim 25, wherein the search method is acquired from a snippet as a search result of the search engine using “pattern *” as a query.

When the entity pair (A, B) and the entity C are input, the relationship between the entities A and B between the entities C and X of the entity pair (C, X) is the same or similar. A system for searching for an entity X using a search database,
The search database stores correspondences between a large number of entity pairs and surrounding contexts that express relationships between entities of each entity pair,
Means for receiving entity pair (A, B) and entity C as a query;
Means for obtaining, using the search database, a first peripheral context expressing a relationship between entity A and entity B;
Means for obtaining entity C by searching for an entity C, X having a second peripheral context identical or similar to the first peripheral context using the search database;
Search system equipped with.

The search according to claim 27, wherein the context is a vocabulary pattern, and the search database stores correspondences between a large number of entity pairs, each entity pair, a vocabulary pattern and a frequency of the vocabulary pattern. system.

A plurality of entity pairs (C, X) having a second feature vector whose elements include the vocabulary pattern of the first feature vector of the acquired entity pair (A, B) (frequency is greater than a preset threshold) Means to obtain,
Means for calculating the similarity between the pair of the entity pair (A, B) and each entity pair (C, X) using both feature vectors;
Means for sorting a plurality of entity pairs (C, X) in descending order of the similarity between entity pairs using the calculated similarity between pairs as an index, and ranking X candidates in the order of sorting;
Means for outputting a part or all of the ranked X as a search result;
The search system according to claim 28, comprising:

The search database is
A first index defining the ID of each entity in the entity pair;
A second index defining entity pair IDs corresponding to two entity ID pairs;
A third index defining a vocabulary pattern ID corresponding to each vocabulary pattern;
A fourth index that defines the correspondence between the entity pair ID and the vocabulary pattern ID / frequency of the vocabulary pattern;
With
A pair ID corresponding to the input entity pair (A, B) is acquired, and using the acquired entity pair ID, a vocabulary pattern ID corresponding to the entity pair (A, B) is set in advance. The frequency of the vocabulary pattern is acquired to form the first feature vector,
Obtaining a candidate pair ID including the vocabulary pattern ID of the first feature vector (the frequency of the vocabulary pattern is greater than a preset threshold) and the ID of the entity C;
Obtaining a second feature vector corresponding to each candidate entity pair ID and obtaining an entity pair (C, X) such that the second feature vector is similar to the first feature vector;
The search system according to claim 29.

The fourth index is
An index for searching the frequency of the vocabulary pattern ID / vocabulary pattern from the entity pair ID;
An index for searching the frequency of entity pair ID / vocabulary pattern from the vocabulary pattern ID;
The search system according to claim 30, comprising:

The search database stores vocabulary pattern clusters, and each vocabulary pattern cluster includes similar or substantially identical vocabulary patterns.
The search system according to any one of claims 27 to 31.

33. The search system according to claim 27, wherein entity clusters are stored in the search database, and each entity cluster includes similar or substantially identical entities.

When the entity pair (A, B) and the entity D are input by replacing the entity pair (C, X) with the entity pair (X, D) and the entity C with the entity D, the entity pair (X 34) The search system according to any one of claims 27 to 33, which functions as a system for searching for an entity X having the same or similar relationship between the entities A and B between the entities X and D of D).