JP6440542B2

JP6440542B2 - Knowledge engine for managing large amounts of complex structured data

Info

Publication number: JP6440542B2
Application number: JP2015054835A
Authority: JP
Inventors: インホンフェン; スバシッチペロ
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2014-03-18
Filing date: 2015-03-18
Publication date: 2018-12-19
Anticipated expiration: 2035-03-18
Also published as: JP2015179516A

Description

本発明は、情報の記憶及び検索に関し、より詳細には構造化データの記憶及び検索に関する。 The present invention relates to information storage and retrieval, and more particularly to structured data storage and retrieval.

［関連出願］
本願は、２０１４年３月１８日に出願された米国仮出願第６１／９５５，０７７号の利益を主張し、同文献の内容は参照により本明細書にすべて援用される。 [Related applications]
This application claims the benefit of US Provisional Application No. 61 / 955,077, filed Mar. 18, 2014, the contents of which are hereby incorporated by reference in their entirety.

構造化照会言語（ＳＱＬ）を使用して照会することが可能なリレーショナルデータベースへの構造化データの記憶は１９７０年代以降徐々に発展してきた。例えば、大規模な販売業者はリレーショナルデータベースを使用して顧客のプロファイルや他の履歴を記憶する場合がある。そのようなデータベースはかなり巨大になる場合があるが、検索されるデータの範囲が顧客のプロファイルのように比較的狭いときにしか効率的に機能しない。 The storage of structured data in relational databases that can be queried using Structured Query Language (SQL) has evolved gradually since the 1970s. For example, a large merchant may use a relational database to store customer profiles and other histories. Such a database can be quite large, but only works efficiently when the range of data retrieved is relatively narrow, such as a customer profile.

従来のリレーショナルデータベースに対して、広大な範囲の事柄にわたって対応する知識データベースを照会することが可能な「知識エンジン」が開発されている。しかし、そのような知識エンジンは、照会対象となるエンティティ及び属性の範囲が広いために、その複雑性が扱い切れないほど高くなる。例えば、あるユーザが「William Jefferson Clintonが生まれたのは何年？」を尋ね、別のユーザが同じ知識エンジンに「２０１２年のアラスカ州バロー（Barrow）の人口は？」を尋ねたい場合がある。したがって、照会対象の構造化データの複雑性と範囲に効率的に対応することが可能な知識エンジン及びそれに対応する知識データベースが本技術分野で必要とされている。 A "knowledge engine" has been developed that can query a knowledge database corresponding to a wide range of matters against a conventional relational database. However, such knowledge engines are so complex that their complexity is unmanageable due to the wide range of entities and attributes to be queried. For example, one user may ask "How many years was Jeff Jefferson Clinton born?" And another user might ask the same knowledge engine "What is the population of Barrow, Alaska in 2012?" . Accordingly, there is a need in the art for a knowledge engine that can efficiently accommodate the complexity and scope of structured data to be queried and a corresponding knowledge database.

また、知識データベースは、それらが記憶することが求められる構造化データの量そのもののために、大規模になり、扱いが困難になりつつある。そのために、効率的な記憶を可能にしながら、対応する知識エンジンによる高速で正確な検索に対応する編成を持つ知識データベースが本技術分野で必要とされる。 In addition, knowledge databases are becoming large and difficult to handle because of the amount of structured data that they are required to store. Therefore, there is a need in the art for a knowledge database that has an organization that supports efficient and fast searches by corresponding knowledge engines while allowing efficient storage.

以下の例示的実施形態は、開示される知識データベースを構築するために使用される構造化データの入手元としてＦｒｅｅｂａｓｅの使用を対象とする。しかし、代替実施形態では、Ｆｒｅｅｂａｓｅに代えて、Ｗｉｋｉｐｅｄｉａや、オンライン及び電子の辞書や百科事典などの幅広い他の情報源も使用可能であることが理解されよう。例えば、知識データベースは、国際出願第ＰＣＴ／ＵＳ１４／６７４７９号に開示されるようにリソースの生成を行うシステムによって駆動されることも可能であり、同出願の内容は参照により本明細書にすべて援用される。そのため、知識データベースは、例えばＷｉｋｉｐｅｄｉａデータベース中の複数の他の項目へのハイパーリンクなど、他のリソースへの各種リンクを搭載することができる。 The following exemplary embodiments are directed to the use of Freebase as a source of structured data used to build the disclosed knowledge database. However, it will be appreciated that in alternative embodiments, Wikipedia, a wide variety of other sources of information such as online and electronic dictionaries and encyclopedias can be used instead of Freebase. For example, the knowledge database can be driven by a system for generating resources as disclosed in International Application No. PCT / US14 / 67479, the contents of which are hereby incorporated by reference in their entirety. Is done. Therefore, the knowledge database can be loaded with various links to other resources such as hyperlinks to a plurality of other items in the Wikipedia database.

Ｆｒｅｅｂａｓｅでは、データはリソース記述フレームワーク（Resource Description Framework：ＲＤＦ）のトリプル（triple）で編成される。ＲＤＦトリプルの第１の部分はエンティティであり、これは記述又は描写される主語である。トリプルの第２の部分は属性又は述語であり、これは記述されるエンティティについての関係の種類である。最後に、トリプルの第３の部分は値又は目的語であり、これはそのトリプルによって参照される対象物である。例えば、次の例示的な文、「ジョーはトムの友人だ（Joe is a friend of Tom.）」を考えられたい。この例では、トムが主語又はエンティティであり、友人関係が属性又は述語であり、ジョーが値になる。このような三つ組は接続グラフの形式で容易に表現されることができ、ジョーとトムが、「〜の友人である」という属性を表す、ジョーからトムに向かう弧で接続されたノードとなる。トムの年齢、職業、興味関心など、多数の同様のトリプルがエンティティ「トム」から伸びることができる。 In Freebase, data is organized in triples of Resource Description Framework (RDF). The first part of the RDF triple is an entity, which is the subject described or depicted. The second part of the triple is an attribute or predicate, which is the type of relationship for the entity being described. Finally, the third part of a triple is a value or object, which is the object referenced by that triple. For example, consider the following example sentence: “Joe is a friend of Tom.” In this example, Tom is the subject or entity, friendship is the attribute or predicate, and Joe is the value. Such a triple can be easily expressed in the form of a connection graph, where Joe and Tom are nodes connected by an arc from Joe to Tom, representing the attribute “is a friend of”. A number of similar triples, such as Tom's age, occupation, and interest, can extend from the entity “Tom”.

例示的な方法は、符号化されたリソース記述フォーマットエンティティ、属性、値、及び符号化されたカテゴリのセットについて、符号化されたカテゴリのカテゴリ索引を形成するステップを含むことができ、カテゴリ索引中の各符号化されたカテゴリは、対応する符号化されたエンティティのリストを含む。この方法は、符号化された属性ごとに、当該符号化された属性についての符号化された値を有する符号化されたエンティティを判定するステップも含むことができる。この方法は、判定に応答して、符号化された属性とそれらに対応する符号化されたエンティティ及び符号化された値の属性索引を形成するステップも含むことができる。この方法は、カテゴリ索引及び属性索引を記憶して知識データベースを形成するステップも含むことができる。 An exemplary method may include forming a category index for an encoded category for a set of encoded resource description format entities, attributes, values, and encoded categories, wherein the category index includes: Each encoded category includes a list of corresponding encoded entities. The method may also include determining, for each encoded attribute, an encoded entity that has an encoded value for the encoded attribute. The method may also include forming an attribute index of the encoded attributes and their corresponding encoded entities and encoded values in response to the determination. The method can also include storing the category index and the attribute index to form a knowledge database.

この方法は、構造化クエリを分解して複数の単純クエリにするステップも含むことができる。複数の単純クエリ各々について、この方法は、知識データベースのカテゴリ索引又は属性索引にアクセスして、符号化されたエンティティのリストを求めるステップも含むことができる。この方法は、符号化されたエンティティのリストの共通集合を求めて、構造化クエリに応答性のある符号化されたエンティティを判定するステップも含むことができる。この方法は、構造化クエリに応答性のある符号化されたエンティティを元のエンティティに変換するステップも含むことができる。 The method can also include decomposing the structured query into a plurality of simple queries. For each of a plurality of simple queries, the method may also include accessing a knowledge database category index or attribute index to determine a list of encoded entities. The method may also include determining an encoded entity that is responsive to the structured query for a common set of encoded entity lists. The method may also include converting the encoded entity responsive to the structured query to the original entity.

例示的な知識データベースをロードする際のフローチャートである。FIG. 6 is a flowchart for loading an exemplary knowledge database. FIG. 例示的な知識エンジンが知識データベースを照会する際のフローチャートである。6 is a flowchart when an exemplary knowledge engine queries a knowledge database. 図１及び図２の知識エンジンを含む例示的な検索アーキテクチャの図である。FIG. 3 is a diagram of an exemplary search architecture that includes the knowledge engine of FIGS. 1 and 2. 図１〜図３の知識エンジンを含む例示的なシステムの図である。4 is a diagram of an exemplary system that includes the knowledge engine of FIGS. 1-3. FIG. 図１〜図４の知識エンジンを含む例示的なサーバ装置の図である。FIG. 5 is a diagram of an exemplary server device that includes the knowledge engine of FIGS. 図１〜図４の知識エンジンを含む例示的なサーバ装置の図である。FIG. 5 is a diagram of an exemplary server device that includes the knowledge engine of FIGS.

次いで図面を参照すると、図１のステップ１０１に示すように、Ｆｒｅｅｂａｓｅなどの該当するデータベースから未処理のトリプルが受け取られる。ステップ１０２は、ステップ１０１で受け取られた未処理のトリプルから冗長なエンティティ、属性、及びＲＤＦトリプルを取り除いて、冗長性のない（本明細書では「クリーンな」とも呼ぶ）ＲＤＦトリプル１０３を得ることを含む。ＲＤＦトリプル１０３に関して、ステップ１０４でそれらのエンティティが順位付けされる。例えば、エンティティの順位は、Ｗｉｋｉｐｅｄｉａのアクセス数出力、エンティティの名前が出現する頻度、エンティティの名前の人気度等を分析することによって生成することができる。 Referring now to the drawings, an unprocessed triple is received from a corresponding database, such as Freebase, as shown in step 101 of FIG. Step 102 removes redundant entities, attributes, and RDF triples from the unprocessed triple received in step 101 to obtain a non-redundant (also referred to herein as “clean”) RDF triple 103. including. For the RDF triple 103, those entities are ranked in step 104. For example, the rank of an entity can be generated by analyzing Wikipedia access count output, frequency of entity name appearance, entity name popularity, and the like.

ステップ１０５で、クリーンなＲＤＦトリプル１０３が一意の整数の識別（ＩＤ）として符号化される。各エンティティ、属性、及び値がそのように符号化されて、１つのトリプルを整数の３つ組で表すことができるようにし、その結果ステップ１０６で符号化データが得られる。元のバージョンと符号化後のバージョンを区別するために、符号化されたエンティティを表す整数ＩＤを本明細書では「エンティティＩＤ」とも称する。同様に、符号化された属性を表す整数ＩＤを本明細書では「属性ＩＤ」とも称する。同様に、符号化された値を表す整数ＩＤを本明細書では「値ＩＤ」とも称する。ステップ１０４で得られたエンティティの順位に、エンティティの整数ＩＤを割り当てることもできる。また、各カテゴリも符号化してカテゴリＩＤとすることができるように、カテゴリに従ってエンティティを編成することができる。ハッシュテーブルＨ（図示せず）に、エンティティ、属性、値、エンティティの順位付け、及びカテゴリについてのすべての対応付けを、各自の整数表現との関連で記憶する。同様に、整数ＩＤから、それに対応する未符号化状態すなわち元のエンティティ、属性、値、エンティティ順位付け、及びカテゴリへの対応付けを配列リストＬに記憶する。したがって、ハッシュテーブルＨと配列リストＬは互いの逆となる。次いで、ステップ１０７で、符号化されたＲＤＦトリプルをメモリ１０８にロードすることができる。ステップ１０７は、すべてのエンティティＩＤを各自のカテゴリＩＤ及び属性ＩＤに従って索引付けすることも含むことができる。カテゴリはエンティティの種別を表し、例えば種別は「映画スター」であり、対して、そのようなカテゴリのエンティティは個々の映画スターからなることができる。ステップ１０９で、属性ＩＤごとにエンティティＩＤ及び値ＩＤを索引付けすることに関して、追加的な索引付けを行うことができる。この索引付けの結果は、ステップ１１０で知識データベース又はメモリに記憶されるカテゴリＩＤ及び属性ＩＤ索引で表される。 At step 105, the clean RDF triple 103 is encoded as a unique integer identification (ID). Each entity, attribute, and value is so encoded so that a single triple can be represented by an integer triplet, resulting in encoded data at step 106. In order to distinguish between the original version and the encoded version, the integer ID representing the encoded entity is also referred to herein as an “entity ID”. Similarly, an integer ID representing an encoded attribute is also referred to as “attribute ID” in this specification. Similarly, an integer ID representing an encoded value is also referred to as a “value ID” in this specification. It is also possible to assign the integer ID of the entity to the rank of the entity obtained in step 104. In addition, entities can be organized according to categories so that each category can also be encoded into a category ID. A hash table H (not shown) stores all associations for entities, attributes, values, entity rankings, and categories in relation to their integer representation. Similarly, the uncoded state corresponding to the integer ID, that is, the original entity, attribute, value, entity ranking, and association with the category are stored in the array list L. Therefore, the hash table H and the array list L are opposite to each other. The encoded RDF triple can then be loaded into memory 108 at step 107. Step 107 may also include indexing all entity IDs according to their category ID and attribute ID. A category represents a type of entity, for example, the type is “movie star”, whereas an entity of such a category can consist of individual movie stars. In step 109, additional indexing can be performed with respect to indexing the entity ID and value ID for each attribute ID. The result of this indexing is represented by the category ID and attribute ID index stored in the knowledge database or memory at step 110.

例示的なカテゴリ索引は、カテゴリごとにエンティティＩＤをリストする。同様に、属性索引は、属性ごとに値ＩＤと共にエンティティＩＤをリストする。カテゴリＩＤ索引と属性ＩＤ索引は、カテゴリＩＤごと且つ属性ＩＤごとに、エンティティＩＤでソートされる。その結果得られる、図１に関して説明したように形成される知識データベース中の符号化及び索引付けされたＲＤＦトリプルは、図２に示すように知識エンジン２００で検索することができ、有利である。ステップ２０１で構造化クエリ２０１が受け取られる。そのような構造化クエリは、指定されたエンティティ、比較値を伴う属性、及びカテゴリ、並びに出力属性及びソート順序を含む。ステップ２０２で構造化クエリ２０１が構文解析して単純クエリ２０３にされ、有利である。各単純クエリ２０３は、１つのカテゴリ及び１つの属性に対応することができる。さらに、各単純クエリ２０３は、１つのみのエンティティ、又は比較を伴う１つのみの属性、又は１つのみのカテゴリを有することができる。例えば、ユーザが、データベース中にある１９８２年以降に生まれたすべての映画スターの識別を知りたいとする。そのようなクエリは、カテゴリ「映画スター」にあるすべてのエンティティや、１９８２年以降生まれの属性を持つすべてのエンティティなどのいくつかの単純クエリに構造化することができる。注目すべき点として、各単純クエリ２０３が１つのカテゴリ及び１つの属性に対応する場合、カテゴリ及び属性の数は、一般的なリレーショナルデータベースの制約をはるかに超える可能性がある。具体的には、本実施形態におけるデータセットは数百ギガバイトを超える可能性があり、単純クエリ２０３を構文解析するために多数のコンピュータクラスタが必要となる可能性がある。 An exemplary category index lists entity IDs by category. Similarly, the attribute index lists the entity ID along with the value ID for each attribute. The category ID index and the attribute ID index are sorted by entity ID for each category ID and for each attribute ID. The resulting encoded and indexed RDF triples in the knowledge database formed as described with respect to FIG. 1 can be advantageously retrieved by the knowledge engine 200 as shown in FIG. At step 201, a structured query 201 is received. Such structured queries include specified entities, attributes with comparison values, and categories, as well as output attributes and sort order. In step 202, the structured query 201 is parsed into a simple query 203, which is advantageous. Each simple query 203 can correspond to one category and one attribute. Further, each simple query 203 can have only one entity, or only one attribute with a comparison, or only one category. For example, a user may want to know the identity of all movie stars born after 1982 in the database. Such a query can be structured into several simple queries, such as all entities in the category “Movie Star” and all entities with attributes born after 1982. It should be noted that if each simple query 203 corresponds to one category and one attribute, the number of categories and attributes can far exceed general relational database constraints. Specifically, the data set in this embodiment may exceed several hundred gigabytes, and a large number of computer clusters may be required to parse the simple query 203.

単純クエリ２０３は、ハッシュテーブルの符号化された整数値を使用して作成される。すなわち、カテゴリの「映画スター」を検索するのではなく、実際の検索は、カテゴリ「映画スター」の整数符号化表現（カテゴリＩＤ）に基づく。同様に、「公開日」の属性に対する検索では、「公開日」に対応する属性ＩＤを使用して検索を行う。ステップ２０４は、クエリの出力属性及びソート属性をリストすることを含む。このリストは実施形態によっては空になる場合がある。 The simple query 203 is created using the encoded integer value of the hash table. That is, rather than searching for the category “movie star”, the actual search is based on the integer coded representation (category ID) of the category “movie star”. Similarly, in the search for the attribute “release date”, the search is performed using the attribute ID corresponding to the “release date”. Step 204 includes listing the output attributes and sort attributes of the query. This list may be empty in some embodiments.

比較ステップ２０５は単純クエリ２０３ごとに行われる。例えば、単純クエリ２０３は、特定のカテゴリＩＤに対応するすべてのエンティティＩＤをリストすることを含むことができる。或いは、単純クエリ２０３は、特定の属性ＩＤについてすべてのエンティティＩＤとそれに対応する値ＩＤをリストすることに関連する場合もある。そして、配列リストＬを使用して値ＩＤを復号して元の値にすることができる。次いで、その結果得られた値を何らかのクエリパラメータと比較して、対応するエンティティがその単純クエリに応答するかどうかを判定することができる。例えば、そのような単純クエリは、その単純クエリの固有の性質に応じて、値がそれよりも大きい、等しい、又は小さくなければならない比較値を含むことができる。ステップ２０５で比較又は一致があるたびに、対応するエンティティリスト２０６が生成される。 The comparison step 205 is performed for each simple query 203. For example, the simple query 203 can include listing all entity IDs corresponding to a particular category ID. Alternatively, the simple query 203 may relate to listing all entity IDs and corresponding value IDs for a particular attribute ID. Then, using the array list L, the value ID can be decrypted to the original value. The resulting value can then be compared to some query parameter to determine whether the corresponding entity responds to the simple query. For example, such a simple query can include a comparison value whose value must be greater than, equal to, or smaller depending on the inherent nature of the simple query. Each time there is a comparison or match in step 205, a corresponding entity list 206 is generated.

ステップ２０７で各種エンティティリスト２０６の共通集合を求めることにより、エンティティリスト２０８を得る。例えば、１９８２年以降に生まれたすべての映画スターを特定する上記の例示的検索では、１つのエンティティリスト２０６がメモリ１０８で見つかったすべての映画スターを含むことができる。別のリストが、データベース中にある１９８２年以降に生まれたすべてのエンティティを含むことができる。それらリストの共通集合が要求される答に相当する。 In step 207, an entity list 208 is obtained by obtaining a common set of various entity lists 206. For example, in the above exemplary search that identifies all movie stars born after 1982, one entity list 206 can include all movie stars found in memory 108. Another list may include all entities in the database that were born after 1982. A common set of these lists corresponds to the required answer.

例えば、ステップ２０７で、エンティティＡ（Ｅ（Ａ）と表記する）とエンティティＢ（Ｅ（Ｂ）と表記する）との共通集合を、共通集合Ｅ（Ａ）∩Ｅ（Ｂ）で特定されるように求めることができる。それらエンティティ間に共通集合がない場合は、エンティティＡとＢの間には何の関係もない可能性がある。この共通集合が存在する場合は、Ｅ（Ａ）∪Ｅ（Ｂ）で表されるＥ（Ａ）とＥ（Ｂ）の和集合、及び式、−ｌｏｇ（（Ｅ（Ａ）∩Ｅ（Ｂ）／Ｅ（Ａ）∪Ｅ（Ｂ））を計算して、エンティティＡとＢの間の類似度スコアを得ることができる。 For example, in step 207, a common set of entity A (denoted as E (A)) and entity B (denoted as E (B)) is identified by a common set E (A) ∩E (B). Can be asking. If there is no common set between the entities, there may be no relationship between entities A and B. When this common set exists, the union of E (A) and E (B) represented by E (A) ∪E (B), and the expression −log ((E (A) ∩E (B ) / E (A) ∪E (B)) can be calculated to obtain a similarity score between entities A and B.

そのような類似度スコアは、エンティティ間の類似度に逆比例する。特に、最も緊密に関連するエンティティは、参照されるエンティティの共通集合がそれらエンティティの和集合と同じである（対数の底に関係なく１の対数がゼロになる）ことに相当すると考えられる。共通集合が和集合と比較して小さくなるのに従って、その結果得られる比の対数はより負に近づき、対数の逆関数がより正に近づく。そのようにして、エンティティごとに、関連するエンティティの順序付けしたスコアを生成することができる。実施形態によっては、順序付けしたスコアに閾値が適用されて、与えられたエンティティに最も関連性が高いエンティティのサブセットを求め、また可能性としては、構造化クエリ２０１に応答性のあるエンティティのリスト２０８を求める。閾値が適用されるかどうかに関係なく、類似度の計算により、クエリ２０１に応答性のあるエンティティのリスト２０８を求めることが容易になり、また関連するエンティティのソートと順位付け２０９も容易にすることができる。 Such similarity score is inversely proportional to the similarity between entities. In particular, the most closely related entities are considered to correspond to the common set of referenced entities being the same as their union (the logarithm of 1 is zero regardless of the base of the logarithm). As the common set becomes smaller compared to the union, the resulting logarithm of the ratio becomes more negative and the inverse of the logarithm becomes more positive. In that way, for each entity, an ordered score of the related entities can be generated. In some embodiments, a threshold is applied to the ordered score to determine the subset of entities most relevant to a given entity, and possibly a list 208 of entities that are responsive to the structured query 201. Ask for. Regardless of whether a threshold is applied, the similarity calculation makes it easy to obtain a list 208 of entities that are responsive to the query 201 and also facilitates sorting and ranking 209 of related entities. be able to.

上記の対数法に代えて、又はそれと併用して、追加的なアルゴリズムを実装することができる。例えば、Ｊａｃｃａｒｄ法やＰＭＩ法を利用してエンティティ間の数値的な類似度を算出することができる。また、与えられたエンティティがカテゴリの要素であることを使用して、そのカテゴリにある他の要素を関連エンティティとして選択することもできる。 Additional algorithms can be implemented in place of or in combination with the logarithmic method described above. For example, the numerical similarity between entities can be calculated using the Jaccard method or the PMI method. Also, using a given entity being an element of a category, other elements in that category can also be selected as related entities.

ステップ２０９で、ステップ２０４で求められた出力形式に基づいて、最終的なリスト２０８にあるエンティティを各自の指定された属性に従ってソートすることができる。或いは、ステップ２０９は、出力形式が空の場合はエンティティの順位でエンティティを順位付けすることを含んでもよい。ステップ２０９のソート又は順位付けから、ステップ２１０で指定された形式のエンティティの出力リストを得ることができ、それをステップ２１１で検索結果としてユーザに表示することができる。 At step 209, based on the output format determined at step 204, the entities in the final list 208 can be sorted according to their designated attributes. Alternatively, step 209 may include ranking the entities by entity rank if the output format is empty. From the sorting or ranking in step 209, an output list of entities of the type specified in step 210 can be obtained, which can be displayed to the user as a search result in step 211.

次いで、いくつかの例示的なクエリを検討して検索プロセスをより詳しく説明する。構造化クエリ２０１は、入力形式と出力形式の２つの部分を有する。例えば、ユーザが２０１３年９月５日から２０１３年９月１２日までの週に公開されたすべての映画の題名を知りたいとする。その結果得られる構造化クエリ２０１は、カテゴリ＝映画，公開日＞＝２０１３−０９−０５，公開日＜２０１３−０９−１２、の入力形式を有することができる。出力形式は、エンティティ名，公開日：降順でソート、とすることができる。 Then, some example queries are considered to describe the search process in more detail. The structured query 201 has two parts, an input format and an output format. For example, assume that the user wants to know the titles of all the movies released in the week from September 5, 2013 to September 12, 2013. The resulting structured query 201 can have an input format of category = movie, release date> = 2013-09-05, release date <2013-09-12. The output format can be entity name, release date: sorted in descending order.

次いで、その結果得られた構造化クエリ２０１を分解して次の単純クエリ２０３にすることができる。第１の単純クエリは、カテゴリ＝映画のすべてのエンティティをリストするものである。上記のように、知識データベースは、ＲＤＦの三つ組値を例えば整数の三つ組で置き換えられるように圧縮形態でカテゴリ索引と属性索引を記憶する。他の種類のデータ圧縮が使用されてもよいことが理解されよう。この単純クエリについて、まずハッシュテーブルＨにアクセスしてカテゴリ「映画」の整数表現を見つける。すると、知識データベースからの検索は、単に、カテゴリ「映画」に対応するハッシュテーブル中の整数を、符号化されたカテゴリ索引と一致させるだけになる。例えば、カテゴリ「映画」が「ｊ」の整数ＩＤで表されるとする。したがって、知識データベースの検索は、整数ＩＤ「ｊ」として符号化されたカテゴリ索引を取り出すという比較的高速で容易な作業になる。したがって、このｊ番目の索引は、映画の索引であり、対応するエンティティＩＤのリストをもたらす。 The resulting structured query 201 can then be decomposed into the next simple query 203. The first simple query lists all entities of category = movie. As described above, the knowledge database stores the category index and the attribute index in a compressed form so that the RDF triplet value can be replaced with, for example, an integer triplet. It will be appreciated that other types of data compression may be used. For this simple query, the hash table H is first accessed to find an integer representation of the category “movie”. A search from the knowledge database then simply matches the integer in the hash table corresponding to the category “movie” with the encoded category index. For example, it is assumed that the category “movie” is represented by an integer ID “j”. Therefore, searching the knowledge database is a relatively fast and easy task of retrieving the category index encoded as integer ID “j”. This j-th index is therefore a movie index, resulting in a list of corresponding entity IDs.

例示的な構造化クエリの分解で得られる第２の単純クエリは、公開日の属性を持つすべてのエンティティをリストするものであり、属性値は２０１３−０９−０５以上となる（そのような単純クエリは属性、比較、及び値を有することに留意されたい）。したがって、「公開日」の属性の整数表現がハッシュテーブルＨから取得される。例えば、「公開日」の属性ＩＤが整数ｙで表されるとする。すると、知識エンジンは、整数ｙに対応する属性索引を知識データベースから取り出す。そのような検索は整数を一致させることに過ぎず、したがって従来技術の方法と比べて非常に高速になる。検索で得られる属性索引は、「公開日」の属性に対応する値を持つすべてのエンティティをリストするが、知識データベースに記憶された他の項目と同じように符号化されている。言い換えると、検索で得られる属性索引はペアのリストであり、各ペアはエンティティＩＤとそれに対応する値ＩＤである。そして知識エンジンは新しいエンティティＩＤリストＥを作成することができる。検索で得られる属性索引の値ＩＤ「ｎ」ごとに、知識エンジンは、配列リストＬから元の値（この例では公開日）を取得し、公開日が「２０１３−０９−０５」のクエリ日付と同じ又は後の日付であるか比較することができる。比較の結果が真（公開日が「２０１３−０９−０５」のクエリ日付と同じ又は後の日付）の場合、知識エンジンは、対応するエンティティＩＤをエンティティＩＤリストＥに入れることができる。このエンティティＩＤリストＥが第２の単純クエリに応答性のあるリストとなる。 The second simple query that results from the decomposition of the example structured query is to list all entities that have a publication date attribute, and the attribute value is 2013-09-05 or higher (such a simple query). Note that a query has attributes, comparisons, and values). Therefore, an integer representation of the attribute “release date” is acquired from the hash table H. For example, it is assumed that the attribute ID of “release date” is represented by an integer y. Then, the knowledge engine retrieves the attribute index corresponding to the integer y from the knowledge database. Such a search only matches integers and is therefore much faster than prior art methods. The attribute index obtained from the search lists all entities that have a value corresponding to the “publication date” attribute, but is encoded like any other item stored in the knowledge database. In other words, the attribute index obtained by the search is a list of pairs, and each pair is an entity ID and a corresponding value ID. The knowledge engine can then create a new entity ID list E. For each value ID “n” of the attribute index obtained by the search, the knowledge engine acquires the original value (public date in this example) from the array list L, and the query date whose public date is “2013-09-05”. Can be compared to the same or later date. If the result of the comparison is true (the date the publication date is the same as or later than the query date “2013-09-05”), the knowledge engine can put the corresponding entity ID into the entity ID list E. This entity ID list E becomes a list that is responsive to the second simple query.

第３の単純クエリは、第２の単純クエリと似ており、したがって公開日の属性を持つすべてのエンティティのリストを対象とし、属性値は日付２０１３−０９−１２未満となる。この第３の単純クエリはしたがって第２の単純クエリに関して説明したのと同様に処理される。そして、これら３つの単純クエリで得られた３つのエンティティリストの共通集合をステップ２０７に関して述べたように行って、出力エンティティリスト２０８を得ることができる。この例では、図２のステップ２０９に関して述べたようなソートや順位付けの必要はない。そして配列リストＬを使用して結果を復号して、元のエンティティとそれらの値を得ることができる。次の表１は、この構造化クエリ例による映画をリストした検索結果を示す。
The third simple query is similar to the second simple query, and thus covers the list of all entities with the publication date attribute, and the attribute value is less than the date 2013-09-12. This third simple query is therefore processed in the same manner as described for the second simple query. Then, the common set of the three entity lists obtained by these three simple queries can be performed as described with respect to step 207 to obtain the output entity list 208. In this example, there is no need for sorting or ranking as described with respect to step 209 in FIG. The array list L can then be used to decrypt the result to obtain the original entities and their values. Table 1 below shows search results listing movies according to this example structured query.

検索結果の実際のリストは、一般には、クエリで指定される出力形式２０４、例えば値が昇順でリストされるべきか、降順でリストされるべきか等に依存する。 The actual list of search results generally depends on the output format 204 specified in the query, such as whether values should be listed in ascending or descending order.

知識エンジン２００は、サーバ、複数のサーバ、又は他の種類の適切なコンピュータからなることができる。図２に示すステップを行うために、知識エンジン２００を実装するサーバは、Ｊａｖａ（登録商標）又は他の適切なプログラミング言語でコーディングすることができる。Ｆｒｅｅｂａｓｅに記憶されるトリプルは、ＲＤＦの分野で知られる形態を使用して表される。 Knowledge engine 200 may consist of a server, multiple servers, or other type of suitable computer. To perform the steps shown in FIG. 2, the server implementing the knowledge engine 200 can be coded in Java or other suitable programming language. Triples stored in Freebase are represented using a form known in the field of RDF.

知識エンジン２００は、図３に示すようにシステムアーキテクチャに組み込むことができる。セマンティックエンジン３０５が、ユーザから自然言語のクエリを受け取り、それに対応する構造化クエリ２０１を知識エンジン２００に提供する。知識エンジンは次いで図２に関して説明したデータベース１０８と対話して要求される検索結果を得、その結果を次いで、スマートフォン、スマートウォッチ、タブレット、また他の適切な装置等のユーザ装置３１０に表示することができる。 Knowledge engine 200 can be incorporated into the system architecture as shown in FIG. A semantic engine 305 receives a natural language query from a user and provides a corresponding structured query 201 to the knowledge engine 200. The knowledge engine then interacts with the database 108 described with respect to FIG. 2 to obtain the requested search results and then displays the results on a user device 310 such as a smartphone, smartwatch, tablet, or other suitable device. Can do.

知識エンジン２００は、図４に示すようにシステムに組み込むこともできる。システム４００は、サーバ装置４０２やクライアント装置４０４及び４０６等の複数のコンピューティング装置を含み、各装置は、通信ネットワーク４０８を通じてデータ／データパケット４１６及び４１８を送信及び受信するように構成することができる。例えば、通信インターフェース４１０は、図１〜図３との関連で上述したような未処理のＲＤＦデータ及び構造化クエリを送受信することができる。メモリ４１２は、例えば上記の１つ又は複数のエンティティリストを記憶するようにさらに構成された、磁気、光学、又はフラッシュ記憶機構などの、揮発性、不揮発性、取り外し可能、及び／又は取り外し不能の記憶構成要素の１つ又は複数を含むことができる。そのため、通信インターフェース４１０、知識エンジン２００、セマンティックエンジン３０５、知識ベース１０８、及びメモリ４１２は、システムバス、ネットワーク、又は他の接続機構４１４を介して通信的に結合することができる。 Knowledge engine 200 can also be incorporated into the system as shown in FIG. System 400 includes a plurality of computing devices, such as server device 402 and client devices 404 and 406, each of which can be configured to send and receive data / data packets 416 and 418 over communications network 408. . For example, the communication interface 410 can send and receive raw RDF data and structured queries as described above in connection with FIGS. Memory 412 is volatile, non-volatile, removable, and / or non-removable, such as magnetic, optical, or flash storage, further configured to store one or more entity lists, for example, as described above. One or more of the storage components may be included. As such, communication interface 410, knowledge engine 200, semantic engine 305, knowledge base 108, and memory 412 can be communicatively coupled via a system bus, network, or other connection mechanism 414.

クライアント装置４０４及び４０６は各種形態を取ることができ、それらには、例えば構造化クエリを送信し、検索結果を受け取ることが可能な種々のコンピューティング装置の中でも特に、パーソナルコンピュータ（ＰＣ）、スマートフォン、着用可能コンピュータ、ラップトップ／タブレットコンピュータ、適切なコンピュータハードウェアリソースを備えたスマートウォッチ、頭部装着ディスプレイ、他の種類の着用可能装置が含まれる。クライアント装置４０４及び４０６は各種の構成要素を含むことができ、それらには例えばそれぞれ、入力／出力（Ｉ／Ｏ）インターフェース４３０及び４４０、通信インターフェース４３２及び４４２、プロセッサ４３４及び４４４、並びにデータ記憶機構４３６及び４４６が含まれ、それらはすべてそれぞれシステムバス、ネットワーク、又は他の接続機構４３８及び４４８を介して互いに通信的に結合することができる。 Client devices 404 and 406 can take a variety of forms, such as personal computers (PCs), smartphones, among other computing devices capable of sending structured queries and receiving search results, for example. , Wearable computers, laptop / tablet computers, smart watches with appropriate computer hardware resources, head mounted displays, and other types of wearable devices. Client devices 404 and 406 can include various components, such as, for example, input / output (I / O) interfaces 430 and 440, communication interfaces 432 and 442, processors 434 and 444, and data storage mechanisms, respectively. 436 and 446, all of which can be communicatively coupled to each other via a system bus, network, or other connection mechanism 438 and 448, respectively.

Ｉ／Ｏインターフェース４３０及び４４０は、それぞれ、クライアント装置４０４及び４０６と、クライアント装置４０４及び４０６のユーザとの間の対話を容易にするように構成することができる。例えば、Ｉ／Ｏインターフェース４３０及び４４０は、ユーザから受け取られたクエリにアクセスし、検索結果をユーザに提供するように構成することができる。したがって、Ｉ／Ｏインターフェース４３０及び４４０は、入力ハードウェア、例えば音声コマンドを受け取るマイクロフォン、タッチ画面、タッチセンシティブパネル、コンピュータマウス、キーボード、及び／又は他の入力ハードウェアを含むことができる。 The I / O interfaces 430 and 440 can be configured to facilitate interaction between the client devices 404 and 406 and the user of the client devices 404 and 406, respectively. For example, the I / O interfaces 430 and 440 can be configured to access queries received from a user and provide search results to the user. Thus, I / O interfaces 430 and 440 may include input hardware, such as a microphone that receives voice commands, a touch screen, a touch-sensitive panel, a computer mouse, a keyboard, and / or other input hardware.

上記のように、知識エンジン２００は、サーバ、複数のサーバ、又は他の種類の適切なコンピュータからなることができる。知識エンジン２００は、その拡張縮小可能性により、図５Ａ及び図５Ｂに示すようにサーバ装置と一体化することができる。図５Ａは、一実施形態による、トレーのセットを支持するように構成された例示的なサーバ装置５００を示す。図示するように、サーバ装置５００は、トレー５０４及び５０６を支持し、場合によっては複数の他のトレーも支持することが可能な筐体５０２を含むことができる。筐体５０２は、それぞれトレー５０４及び５０６を保持するように構成されたスロット５０８及び５１０を含むことができる。筐体５０２は、接続５１４及び５１６を介して電源５１２に接続されて、それぞれスロット５０８及び５１０に電力を供給することができる。筐体５０２は、接続５２０及び５２２を介して通信ネットワーク５１８に接続して、それぞれスロット５０８及び５１０にネットワークへの接続性を提供することもできる。したがって、トレー５０４及び５０６はそれぞれスロット５０８及び５１０に挿入して、電源５１２から電力を受け取ると共に、通信ネットワーク５１８に接続することができる。 As described above, the knowledge engine 200 can comprise a server, multiple servers, or other types of suitable computers. The knowledge engine 200 can be integrated with the server device as shown in FIGS. 5A and 5B due to its expandability. FIG. 5A illustrates an exemplary server device 500 configured to support a set of trays, according to one embodiment. As shown, the server device 500 can include a housing 502 that supports the trays 504 and 506 and possibly can support a plurality of other trays. The housing 502 can include slots 508 and 510 configured to hold trays 504 and 506, respectively. The housing 502 can be connected to the power supply 512 via connections 514 and 516 to supply power to the slots 508 and 510, respectively. Enclosure 502 can also connect to communication network 518 via connections 520 and 522 to provide network connectivity to slots 508 and 510, respectively. Accordingly, trays 504 and 506 can be inserted into slots 508 and 510, respectively, to receive power from power source 512 and to connect to communication network 518.

図５Ｂは、１つ又は複数の構成要素を支持するように構成されたトレー５０４を示す図である。トレー２０４は、知識エンジン２００、セマンティックエンジン３０５、知識ベース１０８、通信インターフェース４１０、及びメモリ４１２を含むことができる。トレー２０４は、接続５１４又は５１６に結合してトレー５０４に電力を供給することができるコネクタ５２６を含むことができる。トレー５０４は、接続５２０又は５２２に結合してトレー５０４にネットワークへの接続性を提供することができるコネクタ５２８も含むことができる。そのため、知識エンジン２００、セマンティックエンジン３０５、及び知識ベース１０８は、本明細書に記載され、添付図面に示される動作を行うように構成されることができる。 FIG. 5B illustrates a tray 504 configured to support one or more components. The tray 204 can include a knowledge engine 200, a semantic engine 305, a knowledge base 108, a communication interface 410, and a memory 412. Tray 204 can include a connector 526 that can be coupled to connection 514 or 516 to provide power to tray 504. The tray 504 can also include a connector 528 that can be coupled to the connection 520 or 522 to provide the tray 504 with connectivity to the network. As such, knowledge engine 200, semantic engine 305, and knowledge base 108 can be configured to perform the operations described herein and illustrated in the accompanying drawings.

１０８…メモリ、２００…知識エンジン、２０１…構造化クエリ、２０３…単純クエリ、２０６、２０８…エンティティリスト、３０５…セマンティックエンジン、３１０…ユーザ装置、４０２…サーバ装置、４０４、４０６…クライアント装置、４０８…通信ネットワーク、４１６、４１８…データ／データパケット、４１０…通信インターフェース、４１２…メモリ、４１４…接続機構、４３０、４４０…Ｉ／Ｏインターフェース、４３２、４４２…通信インターフェース、４３４、４４４…プロセッサ、４３６、４４６…データ記憶機構、４３８、４４８…接続機構、５００…サーバ装置、５０２…筐体、５０４、５０６…トレー、５０８、５１０…スロット、５１４、５１６…接続、５１２…電源、５１８…通信ネットワーク、５２６…コネクタ、５２８…コネクタ。 DESCRIPTION OF SYMBOLS 108 ... Memory, 200 ... Knowledge engine, 201 ... Structured query, 203 ... Simple query, 206, 208 ... Entity list, 305 ... Semantic engine, 310 ... User device, 402 ... Server device, 404, 406 ... Client device, 408 Communication network 416 418 Data / data packet 410 Communication interface 412 Memory 414 Connection mechanism 430 440 I / O interface 432 442 Communication interface 434 444 Processor 436 446: Data storage mechanism, 438, 448 ... Connection mechanism, 500 ... Server device, 502 ... Housing, 504, 506 ... Tray, 508, 510 ... Slot, 514, 516 ... Connection, 512 ... Power supply, 518 ... Communication network , 52 ... connector, 528 ... connector.

Claims

A knowledge database configured to store an attribute index for an encoded attribute, an encoded entity corresponding to the encoded attribute, and an encoded value, the encoded Said knowledge database further configured to store a category index of categories, wherein each encoded category in said category index includes a list of corresponding encoded entities;
A knowledge engine configured to retrieve an encoded entity list responsive to a simple query using the category index and the attribute index stored in the knowledge database;
A system comprising:

Encoded attributes, encoded entity, encoded values, and each coded categories, according to claim 1 comprising a corresponding integer system.

The system is
A semantic engine configured to receive a natural language query and provide a corresponding structured query to the knowledge engine;
The system of claim 1 comprising:

The system of claim 1 , wherein the knowledge engine interacts with the knowledge database to cause a user device to display the encoded entity list that is responsive to the simple query.

The system of claim 1 , wherein the knowledge engine is configured to obtain a list of the encoded entities that are responsive to the simple query for a common set of the encoded entities.

The system of claim 1 , wherein the knowledge engine is configured to determine a common set of one entity list in memory and another entity list in the knowledge database.

The knowledge engine is configured to determine that a similarity score between the list of encoded entities is greater than a threshold and obtain the list of encoded entities that are responsive to a structured query The system of claim 1 , wherein:

Such that the knowledge engine determines that a similarity score between the list of encoded entities is greater than a threshold and obtains the list of encoded entities that are responsive to the structured query. And the similarity score is determined based on -log ((E (A) ∩E (B) / E (A) ∪E (B)), where E (A) is an entity The system of claim 7 , wherein E (B) is another entity.

A non-transitory computer-readable recording medium storing program instructions,
The program instructions are sent to the knowledge engine,
For a set of encoded resource description format entities, attributes, values, and encoded categories, creating a category index for the encoded category, wherein each encoded category in the category index is The creating step, including a list of corresponding encoded entities;
Determining, for each encoded attribute, an encoded entity having an encoded value for the encoded attribute;
Forming an attribute index with respect to the encoded attribute and the encoded entity and the encoded value corresponding to the encoded attribute in response to the determination result;
Storing the category index and the attribute index to form a knowledge database;
Is a program instruction for executing
A non-transitory computer-readable recording medium.

The program instructions are sent to the knowledge engine,
Decomposing a structured query into multiple simple queries,
For each of the plurality of simple queries, accessing the category index or the attribute index of the knowledge database to determine a list of encoded entities;
Determining a common set of a plurality of lists of the encoded entities to determine the encoded entities that are responsive to the structured query;
Converting the encoded entity responsive to the structured query to an original entity;
Is a program instruction for further executing
The non-transitory computer-readable recording medium according to claim 9 .

Obtaining a common set of the encoded list of entities;
The non-transitory computer-readable recording medium according to claim 10 , comprising: obtaining a common set of one entity list in a memory and another entity list in the knowledge database.

The program instructions are sent to the knowledge engine,
Determining that the similarity score between the list of encoded entities is greater than a threshold to determine the encoded entities responsive to a structured query;
Is a program instruction for further executing
The non-transitory computer-readable recording medium according to claim 9 .

Determining that the similarity score between the list of encoded entities is greater than a threshold and determining the encoded entities responsive to the structured query comprises:
-Log (E (A) ∩E (B) / E (A) ∪E (B))
The non-transitory computer-readable recording medium according to claim 12 , wherein the E (A) is one entity and the E (B) is another entity.