CN114661956A

CN114661956A - A Pregel-based Temporal T-SPARQL Query and Reasoning Method

Info

Publication number: CN114661956A
Application number: CN202011532002.7A
Authority: CN
Inventors: 贺振宇; 马宗民
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2022-06-24

Abstract

The invention discloses a temporal T-SPARQL query and inference method based on Pregel, which is used for performing query processing and optimization on temporal RDF graph data in a graph parallel computing mode. The problem of the study can be formally defined as: given a temporal RDF query graph G and a temporal query graph Q, all matches of the query graph Q in the data graph G are found. And converting the T-SPARQL query into a function executed on a vertex, performing communication on the edge in a message transfer mode, storing an intermediate result set in the vertex attribute, performing message aggregation on the next iteration based on the received message, and continuing matching until the iteration is finished. The proposed T-SPGX algorithm generates a corresponding query plan according to a given T-SPARQL query, the query plan comprises two aspects of predicate label matching and time information filtering, and different time filtering methods are adopted for a single temporal triplet mode and a plurality of temporal triplet mode connection queries. The query efficiency is improved from the aspect of query result optimization, the semantic hierarchy of the temporal RDFS is used for reasoning the temporal RDF data, the implicit result is deduced in the existing explicit query result, and the query result set is expanded. A general and extensible solution is provided for the query method of the massive temporal RDF data.

Description

A Pregel-based Temporal T-SPARQL Query and Reasoning Method

技术领域technical field

本发明针对海量时态RDF数据的查询方法提出了一种通用的、可扩展的解决方案。提出了基于Pregel的时态RDF查询方法，通过图并行计算的方式对时态RDF图数据进行查询处理和优化。研究的问题可以被形式化的定义为：给定一个时态RDF查询图G和时态查询图Q，找出查询图Q在数据图G中的所有匹配。将T-SPARQL查询转化为在顶点上执行的函数，采用消息传递的方式在边上进行通信，将中间结果集存储在顶点属性中，下一轮迭代基于收到的消息进行消息聚合并继续匹配直到迭代结束。属于分布式知识语义查询领域。The invention proposes a general and scalable solution for the query method of massive temporal RDF data. A temporal RDF query method based on Pregel is proposed, which can query, process and optimize temporal RDF graph data through graph parallel computing. The research problem can be formally defined as: given a temporal RDF query graph G and a temporal query graph Q, find all matches of the query graph Q in the data graph G. Convert the T-SPARQL query into a function executed on the vertex, communicate on the edge by means of message passing, store the intermediate result set in the vertex attribute, and the next iteration will aggregate the message based on the received message and continue to match until the end of the iteration. It belongs to the field of distributed knowledge semantic query.

背景技术Background technique

现实中的信息天然具备时态属性，为了更好地表示和管理时间信息，许多研究者提出将RDF用于时态数据表示和管理，随着时间数据的爆炸性增长和语义Web和知识工程的发展，如何查询和管理时间数据已成为一个重要的研充课题。然而以往的大多数研究工作都使用关系数据库来存储时态RDF三元组并将时态查询重写为T-SQL或者SPARQL查询进行评估，存在大量的自连接操作以及存储空间数据冗余问题，限制了在大型知识库和复杂查询中的性能。另一方面，一些以图形为中心的并行平台已经被提出和开发，可以有效地支持迭代图形计算。这些都促进了在分布式平台上对时态RDF图数据进行存储和查询的必要性。Information in reality naturally has temporal attributes. In order to better represent and manage temporal information, many researchers propose to use RDF for temporal data representation and management. With the explosive growth of temporal data and the development of semantic web and knowledge engineering , how to query and manage time data has become an important research topic. However, most previous research works use relational databases to store temporal RDF triples and rewrite temporal queries into T-SQL or SPARQL queries for evaluation. There are a large number of self-join operations and storage space data redundancy problems. Limits performance in large knowledge bases and complex queries. On the other hand, some graph-centric parallel platforms have been proposed and developed, which can effectively support iterative graph computation. These all promote the necessity of storing and querying temporal RDF graph data on a distributed platform.

任何对象都是随着时间推移不断变化的，时态属性是描述资源发展过程中动态变化的一个重要属性，时态信息的表示和查询一直是各项科学研究的重点，各种时态数据库的产生和时态查询语言的发展有效促进了时态数据的管理。为了方便时态数据在网络中的传输与共享，学者对各种数据模型进行了时态扩展。随着RDF作为语义表示和元数据处理模型的普遍接受和使用，时态RDF建模逐渐引起学者的关注，相应的时态扩展方法已经获得较广泛的研究。Any object is constantly changing with the passage of time. Temporal attributes are an important attribute to describe the dynamic changes in the process of resource development. The representation and query of temporal information has always been the focus of various scientific researches. The development of generation and temporal query languages has effectively facilitated the management of temporal data. In order to facilitate the transmission and sharing of temporal data in the network, scholars have carried out temporal extensions to various data models. With the general acceptance and use of RDF as a semantic representation and metadata processing model, temporal RDF modeling has gradually attracted the attention of scholars, and the corresponding temporal extension methods have been widely studied.

由于RDF模型可以用RDF三元组或RDF图表示法来描述数据集，而时态信息的表现形式也有时间点、时间区间和时间集合这三种方式，时间维度有事务时间和有效时间之分，以及不同的学者在经典RDF模型的基础上定义时态RDF模型时添加时态信息的方式也不同。因此，时态RDF模型的表达形式并非唯一，时态RDF模型所携带的时态信息表达含义也并非唯一。目前，时态RDF相关的主要研究主要包括以下三个方面：对RDF模型进行时态扩展的研究、对时态RDF查询语言的研究、以及对时态RDF索引方案的研究。Since the RDF model can use the RDF triple or RDF graph notation to describe the data set, and the representation of temporal information also has three ways of time point, time interval and time set, the time dimension is divided into transaction time and valid time , and different scholars add temporal information in different ways when they define temporal RDF models on the basis of classical RDF models. Therefore, the expression form of the temporal RDF model is not unique, and the temporal information carried by the temporal RDF model is not unique. At present, the main researches related to temporal RDF mainly include the following three aspects: research on temporal extension of RDF model, research on temporal RDF query language, and research on temporal RDF indexing scheme.

由于RDF数据的图形性质，SPARQL查询问题其本质就是一个子图匹配问题。它的每一个三元组都对应有向图中一条由主语指向谓语的有向边，这些有向边将RDF数据实体联系在一起。SPARQL查询是一组元组模式的组合，这些元组模式之间存在相互关联，所以在以往的关系数据库和数据并行系统中评估SPARQL查询会存在大量的自连接操作以及存储空间数据冗余造成的查询效率低下问题。另一方面，由于大型真实世界图形数据集的兴起，一些以图形为中心的并行平台已经被提出和开发，以有效地支持迭代图形计算。图的并行算法设计和实现的一种流行方法是以顶点为中心的计算，用顶点之间的边进行通信，虽然以顶点为中心的编程肯定不是唯一的方法，但这种模式是流行的并已被许多研究和开源项目所采用。将语义Web的标准查询语言SPARQL转换为在顶点上执行的函数的方法，利用RDF数据的图形表示评估SPARQL查询。Due to the graphical nature of RDF data, the SPARQL query problem is essentially a subgraph matching problem. Each of its triples corresponds to a directed edge from the subject to the predicate in the directed graph, and these directed edges connect the RDF data entities together. A SPARQL query is a combination of a set of tuple patterns, and these tuple patterns are interrelated. Therefore, evaluating SPARQL queries in previous relational databases and data parallel systems will result in a large number of self-join operations and data redundancy in storage space. Query inefficiency problem. On the other hand, due to the rise of large real-world graph datasets, some graph-centric parallel platforms have been proposed and developed to efficiently support iterative graph computation. A popular approach to the design and implementation of parallel algorithms for graphs is vertex-centric computation, using edges between vertices to communicate. While vertex-centric programming is certainly not the only approach, this pattern is popular and It has been adopted by many research and open source projects. A method for translating SPARQL, the standard query language of the Semantic Web, into functions that execute on vertices, evaluating SPARQL queries with a graphical representation of RDF data.

因此，本发明使用图迭代的方式来实现时态RDF的查询，通过将T-SPARQL查询转化为在顶点上运行的函数，匹配三元组的谓词标签和时态约束关系，通过存储在顶点属性里的中间结果集一步步扩展查询结果，最终实现时态RDF在分布式图框架下的查询。Therefore, the present invention uses graph iteration to implement temporal RDF query, by transforming T-SPARQL query into a function that runs on vertices, matching the predicate label and temporal constraint relationship of the triplet, and storing it in the vertex attribute. The intermediate result set in the system expands the query results step by step, and finally realizes the query of temporal RDF in the distributed graph framework.

发明内容SUMMARY OF THE INVENTION

发明目的：现实中的信息天然具备时态属性，为了更好地表示和管理时间信息，许多研究者提出将RDF用于时态数据表示和管理，随着时间数据的爆炸性增长和语义Web和知识工程的发展，如何查询和管理时间数据已成为一个重要的研究课题。然而以往的大多数研究工作都使用关系数据库来存储时态RDF三元组并将时态查询重写为T-SQL或者SPARQL查询进行评估，存在大量的自连接操作以及存储空间数据冗余问题，限制了在大型知识库和复杂查询中的性能。另一方面，一些以图形为中心的并行平台已经被提出和开发，可以有效地支持迭代图形计算。这些都促进了在分布式平台上对时态RDF图数据进行存储和查询的必要性。基于这一背景，本发明提出的针对海量时态RDF数据的查询方法提出了一种通用的、可扩展的解决方案。Purpose of the invention: Information in reality naturally has temporal attributes. In order to better represent and manage temporal information, many researchers propose to use RDF for temporal data representation and management. With the explosive growth of temporal data and semantic Web and knowledge With the development of engineering, how to query and manage time data has become an important research topic. However, most previous research works use relational databases to store temporal RDF triples and rewrite temporal queries into T-SQL or SPARQL queries for evaluation. There are a large number of self-join operations and storage space data redundancy problems. Limits performance in large knowledge bases and complex queries. On the other hand, some graph-centric parallel platforms have been proposed and developed, which can effectively support iterative graph computation. These all promote the necessity of storing and querying temporal RDF graph data on a distributed platform. Based on this background, the query method for massive temporal RDF data proposed by the present invention proposes a general and scalable solution.

技术方案：为实现上述目的，本发明一种基于Pregel的时态T-SPARQL查询和推理方法包括以下几个步骤：具体研究方法如下：Technical scheme: In order to realize the above-mentioned purpose, a kind of temporal T-SPARQL query and reasoning method based on Pregel of the present invention comprises the following steps: The specific research method is as follows:

(1)提出了时间区间二元关系的约束定义，实现不同类型时间区间关系的连接计算。(1) The constraint definition of binary relation of time interval is proposed, and the connection calculation of different types of time interval relation is realized.

(2)基于Pregel接口对时态RDF图数据进行子图匹配查询，将T-SPARQL查询转化为在顶点上执行的函数，采用消息传递的方式在边上进行通信，将中间结果集存储在顶点属性中，下一轮迭代基于收到的消息进行消息聚合并继续匹配直到迭代结束。提出的T-SPGX算法限据给定的T-SPARQL查询产生相应的查询计划，查询计划包括谓词标签匹配和时间信息过滤两方面，并对单个时态三元组模式和多个时态三元组模式连接查询采取不同的时间过滤方法。(2) Perform subgraph matching queries on temporal RDF graph data based on the Pregel interface, convert T-SPARQL queries into functions executed on vertices, communicate on edges by means of message passing, and store intermediate result sets on vertices properties, the next iteration performs message aggregation based on received messages and continues matching until the end of the iteration. The proposed T-SPGX algorithm generates a corresponding query plan according to a given T-SPARQL query. The query plan includes predicate label matching and time information filtering. Group mode join queries take a different approach to time filtering.

(3)从查询顺序优化和查询结果优化提高查询效率。利用时态分直方图评估查询语句的顺序，得到优化的查询顺序。利用时态RDFS的语义层次对时态RDF数据进行推理，在已有的显式查询结果中推理隐式结果，扩大查询结果集。并对提出的查询方法行了具体的实现，实验结果表明，算法具有良好的查询效率。(3) Improve query efficiency from query sequence optimization and query result optimization. The order of query statements is evaluated by using the temporal score histogram, and the optimized query order is obtained. Use the semantic level of temporal RDFS to reason about temporal RDF data, infer implicit results in the existing explicit query results, and expand the query result set. The proposed query method is implemented concretely, and the experimental results show that the algorithm has good query efficiency.

有益效果：通过图并行计算的方式对时态RDF图数据进行查询处理和优化。基于Pregel接口对时态RDF图数据进行子图匹配查询，将T-SPARQL查询转化为在顶点上执行的函数，对海量时态RDF数据的查询方法提出了一种通用的、可扩展的解决方案，并借助RDFS语义层次从查询结果优化方面提高了查询效率，推理了隐含结果，扩大了查询结果集。Beneficial effects: query processing and optimization of temporal RDF graph data through graph parallel computing. Based on the Pregel interface, the subgraph matching query is performed on the temporal RDF graph data, and the T-SPARQL query is converted into a function executed on the vertex. A general and scalable solution is proposed for the query method of massive temporal RDF data. , and with the help of RDFS semantic layer, the query efficiency is improved from the aspect of query result optimization, the implicit result is reasoned, and the query result set is expanded.

附图说明Description of drawings

图1是本发明方法的总体流程图。Figure 1 is a general flow chart of the method of the present invention.

图2是时间区间二元关系表。Figure 2 is a time interval binary relation table.

图3是BSP模型结构图。Figure 3 is a structural diagram of the BSP model.

图4是顶点计算和消息传递示意图。Figure 4 is a schematic diagram of vertex computation and message passing.

图5是时态推理规则。Figure 5 is a temporal inference rule.

具体实施方式Detailed ways

下面结合附图，对本发明做进一步说明。The present invention will be further described below with reference to the accompanying drawings.

本发明的总体流程如图1所示。其包含的子模块流程分别如图2、图3、图4、图5以下结合各图进行详细说明。其具体实施步骤如下，且总体流程见附图1。The overall flow of the present invention is shown in FIG. 1 . The sub-module processes included are shown in Fig. 2, Fig. 3, Fig. 4, and Fig. 5, respectively, and are described in detail below with reference to each figure. The specific implementation steps are as follows, and the overall flow is shown in FIG. 1 .

1时态区间二元关系计算1 Temporal Interval Binary Relation Calculation

图2展示了Allen定义的时间关系，并引入了区间开并，可以用于时态连接计算的比较。Figure 2 shows the temporal relationship defined by Allen and introduces interval unification, which can be used for comparison of temporal join computations.

2 BSP模型2 BSP model

Pregel模型具有并行计算、批处理消息、同步机制的特点，使得其能够在图上以顶点为中心来并行处理。Pregel计算框架基于整体同步并行计算模型，即BSP模型，将计算分解为一系列超级步的迭代，在每个超级步骤中，顶点程序执行其局部变换，并与相邻顶点交换信息。BSP模型要求系统能够提供并行的多个计算单元，模型中分为Master角色和Worker角色，Master负责全局信息的协调工作，Worker负责计算，BSP模型的处理过程如图3所示。The Pregel model has the characteristics of parallel computing, batch message, and synchronization mechanism, which enables it to perform parallel processing on the graph with the vertex as the center. The Pregel computing framework is based on an overall synchronous parallel computing model, the BSP model, which decomposes the computation into iterations of a series of supersteps, in each superstep, a vertex program performs its local transformations and exchanges information with neighboring vertices. The BSP model requires the system to provide multiple parallel computing units. The model is divided into a Master role and a Worker role. The Master is responsible for the coordination of global information, and the Worker is responsible for computing. The processing process of the BSP model is shown in Figure 3.

3 Pregel接口用户自定义函数3 Pregel interface user-defined functions

GraphX中的Pregel接口，是一个参考GAS改进的Pregel模式，将只有顶点计算函数的Pregel模型进行了细粒度的优化，可以支持更多的并行操作，主要包括三个函数：vprog、mergeMsg和sendMsg。其中Vprog的作用是对顶点内部的消息进行更新，mergeMsg的作用是对两条消息进行合并，而sendMsg的作用是将自身的消息发送给其邻居节点。The Pregel interface in GraphX is an improved Pregel mode with reference to GAS. The Pregel model with only vertex calculation functions is fine-grained and can support more parallel operations. It mainly includes three functions: vprog, mergeMsg and sendMsg. The function of Vprog is to update the message inside the vertex, the function of mergeMsg is to merge the two messages, and the function of sendMsg is to send its own message to its neighbor nodes.

其核心为三个函数，用户可自定义实现这些函数体。Its core is three functions, and users can customize the implementation of these function bodies.

1)vprog函数：1) vprog function:

通俗用语：vertexProgram(vprog)在第一次在初始化的时候，会在所有顶点上运行，之后，只行接收到消息的顶点才会运行vertexProgram，重复这个步骤直到迭代条件Common language: vertexProgram (vprog) will run on all vertices when it is initialized for the first time. After that, only the vertices that receive the message will run vertexProgram. Repeat this step until the iteration condition

用户可自定义实现函数体，在第一次迭代中，函数作用于图的每个顶点，在随后的迭代中，该函数只作用接收消息的顶点。函数的入参是顶点属性值和顶点接收的message，通过用户实现的函数体+message来更新顶点原来的属性值。顶点接收的message应该是sendMsg+mergeMsg函数的结果。The user can customize the implementation of the function body. In the first iteration, the function acts on each vertex of the graph, and in subsequent iterations, the function only acts on the vertex that receives the message. The input parameters of the function are the vertex attribute value and the message received by the vertex, and the original attribute value of the vertex is updated through the function body + message implemented by the user. The message received by the vertex should be the result of the sendMsg+mergeMsg function.

2)sendMsg函数：2) sendMsg function:

通俗用语：用户可自定义实现函数体，函数作用于图的每条边，函数的入参是边的三元组，通过用户实现的函数体+边的三元组(源顶点及其属性，目标顶点及其属性，源顶点与目标顶点之间边的属性)向顶点传递message(用户可以自定义message类型)。Common terms: the user can customize the implementation of the function body, the function acts on each edge of the graph, the input parameter of the function is the triplet of the edge, and the triplet of the function body + edge implemented by the user (source vertex and its attributes, The target vertex and its attributes, the attributes of the edge between the source vertex and the target vertex) pass the message to the vertex (the user can customize the message type).

3)mergeMsg函数：3) mergeMsg function:

通俗用语：用户可自定义实现函数体，函数作用于图的每个顶点，根据sendMsg函数向每个顶点传递message，mergeMsg函数主要是合并传递给顶点的两个message。假设message类型为A，该函数的入参是两个类型为A的message，通过用户实现的函数体+两个message合并成一个类型为A的message。Common terms: The user can customize the implementation of the function body. The function acts on each vertex of the graph, and passes the message to each vertex according to the sendMsg function. The mergeMsg function mainly merges the two messages passed to the vertex. Assuming that the message type is A, the input parameters of this function are two messages of type A, and the function body implemented by the user + two messages are combined into a message of type A.

4时态T-SPARQL查询实现4 Temporal T-SPARQL Query Implementation

1)迭代流程1) Iterative process

在每次迭代时，每个数据顶点都会探索一个子查询；匹配的顶点向相应的邻居发送携带它们的值的消息。在随后的迭代中，只有接收消息的顶点继续探索。不是独立地解决每个子查询，而是从上一个子查询的结果开始探索子查询。顶点之间交换的消息携带中间结果；因此，在最后一次迭代中的消息是查询的最终答案。At each iteration, each data vertex explores a subquery; matching vertices send messages carrying their values to the corresponding neighbors. In subsequent iterations, only the vertices that received the message continue to explore. Instead of solving each subquery independently, the subquery is explored starting from the results of the previous subquery. Messages exchanged between vertices carry intermediate results; therefore, the message in the last iteration is the final answer to the query.

2)顶点上执行的程序：2) The program executed on the vertex:

接收上一个超级步中顶点发送过来的消息，对消息进行聚合，更新自身顶点的属性值，生成新的消息，将新的消息和匹配的新的消息发送给下一个顶点Receive the message sent by the vertex in the previous super step, aggregate the message, update the attribute value of its own vertex, generate a new message, and send the new message and the matching new message to the next vertex

3)查询语句的匹配：3) Matching of query statements:

只有一个三元组模式时候，匹配谓词标签同时匹配时间信息，在数据图上绑定符合谓词标签和时间过滤信息的顶点，将匹配的中间结果表以消息传递的形式从源顶点发送到目的顶点，目的顶点更新自己的属性值为结果表。When there is only one triple pattern, match the predicate label and match the time information, bind the vertices that match the predicate label and time filter information on the data graph, and send the matching intermediate result table from the source vertex to the destination vertex in the form of message passing. , the destination vertex updates its own attribute value to the result table.

当有两个三元组模式的时候，需要制定查询的三元组的顺序，三元组前后要保持连通性，也就是前后两个三元组之间需要有共同的变量。When there are two triples patterns, it is necessary to formulate the order of the triples to be queried, and the connectivity before and after the triples must be maintained, that is, there must be common variables between the two triples before and after.

4)时态二元关系处理：4) Temporal binary relation processing:

制定查询顺序后，由于存在时间区间的二元关系，需要在两个三元组模式进行连接的时候对时间关系进行匹配过滤，也就是前一个三元组模式的消息发送到目的顶点时，目的顶点匹配下一条查询语句，查看两条边的开始时间和结束时间是否满足时间区间的二元关系，若满足，目的顶点将自身的中间结果集和新匹配的中间结果一同传递给下一个顶点，顶点收到两个匹配表之后对消息进行聚合，存储为自己的顶点属性值，也就是最终查询结果。After the query order is formulated, due to the binary relationship between the time intervals, it is necessary to match and filter the time relationship when the two triple patterns are connected, that is, when the message of the previous triple pattern is sent to the destination vertex, the purpose The vertex matches the next query statement, and checks whether the start time and end time of the two edges satisfy the binary relationship of the time interval. If so, the destination vertex passes its own intermediate result set and the newly matched intermediate result to the next vertex. After the vertex receives the two matching tables, the message is aggregated and stored as its own vertex attribute value, that is, the final query result.

5)消息聚合：5) Message aggregation:

当一个顶点收到来自不同源顶点的消息时候，要在顶点对消息进行聚合，可以参考图4的顶点计算和消息传递示意图。When a vertex receives messages from different source vertices, to aggregate the messages at the vertices, you can refer to the schematic diagram of vertex computation and message passing in Figure 4.

6)变量的候选范围6) Candidate ranges for variables

如果三元组中的变量已经被映射到某些顶点，则后面再出现此变量的时候只能局限于之间的绑定值。如果三元组中的变量还没有被映射到任何顶点，则数据图上的顶点都是候选顶点。If the variable in the triplet has been mapped to some vertices, the subsequent occurrence of this variable can only be limited to the binding value between them. If the variables in the triplet have not been mapped to any vertices, then the vertices on the data graph are all candidate vertices.

5查询顺序优化5 Query order optimization

提出了一个分析成本模型，它使用关于时间属性图的统计数据，结合关于分布式执行计划不同阶段所花费的时间的估计，来估计给定查询的不同计划的执行时间。基于活动和匹配的顶点和边的数量，我们的成本模型将估计计划的每个超级步骤的运行时。图统计我们维护关于时间属性图的统计数据，以帮助估计匹配特定查询谓词的顶点和边。An analytical cost model is proposed that uses statistics on a graph of time properties, combined with estimates of the time spent in different stages of a distributed execution plan, to estimate the execution time of different plans for a given query. Based on the number of active and matched vertices and edges, our cost model will estimate the runtime of each planned superstep. Graph Statistics We maintain statistics on temporal property graphs to help estimate vertices and edges that match specific query predicates.

6时态关系推理6 Temporal relational reasoning

在规则中，我们使用以下速记，sp表示rdfs：sub Property of，type表示rdf：type，属性表示rdf：Property，sc表示rdfs：sub Class of，class表示rdfs：Class，dom表示rdfs：domain，rdfs：range。通过扩展RDF图的模型理论语义，给出了时间知识图的语义。利用时态RDFS的语义层次对时态RDF数据进行推理，在已有的显式查询结果中推理隐式结果，扩大查询结果集。In the rules we use the following shorthand, sp for rdfs:sub Property of, type for rdf:type, property for rdf:Property, sc for rdfs:sub Class of, class for rdfs:Class, dom for rdfs:domain,rdfs :range. By extending the model-theoretic semantics of RDF graphs, the semantics of temporal knowledge graphs are given. Use the semantic level of temporal RDFS to reason about temporal RDF data, infer implicit results in the existing explicit query results, and expand the query result set.

Claims

1. A tense T-SPARQL query and inference method based on Pregel is mainly characterized by comprising the following steps:

(1) and matching the query statement. And performing subgraph matching query on the temporal RDF graph data based on a Pregel interface, and converting the T-SPARQL query into a function executed on a vertex.

(2) And optimizing the query sequence. And evaluating the sequence of the query statements by utilizing the temporal histogram to obtain an optimized query sequence.

(3) Vertex computation and message aggregation. And performing communication on the edge by adopting a message transmission mode, storing the intermediate result set in the vertex attribute, performing message aggregation on the basis of the received message in the next iteration, and continuing matching until the iteration is finished.

(4) And (5) reasoning a query result. And the query efficiency is improved through the optimization of the query result, and the semantic hierarchy of the temporal RDFS is used for reasoning the temporal RDF data.

2. The method for temporal T-SPARQL query and inference based on Pregel of claim 1, wherein step (1) is query statement matching, and the method comprises:

(2-1) matching predicate labels:

and when only one triple mode exists, matching the predicate label and the time information, binding vertexes conforming to the predicate label and the time filtering information on the data graph, sending a matched intermediate result table from a source vertex to a target vertex in a message transmission mode, and updating the attribute value of the target vertex into the result table. When there are two triplet modes, the order of the query triplets needs to be formulated, connectivity needs to be maintained before and after the triplets, that is, there needs to be a common variable between the two triplets before and after the triplets.

(2-2) candidate ranges for variables:

if a variable in a triplet has been mapped to some vertex, then later reappearing of the variable can only be limited to the binding value between them. Vertices on the data graph are candidates if the variables in the triples have not yet been mapped to any vertex.

(2-3) temporal binary relation processing:

after a query sequence is formulated, due to the existence of a binary relation of a time interval, the time relation needs to be matched and filtered when two ternary group modes are connected, namely when a message of a previous ternary group mode is sent to a target vertex, the target vertex matches a next query statement, whether the start time and the end time of the two sides meet the binary relation of the time interval or not is checked, if the start time and the end time meet the binary relation of the time interval, the target vertex transmits an own intermediate result set and a newly matched intermediate result to the next vertex together, and the vertex aggregates the messages after receiving the two matching tables and stores the aggregated messages as own vertex attribute values, namely final query results.

3. The method for temporal T-SPARQL query and inference based on Pregel of claim 1, wherein step (2) is query order optimization, and the implementation method comprises:

an analytical cost model is presented that uses statistics on a graph of temporal attributes in conjunction with estimates on the time spent in different phases of a distributed execution plan to estimate the execution times of different plans for a given query. Based on the number of vertices and edges that are active and matched, our cost model will estimate the runtime of each super step of the plan. Graph statistics we maintain statistics about the temporal attribute graph to help estimate vertices and edges that match a particular query predicate.

4. The method for temporal T-SPARQL query and inference based on Pregel of claim 1, wherein step (3) is vertex computation and message aggregation, and the implementation method comprises:

(4-1) vertex calculation:

the proposed T-SPGX algorithm generates a corresponding query plan according to a given T-SPARQL query, the query plan comprises two aspects of predicate label matching and time information filtering, and different time filtering methods are adopted for a single temporal triplet mode and a plurality of temporal triplet mode connection queries. Receiving the message sent by the vertex in the previous super step, aggregating the messages, updating the attribute value of the vertex, generating a new message, and sending the new message and the matched new message to the next vertex

(4-2) message aggregation:

when a vertex receives messages from different source vertices, the vertex computation and message delivery diagram of fig. 4 may be referenced to aggregate the messages at the vertices.

5. The method for temporal T-SPARQL query and inference based on Pregel of claim 1, wherein step (4) is a query-to-result inference, the method comprises:

the semantic meaning of the time knowledge graph is given by expanding the model theory semantic meaning of the RDF graph. And reasoning the temporal RDF data by utilizing subclasses and sub-attributes in the semantic hierarchy of the temporal RDFS, reasoning implicit results in the existing explicit query results, and expanding a query result set.