CN114661956A - Temporal T-SPARQL query and inference method based on Pregel - Google Patents

Temporal T-SPARQL query and inference method based on Pregel Download PDF

Info

Publication number
CN114661956A
CN114661956A CN202011532002.7A CN202011532002A CN114661956A CN 114661956 A CN114661956 A CN 114661956A CN 202011532002 A CN202011532002 A CN 202011532002A CN 114661956 A CN114661956 A CN 114661956A
Authority
CN
China
Prior art keywords
query
temporal
vertex
graph
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011532002.7A
Other languages
Chinese (zh)
Inventor
贺振宇
马宗民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202011532002.7A priority Critical patent/CN114661956A/en
Publication of CN114661956A publication Critical patent/CN114661956A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a temporal T-SPARQL query and inference method based on Pregel, which is used for performing query processing and optimization on temporal RDF graph data in a graph parallel computing mode. The problem of the study can be formally defined as: given a temporal RDF query graph G and a temporal query graph Q, all matches of the query graph Q in the data graph G are found. And converting the T-SPARQL query into a function executed on a vertex, performing communication on the edge in a message transfer mode, storing an intermediate result set in the vertex attribute, performing message aggregation on the next iteration based on the received message, and continuing matching until the iteration is finished. The proposed T-SPGX algorithm generates a corresponding query plan according to a given T-SPARQL query, the query plan comprises two aspects of predicate label matching and time information filtering, and different time filtering methods are adopted for a single temporal triplet mode and a plurality of temporal triplet mode connection queries. The query efficiency is improved from the aspect of query result optimization, the semantic hierarchy of the temporal RDFS is used for reasoning the temporal RDF data, the implicit result is deduced in the existing explicit query result, and the query result set is expanded. A general and extensible solution is provided for the query method of the massive temporal RDF data.

Description

Temporal T-SPARQL query and inference method based on Pregel
Technical Field
The invention provides a general and extensible solution for a query method of massive temporal RDF data. A temporal RDF query method based on Pregel is provided, and query processing and optimization are performed on temporal RDF graph data in a graph parallel computing mode. The problem of the study can be formally defined as: given a temporal RDF query graph G and a temporal query graph Q, all matches of the query graph Q in the data graph G are found. And converting the T-SPARQL query into a function executed on a vertex, performing communication on the edge in a message transfer mode, storing an intermediate result set in the vertex attribute, performing message aggregation on the next iteration based on the received message, and continuing matching until the iteration is finished. Belonging to the field of distributed knowledge semantic query.
Background
Information in reality naturally has temporal attributes, and in order to better represent and manage time information, a plurality of researchers propose that RDF is used for representing and managing temporal data, and with the explosive growth of the temporal data and the development of semantic Web and knowledge engineering, how to query and manage the temporal data becomes an important research topic. However, most of the previous research works use a relational database to store the temporal RDF triples and rewrite the temporal queries into T-SQL or SPARQL queries for evaluation, which has the problems of a large number of self-join operations and storage space data redundancy, and limits the performance in large-scale knowledge bases and complex queries. On the other hand, some graphics-centric parallel platforms have been proposed and developed that can efficiently support iterative graphics computations. These all facilitate the necessity of storing and querying the temporal RDF graph data on a distributed platform.
Any object is changed continuously along with the time, the tense attribute is an important attribute for describing dynamic change in the resource development process, representation and query of tense information are always the key points of various scientific researches, and the generation of various tense databases and the development of tense query languages effectively promote the management of tense data. In order to facilitate the transmission and sharing of the temporal data in the network, the scholars perform temporal expansion on various data models. With the general acceptance and use of RDF as a semantic representation and metadata processing model, temporal RDF modeling has gradually attracted attention of scholars, and the corresponding temporal extension method has been widely researched.
Because the RDF model can describe the data set by using RDF triple or RDF diagram representation, the representation form of the temporal information also has three modes of time point, time interval and time set, the time dimension has the fraction of transaction time and effective time, and different scholars have different modes of adding the temporal information when defining the temporal RDF model on the basis of the classic RDF model. Therefore, the expression form of the temporal RDF model is not unique, and the expression meaning of temporal information carried by the temporal RDF model is also not unique. Currently, the main research related to the temporal RDF mainly includes the following three aspects: the research on the temporal expansion of the RDF model, the research on the temporal RDF query language and the research on the temporal RDF indexing scheme are carried out.
The SPARQL query problem is essentially a subgraph matching problem due to the graphical nature of RDF data. Each of its triples corresponds to a directed edge in the graph that leads from the subject to the predicate, and these directed edges tie the RDF data entities together. The SPARQL query is a combination of a group of tuple patterns, and the tuple patterns are related to each other, so that the evaluation of the SPARQL query in the conventional relational database and data parallel system has the problems of low query efficiency caused by a large number of self-connection operations and storage space data redundancy. On the other hand, due to the rise of large real-world graphic data sets, some graphic-centric parallel platforms have been proposed and developed to efficiently support iterative graphic computations. One popular approach to parallel algorithm design and implementation of graphs is vertex-centric computation, with edges between vertices communicating, although vertex-centric programming is certainly not the only approach, this model is popular and has been adopted by many research and open-source projects. A method of converting the standard query language SPARQL of the semantic Web to a function that executes on vertices evaluates the SPARQL query with a graphical representation of RDF data.
Therefore, the query of the temporal RDF is realized in a graph iteration mode, the T-SPARQL query is converted into a function running on a vertex, the predicate label of the triple and the temporal constraint relation are matched, the query result is expanded step by step through the intermediate result set stored in the vertex attribute, and the query of the temporal RDF under a distributed frame is finally realized.
Disclosure of Invention
The purpose of the invention is as follows: information in reality naturally has a temporal attribute, and in order to better represent and manage time information, many researchers propose that RDF is used for temporal data representation and management, and with the explosive growth of time data and the development of semantic Web and knowledge engineering, how to query and manage time data becomes an important research topic. However, most of the previous research works use a relational database to store the temporal RDF triples and rewrite the temporal queries into T-SQL or SPARQL queries for evaluation, which has the problems of a large number of self-join operations and storage space data redundancy, and limits the performance in large-scale knowledge bases and complex queries. On the other hand, some graphics-centric parallel platforms have been proposed and developed that can efficiently support iterative graphics computations. These all facilitate the necessity of storing and querying the temporal RDF graph data on a distributed platform. Based on the background, the query method for massive temporal RDF data provided by the invention provides a general and extensible solution.
The technical scheme is as follows: in order to achieve the purpose, the temporal T-SPARQL query and inference method based on Pregel comprises the following steps: the specific research method is as follows:
(1) and constraint definition of time interval binary relation is provided, and connection calculation of different types of time interval relation is realized.
(2) Performing subgraph matching query on the temporal RDF graph data based on a Pregel interface, converting T-SPARQL query into a function executed on a vertex, performing communication on the side by adopting a message transfer mode, storing an intermediate result set in the vertex attribute, performing message aggregation on the basis of received messages in the next iteration, and continuing matching until the iteration is finished. The proposed T-SPGX algorithm generates a corresponding query plan according to a given T-SPARQL query, the query plan comprises two aspects of predicate label matching and time information filtering, and different time filtering methods are adopted for a single temporal triplet mode and a plurality of temporal triplet mode connection queries.
(3) The query efficiency is improved from the query sequence optimization and the query result optimization. And evaluating the sequence of the query statements by utilizing the temporal histogram to obtain an optimized query sequence. And reasoning the temporal RDF data by utilizing the semantic hierarchy of the temporal RDFS, reasoning an implicit result in the existing explicit query result, and expanding a query result set. The provided query method is specifically realized, and experimental results show that the algorithm has good query efficiency.
Has the advantages that: and query processing and optimization are carried out on the temporal RDF graph data in a graph parallel computing mode. The method for querying the massive temporal RDF data provides a general and extensible solution, improves query efficiency from the aspect of query result optimization by means of RDFS semantic hierarchy, infers implicit results and expands a query result set.
Drawings
FIG. 1 is a general flow diagram of the process of the present invention.
FIG. 2 is a time interval binary relationship table.
Fig. 3 is a structure diagram of a BSP model.
FIG. 4 is a diagram of vertex computation and messaging.
FIG. 5 is a temporal inference rule.
Detailed Description
The invention will be further explained with reference to the drawings.
The general flow of the present invention is shown in FIG. 1. The sub-module processes included in the method are respectively described in detail below with reference to fig. 2, fig. 3, fig. 4, and fig. 5. The specific implementation steps are as follows, and the general flow is shown in figure 1.
1 temporal interval binary relation calculation
FIG. 2 shows the Allen-defined temporal relationship and introduces interval separation that can be used for comparison of temporal connection calculations.
2 BSP model
The Pregel model has the characteristics of parallel computing, batch processing of messages and a synchronous mechanism, so that the Pregel model can be processed in parallel by taking a vertex as a center on a graph. The Pregel computation framework decomposes the computation into a series of iterations of super-steps, based on an overall synchronous parallel computation model, i.e., the BSP model, in each of which a vertex program performs its local transformations and exchanges information with neighboring vertices. The BSP model requires that the system can provide a plurality of parallel computing units, the model is divided into a Master role and a Worker role, the Master is responsible for the coordination of global information, the Worker is responsible for computing, and the processing procedure of the BSP model is as shown in fig. 3.
3 Pregel interface user-defined function
A Pregel interface in GraphX is a Pregel mode improved by referring to GAS, a Pregel model with only a vertex computing function is subjected to fine-grained optimization, more parallel operations can be supported, and the Pregel interface mainly comprises three functions: vprog, mergeMsg and sendMsg. Where Vprog functions to update messages inside the vertex, mergeMsg functions to merge two messages, and sendMsg functions to send its own message to its neighbor nodes.
The core of the method is three functions, and the user can customize and realize the function bodies.
1) vprog function:
popular terms: vertexProgram (vprog) is run on all vertices for the first time at initialization, after which vertexProgram is run only on vertices that receive the message, and this step is repeated until the iteration condition is reached
The user can customize the body of the implementation function, in the first iteration the function acts on each vertex of the graph, in subsequent iterations the function acts only on the vertices that receive the message. The parameters of the function are the vertex attribute value and the message received by the vertex, and the original attribute value of the vertex is updated through the function body + message realized by the user. The message received by the vertex should be the result of the sendMsg + mergeMsg function.
2) sendMsg function:
common terms: the user can customize the function body, the function acts on each edge of the graph, the entry parameter of the function is the triple of the edge, and the message (the user can customize the message type) is transmitted to the vertex through the triple of the function body + the edge (the source vertex and the attribute thereof, the target vertex and the attribute thereof, and the attribute of the edge between the source vertex and the target vertex) realized by the user.
3) mergeMsg function:
common terms: the user can customize a function body, the function acts on each vertex of the graph, and the message is transferred to each vertex according to the sendMsg function, and the mergeMsg function mainly merges two messages transferred to the vertex. Assuming that the message type is A, the entries of the function are two messages of type A, and the function body realized by the user + the two messages are combined into one message of type A.
4-tense T-SPARQL query implementation
1) Iterative process
At each iteration, each data vertex explores a sub-query; the matching vertices send messages carrying their values to the corresponding neighbors. In subsequent iterations, only the vertices that receive the message continue to explore. Rather than solving each sub-query independently, the sub-queries are explored starting with the results of the last sub-query. The messages exchanged between the vertices carry intermediate results; thus, the message in the last iteration is the final answer to the query.
2) Procedure performed on vertices:
receiving the message sent by the vertex in the previous super step, aggregating the messages, updating the attribute value of the vertex, generating a new message, and sending the new message and the matched new message to the next vertex
3) Matching of query statement:
and when only one triple mode exists, matching the predicate label and the time information, binding vertexes conforming to the predicate label and the time filtering information on the data graph, sending a matched intermediate result table from a source vertex to a target vertex in a message transmission mode, and updating the attribute value of the target vertex into the result table.
When there are two triplet modes, the order of the query triplets needs to be formulated, connectivity needs to be maintained before and after the triplets, that is, there needs to be a common variable between the two triplets before and after the triplets.
4) Temporal binary relation processing:
after a query sequence is formulated, due to the existence of a binary relation of a time interval, matching and filtering the time relation when two triad modes are connected, namely when a message of a previous triad mode is sent to a target vertex, the target vertex is matched with a next query statement, whether the start time and the end time of the two edges meet the binary relation of the time interval or not is checked, if so, the target vertex transmits an own intermediate result set and a newly matched intermediate result to the next vertex together, and the vertex aggregates the messages after receiving the two matching tables and stores the aggregated messages as an own vertex attribute value, namely a final query result.
5) Message aggregation:
when a vertex receives messages from different source vertices, the vertex computation and message delivery diagram of fig. 4 may be referred to aggregate the messages at the vertices.
6) Candidate ranges for variables
If a variable in a triplet has been mapped to some vertex, then later reappearing of the variable can only be limited to the binding value between them. Vertices on the data graph are candidate vertices if the variables in the triplet have not yet been mapped to any vertex.
5 query order optimization
An analytical cost model is presented that uses statistics on a graph of temporal attributes in conjunction with estimates on the time spent in different phases of a distributed execution plan to estimate the execution times of different plans for a given query. Based on the number of vertices and edges that are active and matched, our cost model will estimate the runtime of each super step of the plan. Graph statistics we maintain statistics about the temporal attribute graph to help estimate vertices and edges that match a particular query predicate.
6 temporal relationship reasoning
In the rules, we use the following shorthand, sp for rdfs: sub Property of, type denotes rdf: type, attribute represents rdf: property, sc represents rdfs: sub Class of, Class denotes rdfs: class, dom denotes rdfs: domain, rdfs: and (5) range. The semantic meaning of the time knowledge graph is given by expanding the model theory semantic meaning of the RDF graph. And reasoning the temporal RDF data by utilizing the semantic hierarchy of the temporal RDFS, reasoning an implicit result in the existing explicit query result, and expanding a query result set.

Claims (5)

1. A tense T-SPARQL query and inference method based on Pregel is mainly characterized by comprising the following steps:
(1) and matching the query statement. And performing subgraph matching query on the temporal RDF graph data based on a Pregel interface, and converting the T-SPARQL query into a function executed on a vertex.
(2) And optimizing the query sequence. And evaluating the sequence of the query statements by utilizing the temporal histogram to obtain an optimized query sequence.
(3) Vertex computation and message aggregation. And performing communication on the edge by adopting a message transmission mode, storing the intermediate result set in the vertex attribute, performing message aggregation on the basis of the received message in the next iteration, and continuing matching until the iteration is finished.
(4) And (5) reasoning a query result. And the query efficiency is improved through the optimization of the query result, and the semantic hierarchy of the temporal RDFS is used for reasoning the temporal RDF data.
2. The method for temporal T-SPARQL query and inference based on Pregel of claim 1, wherein step (1) is query statement matching, and the method comprises:
(2-1) matching predicate labels:
and when only one triple mode exists, matching the predicate label and the time information, binding vertexes conforming to the predicate label and the time filtering information on the data graph, sending a matched intermediate result table from a source vertex to a target vertex in a message transmission mode, and updating the attribute value of the target vertex into the result table. When there are two triplet modes, the order of the query triplets needs to be formulated, connectivity needs to be maintained before and after the triplets, that is, there needs to be a common variable between the two triplets before and after the triplets.
(2-2) candidate ranges for variables:
if a variable in a triplet has been mapped to some vertex, then later reappearing of the variable can only be limited to the binding value between them. Vertices on the data graph are candidates if the variables in the triples have not yet been mapped to any vertex.
(2-3) temporal binary relation processing:
after a query sequence is formulated, due to the existence of a binary relation of a time interval, the time relation needs to be matched and filtered when two ternary group modes are connected, namely when a message of a previous ternary group mode is sent to a target vertex, the target vertex matches a next query statement, whether the start time and the end time of the two sides meet the binary relation of the time interval or not is checked, if the start time and the end time meet the binary relation of the time interval, the target vertex transmits an own intermediate result set and a newly matched intermediate result to the next vertex together, and the vertex aggregates the messages after receiving the two matching tables and stores the aggregated messages as own vertex attribute values, namely final query results.
3. The method for temporal T-SPARQL query and inference based on Pregel of claim 1, wherein step (2) is query order optimization, and the implementation method comprises:
an analytical cost model is presented that uses statistics on a graph of temporal attributes in conjunction with estimates on the time spent in different phases of a distributed execution plan to estimate the execution times of different plans for a given query. Based on the number of vertices and edges that are active and matched, our cost model will estimate the runtime of each super step of the plan. Graph statistics we maintain statistics about the temporal attribute graph to help estimate vertices and edges that match a particular query predicate.
4. The method for temporal T-SPARQL query and inference based on Pregel of claim 1, wherein step (3) is vertex computation and message aggregation, and the implementation method comprises:
(4-1) vertex calculation:
the proposed T-SPGX algorithm generates a corresponding query plan according to a given T-SPARQL query, the query plan comprises two aspects of predicate label matching and time information filtering, and different time filtering methods are adopted for a single temporal triplet mode and a plurality of temporal triplet mode connection queries. Receiving the message sent by the vertex in the previous super step, aggregating the messages, updating the attribute value of the vertex, generating a new message, and sending the new message and the matched new message to the next vertex
(4-2) message aggregation:
when a vertex receives messages from different source vertices, the vertex computation and message delivery diagram of fig. 4 may be referenced to aggregate the messages at the vertices.
5. The method for temporal T-SPARQL query and inference based on Pregel of claim 1, wherein step (4) is a query-to-result inference, the method comprises:
the semantic meaning of the time knowledge graph is given by expanding the model theory semantic meaning of the RDF graph. And reasoning the temporal RDF data by utilizing subclasses and sub-attributes in the semantic hierarchy of the temporal RDFS, reasoning implicit results in the existing explicit query results, and expanding a query result set.
CN202011532002.7A 2020-12-22 2020-12-22 Temporal T-SPARQL query and inference method based on Pregel Pending CN114661956A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011532002.7A CN114661956A (en) 2020-12-22 2020-12-22 Temporal T-SPARQL query and inference method based on Pregel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011532002.7A CN114661956A (en) 2020-12-22 2020-12-22 Temporal T-SPARQL query and inference method based on Pregel

Publications (1)

Publication Number Publication Date
CN114661956A true CN114661956A (en) 2022-06-24

Family

ID=82025055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011532002.7A Pending CN114661956A (en) 2020-12-22 2020-12-22 Temporal T-SPARQL query and inference method based on Pregel

Country Status (1)

Country Link
CN (1) CN114661956A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115422233A (en) * 2022-11-03 2022-12-02 中国地质大学(武汉) Complex space RDF query parallel processing method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115422233A (en) * 2022-11-03 2022-12-02 中国地质大学(武汉) Complex space RDF query parallel processing method and device
CN115422233B (en) * 2022-11-03 2023-02-24 中国地质大学(武汉) Complex space RDF query parallel processing method and device

Similar Documents

Publication Publication Date Title
US8316060B1 (en) Segment matching search system and method
Zhou et al. Autoindex: An incremental index management system for dynamic workloads
Wolfson et al. A new paradigm for parallel and distributed rule-processing
CN104484472A (en) Database cluster for mixing various heterogeneous data sources and implementation method
CN110909111A (en) Distributed storage and indexing method based on knowledge graph RDF data characteristics
CN110909077A (en) Distributed storage method
Wang et al. Efficient computation of g-skyline groups
CN111881160A (en) Distributed query optimization method based on equivalent expansion method of relational algebra
CN111897971A (en) Knowledge graph management method and system suitable for field of power grid dispatching control
CN114661956A (en) Temporal T-SPARQL query and inference method based on Pregel
Zhang et al. Hybrid Subgraph Matching Framework Powered by Sketch Tree for Distributed Systems
CN112148830A (en) Semantic data storage and retrieval method and device based on maximum area grid
CN116383247A (en) Large-scale graph data efficient query method
Das et al. A case for stale synchronous distributed model for declarative recursive computation
Subramanian et al. Query optimization in multidatabase systems
Szárnyas et al. Evaluation of optimization strategies for incremental graph queries
CN114116785B (en) Distributed SPARQL query optimization method based on minimum attribute cut
Johnpaul et al. A Cypher query based NoSQL data mining on protein datasets using Neo4j graph database
CN106330559B (en) Complex network topologies calculation of characteristic parameters method and system based on MapReduce
Floratos et al. DBSpinner: Making a Case for Iterative Processing in Databases
d’Orazio et al. Graph Constraints in Urban Computing: Dealing with conditions in processing urban data
Xu et al. What-if query processing policy for big data in OLAP system
Pang et al. Partitioning large-scale property graph for efficient distributed query processing
CN114418101B (en) Bayesian network reasoning method and system
Chen et al. Towards Industry Data Governance: Construction of An Industrial Data Decentralized Distributed Symbiotic Sharing Space Based on Tensor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20220624