CN118113883A - RDF knowledge graph data partitioning and distributed query method - Google Patents

RDF knowledge graph data partitioning and distributed query method Download PDF

Info

Publication number
CN118113883A
CN118113883A CN202410278786.7A CN202410278786A CN118113883A CN 118113883 A CN118113883 A CN 118113883A CN 202410278786 A CN202410278786 A CN 202410278786A CN 118113883 A CN118113883 A CN 118113883A
Authority
CN
China
Prior art keywords
query
graph
sub
pattern
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410278786.7A
Other languages
Chinese (zh)
Inventor
彭鹏
胡喆媛
邹磊
郭嘉丰
程学旗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Hunan University
Institute of Computing Technology of CAS
Original Assignee
Peking University
Hunan University
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Hunan University, Institute of Computing Technology of CAS filed Critical Peking University
Priority to CN202410278786.7A priority Critical patent/CN118113883A/en
Publication of CN118113883A publication Critical patent/CN118113883A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an RDF knowledge graph data partitioning and distributed query method, which provides a minimum mode cut as a novel RDF graph workload perceived partitioning technology, and then decomposes a query into a group of independent executable sub-queries based on an internal mode, so that the number of crossing partition modes is minimum and the number of query times of inter-partition connection is minimized under a given workload, which leads to remarkable improvement of query execution performance.

Description

RDF knowledge graph data partitioning and distributed query method
Technical Field
The invention mainly relates to the field of knowledge maps, in particular to an RDF knowledge map data partitioning and distributed query method.
Background
Resource Description Framework (RDF) is a generic model describing relationships between different entities in a set of triples, denoted (subject, predicate, object). An RDF dataset may be represented by a directed edge-tag graph, where subject and object are vertices and a triplet is an edge with an attribute name as an edge tag. For retrieving and manipulating RDF graphs, an RDF query language SPARQL is presented. One basic building block in SPARQL is the base graph schema (BGP), which is a triplet schema with variables. A BGP query consists of a finite number of BGP and may also be represented as a query graph with variables. Thus, answering BGP queries is equivalent to finding sub-graph matches (using homomorphism) for the query graph on the RDF graph.
With the increasing size of RDF datasets, the typical performance problems of managing and querying RDF data on a single machine also arise, which has stimulated interest in distributed solutions. In the present invention, we focus on optimizing specialized distributed RDF systems that are built specifically for SPARQL query evaluation by integrating multiple centralized RDF systems at different sites using custom physical layouts. Such devices are widely used due to their high efficiency. In this environment, an RDF graph G is partitioned into a set of subgraphs, called partitions, and then distributed across a set of computing nodes, called sites. If a query matches data between multiple partitions, then the connection between the partitions is involved to compute the results.
There have been many studies to discuss how to partition RDF maps, which can be divided into two categories: workload agnostic and workload aware.
The workload agnostic approach ignores the query workload, considering only the RDF data graph. Ignoring the query workload may miss opportunities to optimize queries for frequent query patterns in the workload. Indeed, analysis of the well-known SPARQL actual workload LSQ shows that a small number of frequent query patterns can overlay most SPARQL queries in the workload.
Workload-aware methods typically mine a set of frequent query patterns and partition RDF graphs to replicate graph portions that match frequent patterns of multiple partitions. One key assumption for these approaches is that they can avoid inter-partition joins when the query is isomorphic with one of the common query patterns. These replication methods add space overhead and data consistency maintenance overhead because these replicas need to be maintained as RDF graphs are updated.
Our proposed method in the present invention falls into the workload aware category, but it does not replicate data. Instead, we follow the vertex disjoint approach, which is a common approach in workload agnostic partitioning approaches, where each vertex is assigned to a single partition. In vertex disjoint partitioning techniques, certain edges are "cut" between partitions, resulting in inter-partition connections. There are some approaches such as 1-hop replication that can reduce these joins by replicating the end points of the cut edges in two partitions of its end points, while the space overhead of 1-hop replication is much less than replicating frequent query pattern matches. This is called. The edges (triplets) between two vertices in the same partition are internal edges, while the edges connecting two vertices in different partitions are called intersecting edges.
The traditional objective function of vertex disjoint partitions is to minimize edge cuts while balancing partition size, which is referred to as minimal edge cuts. It is contemplated that fewer intersecting edges may produce fewer cross matches. This is a reasonable expectation, but it does not substantially eliminate or reduce the number of inter-partition joins. Even at the cost of increased edge cutting, it is generally desirable to minimize the number of unique attributes cut. The resulting technique, called minimal-attribute-cut, is a workload independent technique that has significant advantages over it being possible to independently execute a larger set of SPARQL queries.
For the minimum edge cut and the minimum attribute cut, since we cannot determine that query Q does not cross match before starting execution, then even if Q does not cross match during execution, inter-partition join cannot be avoided, so its execution involves minimum edge cut and inter-partition join in the minimum attribute cut partition.
If workload is considered, the query class may be extended and inter-partition connections may be avoided. If there is a non-star pattern of frequent queries in a workload containing cross attributes, the overall performance overhead of the workload is low.
Disclosure of Invention
The invention focuses on a partitioning technique that considers both workload characteristics and RDF graph structure. We propose a minimum pattern cut as a novel RDF graph workload-aware partitioning technique, then decompose the query into a set of independently executable sub-queries based on internal patterns, such that at a given workload, the number of cross-partition patterns is minimized, and the number of inter-partition connected queries is minimized, which results in significant improvement in query execution performance.
Based on workload characteristics and RDF graph structures, the invention provides a minimum pattern cut as a novel RDF graph workload perceived graph dividing technology, the result is a group of partitions, then queries are decomposed into a group of independent executable sub-queries based on internal patterns, and the sub-queries are executed in parallel in each partition to obtain a matching result, so that under a given workload, the number of crossing partition patterns is minimum, and the number of queries connected between the partitions is minimized, thereby improving the query execution performance.
According to a first aspect of the present invention, there is provided an RDF knowledge graph data partitioning and distributed query method, comprising:
Step 1: for a given RDF knowledge graph data graph G, a workload W and a corresponding mode set M thereof, the graph G is divided into n sub-graph partitions P= { G 1,G2,…,Gi,…,Gn } based on a minimum mode division mode so as to meet preset conditions.
The workload W comprises a plurality of queries on the graph G, for any query Q, a pattern M corresponding to the Q is generated, and patterns corresponding to all the queries in the workload W form a pattern set M.
The preset conditions include: the number of modes of the internal mode set M in, i.e., M in, is maximized and |V i |+.ltoreq (1+ε) ×|V|/n is satisfied for each sub-graph partition G i, where |V i | is the number of vertices of each sub-graph partition G i, |V| is the number of vertices of the original graph G, ε is the maximum imbalance ratio of the user-defined partitions.
The minimum mode division mode in the step 1 comprises the steps 1.1 to 1.5.
Step 1.1: selecting the internal mode set M in based on the weakly connected components using a greedy algorithm; wherein the internal pattern set M in is initially an empty set, the candidate pattern set M ' is initially a pattern set M, the candidate pattern set M ' is traversed, and step 1.1.1 is executed once every time one candidate pattern M op in M ' is read.
Step 1.1.1: for m op of each read, calculate WCC (G [ Min U { mop }) where WCC (·) represents the weakly connected component set of a set of maps, G [ Min U { mop } ] is the induced sub-set of map G based on Min U { mop }, calculate candidate costWhere c is the vertex set of any weakly connected component in WCC (G [ Min U { mop }), and |c| is the vertex number of vertex set c.
Step 1.2: extracting the minimum candidate Cost which satisfies Cost [ Min { mop } ] is less than or equal to (1+epsilon) ×|V|/n from a plurality of candidate costs obtained by traversing the candidate pattern set M 'each time, adding M op corresponding to the minimum candidate Cost into M in, and deleting M op from M'; the traversing of the candidate pattern set M 'is repeated until M' is null or there are no candidate patterns that can meet the minimum candidate cost, resulting in an optimal internal pattern set M in.
Step 1.3: based on the internal pattern set M in, generating an induction sub-graph set G [ M in ] of the graph G based on M in, roughening each induction sub-graph in G [ M in ] into super-vertices, and further roughening any two super-vertices s1 and s2 meeting the super-vertex roughening condition into one super-vertex until further roughening cannot be performed, so that the graph G is converted into a rough graph G c.
The super vertex coarsening condition is that two super vertices have the largest similarity, and the super vertex coarsening condition meets the requirements that |c (s 1) |+|c (s 2) |is less than or equal to (1+epsilon) ×|V|/n, wherein |c (s 1) | and |c (s 2) | are the vertex numbers of weak communication components corresponding to s1 and s2 in WCC (G [ M in ]); similarity of two super verticesIM (s 1) and IM (s 2) represent the corresponding internal pattern sets of s1 and s2.
Step 1.4: dividing the rough graph G c into n rough graph sub-graph partitions by using a vertex disjoint dividing algorithm, and performing inverse coarsening on each rough graph sub-graph partition to obtain n sub-graph partitions P of the graph G; wherein, reverse roughening is a process of inducing sub-graph to restore super-vertex to pre-roughening state.
Each internal pattern M in in M in corresponds to one or more sub-picture partitions, M in is converted to sequence coding by DFS coding.
Step 1.5: the inquiring user constructs a hash table to store the sequence coding information of M in at the inquiring user terminal, the partition number n and the IP address of the node where the sub-graph partition is located; the key value of the hash table is the code of m in, and the value of the hash table is the IP address of the node where one or more sub-graph partitions corresponding to m in are located.
Step 2: for a query Q initiated at a user local side, performing distributed query processing to obtain a matching result set MS (Q, G) of the query Q on a graph G, wherein the performing distributed query processing includes: judging whether the query Q is an independently executable query, if the query Q is the independently executable query, executing the step 2.1, and if the query Q is the non-independently executable query, executing the step 2.2.
Wherein the independently executable queries include a first type of independently executable query, a second type of independently executable query, and a third type of independently executable query, and queries not belonging to the three types are non-independently executable queries.
Determining whether query Q is an independently executable query includes: for query Q, generating a corresponding pattern m; neglecting the vertex constants, if M is one M in or a combination of a plurality of M in in M in, Q is isomorphic to M, and Q is a first type of independently executable query; if for query Q there is one sub-graph Q ' is a first type of independently executable query and for all edges E, E that belong to E (Q) \E (Q '), both endpoints of E belong to V (Q '), then Q is a second type of independently executable query; if for query Q there is one sub-graph Q 'is a first class of independently executable query and for all edges E, E belonging to E (Q) \e (Q'), at least one endpoint of E belongs to V (Q '), and there is at least one edge E, belonging to E (Q) \e (Q'), one endpoint of E belongs to V (Q '), and the other endpoint does not belong to V (Q'), then Q is a third class of independently executable query; wherein E (Q) and E (Q ') are edge sets of Q and Q', V (Q ') is a vertex set of Q', and E (Q) \E (Q ') represents an edge set in E (Q) which does not belong to E (Q').
Step 2.1: locating the query Q to the corresponding partition, searching a hash table stored locally by using DFS codes of a pattern m corresponding to the query Q, and taking the intersection of the partition corresponding to the pattern m as the partition corresponding to the query Q; and executing a matching process on the partition corresponding to the query Q, and taking the union of the matching results on all the partitions as a matching result set MS (Q, G) of the query Q.
Step 2.2: decomposing the query Q into a plurality of sub-queries Q based on the internal pattern set M in, then searching a locally stored hash table according to the DFS code of the pattern mq corresponding to each sub-query Q, and taking the partition intersection corresponding to the pattern mq as the partition corresponding to the sub-query Q; and executing the Q matching process on the partition corresponding to the sub-query Q, taking the union of the matching results of the sub-query on all the corresponding partitions as the matching result of the sub-query, and then executing the connection operation on the matching result of the sub-query to obtain a matching result set MS (Q, G) of the query Q.
Step 2.2 comprises: to decompose query Q into multiple sub-queries Q, first, removing intersecting edges not belonging to any internal pattern in query Q to obtain weakly connected components only containing internal pattern edges, and storing the removed intersecting edges in a setIn removing query Q/>After all edges in the database, the obtained weak connected components only comprising the edges of the internal modes are used as sub-queries { q1, q2, …, qx }, and for each sub-query, the corresponding partition is determined according to the corresponding internal mode; for/>Vi-vj, where vi and vj are their vertices, q (vi) and q (vj) are sub-queries containing vi and vj, respectively, |q (vi) | and|q (vj) | are the number of vertices of q (vi) and q (vj), and if|q (vi) |= |q (vj) |, then adding the edge vi-vj to q (vi) to make the query q (vi) a second type of independently executable query, otherwise, adding the edge vi-vj to sub-queries with more vertices in q (vi) and q (vj); at the traversal/>In the sub-queries { Q1, Q2, …, qx } obtained after all edges in the list, the sub-query containing more than one vertex is used as the result of decomposing the query Q.
In step 2.1 and step 2.2, the matching process on the different partitions is performed in parallel.
Further, the method provided by the invention further comprises the following steps: for generating pattern m corresponding to Q, m is generated by deleting the single degree vertex of Q and replacing all constants on the subject and object in Q with variables, the constants including word denomination and URI.
Further, the method provided by the invention further comprises the following steps: for internal pattern M in in internal pattern set M in, all matches μ in in its corresponding match result set MS (M in, G) on graph G are completely internal matches, i.e., all edges in μ in are internal edges in a partition G i.
Further, the method provided by the invention further comprises the following steps: for one pattern M of M, the graph G is based on the induced subgraph of pattern M as G [ M ]: m corresponds to a subgraph composed of all edges in U.S. MS (m, G) E (mu), wherein E (mu) is the edge matched with all edges in mu; graph G M-based inducer subgraph set G [ M ] is a set of G [ M ].
According to a second aspect of the present invention, there is provided a computer device comprising: a memory for storing instructions; and a processor for invoking the instructions stored in the memory to perform the method of the first aspect.
According to a third aspect of the present invention there is provided a computer readable storage medium storing instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the technical scheme of the invention has at least the following beneficial effects:
We propose a minimum pattern cut partitioning strategy that takes into account workload characteristics and maximizes the number of queries that can be performed independently in the workload. Patterns are generated by deleting vertices of 1 degree and replacing all constants (strings and URIs) of subjects and objects with variables in the workload (sub) query. The goal of partitioning is defined as minimizing the number of queries with inter-partition connections that can translate the problem into minimizing the number of cross patterns in the workload (i.e., maximizing the number of internal patterns), which involves costly communications and computation, and minimizing pattern cuts can significantly reduce the overhead of this part.
Compared with the existing vertex disjoint dividing method, the minimum pattern cut can avoid more partition connection, so that the distributed SPARQL query processing is remarkably improved. The proposed minimum pattern cut technique has the same design objective as the minimum attribute cut: the number of independently executable queries (independently executable queries) that can be evaluated without inter-partition connections is maximized. However, the minimum pattern cuts the perceivable workload and maximizes the number of independently executable queries in a given workload, while the minimum attribute cuts are independent of workload, considering only graph structures. Considering the workload, the minimum pattern cut has better query execution performance, for example, a non-star pattern with frequent queries exists in the workload, the inter-partition connection is involved in the minimum attribute cut, and the minimum pattern cut can be independently executed in the partition, so that the inter-partition connection is avoided.
The minimum mode cut provides opportunities for data positioning by solving the data localization problem, and a small amount of data such as modes of a DFS coding format and information of corresponding partitions are stored locally, so that mode related queries can be directly positioned to related partitions, queries are independently executed in the partitions in parallel, and the number of remote requests is reduced by positioning and filtering some irrelevant partitions, and meanwhile, the total throughput and the query performance are improved. The minimal pattern cut coarsens some non-maximal weakly connected components in the induced sub-graph of all internal patterns, which may ensure that all matches of some internal matches fall into a few partitions. The minimum pattern cut expands the set of queries that can be independently executed in the minimum pattern cut, and query processing methods based on the minimum pattern cut can decompose the query into sub-queries, so that the sub-queries can be further located to relevant sites.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is an example of a pattern shown according to an exemplary embodiment.
Fig. 2 is a pattern-induced sub-graph example shown in accordance with an exemplary embodiment.
FIG. 3 is an example RDF graph that is illustrated according to an exemplary embodiment.
FIG. 4 is a coarsening diagram and examples of partitions thereof, shown in accordance with an exemplary embodiment.
FIG. 5 is a high-level rough map and its partition examples, shown according to an example embodiment.
FIG. 6 is an example of an advanced coarse map reverse coarsening result, RDF map partitioning result, shown in accordance with an exemplary embodiment.
FIG. 7 is an example workload shown according to an exemplary embodiment.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The present invention proposes a new workload aware RDF graph partitioning scheme, called minimum pattern cut (MMC), aimed at maximizing the number of Independently Executable Queries (IEQs) in a given workload. We first define patterns in the workload and translate the problem of maximizing the number of independently executable queries in the workload to the problem of maximizing the number of internal patterns. Then, to find the optimal minimum mode cutting and dividing result, we propose two greedy algorithms based on coarsening, iteratively select some internal modes and coarsen the matched induced subgraphs, put more matching items of the same internal mode into the same partition for data localization. Finally, the input query is decomposed into three types of extended independent executable sub-queries based on the schema. The resolved sub-queries are localized to their relevant partitions and executed in parallel, and the matching items are concatenated together to form the final result.
The invention provides an RDF knowledge graph data partitioning and distributed query method, which comprises the following steps:
Step 1: for a given RDF knowledge graph data graph G, a workload W and a corresponding mode set M thereof, the graph G is divided into n sub-graph partitions P= { G 1,G2,…,Gi,…,Gn } based on a minimum mode division mode so as to meet preset conditions.
The workload W comprises a plurality of queries on the graph G, for any query Q, a pattern M corresponding to the Q is generated, and patterns corresponding to all the queries in the workload W form a pattern set M.
The preset conditions include: the number of modes in the internal mode set M in, |M in | is maximized and |V i |++ε) ×|V|/n is satisfied for each sub-graph partition G i, where |V i | is the number of vertices of each sub-graph partition G i, |V| is the number of vertices of the original graph G, ε is the maximum imbalance ratio of the user-defined partitions.
The minimum mode division mode in the step 1 comprises the steps 1.1 to 1.5.
Step 1.1: selecting the internal mode set M in based on the weakly connected components using a greedy algorithm; wherein, the internal mode set M in is initially an empty set, the candidate mode set M ' is initially a mode set M, the candidate mode set M ' is traversed, and step 1.1.1 is executed once for each reading of one candidate mode M op in M ';
Step 1.1.1: for m op of each read, calculate WCC (G [ Min U { mop }) where WCC (·) represents the weakly connected component set of a set of maps, G [ Min U { mop } ] is the induced sub-set of map G based on Min U { mop }, calculate candidate cost Wherein c is the vertex set of any weak connected component in WCC (G [ Min U { mop }), and |c| is the vertex number of vertex set c;
Step 1.2: extracting the minimum candidate Cost which satisfies Cost [ Min { mop } ] is less than or equal to (1+epsilon) ×|V|/n from a plurality of candidate costs obtained by traversing the candidate pattern set M 'each time, adding M op corresponding to the minimum candidate Cost into M in, and deleting M op from M'; repeatedly traversing the candidate mode set M 'until M' is null or no candidate mode capable of meeting the minimum candidate cost, thereby obtaining an optimal internal mode set M in;
Step 1.3: based on the internal mode set M in, generating an induction sub-graph set G [ M in ] of the graph G based on M in, roughening each induction sub-graph in G [ M in ] into super-vertices, and further roughening any two super-vertices s1 and s2 meeting the super-vertex roughening condition into one super-vertex until the super-vertex cannot be further roughened, so that the graph G is converted into a rough graph G c;
The super vertex coarsening condition is that two super vertices have the largest similarity, and the super vertex coarsening condition meets the requirements that |c (s 1) |+|c (s 2) |is less than or equal to (1+epsilon) ×|V|/n, wherein |c (s 1) | and |c (s 2) | are the vertex numbers of weak communication components corresponding to s1 and s2 in WCC (G [ M in ]); similarity of two super vertices IM (s 1) and IM (s 2) represent the corresponding internal pattern sets of s1 and s 2;
Step 1.4: dividing the rough graph G c into n rough graph sub-graph partitions by using a vertex disjoint dividing algorithm, and performing inverse coarsening on each rough graph sub-graph partition to obtain n sub-graph partitions P of the graph G; wherein, reverse roughening is the process of reducing the super vertex to an induced sub-graph of the pre-roughening state;
Each internal pattern M in in M in corresponds to one or more sub-picture partitions, M in is converted into sequence coding by DFS coding;
Step 1.5: the inquiring user constructs a hash table to store the sequence coding information of M in at the inquiring user terminal, the partition number n and the IP address of the node where the sub-graph partition is located; the key value of the hash table is the code of m in, and the value of the hash table is the IP address of the node where one or more sub-graph partitions corresponding to m in are located;
Step 2: for a query Q initiated at a user local side, performing distributed query processing to obtain a matching result set MS (Q, G) of the query Q on a graph G, wherein the performing distributed query processing includes: judging whether the query Q is an independently executable query, if the query Q is the independently executable query, executing the step 2.1, and if the query Q is a non-independently executable query, executing the step 2.2;
Wherein the independently executable queries comprise a first type of independently executable query, a second type of independently executable query, and a third type of independently executable query, and queries not belonging to the three types are non-independently executable queries;
Determining whether query Q is an independently executable query includes: for query Q, generating a corresponding pattern m; neglecting the vertex constants, if M is one M in or a combination of a plurality of M in in M in, Q is isomorphic to M, and Q is a first type of independently executable query; if for query Q there is one sub-graph Q ' is a first type of independently executable query and for all edges E, E that belong to E (Q) \E (Q '), both endpoints of E belong to V (Q '), then Q is a second type of independently executable query; if for query Q there is one sub-graph Q 'is a first class of independently executable query and for all edges E, E belonging to E (Q) \e (Q'), at least one endpoint of E belongs to V (Q '), and there is at least one edge E, belonging to E (Q) \e (Q'), one endpoint of E belongs to V (Q '), and the other endpoint does not belong to V (Q'), then Q is a third class of independently executable query; wherein E (Q) and E (Q ') are edge sets of Q and Q', V (Q ') is a vertex set of Q', E (Q) \E (Q ') represents an edge set in E (Q) which does not belong to E (Q');
Step 2.1: locating the query Q to the corresponding partition, searching a hash table stored locally by using DFS codes of a pattern m corresponding to the query Q, and taking the intersection of the partition corresponding to the pattern m as the partition corresponding to the query Q; executing a matching process on the partition corresponding to the query Q, and taking the union of the matching results on all the partitions as a matching result set MS (Q, G) of the query Q;
Step 2.2: decomposing the query Q into a plurality of sub-queries Q based on the internal pattern set M in, then searching a locally stored hash table according to the DFS code of the pattern mq corresponding to each sub-query Q, and taking the partition intersection corresponding to the pattern mq as the partition corresponding to the sub-query Q; executing the Q matching process on the partition corresponding to the sub-query Q, taking the union of the matching results of the sub-query on all the corresponding partitions as the matching result of the sub-query, and then executing the connection operation on the matching result of the sub-query to obtain a matching result set MS (Q, G) of the query Q;
Step 2.2 comprises: to decompose query Q into multiple sub-queries Q, first, removing intersecting edges not belonging to any internal pattern in query Q to obtain weakly connected components only containing internal pattern edges, and storing the removed intersecting edges in a set In removing query Q/>After all edges in the database, the obtained weak connected components only comprising the edges of the internal modes are used as sub-queries { q1, q2, …, qx }, and for each sub-query, the corresponding partition is determined according to the corresponding internal mode; for/>Vi-vj, where vi and vj are their vertices, q (vi) and q (vj) are sub-queries containing vi and vj, respectively, |q (vi) | and|q (vj) | are the number of vertices of q (vi) and q (vj), and if|q (vi) |= |q (vj) |, then adding the edge vi-vj to q (vi) to make the query q (vi) a second type of independently executable query, otherwise, adding the edge vi-vj to sub-queries with more vertices in q (vi) and q (vj); at the traversal/>In the sub-queries { Q1, Q2, …, qx } obtained after all edges in the list, the sub-query containing more than one vertex is used as a result of decomposing the query Q;
In step 2.1 and step 2.2, the matching process on the different partitions is performed in parallel.
For generating pattern m corresponding to Q, m is generated by deleting the single degree vertex of Q and replacing all constants on the subject and object in Q with variables, the constants including word denomination and URI. The defined mode generalizes a specific query in the workload, is an abstract query template, and the query is an instance of the mode, and the distributed query execution efficiency is improved by avoiding the mode from being divided into different partitions as much as possible, namely maximizing an internal mode set, and reducing inter-partition connection of the distributed query based on the workload.
For internal pattern M in in internal pattern set M in, all matches μ in in its corresponding match result set MS (M in, G) on graph G are completely internal matches, i.e., all edges in μ in are internal edges in a partition G i.
For one pattern M of M, the graph G is based on the induced subgraph of pattern M as G [ M ]: m corresponds to a subgraph composed of all edges in U.S. MS (m, G) E (mu), wherein E (mu) is the edge matched with all edges in mu; graph G M-based inducer subgraph set G [ M ] is a set of G [ M ].
The weakly connected component WCC is a largest sub-graph of the directed graph, where any two vertices in the sub-graph can reach each other by a directed edge (regardless of direction). The use of weakly connected components simplifies the representation of the graph, reducing computational complexity.
In one embodiment, as shown in FIGS. 1-7, for RDF graphs such as FIG. 3, the pattern obtained by the workload such as FIG. 7 such as FIG. 1, the matching of the induced sub-graph of the pattern of FIG. 1 such as FIG. 2 is coarsened, the coarsened graph partitions such as FIG. 4, the super-vertices Supervertex and the super-vertices Supervertex are both coarsened from certain matches that contain pattern m1 of FIG. 1, so they may be further coarsened into one super-vertex Supervertex4, and this further coarsened graph may be partitioned as shown in FIG. 5. The reverse roughening is performed on the partitions of the advanced roughened graph to obtain partitions of the original graph, as shown in fig. 6. Here, since all matches for m1 have been coarsened to Supervertex4, and Supervertex belongs to G1, queries related to m1 in the workload can be independently executed in G1.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (6)

1. The RDF knowledge graph data partitioning and distributed query method is characterized by comprising the following steps of:
step 1: for a given RDF knowledge graph data graph G, a workload W and a corresponding mode set M thereof, dividing the graph G into n sub-graph partitions P= { G 1,G2,…,Gi,…,Gn } based on a minimum mode division mode so as to meet preset conditions;
The workload W comprises a plurality of queries on the graph G, for any query Q, a mode M corresponding to the Q is generated, and a mode set M corresponding to all the queries in the workload W is formed;
The preset conditions include: the number of modes of the internal mode set M in, i.e., M in, is maximized and |V i |++ε) ×|V|/n is satisfied for each sub-graph partition G i, where |V i | is the number of vertices of each sub-graph partition G i, |V| is the number of vertices of the original graph G, ε is the maximum imbalance ratio of the user-defined partitions;
the minimum mode division mode in the step 1 comprises the steps 1.1 to 1.5;
step 1.1: selecting the internal mode set M in based on the weakly connected components using a greedy algorithm; wherein, the internal mode set M in is initially an empty set, the candidate mode set M ' is initially a mode set M, the candidate mode set M ' is traversed, and step 1.1.1 is executed once for each reading of one candidate mode M op in M ';
Step 1.1.1: for m op of each read, calculate WCC (G [ Min U { mop }) where WCC (·) represents the weakly connected component set of a set of maps, G [ Min U { mop } ] is the induced sub-set of map G based on Min U { mop }, calculate candidate cost Wherein c is the vertex set of any weak connected component in WCC (G [ Min U { mop }), and |c| is the vertex number of vertex set c;
Step 1.2: extracting the minimum candidate Cost which satisfies Cost [ Min { mop } ] is less than or equal to (1+epsilon) ×|V|/n from a plurality of candidate costs obtained by traversing the candidate pattern set M 'each time, adding M op corresponding to the minimum candidate Cost into M in, and deleting M op from M'; repeatedly traversing the candidate mode set M 'until M' is null or no candidate mode capable of meeting the minimum candidate cost, thereby obtaining an optimal internal mode set M in;
Step 1.3: based on the internal mode set M in, generating an induction sub-graph set G [ M in ] of the graph G based on M in, roughening each induction sub-graph in G [ M in ] into super-vertices, and further roughening any two super-vertices s1 and s2 meeting the super-vertex roughening condition into one super-vertex until the super-vertex cannot be further roughened, so that the graph G is converted into a rough graph G c;
The super vertex coarsening condition is that two super vertices have the largest similarity, and the super vertex coarsening condition meets the requirements that |c (s 1) |+|c (s 2) |is less than or equal to (1+epsilon) ×|V|/n, wherein |c (s 1) | and |c (s 2) | are the vertex numbers of weak communication components corresponding to s1 and s2 in WCC (G [ M in ]); similarity of two super vertices IM (s 1) and IM (s 2) represent the corresponding internal pattern sets of s1 and s 2;
Step 1.4: dividing the rough graph G c into n rough graph sub-graph partitions by using a vertex disjoint dividing algorithm, and performing inverse coarsening on each rough graph sub-graph partition to obtain n sub-graph partitions P of the graph G; wherein, reverse roughening is the process of reducing the super vertex to an induced sub-graph of the pre-roughening state;
Each internal pattern M in in M in corresponds to one or more sub-picture partitions, M in is converted into sequence coding by DFS coding;
Step 1.5: the inquiring user constructs a hash table to store the sequence coding information of M in at the inquiring user terminal, the partition number n and the IP address of the node where the sub-graph partition is located; the key value of the hash table is the code of m in, and the value of the hash table is the IP address of the node where one or more sub-graph partitions corresponding to m in are located;
Step 2: for a query Q initiated at a user local side, performing distributed query processing to obtain a matching result set MS (Q, G) of the query Q on a graph G, wherein the performing distributed query processing includes: judging whether the query Q is an independently executable query, if the query Q is the independently executable query, executing the step 2.1, and if the query Q is a non-independently executable query, executing the step 2.2;
Wherein the independently executable queries comprise a first type of independently executable query, a second type of independently executable query, and a third type of independently executable query, and queries not belonging to the three types are non-independently executable queries;
Determining whether query Q is an independently executable query includes: for query Q, generating a corresponding pattern m; neglecting the vertex constants, if M is one M in or a combination of a plurality of M in in M in, Q is isomorphic to M, and Q is a first type of independently executable query; if for query Q there is one sub-graph Q ' is a first type of independently executable query and for all edges E, E that belong to E (Q) \E (Q '), both endpoints of E belong to V (Q '), then Q is a second type of independently executable query; if for query Q there is one sub-graph Q 'is a first class of independently executable query and for all edges E, E belonging to E (Q) \e (Q'), at least one endpoint of E belongs to V (Q '), and there is at least one edge E, belonging to E (Q) \e (Q'), one endpoint of E belongs to V (Q '), and the other endpoint does not belong to V (Q'), then Q is a third class of independently executable query; wherein E (Q) and E (Q ') are edge sets of Q and Q', V (Q ') is a vertex set of Q', E (Q) \E (Q ') represents an edge set in E (Q) which does not belong to E (Q');
Step 2.1: locating the query Q to the corresponding partition, searching a hash table stored locally by using DFS codes of a pattern m corresponding to the query Q, and taking the intersection of the partition corresponding to the pattern m as the partition corresponding to the query Q; executing a matching process on the partition corresponding to the query Q, and taking the union of the matching results on all the partitions as a matching result set MS (Q, G) of the query Q;
Step 2.2: decomposing the query Q into a plurality of sub-queries Q based on the internal pattern set M in, then searching a locally stored hash table according to the DFS code of the pattern mq corresponding to each sub-query Q, and taking the partition intersection corresponding to the pattern mq as the partition corresponding to the sub-query Q; executing the Q matching process on the partition corresponding to the sub-query Q, taking the union of the matching results of the sub-query on all the corresponding partitions as the matching result of the sub-query, and then executing the connection operation on the matching result of the sub-query to obtain a matching result set MS (Q, G) of the query Q;
Step 2.2 comprises: to decompose query Q into multiple sub-queries Q, first, removing intersecting edges not belonging to any internal pattern in query Q to obtain weakly connected components only containing internal pattern edges, and storing the removed intersecting edges in a set In removing query Q/>After all edges in the database, the obtained weak connected components only comprising the edges of the internal modes are used as sub-queries { q1, q2, …, qx }, and for each sub-query, the corresponding partition is determined according to the corresponding internal mode; for/>Vi-vj, where vi and vj are their vertices, q (vi) and q (vj) are sub-queries containing vi and vj, respectively, |q (vi) | and|q (vj) | are the number of vertices of q (vi) and q (vj), and if|q (vi) |= |q (vj) |, then adding the edge vi-vj to q (vi) to make the query q (vi) a second type of independently executable query, otherwise, adding the edge vi-vj to sub-queries with more vertices in q (vi) and q (vj); at the traversal/>In the sub-queries { Q1, Q2, …, qx } obtained after all edges in the list, the sub-query containing more than one vertex is used as a result of decomposing the query Q;
In step 2.1 and step 2.2, the matching process on the different partitions is performed in parallel.
2. The method according to claim 1, characterized in that: for generating pattern m corresponding to Q, m is generated by deleting the single degree vertex of Q and replacing all constants on the subject and object in Q with variables, the constants including word denomination and URI.
3. The method according to claim 1, wherein for an internal pattern M in in the internal pattern set M in, all matches μ in in its corresponding match result set MS (M in, G) on the graph G are completely internal matches, i.e. all edges in μ in are internal edges in a certain partition G i.
4. The method according to claim 1, characterized in that for one pattern M of M, the graph G is based on an induced subgraph of pattern M as G [ M ]: m corresponds to a subgraph composed of all edges in U.S. MS (m, G) E (mu), wherein E (mu) is the edge matched with all edges in mu; graph G M-based inducer subgraph set G [ M ] is a set of G [ M ].
5. A computer device, comprising:
A memory for storing instructions;
a processor for invoking instructions stored in the memory to perform the method of any of claims 1-4.
6. A computer readable storage medium storing instructions which, when executed by a processor, perform the method of any of claims 1-4.
CN202410278786.7A 2024-03-12 2024-03-12 RDF knowledge graph data partitioning and distributed query method Pending CN118113883A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410278786.7A CN118113883A (en) 2024-03-12 2024-03-12 RDF knowledge graph data partitioning and distributed query method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410278786.7A CN118113883A (en) 2024-03-12 2024-03-12 RDF knowledge graph data partitioning and distributed query method

Publications (1)

Publication Number Publication Date
CN118113883A true CN118113883A (en) 2024-05-31

Family

ID=91211871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410278786.7A Pending CN118113883A (en) 2024-03-12 2024-03-12 RDF knowledge graph data partitioning and distributed query method

Country Status (1)

Country Link
CN (1) CN118113883A (en)

Similar Documents

Publication Publication Date Title
Stuckenschmidt et al. Index structures and algorithms for querying distributed RDF repositories
Yang et al. Towards effective partition management for large graphs
Zhang et al. EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud
CN103488673B (en) For performing the method for reconciliation process, controller and data-storage system
Kim et al. Taming subgraph isomorphism for RDF query processing
Zeng et al. A distributed graph engine for web scale RDF data
CN100541486C (en) The result that the function of data is used carries out structured index
Tatarowicz et al. Lookup tables: Fine-grained partitioning for distributed databases
CN105630881A (en) Data storage method and query method for RDF (Resource Description Framework)
Stuckenschmidt et al. Towards distributed processing of RDF path queries
CN108763536B (en) Database access method and device
Peng et al. Query workload-based RDF graph fragmentation and allocation
CN106484815B (en) A kind of automatic identification optimization method based on mass data class SQL retrieval scene
CN113946600A (en) Data query method, data query device, computer equipment and medium
Pei et al. An efficient query scheme for hybrid storage blockchains based on merkle semantic trie
Peng et al. Accelerating partial evaluation in distributed SPARQL query evaluation
CN115328883A (en) Data warehouse modeling method and system
Kim et al. Type-based semantic optimization for scalable RDF graph pattern matching
JPWO2013111287A1 (en) SPARQL query optimization method
CN116383247A (en) Large-scale graph data efficient query method
CN111522918A (en) Data aggregation method and device, electronic equipment and computer readable storage medium
CN118113883A (en) RDF knowledge graph data partitioning and distributed query method
CN114116785A (en) Distributed SPARQL query optimization method based on minimum attribute cut
Muhammad et al. Multi query optimization algorithm using semantic and heuristic approaches
Das et al. Query processing on large graphs: Approaches to scalability and response time trade offs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination