CN114116785A

CN114116785A - Distributed SPARQL query optimization method based on minimum attribute cut

Info

Publication number: CN114116785A
Application number: CN202111451035.3A
Authority: CN
Inventors: 彭鹏; 田桢; 秦拯
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-03-01
Anticipated expiration: 2041-12-01
Also published as: CN114116785B

Abstract

The invention discloses a distributed SPARQL query optimization method based on minimum attribute cut, which belongs to the field of distributed systems and comprises the following steps: (1) reading an original RDF data graph, and storing an edge attribute set L; (2) calculating the weakly connected component and the corresponding cost of each edge attribute; (3) selecting internal attributes as much as possible to obtain a coarsening graph of the data graph; (4) carrying out vertex division on the coarsening graph, and carrying out anti-coarsening treatment to obtain a final partition; (5) decomposing the SPARQL query into a set of independently executable subqueries; (6) and executing the decomposed sub-queries in parallel in each partition to obtain a matching result. The invention expands the query types which can be independently executed in the distributed RDF system, reduces the connection between the partitions, reduces the data communication time and improves the query efficiency.

Description

Distributed SPARQL query optimization method based on minimum attribute cut

Technical Field

The present invention relates to the field of distributed systems, and more particularly to data partitioning and query processing for distributed RDF systems.

Background

Rdf (resource Description framework) is a data model organized by W3C, and represents attributes and relationships of web resources in the basic form of triples < subject, predicate, object >, and is currently applied in the fields of knowledge graphs, social network analysis, and the like. The RDF data model has flexible representation form, and can be represented not only as a table in a relational database, but also as a graph model. When RDF is represented as a graph, a triple represents a directed edge pointing from the subject to the object and two vertices connecting the directed edge, the subject and the object are two vertices of the edge, and the predicate is a label on the directed edge. W3C proposes a standard query language SPARQL (simple protocol and RDFquery language) at the same time of proposing RDF. SPARQL, like RDF, can also be represented as a graphical model. Edges in the query graph are called a triplet mode, and the subject, predicate and object in the triplet mode can be variables or constants. Because both SPARQL and RDF can be represented as graph models, SPARQL queries can be transformed into subgraph matching problems.

With the rapid development of the internet, the scale of the RDF data set is continuously increased, and the traditional single machine system cannot effectively process massive RDF data, so that a distributed RDF system appears. In a distributed system, data partitioning is one of the most basic processes. Specifically, the RDF data graph G is divided into a group of subgraphs { F }₁，F₂，…，F_kEach subgraph, called a partition, is distributed among different machines. Currently, a data partitioning method used in a distributed RDF system is to partition data by vertex, that is, to partition each vertex into different partitions, for example, a common hash partition. In this type of approach, some edges may be "split" between partitions, i.e., the two vertices of an edge are divided into different partitions. To ensure graph integrity, these segmented edges are repeatedly saved in two partitions, called one-hop replication. An edge is called an inner edge if two vertices of the edge are in the same partition; otherwise called crossing edges.

The matching type of the query is the same as the type of the edge, and can be divided into two types: internal matching, wherein the matching result is only contained in one partition; across matches, the match results are contained within multiple partitions. When the query to be executed has only an internal match, it only needs to be executed independently in each partition. For a query with cross matching, most of the existing methods decompose the query into a set of star queries, then independently execute the star queries in each partition, and finally execute inter-partition connection to obtain a final result. However, the inter-partition connection involves data communication and extra computational overhead, and has a large impact on query performance. Moreover, in the conventional method of partitioning by vertex, the query that can be executed independently can only be a star, which is greatly limited, and when processing a general query, distributed connection is usually performed, so the query efficiency is not high.

Disclosure of Invention

The existing distributed RDF system only judges whether the query can be executed independently according to the structure of the query graph, and the query graph is considered to be executed independently only when the query graph is a star. The present invention extends the types of queries that can be executed independently, and not just star queries, after considering the attributes of edges in graph data. One of the objectives of the present invention is to provide a graph data partitioning method based on minimum attribute segmentation, which can reduce the number of spanning attributes, thereby avoiding connection operations between partitions and reducing data communication time. The second purpose of the present invention is to provide a query decomposition method, which can decompose an original query that cannot be executed independently into a set of sub-queries that can be executed independently, thereby making full use of the advantage of minimum attribute segmentation data partitioning and improving query efficiency.

The invention provides a distributed SPARQL query optimization method based on minimum attribute segmentation, which comprises the following steps:

step S1: reading an original RDF data graph G, and storing edge attributes into a set L;

step S2: calculating the weakly connected component and the corresponding cost of each edge attribute;

step S3: selecting internal attributes as much as possible to obtain a coarsening graph of the data graph;

when static graph data is processed, the number of edge attributes is fixed and unchanged, and the types only include an internal attribute and a spanning attribute. Therefore, more internal attributes are selected as much as possible by using a heuristic greedy algorithm, so that the minimum cross-attribute is realized, namely the minimum attribute cut is achieved. And after the internal attribute is selected, each weakly connected component in the internal attribute is used as a super point to obtain a coarsened graph of the data graph.

Step S4: carrying out vertex division on the coarsening graph, and carrying out anti-coarsening treatment to obtain a final partition;

when the coarsened graph is subjected to vertex division, any one of the vertex-division algorithms such as hash and METIS may be used. But ensures that the number of vertices in each partition does not exceed (1+ epsilon) × V |/k at the time of partitioning to achieve inter-partition load balancing. Wherein epsilon is the user-defined, maximum imbalance ratio, and k is the number of partitions.

Step S5: decomposing the SPARQL query into a set of independently executable subqueries;

the original SPARQL query is decomposed according to the cross attribute obtained in step S3, the sub-queries obtained by decomposition can be executed independently within the partition, and the shape of the sub-query is not limited to the star query.

Step S6: and executing the decomposed sub-queries in parallel in each partition to obtain a matching result.

By adopting the invention, the following technical effects can be achieved:

the invention provides a distributed SPARQL query optimization method based on minimum attribute segmentation. The present invention then decomposes queries that cannot be executed independently into a set of sub-queries that can be executed independently. Different from the traditional method, the sub-queries which can be independently executed are not limited to star queries, so that the number of invalid intermediate matching results can be further reduced, and the filtering effect is improved.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram illustrating a process of coarsening a data graph according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a query decomposition process according to an embodiment of the present invention.

Detailed Description

The following further description of embodiments of the present invention is provided in conjunction with the accompanying drawings so that those skilled in the art can more easily understand the present invention. It should be noted that the embodiment described below is only one embodiment of the present invention, and not all embodiments. Other embodiments, which can be derived by those skilled in the art from the embodiments of the present invention without making any creative effort, are within the protection scope of the present invention.

For convenience of description and understanding, the symbols and concepts related to the embodiments of the invention are explained as follows:

g: RDF data graphs.

L: the set of attributes for an edge in the RDF data graph.

q (v): and querying the query graph to which the point v belongs.

G [ L' ]: the induced subgraph of the L 'attribute set is a subgraph formed by attribute edges in L'.

DS (L'): and the L' attribute set corresponds to the union check set.

Inside limit: an edge is said to be an internal edge if both vertices of the edge are within the same partition.

Crossing edges: an edge is said to be a spanning edge if two vertices of the edge are within two different partitions, respectively.

Internal attributes: an attribute is said to be an internal attribute if it does not have a crossing edge.

Span attribute: if at least one crossing edge exists in one attribute, the attribute is called as a crossing attribute, namely, at least one attribute crossing the edge is the attribute.

Queries can be performed independently: if a SPARQL query Q is in the RDF graph G partition F ═ { F ═ F₁,F₂,…,F_kAre independently executable, then the matching of query Q does not require inter-partition connections.

Partitioning by minimum attribute: given an RDF data graph G and a positive integer k, the smallest attribute partition F of G ═ F₁,F₂,…,F_kAnd F satisfies: (1) number of spanning attributes | L_crossL is minimum; (2) the number of vertices in each partition does not exceed (1+ ε) x V/k, where ε is the user-defined, maximum imbalance ratio and k is the number of partitions.

The invention provides a distributed SPARQL query optimization method based on minimum attribute segmentation, the flow of which is shown in figure 1 and comprises the following steps:

s1: reading an original RDF data graph G, and storing edge attributes into a set L;

s2: traversing the set L, and calculating a weakly connected component WCC (G { p }) corresponding to each edge attribute p and a corresponding Cost (G { p });

in calculating the weakly connected component, different calculation methods may be used. In an embodiment of the invention, the optimization calculations are performed using a parallel-lookup data structure. Step S2 specifically includes:

s2.1: traversing each attribute p in the set L, and respectively executing the steps S2.2-S2.4 to the attribute p;

s2.2: a union set DS ({ p } is initialized for attribute p). In the parallel lookup set, each node u corresponds to a tree and contains three attribute values u ({ p }) parent, u ({ p }) rank, and u ({ p }) size. Wherein u ({ p }) parent is a root node of u in DS ({ p }), and the initial value is u itself; u ({ p }) rank is the height value from the u node to the root node, and the initial value is 0; u ({ p }) size is the number of root vertices in the tree, with an initial value of 1;

s2.3: for edges in RDF graphs

If its attribute is p, the trees corresponding to u and u 'in the union set DS ({ p }) can be merged, i.e., weakly connected components containing u and u' can be merged. During the merging process, the root vertex of the tree with smaller rank points to the root vertex of the tree with larger rank. After all the edges with the attribute p are processed, if the induced subgraph G of the attribute p [ { p }]Two vertices are in the same connected component, and then the two vertices are also in the same tree in the union set DS ({ p });

s2.4: calculating an attribute p as the cost of the internal attribute;

because the method of the present invention requires that the number of vertices in each partition does not exceed (1+ epsilon) × V |/k in order to ensure load balancing between partitions, the cost is defined based on the size of the weakly connected component in this embodiment. In particular, for a set of attributes

The cost of L' as an internal attribute is defined as follows:

where c is a weakly connected component in WCC (G [ L ], | c | represents the number of vertices in c. Based on the cost function, the cost of the weakly connected component corresponding to each attribute can be calculated.

S3: selecting as many internal attributes L as possible from the attribute set L_inSo as to minimize the spanning attributes, each internal attribute L_inThe corresponding weakly connected component in the data graph is used as a super point to obtain a coarsened graph of the data graph. In the coarsened graph, the super points may be connected by a spanning edge;

giving a minimum attribute partition of the data graph G, assigning a unique attribute to each edge in the G, and obtaining a data graph marked as

At this time, at

The minimum attribute cut is calculated in G. Also, because the minimum edge-cut problem is an NP-complete problem, the minimum attribute-cut problem is also an NP-complete problem. Just because the minimum attribute cut problem has this characteristic, in this embodiment, a heuristic greedy algorithm is used to select the internal attribute, which specifically includes the following steps:

s3.1: set the internal attributes L_inInitialization is null;

s3.2: judging whether the attribute set L is empty, and if the attribute set L is empty, ending the iteration; otherwise, respectively executing the steps S3.3-S.3.8, and continuing the next iteration;

s3.3: minimum cost mincost set to infinity, optimal attribute p_optSet to null;

s3.4: traversing the attribute set L, and respectively executing the steps S3.5 and S3.6 on the attribute p;

s3.5: calculating WCC (G [ L ]_in∪{p}])；

In this embodiment, in order to improve the computational efficiency of the weakly connected components, the co-query data structure is used for optimization. Initially, the set DS (L) will be looked up_inU { p }) is set to DS (L)_in). For vertex u in DS ({ p }), root vertex uRoot of the tree corresponding to DS ({ p }) can be obtained in a recursive manner. Then, at DS (L)_inU and uRoot vertexes are respectively obtained in U { p }), and if the u and uRoot vertexes are different, corresponding trees are merged.

S3.6: if Cost (L)_in∪_p) Less than (1+ ε) × V |/k, and at the same time less than mincost, will Cost (L)_in∪_p) Assign mincost to p and assign p to p_optThen, the step S3.4 is carried out; otherwise, mincost and p_optKeeping unchanged, and directly switching to the step S3.4;

s3.7: if after steps S3.4-S3.6, the optimal property p_optIf the state is still empty, the process ends in step S3, and proceeds to step S4; otherwise, go to step S3.8;

s3.8: deleting an attribute p from an attribute set L_optThen p is added_optAdding to an internal Property set L_inThen, step S3.2 is carried out to continue to select the internal attribute;

taking fig. 2 as an example, the original data map has 12 vertices and 6 edge attributes, and after the processing of step S3, the internal attribute L is selected_in{ starring, residual, producer, spout, found date }. The edges of the internal property are the thickened edges in fig. 2, which form two weakly connected components. In the coarsened graph, the two weakly connected components each form a super point, and the super points are connected by an edge spanning the property birthPlace.

S4: and (3) carrying out division on the super points in the coarsening graph by using a vertex partition algorithm, and ensuring that the number of the vertexes in each partition does not exceed (1+ epsilon) × V/k during the division. Wherein epsilon is the user-defined maximum imbalance proportion, and k is the partition number;

because the number of the vertexes in the coarsened graph is far smaller than that of the original data graph, the vertexes in the coarsened graph can be partitioned by using any partitioning algorithm divided by the vertexes at the moment without worrying about long time consumption. For example, hash partitioning, METIS partitioning, etc. are used. In this embodiment, S4 specifically includes:

s4.1: taking the number of vertexes inside the overtop in the coarsening graph as the weight of the overtop, thereby using weighted Hash division on the coarsening graph and ensuring that the number of vertexes of the final data partition does not exceed (1+ epsilon) x V/k;

s4.2: the super point set divided into the same partition in the step S4.1 is inversely coarsened into a final partition, namely, an original data point contained in the super point set is divided into a partition in an original data graph;

taking fig. 2 as an example, if the number of partitions is 2, the two super points in fig. 2 are each a partition, that is, the original data map is divided into two partitions by the dashed line in fig. 2, so as to obtain the final minimum attribute divided partition.

S5: decomposing the SPARQL query to be processed into a group of sub-queries which can be executed independently;

in the real SPARQL query task, the query is likely not executable independently. In order to fully utilize the advantages of the minimum attribute segmentation data partitioning and reduce the connection between partitions, the original query needs to be decomposed into a group of sub-queries which can be executed independently. In this embodiment, step S5 specifically includes:

s5.1: initializing an empty set

The set is used for storing the decomposed sub-queries;

s5.2: deleting the edges with the edge attribute as variable or spanning attribute in the SPARQL query to obtain a group of weakly connected components WCCs (q)'₁,q′₂,...,q′_x}；

S5.3: traversing the edge with the edge attribute as variable or crossing attribute in SPARQL query

Executing steps S5.4-S5.5 to the edge;

s5.4: if v is₁And v₂If they belong to the same sub-query, add edges to the sub-query in which they are located

Then, the step S5.3 is carried out to continue a new iteration; otherwise, go to step S5.5;

s5.5: if | q (v)₁) | is less than or equal to | q (v)₂) If you want to be able to put the edge on

Addition to q (v)₂) Otherwise, add to q (v)₁) In, i.e. to be edged

Addition to v₁And v₂The sub-query with more vertexes belongs to. Then, step S5.3 is carried out to continue a new iteration;

s5.6: traversing sub-queries q 'in WCCs'_iIf q'_iThe number of vertexes in is more than 1, then q'_iJoin to a collection

In (1). Here, the query with the number of vertices 1 is not considered because: such queries contain only one query point, the number of matching results is large and meaningless, and other queries contain the query point;

taking FIG. 3 as an example, after step S5.2, three sub-queries q 'are obtained'₁、q′₂、q′₃. Because of query q'₁Is greater than q'₂So as to cross attribute edges

Add to query q'₁In (1). Because of q'₂And q'₃The number of vertices is the same, so the edges

May be added to either one of the two. Hypothetical edge

Is added to q'₂In (3), the final decomposed sub-query is q in FIG. 3₁、q₂。

S6: and executing the decomposed sub-queries in parallel in each partition to obtain a matching result. In this embodiment, step S6 specifically includes:

s6.1: the main node of the distributed RDF system broadcasts the decomposed sub-queries to all the slave nodes, and after the slave nodes receive the sub-queries, sub-graph matching is executed in parallel inside the partitions to obtain an intermediate matching result;

s6.2: and carrying out inter-partition connection on the intermediate matching results in each node to obtain a final matching result, and collecting the result into the main node.

In summary, the invention provides a distributed SPARQL query optimization method based on minimum attribute segmentation on the basis of considering the edge attribute in the RDF data graph, so that query types capable of being independently executed are expanded, connection between partitions is reduced, data communication time is reduced, and query efficiency is improved.

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalents, improvements, etc. made thereto should be included within the scope of the present invention.

Claims

1. A distributed SPARQL query optimization method based on minimum attribute cut is characterized by comprising the following steps:

(1) reading an original RDF data graph, and storing an edge attribute set L;

(2) calculating the weakly connected component and the corresponding cost of each edge attribute;

(3) selecting internal attributes as much as possible to obtain a coarsening graph of the data graph;

(4) carrying out vertex division on the coarsening graph, and carrying out anti-coarsening treatment to obtain a final partition;

(5) decomposing the SPARQL query into a set of independently executable subqueries;

(6) and executing the decomposed sub-queries in parallel in each partition to obtain a matching result.

2. The distributed SPARQL query optimization method based on minimum attribute segmentation as claimed in claim 1, wherein step 2 is to use the size of the weakly connected component as the cost of the attribute in order to measure the attribute when selecting the internal attribute when calculating the weakly connected component.

3. The distributed SPARQL query optimization method based on minimum attribute segmentation as claimed in claim 1, wherein in step 3, when processing static graph data, the number of edge attributes is fixed and unchanged, and the types are only internal attributes and two types of cross attributes; by using a heuristic greedy algorithm to select more internal attributes as much as possible, the minimum spanning attributes are realized, namely the minimum attribute cutting purpose is achieved; and after the internal attribute is selected, each weakly connected component in the internal attribute is used as a super point to obtain a coarsened graph of the data graph.

4. The distributed SPARQL query optimization method based on minimum attribute segmentation as claimed in claim 1, wherein in step 4, when the vertex partition is performed on the coarsened graph, any one of partition algorithms divided by the vertex, such as hash and METIS, can be used, but when the partition is performed, the number of vertices in each partition is ensured not to exceed (1+ epsilon) x V/k, so as to achieve load balance between the partitions; wherein epsilon is the user-defined, maximum imbalance ratio, and k is the number of partitions.

5. The method of claim 1, wherein step 5 decomposes the original SPARQL query according to the cross-attribute obtained in step 3, the decomposed subqueries can be executed independently in partitions, and the shape of the subqueries is not limited to star queries.