CN107291807B

CN107291807B - SPARQL query optimization method based on graph traversal

Info

Publication number: CN107291807B
Application number: CN201710343003.9A
Authority: CN
Inventors: 李亮; 沈志宏; 周园春; 黎建辉; 朱小杰; 刘东江; 李跃鹏
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2017-05-16
Filing date: 2017-05-16
Publication date: 2020-10-16
Anticipated expiration: 2037-05-16
Also published as: CN107291807A

Abstract

The invention discloses a SPARQL query optimization method based on graph traversal. The method comprises the following steps: 1) representing the triples in the RDF data by using the attribute map, and then storing the RDF data by using a Bigtable model to obtain Bigtable data corresponding to the RDF data; 2) converting the SPARQL query into traversal of the RDF attribute graph; 3) traversing all nodes meeting the conditions in the Bigtable data according to the traversal sequence obtained in the step 2), and completing SPARQL query. On one hand, the method eliminates the dependence of the traditional SPARQL query on data structures such as Hash and the like, reduces the generation of intermediate data, and avoids the connection calculation of large-scale RDF data; on the other hand, the large data processing technology based on Bigtable can be effectively utilized to store and manage RDF mass associated knowledge network data, and query and analysis of the RDF associated data are accelerated.

Description

SPARQL query optimization method based on graph traversal

Technical Field

The invention relates to a graph traversal-based SPARQL query execution method, in particular to a big data association-oriented storage and query method and system.

Background

The graph data mining and analysis is a new field of big data, and supports information mining and scientific discovery based on data association by establishing association relations of associated world wide web resources, microbial strain resources, scientific research resources and the like. A Resource Description Framework (RDF) is a language for expressing information about World Wide Web resources, and can express information about anything that can be identified on the internet, such as page titles, authors, and modification times, and associations between different data. The RDF specification provides a basic vocabulary for describing resources, and defines rules that must be followed when a resource vocabulary is described by various field applications, such as WDCM (Mercen World Data Centre for microorganisms). Sparql (sparql Protocol and RDF Query language) is a Query language and data acquisition Protocol developed for RDF, defined by the RDF data model recommended by the international standards organization of W3C, for querying any information resource that can be represented by RDF. The SPARQL protocol and RDF query language (SPARQL) formally became a recommendation for W3C on month 1 and 15 of 2008.

Because RDF uses structured XML data, retrieval and query can understand the precise meaning of metadata, the search becomes more intelligent and accurate, and the condition that irrelevant data is often returned in retrieval is effectively avoided. The RDF file comprises a plurality of resource descriptions, each resource description is composed of a plurality of statements, each statement is composed of a resource, an attribute type and an attribute value to form a triple, and the triple indicates that the resource has one attribute. The resource corresponds to a subject in a natural language, the attribute type corresponds to a predicate, the attribute value corresponds to an object, and the plurality of RDF resource files form a complete resource description and association diagram. With the data size of the associated network becoming larger and larger, the data types expressed and processed by the associated network become more and more, and the real-time property of RDF data storage and SPARQL query is challenged. Therefore, the storage and management efficiency of RDF data is improved by adopting an expandable novel big data architecture, and the SPARQL query speed and the analysis capability are improved very importantly. Graph data storage and management frameworks based on large graph data processing technologies such as Bigtable are a new direction for knowledge-graph networks due to their excellent large-scale data processing capabilities.

At present, RDF data mainly adopts a relational database table or a KV data warehouse to store and manage RDF triples, implement subgraph matching and SPARQL query of subjects, predicates and objects of RDF triples in a self-connection manner, and support quick query and retrieval of local data through Hash or Index, which is typically implemented as Virtuoso RDF graph database. The distributed version mainly adopts a federal mode, and the RDF data query and distributed computation framework are fused into a unified framework, specifically comprising the following steps: and analyzing and distributing the SPARQL query to each node, and operating subgraph matching calculation by each node and then summarizing the matching result of each node. The framework is simple and easy to implement, distributed query and quick return of RDF data are supported, and design and development facing to a large-scale knowledge correlation network are simplified.

However, in a distributed query mode based on federation and subgraph matching, each query needs to be decomposed into subgraph matching, the subgraph matching is distributed to a plurality of nodes, and the subgraph matching is executed and results are returned, so that a large amount of node communication and intermediate data are easily caused. When the data size is very large, the system faces the following problems:

1) high overhead self-join operations. For distributed systems, the join operation of the data tables results in a large amount of data communication between the nodes of the system. When the data volume is large and the number of machine nodes is large, the self-connection cost is large, the query delay is obviously increased, and the system is not beneficial to transverse expansion.

2) A large amount of intermediate data. After the SPARQL query is decomposed, the SPARQL query is distributed to a plurality of nodes to be operated respectively, each node is equivalent to a query engine, and the operation of the node generates a large amount of intermediate data, so that the memory consumption of the system is increased, and the number of SPARQLs processed by the system in parallel is reduced.

3) The requirements for data fragmentation are high. The federated query distributes the SPARQL query to each node to be completed respectively, and the requirement on the quality of the data fragments is high. If a large number of incidence relations exist among different partitions, SPARQL subgraph matching cannot be executed in parallel on a plurality of data nodes, and the operating efficiency of the system is reduced.

Due to the problems, the SPARQL query based on the federal mode is difficult to effectively deal with the large-scale growth of large-scale RDF associated data and meet the real-time query requirement of knowledge network associated application, and the query time is increased along with the growth of the data size. However, the Bigtable-based data processing technology is difficult to be applied to processing of massive RDF knowledge network associated data due to the lack of large-scale table connection operation (Join).

Disclosure of Invention

Aiming at the problems of RDF big data SPARQL query, the invention aims to provide a SPARQL query optimization method based on graph traversal.

The technical scheme of the invention is as follows:

a SPARQL query optimization method based on graph traversal comprises the following steps:

1) representing the triples in the RDF data by using the attribute map, and then storing the RDF data by using a Bigtable model to obtain Bigtable data corresponding to the RDF data;

2) converting the SPARQL query into traversal of the RDF attribute graph;

3) traversing all nodes meeting the conditions in the Bigtable data according to the traversal sequence obtained in the step 2), and completing SPARQL query.

The method for storing RDF data by using the Bigtable model comprises the following steps:

21) for each RDF triple (sub, pre, obj) in the RDF data, storing a subject sub as a node v as one line in a Bigtable model;

22) judging the type of the object obj: a) if the object obj is rdf, i.e. the predicate pre is the attribute of the subject sub, taking the object obj as the attribute value of the node v, and then taking the predicate pre as the attribute name of the node v to be stored as a storage unit Cell in the row of the node v; b) if the object obj is rdf, resource, namely the predicate pre is the edge of the subject sub associated to other nodes, the object obj is taken as an independent node w and is stored as one line in the Bigtable model; then, the predicate pre is taken as the outgoing edge of the node v, points to the node w, and is stored as a Cell of the row where the node v is located, and the predicate pre is taken as the incoming edge of the node w, comes from the node v, and is stored as a Cell of the row where the node w is located.

Further, in the step 2), the SPARQL query is converted into a Gremlin graph traversal, so that the SPARQL query is converted into a traversal of the RDF attribute graph.

Further, the method for converting the SPARQL query into the Gremlin graph traversal comprises the following steps: for each triple (sub, pre, obj) in the where clause of the SPARQL query, if the triple is a triple in which a predicate pre represents an attribute in an attribute graph, converting the triple into a filter has (pre, obj) of a node in the attribute graph represented by a subject sub, where obj is a filter condition and pre is an attribute name represented by the predicate pre; if the triple is a triple of which the predicate pre represents an edge in the attribute graph, converting the triple into traversal sub.out (pre) - > obj from a node represented by the subject sub to a node represented by the object obj, wherein out represents the edge, and pre represents an associated label of the edge represented by the predicate pre; and processing all the triples of the where clause to obtain the traversal sequence.

Further, according to the traversal sequence obtained in step 2), the method for traversing all nodes meeting the conditions in the Bigtable data comprises the following steps: the traversal sequence is formed by filtering and associated edges; firstly, forming a traversal path according to the associated edges, and then eliminating an invalid traversal path by filtering to obtain an effective traversal path; and traversing all nodes meeting the conditions in the Bigtable data according to the effective traversal path, and then organizing the attribute values of the nodes and edges of the effective traversal path.

The method comprises the steps of Bigtable data storage, SPARQL to Gremlin traversal conversion and Gremlin graph traversal execution.

The function of which is described below:

1) storing RDF data based on Bigtable model

RDF is an internet-oriented graph data format, and represents graph data by using a < Subject, predicate, object > tuple, where the Subject (Subject) represents a graph node, the predicate is an attribute name of a sub-Subject node, and the corresponding object obj is an attribute. The Object mainly comprises two cases of rdf, namely literal and rdf, resource, wherein the former represents the attribute of the subject node, and the attribute is named as predicate; the latter represents other nodes to which the subject node is associated, the predicate identifies edges of the subject associated with the other nodes, and the labels of the edges are predicates. The attribute map corresponding to the RDF map is shown in fig. 1(a), and its Bigtable is shown in fig. 1 (b).

The invention uses Bigtable to store RDF graphs, aiming at RDF data triples: the Subject (Subject), Predicate (Predicate), and Object (Object) are structurally analyzed according to their characteristics and the RDF model (i.e., RDF ontology) as follows:

and aiming at the RDF triple of sub pre obj, storing the subject sub as a node v as one line in the Bigtable model.

If the object obj is rdf: literal, the object obj is used as the attribute value of the node v, the predicate pre is used as the attribute name of the node v, and the predicate is stored as one Cell (storage unit) of the row where v is located. Otherwise, the obj is taken as an independent node w and stored as a row in the Bigtable. Taking the predicate pre as an outgoing edge of a subject sub node v, pointing to the node w, and storing the predicate pre as a Cell in a row of the node v; the predicate pre is taken as an entry of the object obj node w, comes from the node v (i.e. the departure point of the entry is the node v), and is stored as one Cell on the row where the node w is located.

Lieral is a type of object, and is analyzed by Apache Jena or other RDF tools. The Bigtable is a storage structure of mass data, and can efficiently support storage and management of mass graph data. 2) Converting SPARQL queries into Gremlin graph traversals

As shown in fig. 2, the SPARQL query represents sub-graph matching using multiple RDF triple sequences of where clauses. The invention realizes SPARQL subgraph matching by graph traversal, and converts triple sequences in the where clause of the SPARQL query into filtering and traversal of nodes and edges in the graph. For each sub pre obj. And converting the triple of the attribute represented by the predicate pre into a filter has (pre, obj) of the node in the attribute graph represented by the sub, wherein obj is a filter condition, and pre is an attribute name represented by the predicate pre. For the triple of the edges in the attribute graph represented by the predicate pre, the triple is converted into the traversal from the node represented by sub to the node represented by obj, the gremlin code is sub.out (pre) - > obj, the out represents the edge, and the pre represents the associated label of the edge represented by the predicate pre.

And processing all the triples of the where clause to obtain a traversal sequence consisting of the filtering and the associated edges.

3) Graph traversal execution

And (3) traversing all nodes meeting the conditions in the Bigtable data obtained by storing the RDF data based on the Bigtable model in the step 1) aiming at the traversal sequence obtained in the step (2). Its associated edges constitute a traversal path, and filtering is used to eliminate invalid traversal paths. And after traversing is finished, organizing the attribute values of the nodes and edges of the effective traversing path, and returning according to the requirements of users. Because the access is directly to Bigtable, the query process basically has no generation of intermediate data. The RDF data refers to a graph data representation method defined by the W3C International Standard organization, the association relationship between the data and the data is represented by adopting triples, and the subject, the predicate and the object refer to a standard data structure of the RDF data. The Gremlin refers to a standard language for traversal of the map-oriented data.

The invention has the beneficial effects that:

aiming at the problems of poor expandability and low query efficiency of the existing large-scale RDF graph data storage, a Bigtable-based RDF graph data storage and query method is provided, and the horizontal expansion of the large-scale graph data is supported by converting the RDF graph data into a data format of a Bigtable model; by converting the SPARQL query into the graph traversal-based Bigtable data access, the problems of high cost and poor expansibility caused by RDF data connection are avoided, and the generation of intermediate data in the query process is reduced. And because the SPARQL query process is converted into the access to Bigtable, the access times can be reduced by utilizing the cache.

The method solves the problems of distributed storage and low-delay query of massive large-scale RDF data, eliminates the dependence of the traditional SPARQL query on data structures such as Hash and the like, reduces the generation of intermediate data, and avoids the connection calculation of the large-scale RDF data; on the other hand, the large data processing technology based on Bigtable can be effectively utilized to store and manage RDF mass associated knowledge network data, and the query and analysis of the RDF associated data are accelerated by utilizing a cache technology and an index technology.

Drawings

FIG. 1 is a representation of storing RDF data based on a Bigtable model;

(a) RDF data attribute graph, (b) RDF data graph based on Bigtable model;

FIG. 2 is a diagram traversal-based SPARQL query execution flow diagram;

fig. 3 is a comparison graph of a small data set vs a large data set.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

As shown in FIG. 2, the SPARQL execution engine based on graph traversal is composed of Bigtable data storage, SPARQL to Gremlin conversion and graph traversal execution. According to the RDF-faulral characteristic and the RDF-resource characteristic of the RDF triple object, the incidence relation and the faulral value of the RDF triple are represented by using the attribute map, RDF data are stored and managed by using the Bigtable data model, and query and analysis of the RDF incidence data by using SPARQL are realized by using graph traversal.

At present, a data warehouse facing RDF data stores and manages RDF knowledge network data by taking triples as basic units, and subgraph matching is realized by relying on table self-connection, so that SPARQL query and analysis of RDF data are realized.

According to the invention, Bigtable is adopted to store and manage RDF data, SPARQL query and analysis are converted into traversal of an RDF attribute graph, and SPARQL query is completed through access to Bigtable. Therefore, the distributed expansion of the RDF mass data is realized, and the adverse effect of the connection operation on the SPARQL query is avoided. Bigtable storage and SPARQL to Gremlin translation of RDF data is described below as designed to support rapid retrieval and analysis of RDF data as shown in FIG. 1:

bigtable data storage: as shown in fig. 1, RDF triple data is represented as an attribute map as shown in fig. 1 (a). If < tax1> < type > < tax > ", its object type is rdf, then it is converted into attribute value of the node represented by the subject tax1, and the attribute name is predicate; "< tax1> < x-taxon > < gene1 >", where the object is RDF: resource, it is converted into the edge where the node represented by the subject tax1 points to the node of the gene1 represented by the object, and the predicate is the label of the edge, so all RDF triples and attribute maps are corresponded, and the Bigtable data structure is used to store the attributes and the edges.

SPARQL to Gremlin: and converting the triples of the where clause in the SPARQL query into the traversal of the attribute graph, wherein the SPARQL is a query language facing RDF data, and the Gremlin is a traversal language facing the attribute graph.

Triple for where clause "? tax < type > < tax index > ", because type represents the attribute in the attribute map, turn this association into the filtration has (type, tax index) step to attribute type of node in gremlin, its execution process is according to filtering to BigTable data of tax index, reduce and traverse the scope, raise and traverse the efficiency.

For "? tax < x-taxon >? "since x-taxon is an edge between nodes in the attribute graph, the association is converted into a traversal out (x-taxon) step of the gremlin for the edge of the node represented by tax, which is implemented by accessing the Bigtable data.

The method selects a 3 hundred million scale small data set, a 30 hundred million scale large data set and 16 standard SPARQL queries of a microorganism associated data set WDCM to test the invention, provides a specific implementation process of the SPARQL execution engine which is based on graph traversal and faces large data, and comprises the steps of repeating the query process for 10 times and removing the maximum value and the minimum value.

The testing environment is 4 HBase clusters supporting BigTable, the HBase version is 0.98.23.hadoop1, each node is 32G memory, 12-core CPU and 28T disk, and the nodes are interconnected through a gigabit switch. The Gremlin query and analysis engine was Titan 1.0.0, and the SparQLToGremlin transformation was developed by the project team.

The query run times obtained using the system of the present invention are as follows:

as shown in fig. 3, for a small data set of 3 hundred million size, the time required for query is around 1 s.

For a large data set with the size of 30 hundred million triples, because the traversal load of the graph is effectively dispersed in a plurality of nodes, the time of 16 query statements is less than 1s, and is better than the query time of a small data set.

The experimental result shows that aiming at the continuously increased RDF data, the method can effectively utilize the advantages of Bigtable distributed data storage and the advantages of graph traversal query, keep the query time constant, and well solve the problem that the SPARQL query time is obviously increased when the RDF data is increased in a large scale at present.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and a person skilled in the art can make modifications or equivalent substitutions to the technical solution of the present invention without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A SPARQL query optimization method based on graph traversal comprises the following steps:

2) converting the SPARQL query into traversal of the RDF attribute graph;

3) traversing all nodes meeting the conditions in the Bigtable data according to the traversal sequence obtained in the step 2), and completing SPARQL query;

the method for storing RDF data by using the Bigtable model comprises the following steps: 21) for each RDF triple (sub, pre, obj) in the RDF data, storing a subject sub as a node v as one line in a Bigtable model; 22) judging the type of the object obj: a) if the object obj is rdf, i.e. the predicate pre is the attribute of the subject sub, taking the object obj as the attribute value of the node v, and then taking the predicate pre as the attribute name of the node v to be stored as a storage unit Cell in the row of the node v; b) if the object obj is rdf, resource, namely the predicate pre is the edge of the subject sub associated to other nodes, the object obj is taken as an independent node w and is stored as one line in the Bigtable model; then, the predicate pre is taken as the outgoing edge of the node v, points to the node w, and is stored as a Cell of the row where the node v is located, and the predicate pre is taken as the incoming edge of the node w, comes from the node v, and is stored as a Cell of the row where the node w is located.

2. The method of claim 1, wherein in step 2), converting the SPARQL query into a Gremlin graph traversal, implements converting the SPARQL query into a traversal of the RDF attribute graph.

3. The method of claim 2, wherein the method of translating the SPARQL query into a Gremlin graph traversal is by: for each triple (sub, pre, obj) in the where clause of the SPARQL query, if the triple is a triple in which a predicate pre represents an attribute in an attribute graph, converting the triple into a filter has (pre, obj) of a node in the attribute graph represented by a subject sub, where obj is a filter condition and pre is an attribute name represented by the predicate pre; if the triple is a triple of which the predicate pre represents an edge in the attribute graph, converting the triple into traversal sub.out (pre) - > obj from a node represented by the subject sub to a node represented by the object obj, wherein out represents the edge, and pre represents an associated label of the edge represented by the predicate pre; and processing all the triples of the where clause to obtain the traversal sequence.

4. The method of claim 1, wherein the method of traversing all nodes satisfying the condition in the Bigtable data according to the traversal sequence obtained in step 2) is: the traversal sequence is formed by filtering and associated edges; firstly, forming a traversal path according to the associated edges, and then eliminating an invalid traversal path by filtering to obtain an effective traversal path; and traversing all nodes meeting the conditions in the Bigtable data according to the effective traversal path, and then organizing the attribute values of the nodes and edges of the effective traversal path.