CN107291807A

CN107291807A - A kind of SPARQL enquiring and optimizing methods based on figure traversal

Info

Publication number: CN107291807A
Application number: CN201710343003.9A
Authority: CN
Inventors: 李亮; 沈志宏; 周园春; 黎建辉; 朱小杰; 刘东江; 李跃鹏
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2017-05-16
Filing date: 2017-05-16
Publication date: 2017-10-24
Anticipated expiration: 2037-05-16
Also published as: CN107291807B

Abstract

The invention discloses a kind of SPARQL enquiring and optimizing methods based on figure traversal.This method is：1) using triple in attributed graph specification RDF data, then using Bigtable models storage RDF data, the corresponding Bigtable data of RDF data are obtained；2) by SPARQL inquiry conversions to RDF attribute graph traversals；3) according to step 2) ergodic sequence that obtains, all nodes of condition are met in traversal Bigtable data, complete SPARQL inquiries.One aspect of the present invention eliminates traditional SPARQL and inquires about dependence to data structures such as Hash, reduce the generation of intermediate data, it is to avoid the connection of extensive RDF data is calculated；On the other hand, it can effectively utilize the big data treatment technology based on Bigtable to store and management RDF magnanimity association knowledge network datas, accelerate the inquiry and analysis of RDF associated datas.

Description

A kind of SPARQL enquiring and optimizing methods based on figure traversal

Technical field

The present invention relates to a kind of SPARQL method for query execution based on figure traversal, and in particular to a kind of to be closed towards big data The method and system of storage and the inquiry of connection.

Background technology

Graphical data mining and analysis are the frontiers of big data, are provided by setting up association WWW resource, microorganism fungus kind The incidence relation in source and Scientific Research Resource etc., supports the information excavating based on data correlation and scientific discovery.Resource description framework (Resource Description Framework, abbreviation RDF) is to be used to express on WWW (World Wide Web) The language of the information of resource, can express the information of any things that can be identified on the internet, such as page title, Zuo Zhehe Incidence relation between modification time and different pieces of information.RDF specifications provide the basic vocabulary of description resource, define Apply as WDCM (Mircen World Data Centre for Microorganisms) microorganism describes resource word in each field The rule that must comply with during remittance table.SPARQL (SPARQL Protocol and RDF Query Language) is opened for RDF A kind of query language and data acquisition protocols of hair, the RDF data model definition recommended by W3C International Standards Organization, for looking into Ask any information resources that can be represented with RDF.SPARQL agreements and RDF query language (SPARQL) on January 15th, 2008 just Formula turns into a W3C proposed standard.

Because RDF has used the XML data of structuring, retrieval and inquiry it will be appreciated that the precise meanings of metadata, search Become more intelligent and accurate, effectively prevent the situation that retrieval often returns to extraneous data.RDF files are retouched comprising some resources State, each resource description is made up of some sentences, each sentence constitutes triple by resource, attribute type, property value, represent money Source has an attribute.The subject that resource corresponds in natural language, attribute type corresponds to predicate, and property value corresponds to guest Language, multiple RDF resource files constitute complete resource description and associated diagram.It is increasing with the data scale of related network, It is expressed and the data type of processing is more and more, and RDF data is stored and the real-time composition of SPARQL inquiries is challenged.Cause This, storage and the efficiency of management of RDF data are lifted using expansible new big data framework, improve SPARQL inquiry velocities and Analysis ability is extremely important.Diagram data storage and Governance framework based on big diagram data treatment technology such as Bigtable, due to it Outstanding large-scale data disposal ability, the new development direction as knowledge mapping network.

At present, RDF data is passed through mainly using relation database table or KV data warehouse storages and management RDF triples Subgraph match and the SPARQL inquiries of the subject, predicate and object of RDF triples are realized from the mode of connection, by Hash or Index supports quick search and the retrieval of local data, and its typical case realizes such as Virtuoso RDF graph databases.Its distributed version This is main using federal mode, RDF data inquiry and distributed computing framework is fused in a unified framework, specifically For：SPARQL is inquired about and parses and be distributed to each node, each node operation subgraph match calculates and then collected each node Matching result.The framework is simple and is easily achieved, and supports the distributed query of RDF data and quick return, simplify towards The design and exploitation of larger extensive knowledge connection network.

However, the distributed query mode based on federal and subgraph match, every time inquiry is required to SPARQL inquiries point Solve and be distributed to multiple nodes, operation subgraph match and returning result for subgraph match, be easily caused substantial amounts of node communication with Between data.When data scale super large, the following problem of systems face：

1) high expense from attended operation.For distributed system, the attended operation of tables of data causes big between system node The data communication of amount.When data volume is big and machine node is more, from connection expense greatly, query latency substantially increases, and is unfavorable for System it is extending transversely.

2) substantial amounts of intermediate data.It is distributed to multiple nodes after SPARQL query decompositions to be separately operable, each node phase When in a query engine, it, which runs, produces substantial amounts of intermediate data, adds the memory consumption of system, reduces system in parallel The SPARQL numbers of processing.

3) high is required to data fragmentation.SPARQL inquiries are distributed to each node and are respectively completed by federal style inquiry, logarithm It is higher according to the quality requirement of burst.If there is substantial amounts of incidence relation between different demarcation, SPARQL subgraph matchs can not Performed parallel in multiple Data Nodes, reduce the operational efficiency of system.

Due to the presence of these problems, the SPARQL inquiries based on federal mode are difficult to successfully manage extensive RDF associations The extensive growth of data and the real-time query demand for meeting knowledge network associated application, query time with data scale increasing Grow and increase.And attended operation (Join) of the data processing technique based on Bigtable due to lacking extensive table, it is difficult to should In the processing for using magnanimity RDF knowledge network associated datas.

The content of the invention

The problem of existing for RDF big datas SPARQL inquiries, it is an object of the invention to provide one kind based on figure traversal SPARQL enquiring and optimizing methods.

The technical scheme is that：

A kind of SPARQL enquiring and optimizing methods based on figure traversal, its step is：

1) using triple in attributed graph specification RDF data, then using Bigtable models storage RDF data, obtain The corresponding Bigtable data of RDF data；

2) by SPARQL inquiry conversions to RDF attribute graph traversals；

3) according to step 2) ergodic sequence that obtains, all nodes of condition are met in traversal Bigtable data, are completed SPARQL is inquired about.

Using Bigtable models store RDF data method be：

21) every RDF triples (sub, pre, obj) in RDF data are directed to, subject sub are regard as node v, storage For a line in Bigtable models；

22) object obj type is judged：If a) object obj is rdf:Literal, i.e. predicate pre are subject sub category Property, then using object obj as node v property value, then using predicate pre as node v Property Name, it is stored as this The memory cell Cell that node v is expert at；If b) object obj is rdf:Resource, i.e. predicate pre close for subject sub The side of other nodes is linked to, then object obj is stored as a line in Bigtable models as an isolated node w；Then will meaning Language pre goes out side as node v's, points to node w, is stored as the Cell that node v is expert at, and by predicate Pre enters side as node w's, comes from node v, is stored as the Cell that node w is expert at.

Further, step 2) in, SPARQL inquiries are converted into Gremlin figure traversals, realizes and turns SPARQL inquiries Change to RDF attribute graph traversals.

Further, by SPARQL inquire about be converted into Gremlin figure traversal method be：Inquired about for SPARQL Each triple (sub, pre, obj) in where clause, if the triple, which is predicate pre, represents attribute in attributed graph Triple, then be converted into the filtering has (pre, obj) to attributed graph interior joint representated by subject sub by the triple, and obj is Filter condition, pre is the Property Name representated by predicate pre；If the triple, which is predicate pre, represents three of side in attributed graph The triple, then be converted into the traversal sub.out of node representated by node to object obj representated by subject sub by tuple (pre)->Obj, out represent side, and pre represents the correlation tag on the side representated by predicate pre；By handling where clause's All triples, obtain the ergodic sequence.

Further, according to step 2) ergodic sequence that obtains, all sections of condition are met in traversal Bigtable data Point method be：The ergodic sequence that the ergodic sequence constitutes for filtering and incidence edge；First according to incidence edge composition traversal road Footpath, invalid traverse path is then eliminated using filtering, effective traverse path is obtained；Then traveled through according to effective traverse path All nodes of condition are met in Bigtable data, the node of effective traverse path and the property value on side is then organized.

The present invention includes Bigtable data storages, the conversion of SPARQL to Gremlin traversals, Gremlin figure traversal execution.

Its function is described as follows：

1) RDF data is stored based on Bigtable models

RDF is a kind of diagram data form of Internet, is used<Subject, predicate, object>Triple table diagram data, Its subject (Subject) represents node of graph, and predicate is the Property Name of sub subject nodes, and corresponding object obj is attribute.Guest Language (Object) mainly includes rdf:Literal and rdf:Two kinds of situations of resource, the attribute of the former subject node, category The entitled predicate of property；The latter represents other nodes that subject node is associated with, and predicate mark subject is associated with the side of other nodes, side Label be predicate.The corresponding attributed graph of RDF graph is Fig. 1 (a), and its Bigtable is represented as shown in Fig. 1 (b).

The present invention stores RDF graph using Bigtable, for RDF data triple：Subject (Subject), predicate (Predicate), object (Object), according to its characteristic and rdf model (i.e. RDF bodies), its structure elucidation is as follows：

For " sub pre obj. " RDF triples, using subject sub as node v, are stored as one in Bigtable models OK.

If object obj is rdf:Literal, then regard predicate pre as object obj as the section as node v property value Point v Property Name, is stored as the Cell (memory cell) that v is expert at.Otherwise, it regard obj as isolated node w, storage For a line in Bigtable.Predicate pre is gone out into side as subject sub nodes v, node w is pointed to, predicate pre is stored as The Cell that node v is expert at；Predicate pre is entered into side as object obj nodes w, comes from node v and (enters setting out for side Point is node v), is stored as the Cell that node w is expert at.

Wherein, rdf:Lieral is a type of object, is parsed by Apache Jena or other RDF instruments.Wherein, Bigtable is a kind of storage organization of mass data, can efficiently support the storage and management of magnanimity diagram data.2) will SPARQL inquiries are converted into Gremlin figure traversals

As shown in Fig. 2 SPARQL inquiries represent subgraph match using multiple RDF triad sequences of where clause.This hair Bright use figure traversal realizes SPARQL subgraph matchs, and the triad sequence in the where clause that SPARQL is inquired about is converted into pair The filtering on figure interior joint and side and traversal.For each " the sub pre obj. " triples, its conversion process in where clause It is as follows：The triple of attribute in attributed graph is represented for predicate pre, is converted into attributed graph interior joint representated by sub Has (pre, obj) is filtered, obj is filter condition, and pre is the Property Name representated by predicate pre.Category is represented for predicate pre The triple on side, is converted into the traversal of node representated by node to obj representated by sub, its gremlin code is in property figure sub.out(pre)->Obj, out represent side, and pre represents the correlation tag on the side representated by predicate pre.

By handling all triples of where clause, the ergodic sequence that filtering and incidence edge are constituted is obtained.

3) figure traversal is performed

For step (2) obtain ergodic sequence, traversal step 1) in based on Bigtable models storage RDF data obtain Bigtable data in meet all nodes of condition.Its incidence edge constitutes traverse path, and filters invalid for eliminating Traverse path.After the completion of traversal, the node of effective traverse path and the property value on side are organized, is returned by user's request.Due to being Directly Bigtable is accessed, query process does not have the generation of intermediate data substantially.The RDF data refers to W3C international standard groups Knit the diagram data method for expressing of definition, using the incidence relation between triple table registration evidence and data, the subject, predicate, Object refers to the standard data structure of RDF data.The Gremlin refers to the standard language traveled through towards diagram data.

The beneficial effects of the present invention are：

The problem of for current extensive RDF graph data storage poor expandability and relatively low search efficiency, it is proposed that a kind of RDF graph data storage and query method based on Bigtable, by the data that RDF graph data are converted to Bigtable models Form, supports the horizontal extension of large-scale graph data；The Bigtable numbers based on figure traversal are converted into by the way that SPARQL is inquired about According to access, it is to avoid expense is larger caused by RDF data connection and the problem of poor autgmentability, reduces query process intermediate data Generation.And due to SPARQL query process to be converted into the access to Bigtable, can be reduced using caching and access secondary Number.

The problem of distributed storage and low latency that the present invention solves the extensive RDF data of magnanimity are inquired about, on the one hand disappears Except traditional SPARQL inquires about dependence to data structures such as Hash, reduce the generation of intermediate data, it is to avoid extensive RDF The connection of data is calculated；On the other hand, the big data treatment technology based on Bigtable can be effectively utilized to store and management RDF seas Association knowledge network data is measured, accelerates the inquiry and analysis of RDF associated datas using caching technology and index technology.

Brief description of the drawings

Fig. 1 is to represent figure based on Bigtable models storage RDF data；

(a) RDF data attributed graph, (b) is based on Bigtable model RDF data figures；

Fig. 2 is the SPARQL query execution flow charts based on figure traversal；

Fig. 3 is small data set vs large data sets comparison diagrams.

Embodiment

To enable the features described above and advantage of the present invention to become apparent, special embodiment below, and coordinate institute's accompanying drawing work Describe in detail as follows.

As shown in Fig. 2 it is a kind of based on figure traversal SPARQL enforcement engines, by Bigtable data storages, SPARQL to Gremlin is converted, and figure traversal performs composition.Rdf of the present invention according to RDF triple objects:Literal characteristics and rdf: Resource characteristics, using the incidence relation and literal values of attributed graph specification RDF triples, use Bigtable data moulds Type is stored and management RDF data, and inquiries and analysis of the SPARQL to RDF associated datas are realized using figure traversal.

RDF knowledge network data are stored and managed towards the data warehouse of RDF data by base unit of triple at present, Dependence table realizes subgraph match from the mode connected, so as to realize the SPARQL inquiries and analysis to RDF data.

It is of the invention then use Bigtable store and manage RDF data, by SPARQL inquiry and analysis be converted into RDF belong to Property graph traversal, pass through and SPARQL inquiries completed to Bigtable access.In this way, the present invention is realized to RDF mass datas Distributed extension, it is to avoid the adverse effect that attended operation is inquired about SPARQL.The Bigtable for describing RDF data below is deposited Storage and SPARQL to Gremlin convert quick-searching and the analysis that RDF data is supported by being designed as shown in Figure 1：

Bigtable data storages：As shown in figure 1, for RDF triple data, being denoted as shown in Fig. 1 (a) Attributed graph.As "<tax1><type><taxNode>", its type of object is rdf:Literal, then be converted into subject tax1 institutes The property value of node is represented, Property Name is predicate；“<tax1><x-taxon><gene1>", its object is rdf: Resource, then be converted into the side that node representated by subject tax1 points to gene1 nodes representated by object, and predicate is the mark on side All RDF triples and attributed graph, are so mapped by label, and use Bigtable data structure storages attribute and side.

SPARQL to Gremlin：The triple of where clause is converted into attribute graph traversal during SPARQL is inquired about, Wherein SPARQL is the query language towards RDF data, and Gremlin is Attribute Oriented graph traversal language.

For where clause triple "tax<type><taxNode>", due in attributed graph type represent category Property, then the association is converted into filtering has (type, taxNode) step in gremlin to nodal community type, it is performed Process is the filtering to BigTable data according to taxNode, reduces traversal scope, improves traversal efficiency.

For "tax<x-taxon>Gene. ", due in attributed graph x-taxon be node between side, therefore will The association is converted into traversal out (x-taxon) step for going out side in gremlin to node representated by tax, by Bigtable The traversal is realized in the access of data.

The present invention choose microorganism associated data set WDCM 300,000,000 scale small data sets, the large data sets of 3,000,000,000 scales and 16 standard SPARQL inquiries are tested invention, are provided and are used being traveled through towards big data based on figure proposed in invention SPARQL enforcement engines a specific implementation process, inquiry repetitive process is 10 times, and removes maximum and minimum value.

Test environment is 4 support BigTable HBase clusters, and HBase versions are 0.98.23.hadoop1, each Interconnected between node 32G internal memories, 12 core CPU, 28T disks, node by 10,000,000,000 interchangers.Gremlin is inquired about and analysis engine For Titan 1.0.0, SparQLToGremlin conversions are developed by project team.

The inquiry run time obtained using present system is as follows：

As shown in figure 3, for the small data set of 300,000,000 scales, inquiring about the time needed in 1s or so.

For the large data sets of 3,000,000,000 triple scales, because figure traversal load is effectively dispersed in multiple nodes, 16 are looked into The time used in sentence is ask less than 1s, on the contrary better than the query time of small data set.

Test result indicates that, for ever-increasing RDF data, the present invention can effectively utilize the distributed numbers of Bigtable According to the advantage and the advantage of figure traversal queries of storage, keep query time constant, current RDF data is solved well extensive The problem of SPARQL query times substantially increase during increase.

Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area Member can modify or equivalent substitution to technical scheme, without departing from the spirit and scope of the present invention, this hair Bright protection domain should be to be defined described in claims.

Claims

1. a kind of based on the SPARQL enquiring and optimizing methods for scheming traversal, its step is：

1) using triple in attributed graph specification RDF data, then using Bigtable models storage RDF data, RDF numbers are obtained According to corresponding Bigtable data；

2) by SPARQL inquiry conversions to RDF attribute graph traversals；

2. the method as described in claim 1, it is characterised in that the method for storing RDF data using Bigtable models is：

21) every RDF triples (sub, pre, obj) in RDF data are directed to, using subject sub as node v, are stored as A line in Bigtable models；

22) object obj type is judged：If a) object obj is rdf:Literal, i.e. predicate pre are subject sub attribute, then Using object obj as node v property value, predicate pre is then stored as node v as node v Property Name A memory cell Cell being expert at；If b) object obj is rdf:Resource, i.e. predicate pre are that subject sub is associated with it The side of its node, then be stored as a line in Bigtable models using object obj as an isolated node w；Then by predicate pre Go out side as node v, point to node w, be stored as the Cell that node v is expert at, and using predicate pre as Node w's enters side, comes from node v, is stored as the Cell that node w is expert at.

3. method as claimed in claim 1 or 2, it is characterised in that step 2) in, SPARQL inquiries are converted into Gremlin Figure traversal, is realized SPARQL inquiry conversions to RDF attribute graph traversals.

4. method as claimed in claim 3, it is characterised in that SPARQL is inquired about to the method for being converted into Gremlin figure traversals For：Each triple (sub, pre, obj) in the where clause inquired about for SPARQL, if the triple is predicate pre The triple of attribute in attributed graph is represented, then the triple is converted into the filtering to attributed graph interior joint representated by subject sub Has (pre, obj), obj are filter condition, and pre is the Property Name representated by predicate pre；If the triple is predicate pre The triple on side in attributed graph is represented, then the triple is converted into node representated by node to object obj representated by subject sub Traversal sub.out (pre)->Obj, out represent side, and pre represents the correlation tag on the side representated by predicate pre；Pass through place All triples of where clause are managed, the ergodic sequence is obtained.

5. the method as described in claim 1, it is characterised in that according to step 2) ergodic sequence that obtains, travel through Bigtable The method that all nodes of condition are met in data is：The ergodic sequence that the ergodic sequence constitutes for filtering and incidence edge；It is first Traverse path is first constituted according to incidence edge, invalid traverse path is then eliminated using filtering, effective traverse path is obtained；Then All nodes that condition is met in Bigtable data are traveled through according to effective traverse path, the section of effective traverse path is then organized Point and the property value on side.