CN105550332B

CN105550332B - A kind of provenance graph querying method based on the double-deck index structure

Info

Publication number: CN105550332B
Application number: CN201510969332.5A
Authority: CN
Inventors: 许国艳; 罗章璇; 宋健; 平萍
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2015-12-21
Filing date: 2015-12-21
Publication date: 2019-03-29
Anticipated expiration: 2035-12-21
Also published as: CN105550332A

Abstract

The present invention discloses a kind of provenance graph querying method based on the double-deck index structure comprising the steps of: firstly, inquiring towards provenance graph, proposes a kind of double-deck index structure；Secondly, design is based on dictionary sheet global index, matching relationship and provenance graph ID between origination data and data are recorded in table；Then, it proposes to be based on bitmap partial indexes, according to provenance graph RDF query mode, proposes the index and three kinds of join inquiry modes for meeting Triple Pattern inquiry, and based on the corresponding search algorithm of Index Design.Finally, demonstrating the feasibility and validity of the provenance graph querying method based on the double-deck index structure by test.

Description

A kind of provenance graph querying method based on the double-deck index structure

Technical field

The present invention relates to the management of the origination data of big data management domain, are directed to the query scheme of data origin figure emphatically Design and realization.The present invention provides a kind of provenance graph issuer based on the double-deck index structure according to data origin figure feature Method.This method is designed from global and local two levels respectively: on the one hand can be with matched data and its by dictionary sheet Relationship between source data proposes to be based on dictionary sheet global index algorithm；On the other hand origin institute is quickly positioned according to provenance graph ID It is stored in cloud computing server node, proposes to be based on bitmap partial indexes structure, including 6 kinds of different selection indexes and 3 kinds Join sitation index, and devise corresponding search algorithm.

Background technique

Data origin is the information to the entire history of data processing, the source including data and the institute for handling these data There is subsequent process.How efficiently to have inquired source information with the continuous development of big data, under cloud computing environment becomes especially to weigh It wants, how efficiently to have inquired source information becomes a urgent problem to be solved.

The present invention is directed to data origin under cloud computing environment and inquires problem, a kind of double-deck index structure is introduced, respectively from complete It is analyzed in terms of office's index and partial indexes two, devises a kind of provenance graph querying method, and feasible to method, effective It is verified.

Summary of the invention

Goal of the invention: aiming at the problems existing in the prior art, the present invention provides a kind of rising based on the double-deck index structure Source figure querying method.

Technical solution: a kind of provenance graph querying method based on the double-deck index structure mentions firstly, inquiring towards provenance graph A kind of double-deck index structure out.Secondly, design is based on dictionary sheet global index, records in table and matched between origination data and data Relationship and provenance graph ID, the relationship that can be associated between origin and data, and the stored cloud in origin can be navigated to rapidly Server node is to reduce the user query response time；Then, it proposes to be based on bitmap partial indexes, according to provenance graph RDF query Mode is proposed the index and three kinds of join inquiry modes for meeting eight kinds of Triple Pattern inquiry, and is set based on index Corresponding search algorithm is counted.

The double-deck index structure towards provenance graph inquiry

Store origination data under previous distributed environment, inquiry origin only rely only on master node come the task of distributing into Row is searched, it usually needs is traversed entire cluster, is consumed a large amount of time and resource.And storage system in origin under existing distributed environment System is substantially based on major key come quick search, lacks efficient index structure, cannot provide the inquiry such as multi-dimensional query and join. Efficient index structure can effectively improve search efficiency, shorten response time when user query.

To improve search efficiency, in conjunction with provenance graph feature, a kind of double-deck index structure is proposed.Index structure includes being based on Dictionary sheet global index and be based on bitmap partial indexes.The server node that global index's inquiry provenance graph is stored, local rope Draw the server node refined queries inquired to global index, and then inquires required origination data.Global index's distribution It, only need to can referring to global index's structure of local server when user requests to reach under cloud environment on each node Node location where obtaining the provenance graph inquired.Partial indexes are only to establish the origination data stored in local server It indexes, there is no dependences for the partial indexes between each node.

Global index and global query's algorithm based on dictionary sheet

Dictionary table structure is provided first, on this basis, completes the querying flow based on global index.

1, dictionary table structure

According to data origin feature, dictionary sheet HCPTable is designed in terms of two.Firstly, storage provenance graph title and correspondence Data item.Data item is exactly the described data that originate from, and all data in one action stream is all corresponded to a provenance graph, slightly Relationship between the description origin of granularity and data.Secondly, storing provenance graph title and corresponding ID.The execution of workflow each time A data provenance graph can be then generated, origin ID is then generated in storing process according to Hash (key) mapping.It is risen in global index Source figure ID is the input item of consistency hash index algorithm, can quickly calculate provenance graph institute storage server according to origin ID Node.

2, based on the querying flow of global index

It is begun stepping through from the root node of provenance graph to leaf node according to provenance graph ID is inquired in HCPTable, according to leaf Node obtains provenance graph storage server.Global index's querying flow is as follows:

(1) it searches dictionary sheet and obtains provenance graph ID number

(2) child node met the requirements is searched according to query demand

(3) output child node number is calculated

Partial indexes and local queries algorithm based on bitmap

In order to improve inquiry provenance graph data efficiency, consider user query when sentence diversity, make up selection index Is, The deficiency of Ip, Io in the inquiry to single Triple Pattern, to triple known to Subject-Verb design index Isp and Ips, designs index Ipo and Iop to triple known to predicate object, designs index Iso to triple known to subject object And Ios, form complete local bitmap index structure, including selection index Is, Ip, Io, Isp, Ipo, Iso and join index Is'、Io'、Iso'。

Partial indexes support the refined queries to the origin diagram data on single cloud storage service device node.Provenance graph inquiry Include two parts: single Triple Pattern inquiry and join inquiry.

(1) single Triple Pattern inquiry

Selection index Is, Ip, Io, Isp, Ipo, Iso are to subject, predicate, object, Subject-Verb, predicate object, subject guest Language carries out the inquiry of single Triple Pattern.

(2) join is inquired

For handling, subject shared variable, object shared variable and subject object are shared to be become selection index Is', Io', Iso' Amount carries out join inquiry.

Detailed description of the invention

Fig. 1 is the double-deck index structure；

Fig. 2 is the origin querying flow figure based on global index；

Fig. 3 is consistency binary tree distributed model；

Fig. 4 is RDF triple join type；

Fig. 5 is that index space occupies analysis graph；

Fig. 6 is query performance analysis graph.

Specific embodiment

Combined with specific embodiments below, the present invention is furture elucidated, it should be understood that these embodiments are merely to illustrate the present invention Rather than limit the scope of the invention, after the present invention has been read, those skilled in the art are to various equivalences of the invention The modification of form falls within the application range as defined in the appended claims.

Provenance graph querying method based on the double-deck index structure proposes a kind of double-deck index firstly, inquiring towards provenance graph Structure.Secondly, design is based on dictionary sheet global index, matching relationship and provenance graph between origination data and data are recorded in table ID, the relationship that can be associated between origin and data, and the stored Cloud Server node in origin can be navigated to rapidly to subtract Few user query response time；Then, it proposes to be based on bitmap partial indexes, according to provenance graph RDF query mode, proposes satisfaction The index and three kinds of join inquiry modes of eight kinds of Triple Pattern inquiry, and based on the corresponding query operator of Index Design Method.

The double-deck index structure towards provenance graph inquiry

Index structure includes based on dictionary sheet global index and being based on bitmap partial indexes.Provenance graph institute is inquired by global index The server node of storage, the server node refined queries that partial indexes inquire global index, and then inquire required Origination data.Global index is distributed under cloud environment on each node, only need to be referring to local clothes when user requests to reach Global index's structure of business device can obtain node location where the provenance graph inquired.Partial indexes are only to establish in local clothes The index of origination data that business device is stored, there is no dependences for the partial indexes between each node.The bilayer of design Index structure is specifically as shown in Figure 1.

Global index and global query's algorithm based on dictionary sheet

1, dictionary table structure

According to data origin feature, dictionary sheet HCPTable is designed in terms of two.Firstly, storage provenance graph title and correspondence Data item.Data item is exactly the described data that originate from, and all data in one action stream is all corresponded to a provenance graph, slightly Relationship between the description origin of granularity and data.Secondly, storing provenance graph title and corresponding ID.The execution of workflow each time A data provenance graph can be then generated, origin ID is then generated in storing process according to Hash (key) mapping.It is risen in global index Source figure ID is the input item of consistency hash index algorithm, can quickly calculate provenance graph institute storage server according to origin ID Node.The storage organization example of the dictionary sheet HCPTable of design is as shown in table 1.

The storage organization of 1 dictionary sheet HCPTable of table

2, the provenance graph memory node querying flow based on global index

Provenance graph memory node querying flow based on global index is as shown in Figure 2.

(1) it searches dictionary sheet and obtains provenance graph ID

(2) it searches since the root node of tree, the server section in tree is stored according to provenance graph ID inquiry origin Point, formula 1 calculate selection child node.ID is provenance graph ID number in formula 1, and root.Number is the number of root node.

Nodenum=ID%root.Number (1)

According to calculated result select child node, verifying child node node.Isleaf determined property whether leaf node, if It is that leaf node thens follow the steps (4), it is no to then follow the steps (3).

(3) using the node as new root node, continuation is executed since step (2).

(4) it calculates and exports this node serial number.

Firstly, the execution of process each time can all select a leaf node for tree, execute since root node to leaf Child node.Querying method is similar to binary search, so time complexity is O (log (n)).Secondly, the present invention uses consistency two Fork tree distribution storage, such binary tree structure storage mode also can be more much higher than the efficiency of other multiway trees.

3, consistency binary tree distribution storage

The thought of consistency binary tree distribution storage is to carry out server layering grouping, and consistency Hash is combined to calculate Data are evenly dispersed in each Cloud Server by method.Each server section in consistency binary tree leaf node expression cloud Point is used to store origin diagram data.

Consistency binary tree distributed model is based on binary tree structure, is divided into multiple mutually disjoint in each Hierarchy nodes Finite aggregate in, wherein each set itself again be one tree, so that all memory nodes to be assigned to the difference of different levels In group.Corresponding server number is stored in leaf node.

Define 1 consistency y-bend distribution tree: the binary tree that consistency binary tree is made of the finite aggregate T of n node, T ={ V, E }, V are the set of node, and E is the set on side.

Each leaf indicates cloud computing server position in finite aggregate T.For each node, unique one can be used Serial No. definition, successively represent the number in the lived through path of the node from left to right, wherein subtree from left to right according to Secondary number 0,1,00 ....It is 11 as inquired D node serial number in Fig. 3, is also just uniquely determined in this consistency distribution tree Specific location of the D node in tree, i.e. inquiry D node pass through 1 and 1 liang of paths.When the volume of leaf node all in tree When number all completion, logical construction of this tree is also determined that.It therebetween is the relationship singly mapped.With consistency y-bend When tree, it can be abstracted as a two-dimensional array, so that it may safeguard tree structure with two-dimensional array.

Algorithm is realized:

The purpose of global query is server node where positioning provenance graph, Design consistency distributed storage of the present invention Different Origin figure is uniformly stored in leaf node different in tree.When inquiring provenance graph, inquired according to provenance graph ID Source is stored in the server node in tree.

Originating node search algorithm Match_Node is specific as follows:

Partial indexes and local queries algorithm based on bitmap

Originate from diagram data in the present invention using triple as unit progress sequential storage, the number of triple uses pos from 1 to n (t_i) indicate each triple t_iStorage location i in figure, uses pos^-1(t_i) return to triple t_iPosition i in figure, Wherein t_i∈ G, G are the triplet sets of a RDF graph, and D is the set of RDF graph, G={ t₁,t₂,...t_n},G_i∈ D, D= {G₁,G₂,...G_n}。

RDF data is respectively indicated using S, P and O concentrates subject, predicate and object set.As shown in formula 2

S=S₁∪S₂∪....∪S_n,S_i=s | (s, p, o) ∈ G_i},G_i∈ D, D={ G₁,G₂,...G_n}

P=P₁∪P₂∪....∪P_n,P_i=p | (s, p, o) ∈ G_i},G_i∈ D, D={ G₁,G₂,...G_n} (2)

O=O₁∪O₂∪....∪O_n,O_i=o | (s, p, o) ∈ G_i},G_i∈ D, D={ G₁,G₂,...G_n}

1, to the inquiry of single Triple Pattern

Since subject, predicate and object may be variable in single Triple Pattern, then being directed to single triple Inquiry need to design multi-dimensional indexing.The main thought of multi-dimensional indexing is by the non-variables query interface in triple. In SPARQL inquiry clause such as to the expression formula of single triple Triple Pattern clause inquiry and represented semanteme Shown in table 2.

2 Triple Pattern expression formula of table and its meaning

	Clause's expression formula	Meaning
			1	(s,p,o)	If triple exists, triple is returned, null value is otherwise returned
2	(? s, p, o)	Given predicate, object, return to the subject result set for meeting triple
			3	(s,? p, o)	Given predicate, object, return to the subject result set for meeting triple
4	(s, p,? o)	Given predicate, object, return to the subject result set for meeting triple
			5	(? s,? p, o)	Given predicate, object, return to the subject result set for meeting triple
6	(s,? p,? o)	Given predicate, object, return to the subject result set for meeting triple
			7	(? s, p,? o)	Given predicate, returns to the subject and object result set for meeting triple
8	(? s,? p,? o)	Return to all triples

Define 2 bitmap index Is: index Is is the set { (s of all triple subjects in RDF graph G₁,v₁),(s₂, v₂),....,(s_n,v_n)}.Wherein, s ∈ S.v_iFor figure G in a bit vector, and the k location in vector be 1 and if only if There are triple t in figure G_k=pos (k), t_k∈G,t_k.s=s_i。

The purpose of Is Index Design is that the query statement to inquiry subject can quickly be found accordingly in RDF graph Triple.Wherein the size of Is is fixed, identical comprising the number of RDF triple with the provenance graph of place.

Similarly, same mode establishes index Ip and Io, can quickly inquire using predicate or object as keyword Triple.If subject, predicate and object are all it is known that so can be Is, Io and Ip tri- in conjunction with index in query statement Search index: Is ∧ Ip ∧ Is.

Define 3 bitmap index Isp: index Isp is all triple Subject-Verb set { (s in RDF graph G₁p₁,v₁), (s₂p₂,v₂),....,(s_np_n,v_n)}.Wherein, s ∈ S, p ∈ P.v_iIt to scheme a bit vector in G, and is the k in vector Position is 1 and if only if there are triple t in figure G_k=pos^-1(k),t_k∈G,t_k.sp=s_ip_i。

Similarly, same mode establishes index Ips, Ipo, Iop, Iso, Ios, can quickly inquire and be called with subject The triple of language, predicate object, object predicate, subject object and object subject as keyword.

2, containing the inquiry of join

Relevance between triple is judged by whether there is unbound variable of the same name between triple.Root Incidence relation can be turned to three kinds: Subject-Subject link, Object-Object according to the position of occurrences of the same name Link and Object-Subject link, RDF triple join type are as shown in Figure 4.

Define bitmap index Is': index Is' be in RDF graph G it is all comprising identical subject triplet sets (1, v₁),(2,v₂),....,(n,v_n)}.Wherein n=| G |, 1,2...n is then that continuous position identifies in figure.v_iFor one in figure G Bit vector, and be k location in vector be 1 and if only if there are triple t in figure G_k=pos (k), t_k∈G,t_i=pos (i),t_k.s=t_i.s。

Similarly index Io' establishes similar Is'.Herein without establishing Ipp for predicate, because the amount of predicate is opposite in figure Triple that is less and inquiring identical predicate has no meaning for subject and object.

Define 4 bitmap index Iso': index Iso is all triple collection comprising identical subject and object in RDF graph G Close { (1, v₁),(2,v₂),....,(n,v_n)}.Wherein n=| G |, 1,2...n is then that continuous position identifies in figure.vⁱFor in figure G A bit vector, and be k location in vector be 1 and if only if there are triple t in figure G_k=pos (k), t_k∈G,t_i =pos (i), t_i∈G,t_k.o=t_i.s.Index Ios' is then the transposition for indexing Iso': Ios'=Iso'^T。

Index Isp, Ips, Ipo, Iop, Iso, Ios be selection index, for handle known Subject-Verb, predicate object or The inquiry request of person's subject object, wherein Isp and Ips, Ipo and Iop, the described triple of Iso and Ios index are in practical figure Middle storage location is identical, therefore only needs Isp, Ipo and Iso.

To sum up, the present invention is using index Is, Ip, Io, Isp, Ipo and Iso to subject, predicate, object, Subject-Verb, meaning Triple known to language object or subject object is inquired.Index Is', Io', Iso', Ios' is shared for handling subject The join inquiry request of variable, object shared variable and subject object shared variable.Bitmap index storing framework T_DSuch as 3 institute of table Show.

3 bitmap index storing framework T of table_D

3, algorithm is realized

The unknown is inquired according to known terms in triple to the search algorithm ASI_TP of single Triple Pattern, it is as follows It is shown；

It can be in the respective subject of two triples, object, subject and predicate, predicate to the algorithm AJI_TP of join inquiry When identical with object, subject and predicate difference can Rapid matching, as follows；

And the algorithm Match_BGP to BGP inquiry, it is as follows:

The process algorithm that is called when wherein ASI_TP and AJI_TP is inquires BGP.Match_BGP algorithm will be in BGP All trple pattern are pre-processed, that is, are resequenced, the specific steps are as follows:

1, it is forward to establish the high trple pattern sequence of selectance, trple pattern selectance from high to low suitable Sequence is as follows:

(1) Subject non-variables

(2) Subject, Predicate and Object all non-variables and the non-rdf:type of predicate

(3) Subject is variable, Predicate and Object non-variables and predicate is rdf:type

(4) Subject and Predicate is variable, Object non-variables

(5) Subject and Object is variable, Predicate non-variables and the non-rdf:type of predicate

(6) Subject and Object is variable, Predicate rdf:type

(7) Subject, Object and Predicate are variable.

2, function ASI_TP algorithm is called to look into first trple pattern in the RDF triple collection bgp after sequence It askes, returned variable storage is in v_seva.And according to v_sevaIt obtains a result and collects S.If result set S is sky, directly return empty Collection.

3, next trple pattern is taken, the trple pattern and trple before are first judged before inquiry Whether pattern has shared variable, if there is shared variable, then calls algorithm AJI_TP, records at current trple pattern The result set S of reason_tpi, merge current results collection and v_sevaThe result set obtained.

4, third step is repeated, until all result sets all poll-finals.

5, the bit vector in result set S is replaced, specific RDF triple is obtained according to bit vector.Return to the result of inquiry Collection.

Experimental verification

1, space hold is analyzed

This paper partial indexes technology increases three new indexes to accelerate search efficiency, so storing occupied space more three Memory space shared by a index.

The triple that identical subject, predicate or object in 400 RDF triples are generated in one action stream can be multiple Occur, index Is, Ip and Io and does not have to establish index entry to each triple.Triple containing identical element uses The position bit of bitmap vector marks.Such as the position that identical subject only needs to establish the subject for the first time in bitmap index Corresponding position 1 in figure vector.The position indicates its logical place stored in the database.

The triple of identical Subject-Verb, predicate object and subject object in the origination data of workflow record is repeated Same very much, then only set need to be set by different location in the vector established for the first time for duplicate keys, when storage, is only needed Store a bitmap index.Therefore, three index entries are added on the basis of original 6 indexes herein, the quantity of index increases Add 50%, and indexed memory space and only increase 25% or so, as shown in Figure 5.

2, query performance is analyzed

The present invention for University of Texas's origination data standard data set test respectively 11 UTPB query statements come The query performance of index structure designed by the test present invention.

Experiment is thought to have carried out 11 sentences respectively under Hadoop cluster environment to five data sets of D1, D2, D3, D4, D5 Inquiry test.Each inquiry respectively runs 5 average values for taking the corresponding time on five data sets, and query performance is analyzed such as Shown in Fig. 6.

It is analyzed by example implementing result, it was demonstrated that feasibility of the invention also demonstrates the double-deck index structure proposed When coping with the storage of mass data origin, with the increase of data volume, storage and inquiry property are relatively superior, customer inquiries request Response is timely.Data performance in face of complicated inquiry request and magnanimity is still fine.

Claims

1. a kind of provenance graph querying method based on the double-deck index structure, which is characterized in that comprise the steps of: firstly, towards Provenance graph inquiry proposes a kind of double-deck index structure；Secondly, design is based on dictionary sheet global index, origination data is recorded in table Matching relationship and provenance graph ID between data；Then, it proposes to be based on bitmap partial indexes, according to provenance graph RDF query side Formula proposes the index and three kinds of join inquiry modes for meeting Triple Pattern inquiry, and based on Index Design phase The search algorithm answered；

The double-deck index structure towards provenance graph inquiry includes based on dictionary sheet global index and being based on bitmap partial indexes；It is global The server node that search index provenance graph is stored, partial indexes look into the server node refinement that global index inquires It askes, and then inquires required origination data；Global index is distributed under cloud environment on each node, when user requests to reach When, node location where the provenance graph inquired need to can be only obtained referring to global index's structure of local server；Local rope Drawing is the index for only establishing the origination data stored in local server, the partial indexes between each node there is no according to The relationship of relying；

Global index and global query's algorithm based on dictionary sheet are as follows:

Dictionary table structure is provided first, on this basis, completes the querying flow based on global index；

1), dictionary table structure

According to data origin feature, dictionary sheet HCPTable is designed in terms of two；Firstly, storage provenance graph title and corresponding data ?；Data item is exactly the described data that originate from, and all data in one action stream are all corresponded to a provenance graph, coarseness Description origin data between relationship；Secondly, storing provenance graph title and corresponding ID；The execution of workflow then can each time A data provenance graph is generated, origin ID is then generated in storing process according to Hash (key) mapping；Provenance graph in global index ID is the input item of consistency hash index algorithm, can quickly calculate provenance graph institute storage server section according to origin ID Point；

2), based on the querying flow of global index

It is begun stepping through from root node to leaf node according to provenance graph ID is inquired in HCPTable, origin is obtained according to leaf node Figure storage server；Global index's querying flow is as follows:

(1) it searches dictionary sheet and obtains provenance graph ID

(2) child node met the requirements is searched

(3) it calculates and exports this node serial number.

2. the provenance graph querying method according to claim 1 based on the double-deck index structure, which is characterized in that be based on bitmap Partial indexes and local queries algorithm are as follows:

Provenance graph inquiry includes two parts: single Triple Pattern inquiry and join inquiry；

(1) single Triple Pattern inquiry

Selection index Is, Ip, Io, Isp, Ipo, Iso to subject, predicate, object, Subject-Verb, predicate object, subject object into The inquiry of the single Triple Pattern of row；

(2) join is inquired

Selection index Is', Io', Iso' for handle subject shared variable, object shared variable and subject object shared variable into Row join inquiry.