CN109325029A

CN109325029A - RDF data storage and querying method based on sparse matrix

Info

Publication number: CN109325029A
Application number: CN201811004427.3A
Authority: CN
Inventors: 张小旺; 张明月; 冯志勇
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2019-02-12

Abstract

The present invention relates to the data storage and query process field of database engine, the characteristics of to make full use of real human's social activities relation data, by part, closely entity relationship is stored, and is improved search efficiency, is saved a large amount of memory spaces.The present invention, RDF data storage and querying method based on sparse matrix, steps are as follows: step 1: the character string type Hash of original RDF data is encoded into integer type；Step 2: RDF cube RDF Cube is constructed to the RDF data after coding；Step 3: RDF Cube is stored by way of establishing predicate index using a series of sparse matrixes；Step 4:, which parsing, and optimizes SPARQL query statement obtains optimal inquiry plan；Step 5: as a result the Join query execution based on sparse matrix exports.Present invention is mainly applied to store and query processing occasion.

Description

RDF data storage and querying method based on sparse matrix

Technical field

The present invention relates to the data storage and query process fields of database engine.Specifically, one kind is devised to be based on The memory module of sparse matrix stores a large amount of RDF datas, devises the Join operation based on sparse matrix multiplication and executes The inquiry of RDF data.

Background technique

Resource description framework (Resource Description Framework, RDF) is a kind of popular data Model indicates the information on Web in the form of triple (subject, predicate, object).One RDF data collection It can also be described as an oriented label figure, a triple means that a line, and subject and object indicate two Vertex, predicate indicate the label of this edge.SPARQL(SPARQL Protocol and RDF Query It Language) is the RDF graph data query for the standard that World Wide Web Consortium (World Wide Web Consortium, W3C) recommends Language.Extensive RDF data in real world, such as DBpedia (a very special semantic net exemplary applications) With YAGO (linked database) data, but there is a kind of important data characteristic --- " sparsities " for they.RDF data Sparsity mean that the abutment points on each vertex in RDF graph only occupy the very small part on vertex in entire figure.In fact, The sparsity of RDF data is ubiquitous.For example in social activities relationship situation in real life, have between men Different social activities relationships, and also include many complicated entity type data in these social activities relationships, than If place is associated with, character relation, time point being related to etc..Any two entity can be contacted by certain relationships Together.We are also able to reflect out the variation relation of human social activity by the association between data, such as outside somewhere Come the variation of population, traffic route usage amount, vacation trip etc. can be analyzed by human society relationship. But mankind's entirety social activities relationship has the characteristic that part is close and the overall situation is sparse again, such as, the area A is (for example, Shan Xi Sheng) relationship between internal any entity is just relatively closer, but compared to the area A and the area B (for example, Shaanxi Province and day Jinshi City) but there was only a small amount of associated connection between two areas.So it is whole for human social activity's relationship, in reality Entity between show the close situation in part.Such as Fig. 1, in DBpedia data (42966066 nodes in total), The maximum degree (out-degree and in-degree) for having more than 99.41% node is 43；(38734252 sections in total in YAGO data Point), the maximum degree more than 95.17% node is 39.SPARQL is implemented in basic chart-pattern (Basic Graph Pattern, BGP) and SPARQL algebraic operation on, Join operation be SPARQL query assessment core operation, because of BGP It is exactly the join of triple mode.And in RDF graph, the connection on two vertex can be counted by the product of adjacency matrix It calculates.

It is analyzed by the RDF data to real human's social activities relationship, we have invented one kind to be based on sparse square The memory module of battle array carries out memory-efficient data, and it is high to data progress to have invented a kind of join algorithm based on sparse matrix Effect inquiry.That is, the join algorithm based on sparse matrix is used to inquire the RDF data with " sparsity " feature.Cause And finding one kind can fully consider and using Deta sparseness come in efficient storage and inquiry human social activity's relationship The solution of RDF data is a meaningful project.

The most like prior art implementation with the present invention:

The method of the existing SPARQL inquiry for assessing RDF data can be divided into two classes: system based on relationship and System based on figure.RDF data is stored and indexed using relational approach based on the system of relationship.Such as Jena, Oracle, Sesame, 3store and SOR, these systems safeguard all triples using a big table with 3 column, respectively Corresponding to subject, predicate and object.Then SPARQL inquiry is converted into SQL query, and passes through the multiple of table It is inquired from connection.It is recently proposed many RDF triples that graph model is stored based on the method for figure, such as gStore, DipLODocus [RDF], TurboHOM++ and AMBER.SPARQL query processing is usually considered as son by these methods based on figure Figure matching, this helps to retain and query semantics information.All predicates and predicate value are mapped as bit string by gStore, Then it is organized as a VS*-tree tree.Since each layer of VS*-tree tree is all the summary of entire RDF graph, GStore can effectively handle SPARQL inquiry.DipLODocus [RDF] is started with mixing storage mode, in order to search point Submanifold and inquiry velocity is promoted by the help of cluster correlation data, first the graph structure of consideration RDF data and data analysis It is required that.TurboHOM++ develops TurboISO by the way that RDF graph is converted to general data figure, and AMBER by RDF data and SPARQL inquiry is expressed as scheming more.

The shortcomings that prior art:

The existing querying method to RDF data improves the performance of query assessment really to a certain extent.However, he Improvement in terms of handling the extensive RDF data from real world it is still limited, the reason is that they are in storing data It does not solve truthful data bring sparsity and causes to waste big quantity space when storing data.And it is being based on relationship A large amount of in system to will lead to the prolonged response time from connection, this is the potential bottleneck of system.Due to lacking concurrency, institute There is above system computationally all costly in pretreatment.In addition, when there is the querying method of time-consuming traversal will increase operation Between expense.Thus, it would be desirable to such a RDF query processing method based on sparse matrix, can either efficient storage it is a large amount of RDF data, and can quickly carry out query processing.

Summary of the invention

In order to overcome the deficiencies of the prior art, the present invention solves the sparsity of Fiel's meeting activity relationship data, proposes one Data model storage of the kind based on sparse matrix and search algorithm join based on sparse matrix.The storage mould that the present invention designs Formula, the characteristics of can make full use of real human's social activities relation data, by part, closely entity relationship is stored, Not only high-efficiency compact stores initial data in this way, but also reduces the range of inquiry data set, more improves inquiry effect Rate.Memory module of the invention can save a large amount of memory spaces, and data are expressed as sparse matrix and are only tieed up in RDF graph The mode of shield actual relationship is handled.Search algorithm of the invention is based on the matrix multiplication on sparse storage mode Join operation is carried out, can be obviously improved search efficiency.For this reason, the technical scheme adopted by the present invention is that being based on sparse matrix RDF data storage and querying method, steps are as follows:

Step 1: the character string type Hash of original RDF data is encoded into integer type；

Step 2: RDF cube RDF Cube is constructed to the RDF data after coding；

Step 3: by establishing predicate index, RDF Cube is stored in the form of a series of sparse matrixes；

Step 4:, which parsing, and optimizes SPARQL query statement obtains optimal inquiry plan；

Step 5: as a result the Join query execution based on sparse matrix exports.

Step 4 is specifically: for the SPARQL query statement of a given RDF data, being parsed into an inquiry Figure, be then ranked up further according to the statistical data of each edge in query graph, guarantee while with while connect in the case where reconstruct look into Ask executive plan.

Step 4 is further specifically: input is a SPARQL query statement, and form is a L=(e₁, e₂,...,e_n), all sides comprising this query statement initialize one first and are denoted as the point set of N to record the side of selection Point, and L is ranked up according to the statistical data on side all in L；Then it is chosen in the query statement L after sequence A line is added in Q, and the point on selected side is added in N, and update L；As long as L is not empty, the slave L circuited sequentially One statistical number of middle selection is small and is added in Q while related with having selected before, and the point ∈ N on selected side is simultaneously selected by this The point on side is added in N, updates L；Finally obtained inquiry plan Q is optimal inquiry plan.

Step 5 is specifically: the inquiry plan after given optimization is obtained by step 4, to each in inquiry plan A sparse matrix successively executes the Join algorithm based on sparse matrix and obtains final query result, wherein the sparse square of every two What battle array specifically executed is SMJoin algorithm.

Step 5 is further specifically: algorithm 1 is that each sparse matrix in the works after given optimization is carried out Join algorithm calculation process, exports final query result；Algorithm 2 is that given two sparse matrixes execute core SMJoin algorithm；

For algorithm 1, input is a given optimized inquiry plan Q, first from the inquiry meter after optimization It draws taking-up first in Q and is denoted as R_tSparse matrix simultaneously updates Q, then successively takes out that next to be denoted as SpTable sparse from Q Matrix, two sparse matrix R_tSMJoin operation is carried out with SpTable, and updates result to R_t, it is duplicate chosen from Q it is next A sparse matrix carries out SMJoin operation until Q is sky, and final query result is R_t；

For algorithm 2, SMJoin describes two sparse matrixes A and B giving in algorithm 1 and carries out join operation, first The non-zero row of each of first Ergodic Matrices A is looked for from matrix B then to each nonzero term in the non-zero row of each of matrix A To matched nonzero term, the result after being matched if meeting condition is exported to matrix of consequence R_tIn, final output The matrix of consequence of join operation is R_t。

The features of the present invention and beneficial effect are:

The present invention is directed to the RDF graph data of extensive real human's social activities relationship, is proposed using the sparsity of data A kind of data model storage based on sparse matrix and search algorithm join based on sparse matrix solve in storage number According to when do not solve the problems, such as truthful data bring sparsity and cause to waste big quantity space when storing data, pass through solution The sparse characteristic of RDF data is to reach efficient storage and efficiently inquiry to extensive RDF data.Guaranteeing correctness of algorithm While, query responding time is reduced by optimisation strategy.From the point of view of the experimental result of the invention, as shown in fig. 6, being directed to Various types of social activities relational query sentences, our method is to the search efficiency of RDF graph data obviously than existing inquiry System (gStore and RDF-3X) is more preferable.

Detailed description of the invention:

Fig. 1 is that the present invention analyzes the sparsity of Fiel's meeting relationship resource RDF data.

A is DBpedia DataSet, and b is YAGO DataSet.

Fig. 2 is that the present invention is based on the general frames of the RDF query of sparse matrix processing.Including memory module, inquiry Optimization module and query execution module.

Fig. 3 is the memory module of the invention based on sparse matrix.

Fig. 4 is the citing of query optimization module of the invention.

Fig. 5 is that the present invention has carried out inquiry experiment on the generated data collection Watdiv of different scales.

The inquiry of a WatDiv:F snowflake type, the inquiry of b WatDiv:L line style；

The star-like inquiry of c WatDiv:S, the inquiry of d WatDiv:C complexity.

Fig. 6 is that the present invention can carry out inquiring experiment in Fiel on relational dataset DBpdedia and YAGO.

A is YAGO DataSet, and b is DBpedia DataSet.

Specific embodiment

One proposed the present invention be directed to the local close and whole sparse characteristic of Fiel's meeting activity relationship data Data model storage of the kind based on sparse matrix and the join querying method based on sparse matrix.

Step 2: RDF Cube (RDF cube) is constructed to the RDF data after coding；

Step 5: as a result the Join query execution based on sparse matrix exports；

General frame figure is as shown in Fig. 2, there are three modules in total:

Memory module: including Step 1: step 2 and step 3, memory module are the forms based on sparse matrix to store And then RDF data construct RDF cube, then pass through foundation as shown in figure 3, RDF graph data are carried out Hash coding first Predicate indexes to carry out the storage of sparse matrix form；

Query optimization module: including step 4, the SPARQL query statement that inquired for one is high in order to obtain The inquiry reaction result of effect needs to optimize it according to memory module and search algorithm to obtain optimal inquiry plan. Our prioritization scheme is to reduce all intermediate result sizes generated in query process, as shown in figure 4, according to query graph The size on side be ranked up, reconstruct inquiry plan makes intermediate result minimum, as shown in algorithm 3；

Query execution: including step 5, the Join algorithm based on sparse matrix multiplication of design, such as 2 institute of algorithm 1 and algorithm Show, the inquiry plan after inputting an optimization obtains final query result by SMJoin algorithm；

The present invention is described in further detail below in conjunction with the accompanying drawings.

Referring to Fig. 1, the present invention analyzes the sparsity of RDF truthful data, (in total 42966066 in DBpedia data A node), the maximum degree (out-degree and in-degree) for having more than 99.41% node is 43；In YAGO data (in total 38734252 nodes), the maximum degree more than 95.17% node is 39.It is obvious that in the RDF data of real world In, most of point is only and the point of minimum a part is associated, whole that the sparse characteristic of the compact overall situation in part is presented.

Referring to fig. 2, the present invention is based on the general frames of the RDF query of sparse matrix processing.Including memory module, look into Ask optimization module and query execution module.Memory module has fully considered the sparsity of RDF data, proposes a kind of based on dilute Dredge the data model storage of matrix；Query optimization module is optimisation strategy under the premise of guaranteeing correct result, in order that subtracting Few query responding time；Query execution module is querying method of the present invention to RDF data, is proposed a kind of based on sparse square The Join algorithm of battle array multiplication.Wherein, algorithm 1 is to carry out Join for each sparse matrix in the works after given optimization Algorithm calculation process exports final query result.Algorithm 2 is that given two sparse matrixes execute core SMJoin algorithm.

Algorithm 1: query execution

Input: the inquiry plan Q after given optimization

Output: query result matrix table R_t

For algorithm 1, the SPARQL inquiry that SMJoin operation is carried out comprising multiple sparse matrixes is described.Input Be a given optimized inquiry plan Q, first from the inquiry plan Q after optimization take out first sparse square Battle array (is denoted as R_t) and update Q, it is then successively taken out from Q next sparse matrix (being denoted as SpTable), two sparse matrix R_t SMJoin operation is carried out with SpTable, and updates result to R_t, duplicate that next sparse matrix progress is chosen from Q SMJoin operation is until Q is sky.Final query result is R_t。

Algorithm 2: specific two sparse matrixes execute SMJoin algorithm SMJoin (A, B)

Input: given two the sparse matrixes A and B chosen

Output: matrix of consequence R_t

For algorithm 2, SMJoin describes two sparse matrixes A and B giving in algorithm 1 and carries out join operation.It is first The non-zero row of each of first Ergodic Matrices A is looked for from matrix B then to each nonzero term in the non-zero row of each of matrix A To matched nonzero term, the result after being matched if meeting condition is exported to matrix of consequence R_tIn.Final output The matrix of consequence of join operation is R_t。

Referring to Fig. 3, the memory module of the invention based on sparse matrix, including Step 1: step 2 and step 3, this hair Bright memory module is the form based on sparse matrix to store RDF data, and RDF graph data are carried out Hash coding first, will The character string type of original RDF data is encoded into integer string；And then one RDF cubes is constructed to the RDF data after coding Body；The storage of sparse matrix form finally is carried out by establishing predicate index again, each predicate represents a sparse matrix Table, each table only store point associated therewith；

Referring to fig. 4, the citing of query optimization module of the invention, prioritization scheme is reduced to be generated in whole query process Intermediate result size.The SPARQL query statement of a given RDF data, is parsed into a query graph, then further according to The statistical data of each edge is ranked up in query graph, guarantee while with while connect in the case where reconstruct query execution plan.Tool Body inquiry plan generating algorithm is as shown in algorithm 3.

Algorithm 3: inquiry plan generates

Input: all side List:L=(e of SPARQL sentence₁, e₂..., e_n)

Output: the inquiry plan Q after optimization

For algorithm 3, describes to optimize a SPARQL query statement, ultimately generate optimal inquiry meter It draws.Input is a SPARQL query statement, and form is a L=(e₁, e₂..., e_n), it include this query statement All sides.Initialize a point set (being denoted as N) first to record the point on the side of selection, and according to side all in L Statistical data is ranked up L；Then it chooses a line in the query statement L after sequence to be added in Q, and will be selected The point on side be added in N, and update L.As long as L be not it is empty, chosen in the slave L circuited sequentially a statistical number it is small and and it It is preceding that (the point ∈ N on selected side) while related has been selected to be added in Q, and the point on the selected side is added in N, update L.This Sample both ensure that orderly selection, also ensure that each edge is relevant with side before.Finally obtained inquiry plan Q is most Excellent inquiry plan.

Referring to the data detail parameters of following table experimental situation of the invention:

#Dataset	#Triples	#(S∩O)	#P
				WatDiv100M	108 997 714	10 250 947	86
WatDiv200M	219 783 842	20 296 483	86
				WatDiv300M	329 827 477	30 221 812	86
WatDiv400M	439 433 765	40 040 420	86
				WatDiv500M	549 246 141	49 771 433	86
Yago	200 737 655	38 734 252	46
				DBpedia	120 978 080	42 966 066	4 282

Social relationships generated data collection Watdiv and two RDF real data set DBpedias of the present invention in different scales With progress query processing experiment on YAGO.Wherein, the data set scale of generated data collection Watdiv is 100,000,000,200,000,000,300,000,000,400,000,000 With 500,000,000 or so；The triple quantity of DBpedia data is 120978080, and predicate is 4282, the quantity of subject and object There are 42966066；The triple quantity of YAGO data is 200737655, and predicate is 46, and the quantity of subject and object has 38734252.

The experimental situation of Fig. 6 is described below, the Installed System Memory for the gallery that this experiment uses is 72GB, and CPU parameter is Intel(R)Xeon(R)E5-2603v4@1.70GHz.Test the inquiry system that compares be currently a popular RDF-3X and gStore。

Referring to Fig. 5, the present invention and other systems have carried out inquiry experiment on the generated data collection Watdiv of different scales Compare.Inquiry for Watdiv data, we be categorized into four kinds of query types (snowflake type: F, line style: L, it is star-like: S And complexity: C) inquire classifying and compare.Wherein, for gStore querying method, inquiring data scale can only achieve 300,000,000 Left and right.Thus experimental result can be seen that the generated data collection for different scales, and the inquiry reaction time of the invention tests knot Fruit is all significantly lower than RDF-3X and gStore in the case where different type is inquired.

Referring to Fig. 6, the present invention and other systems compare the experiment effect of the query processing of true RDF data.By scheming As it can be seen that the inquiry reaction time of the invention is all significantly lower than the query processing time of RDF-3X and gStore.

Either generated data or truthful data, the present invention can be significant for the query statement of various social relationships Promotion search efficiency.The close characteristic of local relation that the present invention takes full advantage of in Fiel's meeting activity relationship data will Data carry out the storage of sparse matrix mode, have been obviously improved storage efficiency, and reduce the relation extents of inquiry, accelerate Query responding time.

Claims

1. a kind of RDF data storage and querying method based on sparse matrix, characterized in that steps are as follows:

Step 2: RDF cube RDF Cube is constructed to the RDF data after coding；

Step 5: as a result the Join query execution based on sparse matrix exports.

2. the RDF data storage based on sparse matrix and querying method as described in claim 1, characterized in that step 4 tool Body: for the SPARQL query statement of a given RDF data, being parsed into a query graph, then further according to looking into The statistical data of each edge is ranked up in inquiry figure, guarantee while with while connect in the case where reconstruct query execution plan.

3. as claimed in claim 2 based on sparse matrix RDF data storage and querying method, characterized in that step 4 into One step is specifically: input is a SPARQL query statement, and form is a L=(e₁,e₂,...,e_n), it is looked into comprising this All sides of sentence are ask, one is initialized first and is denoted as the point set of N to record the point on the side of selection, and own according in L The statistical data on side L is ranked up；Then it chooses a line in the query statement L after sequence to be added in Q, and by institute The point on the side of selection is added in N, and updates L；As long as L be not it is empty, chosen in the slave L circuited sequentially a statistical number it is small and and Selected and be added in Q while related before, it is selected while point ∈ N and this is selected while point be added in N, update L；Finally Obtained inquiry plan Q is optimal inquiry plan.

4. the RDF data storage based on sparse matrix and querying method as described in claim 1, characterized in that step 5 tool Body: the inquiry plan after given optimization is obtained by step 4, each of inquiry plan sparse matrix is successively held Join algorithm of the row based on sparse matrix obtains final query result, wherein what every two sparse matrix specifically executed is SMJoin algorithm.

5. as claimed in claim 4 based on sparse matrix RDF data storage and querying method, characterized in that step 5 into One step is specifically: algorithm 1 is to carry out Join algorithm calculation process for each sparse matrix in the works after given optimization, Export final query result；Algorithm 2 is that given two sparse matrixes execute core SMJoin algorithm；

For algorithm 1, input is a given optimized inquiry plan Q, first from the inquiry plan Q after optimization It takes out first and is denoted as R_tSparse matrix simultaneously updates Q, then successively taken out from Q it is next be denoted as SpTable sparse matrix, two A sparse matrix R_tSMJoin operation is carried out with SpTable, and updates result to R_t, duplicate that next sparse square is chosen from Q Battle array carries out SMJoin operation until Q is sky, and final query result is R_t；

For algorithm 2, SMJoin describes two sparse matrixes A and B giving in algorithm 1 and carries out join operation, first time Go through the non-zero row of each of matrix A, then to each nonzero term in the non-zero row of each of matrix A, found from matrix B and its Matched nonzero term, the result after being matched if meeting condition are exported to matrix of consequence R_tIn, final output join behaviour The matrix of consequence of work is R_t。