CN110032676A

CN110032676A - One kind being based on the associated SPARQL enquiring and optimizing method of predicate and system

Info

Publication number: CN110032676A
Application number: CN201910196896.8A
Authority: CN
Inventors: 杨柳; 熊丹婷; 胡志刚; 龙军
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2019-07-19
Anticipated expiration: 2039-03-15
Also published as: CN110032676B

Abstract

The present invention relates to towards the associated storage of big data and inquiring technology field, it discloses a kind of based on the associated SPARQL enquiring and optimizing method of predicate and system, faster and more effectively realize distribution SPARQL inquiry, the method of the present invention includes the RDF triples in the SPARQL for obtaining historical query, RDF triple is named using predicate, obtains original RDF data collection；RDF data collection is divided to obtain VP table, the subject term and predicate quantity of predicate connection in RDF data are counted according to VP table, defines four kinds of connection characteristics of predicate, and priority ranking is carried out to predicate according to connection characteristic power；The relevance between predicate is constructed, and tree-shaped predicate figure is converted for history SPARQL query graph according to the relevance, optimizes tree-shaped predicate figure, correlation table is generated according to the tree-shaped predicate figure after optimization and SPARQL is converted to inquiry instruction；It is inquired using inquiry instruction wait table look-up.

Description

One kind being based on the associated SPARQL enquiring and optimizing method of predicate and system

Technical field

The present invention relates to be associated with towards the associated storage of big data and inquiring technology field, more particularly to a kind of predicate that is based on SPARQL enquiring and optimizing method and system.

Background technique

Resource description framework (Resource Description Framework, RDF) is the W3C mark for describing Internet resources Standard identifies resource using global identifier (Internationalized Resource Identifier, IRI), using master The triple of word s, predicate p and predicate o composition describes the metadata of a data.More and more fields are with RDF data collection Mode describes data, and such as bioscience, social networks and search engine, data set possesses billions of grades of triple.It is huge And more stringent requirements are proposed with information retrieval to data query for ever-increasing RDF data collection, in such circumstances, based on basic The SPARQL query language of figure query pattern (Basic Graph Pattern, BGP) is proposed by W3C, in order to inquire and retrieve RDF data.

Currently, existing SPARQL query strategy relies primarily on the subject term in RDF triple and SPARQL inquiry clause And predicate, Optimizing Queries are associated using subject term, predicate or subject term variable, predicate variable, however it is this based on subject term guest The association of word belongs to " instance-level " correlation inquiry, and associate feature has contingency and particularity.Such as: " inquiry is to ' interface is set Which associative skills meter ' interested people has learnt ", the concern of " instance-level " correlation inquiry "? people ", "? skill ", "? Association between subSkill ", " UIDesign ", and inquire next time may " interest " be not " UIDesign " but " HTML ", i.e., during actual queries, inquiry relevant " subject term " " predicate " example often changes, and in inquiring There is the predicate association of semantic relation but often to occur by " interest ", " learn ", this is embodied in practical application " to a certain The interested people of things can learn a certain technical ability " semantic dependency inquiry, this association based on predicate belongs to " mode grade " Correlation inquiry, i.e. predicate semantic association characteristic have generality and model utility.On the other hand, SPARQL inquiry clause passes through subject term With predicate when former RDF data concentration is attached operation, a large amount of intermediate result set can be generated, this is to influence SPARQL inquiry One of critical issue of efficiency.

Therefore, it now needs to provide a kind of querying method with good query performance and search efficiency.

Summary of the invention

It is an object of that present invention to provide one kind to be based on the associated SPARQL enquiring and optimizing method of predicate and system, with more rapidly And effectively realize distributed SPARQL inquiry.

To achieve the above object, the present invention provides one kind to be based on the associated SPARQL enquiring and optimizing method of predicate, including Following steps:

S1: obtaining the RDF triple in the SPARQL of historical query, names the RDF triple using predicate, and with meaning Word form is stored to obtain original RDF data collection；

S2: vertical division is carried out to the original RDF data collection and obtains the VP table of RDF, is counted according to the VP table of the RDF The subject term and predicate quantity that predicate connects in RDF data, according to subject term with predicate quantity defines predicate four kinds be connected to characteristic, and Priority ranking is carried out to predicate according to connection characteristic power；

S3: according to the connection characteristic of the predicate in S2, constructing the relevance between predicate, and according to the relevance by history SPARQL query graph is converted into tree-shaped predicate figure, optimizes the tree-shaped predicate figure, generates phase according to the tree-shaped predicate figure after optimization SPARQL is simultaneously converted to SQL query instruction by pass table；

S4: it waits tabling look-up using the SQL query instructions query.

Preferably, in the S2, by the feature definitions of predicate be at least four, respectively subject term score, predicate score, meaning Word score and predicate volume.

Preferably, the subject term score is for counting unduplicated subject term quantity, calculation formula in single VP table are as follows:

Ss (p)=count (DISTINCT s) s ∈ s | s in VP (p) }；

In formula, p is predicate, and s is subject term, and VP (p) is the VP table of predicate p after vertical division；

The predicate score is for counting unduplicated predicate quantity, calculation formula in single VP table are as follows:

Os (p)=count (DISTINCT o) o ∈ o | o in VP (p) }；

In formula, o is predicate；

The predicate score is determining by the subject term score of selection predicate and the maximum value of predicate score, calculation formula are as follows:

Ps (p)=max { ss (p), os (p) }；

The predicate volume is used to calculate the number of tuples of VP table, calculation formula are as follows:

PSize (p)=count (VP (p)).

Preferably, in S2, the order standard of priority ranking is carried out to predicate are as follows:

1) priority of the strong predicate of connectivity is higher；

2) the high predicate priority of predicate score is higher；

3) when predicate score is identical, theme score or object score in addition, the more high then priority of score more It is high；

4) when the subject term score of predicate is identical with predicate score, compare predicate volume, the bigger predicate of predicate volume is excellent First grade is higher.

Preferably, in S3, convert tree-shaped predicate figure for SPARQL query graph specifically includes the following steps:

S31: converting the predicate in BGP to the node of PCGP, and subject term variable is connected with the variable of predicate variable in BGP turns Side is turned to, the sequence for constructing PCGP node and side is determined by predicate weight in BGP, generates a conversion figure；

S32: the optimization tree-shaped predicate figure, conversion figure are optimized for a conversion tree.

Preferably, the optimization tree-shaped predicate figure goes around-France realization using beta pruning.

Preferably, when optimizing the tree-shaped predicate figure, the vertex BGP score, BGP Predicate selectivity, BGP core vertex are defined And BGP core side.

The inventive concept total as one, the present invention also provides a kind of SPARQL query optimizer systems, including memory, place The computer program managing device and storage on a memory and can running on a processor, the processor execute the computer The step of above method is realized when program.

The invention has the following advantages:

The present invention provides a kind of based on the associated SPARQL enquiring and optimizing method of predicate and system, first acquisition historical query SPARQL in RDF triple, name RDF triple using predicate, and stored to obtain original RDF number in the form of predicate According to collection；Vertical division is carried out to original RDF data collection and obtains the VP table of RDF, predicate in RDF data is counted according to the VP table of RDF The subject term and predicate quantity of connection, four kinds for defining predicate according to subject term and predicate quantity are connected to characteristic, and according to connection characteristic Power carries out priority ranking to predicate；According to the connection characteristic of predicate, the relevance between predicate is constructed, and according to the association Property by SPARQL query graph convert tree-shaped predicate figure, optimize tree-shaped predicate figure, phase generated according to the tree-shaped predicate figure after optimization Tree-shaped predicate figure after optimization is simultaneously converted to SQL query instruction by pass table；It waits tabling look-up using SQL query instructions query again；The party Method can faster and more effectively realize distributed SPARQL inquiry.

Below with reference to accompanying drawings, the present invention is described in further detail.

Detailed description of the invention

The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is the preferred embodiment of the present invention based on the associated SPARQL enquiring and optimizing method flow chart of predicate；

Fig. 2 is the conversion situation schematic diagram that the RDF triple store of the preferred embodiment of the present invention is stored to vertical division；

Fig. 3 is the SPARQL query statement of the preferred embodiment of the present invention；

Fig. 4 is the BGP of equal value figure of the query statement of the preferred embodiment of the present invention；

Fig. 5 is four kinds of predicate relevance schematic diagrames in the inquiry clause two-by-two of the preferred embodiment of the present invention；

Fig. 6 is the predicate correlative model figure of the component of the preferred embodiment of the present invention；

Fig. 7 is the vertex score and Predicate selectivity figure of the preferred embodiment of the present invention；

Fig. 8 is the predicate correlative model figure after the optimization of the preferred embodiment of the present invention；

Fig. 9 is the updated history predicate pattern tree of the preferred embodiment of the present invention；

Figure 10 be the preferred embodiment of the present invention using only basic chart-pattern connection result when storage condition and inquiry imitate Fruit；

When Figure 11 is that the BCF of the preferred embodiment of the present invention is inquired needed for variety classes querying method under different value conditions Between situation schematic diagram；

Figure 12 be the preferred embodiment of the present invention large data collection under different system queries time contrast schematic diagrams.

Specific embodiment

The embodiment of the present invention is described in detail below in conjunction with attached drawing, but the present invention can be defined by the claims Implement with the multitude of different ways of covering.

Embodiment 1

Referring to Fig. 1, present embodiments provide a kind of based on the associated SPARQL enquiring and optimizing method of predicate, including following step It is rapid:

S1: obtaining the RDF triple in the SPARQL of historical query, names RDF triple using predicate, and with predicate shape Formula is stored to obtain original RDF data collection；

S2: carrying out vertical division to original RDF data collection and obtain the VP table of RDF, counts RDF data according to the VP table of RDF The subject term and predicate quantity of middle predicate connection, according to subject term with predicate quantity defines predicate four kinds be connected to characteristic, and according to company Logical characteristic power carries out priority ranking to predicate；

S3: according to the connection characteristic of the predicate in S2, constructing the relevance between predicate, and according to the relevance by history SPARQL query graph is converted into tree-shaped predicate figure, optimizes tree-shaped predicate figure, generates correlation table according to the tree-shaped predicate figure after optimization And SPARQL is converted to SQL query instruction；

S4: it waits tabling look-up using SQL query instructions query.

Above-mentioned SPARQL enquiring and optimizing method carries out priority ranking to predicate by defining the connection characteristic of predicate, According to the relevance between connection characteristic building predicate, distribution can quickly and efficiently be realized by the relevance SPARQL inquiry.It is main to consider to carry out frequent predicate specificity analysis, frequency when carrying out specificity analysis to predicate in the present embodiment Numerous predicate points out the higher predicate of existing frequency.

It should be noted that the correlation table in above-mentioned S3 is in the present embodiment, refer to from the tree-shaped predicate figure after optimization The query result of the high-frequency predicate pattern filtered out.It is equivalent to and high-frequency predicate index object is converted to a table.This step According to the high frequency predicate pattern in history SPARQL, a part inquiry is heuristically carried out in advance and is calculated, data volume is reduced.When new One inquiry occurs, if wherein including the high frequency predicate pattern having already appeared, only needs input related in query process Data in table, the reduction of input data can greatly improve the efficiency of query processing.

In the present embodiment, by the feature definitions at least four of predicate, respectively subject term score, predicate score, predicate score, And predicate volume.

Wherein, subject term score is for counting unduplicated subject term quantity, calculation formula in single VP table are as follows:

Ss (p)=count (DISTINCT s) s ∈ s | s in VP (p) }；

In formula, p is predicate, and s is subject term, and VP (p) is the VP table of predicate p after vertical division, in the present embodiment, subject term score It is higher, show that the predicate is connected with more subject terms, reflects that connectivity of the predicate in entire RDF graph is stronger.

Predicate score is for counting unduplicated predicate quantity, calculation formula in single VP table are as follows:

Os (p)=count (DISTINCT o) o ∈ o | o in VP (p) }；

In formula, o is predicate, and in the present embodiment, object score is higher, shows that the predicate is connected with more predicates, instead It is stronger to mirror connectivity of the predicate in entire RDF graph.

Predicate score is determining by the subject term score of selection predicate and the maximum value of predicate score, calculation formula are as follows:

Ps (p)=max { ss (p), os (p) }；

Preferably, the larger value in the subject term score and predicate score of predicate is chosen as predicate score.

Predicate volume is used to calculate the number of tuples of VP table, calculation formula are as follows:

PSize (p)=count (VP (p)).

Predicate volume calculate measure VP table number of tuples, reflect VP table size, predicate score is higher, show the predicate with More subject term predicates are connected, and reflect that connectivity of the predicate in entire RDF graph is stronger.

Further, priority ranking, order standard are carried out to predicate are as follows:

1) priority of the strong predicate of connectivity is higher:

2) the high predicate priority of predicate score is higher；

3) when predicate score is identical, theme score or object score in addition, score more high priority is higher；

4) when the subject term score of predicate is identical with predicate score, compare predicate volume, the bigger predicate of predicate volume is preferential Grade is higher.

In the present embodiment, according to above-mentioned order standard, the predicate priority in the corresponding VP table of Fig. 2 can be obtained are as follows:

PRI(subOf)>PRI(workFor)>PRI(follow)>PRI(learn)>PRI(interest)>PRI (location)。

It should be noted that the basic chart-pattern (BGP) of SPARQL inquiry is by multiple triple modes (Triplepattern) it forms, multiple triple modes are combined into complicated basic chart-pattern.Such as: the inquiry language in Fig. 3 Sentence contains single BGP as shown in Figure 4, it may be assumed that whether the people that " interface " is liked in inquiry has learnt the specific of corresponding field Associative skills it is of substantially known to can be seen that the predicate in SPARQL inquiry clause from SPARQL inquiry clause and BGP, using VP It is suitable to optimize SPARQL inquiry clause to convert tree-shaped predicate chart-pattern for SPARQL inquiry chart-pattern for four predicate characteristics in table Sequence can be positioned quickly and be queried the relevant predicate of RDF data concentration.

Wherein, it converts SPARQL inquiry chart-pattern to tree-shaped predicate chart-pattern the specific method is as follows.

It is firstly introduced into predicate correlation (Predicate Correlation, PC) definition, i.e. expression SPARQL query statement In with identical variable two triple pattern carry out half-connection when, the associate feature of corresponding two predicates.Example Such as, when clause carries out half-connection in Fig. 3, " interest " and " learn " two predicates are associated.It should be noted that For user when interested to a technical ability, often which the relevant technologies inquiry should learn simultaneously in actual queries, practice In the problem is repeatedly inquired will greatly improve predicate interest, the association search number of learn and subOf.Therefore, foundation Associate feature between predicate semantically, all predicate correlation structures of analysis of history SPARQL query statement excavate history Frequent associated predicate mode in inquiry.

Chart-pattern is inquired for SPARQL, by the connected inquiry clause of subject term variable or predicate variable, the predicate of composition is closed There are four types of interrelational forms altogether for connection property: subject term-subject term association (SS), theme-object association (SO), the association of object-theme (OS), the association (OO) of object-object.Fig. 5 describes four kinds of predicate relevances in inquiry clause two-by-two wherein, and Fig. 5 (a) is Subject-Subject (SS), Fig. 5 (b) are object-Subject (OS), and Fig. 5 (c) is Subject-object (SO), Fig. 5 (d) it is object-object (OO).For example, two inquiry clauses in Fig. 5 (a) "? v2p1? v1 " with "? v2p2? v3 " it is to pass through "? v2 " it is connected, therefore, p1 belongs to subject term-subject term with two predicates of p2 and is associated with.

For two subquery q in SPARQL_i(s_i,p_i,o_i), q_j(s_j,p_j,o_j), Then predicate relevance PC are as follows:

Four kinds of associations of predicate are equally determined from the position that predicate relevance shown in fig. 5 can be seen that link variable Mode, i.e. SS (subject-subject), SO (subject-object), OS (object-subject) or OO (object- Object), in the present embodiment, convert SPARQL query structure (BGP) equivalence being connected by link variable to based on predicate The correspondence query structure of relevance, which reflects the semantic dependency of inquiry clause by predicate relevance, similar The query structure of predicate relevance is known as predicate relevance chart-pattern (PCGP) by the BGP structure of SPARQL query statement.

Relevance based on predicate, the process for being converted into PCGP from BPG particularly may be divided into two steps: (1) it is converted into figure, it will Predicate in BGP is converted into the node of PCGP, and subject term variable connects with the variable of predicate variable and is converted into side in BGP.Pass through BGP Middle predicate weight determines the sequence for constructing PCGP node and side, generates a conversion figure；(2) optimal trees are converted into, are Conversion figure is optimized for a conversion tree using the method for beta pruning decyclization by simplified query pattern.Wherein, optimizing tree-shaped predicate When figure, the vertex BGP score, BGP Predicate selectivity, BGP core vertex and BGP core side are defined, is called to determine in BGP Word weight.

Specifically, the vertex BGP score (vs) is, for there is i to go out any one vertex in when entering for j in BGP, to count Calculate subject term score the sum of of all i predicates being connected as subject term position with the vertex in VP tableIt is all with Predicate score the sum of of the predicate that the vertex is connected as predicate position in VP tableThe vertex BGP score is exactly this Sum of the two.Vertex score is higher, shows that connectivity of the vertex in entire RDF graph is stronger, if preferentially looked into inquiry The strong node of connectivity is ask, it being capable of quick locating query range.Its calculation formula is:

The side BGP score (es) is any a line e connection two vertex vs 1, v2 in BGP, then the predicate selection of side e Property are as follows: es (e)=vs (v1)+vs (v2).Side score is higher, shows the company of vertex that this side is connected in entire RDF graph The general character is stronger, and this edge often connects two star-plots, and core position is in entire query process, if excellent in inquiry It first inquires when score is high, also can preferentially inquire the strong node of connectivity, thus quickly locating query range.

In the present embodiment, BGP is traversed according to the big foreword of vertex score in BGP and side score, i.e., first traversal connectivity is strong Vertex and side, then traverse connectivity weak vertex and side, construct PCGP simultaneously during orderly traversal BGP, therefore PCGP is the query structure for reflecting former RDF graph connectivity, can be sub convenient for quickly determining in the SPARQL in later period inquiry Sentence search order.

Further, predicate associated diagram is constructed, specific step is as follows.

It is sorted first according to score size in vertex in BGP, finds the highest core vertex of vertex score, then found out and core Vertex be connected in the highest core of score, using core side as the root node of PCGP, i.e. start node；Using breadth-first Method is successively traversed according to the side BGP score size, is converted the corresponding node of PCGP for the side of BGP, will be connected this side Vertex be correspondingly converted into the corresponding side PCGP, since the corresponding side PCGP embodies the predicate relevance of BGP, predicate is closed Connection property is saved as the attribute on the side PCGP, ultimately produces a PCGP tree.For example, middle connection in Fig. 6 (a) " workFor " and " The side attribute on the two predicate vertex location " is SS.It should be noted that Fig. 6 (a) be PCGP figure it is linear (Linear, L), Fig. 6 (b) is star-like (Star, the S) of PCGP figure, and Fig. 6 (c) is the snowflake type (Snowflake, F) of PCGP figure.Fig. 7 (a) is Linear (Linear, the L) of BGP figure, Fig. 7 (b) are star-like (Star, the S) of BGP figure, and Fig. 7 (c) is the snowflake type of BGP figure (Snowflake, F).

It is multiple to there is a situation where that identical predicate occurs in SPARQL query statement, such as the predicate in Fig. 7 (a) " location " occurs twice, during BGP is converted to PCGP, it may appear that with multiple node feelings for indicating identical predicate Condition, therefore when merging predicate relevance, merge the depth that there is mutually isostructural branch and reduce PCGP, to improve inquiry Speed.There are two types of situation, the first situations are as follows: if present node predicate is identical as father node predicate in PCGP, close altogether for merging And two nodes, and the side attribute of two nodes of former connection is assigned to father node, all child nodes of present node are connected To father node, two " location " nodes are merged into one in (a) in Fig. 8；Second situation are as follows: if current in PCGP Predicate node occur on upper layer, then all child nodes of the predicate node are connected to the predicate node that has occurred of top As child node, to reduce the depth of PCGP.Particularly, if the father node predicate of present node appears in the father of accessed node When in node or child node, then present node is directly deleted, and present node and the attribute on side are updated to accessed node Dependence edge in.As the predicate " workFor " of lower section is merged into top in Fig. 8 (c).Wherein, Fig. 8 (a) is linear (Linear, L), Fig. 8 (b) are star-like (Star, S), and Fig. 8 (c) is snowflake type (Snowflake, F).

By the chain of BGP shown in Fig. 7, it is starlike be converted into three kinds of basic query modes of flakes it is as shown in FIG. 6 PCGP figure.

In above-mentioned step, a PCGP and the PCGP are converted by the BGP for inquiring each history SPARQL Constructed according to the connectivity structure of RDF graph, to guarantee in query process can Rapid matching to target position, promotion is looked into Ask efficiency.It should be noted that will appear hotspot query problem in historical query, so if preferential in query process With hotspot query node, then non-hot query node is matched, query context can be reduced to a certain extent to improve inquiry Speed.Therefore, if saving the number being queried in the side attribute of PCGP, i.e., the recordable predicate association being frequently queried Property, also just it is able to record hotspot query node.

Further, in this embodiment can be further improved inquiry by constructing the relevance between frequent predicate Speed.Specifically, in the present embodiment, frequent predicate relevance refers to that the predicate of high frequency present in history SPARQL inquiry closes Connection, it is noted that there are the frequent predicate association of two classes in actual queries, one type is the relevance of predicate two-by-two, No matter more complicated SPARQL inquiries, the attended operation that can be finally decomposed between multiple subqueries, half-connection behaviour when inquiry Make will two predicates be associated, therefore predicate association index table two-by-two can be established, this kind of be frequently queried for storing Predicate relevance two-by-two.Another kind of is more meaning contamination relevances, multiple predicates combination association index tables can be established, for depositing Store up this kind of predicate combination relevance being frequently queried.In inquiry, according to predicate relevance in PCGP, first from these two types of frequent Frequent predicate association is matched in this association index table, if be matched to, can be instantly obtained query result, that is, is substantially shorter and look into The input and output cost of link cost and intermediate result during inquiry.

In the present embodiment, on the basis of the VP table that the RDF vertical division of Fig. 2 stores, building is high in SPARQL inquiry The associated index of more predicates of the predicate two-by-two association and high frequency of frequency, for being quickly found out predicate correlation in former RDF graph data Query result.It defines BCF (Bi-Correlation Factor) and MCF (Multi-Correlation Factor) is used for Measure the frequency height that predicate relevance occurs.Preferably, BCF value range be [0,1], represent selection PCGP in frequently two-by-two The ratio of the degree of incidence of predicate relevance.The degree of incidence ratio of one predicate relevance two-by-two is within the scope of BCF, then by it It can be regarded as frequently predicate association (top_BCF) two-by-two.Such as PCGP shown in Fig. 9, BCF=0.2 is selected, then select frequent two Two predicates be associated in historical query occur from frequency all predicates two-by-two association in account for before 20%, it is satisfactory to call two-by-two Word association has: (workFor, follow, SS) and (workFor, location, SS).MCF value range is [0,1], is represented more For the associated any association of predicate two-by-two of predicate combination in top_BCF, BCF value is the MCF value of selection, and more predicate combinations Being associated in PCGP is connection and has been maximum magnitude.For example, selection MCF=0.2, then only more predicate combinations are closed Connection { (workFor, follow, SS), (workFor, location, SS) } is eligible, two therein predicate associations two-by-two Top_20% is belonged to, predicate two-by-two therein is associated in PCGP and can be connected to, and does not belong to more in PCGP The association of predicate two-by-two of top_20% can extend more predicate combination associations.

It finds out the associated connection of predicate two-by-two of high frequency using BCF and MCF to connect with more predicates, specifically, in the VP table of Fig. 2 On the basis of establish two classes index: high frequency is associated with predicate concordance list and high frequency more two-by-two and is associated with predicate concordance list.High frequency predicate two-by-two Concordance list includes four: VP_FSS, VP_FSO, VP_FOS and VP_FOO, and the predicate two-by-two for being respectively used to record high frequency is associated Four kinds of half-connection results；High frequency is associated with the associated result of multiple predicates of predicate concordance list VP_FMC record high frequency more.Utilize this Two classes index, can greatly reduce query context.

Conversion and query process can be divided into following three step: (1) based on the predicate associated adjustment SPARQL clause order of connection: All predicates association in SPARQL is extracted, then inquires high frequency and is associated with predicate concordance list and high frequency two-by-two and is associated with predicate more and index Table adjusts the inquiry clause order of connection of associated predicate if there are relative indexes for predicate association according to predicate association, remember Record the result of search index；If relative index is not present in predicate association, the order of connection of corresponding inquiry clause is kept not Become；(2) it is converted to SQL query statement: converting SQL statement of equal value for the SPARQL query statement after adjustment clause's sequence, It is inquired in predicate association index table and former RDF graph data VP table respectively, obtains query result；(3) connection Query: connection SQL Subquery executes SQL clause further according to corresponding inquiry table scale sequence, and the preferential table for inquiring small scale is looked into reduce It askes range and reduces intermediate result set, finally obtain final result using Spark SQL query.Knot is inquired by this kind of mode Fruit has the advantages that efficient and quick.

Experimental verification

In the present embodiment, tested in 7 computer clusters, will wherein 1 computer as host (Master) Node, other 6 machines are as work (Worker) node.The memory of 7 computers is AMD Ryzen7 1800X 8cores 3.8GHZ 32G, operating system are CentOS7, and calculating environment is Cloudera CDH 5.7.6, integrated using CDH Spark (Spark 1.6.0), the table in experimentation stores with the document form of Parquet format, disabled simultaneously Broadcast connection (BroadcastJoin) function in SPARQLSQL uses Hive component to avoid automatic.It utilizes Operator built in SparkGraphX module calculates PCGP information.

The data set of experiment is generated using the Data Generator of WatDiv by Query Builder with SPARQL inquiry.Wherein, WatDiv is the RDF data management testing tool of Waterloo data system organization development, defines four class SPARQL inquiry, packet altogether Include linear (Linear, L), star-like (Star, S), snowflake type (Snowflake, F) and complex query (Complex, C).It needs It is bright, complex query type by linear, star-like and snowflake type any one or it is several combination formed, in attached drawing not It shows.The WatDiv data that the use ratio factor is 1000 and 10000 are tested, table 1 lists the WatDiv data for experiment Collect information, including RDF triple scale, do not repeat predicate number and corresponding HDFS storage size, two groups of data can be with from table Find out, although triple scale differs 10 times, it is all 86 that it, which does not repeat predicate quantity,.

Table 1: the WatDiv data set information table for experiment

In query process, 10000 query statements are generated at random using WatDiv Query Builder, include linear (L), Star-like (S), snowflake type (F) and complicated (C) four class inquiry, the influence to test query quantity to storage and inquiry.

Predicate relevance factor two-by-two and more predicates combination relevance factor are respectively indicated using BCF and MCF.For example, BCF =0.2 predicate for indicating that the frequency of predicate association appearance comes preceding 20% two-by-two in only storage history SPARQL query statement closes Connection establishes corresponding VP_FSS, VP_FSO, VP_FOS and VP_FOO index, and MCF=0.2 indicates only storage history The frequency that more predicate associations occur in SPARQL query statement comes preceding 20% predicate association, that is, establishes corresponding VP_FMC rope Draw.

It is as shown in table 2 below to count storage information of the BFC threshold value in six kinds of different values of [0,1], including predicate two-by-two Associated predicate is to (Number of Predicate Correlation, Npc), VP table and VP_FSS, VP_FSO, VP_FOS With in VP_FOO concordance list record number (Number of Records, Nrec) and memory space (HDFS Size, Size).As BFC=0.0, show that any predicate is not selected to be associated with, then Npc is 0, and Nrec and Size is minimum at this time, i.e., original VP table in record number and storage size；As BCF=1.0, show to select all predicates association in PCGP, at this time Npc, Nrec and Size is maximum.

In the data scale of the WatDiv (SF1000) of table 1, there are 86 not repeat predicate, therefore, predicate is associated with two-by-two Predicate to shared 86*86*4=29584, but as can be seen from Table 2 practical BCF=1.0 when the related predicate of predicate two-by-two It is 4419 to most (Max (Npc)), that is, shows that the predicate of physical presence semantic association to only 4419, only accounts for all predicates To the 15% of sum, by analysis, the reason of this kind of situation occurs, there are two kinds, one is because involved by practical SPARQL inquires And RDF data amount be far below whole original data sets, the second is WatDiv Query Builder generates the machine of SPARQL query statement Caused by system, in order to guarantee the validity of inquiry, all query statements are to be given birth at random from inquiry pond based on random walk mode At, therefore the probability that the higher data of predicate correlation degree are queried selection is improved indirectly.

When BCF takes [0.2,0.4], PCGP needs the associated predicate of predicate two-by-two for establishing index all to call two-by-two to accounting for Word association is within the 30% of Max (Npc), and when working as BCF value [0.6,0.8], which has been more than 60%, mutually in requisition for guarantor Record number and memory space in the concordance list deposited sharply increase.

Table 2: the storage (WatDiv SF1000) when different BCF

It is as shown in table 3 below to count storage information of the MCF threshold value in six kinds of different values of [0,0.6], including more predicates Associated predicate is to Npc, the record number Nrec and storage space S ize of VP_FMC concordance list.Due to after MCF > 0.6, newly The associated predicate of more predicates increased is to few, therefore the case where only count MCF≤0.6 in the present embodiment.When MCF value take 0.3 or When 0.4, the value of Npc is respectively 1056,1223, and predicate to relatively more, and when MCF takes 0.5 or 0.6, distinguish by the value of Npc It is 598,279, predicate is to opposite reduction, but the record number Nrec and storage space S ize of actual VP_FMC concordance list Increase instead, which is because, when MCF value is larger, although the meaning two-by-two that the associated predicate of more predicates to reduction, is directed to Word association contains much information, the record information content for needing to retain in VP_FMC concordance list also relative increase, therefore Nrec and Size Value increases instead.

Table 3: the storage (WatDiv SF1000) when different MCF

When BCF and MCF take different value, under the conditions of four kinds of SPARQL query types (L, S, F, C), the storage of index is empty Between shown in variation following Figure 10 and Figure 11 with SPARQL query time.Figure middle polyline indicates query time, and column indicates storage Space, four class query results take the average value of 20 inquiries respectively.

Specifically, storage condition and query effect when embodying the connection result using only basic chart-pattern in Figure 10.Base This chart-pattern has obvious effect of optimization to all inquiry templates.

It can be seen from figure 11 that not establishing predicate association index two-by-two relative to only inquiry VP table as BCF=0.2 (BCF=0), the query time of four classes inquiry is greatly shortened, 50% is shortened for the query time of complex query C；When When BCF=0.4,60% is shortened for the query time of complex query C.But behind BCF > 0.4, for the inquiry of four classes inquiry Time does not change much, although this illustrates the correlation of the partial predicate in history SPARQL, is not to be frequently associated, is building Lithol can not retain these predicate associated informations when drawing.With the increase of BCF value, it is known that seeing the index of foundation Memory space is with linearly increasing.BCF=0.4 is chosen according to experimental result, in the present embodiment as optimal threshold value.

It can be seen from figure 11 that, without satisfactory more predicate combination associations, there is no need to additionally build as MCF=0 Vertical VP_FMC concordance list, when inquiry directly from former VP table query result, therefore concordance list Size=0.It is right as MCF=0.2 All most short in the query time of four classes inquiry, wherein the complex query time reduces amplitude maximum, and chain type inquires reduction amplitude minimum, Predicate number involved in predicate combination at this time is mostly 3 or 4, and the VP_FMC concordance list established in this case can greatly improve Search efficiency.And after MCF > 0.3, the query time of four classes inquiry is not reduced with the increase of MCF, is increased in MCF Query time almost increases to close with MCF=0 when 0.6.In SPARQL inquiry, more predicate combination associations are related to this explanation Predicate number when being more than 4, it is actually most of to be not belonging to frequent more predicate query compositions, that is, these predicates are semantically Relevance it is not high, matching efficiency of the inquiry at this time in concordance list is low, and last query time is caused to increase instead.Root It factually tests as a result, the present embodiment chooses MCF=0.2 as optimal threshold value.

In BCF=0.4 and MCF=0.2, compare the query time established before and after frequent predicate association index table, that is, The time and system that system is inquired based on VP table be based on VP table, VP_FSS, VP_FSO, VP_FOS, VP_FOO and The time that VP_FMC concordance list is inquired.From table 4, it can be seen that the query time of four classes inquiry is obvious after establishing index It reduces, for the inquiry of L, S and F three classes, query time reduces 50% or more.

Table 4: (ms) is compared using only VP and using the query time of VP and concordance list

Further, 20 four class query statements are generated using WatDiv Query Builder, wherein every class query statement 5 It is a.For the data set of WatDiv (SF1000) and WatDiv (SF10000) two kinds of scales, when from additional storage space and inquiring Between two aspect, by method (being indicated in Figure 11 with FrePre2) of the invention respectively with S2RDF (SF=0.25), S2RDF (SF =1), H2RDF, Sempala, PigSPARQL, SHARD are compared.Table 5 lists the additional storage space of these methods, The storage size of middle S2RDF (SF=0.25) method is minimum, for two number of WatDiv (SF1000) and WatDiv (SF10000) Storage size according to collection is 1.8G and 18.4G respectively, and the storage size of SHARD method is maximum, almost S2RDF (SF=0.25) 5 times of method, the storage size of method proposed by the present invention are 3.4G and 34.9G, the side about S2RDF (SF=0.25) respectively 2 times of method, this storage size is relatively conventional under distributed environment.

The not homologous ray additional HDFS storage size under different data collection of table 5

Using S2RDF as main contrast's object.The opposite S2RDF of this system on large data collection can be calculated from table 7 to drop Low query time percentage, average to reduce at least 30% query time, highest even reduces by 80% or more.Effect is generally better than There are two the reason of S2RDF, is main.First is foundation of the S2RDF only using data with respect to reduction amount as preservation ExtVP table, and not Consider O-O predicate connectivity.This system is using enquiry frequency as the foundation for saving ExtVP, in the base for promoting certain memory space Less data input is obtained on plinth, the reduction of data input directly promotes search efficiency.Utilize enquiry frequency Optimizing Queries Method is also used by search engines such as Googles.Second is that this system stores frequent predicate chart-pattern query result, and it is big to reduce input It is small and reduce join operation.But the place for also having effect of optimization bad, such as L3, S3, S5 and C4, speed is inquired with S2RDF It is little or even slightly higher to spend difference, the reason is that there is strange predicate correlation in these inquiries, the process of search index also increases Query time.Based on generally speaking, the query effect of the present embodiment is substantially better than other systems.

As shown in figure 12, different system queries time contrast tables 6 and table 7 list two kinds of data respectively under large data collection Under collection scale, query time of the distinct methods for four kinds of inquiries, it can be seen that the query time of method proposed by the present invention is most It is short, S2RDF (SF=0.25) secondly, the query time of method proposed by the present invention between [32ms, 2304ms], S2RDF (SF =0.25) query time is between [187ms, 3018ms].And the query time of method proposed by the present invention is opposite PigSPARQL method and SHARD method shorten nearly three orders of magnitude, and opposite Sempala method shortens nearly two orders of magnitude. Method proposed by the present invention and S2RDF (SF=0.25) are realized using Spark SQL module, and PigSPARQL method and SHARD Method executes inquiry using MapReduce batch processing, although they greatly simplify data analysis process and inadaptable interaction Formula inquiry, causes its query cost big.Although Sempala method link cost is low, needed to be traversed in match query process All VP tables, cause query cost especially big.H2RDF+ method belongs to centralized querying method, in large data collection WatDiv (SF10000) search efficiency is lower in inquiry, but in the subquery being frequently accessed, search efficiency is relatively preferable.

Table 6: the experimental result WatDiv (SF1000) (unit: ms) of different data collection difference query template

Table 7: the experimental result WatDiv (SF10000) (unit: ms) of different data collection difference query template

In conclusion a kind of SPARQL enquiring and optimizing method proposed by the present invention, based on Hadoop and real using Spark Existing distributed SPARQL query processor.The predicate of extraction is concentrated using the predicate connectivity of query statement and from RDF data Feature stores the frequent predicate pattern result in historical query to reduce input load when inquiry.It, will in query process SPARQL is converted into the inquiry table of SQL with corresponding three types, by being inquired after inquiry table size sequence, to reduce inquiry It inputs size and reduces and execute the time.Since method of the invention is based on historical query, there is adaptivity.And it is above-mentioned Experiment show is under certain threshold value, and system queries performances is better than the existing query processor for comparison.

Embodiment 2

With above method embodiment correspondingly, the present embodiment provides a kind of SPARQL query optimizer system, including storage Device, processor and storage on a memory and the computer program that can run on a processor, described in the processor execution The step of above method is realized when computer program.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. one kind is based on the associated SPARQL enquiring and optimizing method of predicate, which comprises the following steps:

S1: obtaining the RDF triple in the SPARQL of historical query, names the RDF triple using predicate, and with predicate shape Formula is stored to obtain original RDF data collection；

S2: carrying out vertical division to the original RDF data collection and obtain the VP table of RDF, counts RDF according to the VP table of the RDF The subject term and predicate quantity that predicate connects in data, according to subject term with predicate quantity defines predicate four kinds be connected to characteristic, and root Priority ranking is carried out to predicate according to connection characteristic power；

S4: it waits tabling look-up using the SQL query instructions query.

2. according to claim 1 be based on the associated SPARQL enquiring and optimizing method of predicate, which is characterized in that the S2 In, it is at least four, respectively subject term score, predicate score, predicate score and predicate body by the connection feature definitions of predicate Product.

3. according to claim 2 be based on the associated SPARQL enquiring and optimizing method of predicate, which is characterized in that the subject term Score is for counting unduplicated subject term quantity, calculation formula in single VP table are as follows:

Ss (p)=count (DISTINCT s) s ∈ s | s in VP (p) }；

Os (p)=count (DISTINCT o) o ∈ o | o in VP (p) }；

In formula, o is predicate；

Ps (p)=max { ss (p), os (p) }；

PSize (p)=count (VP (p)).

4. according to claim 3 be based on the associated SPARQL enquiring and optimizing method of predicate, which is characterized in that right in S2 The order standard of predicate progress priority ranking are as follows:

1) priority of the strong predicate of connectivity is higher；

2) the high predicate priority of predicate score is higher；

3) when predicate score is identical, subject term score or object score in addition, score more high priority is higher；

4) when the subject term score of predicate is identical with predicate score, compare predicate volume, the bigger predicate priority of predicate volume is more It is high.

5. according to claim 1 be based on the associated SPARQL enquiring and optimizing method of predicate, which is characterized in that, will in S3 SPARQL query graph be converted into tree-shaped predicate figure specifically includes the following steps:

S31: converting the predicate in BGP to the node of PCGP, and subject term connects with the variable of predicate and is converted into side in BGP, passes through Predicate weight determines the sequence for constructing PCGP node and side in BGP, generates a conversion figure；

6. being based on the associated SPARQL enquiring and optimizing method of predicate according to claim 1 or 5, which is characterized in that optimization The tree-shaped predicate figure goes around-France realization using beta pruning.

7. according to claim 6 be based on the associated SPARQL enquiring and optimizing method of predicate, which is characterized in that described in optimization When tree-shaped predicate figure, the vertex BGP score, BGP Predicate selectivity, BGP core vertex and BGP core side are defined.

8. one kind is based on the associated SPARQL query optimizer system of predicate, including memory, processor and is stored in memory Computer program that is upper and can running on a processor, which is characterized in that the processor executes real when the computer program The step of existing 1 to 7 any the method for the claims.