CN110032676A - One kind being based on the associated SPARQL enquiring and optimizing method of predicate and system - Google Patents
One kind being based on the associated SPARQL enquiring and optimizing method of predicate and system Download PDFInfo
- Publication number
- CN110032676A CN110032676A CN201910196896.8A CN201910196896A CN110032676A CN 110032676 A CN110032676 A CN 110032676A CN 201910196896 A CN201910196896 A CN 201910196896A CN 110032676 A CN110032676 A CN 110032676A
- Authority
- CN
- China
- Prior art keywords
- predicate
- sparql
- score
- query
- rdf
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to towards the associated storage of big data and inquiring technology field, it discloses a kind of based on the associated SPARQL enquiring and optimizing method of predicate and system, faster and more effectively realize distribution SPARQL inquiry, the method of the present invention includes the RDF triples in the SPARQL for obtaining historical query, RDF triple is named using predicate, obtains original RDF data collection;RDF data collection is divided to obtain VP table, the subject term and predicate quantity of predicate connection in RDF data are counted according to VP table, defines four kinds of connection characteristics of predicate, and priority ranking is carried out to predicate according to connection characteristic power;The relevance between predicate is constructed, and tree-shaped predicate figure is converted for history SPARQL query graph according to the relevance, optimizes tree-shaped predicate figure, correlation table is generated according to the tree-shaped predicate figure after optimization and SPARQL is converted to inquiry instruction;It is inquired using inquiry instruction wait table look-up.
Description
Technical field
The present invention relates to be associated with towards the associated storage of big data and inquiring technology field, more particularly to a kind of predicate that is based on
SPARQL enquiring and optimizing method and system.
Background technique
Resource description framework (Resource Description Framework, RDF) is the W3C mark for describing Internet resources
Standard identifies resource using global identifier (Internationalized Resource Identifier, IRI), using master
The triple of word s, predicate p and predicate o composition describes the metadata of a data.More and more fields are with RDF data collection
Mode describes data, and such as bioscience, social networks and search engine, data set possesses billions of grades of triple.It is huge
And more stringent requirements are proposed with information retrieval to data query for ever-increasing RDF data collection, in such circumstances, based on basic
The SPARQL query language of figure query pattern (Basic Graph Pattern, BGP) is proposed by W3C, in order to inquire and retrieve
RDF data.
Currently, existing SPARQL query strategy relies primarily on the subject term in RDF triple and SPARQL inquiry clause
And predicate, Optimizing Queries are associated using subject term, predicate or subject term variable, predicate variable, however it is this based on subject term guest
The association of word belongs to " instance-level " correlation inquiry, and associate feature has contingency and particularity.Such as: " inquiry is to ' interface is set
Which associative skills meter ' interested people has learnt ", the concern of " instance-level " correlation inquiry "? people ", "? skill ", "?
Association between subSkill ", " UIDesign ", and inquire next time may " interest " be not " UIDesign " but
" HTML ", i.e., during actual queries, inquiry relevant " subject term " " predicate " example often changes, and in inquiring
There is the predicate association of semantic relation but often to occur by " interest ", " learn ", this is embodied in practical application " to a certain
The interested people of things can learn a certain technical ability " semantic dependency inquiry, this association based on predicate belongs to " mode grade "
Correlation inquiry, i.e. predicate semantic association characteristic have generality and model utility.On the other hand, SPARQL inquiry clause passes through subject term
With predicate when former RDF data concentration is attached operation, a large amount of intermediate result set can be generated, this is to influence SPARQL inquiry
One of critical issue of efficiency.
Therefore, it now needs to provide a kind of querying method with good query performance and search efficiency.
Summary of the invention
It is an object of that present invention to provide one kind to be based on the associated SPARQL enquiring and optimizing method of predicate and system, with more rapidly
And effectively realize distributed SPARQL inquiry.
To achieve the above object, the present invention provides one kind to be based on the associated SPARQL enquiring and optimizing method of predicate, including
Following steps:
S1: obtaining the RDF triple in the SPARQL of historical query, names the RDF triple using predicate, and with meaning
Word form is stored to obtain original RDF data collection;
S2: vertical division is carried out to the original RDF data collection and obtains the VP table of RDF, is counted according to the VP table of the RDF
The subject term and predicate quantity that predicate connects in RDF data, according to subject term with predicate quantity defines predicate four kinds be connected to characteristic, and
Priority ranking is carried out to predicate according to connection characteristic power;
S3: according to the connection characteristic of the predicate in S2, constructing the relevance between predicate, and according to the relevance by history
SPARQL query graph is converted into tree-shaped predicate figure, optimizes the tree-shaped predicate figure, generates phase according to the tree-shaped predicate figure after optimization
SPARQL is simultaneously converted to SQL query instruction by pass table;
S4: it waits tabling look-up using the SQL query instructions query.
Preferably, in the S2, by the feature definitions of predicate be at least four, respectively subject term score, predicate score, meaning
Word score and predicate volume.
Preferably, the subject term score is for counting unduplicated subject term quantity, calculation formula in single VP table are as follows:
Ss (p)=count (DISTINCT s) s ∈ s | s in VP (p) };
In formula, p is predicate, and s is subject term, and VP (p) is the VP table of predicate p after vertical division;
The predicate score is for counting unduplicated predicate quantity, calculation formula in single VP table are as follows:
Os (p)=count (DISTINCT o) o ∈ o | o in VP (p) };
In formula, o is predicate;
The predicate score is determining by the subject term score of selection predicate and the maximum value of predicate score, calculation formula are as follows:
Ps (p)=max { ss (p), os (p) };
The predicate volume is used to calculate the number of tuples of VP table, calculation formula are as follows:
PSize (p)=count (VP (p)).
Preferably, in S2, the order standard of priority ranking is carried out to predicate are as follows:
1) priority of the strong predicate of connectivity is higher;
2) the high predicate priority of predicate score is higher;
3) when predicate score is identical, theme score or object score in addition, the more high then priority of score more
It is high;
4) when the subject term score of predicate is identical with predicate score, compare predicate volume, the bigger predicate of predicate volume is excellent
First grade is higher.
Preferably, in S3, convert tree-shaped predicate figure for SPARQL query graph specifically includes the following steps:
S31: converting the predicate in BGP to the node of PCGP, and subject term variable is connected with the variable of predicate variable in BGP turns
Side is turned to, the sequence for constructing PCGP node and side is determined by predicate weight in BGP, generates a conversion figure;
S32: the optimization tree-shaped predicate figure, conversion figure are optimized for a conversion tree.
Preferably, the optimization tree-shaped predicate figure goes around-France realization using beta pruning.
Preferably, when optimizing the tree-shaped predicate figure, the vertex BGP score, BGP Predicate selectivity, BGP core vertex are defined
And BGP core side.
The inventive concept total as one, the present invention also provides a kind of SPARQL query optimizer systems, including memory, place
The computer program managing device and storage on a memory and can running on a processor, the processor execute the computer
The step of above method is realized when program.
The invention has the following advantages:
The present invention provides a kind of based on the associated SPARQL enquiring and optimizing method of predicate and system, first acquisition historical query
SPARQL in RDF triple, name RDF triple using predicate, and stored to obtain original RDF number in the form of predicate
According to collection;Vertical division is carried out to original RDF data collection and obtains the VP table of RDF, predicate in RDF data is counted according to the VP table of RDF
The subject term and predicate quantity of connection, four kinds for defining predicate according to subject term and predicate quantity are connected to characteristic, and according to connection characteristic
Power carries out priority ranking to predicate;According to the connection characteristic of predicate, the relevance between predicate is constructed, and according to the association
Property by SPARQL query graph convert tree-shaped predicate figure, optimize tree-shaped predicate figure, phase generated according to the tree-shaped predicate figure after optimization
Tree-shaped predicate figure after optimization is simultaneously converted to SQL query instruction by pass table;It waits tabling look-up using SQL query instructions query again;The party
Method can faster and more effectively realize distributed SPARQL inquiry.
Below with reference to accompanying drawings, the present invention is described in further detail.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention
It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the preferred embodiment of the present invention based on the associated SPARQL enquiring and optimizing method flow chart of predicate;
Fig. 2 is the conversion situation schematic diagram that the RDF triple store of the preferred embodiment of the present invention is stored to vertical division;
Fig. 3 is the SPARQL query statement of the preferred embodiment of the present invention;
Fig. 4 is the BGP of equal value figure of the query statement of the preferred embodiment of the present invention;
Fig. 5 is four kinds of predicate relevance schematic diagrames in the inquiry clause two-by-two of the preferred embodiment of the present invention;
Fig. 6 is the predicate correlative model figure of the component of the preferred embodiment of the present invention;
Fig. 7 is the vertex score and Predicate selectivity figure of the preferred embodiment of the present invention;
Fig. 8 is the predicate correlative model figure after the optimization of the preferred embodiment of the present invention;
Fig. 9 is the updated history predicate pattern tree of the preferred embodiment of the present invention;
Figure 10 be the preferred embodiment of the present invention using only basic chart-pattern connection result when storage condition and inquiry imitate
Fruit;
When Figure 11 is that the BCF of the preferred embodiment of the present invention is inquired needed for variety classes querying method under different value conditions
Between situation schematic diagram;
Figure 12 be the preferred embodiment of the present invention large data collection under different system queries time contrast schematic diagrams.
Specific embodiment
The embodiment of the present invention is described in detail below in conjunction with attached drawing, but the present invention can be defined by the claims
Implement with the multitude of different ways of covering.
Embodiment 1
Referring to Fig. 1, present embodiments provide a kind of based on the associated SPARQL enquiring and optimizing method of predicate, including following step
It is rapid:
S1: obtaining the RDF triple in the SPARQL of historical query, names RDF triple using predicate, and with predicate shape
Formula is stored to obtain original RDF data collection;
S2: carrying out vertical division to original RDF data collection and obtain the VP table of RDF, counts RDF data according to the VP table of RDF
The subject term and predicate quantity of middle predicate connection, according to subject term with predicate quantity defines predicate four kinds be connected to characteristic, and according to company
Logical characteristic power carries out priority ranking to predicate;
S3: according to the connection characteristic of the predicate in S2, constructing the relevance between predicate, and according to the relevance by history
SPARQL query graph is converted into tree-shaped predicate figure, optimizes tree-shaped predicate figure, generates correlation table according to the tree-shaped predicate figure after optimization
And SPARQL is converted to SQL query instruction;
S4: it waits tabling look-up using SQL query instructions query.
Above-mentioned SPARQL enquiring and optimizing method carries out priority ranking to predicate by defining the connection characteristic of predicate,
According to the relevance between connection characteristic building predicate, distribution can quickly and efficiently be realized by the relevance
SPARQL inquiry.It is main to consider to carry out frequent predicate specificity analysis, frequency when carrying out specificity analysis to predicate in the present embodiment
Numerous predicate points out the higher predicate of existing frequency.
It should be noted that the correlation table in above-mentioned S3 is in the present embodiment, refer to from the tree-shaped predicate figure after optimization
The query result of the high-frequency predicate pattern filtered out.It is equivalent to and high-frequency predicate index object is converted to a table.This step
According to the high frequency predicate pattern in history SPARQL, a part inquiry is heuristically carried out in advance and is calculated, data volume is reduced.When new
One inquiry occurs, if wherein including the high frequency predicate pattern having already appeared, only needs input related in query process
Data in table, the reduction of input data can greatly improve the efficiency of query processing.
In the present embodiment, by the feature definitions at least four of predicate, respectively subject term score, predicate score, predicate score,
And predicate volume.
Wherein, subject term score is for counting unduplicated subject term quantity, calculation formula in single VP table are as follows:
Ss (p)=count (DISTINCT s) s ∈ s | s in VP (p) };
In formula, p is predicate, and s is subject term, and VP (p) is the VP table of predicate p after vertical division, in the present embodiment, subject term score
It is higher, show that the predicate is connected with more subject terms, reflects that connectivity of the predicate in entire RDF graph is stronger.
Predicate score is for counting unduplicated predicate quantity, calculation formula in single VP table are as follows:
Os (p)=count (DISTINCT o) o ∈ o | o in VP (p) };
In formula, o is predicate, and in the present embodiment, object score is higher, shows that the predicate is connected with more predicates, instead
It is stronger to mirror connectivity of the predicate in entire RDF graph.
Predicate score is determining by the subject term score of selection predicate and the maximum value of predicate score, calculation formula are as follows:
Ps (p)=max { ss (p), os (p) };
Preferably, the larger value in the subject term score and predicate score of predicate is chosen as predicate score.
Predicate volume is used to calculate the number of tuples of VP table, calculation formula are as follows:
PSize (p)=count (VP (p)).
Predicate volume calculate measure VP table number of tuples, reflect VP table size, predicate score is higher, show the predicate with
More subject term predicates are connected, and reflect that connectivity of the predicate in entire RDF graph is stronger.
Further, priority ranking, order standard are carried out to predicate are as follows:
1) priority of the strong predicate of connectivity is higher:
2) the high predicate priority of predicate score is higher;
3) when predicate score is identical, theme score or object score in addition, score more high priority is higher;
4) when the subject term score of predicate is identical with predicate score, compare predicate volume, the bigger predicate of predicate volume is preferential
Grade is higher.
In the present embodiment, according to above-mentioned order standard, the predicate priority in the corresponding VP table of Fig. 2 can be obtained are as follows:
PRI(subOf)>PRI(workFor)>PRI(follow)>PRI(learn)>PRI(interest)>PRI
(location)。
It should be noted that the basic chart-pattern (BGP) of SPARQL inquiry is by multiple triple modes
(Triplepattern) it forms, multiple triple modes are combined into complicated basic chart-pattern.Such as: the inquiry language in Fig. 3
Sentence contains single BGP as shown in Figure 4, it may be assumed that whether the people that " interface " is liked in inquiry has learnt the specific of corresponding field
Associative skills it is of substantially known to can be seen that the predicate in SPARQL inquiry clause from SPARQL inquiry clause and BGP, using VP
It is suitable to optimize SPARQL inquiry clause to convert tree-shaped predicate chart-pattern for SPARQL inquiry chart-pattern for four predicate characteristics in table
Sequence can be positioned quickly and be queried the relevant predicate of RDF data concentration.
Wherein, it converts SPARQL inquiry chart-pattern to tree-shaped predicate chart-pattern the specific method is as follows.
It is firstly introduced into predicate correlation (Predicate Correlation, PC) definition, i.e. expression SPARQL query statement
In with identical variable two triple pattern carry out half-connection when, the associate feature of corresponding two predicates.Example
Such as, when clause carries out half-connection in Fig. 3, " interest " and " learn " two predicates are associated.It should be noted that
For user when interested to a technical ability, often which the relevant technologies inquiry should learn simultaneously in actual queries, practice
In the problem is repeatedly inquired will greatly improve predicate interest, the association search number of learn and subOf.Therefore, foundation
Associate feature between predicate semantically, all predicate correlation structures of analysis of history SPARQL query statement excavate history
Frequent associated predicate mode in inquiry.
Chart-pattern is inquired for SPARQL, by the connected inquiry clause of subject term variable or predicate variable, the predicate of composition is closed
There are four types of interrelational forms altogether for connection property: subject term-subject term association (SS), theme-object association (SO), the association of object-theme
(OS), the association (OO) of object-object.Fig. 5 describes four kinds of predicate relevances in inquiry clause two-by-two wherein, and Fig. 5 (a) is
Subject-Subject (SS), Fig. 5 (b) are object-Subject (OS), and Fig. 5 (c) is Subject-object (SO), Fig. 5
(d) it is object-object (OO).For example, two inquiry clauses in Fig. 5 (a) "? v2p1? v1 " with "? v2p2? v3 " it is to pass through
"? v2 " it is connected, therefore, p1 belongs to subject term-subject term with two predicates of p2 and is associated with.
For two subquery q in SPARQLi(si,pi,oi), qj(sj,pj,oj),
Then predicate relevance PC are as follows:
Four kinds of associations of predicate are equally determined from the position that predicate relevance shown in fig. 5 can be seen that link variable
Mode, i.e. SS (subject-subject), SO (subject-object), OS (object-subject) or OO (object-
Object), in the present embodiment, convert SPARQL query structure (BGP) equivalence being connected by link variable to based on predicate
The correspondence query structure of relevance, which reflects the semantic dependency of inquiry clause by predicate relevance, similar
The query structure of predicate relevance is known as predicate relevance chart-pattern (PCGP) by the BGP structure of SPARQL query statement.
Relevance based on predicate, the process for being converted into PCGP from BPG particularly may be divided into two steps: (1) it is converted into figure, it will
Predicate in BGP is converted into the node of PCGP, and subject term variable connects with the variable of predicate variable and is converted into side in BGP.Pass through BGP
Middle predicate weight determines the sequence for constructing PCGP node and side, generates a conversion figure;(2) optimal trees are converted into, are
Conversion figure is optimized for a conversion tree using the method for beta pruning decyclization by simplified query pattern.Wherein, optimizing tree-shaped predicate
When figure, the vertex BGP score, BGP Predicate selectivity, BGP core vertex and BGP core side are defined, is called to determine in BGP
Word weight.
Specifically, the vertex BGP score (vs) is, for there is i to go out any one vertex in when entering for j in BGP, to count
Calculate subject term score the sum of of all i predicates being connected as subject term position with the vertex in VP tableIt is all with
Predicate score the sum of of the predicate that the vertex is connected as predicate position in VP tableThe vertex BGP score is exactly this
Sum of the two.Vertex score is higher, shows that connectivity of the vertex in entire RDF graph is stronger, if preferentially looked into inquiry
The strong node of connectivity is ask, it being capable of quick locating query range.Its calculation formula is:
The side BGP score (es) is any a line e connection two vertex vs 1, v2 in BGP, then the predicate selection of side e
Property are as follows: es (e)=vs (v1)+vs (v2).Side score is higher, shows the company of vertex that this side is connected in entire RDF graph
The general character is stronger, and this edge often connects two star-plots, and core position is in entire query process, if excellent in inquiry
It first inquires when score is high, also can preferentially inquire the strong node of connectivity, thus quickly locating query range.
In the present embodiment, BGP is traversed according to the big foreword of vertex score in BGP and side score, i.e., first traversal connectivity is strong
Vertex and side, then traverse connectivity weak vertex and side, construct PCGP simultaneously during orderly traversal BGP, therefore
PCGP is the query structure for reflecting former RDF graph connectivity, can be sub convenient for quickly determining in the SPARQL in later period inquiry
Sentence search order.
Further, predicate associated diagram is constructed, specific step is as follows.
It is sorted first according to score size in vertex in BGP, finds the highest core vertex of vertex score, then found out and core
Vertex be connected in the highest core of score, using core side as the root node of PCGP, i.e. start node;Using breadth-first
Method is successively traversed according to the side BGP score size, is converted the corresponding node of PCGP for the side of BGP, will be connected this side
Vertex be correspondingly converted into the corresponding side PCGP, since the corresponding side PCGP embodies the predicate relevance of BGP, predicate is closed
Connection property is saved as the attribute on the side PCGP, ultimately produces a PCGP tree.For example, middle connection in Fig. 6 (a) " workFor " and "
The side attribute on the two predicate vertex location " is SS.It should be noted that Fig. 6 (a) be PCGP figure it is linear (Linear,
L), Fig. 6 (b) is star-like (Star, the S) of PCGP figure, and Fig. 6 (c) is the snowflake type (Snowflake, F) of PCGP figure.Fig. 7 (a) is
Linear (Linear, the L) of BGP figure, Fig. 7 (b) are star-like (Star, the S) of BGP figure, and Fig. 7 (c) is the snowflake type of BGP figure
(Snowflake, F).
It is multiple to there is a situation where that identical predicate occurs in SPARQL query statement, such as the predicate in Fig. 7 (a)
" location " occurs twice, during BGP is converted to PCGP, it may appear that with multiple node feelings for indicating identical predicate
Condition, therefore when merging predicate relevance, merge the depth that there is mutually isostructural branch and reduce PCGP, to improve inquiry
Speed.There are two types of situation, the first situations are as follows: if present node predicate is identical as father node predicate in PCGP, close altogether for merging
And two nodes, and the side attribute of two nodes of former connection is assigned to father node, all child nodes of present node are connected
To father node, two " location " nodes are merged into one in (a) in Fig. 8;Second situation are as follows: if current in PCGP
Predicate node occur on upper layer, then all child nodes of the predicate node are connected to the predicate node that has occurred of top
As child node, to reduce the depth of PCGP.Particularly, if the father node predicate of present node appears in the father of accessed node
When in node or child node, then present node is directly deleted, and present node and the attribute on side are updated to accessed node
Dependence edge in.As the predicate " workFor " of lower section is merged into top in Fig. 8 (c).Wherein, Fig. 8 (a) is linear
(Linear, L), Fig. 8 (b) are star-like (Star, S), and Fig. 8 (c) is snowflake type (Snowflake, F).
By the chain of BGP shown in Fig. 7, it is starlike be converted into three kinds of basic query modes of flakes it is as shown in FIG. 6
PCGP figure.
In above-mentioned step, a PCGP and the PCGP are converted by the BGP for inquiring each history SPARQL
Constructed according to the connectivity structure of RDF graph, to guarantee in query process can Rapid matching to target position, promotion is looked into
Ask efficiency.It should be noted that will appear hotspot query problem in historical query, so if preferential in query process
With hotspot query node, then non-hot query node is matched, query context can be reduced to a certain extent to improve inquiry
Speed.Therefore, if saving the number being queried in the side attribute of PCGP, i.e., the recordable predicate association being frequently queried
Property, also just it is able to record hotspot query node.
Further, in this embodiment can be further improved inquiry by constructing the relevance between frequent predicate
Speed.Specifically, in the present embodiment, frequent predicate relevance refers to that the predicate of high frequency present in history SPARQL inquiry closes
Connection, it is noted that there are the frequent predicate association of two classes in actual queries, one type is the relevance of predicate two-by-two,
No matter more complicated SPARQL inquiries, the attended operation that can be finally decomposed between multiple subqueries, half-connection behaviour when inquiry
Make will two predicates be associated, therefore predicate association index table two-by-two can be established, this kind of be frequently queried for storing
Predicate relevance two-by-two.Another kind of is more meaning contamination relevances, multiple predicates combination association index tables can be established, for depositing
Store up this kind of predicate combination relevance being frequently queried.In inquiry, according to predicate relevance in PCGP, first from these two types of frequent
Frequent predicate association is matched in this association index table, if be matched to, can be instantly obtained query result, that is, is substantially shorter and look into
The input and output cost of link cost and intermediate result during inquiry.
In the present embodiment, on the basis of the VP table that the RDF vertical division of Fig. 2 stores, building is high in SPARQL inquiry
The associated index of more predicates of the predicate two-by-two association and high frequency of frequency, for being quickly found out predicate correlation in former RDF graph data
Query result.It defines BCF (Bi-Correlation Factor) and MCF (Multi-Correlation Factor) is used for
Measure the frequency height that predicate relevance occurs.Preferably, BCF value range be [0,1], represent selection PCGP in frequently two-by-two
The ratio of the degree of incidence of predicate relevance.The degree of incidence ratio of one predicate relevance two-by-two is within the scope of BCF, then by it
It can be regarded as frequently predicate association (top_BCF) two-by-two.Such as PCGP shown in Fig. 9, BCF=0.2 is selected, then select frequent two
Two predicates be associated in historical query occur from frequency all predicates two-by-two association in account for before 20%, it is satisfactory to call two-by-two
Word association has: (workFor, follow, SS) and (workFor, location, SS).MCF value range is [0,1], is represented more
For the associated any association of predicate two-by-two of predicate combination in top_BCF, BCF value is the MCF value of selection, and more predicate combinations
Being associated in PCGP is connection and has been maximum magnitude.For example, selection MCF=0.2, then only more predicate combinations are closed
Connection { (workFor, follow, SS), (workFor, location, SS) } is eligible, two therein predicate associations two-by-two
Top_20% is belonged to, predicate two-by-two therein is associated in PCGP and can be connected to, and does not belong to more in PCGP
The association of predicate two-by-two of top_20% can extend more predicate combination associations.
It finds out the associated connection of predicate two-by-two of high frequency using BCF and MCF to connect with more predicates, specifically, in the VP table of Fig. 2
On the basis of establish two classes index: high frequency is associated with predicate concordance list and high frequency more two-by-two and is associated with predicate concordance list.High frequency predicate two-by-two
Concordance list includes four: VP_FSS, VP_FSO, VP_FOS and VP_FOO, and the predicate two-by-two for being respectively used to record high frequency is associated
Four kinds of half-connection results;High frequency is associated with the associated result of multiple predicates of predicate concordance list VP_FMC record high frequency more.Utilize this
Two classes index, can greatly reduce query context.
Conversion and query process can be divided into following three step: (1) based on the predicate associated adjustment SPARQL clause order of connection:
All predicates association in SPARQL is extracted, then inquires high frequency and is associated with predicate concordance list and high frequency two-by-two and is associated with predicate more and index
Table adjusts the inquiry clause order of connection of associated predicate if there are relative indexes for predicate association according to predicate association, remember
Record the result of search index;If relative index is not present in predicate association, the order of connection of corresponding inquiry clause is kept not
Become;(2) it is converted to SQL query statement: converting SQL statement of equal value for the SPARQL query statement after adjustment clause's sequence,
It is inquired in predicate association index table and former RDF graph data VP table respectively, obtains query result;(3) connection Query: connection SQL
Subquery executes SQL clause further according to corresponding inquiry table scale sequence, and the preferential table for inquiring small scale is looked into reduce
It askes range and reduces intermediate result set, finally obtain final result using Spark SQL query.Knot is inquired by this kind of mode
Fruit has the advantages that efficient and quick.
Experimental verification
In the present embodiment, tested in 7 computer clusters, will wherein 1 computer as host (Master)
Node, other 6 machines are as work (Worker) node.The memory of 7 computers is AMD Ryzen7 1800X
8cores 3.8GHZ 32G, operating system are CentOS7, and calculating environment is Cloudera CDH 5.7.6, integrated using CDH
Spark (Spark 1.6.0), the table in experimentation stores with the document form of Parquet format, disabled simultaneously
Broadcast connection (BroadcastJoin) function in SPARQLSQL uses Hive component to avoid automatic.It utilizes
Operator built in SparkGraphX module calculates PCGP information.
The data set of experiment is generated using the Data Generator of WatDiv by Query Builder with SPARQL inquiry.Wherein,
WatDiv is the RDF data management testing tool of Waterloo data system organization development, defines four class SPARQL inquiry, packet altogether
Include linear (Linear, L), star-like (Star, S), snowflake type (Snowflake, F) and complex query (Complex, C).It needs
It is bright, complex query type by linear, star-like and snowflake type any one or it is several combination formed, in attached drawing not
It shows.The WatDiv data that the use ratio factor is 1000 and 10000 are tested, table 1 lists the WatDiv data for experiment
Collect information, including RDF triple scale, do not repeat predicate number and corresponding HDFS storage size, two groups of data can be with from table
Find out, although triple scale differs 10 times, it is all 86 that it, which does not repeat predicate quantity,.
Table 1: the WatDiv data set information table for experiment
In query process, 10000 query statements are generated at random using WatDiv Query Builder, include linear (L),
Star-like (S), snowflake type (F) and complicated (C) four class inquiry, the influence to test query quantity to storage and inquiry.
Predicate relevance factor two-by-two and more predicates combination relevance factor are respectively indicated using BCF and MCF.For example, BCF
=0.2 predicate for indicating that the frequency of predicate association appearance comes preceding 20% two-by-two in only storage history SPARQL query statement closes
Connection establishes corresponding VP_FSS, VP_FSO, VP_FOS and VP_FOO index, and MCF=0.2 indicates only storage history
The frequency that more predicate associations occur in SPARQL query statement comes preceding 20% predicate association, that is, establishes corresponding VP_FMC rope
Draw.
It is as shown in table 2 below to count storage information of the BFC threshold value in six kinds of different values of [0,1], including predicate two-by-two
Associated predicate is to (Number of Predicate Correlation, Npc), VP table and VP_FSS, VP_FSO, VP_FOS
With in VP_FOO concordance list record number (Number of Records, Nrec) and memory space (HDFS Size,
Size).As BFC=0.0, show that any predicate is not selected to be associated with, then Npc is 0, and Nrec and Size is minimum at this time, i.e., original
VP table in record number and storage size;As BCF=1.0, show to select all predicates association in PCGP, at this time Npc,
Nrec and Size is maximum.
In the data scale of the WatDiv (SF1000) of table 1, there are 86 not repeat predicate, therefore, predicate is associated with two-by-two
Predicate to shared 86*86*4=29584, but as can be seen from Table 2 practical BCF=1.0 when the related predicate of predicate two-by-two
It is 4419 to most (Max (Npc)), that is, shows that the predicate of physical presence semantic association to only 4419, only accounts for all predicates
To the 15% of sum, by analysis, the reason of this kind of situation occurs, there are two kinds, one is because involved by practical SPARQL inquires
And RDF data amount be far below whole original data sets, the second is WatDiv Query Builder generates the machine of SPARQL query statement
Caused by system, in order to guarantee the validity of inquiry, all query statements are to be given birth at random from inquiry pond based on random walk mode
At, therefore the probability that the higher data of predicate correlation degree are queried selection is improved indirectly.
When BCF takes [0.2,0.4], PCGP needs the associated predicate of predicate two-by-two for establishing index all to call two-by-two to accounting for
Word association is within the 30% of Max (Npc), and when working as BCF value [0.6,0.8], which has been more than 60%, mutually in requisition for guarantor
Record number and memory space in the concordance list deposited sharply increase.
Table 2: the storage (WatDiv SF1000) when different BCF
It is as shown in table 3 below to count storage information of the MCF threshold value in six kinds of different values of [0,0.6], including more predicates
Associated predicate is to Npc, the record number Nrec and storage space S ize of VP_FMC concordance list.Due to after MCF > 0.6, newly
The associated predicate of more predicates increased is to few, therefore the case where only count MCF≤0.6 in the present embodiment.When MCF value take 0.3 or
When 0.4, the value of Npc is respectively 1056,1223, and predicate to relatively more, and when MCF takes 0.5 or 0.6, distinguish by the value of Npc
It is 598,279, predicate is to opposite reduction, but the record number Nrec and storage space S ize of actual VP_FMC concordance list
Increase instead, which is because, when MCF value is larger, although the meaning two-by-two that the associated predicate of more predicates to reduction, is directed to
Word association contains much information, the record information content for needing to retain in VP_FMC concordance list also relative increase, therefore Nrec and Size
Value increases instead.
Table 3: the storage (WatDiv SF1000) when different MCF
When BCF and MCF take different value, under the conditions of four kinds of SPARQL query types (L, S, F, C), the storage of index is empty
Between shown in variation following Figure 10 and Figure 11 with SPARQL query time.Figure middle polyline indicates query time, and column indicates storage
Space, four class query results take the average value of 20 inquiries respectively.
Specifically, storage condition and query effect when embodying the connection result using only basic chart-pattern in Figure 10.Base
This chart-pattern has obvious effect of optimization to all inquiry templates.
It can be seen from figure 11 that not establishing predicate association index two-by-two relative to only inquiry VP table as BCF=0.2
(BCF=0), the query time of four classes inquiry is greatly shortened, 50% is shortened for the query time of complex query C;When
When BCF=0.4,60% is shortened for the query time of complex query C.But behind BCF > 0.4, for the inquiry of four classes inquiry
Time does not change much, although this illustrates the correlation of the partial predicate in history SPARQL, is not to be frequently associated, is building
Lithol can not retain these predicate associated informations when drawing.With the increase of BCF value, it is known that seeing the index of foundation
Memory space is with linearly increasing.BCF=0.4 is chosen according to experimental result, in the present embodiment as optimal threshold value.
It can be seen from figure 11 that, without satisfactory more predicate combination associations, there is no need to additionally build as MCF=0
Vertical VP_FMC concordance list, when inquiry directly from former VP table query result, therefore concordance list Size=0.It is right as MCF=0.2
All most short in the query time of four classes inquiry, wherein the complex query time reduces amplitude maximum, and chain type inquires reduction amplitude minimum,
Predicate number involved in predicate combination at this time is mostly 3 or 4, and the VP_FMC concordance list established in this case can greatly improve
Search efficiency.And after MCF > 0.3, the query time of four classes inquiry is not reduced with the increase of MCF, is increased in MCF
Query time almost increases to close with MCF=0 when 0.6.In SPARQL inquiry, more predicate combination associations are related to this explanation
Predicate number when being more than 4, it is actually most of to be not belonging to frequent more predicate query compositions, that is, these predicates are semantically
Relevance it is not high, matching efficiency of the inquiry at this time in concordance list is low, and last query time is caused to increase instead.Root
It factually tests as a result, the present embodiment chooses MCF=0.2 as optimal threshold value.
In BCF=0.4 and MCF=0.2, compare the query time established before and after frequent predicate association index table, that is,
The time and system that system is inquired based on VP table be based on VP table, VP_FSS, VP_FSO, VP_FOS, VP_FOO and
The time that VP_FMC concordance list is inquired.From table 4, it can be seen that the query time of four classes inquiry is obvious after establishing index
It reduces, for the inquiry of L, S and F three classes, query time reduces 50% or more.
Table 4: (ms) is compared using only VP and using the query time of VP and concordance list
Further, 20 four class query statements are generated using WatDiv Query Builder, wherein every class query statement 5
It is a.For the data set of WatDiv (SF1000) and WatDiv (SF10000) two kinds of scales, when from additional storage space and inquiring
Between two aspect, by method (being indicated in Figure 11 with FrePre2) of the invention respectively with S2RDF (SF=0.25), S2RDF (SF
=1), H2RDF, Sempala, PigSPARQL, SHARD are compared.Table 5 lists the additional storage space of these methods,
The storage size of middle S2RDF (SF=0.25) method is minimum, for two number of WatDiv (SF1000) and WatDiv (SF10000)
Storage size according to collection is 1.8G and 18.4G respectively, and the storage size of SHARD method is maximum, almost S2RDF (SF=0.25)
5 times of method, the storage size of method proposed by the present invention are 3.4G and 34.9G, the side about S2RDF (SF=0.25) respectively
2 times of method, this storage size is relatively conventional under distributed environment.
The not homologous ray additional HDFS storage size under different data collection of table 5
Using S2RDF as main contrast's object.The opposite S2RDF of this system on large data collection can be calculated from table 7 to drop
Low query time percentage, average to reduce at least 30% query time, highest even reduces by 80% or more.Effect is generally better than
There are two the reason of S2RDF, is main.First is foundation of the S2RDF only using data with respect to reduction amount as preservation ExtVP table, and not
Consider O-O predicate connectivity.This system is using enquiry frequency as the foundation for saving ExtVP, in the base for promoting certain memory space
Less data input is obtained on plinth, the reduction of data input directly promotes search efficiency.Utilize enquiry frequency Optimizing Queries
Method is also used by search engines such as Googles.Second is that this system stores frequent predicate chart-pattern query result, and it is big to reduce input
It is small and reduce join operation.But the place for also having effect of optimization bad, such as L3, S3, S5 and C4, speed is inquired with S2RDF
It is little or even slightly higher to spend difference, the reason is that there is strange predicate correlation in these inquiries, the process of search index also increases
Query time.Based on generally speaking, the query effect of the present embodiment is substantially better than other systems.
As shown in figure 12, different system queries time contrast tables 6 and table 7 list two kinds of data respectively under large data collection
Under collection scale, query time of the distinct methods for four kinds of inquiries, it can be seen that the query time of method proposed by the present invention is most
It is short, S2RDF (SF=0.25) secondly, the query time of method proposed by the present invention between [32ms, 2304ms], S2RDF (SF
=0.25) query time is between [187ms, 3018ms].And the query time of method proposed by the present invention is opposite
PigSPARQL method and SHARD method shorten nearly three orders of magnitude, and opposite Sempala method shortens nearly two orders of magnitude.
Method proposed by the present invention and S2RDF (SF=0.25) are realized using Spark SQL module, and PigSPARQL method and SHARD
Method executes inquiry using MapReduce batch processing, although they greatly simplify data analysis process and inadaptable interaction
Formula inquiry, causes its query cost big.Although Sempala method link cost is low, needed to be traversed in match query process
All VP tables, cause query cost especially big.H2RDF+ method belongs to centralized querying method, in large data collection WatDiv
(SF10000) search efficiency is lower in inquiry, but in the subquery being frequently accessed, search efficiency is relatively preferable.
Table 6: the experimental result WatDiv (SF1000) (unit: ms) of different data collection difference query template
Table 7: the experimental result WatDiv (SF10000) (unit: ms) of different data collection difference query template
In conclusion a kind of SPARQL enquiring and optimizing method proposed by the present invention, based on Hadoop and real using Spark
Existing distributed SPARQL query processor.The predicate of extraction is concentrated using the predicate connectivity of query statement and from RDF data
Feature stores the frequent predicate pattern result in historical query to reduce input load when inquiry.It, will in query process
SPARQL is converted into the inquiry table of SQL with corresponding three types, by being inquired after inquiry table size sequence, to reduce inquiry
It inputs size and reduces and execute the time.Since method of the invention is based on historical query, there is adaptivity.And it is above-mentioned
Experiment show is under certain threshold value, and system queries performances is better than the existing query processor for comparison.
Embodiment 2
With above method embodiment correspondingly, the present embodiment provides a kind of SPARQL query optimizer system, including storage
Device, processor and storage on a memory and the computer program that can run on a processor, described in the processor execution
The step of above method is realized when computer program.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field
For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair
Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.
Claims (8)
1. one kind is based on the associated SPARQL enquiring and optimizing method of predicate, which comprises the following steps:
S1: obtaining the RDF triple in the SPARQL of historical query, names the RDF triple using predicate, and with predicate shape
Formula is stored to obtain original RDF data collection;
S2: carrying out vertical division to the original RDF data collection and obtain the VP table of RDF, counts RDF according to the VP table of the RDF
The subject term and predicate quantity that predicate connects in data, according to subject term with predicate quantity defines predicate four kinds be connected to characteristic, and root
Priority ranking is carried out to predicate according to connection characteristic power;
S3: according to the connection characteristic of the predicate in S2, constructing the relevance between predicate, and according to the relevance by history
SPARQL query graph is converted into tree-shaped predicate figure, optimizes the tree-shaped predicate figure, generates phase according to the tree-shaped predicate figure after optimization
SPARQL is simultaneously converted to SQL query instruction by pass table;
S4: it waits tabling look-up using the SQL query instructions query.
2. according to claim 1 be based on the associated SPARQL enquiring and optimizing method of predicate, which is characterized in that the S2
In, it is at least four, respectively subject term score, predicate score, predicate score and predicate body by the connection feature definitions of predicate
Product.
3. according to claim 2 be based on the associated SPARQL enquiring and optimizing method of predicate, which is characterized in that the subject term
Score is for counting unduplicated subject term quantity, calculation formula in single VP table are as follows:
Ss (p)=count (DISTINCT s) s ∈ s | s in VP (p) };
In formula, p is predicate, and s is subject term, and VP (p) is the VP table of predicate p after vertical division;
The predicate score is for counting unduplicated predicate quantity, calculation formula in single VP table are as follows:
Os (p)=count (DISTINCT o) o ∈ o | o in VP (p) };
In formula, o is predicate;
The predicate score is determining by the subject term score of selection predicate and the maximum value of predicate score, calculation formula are as follows:
Ps (p)=max { ss (p), os (p) };
The predicate volume is used to calculate the number of tuples of VP table, calculation formula are as follows:
PSize (p)=count (VP (p)).
4. according to claim 3 be based on the associated SPARQL enquiring and optimizing method of predicate, which is characterized in that right in S2
The order standard of predicate progress priority ranking are as follows:
1) priority of the strong predicate of connectivity is higher;
2) the high predicate priority of predicate score is higher;
3) when predicate score is identical, subject term score or object score in addition, score more high priority is higher;
4) when the subject term score of predicate is identical with predicate score, compare predicate volume, the bigger predicate priority of predicate volume is more
It is high.
5. according to claim 1 be based on the associated SPARQL enquiring and optimizing method of predicate, which is characterized in that, will in S3
SPARQL query graph be converted into tree-shaped predicate figure specifically includes the following steps:
S31: converting the predicate in BGP to the node of PCGP, and subject term connects with the variable of predicate and is converted into side in BGP, passes through
Predicate weight determines the sequence for constructing PCGP node and side in BGP, generates a conversion figure;
S32: the optimization tree-shaped predicate figure, conversion figure are optimized for a conversion tree.
6. being based on the associated SPARQL enquiring and optimizing method of predicate according to claim 1 or 5, which is characterized in that optimization
The tree-shaped predicate figure goes around-France realization using beta pruning.
7. according to claim 6 be based on the associated SPARQL enquiring and optimizing method of predicate, which is characterized in that described in optimization
When tree-shaped predicate figure, the vertex BGP score, BGP Predicate selectivity, BGP core vertex and BGP core side are defined.
8. one kind is based on the associated SPARQL query optimizer system of predicate, including memory, processor and is stored in memory
Computer program that is upper and can running on a processor, which is characterized in that the processor executes real when the computer program
The step of existing 1 to 7 any the method for the claims.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910196896.8A CN110032676B (en) | 2019-03-15 | 2019-03-15 | SPARQL query optimization method and system based on predicate association |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910196896.8A CN110032676B (en) | 2019-03-15 | 2019-03-15 | SPARQL query optimization method and system based on predicate association |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110032676A true CN110032676A (en) | 2019-07-19 |
CN110032676B CN110032676B (en) | 2022-08-05 |
Family
ID=67236069
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910196896.8A Active CN110032676B (en) | 2019-03-15 | 2019-03-15 | SPARQL query optimization method and system based on predicate association |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110032676B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241127A (en) * | 2020-01-16 | 2020-06-05 | 华南师范大学 | Predicate combination-based SPARQL query optimization method, system, storage medium and equipment |
CN113626491A (en) * | 2020-05-09 | 2021-11-09 | 杭州海康威视数字技术股份有限公司 | Data query method and device and distributed data query system |
CN116719846A (en) * | 2023-08-07 | 2023-09-08 | 北京滴普科技有限公司 | Distributed computing engine data query optimization method, device and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102693310A (en) * | 2012-05-28 | 2012-09-26 | 无锡成电科大科技发展有限公司 | Resource description framework querying method and system based on relational database |
CN103116625A (en) * | 2013-01-31 | 2013-05-22 | 重庆大学 | Volume radio direction finde (RDF) data distribution type query processing method based on Hadoop |
-
2019
- 2019-03-15 CN CN201910196896.8A patent/CN110032676B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102693310A (en) * | 2012-05-28 | 2012-09-26 | 无锡成电科大科技发展有限公司 | Resource description framework querying method and system based on relational database |
CN103116625A (en) * | 2013-01-31 | 2013-05-22 | 重庆大学 | Volume radio direction finde (RDF) data distribution type query processing method based on Hadoop |
Non-Patent Citations (1)
Title |
---|
顾进广等: "SPES:基于谓词选择率估计的SPARQL查询优化方案", 《小型微型计算机系统》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241127A (en) * | 2020-01-16 | 2020-06-05 | 华南师范大学 | Predicate combination-based SPARQL query optimization method, system, storage medium and equipment |
CN111241127B (en) * | 2020-01-16 | 2023-01-31 | 华南师范大学 | Predicate combination-based SPARQL query optimization method, system, storage medium and equipment |
CN113626491A (en) * | 2020-05-09 | 2021-11-09 | 杭州海康威视数字技术股份有限公司 | Data query method and device and distributed data query system |
CN113626491B (en) * | 2020-05-09 | 2023-08-04 | 杭州海康威视数字技术股份有限公司 | Data query method, device and distributed data query system |
CN116719846A (en) * | 2023-08-07 | 2023-09-08 | 北京滴普科技有限公司 | Distributed computing engine data query optimization method, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110032676B (en) | 2022-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105975488B (en) | A kind of keyword query method based on theme class cluster unit in relational database | |
CN103064875B (en) | A kind of spatial service data distributed enquiring method | |
US6618727B1 (en) | System and method for performing similarity searching | |
CN107291807B (en) | SPARQL query optimization method based on graph traversal | |
Meimaris et al. | Extended characteristic sets: graph indexing for SPARQL query optimization | |
CN102270232B (en) | Semantic data query system with optimized storage | |
US20070130180A1 (en) | Methods and transformations for transforming metadata model | |
CN107169033A (en) | Relation data enquiring and optimizing method with parallel framework is changed based on data pattern | |
CN105975617A (en) | Multi-partition-table inquiring and processing method and device | |
CN110032676A (en) | One kind being based on the associated SPARQL enquiring and optimizing method of predicate and system | |
CN104573039A (en) | Keyword search method of relational database | |
CN107943952A (en) | A kind of implementation method that full-text search is carried out based on Spark frames | |
CN112015741A (en) | Method and device for storing massive data in different databases and tables | |
CN104699786A (en) | Semantic intelligent search communication network complaint system | |
CN106599052A (en) | Data query system based on ApacheKylin, and method thereof | |
CN106484815B (en) | A kind of automatic identification optimization method based on mass data class SQL retrieval scene | |
CN111881160A (en) | Distributed query optimization method based on equivalent expansion method of relational algebra | |
Stefanidis et al. | A context‐aware preference database system | |
Yatskevich | Preliminary evaluation of schema matching systems | |
CN109992593A (en) | A kind of large-scale data parallel query method based on subgraph match | |
Svoboda et al. | Linked data indexing methods: A survey | |
CN101719162A (en) | Multi-version open geographic information service access method and system based on fragment pattern matching | |
Arnold et al. | HRDBMS: Combining the best of modern and traditional relational databases | |
Zeng et al. | Efficient web service composition and intelligent search based on relational database | |
CN108959358B (en) | A kind of end-user listening data access method and system based on ontology model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |