CN108959318A

CN108959318A - Distributed keyword query method based on RDF graph

Info

Publication number: CN108959318A
Application number: CN201710376203.4A
Authority: CN
Inventors: 郑志蕴; 丁阳; 李钝; 张行进; 王振飞
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2017-05-25
Filing date: 2017-05-25
Publication date: 2018-12-07

Abstract

The present invention designs a kind of distributed keyword query method based on RDF graph, belongs to information retrieval field.The present invention converts RDF sentence figure for RDF data figure first；Secondly conditional depth-priority-searching method and simulated annealing are utilized, RDF sentence figure is split according to two most basic principles of data balancing between subgraph after edge cut set minimum and segmentation；The RDF sentence figure after segmentation is finally refined as RDF data figure, the vertex cut set of RDF graph is obtained, and utilize reverse search algorithm and Hadoop distributed computing framework, realizes the efficient quick search of keyword.The present invention efficiently solves traditional algorithm to the limitation of large-scale dataset segmentation efficiency in the case where guaranteeing the atomicity and semantic integrity of RDF data, and greatly improves the search efficiency of keyword.

Description

Distributed keyword query method based on RDF graph

Technical field

The present invention relates to the distributed keyword query methods based on RDF graph, belong to information retrieval field.

Background technique

Figure is a kind of generally existing data structure, is widely used in every field.Keyword based on RDF graph structure is looked into Inquiry is a current research hotspot, it allows user in the case where not using labyrinth query language, is obtained efficient Query result.Current most of search algorithms are realized under centralized environment, i.e., keyword query can only be on single machine Processing.In fact, it is very time-consuming for carrying out keyword query on single machine as the scale of RDF graph constantly expands, therefore There are highly important theoretical value and realistic meaning to figure processing and storage under distributed environment.

Currently used keyword query technology is the digraph that RDF data is expressed as to a tape label, the top in figure Subject and object in the corresponding triple of point, predicate is side.Make to figure the related information that RDF data had both been able to maintain between data Semantic information is not lost again, therefore the query processing of RDF data is usually transformed into figure matching problem, i.e., on RDF data figure Positioning includes the steiner tree (Steiner Tree) of keyword.Since the connectivity of diagram data inherently and figure calculate performance Strong coupling feature out, so need to just reduce each son of distributed treatment as far as possible to realize the efficient parallel processing of figure The degree of coupling between figure, then effective figure segmentation is exactly to realize the important means of decoupling.Current figure partitioning algorithm mainly has two A principle: first is that improving the connectivity inside subgraph, the connectivity between subgraph is reduced；Second is that considering the equilibrium of subgraph scale Property, guarantee that the data scale of each subgraph is balanced as far as possible, biggish inclination do not occur.Wherein Kim et al. proposes the side SBV-Cut Method, this method determine balance vertex by the method for random walk, are split according to balance top pair graph, so that each height It include approximately equal vertex in figure；And guarantee that the quantity at point of contact is minimum by expansion and modularity two attributes. However, the number on vertex is identical (or close) in each subgraph after being divided using this method, but each vertex is closed The quantity on the side of connection is different, to cause the imbalance of data between subgraph.Simultaneously with the continuous expansion of figure scale, pass The limitation of system algorithm (such as KL, DFEP, VSEP) in figure scale, so that these algorithms can not meet data in explosive The demand of growth.

Summary of the invention

The deficiency of the present invention regarding to the issue above proposes the distributed keyword query method based on RDF graph.This method exists In the case where guaranteeing segmentation sub-graph data balance, the efficient segmentation of big data figure is realized, while can be realized the quick of keyword Inquiry, to meet the query demand of user.Technical solution implementation steps are as follows:

(1) RDF sentence figure is converted by RDF graph:

It is concentrated in RDF data, each basic statement of RDF triple as RDF data indicates one of resource on Web A integrated semantic, therefore when being split to RDF data collection, it is necessary to assure the atomicity of each RDF triple；Simultaneously Blank node simply indicates the presence of something or other, without specified overall identification so can only share same in local use The RDF triple of blank node expresses the common context of blank node, if such RDF triple is separated, table The context reached will be destroyed.Therefore RDF sentence s is made of RDF triple and meets the following conditions:

Any two RDF member ancestral in condition 1:s be it is attachable, i.e., when two RDF member ancestrals share the same blank section When point, the two RDF member ancestrals are attachable；

Any one RDF member ancestral in condition 2:s cannot connect with the RDF member ancestral not in s；

The present invention uses (s, p, o) to describe a RDF triple, and is abbreviated as t, with s (t), p (t) and o (t) difference table Show main body, predicate and the object in triple, wherein RDF digraph and RDF sentence figure are defined as follows:

RDF digraph: setting G=(V, E, L) indicates the RDF digraph of a tape label, wherein by main body in RDF triple With the vertex set V={ v | v ∈ s (T) ∪ o (T) } of object composition, the directed edge of the predicate composition of relationship between subject and object Set E={<s (t), o (t)>| t ∈ T }, object vertex is directed toward by main body vertex in the direction on side.L is the set of label, L= L_v∪L_p, wherein L_vIndicate vertex label, L_pIndicate predicate label.

RDF sentence figure: G is set_s=(S, E, l, w) indicates a RDF sentence figure, it is a vertex weighted-graph, Each of middle figure node corresponds to a RDF sentence, and S indicates the vertex in figure, and E indicates the side in figure.If s_i, s_j∈ S and s_i ≠s_j,t′∈s_jAnd there is t.s=t ' .s or t.o=t ' .s then s_iAnd s_jIt is associated, i.e. (s_i, s_j)∈E.L is One label function, forl(s_i) it is the subject comprising the sentence, the local label collection of predicate or object It closes；W is the weight on vertex,w(s_i) it is equal to the number of RDF member ancestral included in the sentence.

(2) based on the side partitioning algorithm (REC) of RDF sentence figure:

According to two principles of figure segmentation it is found that one is divided it is necessary to reduce as far as possible in order to reduce Internet traffic Side quantity.Therefore the present invention carries out depth-first traversal using the smallest principle of Vertex Degree, since this method easily sinks into office Portion's optimal solution, so avoiding the situation using simulated annealing；Two in order to realize the balances of data between subgraph, by RDF graph In side be evenly distributed in each subgraph, i.e., the number on side in each RDF subgraphIf G will be schemed_sIt is divided into k Subgraph indicates the subgraph where each vertex using function P, wherein different subgraph is indicated with { 1,2...k }, then label Subgraph where j=p (s) indicates sentence s is S_j, wherein S_jMeet following two conditionAnd S_i∩S_j=Φ,Wherein, the specific steps of the side partitioning algorithm of RDF sentence figure are as follows:

Step A: the number on side in input RDF subgraph

According to RDF sentence figure G_s, input the number e on side in RDF subgraph；

Step B: setting " access flag "

The smallest vertex of degree in RDF sentence figure is solved, is put it into set D, and is each vertex setting one A " access flag "；

Step C: the vertex in traversal set D is split RDF sentence figure

C1: the access order on vertex in given set D, if currently in RDF sentence figure all not visited vertex power The sum of weight | G_s| > e then sequentially selects a not visited vertex from D；

C2: the vertex is added in set S while the state on the vertex being set to has accessed；

C3: if all vertex weights of current subgraph | S | < e and set N (S, the G on the vertex adjacent with set S_s)！ =null, then from N (S, G_s) one vertex v of middle random selection；

C4: if the vertex v is not visited and removes vertex adjacent with vertex v in S | N (v, G_s) S | number most Small, i.e. the degree on the vertex is minimum.Jump to step C2；

C5: if in current collection S vertex weight | S | < e, and N (S, G_s)=null then returns to the top of recent visit Point, and jump to step C4；

C6: if at this time in current collection S vertex weight | S | > e jumps to step C1；

D step: optimal solution is sought in simulated annealing

D1: the sequence of vertex access in given set D；

D2: the number on the side divided according to such access order is calculated；

D3: the access order on two vertex in random replacement D, if divide at this time while number be less than in step D2 while Number illustrates that new result is better than old as a result, then replacing old access order with new access order；

D4: simulated annealing function is finally called；

(3) keyword query is realized using Hadoop distributed computing framework

I step: divide the determination of the vertex cut set of RDF graph

The vertex cut set of segmentation RDF graph can be obtained in the intersection for solving RDF subgraph；

II step: the determination of directed tree

The definition of given query result tree (RT), if the set with the matched vertex keyword K is M (K)={ m₁, m₂...m_s, query result is defined as the directed tree for meeting following condition:

1) tree root of G is R；

2) for each set m_iIn some vertex v_i, exist from R to v_iDirected path

During keyword query, the query result tree that those include Partial key word is known as candidate result tree, Referred to as CRT.

III step: keyword query is realized using reverse search algorithm

The process for realizing keyword query using MapReduce is specifically described below, is mainly made of following four step, That is MR1: in the map stage, candidate result tree is searched first with reverse search algorithm, indicates (CRT with 4 yuan of ancestrals_i, K, B, S_i), Middle CRT_iIt is the one tree using R as tree root, K indicates CRT_iThe keyword for being included, B indicate the divided top in the subgraph Point, S_iIndicate CRT_iThe subgraph at place；Secondly, 4 yuan of ancestrals to be packaged into the key-value pair < B, CRT of key-value_i>；

MR2: two candidate result tree CRT in the combiner stage, in different subgraphs_iAnd CRT_jIf the two CRT Associated segmentation vertex V_i∩V_j≠ Φ, then by CRT_iAnd CRT_jIt is put into only one combiner；

MR3: in the reduce stage, the CRT in the same combiner is merged；If including in the result after merging The keyword of all inquiries then exports the query result；

MR4: in the range stage, since RDF data collection is very big, an inquiry might have multiple matched inquiries As a result.However, user is only interested in the query result of sub-fraction under normal conditions, it is therefore desirable to using score function to looking into Result is ask to score.Therefore this patent is scored using currently used method using the compactedness of result, returns to Top-k A query result.

Detailed description of the invention

The basic framework of distributed keyword query method of the Fig. 1 based on RDF graph

Fig. 2 RDF exemplary diagram

Fig. 3 converts RDF graph to the exemplary diagram of RDF sentence figure

The flow chart of side partitioning algorithm of the Fig. 4 based on RDF sentence figure

The flow chart of Fig. 5 simulated annealing

The flow chart of Fig. 6 MapReduce realization keyword query

Fig. 7 RDF segmentation figure

The comparison figure of the response time of Fig. 8 different partitioning algorithms

Fig. 9 keyword query example

The comparison figure of response time before and after Figure 10 parallelization

Specific embodiment

Below with reference to the embodiments and with reference to the accompanying drawing further description of the technical solution of the present invention.

Embodiment: this patent using true data set swetodblp (http://lsdis.cs.uga.edu/ Projects/SemDis/Swetodblp), Data subject is the information that computer science is published an article.In the data altogether Comprising 681636 triples, storage occupies 53.6MB, and number of edges and number of vertex are respectively 1026375 and 373219.

The present invention is based on the flow charts of the distributed keyword query method of RDF graph, as shown in Figure 1, leading as we know from the figure To include following 3 stages:

First stage: RDF sentence figure is converted by the RDF graph in Fig. 2, as shown in Figure 3.From figure it is found that will have no right to have It is converted into the undirected RDF sentence figure of vertex cum rights to RDF graph, the number in RDF sentence figure represents the RDF tri- for including in the sentence The number of tuple.

Second stage: the side partitioning algorithm based on RDF sentence figure, flow chart is as shown in figure 4, the algorithm is substantially sharp Figure segmentation is carried out with the smallest principle of degree of vertex in figure and depth-first traversal.Below according to the RDF sentence figure in Fig. 3, give The example that the fixed algorithm is realized.

Primary condition: this degree of 4 vertex in figure of D={ S3, S4, S5, S6 } is 1, by visited [Si]= False (i=3,4,5,6).It is assumed that figure is divided into 2 subgraphs, i.e. k=2, then in each RDF subgraph side number e=(2+1 + 1+1+1+3)/2=4.5, probably there are 5 sides in each subgraph.

C1: it is assumed that the access order on vertex is { S5, S6, S3, S4 }, the at this time weight on vertex not visited in figure in D For 9 > 5, select the vertex begun stepping through for S5

C2: S5 is added to S={ S5 } in set S, and visited [S5]=true

C3:| S |=2 < 5, N (S, G_s)={ S1 }

C4: S1 is added in set S, S={ S1, S5 } and visited [S1]=true；Due to | S |=3 < 5 continue Execute R2, R3.N (S, G_s)={ S1, S5, S6, S2 } in only S6 and S2 it is not visited, wherein | N (S6, G_s) S |=0, | N (S2, G_s) S |=| { S1, S3, S4 } { S1, S5 } |=| { S3, S4 } | the associated vertex=2, S6 be less than the associated vertex S2, Therefore S6 is added to S={ S1, S5, S6 } and visited [S6]=true. in set S

C5: however with N (S, G_s) S={ S1, S5, S6 } { S1, S5, S6 }=NULL, and | S |=4 < 5, therefore retract To the vertex S1 of recent visit, the vertex adjacent with S1 only has S2, therefore S2 is added to S=in S { S1, S5, S6, S2 }, Visited [S2]=true, | S |=5, then side associated with vertex in set S is removed, that is, removes S2-S3 and S2-S4 Two sides.Its weight of S3 not visited in figure at present, S4 and be 3+1=4 < 5, therefore remaining figure do not have to be split again, Two subgraphs are finally obtained, the set of subgraph is respectively { S1, S5, S2, S6 } and { S3, S4 }, and the number on divided side is 2

Since the algorithm is easy to get to locally optimal solution, the situation is avoided using simulated annealing, as shown in Figure 5. An example of algorithm realization is given below.

D1: the access order of given set D is { S5, S6, S3, S4 }, i.e. initial solution

D2: objective function is that the number on divided side is minimum, is accessed according to working as known to side partitioning algorithm described above Sequence is that the number on the divided side { S5, S6, S3, S4 } is 2

D3: the condition for generating new explanation is whether the number on divided side after adjusting access order reduces, if reducing Generate new explanation.Such as the access order of the D after adjustment is { S3, S4, S5, S6 } at present, divides according to side described above and calculates The side of Fa Ke get segmentation is S1-S2, number 1, and the set of two subgraphs is respectively { S2, S3, S4 } and { S1, S5, S6 } this time Segmentation effect be better than step D2, therefore replace old access order with this stylish access order

D4: constantly optimizing the process using simulated annealing, until optimal solution is obtained, since simulated annealing makes With than wide so be no longer described in detail herein

3) three phases: realize that the procedure chart of keyword query is as shown in Figure 6 using Hadoop distributed computing framework. Below by taking Fig. 2 RDF graph as an example, the process for carrying out distributed keyword query using reverse search algorithm is as follows.

After being split RDF sentence figure, need to be refined as RDF graph again.According to second stage based on RDF It is respectively { S2, S3, S4 } and { S1, S5, S6 } that the side partitioning algorithm of sentence figure, which can obtain divided two sub- set of graphs, then refines RDF subgraph afterwards is as shown in fig. 7, the intersection for solving the two RDF subgraphs is that vertex { iswc } assumes searching keyword K= { paper-45, paper-13, OWL }

MR1: the subgraph after segmentation is placed on 2 different nodes, is realized in the Map stage using reverse search algorithm The inquiry of candidate result tree CRT, the two CRT are respectively CRT1=< { paper-45- > isPartof- > iswc }, { paper-45 }, { iswc }, { (L) }>and CRT2=<{ iswc- > hasPart- > paper-13- > title- > Can OWL And Logie Programming Live Together Happily }, { paper-13, OWL }, { iswc }, { (R) } > and with <key, the result of value>key-value pair form storage inquiry；

MR2: it will be put into identical key value but exist from the CRT of different subgraphs in the combiner stage same There are cutpoint key={ iswc } and the left side Fig. 7 subgraph and the right subgraph are respectively present in combiner, between CRT1 and CRT2 In, therefore its two CRT is put into the same combiner；

MR3: being attached merging for the CRT in the same combiner in the reduce stage, and the result after merging is RT ={ paper-45- > isPartof- > iswc- > hasPart- > paper-13- > title- > Can OWL and Logie Programming Live Together Happily } and connect after RT in include all searching keywords, then Export query result；

MR4: in the range stage, query result is ranked up and returns to Top-k result to user.

In order to compare the advantage of this patent partitioning algorithm REC, a comparative analysis, such as Fig. 8 are with VSEP and DVCP algorithm It is shown.Can be obtained from figure REC segmentation the response time it is most short, the time longest of DVCP.The advantage of REC partitioning algorithm is to RDF On the one hand figure, which carries out compression processing, ensure that on the other hand the atomicity of RDF data and semantic integrity reduce vertex in figure Number, to reduce the traversal space of algorithm to improve the efficiency of figure segmentation.VSEP algorithm passes through where exchange vertex Subgraph reduce divided number of vertices, need that two vertex is arbitrarily selected to carry out from figure during iteration each time Exchange time complexity is o (n²) (number that n is vertex in figure), with the increase of RDF data scale, corresponding RDF The number of vertex of figure can become very big, the efficiency of extreme influence figure segmentation.DVCP algorithm mainly passes through the subgraph where exchange side, makes With these when associated vertex includes least subregion, to reduce the vertex cut, in fact in figure side number Far more than the number on vertex, so the efficiency being split using edge flip, which is lower than, exchanges the efficiency being split using vertex.

In order to compare the advantage of the search algorithm under distributed environment, 10 group polling examples are given, as shown in Figure 9.Respectively Search algorithm is executed on single machine and cluster, average response time of this 10 group polling example on different clustered nodes such as Figure 10 It is shown.The efficiency that the efficiency of available parallelization inquiry will be inquired much higher than single machine from figure, and with interstitial content Increase, the parallelization inquiry response time is constantly to reduce, but the amplitude changed is smaller and smaller.It even appear that working as node When number is changed by 40 to 50, query responding time has almost no change.Therefore, by testing it can also be seen that being looked into parallel doing When inquiry, the number of node needs appropriateness to choose, and just parallel effect can be made to reach best.

Above-mentioned is only concentration embodiment of the invention, it is noted that those skilled in the art are in technical solution of the present invention It is carried out in range.

Claims

1. the distributed keyword query method based on RDF graph comprising the steps of:

Step (1): RDF sentence figure is converted by RDF graph；

Step (2): the side partitioning algorithm (REC) based on RDF sentence figure；

Step (3): keyword query is realized using Hadoop distributed computing framework.

2. the distributed keyword query method according to claim 1 based on RDF graph, which is characterized in that the step (1) RDF sentence s is made of RDF triple and meets the following conditions in:

Any two RDF member ancestral in condition 1:s be it is attachable, i.e., when two RDF member ancestrals share the same blank node, The two RDF member ancestrals are attachable；

It uses (s, p, o) to describe a RDF triple, and is abbreviated as t, respectively indicated in triple with s (t), p (t) and o (t) Main body, predicate and object, wherein RDF digraph and RDF sentence figure are defined as follows:

RDF digraph: setting G=(V, E, L) indicates the RDF digraph of a tape label, wherein by main body and visitor in RDF triple The vertex set V={ v | v ∈ s (T) ∪ o (T) } of body composition, the collection of the directed edge of the predicate composition of relationship between subject and object Conjunction E=< s (t), o (t) > | and t ∈ T }, object vertex is directed toward by main body vertex in the direction on side, and L is the set of label, L=L_v ∪L_p, wherein L_vIndicate vertex label, L_pIndicate predicate label；

RDF sentence figure: G is set_s=(S, E, l, w) indicates a RDF sentence figure, it is a vertex weighted-graph, wherein in figure The corresponding RDF sentence of each node, S indicates the vertex in figure, and E indicates the side in figure；If s_i, s_j∈ S and s_i≠s_j,t′∈s_jAnd there is t.s=t ' .s or t.o=t ' .s then s_iAnd s_jIt is associated, i.e. (s_i, s_j)∈E；L is a mark Function is signed, forIt is the subject comprising the sentence, the local label set of predicate or object；W is top The weight of point,It is equal to the number of RDF member ancestral included in the sentence.

3. the distributed keyword query method according to claim 1 based on RDF graph, which is characterized in that the step It (2) is split according to two basic principles of figure segmentation.One in order to reduce Internet traffic it is necessary to reducing quilt as far as possible The number of edges amount of segmentation.Therefore depth-first traversal is carried out using the smallest principle of Vertex Degree, and uses Simulated Anneal Algorithm Optimize Segmentation result；Two, in order to realize the balances of data between subgraph, the side in RDF graph are evenly distributed in each subgraph, i.e., The number on side in each RDF subgraphIf G will be schemed_sIt is divided into k subgraph, indicates each vertex place using function P Subgraph, wherein different subgraph is indicated with { 1,2...k }, then the subgraph where label j=p (s) indicates sentence s is S_j, Wherein S_jMeet following two conditionAnd S_i∩S_j=Φ,Wherein, the side segmentation of RDF sentence figure The specific steps of algorithm are as follows:

Step A: the number on side in input RDF subgraph

Step B: setting " access flag "

The smallest vertex of degree in RDF sentence figure is solved, is put it into set D, and is arranged one for each vertex and " visits Ask mark "；

Step C: the vertex in traversal set D is split RDF sentence figure

C1: the access order on vertex in given set D, if currently in RDF sentence figure all not visited vertex weight it With | G_s| > e then sequentially selects a not visited vertex from D；

C3: if all vertex weights of current subgraph | S | < e and set N (S, the G on the vertex adjacent with set S_s)！= Null, then from N (S, G_s) one vertex v of middle random selection；

C4: if the vertex v is not visited and removes vertex adjacent with vertex v in S | N (v, G_s) S | number it is minimum, i.e., should The degree on vertex is minimum.Jump to step C2；

C5: if in current collection S vertex weight | S | < e, and N (S, G_s)=null then returns to the vertex of recent visit, and Jump to step C4；

D step: optimal solution is sought in simulated annealing

D1: the sequence of vertex access in given set D；

D3: the access order on two vertex in random replacement D, if divide at this time while number be less than in step D2 while Number, illustrates that new result is better than old as a result, then replacing old access order with new access order；

D4: simulated annealing function is finally called.

4. the distributed keyword query method according to claim 1 based on RDF graph, which is characterized in that the step (3) the specific implementation process is as follows:

I step: divide the determination of the vertex cut set of RDF graph

II step: the determination of directed tree

1) tree root of G is R；

2) for each set m_iIn some vertex v_i, exist from R to v_iDirected path；

During keyword query, the query result tree that those include Partial key word is known as candidate result tree, referred to as For CRT；

III step: keyword query is realized using reverse search algorithm

The process for realizing keyword query using MapReduce is specifically described below, is mainly made of following four step, i.e.,

MR1: in the map stage, candidate result tree is searched first with reverse search algorithm, indicates (CRT with 4 yuan of ancestrals_i, K, B, S_i), Wherein CRT_iIt is the one tree using R as tree root, K indicates CRT_iThe keyword for being included, B indicate divided in the subgraph Vertex, S_iIndicate CRT_iThe subgraph at place；Secondly, 4 yuan of ancestrals to be packaged into the key-value pair < B, CRT of key-value_i>；

MR2: two candidate result tree CRT in the combiner stage, in different subgraphs_iAnd CRT_jIf the two CRT are closed The segmentation vertex V of connection_i∩V_j≠ Φ, then by CRT_iAnd CRT_jIt is put into only one combiner；

MR3: in the reduce stage, the CRT in the same combiner is merged；If comprising all in the result after merging The keyword of inquiry then exports the query result；

MR4: in the range stage, since RDF data collection is very big, an inquiry might have multiple matched query results. However, user is only interested in the query result of sub-fraction under normal conditions, therefore inquiry is tied using the compactedness of result Fruit scoring is to return to Top-k query result.