CN106980901B

CN106980901B - Streaming RDF data parallel reasoning algorithm

Info

Publication number: CN106980901B
Application number: CN201710246309.2A
Authority: CN
Inventors: 汪璟玢; 叶怡新
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2017-04-15
Filing date: 2017-04-15
Publication date: 2019-09-13
Anticipated expiration: 2037-04-15
Also published as: CN106980901A

Abstract

The present invention provide streaming RDF data parallel reasoning algorithm: construct rule pseudo- bilateral network, if in regular node there are the link variable of class if establish intermediate node；The data that batch new data and previous reasoning in timing acquisition Streaming data flow generate carry out classification or newly-built corresponding node to the data of input and store to corresponding Redis cluster as input data；It combines pseudo- bilateral network to judge whether the former piece that corresponding intermediate node or regular node are monitored all meets the triple data of input, and then the rule is made inferences, generate inference data；Be saved in input data in Redis cluster as reasoning next time by deleting all data that repetition inference data and this reasoning generate in real time, thus efficient realize the parallel streaming reasoning of RDF data OWL Horst rule.

Description

Streaming RDF data parallel reasoning algorithm

Technical field

The invention belongs to semantic network technology fields, are specifically related to streaming RDF data parallel reasoning algorithm.

Background technique

In recent years, researchers gradually recognize the importance of the parallel reasoning algorithm research of real-time streaming data, but are directed to The related algorithm that the field proposes is still less, needs further to be studied.The also phase of research of the reasoning in terms of intellectual technology simultaneously When it is more, such as the discovery of knowledge, the reasoning etc. of case.Extensive RDF stream data phase is solved by Distributed Parallel Computing Pass problem has become the common recognition of academia and industry.

Research RDFS/OWL streaming parallel inference is a newer at present field.Barbieri D F et al. proposes base In the increment reasoning algorithm of streaming and rich background knowledge, which adds the temporal information that expires into each RDF triple, when new Stream data when reaching, calculatings is made inferences to new data, and terminate the clear fact and the invalid triple of deletion. IDRM algorithm efficiently expansible can carry out RDFS reasoning to incremental data, since IDRM algorithm is special to the progress of RDFS rule Modeling, so for the inefficient of OWL Horst rule-based reasoning.Chevalier J et al. puts forward a kind of effective increment Reasoning device (Slider), the reasoning device make inferences it by the internal characteristics in semantic data stream, are directed to realize The expansible batch processing reasoning device of stream data.But since Slider is designed just for RDFS rule, so for complexity OWL Horst rule-based reasoning and be not suitable for.

Nowadays challenging present in the extensive RDF file reasoning has: the distributed data on network is difficult to obtain suitable When triple；Growing data volume requires the expansible computing capability of large data sets；Existing inference method is special For static ontology, data are usually in variation in the real world.Existing distributing inference method primarily focuses on static number According to reasoning, research streaming RDF data parallel inference is a newer field at present.

The technical issues that need to address:

1. solving that RDF data ontology and OWL Horst rule how to be combined to construct the pseudo- bilateral network of rule, wherein including The corresponding class node of mode triple and regular node, so as to be efficiently completed OWL in extensive stream data The reasoning of the whole rules of Horst.

2. combining the streaming scheme proposed to propose corresponding parallel inference scheme, to meet extensive stream data The demand of distributed parallel reasoning.

Summary of the invention

To solve the above-mentioned problems, the present invention provides a kind of streaming RDF data parallel reasoning algorithm, for OWL Horst Rule proposes PRAS algorithm (Parallel Reasoning Algorithm in conjunction with the advantages of HAL algorithm Streaming RDF Data).The algorithm can be constructed efficiently in extensive stream data and safeguard pseudo- Two-way Network Network, and correct complete execution reasoning.

To achieve the above object, the invention adopts the following technical scheme: a kind of streaming RDF data parallel reasoning algorithm, It is characterized in that, comprising the following steps: S1: loading rule node and mode triple P_j_ RDD and O_k_ RDD is simultaneously saved in Redis collection Group constructs the intermediate node midnode of link variable in rule, skips to S2；S2: the batch new data in timing reading data flow The data itr_data that new_data and previous reasoning generate；If it is mode triple (S_i,P_i,O_i), then skip to S3；If its For example triple (s_i,p_i,o_i), then skip to S5；If new_data is that empty and itr_data is sky, algorithm terminates；S3: if Its corresponding class node P_j_ RDD or O_k_ RDD exists, then is referred to corresponding class node；If it does not exist, then it creates and corresponds to Class node and be saved in Redis cluster；If its predicate belongs to Symmetric Property, S4 is skipped to；Otherwise S6 is skipped to； Symmetric Property is the set for having symmetric relation for predicate in identity mode triple.Symmetric properties triple Set SymTriples is defined as follows: ；Wherein, P_j_ RDD is mode triplet sets；S4: classification and reasoning are carried out to the data of input；S5: reasoning is generated Triple carries out storage and duplicate removal.

Compared with prior art, the invention has the following advantages that

1. OWL Horst rule and RDF ontology file is combined to construct pseudo- bilateral network structure, the effect of streaming reasoning is improved Rate.

2. combining the storage strategy of Redis clustering design, the storage of duplicate removal and iterative data is carried out to triple, is reduced The memory space and inference time for repeating triple, to improve the efficiency of reasoning.

Detailed description of the invention

Fig. 1 is overall framework schematic diagram of the invention.

Fig. 2 is pseudo- bilateral network structure figures.

Fig. 3 is loading rule and ontology data and constructs pseudo- bilateral network.

Fig. 4 is OWL Horst rule relation figure.

Specific embodiment

Explanation is further explained to the present invention in the following with reference to the drawings and specific embodiments.

Streaming parallel inference proposed by the present invention is broadly divided into the pseudo- bilateral network of building, stream data classification and OWL Three parts of reasoning of Horst rule.The characteristics of according to Spark Streaming and Redis, in conjunction with HAL algorithm and OWL Horst rule and RDF data ontology, construct the pseudo- bilateral network of rule, wherein including the corresponding class node of mode triple And regular node, if in regular node there are the link variable of class if establish intermediate node；Then, timing acquisition Streaming The data that batch new data in data flow and previous reasoning generate are used as input data, to the data of input carry out classification or Newly-built corresponding node is simultaneously stored to corresponding Redis cluster；Then, pseudo- bilateral network is combined to sentence the triple data of input Whether the former piece that corresponding intermediate node or regular node are monitored that breaks all meets, and then makes inferences to the rule, produces Raw inference data.Finally, being saved in Redis by all data for deleting repetition inference data in real time and this reasoning generates Input data in cluster as next reasoning, thus efficient realize the parallel streaming of RDF data OWL Horst rule Reasoning.

Overall framework figure is referring to Fig. 1.

A kind of streaming RDF data parallel reasoning algorithm comprising following steps:

S1: loading rule node and mode triple P_j_ RDD and O_k_ RDD is simultaneously saved in Redis cluster, in building rule The intermediate node midnode of link variable, skips to S2；

S2: the data itr_data that batch new data new_data and previous reasoning in timing reading data flow are generated； If it is mode triple (S_i,P_i,O_i), then skip to S3；If it is example triple (s_i,p_i,o_i), then skip to S5；If new_ Data is that empty and itr_data is sky, then algorithm terminates；

S3: if its corresponding class node P_j_ RDD or O_k_ RDD exists, then is referred to corresponding class node；If not depositing It is then creating corresponding class node and is being saved in Redis cluster；If its predicate belongs to Symmetric Property, skip to S4；Otherwise S6 is skipped to；Symmetric Property is the set for having symmetric relation for predicate in identity mode triple. Symmetric properties triplet sets SymTriples is defined as follows:

；

Wherein, P_j_ RDD is mode triplet sets；For example, in OWL Horst rule SymTriples=sameAs, InverseOf, equivalentClass, equivalentProperty }；

S4: classification and reasoning are carried out to the data of input；

S5: storage and duplicate removal are carried out for the triple that reasoning generates.

Wherein S4 is the following steps are included: S41: if the triple data of input are mode triple (S_i,P_i,O_i), then will The triple data of input are respectively with P_i+”_”+S_iFor key, O_iFor value and P_i+”_”+O_iFor key, S_iFor value, building three S in tuple_iAnd O_iBidirectional relationship, and be saved in Redis cluster, skip to S43；

S42: if the triple data of input are example triple (s_i,p_i,o_i), then by the triple data structure of input Build < s_i,(p_i,o_i)>、< p_i , (s_i, o_i)>and<o_i , (s_i,p_i) > tri- key-value pair, and it is stored in Redis cluster, it jumps To S43；

S43: it checks pseudo- bilateral network corresponding to new_data or itr_data, and judges new_data or itr_data The Rule whether monitored comprising regular node or intermediate node_m_ link_RDD, if the Rule that intermediate node is monitored_m_link_ RDD then skips to S44, if the Rule that regular node is monitored_m_ link_RDD then skips to S45, otherwise skips to S2；Pseudo- bilateral network It refers to certain rule Rule_iEstablish regular node Rule_i_ node, rule in be related to class building class node Class_i_ Node establishes intermediate node mid if including link variable in regular former piece_i_node；Regular Rule_iLink variable refer to It is Rule_iIn for connecting the mode triple item of two former pieces, by the link variable information of each rule with < key, Value > form be stored in Rule_m_ link_RDD, wherein key stores all mode ternarys for former piece connection of the rule Group item, value store the mode triple item of the rule conclusion part；The building process of pseudo- bilateral network is referring to fig. 2.

S45: judge the Rule monitored_mWhether _ link_RDD all meets, if then skipping to S46, otherwise skips to S2；

S46: whether the corresponding all former pieces of judgment rule node all meet, and generate three if so then execute the reasoning of rule Tuple skips to S5；Otherwise S2 is skipped to.

S5 is stored in entitled itr_data in Redis cluster comprising the following specific steps for the triple that reasoning generates Set, and deduplication operation is carried out to duplicate triple, then gathers itr_data as next reasoning input data A part, S2 is skipped to if the order not stopped.

PRAS algorithm of the invention is according to the principle that the characteristics of Spark RDD and Redis cluster, in conjunction with HAL algorithm and OWL Horst rule and RDF ontology data, are constructed using the pseudo- bilateral network to rule, firstly for mode triple (S_i,P_j,O_k) the corresponding class node O of building_k_ RDD or P_j_ RDD is simultaneously saved in Redis cluster, if P belongs to symmetric properties, To the S and O building bidirectional relationship in the triple and it is saved in Redis cluster.For all former pieces in quick judgment rule Whether all meet, corresponding regular node is established for strictly all rules, if containing link variable link_var in rule, is built Vertical intermediate node midnode, test condition information preservation is in intermediate node and is arranged two-way between intermediate node and regular node Communication；If connectionless variable, class node is connected directly with regular node, and test condition is stored in class node.To be advised in Fig. 2 Then for 8a, schematic diagram is as shown in Figure 3.By the building of heuristic information and symmetric properties between node, in conjunction with Redis collection The efficient access of group, required triple is read from Redis cluster in a manner of inquiring, reduces the reading of unrelated triple It takes and transmits, to improve whole Reasoning Efficiency.

The Map stage mainly completes data classification and reasoning: if the batch fluxion in timing acquisition Streaming data flow It is ontology data according to the data itr_data that new_data or previous reasoning generate, then is referred in corresponding class node, and more The corresponding value of the node in new Redis cluster；If its attribute be symmetric properties, then respectively with " symm_ "+S and " symm_ "+O For key, the bidirectional relationship of S and O in triple is constructed, and is stored in Redis cluster.If new_data or itr_data is real Number of cases evidence, then to example triple (s_i,p_i,o_i), building < s_i, (p_i,o_i)>、< p_i, (s_i,o_i)>and<o_i, (s_i,p_i) > tri- key-value pairs, and it is stored in Redis cluster.Then pseudo- bilateral network corresponding to new_data or itr_data is checked, and The link variable or the corresponding all former pieces of regular node for judging the corresponding intermediate node monitoring of new_data or itr_data (can Can include multiple intermediate nodes) whether all meet, the reasoning if so then execute rule generates triple and is output to result The Reduce stage；If part meets, the state of corresponding conditions is modified.Data classification proposed in this paper is specifically walked with reasoning algorithm It is rapid as follows:

Map phase algorithm

Input the triple of streaming triple data and previous reasoning generation

Output < " new ",>

Triple data of the Step1 for input, (S_i, P_j, O_k) ∈ SchemaTriple is referred to corresponding class Node simultaneously updates Redis cluster；If P_jFor symmetric properties, respectively with P_j+” _”+S_iFor key, O it is value and with P+ " _ "+O is Key, S value, construct the bidirectional relationship of S and O in triple, and are stored in Redis cluster.Skip to Step3.

Triple data of the Step2 for input, (s_i,p_j,o_k) ∈ InstanceTriple, then to example triple (s_i,p_j,o_k) building < s_i, (p_j,o_k) >、< p_j, (s_i,o_k)>and<o_k, (s_i,p_j) > tri- key-value pair is stored in Redis cluster.Skip to Step3.

Step3 checks (s_i,p_j,o_k) corresponding to pseudo- bilateral network, required data are read from Redis cluster, and Judge (s_i,p_j,o_k) link variable monitored of corresponding intermediate node or the corresponding all former pieces of regular node (may include more A intermediate node) whether all meet, if all met, skip to Step4.If fruit part is unsatisfactory for, then (S is combined_i,P_j, O_k) modify to the monitoring information of intermediate node or class node.

Step4 obtains the triple of reasoning generation according to the conclusion of current ruleAnd export < " new ",>。

By rule 8a and 8b(inverseOf in Fig. 4) for, pseudo-code is described as follows:

Input: (S₁, P₁, O₁)

Output: <”new”, >

Begin

If (S₁, P₁, O₁) the ∈ SchemaTriple // triple be mode triple, carry out classification preservation

{

If P1 equal “type”

sadd O₁ S₁

else {

sadd P₁ (S₁,O₁)

If P₁{/* predicate is that symmetric properties are that building saves subject S to ∈ SymmetriProperty₁And O₁Symmetrical pass Be */

sadd P₁+” _”+S₁ O₁

sadd P₁+” _”+O₁S₁

}

Else when/* is example triple three key-value pairs of building save */

sadd S₁ (P₁,O₁)

sadd P₁ (S₁,O₁)

sadd O₁ (S₁,P₁)

}

/ * reads inverseOf_S in Redis cluster₁With inverseOf_O₁Set to inverseOf*/

inverseOf smembers (“inverseOf_”+S₁)

∪smembers (“inverseOf_”+O₁)

If(inverseOf != null){

yield (“new”,( O₁,P₁, S₁))

For (inverse in inverseOf.value){

yield (“new”,( O₁, inverse, S₁))

}

End

Assuming that in batch flow data currently entered containing mode triple T (memberOf, owl:inverseOf, ) and example triple t (GraduateStudent0, memberOf, University0_Department0) member.It is first First for mode triple T, judge that inverseOf_RDD whether there is, if there is no then newly-built inverseOf_RDD and protects (memberOf, member) is deposited into inverseOf_RDD；And if so, being saved directly to inverseOf_RDD.Then, Then with inverseOf_memberOf be respectively key since inverseOf is symmetric properties, member be value and InverseOf_member is key, memberOf value, and the bidirectional relationship of building memberOf and member is stored in Redis cluster.For example triple t, building < GraduateStudent0, (memberOf, University0_ Department0)>、< memberOf, (GraduateStudent0, University0_Department0)>、< University0_Department0 and is saved in Redis cluster at (GraduateStudent0, memberOf) >.Most Afterwards, the set of inverseOf_ memberOf and inverseOf_ member in Redis cluster is read to inverseOf, time It goes through inverseOf and exports (GraduateStudent0, member, University0_Department0).

It can by the bidirectional relationship of the building and storage of symmetric properties similar to regular 8a and 8b containing symmetric properties Quickly to find out relevant triple in Redis cluster, to improve Reasoning Efficiency.

By rule 15(someValuesFrom in Fig. 4) for, pseudo-code is described as follows:

Input: (S₁, P₁, O₁)

Output: <”new”, >

Begin

If (S₁, P₁, O₁) { // triple is mode triple to ∈ SchemaTriple, carries out classification preservation

If P₁ equal “type”

sadd O₁ S₁

else {

sadd P₁ (S₁,O₁)

If P₁{ // predicate is that symmetric properties are the symmetrical passes that building saves subject S1 and O1 to ∈ SymmetriProperty System

sadd P₁+” _”+S₁ O₁

sadd P₁+” _”+O₁ S₁

}

Else // saved to construct three key-value pairs when example triple

sadd S₁ (P₁,O₁)

sadd P₁ (S₁,O₁)

sadd O₁ (S₁,P₁)

}

someValuesFrom_set Smembers (" someValuesFrom ")/* is read in Redis cluster The set * of someValuesFrom/

onProperty_set smembers (“onProperty”)

For (svf in someValuesFrom_set) {

For (op in onProperty_set) {

If(svf.v equals op.v){

temp_w smembers (svf.w)

It is the three of type that x_type_w=temp_w.filter (x=> x.p==" type ")/*, which filters out p in temp_w, Tuple */

u_p_x smembers (op.p)

result = u_p_x.filter(t=>

t.x==x_type_w.x

Yield (" new ", (t.u, type, svf.v))) former piece in rule is attached by/*, generate reasoning knot Fruit */

}

End

Assuming that in batch flow data currently entered containing mode triple T1 (Chair, owl:someValuesFrom, Department), T2 (Chair, owl:onProperty, headOf) and example triple t1 (FullProfessor7, HeadOf, University0_Department0), t2 (University0_Department0, rdf:type, Departmment).Firstly for mode triple T1 and T2, someValuesFrom_RDD and onProperty_RDD are judged Whether there is, if there is no then create someValuesFrom_RDD and onProperty_RDD and respectively save (Chair, Department) to someValuesFrom_RDD and preservation (Chair, headOf) into onProperty_RDD；If deposited Then it is being saved directly to someValuesFrom_RDD or onProperty_RDD.For example triple t1, construct < FullProfessor7, (headOf, University0_Department0)>、< headOf, (FullProfessor7, University0_Department0)>、< University0_Department0 , (FullProfessor7, HeadOf) > and it is saved in Redis cluster, t2 is similar to aforesaid operations.Then, it reads in Redis cluster respectively The set of someValuesFrom and onProperty is traversed to someValuesFrom_set and onProperty_set SomeValuesFrom_set and onProperty_set, at this time the Chair in someValuesFrom_set with The Chair of onProperty_set is identical, with Department is respectively then key and headOf is that key obtains Redis cluster In two set；Finally FullProfessor7 is connect with Chair and export (FullProfessor7, rdf:type, Chair)。

Similar to the rule 15 of multi-connection variable, pass through the key of class node for mode triple, can quickly from It is obtained in Redis cluster；Connection is passed through using the storage strategy of example triple in Redis for associated example triple The value of variable finds out relevant example triple, to improve Reasoning Efficiency.

The data that the Reduce stage mainly generates reasoning save.For the triple that reasoning generates, it is stored in It is entitled in Redis cluster " set of itr_data ", and deduplication operation is carried out to duplicate triple, then will " itr_ The a part of data " set as next reasoning input data.Data deduplication proposed in this paper and storage algorithm specific steps are such as Under:

Reduce algorithm

Input<" new ", Iterator<String>values>

Export null

Step1. the SchemaTriple of input and InstanceTriple is stored in using itr_data as set name Reading in Redis cluster, for next reasoning.

In order to which definitely Reduce stage is to the duplicate removal and storage of input data, pseudo-code is described as follows:

Input: <”new”, Iterator <String> values>

Output: null

Begin:

del itr_data

itr for each values

Value in sadd itr_data itr.value/* traversal values is added to the itr_data collection of Redis cluster * in conjunction/

End

Can be obtained by above-mentioned pseudo-code, in the Reduce stage, by the triple of input by the set of Redis carry out duplicate removal and The preparation of data is carried out in storage for next reasoning.

Algorithm complexity analysis is the important indicator for measuring an efficiency of algorithm, the complexity point of PRAS algorithm of the invention Analysis has different modes from centralized algorithm.Analyze PRAS algorithm complexity when, can be broken down into Map and Two stages of Reduce carry out algorithm complexity analysis.If it includes N number of triple that experimental data, which is concentrated, Redis data are read Time is set as t, and during MapReduce Map task and line number be set as k, Reduce stage received example triple number Be set as m, Reduce task and line number be set as x.Due to PRAS algorithm in the Map stage to the triple of each input, in conjunction with class Node or intermediate node run-down, that is, can determine whether the triple can participate in certain rule-based reasonings, as can participating in subsequent rule Then reasoning then obtains the reasoning results by reading the former piece data reasoning in Redis.Therefore, the time in Map stage is complicated Property are as follows: O (t*N/k).Sort out in triple of the Reduce stage to each input, therefore, the time in Reduce stage is multiple Polygamy are as follows: O (m/x).

The above are preferred embodiments of the present invention, all any changes made according to the technical solution of the present invention, and generated function is made When with range without departing from technical solution of the present invention, all belong to the scope of protection of the present invention.

Claims

1. a kind of streaming RDF data parallel reasoning algorithm, which comprises the following steps:

S1: loading rule node and mode triple P_j_ RDD and O_k_ RDD is simultaneously saved in Redis cluster, connects in building rule The intermediate node midnode of variable, skips to S2；

S2: the data itr_data that batch new data new_data and previous reasoning in timing reading data flow are generated；If its For mode triple (S_i,P_i,O_i), then skip to S3；If it is example triple (s_i,p_i,o_i), then skip to S5；If new_data It is sky for empty and itr_data, then algorithm terminates；

S3: if its corresponding class node P_j_ RDD or O_k_ RDD exists, then is referred to corresponding class node；If it does not exist, then It creates corresponding class node and is saved in Redis cluster；If its predicate belongs to Symmetric Property, S4 is skipped to；It is no Then skip to S6；Symmetric Property is the set for having symmetric relation for predicate in identity mode triple；Symmetrically Attribute triplet sets SymTriples is defined as follows:

；

Wherein, P_j_ RDD is mode triplet sets；

S4: classification and reasoning are carried out to the data of input；

S5: storage and duplicate removal are carried out for the triple that reasoning generates；

S4 is the following steps are included: S41: if the triple data of input are mode triple (S_i,P_i,O_i), then by the three of input Tuple data is respectively with P_i+”_”+S_iFor key, O_iFor value and P_i+”_”+O_iFor key, S_iFor value, S in triple is constructed_i And O_iBidirectional relationship, and be saved in Redis cluster, skip to S43；

S42: if the triple data of input are example triple (s_i,p_i,o_i), then by the building of the triple data of input < s_i,(p_i,o_i)>、< p_i , (s_i, o_i)>and<o_i , (s_i,p_i) > tri- key-value pair, and it is stored in Redis cluster, it skips to S43；

S43: it checks pseudo- bilateral network corresponding to new_data or itr_data, and whether judges new_data or itr_data The Rule monitored comprising regular node or intermediate node_m_ link_RDD, if the Rule that intermediate node is monitored_m_ link_RDD is then S44 is skipped to, if the Rule that regular node is monitored_m_ link_RDD then skips to S45, otherwise skips to S2；Pseudo- bilateral network refers to To certain rule Rule_iEstablish regular node Rule_i_ node, rule in be related to class building class node Class_i_ node, such as Intermediate node mid is then established comprising link variable in fruit rule former piece_i_node；Regular Rule_iLink variable refer to Rule_i In for connecting the mode triple item of two former pieces, by the link variable information of each rule with<key, value>shape Formula is stored in Rule_m_ link_RDD, wherein key stores all mode triple items for former piece connection of the rule, value Store the mode triple item of the rule conclusion part；

S46: whether the corresponding all former pieces of judgment rule node all meet, and generate ternary if so then execute the reasoning of rule Group skips to S5；Otherwise S2 is skipped to.

2. a kind of streaming RDF data parallel reasoning algorithm according to claim 1, it is characterised in that: S5 includes following tool Body step: the triple generated for reasoning is stored in the set of entitled itr_data in Redis cluster, and to duplicate Triple carries out deduplication operation, itr_data is then gathered a part as next reasoning input data, if do not stopped Order only then skips to S2.