CN108763451A

CN108763451A - Streaming RDF data parallel reasoning algorithm based on Spark Streaming

Info

Publication number: CN108763451A
Application number: CN201810521793.XA
Authority: CN
Inventors: 汪璟玢; 陈晓曦
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-11-06
Anticipated expiration: 2038-05-28
Also published as: CN108763451B

Abstract

The present invention relates to a kind of streaming RDF data parallel reasoning algorithm based on Spark Streaming.OWL Horst inference rules are combined first, build corresponding regular link variable relation table；As input data, mode data and instance data to input sort out processing and store arriving corresponding Redis clusters the data that batch new data and previous reasoning in Iterative Parallel reasoning stage timing acquisition Streaming data flows generate；Then, according to regular link variable relation table, judge the rule that this reasoning can activate, inference data is generated in conjunction with corresponding instance data；Finally, the duplicate data and storage, current iteration reasoning for deleting this reasoning generation terminate.The present invention reduces the number of tasks of MapReduce, and the iteration reasoning of stream data is carried out in conjunction with Spark；Design rule link variable relation table stores the new data generated in data and reasoning, ensure that the completeness of algorithm；The storage scheme for devising example triple is traded space for time in conjunction with the characteristic of Redis, realizes the quick reading of instance data.

Description

Streaming RDF data parallel reasoning algorithm based on Spark Streaming

Technical field

The invention belongs to magnanimity streaming RDF data inference technology fields, and in particular to one kind being based on Spark Streaming Streaming RDF data parallel reasoning algorithm.

Background technology

The existing inference method based on OWL rules is the static data collection of the processing fixed size of centralization mostly, by In the limitation of centralized processing mechanism, existing algorithm inefficiency when handling the real time data of magnanimity.In response to it is this not The disconnected demand increased, many scholars study and propose the RDF streaming reasoning frameworks of oneself：Barbieri DF [1] et al. are proposed Increment reasoning algorithm based on streaming and rich background knowledge, the algorithm add the temporal information that expires into each RDF triples, when When new stream data reaches, calculating is made inferences to new data, and terminates the clear fact and deletes invalid ternary Group.IDRM [2] algorithm efficiently expansible can carry out RDFS reasonings to incremental data.ChevalierJ [3] et al. puts forward A kind of effective increment reasoning device (Slider), the reasoning device make inferences it by the internal characteristics in semantic data stream, To realize the expansible batch processing reasoning device for being directed to stream data.Leaf is happy new et al. to propose the base in conjunction with pseudo- bilateral network In the streaming RDF data parallel reasoning algorithm PRAS [4] of Spark platforms.

The main target of RDF streaming reasonings is how the stream data received to be stored and be made inferences.IDRM is calculated Method carries out special modeling for RDFS rules, so for the inefficient of OWL Horst rule-based reasonings.Slider just for RDFS rules are designed, so for complicated OWL Horst rule-based reasonings and being not suitable for.PRAS algorithms are pseudo- double by designing The reasoning of stream data is carried out to network, but since the consumption of pseudo- two-way network communication is larger, handles a large amount of stream Formula data it is inefficient.

What the streaming RDF data reasoning algorithm of combination Spark platforms proposed by the present invention to be solved is exactly stream data Storage and two problems of reasoning.In order to ensure the completeness of the reasoning results, how to design the storage scheme of stream data is herein Emphasis.The present invention stores large-scale RDF data using Redis data-base clusters, in conjunction with distributive type Computational frame SparkStreaming studies and realizes the distribution of streaming RDF data by the MapReduce computation module in platform Parallel inference scheme, solve the problems, such as in face of a large amount of stream datas can not Rapid Inference and the reasoning results it is incomplete.These are right There is good reference in the reasoning of mass data.

Bibliography：

[1]Barbieri D F,Braga D,Ceri S,et al.Incremental reasoning on streams and rich background knowledge[C]//Extended Semantic Web Conference.Springer Berlin Heidelberg,2010:1-15.

[2]Liu B,Wu L,Li J,et al.Exploiting Incremental Reasoning in Healthcare Based on Hadoop and Amazon Cloud[C]//Semantic Cities Workshop at AAAI Conference on Artificial Intelligence (AAAI’14).2014.

[3]Chevalier J,Subercaze J,Gravier C,et al.Slider:an Efficient Incremental Reasoner[C]//Proceedings of the 2015ACM SIGMOD International Conference on Management of Data.ACM,2015:1081-1086.

[4] distributed parallel reasoning algorithm [J] the computer system applications of the happy new, Wang Jing cellophane of leaf based on Spark, 2017,26(05):97-104.

Invention content

The purpose of the present invention is to provide a kind of streaming RDF data parallel reasoning algorithm based on Spark Streaming, The algorithm reduces the number of tasks of MapReduce, and the iteration reasoning of stream data is carried out in conjunction with Spark；Design rule connection becomes Magnitude relation table stores the new data generated in data and reasoning, ensure that the completeness of algorithm；Devise example triple Storage scheme trade space for time in conjunction with the characteristic of Redis, realize the quick reading of instance data.

To achieve the above object, the technical scheme is that：A kind of streaming RDF numbers based on Spark Streaming According to parallel reasoning algorithm, include the following steps：

Step S1, in conjunction with OWL Horst inference rules, corresponding regular link variable relation table is built；In Iterative Parallel The data that batch new data and previous reasoning in reasoning stage timing acquisition Streaming data flow generate are as input number According to mode data and instance data to input sort out processing and store arriving corresponding Redis clusters；

Step S2, according to regular link variable relation table, the rule that this reasoning can activate is judged, in conjunction with corresponding reality Number of cases is according to generation inference data；

Step S3, the duplicate data and storage, current iteration reasoning for deleting this reasoning generation terminate.

In an embodiment of the present invention, in step S1, the mode data, that is, pattern triple data, the instance data That is example triple data.

In an embodiment of the present invention, the example triple data, which are stored to the mode in Redis clusters, is：According to The characteristics of Redis clusters, uses<Key, value>Form, using in triple subject S, predicate P, object O as Key, i.e., respectively with<S,(P,O)>,<P,(S,O)>With<O,(S,P)>Form there are in three tables.

In an embodiment of the present invention, the pattern triple data, which are stored to the mode in Redis clusters, is：By OWL Each rule of Horst inference rules generates a corresponding table Rulem_Table, is stored in Redis；Using rule as table Name is divided into 2 classes according to the difference of each regular link variable number：Without link variable rule, have the rule of link variable；

Rule without link variable：Storage mode in Redis using P as key,<S,O>It is deposited as value Storage；

There is the rule of link variable：

(1) for the rule of single link variable, the storage mode in Redis is using P as key, and only there are one key Value；

(2) for the complicated rule for having multiple link variables, the storage mode in Redis using P as key, S, O with <S,<O, 0>>,<O,<S, 1>>Map patterns be stored in value, wherein 0 indicate key be subject, 1 indicate key be object.

In an embodiment of the present invention, step S2's the specific implementation process is as follows：

Step S21, traversal rule link variable relation table judges the rule that can be activated；

Step S22, for the rule that can be activated, if you do not need to example triple data can immediate reasoning obtain knot By then skipping to step S23；If necessary to combine example triple data, then the example triple number needed with each rule According to link variable as key, corresponding example triple data are gone for from the example table being previously stored, if can find pair The example triple data answered, then enter step S23, otherwise the repeatedly judgement work of step S22；If all data are all completed It calculates, then terminates algorithm；

Step S23, current rule-based reasoning is executed, inference conclusion is obtained, the triple that reasoning is generated<Si,Pj,Ok>It is defeated Go out to set<Si,(Pj,Ok)>In, and skip to step S22.

In an embodiment of the present invention, step S3's the specific implementation process is as follows：

Step S31, the new triplet sets that receiving step S2 reasonings generate terminate if the data received are sky Algorithm；

Step S32, the new triplet sets received, the triple that removal wherein repeats are traversed；

Step S33, it by the triplet sets after duplicate removal using itr_data as set name, is stored in Redis clusters, is used for The reading of next reasoning.

Compared to the prior art, the invention has the advantages that：

1, the number of tasks for reducing MapReduce carries out the iteration reasoning of stream data in conjunction with Spark；

2, design rule link variable relation table stores the new data generated in data and reasoning, ensure that algorithm Completeness；

3, the storage scheme for devising example triple is traded space for time in conjunction with the characteristic of Redis, realizes example The quick reading of data.

Description of the drawings

Fig. 1 is inventive algorithm overall framework figure.

Fig. 2 is OWL Horst of the present invention rules.

Specific implementation mode

Below in conjunction with the accompanying drawings, technical scheme of the present invention is specifically described.

The present invention provides a kind of streaming RDF data parallel reasoning algorithm based on Spark Streaming, including it is as follows Step：

In step S1, the mode data, that is, pattern triple data, the instance data, that is, example triple data.

The example triple data are stored to the mode in Redis clusters：The characteristics of according to Redis clusters, uses< Key, value>Form, using in triple subject S, predicate P, object O as key, i.e., respectively with<S,(P,O)>,< P,(S,O)>With<O,(S,P)>Form there are in three tables.The pattern triple data are stored to the side in Redis clusters Formula is：Each rule of OWL Horst inference rules is generated into a corresponding table Rulem_Table, is stored in Redis； Using rule as table name, 2 classes are divided into according to the difference of each regular link variable number：Without link variable rule, have company Connect the rule of variable；

There is the rule of link variable：

Step S2's the specific implementation process is as follows：

Step S3's the specific implementation process is as follows：

It is the specific implementation process of the present invention below.

The streaming RDF data parallel reasoning algorithm (PSRH algorithms) based on Spark Streaming of the present invention, algorithm master It is divided into the regular link variable relation table of structure and Iterative Parallel reasoning two benches, wherein Iterative Parallel reasoning includes stream data Sort out two parts of reasoning with OWL Horst rules.The algorithm combines OWL Horst inference rules, structure corresponding first Regular link variable relation table；Batch new data in Iterative Parallel reasoning stage timing acquisition Streaming data flows with And the data that previous reasoning generates, as input data, mode data and instance data to input are carried out sorting out processing and be stored To corresponding Redis clusters；Then, according to regular link variable relation table, judge the rule that this reasoning can activate, in conjunction with Corresponding instance data generates inference data.Finally, the duplicate data of this reasoning generation and storage, current iteration reasoning are deleted Terminate.The overall framework figure of PSRH algorithms is as shown in Figure 1, specific algorithm process is as follows.

1, RDF stream datas store

The characteristics of PSRH algorithms are according to Redis clusters, in conjunction with OWL Horst regular (as shown in Figure 2) and RDF ontology numbers According to being built to inference pattern.Triple data are then obtained by Spark Streaming frames in real time, and will wherein Pattern triple data come with example triple data separation.

1.1, example triple store designs

Since instance data is very huge, and in reasoning process, the link variable used in specific rules may be Any one in Subject-Verb object in example triple, therefore the search efficiency of example triple is just reduced. The characteristics of this chapter algorithms algorithm is according to Redis clusters uses<Key, value>Form, in triple subject, predicate, Object respectively as key, respectively with<S,(P,O)>,<P,(S,O)>With<O,(S,P)>Form there are in three tables.Such one Come, either go in example table to search with which of Subject, Predicate and Object keyword, according to corresponding key and combines Redis clusters Characteristic, the lookup time of example triple is shorten to O (1), has achieved the effect that trade space for time.

1.2, pattern triple store designs

For pattern triple, we devise regular link variable relation table to store.

The connection by variable is needed between each former piece of OWL rules, could generate new triple data.Due to calculating Method is Stream Processing algorithm, therefore pattern triple data can not possibly be as processing static data, and disposably all load is completed, So we establish link variable table according to pattern triple data in the algorithm, before recording and meet part in reasoning process The rule of part, in this way when new data next time is into ingress, so that it may according to the content of table, to continue last do not complete Reasoning.

Each rule of OWL rules is generated into a corresponding table (Rulem_Table), is stored in Redis.With rule As table name, 2 classes are divided into according to the difference of each regular link variable number：Without link variable rule, have link variable Rule.

We construct different tables according to different classes of respectively：

(1) it is not necessarily to the rule of link variable

The former piece of rule without link variable<S,P,O>In P mostly all be with transmit, similar, reciprocal property, Therefore our storage modes in Redis are using P as key,<S,O>It is stored as value.With OWLHorst rules 12a (v owl:EquivalentClass w=>v rdf:SubClassOfw for), example such as table 1-1：

The storage table structure of table 1-1 OWL Horst rules 12a

Table name	Line unit (key)	Train value (value)
			Rule_12a_Table	owl:equivalentClass	<v,w>

Likewise, the table structure of the rule for other connectionless variables is provided herein, such as table 1-2.

The storage table structure of table 1-2 OWL Horst rules 12a, 13a, 13b

Table name	Line unit (key)	Train value (value)
			Rule_12b_Table	owl:equivalentClass	<v,w>
Rule_13a_Table	owl:equivalentProperty	<v,w>
			Rule_13b_Table	owl:equivalentProperty	<v,w>

(2) there is the rule of link variable

For the rule of single link variable, the storage mode in Redis only takes there are one key using P as key Value.With (the p rdf of rule 3 of OWL:type owl:SymmetricProperty, vp u=>U p v) for, wherein connection becomes Amount is p, example such as table 1-3：

The storage table structure of table 1-3 OWL Horst rules 3

Table name	Line unit (key)	Train value (value)
			Rule₃_Table	owl:SymmetricProperty	p

Above-mentioned example indicates in this stream data that, there are the triple that p is link variable, type is SymmetricProperty。

For the complicated rule for there are multiple link variables, the characteristics of due to stream data, a pair can not be simply used< key,value>To store all pattern triples that a rule is related to.It is transmitted through the pattern come in order not to omit each data flow Triple data, also for convenient follow-up connection, with regular former piece pattern triple<S, P, O>P as key, S, O with<S,< O, 0>>, and<O,<S, 1>>Map patterns be stored in value, wherein 0 indicate key be subject, 1 indicate key be object, deposit in this way No matter the effect of storage is to ensure that link variable is subject or object, it can be transferred through key and found within O (1) times.

Meanwhile in table with<LinkVar,<a,b,…>>The current rule of form storage in matched completion pattern The link variable that former piece is included.Thus it is the corresponding values of LinkVar that key can be preferentially searched in reasoning process, if root Corresponding example triple is found from example table according to the value of LinkVar, then can go out result with immediate reasoning.With the rule of OWL 16(v owl:allValuesFrom u,v owl:onProperty p,w rdf:Type v, wp x=>x rdf:type u)

For, wherein link variable is v, p and w, storage example such as table 1-4：

The storage table structure of table 1-4 OWL Horst rules 16

If it is sky that four line units in Rule16_Table, which have the value of some, then illustrate needed for the rule The pattern former piece wanted is incomplete, this rule can not make inferences, and can save the inference time for searching instance data, to Improve Reasoning Efficiency.

Other are contained with the rule of link variable, their table structure is provided herein, such as table 1-5 to 1-14.

The storage table structure of table 1-5 OWL Horst rules 1

Table name	Line unit (key)	Train value (value)
			Rule₁_Table	owl:FuncionalProperty	p

The storage table structure of table 1-6 OWL Horst rules 2

Table name	Line unit (key)	Train value (value)
			Rule₂_Table	owl:InverseProperty	p

The storage table structure of table 1-7 OWL Horst rules 4

Table name	Line unit (key)	Train value (value)
			Rule₄_Table	owl:TransitiveProperty	p

The storage table structure of table 1-8 OWL Horst rules 8a

Table name	Line unit (key)	Train value (value)
			Rule_8a_Table	p	q

The storage table structure of table 1-9 OWL Horst rules 8b

Table name	Line unit (key)	Train value (value)
			Rule_8b_Table	p	q

The storage table structure of table 1-10 OWL Horst rules 12c

Table name (Rule_m_Table)	Line unit (key)	Train value (value)
			Rule_12c_Table	v	w

The storage table structure of table 1-11 OWL Horst rules 13c

Table name	Line unit (key)	Train value (value)
			Rule_13c_Table	v	w

The storage table structure of table 1-12 OWL Horst rules 14a

The storage table structure of table 1-13 OWL Horst rules 14b

The storage table structure of table 1-14 OWL Horst rules 15

1.3,owl:SameAs pertinent triplets design Storages

For predicate owl:The triple of sameAs, due to owl:Subject (object) associated by sameAs is either mould Formula data can also be instance data, therefore this section is to include owl:The rule of sameAs designs different regular link variables Relation table, specific example such as table 1-15 to 1-19：

The storage table structure of table 1-15 OWL Horst rules 6

Table name	Line unit (key)	Train value (value)
			Rule₆_Table	owl:sameAs	<v,w>

The storage table structure of table 1-16 OWL Horst rules 7

Table name	Line unit (key)	Train value (value)
			Rule₇_Table	v	w

The storage table structure of table 1-17 OWL Horst rules 9

The storage table structure of table 1-18 OWL Horst rules 10

The storage table structure of table 1-19 OWL Horst rules 11

1.4, stream data storage is realized

The stage completes the classification storage of data by parallelization mode, batch in timing acquisition Streaming data flows Measure the flow data new_data and data itr_data of previous reasoning generation.Then it checks in new_data or itr_data Triple is then directly stored according to 1.1 design if it is example triple；The triple is then matched if it is pattern triple Corresponding inference rule, and according to the design of regular link variable relation table in 1.2, store all pattern triples.

1 parallel data of algorithm stores algorithm ParallelStoreForHorst

Input：The new triple data (itr_data) that streaming triple data (new_data), previous reasoning generate

Output：It is empty

With 1.3 rule, 6 (v owl:SameAs w=>w owl:SameAs v) and 1.2 rules 16

(v owl:allValuesFrom u,v owl:onProperty p,w rdf:Type v, w p x=>x rdf: type u)

For, the pseudo-code in the stage is described as follows：

Since the calculation that each rule needs LinkVar will be given below according to the definition of specific rules in the generation of LinkVar Method：

The acquisition algorithm of LinkVar in rule 11

The acquisition algorithm of LinkVar in regular 14a

The acquisition algorithm of LinkVar in rule 15

The acquisition algorithm of LinkVar in rule 16

2, the parallelization reasoning stage

2.1, the Map stages：Data reasoning

The Map stages mainly complete data reasoning, are as follows：

Step1 traversal rule link variable relation tables, judge which rule can activate.

Step2 for the rule that can activate, if regular former piece do not need example triple can immediate reasoning obtain Go out conclusion, then skips to Step3；If necessary to combine example triple, then the company of the example triple needed with each rule Variable is connect as key, corresponding example triple is gone for from the example table being previously stored, if corresponding example three can be found Tuple then enters Step3, the otherwise repeatedly judgement work of Step2.If all data are all completed to calculate, terminate algorithm.

Step3 executes current rule-based reasoning, obtains inference conclusion, the triple that reasoning is generated<Si,Pj,Ok>It is output to Set<Si,(Pj,Ok)>In, and skip to Step2.

2 data reasoning algorithm ParallelReasoningForHorst of algorithm

Input：Regular link variable relation table (Rulei_Table), example triple store (S_Table, P_Table, O_Table)

Output：The new triple that reasoning generates

The overall code of algorithm is described as follows：

With 6 (v owl of rule in 1.3:SameAs w=>w owl:SameAs v) for, pseudo-code is described as follows：

The reasoning pseudocode of the above-mentioned rule for being connectionless variable, next description has the rule of link variable, in 1.2 16 (v owl of rule:allValuesFrom u,v owl:onProperty p,w rdf:Type v, w p x=>x rdf: type u)

For, pseudo-code is described as follows and (in pseudo-code, defines the object that s16 is Set_16)：

Similar to the rule 16 of multi-connection variable, by being built in Redis clusters<key,value>Form stores mould Formula triple can quickly be attached the matching of variable；For associated example triple, example ternary in Redis is utilized Group storage strategy, relevant example triple is found out by the value of link variable, due to Redis according to key search when Between be O (1), so Reasoning Efficiency greatly improves.

2.2, the Reduce stages：Duplicate removal and storage

The data that the Reduce stages mainly generate reasoning preserve.For the triple that reasoning generates, it is stored in It is entitled in Redis clusters " set of itr_data ", and the triple to repeating carries out deduplication operation, then will " itr_ A part of the data " set as next reasoning input data.Data deduplication proposed in this paper and storage algorithm specific steps are such as Under：

Step1. receive Map stage reasonings generate new triplet sets (including SchemaTriple and InstanceTriple), if the data received are sky, terminate algorithm；

Step2. the new triplet sets received, the triple that removal wherein repeats are traversed；

Step3. it by the triplet sets after duplicate removal using itr_data as set name, is stored in Redis clusters, is used for down The reading of secondary reasoning.

Algorithm 3.Reduce algorithms (DuplicateRemovalForHorst)

Input set<Si,(Pj,Ok)>

Export itr_data.

Inventive algorithm reduces the number of tasks of MapReduce, and the iteration reasoning of stream data is carried out in conjunction with Spark；If Regular link variable relation table is counted to store the new data generated in data and reasoning, ensure that the completeness of algorithm；Design The storage scheme of example triple is traded space for time in conjunction with the characteristic of Redis, realizes the quick reading of instance data.

The above are preferred embodiments of the present invention, all any changes made according to the technical solution of the present invention, and generated function is made When with range without departing from technical solution of the present invention, all belong to the scope of protection of the present invention.

Claims

1. a kind of streaming RDF data parallel reasoning algorithm based on Spark Streaming, which is characterized in that including walking as follows Suddenly：

Step S1, in conjunction with OWL Horst inference rules, corresponding regular link variable relation table is built；In Iterative Parallel reasoning The data that batch new data in stage timing acquisition Streaming data flow and previous reasoning generate as input data, Mode data and instance data to input sort out processing and store arriving corresponding Redis clusters；

Step S2, according to regular link variable relation table, the rule that this reasoning can activate is judged, in conjunction with corresponding instance number According to generation inference data；

2. the streaming RDF data parallel reasoning algorithm according to claim 1 based on Spark Streaming, feature It is, in step S1, the mode data, that is, pattern triple data, the instance data, that is, example triple data.

3. the streaming RDF data parallel reasoning algorithm according to claim 2 based on Spark Streaming, feature It is, the example triple data storage is to the mode in Redis clusters：The characteristics of according to Redis clusters, uses< Key, value>Form, using in triple subject S, predicate P, object O as key, i.e., respectively with< S, (P,O) >,< P, (S,O) >With<O, (S,P) >Form there are in three tables.

4. the streaming RDF data parallel reasoning algorithm according to claim 3 based on Spark Streaming, feature It is, the pattern triple data storage is to the mode in Redis clusters：By each of OWL Horst inference rules Rule generates a corresponding table Rulem_Table, is stored in Redis；Using rule as table name, become according to each rule connection The difference of amount number is divided into 2 classes：Without link variable rule, have the rule of link variable；

Rule without link variable：Storage mode in Redis using P as key,<S,O>It is stored as value；

There is the rule of link variable：

（1）For the rule of single link variable, the storage mode in Redis only takes there are one key using P as key Value；

（2）For the complicated rule for having multiple link variables, the storage mode in Redis using P as key, S, O with<S,< O, 0>>,<O,<S, 1>>Map patterns be stored in value, wherein 0 indicate key be subject, 1 indicate key be object.

5. the streaming RDF data parallel reasoning algorithm according to claim 4 based on Spark Streaming, feature Be, step S2's the specific implementation process is as follows：

Step S22, for the rule that can be activated, if you do not need to example triple data can immediate reasoning draw a conclusion, Then skip to step S23；If necessary to combine example triple data, then with the example triple data of each rule needs Link variable goes for corresponding example triple data as key, from the example table being previously stored, if can find corresponding Example triple data, then enter step S23, otherwise the repeatedly judgement work of step S22；If all data are all completed to count It calculates, then terminates algorithm；

Step S23, current rule-based reasoning is executed, inference conclusion is obtained, the triple that reasoning is generated<Si,Pj,Ok>It is output to Set< Si, (Pj,Ok)>In, and skip to step S22.

6. the streaming RDF data parallel reasoning algorithm according to claim 5 based on Spark Streaming, feature Be, step S3's the specific implementation process is as follows：

Step S31, the new triplet sets that receiving step S2 reasonings generate terminate algorithm if the data received are sky；

Step S33, it by the triplet sets after duplicate removal using itr_data as set name, is stored in Redis clusters, for next time The reading of reasoning.