CN108763451B

CN108763451B - Streaming RDF data parallel reasoning algorithm based on Spark Streaming

Info

Publication number: CN108763451B
Application number: CN201810521793.XA
Authority: CN
Inventors: 汪璟玢; 陈晓曦
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2022-03-11
Anticipated expiration: 2038-05-28
Also published as: CN108763451A

Abstract

The invention relates to a Streaming RDF data parallel reasoning algorithm based on Spark Streaming. Firstly, establishing a corresponding rule connection variable relation table by combining an OWL Horst inference rule; the method comprises the steps of regularly acquiring batch new data in Streaming data stream and data generated by previous inference as input data in an iterative parallel inference stage, classifying the input mode data and example data, and storing the input mode data and the example data into corresponding Redis clusters; then, judging the rule which can be activated by the current inference according to the rule connection variable relation table, and generating inference data by combining corresponding example data; and finally, deleting and storing the repeated data generated by the current inference, and ending the iterative inference. The invention reduces the task number of MapReduce, and combines Spark to carry out iterative reasoning on streaming data; the design rule is connected with the variable relation table to store data and new data generated in reasoning, so that the completeness of the algorithm is ensured; a storage scheme of the example triple is designed, and the characteristics of Redis are combined to change time in space, so that the example data can be rapidly read.

Description

Streaming RDF data parallel reasoning algorithm based on Spark Streaming

Technical Field

The invention belongs to the technical field of mass Streaming RDF data inference, and particularly relates to a Streaming RDF data parallel inference algorithm based on Spark Streaming.

Background

Most of the existing inference methods based on OWL rules are centralized to process static data sets with fixed sizes, and due to the limitation of a centralized processing mechanism, the existing algorithms are low in efficiency when processing massive real-time data. In response to this growing demand, a few scholars study and propose their own RDF streaming reasoning architectures: barbieri DF [1] et al proposes a streaming and background-rich based incremental inference algorithm that adds expiration time information to each RDF triplet, infers new data when it arrives, and terminates explicit facts and deletes invalid triplets. The IDRM [2] algorithm enables efficient and extensible RDFS reasoning on incremental data. ChevalierJ [3] et al propose an efficient incremental reasoner (Slider) that reasons semantic data streams by their inherent features, thus implementing an extensible batch reasoner for streaming data. Yeyi Xinxin et al propose a Spark platform-based streaming RDF data parallel inference algorithm PRAS [4] combined with a pseudo-bidirectional network.

The main goal of RDF streaming reasoning is how to store and reason the received streaming data. The IDRM algorithm is specially modeled for the RDFS rule, so the inference efficiency for the OWL Horst rule is not high. Slider is designed only for RDFS rules, so the method is not applicable to complex OWL Horst rule reasoning. The PRAS algorithm performs inference of streaming data by designing a pseudo-bidirectional network, but the pseudo-bidirectional network communication is more expensive, so that it is not efficient to process a large amount of streaming data.

The invention provides a streaming RDF data reasoning algorithm combined with a Spark platform, which aims to solve two problems of storage and reasoning of streaming data. In order to guarantee completeness of inference results, how to design a storage scheme of streaming data is a key point of the text. The method utilizes the Redis database cluster to store large-scale RDF data, combines a distributed streaming computing framework spark streaming, researches and realizes a distributed parallel reasoning scheme of the streaming RDF data by means of a MapReduce computing model in a platform, and solves the problems that a large amount of streaming data cannot be rapidly reasoned and the reasoning result is incomplete. The method has good reference significance for reasoning mass data.

Reference documents:

[1]Barbieri D F,Braga D,Ceri S,et al.Incremental reasoning on streams and rich background knowledge[C]//Extended Semantic Web Conference.Springer Berlin Heidelberg,2010:1-15.

[2]Liu B,Wu L,Li J,et al.Exploiting Incremental Reasoning in Healthcare Based on Hadoop and Amazon Cloud[C]//Semantic Cities Workshop at AAAI Conference on Artificial Intelligence (AAAI’14).2014.

[3]Chevalier J,Subercaze J,Gravier C,et al.Slider:an Efficient Incremental Reasoner[C]//Proceedings of the 2015ACM SIGMOD International Conference on Management of Data.ACM,2015:1081-1086.

[4] joy, WangJING, Spark-based distributed parallel inference algorithm [ J ] computer system application, 2017,26(05):97-104.

Disclosure of Invention

The invention aims to provide a Streaming RDF data parallel reasoning algorithm based on Spark Streaming, which reduces the task number of MapReduce and combines Spark to carry out iterative reasoning on Streaming data; the design rule is connected with the variable relation table to store data and new data generated in reasoning, so that the completeness of the algorithm is ensured; a storage scheme of the example triple is designed, and the characteristics of Redis are combined to change time in space, so that the example data can be rapidly read.

In order to achieve the purpose, the technical scheme of the invention is as follows: a Streaming RDF data parallel reasoning algorithm based on Spark Streaming comprises the following steps:

s1, constructing a corresponding rule connection variable relation table by combining the OWL Horst inference rule; the method comprises the steps of regularly acquiring batch new data in Streaming data stream and data generated by previous inference as input data in an iterative parallel inference stage, classifying the input mode data and example data, and storing the input mode data and the example data into corresponding Redis clusters;

step S2, judging the rule that the inference can be activated according to the rule connection variable relation table, and generating inference data by combining with corresponding example data;

and S3, deleting and storing the repeated data generated by the current inference, and ending the iterative inference.

In step S1, the pattern data is pattern triple data, and the instance data is instance triple data.

In an embodiment of the present invention, a manner of storing the example triple data in the Redis cluster is as follows: according to the characteristics of Redis cluster, a form of < key, value > is adopted, and a subject S, a predicate P and an object O in a triple are respectively used as keys, namely, the forms of < S, (P, O) >, < P, (S, O) > and < O, (S, P) > are respectively stored in three tables.

In an embodiment of the present invention, the mode of storing the mode triple packet data in the Redis cluster is as follows: generating each rule of the OWL Horst inference rule to correspond to a Table Rulem _ Table, and storing the Table Rulem _ Table in Redis; the rules are used as table names, and the rules are divided into 2 types according to the different number of the connection variables of each rule: rules without connection variables, rules with connection variables;

rules that do not require a connection variable: the storage in Redis is stored with P as key, < S, O > as value;

rule with connected variables:

(1) for the rule of a single connection variable, the storage mode in Redis takes P as a key, and only one key takes a value;

(2) for a complex rule with a plurality of connection variables, the storage mode in Redis takes P as key, S, O stores Map type of < S, < O, 0> >, < O, < S, 1> > into value, wherein 0 represents key as subject and 1 represents key as object.

In an embodiment of the present invention, the specific implementation process of step S2 is as follows:

step S21, traversing the rule connection variable relation table, and judging the rule which can be activated;

step S22, for the rule capable of being activated, if the conclusion can be directly inferred without the need of example triple data, the step S23 is skipped; if the instance triple data needs to be combined, the connection variable of the instance triple data needed by each rule is used as a key, the corresponding instance triple data is found from the previously stored instance table, if the corresponding instance triple data can be found, the step S23 is carried out, otherwise, the judgment work of the step S22 is repeated; if all the data are calculated, finishing the algorithm;

and S23, executing the current rule inference to obtain an inference conclusion, outputting the triple < Si, Pj, Ok > generated by the inference to the set < Si, (Pj, Ok) > and jumping to the step S22.

In an embodiment of the present invention, the specific implementation process of step S3 is as follows:

step S31, receiving the new triple set generated by the inference in the step S2, and ending the algorithm if the received data is empty;

step S32, traversing the received new triple set, and removing repeated triples in the new triple set;

and step S33, storing the duplicate-removed triple set in a Redis cluster by using itr _ data as a set name for reading next inference.

Compared with the prior art, the invention has the following beneficial effects:

1. the task number of MapReduce is reduced, and iterative reasoning of streaming data is carried out by combining Spark;

2. the design rule is connected with the variable relation table to store data and new data generated in reasoning, so that the completeness of the algorithm is ensured;

3. a storage scheme of the example triple is designed, and the characteristics of Redis are combined to change time in space, so that the example data can be rapidly read.

Drawings

FIG. 1 is a general framework diagram of the algorithm of the present invention.

FIG. 2 shows the OWL Horst rule of the present invention.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

The invention provides a Streaming RDF data parallel reasoning algorithm based on Spark Streaming, which comprises the following steps:

The way of storing the example triple data into the Redis cluster is as follows: according to the characteristics of Redis cluster, a form of < key, value > is adopted, and a subject S, a predicate P and an object O in a triple are respectively used as keys, namely, the forms of < S, (P, O) >, < P, (S, O) > and < O, (S, P) > are respectively stored in three tables. The mode of storing the mode ternary group data into the Redis cluster is as follows: generating each rule of the OWL Horst inference rule to correspond to a Table Rulem _ Table, and storing the Table Rulem _ Table in Redis; the rules are used as table names, and the rules are divided into 2 types according to the different number of the connection variables of each rule: rules without connection variables, rules with connection variables;

rule with connected variables:

The specific implementation process of step S2 is as follows:

The specific implementation process of step S3 is as follows:

The following is a specific implementation of the present invention.

The invention relates to a Streaming RDF data parallel reasoning algorithm (PSRH algorithm) based on Spark Streaming, which is mainly divided into two stages of constructing a rule connection variable relation table and iterative parallel reasoning, wherein the iterative parallel reasoning comprises two parts of Streaming data classification and OWL Horst rule reasoning. Firstly, establishing a corresponding rule connection variable relation table by combining an OWL Horst inference rule; the method comprises the steps of regularly acquiring batch new data in Streaming data stream and data generated by previous inference as input data in an iterative parallel inference stage, classifying the input mode data and example data, and storing the input mode data and the example data into corresponding Redis clusters; then, the rule which can be activated by the inference is judged according to the rule connection variable relation table, and the inference data is generated by combining the corresponding example data. And finally, deleting and storing the repeated data generated by the current inference, and ending the iterative inference. The overall frame diagram of the PSRH algorithm is shown in fig. 1, and the specific algorithm process is as follows.

1. RDF streaming data storage

The PSRH algorithm constructs an inference model according to the characteristics of the Redis cluster by combining an OWL Horst rule (shown in figure 2) and RDF body data. And then, acquiring triple data in real time through a Spark Streaming framework, and distinguishing the mode triple data from the example triple data.

1.1 example triple store design

Since the example data is very large and the connection variable used by a specific rule in the inference process may be any one of the subject predicate objects in the example triples, the efficiency of finding the example triples is reduced. The algorithm of the current chapter adopts a form of < key, value > according to the characteristics of Redis cluster, takes subject, predicate and object in the triple as keys respectively, and stores the key, predicate and object in the triple in a form of < S, (P, O) >, < P, (S, O) > and < O, (S, P) > in three tables. Therefore, no matter which keyword in the main predicate is searched in the instance table, the search time of the instance triple is shortened to O (1) according to the corresponding key and the characteristics of the Redis cluster, and the effect of changing the time by space is achieved.

1.2 schema triple store design

For pattern triplets, we design a rule-join variable relationship table to store.

The various front-parts of the OWL rule need to be connected through variables to generate new triple data. Because the algorithm is a streaming processing algorithm, the mode triple data cannot be loaded at one time as static data, so that a connection variable table is established in the algorithm according to the mode triple data to record rules meeting part of antecedents in the inference process, and the last incomplete inference can be continued according to the content of the table when new data enters a node next time.

And generating a corresponding Table (rule _ Table) for each rule of the OWL rules, and storing the Table in the Redis. The rules are used as table names, and the rules are divided into 2 types according to the different number of the connection variables of each rule: rules that do not require connection variables, rules that have connection variables.

We construct different tables according to different categories respectively:

(1) rules that do not require a connection variable

Most of the P in the antecedent < S, P, O > of the rule without the connection variable has the properties of transmission, similarity and reciprocity, so the storage mode of the Redis is that P is used as key and < S, O > is used as value. Take the OWLHorst rule 12a (v own: equivalentClass w ═ v rdf: subPassOfw) as an example, as shown in tables 1-1:

table 1-1 OWL Horst rule 12a memory table structure

Table name	Key row (key)	Column value (value)
			Rule_12a_Table	owl:equivalentClass	<v,w>

Also, table structures for rules for other connectionless variables are given herein, such as tables 1-2.

Table 1-2 OWL Horst rules 12a, 13b memory table structure

Table name	Key row (key)	Column value (value)
			Rule_12b_Table	owl:equivalentClass	<v,w>
Rule_13a_Table	owl:equivalentProperty	<v,w>
			Rule_13b_Table	owl:equivalentProperty	<v,w>

(2) Rules with connected variables

For the rule of a single connection variable, the storage mode in Redis takes P as a key, and only one key takes a value. Take the rule 3 of OWL (pdrf: type: symmetricpropety, vp u ═ u pv) as an example, where the linkage variable is p, as shown in tables 1-3:

storage table structure of tables 1-3 OWL Horst rule 3

Table name	Key row (key)	Column value (value)
			Rule₃_Table	owl:SymmetricProperty	p

The above example shows that in this streaming data, there is a triplet with p as a connection variable, and its type is symmetricpropety.

For a complex rule with multiple connection variables, due to the characteristics of streaming data, it is not possible to simply store all the mode triples related to a rule with a pair of < key, value >. In order not to omit the mode triplet data from each data stream, and also to facilitate subsequent connection, the P of the regular antecedent mode triplet < S, P, O > is stored as key, S, O is stored as Map type of < S, < O, 0> >, and < O, < S, 1> > in value, where 0 represents key as subject and 1 represents key as object, so that the storage function is to ensure that the connection variable can be found within O (1) time by key regardless of whether the connection variable is subject or object.

Meanwhile, the join variables contained in the schema predecessors that have been matched to completion in the current rule are stored in the form of < LinkVar, < a, b, … > > in the table. Therefore, the key can be preferentially searched for the value corresponding to the LinkVar in the reasoning process, and if the corresponding instance triple is found from the instance table according to the value of the LinkVar, the result can be directly deduced. Rule16 in OWL (v own: allValuesFrom u, v own: onProperty p, w rdf: type v, wp x ═ x rdf: type u)

For example, where the linkage variables are v, p, and w, the storage tables are shown as tables 1-4:

table 1-4 OWL Horst rule16 memory table structure

If one value of four row keys in Rule16_ Table is null, it indicates that the mode antecedent required by the Rule is incomplete, and the Rule cannot carry out inference, so that inference time for searching example data can be saved, and inference efficiency is improved.

For other rules containing linkage variables, their table structures are given here, as in tables 1-5 to 1-14.

Table 1-5 OWL Horst rule1 memory table structure

Table name	Key row (key)	Column value (value)
			Rule₁_Table	owl:FuncionalProperty	p

Storage table structure of tables 1-6 OWL Horst rule 2

Table name	Key row (key)	Column value (value)
			Rule₂_Table	owl:InverseProperty	p

Storage table structure of tables 1-7 OWL Horst rule 4

Table name	Key row (key)	Column value (value)
			Rule₄_Table	owl:TransitiveProperty	p

Storage table structure of table 1-8 OWL Horst rule 8a

Table name	Key row (key)	Column value (value)
			Rule_8a_Table	p	q

Storage table structure of tables 1-9 OWL Horst rule 8b

Table name	Key row (key)	Column value (value)
			Rule_8b_Table	p	q

Storage table structure of tables 1-10 OWL Horst rule 12c

Table name (Rule)_m_Table)	Key row (key)	Column value (value)
			Rule_12c_Table	v	w

Storage table structure of tables 1-11 OWL Horst rule 13c

Table name	Key row (key)	Column value (value)
			Rule_13c_Table	v	w

Storage table structure of table 1-12 OWL Horst rule 14a

Storage table structure of tables 1-13 OWL Horst rule 14b

Storage table structure of tables 1-14 OWL Horst rule 15

1.3 owl sameAs related triple storage design

For triples of predicate owl: sameAs, since the subject (object) associated with owl: sameAs can be either pattern data or instance data, this section designs different rule-linked variable relation tables for rules containing owl: sameAs, specifically, for example, tables 1-15 to 1-19:

storage table structure of tables 1-15 OWL Horst rule 6

Table name	Key row (key)	Column value (value)
			Rule₆_Table	owl:sameAs	<v,w>

Storage table structure of tables 1-16 OWL Horst rule 7

Table name	Key row (key)	Column value (value)
			Rule₇_Table	v	w

Storage table structure of tables 1-17 OWL Horst rule 9

Storage table structure of tables 1-18 OWL Horst rule 10

Storage table structure of tables 1-19 OWL Horst rule 11

1.4 streaming data storage implementation

In the stage, classified storage of data is completed in a parallelization mode, and batch stream data new _ data in the Streaming data stream and data itr _ data generated by previous reasoning are acquired at fixed time. Then, checking a triple in new _ data or itr _ data, and if the triple is an instance triple, directly storing the triple according to the design of 1.1; and if the pattern triples are the pattern triples, matching inference rules corresponding to the triples, and storing all the pattern triples according to the design of the rule connection variable relation table in 1.2.

Algorithm 1 parallel data storage algorithm parallelstoreForHorst

Inputting: streaming triple group data (new _ data), new triple group data generated by previous reasoning (itr _ data)

And (3) outputting: air conditioner

At 1.3 rule 6(v own: sameAs w ═ w own: sameAs v) and 1.2 rule16

(v owl:allValuesFrom u,v owl:onProperty p,w rdf:type v,w p x＝>x rdf:type u)

For example, the pseudo code at this stage is described as follows:

since the generation of the LinkVar is defined according to a specific rule, the following gives an algorithm that each rule needs the LinkVar:

LinkVar acquisition algorithm in rule 11

LinkVar acquisition algorithm in rule 14a

LinkVar acquisition algorithm in rule 15

LinkVar acquisition algorithm in rule16

2. Parallelized reasoning phase

2.1, Map stage: data reasoning

The Map stage mainly completes data reasoning, and comprises the following specific steps:

step1 traverses the rule join variable relationship table to determine which rules can be activated.

Step2, for the rule which can be activated, if the rule front piece can directly deduce and draw a conclusion without an instance triple, jumping to Step 3; if the combination of the instance triples is needed, the connection variable of the instance triples needed by each rule is used as a key, a corresponding instance triplet is found from a previously stored instance table, if the corresponding instance triplet can be found, the Step3 is entered, otherwise, the judgment work of the Step2 is repeated. If all data are calculated, the algorithm ends.

Step3 executes the current rule reasoning to obtain the reasoning conclusion, and outputs the triplet < Si, Pj, Ok > generated by the reasoning to the set < Si, (Pj, Ok) > and jumps to Step2.

Algorithm 2 data reasoning Algorithm paralleleasoninging ForHorst

Inputting: rule join variable relationship Table (rule _ Table), instance triple store (S _ Table, P _ Table, O _ Table)

And (3) outputting: inferentially generated new triples

The overall code of the algorithm is described as follows:

taking rule 6 in 1.3 (v own: sameAs w ═ > w own: sameAs v) as an example, the pseudo-code is described as follows:

the above is inference pseudo-code for rules without connected variables, followed by a description of rules with connected variables, in 1.2 rule16 (v own: allValuesFrom u, v own: on property p, w rdf: type v, w p x ≧ x rdf: type u)

For example, the pseudo code is described as follows (in the pseudo code, s16 is defined as an object of Set _ 16):

similar to the rule16 for multi-connection variables, matching of connection variables can be performed quickly by constructing a < key, value > form in the Redis cluster to store the pattern triplets; and for the associated instance triples, the storage strategy of the instance triples in Redis is utilized, the related instance triples are found out through the values of the connection variables, and the time searched by Redis according to the key is O (1), so that the reasoning efficiency is greatly improved.

2.2, Reduce stage: deduplication and storage

The Reduce phase mainly saves the data generated by reasoning. For inferentially generated triples, a set named "itr _ data" in the Redis cluster is saved and duplicate triples are deduplicated and then the "itr _ data" set is made part of the next inference input data. The specific steps of the data deduplication and storage algorithm provided by the invention are as follows:

step1, receiving a new triple set (comprising a schema triple and an InstanceTriple) generated by Map stage reasoning, and if the received data is null, ending the algorithm;

step2, traversing the received new triple set, and removing repeated triples in the new triple set;

and step3, saving the duplicate triple set with itr _ data as a set name in a Redis cluster for reading in the next inference.

Algorithm 3.Reduce algorithm (DuplicateRemovalForHorst)

Input set < Si, (Pj, Ok) >)

Itr _ data is output.

The algorithm reduces the task number of MapReduce, and performs iterative reasoning on streaming data by combining Spark; the design rule is connected with the variable relation table to store data and new data generated in reasoning, so that the completeness of the algorithm is ensured; a storage scheme of the example triple is designed, and the characteristics of Redis are combined to change time in space, so that the example data can be rapidly read.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A Streaming RDF data parallel reasoning algorithm based on Spark Streaming is characterized by comprising the following steps:

step S3, deleting and storing the repeated data generated by the current inference, and ending the current iterative inference;

in step S1, the pattern data is pattern triple data, and the instance data is instance triple data;

the way of storing the example triple data into the Redis cluster is as follows: according to the characteristics of the Redis cluster, a < key, value > form is adopted, and a subject S, a predicate P and an object O in a triple are respectively used as keys, namely < S, (P, O) >, < P, (S, O) > and < O, (S, P) > form are stored in three tables;

the mode of storing the mode ternary group data into the Redis cluster is as follows: generating each rule of the OWL Horst inference rule to correspond to a Table Rulem _ Table, and storing the Table Rulem _ Table in Redis; the rules are used as table names, and the rules are divided into 2 types according to the different number of the connection variables of each rule: rules without connection variables, rules with connection variables;

rule with connected variables:

2. The Streaming RDF data parallel reasoning algorithm based on Spark Streaming according to claim 1, wherein the step S2 is implemented as follows:

3. The Streaming RDF data parallel reasoning algorithm based on Spark Streaming according to claim 2, wherein the step S3 is implemented as follows: