CN108763451B - Streaming RDF data parallel reasoning algorithm based on Spark Streaming - Google Patents

Streaming RDF data parallel reasoning algorithm based on Spark Streaming Download PDF

Info

Publication number
CN108763451B
CN108763451B CN201810521793.XA CN201810521793A CN108763451B CN 108763451 B CN108763451 B CN 108763451B CN 201810521793 A CN201810521793 A CN 201810521793A CN 108763451 B CN108763451 B CN 108763451B
Authority
CN
China
Prior art keywords
data
rule
inference
triple
streaming
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810521793.XA
Other languages
Chinese (zh)
Other versions
CN108763451A (en
Inventor
汪璟玢
陈晓曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201810521793.XA priority Critical patent/CN108763451B/en
Publication of CN108763451A publication Critical patent/CN108763451A/en
Application granted granted Critical
Publication of CN108763451B publication Critical patent/CN108763451B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a Streaming RDF data parallel reasoning algorithm based on Spark Streaming. Firstly, establishing a corresponding rule connection variable relation table by combining an OWL Horst inference rule; the method comprises the steps of regularly acquiring batch new data in Streaming data stream and data generated by previous inference as input data in an iterative parallel inference stage, classifying the input mode data and example data, and storing the input mode data and the example data into corresponding Redis clusters; then, judging the rule which can be activated by the current inference according to the rule connection variable relation table, and generating inference data by combining corresponding example data; and finally, deleting and storing the repeated data generated by the current inference, and ending the iterative inference. The invention reduces the task number of MapReduce, and combines Spark to carry out iterative reasoning on streaming data; the design rule is connected with the variable relation table to store data and new data generated in reasoning, so that the completeness of the algorithm is ensured; a storage scheme of the example triple is designed, and the characteristics of Redis are combined to change time in space, so that the example data can be rapidly read.

Description

Streaming RDF data parallel reasoning algorithm based on Spark Streaming
Technical Field
The invention belongs to the technical field of mass Streaming RDF data inference, and particularly relates to a Streaming RDF data parallel inference algorithm based on Spark Streaming.
Background
Most of the existing inference methods based on OWL rules are centralized to process static data sets with fixed sizes, and due to the limitation of a centralized processing mechanism, the existing algorithms are low in efficiency when processing massive real-time data. In response to this growing demand, a few scholars study and propose their own RDF streaming reasoning architectures: barbieri DF [1] et al proposes a streaming and background-rich based incremental inference algorithm that adds expiration time information to each RDF triplet, infers new data when it arrives, and terminates explicit facts and deletes invalid triplets. The IDRM [2] algorithm enables efficient and extensible RDFS reasoning on incremental data. ChevalierJ [3] et al propose an efficient incremental reasoner (Slider) that reasons semantic data streams by their inherent features, thus implementing an extensible batch reasoner for streaming data. Yeyi Xinxin et al propose a Spark platform-based streaming RDF data parallel inference algorithm PRAS [4] combined with a pseudo-bidirectional network.
The main goal of RDF streaming reasoning is how to store and reason the received streaming data. The IDRM algorithm is specially modeled for the RDFS rule, so the inference efficiency for the OWL Horst rule is not high. Slider is designed only for RDFS rules, so the method is not applicable to complex OWL Horst rule reasoning. The PRAS algorithm performs inference of streaming data by designing a pseudo-bidirectional network, but the pseudo-bidirectional network communication is more expensive, so that it is not efficient to process a large amount of streaming data.
The invention provides a streaming RDF data reasoning algorithm combined with a Spark platform, which aims to solve two problems of storage and reasoning of streaming data. In order to guarantee completeness of inference results, how to design a storage scheme of streaming data is a key point of the text. The method utilizes the Redis database cluster to store large-scale RDF data, combines a distributed streaming computing framework spark streaming, researches and realizes a distributed parallel reasoning scheme of the streaming RDF data by means of a MapReduce computing model in a platform, and solves the problems that a large amount of streaming data cannot be rapidly reasoned and the reasoning result is incomplete. The method has good reference significance for reasoning mass data.
Reference documents:
[1]Barbieri D F,Braga D,Ceri S,et al.Incremental reasoning on streams and rich background knowledge[C]//Extended Semantic Web Conference.Springer Berlin Heidelberg,2010:1-15.
[2]Liu B,Wu L,Li J,et al.Exploiting Incremental Reasoning in Healthcare Based on Hadoop and Amazon Cloud[C]//Semantic Cities Workshop at AAAI Conference on Artificial Intelligence (AAAI’14).2014.
[3]Chevalier J,Subercaze J,Gravier C,et al.Slider:an Efficient Incremental Reasoner[C]//Proceedings of the 2015ACM SIGMOD International Conference on Management of Data.ACM,2015:1081-1086.
[4] joy, WangJING, Spark-based distributed parallel inference algorithm [ J ] computer system application, 2017,26(05):97-104.
Disclosure of Invention
The invention aims to provide a Streaming RDF data parallel reasoning algorithm based on Spark Streaming, which reduces the task number of MapReduce and combines Spark to carry out iterative reasoning on Streaming data; the design rule is connected with the variable relation table to store data and new data generated in reasoning, so that the completeness of the algorithm is ensured; a storage scheme of the example triple is designed, and the characteristics of Redis are combined to change time in space, so that the example data can be rapidly read.
In order to achieve the purpose, the technical scheme of the invention is as follows: a Streaming RDF data parallel reasoning algorithm based on Spark Streaming comprises the following steps:
s1, constructing a corresponding rule connection variable relation table by combining the OWL Horst inference rule; the method comprises the steps of regularly acquiring batch new data in Streaming data stream and data generated by previous inference as input data in an iterative parallel inference stage, classifying the input mode data and example data, and storing the input mode data and the example data into corresponding Redis clusters;
step S2, judging the rule that the inference can be activated according to the rule connection variable relation table, and generating inference data by combining with corresponding example data;
and S3, deleting and storing the repeated data generated by the current inference, and ending the iterative inference.
In step S1, the pattern data is pattern triple data, and the instance data is instance triple data.
In an embodiment of the present invention, a manner of storing the example triple data in the Redis cluster is as follows: according to the characteristics of Redis cluster, a form of < key, value > is adopted, and a subject S, a predicate P and an object O in a triple are respectively used as keys, namely, the forms of < S, (P, O) >, < P, (S, O) > and < O, (S, P) > are respectively stored in three tables.
In an embodiment of the present invention, the mode of storing the mode triple packet data in the Redis cluster is as follows: generating each rule of the OWL Horst inference rule to correspond to a Table Rulem _ Table, and storing the Table Rulem _ Table in Redis; the rules are used as table names, and the rules are divided into 2 types according to the different number of the connection variables of each rule: rules without connection variables, rules with connection variables;
rules that do not require a connection variable: the storage in Redis is stored with P as key, < S, O > as value;
rule with connected variables:
(1) for the rule of a single connection variable, the storage mode in Redis takes P as a key, and only one key takes a value;
(2) for a complex rule with a plurality of connection variables, the storage mode in Redis takes P as key, S, O stores Map type of < S, < O, 0> >, < O, < S, 1> > into value, wherein 0 represents key as subject and 1 represents key as object.
In an embodiment of the present invention, the specific implementation process of step S2 is as follows:
step S21, traversing the rule connection variable relation table, and judging the rule which can be activated;
step S22, for the rule capable of being activated, if the conclusion can be directly inferred without the need of example triple data, the step S23 is skipped; if the instance triple data needs to be combined, the connection variable of the instance triple data needed by each rule is used as a key, the corresponding instance triple data is found from the previously stored instance table, if the corresponding instance triple data can be found, the step S23 is carried out, otherwise, the judgment work of the step S22 is repeated; if all the data are calculated, finishing the algorithm;
and S23, executing the current rule inference to obtain an inference conclusion, outputting the triple < Si, Pj, Ok > generated by the inference to the set < Si, (Pj, Ok) > and jumping to the step S22.
In an embodiment of the present invention, the specific implementation process of step S3 is as follows:
step S31, receiving the new triple set generated by the inference in the step S2, and ending the algorithm if the received data is empty;
step S32, traversing the received new triple set, and removing repeated triples in the new triple set;
and step S33, storing the duplicate-removed triple set in a Redis cluster by using itr _ data as a set name for reading next inference.
Compared with the prior art, the invention has the following beneficial effects:
1. the task number of MapReduce is reduced, and iterative reasoning of streaming data is carried out by combining Spark;
2. the design rule is connected with the variable relation table to store data and new data generated in reasoning, so that the completeness of the algorithm is ensured;
3. a storage scheme of the example triple is designed, and the characteristics of Redis are combined to change time in space, so that the example data can be rapidly read.
Drawings
FIG. 1 is a general framework diagram of the algorithm of the present invention.
FIG. 2 shows the OWL Horst rule of the present invention.
Detailed Description
The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.
The invention provides a Streaming RDF data parallel reasoning algorithm based on Spark Streaming, which comprises the following steps:
s1, constructing a corresponding rule connection variable relation table by combining the OWL Horst inference rule; the method comprises the steps of regularly acquiring batch new data in Streaming data stream and data generated by previous inference as input data in an iterative parallel inference stage, classifying the input mode data and example data, and storing the input mode data and the example data into corresponding Redis clusters;
step S2, judging the rule that the inference can be activated according to the rule connection variable relation table, and generating inference data by combining with corresponding example data;
and S3, deleting and storing the repeated data generated by the current inference, and ending the iterative inference.
In step S1, the pattern data is pattern triple data, and the instance data is instance triple data.
The way of storing the example triple data into the Redis cluster is as follows: according to the characteristics of Redis cluster, a form of < key, value > is adopted, and a subject S, a predicate P and an object O in a triple are respectively used as keys, namely, the forms of < S, (P, O) >, < P, (S, O) > and < O, (S, P) > are respectively stored in three tables. The mode of storing the mode ternary group data into the Redis cluster is as follows: generating each rule of the OWL Horst inference rule to correspond to a Table Rulem _ Table, and storing the Table Rulem _ Table in Redis; the rules are used as table names, and the rules are divided into 2 types according to the different number of the connection variables of each rule: rules without connection variables, rules with connection variables;
rules that do not require a connection variable: the storage in Redis is stored with P as key, < S, O > as value;
rule with connected variables:
(1) for the rule of a single connection variable, the storage mode in Redis takes P as a key, and only one key takes a value;
(2) for a complex rule with a plurality of connection variables, the storage mode in Redis takes P as key, S, O stores Map type of < S, < O, 0> >, < O, < S, 1> > into value, wherein 0 represents key as subject and 1 represents key as object.
The specific implementation process of step S2 is as follows:
step S21, traversing the rule connection variable relation table, and judging the rule which can be activated;
step S22, for the rule capable of being activated, if the conclusion can be directly inferred without the need of example triple data, the step S23 is skipped; if the instance triple data needs to be combined, the connection variable of the instance triple data needed by each rule is used as a key, the corresponding instance triple data is found from the previously stored instance table, if the corresponding instance triple data can be found, the step S23 is carried out, otherwise, the judgment work of the step S22 is repeated; if all the data are calculated, finishing the algorithm;
and S23, executing the current rule inference to obtain an inference conclusion, outputting the triple < Si, Pj, Ok > generated by the inference to the set < Si, (Pj, Ok) > and jumping to the step S22.
The specific implementation process of step S3 is as follows:
step S31, receiving the new triple set generated by the inference in the step S2, and ending the algorithm if the received data is empty;
step S32, traversing the received new triple set, and removing repeated triples in the new triple set;
and step S33, storing the duplicate-removed triple set in a Redis cluster by using itr _ data as a set name for reading next inference.
The following is a specific implementation of the present invention.
The invention relates to a Streaming RDF data parallel reasoning algorithm (PSRH algorithm) based on Spark Streaming, which is mainly divided into two stages of constructing a rule connection variable relation table and iterative parallel reasoning, wherein the iterative parallel reasoning comprises two parts of Streaming data classification and OWL Horst rule reasoning. Firstly, establishing a corresponding rule connection variable relation table by combining an OWL Horst inference rule; the method comprises the steps of regularly acquiring batch new data in Streaming data stream and data generated by previous inference as input data in an iterative parallel inference stage, classifying the input mode data and example data, and storing the input mode data and the example data into corresponding Redis clusters; then, the rule which can be activated by the inference is judged according to the rule connection variable relation table, and the inference data is generated by combining the corresponding example data. And finally, deleting and storing the repeated data generated by the current inference, and ending the iterative inference. The overall frame diagram of the PSRH algorithm is shown in fig. 1, and the specific algorithm process is as follows.
1. RDF streaming data storage
The PSRH algorithm constructs an inference model according to the characteristics of the Redis cluster by combining an OWL Horst rule (shown in figure 2) and RDF body data. And then, acquiring triple data in real time through a Spark Streaming framework, and distinguishing the mode triple data from the example triple data.
1.1 example triple store design
Since the example data is very large and the connection variable used by a specific rule in the inference process may be any one of the subject predicate objects in the example triples, the efficiency of finding the example triples is reduced. The algorithm of the current chapter adopts a form of < key, value > according to the characteristics of Redis cluster, takes subject, predicate and object in the triple as keys respectively, and stores the key, predicate and object in the triple in a form of < S, (P, O) >, < P, (S, O) > and < O, (S, P) > in three tables. Therefore, no matter which keyword in the main predicate is searched in the instance table, the search time of the instance triple is shortened to O (1) according to the corresponding key and the characteristics of the Redis cluster, and the effect of changing the time by space is achieved.
1.2 schema triple store design
For pattern triplets, we design a rule-join variable relationship table to store.
The various front-parts of the OWL rule need to be connected through variables to generate new triple data. Because the algorithm is a streaming processing algorithm, the mode triple data cannot be loaded at one time as static data, so that a connection variable table is established in the algorithm according to the mode triple data to record rules meeting part of antecedents in the inference process, and the last incomplete inference can be continued according to the content of the table when new data enters a node next time.
And generating a corresponding Table (rule _ Table) for each rule of the OWL rules, and storing the Table in the Redis. The rules are used as table names, and the rules are divided into 2 types according to the different number of the connection variables of each rule: rules that do not require connection variables, rules that have connection variables.
We construct different tables according to different categories respectively:
(1) rules that do not require a connection variable
Most of the P in the antecedent < S, P, O > of the rule without the connection variable has the properties of transmission, similarity and reciprocity, so the storage mode of the Redis is that P is used as key and < S, O > is used as value. Take the OWLHorst rule 12a (v own: equivalentClass w ═ v rdf: subPassOfw) as an example, as shown in tables 1-1:
table 1-1 OWL Horst rule 12a memory table structure
Table name Key row (key) Column value (value)
Rule12a_Table owl:equivalentClass <v,w>
Also, table structures for rules for other connectionless variables are given herein, such as tables 1-2.
Table 1-2 OWL Horst rules 12a, 13b memory table structure
Table name Key row (key) Column value (value)
Rule12b_Table owl:equivalentClass <v,w>
Rule13a_Table owl:equivalentProperty <v,w>
Rule13b_Table owl:equivalentProperty <v,w>
(2) Rules with connected variables
For the rule of a single connection variable, the storage mode in Redis takes P as a key, and only one key takes a value. Take the rule 3 of OWL (pdrf: type: symmetricpropety, vp u ═ u pv) as an example, where the linkage variable is p, as shown in tables 1-3:
storage table structure of tables 1-3 OWL Horst rule 3
Table name Key row (key) Column value (value)
Rule3_Table owl:SymmetricProperty p
The above example shows that in this streaming data, there is a triplet with p as a connection variable, and its type is symmetricpropety.
For a complex rule with multiple connection variables, due to the characteristics of streaming data, it is not possible to simply store all the mode triples related to a rule with a pair of < key, value >. In order not to omit the mode triplet data from each data stream, and also to facilitate subsequent connection, the P of the regular antecedent mode triplet < S, P, O > is stored as key, S, O is stored as Map type of < S, < O, 0> >, and < O, < S, 1> > in value, where 0 represents key as subject and 1 represents key as object, so that the storage function is to ensure that the connection variable can be found within O (1) time by key regardless of whether the connection variable is subject or object.
Meanwhile, the join variables contained in the schema predecessors that have been matched to completion in the current rule are stored in the form of < LinkVar, < a, b, … > > in the table. Therefore, the key can be preferentially searched for the value corresponding to the LinkVar in the reasoning process, and if the corresponding instance triple is found from the instance table according to the value of the LinkVar, the result can be directly deduced. Rule16 in OWL (v own: allValuesFrom u, v own: onProperty p, w rdf: type v, wp x ═ x rdf: type u)
For example, where the linkage variables are v, p, and w, the storage tables are shown as tables 1-4:
table 1-4 OWL Horst rule16 memory table structure
Figure BDA0001674945270000061
Figure BDA0001674945270000071
If one value of four row keys in Rule16_ Table is null, it indicates that the mode antecedent required by the Rule is incomplete, and the Rule cannot carry out inference, so that inference time for searching example data can be saved, and inference efficiency is improved.
For other rules containing linkage variables, their table structures are given here, as in tables 1-5 to 1-14.
Table 1-5 OWL Horst rule1 memory table structure
Table name Key row (key) Column value (value)
Rule1_Table owl:FuncionalProperty p
Storage table structure of tables 1-6 OWL Horst rule 2
Table name Key row (key) Column value (value)
Rule2_Table owl:InverseProperty p
Storage table structure of tables 1-7 OWL Horst rule 4
Table name Key row (key) Column value (value)
Rule4_Table owl:TransitiveProperty p
Storage table structure of table 1-8 OWL Horst rule 8a
Table name Key row (key) Column value (value)
Rule8a_Table p q
Storage table structure of tables 1-9 OWL Horst rule 8b
Table name Key row (key) Column value (value)
Rule8b_Table p q
Storage table structure of tables 1-10 OWL Horst rule 12c
Table name (Rule)m_Table) Key row (key) Column value (value)
Rule12c_Table v w
Storage table structure of tables 1-11 OWL Horst rule 13c
Table name Key row (key) Column value (value)
Rule13c_Table v w
Storage table structure of table 1-12 OWL Horst rule 14a
Figure BDA0001674945270000072
Storage table structure of tables 1-13 OWL Horst rule 14b
Figure BDA0001674945270000073
Storage table structure of tables 1-14 OWL Horst rule 15
Figure BDA0001674945270000074
Figure BDA0001674945270000081
1.3 owl sameAs related triple storage design
For triples of predicate owl: sameAs, since the subject (object) associated with owl: sameAs can be either pattern data or instance data, this section designs different rule-linked variable relation tables for rules containing owl: sameAs, specifically, for example, tables 1-15 to 1-19:
storage table structure of tables 1-15 OWL Horst rule 6
Table name Key row (key) Column value (value)
Rule6_Table owl:sameAs <v,w>
Storage table structure of tables 1-16 OWL Horst rule 7
Table name Key row (key) Column value (value)
Rule7_Table v w
Storage table structure of tables 1-17 OWL Horst rule 9
Figure BDA0001674945270000082
Storage table structure of tables 1-18 OWL Horst rule 10
Figure BDA0001674945270000083
Storage table structure of tables 1-19 OWL Horst rule 11
Figure BDA0001674945270000084
1.4 streaming data storage implementation
In the stage, classified storage of data is completed in a parallelization mode, and batch stream data new _ data in the Streaming data stream and data itr _ data generated by previous reasoning are acquired at fixed time. Then, checking a triple in new _ data or itr _ data, and if the triple is an instance triple, directly storing the triple according to the design of 1.1; and if the pattern triples are the pattern triples, matching inference rules corresponding to the triples, and storing all the pattern triples according to the design of the rule connection variable relation table in 1.2.
Algorithm 1 parallel data storage algorithm parallelstoreForHorst
Inputting: streaming triple group data (new _ data), new triple group data generated by previous reasoning (itr _ data)
And (3) outputting: air conditioner
At 1.3 rule 6(v own: sameAs w ═ w own: sameAs v) and 1.2 rule16
(v owl:allValuesFrom u,v owl:onProperty p,w rdf:type v,w p x=>x rdf:type u)
For example, the pseudo code at this stage is described as follows:
Figure BDA0001674945270000091
Figure BDA0001674945270000101
since the generation of the LinkVar is defined according to a specific rule, the following gives an algorithm that each rule needs the LinkVar:
LinkVar acquisition algorithm in rule 11
Figure BDA0001674945270000102
LinkVar acquisition algorithm in rule 14a
Figure BDA0001674945270000103
Figure BDA0001674945270000111
LinkVar acquisition algorithm in rule 15
Figure BDA0001674945270000112
LinkVar acquisition algorithm in rule16
Figure BDA0001674945270000113
Figure BDA0001674945270000121
2. Parallelized reasoning phase
2.1, Map stage: data reasoning
The Map stage mainly completes data reasoning, and comprises the following specific steps:
step1 traverses the rule join variable relationship table to determine which rules can be activated.
Step2, for the rule which can be activated, if the rule front piece can directly deduce and draw a conclusion without an instance triple, jumping to Step 3; if the combination of the instance triples is needed, the connection variable of the instance triples needed by each rule is used as a key, a corresponding instance triplet is found from a previously stored instance table, if the corresponding instance triplet can be found, the Step3 is entered, otherwise, the judgment work of the Step2 is repeated. If all data are calculated, the algorithm ends.
Step3 executes the current rule reasoning to obtain the reasoning conclusion, and outputs the triplet < Si, Pj, Ok > generated by the reasoning to the set < Si, (Pj, Ok) > and jumps to Step2.
Algorithm 2 data reasoning Algorithm paralleleasoninging ForHorst
Inputting: rule join variable relationship Table (rule _ Table), instance triple store (S _ Table, P _ Table, O _ Table)
And (3) outputting: inferentially generated new triples
The overall code of the algorithm is described as follows:
Figure BDA0001674945270000122
Figure BDA0001674945270000131
taking rule 6 in 1.3 (v own: sameAs w ═ > w own: sameAs v) as an example, the pseudo-code is described as follows:
Figure BDA0001674945270000132
the above is inference pseudo-code for rules without connected variables, followed by a description of rules with connected variables, in 1.2 rule16 (v own: allValuesFrom u, v own: on property p, w rdf: type v, w p x ≧ x rdf: type u)
For example, the pseudo code is described as follows (in the pseudo code, s16 is defined as an object of Set _ 16):
Figure BDA0001674945270000133
Figure BDA0001674945270000141
similar to the rule16 for multi-connection variables, matching of connection variables can be performed quickly by constructing a < key, value > form in the Redis cluster to store the pattern triplets; and for the associated instance triples, the storage strategy of the instance triples in Redis is utilized, the related instance triples are found out through the values of the connection variables, and the time searched by Redis according to the key is O (1), so that the reasoning efficiency is greatly improved.
2.2, Reduce stage: deduplication and storage
The Reduce phase mainly saves the data generated by reasoning. For inferentially generated triples, a set named "itr _ data" in the Redis cluster is saved and duplicate triples are deduplicated and then the "itr _ data" set is made part of the next inference input data. The specific steps of the data deduplication and storage algorithm provided by the invention are as follows:
step1, receiving a new triple set (comprising a schema triple and an InstanceTriple) generated by Map stage reasoning, and if the received data is null, ending the algorithm;
step2, traversing the received new triple set, and removing repeated triples in the new triple set;
and step3, saving the duplicate triple set with itr _ data as a set name in a Redis cluster for reading in the next inference.
Algorithm 3.Reduce algorithm (DuplicateRemovalForHorst)
Input set < Si, (Pj, Ok) >)
Itr _ data is output.
The algorithm reduces the task number of MapReduce, and performs iterative reasoning on streaming data by combining Spark; the design rule is connected with the variable relation table to store data and new data generated in reasoning, so that the completeness of the algorithm is ensured; a storage scheme of the example triple is designed, and the characteristics of Redis are combined to change time in space, so that the example data can be rapidly read.
The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims (3)

1. A Streaming RDF data parallel reasoning algorithm based on Spark Streaming is characterized by comprising the following steps:
s1, constructing a corresponding rule connection variable relation table by combining the OWL Horst inference rule; the method comprises the steps of regularly acquiring batch new data in Streaming data stream and data generated by previous inference as input data in an iterative parallel inference stage, classifying the input mode data and example data, and storing the input mode data and the example data into corresponding Redis clusters;
step S2, judging the rule that the inference can be activated according to the rule connection variable relation table, and generating inference data by combining with corresponding example data;
step S3, deleting and storing the repeated data generated by the current inference, and ending the current iterative inference;
in step S1, the pattern data is pattern triple data, and the instance data is instance triple data;
the way of storing the example triple data into the Redis cluster is as follows: according to the characteristics of the Redis cluster, a < key, value > form is adopted, and a subject S, a predicate P and an object O in a triple are respectively used as keys, namely < S, (P, O) >, < P, (S, O) > and < O, (S, P) > form are stored in three tables;
the mode of storing the mode ternary group data into the Redis cluster is as follows: generating each rule of the OWL Horst inference rule to correspond to a Table Rulem _ Table, and storing the Table Rulem _ Table in Redis; the rules are used as table names, and the rules are divided into 2 types according to the different number of the connection variables of each rule: rules without connection variables, rules with connection variables;
rules that do not require a connection variable: the storage in Redis is stored with P as key, < S, O > as value;
rule with connected variables:
(1) for the rule of a single connection variable, the storage mode in Redis takes P as a key, and only one key takes a value;
(2) for a complex rule with a plurality of connection variables, the storage mode in Redis takes P as key, S, O stores Map type of < S, < O, 0> >, < O, < S, 1> > into value, wherein 0 represents key as subject and 1 represents key as object.
2. The Streaming RDF data parallel reasoning algorithm based on Spark Streaming according to claim 1, wherein the step S2 is implemented as follows:
step S21, traversing the rule connection variable relation table, and judging the rule which can be activated;
step S22, for the rule capable of being activated, if the conclusion can be directly inferred without the need of example triple data, the step S23 is skipped; if the instance triple data needs to be combined, the connection variable of the instance triple data needed by each rule is used as a key, the corresponding instance triple data is found from the previously stored instance table, if the corresponding instance triple data can be found, the step S23 is carried out, otherwise, the judgment work of the step S22 is repeated; if all the data are calculated, finishing the algorithm;
and S23, executing the current rule inference to obtain an inference conclusion, outputting the triple < Si, Pj, Ok > generated by the inference to the set < Si, (Pj, Ok) > and jumping to the step S22.
3. The Streaming RDF data parallel reasoning algorithm based on Spark Streaming according to claim 2, wherein the step S3 is implemented as follows:
step S31, receiving the new triple set generated by the inference in the step S2, and ending the algorithm if the received data is empty;
step S32, traversing the received new triple set, and removing repeated triples in the new triple set;
and step S33, storing the duplicate-removed triple set in a Redis cluster by using itr _ data as a set name for reading next inference.
CN201810521793.XA 2018-05-28 2018-05-28 Streaming RDF data parallel reasoning algorithm based on Spark Streaming Active CN108763451B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810521793.XA CN108763451B (en) 2018-05-28 2018-05-28 Streaming RDF data parallel reasoning algorithm based on Spark Streaming

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810521793.XA CN108763451B (en) 2018-05-28 2018-05-28 Streaming RDF data parallel reasoning algorithm based on Spark Streaming

Publications (2)

Publication Number Publication Date
CN108763451A CN108763451A (en) 2018-11-06
CN108763451B true CN108763451B (en) 2022-03-11

Family

ID=64006259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810521793.XA Active CN108763451B (en) 2018-05-28 2018-05-28 Streaming RDF data parallel reasoning algorithm based on Spark Streaming

Country Status (1)

Country Link
CN (1) CN108763451B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778277A (en) * 2015-04-30 2015-07-15 福州大学 RDF (radial distribution function) data distributed type storage and querying method based on Redis
CN105912721A (en) * 2016-05-05 2016-08-31 福州大学 Rdf data distributed semantic parallel reasoning method
CN106874425A (en) * 2017-01-23 2017-06-20 福州大学 Real time critical word approximate search algorithm based on Storm
CN106980901A (en) * 2017-04-15 2017-07-25 福州大学 Streaming RDF data parallel reasoning algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6594691B1 (en) * 1999-10-28 2003-07-15 Surfnet Media Group, Inc. Method and system for adding function to a web page

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778277A (en) * 2015-04-30 2015-07-15 福州大学 RDF (radial distribution function) data distributed type storage and querying method based on Redis
CN105912721A (en) * 2016-05-05 2016-08-31 福州大学 Rdf data distributed semantic parallel reasoning method
CN106874425A (en) * 2017-01-23 2017-06-20 福州大学 Real time critical word approximate search algorithm based on Storm
CN106980901A (en) * 2017-04-15 2017-07-25 福州大学 Streaming RDF data parallel reasoning algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Cichlid: Efficient Large Scale RDFS/OWL Reasoning with Spark;Rong Gu等;《2015 IEEE 29th International Parallel and Distributed Processing Symposium》;20151231;全文 *
基于Spark的OWL语义规则并行化推理算法;赵慧含等;《计算机应用研究》;20180430;第35卷(第4期);全文 *
基于Spark的分布式并行推理算法;叶怡新等;《计算机系统应用》;20171231;第26卷(第5期);第97-104页正文第1-2节 *

Also Published As

Publication number Publication date
CN108763451A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
Zerhari et al. Big data clustering: Algorithms and challenges
Myung et al. SPARQL basic graph pattern processing with iterative MapReduce
Sun et al. Scalable RDF store based on HBase and MapReduce
Gao et al. Relational approach for shortest path discovery over large graphs
CN106021457B (en) RDF distributed semantic searching method based on keyword
CN103116625A (en) Volume radio direction finde (RDF) data distribution type query processing method based on Hadoop
CN106874425B (en) Storm-based real-time keyword approximate search algorithm
CN109241355A (en) Accessibility querying method, system and the readable storage medium storing program for executing of directed acyclic graph
Zeng et al. Redesign of the gStore system
CN105912721B (en) RDF data distributed semantic parallel inference method
CN104504018A (en) Top-down real-time big data query optimization method based on bushy tree
CN105550332A (en) Dual-layer index structure based origin graph query method
CN104933143A (en) Method and device for acquiring recommended object
Sowkuntla et al. MapReduce based parallel attribute reduction in Incomplete Decision Systems
Sridhar et al. RAPID: Enabling scalable ad-hoc analytics on the semantic web
CN108763451B (en) Streaming RDF data parallel reasoning algorithm based on Spark Streaming
CN112148830A (en) Semantic data storage and retrieval method and device based on maximum area grid
Shibla et al. Improving efficiency of DBSCAN by parallelizing kd-tree using spark
CN116383247A (en) Large-scale graph data efficient query method
CN106980901B (en) Streaming RDF data parallel reasoning algorithm
Bai et al. An integration approach of multi-source heterogeneous fuzzy spatiotemporal data based on RDF
Shou-Qiang et al. Research and design of hybrid collaborative filtering algorithm scalability reform based on genetic algorithm optimization
Ravindra et al. To nest or not to nest, when and how much: Representing intermediate results of graph pattern queries in mapreduce based processing
Huang et al. l-skydiv query: Effectively improve the usefulness of skylines
Hashem et al. A review of modeling toolbox for BigData

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant