CN106980901A

CN106980901A - Streaming RDF data parallel reasoning algorithm

Info

Publication number: CN106980901A
Application number: CN201710246309.2A
Authority: CN
Inventors: 汪璟玢; 叶怡新
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2017-04-15
Filing date: 2017-04-15
Publication date: 2017-07-25
Anticipated expiration: 2037-04-15
Also published as: CN106980901B

Abstract

The invention provides a parallel reasoning algorithm for streaming RDF data: construct a regular pseudo-bidirectional network, and establish an intermediate node if there is a connection variable of a class in the rule node; regularly acquire batches of new data in the Streaming data stream and data generated by previous reasoning as Input data, classify the input data or create a corresponding node and store it in the corresponding Redis cluster; for the input triplet data combined with the pseudo-bidirectional network, judge whether the antecedents monitored by the corresponding intermediate node or rule node are all satisfied, Then infer the rules to generate inference data; by deleting duplicate inference data in real time and saving all the data generated by this inference into the Redis cluster as the input data for the next inference, the OWL Horst rule of RDF data can be fully and efficiently realized Parallel streaming inference.

Description

Streaming RDF Data Parallel Reasoning Algorithm

技术领域technical field

本发明属于语义网技术领域,具体涉及流式RDF数据并行推理算法。The invention belongs to the technical field of semantic web, and specifically relates to a parallel reasoning algorithm for streaming RDF data.

背景技术Background technique

近年来，研究者们逐渐意识到实时流数据的并行推理算法研究的重要性，但针对该领域提出的相关算法仍较少，有待进一步的研究。同时推理在智能技术方面的研究也相当的多，如知识的发现，案例的推理等。通过分布式并行计算来解决大规模RDF流式数据相关问题已成为学术界和工业界的共识。In recent years, researchers have gradually realized the importance of research on parallel reasoning algorithms for real-time streaming data, but there are still few related algorithms proposed in this field, and further research is needed. At the same time, there are quite a lot of researches on reasoning in intelligent technology, such as knowledge discovery and case reasoning. It has become a consensus in academia and industry to solve large-scale RDF streaming data related problems through distributed parallel computing.

研究RDFS/OWL流式并行推理是目前较新的一个领域。Barbieri D F等人提出了基于流式和富背景知识的增量推理算法，该算法向每个RDF三元组中添加到期时间信息，当新的流式数据到达时，对新数据进行推理计算，并且终止明确的事实以及删除无效的三元组。IDRM算法能够高效可扩展的对增量数据进行RDFS推理，由于IDRM算法对RDFS规则进行特殊建模，所以对于OWL Horst规则推理的效率不高。Chevalier J等人提出来一种有效的增量推理器（Slider），该推理器通过语义数据流中的内在特征对其进行推理，从而实现了针对流式数据的可扩展批处理推理器。但由于Slider只针对RDFS规则进行设计，所以对于复杂的OWL Horst规则推理并不适用。Research on RDFS/OWL streaming parallel reasoning is a relatively new field at present. Barbieri D F et al. proposed an incremental reasoning algorithm based on streaming and rich background knowledge. This algorithm adds expiration time information to each RDF triple, and when new streaming data arrives, reasoning calculations are performed on new data. , and terminate explicit facts as well as remove invalid triples. The IDRM algorithm can perform RDFS reasoning on incremental data in an efficient and scalable manner. Since the IDRM algorithm performs special modeling on RDFS rules, the efficiency of OWL Horst rule reasoning is not high. Chevalier J et al. proposed an effective incremental reasoner (Slider), which reasoned through the intrinsic features in the semantic data stream, thus implementing a scalable batch reasoner for streaming data. However, since Slider is only designed for RDFS rules, it is not suitable for complex OWL Horst rule reasoning.

如今在大规模RDF文件推理中存在的挑战有：在网络上的分布式数据难以获得适当的三元组；日益增长的数据量要求大数据集的可扩展的计算能力；现有的推理方法是专为静态本体而数据通常是在现实世界中的变化。现有的分布式推理方法主要侧重于静态数据的推理，研究流式RDF数据并行推理是目前较新的一个领域。Today's challenges in large-scale RDF file reasoning are: distributed data on the network is difficult to obtain appropriate triples; the growing amount of data requires scalable computing power for large data sets; existing reasoning methods are Designed for static ontologies where data is usually changing in the real world. The existing distributed reasoning methods mainly focus on the reasoning of static data, and the research on parallel reasoning of streaming RDF data is a relatively new field at present.

需要解决的技术问题：Technical issues to be resolved:

1.解决如何结合RDF数据本体和OWL Horst规则构建规则的伪双向网络，其中包含模式三元组对应的类节点和规则节点，从而能够在大规模流式数据情况下高效的完成OWLHorst全部规则的推理。1. Solve how to combine RDF data ontology and OWL Horst rules to construct a pseudo-bidirectional network of rules, which contains class nodes and rule nodes corresponding to pattern triples, so that all OWL Horst rules can be efficiently completed in the case of large-scale streaming data reasoning.

2.结合提出的流式方案提出了对应的并行推理方案，从而满足大规模流式数据的分布式并行推理的需求。2. Combined with the proposed streaming scheme, a corresponding parallel reasoning scheme is proposed, so as to meet the needs of distributed parallel reasoning for large-scale streaming data.

发明内容Contents of the invention

为了解决上述问题，本发明提供一种流式RDF数据并行推理算法，针对OWL Horst规则，结合HAL算法的优点，提出了PRAS算法（Parallel Reasoning Algorithm forStreaming RDF Data）。该算法在大规模流式数据情况下能够高效地构建并维护伪双向网络，并正确完整的执行推理。In order to solve the above problems, the present invention provides a parallel reasoning algorithm for streaming RDF data. Aiming at the OWL Horst rule and combining the advantages of the HAL algorithm, the PRAS algorithm (Parallel Reasoning Algorithm for Streaming RDF Data) is proposed. In the case of large-scale streaming data, the algorithm can efficiently construct and maintain a pseudo-bidirectional network, and perform inference correctly and completely.

为实现上述目的，本发明采用以下技术方案：一种流式RDF数据并行推理算法，其特征在于，包括以下步骤：S1：加载规则节点和模式三元组P_j_RDD和O_k_RDD并保存到Redis集群，构建规则中连接变量的中间节点midnode，跳至S2；S2：定时读取数据流中的批量新数据new_data及前次推理产生的数据itr_data；若其为模式三元组(S_i,P_i,O_i)，则跳至S3；若其为实例三元组(s_i,p_i,o_i)，则跳至S5；若new_data为空且itr_data为空，则算法结束；S3：若其对应的类节点P_j_RDD或O_k_RDD存在，则将其归类到对应的类节点；若不存在，则新建对应的类节点并保存到Redis集群；若其谓语属于Symmetric Property，则跳至S4；否则跳至S6；Symmetric Property为用于标识模式三元组中谓语具有对称关系的集合。对称属性三元组集合SymTriples定义如下：；其中，P_j_RDD为模式三元组集合；S4：对输入的数据进行归类与推理；S5：对于推理产生的三元组进行存储及去重。To achieve the above object, the present invention adopts the following technical solutions: a streaming RDF data parallel reasoning algorithm, characterized in that it comprises the following steps: S1: load rule nodes and pattern triples P _j _RDD and _Ok _RDD and save to Redis cluster, build the middle node midnode of the connection variable in the rule, skip to S2; S2: regularly read the batch new data new_data in the data stream and the data itr_data generated by the previous reasoning; if it is a pattern triplet (S _i , P _i , O _i ), then skip to S3; if it is an instance triplet (s _i , p _i , o _i ), then skip to S5; if new_data is empty and itr_data is empty, then the algorithm ends; S3: If the corresponding class node P _j _RDD or _Ok _RDD exists, classify it into the corresponding class node; if it does not exist, create a corresponding class node and save it to the Redis cluster; if its predicate belongs to Symmetric Property, then Skip to S4; otherwise, skip to S6; Symmetric Property is a set used to identify that the predicates in the pattern triples have a symmetrical relationship. SymTriples, a collection of symmetric attribute triples, is defined as follows: ; Among them, P _j _RDD is a set of pattern triples; S4: classify and reason the input data; S5: store and deduplicate the triples generated by reasoning.

与现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

1.结合OWL Horst规则和RDF本体文件构造伪双向网络结构，提高了流式推理的效率。1. Combining OWL Horst rules and RDF ontology files to construct a pseudo-bidirectional network structure, which improves the efficiency of stream reasoning.

2.结合Redis集群设计的存储策略，对三元组进行去重以及迭代数据的存储，减少了重复三元组的存储空间和推理时间，从而提升了推理的效率。2. Combining with the storage strategy designed by Redis cluster, triples are deduplicated and iterative data is stored, which reduces the storage space and reasoning time of repeated triples, thereby improving the efficiency of reasoning.

附图说明Description of drawings

图1为本发明的总体框架示意图。Fig. 1 is a general framework schematic diagram of the present invention.

图2为伪双向网络构建图。Figure 2 is a construction diagram of a pseudo-bidirectional network.

图3为加载规则和本体数据并构建伪双向网络。Figure 3 loads rule and ontology data and constructs a pseudo-bidirectional network.

图4为OWL Horst规则关系图。Figure 4 is a relationship diagram of OWL Horst rules.

具体实施方式detailed description

下面结合附图和具体实施例对本发明做进一步解释说明。The present invention will be further explained below in conjunction with the accompanying drawings and specific embodiments.

本发明提出的流式并行推理主要分为构建伪双向网络、流式数据归类和OWLHorst规则的推理三个部分。根据Spark Streaming及Redis的特点，结合HAL算法和OWLHorst规则以及RDF数据本体，构建规则的伪双向网络，其中包含模式三元组对应的类节点和规则节点，若规则节点中存在类的连接变量则建立中间节点；接着，定时获取Streaming数据流中的批量新数据以及前次推理产生的数据作为输入数据，对输入的数据进行归类或新建对应节点并存储到相应的Redis集群；然后，对于输入的三元组数据结合伪双向网络判断对应的中间节点或者规则节点所监听的前件是否全部满足，进而对该规则进行推理，产生推理数据。最后，通过实时地删除重复推理数据并本次推理产生的所有数据保存到Redis集群中作为下次推理的输入数据，从而完整高效地实现RDF数据OWL Horst规则的并行流式推理。The streaming parallel reasoning proposed by the present invention is mainly divided into three parts: constructing a pseudo-bidirectional network, classifying streaming data and reasoning of OWLHorst rules. According to the characteristics of Spark Streaming and Redis, combined with the HAL algorithm, OWLHorst rules and RDF data ontology, a regular pseudo-bidirectional network is constructed, which contains class nodes and rule nodes corresponding to pattern triples. If there are class connection variables in the rule nodes, then Establish an intermediate node; then, regularly obtain batches of new data in the Streaming data stream and the data generated by the previous reasoning as input data, classify the input data or create a new corresponding node and store it in the corresponding Redis cluster; then, for the input The triplet data combined with the pseudo-two-way network judges whether the antecedents monitored by the corresponding intermediate node or rule node are all satisfied, and then infers the rule to generate inference data. Finally, by deleting duplicate inference data in real time and saving all data generated by this inference into the Redis cluster as the input data for the next inference, the parallel streaming inference of RDF data OWL Horst rules can be realized completely and efficiently.

总体框架图参见图1。See Figure 1 for the overall frame diagram.

一种流式RDF数据并行推理算法，其包括以下步骤：A streaming RDF data parallel reasoning algorithm, which comprises the following steps:

S1：加载规则节点和模式三元组P_j_RDD和O_k_RDD并保存到Redis集群，构建规则中连接变量的中间节点midnode，跳至S2；S1: Load the rule node and pattern triplets P _j _RDD and _Ok _RDD and save them to the Redis cluster, build the intermediate node midnode that connects the variables in the rule, and skip to S2;

S2：定时读取数据流中的批量新数据new_data及前次推理产生的数据itr_data；若其为模式三元组(S_i,P_i,O_i)，则跳至S3；若其为实例三元组(s_i,p_i,o_i)，则跳至S5；若new_data为空且itr_data为空，则算法结束；S2: Regularly read the batch new data new_data in the data stream and the data itr_data generated by the previous reasoning; if it is a pattern triplet (S _i , P _i , O _i ), skip to S3; if it is instance 3 tuple (s _i , p _i , o _i ), skip to S5; if new_data is empty and itr_data is empty, the algorithm ends;

S3：若其对应的类节点P_j_RDD或O_k_RDD存在，则将其归类到对应的类节点；若不存在，则新建对应的类节点并保存到Redis集群；若其谓语属于Symmetric Property，则跳至S4；否则跳至S6；Symmetric Property为用于标识模式三元组中谓语具有对称关系的集合。对称属性三元组集合SymTriples定义如下：S3: If the corresponding class node P _j _RDD or _Ok _RDD exists, classify it into the corresponding class node; if it does not exist, create a corresponding class node and save it to the Redis cluster; if its predicate belongs to Symmetric Property , then skip to S4; otherwise, skip to S6; Symmetric Property is a set used to identify that the predicates in the pattern triples have a symmetrical relationship. SymTriples, a collection of symmetric attribute triples, is defined as follows:

； ;

其中，P_j_RDD为模式三元组集合；例如，在OWL Horst规则中SymTriples={sameAs，inverseOf，equivalentClass，equivalentProperty}；Among them, P _j _RDD is a set of pattern triples; for example, SymTriples={sameAs, inverseOf, equivalentClass, equivalentProperty} in OWL Horst rules;

S4：对输入的数据进行归类与推理；S4: Classify and reason the input data;

S5：对于推理产生的三元组进行存储及去重。S5: Store and deduplicate the triples generated by reasoning.

其中S4包括以下步骤：S41：若是输入的三元组数据为模式三元组(S_i,P_i,O_i)，则将输入的三元组数据分别以P_i+”_”+S_i为key，O_i为value和P_i+”_”+O_i为key，S_i为value，构建三元组中S_i和O_i的双向关系，并保存到Redis集群，跳至S43；Wherein S4 includes the following steps: S41: If the input triplet data is a pattern triplet (S _i , P _i , O _i ), then the input triplet data is represented by P _i +”_”+S _i is the key, O _i is the value, P _i + "_"+O _i is the key, S _i is the value, construct the bidirectional relationship between S _i and O _i in the triplet, and save it to the Redis cluster, skip to S43;

S42：若是输入的三元组数据为实例三元组(s_i,p_i,o_i)，则将输入的三元组数据构建<s_i,(p_i,o_i)>、< p_i , (s_i, o_i)>和< o_i , (s_i,p_i)>三个键值对，并保存于Redis集群，跳至S43；S42: If the input triplet data is an instance triplet (s _i , p _i , o _i ), construct <s _i ,(p _i ,o _i )>, < p _i from the input triplet data , (s _i , o _i )> and < o _i , (s _i , p _i )> three key-value pairs, and save them in the Redis cluster, skip to S43;

S43：检查new_data或itr_data所对应的伪双向网络，并判断new_data或itr_data是否包含规则节点或中间节点监听的Rule_m_link_RDD，若是中间节点监听的Rule_m_link_RDD则跳至S44，若是规则节点监听的Rule_m_link_RDD则跳至S45，否则跳至S2；伪双向网络指的是对某条规则Rule_i建立规则节点Rule_i_node，规则中涉及到的类构建类节点Class_i_node，如果规则前件中包含连接变量则建立中间节点mid_i_node；规则Rule_i的连接变量指的是Rule_i中用于连接两个前件的模式三元组项，将每一条规则的连接变量信息以<key,value>的形式存储在Rule_m_link_RDD，其中key存储该规则所有用于前件连接的模式三元组项，value存储该规则结论部分的模式三元组项；伪双向网络的构建流程参见图2。S43: Check the pseudo-two-way network corresponding to new_data or itr_data, and determine whether new_data or itr_data contains the Rule _{m _link_RDD monitored by the regular node or the intermediate node. If it is the Rule m} _{_link_RDD} monitored by the intermediate node, skip to S44. If it is the Rule monitored by the regular node _m _link_RDD skips to S45, otherwise skips to S2; Pseudo-two-way network refers to establishing a rule node Rule _i _node for a certain rule Rule _i , the class involved in the rule builds a class node Class _i _node, if the precondition of the rule contains The connection variable establishes the intermediate node mid _i _node; the connection variable of Rule _i refers to the pattern triplet item used to connect the two preceding items in Rule _i , and the connection variable information of each rule is represented by <key,value> The form is stored in Rule _{m_link_RDD} , where the key stores all the pattern triplet items used for the connection of the former part of the rule, and the value stores the pattern triplet items in the conclusion part of the rule; the construction process of the pseudo-bidirectional network is shown in Figure 2.

S45：判断监听的Rule_m_link_RDD是否全部满足，若是则跳至S46，否则跳至S2；S45: Judging whether all monitored Rule _{m_link_RDDs} are satisfied, if so, skip to S46, otherwise, skip to S2;

S46：判断规则节点对应的所有前件是否全部满足，若是则执行规则的推理产生三元组，跳至 S5；否则跳至S2。S46: Determine whether all the antecedents corresponding to the rule node are satisfied, if so, execute the inference of the rule to generate a triple, and skip to S5; otherwise, skip to S2.

S5包括以下具体步骤：对于推理产生的三元组，保存于Redis集群中名为itr_data的集合，并且对重复的三元组进行去重操作，然后将itr_data集合作为下次推理输入数据的一部分，如果没有停止的命令则跳至S2。S5 includes the following specific steps: for the triples generated by reasoning, save them in a set named itr_data in the Redis cluster, and perform deduplication operations on repeated triples, and then use the itr_data set as part of the next reasoning input data, Jump to S2 if there is no command to stop.

本发明的PRAS算法根据Spark RDD和Redis集群的特点，结合HAL算法的原理和OWLHorst规则以及RDF本体数据，采用对规则的伪双向网络进行构建，首先对于模式三元组(S_i,P_j,O_k)构建对应的类节点O_k_RDD或P_j_RDD并保存到Redis集群，如果P属于对称属性，则对该三元组中的S和O构建双向关系并保存到Redis集群。为了快速判断规则中的所有前件是否全部满足，对于所有规则建立对应的规则节点，若规则中含有连接变量link_var，则建立中间节点midnode，测试条件信息保存于中间节点且设置中间节点与规则节点间的双向通信；若无连接变量，则类节点与规则节点直接相连，测试条件保存于类节点中。以图2中规则8a为例，原理图如图3所示。通过节点间的启发式信息以及对称属性的构建，结合Redis集群的高效存取，将所需的三元组以查询的方式从Redis集群中读取，减少了无关三元组的读取与传输，从而提高整体推理效率。According to the characteristics of Spark RDD and Redis cluster, the PRAS algorithm of the present invention combines the principle of HAL algorithm and OWLHorst rules and RDF ontology data, and adopts the pseudo-bidirectional network of rules to construct. First, for the pattern triplet (S _i , P _j , Ok) Construct the corresponding class node O _k _RDD or P _j _{_RDD} and save it to the Redis cluster. If P belongs to the symmetric attribute, then build a bidirectional relationship between S and O in the triple and save it to the Redis cluster. In order to quickly judge whether all the antecedents in the rules are satisfied, establish corresponding rule nodes for all rules. If the rule contains the connection variable link_var, then create an intermediate node midnode, save the test condition information in the intermediate node and set the intermediate node and the rule node Two-way communication between; if there is no connection variable, the class node is directly connected to the rule node, and the test conditions are stored in the class node. Taking rule 8a in Figure 2 as an example, the schematic diagram is shown in Figure 3 . Through the heuristic information between nodes and the construction of symmetric attributes, combined with the efficient access of the Redis cluster, the required triples are read from the Redis cluster in the form of queries, reducing the reading and transmission of irrelevant triples , thereby improving the overall inference efficiency.

Map阶段主要完成数据归类与推理：如果定时获取Streaming数据流中的批量流数据new_data或前次推理产生的数据itr_data为本体数据，则归类到对应的类节点中，并更新Redis集群中该节点对应的值；若其属性为对称属性，那么分别以”symm_”+S和”symm_”+O为key，构建三元组中S和O的双向关系，并保存于Redis集群。如果new_data或itr_data为实例数据，那么对实例三元组(s_i,p_i,o_i)，构建< s_i, (p_i,o_i)>、< p_i, (s_i,o_i)>和<o_i, (s_i,p_i)>三个键值对，并保存于Redis集群。然后检查new_data或itr_data所对应的伪双向网络，并判断new_data或itr_data对应的中间节点监听的连接变量或规则节点对应的所有前件（可能包含多个中间节点）是否全部满足，若是则执行规则的推理产生三元组并将结果输出到Reduce阶段；若部分满足，则修改相应条件的状态。本文提出的数据归类与推理算法具体步骤如下：The Map stage mainly completes data classification and reasoning: if the batch stream data new_data in the Streaming data stream or the data itr_data generated by the previous reasoning are regularly obtained as ontology data, it will be classified into the corresponding class node and the Redis cluster will be updated. The value corresponding to the node; if its attribute is a symmetric attribute, then use "symm_"+S and "symm_"+O as keys respectively to construct a bidirectional relationship between S and O in the triplet and save it in the Redis cluster. If new_data or itr_data is instance data, then for the instance triplet (s _i , p _i , o _i ), construct < s _i , (p _i , o _i )>, < p _i , (s _i , o _i ) > and <o _i , (s _i ,p _i )> three key-value pairs, and save them in the Redis cluster. Then check the pseudo-two-way network corresponding to new_data or itr_data, and judge whether all the antecedents (may include multiple intermediate nodes) corresponding to the connection variables monitored by the intermediate node corresponding to new_data or itr_data (may contain multiple intermediate nodes) are all satisfied, and if so, execute the rule Reasoning generates triples and outputs the result to the Reduce stage; if it is partially satisfied, the state of the corresponding condition is modified. The specific steps of the data classification and reasoning algorithm proposed in this paper are as follows:

Map阶段算法Map phase algorithm

输入流式三元组数据及前次推理产生的三元组Input streaming triplet data and triplets generated by previous inference

输出 <”new”, >output<"new", >

Step1 对于输入的三元组数据，∀(S_i, P_j, O_k)∈SchemaTriple归类到对应的类节点并更新Redis集群；若P_j为对称属性，分别以P_j+” _”+S_i为key，O为value和以P+”_”+O为key，S为value，构建三元组中S和O的双向关系，并保存于Redis集群。跳至Step3.Step1 For the input triplet data, ∀(S _i , P _j , O _k )∈SchemaTriple is classified into the corresponding class node and the Redis cluster is updated; if P _j is a symmetric attribute, P _j +” _”+ S _i is the key, O is the value and P+”_”+O is the key, S is the value, construct the bidirectional relationship between S and O in the triplet, and save it in the Redis cluster. Skip to Step3.

Step2对于输入的三元组数据，∀(s_i,p_j,o_k)∈InstanceTriple，则对实例三元组(s_i,p_j,o_k)构建< s_i, (p_j,o_k) >、< p_j, (s_i,o_k) >和<o_k, (s_i,p_j) >三个键值对保存于Redis集群。跳至Step3.Step2 For the input triplet data, ∀(s _i ,p _j ,ok ) _{∈InstanceTriple} , then construct < s _i , (p _j ,ok _k ) for the instance triplet (s _i ,p _j , _ok ) ) >, < p _j , (s _i , _ok ) > and < _{ok , (s i} _, p _j ) > three key-value pairs are stored in the Redis cluster. Skip to Step3.

Step3 检查(s_i,p_j,o_k)所对应的伪双向网络，从Redis集群中读取所需的数据，并判断(s_i,p_j,o_k)对应的中间节点监听的连接变量或规则节点对应的所有前件（可能包含多个中间节点）是否全部满足，如果全部满足，则跳至Step4。如果部分不满足，则结合(S_i,P_j,O_k)对中间节点或类节点的监听信息进行修改。Step3 Check the pseudo-bidirectional network corresponding to (s _i , p _j , _ok ), read the required data from the Redis cluster, and determine the connection variable monitored by the intermediate node corresponding to (s _i , p _j , _ok ) Or whether all the antecedents corresponding to the rule node (may include multiple intermediate nodes) are all satisfied, if all are satisfied, skip to Step4. If some of them are not satisfied, modify the monitoring information of intermediate nodes or class nodes in conjunction with (S _i , P _j , _Ok ).

Step4 根据当前规则的结论，得到推理产生的三元组并输出<”new”,>。Step4 According to the conclusion of the current rule, get the triples generated by reasoning and output <"new", >.

以图4中规则8a与8b（inverseOf）为例，伪码描述如下：Taking rules 8a and 8b (inverseOf) in Figure 4 as an example, the pseudocode is described as follows:

Input: (S₁, P₁, O₁)Input: (S ₁ , P ₁ , O ₁ )

Output: <”new”, >Output: <"new", >

BeginBegin

If (S₁, P₁, O₁) ∈SchemaTriple //该三元组为模式三元组，进行归类保存If (S ₁ , P ₁ , O ₁ ) ∈SchemaTriple //This triple is a schema triple, which is classified and saved

{{

If P1 equal “type”If P1 equal “type”

sadd O₁ S₁ sadd O ₁ S ₁

else {else {

sadd P₁ (S₁,O₁)sadd P ₁ (S ₁ ,O ₁ )

If P₁∈SymmetriProperty { /*谓语为对称属性是构建保存主语S₁和O₁的对称关系*/If P ₁ ∈ SymmetriProperty { /*The predicate is a symmetric property, which is to construct a symmetric relationship between the subject S ₁ and O ₁ */

sadd P₁+” _”+S₁ O₁ sadd P ₁ +”_”+S ₁ O ₁

sadd P₁+” _”+O₁ S₁ sadd P ₁ +”_”+O ₁ S ₁

} }

} else { /*为实例三元组时构建三个键值对保存*/} else { /*Construct three key-value pairs for instance triples*/

sadd S₁ (P₁,O₁)sadd S ₁ (P ₁ ,O ₁ )

sadd P₁ (S₁,O₁)sadd P ₁ (S ₁ ,O ₁ )

sadd O₁ (S₁,P₁)sadd O ₁ (S ₁ ,P ₁ )

}}

/* 读取Redis集群中inverseOf_S₁与inverseOf_O₁的集合到inverseOf*//* Read the collection of inverseOf_S ₁ and inverseOf_O ₁ in the Redis cluster to inverseOf*/

inverseOf smembers (“inverseOf_”+S₁)inverseOf smembers ("inverseOf_"+S ₁ )

∪smembers (“inverseOf_”+O₁)∪smembers ("inverseOf_"+O ₁ )

If(inverseOf != null){If(inverseOf != null){

yield (“new”,( O₁,P₁, S₁))yield(“new”,( O ₁ ,P ₁ , S ₁ ))

For (inverse in inverseOf.value){For (inverse in inverseOf. value){

yield (“new”,( O₁, inverse, S₁))yield(“new”,( O ₁ , inverse, S ₁ ))

}}

Endend

假设当前输入的批量流数据中含有模式三元组T(memberOf, owl:inverseOf,member)和实例三元组t(GraduateStudent0, memberOf, University0_Department0)。首先对于模式三元组T，判断inverseOf_RDD是否存在，如果不存在则新建inverseOf_RDD并保存(memberOf, member)到inverseOf_RDD中；如果存在则直接保存到inverseOf_RDD。接着，由于inverseOf为对称属性，则分别以inverseOf_memberOf为key，member为value和inverseOf_member为key，memberOf为value，构建memberOf和member的双向关系保存于Redis集群。对于实例三元组t，构建 < GraduateStudent0, (memberOf, University0_Department0)>、< memberOf, (GraduateStudent0, University0_Department0)>、<University0_Department0, (GraduateStudent0, memberOf )>并保存到Redis集群。最后，读取Redis集群中inverseOf_ memberOf与inverseOf_ member的集合到inverseOf，遍历inverseOf并输出(GraduateStudent0, member, University0_Department0)。Assume that the current input batch stream data contains a pattern triplet T(memberOf, owl:inverseOf, member) and an instance triplet t(GraduateStudent0, memberOf, University0_Department0). First, for the pattern triplet T, judge whether inverseOf_RDD exists, if not, create a new inverseOf_RDD and save (memberOf, member) to inverseOf_RDD; if it exists, directly save it to inverseOf_RDD. Next, since inverseOf is a symmetric attribute, use inverseOf_memberOf as key, member as value and inverseOf_member as key, memberOf as value to build a bidirectional relationship between memberOf and member and save it in the Redis cluster. For instance triplet t, construct <GraduateStudent0, (memberOf, University0_Department0)>, <memberOf, (GraduateStudent0, University0_Department0)>, <University0_Department0, (GraduateStudent0, memberOf )> and save to Redis cluster. Finally, read the collection of inverseOf_ memberOf and inverseOf_ member in the Redis cluster to inverseOf, traverse inverseOf and output (GraduateStudent0, member, University0_Department0).

类似于含有对称属性的规则8a与8b，通过对称属性的构建与存储的双向关系，可以快速的在Redis集群中查找出相关的三元组，从而提高推理效率。Similar to rules 8a and 8b that contain symmetric attributes, through the two-way relationship between the construction and storage of symmetric attributes, the relevant triples can be quickly found in the Redis cluster, thereby improving the efficiency of reasoning.

以图4中规则15（someValuesFrom）为例，伪码描述如下：Taking rule 15 (someValuesFrom) in Figure 4 as an example, the pseudocode is described as follows:

Input: (S₁, P₁, O₁)Input: (S ₁ , P ₁ , O ₁ )

Output: <”new”, >Output: <"new", >

BeginBegin

If (S₁, P₁, O₁) ∈SchemaTriple {//该三元组为模式三元组，进行归类保存If (S ₁ , P ₁ , O ₁ ) ∈SchemaTriple {//The triplet is a schema triplet, which is classified and saved

If P₁ equal “type”If P ₁ equal “type”

sadd O₁ S₁ sadd O ₁ S ₁

else {else {

sadd P₁ (S₁,O₁)sadd P ₁ (S ₁ ,O ₁ )

If P₁∈SymmetriProperty {//谓语为对称属性是构建保存主语S1和O1的对称关系If P ₁ ∈ SymmetriProperty {//The predicate is a symmetric property, which is to construct a symmetric relationship between the subject S1 and O1

sadd P₁+” _”+S₁ O₁ sadd P ₁ +”_”+S ₁ O ₁

sadd P₁+” _”+O₁ S₁ sadd P ₁ +”_”+O ₁ S ₁

}}

} else { //为实例三元组时构建三个键值对保存} else { //construct three key-value pairs for instance triplet

sadd S₁ (P₁,O₁)sadd S ₁ (P ₁ ,O ₁ )

sadd P₁ (S₁,O₁)sadd P ₁ (S ₁ ,O ₁ )

sadd O₁ (S₁,P₁)sadd O ₁ (S ₁ ,P ₁ )

}}

someValuesFrom_set smembers(“someValuesFrom”) /*读取Redis集群中someValuesFrom的集合*/someValuesFrom_set smembers("someValuesFrom") /*Read the collection of someValuesFrom in the Redis cluster*/

onProperty_set smembers (“onProperty”)onProperty_set smembers("onProperty")

For (svf in someValuesFrom_set) {For (svf in someValuesFrom_set) {

For (op in onProperty_set) {For (op in onProperty_set) {

If(svf.v equals op.v){ If(svf.v equals op.v){

temp_w smembers (svf.w)temp_w smembers (svf.w)

x_type_w= temp_w.filter(x => x.p==”type”) /*筛选出temp_w中p为type的三元组*/x_type_w= temp_w.filter(x => x.p==”type”) /* Filter out the triples in temp_w where p is type*/

u_p_x smembers (op.p)u_p_x smembers (op.p)

result = u_p_x.filter(t=>result = u_p_x. filter(t=>

t.x==x_type_w.xt.x==x_type_w.x

yield(”new”,(t.u,type,svf.v))) /*将规则中的前件进行连接，产生推理结果*/yield("new",(t.u,type,svf.v))) /*connect the antecedents in the rules to generate inference results*/

}}

Endend

假设当前输入的批量流数据中含有模式三元组T1(Chair, owl:someValuesFrom,Department)，T2(Chair, owl:onProperty, headOf)和实例三元组t1(FullProfessor7,headOf, University0_Department0)，t2(University0_Department0, rdf:type,Departmment)。首先对于模式三元组T1和T2，判断someValuesFrom_RDD和onProperty_RDD是否存在，如果不存在则新建someValuesFrom_RDD和onProperty_RDD并分别保存(Chair,Department)到someValuesFrom_RDD和保存(Chair, headOf)到onProperty_RDD中；如果存在则直接保存到someValuesFrom_RDD或onProperty_RDD。对于实例三元组t1，构建 <FullProfessor7, (headOf, University0_Department0)>、< headOf, (FullProfessor7,University0_Department0)>、< University0_Department0 , (FullProfessor7,headOf)>并保存到Redis集群，t2类似上述操作。接着，分别读取Redis集群中someValuesFrom与onProperty的集合到someValuesFrom_set和onProperty_set，遍历someValuesFrom_set与onProperty_set，此时someValuesFrom_set中的Chair与onProperty_set的Chair相同，然后分别以Department为key和headOf为key获取Redis集群中的两个集合；最后将FullProfessor7与Chair连接并输出(FullProfessor7, rdf:type,Chair)。Assume that the currently input batch stream data contains pattern triplets T1(Chair, owl:someValuesFrom,Department), T2(Chair, owl:onProperty, headOf) and instance triplets t1(FullProfessor7,headOf, University0_Department0), t2( University0_Department0, rdf:type, Departmentmment). First, for the pattern triplets T1 and T2, determine whether someValuesFrom_RDD and onProperty_RDD exist. If not, create someValuesFrom_RDD and onProperty_RDD and save (Chair, Department) to someValuesFrom_RDD and save (Chair, headOf) to onProperty_RDD respectively; if they exist, directly Save to someValuesFrom_RDD or onProperty_RDD. For instance triplet t1, construct <FullProfessor7, (headOf, University0_Department0)>, < headOf, (FullProfessor7, University0_Department0)>, <University0_Department0 , (FullProfessor7, headOf)> and save it to the Redis cluster, and t2 is similar to the above operation. Next, read the collections of someValuesFrom and onProperty in the Redis cluster to someValuesFrom_set and onProperty_set respectively, and traverse someValuesFrom_set and onProperty_set. At this time, the Chair in someValuesFrom_set is the same as the Chair in onProperty_set, and then use the Department as the key and headOf as the key to obtain the Redis cluster. Two collections; finally connect FullProfessor7 with Chair and output (FullProfessor7, rdf:type, Chair).

类似于多连接变量的规则15，对于模式三元组通过类节点的key，可以快速从Redis集群中获取；对于关联的实例三元组，利用Redis中实例三元组的存储策略，通过连接变量的值查找出相关的实例三元组，从而提高推理效率。Similar to the rule 15 of multi-connection variables, the mode triplet can be quickly obtained from the Redis cluster through the key of the class node; for the associated instance triplet, use the storage strategy of the instance triplet in Redis to pass the connection variable The value of σ finds out relevant instance triplets, thus improving the inference efficiency.

Reduce阶段主要对推理产生的数据进行保存。对于推理产生的三元组，保存于Redis集群中名为”itr_data”的集合，并且对重复的三元组进行去重操作，然后将”itr_data”集合作为下次推理输入数据的一部分。本文提出的数据去重和存储算法具体步骤如下：The Reduce stage mainly saves the data generated by reasoning. For the triplets generated by inference, save them in a collection named "itr_data" in the Redis cluster, and perform deduplication operations on repeated triplets, and then use the "itr_data" collection as part of the next inference input data. The specific steps of the data deduplication and storage algorithm proposed in this paper are as follows:

Reduce算法Reduce Algorithm

输入 <”new”, Iterator<String> values>Input <"new", Iterator<String> values>

输出 nulloutput null

Step1. 将输入的SchemaTriple 和InstanceTriple以itr_data为集合名，保存于Redis集群中，用于下次推理的读取。Step1. Save the input SchemaTriple and InstanceTriple with itr_data as the collection name in the Redis cluster for reading in the next inference.

为了更加明确Reduce阶段对输入数据的去重和存储，伪码描述如下：In order to clarify the deduplication and storage of input data in the Reduce phase, the pseudocode is described as follows:

Input: <”new”, Iterator <String> values>Input: <”new”, Iterator <String> values>

Output: nullOutput: null

Begin:Begin:

del itr_datadel itr_data

itr for each valuesitr for each value

sadd itr_data itr.value /*遍历values中的值添加到Redis集群的itr_data集合中*/sadd itr_data itr.value /* traverse the values in values and add them to the itr_data collection of Redis cluster*/

Endend

由上述的伪码可得，在Reduce阶段，将输入的三元组通过Redis的集合进行去重和存储，为下次推理做好数据的准备。 From the above pseudo code, it can be obtained that in the Reduce stage, the input triples are deduplicated and stored through the Redis collection, so as to prepare the data for the next reasoning.

算法复杂性分析是衡量一个算法效率的重要指标，本发明的PRAS算法的复杂性分析与集中式算法有着不同的方式。在分析PRAS算法的复杂性时，可以将其分解为Map和Reduce两个阶段进行算法复杂性分析。设实验数据集中包含N个三元组，读取Redis数据的时间设为t，且MapReduce过程中Map任务的并行数设为k，Reduce阶段接收的实例三元组数设为m，Reduce任务的并行数设为x。由于PRAS算法在Map阶段对每个输入的三元组，结合类节点或中间节点扫描一次，即可判断该三元组是否能参与某些规则推理，如能参与后续规则推理，则通过读取Redis中的前件数据进而推理得到推理结果。因此，Map阶段的时间复杂性为：O(t*N/k)。在Reduce阶段对每个输入的三元组进行归类，因此，Reduce阶段的时间复杂性为：O(m/x)。Algorithm complexity analysis is an important index to measure the efficiency of an algorithm, and the complexity analysis of the PRAS algorithm of the present invention is different from the centralized algorithm. When analyzing the complexity of the PRAS algorithm, it can be decomposed into two stages of Map and Reduce for algorithm complexity analysis. Assuming that the experimental data set contains N triples, the time to read Redis data is set to t, and the parallel number of Map tasks in the MapReduce process is set to k, the number of instance triples received in the Reduce phase is set to m, and the The number of parallelism is set to x. Since the PRAS algorithm scans each input triplet in the Map stage, combined with class nodes or intermediate nodes, it can judge whether the triplet can participate in certain rule reasoning. If it can participate in subsequent rule reasoning, it can be read by reading The antecedent data in Redis are then inferred to obtain inference results. Therefore, the time complexity of the Map phase is: O(t*N/k). In the Reduce stage, each input triplet is classified, therefore, the time complexity of the Reduce stage is: O(m/x).

以上是本发明的较佳实施例，凡依本发明技术方案所作的改变，所产生的功能作用未超出本发明技术方案的范围时，均属于本发明的保护范围。The above are the preferred embodiments of the present invention, and all changes made according to the technical solution of the present invention, when the functional effect produced does not exceed the scope of the technical solution of the present invention, all belong to the protection scope of the present invention.

Claims

1. a kind of streaming RDF data parallel reasoning algorithm, it is characterised in that comprise the following steps：

S1：Loading rule node and pattern triple P_j_ RDD and O_k_ RDD is simultaneously saved in Redis clusters, builds in rule and connects The intermediate node midnode of variable, skips to S2；

S2：The data itr_data that batch new data new_data and previous reasoning in timing reading data flow are produced；If its For pattern triple (S_i,P_i,O_i), then skip to S3；If it is example triple (s_i,p_i,o_i), then skip to S5；If new_data It is empty for empty and itr_data, then algorithm terminates；

S3：If its corresponding class node P_j_ RDD or O_k_ RDD is present, then is referred to corresponding class node；If being not present, Newly-built corresponding class node is simultaneously saved in Redis clusters；If its predicate belongs to Symmetric Property, S4 is skipped to；It is no Then skip to S6；Symmetric Property are the set for predicate in markers triple with symmetric relation.Symmetrically Attribute triplet sets SymTriples is defined as follows：

；

Wherein, P_j_ RDD is pattern triplet sets；

S4：Data to input are sorted out and reasoning；

S5：Stored and duplicate removal for the triple that reasoning is produced.

2. a kind of streaming RDF data parallel reasoning algorithm according to claim 1, it is characterised in that：S4 includes following step Suddenly：S41：If the triple data of input are pattern triple (S_i,P_i,O_i), then by the triple data of input respectively with P_i +”_”+S_iFor key, O_iFor value and P_i+”_”+O_iFor key, S_iFor value, S in triple is built_iAnd O_iBidirectional relationship, And Redis clusters are saved in, skip to S43；

S42：If the triple data of input are example triple (s_i,p_i,o_i), then the triple data of input are built< s_i,(p_i,o_i)>、< p_i , (s_i, o_i)>With< o_i , (s_i,p_i)>Three key-value pairs, and Redis clusters are stored in, skip to S43；

S43：The pseudo- bilateral network corresponding to new_data or itr_data is checked, and whether judges new_data or itr_data The Rule monitored comprising regular node or intermediate node_m_ link_RDD, if the Rule that intermediate node is monitored_m_ link_RDD is then S44 is skipped to, if the Rule that regular node is monitored_m_ link_RDD then skips to S45, otherwise skips to S2；Pseudo- bilateral network is referred to To certain rule Rule_iSet up regular node Rule_iThe class being related in _ node, rule builds class node Class_i_ node, such as In the regular former piece of fruit intermediate node mid is then set up comprising link variable_i_node；Regular Rule_iLink variable refer to Rule_i In be used for the pattern triple that connects two former pieces, by the link variable information of each rule with<key,value>Shape Formula is stored in Rule_m_ link_RDD, wherein key store all pattern triples connected for former piece of the rule, value Store the pattern triple of the rule conclusion part；

S45：Judge the Rule monitored_mWhether _ link_RDD all meets, if then skipping to S46, otherwise skips to S2；

S46：Whether the corresponding all former pieces of judgment rule node all meet, if then the reasoning of executing rule produces ternary Group, skips to S5；Otherwise S2 is skipped to.

3. a kind of streaming RDF data parallel reasoning algorithm according to claim 1, it is characterised in that：S5 includes following tool Body step：The triple produced for reasoning, is stored in the set of entitled itr_data in Redis clusters, and to repetition Triple carries out deduplication operation, then itr_data is gathered to the part as next reasoning input data, if do not stopped Order only then skips to S2.