CN104618153B

CN104618153B - Dynamic fault-tolerant method and system based on P2P in the processing of distributed parallel figure

Info

Publication number: CN104618153B
Application number: CN201510026680.9A
Authority: CN
Inventors: 曹东刚; 詹杭龙; 李宝; 梅宏
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2015-01-20
Filing date: 2015-01-20
Publication date: 2018-08-03
Anticipated expiration: 2035-01-20
Also published as: CN104618153A

Abstract

The invention relates to a P2P-based dynamic fault tolerance method and system in distributed parallel graph processing. The method includes: defining the data unit of the distributed graph processing problem to ensure the integrity of the recovered data during dynamic fault tolerance; forming a ring structure of the processor nodes, dividing the input graph data into several partitions and assigning each partition to In each processor node, each processor node generates a copy of its own data unit and backs it up in the adjacent processor node; after each processor node executes its own data unit, it updates it incrementally and places it in the adjacent processor node. The copy in the processor node; when a processor node fails or goes offline due to a network error, its adjacent nodes are assigned to use the data copy to replace the original data unit, complete the corresponding operation, and restore the normal execution of the graph processing. The invention can restore the graph processing operation to the normal state from abnormalities such as node failure and network error, so as to ensure the correct execution of the operation.

Description

P2P-based dynamic fault tolerance method and system in distributed parallel graph processing

技术领域technical field

本发明涉及计算机网络技术领域，具体涉及一种在开放、动态的网络环境下分布式并行图处理的动态容错的方法及系统。The invention relates to the technical field of computer networks, in particular to a dynamic fault-tolerant method and system for distributed parallel graph processing in an open and dynamic network environment.

背景技术Background technique

近年来，随着社交网络、协同网络等技术的普及和发展，互联网中的数据规模越来越大，这给分析这些数据带来了新的挑战。在社交网络、协同网络等场景中，数据之间可能存在关联，相关研究经常用图结构来对这些数据进行描述。图中的顶点记录数据自身属性，图中的边对应数据间的关联。这样，对网络数据的分析就转化成了图处理。然而，上述图结构常常具有百万、千万个顶点和数亿条边，这对普通计算机的内存造成很大压力。更严重的是，由于图处理过程将产生正比于图规模的数量的中间结果，这使得单机由于内存的限制难以对图进行正常计算。In recent years, with the popularization and development of technologies such as social networks and collaborative networks, the scale of data in the Internet has become larger and larger, which has brought new challenges to the analysis of these data. In scenarios such as social networks and collaborative networks, there may be associations between data, and related research often uses graph structures to describe these data. The vertices in the graph record the attributes of the data itself, and the edges in the graph correspond to the associations between the data. In this way, the analysis of network data is transformed into graph processing. However, the above-mentioned graph structures often have millions, tens of millions of vertices and hundreds of millions of edges, which put a lot of pressure on the memory of ordinary computers. What's more, since the graph processing process will produce intermediate results proportional to the scale of the graph, this makes it difficult for a single machine to perform normal calculations on the graph due to memory constraints.

因此，利用分布式集群对图进行并行处理便成为了当前网络图数据分析的主要方式。以Google提出的Pregel框架为典型代表，大部分图处理系统将图数据划分成多个子图(或称图分区)分派到若干机器中，对图数据进行并行计算。并行计算主要采用BSP(BulkSynchronous Processing)模型，即对图数据进行迭代运算。在每个迭代步中，图中每个顶点获取邻接顶点的消息，各自更新自身状态，并将新状态传播给邻接顶点。如图1所示，在迭代步之间，所有顶点的运算需要进行同步，当所有顶点完成当前迭代步的运算时，方可触发进入下一迭代步。当所有顶点均求得结果时，图处理计算完成。Therefore, using distributed clusters to process graphs in parallel has become the main way of current network graph data analysis. Taking the Pregel framework proposed by Google as a typical representative, most graph processing systems divide graph data into multiple subgraphs (or graph partitions) and assign them to several machines to perform parallel calculations on graph data. Parallel computing mainly uses the BSP (Bulk Synchronous Processing) model, which performs iterative operations on graph data. In each iterative step, each vertex in the graph obtains information from adjacent vertices, updates its own state, and propagates the new state to adjacent vertices. As shown in Figure 1, between iteration steps, the operations of all vertices need to be synchronized. When all vertices complete the operations of the current iteration step, the next iteration step can be triggered. When all vertices have obtained results, the graph processing computation is complete.

在开放的网络环境下，分布式集群具有动态变化的特点，集群中的某些节点可能失效、连接集群的网络可能出错，这些异常情况将影响图处理作业的正常执行。因此，分布式并行图处理需要考虑在运算过程中的容错性。现有的图处理系统一般通过checkpoint机制实现容错。这种容错要求图处理在发现出错时，停止当前运算，重新从磁盘中加载最近一次的checkpoint数据，从该checkpoint所记录的迭代点重新进行运算。这种容错机制是静态的，且恢复代价也比较大，并不适用于开放的网络环境。In an open network environment, distributed clusters are characterized by dynamic changes. Some nodes in the cluster may fail, and the network connecting the cluster may fail. These abnormal conditions will affect the normal execution of graph processing jobs. Therefore, distributed parallel graph processing needs to consider the fault tolerance in the operation process. Existing graph processing systems generally implement fault tolerance through the checkpoint mechanism. This kind of fault tolerance requires graph processing to stop the current operation when an error is found, reload the latest checkpoint data from the disk, and re-operate from the iteration point recorded by the checkpoint. This fault-tolerant mechanism is static, and the recovery cost is relatively high, so it is not suitable for an open network environment.

发明内容Contents of the invention

本发明的目的是提供一种开放、动态网络环境下分布式并行图处理中基于P2P(Peer to Peer，对等计算/对等网络)的动态容错的方法，使得图处理运算能够从节点失效、网络出错等异常中恢复正常状态，保证运算的正确执行。The purpose of the present invention is to provide a dynamic fault-tolerant method based on P2P (Peer to Peer, peer-to-peer computing/peer-to-peer network) in distributed parallel graph processing under an open and dynamic network environment, so that graph processing operations can be performed from node failure, It restores the normal state in the event of network errors and other abnormalities to ensure the correct execution of operations.

为实现上述目的，本发明采用如下技术方案：To achieve the above object, the present invention adopts the following technical solutions:

一种分布式并行图处理中基于P2P的动态容错方法，其步骤包括：A P2P-based dynamic fault-tolerant method in distributed parallel graph processing, the steps of which include:

1)定义分布式图处理问题的数据单元，以确保动态容错时所恢复数据的完整性；1) Define the data unit of the distributed graph processing problem to ensure the integrity of the recovered data during dynamic fault tolerance;

2)将处理器节点组成环状结构，将输入的图数据划分成若干分区，并将各分区分派到各处理器节点中，每个处理器节点把各自的数据单元生成副本备份在邻接处理器节点中；2) Form the processor nodes into a ring structure, divide the input graph data into several partitions, and assign each partition to each processor node, and each processor node generates a copy of its own data unit and backs it up in the adjacent processor node;

3)在运算的每一个迭代步中，每个处理器节点执行完自身的数据单元后，以增量的方式更新其放置于相邻处理器节点中的副本；3) In each iterative step of the operation, after each processor node executes its own data unit, it updates its copy placed in the adjacent processor node in an incremental manner;

4)当某个处理器节点失效或因网络出错而离线时，指派其相邻节点利用数据副本替代原先的数据单元，完成相应运算，从而恢复图处理的正常执行。4) When a processor node fails or is offline due to a network error, its adjacent nodes are assigned to replace the original data unit with a data copy to complete the corresponding operation, thereby restoring the normal execution of graph processing.

进一步地，步骤(1)定义的分布式图处理问题的数据单元为二元组(P_j,InMsg(P_j))，其中，P_j为由图结构划分成的某一图分区，即某一子图；InMsg(P_j)为子图P_j包含的所有顶点在某一迭代步所收到的消息集合，在运算初始时，InMsg(P_j)为空集；1≤j≤m，m为对图进行分区后分区的数量。Furthermore, the data unit of the distributed graph processing problem defined in step (1) is a two-tuple (P _j ,InMsg(P _j )), where P _j is a certain graph partition divided by the graph structure, that is, a certain A subgraph; InMsg(P _j ) is the set of messages received by all vertices contained in the subgraph P _j in a certain iteration step. At the beginning of the operation, InMsg(P _j ) is an empty set; 1≤j≤m, m is the number of partitions after partitioning the graph.

进一步地，步骤(2)将输入的图数据划分成若干分区后，以二元组(P_j,空的消息集合)的结构分派到各个处理器节点中。Further, step (2) divides the input graph data into several partitions, and distributes them to each processor node in the structure of a tuple (P _j , an empty message set).

进一步地，步骤(2)中，对于自己被分派的数据单元，处理器节点基于BSP模型对数据单元中的每一个顶点进行运算；对于相邻节点的数据单元副本，处理器节点只是记录，并等待每一次迭代步后，相邻节点发送数据来更新这些副本。Further, in step (2), for the data unit assigned to itself, the processor node operates on each vertex in the data unit based on the BSP model; for the data unit copy of the adjacent node, the processor node just records, and After each iteration, neighbors send data to update these replicas.

进一步地，步骤(3)在每个迭代步完成后进行副本更新；或者根据系统错误发生的频率，综合考虑副本更新所消耗的时间代价，每隔一定迭代步进行一次副本更新。Further, in step (3), the copy is updated after each iteration step is completed; or, according to the frequency of system errors, taking into account the time cost of the copy update, the copy is updated every certain iteration step.

进一步地，步骤(3)在图处理的每一个迭代步中，每个顶点首先处理在上一次迭代步所收到的消息集，并根据这些消息更新自己的状态，再向邻接顶点传播自己的新状态；同一顶点在不同迭代步之间的消息不需要累积，当某一迭代步已计算出顶点的更新值后，在该迭代步之前所收到的消息都不再需要。Further, step (3) in each iteration step of graph processing, each vertex first processes the message set received in the previous iteration step, and updates its own state according to these messages, and then propagates its own state to adjacent vertices. New state; the messages of the same vertex between different iteration steps do not need to be accumulated, and when the update value of the vertex has been calculated in a certain iteration step, the messages received before the iteration step are no longer needed.

一种采用上述方法的分布式并行图处理系统，包括控制器和处理器节点，所述控制器负责将输入的图数据进行分区，将各分区分派到各处理器节点中，并监控各处理器节点的运行情况；各处理器节点组成环状结构，每个处理器节点把各自的数据单元生成副本备份在邻接处理器节点中，每个处理器节点执行完自身的数据单元后，以增量的方式更新其放置于相邻处理器节点中的副本；当某个处理器节点失效或因网络出错而离线时，控制器利用其相邻节点的数据副本替代原先的数据单元，恢复图处理的正常执行。A distributed parallel graph processing system adopting the above method, including a controller and processor nodes, the controller is responsible for partitioning input graph data, assigning each partition to each processor node, and monitoring each processor The running status of the node; each processor node forms a ring structure, and each processor node generates a copy of its own data unit for backup in the adjacent processor node, and after each processor node executes its own data unit, it increments Update the copy placed in the adjacent processor node; when a processor node fails or goes offline due to network errors, the controller uses the data copy of its adjacent node to replace the original data unit, and restores the graph processing. Execute normally.

与现有技术相比，本发明的有益效果如下：Compared with the prior art, the beneficial effects of the present invention are as follows:

(1)传统的checkpoint机制一般需要读写磁盘，而本发明将容错时所需的副本缓存于内存中。与传统checkpoint机制相比，读写内存能更快速地实现副本记录和错误恢复；(1) The traditional checkpoint mechanism generally needs to read and write the disk, but the present invention caches the copy required for fault tolerance in the memory. Compared with the traditional checkpoint mechanism, reading and writing memory can realize copy recording and error recovery faster;

(2)本发明基于P2P的副本分派避免了副本记录和错误恢复过程中的单一热点现象，使得分布式处理器之间的网络通信量较为平均，减少了每一个迭代步中的总体通信时间；(2) The P2P-based copy assignment of the present invention avoids the single hotspot phenomenon in the copy recording and error recovery process, makes the network traffic between the distributed processors more average, and reduces the overall communication time in each iteration step;

(3)本发明支持动态容错，即图处理在运算过程中发生异常时，可以从离异常点最近的数据单元副本恢复运算，而不必重新计算那些已经正确完成的数据，也不必从头重新执行整个图处理应用。(3) The present invention supports dynamic fault tolerance, that is, when an exception occurs during the operation of graph processing, the operation can be resumed from the copy of the data unit closest to the abnormal point, without having to recalculate the data that has been completed correctly, and without having to re-execute the entire graph from the beginning. image processing applications.

附图说明Description of drawings

图1是基于BSP模型的分布式并行图处理工作流程图。Figure 1 is a workflow flowchart of distributed parallel graph processing based on the BSP model.

图2是基于P2P的数据单元副本生成示意图。Fig. 2 is a schematic diagram of P2P-based data unit copy generation.

图3是两个相邻处理器节点在相邻迭代步中进行副本更新的示意图。Fig. 3 is a schematic diagram of two adjacent processor nodes performing replica update in adjacent iteration steps.

图4是错误恢复与P2P副本的重建示意图。Figure 4 is a schematic diagram of error recovery and reconstruction of P2P replicas.

图5是具体实例中UniAS架构示意图。Fig. 5 is a schematic diagram of UniAS architecture in a specific example.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂，下面通过具体实施例和附图，对本发明做进一步说明。In order to make the above objects, features and advantages of the present invention more obvious and understandable, the present invention will be further described below through specific embodiments and accompanying drawings.

本发明的分布式并行图处理中基于P2P的动态容错的方法，包括如下步骤：The P2P-based dynamic fault-tolerant method in the distributed parallel graph processing of the present invention comprises the following steps:

(1)为了确保在动态容错过程中能够完整地恢复数据，本发明对分布式图处理的数据单元进行了定义。(1) In order to ensure that data can be recovered completely during the dynamic fault-tolerant process, the present invention defines the data unit for distributed graph processing.

由于在图处理中，每个顶点运算所需的参数不仅包含顶点自身的原状态，还需要其邻接顶点所传播的消息。因此，在定义图处理问题的数据单元时，不仅要考虑对前一迭代的顶点中数据的依赖，还需要考虑邻接顶点的消息依赖。为了更直观地描述数据单元，本发明定义图结构如下：G＝(V,E)，其中V表示图中顶点的集合{v₁,v₂,v₃…v_n}，E表示图中的有向边组成的集合。初始的图结构划分成若干图分区组成的集合，表示为P＝{P₁,P₂,P₃…P_m}，这些分区在运行过程中将分派到各个处理器节点中。对于任意顶点v_i，其在某一迭代步所收到的消息集合为InMsg(v_i)。进一步，对于某一子图P_j，其包含的所有顶点在某一迭代步所收到的消息集合为InMsg(P_j)＝{InMsg(v_i)|all v_i in P_j}。于是，本发明定义分布式图处理的数据单元为二元组(P_j,InMsg(P_j))，其中1≤j≤m，在运算初始时，InMsg(P_j)为空集。在下文的容错技术中，要求将二元组作为一个整体看待，同时进行复制或恢复。In graph processing, the parameters required for each vertex operation include not only the original state of the vertex itself, but also the messages propagated by its adjacent vertices. Therefore, when defining the data unit of the graph processing problem, not only the dependency on the data in the vertices of the previous iteration should be considered, but also the message dependencies of adjacent vertices need to be considered. In order to describe the data unit more intuitively, the present invention defines the graph structure as follows: G=(V, E), where V represents the set of vertices {v ₁ , v ₂ , v ₃ ...v _n } in the graph, and E represents the vertices in the graph A collection of directed edges. The initial graph structure is divided into a set of several graph partitions, expressed as P={P ₁ , P ₂ , P ₃ . . . P _m }, and these partitions will be assigned to each processor node during operation. For any vertex v _i , the set of messages it receives in a certain iteration step is InMsg(v _i ). Further, for a certain subgraph P _j , the set of messages received by all vertices contained in it at a certain iteration step is InMsg(P _j )={InMsg(v _i )|all v _i in P _j }. Therefore, the present invention defines the data unit for distributed graph processing as a tuple (P _j , InMsg(P _j )), where 1≤j≤m, and InMsg(P _j ) is an empty set at the beginning of the operation. In the following fault-tolerant techniques, it is required to treat the binary as a whole and perform replication or recovery at the same time.

(2)将处理器节点组成环状结构，每个节点把各自的数据单元生成副本备份在邻接处理器节点中。(2) The processor nodes are formed into a ring structure, and each node generates a copy of its own data unit for backup in adjacent processor nodes.

为了实现高效的容错，本发明并不将运算过程的副本保存于磁盘中，而是记录于内存中。然而，由于分布式图处理中的图规模较大，如果将每次迭代运算后的图生成副本保存在某一处理器节点中，则该节点可能成为数据传输热点，并进一步影响整体运算的效率。因此，本发明把副本的保存分散到各个处理器节点，以均衡副本产生过程中的数据传输量。本发明将参与运算过程的所有处理器节点编号，组成环状结构。根据(1)中对数据单元的定义，将输入的图数据划分成若干分区，以二元组(P_j,空的消息集合)的结构分派到各个处理器节点中。完成分派后，各个数据单元将其副本传输给编号相邻的两个处理器节点中保存。In order to achieve high-efficiency fault tolerance, the present invention does not save the copy of the operation process in the disk, but records it in the memory. However, due to the large scale of the graph in distributed graph processing, if the generated copy of the graph after each iteration operation is stored in a certain processor node, this node may become a data transmission hotspot and further affect the efficiency of the overall operation . Therefore, the present invention distributes the storage of copies to each processor node, so as to balance the amount of data transmission in the process of generating copies. The present invention numbers all processor nodes participating in the operation process to form a ring structure. According to the definition of data unit in (1), the input graph data is divided into several partitions, and assigned to each processor node in the structure of a tuple (P _j , an empty message set). After the allocation is completed, each data unit transmits its copy to two processor nodes with adjacent numbers for storage.

因此，每个处理器节点既保存自己被分派的数据单元(以下称为运算单元)，也缓存着相邻节点中的数据单元副本(以下称为镜像单元)，如图2所示，其中A～H表示运算单元，A’～H’表示镜像单元。对于自己被分派的数据单元，处理器将基于BSP模型对数据单元中的每一个顶点进行运算；对于相邻节点的数据单元副本，处理器只是记录，并等待每一次迭代步后，相邻节点发送数据来更新这些副本。Therefore, each processor node not only saves its assigned data unit (hereinafter referred to as the operation unit), but also caches a copy of the data unit in the adjacent node (hereinafter referred to as the mirror unit), as shown in Figure 2, where A ~H represents the operation unit, and A'~H' represents the mirror image unit. For the data unit assigned to itself, the processor will operate on each vertex in the data unit based on the BSP model; for the copy of the data unit of the adjacent node, the processor just records and waits for each iteration step, the adjacent node Send data to update these replicas.

(3)在运算中，每个处理器节点执行完自身的数据单元后，以增量的方式更新其放置于相邻处理器节点中的副本。(3) In operation, after each processor node executes its own data unit, it updates its copies placed in adjacent processor nodes in an incremental manner.

在分布式图处理系统中，图的每个顶点是随着迭代运算不断更新的，这使得步骤(2)中所设置的副本需要不断更新。而由于每个顶点的运算需要其相邻顶点在上一迭代步所传播的消息，这使得在数据恢复过程中必须考虑多个数据单元副本所处的迭代步之间的关系问题。In a distributed graph processing system, each vertex of the graph is continuously updated with iterative operations, which makes the copy set in step (2) need to be continuously updated. And because the operation of each vertex requires the information propagated by its adjacent vertices in the previous iteration step, this makes it necessary to consider the relationship between the iteration steps where multiple data unit copies are located in the data recovery process.

为了简化这一问题，保证副本的更新和副本所处迭代步的一致。本发明基于BSP计算模型，在两个迭代步之间增添了副本更新阶段，在该阶段，要求图处理中的所有数据单元都要根据当前值更新其副本。副本更新阶段可以发生在每个迭代步完成后，也可以根据系统错误发生的频率，综合考虑副本更新所消耗的时间代价，每隔一定迭代步进行一次副本更新。In order to simplify this problem, the update of the copy is guaranteed to be consistent with the iteration step of the copy. The present invention is based on the BSP calculation model, adding a copy updating stage between two iteration steps, and at this stage, all data units in the graph processing are required to update their copies according to the current value. The replica update phase can occur after each iteration step is completed, or it can be performed every certain iteration step according to the frequency of system errors and the time cost of replica update.

为了降低每次副本传输所消耗的时间与网络资源代价，并减少副本缓存所占用的内存空间，本发明提出了增量式的副本更新。每次对副本中图分区状态的更新只传输发生变化的部分，相邻处理器节点将变化的部分累加到数据单元副本中，实现副本中图分区状态的更新。此外，由于在图处理的每一个迭代步中，每个顶点将执行如下操作，首先处理在上一次迭代步所收到的消息集，并根据这些消息更新自己的状态，再向邻接顶点传播自己的新状态。因此，同一顶点在不同迭代步之间的消息是不需要累积的，当某一迭代步已计算出顶点的更新值后，在该迭代步之前所收到的消息都不再需要。以某个图分区P_j为例，在初始时将(P_j ⁰,null)生成副本缓存在相邻处理器节点中。在第1个迭代步运算完成时，该分区的状态更新为P_j ¹，令分区增量△P¹为P_j ¹与P_j ⁰的差值。则P_j所在的处理器单元将(△P¹,InMsg(P_j)[1])发送给相邻的处理器节点。相邻处理器节点将△P¹累积到副本单元(P_j ⁰,null)中生成P_j ¹，并替换消息集合为InMsg(P_j)[1]。以此类推，在第k个迭代步运算完成时，令分区增量△P^k为P_j ^k与P_j ^k-1的差值。在副本更新阶段，将(△P^k,InMsg(P_j)[k])发送给相邻处理器节点。相邻处理器节点将△P^k累积到副本单元(P_j ^k-1,InMsg(P_j)[k-1])中生成P_j ^k，并删除InMsg(P_j)[k-1]，替换消息集合为InMsg(P_j)[k]。图3展示了两个相邻处理器节点在相邻迭代步中进行副本更新的过程。In order to reduce the time consumed by each copy transmission and the cost of network resources, and reduce the memory space occupied by the copy cache, the present invention proposes an incremental copy update. Each update of the state of the graph partition in the copy only transmits the changed part, and the adjacent processor nodes accumulate the changed part into the copy of the data unit to realize the update of the state of the graph partition in the copy. In addition, in each iterative step of graph processing, each vertex will perform the following operations, first process the message set received in the last iterative step, and update its own state according to these messages, and then propagate its own state to adjacent vertices new state of . Therefore, the messages of the same vertex between different iteration steps do not need to be accumulated. When the update value of a vertex has been calculated in a certain iteration step, the messages received before the iteration step are no longer needed. Taking a certain graph partition P _j as an example, a copy of (P _j ⁰ , null) is initially cached in an adjacent processor node. When the first iteration step operation is completed, the state of the partition is updated to P _j ¹ , and the partition increment ΔP ¹ is the difference between P _j ¹ and P _j ⁰ . Then the processor unit where P _j is located sends (ΔP ¹ , InMsg(P _j )[1]) to the adjacent processor node. Adjacent processor nodes accumulate △P ¹ into the replica unit (P _j ⁰ , null) to generate P _j ¹ , and replace the message set as InMsg(P _j )[1]. By analogy, when the operation of the kth iteration step is completed, let the partition increment ΔP ^k be the difference between P _j ^k and P _j ^k-1 . In the replica update phase, (△P ^k , InMsg(P _j )[k]) is sent to adjacent processor nodes. Adjacent processor nodes accumulate △P ^k into the replica unit (P _j ^k-1 ,InMsg(P _j )[k-1]) to generate P _j ^k , and delete InMsg(P _j )[k-1], The set of replacement messages is InMsg(P _j )[k]. Figure 3 shows the process of two adjacent processor nodes performing replica updates in adjacent iterations.

(4)当某个处理器节点失效或因网络出错而离线时，指派其相邻节点利用数据副本替代原先的数据单元，完成相应运算，从而恢复图处理的正常执行。(4) When a processor node fails or is offline due to a network error, its adjacent nodes are assigned to replace the original data unit with a copy of the data to complete the corresponding operation, thereby restoring the normal execution of graph processing.

当图处理运行到某一迭代步k时，如果某个处理器节点失效或因网络出错而离线，在该离线节点中的数据单元无法继续运算，但它们的副本(P_j,InMsg(P_j))缓存于相邻节点中。基于步骤(3)所提到的副本数据增量保存的方式，本发明可以从相邻节点的副本中恢复出这些丢失的数据单元，如图4所示。When the graph processing runs to a certain iteration step k, if a processor node fails or goes offline due to a network error, the data units in the offline node cannot continue to operate, but their copies (P _j ,InMsg(P _j )) are cached in adjacent nodes. Based on the method of incremental storage of replica data mentioned in step (3), the present invention can restore these lost data units from the replicas of adjacent nodes, as shown in FIG. 4 .

为此，本发明设计了如下动态容错算法：For this reason, the present invention has designed following dynamic fault tolerance algorithm:

Recovery_scheduling(离线节点ID，离线节点中的数据单元集合Set){Recovery_scheduling (offline node ID, data unit collection Set in offline nodes) {

获取离线节点的两个相邻节点L,R；Obtain the two adjacent nodes L, R of the offline node;

if(L在线&&R在线){if(L online && R online){

根据节点L,R的负载情况，将Set划分成两个子集SetL,SetR；According to the load of nodes L and R, Set is divided into two subsets SetL and SetR;

节点L将自己缓存中与SetL相对应的镜像单元设置成运算单元，并完成这些SetL中的所有顶点在当前迭代步的运算；Node L sets the mirror unit corresponding to SetL in its own cache as an operation unit, and completes the operation of all vertices in these SetL in the current iteration step;

节点R将自己缓存中与SetR相对应的镜像单元设置成运算单元，并完成这些SetR中的所有顶点在当前迭代步的运算；Node R sets the mirror unit corresponding to SetR in its own cache as an operation unit, and completes the operation of all vertices in these SetR in the current iteration step;

设置节点L,R为相邻节点；Set nodes L and R as adjacent nodes;

对比节点L,R中的运算单元，若在对方节点中不存在镜像单元的，生成该数据单元的副本发送到对方节点中；Comparing the computing units in nodes L and R, if there is no mirror unit in the other node, generate a copy of the data unit and send it to the other node;

恢复正常执行；resume normal execution;

}else if(L在线||R在线){}else if(L online || R online){

N＝在线的节点；N = online nodes;

节点N将自己缓存中与Set相对应的镜像单元设置成运算单元，并完成这些Set中的所有顶点在当前迭代步的运算；The node N sets the mirror unit corresponding to the Set in its own cache as the operation unit, and completes the operation of all the vertices in the Set in the current iteration step;

寻找N的相邻节点；Find the adjacent nodes of N;

对比节点N与其相邻节点中的运算单元，若在对方节点中不存在镜像单元的，生成该数据单元的副本发送到对方节点中；Comparing the computing units in the node N and its adjacent nodes, if there is no mirror unit in the other node, generate a copy of the data unit and send it to the other node;

恢复正常执行；resume normal execution;

}}

由于上述方案是对原有BSP模型的扩展，所以并不影响图处理系统在正常环境下的执行。而利用本发明所提供的方案，图处理系统的容错并不需要停止当前运算，全部回滚到最近的checkpoint点，从磁盘中读取副本数据。在处理器节点离线时，图处理系统只需要对出错的局部节点进行调整恢复，而其他正常节点所完成的运算结果可以继续用到下一个迭代步中。Since the above solution is an extension of the original BSP model, it does not affect the execution of the graph processing system in a normal environment. However, using the solution provided by the present invention, the fault tolerance of the graph processing system does not need to stop the current operation, and all roll back to the nearest checkpoint point to read the copy data from the disk. When the processor node is offline, the graph processing system only needs to adjust and restore the error local node, and the calculation results completed by other normal nodes can continue to be used in the next iteration step.

至此，本发明完成了分布式并行图处理中的动态容错。So far, the present invention completes dynamic fault tolerance in distributed parallel graph processing.

以下给出一个在分布式并行运算平台UniAS上使用本发明构建支持动态容错的分布式并行图处理的实施案例。An implementation example of using the present invention to construct distributed parallel graph processing supporting dynamic fault tolerance on a distributed parallel computing platform UniAS is given below.

UniAS是由北京大学信息科学技术学院软件研究所自主研发的分布式并行运算平台，目前已支持包括批处理、图处理和流处理等多种模式的大数据处理应用。下面围绕UniAS中的图处理框架介绍本发明的实施过程。UniAS is a distributed parallel computing platform independently developed by the Institute of Software, School of Information Science and Technology, Peking University. It currently supports big data processing applications in various modes, including batch processing, graph processing, and stream processing. The following describes the implementation process of the present invention around the graph processing framework in UniAS.

如图5所示，UniAS中的图处理框架基于Master-Slave结构实现。控制器负责将图进行分区，并将各分区分派到处理器节点中，并监控各个节点的运行情况。每个处理器节点均有一个数据单元队列模块，负责运行过程中的副本维护和动态容错。根据本发明的技术要点，下文通过若干步骤实现动态容错的分布式图处理：As shown in Figure 5, the graph processing framework in UniAS is implemented based on the Master-Slave structure. The controller is responsible for partitioning the graph, assigning each partition to processor nodes, and monitoring the operation of each node. Each processor node has a data unit queue module, which is responsible for copy maintenance and dynamic fault tolerance during operation. According to the technical points of the present invention, the dynamic fault-tolerant distributed graph processing is realized through several steps below:

1.启动处理器节点，向控制器注册。控制器将上线的所有处理器节点进行编号，排列成环状结构；1. Start the processor node and register with the controller. The controller numbers all the online processor nodes and arranges them into a ring structure;

2.初始化阶段，控制器将输入的图结构进行分区，并将各分区分派到处理器节点中。分派完成后，触发各节点的副本生成操作，各节点将自身的数据单元制作副本发送到相邻节点中；2. In the initialization phase, the controller partitions the input graph structure and assigns each partition to the processor node. After the allocation is completed, the copy generation operation of each node is triggered, and each node makes a copy of its own data unit and sends it to the adjacent node;

3.启动图处理运算。根据本发明技术点3(即上述步骤(3))的要求，在每个迭代步完成时，求出每个顶点上一迭代步与当前迭代步结果的差值。若顶点状态有改变，则将差值加入分区增量中；若顶点在本次迭代中没有发生修改，则不加入分区增量。将每个分区的分区增量与消息集合发送给相邻节点，用于相邻节点的副本更新；3. Start the graph processing operation. According to the requirements of the technical point 3 of the present invention (i.e. the above-mentioned step (3)), when each iterative step is completed, the difference between the previous iterative step and the current iterative step result on each vertex is calculated. If the state of the vertex changes, the difference is added to the partition increment; if the vertex is not modified in this iteration, the difference is not added to the partition increment. Send the partition increment and message set of each partition to adjacent nodes for replica update of adjacent nodes;

4.每个处理器节点定期向控制器发送心跳消息。当控制器未能及时收到某个处理器节点的心跳消息时，控制器确认该处理器节点离线，进入错误恢复阶段；4. Each processor node periodically sends a heartbeat message to the controller. When the controller fails to receive the heartbeat message of a processor node in time, the controller confirms that the processor node is offline and enters the error recovery stage;

5.控制器调用本发明技术点4(即上述步骤(4))所设计的算法，利用相邻处理器的副本恢复丢失的数据，并重新设置处理器节点间的邻接状态，从而恢复程序的正常运行；5. The controller invokes the algorithm designed by the technical point 4 of the present invention (i.e. the above-mentioned step (4)), utilizes the copies of adjacent processors to restore lost data, and resets the adjacency state between processor nodes, thereby restoring the program normal operation;

6.当在某一个迭代步，所有顶点都处理完成，所有数据单元都不需要再运算更新时，整个图处理作业成功完成。6. When in a certain iteration step, all vertices are processed and all data units do not need to be updated, the entire graph processing job is successfully completed.

至此，利用本发明在UniAS平台上构建了分布式图处理框架的动态容错机制。So far, the present invention has constructed a dynamic fault-tolerant mechanism of a distributed graph processing framework on the UniAS platform.

以上实施例仅用以说明本发明的技术方案而非对其进行限制，本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明的精神和范围，本发明的保护范围应以权利要求书所述为准。The above embodiments are only used to illustrate the technical solution of the present invention and not to limit it. Those of ordinary skill in the art can modify or equivalently replace the technical solution of the present invention without departing from the spirit and scope of the present invention. The scope of protection should be determined by the claims.

Claims

1. A dynamic fault-tolerant method based on P2P in distributed parallel graph processing, its steps comprising:

1) Define the data unit of the distributed graph processing problem to ensure the integrity of the recovered data during dynamic fault tolerance; the data unit is a two-tuple (P _j ,InMsg(P _j )), where P _j is the The structure is divided into a certain graph partition, that is, a certain subgraph; InMsg(P _j ) is the set of messages received by all vertices contained in the subgraph P _j in a certain iteration step. At the beginning of the operation, InMsg(P _j ) is an empty set; 1≤j≤m, m is the number of partitions after partitioning the graph;

2) Number the processor nodes and form a ring structure, divide the input graph data into several partitions according to the definition of data units, and assign each partition to each processor node, and each processor node divides its data unit Generate a copy backup in two processor nodes with adjacent numbers; the copy is not saved in the disk, but recorded in the memory;

3) In each iterative step of the operation, after each processor node executes its own data unit, it updates its copy placed in the adjacent processor node in an incremental manner, and each time the state of the graph partition in the copy is The update of the graph only transmits the changed part, and the adjacent processor nodes accumulate the changed part into the copy of the data unit to update the state of the graph partition in the copy; in each iterative step of graph processing, each vertex is first processed in The set of messages received in the last iteration step, and update its state according to these messages, and then propagate its new state to adjacent vertices; the messages of the same vertex between different iteration steps do not need to be accumulated. After calculating the updated value of the vertex, the messages received before this iteration step are no longer needed;

4) When a processor node fails or is offline due to a network error, its adjacent nodes are assigned to replace the original data unit with a data copy to complete the corresponding operation, thereby restoring the normal execution of graph processing.

2. The method according to claim 1, characterized in that: step 2) divides the input graph data into several partitions, and assigns them to each processor node with the structure of a binary group (P _j , an empty message set) middle.

3. The method according to claim 1, characterized in that: in step 2), for the data unit assigned to itself, the processor node operates on each vertex in the data unit based on the BSP model; For data unit copies, processor nodes just record and wait for each iteration step, when neighboring nodes send data to update these copies.

4. The method according to claim 1, characterized in that: step 3) performs copy update after each iterative step is completed; or according to the frequency of system errors, considering the time cost of copy update comprehensively, every certain The iteration step performs a replica update.

5. A distributed parallel graph processing system adopting the method according to claim 1, characterized in that it comprises a controller and a processor node, the controller is responsible for partitioning the input graph data, and assigning each partition to each In the processor node, and monitor the operation of each processor node; each processor node forms a ring structure, each processor node generates a copy of its own data unit and backs it up in the adjacent processor node, and each processor node executes After completing its own data unit, it incrementally updates its copies placed in adjacent processor nodes; when a processor node fails or goes offline due to a network error, the controller uses the data copies of its adjacent nodes Replaces the original data unit, resuming normal execution of graph processing.

6. The system according to claim 5, wherein each processor node sends a heartbeat message to the controller periodically, and when the controller fails to receive a heartbeat message from a certain processor node in time, the controller confirms the heartbeat message of the processor node. The processor node goes offline and enters the error recovery phase.