CN103885856B

CN103885856B - Diagram calculation fault-tolerant method and system based on information regeneration mechanism

Info

Publication number: CN103885856B
Application number: CN201410085478.9A
Authority: CN
Inventors: 薛继龙; 曲直; 杨智; 代亚非
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2014-03-10
Filing date: 2014-03-10
Publication date: 2017-01-25
Anticipated expiration: 2034-03-10
Also published as: CN103885856A

Abstract

The invention relates to a diagram calculation fault-tolerant method and system based on an information regeneration mechanism. According to the method, in the diagram calculating process, change information of a diagram structure between two adjacent snapshots is stored for diagram structure data, a peak value set is stored for information data, a diagram calculation system is recovered to a Superstep corresponding to the effectively snapshots through the stored change information of the diagram structure and the stored peak value set when efficacy losing occurs, and then calculation of the next superstep is started. The system comprises a control node, multiple calculation nodes and a distributed file system, and in order to better adapt to the efficacy losing condition of the diagram calculation system, the method and the system can lighten snapshot data, greatly shorten the generation time and the recover time of the snapshots, and improve the fault tolerance capacity of the diagram calculation system.

Description

A kind of figure based on message regenerative system calculates fault-tolerance approach and system

Technical field

The invention belongs to field of cloud calculation is and in particular to a kind of height for figure computing system based on message regenerative system Effect fault-tolerance approach and system.

Background technology

During figure computing system is applied to daily data analysiss and calculates by all kinds of enterprises more and more widely.Opening Source circle has emerged the figure Computational frame that many is increased income, such as apache giraph, powergraph etc..These figure computing systems Framework is inspired by pregel, all employs the semantic model based on summit and bsp(bulk synchronous Parallel) computation model, reads in diagram data from distributed memory system, is iterated calculating, finally will count in internal memory Calculate result to export in data base or file system.

One figure calculating task is divided into several superledges (superstep), in each superledge, each summit Jiang Chu The reason message that a superstep sends on neighbours summit, and disappeared to surrounding vertex transmission according to specific arithmetic logic Breath uses for next superstep.Between this message based summit, communication mechanism allows this Computational frame by summit Easily it is divided in multiple machines, thus reaching distributed and parallel computation purpose.

As a distributed computing framework, fault-tolerant it is very important.Traditional mapreduce Computational frame is often adopted With by subtask recalculation method, this is because the subtask of each calculating task has the property that

1. subtask does not have state, and all calculating desired datas can obtain from file system；

2. between subtask, there is no any dependence or communication.

And for figure computing system, two such characteristic does not exist:

1. there is state subtask: the structured data of figure and summit state are all saved in internal memory, and cannot be from file Obtain in system；

2. there is substantial amounts of message exchange between subtask, the subtask on every machine has cached one on other machines The message data of individual superstep.

This makes figure computing system cannot use the subtask recalculation method of mapreduce.Increase income in realization existing, Often using checkpoint(snapshot) mechanism, that is, after each superstep terminates, by the structured data of figure, apex-like State and epicycle receive the message being used for next superstep and are integrally saved in distributed file system as one soon According to.When occurring losing efficacy, all nodes abandon current calculating state and intermediate data, again read in recently from file system One snapshot.Although the method implements very simply, its network and disk expense are all very big, seriously drag slow calculating speed Degree, therefore often closes this function in actual applications.

Content of the invention

In order to better adapt to the failure conditions of figure computing system, the present invention propose a kind of for figure computing system Based on the efficient fault-tolerance approach of message regenerative system, and the figure computing system using the method, it is capable of the light of snapshot data Quantify, greatly shorten generation and the recovery time of snapshot, improve the fault-tolerant ability of figure computing system.

For achieving the above object, the technical solution used in the present invention is as follows:

A kind of figure based on message regenerative system calculates fault-tolerance approach, and its step includes:

1) in figure calculating process, graph structure data is preserved to the change information of graph structure between two neighboring snapshot, Summit value set is preserved for message data；

2) when occurring losing efficacy, using change information and the described summit value set of the described graph structure preserving, by figure meter Calculate system recovery to the effective corresponding superledge of snapshot (superstep) before, then start the calculating of next superledge.

Further, step 2) described before effectively snapshot be lost efficacy before any one effective snapshot, preferably Last effective snapshot before inefficacy.

Further, snapshot generating process at the end of superstep k for the step 1) is as follows:

A) storage is with respect to the figure change information δ of a upper snapshot_{K-1, k}；

B) store the value on all summits, form summit value set v_k.

Further, if δ_{I, j}For j-th superstep increments of change with respect to i-th superstep, g_kFor extensive The multiple graph structure data snapshot to reading required during k-th superstep, then step 2) in returned to when occurring and losing efficacy The process of state during superstep k is as follows:

A) it is successively read g δ_0,1δ_1,2...δ_{K-2, k-1}, obtain g_k-1；

B) read summit value set v_k, according to v_kRecover massage set m_k；

C) read δ_{K-1, k}, form g_k-1δ_{K-1, k}=g_k.

A kind of figure computing system of employing said method, including a control node, multiple calculate node and distributed literary composition Part system, wherein, control node is responsible for distributing calculating task to calculate node, synchronous between calculate node, detects calculate node Failure conditions, simultaneously control calculating process carrying out；Calculate node is responsible for the evaluation work of specific tasks, each calculate node Internal memory in preserve a figure part summit and its with other summits couple situation；Distributed file system is responsible for storage figure Static information and running in snapshot data, the change information including graph structure between two neighboring snapshot and message The summit value set of data；When control node detects calculate node and lost efficacy, before controlling calculate node to return to effectively State.

The figure based on message regenerative system of the present invention calculates fault-tolerance approach, by based on graph structure data increment snapshot and Message data regenerative system it is achieved that a kind of lightweight snapshot and its restoration methods, when greatly reducing snapshot and generating and recover Required data volume, thus reducing the network bandwidth and disk request, improves fault recovery efficiency, be one accurate, careful Method, implementation complexity low it is easy to safeguard, there is higher practical value.

Brief description

Fig. 1 is that pagerank algorithm is realized and its regenerate functional arrangement.

Fig. 2 is that sssp algorithm is realized and its regenerate functional arrangement.

Fig. 3 is figure computing system structural representation.

Fig. 4 is calculating and the Snapshot time comparison diagram of giraph system and present system.

Fig. 5 is the checkpoint size comparison diagram of the k-core task of giraph system and present system.

Fig. 6 is pagerank comparison diagram recovery time of giraph system and present system.

Specific embodiment

Below by specific embodiments and the drawings, the present invention will be further described.

Traditional snapshot data includes two parts: a) graph structure data snapshot；B) cache makes for next superstep Message data snapshot.The present invention reduces this two-part data volume by two kinds of different methods respectively, is tied by figure Structure data increment snapshot and message data regenerative system, realize the lightweight of snapshot data, reach the mesh quickly generating and recovering , substantially reduce generation and the recovery time of snapshot.

1. increment graph structured data snapshot

For graph structure data, with respect to the method for traditional snapshot whole graph structure data, the present invention uses increment type Snapshot, preserves the change of graph structure between two snapshots using the method adding based on daily record.Formally, if δ_{I, j}For I superstep increments of change with respect to i-th superstep, then the data preserving at the end of k-th superstep is δ_{K-1, k}.Return to x-th superstep if necessary, the graph structure data snapshot reading required for it is:

g_x=g δ_0,1δ_1,2...δ_{X-1, x}, wherein g_xRepresent the graph structure data at the end of x-th superledge, g represents original Graph structure data.

2. the message snapshot based on message data regenerative system

If the massage set that superstep k produces is m_k, all summits value set is v_k.For message data, with respect to The whole m of traditional preservation_kMethod, the present invention only preserves vertex value collection v_k, during recovery, then this section is directly recovered by vertex value The message that point sends in this superstep.

If the average number of degrees on each summit are β, then v_kSnapshot size be m_k1/ β.Because computation model is different, this The bright computation model that enhances is semantic, exposes restoration interface to upper layer application, voluntarily determines how the value by summit by programmer Recover the message of this superstep transmission.Its typical routine interface is:

Void regenerate (t value), wherein t value are current vertex at the end of superstep k Value.

In recovery process, system can call regenerate function to each summit, recover and send from this summit Message.Through test, the algorithm that the overwhelming majority runs in figure calculating platform is suitable for this interface.As example, Fig. 1 and Tu 2 sets forth two typical algorithm pagerank and sssp(signal source shortest path) algorithm realize and its corresponding Regenerate function.

3. snapshot processes

At the end of superstep k, its snapshot generating process is as follows:

B) store the value on all summits, form v_k.

4. recovery process

When occurring losing efficacy, if wanting state when returning to superstep k, process is as follows:

A) it is successively read g δ_0,1δ_1,2...δ_{K-2, k-1}, obtain g_k-1

B) read v_k, according to v_kRegenerate function is called for each summit, recovers m_k；

C) read δ_{K-1, k}, form g_k-1δ_{K-1, k}=g_k.

By above step, can be by the recovering state of figure computing system to superstep k, now recovery process is complete Become, next step can be started and calculate.

The experiment being carried out for the inventive method described in detail below.This experiment intactly achieve figure computing system and Previously described snapshot and restoration methods, and it is used true diagram data and common nomography as task load, analog systemss are lost The situation of effect, and is compared with traditional snapshot restoration methods, the performance measuring the method with this and effectiveness.

1. implementation steps

1) realize required computing system framework first, its system structure is as shown in Figure 3:

This system comprises a control node and multiple calculate node.Wherein, control node is responsible for distributing to calculate node Calculating task, synchronous between calculate node, detect the failure conditions of calculate node, control the carrying out of calculating process simultaneously；Calculate Node then be responsible for specific tasks evaluation work, save in the internal memory of each calculate node a figure part summit and its with The connection situation on other summits.Carry out message communicating using the apache mina communications framework similar with giraph between node.Point Cloth file system is then responsible for the snapshot data in the static information and running of storage figure.

2) activation system, and distribute calculating task to system, calculate node is by graph structure data from distributed file system Middle reading, and start to calculate under the coordination of control node.After each superstep terminates, calculate node will be snapshot data It is persisted in distributed file system.

3) when control node detects calculate node and lost efficacy, calculate node can be controlled to proceed as follows:

A) terminate the calculating of current superstep, carry out resource cleaning, enter reforestation practices；

B) find last effective snapshot k, and read in g from file system_k-1；

C) read v_k, to figure, each summit upper calls regenerate function to recover m_k, disappear between different calculate nodes Breath exchanges；

D) read δ_{K-1, k}, form g_k-1δ_{K-1, k}=g_k；

E) exit reforestation practices, start the calculating of superstep k+1.

2. implementation result

Tested for this algorithm, experiment uses 16 servers, every server configures are 12 core amd Opteron4180 and 48g internal memory, is connected with gigabit networking.Experiment diagram data used is certain social networkies local in 2009 Relation data, totally 2,500 ten thousand summit, about 1,500,000,000 sides, the average number of degrees are about 60.For compatible consideration, bottom distributed File system adopts hdfs1.1.0, and is used the apache giraph1.0.0 of current popular to compare as performance.

Experiment using 5 figure calculating fields typical algorithm as load: randomwalk, sssp, wcc, k-core and Pagerank, and corresponding regenerate function is achieved for each algorithm according to the demand of this method.All systems and Algorithm is all realized using java1.7.

Fig. 4 illustrate the calculating under several typical missions of the experimental system of giraph system and the present invention and The checkpoint time.Because data set and algorithm are realized essentially identical, the calculating time of two systems is basically identical, but due to In this method, so that the snapshot creation time of experimental system greatly shortens, the overall operation time accelerates for the addition of light weight snapshot.

Fig. 5 illustrate the experimental system of giraph system and the present invention when running k-core task snapshot document with appointing The cumulative size that business runs, wherein every 5 superstep of giraph system carry out a snapshot, and the pilot system of the present invention Each superstep all carries out a snapshot.The far smaller than traditional snapshot volume of file size of light weight snapshot can be seen.

On recovery time, it is used pagerank to be tested as Several Typical Load, and artificially in the 9th superstep Introduce inefficacy, as shown in Figure 6 it can be seen that having benefited from the regenerative system of message data, after the experimental system fault of the present invention The data volume that needs read in is less, and resume speed will be far faster than giraph system.

Above example only in order to technical scheme to be described rather than be limited, the ordinary skill of this area Personnel can modify to technical scheme or equivalent, without departing from the spirit and scope of the present invention, this The protection domain of invention should be to be defined described in claim.

Claims

1. a kind of figure based on message regenerative system calculates fault-tolerance approach, and its step includes:

1) in figure calculating process, graph structure data is preserved to the change information of graph structure between two neighboring snapshot, for Message data preserves summit value set；

2) when occurring losing efficacy, using change information and the described summit value set of the described graph structure preserving, figure is calculated system System effective corresponding superledge of snapshot before returning to, then starts the calculating of next superledge；If δ_{I, j}For j-th superledge relatively In the increments of change of i-th superledge, g represents original graph structured data, g_kFor representing the graph structure number at the end of k-th superledge According to the process of state when then returning to k-th superledge when occurring and losing efficacy is as follows:

A) it is successively read g δ_0,1δ_1,2...δ_{K-2, k-1}, obtain g_k-1；

B) read summit value set v_k, according to v_kRecover massage set m_k；

C) read δ_{K-1, k}, form g_k-1δ_{K-1, k}=g_k.

2. the method for claim 1 is it is characterised in that step 1) snapshot generating process at the end of k-th superledge As follows:

B) store the value on all summits, form summit value set v_k.

3. the method for claim 1 is it is characterised in that step b) is according to v_kRecover massage set m_kThe journey being adopted Sequence interface is: void regenerate (t value), wherein t value are current vertex at the end of superstep k Value.

4. a kind of figure computing system of employing claim 1 methods described is it is characterised in that include control node, multiple Calculate node and distributed file system, wherein, control node is responsible for distributing calculating task to calculate node, between calculate node Synchronous, detect the failure conditions of calculate node, control the carrying out of calculating process simultaneously；Calculate node is responsible for the calculating of specific tasks Work, in the internal memory of each calculate node preserve a figure part summit and its with other summits couple situation；Distributed File system is responsible for the snapshot data in the static information and running of storage figure, including graph structure between two neighboring snapshot Change information and message data summit value set；When control node detects calculate node and lost efficacy, control calculate node Effective state before returning to, that is, control calculate node to proceed as follows:

A) terminate the calculating of current superledge, carry out resource cleaning, enter reforestation practices；

B) effective snapshot k before finding inefficacy, and read in g from file system_k-1；

C) read v_k, according to v_kRecover m_k, message exchange between different calculate nodes；

D) read δ_{K-1, k}, form g_k-1δ_k-1, k=g_k；

E) exit reforestation practices, start the calculating of+1 superledge of kth.

5. system as claimed in claim 4 it is characterised in that: before the described inefficacy of step b), effective snapshot k is inefficacy Last effective snapshot front.

6. system as claimed in claim 4 it is characterised in that: disappeared using apache mina communications framework between each node Message communication.