WO2019232956A1

WO2019232956A1 - Parallelization of graph computations

Info

Publication number: WO2019232956A1
Application number: PCT/CN2018/104689
Authority: WO
Inventors: Wenfei Fan; Wenyuan Yu; Jingbo XU
Original assignee: Zhejiang Tmall Technology Co., Ltd.
Priority date: 2018-06-08
Filing date: 2018-09-07
Publication date: 2019-12-12
Also published as: CN112074829A

Abstract

A method for asynchronously parallelizing graph computations, the method comprising: distributing a plurality of fragments across a number of workers so that each worker has at least one local fragment, the plurality of fragments being obtained by partitioning a graph and each fragment being a subgraph of the graph; computing, by each worker, a partial result over each of its at least one local fragment using a predefined sequential batch algorithm; iteratively computing, by each worker, an updated partial result over each of its at least one local fragment based on one or more update messages using a predefined sequential incremental algorithm until a termination condition is satisfied. The one or more update messages are received from one or more other workers, respectively, and are stored in a respective buffer. Each worker is allowed to decide when to perform a next round of computation based on its delay stretch, and each worker is put on hold for a time period indicated by the delay stretch before performing the next round of computation, the delay stretch being dynamically adjustable based on each worker's relative computing progress to other workers. Some embodiment may have effects of reducing stragglers and stale computations.

Description

PARALLELIZATION OF GRAPH COMPUTATIONS

TECHNICAL FIELD

The following disclosure relates to parallelization of graph computations.

BACKGROUND ART

Several parallel models are used for graphs. Bulk Synchronous Parallel (BSP) model has been adopted by graph systems. Under BSP, iterative computation is separated into supersteps, and messages from one superstep are only accessible in the next one. This leads to stragglers, i.e., some workers take substantially longer than the others. As workers converge asymmetrically, the speed of each superstep is limited to that of the slowest worker. To reduce stragglers, Asynchronous Parallel (AP) model has been employed. Under AP, a worker has immediate access to messages. Fast workers can move ahead, without waiting for stragglers. However, AP may incur excessive stale computations, i.e., processes triggered by messages that soon become stale due to more up-to-date messages. To rectify the problems, revisions of BSP and AP have been studied, notably Stale Synchronous Parallel (SSP) model. SSP relaxes BSP by allowing fastest workers to outpace the slowest ones by a fixed number of steps (bounded staleness) . It reduces stragglers, but incurs redundant stale computations.

SUMMARY OF THE INVENTION

In one aspect, a method for asynchronously parallelizing graph computations is provided. The method comprises: distributing a plurality of fragments across a number of workers so that each worker has at least one local fragment, the plurality of fragments being obtained by partitioning a graph and each fragment being a subgraph of the graph; computing, by each worker, a partial result over each of its at least one local fragment using a predefined sequential batch algorithm; and iteratively computing, by each worker, an updated partial result over each of its at least one local fragment based on one or more update messages using a predefined sequential incremental algorithm until a termination condition is satisfied. The one or more update messages are received from one or more other workers, respectively, and are stored in a respective buffer.

Each worker is allowed to decide when to perform a next round of computation based on its delay stretch, and each worker is put on hold for a time period indicated by the delay stretch before performing the next round of computation. The delay stretch can be dynamically adjusted based on each worker's relative computing progress to other workers.

One or more of the following features may also be included.

The delay stretch of each worker is adjusted by one or more parameters from the following group: the number of update messages stored in the respective buffer, the number of the one or more other workers from which the one or more update messages are received, the smallest and largest rounds being executed at all workers, running time prediction, query logs and other statistics collected from all workers.

When a worker is suspended during the delay stretch, its resources are allocated to one or more of the other workers.

Each worker sends a flag inactive to a master when it has no update messages stored in the respective buffer after its current round of computation. Upon receiving inactive from all workers, the master broadcasts a termination message to all workers. In response to the termination message, each worker responds with "acknowledgement" when it is inactive, or responds with "wait" when it is active or in the queue for a next round of computation. Upon receiving "acknowledgement" from all workers, the master pulls the updated partial results from all workers and applies a predefined assemble function to the updated partial results.

The predefined sequential incremental algorithm is monotonic.

The update message is based on a respective partial result and is defined by predefined update parameters.

In another aspect, provided is a system configured to perform the method for asynchronously parallelizing graph computations.

Certain implementations may provide one or more of the following advantages. Both stragglers and stale computations can be reduced by dynamically adjusting relative progress of workers. Correct convergence may also be guaranteed under a monotone condition. Other aspects, features, and advantages will be apparent from the following detailed description, the drawings and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments will be described with reference to the following drawing figures, in which:

FIG. 1 (a) depicts runs for computing a connected components (CC) example as shown in FIG. 1 (b) under different models.

FIG. 1 (b) depicts a CC example.

FIG. 2 shows PEval for CC under AAP.

FIG. 3 shows IncEval for CC under AAP.

FIG. 4 shows workflow of AAP.

FIG. 5 shows the architecture of GRAPE+.

FIG. 6 shows results of performance evaluation.

DETAILED DESCRIPTION OF EMBODIMENTS

The scheme described in this application for asynchronously parallelizing graph computations is referred to as Adaptive Asynchronous Parallel (AAP) model. AAP is a parallel model that inherits the benefits of BSP and AP, and reduces both stragglers and stale computations, without explicitly switching between the two. Better still, the AAP model can ensure consistency, and guarantee correct convergence under a general condition.

Neither AP nor BSP consistently outperforms the other for different algorithms, input graphs and cluster scales. For many graph algorithms, different stages in a single execution demand different models for optimal performance. Switching between AP and BSP, however, requires to predict switching points and incurs switching costs.

Without global synchronization barriers, AAP is essentially asynchronous. As opposed to BSP and AP, each worker under AAP maintains parameters to measure (a) its progress relative to other workers, and (b) changes accumulated by messages (staleness) . Each worker has immediate access to incoming messages, and decides whether to start the next round of computation based on its own parameters. In contrast to SSP, each worker dynamically adjusts its parameters based on its relative progress and message staleness, instead of using a fixed bound. The workers can be distributed processors, or processors in a single machine, or threads on a processor.

FIG. 1 (a) compares runs for computing connected components shown in Fig. 1 (b) under different parallel models.

Consider a computation task being conducted at three workers, where workers P ₁ and P ₂ take 3 time units to do one round of computation, P ₃ takes 6 units, and it takes 1 unit to pass messages. This is carried out under different models.

(1) BSP. As depicted in Fig. 1 (a) (1) , worker P ₃ takes twice as long as P ₁ and P ₂, and is a straggler. Due to its global synchronization, each superstep takes 6 time units, the speed of the slowest P ₃.

(2) AP. AP allows a worker to start the next round as soon as its message buffer is not empty. However, it comes with redundant stale computation. As shown in Fig. 1 (a) (2) , at clock time 7, the second round of P ₃ can only use the messages from the first round of P ₁ and P ₂. This round of P ₃ becomes stale at time 8, when the latest updates from P ₁ and P ₂ arrive. As will be seen later, a large part of the computations of faster P ₁ and P ₂ is also redundant.

(3) SSP. Consider bounded staleness of 1, i.e., the fastest worker can outpace the slowest one by at most 1 round. As shown in Fig. 1 (a) (3) , P ₁ and P ₂ are not blocked by the straggler in the first 3 rounds. However, like AP, the second round of P ₃ is stale. Moreover, P ₁ and P ₂ cannot start their

rounds

4 and 5 until P ₃ finishes its

rounds

2 and 3, respectively, due to the bounded staleness condition. As a result, P1, P2 and P3 behave like in BSP model after clock time 14.

(4) AAP. AAP allows a worker to accumulate changes and decides when to start the next round based on the progress of others. As shown in Fig. 1 (a) (4) , after P ₃ finishes one round of computation at clock time 6, it may start the next round at time 8, at which point the latest changes from P ₁ and P ₂ are available. As opposed to AP, AAP reduces redundant stale computation. This also helps us mitigate the straggler problem, since P ₃ can converge in less rounds by utilizing the latest updates from fast workers.

AAP reduces stragglers by not blocking fast workers. This is particularly helpful when the computation is CPU-intensive and skewed, when an evenly partitioned graph becomes skewed due to updates, or when we cannot afford evenly partitioning a large graph due to the partition cost. Moreover, AAP activates a worker only after it receives sufficient up-to-date messages and thus reduces redundant stale computations. This allows us to reallocate resources to useful computations via workload adjustments.

In addition, AAP differs from previous models in the following.

(1) Model switch. BSP, AP and SSP are special cases of AAP with fixed parameters. Hence AAP can naturally switch among these models at different stages of the same execution, without asking for explicit switching points or incurring the switching costs. As will be seen later, AAP is more flexible: some worker groups may follow BSP, while at the same time, the others run AP or SSP. (2) Programming paradigm. AAP can work with the programming model of GRAPE (Graphics Programming Environment) . It allows users to extend existing sequential (single-machine) graph algorithms with message declarations, and parallelizes the algorithms across a cluster of machines. It employs aggregate functions to resolve conflicts raised by updates from different workers, without worrying about race conditions or requiring extra efforts to enforce consistency by using, e.g., locks.

(3) Convergence guarantees. AAP is modeled as a simultaneous fixpoint computation. Based on this one of the first conditions is developed under which AAP parallelization of sequential algorithms guarantees (a) convergence at correct answers, and (b) the Church-Rosser property, i.e., all asynchronous runs converge at the same result, as long as the sequential algorithms are correct.

(4) Expressive power. Despite its simplicity, AAP can optimally simulate MapReduce, PRAM (Parallel Random Access Machine) , BSP, AP and SSP. That is, algorithms developed for these models can be migrated to AAP without increasing the complexity.

(5) Performance. AAP outperforms BSP, AP and SSP for a variety of graph computations. As an example, for PageRank and SSSP (single-source shortest path) on Friendster with 192 workers, Table 1 shows the performance of (a) Giraph (an open-source version of Pregel) and GraphLab under BSP, (b) GraphLab and Maiter under AP, (c) GiraphUC under BAP, (d) PowerSwitch under Hsync, and (e) GRAPE+, an extension of GRAPE by supporting AAP. GRAPE+ does better than these systems.

Table 1: Page Rank and SSSP on parallel systems

Parallel Random Access Machine (PRAM) supports parallel RAM access with shared memory, not for the shared-nothing architecture that is used nowadays. MapReduce is adopted by, e.g., GraphX. However, it is not very efficient for iterative graph computations due to its blocking and I/O costs. BSP with vertex-centric programming works better for graphs as shown in some cases. However, it suffers from stragglers. As remarked earlier, AP reduces stragglers, but it comes with redundant stale computation. It also bears with race conditions and their locking/unblocking costs, and complicates the convergence analysis and programming.

SSP promotes bounded staleness for machine learning. Maiter reduces stragglers by accumulating updates, and supports prioritized asynchronous execution. BAP model (barrierless asynchronous parallel) reduces global barriers and local messages by using light-weighted local barriers. Hsync proposes to switch between AP and BSP.

Several graph systems under these models are in place, e.g., Pregel, GPS, Giraph++, GRAPE under BSP; GraphLab, Maiter, GRACE under (revised) AP; parameter servers under SSP; GiraphUC under BAP; and PowerSwitch under Hsync. Most of these are vertex-centric. While Giraph++ and Blogel process blocks, they inherit vertex-centric programming by treating blocks as vertices. GRAPE parallelizes sequential graph algorithms as a whole.

AAP differs from the prior models in the following.

(1) AAP reduces (a) stragglers of BSP via asynchronous message passing, and (b) redundant stale computations of AP by imposing a bound (delay stretch) , for workers to wait and accumulate updates.

(2) (a) AAP reduces redundant stale computations by enforcing a “lower bound” on accumulated messages, which also serves as an “upper bound” to support bounded staleness if needed. Performance can be improved when stragglers are forced to wait, rather than to catch up as suggested by SSP. (b) AAP dynamically adjusts the bound, instead of using a predefined constant. (c) Bounded staleness is not needed by SSSP, CC, and PageRank.

(3) Similar to Maiter, AAP aggregates changes accumulated. As opposed to Maiter, it reduces redundant computations by (a) imposing a delay stretch on workers, to adjust their relative progress, (b) dynamically adjusting the bound to optimize performance, and (c) combining incremental evaluation with accumulative computation. AAP operates on graph fragments, while Maiter is vertex-centric.

(4) Both BAP and AAP reduce unnecessary messages. However, AAP achieves this by operating on fragments (blocks) , and moreover, optimizes performance by adjusting relative progress of workers.

(5) As opposed to Hsync, AAP does not demand complete switch from one mode to another. Instead, each worker may decide its own “mode” based on its relative progress. Fast workers may follow BSP within a group, while meanwhile, the other workers may adopt AP. Moreover, the parameters are adjusted dynamically, and hence AAP does not have to predict switching points and pay the price of switching cost.

AAP can adopt the programming model of GRAPE. AAP is able to parallelize sequential graph algorithms just like GRAPE. That is, the asynchronous model does not make programming harder than GRAPE.

AAP supports data-partitioned parallelism. It is to work on graphs partitioned into smaller fragments.

Consider graphs G= (V, E, L) , directed or undirected, where (1) V is a finite set of nodes; (2)

is a set of edges; and (3) each node v in V (resp. edge e∈E) is labeled with L (v) (resp. L (e) ) indicating its content, as found in property graphs.

Given a natural number m, a strategy P partitions G into fragments F = (F ₁, …, F _m) such that each F _i= (V _i, E _i, L _i) is a subgraph of G, V=∪ _{i∈ [1, m]} V _i and E=∪ _{i∈ [1, m]} E _i. Here F _i is called a subgraph of G if

and for each node v∈V _i (resp. edge e∈E _i) , L _i (v) =L (v) (resp. L _i (e) =L (e) ) . Note that F _i is a graph itself but is not necessarily an induced subgraph of G.

AAP allows users to pick an edge-cut or vertex-cut strategy P to partition a graph G. When P is edge-cut, a cut edge from F _ito F _jhas a copy in both F _iand F _j. Denote by

(a) F _i. I (resp. F _i. O′) the set of nodes v∈V _i such that there exists an edge (v′, v) (resp. (v, v′) ) with a node v′in F _j (i≠j) ; and

(b) F _i. O (resp. F _i. I′) the set of nodes v′in some F _j (i≠j) such that there exists an edge (v, v′) (resp. (v′, v) ) with v∈V _i.

The nodes in F _i. I∪F _i. O′are referred to as the border nodes of F _i w.r.t. P. For vertex-cut, border nodes are those that have copies in different fragments. In general, a node v is a border node if v has an adjacent edge across two fragments, or a copy in another fragment.

Using familiar terms, we refer to a graph computation problem as a class Q of graph queries, and instances of the problem as queries of Q. To answer queries Q∈Q under AAP, one only needs to specify three functions.

(1) PEval: a sequential algorithm for Q that given a query Q∈Q and a graph G, computes the answer Q (G) .

(2) IncEval: a sequential incremental algorithm for Q that given Q, G, Q (G) and updates ΔG to G, computes updates ΔO to the old output Q (G) such that

where

denotes G updated by ΔG.

(3) Assemble: a function that collects partial answers computed locally at each worker by PEval and IncEval, and assembles the partial results into complete answer Q (G) .

Taken together, the three functions are referred to as a PIE program for Q (PEval, IncEval and Assemble) . PEval and IncEval can be existing sequential (incremental) algorithms for Q, which are to operate on a fragment F _i of G partitioned via a strategy P.

In addition, PEval declares the following.

(a) Update parameters. PEval declares status variables

for a set C _i in a fragment F _i, to store contents of F _i or partial results of a computation. Here C _i is a set of nodes and edges within d-hops of the nodes in F _i. I∪F _i. O′for an integer d. When d = 0, C _i is F _i. I∪F _i. O ^′. We denote by

the set of update parameters of F _i, which consists of status variables associated with the nodes and edges in C _i. The variables in

are the candidates to be updated by incremental steps IncEval.

(b) Aggregate functions. PEval also specifies an aggregate function f _aggr, e.g., min and max, to resolve conflicts when multiple workers attempt to assign different values to the same update parameter. These are specified in PEval and are shared by IncEval.

First Example: Graph Connectivity

Consider graph connectivity (CC) . Given an undirected graph G= (V, E, L) , a subgraph G _s of G is a connected component of G if (a) it is connected, i.e., for any two nodes v and v′in G _s, there exists a path between v and v′, and (b) it is maximum, i.e., adding any node of G to G _s makes the induced subgraph disconnected.

For each G, CC has a single query Q, to compute all connected components of G, denoted by Q (G) . CC is in O (|G|) time.

AAP parallelizes CC with the same PEval and IncEval of GRAPE. More specifically, a PIE program ρ is given as follows.

(1) As shown in Fig. 2, at each fragment F _i, PEval uses a sequential CC algorithm (Depth-First Search, DFS) to compute the local connected components and create their ids, except that it declares the following: (a) for each node v∈V _i, an integer variable v. cid, initially v. id; (b) F _i. O as the candidate set C _i, and

as the update parameters; and (c) min as aggregate function f _aggr: if there are multiple values for the same v. cid, the smallest value is taken by the linear order on integers.

For each local connected component C, (a) PEval creates a “root” node v _c carrying the minimum node id in C as v _c. cid, and (b) links all the nodes in C to v _c, and sets their cid as v _c. cid. These can be done in one pass of the edges in fragment F _i via DFS.

(2) Given a set M _i of changed cids of border nodes, IncEval incrementally updates local components in F _i, by “merging” components when possible. As shown in Fig. 3, by using min as f _aggr, it (a) updates the cid of each border node to the minimum one; and (b) propagates the change to its root v _c and all linked to v _c.

(3) Assemble first updates the cid of each node to the cid of its linked root. It then merges all the nodes having the same cids in a single bucket, and returns all buckets as connected components.

We remark the following about the programming paradigm.

(1) There have been methods for incrementalizing graph algorithms, to get incremental algorithms from their batch counterparts. Moreover, it is not hard to develop IncEval by revising a batch algorithm in response to changes to update parameters, as shown by the cases of CC (see Third Example 3 below) and PageRank (see below) .

(2) Edge-cut is adopted in the sequel unless stated otherwise; but AAP works with other partition strategies. Indeed, the correctness of asynchronous runs under AAP remains intact under the conditions given there, regardless of partitioning strategy used. Nonetheless, different strategies may yield partitions with various degrees of skewness and stragglers, which have an impact on the performance of AAP.

(3) The programming model aims to facilitate users to develop parallel programs, especially for those who are more familiar with conventional sequential programming. This said, programming with GRAPE still requires domain knowledge of algorithm design, to declare update parameters and design an aggregate function.

We next present the AAP model.

Setting. Adopting, for example, the programming model of GRAPE, to answer a class Q of queries on a graph G, AAP takes as input a PIE program ρ (i.e., PEval, IncEval, Assemble) for Q, and a partition strategy P. It partitions G into fragments (F ₁, …, F _m) using P, such that each fragment F _i resides at a virtual worker P _i for i∈ [1, m] . It works with a master P ₀ and n shared-nothing physical workers (P ₁, …, P _n) , where n<m, i.e., multiple virtual workers are mapped to the same physical worker and share memory. Graph G is partitioned once for all queries Q ∈ Q posed on G.

PEval and IncEval can be (existing) sequential batch and incremental algorithms for Q, respectively, except that PEval additionally declares update parameters

and defines an aggregate function f _aggr. At each worker P _i, (a) PEval computes Q (F _i) over local fragment F _i, and (b) IncEval takes F _i and updates M _i to

as input, and computes updates ΔO _i to Q (F _i) such that

Each invocation of PEvalor IncEvalas is referred to as one round of computation at worker P _i.

Message passing. After each round of computation at worker P _i, P _i collects update parameters of

with changed values in a set

It groups

into M _(i, j) for j∈ [1, m] and j≠i, where M _(i, j) includes

for v∈C _j, i.e., v also resides in fragment F _j. That is, M _(i, j) includes changes of

to the update parameters

of F _j. It sends M _(i, j) as a message to worker P _j. Messages M _(i, j) may also be referred to as designated messages.

More specifically, each worker P _i maintains the following:

(1) an index I _i that given a border node v, retrieves the set of j∈ [1, m] such that v∈F _j. I′∪ F _j. O and i≠j, i.e., where v resides; it is deduced from the strategy P ; and

(2) a buffer

to keep track of messages from other workers.

As opposed to GRAPE, AAP is asynchronous in nature. (1) AAP adopts (a) point-to-point communication: a worker P _i can send a message M _(i, j) directly to worker P _j, and (b) push-based message passing: P _i sends M _(i, j) to worker P _j as soon as M _(i, j) is available, regardless of the progress at other workers. A worker P _j can receive messages M _(i, j) at any time, and saves it in its buffer

without being blocked by supersteps. (2) Under AAP, master P ₀ is only responsible for making decision for termination and assembling partial answers by Assemble. (3) Workers exchange their status to adjust relative progress.

Parameters. To reduce stragglers and redundant stale computations, each (virtual) worker P _i maintains a delay stretch DS _i such that P _i is put on hold for DS _i time to accumulate updates. Stretch DS _i is dynamically adjusted by a function δ based on the following.

(1) Staleness η _i, measured by the number of messages in buffer

received by P _i from distinct workers. Intuitively, the larger η _i is, the more messages are accumulated in

and hence, the earlier P _i should start the next round of computation.

(2) Bounds r _min and r _max, the smallest and largest rounds being executed at all workers, respectively. Each P _i keeps track of its current round r _i. These are to control the relative speed of workers.

For example, to simulate SSP [14] , when r _i=r _max and r _i-r _min>c, we can set DS _i = +∞, to prevent P _i from moving too far ahead.

Adjustment function δ for DS _i shortly will be discussed later.

Parallel model. Given a query Q∈Q and a partitioned graph G, AAP posts the same query Q to all the workers. It computes Q (G) in three phases as shown in Fig. 4, described as follows.

(1) Partial evaluation. Upon receiving Q, PEval computes partial results Q (F _i) at each worker P _i in parallel. After this, PEval generates a message M _(i, j) and sends it to worker P _j for j∈ [1, m] , j≠ i.

More specifically, M _(i, j) consists of triples (x, val, r) , where

is associated with a node v that is in C _i∩C _j, and C _j is deduced from the index I _i; val is the value of x, and r indicates the round when val is computed. Worker P _i receives messages from other workers at any time and stores the messages in its buffer

(2) Incremental evaluation. In this phase, IncEval iterates until the termination condition is satisfied. To reduce redundant computation, AAP adjusts (a) relative progress of workers and (b) work assignments. More specifically, IncEval works as follows.

(1) IncEval is triggered at worker P _i to start the next round if (a)

is nonempty, and (b) P _i has been suspended for DS _i time. Intuitively, IncEval is invoked only if changes are inflicted to

i.e.,

and only if P _i has accumulated enough messages.

(2) When IncEval is triggered at P _i, it does the following:

· compute

i.e., IncEval applies the aggregate function to

to deduce changes to its local update parameters; and it clears buffer

· incrementally compute

with IncEval, by treating M _i as updates to its local fragment F _i (i.e.,

) ; and

· derive messages M _(i, j) that consists of updated values of

for border nodes that are in both C _i and C _j, for all j∈ [1, m] , j≠i; it sends M _(i, j) to worker P _j.

In the entire process, P _i keeps receiving messages from other workers and saves them in its buffer

No synchronization is imposed.

When IncEval completes its current round at P _i or when P _i receives a new message, DS _i is adjusted. The next round of IncEval is triggered if the conditions (a) and (b) in (1) above are satisfied; otherwise P _i is suspended for DS _i time, and its resources are allocated to other (virtual) workers P _j to do useful computation, preferably to P _j that is assigned to the same physical worker as P _i to minimize the overhead for data transfer. When the suspension of P _i exceeds DS _i, P _i is activated again to start the next round of IncEval.

(3) Termination. When IncEval is done with its current round of computation, if

sends a flag inactive to master P ₀ and becomes inactive. Upon receiving inactive from all workers, P ₀ broadcasts a message terminate to all workers. Each P _i may respond with either ack if it is inactive, or wait if it is active or is in the queue for execution. If one of the workers replies wait, the iterative incremental step proceeds (phase (2) above) .

Upon receiving ack from all workers, P ₀ pulls partial results from all workers, and applies Assemble to the partial results. The outcome is referred to as the result of the parallelization of ρ under P, denoted by ρ (Q, G) . AAP returns ρ (Q, G) and terminates.

Second Example

Recall the PIE program ρ for CC from First Example. Under AAP, it works in three phases as follows.

(1) PEval computes connected components and their cids at each fragment F _i by using DFS. At the end of the process, the cids of border nodes are grouped as messages and sent to neighboring workers. More specifically, for j∈ [1, m] , {v. cid∨v∈F _i. O∩F _j. I} is sent to worker P _j as message M _(i, j) and is stored in buffer

(2) IncEval first computes updates M _i by applying min to changed cids in

when it is triggered at worker P _i as described above. It then incrementally updates local components in F _i starting from M _i. At the end of the process, the changed cid’s are sent to neighboring workers as messages, just like PEval does. The process iterates until no more changes can be made.

(3) Assemble is invoked at master at this point. It computes and returns connected components as described in First Example.

The example shows that AAP works well with the programming model of GRAPE, i.e., AAP does not make programming harder.

AAP is able to dynamically adjust delay sketch DS _i at each worker P _i; for example, function δ may define

where the parameters of function δ are described as follows.

(1) Predicate S (r _i, r _min, r _max) is to decide whether P _i should be suspended immediately. For example, under SSP, it is defined as false if r _i=r _max and r _max-r _min∨c. When bounded staleness is not needed, S (r _i, r _min, r _max) is constantly true.

(2) Variable L _i “predicts” how many messages should be accumulated, to strike a balance between stale-computation reduction and useful outcome expected from the next round of IncEval at P _i. AAP adjusts L _i as follows. Users may opt to initialize L _i with a uniform bound L _⊥, to start stale-computation reduction early. AAP adjusts L _i at each round at P _i, based on (a) predicted running time t _i of the next round, and (b) the predicted arrival rate s _i of messages. When s _i is above the average rate, L _i is changed to max (η _i, L _⊥) +Δt _i*s _i, where Δt _i is a fraction of t _i, and L _⊥ is adjusted with the number of “fast” workers. t _i and s _i can be approximated by aggregating statistics of consecutive rounds of IncEval. One can get more precise estimate by using a random forest model, with query logs as training samples.

(3) Variable

estimates how longer P _i should wait to accumulate L _i many messages. It can be approximated as

using the number of messages that remain to be received, and message arrival rate s _i. Finally,

is the idle time of worker P _i after the last round of IncEval.

is used to prevent P _i from indefinite waiting.

BSP, AP and SSP are special cases of AAP. Indeed, these can be carried out by AAP by specifying function δ as follows.

о BSP: function δ sets DS _i=+∞ if r _i>r _min, i.e., P _i is suspended; otherwise, DS _i=0 i.e., P _i proceeds at once; thus all workers are synchronized as no one can outpace the others.

о AP: function δ always sets DS _i=0, i.e., worker P _i triggers the next round of computation as soon as its buffer is nonempty.

о SSP: function δ sets DS _i=+∞ if r _i>r _min+c for a fixed bound c like in SSP, and sets DS _i=0 otherwise. That is, the fastest worker may move at most c rounds ahead.

Moreover, AAP can simulate Hsync by using function δ to implement the same switching rules of Hsync.

Third Example:

Referring to FIG. 1 (a) and (b) , recall the PIE program ρ for CC from First Example and illustrated in Second Example. Consider a graph G that is partitioned into fragments F ₁, F ₂ and F ₃ and distributed across workers P ₁, P ₂ and P ₃, respectively. As depicted in Fig. 1 (b) , (a) each circle represents a connected component, annotated with its cid, and (b) a dotted line indicates between fragments. One can see that graph G has a single connected component with the minimal vertex id 0. Suppose that workers P ₁, P ₂ and P ₃ take 3, 3 and 6 time units, respectively.

One can verify the following by referencing Figure 1 (a) .

(a) Under BSP, Figure 1 (a) (1) depicts part of a run of ρ, which takes 5 rounds for the minimal cid 0 to reach component 7.

(b) Under AP, a run is shown in Fig. 1 (a) (2) . Note that before getting cid 0, workers P ₁ and P ₂ invoke 3 rounds of IncEval and exchange cid 1 among components 1-4, while under BSP, one round of IncEval suffices to pass cid 0 from P ₃ to these components. Hence a large part of the computations of faster P ₁ and P ₂ is stale and redundant.

(c) Under SSP with bounded staleness 1, is given in Fig. 1 (a) (3) . It is almost the same as Fig. 1 (a) (2) , except that P ₁ and P ₂ cannot start round 4 before P ₃ finishes round 2. More specifically, when minimal cids in

components

5 and 6 are set to 0 and 4, respectively, P ₁ and P ₂ have to wait for P ₃ to set the cid of component 7 to 5. These again lead to unnecessary stale computations.

(d) Under AAP, P ₃ can suspend IncEval until it receives enough changes as shown in Fig. 1 (a) (4) . For instance, function δ starts with L _⊥=0. It sets DS _i = 0 if η _i∨≥1 for i∈ [1, 2] since no messages are predicted to arrive within the next time unit. In contrast, it sets DS ₃ = 1 if η ₃∨≤4 since in addition to the 2 messages accumulated, 2 more messages are expected to arrive in 1 time unit; hence δ decides to increase DS ₃. These delay stretches are estimated based on the running time (3, 3 and 6 for P ₁, P ₂ and P ₃, respectively) and message arrival rates. With these delay stretches, P ₁ and P ₂ may proceed as soon as they receive new messages, but P ₃ starts a new round only after accumulating 4 messages. Now P ₃ only takes 2 rounds of IncEval to update all the cids in F ₃ to 0. Compared with Figures 1 (a) (1) – (3) , the straggler reaches fixpoint in less rounds.

It is found that AAP reduces the costs of iterative graph computations mainly from three directions.

(1) AAP reduces redundant stale computations and stragglers by adjusting relative progress of workers. In particular, (a) some computations are substantially improved when stragglers are forced to accumulate messages; this actually enables the stragglers to converge in less rounds, as shown by Third Example for CC. (b) When the time taken by different rounds at a worker does not vary much (e.g., PageRank) , fast workers are “automatically” grouped together after a few rounds and run essentially BSP within the group, while the group and slow workers run under AP. This shows that AAP is more flexible than Hsync.

(2) Like GRAPE, AAP employs incremental IncEval to minimize unnecessary recomputations. The speedup is particularly evident when IncEval is bounded, localizable or relatively bounded. For instance, IncEval is bounded if given F _i, Q, Q (F _i) and M _i, it computes ΔO _i such that

in cost that can be expressed as a function in M _i∨+ΔO _i∨, the size of changes in the input and output; intuitively, it reduces the cost of computation on (possibly big) F _i to a function of small M _i∨+ΔO _i∨. As an example, IncEval for CC (Fig. 3) is a bounded incremental algorithm.

(3) Observe that algorithms PEval and IncEval are executed on fragments, which are graphs themselves. Hence AAP inherits all optimization strategies developed for the sequential algorithms.

Convergence and correctness

Asynchronous executions complicate the convergence analysis. Nonetheless, a condition is developed under which AAP guarantees to converge at correct answers. In addition, AAP is generic, as parallel models MapReduce, PRAM, BSP, AP and SSP can be optimally simulated by AAP.

Given a PIE program ρ (i.e., PEval, IncEval, Assemble) for a class Q of graph queries and a partition strategy P, we want to know whether the AAP parallelization of ρ converges at correct results. That is, whether for all queries Q∈Q and all graphs G, ρ terminates under AAP over G partitioned via P, and its result ρ (Q, G) =Q (G) .

We formalize termination and correctness as follows.

Fixpoint. Similar to GRAPE, AAP parallelizes a PIE program ρ based on a simultaneous fixpoint operator φ (R ₁, …, R _m) that starts with partial evaluation of PEval and employs incremental function IncEval as the intermediate consequence operator:

where i∈ [1, m] ,

denotes partial results in round r at worker P _i, fragment

is fragment F _i at the end of round r carrying update parameters

and M _i denotes changes to

computed by

as described above.

The computation reaches a fixpoint if for all i∈ [1, m] ,

i.e., no more changes to partial results

at any worker. At this point, Assemble is applied to

for i∈ [1, m] , and computes ρ (Q, G) . If so, we say that ρ converges at ρ (Q, G) .

In contrast to synchronous execution, a PIE program ρ may have different asynchronous runs, when IncEval is triggered in different orders at multiple workers depending on, e.g., partition of G, clusters and network latency. These runs may end up with different results [37] . A run of ρ can be represented as traces of PEval and IncEval at all workers (see, e.g., Fig. 1 (a) ) .

ρ terminates under AAP with P if for all queries Q∈Q and graphs G, all runs of ρ converge at a fixpoint. . ρ has the Church-Rosser property under AAP if all its asynchronous runs converge at the same result. AAP correctly parallelizes ρ if ρ has the Church-Rosser property, i.e., it always converges at the same ρ (Q, G) , and ρ (Q, G) =Q (G) .

Termination and correctness. We now identify a monotone condition under which a PIE program is guaranteed to converge at correct answers under AAP. We start with some notation.

(1) We assume a partial order

on partial results

To simplify the discussion, assume that

carries its update parameters

Define the following properties of IncEval.

· IncEval is contracting if for all queries Q∈Q and fragmented graphs G via P,

for all i∈ [1, m] in the same run.

· IncEval is monotonic if for all queries Q∈Q and graphs G, for all i∈ [1, m] , if

then

where

and

denote partial results in (possibly different) runs.

For instance, consider the PIE program ρ for CC (First Example) . The order

is defined on sets of connected components (CCs) in each fragment, such that

if for each CC C ₂ in S ₂, there exists a CC C ₁ in S ₁ with

and cid ₁≤cid ₂, where cid _i is the id of C _i for i∈ [1, 2] . Then one can verify that the IncEval of ρ is both contracting and monotonic, since f _aggr is defined as min.

(2) We want to identify a condition under which AAP correctly parallelizes a PIE program ρ as long as its sequential algorithms PEval, IncEval and Assemble are correct, regardless of the order in which PEval and IncEval are triggered. We use the following.

PEval is correct if for all queries Q∈Q and graphs G, PEval (Q, G) returns Q (G) ; (b) IncEval is correct if IncEval (Q, Q (G) , G, M) returns

where M denotes messages (updates) ; and (c) Assemble is correct if when ρ converges at round r ₀ under BSP,

ρ is correct for Q if PEval, IncEval and Assemble are correct for Q.

A monotone condition. Three conditions can be identified for ρ.

(T1) The values of updated parameters are from a finite domain.

(T2) IncEval is contracting.

(T3) IncEval is monotonic.

While conditions T1 and T2 are essentially the same as the ones for GRAPE, condition T3 does not find a counterpart therein.

The termination condition of GRAPE remains intact under AAP.

Theorem 1: Under AAP, a PIE program ρ guarantees to terminate with any partition strategy P if ρ satisfies conditions T1 and T2.

These conditions are general. Indeed, given a graph G, the values of update parameters are often computed from the active domain of G and are finite. By the use of aggregate function f _aggr, IncEval is often contracting, as illustrated by the PIE program for CC above.

Proof: By T1 and T2, each update parameter can be changed finitely many times. This warrants the termination of ρ since ρ terminates when no more changes can be incurred to its update parameters.

However, the condition of GRAPE does not suffice to ensure the Church-Rosser property of asynchronous runs. For the correctness of a PIE program under AAP, we need condition T3 additionally.

Theorem 2: Under conditions T1, T2 and T3, AAP correctly parallelizes a PIE program ρ for a query class Q if ρ is correct for Q, with any partition strategy P.

Proof: We show the following under the conditions. (1) Both the synchronous run of ρ under BSP and asynchronous runs of ρ under AAP reach a fixpoint. (2) No partial results of ρ under BSP are “larger” than any fixpoint of asynchronous runs. (3) No partial results of asynchronous runs are “larger” than the fixpoint under BSP. From (2) and (3) it follows that ρ has the Church-Rosser property. Hence AAP correctly parallelizes ρ as long as ρ is correct for Q.

Recall that AP, BSP and SSP are special cases of AAP. From the proof of Theorem 2 we can conclude that as long as a PIE program ρ is correct for Q, ρ can be correctly parallelized

· under conditions T1 and T2 by BSP;

· under conditions T1, T2 and T3 by AP; and

· under conditions T1, T2 and T3 by SSP.

T1, T2 and T3 provide the first condition for asynchronous runs to converge and ensure the Church-Rosser property. To see this, convergence conditions for GRAPE, Maiter, BAP and SSP are examined.

(1) As remarked earlier, the condition of GRAPE does not ensure the Church-Rosser property, which is not an issue for BSP.

(2) Maiter focuses on vertex-centric programming and identifies four conditions for convergence, on an update function f that changes the state of a vertex based on its neighbors. The conditions require that f is distributive, associative, commutative and moreover, satisfies an equation on initial values.

As opposed to Zhang, Y., Gao, Q., Gao, L. and Wang, C. (2014. Maiter: An asynchronous graph processing framework for delta-based accumulative iterative computation. TPDS. 25, 8 (2014) , 2091–2100) , we deal with block-centric programming of which the vertex-centric model is a special case, when a fragment is limited to a single node. Moreover, the last condition of Zhang is quite restrictive. Further, the proof of Zhang does not suffice for the Church-Rosser property. A counterexample could be conditional convergent series, for which asynchronous runs may diverge.

(3) It is shown that BAP can simulate BSP under certain conditions on message buffers. It does not consider the Church-Rosser property, and we make no assumption about message buffers.

(4) Conditions have been studied to assure the convergence of stochastic gradient descent (SGD) with high probability. In contrast, our conditions are deterministic: under T1, T2 and T3, all AAP runs guarantee to converge at correct answers. Moreover, we consider AAP computations not limited to machine learning.

Simulation of Other Parallel Models

Algorithms developed for MapReduce, PRAM, BSP, AP and SSP can be migrated to AAP without extra complexity. That is, AAP is as expressive as the other parallel models.

Note that while this paper focuses on graph computations, AAP is not limited to graphs as a parallel computation model. It is as generic as BSP and AP, and does not have to take graphs as input.

A parallel model M ₁ optimally simulates model M ₂ if there exists a compilation algorithm that transforms any program with cost C on M ₂ to a program with cost O (C) on M ₁. The cost includes computational and communication cost. That is, the complexity bound remains the same.

As remarked above, BSP, AP and SSP are special cases of AAP. From this one can easily verify the following.

Proposition 3: AAP can optimally simulate BSP, AP and SSP.

By Proposition 3, algorithms developed for, e.g., Pregel, GraphLab and GRAPE can be migrated to AAP. As an example, a Pregel algorithm A (with a function compute () for vertices) can be simulated by a PIE algorithm ρ as follows. (a) PEval runs compute () over vertices with a loop, and uses status variable to exchange local messages instead of SendMessageTo () of Pregel. (b) The update parameters are status variables of border nodes, and function f _aggr groups messages just like Pregel, following BSP. (c) IncEval also runs compute () over vertices in a fragment, except that it starts from active vertices (border nodes with changed values) .

AAP can optimally simulate MapReduce and PRAM. GRAPE can optimally simulate MapReduce and PRAM, by adopting a form of key-value messages.

Theorem 4: MapReduce and PRAM can be optimally simulated by (a) AAP and (b) GRAPE with designated messages only.

Proof: Since PRAM can be simulated by MapReduce, and AAP can simulate GRAPE, it suffices to show that GRAPE can optimally simulate MapReduce with the above-explained message scheme.

A MapReduce algorithm A can be specified as a sequence (B ₁, …, B _k) of subroutines, where B _r (r∈ [1, k] ) consists of a mapper μ _r and a reducer ρ _r. To simulate A by GRAPE, we give a PIE program ρ in which (1) PEval is the mapper μ ₁ of B ₁, and (2) IncEval simulates reducer ρ _i and mapper μ _i+1 (i∈ [1, k-1] ) , and reducer ρ _k in the final round. We define IncEval that treats the subroutines B ₁, …, B _k of A as program branches. Assume that A uses n processors. We add a clique G _W of n nodes as input, one for each worker, such that any two workers can exchange data stored in the status variables of their border nodes in G _W. We show that ρ incurs no more cost than A in each step, using n processors.

Programming with AAP

It has been shown how AAP parallelizes CC (First to third Examples) . We next examine two PIE algorithms, including SSSP and CF. We also give a PIE program for PageRank. We parallelize these algorithms in below under AAP. These show that AAP does not make programming harder.

Graph Traversal

We start with the single source shortest path problem (SSSP) . Consider a directed graph G= (V, E, L) in which for each edge e, L (e) is a positive number. The length of a path (v ₀, …, v _k) in G is the sum of L (v _i-1, v _i) for i∈ [1, k] . For a pair (s, v) of nodes, denote by dist (s, v) the shortest distance from s to v. SSSP is stated as follows.

· Input: A directed graph G as above, and a node s in G.

· Output: Distance dist (s, v) for all nodes v in G.

AAP parallelizes SSSP in the same way as GRAPE.

(1) PIE. AAP takes Dijkstra’s algorithm for SSSP as PEval and the sequential incremental algorithm as IncEval. It declares a status variable x _v for every node v, denoting dist (s, v) , initially ∞ (except dist (s, s) =0) . The candidate set C _i at each F _i is F _i. O. The status variables in the candidates set are updated by PEval and IncEval as in [8] , and aggregated by using min as f _aggr. When no changes can be incurred to these status variables, Assemble is invoked to take the union of all partial results.

(2) Correctness is assured by the correctness of the sequential algorithms for SSSP and Theorem 2. To see this, define order

on sets S ₁ and S ₂ of nodes in the same fragment F _i such that

if for each node v∈F _i, v ₁. dist≤v ₂. dist, where v ₁ and v ₂ denote the copies of v in S ₁ and S ₂, respectively. Then by the use of min as aggregate f _aggr, IncEval is both contracting and monotonic.

Collaborative Filtering

We next consider collaborative filtering (CF) . It takes as input a bipartite graph G that includes two types of nodes, namely, users U and products P, and a set of weighted edges

More specifically, (1) each user u∈U (resp. product p∈P) carries an (unknown) latent factor vector u. f (resp. p. f) . (2) Each edge e= (u, p) in E carries a weight r (e) , estimated as u. f ^T*p. f (possibly

i.e., “unknown” ) that encodes a rating from user u to product p. The training set E _T refers to edge set

i.e., all the known ratings. The CF problem is stated as follows.

· Input: A directed bipartite graph G, and a training set E _T.

· Output: The missing factor vectors u. f and p. f that minimizes a loss function ∈ (f, E _T) , estimated as

AAP parallelizes stochastic gradient descent (SGD) , a popular algorithm for CF. We give the following PIE program.

(1) PIE. PEval declares a status variable v. x= (v. f, v. δ, t) for each node v, where v. f is the factor vector of v (initially

) , v. δ records accumulative updates to v. f, and t bookkeeps the timestamp at which v. f is lastly updated. Assuming w. l. o. g. P∨＜＜∨U∨, it takes F _i. O∪F _i. I, i.e., the shared product nodes related to F _i, as C _i. PEval is essentially “mini-batched” SGD. It computes the descent gradients for each edge (u, p) in F _i and accumulates them in u. δ and p. δ, receptively. The accumulated gradients are then used to update the factor vectors of all local nodes. At the end, PEval sends the updated values of

to neighboring workers.

IncEval first aggregates the factor vector of each node p in F _i. O by taking max on the timestamp for tuples (p. f, p. δ, t) in

For each node in F _i. I, it aggregates its factor vector by applying a weighted sum of gradients computed at other workers. It then runs a round of SGD; it sends the updated status variables as in PEval as long as the bounded staleness condition is not violated.

Assemble simply takes the union of the factor vectors of all nodes from all the workers, and returns the collection.

(2) Correctness has been verified under the bounded staleness condition. Along the same lines, we show that the PIE program converges and correctly infers missing CF factors.

PageRank

Finally, we study PageRank for ranking Web pages. Consider a directed graph G= (V, E) representing Web pages and links. For each page v∈V, its ranking score is denoted by P _v. A PageRank algorithm iteratively updates P _v as follows:

P _v=d*Σ _{{u| (u, v) ∈E}} P _u/N _u+ (1-d) ,

where d is damping factor and N _u is the out-degree of u. The process iterates until the sum of changes of two consecutive iterations is below a threshold. The PageRank problem is stated as follows.

· Input: A directed graph G and a threshold ∈.

· Output: The PageRank scores of nodes in G.

AAP parallelizes PageRank along the same lines as Tian, Y., Balmin, A., Corsten, S. A. and Shirish Tatikonda, J. M. 2013. From “think like a vertex” to “think like a graph” . PVLDB. 7, 7 (2013) , 193–204.

(1) PIE. PEval declares a status variable x _v for each node v∈F _i to keep track of updates to v from other nodes in F _i, at each fragment F _i. It takes F _i. O as its candidate set C _i. Starting from an initial score 0 and an update x _v (initially 1-d) for each v, PEval (a) increases the score P _v by x _v, and (b) updates the variable x _u for each u linked from v by an incremental change d*x _v/N _v. At the end of its process, it sends the values of

to its neighboring workers.

Upon receiving messages, IncEval iteratively updates scores. It (a) first aggregates changes to each border node from other workers by using sum as f _aggr; (b) it then propagates the changes to update other nodes in the local fragment by conducting the same computation as in PEval; and (c) it derives the changes to the values of

and sends them to its neighboring workers.

Assemble collects the scores of all the nodes in G when the sum of changes of two consecutive iterations at each worker is below ∈.

(2) Correctness. We show that the PIE program under AAP terminates and has the Church-Rosser property, along the same lines as the proof of Theorem 2. The proof makes use of the following property, as also observed by [36] : for each node v in graph G, P _v can be expressed as ∑ _p∈Pp (v) + (1-d) , where P is the set of all paths to v in G, p is a path (v _n, v _n-1, …v ₁, v) ,

and N _j is the out-degree node v _j for j∈ [1, n] .

Bounded staleness forbids fastest workers to outpace the slowest ones by more than c steps. It is mainly to ensure the correctness and convergence of CF. By Theorem 2, CC and SSSP are not constrained by bounded staleness; conditions T1, T2 and T3 suffice to guarantee their convergence and correctness. Hence fast workers can move ahead any number of rounds without affecting their correctness and convergence. One can show that PageRank does not need bounded staleness either, since for each path p∈P, p (v) can be added to P _v at most once (see above) .

Implementation of GRAPE+

The architecture of GRAPE+ is shown in Fig. 5, to extend GRAPE by supporting AAP. Its top layer provides interfaces for developers to register their PIE programs, and for end users to run registered PIE programs. The core of GRAPE+ is its engine, to generate parallel evaluation plans. It schedules workload for working threads to carry out the evaluation plans. Underlying the engine are several components, including (1) an MPI controller to handle message passing, (2) a load balancer to evenly distribute workload, (3) an index manager to maintain indices, and (4) a partition manager for graph partitioning. GRAPE+ employs distributed file systems, e.g., NFS, AWS S3 and HDFS, to store graph data.

GRAPE+ extends GRAPE by supporting the following.

Adaptive asynchronization manager. As opposed to GRAPE, GRAPE+ dynamically adjusts relative progress of workers. This is carried out by a scheduler in the engine. Based on statistics collected (see below) , the scheduler adjusts parameters and decides which threads to suspend or run, to allocate resources to useful computations. In particular, the engine allocates communication channels between workers, buffers messages generated, packages the messages into segments, and sends a segment each time. It further reduces costs by overlapping data transfer and computation.

Statistics collector. During a run of a PIE program, the collector gathers information for each worker, e.g., the amount of messages exchanged, the evaluation time in each round, historical data for a query workload, and the impact of the last parameter adjustment.

Fault tolerance. Asynchronous runs of GRAPE+ make it harder to identify a consistent state to rollback in case of failures. Hence as opposed to GRAPE, GRAPE+ adapts Chandy-Lamport snapshots for checkpoints. The master broadcasts a checkpoint request with a token. Upon receiving the request, each worker ignores the request if it has already held the token. Otherwise, it snapshots its current state before sending any messages. The token is attached to its following messages. Messages that arrive late without the token are added to the last snapshot. This gets us a consistent checkpointed state, including all messages passed asynchronously.

When we deployed GRAPE+ in a POC scenario that provides continuous online payment services, we found that on average, it took about 40 seconds to get a snapshot of the entire state, and 20 seconds to recover from failure of one worker. In contrast, it took 40 minutes to start the system and load the graph.

Consistency. Each worker P _i uses a buffer

to store incoming messages, which is incrementally expanded when new messages arrive. GRAPE+ allows users to provide an aggregate function f _aggr to resolve conflicts when a status variable receives multiple values from different workers. The only race condition is that when old messages are removed from

by IncEval, the deletion is atomic. Thus consistency control of GRAPE+ is not much harder than that of GRAPE.

Experimental Study

Using real-life and synthetic graphs, we conducted four sets of experiments to evaluate the (1) efficiency, (2) communication cost, and (3) scale-up of GRAPE+, and (4) the effectiveness of AAP and the impact of graph partitioning strategies on its performance. We also report a case study in Appendix B to illustrate how dynamic adjustment of AAP works. We compared the performance of GRAPE+ with (a) Giraph and synchronized GraphLab _sync under BSP, (b) asynchronized GraphLab _async, GiraphUC and Maiter [36] under AP, (c) Petuum under SSP, (d) PowerSwitch under Hsync, and (e) GRAPE+ simulations of BSP, AP and SSP, denoted by GRAPE+BSP, GRAPE+AP, GRAPE+SSP, respectively.

It has been found that GraphLab _async, GraphLab _sync, PowerSwitch and GRAPE+ outperform the other systems. Indeed, Table 1 shows the performance of SSSP and PageRank of the systems with 192 workers; results on the other algorithms are consistent. Hence we only report the performance of these four systems in details. In all the experiments we also evaluated GRAPE+ _BSP, GRAPE+ _AP and GRAPE+ _SSP. Note that GRAPE is essentially GRAPE+ _BSP.

Experimental setting. We used real-life and synthetic graphs.

Graphs. We used five real-life graphs of different types, such that each algorithm was evaluated with two real-life graphs. These include (1) Friendster, a social network with 65 million users and 1.8 billion links; we randomly assigned weights to test SSSP; (2) traffic, an (undirected) US road network with 23 million nodes (locations) and 58 million edges; (3) UKWeb, a Web graph with 133 million nodes and 5 billion edges. We also used two recommendation networks (bipartite graphs) to evaluate CF, namely, (4) movieLens, with 20 million movie ratings (as weighted edges) between 138000 users and 27000 movies; and (5) Netflix, with 100 million ratings between 17770 movies and 480000 customers.

To test the scalability of GRAPE+, we developed a generator to produce synthetic graphs G = (V, E, L) controlled by the numbers of nodes V∨ (up to 300 million) and edges E∨ (up to 10 billion) .

Queries. For SSSP, we sampled 10 source nodes for each graph G used such that each node has paths to or from at least 90%of the nodes in G, and constructed an SSSP query for each of them.

Graph computations. We evaluated SSSP, CC, PageRank and CF over GRAPE+ by using their PIE programs. We used “default” code provided by the competitor systems when it is available. Otherwise we made our best efforts to develop “optimal” algorithms for them, e.g., CF for PowerSwitch.

We used XtraPuLP as the default graph partition strategy. To evaluate the impact of stragglers, we randomly reshuffled a small portion of partitions to make the partitioned graphs skewed.

We deployed the systems on an HPC cluster. For each experiment, we used up to 20 servers, each with 16 threads of 2.40GHz, and 128GB memory. On each thread, a GRAPE+ worker is deployed. We ran each experiment 5 times. The average is reported here.

Experimental results. We next report our findings.

Exp-1: Efficiency. We first evaluated the efficiency of GRAPE+ by varying the number n of workers used, from 64 to 192. We evaluated (a) SSSP and CC with real-life graphs traffic and Friendster; (b) PageRank with Friendster and UKWeb, and (c) CF with movieLens and Netflix, based on applications of these algorithms in transportation networks, social networks, Web rating and recommendation.

(1) SSSP. Figures 6 (a) and 6 (b) report the performance of SSSP.

(a) GRAPE+ consistently outperforms these systems in all cases. Over traffic (resp. Friendster) and with 192 workers, it is on average 1673 (resp. 3.0) times, 1085 (resp. 15) times and 1270 (resp. 2.56) times faster than synchronized GraphLab _sync, asynchronized GraphLab _asyncand hybrid PowerSwitch, respectively.

The performance gain of GRAPE+ comes from the following: (i) efficient resource utilization by dynamically adjusting relative progress of workers under AAP; (ii) reduction of redundant computation and communication by the use of incremental IncEval; and (iii) optimization inherited from strategies for sequential algorithms. Note that under BSP, AP and SSP, GRAPE+BSP, GRAPE+AP and GRAPE+SSP can still benefit from (ii) and (iii) .

As an example, GraphLabsync took 34 (resp. 10749) rounds over Friendster (resp. traffic) , while by using IncEval, GRAPE+BSP and GRAPE+SSP took 21 and 30 (resp. 31 and 42) rounds, respectively, and hence reduced synchronization barriers and communication costs. In addition, GRAPE+ inherits the optimization techniques from sequential (Dijkstra) algorithm by employing priority queues to prioritize vertex processing; in contrast, this optimization strategy is beyond the capacity of the vertex-centric systems.

(b) GRAPE+ is on average 2.42, 1.71, and 1.47 (resp. 2.45, 1.76, and 1.40) times faster than GRAPE+ _BSP, GRAPE+ _AP and GRAPE+ _SSP over traffic (resp. Friendster) , up to 2.69, 1.97 and 1.86 times, respectively. Since GRAPE+, GRAPE+ _BSP, GRAPE+ _AP and GRAPE+ _SSP are the same system under different modes, the gap reflects the effectiveness of different models. We find that the idle waiting time of AAP is 32.3%and 55.6%of that of BSP and SSP, respectively. Moreover, when measuring stale computation in terms of the total extra computation and communication time over BSP, the stale computation of AAP accounts for 37.2%and 47.1%of that of AP and SSP, respectively. These verify the effectiveness of AAP by dynamically adjusting relative progress of different workers.

(c) GRAPE+ takes less time when n increases. It is on average 2.49 and 2.25 times faster on traffic and Friendster, respectively, when n varies from 64 to 192. That is, AAP makes effective use of parallelism by reducing stragglers and redundant stale computations.

(2) CC. As reported in Figures 6 (c) and 6 (d) over traffic and Friendster, respectively, (a) GRAPE+ outperforms GraphLab _sync, GraphLab _asyncand PowerSwitch. When n=192, GRAPE+ is on average 313, 93 and 51 times faster than the three systems, respectively. (b) GRAPE+ is faster than its variants under BSP, AP and SSP, on average by 20.87, 1.34 and 3.36 (resp. 3.21, 1.11 and 1.61) times faster over traffic (resp. Friendster) , respectively, up to 27.4, 1.39 and 5.04 times. (c) GRAPE+ scales well with the number of workers used: it is on average 2.68 times faster when n varies from 64 to 192.

(3) PageRank. As shown in Figures 6 (e) -6 (f) over Friendster and UKWeb, respectively, when n=192, (a) GRAPE+ is on average 5, 9 and 5 times faster than GraphLab _sync, GraphLab _asyncand PowerSwitch, respectively. (b) GRAPE+ outperforms GRAPE+ _BSP, GRAPE+ _AP and GRAPE+ _SSP by 1.80, 1.90 and 1.25 times, respectively, up to 2.50, 2.16 and 1.57 times. This is because GRAPE+ reduces stale computations, especially those of stragglers. On average stragglers took 50, 27 and 28 rounds under BSP, AP and SSP, respectively, as opposed to 24 rounds under AAP. (d) GRAPE+ is on average 2.16 times faster when n varies from 64 to 192.

(4) CF. We used movieLens and Netflix with training set E _T∨90∨E∨, as shown in Figures 6 (g) -6 (h) , respectively. On average (a) GRAPE+ is 11.9, 9.5, 10.0 times faster than GraphLab _sync, GraphLab _asyncand PowerSwitch, respectively. (b) GRAPE+ beats GRAPE+ _BSP, GRAPE+ _AP and GRAPE+ _SSP by 1.38, 1.80 and 1.26 times, up to 1.67, 3.16 and 1.38 times, respectively. (c) GRAPE+ is on average 2.3 times faster when n varies from 64 to 192.

Single-thread. Among the graphs traffic, movieLens and Netflix can fit in a single machine. On a single machine, it takes 6.7s, 4.3s and 2354.5s for SSSP and CC over traffic, and CF over Netflix, respectively. Using 64-192 workers, GRAPE+ is on average from 1.63 to 5.2, 1.64 to 14.3, and 4.4 to 12.9 times faster than single-thread, depending on how heavy stragglers are. Observe the following. (a) GRAPE+ incurs extra overhead of parallel computation not experienced by a single machine, just like other parallel systems. (b) Large graphs such as UKWeb are beyond the capacity of a single machine, and parallel computation is a must for such graphs.

Exp-2: Communication. We tracked the total bytes sent by each machine during a run, by monitoring the system file /proc/net/dev. The communication costs of PageRank and SSSP over Friendster are reported in Table 1, when 192 workers were used. The results on other algorithms are consistent and hence not shown. These results tell us the following.

(1) On average GRAPE+ ships 22.4%, 8.0%and 68.3%of data shipped by GraphLab _sync, GraphLab _asyncand PowerSwitch, respectively. This is because GRAPE+ (a) reduces redundant stale computations and hence unnecessary data traffic, and (b) ships only changed values of update parameters by incremental IncEval.

(2) The communication cost of GRAPE+ is 1.22X, 40%and 1.02X compared to that of GRAPE+ _BSP, GRAPE+ _AP and GRAPE+ _SSP , respectively. Since AAP allows workers with small workload to run faster and have more iterations, the amount of messages may increase. Moreover, workers under AAP additionally exchange their states and statistics to adjust relative speed. Despite these, its communication cost is not much worse than that of BSP and SSP.

Exp-3: Scale-up of GRAPE+. The speed-up of a system may degrade when using more workers. Thus we evaluated the scale-up of GRAPE+, which measures the ability to keep similar performance when both the size of graph G = (|V|, |E|) and the number n of workers increase proportionally. We varied n from 96 to 320, and for each n, deployed GRAPE+ over a synthetic graph of size varied from (60M, 2B) to (300M, 10B) , proportional to n.

As reported in Figures 6 (i) and 6 (j) for SSSP and PageRank, respectively, GRAPE+ preserves a reasonable scale-up. That is, the overhead of AAP does not weaken the benefit of parallel computation. Despite the overhead for adjusting relative progress, GRAPE+ retains scale-up comparable to that of BSP, AP and SSP.

The results on other algorithms are consistent (not shown) .

Exp-4: Effectiveness of AAP. To further evaluate the effectiveness of AAP, we tested (a) the impact of graph partitioning on AAP, and (b) the performance of AAP over larger graphs with more workers. We evaluated GRAPE+, GRAPE+ _BSP, GRAPE+ _AP and GRAPE+ _SSP. Remark that these are the same system under different modes, and hence the results are not affected by implementation.

Impact of graph partitioning. Definer=||F _max||/||F _median||, where ||F _max|| and || F _median|| denote the size of the largest fragment and the median size, respectively, indicating the skewness of a partition.

As shown in Fig. 6 (k) for SSSP over Friendster, in which thex axis is r, (a) different partitions have an impact on the performance of GRAPE+, just like on other parallel graph systems. (b) The more skewed the partition is, the more effective AAP is. Indeed, AAP is more effective with larger r. When r=9, AAP outperforms BSP, AP, SSP by 9.5, 2.3, and 4.9 times, respectively. For a well-balanced partition (r=1) , BSP works well since the chances of having stragglers are small. In this case AAP works as well as BSP.

AAP in a large-scale setting. We tested synthetic graphs with 300 million vertices and 10 billion edges, generated by GTgraph following the power law and the small world property. We used a cluster of up to 320 workers. As shown in Fig. 6 (l) for PageRank, AAP is on average 4.3, 14.7 and 4.7 times faster than BSP, AP and SSP, respectively, up to 5.0, 16.8 and 5.9 times with 320 workers. Compared to the results in Exp-1, these show that AAP is far more effective on larger graphs with more workers, a setting closer to real-life applications, in which stragglers and stale computations are often heavy. These further verify the effectiveness of AAP.

The results on other algorithms are consistent (not shown) .

It has been found that: (1) GRAPE+ consistently outperforms the state-of-the-art systems. Over real-life graphs and with 192 workers, GRAPE+ is on average (a) 2080, 838, 550, 728, 1850 and 636 times faster than Giraph, GraphLab _sync, GraphLab _async, GiraphUC, Maiter and PowerSwitch for SSSP, (b) 835, 314, 93 and 368 times faster than Giraph, GraphLab _sync, GraphLab _asyncand GiraphUC for CC, (c) 339, 4.8, 8.6, 346, 9.7 and 4.6 times faster than Giraph, GraphLab _sync, GraphLab _async, GiraphUC, Maiter and PowerSwitch for PageRank, and (d) 11.9, 9.5 and 30.9 times faster than GraphLab _sync, GraphLab _asyncand Petuum for CF, respectively. Among these PowerSwitch has the closest performance to GRAPE+. (2) It incurs as small as 0.0001, 0.027, 0.13 and 57.7 of the communication cost of these systems for these problems, respectively. (3) AAP effectively reduces stragglers and redundant stale computations. It is on average 4.8, 1.7 and 1.8 times faster than BSP, AP and SSP for these problems over real-life graphs, respectively. On large-scale synthetic graphs, AAP is on average 4.3, 14.7 and 4.7 times faster than BSP, AP and SSP, respectively, up to 5.0, 16.8 and 5.9 times with 320 workers. (4) The heavier stragglers and stale computations are, or the larger the graphs are and the more workers are used, the more effective AAP is. (5) GRAPE+ scales well with the number n of workers used. It is on average 2.37, 2.68, 2.17 and 2.3 times faster when n varies from 64 to 192 for SSSP, CC, PageRank and CF, respectively. Moreover, it has good scale-up.

It has also been shown that as an asynchronous model, AAP does not make programming harder, and it retains the ease of consistency control and convergence guarantees. We have also developed the first condition to warrant the Church-Rosser property of asynchronous runs, and a simulation result to justify the power and flexibility of AAP. The experimental results have verified that AAP is promising for large-scale graph computations.

Claims

A method for asynchronously parallelizing graph computations, the method comprising:

distributing a plurality of fragments across a number of workers so that each worker has at least one local fragment, the plurality of fragments being obtained by partitioning a graph and each fragment being a subgraph of the graph;

computing, by each worker, a partial result over each of its at least one local fragment using a predefined sequential batch algorithm;

iteratively computing, by each worker, an updated partial result over each of its at least one local fragment based on one or more update messages using a predefined sequential incremental algorithm until a termination condition is satisfied, wherein the one or more update messages are received from one or more other workers, respectively, and are stored in a respective buffer;

wherein each worker is allowed to decide when to perform a next round of computation based on its delay stretch, and wherein said each worker is put on hold for a time period indicated by the delay stretch before performing the next round of computation, the delay stretch being dynamically adjustable based on said each worker's relative computing progress to other workers.
The method of claim 1, wherein the delay stretch of each worker is adjusted by one or more parameters from the following group: the number of update messages stored in the respective buffer, the number of the one or more other workers from which the one or more update messages are received, the smallest and largest rounds being executed at all workers, running time prediction, query logs and other statistics collected from all workers.
The method of claim 1 or 2, wherein each worker keeps receiving update messages, when available, from other workers without synchronization being imposed.
The method of any one of previous claims 1 to 3, wherein, when a worker is suspended during the delay stretch, its resources are allocated to one or more of the other workers.
The method of any one of previous claims 1 to 4, wherein each worker sends a flag inactive to a master when said each worker has no update messages stored in the respective buffer after its current round of computation.
The method of claim 5, wherein, upon receiving inactive from all workers, the master broadcasts a termination message to all workers.
The method of claim 6, wherein, in response to the termination message, each worker responds with "acknowledgement"when it is inactive, or responds with "wait"when it is active or in the queue for a next round of computation.
The method of claim 7, wherein, upon receiving "acknowledgement"from all workers, the master pulls the updated partial results from all workers and applies a predefined assemble function to the updated partial results.
The method of any one of previous claims 1 to 8, wherein the predefined sequential incremental algorithm is monotonic.
The method of any one of previous claims 1 to 9, wherein the update message is based on a respective partial result and is defined by predefined update parameters.
A system for asynchronously parallelizing graph computations, configured to perform the method of any one of claims 1 to 10.