WO2019232956A1 - Parallelization of graph computations - Google Patents

Parallelization of graph computations Download PDF

Info

Publication number
WO2019232956A1
WO2019232956A1 PCT/CN2018/104689 CN2018104689W WO2019232956A1 WO 2019232956 A1 WO2019232956 A1 WO 2019232956A1 CN 2018104689 W CN2018104689 W CN 2018104689W WO 2019232956 A1 WO2019232956 A1 WO 2019232956A1
Authority
WO
WIPO (PCT)
Prior art keywords
workers
worker
aap
grape
graph
Prior art date
Application number
PCT/CN2018/104689
Other languages
French (fr)
Inventor
Wenfei Fan
Wenyuan Yu
Jingbo XU
Original Assignee
Zhejiang Tmall Technology Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Tmall Technology Co., Ltd. filed Critical Zhejiang Tmall Technology Co., Ltd.
Priority to CN201880092086.1A priority Critical patent/CN112074829A/en
Publication of WO2019232956A1 publication Critical patent/WO2019232956A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • G06F8/314Parallel programming languages

Definitions

  • the following disclosure relates to parallelization of graph computations.
  • BSP Bulk Synchronous Parallel
  • AP Asynchronous Parallel
  • SSP Synchronous Parallel
  • a method for asynchronously parallelizing graph computations comprises: distributing a plurality of fragments across a number of workers so that each worker has at least one local fragment, the plurality of fragments being obtained by partitioning a graph and each fragment being a subgraph of the graph; computing, by each worker, a partial result over each of its at least one local fragment using a predefined sequential batch algorithm; and iteratively computing, by each worker, an updated partial result over each of its at least one local fragment based on one or more update messages using a predefined sequential incremental algorithm until a termination condition is satisfied.
  • the one or more update messages are received from one or more other workers, respectively, and are stored in a respective buffer.
  • Each worker is allowed to decide when to perform a next round of computation based on its delay stretch, and each worker is put on hold for a time period indicated by the delay stretch before performing the next round of computation.
  • the delay stretch can be dynamically adjusted based on each worker's relative computing progress to other workers.
  • the delay stretch of each worker is adjusted by one or more parameters from the following group: the number of update messages stored in the respective buffer, the number of the one or more other workers from which the one or more update messages are received, the smallest and largest rounds being executed at all workers, running time prediction, query logs and other statistics collected from all workers.
  • Each worker sends a flag inactive to a master when it has no update messages stored in the respective buffer after its current round of computation.
  • the master broadcasts a termination message to all workers.
  • each worker responds with "acknowledgement” when it is inactive, or responds with "wait” when it is active or in the queue for a next round of computation.
  • the master pulls the updated partial results from all workers and applies a predefined assemble function to the updated partial results.
  • the predefined sequential incremental algorithm is monotonic.
  • the update message is based on a respective partial result and is defined by predefined update parameters.
  • a system configured to perform the method for asynchronously parallelizing graph computations.
  • FIG. 1 (a) depicts runs for computing a connected components (CC) example as shown in FIG. 1 (b) under different models.
  • FIG. 1 (b) depicts a CC example.
  • FIG. 2 shows PEval for CC under AAP.
  • FIG. 3 shows IncEval for CC under AAP.
  • FIG. 4 shows workflow of AAP.
  • FIG. 5 shows the architecture of GRAPE+.
  • FIG. 6 shows results of performance evaluation.
  • AAP Adaptive Asynchronous Parallel
  • AP nor BSP consistently outperforms the other for different algorithms, input graphs and cluster scales. For many graph algorithms, different stages in a single execution demand different models for optimal performance. Switching between AP and BSP, however, requires to predict switching points and incurs switching costs.
  • AAP is essentially asynchronous. As opposed to BSP and AP, each worker under AAP maintains parameters to measure (a) its progress relative to other workers, and (b) changes accumulated by messages (staleness) . Each worker has immediate access to incoming messages, and decides whether to start the next round of computation based on its own parameters. In contrast to SSP, each worker dynamically adjusts its parameters based on its relative progress and message staleness, instead of using a fixed bound.
  • the workers can be distributed processors, or processors in a single machine, or threads on a processor.
  • FIG. 1 (a) compares runs for computing connected components shown in Fig. 1 (b) under different parallel models.
  • (2) AP allows a worker to start the next round as soon as its message buffer is not empty. However, it comes with redundant stale computation. As shown in Fig. 1 (a) (2) , at clock time 7, the second round of P 3 can only use the messages from the first round of P 1 and P 2 . This round of P 3 becomes stale at time 8, when the latest updates from P 1 and P 2 arrive. As will be seen later, a large part of the computations of faster P 1 and P 2 is also redundant.
  • AAP allows a worker to accumulate changes and decides when to start the next round based on the progress of others. As shown in Fig. 1 (a) (4) , after P 3 finishes one round of computation at clock time 6, it may start the next round at time 8, at which point the latest changes from P 1 and P 2 are available. As opposed to AP, AAP reduces redundant stale computation. This also helps us mitigate the straggler problem, since P 3 can converge in less rounds by utilizing the latest updates from fast workers.
  • AAP reduces stragglers by not blocking fast workers. This is particularly helpful when the computation is CPU-intensive and skewed, when an evenly partitioned graph becomes skewed due to updates, or when we cannot afford evenly partitioning a large graph due to the partition cost. Moreover, AAP activates a worker only after it receives sufficient up-to-date messages and thus reduces redundant stale computations. This allows us to reallocate resources to useful computations via workload adjustments.
  • AAP differs from previous models in the following.
  • AAP can naturally switch among these models at different stages of the same execution, without asking for explicit switching points or incurring the switching costs.
  • AAP is more flexible: some worker groups may follow BSP, while at the same time, the others run AP or SSP.
  • GRAPE Graphics Programming Environment
  • AAP can work with the programming model of GRAPE (Graphics Programming Environment) . It allows users to extend existing sequential (single-machine) graph algorithms with message declarations, and parallelizes the algorithms across a cluster of machines. It employs aggregate functions to resolve conflicts raised by updates from different workers, without worrying about race conditions or requiring extra efforts to enforce consistency by using, e.g., locks.
  • AAP is modeled as a simultaneous fixpoint computation. Based on this one of the first conditions is developed under which AAP parallelization of sequential algorithms guarantees (a) convergence at correct answers, and (b) the Church-Rosser property, i.e., all asynchronous runs converge at the same result, as long as the sequential algorithms are correct.
  • AAP can optimally simulate MapReduce, PRAM (Parallel Random Access Machine) , BSP, AP and SSP. That is, algorithms developed for these models can be migrated to AAP without increasing the complexity.
  • Performance AAP outperforms BSP, AP and SSP for a variety of graph computations.
  • Table 1 shows the performance of (a) Giraph (an open-source version of Pregel) and GraphLab under BSP, (b) GraphLab and Maiter under AP, (c) GiraphUC under BAP, (d) PowerSwitch under Hsync, and (e) GRAPE+, an extension of GRAPE by supporting AAP. GRAPE+ does better than these systems.
  • Table 1 Page Rank and SSSP on parallel systems
  • PRAM Parallel Random Access Machine
  • MapReduce is adopted by, e.g., GraphX.
  • BSP with vertex-centric programming works better for graphs as shown in some cases.
  • AP reduces stragglers, but it comes with redundant stale computation. It also bears with race conditions and their locking/unblocking costs, and complicates the convergence analysis and programming.
  • SSP promotes bounded staleness for machine learning.
  • Maiter reduces stragglers by accumulating updates, and supports prioritized asynchronous execution.
  • BAP model carrierless asynchronous parallel reduces global barriers and local messages by using light-weighted local barriers.
  • Hsync proposes to switch between AP and BSP.
  • AAP differs from the prior models in the following.
  • AAP reduces (a) stragglers of BSP via asynchronous message passing, and (b) redundant stale computations of AP by imposing a bound (delay stretch) , for workers to wait and accumulate updates.
  • AAP reduces redundant stale computations by enforcing a “lower bound” on accumulated messages, which also serves as an “upper bound” to support bounded staleness if needed. Performance can be improved when stragglers are forced to wait, rather than to catch up as suggested by SSP.
  • AAP dynamically adjusts the bound, instead of using a predefined constant.
  • Bounded staleness is not needed by SSSP, CC, and PageRank.
  • AAP aggregates changes accumulated. As opposed to Maiter, it reduces redundant computations by (a) imposing a delay stretch on workers, to adjust their relative progress, (b) dynamically adjusting the bound to optimize performance, and (c) combining incremental evaluation with accumulative computation.
  • AAP operates on graph fragments, while Maiter is vertex-centric.
  • AAP does not demand complete switch from one mode to another. Instead, each worker may decide its own “mode” based on its relative progress. Fast workers may follow BSP within a group, while meanwhile, the other workers may adopt AP. Moreover, the parameters are adjusted dynamically, and hence AAP does not have to predict switching points and pay the price of switching cost.
  • AAP can adopt the programming model of GRAPE.
  • AAP is able to parallelize sequential graph algorithms just like GRAPE. That is, the asynchronous model does not make programming harder than GRAPE.
  • AAP supports data-partitioned parallelism. It is to work on graphs partitioned into smaller fragments.
  • V is a finite set of nodes; (2) is a set of edges; and (3) each node v in V (resp. edge e ⁇ E) is labeled with L (v) (resp. L (e) ) indicating its content, as found in property graphs.
  • F i is a graph itself but is not necessarily an induced subgraph of G.
  • AAP allows users to pick an edge-cut or vertex-cut strategy P to partition a graph G.
  • P is edge-cut
  • a cut edge from F i to F j has a copy in both F i and F j .
  • border nodes are those that have copies in different fragments.
  • a node v is a border node if v has an adjacent edge across two fragments, or a copy in another fragment.
  • PEval a sequential algorithm for Q that given a query Q ⁇ Q and a graph G, computes the answer Q (G) .
  • IncEval a sequential incremental algorithm for Q that given Q, G, Q (G) and updates ⁇ G to G, computes updates ⁇ O to the old output Q (G) such that where denotes G updated by ⁇ G.
  • Assemble a function that collects partial answers computed locally at each worker by PEval and IncEval, and assembles the partial results into complete answer Q (G) .
  • PEval, IncEval and Assemble the three functions are referred to as a PIE program for Q (PEval, IncEval and Assemble) .
  • PEval and IncEval can be existing sequential (incremental) algorithms for Q, which are to operate on a fragment F i of G partitioned via a strategy P.
  • PEval declares the following.
  • PEval also specifies an aggregate function f aggr , e.g., min and max, to resolve conflicts when multiple workers attempt to assign different values to the same update parameter. These are specified in PEval and are shared by IncEval.
  • a subgraph G s of G is a connected component of G if (a) it is connected, i.e., for any two nodes v and v′in G s , there exists a path between v and v′, and (b) it is maximum, i.e., adding any node of G to G s makes the induced subgraph disconnected.
  • CC For each G, CC has a single query Q, to compute all connected components of G, denoted by Q (G) . CC is in O (
  • AAP parallelizes CC with the same PEval and IncEval of GRAPE. More specifically, a PIE program ⁇ is given as follows.
  • PEval uses a sequential CC algorithm (Depth-First Search, DFS) to compute the local connected components and create their ids, except that it declares the following: (a) for each node v ⁇ V i , an integer variable v. cid, initially v. id; (b) F i . O as the candidate set C i , and as the update parameters; and (c) min as aggregate function f aggr : if there are multiple values for the same v. cid, the smallest value is taken by the linear order on integers.
  • DFS Depth-First Search
  • PEval For each local connected component C, (a) PEval creates a “root” node v c carrying the minimum node id in C as v c . cid, and (b) links all the nodes in C to v c , and sets their cid as v c . cid. These can be done in one pass of the edges in fragment F i via DFS.
  • IncEval Given a set M i of changed cids of border nodes, IncEval incrementally updates local components in F i , by “merging” components when possible. As shown in Fig. 3, by using min as f aggr , it (a) updates the cid of each border node to the minimum one; and (b) propagates the change to its root v c and all linked to v c .
  • Assemble first updates the cid of each node to the cid of its linked root. It then merges all the nodes having the same cids in a single bucket, and returns all buckets as connected components.
  • the programming model aims to facilitate users to develop parallel programs, especially for those who are more familiar with conventional sequential programming. This said, programming with GRAPE still requires domain knowledge of algorithm design, to declare update parameters and design an aggregate function.
  • AAP takes as input a PIE program ⁇ (i.e., PEval, IncEval, Assemble) for Q, and a partition strategy P. It partitions G into fragments (F 1 , ..., F m ) using P, such that each fragment F i resides at a virtual worker P i for i ⁇ [1, m] . It works with a master P 0 and n shared-nothing physical workers (P 1 , ..., P n ) , where n ⁇ m, i.e., multiple virtual workers are mapped to the same physical worker and share memory.
  • Graph G is partitioned once for all queries Q ⁇ Q posed on G.
  • PEval and IncEval can be (existing) sequential batch and incremental algorithms for Q, respectively, except that PEval additionally declares update parameters and defines an aggregate function f aggr .
  • PEval computes Q (F i ) over local fragment F i
  • IncEval takes F i and updates M i to as input, and computes updates ⁇ O i to Q (F i ) such that Each invocation of PEvalor IncEvalas is referred to as one round of computation at worker P i .
  • P i collects update parameters of with changed values in a set It groups into M (i, j) for j ⁇ [1, m] and j ⁇ i, where M (i, j) includes for v ⁇ C j , i.e., v also resides in fragment F j . That is, M (i, j) includes changes of to the update parameters of F j . It sends M (i, j) as a message to worker P j . Messages M (i, j) may also be referred to as designated messages.
  • each worker P i maintains the following:
  • an index I i that given a border node v retrieves the set of j ⁇ [1, m] such that v ⁇ F j . I′ ⁇ F j . O and i ⁇ j, i.e., where v resides; it is deduced from the strategy P ;
  • AAP is asynchronous in nature.
  • AAP adopts (a) point-to-point communication: a worker P i can send a message M (i, j) directly to worker P j , and (b) push-based message passing: P i sends M (i, j) to worker P j as soon as M (i, j) is available, regardless of the progress at other workers.
  • a worker P j can receive messages M (i, j) at any time, and saves it in its buffer without being blocked by supersteps.
  • master P 0 is only responsible for making decision for termination and assembling partial answers by Assemble.
  • Workers exchange their status to adjust relative progress.
  • each (virtual) worker P i maintains a delay stretch DS i such that P i is put on hold for DS i time to accumulate updates.
  • Stretch DS i is dynamically adjusted by a function ⁇ based on the following.
  • Staleness ⁇ i measured by the number of messages in buffer received by P i from distinct workers. Intuitively, the larger ⁇ i is, the more messages are accumulated in and hence, the earlier P i should start the next round of computation.
  • Adjustment function ⁇ for DS i shortly will be discussed later.
  • M (i, j) consists of triples (x, val, r) , where is associated with a node v that is in C i ⁇ C j , and C j is deduced from the index I i ; val is the value of x, and r indicates the round when val is computed.
  • Worker P i receives messages from other workers at any time and stores the messages in its buffer
  • IncEval Incremental evaluation. In this phase, IncEval iterates until the termination condition is satisfied. To reduce redundant computation, AAP adjusts (a) relative progress of workers and (b) work assignments. More specifically, IncEval works as follows.
  • IncEval is triggered at worker P i to start the next round if (a) is nonempty, and (b) P i has been suspended for DS i time. Intuitively, IncEval is invoked only if changes are inflicted to i.e., and only if P i has accumulated enough messages.
  • ⁇ derive messages M (i, j) that consists of updated values of for border nodes that are in both C i and C j , for all j ⁇ [1, m] , j ⁇ i; it sends M (i, j) to worker P j .
  • IncEval completes its current round at P i or when P i receives a new message, DS i is adjusted.
  • the next round of IncEval is triggered if the conditions (a) and (b) in (1) above are satisfied; otherwise P i is suspended for DS i time, and its resources are allocated to other (virtual) workers P j to do useful computation, preferably to P j that is assigned to the same physical worker as P i to minimize the overhead for data transfer.
  • P i is activated again to start the next round of IncEval.
  • IncEval When IncEval is done with its current round of computation, if sends a flag inactive to master P 0 and becomes inactive. Upon receiving inactive from all workers, P 0 broadcasts a message terminate to all workers. Each P i may respond with either ack if it is inactive, or wait if it is active or is in the queue for execution. If one of the workers replies wait, the iterative incremental step proceeds (phase (2) above) .
  • P 0 Upon receiving ack from all workers, P 0 pulls partial results from all workers, and applies Assemble to the partial results. The outcome is referred to as the result of the parallelization of ⁇ under P, denoted by ⁇ (Q, G) .
  • AAP returns ⁇ (Q, G) and terminates.
  • PEval computes connected components and their cids at each fragment F i by using DFS.
  • the cids of border nodes are grouped as messages and sent to neighboring workers. More specifically, for j ⁇ [1, m] , ⁇ v. cid ⁇ v ⁇ F i . O ⁇ F j . I ⁇ is sent to worker P j as message M (i, j) and is stored in buffer
  • IncEval first computes updates M i by applying min to changed cids in when it is triggered at worker P i as described above. It then incrementally updates local components in F i starting from M i . At the end of the process, the changed cid’s are sent to neighboring workers as messages, just like PEval does. The process iterates until no more changes can be made.
  • AAP works well with the programming model of GRAPE, i.e., AAP does not make programming harder.
  • AAP is able to dynamically adjust delay sketch DS i at each worker P i ; for example, function ⁇ may define
  • Variable L i “predicts” how many messages should be accumulated, to strike a balance between stale-computation reduction and useful outcome expected from the next round of IncEval at P i .
  • AAP adjusts L i as follows. Users may opt to initialize L i with a uniform bound L ⁇ , to start stale-computation reduction early.
  • AAP adjusts L i at each round at P i , based on (a) predicted running time t i of the next round, and (b) the predicted arrival rate s i of messages.
  • L i When s i is above the average rate, L i is changed to max ( ⁇ i , L ⁇ ) + ⁇ t i *s i , where ⁇ t i is a fraction of t i , and L ⁇ is adjusted with the number of “fast” workers.
  • t i and s i can be approximated by aggregating statistics of consecutive rounds of IncEval. One can get more precise estimate by using a random forest model, with query logs as training samples.
  • BSP, AP and SSP are special cases of AAP. Indeed, these can be carried out by AAP by specifying function ⁇ as follows.
  • AAP can simulate Hsync by using function ⁇ to implement the same switching rules of Hsync.
  • FIG. 1 (a) and (b) recall the PIE program ⁇ for CC from First Example and illustrated in Second Example.
  • a graph G that is partitioned into fragments F 1 , F 2 and F 3 and distributed across workers P 1 , P 2 and P 3 , respectively.
  • each circle represents a connected component, annotated with its cid, and
  • a dotted line indicates between fragments.
  • graph G has a single connected component with the minimal vertex id 0.
  • workers P 1 , P 2 and P 3 take 3, 3 and 6 time units, respectively.
  • Figure 1 (a) (1) depicts part of a run of ⁇ , which takes 5 rounds for the minimal cid 0 to reach component 7.
  • P 3 can suspend IncEval until it receives enough changes as shown in Fig. 1 (a) (4) .
  • DS 3 1 if ⁇ 3 ⁇ 4 since in addition to the 2 messages accumulated, 2 more messages are expected to arrive in 1 time unit; hence ⁇ decides to increase DS 3 .
  • These delay stretches are estimated based on the running time (3, 3 and 6 for P 1 , P 2 and P 3 , respectively) and message arrival rates.
  • AAP reduces the costs of iterative graph computations mainly from three directions.
  • AAP reduces redundant stale computations and stragglers by adjusting relative progress of workers.
  • some computations are substantially improved when stragglers are forced to accumulate messages; this actually enables the stragglers to converge in less rounds, as shown by Third Example for CC.
  • (b) When the time taken by different rounds at a worker does not vary much (e.g., PageRank) , fast workers are “automatically” grouped together after a few rounds and run essentially BSP within the group, while the group and slow workers run under AP. This shows that AAP is more flexible than Hsync.
  • IncEval Like GRAPE, AAP employs incremental IncEval to minimize unnecessary recomputations. The speedup is particularly evident when IncEval is bounded, localizable or relatively bounded. For instance, IncEval is bounded if given F i , Q, Q (F i ) and M i , it computes ⁇ O i such that in cost that can be expressed as a function in M i ⁇ + ⁇ O i ⁇ , the size of changes in the input and output; intuitively, it reduces the cost of computation on (possibly big) F i to a function of small M i ⁇ + ⁇ O i ⁇ . As an example, IncEval for CC (Fig. 3) is a bounded incremental algorithm.
  • AAP is generic, as parallel models MapReduce, PRAM, BSP, AP and SSP can be optimally simulated by AAP.
  • AAP parallelizes a PIE program ⁇ based on a simultaneous fixpoint operator ⁇ (R 1 , ..., R m ) that starts with partial evaluation of PEval and employs incremental function IncEval as the intermediate consequence operator:
  • fragment is fragment F i at the end of round r carrying update parameters and M i denotes changes to computed by as described above.
  • the computation reaches a fixpoint if for all i ⁇ [1, m] , i.e., no more changes to partial results at any worker.
  • Assemble is applied to for i ⁇ [1, m] , and computes ⁇ (Q, G) . If so, we say that ⁇ converges at ⁇ (Q, G) .
  • a PIE program ⁇ may have different asynchronous runs, when IncEval is triggered in different orders at multiple workers depending on, e.g., partition of G, clusters and network latency. These runs may end up with different results [37] .
  • a run of ⁇ can be represented as traces of PEval and IncEval at all workers (see, e.g., Fig. 1 (a) ) .
  • terminates under AAP with P if for all queries Q ⁇ Q and graphs G, all runs of ⁇ converge at a fixpoint. .
  • has the Church-Rosser property under AAP if all its asynchronous runs converge at the same result.
  • Termination and correctness We now identify a monotone condition under which a PIE program is guaranteed to converge at correct answers under AAP. We start with some notation.
  • ⁇ IncEval is contracting if for all queries Q ⁇ Q and fragmented graphs G via P, for all i ⁇ [1, m] in the same run.
  • PEval is correct if for all queries Q ⁇ Q and graphs G, PEval (Q, G) returns Q (G) ;
  • IncEval is correct if IncEval (Q, Q (G) , G, M) returns where M denotes messages (updates) ; and
  • Assemble is correct if when ⁇ converges at round r 0 under BSP, ⁇ is correct for Q if PEval, IncEval and Assemble are correct for Q.
  • a monotone condition .
  • Three conditions can be identified for ⁇ .
  • condition T1 and T2 are essentially the same as the ones for GRAPE, condition T3 does not find a counterpart therein.
  • Theorem 1 Under AAP, a PIE program ⁇ guarantees to terminate with any partition strategy P if ⁇ satisfies conditions T1 and T2.
  • Theorem 2 Under conditions T1, T2 and T3, AAP correctly parallelizes a PIE program ⁇ for a query class Q if ⁇ is correct for Q, with any partition strategy P.
  • T1, T2 and T3 provide the first condition for asynchronous runs to converge and ensure the Church-Rosser property. To see this, convergence conditions for GRAPE, Maiter, BAP and SSP are examined.
  • Maiter focuses on vertex-centric programming and identifies four conditions for convergence, on an update function f that changes the state of a vertex based on its neighbors.
  • the conditions require that f is distributive, associative, commutative and moreover, satisfies an equation on initial values.
  • Algorithms developed for MapReduce, PRAM, BSP, AP and SSP can be migrated to AAP without extra complexity. That is, AAP is as expressive as the other parallel models.
  • AAP is not limited to graphs as a parallel computation model. It is as generic as BSP and AP, and does not have to take graphs as input.
  • a parallel model M 1 optimally simulates model M 2 if there exists a compilation algorithm that transforms any program with cost C on M 2 to a program with cost O (C) on M 1 .
  • the cost includes computational and communication cost. That is, the complexity bound remains the same.
  • BSP basic protein sequence
  • AP special cases of AAP. From this one can easily verify the following.
  • Proposition 3 AAP can optimally simulate BSP, AP and SSP.
  • a Pregel algorithm A (with a function compute () for vertices) can be simulated by a PIE algorithm ⁇ as follows.
  • PEval runs compute () over vertices with a loop, and uses status variable to exchange local messages instead of SendMessageTo () of Pregel.
  • the update parameters are status variables of border nodes, and function f aggr groups messages just like Pregel, following BSP.
  • IncEval also runs compute () over vertices in a fragment, except that it starts from active vertices (border nodes with changed values) .
  • AAP can optimally simulate MapReduce and PRAM.
  • GRAPE can optimally simulate MapReduce and PRAM, by adopting a form of key-value messages.
  • MapReduce and PRAM can be optimally simulated by (a) AAP and (b) GRAPE with designated messages only.
  • a MapReduce algorithm A can be specified as a sequence (B 1 , ..., B k ) of subroutines, where B r (r ⁇ [1, k] ) consists of a mapper ⁇ r and a reducer ⁇ r .
  • SSSP single source shortest path problem
  • ⁇ Input A directed graph G as above, and a node s in G.
  • AAP parallelizes SSSP in the same way as GRAPE.
  • the candidate set C i at each F i is F i .
  • the status variables in the candidates set are updated by PEval and IncEval as in [8] , and aggregated by using min as f aggr . When no changes can be incurred to these status variables, Assemble is invoked to take the union of all partial results.
  • each user u ⁇ U (resp. product p ⁇ P) carries an (unknown) latent factor vector u. f (resp. p. f) .
  • the training set E T refers to edge set i.e., all the known ratings.
  • the CF problem is stated as follows.
  • ⁇ Input A directed bipartite graph G, and a training set E T .
  • AAP parallelizes stochastic gradient descent (SGD) , a popular algorithm for CF.
  • SGD stochastic gradient descent
  • v. f is the factor vector of v (initially )
  • v. ⁇ records accumulative updates to v. f
  • t bookkeeps the timestamp at which v. f is lastly updated.
  • w. l. o. g. P ⁇ U ⁇ it takes F i . O ⁇ F i . I, i.e., the shared product nodes related to F i , as C i .
  • PEval is essentially “mini-batched” SGD.
  • PEval sends the updated values of to neighboring workers.
  • IncEval first aggregates the factor vector of each node p in F i . O by taking max on the timestamp for tuples (p. f, p. ⁇ , t) in For each node in F i . I, it aggregates its factor vector by applying a weighted sum of gradients computed at other workers. It then runs a round of SGD; it sends the updated status variables as in PEval as long as the bounded staleness condition is not violated.
  • ⁇ Input A directed graph G and a threshold ⁇ .
  • AAP parallelizes PageRank along the same lines as Tian, Y., Balmin, A., Corsten, S. A. and Shirish Tatikonda, J. M. 2013. From “think like a vertex” to “think like a graph” . PVLDB. 7, 7 (2013) , 193–204.
  • PEval declares a status variable x v for each node v ⁇ F i to keep track of updates to v from other nodes in F i , at each fragment F i . It takes F i . O as its candidate set C i .
  • PEval (a) increases the score P v by x v , and (b) updates the variable x u for each u linked from v by an incremental change d*x v /N v . At the end of its process, it sends the values of to its neighboring workers.
  • IncEval Upon receiving messages, IncEval iteratively updates scores. It (a) first aggregates changes to each border node from other workers by using sum as f aggr ; (b) it then propagates the changes to update other nodes in the local fragment by conducting the same computation as in PEval; and (c) it derives the changes to the values of and sends them to its neighboring workers.
  • Assemble collects the scores of all the nodes in G when the sum of changes of two consecutive iterations at each worker is below ⁇ .
  • P v can be expressed as ⁇ p ⁇ P p (v) + (1-d) , where P is the set of all paths to v in G, p is a path (v n , v n-1 , ...v 1 , v) , and N j is the out-degree node v j for j ⁇ [1, n] .
  • Bounded staleness forbids fastest workers to outpace the slowest ones by more than c steps. It is mainly to ensure the correctness and convergence of CF.
  • CC and SSSP are not constrained by bounded staleness; conditions T1, T2 and T3 suffice to guarantee their convergence and correctness.
  • fast workers can move ahead any number of rounds without affecting their correctness and convergence.
  • PageRank does not need bounded staleness either, since for each path p ⁇ P, p (v) can be added to P v at most once (see above) .
  • GRAPE+ The architecture of GRAPE+ is shown in Fig. 5, to extend GRAPE by supporting AAP. Its top layer provides interfaces for developers to register their PIE programs, and for end users to run registered PIE programs.
  • the core of GRAPE+ is its engine, to generate parallel evaluation plans. It schedules workload for working threads to carry out the evaluation plans. Underlying the engine are several components, including (1) an MPI controller to handle message passing, (2) a load balancer to evenly distribute workload, (3) an index manager to maintain indices, and (4) a partition manager for graph partitioning.
  • GRAPE+ employs distributed file systems, e.g., NFS, AWS S3 and HDFS, to store graph data.
  • GRAPE+ extends GRAPE by supporting the following.
  • Adaptive asynchronization manager As opposed to GRAPE, GRAPE+ dynamically adjusts relative progress of workers. This is carried out by a scheduler in the engine. Based on statistics collected (see below) , the scheduler adjusts parameters and decides which threads to suspend or run, to allocate resources to useful computations. In particular, the engine allocates communication channels between workers, buffers messages generated, packages the messages into segments, and sends a segment each time. It further reduces costs by overlapping data transfer and computation.
  • the collector gathers information for each worker, e.g., the amount of messages exchanged, the evaluation time in each round, historical data for a query workload, and the impact of the last parameter adjustment.
  • GRAPE+ adapts Chandy-Lamport snapshots for checkpoints.
  • the master broadcasts a checkpoint request with a token.
  • each worker ignores the request if it has already held the token. Otherwise, it snapshots its current state before sending any messages.
  • the token is attached to its following messages. Messages that arrive late without the token are added to the last snapshot. This gets us a consistent checkpointed state, including all messages passed asynchronously.
  • Each worker P i uses a buffer to store incoming messages, which is incrementally expanded when new messages arrive.
  • GRAPE+ allows users to provide an aggregate function f aggr to resolve conflicts when a status variable receives multiple values from different workers. The only race condition is that when old messages are removed from by IncEval, the deletion is atomic. Thus consistency control of GRAPE+ is not much harder than that of GRAPE.
  • Graphs We used five real-life graphs of different types, such that each algorithm was evaluated with two real-life graphs. These include (1) Friendster, a social network with 65 million users and 1.8 billion links; we randomly assigned weights to test SSSP; (2) traffic, an (undirected) US road network with 23 million nodes (locations) and 58 million edges; (3) UKWeb, a Web graph with 133 million nodes and 5 billion edges.
  • Exp-1 Efficiency. We first evaluated the efficiency of GRAPE+ by varying the number n of workers used, from 64 to 192. We evaluated (a) SSSP and CC with real-life graphs traffic and Friendster; (b) PageRank with Friendster and UKWeb, and (c) CF with movieLens and Netflix, based on applications of these algorithms in transportation networks, social networks, Web rating and recommendation.
  • the performance gain of GRAPE+ comes from the following: (i) efficient resource utilization by dynamically adjusting relative progress of workers under AAP; (ii) reduction of redundant computation and communication by the use of incremental IncEval; and (iii) optimization inherited from strategies for sequential algorithms. Note that under BSP, AP and SSP, GRAPE+BSP, GRAPE+AP and GRAPE+SSP can still benefit from (ii) and (iii) .
  • GRAPE+ inherits the optimization techniques from sequential (Dijkstra) algorithm by employing priority queues to prioritize vertex processing; in contrast, this optimization strategy is beyond the capacity of the vertex-centric systems.
  • GRAPE+ is on average 2.42, 1.71, and 1.47 (resp. 2.45, 1.76, and 1.40) times faster than GRAPE+ BSP , GRAPE+ AP and GRAPE+ SSP over traffic (resp. Friendster) , up to 2.69, 1.97 and 1.86 times, respectively. Since GRAPE+, GRAPE+ BSP , GRAPE+ AP and GRAPE+ SSP are the same system under different modes, the gap reflects the effectiveness of different models. We find that the idle waiting time of AAP is 32.3%and 55.6%of that of BSP and SSP, respectively.
  • GRAPE+ takes less time when n increases. It is on average 2.49 and 2.25 times faster on traffic and Friendster, respectively, when n varies from 64 to 192. That is, AAP makes effective use of parallelism by reducing stragglers and redundant stale computations.
  • GRAPE+ ships 22.4%, 8.0%and 68.3%of data shipped by GraphLab sync , GraphLab async and PowerSwitch, respectively. This is because GRAPE+ (a) reduces redundant stale computations and hence unnecessary data traffic, and (b) ships only changed values of update parameters by incremental IncEval.
  • GRAPE+ The communication cost of GRAPE+ is 1.22X, 40%and 1.02X compared to that of GRAPE+ BSP , GRAPE+ AP and GRAPE+ SSP , respectively. Since AAP allows workers with small workload to run faster and have more iterations, the amount of messages may increase. Moreover, workers under AAP additionally exchange their states and statistics to adjust relative speed. Despite these, its communication cost is not much worse than that of BSP and SSP.
  • Exp-3 Scale-up of GRAPE+ .
  • the speed-up of a system may degrade when using more workers.
  • n 96 to 320, and for each n, deployed GRAPE+ over a synthetic graph of size varied from (60M, 2B) to (300M, 10B) , proportional to n.
  • GRAPE+ preserves a reasonable scale-up. That is, the overhead of AAP does not weaken the benefit of parallel computation. Despite the overhead for adjusting relative progress, GRAPE+ retains scale-up comparable to that of BSP, AP and SSP.
  • AAP in a large-scale setting We tested synthetic graphs with 300 million vertices and 10 billion edges, generated by GTgraph following the power law and the small world property. We used a cluster of up to 320 workers. As shown in Fig. 6 (l) for PageRank, AAP is on average 4.3, 14.7 and 4.7 times faster than BSP, AP and SSP, respectively, up to 5.0, 16.8 and 5.9 times with 320 workers. Compared to the results in Exp-1, these show that AAP is far more effective on larger graphs with more workers, a setting closer to real-life applications, in which stragglers and stale computations are often heavy. These further verify the effectiveness of AAP.
  • GRAPE+ consistently outperforms the state-of-the-art systems. Over real-life graphs and with 192 workers, GRAPE+ is on average (a) 2080, 838, 550, 728, 1850 and 636 times faster than Giraph, GraphLab sync , GraphLab async , GiraphUC, Maiter and PowerSwitch for SSSP, (b) 835, 314, 93 and 368 times faster than Giraph, GraphLab sync , GraphLab async and GiraphUC for CC, (c) 339, 4.8, 8.6, 346, 9.7 and 4.6 times faster than Giraph, GraphLab sync , GraphLab async , GiraphUC, Maiter and PowerSwitch for PageRank, and (d) 11.9, 9.5 and 30.9 times faster than GraphLab sync , GraphLab async and Petuum for CF, respectively.
  • PowerSwitch has the closest performance to GRAPE+.
  • GRAPE+ scales well with the number n of workers used. It is on average 2.37, 2.68, 2.17 and 2.3 times faster when n varies from 64 to 192 for SSSP, CC, PageRank and CF, respectively. Moreover, it has good scale-up.

Abstract

A method for asynchronously parallelizing graph computations, the method comprising: distributing a plurality of fragments across a number of workers so that each worker has at least one local fragment, the plurality of fragments being obtained by partitioning a graph and each fragment being a subgraph of the graph; computing, by each worker, a partial result over each of its at least one local fragment using a predefined sequential batch algorithm; iteratively computing, by each worker, an updated partial result over each of its at least one local fragment based on one or more update messages using a predefined sequential incremental algorithm until a termination condition is satisfied. The one or more update messages are received from one or more other workers, respectively, and are stored in a respective buffer. Each worker is allowed to decide when to perform a next round of computation based on its delay stretch, and each worker is put on hold for a time period indicated by the delay stretch before performing the next round of computation, the delay stretch being dynamically adjustable based on each worker's relative computing progress to other workers. Some embodiment may have effects of reducing stragglers and stale computations.

Description

PARALLELIZATION OF GRAPH COMPUTATIONS TECHNICAL FIELD
The following disclosure relates to parallelization of graph computations.
BACKGROUND ART
Several parallel models are used for graphs. Bulk Synchronous Parallel (BSP) model has been adopted by graph systems. Under BSP, iterative computation is separated into supersteps, and messages from one superstep are only accessible in the next one. This leads to stragglers, i.e., some workers take substantially longer than the others. As workers converge asymmetrically, the speed of each superstep is limited to that of the slowest worker. To reduce stragglers, Asynchronous Parallel (AP) model has been employed. Under AP, a worker has immediate access to messages. Fast workers can move ahead, without waiting for stragglers. However, AP may incur excessive stale computations, i.e., processes triggered by messages that soon become stale due to more up-to-date messages. To rectify the problems, revisions of BSP and AP have been studied, notably Stale Synchronous Parallel (SSP) model. SSP relaxes BSP by allowing fastest workers to outpace the slowest ones by a fixed number of steps (bounded staleness) . It reduces stragglers, but incurs redundant stale computations.
SUMMARY OF THE INVENTION
In one aspect, a method for asynchronously parallelizing graph computations is provided. The method comprises: distributing a plurality of fragments across a number of workers so that each worker has at least one local fragment, the plurality of fragments being obtained by partitioning a graph and each fragment being a subgraph of the graph; computing, by each worker, a partial result over each of its at least one local fragment using a predefined sequential batch algorithm; and iteratively computing, by each worker, an updated partial result over each of its at least one local fragment based on one or more update messages using a predefined sequential incremental algorithm until a termination condition is satisfied. The one or more update messages are received from one or more other workers, respectively, and are stored in a respective buffer.
Each worker is allowed to decide when to perform a next round of computation based on its delay stretch, and each worker is put on hold for a time period indicated by the delay stretch before performing the next round of computation. The delay stretch can be dynamically adjusted based on each worker's relative computing progress to other workers.
One or more of the following features may also be included.
The delay stretch of each worker is adjusted by one or more parameters from the following group: the number of update messages stored in the respective buffer, the number of the one or more other workers from which the one or more update messages are received, the smallest and largest rounds being executed at all workers, running time prediction, query logs and other statistics collected from all workers.
When a worker is suspended during the delay stretch, its resources are allocated to one or more of the other workers.
Each worker sends a flag inactive to a master when it has no update messages stored in the respective buffer after its current round of computation. Upon receiving inactive from all workers, the master broadcasts a termination message to all workers. In response to the termination message, each worker responds with "acknowledgement" when it is inactive, or responds with "wait" when it is active or in the queue for a next round of computation. Upon receiving "acknowledgement" from all workers, the master pulls the updated partial results from all workers and applies a predefined assemble function to the updated partial results.
The predefined sequential incremental algorithm is monotonic.
The update message is based on a respective partial result and is defined by predefined update parameters.
In another aspect, provided is a system configured to perform the method for asynchronously parallelizing graph computations.
Certain implementations may provide one or more of the following advantages. Both stragglers and stale computations can be reduced by dynamically adjusting relative progress of workers. Correct convergence may also be guaranteed under a monotone condition. Other aspects, features, and advantages will be apparent from the following detailed description, the drawings and the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments will be described with reference to the following drawing figures, in which:
FIG. 1 (a) depicts runs for computing a connected components (CC) example as shown in FIG. 1 (b) under different models.
FIG. 1 (b) depicts a CC example.
FIG. 2 shows PEval for CC under AAP.
FIG. 3 shows IncEval for CC under AAP.
FIG. 4 shows workflow of AAP.
FIG. 5 shows the architecture of GRAPE+.
FIG. 6 shows results of performance evaluation.
DETAILED DESCRIPTION OF EMBODIMENTS
The scheme described in this application for asynchronously parallelizing graph computations is referred to as Adaptive Asynchronous Parallel (AAP) model. AAP is a parallel model that inherits the benefits of BSP and AP, and reduces both stragglers and stale computations, without explicitly switching between the two. Better still, the AAP model can ensure consistency, and guarantee correct convergence under a general condition.
Neither AP nor BSP consistently outperforms the other for different algorithms, input graphs and cluster scales. For many graph algorithms, different stages in a single execution demand different models for optimal performance. Switching between AP and BSP, however, requires to predict switching points and incurs switching costs.
Without global synchronization barriers, AAP is essentially asynchronous. As opposed to BSP and AP, each worker under AAP maintains parameters to measure (a) its progress relative to other workers, and (b) changes accumulated by messages (staleness) . Each worker has immediate access to incoming messages, and decides whether to start the next round of computation based on its own parameters. In contrast to SSP, each worker dynamically adjusts its parameters based on its relative progress and message staleness, instead of using a fixed bound. The workers can be distributed processors, or processors in a single machine, or threads on a processor.
FIG. 1 (a) compares runs for computing connected components shown in Fig. 1 (b) under different parallel models.
Consider a computation task being conducted at three workers, where workers P 1 and P 2 take 3 time units to do one round of computation, P 3 takes 6 units, and it takes 1 unit to pass messages. This is carried out under different models.
(1) BSP. As depicted in Fig. 1 (a) (1) , worker P 3 takes twice as long as P 1 and P 2, and is a straggler. Due to its global synchronization, each superstep takes 6 time units, the speed of the slowest P 3.
(2) AP. AP allows a worker to start the next round as soon as its message buffer is not empty. However, it comes with redundant stale computation. As shown in Fig. 1 (a) (2) , at clock time 7, the second round of P 3 can only use the messages from the first round of P 1 and P 2. This round of P 3  becomes stale at time 8, when the latest updates from P 1 and P 2 arrive. As will be seen later, a large part of the computations of faster P 1 and P 2 is also redundant.
(3) SSP. Consider bounded staleness of 1, i.e., the fastest worker can outpace the slowest one by at most 1 round. As shown in Fig. 1 (a) (3) , P 1 and P 2 are not blocked by the straggler in the first 3 rounds. However, like AP, the second round of P 3 is stale. Moreover, P 1 and P 2 cannot start their  rounds  4 and 5 until P 3 finishes its  rounds  2 and 3, respectively, due to the bounded staleness condition. As a result, P1, P2 and P3 behave like in BSP model after clock time 14.
(4) AAP. AAP allows a worker to accumulate changes and decides when to start the next round based on the progress of others. As shown in Fig. 1 (a) (4) , after P 3 finishes one round of computation at clock time 6, it may start the next round at time 8, at which point the latest changes from P 1 and P 2 are available. As opposed to AP, AAP reduces redundant stale computation. This also helps us mitigate the straggler problem, since P 3 can converge in less rounds by utilizing the latest updates from fast workers.
AAP reduces stragglers by not blocking fast workers. This is particularly helpful when the computation is CPU-intensive and skewed, when an evenly partitioned graph becomes skewed due to updates, or when we cannot afford evenly partitioning a large graph due to the partition cost. Moreover, AAP activates a worker only after it receives sufficient up-to-date messages and thus reduces redundant stale computations. This allows us to reallocate resources to useful computations via workload adjustments.
In addition, AAP differs from previous models in the following.
(1) Model switch. BSP, AP and SSP are special cases of AAP with fixed parameters. Hence AAP can naturally switch among these models at different stages of the same execution, without asking for explicit switching points or incurring the switching costs. As will be seen later, AAP is more flexible: some worker groups may follow BSP, while at the same time, the others run AP or SSP. (2) Programming paradigm. AAP can work with the programming model of GRAPE (Graphics Programming Environment) . It allows users to extend existing sequential (single-machine) graph algorithms with message declarations, and parallelizes the algorithms across a cluster of machines. It employs aggregate functions to resolve conflicts raised by updates from different workers, without worrying about race conditions or requiring extra efforts to enforce consistency by using, e.g., locks.
(3) Convergence guarantees. AAP is modeled as a simultaneous fixpoint computation. Based on this one of the first conditions is developed under which AAP parallelization of sequential algorithms guarantees (a) convergence at correct answers, and (b) the Church-Rosser property, i.e., all asynchronous runs converge at the same result, as long as the sequential algorithms are correct.
(4) Expressive power. Despite its simplicity, AAP can optimally simulate MapReduce, PRAM (Parallel Random Access Machine) , BSP, AP and SSP. That is, algorithms developed for these models can be migrated to AAP without increasing the complexity.
(5) Performance. AAP outperforms BSP, AP and SSP for a variety of graph computations. As an example, for PageRank and SSSP (single-source shortest path) on Friendster with 192 workers, Table 1 shows the performance of (a) Giraph (an open-source version of Pregel) and GraphLab under BSP, (b) GraphLab and Maiter under AP, (c) GiraphUC under BAP, (d) PowerSwitch under Hsync, and (e) GRAPE+, an extension of GRAPE by supporting AAP. GRAPE+ does better than these systems.
Figure PCTCN2018104689-appb-000001
Table 1: Page Rank and SSSP on parallel systems
Parallel Random Access Machine (PRAM) supports parallel RAM access with shared memory, not for the shared-nothing architecture that is used nowadays. MapReduce is adopted by, e.g., GraphX. However, it is not very efficient for iterative graph computations due to its blocking and I/O costs. BSP with vertex-centric programming works better for graphs as shown in some cases. However, it suffers from stragglers. As remarked earlier, AP reduces stragglers, but it comes with redundant stale computation. It also bears with race conditions and their locking/unblocking costs, and complicates the convergence analysis and programming.
SSP promotes bounded staleness for machine learning. Maiter reduces stragglers by accumulating updates, and supports prioritized asynchronous execution. BAP model (barrierless asynchronous parallel) reduces global barriers and local messages by using light-weighted local barriers. Hsync proposes to switch between AP and BSP.
Several graph systems under these models are in place, e.g., Pregel, GPS, Giraph++, GRAPE under BSP; GraphLab, Maiter, GRACE under (revised) AP; parameter servers under SSP; GiraphUC under BAP; and PowerSwitch under Hsync. Most of these are vertex-centric. While Giraph++ and Blogel process blocks, they inherit vertex-centric programming by treating blocks as vertices. GRAPE parallelizes sequential graph algorithms as a whole.
AAP differs from the prior models in the following.
(1) AAP reduces (a) stragglers of BSP via asynchronous message passing, and (b) redundant stale computations of AP by imposing a bound (delay stretch) , for workers to wait and accumulate updates.
(2) (a) AAP reduces redundant stale computations by enforcing a “lower bound” on accumulated messages, which also serves as an “upper bound” to support bounded staleness if needed. Performance can be improved when stragglers are forced to wait, rather than to catch up as suggested by SSP. (b) AAP dynamically adjusts the bound, instead of using a predefined constant. (c) Bounded staleness is not needed by SSSP, CC, and PageRank.
(3) Similar to Maiter, AAP aggregates changes accumulated. As opposed to Maiter, it reduces redundant computations by (a) imposing a delay stretch on workers, to adjust their relative progress, (b) dynamically adjusting the bound to optimize performance, and (c) combining incremental evaluation with accumulative computation. AAP operates on graph fragments, while Maiter is vertex-centric.
(4) Both BAP and AAP reduce unnecessary messages. However, AAP achieves this by operating on fragments (blocks) , and moreover, optimizes performance by adjusting relative progress of workers.
(5) As opposed to Hsync, AAP does not demand complete switch from one mode to another. Instead, each worker may decide its own “mode” based on its relative progress. Fast workers may follow BSP within a group, while meanwhile, the other workers may adopt AP. Moreover, the parameters are adjusted dynamically, and hence AAP does not have to predict switching points and pay the price of switching cost.
AAP can adopt the programming model of GRAPE. AAP is able to parallelize sequential graph algorithms just like GRAPE. That is, the asynchronous model does not make programming harder than GRAPE.
AAP supports data-partitioned parallelism. It is to work on graphs partitioned into smaller fragments.
Consider graphs G= (V, E, L) , directed or undirected, where (1) V is a finite set of nodes; (2) 
Figure PCTCN2018104689-appb-000002
is a set of edges; and (3) each node v in V (resp. edge e∈E) is labeled with L (v) (resp. L (e) ) indicating its content, as found in property graphs.
Given a natural number m, a strategy P partitions G into fragments F = (F 1, …, F m) such that each F i= (V i, E i, L i) is a subgraph of G, V=∪ i∈ [1, m] V i and E=∪ i∈ [1, m] E i. Here F i is called a subgraph of G if
Figure PCTCN2018104689-appb-000003
and for each node v∈V i (resp. edge e∈E i) , L i (v) =L (v) (resp. L i (e) =L (e) ) . Note that F i is a graph itself but is not necessarily an induced subgraph of G.
AAP allows users to pick an edge-cut or vertex-cut strategy P to partition a graph G. When P is edge-cut, a cut edge from F ito F jhas a copy in both F iand F j. Denote by
(a) F i. I (resp. F i. O′) the set of nodes v∈V i such that there exists an edge (v′, v) (resp. (v, v′) ) with a node v′in F j (i≠j) ; and
(b) F i. O (resp. F i. I′) the set of nodes v′in some F j (i≠j) such that there exists an edge (v, v′) (resp. (v′, v) ) with v∈V i.
The nodes in F i. I∪F i. O′are referred to as the border nodes of F i w.r.t. P. For vertex-cut, border nodes are those that have copies in different fragments. In general, a node v is a border node if v has an adjacent edge across two fragments, or a copy in another fragment.
Using familiar terms, we refer to a graph computation problem as a class Q of graph queries, and instances of the problem as queries of Q. To answer queries Q∈Q under AAP, one only needs to specify three functions.
(1) PEval: a sequential algorithm for Q that given a query Q∈Q and a graph G, computes the answer Q (G) .
(2) IncEval: a sequential incremental algorithm for Q that given Q, G, Q (G) and updates ΔG to G, computes updates ΔO to the old output Q (G) such that
Figure PCTCN2018104689-appb-000004
where
Figure PCTCN2018104689-appb-000005
denotes G updated by ΔG.
(3) Assemble: a function that collects partial answers computed locally at each worker by PEval and IncEval, and assembles the partial results into complete answer Q (G) .
Taken together, the three functions are referred to as a PIE program for Q (PEval, IncEval and Assemble) . PEval and IncEval can be existing sequential (incremental) algorithms for Q, which are to operate on a fragment F i of G partitioned via a strategy P.
In addition, PEval declares the following.
(a) Update parameters. PEval declares status variables
Figure PCTCN2018104689-appb-000006
for a set C i in a fragment F i, to store contents of F i or partial results of a computation. Here C i is a set of nodes and edges within d-hops of the nodes in F i. I∪F i. O′for an integer d. When d = 0, C i is F i. I∪F i. O . We denote by
Figure PCTCN2018104689-appb-000007
the set of update parameters of F i, which consists of status variables associated with the nodes and edges in C i. The variables in
Figure PCTCN2018104689-appb-000008
are the candidates to be updated by incremental steps IncEval.
(b) Aggregate functions. PEval also specifies an aggregate function f aggr, e.g., min and max, to resolve conflicts when multiple workers attempt to assign different values to the same update parameter. These are specified in PEval and are shared by IncEval.
First Example: Graph Connectivity
Consider graph connectivity (CC) . Given an undirected graph G= (V, E, L) , a subgraph G s of G is a connected component of G if (a) it is connected, i.e., for any two nodes v and v′in G s, there exists a path between v and v′, and (b) it is maximum, i.e., adding any node of G to G s makes the induced subgraph disconnected.
For each G, CC has a single query Q, to compute all connected components of G, denoted by Q (G) . CC is in O (|G|) time.
AAP parallelizes CC with the same PEval and IncEval of GRAPE. More specifically, a PIE program ρ is given as follows.
(1) As shown in Fig. 2, at each fragment F i, PEval uses a sequential CC algorithm (Depth-First Search, DFS) to compute the local connected components and create their ids, except that it declares the following: (a) for each node v∈V i, an integer variable v. cid, initially v. id; (b) F i. O as the candidate set C i, and
Figure PCTCN2018104689-appb-000009
as the update parameters; and (c) min as aggregate function f aggr: if there are multiple values for the same v. cid, the smallest value is taken by the linear order on integers.
For each local connected component C, (a) PEval creates a “root” node v c carrying the minimum node id in C as v c. cid, and (b) links all the nodes in C to v c, and sets their cid as v c. cid. These can be done in one pass of the edges in fragment F i via DFS.
(2) Given a set M i of changed cids of border nodes, IncEval incrementally updates local components in F i, by “merging” components when possible. As shown in Fig. 3, by using min as f aggr, it (a) updates the cid of each border node to the minimum one; and (b) propagates the change to its root v c and all linked to v c.
(3) Assemble first updates the cid of each node to the cid of its linked root. It then merges all the nodes having the same cids in a single bucket, and returns all buckets as connected components.
We remark the following about the programming paradigm.
(1) There have been methods for incrementalizing graph algorithms, to get incremental algorithms from their batch counterparts. Moreover, it is not hard to develop IncEval by revising a batch algorithm in response to changes to update parameters, as shown by the cases of CC (see Third Example 3 below) and PageRank (see below) .
(2) Edge-cut is adopted in the sequel unless stated otherwise; but AAP works with other partition strategies. Indeed, the correctness of asynchronous runs under AAP remains intact  under the conditions given there, regardless of partitioning strategy used. Nonetheless, different strategies may yield partitions with various degrees of skewness and stragglers, which have an impact on the performance of AAP.
(3) The programming model aims to facilitate users to develop parallel programs, especially for those who are more familiar with conventional sequential programming. This said, programming with GRAPE still requires domain knowledge of algorithm design, to declare update parameters and design an aggregate function.
We next present the AAP model.
Setting. Adopting, for example, the programming model of GRAPE, to answer a class Q of queries on a graph G, AAP takes as input a PIE program ρ (i.e., PEval, IncEval, Assemble) for Q, and a partition strategy P. It partitions G into fragments (F 1, …, F m) using P, such that each fragment F i resides at a virtual worker P i for i∈ [1, m] . It works with a master P 0 and n shared-nothing physical workers (P 1, …, P n) , where n<m, i.e., multiple virtual workers are mapped to the same physical worker and share memory. Graph G is partitioned once for all queries Q ∈ Q posed on G.
PEval and IncEval can be (existing) sequential batch and incremental algorithms for Q, respectively, except that PEval additionally declares update parameters
Figure PCTCN2018104689-appb-000010
and defines an aggregate function f aggr. At each worker P i, (a) PEval computes Q (F i) over local fragment F i, and (b) IncEval takes F i and updates M i to
Figure PCTCN2018104689-appb-000011
as input, and computes updates ΔO i to Q (F i) such that
Figure PCTCN2018104689-appb-000012
Figure PCTCN2018104689-appb-000013
Each invocation of PEvalor IncEvalas is referred to as one round of computation at worker P i.
Message passing. After each round of computation at worker P i, P i collects update parameters of
Figure PCTCN2018104689-appb-000014
with changed values in a set
Figure PCTCN2018104689-appb-000015
It groups
Figure PCTCN2018104689-appb-000016
into M  (i, j) for j∈ [1, m] and j≠i, where M  (i, j) includes
Figure PCTCN2018104689-appb-000017
for v∈C j, i.e., v also resides in fragment F j. That is, M  (i, j) includes changes of
Figure PCTCN2018104689-appb-000018
to the update parameters
Figure PCTCN2018104689-appb-000019
of F j. It sends M  (i, j) as a message to worker P j. Messages M  (i, j) may also be referred to as designated messages.
More specifically, each worker P i maintains the following:
(1) an index I i that given a border node v, retrieves the set of j∈ [1, m] such that v∈F j. I′∪ F j. O and i≠j, i.e., where v resides; it is deduced from the strategy P ; and
(2) a buffer
Figure PCTCN2018104689-appb-000020
to keep track of messages from other workers.
As opposed to GRAPE, AAP is asynchronous in nature. (1) AAP adopts (a) point-to-point communication: a worker P i can send a message M  (i, j) directly to worker P j, and (b) push-based message passing: P i sends M  (i, j) to worker P j as soon as M  (i, j) is available, regardless of the progress  at other workers. A worker P j can receive messages M  (i, j) at any time, and saves it in its buffer
Figure PCTCN2018104689-appb-000021
without being blocked by supersteps. (2) Under AAP, master P 0 is only responsible for making decision for termination and assembling partial answers by Assemble. (3) Workers exchange their status to adjust relative progress.
Parameters. To reduce stragglers and redundant stale computations, each (virtual) worker P i maintains a delay stretch DS i such that P i is put on hold for DS i time to accumulate updates. Stretch DS i is dynamically adjusted by a function δ based on the following.
(1) Staleness η i, measured by the number of messages in buffer
Figure PCTCN2018104689-appb-000022
received by P i from distinct workers. Intuitively, the larger η i is, the more messages are accumulated in
Figure PCTCN2018104689-appb-000023
and hence, the earlier P i should start the next round of computation.
(2) Bounds r min and r max, the smallest and largest rounds being executed at all workers, respectively. Each P i keeps track of its current round r i. These are to control the relative speed of workers.
For example, to simulate SSP [14] , when r i=r max and r i-r min>c, we can set DS i = +∞, to prevent P i from moving too far ahead.
Adjustment function δ for DS i shortly will be discussed later.
Parallel model. Given a query Q∈Q and a partitioned graph G, AAP posts the same query Q to all the workers. It computes Q (G) in three phases as shown in Fig. 4, described as follows.
(1) Partial evaluation. Upon receiving Q, PEval computes partial results Q (F i) at each worker P i in parallel. After this, PEval generates a message M  (i, j) and sends it to worker P j for j∈ [1, m] , j≠ i.
More specifically, M  (i, j) consists of triples (x, val, r) , where
Figure PCTCN2018104689-appb-000024
is associated with a node v that is in C i∩C j, and C j is deduced from the index I i; val is the value of x, and r indicates the round when val is computed. Worker P i receives messages from other workers at any time and stores the messages in its buffer
Figure PCTCN2018104689-appb-000025
(2) Incremental evaluation. In this phase, IncEval iterates until the termination condition is satisfied. To reduce redundant computation, AAP adjusts (a) relative progress of workers and (b) work assignments. More specifically, IncEval works as follows.
(1) IncEval is triggered at worker P i to start the next round if (a) 
Figure PCTCN2018104689-appb-000026
is nonempty, and (b) P i has been suspended for DS i time. Intuitively, IncEval is invoked only if changes are inflicted to 
Figure PCTCN2018104689-appb-000027
i.e., 
Figure PCTCN2018104689-appb-000028
and only if P i has accumulated enough messages.
(2) When IncEval is triggered at P i, it does the following:
· compute
Figure PCTCN2018104689-appb-000029
i.e., IncEval applies the aggregate function to
Figure PCTCN2018104689-appb-000030
to deduce changes to its local update parameters; and it clears buffer
Figure PCTCN2018104689-appb-000031
· incrementally compute
Figure PCTCN2018104689-appb-000032
with IncEval, by treating M i as updates to its local fragment F i (i.e., 
Figure PCTCN2018104689-appb-000033
) ; and
· derive messages M  (i, j) that consists of updated values of
Figure PCTCN2018104689-appb-000034
for border nodes that are in both C i and C j, for all j∈ [1, m] , j≠i; it sends M  (i, j) to worker P j.
In the entire process, P i keeps receiving messages from other workers and saves them in its buffer
Figure PCTCN2018104689-appb-000035
No synchronization is imposed.
When IncEval completes its current round at P i or when P i receives a new message, DS i is adjusted. The next round of IncEval is triggered if the conditions (a) and (b) in (1) above are satisfied; otherwise P i is suspended for DS i time, and its resources are allocated to other (virtual) workers P j to do useful computation, preferably to P j that is assigned to the same physical worker as P i to minimize the overhead for data transfer. When the suspension of P i exceeds DS i, P i is activated again to start the next round of IncEval.
(3) Termination. When IncEval is done with its current round of computation, if
Figure PCTCN2018104689-appb-000036
sends a flag inactive to master P 0 and becomes inactive. Upon receiving inactive from all workers, P 0 broadcasts a message terminate to all workers. Each P i may respond with either ack if it is inactive, or wait if it is active or is in the queue for execution. If one of the workers replies wait, the iterative incremental step proceeds (phase (2) above) .
Upon receiving ack from all workers, P 0 pulls partial results from all workers, and applies Assemble to the partial results. The outcome is referred to as the result of the parallelization of ρ under P, denoted by ρ (Q, G) . AAP returns ρ (Q, G) and terminates.
Second Example
Recall the PIE program ρ for CC from First Example. Under AAP, it works in three phases as follows.
(1) PEval computes connected components and their cids at each fragment F i by using DFS. At the end of the process, the cids of border nodes are grouped as messages and sent to neighboring  workers. More specifically, for j∈ [1, m] , {v. cid∨v∈F i. O∩F j. I} is sent to worker P j as message M  (i, j) and is stored in buffer
Figure PCTCN2018104689-appb-000037
(2) IncEval first computes updates M i by applying min to changed cids in
Figure PCTCN2018104689-appb-000038
when it is triggered at worker P i as described above. It then incrementally updates local components in F i starting from M i. At the end of the process, the changed cid’s are sent to neighboring workers as messages, just like PEval does. The process iterates until no more changes can be made.
(3) Assemble is invoked at master at this point. It computes and returns connected components as described in First Example.
The example shows that AAP works well with the programming model of GRAPE, i.e., AAP does not make programming harder.
AAP is able to dynamically adjust delay sketch DS i at each worker P i; for example, function δ may define
Figure PCTCN2018104689-appb-000039
where the parameters of function δ are described as follows.
(1) Predicate S (r i, r min, r max) is to decide whether P i should be suspended immediately. For example, under SSP, it is defined as false if r i=r max and r max-r min∨c. When bounded staleness is not needed, S (r i, r min, r max) is constantly true.
(2) Variable L i “predicts” how many messages should be accumulated, to strike a balance between stale-computation reduction and useful outcome expected from the next round of IncEval at P i. AAP adjusts L i as follows. Users may opt to initialize L i with a uniform bound L , to start stale-computation reduction early. AAP adjusts L i at each round at P i, based on (a) predicted running time t i of the next round, and (b) the predicted arrival rate s i of messages. When s i is above the average rate, L i is changed to max (η i, L ) +Δt i*s i, where Δt i is a fraction of t i, and L  is adjusted with the number of “fast” workers. t i and s i can be approximated by aggregating statistics of consecutive rounds of IncEval. One can get more precise estimate by using a random forest model, with query logs as training samples.
(3) Variable
Figure PCTCN2018104689-appb-000040
estimates how longer P i should wait to accumulate L i many messages. It can be approximated as
Figure PCTCN2018104689-appb-000041
using the number of messages that remain to be received, and message  arrival rate s i. Finally, 
Figure PCTCN2018104689-appb-000042
is the idle time of worker P i after the last round of IncEval. 
Figure PCTCN2018104689-appb-000043
is used to prevent P i from indefinite waiting.
BSP, AP and SSP are special cases of AAP. Indeed, these can be carried out by AAP by specifying function δ as follows.
о BSP: function δ sets DS i=+∞ if r i>r min, i.e., P i is suspended; otherwise, DS i=0 i.e., P i proceeds at once; thus all workers are synchronized as no one can outpace the others.
о AP: function δ always sets DS i=0, i.e., worker P i triggers the next round of computation as soon as its buffer is nonempty.
о SSP: function δ sets DS i=+∞ if r i>r min+c for a fixed bound c like in SSP, and sets DS i=0 otherwise. That is, the fastest worker may move at most c rounds ahead.
Moreover, AAP can simulate Hsync by using function δ to implement the same switching rules of Hsync.
Third Example:
Referring to FIG. 1 (a) and (b) , recall the PIE program ρ for CC from First Example and illustrated in Second Example. Consider a graph G that is partitioned into fragments F 1, F 2 and F 3 and distributed across workers P 1, P 2 and P 3, respectively. As depicted in Fig. 1 (b) , (a) each circle represents a connected component, annotated with its cid, and (b) a dotted line indicates between fragments. One can see that graph G has a single connected component with the minimal vertex id 0. Suppose that workers P 1, P 2 and P 3 take 3, 3 and 6 time units, respectively.
One can verify the following by referencing Figure 1 (a) .
(a) Under BSP, Figure 1 (a) (1) depicts part of a run of ρ, which takes 5 rounds for the minimal cid 0 to reach component 7.
(b) Under AP, a run is shown in Fig. 1 (a) (2) . Note that before getting cid 0, workers P 1 and P 2 invoke 3 rounds of IncEval and exchange cid 1 among components 1-4, while under BSP, one round of IncEval suffices to pass cid 0 from P 3 to these components. Hence a large part of the computations of faster P 1 and P 2 is stale and redundant.
(c) Under SSP with bounded staleness 1, is given in Fig. 1 (a) (3) . It is almost the same as Fig. 1 (a) (2) , except that P 1 and P 2 cannot start round 4 before P 3 finishes round 2. More specifically, when minimal cids in  components  5 and 6 are set to 0 and 4, respectively, P 1 and P 2 have to wait for P 3 to set the cid of component 7 to 5. These again lead to unnecessary stale computations.
(d) Under AAP, P 3 can suspend IncEval until it receives enough changes as shown in Fig. 1 (a) (4) . For instance, function δ starts with L =0. It sets DS i = 0 if η i∨≥1 for i∈ [1, 2] since no messages are predicted to arrive within the next time unit. In contrast, it sets DS 3 = 1 if η 3∨≤4 since in addition to the 2 messages accumulated, 2 more messages are expected to arrive in 1 time unit; hence δ decides to increase DS 3. These delay stretches are estimated based on the running time (3, 3 and 6 for P 1, P 2 and P 3, respectively) and message arrival rates. With these delay stretches, P 1 and P 2 may proceed as soon as they receive new messages, but P 3 starts a new round only after accumulating 4 messages. Now P 3 only takes 2 rounds of IncEval to update all the cids in F 3 to 0. Compared with Figures 1 (a) (1) – (3) , the straggler reaches fixpoint in less rounds.
It is found that AAP reduces the costs of iterative graph computations mainly from three directions.
(1) AAP reduces redundant stale computations and stragglers by adjusting relative progress of workers. In particular, (a) some computations are substantially improved when stragglers are forced to accumulate messages; this actually enables the stragglers to converge in less rounds, as shown by Third Example for CC. (b) When the time taken by different rounds at a worker does not vary much (e.g., PageRank) , fast workers are “automatically” grouped together after a few rounds and run essentially BSP within the group, while the group and slow workers run under AP. This shows that AAP is more flexible than Hsync.
(2) Like GRAPE, AAP employs incremental IncEval to minimize unnecessary recomputations. The speedup is particularly evident when IncEval is bounded, localizable or relatively bounded. For instance, IncEval is bounded if given F i, Q, Q (F i) and M i, it computes ΔO i such that
Figure PCTCN2018104689-appb-000044
Figure PCTCN2018104689-appb-000045
in cost that can be expressed as a function in M i∨+ΔO i∨, the size of changes in the input and output; intuitively, it reduces the cost of computation on (possibly big) F i to a function of small M i∨+ΔO i∨. As an example, IncEval for CC (Fig. 3) is a bounded incremental algorithm.
(3) Observe that algorithms PEval and IncEval are executed on fragments, which are graphs themselves. Hence AAP inherits all optimization strategies developed for the sequential algorithms.
Convergence and correctness
Asynchronous executions complicate the convergence analysis. Nonetheless, a condition is developed under which AAP guarantees to converge at correct answers. In addition, AAP is generic, as parallel models MapReduce, PRAM, BSP, AP and SSP can be optimally simulated by AAP.
Given a PIE program ρ (i.e., PEval, IncEval, Assemble) for a class Q of graph queries and a partition strategy P, we want to know whether the AAP parallelization of ρ converges at correct results. That is, whether for all queries Q∈Q and all graphs G, ρ terminates under AAP over G partitioned via P, and its result ρ (Q, G) =Q (G) .
We formalize termination and correctness as follows.
Fixpoint. Similar to GRAPE, AAP parallelizes a PIE program ρ based on a simultaneous fixpoint operator φ (R 1, …, R m) that starts with partial evaluation of PEval and employs incremental function IncEval as the intermediate consequence operator:
Figure PCTCN2018104689-appb-000046
Figure PCTCN2018104689-appb-000047
where i∈ [1, m] , 
Figure PCTCN2018104689-appb-000048
denotes partial results in round r at worker P i, fragment
Figure PCTCN2018104689-appb-000049
is fragment F i at the end of round r carrying update parameters
Figure PCTCN2018104689-appb-000050
and M i denotes changes to
Figure PCTCN2018104689-appb-000051
computed by
Figure PCTCN2018104689-appb-000052
as described above.
The computation reaches a fixpoint if for all i∈ [1, m] , 
Figure PCTCN2018104689-appb-000053
i.e., no more changes to partial results
Figure PCTCN2018104689-appb-000054
at any worker. At this point, Assemble is applied to
Figure PCTCN2018104689-appb-000055
for i∈ [1, m] , and computes ρ (Q, G) . If so, we say that ρ converges at ρ (Q, G) .
In contrast to synchronous execution, a PIE program ρ may have different asynchronous runs, when IncEval is triggered in different orders at multiple workers depending on, e.g., partition of G, clusters and network latency. These runs may end up with different results [37] . A run of ρ can be represented as traces of PEval and IncEval at all workers (see, e.g., Fig. 1 (a) ) .
ρ terminates under AAP with P if for all queries Q∈Q and graphs G, all runs of ρ converge at a fixpoint. . ρ has the Church-Rosser property under AAP if all its asynchronous runs converge at the same result. AAP correctly parallelizes ρ if ρ has the Church-Rosser property, i.e., it always converges at the same ρ (Q, G) , and ρ (Q, G) =Q (G) .
Termination and correctness. We now identify a monotone condition under which a PIE program is guaranteed to converge at correct answers under AAP. We start with some notation.
(1) We assume a partial order
Figure PCTCN2018104689-appb-000056
on partial results
Figure PCTCN2018104689-appb-000057
To simplify the discussion, assume that 
Figure PCTCN2018104689-appb-000058
carries its update parameters
Figure PCTCN2018104689-appb-000059
Define the following properties of IncEval.
· IncEval is contracting if for all queries Q∈Q and fragmented graphs G via P, 
Figure PCTCN2018104689-appb-000060
for all i∈ [1, m] in the same run.
· IncEval is monotonic if for all queries Q∈Q and graphs G, for all i∈ [1, m] , if
Figure PCTCN2018104689-appb-000061
then
Figure PCTCN2018104689-appb-000062
where
Figure PCTCN2018104689-appb-000063
and
Figure PCTCN2018104689-appb-000064
denote partial results in (possibly different) runs.
For instance, consider the PIE program ρ for CC (First Example) . The order
Figure PCTCN2018104689-appb-000065
is defined on sets of connected components (CCs) in each fragment, such that
Figure PCTCN2018104689-appb-000066
if for each CC C 2 in S 2, there exists a CC C 1 in S 1 with
Figure PCTCN2018104689-appb-000067
and cid 1≤cid 2, where cid i is the id of C i for i∈ [1, 2] . Then one can verify that the IncEval of ρ is both contracting and monotonic, since f aggr is defined as min.
(2) We want to identify a condition under which AAP correctly parallelizes a PIE program ρ as long as its sequential algorithms PEval, IncEval and Assemble are correct, regardless of the order in which PEval and IncEval are triggered. We use the following.
PEval is correct if for all queries Q∈Q and graphs G, PEval (Q, G) returns Q (G) ; (b) IncEval is correct if IncEval (Q, Q (G) , G, M) returns
Figure PCTCN2018104689-appb-000068
where M denotes messages (updates) ; and (c) Assemble is correct if when ρ converges at round r 0 under BSP, 
Figure PCTCN2018104689-appb-000069
ρ is correct for Q if PEval, IncEval and Assemble are correct for Q.
A monotone condition. Three conditions can be identified for ρ.
(T1) The values of updated parameters are from a finite domain.
(T2) IncEval is contracting.
(T3) IncEval is monotonic.
While conditions T1 and T2 are essentially the same as the ones for GRAPE, condition T3 does not find a counterpart therein.
The termination condition of GRAPE remains intact under AAP.
Theorem 1: Under AAP, a PIE program ρ guarantees to terminate with any partition strategy P if ρ satisfies conditions T1 and T2.
These conditions are general. Indeed, given a graph G, the values of update parameters are often computed from the active domain of G and are finite. By the use of aggregate function f aggr, IncEval is often contracting, as illustrated by the PIE program for CC above.
Proof: By T1 and T2, each update parameter can be changed finitely many times. This warrants the termination of ρ since ρ terminates when no more changes can be incurred to its update parameters.
However, the condition of GRAPE does not suffice to ensure the Church-Rosser property of asynchronous runs. For the correctness of a PIE program under AAP, we need condition T3 additionally.
Theorem 2: Under conditions T1, T2 and T3, AAP correctly parallelizes a PIE program ρ for a query class Q if ρ is correct for Q, with any partition strategy P.
Proof: We show the following under the conditions. (1) Both the synchronous run of ρ under BSP and asynchronous runs of ρ under AAP reach a fixpoint. (2) No partial results of ρ under BSP are “larger” than any fixpoint of asynchronous runs. (3) No partial results of asynchronous runs are “larger” than the fixpoint under BSP. From (2) and (3) it follows that ρ has the Church-Rosser property. Hence AAP correctly parallelizes ρ as long as ρ is correct for Q.
Recall that AP, BSP and SSP are special cases of AAP. From the proof of Theorem 2 we can conclude that as long as a PIE program ρ is correct for Q, ρ can be correctly parallelized
· under conditions T1 and T2 by BSP;
· under conditions T1, T2 and T3 by AP; and
· under conditions T1, T2 and T3 by SSP.
T1, T2 and T3 provide the first condition for asynchronous runs to converge and ensure the Church-Rosser property. To see this, convergence conditions for GRAPE, Maiter, BAP and SSP are examined.
(1) As remarked earlier, the condition of GRAPE does not ensure the Church-Rosser property, which is not an issue for BSP.
(2) Maiter focuses on vertex-centric programming and identifies four conditions for convergence, on an update function f that changes the state of a vertex based on its neighbors. The conditions require that f is distributive, associative, commutative and moreover, satisfies an equation on initial values.
As opposed to Zhang, Y., Gao, Q., Gao, L. and Wang, C. (2014. Maiter: An asynchronous graph processing framework for delta-based accumulative iterative computation. TPDS. 25, 8 (2014) , 2091–2100) , we deal with block-centric programming of which the vertex-centric model is a special case, when a fragment is limited to a single node. Moreover, the last condition of Zhang is quite restrictive. Further, the proof of Zhang does not suffice for the Church-Rosser property. A counterexample could be conditional convergent series, for which asynchronous runs may diverge.
(3) It is shown that BAP can simulate BSP under certain conditions on message buffers. It does not consider the Church-Rosser property, and we make no assumption about message buffers.
(4) Conditions have been studied to assure the convergence of stochastic gradient descent (SGD) with high probability. In contrast, our conditions are deterministic: under T1, T2 and T3, all  AAP runs guarantee to converge at correct answers. Moreover, we consider AAP computations not limited to machine learning.
Simulation of Other Parallel Models
Algorithms developed for MapReduce, PRAM, BSP, AP and SSP can be migrated to AAP without extra complexity. That is, AAP is as expressive as the other parallel models.
Note that while this paper focuses on graph computations, AAP is not limited to graphs as a parallel computation model. It is as generic as BSP and AP, and does not have to take graphs as input.
A parallel model M 1 optimally simulates model M 2 if there exists a compilation algorithm that transforms any program with cost C on M 2 to a program with cost O (C) on M 1. The cost includes computational and communication cost. That is, the complexity bound remains the same.
As remarked above, BSP, AP and SSP are special cases of AAP. From this one can easily verify the following.
Proposition 3: AAP can optimally simulate BSP, AP and SSP.
By Proposition 3, algorithms developed for, e.g., Pregel, GraphLab and GRAPE can be migrated to AAP. As an example, a Pregel algorithm A (with a function compute () for vertices) can be simulated by a PIE algorithm ρ as follows. (a) PEval runs compute () over vertices with a loop, and uses status variable to exchange local messages instead of SendMessageTo () of Pregel. (b) The update parameters are status variables of border nodes, and function f aggr groups messages just like Pregel, following BSP. (c) IncEval also runs compute () over vertices in a fragment, except that it starts from active vertices (border nodes with changed values) .
AAP can optimally simulate MapReduce and PRAM. GRAPE can optimally simulate MapReduce and PRAM, by adopting a form of key-value messages.
Theorem 4: MapReduce and PRAM can be optimally simulated by (a) AAP and (b) GRAPE with designated messages only.
Proof: Since PRAM can be simulated by MapReduce, and AAP can simulate GRAPE, it suffices to show that GRAPE can optimally simulate MapReduce with the above-explained message scheme.
A MapReduce algorithm A can be specified as a sequence (B 1, …, B k) of subroutines, where B r (r∈ [1, k] ) consists of a mapper μ r and a reducer ρ r. To simulate A by GRAPE, we give a PIE program ρ in which (1) PEval is the mapper μ 1 of B 1, and (2) IncEval simulates reducer ρ i and mapper μ i+1 (i∈ [1, k-1] ) , and reducer ρ k in the final round. We define IncEval that treats the subroutines B 1, …, B k of A as program branches. Assume that A uses n processors. We add a clique  G W of n nodes as input, one for each worker, such that any two workers can exchange data stored in the status variables of their border nodes in G W. We show that ρ incurs no more cost than A in each step, using n processors.
Programming with AAP
It has been shown how AAP parallelizes CC (First to third Examples) . We next examine two PIE algorithms, including SSSP and CF. We also give a PIE program for PageRank. We parallelize these algorithms in below under AAP. These show that AAP does not make programming harder.
Graph Traversal
We start with the single source shortest path problem (SSSP) . Consider a directed graph G= (V, E, L) in which for each edge e, L (e) is a positive number. The length of a path (v 0, …, v k) in G is the sum of L (v i-1, v i) for i∈ [1, k] . For a pair (s, v) of nodes, denote by dist (s, v) the shortest distance from s to v. SSSP is stated as follows.
· Input: A directed graph G as above, and a node s in G.
· Output: Distance dist (s, v) for all nodes v in G.
AAP parallelizes SSSP in the same way as GRAPE.
(1) PIE. AAP takes Dijkstra’s algorithm for SSSP as PEval and the sequential incremental algorithm as IncEval. It declares a status variable x v for every node v, denoting dist (s, v) , initially ∞ (except dist (s, s) =0) . The candidate set C i at each F i is F i. O. The status variables in the candidates set are updated by PEval and IncEval as in [8] , and aggregated by using min as f aggr. When no changes can be incurred to these status variables, Assemble is invoked to take the union of all partial results.
(2) Correctness is assured by the correctness of the sequential algorithms for SSSP and Theorem 2. To see this, define order
Figure PCTCN2018104689-appb-000070
on sets S 1 and S 2 of nodes in the same fragment F i such that
Figure PCTCN2018104689-appb-000071
if for each node v∈F i, v 1. dist≤v 2. dist, where v 1 and v 2 denote the copies of v in S 1 and S 2, respectively. Then by the use of min as aggregate f aggr, IncEval is both contracting and monotonic.
Collaborative Filtering
We next consider collaborative filtering (CF) . It takes as input a bipartite graph G that includes two types of nodes, namely, users U and products P, and a set of weighted edges
Figure PCTCN2018104689-appb-000072
More specifically, (1) each user u∈U (resp. product p∈P) carries an (unknown) latent factor vector u. f (resp. p. f) . (2) Each edge e= (u, p) in E carries a weight r (e) , estimated as u. f T*p. f (possibly
Figure PCTCN2018104689-appb-000073
i.e., “unknown” ) that encodes a rating from user u to product p. The training set E T refers to edge set 
Figure PCTCN2018104689-appb-000074
i.e., all the known ratings. The CF problem is stated as follows.
· Input: A directed bipartite graph G, and a training set E T.
· Output: The missing factor vectors u. f and p. f that minimizes a loss function ∈ (f, E T) , estimated as
Figure PCTCN2018104689-appb-000075
AAP parallelizes stochastic gradient descent (SGD) , a popular algorithm for CF. We give the following PIE program.
(1) PIE. PEval declares a status variable v. x= (v. f, v. δ, t) for each node v, where v. f is the factor vector of v (initially
Figure PCTCN2018104689-appb-000076
) , v. δ records accumulative updates to v. f, and t bookkeeps the timestamp at which v. f is lastly updated. Assuming w. l. o. g. P∨<<∨U∨, it takes F i. O∪F i. I, i.e., the shared product nodes related to F i, as C i. PEval is essentially “mini-batched” SGD. It computes the descent gradients for each edge (u, p) in F i and accumulates them in u. δ and p. δ, receptively. The accumulated gradients are then used to update the factor vectors of all local nodes. At the end, PEval sends the updated values of
Figure PCTCN2018104689-appb-000077
to neighboring workers.
IncEval first aggregates the factor vector of each node p in F i. O by taking max on the timestamp for tuples (p. f, p. δ, t) in
Figure PCTCN2018104689-appb-000078
For each node in F i. I, it aggregates its factor vector by applying a weighted sum of gradients computed at other workers. It then runs a round of SGD; it sends the updated status variables as in PEval as long as the bounded staleness condition is not violated.
Assemble simply takes the union of the factor vectors of all nodes from all the workers, and returns the collection.
(2) Correctness has been verified under the bounded staleness condition. Along the same lines, we show that the PIE program converges and correctly infers missing CF factors.
PageRank
Finally, we study PageRank for ranking Web pages. Consider a directed graph G= (V, E) representing Web pages and links. For each page v∈V, its ranking score is denoted by P v. A PageRank algorithm iteratively updates P v as follows:
P v=d*Σ  {u| (u, v) ∈E} P u/N u+ (1-d) ,
where d is damping factor and N u is the out-degree of u. The process iterates until the sum of changes of two consecutive iterations is below a threshold. The PageRank problem is stated as follows.
· Input: A directed graph G and a threshold ∈.
· Output: The PageRank scores of nodes in G.
AAP parallelizes PageRank along the same lines as Tian, Y., Balmin, A., Corsten, S. A. and Shirish Tatikonda, J. M. 2013. From “think like a vertex” to “think like a graph” . PVLDB. 7, 7 (2013) , 193–204.
(1) PIE. PEval declares a status variable x v for each node v∈F i to keep track of updates to v from other nodes in F i, at each fragment F i. It takes F i. O as its candidate set C i. Starting from an initial score 0 and an update x v (initially 1-d) for each v, PEval (a) increases the score P v by x v, and (b) updates the variable x u for each u linked from v by an incremental change d*x v/N v. At the end of its process, it sends the values of
Figure PCTCN2018104689-appb-000079
to its neighboring workers.
Upon receiving messages, IncEval iteratively updates scores. It (a) first aggregates changes to each border node from other workers by using sum as f aggr; (b) it then propagates the changes to update other nodes in the local fragment by conducting the same computation as in PEval; and (c) it derives the changes to the values of
Figure PCTCN2018104689-appb-000080
and sends them to its neighboring workers.
Assemble collects the scores of all the nodes in G when the sum of changes of two consecutive iterations at each worker is below ∈.
(2) Correctness. We show that the PIE program under AAP terminates and has the Church-Rosser property, along the same lines as the proof of Theorem 2. The proof makes use of the following property, as also observed by [36] : for each node v in graph G, P v can be expressed as ∑ p∈Pp (v) + (1-d) , where P is the set of all paths to v in G, p is a path (v n, v n-1, …v 1, v) , 
Figure PCTCN2018104689-appb-000081
Figure PCTCN2018104689-appb-000082
and N j is the out-degree node v j for j∈ [1, n] .
Bounded staleness forbids fastest workers to outpace the slowest ones by more than c steps. It is mainly to ensure the correctness and convergence of CF. By Theorem 2, CC and SSSP are not constrained by bounded staleness; conditions T1, T2 and T3 suffice to guarantee their convergence and correctness. Hence fast workers can move ahead any number of rounds without affecting their correctness and convergence. One can show that PageRank does not need bounded staleness either, since for each path p∈P, p (v) can be added to P v at most once (see above) .
Implementation of GRAPE+
The architecture of GRAPE+ is shown in Fig. 5, to extend GRAPE by supporting AAP. Its top layer provides interfaces for developers to register their PIE programs, and for end users to run registered PIE programs. The core of GRAPE+ is its engine, to generate parallel evaluation plans. It schedules workload for working threads to carry out the evaluation plans. Underlying the engine are  several components, including (1) an MPI controller to handle message passing, (2) a load balancer to evenly distribute workload, (3) an index manager to maintain indices, and (4) a partition manager for graph partitioning. GRAPE+ employs distributed file systems, e.g., NFS, AWS S3 and HDFS, to store graph data.
GRAPE+ extends GRAPE by supporting the following.
Adaptive asynchronization manager. As opposed to GRAPE, GRAPE+ dynamically adjusts relative progress of workers. This is carried out by a scheduler in the engine. Based on statistics collected (see below) , the scheduler adjusts parameters and decides which threads to suspend or run, to allocate resources to useful computations. In particular, the engine allocates communication channels between workers, buffers messages generated, packages the messages into segments, and sends a segment each time. It further reduces costs by overlapping data transfer and computation.
Statistics collector. During a run of a PIE program, the collector gathers information for each worker, e.g., the amount of messages exchanged, the evaluation time in each round, historical data for a query workload, and the impact of the last parameter adjustment.
Fault tolerance. Asynchronous runs of GRAPE+ make it harder to identify a consistent state to rollback in case of failures. Hence as opposed to GRAPE, GRAPE+ adapts Chandy-Lamport snapshots for checkpoints. The master broadcasts a checkpoint request with a token. Upon receiving the request, each worker ignores the request if it has already held the token. Otherwise, it snapshots its current state before sending any messages. The token is attached to its following messages. Messages that arrive late without the token are added to the last snapshot. This gets us a consistent checkpointed state, including all messages passed asynchronously.
When we deployed GRAPE+ in a POC scenario that provides continuous online payment services, we found that on average, it took about 40 seconds to get a snapshot of the entire state, and 20 seconds to recover from failure of one worker. In contrast, it took 40 minutes to start the system and load the graph.
Consistency. Each worker P i uses a buffer
Figure PCTCN2018104689-appb-000083
to store incoming messages, which is incrementally expanded when new messages arrive. GRAPE+ allows users to provide an aggregate function f aggr to resolve conflicts when a status variable receives multiple values from different workers. The only race condition is that when old messages are removed from
Figure PCTCN2018104689-appb-000084
by IncEval, the deletion is atomic. Thus consistency control of GRAPE+ is not much harder than that of GRAPE.
Experimental Study
Using real-life and synthetic graphs, we conducted four sets of experiments to evaluate the (1) efficiency, (2) communication cost, and (3) scale-up of GRAPE+, and (4) the effectiveness of AAP  and the impact of graph partitioning strategies on its performance. We also report a case study in Appendix B to illustrate how dynamic adjustment of AAP works. We compared the performance of GRAPE+ with (a) Giraph and synchronized GraphLab sync under BSP, (b) asynchronized GraphLab async, GiraphUC and Maiter [36] under AP, (c) Petuum under SSP, (d) PowerSwitch under Hsync, and (e) GRAPE+ simulations of BSP, AP and SSP, denoted by GRAPE+BSP, GRAPE+AP, GRAPE+SSP, respectively.
It has been found that GraphLab async, GraphLab sync, PowerSwitch and GRAPE+ outperform the other systems. Indeed, Table 1 shows the performance of SSSP and PageRank of the systems with 192 workers; results on the other algorithms are consistent. Hence we only report the performance of these four systems in details. In all the experiments we also evaluated GRAPE+ BSP, GRAPE+ AP and GRAPE+ SSP. Note that GRAPE is essentially GRAPE+ BSP.
Experimental setting. We used real-life and synthetic graphs.
Graphs. We used five real-life graphs of different types, such that each algorithm was evaluated with two real-life graphs. These include (1) Friendster, a social network with 65 million users and 1.8 billion links; we randomly assigned weights to test SSSP; (2) traffic, an (undirected) US road network with 23 million nodes (locations) and 58 million edges; (3) UKWeb, a Web graph with 133 million nodes and 5 billion edges. We also used two recommendation networks (bipartite graphs) to evaluate CF, namely, (4) movieLens, with 20 million movie ratings (as weighted edges) between 138000 users and 27000 movies; and (5) Netflix, with 100 million ratings between 17770 movies and 480000 customers.
To test the scalability of GRAPE+, we developed a generator to produce synthetic graphs G = (V, E, L) controlled by the numbers of nodes V∨ (up to 300 million) and edges E∨ (up to 10 billion) .
Queries. For SSSP, we sampled 10 source nodes for each graph G used such that each node has paths to or from at least 90%of the nodes in G, and constructed an SSSP query for each of them.
Graph computations. We evaluated SSSP, CC, PageRank and CF over GRAPE+ by using their PIE programs. We used “default” code provided by the competitor systems when it is available. Otherwise we made our best efforts to develop “optimal” algorithms for them, e.g., CF for PowerSwitch.
We used XtraPuLP as the default graph partition strategy. To evaluate the impact of stragglers, we randomly reshuffled a small portion of partitions to make the partitioned graphs skewed.
We deployed the systems on an HPC cluster. For each experiment, we used up to 20 servers, each with 16 threads of 2.40GHz, and 128GB memory. On each thread, a GRAPE+ worker is deployed. We ran each experiment 5 times. The average is reported here.
Experimental results. We next report our findings.
Exp-1: Efficiency. We first evaluated the efficiency of GRAPE+ by varying the number n of workers used, from 64 to 192. We evaluated (a) SSSP and CC with real-life graphs traffic and Friendster; (b) PageRank with Friendster and UKWeb, and (c) CF with movieLens and Netflix, based on applications of these algorithms in transportation networks, social networks, Web rating and recommendation.
(1) SSSP. Figures 6 (a) and 6 (b) report the performance of SSSP.
(a) GRAPE+ consistently outperforms these systems in all cases. Over traffic (resp. Friendster) and with 192 workers, it is on average 1673 (resp. 3.0) times, 1085 (resp. 15) times and 1270 (resp. 2.56) times faster than synchronized GraphLab sync, asynchronized GraphLab asyncand hybrid PowerSwitch, respectively.
The performance gain of GRAPE+ comes from the following: (i) efficient resource utilization by dynamically adjusting relative progress of workers under AAP; (ii) reduction of redundant computation and communication by the use of incremental IncEval; and (iii) optimization inherited from strategies for sequential algorithms. Note that under BSP, AP and SSP, GRAPE+BSP, GRAPE+AP and GRAPE+SSP can still benefit from (ii) and (iii) .
As an example, GraphLabsync took 34 (resp. 10749) rounds over Friendster (resp. traffic) , while by using IncEval, GRAPE+BSP and GRAPE+SSP took 21 and 30 (resp. 31 and 42) rounds, respectively, and hence reduced synchronization barriers and communication costs. In addition, GRAPE+ inherits the optimization techniques from sequential (Dijkstra) algorithm by employing priority queues to prioritize vertex processing; in contrast, this optimization strategy is beyond the capacity of the vertex-centric systems.
(b) GRAPE+ is on average 2.42, 1.71, and 1.47 (resp. 2.45, 1.76, and 1.40) times faster than GRAPE+ BSP, GRAPE+ AP and GRAPE+ SSP over traffic (resp. Friendster) , up to 2.69, 1.97 and 1.86 times, respectively. Since GRAPE+, GRAPE+ BSP, GRAPE+ AP and GRAPE+ SSP are the same system under different modes, the gap reflects the effectiveness of different models. We find that the idle waiting time of AAP is 32.3%and 55.6%of that of BSP and SSP, respectively. Moreover, when measuring stale computation in terms of the total extra computation and communication time over BSP, the stale computation of AAP accounts for 37.2%and 47.1%of that of AP and SSP, respectively. These verify the effectiveness of AAP by dynamically adjusting relative progress of different workers.
(c) GRAPE+ takes less time when n increases. It is on average 2.49 and 2.25 times faster on traffic and Friendster, respectively, when n varies from 64 to 192. That is, AAP makes effective use of parallelism by reducing stragglers and redundant stale computations.
(2) CC. As reported in Figures 6 (c) and 6 (d) over traffic and Friendster, respectively, (a) GRAPE+ outperforms GraphLab sync, GraphLab asyncand PowerSwitch. When n=192, GRAPE+ is on average 313, 93 and 51 times faster than the three systems, respectively. (b) GRAPE+ is faster than its variants under BSP, AP and SSP, on average by 20.87, 1.34 and 3.36 (resp. 3.21, 1.11 and 1.61) times faster over traffic (resp. Friendster) , respectively, up to 27.4, 1.39 and 5.04 times. (c) GRAPE+ scales well with the number of workers used: it is on average 2.68 times faster when n varies from 64 to 192.
(3) PageRank. As shown in Figures 6 (e) -6 (f) over Friendster and UKWeb, respectively, when n=192, (a) GRAPE+ is on average 5, 9 and 5 times faster than GraphLab sync, GraphLab asyncand PowerSwitch, respectively. (b) GRAPE+ outperforms GRAPE+ BSP, GRAPE+ AP and GRAPE+ SSP by 1.80, 1.90 and 1.25 times, respectively, up to 2.50, 2.16 and 1.57 times. This is because GRAPE+ reduces stale computations, especially those of stragglers. On average stragglers took 50, 27 and 28 rounds under BSP, AP and SSP, respectively, as opposed to 24 rounds under AAP. (d) GRAPE+ is on average 2.16 times faster when n varies from 64 to 192.
(4) CF. We used movieLens and Netflix with training set E T∨90∨E∨, as shown in Figures 6 (g) -6 (h) , respectively. On average (a) GRAPE+ is 11.9, 9.5, 10.0 times faster than GraphLab sync, GraphLab asyncand PowerSwitch, respectively. (b) GRAPE+ beats GRAPE+ BSP, GRAPE+ AP and GRAPE+ SSP by 1.38, 1.80 and 1.26 times, up to 1.67, 3.16 and 1.38 times, respectively. (c) GRAPE+ is on average 2.3 times faster when n varies from 64 to 192.
Single-thread. Among the graphs traffic, movieLens and Netflix can fit in a single machine. On a single machine, it takes 6.7s, 4.3s and 2354.5s for SSSP and CC over traffic, and CF over Netflix, respectively. Using 64-192 workers, GRAPE+ is on average from 1.63 to 5.2, 1.64 to 14.3, and 4.4 to 12.9 times faster than single-thread, depending on how heavy stragglers are. Observe the following. (a) GRAPE+ incurs extra overhead of parallel computation not experienced by a single machine, just like other parallel systems. (b) Large graphs such as UKWeb are beyond the capacity of a single machine, and parallel computation is a must for such graphs.
Exp-2: Communication. We tracked the total bytes sent by each machine during a run, by monitoring the system file /proc/net/dev. The communication costs of PageRank and SSSP over Friendster are reported in Table 1, when 192 workers were used. The results on other algorithms are consistent and hence not shown. These results tell us the following.
(1) On average GRAPE+ ships 22.4%, 8.0%and 68.3%of data shipped by GraphLab sync, GraphLab asyncand PowerSwitch, respectively. This is because GRAPE+ (a) reduces redundant stale computations and hence unnecessary data traffic, and (b) ships only changed values of update parameters by incremental IncEval.
(2) The communication cost of GRAPE+ is 1.22X, 40%and 1.02X compared to that of GRAPE+ BSP, GRAPE+ AP and GRAPE+ SSP , respectively. Since AAP allows workers with small workload to run faster and have more iterations, the amount of messages may increase. Moreover, workers under AAP additionally exchange their states and statistics to adjust relative speed. Despite these, its communication cost is not much worse than that of BSP and SSP.
Exp-3: Scale-up of GRAPE+. The speed-up of a system may degrade when using more workers. Thus we evaluated the scale-up of GRAPE+, which measures the ability to keep similar performance when both the size of graph G = (|V|, |E|) and the number n of workers increase proportionally. We varied n from 96 to 320, and for each n, deployed GRAPE+ over a synthetic graph of size varied from (60M, 2B) to (300M, 10B) , proportional to n.
As reported in Figures 6 (i) and 6 (j) for SSSP and PageRank, respectively, GRAPE+ preserves a reasonable scale-up. That is, the overhead of AAP does not weaken the benefit of parallel computation. Despite the overhead for adjusting relative progress, GRAPE+ retains scale-up comparable to that of BSP, AP and SSP.
The results on other algorithms are consistent (not shown) .
Exp-4: Effectiveness of AAP. To further evaluate the effectiveness of AAP, we tested (a) the impact of graph partitioning on AAP, and (b) the performance of AAP over larger graphs with more workers. We evaluated GRAPE+, GRAPE+ BSP, GRAPE+ AP and GRAPE+ SSP. Remark that these are the same system under different modes, and hence the results are not affected by implementation.
Impact of graph partitioning. Definer=||F max||/||F median||, where ||F max|| and || F median|| denote the size of the largest fragment and the median size, respectively, indicating the skewness of a partition.
As shown in Fig. 6 (k) for SSSP over Friendster, in which thex axis is r, (a) different partitions have an impact on the performance of GRAPE+, just like on other parallel graph systems. (b) The more skewed the partition is, the more effective AAP is. Indeed, AAP is more effective with larger r. When r=9, AAP outperforms BSP, AP, SSP by 9.5, 2.3, and 4.9 times, respectively. For a well-balanced partition (r=1) , BSP works well since the chances of having stragglers are small. In this case AAP works as well as BSP.
AAP in a large-scale setting. We tested synthetic graphs with 300 million vertices and 10 billion edges, generated by GTgraph following the power law and the small world property. We used a cluster of up to 320 workers. As shown in Fig. 6 (l) for PageRank, AAP is on average 4.3, 14.7 and 4.7 times faster than BSP, AP and SSP, respectively, up to 5.0, 16.8 and 5.9 times with 320 workers. Compared to the results in Exp-1, these show that AAP is far more effective on larger graphs with  more workers, a setting closer to real-life applications, in which stragglers and stale computations are often heavy. These further verify the effectiveness of AAP.
The results on other algorithms are consistent (not shown) .
It has been found that: (1) GRAPE+ consistently outperforms the state-of-the-art systems. Over real-life graphs and with 192 workers, GRAPE+ is on average (a) 2080, 838, 550, 728, 1850 and 636 times faster than Giraph, GraphLab sync, GraphLab async, GiraphUC, Maiter and PowerSwitch for SSSP, (b) 835, 314, 93 and 368 times faster than Giraph, GraphLab sync, GraphLab asyncand GiraphUC for CC, (c) 339, 4.8, 8.6, 346, 9.7 and 4.6 times faster than Giraph, GraphLab sync, GraphLab async, GiraphUC, Maiter and PowerSwitch for PageRank, and (d) 11.9, 9.5 and 30.9 times faster than GraphLab sync, GraphLab asyncand Petuum for CF, respectively. Among these PowerSwitch has the closest performance to GRAPE+. (2) It incurs as small as 0.0001, 0.027, 0.13 and 57.7 of the communication cost of these systems for these problems, respectively. (3) AAP effectively reduces stragglers and redundant stale computations. It is on average 4.8, 1.7 and 1.8 times faster than BSP, AP and SSP for these problems over real-life graphs, respectively. On large-scale synthetic graphs, AAP is on average 4.3, 14.7 and 4.7 times faster than BSP, AP and SSP, respectively, up to 5.0, 16.8 and 5.9 times with 320 workers. (4) The heavier stragglers and stale computations are, or the larger the graphs are and the more workers are used, the more effective AAP is. (5) GRAPE+ scales well with the number n of workers used. It is on average 2.37, 2.68, 2.17 and 2.3 times faster when n varies from 64 to 192 for SSSP, CC, PageRank and CF, respectively. Moreover, it has good scale-up.
It has also been shown that as an asynchronous model, AAP does not make programming harder, and it retains the ease of consistency control and convergence guarantees. We have also developed the first condition to warrant the Church-Rosser property of asynchronous runs, and a simulation result to justify the power and flexibility of AAP. The experimental results have verified that AAP is promising for large-scale graph computations.

Claims (11)

  1. A method for asynchronously parallelizing graph computations, the method comprising:
    distributing a plurality of fragments across a number of workers so that each worker has at least one local fragment, the plurality of fragments being obtained by partitioning a graph and each fragment being a subgraph of the graph;
    computing, by each worker, a partial result over each of its at least one local fragment using a predefined sequential batch algorithm;
    iteratively computing, by each worker, an updated partial result over each of its at least one local fragment based on one or more update messages using a predefined sequential incremental algorithm until a termination condition is satisfied, wherein the one or more update messages are received from one or more other workers, respectively, and are stored in a respective buffer;
    wherein each worker is allowed to decide when to perform a next round of computation based on its delay stretch, and wherein said each worker is put on hold for a time period indicated by the delay stretch before performing the next round of computation, the delay stretch being dynamically adjustable based on said each worker's relative computing progress to other workers.
  2. The method of claim 1, wherein the delay stretch of each worker is adjusted by one or more parameters from the following group: the number of update messages stored in the respective buffer, the number of the one or more other workers from which the one or more update messages are received, the smallest and largest rounds being executed at all workers, running time prediction, query logs and other statistics collected from all workers.
  3. The method of claim 1 or 2, wherein each worker keeps receiving update messages, when available, from other workers without synchronization being imposed.
  4. The method of any one of previous claims 1 to 3, wherein, when a worker is suspended during the delay stretch, its resources are allocated to one or more of the other workers.
  5. The method of any one of previous claims 1 to 4, wherein each worker sends a flag inactive to a master when said each worker has no update messages stored in the respective buffer after its current round of computation.
  6. The method of claim 5, wherein, upon receiving inactive from all workers, the master broadcasts a termination message to all workers.
  7. The method of claim 6, wherein, in response to the termination message, each worker responds with "acknowledgement"when it is inactive, or responds with "wait"when it is active or in the queue for a next round of computation.
  8. The method of claim 7, wherein, upon receiving "acknowledgement"from all workers, the master pulls the updated partial results from all workers and applies a predefined assemble function to the updated partial results.
  9. The method of any one of previous claims 1 to 8, wherein the predefined sequential incremental algorithm is monotonic.
  10. The method of any one of previous claims 1 to 9, wherein the update message is based on a respective partial result and is defined by predefined update parameters.
  11. A system for asynchronously parallelizing graph computations, configured to perform the method of any one of claims 1 to 10.
PCT/CN2018/104689 2018-06-08 2018-09-07 Parallelization of graph computations WO2019232956A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201880092086.1A CN112074829A (en) 2018-06-08 2018-09-07 Parallelization of graphical computations

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2018090372 2018-06-08
CNPCT/CN2018/090372 2018-06-08

Publications (1)

Publication Number Publication Date
WO2019232956A1 true WO2019232956A1 (en) 2019-12-12

Family

ID=68769224

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/104689 WO2019232956A1 (en) 2018-06-08 2018-09-07 Parallelization of graph computations

Country Status (2)

Country Link
CN (1) CN112074829A (en)
WO (1) WO2019232956A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112799845A (en) * 2021-02-02 2021-05-14 深圳计算科学研究院 Graph algorithm parallel acceleration method and device based on GRAPE framework

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045790A (en) * 2015-03-13 2015-11-11 北京航空航天大学 Graph data search system, method and device
CN105653204A (en) * 2015-12-24 2016-06-08 华中科技大学 Distributed graph calculation method based on disk
CN106407455A (en) * 2016-09-30 2017-02-15 深圳市华傲数据技术有限公司 Data processing method and device based on graph data mining
US20170286484A1 (en) * 2014-12-09 2017-10-05 Huawei Technologies Co., Ltd. Graph Data Search Method and Apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170286484A1 (en) * 2014-12-09 2017-10-05 Huawei Technologies Co., Ltd. Graph Data Search Method and Apparatus
CN105045790A (en) * 2015-03-13 2015-11-11 北京航空航天大学 Graph data search system, method and device
CN105653204A (en) * 2015-12-24 2016-06-08 华中科技大学 Distributed graph calculation method based on disk
CN106407455A (en) * 2016-09-30 2017-02-15 深圳市华傲数据技术有限公司 Data processing method and device based on graph data mining

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WENFEI FAN ET AL.: "Parallelizing Sequential Graph Computations", SIGMOD '17 PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 19 May 2017 (2017-05-19), pages 495 - 510, XP055665397 *

Also Published As

Publication number Publication date
CN112074829A (en) 2020-12-11

Similar Documents

Publication Publication Date Title
Jiang et al. Heterogeneity-aware distributed parameter servers
Low et al. Distributed graphlab: A framework for machine learning in the cloud
Panda et al. Efficient task scheduling algorithms for heterogeneous multi-cloud environment
McCune et al. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing
Braun et al. A taxonomy for describing matching and scheduling heuristics for mixed-machine heterogeneous computing systems
Safaei Real-time processing of streaming big data
Fan et al. Adaptive asynchronous parallelization of graph algorithms
Ren et al. Strider: A hybrid adaptive distributed RDF stream processing engine
Liu et al. A stepwise auto-profiling method for performance optimization of streaming applications
Tang et al. An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems
Zhao et al. v pipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel dnn training
Martin et al. Scalable and elastic realtime click stream analysis using streammine3g
Lee et al. Performance improvement of mapreduce process by promoting deep data locality
Yi et al. Fast training of deep learning models over multiple gpus
Narayanamurthy et al. Towards resource-elastic machine learning
Hefny et al. Comparative study load balance algorithms for map reduce environment
WO2019232956A1 (en) Parallelization of graph computations
Samfass et al. Lightweight task offloading exploiting MPI wait times for parallel adaptive mesh refinement
Li et al. Cost-efficient scheduling algorithms based on beetle antennae search for containerized applications in Kubernetes clouds
Moreno-Vozmediano et al. Latency and resource consumption analysis for serverless edge analytics
Nemirovsky et al. A deep learning mapper (DLM) for scheduling on heterogeneous systems
Ji et al. EP4DDL: addressing straggler problem in heterogeneous distributed deep learning
Tang et al. IncGraph: An improved distributed incremental graph computing model and framework based on spark graphX
Popa et al. Adapting MCP and HLFET algorithms to multiple simultaneous scheduling
Kail et al. A novel adaptive checkpointing method based on information obtained from workflow structure

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18921517

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18921517

Country of ref document: EP

Kind code of ref document: A1